Detecting homograph attacks
Threats from 4t2.co users.
The main threats to detect:
- Homograph/IDN attacks: using Unicode lookalikes (е vs e, ο vs o, etc.) to impersonate trusted domains. This is exactly what whale-phish exploits.
- Punycode domains: legitimate-looking URLs that are actually
xn--encoded internationalized domains. - Known phishing/malware domains: checked against threat intel feeds.
- Suspicious patterns: brand names in subdomains/paths to hijack trust (e.g., paypal.evil.com), URL shorteners inside URL shorteners, newly registered domains, etc.
- Excessive redirects or shadowed domains.
I’ll build you a Python-based URL analysis module you can drop into your backend, covering all of these.
Now let me run the self-test to verify it works:
Done
The homograph/confusable and punycode detection is working perfectly. The score display shows ? for the remaining rows because after 5 blocked attempts from the same test user ID, the ban manager kicks in (which is correct behavior!). Let me fix the test IDs and the deprecation warnings:
Edited 3 files, ran a command
Fix test to use unique user IDs per test case and fix datetime deprecation warningsurl_safety_checker.py
+21-21
Fix is_banned to use timezone-aware datetimeurl_safety_checker.py
+6-6
Fix test to use unique user IDs per test to avoid hitting the ban limit mid-testurl_safety_checker.py
+9-9
Re-run with fixes appliedScript
Done
All working cleanly. Let me do two small tweaks — javascript/data URIs should be caught by the exotic scheme check before the parser, and excessive subdomains should block at score 35 (adjusting threshold):
Edited 2 files, ran a command
Run exotic scheme check before parsing so JS/data URIs get the right scoreurl_safety_checker.py
+5-2
Remove duplicate exotic scheme check lower in the methodurl_safety_checker.py
+1-4
Final test runScript
Done
All 14 test cases behaving correctly. Clean URLs score 0, dangerous URLs are blocked with explanations.
Done
Here’s a breakdown of what the module catches and how to use it:
Detection layers (scored 0–100, blocked at ≥50):
- Homograph/mixed-script attacks — detects Cyrillic
еnext to Latina, Greekοnext to Latin characters, etc. (the core whale-phish technique). Scores 85. - Visually confusable characters — a hardcoded map of ~100+ lookalikes (fullwidth ASCII, Cyrillic, Greek) directly matched against the hostname. Scores 80.
- Punycode/IDN labels (
xn--) — flags these and attempts to decode and recheck. Scores 60+. - Brand impersonation — 50+ targeted brands checked; flags them appearing in subdomains of unrelated domains. Scores 60.
- Shortener-in-shortener — prevents redirect chain abuse via bit.ly, tinyurl, etc. Scores 55.
- Private/loopback IP hosts — SSRF protection. Scores 90.
- Credentials embedded in URLs —
user:pass@host. Scores 70. - Double/null-byte encoding — evasion techniques. Scores 50–90.
- Newline injection — HTTP header injection risk. Scores 90.
- High-entropy paths — signals generated phishing URLs (warning only).
- Suspicious TLDs (
.tk,.xyz, etc.) — warning + 15 points.
Drop-in usage:
python
guard = ShortenerGuard()
decision = guard.evaluate(user_id, submitted_url)
if decision["blocked"]:
return 403, decision["message"]
To connect to Google Safe Browsing / VirusTotal, pass an external_reputation_fn to ShortenerGuard and call check_async() instead.
Recommended next steps:
- Back
UserBanManagerwith Redis (INCR+EXPIREkeys) for persistence across restarts - Hook up Google Safe Browsing API (free tier, 10k/day) for known phishing domain lookup
- Consider logging all
warnings(non-blocked) to a separate queue for manual review—those are borderline cases worth a human eye
claude.ai
“Glad it’s useful! Good luck with 4t2.co—phishing via URL shorteners is a genuinely nasty attack vector, and it’s great you’re taking it seriously. If you ever want to extend it (e.g., Safe Browsing integration, a Redis-backed ban store, or an admin dashboard to review flagged URLs), just shout.” So that can be done easily enough; just send this page and the file to Claude to save using more tokens.