ja4-platform

Author	SHA1	Message	Date
toto	7b7b69dee3	Rewrite seed_clickhouse.py: 500K rows from 20K IPs with realistic traffic - 350K browser rows (14K IPs) using real JA4s from browser_ja4.csv - 100K scanner rows (3K IPs) with vuln/cred/scraper/DDoS sub-categories - 30K legit bot rows (2K IPs) from real bot_ip.csv CIDRs - 20K AI bot rows (1K IPs) for GPTBot, ClaudeBot, etc. Key improvements: - Load browser_ja4.csv at startup, match JA4 to browser family - Load bot_ip.csv to generate IPs from real Googlebot/Bingbot CIDRs - Hard-coded ISP /24 prefixes from real ASNs (Comcast, Orange, DT, etc.) - Realistic navigation patterns with Referer chains and cookies - Sec-CH-UA headers for Chromium browsers (modern_browser_score >= 50) - Batch size increased to 2000, progress reporting every 10K rows - New CLI args: --rows, --ips, --seed, --data-dir - Bot JA4s are synthetic hashes guaranteed NOT in browser_ja4.csv Also updated: - Dockerfile: COPY *.py (was missing seed_clickhouse.py) - docker-compose.yml: mount scripts/data as /app/data for CSV access - run-tests.sh: updated seeder description comments Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 16:35:40 +02:00
toto	5c5bca71d1	feat: rewrite ASN classification with PeeringDB + expanded heuristics Major improvements to generate_asn_data.py: - Add PeeringDB network data source (34K networks with info_type) - Add new categories: education, government, enterprise - Rename 'human' label to 'isp' across all consumers - Expand keyword heuristics (ISP, datacenter, hosting, CDN, education, gov) - Add hard-coded lists for education, government, enterprise ASNs - Support both --output-dir and --output-asn/--output-ipasn CLI interfaces - Add --no-peeringdb flag for offline use Results: unknown dropped from 86% to 57%, ISP coverage 21.8K ASNs, education 3.1K, enterprise 5.7K, government 520. Updated consumers: - bot_detector.py: 'human' -> 'isp' for baseline selection - dashboard api.py: 'human' -> 'isp' in SQL queries - run-tests.sh: 'human' -> 'isp' in integration test assertions - update-csv-data.sh: updated label description comment Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 16:02:07 +02:00
toto	fc882dd3e7	feat(tests): realistic traffic seeder + IP diversity via mod_remoteip Option A — X-Forwarded-For + mod_remoteip: - httpd-integration.conf: load mod_remoteip, trust all Docker RFC-1918 subnets (172/192.168/10). mod_reqin_log uses r->useragent_ip which mod_remoteip updates from XFF → each request logged with distinct src_ip - generate_traffic.py: XFF always set (was 30% only); human scenarios use 91.121/78.41/90.x ranges, bot scenarios use 185.220/45.155/193.32; pool of 1168 human IPs and 180 bot IPs; default --requests 500 Option D — Direct ClickHouse seeder (seed_clickhouse.py, stdlib only): - Inserts ~4000 rows into http_logs_raw triggering full MV chain: http_logs_raw → mv_http_logs → http_logs → mv_agg_host_ip_ja4_1h → agg_host_ip_ja4_1h • 720 human sessions: IPs in OVH/SFR/Orange ASN ranges (16276/15557/3215) → dict_asn_reputation maps these to asn_label='human' → satisfies bot_detector human_baseline >= 500 threshold • 150 scanner sessions: datacenter IPs, attack paths (/.env, wp-login, SQLi, path traversal), scanner UAs, minimal TCP fingerprints • 100 known-bot sessions: IPs matching bot_ip.csv entries • 20 brute-force clusters: 20-50 POST /login per IP All TCP/TLS metadata is profile-realistic (window, MSS, TTL, JA4, JA3) CSV stubs (mounted at /var/lib/clickhouse/user_files/): - iplocate-ip-to-asn.csv: 13 CIDR→ASN mappings (OVH/SFR/Orange/Tor/Contabo) - asn_reputation.csv: 13 ASN→label (8 'human', 3 'datacenter'/'hosting') - bot_ip.csv: 14 known scanner/Tor IPs (Shodan, Censys, Tor exits) - bot_ja4.csv: 5 bot JA4 fingerprints (curl, python-requests, masscan, zgrab) run-tests.sh: - Phase 4a: seeder runs before live traffic (ensures bot_detector baseline) - Phase 4b: live traffic gen at 500 requests (up from 200) - Phase 5f: new assertions — agg_host_ip_ja4_1h populated, ≥500 human rows in view_ai_features_1h, known-bot labels present - Phase 7: verifies ml_all_scores populated (bot_detector ran a cycle) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 11:35:34 +02:00
toto	12d60975da	feat: Python traffic generator with realistic varied HTTP/HTTPS traffic - Replace curlimages/curl with Python stdlib traffic generator - 200 requests, 10 workers, 16 scenario types: browsers (Chrome/Firefox/Safari/Edge/mobile), bots (Googlebot/Bing/curl/wget), GET/POST/HEAD/PUT/PATCH/DELETE/OPTIONS, HTTP + HTTPS - Multiple SSL contexts (default, TLS1.2-only, TLS1.3-only, few_ciphers) → 4 distinct JA4/JA3 fingerprints per test run - Realistic headers: Accept, Accept-Language, Sec-Fetch-*, Referer, X-Forwarded-For, Cookie, Cache-Control - JSON payloads, form data, CORS preflights - DB always reset (down -v) at start of each test run - Enhanced Phase 5 checks: distinct UAs, method variety, JA4/JA3 counts + uniqueness Results: 199/200 OK, 24 distinct UAs, 7 HTTP methods, TLS 1.2+1.3, 4 JA4 fingerprints Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 21:14:55 +02:00
toto	da8357f43d	fix: TLS capture end-to-end in integration tests - Add traffic-gen container (curlimages/curl) to send HTTPS traffic across Docker network so sentinel (pcap on eth0) captures ClientHello - Seed anubis_ua_rules with catch-all rule (REGEXP_TREE needs ≥1 entry) so MV mv_http_logs processes raw logs without errors - Add JA4/JA3 fingerprint verification in Phase 5 tests - Dashboard healthcheck via python urllib (no curl in image) Results: 59 raw logs, 59 parsed, 53 with JA4+JA3 fingerprints (TLS 1.3) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 21:04:14 +02:00
toto	d4e7e674d8	feat: full-stack Docker Compose integration tests - 4-container stack: ClickHouse, platform (Rocky 9), bot-detector, dashboard - Platform builds sentinel on Rocky (CGO+libpcap native), correlator static - mod-reqin-log compiled with apxs on Rocky (matching RPM build target) - ClickHouse init script patches credentials for test env (sed-based) - 8-phase test runner: schema, traffic gen, pipeline, dashboard API, bot-detector, sentinel - All 13 checks pass, 3 non-blocking warnings (empty dicts, log paths) SQL schema fixes discovered during integration: - 02_dictionaries: IPv6CIDR → String (not a valid ClickHouse type) - 03_anubis_tables: dict_anubis_ua missing has_ip/rule_id/category attrs - 03_anubis_tables: dict_anubis_country FLAT() → COMPLEX_KEY_HASHED() (String key) - 09_audit_table: CODEC before DEFAULT → DEFAULT before CODEC Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 20:33:25 +02:00

6 Commits