74e0406c38
chore: update ASN stubs with new classification labels
...
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 16:05:25 +02:00
7d09c614c3
feat: browser JA4 detection, Anubis bot rules, worldwide ASN data
...
- Add generate_browser_ja4.py: 1,186 browser JA4 fingerprints from FoxIO + ja4db.com
covering 11 families (Chromium, Firefox, Safari, Edge, Tor, Opera, Vivaldi...)
- Rewrite generate_bot_ip.py: Anubis YAML rules (Google, Bing, Apple, DuckDuck,
OpenAI, Perplexity bots) + Tor exit nodes + cloud scanner IPs (3,555 entries)
- Rewrite generate_asn_data.py: worldwide iptoasn.com data (78,049 ASNs, 714K CIDRs)
- Add dict_browser_ja4 ClickHouse dictionary + browser_family in AI features views
- Add /api/browsers dashboard endpoint
- Fix CSV quoting for fields containing commas (User-Agent strings)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 15:27:37 +02:00
b6184e6529
feat: CSV generation scripts, API filter params, enriched CSV stubs
...
- scripts/generate_bot_ip.py: download Tor exit nodes + curate scanner IPs (1353 entries)
- scripts/generate_bot_ja4.py: 31 bot JA4 fingerprints across 16 families
- scripts/generate_asn_data.py: 38 ASNs + 96 IP-to-ASN prefixes
- scripts/update-csv-data.sh: master orchestrator with --install-stubs
- api.py: add asn_org/country_code/ja4/bot_name filters on detections+scores
- pages.py: add /network route
- csv-stubs: enriched with generated data (Tor nodes, scanner IPs, etc.)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 15:05:43 +02:00
fc882dd3e7
feat(tests): realistic traffic seeder + IP diversity via mod_remoteip
...
Option A — X-Forwarded-For + mod_remoteip:
- httpd-integration.conf: load mod_remoteip, trust all Docker RFC-1918
subnets (172/192.168/10). mod_reqin_log uses r->useragent_ip which
mod_remoteip updates from XFF → each request logged with distinct src_ip
- generate_traffic.py: XFF always set (was 30% only); human scenarios
use 91.121/78.41/90.x ranges, bot scenarios use 185.220/45.155/193.32;
pool of 1168 human IPs and 180 bot IPs; default --requests 500
Option D — Direct ClickHouse seeder (seed_clickhouse.py, stdlib only):
- Inserts ~4000 rows into http_logs_raw triggering full MV chain:
http_logs_raw → mv_http_logs → http_logs
→ mv_agg_host_ip_ja4_1h → agg_host_ip_ja4_1h
• 720 human sessions: IPs in OVH/SFR/Orange ASN ranges (16276/15557/3215)
→ dict_asn_reputation maps these to asn_label='human'
→ satisfies bot_detector human_baseline >= 500 threshold
• 150 scanner sessions: datacenter IPs, attack paths (/.env, wp-login,
SQLi, path traversal), scanner UAs, minimal TCP fingerprints
• 100 known-bot sessions: IPs matching bot_ip.csv entries
• 20 brute-force clusters: 20-50 POST /login per IP
All TCP/TLS metadata is profile-realistic (window, MSS, TTL, JA4, JA3)
CSV stubs (mounted at /var/lib/clickhouse/user_files/):
- iplocate-ip-to-asn.csv: 13 CIDR→ASN mappings (OVH/SFR/Orange/Tor/Contabo)
- asn_reputation.csv: 13 ASN→label (8 'human', 3 'datacenter'/'hosting')
- bot_ip.csv: 14 known scanner/Tor IPs (Shodan, Censys, Tor exits)
- bot_ja4.csv: 5 bot JA4 fingerprints (curl, python-requests, masscan, zgrab)
run-tests.sh:
- Phase 4a: seeder runs before live traffic (ensures bot_detector baseline)
- Phase 4b: live traffic gen at 500 requests (up from 200)
- Phase 5f: new assertions — agg_host_ip_ja4_1h populated, ≥500 human
rows in view_ai_features_1h, known-bot labels present
- Phase 7: verifies ml_all_scores populated (bot_detector ran a cycle)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 11:35:34 +02:00
228ad7026a
fix(integration): mount missing SQL files 10-12 in ClickHouse init
...
3 SQL files were missing from the docker-compose.yml volume mounts:
- 10_perf_indexes.sql (performance indexes)
- 11_views.sql (dashboard views)
- 12_thesis_features.sql (thesis §5 MVs and views)
Also make 10_perf_indexes.sql non-fatal in init script since ALTER TABLE
ADD INDEX may fail if index already exists.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 02:55:43 +02:00
da8357f43d
fix: TLS capture end-to-end in integration tests
...
- Add traffic-gen container (curlimages/curl) to send HTTPS traffic
across Docker network so sentinel (pcap on eth0) captures ClientHello
- Seed anubis_ua_rules with catch-all rule (REGEXP_TREE needs ≥1 entry)
so MV mv_http_logs processes raw logs without errors
- Add JA4/JA3 fingerprint verification in Phase 5 tests
- Dashboard healthcheck via python urllib (no curl in image)
Results: 59 raw logs, 59 parsed, 53 with JA4+JA3 fingerprints (TLS 1.3)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-07 21:04:14 +02:00
d4e7e674d8
feat: full-stack Docker Compose integration tests
...
- 4-container stack: ClickHouse, platform (Rocky 9), bot-detector, dashboard
- Platform builds sentinel on Rocky (CGO+libpcap native), correlator static
- mod-reqin-log compiled with apxs on Rocky (matching RPM build target)
- ClickHouse init script patches credentials for test env (sed-based)
- 8-phase test runner: schema, traffic gen, pipeline, dashboard API, bot-detector, sentinel
- All 13 checks pass, 3 non-blocking warnings (empty dicts, log paths)
SQL schema fixes discovered during integration:
- 02_dictionaries: IPv6CIDR → String (not a valid ClickHouse type)
- 03_anubis_tables: dict_anubis_ua missing has_ip/rule_id/category attrs
- 03_anubis_tables: dict_anubis_country FLAT() → COMPLEX_KEY_HASHED() (String key)
- 09_audit_table: CODEC before DEFAULT → DEFAULT before CODEC
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-07 20:33:25 +02:00