d8ca804a55
feat(scripts): add reload-prod-logs.sh for prod→dev data sync
...
Exports http_logs from prod ClickHouse via HTTP API, imports into dev
with dynamic date shifting (max(time) → now() by default).
Features:
- Batch export in Native format (200K rows/batch, ~10s each)
- Auto date shift: prod max(time) aligned to current time
- --shift N: manual override (seconds)
- --days N: filter to last N days only
- --cron: silent mode for scheduled runs
- Staging table approach: export → staging → INSERT SELECT with shift → cleanup
Tested: 3,054,122 rows imported in ~3 minutes, dates 2026-04-03→2026-04-09.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 15:41:38 +02:00
98289ccf04
fix: ASN dictionary pipeline + verbose bot-detector logging
...
- Fix dict_iplocate_asn: remove non-existent org/domain columns (4→4 cols)
- Add CSV header to iplocate-ip-to-asn.csv (CSVWithNames format)
- Replace org/domain dictGet calls with empty string literals in MV
- Full 714K CIDR stub for complete ASN resolution in tests
- Add header generation to generate_asn_data.py
- Verbose bot-detector stdout: data summary, triage breakdown, model
training details, scoring stats, browser classification, boxed results
- Fix IPv6 filter in traffic seeder (_ips_from_cidrs)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 17:43:55 +02:00
5c5bca71d1
feat: rewrite ASN classification with PeeringDB + expanded heuristics
...
Major improvements to generate_asn_data.py:
- Add PeeringDB network data source (34K networks with info_type)
- Add new categories: education, government, enterprise
- Rename 'human' label to 'isp' across all consumers
- Expand keyword heuristics (ISP, datacenter, hosting, CDN, education, gov)
- Add hard-coded lists for education, government, enterprise ASNs
- Support both --output-dir and --output-asn/--output-ipasn CLI interfaces
- Add --no-peeringdb flag for offline use
Results: unknown dropped from 86% to 57%, ISP coverage 21.8K ASNs,
education 3.1K, enterprise 5.7K, government 520.
Updated consumers:
- bot_detector.py: 'human' -> 'isp' for baseline selection
- dashboard api.py: 'human' -> 'isp' in SQL queries
- run-tests.sh: 'human' -> 'isp' in integration test assertions
- update-csv-data.sh: updated label description comment
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 16:02:07 +02:00
7d09c614c3
feat: browser JA4 detection, Anubis bot rules, worldwide ASN data
...
- Add generate_browser_ja4.py: 1,186 browser JA4 fingerprints from FoxIO + ja4db.com
covering 11 families (Chromium, Firefox, Safari, Edge, Tor, Opera, Vivaldi...)
- Rewrite generate_bot_ip.py: Anubis YAML rules (Google, Bing, Apple, DuckDuck,
OpenAI, Perplexity bots) + Tor exit nodes + cloud scanner IPs (3,555 entries)
- Rewrite generate_asn_data.py: worldwide iptoasn.com data (78,049 ASNs, 714K CIDRs)
- Add dict_browser_ja4 ClickHouse dictionary + browser_family in AI features views
- Add /api/browsers dashboard endpoint
- Fix CSV quoting for fields containing commas (User-Agent strings)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 15:27:37 +02:00
b6184e6529
feat: CSV generation scripts, API filter params, enriched CSV stubs
...
- scripts/generate_bot_ip.py: download Tor exit nodes + curate scanner IPs (1353 entries)
- scripts/generate_bot_ja4.py: 31 bot JA4 fingerprints across 16 families
- scripts/generate_asn_data.py: 38 ASNs + 96 IP-to-ASN prefixes
- scripts/update-csv-data.sh: master orchestrator with --install-stubs
- api.py: add asn_org/country_code/ja4/bot_name filters on detections+scores
- pages.py: add /network route
- csv-stubs: enriched with generated data (Tor nodes, scanner IPs, etc.)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 15:05:43 +02:00