ja4-platform

Author	SHA1	Message	Date
toto	fde6864311	feat(dashboard): browser signatures management UI - Ajoute dict_browser_h2 dans /reflists (lecture seule via dict_browser_h2) - Nouveaux endpoints API : GET /api/browser-signatures/entries — liste browser_h2_signatures (fallback dict CSV si migration 06 non appliquée) POST /api/browser-signatures/entries — ajout fingerprint + reload dict DELETE /api/browser-signatures/entries — suppression + reload dict - Page /browsers : 2 nouvelles sections 'Base de signatures H2' — tableau des 10 fingerprints, form d'ajout, mode lecture seule automatique si migration 06 non appliquée 'Règles de scoring browser_matcher.py' — tableau statique des 7 dimensions (poids, valeurs par famille, seuils de bypass) - Integration : browser_h2.csv copié dans user_files au démarrage ClickHouse Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 14:46:07 +02:00
toto	9ea36ad22e	feat(scripts): complete stack init + prod data import with date shift Schema cleanup: - Remove anubis_ua_rules table stub from 03_anubis_tables.sql - Remove anubis_ua_rules from bot-detector deploy_schema.sql - Remove UA seed step from clickhouse-init.sh (no more REGEXP_TREE dependency) - Drop dict_anubis_ua, dict_anubis_country, anubis_ua_rules, anubis_country_rules New scripts: - scripts/init-stack.sh: comprehensive ClickHouse init (13 SQL files + migrations + validation + cleanup of obsolete tables). Supports --reset, --import-prod. - scripts/import-prod-data.sh: imports pre-exported prod data (Native format) with dynamic date shift (max(time) → now). Supports --shift, --no-truncate. - scripts/data/prod-export/: directory for cached Native format exports Makefile targets: init-stack, import-prod-data, init-and-import Tested: init-stack.sh passes all 13 SQL + 7 critical tables + 7 dicts import-prod-data.sh: 3M rows in ~37s with auto date shift Dashboard: 55 routes OK, bot-detector: 36/36 tests pass Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 21:40:05 +02:00
toto	98289ccf04	fix: ASN dictionary pipeline + verbose bot-detector logging - Fix dict_iplocate_asn: remove non-existent org/domain columns (4→4 cols) - Add CSV header to iplocate-ip-to-asn.csv (CSVWithNames format) - Replace org/domain dictGet calls with empty string literals in MV - Full 714K CIDR stub for complete ASN resolution in tests - Add header generation to generate_asn_data.py - Verbose bot-detector stdout: data summary, triage breakdown, model training details, scoring stats, browser classification, boxed results - Fix IPv6 filter in traffic seeder (_ips_from_cidrs) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 17:43:55 +02:00
toto	7b7b69dee3	Rewrite seed_clickhouse.py: 500K rows from 20K IPs with realistic traffic - 350K browser rows (14K IPs) using real JA4s from browser_ja4.csv - 100K scanner rows (3K IPs) with vuln/cred/scraper/DDoS sub-categories - 30K legit bot rows (2K IPs) from real bot_ip.csv CIDRs - 20K AI bot rows (1K IPs) for GPTBot, ClaudeBot, etc. Key improvements: - Load browser_ja4.csv at startup, match JA4 to browser family - Load bot_ip.csv to generate IPs from real Googlebot/Bingbot CIDRs - Hard-coded ISP /24 prefixes from real ASNs (Comcast, Orange, DT, etc.) - Realistic navigation patterns with Referer chains and cookies - Sec-CH-UA headers for Chromium browsers (modern_browser_score >= 50) - Batch size increased to 2000, progress reporting every 10K rows - New CLI args: --rows, --ips, --seed, --data-dir - Bot JA4s are synthetic hashes guaranteed NOT in browser_ja4.csv Also updated: - Dockerfile: COPY *.py (was missing seed_clickhouse.py) - docker-compose.yml: mount scripts/data as /app/data for CSV access - run-tests.sh: updated seeder description comments Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 16:35:40 +02:00
toto	74e0406c38	chore: update ASN stubs with new classification labels Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 16:05:25 +02:00
toto	5c5bca71d1	feat: rewrite ASN classification with PeeringDB + expanded heuristics Major improvements to generate_asn_data.py: - Add PeeringDB network data source (34K networks with info_type) - Add new categories: education, government, enterprise - Rename 'human' label to 'isp' across all consumers - Expand keyword heuristics (ISP, datacenter, hosting, CDN, education, gov) - Add hard-coded lists for education, government, enterprise ASNs - Support both --output-dir and --output-asn/--output-ipasn CLI interfaces - Add --no-peeringdb flag for offline use Results: unknown dropped from 86% to 57%, ISP coverage 21.8K ASNs, education 3.1K, enterprise 5.7K, government 520. Updated consumers: - bot_detector.py: 'human' -> 'isp' for baseline selection - dashboard api.py: 'human' -> 'isp' in SQL queries - run-tests.sh: 'human' -> 'isp' in integration test assertions - update-csv-data.sh: updated label description comment Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 16:02:07 +02:00
toto	7d09c614c3	feat: browser JA4 detection, Anubis bot rules, worldwide ASN data - Add generate_browser_ja4.py: 1,186 browser JA4 fingerprints from FoxIO + ja4db.com covering 11 families (Chromium, Firefox, Safari, Edge, Tor, Opera, Vivaldi...) - Rewrite generate_bot_ip.py: Anubis YAML rules (Google, Bing, Apple, DuckDuck, OpenAI, Perplexity bots) + Tor exit nodes + cloud scanner IPs (3,555 entries) - Rewrite generate_asn_data.py: worldwide iptoasn.com data (78,049 ASNs, 714K CIDRs) - Add dict_browser_ja4 ClickHouse dictionary + browser_family in AI features views - Add /api/browsers dashboard endpoint - Fix CSV quoting for fields containing commas (User-Agent strings) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 15:27:37 +02:00
toto	b6184e6529	feat: CSV generation scripts, API filter params, enriched CSV stubs - scripts/generate_bot_ip.py: download Tor exit nodes + curate scanner IPs (1353 entries) - scripts/generate_bot_ja4.py: 31 bot JA4 fingerprints across 16 families - scripts/generate_asn_data.py: 38 ASNs + 96 IP-to-ASN prefixes - scripts/update-csv-data.sh: master orchestrator with --install-stubs - api.py: add asn_org/country_code/ja4/bot_name filters on detections+scores - pages.py: add /network route - csv-stubs: enriched with generated data (Tor nodes, scanner IPs, etc.) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 15:05:43 +02:00
toto	fc882dd3e7	feat(tests): realistic traffic seeder + IP diversity via mod_remoteip Option A — X-Forwarded-For + mod_remoteip: - httpd-integration.conf: load mod_remoteip, trust all Docker RFC-1918 subnets (172/192.168/10). mod_reqin_log uses r->useragent_ip which mod_remoteip updates from XFF → each request logged with distinct src_ip - generate_traffic.py: XFF always set (was 30% only); human scenarios use 91.121/78.41/90.x ranges, bot scenarios use 185.220/45.155/193.32; pool of 1168 human IPs and 180 bot IPs; default --requests 500 Option D — Direct ClickHouse seeder (seed_clickhouse.py, stdlib only): - Inserts ~4000 rows into http_logs_raw triggering full MV chain: http_logs_raw → mv_http_logs → http_logs → mv_agg_host_ip_ja4_1h → agg_host_ip_ja4_1h • 720 human sessions: IPs in OVH/SFR/Orange ASN ranges (16276/15557/3215) → dict_asn_reputation maps these to asn_label='human' → satisfies bot_detector human_baseline >= 500 threshold • 150 scanner sessions: datacenter IPs, attack paths (/.env, wp-login, SQLi, path traversal), scanner UAs, minimal TCP fingerprints • 100 known-bot sessions: IPs matching bot_ip.csv entries • 20 brute-force clusters: 20-50 POST /login per IP All TCP/TLS metadata is profile-realistic (window, MSS, TTL, JA4, JA3) CSV stubs (mounted at /var/lib/clickhouse/user_files/): - iplocate-ip-to-asn.csv: 13 CIDR→ASN mappings (OVH/SFR/Orange/Tor/Contabo) - asn_reputation.csv: 13 ASN→label (8 'human', 3 'datacenter'/'hosting') - bot_ip.csv: 14 known scanner/Tor IPs (Shodan, Censys, Tor exits) - bot_ja4.csv: 5 bot JA4 fingerprints (curl, python-requests, masscan, zgrab) run-tests.sh: - Phase 4a: seeder runs before live traffic (ensures bot_detector baseline) - Phase 4b: live traffic gen at 500 requests (up from 200) - Phase 5f: new assertions — agg_host_ip_ja4_1h populated, ≥500 human rows in view_ai_features_1h, known-bot labels present - Phase 7: verifies ml_all_scores populated (bot_detector ran a cycle) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 11:35:34 +02:00
toto	228ad7026a	fix(integration): mount missing SQL files 10-12 in ClickHouse init 3 SQL files were missing from the docker-compose.yml volume mounts: - 10_perf_indexes.sql (performance indexes) - 11_views.sql (dashboard views) - 12_thesis_features.sql (thesis §5 MVs and views) Also make 10_perf_indexes.sql non-fatal in init script since ALTER TABLE ADD INDEX may fail if index already exists. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 02:55:43 +02:00
toto	f6e2d3c0ca	feat(bot-detector): implement 8 state-of-art improvements - EIF: Extended Isolation Forest via isotree (fallback to sklearn IF) - Benford's Law deviation feature on inter-request timing - Lag-1 autocorrelation feature for cadence analysis - Validation gate: reject model if val_anomaly_rate > 20% - Feature pruning: remove variance < 1e-6 features before training - Quantile drift: replace N(μ,σ) synthetic with quantile interpolation - Thread safety: Lock for _service_healthy/_consecutive_failures - Score normalization: inverted to [0,1] where 1=most anomalous SQL: add lag1_autocorrelation + benford_deviation to view_thesis_features_1h Tests: 10 new test functions covering all improvements Integration: verify_mvs.py checks new thesis feature columns Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 02:31:26 +02:00
toto	6d02f21c1e	feat: implement thesis §5 advanced detection techniques as ClickHouse MVs New aggregation tables + materialized views: - agg_path_sequences_1h + MV (§5.1 Path Sequence Entropy) - agg_request_timing_1h + MV (§5.3 Request Cadence Fingerprint) - agg_ip_behavior_1h + MV (§5.5 JA4 Drift + §5.8 Cross-Domain) - agg_resource_cascade_1h + MV (§5.4 Resource Dependency Tree) New analytical views: - view_thesis_features_1h: unified view exposing all computable features (path_transition_entropy, cadence_cv, burst_ratio, pause_ratio, ja4_drift_ratio, host_diversity, host_sweep_speed, host_coverage_uniformity) - view_resource_cascade_1h: root_to_first_asset_delay, asset_load_stddev Documented future techniques (not feasible as MV): - §5.2 Bipartite Fleet Graph (needs Python networkx) - §5.6 DNS Shadow Analysis (needs sentinel UDP/53 extension) - §5.7 Compression Ratio Invariant (needs mod_reqin_log extension) Updated: deploy_schema.sh, verify_mvs.py (sections 8-10) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 01:42:52 +02:00
toto	51b8eb57a8	feat: port v14 schema fixes, migration, MV verifier, thesis from ja4/ deploy_views.sql (v13 → v14): - CRITICAL: ml_detected_anomalies ORDER BY (src_ip) → (src_ip, ja4, host, model_name) ReplacingMergeTree was collapsing all detections to 1 row per IP on merge - Add PARTITION BY toDate + ttl_only_drop_parts on all 4 data tables - ml_all_scores TTL 3d → 7d; ml_detected_anomalies TTL 30d → 7d - agg_host_ip_ja4_1h + agg_header_fingerprint_1h: add partition + TTL 7d - view_ip_recurrence: add WHERE detected_at >= now() - 7 DAY (was full scan) - Remove dead views: summary/timeseries/threat_dist/variability - Add view_dashboard_entities (fixes HTTP 500 in clustering/incidents/fingerprints) - Add view_dashboard_user_agents (fixes HTTP 500 in fingerprints/metrics) - Add view_ai_features_24h (enables ENABLE_MULTIWINDOW in bot_detector) - Mark max_requests_per_sec as DEPRECATED (always 0) New files: - correlator/sql/migrations/01_ttl_adjustments.sql: ALTER TABLE migration - tests/integration/verify_mvs.py: MV pipeline verification assertions - docs/THESIS_HTTP_Traffic_Detection.md: detection techniques thesis All DB references use ja4_processing/ja4_logs (no mabase_prod). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 23:51:56 +02:00
toto	12d60975da	feat: Python traffic generator with realistic varied HTTP/HTTPS traffic - Replace curlimages/curl with Python stdlib traffic generator - 200 requests, 10 workers, 16 scenario types: browsers (Chrome/Firefox/Safari/Edge/mobile), bots (Googlebot/Bing/curl/wget), GET/POST/HEAD/PUT/PATCH/DELETE/OPTIONS, HTTP + HTTPS - Multiple SSL contexts (default, TLS1.2-only, TLS1.3-only, few_ciphers) → 4 distinct JA4/JA3 fingerprints per test run - Realistic headers: Accept, Accept-Language, Sec-Fetch-*, Referer, X-Forwarded-For, Cookie, Cache-Control - JSON payloads, form data, CORS preflights - DB always reset (down -v) at start of each test run - Enhanced Phase 5 checks: distinct UAs, method variety, JA4/JA3 counts + uniqueness Results: 199/200 OK, 24 distinct UAs, 7 HTTP methods, TLS 1.2+1.3, 4 JA4 fingerprints Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 21:14:55 +02:00
toto	da8357f43d	fix: TLS capture end-to-end in integration tests - Add traffic-gen container (curlimages/curl) to send HTTPS traffic across Docker network so sentinel (pcap on eth0) captures ClientHello - Seed anubis_ua_rules with catch-all rule (REGEXP_TREE needs ≥1 entry) so MV mv_http_logs processes raw logs without errors - Add JA4/JA3 fingerprint verification in Phase 5 tests - Dashboard healthcheck via python urllib (no curl in image) Results: 59 raw logs, 59 parsed, 53 with JA4+JA3 fingerprints (TLS 1.3) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 21:04:14 +02:00
toto	d4e7e674d8	feat: full-stack Docker Compose integration tests - 4-container stack: ClickHouse, platform (Rocky 9), bot-detector, dashboard - Platform builds sentinel on Rocky (CGO+libpcap native), correlator static - mod-reqin-log compiled with apxs on Rocky (matching RPM build target) - ClickHouse init script patches credentials for test env (sed-based) - 8-phase test runner: schema, traffic gen, pipeline, dashboard API, bot-detector, sentinel - All 13 checks pass, 3 non-blocking warnings (empty dicts, log paths) SQL schema fixes discovered during integration: - 02_dictionaries: IPv6CIDR → String (not a valid ClickHouse type) - 03_anubis_tables: dict_anubis_ua missing has_ip/rule_id/category attrs - 03_anubis_tables: dict_anubis_country FLAT() → COMPLEX_KEY_HASHED() (String key) - 09_audit_table: CODEC before DEFAULT → DEFAULT before CODEC Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 20:33:25 +02:00

16 Commits