- Ajoute dict_browser_h2 dans /reflists (lecture seule via dict_browser_h2)
- Nouveaux endpoints API :
GET /api/browser-signatures/entries — liste browser_h2_signatures
(fallback dict CSV si migration 06 non appliquée)
POST /api/browser-signatures/entries — ajout fingerprint + reload dict
DELETE /api/browser-signatures/entries — suppression + reload dict
- Page /browsers : 2 nouvelles sections
'Base de signatures H2' — tableau des 10 fingerprints, form d'ajout,
mode lecture seule automatique si migration 06 non appliquée
'Règles de scoring browser_matcher.py' — tableau statique des 7 dimensions
(poids, valeurs par famille, seuils de bypass)
- Integration : browser_h2.csv copié dans user_files au démarrage ClickHouse
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- 350K browser rows (14K IPs) using real JA4s from browser_ja4.csv
- 100K scanner rows (3K IPs) with vuln/cred/scraper/DDoS sub-categories
- 30K legit bot rows (2K IPs) from real bot_ip.csv CIDRs
- 20K AI bot rows (1K IPs) for GPTBot, ClaudeBot, etc.
Key improvements:
- Load browser_ja4.csv at startup, match JA4 to browser family
- Load bot_ip.csv to generate IPs from real Googlebot/Bingbot CIDRs
- Hard-coded ISP /24 prefixes from real ASNs (Comcast, Orange, DT, etc.)
- Realistic navigation patterns with Referer chains and cookies
- Sec-CH-UA headers for Chromium browsers (modern_browser_score >= 50)
- Batch size increased to 2000, progress reporting every 10K rows
- New CLI args: --rows, --ips, --seed, --data-dir
- Bot JA4s are synthetic hashes guaranteed NOT in browser_ja4.csv
Also updated:
- Dockerfile: COPY *.py (was missing seed_clickhouse.py)
- docker-compose.yml: mount scripts/data as /app/data for CSV access
- run-tests.sh: updated seeder description comments
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
3 SQL files were missing from the docker-compose.yml volume mounts:
- 10_perf_indexes.sql (performance indexes)
- 11_views.sql (dashboard views)
- 12_thesis_features.sql (thesis §5 MVs and views)
Also make 10_perf_indexes.sql non-fatal in init script since ALTER TABLE
ADD INDEX may fail if index already exists.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- EIF: Extended Isolation Forest via isotree (fallback to sklearn IF)
- Benford's Law deviation feature on inter-request timing
- Lag-1 autocorrelation feature for cadence analysis
- Validation gate: reject model if val_anomaly_rate > 20%
- Feature pruning: remove variance < 1e-6 features before training
- Quantile drift: replace N(μ,σ) synthetic with quantile interpolation
- Thread safety: Lock for _service_healthy/_consecutive_failures
- Score normalization: inverted to [0,1] where 1=most anomalous
SQL: add lag1_autocorrelation + benford_deviation to view_thesis_features_1h
Tests: 10 new test functions covering all improvements
Integration: verify_mvs.py checks new thesis feature columns
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
deploy_views.sql (v13 → v14):
- CRITICAL: ml_detected_anomalies ORDER BY (src_ip) → (src_ip, ja4, host, model_name)
ReplacingMergeTree was collapsing all detections to 1 row per IP on merge
- Add PARTITION BY toDate + ttl_only_drop_parts on all 4 data tables
- ml_all_scores TTL 3d → 7d; ml_detected_anomalies TTL 30d → 7d
- agg_host_ip_ja4_1h + agg_header_fingerprint_1h: add partition + TTL 7d
- view_ip_recurrence: add WHERE detected_at >= now() - 7 DAY (was full scan)
- Remove dead views: summary/timeseries/threat_dist/variability
- Add view_dashboard_entities (fixes HTTP 500 in clustering/incidents/fingerprints)
- Add view_dashboard_user_agents (fixes HTTP 500 in fingerprints/metrics)
- Add view_ai_features_24h (enables ENABLE_MULTIWINDOW in bot_detector)
- Mark max_requests_per_sec as DEPRECATED (always 0)
New files:
- correlator/sql/migrations/01_ttl_adjustments.sql: ALTER TABLE migration
- tests/integration/verify_mvs.py: MV pipeline verification assertions
- docs/THESIS_HTTP_Traffic_Detection.md: detection techniques thesis
All DB references use ja4_processing/ja4_logs (no mabase_prod).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>