ja4-platform

Author	SHA1	Message	Date
toto	9548b1782d	fix: corriger ORDER BY ml_detected_anomalies dans le schéma de base CH 24.8 refuse MODIFY ORDER BY sur des colonnes existantes (erreur BAD_ARGUMENTS 36). La migration 01 ne pouvait donc pas corriger l'ORDER BY en post-init. Correctif : - 06_ml_tables.sql : ORDER BY (src_ip) → ORDER BY (src_ip, ja4, host, model_name) + TTL 30j → 7j (cohérent avec l'architecture documentée) - 01_ttl_adjustments.sql : supprime le MODIFY ORDER BY impossible, conserve uniquement les MODIFY TTL (valides pour les déploiements existants) Résultat : make init-stack sans aucun ⚠ ni ✗ Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 01:34:07 +02:00
toto	b409a70970	fix(views): align SQL views with dashboard API expected columns - view_form_bruteforce_detected: add post_count, distinct_paths, first_seen, last_seen - view_host_ip_ja4_rotation: add host, distinct_ja4, ja4_list, window_start - view_ip_recurrence: add worst_threat alias + top_ja4, top_host columns All three views were missing columns referenced by /api/brute-force, /api/ja4-rotation and /api/recurrence endpoints, causing 500 errors on the Tactiques page. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 00:59:57 +02:00
toto	14db3d9040	refactor: suppression dépendance User-Agent de la détection navigateur Changements SQL : - modern_browser_score : sec-ch-ua→100, Sec-Fetch→70 (plus de UA fallback) - Ajout has_sec_ch_ua (UInt8) dans agg_header_fingerprint_1h et ml_all_scores - mss_mobile_mismatch utilise has_sec_ch_ua au lieu de modern_browser_score - header_order_confidence : PARTITION BY ja4 au lieu de first_ua - sec_ch_mobile_mismatch : comparaison Client Hints interne (sans UA) - Migration 03_remove_ua_browser_detection.sql Changements Python : - browser.py Axe 3 : Client Hints + Sec-Fetch + is_fake_navigation (PAS de UA) - Pondération axes : ja4_known 0.30, tls_coherence 0.20 (signaux TLS renforcés) - preprocessing.py : has_sec_ch_ua ajouté aux features et binary_features Fichiers modifiés : 8 SQL/Python + 1 migration, 36/36 tests passent. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 23:06:01 +02:00
toto	db306fb9da	fix: P0 audit bugs — bot-detector + dashboard + SQL Bot-detector: - B1.1: campaign_id and raw_anomaly_score now inserted into ml_detected_anomalies - B1.4/B1.5: log_decision argument order fixed (cycle_id, name) - B1.7: AE broadcast error — model now returns features list, scoring uses model's features instead of current cycle's (prevents dim mismatch) - B1.8: Anubis ALLOW bots now get bot_name from anubis_bot_name Dashboard: - C1.1: XSS in ip_detail.html — {{ ip \| tojson }} instead of raw string - C1.2: Stored XSS via innerHTML — added escapeHtml() helper, all user-facing formatters (fmtIP, fmtASN, fmtCountry, fmtJA4, fmtBotName, fmtLabel) sanitized - C2.1: status filter now correctly filters http_version column - C2.2: heatmap toDayOfWeek() - 1 for 0-indexed JS days SQL: - B1.3: view_ip_recurrence worst_score uses max() not min() (0=normal, 1=anomal) - B1.6: view_resource_cascade_1h joined into view_thesis_features_1h (§5.4) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 23:33:00 +02:00
toto	9a48fb9d29	feat: LEGITIMATE_BROWSER classification from JA4 + behavioral consistency Add browser legitimacy classification (A9) to the bot detection pipeline: - New features: is_known_browser (binary) and browser_consistency_score [0..5] combining 5 signals: JA4 browser match, modern_browser_score, Accept-Language, cookies, Sec-Fetch-* presence - Post-scoring: sessions with known browser JA4 + consistency >= 4/5 + NORMAL/LOW threat level are reclassified as LEGITIMATE_BROWSER - Spoofing detection: inconsistent behavior (known JA4 but low consistency) stays in normal anomaly scoring — prevents evasion via JA4 spoofing - XGBoost treats LEGITIMATE_BROWSER as non-threat (negative label) - ClickHouse: browser_family column added to ml_detected_anomalies and ml_all_scores - Dashboard: browser_family filter/sort on detections and scores endpoints, legitimate_browsers count and browser_stats in overview - 6 new unit tests covering classification threshold, spoofing, exclusion logic Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 15:46:22 +02:00
toto	8d58f2b932	feat(bot-detector): add XGBoost supervised third voice (#10 ) Triple-voice ensemble architecture: - EIF (non-supervisé, anomalies zero-day) - Autoencoder (non-supervisé, corrélations non-linéaires) - XGBoost (supervisé, patterns connus + feedback SOC) XGBoost implementation: - Trained on historical ml_all_scores labels (NORMAL=0, HIGH/CRITICAL/DENY/KNOWN=1) - Weekly retraining (XGB_RETRAIN_INTERVAL_H=168), min 100 labels required - Score = predict_proba, combined via meta-learner: (1-β)(EIF+AE) + βxgb_prob - Configurable: XGB_WEIGHT (β=0.20), XGB_MIN_LABELS, XGB_RETRAIN_INTERVAL_HOURS - Graceful fallback: if xgboost unavailable or labels insufficient, EIF+AE only - ClickHouse: xgb_prob column added to ml_all_scores - Tests: 4 new tests (availability, train/predict, meta-learner, save/load) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 02:45:57 +02:00
toto	57cf6c3828	feat(bot-detector): add parallel Autoencoder scorer (#9 ) - TrafficAutoEncoder class: symmetric AE (n→64→32→16→32→64→n) with BatchNorm+ReLU - Trained alongside EIF on human_baseline, saved/loaded with model versioning - Score = per-sample MSE reconstruction error, combined with EIF via AE_WEIGHT (α=0.30) - AE latent space (16-dim) used for HDBSCAN clustering instead of raw features - Configurable: AE_WEIGHT, AE_EPOCHS, AE_LATENT_DIM, AE_LEARNING_RATE - Graceful fallback: if torch unavailable or AE fails, EIF-only scoring continues - ClickHouse: ae_recon_error column added to ml_all_scores - Tests: 5 new tests (AE train/score, encode latent, state dict save/load, weight combination) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 02:40:39 +02:00
toto	ecceb04174	perf(clickhouse): P3 — view_ip_recurrence avec filtre TTL + supprimer FINAL view_ip_recurrence : Ajout de WHERE detected_at >= now() - INTERVAL 30 DAY → Avec PARTITION BY (P1), ClickHouse élagage les partitions hors de cette plage avant même de lire les données. La vue ne scanne que les partitions actives (au lieu des 30 partitions journalières complètes). → ORDER BY (src_ip) garantit que le GROUP BY src_ip lit des données contiguës (aucune réorganisation mémoire). rotation.py — supprimer FINAL sur ml_detected_anomalies : FINAL force une déduplication complète du ReplacingMergeTree en mémoire (équivalent à un DISTINCT sur toute la table) — une des opérations les plus coûteuses dans ClickHouse. Fix : remplacer le sous-SELECT FINAL par view_ip_recurrence (déjà aggrégée par src_ip, retourne recurrence directement sans FINAL). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 22:33:29 +02:00
toto	f4ffe3410a	perf(clickhouse): P1 — partition + skipping indexes sur ml_detected_anomalies, http_logs, agg_host_ip_ja4_1h Problème : toutes les requêtes du dashboard WHERE detected_at >= now() - INTERVAL N faisaient un full scan car ml_detected_anomalies avait ORDER BY (src_ip) sans partition ni index temporel. Changements : - 06_ml_tables.sql : * ml_detected_anomalies : PARTITION BY toYYYYMMDD(detected_at) → élagage de partitions journalières sur toutes les requêtes temporelles * INDEX idx_detected_at (minmax) → skip des granules hors plage * INDEX idx_threat_level set(8) → skip pour countIf(threat_level = ...) * INDEX idx_bot_name bloom_filter → skip pour bot_name != '' * ttl_only_drop_parts = 1 → TTL par suppression de partition entière * ml_all_scores : même traitement (PARTITION BY + 2 indexes) - 04_mv_http_logs.sql : * http_logs : INDEX idx_src_ip bloom_filter(0.01) → les requêtes WHERE src_ip = X (analysis.py, variability.py) sautent ~90% des granules sans scanner toute la plage temporelle * INDEX idx_ja4 bloom_filter(0.01) → idem pour filtres JA4 - 05_aggregation_tables.sql : * agg_host_ip_ja4_1h : PROJECTION proj_by_ip ORDER BY (src_ip, window_start, ...) → investigation_summary.py et rotation.py (WHERE src_ip = X) utilisent automatiquement la projection au lieu de scanner tous les window_start - 10_perf_indexes.sql (nouveau) : * Migration ALTER TABLE pour instances existantes * ADD INDEX + MATERIALIZE INDEX pour les 4 tables * ADD PROJECTION + MATERIALIZE PROJECTION pour agg_host_ip_ja4_1h * Note : PARTITION BY sur table existante nécessite recréation (documenté) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 22:28:04 +02:00
toto	9f3e0621e5	feat: split ClickHouse into dual configurable databases (ja4_logs / ja4_processing) Architecture: - ja4_logs: raw log ingestion (http_logs_raw, http_logs, mv_http_logs) - ja4_processing: analytics, aggregation, ML, dictionaries, audit Configuration (env vars): - CLICKHOUSE_DB_LOGS (default: ja4_logs) - CLICKHOUSE_DB_PROCESSING (default: ja4_processing) Changes: - SQL migrations (10 files): all mabase_prod refs → ja4_logs or ja4_processing with correct cross-database references (MVs, views, dicts) - deploy_schema.sh: substitutes DB names from env vars at deploy time - Python shared settings: added CLICKHOUSE_DB_LOGS + CLICKHOUSE_DB_PROCESSING - Dashboard routes (19 files): replaced ~80 hardcoded mabase_prod refs with settings.CLICKHOUSE_DB_LOGS / settings.CLICKHOUSE_DB_PROCESSING - Bot-detector: DB → CLICKHOUSE_DB_PROCESSING, fetch_rules.py configurable - Correlator: DSN example updated to ja4_logs - Docker-compose + .env files: new env vars with defaults - All documentation updated (14 markdown files) All tests pass: sentinel 10/10, correlator 67.1%, bot-detector 11, dashboard 20, ja4_common 18 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 19:10:35 +02:00
toto	d469e39da7	feat: ja4-platform monorepo — 5 services unified, tests & RPM builds standardized Services: - ja4sentinel: TLS/JA4 fingerprint capture daemon (Go, libpcap) - logcorrelator: JA4 log correlation engine (Go, ClickHouse) - mod_reqin_log: Apache module (C, JSON request logging) - bot_detector: ML bot detection pipeline (Python) - dashboard: FastAPI/Streamlit analytics UI (Python) Shared libraries: - shared/go/ja4common: logger, config, shutdown, ipfilter (Go module) - shared/python/ja4_common: ClickHouseClient, ClickHouseSettings (Python package) - shared/clickhouse/: canonical SQL migrations (10 files) Build & packaging: - Unified 3-stage Dockerfile.package for Go RPMs (el8/el9/el10) - go.work workspace linking sentinel, correlator, ja4common - Makefile with test-all, build-all, rpm-* targets Fixes applied: - go.work: 1.21 → 1.24.6 (required by sentinel) - correlator Dockerfiles: golang:1.21 → golang:1.24 - replace directives in go.mod for ja4common local path - pyproject.toml: setuptools.backends → setuptools.build_meta - Removed static libpcap linking (unavailable on Rocky 9) - Fixed data races in output/writers_test.go (sync.Mutex + atomic.Int32) - Rewrote corrupted test files (logger_test.go × 2) Test coverage: - correlator: 67.1% total (unixsocket 80.5%, config 91.7%, app 83.3%, multi 87.7%, stdout 100%) - sentinel: all 10 packages pass (api, capture, config, fingerprint, ipfilter, logging, output, tlsparse) Documentation: - README.md + docs/ (architecture, development, 5 services, shared libs, DB schema & migrations) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 16:42:59 +02:00

11 Commits