ja4-platform

Author	SHA1	Message	Date
toto	e52cdcc01f	feat(bot-detector): Browser Signature Detection engine (parallel mode) Étape A — browser_signatures.py Données pures : BROWSER_SIGNATURES (Chrome/Firefox/Safari), NON_BROWSER_SIGNATURES (curl/httpx/go), BROWSER_THRESHOLDS, DIMENSION_WEIGHTS. Valeurs H2 extraites des captures réelles (format Akamai avec virgules, non semicolons). Étape B — browser_matcher.py Moteur vectorisé 7 dimensions (H2 SETTINGS 0.30, WINDOW_UPDATE 0.15, pseudo-header order 0.15, H2 PRIORITY 0.10, HTTP headers 0.15, TLS 0.10, JA4 dict 0.05). run_browser_matcher(df) ajoute bm_family/bm_score/bm_decision. CDN edge case : dimension H2 neutralisée (0.5) si has_xff=1. BROWSER_MATCHER_REPLACE=false par défaut (mode DUAL_MODE logging uniquement). Étape C — 06_browser_signature_detection.sql (migration) Crée browser_h2_signatures (table MergeTree avec 12 fingerprints de référence). Recrée dict_browser_h2 depuis la table avec champ confidence (remplace CSV). Étape D — 07_ai_features_view.sql +h2_wu_val dans le JOIN http_logs, +h2_window_update_value, +h2_dict_family, +h2_dict_confidence, +h2_window_{chrome,firefox,safari,absent}, +h2_order_{chromesafari,firefox}, +h2_priority_present, +h2_pseudo_ord_raw, +tls_h2_family_mismatch (détection incohérence famille JA4 vs famille H2). Étape E — preprocessing.py + pipeline.py preprocessing.py: appelle run_browser_matcher() après compute_browser_axes(), ajoute 7 nouvelles features binaires H2 à FEATURES et binary_features. pipeline.py: appelle log_dual_mode_comparison() après la classification A9. BROWSER_MATCHER_REPLACE=true active le remplacement du bypass. Étape F — test_browser_matcher.py 8 tests : Chrome/Firefox/Safari full match, curl rejeté, httpcloak partiel, TLS↔H2 mismatch, CDN proxy neutralisation, go net/http rejeté. Tous 8 PASSED (+ 36 tests existants inchangés). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-10 13:52:57 +02:00
toto	14db3d9040	refactor: suppression dépendance User-Agent de la détection navigateur Changements SQL : - modern_browser_score : sec-ch-ua→100, Sec-Fetch→70 (plus de UA fallback) - Ajout has_sec_ch_ua (UInt8) dans agg_header_fingerprint_1h et ml_all_scores - mss_mobile_mismatch utilise has_sec_ch_ua au lieu de modern_browser_score - header_order_confidence : PARTITION BY ja4 au lieu de first_ua - sec_ch_mobile_mismatch : comparaison Client Hints interne (sans UA) - Migration 03_remove_ua_browser_detection.sql Changements Python : - browser.py Axe 3 : Client Hints + Sec-Fetch + is_fake_navigation (PAS de UA) - Pondération axes : ja4_known 0.30, tls_coherence 0.20 (signaux TLS renforcés) - preprocessing.py : has_sec_ch_ua ajouté aux features et binary_features Fichiers modifiés : 8 SQL/Python + 1 migration, 36/36 tests passent. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 23:06:01 +02:00
toto	f1547423b5	refactor(bot-detector): suppression monolithe, tests multifactoriels - Suppression de bot_detector.py (1982 lignes) remplacé par 11 modules - Tests navigateur mis à jour pour le système multifactoriel (browser_confidence) - 36/36 tests passent avec la nouvelle structure modulaire Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-09 01:03:17 +02:00
toto	9a48fb9d29	feat: LEGITIMATE_BROWSER classification from JA4 + behavioral consistency Add browser legitimacy classification (A9) to the bot detection pipeline: - New features: is_known_browser (binary) and browser_consistency_score [0..5] combining 5 signals: JA4 browser match, modern_browser_score, Accept-Language, cookies, Sec-Fetch-* presence - Post-scoring: sessions with known browser JA4 + consistency >= 4/5 + NORMAL/LOW threat level are reclassified as LEGITIMATE_BROWSER - Spoofing detection: inconsistent behavior (known JA4 but low consistency) stays in normal anomaly scoring — prevents evasion via JA4 spoofing - XGBoost treats LEGITIMATE_BROWSER as non-threat (negative label) - ClickHouse: browser_family column added to ml_detected_anomalies and ml_all_scores - Dashboard: browser_family filter/sort on detections and scores endpoints, legitimate_browsers count and browser_stats in overview - 6 new unit tests covering classification threshold, spoofing, exclusion logic Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 15:46:22 +02:00
toto	8d58f2b932	feat(bot-detector): add XGBoost supervised third voice (#10 ) Triple-voice ensemble architecture: - EIF (non-supervisé, anomalies zero-day) - Autoencoder (non-supervisé, corrélations non-linéaires) - XGBoost (supervisé, patterns connus + feedback SOC) XGBoost implementation: - Trained on historical ml_all_scores labels (NORMAL=0, HIGH/CRITICAL/DENY/KNOWN=1) - Weekly retraining (XGB_RETRAIN_INTERVAL_H=168), min 100 labels required - Score = predict_proba, combined via meta-learner: (1-β)(EIF+AE) + βxgb_prob - Configurable: XGB_WEIGHT (β=0.20), XGB_MIN_LABELS, XGB_RETRAIN_INTERVAL_HOURS - Graceful fallback: if xgboost unavailable or labels insufficient, EIF+AE only - ClickHouse: xgb_prob column added to ml_all_scores - Tests: 4 new tests (availability, train/predict, meta-learner, save/load) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 02:45:57 +02:00
toto	57cf6c3828	feat(bot-detector): add parallel Autoencoder scorer (#9 ) - TrafficAutoEncoder class: symmetric AE (n→64→32→16→32→64→n) with BatchNorm+ReLU - Trained alongside EIF on human_baseline, saved/loaded with model versioning - Score = per-sample MSE reconstruction error, combined with EIF via AE_WEIGHT (α=0.30) - AE latent space (16-dim) used for HDBSCAN clustering instead of raw features - Configurable: AE_WEIGHT, AE_EPOCHS, AE_LATENT_DIM, AE_LEARNING_RATE - Graceful fallback: if torch unavailable or AE fails, EIF-only scoring continues - ClickHouse: ae_recon_error column added to ml_all_scores - Tests: 5 new tests (AE train/score, encode latent, state dict save/load, weight combination) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 02:40:39 +02:00
toto	f6e2d3c0ca	feat(bot-detector): implement 8 state-of-art improvements - EIF: Extended Isolation Forest via isotree (fallback to sklearn IF) - Benford's Law deviation feature on inter-request timing - Lag-1 autocorrelation feature for cadence analysis - Validation gate: reject model if val_anomaly_rate > 20% - Feature pruning: remove variance < 1e-6 features before training - Quantile drift: replace N(μ,σ) synthetic with quantile interpolation - Thread safety: Lock for _service_healthy/_consecutive_failures - Score normalization: inverted to [0,1] where 1=most anomalous SQL: add lag1_autocorrelation + benford_deviation to view_thesis_features_1h Tests: 10 new test functions covering all improvements Integration: verify_mvs.py checks new thesis feature columns Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 02:31:26 +02:00
toto	9f3e0621e5	feat: split ClickHouse into dual configurable databases (ja4_logs / ja4_processing) Architecture: - ja4_logs: raw log ingestion (http_logs_raw, http_logs, mv_http_logs) - ja4_processing: analytics, aggregation, ML, dictionaries, audit Configuration (env vars): - CLICKHOUSE_DB_LOGS (default: ja4_logs) - CLICKHOUSE_DB_PROCESSING (default: ja4_processing) Changes: - SQL migrations (10 files): all mabase_prod refs → ja4_logs or ja4_processing with correct cross-database references (MVs, views, dicts) - deploy_schema.sh: substitutes DB names from env vars at deploy time - Python shared settings: added CLICKHOUSE_DB_LOGS + CLICKHOUSE_DB_PROCESSING - Dashboard routes (19 files): replaced ~80 hardcoded mabase_prod refs with settings.CLICKHOUSE_DB_LOGS / settings.CLICKHOUSE_DB_PROCESSING - Bot-detector: DB → CLICKHOUSE_DB_PROCESSING, fetch_rules.py configurable - Correlator: DSN example updated to ja4_logs - Docker-compose + .env files: new env vars with defaults - All documentation updated (14 markdown files) All tests pass: sentinel 10/10, correlator 67.1%, bot-detector 11, dashboard 20, ja4_common 18 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 19:10:35 +02:00
toto	d469e39da7	feat: ja4-platform monorepo — 5 services unified, tests & RPM builds standardized Services: - ja4sentinel: TLS/JA4 fingerprint capture daemon (Go, libpcap) - logcorrelator: JA4 log correlation engine (Go, ClickHouse) - mod_reqin_log: Apache module (C, JSON request logging) - bot_detector: ML bot detection pipeline (Python) - dashboard: FastAPI/Streamlit analytics UI (Python) Shared libraries: - shared/go/ja4common: logger, config, shutdown, ipfilter (Go module) - shared/python/ja4_common: ClickHouseClient, ClickHouseSettings (Python package) - shared/clickhouse/: canonical SQL migrations (10 files) Build & packaging: - Unified 3-stage Dockerfile.package for Go RPMs (el8/el9/el10) - go.work workspace linking sentinel, correlator, ja4common - Makefile with test-all, build-all, rpm-* targets Fixes applied: - go.work: 1.21 → 1.24.6 (required by sentinel) - correlator Dockerfiles: golang:1.21 → golang:1.24 - replace directives in go.mod for ja4common local path - pyproject.toml: setuptools.backends → setuptools.build_meta - Removed static libpcap linking (unavailable on Rocky 9) - Fixed data races in output/writers_test.go (sync.Mutex + atomic.Int32) - Rewrote corrupted test files (logger_test.go × 2) Test coverage: - correlator: 67.1% total (unixsocket 80.5%, config 91.7%, app 83.3%, multi 87.7%, stdout 100%) - sentinel: all 10 packages pass (api, capture, config, fingerprint, ipfilter, logging, output, tlsparse) Documentation: - README.md + docs/ (architecture, development, 5 services, shared libs, DB schema & migrations) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 16:42:59 +02:00

9 Commits