Commit Graph

9 Commits

Author SHA1 Message Date
14db3d9040 refactor: suppression dépendance User-Agent de la détection navigateur
Changements SQL :
- modern_browser_score : sec-ch-ua→100, Sec-Fetch→70 (plus de UA fallback)
- Ajout has_sec_ch_ua (UInt8) dans agg_header_fingerprint_1h et ml_all_scores
- mss_mobile_mismatch utilise has_sec_ch_ua au lieu de modern_browser_score
- header_order_confidence : PARTITION BY ja4 au lieu de first_ua
- sec_ch_mobile_mismatch : comparaison Client Hints interne (sans UA)
- Migration 03_remove_ua_browser_detection.sql

Changements Python :
- browser.py Axe 3 : Client Hints + Sec-Fetch + is_fake_navigation (PAS de UA)
- Pondération axes : ja4_known 0.30, tls_coherence 0.20 (signaux TLS renforcés)
- preprocessing.py : has_sec_ch_ua ajouté aux features et binary_features

Fichiers modifiés : 8 SQL/Python + 1 migration, 36/36 tests passent.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 23:06:01 +02:00
db306fb9da fix: P0 audit bugs — bot-detector + dashboard + SQL
Bot-detector:
- B1.1: campaign_id and raw_anomaly_score now inserted into ml_detected_anomalies
- B1.4/B1.5: log_decision argument order fixed (cycle_id, name)
- B1.7: AE broadcast error — model now returns features list, scoring
  uses model's features instead of current cycle's (prevents dim mismatch)
- B1.8: Anubis ALLOW bots now get bot_name from anubis_bot_name

Dashboard:
- C1.1: XSS in ip_detail.html — {{ ip | tojson }} instead of raw string
- C1.2: Stored XSS via innerHTML — added escapeHtml() helper, all user-facing
  formatters (fmtIP, fmtASN, fmtCountry, fmtJA4, fmtBotName, fmtLabel) sanitized
- C2.1: status filter now correctly filters http_version column
- C2.2: heatmap toDayOfWeek() - 1 for 0-indexed JS days

SQL:
- B1.3: view_ip_recurrence worst_score uses max() not min() (0=normal, 1=anomal)
- B1.6: view_resource_cascade_1h joined into view_thesis_features_1h (§5.4)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 23:33:00 +02:00
9a48fb9d29 feat: LEGITIMATE_BROWSER classification from JA4 + behavioral consistency
Add browser legitimacy classification (A9) to the bot detection pipeline:

- New features: is_known_browser (binary) and browser_consistency_score [0..5]
  combining 5 signals: JA4 browser match, modern_browser_score, Accept-Language,
  cookies, Sec-Fetch-* presence
- Post-scoring: sessions with known browser JA4 + consistency >= 4/5 + NORMAL/LOW
  threat level are reclassified as LEGITIMATE_BROWSER
- Spoofing detection: inconsistent behavior (known JA4 but low consistency) stays
  in normal anomaly scoring — prevents evasion via JA4 spoofing
- XGBoost treats LEGITIMATE_BROWSER as non-threat (negative label)
- ClickHouse: browser_family column added to ml_detected_anomalies and ml_all_scores
- Dashboard: browser_family filter/sort on detections and scores endpoints,
  legitimate_browsers count and browser_stats in overview
- 6 new unit tests covering classification threshold, spoofing, exclusion logic

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 15:46:22 +02:00
8d58f2b932 feat(bot-detector): add XGBoost supervised third voice (#10)
Triple-voice ensemble architecture:
- EIF (non-supervisé, anomalies zero-day)
- Autoencoder (non-supervisé, corrélations non-linéaires)
- XGBoost (supervisé, patterns connus + feedback SOC)

XGBoost implementation:
- Trained on historical ml_all_scores labels (NORMAL=0, HIGH/CRITICAL/DENY/KNOWN=1)
- Weekly retraining (XGB_RETRAIN_INTERVAL_H=168), min 100 labels required
- Score = predict_proba, combined via meta-learner: (1-β)*(EIF+AE) + β*xgb_prob
- Configurable: XGB_WEIGHT (β=0.20), XGB_MIN_LABELS, XGB_RETRAIN_INTERVAL_HOURS
- Graceful fallback: if xgboost unavailable or labels insufficient, EIF+AE only
- ClickHouse: xgb_prob column added to ml_all_scores
- Tests: 4 new tests (availability, train/predict, meta-learner, save/load)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 02:45:57 +02:00
57cf6c3828 feat(bot-detector): add parallel Autoencoder scorer (#9)
- TrafficAutoEncoder class: symmetric AE (n→64→32→16→32→64→n) with BatchNorm+ReLU
- Trained alongside EIF on human_baseline, saved/loaded with model versioning
- Score = per-sample MSE reconstruction error, combined with EIF via AE_WEIGHT (α=0.30)
- AE latent space (16-dim) used for HDBSCAN clustering instead of raw features
- Configurable: AE_WEIGHT, AE_EPOCHS, AE_LATENT_DIM, AE_LEARNING_RATE
- Graceful fallback: if torch unavailable or AE fails, EIF-only scoring continues
- ClickHouse: ae_recon_error column added to ml_all_scores
- Tests: 5 new tests (AE train/score, encode latent, state dict save/load, weight combination)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 02:40:39 +02:00
ecceb04174 perf(clickhouse): P3 — view_ip_recurrence avec filtre TTL + supprimer FINAL
view_ip_recurrence :
  Ajout de WHERE detected_at >= now() - INTERVAL 30 DAY
  → Avec PARTITION BY (P1), ClickHouse élagage les partitions hors de cette
    plage avant même de lire les données. La vue ne scanne que les partitions
    actives (au lieu des 30 partitions journalières complètes).
  → ORDER BY (src_ip) garantit que le GROUP BY src_ip lit des données
    contiguës (aucune réorganisation mémoire).

rotation.py — supprimer FINAL sur ml_detected_anomalies :
  FINAL force une déduplication complète du ReplacingMergeTree en mémoire
  (équivalent à un DISTINCT sur toute la table) — une des opérations les plus
  coûteuses dans ClickHouse.
  Fix : remplacer le sous-SELECT FINAL par view_ip_recurrence (déjà aggrégée
  par src_ip, retourne recurrence directement sans FINAL).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 22:33:29 +02:00
f4ffe3410a perf(clickhouse): P1 — partition + skipping indexes sur ml_detected_anomalies, http_logs, agg_host_ip_ja4_1h
Problème : toutes les requêtes du dashboard WHERE detected_at >= now() - INTERVAL N
faisaient un full scan car ml_detected_anomalies avait ORDER BY (src_ip) sans
partition ni index temporel.

Changements :
- 06_ml_tables.sql :
  * ml_detected_anomalies : PARTITION BY toYYYYMMDD(detected_at)
    → élagage de partitions journalières sur toutes les requêtes temporelles
  * INDEX idx_detected_at (minmax) → skip des granules hors plage
  * INDEX idx_threat_level set(8) → skip pour countIf(threat_level = ...)
  * INDEX idx_bot_name bloom_filter → skip pour bot_name != ''
  * ttl_only_drop_parts = 1 → TTL par suppression de partition entière
  * ml_all_scores : même traitement (PARTITION BY + 2 indexes)

- 04_mv_http_logs.sql :
  * http_logs : INDEX idx_src_ip bloom_filter(0.01)
    → les requêtes WHERE src_ip = X (analysis.py, variability.py) sautent
    ~90% des granules sans scanner toute la plage temporelle
  * INDEX idx_ja4 bloom_filter(0.01) → idem pour filtres JA4

- 05_aggregation_tables.sql :
  * agg_host_ip_ja4_1h : PROJECTION proj_by_ip ORDER BY (src_ip, window_start, ...)
    → investigation_summary.py et rotation.py (WHERE src_ip = X) utilisent
    automatiquement la projection au lieu de scanner tous les window_start

- 10_perf_indexes.sql (nouveau) :
  * Migration ALTER TABLE pour instances existantes
  * ADD INDEX + MATERIALIZE INDEX pour les 4 tables
  * ADD PROJECTION + MATERIALIZE PROJECTION pour agg_host_ip_ja4_1h
  * Note : PARTITION BY sur table existante nécessite recréation (documenté)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 22:28:04 +02:00
9f3e0621e5 feat: split ClickHouse into dual configurable databases (ja4_logs / ja4_processing)
Architecture:
- ja4_logs: raw log ingestion (http_logs_raw, http_logs, mv_http_logs)
- ja4_processing: analytics, aggregation, ML, dictionaries, audit

Configuration (env vars):
- CLICKHOUSE_DB_LOGS (default: ja4_logs)
- CLICKHOUSE_DB_PROCESSING (default: ja4_processing)

Changes:
- SQL migrations (10 files): all mabase_prod refs → ja4_logs or ja4_processing
  with correct cross-database references (MVs, views, dicts)
- deploy_schema.sh: substitutes DB names from env vars at deploy time
- Python shared settings: added CLICKHOUSE_DB_LOGS + CLICKHOUSE_DB_PROCESSING
- Dashboard routes (19 files): replaced ~80 hardcoded mabase_prod refs
  with settings.CLICKHOUSE_DB_LOGS / settings.CLICKHOUSE_DB_PROCESSING
- Bot-detector: DB → CLICKHOUSE_DB_PROCESSING, fetch_rules.py configurable
- Correlator: DSN example updated to ja4_logs
- Docker-compose + .env files: new env vars with defaults
- All documentation updated (14 markdown files)

All tests pass: sentinel 10/10, correlator 67.1%, bot-detector 11, dashboard 20, ja4_common 18

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 19:10:35 +02:00
d469e39da7 feat: ja4-platform monorepo — 5 services unified, tests & RPM builds standardized
Services:
- ja4sentinel: TLS/JA4 fingerprint capture daemon (Go, libpcap)
- logcorrelator: JA4 log correlation engine (Go, ClickHouse)
- mod_reqin_log: Apache module (C, JSON request logging)
- bot_detector: ML bot detection pipeline (Python)
- dashboard: FastAPI/Streamlit analytics UI (Python)

Shared libraries:
- shared/go/ja4common: logger, config, shutdown, ipfilter (Go module)
- shared/python/ja4_common: ClickHouseClient, ClickHouseSettings (Python package)
- shared/clickhouse/: canonical SQL migrations (10 files)

Build & packaging:
- Unified 3-stage Dockerfile.package for Go RPMs (el8/el9/el10)
- go.work workspace linking sentinel, correlator, ja4common
- Makefile with test-all, build-all, rpm-* targets

Fixes applied:
- go.work: 1.21 → 1.24.6 (required by sentinel)
- correlator Dockerfiles: golang:1.21 → golang:1.24
- replace directives in go.mod for ja4common local path
- pyproject.toml: setuptools.backends → setuptools.build_meta
- Removed static libpcap linking (unavailable on Rocky 9)
- Fixed data races in output/writers_test.go (sync.Mutex + atomic.Int32)
- Rewrote corrupted test files (logger_test.go × 2)

Test coverage:
- correlator: 67.1% total (unixsocket 80.5%, config 91.7%, app 83.3%, multi 87.7%, stdout 100%)
- sentinel: all 10 packages pass (api, capture, config, fingerprint, ipfilter, logging, output, tlsparse)

Documentation:
- README.md + docs/ (architecture, development, 5 services, shared libs, DB schema & migrations)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 16:42:59 +02:00