When has_xff=1, the H2 connection is terminated by the reverse proxy/CDN,
so client H2 fingerprints are lost. Previously only D1 (h2_settings) was
neutralized; D2 (window_update), D3 (pseudo_order), and D4 (priority)
still penalized proxied traffic — a real Chrome behind Cloudflare scored
0.0 on 3 dimensions (45% of total weight).
Now all 4 H2 dimensions return 0.5 (neutral) when has_xff>0, and
non-browser H2 detection is also disabled behind proxies.
Tests: 10/10 passed including 3 new XFF-specific cases.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Étape 2 — Fingerprinting HTTP/2 dans le pipeline ML :
- Ajout du dictionnaire dict_browser_h2 (11 familles de navigateurs) dans 05_aggregation_tables.sql
- Ajout du CTE h2_agg et 4 features HTTP/2 dans 07_ai_features_view.sql :
h2_settings_known, h2_pseudo_order_match, h2_ja4_coherence, h2_settings_rare
- Calcul du fingerprint_coherence_score (5 axes pondérés) dans la vue
- Ajout du 6e axe axis_h2_coherence dans browser.py (poids rééquilibrés)
- browser_h2.csv : 11 fingerprints Akamai → famille navigateur
Étape 3 — Pré-filtre de cohérence sur la baseline humaine :
- pipeline.py exclut les sessions avec fingerprint_coherence_score < seuil de la baseline d'entraînement
- FINGERPRINT_COHERENCE_THRESHOLD configurable via env (défaut 0.25)
- Log des sessions exclues pour analyse SOC
Étape 4 — Détection de drift améliorée :
- scoring.py : passage de 5 à 9 quantiles (p5…p95)
- Ajout de la divergence KL en complément du test KS
- Détection de drift adversarial (≥80% des features dérivent dans la même direction)
- Split temporel strict pour la validation
Étape 5 — Graphe bipartite JA4×ASN (§5.2) :
- fleet.py : détection de flottes via NetworkX + Louvain (imports optionnels)
- enrich_with_fleet_score() : ajout fleet_score + fleet_campaign_flag au DataFrame
- cycle.py : appel après preprocess_df avec log du nombre de sessions en flotte
- SQL migration 05_fleet_metrics_tables.sql : table fleet_detections (TTL 7j)
- Dashboard : /fleet + /api/fleet (communautés détectées) + template fleet.html
Étape 6 — Cross-domain Jaccard §5.8 :
- 12_thesis_features.sql : CTE jaccard_paths → cross_domain_path_similarity
- Signal : même chemins (/admin, /wp-login) sur plusieurs hosts = scanner
Étape 7 — ExIFFI + erreurs AE par feature :
- scoring.py : compute_exiffi_importance() par permutation, compute_ae_feature_errors()
- pipeline.py : calcul ExIFFI sur X_test, mapping index → dict pour anomalies
- build_reason() enrichi avec exiffi_top quand SHAP inactif
Étape 8 — Méta-learner pour la pondération de l'ensemble :
- scoring.py : classe MetaLearner (LogisticRegression, fallback poids fixes <1000 labels)
- Collecte des labels depuis le cycle courant (known_bots, légitimes, Anubis)
- pipeline.py : remplacement des poids fixes par MetaLearner.predict()
Étape 9 — Métriques de performance et monitoring :
- metrics.py : record_cycle_metrics() — taux anomalie, drift, corrélation, latence
- SQL migration 05_fleet_metrics_tables.sql : table ml_performance_metrics (TTL 90j)
- Dashboard : /health + /api/health + template health.html
- cycle.py : appel record_cycle_metrics en fin de cycle (Complet + Applicatif)
Tests : 36/36 bot-detector tests passent
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
compute_shap_top_features, build_reason, cluster_anomalies renamed from
private (_prefixed) to public to match pipeline.py imports.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Suppression de bot_detector.py (1982 lignes) remplacé par 11 modules
- Tests navigateur mis à jour pour le système multifactoriel (browser_confidence)
- 36/36 tests passent avec la nouvelle structure modulaire
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
XGB: query was selecting features from ml_all_scores which doesn't
store them. Now joins ml_all_scores (labels) with view_ai_features_1h
(features). Dynamically discovers available columns to skip thesis §5
features not present in the view. Returns (model, features) tuple.
SHAP: TreeExplainer doesn't support isotree. Fall back to permutation-
based Explainer(model.decision_function, X_sample) for isotree.
Verified: XGB trained on 50000 labels (18436 positives), triple-voice
ensemble scoring active (EIF+AE+XGB), SHAP silent.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bot-detector:
- B1.1: campaign_id and raw_anomaly_score now inserted into ml_detected_anomalies
- B1.4/B1.5: log_decision argument order fixed (cycle_id, name)
- B1.7: AE broadcast error — model now returns features list, scoring
uses model's features instead of current cycle's (prevents dim mismatch)
- B1.8: Anubis ALLOW bots now get bot_name from anubis_bot_name
Dashboard:
- C1.1: XSS in ip_detail.html — {{ ip | tojson }} instead of raw string
- C1.2: Stored XSS via innerHTML — added escapeHtml() helper, all user-facing
formatters (fmtIP, fmtASN, fmtCountry, fmtJA4, fmtBotName, fmtLabel) sanitized
- C2.1: status filter now correctly filters http_version column
- C2.2: heatmap toDayOfWeek() - 1 for 0-indexed JS days
SQL:
- B1.3: view_ip_recurrence worst_score uses max() not min() (0=normal, 1=anomal)
- B1.6: view_resource_cascade_1h joined into view_thesis_features_1h (§5.4)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- EIF: Extended Isolation Forest via isotree (fallback to sklearn IF)
- Benford's Law deviation feature on inter-request timing
- Lag-1 autocorrelation feature for cadence analysis
- Validation gate: reject model if val_anomaly_rate > 20%
- Feature pruning: remove variance < 1e-6 features before training
- Quantile drift: replace N(μ,σ) synthetic with quantile interpolation
- Thread safety: Lock for _service_healthy/_consecutive_failures
- Score normalization: inverted to [0,1] where 1=most anomalous
SQL: add lag1_autocorrelation + benford_deviation to view_thesis_features_1h
Tests: 10 new test functions covering all improvements
Integration: verify_mvs.py checks new thesis feature columns
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>