docs: mise à jour complète — 7/8 techniques, 85 features, 12 modules

Reflète l'état réel du système après les étapes 1-9 du roadmap : - §5.2 (fleet_detector NetworkX/Louvain) et §5.8 (Jaccard cross-domain) : ✅ - MetaLearner (régression logistique, fallback poids fixes) : documenté - ExIFFI (profondeur isolation EIF) + erreur AE par feature : documenté - KL divergence en complément du KS, drift adversarial : documenté - HTTP/2 fingerprinting (h2_fingerprint, dict_browser_h2, axis_h2_coherence) : documenté - Métriques de cycle (metrics.py, ml_performance_metrics, alertes) : documenté - Browser confidence : 5 axes → 6 axes (axis_h2_coherence) - 85 features (73 FEATURES + 12 FEATURES_COMPLET), 12 modules, 53 routes dashboard - Conformité thèse : 99.4% (était 97.9%), §5 : 87.5% (était 62.5%) - Tables nouvelles : fleet_detections, ml_performance_metrics, soc_feedback - Dictionnaires : 8 (dict_browser_h2 ajouté) - Dashboard : 16 pages + 37 API routes (fleet, health ajoutés) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 01:31:20 +02:00
parent edbb4aed2c
commit 51dd376f7a
6 changed files with 383 additions and 164 deletions
--- a/docs/architecture.md
+++ b/docs/architecture.md
@ -57,13 +57,14 @@ ja4-platform est un pipeline de sécurité qui capture le trafic réseau en temp
     │                        │                              │  htmx + Chart.js       │
     │  Lit :                 │                              │  Tailwind CSS (CDN)    │
     │   view_ai_features_1h  │                              │                        │
-     │   view_thesis_feat_1h  │                              │  55 routes (API+pages) │
-     │   view_ip_recurrence   │                              │  15 templates Jinja2   │
-     │  Écrit :               │                              │  14 pages SOC          │
-     │   ml_detected_anomalies│                              │                        │
-     │   ml_all_scores        │                              │  Lit : ml_*, agg_*,    │
-     └───────────────────────┘                              │   http_logs, audit_logs│
-                                                             └───────────────────────┘
+     │   view_thesis_feat_1h  │                              │  53 routes (37 API     │
+     │   view_ip_recurrence   │                              │   + 16 pages HTML)     │
+     │  Écrit :               │                              │  16 templates Jinja2   │
+     │   ml_detected_anomalies│                              │  16 pages SOC          │
+     │   ml_all_scores        │                              │                        │
+     │   fleet_detections     │                              │  Lit : ml_*, agg_*,    │
+     │   ml_performance_metrics│                             │   http_logs, audit_logs│
+     └───────────────────────┘                              └───────────────────────┘
 ```

 ## Flux de données — 5 phases
@ -96,25 +97,29 @@ ja4-platform est un pipeline de sécurité qui capture le trafic réseau en temp

 ### Phase 4 — Détection

-7. **bot-detector** (Python 3.11, 11 modules) s'exécute en cycle de 5 minutes :
+7. **bot-detector** (Python 3.11, 12 modules) s'exécute en cycle de 5 minutes :
   - **Pipeline bifurqué** :
-     - **Complet** (L3→L7, ~63 features, `correlated=1`) — trafic corrélé TCP+TLS+HTTP
-     - **Applicatif** (L7 seulement, ~51 features, `correlated=0`) — trafic HTTP non corrélé
+     - **Complet** (L3→L7, ~85 features, `correlated=1`) — trafic corrélé TCP+TLS+HTTP
+     - **Applicatif** (L7 seulement, ~73 features, `correlated=0`) — trafic HTTP non corrélé
   - **Ensemble triple voix** :
     - **Extended Isolation Forest** (isotree) — scoreur non supervisé principal
     - **Autoencoder** (PyTorch, architecture n→64→32→16→32→64→n) — erreur de reconstruction
     - **XGBoost** — supervisé, entraîné sur les labels SOC (`soc_feedback`)
-   - **Score final** : `final = (1-β) × ((1-α) × eif_norm + α × ae_norm) + β × xgb_prob` (α=0.30, β=0.20)
-   - **Seuil adaptatif** par percentile, détection de dérive conceptuelle
+   - **Score final** : `final = meta_learner.predict(eif_norm, ae_norm, xgb_prob, volume, correlated)` avec fallback sur pondération linéaire fixe `(1-β) × ((1-α) × eif_norm + α × ae_norm) + β × xgb_prob` (α=0.30, β=0.20)
+   - **MetaLearner** (régression logistique) entraîné automatiquement sur les labels accumulés (seuil: 1000 labels)
+   - **Seuil adaptatif** par percentile, détection de dérive conceptuelle (KS + KL divergence)
+   - **fleet_detector** (NetworkX) — graphe bipartite JA4×ASN, `fleet_score`, table `fleet_detections`
   - **HDBSCAN** — regroupement en campagnes d'attaque
-   - **Détection de navigateur** — 5 axes multifactoriels (confiance ≥ 0.55 → `LEGITIMATE_BROWSER`)
+   - **Détection de navigateur** — 6 axes multifactoriels (confiance ≥ 0.55 → `LEGITIMATE_BROWSER`)
+   - **ExIFFI** — importance de features native à l'EIF (alternative à SHAP)
   - **Explicabilité SHAP** — contribution de chaque feature au score d'anomalie
+   - **Métriques de cycle** (`metrics.py`) — table `ml_performance_metrics`, alertes calibration
   - **Niveaux de menace** : `CRITICAL`, `HIGH`, `MEDIUM`, `LOW`, `NORMAL`, `LEGITIMATE_BROWSER`, `KNOWN_BOT`, `ANUBIS_DENY`, `ANUBIS_ALLOW`

 ### Phase 5 — Visualisation

-8. **dashboard** (FastAPI + Jinja2 + htmx + Chart.js + Tailwind CSS CDN) expose 55 routes (35 API JSON + 14 pages HTML + health/static) et 15 templates Jinja2 pour les analystes SOC :
-   - Pages : overview, detections, scores, traffic, ip_detail, ja4_detail, cluster_detail, campaigns, features, models, classify, tactics, reflists, network
+8. **dashboard** (FastAPI + Jinja2 + htmx + Chart.js + Tailwind CSS CDN) expose 53 routes (37 API JSON + 16 pages HTML) et 16 templates Jinja2 pour les analystes SOC :
+   - Pages : overview, detections, scores, traffic, ip_detail, ja4_detail, cluster_detail, campaigns, features, models, classify, tactics, reflists, network, fleet, health

 ## Matrice d'interaction des composants

@ -149,6 +154,9 @@ ja4-platform est un pipeline de sécurité qui capture le trafic réseau en temp
 | `agg_resource_cascade_1h` | mv_agg_resource_cascade_1h | view_thesis_features_1h |
 | `ml_detected_anomalies` | bot-detector | dashboard |
 | `ml_all_scores` | bot-detector | dashboard |
+| `fleet_detections` | bot-detector (`fleet.py`) | dashboard |
+| `ml_performance_metrics` | bot-detector (`metrics.py`) | dashboard |
+| `soc_feedback` | dashboard (`/api/classify`) | bot-detector |
 | `audit_logs` | dashboard | dashboard |
 | `anubis_ip_rules` | fetch_rules.py | dict_anubis_ip |
 | `anubis_asn_rules` | fetch_rules.py | dict_anubis_asn |
@ -164,7 +172,7 @@ ja4-platform est un pipeline de sécurité qui capture le trafic réseau en temp
 | `view_dashboard_entities` | — (vue) | dashboard |
 | `view_resource_cascade_1h` | — (vue) | dashboard |

-### Dictionnaires (7)
+### Dictionnaires (8)

 | Dictionnaire | Layout | Source | Utilisation |
 |--------------|--------|--------|-------------|
@ -172,6 +180,7 @@ ja4-platform est un pipeline de sécurité qui capture le trafic réseau en temp
 | `dict_bot_ip` | IP_TRIE | Table `bot_ip` | IPs de bots connues |
 | `dict_bot_ja4` | COMPLEX_KEY_HASHED | Table `bot_ja4` | Signatures JA4 de bots |
 | `dict_browser_ja4` | COMPLEX_KEY_HASHED | Table (CSV) | Signatures JA4 de navigateurs |
+| `dict_browser_h2` | COMPLEX_KEY_HASHED | Table (CSV) | Fingerprints HTTP/2 SETTINGS par navigateur |
 | `dict_asn_reputation` | HASHED | Fichier CSV | Réputation ASN (isp/datacenter/hosting/cdn) |
 | `dict_anubis_ip` | IP_TRIE | Table `anubis_ip_rules` | Règles Anubis IP/CIDR |
 | `dict_anubis_asn` | FLAT | Table `anubis_asn_rules` | Règles Anubis ASN |
@ -196,22 +205,25 @@ view_ai_features_1h ──┐                    ┌─── ml_detected_anomal
 view_thesis_feat_1h ──┤   ┌────────────┐   │
 view_ip_recurrence ───┤   │ Pré-       │   │
                      ├──▶│ traitement │──▶│ Bifurcation :
-                      │   │ + filtrage │   │  ├── Complet   (correlated=1, ~63 feat.)
-                      │   └────────────┘   │  └── Applicatif(correlated=0, ~51 feat.)
+                      │   │ + filtrage │   │  ├── Complet   (correlated=1, ~85 feat.)
+                      │   └────────────┘   │  └── Applicatif(correlated=0, ~73 feat.)
                      │                    │
                      │   ┌────────────┐   │  Pour chaque branche :
                      │   │ Ensemble   │   │   ├── Extended Isolation Forest (EIF)
                      │   │ triple     │──▶│   ├── Autoencoder (PyTorch)
                      │   │ voix       │   │   └── XGBoost (supervisé)
                      │   └────────────┘   │
-                      │                    │  Score = (1-β)×((1-α)×EIF + α×AE) + β×XGB
+                      │                    │  Score = MetaLearner(eif, ae, xgb) ou
+                      │                    │          (1-β)×((1-α)×EIF + α×AE) + β×XGB
                      │   ┌────────────┐   │
                      │   │ Post-      │   ├─── ml_all_scores
                      └──▶│ traitement │──▶│
                          │ HDBSCAN    │   │  Niveaux : CRITICAL / HIGH / MEDIUM /
-                          │ Browser 5ax│   │   LOW / NORMAL / LEGITIMATE_BROWSER /
-                          │ SHAP       │   │   KNOWN_BOT / ANUBIS_DENY / ANUBIS_ALLOW
-                          └────────────┘   └───
+                          │ fleet.py   │   │   LOW / NORMAL / LEGITIMATE_BROWSER /
+                          │ Browser 6ax│   │   KNOWN_BOT / ANUBIS_DENY / ANUBIS_ALLOW
+                          │ ExIFFI+SHAP│   │
+                          │ metrics.py │   ├─── fleet_detections
+                          └────────────┘   └─── ml_performance_metrics
 ```

 ## Référence des empreintes JA4/JA3
@ -253,8 +265,9 @@ Les deux empreintes sont générées par sentinel à partir du payload TLS Clien
 | Détection ML — EIF | Python 3.11 + isotree |
 | Détection ML — Autoencoder | Python 3.11 + PyTorch |
 | Détection ML — Supervisé | Python 3.11 + XGBoost |
-| Clustering de campagnes | HDBSCAN |
-| Explicabilité | SHAP |
+| Détection ML — Ensemble | Python 3.11 + MetaLearner (régression logistique) |
+| Clustering de campagnes | HDBSCAN + NetworkX (fleet detection) |
+| Explicabilité | SHAP + ExIFFI |
 | Backend dashboard | FastAPI + Jinja2 (Python 3.11) |
 | Frontend dashboard | htmx + Chart.js + ECharts + Tailwind CSS (CDN) |
 | Magasin de données | ClickHouse 24.8 (dual-database) |