feat(h2): direct per-parameter SETTINGS comparison in browser_matcher

- Rewrote _d1_h2_settings() with 3-signal weighted formula: direct_score×0.60 + dict_match×0.30 + ja4_coherence×0.10 when individual SETTINGS cols are available in the DataFrame - Added _H2_SETTINGS_COLS dict (IDs 1,2,3,4,5,6,8 → column names) - Fallback to dict_match×0.80 + ja4_coherence×0.20 for backward compat - Fix view_ai_features_1h: pass 7 individual SETTINGS columns through base_data CTE (h2_header_table_size, h2_enable_push, h2_max_concurrent_streams, h2_initial_window_size, h2_max_frame_size, h2_max_header_list_size, h2_enable_connect_protocol) - Remove non-existent h2_dict_confidence reference from view SQL (dict_browser_h2 only exposes browser_family attribute) - Add 7 new pytest cases: exact match, one wrong setting, forbidden key penalty, unknown fingerprint with correct settings, fallback path, CDN proxy neutralisation, full Chrome simulation - 53/53 bot-detector tests pass - Update thesis §3.9.2: document direct comparison algorithm + fallback Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-11 03:05:36 +02:00
parent 95e87149aa
commit f704541f83
4 changed files with 259 additions and 45 deletions
--- a/docs/THESIS_HTTP_Traffic_Detection.md
+++ b/docs/THESIS_HTTP_Traffic_Detection.md
@ -1192,32 +1192,40 @@ Le moteur browser_matcher est structuré en deux modules Python distincts, sépa
 Ce module constitue la **base de données de signatures** par famille de navigateur. Chaque entrée est un objet structuré définissant :

 ```python
-BrowserSignature(
-    family="chrome",
-    h2_settings_exact={
-        1: 65536,   # HEADER_TABLE_SIZE
-        2: 0,       # ENABLE_PUSH
-        4: 6291456, # INITIAL_WINDOW_SIZE
-        6: 262144,  # MAX_HEADER_LIST_SIZE
+# Extrait de browser_signatures.py — signature Chrome
+BROWSER_SIGNATURES["Chrome"] = {
+    "h2_settings_exact": {
+        1: 65536,    # HEADER_TABLE_SIZE
+        2: 0,        # ENABLE_PUSH (désactivé)
+        4: 6291456,  # INITIAL_WINDOW_SIZE
+        6: 262144,   # MAX_HEADER_LIST_SIZE
    },
-    h2_settings_forbidden_keys=[3, 5, 9],  # clés qui NE doivent PAS être présentes
-    h2_window_update=15663105,
-    h2_window_update_tolerance=1000,
-    h2_priority_frames_expected=False,
-    pseudo_header_order=["m", "a", "s", "p"],  # masp
-    tls={
-        "min_version": "TLS1.3",
-        "required_alpn": ["h2"],
-        "expect_grease": True,
-        "cipher_count_range": (15, 20),
-        "extension_count_range": (12, 18),
-    },
-    headers_required=["sec-fetch-site", "sec-fetch-mode", "sec-ch-ua"],
-    headers_forbidden=["x-urllib-version", "user-agent: python", "go-http-client"],
-)
+    "h2_settings_forbidden_keys": [3, 5],  # MAX_CONCURRENT_STREAMS et MAX_FRAME_SIZE absents
+    "h2_window_update": 15663105,
+    "h2_window_update_tolerance": 1000,
+    "h2_priority_frames_expected": False,  # Chrome ≥119 utilise PRIORITY_UPDATE (RFC 9218)
+    "pseudo_header_order": "m,a,s,p",
+    "tls": {"ja4_families": ["Chromium", "Chrome", "Edge"], "grease_expected": True},
+    "headers_sec_fetch_required": True,
+    "headers_ch_ua_required": True,
+    "accept_language_required": True,
+}
 ```

-Le module contient également un sous-dictionnaire **NON_BROWSER_SIGNATURES** — des patterns négatifs pour curl, python-httpx et Go net/http qui invalident un score partiel élevé. Par exemple, la présence de l'en-tête `User-Agent: Go-http-client/2.0` invalide tout score de correspondance navigateur, quelle que soit la valeur des dimensions H2.
+Le champ `h2_settings_exact` est désormais consommé directement par `_d1_h2_settings()` via les colonnes individuelles de `view_ai_features_1h` (`h2_header_table_size`, `h2_enable_push`, `h2_initial_window_size`, `h2_max_header_list_size`, etc.). La comparaison colonne par colonne permet de scorer partiellement un fingerprint inconnu du dictionnaire mais structurellement proche (ex. variante de navigateur non encore répertoriée).
+
+**Implémentation de `_d1_h2_settings()` — algorithme de scoring** :
+
+```python
+# Pour chaque clé attendue (h2_settings_exact) : val_col == val_attendue → 1, sinon 0
+# Pour chaque clé interdite (h2_settings_forbidden_keys) : col < 0 (absent) → 1, sinon 0
+direct_score = nb_vérifications_réussies / nb_total_vérifications
+base = direct_score × 0.60 + dict_match × 0.30 + h2_ja4_coherence × 0.10
+```
+
+**Avantage vs l'approche dict-only** : un fingerprint légèrement modifié (ex. une variante de Chrome avec un champ SETTINGS supplémentaire) serait rejeté par le dictionnaire (dict_match=0) mais obtiendrait quand même un score D1 élevé si les paramètres clés sont corrects. Par exemple, un client envoyant `1:65536, 2:0, 4:6291456, 6:262144` (4 clés Chrome attendues) + `3:200` (clé interdite) obtiendrait direct_score = 5/6 × 0.60 = 0.50 au lieu de 0 avec le dict seul.
+
+**Fallback** : si les colonnes individuelles sont absentes du DataFrame (compatibilité ascendante), la fonction revient au comportement original `dict_match × 0.80 + h2_ja4_coherence × 0.20`.

 #### Module browser_matcher.py

@ -1227,12 +1235,12 @@ Ce module implémente le **moteur de scoring**. Il calcule un score de correspon

 | # | Dimension | Poids | Logique de scoring |
 |---|-----------|-------|--------------------|
-| 1 | H2 SETTINGS match | 0,30 | Toutes les clés attendues présentes avec valeur exacte **et** aucune clé interdite → 1,0 ; chaque écart −0,15 |
-| 2 | H2 WINDOW_UPDATE | 0,15 | `|observé − attendu| ≤ tolérance` → 1,0 ; absent (=0) → 0,0 ; hors tolérance → max(0, 1 − distance_normalisée) |
-| 3 | Ordre pseudo-headers | 0,15 | Correspondance exacte → 1,0 ; 3/4 positions correctes → 0,5 ; sinon 0,0 |
-| 4 | Frames H2 PRIORITY | 0,10 | Présence/absence correspond à l'attendu → 1,0 ; sinon 0,0 |
-| 5 | Cohérence en-têtes HTTP | 0,15 | Tous les requis présents (+0,5) ; aucun interdit présent (+0,5) |
-| 6 | Structure TLS | 0,10 | TLS 1.3 (+0,25) + ALPN h2 (+0,25) + cipher_count dans plage (+0,25) + ext_count dans plage (+0,25) |
+| 1 | H2 SETTINGS match | 0,30 | **Comparaison directe par paramètre** (colonnes individuelles) : chaque clé attendue exacte → 1,0 ; chaque clé interdite absente → 1,0 ; score = proportion de vérifications réussies. Pondération : `direct × 0,60 + dict_lookup × 0,30 + ja4_coherence × 0,10`. Fallback dict-only si colonnes indisponibles |
+| 2 | H2 WINDOW_UPDATE | 0,15 | `|observé − attendu| ≤ tolérance` → 1,0 ; absent (=0) → 0,0 ; hors tolérance → 0,0 |
+| 3 | Ordre pseudo-headers | 0,15 | Correspondance exacte → 1,0 ; absent → neutre 0,5 ; sinon 0,0 |
+| 4 | Frames H2 PRIORITY | 0,10 | Présence/absence correspond à l'attendu → 1,0 ; pas de données H2 → neutre 0,5 |
+| 5 | Cohérence en-têtes HTTP | 0,15 | Accept-Language (+0,25) + Sec-Fetch cohérent (+0,25) + Sec-CH-UA cohérent (+0,25) + bonus (+0,25) |
+| 6 | Structure TLS | 0,10 | Famille JA4 correcte (×0,7) + TLS 1.3 (×0,3) |
 | 7 | JA4 dict lookup | 0,05 | Correspondance dans `dict_browser_ja4` pour cette famille → 1,0 ; sinon 0,0 |

 **Formule générale** :