feat(clustering): intégration Fingerprint HTTP Headers (agg_header_fingerprint_1h)

Sources des nouvelles features : - agg_header_fingerprint_1h : Cookie, Referer par src_ip (JOIN sur IPv6) - ml_detected_anomalies : header_order_shared_count, distinct_header_orders (déjà jointé) Nouvelles features (indices 27-30) : [27] FP Popularité : popularité du fingerprint headers (log1p/log1p(500k)) fingerprint rare (bot artisanal) → 0.0 ; très populaire (browser) → 1.0 [28] FP Rotation : distinct_header_orders (log1p/log1p(10)) rotation de fingerprint entre requêtes = comportement bot [29] Cookie Présent : présence header Cookie (engagement utilisateur réel) [30] Referer Présent: présence header Referer (navigation HTTP normale) risk_score_from_centroid() : 14 termes, somme=1.0 + hfp_rare (1-popularité) × 0.06 + hfp_rotating × 0.06 ML × 0.25 reste dominant name_cluster() : 2 nouveaux labels '🔄 Bot fingerprint tournant' : hfp_rotating>0.6 + anomalie>0.15 '🕵️ Fingerprint rare suspect' : hfp_popular<0.15 + anomalie>0.20 '🌐 Navigateur légitime' : fingerprint populaire confirmé N_FEATURES : 27 → 31 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 11:13:37 +01:00
parent 8fb054c8b7
commit 6ff59a36d7
2 changed files with 114 additions and 66 deletions
--- a/backend/routes/clustering.py
+++ b/backend/routes/clustering.py
@ -95,7 +95,18 @@ SELECT
    avg(ml.has_accept_language)          AS hdr_accept_lang,
    any(vh.hdr_enc)                      AS hdr_has_encoding,
    any(vh.hdr_sec_fetch)                AS hdr_has_sec_fetch,
-    any(vh.hdr_count)                    AS hdr_count_raw
+    any(vh.hdr_count)                    AS hdr_count_raw,
    -- Fingerprint HTTP Headers (depuis agg_header_fingerprint_1h + ml_detected_anomalies)
    -- header_order_shared_count : nb d'IPs partageant le même fingerprint
    --   → faible = fingerprint rare = comportement suspect
    avg(ml.header_order_shared_count)    AS hfp_shared_count,
    -- distinct_header_orders : nb de fingerprints distincts émis par cette IP
    --   → élevé = rotation de fingerprint = comportement bot
    avg(ml.distinct_header_orders)       AS hfp_distinct_orders,
    -- Cookie et Referer issus de la table dédiée aux empreintes
    any(hfp.hfp_cookie)                  AS hfp_cookie,
    any(hfp.hfp_referer)                 AS hfp_referer
 FROM mabase_prod.agg_host_ip_ja4_1h t
 LEFT JOIN mabase_prod.ml_detected_anomalies ml
    ON t.src_ip = ml.src_ip AND t.ja4 = ml.ja4
@ -112,6 +123,15 @@ LEFT JOIN (
      AND log_date >= today() - 2
    GROUP BY src_ip_v6, ja4
 ) vh ON t.src_ip = vh.src_ip_v6 AND t.ja4 = vh.ja4
 LEFT JOIN (
    SELECT
        src_ip,
        avg(has_cookie)  AS hfp_cookie,
        avg(has_referer) AS hfp_referer
    FROM mabase_prod.agg_header_fingerprint_1h
    WHERE window_start >= now() - INTERVAL %(hours)s HOUR
    GROUP BY src_ip
 ) hfp ON t.src_ip = hfp.src_ip
 WHERE t.window_start >= now() - INTERVAL %(hours)s HOUR
  AND t.tcp_ttl_raw > 0
 GROUP BY t.src_ip, t.ja4
@ -124,6 +144,7 @@ _SQL_COLS = [
    "h2_eff", "hdr_conf", "ua_ch_mismatch", "asset_ratio", "direct_ratio",
    "ja4_count", "ua_rotating", "threat", "country", "asn_org",
    "hdr_accept_lang", "hdr_has_encoding", "hdr_has_sec_fetch", "hdr_count_raw",
    "hfp_shared_count", "hfp_distinct_orders", "hfp_cookie", "hfp_referer",
 ]
--- a/backend/services/clustering_engine.py
+++ b/backend/services/clustering_engine.py
@ -6,34 +6,40 @@ Ref:
  scipy.spatial.ConvexHull     — enveloppe convexe (Graham/Qhull)
  sklearn-style API             — centroids, labels_, inertia_
-Features (27 dimensions, normalisées [0,1]) :
+Features (31 dimensions, normalisées [0,1]) :
-  0  ttl_n             : TTL initial normalisé
+  0  ttl_n              : TTL initial normalisé
-  1  mss_n             : MSS normalisé → type réseau
+  1  mss_n              : MSS normalisé → type réseau
-  2  scale_n           : facteur de mise à l'échelle TCP
+  2  scale_n            : facteur de mise à l'échelle TCP
-  3  win_n             : fenêtre TCP normalisée
+  3  win_n              : fenêtre TCP normalisée
-  4  score_n           : score anomalie ML (abs)
+  4  score_n            : score anomalie ML (abs)
-  5  velocity_n        : vélocité de requêtes (log1p)
+  5  velocity_n         : vélocité de requêtes (log1p)
-  6  fuzzing_n         : index de fuzzing (log1p)
+  6  fuzzing_n          : index de fuzzing (log1p)
-  7  headless_n        : ratio sessions headless
+  7  headless_n         : ratio sessions headless
-  8  post_n            : ratio POST/total
+  8  post_n             : ratio POST/total
-  9  ip_id_zero_n      : ratio IP-ID=0 (Linux/spoofé)
+  9  ip_id_zero_n       : ratio IP-ID=0 (Linux/spoofé)
-  10 entropy_n         : entropie temporelle
+  10 entropy_n          : entropie temporelle
-  11 browser_n         : score navigateur moderne
+  11 browser_n          : score navigateur moderne
-  12 alpn_n            : mismatch ALPN/protocole
+  12 alpn_n             : mismatch ALPN/protocole
-  13 alpn_absent_n     : ratio ALPN absent
+  13 alpn_absent_n      : ratio ALPN absent
-  14 h2_n              : efficacité H2 multiplexing (log1p)
+  14 h2_n               : efficacité H2 multiplexing (log1p)
-  15 hdr_conf_n        : confiance ordre headers
+  15 hdr_conf_n         : confiance ordre headers
-  16 ua_ch_n           : mismatch User-Agent-Client-Hints
+  16 ua_ch_n            : mismatch User-Agent-Client-Hints
-  17 asset_n           : ratio assets statiques
+  17 asset_n            : ratio assets statiques
-  18 direct_n          : ratio accès directs
+  18 direct_n           : ratio accès directs
-  19 ja4_div_n         : diversité JA4 (log1p)
+  19 ja4_div_n          : diversité JA4 (log1p)
-  20 ua_rot_n          : UA rotatif (booléen)
+  20 ua_rot_n           : UA rotatif (booléen)
-  21 country_risk_n    : risque pays source (CN/RU/KP → 1.0, US/DE/FR → 0.0)
+  21 country_risk_n     : risque pays source (CN/RU/KP → 1.0, US/DE/FR → 0.0)
-  22 asn_cloud_n       : hébergeur cloud/CDN/VPN (Cloudflare/AWS/OVH → 1.0)
+  22 asn_cloud_n        : hébergeur cloud/CDN/VPN (Cloudflare/AWS/OVH → 1.0)
-  23 hdr_accept_lang_n : présence header Accept-Language (0=absent=bot-like)
+  23 hdr_accept_lang_n  : présence header Accept-Language (0=absent=bot-like)
-  24 hdr_encoding_n    : présence header Accept-Encoding  (0=absent=bot-like)
+  24 hdr_encoding_n     : présence header Accept-Encoding  (0=absent=bot-like)
-  25 hdr_sec_fetch_n   : présence headers Sec-Fetch-*     (1=navigateur réel)
+  25 hdr_sec_fetch_n    : présence headers Sec-Fetch-*     (1=navigateur réel)
-  26 hdr_count_n       : nombre de headers HTTP normalisé (3=bot, 15=browser)
+  26 hdr_count_n        : nombre de headers HTTP normalisé (3=bot, 15=browser)
  27 hfp_popular_n      : popularité du fingerprint headers (log-normalisé)
                          fingerprint rare = suspect ; très populaire = browser légitime
  28 hfp_rotating_n     : rotation de fingerprint (distinct_header_orders)
                          plusieurs fingerprints distincts → bot en rotation
  29 hfp_cookie_n       : présence header Cookie (engagement utilisateur réel)
  30 hfp_referer_n      : présence header Referer (navigation HTTP normale)
 """
 from __future__ import annotations
@ -155,6 +161,16 @@ FEATURES: list[tuple[str, str, object]] = [
    ("hdr_has_encoding",  "Accept-Encoding",       lambda v: 1.0 if float(v or 0) > 0 else 0.0),
    ("hdr_has_sec_fetch", "Sec-Fetch Headers",     lambda v: 1.0 if float(v or 0) > 0 else 0.0),
    ("hdr_count_raw",     "Nb Headers",            lambda v: min(1.0, float(v or 0) / 20.0)),
    # ── Fingerprint HTTP Headers (agg_header_fingerprint_1h) ──────────────
    # header_order_shared_count : nb d'IPs partageant ce fingerprint
    #   élevé → populaire → browser légitime (normalisé log1p / log1p(500000))
    ("hfp_shared_count",    "FP Popularité",       lambda v: min(1.0, math.log1p(float(v or 0)) / math.log1p(500_000))),
    # distinct_header_orders : nb de fingerprints distincts pour cette IP
    #   élevé → rotation de fingerprint → bot (normalisé log1p / log1p(10))
    ("hfp_distinct_orders", "FP Rotation",         lambda v: min(1.0, math.log1p(float(v or 0)) / math.log1p(10))),
    # Cookie et Referer : signaux de navigation légitime
    ("hfp_cookie",          "Cookie Présent",      lambda v: min(1.0, float(v or 0))),
    ("hfp_referer",         "Referer Présent",     lambda v: min(1.0, float(v or 0))),
 ]
 FEATURE_KEYS  = [f[0] for f in FEATURES]
@ -334,38 +350,45 @@ def compute_hulls(coords_2d: np.ndarray, labels: np.ndarray,
 def name_cluster(centroid: np.ndarray, raw_stats: dict) -> str:
    """Nom lisible basé sur les features dominantes du centroïde [0,1]."""
    s = centroid
    n = len(s)
    ttl_raw = float(raw_stats.get("mean_ttl", 0))
    mss_raw = float(raw_stats.get("mean_mss", 0))
-    country_risk_v = s[21] if len(s) > 21 else 0.0
+    country_risk_v = s[21] if n > 21 else 0.0
-    asn_cloud      = s[22] if len(s) > 22 else 0.0
+    asn_cloud      = s[22] if n > 22 else 0.0
-    # Features headers (indices 23-26)
+    accept_lang    = s[23] if n > 23 else 1.0
-    accept_lang    = s[23] if len(s) > 23 else 1.0
+    accept_enc     = s[24] if n > 24 else 1.0
-    accept_enc     = s[24] if len(s) > 24 else 1.0
+    sec_fetch      = s[25] if n > 25 else 0.0
-    sec_fetch      = s[25] if len(s) > 25 else 0.0
+    hdr_count      = s[26] if n > 26 else 0.5
-    hdr_count      = s[26] if len(s) > 26 else 0.5
+    hfp_popular    = s[27] if n > 27 else 0.5
    hfp_rotating   = s[28] if n > 28 else 0.0
-    # Scanner pur : aucun header browser, peu de headers
+    # Scanner pur : aucun header browser, fingerprint rare, peu de headers
    if accept_lang < 0.15 and accept_enc < 0.15 and hdr_count < 0.25:
        return "🤖 Scanner pur (no headers)"
    # Fingerprint tournant ET suspect : bot qui change de profil headers
    if hfp_rotating > 0.6 and s[4] > 0.15:
        return "🔄 Bot fingerprint tournant"
    # Fingerprint très rare et anomalie : bot artisanal unique
    if hfp_popular < 0.15 and s[4] > 0.20:
        return "🕵️ Fingerprint rare suspect"
    # Scanners Masscan
    if s[0] > 0.16 and s[0] < 0.25 and mss_raw in range(1440, 1460) and s[2] > 0.25:
        return "🤖 Masscan Scanner"
-    # Bots offensifs agressifs (fuzzing + anomalie + pas de headers browser)
+    # Bots offensifs agressifs (fuzzing + anomalie)
    if s[4] > 0.40 and s[6] > 0.3:
        return "🤖 Bot agressif"
-    # Bot qui simule un navigateur mais sans les vrais headers (ua_ch + absent sec_fetch)
+    # Bot qui simule un navigateur mais sans les vrais headers
    if s[16] > 0.40 and sec_fetch < 0.2 and accept_lang < 0.3:
        return "🤖 Bot UA simulé"
-    # Pays à très haut risque (CN, RU, KP) avec trafic anormal
+    # Pays à très haut risque avec trafic anormal
    if country_risk_v > 0.75 and (s[4] > 0.10 or asn_cloud > 0.5):
        return "🌏 Source pays risqué"
-    # Cloud + UA-CH mismatch = crawler/bot cloud
+    # Cloud + UA-CH mismatch
    if s[16] > 0.50 and asn_cloud > 0.70:
        return "☁️ Bot cloud UA-CH"
    # UA-CH mismatch seul
    if s[16] > 0.60:
        return "🤖 UA-CH Mismatch"
-    # Headless browser avec headers browser réels (Puppeteer, Playwright)
+    # Headless browser (Puppeteer/Playwright) : a les headers Sec-Fetch mais headless
    if s[7] > 0.50 and sec_fetch > 0.5:
        return "🤖 Headless Browser"
    if s[7] > 0.50:
@ -379,8 +402,9 @@ def name_cluster(centroid: np.ndarray, raw_stats: dict) -> str:
    # Pays à risque élevé sans autre signal
    if country_risk_v > 0.60:
        return "🌏 Trafic suspect (pays)"
-    # Navigateur légitime : tous les headers présents
+    # Navigateur légitime : tous les signaux positifs y compris fingerprint populaire
-    if accept_lang > 0.7 and accept_enc > 0.7 and sec_fetch > 0.6 and hdr_count > 0.5:
+    if (accept_lang > 0.7 and accept_enc > 0.7 and sec_fetch > 0.5
            and hdr_count > 0.5 and hfp_popular > 0.5):
        return "🌐 Navigateur légitime"
    # OS fingerprinting
    if s[3] > 0.85 and ttl_raw > 120:
@ -399,31 +423,34 @@ def name_cluster(centroid: np.ndarray, raw_stats: dict) -> str:
 def risk_score_from_centroid(centroid: np.ndarray) -> float:
    """
    Score de risque [0,1] depuis le centroïde (espace original [0,1]).
-    Intègre pays, infrastructure cloud et profil headers HTTP.
+    31 features — poids calibrés pour sommer à 1.0.
    Poids calibrés pour sommer à 1.0.
    """
    s = centroid
-    country_risk_v = s[21] if len(s) > 21 else 0.0
+    n = len(s)
-    asn_cloud      = s[22] if len(s) > 22 else 0.0
+    country_risk_v = s[21] if n > 21 else 0.0
-    # Absence de header = risque → inverser (1 - présence)
+    asn_cloud      = s[22] if n > 22 else 0.0
-    no_accept_lang = 1.0 - (s[23] if len(s) > 23 else 1.0)
+    no_accept_lang = 1.0 - (s[23] if n > 23 else 1.0)
-    no_encoding    = 1.0 - (s[24] if len(s) > 24 else 1.0)
+    no_encoding    = 1.0 - (s[24] if n > 24 else 1.0)
-    no_sec_fetch   = 1.0 - (s[25] if len(s) > 25 else 0.0)
+    no_sec_fetch   = 1.0 - (s[25] if n > 25 else 0.0)
-    # Peu de headers → bot : max risque quand hdr_count=0
+    few_headers    = 1.0 - (s[26] if n > 26 else 0.5)
-    few_headers    = 1.0 - (s[26] if len(s) > 26 else 0.5)
+    # Fingerprint rare = suspect (faible popularité), fingerprint tournant = bot
    hfp_rare       = 1.0 - (s[27] if n > 27 else 0.5)
    hfp_rotating   = s[28] if n > 28 else 0.0
    return float(np.clip(
-        0.28 * s[4]          +   # score ML anomalie (principal)
+        0.25 * s[4]          +   # score ML anomalie (principal)
-        0.10 * s[6]          +   # fuzzing
+        0.09 * s[6]          +   # fuzzing
-        0.08 * s[16]         +   # UA-CH mismatch
+        0.07 * s[16]         +   # UA-CH mismatch
-        0.07 * s[7]          +   # headless
+        0.06 * s[7]          +   # headless
-        0.06 * s[5]          +   # vélocité
+        0.05 * s[5]          +   # vélocité
-        0.06 * s[9]          +   # IP-ID zéro
+        0.05 * s[9]          +   # IP-ID zéro
-        0.10 * country_risk_v+   # risque pays source
+        0.09 * country_risk_v+   # risque pays source
-        0.07 * asn_cloud     +   # infrastructure cloud/VPN
+        0.06 * asn_cloud     +   # infrastructure cloud/VPN
-        0.05 * no_accept_lang+   # absence Accept-Language
+        0.04 * no_accept_lang+   # absence Accept-Language
-        0.05 * no_encoding   +   # absence Accept-Encoding
+        0.04 * no_encoding   +   # absence Accept-Encoding
        0.04 * no_sec_fetch  +   # absence Sec-Fetch (pas un vrai navigateur)
-        0.04 * few_headers,      # très peu de headers (scanner/curl)
+        0.04 * few_headers   +   # très peu de headers (scanner/curl)
        0.06 * hfp_rare      +   # fingerprint headers rare = suspect
        0.06 * hfp_rotating,     # rotation de fingerprint = bot
        0.0, 1.0
    ))