feat(clustering): intégration Fingerprint HTTP Headers (agg_header_fingerprint_1h)
Sources des nouvelles features :
- agg_header_fingerprint_1h : Cookie, Referer par src_ip (JOIN sur IPv6)
- ml_detected_anomalies : header_order_shared_count, distinct_header_orders (déjà jointé)
Nouvelles features (indices 27-30) :
[27] FP Popularité : popularité du fingerprint headers (log1p/log1p(500k))
fingerprint rare (bot artisanal) → 0.0 ; très populaire (browser) → 1.0
[28] FP Rotation : distinct_header_orders (log1p/log1p(10))
rotation de fingerprint entre requêtes = comportement bot
[29] Cookie Présent : présence header Cookie (engagement utilisateur réel)
[30] Referer Présent: présence header Referer (navigation HTTP normale)
risk_score_from_centroid() : 14 termes, somme=1.0
+ hfp_rare (1-popularité) × 0.06 + hfp_rotating × 0.06
ML × 0.25 reste dominant
name_cluster() : 2 nouveaux labels
'🔄 Bot fingerprint tournant' : hfp_rotating>0.6 + anomalie>0.15
'🕵️ Fingerprint rare suspect' : hfp_popular<0.15 + anomalie>0.20
'🌐 Navigateur légitime' : fingerprint populaire confirmé
N_FEATURES : 27 → 31
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
@ -95,7 +95,18 @@ SELECT
|
|||||||
avg(ml.has_accept_language) AS hdr_accept_lang,
|
avg(ml.has_accept_language) AS hdr_accept_lang,
|
||||||
any(vh.hdr_enc) AS hdr_has_encoding,
|
any(vh.hdr_enc) AS hdr_has_encoding,
|
||||||
any(vh.hdr_sec_fetch) AS hdr_has_sec_fetch,
|
any(vh.hdr_sec_fetch) AS hdr_has_sec_fetch,
|
||||||
any(vh.hdr_count) AS hdr_count_raw
|
any(vh.hdr_count) AS hdr_count_raw,
|
||||||
|
|
||||||
|
-- Fingerprint HTTP Headers (depuis agg_header_fingerprint_1h + ml_detected_anomalies)
|
||||||
|
-- header_order_shared_count : nb d'IPs partageant le même fingerprint
|
||||||
|
-- → faible = fingerprint rare = comportement suspect
|
||||||
|
avg(ml.header_order_shared_count) AS hfp_shared_count,
|
||||||
|
-- distinct_header_orders : nb de fingerprints distincts émis par cette IP
|
||||||
|
-- → élevé = rotation de fingerprint = comportement bot
|
||||||
|
avg(ml.distinct_header_orders) AS hfp_distinct_orders,
|
||||||
|
-- Cookie et Referer issus de la table dédiée aux empreintes
|
||||||
|
any(hfp.hfp_cookie) AS hfp_cookie,
|
||||||
|
any(hfp.hfp_referer) AS hfp_referer
|
||||||
FROM mabase_prod.agg_host_ip_ja4_1h t
|
FROM mabase_prod.agg_host_ip_ja4_1h t
|
||||||
LEFT JOIN mabase_prod.ml_detected_anomalies ml
|
LEFT JOIN mabase_prod.ml_detected_anomalies ml
|
||||||
ON t.src_ip = ml.src_ip AND t.ja4 = ml.ja4
|
ON t.src_ip = ml.src_ip AND t.ja4 = ml.ja4
|
||||||
@ -112,6 +123,15 @@ LEFT JOIN (
|
|||||||
AND log_date >= today() - 2
|
AND log_date >= today() - 2
|
||||||
GROUP BY src_ip_v6, ja4
|
GROUP BY src_ip_v6, ja4
|
||||||
) vh ON t.src_ip = vh.src_ip_v6 AND t.ja4 = vh.ja4
|
) vh ON t.src_ip = vh.src_ip_v6 AND t.ja4 = vh.ja4
|
||||||
|
LEFT JOIN (
|
||||||
|
SELECT
|
||||||
|
src_ip,
|
||||||
|
avg(has_cookie) AS hfp_cookie,
|
||||||
|
avg(has_referer) AS hfp_referer
|
||||||
|
FROM mabase_prod.agg_header_fingerprint_1h
|
||||||
|
WHERE window_start >= now() - INTERVAL %(hours)s HOUR
|
||||||
|
GROUP BY src_ip
|
||||||
|
) hfp ON t.src_ip = hfp.src_ip
|
||||||
WHERE t.window_start >= now() - INTERVAL %(hours)s HOUR
|
WHERE t.window_start >= now() - INTERVAL %(hours)s HOUR
|
||||||
AND t.tcp_ttl_raw > 0
|
AND t.tcp_ttl_raw > 0
|
||||||
GROUP BY t.src_ip, t.ja4
|
GROUP BY t.src_ip, t.ja4
|
||||||
@ -124,6 +144,7 @@ _SQL_COLS = [
|
|||||||
"h2_eff", "hdr_conf", "ua_ch_mismatch", "asset_ratio", "direct_ratio",
|
"h2_eff", "hdr_conf", "ua_ch_mismatch", "asset_ratio", "direct_ratio",
|
||||||
"ja4_count", "ua_rotating", "threat", "country", "asn_org",
|
"ja4_count", "ua_rotating", "threat", "country", "asn_org",
|
||||||
"hdr_accept_lang", "hdr_has_encoding", "hdr_has_sec_fetch", "hdr_count_raw",
|
"hdr_accept_lang", "hdr_has_encoding", "hdr_has_sec_fetch", "hdr_count_raw",
|
||||||
|
"hfp_shared_count", "hfp_distinct_orders", "hfp_cookie", "hfp_referer",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@ -6,34 +6,40 @@ Ref:
|
|||||||
scipy.spatial.ConvexHull — enveloppe convexe (Graham/Qhull)
|
scipy.spatial.ConvexHull — enveloppe convexe (Graham/Qhull)
|
||||||
sklearn-style API — centroids, labels_, inertia_
|
sklearn-style API — centroids, labels_, inertia_
|
||||||
|
|
||||||
Features (27 dimensions, normalisées [0,1]) :
|
Features (31 dimensions, normalisées [0,1]) :
|
||||||
0 ttl_n : TTL initial normalisé
|
0 ttl_n : TTL initial normalisé
|
||||||
1 mss_n : MSS normalisé → type réseau
|
1 mss_n : MSS normalisé → type réseau
|
||||||
2 scale_n : facteur de mise à l'échelle TCP
|
2 scale_n : facteur de mise à l'échelle TCP
|
||||||
3 win_n : fenêtre TCP normalisée
|
3 win_n : fenêtre TCP normalisée
|
||||||
4 score_n : score anomalie ML (abs)
|
4 score_n : score anomalie ML (abs)
|
||||||
5 velocity_n : vélocité de requêtes (log1p)
|
5 velocity_n : vélocité de requêtes (log1p)
|
||||||
6 fuzzing_n : index de fuzzing (log1p)
|
6 fuzzing_n : index de fuzzing (log1p)
|
||||||
7 headless_n : ratio sessions headless
|
7 headless_n : ratio sessions headless
|
||||||
8 post_n : ratio POST/total
|
8 post_n : ratio POST/total
|
||||||
9 ip_id_zero_n : ratio IP-ID=0 (Linux/spoofé)
|
9 ip_id_zero_n : ratio IP-ID=0 (Linux/spoofé)
|
||||||
10 entropy_n : entropie temporelle
|
10 entropy_n : entropie temporelle
|
||||||
11 browser_n : score navigateur moderne
|
11 browser_n : score navigateur moderne
|
||||||
12 alpn_n : mismatch ALPN/protocole
|
12 alpn_n : mismatch ALPN/protocole
|
||||||
13 alpn_absent_n : ratio ALPN absent
|
13 alpn_absent_n : ratio ALPN absent
|
||||||
14 h2_n : efficacité H2 multiplexing (log1p)
|
14 h2_n : efficacité H2 multiplexing (log1p)
|
||||||
15 hdr_conf_n : confiance ordre headers
|
15 hdr_conf_n : confiance ordre headers
|
||||||
16 ua_ch_n : mismatch User-Agent-Client-Hints
|
16 ua_ch_n : mismatch User-Agent-Client-Hints
|
||||||
17 asset_n : ratio assets statiques
|
17 asset_n : ratio assets statiques
|
||||||
18 direct_n : ratio accès directs
|
18 direct_n : ratio accès directs
|
||||||
19 ja4_div_n : diversité JA4 (log1p)
|
19 ja4_div_n : diversité JA4 (log1p)
|
||||||
20 ua_rot_n : UA rotatif (booléen)
|
20 ua_rot_n : UA rotatif (booléen)
|
||||||
21 country_risk_n : risque pays source (CN/RU/KP → 1.0, US/DE/FR → 0.0)
|
21 country_risk_n : risque pays source (CN/RU/KP → 1.0, US/DE/FR → 0.0)
|
||||||
22 asn_cloud_n : hébergeur cloud/CDN/VPN (Cloudflare/AWS/OVH → 1.0)
|
22 asn_cloud_n : hébergeur cloud/CDN/VPN (Cloudflare/AWS/OVH → 1.0)
|
||||||
23 hdr_accept_lang_n : présence header Accept-Language (0=absent=bot-like)
|
23 hdr_accept_lang_n : présence header Accept-Language (0=absent=bot-like)
|
||||||
24 hdr_encoding_n : présence header Accept-Encoding (0=absent=bot-like)
|
24 hdr_encoding_n : présence header Accept-Encoding (0=absent=bot-like)
|
||||||
25 hdr_sec_fetch_n : présence headers Sec-Fetch-* (1=navigateur réel)
|
25 hdr_sec_fetch_n : présence headers Sec-Fetch-* (1=navigateur réel)
|
||||||
26 hdr_count_n : nombre de headers HTTP normalisé (3=bot, 15=browser)
|
26 hdr_count_n : nombre de headers HTTP normalisé (3=bot, 15=browser)
|
||||||
|
27 hfp_popular_n : popularité du fingerprint headers (log-normalisé)
|
||||||
|
fingerprint rare = suspect ; très populaire = browser légitime
|
||||||
|
28 hfp_rotating_n : rotation de fingerprint (distinct_header_orders)
|
||||||
|
plusieurs fingerprints distincts → bot en rotation
|
||||||
|
29 hfp_cookie_n : présence header Cookie (engagement utilisateur réel)
|
||||||
|
30 hfp_referer_n : présence header Referer (navigation HTTP normale)
|
||||||
"""
|
"""
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
@ -155,6 +161,16 @@ FEATURES: list[tuple[str, str, object]] = [
|
|||||||
("hdr_has_encoding", "Accept-Encoding", lambda v: 1.0 if float(v or 0) > 0 else 0.0),
|
("hdr_has_encoding", "Accept-Encoding", lambda v: 1.0 if float(v or 0) > 0 else 0.0),
|
||||||
("hdr_has_sec_fetch", "Sec-Fetch Headers", lambda v: 1.0 if float(v or 0) > 0 else 0.0),
|
("hdr_has_sec_fetch", "Sec-Fetch Headers", lambda v: 1.0 if float(v or 0) > 0 else 0.0),
|
||||||
("hdr_count_raw", "Nb Headers", lambda v: min(1.0, float(v or 0) / 20.0)),
|
("hdr_count_raw", "Nb Headers", lambda v: min(1.0, float(v or 0) / 20.0)),
|
||||||
|
# ── Fingerprint HTTP Headers (agg_header_fingerprint_1h) ──────────────
|
||||||
|
# header_order_shared_count : nb d'IPs partageant ce fingerprint
|
||||||
|
# élevé → populaire → browser légitime (normalisé log1p / log1p(500000))
|
||||||
|
("hfp_shared_count", "FP Popularité", lambda v: min(1.0, math.log1p(float(v or 0)) / math.log1p(500_000))),
|
||||||
|
# distinct_header_orders : nb de fingerprints distincts pour cette IP
|
||||||
|
# élevé → rotation de fingerprint → bot (normalisé log1p / log1p(10))
|
||||||
|
("hfp_distinct_orders", "FP Rotation", lambda v: min(1.0, math.log1p(float(v or 0)) / math.log1p(10))),
|
||||||
|
# Cookie et Referer : signaux de navigation légitime
|
||||||
|
("hfp_cookie", "Cookie Présent", lambda v: min(1.0, float(v or 0))),
|
||||||
|
("hfp_referer", "Referer Présent", lambda v: min(1.0, float(v or 0))),
|
||||||
]
|
]
|
||||||
|
|
||||||
FEATURE_KEYS = [f[0] for f in FEATURES]
|
FEATURE_KEYS = [f[0] for f in FEATURES]
|
||||||
@ -334,38 +350,45 @@ def compute_hulls(coords_2d: np.ndarray, labels: np.ndarray,
|
|||||||
def name_cluster(centroid: np.ndarray, raw_stats: dict) -> str:
|
def name_cluster(centroid: np.ndarray, raw_stats: dict) -> str:
|
||||||
"""Nom lisible basé sur les features dominantes du centroïde [0,1]."""
|
"""Nom lisible basé sur les features dominantes du centroïde [0,1]."""
|
||||||
s = centroid
|
s = centroid
|
||||||
|
n = len(s)
|
||||||
ttl_raw = float(raw_stats.get("mean_ttl", 0))
|
ttl_raw = float(raw_stats.get("mean_ttl", 0))
|
||||||
mss_raw = float(raw_stats.get("mean_mss", 0))
|
mss_raw = float(raw_stats.get("mean_mss", 0))
|
||||||
country_risk_v = s[21] if len(s) > 21 else 0.0
|
country_risk_v = s[21] if n > 21 else 0.0
|
||||||
asn_cloud = s[22] if len(s) > 22 else 0.0
|
asn_cloud = s[22] if n > 22 else 0.0
|
||||||
# Features headers (indices 23-26)
|
accept_lang = s[23] if n > 23 else 1.0
|
||||||
accept_lang = s[23] if len(s) > 23 else 1.0
|
accept_enc = s[24] if n > 24 else 1.0
|
||||||
accept_enc = s[24] if len(s) > 24 else 1.0
|
sec_fetch = s[25] if n > 25 else 0.0
|
||||||
sec_fetch = s[25] if len(s) > 25 else 0.0
|
hdr_count = s[26] if n > 26 else 0.5
|
||||||
hdr_count = s[26] if len(s) > 26 else 0.5
|
hfp_popular = s[27] if n > 27 else 0.5
|
||||||
|
hfp_rotating = s[28] if n > 28 else 0.0
|
||||||
|
|
||||||
# Scanner pur : aucun header browser, peu de headers
|
# Scanner pur : aucun header browser, fingerprint rare, peu de headers
|
||||||
if accept_lang < 0.15 and accept_enc < 0.15 and hdr_count < 0.25:
|
if accept_lang < 0.15 and accept_enc < 0.15 and hdr_count < 0.25:
|
||||||
return "🤖 Scanner pur (no headers)"
|
return "🤖 Scanner pur (no headers)"
|
||||||
|
# Fingerprint tournant ET suspect : bot qui change de profil headers
|
||||||
|
if hfp_rotating > 0.6 and s[4] > 0.15:
|
||||||
|
return "🔄 Bot fingerprint tournant"
|
||||||
|
# Fingerprint très rare et anomalie : bot artisanal unique
|
||||||
|
if hfp_popular < 0.15 and s[4] > 0.20:
|
||||||
|
return "🕵️ Fingerprint rare suspect"
|
||||||
# Scanners Masscan
|
# Scanners Masscan
|
||||||
if s[0] > 0.16 and s[0] < 0.25 and mss_raw in range(1440, 1460) and s[2] > 0.25:
|
if s[0] > 0.16 and s[0] < 0.25 and mss_raw in range(1440, 1460) and s[2] > 0.25:
|
||||||
return "🤖 Masscan Scanner"
|
return "🤖 Masscan Scanner"
|
||||||
# Bots offensifs agressifs (fuzzing + anomalie + pas de headers browser)
|
# Bots offensifs agressifs (fuzzing + anomalie)
|
||||||
if s[4] > 0.40 and s[6] > 0.3:
|
if s[4] > 0.40 and s[6] > 0.3:
|
||||||
return "🤖 Bot agressif"
|
return "🤖 Bot agressif"
|
||||||
# Bot qui simule un navigateur mais sans les vrais headers (ua_ch + absent sec_fetch)
|
# Bot qui simule un navigateur mais sans les vrais headers
|
||||||
if s[16] > 0.40 and sec_fetch < 0.2 and accept_lang < 0.3:
|
if s[16] > 0.40 and sec_fetch < 0.2 and accept_lang < 0.3:
|
||||||
return "🤖 Bot UA simulé"
|
return "🤖 Bot UA simulé"
|
||||||
# Pays à très haut risque (CN, RU, KP) avec trafic anormal
|
# Pays à très haut risque avec trafic anormal
|
||||||
if country_risk_v > 0.75 and (s[4] > 0.10 or asn_cloud > 0.5):
|
if country_risk_v > 0.75 and (s[4] > 0.10 or asn_cloud > 0.5):
|
||||||
return "🌏 Source pays risqué"
|
return "🌏 Source pays risqué"
|
||||||
# Cloud + UA-CH mismatch = crawler/bot cloud
|
# Cloud + UA-CH mismatch
|
||||||
if s[16] > 0.50 and asn_cloud > 0.70:
|
if s[16] > 0.50 and asn_cloud > 0.70:
|
||||||
return "☁️ Bot cloud UA-CH"
|
return "☁️ Bot cloud UA-CH"
|
||||||
# UA-CH mismatch seul
|
|
||||||
if s[16] > 0.60:
|
if s[16] > 0.60:
|
||||||
return "🤖 UA-CH Mismatch"
|
return "🤖 UA-CH Mismatch"
|
||||||
# Headless browser avec headers browser réels (Puppeteer, Playwright)
|
# Headless browser (Puppeteer/Playwright) : a les headers Sec-Fetch mais headless
|
||||||
if s[7] > 0.50 and sec_fetch > 0.5:
|
if s[7] > 0.50 and sec_fetch > 0.5:
|
||||||
return "🤖 Headless Browser"
|
return "🤖 Headless Browser"
|
||||||
if s[7] > 0.50:
|
if s[7] > 0.50:
|
||||||
@ -379,8 +402,9 @@ def name_cluster(centroid: np.ndarray, raw_stats: dict) -> str:
|
|||||||
# Pays à risque élevé sans autre signal
|
# Pays à risque élevé sans autre signal
|
||||||
if country_risk_v > 0.60:
|
if country_risk_v > 0.60:
|
||||||
return "🌏 Trafic suspect (pays)"
|
return "🌏 Trafic suspect (pays)"
|
||||||
# Navigateur légitime : tous les headers présents
|
# Navigateur légitime : tous les signaux positifs y compris fingerprint populaire
|
||||||
if accept_lang > 0.7 and accept_enc > 0.7 and sec_fetch > 0.6 and hdr_count > 0.5:
|
if (accept_lang > 0.7 and accept_enc > 0.7 and sec_fetch > 0.5
|
||||||
|
and hdr_count > 0.5 and hfp_popular > 0.5):
|
||||||
return "🌐 Navigateur légitime"
|
return "🌐 Navigateur légitime"
|
||||||
# OS fingerprinting
|
# OS fingerprinting
|
||||||
if s[3] > 0.85 and ttl_raw > 120:
|
if s[3] > 0.85 and ttl_raw > 120:
|
||||||
@ -399,31 +423,34 @@ def name_cluster(centroid: np.ndarray, raw_stats: dict) -> str:
|
|||||||
def risk_score_from_centroid(centroid: np.ndarray) -> float:
|
def risk_score_from_centroid(centroid: np.ndarray) -> float:
|
||||||
"""
|
"""
|
||||||
Score de risque [0,1] depuis le centroïde (espace original [0,1]).
|
Score de risque [0,1] depuis le centroïde (espace original [0,1]).
|
||||||
Intègre pays, infrastructure cloud et profil headers HTTP.
|
31 features — poids calibrés pour sommer à 1.0.
|
||||||
Poids calibrés pour sommer à 1.0.
|
|
||||||
"""
|
"""
|
||||||
s = centroid
|
s = centroid
|
||||||
country_risk_v = s[21] if len(s) > 21 else 0.0
|
n = len(s)
|
||||||
asn_cloud = s[22] if len(s) > 22 else 0.0
|
country_risk_v = s[21] if n > 21 else 0.0
|
||||||
# Absence de header = risque → inverser (1 - présence)
|
asn_cloud = s[22] if n > 22 else 0.0
|
||||||
no_accept_lang = 1.0 - (s[23] if len(s) > 23 else 1.0)
|
no_accept_lang = 1.0 - (s[23] if n > 23 else 1.0)
|
||||||
no_encoding = 1.0 - (s[24] if len(s) > 24 else 1.0)
|
no_encoding = 1.0 - (s[24] if n > 24 else 1.0)
|
||||||
no_sec_fetch = 1.0 - (s[25] if len(s) > 25 else 0.0)
|
no_sec_fetch = 1.0 - (s[25] if n > 25 else 0.0)
|
||||||
# Peu de headers → bot : max risque quand hdr_count=0
|
few_headers = 1.0 - (s[26] if n > 26 else 0.5)
|
||||||
few_headers = 1.0 - (s[26] if len(s) > 26 else 0.5)
|
# Fingerprint rare = suspect (faible popularité), fingerprint tournant = bot
|
||||||
|
hfp_rare = 1.0 - (s[27] if n > 27 else 0.5)
|
||||||
|
hfp_rotating = s[28] if n > 28 else 0.0
|
||||||
|
|
||||||
return float(np.clip(
|
return float(np.clip(
|
||||||
0.28 * s[4] + # score ML anomalie (principal)
|
0.25 * s[4] + # score ML anomalie (principal)
|
||||||
0.10 * s[6] + # fuzzing
|
0.09 * s[6] + # fuzzing
|
||||||
0.08 * s[16] + # UA-CH mismatch
|
0.07 * s[16] + # UA-CH mismatch
|
||||||
0.07 * s[7] + # headless
|
0.06 * s[7] + # headless
|
||||||
0.06 * s[5] + # vélocité
|
0.05 * s[5] + # vélocité
|
||||||
0.06 * s[9] + # IP-ID zéro
|
0.05 * s[9] + # IP-ID zéro
|
||||||
0.10 * country_risk_v+ # risque pays source
|
0.09 * country_risk_v+ # risque pays source
|
||||||
0.07 * asn_cloud + # infrastructure cloud/VPN
|
0.06 * asn_cloud + # infrastructure cloud/VPN
|
||||||
0.05 * no_accept_lang+ # absence Accept-Language
|
0.04 * no_accept_lang+ # absence Accept-Language
|
||||||
0.05 * no_encoding + # absence Accept-Encoding
|
0.04 * no_encoding + # absence Accept-Encoding
|
||||||
0.04 * no_sec_fetch + # absence Sec-Fetch (pas un vrai navigateur)
|
0.04 * no_sec_fetch + # absence Sec-Fetch (pas un vrai navigateur)
|
||||||
0.04 * few_headers, # très peu de headers (scanner/curl)
|
0.04 * few_headers + # très peu de headers (scanner/curl)
|
||||||
|
0.06 * hfp_rare + # fingerprint headers rare = suspect
|
||||||
|
0.06 * hfp_rotating, # rotation de fingerprint = bot
|
||||||
0.0, 1.0
|
0.0, 1.0
|
||||||
))
|
))
|
||||||
|
|||||||
Reference in New Issue
Block a user