feat(bot-detector): Browser Signature Detection engine (parallel mode)
Étape A — browser_signatures.py
Données pures : BROWSER_SIGNATURES (Chrome/Firefox/Safari), NON_BROWSER_SIGNATURES
(curl/httpx/go), BROWSER_THRESHOLDS, DIMENSION_WEIGHTS. Valeurs H2 extraites
des captures réelles (format Akamai avec virgules, non semicolons).
Étape B — browser_matcher.py
Moteur vectorisé 7 dimensions (H2 SETTINGS 0.30, WINDOW_UPDATE 0.15,
pseudo-header order 0.15, H2 PRIORITY 0.10, HTTP headers 0.15, TLS 0.10,
JA4 dict 0.05). run_browser_matcher(df) ajoute bm_family/bm_score/bm_decision.
CDN edge case : dimension H2 neutralisée (0.5) si has_xff=1.
BROWSER_MATCHER_REPLACE=false par défaut (mode DUAL_MODE logging uniquement).
Étape C — 06_browser_signature_detection.sql (migration)
Crée browser_h2_signatures (table MergeTree avec 12 fingerprints de référence).
Recrée dict_browser_h2 depuis la table avec champ confidence (remplace CSV).
Étape D — 07_ai_features_view.sql
+h2_wu_val dans le JOIN http_logs, +h2_window_update_value, +h2_dict_family,
+h2_dict_confidence, +h2_window_{chrome,firefox,safari,absent},
+h2_order_{chromesafari,firefox}, +h2_priority_present, +h2_pseudo_ord_raw,
+tls_h2_family_mismatch (détection incohérence famille JA4 vs famille H2).
Étape E — preprocessing.py + pipeline.py
preprocessing.py: appelle run_browser_matcher() après compute_browser_axes(),
ajoute 7 nouvelles features binaires H2 à FEATURES et binary_features.
pipeline.py: appelle log_dual_mode_comparison() après la classification A9.
BROWSER_MATCHER_REPLACE=true active le remplacement du bypass.
Étape F — test_browser_matcher.py
8 tests : Chrome/Firefox/Safari full match, curl rejeté, httpcloak partiel,
TLS↔H2 mismatch, CDN proxy neutralisation, go net/http rejeté.
Tous 8 PASSED (+ 36 tests existants inchangés).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
@ -122,7 +122,30 @@ WITH base_data AS (
|
||||
h2_fp != '' AND
|
||||
dictGetOrDefault('ja4_processing.dict_browser_h2', 'browser_family', tuple(h2_fp), '') = '',
|
||||
1, 0
|
||||
) AS h2_settings_rare
|
||||
) AS h2_settings_rare,
|
||||
-- §4 — Famille identifiée par le dictionnaire H2 (browser_matcher)
|
||||
dictGetOrDefault('ja4_processing.dict_browser_h2', 'browser_family',
|
||||
tuple(h2_fp), '') AS h2_dict_family,
|
||||
dictGetOrDefault('ja4_processing.dict_browser_h2', 'confidence',
|
||||
tuple(h2_fp), toFloat32(0.0)) AS h2_dict_confidence,
|
||||
-- §4 — Valeur brute WINDOW_UPDATE H2 (signal de famille le plus fiable)
|
||||
h2_wu_val AS h2_window_update_value,
|
||||
-- §4 — Signaux atomiques H2 pour le browser_matcher et le vecteur ML
|
||||
toUInt8(h2_wu_val BETWEEN 15663000 AND 15664000) AS h2_window_chrome,
|
||||
toUInt8(h2_wu_val BETWEEN 12517000 AND 12518000) AS h2_window_firefox,
|
||||
toUInt8(h2_wu_val BETWEEN 10485700 AND 10485820) AS h2_window_safari,
|
||||
toUInt8(h2_wu_val = 0 AND h2_fp != '') AS h2_window_absent,
|
||||
-- Chrome et Safari partagent l'ordre m,a,s,p — utiliser WU pour distinguer
|
||||
toUInt8(h2_pseudo_ord = 'm,a,s,p') AS h2_order_chromesafari,
|
||||
toUInt8(h2_pseudo_ord = 'm,p,s,a') AS h2_order_firefox,
|
||||
-- Présence de PRIORITY frames (3e champ de h2_fp, != '0' → Firefox ancien)
|
||||
toUInt8(
|
||||
h2_fp != ''
|
||||
AND length(splitByChar('|', h2_fp)) >= 3
|
||||
AND arrayElement(splitByChar('|', h2_fp), 3) NOT IN ('', '0')
|
||||
) AS h2_priority_present,
|
||||
-- Valeur brute du pseudo-header order (pour le matcher Python)
|
||||
h2_pseudo_ord AS h2_pseudo_ord_raw
|
||||
FROM (
|
||||
-- Jointure unique avec aliases explicites (contournement bug scope ClickHouse 24.8
|
||||
-- où PARTITION BY src_ip échoue quand plusieurs sources de JOIN exposent src_ip)
|
||||
@ -193,9 +216,10 @@ WITH base_data AS (
|
||||
h.sec_ch_mobile_mismatch AS sec_ch_mobile_mismatch,
|
||||
h.sec_fetch_mode AS sec_fetch_mode,
|
||||
h.sec_fetch_dest AS sec_fetch_dest,
|
||||
-- colonnes HTTP/2 (defaut vide si pas de trafic H2)
|
||||
-- colonnes HTTP/2 (defaut vide/0 si pas de trafic H2)
|
||||
COALESCE(h2.h2_fp, '') AS h2_fp,
|
||||
COALESCE(h2.h2_pseudo_ord, '') AS h2_pseudo_ord
|
||||
COALESCE(h2.h2_pseudo_ord, '') AS h2_pseudo_ord,
|
||||
COALESCE(h2.h2_wu_val, 0) AS h2_wu_val
|
||||
FROM (
|
||||
SELECT
|
||||
window_start, src_ip, ja4, host, src_asn,
|
||||
@ -258,8 +282,9 @@ WITH base_data AS (
|
||||
SELECT
|
||||
toStartOfHour(time) AS h2_window,
|
||||
toIPv6(src_ip) AS h2_ip,
|
||||
anyIf(h2_fingerprint, h2_fingerprint != '') AS h2_fp,
|
||||
anyIf(h2_pseudo_order, h2_pseudo_order != '') AS h2_pseudo_ord
|
||||
anyIf(h2_fingerprint, h2_fingerprint != '') AS h2_fp,
|
||||
anyIf(h2_pseudo_order, h2_pseudo_order != '') AS h2_pseudo_ord,
|
||||
anyIf(h2_window_update, h2_window_update > 0) AS h2_wu_val
|
||||
FROM ja4_logs.http_logs
|
||||
WHERE time >= now() - INTERVAL 24 HOUR
|
||||
AND (h2_fingerprint != '' OR h2_pseudo_order != '')
|
||||
@ -271,6 +296,18 @@ SELECT
|
||||
*,
|
||||
-(sum((hits / (total_ip_hits + 1)) * log2((hits / (total_ip_hits + 1)) + 0.000001)) OVER (PARTITION BY src_ip)) AS temporal_entropy,
|
||||
sum(uniq_ja3_per_row) OVER (PARTITION BY src_ip) / greatest(distinct_ja4_count, 1) AS ja3_diversity_ratio,
|
||||
-- §4 — Incohérence TLS↔H2 : JA4 identifie une famille mais H2 WINDOW_UPDATE en contredit une autre
|
||||
toUInt8(CASE
|
||||
WHEN browser_family IN ('Chromium', 'Chrome', 'Edge')
|
||||
AND h2_window_update_value BETWEEN 12517000 AND 12518000 THEN 1 -- Chrome JA4 / Firefox H2
|
||||
WHEN browser_family IN ('Chromium', 'Chrome', 'Edge')
|
||||
AND h2_window_update_value BETWEEN 10485700 AND 10485820 THEN 1 -- Chrome JA4 / Safari H2
|
||||
WHEN browser_family = 'Firefox'
|
||||
AND h2_window_update_value BETWEEN 15663000 AND 15664000 THEN 1 -- Firefox JA4 / Chrome H2
|
||||
WHEN browser_family != '' AND h2_window_update_value = 0
|
||||
AND h2_settings_known > 0 THEN 1 -- Navigateur JA4 / pas de WU (outil)
|
||||
ELSE 0
|
||||
END) AS tls_h2_family_mismatch,
|
||||
-- §3 — Score de cohérence de fingerprint cross-layer [0.0, 1.0]
|
||||
toFloat32(
|
||||
CASE WHEN browser_family != '' THEN 0.25 ELSE 0.0 END
|
||||
|
||||
Reference in New Issue
Block a user