feat: roadmap détection bots §2-9 — HTTP/2, cohérence, drift, flotte, Jaccard, ExIFFI, méta-learner, métriques

Étape 2 — Fingerprinting HTTP/2 dans le pipeline ML :
- Ajout du dictionnaire dict_browser_h2 (11 familles de navigateurs) dans 05_aggregation_tables.sql
- Ajout du CTE h2_agg et 4 features HTTP/2 dans 07_ai_features_view.sql :
  h2_settings_known, h2_pseudo_order_match, h2_ja4_coherence, h2_settings_rare
- Calcul du fingerprint_coherence_score (5 axes pondérés) dans la vue
- Ajout du 6e axe axis_h2_coherence dans browser.py (poids rééquilibrés)
- browser_h2.csv : 11 fingerprints Akamai → famille navigateur

Étape 3 — Pré-filtre de cohérence sur la baseline humaine :
- pipeline.py exclut les sessions avec fingerprint_coherence_score < seuil de la baseline d'entraînement
- FINGERPRINT_COHERENCE_THRESHOLD configurable via env (défaut 0.25)
- Log des sessions exclues pour analyse SOC

Étape 4 — Détection de drift améliorée :
- scoring.py : passage de 5 à 9 quantiles (p5…p95)
- Ajout de la divergence KL en complément du test KS
- Détection de drift adversarial (≥80% des features dérivent dans la même direction)
- Split temporel strict pour la validation

Étape 5 — Graphe bipartite JA4×ASN (§5.2) :
- fleet.py : détection de flottes via NetworkX + Louvain (imports optionnels)
- enrich_with_fleet_score() : ajout fleet_score + fleet_campaign_flag au DataFrame
- cycle.py : appel après preprocess_df avec log du nombre de sessions en flotte
- SQL migration 05_fleet_metrics_tables.sql : table fleet_detections (TTL 7j)
- Dashboard : /fleet + /api/fleet (communautés détectées) + template fleet.html

Étape 6 — Cross-domain Jaccard §5.8 :
- 12_thesis_features.sql : CTE jaccard_paths → cross_domain_path_similarity
- Signal : même chemins (/admin, /wp-login) sur plusieurs hosts = scanner

Étape 7 — ExIFFI + erreurs AE par feature :
- scoring.py : compute_exiffi_importance() par permutation, compute_ae_feature_errors()
- pipeline.py : calcul ExIFFI sur X_test, mapping index → dict pour anomalies
- build_reason() enrichi avec exiffi_top quand SHAP inactif

Étape 8 — Méta-learner pour la pondération de l'ensemble :
- scoring.py : classe MetaLearner (LogisticRegression, fallback poids fixes <1000 labels)
- Collecte des labels depuis le cycle courant (known_bots, légitimes, Anubis)
- pipeline.py : remplacement des poids fixes par MetaLearner.predict()

Étape 9 — Métriques de performance et monitoring :
- metrics.py : record_cycle_metrics() — taux anomalie, drift, corrélation, latence
- SQL migration 05_fleet_metrics_tables.sql : table ml_performance_metrics (TTL 90j)
- Dashboard : /health + /api/health + template health.html
- cycle.py : appel record_cycle_metrics en fin de cycle (Complet + Applicatif)

Tests : 36/36 bot-detector tests passent

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
toto
2026-04-10 00:11:35 +02:00
parent 8ca4a1e849
commit a108814a56
18 changed files with 1670 additions and 62 deletions

View File

@ -53,6 +53,21 @@ SOURCE(FILE(path '/var/lib/clickhouse/user_files/browser_ja4.csv' format 'CSV'))
LAYOUT(COMPLEX_KEY_HASHED())
LIFETIME(MIN 300 MAX 300);
-- §2 — Dictionnaire HTTP/2 : fingerprint SETTINGS → famille navigateur
-- Colonnes : h2_fingerprint (clé), browser_family
-- Fichier source : /var/lib/clickhouse/user_files/browser_h2.csv (CSVWithNames)
-- Fingerprint au format Akamai : SETTINGS|WINDOW_UPDATE|PRIORITY|PSEUDO_HEADER_ORDER
DROP DICTIONARY IF EXISTS ja4_processing.dict_browser_h2;
CREATE DICTIONARY ja4_processing.dict_browser_h2
(
h2_fingerprint String,
browser_family String
)
PRIMARY KEY h2_fingerprint
SOURCE(FILE(path '/var/lib/clickhouse/user_files/browser_h2.csv' format 'CSVWithNames'))
LAYOUT(COMPLEX_KEY_HASHED())
LIFETIME(MIN 300 MAX 300);
-- -----------------------------------------------------------------------------
-- agg_host_ip_ja4_1h — behavioral aggregation (L4/L5/L7)

View File

@ -2,10 +2,28 @@
-- 07_ai_features_view.sql — AI feature view with full Anubis enrichment
-- Source: bot_detector/anubis/view_ai_features_anubis.sql
-- Includes combined UA+IP priority logic and Anubis bot_name/action/category.
-- §2 : Features HTTP/2 (dict_browser_h2, cohérence H2↔JA4, pseudo-headers)
-- §3 : Score de cohérence de fingerprint cross-layer
-- =============================================================================
CREATE OR REPLACE VIEW ja4_processing.view_ai_features_1h AS
WITH base_data AS (
WITH
-- §2 — Agrégation des fingerprints HTTP/2 par (heure, src_ip)
-- Lecture directe depuis http_logs pour les colonnes ajoutées à l'étape 1
h2_agg AS (
SELECT
toStartOfHour(time) AS window_start,
toIPv6(src_ip) AS src_ip,
anyIf(h2_fingerprint, h2_fingerprint != '') AS h2_fp,
anyIf(h2_pseudo_order, h2_pseudo_order != '') AS h2_pseudo_ord
FROM ja4_logs.http_logs
WHERE time >= now() - INTERVAL 24 HOUR
AND (h2_fingerprint != '' OR h2_pseudo_order != '')
GROUP BY window_start, src_ip
),
base_data AS (
SELECT
a.window_start, a.src_ip, a.ja4, a.host,
toString(a.src_asn) AS asn_number,
@ -92,7 +110,44 @@ WITH base_data AS (
a.count_unusual_ct_val / greatest(a.count_post, 1) AS unusual_content_type_ratio,
a.count_non_std_port_val / (a.hits + 1) AS non_standard_port_ratio,
a.count_login_post_val / greatest(a.count_post, 1) AS login_post_concentration,
h.sec_ch_mobile_mismatch AS sec_ch_mobile_mismatch
h.sec_ch_mobile_mismatch AS sec_ch_mobile_mismatch,
-- §2 — Features HTTP/2 (fingerprint SETTINGS, cohérence H2↔JA4, pseudo-headers)
-- h2_settings_known : le fingerprint H2 est dans dict_browser_h2
IF(
COALESCE(h2.h2_fp, '') != '' AND
dictGetOrDefault('ja4_processing.dict_browser_h2', 'browser_family',
tuple(COALESCE(h2.h2_fp, '')), '') != '',
1, 0
) AS h2_settings_known,
-- h2_pseudo_order_match : l'ordre des pseudo-headers correspond à la famille JA4 déclarée
CASE
WHEN COALESCE(h2.h2_pseudo_ord, '') = '' THEN 0
WHEN dictGetOrDefault('ja4_processing.dict_browser_ja4', 'browser_family',
tuple(a.ja4), '') IN ('Chromium', 'Chrome', 'Edge', 'Safari')
AND h2.h2_pseudo_ord = 'm,a,s,p' THEN 1
WHEN dictGetOrDefault('ja4_processing.dict_browser_ja4', 'browser_family',
tuple(a.ja4), '') = 'Firefox'
AND h2.h2_pseudo_ord = 'm,p,s,a' THEN 1
ELSE 0
END AS h2_pseudo_order_match,
-- h2_ja4_coherence : la famille navigateur H2 correspond à la famille JA4
IF(
COALESCE(h2.h2_fp, '') != '' AND
dictGetOrDefault('ja4_processing.dict_browser_h2', 'browser_family',
tuple(COALESCE(h2.h2_fp, '')), '') =
dictGetOrDefault('ja4_processing.dict_browser_ja4', 'browser_family',
tuple(a.ja4), '') AND
dictGetOrDefault('ja4_processing.dict_browser_ja4', 'browser_family',
tuple(a.ja4), '') != '',
1, 0
) AS h2_ja4_coherence,
-- h2_settings_rare : fingerprint H2 non reconnu (potentiellement suspect)
IF(
COALESCE(h2.h2_fp, '') != '' AND
dictGetOrDefault('ja4_processing.dict_browser_h2', 'browser_family',
tuple(COALESCE(h2.h2_fp, '')), '') = '',
1, 0
) AS h2_settings_rare
FROM (
SELECT
window_start, src_ip, ja4, host, src_asn,
@ -150,9 +205,21 @@ WITH base_data AS (
WHERE window_start >= now() - INTERVAL 24 HOUR
GROUP BY window_start, src_ip
) h ON a.src_ip = h.src_ip AND a.window_start = h.window_start
LEFT JOIN h2_agg h2 ON h2.src_ip = a.src_ip AND h2.window_start = a.window_start
)
SELECT
*,
-(sum((hits / (total_ip_hits + 1)) * log2((hits / (total_ip_hits + 1)) + 0.000001)) OVER (PARTITION BY src_ip)) AS temporal_entropy,
sum(uniq_ja3_per_row) OVER (PARTITION BY src_ip) / greatest(distinct_ja4_count, 1) AS ja3_diversity_ratio
sum(uniq_ja3_per_row) OVER (PARTITION BY src_ip) / greatest(distinct_ja4_count, 1) AS ja3_diversity_ratio,
-- §3 — Score de cohérence de fingerprint cross-layer [0.0, 1.0]
-- Combine : famille navigateur connue, cohérence H2↔JA4, cohérence TLS,
-- présence Accept-Language, et absence de mismatch UA/CH.
toFloat32(
CASE WHEN browser_family != '' THEN 0.25 ELSE 0.0 END
+ COALESCE(h2_ja4_coherence, 0) * 0.20
+ (1 - COALESCE(alpn_http_mismatch, 0)) * 0.15
+ (1 - COALESCE(sni_host_mismatch, 0)) * 0.10
+ COALESCE(has_accept_language, 0) * 0.15
+ (1 - COALESCE(ua_ch_mismatch, 0)) * 0.15
) AS fingerprint_coherence_score
FROM base_data;

View File

@ -467,6 +467,39 @@ cross_domain_features AS (
0.0
) AS host_coverage_uniformity
FROM ja4_drift_features
),
-- ── §5.8b : Similarité Jaccard cross-domaine ────────────────────────────────
-- Principe : un scanner visite les mêmes chemins (/admin, /wp-login.php, /.env)
-- sur plusieurs hosts distincts. Le coefficient de Jaccard mesure la proportion
-- de chemins partagés entre hosts.
-- Signal élevé (>0.5) = même liste de chemins sur plusieurs sites → scanning systématique.
jaccard_paths AS (
SELECT
toStartOfHour(time) AS window_start,
toIPv6(src_ip) AS src_ip,
-- Fraction de chemins normalisés apparaissant sur ≥2 hosts distincts
toFloat64(countIf(distinct_hosts >= 2)) / greatest(toFloat64(count()), 1.0)
AS cross_domain_path_similarity
FROM (
SELECT
toStartOfHour(time) AS time,
src_ip,
-- Normaliser le chemin à profondeur 2 (ignorer les paramètres de query)
arrayStringConcat(
arraySlice(
splitByChar('/', replaceRegexpAll(path, '\\?.*', '')),
1, 3
),
'/'
) AS path_norm,
uniqExact(host) AS distinct_hosts
FROM ja4_logs.http_logs
WHERE time >= now() - INTERVAL 24 HOUR
GROUP BY time, src_ip, path_norm
HAVING distinct_hosts >= 1
)
GROUP BY window_start, src_ip
)
-- ── Jointure finale : features §5.1/§5.3 par (window, ip, ja4, host)
@ -498,7 +531,9 @@ SELECT
-- §5.8 Cross-Domain Session Linking
d.host_diversity,
d.host_sweep_speed,
d.host_coverage_uniformity
d.host_coverage_uniformity,
-- §5.8b Jaccard cross-domaine (proportion de chemins partagés entre hosts)
coalesce(jp.cross_domain_path_similarity, 0.0) AS cross_domain_path_similarity
FROM path_features p
LEFT JOIN cadence_features c
ON p.window_start = c.window_start
@ -508,6 +543,9 @@ LEFT JOIN cadence_features c
LEFT JOIN cross_domain_features d
ON p.window_start = d.window_start
AND p.src_ip = d.src_ip
LEFT JOIN jaccard_paths jp
ON p.window_start = jp.window_start
AND p.src_ip = jp.src_ip
LEFT JOIN ja4_processing.view_resource_cascade_1h rc
ON p.window_start = rc.window_start
AND p.src_ip = rc.src_ip

View File

@ -0,0 +1,12 @@
"h2_fingerprint","browser_family"
"1:65536,2:0,4:6291456,6:262144|15663105|0|m,a,s,p","Chrome"
"1:65536,3:1000,4:6291456,6:262144|15663105|0|m,a,s,p","Chrome"
"1:65536,2:0,3:100,4:6291456,6:262144|15663105|0|m,a,s,p","Chrome"
"1:65536,4:131072,5:16384|12517377|0|m,p,s,a","Firefox"
"1:65536,4:131072|12517377|0|m,p,s,a","Firefox"
"1:65536,3:100,4:131072,5:16384|12517377|0|m,p,s,a","Firefox"
"1:4096,3:100,4:65535|10485760|0|m,a,s,p","Safari"
"1:4096,3:100,4:65535,5:16384|10485760|0|m,a,s,p","Safari"
"1:4096,3:100,4:65535,6:16384|10485760|0|m,a,s,p","Safari"
"1:65536,2:0,4:6291456,6:262144|15663105|0|m,a,s,p","Edge"
"1:65536,2:0,3:1000,4:6291456,6:262144|15663105|0|m,a,s,p","Edge"
1 h2_fingerprint browser_family
2 1:65536,2:0,4:6291456,6:262144|15663105|0|m,a,s,p Chrome
3 1:65536,3:1000,4:6291456,6:262144|15663105|0|m,a,s,p Chrome
4 1:65536,2:0,3:100,4:6291456,6:262144|15663105|0|m,a,s,p Chrome
5 1:65536,4:131072,5:16384|12517377|0|m,p,s,a Firefox
6 1:65536,4:131072|12517377|0|m,p,s,a Firefox
7 1:65536,3:100,4:131072,5:16384|12517377|0|m,p,s,a Firefox
8 1:4096,3:100,4:65535|10485760|0|m,a,s,p Safari
9 1:4096,3:100,4:65535,5:16384|10485760|0|m,a,s,p Safari
10 1:4096,3:100,4:65535,6:16384|10485760|0|m,a,s,p Safari
11 1:65536,2:0,4:6291456,6:262144|15663105|0|m,a,s,p Edge
12 1:65536,2:0,3:1000,4:6291456,6:262144|15663105|0|m,a,s,p Edge