145 Commits

Author SHA1 Message Date
85d3b95b7b feat: HTTP/2 passive fingerprinting with individual SETTINGS fields
Complete implementation of HTTP/2 passive fingerprinting per thesis §2.5.3:

mod-reqin-log (C module):
- Replace connection-level filter with ap_hook_process_connection (APR_HOOK_FIRST)
  to capture H2 preface before mod_http2 takes over the connection
- AP_MODE_SPECULATIVE read of 512 bytes from c->input_filters
- Parse SETTINGS, WINDOW_UPDATE, PRIORITY flags, pseudo-header order
- Output individual SETTINGS params as separate JSON fields (IDs 1-6, 8)
- Read H2 notes from c1 (master connection) for mod_http2 secondary conns
- Fix header_order_signature JSON length bug (26→strlen)

ClickHouse schema:
- Add 8 new columns to http_logs: h2_has_priority, h2_header_table_size,
  h2_enable_push, h2_max_concurrent_streams, h2_initial_window_size,
  h2_max_frame_size, h2_max_header_list_size, h2_enable_connect_protocol
- Use Int32/Int64 with DEFAULT -1 to distinguish absent vs zero
- Update mv_http_logs to extract individual fields via JSONHas/JSONExtractInt
- Migration 04_http2_fields.sql updated for existing deployments

Correlator:
- Accept both timestamp_ns and timestamp field names (backward compat)

Integration:
- Enable HTTP/2 in Apache: Protocols h2 http/1.1 in httpd-integration.conf

Validated end-to-end via Playwright: H2 curl traffic → mod-reqin-log →
correlator → ClickHouse with all 12 H2 columns populated correctly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-11 02:33:45 +02:00
bd81331411 maj these 2026-04-11 00:27:20 +02:00
8da1b7d8e6 tests/integration/platform/csv-stubs/browser_h2.csv 2026-04-10 23:13:35 +02:00
aa233bc55c docs(thesis): v3 — corrections + §3.9 browser_matcher + XFF proxy accuracy
User-authored updates verified and corrected:
- Correction 1: 85 features / 8 familles (was 65+/7)
- Correction 2: diagram adds MetaLearner, ExIFFI, fleet.py
- Correction 3: axis 5 weight 0.20→0.15, new axis 6 (H2 Coherence 0.05)
- Correction 4: 5 quantiles (p10-p90), p5/p95 as future work
- Correction 5: §5.6 DNS Shadow + §5.7 Compression Ratio named as future
- New §3.9 Browser Signature Detection (browser_matcher 7 dimensions)
- New §2.4.5 ExIFFI + MetaLearner + KL divergence drift
- New §2.5.3 HTTP/2 fingerprinting passif literature
- Updated §5.2 fleet.py implementation details
- Updated §5.8 cross_domain_path_similarity + Jaccard

Additional fixes (code-accuracy alignment):
- XFF proxy: 4 H2 dimensions neutralized (70% weight), not redistributed
- Module count: 12→13 (browser_signatures.py added)
- §6.5 limitations table: precise proxy weight impact (70%→30%)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 19:45:02 +02:00
d098de1a66 fix(bot-detector): neutralize H2 dimensions behind proxy (X-Forwarded-For)
When has_xff=1, the H2 connection is terminated by the reverse proxy/CDN,
so client H2 fingerprints are lost. Previously only D1 (h2_settings) was
neutralized; D2 (window_update), D3 (pseudo_order), and D4 (priority)
still penalized proxied traffic — a real Chrome behind Cloudflare scored
0.0 on 3 dimensions (45% of total weight).

Now all 4 H2 dimensions return 0.5 (neutral) when has_xff>0, and
non-browser H2 detection is also disabled behind proxies.

Tests: 10/10 passed including 3 new XFF-specific cases.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 15:15:20 +02:00
261205028d fix(dashboard): campaigns scatter chart — show campaigns not IPs
- API /api/campaigns/scatter: aggregate by campaign_id instead of per-IP
  Returns avg_score, avg_velocity, unique_ips, ja4_list, asn_list, country_list
- Template: one bubble per campaign, sized by IP count
- Tooltip: campaign-level info (IPs, score, velocity, ASNs, pays, JA4s)
- Click navigates to campaign detail (not IP detail)
- Updated doc panel text

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 15:09:02 +02:00
fb73c60e7d feat(dashboard): fingerprint discovery page — extract and group JA4/H2/headers from traffic
- GET /api/fingerprint-discovery: queries http_logs, groups by JA4, aggregates
  UA family, header presence rates (Sec-CH-UA, Sec-Fetch, Accept-Language,
  zstd, brotli, gzip, XFF), H2 data, TLS info, dict lookups
- /fingerprints page: KPIs, doughnut chart by family, stacked header bars,
  filterable/sortable profile table, expandable detail panel
- Promote button: push H2 fingerprints to browser_h2_signatures via existing
  POST /api/browser-signatures/entries endpoint
- Nav link: Découverte added after Navigateurs in sidebar

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 15:02:53 +02:00
fde6864311 feat(dashboard): browser signatures management UI
- Ajoute dict_browser_h2 dans /reflists (lecture seule via dict_browser_h2)
- Nouveaux endpoints API :
    GET  /api/browser-signatures/entries — liste browser_h2_signatures
         (fallback dict CSV si migration 06 non appliquée)
    POST /api/browser-signatures/entries — ajout fingerprint + reload dict
    DELETE /api/browser-signatures/entries — suppression + reload dict
- Page /browsers : 2 nouvelles sections
    'Base de signatures H2' — tableau des 10 fingerprints, form d'ajout,
    mode lecture seule automatique si migration 06 non appliquée
    'Règles de scoring browser_matcher.py' — tableau statique des 7 dimensions
    (poids, valeurs par famille, seuils de bypass)
- Integration : browser_h2.csv copié dans user_files au démarrage ClickHouse

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 14:46:07 +02:00
da1b579d4f fix(dashboard): rename duplicate /api/browsers route to /api/browser-signatures
La route /api/browsers existait déjà (distribution JA4 par famille).
La nouvelle route du browser_matcher était en conflit — FastAPI utilisait
la première définition. Renommage en /api/browser-signatures.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 14:17:38 +02:00
9c308747bd feat(dashboard): page Browser Signature Detection (/browsers)
Nouvelle page dédiée à l'analyse passive des signatures navigateur (§4) :

API — GET /api/browsers :
  Requête view_ai_features_1h pour :
  - Compteurs globaux (total, sessions_with_h2, matched, mismatch %)
  - Distribution h2_dict_family (Chrome/Firefox/Safari/Edge)
  - Répartition des signaux WINDOW_UPDATE (chrome/firefox/safari/absent/autre)
  - Mismatch TLS↔H2 par famille JA4 (total + count + %)
  - Top 20 sessions suspectes (tls_h2_family_mismatch=1, triées par hits)

Page /browsers :
  - 6 KPI header (sessions, avec H2, famille connue, taux match, mismatch, % mismatch)
  - Doc banner expliquant browser_matcher §4 et le mode DUAL_MODE
  - Donut : familles H2 (dict_browser_h2 lookup)
  - Bar horizontal : WINDOW_UPDATE signals par famille
  - Bar groupé + ligne : mismatch TLS↔H2 par famille JA4 (count + %)
  - Table : top 20 imposteurs potentiels avec IP cliquable, pseudo-order, cohérence
  - Mini-KPIs : ordres pseudo-headers Chrome/Safari, Firefox, inconnu, PRIORITY frames
  - Lien nav 'Navigateurs' dans le groupe Surveillance de base.html

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 14:02:39 +02:00
e52cdcc01f feat(bot-detector): Browser Signature Detection engine (parallel mode)
Étape A — browser_signatures.py
  Données pures : BROWSER_SIGNATURES (Chrome/Firefox/Safari), NON_BROWSER_SIGNATURES
  (curl/httpx/go), BROWSER_THRESHOLDS, DIMENSION_WEIGHTS. Valeurs H2 extraites
  des captures réelles (format Akamai avec virgules, non semicolons).

Étape B — browser_matcher.py
  Moteur vectorisé 7 dimensions (H2 SETTINGS 0.30, WINDOW_UPDATE 0.15,
  pseudo-header order 0.15, H2 PRIORITY 0.10, HTTP headers 0.15, TLS 0.10,
  JA4 dict 0.05). run_browser_matcher(df) ajoute bm_family/bm_score/bm_decision.
  CDN edge case : dimension H2 neutralisée (0.5) si has_xff=1.
  BROWSER_MATCHER_REPLACE=false par défaut (mode DUAL_MODE logging uniquement).

Étape C — 06_browser_signature_detection.sql (migration)
  Crée browser_h2_signatures (table MergeTree avec 12 fingerprints de référence).
  Recrée dict_browser_h2 depuis la table avec champ confidence (remplace CSV).

Étape D — 07_ai_features_view.sql
  +h2_wu_val dans le JOIN http_logs, +h2_window_update_value, +h2_dict_family,
  +h2_dict_confidence, +h2_window_{chrome,firefox,safari,absent},
  +h2_order_{chromesafari,firefox}, +h2_priority_present, +h2_pseudo_ord_raw,
  +tls_h2_family_mismatch (détection incohérence famille JA4 vs famille H2).

Étape E — preprocessing.py + pipeline.py
  preprocessing.py: appelle run_browser_matcher() après compute_browser_axes(),
  ajoute 7 nouvelles features binaires H2 à FEATURES et binary_features.
  pipeline.py: appelle log_dual_mode_comparison() après la classification A9.
  BROWSER_MATCHER_REPLACE=true active le remplacement du bypass.

Étape F — test_browser_matcher.py
  8 tests : Chrome/Firefox/Safari full match, curl rejeté, httpcloak partiel,
  TLS↔H2 mismatch, CDN proxy neutralisation, go net/http rejeté.
  Tous 8 PASSED (+ 36 tests existants inchangés).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 13:52:57 +02:00
c77d479d6c docs(thesis): 5 corrections — 85 features, MetaLearner diagram, browser axes note, quantile clarification, §5.6/5.7 named
- Correction 1 (l.65, 701): '65+ features sur 7 familles' → '85 features sur 8 familles'
- Correction 2 (l.374-378): diagramme ASCII bot_detector — ajout MetaLearner, ExIFFI, fleet.py
- Correction 3 (après l.506): note poids axe 5 réduit 0.20→0.15, axe 6 ajouté 0.05
- Correction 4 (l.279): clarification quantiles actuels 5 (p10→p90), p5/p95 = futur
- Correction 5 (l.776): §5.6 (DNS UDP/53) et §5.7 (Apache compression) nommés explicitement

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 13:51:31 +02:00
79dbb23d6f feat(dashboard): sélecteur de plage temporelle sur /campaigns
Avant : toutes les vues de campagnes étaient fixes à 7 jours.
Après : sélecteur 1j / 7j (défaut) / 14j / 30j / 90j en haut à droite.

- Ajout du paramètre ?days= (1–90, défaut 7) à :
    GET /api/campaigns
    GET /api/campaigns/graph
    GET /api/campaigns/scatter
    GET /api/campaigns/{cid}
- Le sélecteur recharge simultanément les 3 vues (cartes, scatter, graphe)
  et le panneau de détail avec la même fenêtre temporelle
- Le compteur de campagnes indique la plage active : (4 campagnes — 30j)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 13:24:08 +02:00
9548b1782d fix: corriger ORDER BY ml_detected_anomalies dans le schéma de base
CH 24.8 refuse MODIFY ORDER BY sur des colonnes existantes (erreur BAD_ARGUMENTS 36).
La migration 01 ne pouvait donc pas corriger l'ORDER BY en post-init.

Correctif :
- 06_ml_tables.sql : ORDER BY (src_ip) → ORDER BY (src_ip, ja4, host, model_name)
  + TTL 30j → 7j (cohérent avec l'architecture documentée)
- 01_ttl_adjustments.sql : supprime le MODIFY ORDER BY impossible, conserve
  uniquement les MODIFY TTL (valides pour les déploiements existants)

Résultat : make init-stack sans aucun ⚠ ni ✗

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 01:34:07 +02:00
51dd376f7a docs: mise à jour complète — 7/8 techniques, 85 features, 12 modules
Reflète l'état réel du système après les étapes 1-9 du roadmap :

- §5.2 (fleet_detector NetworkX/Louvain) et §5.8 (Jaccard cross-domain) : 
- MetaLearner (régression logistique, fallback poids fixes) : documenté
- ExIFFI (profondeur isolation EIF) + erreur AE par feature : documenté
- KL divergence en complément du KS, drift adversarial : documenté
- HTTP/2 fingerprinting (h2_fingerprint, dict_browser_h2, axis_h2_coherence) : documenté
- Métriques de cycle (metrics.py, ml_performance_metrics, alertes) : documenté
- Browser confidence : 5 axes → 6 axes (axis_h2_coherence)
- 85 features (73 FEATURES + 12 FEATURES_COMPLET), 12 modules, 53 routes dashboard
- Conformité thèse : 99.4% (était 97.9%), §5 : 87.5% (était 62.5%)
- Tables nouvelles : fleet_detections, ml_performance_metrics, soc_feedback
- Dictionnaires : 8 (dict_browser_h2 ajouté)
- Dashboard : 16 pages + 37 API routes (fleet, health ajoutés)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 01:31:20 +02:00
edbb4aed2c fix(import): add h2 columns with defaults for prod data missing 4 cols
The prod data export was made before http/2 columns were added to
http_logs (h2_fingerprint, h2_settings_fp, h2_window_update,
h2_pseudo_order). The INSERT SELECT now provides empty/zero literals
for those 4 columns so the 56-col Native export imports into the
60-col table without a column count mismatch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 01:16:36 +02:00
92432085e2 fix(campaigns): fix IP navigation URL encoding
fmtIP() returns an HTML <a> tag string. Using encodeURIComponent(fmtIP(ip))
was URL-encoding the entire HTML markup instead of the raw IP address,
resulting in /ip/%3Ca%20href%3D... navigation.

Fix: extract raw IP (stripping ::ffff: prefix) before building the URL.
Applied to all 3 click handlers in campaigns.html:
- members table row onclick
- scatter chart point click
- force graph node click

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 01:08:53 +02:00
7a04e47041 fix(sql+api): fix view column mismatches and ClickHouse 24.8 JOIN issue
- view_form_bruteforce_detected: add post_count, distinct_paths, first_seen, last_seen
- view_host_ip_ja4_rotation: add host, distinct_ja4, ja4_list, window_start
- Replace uniqExact/groupUniqArray with count()/groupArray (no nested-agg error)
- api.py campaigns/graph: move a.src_ip < b.src_ip from JOIN ON to WHERE
  (ClickHouse 24.8 forbids cross-table inequality in JOIN ON condition)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 01:05:04 +02:00
040437921c fix(init-stack): pre-drop mv_http_logs + http_logs before schema apply
Ensure h2 columns are always included on fresh init. Also add migration
loop for fleet_detections and ml_performance_metrics tables.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 01:00:04 +02:00
b409a70970 fix(views): align SQL views with dashboard API expected columns
- view_form_bruteforce_detected: add post_count, distinct_paths, first_seen, last_seen
- view_host_ip_ja4_rotation: add host, distinct_ja4, ja4_list, window_start
- view_ip_recurrence: add worst_threat alias + top_ja4, top_host columns

All three views were missing columns referenced by /api/brute-force,
/api/ja4-rotation and /api/recurrence endpoints, causing 500 errors
on the Tactiques page.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 00:59:57 +02:00
2f2c5e03bb fix(sql): contournement bug scope ClickHouse 24.8 dans view_ai_features_1h
- Restructure 07_ai_features_view.sql : single anonymous inner subquery
  avec aliases explicites sur toutes les colonnes (a.xxx AS xxx, h.xxx AS xxx,
  h2.xxx AS xxx) pour résoudre l'ambiguïté PARTITION BY src_ip dans l'outer SELECT
- Supprime les CTEs multiples (h2_agg, enriched) qui déclenchaient le bug
- Fix migration 04_http2_fields.sql : ordre DEFAULT avant CODEC (syntax ClickHouse)
- make init-stack : 0 erreur sur 13 fichiers SQL

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 00:48:05 +02:00
a108814a56 feat: roadmap détection bots §2-9 — HTTP/2, cohérence, drift, flotte, Jaccard, ExIFFI, méta-learner, métriques
Étape 2 — Fingerprinting HTTP/2 dans le pipeline ML :
- Ajout du dictionnaire dict_browser_h2 (11 familles de navigateurs) dans 05_aggregation_tables.sql
- Ajout du CTE h2_agg et 4 features HTTP/2 dans 07_ai_features_view.sql :
  h2_settings_known, h2_pseudo_order_match, h2_ja4_coherence, h2_settings_rare
- Calcul du fingerprint_coherence_score (5 axes pondérés) dans la vue
- Ajout du 6e axe axis_h2_coherence dans browser.py (poids rééquilibrés)
- browser_h2.csv : 11 fingerprints Akamai → famille navigateur

Étape 3 — Pré-filtre de cohérence sur la baseline humaine :
- pipeline.py exclut les sessions avec fingerprint_coherence_score < seuil de la baseline d'entraînement
- FINGERPRINT_COHERENCE_THRESHOLD configurable via env (défaut 0.25)
- Log des sessions exclues pour analyse SOC

Étape 4 — Détection de drift améliorée :
- scoring.py : passage de 5 à 9 quantiles (p5…p95)
- Ajout de la divergence KL en complément du test KS
- Détection de drift adversarial (≥80% des features dérivent dans la même direction)
- Split temporel strict pour la validation

Étape 5 — Graphe bipartite JA4×ASN (§5.2) :
- fleet.py : détection de flottes via NetworkX + Louvain (imports optionnels)
- enrich_with_fleet_score() : ajout fleet_score + fleet_campaign_flag au DataFrame
- cycle.py : appel après preprocess_df avec log du nombre de sessions en flotte
- SQL migration 05_fleet_metrics_tables.sql : table fleet_detections (TTL 7j)
- Dashboard : /fleet + /api/fleet (communautés détectées) + template fleet.html

Étape 6 — Cross-domain Jaccard §5.8 :
- 12_thesis_features.sql : CTE jaccard_paths → cross_domain_path_similarity
- Signal : même chemins (/admin, /wp-login) sur plusieurs hosts = scanner

Étape 7 — ExIFFI + erreurs AE par feature :
- scoring.py : compute_exiffi_importance() par permutation, compute_ae_feature_errors()
- pipeline.py : calcul ExIFFI sur X_test, mapping index → dict pour anomalies
- build_reason() enrichi avec exiffi_top quand SHAP inactif

Étape 8 — Méta-learner pour la pondération de l'ensemble :
- scoring.py : classe MetaLearner (LogisticRegression, fallback poids fixes <1000 labels)
- Collecte des labels depuis le cycle courant (known_bots, légitimes, Anubis)
- pipeline.py : remplacement des poids fixes par MetaLearner.predict()

Étape 9 — Métriques de performance et monitoring :
- metrics.py : record_cycle_metrics() — taux anomalie, drift, corrélation, latence
- SQL migration 05_fleet_metrics_tables.sql : table ml_performance_metrics (TTL 90j)
- Dashboard : /health + /api/health + template health.html
- cycle.py : appel record_cycle_metrics en fin de cycle (Complet + Applicatif)

Tests : 36/36 bot-detector tests passent

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 00:11:35 +02:00
8ca4a1e849 feat(mod_reqin_log): fingerprinting HTTP/2 passif (Akamai format)
Ajoute un filtre d'entrée de connexion (AP_FTYPE_CONNECTION, APR_HOOK_LAST)
qui s'insère entre mod_ssl et mod_http2 pour lire de manière non-destructive
le preface HTTP/2 (RFC 9113 §3.4) et en extraire :

- h2_fingerprint    : fingerprint Akamai complet
                      ex. '1:65536,2:0,4:6291456,6:262144|15663105|0|m,a,s,p'
- h2_settings_fp    : entrées SETTINGS brutes  (ex. '1:65536,4:6291456')
- h2_window_update  : incrément WINDOW_UPDATE  (ex. '15663105')
- h2_pseudo_order   : ordre des pseudo-headers (ex. 'm,a,s,p' Chrome,
                                                     'm,p,s,a' Firefox)

Technique : lecture spéculative AP_MODE_SPECULATIVE (non-destructive)
de 512 octets — la donnée reste disponible pour mod_http2. Le filtre
se retire de la chaîne après la première invocation.

Stockage dans c->notes (H2_NOTE_*) puis émission JSON dans log_request().
ClickHouse : 4 nouvelles colonnes dans http_logs + JSONExtract dans mv_http_logs.
Migration pour déploiements existants : 04_http2_fields.sql.
14 tests unitaires (cmocka) couvrent Chrome/Firefox/HTTP1/troncature/HPACK.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 23:46:50 +02:00
bc11cfa8eb fix: init-stack rock-solid — drop/recreate derived tables before views
Root cause: CREATE TABLE IF NOT EXISTS is a no-op on existing tables,
so stale schemas miss new columns. Views (07+) then fail with
UNKNOWN_IDENTIFIER errors.

Fix: split SQL execution into 3 phases:
  Phase 1: databases, raw tables, dictionaries (00-04)
  Phase 2: DROP all derived tables (agg_*, ml_*) — safe, repopulated by MVs
  Phase 3: recreate derived tables + views with full current schema (05-12)

This removes the incomplete inline migrations and makes the script
truly idempotent regardless of prior schema version.

Tested: fresh --reset, existing stale DB, idempotent re-run.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 23:21:15 +02:00
895d7894a9 docs: mise à jour copilot-instructions.md
- bot-detector : monolithe → 10 modules
- Ajout convention browser detection sans UA (5 axes, Client Hints)
- Ajout targets Makefile : init-stack, import-prod-data, purge-db, help
- Anubis : simplifié IP/CIDR + ASN (suppression dict_anubis_ua / REGEXP_TREE)
- Tests bot-detector : clarification imports lourds

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 23:11:24 +02:00
14db3d9040 refactor: suppression dépendance User-Agent de la détection navigateur
Changements SQL :
- modern_browser_score : sec-ch-ua→100, Sec-Fetch→70 (plus de UA fallback)
- Ajout has_sec_ch_ua (UInt8) dans agg_header_fingerprint_1h et ml_all_scores
- mss_mobile_mismatch utilise has_sec_ch_ua au lieu de modern_browser_score
- header_order_confidence : PARTITION BY ja4 au lieu de first_ua
- sec_ch_mobile_mismatch : comparaison Client Hints interne (sans UA)
- Migration 03_remove_ua_browser_detection.sql

Changements Python :
- browser.py Axe 3 : Client Hints + Sec-Fetch + is_fake_navigation (PAS de UA)
- Pondération axes : ja4_known 0.30, tls_coherence 0.20 (signaux TLS renforcés)
- preprocessing.py : has_sec_ch_ua ajouté aux features et binary_features

Fichiers modifiés : 8 SQL/Python + 1 migration, 36/36 tests passent.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 23:06:01 +02:00
00e99e5464 fix(bot-detector): make scoring functions public (remove underscore prefix)
compute_shap_top_features, build_reason, cluster_anomalies renamed from
private (_prefixed) to public to match pipeline.py imports.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:49:48 +02:00
629f7b334d fix(bot-detector): rename _compute_drift_score to public, fix import
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:48:21 +02:00
de6d8da931 fix(bot-detector): FEATURES_BASE → FEATURES import name mismatch
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:42:32 +02:00
1fa6aec784 fix: SQL view ordering, purge-db flag, ctest directory
- 12_thesis_features.sql: move view_resource_cascade_1h before view_thesis_features_1h
- Makefile: purge-db uses --reset (not --clean)
- mod-reqin-log: ctest --test-dir build/tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:39:25 +02:00
6d64c2a8a8 fix(rpm): add systemd-rpm-macros to Dockerfile.package, fix correlator spec_version
- sentinel/correlator: install systemd-rpm-macros in rpm-builder stage
- correlator: use build_version macro (not version) to avoid recursive expansion
- mod-reqin-log: fix ctest --test-dir to find tests in build/tests/

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:33:53 +02:00
ea488c0b11 feat: add make help with all targets documented
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:22:25 +02:00
0ba66729da feat: add make purge-db target for full database reset
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:21:15 +02:00
6b3cc54652 docs: réécriture audit, DOCUMENTATION.md et IMPROVEMENTS.md pour architecture modulaire
- AUDIT: conformité mise à jour 97.9% (142/145), références modulaires
- DOCUMENTATION.md: 1083 lignes, 7 sections, 11 modules documentés
- IMPROVEMENTS.md: A1-A10/B1-B10 annotés /🔄/ avec localisations

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:14:18 +02:00
c96c41fb45 docs: réécriture complète de la documentation des services en français
- bot-detector.md : architecture 11 modules, 77/65 features,
  ensemble triple voix (EIF+AE+XGBoost), browser 5 axes, HDBSCAN,
  toutes les variables d'environnement vérifiées depuis le code source
- dashboard.md : corrigé stack (Jinja2+htmx, pas React+Vite),
  14 pages + 35 API routes + health, dual-database, IPv4/IPv6
- python-ja4common.md : ajouté CLICKHOUSE_DB_PROCESSING/LOGS,
  schéma dual-database, note dashboard n'utilise pas ja4_common

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:04:58 +02:00
8f5e771096 docs: réécriture complète de la documentation base de données en français
Réécriture des 3 fichiers de documentation de la base de données ClickHouse :

- docs/database/schema.md : couverture complète des 2 bases, 14+ tables,
  7 dictionnaires, 8 MVs, 8 vues, TTL, partitions, moteurs et colonnes
- docs/database/migrations.md : 13 fichiers SQL (ajout 10-12), prérequis
  mis à jour (ClickHouse 24.8+, 5 CSV), deploy_schema.sh, init-stack.sh,
  vérification et rollback complets
- shared/clickhouse/README.md : référence rapide des 13 fichiers,
  deploy_schema.sh, patron double-base, prérequis

Suppression des références obsolètes : dict_anubis_ua, dict_anubis_country,
anubis_ua_rules, anubis_country_rules.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:03:37 +02:00
d05969867f docs: rewrite architecture/README, update deployment/development
- architecture.md: complete rewrite (French) with dual-database diagram,
  5-phase data flow, full table ownership, triple-voice ML pipeline,
  7 dictionaries, 13 SQL files, updated tech stack
- README.md: complete rewrite (English) with updated pipeline diagram,
  services table, scripts section, integration tests, full doc index,
  Go 1.24.6 workspace
- deployment.md: update to 13 SQL files, remove Anubis UA/Country refs,
  add scripts section, add ensemble env vars (AE_WEIGHT, XGB_WEIGHT),
  update verification queries and network diagram
- development.md: translate to French, add bot-detector 11-module structure,
  add Python ML deps, add scripts/integration test sections,
  fix bot-detector run command, add make targets

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:00:29 +02:00
7bdc6e2865 docs: mise à jour du document de thèse (§2-§8)
- §2.1.3: Simplifié Anubis à 2 dictionnaires (dict_anubis_ip, dict_anubis_asn) avec priorité COALESCE
- §2.4.2: Ajouté bibliothèque isotree, formule de calibration, ntrees=300, sérialisation joblib
- §2.4.2b/§2.4.4: Remplacé DBSCAN par HDBSCAN partout
- §2.4.2c: Remplacé régression logistique par pondération linéaire fixe, ajouté formule et poids
- §2.4.3: Clarifié approximation par 5 quantiles pour la détection de dérive
- §3.1: Mis à jour le diagramme ASCII (dual-database, 3×EIF+AE+XGB, HDBSCAN, 55 routes)
- §3.8: Mis à jour la trifurcation + ajouté détection multifactorielle navigateur (5 axes)
- §4: Élargi taxonomie de 51 à 65+ features sur 8 familles
- §5: Ajouté statut d'implémentation (/) à chaque technique
- §6: Ajouté §6.6 résultats de déploiement (3M+ logs, 34K sessions/cycle)
- §7: Mis à jour conclusion (65+ features, 5/8 techniques, refactorisation modulaire)
- §8: Ajouté références isotree, PyTorch, HDBSCAN, XGBoost, SHAP

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 21:59:34 +02:00
9ea36ad22e feat(scripts): complete stack init + prod data import with date shift
Schema cleanup:
- Remove anubis_ua_rules table stub from 03_anubis_tables.sql
- Remove anubis_ua_rules from bot-detector deploy_schema.sql
- Remove UA seed step from clickhouse-init.sh (no more REGEXP_TREE dependency)
- Drop dict_anubis_ua, dict_anubis_country, anubis_ua_rules, anubis_country_rules

New scripts:
- scripts/init-stack.sh: comprehensive ClickHouse init (13 SQL files + migrations
  + validation + cleanup of obsolete tables). Supports --reset, --import-prod.
- scripts/import-prod-data.sh: imports pre-exported prod data (Native format)
  with dynamic date shift (max(time) → now). Supports --shift, --no-truncate.
- scripts/data/prod-export/: directory for cached Native format exports

Makefile targets: init-stack, import-prod-data, init-and-import

Tested: init-stack.sh passes all 13 SQL + 7 critical tables + 7 dicts
        import-prod-data.sh: 3M rows in ~37s with auto date shift
        Dashboard: 55 routes OK, bot-detector: 36/36 tests pass

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 21:40:05 +02:00
d8ca804a55 feat(scripts): add reload-prod-logs.sh for prod→dev data sync
Exports http_logs from prod ClickHouse via HTTP API, imports into dev
with dynamic date shifting (max(time) → now() by default).

Features:
- Batch export in Native format (200K rows/batch, ~10s each)
- Auto date shift: prod max(time) aligned to current time
- --shift N: manual override (seconds)
- --days N: filter to last N days only
- --cron: silent mode for scheduled runs
- Staging table approach: export → staging → INSERT SELECT with shift → cleanup

Tested: 3,054,122 rows imported in ~3 minutes, dates 2026-04-03→2026-04-09.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 15:41:38 +02:00
8180f4af04 refactor(anubis): simplify to IP/CIDR + ASN only, remove UA and Country rules
- Remove UA regex extraction (extract_ua_regex, _extract_ua_from_all/any)
- Remove Country rule collection from parse_bot_policies_inline
- Simplify fetch_rules.py: collect_all_rules returns (ip_rules, asn_rules)
- Remove insert_ua_rules and insert_country_rules functions
- reload_dicts now only reloads dict_anubis_ip + dict_anubis_asn
- Simplify CASE blocks in 04_mv_http_logs.sql, 07_ai_features_view.sql,
  view_ai_features_anubis.sql, mv_http_logs.sql: IP > ASN (was 5-level
  UA+IP > UA > IP > ASN > Country cascade)
- Remove dict_anubis_country + dict_anubis_ua from 03_anubis_tables.sql
  (UA table kept as stub for REGEXP_TREE catch-all compatibility)
- Remove anubis_country_rules table from schema
- Remove Anubis UA and Country tabs from dashboard reflists page
- Remove anubis_ua_rules/country_rules from API reflist queries
- deploy_schema.sql simplified from 339 to 122 lines
- 764 lines removed across 9 files

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 15:25:33 +02:00
98abbc80c7 feat(dashboard): page Listes de référence — visualisation CSV/dictionnaires
Nouvelle page /reflists pour visualiser les 9 dictionnaires ClickHouse :
- bot_ip (3.5K entrées) : IP/CIDR de bots connus
- bot_ja4 (31) : fingerprints JA4 de bots
- browser_ja4 (1.2K) : fingerprints JA4 navigateurs → famille, lib TLS
- asn_reputation (82.5K) : ASN → réputation (isp, datacenter, cdn…)
- iplocate_asn (714K) : géolocalisation IP → ASN, pays, nom
- anubis_ua_rules, anubis_ip_rules, anubis_asn_rules, anubis_country_rules

Fonctionnalités :
- 9 onglets de navigation entre les listes
- Recherche textuelle avec filtrage côté ClickHouse
- Pagination (200 entrées/page)
- Tri par colonne (ASC/DESC)
- Graphique de répartition (ECharts) par catégorie
- KPIs dictionnaires en haut de page
- Infobulles de documentation

API : /api/dictionaries, /api/reflist/{name}, /api/reflist/{name}/stats
Helpers : esc() (HTML escape) ajouté à base.html

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 14:56:54 +02:00
039086a0b3 feat: nouvelles techniques de détection et page tactiques SOC
SQL:
- Ajout 5 colonnes d'agrégation (count_xff, count_unusual_ct,
  count_non_std_port, count_login_post, sec_ch_mobile_mismatch)
- Exposition de 5 features calculées dans view_ai_features_1h
- Migration ALTER TABLE pour déploiements existants

Bot-detector:
- 7 nouvelles features ML (has_xff, unusual_content_type_ratio,
  non_standard_port_ratio, login_post_concentration,
  sec_ch_mobile_mismatch, true_window_size, window_mss_ratio)
- Propagation campaign_id vers ml_all_scores (était toujours -1)
- Escalade campagne : HIGH→CRITICAL si cluster ≥5 membres

Dashboard:
- Page Tactiques SOC : brute-force, rotation JA4, récurrence,
  alertes temps réel — 4 KPIs + 4 panneaux + infobulles doc
- Ajout fmtDate() helper global
- Navigation sidebar mise à jour

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 14:29:18 +02:00
702c0d5edb feat(dashboard): add JA4 fingerprint and cluster investigation pages
- /ja4/{fingerprint} page: 8 KPIs, timeline, threat pie, IP scores
  table, ASN/geo charts, HTTP logs, AI features — full JA4 investigation
- /cluster/{cid} page: 8 KPIs, timeline, threat/JA4/ASN/host charts,
  member table with bulk classify — full campaign investigation
- /api/ja4/{fingerprint} and /api/cluster/{cid} API endpoints
- fmtJA4 links now navigate to /ja4/ investigation page
- campaigns.html: 'Ouvrir' button links to /cluster/{cid} full page
- Fix: double-brace {{param}} in non-f-string queries → single {param}
  (was causing HTTP 500 on all parameterized ClickHouse queries)
- 50 routes total, all tests pass, 0 JS console errors

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 14:05:52 +02:00
70188b508c fix(dashboard): eliminate @apply CSS, fix status column, fix click propagation
Playwright testing revealed 3 critical bugs:

1. Tailwind CDN @apply with custom brand-* colors produces empty CSS
   rules, breaking ALL design components (kpi-card, data-table, badges,
   filter-btn, section-card, nav-item). Fix: replace all @apply
   directives with equivalent raw CSS values.

2. Traffic API and IP detail API reference non-existent 'status' column
   in http_logs table → HTTP 500 on /traffic and /ip/{ip}. Fix: remove
   status from SELECT, sort whitelist, filters, and templates.

3. Nested <a> links (fmtJA4, fmtASN, fmtCountry, fmtBotName) inside
   clickable <tr onclick> capture clicks, preventing row navigation to
   /ip/ detail. Fix: add event.stopPropagation() to all formatter links.

Verified with Playwright: 10 pages × 0 JS errors, all tooltips hidden
by default, sidebar toggle works, keyboard shortcuts (Alt+1-9, Alt+B),
classification form saves to DB, campaign detail panel opens on click.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 13:54:38 +02:00
6babc55e3e fix(dashboard): hover infobulles, full-width layout, UX polish
- Fix doc tooltips: split CSS into <style type='text/tailwindcss'> for
  @apply directives + raw CSS for reliable doc panel rendering
- Convert doc panels from click-toggle to hover-based infobulles with
  arrow pointer, fade-in animation, and auto-dismiss on mobile
- Replace '?' icons with 'ⓘ' across all 11 templates (51 tooltips)
- Full-width layout: reduce padding on mobile (px-3), scale up on
  desktop (lg:px-5, xl:px-6) for maximum screen utilization
- Auto-collapse sidebar on narrow screens (<1024px)
- Keyboard shortcuts: Alt+1–9 for page navigation, Alt+B toggle sidebar
- Add LEGITIMATE_BROWSER filter button to detections page
- Sticky header with stronger blur (backdrop-blur-md)
- All 46 routes pass tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 13:30:16 +02:00
63ba6d203c feat(dashboard): complete SOC dashboard with full monitoring and workflows
- models.html: Full rewrite — 6 KPIs, scoring volume timeline, anomaly rate
  chart, threat breakdown per model, enhanced model cards with validation gate
- classify.html: SOC workflow — suggested unclassified IPs, quick-classify
  buttons, classification stats pie, pre-fill from URL params
- traffic.html: Clickable rows → ip_detail, column sorting, status column,
  search filter, doc tooltips on all chart sections
- scores.html: Search input, clickable rows → ip_detail, LEGITIMATE_BROWSER
  filter button, doc tooltips on distribution + scatter charts
- ip_detail.html: Resource cascade section (headless browser detection),
  status column in HTTP logs table
- detections.html: Doc tooltips on threat/reason/ASN chart sections
- features.html: Doc tooltips on radar/importance/scatter sections
- api.py: 4 new endpoints — /api/models/timeline, /api/models/threats,
  /api/classify/stats, /api/classify/suggested. Traffic API: status + search.

46 routes total. All tests pass (dashboard + bot-detector 36/36).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 01:25:01 +02:00
396baa90d2 feat(dashboard): visualisation clusters HDBSCAN
- Page /campaigns dédiée avec 4 vues graphiques :
  · Scatter plot (score vs vélocité, bulles colorées par campagne)
  · Graphe réseau force-directed (IPs liées par JA4 partagé)
  · Grille de cartes campagne (KPIs, ASN, pays, JA4)
  · Panneau détail (radar comportemental, timeline horaire, table membres)
- 4 nouveaux endpoints API :
  · GET /api/campaigns (fix: campaign_id >= 0 au lieu de != '')
  · GET /api/campaigns/graph (nœuds + arêtes)
  · GET /api/campaigns/scatter (score/vélocité par IP)
  · GET /api/campaigns/{cid} (détail + profil + timeline)
- Sidebar: lien Campagnes ajouté dans Surveillance
- Overview: campagnes clickables → lien vers /campaigns

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 01:11:16 +02:00
f1547423b5 refactor(bot-detector): suppression monolithe, tests multifactoriels
- Suppression de bot_detector.py (1982 lignes) remplacé par 11 modules
- Tests navigateur mis à jour pour le système multifactoriel (browser_confidence)
- 36/36 tests passent avec la nouvelle structure modulaire

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 01:03:17 +02:00
1f103392ac refactor(bot-detector): extract monolith into modular package
Split bot_detector.py (~1982 lines) into 10 focused modules:
- config.py: all configuration constants and optional imports
- log.py: logging utilities (log_info, log_decision, append_training_history)
- infra.py: ClickHouse client, health check HTTP server, shutdown
- browser.py: multifactorial browser identification (5 axes)
- scoring.py: drift detection, feature validation, SHAP, clustering
- models.py: EIF, Autoencoder, XGBoost model management
- preprocessing.py: data preprocessing and feature list definitions
- pipeline.py: core semi-supervised scoring loop
- cycle.py: main analysis cycle orchestration
- __main__.py: entry point with startup banner

Update Dockerfile to copy package directory and use python -m bot_detector.

All 36 existing tests pass unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 01:02:04 +02:00