Commit Graph

133 Commits

Author SHA1 Message Date
79dbb23d6f feat(dashboard): sélecteur de plage temporelle sur /campaigns
Avant : toutes les vues de campagnes étaient fixes à 7 jours.
Après : sélecteur 1j / 7j (défaut) / 14j / 30j / 90j en haut à droite.

- Ajout du paramètre ?days= (1–90, défaut 7) à :
    GET /api/campaigns
    GET /api/campaigns/graph
    GET /api/campaigns/scatter
    GET /api/campaigns/{cid}
- Le sélecteur recharge simultanément les 3 vues (cartes, scatter, graphe)
  et le panneau de détail avec la même fenêtre temporelle
- Le compteur de campagnes indique la plage active : (4 campagnes — 30j)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 13:24:08 +02:00
9548b1782d fix: corriger ORDER BY ml_detected_anomalies dans le schéma de base
CH 24.8 refuse MODIFY ORDER BY sur des colonnes existantes (erreur BAD_ARGUMENTS 36).
La migration 01 ne pouvait donc pas corriger l'ORDER BY en post-init.

Correctif :
- 06_ml_tables.sql : ORDER BY (src_ip) → ORDER BY (src_ip, ja4, host, model_name)
  + TTL 30j → 7j (cohérent avec l'architecture documentée)
- 01_ttl_adjustments.sql : supprime le MODIFY ORDER BY impossible, conserve
  uniquement les MODIFY TTL (valides pour les déploiements existants)

Résultat : make init-stack sans aucun ⚠ ni ✗

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 01:34:07 +02:00
51dd376f7a docs: mise à jour complète — 7/8 techniques, 85 features, 12 modules
Reflète l'état réel du système après les étapes 1-9 du roadmap :

- §5.2 (fleet_detector NetworkX/Louvain) et §5.8 (Jaccard cross-domain) : 
- MetaLearner (régression logistique, fallback poids fixes) : documenté
- ExIFFI (profondeur isolation EIF) + erreur AE par feature : documenté
- KL divergence en complément du KS, drift adversarial : documenté
- HTTP/2 fingerprinting (h2_fingerprint, dict_browser_h2, axis_h2_coherence) : documenté
- Métriques de cycle (metrics.py, ml_performance_metrics, alertes) : documenté
- Browser confidence : 5 axes → 6 axes (axis_h2_coherence)
- 85 features (73 FEATURES + 12 FEATURES_COMPLET), 12 modules, 53 routes dashboard
- Conformité thèse : 99.4% (était 97.9%), §5 : 87.5% (était 62.5%)
- Tables nouvelles : fleet_detections, ml_performance_metrics, soc_feedback
- Dictionnaires : 8 (dict_browser_h2 ajouté)
- Dashboard : 16 pages + 37 API routes (fleet, health ajoutés)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 01:31:20 +02:00
edbb4aed2c fix(import): add h2 columns with defaults for prod data missing 4 cols
The prod data export was made before http/2 columns were added to
http_logs (h2_fingerprint, h2_settings_fp, h2_window_update,
h2_pseudo_order). The INSERT SELECT now provides empty/zero literals
for those 4 columns so the 56-col Native export imports into the
60-col table without a column count mismatch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 01:16:36 +02:00
92432085e2 fix(campaigns): fix IP navigation URL encoding
fmtIP() returns an HTML <a> tag string. Using encodeURIComponent(fmtIP(ip))
was URL-encoding the entire HTML markup instead of the raw IP address,
resulting in /ip/%3Ca%20href%3D... navigation.

Fix: extract raw IP (stripping ::ffff: prefix) before building the URL.
Applied to all 3 click handlers in campaigns.html:
- members table row onclick
- scatter chart point click
- force graph node click

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 01:08:53 +02:00
7a04e47041 fix(sql+api): fix view column mismatches and ClickHouse 24.8 JOIN issue
- view_form_bruteforce_detected: add post_count, distinct_paths, first_seen, last_seen
- view_host_ip_ja4_rotation: add host, distinct_ja4, ja4_list, window_start
- Replace uniqExact/groupUniqArray with count()/groupArray (no nested-agg error)
- api.py campaigns/graph: move a.src_ip < b.src_ip from JOIN ON to WHERE
  (ClickHouse 24.8 forbids cross-table inequality in JOIN ON condition)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 01:05:04 +02:00
040437921c fix(init-stack): pre-drop mv_http_logs + http_logs before schema apply
Ensure h2 columns are always included on fresh init. Also add migration
loop for fleet_detections and ml_performance_metrics tables.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 01:00:04 +02:00
b409a70970 fix(views): align SQL views with dashboard API expected columns
- view_form_bruteforce_detected: add post_count, distinct_paths, first_seen, last_seen
- view_host_ip_ja4_rotation: add host, distinct_ja4, ja4_list, window_start
- view_ip_recurrence: add worst_threat alias + top_ja4, top_host columns

All three views were missing columns referenced by /api/brute-force,
/api/ja4-rotation and /api/recurrence endpoints, causing 500 errors
on the Tactiques page.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 00:59:57 +02:00
2f2c5e03bb fix(sql): contournement bug scope ClickHouse 24.8 dans view_ai_features_1h
- Restructure 07_ai_features_view.sql : single anonymous inner subquery
  avec aliases explicites sur toutes les colonnes (a.xxx AS xxx, h.xxx AS xxx,
  h2.xxx AS xxx) pour résoudre l'ambiguïté PARTITION BY src_ip dans l'outer SELECT
- Supprime les CTEs multiples (h2_agg, enriched) qui déclenchaient le bug
- Fix migration 04_http2_fields.sql : ordre DEFAULT avant CODEC (syntax ClickHouse)
- make init-stack : 0 erreur sur 13 fichiers SQL

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 00:48:05 +02:00
a108814a56 feat: roadmap détection bots §2-9 — HTTP/2, cohérence, drift, flotte, Jaccard, ExIFFI, méta-learner, métriques
Étape 2 — Fingerprinting HTTP/2 dans le pipeline ML :
- Ajout du dictionnaire dict_browser_h2 (11 familles de navigateurs) dans 05_aggregation_tables.sql
- Ajout du CTE h2_agg et 4 features HTTP/2 dans 07_ai_features_view.sql :
  h2_settings_known, h2_pseudo_order_match, h2_ja4_coherence, h2_settings_rare
- Calcul du fingerprint_coherence_score (5 axes pondérés) dans la vue
- Ajout du 6e axe axis_h2_coherence dans browser.py (poids rééquilibrés)
- browser_h2.csv : 11 fingerprints Akamai → famille navigateur

Étape 3 — Pré-filtre de cohérence sur la baseline humaine :
- pipeline.py exclut les sessions avec fingerprint_coherence_score < seuil de la baseline d'entraînement
- FINGERPRINT_COHERENCE_THRESHOLD configurable via env (défaut 0.25)
- Log des sessions exclues pour analyse SOC

Étape 4 — Détection de drift améliorée :
- scoring.py : passage de 5 à 9 quantiles (p5…p95)
- Ajout de la divergence KL en complément du test KS
- Détection de drift adversarial (≥80% des features dérivent dans la même direction)
- Split temporel strict pour la validation

Étape 5 — Graphe bipartite JA4×ASN (§5.2) :
- fleet.py : détection de flottes via NetworkX + Louvain (imports optionnels)
- enrich_with_fleet_score() : ajout fleet_score + fleet_campaign_flag au DataFrame
- cycle.py : appel après preprocess_df avec log du nombre de sessions en flotte
- SQL migration 05_fleet_metrics_tables.sql : table fleet_detections (TTL 7j)
- Dashboard : /fleet + /api/fleet (communautés détectées) + template fleet.html

Étape 6 — Cross-domain Jaccard §5.8 :
- 12_thesis_features.sql : CTE jaccard_paths → cross_domain_path_similarity
- Signal : même chemins (/admin, /wp-login) sur plusieurs hosts = scanner

Étape 7 — ExIFFI + erreurs AE par feature :
- scoring.py : compute_exiffi_importance() par permutation, compute_ae_feature_errors()
- pipeline.py : calcul ExIFFI sur X_test, mapping index → dict pour anomalies
- build_reason() enrichi avec exiffi_top quand SHAP inactif

Étape 8 — Méta-learner pour la pondération de l'ensemble :
- scoring.py : classe MetaLearner (LogisticRegression, fallback poids fixes <1000 labels)
- Collecte des labels depuis le cycle courant (known_bots, légitimes, Anubis)
- pipeline.py : remplacement des poids fixes par MetaLearner.predict()

Étape 9 — Métriques de performance et monitoring :
- metrics.py : record_cycle_metrics() — taux anomalie, drift, corrélation, latence
- SQL migration 05_fleet_metrics_tables.sql : table ml_performance_metrics (TTL 90j)
- Dashboard : /health + /api/health + template health.html
- cycle.py : appel record_cycle_metrics en fin de cycle (Complet + Applicatif)

Tests : 36/36 bot-detector tests passent

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 00:11:35 +02:00
8ca4a1e849 feat(mod_reqin_log): fingerprinting HTTP/2 passif (Akamai format)
Ajoute un filtre d'entrée de connexion (AP_FTYPE_CONNECTION, APR_HOOK_LAST)
qui s'insère entre mod_ssl et mod_http2 pour lire de manière non-destructive
le preface HTTP/2 (RFC 9113 §3.4) et en extraire :

- h2_fingerprint    : fingerprint Akamai complet
                      ex. '1:65536,2:0,4:6291456,6:262144|15663105|0|m,a,s,p'
- h2_settings_fp    : entrées SETTINGS brutes  (ex. '1:65536,4:6291456')
- h2_window_update  : incrément WINDOW_UPDATE  (ex. '15663105')
- h2_pseudo_order   : ordre des pseudo-headers (ex. 'm,a,s,p' Chrome,
                                                     'm,p,s,a' Firefox)

Technique : lecture spéculative AP_MODE_SPECULATIVE (non-destructive)
de 512 octets — la donnée reste disponible pour mod_http2. Le filtre
se retire de la chaîne après la première invocation.

Stockage dans c->notes (H2_NOTE_*) puis émission JSON dans log_request().
ClickHouse : 4 nouvelles colonnes dans http_logs + JSONExtract dans mv_http_logs.
Migration pour déploiements existants : 04_http2_fields.sql.
14 tests unitaires (cmocka) couvrent Chrome/Firefox/HTTP1/troncature/HPACK.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 23:46:50 +02:00
bc11cfa8eb fix: init-stack rock-solid — drop/recreate derived tables before views
Root cause: CREATE TABLE IF NOT EXISTS is a no-op on existing tables,
so stale schemas miss new columns. Views (07+) then fail with
UNKNOWN_IDENTIFIER errors.

Fix: split SQL execution into 3 phases:
  Phase 1: databases, raw tables, dictionaries (00-04)
  Phase 2: DROP all derived tables (agg_*, ml_*) — safe, repopulated by MVs
  Phase 3: recreate derived tables + views with full current schema (05-12)

This removes the incomplete inline migrations and makes the script
truly idempotent regardless of prior schema version.

Tested: fresh --reset, existing stale DB, idempotent re-run.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 23:21:15 +02:00
895d7894a9 docs: mise à jour copilot-instructions.md
- bot-detector : monolithe → 10 modules
- Ajout convention browser detection sans UA (5 axes, Client Hints)
- Ajout targets Makefile : init-stack, import-prod-data, purge-db, help
- Anubis : simplifié IP/CIDR + ASN (suppression dict_anubis_ua / REGEXP_TREE)
- Tests bot-detector : clarification imports lourds

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 23:11:24 +02:00
14db3d9040 refactor: suppression dépendance User-Agent de la détection navigateur
Changements SQL :
- modern_browser_score : sec-ch-ua→100, Sec-Fetch→70 (plus de UA fallback)
- Ajout has_sec_ch_ua (UInt8) dans agg_header_fingerprint_1h et ml_all_scores
- mss_mobile_mismatch utilise has_sec_ch_ua au lieu de modern_browser_score
- header_order_confidence : PARTITION BY ja4 au lieu de first_ua
- sec_ch_mobile_mismatch : comparaison Client Hints interne (sans UA)
- Migration 03_remove_ua_browser_detection.sql

Changements Python :
- browser.py Axe 3 : Client Hints + Sec-Fetch + is_fake_navigation (PAS de UA)
- Pondération axes : ja4_known 0.30, tls_coherence 0.20 (signaux TLS renforcés)
- preprocessing.py : has_sec_ch_ua ajouté aux features et binary_features

Fichiers modifiés : 8 SQL/Python + 1 migration, 36/36 tests passent.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 23:06:01 +02:00
00e99e5464 fix(bot-detector): make scoring functions public (remove underscore prefix)
compute_shap_top_features, build_reason, cluster_anomalies renamed from
private (_prefixed) to public to match pipeline.py imports.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:49:48 +02:00
629f7b334d fix(bot-detector): rename _compute_drift_score to public, fix import
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:48:21 +02:00
de6d8da931 fix(bot-detector): FEATURES_BASE → FEATURES import name mismatch
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:42:32 +02:00
1fa6aec784 fix: SQL view ordering, purge-db flag, ctest directory
- 12_thesis_features.sql: move view_resource_cascade_1h before view_thesis_features_1h
- Makefile: purge-db uses --reset (not --clean)
- mod-reqin-log: ctest --test-dir build/tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:39:25 +02:00
6d64c2a8a8 fix(rpm): add systemd-rpm-macros to Dockerfile.package, fix correlator spec_version
- sentinel/correlator: install systemd-rpm-macros in rpm-builder stage
- correlator: use build_version macro (not version) to avoid recursive expansion
- mod-reqin-log: fix ctest --test-dir to find tests in build/tests/

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:33:53 +02:00
ea488c0b11 feat: add make help with all targets documented
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:22:25 +02:00
0ba66729da feat: add make purge-db target for full database reset
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:21:15 +02:00
6b3cc54652 docs: réécriture audit, DOCUMENTATION.md et IMPROVEMENTS.md pour architecture modulaire
- AUDIT: conformité mise à jour 97.9% (142/145), références modulaires
- DOCUMENTATION.md: 1083 lignes, 7 sections, 11 modules documentés
- IMPROVEMENTS.md: A1-A10/B1-B10 annotés /🔄/ avec localisations

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:14:18 +02:00
c96c41fb45 docs: réécriture complète de la documentation des services en français
- bot-detector.md : architecture 11 modules, 77/65 features,
  ensemble triple voix (EIF+AE+XGBoost), browser 5 axes, HDBSCAN,
  toutes les variables d'environnement vérifiées depuis le code source
- dashboard.md : corrigé stack (Jinja2+htmx, pas React+Vite),
  14 pages + 35 API routes + health, dual-database, IPv4/IPv6
- python-ja4common.md : ajouté CLICKHOUSE_DB_PROCESSING/LOGS,
  schéma dual-database, note dashboard n'utilise pas ja4_common

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:04:58 +02:00
8f5e771096 docs: réécriture complète de la documentation base de données en français
Réécriture des 3 fichiers de documentation de la base de données ClickHouse :

- docs/database/schema.md : couverture complète des 2 bases, 14+ tables,
  7 dictionnaires, 8 MVs, 8 vues, TTL, partitions, moteurs et colonnes
- docs/database/migrations.md : 13 fichiers SQL (ajout 10-12), prérequis
  mis à jour (ClickHouse 24.8+, 5 CSV), deploy_schema.sh, init-stack.sh,
  vérification et rollback complets
- shared/clickhouse/README.md : référence rapide des 13 fichiers,
  deploy_schema.sh, patron double-base, prérequis

Suppression des références obsolètes : dict_anubis_ua, dict_anubis_country,
anubis_ua_rules, anubis_country_rules.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:03:37 +02:00
d05969867f docs: rewrite architecture/README, update deployment/development
- architecture.md: complete rewrite (French) with dual-database diagram,
  5-phase data flow, full table ownership, triple-voice ML pipeline,
  7 dictionaries, 13 SQL files, updated tech stack
- README.md: complete rewrite (English) with updated pipeline diagram,
  services table, scripts section, integration tests, full doc index,
  Go 1.24.6 workspace
- deployment.md: update to 13 SQL files, remove Anubis UA/Country refs,
  add scripts section, add ensemble env vars (AE_WEIGHT, XGB_WEIGHT),
  update verification queries and network diagram
- development.md: translate to French, add bot-detector 11-module structure,
  add Python ML deps, add scripts/integration test sections,
  fix bot-detector run command, add make targets

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:00:29 +02:00
7bdc6e2865 docs: mise à jour du document de thèse (§2-§8)
- §2.1.3: Simplifié Anubis à 2 dictionnaires (dict_anubis_ip, dict_anubis_asn) avec priorité COALESCE
- §2.4.2: Ajouté bibliothèque isotree, formule de calibration, ntrees=300, sérialisation joblib
- §2.4.2b/§2.4.4: Remplacé DBSCAN par HDBSCAN partout
- §2.4.2c: Remplacé régression logistique par pondération linéaire fixe, ajouté formule et poids
- §2.4.3: Clarifié approximation par 5 quantiles pour la détection de dérive
- §3.1: Mis à jour le diagramme ASCII (dual-database, 3×EIF+AE+XGB, HDBSCAN, 55 routes)
- §3.8: Mis à jour la trifurcation + ajouté détection multifactorielle navigateur (5 axes)
- §4: Élargi taxonomie de 51 à 65+ features sur 8 familles
- §5: Ajouté statut d'implémentation (/) à chaque technique
- §6: Ajouté §6.6 résultats de déploiement (3M+ logs, 34K sessions/cycle)
- §7: Mis à jour conclusion (65+ features, 5/8 techniques, refactorisation modulaire)
- §8: Ajouté références isotree, PyTorch, HDBSCAN, XGBoost, SHAP

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 21:59:34 +02:00
9ea36ad22e feat(scripts): complete stack init + prod data import with date shift
Schema cleanup:
- Remove anubis_ua_rules table stub from 03_anubis_tables.sql
- Remove anubis_ua_rules from bot-detector deploy_schema.sql
- Remove UA seed step from clickhouse-init.sh (no more REGEXP_TREE dependency)
- Drop dict_anubis_ua, dict_anubis_country, anubis_ua_rules, anubis_country_rules

New scripts:
- scripts/init-stack.sh: comprehensive ClickHouse init (13 SQL files + migrations
  + validation + cleanup of obsolete tables). Supports --reset, --import-prod.
- scripts/import-prod-data.sh: imports pre-exported prod data (Native format)
  with dynamic date shift (max(time) → now). Supports --shift, --no-truncate.
- scripts/data/prod-export/: directory for cached Native format exports

Makefile targets: init-stack, import-prod-data, init-and-import

Tested: init-stack.sh passes all 13 SQL + 7 critical tables + 7 dicts
        import-prod-data.sh: 3M rows in ~37s with auto date shift
        Dashboard: 55 routes OK, bot-detector: 36/36 tests pass

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 21:40:05 +02:00
d8ca804a55 feat(scripts): add reload-prod-logs.sh for prod→dev data sync
Exports http_logs from prod ClickHouse via HTTP API, imports into dev
with dynamic date shifting (max(time) → now() by default).

Features:
- Batch export in Native format (200K rows/batch, ~10s each)
- Auto date shift: prod max(time) aligned to current time
- --shift N: manual override (seconds)
- --days N: filter to last N days only
- --cron: silent mode for scheduled runs
- Staging table approach: export → staging → INSERT SELECT with shift → cleanup

Tested: 3,054,122 rows imported in ~3 minutes, dates 2026-04-03→2026-04-09.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 15:41:38 +02:00
8180f4af04 refactor(anubis): simplify to IP/CIDR + ASN only, remove UA and Country rules
- Remove UA regex extraction (extract_ua_regex, _extract_ua_from_all/any)
- Remove Country rule collection from parse_bot_policies_inline
- Simplify fetch_rules.py: collect_all_rules returns (ip_rules, asn_rules)
- Remove insert_ua_rules and insert_country_rules functions
- reload_dicts now only reloads dict_anubis_ip + dict_anubis_asn
- Simplify CASE blocks in 04_mv_http_logs.sql, 07_ai_features_view.sql,
  view_ai_features_anubis.sql, mv_http_logs.sql: IP > ASN (was 5-level
  UA+IP > UA > IP > ASN > Country cascade)
- Remove dict_anubis_country + dict_anubis_ua from 03_anubis_tables.sql
  (UA table kept as stub for REGEXP_TREE catch-all compatibility)
- Remove anubis_country_rules table from schema
- Remove Anubis UA and Country tabs from dashboard reflists page
- Remove anubis_ua_rules/country_rules from API reflist queries
- deploy_schema.sql simplified from 339 to 122 lines
- 764 lines removed across 9 files

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 15:25:33 +02:00
98abbc80c7 feat(dashboard): page Listes de référence — visualisation CSV/dictionnaires
Nouvelle page /reflists pour visualiser les 9 dictionnaires ClickHouse :
- bot_ip (3.5K entrées) : IP/CIDR de bots connus
- bot_ja4 (31) : fingerprints JA4 de bots
- browser_ja4 (1.2K) : fingerprints JA4 navigateurs → famille, lib TLS
- asn_reputation (82.5K) : ASN → réputation (isp, datacenter, cdn…)
- iplocate_asn (714K) : géolocalisation IP → ASN, pays, nom
- anubis_ua_rules, anubis_ip_rules, anubis_asn_rules, anubis_country_rules

Fonctionnalités :
- 9 onglets de navigation entre les listes
- Recherche textuelle avec filtrage côté ClickHouse
- Pagination (200 entrées/page)
- Tri par colonne (ASC/DESC)
- Graphique de répartition (ECharts) par catégorie
- KPIs dictionnaires en haut de page
- Infobulles de documentation

API : /api/dictionaries, /api/reflist/{name}, /api/reflist/{name}/stats
Helpers : esc() (HTML escape) ajouté à base.html

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 14:56:54 +02:00
039086a0b3 feat: nouvelles techniques de détection et page tactiques SOC
SQL:
- Ajout 5 colonnes d'agrégation (count_xff, count_unusual_ct,
  count_non_std_port, count_login_post, sec_ch_mobile_mismatch)
- Exposition de 5 features calculées dans view_ai_features_1h
- Migration ALTER TABLE pour déploiements existants

Bot-detector:
- 7 nouvelles features ML (has_xff, unusual_content_type_ratio,
  non_standard_port_ratio, login_post_concentration,
  sec_ch_mobile_mismatch, true_window_size, window_mss_ratio)
- Propagation campaign_id vers ml_all_scores (était toujours -1)
- Escalade campagne : HIGH→CRITICAL si cluster ≥5 membres

Dashboard:
- Page Tactiques SOC : brute-force, rotation JA4, récurrence,
  alertes temps réel — 4 KPIs + 4 panneaux + infobulles doc
- Ajout fmtDate() helper global
- Navigation sidebar mise à jour

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 14:29:18 +02:00
702c0d5edb feat(dashboard): add JA4 fingerprint and cluster investigation pages
- /ja4/{fingerprint} page: 8 KPIs, timeline, threat pie, IP scores
  table, ASN/geo charts, HTTP logs, AI features — full JA4 investigation
- /cluster/{cid} page: 8 KPIs, timeline, threat/JA4/ASN/host charts,
  member table with bulk classify — full campaign investigation
- /api/ja4/{fingerprint} and /api/cluster/{cid} API endpoints
- fmtJA4 links now navigate to /ja4/ investigation page
- campaigns.html: 'Ouvrir' button links to /cluster/{cid} full page
- Fix: double-brace {{param}} in non-f-string queries → single {param}
  (was causing HTTP 500 on all parameterized ClickHouse queries)
- 50 routes total, all tests pass, 0 JS console errors

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 14:05:52 +02:00
70188b508c fix(dashboard): eliminate @apply CSS, fix status column, fix click propagation
Playwright testing revealed 3 critical bugs:

1. Tailwind CDN @apply with custom brand-* colors produces empty CSS
   rules, breaking ALL design components (kpi-card, data-table, badges,
   filter-btn, section-card, nav-item). Fix: replace all @apply
   directives with equivalent raw CSS values.

2. Traffic API and IP detail API reference non-existent 'status' column
   in http_logs table → HTTP 500 on /traffic and /ip/{ip}. Fix: remove
   status from SELECT, sort whitelist, filters, and templates.

3. Nested <a> links (fmtJA4, fmtASN, fmtCountry, fmtBotName) inside
   clickable <tr onclick> capture clicks, preventing row navigation to
   /ip/ detail. Fix: add event.stopPropagation() to all formatter links.

Verified with Playwright: 10 pages × 0 JS errors, all tooltips hidden
by default, sidebar toggle works, keyboard shortcuts (Alt+1-9, Alt+B),
classification form saves to DB, campaign detail panel opens on click.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 13:54:38 +02:00
6babc55e3e fix(dashboard): hover infobulles, full-width layout, UX polish
- Fix doc tooltips: split CSS into <style type='text/tailwindcss'> for
  @apply directives + raw CSS for reliable doc panel rendering
- Convert doc panels from click-toggle to hover-based infobulles with
  arrow pointer, fade-in animation, and auto-dismiss on mobile
- Replace '?' icons with 'ⓘ' across all 11 templates (51 tooltips)
- Full-width layout: reduce padding on mobile (px-3), scale up on
  desktop (lg:px-5, xl:px-6) for maximum screen utilization
- Auto-collapse sidebar on narrow screens (<1024px)
- Keyboard shortcuts: Alt+1–9 for page navigation, Alt+B toggle sidebar
- Add LEGITIMATE_BROWSER filter button to detections page
- Sticky header with stronger blur (backdrop-blur-md)
- All 46 routes pass tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 13:30:16 +02:00
63ba6d203c feat(dashboard): complete SOC dashboard with full monitoring and workflows
- models.html: Full rewrite — 6 KPIs, scoring volume timeline, anomaly rate
  chart, threat breakdown per model, enhanced model cards with validation gate
- classify.html: SOC workflow — suggested unclassified IPs, quick-classify
  buttons, classification stats pie, pre-fill from URL params
- traffic.html: Clickable rows → ip_detail, column sorting, status column,
  search filter, doc tooltips on all chart sections
- scores.html: Search input, clickable rows → ip_detail, LEGITIMATE_BROWSER
  filter button, doc tooltips on distribution + scatter charts
- ip_detail.html: Resource cascade section (headless browser detection),
  status column in HTTP logs table
- detections.html: Doc tooltips on threat/reason/ASN chart sections
- features.html: Doc tooltips on radar/importance/scatter sections
- api.py: 4 new endpoints — /api/models/timeline, /api/models/threats,
  /api/classify/stats, /api/classify/suggested. Traffic API: status + search.

46 routes total. All tests pass (dashboard + bot-detector 36/36).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 01:25:01 +02:00
396baa90d2 feat(dashboard): visualisation clusters HDBSCAN
- Page /campaigns dédiée avec 4 vues graphiques :
  · Scatter plot (score vs vélocité, bulles colorées par campagne)
  · Graphe réseau force-directed (IPs liées par JA4 partagé)
  · Grille de cartes campagne (KPIs, ASN, pays, JA4)
  · Panneau détail (radar comportemental, timeline horaire, table membres)
- 4 nouveaux endpoints API :
  · GET /api/campaigns (fix: campaign_id >= 0 au lieu de != '')
  · GET /api/campaigns/graph (nœuds + arêtes)
  · GET /api/campaigns/scatter (score/vélocité par IP)
  · GET /api/campaigns/{cid} (détail + profil + timeline)
- Sidebar: lien Campagnes ajouté dans Surveillance
- Overview: campagnes clickables → lien vers /campaigns

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 01:11:16 +02:00
f1547423b5 refactor(bot-detector): suppression monolithe, tests multifactoriels
- Suppression de bot_detector.py (1982 lignes) remplacé par 11 modules
- Tests navigateur mis à jour pour le système multifactoriel (browser_confidence)
- 36/36 tests passent avec la nouvelle structure modulaire

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 01:03:17 +02:00
1f103392ac refactor(bot-detector): extract monolith into modular package
Split bot_detector.py (~1982 lines) into 10 focused modules:
- config.py: all configuration constants and optional imports
- log.py: logging utilities (log_info, log_decision, append_training_history)
- infra.py: ClickHouse client, health check HTTP server, shutdown
- browser.py: multifactorial browser identification (5 axes)
- scoring.py: drift detection, feature validation, SHAP, clustering
- models.py: EIF, Autoencoder, XGBoost model management
- preprocessing.py: data preprocessing and feature list definitions
- pipeline.py: core semi-supervised scoring loop
- cycle.py: main analysis cycle orchestration
- __main__.py: entry point with startup banner

Update Dockerfile to copy package directory and use python -m bot_detector.

All 36 existing tests pass unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 01:02:04 +02:00
2d04288e95 feat(dashboard): SOC workflow overhaul — sidebar nav, doc tooltips, full-width layout
- base.html: collapsible sidebar navigation, doc tooltip system, JS helpers
  (fmtNum, fmtPct, fmtDuration, ecGrid, buildTable, docHTML)
- overview.html: SOC command center with stacked timeline, live alerts,
  campaigns panel, browser donut, 6 KPIs
- detections.html: threat color dots, raw score column, click-to-navigate rows
- network.html: JA4 rotation, brute-force, persistent threats tables, 6 KPIs
- ip_detail.html: ASN/country KPIs, AE/XGB/campaign columns, enriched features
- scores/traffic/features/models/classify: page_title blocks + doc tooltips
- api.py: 9 new endpoints (campaigns, brute-force, ja4-rotation, recurrence,
  cascade, alerts, timeline-detail, ua-rotation)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 00:29:34 +02:00
c994ad4466 fix: XGB label query + SHAP isotree compatibility
XGB: query was selecting features from ml_all_scores which doesn't
store them. Now joins ml_all_scores (labels) with view_ai_features_1h
(features). Dynamically discovers available columns to skip thesis §5
features not present in the view. Returns (model, features) tuple.

SHAP: TreeExplainer doesn't support isotree. Fall back to permutation-
based Explainer(model.decision_function, X_sample) for isotree.

Verified: XGB trained on 50000 labels (18436 positives), triple-voice
ensemble scoring active (EIF+AE+XGB), SHAP silent.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 00:06:54 +02:00
c6666e2bba fix: isotree score convention — proper sklearn calibration
isotree decision_function returns [0,1] (higher=anomalous, 0.5=boundary).
The entire pipeline (normalize_scores, score_to_threat_level,
compute_adaptive_threshold) expects sklearn convention (negative=anomalous).

Previous fix (-raw_scores) negated all values, making everything
below -0.30 → all CRITICAL. New fix: 0.5 - isotree_score maps
correctly to sklearn's convention:
  isotree 0.80 → -0.30 (CRITICAL)
  isotree 0.65 → -0.15 (HIGH)
  isotree 0.55 → -0.05 (MEDIUM)
  isotree 0.50 →  0.00 (boundary)

Verified: 27,952 LEGITIMATE_BROWSER + 15,843 HIGH + 15,059 MEDIUM
Tests: 36/36 pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 23:56:05 +02:00
db306fb9da fix: P0 audit bugs — bot-detector + dashboard + SQL
Bot-detector:
- B1.1: campaign_id and raw_anomaly_score now inserted into ml_detected_anomalies
- B1.4/B1.5: log_decision argument order fixed (cycle_id, name)
- B1.7: AE broadcast error — model now returns features list, scoring
  uses model's features instead of current cycle's (prevents dim mismatch)
- B1.8: Anubis ALLOW bots now get bot_name from anubis_bot_name

Dashboard:
- C1.1: XSS in ip_detail.html — {{ ip | tojson }} instead of raw string
- C1.2: Stored XSS via innerHTML — added escapeHtml() helper, all user-facing
  formatters (fmtIP, fmtASN, fmtCountry, fmtJA4, fmtBotName, fmtLabel) sanitized
- C2.1: status filter now correctly filters http_version column
- C2.2: heatmap toDayOfWeek() - 1 for 0-indexed JS days

SQL:
- B1.3: view_ip_recurrence worst_score uses max() not min() (0=normal, 1=anomal)
- B1.6: view_resource_cascade_1h joined into view_thesis_features_1h (§5.4)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 23:33:00 +02:00
b66d41a200 docs: updated conformity audit bot-detector + dashboard vs thesis
Score: 93% (was 72%) — 4 thesis techniques now implemented,
browser classification, ASN PeeringDB, SOC feedback loop.

Identifies 9 bot-detector bugs (2 critical: campaign_id/raw_anomaly_score
never inserted, worst_score inverted) and 11 dashboard bugs (4 critical:
XSS, no auth, no CSRF, CORS misconfiguration).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 23:25:19 +02:00
98289ccf04 fix: ASN dictionary pipeline + verbose bot-detector logging
- Fix dict_iplocate_asn: remove non-existent org/domain columns (4→4 cols)
- Add CSV header to iplocate-ip-to-asn.csv (CSVWithNames format)
- Replace org/domain dictGet calls with empty string literals in MV
- Full 714K CIDR stub for complete ASN resolution in tests
- Add header generation to generate_asn_data.py
- Verbose bot-detector stdout: data summary, triage breakdown, model
  training details, scoring stats, browser classification, boxed results
- Fix IPv6 filter in traffic seeder (_ips_from_cidrs)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 17:43:55 +02:00
7b7b69dee3 Rewrite seed_clickhouse.py: 500K rows from 20K IPs with realistic traffic
- 350K browser rows (14K IPs) using real JA4s from browser_ja4.csv
- 100K scanner rows (3K IPs) with vuln/cred/scraper/DDoS sub-categories
- 30K legit bot rows (2K IPs) from real bot_ip.csv CIDRs
- 20K AI bot rows (1K IPs) for GPTBot, ClaudeBot, etc.

Key improvements:
- Load browser_ja4.csv at startup, match JA4 to browser family
- Load bot_ip.csv to generate IPs from real Googlebot/Bingbot CIDRs
- Hard-coded ISP /24 prefixes from real ASNs (Comcast, Orange, DT, etc.)
- Realistic navigation patterns with Referer chains and cookies
- Sec-CH-UA headers for Chromium browsers (modern_browser_score >= 50)
- Batch size increased to 2000, progress reporting every 10K rows
- New CLI args: --rows, --ips, --seed, --data-dir
- Bot JA4s are synthetic hashes guaranteed NOT in browser_ja4.csv

Also updated:
- Dockerfile: COPY *.py (was missing seed_clickhouse.py)
- docker-compose.yml: mount scripts/data as /app/data for CSV access
- run-tests.sh: updated seeder description comments

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 16:35:40 +02:00
74e0406c38 chore: update ASN stubs with new classification labels
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 16:05:25 +02:00
5c5bca71d1 feat: rewrite ASN classification with PeeringDB + expanded heuristics
Major improvements to generate_asn_data.py:
- Add PeeringDB network data source (34K networks with info_type)
- Add new categories: education, government, enterprise
- Rename 'human' label to 'isp' across all consumers
- Expand keyword heuristics (ISP, datacenter, hosting, CDN, education, gov)
- Add hard-coded lists for education, government, enterprise ASNs
- Support both --output-dir and --output-asn/--output-ipasn CLI interfaces
- Add --no-peeringdb flag for offline use

Results: unknown dropped from 86% to 57%, ISP coverage 21.8K ASNs,
education 3.1K, enterprise 5.7K, government 520.

Updated consumers:
- bot_detector.py: 'human' -> 'isp' for baseline selection
- dashboard api.py: 'human' -> 'isp' in SQL queries
- run-tests.sh: 'human' -> 'isp' in integration test assertions
- update-csv-data.sh: updated label description comment

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 16:02:07 +02:00
9a48fb9d29 feat: LEGITIMATE_BROWSER classification from JA4 + behavioral consistency
Add browser legitimacy classification (A9) to the bot detection pipeline:

- New features: is_known_browser (binary) and browser_consistency_score [0..5]
  combining 5 signals: JA4 browser match, modern_browser_score, Accept-Language,
  cookies, Sec-Fetch-* presence
- Post-scoring: sessions with known browser JA4 + consistency >= 4/5 + NORMAL/LOW
  threat level are reclassified as LEGITIMATE_BROWSER
- Spoofing detection: inconsistent behavior (known JA4 but low consistency) stays
  in normal anomaly scoring — prevents evasion via JA4 spoofing
- XGBoost treats LEGITIMATE_BROWSER as non-threat (negative label)
- ClickHouse: browser_family column added to ml_detected_anomalies and ml_all_scores
- Dashboard: browser_family filter/sort on detections and scores endpoints,
  legitimate_browsers count and browser_stats in overview
- 6 new unit tests covering classification threshold, spoofing, exclusion logic

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 15:46:22 +02:00
7d09c614c3 feat: browser JA4 detection, Anubis bot rules, worldwide ASN data
- Add generate_browser_ja4.py: 1,186 browser JA4 fingerprints from FoxIO + ja4db.com
  covering 11 families (Chromium, Firefox, Safari, Edge, Tor, Opera, Vivaldi...)
- Rewrite generate_bot_ip.py: Anubis YAML rules (Google, Bing, Apple, DuckDuck,
  OpenAI, Perplexity bots) + Tor exit nodes + cloud scanner IPs (3,555 entries)
- Rewrite generate_asn_data.py: worldwide iptoasn.com data (78,049 ASNs, 714K CIDRs)
- Add dict_browser_ja4 ClickHouse dictionary + browser_family in AI features views
- Add /api/browsers dashboard endpoint
- Fix CSV quoting for fields containing commas (User-Agent strings)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 15:27:37 +02:00
b6184e6529 feat: CSV generation scripts, API filter params, enriched CSV stubs
- scripts/generate_bot_ip.py: download Tor exit nodes + curate scanner IPs (1353 entries)
- scripts/generate_bot_ja4.py: 31 bot JA4 fingerprints across 16 families
- scripts/generate_asn_data.py: 38 ASNs + 96 IP-to-ASN prefixes
- scripts/update-csv-data.sh: master orchestrator with --install-stubs
- api.py: add asn_org/country_code/ja4/bot_name filters on detections+scores
- pages.py: add /network route
- csv-stubs: enriched with generated data (Tor nodes, scanner IPs, etc.)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 15:05:43 +02:00