Commit Graph

30 Commits

Author SHA1 Message Date
7eb3ad21fd feat(dashboard): afficher SETTINGS H2 individuels dans la table mismatch
- /api/browser-signatures : top_mismatches inclut désormais les 7 colonnes
  SETTINGS individuelles (h2_header_table_size, h2_enable_push,
  h2_max_concurrent_streams, h2_initial_window_size, h2_max_frame_size,
  h2_max_header_list_size, h2_enable_connect_protocol)
- stats : ajout sessions_with_priority (countIf h2_priority_present > 0)
- browsers.html : colonne SETTINGS compact dans la table suspects
  (format '3:100, 4:65536, 2:0' — IDs Akamai avec valeurs non-nulles)
- Compteur pseudo-priority utilise la vraie valeur sessions_with_priority
  au lieu d'afficher '—'

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-11 03:11:17 +02:00
85d3b95b7b feat: HTTP/2 passive fingerprinting with individual SETTINGS fields
Complete implementation of HTTP/2 passive fingerprinting per thesis §2.5.3:

mod-reqin-log (C module):
- Replace connection-level filter with ap_hook_process_connection (APR_HOOK_FIRST)
  to capture H2 preface before mod_http2 takes over the connection
- AP_MODE_SPECULATIVE read of 512 bytes from c->input_filters
- Parse SETTINGS, WINDOW_UPDATE, PRIORITY flags, pseudo-header order
- Output individual SETTINGS params as separate JSON fields (IDs 1-6, 8)
- Read H2 notes from c1 (master connection) for mod_http2 secondary conns
- Fix header_order_signature JSON length bug (26→strlen)

ClickHouse schema:
- Add 8 new columns to http_logs: h2_has_priority, h2_header_table_size,
  h2_enable_push, h2_max_concurrent_streams, h2_initial_window_size,
  h2_max_frame_size, h2_max_header_list_size, h2_enable_connect_protocol
- Use Int32/Int64 with DEFAULT -1 to distinguish absent vs zero
- Update mv_http_logs to extract individual fields via JSONHas/JSONExtractInt
- Migration 04_http2_fields.sql updated for existing deployments

Correlator:
- Accept both timestamp_ns and timestamp field names (backward compat)

Integration:
- Enable HTTP/2 in Apache: Protocols h2 http/1.1 in httpd-integration.conf

Validated end-to-end via Playwright: H2 curl traffic → mod-reqin-log →
correlator → ClickHouse with all 12 H2 columns populated correctly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-11 02:33:45 +02:00
261205028d fix(dashboard): campaigns scatter chart — show campaigns not IPs
- API /api/campaigns/scatter: aggregate by campaign_id instead of per-IP
  Returns avg_score, avg_velocity, unique_ips, ja4_list, asn_list, country_list
- Template: one bubble per campaign, sized by IP count
- Tooltip: campaign-level info (IPs, score, velocity, ASNs, pays, JA4s)
- Click navigates to campaign detail (not IP detail)
- Updated doc panel text

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 15:09:02 +02:00
fb73c60e7d feat(dashboard): fingerprint discovery page — extract and group JA4/H2/headers from traffic
- GET /api/fingerprint-discovery: queries http_logs, groups by JA4, aggregates
  UA family, header presence rates (Sec-CH-UA, Sec-Fetch, Accept-Language,
  zstd, brotli, gzip, XFF), H2 data, TLS info, dict lookups
- /fingerprints page: KPIs, doughnut chart by family, stacked header bars,
  filterable/sortable profile table, expandable detail panel
- Promote button: push H2 fingerprints to browser_h2_signatures via existing
  POST /api/browser-signatures/entries endpoint
- Nav link: Découverte added after Navigateurs in sidebar

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 15:02:53 +02:00
fde6864311 feat(dashboard): browser signatures management UI
- Ajoute dict_browser_h2 dans /reflists (lecture seule via dict_browser_h2)
- Nouveaux endpoints API :
    GET  /api/browser-signatures/entries — liste browser_h2_signatures
         (fallback dict CSV si migration 06 non appliquée)
    POST /api/browser-signatures/entries — ajout fingerprint + reload dict
    DELETE /api/browser-signatures/entries — suppression + reload dict
- Page /browsers : 2 nouvelles sections
    'Base de signatures H2' — tableau des 10 fingerprints, form d'ajout,
    mode lecture seule automatique si migration 06 non appliquée
    'Règles de scoring browser_matcher.py' — tableau statique des 7 dimensions
    (poids, valeurs par famille, seuils de bypass)
- Integration : browser_h2.csv copié dans user_files au démarrage ClickHouse

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 14:46:07 +02:00
da1b579d4f fix(dashboard): rename duplicate /api/browsers route to /api/browser-signatures
La route /api/browsers existait déjà (distribution JA4 par famille).
La nouvelle route du browser_matcher était en conflit — FastAPI utilisait
la première définition. Renommage en /api/browser-signatures.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 14:17:38 +02:00
9c308747bd feat(dashboard): page Browser Signature Detection (/browsers)
Nouvelle page dédiée à l'analyse passive des signatures navigateur (§4) :

API — GET /api/browsers :
  Requête view_ai_features_1h pour :
  - Compteurs globaux (total, sessions_with_h2, matched, mismatch %)
  - Distribution h2_dict_family (Chrome/Firefox/Safari/Edge)
  - Répartition des signaux WINDOW_UPDATE (chrome/firefox/safari/absent/autre)
  - Mismatch TLS↔H2 par famille JA4 (total + count + %)
  - Top 20 sessions suspectes (tls_h2_family_mismatch=1, triées par hits)

Page /browsers :
  - 6 KPI header (sessions, avec H2, famille connue, taux match, mismatch, % mismatch)
  - Doc banner expliquant browser_matcher §4 et le mode DUAL_MODE
  - Donut : familles H2 (dict_browser_h2 lookup)
  - Bar horizontal : WINDOW_UPDATE signals par famille
  - Bar groupé + ligne : mismatch TLS↔H2 par famille JA4 (count + %)
  - Table : top 20 imposteurs potentiels avec IP cliquable, pseudo-order, cohérence
  - Mini-KPIs : ordres pseudo-headers Chrome/Safari, Firefox, inconnu, PRIORITY frames
  - Lien nav 'Navigateurs' dans le groupe Surveillance de base.html

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 14:02:39 +02:00
79dbb23d6f feat(dashboard): sélecteur de plage temporelle sur /campaigns
Avant : toutes les vues de campagnes étaient fixes à 7 jours.
Après : sélecteur 1j / 7j (défaut) / 14j / 30j / 90j en haut à droite.

- Ajout du paramètre ?days= (1–90, défaut 7) à :
    GET /api/campaigns
    GET /api/campaigns/graph
    GET /api/campaigns/scatter
    GET /api/campaigns/{cid}
- Le sélecteur recharge simultanément les 3 vues (cartes, scatter, graphe)
  et le panneau de détail avec la même fenêtre temporelle
- Le compteur de campagnes indique la plage active : (4 campagnes — 30j)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 13:24:08 +02:00
7a04e47041 fix(sql+api): fix view column mismatches and ClickHouse 24.8 JOIN issue
- view_form_bruteforce_detected: add post_count, distinct_paths, first_seen, last_seen
- view_host_ip_ja4_rotation: add host, distinct_ja4, ja4_list, window_start
- Replace uniqExact/groupUniqArray with count()/groupArray (no nested-agg error)
- api.py campaigns/graph: move a.src_ip < b.src_ip from JOIN ON to WHERE
  (ClickHouse 24.8 forbids cross-table inequality in JOIN ON condition)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 01:05:04 +02:00
a108814a56 feat: roadmap détection bots §2-9 — HTTP/2, cohérence, drift, flotte, Jaccard, ExIFFI, méta-learner, métriques
Étape 2 — Fingerprinting HTTP/2 dans le pipeline ML :
- Ajout du dictionnaire dict_browser_h2 (11 familles de navigateurs) dans 05_aggregation_tables.sql
- Ajout du CTE h2_agg et 4 features HTTP/2 dans 07_ai_features_view.sql :
  h2_settings_known, h2_pseudo_order_match, h2_ja4_coherence, h2_settings_rare
- Calcul du fingerprint_coherence_score (5 axes pondérés) dans la vue
- Ajout du 6e axe axis_h2_coherence dans browser.py (poids rééquilibrés)
- browser_h2.csv : 11 fingerprints Akamai → famille navigateur

Étape 3 — Pré-filtre de cohérence sur la baseline humaine :
- pipeline.py exclut les sessions avec fingerprint_coherence_score < seuil de la baseline d'entraînement
- FINGERPRINT_COHERENCE_THRESHOLD configurable via env (défaut 0.25)
- Log des sessions exclues pour analyse SOC

Étape 4 — Détection de drift améliorée :
- scoring.py : passage de 5 à 9 quantiles (p5…p95)
- Ajout de la divergence KL en complément du test KS
- Détection de drift adversarial (≥80% des features dérivent dans la même direction)
- Split temporel strict pour la validation

Étape 5 — Graphe bipartite JA4×ASN (§5.2) :
- fleet.py : détection de flottes via NetworkX + Louvain (imports optionnels)
- enrich_with_fleet_score() : ajout fleet_score + fleet_campaign_flag au DataFrame
- cycle.py : appel après preprocess_df avec log du nombre de sessions en flotte
- SQL migration 05_fleet_metrics_tables.sql : table fleet_detections (TTL 7j)
- Dashboard : /fleet + /api/fleet (communautés détectées) + template fleet.html

Étape 6 — Cross-domain Jaccard §5.8 :
- 12_thesis_features.sql : CTE jaccard_paths → cross_domain_path_similarity
- Signal : même chemins (/admin, /wp-login) sur plusieurs hosts = scanner

Étape 7 — ExIFFI + erreurs AE par feature :
- scoring.py : compute_exiffi_importance() par permutation, compute_ae_feature_errors()
- pipeline.py : calcul ExIFFI sur X_test, mapping index → dict pour anomalies
- build_reason() enrichi avec exiffi_top quand SHAP inactif

Étape 8 — Méta-learner pour la pondération de l'ensemble :
- scoring.py : classe MetaLearner (LogisticRegression, fallback poids fixes <1000 labels)
- Collecte des labels depuis le cycle courant (known_bots, légitimes, Anubis)
- pipeline.py : remplacement des poids fixes par MetaLearner.predict()

Étape 9 — Métriques de performance et monitoring :
- metrics.py : record_cycle_metrics() — taux anomalie, drift, corrélation, latence
- SQL migration 05_fleet_metrics_tables.sql : table ml_performance_metrics (TTL 90j)
- Dashboard : /health + /api/health + template health.html
- cycle.py : appel record_cycle_metrics en fin de cycle (Complet + Applicatif)

Tests : 36/36 bot-detector tests passent

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 00:11:35 +02:00
8180f4af04 refactor(anubis): simplify to IP/CIDR + ASN only, remove UA and Country rules
- Remove UA regex extraction (extract_ua_regex, _extract_ua_from_all/any)
- Remove Country rule collection from parse_bot_policies_inline
- Simplify fetch_rules.py: collect_all_rules returns (ip_rules, asn_rules)
- Remove insert_ua_rules and insert_country_rules functions
- reload_dicts now only reloads dict_anubis_ip + dict_anubis_asn
- Simplify CASE blocks in 04_mv_http_logs.sql, 07_ai_features_view.sql,
  view_ai_features_anubis.sql, mv_http_logs.sql: IP > ASN (was 5-level
  UA+IP > UA > IP > ASN > Country cascade)
- Remove dict_anubis_country + dict_anubis_ua from 03_anubis_tables.sql
  (UA table kept as stub for REGEXP_TREE catch-all compatibility)
- Remove anubis_country_rules table from schema
- Remove Anubis UA and Country tabs from dashboard reflists page
- Remove anubis_ua_rules/country_rules from API reflist queries
- deploy_schema.sql simplified from 339 to 122 lines
- 764 lines removed across 9 files

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 15:25:33 +02:00
98abbc80c7 feat(dashboard): page Listes de référence — visualisation CSV/dictionnaires
Nouvelle page /reflists pour visualiser les 9 dictionnaires ClickHouse :
- bot_ip (3.5K entrées) : IP/CIDR de bots connus
- bot_ja4 (31) : fingerprints JA4 de bots
- browser_ja4 (1.2K) : fingerprints JA4 navigateurs → famille, lib TLS
- asn_reputation (82.5K) : ASN → réputation (isp, datacenter, cdn…)
- iplocate_asn (714K) : géolocalisation IP → ASN, pays, nom
- anubis_ua_rules, anubis_ip_rules, anubis_asn_rules, anubis_country_rules

Fonctionnalités :
- 9 onglets de navigation entre les listes
- Recherche textuelle avec filtrage côté ClickHouse
- Pagination (200 entrées/page)
- Tri par colonne (ASC/DESC)
- Graphique de répartition (ECharts) par catégorie
- KPIs dictionnaires en haut de page
- Infobulles de documentation

API : /api/dictionaries, /api/reflist/{name}, /api/reflist/{name}/stats
Helpers : esc() (HTML escape) ajouté à base.html

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 14:56:54 +02:00
039086a0b3 feat: nouvelles techniques de détection et page tactiques SOC
SQL:
- Ajout 5 colonnes d'agrégation (count_xff, count_unusual_ct,
  count_non_std_port, count_login_post, sec_ch_mobile_mismatch)
- Exposition de 5 features calculées dans view_ai_features_1h
- Migration ALTER TABLE pour déploiements existants

Bot-detector:
- 7 nouvelles features ML (has_xff, unusual_content_type_ratio,
  non_standard_port_ratio, login_post_concentration,
  sec_ch_mobile_mismatch, true_window_size, window_mss_ratio)
- Propagation campaign_id vers ml_all_scores (était toujours -1)
- Escalade campagne : HIGH→CRITICAL si cluster ≥5 membres

Dashboard:
- Page Tactiques SOC : brute-force, rotation JA4, récurrence,
  alertes temps réel — 4 KPIs + 4 panneaux + infobulles doc
- Ajout fmtDate() helper global
- Navigation sidebar mise à jour

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 14:29:18 +02:00
702c0d5edb feat(dashboard): add JA4 fingerprint and cluster investigation pages
- /ja4/{fingerprint} page: 8 KPIs, timeline, threat pie, IP scores
  table, ASN/geo charts, HTTP logs, AI features — full JA4 investigation
- /cluster/{cid} page: 8 KPIs, timeline, threat/JA4/ASN/host charts,
  member table with bulk classify — full campaign investigation
- /api/ja4/{fingerprint} and /api/cluster/{cid} API endpoints
- fmtJA4 links now navigate to /ja4/ investigation page
- campaigns.html: 'Ouvrir' button links to /cluster/{cid} full page
- Fix: double-brace {{param}} in non-f-string queries → single {param}
  (was causing HTTP 500 on all parameterized ClickHouse queries)
- 50 routes total, all tests pass, 0 JS console errors

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 14:05:52 +02:00
70188b508c fix(dashboard): eliminate @apply CSS, fix status column, fix click propagation
Playwright testing revealed 3 critical bugs:

1. Tailwind CDN @apply with custom brand-* colors produces empty CSS
   rules, breaking ALL design components (kpi-card, data-table, badges,
   filter-btn, section-card, nav-item). Fix: replace all @apply
   directives with equivalent raw CSS values.

2. Traffic API and IP detail API reference non-existent 'status' column
   in http_logs table → HTTP 500 on /traffic and /ip/{ip}. Fix: remove
   status from SELECT, sort whitelist, filters, and templates.

3. Nested <a> links (fmtJA4, fmtASN, fmtCountry, fmtBotName) inside
   clickable <tr onclick> capture clicks, preventing row navigation to
   /ip/ detail. Fix: add event.stopPropagation() to all formatter links.

Verified with Playwright: 10 pages × 0 JS errors, all tooltips hidden
by default, sidebar toggle works, keyboard shortcuts (Alt+1-9, Alt+B),
classification form saves to DB, campaign detail panel opens on click.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 13:54:38 +02:00
63ba6d203c feat(dashboard): complete SOC dashboard with full monitoring and workflows
- models.html: Full rewrite — 6 KPIs, scoring volume timeline, anomaly rate
  chart, threat breakdown per model, enhanced model cards with validation gate
- classify.html: SOC workflow — suggested unclassified IPs, quick-classify
  buttons, classification stats pie, pre-fill from URL params
- traffic.html: Clickable rows → ip_detail, column sorting, status column,
  search filter, doc tooltips on all chart sections
- scores.html: Search input, clickable rows → ip_detail, LEGITIMATE_BROWSER
  filter button, doc tooltips on distribution + scatter charts
- ip_detail.html: Resource cascade section (headless browser detection),
  status column in HTTP logs table
- detections.html: Doc tooltips on threat/reason/ASN chart sections
- features.html: Doc tooltips on radar/importance/scatter sections
- api.py: 4 new endpoints — /api/models/timeline, /api/models/threats,
  /api/classify/stats, /api/classify/suggested. Traffic API: status + search.

46 routes total. All tests pass (dashboard + bot-detector 36/36).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 01:25:01 +02:00
396baa90d2 feat(dashboard): visualisation clusters HDBSCAN
- Page /campaigns dédiée avec 4 vues graphiques :
  · Scatter plot (score vs vélocité, bulles colorées par campagne)
  · Graphe réseau force-directed (IPs liées par JA4 partagé)
  · Grille de cartes campagne (KPIs, ASN, pays, JA4)
  · Panneau détail (radar comportemental, timeline horaire, table membres)
- 4 nouveaux endpoints API :
  · GET /api/campaigns (fix: campaign_id >= 0 au lieu de != '')
  · GET /api/campaigns/graph (nœuds + arêtes)
  · GET /api/campaigns/scatter (score/vélocité par IP)
  · GET /api/campaigns/{cid} (détail + profil + timeline)
- Sidebar: lien Campagnes ajouté dans Surveillance
- Overview: campagnes clickables → lien vers /campaigns

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 01:11:16 +02:00
2d04288e95 feat(dashboard): SOC workflow overhaul — sidebar nav, doc tooltips, full-width layout
- base.html: collapsible sidebar navigation, doc tooltip system, JS helpers
  (fmtNum, fmtPct, fmtDuration, ecGrid, buildTable, docHTML)
- overview.html: SOC command center with stacked timeline, live alerts,
  campaigns panel, browser donut, 6 KPIs
- detections.html: threat color dots, raw score column, click-to-navigate rows
- network.html: JA4 rotation, brute-force, persistent threats tables, 6 KPIs
- ip_detail.html: ASN/country KPIs, AE/XGB/campaign columns, enriched features
- scores/traffic/features/models/classify: page_title blocks + doc tooltips
- api.py: 9 new endpoints (campaigns, brute-force, ja4-rotation, recurrence,
  cascade, alerts, timeline-detail, ua-rotation)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 00:29:34 +02:00
db306fb9da fix: P0 audit bugs — bot-detector + dashboard + SQL
Bot-detector:
- B1.1: campaign_id and raw_anomaly_score now inserted into ml_detected_anomalies
- B1.4/B1.5: log_decision argument order fixed (cycle_id, name)
- B1.7: AE broadcast error — model now returns features list, scoring
  uses model's features instead of current cycle's (prevents dim mismatch)
- B1.8: Anubis ALLOW bots now get bot_name from anubis_bot_name

Dashboard:
- C1.1: XSS in ip_detail.html — {{ ip | tojson }} instead of raw string
- C1.2: Stored XSS via innerHTML — added escapeHtml() helper, all user-facing
  formatters (fmtIP, fmtASN, fmtCountry, fmtJA4, fmtBotName, fmtLabel) sanitized
- C2.1: status filter now correctly filters http_version column
- C2.2: heatmap toDayOfWeek() - 1 for 0-indexed JS days

SQL:
- B1.3: view_ip_recurrence worst_score uses max() not min() (0=normal, 1=anomal)
- B1.6: view_resource_cascade_1h joined into view_thesis_features_1h (§5.4)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 23:33:00 +02:00
5c5bca71d1 feat: rewrite ASN classification with PeeringDB + expanded heuristics
Major improvements to generate_asn_data.py:
- Add PeeringDB network data source (34K networks with info_type)
- Add new categories: education, government, enterprise
- Rename 'human' label to 'isp' across all consumers
- Expand keyword heuristics (ISP, datacenter, hosting, CDN, education, gov)
- Add hard-coded lists for education, government, enterprise ASNs
- Support both --output-dir and --output-asn/--output-ipasn CLI interfaces
- Add --no-peeringdb flag for offline use

Results: unknown dropped from 86% to 57%, ISP coverage 21.8K ASNs,
education 3.1K, enterprise 5.7K, government 520.

Updated consumers:
- bot_detector.py: 'human' -> 'isp' for baseline selection
- dashboard api.py: 'human' -> 'isp' in SQL queries
- run-tests.sh: 'human' -> 'isp' in integration test assertions
- update-csv-data.sh: updated label description comment

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 16:02:07 +02:00
9a48fb9d29 feat: LEGITIMATE_BROWSER classification from JA4 + behavioral consistency
Add browser legitimacy classification (A9) to the bot detection pipeline:

- New features: is_known_browser (binary) and browser_consistency_score [0..5]
  combining 5 signals: JA4 browser match, modern_browser_score, Accept-Language,
  cookies, Sec-Fetch-* presence
- Post-scoring: sessions with known browser JA4 + consistency >= 4/5 + NORMAL/LOW
  threat level are reclassified as LEGITIMATE_BROWSER
- Spoofing detection: inconsistent behavior (known JA4 but low consistency) stays
  in normal anomaly scoring — prevents evasion via JA4 spoofing
- XGBoost treats LEGITIMATE_BROWSER as non-threat (negative label)
- ClickHouse: browser_family column added to ml_detected_anomalies and ml_all_scores
- Dashboard: browser_family filter/sort on detections and scores endpoints,
  legitimate_browsers count and browser_stats in overview
- 6 new unit tests covering classification threshold, spoofing, exclusion logic

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 15:46:22 +02:00
7d09c614c3 feat: browser JA4 detection, Anubis bot rules, worldwide ASN data
- Add generate_browser_ja4.py: 1,186 browser JA4 fingerprints from FoxIO + ja4db.com
  covering 11 families (Chromium, Firefox, Safari, Edge, Tor, Opera, Vivaldi...)
- Rewrite generate_bot_ip.py: Anubis YAML rules (Google, Bing, Apple, DuckDuck,
  OpenAI, Perplexity bots) + Tor exit nodes + cloud scanner IPs (3,555 entries)
- Rewrite generate_asn_data.py: worldwide iptoasn.com data (78,049 ASNs, 714K CIDRs)
- Add dict_browser_ja4 ClickHouse dictionary + browser_family in AI features views
- Add /api/browsers dashboard endpoint
- Fix CSV quoting for fields containing commas (User-Agent strings)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 15:27:37 +02:00
b6184e6529 feat: CSV generation scripts, API filter params, enriched CSV stubs
- scripts/generate_bot_ip.py: download Tor exit nodes + curate scanner IPs (1353 entries)
- scripts/generate_bot_ja4.py: 31 bot JA4 fingerprints across 16 families
- scripts/generate_asn_data.py: 38 ASNs + 96 IP-to-ASN prefixes
- scripts/update-csv-data.sh: master orchestrator with --install-stubs
- api.py: add asn_org/country_code/ja4/bot_name filters on detections+scores
- pages.py: add /network route
- csv-stubs: enriched with generated data (Tor nodes, scanner IPs, etc.)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 15:05:43 +02:00
b735bab5a5 feat(dashboard): rebuild SOC dashboard + fix ClickHouse SQL
Complete rewrite of the SOC dashboard using FastAPI + Jinja2 + htmx + Chart.js + Tailwind CSS.
Replaces the old React/Vite frontend with server-rendered templates.

Dashboard pages:
- Overview: KPIs, timeline chart, threat distribution, top IPs
- Detections: paginated/filterable anomaly table
- Scores: ml_all_scores with AE error & XGB prob columns
- Traffic: HTTP logs with method/host filters
- IP Investigation: full deep-dive (scores, features, HTTP logs, classify)
- Classification: SOC feedback form + history
- Features: AI + thesis feature stats
- Models: scoring stats + model metadata

API: 9 JSON endpoints with parameterized queries, sort whitelists

SQL fixes:
- 05_aggregation_tables: add deduplicate_merge_projection_mode
- 11_views: fix nested aggregate (argMax inside sum)
- 12_thesis_features: remove invalid 'let' bindings, fix groupArrayIf type

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 03:21:05 +02:00
ecceb04174 perf(clickhouse): P3 — view_ip_recurrence avec filtre TTL + supprimer FINAL
view_ip_recurrence :
  Ajout de WHERE detected_at >= now() - INTERVAL 30 DAY
  → Avec PARTITION BY (P1), ClickHouse élagage les partitions hors de cette
    plage avant même de lire les données. La vue ne scanne que les partitions
    actives (au lieu des 30 partitions journalières complètes).
  → ORDER BY (src_ip) garantit que le GROUP BY src_ip lit des données
    contiguës (aucune réorganisation mémoire).

rotation.py — supprimer FINAL sur ml_detected_anomalies :
  FINAL force une déduplication complète du ReplacingMergeTree en mémoire
  (équivalent à un DISTINCT sur toute la table) — une des opérations les plus
  coûteuses dans ClickHouse.
  Fix : remplacer le sous-SELECT FINAL par view_ip_recurrence (déjà aggrégée
  par src_ip, retourne recurrence directement sans FINAL).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 22:33:29 +02:00
2bfb4b7282 perf(dashboard): P2 — remplacer replaceRegexpAll dans les WHERE par IPv4MappedToIPv6
Problème : 8 clauses WHERE appliquaient une fonction sur la colonne src_ip :
  WHERE replaceRegexpAll(toString(src_ip), '^::ffff:', '') = %(ip)s
→ ClickHouse ne peut pas utiliser l'index de tri ou les skipping indexes
  quand une fonction est appliquée à la colonne filtrée.

Fix : transformer l'INPUT (le paramètre) plutôt que la colonne :
  WHERE src_ip = IPv4MappedToIPv6(toIPv4(%(ip)s))
→ src_ip reste intact → ClickHouse utilise les indexes (P1) et la
  projection proj_by_ip (P1) pour ces requêtes.

Fichiers modifiés :
  investigation_summary.py — 6 WHERE (ml_detected_anomalies, agg_host_ip_ja4_1h,
                              view_form_bruteforce_detected, view_host_ip_ja4_rotation,
                              view_ip_recurrence)
  ml_features.py           — 1 WHERE (view_ai_features_1h)
  rotation.py              — 1 WHERE (agg_host_ip_ja4_1h)

Note : les 27 autres occurrences de replaceRegexpAll dans les SELECT sont des
transformations d'affichage (IPv6→IPv4 pour l'UI) et ne bloquent pas les indexes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 22:31:57 +02:00
14323f7b05 perf(clickhouse): P10 — créer les 4 vues métier manquantes + corriger préfixes DB
Bug de production : view_form_bruteforce_detected, view_host_ip_ja4_rotation,
view_dashboard_entities, view_dashboard_user_agents étaient référencées dans
13 endpoints du dashboard mais n'existaient nulle part dans le schéma.
Tous ces endpoints retournaient HTTP 500 en production.

shared/clickhouse/11_views.sql (nouveau) :

  view_form_bruteforce_detected
    Source : agg_host_ip_ja4_1h (24h)
    Logique : GROUP BY (src_ip, host) HAVING count_post >= 10
    Usage   : bruteforce.py (3 endpoints), investigation_summary.py

  view_host_ip_ja4_rotation
    Source : agg_host_ip_ja4_1h (24h)
    Logique : uniqExact(ja4) par src_ip, HAVING >= 2 (rotation de fingerprint)
    Usage   : rotation.py (3 endpoints), investigation_summary.py

  view_dashboard_entities
    Source : http_logs (7 jours), UNION ALL 5 branches (ip/ja4/country/asn/host)
    Colonnes : entity_type, entity_value, src_ip, ja4, host, log_date,
               client_headers Array(String), asns Array, countries Array,
               user_agents Array
    Usage   : entities.py (5 endpoints), clustering.py

  view_dashboard_user_agents
    Source : http_logs (7 jours), GROUP BY (src_ip, ja4, hour)
    Colonnes : src_ip, ja4, hour, log_date, user_agents Array(String), requests
    Usage   : variability.py (4 endpoints), fingerprints.py (5 endpoints)
              attributes.py (2 endpoints)

deploy_schema.sh : ajout de 10_perf_indexes.sql et 11_views.sql dans la liste

routes/variability.py + fingerprints.py :
  Correction de 9 requêtes utilisant view_dashboard_user_agents sans préfixe
  de base de données → remplacé par {settings.CLICKHOUSE_DB_PROCESSING}.view_*

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 22:30:09 +02:00
3dfeba860b docs: add standardized comments to all services (Python, Go, Bash)
- Add docs/commenting-standard.md defining per-language comment standards
  (Go godoc, Python PEP-257, C Doxygen, Bash header blocks, SQL banners)

- services/dashboard: 100% docstring coverage (100/100 functions)
  - All FastAPI route handlers, helpers, classes, and models documented
  - Language: French (project convention)

- services/bot-detector: 100% docstring coverage (53/53 symbols)
  - bot_detector.py: 14 functions + module docstring
  - anubis/fetch_rules.py: 9 functions

- shared/python/ja4_common: full docstrings on ClickHouseClient (7 methods)
  and ClickHouseSettings class

- services/correlator: 24 godoc comments added across 6 Go files
  - correlation_service.go: 10 private helpers
  - unixsocket/source.go: 6 parsing/socket helpers
  - correlated_log.go: 4 field extraction helpers
  - orchestrator.go, logger.go, main.go: 4 comments

- services/correlator/scripts/audit-architecture.sh: standardized header block

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 21:32:29 +02:00
b6391afbeb refactor: replace hardcoded mabase_prod DB prefix with configurable settings
Replace all hardcoded 'mabase_prod.' table prefixes in dashboard route
SQL queries with configurable database names from settings:

- http_logs, http_logs_raw → settings.CLICKHOUSE_DB_LOGS
- All other tables → settings.CLICKHOUSE_DB_PROCESSING

Also qualify previously unqualified table references (bare FROM/JOIN
table_name) with the appropriate database prefix for consistency.

Each route file now imports 'from ..config import settings' and uses
f-strings with {settings.CLICKHOUSE_DB_PROCESSING} or
{settings.CLICKHOUSE_DB_LOGS} for database-qualified table names.

Files updated: analysis, attributes, audit, botnets, bruteforce,
clustering, detections, entities, fingerprints, header_fingerprint,
heatmap, incidents, investigation_summary, metrics, ml_features,
rotation, search, tcp_spoofing, variability (19 files).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 19:03:05 +02:00
d469e39da7 feat: ja4-platform monorepo — 5 services unified, tests & RPM builds standardized
Services:
- ja4sentinel: TLS/JA4 fingerprint capture daemon (Go, libpcap)
- logcorrelator: JA4 log correlation engine (Go, ClickHouse)
- mod_reqin_log: Apache module (C, JSON request logging)
- bot_detector: ML bot detection pipeline (Python)
- dashboard: FastAPI/Streamlit analytics UI (Python)

Shared libraries:
- shared/go/ja4common: logger, config, shutdown, ipfilter (Go module)
- shared/python/ja4_common: ClickHouseClient, ClickHouseSettings (Python package)
- shared/clickhouse/: canonical SQL migrations (10 files)

Build & packaging:
- Unified 3-stage Dockerfile.package for Go RPMs (el8/el9/el10)
- go.work workspace linking sentinel, correlator, ja4common
- Makefile with test-all, build-all, rpm-* targets

Fixes applied:
- go.work: 1.21 → 1.24.6 (required by sentinel)
- correlator Dockerfiles: golang:1.21 → golang:1.24
- replace directives in go.mod for ja4common local path
- pyproject.toml: setuptools.backends → setuptools.build_meta
- Removed static libpcap linking (unavailable on Rocky 9)
- Fixed data races in output/writers_test.go (sync.Mutex + atomic.Int32)
- Rewrote corrupted test files (logger_test.go × 2)

Test coverage:
- correlator: 67.1% total (unixsocket 80.5%, config 91.7%, app 83.3%, multi 87.7%, stdout 100%)
- sentinel: all 10 packages pass (api, capture, config, fingerprint, ipfilter, logging, output, tlsparse)

Documentation:
- README.md + docs/ (architecture, development, 5 services, shared libs, DB schema & migrations)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 16:42:59 +02:00