Commit Graph

59 Commits

Author SHA1 Message Date
d05969867f docs: rewrite architecture/README, update deployment/development
- architecture.md: complete rewrite (French) with dual-database diagram,
  5-phase data flow, full table ownership, triple-voice ML pipeline,
  7 dictionaries, 13 SQL files, updated tech stack
- README.md: complete rewrite (English) with updated pipeline diagram,
  services table, scripts section, integration tests, full doc index,
  Go 1.24.6 workspace
- deployment.md: update to 13 SQL files, remove Anubis UA/Country refs,
  add scripts section, add ensemble env vars (AE_WEIGHT, XGB_WEIGHT),
  update verification queries and network diagram
- development.md: translate to French, add bot-detector 11-module structure,
  add Python ML deps, add scripts/integration test sections,
  fix bot-detector run command, add make targets

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:00:29 +02:00
7bdc6e2865 docs: mise à jour du document de thèse (§2-§8)
- §2.1.3: Simplifié Anubis à 2 dictionnaires (dict_anubis_ip, dict_anubis_asn) avec priorité COALESCE
- §2.4.2: Ajouté bibliothèque isotree, formule de calibration, ntrees=300, sérialisation joblib
- §2.4.2b/§2.4.4: Remplacé DBSCAN par HDBSCAN partout
- §2.4.2c: Remplacé régression logistique par pondération linéaire fixe, ajouté formule et poids
- §2.4.3: Clarifié approximation par 5 quantiles pour la détection de dérive
- §3.1: Mis à jour le diagramme ASCII (dual-database, 3×EIF+AE+XGB, HDBSCAN, 55 routes)
- §3.8: Mis à jour la trifurcation + ajouté détection multifactorielle navigateur (5 axes)
- §4: Élargi taxonomie de 51 à 65+ features sur 8 familles
- §5: Ajouté statut d'implémentation (/) à chaque technique
- §6: Ajouté §6.6 résultats de déploiement (3M+ logs, 34K sessions/cycle)
- §7: Mis à jour conclusion (65+ features, 5/8 techniques, refactorisation modulaire)
- §8: Ajouté références isotree, PyTorch, HDBSCAN, XGBoost, SHAP

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 21:59:34 +02:00
9ea36ad22e feat(scripts): complete stack init + prod data import with date shift
Schema cleanup:
- Remove anubis_ua_rules table stub from 03_anubis_tables.sql
- Remove anubis_ua_rules from bot-detector deploy_schema.sql
- Remove UA seed step from clickhouse-init.sh (no more REGEXP_TREE dependency)
- Drop dict_anubis_ua, dict_anubis_country, anubis_ua_rules, anubis_country_rules

New scripts:
- scripts/init-stack.sh: comprehensive ClickHouse init (13 SQL files + migrations
  + validation + cleanup of obsolete tables). Supports --reset, --import-prod.
- scripts/import-prod-data.sh: imports pre-exported prod data (Native format)
  with dynamic date shift (max(time) → now). Supports --shift, --no-truncate.
- scripts/data/prod-export/: directory for cached Native format exports

Makefile targets: init-stack, import-prod-data, init-and-import

Tested: init-stack.sh passes all 13 SQL + 7 critical tables + 7 dicts
        import-prod-data.sh: 3M rows in ~37s with auto date shift
        Dashboard: 55 routes OK, bot-detector: 36/36 tests pass

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 21:40:05 +02:00
d8ca804a55 feat(scripts): add reload-prod-logs.sh for prod→dev data sync
Exports http_logs from prod ClickHouse via HTTP API, imports into dev
with dynamic date shifting (max(time) → now() by default).

Features:
- Batch export in Native format (200K rows/batch, ~10s each)
- Auto date shift: prod max(time) aligned to current time
- --shift N: manual override (seconds)
- --days N: filter to last N days only
- --cron: silent mode for scheduled runs
- Staging table approach: export → staging → INSERT SELECT with shift → cleanup

Tested: 3,054,122 rows imported in ~3 minutes, dates 2026-04-03→2026-04-09.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 15:41:38 +02:00
8180f4af04 refactor(anubis): simplify to IP/CIDR + ASN only, remove UA and Country rules
- Remove UA regex extraction (extract_ua_regex, _extract_ua_from_all/any)
- Remove Country rule collection from parse_bot_policies_inline
- Simplify fetch_rules.py: collect_all_rules returns (ip_rules, asn_rules)
- Remove insert_ua_rules and insert_country_rules functions
- reload_dicts now only reloads dict_anubis_ip + dict_anubis_asn
- Simplify CASE blocks in 04_mv_http_logs.sql, 07_ai_features_view.sql,
  view_ai_features_anubis.sql, mv_http_logs.sql: IP > ASN (was 5-level
  UA+IP > UA > IP > ASN > Country cascade)
- Remove dict_anubis_country + dict_anubis_ua from 03_anubis_tables.sql
  (UA table kept as stub for REGEXP_TREE catch-all compatibility)
- Remove anubis_country_rules table from schema
- Remove Anubis UA and Country tabs from dashboard reflists page
- Remove anubis_ua_rules/country_rules from API reflist queries
- deploy_schema.sql simplified from 339 to 122 lines
- 764 lines removed across 9 files

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 15:25:33 +02:00
98abbc80c7 feat(dashboard): page Listes de référence — visualisation CSV/dictionnaires
Nouvelle page /reflists pour visualiser les 9 dictionnaires ClickHouse :
- bot_ip (3.5K entrées) : IP/CIDR de bots connus
- bot_ja4 (31) : fingerprints JA4 de bots
- browser_ja4 (1.2K) : fingerprints JA4 navigateurs → famille, lib TLS
- asn_reputation (82.5K) : ASN → réputation (isp, datacenter, cdn…)
- iplocate_asn (714K) : géolocalisation IP → ASN, pays, nom
- anubis_ua_rules, anubis_ip_rules, anubis_asn_rules, anubis_country_rules

Fonctionnalités :
- 9 onglets de navigation entre les listes
- Recherche textuelle avec filtrage côté ClickHouse
- Pagination (200 entrées/page)
- Tri par colonne (ASC/DESC)
- Graphique de répartition (ECharts) par catégorie
- KPIs dictionnaires en haut de page
- Infobulles de documentation

API : /api/dictionaries, /api/reflist/{name}, /api/reflist/{name}/stats
Helpers : esc() (HTML escape) ajouté à base.html

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 14:56:54 +02:00
039086a0b3 feat: nouvelles techniques de détection et page tactiques SOC
SQL:
- Ajout 5 colonnes d'agrégation (count_xff, count_unusual_ct,
  count_non_std_port, count_login_post, sec_ch_mobile_mismatch)
- Exposition de 5 features calculées dans view_ai_features_1h
- Migration ALTER TABLE pour déploiements existants

Bot-detector:
- 7 nouvelles features ML (has_xff, unusual_content_type_ratio,
  non_standard_port_ratio, login_post_concentration,
  sec_ch_mobile_mismatch, true_window_size, window_mss_ratio)
- Propagation campaign_id vers ml_all_scores (était toujours -1)
- Escalade campagne : HIGH→CRITICAL si cluster ≥5 membres

Dashboard:
- Page Tactiques SOC : brute-force, rotation JA4, récurrence,
  alertes temps réel — 4 KPIs + 4 panneaux + infobulles doc
- Ajout fmtDate() helper global
- Navigation sidebar mise à jour

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 14:29:18 +02:00
702c0d5edb feat(dashboard): add JA4 fingerprint and cluster investigation pages
- /ja4/{fingerprint} page: 8 KPIs, timeline, threat pie, IP scores
  table, ASN/geo charts, HTTP logs, AI features — full JA4 investigation
- /cluster/{cid} page: 8 KPIs, timeline, threat/JA4/ASN/host charts,
  member table with bulk classify — full campaign investigation
- /api/ja4/{fingerprint} and /api/cluster/{cid} API endpoints
- fmtJA4 links now navigate to /ja4/ investigation page
- campaigns.html: 'Ouvrir' button links to /cluster/{cid} full page
- Fix: double-brace {{param}} in non-f-string queries → single {param}
  (was causing HTTP 500 on all parameterized ClickHouse queries)
- 50 routes total, all tests pass, 0 JS console errors

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 14:05:52 +02:00
70188b508c fix(dashboard): eliminate @apply CSS, fix status column, fix click propagation
Playwright testing revealed 3 critical bugs:

1. Tailwind CDN @apply with custom brand-* colors produces empty CSS
   rules, breaking ALL design components (kpi-card, data-table, badges,
   filter-btn, section-card, nav-item). Fix: replace all @apply
   directives with equivalent raw CSS values.

2. Traffic API and IP detail API reference non-existent 'status' column
   in http_logs table → HTTP 500 on /traffic and /ip/{ip}. Fix: remove
   status from SELECT, sort whitelist, filters, and templates.

3. Nested <a> links (fmtJA4, fmtASN, fmtCountry, fmtBotName) inside
   clickable <tr onclick> capture clicks, preventing row navigation to
   /ip/ detail. Fix: add event.stopPropagation() to all formatter links.

Verified with Playwright: 10 pages × 0 JS errors, all tooltips hidden
by default, sidebar toggle works, keyboard shortcuts (Alt+1-9, Alt+B),
classification form saves to DB, campaign detail panel opens on click.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 13:54:38 +02:00
6babc55e3e fix(dashboard): hover infobulles, full-width layout, UX polish
- Fix doc tooltips: split CSS into <style type='text/tailwindcss'> for
  @apply directives + raw CSS for reliable doc panel rendering
- Convert doc panels from click-toggle to hover-based infobulles with
  arrow pointer, fade-in animation, and auto-dismiss on mobile
- Replace '?' icons with 'ⓘ' across all 11 templates (51 tooltips)
- Full-width layout: reduce padding on mobile (px-3), scale up on
  desktop (lg:px-5, xl:px-6) for maximum screen utilization
- Auto-collapse sidebar on narrow screens (<1024px)
- Keyboard shortcuts: Alt+1–9 for page navigation, Alt+B toggle sidebar
- Add LEGITIMATE_BROWSER filter button to detections page
- Sticky header with stronger blur (backdrop-blur-md)
- All 46 routes pass tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 13:30:16 +02:00
63ba6d203c feat(dashboard): complete SOC dashboard with full monitoring and workflows
- models.html: Full rewrite — 6 KPIs, scoring volume timeline, anomaly rate
  chart, threat breakdown per model, enhanced model cards with validation gate
- classify.html: SOC workflow — suggested unclassified IPs, quick-classify
  buttons, classification stats pie, pre-fill from URL params
- traffic.html: Clickable rows → ip_detail, column sorting, status column,
  search filter, doc tooltips on all chart sections
- scores.html: Search input, clickable rows → ip_detail, LEGITIMATE_BROWSER
  filter button, doc tooltips on distribution + scatter charts
- ip_detail.html: Resource cascade section (headless browser detection),
  status column in HTTP logs table
- detections.html: Doc tooltips on threat/reason/ASN chart sections
- features.html: Doc tooltips on radar/importance/scatter sections
- api.py: 4 new endpoints — /api/models/timeline, /api/models/threats,
  /api/classify/stats, /api/classify/suggested. Traffic API: status + search.

46 routes total. All tests pass (dashboard + bot-detector 36/36).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 01:25:01 +02:00
396baa90d2 feat(dashboard): visualisation clusters HDBSCAN
- Page /campaigns dédiée avec 4 vues graphiques :
  · Scatter plot (score vs vélocité, bulles colorées par campagne)
  · Graphe réseau force-directed (IPs liées par JA4 partagé)
  · Grille de cartes campagne (KPIs, ASN, pays, JA4)
  · Panneau détail (radar comportemental, timeline horaire, table membres)
- 4 nouveaux endpoints API :
  · GET /api/campaigns (fix: campaign_id >= 0 au lieu de != '')
  · GET /api/campaigns/graph (nœuds + arêtes)
  · GET /api/campaigns/scatter (score/vélocité par IP)
  · GET /api/campaigns/{cid} (détail + profil + timeline)
- Sidebar: lien Campagnes ajouté dans Surveillance
- Overview: campagnes clickables → lien vers /campaigns

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 01:11:16 +02:00
f1547423b5 refactor(bot-detector): suppression monolithe, tests multifactoriels
- Suppression de bot_detector.py (1982 lignes) remplacé par 11 modules
- Tests navigateur mis à jour pour le système multifactoriel (browser_confidence)
- 36/36 tests passent avec la nouvelle structure modulaire

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 01:03:17 +02:00
1f103392ac refactor(bot-detector): extract monolith into modular package
Split bot_detector.py (~1982 lines) into 10 focused modules:
- config.py: all configuration constants and optional imports
- log.py: logging utilities (log_info, log_decision, append_training_history)
- infra.py: ClickHouse client, health check HTTP server, shutdown
- browser.py: multifactorial browser identification (5 axes)
- scoring.py: drift detection, feature validation, SHAP, clustering
- models.py: EIF, Autoencoder, XGBoost model management
- preprocessing.py: data preprocessing and feature list definitions
- pipeline.py: core semi-supervised scoring loop
- cycle.py: main analysis cycle orchestration
- __main__.py: entry point with startup banner

Update Dockerfile to copy package directory and use python -m bot_detector.

All 36 existing tests pass unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 01:02:04 +02:00
2d04288e95 feat(dashboard): SOC workflow overhaul — sidebar nav, doc tooltips, full-width layout
- base.html: collapsible sidebar navigation, doc tooltip system, JS helpers
  (fmtNum, fmtPct, fmtDuration, ecGrid, buildTable, docHTML)
- overview.html: SOC command center with stacked timeline, live alerts,
  campaigns panel, browser donut, 6 KPIs
- detections.html: threat color dots, raw score column, click-to-navigate rows
- network.html: JA4 rotation, brute-force, persistent threats tables, 6 KPIs
- ip_detail.html: ASN/country KPIs, AE/XGB/campaign columns, enriched features
- scores/traffic/features/models/classify: page_title blocks + doc tooltips
- api.py: 9 new endpoints (campaigns, brute-force, ja4-rotation, recurrence,
  cascade, alerts, timeline-detail, ua-rotation)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 00:29:34 +02:00
c994ad4466 fix: XGB label query + SHAP isotree compatibility
XGB: query was selecting features from ml_all_scores which doesn't
store them. Now joins ml_all_scores (labels) with view_ai_features_1h
(features). Dynamically discovers available columns to skip thesis §5
features not present in the view. Returns (model, features) tuple.

SHAP: TreeExplainer doesn't support isotree. Fall back to permutation-
based Explainer(model.decision_function, X_sample) for isotree.

Verified: XGB trained on 50000 labels (18436 positives), triple-voice
ensemble scoring active (EIF+AE+XGB), SHAP silent.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 00:06:54 +02:00
c6666e2bba fix: isotree score convention — proper sklearn calibration
isotree decision_function returns [0,1] (higher=anomalous, 0.5=boundary).
The entire pipeline (normalize_scores, score_to_threat_level,
compute_adaptive_threshold) expects sklearn convention (negative=anomalous).

Previous fix (-raw_scores) negated all values, making everything
below -0.30 → all CRITICAL. New fix: 0.5 - isotree_score maps
correctly to sklearn's convention:
  isotree 0.80 → -0.30 (CRITICAL)
  isotree 0.65 → -0.15 (HIGH)
  isotree 0.55 → -0.05 (MEDIUM)
  isotree 0.50 →  0.00 (boundary)

Verified: 27,952 LEGITIMATE_BROWSER + 15,843 HIGH + 15,059 MEDIUM
Tests: 36/36 pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 23:56:05 +02:00
db306fb9da fix: P0 audit bugs — bot-detector + dashboard + SQL
Bot-detector:
- B1.1: campaign_id and raw_anomaly_score now inserted into ml_detected_anomalies
- B1.4/B1.5: log_decision argument order fixed (cycle_id, name)
- B1.7: AE broadcast error — model now returns features list, scoring
  uses model's features instead of current cycle's (prevents dim mismatch)
- B1.8: Anubis ALLOW bots now get bot_name from anubis_bot_name

Dashboard:
- C1.1: XSS in ip_detail.html — {{ ip | tojson }} instead of raw string
- C1.2: Stored XSS via innerHTML — added escapeHtml() helper, all user-facing
  formatters (fmtIP, fmtASN, fmtCountry, fmtJA4, fmtBotName, fmtLabel) sanitized
- C2.1: status filter now correctly filters http_version column
- C2.2: heatmap toDayOfWeek() - 1 for 0-indexed JS days

SQL:
- B1.3: view_ip_recurrence worst_score uses max() not min() (0=normal, 1=anomal)
- B1.6: view_resource_cascade_1h joined into view_thesis_features_1h (§5.4)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 23:33:00 +02:00
b66d41a200 docs: updated conformity audit bot-detector + dashboard vs thesis
Score: 93% (was 72%) — 4 thesis techniques now implemented,
browser classification, ASN PeeringDB, SOC feedback loop.

Identifies 9 bot-detector bugs (2 critical: campaign_id/raw_anomaly_score
never inserted, worst_score inverted) and 11 dashboard bugs (4 critical:
XSS, no auth, no CSRF, CORS misconfiguration).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 23:25:19 +02:00
98289ccf04 fix: ASN dictionary pipeline + verbose bot-detector logging
- Fix dict_iplocate_asn: remove non-existent org/domain columns (4→4 cols)
- Add CSV header to iplocate-ip-to-asn.csv (CSVWithNames format)
- Replace org/domain dictGet calls with empty string literals in MV
- Full 714K CIDR stub for complete ASN resolution in tests
- Add header generation to generate_asn_data.py
- Verbose bot-detector stdout: data summary, triage breakdown, model
  training details, scoring stats, browser classification, boxed results
- Fix IPv6 filter in traffic seeder (_ips_from_cidrs)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 17:43:55 +02:00
7b7b69dee3 Rewrite seed_clickhouse.py: 500K rows from 20K IPs with realistic traffic
- 350K browser rows (14K IPs) using real JA4s from browser_ja4.csv
- 100K scanner rows (3K IPs) with vuln/cred/scraper/DDoS sub-categories
- 30K legit bot rows (2K IPs) from real bot_ip.csv CIDRs
- 20K AI bot rows (1K IPs) for GPTBot, ClaudeBot, etc.

Key improvements:
- Load browser_ja4.csv at startup, match JA4 to browser family
- Load bot_ip.csv to generate IPs from real Googlebot/Bingbot CIDRs
- Hard-coded ISP /24 prefixes from real ASNs (Comcast, Orange, DT, etc.)
- Realistic navigation patterns with Referer chains and cookies
- Sec-CH-UA headers for Chromium browsers (modern_browser_score >= 50)
- Batch size increased to 2000, progress reporting every 10K rows
- New CLI args: --rows, --ips, --seed, --data-dir
- Bot JA4s are synthetic hashes guaranteed NOT in browser_ja4.csv

Also updated:
- Dockerfile: COPY *.py (was missing seed_clickhouse.py)
- docker-compose.yml: mount scripts/data as /app/data for CSV access
- run-tests.sh: updated seeder description comments

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 16:35:40 +02:00
74e0406c38 chore: update ASN stubs with new classification labels
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 16:05:25 +02:00
5c5bca71d1 feat: rewrite ASN classification with PeeringDB + expanded heuristics
Major improvements to generate_asn_data.py:
- Add PeeringDB network data source (34K networks with info_type)
- Add new categories: education, government, enterprise
- Rename 'human' label to 'isp' across all consumers
- Expand keyword heuristics (ISP, datacenter, hosting, CDN, education, gov)
- Add hard-coded lists for education, government, enterprise ASNs
- Support both --output-dir and --output-asn/--output-ipasn CLI interfaces
- Add --no-peeringdb flag for offline use

Results: unknown dropped from 86% to 57%, ISP coverage 21.8K ASNs,
education 3.1K, enterprise 5.7K, government 520.

Updated consumers:
- bot_detector.py: 'human' -> 'isp' for baseline selection
- dashboard api.py: 'human' -> 'isp' in SQL queries
- run-tests.sh: 'human' -> 'isp' in integration test assertions
- update-csv-data.sh: updated label description comment

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 16:02:07 +02:00
9a48fb9d29 feat: LEGITIMATE_BROWSER classification from JA4 + behavioral consistency
Add browser legitimacy classification (A9) to the bot detection pipeline:

- New features: is_known_browser (binary) and browser_consistency_score [0..5]
  combining 5 signals: JA4 browser match, modern_browser_score, Accept-Language,
  cookies, Sec-Fetch-* presence
- Post-scoring: sessions with known browser JA4 + consistency >= 4/5 + NORMAL/LOW
  threat level are reclassified as LEGITIMATE_BROWSER
- Spoofing detection: inconsistent behavior (known JA4 but low consistency) stays
  in normal anomaly scoring — prevents evasion via JA4 spoofing
- XGBoost treats LEGITIMATE_BROWSER as non-threat (negative label)
- ClickHouse: browser_family column added to ml_detected_anomalies and ml_all_scores
- Dashboard: browser_family filter/sort on detections and scores endpoints,
  legitimate_browsers count and browser_stats in overview
- 6 new unit tests covering classification threshold, spoofing, exclusion logic

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 15:46:22 +02:00
7d09c614c3 feat: browser JA4 detection, Anubis bot rules, worldwide ASN data
- Add generate_browser_ja4.py: 1,186 browser JA4 fingerprints from FoxIO + ja4db.com
  covering 11 families (Chromium, Firefox, Safari, Edge, Tor, Opera, Vivaldi...)
- Rewrite generate_bot_ip.py: Anubis YAML rules (Google, Bing, Apple, DuckDuck,
  OpenAI, Perplexity bots) + Tor exit nodes + cloud scanner IPs (3,555 entries)
- Rewrite generate_asn_data.py: worldwide iptoasn.com data (78,049 ASNs, 714K CIDRs)
- Add dict_browser_ja4 ClickHouse dictionary + browser_family in AI features views
- Add /api/browsers dashboard endpoint
- Fix CSV quoting for fields containing commas (User-Agent strings)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 15:27:37 +02:00
b6184e6529 feat: CSV generation scripts, API filter params, enriched CSV stubs
- scripts/generate_bot_ip.py: download Tor exit nodes + curate scanner IPs (1353 entries)
- scripts/generate_bot_ja4.py: 31 bot JA4 fingerprints across 16 families
- scripts/generate_asn_data.py: 38 ASNs + 96 IP-to-ASN prefixes
- scripts/update-csv-data.sh: master orchestrator with --install-stubs
- api.py: add asn_org/country_code/ja4/bot_name filters on detections+scores
- pages.py: add /network route
- csv-stubs: enriched with generated data (Tor nodes, scanner IPs, etc.)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 15:05:43 +02:00
c6ca352db9 feat(dashboard): add clickable drill-down to all data elements
Add navigation helpers (fmtASN, fmtCountry, fmtJA4, fmtBotName,
fmtThreatLink, fmtLabel) to base.html for SOC analyst drill-down.

Update all templates:
- overview.html: clickable table cells + ECharts click handlers for
  ASN, country, JA4, bot, and threat charts
- detections.html: URL param pre-filters, active filter bar with
  clear buttons, clickable ASN/country/JA4/threat in table
- scores.html: URL param pre-filters, clickable threat/JA4/country
- traffic.html: clickable JA4 and country columns
- ip_detail.html: clickable threat/JA4 in detections, clickable
  asn_org/country_code/asn_label in AI features grid
- network.html: click handlers on ASN treemap and country sunburst,
  fmtJA4Full/fmtLabel/fmtBotName/fmtASN in tables
- features.html: scatter plot click navigates to /ip/{ip}

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 14:58:48 +02:00
fc882dd3e7 feat(tests): realistic traffic seeder + IP diversity via mod_remoteip
Option A — X-Forwarded-For + mod_remoteip:
- httpd-integration.conf: load mod_remoteip, trust all Docker RFC-1918
  subnets (172/192.168/10). mod_reqin_log uses r->useragent_ip which
  mod_remoteip updates from XFF → each request logged with distinct src_ip
- generate_traffic.py: XFF always set (was 30% only); human scenarios
  use 91.121/78.41/90.x ranges, bot scenarios use 185.220/45.155/193.32;
  pool of 1168 human IPs and 180 bot IPs; default --requests 500

Option D — Direct ClickHouse seeder (seed_clickhouse.py, stdlib only):
- Inserts ~4000 rows into http_logs_raw triggering full MV chain:
    http_logs_raw → mv_http_logs → http_logs
                 → mv_agg_host_ip_ja4_1h → agg_host_ip_ja4_1h
  • 720 human sessions: IPs in OVH/SFR/Orange ASN ranges (16276/15557/3215)
    → dict_asn_reputation maps these to asn_label='human'
    → satisfies bot_detector human_baseline >= 500 threshold
  • 150 scanner sessions: datacenter IPs, attack paths (/.env, wp-login,
    SQLi, path traversal), scanner UAs, minimal TCP fingerprints
  • 100 known-bot sessions: IPs matching bot_ip.csv entries
  • 20 brute-force clusters: 20-50 POST /login per IP
  All TCP/TLS metadata is profile-realistic (window, MSS, TTL, JA4, JA3)

CSV stubs (mounted at /var/lib/clickhouse/user_files/):
- iplocate-ip-to-asn.csv: 13 CIDR→ASN mappings (OVH/SFR/Orange/Tor/Contabo)
- asn_reputation.csv: 13 ASN→label (8 'human', 3 'datacenter'/'hosting')
- bot_ip.csv: 14 known scanner/Tor IPs (Shodan, Censys, Tor exits)
- bot_ja4.csv: 5 bot JA4 fingerprints (curl, python-requests, masscan, zgrab)

run-tests.sh:
- Phase 4a: seeder runs before live traffic (ensures bot_detector baseline)
- Phase 4b: live traffic gen at 500 requests (up from 200)
- Phase 5f: new assertions — agg_host_ip_ja4_1h populated, ≥500 human
  rows in view_ai_features_1h, known-bot labels present
- Phase 7: verifies ml_all_scores populated (bot_detector ran a cycle)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 11:35:34 +02:00
f448dcb4b0 fix(rpm): standardize systemd scriptlets and unit installation paths
- Add BuildRequires: systemd-rpm-macros to sentinel and correlator specs
- Replace manual systemctl calls with %systemd_post, %systemd_preun,
  %systemd_postun_with_restart macros (handles daemon-reload, stop/disable,
  try-restart on upgrade correctly and is a no-op in containers)
- ja4sentinel.spec: use %{_unitdir} macro instead of hardcoded path
  (/usr/lib/systemd/system); remove cross-service /var/run/logcorrelator
  from %files and %post (owned by logcorrelator package, not sentinel)
- logcorrelator.spec: move unit from /etc/systemd/system (admin namespace)
  to %{_unitdir} (/usr/lib/systemd/system) — correct packaging location;
  move user/group creation from %post to %pre so file ownership is valid
  during RPM install phase; add Requires(pre): shadow-utils; fix bare
  directory entries in %files with %dir macro; add version fallback macro
  so spec is buildable without --define version
- test-rpm.sh: auto-build RPM via Dockerfile.package if dist/rpm/ is
  empty; update service file path check to /usr/lib/systemd/system/

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 10:49:21 +02:00
f7ee5e63f8 fix(docker): add g++ for isotree build, add dashboard Dockerfile.tests
- bot-detector Dockerfile + Dockerfile.tests: install g++ for isotree C++ extension
- dashboard Dockerfile.tests: new smoke test (verify FastAPI app loads)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 08:08:13 +02:00
77c0450a22 docs: update copilot-instructions.md for dashboard rewrite and ML upgrades
- Dashboard: FastAPI+React → FastAPI+Jinja2+htmx+Chart.js (2 route modules)
- Bot-detector: IsolationForest → triple-voice EIF+Autoencoder+XGBoost ensemble
- SQL schema: 10 → 13 files (added thesis features, perf indexes, views)
- Added ClickHouse 24.8 gotchas (projections, nested aggregates, let bindings)
- Added IPv4/IPv6 duality pattern, bot-detector test patterns
- Updated data retention table with 4 new thesis aggregation tables
- Fixed single-test commands to reference existing files

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 07:31:10 +02:00
b735bab5a5 feat(dashboard): rebuild SOC dashboard + fix ClickHouse SQL
Complete rewrite of the SOC dashboard using FastAPI + Jinja2 + htmx + Chart.js + Tailwind CSS.
Replaces the old React/Vite frontend with server-rendered templates.

Dashboard pages:
- Overview: KPIs, timeline chart, threat distribution, top IPs
- Detections: paginated/filterable anomaly table
- Scores: ml_all_scores with AE error & XGB prob columns
- Traffic: HTTP logs with method/host filters
- IP Investigation: full deep-dive (scores, features, HTTP logs, classify)
- Classification: SOC feedback form + history
- Features: AI + thesis feature stats
- Models: scoring stats + model metadata

API: 9 JSON endpoints with parameterized queries, sort whitelists

SQL fixes:
- 05_aggregation_tables: add deduplicate_merge_projection_mode
- 11_views: fix nested aggregate (argMax inside sum)
- 12_thesis_features: remove invalid 'let' bindings, fix groupArrayIf type

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 03:21:05 +02:00
228ad7026a fix(integration): mount missing SQL files 10-12 in ClickHouse init
3 SQL files were missing from the docker-compose.yml volume mounts:
- 10_perf_indexes.sql (performance indexes)
- 11_views.sql (dashboard views)
- 12_thesis_features.sql (thesis §5 MVs and views)

Also make 10_perf_indexes.sql non-fatal in init script since ALTER TABLE
ADD INDEX may fail if index already exists.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 02:55:43 +02:00
8d58f2b932 feat(bot-detector): add XGBoost supervised third voice (#10)
Triple-voice ensemble architecture:
- EIF (non-supervisé, anomalies zero-day)
- Autoencoder (non-supervisé, corrélations non-linéaires)
- XGBoost (supervisé, patterns connus + feedback SOC)

XGBoost implementation:
- Trained on historical ml_all_scores labels (NORMAL=0, HIGH/CRITICAL/DENY/KNOWN=1)
- Weekly retraining (XGB_RETRAIN_INTERVAL_H=168), min 100 labels required
- Score = predict_proba, combined via meta-learner: (1-β)*(EIF+AE) + β*xgb_prob
- Configurable: XGB_WEIGHT (β=0.20), XGB_MIN_LABELS, XGB_RETRAIN_INTERVAL_HOURS
- Graceful fallback: if xgboost unavailable or labels insufficient, EIF+AE only
- ClickHouse: xgb_prob column added to ml_all_scores
- Tests: 4 new tests (availability, train/predict, meta-learner, save/load)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 02:45:57 +02:00
57cf6c3828 feat(bot-detector): add parallel Autoencoder scorer (#9)
- TrafficAutoEncoder class: symmetric AE (n→64→32→16→32→64→n) with BatchNorm+ReLU
- Trained alongside EIF on human_baseline, saved/loaded with model versioning
- Score = per-sample MSE reconstruction error, combined with EIF via AE_WEIGHT (α=0.30)
- AE latent space (16-dim) used for HDBSCAN clustering instead of raw features
- Configurable: AE_WEIGHT, AE_EPOCHS, AE_LATENT_DIM, AE_LEARNING_RATE
- Graceful fallback: if torch unavailable or AE fails, EIF-only scoring continues
- ClickHouse: ae_recon_error column added to ml_all_scores
- Tests: 5 new tests (AE train/score, encode latent, state dict save/load, weight combination)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 02:40:39 +02:00
f6e2d3c0ca feat(bot-detector): implement 8 state-of-art improvements
- EIF: Extended Isolation Forest via isotree (fallback to sklearn IF)
- Benford's Law deviation feature on inter-request timing
- Lag-1 autocorrelation feature for cadence analysis
- Validation gate: reject model if val_anomaly_rate > 20%
- Feature pruning: remove variance < 1e-6 features before training
- Quantile drift: replace N(μ,σ) synthetic with quantile interpolation
- Thread safety: Lock for _service_healthy/_consecutive_failures
- Score normalization: inverted to [0,1] where 1=most anomalous

SQL: add lag1_autocorrelation + benford_deviation to view_thesis_features_1h
Tests: 10 new test functions covering all improvements
Integration: verify_mvs.py checks new thesis feature columns

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 02:31:26 +02:00
0d1a6a81e0 docs: update thesis with EIF, autoencoders, ensemble architecture, quantile drift
- §2.4.2: Add Extended Isolation Forest theory (Hariri et al., TKDE 2021)
- §2.4.2b: New section on autoencoders for network anomaly detection
  (Kitsune, β-VAE, hybrid AE+IF studies)
- §2.4.2c: New section on hybrid supervised+unsupervised ensembles
  (triple-voice architecture: EIF + AE + XGBoost + meta-learner)
- §2.4.3: Enhanced drift detection with quantile digest and validation gate
- §6.2: Multi-level baseline contamination mitigation
- §7: Updated conclusion reflecting ensemble architecture
- §8: 10 new references (Hariri 2021, Mirsky 2018, Baptiste 2026, etc.)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 02:23:00 +02:00
3ae8c7d9c9 feat(bot-detector): upgrade to state-of-the-art detection pipeline
- Fix UnboundLocalError on global _consecutive_failures/_service_healthy
- Add SQL identifier validation for DB names at startup
- Replace Z-score drift detection with KS test (scipy.stats.ks_2samp)
- Replace DBSCAN with HDBSCAN (adaptive clustering, no epsilon needed)
- Fix NaN→0 blanket imputation with per-feature median/sentinel strategy
- Add 80/20 temporal train/validation split with offline metrics logging
- Integrate thesis §5 features from view_thesis_features_1h:
  path_transition_entropy, cadence_cv, burst/pause ratios,
  host_diversity, host_sweep_speed, host_coverage_uniformity,
  ja4_drift_ratio (Complet model only)
- Add SOC feedback loop: read classifications from audit_logs,
  reclassify FP IPs as human, exclude TP IPs from baseline
- Update dependencies: clickhouse-connect 0.8.12, scikit-learn 1.6.1,
  pandas 2.2.3, shap 0.47.2, add scipy>=1.14, hdbscan>=0.8.38

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 02:09:18 +02:00
6d02f21c1e feat: implement thesis §5 advanced detection techniques as ClickHouse MVs
New aggregation tables + materialized views:
- agg_path_sequences_1h + MV (§5.1 Path Sequence Entropy)
- agg_request_timing_1h + MV (§5.3 Request Cadence Fingerprint)
- agg_ip_behavior_1h + MV (§5.5 JA4 Drift + §5.8 Cross-Domain)
- agg_resource_cascade_1h + MV (§5.4 Resource Dependency Tree)

New analytical views:
- view_thesis_features_1h: unified view exposing all computable features
  (path_transition_entropy, cadence_cv, burst_ratio, pause_ratio,
   ja4_drift_ratio, host_diversity, host_sweep_speed,
   host_coverage_uniformity)
- view_resource_cascade_1h: root_to_first_asset_delay, asset_load_stddev

Documented future techniques (not feasible as MV):
- §5.2 Bipartite Fleet Graph (needs Python networkx)
- §5.6 DNS Shadow Analysis (needs sentinel UDP/53 extension)
- §5.7 Compression Ratio Invariant (needs mod_reqin_log extension)

Updated: deploy_schema.sh, verify_mvs.py (sections 8-10)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 01:42:52 +02:00
0ccd417a02 docs: audit conformité détection vs thèse état de l'art
Analyse exhaustive feature-par-feature des techniques de détection
implémentées vs ce que décrit la thèse. Score: 97% base, 6% techniques
avancées, 72% global pondéré.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 00:12:51 +02:00
11b46b2eab docs: update copilot-instructions.md for v14 changes
- Fix coverage gate: 60% → 80% for correlator
- Document dual-model pattern (Complet/Applicatif) in bot-detector
- Add SQL deployment paths: deploy_views.sql + service migrations
- Add data retention TTL table with partition info
- Fix integration test description (8 phases, --build-only flag)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 23:55:28 +02:00
51b8eb57a8 feat: port v14 schema fixes, migration, MV verifier, thesis from ja4/
deploy_views.sql (v13 → v14):
- CRITICAL: ml_detected_anomalies ORDER BY (src_ip) → (src_ip, ja4, host, model_name)
  ReplacingMergeTree was collapsing all detections to 1 row per IP on merge
- Add PARTITION BY toDate + ttl_only_drop_parts on all 4 data tables
- ml_all_scores TTL 3d → 7d; ml_detected_anomalies TTL 30d → 7d
- agg_host_ip_ja4_1h + agg_header_fingerprint_1h: add partition + TTL 7d
- view_ip_recurrence: add WHERE detected_at >= now() - 7 DAY (was full scan)
- Remove dead views: summary/timeseries/threat_dist/variability
- Add view_dashboard_entities (fixes HTTP 500 in clustering/incidents/fingerprints)
- Add view_dashboard_user_agents (fixes HTTP 500 in fingerprints/metrics)
- Add view_ai_features_24h (enables ENABLE_MULTIWINDOW in bot_detector)
- Mark max_requests_per_sec as DEPRECATED (always 0)

New files:
- correlator/sql/migrations/01_ttl_adjustments.sql: ALTER TABLE migration
- tests/integration/verify_mvs.py: MV pipeline verification assertions
- docs/THESIS_HTTP_Traffic_Detection.md: detection techniques thesis

All DB references use ja4_processing/ja4_logs (no mabase_prod).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 23:51:56 +02:00
ecceb04174 perf(clickhouse): P3 — view_ip_recurrence avec filtre TTL + supprimer FINAL
view_ip_recurrence :
  Ajout de WHERE detected_at >= now() - INTERVAL 30 DAY
  → Avec PARTITION BY (P1), ClickHouse élagage les partitions hors de cette
    plage avant même de lire les données. La vue ne scanne que les partitions
    actives (au lieu des 30 partitions journalières complètes).
  → ORDER BY (src_ip) garantit que le GROUP BY src_ip lit des données
    contiguës (aucune réorganisation mémoire).

rotation.py — supprimer FINAL sur ml_detected_anomalies :
  FINAL force une déduplication complète du ReplacingMergeTree en mémoire
  (équivalent à un DISTINCT sur toute la table) — une des opérations les plus
  coûteuses dans ClickHouse.
  Fix : remplacer le sous-SELECT FINAL par view_ip_recurrence (déjà aggrégée
  par src_ip, retourne recurrence directement sans FINAL).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 22:33:29 +02:00
2bfb4b7282 perf(dashboard): P2 — remplacer replaceRegexpAll dans les WHERE par IPv4MappedToIPv6
Problème : 8 clauses WHERE appliquaient une fonction sur la colonne src_ip :
  WHERE replaceRegexpAll(toString(src_ip), '^::ffff:', '') = %(ip)s
→ ClickHouse ne peut pas utiliser l'index de tri ou les skipping indexes
  quand une fonction est appliquée à la colonne filtrée.

Fix : transformer l'INPUT (le paramètre) plutôt que la colonne :
  WHERE src_ip = IPv4MappedToIPv6(toIPv4(%(ip)s))
→ src_ip reste intact → ClickHouse utilise les indexes (P1) et la
  projection proj_by_ip (P1) pour ces requêtes.

Fichiers modifiés :
  investigation_summary.py — 6 WHERE (ml_detected_anomalies, agg_host_ip_ja4_1h,
                              view_form_bruteforce_detected, view_host_ip_ja4_rotation,
                              view_ip_recurrence)
  ml_features.py           — 1 WHERE (view_ai_features_1h)
  rotation.py              — 1 WHERE (agg_host_ip_ja4_1h)

Note : les 27 autres occurrences de replaceRegexpAll dans les SELECT sont des
transformations d'affichage (IPv6→IPv4 pour l'UI) et ne bloquent pas les indexes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 22:31:57 +02:00
14323f7b05 perf(clickhouse): P10 — créer les 4 vues métier manquantes + corriger préfixes DB
Bug de production : view_form_bruteforce_detected, view_host_ip_ja4_rotation,
view_dashboard_entities, view_dashboard_user_agents étaient référencées dans
13 endpoints du dashboard mais n'existaient nulle part dans le schéma.
Tous ces endpoints retournaient HTTP 500 en production.

shared/clickhouse/11_views.sql (nouveau) :

  view_form_bruteforce_detected
    Source : agg_host_ip_ja4_1h (24h)
    Logique : GROUP BY (src_ip, host) HAVING count_post >= 10
    Usage   : bruteforce.py (3 endpoints), investigation_summary.py

  view_host_ip_ja4_rotation
    Source : agg_host_ip_ja4_1h (24h)
    Logique : uniqExact(ja4) par src_ip, HAVING >= 2 (rotation de fingerprint)
    Usage   : rotation.py (3 endpoints), investigation_summary.py

  view_dashboard_entities
    Source : http_logs (7 jours), UNION ALL 5 branches (ip/ja4/country/asn/host)
    Colonnes : entity_type, entity_value, src_ip, ja4, host, log_date,
               client_headers Array(String), asns Array, countries Array,
               user_agents Array
    Usage   : entities.py (5 endpoints), clustering.py

  view_dashboard_user_agents
    Source : http_logs (7 jours), GROUP BY (src_ip, ja4, hour)
    Colonnes : src_ip, ja4, hour, log_date, user_agents Array(String), requests
    Usage   : variability.py (4 endpoints), fingerprints.py (5 endpoints)
              attributes.py (2 endpoints)

deploy_schema.sh : ajout de 10_perf_indexes.sql et 11_views.sql dans la liste

routes/variability.py + fingerprints.py :
  Correction de 9 requêtes utilisant view_dashboard_user_agents sans préfixe
  de base de données → remplacé par {settings.CLICKHOUSE_DB_PROCESSING}.view_*

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 22:30:09 +02:00
f4ffe3410a perf(clickhouse): P1 — partition + skipping indexes sur ml_detected_anomalies, http_logs, agg_host_ip_ja4_1h
Problème : toutes les requêtes du dashboard WHERE detected_at >= now() - INTERVAL N
faisaient un full scan car ml_detected_anomalies avait ORDER BY (src_ip) sans
partition ni index temporel.

Changements :
- 06_ml_tables.sql :
  * ml_detected_anomalies : PARTITION BY toYYYYMMDD(detected_at)
    → élagage de partitions journalières sur toutes les requêtes temporelles
  * INDEX idx_detected_at (minmax) → skip des granules hors plage
  * INDEX idx_threat_level set(8) → skip pour countIf(threat_level = ...)
  * INDEX idx_bot_name bloom_filter → skip pour bot_name != ''
  * ttl_only_drop_parts = 1 → TTL par suppression de partition entière
  * ml_all_scores : même traitement (PARTITION BY + 2 indexes)

- 04_mv_http_logs.sql :
  * http_logs : INDEX idx_src_ip bloom_filter(0.01)
    → les requêtes WHERE src_ip = X (analysis.py, variability.py) sautent
    ~90% des granules sans scanner toute la plage temporelle
  * INDEX idx_ja4 bloom_filter(0.01) → idem pour filtres JA4

- 05_aggregation_tables.sql :
  * agg_host_ip_ja4_1h : PROJECTION proj_by_ip ORDER BY (src_ip, window_start, ...)
    → investigation_summary.py et rotation.py (WHERE src_ip = X) utilisent
    automatiquement la projection au lieu de scanner tous les window_start

- 10_perf_indexes.sql (nouveau) :
  * Migration ALTER TABLE pour instances existantes
  * ADD INDEX + MATERIALIZE INDEX pour les 4 tables
  * ADD PROJECTION + MATERIALIZE PROJECTION pour agg_host_ip_ja4_1h
  * Note : PARTITION BY sur table existante nécessite recréation (documenté)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 22:28:04 +02:00
69940bf18b docs: update copilot-instructions with integration tests, gotchas, comment standard
- Add make test-integration* commands
- Add SKIP_TESTS=true fast build flag
- Add 'Comments standard' section referencing docs/commenting-standard.md
- Add 'Known gotchas' section with 6 non-obvious issues:
  * go.work build context must include both sentinel + correlator
  * YAML does not expand env vars in Go (hardcode DSN)
  * REGEXP_TREE dict requires >=1 rule or inserts fail
  * pcap only captures non-loopback traffic
  * ClickHouse init needs 120s timeout
  * RPM builds must use Rocky Linux (libpcap.so.1 vs .so.0.8)
  * FLAT() layout requires numeric keys (use COMPLEX_KEY_HASHED for strings)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 21:42:54 +02:00
3b8c06b86d docs: add Doxygen comments to mod_reqin_log.c
- File header: French multi-line description block
- 7 section banners in French (/* ====== Section ====== */ format):
  Configuration du serveur, Buffer dynamique, Sérialisation JSON,
  Gestionnaires de directives, Socket Unix, Journalisation, Hooks Apache
- 26 @brief/@param/@return blocks on every function:
  server config, dynbuf_*, JSON helpers, cmd_set_* handlers,
  socket helpers (try_connect/ensure_connected/write_to_socket),
  log_request, Apache hooks (post_read_request, child_init, etc.)
- No logic changes (1033 → 1268 lines, comments only)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 21:35:19 +02:00
3dfeba860b docs: add standardized comments to all services (Python, Go, Bash)
- Add docs/commenting-standard.md defining per-language comment standards
  (Go godoc, Python PEP-257, C Doxygen, Bash header blocks, SQL banners)

- services/dashboard: 100% docstring coverage (100/100 functions)
  - All FastAPI route handlers, helpers, classes, and models documented
  - Language: French (project convention)

- services/bot-detector: 100% docstring coverage (53/53 symbols)
  - bot_detector.py: 14 functions + module docstring
  - anubis/fetch_rules.py: 9 functions

- shared/python/ja4_common: full docstrings on ClickHouseClient (7 methods)
  and ClickHouseSettings class

- services/correlator: 24 godoc comments added across 6 Go files
  - correlation_service.go: 10 private helpers
  - unixsocket/source.go: 6 parsing/socket helpers
  - correlated_log.go: 4 field extraction helpers
  - orchestrator.go, logger.go, main.go: 4 comments

- services/correlator/scripts/audit-architecture.sh: standardized header block

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 21:32:29 +02:00
12d60975da feat: Python traffic generator with realistic varied HTTP/HTTPS traffic
- Replace curlimages/curl with Python stdlib traffic generator
- 200 requests, 10 workers, 16 scenario types:
  browsers (Chrome/Firefox/Safari/Edge/mobile), bots (Googlebot/Bing/curl/wget),
  GET/POST/HEAD/PUT/PATCH/DELETE/OPTIONS, HTTP + HTTPS
- Multiple SSL contexts (default, TLS1.2-only, TLS1.3-only, few_ciphers)
  → 4 distinct JA4/JA3 fingerprints per test run
- Realistic headers: Accept, Accept-Language, Sec-Fetch-*, Referer,
  X-Forwarded-For, Cookie, Cache-Control
- JSON payloads, form data, CORS preflights
- DB always reset (down -v) at start of each test run
- Enhanced Phase 5 checks: distinct UAs, method variety, JA4/JA3 counts + uniqueness

Results: 199/200 OK, 24 distinct UAs, 7 HTTP methods, TLS 1.2+1.3, 4 JA4 fingerprints

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 21:14:55 +02:00