Commit Graph

22 Commits

Author SHA1 Message Date
3ae8c7d9c9 feat(bot-detector): upgrade to state-of-the-art detection pipeline
- Fix UnboundLocalError on global _consecutive_failures/_service_healthy
- Add SQL identifier validation for DB names at startup
- Replace Z-score drift detection with KS test (scipy.stats.ks_2samp)
- Replace DBSCAN with HDBSCAN (adaptive clustering, no epsilon needed)
- Fix NaN→0 blanket imputation with per-feature median/sentinel strategy
- Add 80/20 temporal train/validation split with offline metrics logging
- Integrate thesis §5 features from view_thesis_features_1h:
  path_transition_entropy, cadence_cv, burst/pause ratios,
  host_diversity, host_sweep_speed, host_coverage_uniformity,
  ja4_drift_ratio (Complet model only)
- Add SOC feedback loop: read classifications from audit_logs,
  reclassify FP IPs as human, exclude TP IPs from baseline
- Update dependencies: clickhouse-connect 0.8.12, scikit-learn 1.6.1,
  pandas 2.2.3, shap 0.47.2, add scipy>=1.14, hdbscan>=0.8.38

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 02:09:18 +02:00
6d02f21c1e feat: implement thesis §5 advanced detection techniques as ClickHouse MVs
New aggregation tables + materialized views:
- agg_path_sequences_1h + MV (§5.1 Path Sequence Entropy)
- agg_request_timing_1h + MV (§5.3 Request Cadence Fingerprint)
- agg_ip_behavior_1h + MV (§5.5 JA4 Drift + §5.8 Cross-Domain)
- agg_resource_cascade_1h + MV (§5.4 Resource Dependency Tree)

New analytical views:
- view_thesis_features_1h: unified view exposing all computable features
  (path_transition_entropy, cadence_cv, burst_ratio, pause_ratio,
   ja4_drift_ratio, host_diversity, host_sweep_speed,
   host_coverage_uniformity)
- view_resource_cascade_1h: root_to_first_asset_delay, asset_load_stddev

Documented future techniques (not feasible as MV):
- §5.2 Bipartite Fleet Graph (needs Python networkx)
- §5.6 DNS Shadow Analysis (needs sentinel UDP/53 extension)
- §5.7 Compression Ratio Invariant (needs mod_reqin_log extension)

Updated: deploy_schema.sh, verify_mvs.py (sections 8-10)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 01:42:52 +02:00
0ccd417a02 docs: audit conformité détection vs thèse état de l'art
Analyse exhaustive feature-par-feature des techniques de détection
implémentées vs ce que décrit la thèse. Score: 97% base, 6% techniques
avancées, 72% global pondéré.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 00:12:51 +02:00
11b46b2eab docs: update copilot-instructions.md for v14 changes
- Fix coverage gate: 60% → 80% for correlator
- Document dual-model pattern (Complet/Applicatif) in bot-detector
- Add SQL deployment paths: deploy_views.sql + service migrations
- Add data retention TTL table with partition info
- Fix integration test description (8 phases, --build-only flag)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 23:55:28 +02:00
51b8eb57a8 feat: port v14 schema fixes, migration, MV verifier, thesis from ja4/
deploy_views.sql (v13 → v14):
- CRITICAL: ml_detected_anomalies ORDER BY (src_ip) → (src_ip, ja4, host, model_name)
  ReplacingMergeTree was collapsing all detections to 1 row per IP on merge
- Add PARTITION BY toDate + ttl_only_drop_parts on all 4 data tables
- ml_all_scores TTL 3d → 7d; ml_detected_anomalies TTL 30d → 7d
- agg_host_ip_ja4_1h + agg_header_fingerprint_1h: add partition + TTL 7d
- view_ip_recurrence: add WHERE detected_at >= now() - 7 DAY (was full scan)
- Remove dead views: summary/timeseries/threat_dist/variability
- Add view_dashboard_entities (fixes HTTP 500 in clustering/incidents/fingerprints)
- Add view_dashboard_user_agents (fixes HTTP 500 in fingerprints/metrics)
- Add view_ai_features_24h (enables ENABLE_MULTIWINDOW in bot_detector)
- Mark max_requests_per_sec as DEPRECATED (always 0)

New files:
- correlator/sql/migrations/01_ttl_adjustments.sql: ALTER TABLE migration
- tests/integration/verify_mvs.py: MV pipeline verification assertions
- docs/THESIS_HTTP_Traffic_Detection.md: detection techniques thesis

All DB references use ja4_processing/ja4_logs (no mabase_prod).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 23:51:56 +02:00
ecceb04174 perf(clickhouse): P3 — view_ip_recurrence avec filtre TTL + supprimer FINAL
view_ip_recurrence :
  Ajout de WHERE detected_at >= now() - INTERVAL 30 DAY
  → Avec PARTITION BY (P1), ClickHouse élagage les partitions hors de cette
    plage avant même de lire les données. La vue ne scanne que les partitions
    actives (au lieu des 30 partitions journalières complètes).
  → ORDER BY (src_ip) garantit que le GROUP BY src_ip lit des données
    contiguës (aucune réorganisation mémoire).

rotation.py — supprimer FINAL sur ml_detected_anomalies :
  FINAL force une déduplication complète du ReplacingMergeTree en mémoire
  (équivalent à un DISTINCT sur toute la table) — une des opérations les plus
  coûteuses dans ClickHouse.
  Fix : remplacer le sous-SELECT FINAL par view_ip_recurrence (déjà aggrégée
  par src_ip, retourne recurrence directement sans FINAL).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 22:33:29 +02:00
2bfb4b7282 perf(dashboard): P2 — remplacer replaceRegexpAll dans les WHERE par IPv4MappedToIPv6
Problème : 8 clauses WHERE appliquaient une fonction sur la colonne src_ip :
  WHERE replaceRegexpAll(toString(src_ip), '^::ffff:', '') = %(ip)s
→ ClickHouse ne peut pas utiliser l'index de tri ou les skipping indexes
  quand une fonction est appliquée à la colonne filtrée.

Fix : transformer l'INPUT (le paramètre) plutôt que la colonne :
  WHERE src_ip = IPv4MappedToIPv6(toIPv4(%(ip)s))
→ src_ip reste intact → ClickHouse utilise les indexes (P1) et la
  projection proj_by_ip (P1) pour ces requêtes.

Fichiers modifiés :
  investigation_summary.py — 6 WHERE (ml_detected_anomalies, agg_host_ip_ja4_1h,
                              view_form_bruteforce_detected, view_host_ip_ja4_rotation,
                              view_ip_recurrence)
  ml_features.py           — 1 WHERE (view_ai_features_1h)
  rotation.py              — 1 WHERE (agg_host_ip_ja4_1h)

Note : les 27 autres occurrences de replaceRegexpAll dans les SELECT sont des
transformations d'affichage (IPv6→IPv4 pour l'UI) et ne bloquent pas les indexes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 22:31:57 +02:00
14323f7b05 perf(clickhouse): P10 — créer les 4 vues métier manquantes + corriger préfixes DB
Bug de production : view_form_bruteforce_detected, view_host_ip_ja4_rotation,
view_dashboard_entities, view_dashboard_user_agents étaient référencées dans
13 endpoints du dashboard mais n'existaient nulle part dans le schéma.
Tous ces endpoints retournaient HTTP 500 en production.

shared/clickhouse/11_views.sql (nouveau) :

  view_form_bruteforce_detected
    Source : agg_host_ip_ja4_1h (24h)
    Logique : GROUP BY (src_ip, host) HAVING count_post >= 10
    Usage   : bruteforce.py (3 endpoints), investigation_summary.py

  view_host_ip_ja4_rotation
    Source : agg_host_ip_ja4_1h (24h)
    Logique : uniqExact(ja4) par src_ip, HAVING >= 2 (rotation de fingerprint)
    Usage   : rotation.py (3 endpoints), investigation_summary.py

  view_dashboard_entities
    Source : http_logs (7 jours), UNION ALL 5 branches (ip/ja4/country/asn/host)
    Colonnes : entity_type, entity_value, src_ip, ja4, host, log_date,
               client_headers Array(String), asns Array, countries Array,
               user_agents Array
    Usage   : entities.py (5 endpoints), clustering.py

  view_dashboard_user_agents
    Source : http_logs (7 jours), GROUP BY (src_ip, ja4, hour)
    Colonnes : src_ip, ja4, hour, log_date, user_agents Array(String), requests
    Usage   : variability.py (4 endpoints), fingerprints.py (5 endpoints)
              attributes.py (2 endpoints)

deploy_schema.sh : ajout de 10_perf_indexes.sql et 11_views.sql dans la liste

routes/variability.py + fingerprints.py :
  Correction de 9 requêtes utilisant view_dashboard_user_agents sans préfixe
  de base de données → remplacé par {settings.CLICKHOUSE_DB_PROCESSING}.view_*

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 22:30:09 +02:00
f4ffe3410a perf(clickhouse): P1 — partition + skipping indexes sur ml_detected_anomalies, http_logs, agg_host_ip_ja4_1h
Problème : toutes les requêtes du dashboard WHERE detected_at >= now() - INTERVAL N
faisaient un full scan car ml_detected_anomalies avait ORDER BY (src_ip) sans
partition ni index temporel.

Changements :
- 06_ml_tables.sql :
  * ml_detected_anomalies : PARTITION BY toYYYYMMDD(detected_at)
    → élagage de partitions journalières sur toutes les requêtes temporelles
  * INDEX idx_detected_at (minmax) → skip des granules hors plage
  * INDEX idx_threat_level set(8) → skip pour countIf(threat_level = ...)
  * INDEX idx_bot_name bloom_filter → skip pour bot_name != ''
  * ttl_only_drop_parts = 1 → TTL par suppression de partition entière
  * ml_all_scores : même traitement (PARTITION BY + 2 indexes)

- 04_mv_http_logs.sql :
  * http_logs : INDEX idx_src_ip bloom_filter(0.01)
    → les requêtes WHERE src_ip = X (analysis.py, variability.py) sautent
    ~90% des granules sans scanner toute la plage temporelle
  * INDEX idx_ja4 bloom_filter(0.01) → idem pour filtres JA4

- 05_aggregation_tables.sql :
  * agg_host_ip_ja4_1h : PROJECTION proj_by_ip ORDER BY (src_ip, window_start, ...)
    → investigation_summary.py et rotation.py (WHERE src_ip = X) utilisent
    automatiquement la projection au lieu de scanner tous les window_start

- 10_perf_indexes.sql (nouveau) :
  * Migration ALTER TABLE pour instances existantes
  * ADD INDEX + MATERIALIZE INDEX pour les 4 tables
  * ADD PROJECTION + MATERIALIZE PROJECTION pour agg_host_ip_ja4_1h
  * Note : PARTITION BY sur table existante nécessite recréation (documenté)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 22:28:04 +02:00
69940bf18b docs: update copilot-instructions with integration tests, gotchas, comment standard
- Add make test-integration* commands
- Add SKIP_TESTS=true fast build flag
- Add 'Comments standard' section referencing docs/commenting-standard.md
- Add 'Known gotchas' section with 6 non-obvious issues:
  * go.work build context must include both sentinel + correlator
  * YAML does not expand env vars in Go (hardcode DSN)
  * REGEXP_TREE dict requires >=1 rule or inserts fail
  * pcap only captures non-loopback traffic
  * ClickHouse init needs 120s timeout
  * RPM builds must use Rocky Linux (libpcap.so.1 vs .so.0.8)
  * FLAT() layout requires numeric keys (use COMPLEX_KEY_HASHED for strings)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 21:42:54 +02:00
3b8c06b86d docs: add Doxygen comments to mod_reqin_log.c
- File header: French multi-line description block
- 7 section banners in French (/* ====== Section ====== */ format):
  Configuration du serveur, Buffer dynamique, Sérialisation JSON,
  Gestionnaires de directives, Socket Unix, Journalisation, Hooks Apache
- 26 @brief/@param/@return blocks on every function:
  server config, dynbuf_*, JSON helpers, cmd_set_* handlers,
  socket helpers (try_connect/ensure_connected/write_to_socket),
  log_request, Apache hooks (post_read_request, child_init, etc.)
- No logic changes (1033 → 1268 lines, comments only)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 21:35:19 +02:00
3dfeba860b docs: add standardized comments to all services (Python, Go, Bash)
- Add docs/commenting-standard.md defining per-language comment standards
  (Go godoc, Python PEP-257, C Doxygen, Bash header blocks, SQL banners)

- services/dashboard: 100% docstring coverage (100/100 functions)
  - All FastAPI route handlers, helpers, classes, and models documented
  - Language: French (project convention)

- services/bot-detector: 100% docstring coverage (53/53 symbols)
  - bot_detector.py: 14 functions + module docstring
  - anubis/fetch_rules.py: 9 functions

- shared/python/ja4_common: full docstrings on ClickHouseClient (7 methods)
  and ClickHouseSettings class

- services/correlator: 24 godoc comments added across 6 Go files
  - correlation_service.go: 10 private helpers
  - unixsocket/source.go: 6 parsing/socket helpers
  - correlated_log.go: 4 field extraction helpers
  - orchestrator.go, logger.go, main.go: 4 comments

- services/correlator/scripts/audit-architecture.sh: standardized header block

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 21:32:29 +02:00
12d60975da feat: Python traffic generator with realistic varied HTTP/HTTPS traffic
- Replace curlimages/curl with Python stdlib traffic generator
- 200 requests, 10 workers, 16 scenario types:
  browsers (Chrome/Firefox/Safari/Edge/mobile), bots (Googlebot/Bing/curl/wget),
  GET/POST/HEAD/PUT/PATCH/DELETE/OPTIONS, HTTP + HTTPS
- Multiple SSL contexts (default, TLS1.2-only, TLS1.3-only, few_ciphers)
  → 4 distinct JA4/JA3 fingerprints per test run
- Realistic headers: Accept, Accept-Language, Sec-Fetch-*, Referer,
  X-Forwarded-For, Cookie, Cache-Control
- JSON payloads, form data, CORS preflights
- DB always reset (down -v) at start of each test run
- Enhanced Phase 5 checks: distinct UAs, method variety, JA4/JA3 counts + uniqueness

Results: 199/200 OK, 24 distinct UAs, 7 HTTP methods, TLS 1.2+1.3, 4 JA4 fingerprints

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 21:14:55 +02:00
da8357f43d fix: TLS capture end-to-end in integration tests
- Add traffic-gen container (curlimages/curl) to send HTTPS traffic
  across Docker network so sentinel (pcap on eth0) captures ClientHello
- Seed anubis_ua_rules with catch-all rule (REGEXP_TREE needs ≥1 entry)
  so MV mv_http_logs processes raw logs without errors
- Add JA4/JA3 fingerprint verification in Phase 5 tests
- Dashboard healthcheck via python urllib (no curl in image)

Results: 59 raw logs, 59 parsed, 53 with JA4+JA3 fingerprints (TLS 1.3)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 21:04:14 +02:00
a985661369 fix: build correlator RPM on Rocky Linux instead of Debian
Replace golang:1.24 (Debian) builder with rockylinux:9 + dnf golang.
All three RPM packages (sentinel, correlator, mod-reqin-log) now build
entirely on Rocky Linux Docker images, ensuring native ABI compatibility.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 20:52:27 +02:00
d4e7e674d8 feat: full-stack Docker Compose integration tests
- 4-container stack: ClickHouse, platform (Rocky 9), bot-detector, dashboard
- Platform builds sentinel on Rocky (CGO+libpcap native), correlator static
- mod-reqin-log compiled with apxs on Rocky (matching RPM build target)
- ClickHouse init script patches credentials for test env (sed-based)
- 8-phase test runner: schema, traffic gen, pipeline, dashboard API, bot-detector, sentinel
- All 13 checks pass, 3 non-blocking warnings (empty dicts, log paths)

SQL schema fixes discovered during integration:
- 02_dictionaries: IPv6CIDR → String (not a valid ClickHouse type)
- 03_anubis_tables: dict_anubis_ua missing has_ip/rule_id/category attrs
- 03_anubis_tables: dict_anubis_country FLAT() → COMPLEX_KEY_HASHED() (String key)
- 09_audit_table: CODEC before DEFAULT → DEFAULT before CODEC

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 20:33:25 +02:00
7b8dff2925 docs: guide de déploiement complet (DB, users, services, vérification)
Couvre les 7 étapes de mise en place :
1. Installation ClickHouse
2. Déploiement du schéma (deploy_schema.sh + migrations)
3. Configuration des utilisateurs (data_writer, analyst, bot_writer)
4. Fichiers CSV externes pour les dictionnaires
5. Installation des services Go via RPM (sentinel, correlator, mod-reqin-log)
6. Installation des services Python via Docker (bot-detector, dashboard)
7. Vérification bout-en-bout (ingestion, agrégation, ML, dashboard)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 19:28:58 +02:00
d8dbd4e706 docs: add .github/copilot-instructions.md for Copilot context
Covers: build/test/lint commands, architecture overview, ClickHouse
dual-DB pattern, inter-service communication, key conventions for
Go (hexagonal, YAML config), Python (pydantic-settings, FastAPI routes),
C (Apache module), Docker-first builds, and RPM packaging.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 19:21:32 +02:00
9f3e0621e5 feat: split ClickHouse into dual configurable databases (ja4_logs / ja4_processing)
Architecture:
- ja4_logs: raw log ingestion (http_logs_raw, http_logs, mv_http_logs)
- ja4_processing: analytics, aggregation, ML, dictionaries, audit

Configuration (env vars):
- CLICKHOUSE_DB_LOGS (default: ja4_logs)
- CLICKHOUSE_DB_PROCESSING (default: ja4_processing)

Changes:
- SQL migrations (10 files): all mabase_prod refs → ja4_logs or ja4_processing
  with correct cross-database references (MVs, views, dicts)
- deploy_schema.sh: substitutes DB names from env vars at deploy time
- Python shared settings: added CLICKHOUSE_DB_LOGS + CLICKHOUSE_DB_PROCESSING
- Dashboard routes (19 files): replaced ~80 hardcoded mabase_prod refs
  with settings.CLICKHOUSE_DB_LOGS / settings.CLICKHOUSE_DB_PROCESSING
- Bot-detector: DB → CLICKHOUSE_DB_PROCESSING, fetch_rules.py configurable
- Correlator: DSN example updated to ja4_logs
- Docker-compose + .env files: new env vars with defaults
- All documentation updated (14 markdown files)

All tests pass: sentinel 10/10, correlator 67.1%, bot-detector 11, dashboard 20, ja4_common 18

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 19:10:35 +02:00
b6391afbeb refactor: replace hardcoded mabase_prod DB prefix with configurable settings
Replace all hardcoded 'mabase_prod.' table prefixes in dashboard route
SQL queries with configurable database names from settings:

- http_logs, http_logs_raw → settings.CLICKHOUSE_DB_LOGS
- All other tables → settings.CLICKHOUSE_DB_PROCESSING

Also qualify previously unqualified table references (bare FROM/JOIN
table_name) with the appropriate database prefix for consistency.

Each route file now imports 'from ..config import settings' and uses
f-strings with {settings.CLICKHOUSE_DB_PROCESSING} or
{settings.CLICKHOUSE_DB_LOGS} for database-qualified table names.

Files updated: analysis, attributes, audit, botnets, bruteforce,
clustering, detections, entities, fingerprints, header_fingerprint,
heatmap, incidents, investigation_summary, metrics, ml_features,
rotation, search, tcp_spoofing, variability (19 files).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 19:03:05 +02:00
dba2676fa7 fix: correct singleton test for ja4_common ClickHouseClient
get_client() returns a lazy ClickHouseClient wrapper — clickhouse_connect.get_client
is only called on .connect(), not at construction time. Remove incorrect assertion
that expected call_count==1 at module-level get_client() call.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 16:45:14 +02:00
d469e39da7 feat: ja4-platform monorepo — 5 services unified, tests & RPM builds standardized
Services:
- ja4sentinel: TLS/JA4 fingerprint capture daemon (Go, libpcap)
- logcorrelator: JA4 log correlation engine (Go, ClickHouse)
- mod_reqin_log: Apache module (C, JSON request logging)
- bot_detector: ML bot detection pipeline (Python)
- dashboard: FastAPI/Streamlit analytics UI (Python)

Shared libraries:
- shared/go/ja4common: logger, config, shutdown, ipfilter (Go module)
- shared/python/ja4_common: ClickHouseClient, ClickHouseSettings (Python package)
- shared/clickhouse/: canonical SQL migrations (10 files)

Build & packaging:
- Unified 3-stage Dockerfile.package for Go RPMs (el8/el9/el10)
- go.work workspace linking sentinel, correlator, ja4common
- Makefile with test-all, build-all, rpm-* targets

Fixes applied:
- go.work: 1.21 → 1.24.6 (required by sentinel)
- correlator Dockerfiles: golang:1.21 → golang:1.24
- replace directives in go.mod for ja4common local path
- pyproject.toml: setuptools.backends → setuptools.build_meta
- Removed static libpcap linking (unavailable on Rocky 9)
- Fixed data races in output/writers_test.go (sync.Mutex + atomic.Int32)
- Rewrote corrupted test files (logger_test.go × 2)

Test coverage:
- correlator: 67.1% total (unixsocket 80.5%, config 91.7%, app 83.3%, multi 87.7%, stdout 100%)
- sentinel: all 10 packages pass (api, capture, config, fingerprint, ipfilter, logging, output, tlsparse)

Documentation:
- README.md + docs/ (architecture, development, 5 services, shared libs, DB schema & migrations)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 16:42:59 +02:00