Commit Graph

13 Commits

Author SHA1 Message Date
6b3cc54652 docs: réécriture audit, DOCUMENTATION.md et IMPROVEMENTS.md pour architecture modulaire
- AUDIT: conformité mise à jour 97.9% (142/145), références modulaires
- DOCUMENTATION.md: 1083 lignes, 7 sections, 11 modules documentés
- IMPROVEMENTS.md: A1-A10/B1-B10 annotés /🔄/ avec localisations

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:14:18 +02:00
c96c41fb45 docs: réécriture complète de la documentation des services en français
- bot-detector.md : architecture 11 modules, 77/65 features,
  ensemble triple voix (EIF+AE+XGBoost), browser 5 axes, HDBSCAN,
  toutes les variables d'environnement vérifiées depuis le code source
- dashboard.md : corrigé stack (Jinja2+htmx, pas React+Vite),
  14 pages + 35 API routes + health, dual-database, IPv4/IPv6
- python-ja4common.md : ajouté CLICKHOUSE_DB_PROCESSING/LOGS,
  schéma dual-database, note dashboard n'utilise pas ja4_common

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:04:58 +02:00
8f5e771096 docs: réécriture complète de la documentation base de données en français
Réécriture des 3 fichiers de documentation de la base de données ClickHouse :

- docs/database/schema.md : couverture complète des 2 bases, 14+ tables,
  7 dictionnaires, 8 MVs, 8 vues, TTL, partitions, moteurs et colonnes
- docs/database/migrations.md : 13 fichiers SQL (ajout 10-12), prérequis
  mis à jour (ClickHouse 24.8+, 5 CSV), deploy_schema.sh, init-stack.sh,
  vérification et rollback complets
- shared/clickhouse/README.md : référence rapide des 13 fichiers,
  deploy_schema.sh, patron double-base, prérequis

Suppression des références obsolètes : dict_anubis_ua, dict_anubis_country,
anubis_ua_rules, anubis_country_rules.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:03:37 +02:00
d05969867f docs: rewrite architecture/README, update deployment/development
- architecture.md: complete rewrite (French) with dual-database diagram,
  5-phase data flow, full table ownership, triple-voice ML pipeline,
  7 dictionaries, 13 SQL files, updated tech stack
- README.md: complete rewrite (English) with updated pipeline diagram,
  services table, scripts section, integration tests, full doc index,
  Go 1.24.6 workspace
- deployment.md: update to 13 SQL files, remove Anubis UA/Country refs,
  add scripts section, add ensemble env vars (AE_WEIGHT, XGB_WEIGHT),
  update verification queries and network diagram
- development.md: translate to French, add bot-detector 11-module structure,
  add Python ML deps, add scripts/integration test sections,
  fix bot-detector run command, add make targets

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:00:29 +02:00
7bdc6e2865 docs: mise à jour du document de thèse (§2-§8)
- §2.1.3: Simplifié Anubis à 2 dictionnaires (dict_anubis_ip, dict_anubis_asn) avec priorité COALESCE
- §2.4.2: Ajouté bibliothèque isotree, formule de calibration, ntrees=300, sérialisation joblib
- §2.4.2b/§2.4.4: Remplacé DBSCAN par HDBSCAN partout
- §2.4.2c: Remplacé régression logistique par pondération linéaire fixe, ajouté formule et poids
- §2.4.3: Clarifié approximation par 5 quantiles pour la détection de dérive
- §3.1: Mis à jour le diagramme ASCII (dual-database, 3×EIF+AE+XGB, HDBSCAN, 55 routes)
- §3.8: Mis à jour la trifurcation + ajouté détection multifactorielle navigateur (5 axes)
- §4: Élargi taxonomie de 51 à 65+ features sur 8 familles
- §5: Ajouté statut d'implémentation (/) à chaque technique
- §6: Ajouté §6.6 résultats de déploiement (3M+ logs, 34K sessions/cycle)
- §7: Mis à jour conclusion (65+ features, 5/8 techniques, refactorisation modulaire)
- §8: Ajouté références isotree, PyTorch, HDBSCAN, XGBoost, SHAP

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 21:59:34 +02:00
b66d41a200 docs: updated conformity audit bot-detector + dashboard vs thesis
Score: 93% (was 72%) — 4 thesis techniques now implemented,
browser classification, ASN PeeringDB, SOC feedback loop.

Identifies 9 bot-detector bugs (2 critical: campaign_id/raw_anomaly_score
never inserted, worst_score inverted) and 11 dashboard bugs (4 critical:
XSS, no auth, no CSRF, CORS misconfiguration).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 23:25:19 +02:00
0d1a6a81e0 docs: update thesis with EIF, autoencoders, ensemble architecture, quantile drift
- §2.4.2: Add Extended Isolation Forest theory (Hariri et al., TKDE 2021)
- §2.4.2b: New section on autoencoders for network anomaly detection
  (Kitsune, β-VAE, hybrid AE+IF studies)
- §2.4.2c: New section on hybrid supervised+unsupervised ensembles
  (triple-voice architecture: EIF + AE + XGBoost + meta-learner)
- §2.4.3: Enhanced drift detection with quantile digest and validation gate
- §6.2: Multi-level baseline contamination mitigation
- §7: Updated conclusion reflecting ensemble architecture
- §8: 10 new references (Hariri 2021, Mirsky 2018, Baptiste 2026, etc.)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 02:23:00 +02:00
0ccd417a02 docs: audit conformité détection vs thèse état de l'art
Analyse exhaustive feature-par-feature des techniques de détection
implémentées vs ce que décrit la thèse. Score: 97% base, 6% techniques
avancées, 72% global pondéré.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 00:12:51 +02:00
51b8eb57a8 feat: port v14 schema fixes, migration, MV verifier, thesis from ja4/
deploy_views.sql (v13 → v14):
- CRITICAL: ml_detected_anomalies ORDER BY (src_ip) → (src_ip, ja4, host, model_name)
  ReplacingMergeTree was collapsing all detections to 1 row per IP on merge
- Add PARTITION BY toDate + ttl_only_drop_parts on all 4 data tables
- ml_all_scores TTL 3d → 7d; ml_detected_anomalies TTL 30d → 7d
- agg_host_ip_ja4_1h + agg_header_fingerprint_1h: add partition + TTL 7d
- view_ip_recurrence: add WHERE detected_at >= now() - 7 DAY (was full scan)
- Remove dead views: summary/timeseries/threat_dist/variability
- Add view_dashboard_entities (fixes HTTP 500 in clustering/incidents/fingerprints)
- Add view_dashboard_user_agents (fixes HTTP 500 in fingerprints/metrics)
- Add view_ai_features_24h (enables ENABLE_MULTIWINDOW in bot_detector)
- Mark max_requests_per_sec as DEPRECATED (always 0)

New files:
- correlator/sql/migrations/01_ttl_adjustments.sql: ALTER TABLE migration
- tests/integration/verify_mvs.py: MV pipeline verification assertions
- docs/THESIS_HTTP_Traffic_Detection.md: detection techniques thesis

All DB references use ja4_processing/ja4_logs (no mabase_prod).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 23:51:56 +02:00
3dfeba860b docs: add standardized comments to all services (Python, Go, Bash)
- Add docs/commenting-standard.md defining per-language comment standards
  (Go godoc, Python PEP-257, C Doxygen, Bash header blocks, SQL banners)

- services/dashboard: 100% docstring coverage (100/100 functions)
  - All FastAPI route handlers, helpers, classes, and models documented
  - Language: French (project convention)

- services/bot-detector: 100% docstring coverage (53/53 symbols)
  - bot_detector.py: 14 functions + module docstring
  - anubis/fetch_rules.py: 9 functions

- shared/python/ja4_common: full docstrings on ClickHouseClient (7 methods)
  and ClickHouseSettings class

- services/correlator: 24 godoc comments added across 6 Go files
  - correlation_service.go: 10 private helpers
  - unixsocket/source.go: 6 parsing/socket helpers
  - correlated_log.go: 4 field extraction helpers
  - orchestrator.go, logger.go, main.go: 4 comments

- services/correlator/scripts/audit-architecture.sh: standardized header block

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 21:32:29 +02:00
7b8dff2925 docs: guide de déploiement complet (DB, users, services, vérification)
Couvre les 7 étapes de mise en place :
1. Installation ClickHouse
2. Déploiement du schéma (deploy_schema.sh + migrations)
3. Configuration des utilisateurs (data_writer, analyst, bot_writer)
4. Fichiers CSV externes pour les dictionnaires
5. Installation des services Go via RPM (sentinel, correlator, mod-reqin-log)
6. Installation des services Python via Docker (bot-detector, dashboard)
7. Vérification bout-en-bout (ingestion, agrégation, ML, dashboard)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 19:28:58 +02:00
9f3e0621e5 feat: split ClickHouse into dual configurable databases (ja4_logs / ja4_processing)
Architecture:
- ja4_logs: raw log ingestion (http_logs_raw, http_logs, mv_http_logs)
- ja4_processing: analytics, aggregation, ML, dictionaries, audit

Configuration (env vars):
- CLICKHOUSE_DB_LOGS (default: ja4_logs)
- CLICKHOUSE_DB_PROCESSING (default: ja4_processing)

Changes:
- SQL migrations (10 files): all mabase_prod refs → ja4_logs or ja4_processing
  with correct cross-database references (MVs, views, dicts)
- deploy_schema.sh: substitutes DB names from env vars at deploy time
- Python shared settings: added CLICKHOUSE_DB_LOGS + CLICKHOUSE_DB_PROCESSING
- Dashboard routes (19 files): replaced ~80 hardcoded mabase_prod refs
  with settings.CLICKHOUSE_DB_LOGS / settings.CLICKHOUSE_DB_PROCESSING
- Bot-detector: DB → CLICKHOUSE_DB_PROCESSING, fetch_rules.py configurable
- Correlator: DSN example updated to ja4_logs
- Docker-compose + .env files: new env vars with defaults
- All documentation updated (14 markdown files)

All tests pass: sentinel 10/10, correlator 67.1%, bot-detector 11, dashboard 20, ja4_common 18

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 19:10:35 +02:00
d469e39da7 feat: ja4-platform monorepo — 5 services unified, tests & RPM builds standardized
Services:
- ja4sentinel: TLS/JA4 fingerprint capture daemon (Go, libpcap)
- logcorrelator: JA4 log correlation engine (Go, ClickHouse)
- mod_reqin_log: Apache module (C, JSON request logging)
- bot_detector: ML bot detection pipeline (Python)
- dashboard: FastAPI/Streamlit analytics UI (Python)

Shared libraries:
- shared/go/ja4common: logger, config, shutdown, ipfilter (Go module)
- shared/python/ja4_common: ClickHouseClient, ClickHouseSettings (Python package)
- shared/clickhouse/: canonical SQL migrations (10 files)

Build & packaging:
- Unified 3-stage Dockerfile.package for Go RPMs (el8/el9/el10)
- go.work workspace linking sentinel, correlator, ja4common
- Makefile with test-all, build-all, rpm-* targets

Fixes applied:
- go.work: 1.21 → 1.24.6 (required by sentinel)
- correlator Dockerfiles: golang:1.21 → golang:1.24
- replace directives in go.mod for ja4common local path
- pyproject.toml: setuptools.backends → setuptools.build_meta
- Removed static libpcap linking (unavailable on Rocky 9)
- Fixed data races in output/writers_test.go (sync.Mutex + atomic.Int32)
- Rewrote corrupted test files (logger_test.go × 2)

Test coverage:
- correlator: 67.1% total (unixsocket 80.5%, config 91.7%, app 83.3%, multi 87.7%, stdout 100%)
- sentinel: all 10 packages pass (api, capture, config, fingerprint, ipfilter, logging, output, tlsparse)

Documentation:
- README.md + docs/ (architecture, development, 5 services, shared libs, DB schema & migrations)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 16:42:59 +02:00