aa233bc55c
docs(thesis): v3 — corrections + §3.9 browser_matcher + XFF proxy accuracy
...
User-authored updates verified and corrected:
- Correction 1: 85 features / 8 familles (was 65+/7)
- Correction 2: diagram adds MetaLearner, ExIFFI, fleet.py
- Correction 3: axis 5 weight 0.20→0.15, new axis 6 (H2 Coherence 0.05)
- Correction 4: 5 quantiles (p10-p90), p5/p95 as future work
- Correction 5: §5.6 DNS Shadow + §5.7 Compression Ratio named as future
- New §3.9 Browser Signature Detection (browser_matcher 7 dimensions)
- New §2.4.5 ExIFFI + MetaLearner + KL divergence drift
- New §2.5.3 HTTP/2 fingerprinting passif literature
- Updated §5.2 fleet.py implementation details
- Updated §5.8 cross_domain_path_similarity + Jaccard
Additional fixes (code-accuracy alignment):
- XFF proxy: 4 H2 dimensions neutralized (70% weight), not redistributed
- Module count: 12→13 (browser_signatures.py added)
- §6.5 limitations table: precise proxy weight impact (70%→30%)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-10 19:45:02 +02:00
c77d479d6c
docs(thesis): 5 corrections — 85 features, MetaLearner diagram, browser axes note, quantile clarification, §5.6/5.7 named
...
- Correction 1 (l.65, 701): '65+ features sur 7 familles' → '85 features sur 8 familles'
- Correction 2 (l.374-378): diagramme ASCII bot_detector — ajout MetaLearner, ExIFFI, fleet.py
- Correction 3 (après l.506): note poids axe 5 réduit 0.20→0.15, axe 6 ajouté 0.05
- Correction 4 (l.279): clarification quantiles actuels 5 (p10→p90), p5/p95 = futur
- Correction 5 (l.776): §5.6 (DNS UDP/53) et §5.7 (Apache compression) nommés explicitement
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-10 13:51:31 +02:00
51dd376f7a
docs: mise à jour complète — 7/8 techniques, 85 features, 12 modules
...
Reflète l'état réel du système après les étapes 1-9 du roadmap :
- §5.2 (fleet_detector NetworkX/Louvain) et §5.8 (Jaccard cross-domain) : ✅
- MetaLearner (régression logistique, fallback poids fixes) : documenté
- ExIFFI (profondeur isolation EIF) + erreur AE par feature : documenté
- KL divergence en complément du KS, drift adversarial : documenté
- HTTP/2 fingerprinting (h2_fingerprint, dict_browser_h2, axis_h2_coherence) : documenté
- Métriques de cycle (metrics.py, ml_performance_metrics, alertes) : documenté
- Browser confidence : 5 axes → 6 axes (axis_h2_coherence)
- 85 features (73 FEATURES + 12 FEATURES_COMPLET), 12 modules, 53 routes dashboard
- Conformité thèse : 99.4% (était 97.9%), §5 : 87.5% (était 62.5%)
- Tables nouvelles : fleet_detections, ml_performance_metrics, soc_feedback
- Dictionnaires : 8 (dict_browser_h2 ajouté)
- Dashboard : 16 pages + 37 API routes (fleet, health ajoutés)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-10 01:31:20 +02:00
6b3cc54652
docs: réécriture audit, DOCUMENTATION.md et IMPROVEMENTS.md pour architecture modulaire
...
- AUDIT: conformité mise à jour 97.9% (142/145), références modulaires
- DOCUMENTATION.md: 1083 lignes, 7 sections, 11 modules documentés
- IMPROVEMENTS.md: A1-A10/B1-B10 annotés ✅ /🔄 /❌ avec localisations
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 22:14:18 +02:00
c96c41fb45
docs: réécriture complète de la documentation des services en français
...
- bot-detector.md : architecture 11 modules, 77/65 features,
ensemble triple voix (EIF+AE+XGBoost), browser 5 axes, HDBSCAN,
toutes les variables d'environnement vérifiées depuis le code source
- dashboard.md : corrigé stack (Jinja2+htmx, pas React+Vite),
14 pages + 35 API routes + health, dual-database, IPv4/IPv6
- python-ja4common.md : ajouté CLICKHOUSE_DB_PROCESSING/LOGS,
schéma dual-database, note dashboard n'utilise pas ja4_common
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 22:04:58 +02:00
8f5e771096
docs: réécriture complète de la documentation base de données en français
...
Réécriture des 3 fichiers de documentation de la base de données ClickHouse :
- docs/database/schema.md : couverture complète des 2 bases, 14+ tables,
7 dictionnaires, 8 MVs, 8 vues, TTL, partitions, moteurs et colonnes
- docs/database/migrations.md : 13 fichiers SQL (ajout 10-12), prérequis
mis à jour (ClickHouse 24.8+, 5 CSV), deploy_schema.sh, init-stack.sh,
vérification et rollback complets
- shared/clickhouse/README.md : référence rapide des 13 fichiers,
deploy_schema.sh, patron double-base, prérequis
Suppression des références obsolètes : dict_anubis_ua, dict_anubis_country,
anubis_ua_rules, anubis_country_rules.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 22:03:37 +02:00
d05969867f
docs: rewrite architecture/README, update deployment/development
...
- architecture.md: complete rewrite (French) with dual-database diagram,
5-phase data flow, full table ownership, triple-voice ML pipeline,
7 dictionaries, 13 SQL files, updated tech stack
- README.md: complete rewrite (English) with updated pipeline diagram,
services table, scripts section, integration tests, full doc index,
Go 1.24.6 workspace
- deployment.md: update to 13 SQL files, remove Anubis UA/Country refs,
add scripts section, add ensemble env vars (AE_WEIGHT, XGB_WEIGHT),
update verification queries and network diagram
- development.md: translate to French, add bot-detector 11-module structure,
add Python ML deps, add scripts/integration test sections,
fix bot-detector run command, add make targets
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 22:00:29 +02:00
7bdc6e2865
docs: mise à jour du document de thèse (§2-§8)
...
- §2.1.3: Simplifié Anubis à 2 dictionnaires (dict_anubis_ip, dict_anubis_asn) avec priorité COALESCE
- §2.4.2: Ajouté bibliothèque isotree, formule de calibration, ntrees=300, sérialisation joblib
- §2.4.2b/§2.4.4: Remplacé DBSCAN par HDBSCAN partout
- §2.4.2c: Remplacé régression logistique par pondération linéaire fixe, ajouté formule et poids
- §2.4.3: Clarifié approximation par 5 quantiles pour la détection de dérive
- §3.1: Mis à jour le diagramme ASCII (dual-database, 3×EIF+AE+XGB, HDBSCAN, 55 routes)
- §3.8: Mis à jour la trifurcation + ajouté détection multifactorielle navigateur (5 axes)
- §4: Élargi taxonomie de 51 à 65+ features sur 8 familles
- §5: Ajouté statut d'implémentation (✅ /❌ ) à chaque technique
- §6: Ajouté §6.6 résultats de déploiement (3M+ logs, 34K sessions/cycle)
- §7: Mis à jour conclusion (65+ features, 5/8 techniques, refactorisation modulaire)
- §8: Ajouté références isotree, PyTorch, HDBSCAN, XGBoost, SHAP
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-09 21:59:34 +02:00
b66d41a200
docs: updated conformity audit bot-detector + dashboard vs thesis
...
Score: 93% (was 72%) — 4 thesis techniques now implemented,
browser classification, ASN PeeringDB, SOC feedback loop.
Identifies 9 bot-detector bugs (2 critical: campaign_id/raw_anomaly_score
never inserted, worst_score inverted) and 11 dashboard bugs (4 critical:
XSS, no auth, no CSRF, CORS misconfiguration).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 23:25:19 +02:00
0d1a6a81e0
docs: update thesis with EIF, autoencoders, ensemble architecture, quantile drift
...
- §2.4.2: Add Extended Isolation Forest theory (Hariri et al., TKDE 2021)
- §2.4.2b: New section on autoencoders for network anomaly detection
(Kitsune, β-VAE, hybrid AE+IF studies)
- §2.4.2c: New section on hybrid supervised+unsupervised ensembles
(triple-voice architecture: EIF + AE + XGBoost + meta-learner)
- §2.4.3: Enhanced drift detection with quantile digest and validation gate
- §6.2: Multi-level baseline contamination mitigation
- §7: Updated conclusion reflecting ensemble architecture
- §8: 10 new references (Hariri 2021, Mirsky 2018, Baptiste 2026, etc.)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 02:23:00 +02:00
0ccd417a02
docs: audit conformité détection vs thèse état de l'art
...
Analyse exhaustive feature-par-feature des techniques de détection
implémentées vs ce que décrit la thèse. Score: 97% base, 6% techniques
avancées, 72% global pondéré.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-08 00:12:51 +02:00
51b8eb57a8
feat: port v14 schema fixes, migration, MV verifier, thesis from ja4/
...
deploy_views.sql (v13 → v14):
- CRITICAL: ml_detected_anomalies ORDER BY (src_ip) → (src_ip, ja4, host, model_name)
ReplacingMergeTree was collapsing all detections to 1 row per IP on merge
- Add PARTITION BY toDate + ttl_only_drop_parts on all 4 data tables
- ml_all_scores TTL 3d → 7d; ml_detected_anomalies TTL 30d → 7d
- agg_host_ip_ja4_1h + agg_header_fingerprint_1h: add partition + TTL 7d
- view_ip_recurrence: add WHERE detected_at >= now() - 7 DAY (was full scan)
- Remove dead views: summary/timeseries/threat_dist/variability
- Add view_dashboard_entities (fixes HTTP 500 in clustering/incidents/fingerprints)
- Add view_dashboard_user_agents (fixes HTTP 500 in fingerprints/metrics)
- Add view_ai_features_24h (enables ENABLE_MULTIWINDOW in bot_detector)
- Mark max_requests_per_sec as DEPRECATED (always 0)
New files:
- correlator/sql/migrations/01_ttl_adjustments.sql: ALTER TABLE migration
- tests/integration/verify_mvs.py: MV pipeline verification assertions
- docs/THESIS_HTTP_Traffic_Detection.md: detection techniques thesis
All DB references use ja4_processing/ja4_logs (no mabase_prod).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-07 23:51:56 +02:00
3dfeba860b
docs: add standardized comments to all services (Python, Go, Bash)
...
- Add docs/commenting-standard.md defining per-language comment standards
(Go godoc, Python PEP-257, C Doxygen, Bash header blocks, SQL banners)
- services/dashboard: 100% docstring coverage (100/100 functions)
- All FastAPI route handlers, helpers, classes, and models documented
- Language: French (project convention)
- services/bot-detector: 100% docstring coverage (53/53 symbols)
- bot_detector.py: 14 functions + module docstring
- anubis/fetch_rules.py: 9 functions
- shared/python/ja4_common: full docstrings on ClickHouseClient (7 methods)
and ClickHouseSettings class
- services/correlator: 24 godoc comments added across 6 Go files
- correlation_service.go: 10 private helpers
- unixsocket/source.go: 6 parsing/socket helpers
- correlated_log.go: 4 field extraction helpers
- orchestrator.go, logger.go, main.go: 4 comments
- services/correlator/scripts/audit-architecture.sh: standardized header block
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-07 21:32:29 +02:00
7b8dff2925
docs: guide de déploiement complet (DB, users, services, vérification)
...
Couvre les 7 étapes de mise en place :
1. Installation ClickHouse
2. Déploiement du schéma (deploy_schema.sh + migrations)
3. Configuration des utilisateurs (data_writer, analyst, bot_writer)
4. Fichiers CSV externes pour les dictionnaires
5. Installation des services Go via RPM (sentinel, correlator, mod-reqin-log)
6. Installation des services Python via Docker (bot-detector, dashboard)
7. Vérification bout-en-bout (ingestion, agrégation, ML, dashboard)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-07 19:28:58 +02:00
9f3e0621e5
feat: split ClickHouse into dual configurable databases (ja4_logs / ja4_processing)
...
Architecture:
- ja4_logs: raw log ingestion (http_logs_raw, http_logs, mv_http_logs)
- ja4_processing: analytics, aggregation, ML, dictionaries, audit
Configuration (env vars):
- CLICKHOUSE_DB_LOGS (default: ja4_logs)
- CLICKHOUSE_DB_PROCESSING (default: ja4_processing)
Changes:
- SQL migrations (10 files): all mabase_prod refs → ja4_logs or ja4_processing
with correct cross-database references (MVs, views, dicts)
- deploy_schema.sh: substitutes DB names from env vars at deploy time
- Python shared settings: added CLICKHOUSE_DB_LOGS + CLICKHOUSE_DB_PROCESSING
- Dashboard routes (19 files): replaced ~80 hardcoded mabase_prod refs
with settings.CLICKHOUSE_DB_LOGS / settings.CLICKHOUSE_DB_PROCESSING
- Bot-detector: DB → CLICKHOUSE_DB_PROCESSING, fetch_rules.py configurable
- Correlator: DSN example updated to ja4_logs
- Docker-compose + .env files: new env vars with defaults
- All documentation updated (14 markdown files)
All tests pass: sentinel 10/10, correlator 67.1%, bot-detector 11, dashboard 20, ja4_common 18
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-07 19:10:35 +02:00
d469e39da7
feat: ja4-platform monorepo — 5 services unified, tests & RPM builds standardized
...
Services:
- ja4sentinel: TLS/JA4 fingerprint capture daemon (Go, libpcap)
- logcorrelator: JA4 log correlation engine (Go, ClickHouse)
- mod_reqin_log: Apache module (C, JSON request logging)
- bot_detector: ML bot detection pipeline (Python)
- dashboard: FastAPI/Streamlit analytics UI (Python)
Shared libraries:
- shared/go/ja4common: logger, config, shutdown, ipfilter (Go module)
- shared/python/ja4_common: ClickHouseClient, ClickHouseSettings (Python package)
- shared/clickhouse/: canonical SQL migrations (10 files)
Build & packaging:
- Unified 3-stage Dockerfile.package for Go RPMs (el8/el9/el10)
- go.work workspace linking sentinel, correlator, ja4common
- Makefile with test-all, build-all, rpm-* targets
Fixes applied:
- go.work: 1.21 → 1.24.6 (required by sentinel)
- correlator Dockerfiles: golang:1.21 → golang:1.24
- replace directives in go.mod for ja4common local path
- pyproject.toml: setuptools.backends → setuptools.build_meta
- Removed static libpcap linking (unavailable on Rocky 9)
- Fixed data races in output/writers_test.go (sync.Mutex + atomic.Int32)
- Rewrote corrupted test files (logger_test.go × 2)
Test coverage:
- correlator: 67.1% total (unixsocket 80.5%, config 91.7%, app 83.3%, multi 87.7%, stdout 100%)
- sentinel: all 10 packages pass (api, capture, config, fingerprint, ipfilter, logging, output, tlsparse)
Documentation:
- README.md + docs/ (architecture, development, 5 services, shared libs, DB schema & migrations)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-04-07 16:42:59 +02:00