toto a108814a56 feat: roadmap détection bots §2-9 — HTTP/2, cohérence, drift, flotte, Jaccard, ExIFFI, méta-learner, métriques
Étape 2 — Fingerprinting HTTP/2 dans le pipeline ML :
- Ajout du dictionnaire dict_browser_h2 (11 familles de navigateurs) dans 05_aggregation_tables.sql
- Ajout du CTE h2_agg et 4 features HTTP/2 dans 07_ai_features_view.sql :
  h2_settings_known, h2_pseudo_order_match, h2_ja4_coherence, h2_settings_rare
- Calcul du fingerprint_coherence_score (5 axes pondérés) dans la vue
- Ajout du 6e axe axis_h2_coherence dans browser.py (poids rééquilibrés)
- browser_h2.csv : 11 fingerprints Akamai → famille navigateur

Étape 3 — Pré-filtre de cohérence sur la baseline humaine :
- pipeline.py exclut les sessions avec fingerprint_coherence_score < seuil de la baseline d'entraînement
- FINGERPRINT_COHERENCE_THRESHOLD configurable via env (défaut 0.25)
- Log des sessions exclues pour analyse SOC

Étape 4 — Détection de drift améliorée :
- scoring.py : passage de 5 à 9 quantiles (p5…p95)
- Ajout de la divergence KL en complément du test KS
- Détection de drift adversarial (≥80% des features dérivent dans la même direction)
- Split temporel strict pour la validation

Étape 5 — Graphe bipartite JA4×ASN (§5.2) :
- fleet.py : détection de flottes via NetworkX + Louvain (imports optionnels)
- enrich_with_fleet_score() : ajout fleet_score + fleet_campaign_flag au DataFrame
- cycle.py : appel après preprocess_df avec log du nombre de sessions en flotte
- SQL migration 05_fleet_metrics_tables.sql : table fleet_detections (TTL 7j)
- Dashboard : /fleet + /api/fleet (communautés détectées) + template fleet.html

Étape 6 — Cross-domain Jaccard §5.8 :
- 12_thesis_features.sql : CTE jaccard_paths → cross_domain_path_similarity
- Signal : même chemins (/admin, /wp-login) sur plusieurs hosts = scanner

Étape 7 — ExIFFI + erreurs AE par feature :
- scoring.py : compute_exiffi_importance() par permutation, compute_ae_feature_errors()
- pipeline.py : calcul ExIFFI sur X_test, mapping index → dict pour anomalies
- build_reason() enrichi avec exiffi_top quand SHAP inactif

Étape 8 — Méta-learner pour la pondération de l'ensemble :
- scoring.py : classe MetaLearner (LogisticRegression, fallback poids fixes <1000 labels)
- Collecte des labels depuis le cycle courant (known_bots, légitimes, Anubis)
- pipeline.py : remplacement des poids fixes par MetaLearner.predict()

Étape 9 — Métriques de performance et monitoring :
- metrics.py : record_cycle_metrics() — taux anomalie, drift, corrélation, latence
- SQL migration 05_fleet_metrics_tables.sql : table ml_performance_metrics (TTL 90j)
- Dashboard : /health + /api/health + template health.html
- cycle.py : appel record_cycle_metrics en fin de cycle (Complet + Applicatif)

Tests : 36/36 bot-detector tests passent

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-10 00:11:35 +02:00

ja4-platform

ja4-platform is a monorepo security pipeline for TLS fingerprinting (JA4/JA3) and bot detection. It captures live network traffic, correlates TLS handshakes with HTTP requests, applies triple-voice ML anomaly detection (Extended Isolation Forest + Autoencoder + XGBoost), and surfaces results through a SOC analyst dashboard — all backed by ClickHouse with a dual-database architecture.

Pipeline Overview

  ┌──────────────────────────────────────────────────────────────────────────────┐
  │                          Linux Server (Apache)                               │
  │                                                                              │
  │  ┌─────────────────┐        UNIX socket (DGRAM)        ┌──────────────────┐  │
  │  │  mod-reqin-log   │──── http.socket ────────────────▶│                  │  │
  │  │  (Apache C11)    │        (source A)                 │   correlator     │  │
  │  └─────────────────┘                                   │   (Go · hex.     │  │
  │                                                         │    architecture) │  │
  │  ┌─────────────────┐        UNIX socket (DGRAM)        │                  │  │
  │  │  sentinel        │──── network.socket ─────────────▶│   Joins by       │  │
  │  │  (Go · libpcap)  │        (source B)                 │   src_ip:src_port│  │
  │  │  JA4/JA3 gen.    │                                   └────────┬─────────┘  │
  │  └─────────────────┘                                             │            │
  └──────────────────────────────────────────────────────────────────┼────────────┘
                                                                     │ INSERT
                                                                     ▼
                                          ┌──────────────────────────────────────┐
                                          │          ClickHouse 24.8             │
                                          │                                      │
                                          │  ja4_logs          ja4_processing    │
                                          │  ┌──────────┐     ┌──────────────┐  │
                                          │  │_raw → MV │────▶│ agg_* (×6)   │  │
                                          │  │→ http_logs│     │ ml_* (×2)    │  │
                                          │  └──────────┘     │ views, dicts │  │
                                          │                    └──────────────┘  │
                                          └─────────┬───────────────┬────────────┘
                                                    │               │
                                   ┌────────────────┘               └───────────────┐
                                   ▼                                                 ▼
                        ┌────────────────────┐                           ┌────────────────────┐
                        │  bot-detector       │                           │  dashboard          │
                        │  Python 3.11        │                           │  FastAPI + Jinja2   │
                        │  EIF + AE + XGBoost │                           │  htmx + Chart.js    │
                        │  HDBSCAN · SHAP     │                           │  55 routes · 14 pp  │
                        └────────────────────┘                           └────────────────────┘

Services

Service Language Description Interface
sentinel Go 1.24.6 TLS/TCP packet capture via libpcap, JA4/JA3 fingerprint generation UNIX socket → network.socket
mod-reqin-log C11 Apache HTTPD module, HTTP request JSON logging UNIX socket → http.socket
correlator Go 1.24.6 Hexagonal architecture, correlates HTTP+TLS events by src_ip:src_port ClickHouse INSERT (Native TCP)
bot-detector Python 3.11 Triple-voice ML ensemble (EIF+AE+XGB), HDBSCAN campaigns, SHAP explainability ClickHouse read/write, HTTP :8080
dashboard Python 3.11 SOC analyst dashboard: 55 routes, 15 templates, 14 pages HTTP :8000

Shared Libraries

Library Language Description
go/ja4common Go Logger, config loader, graceful shutdown handler, IP filter
python/ja4_common Python ClickHouseClient singleton, ClickHouseSettings (pydantic-settings)

Quickstart

Prerequisites

  • Docker (with BuildKit) and Docker Compose
  • make
  • No native Go, Python, or C toolchains required — all builds run inside Docker

Build All Services

make build-all

Run All Tests

make test-all

Build RPM Packages

make rpm-all
# RPMs written to services/<service>/dist/rpm/el{8,9,10}/

Scripts

Helper scripts are located in scripts/:

Script Description
init-stack.sh Full ClickHouse stack initialization — deploys schema, loads CSV data, verifies all components
import-prod-data.sh Imports pre-exported production data into the dev database with dynamic date shifting
reload-prod-logs.sh Exports http_logs from production and re-imports into the dev database
update-csv-data.sh Downloads and generates all CSV reference data (bot IPs, JA4 signatures, ASN reputation)
generate_bot_ip.py Generates bot_ip.csv from known scanner/bot sources + Tor exit nodes
generate_bot_ja4.py Generates bot_ja4.csv from known bot TLS fingerprints
generate_asn_data.py Generates asn_reputation.csv (ASN→label mapping)
generate_browser_ja4.py Generates browser JA4 reference data for legitimate browser detection

Corresponding Makefile targets:

make init-stack        # runs scripts/init-stack.sh
make import-prod-data  # runs scripts/import-prod-data.sh
make init-and-import   # init-stack + import-prod-data
make reload-prod-logs  # runs scripts/reload-prod-logs.sh

Integration Tests

Full-stack integration tests run against Docker Compose with a real ClickHouse instance:

make test-integration          # 8 phases: build → start → schema → traffic → pipeline → dashboard → bot-detector → sentinel
make test-integration-keep     # same but leaves stack running after
make test-integration-down     # tear down integration stack

The integration test suite is located in tests/integration/ and resets the database between runs.

Documentation

Document Description
Architecture System architecture, data flow, component interactions
Deployment Step-by-step production deployment guide
Development Build, test, package, and extend the platform
Database Schema Every ClickHouse table, view, dictionary, and materialized view
Database Migrations Migration order, application, verification, and rollback
Commenting Standard Code commenting conventions (French comments, English identifiers)
Thesis Reference Academic reference: HTTP traffic detection techniques
Audit vs Thesis Comparison between platform implementation and thesis techniques

Service Documentation

  • Sentinel — TLS/TCP capture daemon (Go + libpcap)
  • mod-reqin-log — Apache HTTP logging module (C11)
  • Correlator — HTTP/TLS event correlation engine (Go)
  • Bot Detector — Triple-voice ML anomaly detection (Python)
  • Dashboard — SOC analyst dashboard and API (FastAPI)

Shared Library Documentation

  • go-ja4common — Go shared library (logger, config, shutdown, ipfilter)
  • python-ja4common — Python shared library (ClickHouse client, settings)

Go Workspace

The repository uses a Go workspace (go.work) to link the Go modules:

go 1.24.6

use (
    ./services/sentinel
    ./services/correlator
    ./shared/go/ja4common
)

Both Go services have a replace directive in their go.mod pointing to ../../shared/go/ja4common. The workspace takes precedence for local development; the replace is needed for Docker builds where go.work is not available.

License

See individual service directories for license information.

Description
No description provided
Readme 22 MiB
Languages
Python 38.2%
HTML 24.8%
Go 16.1%
Shell 15.1%
C 3.5%
Other 2.3%