docs: rewrite architecture/README, update deployment/development

- architecture.md: complete rewrite (French) with dual-database diagram,
  5-phase data flow, full table ownership, triple-voice ML pipeline,
  7 dictionaries, 13 SQL files, updated tech stack
- README.md: complete rewrite (English) with updated pipeline diagram,
  services table, scripts section, integration tests, full doc index,
  Go 1.24.6 workspace
- deployment.md: update to 13 SQL files, remove Anubis UA/Country refs,
  add scripts section, add ensemble env vars (AE_WEIGHT, XGB_WEIGHT),
  update verification queries and network diagram
- development.md: translate to French, add bot-detector 11-module structure,
  add Python ML deps, add scripts/integration test sections,
  fix bot-detector run command, add make targets

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
toto
2026-04-09 22:00:29 +02:00
parent 7bdc6e2865
commit d05969867f
4 changed files with 578 additions and 282 deletions

146
README.md
View File

@ -1,59 +1,63 @@
# ja4-platform # ja4-platform
**ja4-platform** is a monorepo security pipeline for TLS fingerprinting (JA4/JA3) and bot detection. It captures live network traffic, correlates TLS handshakes with HTTP requests, detects anomalous behavior using machine learning (Isolation Forest), and presents results through a SOC analyst dashboard — all backed by ClickHouse as the central data store. **ja4-platform** is a monorepo security pipeline for TLS fingerprinting (JA4/JA3) and bot detection. It captures live network traffic, correlates TLS handshakes with HTTP requests, applies triple-voice ML anomaly detection (Extended Isolation Forest + Autoencoder + XGBoost), and surfaces results through a SOC analyst dashboard — all backed by ClickHouse with a dual-database architecture.
## Pipeline Overview ## Pipeline Overview
``` ```
┌─────────────────────────────────────────────────────────────────────────────┐ ┌─────────────────────────────────────────────────────────────────────────────
Linux Server (Apache) │ │ Linux Server (Apache)
│ │
│ ┌─────────────────┐ ┌─────────────────────┐ │ ┌─────────────────┐ UNIX socket (DGRAM) ┌──────────────────┐
│ │ mod-reqin-log │───────▶│ UNIX socket (HTTP) │──┐ │ │ mod-reqin-log │──── http.socket ────────────────▶│
│ │ (Apache module) │ JSON /var/run/logcorr/ │ │ │ │ (Apache C11) │ (source A) │ correlator
│ C · httpd DSO http.socket └─────────────────┘ (Go · hex.
└─────────────────┘ └─────────────────────┘ │ architecture) │
┌─────────────────┐ UNIX socket (DGRAM)
┌─────────────────┐ ┌─────────────────────┐ ──────────────────┐ │ sentinel ──── network.socket ─────────────▶│ Joins by │
│ │ sentinel │───────▶│ UNIX socket (TLS) │─▶│ correlator │ │ │ │ (Go · libpcap) │ (source B) src_ip:src_port│ │
│ │ (TLS capture)JSON /var/run/logcorr/ (event join) │ │ │ JA4/JA3 gen. └────────┬─────────┘
│ Go · libpcap network.socket Go · hex. arch └─────────────────┘
└─────────────────┘ └─────────────────────┘ └─────────────────┘ └─────────────────────────────────────────────────────────────────────────────┘
INSERT
└────────────────────────────────────────────────────────────────┼────────────┘
│ INSERT ┌──────────────────────────────────────┐
ClickHouse 24.8
┌──────────────────┐ │ │
│ ClickHouse ja4_logs ja4_processing
│ ja4_processing ┌──────────┐ ┌──────────────┐
│ (all tables) │ │ │_raw → MV │────▶│ agg_* (×6) │
└────────┬─────────┘ │→ http_logs│ │ ml_* (×2) │ │
│ SELECT │ └──────────┘ │ views, dicts │ │
┌────────────────────┼────────────────────┐ │ └──────────────┘ │
▼ ▼ └─────────┬───────────────┬────────────┘
┌──────────────────┐ ┌──────────────────┐
│ bot-detector │ dashboard │ ┌────────────────┘ └───────────────┐
│ (ML anomaly det) │ (SOC web UI)
│ Python · sklearn │ FastAPI + React │ ┌────────────────────┐ ┌────────────────────┐
└──────────────────┘ └──────────────────┘ bot-detector │ dashboard │
│ Python 3.11 │ │ FastAPI + Jinja2 │
│ EIF + AE + XGBoost │ │ htmx + Chart.js │
│ HDBSCAN · SHAP │ │ 55 routes · 14 pp │
└────────────────────┘ └────────────────────┘
``` ```
## Services ## Services
| Service | Language | Purpose | Interface | | Service | Language | Description | Interface |
|---------|----------|---------|-----------| |---------|----------|-------------|-----------|
| [sentinel](docs/services/sentinel.md) | Go | Live TLS packet capture, JA4/JA3 fingerprint generation | UNIX socket (`network.socket`) | | [sentinel](docs/services/sentinel.md) | Go 1.24.6 | TLS/TCP packet capture via libpcap, JA4/JA3 fingerprint generation | UNIX socket `network.socket` |
| [mod-reqin-log](docs/services/mod-reqin-log.md) | C | Apache HTTPD module, HTTP request JSON logging | UNIX socket (`http.socket`) | | [mod-reqin-log](docs/services/mod-reqin-log.md) | C11 | Apache HTTPD module, HTTP request JSON logging | UNIX socket `http.socket` |
| [correlator](docs/services/correlator.md) | Go | Joins HTTP + TLS events by `src_ip:src_port` + time window | ClickHouse INSERT, file, stdout | | [correlator](docs/services/correlator.md) | Go 1.24.6 | Hexagonal architecture, correlates HTTP+TLS events by `src_ip:src_port` | ClickHouse INSERT (Native TCP) |
| [bot-detector](docs/services/bot-detector.md) | Python | Isolation Forest ML anomaly detection on aggregated traffic | ClickHouse read/write, HTTP `:8080` | | [bot-detector](docs/services/bot-detector.md) | Python 3.11 | Triple-voice ML ensemble (EIF+AE+XGB), HDBSCAN campaigns, SHAP explainability | ClickHouse read/write, HTTP `:8080` |
| [dashboard](docs/services/dashboard.md) | Python/JS | SOC analyst web dashboard (FastAPI + React) | HTTP `:8000` | | [dashboard](docs/services/dashboard.md) | Python 3.11 | SOC analyst dashboard: 55 routes, 15 templates, 14 pages | HTTP `:8000` |
## Shared Libraries ## Shared Libraries
| Library | Language | Description | | Library | Language | Description |
|---------|----------|-------------| |---------|----------|-------------|
| [go/ja4common](docs/shared/go-ja4common.md) | Go | Logger, config loader, shutdown handler, IP filter | | [go/ja4common](docs/shared/go-ja4common.md) | Go | Logger, config loader, graceful shutdown handler, IP filter |
| [python/ja4_common](docs/shared/python-ja4common.md) | Python | ClickHouse client singleton, settings | | [python/ja4_common](docs/shared/python-ja4common.md) | Python | `ClickHouseClient` singleton, `ClickHouseSettings` (pydantic-settings) |
## Quickstart ## Quickstart
@ -79,37 +83,77 @@ make test-all
```bash ```bash
make rpm-all make rpm-all
# RPMs written to services/<service>/dist/ # RPMs written to services/<service>/dist/rpm/el{8,9,10}/
``` ```
## Scripts
Helper scripts are located in `scripts/`:
| Script | Description |
|--------|-------------|
| `init-stack.sh` | Full ClickHouse stack initialization — deploys schema, loads CSV data, verifies all components |
| `import-prod-data.sh` | Imports pre-exported production data into the dev database with dynamic date shifting |
| `reload-prod-logs.sh` | Exports `http_logs` from production and re-imports into the dev database |
| `update-csv-data.sh` | Downloads and generates all CSV reference data (bot IPs, JA4 signatures, ASN reputation) |
| `generate_bot_ip.py` | Generates `bot_ip.csv` from known scanner/bot sources + Tor exit nodes |
| `generate_bot_ja4.py` | Generates `bot_ja4.csv` from known bot TLS fingerprints |
| `generate_asn_data.py` | Generates `asn_reputation.csv` (ASN→label mapping) |
| `generate_browser_ja4.py` | Generates browser JA4 reference data for legitimate browser detection |
Corresponding Makefile targets:
```bash
make init-stack # runs scripts/init-stack.sh
make import-prod-data # runs scripts/import-prod-data.sh
make init-and-import # init-stack + import-prod-data
make reload-prod-logs # runs scripts/reload-prod-logs.sh
```
## Integration Tests
Full-stack integration tests run against Docker Compose with a real ClickHouse instance:
```bash
make test-integration # 8 phases: build → start → schema → traffic → pipeline → dashboard → bot-detector → sentinel
make test-integration-keep # same but leaves stack running after
make test-integration-down # tear down integration stack
```
The integration test suite is located in `tests/integration/` and resets the database between runs.
## Documentation ## Documentation
| Document | Description | | Document | Description |
|----------|-------------| |----------|-------------|
| [Architecture](docs/architecture.md) | System architecture, data flow, component interactions | | [Architecture](docs/architecture.md) | System architecture, data flow, component interactions |
| [Deployment](docs/deployment.md) | Step-by-step production deployment guide |
| [Development](docs/development.md) | Build, test, package, and extend the platform | | [Development](docs/development.md) | Build, test, package, and extend the platform |
| [Database Schema](docs/database/schema.md) | Every ClickHouse table, view, dictionary, and materialized view | | [Database Schema](docs/database/schema.md) | Every ClickHouse table, view, dictionary, and materialized view |
| [Database Migrations](docs/database/migrations.md) | Migration order, application, verification, and rollback | | [Database Migrations](docs/database/migrations.md) | Migration order, application, verification, and rollback |
| [Commenting Standard](docs/commenting-standard.md) | Code commenting conventions (French comments, English identifiers) |
| [Thesis Reference](docs/THESIS_HTTP_Traffic_Detection.md) | Academic reference: HTTP traffic detection techniques |
| [Audit vs Thesis](docs/AUDIT_Detection_vs_Thesis.md) | Comparison between platform implementation and thesis techniques |
### Service Documentation ### Service Documentation
- [Sentinel](docs/services/sentinel.md) — TLS capture daemon - [Sentinel](docs/services/sentinel.md) — TLS/TCP capture daemon (Go + libpcap)
- [mod-reqin-log](docs/services/mod-reqin-log.md) — Apache HTTP logging module - [mod-reqin-log](docs/services/mod-reqin-log.md) — Apache HTTP logging module (C11)
- [Correlator](docs/services/correlator.md) — HTTP/TLS event correlation engine - [Correlator](docs/services/correlator.md) — HTTP/TLS event correlation engine (Go)
- [Bot Detector](docs/services/bot-detector.md) — ML anomaly detection - [Bot Detector](docs/services/bot-detector.md) — Triple-voice ML anomaly detection (Python)
- [Dashboard](docs/services/dashboard.md) — SOC web dashboard and API - [Dashboard](docs/services/dashboard.md) — SOC analyst dashboard and API (FastAPI)
### Shared Library Documentation ### Shared Library Documentation
- [go-ja4common](docs/shared/go-ja4common.md) — Go shared library - [go-ja4common](docs/shared/go-ja4common.md) — Go shared library (logger, config, shutdown, ipfilter)
- [python-ja4common](docs/shared/python-ja4common.md) — Python shared library - [python-ja4common](docs/shared/python-ja4common.md) — Python shared library (ClickHouse client, settings)
## Go Workspace ## Go Workspace
The repository uses a Go workspace (`go.work`) to link the Go modules: The repository uses a Go workspace (`go.work`) to link the Go modules:
``` ```
go 1.21 go 1.24.6
use ( use (
./services/sentinel ./services/sentinel
@ -118,6 +162,8 @@ use (
) )
``` ```
Both Go services have a `replace` directive in their `go.mod` pointing to `../../shared/go/ja4common`. The workspace takes precedence for local development; the `replace` is needed for Docker builds where `go.work` is not available.
## License ## License
See individual service directories for license information. See individual service directories for license information.

View File

@ -1,162 +1,285 @@
# Architecture # Architecture
The ja4-platform is a security pipeline that captures live network traffic, generates JA4/JA3 TLS fingerprints, correlates them with HTTP requests, applies machine-learning anomaly detection, and surfaces results through a SOC analyst dashboard. ClickHouse serves as the central data store linking all services. ja4-platform est un pipeline de sécurité qui capture le trafic réseau en temps réel, génère des empreintes TLS JA4/JA3, corrèle les handshakes TLS avec les requêtes HTTP, applique une détection d'anomalies par apprentissage automatique (ensemble triple voix), et présente les résultats dans un tableau de bord SOC. ClickHouse sert de magasin central reliant tous les services, organisé en deux bases de données distinctes.
## System Architecture ## Architecture système
``` ```
┌───────────────────────────────────────────────────────────────────────────────────┐ ┌─────────────────────────────────────────────────────────────────────────────────────
Target Linux Server Serveur Linux cible (Apache)
│ │
│ ┌─────────────┐ HTTP req ┌───────────────────────┐ UNIX socket (DGRAM) │ ┌─────────────────┐ HTTP req ┌───────────────────────┐
│ │ Client │────────────▶│ Apache HTTPD │──────────────┐ │ │ Client │────────────▶│ Apache HTTPD
│ │ (browser / │ │ + mod-reqin-log │ │ (navigateur / │ │ + mod-reqin-log (C) │ │
│ │ bot) │ └──────────────────────┘ │ │ bot) │ └──────────────────────┘ │
│ │ │ │ │ │ JSON / UNIX DGRAM
│ │ │ TLS CH ┌───────────────────────┐ ┌─────────────────────┐ │ │ │ │ /var/run/logcorrelator/http.socket
│ │ │────────────▶│ sentinel correlator │ │ │ ▼
│ │ │ (pcap) │ (packet capture) │──▶│ (event join) │ │ │ TLS CH ┌──────────────────────────────────────┐
└─────────────┘ └───────────────────────┘ └────────┬────────────┘ │ │────────────▶│ sentinel (Go · libpcap) │
└─────────────────┘ (pcap) → JA4/JA3 fingerprints │ │
└────────────────────────────────────────────────────────────────────┼──────────────┘ │ └───────────┬──────────────────────────┘
INSERT JSON JSON / UNIX DGRAM │
│ /var/run/logcorrelator/network.socket│
┌─────────────────────┐ ▼ │
ClickHouse ┌───────────────────────────────┐
│ ja4_processing correlator (Go · hex. arch) │
join src_ip:src_port + TTL │
http_logs_raw └───────────┬───────────────────┘
──(MV)──▶ http_logs
│ ──(MV)──▶ agg_* │ └───────────────────────────────────────────────┼─────────────────────────────────────┘
│ view_ai_features │ │ INSERT (Native TCP :9000)
│ ml_detected_anom. │
│ ml_all_scores │ ┌───────────────────────────────────────────────────────┐
└──────┬──────┬───────┘ ClickHouse 24.8
┌──────────────────┘ └──────────────────┐ │ ja4_logs ja4_processing │
▼ ▼ ┌────────────────┐ ┌────────────────────┐ │
┌──────────────────────┐ ┌──────────────────────┐ │ │ http_logs_raw │──(MV)──▶ │ agg_host_ip_ja4_1h│ │
│ bot-detector │ │ dashboard │ │ ↓ mv_http_logs │ agg_header_fp_1h │
(Python) │ │ (FastAPI + React) http_logs │──(MVs)──▶│ agg_path_seq_1h │
│ │ │ │ └────────────────┘ │ agg_request_tm_1h│
│ Reads: │ Reads: agg_ip_behavior_1h│
│ view_ai_features ml_detected_anom. agg_resource_cas_1h│
view_ip_recurrence │ │ ml_all_scores │ ml_detected_anom. │
│ Writes: │ http_logs │ ml_all_scores
│ ml_detected_anom. │ │ agg_* tables │ view_ai_features
│ ml_all_scores audit_logs view_thesis_feat. │
└──────────────────────┘ └──────────────────────┘ │ audit_logs, dicts │ │
│ └────────────────────┘ │
└──────────┬──────────────────────────┬────────────────┘
│ SELECT │ SELECT / INSERT
┌──────────────┘ └──────────────┐
▼ ▼
┌───────────────────────┐ ┌───────────────────────┐
│ bot-detector │ │ dashboard │
│ Python 3.11 │ │ FastAPI + Jinja2 │
│ │ │ htmx + Chart.js │
│ Lit : │ │ Tailwind CSS (CDN) │
│ view_ai_features_1h │ │ │
│ view_thesis_feat_1h │ │ 55 routes (API+pages) │
│ view_ip_recurrence │ │ 15 templates Jinja2 │
│ Écrit : │ │ 14 pages SOC │
│ ml_detected_anomalies│ │ │
│ ml_all_scores │ │ Lit : ml_*, agg_*, │
└───────────────────────┘ │ http_logs, audit_logs│
└───────────────────────┘
``` ```
## Data Flow ## Flux de données — 5 phases
### 1. Capture Phase ### Phase 1 — Capture
1. **mod-reqin-log** (Apache C module) hooks into `post_read_request`. On each HTTP request, it serializes method, path, headers, client IP/port into JSON and sends it via UNIX datagram socket to `/var/run/logcorrelator/http.socket`. 1. **mod-reqin-log** (module Apache C11) intercepte chaque requête HTTP dans le hook `post_read_request`. Il sérialise method, path, headers, client IP/port en JSON et envoie le datagramme vers `/var/run/logcorrelator/http.socket`.
2. **sentinel** (Go daemon) uses libpcap to capture live TLS ClientHello packets on configured ports (default: 443, 8443). It extracts IP/TCP metadata, generates JA4 and JA3 fingerprints, and sends the result as JSON via UNIX datagram socket to `/var/run/logcorrelator/network.socket`. 2. **sentinel** (démon Go) capture les paquets TLS ClientHello via libpcap sur les ports configurés (défaut : 443, 8443). Il extrait les métadonnées IP/TCP, génère les empreintes JA4 et JA3, et envoie le résultat en JSON vers `/var/run/logcorrelator/network.socket`.
### 2. Correlation Phase ### Phase 2 — Corrélation
3. **correlator** (Go daemon) listens on both UNIX sockets. It buffers incoming events and correlates them by matching `src_ip:src_port` within a configurable time window (default: 10 s). HTTP Keep-Alive connections are supported via `one_to_many` matching mode where a single TLS handshake (source B) is reused for multiple HTTP requests (source A). Correlated events merge HTTP fields (method, path, headers) with TLS fields (JA4, JA3, IP/TCP metadata) into a single `CorrelatedLog` JSON object, which is inserted into `http_logs_raw`. 3. **correlator** (démon Go, architecture hexagonale) écoute les deux sockets Unix. Il met en tampon les événements entrants et les corrèle par `src_ip:src_port` dans une fenêtre temporelle configurable (défaut : 10 s). Le mode `one_to_many` (Keep-Alive) permet de réutiliser un seul handshake TLS (source B) pour plusieurs requêtes HTTP (source A). Les événements corrélés fusionnent les champs HTTP + TLS en un objet `CorrelatedLog` JSON, inséré dans **`ja4_logs.http_logs_raw`**.
### 3. Enrichment Phase (ClickHouse) ### Phase 3 — Enrichissement (ClickHouse)
4. **mv_http_logs** materialized view automatically transforms `http_logs_raw` JSON into the structured `http_logs` table, enriching each row with: 4. **mv_http_logs** (vue matérialisée) transforme le JSON de `http_logs_raw` en la table structurée `ja4_logs.http_logs`, enrichissant chaque ligne avec :
- ASN/geo data via `dict_iplocate_asn` - Données ASN via `dict_iplocate_asn` (IP_TRIE)
- Anubis bot identification via `dict_anubis_ua`, `dict_anubis_ip`, `dict_anubis_asn`, `dict_anubis_country` - Identification Anubis via `dict_anubis_ip` (IP_TRIE) et `dict_anubis_asn` (FLAT) — règles IP/CIDR + ASN uniquement, avec priorité COALESCE(IP, ASN)
5. **mv_agg_host_ip_ja4_1h** and **mv_agg_header_fingerprint_1h** aggregate `http_logs` into 1-hour behavioral windows. 5. **6 vues matérialisées d'agrégation** alimentent les tables `ja4_processing.agg_*` en fenêtres comportementales d'1 heure :
- `agg_host_ip_ja4_1h` — Agrégations par (host, src_ip, ja4)
- `agg_header_fingerprint_1h` — Empreintes d'en-têtes HTTP
- `agg_path_sequences_1h` — Séquences de chemins (n-grams)
- `agg_request_timing_1h` — Métriques de timing inter-requêtes
- `agg_ip_behavior_1h` — Comportement réseau par IP
- `agg_resource_cascade_1h` — Cascades de ressources
6. **view_ai_features_1h** joins the two aggregation tables and computes 50+ ML features per `(src_ip, ja4, host)` tuple. 6. **view_ai_features_1h** joint les tables d'agrégation et calcule ~63 features ML par tuple `(src_ip, ja4, host)`.
### 4. Detection Phase ### Phase 4 — Détection
7. **bot-detector** (Python) runs on a 5-minute cycle: 7. **bot-detector** (Python 3.11, 11 modules) s'exécute en cycle de 5 minutes :
- Reads `view_ai_features_1h` for the last 24 hours - **Pipeline bifurqué** :
- Separates known bots (via reputation dictionaries) from unknown traffic - **Complet** (L3→L7, ~63 features, `correlated=1`) — trafic corrélé TCP+TLS+HTTP
- Trains/loads Isolation Forest models on human-baseline traffic - **Applicatif** (L7 seulement, ~51 features, `correlated=0`) — trafic HTTP non corrélé
- Scores unknown traffic and writes anomalies to `ml_detected_anomalies` and all scores to `ml_all_scores` - **Ensemble triple voix** :
- **Extended Isolation Forest** (isotree) — scoreur non supervisé principal
- **Autoencoder** (PyTorch, architecture n→64→32→16→32→64→n) — erreur de reconstruction
- **XGBoost** — supervisé, entraîné sur les labels SOC (`soc_feedback`)
- **Score final** : `final = (1-β) × ((1-α) × eif_norm + α × ae_norm) + β × xgb_prob` (α=0.30, β=0.20)
- **Seuil adaptatif** par percentile, détection de dérive conceptuelle
- **HDBSCAN** — regroupement en campagnes d'attaque
- **Détection de navigateur** — 5 axes multifactoriels (confiance ≥ 0.55 → `LEGITIMATE_BROWSER`)
- **Explicabilité SHAP** — contribution de chaque feature au score d'anomalie
- **Niveaux de menace** : `CRITICAL`, `HIGH`, `MEDIUM`, `LOW`, `NORMAL`, `LEGITIMATE_BROWSER`, `KNOWN_BOT`, `ANUBIS_DENY`, `ANUBIS_ALLOW`
### 5. Visualization Phase ### Phase 5 — Visualisation
8. **dashboard** (FastAPI + React) queries ClickHouse to display detections, feature analysis, investigation summaries, and clustering to SOC analysts. 8. **dashboard** (FastAPI + Jinja2 + htmx + Chart.js + Tailwind CSS CDN) expose 55 routes (35 API JSON + 14 pages HTML + health/static) et 15 templates Jinja2 pour les analystes SOC :
- Pages : overview, detections, scores, traffic, ip_detail, ja4_detail, cluster_detail, campaigns, features, models, classify, tactics, reflists, network
## Component Interaction Matrix ## Matrice d'interaction des composants
| From → To | mod-reqin-log | sentinel | correlator | ClickHouse | bot-detector | dashboard | | De ↓ \ Vers → | mod-reqin-log | sentinel | correlator | ClickHouse | bot-detector | dashboard |
|-----------|:---:|:---:|:---:|:---:|:---:|:---:| |----------------|:---:|:---:|:---:|:---:|:---:|:---:|
| **mod-reqin-log** | — | — | UNIX socket (DGRAM) | — | — | — | | **mod-reqin-log** | — | — | UNIX DGRAM (source A) | — | — | — |
| **sentinel** | — | — | UNIX socket (DGRAM) | — | — | — | | **sentinel** | — | — | UNIX DGRAM (source B) | — | — | — |
| **correlator** | — | — | — | Native TCP :9000 (INSERT) | — | — | | **correlator** | — | — | — | Native TCP :9000 (INSERT) | — | — |
| **ClickHouse** | — | — | — | | — | — | | **ClickHouse** | — | — | — | MVs internes | — | — |
| **bot-detector** | — | — | — | HTTP :8123 (SELECT/INSERT) | — | — | | **bot-detector** | — | — | — | HTTP :8123 (SELECT/INSERT) | — | — |
| **dashboard** | — | — | — | HTTP :8123 (SELECT/INSERT) | — | — | | **dashboard** | — | — | — | HTTP :8123 (SELECT/INSERT) | — | — |
## ClickHouse Table Ownership ## Propriété des tables ClickHouse
| Table/View | Written By | Read By | ### Base `ja4_logs`
|------------|-----------|---------|
| Table / Vue | Écrit par | Lu par |
|-------------|-----------|--------|
| `http_logs_raw` | correlator | mv_http_logs (MV) | | `http_logs_raw` | correlator | mv_http_logs (MV) |
| `http_logs` | mv_http_logs (MV) | mv_agg_*, dashboard | | `http_logs` | mv_http_logs (MV) | mv_agg_* (6 MVs), dashboard |
| `agg_host_ip_ja4_1h` | mv_agg_host_ip_ja4_1h (MV) | view_ai_features_1h | | `mv_http_logs` | — (MV automatique) | — |
| `agg_header_fingerprint_1h` | mv_agg_header_fingerprint_1h (MV) | view_ai_features_1h |
| `view_ai_features_1h` | — (view) | bot-detector | ### Base `ja4_processing`
| `view_ip_recurrence` | — (view) | bot-detector |
| Table / Vue | Écrit par | Lu par |
|-------------|-----------|--------|
| `agg_host_ip_ja4_1h` | mv_agg_host_ip_ja4_1h | view_ai_features_1h, dashboard |
| `agg_header_fingerprint_1h` | mv_agg_header_fingerprint_1h | view_ai_features_1h, dashboard |
| `agg_path_sequences_1h` | mv_agg_path_sequences_1h | view_thesis_features_1h |
| `agg_request_timing_1h` | mv_agg_request_timing_1h | view_thesis_features_1h |
| `agg_ip_behavior_1h` | mv_agg_ip_behavior_1h | view_thesis_features_1h |
| `agg_resource_cascade_1h` | mv_agg_resource_cascade_1h | view_thesis_features_1h |
| `ml_detected_anomalies` | bot-detector | dashboard | | `ml_detected_anomalies` | bot-detector | dashboard |
| `ml_all_scores` | bot-detector | dashboard | | `ml_all_scores` | bot-detector | dashboard |
| `audit_logs` | dashboard | dashboard | | `audit_logs` | dashboard | dashboard |
| `anubis_ip_rules` | fetch_rules.py | dict_anubis_ip |
| `anubis_asn_rules` | fetch_rules.py | dict_anubis_asn |
| `ref_bot_networks` | update-csv-data.sh | dict_bot_ip |
| `bot_ip` | update-csv-data.sh | dict_bot_ip |
| `bot_ja4` | update-csv-data.sh | dict_bot_ja4 |
| `view_ai_features_1h` | — (vue) | bot-detector |
| `view_ip_recurrence` | — (vue) | bot-detector |
| `view_thesis_features_1h` | — (vue) | bot-detector |
| `view_form_bruteforce_detected` | — (vue) | dashboard |
| `view_host_ip_ja4_rotation` | — (vue) | dashboard |
| `view_dashboard_user_agents` | — (vue) | dashboard |
| `view_dashboard_entities` | — (vue) | dashboard |
| `view_resource_cascade_1h` | — (vue) | dashboard |
## Correlation Algorithm ### Dictionnaires (7)
The correlator joins HTTP events (source A) with TLS/network events (source B) using a two-key correlation: | Dictionnaire | Layout | Source | Utilisation |
|--------------|--------|--------|-------------|
| `dict_iplocate_asn` | IP_TRIE | Fichier CSV | Géolocalisation IP → ASN |
| `dict_bot_ip` | IP_TRIE | Table `bot_ip` | IPs de bots connues |
| `dict_bot_ja4` | COMPLEX_KEY_HASHED | Table `bot_ja4` | Signatures JA4 de bots |
| `dict_browser_ja4` | COMPLEX_KEY_HASHED | Table (CSV) | Signatures JA4 de navigateurs |
| `dict_asn_reputation` | HASHED | Fichier CSV | Réputation ASN (isp/datacenter/hosting/cdn) |
| `dict_anubis_ip` | IP_TRIE | Table `anubis_ip_rules` | Règles Anubis IP/CIDR |
| `dict_anubis_asn` | FLAT | Table `anubis_asn_rules` | Règles Anubis ASN |
1. **Key**: `src_ip + src_port` — the client's source IP and ephemeral port uniquely identify a TCP connection. ## Algorithme de corrélation
2. **Time window**: Events must arrive within the configured window (default 10 seconds).
3. **Matching mode**:
- `one_to_one`: Each B event matches at most one A event (consumed after match).
- `one_to_many` (default, Keep-Alive): A single B (TLS handshake) can match multiple A events (HTTP requests) on the same connection. The B event has a configurable TTL (default 120 s) that resets on each match.
4. **Orphan handling**: Unmatched A events are emitted after a configurable delay (default 500 ms) with `correlated=false` and `orphan_side=A`.
## JA4/JA3 Fingerprint Format Le correlator joint les événements HTTP (source A) avec les événements TLS/réseau (source B) via une corrélation à deux clés :
1. **Clé** : `src_ip + src_port` — l'IP source et le port éphémère du client identifient une connexion TCP de manière unique.
2. **Fenêtre temporelle** : Les événements doivent arriver dans la fenêtre configurée (défaut 10 secondes).
3. **Mode de correspondance** :
- `one_to_one` : Chaque événement B correspond à un seul événement A (consommé après correspondance).
- `one_to_many` (défaut, Keep-Alive) : Un seul B (handshake TLS) peut correspondre à plusieurs A (requêtes HTTP) sur la même connexion. Le B possède un TTL configurable (défaut 120 s) réinitialisé à chaque correspondance.
4. **Gestion des orphelins** : Les événements A sans correspondance sont émis après un délai configurable (défaut 500 ms) avec `correlated=false` et `orphan_side=A`.
La sortie est insérée dans **`ja4_logs.http_logs_raw`** (base `ja4_logs`), pas dans `ja4_processing`.
## Pipeline ML — bot-detector (détail)
```
view_ai_features_1h ──┐ ┌─── ml_detected_anomalies
view_thesis_feat_1h ──┤ ┌────────────┐ │
view_ip_recurrence ───┤ │ Pré- │ │
├──▶│ traitement │──▶│ Bifurcation :
│ │ + filtrage │ │ ├── Complet (correlated=1, ~63 feat.)
│ └────────────┘ │ └── Applicatif(correlated=0, ~51 feat.)
│ │
│ ┌────────────┐ │ Pour chaque branche :
│ │ Ensemble │ │ ├── Extended Isolation Forest (EIF)
│ │ triple │──▶│ ├── Autoencoder (PyTorch)
│ │ voix │ │ └── XGBoost (supervisé)
│ └────────────┘ │
│ │ Score = (1-β)×((1-α)×EIF + α×AE) + β×XGB
│ ┌────────────┐ │
│ │ Post- │ ├─── ml_all_scores
└──▶│ traitement │──▶│
│ HDBSCAN │ │ Niveaux : CRITICAL / HIGH / MEDIUM /
│ Browser 5ax│ │ LOW / NORMAL / LEGITIMATE_BROWSER /
│ SHAP │ │ KNOWN_BOT / ANUBIS_DENY / ANUBIS_ALLOW
└────────────┘ └───
```
## Référence des empreintes JA4/JA3
### JA4 ### JA4
JA4 is a modern TLS fingerprinting format (successor to JA3) with the structure: Format moderne de fingerprinting TLS (successeur de JA3) :
``` ```
t{TLS_VER}{SNI}{CIPHER_COUNT}{EXT_COUNT}_{CIPHER_HASH}_{EXT_HASH} t{TLS_VER}{SNI}{CIPHER_COUNT}{EXT_COUNT}_{CIPHER_HASH}_{EXT_HASH}
``` ```
Example: `t13d1516h2_8daaf6152771_b0da82dd1658` Exemple : `t13d1516h2_8daaf6152771_b0da82dd1658`
- Prefix `t` = TLS, followed by version (`13` = TLS 1.3) - Préfixe `t` = TLS, suivi de la version (`13` = TLS 1.3)
- `d` = SNI present, `i` = SNI absent - `d` = SNI présent, `i` = SNI absent
- Cipher suite count and extension count - Nombre de cipher suites et nombre d'extensions
- SHA-256 truncated hashes of sorted cipher suites and extensions - Hash SHA-256 tronqué des cipher suites et extensions triées
### JA3 ### JA3
JA3 is the original TLS fingerprinting format: Format original de fingerprinting TLS :
``` ```
{TLS_VER},{CIPHERS},{EXTENSIONS},{ELLIPTIC_CURVES},{EC_POINT_FORMATS} {TLS_VER},{CIPHERS},{EXTENSIONS},{ELLIPTIC_CURVES},{EC_POINT_FORMATS}
``` ```
The `ja3_hash` is the MD5 hash of the JA3 string. Le `ja3_hash` est le hash MD5 de la chaîne JA3.
Both fingerprints are generated by sentinel from the TLS ClientHello payload. Les deux empreintes sont générées par sentinel à partir du payload TLS ClientHello.
## Technology Stack ## Stack technologique
| Component | Technology | | Composant | Technologie |
|-----------|-----------| |-----------|-------------|
| Packet capture | Go + libpcap (gopacket) | | Capture de paquets | Go 1.24.6 + libpcap (gopacket) |
| HTTP logging | C Apache module (APR) | | Logging HTTP | Module Apache C11 (APR, `apxs`) |
| Event correlation | Go (hexagonal architecture) | | Corrélation d'événements | Go 1.24.6 (architecture hexagonale) |
| ML detection | Python 3.11 + scikit-learn | | tection ML — EIF | Python 3.11 + isotree |
| Dashboard backend | FastAPI (Python) | | Détection ML — Autoencoder | Python 3.11 + PyTorch |
| Dashboard frontend | React + Vite | | Détection ML — Supervisé | Python 3.11 + XGBoost |
| Data store | ClickHouse | | Clustering de campagnes | HDBSCAN |
| Deployment | systemd, Docker, RPM | | Explicabilité | SHAP |
| IPC | UNIX datagram sockets | | Backend dashboard | FastAPI + Jinja2 (Python 3.11) |
| Frontend dashboard | htmx + Chart.js + ECharts + Tailwind CSS (CDN) |
| Magasin de données | ClickHouse 24.8 (dual-database) |
| Déploiement | systemd, Docker, RPM (Rocky 8/9/10) |
| IPC | Sockets UNIX datagramme |
| Workspace Go | `go.work` (Go 1.24.6) |
## Fichiers de schéma SQL (13)
```
shared/clickhouse/
├── 00_database.sql # Création des bases ja4_logs + ja4_processing
├── 01_raw_tables.sql # ja4_logs.http_logs_raw (TTL 2 heures)
├── 02_dictionaries.sql # dict_iplocate_asn, ref_bot_networks, bot_ip, bot_ja4
├── 03_anubis_tables.sql # anubis_ip_rules, anubis_asn_rules + 2 dictionnaires
├── 04_mv_http_logs.sql # ja4_logs.http_logs + mv_http_logs (JSON → colonnes)
├── 05_aggregation_tables.sql # agg_host_ip_ja4_1h, agg_header_fingerprint_1h + MVs + 4 dicts
├── 06_ml_tables.sql # ml_detected_anomalies, ml_all_scores, view_ip_recurrence
├── 07_ai_features_view.sql # view_ai_features_1h (feature engineering ~63 features)
├── 08_users.sql # Utilisateurs data_writer + analyst
├── 09_audit_table.sql # audit_logs (trace SOC, TTL 90 jours)
├── 10_perf_indexes.sql # Index et projections de performance
├── 11_views.sql # Vues métier du dashboard (4 vues)
├── 12_thesis_features.sql # agg_path_sequences, agg_request_timing, agg_ip_behavior,
│ # agg_resource_cascade + MVs + view_thesis_features_1h
│ # + view_resource_cascade_1h
└── deploy_schema.sh # Script de déploiement automatisé (substitution env vars)
```

View File

@ -42,7 +42,7 @@ clickhouse-client --query "SELECT version()"
## Étape 2 — Déploiement du schéma ClickHouse ## Étape 2 — Déploiement du schéma ClickHouse
Le schéma est géré par **10 fichiers SQL ordonnés** dans `shared/clickhouse/` et un script de déploiement automatisé. Le schéma est géré par **13 fichiers SQL ordonnés** dans `shared/clickhouse/` et un script de déploiement automatisé.
### Architecture des bases de données ### Architecture des bases de données
@ -57,17 +57,21 @@ Le système utilise **deux bases de données** séparées :
``` ```
shared/clickhouse/ shared/clickhouse/
├── 00_database.sql # Création des deux bases ├── 00_database.sql # Création des bases ja4_logs + ja4_processing
├── 01_raw_tables.sql # ja4_logs.http_logs_raw (ingestion brute, TTL 1 jour) ├── 01_raw_tables.sql # ja4_logs.http_logs_raw (ingestion brute, TTL 2 heures)
├── 02_dictionaries.sql # ja4_processing : dict_iplocate_asn, ref_bot_networks, bot_ip, bot_ja4 ├── 02_dictionaries.sql # ja4_processing : dict_iplocate_asn, ref_bot_networks, bot_ip, bot_ja4
├── 03_anubis_tables.sql # ja4_processing : tables et dictionnaires Anubis (UA, IP, ASN, Country) ├── 03_anubis_tables.sql # ja4_processing : anubis_ip_rules, anubis_asn_rules + 2 dicts (IP/CIDR + ASN)
├── 04_mv_http_logs.sql # ja4_logs.http_logs + mv_http_logs (parsing JSON → colonnes typées) ├── 04_mv_http_logs.sql # ja4_logs.http_logs + mv_http_logs (parsing JSON → colonnes typées)
├── 05_aggregation_tables.sql # ja4_processing : agg_host_ip_ja4_1h, agg_header_fingerprint_1h + MVs ├── 05_aggregation_tables.sql # ja4_processing : agg_host_ip_ja4_1h, agg_header_fingerprint_1h + MVs + 4 dicts
├── 06_ml_tables.sql # ja4_processing : ml_detected_anomalies, ml_all_scores, view_ip_recurrence ├── 06_ml_tables.sql # ja4_processing : ml_detected_anomalies, ml_all_scores, view_ip_recurrence
├── 07_ai_features_view.sql # ja4_processing : view_ai_features_1h (feature engineering) ├── 07_ai_features_view.sql # ja4_processing : view_ai_features_1h (feature engineering ~63 features)
├── 08_users.sql # Utilisateurs data_writer et analyst + permissions ├── 08_users.sql # Utilisateurs data_writer et analyst + permissions
├── 09_audit_table.sql # ja4_processing : audit_logs (trace SOC, TTL 90 jours) ├── 09_audit_table.sql # ja4_processing : audit_logs (trace SOC, TTL 90 jours)
── deploy_schema.sh # Script de déploiement automatisé ── 10_perf_indexes.sql # Index et projections de performance sur tables existantes
├── 11_views.sql # Vues métier du dashboard (bruteforce, rotation, user_agents, entities)
├── 12_thesis_features.sql # Tables d'agrégation avancées (path_sequences, request_timing,
│ # ip_behavior, resource_cascade) + MVs + vues thèse
└── deploy_schema.sh # Script de déploiement automatisé (substitution env vars)
``` ```
### Déploiement avec les noms par défaut ### Déploiement avec les noms par défaut
@ -109,8 +113,9 @@ clickhouse-client --query "SHOW TABLES FROM ja4_logs"
# Tables dans ja4_processing # Tables dans ja4_processing
clickhouse-client --query "SHOW TABLES FROM ja4_processing" clickhouse-client --query "SHOW TABLES FROM ja4_processing"
# Attendu : agg_header_fingerprint_1h, agg_host_ip_ja4_1h, anubis_asn_rules, # Attendu : agg_header_fingerprint_1h, agg_host_ip_ja4_1h, agg_ip_behavior_1h,
# anubis_country_rules, anubis_ip_rules, anubis_ua_rules, audit_logs, # agg_path_sequences_1h, agg_request_timing_1h, agg_resource_cascade_1h,
# anubis_asn_rules, anubis_ip_rules, audit_logs, bot_ip, bot_ja4,
# ml_all_scores, ml_detected_anomalies, ref_bot_networks, ... # ml_all_scores, ml_detected_anomalies, ref_bot_networks, ...
# Dictionnaires chargés # Dictionnaires chargés
@ -161,10 +166,9 @@ GRANT INSERT ON ja4_processing.ml_all_scores TO bot_writer;
GRANT INSERT ON ja4_processing.ml_detected_anomalies TO bot_writer; GRANT INSERT ON ja4_processing.ml_detected_anomalies TO bot_writer;
-- Lectures/écritures : tables Anubis (ja4_processing) — pour fetch_rules.py -- Lectures/écritures : tables Anubis (ja4_processing) — pour fetch_rules.py
GRANT SELECT, INSERT, ALTER ON ja4_processing.anubis_ua_rules TO bot_writer; -- Seules les règles IP/CIDR et ASN sont utilisées (UA et Country ont été supprimés)
GRANT SELECT, INSERT, ALTER ON ja4_processing.anubis_ip_rules TO bot_writer; GRANT SELECT, INSERT, ALTER ON ja4_processing.anubis_ip_rules TO bot_writer;
GRANT SELECT, INSERT, ALTER ON ja4_processing.anubis_asn_rules TO bot_writer; GRANT SELECT, INSERT, ALTER ON ja4_processing.anubis_asn_rules TO bot_writer;
GRANT SELECT, INSERT, ALTER ON ja4_processing.anubis_country_rules TO bot_writer;
GRANT SYSTEM RELOAD DICTIONARY ON *.* TO bot_writer; GRANT SYSTEM RELOAD DICTIONARY ON *.* TO bot_writer;
``` ```
@ -322,9 +326,12 @@ Variables d'environnement clés :
| `CLICKHOUSE_DB_LOGS` | `ja4_logs` | Base de logs | | `CLICKHOUSE_DB_LOGS` | `ja4_logs` | Base de logs |
| `CLICKHOUSE_USER` | `admin` | Utilisateur (utiliser `bot_writer` en prod) | | `CLICKHOUSE_USER` | `admin` | Utilisateur (utiliser `bot_writer` en prod) |
| `CLICKHOUSE_PASSWORD` | — | Mot de passe | | `CLICKHOUSE_PASSWORD` | — | Mot de passe |
| `ANOMALY_THRESHOLD` | `-0.05` | Seuil de détection d'anomalies | | `ANOMALY_THRESHOLD` | `-0.05` | Seuil de détection d'anomalies (fallback) |
| `CYCLE_INTERVAL_SEC` | `300` | Intervalle entre cycles de détection (secondes) | | `CYCLE_INTERVAL_SEC` | `300` | Intervalle entre cycles de détection (secondes) |
| `RETRAIN_INTERVAL_HOURS` | `24` | Intervalle de réentraînement du modèle | | `RETRAIN_INTERVAL_HOURS` | `24` | Intervalle de réentraînement des modèles |
| `AE_WEIGHT` | `0.30` | Poids de l'Autoencoder dans l'ensemble (α) |
| `XGB_WEIGHT` | `0.20` | Poids de XGBoost dans l'ensemble (β) |
| `ENABLE_MULTIWINDOW` | `false` | Active les variantes 24h (Complet/Applicatif) |
| `HEALTH_PORT` | `8080` | Port du endpoint /health | | `HEALTH_PORT` | `8080` | Port du endpoint /health |
```bash ```bash
@ -424,29 +431,76 @@ curl -s http://localhost:8000/api/health
--- ---
## Scripts utilitaires
Les scripts dans `scripts/` facilitent l'initialisation et la maintenance de la stack :
### init-stack.sh — Initialisation complète
Déploie le schéma SQL, charge les données CSV de référence et vérifie que tous les composants sont opérationnels :
```bash
./scripts/init-stack.sh # init stack de dev
./scripts/init-stack.sh --container my-ch-1 # conteneur ClickHouse spécifique
./scripts/init-stack.sh --reset # DROP databases, tout recréer
./scripts/init-stack.sh --import-prod # init + import données prod
```
### import-prod-data.sh — Import de données de production
Importe les données exportées au format Native avec décalage temporel automatique :
```bash
./scripts/import-prod-data.sh # décalage auto (max(time) → now)
./scripts/import-prod-data.sh --shift 3600 # décalage manuel (secondes)
./scripts/import-prod-data.sh --no-truncate # conserver les données existantes
```
### reload-prod-logs.sh — Rechargement depuis la production
Exporte les `http_logs` de la production et les réimporte dans la base de dev :
```bash
./scripts/reload-prod-logs.sh # décalage auto
./scripts/reload-prod-logs.sh --days 7 # exporte les N derniers jours
./scripts/reload-prod-logs.sh --cron # mode silencieux (pour crontab)
```
### update-csv-data.sh — Mise à jour des données CSV
Télécharge et génère tous les fichiers CSV de référence (bot IPs, JA4, ASN) :
```bash
./scripts/update-csv-data.sh # génère tout
./scripts/update-csv-data.sh --install-stubs # copie aussi les stubs de test
```
---
## Schéma réseau récapitulatif ## Schéma réseau récapitulatif
``` ```
┌──────────────┐ Unix socket ┌──────────────┐ ClickHouse ┌──────────────┐ ┌──────────────┐ Unix socket ┌──────────────┐ ClickHouse ┌──────────────────
│ mod-reqin-log│──── http.socket ─────→│ │ INSERT INTO │ │ │ mod-reqin-log│──── http.socket ─────→│ │ INSERT INTO │
│ (Apache) │ (source A) │ correlator │───→ ja4_logs. │ ClickHouse │ │ (Apache C11) │ (source A) │ correlator │───→ ja4_logs. │ ClickHouse
└──────────────┘ │ │ http_logs_raw │ │ └──────────────┘ │ │ http_logs_raw │
│ │ │ ┌─────────┐ │ │ │ │ ┌──────────────┐ │
┌──────────────┐ Unix socket │ │ MV parse JSON │ │ja4_logs │ │ ┌──────────────┐ Unix socket │ │ MV parse JSON │ │ ja4_logs │ │
│ sentinel │──── network.socket ──→│ │ ↓ │ │ _raw │ │ │ sentinel │──── network.socket ──→│ │ ↓ │ │ _raw → _logs│ │
│ (TLS capture)│ (source B) └──────────────┘ ja4_logs. │ │ _parsed │ │ (Go+libpcap) │ (source B) └──────────────┘ ja4_logs. │ └──────────────┘
└──────────────┘ http_logs │ └─────────┘ └──────────────┘ http_logs │
│ │ │ │ ┌──────────────┐
MV agrégation │ ┌─────────┐ 6 MVs agrégation│ │ ja4_processing│
↓ │ │ja4_ │ │ ↓ │ │ agg_* (×6) │ │
┌──────────────┐ SELECT features ┌──────────────┐ ja4_processing. │ │processing ┌──────────────┐ SELECT features ┌──────────────┐ ja4_processing. │ │ ml_* (×2) │
│ bot-detector │←─── view_ai_features ─│ │ agg_*, view_* │ _agg │ │ │ bot-detector │←─── view_ai_features ─│ │ agg_*, view_* │ │ views, dicts│ │
│ (ML/Python) │ │ ClickHouse │ │ _ml │ │ │ (Python 3.11)│ view_thesis_feat │ ClickHouse │ │ │ audit_logs │ │
│───→ INSERT scores ────→│ │ ml_all_scores │ │ _views │ EIF+AE+XGB │ │ │ ml_all_scores │ └──────────────┘
└──────────────┘ ml_detected_* └──────────────┘ ml_detected_* │ └─────────┘ │ │───→ INSERT scores ────→│ │ ml_detected_* │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────────
┌──────────────┐ SELECT *
│ dashboard │←─── ja4_processing.* ───────────────────────────┘ ┌──────────────┐ SELECT * │
(FastAPI) │←─── ja4_logs.http_logs ─────────────────────────┘ dashboard │←─── ja4_processing.ml_*, agg_*, views ────────────────────────────┘
│ (FastAPI) │←─── ja4_logs.http_logs ───────────────────────────────────────────┘
└──────────────┘ └──────────────┘
``` ```

View File

@ -1,65 +1,77 @@
# Development Guide # Guide de développement
This guide covers building, testing, packaging, and extending the ja4-platform monorepo. All build and test operations run inside Docker — no native Go, Python, or C toolchains are required on the host. Ce guide couvre la compilation, les tests, le packaging et l'extension du monorepo ja4-platform. Toutes les opérations de build et de test s'exécutent dans Docker — aucune toolchain Go, Python ou C n'est requise sur la machine hôte.
## Prerequisites ## Prérequis
| Requirement | Minimum Version | Notes | | Outil | Version minimale | Notes |
|-------------|----------------|-------| |-------|-----------------|-------|
| Docker | 20.10+ | BuildKit enabled (`DOCKER_BUILDKIT=1`) | | Docker | 20.10+ | BuildKit activé (`DOCKER_BUILDKIT=1`) |
| Docker Compose | 2.x | For bot-detector and dashboard | | Docker Compose | 2.x | Pour bot-detector et dashboard |
| make | 3.81+ | GNU Make | | make | 3.81+ | GNU Make |
| git | 2.x | For version tagging | | git | 2.x | Pour le versioning |
No Go, Python, or C compilers are needed on the host machine. Aucun compilateur Go, Python ou C n'est nécessaire sur la machine hôte.
## Building All Services ## Compilation de tous les services
```bash ```bash
make build-all make build-all
``` ```
This builds Docker images for: Cela construit les images Docker pour :
- `ja4-platform/sentinel:latest` - `ja4-platform/sentinel:latest`
- `ja4-platform/correlator:latest` - `ja4-platform/correlator:latest`
- `ja4-platform/bot-detector:latest` - `ja4-platform/bot-detector:latest`
- `ja4-platform/dashboard:latest` - `ja4-platform/dashboard:latest`
mod-reqin-log is an Apache module and is only built as part of the RPM packaging process. mod-reqin-log est un module Apache et n'est construit que dans le cadre du processus RPM.
### Building Individual Services ### Compilation individuelle
```bash ```bash
make build-sentinel # Go binary in Docker make build-sentinel # Binaire Go dans Docker
make build-correlator # Go binary in Docker make build-correlator # Binaire Go dans Docker
make build-bot-detector # Python image make build-bot-detector # Image Python
make build-dashboard # FastAPI + React image make build-dashboard # Image FastAPI + Jinja2
``` ```
## Running Tests ## Exécution des tests
```bash ```bash
make test-all make test-all
``` ```
### Per-Service Testing ### Tests par service
| Service | Command | Details | | Service | Commande | Détails |
|---------|---------|---------| |---------|----------|---------|
| sentinel | `make test-sentinel` | Go tests with `-race` flag, requires `NET_RAW`/`NET_ADMIN` caps | | sentinel | `make test-sentinel` | Tests Go avec `-race`, nécessite `NET_RAW`/`NET_ADMIN` |
| correlator | `make test-correlator` | Go tests with 80% coverage gate enforced | | correlator | `make test-correlator` | Tests Go avec seuil de couverture 80% |
| mod-reqin-log | `make test-mod-reqin-log` | C unit tests (JSON serialization, config parsing, header handling) | | mod-reqin-log | `make test-mod-reqin-log` | Tests unitaires C (sérialisation JSON, config, headers) |
| bot-detector | `make test-bot-detector` | Python pytest suite | | bot-detector | `make test-bot-detector` | Suite pytest Python |
| dashboard | `make test-dashboard` | Python pytest for FastAPI routes | | dashboard | `make test-dashboard` | pytest pour les routes FastAPI |
| ja4_common (Python) | `make test-ja4common-python` | Shared Python library tests | | ja4_common (Python) | `make test-ja4common-python` | Tests de la librairie Python partagée |
## Building RPM Packages ## Tests d'intégration
Les tests d'intégration full-stack s'exécutent contre Docker Compose avec une instance ClickHouse réelle :
```bash
make test-integration # 8 phases : build → start → schema → traffic → pipeline → dashboard → bot-detector → sentinel
make test-integration-keep # idem mais laisse la stack en fonctionnement
make test-integration-down # démontage de la stack d'intégration
```
La suite de tests se trouve dans `tests/integration/` et réinitialise la base de données entre chaque exécution.
## Construction des paquets RPM
```bash ```bash
make rpm-all make rpm-all
``` ```
Builds RPMs for sentinel, correlator, and mod-reqin-log targeting Rocky Linux 8/9/10: Construit les RPMs pour sentinel, correlator et mod-reqin-log ciblant Rocky Linux 8/9/10 :
```bash ```bash
make rpm-sentinel # → services/sentinel/dist/rpm/ make rpm-sentinel # → services/sentinel/dist/rpm/
@ -67,26 +79,26 @@ make rpm-correlator # → services/correlator/dist/rpm/
make rpm-mod-reqin-log # → services/mod-reqin-log/dist/rpm/ make rpm-mod-reqin-log # → services/mod-reqin-log/dist/rpm/
``` ```
Each RPM build uses a multi-stage Docker pipeline: Chaque build RPM utilise un pipeline Docker multi-étapes :
1. Builder stage compiles the binary (Go) or shared object (C) 1. L'étape builder compile le binaire (Go) ou l'objet partagé (C)
2. RPM builder stage runs `rpmbuild` for each target distro (el8, el9, el10) 2. L'étape rpmbuild exécute `rpmbuild` pour chaque distro cible (el8, el9, el10)
3. Output stage copies RPMs to the host via `--output type=local` 3. L'étape de sortie copie les RPMs sur l'hôte via `--output type=local`
### Distribution Packages ### Paquets de distribution
```bash ```bash
make dist # Alias for rpm-all make dist # Alias for rpm-all
# RPMs in services/<service>/dist/rpm/el{8,9,10}/ # RPMs in services/<service>/dist/rpm/el{8,9,10}/
``` ```
## Local Development Workflow ## veloppement local
### Go Services (sentinel, correlator) ### Services Go (sentinel, correlator)
The `go.work` workspace links Go modules: Le fichier `go.work` relie les modules Go :
``` ```
go 1.21 go 1.24.6
use ( use (
./services/sentinel ./services/sentinel
@ -95,62 +107,119 @@ use (
) )
``` ```
If you have Go 1.21+ installed locally, you can develop without Docker: Si Go 1.24+ est installé localement, le développement sans Docker est possible :
```bash ```bash
# Run sentinel tests locally # Tests sentinel en local
cd services/sentinel && go test ./... -race -v cd services/sentinel && go test ./... -race -v
# Run correlator tests locally # Tests correlator en local
cd services/correlator && go test ./... -race -cover -v cd services/correlator && go test ./... -race -cover -v
# Build sentinel binary locally (requires libpcap-dev) # Compilation du binaire sentinel (nécessite libpcap-dev)
cd services/sentinel && go build -o ja4sentinel ./cmd/ja4sentinel/ cd services/sentinel && go build -o ja4sentinel ./cmd/ja4sentinel/
``` ```
### Python Services (bot-detector, dashboard) ### Services Python (bot-detector, dashboard)
```bash ```bash
# Install shared library in development mode # Installer la librairie partagée en mode développement
cd shared/python/ja4_common && pip install -e . cd shared/python/ja4_common && pip install -e .
# Run bot-detector locally # Lancer le bot-detector localement
cd services/bot-detector && pip install -r bot_detector/requirements.txt cd services/bot-detector && pip install -r bot_detector/requirements.txt
python -m bot_detector.bot_detector python -m bot_detector
# Run dashboard locally # Lancer le dashboard localement
cd services/dashboard && pip install -r backend/requirements.txt cd services/dashboard && pip install -r backend/requirements.txt
uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000 uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000
``` ```
### C Module (mod-reqin-log) #### Dépendances Python pour le développement local du bot-detector
Requires `apxs` (Apache extension tool) and development headers: Les dépendances principales sont listées dans `services/bot-detector/bot_detector/requirements.txt`. Les librairies ML clés sont :
| Librairie | Usage |
|-----------|-------|
| `isotree` | Extended Isolation Forest (scoreur principal non supervisé) |
| `torch` | Autoencoder (PyTorch, architecture n→64→32→16→32→64→n) |
| `xgboost` | Modèle supervisé (entraîné sur les labels SOC) |
| `hdbscan` | Clustering de campagnes d'attaque |
| `shap` | Explicabilité des scores d'anomalie |
| `scikit-learn` | Fallback pour Isolation Forest si isotree indisponible |
| `clickhouse-connect` | Client ClickHouse (via ja4_common) |
#### Structure modulaire du bot-detector (11 modules)
```
services/bot-detector/bot_detector/
├── __init__.py # Package init
├── __main__.py # Point d'entrée (python -m bot_detector)
├── config.py # Configuration (os.getenv, PAS pydantic-settings)
├── log.py # Configuration du logging
├── infra.py # Connexion ClickHouse, health check
├── preprocessing.py # Feature engineering, filtrage, normalisation
├── models.py # EIF, Autoencoder, XGBoost (entraînement + scoring)
├── scoring.py # Ensemble triple voix, seuils adaptatifs
├── browser.py # Détection de navigateur 5 axes multifactoriels
├── pipeline.py # Orchestration du cycle de détection
├── cycle.py # Boucle principale (cycle de 5 minutes)
└── tests/ # Tests pytest (self-contained, pas d'import lourd)
```
### Module C (mod-reqin-log)
Nécessite `apxs` (outil d'extension Apache) et les headers de développement :
```bash ```bash
cd services/mod-reqin-log cd services/mod-reqin-log
make build # Compiles mod_reqin_log.so make build # Compile mod_reqin_log.so
make test # Runs unit tests make test # Exécute les tests unitaires
make rpm # Builds RPM packages make rpm # Construit les paquets RPM
``` ```
## Adding a New Service ## Scripts utilitaires
### Go Service Les scripts dans `scripts/` facilitent l'initialisation et la gestion de l'environnement de développement :
1. Create the service directory: | Script | Commande Make | Description |
|--------|--------------|-------------|
| `init-stack.sh` | `make init-stack` | Initialisation complète de la stack ClickHouse (schéma + données CSV) |
| `import-prod-data.sh` | `make import-prod-data` | Import de données de prod avec décalage temporel |
| `reload-prod-logs.sh` | `make reload-prod-logs` | Export prod → réimport dev avec décalage |
| `update-csv-data.sh` | — | Téléchargement et génération de toutes les données CSV de référence |
```bash
# Initialisation complète (schéma + données CSV + import prod)
make init-and-import
# Initialisation seule (schéma + CSV)
make init-stack
# Import des données prod pré-exportées
make import-prod-data
# Rechargement depuis la prod
make reload-prod-logs
```
## Ajout d'un nouveau service
### Service Go
1. Créer le répertoire du service :
```bash ```bash
mkdir -p services/my-service/cmd/my-service mkdir -p services/my-service/cmd/my-service
mkdir -p services/my-service/internal mkdir -p services/my-service/internal
``` ```
2. Initialize the Go module: 2. Initialiser le module Go :
```bash ```bash
cd services/my-service cd services/my-service
go mod init github.com/antitbone/ja4/my-service go mod init github.com/antitbone/ja4/my-service
``` ```
3. Add to `go.work`: 3. Ajouter au `go.work` :
``` ```
use ( use (
./services/sentinel ./services/sentinel
@ -160,7 +229,7 @@ make rpm # Builds RPM packages
) )
``` ```
4. Import the shared library: 4. Importer la librairie partagée :
```go ```go
import ( import (
"github.com/antitbone/ja4/ja4common/logger" "github.com/antitbone/ja4/ja4common/logger"
@ -169,7 +238,7 @@ make rpm # Builds RPM packages
) )
``` ```
5. Add Makefile targets: 5. Ajouter les cibles Makefile :
```makefile ```makefile
build-my-service: build-my-service:
docker build -f services/my-service/Dockerfile -t ja4-platform/my-service:latest . docker build -f services/my-service/Dockerfile -t ja4-platform/my-service:latest .
@ -179,21 +248,21 @@ make rpm # Builds RPM packages
docker run --rm ja4-platform/my-service-tests:latest docker run --rm ja4-platform/my-service-tests:latest
``` ```
6. Update `build-all` and `test-all` dependencies. 6. Mettre à jour les dépendances `build-all` et `test-all`.
### Python Service ### Service Python
1. Create the service directory with a `requirements.txt` or `pyproject.toml`. 1. Créer le répertoire du service avec un `requirements.txt` ou `pyproject.toml`.
2. Add `ja4-common` as a dependency (installed from `shared/python/ja4_common`). 2. Ajouter `ja4-common` comme dépendance (installée depuis `shared/python/ja4_common`).
3. Use `from ja4_common.clickhouse import get_client` for ClickHouse access. 3. Utiliser `from ja4_common.clickhouse import get_client` pour l'accès ClickHouse.
4. Add Makefile targets following the bot-detector/dashboard pattern. 4. Ajouter les cibles Makefile en suivant le modèle bot-detector/dashboard.
## go.work Workspace ## Workspace go.work
The `go.work` file at the repository root links all Go modules, allowing cross-module development without publishing: Le fichier `go.work` à la racine du dépôt relie tous les modules Go, permettant le développement cross-module sans publication :
``` ```
go 1.21 go 1.24.6
use ( use (
./services/sentinel ./services/sentinel
@ -202,30 +271,34 @@ use (
) )
``` ```
When adding a new Go module: Lors de l'ajout d'un nouveau module Go :
1. `go mod init` in the service directory 1. `go mod init` dans le répertoire du service
2. Add the path to `go.work` 2. Ajouter le chemin dans `go.work`
3. Reference shared packages via their module path: `github.com/antitbone/ja4/ja4common/...` 3. Référencer les packages partagés via leur chemin de module : `github.com/antitbone/ja4/ja4common/...`
4. Run `go work sync` to update the workspace 4. Exécuter `go work sync` pour mettre à jour le workspace
## ja4_common Python Package Les deux services Go ont une directive `replace` dans leur `go.mod` pointant vers `../../shared/go/ja4common`. Le workspace prend priorité en développement local ; la directive `replace` est nécessaire pour les builds Docker où `go.work` n'est pas disponible.
The shared Python package (`shared/python/ja4_common`) provides: ## Package Python ja4_common
- `ClickHouseSettings` — pydantic-settings model reading from `.env` Le package Python partagé (`shared/python/ja4_common`) fournit :
- `ClickHouseClient` — singleton client with auto-reconnect
- `get_client()` — module-level singleton accessor
### Extending ja4_common - `ClickHouseSettings` — modèle pydantic-settings lisant depuis `.env`
- `ClickHouseClient` — client singleton avec reconnexion automatique
- `get_client()` — accesseur singleton au niveau du module
1. Add new modules under `shared/python/ja4_common/ja4_common/` > **Note :** le dashboard n'utilise PAS ja4_common — il possède son propre client léger `clickhouse-connect` dans `backend/database.py`.
2. Export them in `__init__.py`
3. Add dependencies to `pyproject.toml`
4. Run tests: `make test-ja4common-python`
### Using in a New Service ### Extension de ja4_common
Add to `requirements.txt`: 1. Ajouter de nouveaux modules sous `shared/python/ja4_common/ja4_common/`
2. Les exporter dans `__init__.py`
3. Ajouter les dépendances dans `pyproject.toml`
4. Lancer les tests : `make test-ja4common-python`
### Utilisation dans un nouveau service
Ajouter dans `requirements.txt` :
``` ```
ja4-common @ file:///app/shared/python/ja4_common ja4-common @ file:///app/shared/python/ja4_common
``` ```
@ -236,11 +309,11 @@ COPY shared/python/ja4_common /app/shared/python/ja4_common
RUN pip install /app/shared/python/ja4_common RUN pip install /app/shared/python/ja4_common
``` ```
## Environment Variables ## Variables d'environnement
Each service reads configuration from environment variables and/or YAML config files. See individual service documentation for the full reference: Chaque service lit sa configuration depuis des variables d'environnement et/ou des fichiers YAML. Consultez la documentation de chaque service pour la référence complète :
- [Sentinel configuration](services/sentinel.md#configuration-reference) - [Configuration du Sentinel](services/sentinel.md#configuration-reference)
- [Correlator configuration](services/correlator.md#configuration-reference) - [Configuration du Correlator](services/correlator.md#configuration-reference)
- [Bot Detector configuration](services/bot-detector.md#environment-variables) - [Configuration du Bot Detector](services/bot-detector.md#environment-variables)
- [Dashboard configuration](services/dashboard.md#configuration) - [Configuration du Dashboard](services/dashboard.md#configuration)