docs: rewrite architecture/README, update deployment/development

- architecture.md: complete rewrite (French) with dual-database diagram,
  5-phase data flow, full table ownership, triple-voice ML pipeline,
  7 dictionaries, 13 SQL files, updated tech stack
- README.md: complete rewrite (English) with updated pipeline diagram,
  services table, scripts section, integration tests, full doc index,
  Go 1.24.6 workspace
- deployment.md: update to 13 SQL files, remove Anubis UA/Country refs,
  add scripts section, add ensemble env vars (AE_WEIGHT, XGB_WEIGHT),
  update verification queries and network diagram
- development.md: translate to French, add bot-detector 11-module structure,
  add Python ML deps, add scripts/integration test sections,
  fix bot-detector run command, add make targets

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
toto
2026-04-09 22:00:29 +02:00
parent 7bdc6e2865
commit d05969867f
4 changed files with 578 additions and 282 deletions

146
README.md
View File

@ -1,59 +1,63 @@
# ja4-platform
**ja4-platform** is a monorepo security pipeline for TLS fingerprinting (JA4/JA3) and bot detection. It captures live network traffic, correlates TLS handshakes with HTTP requests, detects anomalous behavior using machine learning (Isolation Forest), and presents results through a SOC analyst dashboard — all backed by ClickHouse as the central data store.
**ja4-platform** is a monorepo security pipeline for TLS fingerprinting (JA4/JA3) and bot detection. It captures live network traffic, correlates TLS handshakes with HTTP requests, applies triple-voice ML anomaly detection (Extended Isolation Forest + Autoencoder + XGBoost), and surfaces results through a SOC analyst dashboard — all backed by ClickHouse with a dual-database architecture.
## Pipeline Overview
```
┌─────────────────────────────────────────────────────────────────────────────┐
Linux Server (Apache) │
│ │
│ ┌─────────────────┐ ┌─────────────────────┐
│ │ mod-reqin-log │───────▶│ UNIX socket (HTTP) │──┐
│ │ (Apache module) │ JSON /var/run/logcorr/ │ │
│ C · httpd DSO http.socket
└─────────────────┘ └─────────────────────┘
┌─────────────────┐ ┌─────────────────────┐ ──────────────────┐
│ │ sentinel │───────▶│ UNIX socket (TLS) │─▶│ correlator │ │
│ │ (TLS capture)JSON /var/run/logcorr/ (event join) │
│ Go · libpcap network.socket Go · hex. arch
└─────────────────┘ └─────────────────────┘ └─────────────────┘
└────────────────────────────────────────────────────────────────┼────────────┘
│ INSERT
┌──────────────────┐
│ ClickHouse
│ ja4_processing
│ (all tables) │
└────────┬─────────┘
│ SELECT
┌────────────────────┼────────────────────┐
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ bot-detector │ dashboard │
│ (ML anomaly det) │ (SOC web UI)
│ Python · sklearn │ FastAPI + React │
└──────────────────┘ └──────────────────┘
┌─────────────────────────────────────────────────────────────────────────────
│ Linux Server (Apache)
│ ┌─────────────────┐ UNIX socket (DGRAM) ┌──────────────────┐
│ │ mod-reqin-log │──── http.socket ────────────────▶│
│ │ (Apache C11) │ (source A) │ correlator
└─────────────────┘ (Go · hex.
│ architecture) │
┌─────────────────┐ UNIX socket (DGRAM)
│ sentinel ──── network.socket ─────────────▶│ Joins by │
│ │ (Go · libpcap) │ (source B) src_ip:src_port│ │
│ │ JA4/JA3 gen. └────────┬─────────┘
└─────────────────┘
└─────────────────────────────────────────────────────────────────────────────┘
INSERT
┌──────────────────────────────────────┐
ClickHouse 24.8
│ │
ja4_logs ja4_processing
┌──────────┐ ┌──────────────┐
│ │_raw → MV │────▶│ agg_* (×6) │
│→ http_logs│ │ ml_* (×2) │ │
│ └──────────┘ │ views, dicts │ │
│ └──────────────┘ │
└─────────┬───────────────┬────────────┘
┌────────────────┘ └───────────────┐
┌────────────────────┐ ┌────────────────────┐
bot-detector │ dashboard │
│ Python 3.11 │ │ FastAPI + Jinja2 │
│ EIF + AE + XGBoost │ │ htmx + Chart.js │
│ HDBSCAN · SHAP │ │ 55 routes · 14 pp │
└────────────────────┘ └────────────────────┘
```
## Services
| Service | Language | Purpose | Interface |
|---------|----------|---------|-----------|
| [sentinel](docs/services/sentinel.md) | Go | Live TLS packet capture, JA4/JA3 fingerprint generation | UNIX socket (`network.socket`) |
| [mod-reqin-log](docs/services/mod-reqin-log.md) | C | Apache HTTPD module, HTTP request JSON logging | UNIX socket (`http.socket`) |
| [correlator](docs/services/correlator.md) | Go | Joins HTTP + TLS events by `src_ip:src_port` + time window | ClickHouse INSERT, file, stdout |
| [bot-detector](docs/services/bot-detector.md) | Python | Isolation Forest ML anomaly detection on aggregated traffic | ClickHouse read/write, HTTP `:8080` |
| [dashboard](docs/services/dashboard.md) | Python/JS | SOC analyst web dashboard (FastAPI + React) | HTTP `:8000` |
| Service | Language | Description | Interface |
|---------|----------|-------------|-----------|
| [sentinel](docs/services/sentinel.md) | Go 1.24.6 | TLS/TCP packet capture via libpcap, JA4/JA3 fingerprint generation | UNIX socket `network.socket` |
| [mod-reqin-log](docs/services/mod-reqin-log.md) | C11 | Apache HTTPD module, HTTP request JSON logging | UNIX socket `http.socket` |
| [correlator](docs/services/correlator.md) | Go 1.24.6 | Hexagonal architecture, correlates HTTP+TLS events by `src_ip:src_port` | ClickHouse INSERT (Native TCP) |
| [bot-detector](docs/services/bot-detector.md) | Python 3.11 | Triple-voice ML ensemble (EIF+AE+XGB), HDBSCAN campaigns, SHAP explainability | ClickHouse read/write, HTTP `:8080` |
| [dashboard](docs/services/dashboard.md) | Python 3.11 | SOC analyst dashboard: 55 routes, 15 templates, 14 pages | HTTP `:8000` |
## Shared Libraries
| Library | Language | Description |
|---------|----------|-------------|
| [go/ja4common](docs/shared/go-ja4common.md) | Go | Logger, config loader, shutdown handler, IP filter |
| [python/ja4_common](docs/shared/python-ja4common.md) | Python | ClickHouse client singleton, settings |
| [go/ja4common](docs/shared/go-ja4common.md) | Go | Logger, config loader, graceful shutdown handler, IP filter |
| [python/ja4_common](docs/shared/python-ja4common.md) | Python | `ClickHouseClient` singleton, `ClickHouseSettings` (pydantic-settings) |
## Quickstart
@ -79,37 +83,77 @@ make test-all
```bash
make rpm-all
# RPMs written to services/<service>/dist/
# RPMs written to services/<service>/dist/rpm/el{8,9,10}/
```
## Scripts
Helper scripts are located in `scripts/`:
| Script | Description |
|--------|-------------|
| `init-stack.sh` | Full ClickHouse stack initialization — deploys schema, loads CSV data, verifies all components |
| `import-prod-data.sh` | Imports pre-exported production data into the dev database with dynamic date shifting |
| `reload-prod-logs.sh` | Exports `http_logs` from production and re-imports into the dev database |
| `update-csv-data.sh` | Downloads and generates all CSV reference data (bot IPs, JA4 signatures, ASN reputation) |
| `generate_bot_ip.py` | Generates `bot_ip.csv` from known scanner/bot sources + Tor exit nodes |
| `generate_bot_ja4.py` | Generates `bot_ja4.csv` from known bot TLS fingerprints |
| `generate_asn_data.py` | Generates `asn_reputation.csv` (ASN→label mapping) |
| `generate_browser_ja4.py` | Generates browser JA4 reference data for legitimate browser detection |
Corresponding Makefile targets:
```bash
make init-stack # runs scripts/init-stack.sh
make import-prod-data # runs scripts/import-prod-data.sh
make init-and-import # init-stack + import-prod-data
make reload-prod-logs # runs scripts/reload-prod-logs.sh
```
## Integration Tests
Full-stack integration tests run against Docker Compose with a real ClickHouse instance:
```bash
make test-integration # 8 phases: build → start → schema → traffic → pipeline → dashboard → bot-detector → sentinel
make test-integration-keep # same but leaves stack running after
make test-integration-down # tear down integration stack
```
The integration test suite is located in `tests/integration/` and resets the database between runs.
## Documentation
| Document | Description |
|----------|-------------|
| [Architecture](docs/architecture.md) | System architecture, data flow, component interactions |
| [Deployment](docs/deployment.md) | Step-by-step production deployment guide |
| [Development](docs/development.md) | Build, test, package, and extend the platform |
| [Database Schema](docs/database/schema.md) | Every ClickHouse table, view, dictionary, and materialized view |
| [Database Migrations](docs/database/migrations.md) | Migration order, application, verification, and rollback |
| [Commenting Standard](docs/commenting-standard.md) | Code commenting conventions (French comments, English identifiers) |
| [Thesis Reference](docs/THESIS_HTTP_Traffic_Detection.md) | Academic reference: HTTP traffic detection techniques |
| [Audit vs Thesis](docs/AUDIT_Detection_vs_Thesis.md) | Comparison between platform implementation and thesis techniques |
### Service Documentation
- [Sentinel](docs/services/sentinel.md) — TLS capture daemon
- [mod-reqin-log](docs/services/mod-reqin-log.md) — Apache HTTP logging module
- [Correlator](docs/services/correlator.md) — HTTP/TLS event correlation engine
- [Bot Detector](docs/services/bot-detector.md) — ML anomaly detection
- [Dashboard](docs/services/dashboard.md) — SOC web dashboard and API
- [Sentinel](docs/services/sentinel.md) — TLS/TCP capture daemon (Go + libpcap)
- [mod-reqin-log](docs/services/mod-reqin-log.md) — Apache HTTP logging module (C11)
- [Correlator](docs/services/correlator.md) — HTTP/TLS event correlation engine (Go)
- [Bot Detector](docs/services/bot-detector.md) — Triple-voice ML anomaly detection (Python)
- [Dashboard](docs/services/dashboard.md) — SOC analyst dashboard and API (FastAPI)
### Shared Library Documentation
- [go-ja4common](docs/shared/go-ja4common.md) — Go shared library
- [python-ja4common](docs/shared/python-ja4common.md) — Python shared library
- [go-ja4common](docs/shared/go-ja4common.md) — Go shared library (logger, config, shutdown, ipfilter)
- [python-ja4common](docs/shared/python-ja4common.md) — Python shared library (ClickHouse client, settings)
## Go Workspace
The repository uses a Go workspace (`go.work`) to link the Go modules:
```
go 1.21
go 1.24.6
use (
./services/sentinel
@ -118,6 +162,8 @@ use (
)
```
Both Go services have a `replace` directive in their `go.mod` pointing to `../../shared/go/ja4common`. The workspace takes precedence for local development; the `replace` is needed for Docker builds where `go.work` is not available.
## License
See individual service directories for license information.