Files
ja4-platform/README.md
toto d05969867f docs: rewrite architecture/README, update deployment/development
- architecture.md: complete rewrite (French) with dual-database diagram,
  5-phase data flow, full table ownership, triple-voice ML pipeline,
  7 dictionaries, 13 SQL files, updated tech stack
- README.md: complete rewrite (English) with updated pipeline diagram,
  services table, scripts section, integration tests, full doc index,
  Go 1.24.6 workspace
- deployment.md: update to 13 SQL files, remove Anubis UA/Country refs,
  add scripts section, add ensemble env vars (AE_WEIGHT, XGB_WEIGHT),
  update verification queries and network diagram
- development.md: translate to French, add bot-detector 11-module structure,
  add Python ML deps, add scripts/integration test sections,
  fix bot-detector run command, add make targets

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 22:00:29 +02:00

170 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ja4-platform
**ja4-platform** is a monorepo security pipeline for TLS fingerprinting (JA4/JA3) and bot detection. It captures live network traffic, correlates TLS handshakes with HTTP requests, applies triple-voice ML anomaly detection (Extended Isolation Forest + Autoencoder + XGBoost), and surfaces results through a SOC analyst dashboard — all backed by ClickHouse with a dual-database architecture.
## Pipeline Overview
```
┌──────────────────────────────────────────────────────────────────────────────┐
│ Linux Server (Apache) │
│ │
│ ┌─────────────────┐ UNIX socket (DGRAM) ┌──────────────────┐ │
│ │ mod-reqin-log │──── http.socket ────────────────▶│ │ │
│ │ (Apache C11) │ (source A) │ correlator │ │
│ └─────────────────┘ │ (Go · hex. │ │
│ │ architecture) │ │
│ ┌─────────────────┐ UNIX socket (DGRAM) │ │ │
│ │ sentinel │──── network.socket ─────────────▶│ Joins by │ │
│ │ (Go · libpcap) │ (source B) │ src_ip:src_port│ │
│ │ JA4/JA3 gen. │ └────────┬─────────┘ │
│ └─────────────────┘ │ │
└──────────────────────────────────────────────────────────────────┼────────────┘
│ INSERT
┌──────────────────────────────────────┐
│ ClickHouse 24.8 │
│ │
│ ja4_logs ja4_processing │
│ ┌──────────┐ ┌──────────────┐ │
│ │_raw → MV │────▶│ agg_* (×6) │ │
│ │→ http_logs│ │ ml_* (×2) │ │
│ └──────────┘ │ views, dicts │ │
│ └──────────────┘ │
└─────────┬───────────────┬────────────┘
│ │
┌────────────────┘ └───────────────┐
▼ ▼
┌────────────────────┐ ┌────────────────────┐
│ bot-detector │ │ dashboard │
│ Python 3.11 │ │ FastAPI + Jinja2 │
│ EIF + AE + XGBoost │ │ htmx + Chart.js │
│ HDBSCAN · SHAP │ │ 55 routes · 14 pp │
└────────────────────┘ └────────────────────┘
```
## Services
| Service | Language | Description | Interface |
|---------|----------|-------------|-----------|
| [sentinel](docs/services/sentinel.md) | Go 1.24.6 | TLS/TCP packet capture via libpcap, JA4/JA3 fingerprint generation | UNIX socket → `network.socket` |
| [mod-reqin-log](docs/services/mod-reqin-log.md) | C11 | Apache HTTPD module, HTTP request JSON logging | UNIX socket → `http.socket` |
| [correlator](docs/services/correlator.md) | Go 1.24.6 | Hexagonal architecture, correlates HTTP+TLS events by `src_ip:src_port` | ClickHouse INSERT (Native TCP) |
| [bot-detector](docs/services/bot-detector.md) | Python 3.11 | Triple-voice ML ensemble (EIF+AE+XGB), HDBSCAN campaigns, SHAP explainability | ClickHouse read/write, HTTP `:8080` |
| [dashboard](docs/services/dashboard.md) | Python 3.11 | SOC analyst dashboard: 55 routes, 15 templates, 14 pages | HTTP `:8000` |
## Shared Libraries
| Library | Language | Description |
|---------|----------|-------------|
| [go/ja4common](docs/shared/go-ja4common.md) | Go | Logger, config loader, graceful shutdown handler, IP filter |
| [python/ja4_common](docs/shared/python-ja4common.md) | Python | `ClickHouseClient` singleton, `ClickHouseSettings` (pydantic-settings) |
## Quickstart
### Prerequisites
- Docker (with BuildKit) and Docker Compose
- `make`
- No native Go, Python, or C toolchains required — all builds run inside Docker
### Build All Services
```bash
make build-all
```
### Run All Tests
```bash
make test-all
```
### Build RPM Packages
```bash
make rpm-all
# RPMs written to services/<service>/dist/rpm/el{8,9,10}/
```
## Scripts
Helper scripts are located in `scripts/`:
| Script | Description |
|--------|-------------|
| `init-stack.sh` | Full ClickHouse stack initialization — deploys schema, loads CSV data, verifies all components |
| `import-prod-data.sh` | Imports pre-exported production data into the dev database with dynamic date shifting |
| `reload-prod-logs.sh` | Exports `http_logs` from production and re-imports into the dev database |
| `update-csv-data.sh` | Downloads and generates all CSV reference data (bot IPs, JA4 signatures, ASN reputation) |
| `generate_bot_ip.py` | Generates `bot_ip.csv` from known scanner/bot sources + Tor exit nodes |
| `generate_bot_ja4.py` | Generates `bot_ja4.csv` from known bot TLS fingerprints |
| `generate_asn_data.py` | Generates `asn_reputation.csv` (ASN→label mapping) |
| `generate_browser_ja4.py` | Generates browser JA4 reference data for legitimate browser detection |
Corresponding Makefile targets:
```bash
make init-stack # runs scripts/init-stack.sh
make import-prod-data # runs scripts/import-prod-data.sh
make init-and-import # init-stack + import-prod-data
make reload-prod-logs # runs scripts/reload-prod-logs.sh
```
## Integration Tests
Full-stack integration tests run against Docker Compose with a real ClickHouse instance:
```bash
make test-integration # 8 phases: build → start → schema → traffic → pipeline → dashboard → bot-detector → sentinel
make test-integration-keep # same but leaves stack running after
make test-integration-down # tear down integration stack
```
The integration test suite is located in `tests/integration/` and resets the database between runs.
## Documentation
| Document | Description |
|----------|-------------|
| [Architecture](docs/architecture.md) | System architecture, data flow, component interactions |
| [Deployment](docs/deployment.md) | Step-by-step production deployment guide |
| [Development](docs/development.md) | Build, test, package, and extend the platform |
| [Database Schema](docs/database/schema.md) | Every ClickHouse table, view, dictionary, and materialized view |
| [Database Migrations](docs/database/migrations.md) | Migration order, application, verification, and rollback |
| [Commenting Standard](docs/commenting-standard.md) | Code commenting conventions (French comments, English identifiers) |
| [Thesis Reference](docs/THESIS_HTTP_Traffic_Detection.md) | Academic reference: HTTP traffic detection techniques |
| [Audit vs Thesis](docs/AUDIT_Detection_vs_Thesis.md) | Comparison between platform implementation and thesis techniques |
### Service Documentation
- [Sentinel](docs/services/sentinel.md) — TLS/TCP capture daemon (Go + libpcap)
- [mod-reqin-log](docs/services/mod-reqin-log.md) — Apache HTTP logging module (C11)
- [Correlator](docs/services/correlator.md) — HTTP/TLS event correlation engine (Go)
- [Bot Detector](docs/services/bot-detector.md) — Triple-voice ML anomaly detection (Python)
- [Dashboard](docs/services/dashboard.md) — SOC analyst dashboard and API (FastAPI)
### Shared Library Documentation
- [go-ja4common](docs/shared/go-ja4common.md) — Go shared library (logger, config, shutdown, ipfilter)
- [python-ja4common](docs/shared/python-ja4common.md) — Python shared library (ClickHouse client, settings)
## Go Workspace
The repository uses a Go workspace (`go.work`) to link the Go modules:
```
go 1.24.6
use (
./services/sentinel
./services/correlator
./shared/go/ja4common
)
```
Both Go services have a `replace` directive in their `go.mod` pointing to `../../shared/go/ja4common`. The workspace takes precedence for local development; the `replace` is needed for Docker builds where `go.work` is not available.
## License
See individual service directories for license information.