Files
ja4-platform/docs/architecture.md
toto d469e39da7 feat: ja4-platform monorepo — 5 services unified, tests & RPM builds standardized
Services:
- ja4sentinel: TLS/JA4 fingerprint capture daemon (Go, libpcap)
- logcorrelator: JA4 log correlation engine (Go, ClickHouse)
- mod_reqin_log: Apache module (C, JSON request logging)
- bot_detector: ML bot detection pipeline (Python)
- dashboard: FastAPI/Streamlit analytics UI (Python)

Shared libraries:
- shared/go/ja4common: logger, config, shutdown, ipfilter (Go module)
- shared/python/ja4_common: ClickHouseClient, ClickHouseSettings (Python package)
- shared/clickhouse/: canonical SQL migrations (10 files)

Build & packaging:
- Unified 3-stage Dockerfile.package for Go RPMs (el8/el9/el10)
- go.work workspace linking sentinel, correlator, ja4common
- Makefile with test-all, build-all, rpm-* targets

Fixes applied:
- go.work: 1.21 → 1.24.6 (required by sentinel)
- correlator Dockerfiles: golang:1.21 → golang:1.24
- replace directives in go.mod for ja4common local path
- pyproject.toml: setuptools.backends → setuptools.build_meta
- Removed static libpcap linking (unavailable on Rocky 9)
- Fixed data races in output/writers_test.go (sync.Mutex + atomic.Int32)
- Rewrote corrupted test files (logger_test.go × 2)

Test coverage:
- correlator: 67.1% total (unixsocket 80.5%, config 91.7%, app 83.3%, multi 87.7%, stdout 100%)
- sentinel: all 10 packages pass (api, capture, config, fingerprint, ipfilter, logging, output, tlsparse)

Documentation:
- README.md + docs/ (architecture, development, 5 services, shared libs, DB schema & migrations)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 16:42:59 +02:00

163 lines
11 KiB
Markdown

# Architecture
The ja4-platform is a security pipeline that captures live network traffic, generates JA4/JA3 TLS fingerprints, correlates them with HTTP requests, applies machine-learning anomaly detection, and surfaces results through a SOC analyst dashboard. ClickHouse serves as the central data store linking all services.
## System Architecture
```
┌───────────────────────────────────────────────────────────────────────────────────┐
│ Target Linux Server │
│ │
│ ┌─────────────┐ HTTP req ┌───────────────────────┐ UNIX socket (DGRAM) │
│ │ Client │────────────▶│ Apache HTTPD │──────────────┐ │
│ │ (browser / │ │ + mod-reqin-log │ │ │
│ │ bot) │ └───────────────────────┘ │ │
│ │ │ ▼ │
│ │ │ TLS CH ┌───────────────────────┐ ┌─────────────────────┐ │
│ │ │────────────▶│ sentinel │ │ correlator │ │
│ │ │ (pcap) │ (packet capture) │──▶│ (event join) │ │
│ └─────────────┘ └───────────────────────┘ └────────┬────────────┘ │
│ │ │
└────────────────────────────────────────────────────────────────────┼──────────────┘
│ INSERT JSON
┌─────────────────────┐
│ ClickHouse │
│ mabase_prod │
│ │
│ http_logs_raw │
│ ──(MV)──▶ http_logs│
│ ──(MV)──▶ agg_* │
│ view_ai_features │
│ ml_detected_anom. │
│ ml_all_scores │
└──────┬──────┬───────┘
│ │
┌──────────────────┘ └──────────────────┐
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ bot-detector │ │ dashboard │
│ (Python) │ │ (FastAPI + React) │
│ │ │ │
│ Reads: │ │ Reads: │
│ view_ai_features │ │ ml_detected_anom. │
│ view_ip_recurrence │ │ ml_all_scores │
│ Writes: │ │ http_logs │
│ ml_detected_anom. │ │ agg_* tables │
│ ml_all_scores │ │ audit_logs │
└──────────────────────┘ └──────────────────────┘
```
## Data Flow
### 1. Capture Phase
1. **mod-reqin-log** (Apache C module) hooks into `post_read_request`. On each HTTP request, it serializes method, path, headers, client IP/port into JSON and sends it via UNIX datagram socket to `/var/run/logcorrelator/http.socket`.
2. **sentinel** (Go daemon) uses libpcap to capture live TLS ClientHello packets on configured ports (default: 443, 8443). It extracts IP/TCP metadata, generates JA4 and JA3 fingerprints, and sends the result as JSON via UNIX datagram socket to `/var/run/logcorrelator/network.socket`.
### 2. Correlation Phase
3. **correlator** (Go daemon) listens on both UNIX sockets. It buffers incoming events and correlates them by matching `src_ip:src_port` within a configurable time window (default: 10 s). HTTP Keep-Alive connections are supported via `one_to_many` matching mode where a single TLS handshake (source B) is reused for multiple HTTP requests (source A). Correlated events merge HTTP fields (method, path, headers) with TLS fields (JA4, JA3, IP/TCP metadata) into a single `CorrelatedLog` JSON object, which is inserted into `http_logs_raw`.
### 3. Enrichment Phase (ClickHouse)
4. **mv_http_logs** materialized view automatically transforms `http_logs_raw` JSON into the structured `http_logs` table, enriching each row with:
- ASN/geo data via `dict_iplocate_asn`
- Anubis bot identification via `dict_anubis_ua`, `dict_anubis_ip`, `dict_anubis_asn`, `dict_anubis_country`
5. **mv_agg_host_ip_ja4_1h** and **mv_agg_header_fingerprint_1h** aggregate `http_logs` into 1-hour behavioral windows.
6. **view_ai_features_1h** joins the two aggregation tables and computes 50+ ML features per `(src_ip, ja4, host)` tuple.
### 4. Detection Phase
7. **bot-detector** (Python) runs on a 5-minute cycle:
- Reads `view_ai_features_1h` for the last 24 hours
- Separates known bots (via reputation dictionaries) from unknown traffic
- Trains/loads Isolation Forest models on human-baseline traffic
- Scores unknown traffic and writes anomalies to `ml_detected_anomalies` and all scores to `ml_all_scores`
### 5. Visualization Phase
8. **dashboard** (FastAPI + React) queries ClickHouse to display detections, feature analysis, investigation summaries, and clustering to SOC analysts.
## Component Interaction Matrix
| From → To | mod-reqin-log | sentinel | correlator | ClickHouse | bot-detector | dashboard |
|-----------|:---:|:---:|:---:|:---:|:---:|:---:|
| **mod-reqin-log** | — | — | UNIX socket (DGRAM) | — | — | — |
| **sentinel** | — | — | UNIX socket (DGRAM) | — | — | — |
| **correlator** | — | — | — | Native TCP :9000 (INSERT) | — | — |
| **ClickHouse** | — | — | — | — | — | — |
| **bot-detector** | — | — | — | HTTP :8123 (SELECT/INSERT) | — | — |
| **dashboard** | — | — | — | HTTP :8123 (SELECT/INSERT) | — | — |
## ClickHouse Table Ownership
| Table/View | Written By | Read By |
|------------|-----------|---------|
| `http_logs_raw` | correlator | mv_http_logs (MV) |
| `http_logs` | mv_http_logs (MV) | mv_agg_*, dashboard |
| `agg_host_ip_ja4_1h` | mv_agg_host_ip_ja4_1h (MV) | view_ai_features_1h |
| `agg_header_fingerprint_1h` | mv_agg_header_fingerprint_1h (MV) | view_ai_features_1h |
| `view_ai_features_1h` | — (view) | bot-detector |
| `view_ip_recurrence` | — (view) | bot-detector |
| `ml_detected_anomalies` | bot-detector | dashboard |
| `ml_all_scores` | bot-detector | dashboard |
| `audit_logs` | dashboard | dashboard |
## Correlation Algorithm
The correlator joins HTTP events (source A) with TLS/network events (source B) using a two-key correlation:
1. **Key**: `src_ip + src_port` — the client's source IP and ephemeral port uniquely identify a TCP connection.
2. **Time window**: Events must arrive within the configured window (default 10 seconds).
3. **Matching mode**:
- `one_to_one`: Each B event matches at most one A event (consumed after match).
- `one_to_many` (default, Keep-Alive): A single B (TLS handshake) can match multiple A events (HTTP requests) on the same connection. The B event has a configurable TTL (default 120 s) that resets on each match.
4. **Orphan handling**: Unmatched A events are emitted after a configurable delay (default 500 ms) with `correlated=false` and `orphan_side=A`.
## JA4/JA3 Fingerprint Format
### JA4
JA4 is a modern TLS fingerprinting format (successor to JA3) with the structure:
```
t{TLS_VER}{SNI}{CIPHER_COUNT}{EXT_COUNT}_{CIPHER_HASH}_{EXT_HASH}
```
Example: `t13d1516h2_8daaf6152771_b0da82dd1658`
- Prefix `t` = TLS, followed by version (`13` = TLS 1.3)
- `d` = SNI present, `i` = SNI absent
- Cipher suite count and extension count
- SHA-256 truncated hashes of sorted cipher suites and extensions
### JA3
JA3 is the original TLS fingerprinting format:
```
{TLS_VER},{CIPHERS},{EXTENSIONS},{ELLIPTIC_CURVES},{EC_POINT_FORMATS}
```
The `ja3_hash` is the MD5 hash of the JA3 string.
Both fingerprints are generated by sentinel from the TLS ClientHello payload.
## Technology Stack
| Component | Technology |
|-----------|-----------|
| Packet capture | Go + libpcap (gopacket) |
| HTTP logging | C Apache module (APR) |
| Event correlation | Go (hexagonal architecture) |
| ML detection | Python 3.11 + scikit-learn |
| Dashboard backend | FastAPI (Python) |
| Dashboard frontend | React + Vite |
| Data store | ClickHouse |
| Deployment | systemd, Docker, RPM |
| IPC | UNIX datagram sockets |