Files
ja4-platform/docs/architecture.md
toto 9f3e0621e5 feat: split ClickHouse into dual configurable databases (ja4_logs / ja4_processing)
Architecture:
- ja4_logs: raw log ingestion (http_logs_raw, http_logs, mv_http_logs)
- ja4_processing: analytics, aggregation, ML, dictionaries, audit

Configuration (env vars):
- CLICKHOUSE_DB_LOGS (default: ja4_logs)
- CLICKHOUSE_DB_PROCESSING (default: ja4_processing)

Changes:
- SQL migrations (10 files): all mabase_prod refs → ja4_logs or ja4_processing
  with correct cross-database references (MVs, views, dicts)
- deploy_schema.sh: substitutes DB names from env vars at deploy time
- Python shared settings: added CLICKHOUSE_DB_LOGS + CLICKHOUSE_DB_PROCESSING
- Dashboard routes (19 files): replaced ~80 hardcoded mabase_prod refs
  with settings.CLICKHOUSE_DB_LOGS / settings.CLICKHOUSE_DB_PROCESSING
- Bot-detector: DB → CLICKHOUSE_DB_PROCESSING, fetch_rules.py configurable
- Correlator: DSN example updated to ja4_logs
- Docker-compose + .env files: new env vars with defaults
- All documentation updated (14 markdown files)

All tests pass: sentinel 10/10, correlator 67.1%, bot-detector 11, dashboard 20, ja4_common 18

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 19:10:35 +02:00

163 lines
11 KiB
Markdown

# Architecture
The ja4-platform is a security pipeline that captures live network traffic, generates JA4/JA3 TLS fingerprints, correlates them with HTTP requests, applies machine-learning anomaly detection, and surfaces results through a SOC analyst dashboard. ClickHouse serves as the central data store linking all services.
## System Architecture
```
┌───────────────────────────────────────────────────────────────────────────────────┐
│ Target Linux Server │
│ │
│ ┌─────────────┐ HTTP req ┌───────────────────────┐ UNIX socket (DGRAM) │
│ │ Client │────────────▶│ Apache HTTPD │──────────────┐ │
│ │ (browser / │ │ + mod-reqin-log │ │ │
│ │ bot) │ └───────────────────────┘ │ │
│ │ │ ▼ │
│ │ │ TLS CH ┌───────────────────────┐ ┌─────────────────────┐ │
│ │ │────────────▶│ sentinel │ │ correlator │ │
│ │ │ (pcap) │ (packet capture) │──▶│ (event join) │ │
│ └─────────────┘ └───────────────────────┘ └────────┬────────────┘ │
│ │ │
└────────────────────────────────────────────────────────────────────┼──────────────┘
│ INSERT JSON
┌─────────────────────┐
│ ClickHouse │
│ ja4_processing │
│ │
│ http_logs_raw │
│ ──(MV)──▶ http_logs│
│ ──(MV)──▶ agg_* │
│ view_ai_features │
│ ml_detected_anom. │
│ ml_all_scores │
└──────┬──────┬───────┘
│ │
┌──────────────────┘ └──────────────────┐
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ bot-detector │ │ dashboard │
│ (Python) │ │ (FastAPI + React) │
│ │ │ │
│ Reads: │ │ Reads: │
│ view_ai_features │ │ ml_detected_anom. │
│ view_ip_recurrence │ │ ml_all_scores │
│ Writes: │ │ http_logs │
│ ml_detected_anom. │ │ agg_* tables │
│ ml_all_scores │ │ audit_logs │
└──────────────────────┘ └──────────────────────┘
```
## Data Flow
### 1. Capture Phase
1. **mod-reqin-log** (Apache C module) hooks into `post_read_request`. On each HTTP request, it serializes method, path, headers, client IP/port into JSON and sends it via UNIX datagram socket to `/var/run/logcorrelator/http.socket`.
2. **sentinel** (Go daemon) uses libpcap to capture live TLS ClientHello packets on configured ports (default: 443, 8443). It extracts IP/TCP metadata, generates JA4 and JA3 fingerprints, and sends the result as JSON via UNIX datagram socket to `/var/run/logcorrelator/network.socket`.
### 2. Correlation Phase
3. **correlator** (Go daemon) listens on both UNIX sockets. It buffers incoming events and correlates them by matching `src_ip:src_port` within a configurable time window (default: 10 s). HTTP Keep-Alive connections are supported via `one_to_many` matching mode where a single TLS handshake (source B) is reused for multiple HTTP requests (source A). Correlated events merge HTTP fields (method, path, headers) with TLS fields (JA4, JA3, IP/TCP metadata) into a single `CorrelatedLog` JSON object, which is inserted into `http_logs_raw`.
### 3. Enrichment Phase (ClickHouse)
4. **mv_http_logs** materialized view automatically transforms `http_logs_raw` JSON into the structured `http_logs` table, enriching each row with:
- ASN/geo data via `dict_iplocate_asn`
- Anubis bot identification via `dict_anubis_ua`, `dict_anubis_ip`, `dict_anubis_asn`, `dict_anubis_country`
5. **mv_agg_host_ip_ja4_1h** and **mv_agg_header_fingerprint_1h** aggregate `http_logs` into 1-hour behavioral windows.
6. **view_ai_features_1h** joins the two aggregation tables and computes 50+ ML features per `(src_ip, ja4, host)` tuple.
### 4. Detection Phase
7. **bot-detector** (Python) runs on a 5-minute cycle:
- Reads `view_ai_features_1h` for the last 24 hours
- Separates known bots (via reputation dictionaries) from unknown traffic
- Trains/loads Isolation Forest models on human-baseline traffic
- Scores unknown traffic and writes anomalies to `ml_detected_anomalies` and all scores to `ml_all_scores`
### 5. Visualization Phase
8. **dashboard** (FastAPI + React) queries ClickHouse to display detections, feature analysis, investigation summaries, and clustering to SOC analysts.
## Component Interaction Matrix
| From → To | mod-reqin-log | sentinel | correlator | ClickHouse | bot-detector | dashboard |
|-----------|:---:|:---:|:---:|:---:|:---:|:---:|
| **mod-reqin-log** | — | — | UNIX socket (DGRAM) | — | — | — |
| **sentinel** | — | — | UNIX socket (DGRAM) | — | — | — |
| **correlator** | — | — | — | Native TCP :9000 (INSERT) | — | — |
| **ClickHouse** | — | — | — | — | — | — |
| **bot-detector** | — | — | — | HTTP :8123 (SELECT/INSERT) | — | — |
| **dashboard** | — | — | — | HTTP :8123 (SELECT/INSERT) | — | — |
## ClickHouse Table Ownership
| Table/View | Written By | Read By |
|------------|-----------|---------|
| `http_logs_raw` | correlator | mv_http_logs (MV) |
| `http_logs` | mv_http_logs (MV) | mv_agg_*, dashboard |
| `agg_host_ip_ja4_1h` | mv_agg_host_ip_ja4_1h (MV) | view_ai_features_1h |
| `agg_header_fingerprint_1h` | mv_agg_header_fingerprint_1h (MV) | view_ai_features_1h |
| `view_ai_features_1h` | — (view) | bot-detector |
| `view_ip_recurrence` | — (view) | bot-detector |
| `ml_detected_anomalies` | bot-detector | dashboard |
| `ml_all_scores` | bot-detector | dashboard |
| `audit_logs` | dashboard | dashboard |
## Correlation Algorithm
The correlator joins HTTP events (source A) with TLS/network events (source B) using a two-key correlation:
1. **Key**: `src_ip + src_port` — the client's source IP and ephemeral port uniquely identify a TCP connection.
2. **Time window**: Events must arrive within the configured window (default 10 seconds).
3. **Matching mode**:
- `one_to_one`: Each B event matches at most one A event (consumed after match).
- `one_to_many` (default, Keep-Alive): A single B (TLS handshake) can match multiple A events (HTTP requests) on the same connection. The B event has a configurable TTL (default 120 s) that resets on each match.
4. **Orphan handling**: Unmatched A events are emitted after a configurable delay (default 500 ms) with `correlated=false` and `orphan_side=A`.
## JA4/JA3 Fingerprint Format
### JA4
JA4 is a modern TLS fingerprinting format (successor to JA3) with the structure:
```
t{TLS_VER}{SNI}{CIPHER_COUNT}{EXT_COUNT}_{CIPHER_HASH}_{EXT_HASH}
```
Example: `t13d1516h2_8daaf6152771_b0da82dd1658`
- Prefix `t` = TLS, followed by version (`13` = TLS 1.3)
- `d` = SNI present, `i` = SNI absent
- Cipher suite count and extension count
- SHA-256 truncated hashes of sorted cipher suites and extensions
### JA3
JA3 is the original TLS fingerprinting format:
```
{TLS_VER},{CIPHERS},{EXTENSIONS},{ELLIPTIC_CURVES},{EC_POINT_FORMATS}
```
The `ja3_hash` is the MD5 hash of the JA3 string.
Both fingerprints are generated by sentinel from the TLS ClientHello payload.
## Technology Stack
| Component | Technology |
|-----------|-----------|
| Packet capture | Go + libpcap (gopacket) |
| HTTP logging | C Apache module (APR) |
| Event correlation | Go (hexagonal architecture) |
| ML detection | Python 3.11 + scikit-learn |
| Dashboard backend | FastAPI (Python) |
| Dashboard frontend | React + Vite |
| Data store | ClickHouse |
| Deployment | systemd, Docker, RPM |
| IPC | UNIX datagram sockets |