Files
ja4-platform/docs/architecture.md
toto 9f3e0621e5 feat: split ClickHouse into dual configurable databases (ja4_logs / ja4_processing)
Architecture:
- ja4_logs: raw log ingestion (http_logs_raw, http_logs, mv_http_logs)
- ja4_processing: analytics, aggregation, ML, dictionaries, audit

Configuration (env vars):
- CLICKHOUSE_DB_LOGS (default: ja4_logs)
- CLICKHOUSE_DB_PROCESSING (default: ja4_processing)

Changes:
- SQL migrations (10 files): all mabase_prod refs → ja4_logs or ja4_processing
  with correct cross-database references (MVs, views, dicts)
- deploy_schema.sh: substitutes DB names from env vars at deploy time
- Python shared settings: added CLICKHOUSE_DB_LOGS + CLICKHOUSE_DB_PROCESSING
- Dashboard routes (19 files): replaced ~80 hardcoded mabase_prod refs
  with settings.CLICKHOUSE_DB_LOGS / settings.CLICKHOUSE_DB_PROCESSING
- Bot-detector: DB → CLICKHOUSE_DB_PROCESSING, fetch_rules.py configurable
- Correlator: DSN example updated to ja4_logs
- Docker-compose + .env files: new env vars with defaults
- All documentation updated (14 markdown files)

All tests pass: sentinel 10/10, correlator 67.1%, bot-detector 11, dashboard 20, ja4_common 18

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 19:10:35 +02:00

11 KiB

Architecture

The ja4-platform is a security pipeline that captures live network traffic, generates JA4/JA3 TLS fingerprints, correlates them with HTTP requests, applies machine-learning anomaly detection, and surfaces results through a SOC analyst dashboard. ClickHouse serves as the central data store linking all services.

System Architecture

┌───────────────────────────────────────────────────────────────────────────────────┐
│                              Target Linux Server                                  │
│                                                                                   │
│  ┌─────────────┐  HTTP req    ┌───────────────────────┐   UNIX socket (DGRAM)     │
│  │   Client     │────────────▶│   Apache HTTPD        │──────────────┐            │
│  │  (browser /  │             │   + mod-reqin-log     │              │            │
│  │   bot)       │             └───────────────────────┘              │            │
│  │              │                                                    ▼            │
│  │              │  TLS CH     ┌───────────────────────┐   ┌─────────────────────┐ │
│  │              │────────────▶│   sentinel             │   │   correlator        │ │
│  │              │  (pcap)     │   (packet capture)     │──▶│   (event join)      │ │
│  └─────────────┘             └───────────────────────┘   └────────┬────────────┘ │
│                                                                    │              │
└────────────────────────────────────────────────────────────────────┼──────────────┘
                                                                     │ INSERT JSON
                                                                     ▼
                                                          ┌─────────────────────┐
                                                          │     ClickHouse      │
                                                          │     ja4_processing     │
                                                          │                     │
                                                          │  http_logs_raw      │
                                                          │  ──(MV)──▶ http_logs│
                                                          │  ──(MV)──▶ agg_*   │
                                                          │  view_ai_features   │
                                                          │  ml_detected_anom.  │
                                                          │  ml_all_scores      │
                                                          └──────┬──────┬───────┘
                                                                 │      │
                                              ┌──────────────────┘      └──────────────────┐
                                              ▼                                            ▼
                                   ┌──────────────────────┐                     ┌──────────────────────┐
                                   │    bot-detector       │                     │    dashboard          │
                                   │    (Python)           │                     │    (FastAPI + React)  │
                                   │                       │                     │                       │
                                   │  Reads:               │                     │  Reads:               │
                                   │    view_ai_features   │                     │    ml_detected_anom.  │
                                   │    view_ip_recurrence │                     │    ml_all_scores      │
                                   │  Writes:              │                     │    http_logs          │
                                   │    ml_detected_anom.  │                     │    agg_* tables       │
                                   │    ml_all_scores      │                     │    audit_logs         │
                                   └──────────────────────┘                     └──────────────────────┘

Data Flow

1. Capture Phase

  1. mod-reqin-log (Apache C module) hooks into post_read_request. On each HTTP request, it serializes method, path, headers, client IP/port into JSON and sends it via UNIX datagram socket to /var/run/logcorrelator/http.socket.

  2. sentinel (Go daemon) uses libpcap to capture live TLS ClientHello packets on configured ports (default: 443, 8443). It extracts IP/TCP metadata, generates JA4 and JA3 fingerprints, and sends the result as JSON via UNIX datagram socket to /var/run/logcorrelator/network.socket.

2. Correlation Phase

  1. correlator (Go daemon) listens on both UNIX sockets. It buffers incoming events and correlates them by matching src_ip:src_port within a configurable time window (default: 10 s). HTTP Keep-Alive connections are supported via one_to_many matching mode where a single TLS handshake (source B) is reused for multiple HTTP requests (source A). Correlated events merge HTTP fields (method, path, headers) with TLS fields (JA4, JA3, IP/TCP metadata) into a single CorrelatedLog JSON object, which is inserted into http_logs_raw.

3. Enrichment Phase (ClickHouse)

  1. mv_http_logs materialized view automatically transforms http_logs_raw JSON into the structured http_logs table, enriching each row with:

    • ASN/geo data via dict_iplocate_asn
    • Anubis bot identification via dict_anubis_ua, dict_anubis_ip, dict_anubis_asn, dict_anubis_country
  2. mv_agg_host_ip_ja4_1h and mv_agg_header_fingerprint_1h aggregate http_logs into 1-hour behavioral windows.

  3. view_ai_features_1h joins the two aggregation tables and computes 50+ ML features per (src_ip, ja4, host) tuple.

4. Detection Phase

  1. bot-detector (Python) runs on a 5-minute cycle:
    • Reads view_ai_features_1h for the last 24 hours
    • Separates known bots (via reputation dictionaries) from unknown traffic
    • Trains/loads Isolation Forest models on human-baseline traffic
    • Scores unknown traffic and writes anomalies to ml_detected_anomalies and all scores to ml_all_scores

5. Visualization Phase

  1. dashboard (FastAPI + React) queries ClickHouse to display detections, feature analysis, investigation summaries, and clustering to SOC analysts.

Component Interaction Matrix

From → To mod-reqin-log sentinel correlator ClickHouse bot-detector dashboard
mod-reqin-log UNIX socket (DGRAM)
sentinel UNIX socket (DGRAM)
correlator Native TCP :9000 (INSERT)
ClickHouse
bot-detector HTTP :8123 (SELECT/INSERT)
dashboard HTTP :8123 (SELECT/INSERT)

ClickHouse Table Ownership

Table/View Written By Read By
http_logs_raw correlator mv_http_logs (MV)
http_logs mv_http_logs (MV) mv_agg_*, dashboard
agg_host_ip_ja4_1h mv_agg_host_ip_ja4_1h (MV) view_ai_features_1h
agg_header_fingerprint_1h mv_agg_header_fingerprint_1h (MV) view_ai_features_1h
view_ai_features_1h — (view) bot-detector
view_ip_recurrence — (view) bot-detector
ml_detected_anomalies bot-detector dashboard
ml_all_scores bot-detector dashboard
audit_logs dashboard dashboard

Correlation Algorithm

The correlator joins HTTP events (source A) with TLS/network events (source B) using a two-key correlation:

  1. Key: src_ip + src_port — the client's source IP and ephemeral port uniquely identify a TCP connection.
  2. Time window: Events must arrive within the configured window (default 10 seconds).
  3. Matching mode:
    • one_to_one: Each B event matches at most one A event (consumed after match).
    • one_to_many (default, Keep-Alive): A single B (TLS handshake) can match multiple A events (HTTP requests) on the same connection. The B event has a configurable TTL (default 120 s) that resets on each match.
  4. Orphan handling: Unmatched A events are emitted after a configurable delay (default 500 ms) with correlated=false and orphan_side=A.

JA4/JA3 Fingerprint Format

JA4

JA4 is a modern TLS fingerprinting format (successor to JA3) with the structure:

t{TLS_VER}{SNI}{CIPHER_COUNT}{EXT_COUNT}_{CIPHER_HASH}_{EXT_HASH}

Example: t13d1516h2_8daaf6152771_b0da82dd1658

  • Prefix t = TLS, followed by version (13 = TLS 1.3)
  • d = SNI present, i = SNI absent
  • Cipher suite count and extension count
  • SHA-256 truncated hashes of sorted cipher suites and extensions

JA3

JA3 is the original TLS fingerprinting format:

{TLS_VER},{CIPHERS},{EXTENSIONS},{ELLIPTIC_CURVES},{EC_POINT_FORMATS}

The ja3_hash is the MD5 hash of the JA3 string.

Both fingerprints are generated by sentinel from the TLS ClientHello payload.

Technology Stack

Component Technology
Packet capture Go + libpcap (gopacket)
HTTP logging C Apache module (APR)
Event correlation Go (hexagonal architecture)
ML detection Python 3.11 + scikit-learn
Dashboard backend FastAPI (Python)
Dashboard frontend React + Vite
Data store ClickHouse
Deployment systemd, Docker, RPM
IPC UNIX datagram sockets