Files
ja4-platform/docs/architecture.md
toto d469e39da7 feat: ja4-platform monorepo — 5 services unified, tests & RPM builds standardized
Services:
- ja4sentinel: TLS/JA4 fingerprint capture daemon (Go, libpcap)
- logcorrelator: JA4 log correlation engine (Go, ClickHouse)
- mod_reqin_log: Apache module (C, JSON request logging)
- bot_detector: ML bot detection pipeline (Python)
- dashboard: FastAPI/Streamlit analytics UI (Python)

Shared libraries:
- shared/go/ja4common: logger, config, shutdown, ipfilter (Go module)
- shared/python/ja4_common: ClickHouseClient, ClickHouseSettings (Python package)
- shared/clickhouse/: canonical SQL migrations (10 files)

Build & packaging:
- Unified 3-stage Dockerfile.package for Go RPMs (el8/el9/el10)
- go.work workspace linking sentinel, correlator, ja4common
- Makefile with test-all, build-all, rpm-* targets

Fixes applied:
- go.work: 1.21 → 1.24.6 (required by sentinel)
- correlator Dockerfiles: golang:1.21 → golang:1.24
- replace directives in go.mod for ja4common local path
- pyproject.toml: setuptools.backends → setuptools.build_meta
- Removed static libpcap linking (unavailable on Rocky 9)
- Fixed data races in output/writers_test.go (sync.Mutex + atomic.Int32)
- Rewrote corrupted test files (logger_test.go × 2)

Test coverage:
- correlator: 67.1% total (unixsocket 80.5%, config 91.7%, app 83.3%, multi 87.7%, stdout 100%)
- sentinel: all 10 packages pass (api, capture, config, fingerprint, ipfilter, logging, output, tlsparse)

Documentation:
- README.md + docs/ (architecture, development, 5 services, shared libs, DB schema & migrations)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 16:42:59 +02:00

11 KiB

Architecture

The ja4-platform is a security pipeline that captures live network traffic, generates JA4/JA3 TLS fingerprints, correlates them with HTTP requests, applies machine-learning anomaly detection, and surfaces results through a SOC analyst dashboard. ClickHouse serves as the central data store linking all services.

System Architecture

┌───────────────────────────────────────────────────────────────────────────────────┐
│                              Target Linux Server                                  │
│                                                                                   │
│  ┌─────────────┐  HTTP req    ┌───────────────────────┐   UNIX socket (DGRAM)     │
│  │   Client     │────────────▶│   Apache HTTPD        │──────────────┐            │
│  │  (browser /  │             │   + mod-reqin-log     │              │            │
│  │   bot)       │             └───────────────────────┘              │            │
│  │              │                                                    ▼            │
│  │              │  TLS CH     ┌───────────────────────┐   ┌─────────────────────┐ │
│  │              │────────────▶│   sentinel             │   │   correlator        │ │
│  │              │  (pcap)     │   (packet capture)     │──▶│   (event join)      │ │
│  └─────────────┘             └───────────────────────┘   └────────┬────────────┘ │
│                                                                    │              │
└────────────────────────────────────────────────────────────────────┼──────────────┘
                                                                     │ INSERT JSON
                                                                     ▼
                                                          ┌─────────────────────┐
                                                          │     ClickHouse      │
                                                          │     mabase_prod     │
                                                          │                     │
                                                          │  http_logs_raw      │
                                                          │  ──(MV)──▶ http_logs│
                                                          │  ──(MV)──▶ agg_*   │
                                                          │  view_ai_features   │
                                                          │  ml_detected_anom.  │
                                                          │  ml_all_scores      │
                                                          └──────┬──────┬───────┘
                                                                 │      │
                                              ┌──────────────────┘      └──────────────────┐
                                              ▼                                            ▼
                                   ┌──────────────────────┐                     ┌──────────────────────┐
                                   │    bot-detector       │                     │    dashboard          │
                                   │    (Python)           │                     │    (FastAPI + React)  │
                                   │                       │                     │                       │
                                   │  Reads:               │                     │  Reads:               │
                                   │    view_ai_features   │                     │    ml_detected_anom.  │
                                   │    view_ip_recurrence │                     │    ml_all_scores      │
                                   │  Writes:              │                     │    http_logs          │
                                   │    ml_detected_anom.  │                     │    agg_* tables       │
                                   │    ml_all_scores      │                     │    audit_logs         │
                                   └──────────────────────┘                     └──────────────────────┘

Data Flow

1. Capture Phase

  1. mod-reqin-log (Apache C module) hooks into post_read_request. On each HTTP request, it serializes method, path, headers, client IP/port into JSON and sends it via UNIX datagram socket to /var/run/logcorrelator/http.socket.

  2. sentinel (Go daemon) uses libpcap to capture live TLS ClientHello packets on configured ports (default: 443, 8443). It extracts IP/TCP metadata, generates JA4 and JA3 fingerprints, and sends the result as JSON via UNIX datagram socket to /var/run/logcorrelator/network.socket.

2. Correlation Phase

  1. correlator (Go daemon) listens on both UNIX sockets. It buffers incoming events and correlates them by matching src_ip:src_port within a configurable time window (default: 10 s). HTTP Keep-Alive connections are supported via one_to_many matching mode where a single TLS handshake (source B) is reused for multiple HTTP requests (source A). Correlated events merge HTTP fields (method, path, headers) with TLS fields (JA4, JA3, IP/TCP metadata) into a single CorrelatedLog JSON object, which is inserted into http_logs_raw.

3. Enrichment Phase (ClickHouse)

  1. mv_http_logs materialized view automatically transforms http_logs_raw JSON into the structured http_logs table, enriching each row with:

    • ASN/geo data via dict_iplocate_asn
    • Anubis bot identification via dict_anubis_ua, dict_anubis_ip, dict_anubis_asn, dict_anubis_country
  2. mv_agg_host_ip_ja4_1h and mv_agg_header_fingerprint_1h aggregate http_logs into 1-hour behavioral windows.

  3. view_ai_features_1h joins the two aggregation tables and computes 50+ ML features per (src_ip, ja4, host) tuple.

4. Detection Phase

  1. bot-detector (Python) runs on a 5-minute cycle:
    • Reads view_ai_features_1h for the last 24 hours
    • Separates known bots (via reputation dictionaries) from unknown traffic
    • Trains/loads Isolation Forest models on human-baseline traffic
    • Scores unknown traffic and writes anomalies to ml_detected_anomalies and all scores to ml_all_scores

5. Visualization Phase

  1. dashboard (FastAPI + React) queries ClickHouse to display detections, feature analysis, investigation summaries, and clustering to SOC analysts.

Component Interaction Matrix

From → To mod-reqin-log sentinel correlator ClickHouse bot-detector dashboard
mod-reqin-log UNIX socket (DGRAM)
sentinel UNIX socket (DGRAM)
correlator Native TCP :9000 (INSERT)
ClickHouse
bot-detector HTTP :8123 (SELECT/INSERT)
dashboard HTTP :8123 (SELECT/INSERT)

ClickHouse Table Ownership

Table/View Written By Read By
http_logs_raw correlator mv_http_logs (MV)
http_logs mv_http_logs (MV) mv_agg_*, dashboard
agg_host_ip_ja4_1h mv_agg_host_ip_ja4_1h (MV) view_ai_features_1h
agg_header_fingerprint_1h mv_agg_header_fingerprint_1h (MV) view_ai_features_1h
view_ai_features_1h — (view) bot-detector
view_ip_recurrence — (view) bot-detector
ml_detected_anomalies bot-detector dashboard
ml_all_scores bot-detector dashboard
audit_logs dashboard dashboard

Correlation Algorithm

The correlator joins HTTP events (source A) with TLS/network events (source B) using a two-key correlation:

  1. Key: src_ip + src_port — the client's source IP and ephemeral port uniquely identify a TCP connection.
  2. Time window: Events must arrive within the configured window (default 10 seconds).
  3. Matching mode:
    • one_to_one: Each B event matches at most one A event (consumed after match).
    • one_to_many (default, Keep-Alive): A single B (TLS handshake) can match multiple A events (HTTP requests) on the same connection. The B event has a configurable TTL (default 120 s) that resets on each match.
  4. Orphan handling: Unmatched A events are emitted after a configurable delay (default 500 ms) with correlated=false and orphan_side=A.

JA4/JA3 Fingerprint Format

JA4

JA4 is a modern TLS fingerprinting format (successor to JA3) with the structure:

t{TLS_VER}{SNI}{CIPHER_COUNT}{EXT_COUNT}_{CIPHER_HASH}_{EXT_HASH}

Example: t13d1516h2_8daaf6152771_b0da82dd1658

  • Prefix t = TLS, followed by version (13 = TLS 1.3)
  • d = SNI present, i = SNI absent
  • Cipher suite count and extension count
  • SHA-256 truncated hashes of sorted cipher suites and extensions

JA3

JA3 is the original TLS fingerprinting format:

{TLS_VER},{CIPHERS},{EXTENSIONS},{ELLIPTIC_CURVES},{EC_POINT_FORMATS}

The ja3_hash is the MD5 hash of the JA3 string.

Both fingerprints are generated by sentinel from the TLS ClientHello payload.

Technology Stack

Component Technology
Packet capture Go + libpcap (gopacket)
HTTP logging C Apache module (APR)
Event correlation Go (hexagonal architecture)
ML detection Python 3.11 + scikit-learn
Dashboard backend FastAPI (Python)
Dashboard frontend React + Vite
Data store ClickHouse
Deployment systemd, Docker, RPM
IPC UNIX datagram sockets