# Architecture The ja4-platform is a security pipeline that captures live network traffic, generates JA4/JA3 TLS fingerprints, correlates them with HTTP requests, applies machine-learning anomaly detection, and surfaces results through a SOC analyst dashboard. ClickHouse serves as the central data store linking all services. ## System Architecture ``` ┌───────────────────────────────────────────────────────────────────────────────────┐ │ Target Linux Server │ │ │ │ ┌─────────────┐ HTTP req ┌───────────────────────┐ UNIX socket (DGRAM) │ │ │ Client │────────────▶│ Apache HTTPD │──────────────┐ │ │ │ (browser / │ │ + mod-reqin-log │ │ │ │ │ bot) │ └───────────────────────┘ │ │ │ │ │ ▼ │ │ │ │ TLS CH ┌───────────────────────┐ ┌─────────────────────┐ │ │ │ │────────────▶│ sentinel │ │ correlator │ │ │ │ │ (pcap) │ (packet capture) │──▶│ (event join) │ │ │ └─────────────┘ └───────────────────────┘ └────────┬────────────┘ │ │ │ │ └────────────────────────────────────────────────────────────────────┼──────────────┘ │ INSERT JSON ▼ ┌─────────────────────┐ │ ClickHouse │ │ mabase_prod │ │ │ │ http_logs_raw │ │ ──(MV)──▶ http_logs│ │ ──(MV)──▶ agg_* │ │ view_ai_features │ │ ml_detected_anom. │ │ ml_all_scores │ └──────┬──────┬───────┘ │ │ ┌──────────────────┘ └──────────────────┐ ▼ ▼ ┌──────────────────────┐ ┌──────────────────────┐ │ bot-detector │ │ dashboard │ │ (Python) │ │ (FastAPI + React) │ │ │ │ │ │ Reads: │ │ Reads: │ │ view_ai_features │ │ ml_detected_anom. │ │ view_ip_recurrence │ │ ml_all_scores │ │ Writes: │ │ http_logs │ │ ml_detected_anom. │ │ agg_* tables │ │ ml_all_scores │ │ audit_logs │ └──────────────────────┘ └──────────────────────┘ ``` ## Data Flow ### 1. Capture Phase 1. **mod-reqin-log** (Apache C module) hooks into `post_read_request`. On each HTTP request, it serializes method, path, headers, client IP/port into JSON and sends it via UNIX datagram socket to `/var/run/logcorrelator/http.socket`. 2. **sentinel** (Go daemon) uses libpcap to capture live TLS ClientHello packets on configured ports (default: 443, 8443). It extracts IP/TCP metadata, generates JA4 and JA3 fingerprints, and sends the result as JSON via UNIX datagram socket to `/var/run/logcorrelator/network.socket`. ### 2. Correlation Phase 3. **correlator** (Go daemon) listens on both UNIX sockets. It buffers incoming events and correlates them by matching `src_ip:src_port` within a configurable time window (default: 10 s). HTTP Keep-Alive connections are supported via `one_to_many` matching mode where a single TLS handshake (source B) is reused for multiple HTTP requests (source A). Correlated events merge HTTP fields (method, path, headers) with TLS fields (JA4, JA3, IP/TCP metadata) into a single `CorrelatedLog` JSON object, which is inserted into `http_logs_raw`. ### 3. Enrichment Phase (ClickHouse) 4. **mv_http_logs** materialized view automatically transforms `http_logs_raw` JSON into the structured `http_logs` table, enriching each row with: - ASN/geo data via `dict_iplocate_asn` - Anubis bot identification via `dict_anubis_ua`, `dict_anubis_ip`, `dict_anubis_asn`, `dict_anubis_country` 5. **mv_agg_host_ip_ja4_1h** and **mv_agg_header_fingerprint_1h** aggregate `http_logs` into 1-hour behavioral windows. 6. **view_ai_features_1h** joins the two aggregation tables and computes 50+ ML features per `(src_ip, ja4, host)` tuple. ### 4. Detection Phase 7. **bot-detector** (Python) runs on a 5-minute cycle: - Reads `view_ai_features_1h` for the last 24 hours - Separates known bots (via reputation dictionaries) from unknown traffic - Trains/loads Isolation Forest models on human-baseline traffic - Scores unknown traffic and writes anomalies to `ml_detected_anomalies` and all scores to `ml_all_scores` ### 5. Visualization Phase 8. **dashboard** (FastAPI + React) queries ClickHouse to display detections, feature analysis, investigation summaries, and clustering to SOC analysts. ## Component Interaction Matrix | From → To | mod-reqin-log | sentinel | correlator | ClickHouse | bot-detector | dashboard | |-----------|:---:|:---:|:---:|:---:|:---:|:---:| | **mod-reqin-log** | — | — | UNIX socket (DGRAM) | — | — | — | | **sentinel** | — | — | UNIX socket (DGRAM) | — | — | — | | **correlator** | — | — | — | Native TCP :9000 (INSERT) | — | — | | **ClickHouse** | — | — | — | — | — | — | | **bot-detector** | — | — | — | HTTP :8123 (SELECT/INSERT) | — | — | | **dashboard** | — | — | — | HTTP :8123 (SELECT/INSERT) | — | — | ## ClickHouse Table Ownership | Table/View | Written By | Read By | |------------|-----------|---------| | `http_logs_raw` | correlator | mv_http_logs (MV) | | `http_logs` | mv_http_logs (MV) | mv_agg_*, dashboard | | `agg_host_ip_ja4_1h` | mv_agg_host_ip_ja4_1h (MV) | view_ai_features_1h | | `agg_header_fingerprint_1h` | mv_agg_header_fingerprint_1h (MV) | view_ai_features_1h | | `view_ai_features_1h` | — (view) | bot-detector | | `view_ip_recurrence` | — (view) | bot-detector | | `ml_detected_anomalies` | bot-detector | dashboard | | `ml_all_scores` | bot-detector | dashboard | | `audit_logs` | dashboard | dashboard | ## Correlation Algorithm The correlator joins HTTP events (source A) with TLS/network events (source B) using a two-key correlation: 1. **Key**: `src_ip + src_port` — the client's source IP and ephemeral port uniquely identify a TCP connection. 2. **Time window**: Events must arrive within the configured window (default 10 seconds). 3. **Matching mode**: - `one_to_one`: Each B event matches at most one A event (consumed after match). - `one_to_many` (default, Keep-Alive): A single B (TLS handshake) can match multiple A events (HTTP requests) on the same connection. The B event has a configurable TTL (default 120 s) that resets on each match. 4. **Orphan handling**: Unmatched A events are emitted after a configurable delay (default 500 ms) with `correlated=false` and `orphan_side=A`. ## JA4/JA3 Fingerprint Format ### JA4 JA4 is a modern TLS fingerprinting format (successor to JA3) with the structure: ``` t{TLS_VER}{SNI}{CIPHER_COUNT}{EXT_COUNT}_{CIPHER_HASH}_{EXT_HASH} ``` Example: `t13d1516h2_8daaf6152771_b0da82dd1658` - Prefix `t` = TLS, followed by version (`13` = TLS 1.3) - `d` = SNI present, `i` = SNI absent - Cipher suite count and extension count - SHA-256 truncated hashes of sorted cipher suites and extensions ### JA3 JA3 is the original TLS fingerprinting format: ``` {TLS_VER},{CIPHERS},{EXTENSIONS},{ELLIPTIC_CURVES},{EC_POINT_FORMATS} ``` The `ja3_hash` is the MD5 hash of the JA3 string. Both fingerprints are generated by sentinel from the TLS ClientHello payload. ## Technology Stack | Component | Technology | |-----------|-----------| | Packet capture | Go + libpcap (gopacket) | | HTTP logging | C Apache module (APR) | | Event correlation | Go (hexagonal architecture) | | ML detection | Python 3.11 + scikit-learn | | Dashboard backend | FastAPI (Python) | | Dashboard frontend | React + Vite | | Data store | ClickHouse | | Deployment | systemd, Docker, RPM | | IPC | UNIX datagram sockets |