Services: - ja4sentinel: TLS/JA4 fingerprint capture daemon (Go, libpcap) - logcorrelator: JA4 log correlation engine (Go, ClickHouse) - mod_reqin_log: Apache module (C, JSON request logging) - bot_detector: ML bot detection pipeline (Python) - dashboard: FastAPI/Streamlit analytics UI (Python) Shared libraries: - shared/go/ja4common: logger, config, shutdown, ipfilter (Go module) - shared/python/ja4_common: ClickHouseClient, ClickHouseSettings (Python package) - shared/clickhouse/: canonical SQL migrations (10 files) Build & packaging: - Unified 3-stage Dockerfile.package for Go RPMs (el8/el9/el10) - go.work workspace linking sentinel, correlator, ja4common - Makefile with test-all, build-all, rpm-* targets Fixes applied: - go.work: 1.21 → 1.24.6 (required by sentinel) - correlator Dockerfiles: golang:1.21 → golang:1.24 - replace directives in go.mod for ja4common local path - pyproject.toml: setuptools.backends → setuptools.build_meta - Removed static libpcap linking (unavailable on Rocky 9) - Fixed data races in output/writers_test.go (sync.Mutex + atomic.Int32) - Rewrote corrupted test files (logger_test.go × 2) Test coverage: - correlator: 67.1% total (unixsocket 80.5%, config 91.7%, app 83.3%, multi 87.7%, stdout 100%) - sentinel: all 10 packages pass (api, capture, config, fingerprint, ipfilter, logging, output, tlsparse) Documentation: - README.md + docs/ (architecture, development, 5 services, shared libs, DB schema & migrations) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
14 KiB
Database Schema
The ja4-platform uses ClickHouse as its central data store with database mabase_prod. This document describes every table, materialized view, dictionary, and view in the schema.
Tables
http_logs_raw
Raw JSON ingest table — direct target for correlator INSERTs.
| Column | Type | Description |
|---|---|---|
raw_json |
String (ZSTD(3)) | Complete correlated log as JSON string |
ingest_time |
DateTime | Insertion timestamp (default: now()) |
- Engine: MergeTree
- Partition by:
toDate(ingest_time) - Order by:
ingest_time - TTL:
ingest_time + INTERVAL 1 DAY
http_logs
Parsed and enriched HTTP log table — populated by mv_http_logs materialized view.
| Column | Type | Nullable | Description |
|---|---|---|---|
time |
DateTime | No | Request timestamp |
log_date |
Date | No | Date partition key (default: toDate(time)) |
src_ip |
IPv4 | No | Client source IP |
src_port |
UInt16 | No | Client source port |
dst_ip |
IPv4 | No | Server destination IP |
dst_port |
UInt16 | No | Server destination port |
src_asn |
UInt32 | No | Source ASN (enriched via dict_iplocate_asn) |
src_country_code |
LowCardinality(String) | No | Source country code |
src_as_name |
LowCardinality(String) | No | AS name |
src_org |
LowCardinality(String) | No | AS organization |
src_domain |
LowCardinality(String) | No | AS domain |
method |
LowCardinality(String) | No | HTTP method |
scheme |
LowCardinality(String) | No | URL scheme (http/https) |
host |
LowCardinality(String) | No | HTTP Host header |
path |
String (ZSTD(3)) | No | Request path |
query |
String (ZSTD(3)) | No | Query string |
http_version |
LowCardinality(String) | No | HTTP version |
orphan_side |
LowCardinality(String) | No | Orphan side (A, B, or empty) |
correlated |
UInt8 | No | 1 if HTTP+TLS correlated |
keepalives |
UInt16 | No | Keep-alive request sequence |
a_timestamp |
UInt64 | No | Source A event timestamp (ns) |
b_timestamp |
UInt64 | No | Source B event timestamp (ns) |
conn_id |
String (ZSTD(3)) | No | TCP connection identifier |
ip_meta_df |
UInt8 | No | IP Don't Fragment flag |
ip_meta_id |
UInt16 | No | IP identification |
ip_meta_total_length |
UInt16 | No | IP total length |
ip_meta_ttl |
UInt8 | No | IP TTL |
tcp_meta_options |
LowCardinality(String) | No | TCP options list |
tcp_meta_window_size |
UInt32 | No | TCP window size |
tcp_meta_mss |
UInt16 | No | TCP MSS |
tcp_meta_window_scale |
UInt8 | No | TCP window scale |
syn_to_clienthello_ms |
Int32 | No | SYN-to-ClientHello timing (ms) |
tls_version |
LowCardinality(String) | No | TLS version |
tls_sni |
LowCardinality(String) | No | TLS SNI |
tls_alpn |
LowCardinality(String) | No | TLS ALPN |
ja3 |
String (ZSTD(3)) | No | JA3 fingerprint |
ja3_hash |
String (ZSTD(3)) | No | JA3 MD5 hash |
ja4 |
String (ZSTD(3)) | No | JA4 fingerprint |
client_headers |
String (ZSTD(3)) | No | Comma-separated header names |
header_user_agent |
String (ZSTD(3)) | No | User-Agent header |
header_accept |
String (ZSTD(3)) | No | Accept header |
header_accept_encoding |
String (ZSTD(3)) | No | Accept-Encoding header |
header_accept_language |
String (ZSTD(3)) | No | Accept-Language header |
header_content_type |
String (ZSTD(3)) | No | Content-Type header |
header_x_request_id |
String (ZSTD(3)) | No | X-Request-Id header |
header_x_trace_id |
String (ZSTD(3)) | No | X-Trace-Id header |
header_x_forwarded_for |
String (ZSTD(3)) | No | X-Forwarded-For header |
header_sec_ch_ua |
String (ZSTD(3)) | No | Sec-CH-UA header |
header_sec_ch_ua_mobile |
String (ZSTD(3)) | No | Sec-CH-UA-Mobile header |
header_sec_ch_ua_platform |
String (ZSTD(3)) | No | Sec-CH-UA-Platform header |
header_sec_fetch_dest |
String (ZSTD(3)) | No | Sec-Fetch-Dest header |
header_sec_fetch_mode |
String (ZSTD(3)) | No | Sec-Fetch-Mode header |
header_sec_fetch_site |
String (ZSTD(3)) | No | Sec-Fetch-Site header |
anubis_bot_name |
LowCardinality(String) | No | Anubis-detected bot name (default: '') |
anubis_bot_action |
LowCardinality(String) | No | Anubis-detected bot action (default: '') |
anubis_bot_category |
LowCardinality(String) | No | Anubis-detected bot category (default: '') |
- Engine: MergeTree
- Partition by:
log_date - Order by:
(time, src_ip, dst_ip, ja4) - TTL:
log_date + INTERVAL 7 DAY
agg_host_ip_ja4_1h
Behavioral aggregation per (src_ip, ja4, host) per hour. Uses AggregatingMergeTree with SimpleAggregateFunction and AggregateFunction columns for incremental aggregation.
Key columns include: window_start, src_ip, ja4, host, src_asn, hits, count_post, uniq_paths, uniq_query_params, tcp_jitter_variance, unique_src_ports, unique_conn_id, orphan_count, ip_id_zero_count, mss_1460_count, uniq_ua, url_depth_variance, count_anomalous_payload, uniq_ja3, avg_syn_ms, tls12_count, count_head, count_no_sec_fetch, count_generic_accept, count_http10, ip_df_var, avg_ttl, ttl_var, count_no_wscale, count_correlated, count_no_accept_enc, count_http_scheme.
- Engine: AggregatingMergeTree
- Order by:
(window_start, src_ip, ja4, host)
agg_header_fingerprint_1h
Header-level behavioral fingerprint aggregation per (src_ip) per hour.
| Column | Type | Description |
|---|---|---|
window_start |
DateTime | Hour window start |
src_ip |
IPv6 | Source IP |
header_order_hash |
SimpleAggregateFunction(any, String) | Hash of header order |
header_count |
SimpleAggregateFunction(max, UInt16) | Max header count |
has_accept_language |
SimpleAggregateFunction(max, UInt8) | Accept-Language presence |
has_cookie |
SimpleAggregateFunction(max, UInt8) | Cookie presence |
has_referer |
SimpleAggregateFunction(max, UInt8) | Referer presence |
modern_browser_score |
SimpleAggregateFunction(max, UInt8) | Browser compliance score |
ua_ch_mismatch |
SimpleAggregateFunction(max, UInt8) | UA/Client Hints mismatch |
sec_fetch_mode |
SimpleAggregateFunction(any, String) | Sec-Fetch-Mode value |
sec_fetch_dest |
SimpleAggregateFunction(any, String) | Sec-Fetch-Dest value |
- Engine: AggregatingMergeTree
- Order by:
(window_start, src_ip)
ml_detected_anomalies
Anomaly detections above the threat threshold.
Key columns: detected_at, src_ip (IPv6), ja4, host, bot_name, anomaly_score (Float32), raw_anomaly_score (Float32), threat_level, model_name, recurrence (UInt32), campaign_id (Int32), reason, plus all ML feature columns and Anubis enrichment (anubis_bot_name, anubis_bot_action, anubis_bot_category).
- Engine: ReplacingMergeTree(detected_at)
- Order by:
(src_ip) - TTL:
detected_at + INTERVAL 30 DAY
ml_all_scores
All ML classifications (no threshold filter) for observability.
Key columns: detected_at, window_start, src_ip, ja4, host, bot_name, anomaly_score, raw_anomaly_score, threat_level, model_name, correlated, campaign_id, plus ASN and Anubis enrichment.
- Engine: ReplacingMergeTree(detected_at)
- Order by:
(window_start, src_ip, ja4, host, model_name) - TTL:
window_start + INTERVAL 3 DAY
ref_bot_networks
Bot network CIDR reference table.
| Column | Type | Description |
|---|---|---|
network |
IPv6CIDR | Network CIDR |
bot_name |
LowCardinality(String) | Bot name |
is_legitimate |
UInt8 | 1 = legitimate bot |
last_update |
DateTime | Last update timestamp |
- Engine: ReplacingMergeTree(last_update)
- Order by:
(network, bot_name)
bot_ip / bot_ja4
CSV-backed flat tables for quick bot lookups.
bot_ip: single columnip(String) — Engine: File(CSV, 'bot_ip.csv')bot_ja4: single columnja4(String) — Engine: File(CSV, 'bot_ja4.csv')
Anubis Rule Tables
| Table | Key | Columns | Engine |
|---|---|---|---|
anubis_ua_rules |
id (UInt64) |
parent_id, regexp, keys, values |
ReplacingMergeTree |
anubis_ip_rules |
prefix (String) |
bot_name, action, rule_id, has_ua, category |
ReplacingMergeTree |
anubis_asn_rules |
asn (UInt32) |
bot_name, action, category |
ReplacingMergeTree |
anubis_country_rules |
country_code (String) |
bot_name, action, category |
ReplacingMergeTree |
audit_logs
SOC audit trail for dashboard activity.
| Column | Type | Default | Description |
|---|---|---|---|
timestamp |
DateTime | now() |
Event time |
user_name |
LowCardinality(String) | 'soc_user' |
Analyst name |
action |
LowCardinality(String) | — | Action performed |
entity_type |
LowCardinality(String) | '' |
Entity type (ip, ja4, etc.) |
entity_id |
String | '' |
Entity identifier |
entity_count |
UInt32 | 0 |
Entity count |
details |
String (ZSTD(3)) | '' |
JSON details |
client_ip |
String | '' |
Analyst client IP |
- Engine: MergeTree
- Partition by:
toDate(timestamp) - Order by:
(timestamp, user_name, action) - TTL:
toDate(timestamp) + INTERVAL 90 DAY
Materialized Views
mv_http_logs
- Source:
http_logs_raw - Target:
http_logs - Transformation: Parses
raw_jsonviaJSONExtract*functions, enriches with ASN data fromdict_iplocate_asnand Anubis bot detection fromdict_anubis_ua,dict_anubis_ip,dict_anubis_asn,dict_anubis_country. Uses a 5-level priority cascade for Anubis: UA+IP combined > UA only > IP only > ASN > Country.
mv_agg_host_ip_ja4_1h
- Source:
http_logs - Target:
agg_host_ip_ja4_1h - Transformation: Groups by
(toStartOfHour(time), src_ip, ja4, host, src_asn). Computes counts, unique values, variances, and aggregate functions for 50+ behavioral features.
mv_agg_header_fingerprint_1h
- Source:
http_logs - Target:
agg_header_fingerprint_1h - Transformation: Groups by
(toStartOfHour(time), src_ip). Computes header order hash, header count, browser compliance score, Client Hints mismatch.
Dictionaries
dict_iplocate_asn
- Source: CSV file
/var/lib/clickhouse/user_files/iplocate-ip-to-asn.csv - Key:
network(String) - Layout:
IP_TRIE - Attributes:
asn(UInt32),country_code,name,org,domain - Lifetime: 3600–7200 seconds
dict_bot_ip
- Source: CSV file
/var/lib/clickhouse/user_files/bot_ip.csv - Key:
prefix(String) - Layout:
IP_TRIE - Attributes:
bot_name(String) - Lifetime: 300 seconds
dict_bot_ja4
- Source: CSV file
/var/lib/clickhouse/user_files/bot_ja4.csv - Key:
ja4(String) - Layout:
COMPLEX_KEY_HASHED - Attributes:
bot_name(String) - Lifetime: 300 seconds
dict_asn_reputation
- Source: CSV file
/var/lib/clickhouse/user_files/asn_reputation.csv - Key:
src_asn(UInt64) - Layout:
HASHED - Attributes:
label(String) - Lifetime: 300 seconds
dict_anubis_ua
- Source: ClickHouse table
anubis_ua_rules - Key:
regexp(String) - Layout:
REGEXP_TREE - Attributes:
bot_name,action,has_ip,rule_id,category - Lifetime: 300–600 seconds
dict_anubis_ip
- Source: ClickHouse table
anubis_ip_rules - Key:
prefix(String) - Layout:
IP_TRIE - Attributes:
bot_name,action,rule_id,has_ua,category - Lifetime: 300–600 seconds
dict_anubis_asn
- Source: ClickHouse table
anubis_asn_rules - Key:
asn(UInt32) - Layout:
FLAT - Attributes:
bot_name,action,category - Lifetime: 300–600 seconds
dict_anubis_country
- Source: ClickHouse table
anubis_country_rules - Key:
country_code(String) - Layout:
FLAT - Attributes:
bot_name,action,category - Lifetime: 300–600 seconds
Views
view_ai_features_1h
Computes 50+ ML features per (src_ip, ja4, host) from the last 24 hours by joining agg_host_ip_ja4_1h and agg_header_fingerprint_1h. Includes:
- Behavioral features:
hits,hit_velocity,fuzzing_index,post_ratio,orphan_ratio - Connection features:
max_keepalives,multiplexing_efficiency,port_exhaustion_ratio - Browser features:
modern_browser_score,ua_ch_mismatch,header_order_shared_count - TLS features:
alpn_http_mismatch,is_alpn_missing,sni_host_mismatch - L4 features:
tcp_jitter_variance,avg_ttl,ttl_std,syn_timing_cv - Reputation:
bot_name(from dict_bot_ip/dict_bot_ja4),anubis_bot_name/action/category - Derived:
temporal_entropy,ja3_diversity_ratio
view_ip_recurrence
Aggregates recurrence data from ml_detected_anomalies:
SELECT src_ip, count() AS recurrence,
min(detected_at) AS first_seen, max(detected_at) AS last_seen,
min(anomaly_score) AS worst_score,
argMin(threat_level, anomaly_score) AS worst_threat_level
FROM ml_detected_anomalies GROUP BY src_ip;
User Accounts
| User | Permissions | Purpose |
|---|---|---|
data_writer |
INSERT + SELECT on http_logs_raw |
Used by correlator service |
analyst |
SELECT on http_logs, ml_detected_anomalies, ml_all_scores, view_ai_features_1h, view_ip_recurrence, audit_logs |
Used by dashboard/SOC analysts |
Security note: Default passwords are
ChangeMe— replace with strong passwords before production use. Store credentials in a secrets manager.