Files
ja4-platform/docs/database/schema.md
toto d469e39da7 feat: ja4-platform monorepo — 5 services unified, tests & RPM builds standardized
Services:
- ja4sentinel: TLS/JA4 fingerprint capture daemon (Go, libpcap)
- logcorrelator: JA4 log correlation engine (Go, ClickHouse)
- mod_reqin_log: Apache module (C, JSON request logging)
- bot_detector: ML bot detection pipeline (Python)
- dashboard: FastAPI/Streamlit analytics UI (Python)

Shared libraries:
- shared/go/ja4common: logger, config, shutdown, ipfilter (Go module)
- shared/python/ja4_common: ClickHouseClient, ClickHouseSettings (Python package)
- shared/clickhouse/: canonical SQL migrations (10 files)

Build & packaging:
- Unified 3-stage Dockerfile.package for Go RPMs (el8/el9/el10)
- go.work workspace linking sentinel, correlator, ja4common
- Makefile with test-all, build-all, rpm-* targets

Fixes applied:
- go.work: 1.21 → 1.24.6 (required by sentinel)
- correlator Dockerfiles: golang:1.21 → golang:1.24
- replace directives in go.mod for ja4common local path
- pyproject.toml: setuptools.backends → setuptools.build_meta
- Removed static libpcap linking (unavailable on Rocky 9)
- Fixed data races in output/writers_test.go (sync.Mutex + atomic.Int32)
- Rewrote corrupted test files (logger_test.go × 2)

Test coverage:
- correlator: 67.1% total (unixsocket 80.5%, config 91.7%, app 83.3%, multi 87.7%, stdout 100%)
- sentinel: all 10 packages pass (api, capture, config, fingerprint, ipfilter, logging, output, tlsparse)

Documentation:
- README.md + docs/ (architecture, development, 5 services, shared libs, DB schema & migrations)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 16:42:59 +02:00

14 KiB
Raw Blame History

Database Schema

The ja4-platform uses ClickHouse as its central data store with database mabase_prod. This document describes every table, materialized view, dictionary, and view in the schema.

Tables

http_logs_raw

Raw JSON ingest table — direct target for correlator INSERTs.

Column Type Description
raw_json String (ZSTD(3)) Complete correlated log as JSON string
ingest_time DateTime Insertion timestamp (default: now())
  • Engine: MergeTree
  • Partition by: toDate(ingest_time)
  • Order by: ingest_time
  • TTL: ingest_time + INTERVAL 1 DAY

http_logs

Parsed and enriched HTTP log table — populated by mv_http_logs materialized view.

Column Type Nullable Description
time DateTime No Request timestamp
log_date Date No Date partition key (default: toDate(time))
src_ip IPv4 No Client source IP
src_port UInt16 No Client source port
dst_ip IPv4 No Server destination IP
dst_port UInt16 No Server destination port
src_asn UInt32 No Source ASN (enriched via dict_iplocate_asn)
src_country_code LowCardinality(String) No Source country code
src_as_name LowCardinality(String) No AS name
src_org LowCardinality(String) No AS organization
src_domain LowCardinality(String) No AS domain
method LowCardinality(String) No HTTP method
scheme LowCardinality(String) No URL scheme (http/https)
host LowCardinality(String) No HTTP Host header
path String (ZSTD(3)) No Request path
query String (ZSTD(3)) No Query string
http_version LowCardinality(String) No HTTP version
orphan_side LowCardinality(String) No Orphan side (A, B, or empty)
correlated UInt8 No 1 if HTTP+TLS correlated
keepalives UInt16 No Keep-alive request sequence
a_timestamp UInt64 No Source A event timestamp (ns)
b_timestamp UInt64 No Source B event timestamp (ns)
conn_id String (ZSTD(3)) No TCP connection identifier
ip_meta_df UInt8 No IP Don't Fragment flag
ip_meta_id UInt16 No IP identification
ip_meta_total_length UInt16 No IP total length
ip_meta_ttl UInt8 No IP TTL
tcp_meta_options LowCardinality(String) No TCP options list
tcp_meta_window_size UInt32 No TCP window size
tcp_meta_mss UInt16 No TCP MSS
tcp_meta_window_scale UInt8 No TCP window scale
syn_to_clienthello_ms Int32 No SYN-to-ClientHello timing (ms)
tls_version LowCardinality(String) No TLS version
tls_sni LowCardinality(String) No TLS SNI
tls_alpn LowCardinality(String) No TLS ALPN
ja3 String (ZSTD(3)) No JA3 fingerprint
ja3_hash String (ZSTD(3)) No JA3 MD5 hash
ja4 String (ZSTD(3)) No JA4 fingerprint
client_headers String (ZSTD(3)) No Comma-separated header names
header_user_agent String (ZSTD(3)) No User-Agent header
header_accept String (ZSTD(3)) No Accept header
header_accept_encoding String (ZSTD(3)) No Accept-Encoding header
header_accept_language String (ZSTD(3)) No Accept-Language header
header_content_type String (ZSTD(3)) No Content-Type header
header_x_request_id String (ZSTD(3)) No X-Request-Id header
header_x_trace_id String (ZSTD(3)) No X-Trace-Id header
header_x_forwarded_for String (ZSTD(3)) No X-Forwarded-For header
header_sec_ch_ua String (ZSTD(3)) No Sec-CH-UA header
header_sec_ch_ua_mobile String (ZSTD(3)) No Sec-CH-UA-Mobile header
header_sec_ch_ua_platform String (ZSTD(3)) No Sec-CH-UA-Platform header
header_sec_fetch_dest String (ZSTD(3)) No Sec-Fetch-Dest header
header_sec_fetch_mode String (ZSTD(3)) No Sec-Fetch-Mode header
header_sec_fetch_site String (ZSTD(3)) No Sec-Fetch-Site header
anubis_bot_name LowCardinality(String) No Anubis-detected bot name (default: '')
anubis_bot_action LowCardinality(String) No Anubis-detected bot action (default: '')
anubis_bot_category LowCardinality(String) No Anubis-detected bot category (default: '')
  • Engine: MergeTree
  • Partition by: log_date
  • Order by: (time, src_ip, dst_ip, ja4)
  • TTL: log_date + INTERVAL 7 DAY

agg_host_ip_ja4_1h

Behavioral aggregation per (src_ip, ja4, host) per hour. Uses AggregatingMergeTree with SimpleAggregateFunction and AggregateFunction columns for incremental aggregation.

Key columns include: window_start, src_ip, ja4, host, src_asn, hits, count_post, uniq_paths, uniq_query_params, tcp_jitter_variance, unique_src_ports, unique_conn_id, orphan_count, ip_id_zero_count, mss_1460_count, uniq_ua, url_depth_variance, count_anomalous_payload, uniq_ja3, avg_syn_ms, tls12_count, count_head, count_no_sec_fetch, count_generic_accept, count_http10, ip_df_var, avg_ttl, ttl_var, count_no_wscale, count_correlated, count_no_accept_enc, count_http_scheme.

  • Engine: AggregatingMergeTree
  • Order by: (window_start, src_ip, ja4, host)

agg_header_fingerprint_1h

Header-level behavioral fingerprint aggregation per (src_ip) per hour.

Column Type Description
window_start DateTime Hour window start
src_ip IPv6 Source IP
header_order_hash SimpleAggregateFunction(any, String) Hash of header order
header_count SimpleAggregateFunction(max, UInt16) Max header count
has_accept_language SimpleAggregateFunction(max, UInt8) Accept-Language presence
has_cookie SimpleAggregateFunction(max, UInt8) Cookie presence
has_referer SimpleAggregateFunction(max, UInt8) Referer presence
modern_browser_score SimpleAggregateFunction(max, UInt8) Browser compliance score
ua_ch_mismatch SimpleAggregateFunction(max, UInt8) UA/Client Hints mismatch
sec_fetch_mode SimpleAggregateFunction(any, String) Sec-Fetch-Mode value
sec_fetch_dest SimpleAggregateFunction(any, String) Sec-Fetch-Dest value
  • Engine: AggregatingMergeTree
  • Order by: (window_start, src_ip)

ml_detected_anomalies

Anomaly detections above the threat threshold.

Key columns: detected_at, src_ip (IPv6), ja4, host, bot_name, anomaly_score (Float32), raw_anomaly_score (Float32), threat_level, model_name, recurrence (UInt32), campaign_id (Int32), reason, plus all ML feature columns and Anubis enrichment (anubis_bot_name, anubis_bot_action, anubis_bot_category).

  • Engine: ReplacingMergeTree(detected_at)
  • Order by: (src_ip)
  • TTL: detected_at + INTERVAL 30 DAY

ml_all_scores

All ML classifications (no threshold filter) for observability.

Key columns: detected_at, window_start, src_ip, ja4, host, bot_name, anomaly_score, raw_anomaly_score, threat_level, model_name, correlated, campaign_id, plus ASN and Anubis enrichment.

  • Engine: ReplacingMergeTree(detected_at)
  • Order by: (window_start, src_ip, ja4, host, model_name)
  • TTL: window_start + INTERVAL 3 DAY

ref_bot_networks

Bot network CIDR reference table.

Column Type Description
network IPv6CIDR Network CIDR
bot_name LowCardinality(String) Bot name
is_legitimate UInt8 1 = legitimate bot
last_update DateTime Last update timestamp
  • Engine: ReplacingMergeTree(last_update)
  • Order by: (network, bot_name)

bot_ip / bot_ja4

CSV-backed flat tables for quick bot lookups.

  • bot_ip: single column ip (String) — Engine: File(CSV, 'bot_ip.csv')
  • bot_ja4: single column ja4 (String) — Engine: File(CSV, 'bot_ja4.csv')

Anubis Rule Tables

Table Key Columns Engine
anubis_ua_rules id (UInt64) parent_id, regexp, keys, values ReplacingMergeTree
anubis_ip_rules prefix (String) bot_name, action, rule_id, has_ua, category ReplacingMergeTree
anubis_asn_rules asn (UInt32) bot_name, action, category ReplacingMergeTree
anubis_country_rules country_code (String) bot_name, action, category ReplacingMergeTree

audit_logs

SOC audit trail for dashboard activity.

Column Type Default Description
timestamp DateTime now() Event time
user_name LowCardinality(String) 'soc_user' Analyst name
action LowCardinality(String) Action performed
entity_type LowCardinality(String) '' Entity type (ip, ja4, etc.)
entity_id String '' Entity identifier
entity_count UInt32 0 Entity count
details String (ZSTD(3)) '' JSON details
client_ip String '' Analyst client IP
  • Engine: MergeTree
  • Partition by: toDate(timestamp)
  • Order by: (timestamp, user_name, action)
  • TTL: toDate(timestamp) + INTERVAL 90 DAY

Materialized Views

mv_http_logs

  • Source: http_logs_raw
  • Target: http_logs
  • Transformation: Parses raw_json via JSONExtract* functions, enriches with ASN data from dict_iplocate_asn and Anubis bot detection from dict_anubis_ua, dict_anubis_ip, dict_anubis_asn, dict_anubis_country. Uses a 5-level priority cascade for Anubis: UA+IP combined > UA only > IP only > ASN > Country.

mv_agg_host_ip_ja4_1h

  • Source: http_logs
  • Target: agg_host_ip_ja4_1h
  • Transformation: Groups by (toStartOfHour(time), src_ip, ja4, host, src_asn). Computes counts, unique values, variances, and aggregate functions for 50+ behavioral features.

mv_agg_header_fingerprint_1h

  • Source: http_logs
  • Target: agg_header_fingerprint_1h
  • Transformation: Groups by (toStartOfHour(time), src_ip). Computes header order hash, header count, browser compliance score, Client Hints mismatch.

Dictionaries

dict_iplocate_asn

  • Source: CSV file /var/lib/clickhouse/user_files/iplocate-ip-to-asn.csv
  • Key: network (String)
  • Layout: IP_TRIE
  • Attributes: asn (UInt32), country_code, name, org, domain
  • Lifetime: 36007200 seconds

dict_bot_ip

  • Source: CSV file /var/lib/clickhouse/user_files/bot_ip.csv
  • Key: prefix (String)
  • Layout: IP_TRIE
  • Attributes: bot_name (String)
  • Lifetime: 300 seconds

dict_bot_ja4

  • Source: CSV file /var/lib/clickhouse/user_files/bot_ja4.csv
  • Key: ja4 (String)
  • Layout: COMPLEX_KEY_HASHED
  • Attributes: bot_name (String)
  • Lifetime: 300 seconds

dict_asn_reputation

  • Source: CSV file /var/lib/clickhouse/user_files/asn_reputation.csv
  • Key: src_asn (UInt64)
  • Layout: HASHED
  • Attributes: label (String)
  • Lifetime: 300 seconds

dict_anubis_ua

  • Source: ClickHouse table anubis_ua_rules
  • Key: regexp (String)
  • Layout: REGEXP_TREE
  • Attributes: bot_name, action, has_ip, rule_id, category
  • Lifetime: 300600 seconds

dict_anubis_ip

  • Source: ClickHouse table anubis_ip_rules
  • Key: prefix (String)
  • Layout: IP_TRIE
  • Attributes: bot_name, action, rule_id, has_ua, category
  • Lifetime: 300600 seconds

dict_anubis_asn

  • Source: ClickHouse table anubis_asn_rules
  • Key: asn (UInt32)
  • Layout: FLAT
  • Attributes: bot_name, action, category
  • Lifetime: 300600 seconds

dict_anubis_country

  • Source: ClickHouse table anubis_country_rules
  • Key: country_code (String)
  • Layout: FLAT
  • Attributes: bot_name, action, category
  • Lifetime: 300600 seconds

Views

view_ai_features_1h

Computes 50+ ML features per (src_ip, ja4, host) from the last 24 hours by joining agg_host_ip_ja4_1h and agg_header_fingerprint_1h. Includes:

  • Behavioral features: hits, hit_velocity, fuzzing_index, post_ratio, orphan_ratio
  • Connection features: max_keepalives, multiplexing_efficiency, port_exhaustion_ratio
  • Browser features: modern_browser_score, ua_ch_mismatch, header_order_shared_count
  • TLS features: alpn_http_mismatch, is_alpn_missing, sni_host_mismatch
  • L4 features: tcp_jitter_variance, avg_ttl, ttl_std, syn_timing_cv
  • Reputation: bot_name (from dict_bot_ip/dict_bot_ja4), anubis_bot_name/action/category
  • Derived: temporal_entropy, ja3_diversity_ratio

view_ip_recurrence

Aggregates recurrence data from ml_detected_anomalies:

SELECT src_ip, count() AS recurrence,
       min(detected_at) AS first_seen, max(detected_at) AS last_seen,
       min(anomaly_score) AS worst_score,
       argMin(threat_level, anomaly_score) AS worst_threat_level
FROM ml_detected_anomalies GROUP BY src_ip;

User Accounts

User Permissions Purpose
data_writer INSERT + SELECT on http_logs_raw Used by correlator service
analyst SELECT on http_logs, ml_detected_anomalies, ml_all_scores, view_ai_features_1h, view_ip_recurrence, audit_logs Used by dashboard/SOC analysts

Security note: Default passwords are ChangeMe — replace with strong passwords before production use. Store credentials in a secrets manager.