# Database Schema The ja4-platform uses ClickHouse as its central data store with database `ja4_processing`. This document describes every table, materialized view, dictionary, and view in the schema. ## Tables ### http_logs_raw Raw JSON ingest table — direct target for correlator INSERTs. | Column | Type | Description | |--------|------|-------------| | `raw_json` | String (ZSTD(3)) | Complete correlated log as JSON string | | `ingest_time` | DateTime | Insertion timestamp (default: `now()`) | - **Engine**: MergeTree - **Partition by**: `toDate(ingest_time)` - **Order by**: `ingest_time` - **TTL**: `ingest_time + INTERVAL 1 DAY` --- ### http_logs Parsed and enriched HTTP log table — populated by `mv_http_logs` materialized view. | Column | Type | Nullable | Description | |--------|------|----------|-------------| | `time` | DateTime | No | Request timestamp | | `log_date` | Date | No | Date partition key (default: `toDate(time)`) | | `src_ip` | IPv4 | No | Client source IP | | `src_port` | UInt16 | No | Client source port | | `dst_ip` | IPv4 | No | Server destination IP | | `dst_port` | UInt16 | No | Server destination port | | `src_asn` | UInt32 | No | Source ASN (enriched via dict_iplocate_asn) | | `src_country_code` | LowCardinality(String) | No | Source country code | | `src_as_name` | LowCardinality(String) | No | AS name | | `src_org` | LowCardinality(String) | No | AS organization | | `src_domain` | LowCardinality(String) | No | AS domain | | `method` | LowCardinality(String) | No | HTTP method | | `scheme` | LowCardinality(String) | No | URL scheme (http/https) | | `host` | LowCardinality(String) | No | HTTP Host header | | `path` | String (ZSTD(3)) | No | Request path | | `query` | String (ZSTD(3)) | No | Query string | | `http_version` | LowCardinality(String) | No | HTTP version | | `orphan_side` | LowCardinality(String) | No | Orphan side (A, B, or empty) | | `correlated` | UInt8 | No | 1 if HTTP+TLS correlated | | `keepalives` | UInt16 | No | Keep-alive request sequence | | `a_timestamp` | UInt64 | No | Source A event timestamp (ns) | | `b_timestamp` | UInt64 | No | Source B event timestamp (ns) | | `conn_id` | String (ZSTD(3)) | No | TCP connection identifier | | `ip_meta_df` | UInt8 | No | IP Don't Fragment flag | | `ip_meta_id` | UInt16 | No | IP identification | | `ip_meta_total_length` | UInt16 | No | IP total length | | `ip_meta_ttl` | UInt8 | No | IP TTL | | `tcp_meta_options` | LowCardinality(String) | No | TCP options list | | `tcp_meta_window_size` | UInt32 | No | TCP window size | | `tcp_meta_mss` | UInt16 | No | TCP MSS | | `tcp_meta_window_scale` | UInt8 | No | TCP window scale | | `syn_to_clienthello_ms` | Int32 | No | SYN-to-ClientHello timing (ms) | | `tls_version` | LowCardinality(String) | No | TLS version | | `tls_sni` | LowCardinality(String) | No | TLS SNI | | `tls_alpn` | LowCardinality(String) | No | TLS ALPN | | `ja3` | String (ZSTD(3)) | No | JA3 fingerprint | | `ja3_hash` | String (ZSTD(3)) | No | JA3 MD5 hash | | `ja4` | String (ZSTD(3)) | No | JA4 fingerprint | | `client_headers` | String (ZSTD(3)) | No | Comma-separated header names | | `header_user_agent` | String (ZSTD(3)) | No | User-Agent header | | `header_accept` | String (ZSTD(3)) | No | Accept header | | `header_accept_encoding` | String (ZSTD(3)) | No | Accept-Encoding header | | `header_accept_language` | String (ZSTD(3)) | No | Accept-Language header | | `header_content_type` | String (ZSTD(3)) | No | Content-Type header | | `header_x_request_id` | String (ZSTD(3)) | No | X-Request-Id header | | `header_x_trace_id` | String (ZSTD(3)) | No | X-Trace-Id header | | `header_x_forwarded_for` | String (ZSTD(3)) | No | X-Forwarded-For header | | `header_sec_ch_ua` | String (ZSTD(3)) | No | Sec-CH-UA header | | `header_sec_ch_ua_mobile` | String (ZSTD(3)) | No | Sec-CH-UA-Mobile header | | `header_sec_ch_ua_platform` | String (ZSTD(3)) | No | Sec-CH-UA-Platform header | | `header_sec_fetch_dest` | String (ZSTD(3)) | No | Sec-Fetch-Dest header | | `header_sec_fetch_mode` | String (ZSTD(3)) | No | Sec-Fetch-Mode header | | `header_sec_fetch_site` | String (ZSTD(3)) | No | Sec-Fetch-Site header | | `anubis_bot_name` | LowCardinality(String) | No | Anubis-detected bot name (default: '') | | `anubis_bot_action` | LowCardinality(String) | No | Anubis-detected bot action (default: '') | | `anubis_bot_category` | LowCardinality(String) | No | Anubis-detected bot category (default: '') | - **Engine**: MergeTree - **Partition by**: `log_date` - **Order by**: `(time, src_ip, dst_ip, ja4)` - **TTL**: `log_date + INTERVAL 7 DAY` --- ### agg_host_ip_ja4_1h Behavioral aggregation per `(src_ip, ja4, host)` per hour. Uses `AggregatingMergeTree` with `SimpleAggregateFunction` and `AggregateFunction` columns for incremental aggregation. Key columns include: `window_start`, `src_ip`, `ja4`, `host`, `src_asn`, `hits`, `count_post`, `uniq_paths`, `uniq_query_params`, `tcp_jitter_variance`, `unique_src_ports`, `unique_conn_id`, `orphan_count`, `ip_id_zero_count`, `mss_1460_count`, `uniq_ua`, `url_depth_variance`, `count_anomalous_payload`, `uniq_ja3`, `avg_syn_ms`, `tls12_count`, `count_head`, `count_no_sec_fetch`, `count_generic_accept`, `count_http10`, `ip_df_var`, `avg_ttl`, `ttl_var`, `count_no_wscale`, `count_correlated`, `count_no_accept_enc`, `count_http_scheme`. - **Engine**: AggregatingMergeTree - **Order by**: `(window_start, src_ip, ja4, host)` --- ### agg_header_fingerprint_1h Header-level behavioral fingerprint aggregation per `(src_ip)` per hour. | Column | Type | Description | |--------|------|-------------| | `window_start` | DateTime | Hour window start | | `src_ip` | IPv6 | Source IP | | `header_order_hash` | SimpleAggregateFunction(any, String) | Hash of header order | | `header_count` | SimpleAggregateFunction(max, UInt16) | Max header count | | `has_accept_language` | SimpleAggregateFunction(max, UInt8) | Accept-Language presence | | `has_cookie` | SimpleAggregateFunction(max, UInt8) | Cookie presence | | `has_referer` | SimpleAggregateFunction(max, UInt8) | Referer presence | | `modern_browser_score` | SimpleAggregateFunction(max, UInt8) | Browser compliance score | | `ua_ch_mismatch` | SimpleAggregateFunction(max, UInt8) | UA/Client Hints mismatch | | `sec_fetch_mode` | SimpleAggregateFunction(any, String) | Sec-Fetch-Mode value | | `sec_fetch_dest` | SimpleAggregateFunction(any, String) | Sec-Fetch-Dest value | - **Engine**: AggregatingMergeTree - **Order by**: `(window_start, src_ip)` --- ### ml_detected_anomalies Anomaly detections above the threat threshold. Key columns: `detected_at`, `src_ip` (IPv6), `ja4`, `host`, `bot_name`, `anomaly_score` (Float32), `raw_anomaly_score` (Float32), `threat_level`, `model_name`, `recurrence` (UInt32), `campaign_id` (Int32), `reason`, plus all ML feature columns and Anubis enrichment (`anubis_bot_name`, `anubis_bot_action`, `anubis_bot_category`). - **Engine**: ReplacingMergeTree(detected_at) - **Order by**: `(src_ip)` - **TTL**: `detected_at + INTERVAL 30 DAY` --- ### ml_all_scores All ML classifications (no threshold filter) for observability. Key columns: `detected_at`, `window_start`, `src_ip`, `ja4`, `host`, `bot_name`, `anomaly_score`, `raw_anomaly_score`, `threat_level`, `model_name`, `correlated`, `campaign_id`, plus ASN and Anubis enrichment. - **Engine**: ReplacingMergeTree(detected_at) - **Order by**: `(window_start, src_ip, ja4, host, model_name)` - **TTL**: `window_start + INTERVAL 3 DAY` --- ### ref_bot_networks Bot network CIDR reference table. | Column | Type | Description | |--------|------|-------------| | `network` | IPv6CIDR | Network CIDR | | `bot_name` | LowCardinality(String) | Bot name | | `is_legitimate` | UInt8 | 1 = legitimate bot | | `last_update` | DateTime | Last update timestamp | - **Engine**: ReplacingMergeTree(last_update) - **Order by**: `(network, bot_name)` --- ### bot_ip / bot_ja4 CSV-backed flat tables for quick bot lookups. - `bot_ip`: single column `ip` (String) — Engine: File(CSV, 'bot_ip.csv') - `bot_ja4`: single column `ja4` (String) — Engine: File(CSV, 'bot_ja4.csv') --- ### Anubis Rule Tables | Table | Key | Columns | Engine | |-------|-----|---------|--------| | `anubis_ua_rules` | `id` (UInt64) | `parent_id`, `regexp`, `keys`, `values` | ReplacingMergeTree | | `anubis_ip_rules` | `prefix` (String) | `bot_name`, `action`, `rule_id`, `has_ua`, `category` | ReplacingMergeTree | | `anubis_asn_rules` | `asn` (UInt32) | `bot_name`, `action`, `category` | ReplacingMergeTree | | `anubis_country_rules` | `country_code` (String) | `bot_name`, `action`, `category` | ReplacingMergeTree | --- ### audit_logs SOC audit trail for dashboard activity. | Column | Type | Default | Description | |--------|------|---------|-------------| | `timestamp` | DateTime | `now()` | Event time | | `user_name` | LowCardinality(String) | `'soc_user'` | Analyst name | | `action` | LowCardinality(String) | — | Action performed | | `entity_type` | LowCardinality(String) | `''` | Entity type (ip, ja4, etc.) | | `entity_id` | String | `''` | Entity identifier | | `entity_count` | UInt32 | `0` | Entity count | | `details` | String (ZSTD(3)) | `''` | JSON details | | `client_ip` | String | `''` | Analyst client IP | - **Engine**: MergeTree - **Partition by**: `toDate(timestamp)` - **Order by**: `(timestamp, user_name, action)` - **TTL**: `toDate(timestamp) + INTERVAL 90 DAY` --- ## Materialized Views ### mv_http_logs - **Source**: `http_logs_raw` - **Target**: `http_logs` - **Transformation**: Parses `raw_json` via `JSONExtract*` functions, enriches with ASN data from `dict_iplocate_asn` and Anubis bot detection from `dict_anubis_ua`, `dict_anubis_ip`, `dict_anubis_asn`, `dict_anubis_country`. Uses a 5-level priority cascade for Anubis: UA+IP combined > UA only > IP only > ASN > Country. ### mv_agg_host_ip_ja4_1h - **Source**: `http_logs` - **Target**: `agg_host_ip_ja4_1h` - **Transformation**: Groups by `(toStartOfHour(time), src_ip, ja4, host, src_asn)`. Computes counts, unique values, variances, and aggregate functions for 50+ behavioral features. ### mv_agg_header_fingerprint_1h - **Source**: `http_logs` - **Target**: `agg_header_fingerprint_1h` - **Transformation**: Groups by `(toStartOfHour(time), src_ip)`. Computes header order hash, header count, browser compliance score, Client Hints mismatch. --- ## Dictionaries ### dict_iplocate_asn - **Source**: CSV file `/var/lib/clickhouse/user_files/iplocate-ip-to-asn.csv` - **Key**: `network` (String) - **Layout**: `IP_TRIE` - **Attributes**: `asn` (UInt32), `country_code`, `name`, `org`, `domain` - **Lifetime**: 3600–7200 seconds ### dict_bot_ip - **Source**: CSV file `/var/lib/clickhouse/user_files/bot_ip.csv` - **Key**: `prefix` (String) - **Layout**: `IP_TRIE` - **Attributes**: `bot_name` (String) - **Lifetime**: 300 seconds ### dict_bot_ja4 - **Source**: CSV file `/var/lib/clickhouse/user_files/bot_ja4.csv` - **Key**: `ja4` (String) - **Layout**: `COMPLEX_KEY_HASHED` - **Attributes**: `bot_name` (String) - **Lifetime**: 300 seconds ### dict_asn_reputation - **Source**: CSV file `/var/lib/clickhouse/user_files/asn_reputation.csv` - **Key**: `src_asn` (UInt64) - **Layout**: `HASHED` - **Attributes**: `label` (String) - **Lifetime**: 300 seconds ### dict_anubis_ua - **Source**: ClickHouse table `anubis_ua_rules` - **Key**: `regexp` (String) - **Layout**: `REGEXP_TREE` - **Attributes**: `bot_name`, `action`, `has_ip`, `rule_id`, `category` - **Lifetime**: 300–600 seconds ### dict_anubis_ip - **Source**: ClickHouse table `anubis_ip_rules` - **Key**: `prefix` (String) - **Layout**: `IP_TRIE` - **Attributes**: `bot_name`, `action`, `rule_id`, `has_ua`, `category` - **Lifetime**: 300–600 seconds ### dict_anubis_asn - **Source**: ClickHouse table `anubis_asn_rules` - **Key**: `asn` (UInt32) - **Layout**: `FLAT` - **Attributes**: `bot_name`, `action`, `category` - **Lifetime**: 300–600 seconds ### dict_anubis_country - **Source**: ClickHouse table `anubis_country_rules` - **Key**: `country_code` (String) - **Layout**: `FLAT` - **Attributes**: `bot_name`, `action`, `category` - **Lifetime**: 300–600 seconds --- ## Views ### view_ai_features_1h Computes 50+ ML features per `(src_ip, ja4, host)` from the last 24 hours by joining `agg_host_ip_ja4_1h` and `agg_header_fingerprint_1h`. Includes: - Behavioral features: `hits`, `hit_velocity`, `fuzzing_index`, `post_ratio`, `orphan_ratio` - Connection features: `max_keepalives`, `multiplexing_efficiency`, `port_exhaustion_ratio` - Browser features: `modern_browser_score`, `ua_ch_mismatch`, `header_order_shared_count` - TLS features: `alpn_http_mismatch`, `is_alpn_missing`, `sni_host_mismatch` - L4 features: `tcp_jitter_variance`, `avg_ttl`, `ttl_std`, `syn_timing_cv` - Reputation: `bot_name` (from dict_bot_ip/dict_bot_ja4), `anubis_bot_name/action/category` - Derived: `temporal_entropy`, `ja3_diversity_ratio` ### view_ip_recurrence Aggregates recurrence data from `ml_detected_anomalies`: ```sql SELECT src_ip, count() AS recurrence, min(detected_at) AS first_seen, max(detected_at) AS last_seen, min(anomaly_score) AS worst_score, argMin(threat_level, anomaly_score) AS worst_threat_level FROM ml_detected_anomalies GROUP BY src_ip; ``` --- ## User Accounts | User | Permissions | Purpose | |------|------------|---------| | `data_writer` | INSERT + SELECT on `http_logs_raw` | Used by correlator service | | `analyst` | SELECT on `http_logs`, `ml_detected_anomalies`, `ml_all_scores`, `view_ai_features_1h`, `view_ip_recurrence`, `audit_logs` | Used by dashboard/SOC analysts | > **Security note**: Default passwords are `ChangeMe` — replace with strong passwords before production use. Store credentials in a secrets manager.