Files
ja4-platform/docs/database/schema.md
toto d469e39da7 feat: ja4-platform monorepo — 5 services unified, tests & RPM builds standardized
Services:
- ja4sentinel: TLS/JA4 fingerprint capture daemon (Go, libpcap)
- logcorrelator: JA4 log correlation engine (Go, ClickHouse)
- mod_reqin_log: Apache module (C, JSON request logging)
- bot_detector: ML bot detection pipeline (Python)
- dashboard: FastAPI/Streamlit analytics UI (Python)

Shared libraries:
- shared/go/ja4common: logger, config, shutdown, ipfilter (Go module)
- shared/python/ja4_common: ClickHouseClient, ClickHouseSettings (Python package)
- shared/clickhouse/: canonical SQL migrations (10 files)

Build & packaging:
- Unified 3-stage Dockerfile.package for Go RPMs (el8/el9/el10)
- go.work workspace linking sentinel, correlator, ja4common
- Makefile with test-all, build-all, rpm-* targets

Fixes applied:
- go.work: 1.21 → 1.24.6 (required by sentinel)
- correlator Dockerfiles: golang:1.21 → golang:1.24
- replace directives in go.mod for ja4common local path
- pyproject.toml: setuptools.backends → setuptools.build_meta
- Removed static libpcap linking (unavailable on Rocky 9)
- Fixed data races in output/writers_test.go (sync.Mutex + atomic.Int32)
- Rewrote corrupted test files (logger_test.go × 2)

Test coverage:
- correlator: 67.1% total (unixsocket 80.5%, config 91.7%, app 83.3%, multi 87.7%, stdout 100%)
- sentinel: all 10 packages pass (api, capture, config, fingerprint, ipfilter, logging, output, tlsparse)

Documentation:
- README.md + docs/ (architecture, development, 5 services, shared libs, DB schema & migrations)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 16:42:59 +02:00

335 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Database Schema
The ja4-platform uses ClickHouse as its central data store with database `mabase_prod`. This document describes every table, materialized view, dictionary, and view in the schema.
## Tables
### http_logs_raw
Raw JSON ingest table — direct target for correlator INSERTs.
| Column | Type | Description |
|--------|------|-------------|
| `raw_json` | String (ZSTD(3)) | Complete correlated log as JSON string |
| `ingest_time` | DateTime | Insertion timestamp (default: `now()`) |
- **Engine**: MergeTree
- **Partition by**: `toDate(ingest_time)`
- **Order by**: `ingest_time`
- **TTL**: `ingest_time + INTERVAL 1 DAY`
---
### http_logs
Parsed and enriched HTTP log table — populated by `mv_http_logs` materialized view.
| Column | Type | Nullable | Description |
|--------|------|----------|-------------|
| `time` | DateTime | No | Request timestamp |
| `log_date` | Date | No | Date partition key (default: `toDate(time)`) |
| `src_ip` | IPv4 | No | Client source IP |
| `src_port` | UInt16 | No | Client source port |
| `dst_ip` | IPv4 | No | Server destination IP |
| `dst_port` | UInt16 | No | Server destination port |
| `src_asn` | UInt32 | No | Source ASN (enriched via dict_iplocate_asn) |
| `src_country_code` | LowCardinality(String) | No | Source country code |
| `src_as_name` | LowCardinality(String) | No | AS name |
| `src_org` | LowCardinality(String) | No | AS organization |
| `src_domain` | LowCardinality(String) | No | AS domain |
| `method` | LowCardinality(String) | No | HTTP method |
| `scheme` | LowCardinality(String) | No | URL scheme (http/https) |
| `host` | LowCardinality(String) | No | HTTP Host header |
| `path` | String (ZSTD(3)) | No | Request path |
| `query` | String (ZSTD(3)) | No | Query string |
| `http_version` | LowCardinality(String) | No | HTTP version |
| `orphan_side` | LowCardinality(String) | No | Orphan side (A, B, or empty) |
| `correlated` | UInt8 | No | 1 if HTTP+TLS correlated |
| `keepalives` | UInt16 | No | Keep-alive request sequence |
| `a_timestamp` | UInt64 | No | Source A event timestamp (ns) |
| `b_timestamp` | UInt64 | No | Source B event timestamp (ns) |
| `conn_id` | String (ZSTD(3)) | No | TCP connection identifier |
| `ip_meta_df` | UInt8 | No | IP Don't Fragment flag |
| `ip_meta_id` | UInt16 | No | IP identification |
| `ip_meta_total_length` | UInt16 | No | IP total length |
| `ip_meta_ttl` | UInt8 | No | IP TTL |
| `tcp_meta_options` | LowCardinality(String) | No | TCP options list |
| `tcp_meta_window_size` | UInt32 | No | TCP window size |
| `tcp_meta_mss` | UInt16 | No | TCP MSS |
| `tcp_meta_window_scale` | UInt8 | No | TCP window scale |
| `syn_to_clienthello_ms` | Int32 | No | SYN-to-ClientHello timing (ms) |
| `tls_version` | LowCardinality(String) | No | TLS version |
| `tls_sni` | LowCardinality(String) | No | TLS SNI |
| `tls_alpn` | LowCardinality(String) | No | TLS ALPN |
| `ja3` | String (ZSTD(3)) | No | JA3 fingerprint |
| `ja3_hash` | String (ZSTD(3)) | No | JA3 MD5 hash |
| `ja4` | String (ZSTD(3)) | No | JA4 fingerprint |
| `client_headers` | String (ZSTD(3)) | No | Comma-separated header names |
| `header_user_agent` | String (ZSTD(3)) | No | User-Agent header |
| `header_accept` | String (ZSTD(3)) | No | Accept header |
| `header_accept_encoding` | String (ZSTD(3)) | No | Accept-Encoding header |
| `header_accept_language` | String (ZSTD(3)) | No | Accept-Language header |
| `header_content_type` | String (ZSTD(3)) | No | Content-Type header |
| `header_x_request_id` | String (ZSTD(3)) | No | X-Request-Id header |
| `header_x_trace_id` | String (ZSTD(3)) | No | X-Trace-Id header |
| `header_x_forwarded_for` | String (ZSTD(3)) | No | X-Forwarded-For header |
| `header_sec_ch_ua` | String (ZSTD(3)) | No | Sec-CH-UA header |
| `header_sec_ch_ua_mobile` | String (ZSTD(3)) | No | Sec-CH-UA-Mobile header |
| `header_sec_ch_ua_platform` | String (ZSTD(3)) | No | Sec-CH-UA-Platform header |
| `header_sec_fetch_dest` | String (ZSTD(3)) | No | Sec-Fetch-Dest header |
| `header_sec_fetch_mode` | String (ZSTD(3)) | No | Sec-Fetch-Mode header |
| `header_sec_fetch_site` | String (ZSTD(3)) | No | Sec-Fetch-Site header |
| `anubis_bot_name` | LowCardinality(String) | No | Anubis-detected bot name (default: '') |
| `anubis_bot_action` | LowCardinality(String) | No | Anubis-detected bot action (default: '') |
| `anubis_bot_category` | LowCardinality(String) | No | Anubis-detected bot category (default: '') |
- **Engine**: MergeTree
- **Partition by**: `log_date`
- **Order by**: `(time, src_ip, dst_ip, ja4)`
- **TTL**: `log_date + INTERVAL 7 DAY`
---
### agg_host_ip_ja4_1h
Behavioral aggregation per `(src_ip, ja4, host)` per hour. Uses `AggregatingMergeTree` with `SimpleAggregateFunction` and `AggregateFunction` columns for incremental aggregation.
Key columns include: `window_start`, `src_ip`, `ja4`, `host`, `src_asn`, `hits`, `count_post`, `uniq_paths`, `uniq_query_params`, `tcp_jitter_variance`, `unique_src_ports`, `unique_conn_id`, `orphan_count`, `ip_id_zero_count`, `mss_1460_count`, `uniq_ua`, `url_depth_variance`, `count_anomalous_payload`, `uniq_ja3`, `avg_syn_ms`, `tls12_count`, `count_head`, `count_no_sec_fetch`, `count_generic_accept`, `count_http10`, `ip_df_var`, `avg_ttl`, `ttl_var`, `count_no_wscale`, `count_correlated`, `count_no_accept_enc`, `count_http_scheme`.
- **Engine**: AggregatingMergeTree
- **Order by**: `(window_start, src_ip, ja4, host)`
---
### agg_header_fingerprint_1h
Header-level behavioral fingerprint aggregation per `(src_ip)` per hour.
| Column | Type | Description |
|--------|------|-------------|
| `window_start` | DateTime | Hour window start |
| `src_ip` | IPv6 | Source IP |
| `header_order_hash` | SimpleAggregateFunction(any, String) | Hash of header order |
| `header_count` | SimpleAggregateFunction(max, UInt16) | Max header count |
| `has_accept_language` | SimpleAggregateFunction(max, UInt8) | Accept-Language presence |
| `has_cookie` | SimpleAggregateFunction(max, UInt8) | Cookie presence |
| `has_referer` | SimpleAggregateFunction(max, UInt8) | Referer presence |
| `modern_browser_score` | SimpleAggregateFunction(max, UInt8) | Browser compliance score |
| `ua_ch_mismatch` | SimpleAggregateFunction(max, UInt8) | UA/Client Hints mismatch |
| `sec_fetch_mode` | SimpleAggregateFunction(any, String) | Sec-Fetch-Mode value |
| `sec_fetch_dest` | SimpleAggregateFunction(any, String) | Sec-Fetch-Dest value |
- **Engine**: AggregatingMergeTree
- **Order by**: `(window_start, src_ip)`
---
### ml_detected_anomalies
Anomaly detections above the threat threshold.
Key columns: `detected_at`, `src_ip` (IPv6), `ja4`, `host`, `bot_name`, `anomaly_score` (Float32), `raw_anomaly_score` (Float32), `threat_level`, `model_name`, `recurrence` (UInt32), `campaign_id` (Int32), `reason`, plus all ML feature columns and Anubis enrichment (`anubis_bot_name`, `anubis_bot_action`, `anubis_bot_category`).
- **Engine**: ReplacingMergeTree(detected_at)
- **Order by**: `(src_ip)`
- **TTL**: `detected_at + INTERVAL 30 DAY`
---
### ml_all_scores
All ML classifications (no threshold filter) for observability.
Key columns: `detected_at`, `window_start`, `src_ip`, `ja4`, `host`, `bot_name`, `anomaly_score`, `raw_anomaly_score`, `threat_level`, `model_name`, `correlated`, `campaign_id`, plus ASN and Anubis enrichment.
- **Engine**: ReplacingMergeTree(detected_at)
- **Order by**: `(window_start, src_ip, ja4, host, model_name)`
- **TTL**: `window_start + INTERVAL 3 DAY`
---
### ref_bot_networks
Bot network CIDR reference table.
| Column | Type | Description |
|--------|------|-------------|
| `network` | IPv6CIDR | Network CIDR |
| `bot_name` | LowCardinality(String) | Bot name |
| `is_legitimate` | UInt8 | 1 = legitimate bot |
| `last_update` | DateTime | Last update timestamp |
- **Engine**: ReplacingMergeTree(last_update)
- **Order by**: `(network, bot_name)`
---
### bot_ip / bot_ja4
CSV-backed flat tables for quick bot lookups.
- `bot_ip`: single column `ip` (String) — Engine: File(CSV, 'bot_ip.csv')
- `bot_ja4`: single column `ja4` (String) — Engine: File(CSV, 'bot_ja4.csv')
---
### Anubis Rule Tables
| Table | Key | Columns | Engine |
|-------|-----|---------|--------|
| `anubis_ua_rules` | `id` (UInt64) | `parent_id`, `regexp`, `keys`, `values` | ReplacingMergeTree |
| `anubis_ip_rules` | `prefix` (String) | `bot_name`, `action`, `rule_id`, `has_ua`, `category` | ReplacingMergeTree |
| `anubis_asn_rules` | `asn` (UInt32) | `bot_name`, `action`, `category` | ReplacingMergeTree |
| `anubis_country_rules` | `country_code` (String) | `bot_name`, `action`, `category` | ReplacingMergeTree |
---
### audit_logs
SOC audit trail for dashboard activity.
| Column | Type | Default | Description |
|--------|------|---------|-------------|
| `timestamp` | DateTime | `now()` | Event time |
| `user_name` | LowCardinality(String) | `'soc_user'` | Analyst name |
| `action` | LowCardinality(String) | — | Action performed |
| `entity_type` | LowCardinality(String) | `''` | Entity type (ip, ja4, etc.) |
| `entity_id` | String | `''` | Entity identifier |
| `entity_count` | UInt32 | `0` | Entity count |
| `details` | String (ZSTD(3)) | `''` | JSON details |
| `client_ip` | String | `''` | Analyst client IP |
- **Engine**: MergeTree
- **Partition by**: `toDate(timestamp)`
- **Order by**: `(timestamp, user_name, action)`
- **TTL**: `toDate(timestamp) + INTERVAL 90 DAY`
---
## Materialized Views
### mv_http_logs
- **Source**: `http_logs_raw`
- **Target**: `http_logs`
- **Transformation**: Parses `raw_json` via `JSONExtract*` functions, enriches with ASN data from `dict_iplocate_asn` and Anubis bot detection from `dict_anubis_ua`, `dict_anubis_ip`, `dict_anubis_asn`, `dict_anubis_country`. Uses a 5-level priority cascade for Anubis: UA+IP combined > UA only > IP only > ASN > Country.
### mv_agg_host_ip_ja4_1h
- **Source**: `http_logs`
- **Target**: `agg_host_ip_ja4_1h`
- **Transformation**: Groups by `(toStartOfHour(time), src_ip, ja4, host, src_asn)`. Computes counts, unique values, variances, and aggregate functions for 50+ behavioral features.
### mv_agg_header_fingerprint_1h
- **Source**: `http_logs`
- **Target**: `agg_header_fingerprint_1h`
- **Transformation**: Groups by `(toStartOfHour(time), src_ip)`. Computes header order hash, header count, browser compliance score, Client Hints mismatch.
---
## Dictionaries
### dict_iplocate_asn
- **Source**: CSV file `/var/lib/clickhouse/user_files/iplocate-ip-to-asn.csv`
- **Key**: `network` (String)
- **Layout**: `IP_TRIE`
- **Attributes**: `asn` (UInt32), `country_code`, `name`, `org`, `domain`
- **Lifetime**: 36007200 seconds
### dict_bot_ip
- **Source**: CSV file `/var/lib/clickhouse/user_files/bot_ip.csv`
- **Key**: `prefix` (String)
- **Layout**: `IP_TRIE`
- **Attributes**: `bot_name` (String)
- **Lifetime**: 300 seconds
### dict_bot_ja4
- **Source**: CSV file `/var/lib/clickhouse/user_files/bot_ja4.csv`
- **Key**: `ja4` (String)
- **Layout**: `COMPLEX_KEY_HASHED`
- **Attributes**: `bot_name` (String)
- **Lifetime**: 300 seconds
### dict_asn_reputation
- **Source**: CSV file `/var/lib/clickhouse/user_files/asn_reputation.csv`
- **Key**: `src_asn` (UInt64)
- **Layout**: `HASHED`
- **Attributes**: `label` (String)
- **Lifetime**: 300 seconds
### dict_anubis_ua
- **Source**: ClickHouse table `anubis_ua_rules`
- **Key**: `regexp` (String)
- **Layout**: `REGEXP_TREE`
- **Attributes**: `bot_name`, `action`, `has_ip`, `rule_id`, `category`
- **Lifetime**: 300600 seconds
### dict_anubis_ip
- **Source**: ClickHouse table `anubis_ip_rules`
- **Key**: `prefix` (String)
- **Layout**: `IP_TRIE`
- **Attributes**: `bot_name`, `action`, `rule_id`, `has_ua`, `category`
- **Lifetime**: 300600 seconds
### dict_anubis_asn
- **Source**: ClickHouse table `anubis_asn_rules`
- **Key**: `asn` (UInt32)
- **Layout**: `FLAT`
- **Attributes**: `bot_name`, `action`, `category`
- **Lifetime**: 300600 seconds
### dict_anubis_country
- **Source**: ClickHouse table `anubis_country_rules`
- **Key**: `country_code` (String)
- **Layout**: `FLAT`
- **Attributes**: `bot_name`, `action`, `category`
- **Lifetime**: 300600 seconds
---
## Views
### view_ai_features_1h
Computes 50+ ML features per `(src_ip, ja4, host)` from the last 24 hours by joining `agg_host_ip_ja4_1h` and `agg_header_fingerprint_1h`. Includes:
- Behavioral features: `hits`, `hit_velocity`, `fuzzing_index`, `post_ratio`, `orphan_ratio`
- Connection features: `max_keepalives`, `multiplexing_efficiency`, `port_exhaustion_ratio`
- Browser features: `modern_browser_score`, `ua_ch_mismatch`, `header_order_shared_count`
- TLS features: `alpn_http_mismatch`, `is_alpn_missing`, `sni_host_mismatch`
- L4 features: `tcp_jitter_variance`, `avg_ttl`, `ttl_std`, `syn_timing_cv`
- Reputation: `bot_name` (from dict_bot_ip/dict_bot_ja4), `anubis_bot_name/action/category`
- Derived: `temporal_entropy`, `ja3_diversity_ratio`
### view_ip_recurrence
Aggregates recurrence data from `ml_detected_anomalies`:
```sql
SELECT src_ip, count() AS recurrence,
min(detected_at) AS first_seen, max(detected_at) AS last_seen,
min(anomaly_score) AS worst_score,
argMin(threat_level, anomaly_score) AS worst_threat_level
FROM ml_detected_anomalies GROUP BY src_ip;
```
---
## User Accounts
| User | Permissions | Purpose |
|------|------------|---------|
| `data_writer` | INSERT + SELECT on `http_logs_raw` | Used by correlator service |
| `analyst` | SELECT on `http_logs`, `ml_detected_anomalies`, `ml_all_scores`, `view_ai_features_1h`, `view_ip_recurrence`, `audit_logs` | Used by dashboard/SOC analysts |
> **Security note**: Default passwords are `ChangeMe` — replace with strong passwords before production use. Store credentials in a secrets manager.