feat: ja4-platform monorepo — 5 services unified, tests & RPM builds standardized
Services: - ja4sentinel: TLS/JA4 fingerprint capture daemon (Go, libpcap) - logcorrelator: JA4 log correlation engine (Go, ClickHouse) - mod_reqin_log: Apache module (C, JSON request logging) - bot_detector: ML bot detection pipeline (Python) - dashboard: FastAPI/Streamlit analytics UI (Python) Shared libraries: - shared/go/ja4common: logger, config, shutdown, ipfilter (Go module) - shared/python/ja4_common: ClickHouseClient, ClickHouseSettings (Python package) - shared/clickhouse/: canonical SQL migrations (10 files) Build & packaging: - Unified 3-stage Dockerfile.package for Go RPMs (el8/el9/el10) - go.work workspace linking sentinel, correlator, ja4common - Makefile with test-all, build-all, rpm-* targets Fixes applied: - go.work: 1.21 → 1.24.6 (required by sentinel) - correlator Dockerfiles: golang:1.21 → golang:1.24 - replace directives in go.mod for ja4common local path - pyproject.toml: setuptools.backends → setuptools.build_meta - Removed static libpcap linking (unavailable on Rocky 9) - Fixed data races in output/writers_test.go (sync.Mutex + atomic.Int32) - Rewrote corrupted test files (logger_test.go × 2) Test coverage: - correlator: 67.1% total (unixsocket 80.5%, config 91.7%, app 83.3%, multi 87.7%, stdout 100%) - sentinel: all 10 packages pass (api, capture, config, fingerprint, ipfilter, logging, output, tlsparse) Documentation: - README.md + docs/ (architecture, development, 5 services, shared libs, DB schema & migrations) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
256
docs/database/migrations.md
Normal file
256
docs/database/migrations.md
Normal file
@ -0,0 +1,256 @@
|
||||
# Database Migrations
|
||||
|
||||
The ClickHouse schema for ja4-platform is managed through numbered SQL migration files in `shared/clickhouse/`. Migrations are idempotent (using `IF NOT EXISTS` / `IF EXISTS`) and must be applied in numeric order.
|
||||
|
||||
## Migration Order
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `00_database.sql` | Creates the `mabase_prod` database |
|
||||
| `01_raw_tables.sql` | Creates `http_logs_raw` ingest table (MergeTree, 1-day TTL) |
|
||||
| `02_dictionaries.sql` | Creates ASN geo dictionary (`dict_iplocate_asn`), bot IP/JA4 reference tables, `ref_bot_networks` |
|
||||
| `03_anubis_tables.sql` | Creates Anubis crawler rule tables (`anubis_ua_rules`, `anubis_ip_rules`, `anubis_asn_rules`, `anubis_country_rules`) and their dictionaries (`dict_anubis_ua`, `dict_anubis_ip`, `dict_anubis_asn`, `dict_anubis_country`) |
|
||||
| `04_mv_http_logs.sql` | Creates the canonical `http_logs` table and `mv_http_logs` materialized view with full Anubis enrichment |
|
||||
| `05_aggregation_tables.sql` | Creates reputation dictionaries (`dict_bot_ip`, `dict_bot_ja4`, `dict_asn_reputation`), behavioral aggregation tables (`agg_host_ip_ja4_1h`, `agg_header_fingerprint_1h`), and their materialized views |
|
||||
| `06_ml_tables.sql` | Creates ML output tables (`ml_detected_anomalies`, `ml_all_scores`) and `view_ip_recurrence` |
|
||||
| `07_ai_features_view.sql` | Creates `view_ai_features_1h` — the 50+ feature view used by bot-detector |
|
||||
| `08_users.sql` | Creates ClickHouse users (`data_writer`, `analyst`) and grants permissions |
|
||||
| `09_audit_table.sql` | Creates `audit_logs` table for SOC dashboard audit trail |
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### 1. ClickHouse Server
|
||||
|
||||
A running ClickHouse server (version 23.8+ recommended for `REGEXP_TREE` dictionary support).
|
||||
|
||||
### 2. CSV Data Files
|
||||
|
||||
Place the following files in `/var/lib/clickhouse/user_files/`:
|
||||
|
||||
| File | Source | Description |
|
||||
|------|--------|-------------|
|
||||
| `iplocate-ip-to-asn.csv` | [IPLocate](https://iplocate.io) | IP-to-ASN mapping with country, org, domain |
|
||||
| `bot_ip.csv` | Custom | Known bot IP prefixes (CIDR format) |
|
||||
| `bot_ja4.csv` | Custom | Known bot JA4 fingerprints |
|
||||
| `asn_reputation.csv` | Custom | ASN reputation labels (`human`, `bot`, `unknown`) |
|
||||
|
||||
### 3. Anubis Passwords
|
||||
|
||||
Migration `03_anubis_tables.sql` contains placeholder passwords (`CHANGE_ME`) for the Anubis dictionaries. Replace these with the actual ClickHouse admin password before applying:
|
||||
|
||||
```bash
|
||||
sed -i "s/CHANGE_ME/your_actual_password/g" 03_anubis_tables.sql
|
||||
```
|
||||
|
||||
## How to Apply
|
||||
|
||||
### Full Initial Setup
|
||||
|
||||
Apply all migrations in order:
|
||||
|
||||
```bash
|
||||
cd shared/clickhouse/
|
||||
|
||||
clickhouse-client --multiquery < 00_database.sql
|
||||
clickhouse-client --multiquery < 01_raw_tables.sql
|
||||
clickhouse-client --multiquery < 02_dictionaries.sql
|
||||
clickhouse-client --multiquery < 03_anubis_tables.sql
|
||||
clickhouse-client --multiquery < 04_mv_http_logs.sql
|
||||
clickhouse-client --multiquery < 05_aggregation_tables.sql
|
||||
clickhouse-client --multiquery < 06_ml_tables.sql
|
||||
clickhouse-client --multiquery < 07_ai_features_view.sql
|
||||
clickhouse-client --multiquery < 08_users.sql
|
||||
clickhouse-client --multiquery < 09_audit_table.sql
|
||||
```
|
||||
|
||||
### With Authentication
|
||||
|
||||
```bash
|
||||
clickhouse-client --user admin --password 'your_password' --multiquery < 00_database.sql
|
||||
# ... repeat for each file
|
||||
```
|
||||
|
||||
### One-Liner (All at Once)
|
||||
|
||||
```bash
|
||||
cd shared/clickhouse/
|
||||
for f in 0*.sql; do
|
||||
echo "Applying $f..."
|
||||
clickhouse-client --multiquery < "$f"
|
||||
done
|
||||
```
|
||||
|
||||
## How to Verify
|
||||
|
||||
After applying all migrations, run these queries to verify each migration was successful:
|
||||
|
||||
### 00 — Database
|
||||
|
||||
```sql
|
||||
SHOW DATABASES LIKE 'mabase_prod';
|
||||
-- Expected: mabase_prod
|
||||
```
|
||||
|
||||
### 01 — Raw Tables
|
||||
|
||||
```sql
|
||||
EXISTS mabase_prod.http_logs_raw;
|
||||
-- Expected: 1
|
||||
```
|
||||
|
||||
### 02 — Dictionaries
|
||||
|
||||
```sql
|
||||
SELECT dictGetOrDefault('mabase_prod.dict_iplocate_asn', 'country_code',
|
||||
toIPv6(toIPv4('8.8.8.8')), 'MISSING');
|
||||
-- Expected: US (if CSV loaded) or MISSING
|
||||
```
|
||||
|
||||
### 03 — Anubis Tables
|
||||
|
||||
```sql
|
||||
EXISTS mabase_prod.anubis_ua_rules;
|
||||
EXISTS mabase_prod.anubis_ip_rules;
|
||||
EXISTS mabase_prod.anubis_asn_rules;
|
||||
EXISTS mabase_prod.anubis_country_rules;
|
||||
-- Expected: 1 for each
|
||||
```
|
||||
|
||||
### 04 — MV + http_logs
|
||||
|
||||
```sql
|
||||
EXISTS mabase_prod.http_logs;
|
||||
SELECT name FROM system.tables WHERE database = 'mabase_prod' AND name = 'mv_http_logs';
|
||||
-- Expected: mv_http_logs
|
||||
```
|
||||
|
||||
### 05 — Aggregation Tables
|
||||
|
||||
```sql
|
||||
EXISTS mabase_prod.agg_host_ip_ja4_1h;
|
||||
EXISTS mabase_prod.agg_header_fingerprint_1h;
|
||||
SELECT name FROM system.dictionaries WHERE database = 'mabase_prod' AND name = 'dict_bot_ip';
|
||||
-- Expected: dict_bot_ip
|
||||
```
|
||||
|
||||
### 06 — ML Tables
|
||||
|
||||
```sql
|
||||
EXISTS mabase_prod.ml_detected_anomalies;
|
||||
EXISTS mabase_prod.ml_all_scores;
|
||||
SELECT name FROM system.tables WHERE database = 'mabase_prod' AND name LIKE 'view_ip%';
|
||||
-- Expected: view_ip_recurrence
|
||||
```
|
||||
|
||||
### 07 — AI Features View
|
||||
|
||||
```sql
|
||||
SELECT name FROM system.tables WHERE database = 'mabase_prod' AND name = 'view_ai_features_1h';
|
||||
-- Expected: view_ai_features_1h
|
||||
```
|
||||
|
||||
### 08 — Users
|
||||
|
||||
```sql
|
||||
SHOW GRANTS FOR data_writer;
|
||||
-- Expected: GRANT INSERT, SELECT ON mabase_prod.http_logs_raw TO data_writer
|
||||
SHOW GRANTS FOR analyst;
|
||||
-- Expected: GRANT SELECT ON multiple tables
|
||||
```
|
||||
|
||||
### 09 — Audit Table
|
||||
|
||||
```sql
|
||||
EXISTS mabase_prod.audit_logs;
|
||||
-- Expected: 1
|
||||
```
|
||||
|
||||
### Full Verification Query
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
count() AS total_tables
|
||||
FROM system.tables
|
||||
WHERE database = 'mabase_prod'
|
||||
AND name IN (
|
||||
'http_logs_raw', 'http_logs', 'agg_host_ip_ja4_1h', 'agg_header_fingerprint_1h',
|
||||
'ml_detected_anomalies', 'ml_all_scores', 'ref_bot_networks',
|
||||
'anubis_ua_rules', 'anubis_ip_rules', 'anubis_asn_rules', 'anubis_country_rules',
|
||||
'audit_logs', 'bot_ip', 'bot_ja4'
|
||||
);
|
||||
-- Expected: 14
|
||||
```
|
||||
|
||||
## Rollback Notes
|
||||
|
||||
### General Approach
|
||||
|
||||
ClickHouse does not support transactional DDL. To roll back a migration:
|
||||
|
||||
1. **Tables**: `DROP TABLE IF EXISTS mabase_prod.<table_name>`
|
||||
2. **Materialized Views**: `DROP VIEW IF EXISTS mabase_prod.<mv_name>` (drop MV before its target table)
|
||||
3. **Dictionaries**: `DROP DICTIONARY IF EXISTS mabase_prod.<dict_name>`
|
||||
4. **Views**: `DROP VIEW IF EXISTS mabase_prod.<view_name>`
|
||||
5. **Users**: `DROP USER IF EXISTS <username>`
|
||||
|
||||
### Rollback Order (Reverse of Apply)
|
||||
|
||||
```sql
|
||||
-- 09: Audit
|
||||
DROP TABLE IF EXISTS mabase_prod.audit_logs;
|
||||
|
||||
-- 08: Users
|
||||
DROP USER IF EXISTS data_writer;
|
||||
DROP USER IF EXISTS analyst;
|
||||
|
||||
-- 07: AI Features View
|
||||
DROP VIEW IF EXISTS mabase_prod.view_ai_features_1h;
|
||||
|
||||
-- 06: ML Tables
|
||||
DROP VIEW IF EXISTS mabase_prod.view_ip_recurrence;
|
||||
DROP TABLE IF EXISTS mabase_prod.ml_all_scores;
|
||||
DROP TABLE IF EXISTS mabase_prod.ml_detected_anomalies;
|
||||
|
||||
-- 05: Aggregation
|
||||
DROP VIEW IF EXISTS mabase_prod.mv_agg_header_fingerprint_1h;
|
||||
DROP VIEW IF EXISTS mabase_prod.mv_agg_host_ip_ja4_1h;
|
||||
DROP TABLE IF EXISTS mabase_prod.agg_header_fingerprint_1h;
|
||||
DROP TABLE IF EXISTS mabase_prod.agg_host_ip_ja4_1h;
|
||||
DROP DICTIONARY IF EXISTS mabase_prod.dict_asn_reputation;
|
||||
DROP DICTIONARY IF EXISTS mabase_prod.dict_bot_ja4;
|
||||
DROP DICTIONARY IF EXISTS mabase_prod.dict_bot_ip;
|
||||
|
||||
-- 04: MV + http_logs
|
||||
DROP VIEW IF EXISTS mabase_prod.mv_http_logs;
|
||||
DROP TABLE IF EXISTS mabase_prod.http_logs;
|
||||
|
||||
-- 03: Anubis
|
||||
DROP DICTIONARY IF EXISTS mabase_prod.dict_anubis_country;
|
||||
DROP DICTIONARY IF EXISTS mabase_prod.dict_anubis_asn;
|
||||
DROP DICTIONARY IF EXISTS mabase_prod.dict_anubis_ip;
|
||||
DROP DICTIONARY IF EXISTS mabase_prod.dict_anubis_ua;
|
||||
DROP TABLE IF EXISTS mabase_prod.anubis_country_rules;
|
||||
DROP TABLE IF EXISTS mabase_prod.anubis_asn_rules;
|
||||
DROP TABLE IF EXISTS mabase_prod.anubis_ip_rules;
|
||||
DROP TABLE IF EXISTS mabase_prod.anubis_ua_rules;
|
||||
|
||||
-- 02: Dictionaries
|
||||
DROP DICTIONARY IF EXISTS mabase_prod.dict_iplocate_asn;
|
||||
DROP TABLE IF EXISTS mabase_prod.bot_ja4;
|
||||
DROP TABLE IF EXISTS mabase_prod.bot_ip;
|
||||
DROP TABLE IF EXISTS mabase_prod.ref_bot_networks;
|
||||
|
||||
-- 01: Raw Tables
|
||||
DROP TABLE IF EXISTS mabase_prod.http_logs_raw;
|
||||
|
||||
-- 00: Database
|
||||
DROP DATABASE IF EXISTS mabase_prod;
|
||||
```
|
||||
|
||||
### Important Notes
|
||||
|
||||
- **Data loss**: Dropping tables destroys all data. Always back up before rollback.
|
||||
- **MV dependency**: Materialized views must be dropped before their target tables.
|
||||
- **Dictionary dependency**: Views/MVs using dictionaries will fail if dictionaries are dropped while they still reference them.
|
||||
- **Idempotent re-apply**: After rollback, migrations can be safely re-applied since they use `IF NOT EXISTS`.
|
||||
- **`04_mv_http_logs.sql`** is the canonical version of the MV, superseding any base version in `services/correlator/sql/init.sql`.
|
||||
334
docs/database/schema.md
Normal file
334
docs/database/schema.md
Normal file
@ -0,0 +1,334 @@
|
||||
# Database Schema
|
||||
|
||||
The ja4-platform uses ClickHouse as its central data store with database `mabase_prod`. This document describes every table, materialized view, dictionary, and view in the schema.
|
||||
|
||||
## Tables
|
||||
|
||||
### http_logs_raw
|
||||
|
||||
Raw JSON ingest table — direct target for correlator INSERTs.
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `raw_json` | String (ZSTD(3)) | Complete correlated log as JSON string |
|
||||
| `ingest_time` | DateTime | Insertion timestamp (default: `now()`) |
|
||||
|
||||
- **Engine**: MergeTree
|
||||
- **Partition by**: `toDate(ingest_time)`
|
||||
- **Order by**: `ingest_time`
|
||||
- **TTL**: `ingest_time + INTERVAL 1 DAY`
|
||||
|
||||
---
|
||||
|
||||
### http_logs
|
||||
|
||||
Parsed and enriched HTTP log table — populated by `mv_http_logs` materialized view.
|
||||
|
||||
| Column | Type | Nullable | Description |
|
||||
|--------|------|----------|-------------|
|
||||
| `time` | DateTime | No | Request timestamp |
|
||||
| `log_date` | Date | No | Date partition key (default: `toDate(time)`) |
|
||||
| `src_ip` | IPv4 | No | Client source IP |
|
||||
| `src_port` | UInt16 | No | Client source port |
|
||||
| `dst_ip` | IPv4 | No | Server destination IP |
|
||||
| `dst_port` | UInt16 | No | Server destination port |
|
||||
| `src_asn` | UInt32 | No | Source ASN (enriched via dict_iplocate_asn) |
|
||||
| `src_country_code` | LowCardinality(String) | No | Source country code |
|
||||
| `src_as_name` | LowCardinality(String) | No | AS name |
|
||||
| `src_org` | LowCardinality(String) | No | AS organization |
|
||||
| `src_domain` | LowCardinality(String) | No | AS domain |
|
||||
| `method` | LowCardinality(String) | No | HTTP method |
|
||||
| `scheme` | LowCardinality(String) | No | URL scheme (http/https) |
|
||||
| `host` | LowCardinality(String) | No | HTTP Host header |
|
||||
| `path` | String (ZSTD(3)) | No | Request path |
|
||||
| `query` | String (ZSTD(3)) | No | Query string |
|
||||
| `http_version` | LowCardinality(String) | No | HTTP version |
|
||||
| `orphan_side` | LowCardinality(String) | No | Orphan side (A, B, or empty) |
|
||||
| `correlated` | UInt8 | No | 1 if HTTP+TLS correlated |
|
||||
| `keepalives` | UInt16 | No | Keep-alive request sequence |
|
||||
| `a_timestamp` | UInt64 | No | Source A event timestamp (ns) |
|
||||
| `b_timestamp` | UInt64 | No | Source B event timestamp (ns) |
|
||||
| `conn_id` | String (ZSTD(3)) | No | TCP connection identifier |
|
||||
| `ip_meta_df` | UInt8 | No | IP Don't Fragment flag |
|
||||
| `ip_meta_id` | UInt16 | No | IP identification |
|
||||
| `ip_meta_total_length` | UInt16 | No | IP total length |
|
||||
| `ip_meta_ttl` | UInt8 | No | IP TTL |
|
||||
| `tcp_meta_options` | LowCardinality(String) | No | TCP options list |
|
||||
| `tcp_meta_window_size` | UInt32 | No | TCP window size |
|
||||
| `tcp_meta_mss` | UInt16 | No | TCP MSS |
|
||||
| `tcp_meta_window_scale` | UInt8 | No | TCP window scale |
|
||||
| `syn_to_clienthello_ms` | Int32 | No | SYN-to-ClientHello timing (ms) |
|
||||
| `tls_version` | LowCardinality(String) | No | TLS version |
|
||||
| `tls_sni` | LowCardinality(String) | No | TLS SNI |
|
||||
| `tls_alpn` | LowCardinality(String) | No | TLS ALPN |
|
||||
| `ja3` | String (ZSTD(3)) | No | JA3 fingerprint |
|
||||
| `ja3_hash` | String (ZSTD(3)) | No | JA3 MD5 hash |
|
||||
| `ja4` | String (ZSTD(3)) | No | JA4 fingerprint |
|
||||
| `client_headers` | String (ZSTD(3)) | No | Comma-separated header names |
|
||||
| `header_user_agent` | String (ZSTD(3)) | No | User-Agent header |
|
||||
| `header_accept` | String (ZSTD(3)) | No | Accept header |
|
||||
| `header_accept_encoding` | String (ZSTD(3)) | No | Accept-Encoding header |
|
||||
| `header_accept_language` | String (ZSTD(3)) | No | Accept-Language header |
|
||||
| `header_content_type` | String (ZSTD(3)) | No | Content-Type header |
|
||||
| `header_x_request_id` | String (ZSTD(3)) | No | X-Request-Id header |
|
||||
| `header_x_trace_id` | String (ZSTD(3)) | No | X-Trace-Id header |
|
||||
| `header_x_forwarded_for` | String (ZSTD(3)) | No | X-Forwarded-For header |
|
||||
| `header_sec_ch_ua` | String (ZSTD(3)) | No | Sec-CH-UA header |
|
||||
| `header_sec_ch_ua_mobile` | String (ZSTD(3)) | No | Sec-CH-UA-Mobile header |
|
||||
| `header_sec_ch_ua_platform` | String (ZSTD(3)) | No | Sec-CH-UA-Platform header |
|
||||
| `header_sec_fetch_dest` | String (ZSTD(3)) | No | Sec-Fetch-Dest header |
|
||||
| `header_sec_fetch_mode` | String (ZSTD(3)) | No | Sec-Fetch-Mode header |
|
||||
| `header_sec_fetch_site` | String (ZSTD(3)) | No | Sec-Fetch-Site header |
|
||||
| `anubis_bot_name` | LowCardinality(String) | No | Anubis-detected bot name (default: '') |
|
||||
| `anubis_bot_action` | LowCardinality(String) | No | Anubis-detected bot action (default: '') |
|
||||
| `anubis_bot_category` | LowCardinality(String) | No | Anubis-detected bot category (default: '') |
|
||||
|
||||
- **Engine**: MergeTree
|
||||
- **Partition by**: `log_date`
|
||||
- **Order by**: `(time, src_ip, dst_ip, ja4)`
|
||||
- **TTL**: `log_date + INTERVAL 7 DAY`
|
||||
|
||||
---
|
||||
|
||||
### agg_host_ip_ja4_1h
|
||||
|
||||
Behavioral aggregation per `(src_ip, ja4, host)` per hour. Uses `AggregatingMergeTree` with `SimpleAggregateFunction` and `AggregateFunction` columns for incremental aggregation.
|
||||
|
||||
Key columns include: `window_start`, `src_ip`, `ja4`, `host`, `src_asn`, `hits`, `count_post`, `uniq_paths`, `uniq_query_params`, `tcp_jitter_variance`, `unique_src_ports`, `unique_conn_id`, `orphan_count`, `ip_id_zero_count`, `mss_1460_count`, `uniq_ua`, `url_depth_variance`, `count_anomalous_payload`, `uniq_ja3`, `avg_syn_ms`, `tls12_count`, `count_head`, `count_no_sec_fetch`, `count_generic_accept`, `count_http10`, `ip_df_var`, `avg_ttl`, `ttl_var`, `count_no_wscale`, `count_correlated`, `count_no_accept_enc`, `count_http_scheme`.
|
||||
|
||||
- **Engine**: AggregatingMergeTree
|
||||
- **Order by**: `(window_start, src_ip, ja4, host)`
|
||||
|
||||
---
|
||||
|
||||
### agg_header_fingerprint_1h
|
||||
|
||||
Header-level behavioral fingerprint aggregation per `(src_ip)` per hour.
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `window_start` | DateTime | Hour window start |
|
||||
| `src_ip` | IPv6 | Source IP |
|
||||
| `header_order_hash` | SimpleAggregateFunction(any, String) | Hash of header order |
|
||||
| `header_count` | SimpleAggregateFunction(max, UInt16) | Max header count |
|
||||
| `has_accept_language` | SimpleAggregateFunction(max, UInt8) | Accept-Language presence |
|
||||
| `has_cookie` | SimpleAggregateFunction(max, UInt8) | Cookie presence |
|
||||
| `has_referer` | SimpleAggregateFunction(max, UInt8) | Referer presence |
|
||||
| `modern_browser_score` | SimpleAggregateFunction(max, UInt8) | Browser compliance score |
|
||||
| `ua_ch_mismatch` | SimpleAggregateFunction(max, UInt8) | UA/Client Hints mismatch |
|
||||
| `sec_fetch_mode` | SimpleAggregateFunction(any, String) | Sec-Fetch-Mode value |
|
||||
| `sec_fetch_dest` | SimpleAggregateFunction(any, String) | Sec-Fetch-Dest value |
|
||||
|
||||
- **Engine**: AggregatingMergeTree
|
||||
- **Order by**: `(window_start, src_ip)`
|
||||
|
||||
---
|
||||
|
||||
### ml_detected_anomalies
|
||||
|
||||
Anomaly detections above the threat threshold.
|
||||
|
||||
Key columns: `detected_at`, `src_ip` (IPv6), `ja4`, `host`, `bot_name`, `anomaly_score` (Float32), `raw_anomaly_score` (Float32), `threat_level`, `model_name`, `recurrence` (UInt32), `campaign_id` (Int32), `reason`, plus all ML feature columns and Anubis enrichment (`anubis_bot_name`, `anubis_bot_action`, `anubis_bot_category`).
|
||||
|
||||
- **Engine**: ReplacingMergeTree(detected_at)
|
||||
- **Order by**: `(src_ip)`
|
||||
- **TTL**: `detected_at + INTERVAL 30 DAY`
|
||||
|
||||
---
|
||||
|
||||
### ml_all_scores
|
||||
|
||||
All ML classifications (no threshold filter) for observability.
|
||||
|
||||
Key columns: `detected_at`, `window_start`, `src_ip`, `ja4`, `host`, `bot_name`, `anomaly_score`, `raw_anomaly_score`, `threat_level`, `model_name`, `correlated`, `campaign_id`, plus ASN and Anubis enrichment.
|
||||
|
||||
- **Engine**: ReplacingMergeTree(detected_at)
|
||||
- **Order by**: `(window_start, src_ip, ja4, host, model_name)`
|
||||
- **TTL**: `window_start + INTERVAL 3 DAY`
|
||||
|
||||
---
|
||||
|
||||
### ref_bot_networks
|
||||
|
||||
Bot network CIDR reference table.
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `network` | IPv6CIDR | Network CIDR |
|
||||
| `bot_name` | LowCardinality(String) | Bot name |
|
||||
| `is_legitimate` | UInt8 | 1 = legitimate bot |
|
||||
| `last_update` | DateTime | Last update timestamp |
|
||||
|
||||
- **Engine**: ReplacingMergeTree(last_update)
|
||||
- **Order by**: `(network, bot_name)`
|
||||
|
||||
---
|
||||
|
||||
### bot_ip / bot_ja4
|
||||
|
||||
CSV-backed flat tables for quick bot lookups.
|
||||
|
||||
- `bot_ip`: single column `ip` (String) — Engine: File(CSV, 'bot_ip.csv')
|
||||
- `bot_ja4`: single column `ja4` (String) — Engine: File(CSV, 'bot_ja4.csv')
|
||||
|
||||
---
|
||||
|
||||
### Anubis Rule Tables
|
||||
|
||||
| Table | Key | Columns | Engine |
|
||||
|-------|-----|---------|--------|
|
||||
| `anubis_ua_rules` | `id` (UInt64) | `parent_id`, `regexp`, `keys`, `values` | ReplacingMergeTree |
|
||||
| `anubis_ip_rules` | `prefix` (String) | `bot_name`, `action`, `rule_id`, `has_ua`, `category` | ReplacingMergeTree |
|
||||
| `anubis_asn_rules` | `asn` (UInt32) | `bot_name`, `action`, `category` | ReplacingMergeTree |
|
||||
| `anubis_country_rules` | `country_code` (String) | `bot_name`, `action`, `category` | ReplacingMergeTree |
|
||||
|
||||
---
|
||||
|
||||
### audit_logs
|
||||
|
||||
SOC audit trail for dashboard activity.
|
||||
|
||||
| Column | Type | Default | Description |
|
||||
|--------|------|---------|-------------|
|
||||
| `timestamp` | DateTime | `now()` | Event time |
|
||||
| `user_name` | LowCardinality(String) | `'soc_user'` | Analyst name |
|
||||
| `action` | LowCardinality(String) | — | Action performed |
|
||||
| `entity_type` | LowCardinality(String) | `''` | Entity type (ip, ja4, etc.) |
|
||||
| `entity_id` | String | `''` | Entity identifier |
|
||||
| `entity_count` | UInt32 | `0` | Entity count |
|
||||
| `details` | String (ZSTD(3)) | `''` | JSON details |
|
||||
| `client_ip` | String | `''` | Analyst client IP |
|
||||
|
||||
- **Engine**: MergeTree
|
||||
- **Partition by**: `toDate(timestamp)`
|
||||
- **Order by**: `(timestamp, user_name, action)`
|
||||
- **TTL**: `toDate(timestamp) + INTERVAL 90 DAY`
|
||||
|
||||
---
|
||||
|
||||
## Materialized Views
|
||||
|
||||
### mv_http_logs
|
||||
|
||||
- **Source**: `http_logs_raw`
|
||||
- **Target**: `http_logs`
|
||||
- **Transformation**: Parses `raw_json` via `JSONExtract*` functions, enriches with ASN data from `dict_iplocate_asn` and Anubis bot detection from `dict_anubis_ua`, `dict_anubis_ip`, `dict_anubis_asn`, `dict_anubis_country`. Uses a 5-level priority cascade for Anubis: UA+IP combined > UA only > IP only > ASN > Country.
|
||||
|
||||
### mv_agg_host_ip_ja4_1h
|
||||
|
||||
- **Source**: `http_logs`
|
||||
- **Target**: `agg_host_ip_ja4_1h`
|
||||
- **Transformation**: Groups by `(toStartOfHour(time), src_ip, ja4, host, src_asn)`. Computes counts, unique values, variances, and aggregate functions for 50+ behavioral features.
|
||||
|
||||
### mv_agg_header_fingerprint_1h
|
||||
|
||||
- **Source**: `http_logs`
|
||||
- **Target**: `agg_header_fingerprint_1h`
|
||||
- **Transformation**: Groups by `(toStartOfHour(time), src_ip)`. Computes header order hash, header count, browser compliance score, Client Hints mismatch.
|
||||
|
||||
---
|
||||
|
||||
## Dictionaries
|
||||
|
||||
### dict_iplocate_asn
|
||||
|
||||
- **Source**: CSV file `/var/lib/clickhouse/user_files/iplocate-ip-to-asn.csv`
|
||||
- **Key**: `network` (String)
|
||||
- **Layout**: `IP_TRIE`
|
||||
- **Attributes**: `asn` (UInt32), `country_code`, `name`, `org`, `domain`
|
||||
- **Lifetime**: 3600–7200 seconds
|
||||
|
||||
### dict_bot_ip
|
||||
|
||||
- **Source**: CSV file `/var/lib/clickhouse/user_files/bot_ip.csv`
|
||||
- **Key**: `prefix` (String)
|
||||
- **Layout**: `IP_TRIE`
|
||||
- **Attributes**: `bot_name` (String)
|
||||
- **Lifetime**: 300 seconds
|
||||
|
||||
### dict_bot_ja4
|
||||
|
||||
- **Source**: CSV file `/var/lib/clickhouse/user_files/bot_ja4.csv`
|
||||
- **Key**: `ja4` (String)
|
||||
- **Layout**: `COMPLEX_KEY_HASHED`
|
||||
- **Attributes**: `bot_name` (String)
|
||||
- **Lifetime**: 300 seconds
|
||||
|
||||
### dict_asn_reputation
|
||||
|
||||
- **Source**: CSV file `/var/lib/clickhouse/user_files/asn_reputation.csv`
|
||||
- **Key**: `src_asn` (UInt64)
|
||||
- **Layout**: `HASHED`
|
||||
- **Attributes**: `label` (String)
|
||||
- **Lifetime**: 300 seconds
|
||||
|
||||
### dict_anubis_ua
|
||||
|
||||
- **Source**: ClickHouse table `anubis_ua_rules`
|
||||
- **Key**: `regexp` (String)
|
||||
- **Layout**: `REGEXP_TREE`
|
||||
- **Attributes**: `bot_name`, `action`, `has_ip`, `rule_id`, `category`
|
||||
- **Lifetime**: 300–600 seconds
|
||||
|
||||
### dict_anubis_ip
|
||||
|
||||
- **Source**: ClickHouse table `anubis_ip_rules`
|
||||
- **Key**: `prefix` (String)
|
||||
- **Layout**: `IP_TRIE`
|
||||
- **Attributes**: `bot_name`, `action`, `rule_id`, `has_ua`, `category`
|
||||
- **Lifetime**: 300–600 seconds
|
||||
|
||||
### dict_anubis_asn
|
||||
|
||||
- **Source**: ClickHouse table `anubis_asn_rules`
|
||||
- **Key**: `asn` (UInt32)
|
||||
- **Layout**: `FLAT`
|
||||
- **Attributes**: `bot_name`, `action`, `category`
|
||||
- **Lifetime**: 300–600 seconds
|
||||
|
||||
### dict_anubis_country
|
||||
|
||||
- **Source**: ClickHouse table `anubis_country_rules`
|
||||
- **Key**: `country_code` (String)
|
||||
- **Layout**: `FLAT`
|
||||
- **Attributes**: `bot_name`, `action`, `category`
|
||||
- **Lifetime**: 300–600 seconds
|
||||
|
||||
---
|
||||
|
||||
## Views
|
||||
|
||||
### view_ai_features_1h
|
||||
|
||||
Computes 50+ ML features per `(src_ip, ja4, host)` from the last 24 hours by joining `agg_host_ip_ja4_1h` and `agg_header_fingerprint_1h`. Includes:
|
||||
|
||||
- Behavioral features: `hits`, `hit_velocity`, `fuzzing_index`, `post_ratio`, `orphan_ratio`
|
||||
- Connection features: `max_keepalives`, `multiplexing_efficiency`, `port_exhaustion_ratio`
|
||||
- Browser features: `modern_browser_score`, `ua_ch_mismatch`, `header_order_shared_count`
|
||||
- TLS features: `alpn_http_mismatch`, `is_alpn_missing`, `sni_host_mismatch`
|
||||
- L4 features: `tcp_jitter_variance`, `avg_ttl`, `ttl_std`, `syn_timing_cv`
|
||||
- Reputation: `bot_name` (from dict_bot_ip/dict_bot_ja4), `anubis_bot_name/action/category`
|
||||
- Derived: `temporal_entropy`, `ja3_diversity_ratio`
|
||||
|
||||
### view_ip_recurrence
|
||||
|
||||
Aggregates recurrence data from `ml_detected_anomalies`:
|
||||
|
||||
```sql
|
||||
SELECT src_ip, count() AS recurrence,
|
||||
min(detected_at) AS first_seen, max(detected_at) AS last_seen,
|
||||
min(anomaly_score) AS worst_score,
|
||||
argMin(threat_level, anomaly_score) AS worst_threat_level
|
||||
FROM ml_detected_anomalies GROUP BY src_ip;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## User Accounts
|
||||
|
||||
| User | Permissions | Purpose |
|
||||
|------|------------|---------|
|
||||
| `data_writer` | INSERT + SELECT on `http_logs_raw` | Used by correlator service |
|
||||
| `analyst` | SELECT on `http_logs`, `ml_detected_anomalies`, `ml_all_scores`, `view_ai_features_1h`, `view_ip_recurrence`, `audit_logs` | Used by dashboard/SOC analysts |
|
||||
|
||||
> **Security note**: Default passwords are `ChangeMe` — replace with strong passwords before production use. Store credentials in a secrets manager.
|
||||
Reference in New Issue
Block a user