Files
ja4-platform/shared/clickhouse/README.md
toto d469e39da7 feat: ja4-platform monorepo — 5 services unified, tests & RPM builds standardized
Services:
- ja4sentinel: TLS/JA4 fingerprint capture daemon (Go, libpcap)
- logcorrelator: JA4 log correlation engine (Go, ClickHouse)
- mod_reqin_log: Apache module (C, JSON request logging)
- bot_detector: ML bot detection pipeline (Python)
- dashboard: FastAPI/Streamlit analytics UI (Python)

Shared libraries:
- shared/go/ja4common: logger, config, shutdown, ipfilter (Go module)
- shared/python/ja4_common: ClickHouseClient, ClickHouseSettings (Python package)
- shared/clickhouse/: canonical SQL migrations (10 files)

Build & packaging:
- Unified 3-stage Dockerfile.package for Go RPMs (el8/el9/el10)
- go.work workspace linking sentinel, correlator, ja4common
- Makefile with test-all, build-all, rpm-* targets

Fixes applied:
- go.work: 1.21 → 1.24.6 (required by sentinel)
- correlator Dockerfiles: golang:1.21 → golang:1.24
- replace directives in go.mod for ja4common local path
- pyproject.toml: setuptools.backends → setuptools.build_meta
- Removed static libpcap linking (unavailable on Rocky 9)
- Fixed data races in output/writers_test.go (sync.Mutex + atomic.Int32)
- Rewrote corrupted test files (logger_test.go × 2)

Test coverage:
- correlator: 67.1% total (unixsocket 80.5%, config 91.7%, app 83.3%, multi 87.7%, stdout 100%)
- sentinel: all 10 packages pass (api, capture, config, fingerprint, ipfilter, logging, output, tlsparse)

Documentation:
- README.md + docs/ (architecture, development, 5 services, shared libs, DB schema & migrations)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 16:42:59 +02:00

49 lines
2.1 KiB
Markdown

# ClickHouse Migrations — ja4-platform
## Migration Order
Apply these files in numeric order against the ClickHouse server:
```bash
clickhouse-client --multiquery < 00_database.sql
clickhouse-client --multiquery < 01_raw_tables.sql
clickhouse-client --multiquery < 02_dictionaries.sql
clickhouse-client --multiquery < 03_anubis_tables.sql
clickhouse-client --multiquery < 04_mv_http_logs.sql
clickhouse-client --multiquery < 05_aggregation_tables.sql
clickhouse-client --multiquery < 06_ml_tables.sql
clickhouse-client --multiquery < 07_ai_features_view.sql
clickhouse-client --multiquery < 08_users.sql
clickhouse-client --multiquery < 09_audit_table.sql
```
## File Descriptions
| File | Contents |
|------|----------|
| `00_database.sql` | CREATE DATABASE |
| `01_raw_tables.sql` | `http_logs_raw` ingest table |
| `02_dictionaries.sql` | ASN geo dict, bot IP/JA4/network reference tables |
| `03_anubis_tables.sql` | Anubis crawler rule tables and dictionaries (UA, IP, ASN, country) |
| `04_mv_http_logs.sql` | Canonical `http_logs` target table + `mv_http_logs` materialized view with full Anubis enrichment |
| `05_aggregation_tables.sql` | `agg_host_ip_ja4_1h`, `agg_header_fingerprint_1h` + their MVs |
| `06_ml_tables.sql` | `ml_detected_anomalies`, `ml_all_scores` |
| `07_ai_features_view.sql` | `view_ai_features_1h` with Anubis enrichment |
| `08_users.sql` | ClickHouse users and grants |
| `09_audit_table.sql` | `audit_logs` table for SOC dashboard audit trail |
## Prerequisites
Place CSV data files in `/var/lib/clickhouse/user_files/`:
- `iplocate-ip-to-asn.csv` — IP-to-ASN mapping (from IPLocate)
- `bot_ip.csv` — Known bot IP prefixes
- `bot_ja4.csv` — Known bot JA4 fingerprints
- `asn_reputation.csv` — ASN reputation labels
## Notes
- `04_mv_http_logs.sql` is the **canonical** version of the MV, superseding the base version
in `services/correlator/sql/init.sql`. It includes full Anubis enrichment.
- All migrations are idempotent (use `IF NOT EXISTS` / `IF EXISTS`).
- Anubis dictionary passwords in `03_anubis_tables.sql` **must** be changed before production use.