feat: ja4-platform monorepo — 5 services unified, tests & RPM builds standardized

Services:
- ja4sentinel: TLS/JA4 fingerprint capture daemon (Go, libpcap)
- logcorrelator: JA4 log correlation engine (Go, ClickHouse)
- mod_reqin_log: Apache module (C, JSON request logging)
- bot_detector: ML bot detection pipeline (Python)
- dashboard: FastAPI/Streamlit analytics UI (Python)

Shared libraries:
- shared/go/ja4common: logger, config, shutdown, ipfilter (Go module)
- shared/python/ja4_common: ClickHouseClient, ClickHouseSettings (Python package)
- shared/clickhouse/: canonical SQL migrations (10 files)

Build & packaging:
- Unified 3-stage Dockerfile.package for Go RPMs (el8/el9/el10)
- go.work workspace linking sentinel, correlator, ja4common
- Makefile with test-all, build-all, rpm-* targets

Fixes applied:
- go.work: 1.21 → 1.24.6 (required by sentinel)
- correlator Dockerfiles: golang:1.21 → golang:1.24
- replace directives in go.mod for ja4common local path
- pyproject.toml: setuptools.backends → setuptools.build_meta
- Removed static libpcap linking (unavailable on Rocky 9)
- Fixed data races in output/writers_test.go (sync.Mutex + atomic.Int32)
- Rewrote corrupted test files (logger_test.go × 2)

Test coverage:
- correlator: 67.1% total (unixsocket 80.5%, config 91.7%, app 83.3%, multi 87.7%, stdout 100%)
- sentinel: all 10 packages pass (api, capture, config, fingerprint, ipfilter, logging, output, tlsparse)

Documentation:
- README.md + docs/ (architecture, development, 5 services, shared libs, DB schema & migrations)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
toto
2026-04-07 16:42:59 +02:00
commit d469e39da7
278 changed files with 1621301 additions and 0 deletions

256
docs/database/migrations.md Normal file
View File

@ -0,0 +1,256 @@
# Database Migrations
The ClickHouse schema for ja4-platform is managed through numbered SQL migration files in `shared/clickhouse/`. Migrations are idempotent (using `IF NOT EXISTS` / `IF EXISTS`) and must be applied in numeric order.
## Migration Order
| File | Purpose |
|------|---------|
| `00_database.sql` | Creates the `mabase_prod` database |
| `01_raw_tables.sql` | Creates `http_logs_raw` ingest table (MergeTree, 1-day TTL) |
| `02_dictionaries.sql` | Creates ASN geo dictionary (`dict_iplocate_asn`), bot IP/JA4 reference tables, `ref_bot_networks` |
| `03_anubis_tables.sql` | Creates Anubis crawler rule tables (`anubis_ua_rules`, `anubis_ip_rules`, `anubis_asn_rules`, `anubis_country_rules`) and their dictionaries (`dict_anubis_ua`, `dict_anubis_ip`, `dict_anubis_asn`, `dict_anubis_country`) |
| `04_mv_http_logs.sql` | Creates the canonical `http_logs` table and `mv_http_logs` materialized view with full Anubis enrichment |
| `05_aggregation_tables.sql` | Creates reputation dictionaries (`dict_bot_ip`, `dict_bot_ja4`, `dict_asn_reputation`), behavioral aggregation tables (`agg_host_ip_ja4_1h`, `agg_header_fingerprint_1h`), and their materialized views |
| `06_ml_tables.sql` | Creates ML output tables (`ml_detected_anomalies`, `ml_all_scores`) and `view_ip_recurrence` |
| `07_ai_features_view.sql` | Creates `view_ai_features_1h` — the 50+ feature view used by bot-detector |
| `08_users.sql` | Creates ClickHouse users (`data_writer`, `analyst`) and grants permissions |
| `09_audit_table.sql` | Creates `audit_logs` table for SOC dashboard audit trail |
## Prerequisites
### 1. ClickHouse Server
A running ClickHouse server (version 23.8+ recommended for `REGEXP_TREE` dictionary support).
### 2. CSV Data Files
Place the following files in `/var/lib/clickhouse/user_files/`:
| File | Source | Description |
|------|--------|-------------|
| `iplocate-ip-to-asn.csv` | [IPLocate](https://iplocate.io) | IP-to-ASN mapping with country, org, domain |
| `bot_ip.csv` | Custom | Known bot IP prefixes (CIDR format) |
| `bot_ja4.csv` | Custom | Known bot JA4 fingerprints |
| `asn_reputation.csv` | Custom | ASN reputation labels (`human`, `bot`, `unknown`) |
### 3. Anubis Passwords
Migration `03_anubis_tables.sql` contains placeholder passwords (`CHANGE_ME`) for the Anubis dictionaries. Replace these with the actual ClickHouse admin password before applying:
```bash
sed -i "s/CHANGE_ME/your_actual_password/g" 03_anubis_tables.sql
```
## How to Apply
### Full Initial Setup
Apply all migrations in order:
```bash
cd shared/clickhouse/
clickhouse-client --multiquery < 00_database.sql
clickhouse-client --multiquery < 01_raw_tables.sql
clickhouse-client --multiquery < 02_dictionaries.sql
clickhouse-client --multiquery < 03_anubis_tables.sql
clickhouse-client --multiquery < 04_mv_http_logs.sql
clickhouse-client --multiquery < 05_aggregation_tables.sql
clickhouse-client --multiquery < 06_ml_tables.sql
clickhouse-client --multiquery < 07_ai_features_view.sql
clickhouse-client --multiquery < 08_users.sql
clickhouse-client --multiquery < 09_audit_table.sql
```
### With Authentication
```bash
clickhouse-client --user admin --password 'your_password' --multiquery < 00_database.sql
# ... repeat for each file
```
### One-Liner (All at Once)
```bash
cd shared/clickhouse/
for f in 0*.sql; do
echo "Applying $f..."
clickhouse-client --multiquery < "$f"
done
```
## How to Verify
After applying all migrations, run these queries to verify each migration was successful:
### 00 — Database
```sql
SHOW DATABASES LIKE 'mabase_prod';
-- Expected: mabase_prod
```
### 01 — Raw Tables
```sql
EXISTS mabase_prod.http_logs_raw;
-- Expected: 1
```
### 02 — Dictionaries
```sql
SELECT dictGetOrDefault('mabase_prod.dict_iplocate_asn', 'country_code',
toIPv6(toIPv4('8.8.8.8')), 'MISSING');
-- Expected: US (if CSV loaded) or MISSING
```
### 03 — Anubis Tables
```sql
EXISTS mabase_prod.anubis_ua_rules;
EXISTS mabase_prod.anubis_ip_rules;
EXISTS mabase_prod.anubis_asn_rules;
EXISTS mabase_prod.anubis_country_rules;
-- Expected: 1 for each
```
### 04 — MV + http_logs
```sql
EXISTS mabase_prod.http_logs;
SELECT name FROM system.tables WHERE database = 'mabase_prod' AND name = 'mv_http_logs';
-- Expected: mv_http_logs
```
### 05 — Aggregation Tables
```sql
EXISTS mabase_prod.agg_host_ip_ja4_1h;
EXISTS mabase_prod.agg_header_fingerprint_1h;
SELECT name FROM system.dictionaries WHERE database = 'mabase_prod' AND name = 'dict_bot_ip';
-- Expected: dict_bot_ip
```
### 06 — ML Tables
```sql
EXISTS mabase_prod.ml_detected_anomalies;
EXISTS mabase_prod.ml_all_scores;
SELECT name FROM system.tables WHERE database = 'mabase_prod' AND name LIKE 'view_ip%';
-- Expected: view_ip_recurrence
```
### 07 — AI Features View
```sql
SELECT name FROM system.tables WHERE database = 'mabase_prod' AND name = 'view_ai_features_1h';
-- Expected: view_ai_features_1h
```
### 08 — Users
```sql
SHOW GRANTS FOR data_writer;
-- Expected: GRANT INSERT, SELECT ON mabase_prod.http_logs_raw TO data_writer
SHOW GRANTS FOR analyst;
-- Expected: GRANT SELECT ON multiple tables
```
### 09 — Audit Table
```sql
EXISTS mabase_prod.audit_logs;
-- Expected: 1
```
### Full Verification Query
```sql
SELECT
count() AS total_tables
FROM system.tables
WHERE database = 'mabase_prod'
AND name IN (
'http_logs_raw', 'http_logs', 'agg_host_ip_ja4_1h', 'agg_header_fingerprint_1h',
'ml_detected_anomalies', 'ml_all_scores', 'ref_bot_networks',
'anubis_ua_rules', 'anubis_ip_rules', 'anubis_asn_rules', 'anubis_country_rules',
'audit_logs', 'bot_ip', 'bot_ja4'
);
-- Expected: 14
```
## Rollback Notes
### General Approach
ClickHouse does not support transactional DDL. To roll back a migration:
1. **Tables**: `DROP TABLE IF EXISTS mabase_prod.<table_name>`
2. **Materialized Views**: `DROP VIEW IF EXISTS mabase_prod.<mv_name>` (drop MV before its target table)
3. **Dictionaries**: `DROP DICTIONARY IF EXISTS mabase_prod.<dict_name>`
4. **Views**: `DROP VIEW IF EXISTS mabase_prod.<view_name>`
5. **Users**: `DROP USER IF EXISTS <username>`
### Rollback Order (Reverse of Apply)
```sql
-- 09: Audit
DROP TABLE IF EXISTS mabase_prod.audit_logs;
-- 08: Users
DROP USER IF EXISTS data_writer;
DROP USER IF EXISTS analyst;
-- 07: AI Features View
DROP VIEW IF EXISTS mabase_prod.view_ai_features_1h;
-- 06: ML Tables
DROP VIEW IF EXISTS mabase_prod.view_ip_recurrence;
DROP TABLE IF EXISTS mabase_prod.ml_all_scores;
DROP TABLE IF EXISTS mabase_prod.ml_detected_anomalies;
-- 05: Aggregation
DROP VIEW IF EXISTS mabase_prod.mv_agg_header_fingerprint_1h;
DROP VIEW IF EXISTS mabase_prod.mv_agg_host_ip_ja4_1h;
DROP TABLE IF EXISTS mabase_prod.agg_header_fingerprint_1h;
DROP TABLE IF EXISTS mabase_prod.agg_host_ip_ja4_1h;
DROP DICTIONARY IF EXISTS mabase_prod.dict_asn_reputation;
DROP DICTIONARY IF EXISTS mabase_prod.dict_bot_ja4;
DROP DICTIONARY IF EXISTS mabase_prod.dict_bot_ip;
-- 04: MV + http_logs
DROP VIEW IF EXISTS mabase_prod.mv_http_logs;
DROP TABLE IF EXISTS mabase_prod.http_logs;
-- 03: Anubis
DROP DICTIONARY IF EXISTS mabase_prod.dict_anubis_country;
DROP DICTIONARY IF EXISTS mabase_prod.dict_anubis_asn;
DROP DICTIONARY IF EXISTS mabase_prod.dict_anubis_ip;
DROP DICTIONARY IF EXISTS mabase_prod.dict_anubis_ua;
DROP TABLE IF EXISTS mabase_prod.anubis_country_rules;
DROP TABLE IF EXISTS mabase_prod.anubis_asn_rules;
DROP TABLE IF EXISTS mabase_prod.anubis_ip_rules;
DROP TABLE IF EXISTS mabase_prod.anubis_ua_rules;
-- 02: Dictionaries
DROP DICTIONARY IF EXISTS mabase_prod.dict_iplocate_asn;
DROP TABLE IF EXISTS mabase_prod.bot_ja4;
DROP TABLE IF EXISTS mabase_prod.bot_ip;
DROP TABLE IF EXISTS mabase_prod.ref_bot_networks;
-- 01: Raw Tables
DROP TABLE IF EXISTS mabase_prod.http_logs_raw;
-- 00: Database
DROP DATABASE IF EXISTS mabase_prod;
```
### Important Notes
- **Data loss**: Dropping tables destroys all data. Always back up before rollback.
- **MV dependency**: Materialized views must be dropped before their target tables.
- **Dictionary dependency**: Views/MVs using dictionaries will fail if dictionaries are dropped while they still reference them.
- **Idempotent re-apply**: After rollback, migrations can be safely re-applied since they use `IF NOT EXISTS`.
- **`04_mv_http_logs.sql`** is the canonical version of the MV, superseding any base version in `services/correlator/sql/init.sql`.