Problème : toutes les requêtes du dashboard WHERE detected_at >= now() - INTERVAL N
faisaient un full scan car ml_detected_anomalies avait ORDER BY (src_ip) sans
partition ni index temporel.
Changements :
- 06_ml_tables.sql :
* ml_detected_anomalies : PARTITION BY toYYYYMMDD(detected_at)
→ élagage de partitions journalières sur toutes les requêtes temporelles
* INDEX idx_detected_at (minmax) → skip des granules hors plage
* INDEX idx_threat_level set(8) → skip pour countIf(threat_level = ...)
* INDEX idx_bot_name bloom_filter → skip pour bot_name != ''
* ttl_only_drop_parts = 1 → TTL par suppression de partition entière
* ml_all_scores : même traitement (PARTITION BY + 2 indexes)
- 04_mv_http_logs.sql :
* http_logs : INDEX idx_src_ip bloom_filter(0.01)
→ les requêtes WHERE src_ip = X (analysis.py, variability.py) sautent
~90% des granules sans scanner toute la plage temporelle
* INDEX idx_ja4 bloom_filter(0.01) → idem pour filtres JA4
- 05_aggregation_tables.sql :
* agg_host_ip_ja4_1h : PROJECTION proj_by_ip ORDER BY (src_ip, window_start, ...)
→ investigation_summary.py et rotation.py (WHERE src_ip = X) utilisent
automatiquement la projection au lieu de scanner tous les window_start
- 10_perf_indexes.sql (nouveau) :
* Migration ALTER TABLE pour instances existantes
* ADD INDEX + MATERIALIZE INDEX pour les 4 tables
* ADD PROJECTION + MATERIALIZE PROJECTION pour agg_host_ip_ja4_1h
* Note : PARTITION BY sur table existante nécessite recréation (documenté)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ClickHouse Migrations — ja4-platform
Migration Order
Apply these files in numeric order against the ClickHouse server:
clickhouse-client --multiquery < 00_database.sql
clickhouse-client --multiquery < 01_raw_tables.sql
clickhouse-client --multiquery < 02_dictionaries.sql
clickhouse-client --multiquery < 03_anubis_tables.sql
clickhouse-client --multiquery < 04_mv_http_logs.sql
clickhouse-client --multiquery < 05_aggregation_tables.sql
clickhouse-client --multiquery < 06_ml_tables.sql
clickhouse-client --multiquery < 07_ai_features_view.sql
clickhouse-client --multiquery < 08_users.sql
clickhouse-client --multiquery < 09_audit_table.sql
File Descriptions
| File | Contents |
|---|---|
00_database.sql |
CREATE DATABASE |
01_raw_tables.sql |
http_logs_raw ingest table |
02_dictionaries.sql |
ASN geo dict, bot IP/JA4/network reference tables |
03_anubis_tables.sql |
Anubis crawler rule tables and dictionaries (UA, IP, ASN, country) |
04_mv_http_logs.sql |
Canonical http_logs target table + mv_http_logs materialized view with full Anubis enrichment |
05_aggregation_tables.sql |
agg_host_ip_ja4_1h, agg_header_fingerprint_1h + their MVs |
06_ml_tables.sql |
ml_detected_anomalies, ml_all_scores |
07_ai_features_view.sql |
view_ai_features_1h with Anubis enrichment |
08_users.sql |
ClickHouse users and grants |
09_audit_table.sql |
audit_logs table for SOC dashboard audit trail |
Prerequisites
Place CSV data files in /var/lib/clickhouse/user_files/:
iplocate-ip-to-asn.csv— IP-to-ASN mapping (from IPLocate)bot_ip.csv— Known bot IP prefixesbot_ja4.csv— Known bot JA4 fingerprintsasn_reputation.csv— ASN reputation labels
Notes
04_mv_http_logs.sqlis the canonical version of the MV, superseding the base version inservices/correlator/sql/init.sql. It includes full Anubis enrichment.- All migrations are idempotent (use
IF NOT EXISTS/IF EXISTS). - Anubis dictionary passwords in
03_anubis_tables.sqlmust be changed before production use.