Files
ja4-platform/shared/clickhouse
toto f4ffe3410a perf(clickhouse): P1 — partition + skipping indexes sur ml_detected_anomalies, http_logs, agg_host_ip_ja4_1h
Problème : toutes les requêtes du dashboard WHERE detected_at >= now() - INTERVAL N
faisaient un full scan car ml_detected_anomalies avait ORDER BY (src_ip) sans
partition ni index temporel.

Changements :
- 06_ml_tables.sql :
  * ml_detected_anomalies : PARTITION BY toYYYYMMDD(detected_at)
    → élagage de partitions journalières sur toutes les requêtes temporelles
  * INDEX idx_detected_at (minmax) → skip des granules hors plage
  * INDEX idx_threat_level set(8) → skip pour countIf(threat_level = ...)
  * INDEX idx_bot_name bloom_filter → skip pour bot_name != ''
  * ttl_only_drop_parts = 1 → TTL par suppression de partition entière
  * ml_all_scores : même traitement (PARTITION BY + 2 indexes)

- 04_mv_http_logs.sql :
  * http_logs : INDEX idx_src_ip bloom_filter(0.01)
    → les requêtes WHERE src_ip = X (analysis.py, variability.py) sautent
    ~90% des granules sans scanner toute la plage temporelle
  * INDEX idx_ja4 bloom_filter(0.01) → idem pour filtres JA4

- 05_aggregation_tables.sql :
  * agg_host_ip_ja4_1h : PROJECTION proj_by_ip ORDER BY (src_ip, window_start, ...)
    → investigation_summary.py et rotation.py (WHERE src_ip = X) utilisent
    automatiquement la projection au lieu de scanner tous les window_start

- 10_perf_indexes.sql (nouveau) :
  * Migration ALTER TABLE pour instances existantes
  * ADD INDEX + MATERIALIZE INDEX pour les 4 tables
  * ADD PROJECTION + MATERIALIZE PROJECTION pour agg_host_ip_ja4_1h
  * Note : PARTITION BY sur table existante nécessite recréation (documenté)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 22:28:04 +02:00
..

ClickHouse Migrations — ja4-platform

Migration Order

Apply these files in numeric order against the ClickHouse server:

clickhouse-client --multiquery < 00_database.sql
clickhouse-client --multiquery < 01_raw_tables.sql
clickhouse-client --multiquery < 02_dictionaries.sql
clickhouse-client --multiquery < 03_anubis_tables.sql
clickhouse-client --multiquery < 04_mv_http_logs.sql
clickhouse-client --multiquery < 05_aggregation_tables.sql
clickhouse-client --multiquery < 06_ml_tables.sql
clickhouse-client --multiquery < 07_ai_features_view.sql
clickhouse-client --multiquery < 08_users.sql
clickhouse-client --multiquery < 09_audit_table.sql

File Descriptions

File Contents
00_database.sql CREATE DATABASE
01_raw_tables.sql http_logs_raw ingest table
02_dictionaries.sql ASN geo dict, bot IP/JA4/network reference tables
03_anubis_tables.sql Anubis crawler rule tables and dictionaries (UA, IP, ASN, country)
04_mv_http_logs.sql Canonical http_logs target table + mv_http_logs materialized view with full Anubis enrichment
05_aggregation_tables.sql agg_host_ip_ja4_1h, agg_header_fingerprint_1h + their MVs
06_ml_tables.sql ml_detected_anomalies, ml_all_scores
07_ai_features_view.sql view_ai_features_1h with Anubis enrichment
08_users.sql ClickHouse users and grants
09_audit_table.sql audit_logs table for SOC dashboard audit trail

Prerequisites

Place CSV data files in /var/lib/clickhouse/user_files/:

  • iplocate-ip-to-asn.csv — IP-to-ASN mapping (from IPLocate)
  • bot_ip.csv — Known bot IP prefixes
  • bot_ja4.csv — Known bot JA4 fingerprints
  • asn_reputation.csv — ASN reputation labels

Notes

  • 04_mv_http_logs.sql is the canonical version of the MV, superseding the base version in services/correlator/sql/init.sql. It includes full Anubis enrichment.
  • All migrations are idempotent (use IF NOT EXISTS / IF EXISTS).
  • Anubis dictionary passwords in 03_anubis_tables.sql must be changed before production use.