Files
ja4-platform/docs/database/migrations.md
toto 9f3e0621e5 feat: split ClickHouse into dual configurable databases (ja4_logs / ja4_processing)
Architecture:
- ja4_logs: raw log ingestion (http_logs_raw, http_logs, mv_http_logs)
- ja4_processing: analytics, aggregation, ML, dictionaries, audit

Configuration (env vars):
- CLICKHOUSE_DB_LOGS (default: ja4_logs)
- CLICKHOUSE_DB_PROCESSING (default: ja4_processing)

Changes:
- SQL migrations (10 files): all mabase_prod refs → ja4_logs or ja4_processing
  with correct cross-database references (MVs, views, dicts)
- deploy_schema.sh: substitutes DB names from env vars at deploy time
- Python shared settings: added CLICKHOUSE_DB_LOGS + CLICKHOUSE_DB_PROCESSING
- Dashboard routes (19 files): replaced ~80 hardcoded mabase_prod refs
  with settings.CLICKHOUSE_DB_LOGS / settings.CLICKHOUSE_DB_PROCESSING
- Bot-detector: DB → CLICKHOUSE_DB_PROCESSING, fetch_rules.py configurable
- Correlator: DSN example updated to ja4_logs
- Docker-compose + .env files: new env vars with defaults
- All documentation updated (14 markdown files)

All tests pass: sentinel 10/10, correlator 67.1%, bot-detector 11, dashboard 20, ja4_common 18

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 19:10:35 +02:00

8.4 KiB

Database Migrations

The ClickHouse schema for ja4-platform is managed through numbered SQL migration files in shared/clickhouse/. Migrations are idempotent (using IF NOT EXISTS / IF EXISTS) and must be applied in numeric order.

Migration Order

File Purpose
00_database.sql Creates the ja4_processing database
01_raw_tables.sql Creates http_logs_raw ingest table (MergeTree, 1-day TTL)
02_dictionaries.sql Creates ASN geo dictionary (dict_iplocate_asn), bot IP/JA4 reference tables, ref_bot_networks
03_anubis_tables.sql Creates Anubis crawler rule tables (anubis_ua_rules, anubis_ip_rules, anubis_asn_rules, anubis_country_rules) and their dictionaries (dict_anubis_ua, dict_anubis_ip, dict_anubis_asn, dict_anubis_country)
04_mv_http_logs.sql Creates the canonical http_logs table and mv_http_logs materialized view with full Anubis enrichment
05_aggregation_tables.sql Creates reputation dictionaries (dict_bot_ip, dict_bot_ja4, dict_asn_reputation), behavioral aggregation tables (agg_host_ip_ja4_1h, agg_header_fingerprint_1h), and their materialized views
06_ml_tables.sql Creates ML output tables (ml_detected_anomalies, ml_all_scores) and view_ip_recurrence
07_ai_features_view.sql Creates view_ai_features_1h — the 50+ feature view used by bot-detector
08_users.sql Creates ClickHouse users (data_writer, analyst) and grants permissions
09_audit_table.sql Creates audit_logs table for SOC dashboard audit trail

Prerequisites

1. ClickHouse Server

A running ClickHouse server (version 23.8+ recommended for REGEXP_TREE dictionary support).

2. CSV Data Files

Place the following files in /var/lib/clickhouse/user_files/:

File Source Description
iplocate-ip-to-asn.csv IPLocate IP-to-ASN mapping with country, org, domain
bot_ip.csv Custom Known bot IP prefixes (CIDR format)
bot_ja4.csv Custom Known bot JA4 fingerprints
asn_reputation.csv Custom ASN reputation labels (human, bot, unknown)

3. Anubis Passwords

Migration 03_anubis_tables.sql contains placeholder passwords (CHANGE_ME) for the Anubis dictionaries. Replace these with the actual ClickHouse admin password before applying:

sed -i "s/CHANGE_ME/your_actual_password/g" 03_anubis_tables.sql

How to Apply

Full Initial Setup

Apply all migrations in order:

cd shared/clickhouse/

clickhouse-client --multiquery < 00_database.sql
clickhouse-client --multiquery < 01_raw_tables.sql
clickhouse-client --multiquery < 02_dictionaries.sql
clickhouse-client --multiquery < 03_anubis_tables.sql
clickhouse-client --multiquery < 04_mv_http_logs.sql
clickhouse-client --multiquery < 05_aggregation_tables.sql
clickhouse-client --multiquery < 06_ml_tables.sql
clickhouse-client --multiquery < 07_ai_features_view.sql
clickhouse-client --multiquery < 08_users.sql
clickhouse-client --multiquery < 09_audit_table.sql

With Authentication

clickhouse-client --user admin --password 'your_password' --multiquery < 00_database.sql
# ... repeat for each file

One-Liner (All at Once)

cd shared/clickhouse/
for f in 0*.sql; do
    echo "Applying $f..."
    clickhouse-client --multiquery < "$f"
done

How to Verify

After applying all migrations, run these queries to verify each migration was successful:

00 — Database

SHOW DATABASES LIKE 'ja4_processing';
-- Expected: ja4_processing

01 — Raw Tables

EXISTS ja4_logs.http_logs_raw;
-- Expected: 1

02 — Dictionaries

SELECT dictGetOrDefault('ja4_processing.dict_iplocate_asn', 'country_code',
       toIPv6(toIPv4('8.8.8.8')), 'MISSING');
-- Expected: US (if CSV loaded) or MISSING

03 — Anubis Tables

EXISTS ja4_processing.anubis_ua_rules;
EXISTS ja4_processing.anubis_ip_rules;
EXISTS ja4_processing.anubis_asn_rules;
EXISTS ja4_processing.anubis_country_rules;
-- Expected: 1 for each

04 — MV + http_logs

EXISTS ja4_logs.http_logs;
SELECT name FROM system.tables WHERE database = 'ja4_logs' AND name = 'mv_http_logs';
-- Expected: mv_http_logs

05 — Aggregation Tables

EXISTS ja4_processing.agg_host_ip_ja4_1h;
EXISTS ja4_processing.agg_header_fingerprint_1h;
SELECT name FROM system.dictionaries WHERE database = 'ja4_processing' AND name = 'dict_bot_ip';
-- Expected: dict_bot_ip

06 — ML Tables

EXISTS ja4_processing.ml_detected_anomalies;
EXISTS ja4_processing.ml_all_scores;
SELECT name FROM system.tables WHERE database = 'ja4_processing' AND name LIKE 'view_ip%';
-- Expected: view_ip_recurrence

07 — AI Features View

SELECT name FROM system.tables WHERE database = 'ja4_processing' AND name = 'view_ai_features_1h';
-- Expected: view_ai_features_1h

08 — Users

SHOW GRANTS FOR data_writer;
-- Expected: GRANT INSERT, SELECT ON ja4_logs.http_logs_raw TO data_writer
SHOW GRANTS FOR analyst;
-- Expected: GRANT SELECT ON multiple tables

09 — Audit Table

EXISTS ja4_processing.audit_logs;
-- Expected: 1

Full Verification Query

SELECT
    count() AS total_tables
FROM system.tables
WHERE database = 'ja4_processing'
  AND name IN (
    'http_logs_raw', 'http_logs', 'agg_host_ip_ja4_1h', 'agg_header_fingerprint_1h',
    'ml_detected_anomalies', 'ml_all_scores', 'ref_bot_networks',
    'anubis_ua_rules', 'anubis_ip_rules', 'anubis_asn_rules', 'anubis_country_rules',
    'audit_logs', 'bot_ip', 'bot_ja4'
  );
-- Expected: 14

Rollback Notes

General Approach

ClickHouse does not support transactional DDL. To roll back a migration:

  1. Tables: DROP TABLE IF EXISTS ja4_processing.<table_name>
  2. Materialized Views: DROP VIEW IF EXISTS ja4_processing.<mv_name> (drop MV before its target table)
  3. Dictionaries: DROP DICTIONARY IF EXISTS ja4_processing.<dict_name>
  4. Views: DROP VIEW IF EXISTS ja4_processing.<view_name>
  5. Users: DROP USER IF EXISTS <username>

Rollback Order (Reverse of Apply)

-- 09: Audit
DROP TABLE IF EXISTS ja4_processing.audit_logs;

-- 08: Users
DROP USER IF EXISTS data_writer;
DROP USER IF EXISTS analyst;

-- 07: AI Features View
DROP VIEW IF EXISTS ja4_processing.view_ai_features_1h;

-- 06: ML Tables
DROP VIEW IF EXISTS ja4_processing.view_ip_recurrence;
DROP TABLE IF EXISTS ja4_processing.ml_all_scores;
DROP TABLE IF EXISTS ja4_processing.ml_detected_anomalies;

-- 05: Aggregation
DROP VIEW IF EXISTS ja4_processing.mv_agg_header_fingerprint_1h;
DROP VIEW IF EXISTS ja4_processing.mv_agg_host_ip_ja4_1h;
DROP TABLE IF EXISTS ja4_processing.agg_header_fingerprint_1h;
DROP TABLE IF EXISTS ja4_processing.agg_host_ip_ja4_1h;
DROP DICTIONARY IF EXISTS ja4_processing.dict_asn_reputation;
DROP DICTIONARY IF EXISTS ja4_processing.dict_bot_ja4;
DROP DICTIONARY IF EXISTS ja4_processing.dict_bot_ip;

-- 04: MV + http_logs
DROP VIEW IF EXISTS ja4_logs.mv_http_logs;
DROP TABLE IF EXISTS ja4_logs.http_logs;

-- 03: Anubis
DROP DICTIONARY IF EXISTS ja4_processing.dict_anubis_country;
DROP DICTIONARY IF EXISTS ja4_processing.dict_anubis_asn;
DROP DICTIONARY IF EXISTS ja4_processing.dict_anubis_ip;
DROP DICTIONARY IF EXISTS ja4_processing.dict_anubis_ua;
DROP TABLE IF EXISTS ja4_processing.anubis_country_rules;
DROP TABLE IF EXISTS ja4_processing.anubis_asn_rules;
DROP TABLE IF EXISTS ja4_processing.anubis_ip_rules;
DROP TABLE IF EXISTS ja4_processing.anubis_ua_rules;

-- 02: Dictionaries
DROP DICTIONARY IF EXISTS ja4_processing.dict_iplocate_asn;
DROP TABLE IF EXISTS ja4_processing.bot_ja4;
DROP TABLE IF EXISTS ja4_processing.bot_ip;
DROP TABLE IF EXISTS ja4_processing.ref_bot_networks;

-- 01: Raw Tables
DROP TABLE IF EXISTS ja4_logs.http_logs_raw;

-- 00: Database
DROP DATABASE IF EXISTS ja4_processing;

Important Notes

  • Data loss: Dropping tables destroys all data. Always back up before rollback.
  • MV dependency: Materialized views must be dropped before their target tables.
  • Dictionary dependency: Views/MVs using dictionaries will fail if dictionaries are dropped while they still reference them.
  • Idempotent re-apply: After rollback, migrations can be safely re-applied since they use IF NOT EXISTS.
  • 04_mv_http_logs.sql is the canonical version of the MV, superseding any base version in services/correlator/sql/init.sql.