From 8f5e771096695206e1e6e18bfa5cd3389f17d1dd Mon Sep 17 00:00:00 2001 From: toto Date: Thu, 9 Apr 2026 22:03:37 +0200 Subject: [PATCH] =?UTF-8?q?docs:=20r=C3=A9=C3=A9criture=20compl=C3=A8te=20?= =?UTF-8?q?de=20la=20documentation=20base=20de=20donn=C3=A9es=20en=20fran?= =?UTF-8?q?=C3=A7ais?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Réécriture des 3 fichiers de documentation de la base de données ClickHouse : - docs/database/schema.md : couverture complète des 2 bases, 14+ tables, 7 dictionnaires, 8 MVs, 8 vues, TTL, partitions, moteurs et colonnes - docs/database/migrations.md : 13 fichiers SQL (ajout 10-12), prérequis mis à jour (ClickHouse 24.8+, 5 CSV), deploy_schema.sh, init-stack.sh, vérification et rollback complets - shared/clickhouse/README.md : référence rapide des 13 fichiers, deploy_schema.sh, patron double-base, prérequis Suppression des références obsolètes : dict_anubis_ua, dict_anubis_country, anubis_ua_rules, anubis_country_rules. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- docs/database/migrations.md | 378 ++++++++++++----- docs/database/schema.md | 825 +++++++++++++++++++++++++----------- shared/clickhouse/README.md | 99 +++-- 3 files changed, 897 insertions(+), 405 deletions(-) diff --git a/docs/database/migrations.md b/docs/database/migrations.md index 5e6c9de..dfdf1ee 100644 --- a/docs/database/migrations.md +++ b/docs/database/migrations.md @@ -1,256 +1,402 @@ -# Database Migrations +# Migrations de base de données -The ClickHouse schema for ja4-platform is managed through numbered SQL migration files in `shared/clickhouse/`. Migrations are idempotent (using `IF NOT EXISTS` / `IF EXISTS`) and must be applied in numeric order. +Le schéma ClickHouse de ja4-platform est géré via 13 fichiers SQL numérotés dans +`shared/clickhouse/`. Toutes les migrations sont **idempotentes** (utilisation de +`IF NOT EXISTS` / `IF EXISTS` / `CREATE OR REPLACE`) et doivent être appliquées +dans l'ordre numérique. -## Migration Order +Le schéma utilise un **patron double-base** : -| File | Purpose | -|------|---------| -| `00_database.sql` | Creates the `ja4_processing` database | -| `01_raw_tables.sql` | Creates `http_logs_raw` ingest table (MergeTree, 1-day TTL) | -| `02_dictionaries.sql` | Creates ASN geo dictionary (`dict_iplocate_asn`), bot IP/JA4 reference tables, `ref_bot_networks` | -| `03_anubis_tables.sql` | Creates Anubis crawler rule tables (`anubis_ua_rules`, `anubis_ip_rules`, `anubis_asn_rules`, `anubis_country_rules`) and their dictionaries (`dict_anubis_ua`, `dict_anubis_ip`, `dict_anubis_asn`, `dict_anubis_country`) | -| `04_mv_http_logs.sql` | Creates the canonical `http_logs` table and `mv_http_logs` materialized view with full Anubis enrichment | -| `05_aggregation_tables.sql` | Creates reputation dictionaries (`dict_bot_ip`, `dict_bot_ja4`, `dict_asn_reputation`), behavioral aggregation tables (`agg_host_ip_ja4_1h`, `agg_header_fingerprint_1h`), and their materialized views | -| `06_ml_tables.sql` | Creates ML output tables (`ml_detected_anomalies`, `ml_all_scores`) and `view_ip_recurrence` | -| `07_ai_features_view.sql` | Creates `view_ai_features_1h` — the 50+ feature view used by bot-detector | -| `08_users.sql` | Creates ClickHouse users (`data_writer`, `analyst`) and grants permissions | -| `09_audit_table.sql` | Creates `audit_logs` table for SOC dashboard audit trail | +| Base | Variable d'environnement | Défaut | Contenu | +|------|--------------------------|--------|---------| +| Logs | `CLICKHOUSE_DB_LOGS` | `ja4_logs` | `http_logs_raw`, `http_logs`, `mv_http_logs` | +| Processing | `CLICKHOUSE_DB_PROCESSING` | `ja4_processing` | Agrégations, ML, vues, dictionnaires, audit | -## Prerequisites +--- -### 1. ClickHouse Server +## Ordre des migrations -A running ClickHouse server (version 23.8+ recommended for `REGEXP_TREE` dictionary support). +| Fichier | Lignes | Contenu | +|---------|--------|---------| +| `00_database.sql` | 5 | Création des bases `ja4_logs` et `ja4_processing` | +| `01_raw_tables.sql` | 16 | Table d'ingestion `http_logs_raw` (MergeTree, TTL 2h) | +| `02_dictionaries.sql` | 57 | Dictionnaire `dict_iplocate_asn` (IP_TRIE, CSV), tables `ref_bot_networks`, `bot_ip`, `bot_ja4` | +| `03_anubis_tables.sql` | 73 | Tables de règles Anubis (`anubis_ip_rules`, `anubis_asn_rules`) et dictionnaires (`dict_anubis_ip`, `dict_anubis_asn`) | +| `04_mv_http_logs.sql` | 197 | Table `http_logs` (MergeTree, TTL 30j) + vue matérialisée `mv_http_logs` (parse JSON + enrichissement Anubis COALESCE IP→ASN) | +| `05_aggregation_tables.sql` | 234 | Dictionnaires de réputation (`dict_bot_ip`, `dict_bot_ja4`, `dict_browser_ja4`, `dict_asn_reputation`), 2 tables d'agrégation (`agg_host_ip_ja4_1h`, `agg_header_fingerprint_1h`) + 2 vues matérialisées | +| `06_ml_tables.sql` | 144 | Tables ML (`ml_detected_anomalies`, `ml_all_scores`) + vue `view_ip_recurrence` | +| `07_ai_features_view.sql` | 156 | Vue `view_ai_features_1h` (~65+ features ML depuis les agrégations + dictionnaires) | +| `08_users.sql` | 22 | Utilisateurs `data_writer` et `analyst` avec permissions | +| `09_audit_table.sql` | 21 | Table `audit_logs` pour le journal d'audit SOC | +| `10_perf_indexes.sql` | 113 | Index secondaires et projections de performance (migration idempotente pour instances existantes) | +| `11_views.sql` | 216 | Vues dashboard (`view_dashboard_entities`, `view_dashboard_user_agents`, `view_form_bruteforce_detected`, `view_host_ip_ja4_rotation`, `view_resource_cascade_1h`) | +| `12_thesis_features.sql` | 580 | 4 tables d'agrégation thèse (`agg_path_sequences_1h`, `agg_request_timing_1h`, `agg_ip_behavior_1h`, `agg_resource_cascade_1h`) + 4 MVs + vue `view_thesis_features_1h` | -### 2. CSV Data Files +--- -Place the following files in `/var/lib/clickhouse/user_files/`: +## Prérequis -| File | Source | Description | -|------|--------|-------------| -| `iplocate-ip-to-asn.csv` | [IPLocate](https://iplocate.io) | IP-to-ASN mapping with country, org, domain | -| `bot_ip.csv` | Custom | Known bot IP prefixes (CIDR format) | -| `bot_ja4.csv` | Custom | Known bot JA4 fingerprints | -| `asn_reputation.csv` | Custom | ASN reputation labels (`human`, `bot`, `unknown`) | +### 1. Serveur ClickHouse -### 3. Anubis Passwords +Un serveur ClickHouse en fonctionnement, **version 24.8+** requise (support des +projections AggregatingMergeTree avec `deduplicate_merge_projection_mode`). -Migration `03_anubis_tables.sql` contains placeholder passwords (`CHANGE_ME`) for the Anubis dictionaries. Replace these with the actual ClickHouse admin password before applying: +### 2. Fichiers CSV de données + +Placer les fichiers suivants dans `/var/lib/clickhouse/user_files/` : + +| Fichier | Source | Description | Entrées approx. | +|---------|--------|-------------|------------------| +| `iplocate-ip-to-asn.csv` | [IPLocate](https://iplocate.io) | Correspondance IP→ASN avec pays, org, domaine | ~714K | +| `bot_ip.csv` | Personnalisé | Préfixes IP de bots connus (format CIDR) | ~3,5K | +| `bot_ja4.csv` | Personnalisé | Empreintes JA4 de bots connus | ~31 | +| `browser_ja4.csv` | Personnalisé | Empreintes JA4 de navigateurs légitimes | ~1,2K | +| `asn_reputation.csv` | Personnalisé | Labels de réputation ASN (`human`, `bot`, `unknown`) | ~82K | + +### 3. Mots de passe Anubis + +Le fichier `03_anubis_tables.sql` contient des mots de passe par défaut (`CHANGE_ME`) +pour les dictionnaires Anubis basés sur ClickHouse. Les remplacer avant d'appliquer : ```bash -sed -i "s/CHANGE_ME/your_actual_password/g" 03_anubis_tables.sql +sed -i "s/CHANGE_ME/mot_de_passe_réel/g" 03_anubis_tables.sql ``` -## How to Apply +--- -### Full Initial Setup +## Comment appliquer -Apply all migrations in order: +### Méthode recommandée : deploy_schema.sh + +Le script `deploy_schema.sh` applique les 13 fichiers dans l'ordre en substituant +automatiquement les noms de base de données : ```bash cd shared/clickhouse/ -clickhouse-client --multiquery < 00_database.sql -clickhouse-client --multiquery < 01_raw_tables.sql -clickhouse-client --multiquery < 02_dictionaries.sql -clickhouse-client --multiquery < 03_anubis_tables.sql -clickhouse-client --multiquery < 04_mv_http_logs.sql -clickhouse-client --multiquery < 05_aggregation_tables.sql -clickhouse-client --multiquery < 06_ml_tables.sql -clickhouse-client --multiquery < 07_ai_features_view.sql -clickhouse-client --multiquery < 08_users.sql -clickhouse-client --multiquery < 09_audit_table.sql +# Avec les noms de base par défaut (ja4_logs / ja4_processing) +./deploy_schema.sh + +# Avec des noms personnalisés +CLICKHOUSE_DB_LOGS=my_logs \ +CLICKHOUSE_DB_PROCESSING=my_proc \ +CLICKHOUSE_HOST=clickhouse-server \ +CLICKHOUSE_USER=admin \ +CLICKHOUSE_PASSWORD='secret' \ + ./deploy_schema.sh ``` -### With Authentication +Variables d'environnement supportées : + +| Variable | Défaut | Description | +|----------|--------|-------------| +| `CLICKHOUSE_DB_LOGS` | `ja4_logs` | Nom de la base de logs | +| `CLICKHOUSE_DB_PROCESSING` | `ja4_processing` | Nom de la base de traitement | +| `CLICKHOUSE_HOST` | `localhost` | Hôte ClickHouse | +| `CLICKHOUSE_PORT` | `9000` | Port natif ClickHouse | +| `CLICKHOUSE_USER` | `default` | Utilisateur ClickHouse | +| `CLICKHOUSE_PASSWORD` | (vide) | Mot de passe ClickHouse | + +### Méthode alternative : init-stack.sh + +Le script `scripts/init-stack.sh` fournit une initialisation complète incluant +le schéma, les migrations, la validation et le nettoyage : ```bash -clickhouse-client --user admin --password 'your_password' --multiquery < 00_database.sql -# ... repeat for each file +./scripts/init-stack.sh ``` -### One-Liner (All at Once) +### Application manuelle ```bash cd shared/clickhouse/ -for f in 0*.sql; do - echo "Applying $f..." + +for f in 0*.sql 1*.sql; do + echo "Application de $f..." clickhouse-client --multiquery < "$f" done ``` -## How to Verify +Avec authentification : -After applying all migrations, run these queries to verify each migration was successful: - -### 00 — Database - -```sql -SHOW DATABASES LIKE 'ja4_processing'; --- Expected: ja4_processing +```bash +clickhouse-client --user admin --password 'secret' --multiquery < 00_database.sql +# ... répéter pour chaque fichier ``` -### 01 — Raw Tables +--- + +## Comment vérifier + +Après l'application de toutes les migrations, exécuter ces requêtes pour valider +chaque étape. + +### 00 — Bases de données + +```sql +SHOW DATABASES LIKE 'ja4%'; +-- Attendu : ja4_logs, ja4_processing +``` + +### 01 — Table brute ```sql EXISTS ja4_logs.http_logs_raw; --- Expected: 1 +-- Attendu : 1 ``` -### 02 — Dictionaries +### 02 — Dictionnaire ASN + tables de référence ```sql SELECT dictGetOrDefault('ja4_processing.dict_iplocate_asn', 'country_code', toIPv6(toIPv4('8.8.8.8')), 'MISSING'); --- Expected: US (if CSV loaded) or MISSING +-- Attendu : US (si CSV chargé) ou MISSING + +EXISTS ja4_processing.ref_bot_networks; +-- Attendu : 1 ``` -### 03 — Anubis Tables +### 03 — Tables Anubis ```sql -EXISTS ja4_processing.anubis_ua_rules; EXISTS ja4_processing.anubis_ip_rules; EXISTS ja4_processing.anubis_asn_rules; -EXISTS ja4_processing.anubis_country_rules; --- Expected: 1 for each +-- Attendu : 1 pour chacune ``` -### 04 — MV + http_logs +### 04 — http_logs + vue matérialisée ```sql EXISTS ja4_logs.http_logs; SELECT name FROM system.tables WHERE database = 'ja4_logs' AND name = 'mv_http_logs'; --- Expected: mv_http_logs +-- Attendu : mv_http_logs ``` -### 05 — Aggregation Tables +### 05 — Tables d'agrégation + dictionnaires de réputation ```sql EXISTS ja4_processing.agg_host_ip_ja4_1h; EXISTS ja4_processing.agg_header_fingerprint_1h; -SELECT name FROM system.dictionaries WHERE database = 'ja4_processing' AND name = 'dict_bot_ip'; --- Expected: dict_bot_ip +SELECT name FROM system.dictionaries +WHERE database = 'ja4_processing' AND name IN ('dict_bot_ip', 'dict_bot_ja4', 'dict_browser_ja4', 'dict_asn_reputation'); +-- Attendu : 4 lignes ``` -### 06 — ML Tables +### 06 — Tables ML ```sql EXISTS ja4_processing.ml_detected_anomalies; EXISTS ja4_processing.ml_all_scores; -SELECT name FROM system.tables WHERE database = 'ja4_processing' AND name LIKE 'view_ip%'; --- Expected: view_ip_recurrence +SELECT name FROM system.tables WHERE database = 'ja4_processing' AND name = 'view_ip_recurrence'; +-- Attendu : view_ip_recurrence ``` -### 07 — AI Features View +### 07 — Vue de features AI ```sql SELECT name FROM system.tables WHERE database = 'ja4_processing' AND name = 'view_ai_features_1h'; --- Expected: view_ai_features_1h +-- Attendu : view_ai_features_1h ``` -### 08 — Users +### 08 — Utilisateurs ```sql SHOW GRANTS FOR data_writer; --- Expected: GRANT INSERT, SELECT ON ja4_logs.http_logs_raw TO data_writer +-- Attendu : GRANT INSERT, SELECT ON ja4_logs.http_logs_raw TO data_writer SHOW GRANTS FOR analyst; --- Expected: GRANT SELECT ON multiple tables +-- Attendu : GRANT SELECT sur 6 tables/vues ``` -### 09 — Audit Table +### 09 — Table d'audit ```sql EXISTS ja4_processing.audit_logs; --- Expected: 1 +-- Attendu : 1 ``` -### Full Verification Query +### 10 — Index de performance ```sql -SELECT - count() AS total_tables +SELECT name FROM system.data_skipping_indices +WHERE table = 'ml_detected_anomalies' AND database = 'ja4_processing'; +-- Attendu : idx_detected_at, idx_threat_level, idx_bot_name +``` + +### 11 — Vues dashboard + +```sql +SELECT name FROM system.tables +WHERE database = 'ja4_processing' + AND name LIKE 'view_%' + AND engine = 'View'; +-- Attendu : ≥ 7 vues (view_ip_recurrence, view_ai_features_1h, +-- view_form_bruteforce_detected, view_host_ip_ja4_rotation, +-- view_dashboard_user_agents, view_dashboard_entities, view_resource_cascade_1h) +``` + +### 12 — Tables et vue de thèse + +```sql +EXISTS ja4_processing.agg_path_sequences_1h; +EXISTS ja4_processing.agg_request_timing_1h; +EXISTS ja4_processing.agg_ip_behavior_1h; +EXISTS ja4_processing.agg_resource_cascade_1h; +SELECT name FROM system.tables WHERE database = 'ja4_processing' AND name = 'view_thesis_features_1h'; +-- Attendu : 1 pour chaque EXISTS, view_thesis_features_1h +``` + +### Vérification complète + +```sql +-- Tables dans ja4_logs +SELECT count() AS tables_logs +FROM system.tables +WHERE database = 'ja4_logs' + AND name IN ('http_logs_raw', 'http_logs', 'mv_http_logs'); +-- Attendu : 3 + +-- Tables dans ja4_processing +SELECT count() AS tables_processing FROM system.tables WHERE database = 'ja4_processing' AND name IN ( - 'http_logs_raw', 'http_logs', 'agg_host_ip_ja4_1h', 'agg_header_fingerprint_1h', - 'ml_detected_anomalies', 'ml_all_scores', 'ref_bot_networks', - 'anubis_ua_rules', 'anubis_ip_rules', 'anubis_asn_rules', 'anubis_country_rules', - 'audit_logs', 'bot_ip', 'bot_ja4' + 'ref_bot_networks', 'bot_ip', 'bot_ja4', + 'anubis_ip_rules', 'anubis_asn_rules', + 'agg_host_ip_ja4_1h', 'agg_header_fingerprint_1h', + 'agg_path_sequences_1h', 'agg_request_timing_1h', + 'agg_ip_behavior_1h', 'agg_resource_cascade_1h', + 'ml_detected_anomalies', 'ml_all_scores', 'audit_logs' ); --- Expected: 14 +-- Attendu : 14 + +-- Dictionnaires +SELECT count() AS dicts +FROM system.dictionaries +WHERE database = 'ja4_processing'; +-- Attendu : 7 + +-- Vues matérialisées dans ja4_logs +SELECT count() AS mvs_logs +FROM system.tables +WHERE database = 'ja4_logs' AND engine = 'MaterializedView'; +-- Attendu : 1 + +-- Vues matérialisées dans ja4_processing +SELECT count() AS mvs_proc +FROM system.tables +WHERE database = 'ja4_processing' AND engine = 'MaterializedView'; +-- Attendu : 6 ``` -## Rollback Notes +--- -### General Approach +## Rollback -ClickHouse does not support transactional DDL. To roll back a migration: +### Approche générale -1. **Tables**: `DROP TABLE IF EXISTS ja4_processing.` -2. **Materialized Views**: `DROP VIEW IF EXISTS ja4_processing.` (drop MV before its target table) -3. **Dictionaries**: `DROP DICTIONARY IF EXISTS ja4_processing.` -4. **Views**: `DROP VIEW IF EXISTS ja4_processing.` -5. **Users**: `DROP USER IF EXISTS ` +ClickHouse ne supporte pas les DDL transactionnels. Pour annuler une migration : -### Rollback Order (Reverse of Apply) +1. **Vues matérialisées** : supprimer la MV **avant** sa table cible +2. **Dictionnaires** : supprimer le dictionnaire avant les vues/MVs qui l'utilisent +3. **Tables** : `DROP TABLE IF EXISTS` +4. **Vues** : `DROP VIEW IF EXISTS` +5. **Utilisateurs** : `DROP USER IF EXISTS` + +### Ordre de rollback (inverse de l'application) ```sql --- 09: Audit +-- 12 : Tables et vue de thèse +DROP VIEW IF EXISTS ja4_processing.view_thesis_features_1h; +DROP VIEW IF EXISTS ja4_processing.view_resource_cascade_1h; +DROP VIEW IF EXISTS ja4_processing.mv_agg_resource_cascade_1h; +DROP VIEW IF EXISTS ja4_processing.mv_agg_ip_behavior_1h; +DROP VIEW IF EXISTS ja4_processing.mv_agg_request_timing_1h; +DROP VIEW IF EXISTS ja4_processing.mv_agg_path_sequences_1h; +DROP TABLE IF EXISTS ja4_processing.agg_resource_cascade_1h; +DROP TABLE IF EXISTS ja4_processing.agg_ip_behavior_1h; +DROP TABLE IF EXISTS ja4_processing.agg_request_timing_1h; +DROP TABLE IF EXISTS ja4_processing.agg_path_sequences_1h; + +-- 11 : Vues dashboard +DROP VIEW IF EXISTS ja4_processing.view_dashboard_entities; +DROP VIEW IF EXISTS ja4_processing.view_dashboard_user_agents; +DROP VIEW IF EXISTS ja4_processing.view_host_ip_ja4_rotation; +DROP VIEW IF EXISTS ja4_processing.view_form_bruteforce_detected; + +-- 10 : Index de performance (pas de rollback nécessaire — idempotent) + +-- 09 : Table d'audit DROP TABLE IF EXISTS ja4_processing.audit_logs; --- 08: Users +-- 08 : Utilisateurs DROP USER IF EXISTS data_writer; DROP USER IF EXISTS analyst; --- 07: AI Features View +-- 07 : Vue de features AI DROP VIEW IF EXISTS ja4_processing.view_ai_features_1h; --- 06: ML Tables +-- 06 : Tables ML DROP VIEW IF EXISTS ja4_processing.view_ip_recurrence; DROP TABLE IF EXISTS ja4_processing.ml_all_scores; DROP TABLE IF EXISTS ja4_processing.ml_detected_anomalies; --- 05: Aggregation +-- 05 : Agrégations + dictionnaires de réputation DROP VIEW IF EXISTS ja4_processing.mv_agg_header_fingerprint_1h; DROP VIEW IF EXISTS ja4_processing.mv_agg_host_ip_ja4_1h; DROP TABLE IF EXISTS ja4_processing.agg_header_fingerprint_1h; DROP TABLE IF EXISTS ja4_processing.agg_host_ip_ja4_1h; DROP DICTIONARY IF EXISTS ja4_processing.dict_asn_reputation; +DROP DICTIONARY IF EXISTS ja4_processing.dict_browser_ja4; DROP DICTIONARY IF EXISTS ja4_processing.dict_bot_ja4; DROP DICTIONARY IF EXISTS ja4_processing.dict_bot_ip; --- 04: MV + http_logs +-- 04 : MV + http_logs DROP VIEW IF EXISTS ja4_logs.mv_http_logs; DROP TABLE IF EXISTS ja4_logs.http_logs; --- 03: Anubis -DROP DICTIONARY IF EXISTS ja4_processing.dict_anubis_country; +-- 03 : Anubis DROP DICTIONARY IF EXISTS ja4_processing.dict_anubis_asn; DROP DICTIONARY IF EXISTS ja4_processing.dict_anubis_ip; -DROP DICTIONARY IF EXISTS ja4_processing.dict_anubis_ua; -DROP TABLE IF EXISTS ja4_processing.anubis_country_rules; DROP TABLE IF EXISTS ja4_processing.anubis_asn_rules; DROP TABLE IF EXISTS ja4_processing.anubis_ip_rules; -DROP TABLE IF EXISTS ja4_processing.anubis_ua_rules; --- 02: Dictionaries +-- 02 : Dictionnaire ASN + tables de référence DROP DICTIONARY IF EXISTS ja4_processing.dict_iplocate_asn; DROP TABLE IF EXISTS ja4_processing.bot_ja4; DROP TABLE IF EXISTS ja4_processing.bot_ip; DROP TABLE IF EXISTS ja4_processing.ref_bot_networks; --- 01: Raw Tables +-- 01 : Table brute DROP TABLE IF EXISTS ja4_logs.http_logs_raw; --- 00: Database +-- 00 : Bases de données DROP DATABASE IF EXISTS ja4_processing; +DROP DATABASE IF EXISTS ja4_logs; ``` -### Important Notes +### Notes importantes -- **Data loss**: Dropping tables destroys all data. Always back up before rollback. -- **MV dependency**: Materialized views must be dropped before their target tables. -- **Dictionary dependency**: Views/MVs using dictionaries will fail if dictionaries are dropped while they still reference them. -- **Idempotent re-apply**: After rollback, migrations can be safely re-applied since they use `IF NOT EXISTS`. -- **`04_mv_http_logs.sql`** is the canonical version of the MV, superseding any base version in `services/correlator/sql/init.sql`. +- **Perte de données** : la suppression d'une table détruit toutes ses données. + Toujours sauvegarder avant un rollback. +- **Dépendance MV** : les vues matérialisées doivent être supprimées **avant** + leur table cible. +- **Dépendance dictionnaire** : les vues/MVs utilisant `dictGet()` échoueront + si le dictionnaire référencé est supprimé. +- **Ré-application idempotente** : après un rollback, les migrations peuvent être + ré-appliquées sans risque grâce aux clauses `IF NOT EXISTS`. +- **`04_mv_http_logs.sql`** est la version canonique de la vue matérialisée, + remplaçant toute version antérieure dans `services/correlator/sql/init.sql`. + +--- + +## Migrations post-déploiement + +Le répertoire `services/correlator/sql/migrations/` contient des instructions +`ALTER TABLE` pour les déploiements existants. Les appliquer manuellement : + +```bash +clickhouse-client --multiquery < services/correlator/sql/migrations/.sql +``` + +Ces migrations sont distinctes du schéma de base et ne sont nécessaires que +pour mettre à jour des instances déjà en production. diff --git a/docs/database/schema.md b/docs/database/schema.md index dbbca93..7d99502 100644 --- a/docs/database/schema.md +++ b/docs/database/schema.md @@ -1,334 +1,653 @@ -# Database Schema +# Schéma de base de données -The ja4-platform uses ClickHouse as its central data store with database `ja4_processing`. This document describes every table, materialized view, dictionary, and view in the schema. +La plateforme ja4-platform utilise ClickHouse comme entrepôt de données central. +Le schéma est réparti sur **deux bases de données** configurables via variables +d'environnement (`CLICKHOUSE_DB_LOGS`, `CLICKHOUSE_DB_PROCESSING`) : -## Tables +| Variable | Défaut | Rôle | +|----------|--------|------| +| `CLICKHOUSE_DB_LOGS` | `ja4_logs` | Ingestion brute + logs HTTP parsés | +| `CLICKHOUSE_DB_PROCESSING` | `ja4_processing` | Agrégations, ML, vues, dictionnaires, audit | + +Les vues matérialisées dans `ja4_processing` lisent depuis `ja4_logs` (références +inter-bases). Le schéma complet est défini dans 13 fichiers SQL ordonnés dans +`shared/clickhouse/` et déployé via `deploy_schema.sh`. + +--- + +## Récapitulatif global + +| Catégorie | Nombre | Objets | +|-----------|--------|--------| +| **Bases de données** | 2 | `ja4_logs`, `ja4_processing` | +| **Tables** | 14 | `http_logs_raw`, `http_logs`, `ref_bot_networks`, `bot_ip`, `bot_ja4`, `anubis_ip_rules`, `anubis_asn_rules`, `agg_host_ip_ja4_1h`, `agg_header_fingerprint_1h`, `agg_path_sequences_1h`, `agg_request_timing_1h`, `agg_ip_behavior_1h`, `agg_resource_cascade_1h`, `ml_detected_anomalies`, `ml_all_scores`, `audit_logs` | +| **Dictionnaires** | 7 | `dict_iplocate_asn`, `dict_bot_ip`, `dict_bot_ja4`, `dict_browser_ja4`, `dict_asn_reputation`, `dict_anubis_ip`, `dict_anubis_asn` | +| **Vues matérialisées** | 8 | `mv_http_logs`, `mv_agg_host_ip_ja4_1h`, `mv_agg_header_fingerprint_1h`, `mv_agg_path_sequences_1h`, `mv_agg_request_timing_1h`, `mv_agg_ip_behavior_1h`, `mv_agg_resource_cascade_1h` | +| **Vues** | 8 | `view_ip_recurrence`, `view_ai_features_1h`, `view_form_bruteforce_detected`, `view_host_ip_ja4_rotation`, `view_dashboard_user_agents`, `view_dashboard_entities`, `view_resource_cascade_1h`, `view_thesis_features_1h` | + +--- + +## Rétention des données (TTL) + +| Table | TTL | Clé de partition | +|-------|-----|------------------| +| `http_logs_raw` | 2 heures | `toStartOfHour(ingest_time)` | +| `http_logs` | 30 jours | `toDate(log_date)` | +| `agg_host_ip_ja4_1h` | 7 jours | `toDate(window_start)` | +| `agg_header_fingerprint_1h` | 7 jours | `toDate(window_start)` | +| `agg_path_sequences_1h` | 7 jours | `toDate(window_start)` | +| `agg_request_timing_1h` | 7 jours | `toDate(window_start)` | +| `agg_ip_behavior_1h` | 7 jours | `toDate(window_start)` | +| `agg_resource_cascade_1h` | 7 jours | `toDate(window_start)` | +| `ml_detected_anomalies` | 7 jours | `toDate(detected_at)` | +| `ml_all_scores` | 7 jours | `toDate(window_start)` | +| `audit_logs` | 90 jours | `toDate(timestamp)` | + +Toutes les tables d'agrégation et ML utilisent `ttl_only_drop_parts = 1` pour +une expiration efficace au niveau des partitions. + +--- + +## Tables — Base `ja4_logs` ### http_logs_raw -Raw JSON ingest table — direct target for correlator INSERTs. +Table d'ingestion brute — cible directe des INSERTs du correlator. -| Column | Type | Description | -|--------|------|-------------| -| `raw_json` | String (ZSTD(3)) | Complete correlated log as JSON string | -| `ingest_time` | DateTime | Insertion timestamp (default: `now()`) | +| Colonne | Type | Description | +|---------|------|-------------| +| `raw_json` | String CODEC(ZSTD(3)) | Log corrélé complet au format JSON | +| `ingest_time` | DateTime DEFAULT `now()` | Horodatage d'insertion | -- **Engine**: MergeTree -- **Partition by**: `toDate(ingest_time)` -- **Order by**: `ingest_time` -- **TTL**: `ingest_time + INTERVAL 1 DAY` +- **Moteur** : MergeTree +- **Partition** : `toStartOfHour(ingest_time)` +- **Tri** : `ingest_time` +- **TTL** : `ingest_time + INTERVAL 2 HOUR` +- **Settings** : `index_granularity = 8192, ttl_only_drop_parts = 1` --- ### http_logs -Parsed and enriched HTTP log table — populated by `mv_http_logs` materialized view. +Table de logs HTTP parsés et enrichis — alimentée par la vue matérialisée +`mv_http_logs`. -| Column | Type | Nullable | Description | -|--------|------|----------|-------------| -| `time` | DateTime | No | Request timestamp | -| `log_date` | Date | No | Date partition key (default: `toDate(time)`) | -| `src_ip` | IPv4 | No | Client source IP | -| `src_port` | UInt16 | No | Client source port | -| `dst_ip` | IPv4 | No | Server destination IP | -| `dst_port` | UInt16 | No | Server destination port | -| `src_asn` | UInt32 | No | Source ASN (enriched via dict_iplocate_asn) | -| `src_country_code` | LowCardinality(String) | No | Source country code | -| `src_as_name` | LowCardinality(String) | No | AS name | -| `src_org` | LowCardinality(String) | No | AS organization | -| `src_domain` | LowCardinality(String) | No | AS domain | -| `method` | LowCardinality(String) | No | HTTP method | -| `scheme` | LowCardinality(String) | No | URL scheme (http/https) | -| `host` | LowCardinality(String) | No | HTTP Host header | -| `path` | String (ZSTD(3)) | No | Request path | -| `query` | String (ZSTD(3)) | No | Query string | -| `http_version` | LowCardinality(String) | No | HTTP version | -| `orphan_side` | LowCardinality(String) | No | Orphan side (A, B, or empty) | -| `correlated` | UInt8 | No | 1 if HTTP+TLS correlated | -| `keepalives` | UInt16 | No | Keep-alive request sequence | -| `a_timestamp` | UInt64 | No | Source A event timestamp (ns) | -| `b_timestamp` | UInt64 | No | Source B event timestamp (ns) | -| `conn_id` | String (ZSTD(3)) | No | TCP connection identifier | -| `ip_meta_df` | UInt8 | No | IP Don't Fragment flag | -| `ip_meta_id` | UInt16 | No | IP identification | -| `ip_meta_total_length` | UInt16 | No | IP total length | -| `ip_meta_ttl` | UInt8 | No | IP TTL | -| `tcp_meta_options` | LowCardinality(String) | No | TCP options list | -| `tcp_meta_window_size` | UInt32 | No | TCP window size | -| `tcp_meta_mss` | UInt16 | No | TCP MSS | -| `tcp_meta_window_scale` | UInt8 | No | TCP window scale | -| `syn_to_clienthello_ms` | Int32 | No | SYN-to-ClientHello timing (ms) | -| `tls_version` | LowCardinality(String) | No | TLS version | -| `tls_sni` | LowCardinality(String) | No | TLS SNI | -| `tls_alpn` | LowCardinality(String) | No | TLS ALPN | -| `ja3` | String (ZSTD(3)) | No | JA3 fingerprint | -| `ja3_hash` | String (ZSTD(3)) | No | JA3 MD5 hash | -| `ja4` | String (ZSTD(3)) | No | JA4 fingerprint | -| `client_headers` | String (ZSTD(3)) | No | Comma-separated header names | -| `header_user_agent` | String (ZSTD(3)) | No | User-Agent header | -| `header_accept` | String (ZSTD(3)) | No | Accept header | -| `header_accept_encoding` | String (ZSTD(3)) | No | Accept-Encoding header | -| `header_accept_language` | String (ZSTD(3)) | No | Accept-Language header | -| `header_content_type` | String (ZSTD(3)) | No | Content-Type header | -| `header_x_request_id` | String (ZSTD(3)) | No | X-Request-Id header | -| `header_x_trace_id` | String (ZSTD(3)) | No | X-Trace-Id header | -| `header_x_forwarded_for` | String (ZSTD(3)) | No | X-Forwarded-For header | -| `header_sec_ch_ua` | String (ZSTD(3)) | No | Sec-CH-UA header | -| `header_sec_ch_ua_mobile` | String (ZSTD(3)) | No | Sec-CH-UA-Mobile header | -| `header_sec_ch_ua_platform` | String (ZSTD(3)) | No | Sec-CH-UA-Platform header | -| `header_sec_fetch_dest` | String (ZSTD(3)) | No | Sec-Fetch-Dest header | -| `header_sec_fetch_mode` | String (ZSTD(3)) | No | Sec-Fetch-Mode header | -| `header_sec_fetch_site` | String (ZSTD(3)) | No | Sec-Fetch-Site header | -| `anubis_bot_name` | LowCardinality(String) | No | Anubis-detected bot name (default: '') | -| `anubis_bot_action` | LowCardinality(String) | No | Anubis-detected bot action (default: '') | -| `anubis_bot_category` | LowCardinality(String) | No | Anubis-detected bot category (default: '') | +| Colonne | Type | Description | +|---------|------|-------------| +| `time` | DateTime | Horodatage de la requête | +| `log_date` | Date DEFAULT `toDate(time)` | Clé de partition | +| `src_ip` | IPv4 | IP source du client | +| `src_port` | UInt16 | Port source | +| `dst_ip` | IPv4 | IP destination du serveur | +| `dst_port` | UInt16 | Port destination | +| `src_asn` | UInt32 | ASN source (enrichi via dict_iplocate_asn) | +| `src_country_code` | LowCardinality(String) | Code pays | +| `src_as_name` | LowCardinality(String) | Nom de l'AS | +| `src_org` | LowCardinality(String) | Organisation de l'AS | +| `src_domain` | LowCardinality(String) | Domaine de l'AS | +| `method` | LowCardinality(String) | Méthode HTTP | +| `scheme` | LowCardinality(String) | Schéma URL (http/https) | +| `host` | LowCardinality(String) | En-tête Host HTTP | +| `path` | String CODEC(ZSTD(3)) | Chemin de la requête | +| `query` | String CODEC(ZSTD(3)) | Paramètres de requête | +| `http_version` | LowCardinality(String) | Version HTTP | +| `orphan_side` | LowCardinality(String) | Côté orphelin (A, B, ou vide) | +| `correlated` | UInt8 | 1 si corrélation HTTP+TLS réussie | +| `keepalives` | UInt16 | Numéro de séquence keep-alive | +| `a_timestamp` | UInt64 | Horodatage source A (ns) | +| `b_timestamp` | UInt64 | Horodatage source B (ns) | +| `conn_id` | String CODEC(ZSTD(3)) | Identifiant de connexion TCP | +| `ip_meta_df` | UInt8 | Drapeau Don't Fragment | +| `ip_meta_id` | UInt16 | Identification IP | +| `ip_meta_total_length` | UInt16 | Longueur totale IP | +| `ip_meta_ttl` | UInt8 | TTL IP | +| `tcp_meta_options` | LowCardinality(String) | Options TCP | +| `tcp_meta_window_size` | UInt32 | Taille de fenêtre TCP | +| `tcp_meta_mss` | UInt16 | MSS TCP | +| `tcp_meta_window_scale` | UInt8 | Facteur d'échelle de fenêtre TCP | +| `syn_to_clienthello_ms` | Int32 | Délai SYN→ClientHello (ms) | +| `tls_version` | LowCardinality(String) | Version TLS | +| `tls_sni` | LowCardinality(String) | SNI TLS | +| `tls_alpn` | LowCardinality(String) | ALPN TLS | +| `ja3` | String CODEC(ZSTD(3)) | Empreinte JA3 | +| `ja3_hash` | String CODEC(ZSTD(3)) | Hash MD5 JA3 | +| `ja4` | String CODEC(ZSTD(3)) | Empreinte JA4 | +| `client_headers` | String CODEC(ZSTD(3)) | Noms d'en-têtes séparés par virgule | +| `header_user_agent` | String CODEC(ZSTD(3)) | En-tête User-Agent | +| `header_accept` | String CODEC(ZSTD(3)) | En-tête Accept | +| `header_accept_encoding` | String CODEC(ZSTD(3)) | En-tête Accept-Encoding | +| `header_accept_language` | String CODEC(ZSTD(3)) | En-tête Accept-Language | +| `header_content_type` | String CODEC(ZSTD(3)) | En-tête Content-Type | +| `header_x_request_id` | String CODEC(ZSTD(3)) | En-tête X-Request-Id | +| `header_x_trace_id` | String CODEC(ZSTD(3)) | En-tête X-Trace-Id | +| `header_x_forwarded_for` | String CODEC(ZSTD(3)) | En-tête X-Forwarded-For | +| `header_sec_ch_ua` | String CODEC(ZSTD(3)) | En-tête Sec-CH-UA | +| `header_sec_ch_ua_mobile` | String CODEC(ZSTD(3)) | En-tête Sec-CH-UA-Mobile | +| `header_sec_ch_ua_platform` | String CODEC(ZSTD(3)) | En-tête Sec-CH-UA-Platform | +| `header_sec_fetch_dest` | String CODEC(ZSTD(3)) | En-tête Sec-Fetch-Dest | +| `header_sec_fetch_mode` | String CODEC(ZSTD(3)) | En-tête Sec-Fetch-Mode | +| `header_sec_fetch_site` | String CODEC(ZSTD(3)) | En-tête Sec-Fetch-Site | +| `anubis_bot_name` | LowCardinality(String) DEFAULT `''` | Nom du bot détecté par Anubis | +| `anubis_bot_action` | LowCardinality(String) DEFAULT `''` | Action Anubis | +| `anubis_bot_category` | LowCardinality(String) DEFAULT `''` | Catégorie Anubis | -- **Engine**: MergeTree -- **Partition by**: `log_date` -- **Order by**: `(time, src_ip, dst_ip, ja4)` -- **TTL**: `log_date + INTERVAL 7 DAY` +Index de saut de données : + +| Index | Type | Granularité | +|-------|------|-------------| +| `idx_src_ip` | bloom_filter(0.01) | 4 | +| `idx_ja4` | bloom_filter(0.01) | 4 | + +- **Moteur** : MergeTree +- **Partition** : `toDate(log_date)` +- **Tri** : `(time, src_ip, dst_ip, ja4)` +- **TTL** : `log_date + INTERVAL 30 DAY` +- **Settings** : `index_granularity = 8192, ttl_only_drop_parts = 1` --- -### agg_host_ip_ja4_1h +## Tables — Base `ja4_processing` -Behavioral aggregation per `(src_ip, ja4, host)` per hour. Uses `AggregatingMergeTree` with `SimpleAggregateFunction` and `AggregateFunction` columns for incremental aggregation. +### Tables Anubis -Key columns include: `window_start`, `src_ip`, `ja4`, `host`, `src_asn`, `hits`, `count_post`, `uniq_paths`, `uniq_query_params`, `tcp_jitter_variance`, `unique_src_ports`, `unique_conn_id`, `orphan_count`, `ip_id_zero_count`, `mss_1460_count`, `uniq_ua`, `url_depth_variance`, `count_anomalous_payload`, `uniq_ja3`, `avg_syn_ms`, `tls12_count`, `count_head`, `count_no_sec_fetch`, `count_generic_accept`, `count_http10`, `ip_df_var`, `avg_ttl`, `ttl_var`, `count_no_wscale`, `count_correlated`, `count_no_accept_enc`, `count_http_scheme`. +Tables de règles pour la détection de crawlers Anubis. -- **Engine**: AggregatingMergeTree -- **Order by**: `(window_start, src_ip, ja4, host)` +| Table | Clé de tri | Colonnes | Moteur | +|-------|-----------|----------|--------| +| `anubis_ip_rules` | `prefix` (String) | `bot_name`, `action`, `rule_id` (UInt64), `has_ua` (UInt8), `category` | ReplacingMergeTree | +| `anubis_asn_rules` | `asn` (UInt32) | `bot_name`, `action`, `category` | ReplacingMergeTree | ---- - -### agg_header_fingerprint_1h - -Header-level behavioral fingerprint aggregation per `(src_ip)` per hour. - -| Column | Type | Description | -|--------|------|-------------| -| `window_start` | DateTime | Hour window start | -| `src_ip` | IPv6 | Source IP | -| `header_order_hash` | SimpleAggregateFunction(any, String) | Hash of header order | -| `header_count` | SimpleAggregateFunction(max, UInt16) | Max header count | -| `has_accept_language` | SimpleAggregateFunction(max, UInt8) | Accept-Language presence | -| `has_cookie` | SimpleAggregateFunction(max, UInt8) | Cookie presence | -| `has_referer` | SimpleAggregateFunction(max, UInt8) | Referer presence | -| `modern_browser_score` | SimpleAggregateFunction(max, UInt8) | Browser compliance score | -| `ua_ch_mismatch` | SimpleAggregateFunction(max, UInt8) | UA/Client Hints mismatch | -| `sec_fetch_mode` | SimpleAggregateFunction(any, String) | Sec-Fetch-Mode value | -| `sec_fetch_dest` | SimpleAggregateFunction(any, String) | Sec-Fetch-Dest value | - -- **Engine**: AggregatingMergeTree -- **Order by**: `(window_start, src_ip)` - ---- - -### ml_detected_anomalies - -Anomaly detections above the threat threshold. - -Key columns: `detected_at`, `src_ip` (IPv6), `ja4`, `host`, `bot_name`, `anomaly_score` (Float32), `raw_anomaly_score` (Float32), `threat_level`, `model_name`, `recurrence` (UInt32), `campaign_id` (Int32), `reason`, plus all ML feature columns and Anubis enrichment (`anubis_bot_name`, `anubis_bot_action`, `anubis_bot_category`). - -- **Engine**: ReplacingMergeTree(detected_at) -- **Order by**: `(src_ip)` -- **TTL**: `detected_at + INTERVAL 30 DAY` - ---- - -### ml_all_scores - -All ML classifications (no threshold filter) for observability. - -Key columns: `detected_at`, `window_start`, `src_ip`, `ja4`, `host`, `bot_name`, `anomaly_score`, `raw_anomaly_score`, `threat_level`, `model_name`, `correlated`, `campaign_id`, plus ASN and Anubis enrichment. - -- **Engine**: ReplacingMergeTree(detected_at) -- **Order by**: `(window_start, src_ip, ja4, host, model_name)` -- **TTL**: `window_start + INTERVAL 3 DAY` +> **Note** : les tables `anubis_ua_rules` et `anubis_country_rules` ainsi que les +> dictionnaires `dict_anubis_ua` et `dict_anubis_country` ont été supprimés. +> L'enrichissement Anubis repose désormais sur deux niveaux : IP/CIDR → ASN. --- ### ref_bot_networks -Bot network CIDR reference table. +Table de référence des réseaux de bots connus (CIDR). -| Column | Type | Description | -|--------|------|-------------| -| `network` | IPv6CIDR | Network CIDR | -| `bot_name` | LowCardinality(String) | Bot name | -| `is_legitimate` | UInt8 | 1 = legitimate bot | -| `last_update` | DateTime | Last update timestamp | +| Colonne | Type | Description | +|---------|------|-------------| +| `network` | String | Réseau CIDR | +| `bot_name` | LowCardinality(String) | Nom du bot | +| `is_legitimate` | UInt8 | 1 = bot légitime | +| `last_update` | DateTime | Dernière mise à jour | -- **Engine**: ReplacingMergeTree(last_update) -- **Order by**: `(network, bot_name)` +- **Moteur** : ReplacingMergeTree(last_update) +- **Tri** : `(network, bot_name)` --- ### bot_ip / bot_ja4 -CSV-backed flat tables for quick bot lookups. +Tables fichier CSV pour la recherche rapide de bots. -- `bot_ip`: single column `ip` (String) — Engine: File(CSV, 'bot_ip.csv') -- `bot_ja4`: single column `ja4` (String) — Engine: File(CSV, 'bot_ja4.csv') +| Table | Colonne | Moteur | +|-------|---------|--------| +| `bot_ip` | `ip` (String) | File(CSV, `'bot_ip.csv'`) | +| `bot_ja4` | `ja4` (String) | File(CSV, `'bot_ja4.csv'`) | --- -### Anubis Rule Tables +### agg_host_ip_ja4_1h -| Table | Key | Columns | Engine | -|-------|-----|---------|--------| -| `anubis_ua_rules` | `id` (UInt64) | `parent_id`, `regexp`, `keys`, `values` | ReplacingMergeTree | -| `anubis_ip_rules` | `prefix` (String) | `bot_name`, `action`, `rule_id`, `has_ua`, `category` | ReplacingMergeTree | -| `anubis_asn_rules` | `asn` (UInt32) | `bot_name`, `action`, `category` | ReplacingMergeTree | -| `anubis_country_rules` | `country_code` (String) | `bot_name`, `action`, `category` | ReplacingMergeTree | +Agrégation comportementale par `(src_ip, ja4, host)` par heure. Utilise des +colonnes `SimpleAggregateFunction` et `AggregateFunction` pour l'agrégation +incrémentale. + +**Colonnes clés** : `window_start` (DateTime), `src_ip` (IPv6), `ja4` (String), +`host` (String), `src_asn` (UInt32). + +**Colonnes d'agrégation (~50)** : + +| Catégorie | Colonnes | +|-----------|----------| +| Compteurs (SimpleAggregateFunction sum) | `hits`, `count_post`, `orphan_count`, `ip_id_zero_count`, `mss_1460_count`, `count_assets`, `count_no_referer`, `tls12_count`, `count_head`, `count_no_sec_fetch`, `count_generic_accept`, `count_http10`, `count_no_wscale`, `count_correlated`, `count_no_accept_enc`, `count_http_scheme`, `count_xff`, `count_unusual_ct`, `count_non_std_port`, `count_login_post` | +| Valeurs uniques (AggregateFunction uniq) | `uniq_paths`, `uniq_query_params`, `unique_src_ports`, `unique_conn_id`, `uniq_ua`, `uniq_ja3` | +| Variances (AggregateFunction varPop) | `tcp_jitter_variance`, `total_ip_length_var`, `url_depth_variance`, `ip_df_var` | +| Moyennes (AggregateFunction avg/avgIf) | `avg_syn_ms`, `avg_ttl` | +| Variance conditionnelle (AggregateFunction varPopIf) | `ttl_var` | + +**Projection** : `proj_by_ip` → ORDER BY `(src_ip, window_start, ja4, host)` + +- **Moteur** : AggregatingMergeTree +- **Tri** : `(window_start, src_ip, ja4, host)` +- **TTL** : `window_start + INTERVAL 7 DAY` (partition `toDate(window_start)`) +- **Settings** : `deduplicate_merge_projection_mode = 'drop'` + +--- + +### agg_header_fingerprint_1h + +Agrégation d'empreinte d'en-têtes par `(src_ip)` par heure. + +| Colonne | Type | Description | +|---------|------|-------------| +| `window_start` | DateTime | Début de la fenêtre horaire | +| `src_ip` | IPv6 | IP source | +| `header_order_hash` | SimpleAggregateFunction(any, String) | Hash de l'ordre des en-têtes | +| `header_count` | SimpleAggregateFunction(max, UInt16) | Nombre max d'en-têtes | +| `has_accept_language` | SimpleAggregateFunction(max, UInt8) | Présence Accept-Language | +| `has_cookie` | SimpleAggregateFunction(max, UInt8) | Présence Cookie | +| `has_referer` | SimpleAggregateFunction(max, UInt8) | Présence Referer | +| `modern_browser_score` | SimpleAggregateFunction(max, UInt8) | Score de conformité navigateur | +| `ua_ch_mismatch` | SimpleAggregateFunction(max, UInt8) | Incohérence UA/Client Hints | +| `sec_ch_mobile_mismatch` | SimpleAggregateFunction(max, UInt8) | Incohérence Sec-CH-UA-Mobile | +| `sec_fetch_mode` | SimpleAggregateFunction(any, String) | Valeur Sec-Fetch-Mode | +| `sec_fetch_dest` | SimpleAggregateFunction(any, String) | Valeur Sec-Fetch-Dest | + +- **Moteur** : AggregatingMergeTree +- **Tri** : `(window_start, src_ip)` +- **TTL** : `window_start + INTERVAL 7 DAY` (partition `toDate(window_start)`) + +--- + +### agg_path_sequences_1h (thèse §5.1) + +Entropie des séquences de chemins — transitions de Markov sur les chemins normalisés. + +| Colonne | Type | Description | +|---------|------|-------------| +| `window_start` | DateTime | Début de la fenêtre | +| `src_ip` | IPv6 | IP source | +| `ja4` | LowCardinality(String) | Empreinte JA4 | +| `host` | LowCardinality(String) | Hôte cible | +| `path_sequence` | AggregateFunction(groupArray(100), Tuple(UInt32, String)) | Séquence ordonnée (timestamp, chemin) | + +- **Moteur** : AggregatingMergeTree +- **Tri** : `(window_start, src_ip, ja4, host)` +- **Partition** : `toDate(window_start)` — **TTL** : 7 jours +- **Settings** : `ttl_only_drop_parts = 1` + +--- + +### agg_request_timing_1h (thèse §5.3) + +Cadence des requêtes — analyse du coefficient de variation et des bursts. + +| Colonne | Type | Description | +|---------|------|-------------| +| `window_start` | DateTime | Début de la fenêtre | +| `src_ip` | IPv6 | IP source | +| `ja4` | LowCardinality(String) | Empreinte JA4 | +| `host` | LowCardinality(String) | Hôte cible | +| `request_times` | AggregateFunction(groupArrayIf(500), UInt64, UInt8) | Horodatages des requêtes (filtré a_timestamp > 0) | + +- **Moteur** : AggregatingMergeTree +- **Tri** : `(window_start, src_ip, ja4, host)` +- **Partition** : `toDate(window_start)` — **TTL** : 7 jours +- **Settings** : `ttl_only_drop_parts = 1` + +--- + +### agg_ip_behavior_1h (thèse §5.5 / §5.8) + +Dérive JA4 et comportement inter-domaines par IP. + +| Colonne | Type | Description | +|---------|------|-------------| +| `window_start` | DateTime | Début de la fenêtre | +| `src_ip` | IPv6 | IP source | +| `ja4_sequence` | AggregateFunction(groupArray(200), Tuple(UInt32, String)) | Séquence temporelle (timestamp, ja4) | +| `host_hits_keys` | AggregateFunction(sumMap, Array(String), Array(UInt64)) | Distribution hôte → hits | +| `host_count` | AggregateFunction(uniq, String) | Nombre d'hôtes distincts | +| `total_hits` | SimpleAggregateFunction(sum, UInt64) | Requêtes totales | +| `first_seen` | SimpleAggregateFunction(min, DateTime) | Première observation | +| `last_seen` | SimpleAggregateFunction(max, DateTime) | Dernière observation | + +- **Moteur** : AggregatingMergeTree +- **Tri** : `(window_start, src_ip)` +- **Partition** : `toDate(window_start)` — **TTL** : 7 jours +- **Settings** : `ttl_only_drop_parts = 1` + +--- + +### agg_resource_cascade_1h (thèse §5.4) + +Arbre de dépendances de chargement de ressources. + +| Colonne | Type | Description | +|---------|------|-------------| +| `window_start` | DateTime | Début de la fenêtre | +| `src_ip` | IPv6 | IP source | +| `ja4` | LowCardinality(String) | Empreinte JA4 | +| `host` | LowCardinality(String) | Hôte cible | +| `resource_loads` | AggregateFunction(groupArray(200), Tuple(UInt32, UInt8)) | Chargements (timestamp, is_asset) | + +- **Moteur** : AggregatingMergeTree +- **Tri** : `(window_start, src_ip, ja4, host)` +- **Partition** : `toDate(window_start)` — **TTL** : 7 jours +- **Settings** : `ttl_only_drop_parts = 1` + +--- + +### ml_detected_anomalies + +Détections d'anomalies au-dessus du seuil de menace. + +**Colonnes principales** : + +| Colonne | Type | Description | +|---------|------|-------------| +| `detected_at` | DateTime | Horodatage de la détection | +| `src_ip` | IPv6 | IP source | +| `ja4` | String | Empreinte JA4 | +| `host` | String | Hôte cible | +| `bot_name` | String | Nom du bot identifié | +| `browser_family` | LowCardinality(String) DEFAULT `''` | Famille de navigateur | +| `anomaly_score` | Float32 | Score d'anomalie normalisé | +| `raw_anomaly_score` | Float32 DEFAULT `0` | Score brut avant normalisation | +| `threat_level` | String | Niveau de menace (CRITICAL, HIGH, MEDIUM, LOW) | +| `model_name` | String | Nom du modèle (Complet, Applicatif, etc.) | +| `recurrence` | UInt32 | Nombre de détections précédentes | +| `campaign_id` | Int32 DEFAULT `-1` | Identifiant de campagne HDBSCAN | +| `reason` | String | Explication de la détection | + +**Colonnes de contexte réseau** : `asn_number`, `asn_org`, `asn_detail`, +`asn_domain`, `country_code`, `asn_label` (tous String). + +**Colonnes de features ML (~30)** : `hits`, `hit_velocity`, `fuzzing_index`, +`post_ratio`, `port_exhaustion_ratio`, `max_keepalives`, `orphan_ratio`, +`tcp_jitter_variance`, `tcp_shared_count`, `true_window_size`, `window_mss_ratio`, +`alpn_http_mismatch`, `is_alpn_missing`, `sni_host_mismatch`, `header_count`, +`has_accept_language`, `has_cookie`, `has_referer`, `modern_browser_score`, +`is_headless`, `ua_ch_mismatch`, `header_order_shared_count`, `ip_id_zero_ratio`, +`request_size_variance`, `multiplexing_efficiency`, `mss_mobile_mismatch`, +`correlated`, `asset_ratio`, `direct_access_ratio`, `is_ua_rotating`, +`distinct_ja4_count`, `src_port_density`, `ja4_asn_concentration`, +`ja4_country_concentration`, `is_rare_ja4`, `header_order_confidence`, +`distinct_header_orders`, `temporal_entropy`, `path_diversity_ratio`, +`url_depth_variance`, `anomalous_payload_ratio`. + +**Colonnes Anubis** : `anubis_bot_name`, `anubis_bot_action`, `anubis_bot_category` +(LowCardinality(String) DEFAULT `''`). + +**Index de saut** : + +| Index | Type | Granularité | +|-------|------|-------------| +| `idx_detected_at` | minmax | 4 | +| `idx_threat_level` | set(8) | 4 | +| `idx_bot_name` | bloom_filter() | 4 | + +- **Moteur** : ReplacingMergeTree(detected_at) +- **Partition** : `toYYYYMMDD(detected_at)` +- **Tri** : `(src_ip)` +- **TTL** : `detected_at + INTERVAL 7 DAY` +- **Settings** : `index_granularity = 8192, ttl_only_drop_parts = 1` + +--- + +### ml_all_scores + +Toutes les classifications ML (sans filtre de seuil) pour l'observabilité. + +| Colonne | Type | Description | +|---------|------|-------------| +| `detected_at` | DateTime | Horodatage de la détection | +| `window_start` | DateTime | Début de la fenêtre d'analyse | +| `src_ip` | IPv6 | IP source | +| `ja4` | String | Empreinte JA4 | +| `host` | String | Hôte cible | +| `bot_name` | String | Nom du bot | +| `browser_family` | LowCardinality(String) DEFAULT `''` | Famille de navigateur | +| `anomaly_score` | Float32 | Score final | +| `raw_anomaly_score` | Float32 | Score brut | +| `threat_level` | String | Niveau de menace | +| `model_name` | String | Nom du modèle | +| `correlated` | UInt8 | 1 si trafic corrélé | +| `asn_number` | String | Numéro ASN | +| `asn_org` | String | Organisation ASN | +| `country_code` | String | Code pays | +| `asn_label` | String | Label de réputation ASN | +| `hits` | UInt64 | Nombre de requêtes | +| `hit_velocity` | Float32 | Vélocité des hits | +| `fuzzing_index` | Float32 | Indice de fuzzing | +| `post_ratio` | Float32 | Ratio de requêtes POST | +| `campaign_id` | Int32 | Identifiant de campagne | +| `ae_recon_error` | Float32 DEFAULT `0` | Erreur de reconstruction autoencoder | +| `xgb_prob` | Float32 DEFAULT `0` | Probabilité XGBoost supervisé | +| `anubis_bot_name` | LowCardinality(String) DEFAULT `''` | Nom du bot Anubis | +| `anubis_bot_action` | LowCardinality(String) DEFAULT `''` | Action Anubis | +| `anubis_bot_category` | LowCardinality(String) DEFAULT `''` | Catégorie Anubis | + +**Index de saut** : + +| Index | Type | Granularité | +|-------|------|-------------| +| `idx_detected_at` | minmax | 4 | +| `idx_threat_level` | set(8) | 4 | + +- **Moteur** : ReplacingMergeTree(detected_at) +- **Partition** : `toYYYYMMDD(window_start)` +- **Tri** : `(window_start, src_ip, ja4, host, model_name)` +- **TTL** : `window_start + INTERVAL 7 DAY` +- **Settings** : `index_granularity = 8192, ttl_only_drop_parts = 1` --- ### audit_logs -SOC audit trail for dashboard activity. +Journal d'audit SOC pour le suivi de l'activité du dashboard. -| Column | Type | Default | Description | -|--------|------|---------|-------------| -| `timestamp` | DateTime | `now()` | Event time | -| `user_name` | LowCardinality(String) | `'soc_user'` | Analyst name | -| `action` | LowCardinality(String) | — | Action performed | -| `entity_type` | LowCardinality(String) | `''` | Entity type (ip, ja4, etc.) | -| `entity_id` | String | `''` | Entity identifier | -| `entity_count` | UInt32 | `0` | Entity count | -| `details` | String (ZSTD(3)) | `''` | JSON details | -| `client_ip` | String | `''` | Analyst client IP | +| Colonne | Type | Défaut | Description | +|---------|------|--------|-------------| +| `timestamp` | DateTime | `now()` | Horodatage de l'événement | +| `user_name` | LowCardinality(String) | `'soc_user'` | Nom de l'analyste | +| `action` | LowCardinality(String) | — | Action effectuée | +| `entity_type` | LowCardinality(String) | `''` | Type d'entité (ip, ja4, etc.) | +| `entity_id` | String | `''` | Identifiant de l'entité | +| `entity_count` | UInt32 | `0` | Nombre d'entités | +| `details` | String CODEC(ZSTD(3)) | `''` | Détails en JSON | +| `client_ip` | String | `''` | IP du client analyste | -- **Engine**: MergeTree -- **Partition by**: `toDate(timestamp)` -- **Order by**: `(timestamp, user_name, action)` -- **TTL**: `toDate(timestamp) + INTERVAL 90 DAY` +- **Moteur** : MergeTree +- **Partition** : `toDate(timestamp)` +- **Tri** : `(timestamp, user_name, action)` +- **TTL** : `toDate(timestamp) + INTERVAL 90 DAY` +- **Settings** : `index_granularity = 8192` --- -## Materialized Views +## Vues matérialisées -### mv_http_logs +### mv_http_logs (`ja4_logs`) -- **Source**: `http_logs_raw` -- **Target**: `http_logs` -- **Transformation**: Parses `raw_json` via `JSONExtract*` functions, enriches with ASN data from `dict_iplocate_asn` and Anubis bot detection from `dict_anubis_ua`, `dict_anubis_ip`, `dict_anubis_asn`, `dict_anubis_country`. Uses a 5-level priority cascade for Anubis: UA+IP combined > UA only > IP only > ASN > Country. +- **Source** : `ja4_logs.http_logs_raw` +- **Cible** : `ja4_logs.http_logs` +- **Transformation** : Parse le champ `raw_json` via les fonctions `JSONExtract*`. + Enrichit avec les données ASN depuis `dict_iplocate_asn` et la détection de bots + Anubis via `dict_anubis_ip` + `dict_anubis_asn` avec cascade de priorité + COALESCE : IP/CIDR → ASN. -### mv_agg_host_ip_ja4_1h +### mv_agg_host_ip_ja4_1h (`ja4_processing`) -- **Source**: `http_logs` -- **Target**: `agg_host_ip_ja4_1h` -- **Transformation**: Groups by `(toStartOfHour(time), src_ip, ja4, host, src_asn)`. Computes counts, unique values, variances, and aggregate functions for 50+ behavioral features. +- **Source** : `ja4_logs.http_logs` +- **Cible** : `ja4_processing.agg_host_ip_ja4_1h` +- **Transformation** : GROUP BY `(toStartOfHour(time), src_ip, ja4, host, src_asn)`. + Calcule ~50 features comportementales : compteurs de hits, ratios POST, + unicité des chemins/paramètres, jitter TCP, timing SYN, keep-alives, + compteurs d'orphelins, rotation UA, variance des métadonnées IP, etc. -### mv_agg_header_fingerprint_1h +### mv_agg_header_fingerprint_1h (`ja4_processing`) -- **Source**: `http_logs` -- **Target**: `agg_header_fingerprint_1h` -- **Transformation**: Groups by `(toStartOfHour(time), src_ip)`. Computes header order hash, header count, browser compliance score, Client Hints mismatch. +- **Source** : `ja4_logs.http_logs` +- **Cible** : `ja4_processing.agg_header_fingerprint_1h` +- **Transformation** : GROUP BY `(toStartOfHour(time), src_ip)`. + Calcule le hash d'ordre des en-têtes, le nombre d'en-têtes, le score de + conformité navigateur (Sec-CH-UA = 100, UA seul = 50), l'incohérence + UA↔Sec-CH-UA-Platform et Sec-CH-UA-Mobile. + +### mv_agg_path_sequences_1h (`ja4_processing`) + +- **Source** : `ja4_logs.http_logs` +- **Cible** : `ja4_processing.agg_path_sequences_1h` +- **Transformation** : GROUP BY `(toStartOfHour(time), src_ip, ja4, host)`. + Stocke `groupArrayState(100)(tuple(timestamp, path))`. + +### mv_agg_request_timing_1h (`ja4_processing`) + +- **Source** : `ja4_logs.http_logs` +- **Cible** : `ja4_processing.agg_request_timing_1h` +- **Transformation** : GROUP BY `(toStartOfHour(time), src_ip, ja4, host)`. + Stocke `groupArrayIfState(500)(a_timestamp, a_timestamp > 0)`. + +### mv_agg_ip_behavior_1h (`ja4_processing`) + +- **Source** : `ja4_logs.http_logs` +- **Cible** : `ja4_processing.agg_ip_behavior_1h` +- **Transformation** : GROUP BY `(toStartOfHour(time), src_ip)`. + Stocke la séquence JA4, la distribution sumMap hôte→hits, le compteur + d'hôtes uniques, les hits totaux et les bornes temporelles. + +### mv_agg_resource_cascade_1h (`ja4_processing`) + +- **Source** : `ja4_logs.http_logs` +- **Cible** : `ja4_processing.agg_resource_cascade_1h` +- **Transformation** : GROUP BY `(toStartOfHour(time), src_ip, ja4, host)`. + Stocke `groupArrayState(200)(tuple(timestamp, is_asset))` où is_asset est + déterminé par correspondance regex sur les extensions de fichiers statiques. --- -## Dictionaries +## Dictionnaires -### dict_iplocate_asn +### Dictionnaires basés sur fichier CSV -- **Source**: CSV file `/var/lib/clickhouse/user_files/iplocate-ip-to-asn.csv` -- **Key**: `network` (String) -- **Layout**: `IP_TRIE` -- **Attributes**: `asn` (UInt32), `country_code`, `name`, `org`, `domain` -- **Lifetime**: 3600–7200 seconds +Tous les fichiers doivent être placés dans `/var/lib/clickhouse/user_files/`. -### dict_bot_ip +| Dictionnaire | Fichier CSV | Clé | Layout | Attributs | Lifetime | Entrées approx. | +|--------------|------------|-----|--------|-----------|----------|------------------| +| `dict_iplocate_asn` | `iplocate-ip-to-asn.csv` | `network` (String) | IP_TRIE | `asn` (UInt32), `country_code`, `name` | 3600–7200 s | ~714K | +| `dict_bot_ip` | `bot_ip.csv` | `prefix` (String) | IP_TRIE | `bot_name` (String) | 300 s | ~3,5K CIDR | +| `dict_bot_ja4` | `bot_ja4.csv` | `ja4` (String) | COMPLEX_KEY_HASHED | `bot_name` (String) | 300 s | ~31 | +| `dict_browser_ja4` | `browser_ja4.csv` | `ja4` (String) | COMPLEX_KEY_HASHED | `browser_family`, `tls_library`, `context` | 300 s | ~1,2K | +| `dict_asn_reputation` | `asn_reputation.csv` | `src_asn` (UInt64) | HASHED | `label` (String) | 300 s | ~82K | -- **Source**: CSV file `/var/lib/clickhouse/user_files/bot_ip.csv` -- **Key**: `prefix` (String) -- **Layout**: `IP_TRIE` -- **Attributes**: `bot_name` (String) -- **Lifetime**: 300 seconds +### Dictionnaires basés sur ClickHouse -### dict_bot_ja4 +| Dictionnaire | Table source | Clé | Layout | Attributs | Lifetime | +|--------------|-------------|-----|--------|-----------|----------| +| `dict_anubis_ip` | `ja4_processing.anubis_ip_rules` | `prefix` (String) | IP_TRIE | `bot_name`, `action`, `rule_id` (UInt64), `has_ua` (UInt8), `category` | 300–600 s | +| `dict_anubis_asn` | `ja4_processing.anubis_asn_rules` | `asn` (UInt32) | FLAT | `bot_name`, `action`, `category` | 300–600 s | -- **Source**: CSV file `/var/lib/clickhouse/user_files/bot_ja4.csv` -- **Key**: `ja4` (String) -- **Layout**: `COMPLEX_KEY_HASHED` -- **Attributes**: `bot_name` (String) -- **Lifetime**: 300 seconds - -### dict_asn_reputation - -- **Source**: CSV file `/var/lib/clickhouse/user_files/asn_reputation.csv` -- **Key**: `src_asn` (UInt64) -- **Layout**: `HASHED` -- **Attributes**: `label` (String) -- **Lifetime**: 300 seconds - -### dict_anubis_ua - -- **Source**: ClickHouse table `anubis_ua_rules` -- **Key**: `regexp` (String) -- **Layout**: `REGEXP_TREE` -- **Attributes**: `bot_name`, `action`, `has_ip`, `rule_id`, `category` -- **Lifetime**: 300–600 seconds - -### dict_anubis_ip - -- **Source**: ClickHouse table `anubis_ip_rules` -- **Key**: `prefix` (String) -- **Layout**: `IP_TRIE` -- **Attributes**: `bot_name`, `action`, `rule_id`, `has_ua`, `category` -- **Lifetime**: 300–600 seconds - -### dict_anubis_asn - -- **Source**: ClickHouse table `anubis_asn_rules` -- **Key**: `asn` (UInt32) -- **Layout**: `FLAT` -- **Attributes**: `bot_name`, `action`, `category` -- **Lifetime**: 300–600 seconds - -### dict_anubis_country - -- **Source**: ClickHouse table `anubis_country_rules` -- **Key**: `country_code` (String) -- **Layout**: `FLAT` -- **Attributes**: `bot_name`, `action`, `category` -- **Lifetime**: 300–600 seconds +> **Note** : les dictionnaires Anubis basés sur ClickHouse nécessitent que les +> identifiants de connexion soient configurés dans les fichiers SQL (mot de passe +> par défaut `CHANGE_ME` à remplacer avant la mise en production). --- -## Views +## Vues ### view_ai_features_1h -Computes 50+ ML features per `(src_ip, ja4, host)` from the last 24 hours by joining `agg_host_ip_ja4_1h` and `agg_header_fingerprint_1h`. Includes: +Calcule ~65+ features ML par `(src_ip, ja4, host)` sur les dernières 24 heures +en joignant `agg_host_ip_ja4_1h` et `agg_header_fingerprint_1h`. -- Behavioral features: `hits`, `hit_velocity`, `fuzzing_index`, `post_ratio`, `orphan_ratio` -- Connection features: `max_keepalives`, `multiplexing_efficiency`, `port_exhaustion_ratio` -- Browser features: `modern_browser_score`, `ua_ch_mismatch`, `header_order_shared_count` -- TLS features: `alpn_http_mismatch`, `is_alpn_missing`, `sni_host_mismatch` -- L4 features: `tcp_jitter_variance`, `avg_ttl`, `ttl_std`, `syn_timing_cv` -- Reputation: `bot_name` (from dict_bot_ip/dict_bot_ja4), `anubis_bot_name/action/category` -- Derived: `temporal_entropy`, `ja3_diversity_ratio` +| Catégorie | Features | +|-----------|----------| +| Comportementales | `hits`, `hit_velocity`, `fuzzing_index`, `post_ratio`, `orphan_ratio`, `asset_ratio`, `direct_access_ratio` | +| Connexion | `max_keepalives`, `multiplexing_efficiency`, `port_exhaustion_ratio`, `src_port_density` | +| Navigateur | `modern_browser_score`, `ua_ch_mismatch`, `header_order_shared_count`, `is_headless` | +| TLS | `alpn_http_mismatch`, `is_alpn_missing`, `sni_host_mismatch` | +| L4 | `tcp_jitter_variance`, `avg_ttl`, `ttl_std`, `syn_timing_cv`, `window_mss_ratio` | +| Réputation | `bot_name` (dict_bot_ip / dict_bot_ja4), `browser_family` (dict_browser_ja4), `asn_label` (dict_asn_reputation), `anubis_bot_name`/`action`/`category` (dict_anubis_ip / dict_anubis_asn) | +| Statistiques | `temporal_entropy`, `ja3_diversity_ratio`, `ja4_asn_concentration`, `ja4_country_concentration` | +| P1 | `has_xff`, `unusual_content_type_ratio`, `non_standard_port_ratio`, `login_post_concentration` | + +Utilise des fonctions de fenêtrage (`sum() OVER`, `count() OVER`, `uniqExact() OVER`) +pour les features de concentration et de partage TCP. ### view_ip_recurrence -Aggregates recurrence data from `ml_detected_anomalies`: +Agrège les données de récurrence depuis `ml_detected_anomalies` (30 derniers jours) : ```sql -SELECT src_ip, count() AS recurrence, - min(detected_at) AS first_seen, max(detected_at) AS last_seen, - min(anomaly_score) AS worst_score, - argMin(threat_level, anomaly_score) AS worst_threat_level -FROM ml_detected_anomalies GROUP BY src_ip; +SELECT + src_ip, + count() AS recurrence, + min(detected_at) AS first_seen, + max(detected_at) AS last_seen, + max(anomaly_score) AS worst_score, + argMax(threat_level, anomaly_score) AS worst_threat_level +FROM ja4_processing.ml_detected_anomalies +WHERE detected_at >= now() - INTERVAL 30 DAY +GROUP BY src_ip; ``` +### view_form_bruteforce_detected + +Détection de force brute sur les formulaires. Source : `agg_host_ip_ja4_1h` +(dernières 24h). Filtre les combinaisons `(src_ip, host)` ayant +`count_post >= 10`. Retourne `src_ip`, `host`, `ja4` (argMax par hits), +`hits`, `query_params_count`. + +### view_host_ip_ja4_rotation + +Détection de rotation d'empreintes JA4 par IP. Source : `agg_host_ip_ja4_1h` +(dernières 24h, ja4 ≠ ''). Filtre les IP ayant `distinct_ja4_count >= 2`. +Retourne `src_ip`, `distinct_ja4_count`, `total_hits`, `first_seen`, `last_seen`. + +### view_dashboard_user_agents + +Agrégation des User-Agents pour le dashboard. Source : `ja4_logs.http_logs` +(7 derniers jours). GROUP BY `(src_ip, ja4, toStartOfHour(time), log_date)`. +Retourne `src_ip` (normalisé IPv4), `ja4`, `hour`, `log_date`, +`user_agents` (groupUniqArray(100)), `requests`. + +### view_dashboard_entities + +Vue d'entités du dashboard. Source : `ja4_logs.http_logs` (7 derniers jours). +Structure `UNION ALL` de 5 branches — une par type d'entité : `ip`, `ja4`, +`country`, `asn`, `host`. Retourne `entity_type`, `entity_value`, `src_ip`, +`ja4`, `host`, `log_date`, `client_headers`, `asns`, `countries`, `user_agents`. + +### view_resource_cascade_1h (thèse §5.4) + +Analyse de cascade de chargement de ressources. Source : `agg_resource_cascade_1h` +(dernières 24h). Sépare les chargements en documents (is_asset=0) et assets +(is_asset=1). Calcule `doc_count`, `asset_count`, `root_to_first_asset_delay`, +`asset_load_stddev` (σ des timestamps d'assets — mesure de simultanéité). + +### view_thesis_features_1h (thèse §5) + +Vue unifiée des features avancées de détection de la thèse. Joint (via CTEs +sur les 24 dernières heures) : + +| Source | Features calculées | +|--------|-------------------| +| `agg_path_sequences_1h` (§5.1) | `path_transition_entropy` (entropie de Shannon normalisée des transitions Markov-1) | +| `agg_request_timing_1h` (§5.3) | `cadence_cv`, `burst_ratio` (Δt<100ms), `pause_ratio` (Δt>5s), `lag1_autocorrelation`, `benford_deviation` (χ² vs loi de Benford) | +| `agg_ip_behavior_1h` (§5.5/§5.8) | `ja4_drift_ratio`, `ja4_distinct_in_session`, `host_diversity`, `host_sweep_speed`, `host_coverage_uniformity` | +| `view_resource_cascade_1h` (§5.4) | `doc_count`, `asset_count`, `root_to_first_asset_delay`, `asset_load_stddev` | + +**Clés de jointure** : `(window_start, src_ip, ja4, host)` pour §5.1/§5.3/§5.4 ; +`(window_start, src_ip)` pour §5.5/§5.8. + --- -## User Accounts +## Index de performance (10_perf_indexes.sql) -| User | Permissions | Purpose | -|------|------------|---------| -| `data_writer` | INSERT + SELECT on `http_logs_raw` | Used by correlator service | -| `analyst` | SELECT on `http_logs`, `ml_detected_anomalies`, `ml_all_scores`, `view_ai_features_1h`, `view_ip_recurrence`, `audit_logs` | Used by dashboard/SOC analysts | +Migration idempotente ajoutant des index secondaires et projections aux tables +existantes (les installations fraîches les ont déjà dans 04/05/06) : -> **Security note**: Default passwords are `ChangeMe` — replace with strong passwords before production use. Store credentials in a secrets manager. +| Table | Index / Projection ajouté | +|-------|--------------------------| +| `ml_detected_anomalies` | `idx_detected_at` (minmax), `idx_threat_level` (set(8)), `idx_bot_name` (bloom_filter) | +| `ml_all_scores` | `idx_detected_at` (minmax), `idx_threat_level` (set(8)) | +| `http_logs` | `idx_src_ip` (bloom_filter(0.01)), `idx_ja4` (bloom_filter(0.01)) | +| `agg_host_ip_ja4_1h` | `proj_by_ip` (projection ORDER BY src_ip, window_start, ja4, host) | + +--- + +## Comptes utilisateurs + +| Utilisateur | Permissions | Usage | +|-------------|------------|-------| +| `data_writer` | INSERT + SELECT sur `ja4_logs.http_logs_raw` | Service correlator | +| `analyst` | SELECT sur `ja4_logs.http_logs`, `ja4_processing.ml_detected_anomalies`, `ja4_processing.ml_all_scores`, `ja4_processing.view_ai_features_1h`, `ja4_processing.view_ip_recurrence`, `ja4_processing.audit_logs` | Dashboard / analystes SOC | + +> **Sécurité** : les mots de passe par défaut sont `ChangeMe`. Remplacer par des +> mots de passe forts avant la mise en production. Stocker les identifiants dans +> un gestionnaire de secrets. diff --git a/shared/clickhouse/README.md b/shared/clickhouse/README.md index 20632fb..a700d11 100644 --- a/shared/clickhouse/README.md +++ b/shared/clickhouse/README.md @@ -1,48 +1,75 @@ -# ClickHouse Migrations — ja4-platform +# Schéma ClickHouse — ja4-platform -## Migration Order +Ce répertoire contient les 13 fichiers SQL définissant le schéma complet +de la plateforme. Le schéma utilise un **patron double-base** : -Apply these files in numeric order against the ClickHouse server: +- **`ja4_logs`** — ingestion brute et logs HTTP parsés +- **`ja4_processing`** — agrégations, ML, vues, dictionnaires, audit + +Les noms de base sont configurables via variables d'environnement +(`CLICKHOUSE_DB_LOGS`, `CLICKHOUSE_DB_PROCESSING`). + +## Fichiers SQL + +Appliquer dans l'ordre numérique : + +| Fichier | Contenu | +|---------|---------| +| `00_database.sql` | Création des bases `ja4_logs` et `ja4_processing` | +| `01_raw_tables.sql` | Table d'ingestion `http_logs_raw` | +| `02_dictionaries.sql` | Dictionnaire `dict_iplocate_asn`, tables `ref_bot_networks`, `bot_ip`, `bot_ja4` | +| `03_anubis_tables.sql` | Tables de règles Anubis (`anubis_ip_rules`, `anubis_asn_rules`) et dictionnaires (`dict_anubis_ip`, `dict_anubis_asn`) | +| `04_mv_http_logs.sql` | Table `http_logs` + vue matérialisée `mv_http_logs` (parse JSON + enrichissement Anubis IP→ASN) | +| `05_aggregation_tables.sql` | Dictionnaires de réputation (`dict_bot_ip`, `dict_bot_ja4`, `dict_browser_ja4`, `dict_asn_reputation`) + tables d'agrégation + MVs | +| `06_ml_tables.sql` | Tables ML (`ml_detected_anomalies`, `ml_all_scores`) + vue `view_ip_recurrence` | +| `07_ai_features_view.sql` | Vue `view_ai_features_1h` (~65+ features ML) | +| `08_users.sql` | Utilisateurs `data_writer` et `analyst` avec permissions | +| `09_audit_table.sql` | Table `audit_logs` pour le journal d'audit SOC | +| `10_perf_indexes.sql` | Index secondaires et projections de performance | +| `11_views.sql` | Vues dashboard (entités, user-agents, force brute, rotation JA4, cascade) | +| `12_thesis_features.sql` | Tables d'agrégation thèse (§5) + vue `view_thesis_features_1h` | + +## Déploiement + +### Méthode recommandée + +Utiliser `deploy_schema.sh` qui applique les 13 fichiers avec substitution +automatique des noms de base : ```bash -clickhouse-client --multiquery < 00_database.sql -clickhouse-client --multiquery < 01_raw_tables.sql -clickhouse-client --multiquery < 02_dictionaries.sql -clickhouse-client --multiquery < 03_anubis_tables.sql -clickhouse-client --multiquery < 04_mv_http_logs.sql -clickhouse-client --multiquery < 05_aggregation_tables.sql -clickhouse-client --multiquery < 06_ml_tables.sql -clickhouse-client --multiquery < 07_ai_features_view.sql -clickhouse-client --multiquery < 08_users.sql -clickhouse-client --multiquery < 09_audit_table.sql +./deploy_schema.sh + +# Ou avec configuration personnalisée : +CLICKHOUSE_DB_LOGS=my_logs \ +CLICKHOUSE_DB_PROCESSING=my_proc \ +CLICKHOUSE_HOST=clickhouse-server \ +CLICKHOUSE_PASSWORD='secret' \ + ./deploy_schema.sh ``` -## File Descriptions +### Application manuelle -| File | Contents | -|------|----------| -| `00_database.sql` | CREATE DATABASE | -| `01_raw_tables.sql` | `http_logs_raw` ingest table | -| `02_dictionaries.sql` | ASN geo dict, bot IP/JA4/network reference tables | -| `03_anubis_tables.sql` | Anubis crawler rule tables and dictionaries (UA, IP, ASN, country) | -| `04_mv_http_logs.sql` | Canonical `http_logs` target table + `mv_http_logs` materialized view with full Anubis enrichment | -| `05_aggregation_tables.sql` | `agg_host_ip_ja4_1h`, `agg_header_fingerprint_1h` + their MVs | -| `06_ml_tables.sql` | `ml_detected_anomalies`, `ml_all_scores` | -| `07_ai_features_view.sql` | `view_ai_features_1h` with Anubis enrichment | -| `08_users.sql` | ClickHouse users and grants | -| `09_audit_table.sql` | `audit_logs` table for SOC dashboard audit trail | +```bash +for f in 0*.sql 1*.sql; do + clickhouse-client --multiquery < "$f" +done +``` -## Prerequisites +## Prérequis -Place CSV data files in `/var/lib/clickhouse/user_files/`: -- `iplocate-ip-to-asn.csv` — IP-to-ASN mapping (from IPLocate) -- `bot_ip.csv` — Known bot IP prefixes -- `bot_ja4.csv` — Known bot JA4 fingerprints -- `asn_reputation.csv` — ASN reputation labels +- **ClickHouse 24.8+** (support `deduplicate_merge_projection_mode`) +- **Fichiers CSV** dans `/var/lib/clickhouse/user_files/` : + - `iplocate-ip-to-asn.csv` — correspondance IP→ASN (~714K entrées) + - `bot_ip.csv` — préfixes IP de bots connus (~3,5K CIDR) + - `bot_ja4.csv` — empreintes JA4 de bots (~31 entrées) + - `browser_ja4.csv` — empreintes JA4 de navigateurs (~1,2K entrées) + - `asn_reputation.csv` — labels de réputation ASN (~82K entrées) +- **Mots de passe Anubis** : remplacer `CHANGE_ME` dans `03_anubis_tables.sql` ## Notes -- `04_mv_http_logs.sql` is the **canonical** version of the MV, superseding the base version - in `services/correlator/sql/init.sql`. It includes full Anubis enrichment. -- All migrations are idempotent (use `IF NOT EXISTS` / `IF EXISTS`). -- Anubis dictionary passwords in `03_anubis_tables.sql` **must** be changed before production use. +- Toutes les migrations sont **idempotentes** (`IF NOT EXISTS` / `CREATE OR REPLACE`). +- `04_mv_http_logs.sql` est la version **canonique** de la vue matérialisée. +- Les références inter-bases existent : les MVs dans `ja4_processing` lisent + depuis `ja4_logs.http_logs`. +- Documentation complète : `docs/database/schema.md` et `docs/database/migrations.md`.