docs: réécriture complète de la documentation base de données en français

Réécriture des 3 fichiers de documentation de la base de données ClickHouse :

- docs/database/schema.md : couverture complète des 2 bases, 14+ tables,
  7 dictionnaires, 8 MVs, 8 vues, TTL, partitions, moteurs et colonnes
- docs/database/migrations.md : 13 fichiers SQL (ajout 10-12), prérequis
  mis à jour (ClickHouse 24.8+, 5 CSV), deploy_schema.sh, init-stack.sh,
  vérification et rollback complets
- shared/clickhouse/README.md : référence rapide des 13 fichiers,
  deploy_schema.sh, patron double-base, prérequis

Suppression des références obsolètes : dict_anubis_ua, dict_anubis_country,
anubis_ua_rules, anubis_country_rules.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
toto
2026-04-09 22:03:37 +02:00
parent d05969867f
commit 8f5e771096
3 changed files with 897 additions and 405 deletions

View File

@ -1,256 +1,402 @@
# Database Migrations
# Migrations de base de données
The ClickHouse schema for ja4-platform is managed through numbered SQL migration files in `shared/clickhouse/`. Migrations are idempotent (using `IF NOT EXISTS` / `IF EXISTS`) and must be applied in numeric order.
Le schéma ClickHouse de ja4-platform est géré via 13 fichiers SQL numérotés dans
`shared/clickhouse/`. Toutes les migrations sont **idempotentes** (utilisation de
`IF NOT EXISTS` / `IF EXISTS` / `CREATE OR REPLACE`) et doivent être appliquées
dans l'ordre numérique.
## Migration Order
Le schéma utilise un **patron double-base** :
| File | Purpose |
|------|---------|
| `00_database.sql` | Creates the `ja4_processing` database |
| `01_raw_tables.sql` | Creates `http_logs_raw` ingest table (MergeTree, 1-day TTL) |
| `02_dictionaries.sql` | Creates ASN geo dictionary (`dict_iplocate_asn`), bot IP/JA4 reference tables, `ref_bot_networks` |
| `03_anubis_tables.sql` | Creates Anubis crawler rule tables (`anubis_ua_rules`, `anubis_ip_rules`, `anubis_asn_rules`, `anubis_country_rules`) and their dictionaries (`dict_anubis_ua`, `dict_anubis_ip`, `dict_anubis_asn`, `dict_anubis_country`) |
| `04_mv_http_logs.sql` | Creates the canonical `http_logs` table and `mv_http_logs` materialized view with full Anubis enrichment |
| `05_aggregation_tables.sql` | Creates reputation dictionaries (`dict_bot_ip`, `dict_bot_ja4`, `dict_asn_reputation`), behavioral aggregation tables (`agg_host_ip_ja4_1h`, `agg_header_fingerprint_1h`), and their materialized views |
| `06_ml_tables.sql` | Creates ML output tables (`ml_detected_anomalies`, `ml_all_scores`) and `view_ip_recurrence` |
| `07_ai_features_view.sql` | Creates `view_ai_features_1h` — the 50+ feature view used by bot-detector |
| `08_users.sql` | Creates ClickHouse users (`data_writer`, `analyst`) and grants permissions |
| `09_audit_table.sql` | Creates `audit_logs` table for SOC dashboard audit trail |
| Base | Variable d'environnement | Défaut | Contenu |
|------|--------------------------|--------|---------|
| Logs | `CLICKHOUSE_DB_LOGS` | `ja4_logs` | `http_logs_raw`, `http_logs`, `mv_http_logs` |
| Processing | `CLICKHOUSE_DB_PROCESSING` | `ja4_processing` | Agrégations, ML, vues, dictionnaires, audit |
## Prerequisites
---
### 1. ClickHouse Server
## Ordre des migrations
A running ClickHouse server (version 23.8+ recommended for `REGEXP_TREE` dictionary support).
| Fichier | Lignes | Contenu |
|---------|--------|---------|
| `00_database.sql` | 5 | Création des bases `ja4_logs` et `ja4_processing` |
| `01_raw_tables.sql` | 16 | Table d'ingestion `http_logs_raw` (MergeTree, TTL 2h) |
| `02_dictionaries.sql` | 57 | Dictionnaire `dict_iplocate_asn` (IP_TRIE, CSV), tables `ref_bot_networks`, `bot_ip`, `bot_ja4` |
| `03_anubis_tables.sql` | 73 | Tables de règles Anubis (`anubis_ip_rules`, `anubis_asn_rules`) et dictionnaires (`dict_anubis_ip`, `dict_anubis_asn`) |
| `04_mv_http_logs.sql` | 197 | Table `http_logs` (MergeTree, TTL 30j) + vue matérialisée `mv_http_logs` (parse JSON + enrichissement Anubis COALESCE IP→ASN) |
| `05_aggregation_tables.sql` | 234 | Dictionnaires de réputation (`dict_bot_ip`, `dict_bot_ja4`, `dict_browser_ja4`, `dict_asn_reputation`), 2 tables d'agrégation (`agg_host_ip_ja4_1h`, `agg_header_fingerprint_1h`) + 2 vues matérialisées |
| `06_ml_tables.sql` | 144 | Tables ML (`ml_detected_anomalies`, `ml_all_scores`) + vue `view_ip_recurrence` |
| `07_ai_features_view.sql` | 156 | Vue `view_ai_features_1h` (~65+ features ML depuis les agrégations + dictionnaires) |
| `08_users.sql` | 22 | Utilisateurs `data_writer` et `analyst` avec permissions |
| `09_audit_table.sql` | 21 | Table `audit_logs` pour le journal d'audit SOC |
| `10_perf_indexes.sql` | 113 | Index secondaires et projections de performance (migration idempotente pour instances existantes) |
| `11_views.sql` | 216 | Vues dashboard (`view_dashboard_entities`, `view_dashboard_user_agents`, `view_form_bruteforce_detected`, `view_host_ip_ja4_rotation`, `view_resource_cascade_1h`) |
| `12_thesis_features.sql` | 580 | 4 tables d'agrégation thèse (`agg_path_sequences_1h`, `agg_request_timing_1h`, `agg_ip_behavior_1h`, `agg_resource_cascade_1h`) + 4 MVs + vue `view_thesis_features_1h` |
### 2. CSV Data Files
---
Place the following files in `/var/lib/clickhouse/user_files/`:
## Prérequis
| File | Source | Description |
|------|--------|-------------|
| `iplocate-ip-to-asn.csv` | [IPLocate](https://iplocate.io) | IP-to-ASN mapping with country, org, domain |
| `bot_ip.csv` | Custom | Known bot IP prefixes (CIDR format) |
| `bot_ja4.csv` | Custom | Known bot JA4 fingerprints |
| `asn_reputation.csv` | Custom | ASN reputation labels (`human`, `bot`, `unknown`) |
### 1. Serveur ClickHouse
### 3. Anubis Passwords
Un serveur ClickHouse en fonctionnement, **version 24.8+** requise (support des
projections AggregatingMergeTree avec `deduplicate_merge_projection_mode`).
Migration `03_anubis_tables.sql` contains placeholder passwords (`CHANGE_ME`) for the Anubis dictionaries. Replace these with the actual ClickHouse admin password before applying:
### 2. Fichiers CSV de données
Placer les fichiers suivants dans `/var/lib/clickhouse/user_files/` :
| Fichier | Source | Description | Entrées approx. |
|---------|--------|-------------|------------------|
| `iplocate-ip-to-asn.csv` | [IPLocate](https://iplocate.io) | Correspondance IP→ASN avec pays, org, domaine | ~714K |
| `bot_ip.csv` | Personnalisé | Préfixes IP de bots connus (format CIDR) | ~3,5K |
| `bot_ja4.csv` | Personnalisé | Empreintes JA4 de bots connus | ~31 |
| `browser_ja4.csv` | Personnalisé | Empreintes JA4 de navigateurs légitimes | ~1,2K |
| `asn_reputation.csv` | Personnalisé | Labels de réputation ASN (`human`, `bot`, `unknown`) | ~82K |
### 3. Mots de passe Anubis
Le fichier `03_anubis_tables.sql` contient des mots de passe par défaut (`CHANGE_ME`)
pour les dictionnaires Anubis basés sur ClickHouse. Les remplacer avant d'appliquer :
```bash
sed -i "s/CHANGE_ME/your_actual_password/g" 03_anubis_tables.sql
sed -i "s/CHANGE_ME/mot_de_passe_réel/g" 03_anubis_tables.sql
```
## How to Apply
---
### Full Initial Setup
## Comment appliquer
Apply all migrations in order:
### Méthode recommandée : deploy_schema.sh
Le script `deploy_schema.sh` applique les 13 fichiers dans l'ordre en substituant
automatiquement les noms de base de données :
```bash
cd shared/clickhouse/
clickhouse-client --multiquery < 00_database.sql
clickhouse-client --multiquery < 01_raw_tables.sql
clickhouse-client --multiquery < 02_dictionaries.sql
clickhouse-client --multiquery < 03_anubis_tables.sql
clickhouse-client --multiquery < 04_mv_http_logs.sql
clickhouse-client --multiquery < 05_aggregation_tables.sql
clickhouse-client --multiquery < 06_ml_tables.sql
clickhouse-client --multiquery < 07_ai_features_view.sql
clickhouse-client --multiquery < 08_users.sql
clickhouse-client --multiquery < 09_audit_table.sql
# Avec les noms de base par défaut (ja4_logs / ja4_processing)
./deploy_schema.sh
# Avec des noms personnalisés
CLICKHOUSE_DB_LOGS=my_logs \
CLICKHOUSE_DB_PROCESSING=my_proc \
CLICKHOUSE_HOST=clickhouse-server \
CLICKHOUSE_USER=admin \
CLICKHOUSE_PASSWORD='secret' \
./deploy_schema.sh
```
### With Authentication
Variables d'environnement supportées :
| Variable | Défaut | Description |
|----------|--------|-------------|
| `CLICKHOUSE_DB_LOGS` | `ja4_logs` | Nom de la base de logs |
| `CLICKHOUSE_DB_PROCESSING` | `ja4_processing` | Nom de la base de traitement |
| `CLICKHOUSE_HOST` | `localhost` | Hôte ClickHouse |
| `CLICKHOUSE_PORT` | `9000` | Port natif ClickHouse |
| `CLICKHOUSE_USER` | `default` | Utilisateur ClickHouse |
| `CLICKHOUSE_PASSWORD` | (vide) | Mot de passe ClickHouse |
### Méthode alternative : init-stack.sh
Le script `scripts/init-stack.sh` fournit une initialisation complète incluant
le schéma, les migrations, la validation et le nettoyage :
```bash
clickhouse-client --user admin --password 'your_password' --multiquery < 00_database.sql
# ... repeat for each file
./scripts/init-stack.sh
```
### One-Liner (All at Once)
### Application manuelle
```bash
cd shared/clickhouse/
for f in 0*.sql; do
echo "Applying $f..."
for f in 0*.sql 1*.sql; do
echo "Application de $f..."
clickhouse-client --multiquery < "$f"
done
```
## How to Verify
Avec authentification :
After applying all migrations, run these queries to verify each migration was successful:
### 00 — Database
```sql
SHOW DATABASES LIKE 'ja4_processing';
-- Expected: ja4_processing
```bash
clickhouse-client --user admin --password 'secret' --multiquery < 00_database.sql
# ... répéter pour chaque fichier
```
### 01 — Raw Tables
---
## Comment vérifier
Après l'application de toutes les migrations, exécuter ces requêtes pour valider
chaque étape.
### 00 — Bases de données
```sql
SHOW DATABASES LIKE 'ja4%';
-- Attendu : ja4_logs, ja4_processing
```
### 01 — Table brute
```sql
EXISTS ja4_logs.http_logs_raw;
-- Expected: 1
-- Attendu : 1
```
### 02 — Dictionaries
### 02 — Dictionnaire ASN + tables de référence
```sql
SELECT dictGetOrDefault('ja4_processing.dict_iplocate_asn', 'country_code',
toIPv6(toIPv4('8.8.8.8')), 'MISSING');
-- Expected: US (if CSV loaded) or MISSING
-- Attendu : US (si CSV chargé) ou MISSING
EXISTS ja4_processing.ref_bot_networks;
-- Attendu : 1
```
### 03 — Anubis Tables
### 03 — Tables Anubis
```sql
EXISTS ja4_processing.anubis_ua_rules;
EXISTS ja4_processing.anubis_ip_rules;
EXISTS ja4_processing.anubis_asn_rules;
EXISTS ja4_processing.anubis_country_rules;
-- Expected: 1 for each
-- Attendu : 1 pour chacune
```
### 04 — MV + http_logs
### 04 — http_logs + vue matérialisée
```sql
EXISTS ja4_logs.http_logs;
SELECT name FROM system.tables WHERE database = 'ja4_logs' AND name = 'mv_http_logs';
-- Expected: mv_http_logs
-- Attendu : mv_http_logs
```
### 05 — Aggregation Tables
### 05 — Tables d'agrégation + dictionnaires de réputation
```sql
EXISTS ja4_processing.agg_host_ip_ja4_1h;
EXISTS ja4_processing.agg_header_fingerprint_1h;
SELECT name FROM system.dictionaries WHERE database = 'ja4_processing' AND name = 'dict_bot_ip';
-- Expected: dict_bot_ip
SELECT name FROM system.dictionaries
WHERE database = 'ja4_processing' AND name IN ('dict_bot_ip', 'dict_bot_ja4', 'dict_browser_ja4', 'dict_asn_reputation');
-- Attendu : 4 lignes
```
### 06 — ML Tables
### 06 — Tables ML
```sql
EXISTS ja4_processing.ml_detected_anomalies;
EXISTS ja4_processing.ml_all_scores;
SELECT name FROM system.tables WHERE database = 'ja4_processing' AND name LIKE 'view_ip%';
-- Expected: view_ip_recurrence
SELECT name FROM system.tables WHERE database = 'ja4_processing' AND name = 'view_ip_recurrence';
-- Attendu : view_ip_recurrence
```
### 07 — AI Features View
### 07 — Vue de features AI
```sql
SELECT name FROM system.tables WHERE database = 'ja4_processing' AND name = 'view_ai_features_1h';
-- Expected: view_ai_features_1h
-- Attendu : view_ai_features_1h
```
### 08 — Users
### 08 — Utilisateurs
```sql
SHOW GRANTS FOR data_writer;
-- Expected: GRANT INSERT, SELECT ON ja4_logs.http_logs_raw TO data_writer
-- Attendu : GRANT INSERT, SELECT ON ja4_logs.http_logs_raw TO data_writer
SHOW GRANTS FOR analyst;
-- Expected: GRANT SELECT ON multiple tables
-- Attendu : GRANT SELECT sur 6 tables/vues
```
### 09 — Audit Table
### 09 — Table d'audit
```sql
EXISTS ja4_processing.audit_logs;
-- Expected: 1
-- Attendu : 1
```
### Full Verification Query
### 10 — Index de performance
```sql
SELECT
count() AS total_tables
SELECT name FROM system.data_skipping_indices
WHERE table = 'ml_detected_anomalies' AND database = 'ja4_processing';
-- Attendu : idx_detected_at, idx_threat_level, idx_bot_name
```
### 11 — Vues dashboard
```sql
SELECT name FROM system.tables
WHERE database = 'ja4_processing'
AND name LIKE 'view_%'
AND engine = 'View';
-- Attendu : ≥ 7 vues (view_ip_recurrence, view_ai_features_1h,
-- view_form_bruteforce_detected, view_host_ip_ja4_rotation,
-- view_dashboard_user_agents, view_dashboard_entities, view_resource_cascade_1h)
```
### 12 — Tables et vue de thèse
```sql
EXISTS ja4_processing.agg_path_sequences_1h;
EXISTS ja4_processing.agg_request_timing_1h;
EXISTS ja4_processing.agg_ip_behavior_1h;
EXISTS ja4_processing.agg_resource_cascade_1h;
SELECT name FROM system.tables WHERE database = 'ja4_processing' AND name = 'view_thesis_features_1h';
-- Attendu : 1 pour chaque EXISTS, view_thesis_features_1h
```
### Vérification complète
```sql
-- Tables dans ja4_logs
SELECT count() AS tables_logs
FROM system.tables
WHERE database = 'ja4_logs'
AND name IN ('http_logs_raw', 'http_logs', 'mv_http_logs');
-- Attendu : 3
-- Tables dans ja4_processing
SELECT count() AS tables_processing
FROM system.tables
WHERE database = 'ja4_processing'
AND name IN (
'http_logs_raw', 'http_logs', 'agg_host_ip_ja4_1h', 'agg_header_fingerprint_1h',
'ml_detected_anomalies', 'ml_all_scores', 'ref_bot_networks',
'anubis_ua_rules', 'anubis_ip_rules', 'anubis_asn_rules', 'anubis_country_rules',
'audit_logs', 'bot_ip', 'bot_ja4'
'ref_bot_networks', 'bot_ip', 'bot_ja4',
'anubis_ip_rules', 'anubis_asn_rules',
'agg_host_ip_ja4_1h', 'agg_header_fingerprint_1h',
'agg_path_sequences_1h', 'agg_request_timing_1h',
'agg_ip_behavior_1h', 'agg_resource_cascade_1h',
'ml_detected_anomalies', 'ml_all_scores', 'audit_logs'
);
-- Expected: 14
-- Attendu : 14
-- Dictionnaires
SELECT count() AS dicts
FROM system.dictionaries
WHERE database = 'ja4_processing';
-- Attendu : 7
-- Vues matérialisées dans ja4_logs
SELECT count() AS mvs_logs
FROM system.tables
WHERE database = 'ja4_logs' AND engine = 'MaterializedView';
-- Attendu : 1
-- Vues matérialisées dans ja4_processing
SELECT count() AS mvs_proc
FROM system.tables
WHERE database = 'ja4_processing' AND engine = 'MaterializedView';
-- Attendu : 6
```
## Rollback Notes
---
### General Approach
## Rollback
ClickHouse does not support transactional DDL. To roll back a migration:
### Approche générale
1. **Tables**: `DROP TABLE IF EXISTS ja4_processing.<table_name>`
2. **Materialized Views**: `DROP VIEW IF EXISTS ja4_processing.<mv_name>` (drop MV before its target table)
3. **Dictionaries**: `DROP DICTIONARY IF EXISTS ja4_processing.<dict_name>`
4. **Views**: `DROP VIEW IF EXISTS ja4_processing.<view_name>`
5. **Users**: `DROP USER IF EXISTS <username>`
ClickHouse ne supporte pas les DDL transactionnels. Pour annuler une migration :
### Rollback Order (Reverse of Apply)
1. **Vues matérialisées** : supprimer la MV **avant** sa table cible
2. **Dictionnaires** : supprimer le dictionnaire avant les vues/MVs qui l'utilisent
3. **Tables** : `DROP TABLE IF EXISTS`
4. **Vues** : `DROP VIEW IF EXISTS`
5. **Utilisateurs** : `DROP USER IF EXISTS`
### Ordre de rollback (inverse de l'application)
```sql
-- 09: Audit
-- 12 : Tables et vue de thèse
DROP VIEW IF EXISTS ja4_processing.view_thesis_features_1h;
DROP VIEW IF EXISTS ja4_processing.view_resource_cascade_1h;
DROP VIEW IF EXISTS ja4_processing.mv_agg_resource_cascade_1h;
DROP VIEW IF EXISTS ja4_processing.mv_agg_ip_behavior_1h;
DROP VIEW IF EXISTS ja4_processing.mv_agg_request_timing_1h;
DROP VIEW IF EXISTS ja4_processing.mv_agg_path_sequences_1h;
DROP TABLE IF EXISTS ja4_processing.agg_resource_cascade_1h;
DROP TABLE IF EXISTS ja4_processing.agg_ip_behavior_1h;
DROP TABLE IF EXISTS ja4_processing.agg_request_timing_1h;
DROP TABLE IF EXISTS ja4_processing.agg_path_sequences_1h;
-- 11 : Vues dashboard
DROP VIEW IF EXISTS ja4_processing.view_dashboard_entities;
DROP VIEW IF EXISTS ja4_processing.view_dashboard_user_agents;
DROP VIEW IF EXISTS ja4_processing.view_host_ip_ja4_rotation;
DROP VIEW IF EXISTS ja4_processing.view_form_bruteforce_detected;
-- 10 : Index de performance (pas de rollback nécessaire — idempotent)
-- 09 : Table d'audit
DROP TABLE IF EXISTS ja4_processing.audit_logs;
-- 08: Users
-- 08 : Utilisateurs
DROP USER IF EXISTS data_writer;
DROP USER IF EXISTS analyst;
-- 07: AI Features View
-- 07 : Vue de features AI
DROP VIEW IF EXISTS ja4_processing.view_ai_features_1h;
-- 06: ML Tables
-- 06 : Tables ML
DROP VIEW IF EXISTS ja4_processing.view_ip_recurrence;
DROP TABLE IF EXISTS ja4_processing.ml_all_scores;
DROP TABLE IF EXISTS ja4_processing.ml_detected_anomalies;
-- 05: Aggregation
-- 05 : Agrégations + dictionnaires de réputation
DROP VIEW IF EXISTS ja4_processing.mv_agg_header_fingerprint_1h;
DROP VIEW IF EXISTS ja4_processing.mv_agg_host_ip_ja4_1h;
DROP TABLE IF EXISTS ja4_processing.agg_header_fingerprint_1h;
DROP TABLE IF EXISTS ja4_processing.agg_host_ip_ja4_1h;
DROP DICTIONARY IF EXISTS ja4_processing.dict_asn_reputation;
DROP DICTIONARY IF EXISTS ja4_processing.dict_browser_ja4;
DROP DICTIONARY IF EXISTS ja4_processing.dict_bot_ja4;
DROP DICTIONARY IF EXISTS ja4_processing.dict_bot_ip;
-- 04: MV + http_logs
-- 04 : MV + http_logs
DROP VIEW IF EXISTS ja4_logs.mv_http_logs;
DROP TABLE IF EXISTS ja4_logs.http_logs;
-- 03: Anubis
DROP DICTIONARY IF EXISTS ja4_processing.dict_anubis_country;
-- 03 : Anubis
DROP DICTIONARY IF EXISTS ja4_processing.dict_anubis_asn;
DROP DICTIONARY IF EXISTS ja4_processing.dict_anubis_ip;
DROP DICTIONARY IF EXISTS ja4_processing.dict_anubis_ua;
DROP TABLE IF EXISTS ja4_processing.anubis_country_rules;
DROP TABLE IF EXISTS ja4_processing.anubis_asn_rules;
DROP TABLE IF EXISTS ja4_processing.anubis_ip_rules;
DROP TABLE IF EXISTS ja4_processing.anubis_ua_rules;
-- 02: Dictionaries
-- 02 : Dictionnaire ASN + tables de référence
DROP DICTIONARY IF EXISTS ja4_processing.dict_iplocate_asn;
DROP TABLE IF EXISTS ja4_processing.bot_ja4;
DROP TABLE IF EXISTS ja4_processing.bot_ip;
DROP TABLE IF EXISTS ja4_processing.ref_bot_networks;
-- 01: Raw Tables
-- 01 : Table brute
DROP TABLE IF EXISTS ja4_logs.http_logs_raw;
-- 00: Database
-- 00 : Bases de données
DROP DATABASE IF EXISTS ja4_processing;
DROP DATABASE IF EXISTS ja4_logs;
```
### Important Notes
### Notes importantes
- **Data loss**: Dropping tables destroys all data. Always back up before rollback.
- **MV dependency**: Materialized views must be dropped before their target tables.
- **Dictionary dependency**: Views/MVs using dictionaries will fail if dictionaries are dropped while they still reference them.
- **Idempotent re-apply**: After rollback, migrations can be safely re-applied since they use `IF NOT EXISTS`.
- **`04_mv_http_logs.sql`** is the canonical version of the MV, superseding any base version in `services/correlator/sql/init.sql`.
- **Perte de données** : la suppression d'une table détruit toutes ses données.
Toujours sauvegarder avant un rollback.
- **Dépendance MV** : les vues matérialisées doivent être supprimées **avant**
leur table cible.
- **Dépendance dictionnaire** : les vues/MVs utilisant `dictGet()` échoueront
si le dictionnaire référencé est supprimé.
- **Ré-application idempotente** : après un rollback, les migrations peuvent être
ré-appliquées sans risque grâce aux clauses `IF NOT EXISTS`.
- **`04_mv_http_logs.sql`** est la version canonique de la vue matérialisée,
remplaçant toute version antérieure dans `services/correlator/sql/init.sql`.
---
## Migrations post-déploiement
Le répertoire `services/correlator/sql/migrations/` contient des instructions
`ALTER TABLE` pour les déploiements existants. Les appliquer manuellement :
```bash
clickhouse-client --multiquery < services/correlator/sql/migrations/<fichier>.sql
```
Ces migrations sont distinctes du schéma de base et ne sont nécessaires que
pour mettre à jour des instances déjà en production.

View File

@ -1,334 +1,653 @@
# Database Schema
# Schéma de base de données
The ja4-platform uses ClickHouse as its central data store with database `ja4_processing`. This document describes every table, materialized view, dictionary, and view in the schema.
La plateforme ja4-platform utilise ClickHouse comme entrepôt de données central.
Le schéma est réparti sur **deux bases de données** configurables via variables
d'environnement (`CLICKHOUSE_DB_LOGS`, `CLICKHOUSE_DB_PROCESSING`) :
## Tables
| Variable | Défaut | Rôle |
|----------|--------|------|
| `CLICKHOUSE_DB_LOGS` | `ja4_logs` | Ingestion brute + logs HTTP parsés |
| `CLICKHOUSE_DB_PROCESSING` | `ja4_processing` | Agrégations, ML, vues, dictionnaires, audit |
Les vues matérialisées dans `ja4_processing` lisent depuis `ja4_logs` (références
inter-bases). Le schéma complet est défini dans 13 fichiers SQL ordonnés dans
`shared/clickhouse/` et déployé via `deploy_schema.sh`.
---
## Récapitulatif global
| Catégorie | Nombre | Objets |
|-----------|--------|--------|
| **Bases de données** | 2 | `ja4_logs`, `ja4_processing` |
| **Tables** | 14 | `http_logs_raw`, `http_logs`, `ref_bot_networks`, `bot_ip`, `bot_ja4`, `anubis_ip_rules`, `anubis_asn_rules`, `agg_host_ip_ja4_1h`, `agg_header_fingerprint_1h`, `agg_path_sequences_1h`, `agg_request_timing_1h`, `agg_ip_behavior_1h`, `agg_resource_cascade_1h`, `ml_detected_anomalies`, `ml_all_scores`, `audit_logs` |
| **Dictionnaires** | 7 | `dict_iplocate_asn`, `dict_bot_ip`, `dict_bot_ja4`, `dict_browser_ja4`, `dict_asn_reputation`, `dict_anubis_ip`, `dict_anubis_asn` |
| **Vues matérialisées** | 8 | `mv_http_logs`, `mv_agg_host_ip_ja4_1h`, `mv_agg_header_fingerprint_1h`, `mv_agg_path_sequences_1h`, `mv_agg_request_timing_1h`, `mv_agg_ip_behavior_1h`, `mv_agg_resource_cascade_1h` |
| **Vues** | 8 | `view_ip_recurrence`, `view_ai_features_1h`, `view_form_bruteforce_detected`, `view_host_ip_ja4_rotation`, `view_dashboard_user_agents`, `view_dashboard_entities`, `view_resource_cascade_1h`, `view_thesis_features_1h` |
---
## Rétention des données (TTL)
| Table | TTL | Clé de partition |
|-------|-----|------------------|
| `http_logs_raw` | 2 heures | `toStartOfHour(ingest_time)` |
| `http_logs` | 30 jours | `toDate(log_date)` |
| `agg_host_ip_ja4_1h` | 7 jours | `toDate(window_start)` |
| `agg_header_fingerprint_1h` | 7 jours | `toDate(window_start)` |
| `agg_path_sequences_1h` | 7 jours | `toDate(window_start)` |
| `agg_request_timing_1h` | 7 jours | `toDate(window_start)` |
| `agg_ip_behavior_1h` | 7 jours | `toDate(window_start)` |
| `agg_resource_cascade_1h` | 7 jours | `toDate(window_start)` |
| `ml_detected_anomalies` | 7 jours | `toDate(detected_at)` |
| `ml_all_scores` | 7 jours | `toDate(window_start)` |
| `audit_logs` | 90 jours | `toDate(timestamp)` |
Toutes les tables d'agrégation et ML utilisent `ttl_only_drop_parts = 1` pour
une expiration efficace au niveau des partitions.
---
## Tables — Base `ja4_logs`
### http_logs_raw
Raw JSON ingest table — direct target for correlator INSERTs.
Table d'ingestion brute — cible directe des INSERTs du correlator.
| Column | Type | Description |
|--------|------|-------------|
| `raw_json` | String (ZSTD(3)) | Complete correlated log as JSON string |
| `ingest_time` | DateTime | Insertion timestamp (default: `now()`) |
| Colonne | Type | Description |
|---------|------|-------------|
| `raw_json` | String CODEC(ZSTD(3)) | Log corrélé complet au format JSON |
| `ingest_time` | DateTime DEFAULT `now()` | Horodatage d'insertion |
- **Engine**: MergeTree
- **Partition by**: `toDate(ingest_time)`
- **Order by**: `ingest_time`
- **TTL**: `ingest_time + INTERVAL 1 DAY`
- **Moteur** : MergeTree
- **Partition** : `toStartOfHour(ingest_time)`
- **Tri** : `ingest_time`
- **TTL** : `ingest_time + INTERVAL 2 HOUR`
- **Settings** : `index_granularity = 8192, ttl_only_drop_parts = 1`
---
### http_logs
Parsed and enriched HTTP log table — populated by `mv_http_logs` materialized view.
Table de logs HTTP parsés et enrichis — alimentée par la vue matérialisée
`mv_http_logs`.
| Column | Type | Nullable | Description |
|--------|------|----------|-------------|
| `time` | DateTime | No | Request timestamp |
| `log_date` | Date | No | Date partition key (default: `toDate(time)`) |
| `src_ip` | IPv4 | No | Client source IP |
| `src_port` | UInt16 | No | Client source port |
| `dst_ip` | IPv4 | No | Server destination IP |
| `dst_port` | UInt16 | No | Server destination port |
| `src_asn` | UInt32 | No | Source ASN (enriched via dict_iplocate_asn) |
| `src_country_code` | LowCardinality(String) | No | Source country code |
| `src_as_name` | LowCardinality(String) | No | AS name |
| `src_org` | LowCardinality(String) | No | AS organization |
| `src_domain` | LowCardinality(String) | No | AS domain |
| `method` | LowCardinality(String) | No | HTTP method |
| `scheme` | LowCardinality(String) | No | URL scheme (http/https) |
| `host` | LowCardinality(String) | No | HTTP Host header |
| `path` | String (ZSTD(3)) | No | Request path |
| `query` | String (ZSTD(3)) | No | Query string |
| `http_version` | LowCardinality(String) | No | HTTP version |
| `orphan_side` | LowCardinality(String) | No | Orphan side (A, B, or empty) |
| `correlated` | UInt8 | No | 1 if HTTP+TLS correlated |
| `keepalives` | UInt16 | No | Keep-alive request sequence |
| `a_timestamp` | UInt64 | No | Source A event timestamp (ns) |
| `b_timestamp` | UInt64 | No | Source B event timestamp (ns) |
| `conn_id` | String (ZSTD(3)) | No | TCP connection identifier |
| `ip_meta_df` | UInt8 | No | IP Don't Fragment flag |
| `ip_meta_id` | UInt16 | No | IP identification |
| `ip_meta_total_length` | UInt16 | No | IP total length |
| `ip_meta_ttl` | UInt8 | No | IP TTL |
| `tcp_meta_options` | LowCardinality(String) | No | TCP options list |
| `tcp_meta_window_size` | UInt32 | No | TCP window size |
| `tcp_meta_mss` | UInt16 | No | TCP MSS |
| `tcp_meta_window_scale` | UInt8 | No | TCP window scale |
| `syn_to_clienthello_ms` | Int32 | No | SYN-to-ClientHello timing (ms) |
| `tls_version` | LowCardinality(String) | No | TLS version |
| `tls_sni` | LowCardinality(String) | No | TLS SNI |
| `tls_alpn` | LowCardinality(String) | No | TLS ALPN |
| `ja3` | String (ZSTD(3)) | No | JA3 fingerprint |
| `ja3_hash` | String (ZSTD(3)) | No | JA3 MD5 hash |
| `ja4` | String (ZSTD(3)) | No | JA4 fingerprint |
| `client_headers` | String (ZSTD(3)) | No | Comma-separated header names |
| `header_user_agent` | String (ZSTD(3)) | No | User-Agent header |
| `header_accept` | String (ZSTD(3)) | No | Accept header |
| `header_accept_encoding` | String (ZSTD(3)) | No | Accept-Encoding header |
| `header_accept_language` | String (ZSTD(3)) | No | Accept-Language header |
| `header_content_type` | String (ZSTD(3)) | No | Content-Type header |
| `header_x_request_id` | String (ZSTD(3)) | No | X-Request-Id header |
| `header_x_trace_id` | String (ZSTD(3)) | No | X-Trace-Id header |
| `header_x_forwarded_for` | String (ZSTD(3)) | No | X-Forwarded-For header |
| `header_sec_ch_ua` | String (ZSTD(3)) | No | Sec-CH-UA header |
| `header_sec_ch_ua_mobile` | String (ZSTD(3)) | No | Sec-CH-UA-Mobile header |
| `header_sec_ch_ua_platform` | String (ZSTD(3)) | No | Sec-CH-UA-Platform header |
| `header_sec_fetch_dest` | String (ZSTD(3)) | No | Sec-Fetch-Dest header |
| `header_sec_fetch_mode` | String (ZSTD(3)) | No | Sec-Fetch-Mode header |
| `header_sec_fetch_site` | String (ZSTD(3)) | No | Sec-Fetch-Site header |
| `anubis_bot_name` | LowCardinality(String) | No | Anubis-detected bot name (default: '') |
| `anubis_bot_action` | LowCardinality(String) | No | Anubis-detected bot action (default: '') |
| `anubis_bot_category` | LowCardinality(String) | No | Anubis-detected bot category (default: '') |
| Colonne | Type | Description |
|---------|------|-------------|
| `time` | DateTime | Horodatage de la requête |
| `log_date` | Date DEFAULT `toDate(time)` | Clé de partition |
| `src_ip` | IPv4 | IP source du client |
| `src_port` | UInt16 | Port source |
| `dst_ip` | IPv4 | IP destination du serveur |
| `dst_port` | UInt16 | Port destination |
| `src_asn` | UInt32 | ASN source (enrichi via dict_iplocate_asn) |
| `src_country_code` | LowCardinality(String) | Code pays |
| `src_as_name` | LowCardinality(String) | Nom de l'AS |
| `src_org` | LowCardinality(String) | Organisation de l'AS |
| `src_domain` | LowCardinality(String) | Domaine de l'AS |
| `method` | LowCardinality(String) | Méthode HTTP |
| `scheme` | LowCardinality(String) | Schéma URL (http/https) |
| `host` | LowCardinality(String) | En-tête Host HTTP |
| `path` | String CODEC(ZSTD(3)) | Chemin de la requête |
| `query` | String CODEC(ZSTD(3)) | Paramètres de requête |
| `http_version` | LowCardinality(String) | Version HTTP |
| `orphan_side` | LowCardinality(String) | Côté orphelin (A, B, ou vide) |
| `correlated` | UInt8 | 1 si corrélation HTTP+TLS réussie |
| `keepalives` | UInt16 | Numéro de séquence keep-alive |
| `a_timestamp` | UInt64 | Horodatage source A (ns) |
| `b_timestamp` | UInt64 | Horodatage source B (ns) |
| `conn_id` | String CODEC(ZSTD(3)) | Identifiant de connexion TCP |
| `ip_meta_df` | UInt8 | Drapeau Don't Fragment |
| `ip_meta_id` | UInt16 | Identification IP |
| `ip_meta_total_length` | UInt16 | Longueur totale IP |
| `ip_meta_ttl` | UInt8 | TTL IP |
| `tcp_meta_options` | LowCardinality(String) | Options TCP |
| `tcp_meta_window_size` | UInt32 | Taille de fenêtre TCP |
| `tcp_meta_mss` | UInt16 | MSS TCP |
| `tcp_meta_window_scale` | UInt8 | Facteur d'échelle de fenêtre TCP |
| `syn_to_clienthello_ms` | Int32 | Délai SYN→ClientHello (ms) |
| `tls_version` | LowCardinality(String) | Version TLS |
| `tls_sni` | LowCardinality(String) | SNI TLS |
| `tls_alpn` | LowCardinality(String) | ALPN TLS |
| `ja3` | String CODEC(ZSTD(3)) | Empreinte JA3 |
| `ja3_hash` | String CODEC(ZSTD(3)) | Hash MD5 JA3 |
| `ja4` | String CODEC(ZSTD(3)) | Empreinte JA4 |
| `client_headers` | String CODEC(ZSTD(3)) | Noms d'en-têtes séparés par virgule |
| `header_user_agent` | String CODEC(ZSTD(3)) | En-tête User-Agent |
| `header_accept` | String CODEC(ZSTD(3)) | En-tête Accept |
| `header_accept_encoding` | String CODEC(ZSTD(3)) | En-tête Accept-Encoding |
| `header_accept_language` | String CODEC(ZSTD(3)) | En-tête Accept-Language |
| `header_content_type` | String CODEC(ZSTD(3)) | En-tête Content-Type |
| `header_x_request_id` | String CODEC(ZSTD(3)) | En-tête X-Request-Id |
| `header_x_trace_id` | String CODEC(ZSTD(3)) | En-tête X-Trace-Id |
| `header_x_forwarded_for` | String CODEC(ZSTD(3)) | En-tête X-Forwarded-For |
| `header_sec_ch_ua` | String CODEC(ZSTD(3)) | En-tête Sec-CH-UA |
| `header_sec_ch_ua_mobile` | String CODEC(ZSTD(3)) | En-tête Sec-CH-UA-Mobile |
| `header_sec_ch_ua_platform` | String CODEC(ZSTD(3)) | En-tête Sec-CH-UA-Platform |
| `header_sec_fetch_dest` | String CODEC(ZSTD(3)) | En-tête Sec-Fetch-Dest |
| `header_sec_fetch_mode` | String CODEC(ZSTD(3)) | En-tête Sec-Fetch-Mode |
| `header_sec_fetch_site` | String CODEC(ZSTD(3)) | En-tête Sec-Fetch-Site |
| `anubis_bot_name` | LowCardinality(String) DEFAULT `''` | Nom du bot détecté par Anubis |
| `anubis_bot_action` | LowCardinality(String) DEFAULT `''` | Action Anubis |
| `anubis_bot_category` | LowCardinality(String) DEFAULT `''` | Catégorie Anubis |
- **Engine**: MergeTree
- **Partition by**: `log_date`
- **Order by**: `(time, src_ip, dst_ip, ja4)`
- **TTL**: `log_date + INTERVAL 7 DAY`
Index de saut de données :
| Index | Type | Granularité |
|-------|------|-------------|
| `idx_src_ip` | bloom_filter(0.01) | 4 |
| `idx_ja4` | bloom_filter(0.01) | 4 |
- **Moteur** : MergeTree
- **Partition** : `toDate(log_date)`
- **Tri** : `(time, src_ip, dst_ip, ja4)`
- **TTL** : `log_date + INTERVAL 30 DAY`
- **Settings** : `index_granularity = 8192, ttl_only_drop_parts = 1`
---
### agg_host_ip_ja4_1h
## Tables — Base `ja4_processing`
Behavioral aggregation per `(src_ip, ja4, host)` per hour. Uses `AggregatingMergeTree` with `SimpleAggregateFunction` and `AggregateFunction` columns for incremental aggregation.
### Tables Anubis
Key columns include: `window_start`, `src_ip`, `ja4`, `host`, `src_asn`, `hits`, `count_post`, `uniq_paths`, `uniq_query_params`, `tcp_jitter_variance`, `unique_src_ports`, `unique_conn_id`, `orphan_count`, `ip_id_zero_count`, `mss_1460_count`, `uniq_ua`, `url_depth_variance`, `count_anomalous_payload`, `uniq_ja3`, `avg_syn_ms`, `tls12_count`, `count_head`, `count_no_sec_fetch`, `count_generic_accept`, `count_http10`, `ip_df_var`, `avg_ttl`, `ttl_var`, `count_no_wscale`, `count_correlated`, `count_no_accept_enc`, `count_http_scheme`.
Tables de règles pour la détection de crawlers Anubis.
- **Engine**: AggregatingMergeTree
- **Order by**: `(window_start, src_ip, ja4, host)`
| Table | Clé de tri | Colonnes | Moteur |
|-------|-----------|----------|--------|
| `anubis_ip_rules` | `prefix` (String) | `bot_name`, `action`, `rule_id` (UInt64), `has_ua` (UInt8), `category` | ReplacingMergeTree |
| `anubis_asn_rules` | `asn` (UInt32) | `bot_name`, `action`, `category` | ReplacingMergeTree |
---
### agg_header_fingerprint_1h
Header-level behavioral fingerprint aggregation per `(src_ip)` per hour.
| Column | Type | Description |
|--------|------|-------------|
| `window_start` | DateTime | Hour window start |
| `src_ip` | IPv6 | Source IP |
| `header_order_hash` | SimpleAggregateFunction(any, String) | Hash of header order |
| `header_count` | SimpleAggregateFunction(max, UInt16) | Max header count |
| `has_accept_language` | SimpleAggregateFunction(max, UInt8) | Accept-Language presence |
| `has_cookie` | SimpleAggregateFunction(max, UInt8) | Cookie presence |
| `has_referer` | SimpleAggregateFunction(max, UInt8) | Referer presence |
| `modern_browser_score` | SimpleAggregateFunction(max, UInt8) | Browser compliance score |
| `ua_ch_mismatch` | SimpleAggregateFunction(max, UInt8) | UA/Client Hints mismatch |
| `sec_fetch_mode` | SimpleAggregateFunction(any, String) | Sec-Fetch-Mode value |
| `sec_fetch_dest` | SimpleAggregateFunction(any, String) | Sec-Fetch-Dest value |
- **Engine**: AggregatingMergeTree
- **Order by**: `(window_start, src_ip)`
---
### ml_detected_anomalies
Anomaly detections above the threat threshold.
Key columns: `detected_at`, `src_ip` (IPv6), `ja4`, `host`, `bot_name`, `anomaly_score` (Float32), `raw_anomaly_score` (Float32), `threat_level`, `model_name`, `recurrence` (UInt32), `campaign_id` (Int32), `reason`, plus all ML feature columns and Anubis enrichment (`anubis_bot_name`, `anubis_bot_action`, `anubis_bot_category`).
- **Engine**: ReplacingMergeTree(detected_at)
- **Order by**: `(src_ip)`
- **TTL**: `detected_at + INTERVAL 30 DAY`
---
### ml_all_scores
All ML classifications (no threshold filter) for observability.
Key columns: `detected_at`, `window_start`, `src_ip`, `ja4`, `host`, `bot_name`, `anomaly_score`, `raw_anomaly_score`, `threat_level`, `model_name`, `correlated`, `campaign_id`, plus ASN and Anubis enrichment.
- **Engine**: ReplacingMergeTree(detected_at)
- **Order by**: `(window_start, src_ip, ja4, host, model_name)`
- **TTL**: `window_start + INTERVAL 3 DAY`
> **Note** : les tables `anubis_ua_rules` et `anubis_country_rules` ainsi que les
> dictionnaires `dict_anubis_ua` et `dict_anubis_country` ont été supprimés.
> L'enrichissement Anubis repose désormais sur deux niveaux : IP/CIDR → ASN.
---
### ref_bot_networks
Bot network CIDR reference table.
Table de référence des réseaux de bots connus (CIDR).
| Column | Type | Description |
|--------|------|-------------|
| `network` | IPv6CIDR | Network CIDR |
| `bot_name` | LowCardinality(String) | Bot name |
| `is_legitimate` | UInt8 | 1 = legitimate bot |
| `last_update` | DateTime | Last update timestamp |
| Colonne | Type | Description |
|---------|------|-------------|
| `network` | String | Réseau CIDR |
| `bot_name` | LowCardinality(String) | Nom du bot |
| `is_legitimate` | UInt8 | 1 = bot légitime |
| `last_update` | DateTime | Dernière mise à jour |
- **Engine**: ReplacingMergeTree(last_update)
- **Order by**: `(network, bot_name)`
- **Moteur** : ReplacingMergeTree(last_update)
- **Tri** : `(network, bot_name)`
---
### bot_ip / bot_ja4
CSV-backed flat tables for quick bot lookups.
Tables fichier CSV pour la recherche rapide de bots.
- `bot_ip`: single column `ip` (String) — Engine: File(CSV, 'bot_ip.csv')
- `bot_ja4`: single column `ja4` (String) — Engine: File(CSV, 'bot_ja4.csv')
| Table | Colonne | Moteur |
|-------|---------|--------|
| `bot_ip` | `ip` (String) | File(CSV, `'bot_ip.csv'`) |
| `bot_ja4` | `ja4` (String) | File(CSV, `'bot_ja4.csv'`) |
---
### Anubis Rule Tables
### agg_host_ip_ja4_1h
| Table | Key | Columns | Engine |
|-------|-----|---------|--------|
| `anubis_ua_rules` | `id` (UInt64) | `parent_id`, `regexp`, `keys`, `values` | ReplacingMergeTree |
| `anubis_ip_rules` | `prefix` (String) | `bot_name`, `action`, `rule_id`, `has_ua`, `category` | ReplacingMergeTree |
| `anubis_asn_rules` | `asn` (UInt32) | `bot_name`, `action`, `category` | ReplacingMergeTree |
| `anubis_country_rules` | `country_code` (String) | `bot_name`, `action`, `category` | ReplacingMergeTree |
Agrégation comportementale par `(src_ip, ja4, host)` par heure. Utilise des
colonnes `SimpleAggregateFunction` et `AggregateFunction` pour l'agrégation
incrémentale.
**Colonnes clés** : `window_start` (DateTime), `src_ip` (IPv6), `ja4` (String),
`host` (String), `src_asn` (UInt32).
**Colonnes d'agrégation (~50)** :
| Catégorie | Colonnes |
|-----------|----------|
| Compteurs (SimpleAggregateFunction sum) | `hits`, `count_post`, `orphan_count`, `ip_id_zero_count`, `mss_1460_count`, `count_assets`, `count_no_referer`, `tls12_count`, `count_head`, `count_no_sec_fetch`, `count_generic_accept`, `count_http10`, `count_no_wscale`, `count_correlated`, `count_no_accept_enc`, `count_http_scheme`, `count_xff`, `count_unusual_ct`, `count_non_std_port`, `count_login_post` |
| Valeurs uniques (AggregateFunction uniq) | `uniq_paths`, `uniq_query_params`, `unique_src_ports`, `unique_conn_id`, `uniq_ua`, `uniq_ja3` |
| Variances (AggregateFunction varPop) | `tcp_jitter_variance`, `total_ip_length_var`, `url_depth_variance`, `ip_df_var` |
| Moyennes (AggregateFunction avg/avgIf) | `avg_syn_ms`, `avg_ttl` |
| Variance conditionnelle (AggregateFunction varPopIf) | `ttl_var` |
**Projection** : `proj_by_ip` → ORDER BY `(src_ip, window_start, ja4, host)`
- **Moteur** : AggregatingMergeTree
- **Tri** : `(window_start, src_ip, ja4, host)`
- **TTL** : `window_start + INTERVAL 7 DAY` (partition `toDate(window_start)`)
- **Settings** : `deduplicate_merge_projection_mode = 'drop'`
---
### agg_header_fingerprint_1h
Agrégation d'empreinte d'en-têtes par `(src_ip)` par heure.
| Colonne | Type | Description |
|---------|------|-------------|
| `window_start` | DateTime | Début de la fenêtre horaire |
| `src_ip` | IPv6 | IP source |
| `header_order_hash` | SimpleAggregateFunction(any, String) | Hash de l'ordre des en-têtes |
| `header_count` | SimpleAggregateFunction(max, UInt16) | Nombre max d'en-têtes |
| `has_accept_language` | SimpleAggregateFunction(max, UInt8) | Présence Accept-Language |
| `has_cookie` | SimpleAggregateFunction(max, UInt8) | Présence Cookie |
| `has_referer` | SimpleAggregateFunction(max, UInt8) | Présence Referer |
| `modern_browser_score` | SimpleAggregateFunction(max, UInt8) | Score de conformité navigateur |
| `ua_ch_mismatch` | SimpleAggregateFunction(max, UInt8) | Incohérence UA/Client Hints |
| `sec_ch_mobile_mismatch` | SimpleAggregateFunction(max, UInt8) | Incohérence Sec-CH-UA-Mobile |
| `sec_fetch_mode` | SimpleAggregateFunction(any, String) | Valeur Sec-Fetch-Mode |
| `sec_fetch_dest` | SimpleAggregateFunction(any, String) | Valeur Sec-Fetch-Dest |
- **Moteur** : AggregatingMergeTree
- **Tri** : `(window_start, src_ip)`
- **TTL** : `window_start + INTERVAL 7 DAY` (partition `toDate(window_start)`)
---
### agg_path_sequences_1h (thèse §5.1)
Entropie des séquences de chemins — transitions de Markov sur les chemins normalisés.
| Colonne | Type | Description |
|---------|------|-------------|
| `window_start` | DateTime | Début de la fenêtre |
| `src_ip` | IPv6 | IP source |
| `ja4` | LowCardinality(String) | Empreinte JA4 |
| `host` | LowCardinality(String) | Hôte cible |
| `path_sequence` | AggregateFunction(groupArray(100), Tuple(UInt32, String)) | Séquence ordonnée (timestamp, chemin) |
- **Moteur** : AggregatingMergeTree
- **Tri** : `(window_start, src_ip, ja4, host)`
- **Partition** : `toDate(window_start)`**TTL** : 7 jours
- **Settings** : `ttl_only_drop_parts = 1`
---
### agg_request_timing_1h (thèse §5.3)
Cadence des requêtes — analyse du coefficient de variation et des bursts.
| Colonne | Type | Description |
|---------|------|-------------|
| `window_start` | DateTime | Début de la fenêtre |
| `src_ip` | IPv6 | IP source |
| `ja4` | LowCardinality(String) | Empreinte JA4 |
| `host` | LowCardinality(String) | Hôte cible |
| `request_times` | AggregateFunction(groupArrayIf(500), UInt64, UInt8) | Horodatages des requêtes (filtré a_timestamp > 0) |
- **Moteur** : AggregatingMergeTree
- **Tri** : `(window_start, src_ip, ja4, host)`
- **Partition** : `toDate(window_start)`**TTL** : 7 jours
- **Settings** : `ttl_only_drop_parts = 1`
---
### agg_ip_behavior_1h (thèse §5.5 / §5.8)
Dérive JA4 et comportement inter-domaines par IP.
| Colonne | Type | Description |
|---------|------|-------------|
| `window_start` | DateTime | Début de la fenêtre |
| `src_ip` | IPv6 | IP source |
| `ja4_sequence` | AggregateFunction(groupArray(200), Tuple(UInt32, String)) | Séquence temporelle (timestamp, ja4) |
| `host_hits_keys` | AggregateFunction(sumMap, Array(String), Array(UInt64)) | Distribution hôte → hits |
| `host_count` | AggregateFunction(uniq, String) | Nombre d'hôtes distincts |
| `total_hits` | SimpleAggregateFunction(sum, UInt64) | Requêtes totales |
| `first_seen` | SimpleAggregateFunction(min, DateTime) | Première observation |
| `last_seen` | SimpleAggregateFunction(max, DateTime) | Dernière observation |
- **Moteur** : AggregatingMergeTree
- **Tri** : `(window_start, src_ip)`
- **Partition** : `toDate(window_start)`**TTL** : 7 jours
- **Settings** : `ttl_only_drop_parts = 1`
---
### agg_resource_cascade_1h (thèse §5.4)
Arbre de dépendances de chargement de ressources.
| Colonne | Type | Description |
|---------|------|-------------|
| `window_start` | DateTime | Début de la fenêtre |
| `src_ip` | IPv6 | IP source |
| `ja4` | LowCardinality(String) | Empreinte JA4 |
| `host` | LowCardinality(String) | Hôte cible |
| `resource_loads` | AggregateFunction(groupArray(200), Tuple(UInt32, UInt8)) | Chargements (timestamp, is_asset) |
- **Moteur** : AggregatingMergeTree
- **Tri** : `(window_start, src_ip, ja4, host)`
- **Partition** : `toDate(window_start)`**TTL** : 7 jours
- **Settings** : `ttl_only_drop_parts = 1`
---
### ml_detected_anomalies
Détections d'anomalies au-dessus du seuil de menace.
**Colonnes principales** :
| Colonne | Type | Description |
|---------|------|-------------|
| `detected_at` | DateTime | Horodatage de la détection |
| `src_ip` | IPv6 | IP source |
| `ja4` | String | Empreinte JA4 |
| `host` | String | Hôte cible |
| `bot_name` | String | Nom du bot identifié |
| `browser_family` | LowCardinality(String) DEFAULT `''` | Famille de navigateur |
| `anomaly_score` | Float32 | Score d'anomalie normalisé |
| `raw_anomaly_score` | Float32 DEFAULT `0` | Score brut avant normalisation |
| `threat_level` | String | Niveau de menace (CRITICAL, HIGH, MEDIUM, LOW) |
| `model_name` | String | Nom du modèle (Complet, Applicatif, etc.) |
| `recurrence` | UInt32 | Nombre de détections précédentes |
| `campaign_id` | Int32 DEFAULT `-1` | Identifiant de campagne HDBSCAN |
| `reason` | String | Explication de la détection |
**Colonnes de contexte réseau** : `asn_number`, `asn_org`, `asn_detail`,
`asn_domain`, `country_code`, `asn_label` (tous String).
**Colonnes de features ML (~30)** : `hits`, `hit_velocity`, `fuzzing_index`,
`post_ratio`, `port_exhaustion_ratio`, `max_keepalives`, `orphan_ratio`,
`tcp_jitter_variance`, `tcp_shared_count`, `true_window_size`, `window_mss_ratio`,
`alpn_http_mismatch`, `is_alpn_missing`, `sni_host_mismatch`, `header_count`,
`has_accept_language`, `has_cookie`, `has_referer`, `modern_browser_score`,
`is_headless`, `ua_ch_mismatch`, `header_order_shared_count`, `ip_id_zero_ratio`,
`request_size_variance`, `multiplexing_efficiency`, `mss_mobile_mismatch`,
`correlated`, `asset_ratio`, `direct_access_ratio`, `is_ua_rotating`,
`distinct_ja4_count`, `src_port_density`, `ja4_asn_concentration`,
`ja4_country_concentration`, `is_rare_ja4`, `header_order_confidence`,
`distinct_header_orders`, `temporal_entropy`, `path_diversity_ratio`,
`url_depth_variance`, `anomalous_payload_ratio`.
**Colonnes Anubis** : `anubis_bot_name`, `anubis_bot_action`, `anubis_bot_category`
(LowCardinality(String) DEFAULT `''`).
**Index de saut** :
| Index | Type | Granularité |
|-------|------|-------------|
| `idx_detected_at` | minmax | 4 |
| `idx_threat_level` | set(8) | 4 |
| `idx_bot_name` | bloom_filter() | 4 |
- **Moteur** : ReplacingMergeTree(detected_at)
- **Partition** : `toYYYYMMDD(detected_at)`
- **Tri** : `(src_ip)`
- **TTL** : `detected_at + INTERVAL 7 DAY`
- **Settings** : `index_granularity = 8192, ttl_only_drop_parts = 1`
---
### ml_all_scores
Toutes les classifications ML (sans filtre de seuil) pour l'observabilité.
| Colonne | Type | Description |
|---------|------|-------------|
| `detected_at` | DateTime | Horodatage de la détection |
| `window_start` | DateTime | Début de la fenêtre d'analyse |
| `src_ip` | IPv6 | IP source |
| `ja4` | String | Empreinte JA4 |
| `host` | String | Hôte cible |
| `bot_name` | String | Nom du bot |
| `browser_family` | LowCardinality(String) DEFAULT `''` | Famille de navigateur |
| `anomaly_score` | Float32 | Score final |
| `raw_anomaly_score` | Float32 | Score brut |
| `threat_level` | String | Niveau de menace |
| `model_name` | String | Nom du modèle |
| `correlated` | UInt8 | 1 si trafic corrélé |
| `asn_number` | String | Numéro ASN |
| `asn_org` | String | Organisation ASN |
| `country_code` | String | Code pays |
| `asn_label` | String | Label de réputation ASN |
| `hits` | UInt64 | Nombre de requêtes |
| `hit_velocity` | Float32 | Vélocité des hits |
| `fuzzing_index` | Float32 | Indice de fuzzing |
| `post_ratio` | Float32 | Ratio de requêtes POST |
| `campaign_id` | Int32 | Identifiant de campagne |
| `ae_recon_error` | Float32 DEFAULT `0` | Erreur de reconstruction autoencoder |
| `xgb_prob` | Float32 DEFAULT `0` | Probabilité XGBoost supervisé |
| `anubis_bot_name` | LowCardinality(String) DEFAULT `''` | Nom du bot Anubis |
| `anubis_bot_action` | LowCardinality(String) DEFAULT `''` | Action Anubis |
| `anubis_bot_category` | LowCardinality(String) DEFAULT `''` | Catégorie Anubis |
**Index de saut** :
| Index | Type | Granularité |
|-------|------|-------------|
| `idx_detected_at` | minmax | 4 |
| `idx_threat_level` | set(8) | 4 |
- **Moteur** : ReplacingMergeTree(detected_at)
- **Partition** : `toYYYYMMDD(window_start)`
- **Tri** : `(window_start, src_ip, ja4, host, model_name)`
- **TTL** : `window_start + INTERVAL 7 DAY`
- **Settings** : `index_granularity = 8192, ttl_only_drop_parts = 1`
---
### audit_logs
SOC audit trail for dashboard activity.
Journal d'audit SOC pour le suivi de l'activité du dashboard.
| Column | Type | Default | Description |
|--------|------|---------|-------------|
| `timestamp` | DateTime | `now()` | Event time |
| `user_name` | LowCardinality(String) | `'soc_user'` | Analyst name |
| `action` | LowCardinality(String) | — | Action performed |
| `entity_type` | LowCardinality(String) | `''` | Entity type (ip, ja4, etc.) |
| `entity_id` | String | `''` | Entity identifier |
| `entity_count` | UInt32 | `0` | Entity count |
| `details` | String (ZSTD(3)) | `''` | JSON details |
| `client_ip` | String | `''` | Analyst client IP |
| Colonne | Type | Défaut | Description |
|---------|------|--------|-------------|
| `timestamp` | DateTime | `now()` | Horodatage de l'événement |
| `user_name` | LowCardinality(String) | `'soc_user'` | Nom de l'analyste |
| `action` | LowCardinality(String) | — | Action effectuée |
| `entity_type` | LowCardinality(String) | `''` | Type d'entité (ip, ja4, etc.) |
| `entity_id` | String | `''` | Identifiant de l'entité |
| `entity_count` | UInt32 | `0` | Nombre d'entités |
| `details` | String CODEC(ZSTD(3)) | `''` | Détails en JSON |
| `client_ip` | String | `''` | IP du client analyste |
- **Engine**: MergeTree
- **Partition by**: `toDate(timestamp)`
- **Order by**: `(timestamp, user_name, action)`
- **TTL**: `toDate(timestamp) + INTERVAL 90 DAY`
- **Moteur** : MergeTree
- **Partition** : `toDate(timestamp)`
- **Tri** : `(timestamp, user_name, action)`
- **TTL** : `toDate(timestamp) + INTERVAL 90 DAY`
- **Settings** : `index_granularity = 8192`
---
## Materialized Views
## Vues matérialisées
### mv_http_logs
### mv_http_logs (`ja4_logs`)
- **Source**: `http_logs_raw`
- **Target**: `http_logs`
- **Transformation**: Parses `raw_json` via `JSONExtract*` functions, enriches with ASN data from `dict_iplocate_asn` and Anubis bot detection from `dict_anubis_ua`, `dict_anubis_ip`, `dict_anubis_asn`, `dict_anubis_country`. Uses a 5-level priority cascade for Anubis: UA+IP combined > UA only > IP only > ASN > Country.
- **Source** : `ja4_logs.http_logs_raw`
- **Cible** : `ja4_logs.http_logs`
- **Transformation** : Parse le champ `raw_json` via les fonctions `JSONExtract*`.
Enrichit avec les données ASN depuis `dict_iplocate_asn` et la détection de bots
Anubis via `dict_anubis_ip` + `dict_anubis_asn` avec cascade de priorité
COALESCE : IP/CIDR → ASN.
### mv_agg_host_ip_ja4_1h
### mv_agg_host_ip_ja4_1h (`ja4_processing`)
- **Source**: `http_logs`
- **Target**: `agg_host_ip_ja4_1h`
- **Transformation**: Groups by `(toStartOfHour(time), src_ip, ja4, host, src_asn)`. Computes counts, unique values, variances, and aggregate functions for 50+ behavioral features.
- **Source** : `ja4_logs.http_logs`
- **Cible** : `ja4_processing.agg_host_ip_ja4_1h`
- **Transformation** : GROUP BY `(toStartOfHour(time), src_ip, ja4, host, src_asn)`.
Calcule ~50 features comportementales : compteurs de hits, ratios POST,
unicité des chemins/paramètres, jitter TCP, timing SYN, keep-alives,
compteurs d'orphelins, rotation UA, variance des métadonnées IP, etc.
### mv_agg_header_fingerprint_1h
### mv_agg_header_fingerprint_1h (`ja4_processing`)
- **Source**: `http_logs`
- **Target**: `agg_header_fingerprint_1h`
- **Transformation**: Groups by `(toStartOfHour(time), src_ip)`. Computes header order hash, header count, browser compliance score, Client Hints mismatch.
- **Source** : `ja4_logs.http_logs`
- **Cible** : `ja4_processing.agg_header_fingerprint_1h`
- **Transformation** : GROUP BY `(toStartOfHour(time), src_ip)`.
Calcule le hash d'ordre des en-têtes, le nombre d'en-têtes, le score de
conformité navigateur (Sec-CH-UA = 100, UA seul = 50), l'incohérence
UA↔Sec-CH-UA-Platform et Sec-CH-UA-Mobile.
### mv_agg_path_sequences_1h (`ja4_processing`)
- **Source** : `ja4_logs.http_logs`
- **Cible** : `ja4_processing.agg_path_sequences_1h`
- **Transformation** : GROUP BY `(toStartOfHour(time), src_ip, ja4, host)`.
Stocke `groupArrayState(100)(tuple(timestamp, path))`.
### mv_agg_request_timing_1h (`ja4_processing`)
- **Source** : `ja4_logs.http_logs`
- **Cible** : `ja4_processing.agg_request_timing_1h`
- **Transformation** : GROUP BY `(toStartOfHour(time), src_ip, ja4, host)`.
Stocke `groupArrayIfState(500)(a_timestamp, a_timestamp > 0)`.
### mv_agg_ip_behavior_1h (`ja4_processing`)
- **Source** : `ja4_logs.http_logs`
- **Cible** : `ja4_processing.agg_ip_behavior_1h`
- **Transformation** : GROUP BY `(toStartOfHour(time), src_ip)`.
Stocke la séquence JA4, la distribution sumMap hôte→hits, le compteur
d'hôtes uniques, les hits totaux et les bornes temporelles.
### mv_agg_resource_cascade_1h (`ja4_processing`)
- **Source** : `ja4_logs.http_logs`
- **Cible** : `ja4_processing.agg_resource_cascade_1h`
- **Transformation** : GROUP BY `(toStartOfHour(time), src_ip, ja4, host)`.
Stocke `groupArrayState(200)(tuple(timestamp, is_asset))` où is_asset est
déterminé par correspondance regex sur les extensions de fichiers statiques.
---
## Dictionaries
## Dictionnaires
### dict_iplocate_asn
### Dictionnaires basés sur fichier CSV
- **Source**: CSV file `/var/lib/clickhouse/user_files/iplocate-ip-to-asn.csv`
- **Key**: `network` (String)
- **Layout**: `IP_TRIE`
- **Attributes**: `asn` (UInt32), `country_code`, `name`, `org`, `domain`
- **Lifetime**: 36007200 seconds
Tous les fichiers doivent être placés dans `/var/lib/clickhouse/user_files/`.
### dict_bot_ip
| Dictionnaire | Fichier CSV | Clé | Layout | Attributs | Lifetime | Entrées approx. |
|--------------|------------|-----|--------|-----------|----------|------------------|
| `dict_iplocate_asn` | `iplocate-ip-to-asn.csv` | `network` (String) | IP_TRIE | `asn` (UInt32), `country_code`, `name` | 36007200 s | ~714K |
| `dict_bot_ip` | `bot_ip.csv` | `prefix` (String) | IP_TRIE | `bot_name` (String) | 300 s | ~3,5K CIDR |
| `dict_bot_ja4` | `bot_ja4.csv` | `ja4` (String) | COMPLEX_KEY_HASHED | `bot_name` (String) | 300 s | ~31 |
| `dict_browser_ja4` | `browser_ja4.csv` | `ja4` (String) | COMPLEX_KEY_HASHED | `browser_family`, `tls_library`, `context` | 300 s | ~1,2K |
| `dict_asn_reputation` | `asn_reputation.csv` | `src_asn` (UInt64) | HASHED | `label` (String) | 300 s | ~82K |
- **Source**: CSV file `/var/lib/clickhouse/user_files/bot_ip.csv`
- **Key**: `prefix` (String)
- **Layout**: `IP_TRIE`
- **Attributes**: `bot_name` (String)
- **Lifetime**: 300 seconds
### Dictionnaires basés sur ClickHouse
### dict_bot_ja4
| Dictionnaire | Table source | Clé | Layout | Attributs | Lifetime |
|--------------|-------------|-----|--------|-----------|----------|
| `dict_anubis_ip` | `ja4_processing.anubis_ip_rules` | `prefix` (String) | IP_TRIE | `bot_name`, `action`, `rule_id` (UInt64), `has_ua` (UInt8), `category` | 300600 s |
| `dict_anubis_asn` | `ja4_processing.anubis_asn_rules` | `asn` (UInt32) | FLAT | `bot_name`, `action`, `category` | 300600 s |
- **Source**: CSV file `/var/lib/clickhouse/user_files/bot_ja4.csv`
- **Key**: `ja4` (String)
- **Layout**: `COMPLEX_KEY_HASHED`
- **Attributes**: `bot_name` (String)
- **Lifetime**: 300 seconds
### dict_asn_reputation
- **Source**: CSV file `/var/lib/clickhouse/user_files/asn_reputation.csv`
- **Key**: `src_asn` (UInt64)
- **Layout**: `HASHED`
- **Attributes**: `label` (String)
- **Lifetime**: 300 seconds
### dict_anubis_ua
- **Source**: ClickHouse table `anubis_ua_rules`
- **Key**: `regexp` (String)
- **Layout**: `REGEXP_TREE`
- **Attributes**: `bot_name`, `action`, `has_ip`, `rule_id`, `category`
- **Lifetime**: 300600 seconds
### dict_anubis_ip
- **Source**: ClickHouse table `anubis_ip_rules`
- **Key**: `prefix` (String)
- **Layout**: `IP_TRIE`
- **Attributes**: `bot_name`, `action`, `rule_id`, `has_ua`, `category`
- **Lifetime**: 300600 seconds
### dict_anubis_asn
- **Source**: ClickHouse table `anubis_asn_rules`
- **Key**: `asn` (UInt32)
- **Layout**: `FLAT`
- **Attributes**: `bot_name`, `action`, `category`
- **Lifetime**: 300600 seconds
### dict_anubis_country
- **Source**: ClickHouse table `anubis_country_rules`
- **Key**: `country_code` (String)
- **Layout**: `FLAT`
- **Attributes**: `bot_name`, `action`, `category`
- **Lifetime**: 300600 seconds
> **Note** : les dictionnaires Anubis basés sur ClickHouse nécessitent que les
> identifiants de connexion soient configurés dans les fichiers SQL (mot de passe
> par défaut `CHANGE_ME` à remplacer avant la mise en production).
---
## Views
## Vues
### view_ai_features_1h
Computes 50+ ML features per `(src_ip, ja4, host)` from the last 24 hours by joining `agg_host_ip_ja4_1h` and `agg_header_fingerprint_1h`. Includes:
Calcule ~65+ features ML par `(src_ip, ja4, host)` sur les dernières 24 heures
en joignant `agg_host_ip_ja4_1h` et `agg_header_fingerprint_1h`.
- Behavioral features: `hits`, `hit_velocity`, `fuzzing_index`, `post_ratio`, `orphan_ratio`
- Connection features: `max_keepalives`, `multiplexing_efficiency`, `port_exhaustion_ratio`
- Browser features: `modern_browser_score`, `ua_ch_mismatch`, `header_order_shared_count`
- TLS features: `alpn_http_mismatch`, `is_alpn_missing`, `sni_host_mismatch`
- L4 features: `tcp_jitter_variance`, `avg_ttl`, `ttl_std`, `syn_timing_cv`
- Reputation: `bot_name` (from dict_bot_ip/dict_bot_ja4), `anubis_bot_name/action/category`
- Derived: `temporal_entropy`, `ja3_diversity_ratio`
| Catégorie | Features |
|-----------|----------|
| Comportementales | `hits`, `hit_velocity`, `fuzzing_index`, `post_ratio`, `orphan_ratio`, `asset_ratio`, `direct_access_ratio` |
| Connexion | `max_keepalives`, `multiplexing_efficiency`, `port_exhaustion_ratio`, `src_port_density` |
| Navigateur | `modern_browser_score`, `ua_ch_mismatch`, `header_order_shared_count`, `is_headless` |
| TLS | `alpn_http_mismatch`, `is_alpn_missing`, `sni_host_mismatch` |
| L4 | `tcp_jitter_variance`, `avg_ttl`, `ttl_std`, `syn_timing_cv`, `window_mss_ratio` |
| Réputation | `bot_name` (dict_bot_ip / dict_bot_ja4), `browser_family` (dict_browser_ja4), `asn_label` (dict_asn_reputation), `anubis_bot_name`/`action`/`category` (dict_anubis_ip / dict_anubis_asn) |
| Statistiques | `temporal_entropy`, `ja3_diversity_ratio`, `ja4_asn_concentration`, `ja4_country_concentration` |
| P1 | `has_xff`, `unusual_content_type_ratio`, `non_standard_port_ratio`, `login_post_concentration` |
Utilise des fonctions de fenêtrage (`sum() OVER`, `count() OVER`, `uniqExact() OVER`)
pour les features de concentration et de partage TCP.
### view_ip_recurrence
Aggregates recurrence data from `ml_detected_anomalies`:
Agrège les données de récurrence depuis `ml_detected_anomalies` (30 derniers jours) :
```sql
SELECT src_ip, count() AS recurrence,
min(detected_at) AS first_seen, max(detected_at) AS last_seen,
min(anomaly_score) AS worst_score,
argMin(threat_level, anomaly_score) AS worst_threat_level
FROM ml_detected_anomalies GROUP BY src_ip;
SELECT
src_ip,
count() AS recurrence,
min(detected_at) AS first_seen,
max(detected_at) AS last_seen,
max(anomaly_score) AS worst_score,
argMax(threat_level, anomaly_score) AS worst_threat_level
FROM ja4_processing.ml_detected_anomalies
WHERE detected_at >= now() - INTERVAL 30 DAY
GROUP BY src_ip;
```
### view_form_bruteforce_detected
Détection de force brute sur les formulaires. Source : `agg_host_ip_ja4_1h`
(dernières 24h). Filtre les combinaisons `(src_ip, host)` ayant
`count_post >= 10`. Retourne `src_ip`, `host`, `ja4` (argMax par hits),
`hits`, `query_params_count`.
### view_host_ip_ja4_rotation
Détection de rotation d'empreintes JA4 par IP. Source : `agg_host_ip_ja4_1h`
(dernières 24h, ja4 ≠ ''). Filtre les IP ayant `distinct_ja4_count >= 2`.
Retourne `src_ip`, `distinct_ja4_count`, `total_hits`, `first_seen`, `last_seen`.
### view_dashboard_user_agents
Agrégation des User-Agents pour le dashboard. Source : `ja4_logs.http_logs`
(7 derniers jours). GROUP BY `(src_ip, ja4, toStartOfHour(time), log_date)`.
Retourne `src_ip` (normalisé IPv4), `ja4`, `hour`, `log_date`,
`user_agents` (groupUniqArray(100)), `requests`.
### view_dashboard_entities
Vue d'entités du dashboard. Source : `ja4_logs.http_logs` (7 derniers jours).
Structure `UNION ALL` de 5 branches — une par type d'entité : `ip`, `ja4`,
`country`, `asn`, `host`. Retourne `entity_type`, `entity_value`, `src_ip`,
`ja4`, `host`, `log_date`, `client_headers`, `asns`, `countries`, `user_agents`.
### view_resource_cascade_1h (thèse §5.4)
Analyse de cascade de chargement de ressources. Source : `agg_resource_cascade_1h`
(dernières 24h). Sépare les chargements en documents (is_asset=0) et assets
(is_asset=1). Calcule `doc_count`, `asset_count`, `root_to_first_asset_delay`,
`asset_load_stddev` (σ des timestamps d'assets — mesure de simultanéité).
### view_thesis_features_1h (thèse §5)
Vue unifiée des features avancées de détection de la thèse. Joint (via CTEs
sur les 24 dernières heures) :
| Source | Features calculées |
|--------|-------------------|
| `agg_path_sequences_1h` (§5.1) | `path_transition_entropy` (entropie de Shannon normalisée des transitions Markov-1) |
| `agg_request_timing_1h` (§5.3) | `cadence_cv`, `burst_ratio` (Δt<100ms), `pause_ratio` (Δt>5s), `lag1_autocorrelation`, `benford_deviation` (χ² vs loi de Benford) |
| `agg_ip_behavior_1h` (§5.5/§5.8) | `ja4_drift_ratio`, `ja4_distinct_in_session`, `host_diversity`, `host_sweep_speed`, `host_coverage_uniformity` |
| `view_resource_cascade_1h` (§5.4) | `doc_count`, `asset_count`, `root_to_first_asset_delay`, `asset_load_stddev` |
**Clés de jointure** : `(window_start, src_ip, ja4, host)` pour §5.1/§5.3/§5.4 ;
`(window_start, src_ip)` pour §5.5/§5.8.
---
## User Accounts
## Index de performance (10_perf_indexes.sql)
| User | Permissions | Purpose |
|------|------------|---------|
| `data_writer` | INSERT + SELECT on `http_logs_raw` | Used by correlator service |
| `analyst` | SELECT on `http_logs`, `ml_detected_anomalies`, `ml_all_scores`, `view_ai_features_1h`, `view_ip_recurrence`, `audit_logs` | Used by dashboard/SOC analysts |
Migration idempotente ajoutant des index secondaires et projections aux tables
existantes (les installations fraîches les ont déjà dans 04/05/06) :
> **Security note**: Default passwords are `ChangeMe` — replace with strong passwords before production use. Store credentials in a secrets manager.
| Table | Index / Projection ajouté |
|-------|--------------------------|
| `ml_detected_anomalies` | `idx_detected_at` (minmax), `idx_threat_level` (set(8)), `idx_bot_name` (bloom_filter) |
| `ml_all_scores` | `idx_detected_at` (minmax), `idx_threat_level` (set(8)) |
| `http_logs` | `idx_src_ip` (bloom_filter(0.01)), `idx_ja4` (bloom_filter(0.01)) |
| `agg_host_ip_ja4_1h` | `proj_by_ip` (projection ORDER BY src_ip, window_start, ja4, host) |
---
## Comptes utilisateurs
| Utilisateur | Permissions | Usage |
|-------------|------------|-------|
| `data_writer` | INSERT + SELECT sur `ja4_logs.http_logs_raw` | Service correlator |
| `analyst` | SELECT sur `ja4_logs.http_logs`, `ja4_processing.ml_detected_anomalies`, `ja4_processing.ml_all_scores`, `ja4_processing.view_ai_features_1h`, `ja4_processing.view_ip_recurrence`, `ja4_processing.audit_logs` | Dashboard / analystes SOC |
> **Sécurité** : les mots de passe par défaut sont `ChangeMe`. Remplacer par des
> mots de passe forts avant la mise en production. Stocker les identifiants dans
> un gestionnaire de secrets.