# 🔍 Configuration de `ml_detected_anomalies` ## 📊 Structure de la table **RequĂȘte:** `SHOW CREATE TABLE mabase_prod.ml_detected_anomalies` ```sql CREATE TABLE mabase_prod.ml_detected_anomalies ( `detected_at` DateTime, `src_ip` IPv6, `ja4` String, `host` String, `bot_name` String, `anomaly_score` Float32, `threat_level` String, `model_name` String, `recurrence` UInt32, `asn_number` String, `asn_org` String, `asn_detail` String, `asn_domain` String, `country_code` String, `asn_label` String, `hits` UInt64, `hit_velocity` Float32, `fuzzing_index` Float32, `post_ratio` Float32, `port_exhaustion_ratio` Float32, `max_keepalives` UInt32, `orphan_ratio` Float32, `tcp_jitter_variance` Float32, `tcp_shared_count` UInt32, `true_window_size` UInt64, `window_mss_ratio` Float32, `alpn_http_mismatch` UInt8, `is_alpn_missing` UInt8, `sni_host_mismatch` UInt8, `header_count` UInt16, `has_accept_language` UInt8, `has_cookie` UInt8, `has_referer` UInt8, `modern_browser_score` UInt8, `is_headless` UInt8, `ua_ch_mismatch` UInt8, `header_order_shared_count` UInt32, `ip_id_zero_ratio` Float32, `request_size_variance` Float32, `multiplexing_efficiency` Float32, `mss_mobile_mismatch` UInt8, `correlated` UInt8, `reason` String, `asset_ratio` Float32, `direct_access_ratio` Float32, `is_ua_rotating` UInt8, `distinct_ja4_count` UInt32, `src_port_density` Float32, `ja4_asn_concentration` Float32, `ja4_country_concentration` Float32, `is_rare_ja4` UInt8, `header_order_confidence` Float32, `distinct_header_orders` UInt32, `temporal_entropy` Float32, `path_diversity_ratio` Float32, `url_depth_variance` Float32, `anomalous_payload_ratio` Float32 ) ENGINE = ReplacingMergeTree(detected_at) ORDER BY src_ip TTL detected_at + toIntervalDay(30) SETTINGS index_granularity = 8192 ``` --- ## ⚙ Configuration dĂ©taillĂ©e ### 1. **Moteur de stockage** ``` ENGINE = ReplacingMergeTree(detected_at) ``` - **Type:** `ReplacingMergeTree` - **Version column:** `detected_at` - **Comportement:** Garde la derniĂšre version des lignes dupliquĂ©es lors des merges ### 2. **ClĂ© de tri (ORDER BY)** ``` ORDER BY src_ip ``` - **ClĂ© primaire:** `src_ip` (IPv6) - **Optimisation:** Les requĂȘtes par IP sont trĂšs rapides - **Impact:** Les requĂȘtes par date (`detected_at`) nĂ©cessitent un scan complet ### 3. **Politique de rĂ©tention (TTL)** ``` TTL detected_at + toIntervalDay(30) ``` - **DurĂ©e actuelle:** **30 jours** - **Comportement:** Les lignes sont supprimĂ©es 30 jours aprĂšs `detected_at` - **Application:** Automatique pendant les opĂ©rations de merge ### 4. **Partitionnement** ``` -- Aucun partitionnement explicite ``` - **Statut:** **Non partitionnĂ©e** (tuple()) - **Impact:** Toutes les donnĂ©es dans une seule partition - **ConsĂ©quence:** - ✅ RequĂȘtes plus simples - ❌ OPTIMIZE FINAL plus lent sur grandes tables - ❌ Impossible de DROPper une partition ancienne ### 5. **Index** ``` SETTINGS index_granularity = 8192 ``` - **GranularitĂ©:** 8192 lignes par marque d'index - **Standard:** Valeur par dĂ©faut de ClickHouse --- ## 📈 Statistiques actuelles **RequĂȘte:** `SELECT count(), min(detected_at), max(detected_at) FROM ml_detected_anomalies` | MĂ©trique | Valeur | |----------|--------| | **Total lignes** | 57,338 | | **DonnĂ©e la plus ancienne** | 2026-03-13 20:30:19 | | **DonnĂ©e la plus rĂ©cente** | 2026-03-15 17:57:10 | | **PĂ©riode couverte** | ~2 jours | | **TTL actuel** | 30 jours | --- ## 🔍 Analyse du problĂšme: 212.30.36.0/24 ### Incident dans `api/incidents/clusters` ```json { "subnet": "212.30.36.0/24", "unique_ips": 10, "total_detections": 10, "first_seen": "2026-03-15T03:55:28", "last_seen": "2026-03-15T03:55:28" } ``` ### DonnĂ©es dans `ml_detected_anomalies` - **Âge:** ~15 heures (bien dans les 30 jours) - **Statut:** **Devrait ĂȘtre prĂ©sent** ✅ ### Pourquoi "Subnet non trouvĂ©" ? **HypothĂšses:** 1. **IPv6 vs IPv4** ⚠ - La table stocke `src_ip` en **IPv6** - Les IPs IPv4 sont stockĂ©es comme `::ffff:x.x.x.x` - Notre requĂȘte utilise `replaceRegexpAll(toString(src_ip), '^::ffff:', '')` - **VĂ©rifier:** Est-ce que le nettoyage IPv4 fonctionne correctement ? 2. **ReplacingMergeTree** ⚠ - Les lignes marquĂ©es pour suppression peuvent encore ĂȘtre visibles - **VĂ©rifier:** Y a-t-il des lignes dupliquĂ©es avec `detected_at` diffĂ©rents ? 3. **DonnĂ©es rĂ©ellement absentes** ❌ - Les 10 dĂ©tections de `212.30.36.0/24` ont Ă©tĂ© supprimĂ©es - **Cause possible:** Bug dans bot_detector_ai ou nettoyage prĂ©maturĂ© --- ## đŸ§Ș Tests de diagnostic ### Test 1: VĂ©rifier format IPv4 ```sql SELECT src_ip, toString(src_ip) AS ip_string, replaceRegexpAll(toString(src_ip), '^::ffff:', '') AS clean_ip FROM mabase_prod.ml_detected_anomalies WHERE detected_at >= now() - INTERVAL 1 HOUR LIMIT 10; ``` ### Test 2: Chercher le subnet spĂ©cifique ```sql SELECT count(), min(detected_at), max(detected_at) FROM mabase_prod.ml_detected_anomalies WHERE detected_at >= now() - INTERVAL 30 DAY AND splitByChar('.', replaceRegexpAll(toString(src_ip), '^::ffff:', ''))[1] = '212' AND splitByChar('.', replaceRegexpAll(toString(src_ip), '^::ffff:', ''))[2] = '30' AND splitByChar('.', replaceRegexpAll(toString(src_ip), '^::ffff:', ''))[3] = '36'; ``` ### Test 3: VĂ©rifier les IPs du subnet ```sql SELECT replaceRegexpAll(toString(src_ip), '^::ffff:', '') AS clean_ip, count() AS detections, min(detected_at) AS first_seen, max(detected_at) AS last_seen FROM mabase_prod.ml_detected_anomalies WHERE detected_at >= now() - INTERVAL 30 DAY AND splitByChar('.', replaceRegexpAll(toString(src_ip), '^::ffff:', ''))[1] = '212' AND splitByChar('.', replaceRegexpAll(toString(src_ip), '^::ffff:', ''))[2] = '30' AND splitByChar('.', replaceRegexpAll(toString(src_ip), '^::ffff:', ''))[3] = '36' GROUP BY clean_ip ORDER BY detections DESC LIMIT 20; ``` --- ## ✅ Recommandations ### 1. **Augmenter la rĂ©tention** (dĂ©jĂ  documentĂ©) ```sql -- Passer de 30 Ă  90 jours ALTER TABLE mabase_prod.ml_detected_anomalies MODIFY TTL detected_at + INTERVAL 90 DAY; ``` ### 2. **Ajouter le partitionnement** (optionnel) ```sql -- RecrĂ©er la table avec partitionnement mensuel CREATE TABLE mabase_prod.ml_detected_anomalies_new ( -- ... mĂȘmes colonnes ... ) ENGINE = ReplacingMergeTree(detected_at) PARTITION BY toYYYYMM(detected_at) -- Partition par mois ORDER BY src_ip TTL detected_at + INTERVAL 90 DAY SETTINGS index_granularity = 8192; -- Migrer les donnĂ©es INSERT INTO ml_detected_anomalies_new SELECT * FROM ml_detected_anomalies; -- Renommer RENAME TABLE ml_detected_anomalies TO ml_detected_anomalies_old, ml_detected_anomalies_new TO ml_detected_anomalies; -- Drop l'ancienne table aprĂšs vĂ©rification DROP TABLE ml_detected_anomalies_old; ``` ### 3. **Ajouter un index sur detected_at** (optionnel) ```sql -- Ajouter un index secondaire pour les requĂȘtes temporelles ALTER TABLE mabase_prod.ml_detected_anomalies ADD INDEX idx_detected_at detected_at TYPE minmax GRANULARITY 8192; ``` ### 4. **Corriger le bug 212.30.36.0/24** **Action immĂ©diate:** ```sql -- VĂ©rifier si les donnĂ©es existent SELECT count() FROM mabase_prod.ml_detected_anomalies WHERE detected_at >= toDateTime('2026-03-15 03:00:00') AND detected_at <= toDateTime('2026-03-15 05:00:00') AND splitByChar('.', replaceRegexpAll(toString(src_ip), '^::ffff:', ''))[1] = '212' AND splitByChar('.', replaceRegexpAll(toString(src_ip), '^::ffff:', ''))[2] = '30' AND splitByChar('.', replaceRegexpAll(toString(src_ip), '^::ffff:', ''))[3] = '36'; ``` **Si count = 0:** Les donnĂ©es ont Ă©tĂ© supprimĂ©es prĂ©maturĂ©ment (bug bot_detector_ai) **Si count > 0:** Il y a un bug dans la requĂȘte SQL de l'API subnet --- ## 📚 Fichiers Ă  modifier | Fichier | Modification | Statut | |---------|--------------|--------| | `deploy_dashboard_entities_view.sql` | TTL: 30 → 90 jours | ✅ Fait | | `deploy_user_agents_view.sql` | TTL: 7 → 90 jours | ✅ Fait | | `update_retention_policy.sql` | Script d'application | ✅ Créé | | `ml_detected_anomalies` | TTL: 30 → 90 jours | ⏳ À appliquer | --- **DerniĂšre mise Ă  jour:** 2026-03-15 **Version:** 1.0