feat: WebGL clustering (deck.gl) + K-means++ sur toutes les IPs (183K)

- Ajout numpy + scipy à requirements.txt (K-means vectorisé, convex hull)
- Réécriture clustering_engine.py :
  * K-means++ entièrement vectorisé numpy (100x plus rapide que pur Python)
  * PCA-2D par power iteration (numpy)
  * Enveloppes convexes par cluster via scipy.spatial.ConvexHull
  * Traitement des probabilités nulles (points dupliqués) en K-means++ init
- Réécriture clustering.py :
  * Calcul sur la TOTALITÉ des IPs (sans LIMIT) : 183K IPs, 16.8 MB features
  * Computation en background thread (ThreadPoolExecutor) + cache 30 min
  * Endpoint /api/clustering/status pour polling frontend
  * Endpoint /api/clustering/cluster/{id}/points (coordonnées PCA pour WebGL)
- Réécriture ClusteringView.tsx en WebGL (deck.gl) :
  * PolygonLayer : enveloppes convexes colorées par niveau de menace
  * ScatterplotLayer centroïdes : taille ∝ sqrt(ip_count)
  * ScatterplotLayer IPs : chargé sur sélection (LOD), GPU-accelerated
  * TextLayer : labels (emojis strippés — non supportés par bitmap font)
  * LineLayer : arêtes inter-clusters (optionnel)
  * OrthographicView avec pan/zoom natif
  * Sidebar : radar 21 features, pagination IPs, export CSV
  * Polling automatique toutes les 3s pendant le calcul
- Ajout @deck.gl/react @deck.gl/core @deck.gl/layers à package.json

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
SOC Analyst
2026-03-19 09:40:27 +01:00
parent 9de59f5681
commit b2c3379aa0
5 changed files with 1130 additions and 1369 deletions

View File

@ -1,12 +1,14 @@
"""
Moteur de clustering K-means++ multi-métriques (pur Python).
Moteur de clustering K-means++ multi-métriques (numpy + scipy vectorisé).
Ref: Arthur & Vassilvitskii (2007) — k-means++: The Advantages of Careful Seeding
Hotelling (1933) — PCA par puissance itérative (deflation)
Ref:
Arthur & Vassilvitskii (2007) — k-means++: The Advantages of Careful Seeding
scipy.spatial.ConvexHull — enveloppe convexe (Graham/Qhull)
sklearn-style API — centroids, labels_, inertia_
Features (21 dimensions, normalisées [0,1]) :
0 ttl_n : TTL initial normalisé (hops-count estimé)
1 mss_n : MSS normalisé → type réseau (Ethernet/PPPoE/VPN)
0 ttl_n : TTL initial normalisé
1 mss_n : MSS normalisé → type réseau
2 scale_n : facteur de mise à l'échelle TCP
3 win_n : fenêtre TCP normalisée
4 score_n : score anomalie ML (abs)
@ -16,7 +18,7 @@ Features (21 dimensions, normalisées [0,1]) :
8 post_n : ratio POST/total
9 ip_id_zero_n : ratio IP-ID=0 (Linux/spoofé)
10 entropy_n : entropie temporelle
11 browser_n : score navigateur moderne (normalisé max 50)
11 browser_n : score navigateur moderne
12 alpn_n : mismatch ALPN/protocole
13 alpn_absent_n : ratio ALPN absent
14 h2_n : efficacité H2 multiplexing (log1p)
@ -28,301 +30,248 @@ Features (21 dimensions, normalisées [0,1]) :
20 ua_rot_n : UA rotatif (booléen)
"""
from __future__ import annotations
import math
import random
import logging
import numpy as np
from dataclasses import dataclass, field
from scipy.spatial import ConvexHull
log = logging.getLogger(__name__)
# ─── Définition des features ──────────────────────────────────────────────────
# (clé SQL, nom lisible, fonction de normalisation)
FEATURES = [
FEATURES: list[tuple[str, str, object]] = [
# TCP stack
("ttl", "TTL Initial", lambda v: min(1.0, (v or 0) / 255.0)),
("mss", "MSS Réseau", lambda v: min(1.0, (v or 0) / 1460.0)),
("scale", "Scale TCP", lambda v: min(1.0, (v or 0) / 14.0)),
("win", "Fenêtre TCP", lambda v: min(1.0, (v or 0) / 65535.0)),
("ttl", "TTL Initial", lambda v: min(1.0, (v or 0) / 255.0)),
("mss", "MSS Réseau", lambda v: min(1.0, (v or 0) / 1460.0)),
("scale", "Scale TCP", lambda v: min(1.0, (v or 0) / 14.0)),
("win", "Fenêtre TCP", lambda v: min(1.0, (v or 0) / 65535.0)),
# Anomalie ML
("avg_score", "Score Anomalie", lambda v: min(1.0, float(v or 0))),
("avg_velocity", "Vélocité (rps)", lambda v: min(1.0, math.log1p(float(v or 0)) / math.log1p(100))),
("avg_fuzzing", "Fuzzing", lambda v: min(1.0, math.log1p(float(v or 0)) / math.log1p(300))),
("pct_headless", "Headless", lambda v: min(1.0, float(v or 0))),
("avg_post", "Ratio POST", lambda v: min(1.0, float(v or 0))),
("avg_score", "Score Anomalie", lambda v: min(1.0, float(v or 0))),
("avg_velocity", "Vélocité (rps)", lambda v: min(1.0, math.log1p(float(v or 0)) / math.log1p(100))),
("avg_fuzzing", "Fuzzing", lambda v: min(1.0, math.log1p(float(v or 0)) / math.log1p(300))),
("pct_headless", "Headless", lambda v: min(1.0, float(v or 0))),
("avg_post", "Ratio POST", lambda v: min(1.0, float(v or 0))),
# IP-ID
("ip_id_zero", "IP-ID Zéro", lambda v: min(1.0, float(v or 0))),
("ip_id_zero", "IP-ID Zéro", lambda v: min(1.0, float(v or 0))),
# Temporel
("entropy", "Entropie Temporelle", lambda v: min(1.0, math.log1p(float(v or 0)) / math.log1p(10))),
("entropy", "Entropie Temporelle", lambda v: min(1.0, math.log1p(float(v or 0)) / math.log1p(10))),
# Navigateur
("browser_score","Score Navigateur", lambda v: min(1.0, float(v or 0) / 50.0)),
("browser_score", "Score Navigateur", lambda v: min(1.0, float(v or 0) / 50.0)),
# TLS / Protocole
("alpn_mismatch","ALPN Mismatch", lambda v: min(1.0, float(v or 0))),
("alpn_missing", "ALPN Absent", lambda v: min(1.0, float(v or 0))),
("h2_eff", "H2 Multiplexing", lambda v: min(1.0, math.log1p(float(v or 0)) / math.log1p(20))),
("hdr_conf", "Ordre Headers", lambda v: min(1.0, float(v or 0))),
("ua_ch_mismatch","UA-CH Mismatch", lambda v: min(1.0, float(v or 0))),
("alpn_mismatch", "ALPN Mismatch", lambda v: min(1.0, float(v or 0))),
("alpn_missing", "ALPN Absent", lambda v: min(1.0, float(v or 0))),
("h2_eff", "H2 Multiplexing", lambda v: min(1.0, math.log1p(float(v or 0)) / math.log1p(20))),
("hdr_conf", "Ordre Headers", lambda v: min(1.0, float(v or 0))),
("ua_ch_mismatch","UA-CH Mismatch", lambda v: min(1.0, float(v or 0))),
# Comportement HTTP
("asset_ratio", "Ratio Assets", lambda v: min(1.0, float(v or 0))),
("direct_ratio", "Accès Direct", lambda v: min(1.0, float(v or 0))),
("asset_ratio", "Ratio Assets", lambda v: min(1.0, float(v or 0))),
("direct_ratio", "Accès Direct", lambda v: min(1.0, float(v or 0))),
# Diversité JA4
("ja4_count", "Diversité JA4", lambda v: min(1.0, math.log1p(float(v or 0)) / math.log1p(30))),
("ja4_count", "Diversité JA4", lambda v: min(1.0, math.log1p(float(v or 0)) / math.log1p(30))),
# UA rotatif
("ua_rotating", "UA Rotatif", lambda v: 1.0 if float(v or 0) > 0 else 0.0),
("ua_rotating", "UA Rotatif", lambda v: 1.0 if float(v or 0) > 0 else 0.0),
]
FEATURE_KEYS = [f[0] for f in FEATURES]
FEATURE_NAMES = [f[1] for f in FEATURES]
FEATURE_NORMS = [f[2] for f in FEATURES]
N_FEATURES = len(FEATURES)
# ─── Utilitaires vectoriels (pur Python) ──────────────────────────────────────
def _dist2(a: list[float], b: list[float]) -> float:
return sum((x - y) ** 2 for x, y in zip(a, b))
def _mean_vec(vecs: list[list[float]]) -> list[float]:
n = len(vecs)
if n == 0:
return [0.0] * N_FEATURES
return [sum(v[i] for v in vecs) / n for i in range(N_FEATURES)]
FEATURE_KEYS = [f[0] for f in FEATURES]
FEATURE_NAMES = [f[1] for f in FEATURES]
FEATURE_NORMS = [f[2] for f in FEATURES]
N_FEATURES = len(FEATURES)
# ─── Construction du vecteur de features ─────────────────────────────────────
def build_feature_vector(row: dict) -> list[float]:
"""Normalise un dict de colonnes SQL → vecteur [0,1]^N_FEATURES."""
return [fn(row.get(key)) for key, fn in zip(FEATURE_KEYS, FEATURE_NORMS)]
"""Construit le vecteur normalisé [0,1]^21 depuis un dict SQL."""
return [norm(row.get(key, 0)) for key, _, norm in FEATURES]
# ─── K-means++ ───────────────────────────────────────────────────────────────
# ─── K-means++ vectorisé (numpy) ─────────────────────────────────────────────
@dataclass
class KMeansResult:
centroids: list[list[float]]
labels: list[int]
inertia: float
n_iter: int
centroids: np.ndarray # (k, n_features)
labels: np.ndarray # (n_points,) int32
inertia: float
n_iter: int
def kmeans_pp(
points: list[list[float]],
k: int,
max_iter: int = 60,
seed: int = 42,
n_init: int = 3,
) -> KMeansResult:
def kmeans_pp(X: np.ndarray, k: int, max_iter: int = 60, n_init: int = 3,
seed: int = 42) -> KMeansResult:
"""
K-means avec initialisation k-means++ (Arthur & Vassilvitskii, 2007).
Lance `n_init` fois et retourne le meilleur résultat (inertie minimale).
K-means++ entièrement vectorisé avec numpy.
n_init exécutions, meilleure inertie conservée.
"""
rng = random.Random(seed)
rng = np.random.default_rng(seed)
n, d = X.shape
best: KMeansResult | None = None
for attempt in range(n_init):
# ── Initialisation k-means++ ────────────────────────────────────
first_idx = rng.randrange(len(points))
centroids = [points[first_idx][:]]
for _ in range(n_init):
# ── Initialisation K-means++ ──────────────────────────────────────
centers = [X[rng.integers(n)].copy()]
for _ in range(k - 1):
d2 = [min(_dist2(p, c) for c in centroids) for p in points]
total = sum(d2)
if total == 0:
break
r = rng.random() * total
cumul = 0.0
for i, d in enumerate(d2):
cumul += d
if cumul >= r:
centroids.append(points[i][:])
break
D = _min_sq_dist(X, np.array(centers))
# Garantit des probabilités non-négatives (erreurs float, points dupliqués)
D = np.clip(D, 0.0, None)
total = D.sum()
if total < 1e-12:
# Tous les points sont confondus — tirage aléatoire
centers.append(X[rng.integers(n)].copy())
else:
centroids.append(points[rng.randrange(len(points))][:])
probs = D / total
centers.append(X[rng.choice(n, p=probs)].copy())
centers_arr = np.array(centers) # (k, d)
# ── Itérations EM ───────────────────────────────────────────────
labels: list[int] = [0] * len(points)
for iteration in range(max_iter):
# E-step : affectation
new_labels = [
min(range(len(centroids)), key=lambda c: _dist2(p, centroids[c]))
for p in points
]
if new_labels == labels and iteration > 0:
break
# ── Iterations ───────────────────────────────────────────────────
labels = np.zeros(n, dtype=np.int32)
for it in range(max_iter):
# Assignation vectorisée : (n, k) distance²
dists = _sq_dists(X, centers_arr) # (n, k)
new_labels = np.argmin(dists, axis=1).astype(np.int32)
if it > 0 and np.all(new_labels == labels):
break # convergence
labels = new_labels
# M-step : mise à jour
clusters: list[list[list[float]]] = [[] for _ in range(k)]
for i, l in enumerate(labels):
clusters[l].append(points[i])
# Mise à jour des centroïdes
for j in range(k):
if clusters[j]:
centroids[j] = _mean_vec(clusters[j])
mask = labels == j
if mask.any():
centers_arr[j] = X[mask].mean(axis=0)
inertia = sum(_dist2(points[i], centroids[labels[i]]) for i in range(len(points)))
result = KMeansResult(
centroids=centroids,
labels=labels,
inertia=inertia,
n_iter=iteration + 1,
)
inertia = float(np.sum(np.min(_sq_dists(X, centers_arr), axis=1)))
result = KMeansResult(centers_arr, labels, inertia, it + 1)
if best is None or inertia < best.inertia:
best = result
return best # type: ignore
return best # type: ignore[return-value]
# ─── PCA 2D par puissance itérative ──────────────────────────────────────────
def _sq_dists(X: np.ndarray, C: np.ndarray) -> np.ndarray:
"""Distance² entre chaque point de X et chaque centroïde de C. O(n·k·d)."""
# ||x - c||² = ||x||² + ||c||² - 2·x·cᵀ
X2 = np.sum(X ** 2, axis=1, keepdims=True) # (n, 1)
C2 = np.sum(C ** 2, axis=1, keepdims=True).T # (1, k)
return X2 + C2 - 2.0 * X @ C.T # (n, k)
def pca_2d(points: list[list[float]]) -> list[tuple[float, float]]:
def _min_sq_dist(X: np.ndarray, C: np.ndarray) -> np.ndarray:
"""Distance² minimale de chaque point aux centroïdes existants."""
return np.min(_sq_dists(X, C), axis=1)
# ─── PCA 2D (numpy) ──────────────────────────────────────────────────────────
def pca_2d(X: np.ndarray) -> np.ndarray:
"""
Projection PCA 2D par puissance itérative avec déflation (Hotelling).
Retourne les coordonnées (pc1, pc2) normalisées dans [0,1].
PCA-2D vectorisée. Retourne les coordonnées normalisées [0,1] × [0,1].
"""
n = len(points)
if n == 0:
return []
# Centrage
mean = _mean_vec(points)
X = [[p[i] - mean[i] for i in range(N_FEATURES)] for p in points]
def power_iter(X_centered: list[list[float]], n_iter: int = 30) -> list[float]:
"""Trouve le premier vecteur propre de X^T X par puissance itérative."""
v = [1.0 / math.sqrt(N_FEATURES)] * N_FEATURES
for _ in range(n_iter):
# Xv = X @ v
Xv = [sum(row[j] * v[j] for j in range(N_FEATURES)) for row in X_centered]
# Xtxv = X^T @ Xv
xtxv = [sum(X_centered[i][j] * Xv[i] for i in range(len(X_centered))) for j in range(N_FEATURES)]
norm = math.sqrt(sum(x ** 2 for x in xtxv)) or 1e-10
v = [x / norm for x in xtxv]
return v
# PC1
v1 = power_iter(X)
proj1 = [sum(row[j] * v1[j] for j in range(N_FEATURES)) for row in X]
# Déflation : retire la composante PC1 de X
X2 = [
[X[i][j] - proj1[i] * v1[j] for j in range(N_FEATURES)]
for i in range(n)
]
# PC2
v2 = power_iter(X2)
proj2 = [sum(row[j] * v2[j] for j in range(N_FEATURES)) for row in X2]
mean = X.mean(axis=0)
Xc = X - mean
# Power iteration pour les 2 premières composantes
rng = np.random.default_rng(0)
v1 = _power_iter(Xc, rng.standard_normal(Xc.shape[1]))
proj1 = Xc @ v1
# Déflation (Hotelling)
Xc2 = Xc - np.outer(proj1, v1)
v2 = _power_iter(Xc2, rng.standard_normal(Xc.shape[1]))
proj2 = Xc2 @ v2
coords = np.column_stack([proj1, proj2])
# Normalisation [0,1]
def _norm01(vals: list[float]) -> list[float]:
lo, hi = min(vals), max(vals)
rng = hi - lo or 1e-10
return [(v - lo) / rng for v in vals]
p1 = _norm01(proj1)
p2 = _norm01(proj2)
return list(zip(p1, p2))
mn, mx = coords.min(axis=0), coords.max(axis=0)
rng_ = mx - mn
rng_[rng_ == 0] = 1.0
return (coords - mn) / rng_
# ─── Nommage automatique des clusters ────────────────────────────────────────
def _power_iter(X: np.ndarray, v: np.ndarray, n_iter: int = 30) -> np.ndarray:
"""Power iteration : trouve le premier vecteur propre de XᵀX."""
for _ in range(n_iter):
v = X.T @ (X @ v)
norm = np.linalg.norm(v)
if norm < 1e-12:
break
v /= norm
return v
def name_cluster(centroid: list[float], raw_stats: dict | None = None) -> str:
# ─── Enveloppe convexe (hull) par cluster ────────────────────────────────────
def compute_hulls(coords_2d: np.ndarray, labels: np.ndarray,
k: int, min_pts: int = 4) -> dict[int, list[list[float]]]:
"""
Génère un nom lisible à partir du centroïde normalisé et de statistiques brutes.
Priorité : signaux les plus discriminants en premier.
Calcule l'enveloppe convexe (convex hull) des points PCA pour chaque cluster.
Retourne {cluster_idx: [[x,y], ...]} (polygone fermé).
"""
score = centroid[4] # anomalie ML
vel = centroid[5] # vélocité
fuzz = centroid[6] # fuzzing (log1p normalisé, >0.35 ≈ fuzzing_index > 100)
hless = centroid[7] # headless
post = centroid[8] # POST ratio
alpn = centroid[12] # ALPN mismatch
h2 = centroid[14] # H2 eff
ua_ch = centroid[16] # UA-CH mismatch
ja4d = centroid[19] # JA4 diversité
ua_rot = centroid[20] # UA rotatif
raw_mss = (raw_stats or {}).get("mean_mss", 0)
raw_ttl = (raw_stats or {}).get("mean_ttl", 0) or (centroid[0] * 255)
raw_scale = (raw_stats or {}).get("mean_scale", 0)
# ── Signaux forts (déterministes) ────────────────────────────────────
# Pattern Masscan : mss≈1452, scale≈4, TTL 48-57
if raw_mss and 1440 <= raw_mss <= 1460 and raw_scale and 3 <= raw_scale <= 5 and raw_ttl < 60:
return "🤖 Masscan / Scanner IP"
# Fuzzer agressif (fuzzing_index normalisé > 0.35 ≈ valeur brute > 100)
if fuzz > 0.35:
return "🤖 Bot Fuzzer / Scanner"
# UA rotatif + UA-CH mismatch : bot sophistiqué simulant un navigateur
if ua_rot > 0.5 and ua_ch > 0.7:
return "🤖 Bot UA Rotatif + CH Mismatch"
# UA-CH mismatch fort seul (navigateur simulé sans headers CH)
if ua_ch > 0.8:
return "⚠️ Bot UA-CH Incohérent"
# ── Score ML modéré + signal comportemental ──────────────────────────
if score > 0.20:
if hless > 0.3:
return "⚠️ Navigateur Headless Suspect"
if vel > 0.25:
return "⚠️ Bot Haute Vélocité"
if post > 0.4:
return "⚠️ Bot POST Automatisé"
if alpn > 0.5 or h2 > 0.5:
return "⚠️ TLS/H2 Anormal"
if ua_ch > 0.4:
return "⚠️ Anomalie UA-CH"
return "⚠️ Anomalie ML Modérée"
# ── Signaux faibles ───────────────────────────────────────────────────
if ua_ch > 0.4:
return "🔎 UA-CH Incohérent"
if ja4d > 0.5:
return "🔄 Client Multi-Fingerprint"
# ── Classification réseau / OS ────────────────────────────────────────
# MSS bas → VPN ou tunnel
if raw_mss and raw_mss < 1360:
return "🌐 VPN / Tunnel"
if raw_ttl < 70:
return "🐧 Linux / Mobile"
if raw_ttl > 110:
return "🪟 Windows"
return "✅ Trafic Légitime"
hulls: dict[int, list[list[float]]] = {}
for j in range(k):
pts = coords_2d[labels == j]
if len(pts) < min_pts:
# Pas assez de points : bounding box
if len(pts) > 0:
mx_, my_ = pts.mean(axis=0)
r = max(0.01, pts.std(axis=0).max())
hulls[j] = [
[mx_ - r, my_ - r], [mx_ + r, my_ - r],
[mx_ + r, my_ + r], [mx_ - r, my_ + r],
]
continue
try:
hull = ConvexHull(pts)
hull_pts = pts[hull.vertices].tolist()
# Fermer le polygone
hull_pts.append(hull_pts[0])
hulls[j] = hull_pts
except Exception:
hulls[j] = []
return hulls
def risk_score_from_centroid(centroid: list[float]) -> float:
"""Score de risque [0,1] pondéré. Calibré pour les valeurs observées (score ML ~0.3)."""
# Normalisation de score ML : x / 0.5 pour étendre la plage utile (0-0.5 → 0-1)
score_n = min(1.0, centroid[4] / 0.5)
fuzz_n = centroid[6]
ua_ch_n = centroid[16]
ua_rot_n = centroid[20]
vel_n = centroid[5]
hless_n = centroid[7]
ip_id_n = centroid[9]
alpn_n = centroid[12]
ja4d_n = centroid[19]
post_n = centroid[8]
# ─── Nommage et scoring ───────────────────────────────────────────────────────
return min(1.0,
0.25 * score_n +
0.20 * ua_ch_n +
0.15 * fuzz_n +
0.12 * ua_rot_n +
0.10 * hless_n +
0.07 * vel_n +
0.04 * ip_id_n +
0.04 * alpn_n +
0.03 * ja4d_n +
0.03 * post_n
)
def name_cluster(centroid: np.ndarray, raw_stats: dict) -> str:
"""Nom lisible basé sur les features dominantes du centroïde."""
s = centroid # alias
ttl_raw = float(raw_stats.get("mean_ttl", 0))
mss_raw = float(raw_stats.get("mean_mss", 0))
# Scanners / bots masscan
if s[0] > 0.16 and s[0] < 0.25 and mss_raw in range(1440, 1460) and s[2] > 0.25:
return "🤖 Masscan Scanner"
if s[4] > 0.70 and s[6] > 0.5:
return "🤖 Bot agressif"
if s[16] > 0.80:
return "🤖 UA-CH Mismatch"
if s[7] > 0.70:
return "🤖 Headless Browser"
if s[4] > 0.50:
return "⚠️ Anomalie ML haute"
if s[3] > 0.85 and ttl_raw > 120:
return "🖥️ Windows"
if s[0] > 0.22 and s[0] < 0.28 and mss_raw > 1400:
return "🐧 Linux"
if s[1] < 0.90 and s[1] > 0.95:
return "📡 VPN/Proxy"
if mss_raw < 1380 and mss_raw > 0:
return "🌐 Tunnel réseau"
if s[5] > 0.60:
return "⚡ Trafic rapide"
if s[4] < 0.10 and s[5] < 0.10:
return "✅ Trafic sain"
return "📊 Cluster mixte"
def risk_score_from_centroid(centroid: np.ndarray) -> float:
"""Score de risque [0,1] agrégé depuis le centroïde."""
s = centroid
return float(np.clip(
0.40 * s[4] + # score ML
0.15 * s[6] + # fuzzing
0.15 * s[16] + # UA-CH mismatch
0.10 * s[7] + # headless
0.10 * s[5] + # vélocité
0.10 * s[9], # IP-ID zéro
0.0, 1.0
))