docs: réécriture complète de la documentation des services en français
- bot-detector.md : architecture 11 modules, 77/65 features, ensemble triple voix (EIF+AE+XGBoost), browser 5 axes, HDBSCAN, toutes les variables d'environnement vérifiées depuis le code source - dashboard.md : corrigé stack (Jinja2+htmx, pas React+Vite), 14 pages + 35 API routes + health, dual-database, IPv4/IPv6 - python-ja4common.md : ajouté CLICKHOUSE_DB_PROCESSING/LOGS, schéma dual-database, note dashboard n'utilise pas ja4_common Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
@ -1,221 +1,549 @@
|
|||||||
# Bot Detector
|
# Bot Detector
|
||||||
|
|
||||||
The bot-detector is a Python service that performs machine-learning anomaly detection on aggregated HTTP/TLS traffic features stored in ClickHouse. It runs on a continuous cycle (default: every 5 minutes), using Isolation Forest to identify suspicious traffic patterns, enriched with SHAP explainability, DBSCAN clustering, and Anubis bot-rule enrichment.
|
Service Python de détection d'anomalies par apprentissage automatique semi-supervisé
|
||||||
|
sur le trafic HTTP/TLS agrégé dans ClickHouse. Fonctionne en cycle continu
|
||||||
|
(par défaut toutes les 5 minutes) avec un **ensemble à triple voix**
|
||||||
|
(Extended Isolation Forest + Autoencoder + XGBoost), enrichi par l'explicabilité
|
||||||
|
SHAP, le clustering HDBSCAN et la détection multifactorielle des navigateurs.
|
||||||
|
|
||||||
## ML Algorithm
|
---
|
||||||
|
|
||||||
### Isolation Forest (Semi-Supervised)
|
## Architecture des modules
|
||||||
|
|
||||||
The core algorithm is **Isolation Forest** (Liu, Ting & Zhou, 2008) — an unsupervised anomaly detection algorithm that isolates anomalies by randomly partitioning feature space. Anomalies require fewer partitions to isolate than normal points.
|
Le service est découpé en **11 modules** organisés ainsi :
|
||||||
|
|
||||||
The approach is **semi-supervised** because:
|
|
||||||
1. **Known bots** are identified a priori via reputation dictionaries (IP, JA4, ASN)
|
|
||||||
2. **Human baseline** is identified via ASN reputation labels (`asn_label = 'human'`)
|
|
||||||
3. The model trains **only on human-baseline traffic** (minimum 500 sessions required)
|
|
||||||
4. Unknown traffic is scored by deviation from the human profile
|
|
||||||
|
|
||||||
### Two-Model Architecture
|
|
||||||
|
|
||||||
| Model | Condition | Features | Data |
|
|
||||||
|-------|-----------|----------|------|
|
|
||||||
| **Complet** | `correlated = 1` | 35 | HTTP + TCP + TLS (full pipeline data) |
|
|
||||||
| **Applicatif** | `correlated = 0` | 31 | HTTP only (no TLS correlation available) |
|
|
||||||
|
|
||||||
### Threat Levels
|
|
||||||
|
|
||||||
| Score Range | Level | Interpretation |
|
|
||||||
|------------|-------|----------------|
|
|
||||||
| `< -0.30` | **CRITICAL** | Extremely anomalous behavior |
|
|
||||||
| `< -0.15` | **HIGH** | Strong anomaly signal |
|
|
||||||
| `< -0.05` | **MEDIUM** | Moderate anomaly |
|
|
||||||
| `≥ -0.05` | **LOW** | Slightly unusual |
|
|
||||||
|
|
||||||
## Feature List
|
|
||||||
|
|
||||||
### Common Features (31 — Applicatif model)
|
|
||||||
|
|
||||||
#### HTTP Behavior
|
|
||||||
|
|
||||||
| Feature | Description |
|
|
||||||
|---------|-------------|
|
|
||||||
| `hits` | Request count in the window |
|
|
||||||
| `hit_velocity` | Requests per second |
|
|
||||||
| `fuzzing_index` | Path/parameter diversity anomaly score |
|
|
||||||
| `post_ratio` | Fraction of POST requests |
|
|
||||||
| `port_exhaustion_ratio` | Fraction of distinct source ports / total |
|
|
||||||
| `orphan_ratio` | Requests without TLS correlation |
|
|
||||||
| `head_ratio` | Fraction of HEAD requests |
|
|
||||||
| `http10_ratio` | Fraction of HTTP/1.0 requests |
|
|
||||||
| `generic_accept_ratio` | Fraction of short Accept headers |
|
|
||||||
| `sec_fetch_absence_rate` | Fraction missing Sec-Fetch-Site |
|
|
||||||
| `missing_accept_enc_ratio` | Fraction missing Accept-Encoding |
|
|
||||||
| `http_scheme_ratio` | Fraction using HTTP (not HTTPS) |
|
|
||||||
|
|
||||||
#### Connection Management
|
|
||||||
|
|
||||||
| Feature | Description |
|
|
||||||
|---------|-------------|
|
|
||||||
| `max_keepalives` | Max requests on a single Keep-Alive connection |
|
|
||||||
| `tcp_shared_count` | TCP connections shared between sessions |
|
|
||||||
| `multiplexing_efficiency` | HTTP/2 multiplexing efficiency |
|
|
||||||
|
|
||||||
#### Browser Fingerprint
|
|
||||||
|
|
||||||
| Feature | Description |
|
|
||||||
|---------|-------------|
|
|
||||||
| `header_count` | HTTP headers sent |
|
|
||||||
| `has_accept_language` | Accept-Language header presence |
|
|
||||||
| `has_cookie` | Cookie header presence |
|
|
||||||
| `has_referer` | Referer header presence |
|
|
||||||
| `modern_browser_score` | Composite browser compliance score (0–100) |
|
|
||||||
| `ua_ch_mismatch` | User-Agent vs Client Hints inconsistency |
|
|
||||||
| `ip_id_zero_ratio` | IP packets with ID=0 (headless/minimal stack) |
|
|
||||||
| `header_order_shared_count` | IPs sharing same header order |
|
|
||||||
| `header_order_confidence` | Normalized entropy of header order |
|
|
||||||
| `distinct_header_orders` | Distinct header orderings per IP |
|
|
||||||
| `is_fake_navigation` | Sec-Fetch-Mode=navigate with non-document dest |
|
|
||||||
|
|
||||||
#### Navigation Patterns
|
|
||||||
|
|
||||||
| Feature | Description |
|
|
||||||
|---------|-------------|
|
|
||||||
| `request_size_variance` | Variance of request sizes |
|
|
||||||
| `mss_mobile_mismatch` | TCP MSS vs mobile profile inconsistency |
|
|
||||||
| `asset_ratio` | Static asset request fraction |
|
|
||||||
| `direct_access_ratio` | Direct accesses (no referer) |
|
|
||||||
| `is_ua_rotating` | User-Agent rotation detected (flag) |
|
|
||||||
| `distinct_ja4_count` | Distinct JA4 fingerprints per IP |
|
|
||||||
| `anomalous_payload_ratio` | Anomalous payload size fraction |
|
|
||||||
|
|
||||||
#### Concentration & Rarity
|
|
||||||
|
|
||||||
| Feature | Description |
|
|
||||||
|---------|-------------|
|
|
||||||
| `src_port_density` | Source port entropy |
|
|
||||||
| `ja4_asn_concentration` | JA4 concentration within ASN |
|
|
||||||
| `ja4_country_concentration` | JA4 concentration per country |
|
|
||||||
| `is_rare_ja4` | Rare JA4 fingerprint (< 100 total hits) |
|
|
||||||
|
|
||||||
#### Temporal & Diversity
|
|
||||||
|
|
||||||
| Feature | Description |
|
|
||||||
|---------|-------------|
|
|
||||||
| `temporal_entropy` | Temporal distribution entropy |
|
|
||||||
| `path_diversity_ratio` | URL path diversity |
|
|
||||||
| `url_depth_variance` | URL depth variance |
|
|
||||||
| `ja3_diversity_ratio` | JA3 diversity ratio per IP |
|
|
||||||
|
|
||||||
### Additional TCP/TLS Features (Complet model only — 4 extra)
|
|
||||||
|
|
||||||
| Feature | Description |
|
|
||||||
|---------|-------------|
|
|
||||||
| `tcp_jitter_variance` | TCP inter-packet jitter variance |
|
|
||||||
| `alpn_http_mismatch` | ALPN vs actual HTTP protocol mismatch |
|
|
||||||
| `is_alpn_missing` | ALPN absent in ClientHello |
|
|
||||||
| `sni_host_mismatch` | TLS SNI vs HTTP Host mismatch |
|
|
||||||
|
|
||||||
### L4 Fingerprint Features (Complet model)
|
|
||||||
|
|
||||||
| Feature | Description |
|
|
||||||
|---------|-------------|
|
|
||||||
| `avg_ttl` | Average IP TTL (OS fingerprint) |
|
|
||||||
| `ttl_std` | TTL standard deviation |
|
|
||||||
| `no_window_scale_ratio` | Fraction without TCP window scale |
|
|
||||||
| `syn_timing_cv` | SYN timing coefficient of variation |
|
|
||||||
| `tls12_ratio` | Fraction of TLS 1.2 connections |
|
|
||||||
| `ip_df_variance` | IP Don't-Fragment flag variance |
|
|
||||||
|
|
||||||
## Detection Pipeline
|
|
||||||
|
|
||||||
```
|
```
|
||||||
1. Read view_ai_features_1h (last 24h) → DataFrame
|
__main__.py Point d'entrée (python -m bot_detector)
|
||||||
2. Read view_ip_recurrence → recurrence map
|
└─ cycle.py Boucle principale : requête ClickHouse → pipeline → insertion
|
||||||
3. Clean columns (fillna, astype)
|
├─ config.py Variables d'environnement, flags de disponibilité
|
||||||
4. Split by correlated=1 / correlated=0
|
├─ log.py Journalisation structurée JSON (structlog + RotatingFileHandler)
|
||||||
5. For each model (Complet, Applicatif):
|
├─ infra.py Client ClickHouse (via ja4_common), health check HTTP, arrêt propre
|
||||||
a. A7: Validate features (exclude missing/constant)
|
├─ preprocessing.py Nettoyage du DataFrame, imputation, listes de features
|
||||||
b. Separate known bots → log as KNOWN_BOT
|
│ └─ browser.py Identification multifactorielle des navigateurs (5 axes)
|
||||||
c. Filter human baseline (asn_label='human', min 500 sessions)
|
├─ pipeline.py Orchestration : filtrage → entraînement → scoring → fusion
|
||||||
d. Load or train Isolation Forest model
|
│ ├─ models.py EIF, TrafficAutoEncoder (PyTorch), XGBoost
|
||||||
e. A1: Check concept drift (KS test on features)
|
│ └─ scoring.py Normalisation, seuil adaptatif, SHAP, HDBSCAN, dérive
|
||||||
f. Score unknown traffic
|
└─ (insère dans ml_all_scores + ml_detected_anomalies)
|
||||||
g. A10: Normalize scores to [-1, 0]
|
|
||||||
h. A2: Compute adaptive threshold = min(percentile_5, ANOMALY_THRESHOLD)
|
|
||||||
i. A6: Apply recurrence weighting
|
|
||||||
j. Filter scores below threshold
|
|
||||||
k. A4: SHAP explainability (top 5 features)
|
|
||||||
l. A8: DBSCAN clustering (campaign detection)
|
|
||||||
6. Concatenate results, deduplicate by src_ip (keep lowest score)
|
|
||||||
7. A5: Deduplication with TTL (skip recently reported IPs)
|
|
||||||
8. Insert into ml_detected_anomalies + ml_all_scores
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Concept Drift Detection (A1)
|
| Module | Lignes | Rôle |
|
||||||
|
|--------|--------|------|
|
||||||
|
| `config.py` | 154 | Toute la configuration via `os.getenv()`, flags de disponibilité des librairies |
|
||||||
|
| `log.py` | 65 | `log_info()`, `log_decision()`, `append_training_history()` — JSONL rotatif |
|
||||||
|
| `infra.py` | 89 | Client ClickHouse (délègue à `ja4_common`), `score_to_threat_level()`, serveur de santé en thread daemon |
|
||||||
|
| `browser.py` | 170 | Détection multifactorielle des navigateurs sur 5 axes pondérés |
|
||||||
|
| `scoring.py` | 279 | Normalisation, seuil adaptatif, SHAP top-3, HDBSCAN, détection de dérive |
|
||||||
|
| `models.py` | 478 | `TrafficAutoEncoder`, entraînement/chargement EIF, XGBoost, élagage de features |
|
||||||
|
| `preprocessing.py` | 117 | `preprocess_df()` — nettoyage, typage, imputation, listes `FEATURES` / `FEATURES_COMPLET` |
|
||||||
|
| `pipeline.py` | 378 | `run_semi_supervised_logic()` — orchestration complète d'un modèle |
|
||||||
|
| `cycle.py` | 371 | `fetch_and_analyze()` — boucle principale, feedback SOC, multiwindow |
|
||||||
|
| `__main__.py` | 41 | Point d'entrée, bannière de démarrage, boucle `while True` |
|
||||||
|
| `__init__.py` | 1 | Docstring du package |
|
||||||
|
|
||||||
Uses the **Kolmogorov-Smirnov test** to compare feature distributions between the current data and the training data. If the fraction of drifted features exceeds `DRIFT_THRESHOLD` (default: 0.30), the model is retrained.
|
---
|
||||||
|
|
||||||
## SHAP Explainability (A4)
|
## Configuration
|
||||||
|
|
||||||
When enabled (`ENABLE_SHAP=true`), computes SHAP values for each detected anomaly using `shap.TreeExplainer`. The top 5 contributing features are stored in the `reason` field.
|
Toute la configuration est lue via `os.getenv()` dans `config.py`. Aucun fichier YAML ni pydantic-settings.
|
||||||
|
|
||||||
## DBSCAN Clustering (A8)
|
### Connexion ClickHouse
|
||||||
|
|
||||||
When enabled (`ENABLE_CLUSTERING=true`), applies DBSCAN on anomaly feature vectors to group related anomalies into campaigns. Each anomaly gets a `campaign_id` (-1 = no cluster).
|
| Variable | Type | Défaut | Description |
|
||||||
|
|----------|------|--------|-------------|
|
||||||
|
| `CLICKHOUSE_HOST` | str | `clickhouse` | Nom d'hôte du serveur ClickHouse (via `ja4_common`) |
|
||||||
|
| `CLICKHOUSE_PORT` | int | `8123` | Port HTTP ClickHouse (via `ja4_common`) |
|
||||||
|
| `CLICKHOUSE_USER` | str | `admin` | Utilisateur ClickHouse (via `ja4_common`) |
|
||||||
|
| `CLICKHOUSE_PASSWORD` | str | `""` | Mot de passe ClickHouse (via `ja4_common`) |
|
||||||
|
| `CLICKHOUSE_DB_PROCESSING` | str | `ja4_processing` | Base de données ML/agrégations (fallback : `CLICKHOUSE_DB`) |
|
||||||
|
| `CLICKHOUSE_DB_LOGS` | str | `ja4_logs` | Base de données des logs HTTP |
|
||||||
|
|
||||||
## Anubis Bot-Rule Enrichment
|
### Modèle et entraînement
|
||||||
|
|
||||||
The `view_ai_features_1h` view enriches each IP with Anubis bot detection using a priority cascade:
|
| Variable | Type | Défaut | Description |
|
||||||
1. **UA + IP combined** (same `rule_id`) — highest confidence
|
|----------|------|--------|-------------|
|
||||||
2. **UA only** (no IP requirement)
|
| `N_ESTIMATORS` | int | `300` | Nombre d'arbres pour l'Extended Isolation Forest |
|
||||||
3. **IP only** (no UA requirement)
|
| `ISOLATION_CONTAMINATION` | float | `0.001` | Paramètre de contamination EIF (plage ]0, 0.5[) |
|
||||||
4. **ASN match**
|
| `ANOMALY_THRESHOLD` | float | `-0.05` | Seuil de score brut pour la détection d'anomalie |
|
||||||
5. **Country match**
|
| `ANOMALY_PERCENTILE` | int | `5` | Percentile pour le seuil adaptatif |
|
||||||
|
| `MODEL_DIR` | str | `/var/lib/bot_detector` | Répertoire de persistance des modèles |
|
||||||
|
| `MODEL_HISTORY_COUNT` | int | `10` | Nombre de versions de modèle conservées |
|
||||||
|
| `RETRAIN_INTERVAL_HOURS` | int | `24` | Intervalle de réentraînement EIF (heures) |
|
||||||
|
| `DRIFT_THRESHOLD` | float | `0.30` | Seuil de dérive KS (fraction de features driftées) |
|
||||||
|
| `MIN_VALID_FEATURE_RATIO` | float | `0.50` | Ratio minimal de features valides pour entraîner |
|
||||||
|
| `PRUNE_VARIANCE_THRESHOLD` | float | `1e-6` | Seuil de variance pour l'élagage de features |
|
||||||
|
| `VAL_ANOMALY_GATE` | float | `0.20` | Garde-fou : taux maximum d'anomalies en validation |
|
||||||
|
|
||||||
## Environment Variables
|
### Autoencoder
|
||||||
|
|
||||||
| Variable | Type | Default | Description |
|
| Variable | Type | Défaut | Description |
|
||||||
|----------|------|---------|-------------|
|
|----------|------|--------|-------------|
|
||||||
| `CLICKHOUSE_HOST` | string | `clickhouse` | ClickHouse server hostname |
|
| `AE_WEIGHT` | float | `0.30` | Poids de l'Autoencoder dans le score combiné (plage ]0, 1[) |
|
||||||
| `CLICKHOUSE_PORT` | int | `8123` | ClickHouse HTTP port |
|
| `AE_EPOCHS` | int | `50` | Nombre d'époques d'entraînement |
|
||||||
| `CLICKHOUSE_DB` | string | `ja4_processing` | Database name |
|
| `AE_LATENT_DIM` | int | `16` | Dimension de l'espace latent |
|
||||||
| `CLICKHOUSE_USER` | string | `admin` | ClickHouse username |
|
| `AE_LEARNING_RATE` | float | `1e-3` | Taux d'apprentissage Adam |
|
||||||
| `CLICKHOUSE_PASSWORD` | string | `""` | ClickHouse password |
|
|
||||||
| `ISOLATION_CONTAMINATION` | float | `0.02` | Contamination parameter for Isolation Forest |
|
|
||||||
| `ANOMALY_THRESHOLD` | float | `-0.03` | Score threshold for anomaly detection |
|
|
||||||
| `ANOMALY_PERCENTILE` | int | `5` | Percentile for adaptive threshold (A2) |
|
|
||||||
| `CYCLE_INTERVAL_SEC` | int | `300` | Seconds between detection cycles |
|
|
||||||
| `MAX_CONSECUTIVE_FAILURES` | int | `3` | Max consecutive failures before exit |
|
|
||||||
| `BOT_DETECTOR_LOG` | string | `/var/log/bot_detector/decisions.jsonl` | Decision log file path |
|
|
||||||
| `LOG_BACKUP_COUNT` | int | `7` | Number of rotated log backups |
|
|
||||||
| `MODEL_DIR` | string | `/var/lib/bot_detector` | Model persistence directory |
|
|
||||||
| `RETRAIN_INTERVAL_HOURS` | int | `24` | Hours between model retraining |
|
|
||||||
| `MODEL_HISTORY_COUNT` | int | `10` | Number of model versions to keep |
|
|
||||||
| `DRIFT_THRESHOLD` | float | `0.30` | KS-test drift threshold (A1) |
|
|
||||||
| `ENABLE_MULTIWINDOW` | bool | `false` | Enable 24h multi-window analysis (A3) |
|
|
||||||
| `MULTIWINDOW_VIEW` | string | `view_ai_features_24h` | View for multi-window mode |
|
|
||||||
| `ENABLE_SHAP` | bool | `true` | Enable SHAP explainability (A4) |
|
|
||||||
| `DEDUP_TTL_MIN` | int | `60` | Deduplication TTL in minutes (A5) |
|
|
||||||
| `RECURRENCE_WEIGHT` | float | `0.005` | Recurrence score weighting factor (A6) |
|
|
||||||
| `MIN_VALID_FEATURE_RATIO` | float | `0.50` | Min valid feature ratio (A7) |
|
|
||||||
| `ENABLE_CLUSTERING` | bool | `true` | Enable DBSCAN clustering (A8) |
|
|
||||||
| `CLUSTERING_MIN_SAMPLES` | int | `3` | DBSCAN min samples per cluster |
|
|
||||||
| `HEALTH_PORT` | int | `8080` | Health check HTTP server port |
|
|
||||||
|
|
||||||
## Output Tables
|
### XGBoost
|
||||||
|
|
||||||
|
| Variable | Type | Défaut | Description |
|
||||||
|
|----------|------|--------|-------------|
|
||||||
|
| `XGB_WEIGHT` | float | `0.20` | Poids de XGBoost dans le score final (plage ]0, 1[) |
|
||||||
|
| `XGB_MIN_LABELS` | int | `100` | Nombre minimal de labels SOC pour activer XGBoost |
|
||||||
|
| `XGB_RETRAIN_INTERVAL_HOURS` | int | `168` | Intervalle de réentraînement XGBoost (7 jours) |
|
||||||
|
|
||||||
|
### Détection navigateur
|
||||||
|
|
||||||
|
| Variable | Type | Défaut | Description |
|
||||||
|
|----------|------|--------|-------------|
|
||||||
|
| `BROWSER_CONFIDENCE_THRESHOLD` | float | `0.55` | Confiance minimale pour classifier `LEGITIMATE_BROWSER` |
|
||||||
|
| `BROWSER_COHORT_RATIO` | float | `0.70` | Si ≥ 70 % des sessions d'un JA4 sont navigateur → propagation |
|
||||||
|
|
||||||
|
### Clustering et explicabilité
|
||||||
|
|
||||||
|
| Variable | Type | Défaut | Description |
|
||||||
|
|----------|------|--------|-------------|
|
||||||
|
| `ENABLE_CLUSTERING` | bool | `true` | Activer le clustering HDBSCAN |
|
||||||
|
| `CLUSTERING_MIN_SAMPLES` | int | `3` | Taille minimale de cluster |
|
||||||
|
| `ENABLE_SHAP` | bool | `true` | Activer l'explicabilité SHAP (requiert `shap` installé) |
|
||||||
|
|
||||||
|
### Cycle et opérations
|
||||||
|
|
||||||
|
| Variable | Type | Défaut | Description |
|
||||||
|
|----------|------|--------|-------------|
|
||||||
|
| `CYCLE_INTERVAL_SEC` | int | `300` | Intervalle entre les cycles (secondes) |
|
||||||
|
| `MAX_CONSECUTIVE_FAILURES` | int | `3` | Échecs consécutifs avant `healthy=False` |
|
||||||
|
| `DEDUP_TTL_MIN` | int | `60` | TTL de déduplication inter-cycle (minutes) |
|
||||||
|
| `RECURRENCE_WEIGHT` | float | `0.005` | Poids de la récurrence dans le score brut |
|
||||||
|
| `ENABLE_MULTIWINDOW` | bool | `false` | Activer l'analyse multi-fenêtre 24h |
|
||||||
|
| `MULTIWINDOW_VIEW` | str | `view_ai_features_24h` | Vue ClickHouse pour le mode multi-fenêtre |
|
||||||
|
| `ENABLE_FEEDBACK` | bool | `true` | Activer l'intégration du feedback SOC |
|
||||||
|
| `FEEDBACK_WINDOW_DAYS` | int | `7` | Fenêtre de feedback SOC (jours) |
|
||||||
|
|
||||||
|
### Journalisation et santé
|
||||||
|
|
||||||
|
| Variable | Type | Défaut | Description |
|
||||||
|
|----------|------|--------|-------------|
|
||||||
|
| `BOT_DETECTOR_LOG` | str | `/var/log/bot_detector/decisions.jsonl` | Chemin du fichier de décisions |
|
||||||
|
| `LOG_BACKUP_COUNT` | int | `7` | Nombre de fichiers de rotation conservés |
|
||||||
|
| `HEALTH_PORT` | int | `8080` | Port du serveur de santé HTTP |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Pipeline ML — Ensemble à triple voix
|
||||||
|
|
||||||
|
### Vue d'ensemble
|
||||||
|
|
||||||
|
Le bot-detector utilise trois modèles en parallèle, combinés par une pondération configurable :
|
||||||
|
|
||||||
|
```
|
||||||
|
┌──────────────────────┐
|
||||||
|
│ Extended Isolation │
|
||||||
|
│ Forest (isotree) │──→ eif_norm (0–1)
|
||||||
|
└──────────────────────┘ │
|
||||||
|
│ × (1 − AE_WEIGHT)
|
||||||
|
┌──────────────────────┐ │
|
||||||
|
│ TrafficAutoEncoder │ ├──→ combined_norm
|
||||||
|
│ (PyTorch) │──→ ae_norm (0–1)
|
||||||
|
└──────────────────────┘ × AE_WEIGHT
|
||||||
|
│ × (1 − XGB_WEIGHT)
|
||||||
|
┌──────────────────────┐ │
|
||||||
|
│ XGBoost │ ├──→ anomaly_score
|
||||||
|
│ (supervisé, labels │──→ xgb_prob (0–1)
|
||||||
|
│ SOC) │ × XGB_WEIGHT
|
||||||
|
└──────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
**Formule du score final :**
|
||||||
|
|
||||||
|
```
|
||||||
|
combined_norm = (1 − AE_WEIGHT) × eif_norm + AE_WEIGHT × ae_norm
|
||||||
|
anomaly_score = (1 − XGB_WEIGHT) × combined_norm + XGB_WEIGHT × xgb_prob
|
||||||
|
```
|
||||||
|
|
||||||
|
Avec les poids par défaut (`AE_WEIGHT=0.30`, `XGB_WEIGHT=0.20`) :
|
||||||
|
|
||||||
|
```
|
||||||
|
anomaly_score = 0.56 × eif_norm + 0.24 × ae_norm + 0.20 × xgb_prob
|
||||||
|
```
|
||||||
|
|
||||||
|
### Architecture duale
|
||||||
|
|
||||||
|
Deux modèles indépendants tournent sur chaque cycle :
|
||||||
|
|
||||||
|
| Modèle | Condition | Features | Données |
|
||||||
|
|--------|-----------|----------|---------|
|
||||||
|
| **Complet** | `correlated = 1` | 77 (`FEATURES_COMPLET`) | HTTP + TCP + TLS (L3→L7) |
|
||||||
|
| **Applicatif** | `correlated = 0` | 65 (`FEATURES`) | HTTP seul (L7 pur) |
|
||||||
|
|
||||||
|
En mode multi-fenêtre (`ENABLE_MULTIWINDOW=true`), deux variantes supplémentaires sont exécutées sur la vue 24h : `Complet_24h` et `Applicatif_24h`.
|
||||||
|
|
||||||
|
### Extended Isolation Forest (EIF)
|
||||||
|
|
||||||
|
Modèle principal non supervisé. Utilise `isotree.IsolationForest` :
|
||||||
|
|
||||||
|
```python
|
||||||
|
isotree.IsolationForest(
|
||||||
|
ntrees=300, # N_ESTIMATORS
|
||||||
|
ndim=min(3, n_features),
|
||||||
|
sample_size='auto',
|
||||||
|
missing_action='impute',
|
||||||
|
random_seed=42,
|
||||||
|
nthreads=-1,
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Fallback** si `isotree` n'est pas disponible : `sklearn.ensemble.IsolationForest(n_estimators=300, contamination=CONTAMINATION)`.
|
||||||
|
|
||||||
|
**Calibration** : le score isotree brut (∈ [0, 1], >0.5 = anomalous) est converti en convention sklearn : `sklearn_equiv = 0.5 − isotree_score`.
|
||||||
|
|
||||||
|
### TrafficAutoEncoder (PyTorch)
|
||||||
|
|
||||||
|
Architecture symétrique encodeur-décodeur :
|
||||||
|
|
||||||
|
```
|
||||||
|
Encodeur : n_features → dim1 → dim2 → 16 (latent)
|
||||||
|
Décodeur : 16 → dim2 → dim1 → n_features
|
||||||
|
|
||||||
|
dim1 = min(64, max(n_features, latent_dim + 4))
|
||||||
|
dim2 = min(32, max(dim1 // 2, latent_dim + 2))
|
||||||
|
```
|
||||||
|
|
||||||
|
- Activations : `ReLU` + `BatchNorm1d` sur les couches cachées, `Sigmoid` en sortie du décodeur
|
||||||
|
- Optimiseur : `Adam(lr=1e-3, weight_decay=1e-5)`
|
||||||
|
- Perte : `MSELoss`
|
||||||
|
- Entraînement : 50 époques, batch_size=256
|
||||||
|
- Score : erreur de reconstruction MSE par échantillon
|
||||||
|
- Normalisation des entrées : min-max [0, 1] par feature
|
||||||
|
|
||||||
|
### XGBoost (supervisé)
|
||||||
|
|
||||||
|
Entraîné sur les labels issus du feedback SOC (table `soc_feedback`) :
|
||||||
|
|
||||||
|
```python
|
||||||
|
xgb.XGBClassifier(
|
||||||
|
n_estimators=200,
|
||||||
|
max_depth=6,
|
||||||
|
learning_rate=0.1,
|
||||||
|
scale_pos_weight=auto, # max(1, n_neg / n_pos)
|
||||||
|
eval_metric='logloss',
|
||||||
|
tree_method='hist',
|
||||||
|
random_state=42,
|
||||||
|
n_jobs=-1,
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
- Labels positifs (bot) : `HIGH`, `CRITICAL`, `ANUBIS_DENY`, `KNOWN_BOT`
|
||||||
|
- Labels négatifs (légitime) : `NORMAL`, `LEGITIMATE_BROWSER`
|
||||||
|
- Activation requiert ≥ `XGB_MIN_LABELS` (100) labels
|
||||||
|
- Réentraînement tous les `XGB_RETRAIN_INTERVAL_HOURS` (168h = 7 jours)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Détection multifactorielle des navigateurs
|
||||||
|
|
||||||
|
Module `browser.py` — classifie chaque session sur 5 axes pondérés :
|
||||||
|
|
||||||
|
| Axe | Clé | Poids | Composantes |
|
||||||
|
|-----|-----|-------|-------------|
|
||||||
|
| **1 — JA4 connu** | `axis_ja4_known` | 0.25 | Famille navigateur identifiée dans `dict_browser_ja4` → 1.0, sinon 0.0 |
|
||||||
|
| **2 — Structure JA4** | `axis_ja4_struct` | 0.15 | TLS 1.3 (×0.35), h2/h3 (×0.25), nb ciphers 10–25 (×0.20), nb extensions 10–25 (×0.20) |
|
||||||
|
| **3 — HTTP moderne** | `axis_http_modern` | 0.25 | modern_browser_score ≥ 50 (×0.35), Accept-Language (×0.20), Sec-Fetch < 0.3 (×0.25), generic_accept < 0.3 (×0.10), pas de ua_ch_mismatch (×0.10) |
|
||||||
|
| **4 — Comportement navigation** | `axis_nav_behavior` | 0.15 | has_cookie (×0.25), has_referer (×0.25), asset_ratio > 0.15 (×0.25), direct_access < 0.5 (×0.25) |
|
||||||
|
| **5 — Cohérence TLS/TCP** | `axis_tls_coherence` | 0.20 | Pas d'alpn_mismatch (×0.25), window_scale OK (×0.20), tls12 < 0.1 (×0.20), pas d'http10 (×0.15), ALPN présent (×0.20) |
|
||||||
|
|
||||||
|
**Seuil** : `browser_confidence ≥ 0.55` + famille identifiée → `LEGITIMATE_BROWSER`
|
||||||
|
|
||||||
|
**Propagation par cohorte** : si ≥ 70 % des sessions partageant un JA4 sont classées navigateur, les sessions `NORMAL`/`LOW` restantes avec le même JA4 sont aussi classées `LEGITIMATE_BROWSER`.
|
||||||
|
|
||||||
|
**Inférence de famille** : pour les JA4 inconnus, correspondance structurelle avec les profils `_BROWSER_JA4_PROFILES` (Chromium, Firefox, Safari, Tor_Browser) — requiert `browser_confidence ≥ 0.45`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Scoring et normalisation
|
||||||
|
|
||||||
|
### Normalisation des scores (`normalize_scores`)
|
||||||
|
|
||||||
|
Les scores bruts (négatifs = anomalous) sont mappés vers **[0, 1]** avec **1 = le plus anomalous** :
|
||||||
|
|
||||||
|
```
|
||||||
|
result[mask] = clip(−scores / (−s_min + 1e-9), 0, 1)
|
||||||
|
```
|
||||||
|
|
||||||
|
Les scores ≥ 0 sont mis à 0.
|
||||||
|
|
||||||
|
### Seuil adaptatif (`compute_adaptive_threshold`)
|
||||||
|
|
||||||
|
```
|
||||||
|
threshold = min(percentile(scores_négatifs, ANOMALY_PERCENTILE), ANOMALY_THRESHOLD)
|
||||||
|
```
|
||||||
|
|
||||||
|
Avec `ANOMALY_PERCENTILE=5` et `ANOMALY_THRESHOLD=-0.05`.
|
||||||
|
|
||||||
|
### Pénalité de récurrence
|
||||||
|
|
||||||
|
Appliquée au `raw_anomaly_score` :
|
||||||
|
|
||||||
|
```
|
||||||
|
raw_anomaly_score −= log1p(recurrence_count) × RECURRENCE_WEIGHT
|
||||||
|
```
|
||||||
|
|
||||||
|
### Niveaux de menace
|
||||||
|
|
||||||
|
| Plage de score brut | Niveau | Interprétation |
|
||||||
|
|---------------------|--------|----------------|
|
||||||
|
| `< −0.30` | **CRITICAL** | Comportement extrêmement anomalous |
|
||||||
|
| `< −0.15` | **HIGH** | Signal d'anomalie fort |
|
||||||
|
| `< −0.05` | **MEDIUM** | Anomalie modérée |
|
||||||
|
| `< 0` | **LOW** | Légèrement inhabituel |
|
||||||
|
| `≥ 0` | **NORMAL** | Trafic normal |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Clustering HDBSCAN des campagnes
|
||||||
|
|
||||||
|
Lorsque `ENABLE_CLUSTERING=true`, les anomalies sont regroupées en campagnes par HDBSCAN :
|
||||||
|
|
||||||
|
```python
|
||||||
|
hdbscan.HDBSCAN(
|
||||||
|
min_cluster_size=CLUSTERING_MIN_SAMPLES, # défaut : 3
|
||||||
|
min_samples=max(2, CLUSTERING_MIN_SAMPLES − 1), # défaut : 2
|
||||||
|
cluster_selection_method='eom',
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Espace de clustering** : si un Autoencoder est disponible, le clustering s'effectue dans l'**espace latent 16-dim** de l'AE. Sinon, `StandardScaler` est appliqué sur les features brutes.
|
||||||
|
|
||||||
|
**Fallback** si `hdbscan` n'est pas disponible : `DBSCAN(eps=0.5, min_samples=CLUSTERING_MIN_SAMPLES)`.
|
||||||
|
|
||||||
|
Chaque anomalie reçoit un `campaign_id` (−1 = pas de cluster).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Liste des features
|
||||||
|
|
||||||
|
### Features communes — modèle Applicatif (65 features)
|
||||||
|
|
||||||
|
#### Comportement HTTP
|
||||||
|
|
||||||
|
| Feature | Description |
|
||||||
|
|---------|-------------|
|
||||||
|
| `hits` | Nombre de requêtes dans la fenêtre |
|
||||||
|
| `hit_velocity` | Requêtes par seconde |
|
||||||
|
| `fuzzing_index` | Score de diversité chemins/paramètres |
|
||||||
|
| `post_ratio` | Fraction de requêtes POST |
|
||||||
|
| `port_exhaustion_ratio` | Fraction de ports source distincts / total |
|
||||||
|
| `orphan_ratio` | Requêtes sans corrélation TLS |
|
||||||
|
| `head_ratio` | Fraction de requêtes HEAD |
|
||||||
|
| `http10_ratio` | Fraction de requêtes HTTP/1.0 |
|
||||||
|
| `generic_accept_ratio` | Fraction d'en-têtes Accept courts |
|
||||||
|
| `sec_fetch_absence_rate` | Fraction sans Sec-Fetch-Site |
|
||||||
|
| `missing_accept_enc_ratio` | Fraction sans Accept-Encoding |
|
||||||
|
| `http_scheme_ratio` | Fraction utilisant HTTP (pas HTTPS) |
|
||||||
|
|
||||||
|
#### Gestion de connexion
|
||||||
|
|
||||||
|
| Feature | Description |
|
||||||
|
|---------|-------------|
|
||||||
|
| `max_keepalives` | Max de requêtes sur une seule connexion Keep-Alive |
|
||||||
|
| `tcp_shared_count` | Connexions TCP partagées entre sessions |
|
||||||
|
| `multiplexing_efficiency` | Efficacité du multiplexage HTTP/2 |
|
||||||
|
|
||||||
|
#### Empreinte navigateur
|
||||||
|
|
||||||
|
| Feature | Description |
|
||||||
|
|---------|-------------|
|
||||||
|
| `header_count` | Nombre d'en-têtes HTTP envoyés |
|
||||||
|
| `has_accept_language` | Présence de l'en-tête Accept-Language |
|
||||||
|
| `has_cookie` | Présence de l'en-tête Cookie |
|
||||||
|
| `has_referer` | Présence de l'en-tête Referer |
|
||||||
|
| `modern_browser_score` | Score composite de conformité navigateur (0–100) |
|
||||||
|
| `ua_ch_mismatch` | Incohérence User-Agent vs Client Hints |
|
||||||
|
| `ip_id_zero_ratio` | Paquets IP avec ID=0 (pile minimaliste/headless) |
|
||||||
|
| `header_order_shared_count` | IPs partageant le même ordre d'en-têtes |
|
||||||
|
| `header_order_confidence` | Entropie normalisée de l'ordre des en-têtes |
|
||||||
|
| `distinct_header_orders` | Ordres d'en-têtes distincts par IP |
|
||||||
|
| `is_fake_navigation` | Sec-Fetch-Mode=navigate avec destination non-document |
|
||||||
|
|
||||||
|
#### Comportement de navigation
|
||||||
|
|
||||||
|
| Feature | Description |
|
||||||
|
|---------|-------------|
|
||||||
|
| `request_size_variance` | Variance de la taille des requêtes |
|
||||||
|
| `mss_mobile_mismatch` | Incohérence TCP MSS vs profil mobile |
|
||||||
|
| `asset_ratio` | Fraction de requêtes de ressources statiques |
|
||||||
|
| `direct_access_ratio` | Accès directs (sans referer) |
|
||||||
|
| `is_ua_rotating` | Rotation de User-Agent détectée |
|
||||||
|
| `distinct_ja4_count` | Empreintes JA4 distinctes par IP |
|
||||||
|
| `anomalous_payload_ratio` | Fraction de charges utiles anomalous |
|
||||||
|
|
||||||
|
#### Concentration et rareté
|
||||||
|
|
||||||
|
| Feature | Description |
|
||||||
|
|---------|-------------|
|
||||||
|
| `src_port_density` | Entropie des ports source |
|
||||||
|
| `ja4_asn_concentration` | Concentration JA4 au sein de l'ASN |
|
||||||
|
| `ja4_country_concentration` | Concentration JA4 par pays |
|
||||||
|
| `is_rare_ja4` | Empreinte JA4 rare (< 100 hits totaux) |
|
||||||
|
|
||||||
|
#### Temporel et diversité
|
||||||
|
|
||||||
|
| Feature | Description |
|
||||||
|
|---------|-------------|
|
||||||
|
| `temporal_entropy` | Entropie de la distribution temporelle |
|
||||||
|
| `path_diversity_ratio` | Diversité des chemins URL |
|
||||||
|
| `url_depth_variance` | Variance de la profondeur des URL |
|
||||||
|
|
||||||
|
#### Anubis
|
||||||
|
|
||||||
|
| Feature | Description |
|
||||||
|
|---------|-------------|
|
||||||
|
| `anubis_is_flagged` | Signal de suspicion Anubis (bot détecté, action ni ALLOW/DENY/vide) |
|
||||||
|
|
||||||
|
#### Navigateur multifactoriel
|
||||||
|
|
||||||
|
| Feature | Description |
|
||||||
|
|---------|-------------|
|
||||||
|
| `is_known_browser` | JA4 correspond à un navigateur connu |
|
||||||
|
| `browser_consistency_score` | Score composite de cohérence navigateur |
|
||||||
|
| `browser_confidence` | Confiance globale de l'identification navigateur |
|
||||||
|
| `axis_ja4_known` | Score de l'axe 1 (JA4 connu) |
|
||||||
|
| `axis_ja4_struct` | Score de l'axe 2 (structure JA4) |
|
||||||
|
| `axis_http_modern` | Score de l'axe 3 (HTTP moderne) |
|
||||||
|
| `axis_nav_behavior` | Score de l'axe 4 (comportement navigation) |
|
||||||
|
| `axis_tls_coherence` | Score de l'axe 5 (cohérence TLS/TCP) |
|
||||||
|
|
||||||
|
#### Thèse §5 — Features avancées
|
||||||
|
|
||||||
|
| Feature | Description |
|
||||||
|
|---------|-------------|
|
||||||
|
| `path_transition_entropy` | Entropie des transitions de chemins |
|
||||||
|
| `cadence_cv` | Coefficient de variation de la cadence de requêtes |
|
||||||
|
| `burst_ratio` | Fraction de requêtes en rafale |
|
||||||
|
| `pause_ratio` | Fraction de pauses longues |
|
||||||
|
| `lag1_autocorrelation` | Autocorrélation lag-1 des inter-arrivées |
|
||||||
|
| `benford_deviation` | Déviation par rapport à la loi de Benford |
|
||||||
|
| `host_diversity` | Diversité des hôtes ciblés |
|
||||||
|
| `host_sweep_speed` | Vitesse de balayage des hôtes |
|
||||||
|
| `host_coverage_uniformity` | Uniformité de couverture des hôtes |
|
||||||
|
|
||||||
|
#### Features TCP fenêtre
|
||||||
|
|
||||||
|
| Feature | Description |
|
||||||
|
|---------|-------------|
|
||||||
|
| `true_window_size` | Taille réelle de la fenêtre TCP |
|
||||||
|
| `window_mss_ratio` | Ratio fenêtre TCP / MSS |
|
||||||
|
|
||||||
|
#### Features de détection avancées
|
||||||
|
|
||||||
|
| Feature | Description |
|
||||||
|
|---------|-------------|
|
||||||
|
| `has_xff` | Présence de l'en-tête X-Forwarded-For |
|
||||||
|
| `unusual_content_type_ratio` | Fraction de Content-Type inhabituels |
|
||||||
|
| `non_standard_port_ratio` | Fraction de ports non standards |
|
||||||
|
| `login_post_concentration` | Concentration de POST sur les pages de login |
|
||||||
|
| `sec_ch_mobile_mismatch` | Incohérence Sec-CH-UA-Mobile |
|
||||||
|
|
||||||
|
### Features supplémentaires — modèle Complet (+12 features TCP/TLS)
|
||||||
|
|
||||||
|
| Feature | Description |
|
||||||
|
|---------|-------------|
|
||||||
|
| `tcp_jitter_variance` | Variance du jitter TCP inter-paquets |
|
||||||
|
| `alpn_http_mismatch` | Incohérence ALPN vs protocole HTTP réel |
|
||||||
|
| `is_alpn_missing` | ALPN absent dans le ClientHello |
|
||||||
|
| `sni_host_mismatch` | Incohérence TLS SNI vs en-tête Host HTTP |
|
||||||
|
| `ja3_diversity_ratio` | Ratio de diversité JA3 par IP |
|
||||||
|
| `syn_timing_cv` | Coefficient de variation du timing SYN |
|
||||||
|
| `tls12_ratio` | Fraction de connexions TLS 1.2 |
|
||||||
|
| `ip_df_variance` | Variance du flag Don't-Fragment IP |
|
||||||
|
| `avg_ttl` | TTL IP moyen (empreinte OS) |
|
||||||
|
| `ttl_std` | Écart-type du TTL |
|
||||||
|
| `no_window_scale_ratio` | Fraction sans TCP Window Scale |
|
||||||
|
| `ja4_drift_ratio` | Dérive JA4 intra-session (§5.5) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Pipeline de détection
|
||||||
|
|
||||||
|
```
|
||||||
|
1. Requête view_ai_features_1h → DataFrame
|
||||||
|
2. Enrichissement optionnel view_thesis_features_1h (features thèse §5)
|
||||||
|
3. Prétraitement : preprocess_df() (nettoyage, browser axes, imputation)
|
||||||
|
4. Chargement du feedback SOC → reclassification
|
||||||
|
5. Chargement de la carte de récurrence (view_ip_recurrence)
|
||||||
|
6. Séparation par correlated = 1 / correlated = 0
|
||||||
|
7. Pour chaque modèle (Complet, Applicatif) :
|
||||||
|
a. Validation des features (exclure constantes/manquantes)
|
||||||
|
b. Séparation des bots connus → journalisation KNOWN_BOT
|
||||||
|
c. Filtrage de la baseline humaine (asn_label = 'human')
|
||||||
|
d. Chargement ou entraînement EIF + AE
|
||||||
|
e. Scoring du trafic inconnu (EIF + AE)
|
||||||
|
f. Chargement ou entraînement XGBoost (si labels disponibles)
|
||||||
|
g. Combinaison des scores (formule triple voix)
|
||||||
|
h. Normalisation [0, 1]
|
||||||
|
i. Seuil adaptatif
|
||||||
|
j. Pénalité de récurrence
|
||||||
|
k. SHAP (top-3 features)
|
||||||
|
l. HDBSCAN clustering → campaign_id
|
||||||
|
m. Détection de dérive (KS test)
|
||||||
|
8. Mode multi-fenêtre (si activé) : idem sur view_ai_features_24h
|
||||||
|
9. Insertion → ml_all_scores (toutes les sessions scorées)
|
||||||
|
10. Déduplication intra-cycle (garder raw_anomaly_score le plus bas par IP)
|
||||||
|
11. Déduplication inter-cycle (TTL, skip si détecté récemment sauf aggravation ≥ 0.05)
|
||||||
|
12. Insertion → ml_detected_anomalies (anomalies filtrées)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Détection de dérive (Kolmogorov-Smirnov)
|
||||||
|
|
||||||
|
Par feature, un **test KS bilatéral** compare la distribution courante avec la distribution d'entraînement (reconstruite par interpolation à partir d'un digest quantile p10/p25/p50/p75/p90) :
|
||||||
|
|
||||||
|
- Feature driftée si `p_value < 0.05`
|
||||||
|
- Dérive globale = fraction de features driftées
|
||||||
|
- Si `drift > DRIFT_THRESHOLD` (0.30) → réentraînement automatique
|
||||||
|
|
||||||
|
**Fallback** (sans `scipy`) : méthode Z-score — feature driftée si `|moyenne_courante − moyenne_entraînement| / std_entraînement > 2.0`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Explicabilité SHAP
|
||||||
|
|
||||||
|
Lorsque `ENABLE_SHAP=true` et que la librairie `shap` est disponible :
|
||||||
|
|
||||||
|
- Calcul des valeurs SHAP via `TreeExplainer` sur le modèle EIF
|
||||||
|
- Les **3 features les plus contributives** sont stockées dans le champ `reason`
|
||||||
|
- Format : `"feature1=valeur (±shap), feature2=valeur (±shap), feature3=valeur (±shap)"`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Enrichissement Anubis
|
||||||
|
|
||||||
|
La vue `view_ai_features_1h` enrichit chaque IP via les dictionnaires Anubis selon une cascade de priorité :
|
||||||
|
|
||||||
|
1. **UA + IP combinés** (même `rule_id`) — confiance maximale
|
||||||
|
2. **UA seul** (pas de condition IP)
|
||||||
|
3. **IP seul** (pas de condition UA)
|
||||||
|
4. **Correspondance ASN**
|
||||||
|
5. **Correspondance pays**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tables de sortie
|
||||||
|
|
||||||
### ml_detected_anomalies
|
### ml_detected_anomalies
|
||||||
|
|
||||||
Anomaly detections above the threat threshold. Engine: `ReplacingMergeTree(detected_at)`, ORDER BY `(src_ip)`, TTL 30 days.
|
Détections d'anomalies au-dessus du seuil de menace. Engine : `ReplacingMergeTree(detected_at)`, ORDER BY `(src_ip, model_name)`, TTL 7 jours.
|
||||||
|
|
||||||
Key columns: `detected_at`, `src_ip`, `ja4`, `host`, `bot_name`, `anomaly_score`, `raw_anomaly_score`, `threat_level`, `model_name`, `recurrence`, `campaign_id`, `reason`, `anubis_bot_name`, `anubis_bot_action`, `anubis_bot_category`, plus all ML features.
|
Colonnes clés : `detected_at`, `src_ip`, `ja4`, `host`, `bot_name`, `anomaly_score`, `raw_anomaly_score`, `threat_level`, `model_name`, `recurrence`, `campaign_id`, `reason`, `anubis_bot_name`, `anubis_bot_action`, `anubis_bot_category`, plus toutes les features ML.
|
||||||
|
|
||||||
### ml_all_scores
|
### ml_all_scores
|
||||||
|
|
||||||
All classifications (no threshold filter) for observability. Engine: `ReplacingMergeTree(detected_at)`, ORDER BY `(window_start, src_ip, ja4, host, model_name)`, TTL 3 days.
|
Toutes les classifications (sans filtre de seuil) pour l'observabilité. Engine : `ReplacingMergeTree(detected_at)`, ORDER BY `(window_start, src_ip, ja4, host, model_name)`, TTL 7 jours.
|
||||||
|
|
||||||
## Decision Log Format
|
---
|
||||||
|
|
||||||
The `decisions.jsonl` file contains structured JSONL entries:
|
## Format du journal de décisions
|
||||||
|
|
||||||
|
Le fichier `decisions.jsonl` contient des entrées JSONL structurées :
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{"event": "CYCLE_START", "cycle_id": "20260309T143000", "total": 5000, "human": 1500, "known_bot": 200, "correlated": 3000}
|
{"event": "CYCLE_START", "cycle_id": "20260309T143000", "total": 5000, "human": 1500, "known_bot": 200, "correlated": 3000}
|
||||||
@ -224,42 +552,52 @@ The `decisions.jsonl` file contains structured JSONL entries:
|
|||||||
{"event": "CYCLE_END", "cycle_id": "20260309T143000", "anomalies": 15, "known_bots": 200, "duration_sec": 12.5}
|
{"event": "CYCLE_END", "cycle_id": "20260309T143000", "anomalies": 15, "known_bots": 200, "duration_sec": 12.5}
|
||||||
```
|
```
|
||||||
|
|
||||||
Log rotation: 50 MB max size × `LOG_BACKUP_COUNT` backups (default 7).
|
Rotation des logs : 50 Mo max × `LOG_BACKUP_COUNT` sauvegardes (défaut : 7).
|
||||||
|
|
||||||
## Health Check Endpoint
|
---
|
||||||
|
|
||||||
- **URL**: `GET http://localhost:8080/`
|
## Point de santé
|
||||||
- **Response**: `200 OK` with status JSON
|
|
||||||
- Runs in a separate thread
|
|
||||||
|
|
||||||
## Model Persistence
|
- **URL** : `GET http://localhost:8080/`
|
||||||
|
- **Réponses** : `200 OK` (corps `OK`) ou `503 Service Unavailable` (corps `DEGRADED`)
|
||||||
|
- Exécuté dans un thread daemon au démarrage
|
||||||
|
- État mis à `False` après `MAX_CONSECUTIVE_FAILURES` (3) échecs consécutifs ClickHouse, remis à `True` dès le premier succès
|
||||||
|
|
||||||
| File | Description |
|
---
|
||||||
|------|-------------|
|
|
||||||
| `model_<name>_<version>.joblib` | Serialized Isolation Forest (joblib) |
|
|
||||||
| `model_<name>_<version>.meta.json` | Model metadata (features, thresholds, training stats) |
|
|
||||||
| `model_<name>.current` | Pointer to active model version |
|
|
||||||
| `training_history.jsonl` | Training history log |
|
|
||||||
|
|
||||||
Models are rotated: only the last `MODEL_HISTORY_COUNT` versions (default 10) are kept.
|
## Persistance des modèles
|
||||||
|
|
||||||
## Docker Deployment
|
| Fichier | Description |
|
||||||
|
|---------|-------------|
|
||||||
|
| `model_<name>_<version>.joblib` | EIF sérialisé (joblib) |
|
||||||
|
| `model_<name>_<version>.meta.json` | Métadonnées (features, seuils, statistiques d'entraînement, digest quantile) |
|
||||||
|
| `model_<name>.current` | Pointeur vers la version active |
|
||||||
|
| `training_history.jsonl` | Historique d'entraînement |
|
||||||
|
|
||||||
|
Rotation automatique : seules les `MODEL_HISTORY_COUNT` dernières versions (défaut : 10) sont conservées.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Déploiement Docker
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Build
|
# Construction de l'image
|
||||||
make build-bot-detector
|
make build-bot-detector
|
||||||
|
|
||||||
# Run with docker-compose
|
# Exécution avec docker-compose
|
||||||
cd services/bot-detector
|
cd services/bot-detector
|
||||||
docker-compose up -d
|
docker-compose up -d
|
||||||
|
|
||||||
|
# Tests
|
||||||
|
make test-bot-detector
|
||||||
```
|
```
|
||||||
|
|
||||||
### Volumes
|
### Volumes
|
||||||
|
|
||||||
| Host Path | Container Path | Description |
|
| Chemin hôte | Chemin conteneur | Description |
|
||||||
|-----------|---------------|-------------|
|
|-------------|-----------------|-------------|
|
||||||
| `./bot_detector_logs` | `/var/log/bot_detector` | Decision logs (JSONL) |
|
| `./bot_detector_logs` | `/var/log/bot_detector` | Journaux de décisions (JSONL) |
|
||||||
| `./bot_detector_models` | `/var/lib/bot_detector` | Persisted ML models |
|
| `./bot_detector_models` | `/var/lib/bot_detector` | Modèles ML persistés |
|
||||||
| `./reputation/data/user_files/bot_ip.csv` | `/data/bot_ip.csv` (ro) | Known bot IP list |
|
| `./reputation/data/user_files/bot_ip.csv` | `/data/bot_ip.csv` (ro) | Liste d'IPs de bots connus |
|
||||||
| `./reputation/data/user_files/bot_ja4.csv` | `/data/bot_ja4.csv` (ro) | Known bot JA4 list |
|
| `./reputation/data/user_files/bot_ja4.csv` | `/data/bot_ja4.csv` (ro) | Liste de JA4 de bots connus |
|
||||||
| `./reputation/data/user_files/asn_reputation.csv` | `/data/asn_reputation.csv` (ro) | ASN reputation labels |
|
| `./reputation/data/user_files/asn_reputation.csv` | `/data/asn_reputation.csv` (ro) | Labels de réputation ASN |
|
||||||
|
|||||||
@ -1,307 +1,343 @@
|
|||||||
# Dashboard
|
# Dashboard
|
||||||
|
|
||||||
The dashboard is a SOC (Security Operations Center) web application built with FastAPI (backend) and React (frontend) that provides real-time visualization, investigation, and analysis of bot detections generated by the [bot-detector](bot-detector.md). It queries ClickHouse (`ja4_processing`) for all data.
|
Application web SOC (Security Operations Center) construite avec **FastAPI + Jinja2 + htmx**,
|
||||||
|
offrant la visualisation en temps réel, l'investigation et l'analyse des détections de bots
|
||||||
|
générées par le [bot-detector](bot-detector.md). Interroge ClickHouse sur deux bases de données
|
||||||
|
(`ja4_processing` et `ja4_logs`).
|
||||||
|
|
||||||
## Technology Stack
|
---
|
||||||
|
|
||||||
| Component | Technology |
|
## Pile technologique
|
||||||
|-----------|-----------|
|
|
||||||
|
| Composant | Technologie |
|
||||||
|
|-----------|-------------|
|
||||||
| Backend | Python 3.11 + FastAPI |
|
| Backend | Python 3.11 + FastAPI |
|
||||||
| Frontend | React + Vite |
|
| Templates | Jinja2 (rendu côté serveur) |
|
||||||
| Database | ClickHouse (via `ja4_common` shared client) |
|
| Interactions dynamiques | htmx (mises à jour partielles via JSON API) |
|
||||||
| API Docs | Swagger UI (`/docs`) and ReDoc (`/redoc`) |
|
| Graphiques | Chart.js + ECharts |
|
||||||
|
| Style | Tailwind CSS (CDN) |
|
||||||
|
| Base de données | ClickHouse via `clickhouse-connect` (client propre, **PAS** `ja4_common`) |
|
||||||
|
| Documentation API | Swagger UI (`/docs`) + OpenAPI JSON (`/openapi.json`) |
|
||||||
|
|
||||||
|
> **Note :** le dashboard n'utilise **pas** la bibliothèque partagée `ja4_common`. Il possède
|
||||||
|
> son propre client ClickHouse léger dans `backend/database.py`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Structure des fichiers
|
||||||
|
|
||||||
|
```
|
||||||
|
services/dashboard/
|
||||||
|
├── backend/
|
||||||
|
│ ├── __init__.py
|
||||||
|
│ ├── main.py Application FastAPI, montage templates + static, CORS, health
|
||||||
|
│ ├── config.py Variables d'environnement, safe_identifier()
|
||||||
|
│ ├── database.py Client clickhouse-connect singleton, query(), query_scalar(), execute()
|
||||||
|
│ └── routes/
|
||||||
|
│ ├── __init__.py
|
||||||
|
│ ├── api.py 35 routes JSON (1635 lignes)
|
||||||
|
│ └── pages.py 14 routes de pages HTML (83 lignes)
|
||||||
|
├── templates/
|
||||||
|
│ ├── base.html Template de base (layout, navigation, Tailwind CDN)
|
||||||
|
│ ├── overview.html Vue d'ensemble du dashboard
|
||||||
|
│ ├── detections.html Tableau paginé des détections d'anomalies
|
||||||
|
│ ├── scores.html Tableau paginé de tous les scores ML
|
||||||
|
│ ├── traffic.html Tableau paginé du trafic HTTP brut
|
||||||
|
│ ├── ip_detail.html Investigation détaillée d'une IP
|
||||||
|
│ ├── ja4_detail.html Investigation détaillée d'une empreinte JA4
|
||||||
|
│ ├── cluster_detail.html Investigation détaillée d'un cluster/campagne
|
||||||
|
│ ├── campaigns.html Vue des campagnes HDBSCAN
|
||||||
|
│ ├── features.html Analyse des features ML
|
||||||
|
│ ├── models.html Métadonnées et performances des modèles
|
||||||
|
│ ├── classify.html Interface de classification SOC
|
||||||
|
│ ├── tactics.html Tactiques de détection (brute-force, rotation, récurrence)
|
||||||
|
│ ├── reflists.html Listes de référence et dictionnaires ClickHouse
|
||||||
|
│ └── network.html Graphe réseau des campagnes
|
||||||
|
└── static/ Fichiers statiques (JS, CSS)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
|
|
||||||
| Variable | Type | Default | Description |
|
Toute la configuration est lue via `os.getenv()` dans `backend/config.py`.
|
||||||
|----------|------|---------|-------------|
|
|
||||||
| `CLICKHOUSE_HOST` | string | `clickhouse` | ClickHouse hostname |
|
|
||||||
| `CLICKHOUSE_PORT` | int | `8123` | ClickHouse HTTP port |
|
|
||||||
| `CLICKHOUSE_DB` | string | `ja4_processing` | Database name |
|
|
||||||
| `CLICKHOUSE_USER` | string | `admin` | ClickHouse user |
|
|
||||||
| `CLICKHOUSE_PASSWORD` | string | `""` | ClickHouse password |
|
|
||||||
| `API_HOST` | string | `0.0.0.0` | API listen address |
|
|
||||||
| `API_PORT` | int | `8000` | API listen port |
|
|
||||||
| `CORS_ORIGINS` | list | `["http://localhost:3000", "http://127.0.0.1:3000"]` | Allowed CORS origins |
|
|
||||||
|
|
||||||
## API Reference
|
| Variable | Type | Défaut | Description |
|
||||||
|
|----------|------|--------|-------------|
|
||||||
|
| `CLICKHOUSE_HOST` | str | `localhost` | Nom d'hôte du serveur ClickHouse |
|
||||||
|
| `CLICKHOUSE_PORT` | int | `8123` | Port HTTP ClickHouse |
|
||||||
|
| `CLICKHOUSE_USER` | str | `default` | Utilisateur ClickHouse |
|
||||||
|
| `CLICKHOUSE_PASSWORD` | str | `""` | Mot de passe ClickHouse |
|
||||||
|
| `CLICKHOUSE_DB_PROCESSING` | str | `ja4_processing` | Base de données ML/agrégations (fallback : `CLICKHOUSE_DB`) |
|
||||||
|
| `CLICKHOUSE_DB_LOGS` | str | `ja4_logs` | Base de données des logs HTTP |
|
||||||
|
| `API_HOST` | str | `0.0.0.0` | Adresse d'écoute du serveur |
|
||||||
|
| `API_PORT` | int | `8000` | Port d'écoute du serveur |
|
||||||
|
|
||||||
All endpoints are prefixed with `/api/`. The dashboard exposes **74+ endpoints** across 20 routers.
|
### Schéma dual-database
|
||||||
|
|
||||||
### Health
|
Le dashboard utilise deux bases de données ClickHouse :
|
||||||
|
|
||||||
| Method | Path | Description |
|
- **`CLICKHOUSE_DB_PROCESSING`** (`ja4_processing`) : tables ML, agrégations, dictionnaires, feedback SOC
|
||||||
|--------|------|-------------|
|
- **`CLICKHOUSE_DB_LOGS`** (`ja4_logs`) : `http_logs`, trafic brut
|
||||||
| GET | `/health` | Health check — returns ClickHouse connection status |
|
|
||||||
|
Les noms de bases sont validés par `safe_identifier()` (regex alphanumérique + underscore) avant injection dans les requêtes SQL.
|
||||||
|
|
||||||
|
### Dualité IPv4/IPv6
|
||||||
|
|
||||||
|
- `http_logs.src_ip` est de type `IPv4`
|
||||||
|
- Les tables ML (`ml_all_scores`, `ml_detected_anomalies`) stockent `IPv6` (mappé `::ffff:x.x.x.x`)
|
||||||
|
- Utiliser `toIPv6()` pour les requêtes sur les tables ML
|
||||||
|
- Utiliser `toIPv4OrZero()` pour les requêtes sur `http_logs`
|
||||||
|
- Retirer le préfixe `::ffff:` à l'affichage
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### Metrics (`/api/metrics`)
|
## Couche base de données (`backend/database.py`)
|
||||||
|
|
||||||
| Method | Path | Description |
|
Client `clickhouse-connect` avec singleton paresseux :
|
||||||
|--------|------|-------------|
|
|
||||||
| GET | `/api/metrics` | Global dashboard metrics: detection counts by threat level, unique IPs, time series |
|
| Fonction | Signature | Description |
|
||||||
| GET | `/api/metrics/threats` | Threat distribution summary |
|
|----------|-----------|-------------|
|
||||||
| GET | `/api/metrics/baseline` | Human baseline statistics |
|
| `get_client()` | `→ Client` | Singleton `clickhouse_connect.get_client()` |
|
||||||
|
| `query(sql, params)` | `→ list[dict]` | Exécute un SELECT, retourne une liste de dicts |
|
||||||
|
| `query_scalar(sql, params)` | `→ Any` | Exécute un SELECT, retourne une valeur scalaire |
|
||||||
|
| `execute(sql, params)` | `→ None` | Exécute un DDL/DML sans retour |
|
||||||
|
|
||||||
|
Conversion automatique des types : `IPv4Address`, `IPv6Address`, `bytes` → chaînes JSON-friendly.
|
||||||
|
|
||||||
|
### Patron de requêtes paramétrées
|
||||||
|
|
||||||
|
```python
|
||||||
|
from backend.config import DB_PROCESSING, DB_LOGS, safe_identifier
|
||||||
|
from backend.database import query, query_scalar
|
||||||
|
|
||||||
|
_DB = safe_identifier(DB_PROCESSING)
|
||||||
|
rows = query(
|
||||||
|
f"SELECT src_ip, anomaly_score FROM {_DB}.ml_detected_anomalies "
|
||||||
|
"WHERE src_ip = toIPv6({ip:String}) ORDER BY detected_at DESC LIMIT 10",
|
||||||
|
{"ip": ip_value},
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### Detections (`/api/detections`)
|
## Routes de pages (14)
|
||||||
|
|
||||||
| Method | Path | Description |
|
Toutes les pages sont rendues côté serveur via Jinja2 et utilisent htmx pour les mises à jour dynamiques.
|
||||||
|--------|------|-------------|
|
|
||||||
| GET | `/api/detections` | Paginated detection list with filtering, sorting, and text search |
|
|
||||||
| GET | `/api/detections/{detection_id}` | Single detection details |
|
|
||||||
|
|
||||||
**Query Parameters** (GET `/api/detections`):
|
| # | Chemin | Template | Description |
|
||||||
|
|---|--------|----------|-------------|
|
||||||
|
| 1 | `/` | `overview.html` | Vue d'ensemble : statistiques 24h, distribution des menaces, top IPs, timeline |
|
||||||
|
| 2 | `/detections` | `detections.html` | Tableau paginé des détections d'anomalies avec filtres |
|
||||||
|
| 3 | `/scores` | `scores.html` | Tableau paginé de tous les scores ML avec filtres |
|
||||||
|
| 4 | `/traffic` | `traffic.html` | Tableau paginé des logs HTTP bruts avec filtres |
|
||||||
|
| 5 | `/ip/{ip}` | `ip_detail.html` | Investigation détaillée d'une IP (détections, scores, logs, features, récurrence) |
|
||||||
|
| 6 | `/classify` | `classify.html` | Interface de classification SOC (feedback analyste) |
|
||||||
|
| 7 | `/features` | `features.html` | Analyse des features ML (profils humain/bot, importance) |
|
||||||
|
| 8 | `/models` | `models.html` | Métadonnées des modèles, statistiques de scoring |
|
||||||
|
| 9 | `/network` | `network.html` | Graphe réseau des campagnes (IP ↔ JA4 partagés) |
|
||||||
|
| 10 | `/campaigns` | `campaigns.html` | Campagnes HDBSCAN (clusters de bots agrégés) |
|
||||||
|
| 11 | `/ja4/{fingerprint:path}` | `ja4_detail.html` | Investigation détaillée d'une empreinte JA4 |
|
||||||
|
| 12 | `/cluster/{cid}` | `cluster_detail.html` | Investigation détaillée d'un cluster |
|
||||||
|
| 13 | `/tactics` | `tactics.html` | Tactiques de détection (brute-force, rotation JA4/UA, récurrence) |
|
||||||
|
| 14 | `/reflists` | `reflists.html` | Listes de référence et dictionnaires ClickHouse |
|
||||||
|
|
||||||
| Parameter | Type | Description |
|
---
|
||||||
|
|
||||||
|
## Routes API (35)
|
||||||
|
|
||||||
|
Toutes les routes sont préfixées par `/api` et retournent du JSON. Utilisées par htmx depuis les templates et consommables par des clients externes.
|
||||||
|
|
||||||
|
### Vue d'ensemble et statistiques
|
||||||
|
|
||||||
|
| # | Méthode | Chemin | Description |
|
||||||
|
|---|---------|--------|-------------|
|
||||||
|
| 1 | GET | `/api/overview` | Résumé du dashboard : détections/scores/trafic 24h, distribution des menaces, top IPs, timeline, modèles, statistiques navigateur |
|
||||||
|
| 2 | GET | `/api/heatmap` | Heatmap temporelle (heure × jour de semaine) depuis `http_logs` |
|
||||||
|
| 3 | GET | `/api/geo` | Répartition géographique et ASN des sessions |
|
||||||
|
| 4 | GET | `/api/alerts` | Flux d'alertes en direct (détections récentes HIGH/CRITICAL) |
|
||||||
|
| 5 | GET | `/api/timeline-detail` | Détail horaire de la timeline par niveau de menace |
|
||||||
|
|
||||||
|
### Détections et scores
|
||||||
|
|
||||||
|
| # | Méthode | Chemin | Description |
|
||||||
|
|---|---------|--------|-------------|
|
||||||
|
| 6 | GET | `/api/detections` | Liste paginée des détections d'anomalies avec filtres (threat_level, search, asn, country, ja4, bot, browser) et tri |
|
||||||
|
| 7 | GET | `/api/scores` | Liste paginée des scores ML avec filtres similaires |
|
||||||
|
| 8 | GET | `/api/traffic` | Liste paginée des logs HTTP bruts avec filtres |
|
||||||
|
|
||||||
|
### Investigation IP
|
||||||
|
|
||||||
|
| # | Méthode | Chemin | Description |
|
||||||
|
|---|---------|--------|-------------|
|
||||||
|
| 9 | GET | `/api/ip/{ip}` | Détail IP : détections, scores, logs HTTP, features AI, récurrence |
|
||||||
|
| 10 | GET | `/api/ip/{ip}/radar` | Données pour graphique radar : IP vs baseline ISP vs baseline bot |
|
||||||
|
|
||||||
|
### Features et modèles
|
||||||
|
|
||||||
|
| # | Méthode | Chemin | Description |
|
||||||
|
|---|---------|--------|-------------|
|
||||||
|
| 11 | GET | `/api/features` | Statistiques des features AI/thèse, profils humain/bot, importance (variance) |
|
||||||
|
| 12 | GET | `/api/models` | Métadonnées des modèles depuis fichiers JSON + statistiques de scoring ClickHouse |
|
||||||
|
| 13 | GET | `/api/models/timeline` | Volume horaire de scoring par modèle (7 jours) |
|
||||||
|
| 14 | GET | `/api/models/threats` | Répartition des niveaux de menace par modèle |
|
||||||
|
|
||||||
|
### Empreintes et navigateurs
|
||||||
|
|
||||||
|
| # | Méthode | Chemin | Description |
|
||||||
|
|---|---------|--------|-------------|
|
||||||
|
| 15 | GET | `/api/fingerprints` | Analyse des empreintes JA4 avec mapping de bots |
|
||||||
|
| 16 | GET | `/api/browsers` | Distribution des familles de navigateurs via JA4 |
|
||||||
|
| 17 | GET | `/api/behavior` | Données scatter des features comportementales + distributions |
|
||||||
|
|
||||||
|
### Classification SOC
|
||||||
|
|
||||||
|
| # | Méthode | Chemin | Description |
|
||||||
|
|---|---------|--------|-------------|
|
||||||
|
| 18 | GET | `/api/classify/stats` | Résumé statistique des classifications SOC |
|
||||||
|
| 19 | GET | `/api/classify/suggested` | Top des IPs non classifiées triées par sévérité |
|
||||||
|
| 20 | POST | `/api/classify` | Soumettre un feedback analyste SOC (bot/legitimate/suspicious) |
|
||||||
|
| 21 | GET | `/api/classifications` | Entrées de feedback SOC récentes |
|
||||||
|
|
||||||
|
### Campagnes et clusters
|
||||||
|
|
||||||
|
| # | Méthode | Chemin | Description |
|
||||||
|
|---|---------|--------|-------------|
|
||||||
|
| 22 | GET | `/api/campaigns` | Clusters de campagnes HDBSCAN (agrégés) |
|
||||||
|
| 23 | GET | `/api/campaigns/graph` | Données de graphe réseau (nœuds IP + arêtes JA4 partagés) |
|
||||||
|
| 24 | GET | `/api/campaigns/scatter` | Scatter plot : score d'anomalie vs vélocité par IP et campagne |
|
||||||
|
| 25 | GET | `/api/campaigns/{cid}` | Détail campagne : membres, profil, timeline |
|
||||||
|
|
||||||
|
### Investigation JA4 et cluster
|
||||||
|
|
||||||
|
| # | Méthode | Chemin | Description |
|
||||||
|
|---|---------|--------|-------------|
|
||||||
|
| 26 | GET | `/api/ja4/{fingerprint:path}` | Investigation d'empreinte JA4 (IPs, scores, logs, features) |
|
||||||
|
| 27 | GET | `/api/cluster/{cid}` | Investigation enrichie de cluster (profil, membres, graphe, répartitions) |
|
||||||
|
|
||||||
|
### Tactiques de détection
|
||||||
|
|
||||||
|
| # | Méthode | Chemin | Description |
|
||||||
|
|---|---------|--------|-------------|
|
||||||
|
| 28 | GET | `/api/brute-force` | Détection de brute-force / credential stuffing |
|
||||||
|
| 29 | GET | `/api/ja4-rotation` | Détection de rotation d'empreintes JA4 (évasion) |
|
||||||
|
| 30 | GET | `/api/recurrence` | IPs de menaces persistantes/récurrentes |
|
||||||
|
| 31 | GET | `/api/cascade/{ip}` | Cascade de ressources pour détection de navigateur headless |
|
||||||
|
| 32 | GET | `/api/ua-rotation` | Détection de rotation de User-Agent |
|
||||||
|
|
||||||
|
### Dictionnaires et listes de référence
|
||||||
|
|
||||||
|
| # | Méthode | Chemin | Description |
|
||||||
|
|---|---------|--------|-------------|
|
||||||
|
| 33 | GET | `/api/dictionaries` | Métadonnées des dictionnaires ClickHouse |
|
||||||
|
| 34 | GET | `/api/reflist/{name}` | Contenu paginé d'une liste de référence / dictionnaire |
|
||||||
|
| 35 | GET | `/api/reflist/{name}/stats` | Statistiques agrégées d'une liste de référence |
|
||||||
|
|
||||||
|
### Autres routes
|
||||||
|
|
||||||
|
| Méthode | Chemin | Description |
|
||||||
|
|---------|--------|-------------|
|
||||||
|
| GET | `/health` | Point de santé — retourne l'état de connexion ClickHouse |
|
||||||
|
| — | `/static` | Montage de fichiers statiques (`backend/static/`) |
|
||||||
|
| GET | `/docs` | Documentation Swagger UI (auto-généré par FastAPI) |
|
||||||
|
| GET | `/openapi.json` | Schéma OpenAPI (auto-généré par FastAPI) |
|
||||||
|
|
||||||
|
**Total : 14 pages + 35 API + 1 health = 50 routes applicatives**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Paramètres de requête courants
|
||||||
|
|
||||||
|
Les endpoints paginés (`/api/detections`, `/api/scores`, `/api/traffic`) acceptent :
|
||||||
|
|
||||||
|
| Paramètre | Type | Description |
|
||||||
|-----------|------|-------------|
|
|-----------|------|-------------|
|
||||||
| `page` | int | Page number (default: 1) |
|
| `page` | int | Numéro de page (défaut : 1) |
|
||||||
| `page_size` | int | Items per page (default: 20) |
|
| `page_size` | int | Éléments par page (défaut : 20) |
|
||||||
| `threat_level` | string | Filter by threat level |
|
| `threat_level` | str | Filtre par niveau de menace |
|
||||||
| `model_name` | string | Filter by model name |
|
| `model_name` | str | Filtre par nom de modèle |
|
||||||
| `search` | string | Full-text search across IP, JA4, host, bot_name |
|
| `search` | str | Recherche plein texte (IP, JA4, host, bot_name) |
|
||||||
| `sort_by` | string | Sort field |
|
| `sort_by` | str | Champ de tri (validé contre une whitelist) |
|
||||||
| `sort_order` | string | `asc` or `desc` |
|
| `sort_order` | str | `asc` ou `desc` |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### Investigation (`/api/investigation`)
|
## Workflow SOC
|
||||||
|
|
||||||
| Method | Path | Description |
|
Le dashboard est conçu pour le workflow d'un analyste SOC :
|
||||||
|--------|------|-------------|
|
|
||||||
| GET | `/api/investigation/{ip}/summary` | **Primary investigation endpoint.** Aggregates ML score, brute-force, TCP spoofing, JA4 rotation, persistence, and 24h timeline into a single response with a `risk_score` (0–100) |
|
1. **Vue d'ensemble** (`/`) — identifier les tendances de menaces 24h, les pics d'anomalies
|
||||||
|
2. **Détections** (`/detections`) — filtrer les anomalies par sévérité, rechercher des IPs spécifiques
|
||||||
|
3. **Investigation IP** (`/ip/{ip}`) — examiner les détections, scores, logs HTTP, features ML, radar de comparaison
|
||||||
|
4. **Analyse JA4** (`/ja4/{fp}`) — investiguer une empreinte TLS, voir toutes les IPs associées
|
||||||
|
5. **Campagnes** (`/campaigns`) — visualiser les clusters de bots, graphe réseau, scatter plot
|
||||||
|
6. **Cluster** (`/cluster/{cid}`) — plonger dans un cluster spécifique, examiner les membres et profils
|
||||||
|
7. **Tactiques** (`/tactics`) — surveiller le brute-force, la rotation JA4/UA, les menaces récurrentes
|
||||||
|
8. **Classification** (`/classify`) — soumettre un feedback analyste (bot/légitime/suspect) pour alimenter XGBoost
|
||||||
|
9. **Features** (`/features`) — comparer les profils de features humain vs bot
|
||||||
|
10. **Modèles** (`/models`) — surveiller les performances des modèles ML
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### Reputation (`/api/reputation`)
|
## Architecture des templates
|
||||||
|
|
||||||
| Method | Path | Description |
|
### Template de base (`base.html`)
|
||||||
|--------|------|-------------|
|
|
||||||
| GET | `/api/reputation/ip/{ip_address}` | Full IP reputation from IP-API.com and IPinfo.io (proxy, VPN, Tor, hosting detection) |
|
Fournit le layout commun :
|
||||||
| GET | `/api/reputation/ip/{ip_address}/summary` | Simplified reputation summary |
|
|
||||||
|
- Navigation latérale avec liens vers toutes les pages
|
||||||
|
- Chargement CDN de Tailwind CSS, htmx, Chart.js, ECharts
|
||||||
|
- Bloc `{% block content %}` pour le contenu de chaque page
|
||||||
|
|
||||||
|
### Rendu côté serveur + htmx
|
||||||
|
|
||||||
|
Chaque page est rendue côté serveur par Jinja2. Les interactions dynamiques utilisent htmx :
|
||||||
|
|
||||||
|
```html
|
||||||
|
<!-- Exemple : chargement dynamique des détections -->
|
||||||
|
<div hx-get="/api/detections?page=1&page_size=20"
|
||||||
|
hx-trigger="load"
|
||||||
|
hx-target="#detections-table">
|
||||||
|
</div>
|
||||||
|
```
|
||||||
|
|
||||||
|
Les graphiques sont rendus côté client par Chart.js et ECharts à partir des données JSON de l'API.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### Analysis (`/api/analysis`)
|
## CORS
|
||||||
|
|
||||||
| Method | Path | Description |
|
Le middleware CORS est configuré de manière permissive dans `main.py` :
|
||||||
|--------|------|-------------|
|
|
||||||
| GET | `/api/analysis/{ip}/subnet` | Subnet analysis for an IP (related IPs in same /24) |
|
```python
|
||||||
| GET | `/api/analysis/{ip}/country` | Country-level analysis for an IP |
|
app.add_middleware(
|
||||||
| GET | `/api/analysis/country` | Global country analysis across all detections |
|
CORSMiddleware,
|
||||||
| GET | `/api/analysis/{ip}/ja4` | JA4 fingerprint analysis for an IP |
|
allow_origins=["*"],
|
||||||
| GET | `/api/analysis/{ip}/user-agents` | User-agent analysis for an IP |
|
allow_credentials=True,
|
||||||
| GET | `/api/analysis/{ip}/recommendation` | SOC classification recommendation |
|
allow_methods=["*"],
|
||||||
| POST | `/api/analysis/classifications` | Create a classification (legitimate/suspicious/malicious) |
|
allow_headers=["*"],
|
||||||
| GET | `/api/analysis/classifications` | List all classifications |
|
)
|
||||||
| GET | `/api/analysis/classifications/stats` | Classification statistics |
|
```
|
||||||
|
|
||||||
|
L'application est principalement SSR mais l'API JSON est aussi consommable par des clients externes.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### Entities (`/api/entities`)
|
## Déploiement
|
||||||
|
|
||||||
| Method | Path | Description |
|
|
||||||
|--------|------|-------------|
|
|
||||||
| GET | `/api/entities/types` | List available entity types |
|
|
||||||
| GET | `/api/entities/subnet/{subnet}` | Investigate a subnet |
|
|
||||||
| GET | `/api/entities/{entity_type}/{entity_value}` | Investigate any entity (IP, JA4, subnet, UA, host) |
|
|
||||||
| GET | `/api/entities/{entity_type}/{entity_value}/related` | Related entities |
|
|
||||||
| GET | `/api/entities/{entity_type}/{entity_value}/user_agents` | User-agents for entity |
|
|
||||||
| GET | `/api/entities/{entity_type}/{entity_value}/client_headers` | Client headers for entity |
|
|
||||||
| GET | `/api/entities/{entity_type}/{entity_value}/paths` | URL paths for entity |
|
|
||||||
| GET | `/api/entities/{entity_type}/{entity_value}/query_params` | Query parameters for entity |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Incidents (`/api/incidents`)
|
|
||||||
|
|
||||||
| Method | Path | Description |
|
|
||||||
|--------|------|-------------|
|
|
||||||
| GET | `/api/incidents` | List all incidents |
|
|
||||||
| GET | `/api/incidents/clusters` | Active incident clusters (behavioral similarity grouping) |
|
|
||||||
| GET | `/api/incidents/{cluster_id}` | Incident cluster details |
|
|
||||||
| POST | `/api/incidents/{cluster_id}/classify` | Classify an incident cluster |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Fingerprints (`/api/fingerprints`)
|
|
||||||
|
|
||||||
| Method | Path | Description |
|
|
||||||
|--------|------|-------------|
|
|
||||||
| GET | `/api/fingerprints/spoofing` | TLS fingerprint spoofing detection |
|
|
||||||
| GET | `/api/fingerprints/ja4-ua-matrix` | JA4 ↔ User-Agent correlation matrix |
|
|
||||||
| GET | `/api/fingerprints/ua-analysis` | Suspicious user-agent analysis |
|
|
||||||
| GET | `/api/fingerprints/ip/{ip}/coherence` | Fingerprint coherence analysis per IP |
|
|
||||||
| GET | `/api/fingerprints/legitimate-ja4` | Known legitimate JA4 fingerprints |
|
|
||||||
| GET | `/api/fingerprints/asn-correlation` | JA4-ASN correlation analysis |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Brute Force (`/api/bruteforce`)
|
|
||||||
|
|
||||||
| Method | Path | Description |
|
|
||||||
|--------|------|-------------|
|
|
||||||
| GET | `/api/bruteforce/targets` | Brute-force target hosts |
|
|
||||||
| GET | `/api/bruteforce/attackers` | Brute-force attacker IPs |
|
|
||||||
| GET | `/api/bruteforce/timeline` | Brute-force attack timeline |
|
|
||||||
| GET | `/api/bruteforce/host/{host}/attackers` | Attackers for a specific host |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### TCP Spoofing (`/api/tcp-spoofing`)
|
|
||||||
|
|
||||||
| Method | Path | Description |
|
|
||||||
|--------|------|-------------|
|
|
||||||
| GET | `/api/tcp-spoofing/overview` | TCP/OS fingerprint spoofing overview |
|
|
||||||
| GET | `/api/tcp-spoofing/list` | Spoofing detection list |
|
|
||||||
| GET | `/api/tcp-spoofing/matrix` | TTL × MSS anomaly matrix |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Header Fingerprint (`/api/headers`)
|
|
||||||
|
|
||||||
| Method | Path | Description |
|
|
||||||
|--------|------|-------------|
|
|
||||||
| GET | `/api/headers/clusters` | Header fingerprint clusters (suspicious patterns) |
|
|
||||||
| GET | `/api/headers/cluster/{hash}/ips` | IPs sharing a header fingerprint |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Heatmap (`/api/heatmap`)
|
|
||||||
|
|
||||||
| Method | Path | Description |
|
|
||||||
|--------|------|-------------|
|
|
||||||
| GET | `/api/heatmap/hourly` | Hourly traffic heatmap |
|
|
||||||
| GET | `/api/heatmap/top-hosts` | Top hosts by traffic volume |
|
|
||||||
| GET | `/api/heatmap/matrix` | Activity/hour matrix |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Botnets (`/api/botnets`)
|
|
||||||
|
|
||||||
| Method | Path | Description |
|
|
||||||
|--------|------|-------------|
|
|
||||||
| GET | `/api/botnets/ja4-spread` | JA4 geographic spread (botnet indicator) |
|
|
||||||
| GET | `/api/botnets/ja4/{ja4}/countries` | Country distribution for a JA4 fingerprint |
|
|
||||||
| GET | `/api/botnets/summary` | Global botnet detection summary |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Rotation (`/api/rotation`)
|
|
||||||
|
|
||||||
| Method | Path | Description |
|
|
||||||
|--------|------|-------------|
|
|
||||||
| GET | `/api/rotation/ja4-rotators` | IPs rotating JA4 fingerprints (evasion detection) |
|
|
||||||
| GET | `/api/rotation/persistent-threats` | Persistent threats across time windows |
|
|
||||||
| GET | `/api/rotation/ip/{ip}/ja4-history` | JA4 fingerprint history for an IP |
|
|
||||||
| GET | `/api/rotation/sophistication` | Sophistication score analysis |
|
|
||||||
| GET | `/api/rotation/proactive-hunt` | Proactive threat hunting suggestions |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### ML Features (`/api/ml`)
|
|
||||||
|
|
||||||
| Method | Path | Description |
|
|
||||||
|--------|------|-------------|
|
|
||||||
| GET | `/api/ml/top-anomalies` | Top anomalies with feature details |
|
|
||||||
| GET | `/api/ml/ip/{ip}/radar` | Feature radar chart data for an IP |
|
|
||||||
| GET | `/api/ml/score-distribution` | Anomaly score distribution histogram |
|
|
||||||
| GET | `/api/ml/score-trends` | Score trends over time |
|
|
||||||
| GET | `/api/ml/b-features` | Source B (TCP/TLS) feature analysis |
|
|
||||||
| GET | `/api/ml/campaigns` | ML-detected campaign analysis |
|
|
||||||
| GET | `/api/ml/scatter` | Feature scatter plot data |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Attributes (`/api/attributes`)
|
|
||||||
|
|
||||||
| Method | Path | Description |
|
|
||||||
|--------|------|-------------|
|
|
||||||
| GET | `/api/attributes/{attr_type}` | List distinct values for an attribute (ja4, user_agent, asn, country, host) with counts |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Variability (`/api/variability`)
|
|
||||||
|
|
||||||
| Method | Path | Description |
|
|
||||||
|--------|------|-------------|
|
|
||||||
| GET | `/api/variability/{attr_type}/{value}` | Behavioral variability analysis for an attribute value |
|
|
||||||
| GET | `/api/variability/{attr_type}/{value}/ips` | IPs associated with an attribute value |
|
|
||||||
| GET | `/api/variability/{attr_type}/{value}/attributes` | Attribute breakdown for a value |
|
|
||||||
| GET | `/api/variability/{attr_type}/{value}/user_agents` | User-agents for an attribute value |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Clustering (`/api/clustering`)
|
|
||||||
|
|
||||||
| Method | Path | Description |
|
|
||||||
|--------|------|-------------|
|
|
||||||
| GET | `/api/clustering/status` | Clustering cache status |
|
|
||||||
| GET | `/api/clustering/clusters` | K-Means cluster list |
|
|
||||||
| GET | `/api/clustering/cluster/{cluster_id}/points` | Data points in a cluster |
|
|
||||||
| GET | `/api/clustering/cluster/{cluster_id}/ips` | IPs in a cluster |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Search (`/api/search`)
|
|
||||||
|
|
||||||
| Method | Path | Description |
|
|
||||||
|--------|------|-------------|
|
|
||||||
| GET | `/api/search/quick` | Cross-entity search (IP, JA4, host, UA, country, ASN) |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Audit (`/api/audit`)
|
|
||||||
|
|
||||||
| Method | Path | Description |
|
|
||||||
|--------|------|-------------|
|
|
||||||
| POST | `/api/audit/logs` | Create an audit log entry |
|
|
||||||
| GET | `/api/audit/logs` | Query audit logs (filtered, paginated) |
|
|
||||||
| GET | `/api/audit/stats` | Audit statistics |
|
|
||||||
| GET | `/api/audit/users/activity` | Per-user activity summary |
|
|
||||||
|
|
||||||
## Frontend Structure
|
|
||||||
|
|
||||||
The React frontend is built with Vite and served as static assets:
|
|
||||||
|
|
||||||
- **Entry point**: `/` → `frontend/dist/index.html`
|
|
||||||
- **Static assets**: `/assets/*` → `frontend/dist/assets/`
|
|
||||||
- **SPA routing**: All non-`/api/` paths fall through to `index.html` (React Router)
|
|
||||||
- **API proxy**: Frontend calls `/api/*` which is handled by FastAPI routers
|
|
||||||
|
|
||||||
## Services
|
|
||||||
|
|
||||||
### IPReputationService
|
|
||||||
|
|
||||||
Queries public IP reputation databases (IP-API.com, IPinfo.io) without API keys:
|
|
||||||
- Proxy/VPN/Tor detection
|
|
||||||
- ASN, country, ISP information
|
|
||||||
- Hosting provider identification
|
|
||||||
|
|
||||||
### ClusteringEngine
|
|
||||||
|
|
||||||
K-Means clustering on ML features with caching:
|
|
||||||
- Automatic cluster count selection
|
|
||||||
- Feature normalization via StandardScaler
|
|
||||||
- In-memory cache with TTL
|
|
||||||
|
|
||||||
## Deployment
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Build Docker image
|
# Construction de l'image Docker
|
||||||
make build-dashboard
|
make build-dashboard
|
||||||
|
|
||||||
# Run tests
|
# Tests
|
||||||
make test-dashboard
|
make test-dashboard
|
||||||
|
|
||||||
# Run locally (development)
|
# Exécution locale (développement)
|
||||||
cd services/dashboard
|
cd services/dashboard
|
||||||
uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000
|
uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000
|
||||||
```
|
```
|
||||||
|
|
||||||
### Health Check
|
### Point de santé
|
||||||
|
|
||||||
```
|
```
|
||||||
GET /health → {"status": "healthy", "clickhouse": "connected"}
|
GET /health → {"status": "healthy", "clickhouse": "connected"}
|
||||||
|
|||||||
@ -1,64 +1,106 @@
|
|||||||
# python-ja4common
|
# python-ja4common
|
||||||
|
|
||||||
`ja4_common` is the shared Python library for the ja4-platform, providing a unified ClickHouse client singleton and configuration settings. It is used by [bot-detector](../services/bot-detector.md) and [dashboard](../services/dashboard.md).
|
Bibliothèque Python partagée pour la plateforme ja4, fournissant un client
|
||||||
|
ClickHouse singleton et une configuration centralisée via pydantic-settings.
|
||||||
|
|
||||||
**Package name**: `ja4-common`
|
**Utilisée par** : [bot-detector](../services/bot-detector.md) (via `ja4_common.clickhouse`)
|
||||||
|
|
||||||
**Python version**: ≥ 3.11
|
> **Note :** le [dashboard](../services/dashboard.md) n'utilise **pas** `ja4_common`. Il possède
|
||||||
|
> son propre client `clickhouse-connect` léger dans `backend/database.py`.
|
||||||
|
|
||||||
**Dependencies**:
|
---
|
||||||
- `clickhouse-connect >= 0.8.0`
|
|
||||||
- `pydantic-settings >= 2.1.0`
|
## Informations du package
|
||||||
|
|
||||||
|
| Clé | Valeur |
|
||||||
|
|-----|--------|
|
||||||
|
| **Nom du package** | `ja4-common` |
|
||||||
|
| **Version** | `0.1.0` |
|
||||||
|
| **Python** | ≥ 3.11 |
|
||||||
|
| **Dépendances** | `clickhouse-connect >= 0.8.0`, `pydantic-settings >= 2.1.0` |
|
||||||
|
| **Build system** | `setuptools >= 68` + `wheel` |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## ClickHouseSettings
|
## ClickHouseSettings
|
||||||
|
|
||||||
Pydantic-settings model that reads configuration from environment variables and `.env` files.
|
Modèle pydantic-settings qui lit la configuration depuis les variables d'environnement
|
||||||
|
et les fichiers `.env`.
|
||||||
|
|
||||||
### Fields
|
### Champs
|
||||||
|
|
||||||
| Field | Type | Default | Env Variable | Description |
|
| Champ | Type | Défaut | Variable d'env | Description |
|
||||||
|-------|------|---------|-------------|-------------|
|
|-------|------|--------|----------------|-------------|
|
||||||
| `CLICKHOUSE_HOST` | str | `"clickhouse"` | `CLICKHOUSE_HOST` | ClickHouse server hostname |
|
| `CLICKHOUSE_HOST` | str | `"clickhouse"` | `CLICKHOUSE_HOST` | Nom d'hôte du serveur ClickHouse |
|
||||||
| `CLICKHOUSE_PORT` | int | `8123` | `CLICKHOUSE_PORT` | ClickHouse HTTP API port |
|
| `CLICKHOUSE_PORT` | int | `8123` | `CLICKHOUSE_PORT` | Port de l'API HTTP ClickHouse |
|
||||||
| `CLICKHOUSE_DB` | str | `"ja4_processing"` | `CLICKHOUSE_DB` | Database name |
|
| `CLICKHOUSE_DB` | str | `"ja4_processing"` | `CLICKHOUSE_DB` | Base de données par défaut (rétro-compatibilité) |
|
||||||
| `CLICKHOUSE_USER` | str | `"admin"` | `CLICKHOUSE_USER` | Username for authentication |
|
| `CLICKHOUSE_DB_PROCESSING` | str | `"ja4_processing"` | `CLICKHOUSE_DB_PROCESSING` | Base de données ML, agrégations, dictionnaires |
|
||||||
| `CLICKHOUSE_PASSWORD` | str | `""` | `CLICKHOUSE_PASSWORD` | Password for authentication |
|
| `CLICKHOUSE_DB_LOGS` | str | `"ja4_logs"` | `CLICKHOUSE_DB_LOGS` | Base de données des logs HTTP bruts |
|
||||||
|
| `CLICKHOUSE_USER` | str | `"admin"` | `CLICKHOUSE_USER` | Utilisateur pour l'authentification |
|
||||||
|
| `CLICKHOUSE_PASSWORD` | str | `""` | `CLICKHOUSE_PASSWORD` | Mot de passe pour l'authentification |
|
||||||
|
|
||||||
### Configuration Sources
|
### Schéma dual-database
|
||||||
|
|
||||||
Settings are loaded in order of precedence:
|
La plateforme utilise deux bases de données ClickHouse :
|
||||||
1. **Environment variables** (highest priority)
|
|
||||||
2. **`.env` file** in the current working directory
|
|
||||||
3. **Default values** (lowest priority)
|
|
||||||
|
|
||||||
Environment variable names are **case-sensitive** (e.g., `CLICKHOUSE_HOST`, not `clickhouse_host`).
|
- **`CLICKHOUSE_DB_PROCESSING`** (`ja4_processing`) : tables ML (`ml_detected_anomalies`,
|
||||||
|
`ml_all_scores`), agrégations (`agg_*`), dictionnaires (`dict_*`), feedback SOC, audit
|
||||||
|
- **`CLICKHOUSE_DB_LOGS`** (`ja4_logs`) : `http_logs_raw`, `http_logs`, vues matérialisées
|
||||||
|
|
||||||
### Usage
|
Des **références croisées** existent entre les deux bases — les vues matérialisées de l'une
|
||||||
|
lisent dans l'autre. Utiliser toujours des noms de tables pleinement qualifiés :
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from ja4_common.settings import settings
|
from ja4_common.settings import settings
|
||||||
|
|
||||||
print(settings.CLICKHOUSE_HOST) # "clickhouse" or from env
|
query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_PROCESSING}.ml_detected_anomalies ..."
|
||||||
print(settings.CLICKHOUSE_PORT) # 8123 or from env
|
query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_LOGS}.http_logs ..."
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Ne jamais coder en dur les noms de bases dans les requêtes.
|
||||||
|
|
||||||
|
### Sources de configuration
|
||||||
|
|
||||||
|
Les paramètres sont chargés par ordre de priorité :
|
||||||
|
1. **Variables d'environnement** (priorité la plus haute)
|
||||||
|
2. **Fichier `.env`** dans le répertoire courant
|
||||||
|
3. **Valeurs par défaut** (priorité la plus basse)
|
||||||
|
|
||||||
|
Les noms de variables d'environnement sont **sensibles à la casse** (ex. `CLICKHOUSE_HOST`, pas `clickhouse_host`).
|
||||||
|
|
||||||
|
### Utilisation
|
||||||
|
|
||||||
|
```python
|
||||||
|
from ja4_common.settings import settings
|
||||||
|
|
||||||
|
print(settings.CLICKHOUSE_HOST) # "clickhouse" ou depuis l'env
|
||||||
|
print(settings.CLICKHOUSE_PORT) # 8123 ou depuis l'env
|
||||||
|
print(settings.CLICKHOUSE_DB_PROCESSING) # "ja4_processing" ou depuis l'env
|
||||||
|
print(settings.CLICKHOUSE_DB_LOGS) # "ja4_logs" ou depuis l'env
|
||||||
|
```
|
||||||
|
|
||||||
|
Le singleton `settings` est créé au niveau du module à l'import.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## ClickHouseClient
|
## ClickHouseClient
|
||||||
|
|
||||||
Wraps `clickhouse_connect` with auto-reconnection and a clean API.
|
Encapsule `clickhouse_connect` avec reconnexion automatique et une API épurée.
|
||||||
|
|
||||||
### Methods
|
### Méthodes
|
||||||
|
|
||||||
| Method | Signature | Description |
|
| Méthode | Signature | Description |
|
||||||
|--------|-----------|-------------|
|
|---------|-----------|-------------|
|
||||||
| `connect` | `connect() -> Client` | Returns the underlying `clickhouse_connect` client, creating or reconnecting as needed |
|
| `connect` | `connect() → Client` | Retourne le client `clickhouse_connect` sous-jacent, crée ou reconnecte si nécessaire |
|
||||||
| `query` | `query(query: str, params: dict = None)` | Execute a SELECT query, returns result set |
|
| `_ping` | `_ping() → bool` | Vérifie la connexion via `client.ping()`, retourne `False` en cas d'exception |
|
||||||
| `command` | `command(query: str, params: dict = None)` | Execute a DDL/DML command (CREATE, INSERT, etc.) |
|
| `query` | `query(query: str, params: dict = None)` | Exécute une requête SELECT, retourne le résultat |
|
||||||
| `insert` | `insert(table: str, data, column_names=None)` | Bulk insert data into a table |
|
| `command` | `command(query: str, params: dict = None)` | Exécute une commande DDL/DML (CREATE, INSERT, etc.) |
|
||||||
| `close` | `close()` | Close the connection and release resources |
|
| `insert` | `insert(table: str, data, column_names=None)` | Insertion en masse dans une table |
|
||||||
|
| `close` | `close()` | Ferme la connexion et libère les ressources |
|
||||||
|
|
||||||
### Auto-Reconnection
|
### Reconnexion automatique
|
||||||
|
|
||||||
The `connect()` method automatically reconnects if the current connection is lost:
|
La méthode `connect()` reconnecte automatiquement si la connexion est perdue :
|
||||||
|
|
||||||
```python
|
```python
|
||||||
def connect(self):
|
def connect(self):
|
||||||
@ -74,41 +116,50 @@ def connect(self):
|
|||||||
return self._client
|
return self._client
|
||||||
```
|
```
|
||||||
|
|
||||||
### Usage Example
|
### Exemple d'utilisation
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from ja4_common.clickhouse import get_client
|
from ja4_common.clickhouse import get_client
|
||||||
|
|
||||||
client = get_client()
|
client = get_client()
|
||||||
|
|
||||||
# SELECT query
|
# Requête SELECT avec noms de tables pleinement qualifiés
|
||||||
result = client.query("SELECT count() FROM http_logs WHERE src_ip = {ip:String}", {"ip": "203.0.113.42"})
|
result = client.query(
|
||||||
|
f"SELECT count() FROM {settings.CLICKHOUSE_DB_LOGS}.http_logs "
|
||||||
|
"WHERE src_ip = {ip:String}",
|
||||||
|
{"ip": "203.0.113.42"},
|
||||||
|
)
|
||||||
print(result.result_rows)
|
print(result.result_rows)
|
||||||
|
|
||||||
# INSERT
|
# INSERT
|
||||||
client.insert("audit_logs", [[datetime.now(), "analyst1", "investigate", "ip", "203.0.113.42"]],
|
client.insert(
|
||||||
column_names=["timestamp", "user_name", "action", "entity_type", "entity_id"])
|
"audit_logs",
|
||||||
|
[[datetime.now(), "analyst1", "investigate", "ip", "203.0.113.42"]],
|
||||||
|
column_names=["timestamp", "user_name", "action", "entity_type", "entity_id"],
|
||||||
|
)
|
||||||
|
|
||||||
# Command
|
# Commande DDL
|
||||||
client.command("OPTIMIZE TABLE http_logs FINAL")
|
client.command("OPTIMIZE TABLE http_logs FINAL")
|
||||||
```
|
```
|
||||||
|
|
||||||
## get_client() Singleton
|
---
|
||||||
|
|
||||||
The `get_client()` function provides a module-level singleton `ClickHouseClient`:
|
## Singleton `get_client()`
|
||||||
|
|
||||||
|
La fonction `get_client()` fournit un singleton de `ClickHouseClient` au niveau du module :
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from ja4_common.clickhouse import get_client
|
from ja4_common.clickhouse import get_client
|
||||||
|
|
||||||
# First call creates the client
|
# Le premier appel crée le client
|
||||||
client1 = get_client()
|
client1 = get_client()
|
||||||
|
|
||||||
# Subsequent calls return the same instance
|
# Les appels suivants retournent la même instance
|
||||||
client2 = get_client()
|
client2 = get_client()
|
||||||
assert client1 is client2
|
assert client1 is client2
|
||||||
```
|
```
|
||||||
|
|
||||||
### Implementation
|
### Implémentation
|
||||||
|
|
||||||
```python
|
```python
|
||||||
_client: Optional[ClickHouseClient] = None
|
_client: Optional[ClickHouseClient] = None
|
||||||
@ -120,16 +171,40 @@ def get_client() -> ClickHouseClient:
|
|||||||
return _client
|
return _client
|
||||||
```
|
```
|
||||||
|
|
||||||
## Using from a New Service
|
Architecture à **deux niveaux de singleton** :
|
||||||
|
- `get_client()` → singleton pour `ClickHouseClient`
|
||||||
|
- `settings` dans `settings.py` → singleton pour `ClickHouseSettings`
|
||||||
|
|
||||||
### 1. Add Dependency
|
---
|
||||||
|
|
||||||
In your service's `requirements.txt`:
|
## Exports du package
|
||||||
|
|
||||||
|
Le `__init__.py` n'exporte qu'une chaîne de version :
|
||||||
|
|
||||||
|
```python
|
||||||
|
"""JA4 Common — shared utilities for the JA4 security suite."""
|
||||||
|
__version__ = "0.1.0"
|
||||||
|
```
|
||||||
|
|
||||||
|
Les consommateurs doivent importer directement depuis les sous-modules :
|
||||||
|
|
||||||
|
```python
|
||||||
|
from ja4_common.clickhouse import get_client
|
||||||
|
from ja4_common.settings import settings
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Intégration dans un nouveau service
|
||||||
|
|
||||||
|
### 1. Ajouter la dépendance
|
||||||
|
|
||||||
|
Dans le `requirements.txt` du service :
|
||||||
```
|
```
|
||||||
ja4-common @ file:///app/shared/python/ja4_common
|
ja4-common @ file:///app/shared/python/ja4_common
|
||||||
```
|
```
|
||||||
|
|
||||||
Or in `pyproject.toml`:
|
Ou dans `pyproject.toml` :
|
||||||
```toml
|
```toml
|
||||||
[project]
|
[project]
|
||||||
dependencies = [
|
dependencies = [
|
||||||
@ -137,51 +212,59 @@ dependencies = [
|
|||||||
]
|
]
|
||||||
```
|
```
|
||||||
|
|
||||||
### 2. Docker Setup
|
### 2. Configuration Docker
|
||||||
|
|
||||||
```dockerfile
|
```dockerfile
|
||||||
# Copy shared library
|
# Copier la bibliothèque partagée
|
||||||
COPY shared/python/ja4_common /app/shared/python/ja4_common
|
COPY shared/python/ja4_common /app/shared/python/ja4_common
|
||||||
RUN pip install /app/shared/python/ja4_common
|
RUN pip install /app/shared/python/ja4_common
|
||||||
|
|
||||||
# Copy service code
|
# Copier le code du service
|
||||||
COPY services/my-service /app/services/my-service
|
COPY services/mon-service /app/services/mon-service
|
||||||
```
|
```
|
||||||
|
|
||||||
### 3. Use in Code
|
### 3. Utiliser dans le code
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from ja4_common.clickhouse import get_client
|
from ja4_common.clickhouse import get_client
|
||||||
from ja4_common.settings import settings
|
from ja4_common.settings import settings
|
||||||
|
|
||||||
# Access settings
|
# Accéder à la configuration
|
||||||
print(f"Connecting to {settings.CLICKHOUSE_HOST}:{settings.CLICKHOUSE_PORT}")
|
print(f"Connexion à {settings.CLICKHOUSE_HOST}:{settings.CLICKHOUSE_PORT}")
|
||||||
|
print(f"Base logs : {settings.CLICKHOUSE_DB_LOGS}")
|
||||||
|
print(f"Base processing : {settings.CLICKHOUSE_DB_PROCESSING}")
|
||||||
|
|
||||||
# Use client
|
# Utiliser le client avec noms pleinement qualifiés
|
||||||
db = get_client()
|
db = get_client()
|
||||||
result = db.query("SELECT count() FROM ml_detected_anomalies")
|
result = db.query(
|
||||||
|
f"SELECT count() FROM {settings.CLICKHOUSE_DB_PROCESSING}.ml_detected_anomalies"
|
||||||
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
### 4. Environment Configuration
|
### 4. Configuration d'environnement
|
||||||
|
|
||||||
Create a `.env` file or set environment variables:
|
Créer un fichier `.env` ou définir les variables d'environnement :
|
||||||
```bash
|
```bash
|
||||||
CLICKHOUSE_HOST=clickhouse.example.com
|
CLICKHOUSE_HOST=clickhouse.example.com
|
||||||
CLICKHOUSE_PORT=8123
|
CLICKHOUSE_PORT=8123
|
||||||
CLICKHOUSE_DB=ja4_processing
|
CLICKHOUSE_DB=ja4_processing
|
||||||
|
CLICKHOUSE_DB_PROCESSING=ja4_processing
|
||||||
|
CLICKHOUSE_DB_LOGS=ja4_logs
|
||||||
CLICKHOUSE_USER=data_writer
|
CLICKHOUSE_USER=data_writer
|
||||||
CLICKHOUSE_PASSWORD=secret
|
CLICKHOUSE_PASSWORD=secret
|
||||||
```
|
```
|
||||||
|
|
||||||
## Testing: Mocking the Client
|
---
|
||||||
|
|
||||||
### Using unittest.mock
|
## Tests : simuler le client
|
||||||
|
|
||||||
|
### Avec unittest.mock
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from unittest.mock import MagicMock, patch
|
from unittest.mock import MagicMock, patch
|
||||||
from ja4_common.clickhouse import ClickHouseClient
|
from ja4_common.clickhouse import ClickHouseClient
|
||||||
|
|
||||||
def test_my_service():
|
def test_mon_service():
|
||||||
mock_client = MagicMock(spec=ClickHouseClient)
|
mock_client = MagicMock(spec=ClickHouseClient)
|
||||||
mock_client.query.return_value = MagicMock(result_rows=[(42,)])
|
mock_client.query.return_value = MagicMock(result_rows=[(42,)])
|
||||||
|
|
||||||
@ -192,25 +275,32 @@ def test_my_service():
|
|||||||
assert result.result_rows == [(42,)]
|
assert result.result_rows == [(42,)]
|
||||||
```
|
```
|
||||||
|
|
||||||
### Overriding Settings in Tests
|
### Surcharger les paramètres en test
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from ja4_common.settings import ClickHouseSettings
|
from ja4_common.settings import ClickHouseSettings
|
||||||
|
|
||||||
# Create custom settings for tests
|
# Créer des paramètres personnalisés pour les tests
|
||||||
test_settings = ClickHouseSettings(
|
test_settings = ClickHouseSettings(
|
||||||
CLICKHOUSE_HOST="localhost",
|
CLICKHOUSE_HOST="localhost",
|
||||||
CLICKHOUSE_PORT=8123,
|
CLICKHOUSE_PORT=8123,
|
||||||
CLICKHOUSE_DB="test_db",
|
CLICKHOUSE_DB="test_db",
|
||||||
|
CLICKHOUSE_DB_PROCESSING="test_processing",
|
||||||
|
CLICKHOUSE_DB_LOGS="test_logs",
|
||||||
CLICKHOUSE_USER="test_user",
|
CLICKHOUSE_USER="test_user",
|
||||||
CLICKHOUSE_PASSWORD="test_pass",
|
CLICKHOUSE_PASSWORD="test_pass",
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
## Source Files
|
---
|
||||||
|
|
||||||
| File | Description |
|
## Fichiers sources
|
||||||
|------|-------------|
|
|
||||||
| `ja4_common/settings.py` | `ClickHouseSettings` pydantic-settings model |
|
| Fichier | Description |
|
||||||
| `ja4_common/clickhouse.py` | `ClickHouseClient` class and `get_client()` singleton |
|
|---------|-------------|
|
||||||
| `pyproject.toml` | Package metadata and dependencies |
|
| `ja4_common/__init__.py` | Docstring du package et `__version__` |
|
||||||
|
| `ja4_common/settings.py` | Modèle pydantic-settings `ClickHouseSettings` et singleton `settings` |
|
||||||
|
| `ja4_common/clickhouse.py` | Classe `ClickHouseClient` et singleton `get_client()` |
|
||||||
|
| `pyproject.toml` | Métadonnées du package et dépendances |
|
||||||
|
| `tests/test_settings.py` | Tests unitaires pour `ClickHouseSettings` |
|
||||||
|
| `tests/test_clickhouse.py` | Tests unitaires pour `ClickHouseClient` |
|
||||||
|
|||||||
Reference in New Issue
Block a user