From 77c0450a228be165827a03aefa5f4e8a48e715be Mon Sep 17 00:00:00 2001
From: toto <antitbone@jefaismonrhum.sdv.fr>
Date: Wed, 8 Apr 2026 07:31:10 +0200
Subject: [PATCH] docs: update copilot-instructions.md for dashboard rewrite
 and ML upgrades
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Dashboard: FastAPI+React → FastAPI+Jinja2+htmx+Chart.js (2 route modules)
- Bot-detector: IsolationForest → triple-voice EIF+Autoencoder+XGBoost ensemble
- SQL schema: 10 → 13 files (added thesis features, perf indexes, views)
- Added ClickHouse 24.8 gotchas (projections, nested aggregates, let bindings)
- Added IPv4/IPv6 duality pattern, bot-detector test patterns
- Updated data retention table with 4 new thesis aggregation tables
- Fixed single-test commands to reference existing files

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 .github/copilot-instructions.md | 84 ++++++++++++++++++++++-----------
 1 file changed, 56 insertions(+), 28 deletions(-)

diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
index 37a2187..0f8bbdd 100644
--- a/.github/copilot-instructions.md
+++ b/.github/copilot-instructions.md
@@ -28,9 +28,13 @@ make test-mod-reqin-log     # C cmocka tests
 docker run --rm -v $(pwd):/build -w /build/services/correlator golang:1.24 \
   go test -v -run TestConfigLoad ./internal/config/
 
-# Single Python test (from repo root):
-docker build -f services/dashboard/Dockerfile.tests -t dash-tests .
-docker run --rm dash-tests pytest backend/tests/test_metrics.py -v -k test_health
+# Single Python test (bot-detector, from repo root):
+docker build -f services/bot-detector/bot_detector/Dockerfile.tests -t bd-tests .
+docker run --rm bd-tests pytest bot_detector/tests/test_detector.py -v -k test_ae_class_train_and_score
+
+# Local Python tests (needs deps installed):
+cd services/bot-detector && pip install -r bot_detector/requirements.txt pytest pytest-mock
+pytest bot_detector/tests/test_detector.py -v -k "test_benford"
 
 # Faster correlator build (skip tests):
 docker build --target builder --build-arg SKIP_TESTS=true -f services/correlator/Dockerfile .
@@ -70,14 +74,25 @@ config/                → YAML config loader
 
 ### Python services
 
-- `bot-detector` — scikit-learn IsolationForest + DBSCAN + SHAP. Single monolithic module (`bot_detector.py`). Runs **two parallel IF models** per cycle:
-  - `Complet` (45 features L3→L7) on correlated traffic (`correlated=1`, TCP+TLS+HTTP)
-  - `Applicatif` (35 features L7-only) on uncorrelated traffic (`correlated=0`)
-  - Optional 24h variants (`Complet_24h`/`Applicatif_24h`) when `ENABLE_MULTIWINDOW=true`
+- `bot-detector` — **Triple-voice ensemble** ML pipeline. Single monolithic module (`bot_detector.py`, ~1550 lines). Runs two parallel models per cycle:
+  - `Complet` (~45 features L3→L7) on correlated traffic (`correlated=1`, TCP+TLS+HTTP)
+  - `Applicatif` (~35 features L7-only) on uncorrelated traffic (`correlated=0`)
+  - Optional 24h variants when `ENABLE_MULTIWINDOW=true`
+  
+  **ML ensemble:**
+  - **Extended Isolation Forest** (isotree, primary unsupervised scorer, falls back to sklearn)
+  - **Autoencoder** (PyTorch, architecture n→64→32→16→32→64→n, reconstruction error scorer)
+  - **XGBoost** (supervised, trained on SOC analyst labels from `soc_feedback`)
+  - Score: `final = (1-XGB_WEIGHT) * ((1-AE_WEIGHT)*eif_norm + AE_WEIGHT*ae_norm) + XGB_WEIGHT*xgb_prob`
+  - Default weights: `AE_WEIGHT=0.30`, `XGB_WEIGHT=0.20`
+  
+  Also: adaptive thresholds, SHAP explainability, HDBSCAN campaign clustering, Benford deviation, lag-1 autocorrelation, quantile drift detection, feature pruning, validation gate.
   
   `model_name` is part of the ORDER BY key in both `ml_detected_anomalies` and `ml_all_scores`. Uses `os.getenv()` directly for config, NOT pydantic-settings.
-- `dashboard` — FastAPI + React SPA. 20 route modules in `backend/routes/`. Uses pydantic-settings (`backend/config.py`).
-- `shared/python/ja4_common` — `ClickHouseClient` singleton + `ClickHouseSettings` (pydantic-settings). Installed as a local package in each Python Dockerfile.
+
+- `dashboard` — FastAPI + Jinja2 server-side templates + htmx + Chart.js + Tailwind CSS (CDN). Two route modules: `backend/routes/api.py` (9 JSON endpoints) and `backend/routes/pages.py` (8 HTML pages). Uses `os.getenv()` with `safe_identifier()` for DB name validation.
+
+- `shared/python/ja4_common` — `ClickHouseClient` singleton + `ClickHouseSettings` (pydantic-settings). Installed as a local package in each Python Dockerfile. **Note:** dashboard does NOT use ja4_common — it has its own lightweight `clickhouse-connect` client in `backend/database.py`.
 
 ### C module
 
@@ -107,9 +122,8 @@ Never hardcode database names in queries.
 **In Go (correlator)**, the database is part of the ClickHouse DSN (`clickhouse://user:pass@host:9000/ja4_logs`). The target table is configurable via YAML (`outputs.clickhouse.table`).
 
 **SQL schema** has two deployment paths:
-- **Base schema**: `shared/clickhouse/` (10 ordered files). Deploy with `shared/clickhouse/deploy_schema.sh` which substitutes DB names from env vars.
-- **Bot-detector views**: `services/bot-detector/deploy_views.sql` — aggregation tables, MVs, ML result tables, dashboard views. Version-controlled separately (currently v14).
-- **Post-deploy migrations**: `services/correlator/sql/migrations/` — ALTER TABLE statements for existing deployments (TTL changes, ORDER BY fixes). Run manually: `clickhouse-client --multiquery < file.sql`.
+- **Base schema**: `shared/clickhouse/` (13 ordered files, `00_database.sql` through `12_thesis_features.sql`). Deploy with `shared/clickhouse/deploy_schema.sh` which substitutes DB names from env vars.
+- **Post-deploy migrations**: `services/correlator/sql/migrations/` — ALTER TABLE statements for existing deployments. Run manually: `clickhouse-client --multiquery < file.sql`.
 
 ### Data retention (TTL)
 
@@ -119,6 +133,10 @@ Never hardcode database names in queries.
 | `http_logs` | 30 days | `toDate(log_date)` |
 | `agg_host_ip_ja4_1h` | 7 days | `toDate(window_start)` |
 | `agg_header_fingerprint_1h` | 7 days | `toDate(window_start)` |
+| `agg_ip_behavior_1h` | 7 days | `toDate(window_start)` |
+| `agg_request_timing_1h` | 7 days | `toDate(window_start)` |
+| `agg_path_sequences_1h` | 7 days | `toDate(window_start)` |
+| `agg_resource_cascade_1h` | 7 days | `toDate(window_start)` |
 | `ml_detected_anomalies` | 7 days | `toDate(detected_at)` |
 | `ml_all_scores` | 7 days | `toDate(window_start)` |
 
@@ -134,26 +152,26 @@ Every service has `Dockerfile` (prod), `Dockerfile.dev` or `Dockerfile.tests` (t
 - Correlator: `config.yml`, env prefix `LOGCORRELATOR_`
 - Both support `SIGHUP` for log rotation
 
-### Python config: pydantic-settings
-- Dashboard: `backend/config.py` → `Settings(BaseSettings)` with `.env` file
-- ja4_common: `ClickHouseSettings(BaseSettings)` — singleton at `settings`
-- bot-detector: exception — uses raw `os.getenv()`, not pydantic-settings
+### Python config
+- ja4_common: `ClickHouseSettings(BaseSettings)` — pydantic-settings singleton
+- bot-detector: raw `os.getenv()` with defaults, NOT pydantic-settings
+- dashboard: `backend/config.py` reads `CLICKHOUSE_DB_PROCESSING`/`CLICKHOUSE_DB_LOGS` from env, validates identifiers with `safe_identifier()` regex
 
 ### Dashboard route structure
-Every route file follows this pattern:
+Two route modules: `backend/routes/api.py` (JSON API) and `backend/routes/pages.py` (HTML).
+API endpoints use parameterized ClickHouse queries with sort/order whitelists:
 ```python
-from fastapi import APIRouter, HTTPException, Query
-from ..config import settings
-from ..database import db
+from backend.config import DB_PROCESSING, DB_LOGS, safe_identifier
+from backend.database import query, query_scalar, execute
 
-router = APIRouter()
-
-@router.get("/api/something")
-async def get_something():
-    query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_PROCESSING}.table_name ..."
-    result = db.query(query)
-    ...
+_DB = safe_identifier(DB_PROCESSING)
+rows = query(
+    f"SELECT ... FROM {_DB}.ml_detected_anomalies "
+    "WHERE src_ip = toIPv6({ip:String})",
+    {"ip": ip_value},
+)
 ```
+**IPv4/IPv6 duality**: `http_logs.src_ip` is `IPv4`, but ML tables (`ml_all_scores`, `ml_detected_anomalies`) store `IPv6` (mapped as `::ffff:x.x.x.x`). Use `toIPv6()` for ML table queries, `toIPv4OrZero()` for http_logs queries. Strip `::ffff:` prefix when displaying IPs.
 
 ### RPM spec files
 Located at `services/<name>/packaging/rpm/<name>.spec`. Version injected via `--define "build_version X.Y.Z"` at build time.
@@ -191,7 +209,17 @@ Go's YAML parser reads `${VAR:-default}` as a **literal string**. Use hardcoded
 `sentinel` listens on a network interface (e.g., `eth0`), not loopback. Traffic sent to `localhost` or `127.0.0.1` from the same container is invisible to pcap. In integration tests, traffic must come from a separate container crossing the Docker bridge network.
 
 ### ClickHouse initialization timing
-ClickHouse takes ~15-20s to initialize all 10 SQL files. Integration health checks use a 120s timeout (not the default 60s).
+ClickHouse takes ~15-20s to initialize all 13 SQL files. Integration health checks use a 120s timeout (not the default 60s).
+
+### ClickHouse 24.8 SQL gotchas
+- **Projections in AggregatingMergeTree** require `SETTINGS deduplicate_merge_projection_mode = 'drop'`
+- **No `let` bindings** in SQL expressions — inline everything or use CTEs
+- **Nested aggregates are illegal**: `argMax(any(col), sum(hits))` fails. Use a subquery to pre-aggregate, then apply the outer aggregate
+- **`groupArrayIf` column type** must match the State/Merge functions: if the MV uses `groupArrayIfState()`, the column must be `AggregateFunction(groupArrayIf(...), Type, UInt8)`, not plain `groupArrayState`
+- **`SimpleAggregateFunction` columns** are auto-wrapped in their aggregate function during queries on AggregatingMergeTree. So `argMax(ja4, hits)` where `hits` is `SimpleAggregateFunction(sum, UInt64)` becomes `argMax(ja4, sum(hits))` — a nested aggregate error
+
+### Bot-detector test patterns
+Tests in `bot_detector/tests/test_detector.py` are **self-contained** — they re-implement key logic rather than importing from the main module (avoids heavy `joblib`/`sklearn` import chains). Autoencoder tests use a local `_make_ae()` helper. XGBoost tests create in-memory models. Tests skip gracefully if `torch` or `xgboost` are not installed (`pytest.skip`).
 
 ### RPM builds must use Rocky Linux
 All `Dockerfile.package` files use `rockylinux:9` (or `rockylinux:8`/`almalinux:10`) as the build base — never Debian-based images. Reason: Rocky provides `libpcap.so.1`; Debian provides `libpcap.so.0.8`. Building sentinel on Debian and running on Rocky produces a missing library error at runtime.