From 11b46b2eabbbc920e9b6ef9557fe65bd17527851 Mon Sep 17 00:00:00 2001 From: toto Date: Tue, 7 Apr 2026 23:55:28 +0200 Subject: [PATCH] docs: update copilot-instructions.md for v14 changes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Fix coverage gate: 60% → 80% for correlator - Document dual-model pattern (Complet/Applicatif) in bot-detector - Add SQL deployment paths: deploy_views.sql + service migrations - Add data retention TTL table with partition info - Fix integration test description (8 phases, --build-only flag) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/copilot-instructions.md | 30 ++++++++++++++++++++++++++---- 1 file changed, 26 insertions(+), 4 deletions(-) diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 3d3c6cf..37a2187 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -18,7 +18,7 @@ make rpm-all # build RPMs (sentinel, correlator, mod-reqin-log) for el8/ # Per-service tests make test-sentinel # Go tests (needs --cap-add=NET_RAW inside) -make test-correlator # Go tests with 60% coverage gate +make test-correlator # Go tests with 80% coverage gate make test-bot-detector # Python pytest make test-dashboard # Python pytest make test-ja4common-python # Python pytest (shared lib) @@ -40,9 +40,10 @@ cd services/sentinel && go vet ./... && gofmt -l . cd services/correlator && go vet ./... && gofmt -l . # Full-stack integration tests (Docker Compose, resets DB each run) -make test-integration # runs tests/integration/run-tests.sh → down -v + up + traffic + verify +make test-integration # 8 phases: build → start → schema → traffic → pipeline → dashboard → bot-detector → sentinel make test-integration-keep # same but leaves stack running after make test-integration-down # tear down integration stack +# run-tests.sh also accepts: --build-only (build images without running tests) ``` ## Architecture @@ -69,7 +70,12 @@ config/ → YAML config loader ### Python services -- `bot-detector` — scikit-learn IsolationForest + DBSCAN. Single monolithic module (`bot_detector.py`). Uses `os.getenv()` directly for config, NOT pydantic-settings. +- `bot-detector` — scikit-learn IsolationForest + DBSCAN + SHAP. Single monolithic module (`bot_detector.py`). Runs **two parallel IF models** per cycle: + - `Complet` (45 features L3→L7) on correlated traffic (`correlated=1`, TCP+TLS+HTTP) + - `Applicatif` (35 features L7-only) on uncorrelated traffic (`correlated=0`) + - Optional 24h variants (`Complet_24h`/`Applicatif_24h`) when `ENABLE_MULTIWINDOW=true` + + `model_name` is part of the ORDER BY key in both `ml_detected_anomalies` and `ml_all_scores`. Uses `os.getenv()` directly for config, NOT pydantic-settings. - `dashboard` — FastAPI + React SPA. 20 route modules in `backend/routes/`. Uses pydantic-settings (`backend/config.py`). - `shared/python/ja4_common` — `ClickHouseClient` singleton + `ClickHouseSettings` (pydantic-settings). Installed as a local package in each Python Dockerfile. @@ -100,7 +106,23 @@ Never hardcode database names in queries. **In Go (correlator)**, the database is part of the ClickHouse DSN (`clickhouse://user:pass@host:9000/ja4_logs`). The target table is configurable via YAML (`outputs.clickhouse.table`). -**SQL migrations** live in `shared/clickhouse/` (10 ordered files). Deploy with `shared/clickhouse/deploy_schema.sh` which substitutes DB names from env vars. +**SQL schema** has two deployment paths: +- **Base schema**: `shared/clickhouse/` (10 ordered files). Deploy with `shared/clickhouse/deploy_schema.sh` which substitutes DB names from env vars. +- **Bot-detector views**: `services/bot-detector/deploy_views.sql` — aggregation tables, MVs, ML result tables, dashboard views. Version-controlled separately (currently v14). +- **Post-deploy migrations**: `services/correlator/sql/migrations/` — ALTER TABLE statements for existing deployments (TTL changes, ORDER BY fixes). Run manually: `clickhouse-client --multiquery < file.sql`. + +### Data retention (TTL) + +| Table | TTL | Partition | +|-------|-----|-----------| +| `http_logs_raw` | 2 hours | `toStartOfHour(ingest_time)` | +| `http_logs` | 30 days | `toDate(log_date)` | +| `agg_host_ip_ja4_1h` | 7 days | `toDate(window_start)` | +| `agg_header_fingerprint_1h` | 7 days | `toDate(window_start)` | +| `ml_detected_anomalies` | 7 days | `toDate(detected_at)` | +| `ml_all_scores` | 7 days | `toDate(window_start)` | + +All aggregation/ML tables use `ttl_only_drop_parts=1` for efficient partition-level expiry. ## Key conventions