docs: update copilot-instructions.md for v14 changes

- Fix coverage gate: 60% → 80% for correlator
- Document dual-model pattern (Complet/Applicatif) in bot-detector
- Add SQL deployment paths: deploy_views.sql + service migrations
- Add data retention TTL table with partition info
- Fix integration test description (8 phases, --build-only flag)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
toto
2026-04-07 23:55:28 +02:00
parent 51b8eb57a8
commit 11b46b2eab

View File

@ -18,7 +18,7 @@ make rpm-all # build RPMs (sentinel, correlator, mod-reqin-log) for el8/
# Per-service tests # Per-service tests
make test-sentinel # Go tests (needs --cap-add=NET_RAW inside) make test-sentinel # Go tests (needs --cap-add=NET_RAW inside)
make test-correlator # Go tests with 60% coverage gate make test-correlator # Go tests with 80% coverage gate
make test-bot-detector # Python pytest make test-bot-detector # Python pytest
make test-dashboard # Python pytest make test-dashboard # Python pytest
make test-ja4common-python # Python pytest (shared lib) make test-ja4common-python # Python pytest (shared lib)
@ -40,9 +40,10 @@ cd services/sentinel && go vet ./... && gofmt -l .
cd services/correlator && go vet ./... && gofmt -l . cd services/correlator && go vet ./... && gofmt -l .
# Full-stack integration tests (Docker Compose, resets DB each run) # Full-stack integration tests (Docker Compose, resets DB each run)
make test-integration # runs tests/integration/run-tests.sh → down -v + up + traffic + verify make test-integration # 8 phases: build → start → schema → traffic → pipeline → dashboard → bot-detector → sentinel
make test-integration-keep # same but leaves stack running after make test-integration-keep # same but leaves stack running after
make test-integration-down # tear down integration stack make test-integration-down # tear down integration stack
# run-tests.sh also accepts: --build-only (build images without running tests)
``` ```
## Architecture ## Architecture
@ -69,7 +70,12 @@ config/ → YAML config loader
### Python services ### Python services
- `bot-detector` — scikit-learn IsolationForest + DBSCAN. Single monolithic module (`bot_detector.py`). Uses `os.getenv()` directly for config, NOT pydantic-settings. - `bot-detector` — scikit-learn IsolationForest + DBSCAN + SHAP. Single monolithic module (`bot_detector.py`). Runs **two parallel IF models** per cycle:
- `Complet` (45 features L3→L7) on correlated traffic (`correlated=1`, TCP+TLS+HTTP)
- `Applicatif` (35 features L7-only) on uncorrelated traffic (`correlated=0`)
- Optional 24h variants (`Complet_24h`/`Applicatif_24h`) when `ENABLE_MULTIWINDOW=true`
`model_name` is part of the ORDER BY key in both `ml_detected_anomalies` and `ml_all_scores`. Uses `os.getenv()` directly for config, NOT pydantic-settings.
- `dashboard` — FastAPI + React SPA. 20 route modules in `backend/routes/`. Uses pydantic-settings (`backend/config.py`). - `dashboard` — FastAPI + React SPA. 20 route modules in `backend/routes/`. Uses pydantic-settings (`backend/config.py`).
- `shared/python/ja4_common``ClickHouseClient` singleton + `ClickHouseSettings` (pydantic-settings). Installed as a local package in each Python Dockerfile. - `shared/python/ja4_common``ClickHouseClient` singleton + `ClickHouseSettings` (pydantic-settings). Installed as a local package in each Python Dockerfile.
@ -100,7 +106,23 @@ Never hardcode database names in queries.
**In Go (correlator)**, the database is part of the ClickHouse DSN (`clickhouse://user:pass@host:9000/ja4_logs`). The target table is configurable via YAML (`outputs.clickhouse.table`). **In Go (correlator)**, the database is part of the ClickHouse DSN (`clickhouse://user:pass@host:9000/ja4_logs`). The target table is configurable via YAML (`outputs.clickhouse.table`).
**SQL migrations** live in `shared/clickhouse/` (10 ordered files). Deploy with `shared/clickhouse/deploy_schema.sh` which substitutes DB names from env vars. **SQL schema** has two deployment paths:
- **Base schema**: `shared/clickhouse/` (10 ordered files). Deploy with `shared/clickhouse/deploy_schema.sh` which substitutes DB names from env vars.
- **Bot-detector views**: `services/bot-detector/deploy_views.sql` — aggregation tables, MVs, ML result tables, dashboard views. Version-controlled separately (currently v14).
- **Post-deploy migrations**: `services/correlator/sql/migrations/` — ALTER TABLE statements for existing deployments (TTL changes, ORDER BY fixes). Run manually: `clickhouse-client --multiquery < file.sql`.
### Data retention (TTL)
| Table | TTL | Partition |
|-------|-----|-----------|
| `http_logs_raw` | 2 hours | `toStartOfHour(ingest_time)` |
| `http_logs` | 30 days | `toDate(log_date)` |
| `agg_host_ip_ja4_1h` | 7 days | `toDate(window_start)` |
| `agg_header_fingerprint_1h` | 7 days | `toDate(window_start)` |
| `ml_detected_anomalies` | 7 days | `toDate(detected_at)` |
| `ml_all_scores` | 7 days | `toDate(window_start)` |
All aggregation/ML tables use `ttl_only_drop_parts=1` for efficient partition-level expiry.
## Key conventions ## Key conventions