docs: mise à jour copilot-instructions.md

- bot-detector : monolithe → 10 modules
- Ajout convention browser detection sans UA (5 axes, Client Hints)
- Ajout targets Makefile : init-stack, import-prod-data, purge-db, help
- Anubis : simplifié IP/CIDR + ASN (suppression dict_anubis_ua / REGEXP_TREE)
- Tests bot-detector : clarification imports lourds

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
toto
2026-04-09 23:11:24 +02:00
parent 14db3d9040
commit 895d7894a9

View File

@ -48,6 +48,12 @@ make test-integration # 8 phases: build → start → schema → traffi
make test-integration-keep # same but leaves stack running after
make test-integration-down # tear down integration stack
# run-tests.sh also accepts: --build-only (build images without running tests)
# Database management
make init-stack # initialize ClickHouse schema from scratch
make import-prod-data # import prod data with date shifting
make purge-db # DROP and recreate ja4_logs + ja4_processing
make help # list all 24+ targets with descriptions
```
## Architecture
@ -74,7 +80,9 @@ config/ → YAML config loader
### Python services
- `bot-detector`**Triple-voice ensemble** ML pipeline. Single monolithic module (`bot_detector.py`, ~1550 lines). Runs two parallel models per cycle:
- `bot-detector`**Triple-voice ensemble** ML pipeline. Modular package (10 modules):
- `config.py``log.py``infra.py``browser.py``scoring.py``models.py``preprocessing.py``pipeline.py``cycle.py``__main__.py`
- Runs two parallel models per cycle:
- `Complet` (~45 features L3→L7) on correlated traffic (`correlated=1`, TCP+TLS+HTTP)
- `Applicatif` (~35 features L7-only) on uncorrelated traffic (`correlated=0`)
- Optional 24h variants when `ENABLE_MULTIWINDOW=true`
@ -173,6 +181,17 @@ rows = query(
```
**IPv4/IPv6 duality**: `http_logs.src_ip` is `IPv4`, but ML tables (`ml_all_scores`, `ml_detected_anomalies`) store `IPv6` (mapped as `::ffff:x.x.x.x`). Use `toIPv6()` for ML table queries, `toIPv4OrZero()` for http_logs queries. Strip `::ffff:` prefix when displaying IPs.
### Browser detection: NO User-Agent
Browser identification uses 5 axes in `browser.py`**none rely on User-Agent** (trivially spoofable):
- **Axe 1 (0.30)**: JA4 fingerprint match in `dict_browser_ja4` (TLS-level, hardest to fake)
- **Axe 2 (0.15)**: JA4 structural analysis (TLS1.3, h2/h3, cipher/extension counts)
- **Axe 3 (0.20)**: Client Hints (`has_sec_ch_ua`) + Sec-Fetch-* consistency + `is_fake_navigation`
- **Axe 4 (0.15)**: Navigation behavior (cookies, referers, asset ratio)
- **Axe 5 (0.20)**: TLS/TCP coherence (ALPN, window scale, TLS version)
`modern_browser_score` in SQL: sec-ch-ua→100, Sec-Fetch→70, else→0. Never uses UA presence.
`ua_ch_mismatch` and `is_ua_rotating` remain as ML anomaly features (detect bots), but are NOT used for browser identification.
### RPM spec files
Located at `services/<name>/packaging/rpm/<name>.spec`. Version injected via `--define "build_version X.Y.Z"` at build time.
@ -202,8 +221,8 @@ When building either `sentinel` or `correlator` in Docker, the build context mus
### Correlator YAML does not expand env vars
Go's YAML parser reads `${VAR:-default}` as a **literal string**. Use hardcoded values or pass values directly in the YAML file. This is why `tests/integration/platform/correlator.yml` has a hardcoded DSN.
### REGEXP_TREE dictionary requires ≥1 rule
`dict_anubis_ua` uses `LAYOUT(REGEXP_TREE)`. If `anubis_ua_rules` is empty, **every INSERT into `http_logs_raw` fails** because the materialized view `mv_http_logs` calls `dictGet()` on it. The integration test init script seeds a catch-all rule.
### Anubis reference lists: IP/CIDR + ASN only
The Anubis pipeline was simplified — only `dict_anubis_ip` (IP/CIDR) and `dict_anubis_asn` exist. UA and Country dictionaries were removed. The `fetch_rules.py` script only imports IP ranges and ASN numbers from Anubis lists.
### TLS/pcap capture needs non-loopback traffic
`sentinel` listens on a network interface (e.g., `eth0`), not loopback. Traffic sent to `localhost` or `127.0.0.1` from the same container is invisible to pcap. In integration tests, traffic must come from a separate container crossing the Docker bridge network.
@ -219,7 +238,7 @@ ClickHouse takes ~15-20s to initialize all 13 SQL files. Integration health chec
- **`SimpleAggregateFunction` columns** are auto-wrapped in their aggregate function during queries on AggregatingMergeTree. So `argMax(ja4, hits)` where `hits` is `SimpleAggregateFunction(sum, UInt64)` becomes `argMax(ja4, sum(hits))` — a nested aggregate error
### Bot-detector test patterns
Tests in `bot_detector/tests/test_detector.py` are **self-contained** — they re-implement key logic rather than importing from the main module (avoids heavy `joblib`/`sklearn` import chains). Autoencoder tests use a local `_make_ae()` helper. XGBoost tests create in-memory models. Tests skip gracefully if `torch` or `xgboost` are not installed (`pytest.skip`).
Tests in `bot_detector/tests/test_detector.py` are **self-contained** — they re-implement key logic rather than importing from the main modules (avoids heavy `torch`/`sklearn`/`isotree` import chains). Autoencoder tests use a local `_make_ae()` helper. XGBoost tests create in-memory models. Tests skip gracefully if `torch` or `xgboost` are not installed (`pytest.skip`).
### RPM builds must use Rocky Linux
All `Dockerfile.package` files use `rockylinux:9` (or `rockylinux:8`/`almalinux:10`) as the build base — never Debian-based images. Reason: Rocky provides `libpcap.so.1`; Debian provides `libpcap.so.0.8`. Building sentinel on Debian and running on Rocky produces a missing library error at runtime.