docs: mise à jour copilot-instructions.md
- bot-detector : monolithe → 10 modules - Ajout convention browser detection sans UA (5 axes, Client Hints) - Ajout targets Makefile : init-stack, import-prod-data, purge-db, help - Anubis : simplifié IP/CIDR + ASN (suppression dict_anubis_ua / REGEXP_TREE) - Tests bot-detector : clarification imports lourds Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
27
.github/copilot-instructions.md
vendored
27
.github/copilot-instructions.md
vendored
@ -48,6 +48,12 @@ make test-integration # 8 phases: build → start → schema → traffi
|
||||
make test-integration-keep # same but leaves stack running after
|
||||
make test-integration-down # tear down integration stack
|
||||
# run-tests.sh also accepts: --build-only (build images without running tests)
|
||||
|
||||
# Database management
|
||||
make init-stack # initialize ClickHouse schema from scratch
|
||||
make import-prod-data # import prod data with date shifting
|
||||
make purge-db # DROP and recreate ja4_logs + ja4_processing
|
||||
make help # list all 24+ targets with descriptions
|
||||
```
|
||||
|
||||
## Architecture
|
||||
@ -74,7 +80,9 @@ config/ → YAML config loader
|
||||
|
||||
### Python services
|
||||
|
||||
- `bot-detector` — **Triple-voice ensemble** ML pipeline. Single monolithic module (`bot_detector.py`, ~1550 lines). Runs two parallel models per cycle:
|
||||
- `bot-detector` — **Triple-voice ensemble** ML pipeline. Modular package (10 modules):
|
||||
- `config.py` → `log.py` → `infra.py` → `browser.py` → `scoring.py` → `models.py` → `preprocessing.py` → `pipeline.py` → `cycle.py` → `__main__.py`
|
||||
- Runs two parallel models per cycle:
|
||||
- `Complet` (~45 features L3→L7) on correlated traffic (`correlated=1`, TCP+TLS+HTTP)
|
||||
- `Applicatif` (~35 features L7-only) on uncorrelated traffic (`correlated=0`)
|
||||
- Optional 24h variants when `ENABLE_MULTIWINDOW=true`
|
||||
@ -173,6 +181,17 @@ rows = query(
|
||||
```
|
||||
**IPv4/IPv6 duality**: `http_logs.src_ip` is `IPv4`, but ML tables (`ml_all_scores`, `ml_detected_anomalies`) store `IPv6` (mapped as `::ffff:x.x.x.x`). Use `toIPv6()` for ML table queries, `toIPv4OrZero()` for http_logs queries. Strip `::ffff:` prefix when displaying IPs.
|
||||
|
||||
### Browser detection: NO User-Agent
|
||||
Browser identification uses 5 axes in `browser.py` — **none rely on User-Agent** (trivially spoofable):
|
||||
- **Axe 1 (0.30)**: JA4 fingerprint match in `dict_browser_ja4` (TLS-level, hardest to fake)
|
||||
- **Axe 2 (0.15)**: JA4 structural analysis (TLS1.3, h2/h3, cipher/extension counts)
|
||||
- **Axe 3 (0.20)**: Client Hints (`has_sec_ch_ua`) + Sec-Fetch-* consistency + `is_fake_navigation`
|
||||
- **Axe 4 (0.15)**: Navigation behavior (cookies, referers, asset ratio)
|
||||
- **Axe 5 (0.20)**: TLS/TCP coherence (ALPN, window scale, TLS version)
|
||||
|
||||
`modern_browser_score` in SQL: sec-ch-ua→100, Sec-Fetch→70, else→0. Never uses UA presence.
|
||||
`ua_ch_mismatch` and `is_ua_rotating` remain as ML anomaly features (detect bots), but are NOT used for browser identification.
|
||||
|
||||
### RPM spec files
|
||||
Located at `services/<name>/packaging/rpm/<name>.spec`. Version injected via `--define "build_version X.Y.Z"` at build time.
|
||||
|
||||
@ -202,8 +221,8 @@ When building either `sentinel` or `correlator` in Docker, the build context mus
|
||||
### Correlator YAML does not expand env vars
|
||||
Go's YAML parser reads `${VAR:-default}` as a **literal string**. Use hardcoded values or pass values directly in the YAML file. This is why `tests/integration/platform/correlator.yml` has a hardcoded DSN.
|
||||
|
||||
### REGEXP_TREE dictionary requires ≥1 rule
|
||||
`dict_anubis_ua` uses `LAYOUT(REGEXP_TREE)`. If `anubis_ua_rules` is empty, **every INSERT into `http_logs_raw` fails** because the materialized view `mv_http_logs` calls `dictGet()` on it. The integration test init script seeds a catch-all rule.
|
||||
### Anubis reference lists: IP/CIDR + ASN only
|
||||
The Anubis pipeline was simplified — only `dict_anubis_ip` (IP/CIDR) and `dict_anubis_asn` exist. UA and Country dictionaries were removed. The `fetch_rules.py` script only imports IP ranges and ASN numbers from Anubis lists.
|
||||
|
||||
### TLS/pcap capture needs non-loopback traffic
|
||||
`sentinel` listens on a network interface (e.g., `eth0`), not loopback. Traffic sent to `localhost` or `127.0.0.1` from the same container is invisible to pcap. In integration tests, traffic must come from a separate container crossing the Docker bridge network.
|
||||
@ -219,7 +238,7 @@ ClickHouse takes ~15-20s to initialize all 13 SQL files. Integration health chec
|
||||
- **`SimpleAggregateFunction` columns** are auto-wrapped in their aggregate function during queries on AggregatingMergeTree. So `argMax(ja4, hits)` where `hits` is `SimpleAggregateFunction(sum, UInt64)` becomes `argMax(ja4, sum(hits))` — a nested aggregate error
|
||||
|
||||
### Bot-detector test patterns
|
||||
Tests in `bot_detector/tests/test_detector.py` are **self-contained** — they re-implement key logic rather than importing from the main module (avoids heavy `joblib`/`sklearn` import chains). Autoencoder tests use a local `_make_ae()` helper. XGBoost tests create in-memory models. Tests skip gracefully if `torch` or `xgboost` are not installed (`pytest.skip`).
|
||||
Tests in `bot_detector/tests/test_detector.py` are **self-contained** — they re-implement key logic rather than importing from the main modules (avoids heavy `torch`/`sklearn`/`isotree` import chains). Autoencoder tests use a local `_make_ae()` helper. XGBoost tests create in-memory models. Tests skip gracefully if `torch` or `xgboost` are not installed (`pytest.skip`).
|
||||
|
||||
### RPM builds must use Rocky Linux
|
||||
All `Dockerfile.package` files use `rockylinux:9` (or `rockylinux:8`/`almalinux:10`) as the build base — never Debian-based images. Reason: Rocky provides `libpcap.so.1`; Debian provides `libpcap.so.0.8`. Building sentinel on Debian and running on Rocky produces a missing library error at runtime.
|
||||
|
||||
Reference in New Issue
Block a user