From 895d7894a9682be1600e9b1aceb3c587e530b456 Mon Sep 17 00:00:00 2001 From: toto Date: Thu, 9 Apr 2026 23:11:24 +0200 Subject: [PATCH] =?UTF-8?q?docs:=20mise=20=C3=A0=20jour=20copilot-instruct?= =?UTF-8?q?ions.md?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - bot-detector : monolithe → 10 modules - Ajout convention browser detection sans UA (5 axes, Client Hints) - Ajout targets Makefile : init-stack, import-prod-data, purge-db, help - Anubis : simplifié IP/CIDR + ASN (suppression dict_anubis_ua / REGEXP_TREE) - Tests bot-detector : clarification imports lourds Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/copilot-instructions.md | 27 +++++++++++++++++++++++---- 1 file changed, 23 insertions(+), 4 deletions(-) diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 0f8bbdd..a9af8a0 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -48,6 +48,12 @@ make test-integration # 8 phases: build → start → schema → traffi make test-integration-keep # same but leaves stack running after make test-integration-down # tear down integration stack # run-tests.sh also accepts: --build-only (build images without running tests) + +# Database management +make init-stack # initialize ClickHouse schema from scratch +make import-prod-data # import prod data with date shifting +make purge-db # DROP and recreate ja4_logs + ja4_processing +make help # list all 24+ targets with descriptions ``` ## Architecture @@ -74,7 +80,9 @@ config/ → YAML config loader ### Python services -- `bot-detector` — **Triple-voice ensemble** ML pipeline. Single monolithic module (`bot_detector.py`, ~1550 lines). Runs two parallel models per cycle: +- `bot-detector` — **Triple-voice ensemble** ML pipeline. Modular package (10 modules): + - `config.py` → `log.py` → `infra.py` → `browser.py` → `scoring.py` → `models.py` → `preprocessing.py` → `pipeline.py` → `cycle.py` → `__main__.py` + - Runs two parallel models per cycle: - `Complet` (~45 features L3→L7) on correlated traffic (`correlated=1`, TCP+TLS+HTTP) - `Applicatif` (~35 features L7-only) on uncorrelated traffic (`correlated=0`) - Optional 24h variants when `ENABLE_MULTIWINDOW=true` @@ -173,6 +181,17 @@ rows = query( ``` **IPv4/IPv6 duality**: `http_logs.src_ip` is `IPv4`, but ML tables (`ml_all_scores`, `ml_detected_anomalies`) store `IPv6` (mapped as `::ffff:x.x.x.x`). Use `toIPv6()` for ML table queries, `toIPv4OrZero()` for http_logs queries. Strip `::ffff:` prefix when displaying IPs. +### Browser detection: NO User-Agent +Browser identification uses 5 axes in `browser.py` — **none rely on User-Agent** (trivially spoofable): +- **Axe 1 (0.30)**: JA4 fingerprint match in `dict_browser_ja4` (TLS-level, hardest to fake) +- **Axe 2 (0.15)**: JA4 structural analysis (TLS1.3, h2/h3, cipher/extension counts) +- **Axe 3 (0.20)**: Client Hints (`has_sec_ch_ua`) + Sec-Fetch-* consistency + `is_fake_navigation` +- **Axe 4 (0.15)**: Navigation behavior (cookies, referers, asset ratio) +- **Axe 5 (0.20)**: TLS/TCP coherence (ALPN, window scale, TLS version) + +`modern_browser_score` in SQL: sec-ch-ua→100, Sec-Fetch→70, else→0. Never uses UA presence. +`ua_ch_mismatch` and `is_ua_rotating` remain as ML anomaly features (detect bots), but are NOT used for browser identification. + ### RPM spec files Located at `services//packaging/rpm/.spec`. Version injected via `--define "build_version X.Y.Z"` at build time. @@ -202,8 +221,8 @@ When building either `sentinel` or `correlator` in Docker, the build context mus ### Correlator YAML does not expand env vars Go's YAML parser reads `${VAR:-default}` as a **literal string**. Use hardcoded values or pass values directly in the YAML file. This is why `tests/integration/platform/correlator.yml` has a hardcoded DSN. -### REGEXP_TREE dictionary requires ≥1 rule -`dict_anubis_ua` uses `LAYOUT(REGEXP_TREE)`. If `anubis_ua_rules` is empty, **every INSERT into `http_logs_raw` fails** because the materialized view `mv_http_logs` calls `dictGet()` on it. The integration test init script seeds a catch-all rule. +### Anubis reference lists: IP/CIDR + ASN only +The Anubis pipeline was simplified — only `dict_anubis_ip` (IP/CIDR) and `dict_anubis_asn` exist. UA and Country dictionaries were removed. The `fetch_rules.py` script only imports IP ranges and ASN numbers from Anubis lists. ### TLS/pcap capture needs non-loopback traffic `sentinel` listens on a network interface (e.g., `eth0`), not loopback. Traffic sent to `localhost` or `127.0.0.1` from the same container is invisible to pcap. In integration tests, traffic must come from a separate container crossing the Docker bridge network. @@ -219,7 +238,7 @@ ClickHouse takes ~15-20s to initialize all 13 SQL files. Integration health chec - **`SimpleAggregateFunction` columns** are auto-wrapped in their aggregate function during queries on AggregatingMergeTree. So `argMax(ja4, hits)` where `hits` is `SimpleAggregateFunction(sum, UInt64)` becomes `argMax(ja4, sum(hits))` — a nested aggregate error ### Bot-detector test patterns -Tests in `bot_detector/tests/test_detector.py` are **self-contained** — they re-implement key logic rather than importing from the main module (avoids heavy `joblib`/`sklearn` import chains). Autoencoder tests use a local `_make_ae()` helper. XGBoost tests create in-memory models. Tests skip gracefully if `torch` or `xgboost` are not installed (`pytest.skip`). +Tests in `bot_detector/tests/test_detector.py` are **self-contained** — they re-implement key logic rather than importing from the main modules (avoids heavy `torch`/`sklearn`/`isotree` import chains). Autoencoder tests use a local `_make_ae()` helper. XGBoost tests create in-memory models. Tests skip gracefully if `torch` or `xgboost` are not installed (`pytest.skip`). ### RPM builds must use Rocky Linux All `Dockerfile.package` files use `rockylinux:9` (or `rockylinux:8`/`almalinux:10`) as the build base — never Debian-based images. Reason: Rocky provides `libpcap.so.1`; Debian provides `libpcap.so.0.8`. Building sentinel on Debian and running on Rocky produces a missing library error at runtime.