Files
ja4-platform/.github/copilot-instructions.md
toto 895d7894a9 docs: mise à jour copilot-instructions.md
- bot-detector : monolithe → 10 modules
- Ajout convention browser detection sans UA (5 axes, Client Hints)
- Ajout targets Makefile : init-stack, import-prod-data, purge-db, help
- Anubis : simplifié IP/CIDR + ASN (suppression dict_anubis_ua / REGEXP_TREE)
- Tests bot-detector : clarification imports lourds

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 23:11:24 +02:00

248 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Copilot Instructions — ja4-platform
## What is this?
A monorepo for a JA4/JA3 TLS fingerprinting security pipeline. Five services capture network traffic, correlate logs, detect bots via ML, and present results in a SOC dashboard. All backed by ClickHouse.
**Data flow:** `mod-reqin-log` (Apache HTTP logs) → unix socket → `correlator` ← unix socket ← `sentinel` (TLS/TCP capture) → ClickHouse → `bot-detector` (ML scoring) → `dashboard` (FastAPI SOC UI)
## Build, test, lint
All builds run in Docker — no native Go/Python/C toolchain required on the host.
```sh
# Full suite
make test-all # run all tests (Docker)
make build-all # build all service images
make rpm-all # build RPMs (sentinel, correlator, mod-reqin-log) for el8/el9/el10
# Per-service tests
make test-sentinel # Go tests (needs --cap-add=NET_RAW inside)
make test-correlator # Go tests with 80% coverage gate
make test-bot-detector # Python pytest
make test-dashboard # Python pytest
make test-ja4common-python # Python pytest (shared lib)
make test-mod-reqin-log # C cmocka tests
# Single Go test (from service dir, or via Docker):
docker run --rm -v $(pwd):/build -w /build/services/correlator golang:1.24 \
go test -v -run TestConfigLoad ./internal/config/
# Single Python test (bot-detector, from repo root):
docker build -f services/bot-detector/bot_detector/Dockerfile.tests -t bd-tests .
docker run --rm bd-tests pytest bot_detector/tests/test_detector.py -v -k test_ae_class_train_and_score
# Local Python tests (needs deps installed):
cd services/bot-detector && pip install -r bot_detector/requirements.txt pytest pytest-mock
pytest bot_detector/tests/test_detector.py -v -k "test_benford"
# Faster correlator build (skip tests):
docker build --target builder --build-arg SKIP_TESTS=true -f services/correlator/Dockerfile .
# Linting (Go only — no Python linter configured)
cd services/sentinel && go vet ./... && gofmt -l .
cd services/correlator && go vet ./... && gofmt -l .
# Full-stack integration tests (Docker Compose, resets DB each run)
make test-integration # 8 phases: build → start → schema → traffic → pipeline → dashboard → bot-detector → sentinel
make test-integration-keep # same but leaves stack running after
make test-integration-down # tear down integration stack
# run-tests.sh also accepts: --build-only (build images without running tests)
# Database management
make init-stack # initialize ClickHouse schema from scratch
make import-prod-data # import prod data with date shifting
make purge-db # DROP and recreate ja4_logs + ja4_processing
make help # list all 24+ targets with descriptions
```
## Architecture
### Go workspace (`go.work`, Go 1.24.6)
Three modules in the workspace:
- `services/sentinel` — TLS/TCP packet capture daemon (gopacket/pcap, systemd)
- `services/correlator` — log correlation engine, hexagonal architecture
- `shared/go/ja4common` — shared logger, config, shutdown, ipfilter
Both services have a `replace` directive in their `go.mod` pointing to `../../shared/go/ja4common`. The workspace takes precedence for local dev; the `replace` is needed for Docker builds.
### Correlator hexagonal architecture
```
ports/source.go → EventSource, CorrelatedLogSink, CorrelationProcessor interfaces
adapters/inbound/ → unixsocket (reads from sentinel + mod-reqin-log)
adapters/outbound/ → clickhouse, file, stdout, multi (fan-out wrapper)
domain/ → CorrelationService, CorrelatedLog, NormalizedEvent
app/ → Orchestrator (wires everything together)
config/ → YAML config loader
```
### Python services
- `bot-detector`**Triple-voice ensemble** ML pipeline. Modular package (10 modules):
- `config.py``log.py``infra.py``browser.py``scoring.py``models.py``preprocessing.py``pipeline.py``cycle.py``__main__.py`
- Runs two parallel models per cycle:
- `Complet` (~45 features L3→L7) on correlated traffic (`correlated=1`, TCP+TLS+HTTP)
- `Applicatif` (~35 features L7-only) on uncorrelated traffic (`correlated=0`)
- Optional 24h variants when `ENABLE_MULTIWINDOW=true`
**ML ensemble:**
- **Extended Isolation Forest** (isotree, primary unsupervised scorer, falls back to sklearn)
- **Autoencoder** (PyTorch, architecture n→64→32→16→32→64→n, reconstruction error scorer)
- **XGBoost** (supervised, trained on SOC analyst labels from `soc_feedback`)
- Score: `final = (1-XGB_WEIGHT) * ((1-AE_WEIGHT)*eif_norm + AE_WEIGHT*ae_norm) + XGB_WEIGHT*xgb_prob`
- Default weights: `AE_WEIGHT=0.30`, `XGB_WEIGHT=0.20`
Also: adaptive thresholds, SHAP explainability, HDBSCAN campaign clustering, Benford deviation, lag-1 autocorrelation, quantile drift detection, feature pruning, validation gate.
`model_name` is part of the ORDER BY key in both `ml_detected_anomalies` and `ml_all_scores`. Uses `os.getenv()` directly for config, NOT pydantic-settings.
- `dashboard` — FastAPI + Jinja2 server-side templates + htmx + Chart.js + Tailwind CSS (CDN). Two route modules: `backend/routes/api.py` (9 JSON endpoints) and `backend/routes/pages.py` (8 HTML pages). Uses `os.getenv()` with `safe_identifier()` for DB name validation.
- `shared/python/ja4_common``ClickHouseClient` singleton + `ClickHouseSettings` (pydantic-settings). Installed as a local package in each Python Dockerfile. **Note:** dashboard does NOT use ja4_common — it has its own lightweight `clickhouse-connect` client in `backend/database.py`.
### C module
- `mod-reqin-log` — Apache HTTPD module (C11, built with `apxs`). Logs HTTP requests as JSON to a Unix socket. Tests use cmocka.
## ClickHouse dual-database pattern
Two configurable databases (env vars with defaults):
| Env var | Default | Contains |
|---------|---------|----------|
| `CLICKHOUSE_DB_LOGS` | `ja4_logs` | `http_logs_raw`, `http_logs`, `mv_http_logs` |
| `CLICKHOUSE_DB_PROCESSING` | `ja4_processing` | Aggregations, ML tables, views, dicts, audit |
**Cross-database references exist** — materialized views in one DB read from the other:
- `ja4_logs.mv_http_logs` references `ja4_processing.dict_anubis_*` and `ja4_processing.dict_iplocate_asn`
- `ja4_processing.mv_agg_*` reads `FROM ja4_logs.http_logs`
**In Python code**, always use fully qualified table names:
```python
from ..config import settings
query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_PROCESSING}.ml_detected_anomalies ..."
query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_LOGS}.http_logs ..."
```
Never hardcode database names in queries.
**In Go (correlator)**, the database is part of the ClickHouse DSN (`clickhouse://user:pass@host:9000/ja4_logs`). The target table is configurable via YAML (`outputs.clickhouse.table`).
**SQL schema** has two deployment paths:
- **Base schema**: `shared/clickhouse/` (13 ordered files, `00_database.sql` through `12_thesis_features.sql`). Deploy with `shared/clickhouse/deploy_schema.sh` which substitutes DB names from env vars.
- **Post-deploy migrations**: `services/correlator/sql/migrations/` — ALTER TABLE statements for existing deployments. Run manually: `clickhouse-client --multiquery < file.sql`.
### Data retention (TTL)
| Table | TTL | Partition |
|-------|-----|-----------|
| `http_logs_raw` | 2 hours | `toStartOfHour(ingest_time)` |
| `http_logs` | 30 days | `toDate(log_date)` |
| `agg_host_ip_ja4_1h` | 7 days | `toDate(window_start)` |
| `agg_header_fingerprint_1h` | 7 days | `toDate(window_start)` |
| `agg_ip_behavior_1h` | 7 days | `toDate(window_start)` |
| `agg_request_timing_1h` | 7 days | `toDate(window_start)` |
| `agg_path_sequences_1h` | 7 days | `toDate(window_start)` |
| `agg_resource_cascade_1h` | 7 days | `toDate(window_start)` |
| `ml_detected_anomalies` | 7 days | `toDate(detected_at)` |
| `ml_all_scores` | 7 days | `toDate(window_start)` |
All aggregation/ML tables use `ttl_only_drop_parts=1` for efficient partition-level expiry.
## Key conventions
### Docker-first builds
Every service has `Dockerfile` (prod), `Dockerfile.dev` or `Dockerfile.tests` (tests), and Go/C services have `Dockerfile.package` (RPM packaging via 3-stage: builder → rpmbuild × 3 distros → alpine output).
### Go config: YAML + env vars
- Sentinel: `config.yml`, env prefix `JA4SENTINEL_`
- Correlator: `config.yml`, env prefix `LOGCORRELATOR_`
- Both support `SIGHUP` for log rotation
### Python config
- ja4_common: `ClickHouseSettings(BaseSettings)` — pydantic-settings singleton
- bot-detector: raw `os.getenv()` with defaults, NOT pydantic-settings
- dashboard: `backend/config.py` reads `CLICKHOUSE_DB_PROCESSING`/`CLICKHOUSE_DB_LOGS` from env, validates identifiers with `safe_identifier()` regex
### Dashboard route structure
Two route modules: `backend/routes/api.py` (JSON API) and `backend/routes/pages.py` (HTML).
API endpoints use parameterized ClickHouse queries with sort/order whitelists:
```python
from backend.config import DB_PROCESSING, DB_LOGS, safe_identifier
from backend.database import query, query_scalar, execute
_DB = safe_identifier(DB_PROCESSING)
rows = query(
f"SELECT ... FROM {_DB}.ml_detected_anomalies "
"WHERE src_ip = toIPv6({ip:String})",
{"ip": ip_value},
)
```
**IPv4/IPv6 duality**: `http_logs.src_ip` is `IPv4`, but ML tables (`ml_all_scores`, `ml_detected_anomalies`) store `IPv6` (mapped as `::ffff:x.x.x.x`). Use `toIPv6()` for ML table queries, `toIPv4OrZero()` for http_logs queries. Strip `::ffff:` prefix when displaying IPs.
### Browser detection: NO User-Agent
Browser identification uses 5 axes in `browser.py`**none rely on User-Agent** (trivially spoofable):
- **Axe 1 (0.30)**: JA4 fingerprint match in `dict_browser_ja4` (TLS-level, hardest to fake)
- **Axe 2 (0.15)**: JA4 structural analysis (TLS1.3, h2/h3, cipher/extension counts)
- **Axe 3 (0.20)**: Client Hints (`has_sec_ch_ua`) + Sec-Fetch-* consistency + `is_fake_navigation`
- **Axe 4 (0.15)**: Navigation behavior (cookies, referers, asset ratio)
- **Axe 5 (0.20)**: TLS/TCP coherence (ALPN, window scale, TLS version)
`modern_browser_score` in SQL: sec-ch-ua→100, Sec-Fetch→70, else→0. Never uses UA presence.
`ua_ch_mismatch` and `is_ua_rotating` remain as ML anomaly features (detect bots), but are NOT used for browser identification.
### RPM spec files
Located at `services/<name>/packaging/rpm/<name>.spec`. Version injected via `--define "build_version X.Y.Z"` at build time.
### Inter-service communication
Services communicate via **Unix sockets**, not HTTP:
- `sentinel``/var/run/logcorrelator/network.socket``correlator` (source B: TLS/TCP data)
- `mod-reqin-log``/var/run/logcorrelator/http.socket``correlator` (source A: HTTP data)
- `correlator` → ClickHouse (batch inserts into `ja4_logs.http_logs_raw`)
### Sentinel requires elevated privileges
Tests need `--cap-add=NET_RAW --cap-add=NET_ADMIN` for packet capture (pcap).
## Comments standard
All code is commented in **French** (identifiers stay in English). Standard defined in `docs/commenting-standard.md`:
- **Go**: godoc `// FuncName does X`, package-level `// Package foo fournit...`
- **Python**: PEP-257 triple-quoted French docstrings on all functions/classes/modules
- **C**: Doxygen `/** @brief ... @param ... @return ... */` before every function, `/* ====== Section ====== */` banners
- **Bash**: standardized header block with `Usage:` and `Variables d'environnement:`
- **SQL**: `-- === filename.sql — description ===` banner + `-- --- Table ---` section headers
## Known gotchas
### go.work and Docker build contexts
When building either `sentinel` or `correlator` in Docker, the build context must include **both** service directories because `go.work` references them both. The root-level Makefiles always use `.` (repo root) as context — don't change this.
### Correlator YAML does not expand env vars
Go's YAML parser reads `${VAR:-default}` as a **literal string**. Use hardcoded values or pass values directly in the YAML file. This is why `tests/integration/platform/correlator.yml` has a hardcoded DSN.
### Anubis reference lists: IP/CIDR + ASN only
The Anubis pipeline was simplified — only `dict_anubis_ip` (IP/CIDR) and `dict_anubis_asn` exist. UA and Country dictionaries were removed. The `fetch_rules.py` script only imports IP ranges and ASN numbers from Anubis lists.
### TLS/pcap capture needs non-loopback traffic
`sentinel` listens on a network interface (e.g., `eth0`), not loopback. Traffic sent to `localhost` or `127.0.0.1` from the same container is invisible to pcap. In integration tests, traffic must come from a separate container crossing the Docker bridge network.
### ClickHouse initialization timing
ClickHouse takes ~15-20s to initialize all 13 SQL files. Integration health checks use a 120s timeout (not the default 60s).
### ClickHouse 24.8 SQL gotchas
- **Projections in AggregatingMergeTree** require `SETTINGS deduplicate_merge_projection_mode = 'drop'`
- **No `let` bindings** in SQL expressions — inline everything or use CTEs
- **Nested aggregates are illegal**: `argMax(any(col), sum(hits))` fails. Use a subquery to pre-aggregate, then apply the outer aggregate
- **`groupArrayIf` column type** must match the State/Merge functions: if the MV uses `groupArrayIfState()`, the column must be `AggregateFunction(groupArrayIf(...), Type, UInt8)`, not plain `groupArrayState`
- **`SimpleAggregateFunction` columns** are auto-wrapped in their aggregate function during queries on AggregatingMergeTree. So `argMax(ja4, hits)` where `hits` is `SimpleAggregateFunction(sum, UInt64)` becomes `argMax(ja4, sum(hits))` — a nested aggregate error
### Bot-detector test patterns
Tests in `bot_detector/tests/test_detector.py` are **self-contained** — they re-implement key logic rather than importing from the main modules (avoids heavy `torch`/`sklearn`/`isotree` import chains). Autoencoder tests use a local `_make_ae()` helper. XGBoost tests create in-memory models. Tests skip gracefully if `torch` or `xgboost` are not installed (`pytest.skip`).
### RPM builds must use Rocky Linux
All `Dockerfile.package` files use `rockylinux:9` (or `rockylinux:8`/`almalinux:10`) as the build base — never Debian-based images. Reason: Rocky provides `libpcap.so.1`; Debian provides `libpcap.so.0.8`. Building sentinel on Debian and running on Rocky produces a missing library error at runtime.
### ClickHouse `FLAT()` layout requires numeric keys
If adding a new dictionary with a `String` primary key, use `COMPLEX_KEY_HASHED()` not `FLAT()`.