# Copilot Instructions — ja4-platform ## What is this? A monorepo for a JA4/JA3 TLS fingerprinting security pipeline. Five services capture network traffic, correlate logs, detect bots via ML, and present results in a SOC dashboard. All backed by ClickHouse. **Data flow:** `mod-reqin-log` (Apache HTTP logs) → unix socket → `correlator` ← unix socket ← `sentinel` (TLS/TCP capture) → ClickHouse → `bot-detector` (ML scoring) → `dashboard` (FastAPI SOC UI) ## Build, test, lint All builds run in Docker — no native Go/Python/C toolchain required on the host. ```sh # Full suite make test-all # run all tests (Docker) make build-all # build all service images make rpm-all # build RPMs (sentinel, correlator, mod-reqin-log) for el8/el9/el10 # Per-service tests make test-sentinel # Go tests (needs --cap-add=NET_RAW inside) make test-correlator # Go tests with 80% coverage gate make test-bot-detector # Python pytest make test-dashboard # Python pytest make test-ja4common-python # Python pytest (shared lib) make test-mod-reqin-log # C cmocka tests # Single Go test (from service dir, or via Docker): docker run --rm -v $(pwd):/build -w /build/services/correlator golang:1.24 \ go test -v -run TestConfigLoad ./internal/config/ # Single Python test (bot-detector, from repo root): docker build -f services/bot-detector/bot_detector/Dockerfile.tests -t bd-tests . docker run --rm bd-tests pytest bot_detector/tests/test_detector.py -v -k test_ae_class_train_and_score # Local Python tests (needs deps installed): cd services/bot-detector && pip install -r bot_detector/requirements.txt pytest pytest-mock pytest bot_detector/tests/test_detector.py -v -k "test_benford" # Faster correlator build (skip tests): docker build --target builder --build-arg SKIP_TESTS=true -f services/correlator/Dockerfile . # Linting (Go only — no Python linter configured) cd services/sentinel && go vet ./... && gofmt -l . cd services/correlator && go vet ./... && gofmt -l . # Full-stack integration tests (Docker Compose, resets DB each run) make test-integration # 8 phases: build → start → schema → traffic → pipeline → dashboard → bot-detector → sentinel make test-integration-keep # same but leaves stack running after make test-integration-down # tear down integration stack # run-tests.sh also accepts: --build-only (build images without running tests) # Database management make init-stack # initialize ClickHouse schema from scratch make import-prod-data # import prod data with date shifting make purge-db # DROP and recreate ja4_logs + ja4_processing make help # list all 24+ targets with descriptions ``` ## Architecture ### Go workspace (`go.work`, Go 1.24.6) Three modules in the workspace: - `services/sentinel` — TLS/TCP packet capture daemon (gopacket/pcap, systemd) - `services/correlator` — log correlation engine, hexagonal architecture - `shared/go/ja4common` — shared logger, config, shutdown, ipfilter Both services have a `replace` directive in their `go.mod` pointing to `../../shared/go/ja4common`. The workspace takes precedence for local dev; the `replace` is needed for Docker builds. ### Correlator hexagonal architecture ``` ports/source.go → EventSource, CorrelatedLogSink, CorrelationProcessor interfaces adapters/inbound/ → unixsocket (reads from sentinel + mod-reqin-log) adapters/outbound/ → clickhouse, file, stdout, multi (fan-out wrapper) domain/ → CorrelationService, CorrelatedLog, NormalizedEvent app/ → Orchestrator (wires everything together) config/ → YAML config loader ``` ### Python services - `bot-detector` — **Triple-voice ensemble** ML pipeline. Modular package (10 modules): - `config.py` → `log.py` → `infra.py` → `browser.py` → `scoring.py` → `models.py` → `preprocessing.py` → `pipeline.py` → `cycle.py` → `__main__.py` - Runs two parallel models per cycle: - `Complet` (~45 features L3→L7) on correlated traffic (`correlated=1`, TCP+TLS+HTTP) - `Applicatif` (~35 features L7-only) on uncorrelated traffic (`correlated=0`) - Optional 24h variants when `ENABLE_MULTIWINDOW=true` **ML ensemble:** - **Extended Isolation Forest** (isotree, primary unsupervised scorer, falls back to sklearn) - **Autoencoder** (PyTorch, architecture n→64→32→16→32→64→n, reconstruction error scorer) - **XGBoost** (supervised, trained on SOC analyst labels from `soc_feedback`) - Score: `final = (1-XGB_WEIGHT) * ((1-AE_WEIGHT)*eif_norm + AE_WEIGHT*ae_norm) + XGB_WEIGHT*xgb_prob` - Default weights: `AE_WEIGHT=0.30`, `XGB_WEIGHT=0.20` Also: adaptive thresholds, SHAP explainability, HDBSCAN campaign clustering, Benford deviation, lag-1 autocorrelation, quantile drift detection, feature pruning, validation gate. `model_name` is part of the ORDER BY key in both `ml_detected_anomalies` and `ml_all_scores`. Uses `os.getenv()` directly for config, NOT pydantic-settings. - `dashboard` — FastAPI + Jinja2 server-side templates + htmx + Chart.js + Tailwind CSS (CDN). Two route modules: `backend/routes/api.py` (9 JSON endpoints) and `backend/routes/pages.py` (8 HTML pages). Uses `os.getenv()` with `safe_identifier()` for DB name validation. - `shared/python/ja4_common` — `ClickHouseClient` singleton + `ClickHouseSettings` (pydantic-settings). Installed as a local package in each Python Dockerfile. **Note:** dashboard does NOT use ja4_common — it has its own lightweight `clickhouse-connect` client in `backend/database.py`. ### C module - `mod-reqin-log` — Apache HTTPD module (C11, built with `apxs`). Logs HTTP requests as JSON to a Unix socket. Tests use cmocka. ## ClickHouse dual-database pattern Two configurable databases (env vars with defaults): | Env var | Default | Contains | |---------|---------|----------| | `CLICKHOUSE_DB_LOGS` | `ja4_logs` | `http_logs_raw`, `http_logs`, `mv_http_logs` | | `CLICKHOUSE_DB_PROCESSING` | `ja4_processing` | Aggregations, ML tables, views, dicts, audit | **Cross-database references exist** — materialized views in one DB read from the other: - `ja4_logs.mv_http_logs` references `ja4_processing.dict_anubis_*` and `ja4_processing.dict_iplocate_asn` - `ja4_processing.mv_agg_*` reads `FROM ja4_logs.http_logs` **In Python code**, always use fully qualified table names: ```python from ..config import settings query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_PROCESSING}.ml_detected_anomalies ..." query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_LOGS}.http_logs ..." ``` Never hardcode database names in queries. **In Go (correlator)**, the database is part of the ClickHouse DSN (`clickhouse://user:pass@host:9000/ja4_logs`). The target table is configurable via YAML (`outputs.clickhouse.table`). **SQL schema** has two deployment paths: - **Base schema**: `shared/clickhouse/` (13 ordered files, `00_database.sql` through `12_thesis_features.sql`). Deploy with `shared/clickhouse/deploy_schema.sh` which substitutes DB names from env vars. - **Post-deploy migrations**: `services/correlator/sql/migrations/` — ALTER TABLE statements for existing deployments. Run manually: `clickhouse-client --multiquery < file.sql`. ### Data retention (TTL) | Table | TTL | Partition | |-------|-----|-----------| | `http_logs_raw` | 2 hours | `toStartOfHour(ingest_time)` | | `http_logs` | 30 days | `toDate(log_date)` | | `agg_host_ip_ja4_1h` | 7 days | `toDate(window_start)` | | `agg_header_fingerprint_1h` | 7 days | `toDate(window_start)` | | `agg_ip_behavior_1h` | 7 days | `toDate(window_start)` | | `agg_request_timing_1h` | 7 days | `toDate(window_start)` | | `agg_path_sequences_1h` | 7 days | `toDate(window_start)` | | `agg_resource_cascade_1h` | 7 days | `toDate(window_start)` | | `ml_detected_anomalies` | 7 days | `toDate(detected_at)` | | `ml_all_scores` | 7 days | `toDate(window_start)` | All aggregation/ML tables use `ttl_only_drop_parts=1` for efficient partition-level expiry. ## Key conventions ### Docker-first builds Every service has `Dockerfile` (prod), `Dockerfile.dev` or `Dockerfile.tests` (tests), and Go/C services have `Dockerfile.package` (RPM packaging via 3-stage: builder → rpmbuild × 3 distros → alpine output). ### Go config: YAML + env vars - Sentinel: `config.yml`, env prefix `JA4SENTINEL_` - Correlator: `config.yml`, env prefix `LOGCORRELATOR_` - Both support `SIGHUP` for log rotation ### Python config - ja4_common: `ClickHouseSettings(BaseSettings)` — pydantic-settings singleton - bot-detector: raw `os.getenv()` with defaults, NOT pydantic-settings - dashboard: `backend/config.py` reads `CLICKHOUSE_DB_PROCESSING`/`CLICKHOUSE_DB_LOGS` from env, validates identifiers with `safe_identifier()` regex ### Dashboard route structure Two route modules: `backend/routes/api.py` (JSON API) and `backend/routes/pages.py` (HTML). API endpoints use parameterized ClickHouse queries with sort/order whitelists: ```python from backend.config import DB_PROCESSING, DB_LOGS, safe_identifier from backend.database import query, query_scalar, execute _DB = safe_identifier(DB_PROCESSING) rows = query( f"SELECT ... FROM {_DB}.ml_detected_anomalies " "WHERE src_ip = toIPv6({ip:String})", {"ip": ip_value}, ) ``` **IPv4/IPv6 duality**: `http_logs.src_ip` is `IPv4`, but ML tables (`ml_all_scores`, `ml_detected_anomalies`) store `IPv6` (mapped as `::ffff:x.x.x.x`). Use `toIPv6()` for ML table queries, `toIPv4OrZero()` for http_logs queries. Strip `::ffff:` prefix when displaying IPs. ### Browser detection: NO User-Agent Browser identification uses 5 axes in `browser.py` — **none rely on User-Agent** (trivially spoofable): - **Axe 1 (0.30)**: JA4 fingerprint match in `dict_browser_ja4` (TLS-level, hardest to fake) - **Axe 2 (0.15)**: JA4 structural analysis (TLS1.3, h2/h3, cipher/extension counts) - **Axe 3 (0.20)**: Client Hints (`has_sec_ch_ua`) + Sec-Fetch-* consistency + `is_fake_navigation` - **Axe 4 (0.15)**: Navigation behavior (cookies, referers, asset ratio) - **Axe 5 (0.20)**: TLS/TCP coherence (ALPN, window scale, TLS version) `modern_browser_score` in SQL: sec-ch-ua→100, Sec-Fetch→70, else→0. Never uses UA presence. `ua_ch_mismatch` and `is_ua_rotating` remain as ML anomaly features (detect bots), but are NOT used for browser identification. ### RPM spec files Located at `services//packaging/rpm/.spec`. Version injected via `--define "build_version X.Y.Z"` at build time. ### Inter-service communication Services communicate via **Unix sockets**, not HTTP: - `sentinel` → `/var/run/logcorrelator/network.socket` → `correlator` (source B: TLS/TCP data) - `mod-reqin-log` → `/var/run/logcorrelator/http.socket` → `correlator` (source A: HTTP data) - `correlator` → ClickHouse (batch inserts into `ja4_logs.http_logs_raw`) ### Sentinel requires elevated privileges Tests need `--cap-add=NET_RAW --cap-add=NET_ADMIN` for packet capture (pcap). ## Comments standard All code is commented in **French** (identifiers stay in English). Standard defined in `docs/commenting-standard.md`: - **Go**: godoc `// FuncName does X`, package-level `// Package foo fournit...` - **Python**: PEP-257 triple-quoted French docstrings on all functions/classes/modules - **C**: Doxygen `/** @brief ... @param ... @return ... */` before every function, `/* ====== Section ====== */` banners - **Bash**: standardized header block with `Usage:` and `Variables d'environnement:` - **SQL**: `-- === filename.sql — description ===` banner + `-- --- Table ---` section headers ## Known gotchas ### go.work and Docker build contexts When building either `sentinel` or `correlator` in Docker, the build context must include **both** service directories because `go.work` references them both. The root-level Makefiles always use `.` (repo root) as context — don't change this. ### Correlator YAML does not expand env vars Go's YAML parser reads `${VAR:-default}` as a **literal string**. Use hardcoded values or pass values directly in the YAML file. This is why `tests/integration/platform/correlator.yml` has a hardcoded DSN. ### Anubis reference lists: IP/CIDR + ASN only The Anubis pipeline was simplified — only `dict_anubis_ip` (IP/CIDR) and `dict_anubis_asn` exist. UA and Country dictionaries were removed. The `fetch_rules.py` script only imports IP ranges and ASN numbers from Anubis lists. ### TLS/pcap capture needs non-loopback traffic `sentinel` listens on a network interface (e.g., `eth0`), not loopback. Traffic sent to `localhost` or `127.0.0.1` from the same container is invisible to pcap. In integration tests, traffic must come from a separate container crossing the Docker bridge network. ### ClickHouse initialization timing ClickHouse takes ~15-20s to initialize all 13 SQL files. Integration health checks use a 120s timeout (not the default 60s). ### ClickHouse 24.8 SQL gotchas - **Projections in AggregatingMergeTree** require `SETTINGS deduplicate_merge_projection_mode = 'drop'` - **No `let` bindings** in SQL expressions — inline everything or use CTEs - **Nested aggregates are illegal**: `argMax(any(col), sum(hits))` fails. Use a subquery to pre-aggregate, then apply the outer aggregate - **`groupArrayIf` column type** must match the State/Merge functions: if the MV uses `groupArrayIfState()`, the column must be `AggregateFunction(groupArrayIf(...), Type, UInt8)`, not plain `groupArrayState` - **`SimpleAggregateFunction` columns** are auto-wrapped in their aggregate function during queries on AggregatingMergeTree. So `argMax(ja4, hits)` where `hits` is `SimpleAggregateFunction(sum, UInt64)` becomes `argMax(ja4, sum(hits))` — a nested aggregate error ### Bot-detector test patterns Tests in `bot_detector/tests/test_detector.py` are **self-contained** — they re-implement key logic rather than importing from the main modules (avoids heavy `torch`/`sklearn`/`isotree` import chains). Autoencoder tests use a local `_make_ae()` helper. XGBoost tests create in-memory models. Tests skip gracefully if `torch` or `xgboost` are not installed (`pytest.skip`). ### RPM builds must use Rocky Linux All `Dockerfile.package` files use `rockylinux:9` (or `rockylinux:8`/`almalinux:10`) as the build base — never Debian-based images. Reason: Rocky provides `libpcap.so.1`; Debian provides `libpcap.so.0.8`. Building sentinel on Debian and running on Rocky produces a missing library error at runtime. ### ClickHouse `FLAT()` layout requires numeric keys If adding a new dictionary with a `String` primary key, use `COMPLEX_KEY_HASHED()` not `FLAT()`.