# Copilot Instructions — ja4-platform ## What is this? A monorepo for a JA4/JA3 TLS fingerprinting security pipeline. Five services capture network traffic, correlate logs, detect bots via ML, and present results in a SOC dashboard. All backed by ClickHouse. **Data flow:** `mod-reqin-log` (Apache HTTP logs) → unix socket → `correlator` ← unix socket ← `sentinel` (TLS/TCP capture) → ClickHouse → `bot-detector` (ML scoring) → `dashboard` (FastAPI SOC UI) ## Build, test, lint All builds run in Docker — no native Go/Python/C toolchain required on the host. ```sh # Full suite make test-all # run all tests (Docker) make build-all # build all service images make rpm-all # build RPMs (sentinel, correlator, mod-reqin-log) for el8/el9/el10 # Per-service tests make test-sentinel # Go tests (needs --cap-add=NET_RAW inside) make test-correlator # Go tests with 80% coverage gate make test-bot-detector # Python pytest make test-dashboard # Python pytest make test-ja4common-python # Python pytest (shared lib) make test-mod-reqin-log # C cmocka tests # Single Go test (from service dir, or via Docker): docker run --rm -v $(pwd):/build -w /build/services/correlator golang:1.24 \ go test -v -run TestConfigLoad ./internal/config/ # Single Python test (from repo root): docker build -f services/dashboard/Dockerfile.tests -t dash-tests . docker run --rm dash-tests pytest backend/tests/test_metrics.py -v -k test_health # Faster correlator build (skip tests): docker build --target builder --build-arg SKIP_TESTS=true -f services/correlator/Dockerfile . # Linting (Go only — no Python linter configured) cd services/sentinel && go vet ./... && gofmt -l . cd services/correlator && go vet ./... && gofmt -l . # Full-stack integration tests (Docker Compose, resets DB each run) make test-integration # 8 phases: build → start → schema → traffic → pipeline → dashboard → bot-detector → sentinel make test-integration-keep # same but leaves stack running after make test-integration-down # tear down integration stack # run-tests.sh also accepts: --build-only (build images without running tests) ``` ## Architecture ### Go workspace (`go.work`, Go 1.24.6) Three modules in the workspace: - `services/sentinel` — TLS/TCP packet capture daemon (gopacket/pcap, systemd) - `services/correlator` — log correlation engine, hexagonal architecture - `shared/go/ja4common` — shared logger, config, shutdown, ipfilter Both services have a `replace` directive in their `go.mod` pointing to `../../shared/go/ja4common`. The workspace takes precedence for local dev; the `replace` is needed for Docker builds. ### Correlator hexagonal architecture ``` ports/source.go → EventSource, CorrelatedLogSink, CorrelationProcessor interfaces adapters/inbound/ → unixsocket (reads from sentinel + mod-reqin-log) adapters/outbound/ → clickhouse, file, stdout, multi (fan-out wrapper) domain/ → CorrelationService, CorrelatedLog, NormalizedEvent app/ → Orchestrator (wires everything together) config/ → YAML config loader ``` ### Python services - `bot-detector` — scikit-learn IsolationForest + DBSCAN + SHAP. Single monolithic module (`bot_detector.py`). Runs **two parallel IF models** per cycle: - `Complet` (45 features L3→L7) on correlated traffic (`correlated=1`, TCP+TLS+HTTP) - `Applicatif` (35 features L7-only) on uncorrelated traffic (`correlated=0`) - Optional 24h variants (`Complet_24h`/`Applicatif_24h`) when `ENABLE_MULTIWINDOW=true` `model_name` is part of the ORDER BY key in both `ml_detected_anomalies` and `ml_all_scores`. Uses `os.getenv()` directly for config, NOT pydantic-settings. - `dashboard` — FastAPI + React SPA. 20 route modules in `backend/routes/`. Uses pydantic-settings (`backend/config.py`). - `shared/python/ja4_common` — `ClickHouseClient` singleton + `ClickHouseSettings` (pydantic-settings). Installed as a local package in each Python Dockerfile. ### C module - `mod-reqin-log` — Apache HTTPD module (C11, built with `apxs`). Logs HTTP requests as JSON to a Unix socket. Tests use cmocka. ## ClickHouse dual-database pattern Two configurable databases (env vars with defaults): | Env var | Default | Contains | |---------|---------|----------| | `CLICKHOUSE_DB_LOGS` | `ja4_logs` | `http_logs_raw`, `http_logs`, `mv_http_logs` | | `CLICKHOUSE_DB_PROCESSING` | `ja4_processing` | Aggregations, ML tables, views, dicts, audit | **Cross-database references exist** — materialized views in one DB read from the other: - `ja4_logs.mv_http_logs` references `ja4_processing.dict_anubis_*` and `ja4_processing.dict_iplocate_asn` - `ja4_processing.mv_agg_*` reads `FROM ja4_logs.http_logs` **In Python code**, always use fully qualified table names: ```python from ..config import settings query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_PROCESSING}.ml_detected_anomalies ..." query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_LOGS}.http_logs ..." ``` Never hardcode database names in queries. **In Go (correlator)**, the database is part of the ClickHouse DSN (`clickhouse://user:pass@host:9000/ja4_logs`). The target table is configurable via YAML (`outputs.clickhouse.table`). **SQL schema** has two deployment paths: - **Base schema**: `shared/clickhouse/` (10 ordered files). Deploy with `shared/clickhouse/deploy_schema.sh` which substitutes DB names from env vars. - **Bot-detector views**: `services/bot-detector/deploy_views.sql` — aggregation tables, MVs, ML result tables, dashboard views. Version-controlled separately (currently v14). - **Post-deploy migrations**: `services/correlator/sql/migrations/` — ALTER TABLE statements for existing deployments (TTL changes, ORDER BY fixes). Run manually: `clickhouse-client --multiquery < file.sql`. ### Data retention (TTL) | Table | TTL | Partition | |-------|-----|-----------| | `http_logs_raw` | 2 hours | `toStartOfHour(ingest_time)` | | `http_logs` | 30 days | `toDate(log_date)` | | `agg_host_ip_ja4_1h` | 7 days | `toDate(window_start)` | | `agg_header_fingerprint_1h` | 7 days | `toDate(window_start)` | | `ml_detected_anomalies` | 7 days | `toDate(detected_at)` | | `ml_all_scores` | 7 days | `toDate(window_start)` | All aggregation/ML tables use `ttl_only_drop_parts=1` for efficient partition-level expiry. ## Key conventions ### Docker-first builds Every service has `Dockerfile` (prod), `Dockerfile.dev` or `Dockerfile.tests` (tests), and Go/C services have `Dockerfile.package` (RPM packaging via 3-stage: builder → rpmbuild × 3 distros → alpine output). ### Go config: YAML + env vars - Sentinel: `config.yml`, env prefix `JA4SENTINEL_` - Correlator: `config.yml`, env prefix `LOGCORRELATOR_` - Both support `SIGHUP` for log rotation ### Python config: pydantic-settings - Dashboard: `backend/config.py` → `Settings(BaseSettings)` with `.env` file - ja4_common: `ClickHouseSettings(BaseSettings)` — singleton at `settings` - bot-detector: exception — uses raw `os.getenv()`, not pydantic-settings ### Dashboard route structure Every route file follows this pattern: ```python from fastapi import APIRouter, HTTPException, Query from ..config import settings from ..database import db router = APIRouter() @router.get("/api/something") async def get_something(): query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_PROCESSING}.table_name ..." result = db.query(query) ... ``` ### RPM spec files Located at `services//packaging/rpm/.spec`. Version injected via `--define "build_version X.Y.Z"` at build time. ### Inter-service communication Services communicate via **Unix sockets**, not HTTP: - `sentinel` → `/var/run/logcorrelator/network.socket` → `correlator` (source B: TLS/TCP data) - `mod-reqin-log` → `/var/run/logcorrelator/http.socket` → `correlator` (source A: HTTP data) - `correlator` → ClickHouse (batch inserts into `ja4_logs.http_logs_raw`) ### Sentinel requires elevated privileges Tests need `--cap-add=NET_RAW --cap-add=NET_ADMIN` for packet capture (pcap). ## Comments standard All code is commented in **French** (identifiers stay in English). Standard defined in `docs/commenting-standard.md`: - **Go**: godoc `// FuncName does X`, package-level `// Package foo fournit...` - **Python**: PEP-257 triple-quoted French docstrings on all functions/classes/modules - **C**: Doxygen `/** @brief ... @param ... @return ... */` before every function, `/* ====== Section ====== */` banners - **Bash**: standardized header block with `Usage:` and `Variables d'environnement:` - **SQL**: `-- === filename.sql — description ===` banner + `-- --- Table ---` section headers ## Known gotchas ### go.work and Docker build contexts When building either `sentinel` or `correlator` in Docker, the build context must include **both** service directories because `go.work` references them both. The root-level Makefiles always use `.` (repo root) as context — don't change this. ### Correlator YAML does not expand env vars Go's YAML parser reads `${VAR:-default}` as a **literal string**. Use hardcoded values or pass values directly in the YAML file. This is why `tests/integration/platform/correlator.yml` has a hardcoded DSN. ### REGEXP_TREE dictionary requires ≥1 rule `dict_anubis_ua` uses `LAYOUT(REGEXP_TREE)`. If `anubis_ua_rules` is empty, **every INSERT into `http_logs_raw` fails** because the materialized view `mv_http_logs` calls `dictGet()` on it. The integration test init script seeds a catch-all rule. ### TLS/pcap capture needs non-loopback traffic `sentinel` listens on a network interface (e.g., `eth0`), not loopback. Traffic sent to `localhost` or `127.0.0.1` from the same container is invisible to pcap. In integration tests, traffic must come from a separate container crossing the Docker bridge network. ### ClickHouse initialization timing ClickHouse takes ~15-20s to initialize all 10 SQL files. Integration health checks use a 120s timeout (not the default 60s). ### RPM builds must use Rocky Linux All `Dockerfile.package` files use `rockylinux:9` (or `rockylinux:8`/`almalinux:10`) as the build base — never Debian-based images. Reason: Rocky provides `libpcap.so.1`; Debian provides `libpcap.so.0.8`. Building sentinel on Debian and running on Rocky produces a missing library error at runtime. ### ClickHouse `FLAT()` layout requires numeric keys If adding a new dictionary with a `String` primary key, use `COMPLEX_KEY_HASHED()` not `FLAT()`.