# Copilot Instructions — ja4-platform ## What is this? A monorepo for a JA4/JA3 TLS fingerprinting security pipeline. Five services capture network traffic, correlate logs, detect bots via ML, and present results in a SOC dashboard. All backed by ClickHouse. **Data flow:** `mod-reqin-log` (Apache HTTP logs) → unix socket → `correlator` ← unix socket ← `sentinel` (TLS/TCP capture) → ClickHouse → `bot-detector` (ML scoring) → `dashboard` (FastAPI SOC UI) ## Build, test, lint All builds run in Docker — no native Go/Python/C toolchain required on the host. ```sh # Full suite make test-all # run all tests (Docker) make build-all # build all service images make rpm-all # build RPMs (sentinel, correlator, mod-reqin-log) for el8/el9/el10 # Per-service tests make test-sentinel # Go tests (needs --cap-add=NET_RAW inside) make test-correlator # Go tests with 60% coverage gate make test-bot-detector # Python pytest make test-dashboard # Python pytest make test-ja4common-python # Python pytest (shared lib) make test-mod-reqin-log # C cmocka tests # Single Go test (from service dir, or via Docker): docker run --rm -v $(pwd):/build -w /build/services/correlator golang:1.24 \ go test -v -run TestConfigLoad ./internal/config/ # Single Python test (from repo root): docker build -f services/dashboard/Dockerfile.tests -t dash-tests . docker run --rm dash-tests pytest backend/tests/test_metrics.py -v -k test_health # Linting (Go only — no Python linter configured) cd services/sentinel && go vet ./... && gofmt -l . cd services/correlator && go vet ./... && gofmt -l . ``` ## Architecture ### Go workspace (`go.work`, Go 1.24.6) Three modules in the workspace: - `services/sentinel` — TLS/TCP packet capture daemon (gopacket/pcap, systemd) - `services/correlator` — log correlation engine, hexagonal architecture - `shared/go/ja4common` — shared logger, config, shutdown, ipfilter Both services have a `replace` directive in their `go.mod` pointing to `../../shared/go/ja4common`. The workspace takes precedence for local dev; the `replace` is needed for Docker builds. ### Correlator hexagonal architecture ``` ports/source.go → EventSource, CorrelatedLogSink, CorrelationProcessor interfaces adapters/inbound/ → unixsocket (reads from sentinel + mod-reqin-log) adapters/outbound/ → clickhouse, file, stdout, multi (fan-out wrapper) domain/ → CorrelationService, CorrelatedLog, NormalizedEvent app/ → Orchestrator (wires everything together) config/ → YAML config loader ``` ### Python services - `bot-detector` — scikit-learn IsolationForest + DBSCAN. Single monolithic module (`bot_detector.py`). Uses `os.getenv()` directly for config, NOT pydantic-settings. - `dashboard` — FastAPI + React SPA. 20 route modules in `backend/routes/`. Uses pydantic-settings (`backend/config.py`). - `shared/python/ja4_common` — `ClickHouseClient` singleton + `ClickHouseSettings` (pydantic-settings). Installed as a local package in each Python Dockerfile. ### C module - `mod-reqin-log` — Apache HTTPD module (C11, built with `apxs`). Logs HTTP requests as JSON to a Unix socket. Tests use cmocka. ## ClickHouse dual-database pattern Two configurable databases (env vars with defaults): | Env var | Default | Contains | |---------|---------|----------| | `CLICKHOUSE_DB_LOGS` | `ja4_logs` | `http_logs_raw`, `http_logs`, `mv_http_logs` | | `CLICKHOUSE_DB_PROCESSING` | `ja4_processing` | Aggregations, ML tables, views, dicts, audit | **Cross-database references exist** — materialized views in one DB read from the other: - `ja4_logs.mv_http_logs` references `ja4_processing.dict_anubis_*` and `ja4_processing.dict_iplocate_asn` - `ja4_processing.mv_agg_*` reads `FROM ja4_logs.http_logs` **In Python code**, always use fully qualified table names: ```python from ..config import settings query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_PROCESSING}.ml_detected_anomalies ..." query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_LOGS}.http_logs ..." ``` Never hardcode database names in queries. **In Go (correlator)**, the database is part of the ClickHouse DSN (`clickhouse://user:pass@host:9000/ja4_logs`). The target table is configurable via YAML (`outputs.clickhouse.table`). **SQL migrations** live in `shared/clickhouse/` (10 ordered files). Deploy with `shared/clickhouse/deploy_schema.sh` which substitutes DB names from env vars. ## Key conventions ### Docker-first builds Every service has `Dockerfile` (prod), `Dockerfile.dev` or `Dockerfile.tests` (tests), and Go/C services have `Dockerfile.package` (RPM packaging via 3-stage: builder → rpmbuild × 3 distros → alpine output). ### Go config: YAML + env vars - Sentinel: `config.yml`, env prefix `JA4SENTINEL_` - Correlator: `config.yml`, env prefix `LOGCORRELATOR_` - Both support `SIGHUP` for log rotation ### Python config: pydantic-settings - Dashboard: `backend/config.py` → `Settings(BaseSettings)` with `.env` file - ja4_common: `ClickHouseSettings(BaseSettings)` — singleton at `settings` - bot-detector: exception — uses raw `os.getenv()`, not pydantic-settings ### Dashboard route structure Every route file follows this pattern: ```python from fastapi import APIRouter, HTTPException, Query from ..config import settings from ..database import db router = APIRouter() @router.get("/api/something") async def get_something(): query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_PROCESSING}.table_name ..." result = db.query(query) ... ``` ### RPM spec files Located at `services//packaging/rpm/.spec`. Version injected via `--define "build_version X.Y.Z"` at build time. ### Inter-service communication Services communicate via **Unix sockets**, not HTTP: - `sentinel` → `/var/run/logcorrelator/network.socket` → `correlator` (source B: TLS/TCP data) - `mod-reqin-log` → `/var/run/logcorrelator/http.socket` → `correlator` (source A: HTTP data) - `correlator` → ClickHouse (batch inserts into `ja4_logs.http_logs_raw`) ### Sentinel requires elevated privileges Tests need `--cap-add=NET_RAW --cap-add=NET_ADMIN` for packet capture (pcap).