docs: add .github/copilot-instructions.md for Copilot context

Covers: build/test/lint commands, architecture overview, ClickHouse
dual-DB pattern, inter-service communication, key conventions for
Go (hexagonal, YAML config), Python (pydantic-settings, FastAPI routes),
C (Apache module), Docker-first builds, and RPM packaging.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
toto
2026-04-07 19:21:32 +02:00
parent 9f3e0621e5
commit d8dbd4e706

138
.github/copilot-instructions.md vendored Normal file
View File

@ -0,0 +1,138 @@
# Copilot Instructions — ja4-platform
## What is this?
A monorepo for a JA4/JA3 TLS fingerprinting security pipeline. Five services capture network traffic, correlate logs, detect bots via ML, and present results in a SOC dashboard. All backed by ClickHouse.
**Data flow:** `mod-reqin-log` (Apache HTTP logs) → unix socket → `correlator` ← unix socket ← `sentinel` (TLS/TCP capture) → ClickHouse → `bot-detector` (ML scoring) → `dashboard` (FastAPI SOC UI)
## Build, test, lint
All builds run in Docker — no native Go/Python/C toolchain required on the host.
```sh
# Full suite
make test-all # run all tests (Docker)
make build-all # build all service images
make rpm-all # build RPMs (sentinel, correlator, mod-reqin-log) for el8/el9/el10
# Per-service tests
make test-sentinel # Go tests (needs --cap-add=NET_RAW inside)
make test-correlator # Go tests with 60% coverage gate
make test-bot-detector # Python pytest
make test-dashboard # Python pytest
make test-ja4common-python # Python pytest (shared lib)
make test-mod-reqin-log # C cmocka tests
# Single Go test (from service dir, or via Docker):
docker run --rm -v $(pwd):/build -w /build/services/correlator golang:1.24 \
go test -v -run TestConfigLoad ./internal/config/
# Single Python test (from repo root):
docker build -f services/dashboard/Dockerfile.tests -t dash-tests .
docker run --rm dash-tests pytest backend/tests/test_metrics.py -v -k test_health
# Linting (Go only — no Python linter configured)
cd services/sentinel && go vet ./... && gofmt -l .
cd services/correlator && go vet ./... && gofmt -l .
```
## Architecture
### Go workspace (`go.work`, Go 1.24.6)
Three modules in the workspace:
- `services/sentinel` — TLS/TCP packet capture daemon (gopacket/pcap, systemd)
- `services/correlator` — log correlation engine, hexagonal architecture
- `shared/go/ja4common` — shared logger, config, shutdown, ipfilter
Both services have a `replace` directive in their `go.mod` pointing to `../../shared/go/ja4common`. The workspace takes precedence for local dev; the `replace` is needed for Docker builds.
### Correlator hexagonal architecture
```
ports/source.go → EventSource, CorrelatedLogSink, CorrelationProcessor interfaces
adapters/inbound/ → unixsocket (reads from sentinel + mod-reqin-log)
adapters/outbound/ → clickhouse, file, stdout, multi (fan-out wrapper)
domain/ → CorrelationService, CorrelatedLog, NormalizedEvent
app/ → Orchestrator (wires everything together)
config/ → YAML config loader
```
### Python services
- `bot-detector` — scikit-learn IsolationForest + DBSCAN. Single monolithic module (`bot_detector.py`). Uses `os.getenv()` directly for config, NOT pydantic-settings.
- `dashboard` — FastAPI + React SPA. 20 route modules in `backend/routes/`. Uses pydantic-settings (`backend/config.py`).
- `shared/python/ja4_common``ClickHouseClient` singleton + `ClickHouseSettings` (pydantic-settings). Installed as a local package in each Python Dockerfile.
### C module
- `mod-reqin-log` — Apache HTTPD module (C11, built with `apxs`). Logs HTTP requests as JSON to a Unix socket. Tests use cmocka.
## ClickHouse dual-database pattern
Two configurable databases (env vars with defaults):
| Env var | Default | Contains |
|---------|---------|----------|
| `CLICKHOUSE_DB_LOGS` | `ja4_logs` | `http_logs_raw`, `http_logs`, `mv_http_logs` |
| `CLICKHOUSE_DB_PROCESSING` | `ja4_processing` | Aggregations, ML tables, views, dicts, audit |
**Cross-database references exist** — materialized views in one DB read from the other:
- `ja4_logs.mv_http_logs` references `ja4_processing.dict_anubis_*` and `ja4_processing.dict_iplocate_asn`
- `ja4_processing.mv_agg_*` reads `FROM ja4_logs.http_logs`
**In Python code**, always use fully qualified table names:
```python
from ..config import settings
query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_PROCESSING}.ml_detected_anomalies ..."
query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_LOGS}.http_logs ..."
```
Never hardcode database names in queries.
**In Go (correlator)**, the database is part of the ClickHouse DSN (`clickhouse://user:pass@host:9000/ja4_logs`). The target table is configurable via YAML (`outputs.clickhouse.table`).
**SQL migrations** live in `shared/clickhouse/` (10 ordered files). Deploy with `shared/clickhouse/deploy_schema.sh` which substitutes DB names from env vars.
## Key conventions
### Docker-first builds
Every service has `Dockerfile` (prod), `Dockerfile.dev` or `Dockerfile.tests` (tests), and Go/C services have `Dockerfile.package` (RPM packaging via 3-stage: builder → rpmbuild × 3 distros → alpine output).
### Go config: YAML + env vars
- Sentinel: `config.yml`, env prefix `JA4SENTINEL_`
- Correlator: `config.yml`, env prefix `LOGCORRELATOR_`
- Both support `SIGHUP` for log rotation
### Python config: pydantic-settings
- Dashboard: `backend/config.py``Settings(BaseSettings)` with `.env` file
- ja4_common: `ClickHouseSettings(BaseSettings)` — singleton at `settings`
- bot-detector: exception — uses raw `os.getenv()`, not pydantic-settings
### Dashboard route structure
Every route file follows this pattern:
```python
from fastapi import APIRouter, HTTPException, Query
from ..config import settings
from ..database import db
router = APIRouter()
@router.get("/api/something")
async def get_something():
query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_PROCESSING}.table_name ..."
result = db.query(query)
...
```
### RPM spec files
Located at `services/<name>/packaging/rpm/<name>.spec`. Version injected via `--define "build_version X.Y.Z"` at build time.
### Inter-service communication
Services communicate via **Unix sockets**, not HTTP:
- `sentinel``/var/run/logcorrelator/network.socket``correlator` (source B: TLS/TCP data)
- `mod-reqin-log``/var/run/logcorrelator/http.socket``correlator` (source A: HTTP data)
- `correlator` → ClickHouse (batch inserts into `ja4_logs.http_logs_raw`)
### Sentinel requires elevated privileges
Tests need `--cap-add=NET_RAW --cap-add=NET_ADMIN` for packet capture (pcap).