Files
ja4-platform/.github/copilot-instructions.md
toto d8dbd4e706 docs: add .github/copilot-instructions.md for Copilot context
Covers: build/test/lint commands, architecture overview, ClickHouse
dual-DB pattern, inter-service communication, key conventions for
Go (hexagonal, YAML config), Python (pydantic-settings, FastAPI routes),
C (Apache module), Docker-first builds, and RPM packaging.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 19:21:32 +02:00

139 lines
6.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Copilot Instructions — ja4-platform
## What is this?
A monorepo for a JA4/JA3 TLS fingerprinting security pipeline. Five services capture network traffic, correlate logs, detect bots via ML, and present results in a SOC dashboard. All backed by ClickHouse.
**Data flow:** `mod-reqin-log` (Apache HTTP logs) → unix socket → `correlator` ← unix socket ← `sentinel` (TLS/TCP capture) → ClickHouse → `bot-detector` (ML scoring) → `dashboard` (FastAPI SOC UI)
## Build, test, lint
All builds run in Docker — no native Go/Python/C toolchain required on the host.
```sh
# Full suite
make test-all # run all tests (Docker)
make build-all # build all service images
make rpm-all # build RPMs (sentinel, correlator, mod-reqin-log) for el8/el9/el10
# Per-service tests
make test-sentinel # Go tests (needs --cap-add=NET_RAW inside)
make test-correlator # Go tests with 60% coverage gate
make test-bot-detector # Python pytest
make test-dashboard # Python pytest
make test-ja4common-python # Python pytest (shared lib)
make test-mod-reqin-log # C cmocka tests
# Single Go test (from service dir, or via Docker):
docker run --rm -v $(pwd):/build -w /build/services/correlator golang:1.24 \
go test -v -run TestConfigLoad ./internal/config/
# Single Python test (from repo root):
docker build -f services/dashboard/Dockerfile.tests -t dash-tests .
docker run --rm dash-tests pytest backend/tests/test_metrics.py -v -k test_health
# Linting (Go only — no Python linter configured)
cd services/sentinel && go vet ./... && gofmt -l .
cd services/correlator && go vet ./... && gofmt -l .
```
## Architecture
### Go workspace (`go.work`, Go 1.24.6)
Three modules in the workspace:
- `services/sentinel` — TLS/TCP packet capture daemon (gopacket/pcap, systemd)
- `services/correlator` — log correlation engine, hexagonal architecture
- `shared/go/ja4common` — shared logger, config, shutdown, ipfilter
Both services have a `replace` directive in their `go.mod` pointing to `../../shared/go/ja4common`. The workspace takes precedence for local dev; the `replace` is needed for Docker builds.
### Correlator hexagonal architecture
```
ports/source.go → EventSource, CorrelatedLogSink, CorrelationProcessor interfaces
adapters/inbound/ → unixsocket (reads from sentinel + mod-reqin-log)
adapters/outbound/ → clickhouse, file, stdout, multi (fan-out wrapper)
domain/ → CorrelationService, CorrelatedLog, NormalizedEvent
app/ → Orchestrator (wires everything together)
config/ → YAML config loader
```
### Python services
- `bot-detector` — scikit-learn IsolationForest + DBSCAN. Single monolithic module (`bot_detector.py`). Uses `os.getenv()` directly for config, NOT pydantic-settings.
- `dashboard` — FastAPI + React SPA. 20 route modules in `backend/routes/`. Uses pydantic-settings (`backend/config.py`).
- `shared/python/ja4_common``ClickHouseClient` singleton + `ClickHouseSettings` (pydantic-settings). Installed as a local package in each Python Dockerfile.
### C module
- `mod-reqin-log` — Apache HTTPD module (C11, built with `apxs`). Logs HTTP requests as JSON to a Unix socket. Tests use cmocka.
## ClickHouse dual-database pattern
Two configurable databases (env vars with defaults):
| Env var | Default | Contains |
|---------|---------|----------|
| `CLICKHOUSE_DB_LOGS` | `ja4_logs` | `http_logs_raw`, `http_logs`, `mv_http_logs` |
| `CLICKHOUSE_DB_PROCESSING` | `ja4_processing` | Aggregations, ML tables, views, dicts, audit |
**Cross-database references exist** — materialized views in one DB read from the other:
- `ja4_logs.mv_http_logs` references `ja4_processing.dict_anubis_*` and `ja4_processing.dict_iplocate_asn`
- `ja4_processing.mv_agg_*` reads `FROM ja4_logs.http_logs`
**In Python code**, always use fully qualified table names:
```python
from ..config import settings
query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_PROCESSING}.ml_detected_anomalies ..."
query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_LOGS}.http_logs ..."
```
Never hardcode database names in queries.
**In Go (correlator)**, the database is part of the ClickHouse DSN (`clickhouse://user:pass@host:9000/ja4_logs`). The target table is configurable via YAML (`outputs.clickhouse.table`).
**SQL migrations** live in `shared/clickhouse/` (10 ordered files). Deploy with `shared/clickhouse/deploy_schema.sh` which substitutes DB names from env vars.
## Key conventions
### Docker-first builds
Every service has `Dockerfile` (prod), `Dockerfile.dev` or `Dockerfile.tests` (tests), and Go/C services have `Dockerfile.package` (RPM packaging via 3-stage: builder → rpmbuild × 3 distros → alpine output).
### Go config: YAML + env vars
- Sentinel: `config.yml`, env prefix `JA4SENTINEL_`
- Correlator: `config.yml`, env prefix `LOGCORRELATOR_`
- Both support `SIGHUP` for log rotation
### Python config: pydantic-settings
- Dashboard: `backend/config.py``Settings(BaseSettings)` with `.env` file
- ja4_common: `ClickHouseSettings(BaseSettings)` — singleton at `settings`
- bot-detector: exception — uses raw `os.getenv()`, not pydantic-settings
### Dashboard route structure
Every route file follows this pattern:
```python
from fastapi import APIRouter, HTTPException, Query
from ..config import settings
from ..database import db
router = APIRouter()
@router.get("/api/something")
async def get_something():
query = f"SELECT ... FROM {settings.CLICKHOUSE_DB_PROCESSING}.table_name ..."
result = db.query(query)
...
```
### RPM spec files
Located at `services/<name>/packaging/rpm/<name>.spec`. Version injected via `--define "build_version X.Y.Z"` at build time.
### Inter-service communication
Services communicate via **Unix sockets**, not HTTP:
- `sentinel``/var/run/logcorrelator/network.socket``correlator` (source B: TLS/TCP data)
- `mod-reqin-log``/var/run/logcorrelator/http.socket``correlator` (source A: HTTP data)
- `correlator` → ClickHouse (batch inserts into `ja4_logs.http_logs_raw`)
### Sentinel requires elevated privileges
Tests need `--cap-add=NET_RAW --cap-add=NET_ADMIN` for packet capture (pcap).