Rewrite seed_clickhouse.py: 500K rows from 20K IPs with realistic traffic
- 350K browser rows (14K IPs) using real JA4s from browser_ja4.csv - 100K scanner rows (3K IPs) with vuln/cred/scraper/DDoS sub-categories - 30K legit bot rows (2K IPs) from real bot_ip.csv CIDRs - 20K AI bot rows (1K IPs) for GPTBot, ClaudeBot, etc. Key improvements: - Load browser_ja4.csv at startup, match JA4 to browser family - Load bot_ip.csv to generate IPs from real Googlebot/Bingbot CIDRs - Hard-coded ISP /24 prefixes from real ASNs (Comcast, Orange, DT, etc.) - Realistic navigation patterns with Referer chains and cookies - Sec-CH-UA headers for Chromium browsers (modern_browser_score >= 50) - Batch size increased to 2000, progress reporting every 10K rows - New CLI args: --rows, --ips, --seed, --data-dir - Bot JA4s are synthetic hashes guaranteed NOT in browser_ja4.csv Also updated: - Dockerfile: COPY *.py (was missing seed_clickhouse.py) - docker-compose.yml: mount scripts/data as /app/data for CSV access - run-tests.sh: updated seeder description comments Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
@ -157,6 +157,8 @@ services:
|
|||||||
depends_on:
|
depends_on:
|
||||||
platform:
|
platform:
|
||||||
condition: service_healthy
|
condition: service_healthy
|
||||||
|
volumes:
|
||||||
|
- ../../scripts/data:/app/data:ro
|
||||||
networks:
|
networks:
|
||||||
- ja4net
|
- ja4net
|
||||||
|
|
||||||
|
|||||||
@ -162,15 +162,15 @@ log "============================================"
|
|||||||
# The seeder inserts directly into http_logs_raw, triggering all MVs:
|
# The seeder inserts directly into http_logs_raw, triggering all MVs:
|
||||||
# http_logs_raw → mv_http_logs → http_logs → mv_agg_host_ip_ja4_1h → agg_host_ip_ja4_1h
|
# http_logs_raw → mv_http_logs → http_logs → mv_agg_host_ip_ja4_1h → agg_host_ip_ja4_1h
|
||||||
# This pre-populates:
|
# This pre-populates:
|
||||||
# - 720 human sessions (IPs in residential ASN ranges → asn_label='human')
|
# - ~350,000 rows from 14,000 browser IPs (ISP ASN ranges → asn_label='isp')
|
||||||
# - 150 scanner/anomaly sessions (IPs in datacenter ASN → ML anomaly candidates)
|
# - ~100,000 rows from 3,000 scanner IPs (datacenter ASN → ML anomaly candidates)
|
||||||
# - 100 known-bot sessions (IPs/JA4 matching bot_ip.csv / bot_ja4.csv)
|
# - ~30,000 rows from 2,000 legit bot IPs (from bot_ip.csv CIDRs)
|
||||||
# - 20 brute-force clusters (many POST /login per IP)
|
# - ~20,000 rows from 1,000 AI bot IPs (datacenter ranges)
|
||||||
# After seeding, bot_detector has ≥500 human rows → can train and run.
|
# After seeding, bot_detector has ≥500 human rows → can train and run.
|
||||||
log "Running seed_clickhouse.py..."
|
log "Running seed_clickhouse.py..."
|
||||||
if docker compose exec -T traffic-gen python /app/seed_clickhouse.py \
|
if docker compose exec -T traffic-gen python /app/seed_clickhouse.py \
|
||||||
--host clickhouse --port 8123 --user default --password ""; then
|
--host clickhouse --port 8123 --user default --password ""; then
|
||||||
pass "ClickHouse seeded (700+ human + 150 scanner + 100 known-bot rows)"
|
pass "ClickHouse seeded (500K rows: 350K browser + 100K scanner + 30K legit-bot + 20K AI-bot)"
|
||||||
else
|
else
|
||||||
warn "Seeder reported errors (pipeline verification will show impact)"
|
warn "Seeder reported errors (pipeline verification will show impact)"
|
||||||
fi
|
fi
|
||||||
|
|||||||
@ -2,7 +2,7 @@ FROM python:3.12-alpine
|
|||||||
|
|
||||||
# No extra deps needed — stdlib only (urllib, ssl, concurrent.futures)
|
# No extra deps needed — stdlib only (urllib, ssl, concurrent.futures)
|
||||||
WORKDIR /app
|
WORKDIR /app
|
||||||
COPY generate_traffic.py .
|
COPY *.py .
|
||||||
|
|
||||||
# Keep container alive; traffic is triggered via docker compose exec
|
# Keep container alive; traffic is triggered via docker compose exec
|
||||||
ENTRYPOINT ["sleep", "infinity"]
|
ENTRYPOINT ["sleep", "infinity"]
|
||||||
|
|||||||
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user