Rewrite seed_clickhouse.py: 500K rows from 20K IPs with realistic traffic
- 350K browser rows (14K IPs) using real JA4s from browser_ja4.csv - 100K scanner rows (3K IPs) with vuln/cred/scraper/DDoS sub-categories - 30K legit bot rows (2K IPs) from real bot_ip.csv CIDRs - 20K AI bot rows (1K IPs) for GPTBot, ClaudeBot, etc. Key improvements: - Load browser_ja4.csv at startup, match JA4 to browser family - Load bot_ip.csv to generate IPs from real Googlebot/Bingbot CIDRs - Hard-coded ISP /24 prefixes from real ASNs (Comcast, Orange, DT, etc.) - Realistic navigation patterns with Referer chains and cookies - Sec-CH-UA headers for Chromium browsers (modern_browser_score >= 50) - Batch size increased to 2000, progress reporting every 10K rows - New CLI args: --rows, --ips, --seed, --data-dir - Bot JA4s are synthetic hashes guaranteed NOT in browser_ja4.csv Also updated: - Dockerfile: COPY *.py (was missing seed_clickhouse.py) - docker-compose.yml: mount scripts/data as /app/data for CSV access - run-tests.sh: updated seeder description comments Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
@ -2,7 +2,7 @@ FROM python:3.12-alpine
|
||||
|
||||
# No extra deps needed — stdlib only (urllib, ssl, concurrent.futures)
|
||||
WORKDIR /app
|
||||
COPY generate_traffic.py .
|
||||
COPY *.py .
|
||||
|
||||
# Keep container alive; traffic is triggered via docker compose exec
|
||||
ENTRYPOINT ["sleep", "infinity"]
|
||||
|
||||
Reference in New Issue
Block a user