feat: rewrite ASN classification with PeeringDB + expanded heuristics

Major improvements to generate_asn_data.py:
- Add PeeringDB network data source (34K networks with info_type)
- Add new categories: education, government, enterprise
- Rename 'human' label to 'isp' across all consumers
- Expand keyword heuristics (ISP, datacenter, hosting, CDN, education, gov)
- Add hard-coded lists for education, government, enterprise ASNs
- Support both --output-dir and --output-asn/--output-ipasn CLI interfaces
- Add --no-peeringdb flag for offline use

Results: unknown dropped from 86% to 57%, ISP coverage 21.8K ASNs,
education 3.1K, enterprise 5.7K, government 520.

Updated consumers:
- bot_detector.py: 'human' -> 'isp' for baseline selection
- dashboard api.py: 'human' -> 'isp' in SQL queries
- run-tests.sh: 'human' -> 'isp' in integration test assertions
- update-csv-data.sh: updated label description comment

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
toto
2026-04-08 16:02:07 +02:00
parent 9a48fb9d29
commit 5c5bca71d1
6 changed files with 35197 additions and 30362 deletions

View File

@ -4,7 +4,7 @@
# Outputs:
# data/bot_ip.csv — Known bot/scanner IPs + Tor exit nodes
# data/bot_ja4.csv — Known bot JA4 TLS fingerprints
# data/asn_reputation.csv — ASN→label mapping (human/datacenter/hosting)
# data/asn_reputation.csv — ASN→label mapping (isp/datacenter/hosting/cdn/education/government/enterprise)
# data/iplocate-ip-to-asn.csv — CIDR→ASN for dictionary lookup
#
# Usage: