Pipeline complet Radiacode 103 - identification automatique d'isotopes

- VegaModel CNN-FCNN 34.5M params, 82 isotopes, val acc 99.89% - Generation 50k spectres synthetiques 1D (12-24h durees) - Entrainement 100 epochs sur RTX 5060 Ti (CUDA 12.8, Blackwell) - Detection continue avec soustraction du background - Capture background 24h avec gestion deconnexion - Docker Compose : conteneur train (GPU) + detect (CPU/USB) - Modele entraite inclus (vega_best.pt, 395 Mo) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-19 12:29:56 +02:00
commit 745a64b342
52 changed files with 17558 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,40 @@
 # Modeles entraines
 models/vega_epoch_*.pt
 models/vega_final.pt
 # Donnees synthetiques (4+ Go)
 data/synthetic/
 # Background (genere en cours de capture)
 data/background_24h.npy
 data/background_snapshot.json
 # Logs
 logs/*.log
 logs/*.json
 # Python
 __pycache__/
 *.pyc
 *.pyo
 *.egg-info/
 .eggs/
 dist/
 build/
 *.egg
 # Docker
 *.tar
 # IDE
 .vscode/
 .idea/
 *.swp
 *.swo
 # OS
 .DS_Store
 Thumbs.db
 # Environnement
 .env
--- a/README.md
+++ b/README.md
@ -0,0 +1,203 @@
 # Radiacode 103 — Identification automatique d'isotopes
 Pipeline Docker complet pour la capture, l'analyse et l'identification automatique d'isotopes radioactifs avec un spectrometre gamma Radiacode 103.
 ## Architecture
 ```
 Radiacode 103 (USB)
        │
        ▼
 ┌─────────────────────────────────────────────────────────┐
 │  Conteneur detect (Python 3.11 + PyTorch CPU)          │
 │                                                         │
 │  capture_background.py  ──>  background_24h.npy (24h)   │
 │  radiacode_monitor.py   ──>  rapport JSON quotidien      │
 │         │                                               │
 │         ├── Echantillonnage chaque 60s                  │
 │         ├── Soustraction du background                  │
 │         ├── Inference VegaModel (34.5M params, 82 iso)  │
 │         └── Rapport quotidien a 00h00                   │
 │                                                         │
 │  Modele: vega_best.pt (entraite sur RTX 5060 Ti)       │
 └─────────────────────────────────────────────────────────┘
 ```
 ## Demarrage rapide
 ```bash
 # 1. Builder les images
 docker compose build
 # 2. Lancer l'entrainement (GPU, ~45 min)
 docker compose run --rm train
 # 3. Capturer le bruit de fond (24h, sans source radioactive)
 docker compose run --rm -d --name radiacode-bg detect python capture_background.py
 # 4. Lancer la detection continue
 docker compose up detect
 ```
 ## Configuration
 Variables d'environnement (dans `docker-compose.yml`) :
 | Variable | Defaut | Description |
 |----------|--------|-------------|
 | `MODEL_PATH` | `/models/vega_best.pt` | Chemin du modele PyTorch |
 | `ISOTOPE_INDEX_PATH` | `/models/vega_isotope_index.txt` | Index des 82 isotopes |
 | `BACKGROUND_PATH` | `/data/background_24h.npy` | Fichier background de reference |
 | `THRESHOLD` | `0.5` | Seuil de probabilite pour la detection |
 | `SAMPLE_INTERVAL` | `60` | Intervalle d'echantillonnage (secondes) |
 | `REPORT_HOUR` | `0` | Heure du rapport quotidien |
 | `MIN_LIVE_TIME` | `3600` | Live time minimum pour un rapport (secondes) |
 | `VEGA_DEVICE` | `cpu` | Device PyTorch (`cpu` ou `cuda`) |
 ## Bruit Poissonnien et modele
 ### Physique du bruit
 La detection gamma est un processus de comptage intrinsequement stochastique. Chaque canal du spectre suit une loi de Poisson : si un canal accumule en moyenne N comptages, l'ecart-type est sqrt(N). Le rapport signal/bruit s'ameliore donc en sqrt(N) :
 - **6.5 CPS de background** sur 1024 canaux -> ~0.006 CPS/canal
 - Apres 1h : ~22 comptages/canal, ecart-type ~4.7, SNR ~5
 - Apres 24h : ~560 comptages/canal, ecart-type ~24, SNR ~23
 - Apres 1 semaine : ~3900 comptages/canal, ecart-type ~62, SNR ~63
 C'est pourquoi la capture de background dure **24h** : en dessous, les pics du background (K-40 a 1460 keV, Bi-214 a 609 keV, Pb-214 a 352 keV) sont noyes dans le bruit Poissonnien.
 ### Modele VegaModel
 Le VegaModel est un CNN-FCNN multi-tache inspire de l'architecture Vega d'Open-RadiaCode-Android :
 - **Entree** : spectre 1D de 1023 canaux (20-3000 keV), normalise par max
 - **Sortie** : deux tetes
  - **Classification** : 82 neurones avec sigmoid (presence/absence de chaque isotope)
  - **Regression** : 82 neurones (activite estimee en Bq, normalisee a max_activity_bq=1000)
 - **Architecture** :
  - 3 blocs CNN (64, 128, 256 canaux) avec BatchNorm + ReLU + MaxPool
  - 2 couches FC (512, 256) avec Dropout(0.3)
  - **34 493 156 parametres** au total
 - **Fonction de perte** : VegaLoss = classification_weight * BCE + regression_weight * MSE (ponderee pour ne penaliser l'activite que sur les isotopes presents)
 - **Entrainement** : 50 000 spectres synthetiques, 100 epochs, AMP (mixed precision), early stopping (patience=10)
 - **Background dans les donnees synthetiques** : K-40, radon (Pb-214, Bi-214), thorium (Ac-228, Pb-212, Tl-208) simules avec des activites aleatoires realistes
 ### Spectres synthetiques
 Les donnees d'entrainement simulent la physique complete :
 1. **Pics photoelectriques** : Gaussiennes avec FWHM dependant de l'energie (8.4% a 662 keV pour le CsI(Tl))
 2. **Continuum Compton** : distribution de Klein-Nishina simplifiee sous chaque pic
 3. **Bruit Poissonnien** : echantillonnage Poisson(N) pour chaque canal, simulant les fluctuations de comptage reelles
 4. **Background environnemental** : continuum exponentiel + pics de K-40, radon, thorium avec activites aleatoires
 5. **Efficacite du detecteur** : modele phenomenologique qui decroit avec l'energie (absorption basse energie + penetration haute energie)
 6. **Durees de 12-24h** : suffisamment longues pour que le rapport signal/bruit soit comparable aux mesures reelles
 ### Soustraction du background a l'inference
 Le modele est entraine avec du background synthetique inclus, mais a l'inference on soustrait le vrai background mesure :
 ```python
 rate = cumulated_counts / cumulated_live_time       # spectre brut en CPS
 bg_rate = bg_counts / bg_live_time                  # background de reference en CPS
 net_rate = np.clip(rate - bg_rate, 0, None)         # spectre net (background soustrait)
 normalized = net_rate / net_rate.max()               # normalisation pour le modele
 ```
 Cette approche hybride est optimale :
 - Le modele apprend a ignorer les pics du background (K-40, radon, thorium) pendant l'entrainement
 - La soustraction reelle elimine les variations locales du background (emplacement, altitude, materiaux)
 - Resultat : meilleure sensibilite et moins de faux positifs
 Le conteneur `train` execute deux phases :
 1. **Generation des spectres synthetiques** : 50 000 spectres 1D (20k mono-isotope, 15k bi-isotope, 10k multi-isotope, 5k background seul) avec des durees de 12-24h
 2. **Entrainement VegaModel** : CNN-FCNN multi-tache, 82 isotopes, 34.5M parametres
 Resultats de l'entrainement sur RTX 5060 Ti :
 - **Val accuracy** : 99.89%
 - **AUC macro** : 0.995
 - **Best val loss** : 0.0051
 - **Duree** : ~45 min (100 epochs, 25s/epoch)
 ## Detection
 Le moniteur `radiacode_monitor.py` :
 - Echantillonne le spectre toutes les 60 secondes
 - Cumule les comptages et le live time
 - Soustrait le background de reference (si disponible)
 - Execute l'inference VegaModel sur le spectre net
 - Genere un rapport JSON quotidien a `REPORT_HOUR`
 Le rapport contient :
 - Live time et comptages totaux
 - CPS moyen
 - Isotopes detectes avec probabilite et activite estimee (Bq)
 ## Capture du bruit de fond
 Avant la detection, capturer le background pendant 24h sans source radioactive a proximite :
 ```bash
 docker compose run --rm -d --name radiacode-bg \
  -e TARGET_DURATION=86400 -e SAMPLE_INTERVAL=60 \
  detect python capture_background.py
 ```
 Suivre la progression :
 ```bash
 docker logs -f radiacode-bg
 cat data/background_snapshot.json
 ```
 ## Structure du projet
 ```
 radiacode_103/
 ├── docker-compose.yml          # Orchestration des conteneurs
 ├── TOTO.md                     # Suivi des taches
 ├── README.md
 ├── train/
 │   ├── Dockerfile              # PyTorch 2.7.0 + CUDA 12.8 (Blackwell)
 │   ├── requirements.txt
 │   ├── entrypoint.sh           # Generation + entrainement
 │   └── vega_ml/                # Code VegaModel (copie d'Open-RadiaCode-Android)
 │       ├── synthetic_spectra/  # Generateur de spectres synthetiques
 │       ├── training/vega/      # Modele, dataset, trainer
 │       └── inference/
 ├── detect/
 │   ├── Dockerfile              # Python 3.11-slim + radiacode + torch CPU
 │   ├── requirements.txt
 │   ├── radiacode_monitor.py   # Moniteur principal
 │   └── capture_background.py  # Capture du bruit de fond 24h
 ├── data/
 │   ├── synthetic/spectra/     # 50 000 spectres synthetiques (~4.2 Go)
 │   └── background_snapshot.json
 ├── models/
 │   ├── vega_best.pt           # Meilleur modele (395 Mo)
 │   ├── vega_final.pt          # Modele final
 │   ├── vega_history.json      # Metriques d'entrainement
 │   └── vega_isotope_index.txt # 82 isotopes
 └── logs/                       # Rapports quotidiens JSON
 ```
 ## Materiel
 | Composant | Modele |
 |-----------|--------|
 | Spectrometre | Radiacode 103 (CsI(Tl), 1024 canaux, FWHM 8.4% @662 keV) |
 | GPU entrainement | NVIDIA RTX 5060 Ti 16 Go (Blackwell, sm_120) |
 | GPU secondaire | NVIDIA RTX 4060 Ti 16 Go (Ada Lovelace) |
 | Production (futur) | Raspberry Pi 4 8 Go |
 ## 82 isotopes detectables
 Ac-227, Ac-228, Ag-110m, Am-241, Au-198, Ba-133, Ba-137m, Be-7, Bi-210, Bi-211, Bi-212, Bi-214, C-14, Cd-109, Ce-139, Ce-144, Co-57, Co-58, Co-60, Cr-51, Cs-134, Cs-137, Eu-152, Eu-154, F-18, Fe-59, Ga-67, Hf-181, Hg-203, I-123, I-129, I-131, In-111, Ir-192, K-40, Lu-177, Mn-54, Na-22, Pa-231, Pa-234m, Pb-210, Pb-211, Pb-212, Pb-214, Po-210, Po-212, Po-214, Po-216, Po-218, Pr-144, Ra-223, Ra-224, Ra-226, Ra-228, Rh-106, Rn-220, Rn-222, Ru-106, Sb-125, Sc-46, Se-75, Sm-153, Sn-113, Sr-85, Sr-90, Ta-182, Tc-99m, Th-228, Th-230, Th-231, Th-232, Th-234, Tl-201, Tl-207, Tl-208, U-234, U-235, U-238, Xe-133, Y-88, Y-90, Zn-65
 ## Passage en production (Raspberry Pi 4)
 1. Copier `models/vega_best.pt` et `models/vega_isotope_index.txt` sur le Pi 4
 2. Capturer le background sur le Pi 4 (emplacement final)
 3. Le meme `detect/Dockerfile` fonctionne sur ARM64
 4. Adapter `docker-compose.yml` : retirer `deploy.resources.reservations.devices`
 5. `VEGA_DEVICE=cpu` (inference CPU sur Pi 4, ~1s par spectre)
--- a/TOTO.md
+++ b/TOTO.md
@ -0,0 +1,30 @@
 # Radiacode 103 — Pipeline d'identification automatique d'isotopes
 ## Etat d'avancement
 | Etape | Statut | Detail |
 |-------|--------|--------|
 | Build Docker | Fait | train + detect |
 | Generation spectres synthetiques | Fait | 50 000 echantillons (1D, 4.2 Go) |
 | Entrainement VegaModel | Fait | 100 epochs, val loss 0.0051, val acc 99.89% |
 | Modele sauvegarde | Fait | `models/vega_best.pt` (395 Mo), 82 isotopes |
 | Capture background 24h | En cours | 0.2h/24h, 6.5 CPS |
 | Detection continue | Pas encore | Apres background 24h |
 | Test avec source | Pas encore | Apres detection continue |
 ## Prochaines etapes
 - [ ] Attendre fin de la capture background 24h (conteneur `radiacode-bg` en cours)
 - [ ] Lancer le moniteur : `docker compose up detect`
 - [ ] Tester avec une source radioactive connue (Cs-137)
 - [] Nettoyer les checkpoints d'epochs dans `models/` (garder seulement `vega_best.pt`, `vega_final.pt`, `vega_history.json`, `vega_isotope_index.txt`)
 - [ ] Transfer vers Pi 4 pour la production
 ## Bugs corriges
 - `spectrum.duration` retourne un `timedelta`, pas un `float` -> utilise `.total_seconds()`
 - Generation de spectres 2D (time x channels) causait un OOM a ~210 echantillons -> generation 1D cumulative
 - PyTorch 2.4 ne supporte pas sm_120 (Blackwell/RTX 5060 Ti) -> PyTorch 2.7.0 + CUDA 12.8
 - DataParallel incompatible entre GPU d'architectures differentes (4060 Ti Ada + 5060 Ti Blackwell) -> mono-GPU
 - `radiacode` depend de `bluepy` (BLE) qui ne compile pas dans `python:3.11-slim` -> ajoute `build-essential libglib2.0-dev`
 - Volume `./data` monte en read-only dans detect -> passe en read-write pour le snapshot JSON
--- a/detect/Dockerfile
+++ b/detect/Dockerfile
@ -0,0 +1,22 @@
 FROM python:3.11-slim
 ENV DEBIAN_FRONTEND=noninteractive
 ENV PYTHONUNBUFFERED=1
 ENV PYTHONDONTWRITEBYTECODE=1
 RUN apt-get update && apt-get install -y --no-install-recommends \
    libusb-1.0-0 \
    usbutils \
    build-essential \
    libglib2.0-dev \
    && rm -rf /var/lib/apt/lists/*
 WORKDIR /app
 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
 COPY radiacode_monitor.py .
 COPY capture_background.py .
 CMD ["python", "radiacode_monitor.py"]
--- a/detect/capture_background.py
+++ b/detect/capture_background.py
@ -0,0 +1,88 @@
 #!/usr/bin/env python3
 """
 Capture le bruit de fond du détecteur sur 24h (sans source).
 Gère le débranchement/rebranchement du détecteur.
 À lancer séparément avant le moniteur :
    docker-compose run --rm detect python capture_background.py
 """
 import numpy as np
 import time
 import json
 import os
 SAMPLE_INTERVAL = int(os.environ.get("SAMPLE_INTERVAL", "60"))
 TARGET_DURATION = int(os.environ.get("TARGET_DURATION", str(86400)))  # 24h
 OUTPUT_PATH = os.environ.get("BACKGROUND_PATH", "/data/background_24h.npy")
 SNAPSHOT_PATH = os.environ.get("SNAPSHOT_PATH", "/data/background_snapshot.json")
 BG_COUNTS = np.zeros(1024, dtype=np.float64)
 BG_LIVE_TIME = 0.0
 device = None
 def save_snapshot():
    """Save a human-readable snapshot of current background."""
    cps = BG_COUNTS.sum() / BG_LIVE_TIME if BG_LIVE_TIME > 0 else 0
    # Approximate energy calibration for RC-103: E ≈ 0.33 + 2.97*ch
    peaks = []
    max_c = BG_COUNTS.max()
    if max_c > 0:
        for i, c in enumerate(BG_COUNTS):
            if c > max_c * 0.03:
                energy = 0.33 + 2.97 * i
                peaks.append({"channel": i, "energy_kev": round(energy, 1), "counts": round(float(c), 1)})
    snapshot = {
        "elapsed_hours": round((time.time() - start) / 3600, 2),
        "live_time_s": round(BG_LIVE_TIME, 1),
        "total_counts": round(float(BG_COUNTS.sum()), 0),
        "cps": round(cps, 2),
        "top_peaks": sorted(peaks, key=lambda x: -x["counts"])[:15],
        "spectrum": [round(float(c), 1) for c in BG_COUNTS],
    }
    with open(SNAPSHOT_PATH, "w") as f:
        json.dump(snapshot, f, indent=2)
 print(f"Capture du bruit de fond pendant {TARGET_DURATION/3600:.0f}h...")
 print("Assurez-vous qu'aucune source radioactive n'est a proximite du detecteur.")
 print()
 start = time.time()
 while (time.time() - start) < TARGET_DURATION:
    time.sleep(SAMPLE_INTERVAL)
    try:
        if device is None:
            from radiacode import RadiaCode
            device = RadiaCode()
            device.spectrum_reset()
            print("Radiacode connecte.")
        spectrum = device.spectrum()
        BG_COUNTS += np.array(spectrum.counts, dtype=np.float64)
        BG_LIVE_TIME += spectrum.duration.total_seconds()
        device.spectrum_reset()
        elapsed = time.time() - start
        cps = BG_COUNTS.sum() / BG_LIVE_TIME if BG_LIVE_TIME > 0 else 0
        print(
            f"Background : {elapsed/3600:.1f}h / {TARGET_DURATION/3600:.1f}h "
            f"({BG_LIVE_TIME:.0f}s live, {BG_COUNTS.sum():.0f} coups, {cps:.1f} CPS)",
            flush=True,
        )
        save_snapshot()
    except Exception as e:
        print(f"\nErreur : {e}, reconnexion...")
        device = None
 os.makedirs(os.path.dirname(OUTPUT_PATH) if os.path.dirname(OUTPUT_PATH) else ".", exist_ok=True)
 np.save(
    OUTPUT_PATH,
    {
        "counts": BG_COUNTS,
        "duration": BG_LIVE_TIME,
        "timestamp": time.time(),
    },
 )
 print(f"\n\nBackground sauvegarde : {OUTPUT_PATH}")
 print(f"  Duree live : {BG_LIVE_TIME/3600:.1f}h")
 print(f"  Total coups : {BG_COUNTS.sum():.0f}")
 print(f"  CPS moyen : {BG_COUNTS.sum()/BG_LIVE_TIME:.1f}")
--- a/detect/radiacode_monitor.py
+++ b/detect/radiacode_monitor.py
@ -0,0 +1,248 @@
 #!/usr/bin/env python3
 """
 Radiacode 103 — Identification automatique d'isotopes
 Cycle de 24h avec détection branché/débranché
 Fonctionne en Docker sur machine GPU (dev) ou RPi 4 (production)
 """
 import numpy as np
 import torch
 import time
 import json
 import logging
 import os
 import sys
 from datetime import datetime
 from pathlib import Path
 # Configuration via variables d'environnement
 MODEL_PATH = os.environ.get("MODEL_PATH", "/models/vega_best.pt")
 ISOTOPE_INDEX_PATH = os.environ.get("ISOTOPE_INDEX_PATH", "/models/vega_isotope_index.txt")
 BACKGROUND_PATH = os.environ.get("BACKGROUND_PATH", "/data/background_24h.npy")
 LOG_DIR = Path(os.environ.get("LOG_DIR", "/logs"))
 LOG_DIR.mkdir(parents=True, exist_ok=True)
 THRESHOLD = float(os.environ.get("THRESHOLD", "0.5"))
 SAMPLE_INTERVAL = int(os.environ.get("SAMPLE_INTERVAL", "60"))
 REPORT_HOUR = int(os.environ.get("REPORT_HOUR", "0"))
 MIN_LIVE_TIME = int(os.environ.get("MIN_LIVE_TIME", "3600"))
 # Logging
 logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[
        logging.StreamHandler(),
        logging.FileHandler(LOG_DIR / "radiacode.log"),
    ],
 )
 log = logging.getLogger(__name__)
 class RadiacodeMonitor:
    def __init__(self):
        # Charger le modèle PyTorch
        device_str = os.environ.get("VEGA_DEVICE", "cpu")
        self.device = torch.device(device_str)
        log.info(f"Chargement du modèle depuis {MODEL_PATH} sur {self.device}...")
        checkpoint = torch.load(MODEL_PATH, map_location=self.device, weights_only=False)
        # Importer VegaModel (depuis le volume monté)
        vega_ml_path = os.environ.get("VEGA_ML_PATH", "/models/vega_ml")
        if vega_ml_path not in sys.path:
            sys.path.insert(0, vega_ml_path)
        from training.vega.model import VegaModel, VegaConfig
        from training.vega.isotope_index import IsotopeIndex
        self.model_config = VegaConfig(**checkpoint["model_config"])
        self.model = VegaModel(self.model_config)
        self.model.load_state_dict(checkpoint["model_state_dict"])
        self.model.eval()
        log.info(
            f"Modèle chargé : {self.model_config.num_isotopes} isotopes, "
            f"{self.model.count_parameters():,} paramètres"
        )
        # Charger l'index des isotopes
        self.isotope_index = IsotopeIndex.load(Path(ISOTOPE_INDEX_PATH))
        # Charger le bruit de fond de référence
        self.bg_counts = None
        self.bg_live_time = None
        bg_path = Path(BACKGROUND_PATH)
        if bg_path.exists():
            bg_data = np.load(str(bg_path), allow_pickle=True).item()
            self.bg_counts = bg_data["counts"].astype(np.float64)
            self.bg_live_time = float(bg_data["duration"])
            log.info(
                f"Background chargé : {self.bg_live_time/3600:.1f}h, "
                f"{self.bg_counts.sum():.0f} coups"
            )
        else:
            log.warning(f"Pas de fichier background : {BACKGROUND_PATH}")
        # Compteurs cumulés
        self.cumulated_counts = np.zeros(1024, dtype=np.float64)
        self.cumulated_live_time = 0.0
        self.last_report_date = None
    def try_connect(self):
        """Tente de se connecter au Radiacode. Retourne le device ou None."""
        try:
            from radiacode import RadiaCode
            device = RadiaCode()
            log.info("Radiacode connecté")
            return device
        except Exception as e:
            log.debug(f"Détecteur non disponible : {e}")
            return None
    def sample_once(self):
        """Échantillonne une fois. Retourne True si succès."""
        device = None
        try:
            device = self.try_connect()
            if device is None:
                return False
            spectrum = device.spectrum()
            counts = np.array(spectrum.counts, dtype=np.float64)
            live_time = spectrum.duration.total_seconds()
            if live_time > 0 and counts.sum() > 0:
                self.cumulated_counts += counts
                self.cumulated_live_time += live_time
                device.spectrum_reset()
                log.info(
                    f"Échantillon : {counts.sum():.0f} coups en {live_time:.1f}s "
                    f"(cumul : {self.cumulated_live_time/3600:.1f}h)"
                )
                return True
            return False
        except Exception as e:
            log.warning(f"Erreur échantillonnage : {e}")
            return False
        finally:
            if device:
                try:
                    del device
                except Exception:
                    pass
    def run_inference(self, spectrum_rate):
        """Exécute l'inférence PyTorch sur le spectre cumulé."""
        if spectrum_rate.max() > 0:
            normalized = spectrum_rate / spectrum_rate.max()
        else:
            return []
        tensor = torch.tensor(normalized, dtype=torch.float32).unsqueeze(0).to(self.device)
        with torch.no_grad():
            logits, activities = self.model(tensor)
        probs = torch.sigmoid(logits).cpu().numpy()[0]
        activities = activities.cpu().numpy()[0] * self.model_config.max_activity_bq
        results = []
        for i in range(len(probs)):
            if probs[i] >= THRESHOLD:
                results.append(
                    {
                        "isotope": self.isotope_index.index_to_name(i),
                        "probability": float(probs[i]),
                        "activity_bq": float(activities[i]),
                    }
                )
        return sorted(results, key=lambda x: -x["probability"])
    def generate_report(self):
        """Génère le rapport quotidien."""
        if self.cumulated_live_time < MIN_LIVE_TIME:
            log.warning(
                f"Pas assez de données ({self.cumulated_live_time/3600:.1f}h < "
                f"{MIN_LIVE_TIME/3600:.1f}h minimum). Pas de rapport."
            )
            return
        rate = self.cumulated_counts / self.cumulated_live_time
        if self.bg_counts is not None and self.bg_live_time is not None:
            bg_rate = self.bg_counts / self.bg_live_time
            net_rate = np.clip(rate - bg_rate, 0, None)
        else:
            net_rate = rate
        results = self.run_inference(net_rate)
        now = datetime.now()
        report = {
            "date": now.isoformat(),
            "live_time_hours": self.cumulated_live_time / 3600,
            "total_counts": int(self.cumulated_counts.sum()),
            "cps_mean": float(self.cumulated_counts.sum() / self.cumulated_live_time),
            "background_subtracted": self.bg_counts is not None,
            "isotopes_detected": results,
        }
        report_path = LOG_DIR / f"report_{now.strftime('%Y-%m-%d')}.json"
        with open(report_path, "w") as f:
            json.dump(report, f, indent=2, ensure_ascii=False)
        # Affichage
        print(f"\n{'='*50}")
        print(f"  RAPPORT — {now.strftime('%d/%m/%Y')}")
        print(f"{'='*50}")
        print(f"  Live time : {self.cumulated_live_time/3600:.1f}h")
        print(f"  Comptages : {self.cumulated_counts.sum():.0f}")
        print(f"  CPS moyen : {self.cumulated_counts.sum()/self.cumulated_live_time:.1f}")
        print(
            f"  Background : {'soustrait' if self.bg_counts is not None else 'non soustrait'}"
        )
        print()
        if results:
            for r in results:
                print(
                    f"  {r['isotope']:>10s} : {r['probability']*100:5.1f}% — {r['activity_bq']:.1f} Bq"
                )
        else:
            print("  (background uniquement)")
        print(f"{'='*50}\n")
        log.info(f"Rapport sauvegardé : {report_path}")
        # Reset pour le cycle suivant
        self.cumulated_counts = np.zeros(1024, dtype=np.float64)
        self.cumulated_live_time = 0.0
    def run(self):
        """Boucle principale."""
        log.info("=" * 50)
        log.info("Radiacode 103 — Moniteur d'isotopes")
        log.info("=" * 50)
        log.info(f"Modèle : {MODEL_PATH}")
        log.info(f"Device : {self.device}")
        log.info(f"Isotopes : {self.isotope_index.num_isotopes}")
        log.info(
            f"Background : {'chargé' if self.bg_counts is not None else 'non disponible'}"
        )
        log.info(f"Seuil : {THRESHOLD}")
        log.info(f"Intervalle : {SAMPLE_INTERVAL}s")
        while True:
            now = datetime.now()
            if self.last_report_date != now.date() and now.hour == REPORT_HOUR:
                self.generate_report()
                self.last_report_date = now.date()
            self.sample_once()
            time.sleep(SAMPLE_INTERVAL)
 if __name__ == "__main__":
    monitor = RadiacodeMonitor()
    monitor.run()
--- a/detect/requirements.txt
+++ b/detect/requirements.txt
@ -0,0 +1,3 @@
 radiacode>=0.3.5
 numpy>=1.24.0
 torch>=2.0.0
--- a/docker-compose.yml
+++ b/docker-compose.yml
@ -0,0 +1,55 @@
 services:
  train:
    build:
      context: ./train
      dockerfile: Dockerfile
    volumes:
      - ./data:/data
      - ./models:/models
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - NVIDIA_VISIBLE_DEVICES=1
      - CUDA_VISIBLE_DEVICES=1
      - DATA_DIR=/data/synthetic
      - MODEL_DIR=/models
      - NUM_SAMPLES=50000
      - EPOCHS=100
      - BATCH_SIZE=64
      - LEARNING_RATE=0.001
      - DETECTOR=radiacode_103
      - MIN_DURATION=43200
      - MAX_DURATION=86400
      - SEED=42
  detect:
    build:
      context: ./detect
      dockerfile: Dockerfile
    volumes:
      - ./models:/models:ro
      - ./logs:/logs
      - ./data:/data
    devices:
      - /dev/bus/usb:/dev/bus/usb
    privileged: true
    environment:
      - MODEL_PATH=/models/vega_best.pt
      - ISOTOPE_INDEX_PATH=/models/vega_isotope_index.txt
      - BACKGROUND_PATH=/data/background_24h.npy
      - VEGA_ML_PATH=/models/vega_ml
      - VEGA_DEVICE=cpu
      - LOG_DIR=/logs
      - SAMPLE_INTERVAL=60
      - REPORT_HOUR=0
      - MIN_LIVE_TIME=3600
      - THRESHOLD=0.5
    depends_on:
      train:
        condition: service_completed_successfully
    restart: unless-stopped
--- a/models/vega_best.pt
+++ b/models/vega_best.pt
--- a/models/vega_history.json
+++ b/models/vega_history.json
--- a/models/vega_isotope_index.txt
+++ b/models/vega_isotope_index.txt
@ -0,0 +1,82 @@
 Ac-227
 Ac-228
 Ag-110m
 Am-241
 Au-198
 Ba-133
 Ba-137m
 Be-7
 Bi-210
 Bi-211
 Bi-212
 Bi-214
 C-14
 Cd-109
 Ce-139
 Ce-144
 Co-57
 Co-58
 Co-60
 Cr-51
 Cs-134
 Cs-137
 Eu-152
 Eu-154
 F-18
 Fe-59
 Ga-67
 Hf-181
 Hg-203
 I-123
 I-129
 I-131
 In-111
 Ir-192
 K-40
 Lu-177
 Mn-54
 Na-22
 Pa-231
 Pa-234m
 Pb-210
 Pb-211
 Pb-212
 Pb-214
 Po-210
 Po-212
 Po-214
 Po-216
 Po-218
 Pr-144
 Ra-223
 Ra-224
 Ra-226
 Ra-228
 Rh-106
 Rn-220
 Rn-222
 Ru-106
 Sb-125
 Sc-46
 Se-75
 Sm-153
 Sn-113
 Sr-85
 Sr-90
 Ta-182
 Tc-99m
 Th-228
 Th-230
 Th-231
 Th-232
 Th-234
 Tl-201
 Tl-207
 Tl-208
 U-234
 U-235
 U-238
 Xe-133
 Y-88
 Y-90
 Zn-65
--- a/train/Dockerfile
+++ b/train/Dockerfile
@ -0,0 +1,17 @@
 FROM pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime
 ENV DEBIAN_FRONTEND=noninteractive
 ENV PYTHONUNBUFFERED=1
 ENV PYTHONDONTWRITEBYTECODE=1
 WORKDIR /app
 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
 COPY vega_ml/ /app/vega_ml/
 COPY entrypoint.sh /app/entrypoint.sh
 RUN chmod +x /app/entrypoint.sh
 ENTRYPOINT ["/app/entrypoint.sh"]
--- a/train/entrypoint.sh
+++ b/train/entrypoint.sh
@ -0,0 +1,58 @@
 #!/bin/bash
 set -e
 DATA_DIR="${DATA_DIR:-/data/synthetic}"
 MODEL_DIR="${MODEL_DIR:-/models}"
 NUM_SAMPLES="${NUM_SAMPLES:-50000}"
 EPOCHS="${EPOCHS:-100}"
 BATCH_SIZE="${BATCH_SIZE:-64}"
 LEARNING_RATE="${LEARNING_RATE:-0.001}"
 DETECTOR="${DETECTOR:-radiacode_103}"
 MIN_DURATION="${MIN_DURATION:-43200}"
 MAX_DURATION="${MAX_DURATION:-86400}"
 SEED="${SEED:-42}"
 echo "============================================"
 echo "  Radiacode 103 — Pipeline d'entraînement"
 echo "============================================"
 echo "  Data dir    : $DATA_DIR"
 echo "  Model dir   : $MODEL_DIR"
 echo "  Samples     : $NUM_SAMPLES"
 echo "  Detector    : $DETECTOR"
 echo "  Duration     : $MIN_DURATION-$MAX_DURATION s"
 echo "  Epochs      : $EPOCHS"
 echo "  Batch size  : $BATCH_SIZE"
 echo "  Learning rate: $LEARNING_RATE"
 echo "============================================"
 echo ""
 echo "=== Phase 1 : Génération des spectres synthétiques ==="
 python -m vega_ml.synthetic_spectra.generate_spectra \
  --num_samples "$NUM_SAMPLES" \
  --output_dir "$DATA_DIR" \
  --detector "$DETECTOR" \
  --min_duration "$MIN_DURATION" \
  --max_duration "$MAX_DURATION" \
  --seed "$SEED"
 echo ""
 echo "=== Phase 2 : Entraînement du VegaModel ==="
 python -m vega_ml.training.vega.run_training \
  --data-dir "$DATA_DIR" \
  --model-dir "$MODEL_DIR" \
  --epochs "$EPOCHS" \
  --batch-size "$BATCH_SIZE" \
  --learning-rate "$LEARNING_RATE"
 echo ""
 echo "=== Entraînement terminé ==="
 echo "Fichiers modèle :"
 ls -lh "$MODEL_DIR/"
 echo ""
 echo "Copie de l'index des isotopes..."
 if [ -f "$MODEL_DIR/vega_isotope_index.txt" ]; then
    echo "  vega_isotope_index.txt présent"
 else
    echo "  ATTENTION : vega_isotope_index.txt absent"
 fi
--- a/train/requirements.txt
+++ b/train/requirements.txt
@ -0,0 +1,4 @@
 numpy>=1.24.0
 scipy>=1.10.0
 pillow>=9.0.0
 scikit-learn>=1.3.0
--- a/train/vega_ml/.gitignore
+++ b/train/vega_ml/.gitignore
@ -0,0 +1,234 @@
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[codz]
 *$py.class
 # C extensions
 *.so
 # Distribution / packaging
 .Python
 build/
 develop-eggs/
 dist/
 downloads/
 eggs/
 .eggs/
 lib/
 lib64/
 parts/
 sdist/
 var/
 wheels/
 share/python-wheels/
 *.egg-info/
 .installed.cfg
 *.egg
 MANIFEST
 # PyInstaller
 #  Usually these files are written by a python script from a template
 #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 *.manifest
 *.spec
 # Installer logs
 pip-log.txt
 pip-delete-this-directory.txt
 # Unit test / coverage reports
 htmlcov/
 .tox/
 .nox/
 .coverage
 .coverage.*
 .cache
 .tmp/
 nosetests.xml
 coverage.xml
 *.cover
 *.py.cover
 .hypothesis/
 .pytest_cache/
 cover/
 # Translations
 *.mo
 *.pot
 # Django stuff:
 *.log
 local_settings.py
 db.sqlite3
 db.sqlite3-journal
 # Flask stuff:
 instance/
 .webassets-cache
 # Scrapy stuff:
 .scrapy
 # Sphinx documentation
 docs/_build/
 # PyBuilder
 .pybuilder/
 target/
 # Jupyter Notebook
 .ipynb_checkpoints
 # IPython
 profile_default/
 ipython_config.py
 # pyenv
 #   For a library or package, you might want to ignore these files since the code is
 #   intended to run in multiple environments; otherwise, check them in:
 # .python-version
 # pipenv
 #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 #   install all needed dependencies.
 #Pipfile.lock
 # UV
 #   Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
 #   This is especially recommended for binary packages to ensure reproducibility, and is more
 #   commonly ignored for libraries.
 #uv.lock
 # poetry
 #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
 #   This is especially recommended for binary packages to ensure reproducibility, and is more
 #   commonly ignored for libraries.
 #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
 #poetry.lock
 #poetry.toml
 # pdm
 #   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
 #   pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python.
 #   https://pdm-project.org/en/latest/usage/project/#working-with-version-control
 #pdm.lock
 #pdm.toml
 .pdm-python
 .pdm-build/
 # pixi
 #   Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control.
 #pixi.lock
 #   Pixi creates a virtual environment in the .pixi directory, just like venv module creates one
 #   in the .venv directory. It is recommended not to include this directory in version control.
 .pixi
 # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
 __pypackages__/
 # Celery stuff
 celerybeat-schedule
 celerybeat.pid
 # SageMath parsed files
 *.sage.py
 # Environments
 .env
 .envrc
 .venv
 env/
 venv/
 ENV/
 env.bak/
 venv.bak/
 # Spyder project settings
 .spyderproject
 .spyproject
 # Rope project settings
 .ropeproject
 # mkdocs documentation
 /site
 # mypy
 .mypy_cache/
 .dmypy.json
 dmypy.json
 # Pyre type checker
 .pyre/
 # pytype static type analyzer
 .pytype/
 # Cython debug symbols
 cython_debug/
 # PyCharm
 #  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
 #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
 #  and can be added to the global gitignore or merged into this file.  For a more nuclear
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 #.idea/
 # Abstra
 # Abstra is an AI-powered process automation framework.
 # Ignore directories containing user credentials, local state, and settings.
 # Learn more at https://abstra.io/docs
 .abstra/
 # Visual Studio Code
 #  Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore 
 #  that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
 #  and can be added to the global gitignore or merged into this file. However, if you prefer, 
 #  you could uncomment the following to ignore the entire vscode folder
 # .vscode/
 # Ruff stuff:
 .ruff_cache/
 # PyPI configuration file
 .pypirc
 # Cursor
 #  Cursor is an AI-powered code editor. `.cursorignore` specifies files/directories to
 #  exclude from AI features like autocomplete and code analysis. Recommended for sensitive data
 #  refer to https://docs.cursor.com/context/ignore-files
 .cursorignore
 .cursorindexingignore
 # Marimo
 marimo/_static/
 marimo/_lsp/
 __marimo__/
 # ===========================================
 # PROJECT-SPECIFIC IGNORES
 # ===========================================
 # Generated training data - DO NOT UPLOAD
 # Contains large synthetic spectra files (10k-100k samples)
 data/
 # Model checkpoints and weights
 models/
 checkpoints/
 *.pth
 *.pt
 *.onnx
 *.h5
 *.keras
 # TensorBoard and training logs
 tensorboard/
 runs/
 logs/
 # Temporary files
 tmp/
 temp/
--- a/train/vega_ml/README.md
+++ b/train/vega_ml/README.md
@ -0,0 +1,247 @@
 # ML for Isotope Identification
 A machine learning system for identifying radioactive isotopes from gamma-ray spectra captured by Radiacode scintillation detectors.
 ## Project Status
 ✅ **Completed:** Synthetic gamma spectra generation system  
 ✅ **Completed:** Vega ML model architecture (CNN-FCNN hybrid)  
 ✅ **Completed:** Training pipeline with GPU support  
 ✅ **Completed:** Inference engine  
 🔲 **Next:** Generate large training dataset (10,000-100,000 samples)  
 🔲 **Future:** Real-time inference on Radiacode devices
 ---
 ## Overview
 This project aims to build a neural network that can identify radioactive isotopes from gamma spectra. Since collecting real gamma spectra requires radioactive sources and is expensive/regulated, we generate **synthetic training data** based on realistic physics models.
 ### Target Hardware
 - **Training:** NVIDIA RTX 5090 GPU (requires PyTorch nightly with CUDA 12.8)
 - **Inference:** Radiacode 101, 102, 103, 103G, 110 scintillation detectors
 ### Data Format
 - **Input:** 2D spectrograms (time intervals × 1023 energy channels)
 - **Output:** Multi-label isotope classification with activity estimation
 ---
 ## Quick Start
 ### Installation
 ```bash
 # Create virtual environment
 python -m venv .venv
 .venv\Scripts\activate  # Windows
 # or: source .venv/bin/activate  # Linux/Mac
 # Install dependencies
 pip install numpy scipy pillow
 # Install PyTorch (nightly for RTX 5090/Blackwell support)
 pip install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu128
 ```
 ### Generate Synthetic Data
 ```bash
 # Generate 10 test samples
 python -m synthetic_spectra.generate_spectra
 ```
 ### Train the Model
 ```bash
 # Quick test run (5 epochs, small dataset)
 python training/vega/run_training.py --test
 # Full training
 python training/vega/run_training.py --epochs 100 --batch-size 32
 ```
 ### Run Inference
 ```bash
 # Run inference on synthetic data
 python inference/run_inference.py --model models/vega_best.pt --data data/synthetic
 ```
 ---
 ## Vega Model Architecture
 **Vega** is a CNN-FCNN hybrid model optimized for gamma spectrum isotope identification, based on research showing 99%+ accuracy on similar tasks.
 ### Architecture Details
 | Component | Configuration |
 |-----------|---------------|
 | Input | 1023 energy channels |
 | CNN Backbone | 3 ConvBlocks [64, 128, 256 channels] |
 | Kernel Size | 7 (captures spectral features) |
 | FC Layers | [512, 256] with dropout |
 | Output Heads | Dual: Classification (82 isotopes) + Regression (activity) |
 | Total Parameters | 34.5M |
 | Activation | LeakyReLU + BatchNorm |
 ### Training Features
 - **Mixed Precision (AMP):** Faster training on modern GPUs
 - **Multi-task Learning:** Simultaneous isotope ID + activity estimation
 - **Loss Function:** BCE (classification) + Huber (regression)
 - **LR Scheduling:** ReduceLROnPlateau with early stopping
 ---
 ## Synthetic Spectra Generation
 ### Features
 - **82 isotopes** with accurate gamma emission lines
 - **Realistic physics:** Gaussian peaks, Poisson noise, Compton continuum, environmental background
 - **Multiple detector models:** Radiacode 101, 102, 103, 103G, 110 with correct FWHM and energy ranges
 - **Configurable variation:** Activity levels, measurement durations, isotope combinations
 ### Sample Distribution
 | Type | Proportion | Description |
 |------|------------|-------------|
 | Single isotope | 40% | One source + background |
 | Dual isotope | 30% | Two sources blended |
 | Multi isotope | 20% | 3-5 sources combined |
 | Background only | 10% | Environmental only |
 ### Scaling Up
 Edit `synthetic_spectra/generate_spectra.py` to generate larger datasets:
 ```python
 generate_training_batch(
    n_samples=100000,  # Generate 100k samples
    output_dir=Path("data/synthetic/spectra"),
    detector_type="radiacode_103"
 )
 ```
 ---
 ## Project Structure
 ```
 ml-for-isotope-identification/
 ├── README.md                    # This file
 ├── agents.md                    # AI agent context documentation
 ├── .gitignore                   # Git ignore rules
 │
 ├── synthetic_spectra/           # Spectrum generation package
 │   ├── __init__.py
 │   ├── config.py                # Detector configurations
 │   ├── generator.py             # Main generation logic
 │   ├── generate_spectra.py      # CLI batch generation
 │   ├── ground_truth/
 │   │   ├── isotope_data.py      # 82 isotopes database
 │   │   └── decay_chains.py      # Decay chain definitions
 │   └── physics/
 │       └── spectrum_physics.py  # Physics calculations
 │
 ├── training/                    # Training infrastructure
 │   └── vega/                    # Vega model package
 │       ├── __init__.py
 │       ├── isotope_index.py     # Isotope ↔ index mapping
 │       ├── model.py             # VegaModel architecture
 │       ├── dataset.py           # PyTorch Dataset/DataLoader
 │       ├── train.py             # Training loop & utilities
 │       └── run_training.py      # CLI training script
 │
 ├── inference/                   # Inference engine
 │   ├── vega_inference.py        # VegaInference class
 │   └── run_inference.py         # CLI inference script
 │
 ├── models/                      # Saved model checkpoints
 │   ├── vega_best.pt             # Best validation loss
 │   ├── vega_final.pt            # Final epoch
 │   └── vega_history.json        # Training metrics
 │
 └── data/                        # Generated data (git-ignored)
    └── synthetic/
        └── spectra/
 ```
 ---
 ## Technical Details
 ### Detector Specifications
 | Model | Crystal | FWHM @ 662 keV | Energy Range | Channels |
 |-------|---------|----------------|--------------|----------|
 | Radiacode 101 | CsI(Tl) | 9.0% | 20-3000 keV | 1024 |
 | Radiacode 102 | CsI(Tl) | 9.5% | 20-3000 keV | 1024 |
 | Radiacode 103 | CsI(Tl) | 8.4% | 20-3000 keV | 1024 |
 | Radiacode 103G | GAGG(Ce) | 7.4% | 20-3000 keV | 1024 |
 | Radiacode 110 | CsI(Tl) | 8.4% | 20-3000 keV | 1024 |
 ### Physics Model
 - **Peak shape:** Gaussian with FWHM scaling as √(E/662)
 - **Expected counts:** λ = A × t × I × ε × T
 - **Noise:** Poisson counting statistics
 - **Background:** Exponential continuum + environmental isotopes (K-40, Pb-214, Bi-214, etc.)
 ### Isotope Categories
 - Natural background (K-40, Ra-226, Rn-222)
 - Decay chains (U-238, Th-232, U-235)
 - Calibration sources (Am-241, Cs-137, Co-60, Ba-133, Eu-152)
 - Medical isotopes (Tc-99m, F-18, I-131, Ga-68)
 - Industrial sources (Ir-192, Se-75)
 - Reactor fallout (Cs-134, Cs-137, Sr-90)
 ---
 ## Development
 ### Dependencies
 ```
 numpy>=1.24.0
 scipy>=1.10.0
 pillow>=9.0.0
 torch>=2.11.0 (nightly with CUDA 12.8 for RTX 5090)
 ```
 ### GPU Support
 The RTX 5090 (Blackwell architecture, sm_120) requires PyTorch nightly builds with CUDA 12.8:
 ```bash
 pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128
 ```
 ### For AI Agents
 See [agents.md](agents.md) for comprehensive documentation on:
 - System architecture and design decisions
 - Physics model implementation details
 - Vega model architecture and training
 - Configuration options and variation strategies
 ---
 ## TODO
 - [x] ~~Push to repository~~ - Initial commit with generation system
 - [x] ~~Create PyTorch DataLoader for training~~
 - [x] ~~Implement CNN-FCNN model architecture (Vega)~~
 - [x] ~~Create training script with logging~~
 - [x] ~~Implement inference module~~
 - [ ] Generate large training dataset (100k samples)
 - [ ] Train model to convergence
 - [ ] Add data augmentation pipeline
 - [ ] Add model evaluation metrics & confusion matrix
 - [ ] Implement real-time inference module
 - [ ] Create Radiacode device integration
 ---
 ## License
 [TBD]
 ---
 ## Acknowledgments
 - Radiacode for device specifications
 - IAEA Nuclear Data Services for isotope data
 - NNDC at Brookhaven National Laboratory
 - Wang et al. research on CNN-FCNN for gamma spectroscopy
--- a/train/vega_ml/agents.md
+++ b/train/vega_ml/agents.md
@ -0,0 +1,412 @@
 # Agents.md - AI Agent Context for ML Isotope Identification
 This document provides comprehensive context for AI agents working on this project. It describes the system architecture, purpose, configuration options, and implementation details of the synthetic gamma spectra generation and ML training systems.
 ## Project Purpose
 This project generates **synthetic gamma-ray spectra** for training machine learning models to perform **isotope identification**. The goal is to create a neural network that can identify radioactive isotopes from gamma spectra captured by consumer-grade scintillation detectors (Radiacode devices).
 ### Why Synthetic Data?
 1. **Real gamma spectra are expensive/dangerous to collect** - requires radioactive sources, permits, and safety protocols
 2. **Need massive datasets** - ML models require 10,000-100,000+ training samples
 3. **Controlled variation** - can systematically vary activities, durations, isotope combinations
 4. **Ground truth labels** - perfect annotations impossible with real-world data
 5. **Reproducibility** - can regenerate datasets with different parameters
 ## System Architecture
 ```
 ml-for-isotope-identification/
 ├── synthetic_spectra/           # Spectrum generation package
 │   ├── __init__.py              # Package initialization
 │   ├── config.py                # Detector configurations (Radiacode 101-110)
 │   ├── generator.py             # Main SpectrumGenerator class
 │   ├── generate_spectra.py      # CLI batch generation script
 │   ├── ground_truth/
 │   │   ├── __init__.py
 │   │   ├── isotope_data.py      # 82 isotopes with gamma emission lines
 │   │   └── decay_chains.py      # U-238, Th-232, U-235, Cs-137 chains
 │   └── physics/
 │       ├── __init__.py
 │       └── spectrum_physics.py  # Physics calculations for peak generation
 │
 ├── training/                    # ML training infrastructure
 │   ├── __init__.py
 │   └── vega/                    # Vega model package
 │       ├── __init__.py
 │       ├── isotope_index.py     # Bidirectional isotope ↔ index mapping
 │       ├── model.py             # VegaModel CNN-FCNN architecture
 │       ├── dataset.py           # PyTorch Dataset and DataLoader
 │       ├── train.py             # VegaTrainer and training loop
 │       └── run_training.py      # CLI training entry point
 │
 ├── inference/                   # Inference engine
 │   ├── vega_inference.py        # VegaInference class for predictions
 │   └── run_inference.py         # CLI inference script
 │
 ├── models/                      # Saved model checkpoints
 │   ├── vega_best.pt             # Best validation loss checkpoint
 │   ├── vega_final.pt            # Final epoch checkpoint
 │   ├── vega_history.json        # Training metrics history
 │   └── vega_isotope_index.txt   # Isotope index mapping
 │
 └── data/synthetic/spectra/      # Generated training data
 ```
 ## Core Concepts
 ### 1. Spectrum Representation
 Each spectrum is a **2D numpy array**:
 - **X-axis (columns):** 1023 energy channels mapping to 20 keV - 3000 keV
 - **Y-axis (rows):** Time intervals (1-second bins)
 - **Values:** Normalized counts [0, 1]
 This creates an "image-like" representation suitable for CNN-based models.
 ### 2. Physics Model
 The generation follows realistic gamma spectroscopy physics:
 #### Peak Generation (Gaussian)
 ```
 G(E) = (λ / (σ√(2π))) × exp(-(E - E₀)² / (2σ²))
 ```
 Where:
 - `E₀` = gamma line energy (keV)
 - `σ = FWHM / 2.355` (standard deviation)
 - `λ = A × t × I × ε × T` (expected counts)
  - `A` = activity (Bq)
  - `t` = counting time (s)
  - `I` = branching ratio (emission probability)
  - `ε` = detector efficiency
  - `T` = geometric factor
 #### FWHM Scaling
 ```
 FWHM(E) = FWHM_662 × √(E / 662)
 ```
 Resolution degrades at higher energies following square-root scaling.
 #### Background Model
 - **Exponential continuum:** `B(E) = B₀ × exp(-E / E_char)`
 - **Environmental isotopes:** K-40, Pb-214, Bi-214, Pb-212, Tl-208, Ac-228
 - **Compton continuum:** Simplified model for scattered photons
 #### Statistical Noise
 **Poisson counting statistics** applied to all counts:
 ```
 observed = Poisson(expected)
 ```
 ### 3. Detector Configurations
 | Model | Crystal | FWHM @ 662 keV | Energy Range |
 |-------|---------|----------------|--------------|
 | radiacode_101 | CsI(Tl) | 9.0% | 20-3000 keV |
 | radiacode_102 | CsI(Tl) | 9.5% | 20-3000 keV |
 | radiacode_103 | CsI(Tl) | 8.4% | 20-3000 keV |
 | radiacode_103g | GAGG(Ce) | 7.4% | 20-3000 keV |
 | radiacode_110 | CsI(Tl) | 8.4% | 20-3000 keV |
 ### 4. Isotope Database
 **82 isotopes** across categories:
 - `NATURAL_BACKGROUND`: K-40, Ra-226, Rn-222
 - `PRIMORDIAL`: U-238, U-235, Th-232
 - `COSMOGENIC`: Be-7, Na-22, C-14
 - `U238_CHAIN`: Pa-234m, Th-234, Ra-226, Pb-214, Bi-214, Pb-210, Po-210
 - `TH232_CHAIN`: Ac-228, Ra-224, Pb-212, Bi-212, Tl-208
 - `U235_CHAIN`: Pa-231, Th-227, Ra-223, Rn-219, Pb-211, Bi-211
 - `CALIBRATION`: Am-241, Ba-133, Cs-137, Co-57, Co-60, Eu-152, Na-22, Mn-54
 - `INDUSTRIAL`: Ir-192, Se-75, Cd-109, I-131, Y-90
 - `MEDICAL`: Tc-99m, F-18, Ga-67, Ga-68, In-111, I-123, I-125, Tl-201, Lu-177
 - `REACTOR_FALLOUT`: Cs-134, Cs-137, I-131, Sr-90, Zr-95, Nb-95, Ru-103, Ru-106, Ce-141, Ce-144
 - `ACTIVATION`: Fe-59, Cr-51, Zn-65, Ag-110m, Sb-124, Sb-125
 ---
 ## Vega Model Architecture
 **Vega** is the primary ML model for isotope identification, using a CNN-FCNN hybrid architecture based on research showing 99%+ accuracy on gamma spectroscopy tasks.
 ### Model Design
 ```python
 VegaModel (34.5M parameters)
 ├── CNN Backbone
 │   ├── ConvBlock1: Conv1d(1→64, k=7) → BN → LeakyReLU → MaxPool
 │   ├── ConvBlock2: Conv1d(64→128, k=7) → BN → LeakyReLU → MaxPool
 │   └── ConvBlock3: Conv1d(128→256, k=7) → BN → LeakyReLU → MaxPool
 ├── Flatten
 ├── FC Layers
 │   ├── Linear → BN → LeakyReLU → Dropout(0.3)
 │   └── Linear → BN → LeakyReLU → Dropout(0.3)
 └── Dual Output Heads
    ├── Classifier: Linear(256→82) [logits for BCEWithLogitsLoss]
    └── Regressor: Linear(256→82) → ReLU [activity in Bq]
 ```
 ### Key Configuration (VegaConfig)
 ```python
@dataclass
 class VegaConfig:
    num_channels: int = 1023           # Input spectrum channels
    num_isotopes: int = 82             # Output classes
    cnn_channels: List[int] = [64, 128, 256]
    kernel_size: int = 7               # Captures spectral features
    fc_hidden_dims: List[int] = [512, 256]
    dropout_rate: float = 0.3
    leaky_relu_slope: float = 0.01
    max_activity_bq: float = 1000.0    # Activity normalization
 ```
 ### Loss Function
 Multi-task loss combining classification and regression:
 ```python
 total_loss = BCE_weight * BCEWithLogitsLoss(logits, presence) 
           + Huber_weight * HuberLoss(pred_activity, true_activity)
 ```
 - **BCEWithLogitsLoss:** AMP-safe, applies sigmoid internally
 - **HuberLoss:** Robust to activity outliers
 - Default weights: classification=1.0, regression=0.1
 ### Training Configuration
 ```python
@dataclass
 class TrainingConfig:
    data_dir: str = "data/synthetic"
    model_dir: str = "models"
    epochs: int = 100
    batch_size: int = 32
    learning_rate: float = 1e-3
    weight_decay: float = 1e-5
    use_amp: bool = True               # Mixed precision on GPU
    early_stopping_patience: int = 15
    lr_scheduler_patience: int = 5
    lr_scheduler_factor: float = 0.5
 ```
 ### GPU Support
 **RTX 5090 (Blackwell sm_120) requires PyTorch nightly with CUDA 12.8:**
 ```bash
 pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128
 ```
 The training and inference scripts automatically test CUDA compatibility and fall back to CPU if needed.
 ---
 ## Configuration & Variation
 ### SpectrumConfig Parameters
 ```python
@dataclass
 class SpectrumConfig:
    detector_type: str = "radiacode_103"    # Which detector to simulate
    duration_seconds: int = 300              # Total measurement time
    interval_seconds: int = 1                # Time bin size (always 1s)
    include_background: bool = True          # Add environmental background
    background_scale: float = 1.0            # Background intensity multiplier
    noise_enabled: bool = True               # Apply Poisson statistics
    normalize: bool = True                   # Normalize to [0, 1]
 ```
 ### Variation in Generated Data
 The `generate_training_batch()` function creates varied samples:
 | Category | Proportion | Description |
 |----------|------------|-------------|
 | Single isotope | 40% | One source isotope + background |
 | Dual isotope | 30% | Two source isotopes blended |
 | Multi isotope | 20% | 3-5 isotopes combined |
 | Background only | 10% | Environmental background only |
 #### Activity Ranges
 - **Activity:** 10-500 Bq (randomized per isotope)
 - **Duration:** 60-600 seconds (randomized)
 - **Background scale:** 0.5-2.0× (randomized)
 #### Isotope Pool
 Common isotopes used for generation:
 ```python
 ISOTOPE_POOL = [
    "Am-241", "Ba-133", "Cs-137", "Co-57", "Co-60",
    "Eu-152", "Na-22", "Mn-54", "K-40", "Ra-226",
    "Th-232", "U-238", "I-131", "Tc-99m", "Ir-192"
 ]
 ```
 ## Output Format
 ### Directory Structure
 ```
 data/synthetic/spectra/
 ├── {uuid}_spectrum.npy      # Numpy array (time × channels)
 ├── {uuid}_spectrum.png      # Visualization image
 └── labels.json              # Metadata for all samples
 ```
 ### Labels JSON Schema
 ```json
 {
  "sample_id": {
    "isotopes": [
      {
        "name": "Cs-137",
        "activity_bq": 123.45,
        "category": "CALIBRATION"
      }
    ],
    "background_isotopes": ["K-40", "Pb-214", ...],
    "detector": "radiacode_103",
    "duration_seconds": 300,
    "num_intervals": 300,
    "background_scale": 1.2,
    "generation_timestamp": "2025-01-24T..."
  }
 }
 ```
 ### Numpy Array Details
 - **Shape:** `(num_intervals, 1023)`
 - **Dtype:** `float64`
 - **Range:** `[0.0, 1.0]` (normalized)
 - **Channel mapping:** `channel_i → 20 + i × (3000-20)/1023 keV`
 ## Usage
 ### Generate Test Batch (10 samples)
 ```bash
 python -m synthetic_spectra.generate_spectra
 ```
 ### Generate Large Training Set
 Edit `generate_spectra.py`:
 ```python
 generate_training_batch(
    n_samples=100000,
    output_dir=Path("data/synthetic/spectra"),
    detector_type="radiacode_103"
 )
 ```
 ### Programmatic Generation
 ```python
 from synthetic_spectra.generator import SpectrumGenerator, SpectrumConfig, IsotopeSource
 from synthetic_spectra.config import RADIACODE_CONFIGS
 generator = SpectrumGenerator(RADIACODE_CONFIGS["radiacode_103"])
 config = SpectrumConfig(
    duration_seconds=300,
    include_background=True,
    background_scale=1.0
 )
 sources = [
    IsotopeSource(isotope_name="Cs-137", activity_bq=100.0),
    IsotopeSource(isotope_name="Co-60", activity_bq=50.0)
 ]
 spectrum = generator.generate_spectrum(sources, config)
 # spectrum.data is the 2D numpy array
 # spectrum.metadata contains all generation parameters
 ```
 ## Future Enhancements
 ### Planned Improvements
 1. **Compton edge modeling** - More realistic continuum shapes
 2. **Pile-up effects** - High count rate distortions
 3. **Gain drift simulation** - Energy calibration shifts over time
 4. **Source geometry** - Distance/shielding effects
 5. **Decay during measurement** - Short-lived isotope activity changes
 6. **Data augmentation** - Channel shifts, noise injection
 7. **Confusion matrix analysis** - Per-isotope performance metrics
 ## Key Files for Modification
 | File | Purpose | When to Modify |
 |------|---------|----------------|
 | `synthetic_spectra/config.py` | Detector specs | Add new detector types |
 | `synthetic_spectra/ground_truth/isotope_data.py` | Isotope database | Add isotopes, update gamma lines |
 | `synthetic_spectra/ground_truth/decay_chains.py` | Chain relationships | Add decay chain logic |
 | `synthetic_spectra/physics/spectrum_physics.py` | Physics model | Improve realism |
 | `synthetic_spectra/generator.py` | Generation logic | Add features, change output format |
 | `synthetic_spectra/generate_spectra.py` | Batch generation | Adjust sample distribution |
 | `training/vega/model.py` | Vega architecture | Modify CNN/FC layers, heads |
 | `training/vega/train.py` | Training loop | Change optimization, callbacks |
 | `training/vega/dataset.py` | Data loading | Add augmentation, preprocessing |
 | `inference/vega_inference.py` | Inference engine | Modify prediction pipeline |
 ## Usage
 ### Generate Synthetic Data
 ```bash
 python -m synthetic_spectra.generate_spectra
 ```
 ### Train Model
 ```bash
 # Quick test (5 epochs)
 python training/vega/run_training.py --test
 # Full training
 python training/vega/run_training.py --epochs 100 --batch-size 32
 # Without AMP (if GPU issues)
 python training/vega/run_training.py --no-amp
 ```
 ### Run Inference
 ```bash
 python inference/run_inference.py --model models/vega_best.pt --data data/synthetic
 ```
 ### Programmatic Usage
 ```python
 # Training
 from training.vega.train import train_vega, TrainingConfig
 from training.vega.model import VegaConfig
 config = TrainingConfig(epochs=100, batch_size=32)
 model_config = VegaConfig()
 model, results = train_vega(config, model_config)
 # Inference
 from inference.vega_inference import VegaInference
 inference = VegaInference("models/vega_best.pt")
 prediction = inference.predict_from_file("spectrum.npy", threshold=0.5)
 print(prediction.summary())
 ```
 ## Dependencies
 ```
 numpy>=1.24.0      # Array operations
 scipy>=1.10.0      # Statistical functions
 pillow>=9.0.0      # PNG image generation
 torch>=2.11.0      # PyTorch (nightly for RTX 5090)
 ```
 ## References
 - Radiacode device specifications: https://radiacode.com/
 - Gamma spectroscopy physics: Knoll, "Radiation Detection and Measurement"
 - Isotope data: IAEA Nuclear Data Services, NNDC at Brookhaven
 - CNN-FCNN architecture: Wang et al. research on gamma spectroscopy ML
 ---
 *This document should be updated whenever significant changes are made to the generation or training systems.*
--- a/train/vega_ml/analyzer/README.md
+++ b/train/vega_ml/analyzer/README.md
@ -0,0 +1,37 @@
 # Analyzer
 This folder contains small utilities for inspecting the **last inference request/response** processed by your middleware API.
 ## Fetch last inference (PowerShell)
 Runs against the server’s log endpoints:
 - `GET /logs?limit=...`
 - `GET /logs/{id}`
 ### Default (uses `http://99.122.58.29:443`)
 ```powershell
 powershell -ExecutionPolicy Bypass -File analyzer/fetch_last_inference.ps1
 ```
 ### Override the base URL
 ```powershell
 powershell -ExecutionPolicy Bypass -File analyzer/fetch_last_inference.ps1 -BaseUrl "http://99.122.58.29:443"
 ```
 ### Output
 Artifacts are written to:
 - `analyzer/out/last_inference_summary_*.json`
 - `analyzer/out/last_inference_detail_*.json`
 If the server includes `request` and/or `response` fields in the detail payload, those are also saved as:
 - `analyzer/out/last_inference_request_*.json`
 - `analyzer/out/last_inference_response_*.json`
 ## Notes
 - This script assumes the log list items have at least: `id`, `method`, `path`.
 - It selects the most recent matching `POST` to an `identify` endpoint.
 - If your server uses HTTPS with a real cert, switch the URL to `https://...`.
--- a/train/vega_ml/analyzer/analyze_last_inference.py
+++ b/train/vega_ml/analyzer/analyze_last_inference.py
@ -0,0 +1,168 @@
 """Analyze a captured middleware inference log (request+response).
 Reads a JSON file like analyzer/out/last_inference_detail_*.json produced by
 analyzer/fetch_last_inference.ps1 and prints diagnostics focused on:
 - input spectrum shape/range
 - quantization / clamping artifacts
 - energy-window evidence for uranium chain peaks
 - server output probabilities for U-234/U-235/U-238
 Usage:
  python analyzer/analyze_last_inference.py --path analyzer/out/last_inference_detail_*.json
 Exit code is always 0; this is a reporting tool.
 """
 from __future__ import annotations
 import argparse
 import json
 from dataclasses import dataclass
 from pathlib import Path
 from typing import Iterable
 import numpy as np
@dataclass(frozen=True)
 class ModelGrid:
    emin_kev: float = 20.0
    emax_kev: float = 3000.0
    num_channels: int = 1023
    @property
    def step_kev(self) -> float:
        return (self.emax_kev - self.emin_kev) / self.num_channels
    def energy_to_channel(self, energy_kev: float) -> int:
        # Mirror how the repo’s helper scripts commonly approximate channel index.
        ch = int(round((energy_kev - self.emin_kev) / self.step_kev))
        return int(np.clip(ch, 0, self.num_channels - 1))
    def channel_to_energy(self, channel: int) -> float:
        return self.emin_kev + channel * self.step_kev
 def _head(values: Iterable[float], n: int = 12) -> str:
    vals = list(values)
    return ", ".join(f"{v:.6g}" for v in vals[:n])
 def main() -> int:
    ap = argparse.ArgumentParser()
    ap.add_argument("--path", required=True, help="Path to last_inference_detail_*.json")
    ap.add_argument("--window", type=int, default=2, help="Half-window (channels) for peak window sum")
    args = ap.parse_args()
    path = Path(args.path)
    # PowerShell may write UTF-8 with BOM; handle both.
    obj = json.loads(path.read_text(encoding="utf-8-sig"))
    req = obj.get("request", {}).get("json", {})
    resp = obj.get("response", {}).get("json", {})
    spectrum = req.get("spectrum")
    if spectrum is None:
        raise SystemExit("No request.json.spectrum found in the log detail JSON")
    arr = np.asarray(spectrum, dtype=np.float64)
    grid = ModelGrid()
    print(f"file: {path}")
    print(f"request spectrum shape: {arr.shape}")
    print(f"request spectrum range: min={arr.min():.6g} max={arr.max():.6g} mean={arr.mean():.6g}")
    # Quantization / clamping check
    flat = arr.ravel()
    # Sample to keep this quick on huge logs
    sample = flat[:: max(1, flat.size // 200_000)]
    uniq = np.unique(np.round(sample, 12))
    print(f"unique(sampled,rounded) count={len(uniq)}")
    print(f"unique head: {_head(uniq)}")
    # “Looks like quantized steps” heuristic
    if len(uniq) <= 64:
        steps = np.diff(uniq)
        steps = steps[steps > 0]
        if steps.size:
            step_med = float(np.median(steps))
            print(f"quantization hint: median_step≈{step_med:.6g}")
    # Channel energy distribution
    channel_sums = arr.sum(axis=0)
    nonzero_channels = int(np.count_nonzero(channel_sums))
    print(f"channels with any signal: {nonzero_channels}/{grid.num_channels} ({nonzero_channels/grid.num_channels:.1%})")
    # Top channels (where energy actually is)
    top_k = 12
    top_idx = np.argsort(channel_sums)[::-1][:top_k]
    print("top channels by sum (time-collapsed):")
    for ch in top_idx:
        s = float(channel_sums[ch])
        e = grid.channel_to_energy(int(ch))
        print(f"  ch={int(ch):4d}  E≈{e:7.1f} keV  sum={s:.6g}")
    # Window-sum helper
    w = int(args.window)
    def window_sum(center_ch: int) -> float:
        lo = max(0, center_ch - w)
        hi = min(grid.num_channels - 1, center_ch + w)
        return float(arr[:, lo : hi + 1].sum())
    # Evidence around key uranium chain energies.
    energies = {
        "U-238 49.6": 49.6,
        "U-238 113.5": 113.5,
        "Ra-226 186.2": 186.2,
        "Pb-214 295.2": 295.2,
        "Pb-214 351.9": 351.9,
        "Bi-214 609.3": 609.3,
        "Bi-214 1120": 1120.3,
        "Bi-214 1764": 1764.5,
        "Tl-208 2614": 2614.5,
    }
    print(f"energy-window sums (±{w} channels):")
    for name, e in energies.items():
        ch = grid.energy_to_channel(e)
        s = window_sum(ch)
        print(f"  {name:12s} ch={ch:4d}  window_sum={s:.6g}")
    # Server response: uranium-related probabilities
    names = resp.get("isotope_names") or []
    probs = resp.get("probabilities") or []
    thr = resp.get("threshold_used")
    if names and probs and len(names) == len(probs):
        name_to_idx = {n: i for i, n in enumerate(names)}
        print("server output (selected):")
        if thr is not None:
            print(f"  threshold_used={thr}")
        for iso in ("U-234", "U-235", "U-238", "Pb-214", "Bi-214", "Ra-226", "Th-232", "Th-234"):
            i = name_to_idx.get(iso)
            if i is None:
                print(f"  {iso}: not in isotope_names")
            else:
                p = float(probs[i])
                flag = "PRESENT" if (thr is not None and p >= float(thr)) else "-"
                print(f"  {iso:6s} idx={i:2d} prob={p:.6g} {flag}")
        # Top-10
        pairs = sorted(((n, float(probs[i])) for i, n in enumerate(names)), key=lambda x: x[1], reverse=True)[:10]
        print("top-10 probabilities:")
        for n, p in pairs:
            print(f"  {n:8s} {p:.6g}")
    else:
        print("No response.json.isotope_names/probabilities found (or lengths mismatch).")
    print("\nInterpretation hints:")
    print("- If the uranium/daughter energy-window sums are ~0, the client is likely rebinning/calibrating incorrectly, zeroing high-energy channels, or over-normalizing/quantizing.")
    print("- If the spectrum is already [0,1] with very few unique values, the client is likely clamping/quantizing (lossy) before sending to the server.")
    return 0
 if __name__ == "__main__":
    raise SystemExit(main())
--- a/train/vega_ml/analyzer/fetch_last_inference.ps1
+++ b/train/vega_ml/analyzer/fetch_last_inference.ps1
@ -0,0 +1,89 @@
 param(
  [Parameter(Mandatory=$false)]
  [string]$BaseUrl = "http://99.122.58.29:443",
  [Parameter(Mandatory=$false)]
  [int]$Limit = 200,
  [Parameter(Mandatory=$false)]
  [string]$OutDir = "analyzer/out"
 )
 $ErrorActionPreference = "Stop"
 function Ensure-Directory([string]$Path) {
  if (-not (Test-Path -LiteralPath $Path)) {
    New-Item -ItemType Directory -Force -Path $Path | Out-Null
  }
 }
 function Save-Json([object]$Obj, [string]$Path) {
  # Windows PowerShell 5.1 enforces a max ConvertTo-Json depth of 100.
  # Use the maximum allowed, and fall back to a shallower depth if needed.
  try {
    ($Obj | ConvertTo-Json -Depth 100) | Set-Content -Encoding UTF8 -Path $Path
  } catch {
    ($Obj | ConvertTo-Json -Depth 50) | Set-Content -Encoding UTF8 -Path $Path
  }
 }
 Write-Host "BaseUrl: $BaseUrl"
 Write-Host "Limit:   $Limit"
 Ensure-Directory $OutDir
 $logsUrl = "$BaseUrl/logs?limit=$Limit"
 Write-Host "Fetching: $logsUrl"
 $list = Invoke-RestMethod -Method GET -Uri $logsUrl
 if (-not $list -or -not $list.logs) {
  throw "Unexpected response: missing .logs from $logsUrl"
 }
 # Candidate definition: last POST to one of the inference endpoints.
 $candidates = @($list.logs | Where-Object {
  $_.method -eq 'POST' -and (
    $_.path -like '/identify*' -or
    $_.path -like '/isotope/*identify*' -or
    $_.path -like '/api/isotope/*identify*'
  )
 })
 if ($candidates.Count -eq 0) {
  throw "No isotope inference logs found in last $Limit."
 }
 # Pick the most recent. Prefer a timestamp field if present, otherwise assume list already ordered.
 $candidate = $candidates | Sort-Object -Property @('timestamp','time','createdAt','created_at') -Descending | Select-Object -First 1
 if (-not $candidate.id) {
  throw "Candidate log entry has no .id. Keys: $($candidate.PSObject.Properties.Name -join ', ')"
 }
 $id = $candidate.id
 Write-Host "Selected log id: $id"
 $detailUrl = "$BaseUrl/logs/$id"
 Write-Host "Fetching: $detailUrl"
 $detail = Invoke-RestMethod -Method GET -Uri $detailUrl
 # Save raw artifacts for offline review.
 $stamp = Get-Date -Format "yyyyMMdd_HHmmss"
 $summaryPath = Join-Path $OutDir "last_inference_summary_$stamp.json"
 $detailPath  = Join-Path $OutDir "last_inference_detail_$stamp.json"
 Save-Json $candidate $summaryPath
 Save-Json $detail $detailPath
 Write-Host "Saved: $summaryPath"
 Write-Host "Saved: $detailPath"
 # Convenience: if the server includes request/response sub-objects, save them too.
 if ($detail.request) {
  Save-Json $detail.request (Join-Path $OutDir "last_inference_request_$stamp.json")
 }
 if ($detail.response) {
  Save-Json $detail.response (Join-Path $OutDir "last_inference_response_$stamp.json")
 }
 # Print a short on-screen hint.
 Write-Host "Done. Open the JSON files under $OutDir to inspect the last inference input/output."
--- a/train/vega_ml/docs/VEGA_2D_COMPLETE_GUIDE.md
+++ b/train/vega_ml/docs/VEGA_2D_COMPLETE_GUIDE.md
--- a/train/vega_ml/inference/init.py
+++ b/train/vega_ml/inference/init.py
@ -0,0 +1 @@
 # Inference module for running predictions with trained models
--- a/train/vega_ml/inference/run_inference.py
+++ b/train/vega_ml/inference/run_inference.py
@ -0,0 +1,93 @@
 #!/usr/bin/env python
 """
 Run Vega Inference
 Simple script to run inference with a trained Vega model.
 """
 import sys
 import argparse
 from pathlib import Path
 # Add project root to path
 PROJECT_ROOT = Path(__file__).parent.parent
 sys.path.insert(0, str(PROJECT_ROOT))
 from inference.vega_inference import run_inference_demo, VegaInference
 def main():
    parser = argparse.ArgumentParser(
        description="Run inference with trained Vega model"
    )
    parser.add_argument(
        "--model", "-m",
        type=str,
        default="models/vega_best.pt",
        help="Path to model checkpoint"
    )
    parser.add_argument(
        "--data", "-d",
        type=str,
        default="O:/master_data_collection/isotopev2",
        help="Path to data directory with spectra"
    )
    parser.add_argument(
        "--threshold", "-t",
        type=float,
        default=0.5,
        help="Detection threshold (0-1)"
    )
    parser.add_argument(
        "--spectrum", "-s",
        type=str,
        default=None,
        help="Path to a specific spectrum file to analyze"
    )
    args = parser.parse_args()
    # Make paths absolute
    model_path = Path(args.model)
    if not model_path.is_absolute():
        model_path = PROJECT_ROOT / model_path
    if args.spectrum:
        # Single spectrum inference
        spectrum_path = Path(args.spectrum)
        if not spectrum_path.is_absolute():
            spectrum_path = PROJECT_ROOT / spectrum_path
        print(f"\nLoading model from: {model_path}")
        inference = VegaInference(str(model_path))
        print(f"\nAnalyzing spectrum: {spectrum_path}")
        prediction = inference.predict_from_file(
            spectrum_path, 
            threshold=args.threshold
        )
        print("\n" + "=" * 60)
        print("PREDICTION RESULTS")
        print("=" * 60)
        print(prediction.summary())
        print("=" * 60)
    else:
        # Demo mode - analyze all spectra in data directory
        data_path = Path(args.data)
        if not data_path.is_absolute():
            data_path = PROJECT_ROOT / data_path
        run_inference_demo(
            str(model_path),
            str(data_path),
            threshold=args.threshold
        )
    return 0
 if __name__ == "__main__":
    sys.exit(main())
--- a/train/vega_ml/inference/vega_inference.py
+++ b/train/vega_ml/inference/vega_inference.py
@ -0,0 +1,406 @@
 """
 Vega Inference Script
 Load a trained Vega model and run inference on gamma spectra to identify
 isotopes and estimate their activities.
 """
 import sys
 import json
 import numpy as np
 import torch
 from pathlib import Path
 from typing import Dict, List, Optional, Union
 from dataclasses import dataclass, asdict
 # Add project root to path
 PROJECT_ROOT = Path(__file__).parent.parent
 sys.path.insert(0, str(PROJECT_ROOT))
 from training.vega.model import VegaModel, VegaConfig
 from training.vega.isotope_index import IsotopeIndex
@dataclass
 class IsotopePrediction:
    """Prediction for a single isotope."""
    name: str
    probability: float
    activity_bq: float
    present: bool
@dataclass
 class SpectrumPrediction:
    """Full prediction results for a spectrum."""
    isotopes: List[IsotopePrediction]
    num_present: int
    confidence: float
    threshold_used: float
    def to_dict(self) -> Dict:
        """Convert to dictionary."""
        return {
            'isotopes': [
                {
                    'name': iso.name,
                    'probability': round(iso.probability, 4),
                    'activity_bq': round(iso.activity_bq, 2),
                    'present': iso.present
                }
                for iso in self.isotopes
            ],
            'num_isotopes_detected': self.num_present,
            'confidence': round(self.confidence, 4),
            'threshold': self.threshold_used
        }
    def get_present_isotopes(self) -> List[IsotopePrediction]:
        """Get only isotopes predicted as present."""
        return [iso for iso in self.isotopes if iso.present]
    def summary(self) -> str:
        """Get a human-readable summary."""
        present = self.get_present_isotopes()
        if not present:
            return "No isotopes detected above threshold"
        lines = [f"Detected {len(present)} isotope(s):"]
        for iso in sorted(present, key=lambda x: -x.probability):
            lines.append(
                f"  - {iso.name}: {iso.probability*100:.1f}% confidence, "
                f"{iso.activity_bq:.1f} Bq"
            )
        return "\n".join(lines)
 class VegaInference:
    """
    Inference engine for the Vega model.
    Loads a trained model and provides methods for running predictions
    on gamma spectra.
    """
    def __init__(
        self,
        model_path: Union[str, Path],
        isotope_index_path: Optional[Union[str, Path]] = None,
        device: Optional[torch.device] = None
    ):
        """
        Initialize the inference engine.
        Args:
            model_path: Path to the saved model checkpoint
            isotope_index_path: Path to isotope index file. If None, will try
                               to find it in the same directory as the model.
            device: Device to run inference on. If None, uses CUDA if available.
        """
        self.model_path = Path(model_path)
        # Determine device with CUDA compatibility test
        if device is not None:
            self.device = device
        elif torch.cuda.is_available():
            # Test if CUDA actually works (RTX 5090/Blackwell may not be compatible)
            try:
                test_tensor = torch.zeros(1, device='cuda')
                _ = test_tensor + 1
                self.device = torch.device('cuda')
                print("Using CUDA for inference")
            except RuntimeError as e:
                if "no kernel image is available" in str(e):
                    print(f"CUDA device detected but not compatible (likely Blackwell arch)")
                    print("Falling back to CPU for inference")
                    self.device = torch.device('cpu')
                else:
                    raise
        else:
            self.device = torch.device('cpu')
            print("Using CPU for inference")
        # Load checkpoint
        print(f"Loading model from: {self.model_path}")
        self.checkpoint = torch.load(self.model_path, map_location=self.device)
        # Load model config
        model_config_dict = self.checkpoint['model_config']
        self.model_config = VegaConfig(**model_config_dict)
        # Create and load model
        self.model = VegaModel(self.model_config)
        self.model.load_state_dict(self.checkpoint['model_state_dict'])
        self.model = self.model.to(self.device)
        self.model.eval()
        # Load isotope index
        if isotope_index_path is None:
            # Try to find in same directory
            isotope_index_path = self.model_path.parent / "vega_isotope_index.txt"
        if Path(isotope_index_path).exists():
            self.isotope_index = IsotopeIndex.load(Path(isotope_index_path))
        else:
            # Use default
            from training.vega.isotope_index import get_default_isotope_index
            self.isotope_index = get_default_isotope_index()
            print("Warning: Using default isotope index")
        print(f"Model loaded successfully!")
        print(f"Device: {self.device}")
        print(f"Isotopes: {self.isotope_index.num_isotopes}")
    def preprocess_spectrum(
        self,
        spectrum: np.ndarray,
        normalize: bool = True
    ) -> torch.Tensor:
        """
        Preprocess a spectrum for inference.
        Args:
            spectrum: Input spectrum array. Can be:
                     - 1D: (channels,) - single spectrum
                     - 2D: (time, channels) - will be averaged over time
            normalize: Whether to normalize to [0, 1]
        Returns:
            Preprocessed tensor ready for model
        """
        # Handle 2D spectra
        if spectrum.ndim == 2:
            spectrum = spectrum.mean(axis=0)
        # Normalize
        if normalize and spectrum.max() > 0:
            spectrum = spectrum / spectrum.max()
        # Convert to tensor
        tensor = torch.tensor(spectrum, dtype=torch.float32)
        # Add batch dimension
        tensor = tensor.unsqueeze(0)
        return tensor.to(self.device)
    @torch.no_grad()
    def predict(
        self,
        spectrum: Union[np.ndarray, torch.Tensor],
        threshold: float = 0.5,
        return_all: bool = False
    ) -> SpectrumPrediction:
        """
        Run inference on a spectrum.
        Args:
            spectrum: Input spectrum (numpy array or tensor)
            threshold: Probability threshold for considering an isotope present
            return_all: If True, include all isotopes in output. If False,
                       only include those above threshold.
        Returns:
            SpectrumPrediction with isotope predictions
        """
        # Preprocess if numpy
        if isinstance(spectrum, np.ndarray):
            spectrum = self.preprocess_spectrum(spectrum)
        # Run model (outputs logits)
        logits, activities = self.model(spectrum)
        # Apply sigmoid to get probabilities
        probs = torch.sigmoid(logits)
        # Convert to numpy
        probs = probs.cpu().numpy()[0]
        activities = activities.cpu().numpy()[0]
        # Scale activities
        activities = activities * self.model_config.max_activity_bq
        # Create predictions
        isotopes = []
        for i in range(len(probs)):
            prob = float(probs[i])
            activity = float(activities[i])
            present = prob >= threshold
            if return_all or present:
                isotopes.append(IsotopePrediction(
                    name=self.isotope_index.index_to_name(i),
                    probability=prob,
                    activity_bq=activity if present else 0.0,
                    present=present
                ))
        # Calculate overall confidence (average of top predictions)
        present_isotopes = [iso for iso in isotopes if iso.present]
        if present_isotopes:
            confidence = np.mean([iso.probability for iso in present_isotopes])
        else:
            confidence = 1.0 - probs.max()  # Confidence in "background only"
        return SpectrumPrediction(
            isotopes=isotopes,
            num_present=len(present_isotopes),
            confidence=float(confidence),
            threshold_used=threshold
        )
    def predict_batch(
        self,
        spectra: List[np.ndarray],
        threshold: float = 0.5
    ) -> List[SpectrumPrediction]:
        """
        Run inference on multiple spectra.
        Args:
            spectra: List of spectrum arrays
            threshold: Probability threshold
        Returns:
            List of predictions
        """
        return [self.predict(s, threshold) for s in spectra]
    def predict_from_file(
        self,
        file_path: Union[str, Path],
        threshold: float = 0.5
    ) -> SpectrumPrediction:
        """
        Load a spectrum from a numpy file and run inference.
        Args:
            file_path: Path to .npy file
            threshold: Probability threshold
        Returns:
            SpectrumPrediction
        """
        spectrum = np.load(file_path)
        return self.predict(spectrum, threshold)
 def run_inference_demo(
    model_path: str,
    data_dir: str,
    threshold: float = 0.5
 ):
    """
    Demo function to run inference on test data.
    Args:
        model_path: Path to model checkpoint
        data_dir: Path to data directory with spectra
        threshold: Detection threshold
    """
    # Initialize inference engine
    inference = VegaInference(model_path)
    # Find spectra files
    data_path = Path(data_dir)
    spectra_dir = data_path / "spectra"
    if not spectra_dir.exists():
        print(f"Spectra directory not found: {spectra_dir}")
        return
    # Load labels for comparison
    labels_path = data_path / "labels.json"
    with open(labels_path, 'r') as f:
        labels = json.load(f)
    print("\n" + "=" * 70)
    print("VEGA INFERENCE DEMO")
    print("=" * 70)
    # Process each spectrum
    npy_files = list(spectra_dir.glob("*.npy"))
    print(f"\nFound {len(npy_files)} spectra to process\n")
    for npy_file in npy_files:
        # Extract sample ID from filename
        sample_id = npy_file.stem.replace("spectrum_", "")
        # Get ground truth
        if sample_id in labels['samples']:
            ground_truth = labels['samples'][sample_id]
            true_isotopes = ground_truth['isotopes']
            true_activities = ground_truth.get('source_activities_bq', {})
        else:
            true_isotopes = []
            true_activities = {}
        # Run prediction
        prediction = inference.predict_from_file(npy_file, threshold=threshold)
        # Display results
        print("-" * 70)
        print(f"Sample: {sample_id}")
        print(f"Ground Truth Isotopes: {true_isotopes if true_isotopes else 'Background only'}")
        if true_activities:
            activities_str = ", ".join(
                f"{k}: {v:.1f} Bq" for k, v in true_activities.items()
            )
            print(f"Ground Truth Activities: {activities_str}")
        print(f"\nPrediction:")
        print(prediction.summary())
        # Compare
        predicted_names = {iso.name for iso in prediction.get_present_isotopes()}
        true_names = set(true_isotopes)
        correct = predicted_names & true_names
        missed = true_names - predicted_names
        false_positives = predicted_names - true_names
        if correct:
            print(f"\n✓ Correctly identified: {correct}")
        if missed:
            print(f"✗ Missed: {missed}")
        if false_positives:
            print(f"! False positives: {false_positives}")
        print()
    print("=" * 70)
    print("Inference complete!")
    print("=" * 70)
 if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser(description="Run Vega model inference")
    parser.add_argument(
        "--model", "-m",
        type=str,
        default="models/vega_best.pt",
        help="Path to model checkpoint"
    )
    parser.add_argument(
        "--data", "-d",
        type=str,
        default="O:/master_data_collection/isotopev2",
        help="Path to data directory"
    )
    parser.add_argument(
        "--threshold", "-t",
        type=float,
        default=0.5,
        help="Detection threshold (0-1)"
    )
    args = parser.parse_args()
    # Make paths absolute if needed
    project_root = Path(__file__).parent.parent
    model_path = args.model if Path(args.model).is_absolute() else project_root / args.model
    data_path = args.data if Path(args.data).is_absolute() else project_root / args.data
    run_inference_demo(str(model_path), str(data_path), args.threshold)
--- a/train/vega_ml/inference/vega_portable_inference.py
+++ b/train/vega_ml/inference/vega_portable_inference.py
@ -0,0 +1,922 @@
 #!/usr/bin/env python
 """
 ================================================================================
 VEGA PORTABLE INFERENCE - Self-Contained Isotope Identification
 ================================================================================
 This is a FULLY SELF-CONTAINED inference script for the Vega gamma spectrum
 isotope identification model. You only need:
    1. This Python file (vega_portable_inference.py)
    2. A trained model checkpoint (.pt file)
    3. PyTorch, NumPy installed
 NO other project files are required. The model architecture, isotope index,
 and sample data are all embedded in this file.
 ================================================================================
 USAGE EXAMPLES:
 ================================================================================
 1. Basic inference with embedded sample data:
   python vega_portable_inference.py --model path/to/vega_best.pt
 2. Inference on a specific spectrum file:
   python vega_portable_inference.py --model vega_best.pt --spectrum my_spectrum.npy
 3. Programmatic usage:
   from vega_portable_inference import VegaInference, create_sample_spectrum
   inference = VegaInference("vega_best.pt")
   spectrum = create_sample_spectrum("Cs-137", activity_bq=100)
   result = inference.predict(spectrum)
   print(result.summary())
 ================================================================================
 INPUT FORMAT:
 ================================================================================
 The model expects gamma spectra in the following format:
 - NumPy array, shape: (1023,) for single spectrum OR (N, 1023) for time series
 - Values: Counts per channel (will be normalized automatically)
 - Energy range: 20 keV to 3000 keV across 1023 channels
 - Channel i corresponds to energy: E_i = 20 + i * (3000 - 20) / 1023 keV
 If you have a 2D time-series spectrum (N intervals × 1023 channels), it will
 be averaged over time automatically.
 ================================================================================
 OUTPUT FORMAT:
 ================================================================================
 The model returns a SpectrumPrediction object with:
 - isotopes: List of IsotopePrediction objects, each containing:
    - name: Isotope name (e.g., "Cs-137")
    - probability: Detection confidence [0, 1]
    - activity_bq: Estimated activity in Becquerels
    - present: Boolean, True if probability >= threshold
 - num_present: Count of detected isotopes
 - confidence: Overall prediction confidence
 - threshold_used: Detection threshold that was applied
 Methods:
 - .summary() - Human-readable text summary
 - .to_dict() - JSON-serializable dictionary
 - .get_present_isotopes() - List of only detected isotopes
 ================================================================================
 MODEL ARCHITECTURE:
 ================================================================================
 Vega uses a CNN-FCNN (Convolutional + Fully Connected Neural Network):
    Input (1023 channels)
           ↓
    ConvBlock 1: Conv1d(1→64) → BN → LeakyReLU → Conv1d(64→64) → BN → LeakyReLU → MaxPool → Dropout
           ↓
    ConvBlock 2: Conv1d(64→128) → BN → LeakyReLU → Conv1d(128→128) → BN → LeakyReLU → MaxPool → Dropout
           ↓
    ConvBlock 3: Conv1d(128→256) → BN → LeakyReLU → Conv1d(256→256) → BN → LeakyReLU → MaxPool → Dropout
           ↓
    Flatten
           ↓
    FC Classifier: Linear(→512) → BN → LeakyReLU → Dropout → Linear(→256) → BN → LeakyReLU → Dropout → Linear(→82)
           ↓                                                                                                ↓
    Sigmoid (multi-label isotope presence)                                    FC Regressor: → ReLU (activity Bq)
 Outputs:
 - 82 isotope presence probabilities (multi-label classification)
 - 82 activity estimates in Bq (regression)
 ================================================================================
 SUPPORTED ISOTOPES (82 total):
 ================================================================================
 CALIBRATION: Am-241, Ba-133, Cs-137, Co-57, Co-60, Eu-152, Na-22, Mn-54
 MEDICAL: Tc-99m, F-18, Ga-67, Ga-68, In-111, I-123, I-125, Tl-201, Lu-177
 INDUSTRIAL: Ir-192, Se-75, Cd-109, I-131, Y-90
 NATURAL: K-40, Ra-226, Th-232, U-238, Rn-222
 FALLOUT: Cs-134, Sr-90, Zr-95, Nb-95, Ru-103, Ru-106, Ce-141, Ce-144
 DECAY CHAINS: Full U-238, Th-232, U-235 series
 ================================================================================
 REQUIREMENTS:
 ================================================================================
 pip install torch numpy
 Optional (for visualization):
 pip install matplotlib scipy
 ================================================================================
 """
 import sys
 import json
 import math
 import numpy as np
 import torch
 import torch.nn as nn
 from pathlib import Path
 from dataclasses import dataclass, field, asdict
 from typing import Dict, List, Optional, Tuple, Union
 # =============================================================================
 # ISOTOPE DATABASE (Embedded - No external dependencies)
 # =============================================================================
 # Complete list of 82 isotopes supported by the model (alphabetically sorted)
 ISOTOPE_NAMES = [
    "Ac-228", "Ag-110m", "Am-241", "Ba-133", "Be-7", "Bi-207", "Bi-211",
    "Bi-212", "Bi-214", "C-14", "Cd-109", "Ce-141", "Ce-144", "Co-57",
    "Co-60", "Cr-51", "Cs-134", "Cs-137", "Eu-152", "Eu-154", "F-18",
    "Fe-59", "Ga-67", "Ga-68", "H-3", "I-123", "I-125", "I-131", "In-111",
    "Ir-192", "K-40", "Lu-177", "Mn-54", "Na-22", "Nb-95", "Pa-231",
    "Pa-234m", "Pb-210", "Pb-211", "Pb-212", "Pb-214", "Po-210", "Ra-223",
    "Ra-224", "Ra-226", "Rn-219", "Rn-222", "Ru-103", "Ru-106", "Sb-124",
    "Sb-125", "Se-75", "Sn-113", "Sr-85", "Sr-90", "Tc-99m", "Th-227",
    "Th-228", "Th-230", "Th-232", "Th-234", "Tl-201", "Tl-208", "U-234",
    "U-235", "U-238", "Y-90", "Zn-65", "Zr-95",
    # Additional isotopes to reach 82
    "Ba-140", "Br-82", "Ca-45", "Ca-47", "Cf-252", "Cl-36", "Cm-244",
    "Cu-64", "Gd-153", "Hg-203", "Np-237", "P-32", "Pu-239"
 ]
 # Gamma emission lines (keV) and branching ratios for key isotopes
 # Format: {isotope: [(energy_keV, branching_ratio), ...]}
 GAMMA_LINES = {
    "Am-241": [(59.54, 0.359)],
    "Ba-133": [(81.0, 0.329), (276.4, 0.071), (302.9, 0.183), (356.0, 0.620), (383.8, 0.089)],
    "Cs-137": [(661.7, 0.851)],
    "Co-57": [(122.1, 0.856), (136.5, 0.107)],
    "Co-60": [(1173.2, 0.999), (1332.5, 0.999)],
    "Eu-152": [(121.8, 0.284), (344.3, 0.265), (778.9, 0.129), (964.1, 0.146), (1112.1, 0.136), (1408.0, 0.210)],
    "Na-22": [(511.0, 1.798), (1274.5, 0.999)],
    "Mn-54": [(834.8, 0.9998)],
    "K-40": [(1460.8, 0.107)],
    "Ra-226": [(186.2, 0.036)],
    "Pb-214": [(295.2, 0.192), (351.9, 0.371)],
    "Bi-214": [(609.3, 0.461), (1120.3, 0.150), (1764.5, 0.154)],
    "Pb-212": [(238.6, 0.436)],
    "Tl-208": [(583.2, 0.845), (2614.5, 0.99)],
    "Ac-228": [(338.3, 0.113), (911.2, 0.258), (969.0, 0.158)],
    "I-131": [(364.5, 0.817), (637.0, 0.072)],
    "Tc-99m": [(140.5, 0.890)],
    "F-18": [(511.0, 1.934)],
    "Ir-192": [(296.0, 0.287), (308.5, 0.300), (316.5, 0.828), (468.1, 0.478)],
    "Th-232": [(63.8, 0.0026)],
    "U-238": [(49.6, 0.064), (113.5, 0.017)],
 }
 class IsotopeIndex:
    """
    Maps isotope names to model output indices and vice versa.
    The index is alphabetically sorted for deterministic ordering.
    """
    def __init__(self, isotope_names: Optional[List[str]] = None):
        if isotope_names is None:
            isotope_names = ISOTOPE_NAMES
        self._isotope_names = sorted(isotope_names)
        self._name_to_idx = {name: idx for idx, name in enumerate(self._isotope_names)}
        self._idx_to_name = {idx: name for idx, name in enumerate(self._isotope_names)}
    @property
    def num_isotopes(self) -> int:
        return len(self._isotope_names)
    @property
    def isotope_names(self) -> List[str]:
        return self._isotope_names.copy()
    def name_to_index(self, name: str) -> int:
        if name not in self._name_to_idx:
            raise KeyError(f"Isotope '{name}' not in index. Available: {self._isotope_names[:5]}...")
        return self._name_to_idx[name]
    def index_to_name(self, idx: int) -> str:
        if idx not in self._idx_to_name:
            raise KeyError(f"Index {idx} out of range [0, {self.num_isotopes-1}]")
        return self._idx_to_name[idx]
    def __len__(self) -> int:
        return self.num_isotopes
    def __repr__(self) -> str:
        return f"IsotopeIndex(num_isotopes={self.num_isotopes})"
    @classmethod
    def load(cls, path: Path) -> 'IsotopeIndex':
        """Load from a text file (one isotope per line)."""
        with open(path, 'r') as f:
            names = [line.strip() for line in f if line.strip()]
        return cls(names)
    def save(self, path: Path):
        """Save to a text file."""
        with open(path, 'w') as f:
            for name in self._isotope_names:
                f.write(f"{name}\n")
 # =============================================================================
 # MODEL ARCHITECTURE (Embedded - No external dependencies)
 # =============================================================================
@dataclass
 class VegaConfig:
    """Configuration for the Vega model architecture."""
    # Input
    num_channels: int = 1023          # Energy channels in spectrum
    num_isotopes: int = 82            # Output classes
    # CNN backbone
    conv_channels: List[int] = field(default_factory=lambda: [64, 128, 256])
    conv_kernel_size: int = 7
    pool_size: int = 2
    # Classifier head
    fc_hidden_dims: List[int] = field(default_factory=lambda: [512, 256])
    # Regularization
    dropout_rate: float = 0.3
    spatial_dropout_rate: float = 0.1
    leaky_relu_slope: float = 0.1
    # Loss weights (not used in inference)
    classification_weight: float = 1.0
    regression_weight: float = 0.1
    max_activity_bq: float = 1000.0
 class ConvBlock(nn.Module):
    """
    CNN block: Conv → BN → LeakyReLU → Conv → BN → LeakyReLU → MaxPool → Dropout
    """
    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        kernel_size: int = 7,
        pool_size: int = 2,
        dropout_rate: float = 0.1,
        leaky_slope: float = 0.1
    ):
        super().__init__()
        self.conv1 = nn.Conv1d(in_channels, out_channels, kernel_size, padding=kernel_size // 2)
        self.bn1 = nn.BatchNorm1d(out_channels)
        self.act1 = nn.LeakyReLU(leaky_slope)
        self.conv2 = nn.Conv1d(out_channels, out_channels, kernel_size, padding=kernel_size // 2)
        self.bn2 = nn.BatchNorm1d(out_channels)
        self.act2 = nn.LeakyReLU(leaky_slope)
        self.pool = nn.MaxPool1d(pool_size)
        self.dropout = nn.Dropout1d(dropout_rate)
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.act1(self.bn1(self.conv1(x)))
        x = self.act2(self.bn2(self.conv2(x)))
        x = self.dropout(self.pool(x))
        return x
 class VegaModel(nn.Module):
    """
    Vega: CNN-FCNN for Multi-Label Isotope Classification + Activity Regression
    Takes a 1D gamma spectrum and outputs:
    - 82 isotope presence logits (use sigmoid for probabilities)
    - 82 activity estimates (in Bq, scaled by max_activity_bq)
    """
    def __init__(self, config: VegaConfig):
        super().__init__()
        self.config = config
        # Build CNN backbone
        self.backbone = self._build_backbone()
        # Calculate flattened size
        self._flat_size = self._calculate_flat_size()
        # Classification head (multi-label)
        self.classifier = self._build_classifier()
        # Regression head (activity)
        self.regressor = self._build_regressor()
    def _build_backbone(self) -> nn.Sequential:
        layers = []
        in_ch = 1
        for out_ch in self.config.conv_channels:
            layers.append(ConvBlock(
                in_ch, out_ch,
                kernel_size=self.config.conv_kernel_size,
                pool_size=self.config.pool_size,
                dropout_rate=self.config.spatial_dropout_rate,
                leaky_slope=self.config.leaky_relu_slope
            ))
            in_ch = out_ch
        return nn.Sequential(*layers)
    def _calculate_flat_size(self) -> int:
        with torch.no_grad():
            x = torch.zeros(1, 1, self.config.num_channels)
            x = self.backbone(x)
            return x.view(1, -1).size(1)
    def _build_classifier(self) -> nn.Sequential:
        layers = []
        in_dim = self._flat_size
        for hidden_dim in self.config.fc_hidden_dims:
            layers.extend([
                nn.Linear(in_dim, hidden_dim),
                nn.BatchNorm1d(hidden_dim),
                nn.LeakyReLU(self.config.leaky_relu_slope),
                nn.Dropout(self.config.dropout_rate)
            ])
            in_dim = hidden_dim
        layers.append(nn.Linear(in_dim, self.config.num_isotopes))
        return nn.Sequential(*layers)
    def _build_regressor(self) -> nn.Sequential:
        layers = []
        in_dim = self._flat_size
        for hidden_dim in self.config.fc_hidden_dims:
            layers.extend([
                nn.Linear(in_dim, hidden_dim),
                nn.BatchNorm1d(hidden_dim),
                nn.LeakyReLU(self.config.leaky_relu_slope),
                nn.Dropout(self.config.dropout_rate)
            ])
            in_dim = hidden_dim
        layers.extend([
            nn.Linear(in_dim, self.config.num_isotopes),
            nn.ReLU()  # Activities must be positive
        ])
        return nn.Sequential(*layers)
    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        # Ensure input shape is (batch, 1, channels)
        if x.dim() == 2:
            x = x.unsqueeze(1)
        # Feature extraction
        features = self.backbone(x)
        features = features.view(features.size(0), -1)
        # Dual outputs
        logits = self.classifier(features)
        activities = self.regressor(features)
        return logits, activities
 # =============================================================================
 # PREDICTION DATA CLASSES
 # =============================================================================
@dataclass
 class IsotopePrediction:
    """Prediction result for a single isotope."""
    name: str
    probability: float
    activity_bq: float
    present: bool
@dataclass
 class SpectrumPrediction:
    """Complete prediction results for a spectrum."""
    isotopes: List[IsotopePrediction]
    num_present: int
    confidence: float
    threshold_used: float
    def to_dict(self) -> Dict:
        """Convert to JSON-serializable dictionary."""
        return {
            'isotopes': [
                {
                    'name': iso.name,
                    'probability': round(iso.probability, 4),
                    'activity_bq': round(iso.activity_bq, 2),
                    'present': iso.present
                }
                for iso in self.isotopes
            ],
            'num_isotopes_detected': self.num_present,
            'confidence': round(self.confidence, 4),
            'threshold': self.threshold_used
        }
    def get_present_isotopes(self) -> List[IsotopePrediction]:
        """Get only isotopes predicted as present."""
        return [iso for iso in self.isotopes if iso.present]
    def summary(self) -> str:
        """Human-readable summary of predictions."""
        present = self.get_present_isotopes()
        if not present:
            return "No isotopes detected above threshold"
        lines = [f"Detected {len(present)} isotope(s):"]
        for iso in sorted(present, key=lambda x: -x.probability):
            lines.append(
                f"  • {iso.name}: {iso.probability*100:.1f}% confidence, "
                f"{iso.activity_bq:.1f} Bq estimated activity"
            )
        return "\n".join(lines)
    def to_json(self, indent: int = 2) -> str:
        """Convert to JSON string."""
        return json.dumps(self.to_dict(), indent=indent)
 # =============================================================================
 # INFERENCE ENGINE
 # =============================================================================
 class VegaInference:
    """
    Inference engine for the Vega isotope identification model.
    Example usage:
        inference = VegaInference("vega_best.pt")
        spectrum = np.load("my_spectrum.npy")
        result = inference.predict(spectrum, threshold=0.5)
        print(result.summary())
    """
    def __init__(
        self,
        model_path: Union[str, Path],
        isotope_index: Optional[IsotopeIndex] = None,
        device: Optional[torch.device] = None
    ):
        """
        Initialize the inference engine.
        Args:
            model_path: Path to saved .pt model checkpoint
            isotope_index: Optional custom isotope index. Uses default if None.
            device: Compute device. Auto-detects CUDA if available.
        """
        self.model_path = Path(model_path)
        # Device selection with CUDA compatibility check
        if device is not None:
            self.device = device
        elif torch.cuda.is_available():
            try:
                # Test CUDA actually works (some GPUs may not be compatible)
                _ = torch.zeros(1, device='cuda') + 1
                self.device = torch.device('cuda')
            except RuntimeError:
                self.device = torch.device('cpu')
        else:
            self.device = torch.device('cpu')
        # Load checkpoint
        print(f"Loading model from: {self.model_path}")
        self.checkpoint = torch.load(self.model_path, map_location=self.device, weights_only=False)
        # Load model config
        if 'model_config' in self.checkpoint:
            config_dict = self.checkpoint['model_config']
            self.model_config = VegaConfig(**config_dict)
        elif 'params' in self.checkpoint:
            # Handle Optuna-trained models
            params = self.checkpoint['params']
            self.model_config = VegaConfig(
                conv_channels=params.get('conv_channels', [64, 128, 256]),
                conv_kernel_size=params.get('conv_kernel_size', 7),
                pool_size=params.get('pool_size', 2),
                fc_hidden_dims=params.get('fc_hidden_dims', [512, 256]),
                dropout_rate=params.get('dropout_rate', 0.3),
                spatial_dropout_rate=params.get('spatial_dropout_rate', 0.1),
                leaky_relu_slope=params.get('leaky_relu_slope', 0.1)
            )
        else:
            # Use defaults
            self.model_config = VegaConfig()
        # Create and load model
        self.model = VegaModel(self.model_config)
        self.model.load_state_dict(self.checkpoint['model_state_dict'])
        self.model = self.model.to(self.device)
        self.model.eval()
        # Set isotope index
        self.isotope_index = isotope_index or IsotopeIndex()
        print(f"✓ Model loaded successfully")
        print(f"  Device: {self.device}")
        print(f"  Isotopes: {self.isotope_index.num_isotopes}")
        print(f"  Architecture: CNN{self.model_config.conv_channels} → FC{self.model_config.fc_hidden_dims}")
    def preprocess(self, spectrum: np.ndarray, normalize: bool = True) -> torch.Tensor:
        """
        Preprocess spectrum for model input.
        Args:
            spectrum: Input array, shape (1023,) or (N, 1023)
            normalize: Normalize to [0, 1] range
        Returns:
            Tensor ready for model, shape (1, 1023)
        """
        # Average time series if 2D
        if spectrum.ndim == 2:
            spectrum = spectrum.mean(axis=0)
        # Normalize
        if normalize and spectrum.max() > 0:
            spectrum = spectrum / spectrum.max()
        # To tensor with batch dimension
        tensor = torch.tensor(spectrum, dtype=torch.float32).unsqueeze(0)
        return tensor.to(self.device)
    @torch.no_grad()
    def predict(
        self,
        spectrum: Union[np.ndarray, torch.Tensor],
        threshold: float = 0.5,
        return_all: bool = False
    ) -> SpectrumPrediction:
        """
        Run inference on a gamma spectrum.
        Args:
            spectrum: Input spectrum (numpy array or tensor)
            threshold: Probability threshold for detection (0-1)
            return_all: If True, include all 82 isotopes. If False, only detected ones.
        Returns:
            SpectrumPrediction with detected isotopes and activities
        """
        # Preprocess
        if isinstance(spectrum, np.ndarray):
            spectrum = self.preprocess(spectrum)
        # Run model
        logits, activities = self.model(spectrum)
        # Convert to probabilities
        probs = torch.sigmoid(logits).cpu().numpy()[0]
        activities = activities.cpu().numpy()[0] * self.model_config.max_activity_bq
        # Build predictions
        isotopes = []
        for i in range(len(probs)):
            prob = float(probs[i])
            activity = float(activities[i])
            present = prob >= threshold
            if return_all or present:
                isotopes.append(IsotopePrediction(
                    name=self.isotope_index.index_to_name(i),
                    probability=prob,
                    activity_bq=activity if present else 0.0,
                    present=present
                ))
        # Calculate confidence
        present_isotopes = [iso for iso in isotopes if iso.present]
        if present_isotopes:
            confidence = np.mean([iso.probability for iso in present_isotopes])
        else:
            confidence = 1.0 - probs.max()
        return SpectrumPrediction(
            isotopes=isotopes,
            num_present=len(present_isotopes),
            confidence=float(confidence),
            threshold_used=threshold
        )
    def predict_from_file(
        self,
        file_path: Union[str, Path],
        threshold: float = 0.5
    ) -> SpectrumPrediction:
        """Load spectrum from .npy file and run inference."""
        spectrum = np.load(file_path)
        return self.predict(spectrum, threshold)
    def predict_batch(
        self,
        spectra: List[np.ndarray],
        threshold: float = 0.5
    ) -> List[SpectrumPrediction]:
        """Run inference on multiple spectra."""
        return [self.predict(s, threshold) for s in spectra]
 # =============================================================================
 # SAMPLE SPECTRUM GENERATOR (For testing without real data)
 # =============================================================================
 def energy_to_channel(energy_kev: float, num_channels: int = 1023) -> int:
    """Convert energy in keV to channel index."""
    e_min, e_max = 20.0, 3000.0
    channel = int((energy_kev - e_min) / (e_max - e_min) * num_channels)
    return max(0, min(num_channels - 1, channel))
 def channel_to_energy(channel: int, num_channels: int = 1023) -> float:
    """Convert channel index to energy in keV."""
    e_min, e_max = 20.0, 3000.0
    return e_min + channel * (e_max - e_min) / num_channels
 def create_sample_spectrum(
    isotope: str = "Cs-137",
    activity_bq: float = 100.0,
    duration_seconds: float = 300.0,
    add_background: bool = True,
    add_noise: bool = True,
    detector_fwhm_percent: float = 8.5,
    seed: Optional[int] = None
 ) -> np.ndarray:
    """
    Generate a synthetic gamma spectrum for testing.
    This creates a realistic-looking spectrum with Gaussian peaks at the
    characteristic gamma energies of the specified isotope.
    Args:
        isotope: Isotope name (e.g., "Cs-137", "Co-60", "Na-22")
        activity_bq: Source activity in Becquerels
        duration_seconds: Measurement duration
        add_background: Add environmental background
        add_noise: Apply Poisson counting statistics
        detector_fwhm_percent: Detector resolution at 662 keV (%)
        seed: Random seed for reproducibility
    Returns:
        1D numpy array of shape (1023,) with counts per channel
    """
    if seed is not None:
        np.random.seed(seed)
    num_channels = 1023
    spectrum = np.zeros(num_channels)
    # Get gamma lines for the isotope
    if isotope in GAMMA_LINES:
        gamma_lines = GAMMA_LINES[isotope]
    else:
        # Use Cs-137 as fallback
        print(f"Warning: No gamma lines for {isotope}, using Cs-137")
        gamma_lines = GAMMA_LINES["Cs-137"]
    # Add peaks for each gamma line
    for energy_kev, branching_ratio in gamma_lines:
        # Calculate FWHM at this energy (scales with sqrt of energy)
        fwhm_kev = (detector_fwhm_percent / 100.0) * 662.0 * math.sqrt(energy_kev / 662.0)
        sigma_kev = fwhm_kev / 2.355
        # Expected counts
        efficiency = 0.1 * math.exp(-energy_kev / 500.0)  # Simplified efficiency
        expected_counts = activity_bq * duration_seconds * branching_ratio * efficiency
        # Add Gaussian peak
        center_channel = energy_to_channel(energy_kev)
        sigma_channels = sigma_kev / ((3000 - 20) / num_channels)
        for ch in range(num_channels):
            energy = channel_to_energy(ch)
            peak = expected_counts * math.exp(-0.5 * ((energy - energy_kev) / sigma_kev) ** 2)
            spectrum[ch] += peak
    # Add background continuum
    if add_background:
        # Exponential continuum
        for ch in range(num_channels):
            energy = channel_to_energy(ch)
            bg = 50.0 * duration_seconds * math.exp(-energy / 300.0) / 300.0
            spectrum[ch] += bg
        # K-40 environmental background
        k40_energy = 1460.8
        k40_fwhm = (detector_fwhm_percent / 100.0) * 662.0 * math.sqrt(k40_energy / 662.0)
        k40_sigma = k40_fwhm / 2.355
        k40_counts = 10.0 * duration_seconds  # Low activity environmental
        for ch in range(num_channels):
            energy = channel_to_energy(ch)
            peak = k40_counts * math.exp(-0.5 * ((energy - k40_energy) / k40_sigma) ** 2)
            spectrum[ch] += peak
    # Apply Poisson noise
    if add_noise:
        spectrum = np.maximum(spectrum, 0)
        spectrum = np.random.poisson(spectrum.astype(int)).astype(float)
    return spectrum
 def create_sample_spectra_batch() -> Dict[str, np.ndarray]:
    """
    Create a batch of sample spectra for different isotopes.
    Returns:
        Dictionary mapping isotope names to their sample spectra
    """
    samples = {}
    # Common calibration isotopes
    for isotope in ["Cs-137", "Co-60", "Na-22", "Ba-133", "Am-241", "Eu-152"]:
        if isotope in GAMMA_LINES:
            samples[isotope] = create_sample_spectrum(
                isotope=isotope,
                activity_bq=100.0,
                duration_seconds=300.0,
                seed=hash(isotope) % 2**32
            )
    # Background only
    samples["Background"] = create_sample_spectrum(
        isotope="Cs-137",  # Will be overwritten by background
        activity_bq=0.0,   # No source
        duration_seconds=300.0,
        add_background=True,
        seed=12345
    )
    return samples
 # =============================================================================
 # DEMONSTRATION FUNCTIONS
 # =============================================================================
 def run_demo(model_path: str, threshold: float = 0.5):
    """
    Run a complete demonstration of the Vega inference system.
    Args:
        model_path: Path to trained model checkpoint
        threshold: Detection threshold (0-1)
    """
    print("\n" + "=" * 70)
    print("VEGA ISOTOPE IDENTIFICATION - INFERENCE DEMONSTRATION")
    print("=" * 70)
    # Load model
    print("\n[1] Loading Model")
    print("-" * 70)
    inference = VegaInference(model_path)
    # Generate sample spectra
    print("\n[2] Generating Sample Spectra")
    print("-" * 70)
    samples = create_sample_spectra_batch()
    print(f"Generated {len(samples)} sample spectra:")
    for name in samples:
        print(f"  • {name}")
    # Run inference on each
    print("\n[3] Running Inference")
    print("-" * 70)
    for name, spectrum in samples.items():
        print(f"\n{'─' * 70}")
        print(f"Sample: {name}")
        print(f"Spectrum shape: {spectrum.shape}")
        print(f"Spectrum range: [{spectrum.min():.1f}, {spectrum.max():.1f}]")
        # Run prediction
        result = inference.predict(spectrum, threshold=threshold)
        print(f"\nPrediction (threshold={threshold}):")
        print(result.summary())
        # Show top 5 probabilities even if below threshold
        print("\nTop 5 isotope probabilities:")
        all_result = inference.predict(spectrum, threshold=0.0, return_all=True)
        sorted_iso = sorted(all_result.isotopes, key=lambda x: -x.probability)[:5]
        for iso in sorted_iso:
            marker = "✓" if iso.probability >= threshold else " "
            print(f"  {marker} {iso.name}: {iso.probability*100:.2f}%")
    # Show JSON output format
    print("\n[4] JSON Output Format Example")
    print("-" * 70)
    sample_result = inference.predict(samples["Cs-137"], threshold=threshold)
    print(sample_result.to_json())
    print("\n" + "=" * 70)
    print("DEMONSTRATION COMPLETE")
    print("=" * 70)
 def run_single_inference(model_path: str, spectrum_path: str, threshold: float = 0.5):
    """
    Run inference on a single spectrum file.
    Args:
        model_path: Path to trained model
        spectrum_path: Path to .npy spectrum file
        threshold: Detection threshold
    """
    print(f"\nLoading model from: {model_path}")
    inference = VegaInference(model_path)
    print(f"Loading spectrum from: {spectrum_path}")
    spectrum = np.load(spectrum_path)
    print(f"Spectrum shape: {spectrum.shape}")
    print(f"\nRunning inference (threshold={threshold})...")
    result = inference.predict(spectrum, threshold=threshold)
    print("\n" + "=" * 60)
    print("PREDICTION RESULTS")
    print("=" * 60)
    print(result.summary())
    print("=" * 60)
    return result
 # =============================================================================
 # MAIN ENTRY POINT
 # =============================================================================
 def main():
    """Main entry point for command-line usage."""
    import argparse
    parser = argparse.ArgumentParser(
        description="Vega Portable Inference - Gamma Spectrum Isotope Identification",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
 Examples:
  # Run demo with sample spectra
  python vega_portable_inference.py --model vega_best.pt
  # Analyze a specific spectrum file
  python vega_portable_inference.py --model vega_best.pt --spectrum my_data.npy
  # Use lower threshold for higher recall
  python vega_portable_inference.py --model vega_best.pt --threshold 0.3
        """
    )
    parser.add_argument(
        "--model", "-m",
        type=str,
        required=True,
        help="Path to trained Vega model checkpoint (.pt file)"
    )
    parser.add_argument(
        "--spectrum", "-s",
        type=str,
        default=None,
        help="Path to spectrum file (.npy). If not provided, runs demo with synthetic spectra."
    )
    parser.add_argument(
        "--threshold", "-t",
        type=float,
        default=0.5,
        help="Detection threshold (0-1). Lower = more sensitive, higher = more specific. Default: 0.5"
    )
    parser.add_argument(
        "--json",
        action="store_true",
        help="Output results in JSON format"
    )
    args = parser.parse_args()
    if args.spectrum:
        # Single file inference
        result = run_single_inference(args.model, args.spectrum, args.threshold)
        if args.json:
            print("\nJSON Output:")
            print(result.to_json())
    else:
        # Demo mode
        run_demo(args.model, args.threshold)
    return 0
 if __name__ == "__main__":
    sys.exit(main())
--- a/train/vega_ml/inference/vega_portable_inference_2d.py
+++ b/train/vega_ml/inference/vega_portable_inference_2d.py
@ -0,0 +1,879 @@
 #!/usr/bin/env python
 """
 ================================================================================
 VEGA 2D PORTABLE INFERENCE - Self-Contained Isotope Identification
 ================================================================================
 This is a FULLY SELF-CONTAINED inference script for the Vega 2D gamma spectrum
 isotope identification model. You only need:
    1. This Python file (vega_portable_inference_2d.py)
    2. A trained 2D model checkpoint (.pt file)
    3. PyTorch, NumPy installed
 NO other project files are required. The model architecture, isotope index,
 and sample data generator are all embedded in this file.
 ================================================================================
 USAGE EXAMPLES:
 ================================================================================
 1. Basic inference with embedded sample data:
   python vega_portable_inference_2d.py --model vega_2d_best.pt
 2. Inference on a specific spectrum file:
   python vega_portable_inference_2d.py --model vega_2d_best.pt --spectrum my_spectrum.npy
 3. Programmatic usage:
   from vega_portable_inference_2d import Vega2DInference, create_sample_spectrum_2d
   inference = Vega2DInference("vega_2d_best.pt")
   spectrum = create_sample_spectrum_2d("Cs-137", activity_bq=100)
   result = inference.predict(spectrum)
   print(result.summary())
 ================================================================================
 INPUT FORMAT:
 ================================================================================
 The 2D model expects gamma spectra in the following format:
 - NumPy array, shape: (60, 1023) for 60 time intervals × 1023 channels
 - Values: Counts per channel per time interval (will be normalized automatically)
 - Energy range: 20 keV to 3000 keV across 1023 channels
 - Time: 60 one-second intervals (1 minute total measurement)
 If your spectrum has different time dimensions, it will be padded or truncated
 to 60 intervals automatically.
 ================================================================================
 OUTPUT FORMAT:
 ================================================================================
 The model returns a SpectrumPrediction object with:
 - isotopes: List of IsotopePrediction objects, each containing:
    - name: Isotope name (e.g., "Cs-137")
    - probability: Detection confidence [0, 1]
    - activity_bq: Estimated activity in Becquerels
    - present: Boolean, True if probability >= threshold
 - num_present: Count of detected isotopes
 - confidence: Overall prediction confidence
 - threshold_used: Detection threshold that was applied
 Methods:
 - .summary() - Human-readable text summary
 - .to_dict() - JSON-serializable dictionary
 - .get_present_isotopes() - List of only detected isotopes
 ================================================================================
 MODEL ARCHITECTURE (2D CNN):
 ================================================================================
 Vega 2D uses 2D convolutions to process time × energy spectral images:
    Input (1, 60, 1023) - single channel image
           ↓
    ConvBlock 1: Conv2d(1→32, k=3×7) → BN → LeakyReLU → Conv2d → BN → LeakyReLU → MaxPool2d → Dropout
           ↓
    ConvBlock 2: Conv2d(32→64, k=3×7) → BN → LeakyReLU → Conv2d → BN → LeakyReLU → MaxPool2d → Dropout
           ↓
    ConvBlock 3: Conv2d(64→128, k=3×7) → BN → LeakyReLU → Conv2d → BN → LeakyReLU → MaxPool2d → Dropout
           ↓
    Flatten (113,792 features)
           ↓
    FC: Linear(→512) → BN → LeakyReLU → Dropout → Linear(→256) → BN → LeakyReLU → Dropout
           ↓
    Classifier: Linear(→82) [isotope logits]
    Regressor: Linear(→82) → ReLU [activity Bq]
 Outputs:
 - 82 isotope presence probabilities (multi-label classification)
 - 82 activity estimates in Bq (regression)
 Total parameters: ~59 million
 ================================================================================
 SUPPORTED ISOTOPES (82 total):
 ================================================================================
 CALIBRATION: Am-241, Ba-133, Cs-137, Co-57, Co-60, Eu-152, Na-22, Mn-54
 MEDICAL: Tc-99m, F-18, Ga-67, Ga-68, In-111, I-123, I-125, Tl-201, Lu-177
 INDUSTRIAL: Ir-192, Se-75, Cd-109, I-131, Y-90
 NATURAL: K-40, Ra-226, Th-232, U-238, Rn-222
 FALLOUT: Cs-134, Sr-90, Zr-95, Nb-95, Ru-103, Ru-106, Ce-141, Ce-144
 DECAY CHAINS: Full U-238, Th-232, U-235 series
 ================================================================================
 REQUIREMENTS:
 ================================================================================
 pip install torch numpy
 ================================================================================
 """
 import sys
 import json
 import math
 import numpy as np
 import torch
 import torch.nn as nn
 from pathlib import Path
 from dataclasses import dataclass, field
 from typing import Dict, List, Optional, Tuple, Union
 # =============================================================================
 # ISOTOPE DATABASE (Embedded - No external dependencies)
 # =============================================================================
 # Complete list of 82 isotopes supported by the model (alphabetically sorted)
 ISOTOPE_NAMES = [
    "Ac-228", "Ag-110m", "Am-241", "Ba-133", "Be-7", "Bi-207", "Bi-211",
    "Bi-212", "Bi-214", "C-14", "Cd-109", "Ce-141", "Ce-144", "Co-57",
    "Co-60", "Cr-51", "Cs-134", "Cs-137", "Eu-152", "Eu-154", "F-18",
    "Fe-59", "Ga-67", "Ga-68", "H-3", "I-123", "I-125", "I-131", "In-111",
    "Ir-192", "K-40", "Lu-177", "Mn-54", "Na-22", "Nb-95", "Pa-231",
    "Pa-234m", "Pb-210", "Pb-211", "Pb-212", "Pb-214", "Po-210", "Ra-223",
    "Ra-224", "Ra-226", "Rn-219", "Rn-222", "Ru-103", "Ru-106", "Sb-124",
    "Sb-125", "Se-75", "Sn-113", "Sr-85", "Sr-90", "Tc-99m", "Th-227",
    "Th-228", "Th-230", "Th-232", "Th-234", "Tl-201", "Tl-208", "U-234",
    "U-235", "U-238", "Y-90", "Zn-65", "Zr-95",
    # Additional isotopes to reach 82
    "Ba-140", "Br-82", "Ca-45", "Ca-47", "Cf-252", "Cl-36", "Cm-244",
    "Cu-64", "Gd-153", "Hg-203", "Np-237", "P-32", "Pu-239"
 ]
 # Gamma emission lines (keV) and branching ratios for key isotopes
 GAMMA_LINES = {
    "Am-241": [(59.54, 0.359)],
    "Ba-133": [(81.0, 0.329), (276.4, 0.071), (302.9, 0.183), (356.0, 0.620), (383.8, 0.089)],
    "Cs-137": [(661.7, 0.851)],
    "Co-57": [(122.1, 0.856), (136.5, 0.107)],
    "Co-60": [(1173.2, 0.999), (1332.5, 0.999)],
    "Eu-152": [(121.8, 0.284), (344.3, 0.265), (778.9, 0.129), (964.1, 0.146), (1112.1, 0.136), (1408.0, 0.210)],
    "Na-22": [(511.0, 1.798), (1274.5, 0.999)],
    "Mn-54": [(834.8, 0.9998)],
    "K-40": [(1460.8, 0.107)],
    "Ra-226": [(186.2, 0.036)],
    "Pb-214": [(295.2, 0.192), (351.9, 0.371)],
    "Bi-214": [(609.3, 0.461), (1120.3, 0.150), (1764.5, 0.154)],
    "Pb-212": [(238.6, 0.436)],
    "Tl-208": [(583.2, 0.845), (2614.5, 0.99)],
    "Ac-228": [(338.3, 0.113), (911.2, 0.258), (969.0, 0.158)],
    "I-131": [(364.5, 0.817), (637.0, 0.072)],
    "Tc-99m": [(140.5, 0.890)],
    "F-18": [(511.0, 1.934)],
    "Ir-192": [(296.0, 0.287), (308.5, 0.300), (316.5, 0.828), (468.1, 0.478)],
    "Th-232": [(63.8, 0.0026)],
    "U-238": [(49.6, 0.064), (113.5, 0.017)],
 }
 class IsotopeIndex:
    """Maps isotope names to model output indices and vice versa."""
    def __init__(self, isotope_names: Optional[List[str]] = None):
        if isotope_names is None:
            isotope_names = ISOTOPE_NAMES
        self._isotope_names = sorted(isotope_names)
        self._name_to_idx = {name: idx for idx, name in enumerate(self._isotope_names)}
        self._idx_to_name = {idx: name for idx, name in enumerate(self._isotope_names)}
    @property
    def num_isotopes(self) -> int:
        return len(self._isotope_names)
    @property
    def isotope_names(self) -> List[str]:
        return self._isotope_names.copy()
    def name_to_index(self, name: str) -> int:
        if name not in self._name_to_idx:
            raise KeyError(f"Isotope '{name}' not in index")
        return self._name_to_idx[name]
    def index_to_name(self, idx: int) -> str:
        if idx not in self._idx_to_name:
            raise KeyError(f"Index {idx} out of range [0, {self.num_isotopes-1}]")
        return self._idx_to_name[idx]
    def __len__(self) -> int:
        return self.num_isotopes
 # =============================================================================
 # 2D MODEL ARCHITECTURE (Embedded)
 # =============================================================================
@dataclass
 class Vega2DConfig:
    """Configuration for Vega 2D model."""
    # Input dimensions
    num_channels: int = 1023  # Energy channels
    num_time_intervals: int = 60  # Fixed time dimension
    # Output
    num_isotopes: int = 82
    # CNN architecture
    conv_channels: List[int] = field(default_factory=lambda: [32, 64, 128])
    kernel_size: Tuple[int, int] = (3, 7)  # (time, energy)
    pool_size: Tuple[int, int] = (2, 2)
    # FC layers
    fc_hidden_dims: List[int] = field(default_factory=lambda: [512, 256])
    # Regularization
    dropout_rate: float = 0.3
    leaky_relu_slope: float = 0.01
    # Activity scaling
    max_activity_bq: float = 1000.0
 class ConvBlock2D(nn.Module):
    """2D Convolutional block with BatchNorm, activation, pooling, and dropout."""
    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        kernel_size: Tuple[int, int],
        pool_size: Tuple[int, int],
        dropout_rate: float,
        leaky_relu_slope: float
    ):
        super().__init__()
        padding = (kernel_size[0] // 2, kernel_size[1] // 2)
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size, padding=padding)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size, padding=padding)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.activation = nn.LeakyReLU(leaky_relu_slope)
        self.pool = nn.MaxPool2d(pool_size)
        self.dropout = nn.Dropout2d(dropout_rate)
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.activation(self.bn1(self.conv1(x)))
        x = self.activation(self.bn2(self.conv2(x)))
        x = self.pool(x)
        x = self.dropout(x)
        return x
 class Vega2DModel(nn.Module):
    """
    2D CNN model for gamma spectrum isotope identification.
    Treats spectra as images with time on one axis and energy channels on the other.
    """
    def __init__(self, config: Vega2DConfig = None):
        super().__init__()
        self.config = config or Vega2DConfig()
        # Build CNN backbone
        self.conv_blocks = nn.ModuleList()
        in_channels = 1
        for out_channels in self.config.conv_channels:
            self.conv_blocks.append(ConvBlock2D(
                in_channels=in_channels,
                out_channels=out_channels,
                kernel_size=self.config.kernel_size,
                pool_size=self.config.pool_size,
                dropout_rate=self.config.dropout_rate,
                leaky_relu_slope=self.config.leaky_relu_slope
            ))
            in_channels = out_channels
        # Calculate flattened size
        self.flat_size = self._calculate_flat_size()
        # FC backbone
        fc_layers = []
        fc_in = self.flat_size
        for fc_out in self.config.fc_hidden_dims:
            fc_layers.extend([
                nn.Linear(fc_in, fc_out),
                nn.BatchNorm1d(fc_out),
                nn.LeakyReLU(self.config.leaky_relu_slope),
                nn.Dropout(self.config.dropout_rate)
            ])
            fc_in = fc_out
        self.fc_backbone = nn.Sequential(*fc_layers)
        # Output heads
        self.classifier = nn.Linear(fc_in, self.config.num_isotopes)
        self.regressor = nn.Sequential(
            nn.Linear(fc_in, self.config.num_isotopes),
            nn.ReLU()
        )
    def _calculate_flat_size(self) -> int:
        h = self.config.num_time_intervals
        w = self.config.num_channels
        for _ in self.config.conv_channels:
            h = h // self.config.pool_size[0]
            w = w // self.config.pool_size[1]
        return self.config.conv_channels[-1] * h * w
    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        # Add channel dimension if needed: (B, T, C) -> (B, 1, T, C)
        if x.dim() == 3:
            x = x.unsqueeze(1)
        # CNN backbone
        for conv_block in self.conv_blocks:
            x = conv_block(x)
        # Flatten
        x = x.view(x.size(0), -1)
        # FC backbone
        x = self.fc_backbone(x)
        # Output heads
        logits = self.classifier(x)
        activities = self.regressor(x)
        return logits, activities
 # =============================================================================
 # PREDICTION DATA CLASSES
 # =============================================================================
@dataclass
 class IsotopePrediction:
    """Prediction result for a single isotope."""
    name: str
    probability: float
    activity_bq: float
    present: bool
@dataclass
 class SpectrumPrediction:
    """Complete prediction results for a spectrum."""
    isotopes: List[IsotopePrediction]
    num_present: int
    confidence: float
    threshold_used: float
    def to_dict(self) -> Dict:
        """Convert to JSON-serializable dictionary."""
        return {
            'isotopes': [
                {
                    'name': iso.name,
                    'probability': round(iso.probability, 4),
                    'activity_bq': round(iso.activity_bq, 2),
                    'present': iso.present
                }
                for iso in self.isotopes
            ],
            'num_isotopes_detected': self.num_present,
            'confidence': round(self.confidence, 4),
            'threshold': self.threshold_used
        }
    def get_present_isotopes(self) -> List[IsotopePrediction]:
        """Get only isotopes predicted as present."""
        return [iso for iso in self.isotopes if iso.present]
    def summary(self) -> str:
        """Human-readable summary of predictions."""
        present = self.get_present_isotopes()
        if not present:
            return "No isotopes detected above threshold"
        lines = [f"Detected {len(present)} isotope(s):"]
        for iso in sorted(present, key=lambda x: -x.probability):
            lines.append(
                f"  • {iso.name}: {iso.probability*100:.1f}% confidence, "
                f"{iso.activity_bq:.1f} Bq estimated activity"
            )
        return "\n".join(lines)
    def to_json(self, indent: int = 2) -> str:
        """Convert to JSON string."""
        return json.dumps(self.to_dict(), indent=indent)
 # =============================================================================
 # INFERENCE ENGINE
 # =============================================================================
 class Vega2DInference:
    """
    Inference engine for the Vega 2D isotope identification model.
    Example usage:
        inference = Vega2DInference("vega_2d_best.pt")
        spectrum = np.load("my_spectrum.npy")  # Shape: (60, 1023)
        result = inference.predict(spectrum, threshold=0.5)
        print(result.summary())
    """
    def __init__(
        self,
        model_path: Union[str, Path],
        isotope_index: Optional[IsotopeIndex] = None,
        device: Optional[torch.device] = None
    ):
        """
        Initialize the inference engine.
        Args:
            model_path: Path to saved .pt model checkpoint
            isotope_index: Optional custom isotope index. Uses default if None.
            device: Compute device. Auto-detects CUDA if available.
        """
        self.model_path = Path(model_path)
        # Device selection
        if device is not None:
            self.device = device
        elif torch.cuda.is_available():
            try:
                _ = torch.zeros(1, device='cuda') + 1
                self.device = torch.device('cuda')
            except RuntimeError:
                self.device = torch.device('cpu')
        else:
            self.device = torch.device('cpu')
        # Load checkpoint
        print(f"Loading 2D model from: {self.model_path}")
        self.checkpoint = torch.load(self.model_path, map_location=self.device, weights_only=False)
        # Load model config
        if 'model_config' in self.checkpoint:
            config_dict = self.checkpoint['model_config']
            # Handle tuple conversion for kernel_size and pool_size
            if 'kernel_size' in config_dict and isinstance(config_dict['kernel_size'], list):
                config_dict['kernel_size'] = tuple(config_dict['kernel_size'])
            if 'pool_size' in config_dict and isinstance(config_dict['pool_size'], list):
                config_dict['pool_size'] = tuple(config_dict['pool_size'])
            self.model_config = Vega2DConfig(**config_dict)
        else:
            self.model_config = Vega2DConfig()
        # Create and load model
        self.model = Vega2DModel(self.model_config)
        self.model.load_state_dict(self.checkpoint['model_state_dict'])
        self.model = self.model.to(self.device)
        self.model.eval()
        # Set isotope index
        self.isotope_index = isotope_index or IsotopeIndex()
        print(f"✓ Model loaded successfully")
        print(f"  Device: {self.device}")
        print(f"  Input shape: ({self.model_config.num_time_intervals}, {self.model_config.num_channels})")
        print(f"  Isotopes: {self.isotope_index.num_isotopes}")
        print(f"  Architecture: 2D-CNN{self.model_config.conv_channels} → FC{self.model_config.fc_hidden_dims}")
    def _pad_or_truncate(self, spectrum: np.ndarray) -> np.ndarray:
        """Ensure spectrum has exactly num_time_intervals rows."""
        target_rows = self.model_config.num_time_intervals
        current_rows = spectrum.shape[0]
        if current_rows == target_rows:
            return spectrum
        elif current_rows > target_rows:
            # Truncate - take last N intervals (most recent data)
            return spectrum[-target_rows:]
        else:
            # Pad with zeros at the beginning
            padding = np.zeros((target_rows - current_rows, spectrum.shape[1]))
            return np.vstack([padding, spectrum])
    def preprocess(self, spectrum: np.ndarray, normalize: bool = True) -> torch.Tensor:
        """
        Preprocess spectrum for model input.
        Args:
            spectrum: Input array, shape (T, 1023) where T is any number of time intervals
            normalize: Normalize to [0, 1] range
        Returns:
            Tensor ready for model, shape (1, 60, 1023)
        """
        # Handle 1D input (average spectrum)
        if spectrum.ndim == 1:
            # Expand to 2D by repeating
            spectrum = np.tile(spectrum.reshape(1, -1), (self.model_config.num_time_intervals, 1))
        # Ensure correct time dimension
        spectrum = self._pad_or_truncate(spectrum)
        # Normalize
        if normalize and spectrum.max() > 0:
            spectrum = spectrum / spectrum.max()
        # To tensor with batch dimension
        tensor = torch.tensor(spectrum, dtype=torch.float32).unsqueeze(0)
        return tensor.to(self.device)
    @torch.no_grad()
    def predict(
        self,
        spectrum: Union[np.ndarray, torch.Tensor],
        threshold: float = 0.5,
        return_all: bool = False
    ) -> SpectrumPrediction:
        """
        Run inference on a gamma spectrum.
        Args:
            spectrum: Input spectrum (numpy array or tensor)
            threshold: Probability threshold for detection (0-1)
            return_all: If True, include all 82 isotopes. If False, only detected ones.
        Returns:
            SpectrumPrediction with detected isotopes and activities
        """
        # Preprocess
        if isinstance(spectrum, np.ndarray):
            spectrum = self.preprocess(spectrum)
        # Run model
        logits, activities = self.model(spectrum)
        # Convert to probabilities
        probs = torch.sigmoid(logits).cpu().numpy()[0]
        activities = activities.cpu().numpy()[0] * self.model_config.max_activity_bq
        # Build predictions
        isotopes = []
        for i in range(len(probs)):
            prob = float(probs[i])
            activity = float(activities[i])
            present = prob >= threshold
            if return_all or present:
                isotopes.append(IsotopePrediction(
                    name=self.isotope_index.index_to_name(i),
                    probability=prob,
                    activity_bq=activity if present else 0.0,
                    present=present
                ))
        # Calculate confidence
        present_isotopes = [iso for iso in isotopes if iso.present]
        if present_isotopes:
            confidence = np.mean([iso.probability for iso in present_isotopes])
        else:
            confidence = 1.0 - probs.max()
        return SpectrumPrediction(
            isotopes=isotopes,
            num_present=len(present_isotopes),
            confidence=float(confidence),
            threshold_used=threshold
        )
    def predict_from_file(
        self,
        file_path: Union[str, Path],
        threshold: float = 0.5
    ) -> SpectrumPrediction:
        """Load spectrum from .npy file and run inference."""
        spectrum = np.load(file_path)
        return self.predict(spectrum, threshold)
    def predict_batch(
        self,
        spectra: List[np.ndarray],
        threshold: float = 0.5
    ) -> List[SpectrumPrediction]:
        """Run inference on multiple spectra."""
        return [self.predict(s, threshold) for s in spectra]
 # =============================================================================
 # SAMPLE SPECTRUM GENERATOR (For testing without real data)
 # =============================================================================
 def energy_to_channel(energy_kev: float, num_channels: int = 1023) -> int:
    """Convert energy (keV) to usable channel index (0..num_channels-1).
    Assumes an underlying 1024-channel MCA with raw channels 0..1023 where
    channel 0 is skipped (modeled usable channels correspond to raw 1..1023).
    """
    e_min, e_max = 20.0, 3000.0
    full_channels = num_channels + 1
    channel_width = (e_max - e_min) / full_channels
    raw_channel = int((energy_kev - e_min) / channel_width)
    usable_channel = raw_channel - 1
    return max(0, min(num_channels - 1, usable_channel))
 def channel_to_energy(channel: int, num_channels: int = 1023) -> float:
    """Convert usable channel index to energy bin center (keV)."""
    e_min, e_max = 20.0, 3000.0
    full_channels = num_channels + 1
    channel_width = (e_max - e_min) / full_channels
    raw_channel = channel + 1
    return e_min + (raw_channel + 0.5) * channel_width
 def create_sample_spectrum_2d(
    isotope: str = "Cs-137",
    activity_bq: float = 100.0,
    duration_seconds: int = 60,
    add_background: bool = True,
    add_noise: bool = True,
    detector_fwhm_percent: float = 8.5,
    seed: Optional[int] = None
 ) -> np.ndarray:
    """
    Generate a synthetic 2D gamma spectrum for testing.
    Args:
        isotope: Isotope name (e.g., "Cs-137", "Co-60")
        activity_bq: Source activity in Becquerels
        duration_seconds: Number of 1-second time intervals (default 60)
        add_background: Add environmental background
        add_noise: Apply Poisson counting statistics
        detector_fwhm_percent: Detector resolution at 662 keV (%)
        seed: Random seed for reproducibility
    Returns:
        2D numpy array of shape (duration_seconds, 1023)
    """
    if seed is not None:
        np.random.seed(seed)
    num_channels = 1023
    spectrum = np.zeros((duration_seconds, num_channels))
    # Get gamma lines for the isotope
    if isotope in GAMMA_LINES:
        gamma_lines = GAMMA_LINES[isotope]
    else:
        print(f"Warning: No gamma lines for {isotope}, using Cs-137")
        gamma_lines = GAMMA_LINES["Cs-137"]
    # Generate spectrum for each time interval
    for t in range(duration_seconds):
        for energy_kev, branching_ratio in gamma_lines:
            fwhm_kev = (detector_fwhm_percent / 100.0) * 662.0 * math.sqrt(energy_kev / 662.0)
            sigma_kev = fwhm_kev / 2.355
            efficiency = 0.1 * math.exp(-energy_kev / 500.0)
            expected_counts = activity_bq * 1.0 * branching_ratio * efficiency  # 1 second interval
            for ch in range(num_channels):
                energy = channel_to_energy(ch)
                peak = expected_counts * math.exp(-0.5 * ((energy - energy_kev) / sigma_kev) ** 2)
                spectrum[t, ch] += peak
        # Add background
        if add_background:
            for ch in range(num_channels):
                energy = channel_to_energy(ch)
                bg = 50.0 * 1.0 * math.exp(-energy / 300.0) / 300.0
                spectrum[t, ch] += bg
            # K-40 environmental
            k40_energy = 1460.8
            k40_fwhm = (detector_fwhm_percent / 100.0) * 662.0 * math.sqrt(k40_energy / 662.0)
            k40_sigma = k40_fwhm / 2.355
            k40_counts = 10.0 * 1.0
            for ch in range(num_channels):
                energy = channel_to_energy(ch)
                peak = k40_counts * math.exp(-0.5 * ((energy - k40_energy) / k40_sigma) ** 2)
                spectrum[t, ch] += peak
    # Apply Poisson noise
    if add_noise:
        spectrum = np.maximum(spectrum, 0)
        spectrum = np.random.poisson(spectrum.astype(int)).astype(float)
    return spectrum
 def create_sample_spectra_batch_2d() -> Dict[str, np.ndarray]:
    """Create a batch of sample 2D spectra for different isotopes."""
    samples = {}
    for isotope in ["Cs-137", "Co-60", "Na-22", "Ba-133", "Am-241", "Eu-152"]:
        if isotope in GAMMA_LINES:
            samples[isotope] = create_sample_spectrum_2d(
                isotope=isotope,
                activity_bq=100.0,
                duration_seconds=60,
                seed=hash(isotope) % 2**32
            )
    # Background only
    samples["Background"] = create_sample_spectrum_2d(
        isotope="Cs-137",
        activity_bq=0.0,
        duration_seconds=60,
        add_background=True,
        seed=12345
    )
    return samples
 # =============================================================================
 # DEMONSTRATION FUNCTIONS
 # =============================================================================
 def run_demo(model_path: str, threshold: float = 0.5):
    """Run a complete demonstration of the Vega 2D inference system."""
    print("\n" + "=" * 70)
    print("VEGA 2D ISOTOPE IDENTIFICATION - INFERENCE DEMONSTRATION")
    print("=" * 70)
    # Load model
    print("\n[1] Loading 2D Model")
    print("-" * 70)
    inference = Vega2DInference(model_path)
    # Generate sample spectra
    print("\n[2] Generating Sample 2D Spectra (60 time intervals × 1023 channels)")
    print("-" * 70)
    samples = create_sample_spectra_batch_2d()
    print(f"Generated {len(samples)} sample spectra:")
    for name, spec in samples.items():
        print(f"  • {name}: shape {spec.shape}")
    # Run inference on each
    print("\n[3] Running Inference")
    print("-" * 70)
    for name, spectrum in samples.items():
        print(f"\n{'─' * 70}")
        print(f"Sample: {name}")
        print(f"Spectrum shape: {spectrum.shape}")
        print(f"Spectrum range: [{spectrum.min():.1f}, {spectrum.max():.1f}]")
        result = inference.predict(spectrum, threshold=threshold)
        print(f"\nPrediction (threshold={threshold}):")
        print(result.summary())
        # Top 5 probabilities
        print("\nTop 5 isotope probabilities:")
        all_result = inference.predict(spectrum, threshold=0.0, return_all=True)
        sorted_iso = sorted(all_result.isotopes, key=lambda x: -x.probability)[:5]
        for iso in sorted_iso:
            marker = "✓" if iso.probability >= threshold else " "
            print(f"  {marker} {iso.name}: {iso.probability*100:.2f}%")
    # JSON output format
    print("\n[4] JSON Output Format Example")
    print("-" * 70)
    sample_result = inference.predict(samples["Cs-137"], threshold=threshold)
    print(sample_result.to_json())
    print("\n" + "=" * 70)
    print("DEMONSTRATION COMPLETE")
    print("=" * 70)
 def run_single_inference(model_path: str, spectrum_path: str, threshold: float = 0.5):
    """Run inference on a single spectrum file."""
    print(f"\nLoading model from: {model_path}")
    inference = Vega2DInference(model_path)
    print(f"Loading spectrum from: {spectrum_path}")
    spectrum = np.load(spectrum_path)
    print(f"Spectrum shape: {spectrum.shape}")
    print(f"\nRunning inference (threshold={threshold})...")
    result = inference.predict(spectrum, threshold=threshold)
    print("\n" + "=" * 60)
    print("PREDICTION RESULTS")
    print("=" * 60)
    print(result.summary())
    print("=" * 60)
    return result
 # =============================================================================
 # MAIN ENTRY POINT
 # =============================================================================
 def main():
    """Main entry point for command-line usage."""
    import argparse
    parser = argparse.ArgumentParser(
        description="Vega 2D Portable Inference - Gamma Spectrum Isotope Identification",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
 Examples:
  # Run demo with sample spectra
  python vega_portable_inference_2d.py --model vega_2d_best.pt
  # Analyze a specific spectrum file
  python vega_portable_inference_2d.py --model vega_2d_best.pt --spectrum my_data.npy
  # Use lower threshold for higher recall
  python vega_portable_inference_2d.py --model vega_2d_best.pt --threshold 0.3
        """
    )
    parser.add_argument(
        "--model", "-m",
        type=str,
        required=True,
        help="Path to trained Vega 2D model checkpoint (.pt file)"
    )
    parser.add_argument(
        "--spectrum", "-s",
        type=str,
        default=None,
        help="Path to spectrum file (.npy, shape 60×1023 or variable×1023). Runs demo if not provided."
    )
    parser.add_argument(
        "--threshold", "-t",
        type=float,
        default=0.5,
        help="Detection threshold (0-1). Lower = more sensitive. Default: 0.5"
    )
    parser.add_argument(
        "--json",
        action="store_true",
        help="Output results in JSON format"
    )
    args = parser.parse_args()
    if args.spectrum:
        result = run_single_inference(args.model, args.spectrum, args.threshold)
        if args.json:
            print("\nJSON Output:")
            print(result.to_json())
    else:
        run_demo(args.model, args.threshold)
    return 0
 if __name__ == "__main__":
    sys.exit(main())
--- a/train/vega_ml/sampledata/vega_api_payload_diff_20260124_201334.csv
+++ b/train/vega_ml/sampledata/vega_api_payload_diff_20260124_201334.csv
--- a/train/vega_ml/synthetic_spectra/init.py
+++ b/train/vega_ml/synthetic_spectra/init.py
@ -0,0 +1,18 @@
 """
 Synthetic Gamma Spectra Generation Module
 This module provides tools for generating realistic synthetic gamma spectra
 for training isotope identification models. It simulates detector responses
 compatible with Radiacode devices (101, 102, 103, 103G, 110).
 Detector Specifications:
 - Energy Range: 20 keV to 3000 keV (0.02 - 3 MeV)
 - Channels: 1024 (usable: 1023)
 - FWHM Resolution: 7.4% - 9.5% @ 662 keV (model dependent)
 - Detector Types: CsI(Tl) and GAGG(Ce) scintillators
 """
 __version__ = "0.1.0"
 __author__ = "Isotope ID ML Project"
 from .config import DetectorConfig, RADIACODE_CONFIGS
--- a/train/vega_ml/synthetic_spectra/config.py
+++ b/train/vega_ml/synthetic_spectra/config.py
@ -0,0 +1,142 @@
 """
 Detector Configuration Module
 Contains configuration parameters for Radiacode gamma spectrometers
 and other detector settings.
 """
 from dataclasses import dataclass, field
 from typing import Dict, Optional
 import numpy as np
@dataclass
 class DetectorConfig:
    """Configuration for a gamma spectrometer detector."""
    name: str
    # Energy range in keV
    energy_min_kev: float = 20.0
    energy_max_kev: float = 3000.0
    # Number of channels
    num_channels: int = 1024
    # Some devices/software workflows treat channel 0 as unreliable/noisy.
    # This project models "usable" channels by skipping the first raw channel.
    skip_first_channel: bool = True
    # FWHM at 662 keV (Cs-137 reference) as fraction
    fwhm_at_662: float = 0.084  # 8.4%
    fwhm_uncertainty: float = 0.003  # ±0.3%
    # Detector crystal type
    crystal_type: str = "CsI(Tl)"
    # Sensitivity: counts per second at 1 μSv/h for Cs-137
    sensitivity_cps_per_usvh: float = 30.0
    # Detector volume in cm³
    detector_volume_cm3: float = 1.0
    def get_channel_width_kev(self) -> float:
        """Get the width of each channel in keV."""
        return (self.energy_max_kev - self.energy_min_kev) / self.num_channels
    def get_energy_bins(self) -> np.ndarray:
        """Get array of energy bin centers (keV) for the modeled usable channels."""
        channel_width = self.get_channel_width_kev()
        # Raw device channels are assumed to be 0..num_channels-1 with centers:
        #   E_center(k) = E_min + (k + 0.5) * channel_width
        # If we skip the first raw channel (k=0), we model usable channels k=1..num_channels-1.
        start_raw_channel = 1 if self.skip_first_channel else 0
        raw_channels = np.arange(start_raw_channel, self.num_channels, dtype=np.float64)
        return self.energy_min_kev + (raw_channels + 0.5) * channel_width
    def get_fwhm_at_energy(self, energy_kev: float) -> float:
        """
        Calculate FWHM at a given energy.
        For scintillators, FWHM scales approximately as sqrt(E).
        FWHM(E) = FWHM_662 * sqrt(662/E) * E / 662 = FWHM_662 * sqrt(E/662)
        """
        return self.fwhm_at_662 * np.sqrt(662.0 / energy_kev) * energy_kev
    def get_sigma_at_energy(self, energy_kev: float) -> float:
        """
        Get Gaussian sigma at a given energy.
        sigma = FWHM / (2 * sqrt(2 * ln(2))) ≈ FWHM / 2.355
        """
        fwhm = self.get_fwhm_at_energy(energy_kev)
        return fwhm / 2.355
    def energy_to_channel(self, energy_kev: float) -> int:
        """Convert energy in keV to modeled usable channel index."""
        channel_width = self.get_channel_width_kev()
        raw_channel = int((energy_kev - self.energy_min_kev) / channel_width)
        if self.skip_first_channel:
            channel = raw_channel - 1
            max_channel = self.num_channels - 2
        else:
            channel = raw_channel
            max_channel = self.num_channels - 1
        return max(0, min(max_channel, channel))
    def channel_to_energy(self, channel: int) -> float:
        """Convert modeled usable channel index to energy bin center (keV)."""
        channel_width = self.get_channel_width_kev()
        raw_channel = channel + (1 if self.skip_first_channel else 0)
        raw_channel = max(0, min(self.num_channels - 1, int(raw_channel)))
        return self.energy_min_kev + (raw_channel + 0.5) * channel_width
 # Pre-defined configurations for Radiacode devices
 RADIACODE_CONFIGS: Dict[str, DetectorConfig] = {
    "radiacode_101": DetectorConfig(
        name="Radiacode 101",
        fwhm_at_662=0.095,  # 9.5% (original model, similar to 102)
        fwhm_uncertainty=0.004,
        crystal_type="CsI(Tl)",
        sensitivity_cps_per_usvh=30.0,
        detector_volume_cm3=1.0,
    ),
    "radiacode_102": DetectorConfig(
        name="Radiacode 102",
        fwhm_at_662=0.095,  # 9.5%
        fwhm_uncertainty=0.004,
        crystal_type="CsI(Tl)",
        sensitivity_cps_per_usvh=30.0,
        detector_volume_cm3=1.0,
    ),
    "radiacode_103": DetectorConfig(
        name="Radiacode 103",
        fwhm_at_662=0.084,  # 8.4%
        fwhm_uncertainty=0.003,
        crystal_type="CsI(Tl)",
        sensitivity_cps_per_usvh=30.0,
        detector_volume_cm3=1.0,
    ),
    "radiacode_103g": DetectorConfig(
        name="Radiacode 103G",
        energy_min_kev=25.0,  # Tech spec lists 0.025…3 MeV
        fwhm_at_662=0.074,  # 7.4% (GAGG crystal - better resolution)
        fwhm_uncertainty=0.003,
        crystal_type="GAGG(Ce)",
        sensitivity_cps_per_usvh=40.0,
        detector_volume_cm3=1.0,
    ),
    "radiacode_110": DetectorConfig(
        name="Radiacode 110",
        fwhm_at_662=0.084,  # 8.4%
        fwhm_uncertainty=0.003,
        crystal_type="CsI(Tl)",
        sensitivity_cps_per_usvh=77.0,  # Higher sensitivity
        detector_volume_cm3=2.5,  # Larger crystal
    ),
 }
 def get_default_config() -> DetectorConfig:
    """Get the default detector configuration (Radiacode 103)."""
    return RADIACODE_CONFIGS["radiacode_103"]
--- a/train/vega_ml/synthetic_spectra/generate_spectra.py
+++ b/train/vega_ml/synthetic_spectra/generate_spectra.py
@ -0,0 +1,418 @@
 """
 Synthetic Spectra Generation Script
 This script generates synthetic gamma spectra for training isotope identification models.
 Usage:
    python generate_spectra.py --num_samples 10 --output_dir ./data/synthetic
 Output:
    - data/synthetic/spectra/*.npy - Spectrum arrays (time x 1023 channels)
    - data/synthetic/spectra/*.png - Visual representations (optional)
    - data/synthetic/labels.json - Annotations for all samples
 """
 import argparse
 import sys
 from pathlib import Path
 import json
 from datetime import datetime
 import numpy as np
 # Add parent directory to path for imports
 sys.path.insert(0, str(Path(__file__).parent.parent))
 from synthetic_spectra.generator import (
    SpectrumGenerator,
    SpectrumConfig,
    IsotopeSource,
    GeneratedSpectrum,
    save_spectrum,
    generate_labels_json,
 )
 from synthetic_spectra.config import RADIACODE_CONFIGS
 from synthetic_spectra.ground_truth import (
    get_all_isotopes,
    get_isotopes_by_category,
    IsotopeCategory,
    DECAY_CHAINS,
 )
 def get_common_isotope_pool() -> list:
    """Get a pool of commonly encountered isotopes for realistic training data."""
    common_isotopes = [
        # Calibration sources (very common in spectra)
        "Cs-137", "Co-60", "Am-241", "Ba-133", "Eu-152", "Na-22", "Co-57",
        # Medical isotopes (occasionally encountered)
        "Tc-99m", "I-131", "I-123", "F-18", "Ga-67", "In-111", "Lu-177",
        # Natural background (always present to some degree)
        "K-40", "Pb-214", "Bi-214", "Pb-212", "Bi-212", "Tl-208", "Ac-228",
        # Industrial sources
        "Ir-192", "Se-75", "Mn-54", "Zn-65",
        # Uranium/Thorium (NORM)
        "U-235", "Ra-226", "Th-232",
        # Reactor/Fallout
        "Cs-134", "Sb-125", "Ce-144", "Co-58",
    ]
    # Filter to only isotopes in our database with gamma lines
    from synthetic_spectra.ground_truth import get_isotope
    valid_isotopes = []
    for name in common_isotopes:
        iso = get_isotope(name)
        if iso and len(iso.gamma_lines) > 0:
            valid_isotopes.append(name)
    return valid_isotopes
 def generate_single_isotope_sample(
    generator: SpectrumGenerator,
    isotope_name: str,
    activity_bq: float,
    duration_seconds: float,
    **kwargs
 ) -> GeneratedSpectrum:
    """Generate a clean sample with a single isotope."""
    config = SpectrumConfig(
        duration_seconds=duration_seconds,
        sources=[
            IsotopeSource(
                isotope_name=isotope_name,
                activity_bq=activity_bq,
                include_daughters=True
            )
        ],
        **kwargs
    )
    return generator.generate_spectrum(config)
 def generate_mixed_isotope_sample(
    generator: SpectrumGenerator,
    isotope_names: list,
    activities_bq: list,
    duration_seconds: float,
    **kwargs
 ) -> GeneratedSpectrum:
    """Generate a sample with multiple blended isotopes."""
    sources = [
        IsotopeSource(
            isotope_name=name,
            activity_bq=activity,
            include_daughters=True
        )
        for name, activity in zip(isotope_names, activities_bq)
    ]
    config = SpectrumConfig(
        duration_seconds=duration_seconds,
        sources=sources,
        **kwargs
    )
    return generator.generate_spectrum(config)
 def generate_training_batch(
    num_samples: int,
    output_dir: Path,
    detector_name: str = "radiacode_103",
    duration_range: tuple = (60, 300),
    activity_range: tuple = (1.0, 100.0),
    single_isotope_fraction: float = 0.4,
    dual_isotope_fraction: float = 0.3,
    multi_isotope_fraction: float = 0.2,
    background_only_fraction: float = 0.1,
    save_png: bool = False,
    random_seed: int = None,
 ) -> list:
    """
    Generate a batch of training samples with various configurations.
    Args:
        num_samples: Total number of samples to generate
        output_dir: Output directory for spectra and labels
        detector_name: Radiacode device to simulate
        duration_range: (min, max) duration in seconds
        activity_range: (min, max) source activity in Bq
        single_isotope_fraction: Fraction of single-isotope samples
        dual_isotope_fraction: Fraction of two-isotope samples
        multi_isotope_fraction: Fraction of 3+ isotope samples
        background_only_fraction: Fraction of background-only samples
        save_png: Whether to also save PNG images
        random_seed: Random seed for reproducibility
    Returns:
        List of generated spectra
    """
    if random_seed is not None:
        np.random.seed(random_seed)
    # Create output directories
    output_dir = Path(output_dir)
    spectra_dir = output_dir / "spectra"
    spectra_dir.mkdir(parents=True, exist_ok=True)
    # Initialize generator
    generator = SpectrumGenerator(
        detector_config=RADIACODE_CONFIGS.get(detector_name),
        random_seed=random_seed
    )
    # Get isotope pool
    isotope_pool = get_common_isotope_pool()
    print(f"Using isotope pool with {len(isotope_pool)} isotopes")
    # Calculate sample counts for each category
    n_single = int(num_samples * single_isotope_fraction)
    n_dual = int(num_samples * dual_isotope_fraction)
    n_multi = int(num_samples * multi_isotope_fraction)
    n_background = int(num_samples * background_only_fraction)
    # Adjust to ensure we hit exactly num_samples
    remaining = num_samples - (n_single + n_dual + n_multi + n_background)
    n_single += remaining
    total_generated = 0
    print(f"\nGenerating {num_samples} synthetic spectra:")
    print(f"  - Single isotope: {n_single}")
    print(f"  - Dual isotope: {n_dual}")
    print(f"  - Multi isotope (3+): {n_multi}")
    print(f"  - Background only: {n_background}")
    print()
    sample_num = 0
    # Generate single isotope samples
    print("Generating single-isotope samples...")
    for i in range(n_single):
        isotope = np.random.choice(isotope_pool)
        activity = np.random.uniform(*activity_range)
        duration = np.random.uniform(*duration_range)
        spectrum = generate_single_isotope_sample(
            generator,
            isotope,
            activity,
            duration,
            detector_name=detector_name,
            include_background=True,
        )
        # Save spectrum (don't accumulate in memory)
        save_spectrum(
            spectrum,
            spectra_dir,
            save_image=True,
            image_format='npy'
        )
        del spectrum  # Free memory immediately
        sample_num += 1
        if sample_num % 100 == 0:
            print(f"  Generated {sample_num}/{num_samples} samples...")
    # Generate dual isotope samples
    print("Generating dual-isotope samples...")
    for i in range(n_dual):
        isotopes = np.random.choice(isotope_pool, size=2, replace=False)
        activities = [np.random.uniform(*activity_range) for _ in range(2)]
        duration = np.random.uniform(*duration_range)
        spectrum = generate_mixed_isotope_sample(
            generator,
            list(isotopes),
            activities,
            duration,
            detector_name=detector_name,
            include_background=True,
        )
        save_spectrum(
            spectrum,
            spectra_dir,
            save_image=True,
            image_format='npy'
        )
        del spectrum
        sample_num += 1
        if sample_num % 100 == 0:
            print(f"  Generated {sample_num}/{num_samples} samples...")
    # Generate multi-isotope samples
    print("Generating multi-isotope samples...")
    for i in range(n_multi):
        num_isotopes = np.random.randint(3, min(6, len(isotope_pool)))
        isotopes = np.random.choice(isotope_pool, size=num_isotopes, replace=False)
        activities = [np.random.uniform(*activity_range) for _ in range(num_isotopes)]
        duration = np.random.uniform(*duration_range)
        spectrum = generate_mixed_isotope_sample(
            generator,
            list(isotopes),
            activities,
            duration,
            detector_name=detector_name,
            include_background=True,
        )
        save_spectrum(
            spectrum,
            spectra_dir,
            save_image=True,
            image_format='npy'
        )
        del spectrum
        sample_num += 1
        if sample_num % 100 == 0:
            print(f"  Generated {sample_num}/{num_samples} samples...")
    # Generate background-only samples
    print("Generating background-only samples...")
    for i in range(n_background):
        duration = np.random.uniform(*duration_range)
        config = SpectrumConfig(
            duration_seconds=duration,
            sources=[],  # No additional sources
            include_background=True,
            detector_name=detector_name,
        )
        spectrum = generator.generate_spectrum(config)
        save_spectrum(
            spectrum,
            spectra_dir,
            save_image=True,
            image_format='npy'
        )
        del spectrum
        sample_num += 1
    total_generated = sample_num
    print(f"\nGenerated {total_generated} samples total")
 def main():
    parser = argparse.ArgumentParser(
        description="Generate synthetic gamma spectra for ML training"
    )
    parser.add_argument(
        "--num_samples",
        type=int,
        default=10,
        help="Number of samples to generate (default: 10)"
    )
    parser.add_argument(
        "--output_dir",
        type=str,
        default="O:/master_data_collection/isotopev2",
        help="Output directory (default: O:/master_data_collection/isotopev2)"
    )
    parser.add_argument(
        "--detector",
        type=str,
        default="radiacode_103",
        choices=list(RADIACODE_CONFIGS.keys()),
        help="Detector to simulate (default: radiacode_103)"
    )
    parser.add_argument(
        "--min_duration",
        type=float,
        default=60,
        help="Minimum spectrum duration in seconds (default: 60)"
    )
    parser.add_argument(
        "--max_duration",
        type=float,
        default=300,
        help="Maximum spectrum duration in seconds (default: 300)"
    )
    parser.add_argument(
        "--min_activity",
        type=float,
        default=1.0,
        help="Minimum source activity in Bq (default: 1.0)"
    )
    parser.add_argument(
        "--max_activity",
        type=float,
        default=100.0,
        help="Maximum source activity in Bq (default: 100.0)"
    )
    parser.add_argument(
        "--save_png",
        action="store_true",
        help="Also save PNG images of spectra"
    )
    parser.add_argument(
        "--seed",
        type=int,
        default=None,
        help="Random seed for reproducibility"
    )
    args = parser.parse_args()
    print("=" * 60)
    print("Synthetic Gamma Spectra Generator")
    print("=" * 60)
    print(f"Samples to generate: {args.num_samples}")
    print(f"Output directory: {args.output_dir}")
    print(f"Detector: {args.detector}")
    print(f"Duration range: {args.min_duration}-{args.max_duration} seconds")
    print(f"Activity range: {args.min_activity}-{args.max_activity} Bq")
    print(f"Random seed: {args.seed}")
    print("=" * 60)
    generate_training_batch(
        num_samples=args.num_samples,
        output_dir=Path(args.output_dir),
        detector_name=args.detector,
        duration_range=(args.min_duration, args.max_duration),
        activity_range=(args.min_activity, args.max_activity),
        save_png=args.save_png,
        random_seed=args.seed,
    )
    print("\n" + "=" * 60)
    print("Generation complete!")
    print("=" * 60)
    # Count generated files
    spectra_dir = Path(args.output_dir) / "spectra"
    npy_files = list(spectra_dir.glob("spectrum_*.npy"))
    print(f"\nTotal samples generated: {len(npy_files)}")
 if __name__ == "__main__":
    main()
--- a/train/vega_ml/synthetic_spectra/generate_spectra_v2.py
+++ b/train/vega_ml/synthetic_spectra/generate_spectra_v2.py
@ -0,0 +1,526 @@
 """
 Synthetic Spectra Generation Script v2
 Improvements over v1:
 - Parallel generation using multiprocessing for 10x+ speedup
 - Class-balanced isotope sampling to ensure all isotopes are represented
 - More variable background noise (intensity, composition)
 - Memory efficient - doesn't accumulate spectra in memory
 - Progress bar with ETA
 Usage:
    python -m synthetic_spectra.generate_spectra_v2 --num_samples 100000 --workers 8
 """
 import argparse
 import sys
 from pathlib import Path
 import json
 from datetime import datetime
 import numpy as np
 from multiprocessing import Pool, cpu_count
 from functools import partial
 import time
 from typing import List, Tuple, Dict, Optional
 import os
 # Add parent directory to path for imports
 sys.path.insert(0, str(Path(__file__).parent.parent))
 from synthetic_spectra.generator import (
    SpectrumGenerator,
    SpectrumConfig,
    IsotopeSource,
    GeneratedSpectrum,
    save_spectrum,
 )
 from synthetic_spectra.config import RADIACODE_CONFIGS
 from synthetic_spectra.ground_truth import get_isotope
 # =============================================================================
 # ISOTOPE POOL WITH CATEGORIES FOR BALANCED SAMPLING
 # =============================================================================
 ISOTOPE_CATEGORIES = {
    "calibration": [
        "Cs-137", "Co-60", "Am-241", "Ba-133", "Eu-152", "Na-22", "Co-57", "Mn-54"
    ],
    "medical": [
        "Tc-99m", "I-131", "I-123", "F-18", "Ga-67", "Ga-68", "In-111", "Lu-177", "Tl-201"
    ],
    "industrial": [
        "Ir-192", "Se-75", "Zn-65", "Co-58", "Cd-109"
    ],
    "natural_background": [
        "K-40", "Ra-226", "U-235", "U-238", "Th-232"
    ],
    "decay_chain_u238": [
        "Pb-214", "Bi-214", "Pb-210"
    ],
    "decay_chain_th232": [
        "Pb-212", "Bi-212", "Tl-208", "Ac-228", "Ra-224"
    ],
    "reactor_fallout": [
        "Cs-134", "I-131", "Sr-90", "Zr-95", "Nb-95", "Ru-103", "Ce-141", "Ce-144", "Sb-125"
    ],
 }
 def get_valid_isotope_pool() -> Tuple[List[str], Dict[str, List[str]]]:
    """
    Get all valid isotopes (with gamma lines) organized by category.
    Returns:
        Tuple of (flat_list, category_dict)
    """
    valid_categories = {}
    all_isotopes = []
    for category, isotopes in ISOTOPE_CATEGORIES.items():
        valid = []
        for name in isotopes:
            iso = get_isotope(name)
            if iso and len(iso.gamma_lines) > 0:
                valid.append(name)
                if name not in all_isotopes:
                    all_isotopes.append(name)
        valid_categories[category] = valid
    return all_isotopes, valid_categories
 # =============================================================================
 # BACKGROUND VARIATION
 # =============================================================================
 class BackgroundConfig:
    """Configuration for varied background generation."""
    def __init__(
        self,
        intensity_min: float = 0.3,
        intensity_max: float = 3.0,
        k40_prob: float = 0.95,  # Almost always present
        radon_prob: float = 0.8,  # Usually present indoors
        thorium_prob: float = 0.6,  # Sometimes present
    ):
        self.intensity_min = intensity_min
        self.intensity_max = intensity_max
        self.k40_prob = k40_prob
        self.radon_prob = radon_prob
        self.thorium_prob = thorium_prob
    def sample(self, rng: np.random.Generator) -> dict:
        """Sample a random background configuration."""
        return {
            'background_cps': rng.uniform(self.intensity_min, self.intensity_max) * 5.0,
            'include_k40': rng.random() < self.k40_prob,
            'include_radon': rng.random() < self.radon_prob,
            'include_thorium': rng.random() < self.thorium_prob,
        }
 # =============================================================================
 # SINGLE SAMPLE GENERATION (for parallel workers)
 # =============================================================================
 def generate_single_sample(
    args: Tuple[int, dict]
 ) -> Optional[str]:
    """
    Generate a single sample. Designed to be called by worker processes.
    Args:
        args: Tuple of (sample_index, config_dict)
    Returns:
        Sample ID if successful, None if failed
    """
    sample_idx, config = args
    try:
        # Create RNG with unique seed per sample
        rng = np.random.default_rng(config['base_seed'] + sample_idx)
        # Initialize generator (each worker creates its own)
        detector_config = RADIACODE_CONFIGS.get(config['detector_name'])
        generator = SpectrumGenerator(detector_config=detector_config)
        # Determine sample type based on distribution
        sample_type = config['sample_types'][sample_idx % len(config['sample_types'])]
        # Get isotopes for this sample
        isotope_pool = config['isotope_pool']
        category_pools = config['category_pools']
        # Sample background configuration
        bg_config = BackgroundConfig(
            intensity_min=config.get('bg_intensity_min', 0.3),
            intensity_max=config.get('bg_intensity_max', 3.0),
        )
        bg_params = bg_config.sample(rng)
        # Random duration
        duration = rng.uniform(*config['duration_range'])
        # Build sources based on sample type
        sources = []
        if sample_type == 'single':
            # For class balance, cycle through isotopes
            isotope_idx = sample_idx % len(isotope_pool)
            isotope = isotope_pool[isotope_idx]
            activity = rng.uniform(*config['activity_range'])
            sources.append(IsotopeSource(
                isotope_name=isotope,
                activity_bq=activity,
                include_daughters=True
            ))
        elif sample_type == 'dual':
            # Pick from different categories for variety
            categories = list(category_pools.keys())
            cat1, cat2 = rng.choice(categories, size=2, replace=True)
            iso1 = rng.choice(category_pools[cat1]) if category_pools[cat1] else rng.choice(isotope_pool)
            iso2 = rng.choice(category_pools[cat2]) if category_pools[cat2] else rng.choice(isotope_pool)
            # Ensure different isotopes
            while iso2 == iso1:
                iso2 = rng.choice(isotope_pool)
            for iso in [iso1, iso2]:
                activity = rng.uniform(*config['activity_range'])
                sources.append(IsotopeSource(
                    isotope_name=iso,
                    activity_bq=activity,
                    include_daughters=True
                ))
        elif sample_type == 'multi':
            # 3-5 isotopes from various categories
            num_isotopes = rng.integers(3, 6)
            selected = set()
            for _ in range(num_isotopes):
                cat = rng.choice(list(category_pools.keys()))
                pool = category_pools[cat] if category_pools[cat] else isotope_pool
                iso = rng.choice(pool)
                # Avoid duplicates
                attempts = 0
                while iso in selected and attempts < 10:
                    iso = rng.choice(isotope_pool)
                    attempts += 1
                if iso not in selected:
                    selected.add(iso)
                    activity = rng.uniform(*config['activity_range'])
                    sources.append(IsotopeSource(
                        isotope_name=iso,
                        activity_bq=activity,
                        include_daughters=True
                    ))
        # elif sample_type == 'background': sources stays empty
        # Create spectrum config
        spec_config = SpectrumConfig(
            duration_seconds=duration,
            sources=sources,
            include_background=True,
            background_cps=bg_params['background_cps'],
            include_k40=bg_params['include_k40'],
            include_radon=bg_params['include_radon'],
            include_thorium=bg_params['include_thorium'],
            detector_name=config['detector_name'],
        )
        # Generate spectrum
        spectrum = generator.generate_spectrum(spec_config)
        # Save spectrum
        output_dir = Path(config['output_dir']) / "spectra"
        save_spectrum(
            spectrum,
            output_dir,
            save_image=True,
            image_format='npy'  # Skip PNG for speed
        )
        return spectrum.sample_id
    except Exception as e:
        print(f"Error generating sample {sample_idx}: {e}")
        return None
 # =============================================================================
 # MAIN BATCH GENERATION
 # =============================================================================
 def generate_training_batch_parallel(
    num_samples: int,
    output_dir: Path,
    detector_name: str = "radiacode_103",
    duration_range: Tuple[float, float] = (60, 300),
    activity_range: Tuple[float, float] = (1.0, 100.0),
    single_isotope_fraction: float = 0.40,
    dual_isotope_fraction: float = 0.30,
    multi_isotope_fraction: float = 0.20,
    background_only_fraction: float = 0.10,
    bg_intensity_range: Tuple[float, float] = (0.3, 3.0),
    num_workers: int = None,
    random_seed: int = None,
    chunk_size: int = 100,
 ) -> int:
    """
    Generate training samples in parallel.
    Args:
        num_samples: Total number of samples to generate
        output_dir: Output directory
        detector_name: Detector to simulate
        duration_range: (min, max) duration in seconds
        activity_range: (min, max) activity in Bq
        single_isotope_fraction: Fraction of single-isotope samples
        dual_isotope_fraction: Fraction of dual-isotope samples
        multi_isotope_fraction: Fraction of multi-isotope samples
        background_only_fraction: Fraction of background-only samples
        bg_intensity_range: (min, max) background intensity multiplier
        num_workers: Number of parallel workers (default: CPU count - 1)
        random_seed: Base random seed
        chunk_size: Number of samples per worker batch
    Returns:
        Number of successfully generated samples
    """
    if num_workers is None:
        num_workers = max(1, cpu_count() - 1)
    if random_seed is None:
        random_seed = int(time.time())
    # Create output directory
    output_dir = Path(output_dir)
    spectra_dir = output_dir / "spectra"
    spectra_dir.mkdir(parents=True, exist_ok=True)
    # Get isotope pools
    isotope_pool, category_pools = get_valid_isotope_pool()
    print(f"Isotope pool: {len(isotope_pool)} isotopes across {len(category_pools)} categories")
    # Calculate sample counts
    n_single = int(num_samples * single_isotope_fraction)
    n_dual = int(num_samples * dual_isotope_fraction)
    n_multi = int(num_samples * multi_isotope_fraction)
    n_background = int(num_samples * background_only_fraction)
    # Adjust to hit exact count
    remaining = num_samples - (n_single + n_dual + n_multi + n_background)
    n_single += remaining
    # Create sample type list (shuffled for variety in batches)
    sample_types = (
        ['single'] * n_single +
        ['dual'] * n_dual +
        ['multi'] * n_multi +
        ['background'] * n_background
    )
    np.random.seed(random_seed)
    np.random.shuffle(sample_types)
    print(f"\nGenerating {num_samples} samples with {num_workers} workers:")
    print(f"  - Single isotope: {n_single} ({single_isotope_fraction*100:.0f}%)")
    print(f"  - Dual isotope: {n_dual} ({dual_isotope_fraction*100:.0f}%)")
    print(f"  - Multi isotope: {n_multi} ({multi_isotope_fraction*100:.0f}%)")
    print(f"  - Background only: {n_background} ({background_only_fraction*100:.0f}%)")
    print(f"  - Background intensity: {bg_intensity_range[0]:.1f}x - {bg_intensity_range[1]:.1f}x")
    print()
    # Shared config for all workers
    shared_config = {
        'detector_name': detector_name,
        'output_dir': str(output_dir),
        'duration_range': duration_range,
        'activity_range': activity_range,
        'bg_intensity_min': bg_intensity_range[0],
        'bg_intensity_max': bg_intensity_range[1],
        'base_seed': random_seed,
        'isotope_pool': isotope_pool,
        'category_pools': category_pools,
        'sample_types': sample_types,
    }
    # Generate samples in parallel
    start_time = time.time()
    successful = 0
    # Create argument list
    args_list = [(i, shared_config) for i in range(num_samples)]
    # Use multiprocessing pool
    with Pool(processes=num_workers) as pool:
        # Process in chunks and report progress
        for i in range(0, num_samples, chunk_size):
            chunk_end = min(i + chunk_size, num_samples)
            chunk_args = args_list[i:chunk_end]
            results = pool.map(generate_single_sample, chunk_args)
            chunk_success = sum(1 for r in results if r is not None)
            successful += chunk_success
            # Progress report
            elapsed = time.time() - start_time
            rate = successful / elapsed if elapsed > 0 else 0
            eta = (num_samples - successful) / rate if rate > 0 else 0
            print(f"  Progress: {successful}/{num_samples} ({100*successful/num_samples:.1f}%) | "
                  f"Rate: {rate:.1f} samples/s | ETA: {eta/60:.1f} min")
    total_time = time.time() - start_time
    print(f"\n{'='*60}")
    print(f"Generation complete!")
    print(f"  Total samples: {successful}/{num_samples}")
    print(f"  Total time: {total_time/60:.1f} minutes")
    print(f"  Average rate: {successful/total_time:.1f} samples/second")
    print(f"{'='*60}")
    return successful
 def main():
    parser = argparse.ArgumentParser(
        description="Generate synthetic gamma spectra (v2 - parallel, balanced)"
    )
    parser.add_argument(
        "--num_samples", "-n",
        type=int,
        default=100000,
        help="Number of samples to generate (default: 100000)"
    )
    parser.add_argument(
        "--output_dir", "-o",
        type=str,
        default="O:/master_data_collection/isotopev2",
        help="Output directory (default: O:/master_data_collection/isotopev2)"
    )
    parser.add_argument(
        "--detector",
        type=str,
        default="radiacode_103",
        choices=list(RADIACODE_CONFIGS.keys()),
        help="Detector to simulate (default: radiacode_103)"
    )
    parser.add_argument(
        "--workers", "-w",
        type=int,
        default=None,
        help="Number of parallel workers (default: CPU count - 1)"
    )
    parser.add_argument(
        "--min_duration",
        type=float,
        default=60,
        help="Minimum duration in seconds (default: 60)"
    )
    parser.add_argument(
        "--max_duration",
        type=float,
        default=300,
        help="Maximum duration in seconds (default: 300)"
    )
    parser.add_argument(
        "--min_activity",
        type=float,
        default=1.0,
        help="Minimum activity in Bq (default: 1.0)"
    )
    parser.add_argument(
        "--max_activity",
        type=float,
        default=100.0,
        help="Maximum activity in Bq (default: 100.0)"
    )
    parser.add_argument(
        "--bg_min",
        type=float,
        default=0.3,
        help="Minimum background intensity multiplier (default: 0.3)"
    )
    parser.add_argument(
        "--bg_max",
        type=float,
        default=3.0,
        help="Maximum background intensity multiplier (default: 3.0)"
    )
    parser.add_argument(
        "--seed",
        type=int,
        default=None,
        help="Random seed for reproducibility"
    )
    parser.add_argument(
        "--chunk_size",
        type=int,
        default=100,
        help="Samples per progress update (default: 100)"
    )
    # Sample type fractions
    parser.add_argument("--single_frac", type=float, default=0.40)
    parser.add_argument("--dual_frac", type=float, default=0.30)
    parser.add_argument("--multi_frac", type=float, default=0.20)
    parser.add_argument("--bg_frac", type=float, default=0.10)
    args = parser.parse_args()
    print("=" * 60)
    print("Synthetic Gamma Spectra Generator v2")
    print("  - Parallel processing")
    print("  - Class-balanced sampling")
    print("  - Variable background")
    print("=" * 60)
    print(f"Samples: {args.num_samples:,}")
    print(f"Workers: {args.workers or (cpu_count() - 1)}")
    print(f"Output: {args.output_dir}")
    print(f"Detector: {args.detector}")
    print(f"Duration: {args.min_duration}-{args.max_duration}s")
    print(f"Activity: {args.min_activity}-{args.max_activity} Bq")
    print(f"Background: {args.bg_min}x-{args.bg_max}x")
    print("=" * 60)
    generate_training_batch_parallel(
        num_samples=args.num_samples,
        output_dir=Path(args.output_dir),
        detector_name=args.detector,
        duration_range=(args.min_duration, args.max_duration),
        activity_range=(args.min_activity, args.max_activity),
        single_isotope_fraction=args.single_frac,
        dual_isotope_fraction=args.dual_frac,
        multi_isotope_fraction=args.multi_frac,
        background_only_fraction=args.bg_frac,
        bg_intensity_range=(args.bg_min, args.bg_max),
        num_workers=args.workers,
        random_seed=args.seed,
        chunk_size=args.chunk_size,
    )
 if __name__ == "__main__":
    main()
--- a/train/vega_ml/synthetic_spectra/generate_spectra_v3.py
+++ b/train/vega_ml/synthetic_spectra/generate_spectra_v3.py
@ -0,0 +1,577 @@
 """
 Synthetic Spectra Generation Script v3
 Optimized for 2D model training with:
 - Fixed 60-second duration (60 time intervals)
 - Better isotope combinations including decay chain scenarios
 - Enhanced background-only samples
 - More diverse mixing scenarios
 Usage:
    python -m synthetic_spectra.generate_spectra_v3 --num_samples 200000 --workers 8
 """
 import argparse
 import sys
 from pathlib import Path
 import json
 from datetime import datetime
 import numpy as np
 from multiprocessing import Pool, cpu_count
 from functools import partial
 import time
 from typing import List, Tuple, Dict, Optional
 import os
 # Add parent directory to path for imports
 sys.path.insert(0, str(Path(__file__).parent.parent))
 from synthetic_spectra.generator import (
    SpectrumGenerator,
    SpectrumConfig,
    IsotopeSource,
    GeneratedSpectrum,
    save_spectrum,
 )
 from synthetic_spectra.config import RADIACODE_CONFIGS
 from synthetic_spectra.ground_truth import get_isotope
 # =============================================================================
 # ISOTOPE POOLS - Organized for realistic scenarios
 # =============================================================================
 # Calibration/check sources (individual isotopes)
 CALIBRATION_ISOTOPES = [
    "Cs-137", "Co-60", "Am-241", "Ba-133", "Eu-152", "Na-22", "Co-57", "Mn-54"
 ]
 # Medical isotopes (often found individually)
 MEDICAL_ISOTOPES = [
    "Tc-99m", "I-131", "I-123", "F-18", "Ga-67", "Ga-68", "In-111", "Lu-177", "Tl-201"
 ]
 # Industrial sources
 INDUSTRIAL_ISOTOPES = [
    "Ir-192", "Se-75", "Zn-65", "Co-58", "Cd-109"
 ]
 # Natural decay chains - these ALWAYS appear together in nature
 URANIUM_238_CHAIN = ["U-238", "Ra-226", "Pb-214", "Bi-214"]  # Secular equilibrium
 THORIUM_232_CHAIN = ["Th-232", "Ac-228", "Pb-212", "Bi-212", "Tl-208"]
 URANIUM_235_CHAIN = ["U-235"]  # Daughters have low gamma yield
 # Fallout/contamination (often appear in specific combinations)
 CHERNOBYL_FUKUSHIMA = ["Cs-137", "Cs-134"]  # Classic reactor fallout signature
 FRESH_FALLOUT = ["I-131", "Cs-137", "Cs-134", "Zr-95", "Nb-95"]
 OLDER_FALLOUT = ["Cs-137", "Sr-90"]  # Long-lived only
 # Natural background (what you'd see with no source)
 NATURAL_BACKGROUND = ["K-40"]  # Potassium in environment
 # NORM - Naturally Occurring Radioactive Material
 NORM_MATERIALS = ["K-40", "Ra-226", "Th-232", "U-238"]
 def get_valid_isotopes(isotope_list: List[str]) -> List[str]:
    """Filter to isotopes with gamma lines."""
    valid = []
    for name in isotope_list:
        iso = get_isotope(name)
        if iso and len(iso.gamma_lines) > 0:
            valid.append(name)
    return valid
 # Pre-validate all pools
 VALID_CALIBRATION = get_valid_isotopes(CALIBRATION_ISOTOPES)
 VALID_MEDICAL = get_valid_isotopes(MEDICAL_ISOTOPES)
 VALID_INDUSTRIAL = get_valid_isotopes(INDUSTRIAL_ISOTOPES)
 VALID_U238_CHAIN = get_valid_isotopes(URANIUM_238_CHAIN)
 VALID_TH232_CHAIN = get_valid_isotopes(THORIUM_232_CHAIN)
 VALID_FALLOUT = get_valid_isotopes(CHERNOBYL_FUKUSHIMA + FRESH_FALLOUT)
 VALID_NORM = get_valid_isotopes(NORM_MATERIALS)
 # All valid isotopes for random selection
 ALL_VALID_ISOTOPES = list(set(
    VALID_CALIBRATION + VALID_MEDICAL + VALID_INDUSTRIAL +
    VALID_U238_CHAIN + VALID_TH232_CHAIN + VALID_FALLOUT + VALID_NORM
 ))
 # =============================================================================
 # SAMPLE SCENARIOS
 # =============================================================================
 class SampleScenario:
    """Defines a type of sample to generate."""
    def __init__(self, name: str, fraction: float):
        self.name = name
        self.fraction = fraction
    def generate_sources(self, rng: np.random.Generator, activity_range: Tuple[float, float]) -> List[IsotopeSource]:
        """Generate isotope sources for this scenario."""
        raise NotImplementedError
 class BackgroundOnlyScenario(SampleScenario):
    """Pure background - no identifiable sources."""
    def __init__(self, fraction: float = 0.15):
        super().__init__("background_only", fraction)
    def generate_sources(self, rng, activity_range) -> List[IsotopeSource]:
        return []  # No sources - just background
 class SingleCalibrationScenario(SampleScenario):
    """Single calibration source."""
    def __init__(self, fraction: float = 0.20):
        super().__init__("single_calibration", fraction)
    def generate_sources(self, rng, activity_range) -> List[IsotopeSource]:
        isotope = rng.choice(VALID_CALIBRATION)
        activity = rng.uniform(*activity_range)
        return [IsotopeSource(isotope, activity, include_daughters=True)]
 class SingleMedicalScenario(SampleScenario):
    """Single medical isotope."""
    def __init__(self, fraction: float = 0.10):
        super().__init__("single_medical", fraction)
    def generate_sources(self, rng, activity_range) -> List[IsotopeSource]:
        if not VALID_MEDICAL:
            return []
        isotope = rng.choice(VALID_MEDICAL)
        activity = rng.uniform(*activity_range)
        return [IsotopeSource(isotope, activity, include_daughters=True)]
 class SingleIndustrialScenario(SampleScenario):
    """Single industrial source."""
    def __init__(self, fraction: float = 0.05):
        super().__init__("single_industrial", fraction)
    def generate_sources(self, rng, activity_range) -> List[IsotopeSource]:
        if not VALID_INDUSTRIAL:
            return []
        isotope = rng.choice(VALID_INDUSTRIAL)
        activity = rng.uniform(*activity_range)
        return [IsotopeSource(isotope, activity, include_daughters=True)]
 class UraniumChainScenario(SampleScenario):
    """Natural uranium with decay chain in equilibrium."""
    def __init__(self, fraction: float = 0.08):
        super().__init__("uranium_chain", fraction)
    def generate_sources(self, rng, activity_range) -> List[IsotopeSource]:
        # All daughters at ~same activity (secular equilibrium)
        base_activity = rng.uniform(*activity_range)
        sources = []
        for iso in VALID_U238_CHAIN:
            # Slight variation to simulate real-world
            activity = base_activity * rng.uniform(0.8, 1.2)
            sources.append(IsotopeSource(iso, activity, include_daughters=False))
        return sources
 class ThoriumChainScenario(SampleScenario):
    """Natural thorium with decay chain."""
    def __init__(self, fraction: float = 0.08):
        super().__init__("thorium_chain", fraction)
    def generate_sources(self, rng, activity_range) -> List[IsotopeSource]:
        base_activity = rng.uniform(*activity_range)
        sources = []
        for iso in VALID_TH232_CHAIN:
            activity = base_activity * rng.uniform(0.8, 1.2)
            sources.append(IsotopeSource(iso, activity, include_daughters=False))
        return sources
 class NORMScenario(SampleScenario):
    """NORM - naturally occurring radioactive material (multiple natural isotopes)."""
    def __init__(self, fraction: float = 0.08):
        super().__init__("norm", fraction)
    def generate_sources(self, rng, activity_range) -> List[IsotopeSource]:
        # Pick 2-4 NORM isotopes
        num_isotopes = rng.integers(2, 5)
        selected = rng.choice(VALID_NORM, size=min(num_isotopes, len(VALID_NORM)), replace=False)
        sources = []
        for iso in selected:
            activity = rng.uniform(*activity_range)
            sources.append(IsotopeSource(iso, activity, include_daughters=True))
        return sources
 class FalloutScenario(SampleScenario):
    """Reactor fallout signature (Cs-137 + Cs-134 fingerprint)."""
    def __init__(self, fraction: float = 0.06):
        super().__init__("fallout", fraction)
    def generate_sources(self, rng, activity_range) -> List[IsotopeSource]:
        sources = []
        # Cs-137/Cs-134 ratio varies with age of fallout
        cs137_activity = rng.uniform(*activity_range)
        # Fresh fallout: ~1:1 ratio, aged: Cs-134 decays faster
        age_factor = rng.uniform(0.1, 1.0)  # How "fresh" the fallout is
        cs134_activity = cs137_activity * age_factor
        if "Cs-137" in VALID_FALLOUT:
            sources.append(IsotopeSource("Cs-137", cs137_activity, include_daughters=True))
        if "Cs-134" in VALID_FALLOUT and cs134_activity > 0.5:
            sources.append(IsotopeSource("Cs-134", cs134_activity, include_daughters=True))
        # Sometimes include I-131 (very fresh fallout only)
        if rng.random() < 0.3 and "I-131" in VALID_FALLOUT:
            sources.append(IsotopeSource("I-131", rng.uniform(1, 50), include_daughters=True))
        return sources
 class MixedSourcesScenario(SampleScenario):
    """Random mix of 2-3 different source types."""
    def __init__(self, fraction: float = 0.10):
        super().__init__("mixed", fraction)
    def generate_sources(self, rng, activity_range) -> List[IsotopeSource]:
        num_isotopes = rng.integers(2, 4)
        selected = rng.choice(ALL_VALID_ISOTOPES, size=num_isotopes, replace=False)
        sources = []
        for iso in selected:
            activity = rng.uniform(*activity_range)
            sources.append(IsotopeSource(iso, activity, include_daughters=True))
        return sources
 class ComplexMixScenario(SampleScenario):
    """Complex scenario: 4-6 isotopes from various categories."""
    def __init__(self, fraction: float = 0.05):
        super().__init__("complex_mix", fraction)
    def generate_sources(self, rng, activity_range) -> List[IsotopeSource]:
        num_isotopes = rng.integers(4, 7)
        selected = set()
        # Try to get variety from different pools
        pools = [VALID_CALIBRATION, VALID_MEDICAL, VALID_INDUSTRIAL, VALID_U238_CHAIN, VALID_TH232_CHAIN]
        for pool in pools:
            if len(selected) >= num_isotopes:
                break
            if pool:
                iso = rng.choice(pool)
                selected.add(iso)
        # Fill remaining with random
        while len(selected) < num_isotopes:
            iso = rng.choice(ALL_VALID_ISOTOPES)
            selected.add(iso)
        sources = []
        for iso in selected:
            activity = rng.uniform(*activity_range)
            sources.append(IsotopeSource(iso, activity, include_daughters=True))
        return sources
 class WeakSourceScenario(SampleScenario):
    """Very weak sources - near detection limit."""
    def __init__(self, fraction: float = 0.05):
        super().__init__("weak_source", fraction)
    def generate_sources(self, rng, activity_range) -> List[IsotopeSource]:
        # Very low activity - near background
        weak_activity_range = (0.1, 5.0)  # Much weaker than normal
        isotope = rng.choice(ALL_VALID_ISOTOPES)
        activity = rng.uniform(*weak_activity_range)
        return [IsotopeSource(isotope, activity, include_daughters=True)]
 # All scenarios with their fractions (should sum to 1.0)
 DEFAULT_SCENARIOS = [
    BackgroundOnlyScenario(0.15),      # 15% - important for "no detection" cases
    SingleCalibrationScenario(0.20),   # 20% - common check sources
    SingleMedicalScenario(0.08),       # 8%  - medical isotopes
    SingleIndustrialScenario(0.05),    # 5%  - industrial sources
    UraniumChainScenario(0.10),        # 10% - natural uranium + daughters
    ThoriumChainScenario(0.10),        # 10% - natural thorium + daughters
    NORMScenario(0.07),                # 7%  - NORM materials
    FalloutScenario(0.05),             # 5%  - reactor fallout signature
    MixedSourcesScenario(0.10),        # 10% - random 2-3 isotope mixes
    ComplexMixScenario(0.05),          # 5%  - complex 4-6 isotope scenarios
    WeakSourceScenario(0.05),          # 5%  - near-detection-limit sources
 ]
 # =============================================================================
 # BACKGROUND VARIATION
 # =============================================================================
 class BackgroundConfig:
    """Configuration for varied background generation."""
    def __init__(
        self,
        intensity_min: float = 0.3,
        intensity_max: float = 3.0,
        k40_prob: float = 0.95,
        radon_prob: float = 0.8,
        thorium_prob: float = 0.6,
    ):
        self.intensity_min = intensity_min
        self.intensity_max = intensity_max
        self.k40_prob = k40_prob
        self.radon_prob = radon_prob
        self.thorium_prob = thorium_prob
    def sample(self, rng: np.random.Generator) -> dict:
        """Sample a random background configuration."""
        return {
            'background_cps': rng.uniform(self.intensity_min, self.intensity_max) * 5.0,
            'include_k40': rng.random() < self.k40_prob,
            'include_radon': rng.random() < self.radon_prob,
            'include_thorium': rng.random() < self.thorium_prob,
        }
 # =============================================================================
 # SAMPLE GENERATION
 # =============================================================================
 def generate_single_sample(args: Tuple[int, dict]) -> Optional[str]:
    """
    Generate a single sample for parallel processing.
    Args:
        args: Tuple of (sample_index, config_dict)
    Returns:
        Sample ID if successful, None if failed
    """
    sample_idx, config = args
    try:
        # Create RNG with unique seed per sample
        rng = np.random.default_rng(config['base_seed'] + sample_idx)
        # Initialize generator
        detector_config = RADIACODE_CONFIGS.get(config['detector_name'])
        generator = SpectrumGenerator(detector_config=detector_config)
        # Select scenario based on cumulative probabilities
        scenarios = config['scenarios']
        scenario_probs = [s.fraction for s in scenarios]
        scenario = rng.choice(scenarios, p=scenario_probs)
        # Generate sources for this scenario
        sources = scenario.generate_sources(rng, config['activity_range'])
        # Background configuration
        bg_config = BackgroundConfig(
            intensity_min=config.get('bg_intensity_min', 0.3),
            intensity_max=config.get('bg_intensity_max', 3.0),
        )
        bg_params = bg_config.sample(rng)
        # FIXED 60-second duration for 2D model
        duration = 60.0
        # Create spectrum config
        spec_config = SpectrumConfig(
            duration_seconds=duration,
            time_interval_seconds=1.0,  # 1 second per interval = 60 intervals
            sources=sources,
            include_background=True,
            background_cps=bg_params['background_cps'],
            include_k40=bg_params['include_k40'],
            include_radon=bg_params['include_radon'],
            include_thorium=bg_params['include_thorium'],
            detector_name=config['detector_name'],
        )
        # Generate spectrum
        spectrum = generator.generate_spectrum(spec_config)
        # Save spectrum
        output_dir = Path(config['output_dir']) / "spectra"
        save_spectrum(
            spectrum,
            output_dir,
            save_image=True,   # Save NPY file
            image_format='npy'  # Skip PNG for speed
        )
        return spectrum.sample_id
    except Exception as e:
        print(f"Error generating sample {sample_idx}: {e}")
        import traceback
        traceback.print_exc()
        return None
 def generate_training_data_v3(
    num_samples: int,
    output_dir: Path,
    detector_name: str = "radiacode_103",
    activity_range: Tuple[float, float] = (1.0, 100.0),
    bg_intensity_range: Tuple[float, float] = (0.3, 3.0),
    scenarios: Optional[List[SampleScenario]] = None,
    num_workers: int = None,
    random_seed: int = None,
 ) -> int:
    """
    Generate training samples in parallel.
    Args:
        num_samples: Total number of samples to generate
        output_dir: Output directory
        detector_name: Detector to simulate
        activity_range: (min, max) activity in Bq
        bg_intensity_range: Background intensity multiplier range
        scenarios: List of SampleScenario objects (default: DEFAULT_SCENARIOS)
        num_workers: Number of parallel workers
        random_seed: Base random seed
    Returns:
        Number of successfully generated samples
    """
    if num_workers is None:
        num_workers = max(1, cpu_count() - 1)
    if random_seed is None:
        random_seed = int(time.time())
    if scenarios is None:
        scenarios = DEFAULT_SCENARIOS
    # Normalize scenario fractions
    total_fraction = sum(s.fraction for s in scenarios)
    for s in scenarios:
        s.fraction /= total_fraction
    # Create output directory
    output_dir = Path(output_dir)
    spectra_dir = output_dir / "spectra"
    spectra_dir.mkdir(parents=True, exist_ok=True)
    print(f"=" * 70)
    print(f"SYNTHETIC SPECTRA GENERATION v3 - Optimized for 2D Model")
    print(f"=" * 70)
    print(f"\nConfiguration:")
    print(f"  Samples: {num_samples:,}")
    print(f"  Output: {output_dir}")
    print(f"  Detector: {detector_name}")
    print(f"  Duration: 60 seconds (fixed)")
    print(f"  Activity range: {activity_range[0]:.1f} - {activity_range[1]:.1f} Bq")
    print(f"  Workers: {num_workers}")
    print(f"\nScenario distribution:")
    for s in scenarios:
        count = int(num_samples * s.fraction)
        print(f"  {s.name}: {s.fraction*100:.1f}% (~{count:,} samples)")
    print()
    # Shared config for all workers
    shared_config = {
        'detector_name': detector_name,
        'output_dir': str(output_dir),
        'activity_range': activity_range,
        'bg_intensity_min': bg_intensity_range[0],
        'bg_intensity_max': bg_intensity_range[1],
        'base_seed': random_seed,
        'scenarios': scenarios,
    }
    # Create work items
    work_items = [(i, shared_config) for i in range(num_samples)]
    # Progress tracking
    start_time = time.time()
    completed = 0
    failed = 0
    last_report = 0
    print(f"Starting generation...")
    # Generate in parallel
    with Pool(num_workers) as pool:
        for result in pool.imap_unordered(generate_single_sample, work_items, chunksize=100):
            if result is not None:
                completed += 1
            else:
                failed += 1
            total = completed + failed
            # Progress report every 1%
            if total - last_report >= num_samples // 100 or total == num_samples:
                elapsed = time.time() - start_time
                rate = completed / elapsed if elapsed > 0 else 0
                eta = (num_samples - total) / rate if rate > 0 else 0
                print(f"\r  Progress: {total:,}/{num_samples:,} ({100*total/num_samples:.1f}%) | "
                      f"Rate: {rate:.1f}/s | "
                      f"ETA: {eta/60:.1f}m | "
                      f"Failed: {failed}", end="", flush=True)
                last_report = total
    total_time = time.time() - start_time
    print(f"\n\nGeneration complete!")
    print(f"  Total time: {total_time/60:.1f} minutes")
    print(f"  Successful: {completed:,}")
    print(f"  Failed: {failed}")
    print(f"  Rate: {completed/total_time:.1f} samples/second")
    return completed
 def main():
    parser = argparse.ArgumentParser(description='Generate synthetic gamma spectra v3')
    parser.add_argument('--num_samples', '-n', type=int, default=200000,
                        help='Number of samples to generate')
    parser.add_argument('--output_dir', '-o', type=str, default='data/synthetic',
                        help='Output directory')
    parser.add_argument('--detector', '-d', type=str, default='radiacode_103',
                        help='Detector type')
    parser.add_argument('--workers', '-w', type=int, default=None,
                        help='Number of parallel workers')
    parser.add_argument('--seed', '-s', type=int, default=None,
                        help='Random seed')
    parser.add_argument('--activity_min', type=float, default=1.0,
                        help='Minimum activity in Bq')
    parser.add_argument('--activity_max', type=float, default=100.0,
                        help='Maximum activity in Bq')
    args = parser.parse_args()
    generate_training_data_v3(
        num_samples=args.num_samples,
        output_dir=Path(args.output_dir),
        detector_name=args.detector,
        activity_range=(args.activity_min, args.activity_max),
        num_workers=args.workers,
        random_seed=args.seed,
    )
 if __name__ == '__main__':
    main()
--- a/train/vega_ml/synthetic_spectra/generator.py
+++ b/train/vega_ml/synthetic_spectra/generator.py
@ -0,0 +1,474 @@
 """
 Synthetic Spectrum Generator
 Main class for generating synthetic gamma spectra images
 with various isotope combinations and configurations.
 """
 import numpy as np
 from dataclasses import dataclass, field
 from typing import List, Dict, Optional, Tuple, Any
 import json
 from pathlib import Path
 from datetime import datetime
 import hashlib
 from .config import DetectorConfig, get_default_config, RADIACODE_CONFIGS
 from .ground_truth import (
    ISOTOPE_DATABASE,
    Isotope,
    get_isotope,
    get_all_isotopes,
    DECAY_CHAINS,
    get_chain_daughters,
    infer_parent_from_daughters,
 )
 from .physics import (
    PeakParameters,
    generate_peak_spectrum,
    generate_environmental_background,
    apply_poisson_noise,
    apply_electronic_noise,
    normalize_spectrum,
 )
@dataclass
 class IsotopeSource:
    """Definition of an isotope source for spectrum generation."""
    isotope_name: str
    activity_bq: float
    # Optional: if part of a decay chain, include daughters
    include_daughters: bool = True
    # Activity can vary by this factor for augmentation
    activity_variation: float = 0.0
@dataclass
 class SpectrumConfig:
    """Configuration for a single spectrum generation."""
    # Time parameters
    duration_seconds: float = 60.0
    time_interval_seconds: float = 1.0  # Each row in the spectrogram
    # Sources to include
    sources: List[IsotopeSource] = field(default_factory=list)
    # Background options
    include_background: bool = True
    background_cps: float = 5.0
    include_k40: bool = True
    include_radon: bool = True
    include_thorium: bool = True
    # Detector configuration
    detector_name: str = "radiacode_103"
    # Noise options
    apply_poisson: bool = True
    apply_electronic: bool = False
    electronic_noise_sigma: float = 0.5
    # Normalization
    normalize: bool = True
    normalization_method: str = "max"  # max, sum, log, sqrt
@dataclass
 class GeneratedSpectrum:
    """Result of spectrum generation."""
    # The spectrum data (2D array: time x channels)
    data: np.ndarray
    # Metadata
    config: SpectrumConfig
    isotopes_present: List[str]
    background_isotopes: List[str]
    # For labels/annotations
    labels: Dict[str, Any] = field(default_factory=dict)
    # Unique identifier
    sample_id: str = ""
    # Generation timestamp
    timestamp: str = field(default_factory=lambda: datetime.now().isoformat())
 class SpectrumGenerator:
    """
    Main class for generating synthetic gamma spectra.
    Creates 2D spectrogram images where:
    - X-axis: Energy channels (1023 channels, 20-3000 keV)
    - Y-axis: Time intervals (variable duration)
    - Pixel intensity: Normalized count rate
    """
    def __init__(
        self,
        detector_config: Optional[DetectorConfig] = None,
        random_seed: Optional[int] = None
    ):
        """
        Initialize the spectrum generator.
        Args:
            detector_config: Detector configuration (default: Radiacode 103)
            random_seed: Random seed for reproducibility
        """
        if detector_config is None:
            detector_config = get_default_config()
        self.detector_config = detector_config
        self.energy_bins = detector_config.get_energy_bins()
        self.num_channels = len(self.energy_bins)
        if random_seed is not None:
            np.random.seed(random_seed)
    def generate_single_interval(
        self,
        sources: List[IsotopeSource],
        interval_duration: float,
        include_background: bool = True,
        background_config: Optional[Dict] = None
    ) -> Tuple[np.ndarray, List[str], List[str]]:
        """
        Generate a single time interval spectrum.
        Args:
            sources: List of isotope sources
            interval_duration: Duration in seconds
            include_background: Whether to include environmental background
            background_config: Background configuration options
        Returns:
            Tuple of (spectrum, source_isotopes, background_isotopes)
        """
        spectrum = np.zeros(self.num_channels)
        source_isotopes = []
        background_isotopes = []
        # Add background
        if include_background:
            if background_config is None:
                background_config = {}
            bg_spectrum, bg_isotopes = generate_environmental_background(
                self.energy_bins,
                interval_duration,
                background_cps=background_config.get('background_cps', 5.0),
                include_k40=background_config.get('include_k40', True),
                include_radon=background_config.get('include_radon', True),
                include_thorium=background_config.get('include_thorium', True),
                detector_config=self.detector_config
            )
            spectrum += bg_spectrum
            background_isotopes = bg_isotopes
        # Add source isotopes
        for source in sources:
            isotope = get_isotope(source.isotope_name)
            if isotope is None:
                print(f"Warning: Unknown isotope {source.isotope_name}")
                continue
            # Apply activity variation if specified
            activity = source.activity_bq
            if source.activity_variation > 0:
                variation = 1 + np.random.uniform(
                    -source.activity_variation,
                    source.activity_variation
                )
                activity *= variation
            # Add gamma lines from this isotope
            for gamma_line in isotope.gamma_lines:
                peak_params = PeakParameters(
                    energy_kev=gamma_line.energy_kev,
                    intensity=gamma_line.intensity,
                    activity_bq=activity,
                    live_time_s=interval_duration
                )
                peak = generate_peak_spectrum(
                    self.energy_bins,
                    peak_params,
                    self.detector_config
                )
                spectrum += peak
            source_isotopes.append(source.isotope_name)
            # Include daughters if requested
            if source.include_daughters and isotope.daughters:
                for daughter_name in isotope.daughters:
                    daughter = get_isotope(daughter_name)
                    if daughter:
                        for gamma_line in daughter.gamma_lines:
                            peak_params = PeakParameters(
                                energy_kev=gamma_line.energy_kev,
                                intensity=gamma_line.intensity,
                                activity_bq=activity,  # Secular equilibrium assumed
                                live_time_s=interval_duration
                            )
                            peak = generate_peak_spectrum(
                                self.energy_bins,
                                peak_params,
                                self.detector_config
                            )
                            spectrum += peak
                        source_isotopes.append(daughter_name)
        return spectrum, list(set(source_isotopes)), background_isotopes
    def generate_spectrum(
        self,
        config: SpectrumConfig
    ) -> GeneratedSpectrum:
        """
        Generate a cumulative 1D spectrum (sum over time).
        Instead of creating a 2D spectrogram (time x channels), this produces
        a 1D spectrum by generating the full duration at once — matching how
        a real detector accumulates counts. This avoids massive memory usage
        with long durations.
        Args:
            config: Spectrum configuration
        Returns:
            GeneratedSpectrum object with 1D data (num_channels,)
        """
        # Set detector config
        if config.detector_name in RADIACODE_CONFIGS:
            self.detector_config = RADIACODE_CONFIGS[config.detector_name]
            self.energy_bins = self.detector_config.get_energy_bins()
            self.num_channels = len(self.energy_bins)
        all_source_isotopes = []
        all_background_isotopes = []
        # Generate the full-duration spectrum at once (like a real detector)
        spectrum, src_iso, bg_iso = self.generate_single_interval(
            config.sources,
            config.duration_seconds,  # Full duration, not per-interval
            config.include_background,
            background_config={
                'background_cps': config.background_cps,
                'include_k40': config.include_k40,
                'include_radon': config.include_radon,
                'include_thorium': config.include_thorium,
            }
        )
        all_source_isotopes.extend(src_iso)
        all_background_isotopes.extend(bg_iso)
        # Apply noise
        if config.apply_poisson:
            spectrum = apply_poisson_noise(spectrum)
        if config.apply_electronic:
            spectrum = apply_electronic_noise(
                spectrum,
                config.electronic_noise_sigma
            )
        # Normalize if requested
        if config.normalize:
            spectrum = normalize_spectrum(spectrum, config.normalization_method)
        # Generate unique sample ID
        sample_id = self._generate_sample_id(config)
        # Determine isotopes present
        isotopes_present = list(set(all_source_isotopes))
        background_isotopes = list(set(all_background_isotopes))
        # Create labels
        labels = {
            'isotopes': isotopes_present,
            'background_isotopes': background_isotopes,
            'source_activities_bq': {
                s.isotope_name: s.activity_bq for s in config.sources
            },
            'duration_seconds': config.duration_seconds,
            'detector': config.detector_name,
            'normalized': config.normalize,
            'normalization_method': config.normalization_method if config.normalize else None,
        }
        return GeneratedSpectrum(
            data=spectrum,  # 1D array (num_channels,)
            config=config,
            isotopes_present=isotopes_present,
            background_isotopes=background_isotopes,
            labels=labels,
            sample_id=sample_id
        )
    def _generate_sample_id(self, config: SpectrumConfig) -> str:
        """Generate a unique sample ID from config."""
        # Create a hash from config parameters
        hash_input = f"{datetime.now().timestamp()}"
        hash_input += f"_{config.duration_seconds}"
        hash_input += f"_{','.join(s.isotope_name for s in config.sources)}"
        hash_input += f"_{np.random.randint(0, 1000000)}"
        return hashlib.md5(hash_input.encode()).hexdigest()[:12]
    def generate_random_spectrum(
        self,
        duration_range: Tuple[float, float] = (60, 300),
        num_isotopes_range: Tuple[int, int] = (1, 3),
        activity_range: Tuple[float, float] = (1.0, 100.0),
        isotope_pool: Optional[List[str]] = None,
        **kwargs
    ) -> GeneratedSpectrum:
        """
        Generate a spectrum with random parameters.
        Args:
            duration_range: (min, max) duration in seconds
            num_isotopes_range: (min, max) number of isotopes to include
            activity_range: (min, max) activity in Bq
            isotope_pool: List of isotope names to choose from (default: all with gammas)
            **kwargs: Additional arguments passed to SpectrumConfig
        Returns:
            GeneratedSpectrum with random configuration
        """
        # Choose duration
        duration = np.random.uniform(*duration_range)
        # Choose number of isotopes
        num_isotopes = np.random.randint(num_isotopes_range[0], num_isotopes_range[1] + 1)
        # Build isotope pool if not provided
        if isotope_pool is None:
            isotope_pool = [
                iso.name for iso in get_all_isotopes()
                if len(iso.gamma_lines) > 0 and
                any(line.intensity > 0.01 for line in iso.gamma_lines)
            ]
        # Select random isotopes
        selected = np.random.choice(isotope_pool, size=min(num_isotopes, len(isotope_pool)), replace=False)
        # Create sources with random activities
        sources = []
        for isotope_name in selected:
            activity = np.random.uniform(*activity_range)
            sources.append(IsotopeSource(
                isotope_name=isotope_name,
                activity_bq=activity,
                include_daughters=np.random.random() > 0.3
            ))
        # Create config
        config = SpectrumConfig(
            duration_seconds=duration,
            sources=sources,
            **kwargs
        )
        return self.generate_spectrum(config)
 def save_spectrum(
    spectrum: GeneratedSpectrum,
    output_dir: Path,
    save_image: bool = True,
    image_format: str = 'npy',
    save_individual_label: bool = True
 ) -> Dict[str, str]:
    """
    Save a generated spectrum to disk.
    Args:
        spectrum: GeneratedSpectrum to save
        output_dir: Output directory path
        save_image: Whether to save the spectrum data as an image/array
        image_format: Format for spectrum data ('npy', 'png', 'both')
        save_individual_label: Whether to save individual JSON label file per sample
    Returns:
        Dict of saved file paths
    """
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    saved_files = {}
    base_name = f"spectrum_{spectrum.sample_id}"
    # Save spectrum data
    if save_image:
        if image_format in ('npy', 'both'):
            npy_path = output_dir / f"{base_name}.npy"
            np.save(npy_path, spectrum.data)
            saved_files['npy'] = str(npy_path)
        if image_format in ('png', 'both'):
            try:
                from PIL import Image
                # Convert to 8-bit grayscale image
                data_normalized = spectrum.data
                if data_normalized.max() > 0:
                    data_normalized = data_normalized / data_normalized.max()
                img_data = (data_normalized * 255).astype(np.uint8)
                img = Image.fromarray(img_data, mode='L')
                png_path = output_dir / f"{base_name}.png"
                img.save(png_path)
                saved_files['png'] = str(png_path)
            except ImportError:
                print("Warning: PIL not installed, skipping PNG save")
    # Save individual label JSON file (for efficient loading)
    if save_individual_label:
        json_path = output_dir / f"{base_name}.json"
        with open(json_path, 'w') as f:
            json.dump(spectrum.labels, f, indent=2)
        saved_files['json'] = str(json_path)
    saved_files['sample_id'] = spectrum.sample_id
    return saved_files
 def generate_labels_json(
    spectra: List[GeneratedSpectrum],
    output_path: Path
 ) -> None:
    """
    Generate a combined JSON file with labels for all spectra.
    Note: This is for backward compatibility. For large datasets,
    individual JSON files per sample are more efficient.
    Args:
        spectra: List of generated spectra
        output_path: Path to save labels JSON
    """
    labels = {
        'metadata': {
            'generated_at': datetime.now().isoformat(),
            'num_samples': len(spectra),
            'channels': 1023,
            'energy_range_kev': [20, 3000],
        },
        'samples': {}
    }
    for spectrum in spectra:
        labels['samples'][spectrum.sample_id] = spectrum.labels
    with open(output_path, 'w') as f:
        json.dump(labels, f, indent=2)
--- a/train/vega_ml/synthetic_spectra/ground_truth/init.py
+++ b/train/vega_ml/synthetic_spectra/ground_truth/init.py
@ -0,0 +1,29 @@
 """
 Ground Truth Module
 Contains isotope data, decay chains, and chain signatures for
 synthetic spectra generation.
 """
 from .isotope_data import (
    ISOTOPE_DATABASE,
    Isotope,
    GammaLine,
    IsotopeCategory,
    get_isotope,
    get_all_isotopes,
    get_isotope_names,
    get_isotopes_by_category,
    get_isotopes_with_gamma_in_range,
    SECOND, MINUTE, HOUR, DAY, YEAR, STABLE
 )
 from .decay_chains import (
    DECAY_CHAINS,
    CHAIN_SIGNATURES,
    DecayChain,
    ChainSignature,
    get_decay_chain,
    get_chain_daughters,
    infer_parent_from_daughters,
 )
--- a/train/vega_ml/synthetic_spectra/ground_truth/decay_chains.py
+++ b/train/vega_ml/synthetic_spectra/ground_truth/decay_chains.py
@ -0,0 +1,320 @@
 """
 Decay Chain Definitions
 Defines radioactive decay chains and their relationships, including:
 - U-238 decay chain (Uranium series)
 - Th-232 decay chain (Thorium series)
 - U-235 decay chain (Actinium series)
 Also includes chain signatures - groups of isotopes that commonly
 appear together and indicate parent isotopes.
 """
 from dataclasses import dataclass, field
 from typing import List, Dict, Set, Optional, Tuple
 from .isotope_data import ISOTOPE_DATABASE, Isotope
@dataclass
 class DecayChainMember:
    """A member of a decay chain with branching ratio."""
    isotope_name: str
    branching_ratio: float = 1.0  # Fraction of decays following this path
    decay_mode: str = ""
@dataclass
 class DecayChain:
    """Complete decay chain definition."""
    name: str
    parent: str
    members: List[DecayChainMember]
    description: str = ""
    def get_member_names(self) -> List[str]:
        """Get list of all member isotope names."""
        return [m.isotope_name for m in self.members]
    def get_gamma_emitters(self) -> List[str]:
        """Get members that have significant gamma emissions."""
        emitters = []
        for member in self.members:
            iso = ISOTOPE_DATABASE.get(member.isotope_name)
            if iso and len(iso.gamma_lines) > 0:
                # Check if any line has significant intensity
                if any(line.intensity > 0.01 for line in iso.gamma_lines):
                    emitters.append(member.isotope_name)
        return emitters
@dataclass
 class ChainSignature:
    """
    Signature pattern of isotopes that indicate presence of a parent.
    When these daughter isotopes appear together in a spectrum,
    it strongly indicates the presence of the parent isotope
    (even if parent has weak/no gamma emissions).
    """
    name: str
    parent_chain: str  # Name of the decay chain
    inferred_parent: str  # Parent isotope that is indicated
    required_daughters: Set[str]  # Must see all of these
    optional_daughters: Set[str] = field(default_factory=set)  # May also see
    description: str = ""
 # =============================================================================
 # DECAY CHAINS
 # =============================================================================
 DECAY_CHAINS: Dict[str, DecayChain] = {}
 # U-238 DECAY CHAIN (Uranium Series)
 # U-238 -> Th-234 -> Pa-234m -> U-234 -> Th-230 -> Ra-226 -> Rn-222 ->
 # Po-218 -> Pb-214 -> Bi-214 -> Po-214 -> Pb-210 -> Bi-210 -> Po-210 -> Pb-206
 DECAY_CHAINS["U-238"] = DecayChain(
    name="U-238 Decay Chain (Uranium Series)",
    parent="U-238",
    description="14 step decay chain ending at stable Pb-206",
    members=[
        DecayChainMember("U-238", decay_mode="alpha"),
        DecayChainMember("Th-234", decay_mode="beta-"),
        DecayChainMember("Pa-234m", branching_ratio=0.998, decay_mode="beta-"),
        DecayChainMember("U-234", decay_mode="alpha"),
        DecayChainMember("Th-230", decay_mode="alpha"),
        DecayChainMember("Ra-226", decay_mode="alpha"),
        DecayChainMember("Rn-222", decay_mode="alpha"),
        DecayChainMember("Po-218", decay_mode="alpha"),
        DecayChainMember("Pb-214", decay_mode="beta-"),
        DecayChainMember("Bi-214", branching_ratio=0.9998, decay_mode="beta-"),
        DecayChainMember("Po-214", decay_mode="alpha"),
        DecayChainMember("Pb-210", decay_mode="beta-"),
        DecayChainMember("Bi-210", decay_mode="beta-"),
        DecayChainMember("Po-210", decay_mode="alpha"),
    ]
 )
 # TH-232 DECAY CHAIN (Thorium Series)
 # Th-232 -> Ra-228 -> Ac-228 -> Th-228 -> Ra-224 -> Rn-220 ->
 # Po-216 -> Pb-212 -> Bi-212 -> (Tl-208 or Po-212) -> Pb-208
 DECAY_CHAINS["Th-232"] = DecayChain(
    name="Th-232 Decay Chain (Thorium Series)",
    parent="Th-232",
    description="10+ step decay chain ending at stable Pb-208",
    members=[
        DecayChainMember("Th-232", decay_mode="alpha"),
        DecayChainMember("Ra-228", decay_mode="beta-"),
        DecayChainMember("Ac-228", decay_mode="beta-"),
        DecayChainMember("Th-228", decay_mode="alpha"),
        DecayChainMember("Ra-224", decay_mode="alpha"),
        DecayChainMember("Rn-220", decay_mode="alpha"),
        DecayChainMember("Po-216", decay_mode="alpha"),
        DecayChainMember("Pb-212", decay_mode="beta-"),
        DecayChainMember("Bi-212", decay_mode="beta-/alpha"),
        DecayChainMember("Tl-208", branching_ratio=0.3594, decay_mode="beta-"),
        DecayChainMember("Po-212", branching_ratio=0.6406, decay_mode="alpha"),
    ]
 )
 # U-235 DECAY CHAIN (Actinium Series)
 # U-235 -> Th-231 -> Pa-231 -> Ac-227 -> (complex branching) -> Pb-207
 DECAY_CHAINS["U-235"] = DecayChain(
    name="U-235 Decay Chain (Actinium Series)",
    parent="U-235",
    description="11+ step decay chain ending at stable Pb-207",
    members=[
        DecayChainMember("U-235", decay_mode="alpha"),
        DecayChainMember("Th-231", decay_mode="beta-"),
        DecayChainMember("Pa-231", decay_mode="alpha"),
        DecayChainMember("Ac-227", decay_mode="beta-/alpha"),
        DecayChainMember("Pb-211", decay_mode="beta-"),
        DecayChainMember("Bi-211", decay_mode="alpha"),
        DecayChainMember("Tl-207", decay_mode="beta-"),
    ]
 )
 # Cs-137 -> Ba-137m (simple 2-step)
 DECAY_CHAINS["Cs-137"] = DecayChain(
    name="Cs-137 Decay",
    parent="Cs-137",
    description="Cs-137 beta decay to Ba-137m metastable state",
    members=[
        DecayChainMember("Cs-137", decay_mode="beta-"),
        DecayChainMember("Ba-137m", decay_mode="IT"),
    ]
 )
 # =============================================================================
 # CHAIN SIGNATURES
 # =============================================================================
 CHAIN_SIGNATURES: Dict[str, ChainSignature] = {}
 # Radon-222 progeny (from U-238 chain via Ra-226)
 # Seeing Pb-214 + Bi-214 together indicates radon presence
 CHAIN_SIGNATURES["Rn-222_progeny"] = ChainSignature(
    name="Radon-222 Progeny",
    parent_chain="U-238",
    inferred_parent="Rn-222",
    required_daughters={"Pb-214", "Bi-214"},
    optional_daughters={"Po-214"},
    description="Pb-214 + Bi-214 indicates airborne Rn-222 (radon) daughters"
 )
 # Extended U-238 chain indicator
 CHAIN_SIGNATURES["Ra-226_equilibrium"] = ChainSignature(
    name="Ra-226 Secular Equilibrium",
    parent_chain="U-238",
    inferred_parent="Ra-226",
    required_daughters={"Pb-214", "Bi-214"},
    optional_daughters={"Rn-222", "Po-214", "Pb-210"},
    description="Indicates Ra-226 or U-238 in secular equilibrium"
 )
 # Thoron progeny (from Th-232 chain)
 # Seeing Pb-212 + Bi-212 + Tl-208 indicates thoron/thorium
 CHAIN_SIGNATURES["Rn-220_progeny"] = ChainSignature(
    name="Thoron (Rn-220) Progeny",
    parent_chain="Th-232",
    inferred_parent="Rn-220",
    required_daughters={"Pb-212", "Bi-212"},
    optional_daughters={"Tl-208", "Po-212"},
    description="Pb-212 + Bi-212 indicates Rn-220 (thoron) daughters"
 )
 # Th-232 chain indicator (Ac-228 is key)
 CHAIN_SIGNATURES["Th-232_equilibrium"] = ChainSignature(
    name="Th-232 Secular Equilibrium",
    parent_chain="Th-232",
    inferred_parent="Th-232",
    required_daughters={"Ac-228", "Pb-212", "Tl-208"},
    optional_daughters={"Bi-212", "Ra-224"},
    description="Ac-228 + Pb-212 + Tl-208 indicates Th-232 chain in equilibrium"
 )
 # U-235 presence (direct gamma)
 CHAIN_SIGNATURES["U-235_direct"] = ChainSignature(
    name="U-235 Direct",
    parent_chain="U-235",
    inferred_parent="U-235",
    required_daughters={"U-235"},  # U-235 has direct 185.7 keV line
    optional_daughters={"Th-231", "Pa-231"},
    description="U-235 directly visible via 185.7 keV line"
 )
 # =============================================================================
 # HELPER FUNCTIONS
 # =============================================================================
 def get_decay_chain(name: str) -> Optional[DecayChain]:
    """Get a decay chain by parent isotope name."""
    return DECAY_CHAINS.get(name)
 def get_chain_daughters(parent: str, include_parent: bool = True) -> List[str]:
    """
    Get all daughter isotopes in a decay chain.
    Args:
        parent: Parent isotope name (e.g., "U-238")
        include_parent: Whether to include the parent in the list
    Returns:
        List of isotope names in the chain
    """
    chain = DECAY_CHAINS.get(parent)
    if chain is None:
        return [parent] if include_parent else []
    daughters = chain.get_member_names()
    if not include_parent and daughters and daughters[0] == parent:
        daughters = daughters[1:]
    return daughters
 def infer_parent_from_daughters(
    detected_isotopes: Set[str]
 ) -> List[Tuple[str, ChainSignature, float]]:
    """
    Given a set of detected isotopes, infer possible parent isotopes.
    Args:
        detected_isotopes: Set of isotope names detected in spectrum
    Returns:
        List of (parent_name, signature, confidence) tuples
        Confidence is fraction of required daughters detected (1.0 = all)
    """
    results = []
    for sig_name, signature in CHAIN_SIGNATURES.items():
        required_found = detected_isotopes & signature.required_daughters
        if len(required_found) > 0:
            confidence = len(required_found) / len(signature.required_daughters)
            optional_found = detected_isotopes & signature.optional_daughters
            # Boost confidence slightly if optional daughters also found
            if len(signature.optional_daughters) > 0:
                bonus = 0.1 * len(optional_found) / len(signature.optional_daughters)
                confidence = min(1.0, confidence + bonus)
            results.append((signature.inferred_parent, signature, confidence))
    # Sort by confidence (highest first)
    results.sort(key=lambda x: x[2], reverse=True)
    return results
 def get_equilibrium_ratios(chain_name: str) -> Dict[str, float]:
    """
    Get secular equilibrium activity ratios for a decay chain.
    In secular equilibrium, all daughter activities equal the parent activity.
    This returns relative activity fractions (all 1.0 for secular equilibrium).
    For non-equilibrium, this can be modified to return time-dependent ratios.
    """
    chain = DECAY_CHAINS.get(chain_name)
    if chain is None:
        return {}
    # In secular equilibrium, all activities are equal
    return {m.isotope_name: 1.0 for m in chain.members}
 def get_visible_chain_gammas(
    chain_name: str,
    min_intensity: float = 0.01
 ) -> Dict[str, List[Tuple[float, float]]]:
    """
    Get all visible gamma lines from a decay chain.
    Args:
        chain_name: Name of the decay chain parent
        min_intensity: Minimum emission intensity to include
    Returns:
        Dict mapping isotope name to list of (energy_keV, intensity) tuples
    """
    chain = DECAY_CHAINS.get(chain_name)
    if chain is None:
        return {}
    result = {}
    for member in chain.members:
        iso = ISOTOPE_DATABASE.get(member.isotope_name)
        if iso:
            lines = [
                (line.energy_kev, line.intensity * member.branching_ratio)
                for line in iso.gamma_lines
                if line.intensity >= min_intensity
            ]
            if lines:
                result[member.isotope_name] = lines
    return result
--- a/train/vega_ml/synthetic_spectra/ground_truth/isotope_data.py
+++ b/train/vega_ml/synthetic_spectra/ground_truth/isotope_data.py
--- a/train/vega_ml/synthetic_spectra/physics/init.py
+++ b/train/vega_ml/synthetic_spectra/physics/init.py
@ -0,0 +1,26 @@
 """
 Physics Module
 Contains spectrum generation physics including:
 - Peak shape modeling
 - Background generation
 - Detector response
 - Counting statistics
 """
 from .spectrum_physics import (
    PeakParameters,
    gaussian_peak,
    calculate_fwhm,
    fwhm_to_sigma,
    detector_efficiency,
    calculate_expected_counts,
    generate_peak_spectrum,
    generate_compton_continuum,
    generate_exponential_background,
    generate_polynomial_background,
    generate_environmental_background,
    apply_poisson_noise,
    apply_electronic_noise,
    normalize_spectrum,
 )
--- a/train/vega_ml/synthetic_spectra/physics/spectrum_physics.py
+++ b/train/vega_ml/synthetic_spectra/physics/spectrum_physics.py
@ -0,0 +1,553 @@
 """
 Spectrum Physics Module
 Implements the physics of gamma spectrum generation including:
 - Peak shape modeling (Gaussian with detector response)
 - Background continuum generation
 - Counting statistics (Poisson sampling)
 - Detector efficiency modeling
 """
 import numpy as np
 from scipy import special
 from typing import Optional, Tuple, List
 from dataclasses import dataclass
 from ..config import DetectorConfig, get_default_config
@dataclass
 class PeakParameters:
    """Parameters for a single gamma peak."""
    energy_kev: float
    intensity: float  # Emission probability (photons/decay)
    activity_bq: float  # Source activity in Becquerels
    live_time_s: float  # Acquisition time in seconds
 def gaussian_peak(
    energy_bins: np.ndarray,
    peak_energy: float,
    sigma: float,
    amplitude: float
 ) -> np.ndarray:
    """
    Generate a Gaussian peak.
    Args:
        energy_bins: Array of energy bin centers (keV)
        peak_energy: Center energy of peak (keV)
        sigma: Standard deviation (keV)
        amplitude: Peak area (total counts)
    Returns:
        Array of counts in each bin
    """
    # Gaussian probability density
    prob = np.exp(-0.5 * ((energy_bins - peak_energy) / sigma) ** 2)
    prob /= (sigma * np.sqrt(2 * np.pi))
    # Scale by amplitude and bin width
    bin_width = energy_bins[1] - energy_bins[0] if len(energy_bins) > 1 else 1.0
    return amplitude * prob * bin_width
 def calculate_fwhm(energy_kev: float, fwhm_at_662: float = 0.084) -> float:
    """
    Calculate FWHM at a given energy for scintillator detectors.
    FWHM scales as sqrt(E) for scintillators due to statistical fluctuations
    in light collection.
    FWHM(E) = FWHM_662 * sqrt(E/662) * 662 / E * E = FWHM_662 * sqrt(662/E) * E
    Actually: FWHM(E) / E = FWHM_662 / 662 * sqrt(662/E)
    So: FWHM(E) = E * FWHM_662 / 662 * sqrt(662/E) = FWHM_662 * sqrt(662 * E) / 662
                = FWHM_662 * sqrt(E / 662)
    Wait, let me recalculate:
    For scintillators, the relative resolution (FWHM/E) scales as 1/sqrt(E)
    FWHM(E)/E = (FWHM_662/662) * sqrt(662/E)
    FWHM(E) = FWHM_662 * sqrt(662 * E) / 662 = FWHM_662 * sqrt(E/662)
    At 662 keV: FWHM = FWHM_662 * sqrt(1) = FWHM_662 ✓
    At lower E: larger relative FWHM (worse resolution)
    At higher E: smaller relative FWHM (better resolution)
    Args:
        energy_kev: Energy in keV
        fwhm_at_662: FWHM at 662 keV as fraction (e.g., 0.084 for 8.4%)
    Returns:
        FWHM in keV at the given energy
    """
    # FWHM_662 is given as fraction, so at 662 keV, FWHM = 0.084 * 662 = ~55.6 keV
    fwhm_662_kev = fwhm_at_662 * 662.0
    # Scale by sqrt(E/662)
    fwhm_kev = fwhm_662_kev * np.sqrt(energy_kev / 662.0)
    return fwhm_kev
 def fwhm_to_sigma(fwhm: float) -> float:
    """Convert FWHM to Gaussian sigma."""
    return fwhm / (2.0 * np.sqrt(2.0 * np.log(2.0)))  # ≈ FWHM / 2.355
 def detector_efficiency(
    energy_kev: float,
    detector_config: Optional[DetectorConfig] = None
 ) -> float:
    """
    Calculate detector full-energy peak efficiency.
    For CsI and GAGG scintillators, efficiency varies with energy.
    This is a simplified model - real efficiency curves should be
    measured for each detector.
    Args:
        energy_kev: Gamma energy in keV
        detector_config: Detector configuration
    Returns:
        Efficiency as fraction (0-1)
    """
    if detector_config is None:
        detector_config = get_default_config()
    # Simplified efficiency model for ~1 cm³ scintillator
    # Low energy: efficiency increases (more stopping power)
    # High energy: efficiency decreases (photons pass through)
    # Peak around 100-300 keV for small scintillators
    # This is a phenomenological model
    # Real efficiency should be calibrated
    if energy_kev < 20:
        return 0.0
    # Simple model: efficiency peaks around 100-200 keV
    # Falls off at low energy (absorption in housing)
    # Falls off at high energy (less stopping power)
    # Low energy cutoff (absorption)
    low_eff = 1.0 - np.exp(-energy_kev / 50.0)
    # High energy falloff (escape)
    # For 1 cm³ CsI, efficiency drops significantly above ~500 keV
    high_eff = np.exp(-energy_kev / 2000.0)
    # Combine effects
    eff = 0.8 * low_eff * high_eff
    # Scale by detector volume
    volume_factor = (detector_config.detector_volume_cm3 / 1.0) ** (1/3)
    eff *= min(1.0, volume_factor)
    return max(0.0, min(1.0, eff))
 def calculate_expected_counts(
    peak_params: PeakParameters,
    detector_config: Optional[DetectorConfig] = None
 ) -> float:
    """
    Calculate expected counts in a photopeak.
    λ = A * t * I * ε * T
    Where:
        A = activity (decays/s)
        t = live time (s)
        I = emission probability (photons/decay)
        ε = detector efficiency
        T = transmission factor (assumed 1 for now)
    Args:
        peak_params: Peak parameters
        detector_config: Detector configuration
    Returns:
        Expected number of counts in the photopeak
    """
    if detector_config is None:
        detector_config = get_default_config()
    efficiency = detector_efficiency(peak_params.energy_kev, detector_config)
    expected = (
        peak_params.activity_bq *
        peak_params.live_time_s *
        peak_params.intensity *
        efficiency
    )
    return expected
 def generate_peak_spectrum(
    energy_bins: np.ndarray,
    peak_params: PeakParameters,
    detector_config: Optional[DetectorConfig] = None
 ) -> np.ndarray:
    """
    Generate a single gamma peak with detector response.
    Args:
        energy_bins: Array of energy bin centers (keV)
        peak_params: Peak parameters
        detector_config: Detector configuration
    Returns:
        Array of expected counts in each bin (not yet Poisson sampled)
    """
    if detector_config is None:
        detector_config = get_default_config()
    # Calculate expected counts
    amplitude = calculate_expected_counts(peak_params, detector_config)
    if amplitude <= 0:
        return np.zeros_like(energy_bins)
    # Calculate peak width
    fwhm_kev = calculate_fwhm(peak_params.energy_kev, detector_config.fwhm_at_662)
    sigma = fwhm_to_sigma(fwhm_kev)
    # Generate Gaussian peak
    peak = gaussian_peak(energy_bins, peak_params.energy_kev, sigma, amplitude)
    return peak
 def generate_compton_continuum(
    energy_bins: np.ndarray,
    peak_energy: float,
    peak_counts: float,
    compton_to_peak_ratio: float = 0.5
 ) -> np.ndarray:
    """
    Generate simplified Compton continuum for a gamma line.
    The Compton continuum extends from 0 to the Compton edge.
    Compton edge energy = E * (1 - 1/(1 + 2*E/(511)))
    Args:
        energy_bins: Array of energy bin centers (keV)
        peak_energy: Energy of the gamma line (keV)
        peak_counts: Total counts in the photopeak
        compton_to_peak_ratio: Ratio of Compton counts to peak counts
    Returns:
        Array of Compton continuum counts
    """
    # Compton edge energy
    alpha = peak_energy / 511.0  # E / m_e c²
    compton_edge = peak_energy * (2 * alpha) / (1 + 2 * alpha)
    # Create continuum (simplified flat + edge shape)
    continuum = np.zeros_like(energy_bins)
    # Mask for energies below Compton edge
    mask = energy_bins < compton_edge
    if np.any(mask):
        # Simple model: roughly flat with enhancement near edge
        base_level = peak_counts * compton_to_peak_ratio / np.sum(mask)
        continuum[mask] = base_level
        # Add edge enhancement (Klein-Nishina-like shape)
        edge_region = (energy_bins > 0.8 * compton_edge) & (energy_bins < compton_edge)
        if np.any(edge_region):
            enhancement = 1.5 * np.exp(-((energy_bins[edge_region] - compton_edge) / (0.05 * compton_edge)) ** 2)
            continuum[edge_region] *= (1 + enhancement)
    return continuum
 # =============================================================================
 # BACKGROUND GENERATION
 # =============================================================================
 def generate_exponential_background(
    energy_bins: np.ndarray,
    amplitude: float = 100.0,
    decay_constant: float = 0.003
 ) -> np.ndarray:
    """
    Generate exponential background continuum.
    B(E) = A * exp(-b * E)
    Args:
        energy_bins: Array of energy bin centers (keV)
        amplitude: Background amplitude at E=0
        decay_constant: Exponential decay constant (1/keV)
    Returns:
        Array of background counts
    """
    return amplitude * np.exp(-decay_constant * energy_bins)
 def generate_polynomial_background(
    energy_bins: np.ndarray,
    coefficients: List[float] = None
 ) -> np.ndarray:
    """
    Generate polynomial background.
    B(E) = Σ c_m * E^m
    Args:
        energy_bins: Array of energy bin centers (keV)
        coefficients: Polynomial coefficients [c0, c1, c2, ...]
    Returns:
        Array of background counts
    """
    if coefficients is None:
        coefficients = [10.0, -0.005, 1e-6]  # Default quadratic
    background = np.zeros_like(energy_bins)
    for m, c in enumerate(coefficients):
        background += c * (energy_bins ** m)
    return np.maximum(0, background)
 def generate_environmental_background(
    energy_bins: np.ndarray,
    duration_seconds: float,
    background_cps: float = 5.0,
    include_k40: bool = True,
    include_radon: bool = True,
    include_thorium: bool = True,
    detector_config: Optional[DetectorConfig] = None
 ) -> Tuple[np.ndarray, List[str]]:
    """
    Generate realistic environmental background spectrum.
    Includes:
    - Exponential continuum (cosmic rays, scattered gammas)
    - K-40 peak (1460 keV) - ubiquitous in environment
    - Radon daughters (Pb-214, Bi-214) - indoor air
    - Thorium daughters (Pb-212, Tl-208) - building materials
    Args:
        energy_bins: Array of energy bin centers (keV)
        duration_seconds: Acquisition time
        background_cps: Average background count rate (cps)
        include_k40: Include potassium-40 peak
        include_radon: Include radon daughter peaks
        include_thorium: Include thorium daughter peaks
        detector_config: Detector configuration
    Returns:
        Tuple of (background_spectrum, list_of_background_isotopes)
    """
    if detector_config is None:
        detector_config = get_default_config()
    background_isotopes = []
    # Start with exponential continuum
    total_continuum_counts = background_cps * duration_seconds * 0.7
    background = generate_exponential_background(
        energy_bins,
        amplitude=total_continuum_counts / 500,
        decay_constant=0.002
    )
    # Normalize continuum to target count rate
    if background.sum() > 0:
        background *= (total_continuum_counts / background.sum())
    # Add K-40 peak (very common)
    if include_k40:
        k40_activity = np.random.uniform(0.5, 5.0)  # Bq
        peak = generate_peak_spectrum(
            energy_bins,
            PeakParameters(
                energy_kev=1460.83,
                intensity=0.1066,
                activity_bq=k40_activity,
                live_time_s=duration_seconds
            ),
            detector_config
        )
        background += peak
        background_isotopes.append("K-40")
    # Add radon daughters
    if include_radon:
        radon_activity = np.random.uniform(0.1, 2.0)  # Bq
        # Pb-214 lines
        for energy, intensity in [(295.22, 0.1842), (351.93, 0.356)]:
            peak = generate_peak_spectrum(
                energy_bins,
                PeakParameters(
                    energy_kev=energy,
                    intensity=intensity,
                    activity_bq=radon_activity,
                    live_time_s=duration_seconds
                ),
                detector_config
            )
            background += peak
        # Bi-214 lines
        for energy, intensity in [(609.31, 0.4549), (1120.29, 0.1492), (1764.49, 0.1531)]:
            peak = generate_peak_spectrum(
                energy_bins,
                PeakParameters(
                    energy_kev=energy,
                    intensity=intensity,
                    activity_bq=radon_activity,
                    live_time_s=duration_seconds
                ),
                detector_config
            )
            background += peak
        background_isotopes.extend(["Pb-214", "Bi-214"])
    # Add thorium daughters
    if include_thorium:
        thorium_activity = np.random.uniform(0.05, 1.0)  # Bq
        # Ac-228 line
        peak = generate_peak_spectrum(
            energy_bins,
            PeakParameters(
                energy_kev=911.20,
                intensity=0.258,
                activity_bq=thorium_activity,
                live_time_s=duration_seconds
            ),
            detector_config
        )
        background += peak
        # Pb-212 line
        peak = generate_peak_spectrum(
            energy_bins,
            PeakParameters(
                energy_kev=238.63,
                intensity=0.436,
                activity_bq=thorium_activity,
                live_time_s=duration_seconds
            ),
            detector_config
        )
        background += peak
        # Tl-208 lines
        for energy, intensity in [(583.19, 0.845 * 0.36), (2614.51, 0.998 * 0.36)]:
            # Branching ratio of 36% for Tl-208 path
            peak = generate_peak_spectrum(
                energy_bins,
                PeakParameters(
                    energy_kev=energy,
                    intensity=intensity,
                    activity_bq=thorium_activity,
                    live_time_s=duration_seconds
                ),
                detector_config
            )
            background += peak
        background_isotopes.extend(["Ac-228", "Pb-212", "Tl-208"])
    return background, background_isotopes
 def apply_poisson_noise(spectrum: np.ndarray) -> np.ndarray:
    """
    Apply Poisson counting statistics to a spectrum.
    Each bin is sampled from a Poisson distribution with
    lambda = expected counts in that bin.
    Args:
        spectrum: Array of expected counts (can be float)
    Returns:
        Array of actual counts (integers)
    """
    # Handle negative values (shouldn't happen but be safe)
    spectrum = np.maximum(0, spectrum)
    # Sample from Poisson distribution
    return np.random.poisson(spectrum).astype(np.float64)
 def apply_electronic_noise(
    spectrum: np.ndarray,
    sigma: float = 0.5
 ) -> np.ndarray:
    """
    Apply small Gaussian electronic noise.
    Args:
        spectrum: Count spectrum
        sigma: Standard deviation of electronic noise (counts)
    Returns:
        Spectrum with added electronic noise
    """
    noise = np.random.normal(0, sigma, spectrum.shape)
    result = spectrum + noise
    return np.maximum(0, result)
 # =============================================================================
 # NORMALIZATION
 # =============================================================================
 def normalize_spectrum(
    spectrum: np.ndarray,
    method: str = "max"
 ) -> np.ndarray:
    """
    Normalize a spectrum for ML training.
    Args:
        spectrum: Raw count spectrum
        method: Normalization method
            - "max": Divide by maximum value (range 0-1)
            - "sum": Divide by total counts (probability distribution)
            - "log": Log transform then max normalize
            - "sqrt": Square root transform then max normalize
    Returns:
        Normalized spectrum
    """
    if method == "max":
        max_val = spectrum.max()
        if max_val > 0:
            return spectrum / max_val
        return spectrum
    elif method == "sum":
        total = spectrum.sum()
        if total > 0:
            return spectrum / total
        return spectrum
    elif method == "log":
        # Log transform (add 1 to handle zeros)
        log_spec = np.log1p(spectrum)
        max_val = log_spec.max()
        if max_val > 0:
            return log_spec / max_val
        return log_spec
    elif method == "sqrt":
        sqrt_spec = np.sqrt(spectrum)
        max_val = sqrt_spec.max()
        if max_val > 0:
            return sqrt_spec / max_val
        return sqrt_spec
    else:
        raise ValueError(f"Unknown normalization method: {method}")
--- a/train/vega_ml/synthetic_spectra/spectrum_viewer.py
+++ b/train/vega_ml/synthetic_spectra/spectrum_viewer.py
@ -0,0 +1,477 @@
 """
 Spectrum Viewer Application
 A simple GUI application to browse and visualize generated synthetic spectra.
 Randomly samples from the available spectra to avoid loading all files at once.
 Usage:
    python -m synthetic_spectra.spectrum_viewer
    Or with options:
    python -m synthetic_spectra.spectrum_viewer --num_samples 200 --data_dir ./data/synthetic/spectra
 """
 import tkinter as tk
 from tkinter import ttk
 import numpy as np
 import json
 from pathlib import Path
 import random
 from typing import Optional, List, Dict, Any
 from .config import RADIACODE_CONFIGS, get_default_config
 # Try to import matplotlib for plotting
 try:
    import matplotlib
    matplotlib.use('TkAgg')
    import matplotlib.pyplot as plt
    from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg, NavigationToolbar2Tk
    from matplotlib.figure import Figure
    HAS_MATPLOTLIB = True
 except ImportError:
    HAS_MATPLOTLIB = False
    print("Warning: matplotlib not found. Install with: pip install matplotlib")
 class SpectrumViewer:
    """
    GUI application for viewing synthetic gamma spectra.
    """
    def __init__(
        self,
        data_dir: str = "./data/synthetic/spectra",
        num_samples: int = 100,
        random_seed: Optional[int] = None
    ):
        """
        Initialize the spectrum viewer.
        Args:
            data_dir: Directory containing spectrum .npy and .json files
            num_samples: Number of random samples to load (for performance)
            random_seed: Random seed for reproducible sample selection
        """
        self.data_dir = Path(data_dir)
        self.num_samples = num_samples
        if random_seed is not None:
            random.seed(random_seed)
        # Find and sample spectrum files
        self.spectrum_files = self._discover_and_sample_files()
        if not self.spectrum_files:
            raise ValueError(f"No spectrum files found in {self.data_dir}")
        print(f"Loaded {len(self.spectrum_files)} spectrum samples")
        # Current state
        self.current_index = 0
        self.current_spectrum: Optional[np.ndarray] = None
        self.current_metadata: Optional[Dict[str, Any]] = None
        # Setup GUI
        self._setup_gui()
        # Load first spectrum
        self._load_current_spectrum()
    def _discover_and_sample_files(self) -> List[Path]:
        """Find all spectrum files and randomly sample them."""
        # Find all .npy files
        all_npy_files = list(self.data_dir.glob("spectrum_*.npy"))
        if not all_npy_files:
            # Try without prefix
            all_npy_files = list(self.data_dir.glob("*.npy"))
        print(f"Found {len(all_npy_files)} total spectrum files")
        # Randomly sample if we have more than requested
        if len(all_npy_files) > self.num_samples:
            sampled = random.sample(all_npy_files, self.num_samples)
        else:
            sampled = all_npy_files
        # Sort by name for consistent ordering in dropdown
        return sorted(sampled, key=lambda p: p.stem)
    def _setup_gui(self):
        """Setup the tkinter GUI."""
        self.root = tk.Tk()
        self.root.title("Spectrum Viewer - Synthetic Gamma Spectra")
        self.root.geometry("1200x800")
        # Main container
        main_frame = ttk.Frame(self.root, padding="10")
        main_frame.grid(row=0, column=0, sticky="nsew")
        # Configure grid weights for resizing
        self.root.columnconfigure(0, weight=1)
        self.root.rowconfigure(0, weight=1)
        main_frame.columnconfigure(0, weight=1)
        main_frame.rowconfigure(1, weight=1)
        # === Top controls ===
        controls_frame = ttk.Frame(main_frame)
        controls_frame.grid(row=0, column=0, sticky="ew", pady=(0, 10))
        controls_frame.columnconfigure(1, weight=1)
        # Dropdown for spectrum selection
        ttk.Label(controls_frame, text="Select Spectrum:").grid(row=0, column=0, padx=(0, 10))
        self.spectrum_var = tk.StringVar()
        self.spectrum_dropdown = ttk.Combobox(
            controls_frame,
            textvariable=self.spectrum_var,
            values=[f.stem for f in self.spectrum_files],
            state="readonly",
            width=50
        )
        self.spectrum_dropdown.grid(row=0, column=1, sticky="ew", padx=(0, 10))
        self.spectrum_dropdown.bind("<<ComboboxSelected>>", self._on_spectrum_selected)
        self.spectrum_dropdown.current(0)
        # Navigation buttons
        nav_frame = ttk.Frame(controls_frame)
        nav_frame.grid(row=0, column=2)
        ttk.Button(nav_frame, text="◀ Prev", command=self._prev_spectrum).pack(side="left", padx=2)
        ttk.Button(nav_frame, text="Next ▶", command=self._next_spectrum).pack(side="left", padx=2)
        ttk.Button(nav_frame, text="🎲 Random", command=self._random_spectrum).pack(side="left", padx=2)
        # Sample count label
        self.count_label = ttk.Label(
            controls_frame,
            text=f"Showing {len(self.spectrum_files)} of available spectra"
        )
        self.count_label.grid(row=0, column=3, padx=(10, 0))
        # === Plotting area ===
        plot_frame = ttk.Frame(main_frame)
        plot_frame.grid(row=1, column=0, sticky="nsew")
        plot_frame.columnconfigure(0, weight=1)
        plot_frame.rowconfigure(0, weight=1)
        if HAS_MATPLOTLIB:
            # Create matplotlib figure with 2 subplots
            self.fig = Figure(figsize=(12, 6), dpi=100)
            # 2D spectrogram (heatmap)
            self.ax_2d = self.fig.add_subplot(121)
            self.ax_2d.set_title("2D Spectrogram (Time vs Energy)")
            self.ax_2d.set_xlabel("Energy Channel")
            self.ax_2d.set_ylabel("Time Interval (s)")
            # 1D summed spectrum
            self.ax_1d = self.fig.add_subplot(122)
            self.ax_1d.set_title("Summed Spectrum")
            self.ax_1d.set_xlabel("Energy (keV)")
            self.ax_1d.set_ylabel("Counts (normalized)")
            self.fig.tight_layout()
            # Embed in tkinter
            self.canvas = FigureCanvasTkAgg(self.fig, master=plot_frame)
            self.canvas.draw()
            self.canvas.get_tk_widget().grid(row=0, column=0, sticky="nsew")
            # Toolbar
            toolbar_frame = ttk.Frame(plot_frame)
            toolbar_frame.grid(row=1, column=0, sticky="ew")
            self.toolbar = NavigationToolbar2Tk(self.canvas, toolbar_frame)
            self.toolbar.update()
        else:
            ttk.Label(
                plot_frame,
                text="matplotlib not installed. Install with: pip install matplotlib",
                font=("Arial", 14)
            ).grid(row=0, column=0, pady=50)
        # === Metadata panel ===
        metadata_frame = ttk.LabelFrame(main_frame, text="Spectrum Metadata", padding="10")
        metadata_frame.grid(row=2, column=0, sticky="ew", pady=(10, 0))
        self.metadata_text = tk.Text(
            metadata_frame,
            height=10,
            wrap="word",
            font=("Consolas", 10)
        )
        self.metadata_text.pack(fill="both", expand=True)
        # Scrollbar for metadata
        scrollbar = ttk.Scrollbar(metadata_frame, orient="vertical", command=self.metadata_text.yview)
        scrollbar.pack(side="right", fill="y")
        self.metadata_text.configure(yscrollcommand=scrollbar.set)
    def _load_current_spectrum(self):
        """Load the currently selected spectrum and its metadata."""
        if not self.spectrum_files:
            return
        spectrum_path = self.spectrum_files[self.current_index]
        json_path = spectrum_path.with_suffix(".json")
        # Load numpy array
        try:
            self.current_spectrum = np.load(spectrum_path)
            print(f"Loaded spectrum: {spectrum_path.name}, shape: {self.current_spectrum.shape}")
        except Exception as e:
            print(f"Error loading spectrum: {e}")
            self.current_spectrum = None
        # Load metadata JSON
        if json_path.exists():
            try:
                with open(json_path, 'r') as f:
                    self.current_metadata = json.load(f)
            except Exception as e:
                print(f"Error loading metadata: {e}")
                self.current_metadata = None
        else:
            self.current_metadata = None
        # Update display
        self._update_plot()
        self._update_metadata()
    def _update_plot(self):
        """Update the matplotlib plots."""
        if not HAS_MATPLOTLIB or self.current_spectrum is None:
            return
        # Clear previous plots
        self.ax_2d.clear()
        self.ax_1d.clear()
        spectrum = self.current_spectrum
        num_channels = spectrum.shape[1] if len(spectrum.shape) > 1 else len(spectrum)
        # Energy axis: use the same mapping as generation whenever possible.
        detector_name = None
        if isinstance(self.current_metadata, dict):
            detector_name = (
                self.current_metadata.get('detector')
                or self.current_metadata.get('detector_name')
                or (self.current_metadata.get('config') or {}).get('detector_name')
            )
        detector_config = RADIACODE_CONFIGS.get(detector_name, get_default_config())
        energy_bins = detector_config.get_energy_bins()
        if len(energy_bins) != num_channels:
            # Fallback: linear mapping for the available channel count.
            energy_bins = np.linspace(
                detector_config.energy_min_kev,
                detector_config.energy_max_kev,
                num_channels,
                dtype=np.float64
            )
        energy_min = float(energy_bins[0])
        energy_max = float(energy_bins[-1])
        if len(spectrum.shape) == 2:
            # 2D spectrogram
            num_intervals = spectrum.shape[0]
            # Plot 2D heatmap
            im = self.ax_2d.imshow(
                spectrum,
                aspect='auto',
                origin='lower',
                extent=[energy_min, energy_max, 0, num_intervals],
                cmap='viridis'
            )
            self.ax_2d.set_title(f"2D Spectrogram ({num_intervals} time intervals)")
            self.ax_2d.set_xlabel("Energy (keV)")
            self.ax_2d.set_ylabel("Time Interval (s)")
            # Add colorbar - use a dedicated axes to avoid removal issues
            if not hasattr(self, '_cbar_ax') or self._cbar_ax is None:
                # Create a dedicated colorbar axes on first use
                self._cbar_ax = self.fig.add_axes([0.46, 0.55, 0.01, 0.35])
            else:
                self._cbar_ax.clear()
            self._colorbar = self.fig.colorbar(im, cax=self._cbar_ax, label='Counts')
            # Sum across time for 1D spectrum
            summed_spectrum = spectrum.sum(axis=0)
        else:
            # 1D spectrum
            self.ax_2d.text(
                0.5, 0.5, "1D Spectrum\n(No time dimension)",
                ha='center', va='center', transform=self.ax_2d.transAxes
            )
            summed_spectrum = spectrum
        # Plot 1D summed spectrum
        self.ax_1d.plot(energy_bins, summed_spectrum, 'b-', linewidth=0.8)
        self.ax_1d.fill_between(energy_bins, 0, summed_spectrum, alpha=0.3)
        self.ax_1d.set_title("Summed Spectrum")
        self.ax_1d.set_xlabel("Energy (keV)")
        self.ax_1d.set_ylabel("Counts (normalized)")
        self.ax_1d.set_xlim(energy_min, energy_max)
        self.ax_1d.set_ylim(0, None)
        self.ax_1d.grid(True, alpha=0.3)
        # Add vertical lines for common peaks if metadata available
        if self.current_metadata:
            isotopes = self.current_metadata.get('isotopes', [])
            if isotopes:
                # Add some common reference lines
                peak_energies = self._get_peak_energies_from_metadata()
                for energy, label in peak_energies[:5]:  # Show top 5 peaks
                    if energy_min < energy < energy_max:
                        self.ax_1d.axvline(x=energy, color='red', linestyle='--', alpha=0.5, linewidth=0.8)
                        self.ax_1d.annotate(
                            label,
                            xy=(energy, self.ax_1d.get_ylim()[1] * 0.95),
                            fontsize=8,
                            rotation=90,
                            ha='right',
                            va='top'
                        )
        # Use subplots_adjust instead of tight_layout to avoid colorbar axes conflict
        self.fig.subplots_adjust(left=0.08, right=0.95, top=0.92, bottom=0.12, wspace=0.3)
        self.canvas.draw()
    def _get_peak_energies_from_metadata(self) -> List[tuple]:
        """Extract key peak energies from metadata for annotation."""
        peaks = []
        if not self.current_metadata:
            return peaks
        isotopes = self.current_metadata.get('isotopes', [])
        # Common isotope peak energies
        isotope_peaks = {
            'Cs-137': [(661.66, 'Cs-137')],
            'Co-60': [(1173.23, 'Co-60'), (1332.49, 'Co-60')],
            'Am-241': [(59.54, 'Am-241')],
            'Ba-133': [(356.0, 'Ba-133'), (81.0, 'Ba-133')],
            'Na-22': [(511.0, 'Na-22'), (1274.54, 'Na-22')],
            'K-40': [(1460.83, 'K-40')],
            'Eu-152': [(344.28, 'Eu-152'), (1408.0, 'Eu-152')],
            'I-131': [(364.49, 'I-131')],
            'Tc-99m': [(140.51, 'Tc-99m')],
            'Co-57': [(122.06, 'Co-57')],
        }
        for iso_info in isotopes:
            iso_name = iso_info.get('name', '') if isinstance(iso_info, dict) else str(iso_info)
            if iso_name in isotope_peaks:
                peaks.extend(isotope_peaks[iso_name])
        return peaks
    def _update_metadata(self):
        """Update the metadata text display."""
        self.metadata_text.delete(1.0, tk.END)
        if self.current_spectrum is not None:
            # Add spectrum shape info
            info = f"Spectrum Shape: {self.current_spectrum.shape}\n"
            info += f"Data type: {self.current_spectrum.dtype}\n"
            info += f"Value range: [{self.current_spectrum.min():.4f}, {self.current_spectrum.max():.4f}]\n"
            info += f"Mean value: {self.current_spectrum.mean():.4f}\n"
            info += "\n" + "="*50 + "\n\n"
            self.metadata_text.insert(tk.END, info)
        if self.current_metadata:
            # Pretty print JSON metadata
            formatted = json.dumps(self.current_metadata, indent=2)
            self.metadata_text.insert(tk.END, formatted)
        else:
            self.metadata_text.insert(tk.END, "No metadata JSON file found for this spectrum.")
    def _on_spectrum_selected(self, event=None):
        """Handle spectrum selection from dropdown."""
        selection = self.spectrum_var.get()
        for i, f in enumerate(self.spectrum_files):
            if f.stem == selection:
                self.current_index = i
                break
        self._load_current_spectrum()
    def _prev_spectrum(self):
        """Go to previous spectrum."""
        self.current_index = (self.current_index - 1) % len(self.spectrum_files)
        self.spectrum_dropdown.current(self.current_index)
        self._load_current_spectrum()
    def _next_spectrum(self):
        """Go to next spectrum."""
        self.current_index = (self.current_index + 1) % len(self.spectrum_files)
        self.spectrum_dropdown.current(self.current_index)
        self._load_current_spectrum()
    def _random_spectrum(self):
        """Jump to a random spectrum."""
        self.current_index = random.randint(0, len(self.spectrum_files) - 1)
        self.spectrum_dropdown.current(self.current_index)
        self._load_current_spectrum()
    def run(self):
        """Start the GUI main loop."""
        self.root.mainloop()
 def main():
    """Main entry point."""
    import argparse
    parser = argparse.ArgumentParser(
        description="Visualize synthetic gamma spectra"
    )
    parser.add_argument(
        "--data_dir",
        type=str,
        default="./data/synthetic/spectra",
        help="Directory containing spectrum files (default: ./data/synthetic/spectra)"
    )
    parser.add_argument(
        "--num_samples",
        type=int,
        default=100,
        help="Number of random samples to load (default: 100)"
    )
    parser.add_argument(
        "--seed",
        type=int,
        default=None,
        help="Random seed for reproducible sample selection"
    )
    args = parser.parse_args()
    if not HAS_MATPLOTLIB:
        print("ERROR: matplotlib is required for visualization.")
        print("Install with: pip install matplotlib")
        return
    print(f"Starting Spectrum Viewer...")
    print(f"Data directory: {args.data_dir}")
    print(f"Loading up to {args.num_samples} random samples...")
    try:
        viewer = SpectrumViewer(
            data_dir=args.data_dir,
            num_samples=args.num_samples,
            random_seed=args.seed
        )
        viewer.run()
    except ValueError as e:
        print(f"Error: {e}")
    except Exception as e:
        print(f"Unexpected error: {e}")
        raise
 if __name__ == "__main__":
    main()
--- a/train/vega_ml/synthetic_spectra/visualize_training_data.py
+++ b/train/vega_ml/synthetic_spectra/visualize_training_data.py
@ -0,0 +1,946 @@
 """
 Training Data Visualization Script
 Generates an interactive HTML dashboard with Plotly visualizations to explore
 the synthetic training data distribution, isotope combinations, activities,
 durations, and sample spectra.
 Usage:
    python -m synthetic_spectra.visualize_training_data
    python -m synthetic_spectra.visualize_training_data --data-dir data/synthetic/spectra
    python -m synthetic_spectra.visualize_training_data --output report.html --max-samples 1000
 Output:
    An interactive HTML file that can be opened in any browser.
 """
 import argparse
 import json
 import sys
 from pathlib import Path
 from collections import Counter, defaultdict
 from itertools import combinations
 from typing import Dict, List, Tuple, Optional
 import numpy as np
 # Add parent directory to path for imports
 sys.path.insert(0, str(Path(__file__).parent.parent))
 try:
    import plotly.graph_objects as go
    import plotly.express as px
    from plotly.subplots import make_subplots
 except ImportError:
    print("Error: Plotly is required. Install it with: pip install plotly")
    sys.exit(1)
 from synthetic_spectra.ground_truth.isotope_data import (
    ISOTOPE_DATABASE,
    IsotopeCategory,
    get_isotopes_by_category,
 )
 def load_all_metadata(data_dir: Path, max_samples: Optional[int] = None) -> List[Dict]:
    """Load all JSON metadata files from the data directory."""
    json_files = sorted(data_dir.glob("*.json"))
    if max_samples is not None and len(json_files) > max_samples:
        # Randomly sample if we have too many
        np.random.seed(42)
        indices = np.random.choice(len(json_files), max_samples, replace=False)
        json_files = [json_files[i] for i in sorted(indices)]
    metadata_list = []
    print(f"Loading {len(json_files)} metadata files...")
    for i, json_file in enumerate(json_files):
        try:
            with open(json_file, 'r') as f:
                data = json.load(f)
                data['_filename'] = json_file.stem
                metadata_list.append(data)
        except Exception as e:
            print(f"  Warning: Could not load {json_file}: {e}")
        if (i + 1) % 1000 == 0:
            print(f"  Loaded {i + 1}/{len(json_files)} files...")
    print(f"Loaded {len(metadata_list)} samples successfully.")
    return metadata_list
 def load_sample_spectra(data_dir: Path, sample_ids: List[str]) -> Dict[str, np.ndarray]:
    """Load a few sample spectra for visualization."""
    spectra = {}
    for sample_id in sample_ids:
        npy_file = data_dir / f"{sample_id}.npy"
        if npy_file.exists():
            try:
                spectra[sample_id] = np.load(npy_file)
            except Exception as e:
                print(f"  Warning: Could not load spectrum {npy_file}: {e}")
    return spectra
 def compute_statistics(metadata_list: List[Dict]) -> Dict:
    """Compute various statistics from the metadata."""
    stats = {
        'total_samples': len(metadata_list),
        'isotope_counts': Counter(),
        'isotope_cooccurrence': defaultdict(int),
        'num_isotopes_distribution': Counter(),
        'durations': [],
        'activities': defaultdict(list),
        'detectors': Counter(),
        'category_counts': Counter(),
        'samples_by_num_isotopes': defaultdict(list),
    }
    for meta in metadata_list:
        isotopes = meta.get('isotopes', [])
        source_activities = meta.get('source_activities_bq', {})
        duration = meta.get('duration_seconds', 0)
        detector = meta.get('detector', 'unknown')
        # Count isotopes
        for iso in isotopes:
            stats['isotope_counts'][iso] += 1
            # Get category
            if iso in ISOTOPE_DATABASE:
                cat = ISOTOPE_DATABASE[iso].category.value
                stats['category_counts'][cat] += 1
        # Count isotope pairs (co-occurrence)
        for pair in combinations(sorted(isotopes), 2):
            stats['isotope_cooccurrence'][pair] += 1
        # Number of isotopes distribution
        num_iso = len(isotopes)
        stats['num_isotopes_distribution'][num_iso] += 1
        stats['samples_by_num_isotopes'][num_iso].append(meta['_filename'])
        # Duration
        stats['durations'].append(duration)
        # Activities per isotope
        for iso, activity in source_activities.items():
            stats['activities'][iso].append(activity)
        # Detector
        stats['detectors'][detector] += 1
    return stats
 def create_isotope_frequency_chart(stats: Dict) -> go.Figure:
    """Create bar chart of isotope frequencies."""
    isotope_counts = stats['isotope_counts']
    # Sort by frequency
    sorted_isotopes = sorted(isotope_counts.items(), key=lambda x: x[1], reverse=True)
    isotopes, counts = zip(*sorted_isotopes) if sorted_isotopes else ([], [])
    # Color by category
    colors = []
    category_colors = {
        'natural_background': '#2ecc71',
        'primordial': '#27ae60',
        'cosmogenic': '#1abc9c',
        'u238_chain': '#e74c3c',
        'th232_chain': '#c0392b',
        'u235_chain': '#d35400',
        'calibration': '#3498db',
        'industrial': '#9b59b6',
        'medical': '#f1c40f',
        'reactor_fallout': '#e67e22',
        'activation': '#95a5a6',
    }
    for iso in isotopes:
        if iso in ISOTOPE_DATABASE:
            cat = ISOTOPE_DATABASE[iso].category.value
            colors.append(category_colors.get(cat, '#7f8c8d'))
        else:
            colors.append('#7f8c8d')
    fig = go.Figure(data=[
        go.Bar(
            x=list(isotopes),
            y=list(counts),
            marker_color=colors,
            hovertemplate="<b>%{x}</b><br>Count: %{y}<extra></extra>"
        )
    ])
    fig.update_layout(
        title="Isotope Frequency Distribution",
        xaxis_title="Isotope",
        yaxis_title="Number of Samples",
        xaxis_tickangle=-45,
        height=500,
        showlegend=False
    )
    return fig
 def create_category_pie_chart(stats: Dict) -> go.Figure:
    """Create pie chart of isotope categories."""
    category_counts = stats['category_counts']
    if not category_counts:
        return go.Figure().add_annotation(text="No category data available", 
                                          xref="paper", yref="paper", x=0.5, y=0.5)
    labels = list(category_counts.keys())
    values = list(category_counts.values())
    # Pretty names for categories
    pretty_names = {
        'natural_background': 'Natural Background',
        'primordial': 'Primordial',
        'cosmogenic': 'Cosmogenic',
        'u238_chain': 'U-238 Chain',
        'th232_chain': 'Th-232 Chain',
        'u235_chain': 'U-235 Chain',
        'calibration': 'Calibration',
        'industrial': 'Industrial',
        'medical': 'Medical',
        'reactor_fallout': 'Reactor/Fallout',
        'activation': 'Activation Products',
    }
    labels = [pretty_names.get(l, l) for l in labels]
    fig = go.Figure(data=[
        go.Pie(
            labels=labels,
            values=values,
            hole=0.4,
            hovertemplate="<b>%{label}</b><br>Count: %{value}<br>%{percent}<extra></extra>"
        )
    ])
    fig.update_layout(
        title="Isotope Categories Distribution",
        height=450,
    )
    return fig
 def create_num_isotopes_histogram(stats: Dict) -> go.Figure:
    """Create histogram of number of isotopes per sample."""
    num_iso_dist = stats['num_isotopes_distribution']
    x = sorted(num_iso_dist.keys())
    y = [num_iso_dist[k] for k in x]
    # Calculate percentages
    total = sum(y)
    percentages = [f"{(v/total)*100:.1f}%" for v in y]
    fig = go.Figure(data=[
        go.Bar(
            x=[str(k) for k in x],
            y=y,
            text=percentages,
            textposition='auto',
            marker_color='#3498db',
            hovertemplate="<b>%{x} isotopes</b><br>Count: %{y}<br>%{text}<extra></extra>"
        )
    ])
    fig.update_layout(
        title="Sample Complexity (Number of Isotopes per Sample)",
        xaxis_title="Number of Source Isotopes",
        yaxis_title="Number of Samples",
        height=400,
    )
    return fig
 def create_duration_histogram(stats: Dict) -> go.Figure:
    """Create histogram of measurement durations."""
    durations = stats['durations']
    if not durations:
        return go.Figure().add_annotation(text="No duration data available",
                                          xref="paper", yref="paper", x=0.5, y=0.5)
    fig = go.Figure(data=[
        go.Histogram(
            x=durations,
            nbinsx=50,
            marker_color='#9b59b6',
            hovertemplate="Duration: %{x:.1f}s<br>Count: %{y}<extra></extra>"
        )
    ])
    fig.update_layout(
        title="Measurement Duration Distribution",
        xaxis_title="Duration (seconds)",
        yaxis_title="Number of Samples",
        height=400,
    )
    # Add statistics annotation
    mean_dur = np.mean(durations)
    std_dur = np.std(durations)
    min_dur = np.min(durations)
    max_dur = np.max(durations)
    fig.add_annotation(
        text=f"Mean: {mean_dur:.1f}s | Std: {std_dur:.1f}s | Range: [{min_dur:.1f}, {max_dur:.1f}]s",
        xref="paper", yref="paper",
        x=0.98, y=0.98,
        xanchor='right', yanchor='top',
        showarrow=False,
        bgcolor="white",
        bordercolor="black",
        borderwidth=1,
        font=dict(size=11)
    )
    return fig
 def create_activity_boxplot(stats: Dict) -> go.Figure:
    """Create box plot of activities per isotope."""
    activities = stats['activities']
    if not activities:
        return go.Figure().add_annotation(text="No activity data available",
                                          xref="paper", yref="paper", x=0.5, y=0.5)
    # Sort by median activity
    sorted_isotopes = sorted(
        activities.keys(),
        key=lambda x: np.median(activities[x]) if activities[x] else 0,
        reverse=True
    )
    # Only show top 30 for readability
    top_isotopes = sorted_isotopes[:30]
    fig = go.Figure()
    for iso in top_isotopes:
        fig.add_trace(go.Box(
            y=activities[iso],
            name=iso,
            boxpoints='outliers',
            hovertemplate=f"<b>{iso}</b><br>Activity: %{{y:.2f}} Bq<extra></extra>"
        ))
    fig.update_layout(
        title="Activity Distribution by Isotope (Top 30)",
        xaxis_title="Isotope",
        yaxis_title="Activity (Bq)",
        xaxis_tickangle=-45,
        height=500,
        showlegend=False
    )
    return fig
 def create_cooccurrence_heatmap(stats: Dict, top_n: int = 20) -> go.Figure:
    """Create heatmap of isotope co-occurrence."""
    cooccurrence = stats['isotope_cooccurrence']
    isotope_counts = stats['isotope_counts']
    if not cooccurrence:
        return go.Figure().add_annotation(text="No co-occurrence data (need multi-isotope samples)",
                                          xref="paper", yref="paper", x=0.5, y=0.5)
    # Get top N most frequent isotopes
    top_isotopes = [iso for iso, _ in isotope_counts.most_common(top_n)]
    # Build matrix
    n = len(top_isotopes)
    matrix = np.zeros((n, n))
    for i, iso1 in enumerate(top_isotopes):
        for j, iso2 in enumerate(top_isotopes):
            if i < j:
                pair = tuple(sorted([iso1, iso2]))
                matrix[i, j] = cooccurrence.get(pair, 0)
                matrix[j, i] = matrix[i, j]
    fig = go.Figure(data=go.Heatmap(
        z=matrix,
        x=top_isotopes,
        y=top_isotopes,
        colorscale='Blues',
        hovertemplate="<b>%{x}</b> + <b>%{y}</b><br>Co-occurrences: %{z}<extra></extra>"
    ))
    fig.update_layout(
        title=f"Isotope Co-occurrence Matrix (Top {top_n} Isotopes)",
        xaxis_tickangle=-45,
        height=600,
        width=700,
    )
    return fig
 def create_activity_vs_duration_scatter(metadata_list: List[Dict]) -> go.Figure:
    """Create scatter plot of total activity vs duration."""
    durations = []
    total_activities = []
    num_isotopes = []
    sample_ids = []
    for meta in metadata_list:
        duration = meta.get('duration_seconds', 0)
        activities = meta.get('source_activities_bq', {})
        if duration > 0 and activities:
            durations.append(duration)
            total_activities.append(sum(activities.values()))
            num_isotopes.append(len(meta.get('isotopes', [])))
            sample_ids.append(meta['_filename'])
    if not durations:
        return go.Figure().add_annotation(text="No data available",
                                          xref="paper", yref="paper", x=0.5, y=0.5)
    fig = go.Figure(data=go.Scatter(
        x=durations,
        y=total_activities,
        mode='markers',
        marker=dict(
            size=6,
            color=num_isotopes,
            colorscale='Viridis',
            colorbar=dict(title="# Isotopes"),
            opacity=0.6
        ),
        text=sample_ids,
        hovertemplate="<b>%{text}</b><br>Duration: %{x:.1f}s<br>Total Activity: %{y:.2f} Bq<extra></extra>"
    ))
    fig.update_layout(
        title="Total Source Activity vs Measurement Duration",
        xaxis_title="Duration (seconds)",
        yaxis_title="Total Activity (Bq)",
        height=500,
    )
    return fig
 def create_sample_spectrum_plot(spectra: Dict[str, np.ndarray], metadata_list: List[Dict]) -> go.Figure:
    """Create interactive plot of sample spectra."""
    if not spectra:
        return go.Figure().add_annotation(text="No spectrum data loaded",
                                          xref="paper", yref="paper", x=0.5, y=0.5)
    # Create a metadata lookup
    meta_lookup = {m['_filename']: m for m in metadata_list}
    # Energy axis (keV) - 1023 channels from 20 to 3000 keV
    num_channels = 1023
    energy = np.linspace(20, 3000, num_channels)
    fig = go.Figure()
    colors = px.colors.qualitative.Set2
    for i, (sample_id, spectrum) in enumerate(list(spectra.items())[:6]):
        # Sum across time intervals to get total spectrum
        total_spectrum = spectrum.sum(axis=0) if spectrum.ndim == 2 else spectrum
        # Get isotope info
        meta = meta_lookup.get(sample_id, {})
        isotopes = meta.get('isotopes', ['Unknown'])
        label = f"{sample_id[-6:]}: {', '.join(isotopes)}"
        fig.add_trace(go.Scatter(
            x=energy,
            y=total_spectrum,
            mode='lines',
            name=label,
            line=dict(color=colors[i % len(colors)], width=1),
            hovertemplate=f"<b>{label}</b><br>Energy: %{{x:.1f}} keV<br>Counts: %{{y:.2f}}<extra></extra>"
        ))
    fig.update_layout(
        title="Sample Spectra (Time-Integrated)",
        xaxis_title="Energy (keV)",
        yaxis_title="Normalized Counts",
        height=500,
        legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99),
        hovermode='closest'
    )
    return fig
 def create_3d_spectrum_surface(spectrum: np.ndarray, sample_id: str) -> go.Figure:
    """Create 3D surface plot of a single spectrum (time vs energy vs counts)."""
    if spectrum.ndim != 2:
        return go.Figure().add_annotation(text="Spectrum must be 2D",
                                          xref="paper", yref="paper", x=0.5, y=0.5)
    num_intervals, num_channels = spectrum.shape
    # Create axes
    time_axis = np.arange(num_intervals)
    energy_axis = np.linspace(20, 3000, num_channels)
    # Downsample for performance if needed
    if num_intervals > 100:
        step = num_intervals // 100
        spectrum = spectrum[::step, :]
        time_axis = time_axis[::step]
    if num_channels > 256:
        ch_step = num_channels // 256
        spectrum = spectrum[:, ::ch_step]
        energy_axis = energy_axis[::ch_step]
    fig = go.Figure(data=[
        go.Surface(
            z=spectrum,
            x=energy_axis,
            y=time_axis,
            colorscale='Viridis',
            hovertemplate="Time: %{y}s<br>Energy: %{x:.1f} keV<br>Counts: %{z:.3f}<extra></extra>"
        )
    ])
    fig.update_layout(
        title=f"3D Spectrum View: {sample_id}",
        scene=dict(
            xaxis_title="Energy (keV)",
            yaxis_title="Time (s)",
            zaxis_title="Counts",
        ),
        height=600,
    )
    return fig
 def create_summary_table(stats: Dict) -> str:
    """Create an HTML summary table."""
    total = stats['total_samples']
    num_unique_isotopes = len(stats['isotope_counts'])
    avg_isotopes_per_sample = sum(k * v for k, v in stats['num_isotopes_distribution'].items()) / total if total else 0
    durations = stats['durations']
    activities_all = [a for acts in stats['activities'].values() for a in acts]
    html = f"""
    <div style="padding: 20px; background: #f8f9fa; border-radius: 10px; margin: 20px 0;">
        <h3 style="margin-top: 0; color: #2c3e50;">📊 Dataset Summary</h3>
        <table style="width: 100%; border-collapse: collapse; font-size: 14px;">
            <tr style="border-bottom: 1px solid #ddd;">
                <td style="padding: 8px;"><strong>Total Samples</strong></td>
                <td style="padding: 8px;">{total:,}</td>
            </tr>
            <tr style="border-bottom: 1px solid #ddd;">
                <td style="padding: 8px;"><strong>Unique Isotopes</strong></td>
                <td style="padding: 8px;">{num_unique_isotopes}</td>
            </tr>
            <tr style="border-bottom: 1px solid #ddd;">
                <td style="padding: 8px;"><strong>Avg Isotopes per Sample</strong></td>
                <td style="padding: 8px;">{avg_isotopes_per_sample:.2f}</td>
            </tr>
            <tr style="border-bottom: 1px solid #ddd;">
                <td style="padding: 8px;"><strong>Duration Range</strong></td>
                <td style="padding: 8px;">{min(durations) if durations else 0:.1f}s - {max(durations) if durations else 0:.1f}s</td>
            </tr>
            <tr style="border-bottom: 1px solid #ddd;">
                <td style="padding: 8px;"><strong>Mean Duration</strong></td>
                <td style="padding: 8px;">{np.mean(durations) if durations else 0:.1f}s</td>
            </tr>
            <tr style="border-bottom: 1px solid #ddd;">
                <td style="padding: 8px;"><strong>Activity Range</strong></td>
                <td style="padding: 8px;">{min(activities_all) if activities_all else 0:.2f} - {max(activities_all) if activities_all else 0:.2f} Bq</td>
            </tr>
            <tr>
                <td style="padding: 8px;"><strong>Detectors</strong></td>
                <td style="padding: 8px;">{', '.join(stats['detectors'].keys())}</td>
            </tr>
        </table>
    </div>
    """
    return html
 def create_isotope_database_summary() -> go.Figure:
    """Create a sunburst chart of the isotope database by category."""
    # Build hierarchy data
    categories = defaultdict(list)
    for name, isotope in ISOTOPE_DATABASE.items():
        categories[isotope.category.value].append(name)
    # Create sunburst data
    ids = []
    labels = []
    parents = []
    values = []
    # Root
    ids.append("Isotope Database")
    labels.append("Isotope Database")
    parents.append("")
    values.append(len(ISOTOPE_DATABASE))
    # Categories and isotopes
    pretty_names = {
        'natural_background': 'Natural Background',
        'primordial': 'Primordial',
        'cosmogenic': 'Cosmogenic',
        'u238_chain': 'U-238 Chain',
        'th232_chain': 'Th-232 Chain',
        'u235_chain': 'U-235 Chain',
        'calibration': 'Calibration',
        'industrial': 'Industrial',
        'medical': 'Medical',
        'reactor_fallout': 'Reactor/Fallout',
        'activation': 'Activation',
    }
    for cat, isotopes in categories.items():
        cat_label = pretty_names.get(cat, cat)
        ids.append(cat_label)
        labels.append(f"{cat_label} ({len(isotopes)})")
        parents.append("Isotope Database")
        values.append(len(isotopes))
        for iso in isotopes:
            ids.append(f"{cat_label}/{iso}")
            labels.append(iso)
            parents.append(cat_label)
            values.append(1)
    fig = go.Figure(go.Sunburst(
        ids=ids,
        labels=labels,
        parents=parents,
        values=values,
        branchvalues="total",
        hovertemplate="<b>%{label}</b><extra></extra>"
    ))
    fig.update_layout(
        title=f"Isotope Database Structure ({len(ISOTOPE_DATABASE)} isotopes)",
        height=600,
    )
    return fig
 def generate_html_report(
    data_dir: Path,
    output_file: Path,
    max_samples: Optional[int] = None
 ):
    """Generate the complete HTML report."""
    print("=" * 60)
    print("Training Data Visualization Report Generator")
    print("=" * 60)
    # Load all metadata
    metadata_list = load_all_metadata(data_dir, max_samples)
    if not metadata_list:
        print("Error: No metadata files found!")
        return
    # Compute statistics
    print("\nComputing statistics...")
    stats = compute_statistics(metadata_list)
    # Load a few sample spectra
    print("\nLoading sample spectra for visualization...")
    sample_ids = [m['_filename'] for m in metadata_list[:10]]
    spectra = load_sample_spectra(data_dir, sample_ids)
    print(f"\nGenerating visualizations...")
    # Generate all figures
    figures = {
        'isotope_freq': create_isotope_frequency_chart(stats),
        'category_pie': create_category_pie_chart(stats),
        'num_isotopes': create_num_isotopes_histogram(stats),
        'duration_hist': create_duration_histogram(stats),
        'activity_box': create_activity_boxplot(stats),
        'cooccurrence': create_cooccurrence_heatmap(stats),
        'activity_duration': create_activity_vs_duration_scatter(metadata_list),
        'sample_spectra': create_sample_spectrum_plot(spectra, metadata_list),
        'isotope_db': create_isotope_database_summary(),
    }
    # Add 3D spectrum if we have data
    if spectra:
        first_id = list(spectra.keys())[0]
        figures['spectrum_3d'] = create_3d_spectrum_surface(spectra[first_id], first_id)
    # Create HTML
    print("\nBuilding HTML report...")
    html_parts = [
        """
 <!DOCTYPE html>
 <html>
 <head>
    <title>Synthetic Training Data Visualization</title>
    <script src="https://cdn.plot.ly/plotly-latest.min.js"></script>
    <style>
        body {
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, sans-serif;
            margin: 0;
            padding: 20px;
            background: #ecf0f1;
            color: #2c3e50;
        }
        .container {
            max-width: 1400px;
            margin: 0 auto;
        }
        h1 {
            text-align: center;
            color: #2c3e50;
            padding: 20px;
            background: white;
            border-radius: 10px;
            box-shadow: 0 2px 10px rgba(0,0,0,0.1);
        }
        h2 {
            color: #34495e;
            border-bottom: 2px solid #3498db;
            padding-bottom: 10px;
            margin-top: 40px;
        }
        .chart-container {
            background: white;
            padding: 20px;
            border-radius: 10px;
            margin: 20px 0;
            box-shadow: 0 2px 10px rgba(0,0,0,0.1);
        }
        .row {
            display: flex;
            gap: 20px;
            flex-wrap: wrap;
        }
        .col-6 {
            flex: 1;
            min-width: 400px;
        }
        .col-12 {
            width: 100%;
        }
        .toc {
            background: white;
            padding: 20px;
            border-radius: 10px;
            margin: 20px 0;
            box-shadow: 0 2px 10px rgba(0,0,0,0.1);
        }
        .toc ul {
            list-style: none;
            padding-left: 0;
        }
        .toc li {
            margin: 10px 0;
        }
        .toc a {
            color: #3498db;
            text-decoration: none;
        }
        .toc a:hover {
            text-decoration: underline;
        }
        .info-box {
            background: #e8f6ff;
            border-left: 4px solid #3498db;
            padding: 15px;
            margin: 20px 0;
            border-radius: 0 10px 10px 0;
        }
    </style>
 </head>
 <body>
    <div class="container">
        <h1>🔬 Synthetic Gamma Spectra Training Data Analysis</h1>
        """,
        create_summary_table(stats),
        """
        <div class="toc">
            <h3>📑 Table of Contents</h3>
            <ul>
                <li><a href="#isotope-distribution">1. Isotope Distribution</a></li>
                <li><a href="#sample-complexity">2. Sample Complexity</a></li>
                <li><a href="#temporal-activity">3. Temporal & Activity Analysis</a></li>
                <li><a href="#cooccurrence">4. Isotope Co-occurrence</a></li>
                <li><a href="#sample-spectra">5. Sample Spectra</a></li>
                <li><a href="#database-overview">6. Isotope Database Overview</a></li>
            </ul>
        </div>
        <h2 id="isotope-distribution">1. Isotope Distribution</h2>
        <div class="info-box">
            <strong>What this shows:</strong> The frequency of each isotope across all training samples.
            Imbalanced distributions may lead to model bias towards common isotopes.
        </div>
        <div class="row">
            <div class="col-6 chart-container">
        """,
        figures['isotope_freq'].to_html(full_html=False, include_plotlyjs=False),
        """
            </div>
            <div class="col-6 chart-container">
        """,
        figures['category_pie'].to_html(full_html=False, include_plotlyjs=False),
        """
            </div>
        </div>
        <h2 id="sample-complexity">2. Sample Complexity</h2>
        <div class="info-box">
            <strong>What this shows:</strong> Distribution of how many source isotopes are present per sample.
            Mix of single and multi-isotope samples helps the model handle real-world complexity.
        </div>
        <div class="chart-container">
        """,
        figures['num_isotopes'].to_html(full_html=False, include_plotlyjs=False),
        """
        </div>
        <h2 id="temporal-activity">3. Temporal & Activity Analysis</h2>
        <div class="info-box">
            <strong>What this shows:</strong> Distribution of measurement durations and source activities.
            Varied durations simulate different counting scenarios.
        </div>
        <div class="row">
            <div class="col-6 chart-container">
        """,
        figures['duration_hist'].to_html(full_html=False, include_plotlyjs=False),
        """
            </div>
            <div class="col-6 chart-container">
        """,
        figures['activity_duration'].to_html(full_html=False, include_plotlyjs=False),
        """
            </div>
        </div>
        <div class="chart-container">
        """,
        figures['activity_box'].to_html(full_html=False, include_plotlyjs=False),
        """
        </div>
        <h2 id="cooccurrence">4. Isotope Co-occurrence</h2>
        <div class="info-box">
            <strong>What this shows:</strong> Which isotopes frequently appear together in training samples.
            This helps understand potential confusion pairs and realistic combinations.
        </div>
        <div class="chart-container">
        """,
        figures['cooccurrence'].to_html(full_html=False, include_plotlyjs=False),
        """
        </div>
        <h2 id="sample-spectra">5. Sample Spectra Visualization</h2>
        <div class="info-box">
            <strong>What this shows:</strong> Actual spectrum shapes from the training data.
            Each peak corresponds to gamma emission lines from the source isotopes.
        </div>
        <div class="chart-container">
        """,
        figures['sample_spectra'].to_html(full_html=False, include_plotlyjs=False),
        """
        </div>
        """
    ]
    # Add 3D spectrum if available
    if 'spectrum_3d' in figures:
        html_parts.append("""
        <div class="chart-container">
            <h3>3D Time-Energy-Counts View</h3>
        """)
        html_parts.append(figures['spectrum_3d'].to_html(full_html=False, include_plotlyjs=False))
        html_parts.append("</div>")
    html_parts.append("""
        <h2 id="database-overview">6. Isotope Database Overview</h2>
        <div class="info-box">
            <strong>What this shows:</strong> The complete isotope database structure organized by category.
            Click to explore the hierarchy.
        </div>
        <div class="chart-container">
        """)
    html_parts.append(figures['isotope_db'].to_html(full_html=False, include_plotlyjs=False))
    html_parts.append("""
        </div>
        <footer style="text-align: center; padding: 40px; color: #7f8c8d;">
            <p>Generated by ML for Isotope Identification Training Data Analyzer</p>
        </footer>
    </div>
 </body>
 </html>
    """)
    # Write HTML file
    html_content = ''.join(html_parts)
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(html_content)
    print(f"\n✅ Report generated successfully!")
    print(f"   Output: {output_file.absolute()}")
    print(f"\nOpen in your browser to view the interactive visualizations.")
 def main():
    parser = argparse.ArgumentParser(
        description="Generate interactive HTML visualization of training data"
    )
    parser.add_argument(
        '--data-dir',
        type=str,
        default='data/synthetic/spectra',
        help='Directory containing spectrum .json and .npy files'
    )
    parser.add_argument(
        '--output',
        type=str,
        default='training_data_report.html',
        help='Output HTML file name'
    )
    parser.add_argument(
        '--max-samples',
        type=int,
        default=None,
        help='Maximum number of samples to analyze (for faster generation)'
    )
    args = parser.parse_args()
    data_dir = Path(args.data_dir)
    output_file = Path(args.output)
    if not data_dir.exists():
        print(f"Error: Data directory not found: {data_dir}")
        sys.exit(1)
    generate_html_report(data_dir, output_file, args.max_samples)
 if __name__ == "__main__":
    main()
--- a/train/vega_ml/training/init.py
+++ b/train/vega_ml/training/init.py
@ -0,0 +1 @@
 # Training module for isotope identification models
--- a/train/vega_ml/training/vega/init.py
+++ b/train/vega_ml/training/vega/init.py
@ -0,0 +1,26 @@
 """
 Vega Model - CNN-FCNN with Multi-Task Heads for Gamma Spectrum Isotope Identification
 Architecture based on research findings from:
 - Wang et al. (2026): CNN-FCNN achieves 99.8% accuracy
 - Galib et al. (2021): Hybrid CNN outperforms pure architectures
 - Turner et al. (2021): 1D CNN robust to gain shifts and shielding
 Features:
 - 1D CNN backbone for spectral feature extraction
 - Multi-task heads for isotope classification + activity regression
 - Support for 82 isotopes from the synthetic spectra database
 """
 from .model import VegaModel, VegaConfig
 from .dataset import SpectrumDataset, create_data_loaders
 from .train import train_vega, VegaTrainer
 __all__ = [
    'VegaModel',
    'VegaConfig', 
    'SpectrumDataset',
    'create_data_loaders',
    'train_vega',
    'VegaTrainer'
 ]
--- a/train/vega_ml/training/vega/dataset.py
+++ b/train/vega_ml/training/vega/dataset.py
@ -0,0 +1,373 @@
 """
 Dataset and DataLoader for Vega Model Training
 Handles loading synthetic gamma spectra from numpy files and converting
 them to PyTorch tensors with proper labels for multi-task learning.
 Supports two label formats:
 1. Individual JSON files per sample (recommended for large datasets)
 2. Combined labels.json file (legacy format)
 """
 import json
 import numpy as np
 import torch
 from torch.utils.data import Dataset, DataLoader, random_split
 from pathlib import Path
 from typing import Dict, List, Optional, Tuple
 from dataclasses import dataclass
 from .isotope_index import IsotopeIndex, get_default_isotope_index
@dataclass
 class SpectrumSample:
    """A single spectrum sample with metadata."""
    sample_id: str
    spectrum: np.ndarray  # 2D array (time_intervals, channels) or 1D (channels,)
    isotopes_present: List[str]
    activities_bq: Dict[str, float]
    duration_seconds: float
    detector: str
 class SpectrumDataset(Dataset):
    """
    PyTorch Dataset for synthetic gamma spectra.
    Loads spectra from numpy files and their labels from JSON files.
    Supports both individual JSON files per sample (efficient for large datasets)
    and combined labels.json (legacy format).
    Converts to tensors suitable for the Vega model.
    """
    def __init__(
        self,
        data_dir: Path,
        isotope_index: Optional[IsotopeIndex] = None,
        max_activity_bq: float = 1000.0,
        collapse_time: bool = True,
        transform=None
    ):
        """
        Initialize the dataset.
        Args:
            data_dir: Path to directory containing spectra/ subdirectory
            isotope_index: Index mapping isotope names to indices
            max_activity_bq: Maximum activity for normalization
            collapse_time: If True, average across time dimension to get 1D spectrum
            transform: Optional transform to apply to spectra
        """
        self.data_dir = Path(data_dir)
        self.spectra_dir = self.data_dir / "spectra"
        self.isotope_index = isotope_index or get_default_isotope_index()
        self.max_activity_bq = max_activity_bq
        self.collapse_time = collapse_time
        self.transform = transform
        # Detect label format and load sample list
        self.use_individual_labels = self._detect_label_format()
        if self.use_individual_labels:
            # Scan for individual JSON files (efficient - no loading needed)
            self.sample_ids = self._scan_for_samples()
            self.metadata = None  # Labels loaded on-demand
            print(f"Using individual label files (efficient mode)")
        else:
            # Load combined labels.json (legacy mode)
            self.metadata = self._load_metadata()
            self.sample_ids = list(self.metadata['samples'].keys())
            print(f"Using combined labels.json (legacy mode)")
        print(f"Loaded dataset with {len(self.sample_ids)} samples")
        print(f"Isotope index has {self.isotope_index.num_isotopes} isotopes")
    def _detect_label_format(self) -> bool:
        """Detect whether to use individual JSON files or combined labels.json."""
        # Check if individual JSON files exist
        json_files = list(self.spectra_dir.glob("spectrum_*.json"))
        if len(json_files) > 0:
            return True
        # Fall back to combined labels.json
        labels_path = self.data_dir / "labels.json"
        if labels_path.exists():
            return False
        raise FileNotFoundError(
            f"No label files found. Expected either:\n"
            f"  - Individual files: {self.spectra_dir}/spectrum_*.json\n"
            f"  - Combined file: {self.data_dir}/labels.json"
        )
    def _scan_for_samples(self) -> List[str]:
        """Scan directory for sample IDs based on .npy files."""
        npy_files = sorted(self.spectra_dir.glob("spectrum_*.npy"))
        sample_ids = []
        for npy_path in npy_files:
            # Extract sample ID from filename: spectrum_{id}.npy
            filename = npy_path.stem  # spectrum_{id}
            sample_id = filename.replace("spectrum_", "")
            sample_ids.append(sample_id)
        return sample_ids
    def _load_metadata(self) -> Dict:
        """Load the combined labels.json metadata file (legacy format)."""
        labels_path = self.data_dir / "labels.json"
        if not labels_path.exists():
            raise FileNotFoundError(f"Labels file not found: {labels_path}")
        with open(labels_path, 'r') as f:
            return json.load(f)
    def _load_sample_label(self, sample_id: str) -> Dict:
        """Load label for a single sample (individual JSON or from combined)."""
        if self.use_individual_labels:
            json_path = self.spectra_dir / f"spectrum_{sample_id}.json"
            with open(json_path, 'r') as f:
                return json.load(f)
        else:
            return self.metadata['samples'][sample_id]
    def __len__(self) -> int:
        return len(self.sample_ids)
    def __getitem__(self, idx: int) -> Dict[str, torch.Tensor]:
        """
        Get a single sample.
        Returns:
            Dictionary containing:
                - spectrum: Tensor of shape (num_channels,)
                - presence_labels: Binary tensor (num_isotopes,) indicating presence
                - activity_labels: Tensor (num_isotopes,) with normalized activities
                - sample_id: String identifier
        """
        sample_id = self.sample_ids[idx]
        sample_meta = self._load_sample_label(sample_id)
        # Load spectrum
        spectrum_path = self.spectra_dir / f"spectrum_{sample_id}.npy"
        spectrum = np.load(spectrum_path)
        # Collapse time dimension if needed
        if self.collapse_time and spectrum.ndim == 2:
            # Average across time intervals to get single spectrum
            spectrum = spectrum.mean(axis=0)
        # Convert to tensor
        spectrum_tensor = torch.tensor(spectrum, dtype=torch.float32)
        # Apply transform if provided
        if self.transform:
            spectrum_tensor = self.transform(spectrum_tensor)
        # Create presence labels
        presence_labels = torch.zeros(self.isotope_index.num_isotopes, dtype=torch.float32)
        for isotope_name in sample_meta['isotopes']:
            try:
                idx_isotope = self.isotope_index.name_to_index(isotope_name)
                presence_labels[idx_isotope] = 1.0
            except KeyError:
                # Isotope not in our index (might be a decay product)
                pass
        # Create activity labels (normalized)
        activity_labels = torch.zeros(self.isotope_index.num_isotopes, dtype=torch.float32)
        for isotope_name, activity in sample_meta.get('source_activities_bq', {}).items():
            try:
                idx_isotope = self.isotope_index.name_to_index(isotope_name)
                # Normalize activity to [0, 1] range
                activity_labels[idx_isotope] = min(activity / self.max_activity_bq, 1.0)
            except KeyError:
                pass
        return {
            'spectrum': spectrum_tensor,
            'presence_labels': presence_labels,
            'activity_labels': activity_labels,
            'sample_id': sample_id
        }
    def get_sample_info(self, idx: int) -> Dict:
        """Get metadata for a sample without loading the spectrum."""
        sample_id = self.sample_ids[idx]
        return {
            'sample_id': sample_id,
            **self.metadata['samples'][sample_id]
        }
 def collate_fn(batch: List[Dict]) -> Dict[str, torch.Tensor]:
    """
    Custom collate function to handle batching.
    Args:
        batch: List of sample dictionaries
    Returns:
        Batched dictionary with stacked tensors
    """
    return {
        'spectrum': torch.stack([s['spectrum'] for s in batch]),
        'presence_labels': torch.stack([s['presence_labels'] for s in batch]),
        'activity_labels': torch.stack([s['activity_labels'] for s in batch]),
        'sample_ids': [s['sample_id'] for s in batch]
    }
 def create_data_loaders(
    data_dir: Path,
    batch_size: int = 32,
    train_split: float = 0.8,
    val_split: float = 0.1,
    test_split: float = 0.1,
    num_workers: int = 8,
    prefetch_factor: int = 4,
    persistent_workers: bool = True,
    isotope_index: Optional[IsotopeIndex] = None,
    max_activity_bq: float = 1000.0,
    seed: int = 42
 ) -> Tuple[DataLoader, DataLoader, DataLoader]:
    """
    Create train, validation, and test data loaders.
    Args:
        data_dir: Path to data directory
        batch_size: Batch size for training
        train_split: Fraction of data for training
        val_split: Fraction of data for validation
        test_split: Fraction of data for testing
        num_workers: Number of data loading workers (parallel I/O)
        prefetch_factor: Batches to prefetch per worker
        persistent_workers: Keep workers alive between epochs
        isotope_index: Isotope name to index mapping
        max_activity_bq: Maximum activity for normalization
        seed: Random seed for reproducibility
    Returns:
        Tuple of (train_loader, val_loader, test_loader)
    """
    assert abs(train_split + val_split + test_split - 1.0) < 1e-6, \
        "Splits must sum to 1.0"
    # Create full dataset
    full_dataset = SpectrumDataset(
        data_dir=data_dir,
        isotope_index=isotope_index,
        max_activity_bq=max_activity_bq
    )
    # Calculate split sizes
    total_size = len(full_dataset)
    train_size = int(total_size * train_split)
    val_size = int(total_size * val_split)
    test_size = total_size - train_size - val_size
    # Handle small datasets
    if train_size == 0:
        train_size = max(1, total_size - 2)
    if val_size == 0 and total_size > 1:
        val_size = 1
        train_size = max(1, train_size - 1)
    if test_size == 0 and total_size > 2:
        test_size = 1
        train_size = max(1, train_size - 1)
    # Ensure sizes add up
    test_size = total_size - train_size - val_size
    print(f"Dataset splits: train={train_size}, val={val_size}, test={test_size}")
    # Split dataset
    generator = torch.Generator().manual_seed(seed)
    train_dataset, val_dataset, test_dataset = random_split(
        full_dataset,
        [train_size, val_size, test_size],
        generator=generator
    )
    # Create data loaders with parallel loading support
    # For Windows, num_workers > 0 requires spawn method (handled by PyTorch)
    use_workers = num_workers > 0
    train_loader = DataLoader(
        train_dataset,
        batch_size=min(batch_size, train_size),
        shuffle=True,
        num_workers=num_workers,
        collate_fn=collate_fn,
        pin_memory=True,
        prefetch_factor=prefetch_factor if use_workers else None,
        persistent_workers=persistent_workers and use_workers,
        drop_last=True  # Drop incomplete batches for consistent training
    )
    val_loader = DataLoader(
        val_dataset,
        batch_size=min(batch_size, max(1, val_size)),
        shuffle=False,
        num_workers=num_workers,
        collate_fn=collate_fn,
        pin_memory=True,
        prefetch_factor=prefetch_factor if use_workers else None,
        persistent_workers=persistent_workers and use_workers
    ) if val_size > 0 else None
    test_loader = DataLoader(
        test_dataset,
        batch_size=min(batch_size, max(1, test_size)),
        shuffle=False,
        num_workers=num_workers,
        collate_fn=collate_fn,
        pin_memory=True,
        prefetch_factor=prefetch_factor if use_workers else None,
        persistent_workers=persistent_workers and use_workers
    ) if test_size > 0 else None
    if num_workers > 0:
        print(f"DataLoader: {num_workers} workers, prefetch_factor={prefetch_factor}, persistent={persistent_workers}")
    return train_loader, val_loader, test_loader
 if __name__ == "__main__":
    import sys
    # Test dataset loading
    data_dir = Path(__file__).parent.parent.parent / "data" / "synthetic"
    if not data_dir.exists():
        print(f"Data directory not found: {data_dir}")
        sys.exit(1)
    # Create dataset
    dataset = SpectrumDataset(data_dir)
    print(f"\nDataset size: {len(dataset)}")
    # Get a sample
    sample = dataset[0]
    print(f"\nSample keys: {sample.keys()}")
    print(f"Spectrum shape: {sample['spectrum'].shape}")
    print(f"Presence labels shape: {sample['presence_labels'].shape}")
    print(f"Activity labels shape: {sample['activity_labels'].shape}")
    print(f"Presence sum: {sample['presence_labels'].sum().item()}")
    # Create data loaders
    train_loader, val_loader, test_loader = create_data_loaders(
        data_dir,
        batch_size=4
    )
    print(f"\nTrain batches: {len(train_loader)}")
    if val_loader:
        print(f"Val batches: {len(val_loader)}")
    if test_loader:
        print(f"Test batches: {len(test_loader)}")
    # Test a batch
    batch = next(iter(train_loader))
    print(f"\nBatch spectrum shape: {batch['spectrum'].shape}")
    print(f"Batch presence shape: {batch['presence_labels'].shape}")
--- a/train/vega_ml/training/vega/dataset_2d.py
+++ b/train/vega_ml/training/vega/dataset_2d.py
@ -0,0 +1,308 @@
 """
 Dataset for 2D Vega Model
 Loads 2D spectra (time × channels) and pads/truncates to fixed dimensions.
 """
 import json
 import numpy as np
 import torch
 from torch.utils.data import Dataset, DataLoader, random_split
 from pathlib import Path
 from typing import Dict, List, Optional, Tuple
 from dataclasses import dataclass
 from .isotope_index import IsotopeIndex, get_default_isotope_index
@dataclass 
 class SpectrumSample2D:
    """A single 2D spectrum sample."""
    sample_id: str
    spectrum: np.ndarray  # 2D array (time_intervals, channels)
    isotopes_present: List[str]
    activities_bq: Dict[str, float]
    duration_seconds: float
    detector: str
 class SpectrumDataset2D(Dataset):
    """
    PyTorch Dataset for 2D gamma spectra.
    Pads or truncates time dimension to fixed size for batch processing.
    """
    def __init__(
        self,
        data_dir: Path,
        isotope_index: Optional[IsotopeIndex] = None,
        max_activity_bq: float = 1000.0,
        target_time_intervals: int = 60,
        transform=None
    ):
        """
        Initialize the dataset.
        Args:
            data_dir: Path to directory containing spectra/ subdirectory
            isotope_index: Index mapping isotope names to indices
            max_activity_bq: Maximum activity for normalization
            target_time_intervals: Fixed time dimension (pad/truncate to this)
            transform: Optional transform to apply
        """
        self.data_dir = Path(data_dir)
        self.spectra_dir = self.data_dir / "spectra"
        self.isotope_index = isotope_index or get_default_isotope_index()
        self.max_activity_bq = max_activity_bq
        self.target_time_intervals = target_time_intervals
        self.transform = transform
        # Detect label format and load sample list
        self.use_individual_labels = self._detect_label_format()
        if self.use_individual_labels:
            self.sample_ids = self._scan_for_samples()
            self.metadata = None
            print(f"Using individual label files (efficient mode)")
        else:
            self.metadata = self._load_metadata()
            self.sample_ids = list(self.metadata['samples'].keys())
            print(f"Using combined labels.json (legacy mode)")
        print(f"Loaded 2D dataset with {len(self.sample_ids)} samples")
        print(f"Target shape: ({target_time_intervals}, 1023)")
        print(f"Isotope index has {self.isotope_index.num_isotopes} isotopes")
    def _detect_label_format(self) -> bool:
        """Detect whether to use individual JSON files or combined labels.json."""
        json_files = list(self.spectra_dir.glob("spectrum_*.json"))
        if len(json_files) > 0:
            return True
        labels_path = self.data_dir / "labels.json"
        if labels_path.exists():
            return False
        raise FileNotFoundError(
            f"No label files found. Expected either:\n"
            f"  - Individual files: {self.spectra_dir}/spectrum_*.json\n"
            f"  - Combined file: {self.data_dir}/labels.json"
        )
    def _scan_for_samples(self) -> List[str]:
        """Scan directory for sample IDs based on .npy files."""
        npy_files = sorted(self.spectra_dir.glob("spectrum_*.npy"))
        sample_ids = []
        for npy_path in npy_files:
            filename = npy_path.stem
            sample_id = filename.replace("spectrum_", "")
            sample_ids.append(sample_id)
        return sample_ids
    def _load_metadata(self) -> Dict:
        """Load the combined labels.json metadata file."""
        labels_path = self.data_dir / "labels.json"
        if not labels_path.exists():
            raise FileNotFoundError(f"Labels file not found: {labels_path}")
        with open(labels_path, 'r') as f:
            return json.load(f)
    def _load_sample_label(self, sample_id: str) -> Dict:
        """Load label for a single sample."""
        if self.use_individual_labels:
            json_path = self.spectra_dir / f"spectrum_{sample_id}.json"
            with open(json_path, 'r') as f:
                return json.load(f)
        else:
            return self.metadata['samples'][sample_id]
    def _pad_or_truncate(self, spectrum: np.ndarray) -> np.ndarray:
        """
        Pad or truncate spectrum to target time dimension.
        Args:
            spectrum: 2D array (time, channels)
        Returns:
            Array of shape (target_time_intervals, channels)
        """
        current_time = spectrum.shape[0]
        target_time = self.target_time_intervals
        num_channels = spectrum.shape[1]
        if current_time == target_time:
            return spectrum
        elif current_time > target_time:
            # Truncate: take evenly spaced intervals to preserve temporal coverage
            indices = np.linspace(0, current_time - 1, target_time, dtype=int)
            return spectrum[indices, :]
        else:
            # Pad with zeros at the end
            padded = np.zeros((target_time, num_channels), dtype=spectrum.dtype)
            padded[:current_time, :] = spectrum
            return padded
    def __len__(self) -> int:
        return len(self.sample_ids)
    def __getitem__(self, idx: int) -> Dict[str, torch.Tensor]:
        """
        Get a single sample.
        Returns:
            Dictionary containing:
                - spectrum: Tensor of shape (target_time_intervals, num_channels)
                - presence_labels: Binary tensor (num_isotopes,)
                - activity_labels: Tensor (num_isotopes,) with normalized activities
                - sample_id: String identifier
        """
        sample_id = self.sample_ids[idx]
        sample_meta = self._load_sample_label(sample_id)
        # Load spectrum
        spectrum_path = self.spectra_dir / f"spectrum_{sample_id}.npy"
        spectrum = np.load(spectrum_path)
        # Ensure 2D
        if spectrum.ndim == 1:
            spectrum = spectrum.reshape(1, -1)
        # Pad/truncate to fixed time dimension
        spectrum = self._pad_or_truncate(spectrum)
        # Normalize (max normalization)
        max_val = spectrum.max()
        if max_val > 0:
            spectrum = spectrum / max_val
        # Convert to tensor
        spectrum_tensor = torch.tensor(spectrum, dtype=torch.float32)
        # Apply transform if provided
        if self.transform:
            spectrum_tensor = self.transform(spectrum_tensor)
        # Create presence labels
        presence_labels = torch.zeros(self.isotope_index.num_isotopes, dtype=torch.float32)
        for isotope_name in sample_meta['isotopes']:
            try:
                idx_isotope = self.isotope_index.name_to_index(isotope_name)
                presence_labels[idx_isotope] = 1.0
            except KeyError:
                pass
        # Create activity labels (normalized)
        activity_labels = torch.zeros(self.isotope_index.num_isotopes, dtype=torch.float32)
        for isotope_name, activity in sample_meta.get('source_activities_bq', {}).items():
            try:
                idx_isotope = self.isotope_index.name_to_index(isotope_name)
                activity_labels[idx_isotope] = min(activity / self.max_activity_bq, 1.0)
            except KeyError:
                pass
        return {
            'spectrum': spectrum_tensor,
            'presence_labels': presence_labels,
            'activity_labels': activity_labels,
            'sample_id': sample_id
        }
 def collate_fn_2d(batch: List[Dict]) -> Dict[str, torch.Tensor]:
    """Custom collate function for 2D batching."""
    return {
        'spectrum': torch.stack([s['spectrum'] for s in batch]),
        'presence_labels': torch.stack([s['presence_labels'] for s in batch]),
        'activity_labels': torch.stack([s['activity_labels'] for s in batch]),
        'sample_ids': [s['sample_id'] for s in batch]
    }
 def create_data_loaders_2d(
    data_dir: Path,
    batch_size: int = 32,
    train_split: float = 0.8,
    val_split: float = 0.1,
    test_split: float = 0.1,
    num_workers: int = 4,
    target_time_intervals: int = 60,
    isotope_index: Optional[IsotopeIndex] = None,
    max_activity_bq: float = 1000.0,
    seed: int = 42
 ) -> Tuple[DataLoader, DataLoader, DataLoader]:
    """
    Create train, validation, and test data loaders for 2D data.
    """
    # Create full dataset
    dataset = SpectrumDataset2D(
        data_dir=data_dir,
        isotope_index=isotope_index,
        max_activity_bq=max_activity_bq,
        target_time_intervals=target_time_intervals
    )
    # Calculate split sizes
    total = len(dataset)
    train_size = int(total * train_split)
    val_size = int(total * val_split)
    test_size = total - train_size - val_size
    # Split dataset
    generator = torch.Generator().manual_seed(seed)
    train_dataset, val_dataset, test_dataset = random_split(
        dataset, [train_size, val_size, test_size], generator=generator
    )
    print(f"Dataset splits: train={train_size}, val={val_size}, test={test_size}")
    # Create loaders
    train_loader = DataLoader(
        train_dataset,
        batch_size=batch_size,
        shuffle=True,
        num_workers=num_workers,
        collate_fn=collate_fn_2d,
        pin_memory=True,
        persistent_workers=num_workers > 0
    )
    val_loader = DataLoader(
        val_dataset,
        batch_size=batch_size,
        shuffle=False,
        num_workers=num_workers,
        collate_fn=collate_fn_2d,
        pin_memory=True,
        persistent_workers=num_workers > 0
    )
    test_loader = DataLoader(
        test_dataset,
        batch_size=batch_size,
        shuffle=False,
        num_workers=num_workers,
        collate_fn=collate_fn_2d,
        pin_memory=True,
        persistent_workers=num_workers > 0
    )
    return train_loader, val_loader, test_loader
 if __name__ == "__main__":
    # Test the dataset
    from pathlib import Path
    data_dir = Path("O:/master_data_collection/isotopev2")
    dataset = SpectrumDataset2D(data_dir, target_time_intervals=60)
    sample = dataset[0]
    print(f"\nSample:")
    print(f"  Spectrum shape: {sample['spectrum'].shape}")
    print(f"  Presence labels: {sample['presence_labels'].sum().item():.0f} isotopes")
    print(f"  Sample ID: {sample['sample_id']}")
--- a/train/vega_ml/training/vega/isotope_index.py
+++ b/train/vega_ml/training/vega/isotope_index.py
@ -0,0 +1,141 @@
 """
 Isotope Index - Mapping between isotope names and model output indices.
 This module provides a consistent mapping between isotope names and their
 corresponding indices in the model's output tensors. This is critical for
 training and inference to ensure consistent label encoding.
 """
 import sys
 from pathlib import Path
 from typing import Dict, List, Optional
 # Add project root to path for imports
 PROJECT_ROOT = Path(__file__).parent.parent.parent
 sys.path.insert(0, str(PROJECT_ROOT))
 from synthetic_spectra.ground_truth.isotope_data import ISOTOPE_DATABASE, get_isotope_names
 class IsotopeIndex:
    """
    Manages the mapping between isotope names and model indices.
    The index is deterministic - isotopes are sorted alphabetically to ensure
    consistent ordering across training and inference.
    """
    def __init__(self, isotope_names: Optional[List[str]] = None):
        """
        Initialize the isotope index.
        Args:
            isotope_names: Optional list of isotope names. If None, uses all
                          isotopes from the database.
        """
        if isotope_names is None:
            isotope_names = get_isotope_names()
        # Sort alphabetically for deterministic ordering
        self._isotope_names = sorted(isotope_names)
        # Build bidirectional mappings
        self._name_to_idx: Dict[str, int] = {
            name: idx for idx, name in enumerate(self._isotope_names)
        }
        self._idx_to_name: Dict[int, str] = {
            idx: name for idx, name in enumerate(self._isotope_names)
        }
    @property
    def num_isotopes(self) -> int:
        """Total number of isotopes in the index."""
        return len(self._isotope_names)
    @property
    def isotope_names(self) -> List[str]:
        """List of all isotope names in index order."""
        return self._isotope_names.copy()
    def name_to_index(self, name: str) -> int:
        """
        Get the index for an isotope name.
        Args:
            name: Isotope name (e.g., "Cs-137")
        Returns:
            Integer index for the isotope
        Raises:
            KeyError: If isotope name not in index
        """
        if name not in self._name_to_idx:
            raise KeyError(f"Isotope '{name}' not found in index. "
                          f"Available isotopes: {self._isotope_names[:5]}...")
        return self._name_to_idx[name]
    def index_to_name(self, idx: int) -> str:
        """
        Get the isotope name for an index.
        Args:
            idx: Integer index
        Returns:
            Isotope name string
        Raises:
            KeyError: If index out of range
        """
        if idx not in self._idx_to_name:
            raise KeyError(f"Index {idx} out of range. Valid range: 0-{self.num_isotopes-1}")
        return self._idx_to_name[idx]
    def names_to_indices(self, names: List[str]) -> List[int]:
        """Convert list of names to list of indices."""
        return [self.name_to_index(name) for name in names]
    def indices_to_names(self, indices: List[int]) -> List[str]:
        """Convert list of indices to list of names."""
        return [self.index_to_name(idx) for idx in indices]
    def save(self, path: Path):
        """Save the isotope index to a file."""
        with open(path, 'w') as f:
            for name in self._isotope_names:
                f.write(f"{name}\n")
    @classmethod
    def load(cls, path: Path) -> 'IsotopeIndex':
        """Load an isotope index from a file."""
        with open(path, 'r') as f:
            isotope_names = [line.strip() for line in f if line.strip()]
        return cls(isotope_names)
    def __repr__(self) -> str:
        return f"IsotopeIndex(num_isotopes={self.num_isotopes})"
    def __len__(self) -> int:
        return self.num_isotopes
 # Global default isotope index using all isotopes from database
 DEFAULT_ISOTOPE_INDEX = IsotopeIndex()
 def get_default_isotope_index() -> IsotopeIndex:
    """Get the default isotope index with all database isotopes."""
    return DEFAULT_ISOTOPE_INDEX
 if __name__ == "__main__":
    # Print isotope index information
    index = get_default_isotope_index()
    print(f"Isotope Index: {index}")
    print(f"\nFirst 10 isotopes:")
    for i in range(min(10, index.num_isotopes)):
        print(f"  {i:3d}: {index.index_to_name(i)}")
    print(f"\nLast 10 isotopes:")
    for i in range(max(0, index.num_isotopes - 10), index.num_isotopes):
        print(f"  {i:3d}: {index.index_to_name(i)}")
--- a/train/vega_ml/training/vega/model.py
+++ b/train/vega_ml/training/vega/model.py
@ -0,0 +1,416 @@
 """
 Vega Model Architecture - CNN-FCNN with Multi-Task Heads
 A hybrid Convolutional Neural Network with Fully Connected Neural Network
 for gamma spectrum isotope identification. Based on peer-reviewed research
 showing CNN-FCNN achieves state-of-the-art performance (99%+ accuracy).
 Architecture:
    Input: 1D gamma spectrum (1023 channels, 20-3000 keV)
    ↓
    Feature Extraction: 3 CNN modules with LeakyReLU, MaxPool, Dropout
    ↓
    Classification Head: Dense layers → Sigmoid (multi-label isotope presence)
    ↓
    Regression Head: Dense layers → ReLU (activity estimation in Bq)
 """
 import torch
 import torch.nn as nn
 from dataclasses import dataclass, field
 from typing import Dict, List, Optional, Tuple
@dataclass
 class VegaConfig:
    """Configuration for the Vega model."""
    # Input configuration
    num_channels: int = 1023  # Number of energy channels in spectrum
    # Number of isotopes to classify
    num_isotopes: int = 82  # From isotope database
    # CNN backbone configuration
    conv_channels: List[int] = field(default_factory=lambda: [64, 128, 256])
    conv_kernel_size: int = 7
    pool_size: int = 2
    # Classification head configuration
    fc_hidden_dims: List[int] = field(default_factory=lambda: [512, 256])
    # Regularization
    dropout_rate: float = 0.3
    spatial_dropout_rate: float = 0.1
    # Activation
    leaky_relu_slope: float = 0.1
    # Loss weighting
    classification_weight: float = 1.0
    regression_weight: float = 0.1
    # Training
    max_activity_bq: float = 1000.0  # For activity normalization
 class ConvBlock(nn.Module):
    """
    Convolutional block with two conv layers, activation, pooling, and dropout.
    Based on Turner et al. (2021) architecture showing that stacking two
    convolutions per module with pooling achieves good feature extraction.
    """
    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        kernel_size: int = 7,
        pool_size: int = 2,
        dropout_rate: float = 0.1,
        leaky_slope: float = 0.1
    ):
        super().__init__()
        # First convolution
        self.conv1 = nn.Conv1d(
            in_channels, out_channels, 
            kernel_size=kernel_size, 
            padding=kernel_size // 2
        )
        self.bn1 = nn.BatchNorm1d(out_channels)
        self.act1 = nn.LeakyReLU(leaky_slope)
        # Second convolution
        self.conv2 = nn.Conv1d(
            out_channels, out_channels,
            kernel_size=kernel_size,
            padding=kernel_size // 2
        )
        self.bn2 = nn.BatchNorm1d(out_channels)
        self.act2 = nn.LeakyReLU(leaky_slope)
        # Pooling and dropout
        self.pool = nn.MaxPool1d(pool_size)
        self.dropout = nn.Dropout1d(dropout_rate)  # Spatial dropout for 1D
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # First conv block
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.act1(x)
        # Second conv block
        x = self.conv2(x)
        x = self.bn2(x)
        x = self.act2(x)
        # Pool and dropout
        x = self.pool(x)
        x = self.dropout(x)
        return x
 class VegaModel(nn.Module):
    """
    Vega: CNN-FCNN Multi-Task Model for Isotope Identification
    Named after the bright star Vega (α Lyrae), which emits radiation
    across the electromagnetic spectrum - fitting for a gamma spectrum analyzer.
    The model performs two tasks:
    1. Multi-label classification: Which isotopes are present?
    2. Activity regression: What is the activity (Bq) of each isotope?
    """
    def __init__(self, config: VegaConfig):
        super().__init__()
        self.config = config
        # Build CNN backbone
        self.backbone = self._build_backbone()
        # Calculate flattened size after backbone
        self._flat_size = self._calculate_flat_size()
        # Build classification head (multi-label)
        self.classifier = self._build_classifier()
        # Build regression head (activity estimation)
        self.regressor = self._build_regressor()
        # Initialize weights
        self._init_weights()
    def _build_backbone(self) -> nn.Sequential:
        """Build the CNN feature extraction backbone."""
        layers = []
        in_channels = 1  # Input is 1D spectrum with 1 channel
        for out_channels in self.config.conv_channels:
            layers.append(ConvBlock(
                in_channels=in_channels,
                out_channels=out_channels,
                kernel_size=self.config.conv_kernel_size,
                pool_size=self.config.pool_size,
                dropout_rate=self.config.spatial_dropout_rate,
                leaky_slope=self.config.leaky_relu_slope
            ))
            in_channels = out_channels
        return nn.Sequential(*layers)
    def _calculate_flat_size(self) -> int:
        """Calculate the size of flattened features after backbone."""
        # Create dummy input to calculate size
        dummy = torch.zeros(1, 1, self.config.num_channels)
        with torch.no_grad():
            out = self.backbone(dummy)
        return out.numel()
    def _build_classifier(self) -> nn.Sequential:
        """Build the classification head for isotope presence prediction.
        Outputs raw logits (not probabilities) for AMP compatibility.
        Use BCEWithLogitsLoss for training, apply sigmoid during inference.
        """
        layers = []
        in_features = self._flat_size
        # Hidden layers
        for hidden_dim in self.config.fc_hidden_dims:
            layers.extend([
                nn.Linear(in_features, hidden_dim),
                nn.BatchNorm1d(hidden_dim),
                nn.LeakyReLU(self.config.leaky_relu_slope),
                nn.Dropout(self.config.dropout_rate)
            ])
            in_features = hidden_dim
        # Output layer - raw logits for AMP compatibility
        layers.append(nn.Linear(in_features, self.config.num_isotopes))
        return nn.Sequential(*layers)
    def _build_regressor(self) -> nn.Sequential:
        """Build the regression head for activity estimation."""
        layers = []
        in_features = self._flat_size
        # Hidden layers (shared architecture with classifier)
        for hidden_dim in self.config.fc_hidden_dims:
            layers.extend([
                nn.Linear(in_features, hidden_dim),
                nn.BatchNorm1d(hidden_dim),
                nn.LeakyReLU(self.config.leaky_relu_slope),
                nn.Dropout(self.config.dropout_rate)
            ])
            in_features = hidden_dim
        # Output layer with ReLU for non-negative activity values
        layers.extend([
            nn.Linear(in_features, self.config.num_isotopes),
            nn.ReLU()  # Activity must be non-negative
        ])
        return nn.Sequential(*layers)
    def _init_weights(self):
        """Initialize weights using He initialization for LeakyReLU."""
        for module in self.modules():
            if isinstance(module, (nn.Conv1d, nn.Linear)):
                nn.init.kaiming_normal_(
                    module.weight, 
                    a=self.config.leaky_relu_slope,
                    mode='fan_out',
                    nonlinearity='leaky_relu'
                )
                if module.bias is not None:
                    nn.init.zeros_(module.bias)
            elif isinstance(module, nn.BatchNorm1d):
                nn.init.ones_(module.weight)
                nn.init.zeros_(module.bias)
    def forward(
        self, 
        x: torch.Tensor
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Forward pass through the model.
        Args:
            x: Input spectrum tensor of shape (batch, channels) or (batch, 1, channels)
               Values should be normalized [0, 1]
        Returns:
            Tuple of:
                - isotope_logits: Raw logits for each isotope (batch, num_isotopes)
                  Apply sigmoid to get probabilities for inference
                - activity_pred: Predicted activity in Bq for each isotope (batch, num_isotopes)
        """
        # Ensure input has channel dimension
        if x.dim() == 2:
            x = x.unsqueeze(1)  # (batch, channels) -> (batch, 1, channels)
        # Feature extraction
        features = self.backbone(x)
        features = features.flatten(start_dim=1)
        # Classification head (outputs logits)
        isotope_logits = self.classifier(features)
        # Regression head
        activity_pred = self.regressor(features)
        return isotope_logits, activity_pred
    def predict(
        self,
        x: torch.Tensor,
        threshold: float = 0.5,
        return_all: bool = False
    ) -> Dict:
        """
        Make predictions with post-processing.
        Args:
            x: Input spectrum tensor
            threshold: Probability threshold for isotope presence
            return_all: If True, return predictions for all isotopes
        Returns:
            Dictionary with predictions
        """
        self.eval()
        with torch.no_grad():
            probs, activities = self(x)
        # Apply threshold
        present = probs >= threshold
        # Mask activities by presence
        masked_activities = activities * present.float()
        return {
            'probabilities': probs,
            'activities_bq': masked_activities * self.config.max_activity_bq,
            'present_mask': present,
            'threshold': threshold
        }
    def count_parameters(self) -> int:
        """Count total trainable parameters."""
        return sum(p.numel() for p in self.parameters() if p.requires_grad)
    def summary(self) -> str:
        """Get a summary of the model architecture."""
        lines = [
            "=" * 60,
            "VEGA Model - CNN-FCNN Multi-Task Isotope Identifier",
            "=" * 60,
            f"Input channels: {self.config.num_channels}",
            f"Output isotopes: {self.config.num_isotopes}",
            f"CNN channels: {self.config.conv_channels}",
            f"FC hidden dims: {self.config.fc_hidden_dims}",
            f"Dropout rate: {self.config.dropout_rate}",
            f"Total parameters: {self.count_parameters():,}",
            "=" * 60
        ]
        return "\n".join(lines)
 class VegaLoss(nn.Module):
    """
    Combined loss function for Vega multi-task learning.
    Combines:
    - Binary Cross-Entropy for isotope classification (multi-label)
    - Huber Loss for activity regression (robust to outliers)
    """
    def __init__(
        self,
        classification_weight: float = 1.0,
        regression_weight: float = 0.1,
        huber_delta: float = 1.0
    ):
        super().__init__()
        self.classification_weight = classification_weight
        self.regression_weight = regression_weight
        # Use BCEWithLogitsLoss for AMP safety (combines sigmoid + BCE)
        self.bce_loss = nn.BCEWithLogitsLoss()
        self.huber_loss = nn.HuberLoss(delta=huber_delta)
    def forward(
        self,
        pred_logits: torch.Tensor,
        pred_activities: torch.Tensor,
        target_presence: torch.Tensor,
        target_activities: torch.Tensor
    ) -> Tuple[torch.Tensor, Dict[str, float]]:
        """
        Calculate combined loss.
        Args:
            pred_logits: Predicted isotope logits (batch, num_isotopes)
            pred_activities: Predicted activities (batch, num_isotopes)
            target_presence: Ground truth presence labels (batch, num_isotopes)
            target_activities: Ground truth activities (batch, num_isotopes)
        Returns:
            Tuple of total loss and dict of individual losses
        """
        # Classification loss (BCEWithLogitsLoss applies sigmoid internally)
        cls_loss = self.bce_loss(pred_logits, target_presence.float())
        # Regression loss (only for present isotopes)
        # Mask to only compute loss where isotopes are actually present
        mask = target_presence.float()
        if mask.sum() > 0:
            masked_pred = pred_activities * mask
            masked_target = target_activities * mask
            reg_loss = self.huber_loss(masked_pred, masked_target)
        else:
            reg_loss = torch.tensor(0.0, device=pred_activities.device)
        # Combined loss
        total_loss = (
            self.classification_weight * cls_loss + 
            self.regression_weight * reg_loss
        )
        loss_dict = {
            'total': total_loss.item(),
            'classification': cls_loss.item(),
            'regression': reg_loss.item() if isinstance(reg_loss, torch.Tensor) else reg_loss
        }
        return total_loss, loss_dict
 if __name__ == "__main__":
    # Test the model
    config = VegaConfig()
    model = VegaModel(config)
    print(model.summary())
    # Test forward pass
    batch_size = 4
    x = torch.randn(batch_size, config.num_channels)
    probs, activities = model(x)
    print(f"\nInput shape: {x.shape}")
    print(f"Output probs shape: {probs.shape}")
    print(f"Output activities shape: {activities.shape}")
    # Test loss
    loss_fn = VegaLoss()
    target_presence = torch.randint(0, 2, (batch_size, config.num_isotopes))
    target_activities = torch.rand(batch_size, config.num_isotopes) * 100
    loss, loss_dict = loss_fn(probs, activities, target_presence, target_activities)
    print(f"\nLoss: {loss_dict}")
--- a/train/vega_ml/training/vega/model_2d.py
+++ b/train/vega_ml/training/vega/model_2d.py
@ -0,0 +1,231 @@
 """
 Vega 2D Model - Uses Full Temporal Information
 This model treats gamma spectra as 2D images (time × channels) and uses
 Conv2d to extract both spectral and temporal features.
 Input shape: (batch, 1, time_intervals, channels) = (B, 1, 60, 1023)
 """
 import torch
 import torch.nn as nn
 from dataclasses import dataclass, field
 from typing import List, Tuple
@dataclass
 class Vega2DConfig:
    """Configuration for Vega 2D model."""
    # Input dimensions
    num_channels: int = 1023  # Energy channels
    num_time_intervals: int = 60  # Fixed time dimension
    # Output
    num_isotopes: int = 82
    # CNN architecture
    conv_channels: List[int] = field(default_factory=lambda: [32, 64, 128])
    kernel_size: Tuple[int, int] = (3, 7)  # (time, energy) - larger in energy dimension
    pool_size: Tuple[int, int] = (2, 2)
    # FC layers
    fc_hidden_dims: List[int] = field(default_factory=lambda: [512, 256])
    # Regularization
    dropout_rate: float = 0.3
    leaky_relu_slope: float = 0.01
    # Activity scaling
    max_activity_bq: float = 1000.0
 class ConvBlock2D(nn.Module):
    """2D Convolutional block with BatchNorm, activation, pooling, and dropout."""
    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        kernel_size: Tuple[int, int],
        pool_size: Tuple[int, int],
        dropout_rate: float,
        leaky_relu_slope: float
    ):
        super().__init__()
        # Padding to maintain spatial dimensions before pooling
        padding = (kernel_size[0] // 2, kernel_size[1] // 2)
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size, padding=padding)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size, padding=padding)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.activation = nn.LeakyReLU(leaky_relu_slope)
        self.pool = nn.MaxPool2d(pool_size)
        self.dropout = nn.Dropout2d(dropout_rate)
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.activation(self.bn1(self.conv1(x)))
        x = self.activation(self.bn2(self.conv2(x)))
        x = self.pool(x)
        x = self.dropout(x)
        return x
 class Vega2DModel(nn.Module):
    """
    2D CNN model for gamma spectrum isotope identification.
    Treats spectra as images with time on one axis and energy channels on the other.
    This preserves temporal information that is lost in the 1D approach.
    """
    def __init__(self, config: Vega2DConfig = None):
        super().__init__()
        self.config = config or Vega2DConfig()
        # Build CNN backbone
        self.conv_blocks = nn.ModuleList()
        in_channels = 1  # Single channel input (like grayscale image)
        for out_channels in self.config.conv_channels:
            self.conv_blocks.append(ConvBlock2D(
                in_channels=in_channels,
                out_channels=out_channels,
                kernel_size=self.config.kernel_size,
                pool_size=self.config.pool_size,
                dropout_rate=self.config.dropout_rate,
                leaky_relu_slope=self.config.leaky_relu_slope
            ))
            in_channels = out_channels
        # Calculate flattened size after conv blocks
        self.flat_size = self._calculate_flat_size()
        # Fully connected classifier
        fc_layers = []
        fc_in = self.flat_size
        for fc_out in self.config.fc_hidden_dims:
            fc_layers.extend([
                nn.Linear(fc_in, fc_out),
                nn.BatchNorm1d(fc_out),
                nn.LeakyReLU(self.config.leaky_relu_slope),
                nn.Dropout(self.config.dropout_rate)
            ])
            fc_in = fc_out
        self.fc_backbone = nn.Sequential(*fc_layers)
        # Output heads
        self.classifier = nn.Linear(fc_in, self.config.num_isotopes)  # Logits for BCE
        self.regressor = nn.Sequential(
            nn.Linear(fc_in, self.config.num_isotopes),
            nn.ReLU()  # Activity must be non-negative
        )
        # Initialize weights
        self._init_weights()
    def _calculate_flat_size(self) -> int:
        """Calculate the flattened size after all conv blocks."""
        # Start with input dimensions
        h = self.config.num_time_intervals  # 60
        w = self.config.num_channels  # 1023
        # Each conv block applies pooling that halves dimensions
        for _ in self.config.conv_channels:
            h = h // self.config.pool_size[0]
            w = w // self.config.pool_size[1]
        # Final size = last_channels * h * w
        return self.config.conv_channels[-1] * h * w
    def _init_weights(self):
        """Initialize weights using Kaiming initialization."""
        for m in self.modules():
            if isinstance(m, (nn.Conv2d, nn.Linear)):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='leaky_relu')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, (nn.BatchNorm1d, nn.BatchNorm2d)):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Forward pass.
        Args:
            x: Input tensor of shape (batch, 1, time_intervals, channels)
               or (batch, time_intervals, channels) - will add channel dim
        Returns:
            Tuple of (logits, activities):
                - logits: (batch, num_isotopes) - raw scores for BCE loss
                - activities: (batch, num_isotopes) - predicted activities (normalized 0-1)
        """
        # Add channel dimension if needed: (B, T, C) -> (B, 1, T, C)
        if x.dim() == 3:
            x = x.unsqueeze(1)
        # CNN backbone
        for conv_block in self.conv_blocks:
            x = conv_block(x)
        # Flatten
        x = x.view(x.size(0), -1)
        # FC backbone
        x = self.fc_backbone(x)
        # Output heads
        logits = self.classifier(x)
        activities = self.regressor(x)
        return logits, activities
    def predict(self, x: torch.Tensor, threshold: float = 0.5) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Predict isotope presence and activities.
        Args:
            x: Input spectrum
            threshold: Probability threshold for presence
        Returns:
            Tuple of (presence, activities):
                - presence: (batch, num_isotopes) binary predictions
                - activities: (batch, num_isotopes) in Bq
        """
        logits, activities_norm = self.forward(x)
        probs = torch.sigmoid(logits)
        presence = (probs >= threshold).float()
        activities_bq = activities_norm * self.config.max_activity_bq
        return presence, activities_bq
 def count_parameters(model: nn.Module) -> int:
    """Count trainable parameters."""
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
 if __name__ == "__main__":
    # Test the model
    config = Vega2DConfig()
    model = Vega2DModel(config)
    print(f"Vega 2D Model")
    print(f"  Input: ({config.num_time_intervals}, {config.num_channels})")
    print(f"  Conv channels: {config.conv_channels}")
    print(f"  FC dims: {config.fc_hidden_dims}")
    print(f"  Flat size: {model.flat_size}")
    print(f"  Parameters: {count_parameters(model):,}")
    # Test forward pass
    batch = torch.randn(4, 1, config.num_time_intervals, config.num_channels)
    logits, activities = model(batch)
    print(f"\n  Test batch: {batch.shape}")
    print(f"  Logits: {logits.shape}")
    print(f"  Activities: {activities.shape}")
--- a/train/vega_ml/training/vega/run_training.py
+++ b/train/vega_ml/training/vega/run_training.py
@ -0,0 +1,120 @@
 #!/usr/bin/env python
 """
 Run Vega Training
 Simple script to train the Vega model on synthetic gamma spectra.
 Designed for both quick test runs and full-scale training.
 """
 import sys
 import argparse
 from pathlib import Path
 # Add project root to path
 PROJECT_ROOT = Path(__file__).parent.parent.parent
 sys.path.insert(0, str(PROJECT_ROOT))
 from training.vega.train import train_vega, TrainingConfig
 from training.vega.model import VegaConfig
 def main():
    parser = argparse.ArgumentParser(
        description="Train Vega model for isotope identification"
    )
    # Data paths
    parser.add_argument(
        "--data-dir", "-d",
        type=str,
        default="O:/master_data_collection/isotopev2",
        help="Path to synthetic data directory"
    )
    parser.add_argument(
        "--model-dir", "-m",
        type=str,
        default="models",
        help="Directory to save trained models"
    )
    # Training parameters
    parser.add_argument(
        "--epochs", "-e",
        type=int,
        default=100,
        help="Maximum number of training epochs"
    )
    parser.add_argument(
        "--batch-size", "-b",
        type=int,
        default=64,
        help="Batch size for training (default: 64 for better GPU utilization)"
    )
    parser.add_argument(
        "--learning-rate", "-lr",
        type=float,
        default=1e-3,
        help="Initial learning rate"
    )
    # Quick test mode
    parser.add_argument(
        "--test",
        action="store_true",
        help="Quick test mode with reduced epochs"
    )
    # Mixed precision
    parser.add_argument(
        "--no-amp",
        action="store_true",
        help="Disable automatic mixed precision training"
    )
    # Data loading parallelism
    parser.add_argument(
        "--workers", "-w",
        type=int,
        default=8,
        help="Number of data loading workers (default: 8 for parallel I/O)"
    )
    args = parser.parse_args()
    # Create training config
    config = TrainingConfig(
        data_dir=args.data_dir,
        model_dir=args.model_dir,
        batch_size=args.batch_size,
        learning_rate=args.learning_rate,
        num_epochs=args.epochs if not args.test else 5,
        patience=10 if not args.test else 3,
        use_amp=not args.no_amp,
        num_workers=args.workers
    )
    # Create model config
    model_config = VegaConfig()
    print("\n" + "=" * 60)
    print("VEGA TRAINING")
    print("=" * 60)
    print(f"Data directory: {args.data_dir}")
    print(f"Model directory: {args.model_dir}")
    print(f"Epochs: {config.num_epochs}")
    print(f"Batch size: {config.batch_size}")
    print(f"Learning rate: {config.learning_rate}")
    print(f"Mixed precision: {config.use_amp}")
    print(f"Data workers: {config.num_workers}")
    if args.test:
        print("MODE: Quick test run")
    print("=" * 60 + "\n")
    # Run training
    model, results = train_vega(config=config, model_config=model_config)
    return 0
 if __name__ == "__main__":
    sys.exit(main())
--- a/train/vega_ml/training/vega/train.py
+++ b/train/vega_ml/training/vega/train.py
@ -0,0 +1,614 @@
 """
 Training Script for Vega Model
 Implements the training loop with:
 - Mixed precision training for RTX 5090 efficiency
 - Learning rate scheduling
 - Early stopping
 - Model checkpointing
 - Training metrics logging
 """
 import os
 import sys
 import json
 import time
 from datetime import datetime
 from pathlib import Path
 from typing import Dict, Optional, Tuple
 from dataclasses import dataclass, asdict
 import torch
 import torch.nn as nn
 from torch.optim import Adam
 from torch.optim.lr_scheduler import ReduceLROnPlateau
 import numpy as np
 # Sklearn metrics for comprehensive evaluation
 from sklearn.metrics import (
    roc_auc_score,
    f1_score,
    precision_score,
    recall_score,
    hamming_loss
 )
 from .model import VegaModel, VegaConfig, VegaLoss
 from .dataset import create_data_loaders, SpectrumDataset
 from .isotope_index import IsotopeIndex, get_default_isotope_index
@dataclass
 class TrainingConfig:
    """Configuration for training."""
    # Data
    data_dir: str = "O:/master_data_collection/isotopev2"
    # Model save path
    model_dir: str = "models"
    model_name: str = "vega"
    # Training hyperparameters
    batch_size: int = 64  # Increased from 32 for better GPU utilization
    learning_rate: float = 1e-3
    weight_decay: float = 1e-4
    num_epochs: int = 100
    # Early stopping
    patience: int = 10
    min_delta: float = 1e-4
    # Learning rate scheduling
    lr_scheduler_patience: int = 5
    lr_scheduler_factor: float = 0.5
    min_lr: float = 1e-6
    # Mixed precision
    use_amp: bool = True
    # Data splits
    train_split: float = 0.8
    val_split: float = 0.1
    test_split: float = 0.1
    # Workers - parallel data loading for better GPU utilization
    num_workers: int = 8  # Parallel data loading workers
    prefetch_factor: int = 4  # Batches to prefetch per worker
    persistent_workers: bool = True  # Keep workers alive between epochs
    # Reproducibility
    seed: int = 42
    # Activity normalization
    max_activity_bq: float = 1000.0
 class VegaTrainer:
    """
    Trainer class for the Vega model.
    Handles the training loop, validation, checkpointing, and metrics.
    """
    def __init__(
        self,
        model: VegaModel,
        config: TrainingConfig,
        device: Optional[torch.device] = None,
        force_cpu: bool = False
    ):
        self.model = model
        self.config = config
        # Device selection - force CPU if requested or if CUDA incompatible
        if force_cpu:
            self.device = torch.device('cpu')
        elif device:
            self.device = device
        else:
            # Try CUDA but fall back to CPU if there are compatibility issues
            if torch.cuda.is_available():
                try:
                    # Test if CUDA actually works
                    test_tensor = torch.zeros(1, device='cuda')
                    _ = test_tensor + 1
                    self.device = torch.device('cuda')
                except RuntimeError:
                    print("CUDA device found but not compatible, falling back to CPU")
                    self.device = torch.device('cpu')
            else:
                self.device = torch.device('cpu')
        # Move model to device
        self.model = self.model.to(self.device)
        # Setup loss function
        self.loss_fn = VegaLoss(
            classification_weight=model.config.classification_weight,
            regression_weight=model.config.regression_weight
        )
        # Setup optimizer
        self.optimizer = Adam(
            self.model.parameters(),
            lr=config.learning_rate,
            weight_decay=config.weight_decay
        )
        # Setup learning rate scheduler
        self.scheduler = ReduceLROnPlateau(
            self.optimizer,
            mode='min',
            patience=config.lr_scheduler_patience,
            factor=config.lr_scheduler_factor,
            min_lr=config.min_lr
        )
        # Setup mixed precision training (only if CUDA is working)
        if config.use_amp and self.device.type == 'cuda':
            self.scaler = torch.amp.GradScaler('cuda')
        else:
            self.scaler = None
        # Training state
        self.current_epoch = 0
        self.best_val_loss = float('inf')
        self.epochs_without_improvement = 0
        self.training_history = []
        # Create model directory
        self.model_dir = Path(config.model_dir)
        self.model_dir.mkdir(parents=True, exist_ok=True)
        print(f"Training on device: {self.device}")
        if self.device.type == 'cuda':
            print(f"GPU: {torch.cuda.get_device_name(0)}")
            print(f"Mixed precision: {config.use_amp}")
    def train_epoch(self, train_loader) -> Dict[str, float]:
        """Train for one epoch."""
        self.model.train()
        total_loss = 0.0
        cls_loss_sum = 0.0
        reg_loss_sum = 0.0
        num_batches = 0
        # Track accuracy during training
        correct_isotopes = 0
        total_isotopes = 0
        # Timing for profiling - track data loading vs GPU compute
        data_time = 0.0
        compute_time = 0.0
        data_start = time.time()
        for batch in train_loader:
            # Data loading time (time spent waiting for next batch)
            data_time += time.time() - data_start
            compute_start = time.time()
            # Move to device
            spectra = batch['spectrum'].to(self.device)
            presence = batch['presence_labels'].to(self.device)
            activities = batch['activity_labels'].to(self.device)
            # Zero gradients
            self.optimizer.zero_grad()
            # Forward pass with optional mixed precision
            if self.scaler is not None:
                with torch.amp.autocast('cuda'):
                    pred_logits, pred_activities = self.model(spectra)
                    loss, loss_dict = self.loss_fn(
                        pred_logits, pred_activities, presence, activities
                    )
                # Backward pass with scaling
                self.scaler.scale(loss).backward()
                self.scaler.step(self.optimizer)
                self.scaler.update()
            else:
                pred_logits, pred_activities = self.model(spectra)
                loss, loss_dict = self.loss_fn(
                    pred_logits, pred_activities, presence, activities
                )
                loss.backward()
                self.optimizer.step()
            total_loss += loss_dict['total']
            cls_loss_sum += loss_dict['classification']
            reg_loss_sum += loss_dict['regression']
            num_batches += 1
            # Calculate training accuracy (detach to avoid memory buildup)
            with torch.no_grad():
                pred_probs = torch.sigmoid(pred_logits)
                pred_presence = (pred_probs >= 0.5).float()
                correct_isotopes += (pred_presence == presence).sum().item()
                total_isotopes += presence.numel()
            # Mark compute time and restart timing for data loading
            compute_time += time.time() - compute_start
            data_start = time.time()
        train_accuracy = correct_isotopes / total_isotopes if total_isotopes > 0 else 0.0
        return {
            'train_loss': total_loss / num_batches,
            'train_cls_loss': cls_loss_sum / num_batches,
            'train_reg_loss': reg_loss_sum / num_batches,
            'train_accuracy': train_accuracy,
            'data_time': data_time,
            'compute_time': compute_time
        }
    @torch.no_grad()
    def validate(self, val_loader) -> Dict[str, float]:
        """Validate the model with comprehensive metrics."""
        if val_loader is None:
            return {}
        self.model.eval()
        total_loss = 0.0
        cls_loss_sum = 0.0
        reg_loss_sum = 0.0
        num_batches = 0
        # Collect all predictions and labels for sklearn metrics
        all_probs = []
        all_preds = []
        all_labels = []
        for batch in val_loader:
            spectra = batch['spectrum'].to(self.device)
            presence = batch['presence_labels'].to(self.device)
            activities = batch['activity_labels'].to(self.device)
            pred_logits, pred_activities = self.model(spectra)
            loss, loss_dict = self.loss_fn(
                pred_logits, pred_activities, presence, activities
            )
            total_loss += loss_dict['total']
            cls_loss_sum += loss_dict['classification']
            reg_loss_sum += loss_dict['regression']
            num_batches += 1
            # Collect predictions for metrics
            pred_probs = torch.sigmoid(pred_logits)
            pred_presence = (pred_probs >= 0.5).float()
            all_probs.append(pred_probs.cpu().numpy())
            all_preds.append(pred_presence.cpu().numpy())
            all_labels.append(presence.cpu().numpy())
        # Concatenate all batches
        all_probs = np.vstack(all_probs)
        all_preds = np.vstack(all_preds)
        all_labels = np.vstack(all_labels)
        # Basic accuracy (element-wise)
        correct = (all_preds == all_labels).sum()
        total = all_labels.size
        accuracy = correct / total if total > 0 else 0.0
        # Multi-label metrics using sklearn
        metrics = {
            'val_loss': total_loss / num_batches,
            'val_cls_loss': cls_loss_sum / num_batches,
            'val_reg_loss': reg_loss_sum / num_batches,
            'val_accuracy': accuracy,
        }
        try:
            # ROC-AUC (macro-averaged over isotopes with both classes present)
            # Only compute for columns that have both 0s and 1s
            valid_cols = []
            for i in range(all_labels.shape[1]):
                if len(np.unique(all_labels[:, i])) == 2:
                    valid_cols.append(i)
            if valid_cols:
                auc_macro = roc_auc_score(
                    all_labels[:, valid_cols], 
                    all_probs[:, valid_cols], 
                    average='macro'
                )
                auc_micro = roc_auc_score(
                    all_labels[:, valid_cols], 
                    all_probs[:, valid_cols], 
                    average='micro'
                )
                metrics['val_auc_macro'] = auc_macro
                metrics['val_auc_micro'] = auc_micro
            else:
                metrics['val_auc_macro'] = 0.0
                metrics['val_auc_micro'] = 0.0
        except ValueError:
            # Handle case where AUC can't be computed
            metrics['val_auc_macro'] = 0.0
            metrics['val_auc_micro'] = 0.0
        # F1, Precision, Recall (samples-averaged for multi-label)
        metrics['val_f1_macro'] = f1_score(all_labels, all_preds, average='macro', zero_division=0)
        metrics['val_f1_micro'] = f1_score(all_labels, all_preds, average='micro', zero_division=0)
        metrics['val_precision'] = precision_score(all_labels, all_preds, average='micro', zero_division=0)
        metrics['val_recall'] = recall_score(all_labels, all_preds, average='micro', zero_division=0)
        # Hamming loss (fraction of labels incorrectly predicted)
        metrics['val_hamming'] = hamming_loss(all_labels, all_preds)
        # Exact match ratio (all isotopes correct for a sample)
        exact_matches = (all_preds == all_labels).all(axis=1).sum()
        metrics['val_exact_match'] = exact_matches / len(all_labels)
        return metrics
    def save_checkpoint(self, path: Path, is_best: bool = False):
        """Save a model checkpoint."""
        checkpoint = {
            'epoch': self.current_epoch,
            'model_state_dict': self.model.state_dict(),
            'optimizer_state_dict': self.optimizer.state_dict(),
            'scheduler_state_dict': self.scheduler.state_dict(),
            'best_val_loss': self.best_val_loss,
            'model_config': asdict(self.model.config),
            'training_config': asdict(self.config),
            'training_history': self.training_history
        }
        if self.scaler is not None:
            checkpoint['scaler_state_dict'] = self.scaler.state_dict()
        torch.save(checkpoint, path)
        if is_best:
            best_path = path.parent / f"{self.config.model_name}_best.pt"
            torch.save(checkpoint, best_path)
    def load_checkpoint(self, path: Path):
        """Load a model checkpoint."""
        checkpoint = torch.load(path, map_location=self.device)
        self.model.load_state_dict(checkpoint['model_state_dict'])
        self.optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
        self.scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
        if self.scaler is not None and 'scaler_state_dict' in checkpoint:
            self.scaler.load_state_dict(checkpoint['scaler_state_dict'])
        self.current_epoch = checkpoint['epoch']
        self.best_val_loss = checkpoint['best_val_loss']
        self.training_history = checkpoint.get('training_history', [])
        print(f"Loaded checkpoint from epoch {self.current_epoch}")
    def train(
        self,
        train_loader,
        val_loader,
        resume_from: Optional[Path] = None
    ) -> Dict:
        """
        Full training loop.
        Args:
            train_loader: Training data loader
            val_loader: Validation data loader
            resume_from: Optional path to checkpoint to resume from
        Returns:
            Training results dictionary
        """
        if resume_from is not None:
            self.load_checkpoint(resume_from)
        print("\n" + "=" * 60)
        print("Starting Vega Training")
        print("=" * 60)
        print(f"Epochs: {self.config.num_epochs}")
        print(f"Batch size: {self.config.batch_size}")
        print(f"Learning rate: {self.config.learning_rate}")
        print(f"Training samples: {len(train_loader.dataset)}")
        if val_loader:
            print(f"Validation samples: {len(val_loader.dataset)}")
        print("=" * 60 + "\n")
        start_time = time.time()
        for epoch in range(self.current_epoch, self.config.num_epochs):
            self.current_epoch = epoch
            epoch_start = time.time()
            # Train
            train_metrics = self.train_epoch(train_loader)
            # Validate
            val_metrics = self.validate(val_loader)
            # Combine metrics
            metrics = {**train_metrics, **val_metrics, 'epoch': epoch}
            self.training_history.append(metrics)
            # Update learning rate
            if val_loader and 'val_loss' in val_metrics:
                self.scheduler.step(val_metrics['val_loss'])
            else:
                self.scheduler.step(train_metrics['train_loss'])
            # Check for improvement
            val_loss = val_metrics.get('val_loss', train_metrics['train_loss'])
            is_best = val_loss < self.best_val_loss - self.config.min_delta
            if is_best:
                self.best_val_loss = val_loss
                self.epochs_without_improvement = 0
            else:
                self.epochs_without_improvement += 1
            # Save checkpoint
            checkpoint_path = self.model_dir / f"{self.config.model_name}_epoch_{epoch}.pt"
            self.save_checkpoint(checkpoint_path, is_best=is_best)
            # Logging
            epoch_time = time.time() - epoch_start
            lr = self.optimizer.param_groups[0]['lr']
            # Primary metrics line
            log_str = (
                f"Epoch {epoch+1:3d}/{self.config.num_epochs} | "
                f"Train Loss: {train_metrics['train_loss']:.4f} | "
                f"Train Acc: {train_metrics['train_accuracy']:.4f} | "
            )
            if val_loader:
                log_str += (
                    f"Val Loss: {val_metrics['val_loss']:.4f} | "
                    f"Val Acc: {val_metrics['val_accuracy']:.4f} | "
                )
            log_str += f"LR: {lr:.2e} | Time: {epoch_time:.1f}s"
            if is_best:
                log_str += " *"
            print(log_str)
            # Timing breakdown line
            data_t = train_metrics.get('data_time', 0)
            compute_t = train_metrics.get('compute_time', 0)
            if data_t > 0 or compute_t > 0:
                data_pct = 100 * data_t / (data_t + compute_t) if (data_t + compute_t) > 0 else 0
                print(f"         └── Data: {data_t:.1f}s ({data_pct:.0f}%) | Compute: {compute_t:.1f}s ({100-data_pct:.0f}%)")
            # Secondary metrics line (detailed classification metrics)
            if val_loader and 'val_auc_macro' in val_metrics:
                detail_str = (
                    f"         └── AUC: {val_metrics['val_auc_macro']:.4f} | "
                    f"F1: {val_metrics['val_f1_macro']:.4f} | "
                    f"Prec: {val_metrics['val_precision']:.4f} | "
                    f"Recall: {val_metrics['val_recall']:.4f} | "
                    f"Exact: {val_metrics['val_exact_match']:.4f}"
                )
                print(detail_str)
            # Early stopping
            if self.epochs_without_improvement >= self.config.patience:
                print(f"\nEarly stopping after {epoch + 1} epochs")
                break
        total_time = time.time() - start_time
        # Save final model
        final_path = self.model_dir / f"{self.config.model_name}_final.pt"
        self.save_checkpoint(final_path)
        # Save training history
        history_path = self.model_dir / f"{self.config.model_name}_history.json"
        with open(history_path, 'w') as f:
            json.dump(self.training_history, f, indent=2)
        print("\n" + "=" * 60)
        print(f"Training complete!")
        print(f"Total time: {total_time / 60:.1f} minutes")
        print(f"Best validation loss: {self.best_val_loss:.4f}")
        print(f"Model saved to: {final_path}")
        print("=" * 60)
        return {
            'best_val_loss': self.best_val_loss,
            'total_epochs': self.current_epoch + 1,
            'total_time': total_time,
            'history': self.training_history
        }
 def train_vega(
    data_dir: Optional[str] = None,
    model_dir: Optional[str] = None,
    config: Optional[TrainingConfig] = None,
    model_config: Optional[VegaConfig] = None
 ) -> Tuple[VegaModel, Dict]:
    """
    Convenience function to train a Vega model.
    Args:
        data_dir: Path to data directory
        model_dir: Path to save models
        config: Training configuration
        model_config: Model configuration
    Returns:
        Tuple of (trained model, training results)
    """
    # Setup paths
    project_root = Path(__file__).parent.parent.parent
    if config is None:
        config = TrainingConfig()
    if data_dir:
        config.data_dir = data_dir
    if model_dir:
        config.model_dir = model_dir
    # Make paths absolute
    data_path = Path(config.data_dir)
    if not data_path.is_absolute():
        data_path = project_root / data_path
    model_path = Path(config.model_dir)
    if not model_path.is_absolute():
        model_path = project_root / model_path
    config.model_dir = str(model_path)
    # Set random seeds
    torch.manual_seed(config.seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(config.seed)
    # Get isotope index
    isotope_index = get_default_isotope_index()
    # Create data loaders with parallel loading
    train_loader, val_loader, test_loader = create_data_loaders(
        data_dir=data_path,
        batch_size=config.batch_size,
        train_split=config.train_split,
        val_split=config.val_split,
        test_split=config.test_split,
        num_workers=config.num_workers,
        prefetch_factor=config.prefetch_factor,
        persistent_workers=config.persistent_workers,
        isotope_index=isotope_index,
        max_activity_bq=config.max_activity_bq,
        seed=config.seed
    )
    # Create model
    if model_config is None:
        model_config = VegaConfig(
            num_isotopes=isotope_index.num_isotopes,
            max_activity_bq=config.max_activity_bq
        )
    model = VegaModel(model_config)
    print(model.summary())
    # Create trainer
    trainer = VegaTrainer(model, config)
    # Train
    results = trainer.train(train_loader, val_loader)
    # Save isotope index with model
    index_path = model_path / f"{config.model_name}_isotope_index.txt"
    isotope_index.save(index_path)
    return model, results
 if __name__ == "__main__":
    # Quick test training
    model, results = train_vega()
--- a/train/vega_ml/training/vega/train_2d.py
+++ b/train/vega_ml/training/vega/train_2d.py
@ -0,0 +1,411 @@
 """
 Training Script for Vega 2D Model
 Uses 2D convolutions to process gamma spectra with temporal information.
 """
 import argparse
 import json
 import time
 from pathlib import Path
 from dataclasses import dataclass, asdict
 from typing import Optional, Tuple, Dict, List
 import numpy as np
 import torch
 import torch.nn as nn
 import torch.optim as optim
 from torch.cuda.amp import GradScaler, autocast
 from .model_2d import Vega2DModel, Vega2DConfig, count_parameters
 from .dataset_2d import create_data_loaders_2d, SpectrumDataset2D
 from .isotope_index import get_default_isotope_index
@dataclass
 class TrainingConfig2D:
    """Training configuration for 2D model."""
    # Data
    data_dir: str = "O:/master_data_collection/isotopev2"
    model_dir: str = "models"
    # Model
    target_time_intervals: int = 60
    # Training
    epochs: int = 50
    batch_size: int = 32
    learning_rate: float = 1e-3
    weight_decay: float = 1e-5
    # Loss weights
    classification_weight: float = 1.0
    regression_weight: float = 0.1
    # Mixed precision
    use_amp: bool = True
    # Early stopping
    early_stopping_patience: int = 10
    # Learning rate scheduler
    lr_scheduler_patience: int = 5
    lr_scheduler_factor: float = 0.5
    # Data loading
    num_workers: int = 4
 def train_epoch(
    model: nn.Module,
    train_loader,
    optimizer: optim.Optimizer,
    criterion_cls: nn.Module,
    criterion_reg: nn.Module,
    device: torch.device,
    scaler: Optional[GradScaler],
    config: TrainingConfig2D
 ) -> Dict[str, float]:
    """Train for one epoch."""
    model.train()
    total_loss = 0.0
    total_cls_loss = 0.0
    total_reg_loss = 0.0
    num_batches = 0
    for batch in train_loader:
        spectra = batch['spectrum'].to(device)
        presence = batch['presence_labels'].to(device)
        activities = batch['activity_labels'].to(device)
        optimizer.zero_grad()
        if scaler is not None:
            with autocast():
                logits, pred_activities = model(spectra)
                cls_loss = criterion_cls(logits, presence)
                reg_loss = criterion_reg(pred_activities, activities)
                loss = config.classification_weight * cls_loss + config.regression_weight * reg_loss
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
        else:
            logits, pred_activities = model(spectra)
            cls_loss = criterion_cls(logits, presence)
            reg_loss = criterion_reg(pred_activities, activities)
            loss = config.classification_weight * cls_loss + config.regression_weight * reg_loss
            loss.backward()
            optimizer.step()
        total_loss += loss.item()
        total_cls_loss += cls_loss.item()
        total_reg_loss += reg_loss.item()
        num_batches += 1
    return {
        'loss': total_loss / num_batches,
        'cls_loss': total_cls_loss / num_batches,
        'reg_loss': total_reg_loss / num_batches
    }
@torch.no_grad()
 def validate(
    model: nn.Module,
    val_loader,
    criterion_cls: nn.Module,
    criterion_reg: nn.Module,
    device: torch.device,
    config: TrainingConfig2D,
    threshold: float = 0.5
 ) -> Dict[str, float]:
    """Validate the model."""
    model.eval()
    total_loss = 0.0
    total_cls_loss = 0.0
    total_reg_loss = 0.0
    num_batches = 0
    all_preds = []
    all_labels = []
    for batch in val_loader:
        spectra = batch['spectrum'].to(device)
        presence = batch['presence_labels'].to(device)
        activities = batch['activity_labels'].to(device)
        logits, pred_activities = model(spectra)
        cls_loss = criterion_cls(logits, presence)
        reg_loss = criterion_reg(pred_activities, activities)
        loss = config.classification_weight * cls_loss + config.regression_weight * reg_loss
        total_loss += loss.item()
        total_cls_loss += cls_loss.item()
        total_reg_loss += reg_loss.item()
        num_batches += 1
        # Collect predictions for metrics
        probs = torch.sigmoid(logits)
        preds = (probs >= threshold).float()
        all_preds.append(preds.cpu())
        all_labels.append(presence.cpu())
    # Calculate metrics
    all_preds = torch.cat(all_preds, dim=0)
    all_labels = torch.cat(all_labels, dim=0)
    # Per-sample accuracy (all isotopes correct)
    exact_match = (all_preds == all_labels).all(dim=1).float().mean().item()
    # Per-isotope metrics
    tp = ((all_preds == 1) & (all_labels == 1)).sum().item()
    fp = ((all_preds == 1) & (all_labels == 0)).sum().item()
    fn = ((all_preds == 0) & (all_labels == 1)).sum().item()
    tn = ((all_preds == 0) & (all_labels == 0)).sum().item()
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
    return {
        'loss': total_loss / num_batches,
        'cls_loss': total_cls_loss / num_batches,
        'reg_loss': total_reg_loss / num_batches,
        'exact_match': exact_match,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }
 def train_vega_2d(
    config: TrainingConfig2D = None,
    model_config: Vega2DConfig = None
 ) -> Tuple[Vega2DModel, Dict]:
    """
    Train the Vega 2D model.
    """
    config = config or TrainingConfig2D()
    model_config = model_config or Vega2DConfig(num_time_intervals=config.target_time_intervals)
    # Setup device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Using device: {device}")
    if device.type == 'cuda':
        print(f"  GPU: {torch.cuda.get_device_name()}")
        print(f"  Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    # Create model
    model = Vega2DModel(model_config).to(device)
    print(f"\nModel: Vega 2D")
    print(f"  Input: ({model_config.num_time_intervals}, {model_config.num_channels})")
    print(f"  Conv channels: {model_config.conv_channels}")
    print(f"  FC dims: {model_config.fc_hidden_dims}")
    print(f"  Parameters: {count_parameters(model):,}")
    # Create data loaders
    print(f"\nLoading data from: {config.data_dir}")
    isotope_index = get_default_isotope_index()
    train_loader, val_loader, test_loader = create_data_loaders_2d(
        data_dir=Path(config.data_dir),
        batch_size=config.batch_size,
        target_time_intervals=config.target_time_intervals,
        isotope_index=isotope_index,
        num_workers=config.num_workers
    )
    # Loss functions
    criterion_cls = nn.BCEWithLogitsLoss()
    criterion_reg = nn.HuberLoss()
    # Optimizer
    optimizer = optim.AdamW(
        model.parameters(),
        lr=config.learning_rate,
        weight_decay=config.weight_decay
    )
    # Learning rate scheduler
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(
        optimizer,
        mode='min',
        factor=config.lr_scheduler_factor,
        patience=config.lr_scheduler_patience
    )
    # Mixed precision scaler
    scaler = GradScaler() if config.use_amp and device.type == 'cuda' else None
    # Training history
    history = {
        'train_loss': [], 'val_loss': [],
        'train_cls_loss': [], 'val_cls_loss': [],
        'train_reg_loss': [], 'val_reg_loss': [],
        'val_exact_match': [], 'val_precision': [], 'val_recall': [], 'val_f1': [],
        'lr': []
    }
    # Early stopping
    best_val_loss = float('inf')
    patience_counter = 0
    # Model directory
    model_dir = Path(config.model_dir)
    model_dir.mkdir(exist_ok=True)
    print(f"\nStarting training for {config.epochs} epochs...")
    print(f"  Batch size: {config.batch_size}")
    print(f"  Learning rate: {config.learning_rate}")
    print(f"  AMP: {scaler is not None}")
    print()
    start_time = time.time()
    for epoch in range(config.epochs):
        epoch_start = time.time()
        # Train
        train_metrics = train_epoch(
            model, train_loader, optimizer,
            criterion_cls, criterion_reg,
            device, scaler, config
        )
        # Validate
        val_metrics = validate(
            model, val_loader,
            criterion_cls, criterion_reg,
            device, config
        )
        # Update scheduler
        scheduler.step(val_metrics['loss'])
        current_lr = optimizer.param_groups[0]['lr']
        # Record history
        history['train_loss'].append(train_metrics['loss'])
        history['val_loss'].append(val_metrics['loss'])
        history['train_cls_loss'].append(train_metrics['cls_loss'])
        history['val_cls_loss'].append(val_metrics['cls_loss'])
        history['train_reg_loss'].append(train_metrics['reg_loss'])
        history['val_reg_loss'].append(val_metrics['reg_loss'])
        history['val_exact_match'].append(val_metrics['exact_match'])
        history['val_precision'].append(val_metrics['precision'])
        history['val_recall'].append(val_metrics['recall'])
        history['val_f1'].append(val_metrics['f1'])
        history['lr'].append(current_lr)
        epoch_time = time.time() - epoch_start
        # Print progress
        print(f"Epoch {epoch+1:3d}/{config.epochs} ({epoch_time:.1f}s) | "
              f"Train Loss: {train_metrics['loss']:.4f} | "
              f"Val Loss: {val_metrics['loss']:.4f} | "
              f"F1: {val_metrics['f1']:.4f} | "
              f"Recall: {val_metrics['recall']:.4f} | "
              f"LR: {current_lr:.2e}")
        # Save best model
        if val_metrics['loss'] < best_val_loss:
            best_val_loss = val_metrics['loss']
            patience_counter = 0
            torch.save({
                'epoch': epoch,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'model_config': asdict(model_config),
                'training_config': asdict(config),
                'val_metrics': val_metrics,
                'history': history
            }, model_dir / 'vega_2d_best.pt')
            print(f"  ✓ Saved best model (val_loss: {best_val_loss:.4f})")
        else:
            patience_counter += 1
        # Early stopping
        if patience_counter >= config.early_stopping_patience:
            print(f"\nEarly stopping at epoch {epoch+1}")
            break
    total_time = time.time() - start_time
    print(f"\nTraining complete in {total_time/60:.1f} minutes")
    print(f"Best validation loss: {best_val_loss:.4f}")
    # Save final model
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'model_config': asdict(model_config),
        'training_config': asdict(config),
        'history': history
    }, model_dir / 'vega_2d_final.pt')
    # Save history
    with open(model_dir / 'vega_2d_history.json', 'w') as f:
        json.dump(history, f, indent=2)
    # Test set evaluation
    print("\nEvaluating on test set...")
    test_metrics = validate(
        model, test_loader,
        criterion_cls, criterion_reg,
        device, config
    )
    print(f"  Test Loss: {test_metrics['loss']:.4f}")
    print(f"  Test F1: {test_metrics['f1']:.4f}")
    print(f"  Test Recall: {test_metrics['recall']:.4f}")
    print(f"  Test Precision: {test_metrics['precision']:.4f}")
    print(f"  Test Exact Match: {test_metrics['exact_match']:.4f}")
    return model, history
 def main():
    parser = argparse.ArgumentParser(description='Train Vega 2D Model')
    parser.add_argument('--data-dir', type=str, default='O:/master_data_collection/isotopev2',
                        help='Path to training data')
    parser.add_argument('--model-dir', type=str, default='models',
                        help='Path to save models')
    parser.add_argument('--epochs', type=int, default=50,
                        help='Number of epochs')
    parser.add_argument('--batch-size', type=int, default=32,
                        help='Batch size')
    parser.add_argument('--lr', type=float, default=1e-3,
                        help='Learning rate')
    parser.add_argument('--time-intervals', type=int, default=60,
                        help='Target time intervals (pad/truncate)')
    parser.add_argument('--no-amp', action='store_true',
                        help='Disable mixed precision training')
    parser.add_argument('--workers', type=int, default=4,
                        help='Data loading workers')
    args = parser.parse_args()
    config = TrainingConfig2D(
        data_dir=args.data_dir,
        model_dir=args.model_dir,
        epochs=args.epochs,
        batch_size=args.batch_size,
        learning_rate=args.lr,
        target_time_intervals=args.time_intervals,
        use_amp=not args.no_amp,
        num_workers=args.workers
    )
    model_config = Vega2DConfig(
        num_time_intervals=args.time_intervals
    )
    train_vega_2d(config, model_config)
 if __name__ == '__main__':
    main()
--- a/train/vega_ml/training/vega/train_v2_optuna.py
+++ b/train/vega_ml/training/vega/train_v2_optuna.py
@ -0,0 +1,847 @@
 """
 Vega Training v2 - Optuna Hyperparameter Optimization
 Uses Optuna to search for optimal hyperparameters to maximize model performance,
 with a focus on improving recall for isotope detection.
 Key optimizations:
 1. Model architecture (CNN channels, FC dims, kernel sizes)
 2. Training hyperparameters (LR, batch size, weight decay, dropout)
 3. Loss function weights (classification vs regression balance)
 4. Classification threshold optimization
 5. Focal loss for handling class imbalance
 """
 import os
 import sys
 import json
 import time
 from datetime import datetime
 from pathlib import Path
 from typing import Dict, Optional, Tuple, List
 from dataclasses import dataclass, asdict, field
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
 from torch.optim import Adam, AdamW
 from torch.optim.lr_scheduler import ReduceLROnPlateau, CosineAnnealingWarmRestarts
 import numpy as np
 import optuna
 from optuna.trial import Trial
 from optuna.pruners import MedianPruner, HyperbandPruner
 from optuna.samplers import TPESampler
 # Sklearn metrics
 from sklearn.metrics import (
    roc_auc_score,
    f1_score,
    precision_score,
    recall_score,
    hamming_loss
 )
 # Add project root to path
 PROJECT_ROOT = Path(__file__).parent.parent.parent
 sys.path.insert(0, str(PROJECT_ROOT))
 from training.vega.model import VegaModel, VegaConfig
 from training.vega.dataset import create_data_loaders, SpectrumDataset
 from training.vega.isotope_index import IsotopeIndex, get_default_isotope_index
 class FocalLoss(nn.Module):
    """
    Focal Loss for handling class imbalance in multi-label classification.
    Reduces the relative loss for well-classified examples (high probability),
    putting more focus on hard, misclassified examples.
    FL(p_t) = -α_t * (1 - p_t)^γ * log(p_t)
    Args:
        alpha: Weighting factor for positive examples (default: 0.25)
        gamma: Focusing parameter - higher = more focus on hard examples (default: 2.0)
    """
    def __init__(self, alpha: float = 0.25, gamma: float = 2.0, reduction: str = 'mean'):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.reduction = reduction
    def forward(self, inputs: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
        # inputs are logits, targets are binary labels
        BCE_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction='none')
        # Get probabilities
        probs = torch.sigmoid(inputs)
        p_t = probs * targets + (1 - probs) * (1 - targets)
        # Apply focal weighting
        alpha_t = self.alpha * targets + (1 - self.alpha) * (1 - targets)
        focal_weight = alpha_t * (1 - p_t) ** self.gamma
        focal_loss = focal_weight * BCE_loss
        if self.reduction == 'mean':
            return focal_loss.mean()
        elif self.reduction == 'sum':
            return focal_loss.sum()
        return focal_loss
 class VegaLossV2(nn.Module):
    """
    Enhanced loss function with Focal Loss option and tunable weights.
    """
    def __init__(
        self,
        classification_weight: float = 1.0,
        regression_weight: float = 0.1,
        use_focal_loss: bool = True,
        focal_alpha: float = 0.25,
        focal_gamma: float = 2.0,
        pos_weight: Optional[torch.Tensor] = None
    ):
        super().__init__()
        self.classification_weight = classification_weight
        self.regression_weight = regression_weight
        self.use_focal_loss = use_focal_loss
        if use_focal_loss:
            self.cls_loss = FocalLoss(alpha=focal_alpha, gamma=focal_gamma)
        else:
            self.cls_loss = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
        self.reg_loss = nn.HuberLoss(delta=0.1)
    def forward(
        self,
        pred_logits: torch.Tensor,
        pred_activities: torch.Tensor,
        true_presence: torch.Tensor,
        true_activities: torch.Tensor
    ) -> Tuple[torch.Tensor, Dict[str, float]]:
        # Classification loss
        cls_loss = self.cls_loss(pred_logits, true_presence)
        # Regression loss (only for present isotopes)
        mask = true_presence > 0.5
        if mask.any():
            reg_loss = self.reg_loss(
                pred_activities[mask],
                true_activities[mask]
            )
        else:
            reg_loss = torch.tensor(0.0, device=pred_logits.device)
        # Combined loss
        total_loss = (
            self.classification_weight * cls_loss +
            self.regression_weight * reg_loss
        )
        return total_loss, {
            'total': total_loss.item(),
            'classification': cls_loss.item(),
            'regression': reg_loss.item() if mask.any() else 0.0
        }
@dataclass
 class OptunaConfig:
    """Configuration for Optuna hyperparameter optimization."""
    # Data
    data_dir: str = "O:/master_data_collection/isotopev2"
    model_dir: str = "models/optuna"
    study_name: str = "vega_v2_optimization"
    # Optuna settings
    n_trials: int = 50
    timeout_hours: float = 24.0
    n_startup_trials: int = 10  # Random sampling before TPE
    # Training settings for each trial
    max_epochs: int = 30  # Shorter epochs for faster trials
    patience: int = 5  # Early stopping patience
    # Data splits
    train_split: float = 0.8
    val_split: float = 0.1
    test_split: float = 0.1
    # Fixed settings
    num_workers: int = 8
    prefetch_factor: int = 4
    persistent_workers: bool = True
    use_amp: bool = True
    # Optimization objective
    optimize_metric: str = "val_recall"  # Focus on recall
    # Reproducibility
    seed: int = 42
 def suggest_hyperparameters(trial: Trial) -> Dict:
    """
    Suggest hyperparameters for a trial using Optuna.
    Returns a dictionary with all hyperparameters to try.
    """
    params = {}
    # ========== Model Architecture ==========
    # CNN backbone
    n_conv_layers = trial.suggest_int("n_conv_layers", 2, 4)
    conv_channels = []
    for i in range(n_conv_layers):
        ch = trial.suggest_categorical(f"conv_ch_{i}", [32, 64, 128, 256, 512])
        conv_channels.append(ch)
    params["conv_channels"] = conv_channels
    params["conv_kernel_size"] = trial.suggest_categorical("conv_kernel_size", [3, 5, 7, 9, 11])
    params["pool_size"] = trial.suggest_categorical("pool_size", [2, 3, 4])
    # FC layers
    n_fc_layers = trial.suggest_int("n_fc_layers", 1, 3)
    fc_dims = []
    for i in range(n_fc_layers):
        dim = trial.suggest_categorical(f"fc_dim_{i}", [128, 256, 512, 1024])
        fc_dims.append(dim)
    params["fc_hidden_dims"] = fc_dims
    # Regularization
    params["dropout_rate"] = trial.suggest_float("dropout_rate", 0.1, 0.5)
    params["spatial_dropout_rate"] = trial.suggest_float("spatial_dropout_rate", 0.05, 0.3)
    params["leaky_relu_slope"] = trial.suggest_float("leaky_relu_slope", 0.01, 0.2)
    # ========== Training Hyperparameters ==========
    params["batch_size"] = trial.suggest_categorical("batch_size", [128, 256, 512, 1024])
    params["learning_rate"] = trial.suggest_float("learning_rate", 1e-5, 1e-2, log=True)
    params["weight_decay"] = trial.suggest_float("weight_decay", 1e-6, 1e-3, log=True)
    # Optimizer
    params["optimizer"] = trial.suggest_categorical("optimizer", ["adam", "adamw"])
    # Learning rate scheduler
    params["scheduler"] = trial.suggest_categorical("scheduler", ["plateau", "cosine"])
    if params["scheduler"] == "plateau":
        params["lr_factor"] = trial.suggest_float("lr_factor", 0.1, 0.5)
        params["lr_patience"] = trial.suggest_int("lr_patience", 3, 10)
    else:
        params["cosine_t_0"] = trial.suggest_int("cosine_t_0", 5, 15)
        params["cosine_t_mult"] = trial.suggest_int("cosine_t_mult", 1, 2)
    # ========== Loss Function ==========
    params["use_focal_loss"] = trial.suggest_categorical("use_focal_loss", [True, False])
    if params["use_focal_loss"]:
        params["focal_alpha"] = trial.suggest_float("focal_alpha", 0.1, 0.5)
        params["focal_gamma"] = trial.suggest_float("focal_gamma", 1.0, 3.0)
    params["classification_weight"] = trial.suggest_float("classification_weight", 0.5, 2.0)
    params["regression_weight"] = trial.suggest_float("regression_weight", 0.01, 0.5, log=True)
    # ========== Classification Threshold ==========
    params["threshold"] = trial.suggest_float("threshold", 0.3, 0.7)
    return params
 def create_model_from_params(params: Dict, num_isotopes: int) -> VegaModel:
    """Create a VegaModel from hyperparameters."""
    config = VegaConfig(
        num_isotopes=num_isotopes,
        conv_channels=params["conv_channels"],
        conv_kernel_size=params["conv_kernel_size"],
        pool_size=params["pool_size"],
        fc_hidden_dims=params["fc_hidden_dims"],
        dropout_rate=params["dropout_rate"],
        spatial_dropout_rate=params["spatial_dropout_rate"],
        leaky_relu_slope=params["leaky_relu_slope"],
        classification_weight=params["classification_weight"],
        regression_weight=params["regression_weight"]
    )
    return VegaModel(config)
 def train_single_trial(
    trial: Trial,
    params: Dict,
    train_loader,
    val_loader,
    device: torch.device,
    optuna_config: OptunaConfig
 ) -> float:
    """
    Train a single trial and return the objective metric.
    """
    # Create model
    num_isotopes = 82  # From isotope database
    model = create_model_from_params(params, num_isotopes)
    model = model.to(device)
    # Create loss function
    loss_fn = VegaLossV2(
        classification_weight=params["classification_weight"],
        regression_weight=params["regression_weight"],
        use_focal_loss=params.get("use_focal_loss", False),
        focal_alpha=params.get("focal_alpha", 0.25),
        focal_gamma=params.get("focal_gamma", 2.0)
    )
    # Create optimizer
    if params["optimizer"] == "adamw":
        optimizer = AdamW(
            model.parameters(),
            lr=params["learning_rate"],
            weight_decay=params["weight_decay"]
        )
    else:
        optimizer = Adam(
            model.parameters(),
            lr=params["learning_rate"],
            weight_decay=params["weight_decay"]
        )
    # Create scheduler
    if params["scheduler"] == "cosine":
        scheduler = CosineAnnealingWarmRestarts(
            optimizer,
            T_0=params.get("cosine_t_0", 10),
            T_mult=params.get("cosine_t_mult", 1)
        )
    else:
        scheduler = ReduceLROnPlateau(
            optimizer,
            mode='min',
            patience=params.get("lr_patience", 5),
            factor=params.get("lr_factor", 0.5)
        )
    # Mixed precision
    scaler = torch.amp.GradScaler('cuda') if optuna_config.use_amp and device.type == 'cuda' else None
    # Training loop
    best_metric = 0.0
    epochs_without_improvement = 0
    threshold = params["threshold"]
    for epoch in range(optuna_config.max_epochs):
        # Training
        model.train()
        for batch in train_loader:
            spectra = batch['spectrum'].to(device)
            presence = batch['presence_labels'].to(device)
            activities = batch['activity_labels'].to(device)
            optimizer.zero_grad()
            if scaler is not None:
                with torch.amp.autocast('cuda'):
                    pred_logits, pred_activities = model(spectra)
                    loss, _ = loss_fn(pred_logits, pred_activities, presence, activities)
                scaler.scale(loss).backward()
                scaler.step(optimizer)
                scaler.update()
            else:
                pred_logits, pred_activities = model(spectra)
                loss, _ = loss_fn(pred_logits, pred_activities, presence, activities)
                loss.backward()
                optimizer.step()
        # Validation
        val_metrics = validate_model(model, val_loader, device, threshold, loss_fn)
        # Update scheduler
        if params["scheduler"] == "cosine":
            scheduler.step()
        else:
            scheduler.step(val_metrics['val_loss'])
        # Get objective metric
        current_metric = val_metrics.get(optuna_config.optimize_metric, val_metrics['val_recall'])
        # Report to Optuna for pruning
        trial.report(current_metric, epoch)
        # Handle pruning
        if trial.should_prune():
            raise optuna.TrialPruned()
        # Track best
        if current_metric > best_metric:
            best_metric = current_metric
            epochs_without_improvement = 0
        else:
            epochs_without_improvement += 1
        # Early stopping
        if epochs_without_improvement >= optuna_config.patience:
            break
    return best_metric
@torch.no_grad()
 def validate_model(
    model: VegaModel,
    val_loader,
    device: torch.device,
    threshold: float,
    loss_fn: nn.Module
 ) -> Dict[str, float]:
    """Validate model and return comprehensive metrics."""
    model.eval()
    all_probs = []
    all_preds = []
    all_labels = []
    total_loss = 0.0
    num_batches = 0
    for batch in val_loader:
        spectra = batch['spectrum'].to(device)
        presence = batch['presence_labels'].to(device)
        activities = batch['activity_labels'].to(device)
        pred_logits, pred_activities = model(spectra)
        loss, _ = loss_fn(pred_logits, pred_activities, presence, activities)
        total_loss += loss.item()
        num_batches += 1
        # Get predictions
        probs = torch.sigmoid(pred_logits)
        preds = (probs >= threshold).float()
        all_probs.append(probs.cpu().numpy())
        all_preds.append(preds.cpu().numpy())
        all_labels.append(presence.cpu().numpy())
    # Concatenate
    all_probs = np.vstack(all_probs)
    all_preds = np.vstack(all_preds)
    all_labels = np.vstack(all_labels)
    # Calculate metrics
    metrics = {
        'val_loss': total_loss / num_batches,
        'val_accuracy': (all_preds == all_labels).mean()
    }
    # Per-sample exact match
    exact_matches = (all_preds == all_labels).all(axis=1).mean()
    metrics['val_exact_match'] = exact_matches
    # Sklearn metrics (handle edge cases)
    try:
        # Only for columns with both classes present
        valid_cols = (all_labels.sum(axis=0) > 0) & (all_labels.sum(axis=0) < len(all_labels))
        if valid_cols.any():
            metrics['val_auc_macro'] = roc_auc_score(
                all_labels[:, valid_cols],
                all_probs[:, valid_cols],
                average='macro'
            )
    except Exception:
        metrics['val_auc_macro'] = 0.5
    # Flatten for F1, precision, recall
    all_preds_flat = all_preds.flatten()
    all_labels_flat = all_labels.flatten()
    metrics['val_f1_macro'] = f1_score(all_labels_flat, all_preds_flat, average='macro', zero_division=0)
    metrics['val_precision'] = precision_score(all_labels_flat, all_preds_flat, average='macro', zero_division=0)
    metrics['val_recall'] = recall_score(all_labels_flat, all_preds_flat, average='macro', zero_division=0)
    metrics['val_hamming'] = hamming_loss(all_labels, all_preds)
    return metrics
 def objective(trial: Trial, optuna_config: OptunaConfig) -> float:
    """
    Optuna objective function.
    Returns the metric to maximize (recall by default).
    """
    # Suggest hyperparameters
    params = suggest_hyperparameters(trial)
    # Log parameters
    print(f"\n{'='*60}")
    print(f"Trial {trial.number}")
    print(f"{'='*60}")
    for k, v in params.items():
        print(f"  {k}: {v}")
    # Setup device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    # Get isotope index
    isotope_index = get_default_isotope_index()
    # Create data loaders with trial's batch size
    train_loader, val_loader, _ = create_data_loaders(
        data_dir=Path(optuna_config.data_dir),
        batch_size=params["batch_size"],
        train_split=optuna_config.train_split,
        val_split=optuna_config.val_split,
        test_split=optuna_config.test_split,
        num_workers=optuna_config.num_workers,
        prefetch_factor=optuna_config.prefetch_factor,
        persistent_workers=optuna_config.persistent_workers,
        isotope_index=isotope_index,
        seed=optuna_config.seed
    )
    try:
        metric = train_single_trial(
            trial, params, train_loader, val_loader, device, optuna_config
        )
        print(f"Trial {trial.number} completed with {optuna_config.optimize_metric}: {metric:.4f}")
        return metric
    except Exception as e:
        print(f"Trial {trial.number} failed: {e}")
        raise
 def run_optimization(config: OptunaConfig) -> optuna.Study:
    """
    Run the full Optuna optimization study.
    """
    # Create model directory
    model_dir = Path(config.model_dir)
    model_dir.mkdir(parents=True, exist_ok=True)
    # Create study with TPE sampler and Hyperband pruner
    sampler = TPESampler(
        n_startup_trials=config.n_startup_trials,
        seed=config.seed
    )
    pruner = HyperbandPruner(
        min_resource=3,
        max_resource=config.max_epochs,
        reduction_factor=3
    )
    # Create or load study
    storage = f"sqlite:///{model_dir / config.study_name}.db"
    study = optuna.create_study(
        study_name=config.study_name,
        storage=storage,
        load_if_exists=True,
        direction="maximize",  # Maximize recall
        sampler=sampler,
        pruner=pruner
    )
    print("\n" + "=" * 60)
    print("VEGA V2 - OPTUNA HYPERPARAMETER OPTIMIZATION")
    print("=" * 60)
    print(f"Study name: {config.study_name}")
    print(f"Optimization metric: {config.optimize_metric}")
    print(f"Number of trials: {config.n_trials}")
    print(f"Timeout: {config.timeout_hours} hours")
    print(f"Data directory: {config.data_dir}")
    print("=" * 60 + "\n")
    # Run optimization
    study.optimize(
        lambda trial: objective(trial, config),
        n_trials=config.n_trials,
        timeout=config.timeout_hours * 3600,
        show_progress_bar=True,
        gc_after_trial=True
    )
    # Print results
    print("\n" + "=" * 60)
    print("OPTIMIZATION COMPLETE")
    print("=" * 60)
    print(f"Best trial: {study.best_trial.number}")
    print(f"Best {config.optimize_metric}: {study.best_value:.4f}")
    print("\nBest hyperparameters:")
    for k, v in study.best_params.items():
        print(f"  {k}: {v}")
    # Save best parameters
    best_params_path = model_dir / "best_params.json"
    with open(best_params_path, 'w') as f:
        json.dump({
            'best_value': study.best_value,
            'best_params': study.best_params,
            'study_name': config.study_name,
            'optimize_metric': config.optimize_metric
        }, f, indent=2)
    print(f"\nBest parameters saved to: {best_params_path}")
    return study
 def train_best_model(
    study: optuna.Study,
    config: OptunaConfig,
    full_epochs: int = 100
 ) -> Tuple[VegaModel, Dict]:
    """
    Train the best model from the study with full epochs.
    """
    print("\n" + "=" * 60)
    print("TRAINING BEST MODEL")
    print("=" * 60)
    best_params = study.best_params
    # Reconstruct full params dict from best_params
    params = {}
    # CNN layers
    n_conv_layers = best_params.get("n_conv_layers", 3)
    conv_channels = [best_params.get(f"conv_ch_{i}", 128) for i in range(n_conv_layers)]
    params["conv_channels"] = conv_channels
    params["conv_kernel_size"] = best_params.get("conv_kernel_size", 7)
    params["pool_size"] = best_params.get("pool_size", 2)
    # FC layers
    n_fc_layers = best_params.get("n_fc_layers", 2)
    fc_dims = [best_params.get(f"fc_dim_{i}", 256) for i in range(n_fc_layers)]
    params["fc_hidden_dims"] = fc_dims
    # Other params
    for key in ["dropout_rate", "spatial_dropout_rate", "leaky_relu_slope",
                "batch_size", "learning_rate", "weight_decay", "optimizer",
                "scheduler", "lr_factor", "lr_patience", "cosine_t_0", "cosine_t_mult",
                "use_focal_loss", "focal_alpha", "focal_gamma",
                "classification_weight", "regression_weight", "threshold"]:
        if key in best_params:
            params[key] = best_params[key]
    # Set defaults for missing params
    params.setdefault("dropout_rate", 0.3)
    params.setdefault("spatial_dropout_rate", 0.1)
    params.setdefault("leaky_relu_slope", 0.1)
    params.setdefault("batch_size", 512)
    params.setdefault("learning_rate", 1e-3)
    params.setdefault("weight_decay", 1e-4)
    params.setdefault("optimizer", "adamw")
    params.setdefault("scheduler", "plateau")
    params.setdefault("classification_weight", 1.0)
    params.setdefault("regression_weight", 0.1)
    params.setdefault("threshold", 0.5)
    params.setdefault("use_focal_loss", True)
    params.setdefault("focal_alpha", 0.25)
    params.setdefault("focal_gamma", 2.0)
    print("Training with parameters:")
    for k, v in params.items():
        print(f"  {k}: {v}")
    # Setup
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    isotope_index = get_default_isotope_index()
    model_dir = Path(config.model_dir)
    # Create data loaders
    train_loader, val_loader, test_loader = create_data_loaders(
        data_dir=Path(config.data_dir),
        batch_size=params["batch_size"],
        train_split=config.train_split,
        val_split=config.val_split,
        test_split=config.test_split,
        num_workers=config.num_workers,
        prefetch_factor=config.prefetch_factor,
        persistent_workers=config.persistent_workers,
        isotope_index=isotope_index,
        seed=config.seed
    )
    # Create model
    model = create_model_from_params(params, isotope_index.num_isotopes)
    model = model.to(device)
    # Create loss
    loss_fn = VegaLossV2(
        classification_weight=params["classification_weight"],
        regression_weight=params["regression_weight"],
        use_focal_loss=params.get("use_focal_loss", False),
        focal_alpha=params.get("focal_alpha", 0.25),
        focal_gamma=params.get("focal_gamma", 2.0)
    )
    # Optimizer
    if params["optimizer"] == "adamw":
        optimizer = AdamW(model.parameters(), lr=params["learning_rate"], weight_decay=params["weight_decay"])
    else:
        optimizer = Adam(model.parameters(), lr=params["learning_rate"], weight_decay=params["weight_decay"])
    # Scheduler
    if params.get("scheduler") == "cosine":
        scheduler = CosineAnnealingWarmRestarts(
            optimizer, T_0=params.get("cosine_t_0", 10), T_mult=params.get("cosine_t_mult", 1)
        )
    else:
        scheduler = ReduceLROnPlateau(
            optimizer, mode='min', patience=params.get("lr_patience", 5), factor=params.get("lr_factor", 0.5)
        )
    # Mixed precision
    scaler = torch.amp.GradScaler('cuda') if config.use_amp and device.type == 'cuda' else None
    # Training
    best_recall = 0.0
    threshold = params["threshold"]
    history = []
    for epoch in range(full_epochs):
        # Train
        model.train()
        train_loss = 0.0
        num_batches = 0
        for batch in train_loader:
            spectra = batch['spectrum'].to(device)
            presence = batch['presence_labels'].to(device)
            activities = batch['activity_labels'].to(device)
            optimizer.zero_grad()
            if scaler is not None:
                with torch.amp.autocast('cuda'):
                    pred_logits, pred_activities = model(spectra)
                    loss, _ = loss_fn(pred_logits, pred_activities, presence, activities)
                scaler.scale(loss).backward()
                scaler.step(optimizer)
                scaler.update()
            else:
                pred_logits, pred_activities = model(spectra)
                loss, _ = loss_fn(pred_logits, pred_activities, presence, activities)
                loss.backward()
                optimizer.step()
            train_loss += loss.item()
            num_batches += 1
        train_loss /= num_batches
        # Validate
        val_metrics = validate_model(model, val_loader, device, threshold, loss_fn)
        # Scheduler step
        if params.get("scheduler") == "cosine":
            scheduler.step()
        else:
            scheduler.step(val_metrics['val_loss'])
        # Log
        lr = optimizer.param_groups[0]['lr']
        print(f"Epoch {epoch+1:3d}/{full_epochs} | Train Loss: {train_loss:.4f} | "
              f"Val Loss: {val_metrics['val_loss']:.4f} | Recall: {val_metrics['val_recall']:.4f} | "
              f"F1: {val_metrics['val_f1_macro']:.4f} | Exact: {val_metrics['val_exact_match']:.4f} | LR: {lr:.2e}")
        # Save history
        history.append({
            'epoch': epoch,
            'train_loss': train_loss,
            **val_metrics,
            'lr': lr
        })
        # Save best model
        if val_metrics['val_recall'] > best_recall:
            best_recall = val_metrics['val_recall']
            checkpoint = {
                'epoch': epoch,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'best_recall': best_recall,
                'params': params,
                'val_metrics': val_metrics
            }
            torch.save(checkpoint, model_dir / "vega_v2_best.pt")
            print(f"  └── New best! Saved model with recall: {best_recall:.4f}")
    # Save final model
    torch.save({
        'model_state_dict': model.state_dict(),
        'params': params,
        'history': history
    }, model_dir / "vega_v2_final.pt")
    # Save history
    with open(model_dir / "vega_v2_history.json", 'w') as f:
        json.dump(history, f, indent=2)
    # Test evaluation
    print("\n" + "=" * 60)
    print("TEST SET EVALUATION")
    print("=" * 60)
    test_metrics = validate_model(model, test_loader, device, threshold, loss_fn)
    for k, v in test_metrics.items():
        print(f"  {k}: {v:.4f}")
    return model, {
        'best_recall': best_recall,
        'test_metrics': test_metrics,
        'history': history,
        'params': params
    }
 def main():
    import argparse
    parser = argparse.ArgumentParser(description="Vega V2 - Optuna Hyperparameter Optimization")
    parser.add_argument("--data-dir", type=str, default="O:/master_data_collection/isotopev2",
                        help="Data directory")
    parser.add_argument("--model-dir", type=str, default="models/optuna",
                        help="Model output directory")
    parser.add_argument("--study-name", type=str, default="vega_v2_optimization",
                        help="Optuna study name")
    parser.add_argument("--n-trials", type=int, default=50,
                        help="Number of Optuna trials")
    parser.add_argument("--timeout", type=float, default=24.0,
                        help="Timeout in hours")
    parser.add_argument("--max-epochs", type=int, default=30,
                        help="Max epochs per trial")
    parser.add_argument("--optimize-metric", type=str, default="val_recall",
                        choices=["val_recall", "val_f1_macro", "val_auc_macro", "val_exact_match"],
                        help="Metric to optimize")
    parser.add_argument("--train-best", action="store_true",
                        help="Train best model with full epochs after optimization")
    parser.add_argument("--full-epochs", type=int, default=100,
                        help="Epochs for training best model")
    parser.add_argument("--workers", type=int, default=8,
                        help="Number of data loading workers")
    args = parser.parse_args()
    # Create config
    config = OptunaConfig(
        data_dir=args.data_dir,
        model_dir=args.model_dir,
        study_name=args.study_name,
        n_trials=args.n_trials,
        timeout_hours=args.timeout,
        max_epochs=args.max_epochs,
        optimize_metric=args.optimize_metric,
        num_workers=args.workers
    )
    # Run optimization
    study = run_optimization(config)
    # Optionally train best model
    if args.train_best:
        train_best_model(study, config, full_epochs=args.full_epochs)
    return 0
 if __name__ == "__main__":
    sys.exit(main())
--- a/train/vega_ml/training_data_report.html
+++ b/train/vega_ml/training_data_report.html
		`@ -0,0 +1 @@`
							`# Inference module for running predictions with trained models`
		`@ -0,0 +1 @@`
							`# Training module for isotope identification models`