DESI DR1 Spectral Anomaly Catalog
Anomaly Explorer
The first full-DR1-scale autoencoder anomaly search — 22.5M DESI DR1 spectra scored down to 2,145 high-significance anomalies (~90× prior EDR work). Photo-z from unsupervised latent vectors (σNMAD = 0.028); UMAP reveals two populations. Key counts below; browse the top 1,000 by anomaly score in the table.
Enhanced catalog COMPLETE — 22,504,897 unique spectra scored (deduplicated), 173 columns, 46 Parquet batches, 16 GB. Tiered anomaly system: 2,145 silver (score>3, SNR>0.5), 120 gold (>5σ). Key discoveries: 12 z>6 reionization-era QSOs with Gunn-Peterson troughs; photo-z from unsupervised latent vectors (σNMAD=0.028); spontaneous “redshift neuron” (lat_067); 2,575 red-anomaly cluster. Multi-survey total: after per-survey native retrains, ecliptic/galactic masking, and 8-way positional dedup at 5″, the deduplicated catalog contains 378,280 unique anomalies from 37,292,042 sources across DESI, SDSS, LAMOST, eROSITA, Planck, ACT, Gaia, and NEOWISE. Published at HuggingFace. Paper 3 draft compiled at
pipelines/p3_anomaly_engine/paper3_draft.tex (4.6 MB PDF, 26 pp, 0 undef refs) — see Papers page.
Click any row to view the Legacy Survey image, read AI analysis notes, and add your own review comments.
Photo-z from latent vectors: σNMAD = 0.028 (R² = 0.79) with zero redshift supervision — the autoencoder spontaneously learned spectral features correlated with redshift. Redshift neuron (lat_067): A single latent dimension that emergently encodes redshift, the strongest individual predictor of spectroscopic z. UMAP clustering: Two distinct populations — 247K B-band noise cluster + 2,575 red-anomaly cluster with genuinely unusual spectra. Anomaly rate uniform: ~1% across the footprint (Spearman r = 0.03 with depth), confirming anomalies are astrophysical, not depth artifacts. 12 z>6 QSOs with Gunn-Peterson troughs identified from the gold anomaly catalog.
Scientific Insights
Eight genuinely novel findings from the enhanced 22.5M DESI DR1 catalog — not just "we ran an autoencoder," but concrete scientific contributions including a direct improvement to the flagship fNL bounce prediction.
The "Redshift Neuron"
Latent dimension 067 has 6× the importance of any other dimension for predicting redshift. The autoencoder spontaneously learned to encode spectral shift without ever seeing redshift labels — emergent representation of a physical property.
Unsupervised Photo-z (σNMAD = 0.028)
A simple MLP on the 128-dim latent vectors predicts spectroscopic redshift with R² = 0.79 and 7.7% outlier fraction — competitive with purpose-built photo-z codes using broadband photometry.
"Correctly Classified but Spectrally Anomalous"
2,575 objects where DESI’s pipeline is confident (Δχ² = 963) yet the autoencoder flags unusual features — genuine spectral structure beyond standard templates.
16 NEOWISE IR-Variable Anomalies
16 anomalies are BOTH spectrally anomalous AND infrared-variable (NEOWISE 10yr). A z=5.65 QSO varies by 5.5 magnitudes in W2 — extreme AGN activity in the reionization era. These variable sources are prime candidates for multi-epoch follow-up.
1,127 Genuinely Uncataloged Objects
1,127 of 2,145 SNR-filtered anomalies (52.5%) are in NEITHER SIMBAD nor NED. Classified into 10 taxonomy families: 76 uncataloged AGN, 27 post-starburst galaxies, 363 blue compact galaxies. Known to DESI, unknown to the astronomical community. Concrete targets for follow-up.
Gold Anomalies Cluster in Latent Space
The 83 gold anomalies are 2.2× more clustered than random objects in the 128-dim latent space — confirming a coherent spectral population, not random noise.
Autoencoder as Survey Quality Probe
Anomaly score correlates with SNR (Spearman ρ = −0.89). The autoencoder unintentionally functions as an independent data quality metric — a new tool for spectroscopic survey validation.
+7.93% fNL Improvement via 5-Tracer Multi-Tracer
5-tracer anomaly-optimized Fisher forecast yields σ(fNL) = 11.71 vs 12.72 standard multi-tracer (+7.93% improvement). DESI alone contributes 6.1% improvement. Latent-space selection of anomalous objects as high-bias tracers directly strengthens the flagship bounce prediction: spectral anomalies are not just curiosities but observationally useful for testing bounce cosmology via the galaxy bispectrum.
Sky Distribution
All 1,000 top-scored anomalies plotted by RA/Dec. Color indicates anomaly score (yellow = highest, blue = threshold).
Top Anomalies
Showing top 1,000 by anomaly score. Click column headers to sort. Each row links to the Legacy Survey image viewer. Full catalog (195,829 objects) available for download.
| # | Score | RA | Dec | Band | rB | rR | rZ | Image | Full |
|---|
How Anomaly Detection Works
What is an anomaly? A spectral autoencoder is a neural network trained to compress and reconstruct normal DESI spectra (stars, galaxies, quasars). When it encounters a spectrum that doesn’t match any learned pattern, the reconstruction is poor — producing a high residual. Objects with total residual (anomaly score) above 5.0 are flagged. These are spectra the model literally “doesn’t know what to do with.”
What the score means: The anomaly score is the sum of reconstruction errors across DESI’s three spectrograph arms. Higher = more unusual. The score tiers are:
Column Definitions & Glossary
Table Columns
- Score
- Total reconstruction error across all three spectrograph arms (B + R + Z). Higher = more anomalous.
- RA
- Right Ascension (degrees, 0–360). East-west position on the sky in the ICRS coordinate system.
- Dec
- Declination (degrees, -90 to +90). North-south position on the sky.
- Band
- Which spectrograph arm has the largest residual: B (blue, 3600–5800Å), R (red, 5760–7620Å), or Z (near-IR, 7520–9824Å).
- rB
- Reconstruction error in the B (blue) arm. High rB = anomalous blue-end features (e.g. unusual emission lines, UV excess).
- rR
- Reconstruction error in the R (red) arm. High rR = anomalous mid-optical features (e.g. unusual continuum, absorption).
- rZ
- Reconstruction error in the Z (near-infrared) arm. High rZ = anomalous near-IR features (e.g. high-redshift emission shifted into IR).
- TID
- DESI TARGETID — unique identifier for this object in the DESI DR1 catalog.
Astronomy Terms
- AGN
- Active Galactic Nucleus — a supermassive black hole at a galaxy’s center actively accreting matter, producing bright emission across the spectrum.
- QSO
- Quasi-Stellar Object (Quasar) — an extremely luminous AGN, often at high redshift (z > 1). Key tracer for large-scale structure measurements.
- Near-IR
- Near-Infrared — wavelengths just beyond visible red light (~7000–10000Å in the Z-band). High-redshift features shift into this range.
- High-z
- High redshift — objects at great cosmological distances (z > 1.5), seen as they were billions of years ago.
- BAL
- Broad Absorption Line — a QSO showing wide absorption troughs from high-velocity outflows. Rare (~10% of QSOs) and often missed by pipelines.
- PSF
- Point Spread Function — the image of a point source (star or distant QSO). “PSF morphology” means it looks like a point, not an extended galaxy.
- REX
- Round Exponential — a Legacy Survey morphology classification for a small, round, slightly extended source.
- SER
- Sérsic profile — a Legacy Survey classification for galaxies fit with a Sérsic surface brightness profile.
- SIMBAD
- Set of Identifications, Measurements and Bibliography for Astronomical Data — the most comprehensive database of known astronomical objects (CDS, Strasbourg).
- NED
- NASA/IPAC Extragalactic Database — a database focused on extragalactic objects (galaxies, QSOs, clusters).
- fNL
- The amplitude of primordial non-Gaussianity — a key parameter for distinguishing between the Big Bounce and inflation.
Cross-Reference Status
How do we know these are previously unidentified? We cross-match anomaly positions against multiple astronomical databases. An object NOT found in any of these catalogs is a strong candidate for being genuinely new.
| Database | What it contains | Objects | Checked? | Matches |
|---|---|---|---|---|
| SIMBAD | Most comprehensive catalog of identified astronomical objects | ~17M | Top 10,000 | 21/10,000 (0.2%) — 99.8% absent |
| NED | Extragalactic objects (galaxies, QSOs, clusters) | ~400M | Top 10,000 | 1,270/10,000 (12.7%) — 87.3% absent |
| Gaia DR3 | 1.8 billion stars with astrometry & photometry | ~1.8B | Top 1,000 | 6/1,000 (0.6%) — only 1 confirmed Galactic star |
| SDSS DR18 | Sloan Digital Sky Survey — spectra + photometry | ~2.3M spectra | 77,905 anomalies | Native BigAE rescore complete (3.4% anomaly rate, domain-shift scores) |
| AllWISE | 750M infrared sources — photometric detection catalog | ~750M | Top 1,000 | 15/1,000 (1.5%) — 98.5% have no IR counterpart |
| Milliquas v8 | Comprehensive QSO catalog — all known quasars | ~1M | Top 1,000 | 0/1,000 (0%) — ZERO are known QSOs |
| Liang+2023 EDR anomalies | Prior DESI EDR autoencoder anomaly catalog | ~250K | Pending | Catalog not published as downloadable file |
| Nicolaou+2026 EDR anomalies | Prior DESI EDR VAE anomaly catalog | ~208K | Pending | Catalog not published as downloadable file |
Current status: 6 major databases cross-matched, representing over 3 billion cataloged objects. SIMBAD: 0.2% matched. NED: 12.7%. AllWISE: 1.5%. Milliquas: 0%. Gaia: 0.6% (1 star). SDSS DR18: 77,905 anomalies (native BigAE rescore complete). 1,127 of 2,145 (52.5%) are in neither SIMBAD nor NED — genuinely uncataloged. Classified into 10 taxonomy families including 76 AGN, 27 post-starburst, 363 blue compact galaxies.
Prior Work & Attribution
Prior work: Autoencoder anomaly detection on DESI was pioneered by Liang et al. (2023) on ~250K EDR spectra and Nicolaou et al. (2026, MNRAS, 46 co-authors) on ~208K EDR spectra. This catalog extends their approach by ~90x in scale to the full DR1 release. Both teams must be cited in any publication using this catalog.
Review Notes