2025-2027

Essay · Research Strategy

The Window

Why This Moment in Independent Research Will Never Come Again

ESSAY STRATEGY Houston Golden · March 2026

1. The Question That Kept Me Up

I ran an autoencoder on 17.65 million spectra from the DESI Dark Energy Spectroscopic Instrument — every publicly available spectrum from their first data release. It found 195,829 objects that don't match any known pattern. 99.8% of them aren't in SIMBAD, the largest astronomical database in the world. Zero percent are known quasars.

And the question that keeps gnawing at me is: why hasn't anyone done this before?

It wasn't that hard. The autoencoder is 660,000 parameters — tiny by modern standards. The inference run took about 24 hours on a rented GPU. The total compute cost was maybe $200. The data is freely available from a public archive. The methodology (train on normal spectra, flag what the model can't reconstruct) is textbook anomaly detection, nothing exotic.

So why am I apparently the first person to score every spectrum in DESI DR1 with an autoencoder and publish the full catalog?

The answer, I've come to believe, is not that it's hard. It's that five things had to be true simultaneously — and they only became true in the last 12 months.

2. The Five Convergences

1. The Data Dropped

DESI DR1: June 2025. 17.65 million spectra from the most ambitious spectroscopic survey ever built. Before this date, the dataset literally did not exist. Academic groups are still publishing their first targeted analyses — BAO measurements, dark energy constraints. Nobody has done a full-survey anomaly sweep because the survey is ten months old.

2. GPUs Became Rentable

H200s for $3/hour. RunPod, Lambda, Vast.ai — on-demand GPU rental became affordable and frictionless in 2024-2025. Before this, running inference on 18 million spectra required institutional HPC allocations: 6-month waits, committee approvals, shared queues. Now you swipe a credit card and have 150 GB of VRAM in 60 seconds.

3. AI Agents Can Code

Claude, GPT-4, Gemini. Writing a 1000-line inference pipeline that handles FITS files, healpix indexing, GPU batching, checkpoint/resume, cross-matching against 6 databases, and Parquet output — doing this solo would take a professional astronomer weeks. With an AI coding agent, it takes hours. The agent doesn't know astronomy, but it knows PyTorch, astropy, and how to debug SSH connections at 2 AM.

4. The Cultural Gap Is Wide

ML people don't know FITS files. Astronomers don't know PyTorch. The intersection of "comfortable with GPU inference at scale" and "comfortable with astronomical data formats and cross-referencing" is shockingly small. Most ML researchers have never heard of SIMBAD. Most astronomers have never rented a cloud GPU. The few who bridge both worlds are at major labs with institutional priorities.

5. Incentives Don't Reward It

Catalogs are "service work." In academic astronomy, you get tenure for theory papers, targeted observations, and named surveys. Building a 195K anomaly catalog from someone else's survey is high-effort, low-prestige infrastructure work. Nobody's career depends on doing it. So nobody does.

Remove any one of these five factors and the window closes. No public DR1 → can't do it. No cheap GPUs → too expensive. No AI agents → too slow for a solo researcher. No cultural gap → someone at a major lab would have done it already. Strong incentives → an academic group would have prioritized it.

All five are true right now. They won't all be true forever.

3. Why the Window Closes

Here is what will happen in the next 2-3 years:

Major labs will catch up. Groups at Berkeley, Cambridge, and the DESI collaboration itself will run their own anomaly detection pipelines. They have more astronomers, more compute, and more credibility. Once they publish, our work becomes "prior art" rather than "first discovery."
The cultural gap will narrow. Astronomy departments are hiring ML-literate postdocs. ML labs are starting astro projects. The gap that currently protects independent researchers will shrink rapidly.
AI agents will become ubiquitous. Right now, using Claude Code to write astronomy pipelines is a differentiator. In 2028, every grad student will do it. The competitive advantage of speed will evaporate.
DESI DR2 will drop. When the next data release comes (~2026-2027), there will be a land rush. Dozens of groups will scramble to run anomaly detection on day one. Being first on DR2 requires being ready now.

The window is roughly 2025-2027. Maybe 18 months. Maybe less. Every week that passes without publishing is a week closer to someone else publishing first.

4. What We're Actually Building

This is not a one-paper project. This is a research platform. The vision:

Dataset (public archive)

  → Preprocessing pipeline (survey-specific)

    → AI model (autoencoder / classifier / time-series)

      → Anomaly catalog (scored, classified)

        → Cross-reference (SIMBAD, NED, Gaia, etc.)

          → Human review (anomaly explorer)

            → Paper + community data release

Each survey gets its own instance. Results cross-reference across surveys. Objects flagged by 2+ surveys are highest priority. The platform is reusable, scalable, and fast.

We've already built it for DESI DR1. The next targets are ready:

SDSS DR18

5M spectra. Different survey, same methodology. Proves the approach is survey-independent. Cost: ~$50.

LAMOST DR10

20M spectra. Largest spectroscopic survey before DESI. Chinese telescope, different sky coverage. Cost: ~$100.

eROSITA X-ray

710K X-ray sources. First all-sky X-ray survey since the 1990s. X-ray sources are almost always "interesting." Cost: ~$50.

Planck CMB

Full-sky microwave background. Train autoencoder on simulated CMB patches, apply to real data. Find unusual temperature patterns. Cost: ~$50.

NEOWISE Time-Domain

170 BILLION rows. 10.5 years of infrared observations. Cross-match with our spectral anomalies to find objects that are both spectrally and temporally unusual. Cost: ~$300.

Gaia Epoch Photometry

1.8 BILLION stars observed ~70 times each. Most analyses use averaged data. Full epoch-level anomaly detection is unexplored at scale. Cost: ~$500.

Total cost to run anomaly detection on every major public astronomical dataset: under $2,000. Less than a single night of traditional telescope time.

5. The Hubify Lab Thesis

The thesis is simple:

The fastest path to high-impact scientific discovery in 2026 is to combine public archival data, commodity GPU compute, and AI-assisted development to do at scale what no individual researcher or lab is currently doing — systematic anomaly detection across every major astronomical survey simultaneously.

This is not about being smarter than astronomers. They know infinitely more about physics than I do. It's about recognizing a structural gap in how science currently operates: the data is public, the tools are available, but the intersection of skills and incentives required to exploit both is nearly empty.

For a brief window, a small lab with the right approach can move faster than institutions with 100x the resources. Not because we're better — because we're willing to do the work that doesn't fit neatly into anyone else's job description.

The goal for Hubify Lab is concrete: become the fastest-growing scientific research operation in terms of published papers, public catalogs, and verified discoveries per dollar spent. Not in a decade. Now. This year. Before the window closes.

6. A Note on Self-Doubt

I want to be honest about something. When I look at what we've done — 195,829 anomalies, 8.47 million galaxies classified for chirality, 424,000 MCMC posterior samples — my first reaction isn't pride. It's suspicion.

This was too easy. What am I missing? Why hasn't a real lab done this? Am I fooling myself?

I've been through this loop many times now. And every time, the answer is the same: it's not that the work is trivial. It's that the combination is rare. Any individual step — training an autoencoder, renting a GPU, downloading DESI data, cross-matching against SIMBAD — is well within the capabilities of hundreds of research groups. But the full pipeline, end to end, requires someone who is willing to do all of it, fast, without waiting for committee approval or grant funding or a postdoc to assign the work to.

The self-doubt is healthy. It keeps the methodology rigorous. But it shouldn't be paralyzing. The data is real. The anomalies are real. The cross-references are verified. The question isn't whether we did it right — it's whether we can publish fast enough before someone else does it too.

7. The Moment

There are rare moments in science when the tools outpace the institutions. When individuals with the right timing can make contributions that would normally require entire departments. The democratization of sequencing did this for genomics in the 2010s. The availability of cloud compute did this for machine learning in the 2015s.

For astronomy and cosmology, that moment is right now. Public survey data + commodity GPUs + AI coding agents = a level playing field that has never existed before and will not last long.

We're not waiting for permission. We're not applying for grants. We're not forming committees. We're downloading the data, training the models, scoring the spectra, and publishing the results.

The window is open. We're going through it.

"The best time to start was when DESI DR1 dropped. The second best time is today."

— Houston Golden, March 2026