Created Using GPT-4oLast week, I had lunch with my friend Ed Sim, who is widely regarded as one of the top early-stage investors in frontier tech, including AI. We got into a fascinating debate about overlooked opportunities in foundation models. The core thesis was that while it may seem nearly impossible for startups to compete with major labs like OpenAI, Google DeepMind, or Anthropic, there are still plenty of opportunities—particularly in domains that require the creation of new datasets.
The conversation was so enlightening that I decided to make it the topic of today’s essay.Foundation models have transformed NLP and vision, but their application to scientific and engineering domains remains limited. Fields like physics, chemistry, biology, and robotics involve data types and reasoning patterns not captured in general web-scale corpora.
These domains require models trained on proprietary, often experimentally generated datasets and benefit from architectures that go beyond conventional LLM structures. As general-purpose models saturate the market, the real innovation frontier is shifting toward specialized foundation models that unlock domain-specific reasoning and accelerate discovery.The Data Challenge and Why It MattersMost of the success in LLMs and vision models stems from massive public datasets: Common Crawl, Wikipedia, ImageNet, LAION-5B.
These models thrive on scale and diversity. In contrast, scientific domains deal with data that is structured, scarce, multimodal, and frequently proprietary. Read more.
Technology
The Sequence Opinion #529: Where Foundation Models Are Just Getting Started

Domain specialized foundation models might be a non-obvious, next frontier for the space.