Created Using GPT-4oNext Week in The Sequence:Our series in AI evals continue with an exploration of the types of benchmarks. The opinion series explores the trend of all the major AI labs creating the same primitives( research, reasoning, search, etc) and its implications. In research, we dive into the new Llama 4 release.
Engineering explores another cool framework. You can subscribe to The Sequence below:TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
đź“ť Editorial: Llama 4 Scout and Maverick are Here!I had written an editorial for this newsletter and then Meta dropped the Llama 4 release Sat. Ohh well, time to rewrite the whole thing but it was definitely worth it because this Llama release is a big deal! New architectures and enhanced multi-modality. Llama 4 debuts with two models: Llama 4 Scout and Llama 4 Maverick.
Both are multimodal, capable of processing not just text but also audio, video, and images. This versatility positions them as foundational models for next-generation applications requiring rich contextual understanding across modalities.At the core of Llama 4 lies a sophisticated mixture-of-experts (MoE) architecture.
Llama 4 Scout incorporates 16 experts with 17 billion active parameters, uniquely optimized to run on a single H100 GPU. It also supports an unprecedented 10 million token context window, dramatically enhancing its capabilities in long-range dependency tasks such as legal document analysis, scientific literature synthesis, and full-codebase reasoning. Meanwhile, Llama 4 Maverick scales the architecture further to 128 experts and a total of 400 billion parameters, achieving state-of-the-art results in reasoning, multilingual tasks, and code generation, rivaling models like DeepSeek V3.
This release signals a strategic pivot from Meta, influenced by the growing competitiveness of open-source models from the broader AI community. With Llama 4 Behemoth in training — a model boasting 288 billion active parameters and nearly two trillion in total — Meta aims to surpass frontier models like GPT-4.5 and Claude 3 in STEM-specific benchmarks.
Early signals suggest it is succeeding, with Behemoth outperforming across a wide range of logic and scientific reasoning evaluations.What sets Llama 4 apart is not just its performance, but its accessibility. Both Scout and Maverick are released under open terms and available through platforms like Hugging Face and llama.
com, reinforcing Meta's commitment to democratizing advanced AI tooling. This stands in stark contrast to the increasingly closed ecosystem of some competitors, and it provides researchers, developers, and startups the building blocks to create high-performance AI systems without prohibitive licensing barriers.Meta's anticipated $65 billion AI infrastructure spend in 2025 underscores the seriousness of its intent.
Llama 4 is not merely another model drop; it is a recalibration of the LLM landscape, balancing technical sophistication with an open invitation to build. In the era of scale, efficiency, and multimodality, Llama 4 is not just keeping pace — it is setting the tempo.🔎 AI ResearchResponsible Path to AGIGoogle DeepMind published a very long paper titled “An Approach to Technical AGI Safety and Security” which outlines their proactive strategy for navigating the development of artificial general intelligence, emphasizing readiness, risk assessment, and collaboration.
This approach involves a systematic exploration of four main risk areas – misuse, misalignment, accidents, and structural risks – and details their ongoing efforts in monitoring progress, implementing safety and security measures, and fostering an ecosystem for responsible AGI development.CURIEGoogle Research published a paper detailing CURIE, a new benchmark designed to evaluate the potential of large language models in scientific problem-solving by testing their long-context understanding, reasoning, and information extraction abilities across six scientific disciplines. Its top contributions include a suite of ten challenging tasks based on full-length scientific papers that represent realistic scientific workflows and novel model-based evaluation metrics to assess the varied and heterogeneous forms of ground truth annotations.
UniDiscIn the paper "Unified Multimodal Discrete Diffusion", researchers from Carnegie Mellon University present UniDisc, a novel unified multimodal discrete diffusion model for jointly understanding and generating text and images. The model leverages discrete diffusion through masking and demonstrates capabilities in tasks such as joint image-text inpainting, outperforming autoregressive models in terms of performance and inference-time compute, while also offering enhanced controllability and editability.ECLeKTic Google Research published ECLeKTic, a novel benchmark designed to evaluate the ability of large language models (LLMs) to transfer knowledge across different languages by using a closed-book question answering task based on single-language Wikipedia articles.
This dataset assesses whether LLMs can access and utilize knowledge originally present in one language when questions are posed in other languages, highlighting discrepancies in current models and providing a tool for improvement.AI for Software EngIn the paper "AI for Software Engineering: The State of the Art and Promising Directions" researchers from University of California, Berkeley, MIT CSAIL, the authors provide a comprehensive overview of the field of AI for software engineering, highlighting its recent progress and remaining challenges. The paper offers a structured taxonomy of tasks beyond code generation, emphasizes key limitations of current models, and proposes promising research directions to achieve higher levels of automation in software development.
CodeARCIn the paper "CodeARC: A Dataset and Evaluation Framework for Inductive Program Synthesis of General-Purpose Python Functions", the authors introduce CodeARC, the first comprehensive dataset for general-purpose inductive program synthesis, featuring 1114 Python functions with initial input-output examples. This benchmark is designed to evaluate the ability of LLM agents to synthesize general programming tasks and employs differential testing for correctness evaluation, revealing that existing LLMs face significant challenges on this dataset.🤖 AI Tech ReleasesHallOumiOumi introduced a frontier model for claims verification.
Midjourney v7Midjourney released its new image generation model. Devin 2.0Cognition released the second version of its software engineering agent.
Nova ActAmazon introduced Nova Act, its new web browsing agent. 🛠AI in ProductionLLMs at PinterestLLMs shares details of its LLM-powered search capabilities. 📡AI RadarOpenAI raised $40 billion in a round led by Softbank.
AI video platform Runway raised $308 million in new funding. Isomorphic Labs, the DeepMind sping off focused on drug discovery raised $600 million. Agentic AI platform Redpanda raised $100 million in new funding.
Anthropic introduced Claude for Education. GenSpark released its SuperAgent platform for real world tasks that compete with the famous Manus.ai.
GitHub introduced changes to its Copilot platform.Voice AI platform Phonic came out of stealth with $4 million in funding. Uplimit unveiled a suite of learning agents for training employees.
ZenCoder released a new version of its coding and unit testing agents. Emergence AI unveiled its new agent creation platform. Qualcomm acquired the AI division of VinAI.
.
Technology
The Sequence Radar #526: Llama 4 Scout and Maverick are Here!

A major release for open source generative AI.