Llama 4 Landed with a Thud—Here's Why That Matters

I've watched for two weeks as the noise around Meta's Llama 4 launch has gotten louder. In the AI world, noise is almost always a good thing–almost being the operative word here. I wanted the dust to settle before weighing in, but it's clear now that the story isn't just the models. It's the launch.

I feel this deserves a straightforward breakdown based on my own observations, internal conversation with industry contemporaries, and public sources. I'll do a deeper dive into the technical specifics after we get a longer look at performance, and after Alibaba's Qwen 3 models arrive.

What Actually Shipped

On April 5, 2025, Meta released weights for two new models:

Llama 4 Scout: Around 109B, with 17B active parameters, multimodal, and claiming a massive 10-million-token context window.
Llama 4 Maverick: Approximately 402B, 17B active parameters, same basic architecture with a much sparser expert distribution, but with a one-million-token context.

A larger third model, "Behemoth," remains in training.

Collectively, the models are called "the herd," highlighting openness and remixability. Meta positioned them boldly, claiming the longest context of any available model.

How the Rollout Went Off-Script

Benchmark Controversy

Just hours after launch, sharp-eyed community members noticed that the Maverick checkpoint Meta submitted to LM Arena differed from the publicly-available version. The submitted model outperformed the public release, prompting accusations of–among other things–“gaming human preference benchmarks.” Sloppy and avoidable at best, but not necessarily a deal-breaker.

Issues with Long Context

Early adopters trying to leverage Scout's promised multi-million-token context faced crashes and inconsistent outputs between GPU and CPU environments. Meta blamed these issues on "early-stage deployment bugs," but the label "unstable" quickly stuck.

Unstable and sloppy. Regardless of these models’ performance outside of such missteps, it’s clear that the launch underestimated the gravity of missing the target on these checkpoints.

We participate in an AI ecosystem right now that relies on stability, speed and innovation. Make no mistake, open-source is the future of AI, but it has to work as advertised.

Mixed Messages

When the controversy erupted, Meta issued firm denials through media outlets, asserting that they were not "gaming" benchmarks. But they did submit an experimental “chat optimized version of Maverick” to the leaderboard. The very need for defensive statements highlighted how quickly trust can erode in the AI community (just ask Google, c. 2024). This is especially true when open-source expectations seem violated.

My marketing colleague summed it up well: "It's like telling someone their baby isn't that cute." We rely on these benchmarks because with them, we can all speak the same language. Throw benchmarks out or get them wrong altogether and all of a sudden, we’re left with subjectivity that doesn’t stand up to testing.

My baby may very well end up being cute–but I don’t trust you to judge it objectively anymore.

Why Meta's Direction Feels Unclear

Internally at Arcee AI, we keep returning to a simple question: "Best at what, exactly?"

Anthropic anchored Claude 3.5 around coding and software development.
Google went deep into multimodal, long-context models with Gemini 1
OpenAI refined ChatGPT as the go-to consumer assistant
Cohere specialized in enterprise retrieval and secure deployments.

Meta’s broader mandate—"build the best general-purpose open-source model"—is noble but vague. When every metric matters equally, none stand out. This leads to internal confusion and rushed decision-making:

Pour tokens in and hope for scaling magic
Experiment with sparse routing, quickly abandoning anything that underperforms initially
Push context limits without fully stabilizing the infrastructure
Over-focus on benchmark metrics because they’re marketing-friendly.

This scattershot approach resulted in strong models released prematurely, with engineering and messaging trailing behind. If we’ve learned nothing else in the open-source community, it’s to let your model do the talking; not the other way around.

The Models Themselves Aren’t the Problem

Despite the shaky rollout, I believe Scout and Maverick still represent valuable contributions:

Open-Weight Flexibility: Organizations now have solid, mid-sized options suitable for on-premises or private cloud setups.
Competitive Pricing Pressure: Their presence will likely help keep pricing reasonable and encourage larger context windows industry-wide.
Rapid Community Solutions: Within days, third-party developers provided critical quantization fixes and performance patches.

Even harsh critics acknowledge underlying promise, despite the initial missteps. Meta’s emphasis on lower refusal rates for controversial queries might appeal to certain builders, though it introduces fresh alignment debates.

Where This Leaves Meta—and the Community

Meta's AI division faces enormous pressure to justify its massive $65 billion AI spend and establish leadership in open models. Without a clearly articulated product direction—like software development or enterprise retrieval—the team risks repeating these launch issues.

The charitable interpretation: Meta is learning better release practices publicly, in real-time.

The skeptical view: in the absence of a more defined product-identity, Meta fumbling this model release gives me more heartburn than Pete Hegseth’s Signal practices.

What I'm Watching Next

Patch Frequency: Will Meta stabilize these models quickly, or wait months until Llama 4.1?
Qwen’s Performance: Alibaba’s upcoming Qwen models are rumored to rival Claude-3.5, providing clearer competitive benchmarks.
Benchmark Reform: Platforms like LM Arena already tightened rules around submissions—expect broader benchmark improvements soon.

Closing Thoughts

I genuinely support open-source AI—each new release is a win. But openness is a means to an end, not an end unto itself. If not for vision and careful execution, Llama 4 provides a perfect market example that how you ship matters just as much as what you ship. The models themselves are powerful, but the rushed release exposed underlying organizational challenges Meta needs to fix before "Behemoth" arrives.

I'll return soon with a detailed technical review—and (likely) kinder words—once more data comes in. Until then, Llama 4 stands as both a gift to the community and a cautionary tale about launching before everything’s fully ready.

Lucas Atkins is the Co-Lead of Arcee Labs. Learn more about Arcee Labs open-source small language models (SLMs) here. And book a call with our team to learn about our newest tools that leverage our SLMs: the intelligent model routing solution Arcee Conductor, and the AI automation platform Arcee Orchestra.

‍

Llama-4 Landed with a Thud—Here's Why That Matters

What Actually Shipped

How the Rollout Went Off-Script

Benchmark Controversy

Issues with Long Context

Mixed Messages

Why Meta's Direction Feels Unclear

The Models Themselves Aren’t the Problem

Where This Leaves Meta—and the Community

What I'm Watching Next

Closing Thoughts

Give Arcee a Try

Related Posts

Sign up for the Arcee AI newsletter

Products

Community

Company

Resources

Llama-4 Landed with a Thud—Here's Why That Matters

What Actually Shipped

How the Rollout Went Off-Script

Benchmark Controversy

Issues with Long Context

Mixed Messages

Why Meta's Direction Feels Unclear

The Models Themselves Aren’t the Problem

Where This Leaves Meta—and the Community

What I'm Watching Next

Closing Thoughts

Give Arcee a Try

Related Posts

The Hidden Challenges of Domain-Adapting LLMs

Arcee AI Releases Two Open Datasets

Train, Merge, & Domain-Adapt Llama-3.1 with Arcee AI

Sign up for the Arcee AI newsletter

Products

Community

Company

Resources