My Favorite Closed-Source Models Right Now
This is a point-in-time judgment. Not a review roundup. Not a benchmark comparison. A field report from someone who uses these systems daily to build software, write, research, and ship.
As of March 23, 2026, the landscape looks like this — and it will look different in three months. That is the point of writing it down now.
If You Only Care About the Ranking
1. Anthropic (Opus 4.6) — my favorite model family overall. Best taste, best reasoning texture, best long-form coding judgment.
2. OpenAI (GPT-5.4) — highest-reach closed-source ecosystem for serious daily work. Broadest product surface. Most reliable tool stack.
3. xAI (Grok) — not my primary workhorse, but better at research aggregation and source-finding than most people give it credit for.
Now the reasoning.
What I Mean by "Favorite"
I do not mean highest on a leaderboard. I mean: which model do I open first when the task matters? Which one do I trust with a problem where a wrong answer costs me time? Which one produces output I do not have to rewrite?
"Favorite" is a compound judgment across reliability, taste, stamina on long tasks, and integration into the way I actually work. A model can score well on benchmarks and still not be my default if reaching for it adds friction or if its output requires heavy editing.
OpenAI: The Utility Stack
OpenAI introduced GPT-5.3-Codex on February 5, 2026. A month later, on March 5, GPT-5.4 arrived — described by OpenAI as incorporating 5.3-Codex's coding strengths while improving broader reasoning and tool use.
GPT-5.3-Codex mattered because it made long-running coding work materially more useful. Not a marginal improvement in pass rates on HumanEval — a change in what you could hand off and walk away from. Multi-file refactors that held coherence across dozens of edits. Implementation tasks where the model maintained architectural context over extended sessions. That stamina was a real axis of improvement, not a benchmark gimmick.
Its window of leadership was short. GPT-5.4 compressed and broadened that utility within a month — better reasoning across non-coding tasks, stronger tool orchestration, and the coding capability folded in. The rapid succession was striking. If you only evaluated Codex, you missed the signal; if you only waited for 5.4, you missed the context.
My impression of GPT-5.4 as a daily driver: it is the most broadly useful closed-source system available right now. Not the best at any single axis. The best at not being bad at anything. Its product integration — search, code interpreter, image understanding, file handling, plugins — creates a product surface that no other provider matches. When I need a general-purpose assistant that can handle whatever I throw at it, I reach for GPT-5.4.
Where OpenAI underperforms: Output taste. GPT-5.4 is competent and reliable, but its prose is functional rather than precise. For writing where I care about the exact texture of each sentence, I switch. For complex reasoning where I want the model to push back or surface non-obvious second-order consequences, I often get better results elsewhere. OpenAI optimizes for broad usefulness, and that optimization sometimes rounds off the edges I want.
Anthropic: The Favorite
Anthropic introduced Claude Opus 4.6 on February 5, 2026, emphasizing stronger coding, longer agentic task endurance, and a 1M token context window in beta.
Opus 4.6 is still my favorite model family overall, and the reason is qualitative rather than quantitative. The output has a texture I trust — precise without being sterile, and willing to express genuine uncertainty rather than hedging with boilerplate. It pushes back when my framing is wrong. When I need a model that thinks carefully rather than quickly, Opus is where I go.
For coding specifically, Opus 4.6 produces solutions I agree with more often. Not just solutions that work — solutions that are tasteful. Right abstractions. Appropriate scope. Less over-engineering. I spend less time editing its output and more time building on top of it.
Sonnet's agent mode deserves separate mention. It is genuinely impressive for multi-step tasks — navigating codebases, running verifications, maintaining context across tool calls. The agentic behavior feels less like a model executing instructions and more like a competent collaborator working through a problem. It is not flawless, but it is the most natural agentic experience I have used.
Where Anthropic underperforms: Product breadth. The model is excellent, but the ecosystem around it is narrower than OpenAI's. File handling, search integration, multi-modal workflows — Anthropic's product surface is thinner. If I need to process a PDF, query the web, analyze an image, and write a summary in one session, OpenAI's stack handles that more smoothly. Anthropic is a better engine in a less complete vehicle.
The 1M context window went GA on March 13 — no beta header, no long-context surcharge, standard per-token pricing across the full window. That pricing decision matters more than the technical capability itself. When Anthropic removed the 2x input premium, they made it practical to use long context routinely rather than reserving it for special cases. I have started loading full crate trees into single prompts for architectural review, and the results are noticeably better than chunked approaches. Opus 4.6 scored 78.3% on MRCR v2, and that tracks with what I see in practice — it holds references across very long agent traces without the degradation I expect. This is a genuine competitive advantage. GPT-5.4 charges more beyond 200K tokens and supports less total context. For long-context work, Anthropic is not close — it is ahead.
xAI: Better Than Its Reputation at One Thing
I am not going to pretend xAI is competing as my primary workhorse. For coding, reasoning, and daily utility, I reach for OpenAI or Anthropic first.
But xAI does something well that is worth acknowledging: research aggregation and source-finding. Grok's real-time search integration and willingness to surface links, pull from recent sources, and synthesize across live data is better than its reputation suggests. When I need to understand the current state of a fast-moving topic — recent papers, industry developments, regulatory actions — Grok often produces a more useful starting point than the alternatives.
This is a narrow band. Research and source-gathering are not the whole job. But for that specific task, dismissing xAI entirely would be wrong.
Where xAI underperforms: Depth and reliability on complex tasks. For multi-step reasoning, long coding sessions, or anything requiring sustained precision, Grok does not match the other two. The model can be impressive in bursts and unreliable across extended work. Tool use feels less mature. The overall product experience is rougher.
What I Optimize For
My model selection is driven by four axes, roughly in this order:
Reliability under real conditions. Not benchmark performance — actual reliability when I hand off a task and check the result an hour later. Does it hold context? Does it degrade gracefully? Does it produce correct output, or output that looks correct?
Output taste. How much editing does the output require before I would publish it or commit it? A model that produces 90% correct output with the wrong tone costs more of my time than a model that produces 85% correct output with the right sensibility.
Stamina on long tasks. Single-turn performance is table stakes. I care about what happens at turn 30 of a complex implementation session, or page 15 of a long document. Models that start strong and degrade are a trap.
Product integration. A great model behind a limited interface is less useful than a good model behind a comprehensive one. The surrounding tools — file handling, search, code execution, multi-modal input — determine what I can actually do in practice.
By these axes: Anthropic wins on taste and reliability for the tasks I care about most. OpenAI wins on product integration and breadth. xAI wins on a narrow but real research axis.
What I Would Tell Builders
Pick your model by your bottleneck, not by the leaderboard.
If your bottleneck is coding quality and you care about the shape of the code, not just whether it passes tests — evaluate Anthropic seriously. If your bottleneck is broad utility and you need one system that handles everything adequately — OpenAI is the pragmatic choice. If your bottleneck is research synthesis and staying current — do not dismiss xAI without trying it for that specific task.
Benchmark leadership is not workflow dominance. Evals do not measure the friction of integration, the texture of output, or the reliability across 50 consecutive tool calls. The model you actually use is the one that fits your work, not the one that wins on paper.
Do not marry a vendor. These rankings will change. The pace of iteration across all three companies is fast enough that any snapshot is provisional. Build abstractions that let you switch. Test new releases against your actual tasks, not against public benchmarks. The model that was best in January may not be best in April.
The Ranking, with Caveats
As of March 23, 2026:
1. Anthropic Opus 4.6 — favorite overall. Best judgment, best taste, best for work I care about most. Limited by thinner product surface.
2. OpenAI GPT-5.4 — most useful overall. Broadest utility, best tool integration, most reliable as a general-purpose system. Limited by output texture on quality-sensitive tasks.
3. xAI Grok — not a primary workhorse but genuinely useful for research and source aggregation. Limited by depth and consistency on complex work.
This will change. That is why it has a date on it.