WTF is AGI?: Inside OpenAI’s o3 and o4-mini

Smashing benchmarks, agentic usage, and feeling the AGI

Apr 17, 2025

April 16th saw the release of OpenAI’s latest models, o3 and o4-mini, which combine reasoning with access to tool functionality like web browsing, reading files, and coding. Surprisingly, o3 is available for usage with a ChatGPT Plus subscription, so you don’t need to be a Pro user to try it out.

Every single time there’s a new frontier model released, the same debates happen on X surrounding AGI. People post their unbelievable results, and others post examples where o3 can’t figure out how many intersecting lines there are on a simple image or how many ‘r’s are in ‘strawberry’. The new model can correctly solve complicated math problems that are novel, but struggles with tasks the average 15 year old would easily get correct.

Let’s discuss what sets these models apart, why native agentic-ness matters for future frontier models, and the subjectivity of ‘feeling the AGI’.

WTF is AGI, Really?

Everyone’s throwing the term around, but let’s get clear on what it actually means, and why nobody can agree:

Artificial General Intelligence (AGI), also sometimes referred to as strong AI or human-level AI, represents a hypothetical level of artificial intelligence that possesses the ability to understand, learn, and apply knowledge across a vast range of tasks, much like a human being. Unlike the narrow or weak AI that we see in abundance today – systems designed for specific tasks like playing chess, recognizing faces, or translating languages – AGI would have the capacity for general-purpose intelligence. This means it could theoretically perform any intellectual task that a human can, and potentially even surpass human capabilities in many areas.

This definition has changed over the years. Back in 2015, when narrow AI was synonymous with AI, the idea of AGI was purely theoretical. Transformer architecture was only introduced as a concept in 2017, and chatbots during this period were easily mocked or criticized for their intelligence. They could be hard-coded to be smart about some things, but were not generalized enough where you could ask it just about anything. By 2018, ‘AGI-like’ systems were expected to feature autonomous learning and self-teaching capabilities. The definition changed, and an appreciation for the depth and complexity of human intelligence was incorporated into the definition.

GPT-3 changed everything. It was markedly more consistent than previous iterations of the GPT series, and could respond quickly to a wide range of prompts. It is easy for us to forget, but the first few days of using GPT-3 were truly mind-shattering. There were talks at this point about this being nearly AGI. The modern internet can almost be defined in two distinct phases: pre and post LLM usage.

Now, we’re living in 2025. We have generalized intelligence, and these frontier models clearly demonstrate that. The goalposts just keep moving. We have generalized intelligence, and you can use it for 20 dollars a month. We have AI systems that can perform intellectual tasks faster and more effectively than 99.9% of the population. O3 can solve novel mathematical challenges that only the smartest people can solve, except it can do it in a fraction of the time.

https://x.com/bio_bootloader/status/1912566454823870801

As some have mentioned online, the AGI debate has become a philosophical one. This latest release by OpenAI, alongside the release of Codex CLI showcase the clear agentic future that the company is aiming for. Combine o3’s intelligence with additional agency, and you have a tool that is both increasingly useful and human-like.

Today, the definition of AGI has changed even more. Now, AGI (roughly) means the following:

It includes everything previously mentioned, such as general-purpose reasoning, cross-domain knowledge, and flexible problem-solving.
It must be capable of taking actions independently. Many believe that embodiment and real-world interaction are necessary for an AI to truly qualify as AGI.
It has to pass established benchmarks and evaluation frameworks. You can’t just say it “feels like AGI” anymore.
It should match or exceed human performance across the full range of intellectual tasks we can measure.
It needs to be fully autonomous, with the ability to improve itself recursively over time without human intervention.

Even once it seems like we’ve smashed every single benchmark, and we have an intelligent robot walking around that can do essentially everything a person can do, the fundamental architecture of LLMs is called into question. Is transformer architecture the true pathway to AGI? Does there need to be proof of sentience for something to truly ‘be AGI’? We can’t know how we’ll respond until we get to that point. Remember, almost all benchmarks currently being used for testing these models have been rewritten, because they’ve smashed them all already!

How These Models Hit Different

Out of the box, o3 and o4-mini have enhanced agentic abilities, and they perform similarly to 4o. They can search the web, write and execute Python code, analyze images, and interpret files. Their behavior increasingly looks like what a person would do if given a task.

o3: Outperformed previous models like Gemini 2.5 Pro, excelling in advanced coding, math, philosophical discussions, and even PhD-level scientific reasoning. Benchmarks like Aider Polyglot (81%) and SWE (69.1%) show significant superiority.

o4-mini: Highly praised for speed and cost-effectiveness, particularly in math and visual tasks, surpassing Claude 3.7 Sonnet by approximately 6% in software engineering benchmarks.

OpenAI has highlighted that o3 and o4-mini are capable of generating genuinely novel ideas and thoughts, and many people online agree with this, after having tested the models. As Derya Unutmaz mentions on X:

https://x.com/DeryaTR_/status/1912558350794961168

OpenAI has dropped computational costs significantly across both of these models, with o4-mini being drastically cheaper while retaining similar performance to o3. But, user experience varies. In general, people want to see these models be more consistent.

o3 User Experiences

Strengths: Users report o3 excels in coding, math, and philosophical discussions, often outperforming previous models like o1. It feels smarter, with deliberate phrasing, and shows improvements in benchmarks like Aider Polyglot (81% vs. Gemini 2.5 Pro's 74%) and SWE. It also performs well in humanities exams and visual reasoning, making it a strong thought partner for complex tasks.

Weaknesses: Some users are underwhelmed, particularly in niche technical areas like reverse game engineering, where it hallucinates details and provides incorrect results. It made obvious Python errors (e.g., using semicolons, not checking CUDA) compared to o4-mini-high. It also failed tasks like creating bar charts with animal themes due to integration issues and was worse than Gemini 2.5 Pro at math research.

General Sentiment: Seen as a strict improvement over o1, but inconsistent, with some users feeling its comprehensive explanations can be overwhelming.

o4-mini User Experiences

Strengths: o4-mini is praised for its speed and cost-efficiency, with users impressed by its performance on the plus plan (150 messages/day normal, 50/day high). It's better at coding and math, providing comprehensive solutions, and outperforms Claude 3.7 Sonnet by ~6% on SWE-bench. Some people on X claim it is ‘10x faster than hiring juniors for C++ coding’, with no syntax mistakes, and excels in image generation and editing.
Weaknesses: Some users find its conversational reasoning depressingly bad, failing basic coding tests and hallucinating. It's not as good for SWE-Lancer as expected and worse than o3 for non-coding tasks. Compared to Claude 3.7 and Gemini 2.5, it's seen as lazy, missing features, though better at fixing complex TypeScript types.
General Sentiment: Efficient and suitable for high-volume needs, but lacks depth in some areas, with mixed feedback on consistency.

This thread is a great compilation of some of the agentic-ish usages for the models so far. Some highlights include o3 stitching together still frames and creating a GIF, impressive image search, automated task management using Codex, impressive mathematical abilities, one-shot game generation, and better vibe-coding in general.

https://agi.safe.ai — Humanity’s Last Exam. All frontier models achieve low accuracy on Humanity's Last Exam, highlighting significant room for improvement in narrowing the gap between current LLMs and expert-level academic capabilities on closed-ended questions.

In the coming weeks, we’ll no doubt see more interesting use-cases and examples of these models doing incredible work. Better vibe-coding, larger context windows, more tool usage, and more X threads where o3 makes dumb mistakes.

Feeling the AGI and Moving Forward

Even though I previously defined what AGI is (or is becoming), it is clear there is no ‘real’ set definition. If you ask 20 people, you’ll get 20 different definitions. O3 is already outperforming most humans and has clear generalized intelligence. Even if O3 was much smarter, it wouldn’t be agentic enough for people to say it is ‘truly AGI’. We need more time.

Agentic usage for frontier models is the name of the game. For Google and Anthropic to keep up, they need to become increasingly tools-centric. Users don’t want to pick from a long list of models and decide which will work best when they ask a question. Right now, when you use ChatGPT in a browser, you have the option to pick from 8 different models. That’s 7 too many!

Combining all tools within an interface is the future. That’s what o3 is becoming, and that’s what OpenAI is training GPT-5 to be like. As reported yesterday, OpenAI is in talks to purchase Windsurf, so they can catch-up to Cursor in-terms of having a rock-solid agentic IDE. The release of Codex is all the proof you need for where OpenAI (and the industry) is heading.

So, keep experimenting with o3. Accept that it isn’t perfect. Maybe AGI isn’t a finish line. Maybe, like love, you just know it when you feel it.