I used o1-mini for coding every day since launch. Here's how it compares to Claude Sonnet 3.5.

tl;dr - use o1-mini for large scale rewrites and greenfield projects. Use Sonnet 3.5 for general every day tasks.

Sep 15, 2024

time-lapsed photo of white and red light — Photo by Ahmad Dirini on Unsplash

⚡️ tl;dr

𝐮𝐬𝐞 𝐨1-𝐦𝐢𝐧𝐢 when large-scale refactoring or massive greenfield projects. Its deep processing and 64k output token limit enable comprehensive, one-shot completions.
𝐮𝐬𝐞 𝐂𝐥𝐚𝐮𝐝𝐞 𝐒𝐨𝐧𝐧𝐞𝐭 3.5 for diverse, smaller/medium tasks. Claude Sonnet 3.5 remains the top choice among closed-source coding LLMs.
𝐰𝐡𝐞𝐧 𝐮𝐬𝐢𝐧𝐠 𝐨1-𝐦𝐢𝐧𝐢, 𝐜𝐫𝐚𝐟𝐭 𝐝𝐞𝐭𝐚𝐢𝐥𝐞𝐝, 𝐬𝐩𝐞𝐜𝐢𝐟𝐢𝐜 𝐩𝐫𝐨𝐦𝐩𝐭𝐬. You will save yourself a ton of time; o1-mini isn't optimized for iterative conversations or minor debugging jobs.

👉 How I tested

I've been using o1-mini (which OpenAI touts as superior to their preview version for coding) in Cursor as much as my rate limits could give; I even paid for extra fast responses. I compare it to Claude Sonnet 3.5, which has been my go-to workhorse for coding tasks. Sonnet is the undisputed 🐐

I tested in a production SaaS startup app built with a React/NextJS/Tailwind, FastAPI Python backend, and Upstash Redis KV store for config storage - simple setup by professional standards.

I tasked o1-mini with rearchitecting my JSON config storage in Upstash KV. The goal was to split a single endpoint into two and update seven React components accordingly. My first attempt failed, producing non-functional code. On the second try, with a more explicit prompt detailing the JSON config split, it generated mostly correct code. Some manual fixes needed, partly due to an incorrect Redis store value and Cursor's beta implementation quirks.

✅ o1-mini: Pros

The 64k output context is a game-changer, allowing for extensive refactoring across numerous files with substantial code in each.
With well-crafted prompts, it can typically handle major refactors or architecture changes in 2-3 attempts.
Successfully rearchitected my user config storage system, updating multiple React components in one go.

❎ o1-mini: Cons

Requires extremely specific, verbose prompts, reminiscent of earlier GPT-3.5 era interactions.
Long processing times necessitate near-perfect initial prompts to avoid wasted interactions. Aim for one-shot.
Limited daily/weekly usage greatly restricts adoption and frequent.
Outputs are insanely verbose, providing unnecessary details. I hate this.
Current Cursor implementation has some bugs, including code duplication and occasional lack of text output.

⌨️ Final thoughts

o1-mini excels at large-scale tasks. Claude Sonnet 3.5 remains superior for day-to-day coding needs. The 64k output context is impressive. O1-mini's limitations in conversational ability and required prompt specificity make it less versatile for routine work. Claude Sonnet 3.5 with a similar output token limit would likely outperform o1-mini across the board.

I think Claude Sonnet 3.5 with fine-tuned chain-of-thought (just the same way o1 is a fine-tuned chain-of-thought version of GPT-4o) is going to be a vastly superior model. I’m bearish on OpenAI, to be honest.

🤔 Who is o1-series for?

I am a bit confused by the use case positioning of the o1 series. It’s not the same as a GPT-4o to GPT-5 jump, because its intention is not to replace GPT-4o. It’s a vastly different model that excels and decels in different things — it’s good at system thinking and deep reasoning. But it’s overly verbose in writing, and too slow for general coding.

So who is this model for? OpenAI claims this is for scientists and people that need deep reasoning.

I’m not a scientist or researcher, so I don’t know whether this is true.

I don’t really know what “people that need deep reasoning” means; laymen don’t need to solve brainteasers, puzzles, how many R’s are in a strawberry, etc.

If you have better insight than me on who o1 is really for, feel free to share below.

How have your experience with the o1-{mini, preview} been like?

Answer HQ

Discussion about this post