I used o1-mini for coding every day since launch. Here's how it compares to Claude Sonnet 3.5.
tl;dr - use o1-mini for large scale rewrites and greenfield projects. Use Sonnet 3.5 for general every day tasks.
โก๏ธ tl;dr
๐ฎ๐ฌ๐ ๐จ1-๐ฆ๐ข๐ง๐ข when large-scale refactoring or massive greenfield projects. Its deep processing and 64k output token limit enable comprehensive, one-shot completions.
๐ฎ๐ฌ๐ ๐๐ฅ๐๐ฎ๐๐ ๐๐จ๐ง๐ง๐๐ญ 3.5 for diverse, smaller/medium tasks. Claude Sonnet 3.5 remains the top choice among closed-source coding LLMs.
๐ฐ๐ก๐๐ง ๐ฎ๐ฌ๐ข๐ง๐ ๐จ1-๐ฆ๐ข๐ง๐ข, ๐๐ซ๐๐๐ญ ๐๐๐ญ๐๐ข๐ฅ๐๐, ๐ฌ๐ฉ๐๐๐ข๐๐ข๐ ๐ฉ๐ซ๐จ๐ฆ๐ฉ๐ญ๐ฌ. You will save yourself a ton of time; o1-mini isn't optimized for iterative conversations or minor debugging jobs.
๐ How I tested
I've been using o1-mini (which OpenAI touts as superior to their preview version for coding) in Cursor as much as my rate limits could give; I even paid for extra fast responses. I compare it to Claude Sonnet 3.5, which has been my go-to workhorse for coding tasks. Sonnet is the undisputed ๐
I tested in a production SaaS startup app built with a React/NextJS/Tailwind, FastAPI Python backend, and Upstash Redis KV store for config storage - simple setup by professional standards.
I tasked o1-mini with rearchitecting my JSON config storage in Upstash KV. The goal was to split a single endpoint into two and update seven React components accordingly. My first attempt failed, producing non-functional code. On the second try, with a more explicit prompt detailing the JSON config split, it generated mostly correct code. Some manual fixes needed, partly due to an incorrect Redis store value and Cursor's beta implementation quirks.
โ
o1-mini: Pros
The 64k output context is a game-changer, allowing for extensive refactoring across numerous files with substantial code in each.
With well-crafted prompts, it can typically handle major refactors or architecture changes in 2-3 attempts.
Successfully rearchitected my user config storage system, updating multiple React components in one go.
โ o1-mini: Cons
Requires extremely specific, verbose prompts, reminiscent of earlier GPT-3.5 era interactions.
Long processing times necessitate near-perfect initial prompts to avoid wasted interactions. Aim for one-shot.
Limited daily/weekly usage greatly restricts adoption and frequent.
Outputs are insanely verbose, providing unnecessary details. I hate this.
Current Cursor implementation has some bugs, including code duplication and occasional lack of text output.
โจ๏ธ Final thoughts
o1-mini excels at large-scale tasks. Claude Sonnet 3.5 remains superior for day-to-day coding needs. The 64k output context is impressive. O1-mini's limitations in conversational ability and required prompt specificity make it less versatile for routine work. Claude Sonnet 3.5 with a similar output token limit would likely outperform o1-mini across the board.โโโโโโโโโโโโโโโโ
I think Claude Sonnet 3.5 with fine-tuned chain-of-thought (just the same way o1 is a fine-tuned chain-of-thought version of GPT-4o) is going to be a vastly superior model. Iโm bearish on OpenAI, to be honest.
๐ค Who is o1-series for?
I am a bit confused by the use case positioning of the o1 series. Itโs not the same as a GPT-4o to GPT-5 jump, because its intention is not to replace GPT-4o. Itโs a vastly different model that excels and decels in different things โ itโs good at system thinking and deep reasoning. But itโs overly verbose in writing, and too slow for general coding.
So who is this model for? OpenAI claims this is for scientists and people that need deep reasoning.
Iโm not a scientist or researcher, so I donโt know whether this is true.
I donโt really know what โpeople that need deep reasoningโ means; laymen donโt need to solve brainteasers, puzzles, how many Rโs are in a strawberry, etc.
If you have better insight than me on who o1 is really for, feel free to share below.
How have your experience with the o1-{mini, preview} been like?