top | item 43014461

(no title)

ragle | 1 year ago

In a similar situation at my workplace.

What models are you using that you feel comfortable trusting it to understand and operate on 10-20k LOC?

Using the latest and greatest from OpenAI, I've seen output become unreliable with as little as ~300 LOC on a pretty simple personal project. It will drop features as new ones are added, make obvious mistakes, refuse to follow instructions no matter how many different ways I try to tell it to fix a bug, etc.

Tried taking those 300 LOC (generated by o3-mini-high) to cursor and didn't fare much better with the variety of models it offers.

I haven't tried OpenAI's APIs yet - I think I read that they accommodate quite a bit more context than the web interface.

I do find OpenAI's web-based offerings extremely useful for generating short 50-200 LOC support scripts, generating boilerplate, creating short single-purpose functions, etc.

Anything beyond this just hasn't worked all that well for me. Maybe I just need better or different tools though?

discuss

order

Jcampuzano2|1 year ago

I usually use Claude 3.5 sonnett since its still the one I've had my best luck with for coding tasks.

When it comes to 10k LOC codebases, I still don't really trust it with anything. My best luck has been small personal projects where I can sort of trust it to make larger scale changes, but larger scale at a small level in the first place.

I've found it best for generating tests, autocompletion, especially if you give context via function names and parameter names I find it can oftentimes complete a whole function I was about to write using the interfaces available to it in files I've visited recently.

But besides that I don't really use it for much outside of starting from scratch on a new feature or getting helping me with getting a plan together before starting working on something I may be unfamiliar with.

We have access to all models available through copilot including o3 and o1, and access to chatgpt enterprise, and I do find using it via the chat interface nice just for architecting and planning. But I usually do the actual coding with help from autocompletion since it honestly takes longer to try to wrangle it into doing the correct thing than doing it myself with a little bit of its help.

ragle|1 year ago

This makes sense. I've mostly been successful doing these sorts of things as well and really appreciate the way it saves me some typing (even in cases where I only keep 40-80% of what it writes, this is still a huge savings).

It's when I try to give it a clear, logical specification for a full feature and expect it to write everything that's required to deliver that feature (or the entirety of slightly-more-than-non-trivial personal project) that it falls over.

I've experimented trying to get it to do this (for features or personal projects that require maybe 200-400 LOC) mostly just to see what the limitations of the tool are.

Interestingly, I hit a wall with GPT-4 on a ~300 LOC personal project that o3-mini-high was able to overcome. So, as you'd expect - the models are getting better. Pushing my use case only a little bit further with a few more enhancements, however, o3-mini-high similarly fell over in precisely the same ways as GPT-4, only a bit worse in the volume and severity of errors.

The improvement between GPT-4 and o3-mini-high felt nominally incremental (which I guess is what they're claiming it offers).

Just to say: having seen similar small bumps in capability over the last few years of model releases, I tend to agree with other posters that it feels like we'll need something revolutionary to deliver on a lot of the hype being sold at the moment. I don't think current LLM models / approaches are going to cut it.