top | item 44829492

(no title)

abossy | 6 months ago

At my company (Charlie Labs), we've had a tremendous amount of success with context awareness over long-running tasks with GPT-5 since getting access a few weeks ago. We ran an eval to solve 10 real Github issues so that we could measure this against Claude Code and the differences were surprisingly large. You can see our write-up here:

https://charlielabs.ai/research/gpt-5

Often, our tasks take 30-45 minutes and can handle massive context threads in Linear or Github without getting tripped up by things like changes in direction part of the way through the thread.

While 10 issues isn't crazy comprehensive, we found it to be directionally very impressive and we'll likely build upon it to better understand performance going forward.

discuss

order

bartman|6 months ago

I am not (usually) photosensitive, but the animated static noise on your websites causes noticable flickering on various screens I use and made it impossible for me to read your article.

For better accessibility and a safer experience[1] I would recommend not animating the background, or at least making it easily togglable.

[1] https://developer.mozilla.org/en-US/docs/Web/Accessibility/G...

neom|6 months ago

Removed- sorry, and thank you for the feedback.

RyanHamilton|6 months ago

Did you sign any kind of agreement with a non disparagement clause to get early access? I'm asking because if you did, your data point isn't useful. It would mean anyone else that tried it and got worse results wouldn't be able to post here. We would just be seeing the successful data points.

htrp|6 months ago

Even if they didn't, overly critical or negative commentary will mean their removal from the list of trusted testers

TechDebtDevin|6 months ago

Waitig 30-45 minutes for code, that you're still going to have to read from top to bottom to make sure it doesn't have anything dumb in it, does not seem like a productivity enhancement. I would quit If I was an engineer and told to do this.

rantallion|6 months ago

If you're doing nothing in that 30-45 minutes other than stare at a loading screen, you're doing it wrong.

I'm not sold on the efficacy of AI and I share your reservations about having to scrutinise their output, but I see great value in being able to offload a long-running task to someone/something else and only have to check back later. In the meantime, I can be doing something else - like sitting in those planning meetings we all enjoy!