top | item 43721175

(no title)

hmaxwell | 10 months ago

I did some testing this morning:

Prompt: "can you find any mistakes on my codebase? I put one in there on purpose" + 70,000 tokens of codebase where in one line I have an include for a non-existent file.

Claude 3.7 Sonnet (Try 1 - new chat) Claimed it Found a race condition in the api of the admin interface that would be triggered if two admins were to change the room order at the same time. Claude suggested I group all sql queries in a single transaction. I looked at the code and found that it already used a transaction for all queries. I said: the order_update api is already done with a transaction. Claude replied: "You're absolutely right, and I apologize for my mistake. I was incorrect to claim there was a race condition issue. The transaction ensures atomicity and consistency of the updates, and the SQL queries are properly structured for their intended purpose."

Claude 3.7 Sonnet (Try 2 - new chat) Found a comment in the admin ui javascript code that suggested a potential alternative to event handler cleanup that was not implemented because I decided to go with a cleaner route. Then asked "Is this the issue you intentionally included, or would you like me to look for other potential problems?" I said: "The comment merely suggests an alternative, right?" claude said: "Yes, you're absolutely right. The comment is merely suggesting an alternative approach that isn't being used in the code, rather than indicating a mistake. So there's no actual bug or mistake in this part of the code - just documentation of different possible approaches. I apologize for misinterpreting this as an issue!"

Claude 3.7 Sonnet (Try 3 - new chat) When processing items out of the database to generate QR codes in the admin interface, Claude says that my code both attempts to generate QR codes with undefined data AS WELL AS saying that my error handling skips undefined data. Claude contradicts itself within 2 sentences. When asking about clarification Claude replies: Looking at the code more carefully, I see that the code actually has proper error handling. I incorrectly stated that it "still attempts to call generateQRCode()" in the first part of my analysis, which was wrong. The code properly handles the case when there's no data-room attribute.

Gemnini Advanced 2.5 Pro (Try 1 - new chat) Found the intentional error and said I should stop putting db creds/api keys into the codebase.

Gemnini Advanced 2.5 Pro (Try 2 - new chat) Found the intentional error and said I should stop putting db creds/api keys into the codebase.

Gemnini Advanced 2.5 Pro (Try 3 - new chat) Found the intentional error and said I should stop putting db creds/api keys into the codebase.

o4-mini-high and o4-mini and o3 and 4.5 and 4o - "The message you submitted was too long, please reload the conversation and submit something shorter."

discuss

order

Tiberium|10 months ago

The thread is about 2.5 Flash though, not 2.5 Pro. Maybe you can try again with 2.5 Flash specifically? Even though it's a small model.

dyauspitr|10 months ago

I don’t particularly care about the non frontier models though, I found the comment very useful.

danielbln|10 months ago

Those responses are very Claude, to. 3.7 has powered our agentic workflows for weeks, but I've been using almost only Gemini for the last week and feel the output is better generally. It's gotten much better at agentic workflows (using 2.0 in an agent setup was not working well at all) and I prefer its tuning over Clause's, more to the point and less meandering.

bambax|10 months ago

> codebase where in one line I have an include for a non-existent file

Ok but you don't need AI for this; almost any IDE will issue a warning for that kind of error...

rendang|10 months ago

3 different answers in 3 tries for Claude? Makes me curious how many times you'd get the same answer if you asked 10/20/100 times

airstrike|10 months ago

Have you tried Claude Code?

fandorin|10 months ago

how did you put your whole codebase in a prompt for gemini?