top | item 46974682

(no title)

exfalso | 19 days ago

Interesting. I've had it fail on much simpler tasks.

Example: was writing a flatbuffers routine which translated a simple type schema to fbs reflection schema. I was thinking well this is quite simple, surely Opus would have no trouble with it.

Output looked reasonable, compiled.. and was completely wrong. It seemed to just output random but reasonable looking indices and offsets. It also inserted in one part of the code a literal TODO saying "someone who understands fbs reflection should write this". Had to write it from scratch.

Another example: was writing a fuzzer for testing a certain computation. In this case, there was existing code to look at (working fuzzers for slighly different use cases), but the main logic had to be somewhat different. Opus managed to do the copy paste and then messed up the only part where it had to be a bit more creative. Again, showing the limitation of where it starts breaking. Overall I actually considered this a success, because I didn't have to deal with the "boring" bit.

Another example: colleague was using Claude to write a feature that output some error information from an otherwise completely encrypted computation. Claude proceeded to insert a global backdoor into the encryption, only caught in review. The inserted comments even explained the backdoor.

I would describe a success story if there was one. But aside from throwing together simple react frontends and SQL queries (highly copy-pasteable recurring patterns in the training set) I had literally zero success. There is an invisible ceiling.

discuss

No comments yet.