top | item 47134093

(no title)

zkmon | 6 days ago

I think failure is around reasoning where the car is and whether it is needed to be moved to a different place. So it's not surprising that only models with high reasoning would pass the test.

discuss

order

No comments yet.