I’ve worked on and shipped a few AI systems that reached real users.
This post isn’t about models or prompts. It’s about the things that kept breaking once AI moved off the happy path: async jobs, retries, silent failures, provider outages, cost blowups, and debugging without visibility.
I wrote this mostly as a way to document the mistakes I made and what I wish I had known earlier. Happy to answer questions or dig deeper into any of the failure modes.
akarshc|1 month ago
This post isn’t about models or prompts. It’s about the things that kept breaking once AI moved off the happy path: async jobs, retries, silent failures, provider outages, cost blowups, and debugging without visibility.
I wrote this mostly as a way to document the mistakes I made and what I wish I had known earlier. Happy to answer questions or dig deeper into any of the failure modes.