Launch HN: Lucidic (YC W25) – Debug, test, and evaluate AI agents in production
116 points| AbhinavX | 7 months ago
Here is a demo: https://youtu.be/Zvoh1QUMhXQ.
Getting started is easy with just one line of code. You just call lai.init() in your agent code and log into the dashboard. You can see traces of each run, cumulative trends across sessions, built-in or custom evals, and grouped failure modes. Call lai.create_step() with any metadata you want, memory snapshots, tool outputs, stateful info, and we'll index it for debugging.
We did NLP research at Stanford AI Lab (SAIL), where we worked on creating an AI agent (w/ fine-tuned models and DSPy) to solve math olympiad problems (focusing on AIME/USAMO); and we realized debugging these agents was hard. But the last straw was when we built an e-commerce agent that could buy items online. It kept failing at checkout, and every one-line change, tweaking a prompt, switching to Llama, adjusting tool logic, meant another 10-minute rerun just to see if we hit the same checkout page.
At this point, we were all like, this sucks, so we improved agent interpretability with better debugging, monitoring, and evals.
We started by listening to users who told us traditional LLM observability platforms don't capture the complexity of agents. Agents have tools, memories, events, not just input/output pairs. So we automatically transform OTel (and/or regular) agent logs into interactive graph visualizations that cluster similar states based on memory and action patterns. We heard that people wanted to test small changes even with the graphs, so we created “time traveling,” where you can modify any state (memory contents, tool outputs, context), then re-simulate 30–40 times to see outcome distributions. We embed the responses, cluster by similarity, and show which modifications lead to stable vs. divergent behaviors.
Then we saw people running their agent 10 times on the same task, watching each run individually, and wasting hours looking at mostly repeated states. So we built trajectory clustering on similar state embeddings (like similar tools or memories) to surface behavioral patterns across mass simulations.
We then use that to create a force-directed layout that automatically groups similar paths your agent took, which displays states as nodes, actions as edges, and failure probability as color intensity. The clusters make failure patterns obvious; you see trends across hundreds of runs, not individual traces.
Finally, when people saw our observability features, they naturally wanted evaluation capabilities. So we developed a concept for people to make their own evals called "rubrics," which lets you define specific criteria, assign weights to each criterion, and set score definitions, giving you a structured way to measure agent performance against your exact requirements.
To evaluate these criteria, we used our own platform to build an investigator agent that reviews your criteria and evaluates performance much more effectively than traditional LLM-as-a-judge approaches.
To get started visit dashboard.lucidic.ai and https://docs.lucidic.ai/getting-started/quickstart. You can use it for free for 1,000 event and step creations.
Look forward to your thoughts! And don’t hesitate to reach out at team@lucidic.ai
iancarroll|7 months ago
But our stack is in Go and it has been tough to see a lot of observability tools focus on Python rather than an agnostic endpoint proxy like Helicone has.
0xdeafcafe|7 months ago
If you're interested our Go SDK has full support for OpenAI, and any OpenAI compatible endpoints, as well as some nice OpenTelemetry tracing support too.
https://github.com/langwatch/langwatch/tree/main/sdk-go https://github.com/langwatch/langwatch/tree/main/sdk-go/inst...
AbhinavX|7 months ago
IgorBlink|7 months ago
AbhinavX|7 months ago
jauhar_|7 months ago
AbhinavX|7 months ago
srameshc|7 months ago
AbhinavX|7 months ago
We then take all this information you give us and try to transform it i.e group together similar nodes, run an agent to evaluate a session, or to find root cause of a session failure in the backend.
majdalsado|7 months ago
AbhinavX|7 months ago
First, while LLMs simply respond to prompts, agents often get stuck in behavioral loops where they repeat the same actions; to address this, we built a graph visualization that automatically detects when an agent reaches the same state multiple times and groups these occurrences together, making loops immediately visible.
Second, our evaluations are much more tailored for AI Agents. LLM ops evaluations usually occur at a per prompt level (i.e hallucination, qa-correctness) which makes sense for those use cases, but agent evaluations are usually per session or run. What this means is that usually a single prompt in isolation didn’t cause an issue but some downstream memory issue or previous action caused this current tool to fail. So, we spent a lot of time creating a way for you to create a rubric. Then, to evaluate the rubric (so that there isn’t context overload) we created an agentic pipeline which has tools like viewing rubric examples, ability to zoom “in and out” of a session (to prevent context overload), referencing previous examples, etc.
Third, time traveling and clustering of similar responses. LLM debugging is straightforward because prompts are stateless and are independent from one another, but agents maintain complex state through tools, context, and memory management; we solved this by creating “time travel” functionality that captures the complete agent state at any point, allowing developers to modify variables like context or tool availability and replay from that exact moment and then simulate that 20-30 times and group together similar responses (with our clustering alg).
Fourth, agents exhibit far more non-deterministic behavior than LLMs because a single tool call can completely change their trajectory; to handle this complexity, we developed workflow trajectory clustering that groups similar execution paths together, helping developers identify patterns and edge cases that would be impossible to spot in traditional LLM systems.
simonw|7 months ago
AbhinavX|7 months ago
Haha also our whole backend is in Django :)
greatwhitenorth|7 months ago
tln|7 months ago
henriquegodoy|7 months ago
iskhare|7 months ago
AbhinavX|7 months ago
KaseyZhang|7 months ago
psilambda|7 months ago
SkylerJi|7 months ago
witnessme|7 months ago
Areibman|7 months ago
https://hegel-ai.com https://www.vellum.ai/ https://www.parea.ai http://baserun.ai https://www.traceloop.com https://www.trychatter.ai https://talc.ai https://langfuse.com https://humanloop.com https://uptrain.ai https://athina.ai https://relari.ai https://phospho.ai https://github.com/BerriAI/bettertest https://www.getzep.com https://hamming.ai https://github.com/DAGWorks-Inc/burr https://www.lmnr.ai https://keywordsai.co https://www.thefoundryai.com https://www.usesynth.ai https://www.vocera.ai https://coval.ai https://andonlabs.com https://lucidic.ai https://roark.ai https://dawn.so/ https://www.atla-ai.com https://www.hud.so https://www.thellmdatacompany.com/ https://casco.com https://www.confident-ai.com
Karrot_Kream|7 months ago
barapa|7 months ago
sharathr|7 months ago
henriquegodoy|7 months ago