Show HN: Langfuse – Open-source observability and analytics for LLM apps
143 points| marcklingen | 2 years ago |github.com | reply
Langfuse makes capturing and viewing LLM calls (execution traces) a breeze. On top of this data, you can analyze the quality, cost and latency of LLM apps.
When GPT-4 dropped, we started building LLM apps – a lot of them! [1, 2] But they all suffered from the same issue: it’s hard to assure quality in 100% of cases and even to have a clear view of user behavior. Initially, we logged all prompts/completions to our production database to understand what works and what doesn’t. We soon realized we needed more context, more data and better analytics to sustainably improve our apps. So we started building a homegrown tool.
Our first task was to track and view what is going on in production: what user input is provided, how prompt templates or vector db requests work, and which steps of an LLM chain fail. We built async SDKs and a slick frontend to render chains in a nested way. It’s a good way to look at LLM logic ‘natively’. Then we added some basic analytics to understand token usage and quality over time for the entire project or single users (pre-built dashboards).
Under the hood, we use the T3 stack (Typescript, NextJs, Prisma, tRPC, Tailwind, NextAuth), which allows us to move fast + it means it's easy to contribute to our repo. The SDKs are heavily influenced by the design of the PostHog SDKs [3] for stable implementations of async network requests. It was a surprisingly inconvenient experience to convert OpenAPI specs to boilerplate Python code and we ended up using Fern [4] here. We’re fans of Tailwind + shadcn/ui + tremor.so for speed and flexibility in building tables and dashboards fast.
Our SDKs run fully asynchronously and make network requests in the background. We did our best to reduce any impact on application performance to a minimum. We never block the main execution path.
We've made two engineering decisions we've felt uncertain about: to use a Postgres database and Looker Studio for the analytics MVP. Supabase performs well at our scale and integrates seamlessly into our tech stack. We will need to move to an OLAP database soon and are debating if we need to start batching ingestion and if we can keep using Vercel. Any experience you could share would be helpful!
Integrating Looker Studio got us to first analytics charts in half a day. As it is not open-source and does not work with our UI/UX, we are looking to switch it out for an OSS solution to flexibly generate charts and dashboards. We’ve had a look at Lightdash and would be happy to hear your thoughts.
We’re borrowing our OSS business model from Posthog/Supabase who make it easy to self-host with features reserved for enterprise (no plans yet) and a paid version for managed cloud service. Right now all of our code is available under a permissive license (MIT).
Next, we’re going deep on analytics. For quality specifically, we will build out model-based evaluations and labeling to be able to cluster traces by scores and use cases.
Looking forward to hearing your thoughts and discussion – we’ll be in the comments. Thanks!
[1] https://learn-from-ai.com/
[2] https://www.loom.com/share/5c044ca77be44ff7821967834dd70cba
[+] [-] phillipcarter|2 years ago|reply
I noticed your SDKs use tracing concepts! Are there plans to implement OpenTelemetry support?
[+] [-] mdeichmann|2 years ago|reply
[+] [-] eoproc|2 years ago|reply
[+] [-] idosh|2 years ago|reply
[+] [-] marcklingen|2 years ago|reply
[+] [-] jayunit|2 years ago|reply
Are there any alternatives you’d also suggest evaluating, and any particular strengths/weaknesses we should consider?
I’m also curious about doing quality metrics, benchmarking, regression testing, and skew measurement. I’ll dig further into Langfuse documentation (just watched the video so far) but I’d love any additional recommendations base on that.
[+] [-] anirudhrx|2 years ago|reply
[+] [-] marcklingen|2 years ago|reply
[+] [-] v3np|2 years ago|reply
[+] [-] mdeichmann|2 years ago|reply
[+] [-] marcklingen|2 years ago|reply
For those reading this thread later, feel free to reach out with any feedback or questions marc at langfuse dot com
[+] [-] fiehtle|2 years ago|reply
[+] [-] mdeichmann|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] elamje|2 years ago|reply
[+] [-] kaspermarstal|2 years ago|reply
[+] [-] marcklingen|2 years ago|reply
[+] [-] addisonj|2 years ago|reply
I have quite a few years of observability experience behind me and hand't really considered some of the unique aspects that LLMs bring into the picture. Here are a few thoughts, responses to your questions, and feedback items
* Generally, I think you do a good job of having a clear, concise story and value proposition that is fairly early in a market where the number of people hitting these problems is rapidly growing, which is a pretty nice place to be! But, I do think that can be a challenge in that you have to help people recognize the problem, which often means lots of content and lots of outreach.
* I think going open-source and following a PLG model of cloud/managed services is pretty reasonable way to go and certainly can be a leg up over the existing players, but I noticed in your pricing a note about enterprise support of self-hosting in customer VPC and dedicated instances. There is lots of money there... but it also can just be extremely big time sink for early stage teams, so I would be careful, or at least make sure you price it such that it supports hiring.
* Also on pricing, I wonder if doing this based on storage is how people would think about? Generally, I think about observability data in terms of events/sec first and then retention period. If you can make it work with a single usage based metric of storage, than that is great! but I would be concerned that 1) you aren't telling the user which plan can support throughput and 2) you could end up with some large variance in cost based on different usage patterns
* The biggest question I have is how much did you explore opentelemetry? Obviously, it is not as simple as just going and building your own API and SDK... but when I look at the capabilities, I could see opentelemetry being the underlying protocol with some thinner convenience wrappers on top. From your other comments, I understand that you see some ways in which this data is different than typical trace/observability data, but I do wonder if that choice will 1) scare off some companies that are already "all in" on otel and 2) you don't get any opportunity to use all of the stuff around otel, for example, Kafka integration if you someday need that.
* As far as your question about OLAP, I wouldn't rush it... In general, once you are big enough that the cost/scalability limitations of PG are looming, you will be a different company and know a lot more about the real requirements. I will also say that in all likelihood, ClickHouse is probably the right choice, but even knowing that, there are lots of different ways to tackle that problem (like using hosted vs self-managed) and the right way to do it will depend on usage patterns, cost structure, where you end up with enterprise dedicated / self-hosted, etc. I will mention though that timescaledb is not a bad way to maybe buy you a bit of headroom, but it is important to note that the timescaledb offered by supabase shouldn't be compared to timescaledb community / cloud. The supabase version isn't bad, it just isn't quite the same thing (i.e. no horizontal scalability)
Anyways, congrats again! It looks like you are off to a good start.
If you have any other questions for me, my email is in my profile.
[+] [-] hrpnk|2 years ago|reply
On the other hand, if you target just the applications that implement an API behind an LLM, you will have customers expecting value-added services on top of telemetry, like prompt optimization, classification, result caching, etc.
Your choice which direction and target group you will focus on first.
[+] [-] mdeichmann|2 years ago|reply
> About value prop: Thanks for the feedback! We are already trying to be as vocal about it as possible by writing great docs etc. but can probably do better.
> PLG & OSS: thanks for the hint, we will be careful around managing deployments within customer VPCs.
> Pricing: Currently picked storage as the first metric to price on as this varies a lot across users. Some use langfuse to track complex embedding processes with a lot of context, others just simple chat messages with relatively low-context, low-value events.
> OTel: We looked into it but did not go into all the details. We wanted to have a product out there fast and liked the experience of e.g. Posthog SDKs. I might reach out to you concerning this topic after investing more time on it. Thanks for the offer!
> OLAP: Agree, i also learned to tackle scaling issues once they appear and so far we are good. Interesting that Supabase has no horizontal scaling. This would be one of the main reasons to use it IMO.
[+] [-] ecnahc515|2 years ago|reply
[+] [-] pranay01|2 years ago|reply
Also, how do you compare in terms of features with DataDog's LLM monitoring product which was launched recently?
Disclaimer : I am a maintainer at SigNoz
[+] [-] mdeichmann|2 years ago|reply
[+] [-] steventey|2 years ago|reply
Highly recommend https://tinybird.com for this – they're a fantastic OLAP DB for ingesting & visualizing time-series data!
[+] [-] mdeichmann|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] ij23|2 years ago|reply
[deleted]