top | item 37310070

Show HN: Langfuse – Open-source observability and analytics for LLM apps

143 points| marcklingen | 2 years ago |github.com | reply

Hi HN! Langfuse is OSS observability and analytics for LLM applications (repo: https://github.com/langfuse/langfuse, 2 min demo: https://langfuse.com/video, try it yourself: https://langfuse.com/demo)

Langfuse makes capturing and viewing LLM calls (execution traces) a breeze. On top of this data, you can analyze the quality, cost and latency of LLM apps.

When GPT-4 dropped, we started building LLM apps – a lot of them! [1, 2] But they all suffered from the same issue: it’s hard to assure quality in 100% of cases and even to have a clear view of user behavior. Initially, we logged all prompts/completions to our production database to understand what works and what doesn’t. We soon realized we needed more context, more data and better analytics to sustainably improve our apps. So we started building a homegrown tool.

Our first task was to track and view what is going on in production: what user input is provided, how prompt templates or vector db requests work, and which steps of an LLM chain fail. We built async SDKs and a slick frontend to render chains in a nested way. It’s a good way to look at LLM logic ‘natively’. Then we added some basic analytics to understand token usage and quality over time for the entire project or single users (pre-built dashboards).

Under the hood, we use the T3 stack (Typescript, NextJs, Prisma, tRPC, Tailwind, NextAuth), which allows us to move fast + it means it's easy to contribute to our repo. The SDKs are heavily influenced by the design of the PostHog SDKs [3] for stable implementations of async network requests. It was a surprisingly inconvenient experience to convert OpenAPI specs to boilerplate Python code and we ended up using Fern [4] here. We’re fans of Tailwind + shadcn/ui + tremor.so for speed and flexibility in building tables and dashboards fast.

Our SDKs run fully asynchronously and make network requests in the background. We did our best to reduce any impact on application performance to a minimum. We never block the main execution path.

We've made two engineering decisions we've felt uncertain about: to use a Postgres database and Looker Studio for the analytics MVP. Supabase performs well at our scale and integrates seamlessly into our tech stack. We will need to move to an OLAP database soon and are debating if we need to start batching ingestion and if we can keep using Vercel. Any experience you could share would be helpful!

Integrating Looker Studio got us to first analytics charts in half a day. As it is not open-source and does not work with our UI/UX, we are looking to switch it out for an OSS solution to flexibly generate charts and dashboards. We’ve had a look at Lightdash and would be happy to hear your thoughts.

We’re borrowing our OSS business model from Posthog/Supabase who make it easy to self-host with features reserved for enterprise (no plans yet) and a paid version for managed cloud service. Right now all of our code is available under a permissive license (MIT).

Next, we’re going deep on analytics. For quality specifically, we will build out model-based evaluations and labeling to be able to cluster traces by scores and use cases.

Looking forward to hearing your thoughts and discussion – we’ll be in the comments. Thanks!

[1] https://learn-from-ai.com/

[2] https://www.loom.com/share/5c044ca77be44ff7821967834dd70cba

[3] https://posthog.com/docs/libraries

[4] https://buildwithfern.com/

35 comments

[+] phillipcarter|2 years ago|reply

Congrats on the release! I'm keenly interested in this space, as I believe that Observability is one of the top ways to steer LLMs to be more reliable in production.

I noticed your SDKs use tracing concepts! Are there plans to implement OpenTelemetry support?

[+] mdeichmann|2 years ago|reply

Thank you so much, fully share your sentiment on this and aligned our domain language to OpenTelemetry. Currently users add lots of metadata and configuration details to the trace by manually instrumenting it using the SDKs (or via Langchain integration). We are thinking about integrating OpenTelemetry, as this would be a step function on making integrations with apps easier. However, hadn't had the time yet to figure out how capture all the metadata that's relevant as context to the trace.

[+] eoproc|2 years ago|reply

Shameless plug on https://proc.gg since you asked about OpenTelemetry support. The observability features are built upon Otel and I plan to open source it if there is considerable interests.

[+] idosh|2 years ago|reply

Congrats on the launch! Sounds like an exciting project. Do you plan to store also the raw data (input + output)? It can be relevant for fine-tuning, optimizing costs, etc. Since you already store metadata, I think it makes sense to have a one-stop shop.

[+] marcklingen|2 years ago|reply

Agree – Langfuse stores all prompts/completions, model configuration and metadata. Currently the GET API can be used to use the data for finetuning and we build a wrapper to access a filtered sample via the Python SDK.

[+] jayunit|2 years ago|reply

Congrats on the release! Having built several LLM apps in the past months and embarking on a couple new ones, I’m excited to take a look at Langfuse.

Are there any alternatives you’d also suggest evaluating, and any particular strengths/weaknesses we should consider?

I’m also curious about doing quality metrics, benchmarking, regression testing, and skew measurement. I’ll dig further into Langfuse documentation (just watched the video so far) but I’d love any additional recommendations base on that.

[+] anirudhrx|2 years ago|reply

Congrats on the launch! This is really cool. Would love to see OTel integration in the future. I'm curious if this might eventually work with request-context based routing, i.e. being able to use the propagated metadata between layers to dynamically test different versions of the stack, replay requests / route to specific underlying implementation versions at different levels of the stack.

[+] marcklingen|2 years ago|reply

Thx. Currently looking deeply into how we can make it as dynamic as possible to help people experiment with configurations and test on production samples. What do you exactly refer to with levels/layer in stack? See multiple meanings that would all make sense to me and would love to learn more

[+] v3np|2 years ago|reply

Cool stuff and congrats on the Show HN! Out of curiosity, at what point do you see teams usually adopting something like langfuse? In regular development, you sometimes even have test-driven development - I imagine this doesn't really apply for LLMs. Do you see this changing over time as the process of building LLM apps becomes more mature?

[+] mdeichmann|2 years ago|reply

Thanks a lot! We see teams adopt Langfuse quite early already. Say you have one or two engineers working on a rather complex LLM feature, they look for a solution like Langfuse already in a test environment before going to production. The majority observes their LLM features in production though. We dont see test-driven development as much but we do think that model and rule based eval will become more important in the future and CIs will only pass if a certain score was achieved.

[+] marcklingen|2 years ago|reply

Many great points/ideas here and on Discord, thanks HN!

For those reading this thread later, feel free to reach out with any feedback or questions marc at langfuse dot com

[+] fiehtle|2 years ago|reply

If you’re looking to replace Looker with open source and the ability to style it to your needs maybe a mix of cube.dev plus tremor.so would do the trick?

[+] mdeichmann|2 years ago|reply

Thanks for the suggestion. We love tremor as it perfectly fits into our React/Tailwind setup. Cubeis great for collecting data from multiple resources, caching aggregates, and providing an API to call from our React FE. I think this could be a solution for the future in case we run into performance issues or end up having data stored in different databases. I am rather wondering how we can provide our users with a DD like dashboard experience. We would love to provide many different graphs, the ability to select and filter data, maybe even SQL like queries from the FE.

[+] unknown|2 years ago|reply

[deleted]

[+] elamje|2 years ago|reply

Awesome. There is a definitely a need for LLM product analytics that is currently completely underserved by traditional tools like GA, Mixpanel, etc.

[+] kaspermarstal|2 years ago|reply

I’m curious if you investigated the TimescaleDB extension that is built into Supabase for your usecase? And if so, what was the pros and cons?

[+] marcklingen|2 years ago|reply

Thanks for the hint. Having only one fully managed db seems to be interesting to reduce needs for joining across databases and less operations overhead (managing uptime, setting up infra for testing, CI etc.) on our 2 people engineering "team". Timescale is definitely on the list to be at least an intermediary solution that would be faster to adopt than migrating to e.g. ClickHouse over time. As you mentioned it, any obvious limitations to watch out for?

[+] addisonj|2 years ago|reply

Congrats on the launch!

I have quite a few years of observability experience behind me and hand't really considered some of the unique aspects that LLMs bring into the picture. Here are a few thoughts, responses to your questions, and feedback items

* Generally, I think you do a good job of having a clear, concise story and value proposition that is fairly early in a market where the number of people hitting these problems is rapidly growing, which is a pretty nice place to be! But, I do think that can be a challenge in that you have to help people recognize the problem, which often means lots of content and lots of outreach.

* I think going open-source and following a PLG model of cloud/managed services is pretty reasonable way to go and certainly can be a leg up over the existing players, but I noticed in your pricing a note about enterprise support of self-hosting in customer VPC and dedicated instances. There is lots of money there... but it also can just be extremely big time sink for early stage teams, so I would be careful, or at least make sure you price it such that it supports hiring.

* Also on pricing, I wonder if doing this based on storage is how people would think about? Generally, I think about observability data in terms of events/sec first and then retention period. If you can make it work with a single usage based metric of storage, than that is great! but I would be concerned that 1) you aren't telling the user which plan can support throughput and 2) you could end up with some large variance in cost based on different usage patterns

* The biggest question I have is how much did you explore opentelemetry? Obviously, it is not as simple as just going and building your own API and SDK... but when I look at the capabilities, I could see opentelemetry being the underlying protocol with some thinner convenience wrappers on top. From your other comments, I understand that you see some ways in which this data is different than typical trace/observability data, but I do wonder if that choice will 1) scare off some companies that are already "all in" on otel and 2) you don't get any opportunity to use all of the stuff around otel, for example, Kafka integration if you someday need that.

* As far as your question about OLAP, I wouldn't rush it... In general, once you are big enough that the cost/scalability limitations of PG are looming, you will be a different company and know a lot more about the real requirements. I will also say that in all likelihood, ClickHouse is probably the right choice, but even knowing that, there are lots of different ways to tackle that problem (like using hosted vs self-managed) and the right way to do it will depend on usage patterns, cost structure, where you end up with enterprise dedicated / self-hosted, etc. I will mention though that timescaledb is not a bad way to maybe buy you a bit of headroom, but it is important to note that the timescaledb offered by supabase shouldn't be compared to timescaledb community / cloud. The supabase version isn't bad, it just isn't quite the same thing (i.e. no horizontal scalability)

Anyways, congrats again! It looks like you are off to a good start.

If you have any other questions for me, my email is in my profile.

[+] hrpnk|2 years ago|reply

+1 on the OTel mention. Having telemetry in place in a system, one would typically implement a single behavioral tracking SDK on top. Adding yet another SDK for LLMs is a hard ask given how specific the implementation will be. Backing back on a standard you offer value-added insights on top.

On the other hand, if you target just the applications that implement an API behind an LLM, you will have customers expecting value-added services on top of telemetry, like prompt optimization, classification, result caching, etc.

Your choice which direction and target group you will focus on first.

[+] mdeichmann|2 years ago|reply

This reads like a book, thank you so much for putting this together!

> About value prop: Thanks for the feedback! We are already trying to be as vocal about it as possible by writing great docs etc. but can probably do better.

> PLG & OSS: thanks for the hint, we will be careful around managing deployments within customer VPCs.

> Pricing: Currently picked storage as the first metric to price on as this varies a lot across users. Some use langfuse to track complex embedding processes with a lot of context, others just simple chat messages with relatively low-context, low-value events.

> OTel: We looked into it but did not go into all the details. We wanted to have a product out there fast and liked the experience of e.g. Posthog SDKs. I might reach out to you concerning this topic after investing more time on it. Thanks for the offer!

> OLAP: Agree, i also learned to tackle scaling issues once they appear and so far we are good. Interesting that Supabase has no horizontal scaling. This would be one of the main reasons to use it IMO.

[+] ecnahc515|2 years ago|reply

This reply is top notch. I agree with the points here. Definitely worth considering this advice.

[+] pranay01|2 years ago|reply

Congrats on the launch! Curious to learn what specific use case you have seen around observability of LLM apps which are not covered by standard observability tools like DataDog, SigNoz, etc

Also, how do you compare in terms of features with DataDog's LLM monitoring product which was launched recently?

Disclaimer : I am a maintainer at SigNoz

[+] mdeichmann|2 years ago|reply

This is Max, one of the co-founders. We appreciate existing observability tools as they have saved us so much time in the past already. Excited to get your view on this! We've found many observability demands to be quite different when working on LLM applications. Mainly: Unpredictable input (users input free-form text that cannot be fully tested for), control flow highly dynamic when running on the textual output of a previous step and quality of output is not known at runtime (for the application it is just text). Many teams read manually through the LLM inputs and outputs to get a feeling for correctness or ask for user feedback. In addition, currently working on abstraction for model-based evals to make it simple to try which one works best for a use case and automatically run it on all production prompts/completions. One user described the difference to be that they use observability usually to know that nothing is going wrong whereas they use Langfuse many hours per day to understand how to best improve the application and navigate cost/latency/quality trade offs.

[+] steventey|2 years ago|reply

> We will need to move to an OLAP database soon and are debating if we need to start batching ingestion

Highly recommend https://tinybird.com for this – they're a fantastic OLAP DB for ingesting & visualizing time-series data!

[+] mdeichmann|2 years ago|reply

Hi, this is Max - one of the founders of Langfuse and super excited to show Langfuse to HN today. Thanks a lot for the suggestion. I had not heard of Tinybird but it seems like a great product. It could be valuable to use their materialized views to calculate aggregates for our analytics UI. We will need to discuss whether we can use them as they are not open source. However, for anyone reading this, they use Clickhouse under the hood and have created a knowledge base (https://github.com/tinybirdco/clickhouse_knowledge_base). I will browse it to learn more.

[+] unknown|2 years ago|reply

[deleted]

[+] ij23|2 years ago|reply

[deleted]