top | item 42444591

(no title)

marcklingen | 1 year ago

appreciate your constructive feedback!

> i wonder if there are new ops solutions for the realtime apis popping up

This is something we have spent quite some time on already, both on designs internally and talking to teams using Langfuse with realtime applications. IMO the usage patterns are still developing and the data capturing/visualization needs across teams is not aligned. What matters: (1) capture streams, (2) for non-text provide timestamped transcript/labels, (3) capture the difference between user-time and api-level-time (e.g. when catching up on a stream after having categorized the input first).

We are excited to build support for this, if you or others have ideas or a wishlist, please add them to this thread: https://github.com/orgs/langfuse/discussions/4757

> retries for instructor like structured outputs mess up the traces, i wonder if they can be tracked and collapsible

Great feedback. Being able to retroactively downrank llm calls to be `debug` level in order to collapse/hide them by default would be interesting. Added thread for this here: https://github.com/orgs/langfuse/discussions/4758

> chatgpt canvas like "drafting" workflows are on the rise (https://www.latent.space/p/inference-fast-and-slow) and again its noisy to see in a chat flow

Can you share an example trace for this or open a thread on github? Would love to understand this in more detail as I have seen different trace-representations of it -- the best yet was a _git diff_ on a wrapper span for every iteration.

> how often do people actually use the feedback tagging and then subsequently finetuning? i always feel guilty that i dont do it yet and wonder when and where i should.

Have not seen finetuning based on user-feedback a lot as the feedback can be noisy and low in frequency (unless there is a very clear feedback loop built into the product). More common workflow that I have seen: identify new problems via user feedback -> review them manually -> create llm-as-a-judge or other automated evals for this problem -> select "good" examples for fine-tuning based on a mix of different evals that currently run on production data -> sanitize the dataset (e.g. remove PII).

Finetuning has been more popular for structured output, sql generation (clear feedback loop / retries at run-time if the output does not work). More teams fine-tune on all the output that has passed this initial run-time gate for model distillation without further quality controls on the training dataset. They usually then run evals on a test dataset in order to verify whether the fine-tuned hits their quality bar.

discuss

order

swyx|1 year ago

im too lazy to send traces for now haha. maybe in future when it REALLY bothers me.

good luck keep going.