top | item 40103671

(no title)

jusgu | 1 year ago

Great work! One of the things that would be incredibly useful/interesting would be generating a reusable script with an LLM, instead of just grabbing the data. In theory, this should result in a massive cost reduction (no need to call the LLM every time) as long as the source code doesn’t change which would make it sustainable for constant and frequent monitoring.

discuss

diptanu|1 year ago

This approach was studied in a paper called Evaporate+ - https://www.vldb.org/pvldb/vol17/p92-arora.pdf They used active learning to pick the best function among candidate functions generated by the LLM on a sampled set of data.

andrew_zhong|1 year ago

I’ve worked on this exact problem when extracting feeds from news websites. Yes calling LLM each time is costly so I use LLM for the first time to extract robust css selectors and the following times just relying on those instead of incurring further LLM cost.

ushakov|1 year ago

Thank you! I’m working on supporting local llms via llama.cpp currently, so cost won’t be an issue anymore

nbbaier|1 year ago

Given that the ollama API is openai compatible, that should be a drop in, no?

fermisea|1 year ago

I'm working on this problem now. It's possible in some sources - whenever the HTML structure is enough that you map it to the feature of interest - but it could also happen that the information is hidden within the text, which makes it virtually impossible

nbbaier|1 year ago

This is a really nice idea. Wonder what the prompt would look like for that.