We've had the best success by first converting the HTML to a simpler format (i.e. markdown) before passing it to the LLM.
There are a few ways to do this that we've tried, namely Extractus[0] and dom-to-semantic-markdown[1].
Internally we use Apify[2] and Firecrawl[3] for Magic Loops[4] that run in the cloud, both of which have options for simplifying pages built-in, but for our Chrome Extension we use dom-to-semantic-markdown.
Similar to the article, we're currently exploring a user-assisted flow to generate XPaths for a given site, which we can then use to extract specific elements before hitting the LLM.
By simplifying the "problem" we've had decent success, even with GPT-4o mini.
If you're open to it, I'd love to hear what you think of what we're building at https://browserbase.com/ - you can run a chrome extension on a headless browser so you can do the semantic markdown within the browser, before pulling anything off.
Have you compared markdown to just stripping the HTML down (strip tag attributes, unwrap links, remove obvious non-displaying elements)? My experience has been that performance is pretty similar to markdown, and it’s an easier transformation with fewer edge cases.
First I’ve heard of Semantic Markdown [0]. It appears to be a way to embed RDF data in Markdown documents.
The page I found is labeled “Alpha Draft,” which suggests there isn’t a huge corpus of Semantic Markdown content out there. This might impede LLM’s ability to understand it due to lack of training data. However, it seems sufficiently readable that LLMs could get by pretty well by treating its structured metadata as parathenicals
=====
What is Semantic Markdown?
Semantic Markdown is a plain-text format for writing documents that embed machine-readable data. The documents are easy to author and both human and machine-readable, so that the structured data contained within these documents is available to tools and applications.
Technically speaking, Semantic Markdown is "RDFa Lite for Markdown" and aims at enhancing the HTML generated from Markdown with RDFa Lite attributes.
Design Rationale:
Embed RDFa-like semantic annotation within Markdown
Ability to mix unstructured human-text with machine-readable data in JSON-LD-like lists
Ability to semantically annotate an existing plain Markdown document with semantic annotations
Keep human-readability to a maximum
About this document
We did something similar -na although in a somewhat different context.
Translating a complex JSON representing an execution graph to a simpler graphviz dot format first and then feeding it to an LLM. We had decent success.
OpenAI recently announced a Batch API [1] which allows you to prepare all prompts and then run them as a batch. This reduces costs as its just 50% the price. Used it a lot with GPT-4o mini in the past and was able to prompt 3000 Items in less than 5min. Could be great for non-realtime applications.
I hope some of the opensource inference servers start supporting that endpoint soon. I know vLLM has added some "offline batch mode" support with the same format, they just haven't gotten around to implementing it on the OpenAI endpoint yet.
That's a great proposition by OpenAI.
I think however that it is still one to two orders of magnitude too expensive compared to traditional text extraction with very similar precision and recall levels.
Yeah this was a phenomenal decision on their part. I wish some of the other cloud tools like azure would offer the same thing, it just makes so much sense!
For structured content (e.g. lists of items, simple tables), you really don’t need LLMs.
I recently built a web scraper to automatically work on any website [0] and built the initial version using AI, but I found that using heuristics based on element attributes and positioning ended up being faster, cheaper, and more accurate (no hallucinations!).
For most websites, the non-AI approach works incredibly well so I’d make sure AI is really necessary (e.g. data is unstructured, need to derive or format the output based on the page data) before incorporating it.
The LLM is resistant to website updates that would break normal scraping
If you do like the author did and ask it to generate xPaths, you can use it once, use the xPaths it generated for regular scraping, then once it breaks fall back to the LLM to update the xPaths and fall back one more time to alerting a human if the data doesn't start flowing again, or if something breaks further down the pipeline because the data is in an unexpected format.
Is there a "html reducer" out there? I've been considering writing one. If you take a page's source it's going to be 90% garbage tokens -- random JS, ads, unnecessary properties, aggressive nesting for layout rendering, etc.
I feel like if you used a dom parser to walk and only keep nodes with text, the html structure and the necessary tag properties (class/id only maybe?) you'd have significant savings. Perhaps the xpath thing might work better too. You can even even drop necessary symbols and represent it as a indented text file.
We use readability for things like this but you lose the dom structure and their quality reduces with JS heavy websites and pages with actions like "continue reading" which expand the text.
Jina.ai offer a really neat (currently free) API for this - you add https://r.jina.ai/ on the beginning of any API and it gives you back a Markdown version of the main content of that page, suitable for piping into an LLM.
I've been using Readability (minus the Markdown) bit myself to extract the title and main content from a page - I have a recipe for running it via Playwright using my shot-scraper tool here: https://shot-scraper.datasette.io/en/stable/javascript.html#...
Only works insofar as sites are being nice. A lot of sites do things like: render all text via JS, render article text via API, paywall content by showing a preview snippet of static text before swapping it for the full text (which lives in a different element), lazyload images, lazyload text, etc etc.
DOM parsing wasn't enough for Google's SEO algo, either. I'll even see Safari's "reader mode" fail utterly on site after site for some of these reasons. I tend to have to scroll the entire page before running it.
That’s easy to do with BeautifulSoup in Python. Look up tutorials on that. Use it on non-essential tags. That will at least work when the content is in HTML rather than procedurally generated (eg JavaScript).
It's very surprising that the author of this post does 99% of the work and writing and then does not go forward for the other 1% downloading ollama (or some other llama.cpp based engine) and testing how some decent local LLM works in this use case. Because maybe a 7B or 30B model will do great in this use case, and that's cheap enough to run: no GPT-4o needed.
We've been working on AI-automated web scraping at Kadoa[0] and our early experiments were similar to the those in the article. We started when only the expensive and slow GPT-3 was available, which pushed us to develop a cost-effective solution at scale.
Here is what we ended up with:
- Extraction: We use codegen to generate CSS selectors or XPath extraction code. This is more efficient than using LLMs for every data extraction. Using an LLM for every data extraction, would be expensive and slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient.
- Cleansing & transformation: We use small fine-tuned LLMs to clean and map data into the desired format.
- Validation: Unstructured data is a pain to validate. Among traditional data validation methods like reverse search, we use LLM-as-a-judge to evaluate the data quality.
We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of websites in a reliable, scalable, and cost-efficient way is a whole different beast.
Combining traditional ETL engineering methods with small, well-evaluated LLM steps was the way to go for us
I've had good luck with giving it an example of HTML I want scraped and asking for a beautifulsoup code snippet. Generally the structure of what you want to scrape remains the same, and it's a tedious exercise coming up with the garbled string of nonsense that ends up parsing it.
Using an LLM for the actual parsing, that's simultaneously overkill while risking your results being polluted with hallucinations.
As others have mentioned here you might get better results cheaper (this probably wasn't the point of the article, so just fyi) if you preprocess the html first. I personally have had good results with trafilatura[1], which I don't see mentioned yet.
I second trafilatura greatly. This will save a huge amount of money to just send the text to the LLM.
I used it on this recent project (shameless plug): https://github.com/philippe2803/contentmap. It's a simple python library that creates a vector store for any website, using a domain XML sitemap as a starting point. The challenge was that each domain has its own HTML structure, and to create a vector store, we need the actual content, removing HTML tags, etc. Trafilatura basically does that for any url, in just a few lines of code.
Wow, that's one of the most orange tag-rich posts I've ever seen.
We're doing a lot of tests with GPT-4o at NewsCatcher. We have to crawl 100k+ news websites and then parse news content. Our rule-based model for extracting data from any article works pretty well, and we never could find a way to improve it with GPT.
"Crawling" is much more interesting. We need to know all the places where news articles can be published: sometimes 50+ sub-sections.
Interesting hack: I think many projects (including us) can get away with generating the code for extraction since the per-website structure rarely changes.
So, we're looking for LLM to generate a code to parse HTML.
Happy to chat/share our findings if anyone is interested: artem [at] newscatcherapi.com
Funnily enough, web scraping was actually the motivating use-case that started my co-founder and I building what is now openpipe.ai. GPT-4 is really good at it, but extremely expensive. But it's actually pretty easy to distill its skill at scraping a specific class of site down to a fine-tuned model that's way cheaper and also really good at scraping that class of site reliably.
I’ve had problems with hallucinations though even for something as simple as city names; also the model often ignores my prompt and returns country names - am thinking of trying a two-stage scrape with one checking the output of the other.
I'm working on a Chrome extension to do web scraping using OpenAI, and I've been impressed by what ChatGPT can do. It can scrape complicated text/html, and usually returns the correct results.
One of the cool things is that you can scrape non-uniform pages easily. For example I helped someone scrape auto dealer leads from different websites: https://youtu.be/QlWX83uHgHs . This would be a lot harder with a "traditional" scraper.
Same experience here. Been building a classical music database [1] where historical and composer life events are scraped off wikipedia by asking ChatGPT to extract lists of `[{event, year, location}, ...]` from biographies.
- Using chatgpt-mini was the only cheap option, worked well (although I have a feeling it's dumbing down these days) and made it virtually free.
- Just extracting the webpage text from HTML, with `BeautifulSoup(html).text` slashes the number of tokens (but can be risky when dealing with complex tables)
- At some point I needed to scrape ~10,000 pages that have the same format and it was much more efficient speed-wise and price-wise to provide ChatGPT with the HTML once and say "write some python code that extracts data", then apply that code to the 10,000 pages. I'm thinking a very smart GPT-based web parser could do that, with dynamically generated scraping methods.
- Finally because this article mentions tables, Pandas has a very nice feature `from_html("http:/the-website.com")` that will detect and parse all tables on a page. But the article does a good job pointing at websites where the method would fail because the tables don't use `<table/>`
If you haven't considered it, you can also use the direct wikitext markup, from which the HTML is derived.
Depending on how you use it, the wikitext may or may not be more ingestible if you're passing it through to an LLM anyway. You may also be able to pare it down a bit by heading/section so that you can reduce it do only sections that are likely to be relevant (eg. "Life and career") type sections.
You can also download full dumps [0] from Wikipedia and query them via SQL to make your life easier if you're processing them.
This doesn't directly address your issue but since this caused me some pain I'll share that if you want to parse structured information from Wikipedia infoboxes the npm module wtf_wikipedia works.
Can you share how long it took for you to parse the HTML? I recently experimented with comparing different AI models, including GPT-4o, alongside Gemini and Claude to parse raw HTML: https://serpapi.com/blog/web-scraping-with-ai-parsing-html-t.... Result is pretty interesting.
We've had lots of success with this at Rastro.sh - but the biggest unlock came when we used this as benchmark data to build scraping code. Sonnet 3.5 is able to do this. It reduced our cost and improved accuracy for our use case (extracting e-commerce products), as some of these models are not reliable to extract lists of 50+ items.
author here: I'm working on a follow-up post where I benchmark pre-processing techniques (to reduce the token count). Turns out, removing all HTML works well (much cheaper and doesn't impact accuracy). So far, I've only tried gpt-4o and the mini version, but trying other models would be interesting!
GPT-4o (and the other top-tier models like Claude 3.5 Sonnet and Gemini 1.4 Pro) is massively more capable than models you can run on your own machine using Ollama - unless you can run something truly monstrous like Llama 3.1 405b, but that's requires 100GBs of GPU RAM which is very expensive.
I would definitely approach this problem by having the LLM write code to scrape the page. That would address the cost and accuracy problems. And also give you testable code.
As others have mentioned, converting html to markdown works pretty well.
With that said, we've noticed that for some sites that have nested lists or tables, we get better results by reducing those elements to a simplified html instead of markdown. Essentially providing context when the structures start and stop.
It's also been helpful for chunking docs, to ensure that lists / tables aren't broken apart in different chunks.
There’s a lot of data that we should have programmatic access to that we don’t.
The fact that I can’t get my own receipt data from online retailers is unacceptable. I built a CLI Puppeteer scraper to scrape sites like Target, Amazon, Walmart, and Kroger for precisely this reason.
Any website that has my data and doesn’t give me access to it is a great target for scraping.
I'd say scrapers have always been popular, but I imagine they're even more popular nowadays with all the tools (AI but also non-AI) readily available to do cool stuff on a lot of data.
There's been a large push to do server-side rendering for web pages which means that companies no longer have a publicly facing API to fetch the data they display on their websites.
Parsing the rendered HTML is the only way to extract the data you need.
What do you think all these LLM stuff will evolve into? Of course it's moving on from chitchat on stale information and now onto "automate the web" kinda phase, like it or not.
GPT-4 (and Claude) are definitely the top models out there, but: Llama, even the 8b, is more than capable of handling extraction like this. I've pumped absurd batches through it via vLLM.
With serverless GPUs, the cost has been basically nothing.
Can you explain a bit more about what "serverless GPUs" are exactly? Is there a specific cloud provider you're thinking of, e.g. is there a GPU product with AWS? Google gives me SageMaker, which is perhaps what you are referring to?
Can anyone recommend an AI vision web browsing automation framework rather than just scraping? My use case: automate the monthly task of logging into a website and downloading the latest invoice PDF.
Most discussion I found about the topic is how to extract information. Is there any technique for extracting interactive elements? I reckon listing all of inputs/controls would not be hard, but finding the corresponding labels/articles might be tricky.
Another thing I wonder is, regarding text extraction, would it be a crazy idea to just snapshot the page and ask it to OCR & generate a bare minimum html table layout. That way both the content and the spatial relationship of elements are maintained (not sure how useful but I'd like to keep it anyway).
As a poc, we first took a screenshot of the page, cropped it to the part we needed and then passed it to GPT. One of the things we do is compare prices of different suppliers for the same product (i.e. airline tickets), and sometimes need to do it manually. While the approach could look expensive, it is in general cheaper than a real person, and enables the real person to do more meaningful work… so it’s a win-win. I am looking forward to put this in production hopefully
This looks super useful, but from what i've heard, if you try to do this at any meaningful scale your scrapers will be blocked by Cloudflare and the likes
I used to do a lot of web scraping. Cloudflare is an issue, as are a few Cloudflare competitors, but scraping can still be useful. We had contracts with companies we scraped that allowed us to scrape their sites, specifically so that they didn't need to do any integration work to partner with us. The most anyone had to do on the company side was allowlist us with Cloudflare.
Would recommend web scraping as a "growth hack" in that way, we got a lot of partnerships that we wouldn't otherwise have got.
Instead of directly scraping with GPT-4o, what you could do is have GPT-4o write a script for a simple web scraper and then use a prompt-loop when something breaks or goes wrong.
I have the same opinion about a man and his animals crossing a river on a boat. Instead of spending tokens on trying to solve a word problem, have it create a constraint solver and then run that. Same thing.
What people mentioned above is pretty much what they did at octabear and as an extension of the idea it's also what a lot of startups applicants did for other type of media like video scraping, podcast scraping, audio scraping, etc
[0] https://www.octabear.com/
I think that LLM costs, even GPT-4o, are probably lower compared to proxy costs usually required for web scraping at scale. The cost of residential/mobile proxies is a few $ per GB. If I were to process cleaned data obtained using 1GB of residential/mobile proxy transfer, I wouldn't pay more for LLM.
The author claims that attempting to retrieve xpaths with the LLM proved to be unreliable. I've been curious about this approach because it seems like the best "bang for your buck" with regards to cost. I bet if you experimented more, you could probably improve your results.
This is also how we started a while ago.
I agree that it's too expensive, hence we're working on making this scalable and cheaper now!
We'll soon launch, but here we go! https://expand.ai
I’m curious to know more about your product. Currently, I’m using visualping.io to keep an eye on the website of my local housing community. They share important updates there, and it’s really helpful for me to get an email every few months instead of having to check their site every day.
I was thinking of adding a feature of my app to use LLMs to extract XPaths to generate RSS feeds from sites that don't support it. The section on XPaths is unfortunate.
Not sure why author didn't use 4o-mini. 4o for reasoning but things like parsing/summarizing can be done by cheaper models with little loss in quality.
On this note, does anyone know how Cursor scrapes websites? Is it just fetching locally and then feeding the raw html or doing some type of preprocessing?
is it really so hard to look at a couple xpaths in chrome? insane that people actually use an llm when trying to do this for real. were headed where automakers are now- just put in idiot lights, no one knows how to work on any parts anymore. suit yourself i guess
I just want something that can take all my bookmarks, log into all by subscriptions using my credentials, and archive all those articles. I can then feed them to an LLM of my choice to ask questions later. But having the raw archive is the important part. I don’t know if there are any easy to use tools to do this though, especially with paywalled subscription based websites.
What are some good frameworks for webscraping and PDF document processing -- some public and some behind login, some requiring multiple clicks before the sites display relevant data.
We need to ingest a wide variety of data sources for one solution. Very few of those sources supply data as API / json.
I have built most of this and have it running on Google Cloud as a service. The framework I built is Open Source. Let me know if you want to discuss: https://mitta.ai
jumploops|1 year ago
There are a few ways to do this that we've tried, namely Extractus[0] and dom-to-semantic-markdown[1].
Internally we use Apify[2] and Firecrawl[3] for Magic Loops[4] that run in the cloud, both of which have options for simplifying pages built-in, but for our Chrome Extension we use dom-to-semantic-markdown.
Similar to the article, we're currently exploring a user-assisted flow to generate XPaths for a given site, which we can then use to extract specific elements before hitting the LLM.
By simplifying the "problem" we've had decent success, even with GPT-4o mini.
[0] https://github.com/extractus
[1] https://github.com/romansky/dom-to-semantic-markdown
[2] https://apify.com/
[3] https://www.firecrawl.dev/
[4] https://magicloops.dev/
pkiv|1 year ago
We even have an iFrame-able live view of the browser, so your users can get real-time feedback on the XPaths they're generating: https://docs.browserbase.com/features/session-live-view#give...
Happy to answer any questions!
mistercow|1 year ago
pbronez|1 year ago
The page I found is labeled “Alpha Draft,” which suggests there isn’t a huge corpus of Semantic Markdown content out there. This might impede LLM’s ability to understand it due to lack of training data. However, it seems sufficiently readable that LLMs could get by pretty well by treating its structured metadata as parathenicals
=====
What is Semantic Markdown?
Semantic Markdown is a plain-text format for writing documents that embed machine-readable data. The documents are easy to author and both human and machine-readable, so that the structured data contained within these documents is available to tools and applications.
Technically speaking, Semantic Markdown is "RDFa Lite for Markdown" and aims at enhancing the HTML generated from Markdown with RDFa Lite attributes.
Design Rationale:
Embed RDFa-like semantic annotation within Markdown
Ability to mix unstructured human-text with machine-readable data in JSON-LD-like lists
Ability to semantically annotate an existing plain Markdown document with semantic annotations
Keep human-readability to a maximum About this document
=====
[0] https://hackmd.io/@sparna/semantic-markdown-draft
snthpy|1 year ago
I've been wanting to try the same approach and have been looking for the right tools.
neeleshs|1 year ago
Translating a complex JSON representing an execution graph to a simpler graphviz dot format first and then feeding it to an LLM. We had decent success.
tom1337|1 year ago
[1] https://platform.openai.com/docs/guides/batch
Tostino|1 year ago
LunaSea|1 year ago
cdrini|1 year ago
namuorg|1 year ago
I recently built a web scraper to automatically work on any website [0] and built the initial version using AI, but I found that using heuristics based on element attributes and positioning ended up being faster, cheaper, and more accurate (no hallucinations!).
For most websites, the non-AI approach works incredibly well so I’d make sure AI is really necessary (e.g. data is unstructured, need to derive or format the output based on the page data) before incorporating it.
[0] https://easyscraper.com
sebstefan|1 year ago
If you do like the author did and ask it to generate xPaths, you can use it once, use the xPaths it generated for regular scraping, then once it breaks fall back to the LLM to update the xPaths and fall back one more time to alerting a human if the data doesn't start flowing again, or if something breaks further down the pipeline because the data is in an unexpected format.
poulpy123|1 year ago
parhamn|1 year ago
I feel like if you used a dom parser to walk and only keep nodes with text, the html structure and the necessary tag properties (class/id only maybe?) you'd have significant savings. Perhaps the xpath thing might work better too. You can even even drop necessary symbols and represent it as a indented text file.
We use readability for things like this but you lose the dom structure and their quality reduces with JS heavy websites and pages with actions like "continue reading" which expand the text.
Whats the gold standard for something like this?
axg11|1 year ago
simonw|1 year ago
Here's an example: https://r.jina.ai/https://simonwillison.net/2024/Sep/2/anato... - for this page: https://simonwillison.net/2024/Sep/2/anatomy-of-a-textual-us...
Their code is open source so you can run your own copy if you like: https://github.com/jina-ai/reader - it's written in TypeScript and uses Puppeteer and https://github.com/mozilla/readability
I've been using Readability (minus the Markdown) bit myself to extract the title and main content from a page - I have a recipe for running it via Playwright using my shot-scraper tool here: https://shot-scraper.datasette.io/en/stable/javascript.html#...
suchintan|1 year ago
It's adapted from vimium and works like a charm. Distill the html down to it's important bits, and handle a ton of edge cases along the way haha
ErikAugust|1 year ago
https://github.com/mozilla/readability
edublancas|1 year ago
lelandfe|1 year ago
DOM parsing wasn't enough for Google's SEO algo, either. I'll even see Safari's "reader mode" fail utterly on site after site for some of these reasons. I tend to have to scroll the entire page before running it.
purple-leafy|1 year ago
It’s strips all JS/event handlers, most attributes and most CSS, and only keeps important text nodes
I needed this because I was using LLM to reimplement portions of a page using just tailwind, so needed to minimise input tokens
nickpsecurity|1 year ago
antirez|1 year ago
devoutsalsa|1 year ago
hubraumhugo|1 year ago
Here is what we ended up with:
- Extraction: We use codegen to generate CSS selectors or XPath extraction code. This is more efficient than using LLMs for every data extraction. Using an LLM for every data extraction, would be expensive and slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient.
- Cleansing & transformation: We use small fine-tuned LLMs to clean and map data into the desired format.
- Validation: Unstructured data is a pain to validate. Among traditional data validation methods like reverse search, we use LLM-as-a-judge to evaluate the data quality.
We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of websites in a reliable, scalable, and cost-efficient way is a whole different beast.
Combining traditional ETL engineering methods with small, well-evaluated LLM steps was the way to go for us
[0] https://kadoa.com
artembugara|1 year ago
btbuildem|1 year ago
Using an LLM for the actual parsing, that's simultaneously overkill while risking your results being polluted with hallucinations.
abhgh|1 year ago
[1] https://trafilatura.readthedocs.io/en/latest/
jeanloolz|1 year ago
artembugara|1 year ago
We're doing a lot of tests with GPT-4o at NewsCatcher. We have to crawl 100k+ news websites and then parse news content. Our rule-based model for extracting data from any article works pretty well, and we never could find a way to improve it with GPT.
"Crawling" is much more interesting. We need to know all the places where news articles can be published: sometimes 50+ sub-sections.
Interesting hack: I think many projects (including us) can get away with generating the code for extraction since the per-website structure rarely changes.
So, we're looking for LLM to generate a code to parse HTML.
Happy to chat/share our findings if anyone is interested: artem [at] newscatcherapi.com
AbstractH24|1 year ago
kcorbitt|1 year ago
unknown|1 year ago
[deleted]
artembugara|1 year ago
We've been working on this for quite a while. I'll contact you to show how far we've gotten
jasonthorsness|1 year ago
I’ve had problems with hallucinations though even for something as simple as city names; also the model often ignores my prompt and returns country names - am thinking of trying a two-stage scrape with one checking the output of the other.
marcell|1 year ago
It's very early still but check it out at https://FetchFoxAI.com
One of the cool things is that you can scrape non-uniform pages easily. For example I helped someone scrape auto dealer leads from different websites: https://youtu.be/QlWX83uHgHs . This would be a lot harder with a "traditional" scraper.
hydrogenpolo|1 year ago
zulko|1 year ago
- Using chatgpt-mini was the only cheap option, worked well (although I have a feeling it's dumbing down these days) and made it virtually free.
- Just extracting the webpage text from HTML, with `BeautifulSoup(html).text` slashes the number of tokens (but can be risky when dealing with complex tables)
- At some point I needed to scrape ~10,000 pages that have the same format and it was much more efficient speed-wise and price-wise to provide ChatGPT with the HTML once and say "write some python code that extracts data", then apply that code to the 10,000 pages. I'm thinking a very smart GPT-based web parser could do that, with dynamically generated scraping methods.
- Finally because this article mentions tables, Pandas has a very nice feature `from_html("http:/the-website.com")` that will detect and parse all tables on a page. But the article does a good job pointing at websites where the method would fail because the tables don't use `<table/>`
[1] https://github.com/Zulko/composer-timelines
davidsojevic|1 year ago
Depending on how you use it, the wikitext may or may not be more ingestible if you're passing it through to an LLM anyway. You may also be able to pare it down a bit by heading/section so that you can reduce it do only sections that are likely to be relevant (eg. "Life and career") type sections.
You can also download full dumps [0] from Wikipedia and query them via SQL to make your life easier if you're processing them.
[0] https://en.wikipedia.org/wiki/Wikipedia:Database_download#Wh...?
iudqnolq|1 year ago
ammario|1 year ago
tuktuktuk|1 year ago
mjrbds|1 year ago
simonw|1 year ago
> I also tried GPT-4o mini but yielded significantly worse results so I just continued my experiments with GPT-4o.
Would be interesting to compare with the other inexpensive top tier models, Claude 3 Haiku and Gemini 1.5 Flash.
edublancas|1 year ago
wslh|1 year ago
simonw|1 year ago
godber|1 year ago
mfrye0|1 year ago
With that said, we've noticed that for some sites that have nested lists or tables, we get better results by reducing those elements to a simplified html instead of markdown. Essentially providing context when the structures start and stop.
It's also been helpful for chunking docs, to ensure that lists / tables aren't broken apart in different chunks.
luigi23|1 year ago
adamtaylor_13|1 year ago
The fact that I can’t get my own receipt data from online retailers is unacceptable. I built a CLI Puppeteer scraper to scrape sites like Target, Amazon, Walmart, and Kroger for precisely this reason.
Any website that has my data and doesn’t give me access to it is a great target for scraping.
drusepth|1 year ago
rietta|1 year ago
CSMastermind|1 year ago
Parsing the rendered HTML is the only way to extract the data you need.
nsonha|1 year ago
ozr|1 year ago
With serverless GPUs, the cost has been basically nothing.
shoelessone|1 year ago
FooBarWidget|1 year ago
nsonha|1 year ago
Another thing I wonder is, regarding text extraction, would it be a crazy idea to just snapshot the page and ask it to OCR & generate a bare minimum html table layout. That way both the content and the spatial relationship of elements are maintained (not sure how useful but I'd like to keep it anyway).
mmasu|1 year ago
fvdessen|1 year ago
danpalmer|1 year ago
Would recommend web scraping as a "growth hack" in that way, we got a lot of partnerships that we wouldn't otherwise have got.
kanzure|1 year ago
I have the same opinion about a man and his animals crossing a river on a boat. Instead of spending tokens on trying to solve a word problem, have it create a constraint solver and then run that. Same thing.
sentinels|1 year ago
mateuszbuda|1 year ago
Havoc|1 year ago
Plus you can probably use that until it fails (website changes) and then just re scrape it with llm request
danielvaughn|1 year ago
timsuchanek|1 year ago
hmottestad|1 year ago
impure|1 year ago
bilater|1 year ago
mthoms|1 year ago
raybb|1 year ago
unknown|1 year ago
[deleted]
the_cat_kittles|1 year ago
kimoz|1 year ago
LetsGetTechnicl|1 year ago
Gee101|1 year ago
blackeyeblitzar|1 year ago
webprofusion|1 year ago
fsndz|1 year ago
lccerina|1 year ago
unknown|1 year ago
[deleted]
albert_e|1 year ago
What are some good frameworks for webscraping and PDF document processing -- some public and some behind login, some requiring multiple clicks before the sites display relevant data.
We need to ingest a wide variety of data sources for one solution. Very few of those sources supply data as API / json.
kordlessagain|1 year ago
riiii|1 year ago
unknown|1 year ago
[deleted]
dnzzcn|1 year ago
[deleted]
unknown|1 year ago
[deleted]
LetsGetTechnicl|1 year ago
gallerdude|1 year ago
"I'm starting to think computers are a solution in the need of a problem. Have we not already solved doing math?"