Ask HN: Simple API to extract web article text?
2 points| friendofafriend | 1 year ago
Does anyone know of an API that can handle the text extraction part automatically?
Ideally the API can take in a URL and just return the main text content of a website, even for sites with slightly complex layouts.
For example: https://www.nytimes.com/2024/03/28/technology/personaltech/smart-glasses-ray-ban-meta.html
We're most interested in an API that has a decent free tier + usage-based pricing (at least for overages).
So far, most of our searches have turned up website scrapers that return HTML that needs to be further parsed (ScrapingBot, ScrapingBee, Scrapingdog, etc.), or services that are prohibitively priced (Diffbot).
Next, we're looking into Apify, but maybe we've missed something?
Any recommendations would be greatly appreciated!
timoteostewart|1 year ago
friendofafriend|1 year ago
goose3, trafilatura, newspaper3k (and newspaper4k even) all look like great tools. We were not planning on rolling our own, but that might be the right way to go after all. Thanks again.
cranberryturkey|1 year ago
friendofafriend|1 year ago