(no title)
pkamb | 6 months ago
Tons of public domain sources are locked into websites like Newspapers.com or the nearly-dead and now completely unsearchable old Google News / Newspaper.
It would be nice if the massive pursuit of AI training data resulted in some fully-legal open source alternatives to these proprietary, outdated, or abandoned sites. I know some of it is available via the Internet Archive, etc., but something new with an AI-powered search and finding aid sounds so useful.
lioeters|6 months ago
https://archive.org/search?query=title%3ANew+York+Times&sort...
> as a full PDF download set
I imagine it's possible to achieve this through torrents from Anna's, but you'd have to search and compile the list of all individual PDFs.
> something new with an AI-powered search
With enough time and willingness, someone could put all the old NYT issues through optical character recognition and convert them to text; then make it available to large language models for semantic search of some kind. Ideally public cultural funds could support the effort as academic research.
pkamb|6 months ago