top | item 35906108

(no title)

vaskal08 | 2 years ago

Hey, other dev on this project. This is a good catch, and we're aware of this issue. What it's doing is actually using a photo caption as part of the article, and we're working on removing the use of that in the summarization process.

discuss

order

kristopolous|2 years ago

Their are news APIs

Start with those and then figure out how to scrape a site as your input and spit out the existing API format and you'll come in through a clever side route, essentially having a two phase assembly line.

Also this will allow users to customize their "feed" as a free side effect of the architecture and furthermore you'll be able to isolate your scraping -> API transform on a per site basis, also as a free consequence and lastly, you can parallelize the work much cleaner and even have the public add their own "transformer" for their favorite news site

lxe|2 years ago

Parsing pdfs or web semantically is really not an easy job, as I found in my own foray into LLM sumamrization.

startupsfail|2 years ago

Maybe image search and if the image is not novel, ignore it?

cutemonster|2 years ago

Good point (it seems to me), and if it's AI generated, (try to) ignore it too I guess