(no title)
jka | 3 years ago
There's an architecture diagram[1] alongside the source code, and my summary would be:
- The system has in-house web indexes built from Common Crawl[2] data
- The system receives snippets of text from Wikipedia and determines whether existing citations exist and whether they are valid
- If no valid citation exists, then the system performs queries against the indexes to find relevant URLs
It'd be interesting to learn how this approach fares compared to pasting the relevant paragraphs of text into search engines and excluding site:wikipedia.org from the results.
Something about feedback loops and data quality makes me wary that too much application of automated systems like this would lead to a degradation of content quality (each updated copy an imperfect translation or reference to an existing one).
[1] - https://github.com/facebookresearch/side/tree/a595fb09c85233...
[2] - https://commoncrawl.org/
No comments yet.