top | item 47118744

(no title)

genewitch | 6 days ago

it is a lot of fun and rewarding to do this! I've done it several times for medium-sized datasets, like wikipedia dumps, the entire geospatial dataset to mapreduce it (pgsql). The wikipedia one was great, i had it set up to query things like "show me all ammunition manufactured after 1950 that is between .30 and .40" and it could just return it nearly instantly. The wikimedia dumps keep the infoboxes and relations intact, so you can do queries like this easily.

discuss

order

3eb7988a1663|6 days ago

Do you have a write-up of this somewhere? When I last looked at the Wikipedia dumps, they looked like a mess to parse. How were you getting structured information?

iamacyborg|5 days ago

You'd presumably have to run some part of the transclusion pipeline to properly handle template/module/page transclusion.

genewitch|6 days ago

unfortunately, i consider it proprietary.