top | item 43559034

Strategies to download data constantly changing via API

2 points| rupestrecampos | 11 months ago

I have to download a dataset through one API (WFS provided by geoserver) that tells me the total amount of items and delivers at maximum 1000 items per request and I can sort by one field and offset the requests start index. The layer has ~1Million items. I can use at maximum 5 parallel request before API gets overloaded.

Problem is that items are being added and removed in real time, so at the end of the copy process I already have stale data copied and there are new items to be copied over. So what would you do, or have done in this situation? Start a never ending loop to crawl data all day long would be something evil or is it something to be fixed on provider side?

The api url is https://geoserver.car.gov.br/geoserver/sicar/wfs

Source data website: https://consultapublica.car.gov.br/publico/imoveis/index

2 comments

order

stop50|11 months ago

I currently only have my phone, so i can't judge the API. From my point a full scrape at regular intervals is not that bad. Its only 1000 requests. Depending on the data and querymethods you xan make fresh data appear sooner than removing old data. The major question is: how fresh do you need your data?

Not every application needs realtime data, querying it only on occasion or every few hours can be good enough.

rupestrecampos|11 months ago

Thanks for your point of view, I appreciate very much. About data freshness, we would like to serve on our system the same information that you can find on the source website, and we have a few tickets pointing the difference to solve, so maybe the continuous strategy each hour may minimize it as today we do it once a day.