top | item 34974795

(no title)

portInit | 3 years ago

Yeah the default domain throttle policy is 1 req per second per domain. Configurable through domain policies https://www.crul.com/docs/features/domain-policies - although currently an enterprise feature.

We found that it becomes too easy to break API request limits or spam a website otherwise.

However if you rerun that query it should load pretty instantly due to the caching layers, so the actual querying/filtering of the data part is smoother/faster.

discuss

order

mdaniel|3 years ago

> due to the caching layers

Every time I see that, the "2 hardest things" springs to mind. Is there a clear-caches option, or I guess the opposite question: does that process honor the HTTP caching semantics? Scrapy actually has a bunch of configurable knobs for that (use RFC2616 Policy ( https://docs.scrapy.org/en/2.8/topics/downloader-middleware.... ), write your own policy, or a ton of other stuff: https://docs.scrapy.org/en/2.8/topics/downloader-middleware.... )

portInit|3 years ago

Agreed, caching does come with its own set of quirks and mind-numbing bugs, crul does have a caching override flag at the command/stage level which alleviates some of this: https://www.crul.com/docs/queryconcepts/common-flags#--cache

Your provided links are interesting and something for us think about some more. Honestly, I would be quite interested in hearing more about your experiences.

KomoD|3 years ago

> although currently an enterprise feature.

Wait, so we literally can't go faster than 1 req/s unless we pay?

I have to say I'm pretty disappointed :/

curiousgeorgio|3 years ago

If you attach to the running docker container, these defaults appear to be defined in /crul/dist/crul-docker/packages/startup/.env

Don't spam APIs. That said, if you're determined to do so, there's not much this or any other tool can do to stop you from trying.

portInit|3 years ago

Sorry to hear that - we do need to think about this. It's our first pass at product tiers and features and we may need to adjust.

Scheduling and Domain Policies were the main features we chose to gate initially as they don't affect core functionality other than performance and deployment.