This post is not about database cursors. It's about the style of pagination where you have a ?_next=xxx link to get to the next page, where the xxx bit encodes details about the last item on the current page such that the next page can show everything that comes after that record.
Sounds great on paper but my experience is that key-based indexing is broken on SOLR and likely other lucene engine DBs. If you are evaluating conditions on child objects inside a parent document, the child objects get split and stored as a separate document that is joined… and if that child document is in a different block/file on disk, then it won’t necessarily be inside the range being scanned, and so you will be missing some results that meet your logical criteria.
Possibly just an implementation error in the BlockJoinParser but it did not occur with numeric pagination.
In order to work with that, you need to “flatten” your json, like with the @JsonUnwrapped annotation, and some structures (like arrays) may become problematic and/or require significant lexical mapping of queries to the dataset.
Most users don't care but I very much prefer to have URLs with `/?page=123` instead of `?_next=xxx`. I should get over it but meanwhile I'll stick to offset pagination with deferred join (see aarondf's comment).
This post is about "database cursors" and "keyset pagination". In practice, these terms refer to the same thing, one seen bottom-up the other seen top-down. Implementation-wise, one saves the state of the cursor in the pagination [parameters] and resumes reading from the DB with an equivalent cursor.
There are ways to mitigate the (although not eliminate) the slowing down of offset/limit pagination in later pages. The technique is called a "deferred join" and it is most effective in MySQL. The basic idea is to paginate as little data as necessary, and then do a self-join to get the rest of the data for a single page.
To be clear, this technique (which it seems I independently discovered in 2015) mostly only works in MySQL because other databases usually have planners which are smart enough to not pull everything in eagerly.
MySQL is fairly predictable, though, so when you understand that it wants to nested-loop join all your rows before evaluating predicates on the parent table, it's a predictable win to stop it doing that.
The technique is still applicable even when you have no joins, because MySQL will materialize rows with every selected column before evaluating the unindexed portion of the predicate, and the order by.
Cursor based pagination doesn’t actually solve the described pitfalls if your results are ordered by anything mutable, though.
A result you haven’t yet seen can be mutated to sort before your current cursor, likewise a result that you’ve already seen can be mutated to sort after the current cursor, causing you to see it twice.
Cursor based pagination does minimize the issue somewhat, because only the mutated rows are possibly affected, not unrelated records that lie on page boundaries. But depending on the use case I’m not sure that is worth that added complexity of cursor based pagination (it does get a bit tricky once you have any non-trivial sort clause).
The only approach I can think of which might be able to handle mutability of the rows used for sorting would be to support paginating through a snapshot: provide a mechanism whereby a full snapshot of the query at a specific point in time is captured such that the user can then paginate through that snapshot.
Expensive to implement, so this would only work for situations where you can afford to spend significant storage and computation resources to keep a specific user happy.
Why does every web site default to aggressively paginate their information? Pagination sucks, it's a waste of time and of clicks, and should be a last resort. Sure, when Google returns millions of search results, paginate. But:
For instance, if you have 40 entries and specify that there should be 10 items per page
40 entries??? Just show them all to me. My browser has a scrollbar, CTRL-F is much faster than your search box.
"but not everyone has a reliable fast connection" -- yes, which is a good reason to deliver more data per http request than breaking it up and requiring lots of slow requests.
"but the database load" -- half the queries, each returning 2x data, is almost always going to be easier on a RDBMS. If it's not then you probably need to rethink your schema.
> Why does every web site default to aggressively paginate their information?
Because data retrieval costs money and resources. Unbounded API responses are a susceptible to denial-of-service. Well-architected APIs will always cap the number of results they return per-call.
I’ve seen instances where the issue is not database performance, but web server performance. Serializing a graphql response in graphql-ruby, for example, can easily take 100+ms and grows linearly with number of records in a request.
In the general case this isn’t a good excuse, there’s no reason it should take that long. But you don’t always get to pick your technology.
I'd guess that most people have written some bad query that only worked for small limits, so they started superstitiously making their limits very low.
For anyone curious about this approach, I've posted an excerpt [1] of the shared Python + SQLAlchemy + Postgres code we use to handle pagination at my company.
The name "cursor pagination" is super confusing given that "cursor" is an overloaded term in databases. I always call this "token pagination", given that the APIs I've seen usually call the value a token.
I was hoping to read about how they handled cleaning up cursors, what the effect on the server was when having a bunch of open, long-running cursors, etc. Unfortunately the article only treated the subject at a superficial level.
So, anyone here implement pagination via cursors? What do you find to be the drawbacks and how do you mitigate them?
You probably assume they are talking about database cursors. The cursor is just a record ID in this article. There is no long term storage of database cursors on the server. Assuming you can sort all your data, next query just returns all records after the given record ID, plus the record with that ID.
One corner case would be if the cursor record is deleted. I don't see it mentioned how they handle it.
It’s not a cursor as in a relational database cursor. It’s a cursor, as in an ID of an item in a stably ordered set. There’s no long-running anything to be worried about.
I have a few main requirements for what I consider easy to use pagination.
1. I can set how many items per page.
2. I can get to an arbitrary page either through a path/ query param or at least close with a pagination row that contains a way to jump around. If an item gets removed, whatever I was looking for should still be in the same vicinity.
3. As a result of #1 and #2, I can go back and find items I saw previously because their placement is at some fairly reliable position on the page I remember it being on or around.
You know, like how a physical book or catalogue with pages works.
Please stop trying to improve on this and making usability more difficult. I hate infinity scroll and I hate not being able to pick my page in an intuitive fashion.
>I can get to an arbitrary page either through a path/ query param or at least close with a pagination row that contains a way to jump around
I've had quite a few heated discussions on this point. The problem is, once your dataset gets large enough, this use case is incredibly difficult to scale; not impossible (Google does it with millions of search results), but prohibitively expensive compared to how often the need of being able to jump to any specific page arises.
Now I always try to stick to cursor based pagination as the default in order to prevent people from building workflows on top of offsets.
I totally agree with you from a usability standpoint, and while it is possible to make this work well, for larger-scale services it is rarely without costs, and for the most part - people don't seem to care as much as you'd think.
While it can be frustrating, I doubt much revenue (relatively speaking) was lost due to this issue, which means for most apps and services, it will likely not get done.
(I'm not disputing the merit of your arguments, just explaining why this will rarely get done well in real life...)
The pagination con given in the article is wrong, switching to stream isn’t fixing the dropping issue, removing sorting is likely why no records are being dropped (you wouldn’t drop records using a numerical sorted ID or creation date)
Pagination is incredibly useful to human. If I tell you I found something on page 15 you can relate to it, something I cannot do with infinite scroll.
If a user has to go to page 15 to find something useful to them then I would argue that's a bigger failure of the UX/filtering than it is a success of pagination.
Worth noting that another solution to the problems with offset cursors is to do updates. When you add/remove rows you just update the results set and then there’s no issue with missing or duplicated rows.
Not easy to do of course, but it’s one direction you can go.
Why is offset-based pagination a thing at all? It sucks when a feed is very active, it sucks when you link to a page or visit search results, where it may turn out to be hundreds of pages away in a week. I always wondered why nobody does some obvious (order_col > k, order by order_col, limit n). It’s not a two years old mistake, it’s two decades and sites still do that by default.
Ime, concurrent updates to the data set isn't a problem in practice and nobody cares if they occasionally get duplicate entries. The cluttered and sometimes useless urls cursors cause are, again ime, a much bigger usability problem. Sure, naively implemented queries suffer if the user types ?page=123456 in the url but such problems are quite easy to fix.
I think the challenge is always the stateless nature of HTTP. It's pretty hard to keep a connection to the API server that has your database cursor.
Everything else has big tradeoffs. You can make sure your rows are sortable by a specific column, but that means your searches must all be sorted by that column.
You can save the results of a search and page through it, but that's a lot of I/O.
You can not do any of the above, but then you run the risk of non-deterministic pagination (I went back and because someone added/deleted a row, that page is slightly different now). You also can't really link to it.
These things might all be fine. In my experience though, you run into people who really want to do search in way A, and will use the tradeoffs for way B/C/D to argue against it, as though A has no tradeoffs.
At my current job, our intranet site has lackluster performance due, in part, to limit/offset pagination. Unfortunately, the business treats the "reports" we author like glorified spreadsheets and want the ability to filter on any column and order by any column. It makes it near impossible to tune/optimize.
The legacy system we're replacing used cursor pagination in a lot of areas and was perfectly acceptable to them, but now that we're going web based - it's not. Unfortunately, really - it seems vastly superior.
My sympathies to the coders downstream of this large change for the amount of work required. I would like to add cursor based pagination to the APIs I manage. It would not be an option to go to our clients and explain that offset-pagination will be removed. There seems to be very little cost in supporting both.
AWS does the pagination thing for all LIST APIs and they encode the token in a way they clients cannot deconstruct anything out of it. For them it is just an opaque string. In fact it is cryptographically encrypted and not something simple like a Base64 encoding of some Json data.
And the token has enough metadata to reconstruct the query from where it left behind in the previous call and perform additional security checks so that it cannot be used by other customers to get results from a different account.
I have actually migrated databases from Aurora to DynamoDb behind the scenes where customers continued to call these APIs with these tokens and the calls would flip to the new DB and resume from the last place. No downtime or "maintenance window" which would be a non-starter at AWS anyway.
I recommend this for any external service. For internal services, if you work closely enough with your users, may be you can opt for something more transparent.
I didn't see anything specific about timelines. I would expect you'd want to offer both for some length of time, with some deprecation notice and a sunset date for the older approach. But perhaps they seem use cases in their logs that the current offset is used minimally already, and it's better to switch now vs later if/when adoption is higher?
simonw|3 years ago
This is also sometimes known as keyset pagination. My favourite technical explanation of that is here: https://use-the-index-luke.com/no-offset
paulmd|3 years ago
Possibly just an implementation error in the BlockJoinParser but it did not occur with numeric pagination.
In order to work with that, you need to “flatten” your json, like with the @JsonUnwrapped annotation, and some structures (like arrays) may become problematic and/or require significant lexical mapping of queries to the dataset.
clementmas|3 years ago
jlg23|3 years ago
aarondf|3 years ago
You can read more about it here: https://aaronfrancis.com/2022/efficient-pagination-using-def... or here https://planetscale.com/blog/fastpage-faster-offset-paginati....
There are libraries for Laravel (https://github.com/hammerstonedev/fast-paginate) and Rails (https://github.com/planetscale/fast_page) as well!
Cursor based pagination is wonderful, but sometimes you're stuck with offset/limit for whatever reason. Might as well make it fast.
barrkel|3 years ago
MySQL is fairly predictable, though, so when you understand that it wants to nested-loop join all your rows before evaluating predicates on the parent table, it's a predictable win to stop it doing that.
The technique is still applicable even when you have no joins, because MySQL will materialize rows with every selected column before evaluating the unindexed portion of the predicate, and the order by.
hamandcheese|3 years ago
A result you haven’t yet seen can be mutated to sort before your current cursor, likewise a result that you’ve already seen can be mutated to sort after the current cursor, causing you to see it twice.
Cursor based pagination does minimize the issue somewhat, because only the mutated rows are possibly affected, not unrelated records that lie on page boundaries. But depending on the use case I’m not sure that is worth that added complexity of cursor based pagination (it does get a bit tricky once you have any non-trivial sort clause).
simonw|3 years ago
Expensive to implement, so this would only work for situations where you can afford to spend significant storage and computation resources to keep a specific user happy.
andy800|3 years ago
For instance, if you have 40 entries and specify that there should be 10 items per page
40 entries??? Just show them all to me. My browser has a scrollbar, CTRL-F is much faster than your search box.
"but not everyone has a reliable fast connection" -- yes, which is a good reason to deliver more data per http request than breaking it up and requiring lots of slow requests.
"but the database load" -- half the queries, each returning 2x data, is almost always going to be easier on a RDBMS. If it's not then you probably need to rethink your schema.
ilammy|3 years ago
You get to see more ads while flipping through pages.
Timing metrics for loading smaller pages make marketing happy.
Timing metrics for time spent on the website make marketing happy.
VWWHFSfQ|3 years ago
Because data retrieval costs money and resources. Unbounded API responses are a susceptible to denial-of-service. Well-architected APIs will always cap the number of results they return per-call.
notriddle|3 years ago
hamandcheese|3 years ago
In the general case this isn’t a good excuse, there’s no reason it should take that long. But you don’t always get to pick your technology.
phendrenad2|3 years ago
Aeolun|3 years ago
It’ll be like Jira, pulling in literal megabytes of data and still slow.
feike|3 years ago
His site is a great resource for anyone wanting to take a deeper dive on SQL performance:
https://use-the-index-luke.com/sql/partial-results/fetch-nex...
ReactiveJelly|3 years ago
https://old.reddit.com/?count=25&after=t3_wtpvdp
I noticed Reddit's pagination has that "after" parameter, which points to the last post on the current page.
It glitches out if the last item is deleted by moderators, but otherwise it works smoothly.
wootest|3 years ago
croes|3 years ago
No need for row value syntax and it works with MS SQL Server
dinkledunk|3 years ago
jvolkman|3 years ago
The name "cursor pagination" is super confusing given that "cursor" is an overloaded term in databases. I always call this "token pagination", given that the APIs I've seen usually call the value a token.
[1] https://gist.github.com/jvolkman/b8c0e3d05929a1506c99fbc9474...
drdec|3 years ago
So, anyone here implement pagination via cursors? What do you find to be the drawbacks and how do you mitigate them?
imbusy111|3 years ago
One corner case would be if the cursor record is deleted. I don't see it mentioned how they handle it.
erikpukinskis|3 years ago
Do you just crash and ask the user to start over?
Do you have to nudge open cursors on every delete?
christophilus|3 years ago
fein|3 years ago
1. I can set how many items per page.
2. I can get to an arbitrary page either through a path/ query param or at least close with a pagination row that contains a way to jump around. If an item gets removed, whatever I was looking for should still be in the same vicinity.
3. As a result of #1 and #2, I can go back and find items I saw previously because their placement is at some fairly reliable position on the page I remember it being on or around.
You know, like how a physical book or catalogue with pages works.
Please stop trying to improve on this and making usability more difficult. I hate infinity scroll and I hate not being able to pick my page in an intuitive fashion.
nemothekid|3 years ago
I've had quite a few heated discussions on this point. The problem is, once your dataset gets large enough, this use case is incredibly difficult to scale; not impossible (Google does it with millions of search results), but prohibitively expensive compared to how often the need of being able to jump to any specific page arises.
Now I always try to stick to cursor based pagination as the default in order to prevent people from building workflows on top of offsets.
int0x2e|3 years ago
While it can be frustrating, I doubt much revenue (relatively speaking) was lost due to this issue, which means for most apps and services, it will likely not get done.
(I'm not disputing the merit of your arguments, just explaining why this will rarely get done well in real life...)
debrice|3 years ago
Pagination is incredibly useful to human. If I tell you I found something on page 15 you can relate to it, something I cannot do with infinite scroll.
indecisive_user|3 years ago
erikpukinskis|3 years ago
Not easy to do of course, but it’s one direction you can go.
wruza|3 years ago
Aeolun|3 years ago
At least I do. Passing numbers under 10 is easy. Passing identifiers is hell.
bjourne|3 years ago
nickkell|3 years ago
mike_hock|3 years ago
G3rn0ti|3 years ago
camgunz|3 years ago
Everything else has big tradeoffs. You can make sure your rows are sortable by a specific column, but that means your searches must all be sorted by that column.
You can save the results of a search and page through it, but that's a lot of I/O.
You can not do any of the above, but then you run the risk of non-deterministic pagination (I went back and because someone added/deleted a row, that page is slightly different now). You also can't really link to it.
These things might all be fine. In my experience though, you run into people who really want to do search in way A, and will use the tradeoffs for way B/C/D to argue against it, as though A has no tradeoffs.
WesleyJohnson|3 years ago
The legacy system we're replacing used cursor pagination in a lot of areas and was perfectly acceptable to them, but now that we're going web based - it's not. Unfortunately, really - it seems vastly superior.
yrgulation|3 years ago
I recommend using elastic search or a nosql database to optimise performance. Relational databases can be slow for this use case.
hirundo|3 years ago
redditor98654|3 years ago
And the token has enough metadata to reconstruct the query from where it left behind in the previous call and perform additional security checks so that it cannot be used by other customers to get results from a different account.
I have actually migrated databases from Aurora to DynamoDb behind the scenes where customers continued to call these APIs with these tokens and the calls would flip to the new DB and resume from the last place. No downtime or "maintenance window" which would be a non-starter at AWS anyway.
I recommend this for any external service. For internal services, if you work closely enough with your users, may be you can opt for something more transparent.
mgkimsal|3 years ago
unknown|3 years ago
[deleted]
radiojasper|3 years ago
[deleted]
ReactiveJelly|3 years ago