top | item 32495846

We switched to cursor-based pagination

120 points| qin | 3 years ago |moderntreasury.com

114 comments

order

simonw|3 years ago

This post is not about database cursors. It's about the style of pagination where you have a ?_next=xxx link to get to the next page, where the xxx bit encodes details about the last item on the current page such that the next page can show everything that comes after that record.

This is also sometimes known as keyset pagination. My favourite technical explanation of that is here: https://use-the-index-luke.com/no-offset

paulmd|3 years ago

Sounds great on paper but my experience is that key-based indexing is broken on SOLR and likely other lucene engine DBs. If you are evaluating conditions on child objects inside a parent document, the child objects get split and stored as a separate document that is joined… and if that child document is in a different block/file on disk, then it won’t necessarily be inside the range being scanned, and so you will be missing some results that meet your logical criteria.

Possibly just an implementation error in the BlockJoinParser but it did not occur with numeric pagination.

In order to work with that, you need to “flatten” your json, like with the @JsonUnwrapped annotation, and some structures (like arrays) may become problematic and/or require significant lexical mapping of queries to the dataset.

clementmas|3 years ago

Most users don't care but I very much prefer to have URLs with `/?page=123` instead of `?_next=xxx`. I should get over it but meanwhile I'll stick to offset pagination with deferred join (see aarondf's comment).

jlg23|3 years ago

This post is about "database cursors" and "keyset pagination". In practice, these terms refer to the same thing, one seen bottom-up the other seen top-down. Implementation-wise, one saves the state of the cursor in the pagination [parameters] and resumes reading from the DB with an equivalent cursor.

aarondf|3 years ago

There are ways to mitigate the (although not eliminate) the slowing down of offset/limit pagination in later pages. The technique is called a "deferred join" and it is most effective in MySQL. The basic idea is to paginate as little data as necessary, and then do a self-join to get the rest of the data for a single page.

You can read more about it here: https://aaronfrancis.com/2022/efficient-pagination-using-def... or here https://planetscale.com/blog/fastpage-faster-offset-paginati....

There are libraries for Laravel (https://github.com/hammerstonedev/fast-paginate) and Rails (https://github.com/planetscale/fast_page) as well!

Cursor based pagination is wonderful, but sometimes you're stuck with offset/limit for whatever reason. Might as well make it fast.

barrkel|3 years ago

To be clear, this technique (which it seems I independently discovered in 2015) mostly only works in MySQL because other databases usually have planners which are smart enough to not pull everything in eagerly.

MySQL is fairly predictable, though, so when you understand that it wants to nested-loop join all your rows before evaluating predicates on the parent table, it's a predictable win to stop it doing that.

The technique is still applicable even when you have no joins, because MySQL will materialize rows with every selected column before evaluating the unindexed portion of the predicate, and the order by.

hamandcheese|3 years ago

Cursor based pagination doesn’t actually solve the described pitfalls if your results are ordered by anything mutable, though.

A result you haven’t yet seen can be mutated to sort before your current cursor, likewise a result that you’ve already seen can be mutated to sort after the current cursor, causing you to see it twice.

Cursor based pagination does minimize the issue somewhat, because only the mutated rows are possibly affected, not unrelated records that lie on page boundaries. But depending on the use case I’m not sure that is worth that added complexity of cursor based pagination (it does get a bit tricky once you have any non-trivial sort clause).

simonw|3 years ago

The only approach I can think of which might be able to handle mutability of the rows used for sorting would be to support paginating through a snapshot: provide a mechanism whereby a full snapshot of the query at a specific point in time is captured such that the user can then paginate through that snapshot.

Expensive to implement, so this would only work for situations where you can afford to spend significant storage and computation resources to keep a specific user happy.

andy800|3 years ago

Why does every web site default to aggressively paginate their information? Pagination sucks, it's a waste of time and of clicks, and should be a last resort. Sure, when Google returns millions of search results, paginate. But:

For instance, if you have 40 entries and specify that there should be 10 items per page

40 entries??? Just show them all to me. My browser has a scrollbar, CTRL-F is much faster than your search box.

"but not everyone has a reliable fast connection" -- yes, which is a good reason to deliver more data per http request than breaking it up and requiring lots of slow requests.

"but the database load" -- half the queries, each returning 2x data, is almost always going to be easier on a RDBMS. If it's not then you probably need to rethink your schema.

ilammy|3 years ago

> Why does every web site default to aggressively paginate their information?

You get to see more ads while flipping through pages.

Timing metrics for loading smaller pages make marketing happy.

Timing metrics for time spent on the website make marketing happy.

VWWHFSfQ|3 years ago

> Why does every web site default to aggressively paginate their information?

Because data retrieval costs money and resources. Unbounded API responses are a susceptible to denial-of-service. Well-architected APIs will always cap the number of results they return per-call.

notriddle|3 years ago

How many people actually read past entry five?

hamandcheese|3 years ago

I’ve seen instances where the issue is not database performance, but web server performance. Serializing a graphql response in graphql-ruby, for example, can easily take 100+ms and grows linearly with number of records in a request.

In the general case this isn’t a good excuse, there’s no reason it should take that long. But you don’t always get to pick your technology.

phendrenad2|3 years ago

I'd guess that most people have written some bad query that only worked for small limits, so they started superstitiously making their limits very low.

Aeolun|3 years ago

And this kind of thinking is how you end up with 2mb per request.

It’ll be like Jira, pulling in literal megabytes of data and still slow.

feike|3 years ago

Reminds me of Markus Winand who hands out stickers on database conferences banning offset.

His site is a great resource for anyone wanting to take a deeper dive on SQL performance:

https://use-the-index-luke.com/sql/partial-results/fetch-nex...

ReactiveJelly|3 years ago

So basically do it like Reddit?

https://old.reddit.com/?count=25&after=t3_wtpvdp

I noticed Reddit's pagination has that "after" parameter, which points to the last post on the current page.

It glitches out if the last item is deleted by moderators, but otherwise it works smoothly.

croes|3 years ago

Can't he just use rowversion?

No need for row value syntax and it works with MS SQL Server

dinkledunk|3 years ago

how to jump to an arbitrary page?

jvolkman|3 years ago

For anyone curious about this approach, I've posted an excerpt [1] of the shared Python + SQLAlchemy + Postgres code we use to handle pagination at my company.

The name "cursor pagination" is super confusing given that "cursor" is an overloaded term in databases. I always call this "token pagination", given that the APIs I've seen usually call the value a token.

[1] https://gist.github.com/jvolkman/b8c0e3d05929a1506c99fbc9474...

drdec|3 years ago

I was hoping to read about how they handled cleaning up cursors, what the effect on the server was when having a bunch of open, long-running cursors, etc. Unfortunately the article only treated the subject at a superficial level.

So, anyone here implement pagination via cursors? What do you find to be the drawbacks and how do you mitigate them?

imbusy111|3 years ago

You probably assume they are talking about database cursors. The cursor is just a record ID in this article. There is no long term storage of database cursors on the server. Assuming you can sort all your data, next query just returns all records after the given record ID, plus the record with that ID.

One corner case would be if the cursor record is deleted. I don't see it mentioned how they handle it.

erikpukinskis|3 years ago

Yah, another downside of cursor-based pagination is: what happens when the record the cursor refers to is deleted?

Do you just crash and ask the user to start over?

Do you have to nudge open cursors on every delete?

christophilus|3 years ago

It’s not a cursor as in a relational database cursor. It’s a cursor, as in an ID of an item in a stably ordered set. There’s no long-running anything to be worried about.

fein|3 years ago

I have a few main requirements for what I consider easy to use pagination.

1. I can set how many items per page.

2. I can get to an arbitrary page either through a path/ query param or at least close with a pagination row that contains a way to jump around. If an item gets removed, whatever I was looking for should still be in the same vicinity.

3. As a result of #1 and #2, I can go back and find items I saw previously because their placement is at some fairly reliable position on the page I remember it being on or around.

You know, like how a physical book or catalogue with pages works.

Please stop trying to improve on this and making usability more difficult. I hate infinity scroll and I hate not being able to pick my page in an intuitive fashion.

nemothekid|3 years ago

>I can get to an arbitrary page either through a path/ query param or at least close with a pagination row that contains a way to jump around

I've had quite a few heated discussions on this point. The problem is, once your dataset gets large enough, this use case is incredibly difficult to scale; not impossible (Google does it with millions of search results), but prohibitively expensive compared to how often the need of being able to jump to any specific page arises.

Now I always try to stick to cursor based pagination as the default in order to prevent people from building workflows on top of offsets.

int0x2e|3 years ago

I totally agree with you from a usability standpoint, and while it is possible to make this work well, for larger-scale services it is rarely without costs, and for the most part - people don't seem to care as much as you'd think.

While it can be frustrating, I doubt much revenue (relatively speaking) was lost due to this issue, which means for most apps and services, it will likely not get done.

(I'm not disputing the merit of your arguments, just explaining why this will rarely get done well in real life...)

debrice|3 years ago

The pagination con given in the article is wrong, switching to stream isn’t fixing the dropping issue, removing sorting is likely why no records are being dropped (you wouldn’t drop records using a numerical sorted ID or creation date)

Pagination is incredibly useful to human. If I tell you I found something on page 15 you can relate to it, something I cannot do with infinite scroll.

indecisive_user|3 years ago

If a user has to go to page 15 to find something useful to them then I would argue that's a bigger failure of the UX/filtering than it is a success of pagination.

erikpukinskis|3 years ago

Worth noting that another solution to the problems with offset cursors is to do updates. When you add/remove rows you just update the results set and then there’s no issue with missing or duplicated rows.

Not easy to do of course, but it’s one direction you can go.

wruza|3 years ago

Why is offset-based pagination a thing at all? It sucks when a feed is very active, it sucks when you link to a page or visit search results, where it may turn out to be hundreds of pages away in a week. I always wondered why nobody does some obvious (order_col > k, order by order_col, limit n). It’s not a two years old mistake, it’s two decades and sites still do that by default.

Aeolun|3 years ago

Because normal people enjoy referring to ‘item 4 on page 3’ instead of ‘search for item 57533898-2’.

At least I do. Passing numbers under 10 is easy. Passing identifiers is hell.

bjourne|3 years ago

Ime, concurrent updates to the data set isn't a problem in practice and nobody cares if they occasionally get duplicate entries. The cluttered and sometimes useless urls cursors cause are, again ime, a much bigger usability problem. Sure, naively implemented queries suffer if the user types ?page=123456 in the url but such problems are quite easy to fix.

nickkell|3 years ago

How do you fix those problems then? Let’s say you have a “last page” button in the UI for example

mike_hock|3 years ago

TL;DR: The headline. TFA doesn't really add any information.

G3rn0ti|3 years ago

Well, except that TFA explains what DB cursors are and why they are faster than page offsets (because they skip DB entries).

camgunz|3 years ago

I think the challenge is always the stateless nature of HTTP. It's pretty hard to keep a connection to the API server that has your database cursor.

Everything else has big tradeoffs. You can make sure your rows are sortable by a specific column, but that means your searches must all be sorted by that column.

You can save the results of a search and page through it, but that's a lot of I/O.

You can not do any of the above, but then you run the risk of non-deterministic pagination (I went back and because someone added/deleted a row, that page is slightly different now). You also can't really link to it.

These things might all be fine. In my experience though, you run into people who really want to do search in way A, and will use the tradeoffs for way B/C/D to argue against it, as though A has no tradeoffs.

WesleyJohnson|3 years ago

At my current job, our intranet site has lackluster performance due, in part, to limit/offset pagination. Unfortunately, the business treats the "reports" we author like glorified spreadsheets and want the ability to filter on any column and order by any column. It makes it near impossible to tune/optimize.

The legacy system we're replacing used cursor pagination in a lot of areas and was perfectly acceptable to them, but now that we're going web based - it's not. Unfortunately, really - it seems vastly superior.

yrgulation|3 years ago

“It makes it near impossible to tune/optimize.”

I recommend using elastic search or a nosql database to optimise performance. Relational databases can be slow for this use case.

hirundo|3 years ago

My sympathies to the coders downstream of this large change for the amount of work required. I would like to add cursor based pagination to the APIs I manage. It would not be an option to go to our clients and explain that offset-pagination will be removed. There seems to be very little cost in supporting both.

redditor98654|3 years ago

AWS does the pagination thing for all LIST APIs and they encode the token in a way they clients cannot deconstruct anything out of it. For them it is just an opaque string. In fact it is cryptographically encrypted and not something simple like a Base64 encoding of some Json data.

And the token has enough metadata to reconstruct the query from where it left behind in the previous call and perform additional security checks so that it cannot be used by other customers to get results from a different account.

I have actually migrated databases from Aurora to DynamoDb behind the scenes where customers continued to call these APIs with these tokens and the calls would flip to the new DB and resume from the last place. No downtime or "maintenance window" which would be a non-starter at AWS anyway.

I recommend this for any external service. For internal services, if you work closely enough with your users, may be you can opt for something more transparent.

mgkimsal|3 years ago

I didn't see anything specific about timelines. I would expect you'd want to offer both for some length of time, with some deprecation notice and a sunset date for the older approach. But perhaps they seem use cases in their logs that the current offset is used minimally already, and it's better to switch now vs later if/when adoption is higher?

radiojasper|3 years ago

[deleted]

ReactiveJelly|3 years ago

If we're still using SQL, what did PHP do correctly to make pagination easier?