top | item 28090738

(no title)

neur4lnet | 4 years ago

Its compression a legitimate concern for primary keys?

discuss

Definitely, I've run into this problem many times. UUIDs are pretty bulky data types in database terms, usually larger than most other types in a typical table on average and possibly larger than rest of the row after including the index overhead, which creates cache pressure. Logical columns in a table are commonly stored physically as a vector, of UUIDs in this case, expressly for the purpose of compressing the representation. This saves both disk cache and CPU cache. For queries, the benefit of scanning compressed vectors is that it reduces the average number of page faults per query, which is one of the major bottlenecks in databases.

Also, some data models (think graphs) tend to be not much more than giant collections of primary keys. Using the smallest primary key data type that will satisfy the requirements of the data model is a standard performance optimization in databases. It is not uncommon for the UUID that the user sees to be derived from a stored primary key that is a 32-bit or 64-bit integer.

NyxWulf|4 years ago

Uuids are bulky if you store them in text form, but in binary form they are only 128 bits.

The main feature of a uuid is it allows distributed generation. 32-bit or 64-bit integers are almost always sequential numbers. The sequential nature allows efficient page filling and index creation, but the contention involved in creating a sequence grows rapidly with scale.

So while a 128-bit uuid is larger than a 64 bit integer, this version allows for the bulk of the benefits of sequential integers while reducing the biggest drawback of contention at the point of creation.

bob1029|4 years ago

If you are getting good compression on your primary keys, this may be cause for alarm.

These should be high-entropy items.

jandrewrogers|4 years ago

The sole constraint on a primary key is that it is unique. Even in distributed systems, deterministically unique keys are almost always much more compressible than probabilistically unique keys, which is a double win.

Should you desire a high-entropy key, e.g. for hash tables or security, this can be trivially derived from the low-entropy key while still guaranteeing deterministic uniqueness.

Compressible primary keys are superior in every way.