top | item 37985790

(no title)

soliton4 | 2 years ago

it didnt go into detail about the purpose of the = / == padding. and it also didnt show in the example how to handle data that can not be devided into groups of 6 bits without bits left over. i think i have an understanding of how to do it but it would be nice to be certain. could someone address the following 2 questions in a short and exhaustive way:

- when do you use =, when do you use == and do you always add = / == or are there cases where you dont add = / == ?

- how to precisely handle leftover bits. for example the string "5byte". and is there anything to consider when decoding?

discuss

order

tangent128|2 years ago

Your questions are related.

For context: since a base64 character represents 6 bits, every block of three data bytes corresponds to a block of four base64 encoded characters. (83 == 24 == 64)

That means it's often convenient to process base64 data 4 characters at a time. (in the same way that it's often convenient to process hexadecimal data 2 characters at a time)

1) You use = to pad the encoded string to a multiple of 4 characters, adding zero, one, or two as needed to hit the next multiple-of-4.

So, "543210" becomes "543210==", "6543210" becomes "6543210=", and "76543210" doesn't need padding.

(You'll never need three = for padding, since one byte of data already needs at least two base64 characters)

2) Leftover bits should just be set to zero; the decoder can see that there's not enough bits for a full byte and discard them.

3) In almost all modern cases, the padding isn't necessary, it's just convention.

The Wikipedia article is pretty exhaustive: https://en.wikipedia.org/wiki/Base64

pixelbeat__|2 years ago

Padding is only required if concatenating / streaming encoded data. I.e. when there are padding chars _within_ the encoded stream.

Padding chars at the end (of stream / file / string) can be inferred from the length already processed, and thus are not strictly necessary.

Note how padding is treated is quite subtle, and has resulted in interesting variations in handling as discussed at: https://eprint.iacr.org/2022/361.pdf

eatporktoo|2 years ago

from the article: "Every Base64 digit represents 6 bits of data. There are 8 bits in a byte, and the closest common multiple of 8 and 6 is 24. So 24 bits, or 3 bytes, can be represented using four 6-bit Base64 digits."

So you're essentially encoding in groups of 24 bits at a time. Once the data ends, you pad out the remainder of the 24 bits with = instead of A because A represents 000000 as data.

For the record, I had to read the whole thing twice to understand that too.

jameshart|2 years ago

Not quite. The ‘=‘ isn’t strictly padding - it’s the padding marker. You pad the original data with one or two bytes of zeroes. Then you add ‘=‘ to indicate how many such bytes you had to add.

This is because if you’ve only got one of the three bytes you’re going to need, your data looks like this:

   XXXXXXXX
Then when you group into 6 bit base64 numbers you get

   XXXXXX XX????
Which you have to pad with two bytes worth of zeroes because otherwise you don’t even have a full second digit.

   XXXXXX XX0000 000000 000000
so to encode all your data you still need the first two of these four base64 digits - although the second one will always have four zeroes in it, so it’ll be 0, 16, 32, or 48.

The ‘=‘ isn’t just telling you those last 12 bits are zeroes - they’re telling you to ignore the last four bits of the previous digit too.

Similarly with two bytes remaining:

   XXXXXXXX YYYYYYYY
That groups as

   XXXXXX XXYYYY YYYY??
Which pads out with one byte of zeroes to

   XXXXXX XXYYYY YYYY00 000000
And now your third digit is some multiple of 4 because it’s forced to contain zeroes.

Funny side effect of this:

Some base64 decoders will accept a digit right before the padding that isn’t either a multiple of four (with one byte of padding) or of 16 (with two).

They will decode the digit as normal, then discard the lower bits.

That means it’s possible in some decoders for dissimilar base64 strings to decode to the same binary value.

Which can occasionally be a security concern, when base64 strings are checked for equality, rather than their decoded values.