Time for a WTF MySQL Moment

munk-a|5 years ago

I'm a big fan of Postgres too for a number of reasons, but this issue is pretty clearly documented so I'd like to counter with an issue I hit in Postgres recently that is terribly documented. UNNEST works a bit funky, and in particular it works super funky if you have multiple calls in the same select statement (or any set expanded function calls it turns out). There's a bit of a dive into here[1] (though that is out of date - PG10 no longer follows the different array sized result, it uses null filling) which I managed to find after struggling with an issue where an experimental query was resulting in nulls in the output while unnesting arrays without nulls.

All DBs have their warts and while MySQL has an over abundance of warts they tend to be quite well documented. The warts that postgres has tend to be quite buried and their documentation is very good for syntax comprehension but rather light when it comes to deeper learning.

1. https://stackoverflow.com/questions/50364475/how-to-force-po...

eyelidlessness|5 years ago

I am also a big fan of Postgres, and tend to have a bit of fun picking on MySQL having been scarred by it in a past life. But since we're picking at Postgres warts...

One which bit me recently, and is still utterly baffling to me, is that a column defined as an array type will accept values of that array's type in any number of dimensions greater than that specified for the column. In other words, `{{{{{{text}}}}}}` can be inserted into columns of the following types:

- `TEXT[]`

- `TEXT[][]`

- `TEXT[][][]`

- `TEXT[][][][]`

- `TEXT[][][][][]`

- `TEXT[][][][][][]`

The inverse is true as well! A column specified `TEXT[][]` (and so on) will accept `{text}`. Of course, none of this (as far as I've been able to find) is documented.

But wait, there's more!

`UNNEST` does not allow you to specify depth, it always unnests to the deepest dimension. This, too, is undocumented. In fact, it's anti-documented. The documents provide an example function to unnest a two-dimensional array that is wholly unnecessary (and likely performs worse than the built-in `UNNEST`, but I'm just guessing). Said documentation would seem to imply that the depth of `UNNEST` is 1, but of course that's not the case.

But wait, there's more still!

What if you want to get at a nested array? Idk, I'm sure it's possible, but if you thought `SELECT that_array[1]` is the way to do it, look under your seat because you're getting a `NULL`!

- - -

Postscript: I discovered the first part of this in a production system where a migration had incorrectly nested some data, and where that data was in turn causing certain requests to unexpectedly fail. Of course, given that this was in production, I didn't have a lot of time to research the issue. Found the problem, fixed it, moved on with my day. In the course of fixing it, I discovered the `UNNEST` issue, which... okay fun, fix it a slightly different way than I expected.

So in the course of verifying the particulars to write this comment, I played around with some things, and discovered the `NULL` issue.

At least when Postgres has wildly unexpected behavior, it's exceptionally unexpected behavior.

masklinn|5 years ago

> I'm a big fan of Postgres too for a number of reasons, but this issue is pretty clearly documented

Of course it is, the documentation is where TFAA got the information in the 4th paragraph of the story, out of 15 or so.

The range itself is what nerd-sniped the author and led them to try and find out why mysql had such an odd yet specific range.

benesch|5 years ago

> I hit in Postgres recently that is terribly documented.

I'm going to have to disagree with you there. This issue is quite well documented in the "SQL Functions Returning Sets" section [0]. The relevant bit starts thusly:

> ...Set-returning functions can be nested in a select list, although that is not allowed in FROM-clause items. In such cases, each level of nesting is treated separately, as though it were a separate LATERAL ROWS FROM( ... ) item...

And there's even a note about the crazy behavior pre-PostgreSQL 10:

> Before PostgreSQL 10, putting more than one set-returning function in the same select list did not behave very sensibly unless they always produced equal numbers of rows. Otherwise, what you got was a number of output rows equal to the least common multiple of the numbers of rows produced by the set-returning functions. Also, nested set-returning functions did not work as described above; instead, a set-returning function could have at most one set-returning argument, and each nest of set-returning functions was run independently. Also, conditional execution (set-returning functions inside CASE etc) was previously allowed, complicating things even more. Use of the LATERAL syntax is recommended when writing queries that need to work in older PostgreSQL versions, because that will give consistent results across different versions.

I agree that allowing SRFs in the SELECT clause is a wart that should never have been permitted, but I think the PostgreSQL docs do a pretty great job describing both the old behavior and the new behavior that has to balance backwards compatibility with sensibility.

(And, indeed, the 9.6 docs have this to say on the behavior of SRFs in the SELECT list: "The key problem with using set-returning functions in the select list, rather than the FROM clause, is that putting more than one set-returning function in the same select list does not behave very sensibly.")

I do think one notable defect with the PostgreSQL docs is that they were designed in a time before modern search engines. They are better understood as a written manual in electronic form. Almost always the information you need is there, but possibly not in the chapter that Google will surface. But there are all sorts of tricks you can use if you update your mental model of how to read the PostgreSQL docs. For example, there's an old-style index! [1]

[0]: https://www.postgresql.org/docs/current/xfunc-sql.html#XFUNC... [1]: https://www.postgresql.org/docs/current/bookindex.html

simias|5 years ago

> This format is even less wieldy than the current one, requiring multiplication and division to do basically anything with it, except string formatting and parsing – once again showing that MySQL places too much value on string IO and not so much on having types that are convenient for internal operations and non-string-based protocols.

Not necessarily an odd choice in the Olden Days, after all BCD representation used to be pretty popular. By modern standards it's insane, but at a time where binary to decimal conversions could be a serious performance concern it might have made sense. For instance if you had a date in "hours, minutes, seconds" and wanted to add or subtract one of these TIME values, you could do it without a single multiply or divide.

Now I was 8 when MySQL first released in 1995, so I can't really comment on whether that choice really made sense back then. 1995 does seem a bit late for BCD shenanigans, but maybe they based their design on existing applications and de-facto standards that could easily go back to the 80's.

kstrauser|5 years ago

I'll comment: it absolutely didn't make sense back then, either. We didn't use BCD for pretty much anything in '95. If anything, all timestamps were 32 bit signed ints.

Edit: plenty of things still stored dates as strings where the emphasis of the app was on displaying information. Int and float types carried the day whenever any kind of math was going to be used, or when you wanted to output the data in multiple formats.

Roboprog|5 years ago

https://en.wikipedia.org/wiki/Binary-coded_decimal#Packed_BC...

Earlier than the 60s. Had to work on a system in the 90s to “transpile” IBM minicomputer code to C.

Useful for accounting since what you see is what you get as far as cents and other decimals go. 1/5, and thus 1/10, can only be approximated in floating point.

Not that I’m defending MySQL

njharman|5 years ago

This is going to sound insulting, maybe it is, sorry. It's definitely subjective.

The reason I reach for Postgres over MySQL isn't features or technical superiority. Although those result from the reason. Which is, PG devs consistently have "taste", they have "good" style. They make good choices. MySQL devs are not consistently strong in these areas. I'm guessing that MySQL is now so full of tech and design debt (like OP issue) that they're just stuck, without choice.

kaslai|5 years ago

In the MySQL vs PgSQL comparison, it basically feels like MySQL tried to obtain fast performance first, then worked towards correct and useful behavior later, where PgSQL went for correct and useful behavior first, then fast performance later. While the end result after decades of development is comparable, the echos of those very different beginnings remain in the current products.

smitty1e|5 years ago

MySQL and PHP are a dynamic duo that never fail to surprise.

And not in a desirable way.

dancemethis|5 years ago

I mean, with the exception of the name, of course.

"postgre" feels and sounds like a plate full of very wet, oily rice and the subsequent addition of clear vomit to the mix.

xeeeeeeeeeeenu|5 years ago

A list of MySQL WTFs: https://grinnz.com/stuff/lolmysql.txt

kochthesecond|5 years ago

Nice

These three points has made me raving mad from working with mysql:

- The default 'latin1' character set is in fact cp1252, not ISO-8859-1, meaning it contains the extra characters in the Windows codepage. 'latin2', however, is ISO-8859-2. - The 'utf8' character set is limited to unicode characters that encode to 1-3 bytes in UTF-8. 'utf8mb4' was added in MySQL 5.5.3 and supports up to 4-byte encoded characters. UTF-8 has been defined to encode characters to up to 4 bytes since 2003. - Neither the 'utf8' nor 'utf8mb4' character sets have any case sensitive collation other than 'utf8_bin' and 'utf8mb4_bin', which sort characters by their numeric codepoint.

utf8 being effectively alias of utf8mb3 has cost us so much work its not even funny.

jakobdabo|5 years ago

Another one: if you happen to keep big numerical values in a VARCHAR field, be careful, the engine will convert them to DOUBLE when comparing to a big integer, and interesting things will happen. For example, "SELECT id, clmn FROM tbl WHERE clmn = 999999999999999999" may return records where `clmn` is '999999999999999999', also where `clmn` is, for example, '999999999999999998' (because those big integers are too big for DOUBLE and, when converted, they have the same representation).

So the correct query is "SELECT id, clmn FROM tbl WHERE clmn = '999999999999999999'"

justin_oaks|5 years ago

Those are certainly valid WTFs. I've been bitten by the GROUP_CONCAT issue before.

This list is missing the WTF that cascaded deletes/updates don't cause triggers to fire on child tables:

https://bugs.mysql.com/bug.php?id=11472

arriu|5 years ago

Alright, seriously, thank you for this list! You just solved two mysteries that had my team going in circles.

l0b0|5 years ago

Is there something similar for Postgres? I realize it might be a lot shorter, but any system big enough has its share of gotchas.

eska|5 years ago

This gives me flashbacks to the DOS filesystem timestamps. It's exactly the same mistake. By splitting the date into multiple fields, bits are wasted. If they hadn't tried to be smart and just made it one number, it would've been more precise with a wider range.

Dylan16807|5 years ago

I don't really see it as the same thing. Using multiple fields of bits works fine here. 60 possibilities for seconds and minutes fit very well into 6 bits each. A raw 24-bit integer gives you plus or minus 2330 hours, and if you don't reserve a bit the format they actually went with gives you plus or minus 2047 hours. Since it just needs to represent 999 hours and change, that's fine.

The problem is the weird non-bit packing they did before.

DOS is different, partly because there are fields that fit worse, and partly because 32 bits is just barely enough to store seconds in the first place. If you did a DOS-style packing with 34 or 38 bits it would work fine. And it would be able to represent leap seconds, making it arguably better than unix timestamps!

falcolas|5 years ago

FWIW, even keeping an interval in a single number still imposes limits. As proven by the epoch's 2038 problem.

SoSoRoCoCo|5 years ago

An oldie but a goodie:

https://people.cs.nctu.edu.tw/~tsaiwn/sisc/runtime_error_200...

jarym|5 years ago

MySQL has come a long long way indeed but not enough to make me turn away from Postgres.

Florin_Andrei|5 years ago

I used to see literally this same comment as far back as 10 years ago, maybe more.

Just saying - I have no horse in this race.

pippy|5 years ago

For me MySQL isn't even worth touching. Just this week their 'stable' binaries from their website threw seg faults while executing large SQL batches on my dev box. I replaced it with mariaDB and had no issues.

josefx|5 years ago

> 1 bit sign (1= non-negative, 0= negative)

First time I ever saw a number where the leading sign bit has to be set to 1 to indicate non-negative.

js2|5 years ago

https://en.wikipedia.org/wiki/Offset_binary

jnwatson|5 years ago

Sorts better that way.

nitramt|5 years ago

That's yet another incompatibility with the ISO/ANSI SQL specification. In the specification, the TIME type is defined as containing HOUR, MINUTE and SECOND fields, representing the "hour within day", "minute within hour" and "second within minute" values, respectively, so, the valid range for that type supposed to be "00:00:00:00.00000..." to "23:59:59.99999...". It's not intended to represent an interval, although that seems to be the intended semantics for MySQL's TIME type:

(from https://dev.mysql.com/doc/refman/8.0/en/time.html):

> but also elapsed time or a time interval between two events (which may be much greater than 24 hours, or even negative).

For representing temporal intervals, the specification defines two kinds of INTERVAL types (year-month and day-time). Year-month intervals can represent intervals in terms of years, months, or a combination of years and months. Similarly, day-time interval, can represent intervals in terms of days, hours, minutes or seconds, or combinations of them (e.g, days+hours, days+hours+minutes, hours+minutes, etc.)

As a sidenote, the TIME and DATE types are related to the TIMESTAMP type in that TIMESTAMP can be thought of as combination of a DATE part (year, month, day) and a TIME (hour, minute, second) part.

hans_castorp|5 years ago

Well MySQL has a history of ignoring the SQL standard even for the most simple things (even if those were defined before MySQL even existed), so it's not really surprising they got the TIME data type wrong as well.

gigatexal|5 years ago

Here’s a Redshift oddity that I don’t think is documented:

select sum(y.a), count(y.a) from(select distinct x.a from ( select 1 as a union all select 2 as a union all select 1 as a)x)y

sum | count -----+------- 4 | 3

Sqlite3 returns the correct results of sum of 3 count of 2.

To fix this don’t use subqueries.

hans_castorp|5 years ago

Interesting, Postgres does this correctly:

https://dbfiddle.uk/?rdbms=postgres_12&fiddle=dc4ce40bd52695...

TedShiller|5 years ago

I gave up on MySQL a long time ago, when I realized that I had to activate special types of settings just to make Unicode characters work in tables.

In Postgres, it just works out of the box.

chrisan|5 years ago

Must have been more than 10 years ago when utf8mb4 was added where you don't have to activate any kind of special settings.

I was using MSSQL prior to 2010 so I have no idea of MySQL unicode handling before that

sfilargi|5 years ago

“I am struggling to imagine the circumstances where ..... can break ....”

If only I had a penny for everyone I heard this argument and we ended up breaking regression tests or something really obscure in the qa or customer setup

nullsense|5 years ago

I always catch myself when I'm saying it and try to make sense of the cognitive dissonance of saying nothing will break and being somewhat convinced that something, somewhere will.

forcemajeure|5 years ago

Fear the unknown unknowns!

sfilargi|5 years ago

everytime* not everyone

And yes, I have done it so many times myself too

smeeth|5 years ago

I have yet to find a situation where using a native datetime format made more sense than using a unix timestamp in integer fields.

castorp|5 years ago

https://blog.sql-workbench.eu/post/epoch-mania/

crazygringo|5 years ago

Ultimately this is a bit of a rant about why MySQL didn't bother changing the TIME type to support an elegant maximum value of 1,024 hours instead of 838.

But, seriously? Who cares? It's not even close to an extra order of magnitude of range. The type is obviously meant to be used for time values that have a context of hours within a day, supporting a few days as headroom... so supporting 1,024 instead of 838 is pointless -- if you're getting anywhere even close to the max value, you probably shouldn't be using this type in the first place.

And yes, it's probably best not to change it for backwards compatibility. Can I imagine a case where it could break something? No, not off the top of my head. But it probably would break some application somewhere. And for such a widely deployed piece of critical foundational infrastructure, being conservative is the way to go.

Nothing about this seems WTF at all, except for the author's seeming opinion that elegant, power-of-two ranges ought to trump backwards compatibility with things that probably made sense at the time.

jmnicolas|5 years ago

IIRC there was also another problem with MySQL and .NET: MySQL used the year 0 as an uninitialized DateTime but .NET starts at year 1.

I think the workaround was to pass some parameter in the connection string.

mikorym|5 years ago

TL;DR

So the answer is just it's for backwards compatibility with MySQL 3?

I was kind of hoping for more.

meritt|5 years ago

[deleted]

robsws|5 years ago

The MySQL documentation suggests using it to store a time interval:

"The hours part may be so large because the TIME type can be used not only to represent a time of day (which must be less than 24 hours), but also elapsed time or a time interval between two events (which may be much greater than 24 hours, or even negative)." [1]

How is the author's usage incorrect? Sure there's a documented allowed range, but it seems to me that the complaint is about the poor size of the range relative to other SQL technologies.

[1] https://dev.mysql.com/doc/refman/5.7/en/time.html

rdiddly|5 years ago

"If we want to complain that MySQL has an utterly useless TIME datatype, and separately they really should have an INTERVAL datatype, then let's do that."

Uhh, done?

"The issue" is that MySQL has a TIME type that is goofy as all get-out, and this guy's friend had a bug that prompted him to look into the TIME type and find out how goofy it is. I don't recall him complaining that it's not in the docs (since that's where he looked it up). The bug was only a story, the thing that prompted him to look into it. Take a breath, this isn't StackOverflow (not that being rude and offended/butthurt is welcome there either).

janwillemb|5 years ago

The issue seems to be that the .NET core provider for MySQL that OP uses maps the .NET Timespan type to this MySQL TIME datatype. They probably didn't think this through.

theon144|5 years ago

>If we want to complain that MySQL has an utterly useless TIME datatype, and separately they really should have an INTERVAL datatype, then let's do that.

Well isn't this complaint basically the last paragraph of TFA verbatim?

parliament32|5 years ago

A vast majority of developer issues can be boiled down to RTFM, as usual.

wnevets|5 years ago

> Not being too experienced with MySQL, as I prefer PostgreSQL

Because that's why. He likes PostgreSQL and you need to know about it.

falcolas|5 years ago

The issue is that the authors expectations didn't match with reality, because he hadn't yet read the docs.

The moment the author wrote "I normally use PostgreSQL" I knew that this would devolve into a "MySQL Bad" even if it's well defined and documented behavior.

falcolas|5 years ago

So, mysql is preserving backwards compatibility? Good. A flag to break this backwards compatibility and offer a larger range is probably warranted, but until that's implemented this behavior is fine.

"I think this is stupid" is a really poor reason to break backwards compatibility, despite how many other software projects use this reasoning.

But of course, MySQL bad, PostgreSQL good.

Sebb767|5 years ago

> "I think this is stupid" is a really poor reason to break backwards compatibility, despite how many other software projects use this reasoning.

True. But his argument seems to be in the direction of "this is highly unexpected behaviour" and I tend to agree. The number of applications broken by extending the date range probably dwarves to the number of bugs avoided by not having the time span break at such a strange length.

hans_castorp|5 years ago

MySQL wouldn't be in that situation if they had paid attention to the definition of the TIME (and INTERVAL) data types specified in the SQL standard long before MySQL was created

hackbinary|5 years ago

I learned the semi hard way on PoC system where the Mysql index corrupted.

I was lucky and could simply redeploy my application, but I have never used Mysql since.

falcolas|5 years ago

How does a corrupted index and a TIME type come into play with each other?

jspaetzel|5 years ago

How'd this get voted so many points? RTFM and use a more appropriate datatype.

unknown|5 years ago

[deleted]

114 comments