Why Python 3 Exists

[+] andrewstuart|10 years ago|reply

IMO one of the reasons for all the angst is that .encode() and .decode() are so ambiguous and unintuitive which makes them incredibly confusing to use. Which direction are you converting? From what to what? The whole unicode thing is hard enough to understand without Python's encoding and decoding functions adding to the mystery. I still have to refer to the documentation to make sure I'm encoding or decoding as expected.

I think there would have been much less of a problem if encode and decode were far more obvious, unambiguous and intuitive to use. Probably without there being two functions.

Still a problem of course today.

[+] dietrichepp|10 years ago|reply

Hm, I never saw this as ambiguous at all, except for a few weird encodings that Python has as "convenience" method.

Here's how you remember it: "Unicode" is not an encoding. It never was, it never will be. Of course, the data must be encoded in memory somehow, but in Python 3, you cannot be sure what encoding that is because it's not really exposed to the user. From what I understand, there are different encodings that string objects will use, transparently, in order to save memory!

You always "encode" something into bytes, and "decode" bytes back into something. There should be exactly two functions, because the functions have different types: "encode" is str -> bytes, "decode" is bytes -> str. Explicit is better than implicit.

    output = input.decode(self.coding)

With Python 3, I instantly know that "input" is bytes and "output" is str.

[+] aidos|10 years ago|reply

The confusion you're talking about is a Python 2 problem, not one of ambiguity.

Encoding and decoding are pretty well defined (though, I've never thought about the formal definition before). When you have an entity in its native form, it needs to be _encoded_ for the purposes of communication (in a broad sense). The encoded message can then be to be _decoded_ back to the natural form. There is no ambiguity.

Really, the reason people get them mixed up in Python is because Python 2 totally stacked it by adding str.encode and unicode.decode.

In Python 2, you can _decode_ unicode to unicode – which it does by silently _encoding_ as ascii first. This operation is total madness.

    >>> u = u'\xe9'
    >>> u.decode('utf8')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)

The error here is in the step where you _encode_ the unicode to acsii, something you didn't ask it to do at all.

And similarly, you can encode str to str (where str is really bytes in Python 2, another issue that adds to the confusion).

Don't get me wrong. I feel your pain. When I was using Python 2 I also got confused about what form things were in and where I needed to encode / decode.

Honestly, once I switched to Python 3, that cognitive overhead just totally vanished. str is the natural form of text, and if I need to store / communicate it I _encode_ it to bytes (utf8, generally). When I'm loading a stored/transmitted message, I _decode_ it to its natural text form.

There are edge cases that make certain situations more complex, but in terms of general usage, I feel Python 3 really got this stuff right.

[+] zeeboo|10 years ago|reply

Nobody has a problem with encode/decode when talking about, say, json. You have some semantic object that you want to encode to some bytes representation, and some bytes representation that you want to decode into some semantic object.

The confusions is around people viewing byte strings implicitly as ascii codepoints, and not fully understanding what the things they're looking at actually are.

[+] marklgr|10 years ago|reply

Encoded things are not meant to be human readable, like some sort of "secret code". The underlying bytes of some Unicode string probably won't be human-readable if you read it byte per byte--it will be like the platform's "secret code" for your string.

Conversely, you can decode the byte message into something readable, ie. your Unicode string.

[+] cjauvin|10 years ago|reply

I know it's a deeply subjective and personal thing unlikely to help anyone other than me, but I've always successfully relied on the following mnemonic for this particular problem:

* unicode -> encode, u -> e (vowels)

* str -> decode, s -> d (consonants)

[+] rcthompson|10 years ago|reply

I've really tried to figure out how to use these methods properly, but in the end, the only successful strategy I've found essentially boils down to randomly sprinkling encodes and decodes throughout my code until errors stop happening. Half the time I end up using an encode where my interpretation of the documentation suggested a decode would be required, or vice versa. (This is in Python 2. I haven't tried Python 3 yet since I don't do much Python coding these days.)

[+] jlarocco|10 years ago|reply

It is easy to confuse the names, but it takes 2 seconds in the REPL or in Idle to figure it out, and the concept behind them is pretty obvious, IMO.

I'll take an occasional 2 seconds in the REPL over the headache of debugging codec issues any day of the week.

[+] Too|10 years ago|reply

For me this confusion arises because of ducktyping together with ambiguous interfaces. Take open() as example, it can return a file that reads either in binary more or automatically decodes into unicode strings, just based on if you passed a string argument mode='rb' vs mode='r', also if you enable universal newlines it will implicitly assume some encoding. This in itself all kind of makes sense once you know it, but the danger lies in that it's code that worked in py2, and even now it will still run because most functions that work on bytes also work on unicode. It's not until you try to combine these variables with data from other sources(like string literals) which were/weren't unicode you notice this and then you don't know which side wad correct so you randomly just shotgun either decode on one side or encode on the other side. This could be 100 lines away from where the problem should actually have been solved and people being people solves the symptoms instead of the causes.

[+] the_mitsuhiko|10 years ago|reply

> .encode() and .decode() are so ambiguous and unintuitive

They are not. What is unintuitive is the default encoding in Python where an encode can trigger an implicit encode and the other way round. The `encode()`/`decode()` availability of strings was never a problem has you have many bytes -> bytes and str -> str codecs;

[+] mark-r|10 years ago|reply

I agree that encode() and decode() are ambiguous, I find myself pausing to make sure I'm using the right one.

You can use bytes(string,encoding) to replace encode(). Unfortunately it doesn't have a default encoding, which makes it a pain to use. And str(bstring) isn't symmetric, it can't replace decode().

[+] est|10 years ago|reply

The problem is that both u''.encode() and ''.encode() (at least in Python 2) exists. Why?

[+] Animats|10 years ago|reply

Unicode worked just fine by Python 2.6. I had a whole system with a web crawler and HTML parsers which did everything in Unicode internally. You had to use "unicode()" instead of "str()" in many places, but that wasn't a serious problem.

By Python 2.7, there were types "unicode", "str", and "bytes". That made sense. "str" and "bytes" were still the same thing, for backwards compatibility, but it was clear where things were going. The next step seemed to be a hard break between "str" and "bytes", where "str" would be limited to 0..127 ASCII values. Binary I/O would then return "bytes", which could be decoded into "unicode" or "str" when required. So there was a clear migration path forward.

Python 3 dumped in a whole bunch of incompatible changes that had nothing to do with Unicode, which is why there's still more Python 2 running than Python 3. It was Python's Perl 6 moment.

From the article: "Obviously it will take decades to see if Python 3 code in the world outstrips Python 2 code in terms of lines of code." Right. Seven years in, Python 2.x still has far more use than Python 3. About a year ago, I converted a moderately large system from Python 2 to Python 3, and it took about a month of pain. Not because of the language changes, but because the third-party packages for Python 3 were so buggy. I should not have been the one to discover that the Python connector for MySQL/MariaDB could not do a "LOAD DATA LOCAL" of a large data set. Clearly, no one had ever used that code in production.

One of the big problems with Python and its developers is that the core developers take the position that the quality of third party packages is someone else's problem. Python doesn't even have a third party package repository - PyPI is a link farm of links to packages elsewhere. You can't file a bug report or submit a patch through it. Perl's CPAN is a repository with quality control, bug reporting, and Q/A. Go has good libraries for most server-side tasks, mostly written at Google or used at Google, so you know they've been exercised on lots of data.

That "build it and they will convert" attitude and the growth of alternatives to Python is what killed Python 3.

[+] pwang|10 years ago|reply

> That "build it and they will convert" attitude and the growth of alternatives to Python is what killed Python 3.

Well said.

[+] danso|10 years ago|reply

> We have decided as a team that a change as big as unicode/str/bytes will never happen so abruptly again. When we started Python 3 we thought/hoped that the community would do what Python did and do one last feature release supporting Python 2 and then cut over to Python 3 development for feature development while doing bugfix releases only for the Python 2 version.

I'm guessing it's not a coincidence that string encoding was also behind the Great Sadness of Moving From Ruby 1.8 to 1.9. How have other mainstream languages made this jump, if it was needed, and were they able to do it in a non-breaking way?

https://news.ycombinator.com/item?id=1162122

[+] bdarnell|10 years ago|reply

C and C++ are so widely used that transitions like this are made not at the language level but at the level of platforms or other communities. Some parts of the C/C++ world made this transition relatively seamlessly, while others got caught in the same traps as Python.

The key is UTF-8: UTF-8 is a superset of 7-bit ASCII, so as long as you only convert to/from other encodings at the boundaries of your system, unicode can be introduced to the internal components in a gradual and mostly-compatible way. You only get in trouble when you decide that you need separate "byte string" and "character string" data types (which is generally a mistake: due to the existince of combining characters, graphemes are variable-width even if you're using strings composed of unicode code points, so you don't gain much by using UCS-4 character strings instead of UTF-8 byte strings).

My theory is that the python 3 transition would have gone much smoother and still accomplished its goals if they had left the implicit str/bytes conversion in place but just made it use UTF-8 instead of ASCII (although in environments like Windows where UTF-16 is important this may not have worked out as well).

[+] perlgeek|10 years ago|reply

Well, Perl 6 also introduces a distinction between strings (Unicode) und buffers (octets), but it also introduces loads of other changes, to the point where it's its own language.

That said, the Unicode features were a major thing planned for PHP 6, and (afaict) one of the reasons there has never been a PHP 6, but rather they went straight from 5 to 7.

I'm not aware of any graceful transitions.

[+] autarch|10 years ago|reply

Perl 5 introduced Unicode support in 5.6.0 (2000) but it was kind of a mess. It was essentially redesigned in 5.8.0 (2002). At that point it was fairly buggy, but by 5.8.8 (2006) it was in pretty good shape. The most recent versions of Perl 5 (5.22.1 was released a few days ago) have excellent Unicode support.

That said, Perl 5 does not have different types for strings and bytes, which is definitely a source of bugs. Since Unicode support is essentially opt-in for a library (you have to explicitly decode/encode data somehow) it's easily possible for a library you use to break your data. Most of the major libraries (database, HTTP, etc.) have long since been patched to respect Unicode in appropriate ways, so the state of Unicode in Perl 5 is good overall.

[+] tzs|10 years ago|reply

In the Reddit discussion of this, someone linked to this criticism [1] of Python 3's Unicode handling written by Armin Ronacher, author of the Flask framework.

I am not competent to say whether this is spot on or rubbish or somewhere in between [2], but it seemed interesting at least.

[1] http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/

[2] Almost all of my Python 2 experience is in homework assignments in MOOCs for problems where there was no need to care about whether strings were ASCII, UTF-8, binary, or something else. My Python 3 experience is a handful of small scripts in environments where everything was ASCII.

[+] lmm|10 years ago|reply

It's wrong. The whole point of cat is to concatenate files. But if you concatenate two files with different encodings, you end up with an unreadable file. So you want cat to error out if one of the files that was passed has a different encoding from the encoding you told it to use, which is exactly what the python cat will do.

[+] mark-r|10 years ago|reply

It seems simple to me - if you want bytes, open the file in binary mode, if you want strings open it in text mode.

The only glitch is with stdin/stdout. They're opened on your behalf before your program even starts, and the assumption is that you'll be reading and writing text in the default OS encoding. This doesn't mesh well with the Unix pipe paradigm.

[+] rkrzr|10 years ago|reply

IMO the biggest reason to use Python3 is its concurrency support via async + await.

Fixing the unicode mess is nice too of course, but you can get most of the benefits in Python2 as well, by simply putting this at the top of all of your source files:

from __future__ import unicode_literals

Also make sure to decode all data from the outside as early as possible and only encode it again when it goes back to disk or the network etc.

[+] tudborg|10 years ago|reply

So much this! asyncio was the main selling point for me, but in general, why not follow the language?

I never really understood the "rather stay with py2.7" thing. I get it with big old monolithic applications. You don't "just" rewrite those, but _every new python project_ should be done with the latest stable release.

Is anyone starting their PHP projects on PHP4? Any new node projects in 0.10? Of course not, that would be moronic.

[+] ak217|10 years ago|reply

For the record, you do not get most of the benefits by doing the future import. The silent Unicode/bytes coercion and lack of standard library support is still there, making it almost impossible to write correct non-ascii-handling software in Python 2.

[+] BuckRogers|10 years ago|reply

I chose to port from CPython2 to PyPy4, rather than to CPython3. It just made more sense. I for one see no value in Python3 (unicode has been supported since 2.6). My reasons for migrating to PyPy4 instead of Python3-

1) It was easier than porting to CP3.

2) It gave me a tangible benefit by removing all CPU performance worries once and for all. Added "performance" as a feature for Python. Worth the testing involved.

3) It removed the GIL. If you use PyPy4 STM, which is currently a separate JIT. Which will be at some point merged back into PyPy4.

So for me, Python3 can't possibly compete, and likely never will with PyPy4 once you consider the performance and existing code that runs with it. PyPy3 is old, beta, not production-ready, based on 3.2 and Py3 is moving so fast I don't think PyPy3 would be able to keep up if they tried.

Python3 is dead to me. There's not enough value for a new language. I'm not worried about library support because Py2 is still bigger than 3 and 2.7 will be supported by 3rd party libraries for a very long time else choose irrelevance (Python3 was released in 2008, and still struggling to justify its existence...). My views on the language changes themselves are stated much better by Mark Lutz[0]. I'm more likely to leave Python entirely for a new platform than I am to migrate to Python3.

PyPy is the future of Python. If the PyPy team announces within the next 5 years they're taking the mantle of Python2, that would be the nail in the coffin. All they have to do is sit back and backport whatever features the Python2/PyPy4 community wants into PyPy4 from CPython3 as those guys run off with their experiments bloating their language. I believe it's all desperation, throwing any feature against the wall. Yet doing irreparable harm bloating the language, making the famous "beginner friendly" language the exact opposite.

I already consider myself a PyPy4 programmer, so I hope they make it an official language to match the implementation. There's also Pyston to keep an eye on which is also effectively 2.x only at this time.

[0]http://learning-python.com/books/python-changes-2014-plus.ht...

[+] cname|10 years ago|reply

There's just a bit of hyperbole in your comment. Most major libraries have been ported to Python 3. I wonder if the opposite of what you're saying will happen--i.e, the libraries that don't support Python 3 will be left behind. Fabric is an example of that for me.

[+] Avernar|10 years ago|reply

Going to PyPy4 would be nice. But mod_wsgi doesn't support it. It has something to do with PyPy not implementing Python's embedding API I belive.

So I'll be sticking with 2.7 for now.

[+] rdslw|10 years ago|reply

I love when people with native english skills write monsters like this: "If people's hopes of coding bug-free code in Python 2 actually panned out then I wouldn't consistently hear from basically every person who ports their project to Python 3 that they found latent bugs in their code regarding encoding and decoding of text and binary data."

This should be under penalty ;)

Anyone to divide it into few simpler sentences?

UPDATE: And another one from our connected sentences loving author: "We assumed that more code would be written in Python 3 than in Python 2 over a long-enough time frame assuming we didn't botch Python 3 as it would last longer than Python 2 and be used more once Python 2.7 was only used for legacy projects and not new ones."

[+] teek|10 years ago|reply

The first one:

> If people's hopes of coding bug-free code in Python 2 actually panned out

Python2 developers wanted to write bug-free code.

code = for the purpose of processing text and binary data

> then I wouldn't consistently hear from basically every person

Python2 developers could not write bug free code. So they complained.

complained = complained about their algorithms having bugs when they rewrote those algorithms in Python3

> that they found latent bugs in their code regarding encoding and decoding of text and binary data.

Python2 code written by the same developers had bugs that they did not know about.

When the same developers rewrote their code in Python3, they found the bugs.

(If Python3 did not exist, then it would be very hard to write bug-free code in Python2.)

The second one:

> We assumed that more code would be written in Python 3 than in Python 2 over a long-enough time frame assuming we didn't botch Python 3 as it would last longer than Python 2 and be used more once Python 2.7 was only used for legacy projects and not new ones.

If we designed Python 3 correctly, then we expect Python 3 to live longer than Python 2. We also expect more code to be written in Python 3 for the same reason. We also expect only old projects will be written in Python 2.7.

[+] smegel|10 years ago|reply

If people's hopes of coding bug-free code in Python 2 actually panned out

then they would have bug free code

thus when they port their code to python 3, the unicode changes would not reveal latent (existing) bugs

thus they would not blog about or tell people about said bugs, hence the author of that sentence would not have heard about such bugs

moral of the story: Python 2 hides certain kind of unicode related bugs that are not exposed until you port to Python 3.

[+] nemmons|10 years ago|reply

Maybe something like:

"I consistently hear from basically every person who ports their project to Python 3 that they found latent bugs in their code (regarding encoding/decoding of text and binary data). This would not be the case if people's hopes of coding bug free code had actually panned out."

That's a little better, but still not great.

[+] criley2|10 years ago|reply

Division of this sentence would reduce it's legibility.

It's a simple IF <condition> THEN <result>.

You can argue that he's overly verbose, but breaking the IF/THEN into multiple sentences reduces their connection and the ability to understand.

"If we were actually creating bug free code in Python 2 then porting code to Python 3 would be seamless, which it is not".

[+] Scarbutt|10 years ago|reply

Since python3 is not backwards compatible with python2, why didn't the python devs leverage the opportunity for creating a more performant non-GIL runtime for python3?

[+] nulltype|10 years ago|reply

So Python 2 did not have super obvious string handling. One of the odd things that they seemingly could have fixed pretty easily is to change the default encoding from 'ascii' to 'utf8'. That would have fixed a bunch of the UnicodeDecodeErrors that were the most obvious problem with strings: http://www.ianbicking.org/illusive-setdefaultencoding.html

If they had to make Python 3 anyway, I think the main thing they were missing is that they should have added a JIT. That makes upgrading to Python 3 a much easier argument. If the only point of the JIT was to add a selling point to Python 3, that probably would have been worth it.

[+] collinmanderson|10 years ago|reply

It seems to me if bytes/unicode was the only breaking change we would probably be over the transition by now.

There are a lot of other subtle changes that makes the transition harder: comparison changes and keys() being an iterator for example. These are good long term changes, but I wish they weren't bundled in with the bytes/unicode changes.

[+] cft|10 years ago|reply

We migrated to Go from Python 2, since instead of incompatible Python 3 we needed faster Python 2 replacement.

[+] diimdeep|10 years ago|reply

Str is tip of the iceberg. Python before 2.7 and current Python is completely different language semantically; methods, functions, statements, expressions, Global interpreter lock behavior.. This is sad that this blog post and discussions around it didn't mention anything about it.

[+] rcthompson|10 years ago|reply

The article isn't covering all the differences between Python 2 and Python 3. Based on this article as well as other articles I've read in the past, the Unicode issue was the original reason they decided it was necessary to break backward-compatibility, but once that decision was made, there was no reason not to make any further backward-incompatible improvements.

[+] PythonicAlpha|10 years ago|reply

The reason, I still did not port to Python 3:

(and yes, Unicode in Py2 is a mess ...)

They just broke to many things (unnecessarily!) internally. Particularly they changed many C APIs for enhancement modules, so that all of them had to be ported, before they could be used with Python 3. They did not even consider a portability layer ... why not??

Some (not all) of the bad decisions (like the u"..." strings) they did change afterwards, but than it was a little late.

So many modules are still not ported to Python 3 -- so the hurdle is a little to high -- for small to nil benefits!

So, the problem (from my side) is not Unicode at all ... just the lack of reasonable support from the deciders side.

---

Maybe, some time later, when I have to much spare time.

[+] henrik_w|10 years ago|reply

This is a pretty good explanation of unicode in Python: http://nedbatchelder.com/text/unipain.html

[+] euske|10 years ago|reply

I like Python3 personally. It's new and better but a different branch. I'm annoyed by people abbreviating it as "Python" and treating it as a substitute for Python2. In my opinion, the "Python" name should be exclusively used for Python2, and Python3 should've been always used as one word. The whole Python3 situation caused unnecessary confusion to the outside (non-Python) people, which I think could be avoided.

[+] makecheck|10 years ago|reply

Since I'm trying to keep a small footprint, I rely on the system version of Python on Mac OS X, which is 2.7.10 now.

To use anything newer, I'd have to ask users to install a different interpreter, or bundle a particular version that adds bloat. There's no point. The most I've done is to import a few things from __future__; otherwise, my interest in Python 3 begins when Apple installs it.

[+] echlebek|10 years ago|reply

The Go authors have solved this problem thoroughly. When working in Go, I usually never have to think about this.

https://blog.golang.org/strings

[+] eugenekolo2|10 years ago|reply

Go came out in 2009. What's your point? I'd sure hope they'd look at languages older than them.

[+] niels_olson|10 years ago|reply

How long is the transition going to take? Serious question. Because I'm rather tired of starting new work and finding some module that drags me back to 2.x.

[+] ubernostrum|10 years ago|reply

Technically, "forever", since there will be people who never port their code. If you're depending on one of those holdouts, it's time to find a new dependency, because if they haven't ported by now they won't and that's your problem because...

in practical terms, the transition is about to be over, since now the Linux distros are all-in on converting to Python 3 for their current or next releases and that will forcibly move the unported libraries in the bin of obsolescence.

[+] onesixtythree|10 years ago|reply

From the outside, Python 3 seems like a much better language. I don't have strong views of its object system (I avoid OOP as much as I can) but it seems like the string/bytes handling is much better, and I'm also a fan of map and filter returning generators rather than being hard-coded to a list implementation (stream fusion is a good thing). Also, I fail to see any value in print being a special keyword instead of a regular function (as in 3).

What I don't get is: why has Python 3 adoption been so slow? Is it just backward compatibility, or are there deeper problems with it that I'm not aware of?

[+] mathgenius|10 years ago|reply

Ok, fine. Can we have the print statement back?

[+] untothebreach|10 years ago|reply

Genuinely curious, why do you prefer `print` to be a statement rather than a function? I've heard a lot of criticisms of Py3, but this is the first time I've heard this one.

[+] gvalkov|10 years ago|reply

Print as a function is a definite improvement in terms of functionality and API (`print >>fileobj, ...` always looked like syntax from another language). I too though I'd be bothered by the extra (), mainly due to muscle memory, but after spending a week with it, I could hardly feel any inconvenience.

Setting up an abbrev or a snippet also helps. I use these a lot:

    p<tab>   -> print(|)
    pp<tab>  -> import pprint; pprint.pprint(|)

266 comments