top | item 26796923

(no title)

mjs7231 | 4 years ago

Shouldn't this be considered a bug in Python? Why does it even try to evaluate 0xfor without the space? Trying a few other things..

* 0xfor1 evaluates.

* 1or 2 evaluates.

* 1or2 doesn't.

* ''or'foo' evaluates.

This is gross.

discuss

order

layer8|4 years ago

That’s the normal way lexers work, given “tight” token definitions. They continue adding to the current token until an invalid (for the current token type) character is reached, and then begin parsing a new token starting with the “invalid” (but now valid for the next token) character (or the next non-whitespace character).

“1or2” is lexed into “1” (integer) followed by “or2” (identifier), which is valid on the lexer level but then fails on the grammar level.

sabhiram|4 years ago

The lexer unfortunately is a greedy token matcher. As soon as the 0xf "made sense" to it, and 0xfo did not - it did the same thing it would do in the case of something like 0xf+3. Except the + was an `or` in this case which is kosher. There is an idempotent step you can take where extra spaces are added before the AST is formed to make this sort of thing easier. The good news is, with a decent lint / format flow - these sorts are easy to catch.

gfiorav|4 years ago

Probably a lexer bug. "foo"or should never be processed as "foo" and token OR

njharman|4 years ago

why not? "or" is an operator, like "+", "foo"+"bar" should be valid. Why have special inconsistent case for "or".

kristaps|4 years ago

Not by design, fortunately: https://bugs.python.org/issue43833

xxpor|4 years ago

That hasn't been confirmed. Are we sure that it's not an inherent ambiguity in the grammar?

goto11|4 years ago

It is not strictly speaking a bug, since it works as intended. But it is clearly a counter-intuitive behavior and could be improved. Making 0xfor a syntax error would definitely be an improvement.

But requiring whitespace between all tokens is not an acceptable solution, since "2+2" should work. Always equiring whitespace between alphanumerical characters in different tokens would make sense.