That’s the normal way lexers work, given “tight” token definitions. They continue adding to the current token until an invalid (for the current token type) character is reached, and then begin parsing a new token starting with the “invalid” (but now valid for the next token) character (or the next non-whitespace character).
“1or2” is lexed into “1” (integer) followed by “or2” (identifier), which is valid on the lexer level but then fails on the grammar level.
The lexer unfortunately is a greedy token matcher. As soon as the 0xf "made sense" to it, and 0xfo did not - it did the same thing it would do in the case of something like 0xf+3. Except the + was an `or` in this case which is kosher. There is an idempotent step you can take where extra spaces are added before the AST is formed to make this sort of thing easier. The good news is, with a decent lint / format flow - these sorts are easy to catch.
That's not totally clear. A bug being filed doesn't mean it's accepted. And this has been (ab)used for quite some time in various python codegolf. See https://codegolf.stackexchange.com/a/56 from 2011.
It is not strictly speaking a bug, since it works as intended. But it is clearly a counter-intuitive behavior and could be improved. Making 0xfor a syntax error would definitely be an improvement.
But requiring whitespace between all tokens is not an acceptable solution, since "2+2" should work. Always equiring whitespace between alphanumerical characters in different tokens would make sense.
layer8|4 years ago
“1or2” is lexed into “1” (integer) followed by “or2” (identifier), which is valid on the lexer level but then fails on the grammar level.
sabhiram|4 years ago
gfiorav|4 years ago
njharman|4 years ago
kristaps|4 years ago
xxpor|4 years ago
joshuamorton|4 years ago
Sohcahtoa82|4 years ago
https://docs.python.org/3/reference/lexical_analysis.html#wh...
This is not a bug.
goto11|4 years ago
But requiring whitespace between all tokens is not an acceptable solution, since "2+2" should work. Always equiring whitespace between alphanumerical characters in different tokens would make sense.