cschmidt's comments

cschmidt | 8 days ago | on: The Brand Age

Looks great. I just ordered it. Thanks for the recommendation.

cschmidt | 3 months ago | on: Google boss says AI investment boom has 'elements of irrationality'

There are equal weight S&P ETFs, which avoid having a handful of stock dominating. However, they do have to do a lot more rebalancing to keep things in line.

cschmidt | 4 months ago | on: Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?

There is other research that works with pixels of text, such as this recent paper I saw at COLM 2025 https://arxiv.org/abs/2504.02122.

cschmidt | 7 months ago | on: Eleven Music

I worry how often that is happening already on Spotify.

cschmidt | 7 months ago | on: Stanford’s Department of Management Science and Engineering

I’m not sure about this masters program, but the undergrad program seems to be proper ORMS.

cschmidt | 7 months ago | on: Stanford’s Department of Management Science and Engineering

I think in this context Management Science is an older term that was synonymous with operations research. The flagship journal of Informs (the institute for operations research and management science) has the same name. Studying how to optimize thing, lots of statistics and math. Stanford was at the forefront of the field from George Danzig onwards. So not trying to make management a “science” in this case.

cschmidt | 8 months ago | on: The bitter lesson is coming for tokenization

Attention does help, which is why it can learn arithmetic, even with arbitrary tokenization. However, if you put it in a standard form, such as right-to-left groups of 3, you make it an easier problem for the LLM to learn. All the examples it sees are in the same format. Here, the issue is that BLT operates in an autoregressive manner (strictly left to right), which makes it harder to tokenize the digits in a way that is easier for the LLM to learn. Each digit is its own token (Llama style), or flipping the digits might be the best.

cschmidt | 8 months ago | on: The bitter lesson is coming for tokenization

Math operations go right to left in the text, while we write them left to right. So if you see the digits 123... in an autoreressive manner, you don't know really anything, since it could be 12345 or 1234567. If you flipped 12345 as 543..., you know the place value of each. You know that the 5 you encounter first is in the ones place, the 4 is the tens place, etc. It gives the LLM a better chance of learning arithmetic.

cschmidt | 8 months ago | on: The bitter lesson is coming for tokenization

And in regard to utf-8 being a shitty biased tokenizer, here is recent paper trying to design a better style of encoding https://arxiv.org/abs/2505.24689

cschmidt | 8 months ago | on: The bitter lesson is coming for tokenization

Virtually all current tokenization schemes do work at the raw byte level, not the utf-8 character. They do this to avoid the Out of Vocabulary (OOV) or unknown token problem. In older models, if you came across something in the data you can't tokenize, you add a <UNK>. But tokenization should be exactly reversible, so now people use subword tokenizers including all 256 single bytes in the vocab. That way you can always represent any text by dropping down to the single byte level. The other alternative would be to add all utf-8 code points to the vocabulary, but there are more than 150k of those, and enough are rare, that many would be undertrained. You'd have a lot of glitch tokens (https://arxiv.org/abs/2405.05417). That does mean an LLM isn't 100% guaranteed to output well formed utf-8.

cschmidt | 8 months ago | on: The bitter lesson is coming for tokenization

I suppose it is. There is a lot to tokenization - pre-tokenization, how to handle digits, the tokenization training approach - that is about adding cleverness. In the long run, the bitter lesson would be to just get rid of it all and learn from more data. Many people would love to do it. But I think for the case of BLT, digits will still be an issue. There is no way an autoregressive entropy model will be able to split numbers sensibly, since it has no idea how many digits are coming. It seems like it will struggle more with arithmetic. Perhaps you could reverse all the digits in a number, then it has a chance. So 12334 becomes 43321, and it gets to start from the ones digit. This has been suggested as an approach for LLM's.

cschmidt | 8 months ago | on: The bitter lesson is coming for tokenization

This paper has a good solution:

https://arxiv.org/abs/2402.14903

You right to left tokenize in groups of 3, so 1234567 becomes 1 234 567 rather than the default 123 456 7. And if you ensure all 1-3 digits groups are in the vocab, it does much better.

Both https://arxiv.org/abs/2503.13423 and https://arxiv.org/abs/2504.00178 (co-author) both independently noted that you can do this with just by modifying the pre-tokenization regex, without having to explicitly add commas.

cschmidt | 9 months ago | on: Last fifty years of integer linear programming: Recent practical advances (2024)

Gurobi does have a cloud service where you pay by the hour. A full non-academic license is pricy.

cschmidt | 9 months ago | on: Quarkdown: A modern Markdown-based typesetting system

I'm just saying that these systems don't work for me. I write ML/AI conference papers in LaTeX, and I think that use case will be tough to dislodge. I can see this being very attractive to people making other types of documents without a fixed format, especially if you don't already know LaTeX.

cschmidt | 9 months ago | on: Quarkdown: A modern Markdown-based typesetting system

One thing that has helped with ease of use is Overleaf. It is a hosted LaTeX editor with lots of collaboration features (leaving comments, history of edits) that let people collaborate in real time on a paper. It comes with many templates to get you started on a new document. If you're working with collaborators, it has a lock on the market.

LaTeX itself can be easy for simple things (pick a template, and put text in each section). And it can grow into almost anything if you put in enough effort. It is far and away the standard way to write math equations, so if your document has lots of formulas, that's a plus.

cschmidt | 9 months ago | on: Quarkdown: A modern Markdown-based typesetting system

You make a fair point - I'm talking specifically about CS/ML/AI conferences. I shouldn't overgeneralize.

cschmidt | 9 months ago | on: Quarkdown: A modern Markdown-based typesetting system

Every conference has their own required LaTeX style file that must be used. Unless there is an automated way to convert these exactly, I don't see how LaTeX alternatives can be used.

cschmidt | 9 months ago | on: Tokenization for language modeling: BPE vs. Unigram Language Modeling (2020)

Anyone reading this in the future, I meant to say the length weighting is a bit nonstandard. It is usually by frequency. Oops

cschmidt | 9 months ago | on: Tokenization for language modeling: BPE vs. Unigram Language Modeling (2020)

That's in interesting point. While your correct, of course, it is so common to consider a hash table lookup a O(1) operation, it never occurred to me. But in this case, the loops are actually really tight and the hash table lookup might be a significant part of the time, so it might well behave more like O(n L^2). I'll update the docs and paper.

cschmidt | 9 months ago | on: Tokenization for language modeling: BPE vs. Unigram Language Modeling (2020)

Yes, they were concurrent work. (Co-author of BoundlessBPE here). A sibling comment describes the main differences. Our paper motivates why superwords can lead to such a big improvement, by overcoming a limit that pre-tokenization imposes on current tokenization methods. The SuperBPE paper has a wonderful set of downstream evaluation runs. So if you're interested in either, they are quite complimentary papers.