MrManatee | 22 days ago | on: Relicensing with AI-Assisted Rewrite
MrManatee's comments
MrManatee | 7 months ago | on: Researchers find evidence of ChatGPT buzzwords turning up in everyday speech
For example, the word "surpass" was used 1.47 times per million in the pre-2022 dataset and 3.53 times per million in the post-2022 dataset. That's 16 occurrences in 10.92M words and 41 occurrences in 11.63M words, respectively. That's a low enough number that I could just read through every occurrence and see how it feels. In this case I can't because the authors very understandably couldn't publish the whole dataset for copyright reasons. And replicating the analysis from scratch is a bit too much to do just for curiosity's sake. :)
I often find drilling to the raw data like this to be useful. It can't prove anything, but it can help formulate a bunch of alternative explanations, and then I can start to think how could I possibly tell which of the explanations is the best.
What are the competing explanations here? Perhaps the overall usage rate has increased. Or maybe there was just one or few guests who really like that word. Or perhaps a topic was discussed where it would naturally come up more. Or maybe some of these podcasts are not quite as unscripted, and ChatGPT was directly responsible for the increase. These are some alternative explanations I could think of without seeing the raw data, but there could easily be more alternative explanations that would immediately come to mind upon seeing the raw data.
MrManatee | 8 months ago | on: Observable Notebooks 2.0 Technology Preview
In addition to the hardcoded lookup table, here are some other notes on the generated code:
1. Silently assuming that the first page of results contains all of the data feels a bit fragile. If we're not going to bother with paging otherwise, I'd at least assert that the first page contains everything.
2. Unlike the code comment claims, 2022 is not the latest year that has data available, 2023 is. The reason this is worrisome is not that the difference is massive, but because of methodological implications. It looks like the year 2022 came from "remembering" details from some other code that was analyzing this same data set instead of just looking at current version of the data set we're supposed to be analyzing.
3. The code for removing the world aggregate doesn't actually work (although it doesn't matter for the map). The place where it says d.country.id !== "WLD" should be either d.country.id !== "1W" or d.countryiso3code !== "WLD" instead. Also, if it would actually be important to filter this then presumably it would also be important to filter out a bunch of other aggregates as well.
4. The text says "darker colors indicating higher life expectancy", which is pretty much the opposite of how I would describe this color scheme.
5. The analysis given is: "Notice how certain regions tend to cluster together in terms of life expectancy, reflecting similar economic, healthcare, and social conditions". This is such a bland take. It feels like something I could have written before seeing the data. I would try to encourage the analyst to look for something more concrete and more interesting.
6. The actual thing that immediately pops out from the map is that the Central African Republic is a massive outlier with a life expectancy of 18.8 years, whereas every other country is over 50. This doesn't seem plausible. I would do something about that just so that it doesn't mess up the colors of the entire map.
MrManatee | 10 months ago | on: Show HN: AutoThink – Boosts local LLM performance with adaptive reasoning
If someone asked me to find solutions to these example equations, there are three complications that I would immediately notice:
1. We are looking for solutions over integers. 2. There are three variables. 3. The degree of the equation is 3.
Having all three is a deadly combination. If we were looking for solutions over reals or complex numbers? Solvable. Less than three variables? Solvable. Degree less than 3? Solvable. With all three complications, it's still not necessarily hard, but now it might be. We might even be looking at an unsolved problem.
I haven't studied enough number theory to actually solve either of these problems, but I have studied enough to know where to look. And because I know where to look, it only takes me a few seconds to recognize the "this might be very difficult" vibe that both of these have. Maybe LLMs can learn to pick up on similar cues to classify problems as difficult or not so difficult without having needing to solve them. (Or, maybe they have already learned?)
MrManatee | 2 years ago | on: Ohm: A library and language for building parsers, interpreters, compilers, etc
MrManatee | 2 years ago | on: Ohm: A library and language for building parsers, interpreters, compilers, etc
I have used both Ohm and Lezer - for different use cases - and have been happy with both.
If you want a parser that makes it possible for code editors to provide syntax highlighting, code folding, and other such features, Tree-sitter and Lezer work great for that use case. They are incremental, so it's possible to parse the file every time a new character is added. Also, for the editor use case it is essential that they can produce some kind of parse tree even if there are syntax errors.
I wouldn't try to build a syntax highlighter on top of Ohm. Ohm is, as the title says, meant for building parsers, interpreters, compilers, etc. And for those use cases, I think Ohm is easier to build upon than Lezer is.
MrManatee | 2 years ago | on: I don't use Bayes factors in my research (2019)
1. Change in unemployment is normally distributed with mean 0% and standard deviation 0.606%.
2. Change in unemployment is uniformly distributed between 1% and 10%.
I don't really agree that "(1) vs (2)" is a particularly good formulation of the original question ("Would raising the minimum wage by $4 lead to greater unemployment?"). But if it were, how would the math work out?
If we observe that unemployment increases 1%, then yes, that piece of evidence is very slightly in favor of explanation (1). This doesn't feel weird or paradoxical to me. But surely we wouldn't want to decide the matter based just on that one inconclusive data point? Instead we would want to look at another instance of the same situation. If in that case an increase of, say, 6% would (almost) conclusively settle the matter in favor of (2), and an increase of, say, 0.8% would (absolutely) conclusively settle the matter in favor of (1).
MrManatee | 2 years ago | on: The Undecidability of BB(748): Understanding Godel’s Incompleteness Theorems [pdf]
I'm just one mathematician, but I certainly don't mean that.
Before we can prove anything about sets, we need to pick some axioms. Zermelo set theory (Z) would be enough for most of ordinary mathematics. If we need something stronger, there's Zermelo–Fraenkel set theory with the axiom of choice (ZFC). Or if I need something even stronger, there's, for example, Tarski–Grothendieck set theory (TG).
What I mean by "X is true" is technically difficult to define. The statements
(1) X is provable in Z.
(2) X is provable in ZFC.
(3) X is provable in TG.
are all increasingly accurate characterizations of "X is true", but none of them capture everything about it. And that's kind of the point. There is no proof system P such that "X is provable in P" would work as a faithful definition of "X is true". So the best we can get is this tower of increasingly sophisticated axioms that still always fail to capture the full meaning of "truth".
There is a convention among mathematicians: Anything up to ZFC you can assume without explicitly mentioning it, but if you go beyond it, it's good to state what axioms you have used. ZFC is not a bad choice for this role. It is quite high in the tower. In most cases ZFC is strong enough, or in fact, overkill. But still, it is not at the top of the tower (there is no top!), so sometimes you need stronger axioms. The fact that ZFC has been singled out like this is ultimately a bit arbitrary - a social convention. "X is provable in ZFC" may be the most common justification for "X is true", but that doesn't make it the definition of "X is true".
MrManatee | 3 years ago | on: Functional Programming – How and Why
But also, writing imperative code doesn't guarantee explicit performance characteristics. Whether you mutate references or not, you still need to know which operations are fast and which are slow.
In JavaScript, concatenation, [...array1, ...array2], is non-mutating and slow. Adding an element to the end, array.push(x), is mutating and fast. But adding an element to the beginning, array.unshift(x), is mutating and slow. So even if you're sticking to mutation, you still need to know that push is fast and unshift is slow.
And yeah, sorry, "in JavaScript" is not quite right. I meant in my browser. This is not part of the spec, and it's not mentioned in the documentation. Is it the same in other browsers? Who knows. To me, this is just as much "you need intimate knowledge about compiler and/or runtime to know what's going to happen".
MrManatee | 3 years ago | on: Functional Programming – How and Why
On my computer, the article's imperative range(6000) took 0.05 ms and the "functional" range(6000) took 400 ms. The whole [...cur, cur.length+1] thing turns a linear algorithm into a quadratic one. It wouldn't happen in all languages, but that's how JavaScript's arrays work. My advice is that if you really want to do stuff like this, choose appropriate data structures (i.e., learn about persistent data structures).
Except in this case the imperative version is already totally fine. It is modifying an array that it created itself. You have all of the benefits of avoiding shared mutable state already.
Also, the difference in performance would become even worse, except that for range(7000) I already get "Maximum call stack size exceeded" on the recursive one. The imperative implementation handles 10000000 without breaking a sweat. My second advice is to not use recursion to process arrays in languages (like JavaScript) that have a very limited call stack size.
MrManatee | 3 years ago | on: Take my app ideas
[1] https://austinhenley.com/blog/thisprojectwillonlytake.html
MrManatee | 4 years ago | on: Backblaze IPO
What has been on my radar is that my laptop has a small SSD, and in addition I have an external hard drive containing files that I want to back up but I rarely use. Backblaze used to require that I regularly plug in that hard drive and then keep it plugged in for a few hours. This was a total chore, and meant that I couldn’t just forget about my backups. I seriously considered switching backup providers because of that. Luckily Backblaze eventually introduced a slightly more expensive plan where I don’t have to do that. So at the moment I’m a satisfied customer, and I’m not actively looking for alternatives.
MrManatee | 4 years ago | on: Better Operator Precedence
For example, the proof of the Minkowski inequality in Wikipedia [1] contains a statement of the form "x_1 = x_2 = x_3 ≤ x_4 = x_5 ≤ x_6 = x_7". Except that x_1, ..., x_7 are so complicated that any notation that requires repeating them doesn't feel like a satisfactory notation for "all of mathematics".
MrManatee | 4 years ago | on: Better Operator Precedence
MrManatee | 4 years ago | on: Today Sci-Hub is 10 years old. I'll publish 2M new articles to celebrate
I'm absolutely convinced that if copyrights weren't an issue, there would be enough governments, foundations, universities, corporations, and individuals willing to pay the costs of making scientific publications available to everyone. It wouldn't have to be a paid service.
MrManatee | 4 years ago | on: Weird Languages
I don't mean to be too dismissive. I think this short piece was thought-provoking enough to be worth reading, and it was not a bad argument in favor of Lisp. Maybe I really should learn Lisp. But if the heuristic was generalized from just one example, it might overfit to just one particular kind of weirdness.
MrManatee | 4 years ago | on: Captcha pictures force you to look at the world the way an AI does
It reminds me of how nowadays when I do an incognito Google search on my phone, I need three taps just to accept the terms of service. Same with YouTube. Maybe this level of cumbersomeness is somehow legally required now, but I find it more likely that this is an attempt to subconsciously encourage people to not browse incognito—so that they can be tracked.
My understanding is that being logged in in your Google account often allows you to bypass captchas. If captchas are a miserable experience, this would have a similar effect of subconsciously discouraging incognito browsing.
MrManatee | 4 years ago | on: New UUID Formats – IETF Draft
And in any case, 15 October 1582 isn't some hard cutoff point where we can stop worrying about calendar conversions. Only four countries adopted the Gregorian calendar on that day, and even in Europe there are several countries that only switched in the 20th century. If a piece of software needs to support historical dates that go anywhere near 1582, it needs to be aware of the existence of different calendars.
[1] https://en.wikipedia.org/wiki/Proleptic_Gregorian_calendar
MrManatee | 4 years ago | on: I have a lot to say about Signal’s Cellebrite hack
MrManatee | 4 years ago | on: I have a lot to say about Signal’s Cellebrite hack
But if some app actually decided to use this hack, then wouldn't it be likely that in addition to modifying the contents of the data dump it would also modify the on-device data? In that case it wouldn't matter if the other vendors have vulnerabilities, since the device itself was already compromised.
My understanding of cleanroom is that the person/team programming is supposed to have never seen any of the original code. The agent is more like someone who has read the original code line by line, but doesn't remember all the details - and isn't allowed to check.