pyentropy's comments

pyentropy | 1 year ago | on: Nvidia’s $589B DeepSeek rout

If H800 is a memory-constrained model that NVIDIA built to avoid the Chinese export ban on H100 with equivalent fp8 performance, it makes zero sense to believe Elon Musk, Dario Armodei and Alexandr Wang's claims that DeepSeek smuggled H100s.

The only reason why a team would allocate time on memory optimizations and writing NVPTX code rather than focusing on posttraining is if they severely struggled with memory during training.

I mean, take a look at the numbers:

https://www.fibermall.com/blog/nvidia-ai-chip.htm#A100_vs_A8...

This is a massive trick pulled by Jensen, take the H100 design whose sales are regulated by the government, make it look 40x weaker and call it H800, while conveniently leaving 8-bit computation as fast as H100. Then bring it to China and let companies stockpile without disclosing production or sales numbers, and have no export controls.

Eventually, after 7 months, US govt starts noticing the H800 sales and introduces new export controls, but it's too late. By this point, DeepSeek has started research using fp8. They slowly build bigger and bigger models, work on the bandwidth and memory consumptions, until they make r1 - their reasoning model.

pyentropy | 1 year ago | on: Why haven't biologists cured cancer?

You should start a blog... or maybe not - pursue the battle in academia/work and occasionally drop nuggets of wisdom like this somewhere. But do not delete them.

pyentropy | 1 year ago | on: Is Aschenbrenner's 165 page paper on AI the naivety of a 25 year old?

It is a question. I tried to put what my opinion is on a few statements but I absolutely cannot summarize 160 pages (Business Insider did using GPT, which I find insulting and funny) nor have a 100% opinion on something that involves national security, secrets and other stuff that I don't have access to.

pyentropy | 2 years ago | on: In the long run, we're all Dad

You haven't read Scott's blog enough :)

He's an atheist psychiatrist. However, he enjoys how natural selection, social dynamics and reputation can also be modeled by the moral rules of most religions. For example, going to therapy isn't that different from practicing confessions in a church.

pyentropy | 2 years ago | on: Metaculus

You can gain reputation by simply forecasting the same outcome as the (publicly available) average probability of everybody - so a user that forgets to forecast on questions is gonna be worse off than a bot who just follows the crowd.

However it gets more interesting when you try to beat the crowd - because you have to take risk and disagree with the masses. You will either end up with negative reputation or a very large one. You can learn more about scoring functions and how to measure the accuracy of everyone's forecasts: https://www.metaculus.com/help/scoring/

Personally I have opened one question, and it involves predicting the net sales of Apple Vision Pro until 2025: https://www.metaculus.com/questions/17407/apple-vision-pro-n...

pyentropy | 2 years ago | on: Google claims to have proved its supremacy with new quantum computer

Are you familiar with logic circuits (those made of gates like AND, OR, XOR, NAND)? Just like they are the founding blocks of classical computers, the founding block of quantum computers are quantum circuits.

Quantum circuits are made of quantum logic gates like Hadamard, CNOT, Z, CZ, etc. Instead of bits as inputs and outputs, quantum logic gates have qubits. Unlike boolean logic where bits are 0 and 1, a qubit is a 2D vector [α β] where α and β are complex numbers, corresponding to a superposition of the zero and one bases: α * |0> + β * |1>. You can visualise a qubit as a point on a sphere, the so called Bloch sphere [1]

There are multiple ways to implement a qubit, but you need to start with some quantum phenomenon. An example is the polarisation of a photon, so horizontal could be |0> and vertical polarisation could be |1> and the qubit is represented as complex vector of these two. If you've studied linear algebra you know manipulating a vector often involves linear transformations. Any linear transformation can be represented as a matrix - so applying gates is just doing matrix multiplication. Unary gates are 2x2 matrices and binary gates are 4x4 matrices - for photons they would be implemented with mirrors and optical waveplates. Measuring the polarisation at the end is the output. The output is not deterministic but it always follows the same distribution, so you could design a circuit that has |001> X% of the time, |010> Y%, |111> Z% of the time, etc. such that X + Y + Z + .. = 100%.

I'm not too familiar with the details of random circuit sampling, but the idea is that you start with a big circuit that wasn't intentionally designed and therefore has no known properties we can exploit - instead it's a random mess of transformations to the qubits. A classical computer cannot run big quantum circuits - N gates with the 49 Google qubits requires like 2^49 * N^3 classical gates, so it won't be able to calculate the output distribution. However, what we can do is run the quantum circuit many times (do measurements on the quantum computer) and collect many samples. Given enough samples, a classical computer can verify whether there's consistency between them and whether an actual transformation produced them (and therefore quantum computation happened) or its just pure noise / garbage using cross entropy benchmarks [2].

Note that the purpose of the "random" in the random circuit is to introduce hardness and prevent cheating (assume that the classical computer is the "opponent" of the quantum computer); the circuits don't calculate anything useful / of human value.

What's interesting is that once people with supercomputers saw the benchmark formula and analysed the constant factors, they found a loophole which let them run a classical algorithm which generates measurements/samples that satisfy the benchmark with 40K classical CPUs for a week, or even a single A100 within 140 days. Some of their success was due to the sheer power available and some is due to algorithmic cleverness (see: tensor networks). In my opinion, they are only disproving the Sycamore supremacy in a fussy way.

[1] - https://en.wikipedia.org/wiki/Bloch_sphere

[2] - https://en.wikipedia.org/wiki/Cross-entropy_benchmarking

pyentropy | 2 years ago | on: DMT, Derealization, and Depersonalization

I've had horrible DR and DP as a child. It runs in families with anxiety disorders and personality disorders.

No matter how many times it happened, it was always equally scary - feeling like a passive observer of a movie starring some piece of flesh and bones as the main character, feeling completely separate from that body and unable to control its decisions. The episodes usually lasted <30 minutes.

I don't know about its occurrence in psychedelics, but in my case it always occurred after periods of extreme emotions (seeing a classmate die and being aware of my own mortality, being rejected by some 'friends' in school, and a few others). The way I see it (and some neuroscientists claim), the brain shuts the perception of "self" in order to stop intense emotional pain.

pyentropy | 2 years ago | on: So this guy is now S3. All of S3

Further context: Bluesky lets you use a domain name you own as a user handle.

The official method is to set a TXT record, but apparently their "AT protocol" also lets you confirm a domain by serving `GET your.domainname.com/xrpc/com.atproto.identity.resolveHandle`

and `xrpc` was available as an S3 bucket name :)

pyentropy | 2 years ago | on: OpenAI Tokenizer

A character is the base unit of written communication. Single characters as tokens is not a bad idea, it just requires too much resources to make it learn and infer.

BPE is a tradeoff between single letters (computationally hard) and a word dictionary (can't handle novel words, languages or complex structures like code syntax). Note that tokens must be hardcoded because the neural network has an output layer consisting of neurons one-to-one mapped to the tokens (and the predicted word is the most activated neuron).

Human brains roughly do the same thing - that's why we have syllables as a tradeoff between letters and words.

pyentropy | 3 years ago | on: Grover's algorithm offers no quantum advantage

I don't want to read the whole thing because it looks like it's very cynical - it judges TCS scientists who believe in Grover speedup as naive because they are unaware of real life noise, without actually realizing that's the point of TCS.

We don't know how noise will scale IRL so the job of theoretical scientists is to design the basic units of quantum computation regardless of how it may or may not work IRL. It's like judging XOR and NAND in 1920s because transistors maybe won't be able to simulate them.

page 1