anishathalye | 2 days ago | on: Turning a MacBook into a touchscreen with $1 of hardware (2018)
anishathalye's comments
anishathalye | 1 month ago | on: The Missing Semester of Your CS Education – Revised for 2026
anishathalye | 1 month ago | on: The Missing Semester of Your CS Education (2026)
Over the years, the three of us helped teach several classes at MIT, and over and over again we saw that students had limited knowledge of tools available to them. Computers were built to automate manual tasks, yet students often perform repetitive tasks by hand or fail to take full advantage of powerful tools such as version control and IDEs. Common examples include manually renaming a symbol across many source code files, or using the nuclear approach to fix a Git repository (https://xkcd.com/1597/).
At least at MIT, these topics are not taught as part of the university curriculum: students are never shown how to use these tools, or at least not how to use them efficiently, and thus waste time and effort on tasks that should be simple. The standard CS curriculum is missing critical topics about the computing ecosystem that could make students’ lives significantly easier both during school and after graduation (most jobs do not formally teach these topics either).
To help mitigate this, the three of us developed a class, originally called Hacker Tools in 2019 and then renamed to Missing Semester in 2020 (some great past discussion here: https://news.ycombinator.com/item?id=22226380, https://news.ycombinator.com/item?id=19078281). Over the past several years, we’ve seen the course translated into over a dozen languages, inspire similar courses at other universities, and be adopted by several companies as part of their standard onboarding materials.
Based on feedback and discussions here and elsewhere, along with our updated perspective from working in industry for several years, we have developed a new iteration of the course. The 2026 edition covers several new topics such as packaging/shipping code, code quality, agentic coding, and soft skills. Some things never change, though; we’re still using this hacky Python DSL for editing our multi-camera-angle lecture videos: https://github.com/missing-semester/videos.
As always, we’d love to hear any feedback from the community to help us improve the course content!
—Anish, Jon, and Jose
anishathalye | 6 months ago | on: Show HN: Semlib – Semantic Data Processing
anishathalye | 6 months ago | on: Show HN: Semlib – Semantic Data Processing
anishathalye | 6 months ago | on: Show HN: Semlib – Semantic Data Processing
LOTUS (by Liana Patel et al., folks from Stanford and Berkeley; https://arxiv.org/abs/2407.11418) extends Pandas DataFrames with semantic operators, you could check out their open-source library: https://github.com/lotus-data/lotus
Semlib does batch requests, that was one of the primary motivations (I wanted to solve some concrete data processing tasks, started using the OpenAI API directly, then started calling LLMs in a for-loop, then wanted concurrency...). Semlib lets you set `max_concurrency` when you construct a session, and then many of the algorithms like `map` and `sort` take advantage of I/O concurrency (e.g., see the heart of the implementation of Quicksort with I/O concurrency: https://github.com/anishathalye/semlib/blob/5fa5c4534b91aa0e...). I wrote a bit more about the origins of this library on my blog, if you are interested: https://anishathalye.com/semlib/
ETA: I interpreted “batching” as I/O concurrency. If you were referring to the batch APIs that some providers offer: Semlib does not use those. They are too slow for the kind of data processing I wanted to do / not great when you have a lot of data dependencies. For example, a semantic Quicksort would take forever if each batch is processed in 24 hours (the upper bound when using Anthropic’s batch APIs, for example).
anishathalye | 6 months ago | on: Show HN: Semlib – Semantic Data Processing
Sorry you didn't like the wording in the README, that was not the intention. I like to give people a canonical form they can copy-paste if they want to cite the work, things have been a mess for many of my other GitHub repos, which makes it hard to find who is using the work (which can be really informative for improving the software, and I often follow-up with authors of papers via email etc.). For example, I heard about Amazon MemoryDB because they use Porcupine (https://dl.acm.org/doi/pdf/10.1145/3626246.3653380). Appreciate you sharing your feelings; I stripped the text from the README; if you have additional suggestions, would appreciate your comments or a PR.
anishathalye | 6 months ago | on: Show HN: Semlib – Semantic Data Processing
I think there are many real use cases where you might want a semantic sort / semantic data processing in general, when there isn’t a deterministic way to do the task and there is not necessarily a single right answer, and some amount of error (due to LLMs being imperfect) is tolerable. See https://semlib.anish.io/examples/arxiv-recommendations/ for one concrete example. In my opinion, the outputs are pretty high quality, to the point where this is practically usable.
These primitives can be _composed_, and that’s where this approach really shines. As a case study, I tried automating a part of performance reviews at my company, and the Semlib+LLM approach did _better_ than me (don’t worry, I didn’t dump AI-generated outputs on people, I first did the work manually, and shared both versions with an explanation of where each version came from). See the case study in https://anishathalye.com/semlib/
There’s also some related academic work in this area that also talks about applications. One of the most compelling IMO is DocETL’s collaboration to analyze police records (https://arxiv.org/abs/2410.12189). Some others you might enjoy checking out are LOTUS (https://arxiv.org/abs/2407.11418v1), Palimpzest (https://arxiv.org/abs/2405.14696), and Aryn (https://arxiv.org/abs/2409.00847).
anishathalye | 6 months ago | on: Semlib: LLM-Powered Data Processing
I've been thinking a lot about semantic data processing recently. A lot of the attention in AI has been on agents and chatbots (e.g., Claude Code or Claude Desktop), and I think semantic data processing is not well-served by such tools (or frameworks designed for implementing such tools, like LangChain).
As I was working on some concrete semantic data processing problems and writing a lot of Python code (to call LLMs in a for loop, for example, and then adding more and more code to do things like I/O concurrency and caching), I wanted to figure out how to disentangle data processing pipeline logic from LLM orchestration. Functional programming primitives (map, reduce, etc.), common in data processing systems like MapReduce/Flume/Spark, seemed like a natural fit, so I implemented semantic versions of these operators. It's been pretty effective for the data processing tasks I've been trying to do.
This blog post shares some more details on the story here and elaborates what I like about this approach to semantic data processing. It also covers some of the related work in this area (like DocETL from Berkeley's EPIC Data Lab, LOTUS from Stanford and Berkeley, and Palimpzest from MIT's Data Systems Group).
Like a lot of my past work, the software itself isn't all that fancy; but it might change the way you think!
The software is open-source at https://github.com/anishathalye/semlib. I'm very curious to hear the Hacker News community's thoughts!
anishathalye | 10 months ago | on: Why I no longer have an old-school cert on my HTTPS site
(6.858 is the old name of the class, it was renamed to 6.5660 recently.)
anishathalye | 10 months ago | on: Why I no longer have an old-school cert on my HTTPS site
anishathalye | 1 year ago | on: Provably Correct, Secure, and Leakage-Free Systems
Hi HN! For the last six years, I've been working on techniques to build high-assurance systems using formal verification, with a focus on eliminating side-channel leakage. I'm defending my PhD thesis next week, where I'll talk about our approach to verifying hardware security modules with proofs covering the entire hardware and software system down to the wire-I/O-level. In terms of the artifacts we verify: the biggest example is an ECDSA signature HSM, implemented in 2,300 lines of C code and 13,500 lines of Verilog, and we verify its behavior (capturing correctness, security, and non-leakage) against a succinct 50-line specification.
One of the components that I'm most excited about is how we formally define security for a system at the wire-I/O-level --- we do this with a new security definition called "information-preserving refinement," inspired by the real/ideal paradigm from theoretical cryptography.
HN has been a huge part of my life since I started undergrad about 10 years ago (I post occasionally but mostly read). I would love to see some of the HN community there, whether in-person or over Zoom --- PhD thesis defense talks are open to the public, and my talk is aimed at a general CS/systems audience!
anishathalye | 2 years ago | on: TKey is a RISC-V computer in a USB-C case, that can run security applications
Transparency logs indeed are a neat ingredient to use here. I've heard of other software distributors (e.g., Firefox) using binary transparency logs but hadn't heard of anyone use them in the context of HSMs/security tokens/cryptocurrency wallets yet.
anishathalye | 2 years ago | on: TKey is a RISC-V computer in a USB-C case, that can run security applications
We've been working on some research to formally verify the hardware/software of such devices [1, 2]. Neat how there are so many shared ideas: we also use a PicoRV32, run on an iCE40 FPGA, use UART for communication to/from the PicoRV32 to keep the security-critical part of the hardware simple, and use a separate MCU to convert between USB and UART.
Interesting decision to make the device stateless. Given that the application keys are generated by combining the UDS, USS, and the hash of the application [3], it seems this rules out software updates? Was this an intentional tradeoff, to have a sort of "forward security"?
In an earlier project I worked on [4], we had run into a similar issue (no space for this in the write-up though); there, we ended up using the following approach: applications are _signed_ by the developer (who can use any keypair they generate), the signature is checked at application load time, and the application-specific key is derived using the hash of the developer's public key instead of the hash of the application. This does have the downside that if the developer is compromised, an adversary can use this to sign a malicious application that can leak the key.
[1]: https://github.com/anishathalye/knox-hsm [2]: https://pdos.csail.mit.edu/papers/knox:osdi22.pdf [3]: https://tillitis.se/blog/2023/03/31/on-tkey-key-generation/ [4]: https://pdos.csail.mit.edu/papers/notary:sosp19.pdf
anishathalye | 2 years ago | on: Automated Data Quality at Scale
Since then, I’ve been interested in building tools to automate this sort of analysis. We’ve finally gotten to the point where a web app can do automatically in a couple of hours what I spent months doing in Jupyter notebooks back in 2019—2020. It was really neat to see the software we built automatically produce the same figures and tables that are in our papers.
The blog post shared here is results-focused, talking about some of the data and dataset-level issues that a tool using data-centric AI algorithms can automatically find in ImageNet, which we used as a case study. Happy to answer any questions about the post or data-centric AI in general here!
P.S. all of our core algorithms are open-source, in case any of you are interested in checking out the code: https://github.com/cleanlab/cleanlab
anishathalye | 3 years ago | on: Introduction to Data-Centric AI
anishathalye | 3 years ago | on: MIT Introduction to Data-Centric AI
MIT, like most universities, has many courses on machine learning (6.036, 6.867, and many others). Those classes teach techniques to produce effective models for a given dataset, and the classes focus heavily on the mathematical details of models rather than practical applications. However, in real-world applications of ML, the dataset is not fixed, and focusing on improving the data often gives better results than improving the model. We’ve personally seen this time and time again in our applied ML work as well as our research.
Data-Centric AI (DCAI) is an emerging science that studies techniques to improve datasets in a systematic/algorithmic way — given that this topic wasn’t covered in the standard curriculum, we (a group of PhD candidates and grads) thought that we should put together a new class! We taught this intensive 2-week course in January over MIT’s IAP term, and we’ve just published all the course material, including lecture videos, lecture notes, hands-on lab assignments, and lab solutions, in hopes that people outside the MIT community would find these resources useful.
We’d be happy to answer any questions related to the class or DCAI in general, and we’d love to hear any feedback on how we can improve the course material. Introduction to Data-Centric AI is open-source opencourseware, so feel free to make improvements directly: https://github.com/dcai-course/dcai-course.
anishathalye | 3 years ago | on: Crowdlab: Effective algorithms to handle data labeled by multiple annotators
The blog post gives some intuition for how it works, along with some benchmarking results, and the math and the nitty-gritty details can be found in this paper: https://cleanlab.github.io/multiannotator-benchmarks/paper.p...
Happy to answer any questions related to multi-annotator datasets or data-centric approaches to ML in general here.
anishathalye | 4 years ago | on: Cleanlab 2.0: Automatically Find Errors in ML Datasets
We’d love to hear any ideas or feedback from the HN community, especially from those who face data-quality challenges in their work. We (me, @cgn, and @_jonas), who all have a background in ML research, would also be happy to answer any questions you have related to cleanlab or data-centric AI.
anishathalye | 4 years ago | on: Yet Another GitHub Profile Generator
At that time, I was quite interested in adversarial examples and ML security.