anishathalye's comments

anishathalye | 2 days ago | on: Turning a MacBook into a touchscreen with $1 of hardware (2018)

I was working on AI in 2018 too :)

At that time, I was quite interested in adversarial examples and ML security.

anishathalye | 1 month ago | on: The Missing Semester of Your CS Education – Revised for 2026

MIT already has an excellent class on Ethics for Engineers: https://e4e.mit.edu/

anishathalye | 1 month ago | on: The Missing Semester of Your CS Education (2026)

We (@anishathalye, @jjgo, @jonhoo) returned to MIT during IAP (January term) 2026 to teach a new iteration of Missing Semester, a class covering topics that are missing from the standard computer science curriculum.

Over the years, the three of us helped teach several classes at MIT, and over and over again we saw that students had limited knowledge of tools available to them. Computers were built to automate manual tasks, yet students often perform repetitive tasks by hand or fail to take full advantage of powerful tools such as version control and IDEs. Common examples include manually renaming a symbol across many source code files, or using the nuclear approach to fix a Git repository (https://xkcd.com/1597/).

At least at MIT, these topics are not taught as part of the university curriculum: students are never shown how to use these tools, or at least not how to use them efficiently, and thus waste time and effort on tasks that should be simple. The standard CS curriculum is missing critical topics about the computing ecosystem that could make students’ lives significantly easier both during school and after graduation (most jobs do not formally teach these topics either).

To help mitigate this, the three of us developed a class, originally called Hacker Tools in 2019 and then renamed to Missing Semester in 2020 (some great past discussion here: https://news.ycombinator.com/item?id=22226380, https://news.ycombinator.com/item?id=19078281). Over the past several years, we’ve seen the course translated into over a dozen languages, inspire similar courses at other universities, and be adopted by several companies as part of their standard onboarding materials.

Based on feedback and discussions here and elsewhere, along with our updated perspective from working in industry for several years, we have developed a new iteration of the course. The 2026 edition covers several new topics such as packaging/shipping code, code quality, agentic coding, and soft skills. Some things never change, though; we’re still using this hacky Python DSL for editing our multi-camera-angle lecture videos: https://github.com/missing-semester/videos.

As always, we’d love to hear any feedback from the community to help us improve the course content!

—Anish, Jon, and Jose

anishathalye | 6 months ago | on: Show HN: Semlib – Semantic Data Processing

Good point, I see how the example can be confusing. Updated the example to have `reverse=True` and a comment, hopefully that clarifies things.

anishathalye | 6 months ago | on: Show HN: Semlib – Semantic Data Processing

For sure! I share your feelings about good science and honesty in academia :)

anishathalye | 6 months ago | on: Show HN: Semlib – Semantic Data Processing

Yeah, that was just a design choice that I made: I wanted a library that worked with `Iterator`s, felt more lightweight to me / fit my immediate needs better. I'm personally not a huge fan of Pandas DataFrames for certain applications.

LOTUS (by Liana Patel et al., folks from Stanford and Berkeley; https://arxiv.org/abs/2407.11418) extends Pandas DataFrames with semantic operators, you could check out their open-source library: https://github.com/lotus-data/lotus

Semlib does batch requests, that was one of the primary motivations (I wanted to solve some concrete data processing tasks, started using the OpenAI API directly, then started calling LLMs in a for-loop, then wanted concurrency...). Semlib lets you set `max_concurrency` when you construct a session, and then many of the algorithms like `map` and `sort` take advantage of I/O concurrency (e.g., see the heart of the implementation of Quicksort with I/O concurrency: https://github.com/anishathalye/semlib/blob/5fa5c4534b91aa0e...). I wrote a bit more about the origins of this library on my blog, if you are interested: https://anishathalye.com/semlib/

ETA: I interpreted “batching” as I/O concurrency. If you were referring to the batch APIs that some providers offer: Semlib does not use those. They are too slow for the kind of data processing I wanted to do / not great when you have a lot of data dependencies. For example, a semantic Quicksort would take forever if each batch is processed in 24 hours (the upper bound when using Anthropic’s batch APIs, for example).

anishathalye | 6 months ago | on: Show HN: Semlib – Semantic Data Processing

I think semantic data processing in this style has a nonempty set of use cases (e.g., I find the fuzzy sorting of arXiv papers to be useful, I find the examples in the docs representative of some real-world tasks where this style of data processing makes sense, and I find many of the motivating examples and use cases in the academic work compelling). At the same time, I think there are many tasks for which this approach is not the right one to use.

Sorry you didn't like the wording in the README, that was not the intention. I like to give people a canonical form they can copy-paste if they want to cite the work, things have been a mess for many of my other GitHub repos, which makes it hard to find who is using the work (which can be really informative for improving the software, and I often follow-up with authors of papers via email etc.). For example, I heard about Amazon MemoryDB because they use Porcupine (https://dl.acm.org/doi/pdf/10.1145/3626246.3653380). Appreciate you sharing your feelings; I stripped the text from the README; if you have additional suggestions, would appreciate your comments or a PR.

anishathalye | 6 months ago | on: Show HN: Semlib – Semantic Data Processing

That was a small self-contained example that fit above the fold in the README (and fwiw even last year’s models like GPT-4o give the right output there). That `sort` is based on pairwise comparisons, which is one of the best ways you can do it in terms of accuracy (Qin et al., 2023: https://arxiv.org/abs/2306.17563).

I think there are many real use cases where you might want a semantic sort / semantic data processing in general, when there isn’t a deterministic way to do the task and there is not necessarily a single right answer, and some amount of error (due to LLMs being imperfect) is tolerable. See https://semlib.anish.io/examples/arxiv-recommendations/ for one concrete example. In my opinion, the outputs are pretty high quality, to the point where this is practically usable.

These primitives can be _composed_, and that’s where this approach really shines. As a case study, I tried automating a part of performance reviews at my company, and the Semlib+LLM approach did _better_ than me (don’t worry, I didn’t dump AI-generated outputs on people, I first did the work manually, and shared both versions with an explanation of where each version came from). See the case study in https://anishathalye.com/semlib/

There’s also some related academic work in this area that also talks about applications. One of the most compelling IMO is DocETL’s collaboration to analyze police records (https://arxiv.org/abs/2410.12189). Some others you might enjoy checking out are LOTUS (https://arxiv.org/abs/2407.11418v1), Palimpzest (https://arxiv.org/abs/2405.14696), and Aryn (https://arxiv.org/abs/2409.00847).

anishathalye | 6 months ago | on: Semlib: LLM-Powered Data Processing

Hi HN!

I've been thinking a lot about semantic data processing recently. A lot of the attention in AI has been on agents and chatbots (e.g., Claude Code or Claude Desktop), and I think semantic data processing is not well-served by such tools (or frameworks designed for implementing such tools, like LangChain).

As I was working on some concrete semantic data processing problems and writing a lot of Python code (to call LLMs in a for loop, for example, and then adding more and more code to do things like I/O concurrency and caching), I wanted to figure out how to disentangle data processing pipeline logic from LLM orchestration. Functional programming primitives (map, reduce, etc.), common in data processing systems like MapReduce/Flume/Spark, seemed like a natural fit, so I implemented semantic versions of these operators. It's been pretty effective for the data processing tasks I've been trying to do.

This blog post shares some more details on the story here and elaborates what I like about this approach to semantic data processing. It also covers some of the related work in this area (like DocETL from Berkeley's EPIC Data Lab, LOTUS from Stanford and Berkeley, and Palimpzest from MIT's Data Systems Group).

Like a lot of my past work, the software itself isn't all that fancy; but it might change the way you think!

The software is open-source at https://github.com/anishathalye/semlib. I'm very curious to hear the Hacker News community's thoughts!

anishathalye | 10 months ago | on: Why I no longer have an old-school cert on my HTTPS site

Looks like the 2023 lectures weren't uploaded to YouTube, but the lectures from earlier iterations of the class, including 2022, are available publicly. For example, see the YouTube links on https://css.csail.mit.edu/6.858/2022/

(6.858 is the old name of the class, it was renamed to 6.5660 recently.)

anishathalye | 10 months ago | on: Why I no longer have an old-school cert on my HTTPS site

Implementing an ACME client is part of the final lab assignment for MIT’s security class: https://css.csail.mit.edu/6.858/2023/labs/lab5.html

anishathalye | 1 year ago | on: Provably Correct, Secure, and Leakage-Free Systems

I see someone posted this before I was able to do it :)

Hi HN! For the last six years, I've been working on techniques to build high-assurance systems using formal verification, with a focus on eliminating side-channel leakage. I'm defending my PhD thesis next week, where I'll talk about our approach to verifying hardware security modules with proofs covering the entire hardware and software system down to the wire-I/O-level. In terms of the artifacts we verify: the biggest example is an ECDSA signature HSM, implemented in 2,300 lines of C code and 13,500 lines of Verilog, and we verify its behavior (capturing correctness, security, and non-leakage) against a succinct 50-line specification.

One of the components that I'm most excited about is how we formally define security for a system at the wire-I/O-level --- we do this with a new security definition called "information-preserving refinement," inspired by the real/ideal paradigm from theoretical cryptography.

HN has been a huge part of my life since I started undergrad about 10 years ago (I post occasionally but mostly read). I would love to see some of the HN community there, whether in-person or over Zoom --- PhD thesis defense talks are open to the public, and my talk is aimed at a general CS/systems audience!

anishathalye | 2 years ago | on: TKey is a RISC-V computer in a USB-C case, that can run security applications

Very cool! That's a nice design that gives the developer the choice on the trade-off between being upgradeable and being future-proof against developer key compromise.

Transparency logs indeed are a neat ingredient to use here. I've heard of other software distributors (e.g., Firefox) using binary transparency logs but hadn't heard of anyone use them in the context of HSMs/security tokens/cryptocurrency wallets yet.

anishathalye | 2 years ago | on: TKey is a RISC-V computer in a USB-C case, that can run security applications

This is really neat!

We've been working on some research to formally verify the hardware/software of such devices [1, 2]. Neat how there are so many shared ideas: we also use a PicoRV32, run on an iCE40 FPGA, use UART for communication to/from the PicoRV32 to keep the security-critical part of the hardware simple, and use a separate MCU to convert between USB and UART.

Interesting decision to make the device stateless. Given that the application keys are generated by combining the UDS, USS, and the hash of the application [3], it seems this rules out software updates? Was this an intentional tradeoff, to have a sort of "forward security"?

In an earlier project I worked on [4], we had run into a similar issue (no space for this in the write-up though); there, we ended up using the following approach: applications are _signed_ by the developer (who can use any keypair they generate), the signature is checked at application load time, and the application-specific key is derived using the hash of the developer's public key instead of the hash of the application. This does have the downside that if the developer is compromised, an adversary can use this to sign a malicious application that can leak the key.

[1]: https://github.com/anishathalye/knox-hsm [2]: https://pdos.csail.mit.edu/papers/knox:osdi22.pdf [3]: https://tillitis.se/blog/2023/03/31/on-tkey-key-generation/ [4]: https://pdos.csail.mit.edu/papers/notary:sosp19.pdf

anishathalye | 2 years ago | on: Automated Data Quality at Scale

Sharing some context here: in grad school, I spent months writing custom data analysis code and training ML models to find errors in large-scale datasets like ImageNet, work that eventually resulted in this paper (https://arxiv.org/abs/2103.14749) and demo (https://labelerrors.com/).

Since then, I’ve been interested in building tools to automate this sort of analysis. We’ve finally gotten to the point where a web app can do automatically in a couple of hours what I spent months doing in Jupyter notebooks back in 2019—2020. It was really neat to see the software we built automatically produce the same figures and tables that are in our papers.

The blog post shared here is results-focused, talking about some of the data and dataset-level issues that a tool using data-centric AI algorithms can automatically find in ImageNet, which we used as a case study. Happy to answer any questions about the post or data-centric AI in general here!

P.S. all of our core algorithms are open-source, in case any of you are interested in checking out the code: https://github.com/cleanlab/cleanlab

anishathalye | 3 years ago | on: Introduction to Data-Centric AI

The class website has a list of resources that includes open-source DCAI tools: https://dcai.csail.mit.edu/resources/

anishathalye | 3 years ago | on: MIT Introduction to Data-Centric AI

Hi HN! I’m back with another “what they don’t teach you in school” style course that I’d love to share with the community. (A couple years ago, I was part of the team that taught Missing Semester, an IAP class that taught programmer tools that weren’t covered in any CS courses at MIT: https://news.ycombinator.com/item?id=22226380.)

MIT, like most universities, has many courses on machine learning (6.036, 6.867, and many others). Those classes teach techniques to produce effective models for a given dataset, and the classes focus heavily on the mathematical details of models rather than practical applications. However, in real-world applications of ML, the dataset is not fixed, and focusing on improving the data often gives better results than improving the model. We’ve personally seen this time and time again in our applied ML work as well as our research.

Data-Centric AI (DCAI) is an emerging science that studies techniques to improve datasets in a systematic/algorithmic way — given that this topic wasn’t covered in the standard curriculum, we (a group of PhD candidates and grads) thought that we should put together a new class! We taught this intensive 2-week course in January over MIT’s IAP term, and we’ve just published all the course material, including lecture videos, lecture notes, hands-on lab assignments, and lab solutions, in hopes that people outside the MIT community would find these resources useful.

We’d be happy to answer any questions related to the class or DCAI in general, and we’d love to hear any feedback on how we can improve the course material. Introduction to Data-Centric AI is open-source opencourseware, so feel free to make improvements directly: https://github.com/dcai-course/dcai-course.

anishathalye | 3 years ago | on: Crowdlab: Effective algorithms to handle data labeled by multiple annotators

Many real-world datasets use multiple annotations per example to ensure higher-quality labels. CROWDLAB is a new set of algorithms that estimate 3 key quantities better than prior standard crowdsourcing algorithms like GLAD and Dawid-Skene: (1) a consensus label per example, (2) a confidence score for the correctness of the consensus label, and (3) a rating for each annotator.

The blog post gives some intuition for how it works, along with some benchmarking results, and the math and the nitty-gritty details can be found in this paper: https://cleanlab.github.io/multiannotator-benchmarks/paper.p...

Happy to answer any questions related to multi-annotator datasets or data-centric approaches to ML in general here.

anishathalye | 4 years ago | on: Cleanlab 2.0: Automatically Find Errors in ML Datasets

We’re excited to share with the HN community our open-source Python package for addressing data-quality issues in machine learning, automating tasks like finding label errors in datasets. cleanlab started out as a grad student research project, and it was eventually open-sourced. As we saw data scientists finding the tool useful for real-world applications, and as we did more research that applied the tool to find issues in academic datasets at scale (https://labelerrors.com/), we realized that this was an important real-world problem and decided to spend more time and energy building a useful and usable framework for solving data-quality challenges.

We’d love to hear any ideas or feedback from the HN community, especially from those who face data-quality challenges in their work. We (me, @cgn, and @_jonas), who all have a background in ML research, would also be happy to answer any questions you have related to cleanlab or data-centric AI.

anishathalye | 4 years ago | on: Yet Another GitHub Profile Generator

Yes, I try to do this, but it's extra effort to remember. I usually try to clean up my public repositories once every couple of months.

page 1