Join us for the reading group on ML and cryptography organized by Shafi Goldwasser, Yael Kalai, Jonathan Shafer, Neekon Vafa and Vinod Vaikuntanathan. For questions, feel free to contact Jonathan or Neekon.
A Theoretical Point of View on Machine Unlearning and Privacy
Odelia Melamed (Weizmann Institute of Science)
May 12, 2026
How do machine learning models really ālearnā the data?
Modern models learn complex patterns and high-level structure from massive datasets, yet surprisingly, it is often possible to efficiently remove the influence of a single training example without significantly affecting overall performance. While this phenomenon may appear deceptively simple, it offers a powerful lens for understanding the learning and unlearning process itself.
Machine Unlearning is motivated by privacy regulations and the growing demand for trustworthy AI. In practice, many successful methods rely on surprisingly simple operations such as gradient ascent or projection, while the main theoretical foundations, rooted in Differential Privacy theory, remain limited in scope and difficult to reconcile with strong empirical results. In this talk, I will present our novel unlearning criterion, rooted in learning theory and its implicit bias. I will discuss how the effect of a single data point can be isolated from the training process in several settings, and present a new perspective on practical and theoretically grounded unlearning.
How does the choice of training data influence an AI model? This question is of central importance to interpretability, privacy, and basic science. At its core is the data deletion problem: after a reasonable amount of precomputation, quickly predict how the model would behave in a given situation if a given subset of training data had been excluded from the learning process.
I will present the first data deletion scheme capable of predicting model outputs with vanishing error ε and failure probability Ī“ in the deep learning setting. The precomputation and prediction algorithms are only OĢ(log(1/Ī“)/ε^2) factors slower than regular training and inference, respectively. The storage requirements are those of OĢ(log(1/Ī“)/ε^2) models.
The proof is based on a simple assumption about the stability of training. In contrast to the assumptions made by prior work, stability appears to be fully compatible with learning powerful AI models. In support of this, I will demonstrate that stability is satisfied in a minimal set of experiments.
At a technical level, this work is based on a new method for locally sketching an arithmetic circuit by computing higher-order derivatives in random complex directions. Forward-mode automatic differentiation allows cheap computation of these derivatives.
Many approaches to AI safety rely on inspecting model outputs or activations, yet certain risks are inherently undetectable by inspection alone. We propose a complementary, architecture-agnostic approach that enhances safety through the aggregation of multiple generative models, with the aggregated model inheriting its safety from the safest subset of a given size among them.
Specifically, we present a consensus sampling algorithm that, given k models and a prompt, achieves risk competitive with the average risk of the safest s of the k models, where s is a chosen parameter, while abstaining when there is insufficient agreement between them. The approach leverages the models' ability to compute output probabilities, and we bound the probability of abstention when sufficiently many models are safe and exhibit adequate agreement. The algorithm is inspired by the provable copyright protection algorithm of Vyas et al. (ICML 2023).
Our algorithm requires some overlap among safe models, offers no protection when all models are unsafe, and may accumulate risk over repeated use. Nonetheless, our results provide a new, model-agnostic approach for AI safety by amplifying safety guarantees from an unknown subset of models within a collection to that of a single reliable model.
This is joint work with Adam Tauman Kalai and Or Zamir.
Subliminal Transfer in Training Data and Low Logit Rank
Training modern large language models (LLMs) has become a veritable smorgasbord of algorithms and datasets designed to elicit particular behaviors, making it critical to develop techniques to understand the effects of datasets on the model's properties. This is exacerbated by recent experiments that show datasets can transmit signals that are not directly observable from individual datapoints, posing a conceptual challenge for dataset-centric understandings of LLM training and suggesting a missing fundamental account of such phenomena. Towards understanding such effects, inspired by recent work on the linear structure of LLMs, we uncover a general mechanism through which hidden subtexts can arise in generic datasets.
We introduce Logit-Linear-Selection (LLS), a method that prescribes how to select subsets of a generic preference dataset to elicit a wide range of hidden effects. We apply LLS to discover subsets of real-world datasets so that models trained on them exhibit behaviors ranging from having specific preferences, to responding to prompts in a different language not present in the dataset, to taking on a different persona. Crucially, the effect persists for the selected subset, across models with varying architectures, supporting its generality and universality. Perhaps more importantly, we show that this phenomenon can be explain by the notion of low logit rank (introduced in GLS25), pointing towards an avenue for designing empirically validated experiments starting from a theoretical lens.
Based on joint works with Noah Golowich, Allen Liu, Ishaq Aden-ali, Nika Haghtalab and Ankur Moitra.
Bio: Abhishek Shetty is the incoming Catherine M. and James E. Allchin Early-Career Assistant Professor in Computer Science at Georgia Tech, starting in Fall 2026. Currently, he is a FODSI postdoctoral fellow at Massachusetts Institute of Technology hosted by Costis Daskalakis, Ankur Moitra and Sasha Rakhlin and previously PhD student at the University of California at Berkeley advised by Nika Haghtalab. His research focuses on building fundamental connections between the theory of computation and machine learning, with particular interest in understanding the algorithmic and statistical role of data in generalization and sequential decision making, recently focussed on understanding large language models. His research has been recognized with an Apple AI/ML fellowship and an ASA SCGS best student paper.
We study the learningāunlearning paradigm, motivated by emerging requirements for data deletion and algorithmic accountability. Given an initial dataset, the learner first trains a predictor. Subsequently, upon a request to remove a subset of training examples, the goal is to produce an updated predictor that is indistinguishable from the one obtained by retraining from scratch on the remaining dataācrucially, without access to the original dataset and without retraining. This talk focuses on the space complexity of unlearning: what information must a learner retain to support data deletion post-training?
We develop lower and upper bounds under a range of data access and storage models, highlighting fundamental trade-offs between memory, computation, and update fidelity. A key aspect of the talk is a novel ticketed learningāunlearning framework, in which the learning algorithm augments training by issuing each data point a compact, encrypted āticket,ā while also maintaining a small amount of centralized state. These tickets are later used for unlearning. We show how this framework enables new algorithmic designs that significantly reduce the memory required for unlearning while preserving exact unlearning guarantees.
Brief bio: Ayush is a Senior Research Scientist at Chan Zuckerberg Initiative (BioHub). Previously, he was a postdoctoral researcher at the Institute of Data, Systems, and Society (IDSS) at MIT, hosted by Prof. Sasha Rakhlin. He earned his Ph.D. in Computer Science at Cornell University, advised by Prof. Karthik Sridharan and Prof. Robert D. Kleinberg. His research focuses on updatable machine learning, developing algorithms to efficiently update ML models as they interact with changing environments and real-world constraints. Before his Ph.D., Ayush received his B.Tech. in Computer Science from IIT Kanpur, where he earned the Presidentās Gold Medal. His work has been recognized with a student best paper award at COLT 2019, and multiple orals and spotlights at premier ML conferences such as NeurIPS and ICLR. He was also a finalist for the Meta AI fellowship in Statistics in 2022.
We initiate an investigation of learning tasks in a setting where the learner is given access to two
competing provers, only one of which is honest. Specifically, we consider the power of such learners
in assessing purported properties of opaque models. Following prior work in complexity theory that
considers the power of competing provers in various settings, we call this setting refereed learning.
After formulating a general definition of refereed learning tasks, we show refereed learning protocols
that obtain a level of accuracy that far exceeds what is obtainable at comparable cost without provers,
or even with a single prover. We concentrate on the task of choosing the better one out of two black-box
models, with respect to some ground truth. While we consider a range of parameters, perhaps our most
notable result is in the high-precision range: For all $\epsilon > 0$ and ambient dimension $d$,
our learner makes only one query to the ground truth function, communicates only $(1 + 1/\epsilon^2) \cdot \mathsf{poly}(d)$
bits with the provers, and outputs a model whose loss is within a multiplicative factor of $(1 + \epsilon)$
of the best model's loss. Obtaining comparable loss with a single prover would require the learner to access the
ground truth at almost all of the points in the domain.
We also present lower bounds that demonstrate the optimality of our protocols in a number of respects,
including prover complexity, number of samples, and need for query access.
Sequences of Logits and the Low Rank Structure of Language Models
Noah Golowich (Microsoft Research NYC)
November 18, 2025
A major problem in the study of large language models is to understand their inherent low-dimensional structure. We introduce an approach to study the low-dimensional structure of language models at a model-agnostic level: as sequential probabilistic models. We first empirically demonstrate that a wide range of modern language models exhibit low-rank structure: in particular, matrices built from the modelās logits for varying sets of prompts and responses have low approximate rank. Taking a theoretical perspective, we then show that any distribution over sequences with such structure of low approximate logit rank can be provably learned using polynomially many queries to the model's logits and polynomial time. Finally, we show that insights resulting from this perspective of low-rank can be leveraged for generationā in particular, we can generate a response to a target prompt using a linear combination of the modelās outputs on unrelated, or even nonsensical prompts. This new generation procedure may have applications in AI alignment as it could potentially, for instance, yield new approaches for constructing jailbreaks.
Based on joint work with Allen Liu and Abhishek Shetty.
Large language models (LLMs) sometimes generate statements that are plausible but factually incorrectāa phenomenon commonly called āhallucination.ā We argue that these errors are not mysterious failures of architecture or reasoning, but rather predictable consequences of standard training and evaluation incentives.
We show (i) that hallucinations can be viewed as classification errors: when pretrained models cannot reliably distinguish a false statement from a true one, they may produce the false option rather than saying I donāt know; (ii) that optimization of benchmark performance encourages guessing rather than abstaining, since most evaluation metrics penalize expressing uncertainty; and (iii) that a possible mitigation path lies in revising existing benchmarks to reward calibrated abstention, thus realigning incentives in model development.
Joint work with Santosh Vempala (Georgia Tech) and Ofir Nachum & Edwin Zhang (OpenAI).
Statistically Undetectable Backdoors in Deep Neural Networks
Neekon Vafa (MIT)
October 21, 2025
In this talk, I will show how an adversarial model trainer can plant backdoors in a large class of deep, feedforward neural networks. These backdoors are statistically undetectable in the white-box setting, meaning that the backdoored and honestly trained models are close in total variation distance, even given the full descriptions of the models (e.g., all of the weights). The backdoor provides access to (invariance-based) adversarial examples for every input. However, without the backdoor, no one can generate any such adversarial examples, assuming the worst-case hardness of shortest vector problems on lattices. Our main technical tool relies on a cryptographic perspective on the ubiquitous Johnson-Lindenstrauss lemma.
This talk is based on upcoming work with Andrej Bogdanov and Alon Rosen.
What Can Cryptography Tell Us About AI?
Greg Gluch (Simons Institute, UC Berkeley)
October 14, 2025
I will present three results that use cryptographic assumptions to characterize both the limits and possibilities of AI safety. First, we show that AI alignment cannot be achieved using only black-box filters of harmful content. Second, we prove a separation between mitigation and detection at inference time, where mitigation refines an LLMās output using additional computation to compute a safer or more accurate result. Third, we conduct a meta-analysis of watermarks, adversarial defenses, and transferable attacks, showing that for every learning task, at least one of these three schemes must exist.
Each result carries a broader message: the first argues for the necessity of weight access in AI auditing; the second provides a rule of thumb for allocating inference-time resources when safety is the goal; and the third offers an explanation for why adversarial examples often transfer across different LLMs.
Bio. Greg Gluch is a postdoctoral researcher at the Simons Institute, UC Berkeley, working with Shafi Goldwasser, and currently a long-term visitor at MIT. He received his PhD from EPFL, advised by Rüdiger Urbanke and Michael Kapralov. His research spans AI safetyācovering topics such as adversarial examples, alignment, and verifiabilityāand quantum interactive proofs and their links to physics.
Assume some untrusted powerful data analyst claims to have drawn many samples from an unknown distribution, ran some complicated analysis over the samples, and reveals to us their conclusion. Say that we are granted few samples from the same distribution. Can we verify that the results of the analyses are approximately correct?
In this talk we review a recent line of work that shows that the answer to this question is positive, and present the constructions of proof-systems that allow a probabilistic verifier to ascertain that results of an analysis are approximately correct, while drawing fewer samples and using less computational resources than would be needed to replicate the analysis. We focus on distribution testing problems: verifying that an unknown distribution is close to having a claimed property.
Moving from data science to AI: Can similar tools and interactive proofs provide a theoretical framework for addressing questions of verification in the context of AI systems? We will open the floor for discussion and ideas.
A Survey of Cryptographic Watermarks for AI-Generated Content
Generative AI watermarks are hidden patterns embedded in AI-generated content to facilitate its detection. A recent line of work draws on cryptography to define desired properties of watermarks and realize these properties with surprisingly simple constructions. In this talk, I will survey common definitions and approaches, focusing on those from the cryptography community. I'll end with some open questions that I find interesting.