Join us for the reading group on ML and cryptography organized by Shafi Goldwasser, Yael Kalai, Jonathan Shafer, Neekon Vafa and Vinod Vaikuntanathan. For questions, feel free to contact Jonathan or Neekon.
We initiate an investigation of learning tasks in a setting where the learner is given access to two
competing provers, only one of which is honest. Specifically, we consider the power of such learners
in assessing purported properties of opaque models. Following prior work in complexity theory that
considers the power of competing provers in various settings, we call this setting refereed learning.
After formulating a general definition of refereed learning tasks, we show refereed learning protocols
that obtain a level of accuracy that far exceeds what is obtainable at comparable cost without provers,
or even with a single prover. We concentrate on the task of choosing the better one out of two black-box
models, with respect to some ground truth. While we consider a range of parameters, perhaps our most
notable result is in the high-precision range: For all $\epsilon > 0$ and ambient dimension $d$,
our learner makes only one query to the ground truth function, communicates only $(1 + 1/\epsilon^2) \cdot \mathsf{poly}(d)$
bits with the provers, and outputs a model whose loss is within a multiplicative factor of $(1 + \epsilon)$
of the best model's loss. Obtaining comparable loss with a single prover would require the learner to access the
ground truth at almost all of the points in the domain.
We also present lower bounds that demonstrate the optimality of our protocols in a number of respects,
including prover complexity, number of samples, and need for query access.
Many approaches to AI safety rely on inspecting model outputs or activations, yet certain risks are inherently undetectable by inspection alone. We propose a complementary, architecture-agnostic approach that enhances safety through the aggregation of multiple generative models, with the aggregated model inheriting its safety from the safest subset of a given size among them.
Specifically, we present a consensus sampling algorithm that, given k models and a prompt, achieves risk competitive with the average risk of the safest s of the k models, where s is a chosen parameter, while abstaining when there is insufficient agreement between them. The approach leverages the models' ability to compute output probabilities, and we bound the probability of abstention when sufficiently many models are safe and exhibit adequate agreement. The algorithm is inspired by the provable copyright protection algorithm of Vyas et al. (ICML 2023).
Our algorithm requires some overlap among safe models, offers no protection when all models are unsafe, and may accumulate risk over repeated use. Nonetheless, our results provide a new, model-agnostic approach for AI safety by amplifying safety guarantees from an unknown subset of models within a collection to that of a single reliable model.
This is joint work with Adam Tauman Kalai and Or Zamir.
Fall 2025
Sequences of Logits and the Low Rank Structure of Language Models
Noah Golowich (Microsoft Research NYC)
November 18, 2025
A major problem in the study of large language models is to understand their inherent low-dimensional structure. We introduce an approach to study the low-dimensional structure of language models at a model-agnostic level: as sequential probabilistic models. We first empirically demonstrate that a wide range of modern language models exhibit low-rank structure: in particular, matrices built from the model’s logits for varying sets of prompts and responses have low approximate rank. Taking a theoretical perspective, we then show that any distribution over sequences with such structure of low approximate logit rank can be provably learned using polynomially many queries to the model's logits and polynomial time. Finally, we show that insights resulting from this perspective of low-rank can be leveraged for generation— in particular, we can generate a response to a target prompt using a linear combination of the model’s outputs on unrelated, or even nonsensical prompts. This new generation procedure may have applications in AI alignment as it could potentially, for instance, yield new approaches for constructing jailbreaks.
Based on joint work with Allen Liu and Abhishek Shetty.
Large language models (LLMs) sometimes generate statements that are plausible but factually incorrect—a phenomenon commonly called “hallucination.” We argue that these errors are not mysterious failures of architecture or reasoning, but rather predictable consequences of standard training and evaluation incentives.
We show (i) that hallucinations can be viewed as classification errors: when pretrained models cannot reliably distinguish a false statement from a true one, they may produce the false option rather than saying I don’t know; (ii) that optimization of benchmark performance encourages guessing rather than abstaining, since most evaluation metrics penalize expressing uncertainty; and (iii) that a possible mitigation path lies in revising existing benchmarks to reward calibrated abstention, thus realigning incentives in model development.
Joint work with Santosh Vempala (Georgia Tech) and Ofir Nachum & Edwin Zhang (OpenAI).
Statistically Undetectable Backdoors in Deep Neural Networks
Neekon Vafa (MIT)
October 21, 2025
In this talk, I will show how an adversarial model trainer can plant backdoors in a large class of deep, feedforward neural networks. These backdoors are statistically undetectable in the white-box setting, meaning that the backdoored and honestly trained models are close in total variation distance, even given the full descriptions of the models (e.g., all of the weights). The backdoor provides access to (invariance-based) adversarial examples for every input. However, without the backdoor, no one can generate any such adversarial examples, assuming the worst-case hardness of shortest vector problems on lattices. Our main technical tool relies on a cryptographic perspective on the ubiquitous Johnson-Lindenstrauss lemma.
This talk is based on upcoming work with Andrej Bogdanov and Alon Rosen.
What Can Cryptography Tell Us About AI?
Greg Gluch (Simons Institute, UC Berkeley)
October 14, 2025
I will present three results that use cryptographic assumptions to characterize both the limits and possibilities of AI safety. First, we show that AI alignment cannot be achieved using only black-box filters of harmful content. Second, we prove a separation between mitigation and detection at inference time, where mitigation refines an LLM’s output using additional computation to compute a safer or more accurate result. Third, we conduct a meta-analysis of watermarks, adversarial defenses, and transferable attacks, showing that for every learning task, at least one of these three schemes must exist.
Each result carries a broader message: the first argues for the necessity of weight access in AI auditing; the second provides a rule of thumb for allocating inference-time resources when safety is the goal; and the third offers an explanation for why adversarial examples often transfer across different LLMs.
Bio. Greg Gluch is a postdoctoral researcher at the Simons Institute, UC Berkeley, working with Shafi Goldwasser, and currently a long-term visitor at MIT. He received his PhD from EPFL, advised by Rüdiger Urbanke and Michael Kapralov. His research spans AI safety—covering topics such as adversarial examples, alignment, and verifiability—and quantum interactive proofs and their links to physics.
Assume some untrusted powerful data analyst claims to have drawn many samples from an unknown distribution, ran some complicated analysis over the samples, and reveals to us their conclusion. Say that we are granted few samples from the same distribution. Can we verify that the results of the analyses are approximately correct?
In this talk we review a recent line of work that shows that the answer to this question is positive, and present the constructions of proof-systems that allow a probabilistic verifier to ascertain that results of an analysis are approximately correct, while drawing fewer samples and using less computational resources than would be needed to replicate the analysis. We focus on distribution testing problems: verifying that an unknown distribution is close to having a claimed property.
Moving from data science to AI: Can similar tools and interactive proofs provide a theoretical framework for addressing questions of verification in the context of AI systems? We will open the floor for discussion and ideas.
A Survey of Cryptographic Watermarks for AI-Generated Content
Generative AI watermarks are hidden patterns embedded in AI-generated content to facilitate its detection. A recent line of work draws on cryptography to define desired properties of watermarks and realize these properties with surprisingly simple constructions. In this talk, I will survey common definitions and approaches, focusing on those from the cryptography community. I'll end with some open questions that I find interesting.