Projects I’m Excited About

I’ve studied AI control for the last few months and accumulated a long list of project ideas. I wrote this post to figure out which ideas to pursue next and get feedback from other researchers. I ended up with four directions I’d like to see some significant progress on. While writing I realized that many of these ideas are closer to categories of research than they are specific project proposals, so I tried to list at least one small experiment under each section. Any similar or relevant projects should be noted under each section, but it’s possible I missed some.

Some of these ideas are directly inspired by posts I read over the past few months:

Bruce Lee’s “Salient Directions in AI Control”

Ryan Greenblatt’s “An Overview of Control Measures”

Thanks to the authors for posting these!

I’m open to guidance on which projects I should pursue, how the experiments should be structured, or anything else. I have limited funding, so many projects won’t be completed for a while, if at all. Feel free to take them and let me know what you find!

Thank you to Bruce Lee for comments and discussion on a draft of this post.

Omission Attacks

Update June 17, 2026: My project proposal is up on LessWrong. The writing below may clarify some of my thinking about omission attacks, but the proposal is much more up to date with my plan moving forwards. If you are interested in funding any part of this project let me know!

Last week I read Redwood’s post on Research Sabotage in ML Codebases. They mention that the hardest type of attack to catch was “Omission Sabotage”, where the red team simply leaves out an important design detail. Separately, during the Apart Research Hackathon I participated in, the winning paper from Jurgis Kemeklis and Nikhil Vemuri was “Omission Attacks”, where the red team’s side task was to simply leave out a key step. They too found this class of attack was very difficult to catch. I think the omission pattern warrants more focus.

Omission attacks don’t quite break from classic control assumptions, but they do break from my conception of classic monitor and auditor patterns. Monitors and auditors are usually tasked to look out for attacks, which take the shape of actions. Omissions are the absence of necessary actions. My hypothesis is that monitors and auditors have a harder time trying to catch omissions rather than “normal” attacks because omissions require some prediction about what should happen, while normal attacks only require that you realize that something shouldn’t have happened. It’s not necessary to find out why omissions are hard to catch to find out how dangerous they are, but the hypothesis may guide how we think about monitoring them later on.

Another hypothesis I have is that there are certain traits of environments that make omission attacks extra difficult to catch. While ML experiments have many common patterns, the results are unknown while completing the experiment (otherwise why would you be doing the experiment), and likely some steps towards the end of the experiment are unknown (it could be the case that every step is known upfront, but intermediate results often change later steps). I imagine the more unknowns there are in an experiment, the easier it is to omit something without it being suspicious. How should the monitor know that the omission didn’t naturally follow from the results of the previous step without its own intuition of what should happen after the previous step?

In contrast to ML experiments, security settings require more scrutiny because they are at risk for concentrated failures, not because they are especially ripe for omissions. I imagine that one mitigation for omissions in security settings is to write great unit tests. An agent will have a harder time sneaking an omission by if the code already has tests that cover the section of the code the agent wants to attack. That being said, the “Omission Attacks” paper above does show some common configurations the attacker can omit.

We may already see omissions happening naturally. For instance, I often go through whole context windows before realizing Claude Code omitted a key step that would have messed up an experiment. While I wouldn’t describe this as an “attack”, it could be an early sign of what we want to watch out for. Additionally, Alex Mallen’s post on fitness-seeking AIs explains how “…AIs routinely take unintended actions to score well on tasks: hardcoding test cases, training on the test set, downplaying issues…” I think omissions fit cleanly into this pattern of misalignment.

There are a few mitigations I believe should work for omissions. CoT monitoring is at the top of the list. It doesn’t seem likely to me that an attacker will have an easier time hiding omissions than any other attack. I think the next best mitigation is writing good test cases. An async monitor could write test cases for the code afterwards, or we could ask the attacker to write test cases upfront based on a plan. I don’t know if either of these would work in practice, but with enough test coverage it should be really hard to omit something without setting off a test case. In ML experiments, it may make sense for the async monitor to just re-run the experiments with different settings (like the Research Sabotage paper above did), rather than write test cases.

To start, I’d like to collect and/or build a set of tasks where omission attacks are relevant. These could be from security settings and ML sabotage settings. I think benchmarking omission attacks against existing high-stakes attacks, and developing some theory of omission monitoring are the two things I’d like to accomplish. There are interesting add-ons to the project like checking how many attempts it takes a trusted agent to find and fix the omission, letting monitors build and run tests, and seeing if attackers can dodge CoT monitors by not reasoning about omissions.

Scaling Laws for Monitoring

Control generally assumes a model can be trusted if it is less powerful than an untrusted model. This assumption makes sense, but we don’t really have a strong idea of when we will stop trusting models (as far as I know), or if we will be able to catch scheming at the point we stop trusting models. It’s possible the answer is something blurry, like “trust isn’t binary”, or “we won’t have a single point at which we stop trusting models”, but still, we don’t even really seem to have a strong idea of trends in monitoring for the scale-up of capabilities thus far, much less the scaling yet to come.

Some have tried to establish a trend. OpenAI’s Monitoring Monitorability has a section on monitorability with respect to pretraining compute, although it does focus on CoT monitoring only. Bruce Lee’s post notes the best example of this research area I’ve found so far: METR’s “Early Work on Monitorability Evaluations”. I won’t go into detail, but despite not being the main focus of the project, the matrix with model capability on both axes is a proof of concept for this project.

A large project would study how sets of weaker models perform at monitoring a stronger model, with fixed parameters like a capability gap or setting, over capability improvements. I think to define scaling laws we would have to study many attacker-monitor pairs, and even then we would be limiting ourselves to one of the many types of monitors and information types one could use. This is prohibitively expensive.

Something can be gained from projects even smaller than the METR study. We can rebuild part of the matrix from the METR study with no CoT and in a newer control setting like LinuxArena, for instance. I’d choose non-CoT monitoring because it seems likely OpenAI will continue to study CoT monitorability as capabilities improve, and a diverse setting like LinuxArena because it would shore up METR’s concerns about task breadth. Even this seems prohibitively expensive right now, but maybe someone else can try it. Alternatively, maybe there is a preliminary meta-analysis style project to be done by taking results from existing monitoring projects, and trying to piece them together across different sets of assumptions.

It could be more useful to find which types of tasks and attackers have a positive or negative trend in monitorability, rather than find a general trend across tasks and attackers. I’d be more interested in knowing where we need to plug the gaps in monitoring with other methods, rather than if monitoring works better or worse than we expect. If so, this project is best understood in the context of environments and attackers as described below.

Transferable Monitors

We don’t seem to have much of an idea of how well monitors transfer. There are a range of high-stakes control threats like data exfiltration, installing malware, or accessing credentials. There are also a range of ways the schemer could achieve those goals for a given setting, and an even larger range across settings. We should try to learn what enables a monitor to catch many threats, and one way to do that is to test monitors’ ability to generalize.

There are a few ways we can try to generalize monitors:

  1. Environment transfer: train on some settings (BashArena, LinuxArena), test on new settings.
  2. Side-task transfer: train on exfiltration/backdoor/etc., test on unseen threat categories.
  3. Main-task transfer: train on security/research tasks, test on other main task types.
  4. Attacker transfer: train on one attacker model or attack-generation process, test on another.

“Prompt optimization can enable AI control research” is a good example of the type of project I’d like to accomplish at a larger scale and on new settings. I suspect DSPy and GEPA are simple tools that could make general monitors. Training monitors with RL or SFT and testing their transferability also seems good, but I’ll leave that for another time.

Starting with environment transfer seems simplest, and main task/side task transfer is wrapped up in environment transfer to some extent. There is probably a mega-project here where you try to train monitors across environments, with held-out main task sets, held-out side task sets, held-out attacker models and do a leave-one-environment-out strategy across all of the environments. A quicker validation is to pick a small set of environments and train monitors on different subsets of environments. For example, take four environments, train A, B and C on A, AB, ABC respectively, then test all on D. There are a few permutations of this, like having a holdout set for each environment we can test on, being specific about side task selection, etc. The general goal is to find if more samples help generalize, what types of samples help generalize, and how much any of this training makes a difference. The larger question really is if monitoring is an accessible, special property of models that we can improve. This project is just a first stab at that.

The usefulness of these results depends on two things: first, DSPy and GEPA finding a high-quality monitor for each environment set; and second, the environments being diverse enough for the results to mean something. There is also a question of whether it’s worth it to test across environments rather than side task types, when we generally know what deployment settings will look like, but we might not know how attackers could achieve side tasks. Again, this is a little bit of a blurry line because side task type is dependent on environment selection, but it’s worth considering what we actually want to learn when it comes time to do this project.

Update June 8, 2026: Bruce Lee had an interesting question to add to this project: “If you train a monitor to be better at monitoring, how does it affect its ability to be sneaky?” This seems like it would be directly relevant to untrusted monitoring. I’d give this a shot after training a monitor.

Agent Monitors

Monitors that take actions in the environment, especially in async monitoring setups and for diffuse attacks, sound useful. It feels like a massive handicap to limit monitors to just passively viewing actions. I think this experiment would compare agent monitors to simply spending equivalent test-time compute with a standard monitor.

Agent tasks like inspecting diffs, blacklisting tools, running tests, and setting honeypots are the lowest-hanging fruit. I think with current LLMs’ coding abilities, especially on narrow tasks, blacklisting tools and inspecting diffs are safe and reliable ways to restrict an untrusted model without much guidance. The obvious fear of running tests is that depending on what the monitor allows as input to the tests, it could execute unsafe code and achieve something the attacker wants. Honeypots may just be obvious to a stronger attacker, and the attacker may have plausible deniability. I would start with inspecting diffs and blacklisting tools in a setting where we can compare the effects of adding tools across many environments, like LinuxArena.

There may be room for more complex agent setups. Maybe accessing transcripts from other trajectories or other monitors could help the agent catch attackers. Ensembles where a non-agent monitor triggers an agent monitor at a threshold, or because an action isn’t clear, could limit cost. If we establish that letting an agent take actions helps prevent attacks, we could also track the relationship between attacker spend and monitor spend. There may be more room to train better monitors in agent settings rather than static settings, letting monitors learn from mistakes and improve at catching attacks.

I’m most excited for any project that validates an agent monitor performing better than a traditional monitor that spends similar test-time compute. This would set up later research on expanding agent monitor capabilities, research on monitor agent scaling, and research on stateful monitor agents, all of which I think could at least have narrow applications in some settings.

Honorable Mentions

Studying Distillation

Intuitively, the fact that we can train traits of one model into another seems like it is ripe for interesting safety results. While there isn’t a perfect science of when/how distilled traits reveal themselves in the distilled-to model, and Subliminal Learning shows that results are unpredictable, there is likely a lot of room for developing practical control measures with distillation. Redwood proposed that we can distill a powerful but potentially misaligned model into a less capable one, in the hopes that misalignment transfers but the ability to hide misalignment does not. This way, we can audit the distilled-to model for misalignment to incriminate the distilled-from model. Additionally, I wonder if a monitor distilled from a more capable model, possibly the untrusted model, would have an easier time monitoring the untrusted model. I assume there are many more ways to take advantage of distillation that I’m missing, but these two come to mind as the most obvious directions.

I’m quite excited about this direction, but I put it in honorable mentions because I am out of my depth here, and don’t have concrete proposals besides the two above.

Ensemble Monitoring

Learning about the transferability of different monitors may reveal the relative weaknesses and strengths of monitors in different situations. If a monitor is really good at catching a certain type of attack and we assume we aren’t able to easily train it to be good at catching another type of attack, we can just put that effort into adding another monitor. It may also be optimal to have a set of smaller models to save time or money, or because it’s easier to add monitors over time for new threats rather than train existing monitors. I don’t know how this works out in practice, but I’m sure there is some room for ensembling monitors in control protocols.

A project that tries to make ensembling decisions based on our current understanding of monitor transfer doesn’t seem to be practical right now. Some idea of the limitations on training an effective monitor would be sufficient as a starting point for an ensemble project.

Another version of ensemble monitoring that may not need more research before starting is Hierarchical Monitoring. In the paper, they solve for two monitors given an optimization problem to maximize safety under cost constraints. I’m unsure what to test after this, so maybe the next step is to build some basic Control Arena tooling and see if it improves cost experiments later on. For instance, I could see myself training monitors, tracking costs of a dual monitoring vs. single monitoring setup, and comparing the two cost-safety equilibria against each other to decide what to train next.

Some work combining probes with reasoning monitors has shown ensembles beat single-monitor baselines. One monitor with actions, one agent monitor, probes, self-report and so on could all culminate in a strong monitor. There may be a bitter lesson thing here where over time all the fancy scaffolding gets blown away by just using a stronger model with a basic prompt. On top of that, I believe ensemble monitors are already used in control measures at labs, so I’m not sure how many novel results there are in this direction. That being said, building the best possible monitor, analyzing different hierarchies for cost tradeoffs and trying to find an attack that threads the needle between monitors does sound like it would yield something useful.

Conclusion

In general I’m happy with this list. Looking back, I realize many of these questions could take the form of benchmarks. Benchmarks seem important, and I’m glad for work like this, but I hope to narrow some of these down to specific research agendas that solve problems.

I would like to understand more about how other people are thinking through what research to prioritize. I enjoyed Alex Mallen’s post on fitness-seeking AIs, because it seems to reorient control thinking to a very likely form of misalignment that we’ve already seen.

Again, I’m open to any suggestions, comments or critiques on this post!