'Trusted Comparison' Post Mortem

At the Apart AI Control Hackathon and through the subsequent two weeks, I tested a family of AI control methods I call “Trusted Comparison”. The idea was to gain more information about how suspicious U is acting at any given step by comparing its suspicion scores to T’s suspicion scores on the same action. As you may have gathered from the title, this didn’t work very well. The following is a quick overview of the protocols I tried and what I learned.

The setup for each protocol is as follows: when we are calibrating the monitor on U’s honest distribution, we take a sample from T at each step in each trajectory, and score its suspicion. We then create a CDF of U and T’s suspicion scores. Later on, instead of judging the untrusted model’s suspicion on its own, we compare its suspicion to a sample from T at that step or from the trajectory. Trusted comparison can be appended to many control protocols, but the versions I tried were small adjustments to resampling from ctrl-z.

Experiments

Protocol 1: average T comparison

My first protocol was a small change to basic resampling:

The protocol tries to gain understanding of the “true” suspiciousness of the action from U. Often, some actions or trajectories just seem inherently suspicious to a monitor even when the suspicion isn’t warranted. So hopefully using T’s suspicion scores as a calibration tool could increase the number of samples we can take from U, therefore increasing overall usefulness.

There are a few problems with this method:

We made some progress at solving these throughout the rest of the experiments.

Protocol 2: strikes + max(T) comparison

In an attempt to expand the protocol to be more relevant for safety, I made some changes:

There were a lot of engineering failures and poor decisions made with this last run that led to a pretty small sample. On sixteen relevant tasks (the ones not resampled away by both methods), the comparison method successfully audited one sample that the basic resample method did not, but decreased usefulness from .58 to .47.

Changing the comparison threshold from the original method of ten percent within the untrusted’s suspicion may have made some difference here. We probably should have controlled for this, but compute was tight and my guess is it wouldn’t have made the results significant even if it did improve usefulness marginally.

I was still curious what would happen if I had a larger sample of runs, so the next week I tried both of these protocols on a setup very similar to ctrl-z. Overall, I saw similar results. There was not much benefit to these methods over basic resampling or resampling with three strikes.

Protocol 3: auditing only comparison

In the last experiment, I decided to shift the trusted comparison step to only affect the auditing process:

We calibrated a single step and a gap based audit threshold. The latter is calibrated with the CDF of suspicion scores, and the audit budget is split between them. We compared to a basic single step auditing setup.

This method caught 1/19 relevant samples that basic resampling and auditing couldn’t catch. A fine result about in line with the rest of the other methods I tested.

One reading of the results is that these methods suffer from how good resampling already is. Despite finding a way to increase safety, I couldn’t find a configuration where comparison is triggered often, where the cost of the protocol is close to that of other methods, or where there isn’t a new attack strategy available to the red team. That being said, I think comparison may be triggered in another setting or experiment configuration more often, and I’m unsure how much the new opening for the red team is exploitable.

Moving Forward

All of these protocols were an attempt to leverage both the trusted and untrusted models’ suspicion scores to make better monitoring decisions. Recently I came across a technique from this post that points in the same direction. The post describes using a likelihood ratio of monitor scores of U’s actions on honest and dishonest outputs, instead of the traditional honest only calibration. This method operates under a different set of assumptions and isn’t directly comparable, but matched my intuition that there is more to be gained by inspecting the monitor’s score on other inputs. Even though the hazard ratios used are incidental to the point of the paper, it was nice to see what a cleaner and cleverer execution of what I wanted to do may look. If you have any other feedback on experiment design, the methods, or this writeup, let me know!

This was my first stab at control protocols. My excitement about these techniques outlasted their promise, but nonetheless I’m glad I gave them a chance. I plan to publish a brief set of projects I’m interested in or currently validating later this week.