top | item 42029628

(no title)

dmpetrov | 1 year ago

Can this work statistically? For a giving number of attempts, you can ger a required number of successes to make sure it's a statistically meaningful result.

In theory, this approach could help address the non-determinism of LLMs.

discuss

order

seeknotfind|1 year ago

There are a few examples of repeated testing being used by alignment groups to either test how aligned a model is, or to aggregate results to get something that is more aligned. For instance this is one related discussion: https://artium.ai/insights/taming-the-unpredictable-how-cont...

The non-determinism is a feature, and it can be disabled. This article also mentions doing that to get more deterministic alignment tests.

Theoretically if you aggregate enough results, it might become improbable to ever see an unaligned output. However, from a practical standpoint, we clearly much prefer much smarter models than running dumber models in parallel to get alignment that way. It's inefficient. The other thing is that given the number of possible ways to jailbreak a model, you can probably find something that would still bypass ensemble-based protections.

One other concept is relativism - there is a large grey area here. What is okay for someone is not okay for someone else, so even getting consensus among people what is okay, it's just not going to happen.