top | item 42475331

(no title)

I believe this is a valid point: HF's replication indeed uses larger off-the-shelf model as a verifier.

In contrast, in the original paper, verifier is a fine-tune of the exact same base model which is used to sample step-by-step solutions (="solver").

discuss

boroboro4|1 year ago

Using different 1B model as verifier makes sense, yes. Using Llama 8B finetune as verifier to compare 1B inference time scaled in comparison with 8B makes little sense to me.

Using 3B model with 8B verifier against 70B model would make sense too. This being said their performance barely crossed 70B line with 256 examples. This is 256*(8+3)/70 ~ 40 times more computationally expensive than running 70B model as is.

dimitry12|1 year ago

"1B solver + 8B verifier + search" beating 0-shot 70B is nice, agree.

"1B solver + 8B verifier + search" beating 1B-0-shot or 1B-majority as baselines isn't illustrative imo. In other words, by using larger verifier, HF's replication fails to establish a "fair" baseline. Still an awesome blog and release/repository from HF's group - I love it!

zackangelo|1 year ago

Where did you see that? I thought they used an 8b model for their reward model?

> To guide our search strategies, we used RLHFlow/Llama3.1-8B-PRM-Deepseek-Data, an 8B reward model that has been trained using process supervision

dimitry12|1 year ago

"Solver" is `meta-llama/Llama-3.2-1B-Instruct` (1B model, and they use 3B for another experiment), and verifier is `RLHFlow/Llama3.1-8B-PRM-Deepseek-Data`.

See https://github.com/huggingface/search-and-learn/blob/b3375f8... and https://github.com/huggingface/search-and-learn/blob/b3375f8...

In the original paper, they use PaLM 2-S* as "solver" and its fine-tune as "verifier".