It should work with any type of model, obviously longer chain of thoughts will be more difficult to analyse by the evaluation model, because it will have way more reasoning steps to identify and separate. The quality of the outcome depends a lot on the chosen model to give you insights. We tested with Llama3-70B and worked smoothly most of the times.
No comments yet.