I don't have any test sets because I haven't trained any model from scratch. I have built a simple RAG, and my validation comes from users directly, like whether they find the answer useful or not.
The real value of these tools is in the validation, and I mean not just the face validity. User feedback is just face validity.
If you were a doctor and you needed to make a real treatment decision for a real patient, would you use this tool without checking the answer thoroughly, reading the literature yourself and checking to see if it didn't miss any relevant sources? If no, then you might as well skip the tool and do the work yourself. If yes, then you need to know for certain that the answer is correct.
And I don't think it matters if you trained the model yourself. You validate the tool as a whole.
The problem with using user feedback as validation is that users ask questions they don't know the answer to. Therefore, they are unable to judge the correctness of an answer. What you need is a gold standard, and validate against that.
arnok|8 months ago
If you were a doctor and you needed to make a real treatment decision for a real patient, would you use this tool without checking the answer thoroughly, reading the literature yourself and checking to see if it didn't miss any relevant sources? If no, then you might as well skip the tool and do the work yourself. If yes, then you need to know for certain that the answer is correct.
And I don't think it matters if you trained the model yourself. You validate the tool as a whole.
The problem with using user feedback as validation is that users ask questions they don't know the answer to. Therefore, they are unable to judge the correctness of an answer. What you need is a gold standard, and validate against that.