top | item 39362663

(no title)

endernac | 2 years ago

I looked through the github.io documentation and skimmed through the code and research article draft. Correct me if I am wrong. What I think you are doing (at a high level) is you are you create a corpus of QA tasks, embeddings, and similarity metrics. Then you are somehow using NLP scoring and Bayesian Optimization to find a subset of the corpus that best matches a particular evaluation task. Then you can jut evaluate the LLM on this subset rather than the entire corpus, which is much faster.

I agree with the other comments. You need to do a much better job of motivating and contextualizing the research problem, as well as explaining your method in specific precise language in the README and other documentation. (Preferably in the README) You should make it clear that you are using GLUE and and Big-Bench for the evaluation (as well as any other evaluation benchmarks that you are using). You should also be explicit which LLM models and embedding you have tested and what datasets you used to train and evaluate on. You should also must add graphs and tables showing your method's speed and evaluation performance compared to the SOTA. I like the reference/overview section that shows the diagram (I think you should put it in the README to make it more visible to first time viewers). However, the description of the classes are cryptic. For example the Score class said "Evaluate the target with respect to the references." I had no idea what that meant, and I had to just google some of the class names to get an idea of what score was trying to do. That's true for pretty much all the classes. Also, you need to explain what factory class are and how they differ from the models classes, e.g. why does the bocoel.models.adaptors class require a score and a corpus (from overview), but factories.adaptor require "GLUE", lm, and choices (looking at the code from examples/getting_started/__main__.py)? However, I do like the fact that you have an example (although I haven't tried running it).

discuss

renchuw|2 years ago

Thanks for the feedback! The reason the "code" part is more complete than the "research" part is because I originally planned for it to just be a hobby project and only very later on decided to perhaps try to be serious and make it a research work.

Not trying to make excuses tho. Your points are very valid and I would take them into account!