top | item 39302690

(no title)

This is spot on, the thing we've not yet done is make it easy to import a repo(s) code and the associated metadata into a fine tuning session easily.

> I often wonder how you'd go about organizing training data for a full historic github repo in a way that makes sense for training (or RAG)?

This is the hard part :-) But you are right - it would be intriguing to see what the output of a fune-tuned & RAG model would look like for this use-case. We are currently experimenting with adding RAG alongside the fine tuned model (so it's both, not either or) to see if it produces better results.

I will make sure we take a look at the gihub repo use case because it feels like that would be an interesting experiment to do!

disclaimer: I work on Helix

discuss

joshka|2 years ago

Reading through the dataprep stuff, I wonder if doing more RAG during the prep stage might help this sort of task on structured daa. E.g. pre-indexing related parts and using those to build summaries / QA pairs. I took a look at the current prompts that are very research focused ("professors" creating questions), and could extrapolate from that to a dev mindset nicely.