top | item 47192983

(no title)

sasjaws | 3 days ago

The idea i tried to express was purely the loss function thing you mentioned, and how both tasks (1 vs 2 vs n) lead to identical training runs. At least with nanogpt. I dont know if that extrapolates well to current llm internals and current training.

discuss

No comments yet.