(no title)
riotnrrd | 7 months ago
Even aside from the expense (which penalizes universities and smaller labs), I feel it's a bad idea to require academic research to compare itself to opaque commercial offerings. We have very little detail on what's really happening when OpenAI for example does inference. And their technology stack and model can change at any time, and users won't know unless they carefully re-benchmark ($$$) every time you use the model. I feel that academic journals should discourage comparisons to commercial models, unless we have very precise information about the architecture, engineering stack, and training data they use.
tough|7 months ago
you can totally evaluate these as GUI's, and CLI's and TUI's with more or less features and connectors.
Model quality is about benchmarks.
aider is great at showing benchmarks for their users
gemini-cli now tells you % of correct tools ending a session