It's fairly easy to pay OpenAI or Mistral money to use their API's.
Figuring out how Google Cloud Vertex works and how it's billed is more complicated. Azure and AWS are similar in how complex they are to use for this.
Could Google Cloud please provide an OpenAI compatible API and service?
I know it's a different department. But it'd make using your models way easier.
It often feels like Google Cloud has no UX or end-user testing done on it at all (not true for aistudio.google.com - that is better than before, for sure!).
Gemini models on Vertex AI can be called via a preview OpenAI-compatible endpoint [1], but shoving it into existing tooling where you don't have programmatic control over the API key and is long lived is non-trivial because GCP uses short lived access tokens (and long-lived ones are not great security-wise).
Billing for the Gemini models (on Vertex AI, the Generative Language AI variant still charges by tokens) I would argue is simpler than every other provider, simply because you're charged by characters/image/video-second/audio-second and don't need to run a tokenizer (if it's even available cough Claude 3 and Gemini) and having to figure out what the chat template is to calculate the token cost per message [2] or figure out how to calculate tokens for an image [3] to get cost estimates before actually submitting the request and getting usage info back.
If you're an individual developer and not an enterprise, just go straight to Google AIStudio or GeminiAPI instead: https://aistudio.google.com/app/apikey. It's dead simple getting an API key and calling with a rest client.
I have to agree with all of this. I tried switching to Gemini, but the lack of clear billing/quotas, horrible documentation, and even poor implementation of status codes on failed requests have led me to stick with OpenAI.
I don't know who writes Google's documentation or does the copyediting for their console, but it is hard to adapt. I have spent hours troubleshooting, only to find out it's because the documentation is referring to the same thing by two different names. It's 2024 also, I shouldn't be seeing print statements without parentheses.
I plan on downloading a Q5 or Q6 version of the 27b for my 3090 once someone puts quants on HF, loading it in LM studio and starting the API server to call it from my scripts based on openai api. Hopefully it's better at code gen than llama 3 8b.
Why is AIStudio not available in Ukraine? I have no problem with using Gemini web UI or other LLM providers from Ukraine, but this Google API constrain is strange.
The 4k sliding window context seems like a controversial choice after Mistral 7B mostly failed at showing any benefits from it. What was the rationale behind that instead of just going for full 8k or 16k?
Given the goal of mitigating self-proliferation risks, have you observed a decrease in the model's ability to do things like help a user setup a local LLM with local or cloud software?
How much is pre-training dataset changes, how much is tuning?
How do you think about this problem, how do you solve it?
Literature has identified self-proliferation as dangerous capability of models, and details about how to define it and example of form it can take have been openly discussed by GDM (https://arxiv.org/pdf/2403.13793).
Current Gemma 2 models' success rate to end-to-end challenges is null (0 out 10), so the capabilities to perform such tasks are currently limited.
I think it makes sense to compare models trained with the same recipe on token count - usually more tokens will give you a better model.
However, I wouldn't draw conclusions about different model families, like Llama and Gemma, based on their token count alone. There are many other variables at play - the quality of those tokens, number of epochs, model architecture, hyperparameters, distillation, etc. that will have an influence on training efficiency.
luke-stanley|1 year ago
Deathmax|1 year ago
Billing for the Gemini models (on Vertex AI, the Generative Language AI variant still charges by tokens) I would argue is simpler than every other provider, simply because you're charged by characters/image/video-second/audio-second and don't need to run a tokenizer (if it's even available cough Claude 3 and Gemini) and having to figure out what the chat template is to calculate the token cost per message [2] or figure out how to calculate tokens for an image [3] to get cost estimates before actually submitting the request and getting usage info back.
[1]: https://cloud.google.com/vertex-ai/generative-ai/docs/multim...
[2]: https://platform.openai.com/docs/guides/text-generation/mana...
[3]: https://platform.openai.com/docs/guides/vision/calculating-c...
ankeshanand|1 year ago
bapcon|1 year ago
I don't know who writes Google's documentation or does the copyediting for their console, but it is hard to adapt. I have spent hours troubleshooting, only to find out it's because the documentation is referring to the same thing by two different names. It's 2024 also, I shouldn't be seeing print statements without parentheses.
hnuser123456|1 year ago
alekandreev|1 year ago
unknown|1 year ago
[deleted]
canyon289|1 year ago
You can try 27b at www.aistudio,google.com. Send in your favorite prompts, and we hope you like the responses.
dandanua|1 year ago
jpcapdevila|1 year ago
austinvhuang|1 year ago
https://github.com/google/gemma.cpp/pull/274
moffkalast|1 year ago
alekandreev|1 year ago
causal|1 year ago
The Google API models support 1M+ tokens, but these are just 8K. Is there a fundamental architecture difference, training set, something else?
coreypreston|1 year ago
luke-stanley|1 year ago
How much is pre-training dataset changes, how much is tuning?
How do you think about this problem, how do you solve it?
Seems tricky to me.
alekandreev|1 year ago
Literature has identified self-proliferation as dangerous capability of models, and details about how to define it and example of form it can take have been openly discussed by GDM (https://arxiv.org/pdf/2403.13793).
Current Gemma 2 models' success rate to end-to-end challenges is null (0 out 10), so the capabilities to perform such tasks are currently limited.
luke-stanley|1 year ago
[deleted]
WhitneyLand|1 year ago
Is this a contradiction or am I misunderstanding something?
Btw overall very impressive work great job.
alekandreev|1 year ago
However, I wouldn't draw conclusions about different model families, like Llama and Gemma, based on their token count alone. There are many other variables at play - the quality of those tokens, number of epochs, model architecture, hyperparameters, distillation, etc. that will have an influence on training efficiency.
luke-stanley|1 year ago
luke-stanley|1 year ago
Still no 27B 4-bit GGUF quants on HF yet!
I'm monitoring this search: https://huggingface.co/models?library=gguf&sort=trending&sea...
XzAeRosho|1 year ago
chown|1 year ago
https://msty.app
zerojames|1 year ago
alekandreev|1 year ago
np_space|1 year ago
zone411|1 year ago
kristianpaul|1 year ago