(no title)
iagooar | 25 days ago
I tried English + Polish:
> All right, I'm not really sure if transcribing this makes a lot of sense. Maybe not. A цьому nie mówisz po polsku. A цьому nie mówisz po polsku, nie po ukrańsku.
iagooar | 25 days ago
I tried English + Polish:
> All right, I'm not really sure if transcribing this makes a lot of sense. Maybe not. A цьому nie mówisz po polsku. A цьому nie mówisz po polsku, nie po ukrańsku.
loire280|25 days ago
> The model is natively multilingual, achieving strong transcription performance in 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch. With a 4B parameter footprint, it runs efficiently on edge devices, ensuring privacy and security for sensitive deployments.
I wonder how much having languages with the same roots (e.g. the romance languages in the list above or multiple Slavic languages) affects the parameter count and the training set. Do you need more training data to differentiate between multiple similar languages? How would swapping, for example, Hindi (fairly distinct from the other 12 supported languages) for Ukrainian and Polish (both share some roots with Russian) affect the parameter count?
MarcelOlsz|25 days ago
edit: I stand corrected lol. I'll go with "Gaelic" instead.
_ache_|25 days ago
I guess a European version can be created but now it's aimed at a world wide distribution.
sbinnee|25 days ago
lm28469|25 days ago
Try sticking to the supported languages
tdb7893|25 days ago
ricardonunez|25 days ago
yko|25 days ago
overfeed|25 days ago
The base likely was pretrained on days that included Polish and Ukrainian. You shouldn't be surprised to learn it doesn't perform great on languages it wasn't trained on, or perhaps had the highest share of training data.
scotty79|25 days ago
Cthulhu_|25 days ago
Add poor microphone quality (using a laptop to broadcast a presentation to a room audience isn't very good) and you get a perfect storm of untranscribeable presentations or meetings.
All I want from e.g. Teams is a good transcript and, more importantly, a clever summary. Because when you think about it, imagine all the words spoken in a meeting and write them down - that's pages and pages of content that nobody would want to read in full.
moffkalast|25 days ago
mystifyingpoi|25 days ago
DaedalusII|25 days ago
iagooar|25 days ago
Polish works with the Latin alphabet just fine.
"Do kraju tego, gdzie kruszynę chleba podnoszą z ziemi przez uszanowanie dla darów Nieba.... Tęskno mi, Panie..."
"Mimozami jesień się zaczyna, złotawa, krucha i miła. To ty, to ty jesteś ta dziewczyna, która do mnie na ulicę wychodziła."
viraptor|25 days ago
That's not the case. Polish uses Latin-like alphabet due to Czech influence and German printers.