And it seems to be because the training data is largely unofficial subtitles from movies. Which often have a string like "Translated by X" at the end of the movie which is often silent while credits roll.
Looks like they used more official sources for German - there, silence is apparently hallucinated as "Untertitelung des ZDF für funk, 2017" according to one of the comments on the issue. Which makes sense, as the public broadcasters' "Mediathek" is probably the largest freely available resource of subtitled videos in Germany. I wonder if the ZDF gave its approval for it being used for LLM training though?
mormegil|7 months ago
actionfromafar|7 months ago
The MPA must be so proud.
aprilthird2021|7 months ago
rob74|7 months ago
4gotunameagain|7 months ago
unknown|7 months ago
[deleted]
iqfareez|7 months ago
beshrkayali|7 months ago