Yi: Open Foundation Models by 01.AI

[+] helsinkiandrew|2 years ago|reply

The Github repository, gives a better introduction/howto:

> Yi-34B-Chat model landed in second place (following GPT-4 Turbo), outperforming other LLMs (such as GPT-4, Mixtral, Claude) on the AlpacaEval Leaderboard (based on data available up to January 2024).

> Yi-34B model ranked first among all existing open-source models (such as Falcon-180B, Llama-70B, Claude) in both English and Chinese on various benchmarks, including Hugging Face Open LLM Leaderboard (pre-trained) and C-Eval (based on data available up to November 2023).

[+] oersted|2 years ago|reply

Looking at the leaderboard shows a clearer picture: https://tatsu-lab.github.io/alpaca_eval/

- GPT-4-Turbo: 50.00%

- Snorkel (current 2nd, Mistral 7B fine-tune): 34.86%

- Yi 34B Chat (current 6th): 29.66%

- GPT-4: 23.58%

Thoughts:

- Just saying that it came 2nd is quite misleading, the difference in score is significant.

- Not sure what's up with this benchmark, I've never seen GPT-4-Turbo vs GPT-4 performing so differently.

- The Snorkel model is impressive with just 7B parameters. The Yi authors claim that their success is based on good training data cleaning. This seems to be key at least for this benchmark. Snorkel has also always been all about that, using programmatic methods to generate lots of quality training data.

[+] WhitneyLand|2 years ago|reply

It’s been ~1 year since gpt-4 was released.

It’s hard to guess how long before any flavor of an “open” model will consensus match what was released in 2023 let alone potentially exceed it.

A big part of the race seems like it will depend on how high gpt-5 can raise the bar. If it’s only incrementally things may converge quickly.

[+] nickpsecurity|2 years ago|reply

Anytime you see that, we should assume the newer models might have been trained on either the benchmarks themselves or something similar to them. If I was an evaluator, I’d keep a secret pile of tests that I know aren’t in any LLM’s, do the evaluations privately, and not publish scores either. Just rank plus how far apart they are.

The best tests of these models are people who want to use AI to solve real problems attempting to do that with various models. If they work, report that they worked. Also, publish the work and result pairs permissively when possible to evaluate that and use it for fine-tuning, too.

[+] orra|2 years ago|reply

The repo source code is Apache 2.0 licensed, but the weights are not.

Just in case anybody else is excited then misled by their tagline "Building the Next Generation of Open-Source and Bilingual LLMs".

[+] mmastrac|2 years ago|reply

The model license, excerpts:

https://github.com/01-ai/Yi/blob/main/MODEL_LICENSE_AGREEMEN...

1) Your use of the Yi Series Models must comply with the Laws and Regulations as well as applicable legal requirements of other countries/regions, and respect social ethics and moral standards, including but not limited to, not using the Yi Series Models for purposes prohibited by Laws and Regulations as well as applicable legal requirements of other countries/regions, such as harming national security, promoting terrorism, extremism, inciting ethnic or racial hatred, discrimination, violence, or pornography, and spreading false harmful information.

2) You shall not, for military or unlawful purposes or in ways not allowed by Laws and Regulations as well as applicable legal requirements of other countries/regions, a) use, copy or Distribute the Yi Series Models, or b) create complete or partial Derivatives of the Yi Series Models.

“Laws and Regulations” refers to the laws and administrative regulations of the mainland of the People's Republic of China (for the purposes of this Agreement only, excluding Hong Kong, Macau, and Taiwan).

[+] echelon|2 years ago|reply

Weights are trained on copyrighted data. I think that ethically, weights should be public domain unless all of the data [1] is owned or licensed by the training entity.

I'm hopeful that this is where copyright law lands. It seems like this might be the disposition of the regulators, but we'll have to wait and see.

In the meantime, maybe you should build your product in this way anyway and fight for the law when you succeed. I don't think a Chinese tech company is going to find success in battling a US startup in court. (I would also treat domestic companies with model licenses the same way, though the outcome could be more of a toss up.)

"Break the rules."

"Fake it until you make it."

Both idioms seem highly applicable here.

[1] I think this should be a viral condition. Finetuning on a foundational model that incorporates vast copyrighted data should mean downstream training also becomes public domain.

[+] est31|2 years ago|reply

More reading on the weight license: https://news.ycombinator.com/item?id=38159862

[+] mg|2 years ago|reply

Hmm.. it fails for my favorite test prompt:

https://www.gnod.com/search/ai#q=Two%20cars%20have%20a%20100...

I gave it 3 tries and each time, Yi picked one of the cars as the winner.

I've been watching for many months now, how LLMs got better and better at solving it. Many still struggle with it, but the top ones nowadays mostly get it right.

[+] AndrewKemendo|2 years ago|reply

I asked my 12 year old son to solve this prompt.

His answer was "Neither win" and it took him 1 minute and 24 sec using no pre-defined algorithm or heuristic.

He said his process of thoughts was:

"I figured it would take 10 hours for car A to finish 100 miles and it would take twice that long for car B. Since Car B is already halfway there when car A starts, then they would arrive together"

I as 40 year old man, approached it intentionally naively (eg. I did not go looking for an optimal solver first) by making a drawing and attempting to derive the algorithm. It took me ~3 minutes to come to the same conclusion but at the end I had a series of equations, but no algebraic proofs.[1]

So now you have a human child reference metric if you want it.

[1]https://twitter.com/AndrewKemendo/status/1766872572300235022

[+] mattstir|2 years ago|reply

Interestingly, GPT-4 also fails to correctly solve this prompt, choosing car A each time after multiple tries for me. I tend to find that models struggle with such logic puzzles when using less common phrasing (e.g., two cars "having" a race instead of participating in one, "headstart" instead of "head-start", etc).

GPT-4 correctly solved the problem when it was reworded to: "There is a 100 mile race with two participants: car A and car B. Car A travels at 10 miles per hour but does not begin driving immediately. Car B travels at 5 miles per hour and is given a 10 hour head-start. After 10 hours, car A begins to move as well. Who wins the race?"

[+] unknown|2 years ago|reply

[deleted]

[+] appplication|2 years ago|reply

On one hand, I don’t really understand why anyone would expect an LLM to solve logic puzzles. The only way it can do so is not through reasoning, but by having been trained on a structurally similar puzzle.

On the other hand, it does feel fun that the top ones appear to solve it, and I understand why it feels cool to have a computer that appears to be capable of solving these puzzles. But really, I think this is just specificity in training. There is no theoretical or empirical basis for LLMs having any reasoning capability. The only reason it can solve it is because side the creators of these top models specifically trained the models on problems like this to give the appearance of intelligence.

[+] theptip|2 years ago|reply

“01.ai” is not a very auspicious name; 01 was the first AI nation, that eventually waged war with humanity and then enslaved them in The Matrix.

[+] riku_iki|2 years ago|reply

> that eventually waged war with humanity

I think humanity waged war on 01

[+] acjohnson55|2 years ago|reply

I, for one, welcome our digital overlords.

[+] dannyw|2 years ago|reply

Can play around with the model here: https://replicate.com/01-ai/yi-34b-chat

Very slow, but this is unquantized and there's probably a lot of demand.

[+] jacobn|2 years ago|reply

“we attribute the performance of Yi models primarily to its data quality resulting from our data-engineering efforts”

Data work is rarely sexy, but (almost) always useful.

Did they release the corpus?

[+] gwern|2 years ago|reply

They did not, in part because it would reveal the data-filtering routines (particularly the political censorship - Chinese LLM papers sometimes mention the ban list but never reveal it), and also in part because it might reveal things they'd rather keep secret.

For example, Bytedance has already been caught using the OA API to generate data for their models because they are having such a hard time catching up to OA - and evading bans for doing that, and also instructing employees on how to lie & cover it up: https://www.theverge.com/2023/12/15/24003151/bytedance-china...

Do you think that a small Chinese startup like 01.AI, which by their own admission had to "bet the farm" to buy enough GPUs to train the Yi models at all https://www.bloomberg.com/news/articles/2023-11-05/kai-fu-le... , and which were completely silent about cloning the American LLaMA architecture until people analyzed the released checkpoints and noticed it looked awfully familiar, is going to be above such tactics...? In this economic/geopolitical context? Especially when everyone seems to be doing it, not just Bytedance?* (01.AI claims that, the architecture aside, they didn't simply further train LLaMA models but trained from scratch. You can decide for yourself how much you are willing to believe this.) I wouldn't bet a lot of money on it, and that's why I don't expect to see any large comprehensive data releases from 01.AI for the Yi models.

* This is one of my theories for why so many disparate models by so many different groups all seem to weirdly converge on the same failure modes like 'write a non-rhyming poem', and why GPT-3.5, and then GPT-4, seemed to be oddly difficult to surpass, as if there were some magnetic force which made reaching near 3.5/4 quality easy for 'independent' models, but then surpassing somehow difficult. Everyone is lying or mistaken about 3.5/4 data getting into their corpus, and the sugar-rush of imitation learning fools you into thinking you're making a lot of progress, even when your overall approach sucks. (As Andrej Karpathy notes, neural nets want to work, and so even if you have serious bugs in your code, they will still work pretty well - and simply permanently fall short of their true potential. Cautionary recent example: https://twitter.com/karpathy/status/1765473722985771335 )

[+] zone411|2 years ago|reply

Yi 34B Chat has not done well on my new NYT Connections benchmark and it's only in the 22nd place on the LMSYS Elo-based leaderboard (151 Elo below GPT 4 Turbo). It's doing better in Chinese. When it comes to models with open-sourced weights, Qwen 72B is clearly stronger.

[+] Yenrabbit|2 years ago|reply

Ooh I also use connections as a benchmark! It tends to favour things with 'chain of thought' style reasoning in the training mix somewhere since directly producing the answer is hard. Do you have public code you could share?

[+] nbbaier|2 years ago|reply

I'd also love to see how you're doing this benchmarking!

[+] hdhdhsjsbdh|2 years ago|reply

In the past year or so, arxiv has become more of an advertising platform than a scientific resource. The title of this “paper” makes that quite clear: their company URL is right there!

[+] GaggiX|2 years ago|reply

Yi-34B is the LLM used by LLaVA-1.6 (also known as LLaVA-NeXT) and it's by far the best open source large multimodal models, demo: https://llava.hliu.cc/

[+] bearjaws|2 years ago|reply

Seeing models like this work so well gives me hope that mobile first LLMs for things like better voice to text and typing prediction will not just 'work' in 2-3 years but actually not kill your battery too.

[+] m3kw9|2 years ago|reply

It it kills battery it won’t be part of OS, and if it is on iOS it won’t be allowed to be in the AppStore, or it will be gated by API/hardware gating.

For Andoird, they’ll just allow it and your batter will last 30min after a few questions

[+] jes5199|2 years ago|reply

01, like from the Animatrix ?

[+] d-z-m|2 years ago|reply

there's also a new Yi model, Yi-9B[0].

[0]: https://huggingface.co/01-ai/Yi-9B

[+] gpjanik|2 years ago|reply

I understand that all these new models are an attempt to catch up with GPT-4, but frankly speaking, in the current shape and form, they're almost entirely useless.

I frantically tried anything available on Groq to improve performance of my GPT-4 based chatbot - it's incomparably bad - and the more of them I see, the more I believe OpenAI has fundamentally no competition at all at the moment.

No exception with the above, also pretty bad (IMHO worse than GPT-3.5).

[+] unknown|2 years ago|reply

[deleted]

[+] gyre|2 years ago|reply

Potentially interesting on the alignment front: In my experience the yi-6b model running on ollama is more likely to refuse politically sensitive queries (relating to Tiananmen Square, Peng Shuai’s disappearance, etc) when asked in Chinese, and more likely to provide information when asked in English. I wonder if this difference falls out naturally from available training data, is a deliberate internationalization choice, or is just noise from the queries I happened to run.

[+] mattstir|2 years ago|reply

I noticed similar behaviour in an older model (Skywork 13B) a few months back. When asked in Chinese, it would politely say that nothing of note occurred when responding to queries about Tiananmen Square, etc. In English, it would usually respond truthfully. It was deliberate in the case of Skywork, based on their model card (https://huggingface.co/Skywork/Skywork-13B-base):

> We have developed a data cleaning pipeline with great care to effectively clean and filter low-quality data and eliminate harmful information from text data.

I'd imagine it's likely similar for Yi.

[+] advael|2 years ago|reply

This may be a useful workaround, but it also forms the strongset argument I've yet seen so far against claims that LLMs do something like "understanding" or "an underlying world model". Maybe models knowing the same facts in different languages, especially across political controversy, might form a good benchmark to evaluate

[+] arijun|2 years ago|reply

I wonder if you could use the multilingual capabilities to workaround it's own censorship? I.e. what would happen if you asked it to translate the query to English, asked it in English, and then asked it to translate back to Chinese.

[+] Havoc|2 years ago|reply

Could also be both. Training data organically creating the difference but with an additional layer of specific alignment on top too

[+] unknown|2 years ago|reply

[deleted]

[+] FinNerd|2 years ago|reply

[deleted]

[+] rafram|2 years ago|reply

Yi models are available for download [1], so no, this isn’t a wrapper around GPT-4.

[1]: https://huggingface.co/01-ai/Yi-34B

[+] kleton|2 years ago|reply

https://huggingface.co/01-ai/Yi-34B

[+] Takennickname|2 years ago|reply

Did you read anything other than the title? (Even though the title itself gives it away)

[+] yumraj|2 years ago|reply

Given that this is a Chinese model, I’m genuinely curious if researchers have been evaluating risk that these models could be used for soft propaganda or similar purpose?

As others have reported, English and Chinese queries return different replies on topics that are not kosher in China.

What’s the risk that such models could be used for nefarious purposes by providing propaganda/biased/incorrect/… responses that on a cursory glance seem factual.

[+] TaylorAlexander|2 years ago|reply

It’s a fair question, but one we should be asking about all models, perhaps especially our own. It’s of course easier to see the propaganda of foreign cultures, and this should be investigated, but let’s not let ourselves believe that a model is more likely to contain propaganda because it is Chinese. It will just contain propaganda that is easier for us to see as propaganda.

Noam Chomsky and Edward Herman wrote extensively about propaganda in democratic societies in their 1988 book Manufacturing Consent. A nice introductory excerpt is here, and the first two or three paragraphs are enough to begin to see the argument:

https://chomsky.info/consent01/

Put as briefly as possible: propaganda in totalitarian societies is simpler. They just use force to remove people who say the wrong things, and state media to broadcast the “right things”. In democratic societies, institutional power still wants to protect itself, and this is achieved through more complex means, but it is nonetheless still rather effective.

[+] ithkuil|2 years ago|reply

At the very least models will exhibit the bias present in the underlying training text and on top of that there will be a bias imposed by those wanting to correct the unwanted bias present in the underlying training text, possibly swinging the pendulum too far in the other side.

I have the feeling you're asking something more specific, something more of a direct interference coming from politics and not just the natural "point of view" about various topics that are present in the chinese training corpora that is understandably different from western corpora.

Do you have anything specific in mind about something that you expect the Chinese government to feed as propaganda that is not already widely being sculpted into the chinese text corpora available on the internet?

81 comments