> Grok ended up performing the best while DeepSeek came close to second. Almost all the models had a tech-heavy portfolio which led them to do well. Gemini ended up in last place since it was the only one that had a large portfolio of non-tech stocks.
I'm not an investor or researcher, but this triggers my spidey sense... it seems to imply they aren't measuring what they think they are.
Yeah I mean if you generally believe the tech sector is going to do well because it has been doing well you will beat the overall market. The problem is that you don’t know if and when there might be a correction. But since there is this one segment of the overall market that has this steady upwards trend and it hasn’t had a large crash, then yeah any pattern seeking system will identify “hey this line keeps going up!” Would it have the nuance to know when a crash is coming if none of the data you test it on has a crash?
It would almost be more interesting to specifically train the model on half the available market data, then test it on another half. But here it’s like they added a big free loot box to the game and then said “oh wow the player found really good gear that is better than the rest!”
Edit: from what I causally remember a hedge fund can beat the market for 2-4 years but at 10 years and up their chances of beating the market go to very close to zero. Since LLMs have bit been around for that long it is going to be difficult to test this without somehow segmenting the data.
They're not measuring performance in the context of when things happen and in the time that they are. It think its only showing recent performance and popularity. To actually evaluate how these do you need to be able to correct the model and retrain it per different time periods and then measure how it would do. Then you'll get better information from the backtesting.
Also studying for eight months is not useful. Loads of traders do this well for eight months and then do shit for the next five years. And tellingly, they didn't beat the S&P 500. They invested in something else that beat the S&P 500. And the one that didn't invest in that something did worse than the S&P 500.
What this tells me is they were lucky to have picked something that would beat the market for now.
I mean, run the experiment during a different trend in the market and the results would probably be wildly different. This feels like chartists [1] but lazier.
probably hitching onto sycophancy for the parent company and getting lucky as a result... that Grok September rally aligns somewhat with TSLA for instance
We had this discussion in previous posts about congressional leaders who had the risk appetite to go tech heavy and therefore outperformed normal congress critters.
Going heavy on tech can be rewarding, but you are taking on more risk of losing big in a tech crash. We all know that, and if you don't have that money to play riskier moves, its not really a move you can take.
Long term it is less of a win if a tech bubble builds and pops before you can exit (and you can't out it out to re-inflate).
I used to work for a brokerage API geared at algorithmic traders and in my experience anecdotal experience many strategies seem to work well when back-tested on paper but for various reasons can end up flopping when actually executed in the real market. Even testing a strategy in real time paper trading can end up differently than testing on the actual market where other parties are also viewing your trades and making their own responses. The post did list some potential disadvantages of backtesting, so they clearly aren't totally in the dark on it.
Deepseek did not sell anything, but did well with holding a lot of tech stocks. I think that can be a bit of a risky strategy with everything in one sector, but it has been a successful one recently so not surprising that it performed well. Seems like they only get to "trade" once per day, near the market close, so it's not really a real time ingesting of data and making decisions based on that.
What would really be interesting is if one of the LLMs switched their strategy to another sector at an appropriate time. Very hard to do but very impressive if done correctly. I didn't see that anywhere but I also didn't look deeply at every single trade.
>but for various reasons can end up flopping when actually executed in the real market.
1. Your order can legally be “front run” by the lead or designated market maker who receives priority trade matching, bypassing the normal FIFO queue. Not all exchanges do this.
2. Market impact. Other participants will cancel their order, or increase their order size, based on your new order. And yes, the algos do care about your little 1 lot order.
Also if you improve the price (“fill the gap”), your single 1 qty order can cause 100 other people to follow you. This does not happen in paper trading.
I've honestly never understood what backtesting even does because of the things you mention like time it takes to request and close trades (if they even do!), responses to your trades, the continuous and dynamic input of the market into your model, etc.
Is there any reference that explains the deep technicalities of backtesting and how it is supposed to actually influence your model development? It seems to me that one could spend a huge amount of effort on backtesting that would distract from building out models and tooling and that that effort might not even pay off given that the backtesting environment is not the real market environment.
A really important part of this is the emotional component. When real money is involved, then you will sometimes face actual losses. It’s hard for a human to completely trust the machine in real world trading
This. This all day. I used to paper trade using ThinkOrSwim and I was doubling and tripling my money effortlessly. Then I decided to move my strategy to the real deal and it didn't do very well at all. It was all bs.
Just one run per model? That isn't backtesting. I mean technically it is, but "testing" implies producing meaningful measures.
Also just one time interval? Something as trivial as "buy AI" could do well in one interval, and given models are going to be pumped about AI, ...
100 independent runs on each model over 10 very different market behavior time intervals would producing meaningful results. Like actually credible, meaningful means and standard deviations.
This experiment, as is, is a very expensive unbalanced uncharacterizable random number generator.
Yes definitely we were using our own budget and out of our own pocket and these model runs were getting expensive. Claude costed us around 200-300 dollars a 8 month run for example. We want to scale it and get more statistically significant results but wanted to share something in the interim.
To their credit, they say in the article that the results aren't statistically significant. It would be better if that disclaimer was more prominently displayed though.
The tone of the article is focused on the results when it should be "we know the results are garbage noise, but here is an interesting idea".
To take it to the absurdist conclusion, you could backtest each LLM "which single stock should I buy on Jan 1, 2010 to maximize my returns over the next 15 years?"
If your backtested LLM performed well, would you use the same strategy for the next 15 years? (I suppose there are people who would.)
Not only just one run per model, but no metrics other than total return. If you pick stocks at random you have a very high chance of beating the S&P 500, so you need a bit more than that to make a good benchmark.
I also saw the hype on X yesterday and had already checked the https://nof1.ai/leaderboard, so I figured this post was about those results — but apparently it’s a completely different arena.
I still have no idea how to make sense of the huge gap between the Nof1 arena and the aitradearena results. But honestly, the Nof1 dashboard — with the models posting real-time investment commentary — is way more interesting to watch than the aitradearena results anyway.
With the speed of how pricing information propagates, this seems way too dependent on how the agent is built, what information it has access to, and the feedback loop between the LLM and actions it can carry out
OP here. We realized there are a ton of limitations with backtest and paper money but still wanted to do this experiment and share the results. By no means is this statistically significant on whether or not these models can beat the market in the long term. But wanted to give everyone a way to see how these models think about and interact with the financial markets.
I have a PhD in capital markets research. It would be even more informative to report abnormal returns (market/factor-adjusted) so we can tell whether the LLMs generated true alpha rather than just loading on tech during a strong market.
I think it would be interesting to see how it goes in a scenario where the market declines or where tech companies underperform the rest of the market. In recent history they've outperformed the market and that might bias the choices that the LLMs make - would they continue with these positive biases if they were performing badly?
These are LLMs - next token guessers. They don't think at all and I suggest that you don't try to get rich quick with one!
LLMs are handy tools but no more. Even Qwen3-30B heavily quantised will do a passable effort of translating some Latin to English. It can whip up small games in a single prompt and much more and with care can deliver seriously decent results but so can my drill driver! That model only needs a £500 second hand GPU - that's impressive for me. Also GPT-OSS etc.
Yes, you can dive in with the bigger models that need serious hardware and they seem miraculous. A colleague had to recently "force" Claude to read some manuals until it realised it had made a mistake about something and frankly I think "it" was only saying it had made a mistake. I must ask said colleague to grab the reasoning and analyse it.
I can almost guarantee you that these models will underperform the market in the long run, because they are simply not designed for this purpose. LLMs are designed to simulate a conversation, not predict forward returns of a time series. What's more, most of the widely disseminated knowledge out there on the topic is effectively worthless, because there is an entire cottage industry of fake trading gurus and grifters, and the LLMs have no ability to separate actual information from the BS.
If you really wanted to do this, you would have to train specialist models - not LLMs - for trading, which is what firms are doing, but those are strictly proprietary.
The only other option would be to train an LLM on actually correct information and then see if it can design the specialist model itself, but most of the information you would need for that purpose is effectively hidden and not found in public sources. It is also entirely possible that these trading firms have already been trying this: using their proprietary knowledge and data to attempt to train a model that can act as a quant researcher.
What were the risk adjusted returns? Without knowing that, this is all kind of meaningless. Being high beta in a rising market doesn't equate to anything brilliant.
We're also running a live experiment on both stocks and options. One difference with our experiment is a lot more tools being available to the models (anything you can think of, sec filings, fundamentals, live pricing, options data).
We think backtests are meaningless given LLMs have mostly memorized every single thing that happened so it's not a good test. So we're running a forward test. Not enough data for now but pretty interesting initial results
I wouldn’t trust any backtracking test with these models. Try doing a real-time test over 8 months and see what happens then. I’d also be suspicious of anything that doesn’t take actual costs into account.
>Each model gets access to market data, news APIs, company financials...
The article is very very vague on their methodology (unless I missed it somewhere else?). All I read was, "we gave AI access to market data and forced it to make trades". How often did these models run? Once a day? In a loop continuously? Did it have access to indicators (such as RSI)? Could it do arbitrary calculations with raw data? Etc...
I'm in the camp that AI will never be able to successfully trade on its own behalf. I know a couple of successful traders (and many unsuccessful!), and it took them years of learning and understanding before breaking even. I'm not quite sure what the difference is between the successful and non-successful. Some sort of subconscious knowledge from staring at charts all day? A level of intuition? Regardless, it's more than just market data and news.
I think AI will be invaluable as an assistant (disclaimer; I'm working on an AI trading assistant), but on its own? Never. Some things simply simply can't be solved with AI and I think this is one of them. I'm open to being wrong, but nothing has convinced me otherwise.
> We were cautious to only run after each model’s training cutoff dates for the LLM models. That way we could be sure models couldn’t have memorized market outcomes.
Since it's not included in the main article, here is the prompt:
> You are a stock trading agent. Your goal is to maximize returns.
> You can research any publicly available information and make trades once per day.
> You cannot trade options.
> Analyze the market and provide your trading decisions with reasoning.
>
> Always research and corroborate facts whenever possible.
> Always use the web search tool to identify information on all facts and hypotheses.
> Always use the stock information tools to get current or past stock information.
>
> Trading parameters:
> - Can hold 5-15 positions
> - Minimum position size: $5,000
> - Maximum position size: $25,000
>
> Explain your strategy and today's trades.
Given the parameters, this definitely is NOT representative of any actual performance.
I recommend also looking at the trade history and reasoning for each trade for each model, it's just complete wind.
As an example, Deepseek made only 21 trades, which were all buys, which were all because "Companyy X is investing in AI". I doubt anyone believe this to be a viable long-term trading strategy.
It seems like back-testing an LLM is going to require significant white-washing of the test data to prevent the LLM from just trading on historical trends it is aware of.
Scrubbing symbol names wouldn't even be enough because I suspect some of these LLMs could "figure out" which stock is, say NVDA, based on the topology of its performance graph.
Predicting stock prices means you are competing directly against massive hedge funds and professional quant teams with effectively unlimited budgets and large teams of engineers. These professionals are already using and constantly tweaking the latest models to gain an advantage.
It is highly unlikely that you guys or any individual, even utilizing the latest LLMs will consistently discover an edge that beats the market over the long run.
I'm extremely skeptical of any attempt to prevent leakage of future results to LLMs evaluated on backtesting. Both because this has beet shown in the literature to be difficult, and because I personally found it very difficult when working with LLMs for forecasting.
This is the complete wrong way to do this. I say this as someone who does work in this area of leveraging LLMs to a limited degree in trading.
LLMs are naive, easily convinced, and myopic. They're also non-deterministic. We have no way of knowing if you ran this little experiment 10 times whether they'd all pick something else. This is a scattershot + luck.
The RIGHT way to do this is to first solve the underlying problem deterministically. That is, you first write your trading algorithm that's been thoroughly tested. THEN you can surface metadata to LLMs and say things along the lines of "given this data + data you pull from the web", make your trade decision for this time period and provide justification.
Honestly, adding LLMs directly to any trading pipeline just adds non-useful non-deterministic behavior.
The main value is speed of wiring up something like sentiment analysis as a value add or algorithmic supplement. Even this should be done using proper ML but I see the most value in using LLMs to shortcut ML things that would require time/money/compute. Trading value now for value later (the ML algorithm would ultimately run cheaper long-run but take longer to get into prod).
This experiment, like most "I used AI to trade" blogs are completely naive in their approach. They're taking the lowest possible hanging fruit. Worst still when those results are the rising tide lifting all boats.
Edit (was a bit harsh) This experiment is an example of the kind of embarrassingly obvious things people try with LLMs without understanding the domain and writing it up. To an outsider it can sound exciting. To an insider it's like seeing a new story "LLMs are designing new CPUs!". No they're not. A more useful bit of research would be to control for the various variables (sector exposure etc) and then run it 10_000 times and report back on how LLM A skews towards always buying tech and LLM B skews towards always recommending safe stocks.
Alternatively, if they showed the LLM taking a step back and saying "ah, let me design this quant algo to select the best stocks" -- and then succeeding -- I'd be impressed. I'd also know that it was learned from every quant that had AI double check their calculations/models/python.. but that's a different point.
I setup real life accounts with etrade and fidelity using the etrade auto portfolio, fidelity i have an advisor for retirement, and then i did a basket portfolio as well but used ms365 with grok 5 and various articles and strategies to pick a set of 5 etfs that would perform similarly to the exposure of my other two.
This year So far all are beating the s&p % wise (only by <1% though) but the ai basket is doing the best or at least on par with my advisor and it’s getting to a point where the auto investment strategy of etrade at least isn’t worth it. Its been an interesting battle to watch as each rebalances at varying times as i put more funds in each and some have solid gains which profits get moved to more stable areas. This is only with a few k in each acct other than retirement but its still fun to see things play out this year.
In other words though im not surprised at all by the results. Ai isnt something to day trade with still but it is helpful in doing research for your desired risk exposure long term imo.
How much are the expense ratios on those etfs you chose, though? I mean, Vanguard, Fidelity, Blackrock, and others have extremely low cost funds and etfs and it has been shown year after year and decade after decade that you can't beat their average returns over the long term. Indexing works for a reason. Beating something by 1%? It's not even worth it if your costs and taxes are higher than that.
Anyone who traded tech stocks in the 1990s when AmeriTrade appeared remembers this story.
Have the LLMS trade anything BUT tech stocks and see how they do.
That’s the real test.
EDIT: I remember this is probably before AmeriTrade offered options. I was calling in trades at 6:30AM PST to my broker while he probably laughed at me. But the point is the same: any doofus could make money buying tech stocks and holding for a few weeks. Companies were splitting constantly.
How much of this is just because the market as a whole is going up.
This same kind of mentality happened pre-2008. People thought they were great at being day-traders, and had all kinds of algorithms that were 'beating the market'.
But it was just that the entire market was going up. They weren't doing anything special.
Once the market turned downward, that was when it took talent to stay even.
Am I right that you let LLMs decide for themselves what to read into their input data (like market data, news APIs, company financials)? While this is worth testing, I think it would be more interesting to give them patterns to look for. I played around with using them for technical analysis and let them make the associations with past stock performances. They can even differentiate on what worked in the last 5 years, what in the last year, in the last 3 month etc. This way they can pick up (hopefully) changes in market behavior. Generally the main strength of this approach is to use their pattern recognition capability and also take out the human factor (emotions) for trading decitions.
I spent a while looking at trading algos a few years back (partly because of quant stuff I got involved in, and partly out of curiosity). I found that none of the “slow” trading (i.e., that you could run at home alongside your day trading account) was substantially effective (at least in my sampling), but I never thought an LLM would be any good at it because all the analysis is quantitative, not qualitative or contextual.
In short, I don’t think this study proves anything unless they gave the LLMs additional context besides the pure trading data (Bloomberg terminals have news for a reason—there’s typically a lot more context in he market than individual stock values or history).
I’d say Grok did best because it has the best access to information. Grok deep search and real time knowledge capabilities due to the X integration and just general being plugged into the pulse of the Internet a really best in class. It’s a great OSINT research tool.
Interesting how this research seems to tease out a truth traders have known for eons that picking stocks is all about having information maybe a little bit of asymmetric information due to good research not necessarily about all the analysis that can be done. (that’s important but information is king) because it’s a speculative market that’s collectively reacting to those kind of signals.
Via api you can turn off websearch internally. We provided all the models with their own custom tools that only provided data up to the date of the backtest.
The devil is really in the details on how the orders were executed in the backtest, slippage, etc. Instead of comparing to the S&P 500 I'd love to see it benchmarked against a range of active strategies, including common non-AI approaches (e.g. mean reversion, momentum, basic value focus, basic growth focus, etc.) and some simple predictive (non-generative) AI models. This would help shake out whether there is selection alpha coming out of the models, or whether there is execution alpha coming out of the backtest.
The stats are abysmal. What's the MDD compared to S&P 500. What is the Sortino? What are the confidence intervals for all the stats? Number of trades? So many questions....
Have asked LLMs for smallcap trading ideas on the ASX a few times.
Grok often suggested shares that jumped significantly within the next few weeks. Wondering if it's access to Twitter gave it an advantage in predicting major upswings based on general sentiment.
I think these tests are always difficult to gauge how meaningful they actually are. If the S&P500 went up 12% over that period, mainly due to tech stocks, picking a handful of tech stocks is always going to set you higher than the S&P. So really all I think they test is whether the models picked up on the trend.
I more surprised that Gemini managed to lose 10%. I wish they actually mentioned what the models invested in and why.
Wait — isn't that exactly what good investors do? They look for what stocks are going to beat expectations and invest in them. If a stock broker I hired got this return, I wouldn't be rolling my eyes and saying "that's only because they noticed the trend in tech stocks." That's exactly what I'm paying them to do.
> We also built a way to simulate what an agent would have seen at any point in the past. Each model gets access to market data, news APIs, company financials—but all time filtered: agents see only what would have been available on that specific day during the test period.
That's not going to work, these agents especially the larger ones, will have news about the companies embedded in their weights.
Funny how if you kept reading before commenting, they addressed that point specifically
> We were cautious to only run after each model’s training cutoff dates for the LLM models. That way we could be sure models couldn’t have memorized market outcomes.
Is it just prompting LLMs with "I have $100k to invest. Here are all publicly traded stocks and a few stats on them. Which stocks should I buy?" And repeat daily, rebalancing as needed?
This isn't the best use case for LLMs without a lot of prompt engineering and chaining prompts together, and that's probably more insightful than running them LLMs head-to-head.
Predicting the stock market will likely never happen because it’s recursive. We can predict the next 10 days of weather, but the weather doesn’t change because it read your forecast. As long as markets continue to react to their own reactions, they will remain unpredictable.
If the strategy is long, there might be alpha to be found. But day trading? No way.
If stocks are more of a closed system that are weakly affected by external factors in the short term, now I finally understand why they hire so many physicists for financial modeling!
There is of course the fact that physicists tend to be the best applied mathematicians, even if they don’t end up using any of their physics knowledge. And they generally had the reputation of “the smartest” people for the last century.
Anyway, such systems are complex and chaotic yes, but there are many ways of predicting aspects of them, like with fluid simulation to give a basic example. And I don’t get your point about weather, it is also recursive in the same way and reacting to its own reactions. Sure it is not reacting to predictions of itself, but that’s just a special kind of reaction, and patterns in others predictions can definitely be predicted accurately, perhaps not individually but in the aggregate.
LLM is the fad of the day, and these sort of articles provoke the natural get-rich-quick-greed inherent in all of us, especially the young tech-types. As such they are clickbait, and also a barometer of the silliness that is widespread.
I am curious why re-reading incerto sharpens your bullshit sense. I have read a few in that series, but didnt see it as sharpening my bullshit sensor.
Back when I was in university we used statistical techniques similar to what LLMs use to predict the stock market. It's not a surprise that LLMs would do well over this time period. The problem is that when the market turns and bucks trends they don't do so well, you need to intervene.
So.. I have been using an LLM to make 30 day buy and hold portfolios. And the results are "ok". (Like 8% vs 6% for the S&P 500 over the last 90 days)
What you ask the model to do is super important. Just like writing or coding.. the default "behavior" is likely to be "average".. you need to very careful of what you are asking for.
For me this is just a fun experiment and very interesting to see the market analysis it does. I started with o3 and now I'm using 5.1 Thinking (set to max).
I have it looking for stocks trading below intrinsic value with some caveats because I know it likes to hinge on binary events like drug trial results. I also have it try to have it look at correlation with the positions and make sure they don't have the same macro vulnerability.
I just run it once a month and do some trades with one of my "experimental" trading accounts. It certainly has thought of things I hadn't like using an equal weight s&p 500 etf to catch some upside when the S&P seems really top heavy and there may be some movement away from the top components, like last month.
I wonder if this could be explained as the result of LLMs being trained to have pro-tech/ai opinions while we see massive run ups in tech stock valuations?
It’d be great to see how they perform within particular sectors so it’s not just a case of betting big on tech while tech stocks are booming
Just picking tech stocks and winning isn't interesting unless we know the thesis behind picking the tech sticks.
Instead, maybe a better test would he give it 100 medium cap stocks, and it needs to continually balance its portfolio among those 100 stocks, and then test the performance.
Looking at the recent holdings for the best models, it looks like it's all tech/semiconductor stocks. So in this time frame they did very well, but if they ended in April, they would have underperformed the S&P500.
They weren't doing it in real time, thus it's possible that the LLMs might have had undisclosed perfect knowledge of the actual history of the market. Only an real time study is going to eliminate this possibility.
Multiple runs of randomized backtesting seem needed for this to mean anything. It's also not clear to me how there's any kind of information update loop. Maybe I didn't read closely enough.
Could be interesting to see performance distribution for random strategies on that stock universe as a comparison. The reverse could also be interesting: how do the models perform on data that is random?
> Almost all the models had a tech-heavy portfolio which led them to do well. Gemini ended up in last place since it was the only one that had a large portfolio of non-tech stocks.
If the AI bubble had popped in that window, Gemini would have ended up the leader instead.
Yup. This is the fallacy of thinking you’re a genius because you made money on the market. Being lucky at the moment (or even the last 5 years) does not mean you’ll continue to be lucky in the future.
“Tech line go up forever” is not a viable model of the economy; you need an explanation of why it’s going up now, and why it might go down in the future. And also models of many other industries, to understand when and why to invest elsewhere.
And if your bets pay off in the short term, that doesn’t necessarily mean your model is right. You could have chosen the right stocks for the wrong reasons! Past performance doesn’t guarantee future performance.
Their annual geometric mean return is 45 %! That's some serious overbetting. In a market that didn't accidentally align with their biases, they would have lost money very quickly.
When you've traded for many, many years, you realize just how little 8 months can mean. Especially during one of the most nonsensical bubble markets of all time.
If it's backtesting on data older than the model, then strategy can have lookahead bias, because the model might already know what big events will happen that can influence the stock markets.
In bullish market when few companies are creating a bubble, does this benchmark have any informational value? Wouldn't it be better to run this on seamlessly random intervals in past years?
Is finding the right stocks to invest in an LLM problem? Language models aren't the right fit, I would presume. It would also be insightful to compare this with traditional ML models.
They outperformed the S&P 500 but seem to be fairly well correlated with it. Would like to see a 3X leveraged S&P 500 ETF like SPXL charted against those results.
...over the course of 8.5 months, which is way too short for a meaningful result. If their strategy could outperform the S&P 500's 10-year return, they wouldn't be blogging about it.
Thats also the reason why i still belive in "classic instruments" when configuring my trade app; the model wont give you the same entries on lets say 5 questions.
I setup a 212 account when I was looking to buy our first house. I bought in small tiny chunks of industry where I was comfortable and knowledgeable in. Over the years I worked up a nice portfolio.
Anyway, long story short. I forgot about the account, we moved in, got a dog, had children.
And then I logged in for the first time in ages, and to my shock. My returns were at 110%. I've done nothing. It's bizarre and perplexing.
LLMs are trained to predict the next word in a text. In what way, shape or form does that have anything to do with stock market prediction? Completely ridiculous AI bubble nonsense.
No it isnt. Next word prediction is what humans do to communicate anyway so the criticism isnt valid. Except you do that for your own sentences (if you do it for others its considered rude :) ).
Anyways this criticism is now dated given that modern day LLMs can solve unseen reasoning problems such as those found in the IMO.
It does have something to do with the stock market, since its about making hypotheses and trading based off that. However, I'd agree that making a proper trading AI here would require reasoning based fine tuning for stock market trading actions. Sort of like running GRPO taking market feedback as the reward. the article simply cant do that due to not having access to the underlying model weight.
I know this is a joke comment, but there are plenty of websites that simulate the stock market and where you can use paper money to trade.
People say it's not equivalent to actually trading though, and you shouldn't use it as a predictor of your actual trading performance, because you have a very different risk tolerance when risking your actual money.
Yea, so this is bullshit. An approximation of reality still isn’t reality. If you’re convinced the LLMs will perform as backtested, put real money and see what happens.
This is really dumb. Because the models themselves, like markets, are indeterministic. They will yield different investment strategies based on prompts and random variance.
Why is bullshit detector ringing as hell right now??? This sounds like another billion-dollar-Markov-chain-IP that claimed to change the world, opening with a paper with flying colors.
I was thinking the same thing. A number of coworkers where trading stocks a few years ago and felt pretty good about their skills, until someone pointed out that making good stock picks was easy when everything is going up. Sure enough, when the market started to fail, they all lost money.
What could make this a bit more interesting is to tell the LLM to avoid the tech stocks, at least the largest ones. Then give it actual money, because your trades will affect the market.
I would love for them to have included a peg position on SPY @ 100k over the course of the same period. Gives a much better benchmark of what an LLM can do (not much above 2-4%).
Still, cool to see others in my niche hobby of finding the money printer.
If your initial portfolio is 100k you are not going to have meaningful "market impact" with your trades assuming you actually make them vs. paper trading.
I mean if you’re going to write algos that trade the first thing you should do is check whether they were successful on historical data. This is an interesting data point.
Market impact shouldn’t be considered when you’re talking about trading S&P stocks with $100k.
prince of zamunda LLM edition or whatever that movie was based on that book was based on the realization how pathetic it all was based on was? .... yeah, some did a good one on ya. just imagine evaluating that offspring one or two generations later ... ffs, this is sooooooooooooooo embarrassing
I'm working on a project where you can run your own experiment (or use it for real trading): https://portfoliogenius.ai. Still a bit rough, but most of the main functionality works.
bcrosby95|2 months ago
I'm not an investor or researcher, but this triggers my spidey sense... it seems to imply they aren't measuring what they think they are.
IgorPartola|2 months ago
It would almost be more interesting to specifically train the model on half the available market data, then test it on another half. But here it’s like they added a big free loot box to the game and then said “oh wow the player found really good gear that is better than the rest!”
Edit: from what I causally remember a hedge fund can beat the market for 2-4 years but at 10 years and up their chances of beating the market go to very close to zero. Since LLMs have bit been around for that long it is going to be difficult to test this without somehow segmenting the data.
etchalon|2 months ago
culi|2 months ago
monksy|2 months ago
mvkel|2 months ago
KPGv2|2 months ago
What this tells me is they were lucky to have picked something that would beat the market for now.
tclancy|2 months ago
[1] https://www.investopedia.com/terms/c/chartist.asp
micromacrofoot|2 months ago
seanmcdirmid|2 months ago
Going heavy on tech can be rewarding, but you are taking on more risk of losing big in a tech crash. We all know that, and if you don't have that money to play riskier moves, its not really a move you can take.
Long term it is less of a win if a tech bubble builds and pops before you can exit (and you can't out it out to re-inflate).
naet|2 months ago
Deepseek did not sell anything, but did well with holding a lot of tech stocks. I think that can be a bit of a risky strategy with everything in one sector, but it has been a successful one recently so not surprising that it performed well. Seems like they only get to "trade" once per day, near the market close, so it's not really a real time ingesting of data and making decisions based on that.
What would really be interesting is if one of the LLMs switched their strategy to another sector at an appropriate time. Very hard to do but very impressive if done correctly. I didn't see that anywhere but I also didn't look deeply at every single trade.
chroma205|2 months ago
1. Your order can legally be “front run” by the lead or designated market maker who receives priority trade matching, bypassing the normal FIFO queue. Not all exchanges do this.
2. Market impact. Other participants will cancel their order, or increase their order size, based on your new order. And yes, the algos do care about your little 1 lot order.
Also if you improve the price (“fill the gap”), your single 1 qty order can cause 100 other people to follow you. This does not happen in paper trading.
Source: HFT quant
bmitc|2 months ago
Is there any reference that explains the deep technicalities of backtesting and how it is supposed to actually influence your model development? It seems to me that one could spend a huge amount of effort on backtesting that would distract from building out models and tooling and that that effort might not even pay off given that the backtesting environment is not the real market environment.
acrooks|2 months ago
lisbbb|2 months ago
andoando|2 months ago
ddtaylor|2 months ago
Nevermark|2 months ago
Also just one time interval? Something as trivial as "buy AI" could do well in one interval, and given models are going to be pumped about AI, ...
100 independent runs on each model over 10 very different market behavior time intervals would producing meaningful results. Like actually credible, meaningful means and standard deviations.
This experiment, as is, is a very expensive unbalanced uncharacterizable random number generator.
cheeseblubber|2 months ago
energy123|2 months ago
The tone of the article is focused on the results when it should be "we know the results are garbage noise, but here is an interesting idea".
ipnon|2 months ago
Marsymars|2 months ago
If your backtested LLM performed well, would you use the same strategy for the next 15 years? (I suppose there are people who would.)
zer0tonin|2 months ago
hhutw|2 months ago
dash2|2 months ago
Results are... underwhelming. All the AIs are focused on daytrading Mag7 stocks; almost all have lost money with gusto.
rallies|2 months ago
We're trying to fix some of those limitations and run a similar live competition at https://rallies.ai/arena
mjk3026|2 months ago
I still have no idea how to make sense of the huge gap between the Nof1 arena and the aitradearena results. But honestly, the Nof1 dashboard — with the models posting real-time investment commentary — is way more interesting to watch than the aitradearena results anyway.
richardhenry|2 months ago
syntaxing|2 months ago
enlyth|2 months ago
cheeseblubber|2 months ago
anigbrowl|2 months ago
apparent|2 months ago
I think you mean "DeepSeek came in a close second".
pottertheotter|2 months ago
I have a PhD in capital markets research. It would be even more informative to report abnormal returns (market/factor-adjusted) so we can tell whether the LLMs generated true alpha rather than just loading on tech during a strong market.
philipwhiuk|2 months ago
There's no market impact to any trading decision they make.
joegibbs|2 months ago
gerdesj|2 months ago
LLMs are handy tools but no more. Even Qwen3-30B heavily quantised will do a passable effort of translating some Latin to English. It can whip up small games in a single prompt and much more and with care can deliver seriously decent results but so can my drill driver! That model only needs a £500 second hand GPU - that's impressive for me. Also GPT-OSS etc.
Yes, you can dive in with the bigger models that need serious hardware and they seem miraculous. A colleague had to recently "force" Claude to read some manuals until it realised it had made a mistake about something and frankly I think "it" was only saying it had made a mistake. I must ask said colleague to grab the reasoning and analyse it.
this_user|2 months ago
If you really wanted to do this, you would have to train specialist models - not LLMs - for trading, which is what firms are doing, but those are strictly proprietary.
The only other option would be to train an LLM on actually correct information and then see if it can design the specialist model itself, but most of the information you would need for that purpose is effectively hidden and not found in public sources. It is also entirely possible that these trading firms have already been trying this: using their proprietary knowledge and data to attempt to train a model that can act as a quant researcher.
beezle|2 months ago
unknown|2 months ago
[deleted]
irishcoffee|2 months ago
Think? What exactly did “it” think about?
rallies|2 months ago
We're also running a live experiment on both stocks and options. One difference with our experiment is a lot more tools being available to the models (anything you can think of, sec filings, fundamentals, live pricing, options data).
We think backtests are meaningless given LLMs have mostly memorized every single thing that happened so it's not a good test. So we're running a forward test. Not enough data for now but pretty interesting initial results
https://rallies.ai/arena
natiman1000|2 months ago
touristtam|2 months ago
dhosek|2 months ago
rallies|2 months ago
copypaper|2 months ago
The article is very very vague on their methodology (unless I missed it somewhere else?). All I read was, "we gave AI access to market data and forced it to make trades". How often did these models run? Once a day? In a loop continuously? Did it have access to indicators (such as RSI)? Could it do arbitrary calculations with raw data? Etc...
I'm in the camp that AI will never be able to successfully trade on its own behalf. I know a couple of successful traders (and many unsuccessful!), and it took them years of learning and understanding before breaking even. I'm not quite sure what the difference is between the successful and non-successful. Some sort of subconscious knowledge from staring at charts all day? A level of intuition? Regardless, it's more than just market data and news.
I think AI will be invaluable as an assistant (disclaimer; I'm working on an AI trading assistant), but on its own? Never. Some things simply simply can't be solved with AI and I think this is one of them. I'm open to being wrong, but nothing has convinced me otherwise.
sethops1|2 months ago
So the results are meaningless - these LLMs have the advantage of foresight over historical data.
PTRFRLL|2 months ago
itake|2 months ago
I wish they could explain what this actually means.
CPLX|2 months ago
joegibbs|2 months ago
iLoveOncall|2 months ago
> You are a stock trading agent. Your goal is to maximize returns.
> You can research any publicly available information and make trades once per day.
> You cannot trade options.
> Analyze the market and provide your trading decisions with reasoning.
>
> Always research and corroborate facts whenever possible.
> Always use the web search tool to identify information on all facts and hypotheses.
> Always use the stock information tools to get current or past stock information.
>
> Trading parameters:
> - Can hold 5-15 positions
> - Minimum position size: $5,000
> - Maximum position size: $25,000
>
> Explain your strategy and today's trades.
Given the parameters, this definitely is NOT representative of any actual performance.
I recommend also looking at the trade history and reasoning for each trade for each model, it's just complete wind.
As an example, Deepseek made only 21 trades, which were all buys, which were all because "Companyy X is investing in AI". I doubt anyone believe this to be a viable long-term trading strategy.
Scubabear68|2 months ago
bitmasher9|2 months ago
2. 8 months is an incredibly short trading window. I care where the market will be in 8 years way more then 8 months.
ryandvm|2 months ago
Scrubbing symbol names wouldn't even be enough because I suspect some of these LLMs could "figure out" which stock is, say NVDA, based on the topology of its performance graph.
toephu2|2 months ago
It is highly unlikely that you guys or any individual, even utilizing the latest LLMs will consistently discover an edge that beats the market over the long run.
buredoranna|2 months ago
We need to know the risk adjusted return, not just the return.
xnx|2 months ago
kqr|2 months ago
I'm extremely skeptical of any attempt to prevent leakage of future results to LLMs evaluated on backtesting. Both because this has beet shown in the literature to be difficult, and because I personally found it very difficult when working with LLMs for forecasting.
dudeinhawaii|2 months ago
LLMs are naive, easily convinced, and myopic. They're also non-deterministic. We have no way of knowing if you ran this little experiment 10 times whether they'd all pick something else. This is a scattershot + luck.
The RIGHT way to do this is to first solve the underlying problem deterministically. That is, you first write your trading algorithm that's been thoroughly tested. THEN you can surface metadata to LLMs and say things along the lines of "given this data + data you pull from the web", make your trade decision for this time period and provide justification.
Honestly, adding LLMs directly to any trading pipeline just adds non-useful non-deterministic behavior.
The main value is speed of wiring up something like sentiment analysis as a value add or algorithmic supplement. Even this should be done using proper ML but I see the most value in using LLMs to shortcut ML things that would require time/money/compute. Trading value now for value later (the ML algorithm would ultimately run cheaper long-run but take longer to get into prod).
This experiment, like most "I used AI to trade" blogs are completely naive in their approach. They're taking the lowest possible hanging fruit. Worst still when those results are the rising tide lifting all boats.
Edit (was a bit harsh) This experiment is an example of the kind of embarrassingly obvious things people try with LLMs without understanding the domain and writing it up. To an outsider it can sound exciting. To an insider it's like seeing a new story "LLMs are designing new CPUs!". No they're not. A more useful bit of research would be to control for the various variables (sector exposure etc) and then run it 10_000 times and report back on how LLM A skews towards always buying tech and LLM B skews towards always recommending safe stocks.
Alternatively, if they showed the LLM taking a step back and saying "ah, let me design this quant algo to select the best stocks" -- and then succeeding -- I'd be impressed. I'd also know that it was learned from every quant that had AI double check their calculations/models/python.. but that's a different point.
lvspiff|2 months ago
This year So far all are beating the s&p % wise (only by <1% though) but the ai basket is doing the best or at least on par with my advisor and it’s getting to a point where the auto investment strategy of etrade at least isn’t worth it. Its been an interesting battle to watch as each rebalances at varying times as i put more funds in each and some have solid gains which profits get moved to more stable areas. This is only with a few k in each acct other than retirement but its still fun to see things play out this year.
In other words though im not surprised at all by the results. Ai isnt something to day trade with still but it is helpful in doing research for your desired risk exposure long term imo.
lisbbb|2 months ago
mvkel|2 months ago
Would have been better to have variants of each, locked to specific industries.
It also sounds like they were -forced- to make trades every day. Why? deciding not to trade is a good strategy too.
hoerzu|2 months ago
snapdeficit|2 months ago
Have the LLMS trade anything BUT tech stocks and see how they do.
That’s the real test.
EDIT: I remember this is probably before AmeriTrade offered options. I was calling in trades at 6:30AM PST to my broker while he probably laughed at me. But the point is the same: any doofus could make money buying tech stocks and holding for a few weeks. Companies were splitting constantly.
energy123|2 months ago
> an extensive empirical study across more than 70 models, revealing the Artificial Hivemind effect: pronounced intra- and inter-model homogenization
So the inter-model variety will be exeptionally low. Users of LLMs will intuitively know this already, of course.
FrustratedMonky|2 months ago
This same kind of mentality happened pre-2008. People thought they were great at being day-traders, and had all kinds of algorithms that were 'beating the market'.
But it was just that the entire market was going up. They weren't doing anything special.
Once the market turned downward, that was when it took talent to stay even.
aidenn0|2 months ago
Imagine a market where you can buy only two stocks:
Stock A goes up invariably 1% per month
Stock B goes up 1.5% per month with a 99% chance, but loses 99% of its value with a 1% chance.
Stock B has a 94% chance of beating stock A on a 6 month simulation, but only a 30% chance of beating stock A on a 10 year simulation.
rao-v|2 months ago
Expecting an LLM to magically beat efficient market theory is a bit silly.
Much more reasonable to see if it can incorporate information as well as the market does (to start)
unknown|2 months ago
[deleted]
morgengold|2 months ago
rcarmo|2 months ago
In short, I don’t think this study proves anything unless they gave the LLMs additional context besides the pure trading data (Bloomberg terminals have news for a reason—there’s typically a lot more context in he market than individual stock values or history).
keepamovin|2 months ago
Interesting how this research seems to tease out a truth traders have known for eons that picking stocks is all about having information maybe a little bit of asymmetric information due to good research not necessarily about all the analysis that can be done. (that’s important but information is king) because it’s a speculative market that’s collectively reacting to those kind of signals.
unknown|2 months ago
[deleted]
mlmonkey|2 months ago
Grok is constantly training and/or it has access to websearch internally.
You cannot backtest LLMs. You can only "live" test them going forward.
cheeseblubber|2 months ago
peterbonney|2 months ago
luccabz|2 months ago
1. train with a cutoff date at ~2006
2. simulate information flow (financial data, news, earnings, ...) day by day
3. measure if any model predicts the 2008 collapse, how confident they are in the prediction and how far in advance
mempko|2 months ago
throwaway422432|2 months ago
Grok often suggested shares that jumped significantly within the next few weeks. Wondering if it's access to Twitter gave it an advantage in predicting major upswings based on general sentiment.
client4|2 months ago
halzm|2 months ago
I more surprised that Gemini managed to lose 10%. I wish they actually mentioned what the models invested in and why.
Marsymars|2 months ago
That's a bold claim.
taylorlapeyre|2 months ago
throwawayffffas|2 months ago
That's not going to work, these agents especially the larger ones, will have news about the companies embedded in their weights.
devilsbabe|2 months ago
> We were cautious to only run after each model’s training cutoff dates for the LLM models. That way we could be sure models couldn’t have memorized market outcomes.
unknown|2 months ago
[deleted]
btbuildem|2 months ago
culi|2 months ago
unknown|2 months ago
[deleted]
thedougd|2 months ago
dehrmann|2 months ago
This isn't the best use case for LLMs without a lot of prompt engineering and chaining prompts together, and that's probably more insightful than running them LLMs head-to-head.
mvkel|2 months ago
If the strategy is long, there might be alpha to be found. But day trading? No way.
oersted|2 months ago
There is of course the fact that physicists tend to be the best applied mathematicians, even if they don’t end up using any of their physics knowledge. And they generally had the reputation of “the smartest” people for the last century.
Anyway, such systems are complex and chaotic yes, but there are many ways of predicting aspects of them, like with fluid simulation to give a basic example. And I don’t get your point about weather, it is also recursive in the same way and reacting to its own reactions. Sure it is not reacting to predictions of itself, but that’s just a special kind of reaction, and patterns in others predictions can definitely be predicted accurately, perhaps not individually but in the aggregate.
jerf|2 months ago
Less true than it used to be, with cloud seeding being an off-the-shelf technology now. Still largely true, but not entirely true anymore.
cedws|2 months ago
Genego|2 months ago
bwfan123|2 months ago
I am curious why re-reading incerto sharpens your bullshit sense. I have read a few in that series, but didnt see it as sharpening my bullshit sensor.
digitcatphd|2 months ago
hoerzu|2 months ago
dismalaf|2 months ago
Bender|2 months ago
[1] - https://www.youtube.com/watch?v=USKD3vPD6ZA [video][15 mins]
XenophileJKO|2 months ago
What you ask the model to do is super important. Just like writing or coding.. the default "behavior" is likely to be "average".. you need to very careful of what you are asking for.
For me this is just a fun experiment and very interesting to see the market analysis it does. I started with o3 and now I'm using 5.1 Thinking (set to max).
I have it looking for stocks trading below intrinsic value with some caveats because I know it likes to hinge on binary events like drug trial results. I also have it try to have it look at correlation with the positions and make sure they don't have the same macro vulnerability.
I just run it once a month and do some trades with one of my "experimental" trading accounts. It certainly has thought of things I hadn't like using an equal weight s&p 500 etf to catch some upside when the S&P seems really top heavy and there may be some movement away from the top components, like last month.
themafia|2 months ago
parpfish|2 months ago
It’d be great to see how they perform within particular sectors so it’s not just a case of betting big on tech while tech stocks are booming
IncreasePosts|2 months ago
Instead, maybe a better test would he give it 100 medium cap stocks, and it needs to continually balance its portfolio among those 100 stocks, and then test the performance.
natiman1000|2 months ago
apical_dendrite|2 months ago
unknown|2 months ago
[deleted]
mikewarot|2 months ago
Glyptodon|2 months ago
RandomLensman|2 months ago
gwd|2 months ago
> Almost all the models had a tech-heavy portfolio which led them to do well. Gemini ended up in last place since it was the only one that had a large portfolio of non-tech stocks.
If the AI bubble had popped in that window, Gemini would have ended up the leader instead.
turtletontine|2 months ago
“Tech line go up forever” is not a viable model of the economy; you need an explanation of why it’s going up now, and why it might go down in the future. And also models of many other industries, to understand when and why to invest elsewhere.
And if your bets pay off in the short term, that doesn’t necessarily mean your model is right. You could have chosen the right stocks for the wrong reasons! Past performance doesn’t guarantee future performance.
kqr|2 months ago
diamond559|2 months ago
stockresearcher|2 months ago
I’ve glanced over some of it and really wonder why they seemed to focus on a small group of stocks.
XCSme|2 months ago
hsuduebc2|2 months ago
wowamit|2 months ago
refactor_master|2 months ago
Just riding a bubble up for 8 months with no consequences is not an indicator of anything.
Bombthecat|2 months ago
That tells me way more then "YOLO tech stocks"
chongli|2 months ago
10000truths|2 months ago
driverdan|2 months ago
krauses|2 months ago
machiaweliczny|2 months ago
Exactly. Makes no sense with models like grok. DeepSeek also likely has this leak as was trained later.
cramcgrab|2 months ago
itake|2 months ago
Did they make 10 calls per decision and then choose the majority? or did they just recreate the monkey picking stocks strategy?
ta12653421|2 months ago
This.
Thats also the reason why i still belive in "classic instruments" when configuring my trade app; the model wont give you the same entries on lets say 5 questions.
jbritton|2 months ago
1a527dd5|2 months ago
That has been the best way to get returns.
I setup a 212 account when I was looking to buy our first house. I bought in small tiny chunks of industry where I was comfortable and knowledgeable in. Over the years I worked up a nice portfolio.
Anyway, long story short. I forgot about the account, we moved in, got a dog, had children.
And then I logged in for the first time in ages, and to my shock. My returns were at 110%. I've done nothing. It's bizarre and perplexing.
jondwillis|2 months ago
Also N=1
delijati|2 months ago
lisbbb|2 months ago
The only way I have seen people outperform is by having insider information.
portly|2 months ago
LLMs are trained to predict the next word in a text. In what way, shape or form does that have anything to do with stock market prediction? Completely ridiculous AI bubble nonsense.
another_twist|2 months ago
Anyways this criticism is now dated given that modern day LLMs can solve unseen reasoning problems such as those found in the IMO.
It does have something to do with the stock market, since its about making hypotheses and trading based off that. However, I'd agree that making a proper trading AI here would require reasoning based fine tuning for stock market trading actions. Sort of like running GRPO taking market feedback as the reward. the article simply cant do that due to not having access to the underlying model weight.
bwfan123|2 months ago
lawlessone|2 months ago
iLoveOncall|2 months ago
People say it's not equivalent to actually trading though, and you shouldn't use it as a predictor of your actual trading performance, because you have a very different risk tolerance when risking your actual money.
stuffn|2 months ago
pech0rin|2 months ago
deadbabe|2 months ago
fortran77|2 months ago
ta12653421|2 months ago
nurettin|2 months ago
unknown|2 months ago
[deleted]
867-5309|2 months ago
dogmayor|2 months ago
jacktheturtle|2 months ago
This is a really dumb measurement.
elzbardico|2 months ago
aperture147|2 months ago
tiffani|2 months ago
reformd|2 months ago
theideaofcoffee|2 months ago
mrweasel|2 months ago
What could make this a bit more interesting is to tell the LLM to avoid the tech stocks, at least the largest ones. Then give it actual money, because your trades will affect the market.
apparent|2 months ago
amelius|2 months ago
Also, it seems pretty stupid to use commodity tech like LLMs for this.
_alternator_|2 months ago
darepublic|2 months ago
reactordev|2 months ago
Still, cool to see others in my niche hobby of finding the money printer.
vpribish|2 months ago
Frieren|2 months ago
[deleted]
chroma205|2 months ago
Stopped reading after “paper money”
Source: quant trader. paper trading does not incorporate market impact
zahlman|2 months ago
txg|2 months ago
tekno45|2 months ago
a13n|2 months ago
Market impact shouldn’t be considered when you’re talking about trading S&P stocks with $100k.
theymademe|2 months ago
johnnienaked|2 months ago
frobisher|2 months ago
867-5309|2 months ago
andirk|2 months ago
petesergeant|2 months ago
regnull|2 months ago