top | item 46912562

(no title)

I gave a talk at PyData Berlin on how to build your own TikTok recommendation algorithm. The TikTok personalized recommendation engine is the world's most valuable AI. It's TikTok's differentiation. It updates recommendations within 1 second of you clicking - at human perceivable latency. If your AI recommender has poor feature freshness, it will be perceived as slow, not intelligent - no matter how good the recommendations are.

TikTok's recommender is partly built on European Technology (Apache Flink for real-time feature computation), along with Kafka, and distributed model training infrastructure. The Monolith paper is misleading that the 'online training' is key. It is not. It is that your clicks are made available as features for predicitons in less than 1 second. You need a per-event stream processing architecture for this (like Flink - Feldera would be my modern choice as an incremental streaming engine).

* https://www.youtube.com/watch?v=skZ1HcF7AsM

* Monolith paper - https://arxiv.org/pdf/2209.07663

discuss

eddd-ddde|23 days ago

I have to say, it is _extremely_ impressive when a tiktok I watched reminds me of some other tiktok, so I go and search for a very loose description of the tiktok, and the first result is 95% of the time what I wanted to find.

I don't think any single other platform has as good a search feature as TikTok does.

tridentboy|23 days ago

oh wow, you're really lucky. around my friend groups who use tiktok, the main complaint is how bad the search is. unfortunately for us, getting a specific video is almost impossible =(

nik_0_0|20 days ago

Thats super interesting (I deleted Tiktok because it was too addicting!), but this is a common complaint about Instagram is that it feels impossible to find a reel based on keywords.

dmix|24 days ago

I noticed Youtube shorts also seems to update the feed based on how long the last video you watched. If you're scrolling quickly then stop to watch a dog video long enough the next one is likely to be another dog video.

kgeist|23 days ago

It creates a weird feedback loop: after I watch video A, it recommends a similar video B, and if I make the mistake of watching that too, it then recommends video C on the same topic. Suddenly my feed is nothing but Stranger Things shorts for two whole days (literally not a single video about anything else). Skipping or disliking didn't help, then somehow it went back to normal after two days.

randysalami|24 days ago

I’ve noticed the same thing and this creates such a negative user experience. Every short is a reaction test and if I fail, I get slop. Makes the whole experience very jarring (for better or for worse).

pandemic_region|24 days ago

I've been insta-skipping tennis video's for months now. Still getting Federer on a daily basis.

beAbU|23 days ago

Facebook does the same. The longer I dwell on an image post, the more likely the next batch of posts would be similar

BoxOfRain|23 days ago

One of my gripes with youtube at the moment is that they break my adblock filters to remove shorts more often than they break the filters stopping the actual ads.

rjh29|23 days ago

youtube's algorithm seems to be "oh you watched this video? now here's every other video by this creator, pretty much without a break, until you downvote it"

It never reliably gives me videos similar but not exactly the same, i.e. things I might be interested in.

vjerancrnjak|24 days ago

Flink is too slow for this.

If by features you mean tracking state per user, that stuff can be tracked without Flink insanely fast with Redis as well.

If you re saying they dont have to load data to update the state, I dont see how massive these states are to require inmemory updates, and if so, you could just do inmemory updates without Flink.

Similarly, any consumer will have to deal with batches of users and pipelining.

Flink is just a bottleneck.

If they actually use Flink for this, its not the moat.

btown|23 days ago

Yea, the Monolith paper by Bytedance uses Flink but they only say it's in use for their B2B ecommerce optimization system. Maybe this is intentional ambiguity, but I'd believe that they wouldn't rely on something like Flink for their core TikTok infrastructure.

My hunch is we start to learn a lot more about the core internals as Oracle tries to market to B2B customers, as Oracle is wont to do!

lsuresh|23 days ago

Thanks for the Feldera shoutout Jim.

For anyone else, if you want to try out Feldera and IVM for feature-engineering (it gives you perfect offline-online parity), you can start here: https://docs.feldera.com/use_cases/fraud_detection/

unknown|24 days ago

[deleted]

bobek|23 days ago

It is not only recommender though. These guys [1] seem to be able to react pretty quickly and not to create addicts on the way ;(

[1] https://recombee.com

miohtama|24 days ago

TikTok's differention is the userbase of all teenagers in the world.

wongarsu|23 days ago

But go just one layer deeper to 'why is every teenager using Tiktok' and the primary answer once again becomes 'Tiktok's recommendation engine'

AlienRobot|24 days ago

It also provides different opportunities for growth compared to other social media. A video that gets over half a million views on TikTok may not get 5 thousand on Youtube, or even 10 views on Instagram or Facebook.

notyourwork|23 days ago

That didn’t by accident though.

3abiton|23 days ago

It's interesting to how they found out the "lifetime" of features is a feature by itself. Meta features is real.

not_ai|23 days ago

I’m happy to see that Flink is in this stack, I wish that Pulsar was as well instead of Kafka.

unknown|24 days ago

[deleted]

permo-w|23 days ago

I'm sorry to point out the obvious here, but who is going to perceive their recommended feed as slow or unfresh if it doesn't learn from exactly the last video you clicked on within 1 second? The bar simply is not that high. The special sauce of TikTok is how it chooses the videos, not the speed it does it at. I'm sure the speed helps to give it that "spookily intelligent" feeling, but that's a cherry on the recommendation cake, a cake which is already twice as good as the nearest competitor. I'm sure your talk goes deeper than this, but if this is the main focus, then you've missed the point.

owenversteeg|22 days ago

I partly agree and disagree.

Speed completely changes the game in a few ways. The first is identifying interests. Imagine every possible interest in a tree structure. Let's say you're into kumiko. There are so many levels of the tree to traverse to find kumiko; perhaps Skilled crafts -> Woodworking -> Japanese -> Construction without use of fasteners -> Panels and decorative elements -> Kumiko. The more iterations you can get through, the better you can match people's interests. If someone has 10 interests and each one requires many questions to determine, it can take forever to find exact interests with a system that only narrows down your interests every X videos vs. after each video.

The second is matching current moods. Let's say you just broke up with your girlfriend, or your pet fish died, or you're on vacation in Spain. A rapidly-updating system can capture those trends and get right to the heart of them in time for them to matter. A slow system might only get through a few iterations and capture a vague interest in Spain; a fast-updating one can get through countless iterations of guessing. Spain? What city? Tourist or moving there? What type of tourist? Foodie? What type of food? How fancy? Bam, you're watching the perfect video about an upscale seafood restaurant in Barcelona.

The third is type and flavor of content. Even inside of a small niche you will find many flavors of content. Super-short or long form, fast paced or slow, funny or serious, intellectual, irreverent, political leanings, background music, et cetera. Maybe you like slow long-form woodworking content but like fast-paced travel guides. Maybe you hate background music except when it's in skateboarding videos. To determine this requires an incredible amount of "questioning" of the user.

Now, of course, an algorithm that updates once daily can also make inferences about your interests and preferences. It can certainly learn, with enough time, what you are into and how you like to consume it. But the key thing is that these inferences only enable _predetermined_ changes. Imagine you are a human showing someone TikToks. Imagine that you can ask them any questions about their preferences right as they watch a video. You may not ask a question after every video, but you will ask countless questions over the hours of scrolling that day, and you will get good data. Now imagine a new restriction: you must decide your questions once a day in advance. You will manage far fewer questions; and to follow up on them you must wait yet another day.

Now, why do I partly agree? Well, I don't think speed is everything; I think TikTok has another sort of je ne sais quoi to it. I think it has a unique culture and community. It has a better UI and better features than Instagram. It has a young and cool reputation, far from the Millennial taint of Instagram or Facebook. And I suspect that they are good at identifying _who_ you are and acting on that information. But in my eyes, the speed could very well be the most important part of the puzzle.

ryanjshaw|24 days ago

Great insight. Any thoughts on RisingWave?

jamesblonde|23 days ago

That, too, and materialize. Feldera is my favourite, though.

SpaceManNabs|23 days ago

apache flink is so good. i think netflix used it heavily in 2018. not sure about now.

cactusplant7374|23 days ago

I thought was secret information. How long as it been publicly known?

permo-w|23 days ago

The secret is how it chooses videos to recommend, not the tech stack it uses and how fast it is

unknown|23 days ago

[deleted]

NedF|23 days ago

[deleted]

computerthings|23 days ago

[deleted]

Jamesbeam|24 days ago

[deleted]