Reverse Engineering TikTok's VM Obfuscation (Part 2)

[+] nullpt_rs|3 years ago|reply

Nice work. I was going to basically cover the same topics in my second part but it looks like you beat me to it. If you'd like to collaborate with me on the next portion feel free to contact me on Discord (veritas#0001) or email ([email protected])

[+] cute_boi|3 years ago|reply

Hmm, can you clarify how did you manage to send request in your original article? I copied as curl from network tab and tried to sent the request from curl. However, there was no response. Are they using fingerprints or cookies shenanigans?

[+] kayson|3 years ago|reply

What's the point of such heavy obfuscation? Are they afraid of someone cloning their frontend? At first glance it seems like a waste of performance...

[+] renonce|3 years ago|reply

I think they are just reusing technologies commonly used in China on TikTok website so they don't write it twice. In China many frontend developers write code for WeChat mini program. WeChat mini program is just a typical HTML website with extra APIs and the restriction that code has to be uploaded for review before they are deployed, and `eval` (and anything else that allows executing string as script directly) is disallowed in order to prevent sneaking code past review. Many developers developed their own JavaScript VM and interpreter to bypass that restriction, and that same code happens to be reused on all HTML based platforms even outside WeChat.

[+] mach1ne|3 years ago|reply

They do heavy data collecting which is a PR hit if it's too obvious. Obfuscation could be a general policy driven by this.

[+] est|3 years ago|reply

You'd be surprised to see China's underground content farms. They went great length to automate content generation by decrypt & exploit App's private APIs

[+] TobyTheDog123|3 years ago|reply

Most likely bot prevention & data security.

There's not a huge performance hit, there's a portion that runs on page load and a smaller portion that runs on each HTTP request (which isn't too often).

[+] thrdbndndn|3 years ago|reply

Lots of web services obfuscates their front-end scripts by default, not really that uncommon.

[+] 2OEH8eoCRo0|3 years ago|reply

Maybe they're using stolen IP, copyrighted code, violating licenses, etc?

[+] cactusplant7374|3 years ago|reply

At the business / management level, who decides to green light a project like this and what are their reasons? Is it regulatory obscurity or denying the competition easy understanding of the code?

[+] kerneloops|3 years ago|reply

> Using these accessors, the VMs become able to do anything that JS can do.

In fact the source language is likely to be JS itself; a JS-to-some-sort-of-vm-bytecode-to-JS compiler is made. I know that Tencent has a similar VM; an interesting aspect of that VM is that the instruction set is dependent on the code being compiled (and the opcodes are dynamically generated and shuffled when compiling), so unused instructions are not generated.

[+] TobyTheDog123|3 years ago|reply

I, too, was disappointed that this was not a continuation of https://www.nullpt.rs/reverse-engineering-tiktok-vm-1, but as someone who could really use a way to interface with TikTok's API (for legitimate non-bot reasons that allow users to interface with TikTok differently), I'm all for more eyes on this problem.

[+] jeroenhd|3 years ago|reply

Interesting! I wonder, would this virtual machine be hand crafted or auto generated? When I look at obfuscated code like this, I always wonder if the code authors couldn't run their generator a second time and come up with a completely new format that existing reverse engineering efforts wouldn't work with?

I don't know what they're trying to obfuscate but it must be worth hiding to allow such inefficient javascript to run on clients around the world. I can't think of any non malicious reason to develop such a system for a website about silly videos.

[+] TheDong|3 years ago|reply

> I can't think of any non malicious reason to develop such a system for a website about silly videos.

Let me give you a few examples of cases where obfuscated or inefficient code exists:

1. re-captcha is "obfuscated", and so difficult that computers aren't meant to be able to compute it. It is used to rate-limit login attempts to secure people's accounts, among many other uses.

2. Cloudflare runs some obfuscated "are you a human" javascript to fingerprint you for the purpose of DDoS protection. This one is arguably less noble, but still not obviously malicious.

3. Games obfuscate code all the time to make cheats harder to develop and maintain

4. Youtube, netflix, etc have obfuscated code (DRM code) because legally they have to make an attempt to protect copyrighted works from being downloaded etc, or else they lose access to said copyrighted works. Tiktok also allows using copyrighted music I'll point out.

Depending on your viewpoint, perhaps all of those are also malicious, but I think many people view them as non-malicious, though perhaps dumb.

That said, I personally wouldn't give tiktok the benefit of the doubt here. All the other large social media companies maliciously capture as much data as they can about me in order to sell it to brainwashing companies (ahem, advertisers), to the point where I think it should be illegal, so I don't really expect tiktok to be any different.

[+] _0w8t|3 years ago|reply

Typically the obfuscation is done with a special compiler as it allows easily to add or alter obfuscation layers. This also allows to serve different obfuscation blobs to different clients to slow down reverse-engineering even further.

[+] RektBoy|3 years ago|reply

>I wonder, would this virtual machine be hand crafted or auto generated?

Every proper obfuscator with virtualization should be auto-generated. Of course, when I reverse it, then my scripts are also automatized as much as they can be.

[+] runnerup|3 years ago|reply

A lot of it is probably to prevent automatedly ripping content out wholesale and using it for other social media platforms. And any related automated data mining.

[+] tinus_hn|3 years ago|reply

The market for promotion using fake views and bots is enormous. It’s the new spam.

[+] unknown|3 years ago|reply

[deleted]

[+] jonatron|3 years ago|reply

dynamic string extraction: https://gist.github.com/jonatron/f7ec44e7ffd41c4dd50d51b3451...

[+] sylware|3 years ago|reply

And then we can use a small software to view tiktok videos, without the need of one of the vanguard/blackrock financed, absurdely and grotesquely massive and complex, web engines (blink/geeko/webkit).

[+] ralphc|3 years ago|reply

Can someone explain why TikTok needs a virtual machine? What are they doing with it that can't be done with a normal web app?

[+] KiwiJohnno|3 years ago|reply

They are doing it to obfuscate the capabilities and usage of their app.

Yes, this should set of big warning bells.

[+] The_Link|3 years ago|reply

As a thought exercise Botter: attempts to spin up phone vm for botting Okay tiktok start up TikTok: What hardware am I on? Botter: insert common phone hardware here TikTok: Okay, than this should work hyper firmware/bytecode specific virtualization commands and syscalls (segfault)

[+] ghostly_s|3 years ago|reply

I skimmed the original post but still don't really grasp how these VMs are deployed. I assume they are running server-side (client-side would be a violation of App Store policies at least, right)? Is a dedicated VM spun up for each user session? Or just a sandbox in which various services run?

[+] kaba0|3 years ago|reply

    while (hasData) {
      switch (code) {
        case INSTR1 -> …,
        case INSTR2 -> …
        …
      }
    }

[+] unknown|3 years ago|reply

[deleted]

[+] yayr|3 years ago|reply

is there any open source reference implementation for such a vm that is dynamically generated based on JS/TS?

[+] umasi|3 years ago|reply

As others have already touched on, it’s rather disingenuous to pretend that you are “picking up where he left off” after copying his title and labeling your article as part 2, no? Especially considering the first author made it very clear he intended to publish further work on the topic. Further, this article reads a lot like an attempt to beat him to the punch and to hijack his series. But correct me if I’m wrong?

[+] laptou|3 years ago|reply

You're right that copying their title exactly was probably not the right thing to do, and may cause confusion. I'm not trying to hijack anything, I think there's enough room for both of us to research this at the same time.

[+] unknown|3 years ago|reply

[deleted]

[+] captainmuon|3 years ago|reply

It's sad how much human effort is spend in creating obfuscation schemes, and then in analyzing them. Or creating proprietary things and then reverse engineering them.

Sometimes I wonder if we could just make everything open. No obfuscation, no captchas, just a neat API for everything. Of course that wouldn't work ceteris paribus, everything else unchanged, due to bad actors, spam, or just competitors who want to take your work. But if you'd change the incentives - make society non-adversarial, non-profit oriented - then all that gating and obfuscation would become unneccessary.

[+] vasco|3 years ago|reply

You could say the same thing about locks, police, military. Those are way more wasteful and also just exist due to bad actors. I don't think there's a way around this, it'd be like wishing for no predators in nature so that natural evolution could just do the "good parts" and not have to optimise for survival.

[+] contrapunctus|3 years ago|reply

I often wonder the same thing. A similar aim was expressed in at least one of RMS' essays.

The other comments here remind me of how pessimistic and lacking in imagination the tech crowd tends to be.

Because earlier attempts failed, does that make an endeavor "impossible"/"naive"?

People cite "human nature", but forget that there exist and have existed systems of living, radically different from what they are used to, which gave rise to completely different "human nature" - or perhaps merely gave it a healthier environment, so it manifested very differently. (This "problem" is frequently refuted in anarchist texts.)

I'll leave some links which I think are relevant to this quest. I hope they provide some answers, or at least provoke further searching.

1. https://donellameadows.org/archives/leverage-points-places-t...

2. https://theanarchistlibrary.org/library/the-anarchist-faq-ed...

3. https://www.youtube.com/watch?v=l7TONauJGfc

4. https://en.wikipedia.org/wiki/Tao_Te_Ching

I hope we find a way. And for those looking and trying to move beyond the status quo, I supply a maxim to bear in mind when facing the unimaginative and the complacent - "It's always impossible until it's done."

[+] CGamesPlay|3 years ago|reply

Note: this is not a continuation of the work by the same author. This is a new author who took the original and did further research. I think it's a bit disingenuous to call this by the same name, "part 2".

[+] jeroenhd|3 years ago|reply

I think it's actually more truthful to admit that this is the continuation of an earlier post than it is to say that this is new work. This is just the continuation of the process the first post documented.

The relation between this article and the one it's based upon is clearly indicated in the first paragraph. I don't think this is disingenuous at all.

[+] vore|3 years ago|reply

If Brandon Sanderson is allowed to finish The Wheel of Time, this author is allowed to call it part 2 ;-)

[+] unknown|3 years ago|reply

[deleted]

[+] mrsaint|3 years ago|reply

How could Apple properly review something like this? Isn't it one of Apple's selling pitches that they'd review each app for malicious activity before it makes it to the app store?

[+] valleyer|3 years ago|reply

So, a tricky piece here is that this appears to be behavior of the TikTok web site. Obviously Apple makes no attempt (nor claim) to review the behavior of every web site accessible in Safari from an iPhone. And other native apps can embed WebKit-based web views into their apps.

The good news is that the scope of "malicious activity" is (at least in theory) much smaller when you constrain it to what web sites can do, as opposed to the scope of what can be done by executing ARM instructions and making syscalls.

The bad news is that the scope of "things web sites can do" keeps growing and is fingerprintable.

[+] angulardragon03|3 years ago|reply

> the code that is deployed on TikTok's _website_

This isn't regarding the app at all, which is likely not as heavily obfuscated as this (mostly because you can't just "view source" on an app).

[+] Mindwipe|3 years ago|reply

> How could Apple properly review something like this? Isn't it one of Apple's selling pitches that they'd review each app for malicious activity before it makes it to the app store?

They couldn't. Apple does not perform any meaningful review of apps for malicious activity, do they do it for rent seeking.

[+] perttir|3 years ago|reply

I used to develop Apache Cordova application that had strong obfuscation using javascript-obfuscator. Apple didn't care.

[+] pjmlp|3 years ago|reply

They can't and most likely would kick the app out of the store, hence why this is the Website code.

[+] unknown|3 years ago|reply

[deleted]

[+] slimebot|3 years ago|reply

Yeah I'd also like to understand why they are doing this.

Everyone expects these sites to scrape as much personal information as possible (China did not invent that, they are following), but beyond that any additional imagined state-ran initiative would be server side, right? What is worth hiding in the front end beyond preventing people re-using their code? (which would be overkill to use a VM for, as light obfuscation would be enough)

[+] unknown|3 years ago|reply

[deleted]

[+] sposeray|3 years ago|reply

[deleted]

[+] ActionHank|3 years ago|reply

Instead of writing a custom decompiler it might be a bit quicker to pose as a question to chat gpt

103 comments