top | item 38740953

(no title)

pvankessel | 2 years ago

This is such a clever way of sampling, kudos to the authors. Back when I was at Pew we tried to map YouTube using random walks through the API's "related videos" endpoint and it seemed like we hit a saturation point after a year, but the magnitude described here suggests there's a quite a long tail that flies under the radar. Google started locking down the API almost immediately after we published our study, I'm glad to see folks still pursuing research with good old-fashioned scraping. Our analysis was at the channel level and focused only on popular ones but it's interesting how some of the figures on TubeStats are pretty close to what we found (e.g. language distribution): https://www.pewresearch.org/internet/2019/07/25/a-week-in-th...

discuss

order

m463|2 years ago

> Google started locking down the API almost immediately after we published our study

Isn't this ironic, given how google bots scour the web relentlessly and hammer sites almost to death?

LeonM|2 years ago

> google bots scour the web relentlessly and hammer sites almost to death

I have been hosting sites and online services for a long time now and never had this problem, or heard of this issue ever before.

If your site can't even handle a crawler, you need to seriously question your hosting provider, or your architecture.

LocalH|2 years ago

"Rules for thee, but not for me"

dotandgtfo|2 years ago

This is one of the most important parts of the EUs upcoming digital services act in my opinion. Platforms have to share data with (vetted) researchers, public interest groups and journalists.

MBCook|2 years ago

This would find things like unlisted videos which don’t have links to them from recommendations.

trogdor|2 years ago

That’s a really good point. I wonder if they have an estimate of the percentage of YouTube videos that are unlisted.

0x1ceb00da|2 years ago

This technique isn't new. Biologists use it to count the number of fish in a lake. (Catch 100 fish, tag them, wait a week, catch 100 fish again, count the number of tagged fishes in this batch)

pants2|2 years ago

That's typically the Lincoln-Petersen Estimator. You can use this type of approach to estimate the number of bugs in your code too! If reviewer A catches 4 bugs, and reviewer B catches 5 bugs, with 2 being the same, then you can estimate there are 10 total bugs in the code (7 caught, 3 uncaught) based on the Lincoln-Petersen Estimator.

justinpombrio|2 years ago

That's not actually the technique the authors are using. Catching 100 fish would be analogous to "sample 100 YouTube videos at random", but they don't have a direct method of doing so. Instead, they're guessing possible YouTube video links at random and seeing how many resolve to videos.

In the "100 fish" example, the formula for approximating the total number of fish is:

    total ~= caught / tagged
    (where caught=100 in the example)
In their YouTube sampling method, the formula for approximating the total number of videos is:

    total ~= (valid / tried) * 2^64
Notice that this is flipped: in the fish example the main measurement is "tagged" (the number of fish that were tagged the second time you caught them), which is in the denominator. But when counting YouTube videos, the main measurement is "valid" (the number of urls that resolved to videos), which is in the numerator.

zellyn|2 years ago

Do you get the same 100 dumb fish?

dclowd9901|2 years ago

I made the same connection but it’s still the first time I’ve seen it used for reverse looking up IDs.

neurostimulant|2 years ago

> You generate a five character string where one character is a dash – YouTube will autocomplete those URLs and spit out a matching video if one exists.

Won't this mess up stats though? It's like a lake monster randomly swapping an untagged fish with tagged fish as you catch them.

fergbrain|2 years ago

Isn’t this just a variation of the Monte Carlo method?

layer8|2 years ago

That's only vaguely the same. It would be much closer if they divided the lake into a 3D grid and sampled random cubes from it.

gaucheries|2 years ago

I think YouTube locked down their APIs after the Cambridge Analytica scandal.

herval|2 years ago

in the end, that scandal was the open web's official death sentence :(

nextaccountic|2 years ago

In which ways were the Cambridge Analytica thing and the openness of Youtube APIs (or other web APIs) related? I just don't see the connection

pvankessel|2 years ago

They actually held out for a couple of years after Facebook and didn't start forcing audits and cutting quotas until 2019/2020

hipadev23|2 years ago

[deleted]

blackle|2 years ago

It is a little more sophisticated. They say they use an exploit that was found where a URL with five characters with a dash will get autocompleted by YouTube (I wonder why that is.) That improves sampling by 32,000 times apparently