top | item 46187698 (no title) ccgreg | 2 months ago Common Crawl is a text-only crawl. discuss order hn newest mirandrom|2 months ago I'm not so sure, they say "The crawled content is dominated by HTML pages and contains only a small percentage of other document formats." https://commoncrawl.github.io/cc-crawl-statistics/plots/mime...In any case, all the images were external cloudfrount URLs that have not been archived anywhere afaict. ccgreg|2 months ago Hi. I'm the CTO at Common Crawl. Nice to meet you. There's a small amount of "bycatch", and you already discovered how to see it. Notice that it went down after I was hired. load replies (1)
mirandrom|2 months ago I'm not so sure, they say "The crawled content is dominated by HTML pages and contains only a small percentage of other document formats." https://commoncrawl.github.io/cc-crawl-statistics/plots/mime...In any case, all the images were external cloudfrount URLs that have not been archived anywhere afaict. ccgreg|2 months ago Hi. I'm the CTO at Common Crawl. Nice to meet you. There's a small amount of "bycatch", and you already discovered how to see it. Notice that it went down after I was hired. load replies (1)
ccgreg|2 months ago Hi. I'm the CTO at Common Crawl. Nice to meet you. There's a small amount of "bycatch", and you already discovered how to see it. Notice that it went down after I was hired. load replies (1)
mirandrom|2 months ago
In any case, all the images were external cloudfrount URLs that have not been archived anywhere afaict.
ccgreg|2 months ago