(no title)
knolax
|
3 years ago
Your argument uses a website whose code was written by English speakers. There would still be ASCII but verbose element names like "vector-page-toolbar-container" would definitely be shorter both in utf-8 and in utf-16 if they weren't written in English
Beltalowda|3 years ago
Plus many identifiers come from libraries, and when creating their own identifiers many people use either full English or partial English no matter what language (it was a huge mistake to not use English for many identifiers in my first programming job, as you will invariably end up with a mishmash of two languages).
But it is easy enough to verify this with some actual websites: https://www.rakuten.co.jp is 330K in UTF-8 and 625K in UTF-16, https://ameblo.jp is 104K in UTF-8 and 187K in UTF-16, baidu.com is 360K in UTF-8 and 717K in UTF-16, sina.com.cn: 455K, 854K, daum.net: 666K, 1.2M.
And all of that is only the HTML document; if we'd add up the CSS – where there's almost no possibility to use non-ASCII outside of class and ID names – and JavaScript – where the filesize is usually dominated by React or jQuery or whatnot – things would skew even more in favour of UTF-8.
I'm sure there are examples where a page served over UTF-16 is smaller, such as pages with very little markup (like e.g. HN), but that is the common case, even for websites exclusively written and for users of CJK languages. Someone who does not speak a word of English will save many bytes of data every day with UTF-8. There's a reason all those websites are served over UTF-8 and not UTF-16.
But for the sake of the argument, let's replace all class="...", id="..", and data-event-name=".." with strings of the same length consisting of "回". That grows the filesize from 118K to 151K, and ... it's still larger in UTF-16 with 207K. We could start replacing more stuff and eventually UTF-16 may win, maybe. But you have to use a lot of CJK. Let's use a random excerpt:
Has 91 7-bit characters and 60 multibyte ones (this includes indentation, which may not be represented 100% accurately here). If we do the math this is: UTF-8 still wins.And to repeat, there are certainly cases where UTF-16 is smaller. Markdown documents and other plain text files is an obvious one, but HTML is rarely one of them.
But imagine actually checking things before making a claim...
gary_0|3 years ago
[0] http://utf8everywhere.org/#asian
knolax|3 years ago
You're missing the point entirely, the amount of characters you used is enough for 2 or 3 sentences. This was not an example constructed in good faith.