top | item 23059638

(no title)

chrfrasco | 5 years ago

Hey all, as some keen-eyed commenters have pointed out, it looks like the rust program is not actually equivalent to the go program. The go program parses the string once, while the rust program parses it repeatedly inside every loop. It's quite late in Sydney as I write this so I'm not up for a fix right now, but this post is probably Fake News. The perf gains from jemalloc are real, but it's probably not the allocators fault. I've updated the post with this message as well.

The one-two combo of 1) better performance on linux & 2) jemalloc seeming to fix the issue lured me into believing that the allocator was to blame. I’m not sure what the lesson here is – perhaps more proof of Cunningham’s law? https://en.wikipedia.org/wiki/Ward_Cunningham#Cunningham's_L...

discuss

order

arcticbull|5 years ago

Thanks for following up. Just as an FYI, there's a few bugs in your implementation, the most obvious one is the use of ".len()" in a number of places interspersed with ".chars().count()". These two return different values. ".len()" returns then number of UTF-8 bytes in the input string, which for ASCII is the same as ".chars().count()" obviously, but if you do attempt any Unicode characters, your function won't work. ".chars()" provides Unicode Scalar Values (USVs) -- which is a subset of code points, excluding surrogate pairs [1]. Note also this is not the same as a Go rune, which is a code point including surrogate pairs.

Secondly, you re-implemented "std::cmp::min" at the bottom of the file, and I'm not sure if the stdlib version is more optimized.

Lastly, well, you caught the issue with repeated passes over the string.

I've fixed the issues if you're curious: https://gist.github.com/martinmroz/2ff91041416eeff1b81f624ea...

Unrelated, I hate the term "fake news" as it's an intentional attempt to destroy the world public's faith in news media. It's a cancer on civilized society. Somewhere your civics teacher is crying into some whiskey, even though of course you're joking.

[1] http://www.unicode.org/glossary/#unicode_scalar_value

arcticbull|5 years ago

This doesn't even begin to get into the question of what Levenshtein Distance even means in a Unicode context. What's the Levenshtein Distance of 3 emoji flags? I suppose we should be segmenting by grapheme clusters and utilizing a consistent normalization form when comparing, but Rust has no native support for processing grapheme clusters -- or for normalizations I believe. The UnicodeSegementation crate might help.

Based on some cursory research, the go version differs in a more subtle way too. A Rune is a Code Point, which is a superset of the Rust "char" type; it includes surrogate pairs.

afiori|5 years ago

(related to the unrelated part) what if the media is corrupt? I mean, independently on current events (I really don't want to enter that here) we do live in a world where very few amoral corporations own most of the media industry.

If we (correctly) rely on the media to bring to public attentions relevant facts (both criminal and non-criminal) and keep a watchful eye on the nation who then keeps a watchful eye on the media?

is the model entirely based on always being there enough good journalist to spot the bad ones? how is this affected by the very precarious economics of current internet ads-based venture-funded media enterprises?

I just blurted too many questions... what I am trying to say is that similarly with the police there is not as easy answer in shoud-trust should-not-trust (in the US a supreme Court judge advised to "not talk to the police").

in that case I guess part of the problem is that the job of the police can be miscontrued as "arresting people". in the same way the job of a journalist can be miscontrued as "getting clicks"

overall I don't think we can pass an a priori moral judgement on that term, as essentially represent a statement that the default safety measures have failed.

(I want to reiterate that here I try not to intermingle my point with whether I believe or not that the current use is warranted, I am just trying to say that as a concept it needs to be part of an healthy democracy, the same as some distrust in electoral promises)

dcow|5 years ago

"Fake News" entered the realm (not even recently, might I add) of popular misuse. Actually, is there a term for: language/grammar incorrectly used because society has developed a familiar "meme" use?

Common examples:

* Look at this dank "meme".

meme has come to mean "a picture shared on the internet that has words on it".

* Let's [have a] "cheers".

It's a toast. You say "cheers" when you toast.

* You missed Suzie and I's party last night.

It's Suzie and my party. This one is particularly annoying because it's made it way past editors and into writing, screenplay, etc.

masklinn|5 years ago

> Hey all, as some keen-eyed commenters have pointed out, it looks like the rust program is not actually equivalent to the go program. The go program parses the string once, while the rust program parses it repeatedly inside every loop. […] The perf gains from jemalloc are real, but it's probably not the allocators fault. I've updated the post with this message as well.

I don't know that it would be a gain: Rust is pretty good at decoding UTF8 quickly given how absolutely fundamental that operation is, and "caching" the decoded data would increase pressure on the allocator.

Unless you also changed the interface of the levenshtein to hand it preallocated cache, source and destination buffers (or the caller did that).

edit: to the downvoter, burntsushi did the legwork of actually looking at this[0] and found caching the decoding to have no effect at best, unless the buffers get lifted out of the function entirely, which matches my comment's expectations.

[0] https://news.ycombinator.com/item?id=23059753

> But yes, I did benchmark this, even after reusing allocations, and I can't tell a difference. The benchmark is fairly noisy.

otterley|5 years ago

> this post is probably Fake News

It's not Fake News. Fake News is the publication of intentionally false stories. This is just erroneous.

There's a yawning chasm between the two.

arcticbull|5 years ago

"fake news" is an attempt at permanently damaging the American public's faith in the fifth estate. Nobody should ever use that term in the current political climate, ever.

will4274|5 years ago

Says who?

When news organizations take other news organizations word for it and the story is false, that's fake news. We called it something different back then, but fake news led to the invasion of Iraq. Negligence is sufficient for fake news, malice not required.

Ar-Curunir|5 years ago

It’s a joke mate