Interning is a trade-off: you get decreased memory usage (if you use a lot of long-lived duplicated strings) and faster string comparisons (for pairs of strings that you do intern), at the cost of extra work to create the interned strings.
Problem is that it is almost impossible to decide whether interning makes sense. About the only way I know is that, if you run out of memory and use a lot of strings, you can think “let’s give interning a try”.
Even if you know the answer, you cannot tell libraries you use to intern strings they create. So, if your libraries create lots of long-lived strings, you do not have control over whether they intern strings.
Also, if you do write a library, the call whether to intern or not to intern is impossible to make because you cannot know whether your callers prefer speed or memory usage, and whether the objects you return to them will be long-lived.
For example, if you write a XML parser library, interning tag names may be the best guess, but libraries that do not do that may beat you in short benchmarks.
Because of that, I think it would be more useful to either have some way to globally install a ‘should I intern this string?’ handler (but what kind of information should that get, so that it can make an informed decision?), or to have the garbage collector intern strings as it sees fit. Problem with that is that it can change the behaviour of programs that compare strings using ‘are the same object’ comparisons. Maybe that feature should be removed from your language.
> Interning is a trade-off: you get decreased memory usage (if you use a lot of long-lived duplicated strings) and faster string comparisons (for pairs of strings that you do intern), at the cost of extra work to create the interned strings.
For garbage collected languages the benefit isn't only memory consumption, it's performance, since using canonical objects eliminates the additional allocations and garbage collections.
The string comparison argument is a bit of a dubious one though. Comparing length is only an integer comparison, and depending on the architecture, you can compare up to eight (or more) characters per cycle.
>You can see why Java needs these types. With permanent default interning, any sort of sequence involving character-level appends, such as reading the contents of a book from a file, would result in a preposterous O(n²) version of an otherwise trivial technique.
I don't think this is true. It looks like Java only interns string literals by default. [1] If you get a string another way (user input, for example), it's not interned unless you call the String.intern() method.
Yeah, the explanation in this article about Java's string interning made be lose confidence that the author knew what they were talking about. If you compare Strings, you should use .equals() or you are risking a logic error if you aren't very careful. Indeed, you can deliberately un-intern a string by doing new String("my string") (with little realistic value).
Has anyone compared performance of interned and non-interned versions of a large, string-heavy application? Seems like one of those things that might not be worth it.
> This is most noticeable when writing objects to disk or sending them across the network. As soon as the process needs to communicate the scheme breaks down.
As another point in this space, X Windows moves things out - processes can create "ATOMs" registered with the server.
[+] [-] Someone|10 years ago|reply
Problem is that it is almost impossible to decide whether interning makes sense. About the only way I know is that, if you run out of memory and use a lot of strings, you can think “let’s give interning a try”.
Even if you know the answer, you cannot tell libraries you use to intern strings they create. So, if your libraries create lots of long-lived strings, you do not have control over whether they intern strings.
Also, if you do write a library, the call whether to intern or not to intern is impossible to make because you cannot know whether your callers prefer speed or memory usage, and whether the objects you return to them will be long-lived.
For example, if you write a XML parser library, interning tag names may be the best guess, but libraries that do not do that may beat you in short benchmarks.
Because of that, I think it would be more useful to either have some way to globally install a ‘should I intern this string?’ handler (but what kind of information should that get, so that it can make an informed decision?), or to have the garbage collector intern strings as it sees fit. Problem with that is that it can change the behaviour of programs that compare strings using ‘are the same object’ comparisons. Maybe that feature should be removed from your language.
Also note that modern Java allows one to tweak its interning behaviour a bit (http://java-performance.info/string-intern-in-java-6-7-8/)
[+] [-] uxcn|10 years ago|reply
For garbage collected languages the benefit isn't only memory consumption, it's performance, since using canonical objects eliminates the additional allocations and garbage collections.
The string comparison argument is a bit of a dubious one though. Comparing length is only an integer comparison, and depending on the architecture, you can compare up to eight (or more) characters per cycle.
[+] [-] wtetzner|10 years ago|reply
I don't think this is true. It looks like Java only interns string literals by default. [1] If you get a string another way (user input, for example), it's not interned unless you call the String.intern() method.
[1] https://docs.oracle.com/javase/8/docs/api/java/lang/String.h...
[+] [-] richardwhiuk|10 years ago|reply
[+] [-] chrisohara|10 years ago|reply
There are also some bindings for Go: https://github.com/chriso/go-intern
[+] [-] dgreensp|10 years ago|reply
[+] [-] blt|10 years ago|reply
[+] [-] dllthomas|10 years ago|reply
As another point in this space, X Windows moves things out - processes can create "ATOMs" registered with the server.
[+] [-] hendekagon|10 years ago|reply
[+] [-] Per_Bothner|10 years ago|reply
Kerf's idea of per-object intern pool does not seem as useful for other languages.