sandius's comments

sandius | 9 years ago | on: Ask HN: How Can I Get into NLP (Natural Language Processing)?

NLP is a huge topic, and the choice of materials pretty much depends on what you'd like to focus on. In my experience nothing beats a good textbook, especially if you do the exercises.

The classic NLP textbook is

* Jurafsky, Martin: "Speech and Language Processing" (https://web.stanford.edu/~jurafsky/slp3/) -- already mentioned here: a very solid overview textbook to give you an idea about the field;

Should you be interested in statistical NLP (even if it probably isn't as sexy as it used to be), the classic there is:

* Manning, Schütze: "Foundations of Statistical Natural Language Processing" (http://nlp.stanford.edu/fsnlp/).

sandius | 9 years ago | on: Show HN: Rembulan, an implementation of Lua 5.3 for the JVM

I haven't seen that much written about this either. My only explanation is that it isn't generally seen as a problem worth solving, for two definition of "worth": either it's too difficult or even impossible; a solution to it isn't needed. I tend to disagree with both.

Btw, thanks for the discussion!

About System.gc(): at least in the Oracle JVM, I think so. But then look at what it says in the JavaDoc: "When control returns from the method call, the Java Virtual Machine has made a best effort to reclaim space from all discarded objects." A no-op interpretation can indeed be a best effort (as in, "there's nothing to be done!") :) But since it's the only place in the JDK that gives at least some kind of a GC-related guarantee, it's the best there is.

These pastures may be greener in other JVMs, though!

sandius | 9 years ago | on: Show HN: Rembulan, an implementation of Lua 5.3 for the JVM

I think it would be a good start.

Notice that the value of the heap counter would at all times by the upper bound on the memory usage. In other words, we may always assume that there’s unreclaimed garbage memory that we’re still counting in. But that also means that if throw an error if and only if we detect that the limit has been exceeded, the only error we can be in is a false positive. For sandboxing scenarios, that’s at least some good news after all: what we definitely don’t want are false negatives and we won’t get these.

Now, when we've detected that the limit has been exceeded, we don't have to signal an error immediately. The execution can be suspended (as with CPU accounting). We can resume the execution any time later, or terminate it with an out-of-memory error. What we actually do depends on the application: we could simply decide that getting false positives is a risk worth living with and terminate the program; we could try calling System.gc() and check the limit afterwards; we could increase the limit temporarily and check in X ticks if this was just a spike in allocations etc etc.

I see the chances of getting to 0% false positive rate almost impossible (at least not on a generic JVM), but I tend to think that such a technique could go a long way.

Regarding other solutions to this problem out there today: I agree! Using a programming language that evades these problems entirely would be a solution, but I'd be skeptical about the chances of persuading users that it's for their benefit.

sandius | 9 years ago | on: Show HN: Rembulan, an implementation of Lua 5.3 for the JVM

Thanks!

Yes, indeed: there is LuaJ and Kahlua, but neither is able to do Lua 5.3 (they both support 5.2). I actually started by trying to update LuaJ to 5.3, but in the end I found myself disagreeing with some of the fundamental design choices (such as the mapping of Lua coroutines to Java threads), and decided to start from scratch. Lua is a small and well-designed programming language, so it seemed doable. In the process I made some controversial design choices myself :)

Rembulan compiles Lua 5.3 sources to Java bytecode directly -- it's easier (and probably faster) to do it that way, rather than compiling to Java and then using javac to get the bytecode. For instance, there is no "goto" in the Java programming language, but the Java bytecode does have unconditional jumps. Besides, Rembulan needs to be able to suspend a Lua call at almost any point, and restore it later. So even if I went via generating Java sources first, those sources would be very human-unfriendly (not readable or useful).

Writing a Lua parser + compiler is actually quite straightforward. Everything you need -- both syntax and semantics -- is in described in the Lua Reference Manual. I used to have a PUC-Lua bytecode loader and compiler (i.e., a recompiler that would read PUC-Lua bytecode and emit Java bytecode), but it was for bootstrapping only -- once I've verified that my compiler covered the entire language, I got rid of it. That said, I think neither LuaJ and Kahlua actually requires PUC-Lua. LuaJ used a parser/compiler that looks like it was ported from C (i.e., it looks suspiciously similar to the PUC-Lua sources). But it generates PUC-Lua bytecode, not Java bytecode directly, that's a separate (optional) step.

Regarding sandboxing: Heap limits are tricky, since the JVM doesn't give you direct control over the GC, and as far as I'm aware it isn't possible to "segment" the JVM's heap into smaller chunks. So it seems that the only way to do this is to do some bookkeeping, and keep track of new tables (and new table entries), coroutines and userdata, updating the "heap counter" once they are GC'd. (Plus of course throwing an exception once the limit has been exceeded.) I haven't worked on that yet, but the infrastructure should be more or less ready for this kind of approach.

Stack limits are easier: increment a counter on (non-tail) call, decrement it on return, throw an error if the limit is exceeded. I haven't implemented it yet, though.

I think Rembulan might work quite well in a backend of a massive multiplayer game, especially one that the players may program (to a degree) themselves. Something along the lines of Grobots, but programmed in Lua instead of Forth :)

page 1