Show HN: Detecting the Programming Language of Code Using ML and Neural Networks

[+] riscy|9 years ago|reply

From my testing, it seems this classifier is heavily influenced by the choice of identifiers used in the program, which is not necessarily a hallmark of the language, but of the particular programs trained with. Here's a made-up C function that uses Python's colon + whitespace for the body instead of braces:

    int main(int* x):
        return *x + ((int)x);

This program is classified as being C (84% confidence), but if I change 'main' to 'elephant' it drops to 49% C, 41% Python. If I change the name to 'snake' it suddenly becomes Python code with 80% confidence. I guess Python programmers like really like talking about animals? :)

This would be more interesting if it were able to accurately classify programs by what family of syntax they're most similar to (Lisp, ML, C, etc). Otherwise, at first glance this seems like another needless application of something complicated like neural networks, since languages already have a formal grammar describing their structure and reserved tokens/symbols that you could take advantage of.

[+] pprogrammer|9 years ago|reply

Interesting example. It is not a valid Python/C example, you wouldn't find such an example in the code (edit: dataset). The model also isn't trained on small code snippets (only on running code), this is currently probably a weak spot.

I agree a lot would be possible using formal grammar indeed but it would be probably a lot of work to maintain all the parsers. Also some languages share a lot of the same syntax, so in some cases this may lead to ambiguity. Maybe you could use a combination of both approaches in that case.

[+] Varinius|9 years ago|reply

Scala 0.98

    #define def int
    
    def main(def x) {
     return x;
   }

[+] sushisource|9 years ago|reply

Oh man, that's a dirty trick. I like it.

[+] dmichulke|9 years ago|reply

Shouldn't one just use a naive bayesian classifier on top of the frequency-weighted language keywords?

The data would then really only be needed to guess the frequencies.

I am also not sure whether word bigrams would help a lot, but character level n-grams would be totally off my radar.

Any opinions on that?

[+] pprogrammer|9 years ago|reply

That would probably work OK, I guess, but some languages have the same keywords, so I don't think you'll get very high accuracy. I also tried a word level n-grams with linear classifiers, I couldn't get very high accuracy out of it. Also n-grams with higher values for n improve accuracy a lot if you train it with a lot of data, both for word-level models as for character level.

[+] gkbrk|9 years ago|reply

It incorrectly classifies the following valid C++ code as Javascript. It seems to be thrown off by the use of console.log without actually classifying based on the syntax.

    #include <iostream>

    class Logger {
        public:
            void log(std::string a) {
                std::cout << a << std::endl;
            }
    };

    int main() {
        Logger console = Logger();
        for (int i=0;i<10;i++) {
            console.log("test");
            console.log("test");
            console.log("test");
            console.log("test");
            console.log("test");
            console.log("test");
            console.log("test");
            console.log("test");
        }
    }

[+] leoh|9 years ago|reply

Interesting case:

    x = [1,2,3]
    x << 4

> haskell: 0.94, ruby: 0.03, swift: 0.01

    x = [1,2,3]
    x << 4
    puts x

> ruby: 0.89, haskell: 0.10, swift: 0.00

[+] pprogrammer|9 years ago|reply

puts is probably much more informative than <<

[+] scott_s|9 years ago|reply

Have you only tested syntactically valid code? For example, the code snippet:

  for (int i = 0; i ? N; ++i) {
    a[i]->val = i;
  }

Is clearly either C or C++, but it's not valid C or C++ because "i ? N" is not a valid boolean expression in the for loop. I actually think your techniques would work on syntactically invalid code, but I'm curious if you tried.

[+] pprogrammer|9 years ago|reply

Probably almost all of the samples are syntactically valid, but it probably works OK for invalid code. Adding or removing symbols can have an impact on the results though, as those may occur more frequently in other languages.

[+] ZanyProgrammer|9 years ago|reply

Maybe not valid C code, depending on what version, because of the int in the loop?

[+] sweezyjeezy|9 years ago|reply

This seems like it's probably massive overkill, have you compared with a simple bag-of-words linear classifier?

[+] nicolewhite|9 years ago|reply

Not necessarily overkill; maybe they wanted to learn how to use neural networks and this was a good project to do that. Simple is best for professional projects, but for personal projects like this it's better for you to explore something you don't know already for learning purposes.

[+] pprogrammer|9 years ago|reply

I tried a linear classifier, yes. It works decent too, but has lower accuracy when using a big dataset (+/- 98% top-1 accuracy vs. 99.4%).

[+] wwwigham|9 years ago|reply

Did the JS corpus not contain any es6 syntax? The statement

  let x = {}

doesn't even register as JS.

[+] pprogrammer|9 years ago|reply

It probably does not contain a lot of es6 syntax, and the let keyword probably occurs a lot more in other languages (Swift, Haskell, Lua). You could try to add more code to disambiguate it, for example adding a semicolon already displays it as second.

[+] haldean|9 years ago|reply

It would be interesting to set up some sort of similarity comparison between languages; then, if you introduced the date of release for the various languages, you might get a cool tree of which-languages-influenced-what. Very cool project!

[+] pprogrammer|9 years ago|reply

This is a great idea! I did think of something like this, because similar languages often end up in the top-3 of the classifier.

[+] mcphage|9 years ago|reply

Given the >99% accuracy rating, it would be interesting to see some examples of code it gets incorrect.

[+] radarsat1|9 years ago|reply

Would love to see this for detecting which markup language was used for `README`.

[+] pprogrammer|9 years ago|reply

Great idea! Will think about this when I am going to add languages.

[+] unknown|9 years ago|reply

[deleted]

28 comments