top | item 12115130

Show HN: Detecting the Programming Language of Code Using ML and Neural Networks

53 points| pprogrammer | 9 years ago |danielheres.space | reply

28 comments

order
[+] riscy|9 years ago|reply
From my testing, it seems this classifier is heavily influenced by the choice of identifiers used in the program, which is not necessarily a hallmark of the language, but of the particular programs trained with. Here's a made-up C function that uses Python's colon + whitespace for the body instead of braces:

    int main(int* x):
        return *x + ((int)x);
This program is classified as being C (84% confidence), but if I change 'main' to 'elephant' it drops to 49% C, 41% Python. If I change the name to 'snake' it suddenly becomes Python code with 80% confidence. I guess Python programmers like really like talking about animals? :)

This would be more interesting if it were able to accurately classify programs by what family of syntax they're most similar to (Lisp, ML, C, etc). Otherwise, at first glance this seems like another needless application of something complicated like neural networks, since languages already have a formal grammar describing their structure and reserved tokens/symbols that you could take advantage of.

[+] pprogrammer|9 years ago|reply
Interesting example. It is not a valid Python/C example, you wouldn't find such an example in the code (edit: dataset). The model also isn't trained on small code snippets (only on running code), this is currently probably a weak spot.

I agree a lot would be possible using formal grammar indeed but it would be probably a lot of work to maintain all the parsers. Also some languages share a lot of the same syntax, so in some cases this may lead to ambiguity. Maybe you could use a combination of both approaches in that case.

[+] Varinius|9 years ago|reply
Scala 0.98

    #define def int
    
    def main(def x) {
     return x;
   }
[+] sushisource|9 years ago|reply
Oh man, that's a dirty trick. I like it.
[+] dmichulke|9 years ago|reply
Shouldn't one just use a naive bayesian classifier on top of the frequency-weighted language keywords?

The data would then really only be needed to guess the frequencies.

I am also not sure whether word bigrams would help a lot, but character level n-grams would be totally off my radar.

Any opinions on that?

[+] pprogrammer|9 years ago|reply
That would probably work OK, I guess, but some languages have the same keywords, so I don't think you'll get very high accuracy. I also tried a word level n-grams with linear classifiers, I couldn't get very high accuracy out of it. Also n-grams with higher values for n improve accuracy a lot if you train it with a lot of data, both for word-level models as for character level.
[+] gkbrk|9 years ago|reply
It incorrectly classifies the following valid C++ code as Javascript. It seems to be thrown off by the use of console.log without actually classifying based on the syntax.

    #include <iostream>

    class Logger {
        public:
            void log(std::string a) {
                std::cout << a << std::endl;
            }
    };

    int main() {
        Logger console = Logger();
        for (int i=0;i<10;i++) {
            console.log("test");
            console.log("test");
            console.log("test");
            console.log("test");
            console.log("test");
            console.log("test");
            console.log("test");
            console.log("test");
        }
    }
[+] leoh|9 years ago|reply
Interesting case:

    x = [1,2,3]
    x << 4
> haskell: 0.94, ruby: 0.03, swift: 0.01

    x = [1,2,3]
    x << 4
    puts x
> ruby: 0.89, haskell: 0.10, swift: 0.00
[+] pprogrammer|9 years ago|reply
puts is probably much more informative than <<
[+] scott_s|9 years ago|reply
Have you only tested syntactically valid code? For example, the code snippet:

  for (int i = 0; i ? N; ++i) {
    a[i]->val = i;
  }
Is clearly either C or C++, but it's not valid C or C++ because "i ? N" is not a valid boolean expression in the for loop. I actually think your techniques would work on syntactically invalid code, but I'm curious if you tried.
[+] pprogrammer|9 years ago|reply
Probably almost all of the samples are syntactically valid, but it probably works OK for invalid code. Adding or removing symbols can have an impact on the results though, as those may occur more frequently in other languages.
[+] ZanyProgrammer|9 years ago|reply
Maybe not valid C code, depending on what version, because of the int in the loop?
[+] sweezyjeezy|9 years ago|reply
This seems like it's probably massive overkill, have you compared with a simple bag-of-words linear classifier?
[+] nicolewhite|9 years ago|reply
Not necessarily overkill; maybe they wanted to learn how to use neural networks and this was a good project to do that. Simple is best for professional projects, but for personal projects like this it's better for you to explore something you don't know already for learning purposes.
[+] pprogrammer|9 years ago|reply
I tried a linear classifier, yes. It works decent too, but has lower accuracy when using a big dataset (+/- 98% top-1 accuracy vs. 99.4%).
[+] wwwigham|9 years ago|reply
Did the JS corpus not contain any es6 syntax? The statement

  let x = {}
doesn't even register as JS.
[+] pprogrammer|9 years ago|reply
It probably does not contain a lot of es6 syntax, and the let keyword probably occurs a lot more in other languages (Swift, Haskell, Lua). You could try to add more code to disambiguate it, for example adding a semicolon already displays it as second.
[+] haldean|9 years ago|reply
It would be interesting to set up some sort of similarity comparison between languages; then, if you introduced the date of release for the various languages, you might get a cool tree of which-languages-influenced-what. Very cool project!
[+] pprogrammer|9 years ago|reply
This is a great idea! I did think of something like this, because similar languages often end up in the top-3 of the classifier.
[+] mcphage|9 years ago|reply
Given the >99% accuracy rating, it would be interesting to see some examples of code it gets incorrect.
[+] radarsat1|9 years ago|reply
Would love to see this for detecting which markup language was used for `README`.
[+] pprogrammer|9 years ago|reply
Great idea! Will think about this when I am going to add languages.