From my testing, it seems this classifier is heavily influenced by the choice of identifiers used in the program, which is not necessarily a hallmark of the language, but of the particular programs trained with. Here's a made-up C function that uses Python's colon + whitespace for the body instead of braces:
int main(int* x):
return *x + ((int)x);
This program is classified as being C (84% confidence), but if I change 'main' to 'elephant' it drops to 49% C, 41% Python. If I change the name to 'snake' it suddenly becomes Python code with 80% confidence. I guess Python programmers like really like talking about animals? :)
This would be more interesting if it were able to accurately classify programs by what family of syntax they're most similar to (Lisp, ML, C, etc). Otherwise, at first glance this seems like another needless application of something complicated like neural networks, since languages already have a formal grammar describing their structure and reserved tokens/symbols that you could take advantage of.
Interesting example. It is not a valid Python/C example, you wouldn't find such an example in the code (edit: dataset). The model also isn't trained on small code snippets (only on running code), this is currently probably a weak spot.
I agree a lot would be possible using formal grammar indeed but it would be probably a lot of work to maintain all the parsers. Also some languages share a lot of the same syntax, so in some cases this may lead to ambiguity. Maybe you could use a combination of both approaches in that case.
That would probably work OK, I guess, but some languages have the same keywords, so I don't think you'll get very high accuracy.
I also tried a word level n-grams with linear classifiers, I couldn't get very high accuracy out of it. Also n-grams with higher values for n improve accuracy a lot if you train it with a lot of data, both for word-level models as for character level.
It incorrectly classifies the following valid C++ code as Javascript. It seems to be thrown off by the use of console.log without actually classifying based on the syntax.
#include <iostream>
class Logger {
public:
void log(std::string a) {
std::cout << a << std::endl;
}
};
int main() {
Logger console = Logger();
for (int i=0;i<10;i++) {
console.log("test");
console.log("test");
console.log("test");
console.log("test");
console.log("test");
console.log("test");
console.log("test");
console.log("test");
}
}
Have you only tested syntactically valid code? For example, the code snippet:
for (int i = 0; i ? N; ++i) {
a[i]->val = i;
}
Is clearly either C or C++, but it's not valid C or C++ because "i ? N" is not a valid boolean expression in the for loop. I actually think your techniques would work on syntactically invalid code, but I'm curious if you tried.
Probably almost all of the samples are syntactically valid, but it probably works OK for invalid code. Adding or removing symbols can have an impact on the results though, as those may occur more frequently in other languages.
Not necessarily overkill; maybe they wanted to learn how to use neural networks and this was a good project to do that. Simple is best for professional projects, but for personal projects like this it's better for you to explore something you don't know already for learning purposes.
It probably does not contain a lot of es6 syntax, and the let keyword probably occurs a lot more in other languages (Swift, Haskell, Lua). You could try to add more code to disambiguate it, for example adding a semicolon already displays it as second.
It would be interesting to set up some sort of similarity comparison between languages; then, if you introduced the date of release for the various languages, you might get a cool tree of which-languages-influenced-what. Very cool project!
[+] [-] riscy|9 years ago|reply
This would be more interesting if it were able to accurately classify programs by what family of syntax they're most similar to (Lisp, ML, C, etc). Otherwise, at first glance this seems like another needless application of something complicated like neural networks, since languages already have a formal grammar describing their structure and reserved tokens/symbols that you could take advantage of.
[+] [-] pprogrammer|9 years ago|reply
I agree a lot would be possible using formal grammar indeed but it would be probably a lot of work to maintain all the parsers. Also some languages share a lot of the same syntax, so in some cases this may lead to ambiguity. Maybe you could use a combination of both approaches in that case.
[+] [-] Varinius|9 years ago|reply
[+] [-] sushisource|9 years ago|reply
[+] [-] dmichulke|9 years ago|reply
The data would then really only be needed to guess the frequencies.
I am also not sure whether word bigrams would help a lot, but character level n-grams would be totally off my radar.
Any opinions on that?
[+] [-] pprogrammer|9 years ago|reply
[+] [-] gkbrk|9 years ago|reply
[+] [-] leoh|9 years ago|reply
[+] [-] pprogrammer|9 years ago|reply
[+] [-] scott_s|9 years ago|reply
[+] [-] pprogrammer|9 years ago|reply
[+] [-] ZanyProgrammer|9 years ago|reply
[+] [-] sweezyjeezy|9 years ago|reply
[+] [-] nicolewhite|9 years ago|reply
[+] [-] pprogrammer|9 years ago|reply
[+] [-] wwwigham|9 years ago|reply
[+] [-] pprogrammer|9 years ago|reply
[+] [-] haldean|9 years ago|reply
[+] [-] pprogrammer|9 years ago|reply
[+] [-] mcphage|9 years ago|reply
[+] [-] radarsat1|9 years ago|reply
[+] [-] pprogrammer|9 years ago|reply
[+] [-] unknown|9 years ago|reply
[deleted]