After few years of research in DL I learned not to trust any single paper at all. 99% of DL papers are not scientific but rather 'hey guys, look this trick is our new awesome discovery'.
Also I learned to communicate with other teams and exchange with ideas proved by practices - this really helps.
To improve, I would suggest to publish whole setup, all the parameters used, the programming code, and publish either all the data or a reference to a large free data set (no MNIST anymore in papers, please).
If you have a breakthrough in transfer learning then you will be able to very effectively demonstrate it with MNIST.
The race to the bottom is essentially over, but that doesn't mean MNIST can't be used to demonstrate learning.
Regarding setup and parameters. I hope AI researchers move toward something like pachyderm [https://pachyderm.io] -- providing a single docker image to completely replicate their work. However, I sincerely doubt that will happen. As "open" as research is the details are almost always obfuscated to prevent competition with the spin-out company (or other researchers).
Anecdotally, my master thesis on natural language processing was supposed to consist of first reproducing the results of an influential paper (back then) and then hopefully improving upon it by extending the model used.
The paper made it seem like they had been using a standard PCFG parser (which circulated in the research community at the time) to achieve their results. It turned out they hadn't and instead had written a custom one and in fact their results were not reproducible using the standard parser.
What was meant to be a timesaver in terms of engineering (using a standard parser instead of writing your own) turned out to be a massive time sink. It also turned out that by using a custom parser they had unintentionally diverted from a vanilla PCFG (probabilistic context free grammar), or in other words, some implementation details had led to a departure from the assumed underlying theoretical model.
A lot of research depends on software written by researchers and yet writing software is not really supported or incentivised by academia. Organizations like the software sustainability institute in the UK (http://software.ac.uk) are lobbying to change this, with some success, but I guess it takes a long time to effect cultural change.
See also Artifact Evaluation, a process used in several PL/SE conferences that makes evaluation of code (and datasets/studies) an explicit step in the review process:
Unfortunately CS is not as rigorous as some other academic fields.
I've been to 100+ colloquia in the physics dept at Cornell and I have never been at one that I felt was a waste of time or that the person should not belong there.
The CS department colloquium is a different story: yes I got to see Geoff Hinton before he became a celebrity but maybe half of the talks are awful.
IMO (to be fair, some) CS people became engineering bottlenecks the day the universities switched out teaching C/C++ for Java and Python (IMO the Why Not Zoidberg? of programming languages). Those who learned C/C++ anyway became my heroes.
I have sat through too many presentations obsessing on HW-level perf/W especially w/r to Deep Learning ASIC wannabes. Just writing one's code in C/C++ (and doing it well) guarantees at least a 2x improvement over Java and a 10-100x improvement over Python. I won't even bring up the computational coup that is CUDA.
But hey, let's base a mobile phone OS on Java and block low-level access to its GPU, that's a fantastic idea, right?
See also many experiences with data scientist and CS primadonnas dismissing low-level coding as "ops." I liken this to the Eloi dismissing the Morlocks as "the help."
Agreed. The tooling around deep learning is not as mature as the tooling around software development. There is a fair amount of engineering and grunt work needed to even get started, let alone build on others' research. A few problems from top of mind:
- Setup: Installing DL frameworks, Nvidia drivers and CUDA is an exercise in dependency hell. Trying to run someone's project, which has different dependencies than what you have is difficult to get right. Docker images [1] and nvidia-docker make this simple, but are still not the norm.
- Reproducibility: This is big as Denny mentions. Folks still use Github for sharing code. But DL pipelines need versioning of more than just code. It's code, environment, parameters, data and results.
- Sharing and collaboration: I've noticed that most collaboration on deep learning research, unlike software, happens only when the folks are co-located (e.g. part of the same school or company). This likely links back to reproducibility, but there are not many good tools for effective collaboration currently IMHO.
[1] https://github.com/floydhub/dl-docker (Disclaimer: I created this)
Sorry to knock the post author off his high horse, but "just like you wouldn’t want a highly trained surgeon spending several hours a day inputting patient data from paper forms." Highly trained surgeons _do_ spend several hours a day doing tedious paperwork.
As a researcher, I expect 50-90% of my time to be slogging through organizational and preparatory work.
I agree with the central thesis: engineering is a huge bottleneck. I work for a FinTech company that is building novel machine learning models and this is our experience.
We've had a few machine learning experts working here for a couple of years, but recently brought in a software engineer with a passion for machine learning. He was able to, within a few months, streamline the data acquisition pipeline to the point where we could iterate on a new models in about 30 minutes, down from days. He accomplished this not just with better data but by building efficient in-memory data structures. It saves literally days of time per iteration because of disk I/O.
Before his work the training data versus the data we used in production had minor differences. Each new release required intensive manual verification to make sure that our model worked. Now we have much more certainty that the two match up.
Looking down on engineering problems is like a famous architect looking down on structural engineers. You're not gonna have a very good skyscraper if your foundation is shaky and ad-hoc.
coming from the machine learning research community, i am in awe of the availability and relative ease-of-use of the deep learning frameworks. Rarely can i find comparison code in ML that i didn't have to bug the author for, or try and implement myself based on the paper alone. in short DL is on a much better path than the author realizes. Perhaps we have to thank the github/bitbucket era. The real problem with DL is that until there is a more robust theory(if even possible, the bane and boon of ML is the complexity of the models), much of the application research will sorta be a form of digital alchemy.
i was playing around with the emotion data set on kaggle using tensorflow. depending on the seed, i was getting between 58 and 60% accuracy on the hold out test set (what you submit against).
i thought i came up with a good set of hyper parameters using aws gpu instances (python 2.7). i wanted to visualize some of the outputs so I copied the code to my machine and ran under python 3.5 (windows) and only got 57% accuracy. these swings in accuracy are huge
[+] [-] novaRom|9 years ago|reply
Also I learned to communicate with other teams and exchange with ideas proved by practices - this really helps.
To improve, I would suggest to publish whole setup, all the parameters used, the programming code, and publish either all the data or a reference to a large free data set (no MNIST anymore in papers, please).
[+] [-] daveguy|9 years ago|reply
If you have a breakthrough in transfer learning then you will be able to very effectively demonstrate it with MNIST.
The race to the bottom is essentially over, but that doesn't mean MNIST can't be used to demonstrate learning.
Regarding setup and parameters. I hope AI researchers move toward something like pachyderm [https://pachyderm.io] -- providing a single docker image to completely replicate their work. However, I sincerely doubt that will happen. As "open" as research is the details are almost always obfuscated to prevent competition with the spin-out company (or other researchers).
[+] [-] felxh|9 years ago|reply
The paper made it seem like they had been using a standard PCFG parser (which circulated in the research community at the time) to achieve their results. It turned out they hadn't and instead had written a custom one and in fact their results were not reproducible using the standard parser.
What was meant to be a timesaver in terms of engineering (using a standard parser instead of writing your own) turned out to be a massive time sink. It also turned out that by using a custom parser they had unintentionally diverted from a vanilla PCFG (probabilistic context free grammar), or in other words, some implementation details had led to a departure from the assumed underlying theoretical model.
[+] [-] mattsouth|9 years ago|reply
[+] [-] jpolitz|9 years ago|reply
http://evaluate.inf.usi.ch/artifacts
http://www.artifact-eval.org/
[+] [-] bottled_poe|9 years ago|reply
[+] [-] PaulHoule|9 years ago|reply
I've been to 100+ colloquia in the physics dept at Cornell and I have never been at one that I felt was a waste of time or that the person should not belong there.
The CS department colloquium is a different story: yes I got to see Geoff Hinton before he became a celebrity but maybe half of the talks are awful.
[+] [-] AIMunchkin|9 years ago|reply
I have sat through too many presentations obsessing on HW-level perf/W especially w/r to Deep Learning ASIC wannabes. Just writing one's code in C/C++ (and doing it well) guarantees at least a 2x improvement over Java and a 10-100x improvement over Python. I won't even bring up the computational coup that is CUDA.
But hey, let's base a mobile phone OS on Java and block low-level access to its GPU, that's a fantastic idea, right?
See also many experiences with data scientist and CS primadonnas dismissing low-level coding as "ops." I liken this to the Eloi dismissing the Morlocks as "the help."
[+] [-] elitro|9 years ago|reply
I had to select features from multiple papers in order to try and select the best ones with classification results to prove it.
A few problems included:
- Incomplete/unavailable datasets (404 on some copyright pictures)
- Features consisted on Math formulas and text descriptions (no code whatsoever)
- Classifier names only (which framework did you use? parameter values?)
In the end i couldn't contribute as well, got instructions to save my work in a private repo despite being funded by an EU academical scholarship.
[+] [-] A_Crazy_Idea|9 years ago|reply
[deleted]
[+] [-] saip|9 years ago|reply
[+] [-] rubidium|9 years ago|reply
As a researcher, I expect 50-90% of my time to be slogging through organizational and preparatory work.
[+] [-] transcranial|9 years ago|reply
[+] [-] audleman|9 years ago|reply
We've had a few machine learning experts working here for a couple of years, but recently brought in a software engineer with a passion for machine learning. He was able to, within a few months, streamline the data acquisition pipeline to the point where we could iterate on a new models in about 30 minutes, down from days. He accomplished this not just with better data but by building efficient in-memory data structures. It saves literally days of time per iteration because of disk I/O.
Before his work the training data versus the data we used in production had minor differences. Each new release required intensive manual verification to make sure that our model worked. Now we have much more certainty that the two match up.
Looking down on engineering problems is like a famous architect looking down on structural engineers. You're not gonna have a very good skyscraper if your foundation is shaky and ad-hoc.
[+] [-] leecarraher|9 years ago|reply
[+] [-] siscia|9 years ago|reply
So that you could link your code to a dataset, have it automatically run, and show the result...
Not sure if it is worth the time...
[+] [-] daveguy|9 years ago|reply
https://universe.openai.com
https://gym.openai.com
A general dataset pool in OpenAI would be nice. Kaggle has quite a few just basic datasets (MNIST etc) for evaluation.
[+] [-] autokad|9 years ago|reply
i thought i came up with a good set of hyper parameters using aws gpu instances (python 2.7). i wanted to visualize some of the outputs so I copied the code to my machine and ran under python 3.5 (windows) and only got 57% accuracy. these swings in accuracy are huge