Umm...Let us see. Assume I am trying to build a fancy new recommendation system for movies on a completely new media device. Let us assume that for now we have been collecting user data and recommending them random movies. The first thing we do is to go through this data and try to figure out what would be relevant user features. While you are at it, you might find that there are some odd issues with the way data is being collected which is creating statistical biases: Say for example, you only have server time when the user rating of the movie hit your system and not the actual time at which the user rated the movie. So you go back and fix those. Repeat, rinse and redo. Now, you go out and find out that your potential feature set is extremely sparse and high dimensional. So you think about adding other features from the Movie set. Go through those: Hopefully you are not starting the problem from scratch and you already have gone past the data collection and feature extraction problem when it comes to movies. Now, you start thinking algorithmically, nothing too complicated because everything has to be made production ready and adding bugs in machine learning algorithms is extremely easy (See Mahout ;-)). You prototype a few algorithms in Hive or your favorite scripting language and then get some data out for A/B testing. You run your A/B tests and hopefully you have something that looks significantly better than baseline. Now you go through the boring part of making that algorithm production ready: This means thinking in terms of scale: 1) How do I deliver these recommendations on the fly to millions of users who are using my fancy new media device. 2) How do I do this while not stressing out my system. So you write a bunch of code, integrate into your current APIs, run a cron and hope nothing breaks and monitor graphs like a nervous little rabbit.
Then, you repeat this and do all this over next week.
EDIT: Obviously depending on the company and the job, this entire story might completely change etc etc.
The first big chunk of time is getting data into a usable format. Customers (internal or external) all have different data formats, and nothing is consistent. So you write parsers, deal with the corner cases, and get data into a form you can analyze.
Then you have to write software that lets you both experiment and release your models to production. That involves writing pipeline architectures to apply things like feature extraction and pruning in a consistent way, and to make sure the result can be serialized and deployed. Off the shelf packages typically haven't solved these problems very well, so you have to make sure the thing that looks good in the scripting environment is reproducible on new data.
Then when you have a model and can deploy it, you start working on automating the training process so the model automatically adapts as new data comes in. Usually the customer has gotten the impression this was happening on day one, so you have to rush to deliver it.
Then, you deal with customer complaints that something that would have been obvious to a human is wrong even after they corrected it.
At some point you try to measure the gains you're offering over the non-ML system you replaced, and try to tweak those metrics until they make you look good.
If you're lucky, you got to experiment with some interesting algorithms somewhere in the middle, but you probably got the best results from something fairly standard like random forests and not the latent bayesian slice sampler you dreamed up when you first heard about the problem.
Getting data ready, clean and formatted for analysis. That is a huge part, unless you are super lucky. Think about how to analyze said data. Come up with a hypothesis, test, analyze results, proceed to secondary hypothesis. Repeat.
I imagine it depends on the company. A friend of mine with a data-scientist job reports that the most important part of his particular job is making convincing PowerPoints to present "insights from data" to management.
eshvk|13 years ago
Then, you repeat this and do all this over next week.
EDIT: Obviously depending on the company and the job, this entire story might completely change etc etc.
cityhall|13 years ago
Then you have to write software that lets you both experiment and release your models to production. That involves writing pipeline architectures to apply things like feature extraction and pruning in a consistent way, and to make sure the result can be serialized and deployed. Off the shelf packages typically haven't solved these problems very well, so you have to make sure the thing that looks good in the scripting environment is reproducible on new data.
Then when you have a model and can deploy it, you start working on automating the training process so the model automatically adapts as new data comes in. Usually the customer has gotten the impression this was happening on day one, so you have to rush to deliver it.
Then, you deal with customer complaints that something that would have been obvious to a human is wrong even after they corrected it.
At some point you try to measure the gains you're offering over the non-ML system you replaced, and try to tweak those metrics until they make you look good.
If you're lucky, you got to experiment with some interesting algorithms somewhere in the middle, but you probably got the best results from something fairly standard like random forests and not the latent bayesian slice sampler you dreamed up when you first heard about the problem.
lsiebert|13 years ago
_delirium|13 years ago