Great explanation and I love the fact that the entire presentation is a Jupyter Notebook!
A non-academic observation - the 'real-world' challenge of ML pipelines is what I call the 'last-mile' problem of ML - operationalizing your model. You begin to run into problems of:
1. How often do you 'score' live data? How will this affect latency, data ingestion etc?
2. How often do you have to update your weights, if you want your model's performance to be consistent?
3. Integration with source systems
4. If you build your final scoring model on library-dependent languages like Python, how do you ensure no breakages? (Docker solves this to a large extent though)
Seconding this. I have run a data science and machine learning team for the last couple of years. By far the most challenging part of our work has been convincing our data management team that we aren't just another front end widget factory and our development/operations staff that we aren't choosing "non-standard" tech to deliver model results into production. The model maintenance is difficult, too, due to poor data management practices but it's less challenging than the other items for my team.
Could that be because of using Jupyter notebook itself? I like Jupyter for data and machine learning 'journalism', but I don't see it as the a proper medium to address the 'last-mile'. The insights driven from Jupyter, in my opinion, are not actionable and well integrated enough. It is becoming a de-facto medium reminding me of shared Excel files.
Is your code intentionally verbose (for the sake of being explicit)? It seems like it could be condensed a lot by using Pythonic structures. For example you could replace block 39 with a one-liner:
Genre_ID_to_name=dict([(g['id'], g['name']) for g in list_of_genres])
In other places, you would benefit a lot from the enumerate(..) function, which returns (index, item) tuples when called on a list.
Just started an ML course this semester. I am not sure if I even have time to use this as additional resource, but it looks super awesome after skimming through it.
Definitly going into my favorites and if I don't use it as additional resource now, I will read it later. Thanks for making all this work public!
This is a very good way to get started building ML pipelines. When you do it at scale, you often need to use a broader range of tools. Here's how we do it in Hopsworks with Python the whole way (using Airflow to orchestrate the different steps):
Note: at the beginning, you speak about learning a function "g" that approximates a function "f", then later, you swap them and learn a function "f" that approximates "g". That could be confusing;
As a beginner and a self-tought. This style of tutorial is practically good and persuasive for practitioners. Thank you very much for investing time to do this.
[+] [-] roystonvassey|7 years ago|reply
A non-academic observation - the 'real-world' challenge of ML pipelines is what I call the 'last-mile' problem of ML - operationalizing your model. You begin to run into problems of:
1. How often do you 'score' live data? How will this affect latency, data ingestion etc?
2. How often do you have to update your weights, if you want your model's performance to be consistent?
3. Integration with source systems
4. If you build your final scoring model on library-dependent languages like Python, how do you ensure no breakages? (Docker solves this to a large extent though)
[+] [-] sidlls|7 years ago|reply
[+] [-] DrNuke|7 years ago|reply
[+] [-] tamersalama|7 years ago|reply
[+] [-] spandan-madan|7 years ago|reply
For future tutorial suggestions, mail me at [email protected]. A new one on NLP is coming soon!
[+] [-] d_burfoot|7 years ago|reply
Genre_ID_to_name=dict([(g['id'], g['name']) for g in list_of_genres])
In other places, you would benefit a lot from the enumerate(..) function, which returns (index, item) tuples when called on a list.
[+] [-] mario0b1|7 years ago|reply
[+] [-] jamesblonde|7 years ago|reply
https://hops.readthedocs.io/en/latest/hopsml/hopsML.html
[+] [-] Octokat|7 years ago|reply
[+] [-] junke|7 years ago|reply
[+] [-] ratsimihah|7 years ago|reply
Also, looks like that notebook needs an update.
https://github.com/ContinuumIO/anaconda-issues/issues/6678
[+] [-] heinrichhartman|7 years ago|reply
For me personally, I find this off-putting. Let your content speak for itself. No added credibility when the affiliation is advertised like this.
Content looks good, though! :)
[+] [-] Reebz|7 years ago|reply
[+] [-] dang|7 years ago|reply
[+] [-] spandan-madan|7 years ago|reply
[+] [-] zdk|7 years ago|reply
[+] [-] laurentl|7 years ago|reply
[+] [-] sooheon|7 years ago|reply
[+] [-] antpls|7 years ago|reply
- some values in the command outputs dont match the author's comments (or maybe I misunderstood some?)
- there are some big red blocks of errors in the outputs
- the outputs of the trainings are way too verbose for mobile reading
I guess they are issues on Jupyter's framework side. It would be nice if mobile were treated as first-class viewer.
[+] [-] lolitan|7 years ago|reply