top | item 18298645

End-to-end implementation of a machine learning pipeline (2017)

504 points| spandan-madan | 7 years ago |spandan-madan.github.io | reply

42 comments

order
[+] roystonvassey|7 years ago|reply
Great explanation and I love the fact that the entire presentation is a Jupyter Notebook!

A non-academic observation - the 'real-world' challenge of ML pipelines is what I call the 'last-mile' problem of ML - operationalizing your model. You begin to run into problems of:

1. How often do you 'score' live data? How will this affect latency, data ingestion etc?

2. How often do you have to update your weights, if you want your model's performance to be consistent?

3. Integration with source systems

4. If you build your final scoring model on library-dependent languages like Python, how do you ensure no breakages? (Docker solves this to a large extent though)

[+] sidlls|7 years ago|reply
Seconding this. I have run a data science and machine learning team for the last couple of years. By far the most challenging part of our work has been convincing our data management team that we aren't just another front end widget factory and our development/operations staff that we aren't choosing "non-standard" tech to deliver model results into production. The model maintenance is difficult, too, due to poor data management practices but it's less challenging than the other items for my team.
[+] DrNuke|7 years ago|reply
That’s it, really. Any good reference to keep up to date with the last-mile best practices for the average ML practitioner? Thanks!
[+] tamersalama|7 years ago|reply
Could that be because of using Jupyter notebook itself? I like Jupyter for data and machine learning 'journalism', but I don't see it as the a proper medium to address the 'last-mile'. The insights driven from Jupyter, in my opinion, are not actionable and well integrated enough. It is becoming a de-facto medium reminding me of shared Excel files.
[+] spandan-madan|7 years ago|reply
For feature requests on this, please create an issue on the github Repo!

For future tutorial suggestions, mail me at [email protected]. A new one on NLP is coming soon!

[+] d_burfoot|7 years ago|reply
Is your code intentionally verbose (for the sake of being explicit)? It seems like it could be condensed a lot by using Pythonic structures. For example you could replace block 39 with a one-liner:

Genre_ID_to_name=dict([(g['id'], g['name']) for g in list_of_genres])

In other places, you would benefit a lot from the enumerate(..) function, which returns (index, item) tuples when called on a list.

[+] mario0b1|7 years ago|reply
Just started an ML course this semester. I am not sure if I even have time to use this as additional resource, but it looks super awesome after skimming through it. Definitly going into my favorites and if I don't use it as additional resource now, I will read it later. Thanks for making all this work public!
[+] junke|7 years ago|reply
Note: at the beginning, you speak about learning a function "g" that approximates a function "f", then later, you swap them and learn a function "f" that approximates "g". That could be confusing;
[+] heinrichhartman|7 years ago|reply
Did they have to put a big "Harvard University" banner at the top of the GitHub repo: https://github.com/Spandan-Madan/DeepLearningProject ? This is a private repository, right? Is the code owned by Harvard?

For me personally, I find this off-putting. Let your content speak for itself. No added credibility when the affiliation is advertised like this.

Content looks good, though! :)

[+] Reebz|7 years ago|reply
Yes, this confused me too. Especially considering Spandan Madan is an MIT researcher, I don't understand the overt branding of a different school
[+] zdk|7 years ago|reply
As a beginner and a self-tought. This style of tutorial is practically good and persuasive for practitioners. Thank you very much for investing time to do this.
[+] antpls|7 years ago|reply
It's interesting, however the webpage seems broken on Firefox Android (latest, Android 8.1) :

- some values in the command outputs dont match the author's comments (or maybe I misunderstood some?)

- there are some big red blocks of errors in the outputs

- the outputs of the trainings are way too verbose for mobile reading

I guess they are issues on Jupyter's framework side. It would be nice if mobile were treated as first-class viewer.