End-to-end implementation of a machine learning pipeline (2017)

[+] roystonvassey|7 years ago|reply

Great explanation and I love the fact that the entire presentation is a Jupyter Notebook!

A non-academic observation - the 'real-world' challenge of ML pipelines is what I call the 'last-mile' problem of ML - operationalizing your model. You begin to run into problems of:

1. How often do you 'score' live data? How will this affect latency, data ingestion etc?

2. How often do you have to update your weights, if you want your model's performance to be consistent?

3. Integration with source systems

4. If you build your final scoring model on library-dependent languages like Python, how do you ensure no breakages? (Docker solves this to a large extent though)

[+] sidlls|7 years ago|reply

Seconding this. I have run a data science and machine learning team for the last couple of years. By far the most challenging part of our work has been convincing our data management team that we aren't just another front end widget factory and our development/operations staff that we aren't choosing "non-standard" tech to deliver model results into production. The model maintenance is difficult, too, due to poor data management practices but it's less challenging than the other items for my team.

[+] DrNuke|7 years ago|reply

That’s it, really. Any good reference to keep up to date with the last-mile best practices for the average ML practitioner? Thanks!

[+] tamersalama|7 years ago|reply

Could that be because of using Jupyter notebook itself? I like Jupyter for data and machine learning 'journalism', but I don't see it as the a proper medium to address the 'last-mile'. The insights driven from Jupyter, in my opinion, are not actionable and well integrated enough. It is becoming a de-facto medium reminding me of shared Excel files.

[+] spandan-madan|7 years ago|reply

For feature requests on this, please create an issue on the github Repo!

For future tutorial suggestions, mail me at [email protected]. A new one on NLP is coming soon!

[+] d_burfoot|7 years ago|reply

Is your code intentionally verbose (for the sake of being explicit)? It seems like it could be condensed a lot by using Pythonic structures. For example you could replace block 39 with a one-liner:

Genre_ID_to_name=dict([(g['id'], g['name']) for g in list_of_genres])

In other places, you would benefit a lot from the enumerate(..) function, which returns (index, item) tuples when called on a list.

[+] mario0b1|7 years ago|reply

Just started an ML course this semester. I am not sure if I even have time to use this as additional resource, but it looks super awesome after skimming through it. Definitly going into my favorites and if I don't use it as additional resource now, I will read it later. Thanks for making all this work public!

[+] jamesblonde|7 years ago|reply

This is a very good way to get started building ML pipelines. When you do it at scale, you often need to use a broader range of tools. Here's how we do it in Hopsworks with Python the whole way (using Airflow to orchestrate the different steps):

https://hops.readthedocs.io/en/latest/hopsml/hopsML.html

[+] Octokat|7 years ago|reply

Your pipeline design looks pretty sweet!

[+] junke|7 years ago|reply

Note: at the beginning, you speak about learning a function "g" that approximates a function "f", then later, you swap them and learn a function "f" that approximates "g". That could be confusing;

[+] ratsimihah|7 years ago|reply

This is great, most tutorials out there assume the dataset already exists. Nice move covering the entire pipeline!

Also, looks like that notebook needs an update.

https://github.com/ContinuumIO/anaconda-issues/issues/6678

[+] heinrichhartman|7 years ago|reply

Did they have to put a big "Harvard University" banner at the top of the GitHub repo: https://github.com/Spandan-Madan/DeepLearningProject ? This is a private repository, right? Is the code owned by Harvard?

For me personally, I find this off-putting. Let your content speak for itself. No added credibility when the affiliation is advertised like this.

Content looks good, though! :)

[+] Reebz|7 years ago|reply

Yes, this confused me too. Especially considering Spandan Madan is an MIT researcher, I don't understand the overt branding of a different school

[+] dang|7 years ago|reply

An earlier discussion: https://news.ycombinator.com/item?id=14781888

[+] spandan-madan|7 years ago|reply

Yup! That version was in Keras. It's now been re-written in PyTorch as well! Thanks to https://github.com/AnshulBasia.

[+] zdk|7 years ago|reply

As a beginner and a self-tought. This style of tutorial is practically good and persuasive for practitioners. Thank you very much for investing time to do this.

[+] laurentl|7 years ago|reply

The link seems dead, is there another url ?

[+] sooheon|7 years ago|reply

https://spandan-madan.github.io/DeepLearningProject/PyTorch_...

[+] antpls|7 years ago|reply

It's interesting, however the webpage seems broken on Firefox Android (latest, Android 8.1) :

- some values in the command outputs dont match the author's comments (or maybe I misunderstood some?)

- there are some big red blocks of errors in the outputs

- the outputs of the trainings are way too verbose for mobile reading

I guess they are issues on Jupyter's framework side. It would be nice if mobile were treated as first-class viewer.

[+] lolitan|7 years ago|reply

thank you guys

42 comments