Visualizing popular machine learning algorithms

[+] blt|10 years ago|reply

For learners it is confusing to see the nonlinear decision boundaries for linear and logistic regression, IMO a note about the feature expansion should be added

[+] wybiral|10 years ago|reply

Good point, I've updated my post. For linear and logistic regression there's cubic expansion on the features (which is how they can fit curved problems). The relevant Javascript code is on lines 91 and 96.

PS: It can be changed to "linear" or "quadratic" as well.

[+] edwinksl|10 years ago|reply

Yeah, that or label the axes....

[+] mtw|10 years ago|reply

Awesome. Would be great to have execution times.

Also what is nerdy.js? I saw it was related to "Carl Edward Rasmussen" but couldn't find another reference on the net

[+] wybiral|10 years ago|reply

It's a Javascript library I put together a long time ago for dealing with datasets and machine learning algorithms. It was used for some of my own personal projects and hasn't been focused on for release in the wild (although I'm considering it now).

The reference to Carl Edward Rasmussen is because I based my minimize function heavily off of this one: http://learning.eng.cam.ac.uk/carl/code/minimize/

[+] adriancooney|10 years ago|reply

I'm intrigued also. Definitely some sort of Machine Learning related library anyway. Found something related to it but it doesn't really have any substantial information on it either:

http://nerdyjs.appspot.com/

[+] indubitably|10 years ago|reply

Looks like K-nearest neighbor does pretty well.

[+] obmelvin|10 years ago|reply

Yes, k-nn is theoretically the one of the best ML algorithms in the sense that it will find the closest items in the training set. For classification or finding similar looking items it is great. However, it has pretty poor running times for evaluation of unseen data (http://nlp.stanford.edu/IR-book/html/htmledition/time-comple...). This is contrary to something like neural networks, which take a while to train, but then evaluate very quickly. For real world use the training times matter to an extent, but in a web app or real time application the latency from knn is just impractical.

[+] jules|10 years ago|reply

These visualisations are great but misleading regarding the performance of these classifiers. In practice you don't have a lot of data in a small number of dimensions (2 in this case). You have a little bit of data in zillions of dimensions. Think of classifying a 100x100 pixel image: that's 3x100x100=30000 dimensional data. You may not even have one training sample per class per dimension. Generalizing from comparatively little data to a very high dimensional space is the true difficulty of machine learning. Unfortunately you can't easily visualize that.

[+] darkmighty|10 years ago|reply

Try the "Island inside an island" test (put a blue cluster inside an orange island). Only k-means and SVM dealt with it satisfactorily.

[+] maurits|10 years ago|reply

There is also MLDemos [1] which is open source.

[1]: http://mldemos.epfl.ch/

[+] RockyMcNuts|10 years ago|reply

I'm a little surprised neural network comes up with a straight line and linear regression doesn't, which I thought by definition it would do. (e.g. on 2 normal groups)

Some discussion of methods, ie how many hidden layers/nodes for the neural network, would probably help make some sense of it.

Random forest could be worth adding.

[+] pedrosorio|10 years ago|reply

Looking at the code (http://jsfiddle.net/wybiral/3bdkp5c0/light/) it seems they are expanding the features to include all second and third order terms (options.expansion = cubic), that's why linear regression does not come up with a straight line.

[+] lottin|10 years ago|reply

Sounds interesting, but I can't see the results with Firefox 38.2.1.

[+] narsil|10 years ago|reply

Try this one: https://jsfiddle.net/752pqyvp/embedded/result

It's because of the browser blocking mixed content: The JS libraries are being loaded over HTTP but the JSFiddle is over HTTPS.

The version above loads the libraries over HTTPS via cdnjs.com

[+] autoreleasepool|10 years ago|reply

I can't see the results either. Chrome 46.0.2490.71 (64-bit)

[+] revorad|10 years ago|reply

Can someone please explain this?

[+] wybiral|10 years ago|reply

It's using the X and Y location of the dots as training data. Each algorithm is being trained on (x,y)->color in an attempt to buildup a rule for predicting what color an unseen (x,y) pair would be. The hypothesis it builds is then used to color the background so that you can see the decision boundary.

[+] andrelaszlo|10 years ago|reply

There's a bug somewhere.

Refresh, choose dataset: curved, algorithm: k means clustering. You get this:

http://imageshack.com/a/img633/7110/sfteaE.png

If you play around and select different algorithms before selecting k means clustering you can get very different results. :)

[+] wybiral|10 years ago|reply

I accidentally left k means in there as an option and it doesn't make much sense in the context of this example. So, yeah, it's a bit of a bug. Realistically, linear regression doesn't make sense being included either but it still kinda works.

[+] orliesaurus|10 years ago|reply

some content is loaded over HTTP rather than HTTPS so thas why it might display a blank page for some people who have HTTPS forced

[+] throwaway_bob|10 years ago|reply

any visualization of these algorithms in 2 dimensions (with cubic feature expansion!) is completely misleading if you intend to work on any real problem with many dimensions. Also, for those asking for execution times, these would be horribly misleading as well.

[+] heinrichhartman|10 years ago|reply

+1!

Are you aware of reasonable high dimensional "visualizations". It cant' be accurate of course. But catpuring essential features would be nice.

E.g. here is a 4d cube: https://commons.wikimedia.org/wiki/File:8-cell.gif

[+] chestervonwinch|10 years ago|reply

How is there no training time delay? How is training all these classifiers not putting my CPU into a sweat??

edit: I should also mention: these is very cool :)

[+] joshvm|10 years ago|reply

The dataset is quite small and you have a fast machine. On my laptop, a 7 year old Core 2, there's a slight delay when running some of the heavier algorithms (e.g. running neural net or svm on the island data set).

[+] p1esk|10 years ago|reply

Note: you can click on the graph to add datapoints.

[+] alexanderb|10 years ago|reply

No sources on GitHub, but why? Nerdy.js looks like interesting component, but I failed to find any relevant information about it.

[+] 0x99|10 years ago|reply

Wow it is awesome. Can we also play around with classifier parameters ( k of kNN) ?

[+] wybiral|10 years ago|reply

Only in the code :) Click "Edit in JSFiddle" and look at line 109 in the Javascript section. You'll see: options.k = 5

38 comments