mjw's comments | WingNews

mjw | 6 years ago | on: Musicians algorithmically generate melodies, release them to public domain

> A minor key has sharps and flats to create tension.

The natural minor or Aeolian mode doesn't use any notes outside the diatonic scale (probably what you meant by "sharps and flats"). It's very possible to write sad music in the natural minor. REM's "Losing my religion" for example.

> To sound sad, you MUST shift to a minor key

With sufficient skill you can write sad music in any key or scale, there's no hard and fast rules here. Tonality is only one of the elements you can use to shape the emotion that's conveyed, and it's all quite culturally relative too.

mjw | 7 years ago | on: Computing Higher Order Derivatives of Matrix and Tensor Expressions [pdf]

This is very neat. That said the reason these methods haven't received much attention so far is that relatively few people actually need to compute Jacobeans or Hessians directly.

Often only Hessian-vector products or Jacobean-vector products are required, and these can be computed via more standard autodiff techniques, usually a lot more efficiently than if you were to compute the Hessian or Jacobean directly.

Also for models with lots of parameters, the Jacobean and Hessian are usually impractically large to realise in memory (N^2 in the number of parameters).

Nevertheless the symbolic tensor calculus approach is very appealing to me. For one thing it could make it a lot easier to see in a more readable symbolic notation what the gradient computations look like in standard backprop, and could perhaps make it easier to implement powerful symbolic optimizations.

mjw | 7 years ago | on: The Matrix Calculus You Need for Deep Learning

When I started out in ML I was really keen to learn about the most 'mathsy' approaches out there.

I think with hindsight, it's great to have a broad spectrum of methods available to you, but if you focus too much on methods at the hard-math end of the spectrum just for the sake of an intellectual challenge, you can end up fixated on an exotic solution looking for a problem while the rest of the field moves on, rather than doing useful engineering people care about.

Maybe you find a niche where something exotic really helps, maybe you don't -- maybe for research this is a risk worth taking. But just something to keep in mind.

IMO: breadth is good. Mathematical maturity helps. If one sticks around one finds uses for interesting maths eventually, but not worth trying to force it.

Another avenue for people who want to use some hardcore math: try and use it to find some good theory around why things which work well, work well. Not an easy task either by any means.

mjw | 9 years ago | on: Deep Learning Without Poor Local Minima

Ah yep, true. I'd forgotten you can still get the saddle effect from higher-order derivatives, the Hessian eigenvalues aren't enough to characterise it.

I was thinking of examples like (x-y)^2 at zero, although I guess that's still a local minimum, just not a unique local minimum in any neighbourhood.

mjw | 9 years ago | on: Deep Learning Without Poor Local Minima

My main quibble from this paper is:

> For deeper networks, Corollary 2.4 states that there exist “bad” saddle points in the sense that the Hessian at the point has no negative eigenvalue.

To me these sound just as bad as local minima. Also I don't think it's standard to call something a saddle point unless the Hessian has negative as well as positive eigenvalues. Otherwise there's no "saddle", more something like a valley or plateau.

They claim that these can be escaped with some peturbation:

> From the proof of Theorem 2.3, we see that some perturbation is sufficient to escape such bad saddle points.

I haven't read through the (long!) proof in detail but it doesn't seem obvious to me why these would be any easier to escape via peturbation than a local minimum would be, and I think this could use some extra explanation as it seems like an important point for the result to be useful. Did anyone figure this bit out?

mjw | 9 years ago | on: The Sigmoid Function in Logistic Regression

See the other replies above, but: the logistic has heavier tails than the normal, so might do better in cases where we need robustness, where unexpected outcomes remain possible even in cases where the linear predictor is relatively big, and we want to avoid drawing extreme inferences from them.

Probit might lead to more efficient inferences in cases where the mechanism is known to become deterministic relatively quickly as the linear predictor gets big.

You could go further in either direction too (more or less robust) by using other link functions.

mjw | 9 years ago | on: The Sigmoid Function in Logistic Regression

Ah yep, I forgot it's the canonical link. That's more of a small computational convenience though, right, at least when fitting a straightforward GLM -- it should be very cheap to fit regardless.

I suppose the logistic having heavier tails than the normal is probably the main consideration in motivating one or the other as the better model for a given situation.

Logistic being is heavier-tailed, is potentially more robust to outliers. Which in terms of binary data, means that it might be a better choice in cases where an unexpected outcome is possible even in the most clear-cut cases. Probit regression with its heavier normal tails, might be a better fit in cases where the response is expected to be pretty much deterministic in clear-cut cases, and where quite severe inferences can be drawn from unexpected outcomes in those cases. Sound fair?

mjw | 9 years ago | on: The Sigmoid Function in Logistic Regression

Their answer is pretty much 'because it's based on the log-odds', which to me is still only very mild motivation.

There are other non-linearities which people use to map onto (0, 1), for example probit regression uses the Normal CDF. In fact you can use the CDF of any distribution supported on the whole real line, and the sigmoid is an example of this -- it's the CDF of a standard logistic distribution [1].

There's a nice interpretation for this using an extra latent variable: for probit regression, you take your linear predictor, add a standard normal noise term, and the response is determined by the sign of the result. For logistic regression, same thing except make it a standard logistic instead.

This then extends nicely to ordinal regression too.

[0] https://en.wikipedia.org/wiki/Probit_model [1] https://en.wikipedia.org/wiki/Logistic_distribution

mjw | 9 years ago | on: All of Statistics, by Larry Wassserman (2013) [pdf]

Pretty much any kind of mathematical modelling that involves uncertainty, really.

Making inferences and predictions from data, in the presence of uncertainty.

Analysis of the properties of procedures for doing the above.

If you want examples that avoid the feel of just "curve fitting" (assume you mean something like "inferring parameters given noisy observations of them") -- maybe look at models involving latent variables. Bayesian statistics has quite a few interesting examples.

mjw | 10 years ago | on: Free “Deep Learning” Textbook by Goodfellow and Bengio Now Finished

If anything, to me a lot of deep learning literature seems to lack the statistical insight and theory that's available to other subfields in machine learning (whether the Bayesian/graphical models camp, the statistical learning theory camp...)

If this book is trying to do more to bring statistical or probabilistic insights to bear on deep learning than I think that's a very good thing. It might make it less accessible to those coming from a pure computer science background, but potentially more so to those who like to think about machine learning from a probabilistic modelling perspective.

If they're using stats jargon in a gratuitous way that doesn't actually cast any light on the material then that's another thing, but from a quick skim I didn't see anything particularly bad on this front. Do you have any examples of the kind of jargon you're talking about?

To others reading, I just wanted to emphasise that statistics is really important in machine learning! Deep learning lets you get away with less of it than you might need elsewhere, but that doesn't mean one can treat it as an unnecessary inconvenience. It's a language you need to learn, especially if you want to try and get to the bottom of how and why aspects of deep learning work the way they do. As opposed to just an empirical "using GPU clusters to throw lots of clever shit at the wall and see what sticks" engineering field. Bengio seems very interested in these kinds of questions and I'm glad he's leading research in that direction, even if clear answers and intuition aren't always easy to come by at this point.

mjw | 10 years ago | on: XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

It's more an empirically-verified thing than a mathematical fact, there's nothing magic about 16 bits AFAIK. Empirically 16 bits seems to work well enough for some tasks, taking it down to 8 bits is usually taking it too far, and performance-wise there's not a lot of point playing with values in between e.g. 12 bits.

(Half-float arithmetic is implemented natively in recent CUDA CC5 architectures and is quite convenient, in particular it reduces memory bandwidth by 1/2 which is often the bottleneck.)

Stochastic gradient descent is fairly robust to noisy gradients -- any numerical or quantisation error that you can model approximately as independent zero-mean noise can be 'rolled into the noise term' for SGD without affecting the theory around convergence [0]. It will increase the variance of course, which when taken too far could in practise mean divergence or slow convergence under a reduced learning rate, perhaps to a poorer local minimum.

Extreme quantisation (like binarisation) the error can't really be modelled as independent zero-mean, UNLESS you do the kind of stochastic quantisation mentioned. From what I hear this works well enough to allow convergence, but accuracy can take quite a hit. I don't think it has to be 'implemented natively', although no doubt that would speed it up, a large part of the benefit of quantisation during training is not so much to speed up arithmetic as to reduce memory bandwidth and communication latency.

[0] https://en.wikipedia.org/wiki/Stochastic_approximation#Robbi...

mjw | 10 years ago | on: Microsoft releases CNTK, its open source deep learning toolkit, on GitHub

Yep. To elaborate: really big batch sizes can speed up training data throughput, but usually mean that less is learned from each example seen, so time-to-convergence might not necessarily improve (might even increase, if you take things too far).

Training data throughput isn't the right metric to compare -- look at time to convergence, or e.g. time to some target accuracy level on held-out data.

mjw | 10 years ago | on: Microsoft releases CNTK, its open source deep learning toolkit, on GitHub

Warp-CTC implements one specific model (or at least, one specific loss function), it's not really a general framework in the same way as the other libraries mentioned.

mjw | 10 years ago | on: My Trouble with Bayes

> If I publish a paper people want to know what the data I found suggests and that's it.

What they're going to get, is what your data and your modelling assumptions suggest.

If you're taking just as much care to make the rest of your model unassailably objective, then fair enough. But a prior is usually just one modelling assumption amongst many.

mjw | 10 years ago | on: My Trouble with Bayes

It's odd to complain specifically about subjectivity of the prior when the likelihood is often just as subjective. Gelman puts this well here:

http://andrewgelman.com/2015/01/27/perhaps-merely-accident-h...

mjw | 10 years ago | on: Evaluation of Deep Learning Toolkits

On the whole this is useful, although I think it's a little unfair to Theano in places.

* Performance

I feel they should score separately here for compilation/startup time vs runtime. Theano's compilation step can be slow the first time around. (In my personal experience not enough to add significant friction at development time, but YMMV -- I hear it can struggle with some more complex architectures like deep stacked RNNs.)

Its compilation process gives some unique advantages though -- for example it can generate and compile custom kernels for fused elementwise operations, which can give speed advantages at runtime that aren't achievable via a simple stacking of layers with pre-canned kernels. Some of its graph optimisations are pretty useful too. In short smarter compilation can save you from having to implement your own kernels to achieve good performance on non-standard architectures. If you're doing research that can matter.

* Architecture

The architecture of Theano's main public API is clean and elegant IMO, which is what matters most.

When it comes to extensibility, firstly you don't need to go implement custom Ops very often, certainly not as often as you might implement a custom Layer in Torch. That's because Theano ships with lots of fundamental tensor operations that you can compose, and a compiler that can optimise the resulting graph well.

About the idea that it's hacky that "the whole code base is Python where C/CUDA code is packaged as Python string": if you want to generate new CUDA kernels programatically then you're going to want to use some high-level language to do it. As stated Theano gets some unique advantages from being able to do this. At some conceptual cost I'm sure it'd be possible to handle this code generation in a slightly cleaner way, but I don't really see anyone else in this area doing it significantly better, so I think given the constraints it's a bit subjective and slightly unfair to call it "hacky".

I also think it's something that matters more for framework developers than users. In my experience, on the relatively rare situations where you do need to implement a custom Op, it's usually as a performance optimisation and you can get away with something relatively simple and problem-specific, essentially a thin python wrapper around some fixed kernel code.

The CGT project (which seems to be aiming for a better Theano) has some valid and more detailed criticism of the architecture of the compiler, which I think is fairer: http://rll.berkeley.edu/cgt/#whynottheano

I'm also hoping in due course that Tensorflow will come closer to parity with some of Theano's compiler smarts, at which point I'll be eager to switch as Tensorflow has some other advantages, multi-GPU for one.

mjw | 10 years ago | on: The category design pattern

Agreed that SQL is ugly as hell, but if you want to talk about its theoretical properties that's a separate debate. Theory doesn't care whether something's aesthetically pleasing, just whether it's possible.

mjw | 10 years ago | on: TensorFlow: open-source library for machine intelligence

Does anyone know if TensorFlow can apply algebraic simplifications and numerical optimisations to the compute graph, in the way that Theano does with its optimisations?

Sounds like it doesn't suffer from the (alleged) slow compile times of Theano, but I wonder if the flipside of that is that you have to implement larger-scale custom Ops (like torch's layers) in order to ensure that a composite compute graph is implemented optimally?

mjw | 10 years ago | on: The category design pattern

Ah OK, fair enough.

Not sure what definition you're using for the category of tables, but I don't think the distinction between table and query is really that significant, at least from a theory point of view. You can declare views and materialized views. In some SQL dialects you can even define triggers which allow you to 'update' them, not that I would have thought mutability would be particularly nice to reason about in a category-theoretic framework.

If you really want to do theory on this stuff, just use the relational algebra, or better yet just plain first-order logic. Much nicer, you have all the products and coproducts you want, and the results can probably be re-applied to SQL with a bit of cludge-work :)

mjw | 10 years ago | on: TensorFlow: open-source library for machine intelligence

> What I mostly see around is just standard linear algebra operations on matrices and vectors lifted to higher-dimensional tensors point-wise

Equally what is matrix multiplication but a bunch of 1-dimensional dot products applied pointwise? why do we need matrices?

I do get what you're saying, and that part of it is that ML / CS folk just use 'tensor' as a fancy word for a multi-dimensional array, whereas physics folk use the word for a related coordinate-basis-independent geometric concept. But for numerical computing broadcasting simple operations over slices of some big array is really useful thing to be able to do fast and to express concisely.

Numerics libraries which don't bother to generalise arrays beyond rank 1 and 2 always feel rather inelegant and limiting to me. Rank 3+ arrays are often really useful (images, video, sensory data, count data grouped by more than 2 factors, ...), and lots of operations generalise to them in a nice way. Good array programming environments (numpy, torch, APL take advantage of this to provide an elegant general framework for broadcasting of operations without ugly special cases.