Author here; the arxiv version can be found at https://arxiv.org/abs/1810.08297. Not much different from OP's linked version, but it includes citations to other interesting Julia AD/TPU-related papers that utilize this technique.
Happy to answer any questions, at least until I turn in for the night :)
Can you explain, from a very high level, what problem is being solved or which thing in a machine learning stack is improved?
I have basic knowledge in machine learning and can handle the maths part but, for instance, I have never heard the term 'broadcasting' in this context.
I'm not trying to scrutinize your work, just being genuinely curious and trying to learn.
jrevels|7 years ago
Happy to answer any questions, at least until I turn in for the night :)
joe_the_user|7 years ago
AD is basically a code transformation method.
What's the most notable way the GPU in particular comes into play?
How does caching come into play? What about intrinsic condensing functions?
rsp1984|7 years ago
I have basic knowledge in machine learning and can handle the maths part but, for instance, I have never heard the term 'broadcasting' in this context.
I'm not trying to scrutinize your work, just being genuinely curious and trying to learn.