jcjohns's comments

jcjohns | 5 months ago | on: RTFM: A Real-Time Frame Model

Make sure to try the live demo here:

https://rtfm.worldlabs.ai/

jcjohns | 1 year ago | on: World Labs: Generate 3D worlds from a single image

Arrow keys also work now, thanks for the feedback!

jcjohns | 1 year ago | on: World Labs: Generate 3D worlds from a single image

That's weird, what device are you using?

(I'm part of World Labs)

jcjohns | 2 years ago | on: Korbut Flip

The usual narrative around this skill is that it was banned because it was too dangerous, but that doesn't ring true to me -- many release moves performed both by women on the uneven bars and men on the high bar have similar motions whose risk is surely equal to or greater than a Korbut flip.

For example, here's a Korbut flip: it's a back flip from the feet to catch the bar: https://youtu.be/NZYPcdj_wn4?t=15

Compare this with a Mo Salto, one of the hardest (legal!) release moves in the women's code of points: https://youtu.be/eIwTquLwGpA?t=26

On the men's side, the Korbut flip is a pretty similar motion to a Kovacs (https://www.youtube.com/shorts/6yRJaivL1TE) which is a staple of high-level men's gymnastics; in fact the basic Kovacs is "so easy" that you rarely see the vanilla version performed by top athletes! It's more common to do them with a full twist (Coleman/Cassina, e.g. https://youtu.be/8IeBXhijY0M?t=40) or in combination with other release moves (e.g. Zonderland at the 2012 Olympics https://youtu.be/I0TM2sOnvyI?t=1160). Hidetaka Miyachi is one of the few people to ever have competed a double-twisting Kovacs (both tucked and straight! https://youtu.be/RgW36EKyKyg?t=23), and there are a few videos online of people practicing a "double Kovacs" with an extra flip (e.g. https://www.youtube.com/shorts/zI8VEll7wKI) but nobody has ever done one in competition.

While it's perceived danger might have been a factor in the initial ban of the Korbut flip, in light of these modern release moves it's hard to see how that is still a good reason. Instead, I think the reason it remains banned is more aesthetic; bars are supposed to be a swinging event, and we don't want to allow skills that have athletes standing on the bar instead of swinging around it.

On the other hand, banning the Thomas salto (https://www.youtube.com/watch?v=vkQRWCsKyj0) and other similar roll-out moves on floor is very clearly motivated by safety -- these are indeed very dangerous, and athletes have been seriously injured by them (most famously Elena Mukhina who became a quadriplegic as a result of this skill).

jcjohns | 2 years ago | on: Faster neural networks straight from JPEG (2018)

This makes sense in theory, but is hard to get working in practice.

We tried using nvjpeg to do JPEG decoding on GPU as a additional baseline, but using it as a drop-in replacement to a standard training pipeline gives huge slowdowns for a few reasons:

(1) Batching: nvjpeg isn't batched; you need to decode one at a time in a loop. This is slow but could in principle be improved with a better GPU decoder.

(2) Concurrent data loading / model execution: In a standard training pipeline, the CPU is loading and augmenting data on CPU for the next batch in parallel with the model running forward / backward on the current batch. Using the GPU for decoding blocks it from running the model concurrently. If you were careful I think you could probably find a way to interleave JPEG decoding and model execution on the GPU, but it's not straightforward. Just naively swapping out to use nvjpeg in a standard PyTorch training pipeline gives very bad performance.

(3) Data augmentation: If you do DCT -> RGB decoding on the GPU, then you have to think about how and where to do data augmentation. You can augment in DCT either on CPU or on GPU; however DCT augmentation tends to be more expensive than RGB augmentation (especially for resize operations), so if you are already going through the trouble of decoding to RGB then it's probably much cheaper to augment in RGB. If you augment in RGB on GPU, then you are blocking parallel model execution for both JPEG decoding and augmentation, and problem (2) gets even worse. If you do RGB augmentation on CPU, you end up with and extra GPU -> CPU -> GPU round trip on every model iteration which again reduces performance.

jcjohns | 2 years ago | on: Faster neural networks straight from JPEG (2018)

I'm one of the authors of this CVPR paper -- cool to see our work mentioned on HN!

The Uber paper from 2018 is one that has been floating around in the back of my head for a while. Decoding DCT to RGB is essentially an 8x8 stride 8 convolution -- it seems wasteful to perform this operation on CPU for data loading, then immediately pass the resulting decoded RGB into convolution layers that probably learn similar filters as those used during DCT decoding anyway.

Compared to the earlier Uber paper, our CVPR paper makes two big advances:

(1) Cleaner architecture: The Uber paper uses a CNN, while we use a ViT. It's kind of awkward to modify an existing CNN architecture to accept DCT instead of RGB since the grayscale data is 8x lower resolution than RGB, and the color information is 16x lower than RGB. With a CNN, you need to add extra layers to deal with the downsampled input, and use some kind of fusion mechanism to fuse the luma/chroma data of different resolution. With a ViT it's very straightforward to accept DCT input; you only need to change the patch embedding layer, and the body of the network is unchanged.

(2) Data augmentation: The original Uber paper only showed speedup during inference. During training they need to perform data augmentation, so convert DCT to RGB, augment in RGB, then convert back to DCT to feed the augmented data to the model. This means that their approach will be slower during training vs an RGB model. In our paper we show to to perform all standard image augmentations directly in DCT, so we can get speedups during both training and inference.

Happy to answer any questions about the project!

jcjohns | 5 years ago | on: Open-Sourcing Bit: Exploring Large-Scale Pre-Training for Computer Vision

I don't think Google has ever released models trained on JFT. But if you're interested in large-scale vision models, you can check out these models from Facebook trained on 940M Instagram images (several times bigger than JFT!)

https://github.com/facebookresearch/WSL-Images

jcjohns | 9 years ago | on: Supercharging Style Transfer

Yes, I think that is a likely explanation. Also note that Vincent Dumoulin is an author of both the deconv-checkerboard blog post and the new paper from Google, and that the new Google paper uses the upsample+convolution technique suggested by the deconv-checkerboard blog post.

jcjohns | 9 years ago | on: Supercharging Style Transfer

I've found that instance normalization usually gives better results so I prefer it over batch normalization.

With batch norm you learn four scalars per convolutional feature map: mu (mean), sigma (stddev), alpha (scale) and beta (shift). During training, mu and sigma are estimated from data statistics; during testing they are constants, either estimated from the entire training set or computed as a running mean during training. At test time the batch norm operation is then alpha * (x - mu) / sigma + beta, which is a linear operation since everything but x is constant; since it is linear it can be merged into a convolutional layer.

With instance norm, mu and sigma are estimated from data statistics during both training and testing; this means that the test-time forward pass is nonlinear, so it cannot be merged into a convolution (which is linear).

jcjohns | 9 years ago | on: Supercharging Style Transfer

Real-time neural style transfer is not new; in the past year there have been several academic papers [1-4] on this topic and several open-source code releases:

https://github.com/jcjohnson/fast-neural-style

https://github.com/DmitryUlyanov/texture_nets

https://github.com/chuanli11/MGANs

Neural style blending is also not new; I did it more than a year ago using optimization-based method:

https://github.com/jcjohnson/neural-style#multiple-style-ima...

The novelty of this work is a clever way for training a single network that can apply many different styles; existing methods for real-time style transfer train separate networks per style. Their method also allows for real-time style blending, which is very cool and to my knowledge has not been done before.

(Disclaimer: I'm the author of [2])

[1] Ulyanov et al, "Texture Networks: Feed-forward Synthesis of Textures and Stylized Images", ICML 2016

[2] Johnson et al, "Perceptual Losses for Real-Time Style Transfer and Super-Resolution", ECCV 2016

[3] Li and Wand, "Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks", ECCV 2016

[4] Ulyanov et al, "Instance Normalization: The Missing Ingredient for Fast Stylization", arXiv 2016

jcjohns | 9 years ago | on: Fast Neural Style Transfer

Author here. I'm not a lawyer so I can't write anything too official myself, and after some searching it seemed like none of the standard open-source licenses apply to this use-case.

jcjohns | 10 years ago | on: DenseCap: Fully Convolutional Localization Networks for Dense Captioning

Code is here:

https://github.com/jcjohnson/densecap