Makes for simple enough code and even before any serious effort at optimization or SIMD it can convert a 3680x2456x4 image in 32bit float (source article is 3x8bit) to 320x200x4 also in 32bit float in about 60ms (across 4 threads in 2 cores on a i5-6200U).
Edit: If you want to do even better than the tensor product Lanczos filtering, you can do a filter based on Euclidean distance. Make sure that you work in linear (non-gamma-adjusted) color space.
Edit 2: really weird that this is getting downvoted. There’s not really anything to dispute here. It is straight-forward to show that the Lanczos method has objectively better output. Moreover, Alvy Ray Smith’s paper is a classic which anyone interested in image processing should read.
Yes, lanczos really is better quality, but it seems like the pixel mixing article admits and explains that openly. You don't always need better quality, sometimes a box filter is fine, but for some applications it can be super important to use a high quality filter.
Faster depends on implementation details, but from the looks of it, to implement pixel mixing, you either have to generate your convolution kernel dynamically, or chop some source pixels into potentially 4 separate pieces? I would guess that it's easier to optimize a static kernel than to make a dynamic kernel faster than a static one. But I'm not entirely sure how fast pixel mixing could be made.
> It can be confused with a box filter or with linear interpolation, but it is not the same as either of them... [pixel mixing] Treats pixels as if they were little squares, which gets on some experts’ nerves.
This is a funny way of putting it. Pixel mixing is definitely using a box filter, just slightly differently than what people normally call box filter resizing. The author even says that later: "Another way to think of pixel mixing is as the integral of a nearest-neighbor function." Pixel mixing as described here is clipping the source image under the box filter rather than using a static kernel. The reason that a box filter isn't ideal is well understood. It's because the filter itself has high frequencies in it. This is the reason that the quality of pixel mixing resizes is low.
One of the benefits of a box filter is that applied multiple times, it becomes a better filter, and approximates a Gaussian after several times. I'm not sure but I would bet that clipping to the exact filter boundary actually prevents you from being able to do that with pixel mixing.
Well, 3680x2456 / 0.06 seconds ≈ 150 Mp/s or 75 Mp/s/per core. Pillow-SIMD's current implementation runs on ≈ 700 Mp/s/pre core for bicubic (which is closer to this implementation).
Obviously, your implementation could be further optimized, but quality drawbacks will remain the same.
Nice writeup. I think it spells out the difference pretty well - less aliasing and artifacts for Lanczos. But you gotta admit it's fast. For large shrink factors you'd be just as well off averaging whole pixels, even if the rectangles being averaged aren't all the same size.
Let me insert my personal pet-peeve: have you thought of making it colour-space-aware? Most (all?) images you'll encounter are stored in sRGB colourspace, which isn't linear, so you can't do the convolution by just multiplying and adding (the result will be slightly off). The easiest way would be to convert it to 16-bit-per-channel linear colour space using a lookup table, do the convolution in linear 16-bit space, then convert back to 8-bit sRGB.
You might be able to do this as a 3-part process instead of expecting the resizing to handle it natively. But that brings up a good question, does the new SIMD goodness work on anything other than 8-bit data? You couldn't do linear in anything less than 16 bits.
Color space conversion is a hard topic in terms of performance. First of all, not all images are stored in sRGB. Most of them have another color profiles (such as P3 or ProPhoto). So, sRGB conversion is not enough, you need the full color management.
Second, you'll see a real profit of color management only on a few images. Most time you'll see the difference only when you see both images at the same time on the same screen.
For now, I came up to the resizing in original non-linear color space and saving the original color profile with the resulting image.
You can even see this problem in the article. The convolution-based sample image is clearly darker than the nearest-neighbour one.
People go all crazy about interpolation and then get the brightness wrong. It's even more obvious for high resolution photos of a tree or grass in bright sunlight. Once you start looking you notice the change in brightness everywhere, when you click on a thumbnail or while a JPEG is loading.
I'd be very interested in an optional Pillow-SIMD downsampling resize that produces 16 bit output internally and then uses a dither to convert from 16 bit to 8 bit. Photoshop does this by default and it produces superior downsampling. Without keeping the color resolution higher, you can end up with visible color banding in resized 8 bit images that wasn't visible in the source image.
I am curious if the reason that Pillow-SIMD is more than 4x faster than IPP is due to features IPP supports - like higher internal resolution - that Pillow-SIMD doesn't? The reported speeds here are amazing, and I'm definitely going to check this project out and probably use it, but I'd love a little clarity on what the tradeoffs are against IPP or others. I assume there are some.
Each resampling algorithm will internally produce some high-precision result before cutting it to 8 bits. For Pillow-SIMD it is 32-bit integers. Currently, I haven't considered dithering, but it is a very interesting idea. Do you have any links for further reading about downsampling banding and dithering?
About IPP's features: the comparison is pretty fair: the same input, the same algorithm and filters, pretty much the same output. If IPP uses more resources internally with the same output, so, maybe it shouldn't.
Shame on me, I still haven't added the link to IPP's test file I used. Here is it: https://gist.github.com/homm/9b35398e7e105a3c886ab1d60bf598d...
It is modified ipp_resize_mt program from IPP's examples. If you have installed IPP, you'll easily find and build it.
I suspect on a GPU it will be better to use 32-bit floating point internally. But yeah, dithering the output when converting back to integers would be great.
In Photoshop I always convert to 16-bit linear color before doing any kind of compositing or resampling.
FWIW the Accelerate framework[1] gives roughly comparable performance[2] for Lanczos resizing. Apple platforms only, but all Apple platforms, not limited to x86.
[1] vImageScale_ARGB8888( ).
[2] I don't have identical hardware available to time on, and it's doing an alpha channel as well, so this is slightly hand-wavy.
Accelerate is a really amazing and highly underrated framework. Sure the function names are, uh, sub optimal (I'm looking at you, vDSP). That said, having a framework guaranteed across all devices to implement algorithms and primitives in the fastest way for each new device as they come out is amazingly valuable.
I've built production systems over the last few years with it that really wouldnt have been possible without it.
Curious how this would compare vs. running it on the GPU? This is literally what GPUs are made for, and they often have levels of parallelism 500+ times greater than SIMD.
I tried to do this once with Theano, and found that the latency of the roundtrip to GPU and back made it not worthwhile for a single image. Maybe a batch of images at once would make it worthwhile. And this isn't what theano is intended for, admittedly - custom CUDA might do a better job.
I'm really happy to see this. The one time I tried looking at the PIL sources for resizing, I was appalled at what I saw. Simply seeing that you're expanding the filter size as the input to output ratio shrinks is a huge deal.
When I wrote my own resizing code, I found it helpful to debug using a nearest-neighbor kernel: 1 from -0.5 to 0.5 and 0 everywhere else. It shook out some off-by-one errors.
> No tricks like decoding a smaller image from a JPEG
Given that most cameras are producing JPEG now, I'm curious why you don't make use of the compressed / frequency-domain representation. To a novice in this area (read: me), It seems like a quick shortcut to an 8x or 4x or 2x downsample.
Or is the required iDCT operation just that much more expensive than the convolution approach?
They would likely get another big speedup by doing this. iDCT gets faster as you perform a "DCT downscaling" operation because you require fewer add/mul [1].
You could probably go for another speedup, independently of DCT downscaling, by operating in YCbCr before a colorspace conversion to RGB. For example, for 4:2:0 encoded content (a majority of JPEG photographs), you end up processing 50% less pixels in the chroma planes.
When you combine both techniques, you can have your cake and eat it too: for example, to downsample 4:2:0 content by 50% you can do a DCT downscale on only the Y plane, keeping the CbCr planes as they are before colorspace conversion to RGB. No lanczos required!
If you need a downsample other than {1/n; n = 2,4,8}, you can round up to the nearest integer n then perform a lanczos to the final resolution: the resampling filter will be operating on a lot less data.
On quality I once saw a comparison roughly equating DCT downscaling to bilinear (if I can find the reference I'll update this comment). With the example above, it really depends on how you compare: if you compare to a 4:2:0 image decoded to RGB where the chroma is first pixel-doubled or bicubic-upsampled before conversion to RGB then downsampled, it might be that the above lanczos-free technique will look just as good because it didn't modify the chroma at all. Ultimately it's best to try-and-compare.
Lastly you could leverage both SIMD and multicore by processing each of the Y, Cb, and/or Cr planes in parallel.
That’s a shortcut if you only ever have to downsample by powers of two and you don’t mind worse image quality, since your down-sampled picture won’t use any data from across block boundaries.
I'd love to see vips in the benchmark comparison, perhaps a Halide-based resizer too as those are the fastest I've found so far. Perhaps GraphicsMagick too, as I believe it's meant to be faster than ImageMagick in many cases.
Have you tried to use a fast blur (like StackBlur for example : http://www.quasimondo.com/BoxBlurForCanvas/FastBlur2Demo.htm... , the radius should be computed according to the ratio between original size and target size) as a first step before taking the classic nearest neighbor ?
And also try to make an algorithm that resize to multiple resolution at the same time could improve speed
I take an image of 2560x1600 pixels in size and resize it to the following resolutions: 320x200, 2048x1280, and 5478x3424
For small filter sizes, convolution is going to be faster than an FFT approach. Plus, correct me if I'm wrong, but you need to perform a convolution for every output pixel where the filter kernel is different for each convolution (Sampling the lanczos filter at different points depending on the resample ratio), which would really slow down an FFT approach.
It originally was written two years ago, so some things have changed. But in general, it is still correct: for most browsers, you need to combine several ugly technics to get suitable results. Though, the quality will have nothing common with quality when you have direct access to hardware.
He's scaling images down, not up. Think taking an uploaded smartphone picture (multiple megapixels) and scaling it down to thumbnail-sized images for various screens.
He addressed that in the article: Pillow is cross-platform and cross-architecture, so these sorts of specific optimisations (x86-64 only, with some pretty specific instruction requirements) meant that the author felt it wouldn't be a good fit in the original library
> With optimizations, Uploadcare now needs six times fewer servers to handle its load than before.
This is devil's advocate, but did you guys have concrete need for this optimization? You now need six times fewer servers, but was that a crippling problem, or is it a cool statistic for the future when you get more users?
Does it need to be a crippling problem before you do anything about it?
Even discounting that, the fact that their server bill will now be 6x smaller is justification enough? Even if the cost savings aren't quite that much (suppose they work out to be 50% of previous costs), if I was running a business I would totally be implementing optimisations that allowed me to halve my running costs...
One can make an argument that the CO2 and energy cost for wasted server usage is a decent reason for it! 6x fewer is not a small amount, that's a great result.
[+] [-] pedrocr|8 years ago|reply
http://entropymine.com/imageworsener/pixelmixing/
I implemented this for my image pipeline:
https://github.com/pedrocr/rawloader/blob/230432a403a9febb5e...
Makes for simple enough code and even before any serious effort at optimization or SIMD it can convert a 3680x2456x4 image in 32bit float (source article is 3x8bit) to 320x200x4 also in 32bit float in about 60ms (across 4 threads in 2 cores on a i5-6200U).
[+] [-] jacobolus|8 years ago|reply
Edit: If you want to do even better than the tensor product Lanczos filtering, you can do a filter based on Euclidean distance. Make sure that you work in linear (non-gamma-adjusted) color space.
A couple more resources to read: http://www.imagemagick.org/Usage/filter/#cylindrical http://www.imagemagick.org/Usage/filter/nicolas/
Edit 2: really weird that this is getting downvoted. There’s not really anything to dispute here. It is straight-forward to show that the Lanczos method has objectively better output. Moreover, Alvy Ray Smith’s paper is a classic which anyone interested in image processing should read.
[+] [-] dahart|8 years ago|reply
Faster depends on implementation details, but from the looks of it, to implement pixel mixing, you either have to generate your convolution kernel dynamically, or chop some source pixels into potentially 4 separate pieces? I would guess that it's easier to optimize a static kernel than to make a dynamic kernel faster than a static one. But I'm not entirely sure how fast pixel mixing could be made.
> It can be confused with a box filter or with linear interpolation, but it is not the same as either of them... [pixel mixing] Treats pixels as if they were little squares, which gets on some experts’ nerves.
This is a funny way of putting it. Pixel mixing is definitely using a box filter, just slightly differently than what people normally call box filter resizing. The author even says that later: "Another way to think of pixel mixing is as the integral of a nearest-neighbor function." Pixel mixing as described here is clipping the source image under the box filter rather than using a static kernel. The reason that a box filter isn't ideal is well understood. It's because the filter itself has high frequencies in it. This is the reason that the quality of pixel mixing resizes is low.
One of the benefits of a box filter is that applied multiple times, it becomes a better filter, and approximates a Gaussian after several times. I'm not sure but I would bet that clipping to the exact filter boundary actually prevents you from being able to do that with pixel mixing.
[+] [-] homm|8 years ago|reply
Obviously, your implementation could be further optimized, but quality drawbacks will remain the same.
[+] [-] mark-r|8 years ago|reply
[+] [-] CyberDildonics|8 years ago|reply
[+] [-] est|8 years ago|reply
[+] [-] Asooka|8 years ago|reply
[+] [-] mark-r|8 years ago|reply
You might be able to do this as a 3-part process instead of expecting the resizing to handle it natively. But that brings up a good question, does the new SIMD goodness work on anything other than 8-bit data? You couldn't do linear in anything less than 16 bits.
[+] [-] homm|8 years ago|reply
Second, you'll see a real profit of color management only on a few images. Most time you'll see the difference only when you see both images at the same time on the same screen.
For now, I came up to the resizing in original non-linear color space and saving the original color profile with the resulting image.
[+] [-] Matumio|8 years ago|reply
People go all crazy about interpolation and then get the brightness wrong. It's even more obvious for high resolution photos of a tree or grass in bright sunlight. Once you start looking you notice the change in brightness everywhere, when you click on a thumbnail or while a JPEG is loading.
[+] [-] pishpash|8 years ago|reply
[+] [-] dahart|8 years ago|reply
I am curious if the reason that Pillow-SIMD is more than 4x faster than IPP is due to features IPP supports - like higher internal resolution - that Pillow-SIMD doesn't? The reported speeds here are amazing, and I'm definitely going to check this project out and probably use it, but I'd love a little clarity on what the tradeoffs are against IPP or others. I assume there are some.
[+] [-] homm|8 years ago|reply
About IPP's features: the comparison is pretty fair: the same input, the same algorithm and filters, pretty much the same output. If IPP uses more resources internally with the same output, so, maybe it shouldn't.
Shame on me, I still haven't added the link to IPP's test file I used. Here is it: https://gist.github.com/homm/9b35398e7e105a3c886ab1d60bf598d... It is modified ipp_resize_mt program from IPP's examples. If you have installed IPP, you'll easily find and build it.
[+] [-] jacobolus|8 years ago|reply
In Photoshop I always convert to 16-bit linear color before doing any kind of compositing or resampling.
[+] [-] FrozenVoid|8 years ago|reply
[+] [-] gioele|8 years ago|reply
I think I have already seen it in a couple of recent posts about image compression. (Fits perfectly the definition of Baader-Meinhof phenomenon [1].)
[1] https://en.wikipedia.org/wiki/List_of_cognitive_biases#Frequ...
[+] [-] josteink|8 years ago|reply
Lenna does IMO not contain enough sharp edges and contrasts to highlight the differences between the different resize techniques.
With Bologna, you can clearly see the problems with a nearest-neighbour approach. I'm not sure that would have been equally visible with lenna.
[+] [-] homm|8 years ago|reply
Bu the way, thank you fo pointing out that this is Bologna! I'm going to Italy at the end of the month, can visit it :-)
[+] [-] McKayDavis|8 years ago|reply
[1] http://www.cs.cmu.edu/~chuck/lennapg/editor.html
[+] [-] unknown|8 years ago|reply
[deleted]
[+] [-] stephencanon|8 years ago|reply
[1] vImageScale_ARGB8888( ).
[2] I don't have identical hardware available to time on, and it's doing an alpha channel as well, so this is slightly hand-wavy.
[+] [-] arcticbull|8 years ago|reply
I've built production systems over the last few years with it that really wouldnt have been possible without it.
[+] [-] nostrademons|8 years ago|reply
[+] [-] dimatura|8 years ago|reply
[+] [-] mark-r|8 years ago|reply
When I wrote my own resizing code, I found it helpful to debug using a nearest-neighbor kernel: 1 from -0.5 to 0.5 and 0 everywhere else. It shook out some off-by-one errors.
[+] [-] cvwright|8 years ago|reply
Given that most cameras are producing JPEG now, I'm curious why you don't make use of the compressed / frequency-domain representation. To a novice in this area (read: me), It seems like a quick shortcut to an 8x or 4x or 2x downsample.
Or is the required iDCT operation just that much more expensive than the convolution approach?
[+] [-] jpap|8 years ago|reply
You could probably go for another speedup, independently of DCT downscaling, by operating in YCbCr before a colorspace conversion to RGB. For example, for 4:2:0 encoded content (a majority of JPEG photographs), you end up processing 50% less pixels in the chroma planes.
When you combine both techniques, you can have your cake and eat it too: for example, to downsample 4:2:0 content by 50% you can do a DCT downscale on only the Y plane, keeping the CbCr planes as they are before colorspace conversion to RGB. No lanczos required!
If you need a downsample other than {1/n; n = 2,4,8}, you can round up to the nearest integer n then perform a lanczos to the final resolution: the resampling filter will be operating on a lot less data.
On quality I once saw a comparison roughly equating DCT downscaling to bilinear (if I can find the reference I'll update this comment). With the example above, it really depends on how you compare: if you compare to a 4:2:0 image decoded to RGB where the chroma is first pixel-doubled or bicubic-upsampled before conversion to RGB then downsampled, it might be that the above lanczos-free technique will look just as good because it didn't modify the chroma at all. Ultimately it's best to try-and-compare.
Lastly you could leverage both SIMD and multicore by processing each of the Y, Cb, and/or Cr planes in parallel.
[1] http://jpegclub.org/djpeg/
[+] [-] jacobolus|8 years ago|reply
[+] [-] Veratyr|8 years ago|reply
I'd love to see vips in the benchmark comparison, perhaps a Halide-based resizer too as those are the fastest I've found so far. Perhaps GraphicsMagick too, as I believe it's meant to be faster than ImageMagick in many cases.
[+] [-] ashishuthama|8 years ago|reply
>> maxNumCompThreads(1);
>> im = randi(255, [2560, 1600, 3],'uint8');
>> timeit(@()imresize(im,[320,200],'bilinear','Antialiasing',false))
ans =
>> timeit(@()imresize(im,[320,200],'bilinear'))ans =
>> maxNumCompThreads(6);>> timeit(@()imresize(im,[320,200],'bilinear','Antialiasing',false))
ans =
>> timeit(@()imresize(im,[320,200],'bilinear'))ans =
Oh, missed that lanczos2 part:>> maxNumCompThreads(1);
>> timeit(@()imresize(im,[320,200],'lanczos2','Antialiasing',false))
ans =
>> maxNumCompThreads(6);>> timeit(@()imresize(im,[320,200],'lanczos2','Antialiasing',false))
ans =
Since MATLAB tries to do most of the computation in double precision, its harder to extract much from SIMD.[+] [-] ttoinou|8 years ago|reply
[+] [-] vadiml|8 years ago|reply
[+] [-] liuliu|8 years ago|reply
[+] [-] stagger87|8 years ago|reply
[+] [-] homm|8 years ago|reply
[+] [-] gfody|8 years ago|reply
you ever consider pushing the work entirely to the client with a resize implemented in javascript? that would cut down on bandwidth as well.
[+] [-] igordebatur|8 years ago|reply
[+] [-] homm|8 years ago|reply
https://blog.uploadcare.com/image-resize-in-browsers-is-brok...
It originally was written two years ago, so some things have changed. But in general, it is still correct: for most browsers, you need to combine several ugly technics to get suitable results. Though, the quality will have nothing common with quality when you have direct access to hardware.
[+] [-] nneonneo|8 years ago|reply
[+] [-] techdragon|8 years ago|reply
I would much rather this feature be in Pillow so ALL of the python ecosystem could get 6 times faster image resizing.
[+] [-] girvo|8 years ago|reply
[+] [-] vortico|8 years ago|reply
[+] [-] lcnmrn|8 years ago|reply
[+] [-] legulere|8 years ago|reply
[+] [-] vadiml|8 years ago|reply
[deleted]
[+] [-] MuffinFlavored|8 years ago|reply
This is devil's advocate, but did you guys have concrete need for this optimization? You now need six times fewer servers, but was that a crippling problem, or is it a cool statistic for the future when you get more users?
[+] [-] FridgeSeal|8 years ago|reply
Even discounting that, the fact that their server bill will now be 6x smaller is justification enough? Even if the cost savings aren't quite that much (suppose they work out to be 50% of previous costs), if I was running a business I would totally be implementing optimisations that allowed me to halve my running costs...
[+] [-] igordebatur|8 years ago|reply
[+] [-] girvo|8 years ago|reply