On a general note: there really seems to be an extremely inaccurate narrative regarding AV1 and speed taking hold. I can't understand why it isn't easier understood that a reference implementation is about accuracy only, completely ignoring performance considerations. Not in the usual "we'll now try to make it faster", but as in "this is never meant to be used in production, and it's performance is in no way indicative of the performance optimised encoders will see".
As but one example: media encoding is pretty close to being "embarrassingly parallel" in principle, making the first three orders of magnitude easy wins for a straightforward GPU implementation.
> I can't understand why it isn't easier understood that a reference implementation is about accuracy only,
> completely ignoring performance considerations.
Because the official codebase conveys another message.
Have a look, there are SIMD implementations for almost all supported targets.
What are these files for, if not performance? They've been maintained and kept synchronized with the reference C code during the whole project, long before the codec was frozen (and it was a huge PITA).
This doesn't look like "completely ignoring performance considerations".
> As but one example: media encoding is pretty close to being "embarrassingly parallel" in principle,
Almost all video codecs exploit some block-level encoding context, which means the way you encode one block depends on how the previous neighbooring blocks were encoded. This creates a huge dependency between blocks. There are tools like slicing/tiling that allow you to break these dependencies, and thus, encoding in parallel, but at the cost of video quality. Making the problem "embarrassingly parallel" at this point would make the video "embarrasingly ugly".
You could encode multiple frames in parallel ; but then again, being able to encode them independently means you're basically trashing all the compression context (reference frames), and your video quality goes down the tubes.
In an offline encoding scenario (Netflix, Youtube), if you have lots of memory, you can encode multiple independent video sequences from the same movie. Making the problem "embarrassingly parallel" in this case would require an "embarrassingly huge" amount of memory.
Also, it's not applicable to a live scenario (think: latency).
> media encoding is pretty close to being "embarrassingly parallel" in principle
My understanding is that there are some fairly tight feedback loops in the encoders that make it difficult to offload things to the GPU, at least if you want to maximize the quality per byte metric. If you want to target realtime and don't need optimal compression it probably gets easier.
> As but one example: media encoding is pretty close to being "embarrassingly parallel" in principle
Which part? 90% of what you're doing is context or inter-frame dependent. Video encoders that live on graphics cards today use dedicated ASIC hardware.
People are pragmatic, at least in this regard. They don't really suffer from the bandwidth costs, they want fast encode speeds for offline storage.
And they are simply cautious. They don't really care about the hype x264 is good enough visually, now all visual comparisons are done on ridiculously low bitrate (which is a good thing, but people don't really care).
There are a number of features that make AV1 structurally more suited to real-time implementations than its predecessor, VP9.
For example, it does adaptive entropy coding instead of explicitly coding probabilities in the header. That means that you don't need to choose between making multiple passes over the frame (one to count symbol occurrences and one to write the bitstream using R-D optimal probabilities) or encoding with sub-optimal probabilities (which can have an overhead upwards of 5% of the bitrate). libaom has always been based on a multi-pass design, as was libvpx before it, but rav1e only needs a single pass per frame (we may add multiple passes for non-realtime use cases later).
In another example, AV1 has explicit dependencies between frames. VP9 maintained multiple banks of probabilities which could be used as a starting point for a new frame. But any frame was allowed to modify any bank. So if you lost a frame, you had no idea if it modified the bank of probabilities used by the next frame. In AV1, probabilities (and all other inter-frame state) propagate via reference frames. So you're guaranteed that if you have all of your references, you can decode a frame correctly. This is important if you want to make a low-latency interactive application that never shows a broken frame.
Some of its tools also become more effective in low-complexity settings. One of the new loop filters, CDEF, gives somewhere around a 2% bitrate savings using objective metrics when tested with libaom running at its highest complexity (although subjective testing suggests the actual improvement is larger). However, when you turn down the complexity, the improvement from CDEF goes up to close to 8%. I.e., using this filter helps you to take shortcuts elsewhere in the encoder.
The real reason the reference encoder is so slow is that it searches a lot of things. You can always make things run faster by searching less. Take a look at http://obe.tv/about-us/obe-blog/item/54-a-look-at-the-bbc-uh... to see how drastically people are limiting HEVC to make it run in real time today (though if you have to go up to 35 Mbps to do so, one might wonder what the point is).
Yes, it's just that everyone is focusing on size/bandwidth optimization for now. Once they nail down the actual format, projects will start work on making it fast.
I believe the possibility of "making it fast" is taken in account in the existing design, to avoid designing a format which can't be cheaply optimised & hardware-implemented.
matt4077|7 years ago
On a general note: there really seems to be an extremely inaccurate narrative regarding AV1 and speed taking hold. I can't understand why it isn't easier understood that a reference implementation is about accuracy only, completely ignoring performance considerations. Not in the usual "we'll now try to make it faster", but as in "this is never meant to be used in production, and it's performance is in no way indicative of the performance optimised encoders will see".
As but one example: media encoding is pretty close to being "embarrassingly parallel" in principle, making the first three orders of magnitude easy wins for a straightforward GPU implementation.
Ace17|7 years ago
> I can't understand why it isn't easier understood that a reference implementation is about accuracy only, > completely ignoring performance considerations.
Because the official codebase conveys another message. Have a look, there are SIMD implementations for almost all supported targets.
https://aomedia.googlesource.com/aom/+/av1-normative/aom_dsp... https://aomedia.googlesource.com/aom/+/av1-normative/aom_dsp... https://aomedia.googlesource.com/aom/+/av1-normative/aom_dsp... ....
What are these files for, if not performance? They've been maintained and kept synchronized with the reference C code during the whole project, long before the codec was frozen (and it was a huge PITA).
This doesn't look like "completely ignoring performance considerations".
> As but one example: media encoding is pretty close to being "embarrassingly parallel" in principle,
Almost all video codecs exploit some block-level encoding context, which means the way you encode one block depends on how the previous neighbooring blocks were encoded. This creates a huge dependency between blocks. There are tools like slicing/tiling that allow you to break these dependencies, and thus, encoding in parallel, but at the cost of video quality. Making the problem "embarrassingly parallel" at this point would make the video "embarrasingly ugly".
You could encode multiple frames in parallel ; but then again, being able to encode them independently means you're basically trashing all the compression context (reference frames), and your video quality goes down the tubes.
In an offline encoding scenario (Netflix, Youtube), if you have lots of memory, you can encode multiple independent video sequences from the same movie. Making the problem "embarrassingly parallel" in this case would require an "embarrassingly huge" amount of memory. Also, it's not applicable to a live scenario (think: latency).
Ienuur4i|7 years ago
My understanding is that there are some fairly tight feedback loops in the encoders that make it difficult to offload things to the GPU, at least if you want to maximize the quality per byte metric. If you want to target realtime and don't need optimal compression it probably gets easier.
Jasper_|7 years ago
Which part? 90% of what you're doing is context or inter-frame dependent. Video encoders that live on graphics cards today use dedicated ASIC hardware.
pas|7 years ago
And they are simply cautious. They don't really care about the hype x264 is good enough visually, now all visual comparisons are done on ridiculously low bitrate (which is a good thing, but people don't really care).
TD-Linux|7 years ago
derf_|7 years ago
For example, it does adaptive entropy coding instead of explicitly coding probabilities in the header. That means that you don't need to choose between making multiple passes over the frame (one to count symbol occurrences and one to write the bitstream using R-D optimal probabilities) or encoding with sub-optimal probabilities (which can have an overhead upwards of 5% of the bitrate). libaom has always been based on a multi-pass design, as was libvpx before it, but rav1e only needs a single pass per frame (we may add multiple passes for non-realtime use cases later).
In another example, AV1 has explicit dependencies between frames. VP9 maintained multiple banks of probabilities which could be used as a starting point for a new frame. But any frame was allowed to modify any bank. So if you lost a frame, you had no idea if it modified the bank of probabilities used by the next frame. In AV1, probabilities (and all other inter-frame state) propagate via reference frames. So you're guaranteed that if you have all of your references, you can decode a frame correctly. This is important if you want to make a low-latency interactive application that never shows a broken frame.
Some of its tools also become more effective in low-complexity settings. One of the new loop filters, CDEF, gives somewhere around a 2% bitrate savings using objective metrics when tested with libaom running at its highest complexity (although subjective testing suggests the actual improvement is larger). However, when you turn down the complexity, the improvement from CDEF goes up to close to 8%. I.e., using this filter helps you to take shortcuts elsewhere in the encoder.
The real reason the reference encoder is so slow is that it searches a lot of things. You can always make things run faster by searching less. Take a look at http://obe.tv/about-us/obe-blog/item/54-a-look-at-the-bbc-uh... to see how drastically people are limiting HEVC to make it run in real time today (though if you have to go up to 35 Mbps to do so, one might wonder what the point is).
JustFinishedBSG|7 years ago
sp332|7 years ago
masklinn|7 years ago
masklinn|7 years ago
dash101|7 years ago
[deleted]
p0nce|7 years ago