| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344 |
- New swscale design to change everything (tm)
- ============================================
- SwsGraph
- --------
- The entry point to the new architecture, SwsGraph is what coordinates
- multiple "passes". These can include cascaded scaling passes, error diffusion
- dithering, and so on. Or we could have separate passes for the vertical and
- horizontal scaling. In between each SwsPass lies a fully allocated image buffer.
- Graph passes may have different levels of threading, e.g. we can have a single
- threaded error diffusion pass following a multi-threaded scaling pass.
- SwsGraph is internally recreated whenever the image format, dimensions or
- settings change in any way. sws_scale_frame() is itself just a light-weight
- wrapper that runs ff_sws_graph_create() whenever the format changes, splits
- interlaced images into separate fields, and calls ff_sws_graph_run() on each.
- From the point of view of SwsGraph itself, all inputs are progressive.
- SwsOp / SwsOpList
- -----------------
- This is the newly introduced abstraction layer between the high-level format
- handling logic and the low-level backing implementation. Each SwsOp is designed
- to be as small and atomic as possible, with the possible exception of the
- read / write operations due to their numerous variants.
- The basic idea is to split logic between three major components:
- 1. The high-level format "business logic", which generates in a very
- naive way a sequence of operations guaranteed to get you from point A
- to point B. This logic is written with correctness in mind only, and
- ignoring any performance concerns or low-level implementation decisions.
- Semantically, everything is always decoded from the input format to
- normalized (real valued) RGB, and then encoded back to output format.
- This code lives in libswscale/format.c
- 2. The optimizer. This is where the "magic" happens, so to speak. The
- optimizer's job is to take the abstract sequence of operations
- produced by the high-level format analysis code and incrementally
- optimize it. Each optimization step is designed to be minute and provably
- lossless, or otherwise guarded behind the BITEXACT flag. This ensures that
- the resulting output is always identical, no matter how many layers of
- optimization we add.
- This code lives in libswscale/ops.c
- 3. The compiler. Once we have a sequence of operations as output by the
- optimizer, we "compile" this down to a callable function. This is then
- applied by the dispatch wrapper by striping it over the input image.
- See libswscale/ops_backend.c for the reference backend, or
- libswscale/x86/ops.c for a more complex SIMD example.
- This overall approach has a considerable number of benefits:
- 1. It allows us to verify correctness of logic and spot semantic errors at a
- very high level, by simply looking at the sequence of operations (available
- by default at debug / verbose log level), without having to dig through the
- multiple levels of complicated, interwoven format handling code that is
- legacy swscale.
- 2. Because most of the brains lives inside the the powerful optimizer, we get
- fast paths "for free" for any suitable format conversion, rather than having
- to enumerate them one by one. SIMD code itself can be written in a very
- general way and does need to be tied to specific pixel formats - subsequent
- low-level implementations can be strung together without much overhead.
- 3. We can in the future, with relative ease, compile these operations
- down to SPIR-V (or even LLVM IR) and generate efficient GPU or
- target-machine specific implementations. This also opens the window for
- adding hardware frame support to libswscale, and even transparently using
- GPU acceleration for CPU frames.
- 4. Platform-specific SIMD can be reduced down to a comparatively small set of
- optimized routines, while still providing 100% coverage for all possible
- pixel formats and operations. (As of writing, the x86 example backend has
- about 60 unique implementations, of which 20 are trivial swizzles, 10 are
- read/write ops, 10 are pixel type conversions and the remaining 20 are the
- various logic/arithmetic ops).
- 5. Backends hide behind a layer of abstraction offering them a considerable
- deal of flexibility in how they want to implement their operations. For
- example, the x86 backend has a dedicated function for compiling compatible
- operations down to a single in-place pshufb instruction.
- Platform specific low level data is self-contained within its own setup()
- function and private data structure, eliminating all reads into SwsContext
- or the possibility of conflicts between platforms.
- 6. We can compute an exact reference result for each operation with fixed
- precision (ff_sws_op_apply_q), and use that to e.g. measure the amount of
- error introduced by dithering, or even catch bugs in the reference C
- implementation. (In theory - currently checkasm just compares against C)
- Examples of SwsOp in action
- ---------------------------
- For illustration, here is the sequence of operations currently generated by
- my prototype, for a conversion from RGB24 to YUV444P:
- Unoptimized operation list:
- [ u8 .... -> ....] SWS_OP_READ : 3 elem(s) packed >> 0
- [ u8 .... -> ....] SWS_OP_SWIZZLE : 0123
- [ u8 .... -> ....] SWS_OP_RSHIFT : >> 0
- [ u8 .... -> ....] SWS_OP_CLEAR : {_ _ _ 0}
- [ u8 .... -> ....] SWS_OP_CONVERT : u8 -> f32
- [f32 .... -> ....] SWS_OP_LINEAR : diag3+alpha [[1/255 0 0 0 0] [0 1/255 0 0 0] [0 0 1/255 0 0] [0 0 0 1 1]]
- [f32 .... -> ....] SWS_OP_LINEAR : matrix3 [[0.299000 0.587000 0.114000 0 0] [-0.168736 -0.331264 1/2 0 0] [1/2 -0.418688 -57/701 0 0] [0 0 0 1 0]]
- [f32 .... -> ....] SWS_OP_LINEAR : diag3+off3 [[219 0 0 0 16] [0 224 0 0 128] [0 0 224 0 128] [0 0 0 1 0]]
- [f32 .... -> ....] SWS_OP_DITHER : 16x16 matrix
- [f32 .... -> ....] SWS_OP_MAX : {0 0 0 0} <= x
- [f32 .... -> ....] SWS_OP_MIN : x <= {255 255 255 _}
- [f32 .... -> ....] SWS_OP_CONVERT : f32 -> u8
- [ u8 .... -> ....] SWS_OP_LSHIFT : << 0
- [ u8 .... -> ....] SWS_OP_SWIZZLE : 0123
- [ u8 .... -> ....] SWS_OP_WRITE : 3 elem(s) planar >> 0
- This is optimized into the following sequence:
- Optimized operation list:
- [ u8 XXXX -> +++X] SWS_OP_READ : 3 elem(s) packed >> 0
- [ u8 ...X -> +++X] SWS_OP_CONVERT : u8 -> f32
- [f32 ...X -> ...X] SWS_OP_LINEAR : matrix3+off3 [[0.256788 0.504129 0.097906 0 16] [-0.148223 -0.290993 112/255 0 128] [112/255 -0.367788 -0.071427 0 128] [0 0 0 1 0]]
- [f32 ...X -> ...X] SWS_OP_DITHER : 16x16 matrix
- [f32 ...X -> +++X] SWS_OP_CONVERT : f32 -> u8
- [ u8 ...X -> +++X] SWS_OP_WRITE : 3 elem(s) planar >> 0
- (X = unused, + = exact, 0 = zero)
- The extra metadata on the left of the operation list is just a dump of the
- internal state used by the optimizer during optimization. It keeps track of
- knowledge about the pixel values, such as their value range, whether or not
- they're exact integers, and so on.
- In this example, you can see that the input values are exact (except for
- the alpha channel, which is undefined), until the first SWS_OP_LINEAR
- multiplies them by a noninteger constant. They regain their exact integer
- status only after the (truncating) conversion to U8 in the output step.
- Example of more aggressive optimization
- ---------------------------------------
- Conversion pass for gray -> rgb48:
- Unoptimized operation list:
- [ u8 .... -> ....] SWS_OP_READ : 1 elem(s) planar >> 0
- [ u8 .... -> ....] SWS_OP_SWIZZLE : 0123
- [ u8 .... -> ....] SWS_OP_RSHIFT : >> 0
- [ u8 .... -> ....] SWS_OP_CLEAR : {_ 0 0 0}
- [ u8 .... -> ....] SWS_OP_CONVERT : u8 -> f32
- [f32 .... -> ....] SWS_OP_LINEAR : luma+alpha [[1/255 0 0 0 0] [0 1 0 0 0] [0 0 1 0 0] [0 0 0 1 1]]
- [f32 .... -> ....] SWS_OP_LINEAR : matrix3 [[1 0 701/500 0 0] [1 -0.344136 -0.714136 0 0] [1 443/250 0 0 0] [0 0 0 1 0]]
- [f32 .... -> ....] SWS_OP_LINEAR : diag3 [[65535 0 0 0 0] [0 65535 0 0 0] [0 0 65535 0 0] [0 0 0 1 0]]
- [f32 .... -> ....] SWS_OP_MAX : {0 0 0 0} <= x
- [f32 .... -> ....] SWS_OP_MIN : x <= {65535 65535 65535 _}
- [f32 .... -> ....] SWS_OP_CONVERT : f32 -> u16
- [u16 .... -> ....] SWS_OP_LSHIFT : << 0
- [u16 .... -> ....] SWS_OP_SWIZZLE : 0123
- [u16 .... -> ....] SWS_OP_WRITE : 3 elem(s) packed >> 0
- Optimized operation list:
- [ u8 XXXX -> +XXX] SWS_OP_READ : 1 elem(s) planar >> 0
- [ u8 .XXX -> +XXX] SWS_OP_CONVERT : u8 -> u16 (expand)
- [u16 .XXX -> +++X] SWS_OP_SWIZZLE : 0003
- [u16 ...X -> +++X] SWS_OP_WRITE : 3 elem(s) packed >> 0
- (X = unused, + = exact, 0 = zero)
- Here, the optimizer has managed to eliminate all of the unnecessary linear
- operations on previously zero'd values, turn the resulting column matrix into
- a swizzle operation, avoid the unnecessary dither (and round trip via float)
- because the pixel values are guaranteed to be bit exact, and finally, turns
- the multiplication by 65535 / 255 = 257 into a simple integer expand operation.
- As a final bonus, the x86 backend further optimizes this into a 12-byte shuffle:
- pshufb = {0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1}
- time=208 us, ref=4212 us, speedup=20.236x faster (single thread)
- time=57 us, ref=472 us, speedup=8.160x faster (multi thread)
- Compiler and underlying implementation layer (SwsOpChain)
- ---------------------------------------------------------
- While the backend API is flexible enough to permit more exotic implementations
- (e.g. using JIT code generation), we establish a common set of helpers for use
- in "traditional" SIMD implementations.
- The basic idea is to have one "kernel" (or implementation) per operation,
- and then just chain a list of these kernels together as separate function
- calls. For best performance, we want to keep data in vector registers in
- between function calls using a custom calling convention, thus avoiding any
- unnecessary memory accesses. Additionally, we want the per-kernel overhead to
- be as low as possible, with each kernel ideally just jumping directly into
- the next kernel.
- As a result, we arrive at a design where we first divide the image into small
- chunks, or "blocks", and then dispatch the "chain" of kernels on each chunk in
- sequence. Each kernel processes a fixed number of pixels, with the overall
- entry point taking care of looping. Remaining pixels (the "tail") are handled
- generically by the backend-invariant dispatch code (located in ops.c), using a
- partial memcpy into a suitably sized temporary buffer.
- To minimize the per-kernel function call overhead, we use a "continuation
- passing style" for chaining kernels. Each operation computes its result and
- then directly calls the next operation in the sequence, with the appropriate
- internal function signature.
- The C reference backend reads data into the stack and then passes the array
- pointers to the next continuation as regular function arguments:
- void process(GlobalContext *ctx, OpContext *op,
- block_t x, block_t y, block_t z, block_t w)
- {
- for (int i = 0; i < SWS_BLOCK_SIZE; i++)
- // do something with x[i], y[i], z[i], w[i]
- op->next(ctx, &op[1], x, y, z, w);
- }
- With type conversions pushing the new data onto the stack as well:
- void convert8to16(GlobalContext *ctx, OpContext *op,
- block_t x, block_t y, block_t z, block_t w)
- {
- /* Pseudo-code */
- u16block_t x16 = (u16block_t) x;
- u16block_t y16 = (u16block_t) y;
- u16block_t z16 = (u16block_t) z;
- u16block_t w16 = (u16block_t) w;
- op->next(ctx, &op[1], x16, y16, z16, w16);
- }
- By contrast, the x86 backend always keeps the X/Y/Z/W values pinned in specific
- vector registers (ymm0-ymm3 for the lower half, and ymm4-ymm7 for the second
- half).
- Each kernel additionally has access to a 32 byte per-op context storing the
- pointer to the next kernel plus 16 bytes of arbitrary private data. This is
- used during construction of the function chain to place things like small
- constants.
- In assembly, the per-kernel overhead looks like this:
- load $tmp, $arg1
- ...
- add $arg1, 32
- jump $tmp
- This design gives vastly better performance than the alternative of returning
- out to a central loop or "trampoline". This is partly because the order of
- kernels within a chain is always the same, so the branch predictor can easily
- remember the target address of each "jump" instruction.
- The only way to realistically improve on this design would be to directly
- stitch the kernel body together using runtime code generation.
- Future considerations and limitations
- -------------------------------------
- My current prototype has a number of severe limitations and opportunities
- for improvements:
- 1. It does not handle scaling at all. I am not yet entirely sure on how I want
- to handle scaling; this includes handling of subsampled content. I have a
- number of vague ideas in my head, but nothing where I can say with certainty
- that it will work out well.
- It's possible that we won't come up with a perfect solution here, and will
- need to decide on which set of compromises we are comfortable accepting:
- 1. Do we need the ability to scale YUV -> YUV by handling luma and chroma
- independently? When downscaling 100x100 4:2:0 to 50x50 4:4:4, should we
- support the option of reusing the chroma plane directly (even though
- this would introduce a subpixel shift for typical chroma siting)?
- Looking towards zimg, I am also thinking that we probably also want to do
- scaling on floating point values, since this is best for both performance
- and accuracy, especially given that we need to go up to 32-bit intermediates
- during scaling anyway.
- So far, the most promising approach seems to be to handle subsampled
- input/output as a dedicated read/write operation type; perhaps even with a
- fixed/static subsampling kernel. To avoid compromising on performance when
- chroma resampling is not necessary, the optimizer could then relax the
- pipeline to use non-interpolating read/writes when all intermediate
- operations are component-independent.
- 2. Since each operation is conceptually defined on 4-component pixels, we end
- up defining a lot of variants of each implementation for each possible
- *subset*. For example, we have four different implementations for
- SWS_OP_SCALE in my current templates:
- - op_scale_1000
- - op_scale_1001
- - op_scale_1110
- - op_scale_1111
- This reflects the four different arrangements of pixel components that are
- typically present (or absent). While best for performance, it does turn into
- a bit of a chore when implementing these kernels.
- The only real alternative would be to either branch inside the kernel (bad),
- or to use separate kernels for each individual component and chain them all
- together. I have not yet tested whether the latter approach would be faster
- after the latest round of refactors to the kernel glue code.
- 3. I do not yet have any support for LUTs. But when I add them, something we
- could do is have the optimized pass automatically "promote" a sequence of
- operations to LUTs. For example, any sequence that looks like:
- 1. [u8] SWS_OP_CONVERT -> X
- 2. [X] ... // only per-component operations
- 4. [X] SWS_OP_CONVERT -> Y
- 3. [Y] SWS_OP_WRITE
- could be replaced by a LUT with 256 entries. This is especially important
- for anything involving packed 8-bit input (e.g. rgb8, rgb4_byte).
- We also definitely want to hook this up to the existing CMS code for
- transformations between different primaries.
- 4. Because we rely on AVRational math to generate the coefficients for
- operations, we need to be able to represent all pixel values as an
- AVRational. However, this presents a challenge for 32-bit formats (e.g.
- GRAY32, RGBA128), because their size exceeds INT_MAX, which is the maximum
- value representable by an AVRational.
- It's possible we may want to introduce an AVRational64 for this, or
- perhaps more flexibly, extend AVRational to an AVFloating type which is
- represented as { AVRational n; int exp; }, representing n/d * 2^exp. This
- would preserve our ability to represent all pixel values exactly, while
- opening up the range arbitrarily.
- 5. Is there ever a situation where the use of floats introduces the risk of
- non bit-exact output? For this reason, and possible performance advantages,
- we may want to explore the use of a fixed-point 16 bit path as an alternative
- to the floating point math.
- So far, I have managed to avoid any bit exactness issues inside the x86
- backend by ensuring that the order of linear operations is identical
- between the C backend and the x86 backend, but this may not be practical
- to guarantee on all backends. The x86 float code is also dramatically
- faster than the old fixed point code, so I'm tentatively optimistic about
- the lack of a need for a fixed point path.
|