swscale-v2.txt 17 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344
  1. New swscale design to change everything (tm)
  2. ============================================
  3. SwsGraph
  4. --------
  5. The entry point to the new architecture, SwsGraph is what coordinates
  6. multiple "passes". These can include cascaded scaling passes, error diffusion
  7. dithering, and so on. Or we could have separate passes for the vertical and
  8. horizontal scaling. In between each SwsPass lies a fully allocated image buffer.
  9. Graph passes may have different levels of threading, e.g. we can have a single
  10. threaded error diffusion pass following a multi-threaded scaling pass.
  11. SwsGraph is internally recreated whenever the image format, dimensions or
  12. settings change in any way. sws_scale_frame() is itself just a light-weight
  13. wrapper that runs ff_sws_graph_create() whenever the format changes, splits
  14. interlaced images into separate fields, and calls ff_sws_graph_run() on each.
  15. From the point of view of SwsGraph itself, all inputs are progressive.
  16. SwsOp / SwsOpList
  17. -----------------
  18. This is the newly introduced abstraction layer between the high-level format
  19. handling logic and the low-level backing implementation. Each SwsOp is designed
  20. to be as small and atomic as possible, with the possible exception of the
  21. read / write operations due to their numerous variants.
  22. The basic idea is to split logic between three major components:
  23. 1. The high-level format "business logic", which generates in a very
  24. naive way a sequence of operations guaranteed to get you from point A
  25. to point B. This logic is written with correctness in mind only, and
  26. ignoring any performance concerns or low-level implementation decisions.
  27. Semantically, everything is always decoded from the input format to
  28. normalized (real valued) RGB, and then encoded back to output format.
  29. This code lives in libswscale/format.c
  30. 2. The optimizer. This is where the "magic" happens, so to speak. The
  31. optimizer's job is to take the abstract sequence of operations
  32. produced by the high-level format analysis code and incrementally
  33. optimize it. Each optimization step is designed to be minute and provably
  34. lossless, or otherwise guarded behind the BITEXACT flag. This ensures that
  35. the resulting output is always identical, no matter how many layers of
  36. optimization we add.
  37. This code lives in libswscale/ops.c
  38. 3. The compiler. Once we have a sequence of operations as output by the
  39. optimizer, we "compile" this down to a callable function. This is then
  40. applied by the dispatch wrapper by striping it over the input image.
  41. See libswscale/ops_backend.c for the reference backend, or
  42. libswscale/x86/ops.c for a more complex SIMD example.
  43. This overall approach has a considerable number of benefits:
  44. 1. It allows us to verify correctness of logic and spot semantic errors at a
  45. very high level, by simply looking at the sequence of operations (available
  46. by default at debug / verbose log level), without having to dig through the
  47. multiple levels of complicated, interwoven format handling code that is
  48. legacy swscale.
  49. 2. Because most of the brains lives inside the the powerful optimizer, we get
  50. fast paths "for free" for any suitable format conversion, rather than having
  51. to enumerate them one by one. SIMD code itself can be written in a very
  52. general way and does need to be tied to specific pixel formats - subsequent
  53. low-level implementations can be strung together without much overhead.
  54. 3. We can in the future, with relative ease, compile these operations
  55. down to SPIR-V (or even LLVM IR) and generate efficient GPU or
  56. target-machine specific implementations. This also opens the window for
  57. adding hardware frame support to libswscale, and even transparently using
  58. GPU acceleration for CPU frames.
  59. 4. Platform-specific SIMD can be reduced down to a comparatively small set of
  60. optimized routines, while still providing 100% coverage for all possible
  61. pixel formats and operations. (As of writing, the x86 example backend has
  62. about 60 unique implementations, of which 20 are trivial swizzles, 10 are
  63. read/write ops, 10 are pixel type conversions and the remaining 20 are the
  64. various logic/arithmetic ops).
  65. 5. Backends hide behind a layer of abstraction offering them a considerable
  66. deal of flexibility in how they want to implement their operations. For
  67. example, the x86 backend has a dedicated function for compiling compatible
  68. operations down to a single in-place pshufb instruction.
  69. Platform specific low level data is self-contained within its own setup()
  70. function and private data structure, eliminating all reads into SwsContext
  71. or the possibility of conflicts between platforms.
  72. 6. We can compute an exact reference result for each operation with fixed
  73. precision (ff_sws_op_apply_q), and use that to e.g. measure the amount of
  74. error introduced by dithering, or even catch bugs in the reference C
  75. implementation. (In theory - currently checkasm just compares against C)
  76. Examples of SwsOp in action
  77. ---------------------------
  78. For illustration, here is the sequence of operations currently generated by
  79. my prototype, for a conversion from RGB24 to YUV444P:
  80. Unoptimized operation list:
  81. [ u8 .... -> ....] SWS_OP_READ : 3 elem(s) packed >> 0
  82. [ u8 .... -> ....] SWS_OP_SWIZZLE : 0123
  83. [ u8 .... -> ....] SWS_OP_RSHIFT : >> 0
  84. [ u8 .... -> ....] SWS_OP_CLEAR : {_ _ _ 0}
  85. [ u8 .... -> ....] SWS_OP_CONVERT : u8 -> f32
  86. [f32 .... -> ....] SWS_OP_LINEAR : diag3+alpha [[1/255 0 0 0 0] [0 1/255 0 0 0] [0 0 1/255 0 0] [0 0 0 1 1]]
  87. [f32 .... -> ....] SWS_OP_LINEAR : matrix3 [[0.299000 0.587000 0.114000 0 0] [-0.168736 -0.331264 1/2 0 0] [1/2 -0.418688 -57/701 0 0] [0 0 0 1 0]]
  88. [f32 .... -> ....] SWS_OP_LINEAR : diag3+off3 [[219 0 0 0 16] [0 224 0 0 128] [0 0 224 0 128] [0 0 0 1 0]]
  89. [f32 .... -> ....] SWS_OP_DITHER : 16x16 matrix
  90. [f32 .... -> ....] SWS_OP_MAX : {0 0 0 0} <= x
  91. [f32 .... -> ....] SWS_OP_MIN : x <= {255 255 255 _}
  92. [f32 .... -> ....] SWS_OP_CONVERT : f32 -> u8
  93. [ u8 .... -> ....] SWS_OP_LSHIFT : << 0
  94. [ u8 .... -> ....] SWS_OP_SWIZZLE : 0123
  95. [ u8 .... -> ....] SWS_OP_WRITE : 3 elem(s) planar >> 0
  96. This is optimized into the following sequence:
  97. Optimized operation list:
  98. [ u8 XXXX -> +++X] SWS_OP_READ : 3 elem(s) packed >> 0
  99. [ u8 ...X -> +++X] SWS_OP_CONVERT : u8 -> f32
  100. [f32 ...X -> ...X] SWS_OP_LINEAR : matrix3+off3 [[0.256788 0.504129 0.097906 0 16] [-0.148223 -0.290993 112/255 0 128] [112/255 -0.367788 -0.071427 0 128] [0 0 0 1 0]]
  101. [f32 ...X -> ...X] SWS_OP_DITHER : 16x16 matrix
  102. [f32 ...X -> +++X] SWS_OP_CONVERT : f32 -> u8
  103. [ u8 ...X -> +++X] SWS_OP_WRITE : 3 elem(s) planar >> 0
  104. (X = unused, + = exact, 0 = zero)
  105. The extra metadata on the left of the operation list is just a dump of the
  106. internal state used by the optimizer during optimization. It keeps track of
  107. knowledge about the pixel values, such as their value range, whether or not
  108. they're exact integers, and so on.
  109. In this example, you can see that the input values are exact (except for
  110. the alpha channel, which is undefined), until the first SWS_OP_LINEAR
  111. multiplies them by a noninteger constant. They regain their exact integer
  112. status only after the (truncating) conversion to U8 in the output step.
  113. Example of more aggressive optimization
  114. ---------------------------------------
  115. Conversion pass for gray -> rgb48:
  116. Unoptimized operation list:
  117. [ u8 .... -> ....] SWS_OP_READ : 1 elem(s) planar >> 0
  118. [ u8 .... -> ....] SWS_OP_SWIZZLE : 0123
  119. [ u8 .... -> ....] SWS_OP_RSHIFT : >> 0
  120. [ u8 .... -> ....] SWS_OP_CLEAR : {_ 0 0 0}
  121. [ u8 .... -> ....] SWS_OP_CONVERT : u8 -> f32
  122. [f32 .... -> ....] SWS_OP_LINEAR : luma+alpha [[1/255 0 0 0 0] [0 1 0 0 0] [0 0 1 0 0] [0 0 0 1 1]]
  123. [f32 .... -> ....] SWS_OP_LINEAR : matrix3 [[1 0 701/500 0 0] [1 -0.344136 -0.714136 0 0] [1 443/250 0 0 0] [0 0 0 1 0]]
  124. [f32 .... -> ....] SWS_OP_LINEAR : diag3 [[65535 0 0 0 0] [0 65535 0 0 0] [0 0 65535 0 0] [0 0 0 1 0]]
  125. [f32 .... -> ....] SWS_OP_MAX : {0 0 0 0} <= x
  126. [f32 .... -> ....] SWS_OP_MIN : x <= {65535 65535 65535 _}
  127. [f32 .... -> ....] SWS_OP_CONVERT : f32 -> u16
  128. [u16 .... -> ....] SWS_OP_LSHIFT : << 0
  129. [u16 .... -> ....] SWS_OP_SWIZZLE : 0123
  130. [u16 .... -> ....] SWS_OP_WRITE : 3 elem(s) packed >> 0
  131. Optimized operation list:
  132. [ u8 XXXX -> +XXX] SWS_OP_READ : 1 elem(s) planar >> 0
  133. [ u8 .XXX -> +XXX] SWS_OP_CONVERT : u8 -> u16 (expand)
  134. [u16 .XXX -> +++X] SWS_OP_SWIZZLE : 0003
  135. [u16 ...X -> +++X] SWS_OP_WRITE : 3 elem(s) packed >> 0
  136. (X = unused, + = exact, 0 = zero)
  137. Here, the optimizer has managed to eliminate all of the unnecessary linear
  138. operations on previously zero'd values, turn the resulting column matrix into
  139. a swizzle operation, avoid the unnecessary dither (and round trip via float)
  140. because the pixel values are guaranteed to be bit exact, and finally, turns
  141. the multiplication by 65535 / 255 = 257 into a simple integer expand operation.
  142. As a final bonus, the x86 backend further optimizes this into a 12-byte shuffle:
  143. pshufb = {0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1}
  144. time=208 us, ref=4212 us, speedup=20.236x faster (single thread)
  145. time=57 us, ref=472 us, speedup=8.160x faster (multi thread)
  146. Compiler and underlying implementation layer (SwsOpChain)
  147. ---------------------------------------------------------
  148. While the backend API is flexible enough to permit more exotic implementations
  149. (e.g. using JIT code generation), we establish a common set of helpers for use
  150. in "traditional" SIMD implementations.
  151. The basic idea is to have one "kernel" (or implementation) per operation,
  152. and then just chain a list of these kernels together as separate function
  153. calls. For best performance, we want to keep data in vector registers in
  154. between function calls using a custom calling convention, thus avoiding any
  155. unnecessary memory accesses. Additionally, we want the per-kernel overhead to
  156. be as low as possible, with each kernel ideally just jumping directly into
  157. the next kernel.
  158. As a result, we arrive at a design where we first divide the image into small
  159. chunks, or "blocks", and then dispatch the "chain" of kernels on each chunk in
  160. sequence. Each kernel processes a fixed number of pixels, with the overall
  161. entry point taking care of looping. Remaining pixels (the "tail") are handled
  162. generically by the backend-invariant dispatch code (located in ops.c), using a
  163. partial memcpy into a suitably sized temporary buffer.
  164. To minimize the per-kernel function call overhead, we use a "continuation
  165. passing style" for chaining kernels. Each operation computes its result and
  166. then directly calls the next operation in the sequence, with the appropriate
  167. internal function signature.
  168. The C reference backend reads data into the stack and then passes the array
  169. pointers to the next continuation as regular function arguments:
  170. void process(GlobalContext *ctx, OpContext *op,
  171. block_t x, block_t y, block_t z, block_t w)
  172. {
  173. for (int i = 0; i < SWS_BLOCK_SIZE; i++)
  174. // do something with x[i], y[i], z[i], w[i]
  175. op->next(ctx, &op[1], x, y, z, w);
  176. }
  177. With type conversions pushing the new data onto the stack as well:
  178. void convert8to16(GlobalContext *ctx, OpContext *op,
  179. block_t x, block_t y, block_t z, block_t w)
  180. {
  181. /* Pseudo-code */
  182. u16block_t x16 = (u16block_t) x;
  183. u16block_t y16 = (u16block_t) y;
  184. u16block_t z16 = (u16block_t) z;
  185. u16block_t w16 = (u16block_t) w;
  186. op->next(ctx, &op[1], x16, y16, z16, w16);
  187. }
  188. By contrast, the x86 backend always keeps the X/Y/Z/W values pinned in specific
  189. vector registers (ymm0-ymm3 for the lower half, and ymm4-ymm7 for the second
  190. half).
  191. Each kernel additionally has access to a 32 byte per-op context storing the
  192. pointer to the next kernel plus 16 bytes of arbitrary private data. This is
  193. used during construction of the function chain to place things like small
  194. constants.
  195. In assembly, the per-kernel overhead looks like this:
  196. load $tmp, $arg1
  197. ...
  198. add $arg1, 32
  199. jump $tmp
  200. This design gives vastly better performance than the alternative of returning
  201. out to a central loop or "trampoline". This is partly because the order of
  202. kernels within a chain is always the same, so the branch predictor can easily
  203. remember the target address of each "jump" instruction.
  204. The only way to realistically improve on this design would be to directly
  205. stitch the kernel body together using runtime code generation.
  206. Future considerations and limitations
  207. -------------------------------------
  208. My current prototype has a number of severe limitations and opportunities
  209. for improvements:
  210. 1. It does not handle scaling at all. I am not yet entirely sure on how I want
  211. to handle scaling; this includes handling of subsampled content. I have a
  212. number of vague ideas in my head, but nothing where I can say with certainty
  213. that it will work out well.
  214. It's possible that we won't come up with a perfect solution here, and will
  215. need to decide on which set of compromises we are comfortable accepting:
  216. 1. Do we need the ability to scale YUV -> YUV by handling luma and chroma
  217. independently? When downscaling 100x100 4:2:0 to 50x50 4:4:4, should we
  218. support the option of reusing the chroma plane directly (even though
  219. this would introduce a subpixel shift for typical chroma siting)?
  220. Looking towards zimg, I am also thinking that we probably also want to do
  221. scaling on floating point values, since this is best for both performance
  222. and accuracy, especially given that we need to go up to 32-bit intermediates
  223. during scaling anyway.
  224. So far, the most promising approach seems to be to handle subsampled
  225. input/output as a dedicated read/write operation type; perhaps even with a
  226. fixed/static subsampling kernel. To avoid compromising on performance when
  227. chroma resampling is not necessary, the optimizer could then relax the
  228. pipeline to use non-interpolating read/writes when all intermediate
  229. operations are component-independent.
  230. 2. Since each operation is conceptually defined on 4-component pixels, we end
  231. up defining a lot of variants of each implementation for each possible
  232. *subset*. For example, we have four different implementations for
  233. SWS_OP_SCALE in my current templates:
  234. - op_scale_1000
  235. - op_scale_1001
  236. - op_scale_1110
  237. - op_scale_1111
  238. This reflects the four different arrangements of pixel components that are
  239. typically present (or absent). While best for performance, it does turn into
  240. a bit of a chore when implementing these kernels.
  241. The only real alternative would be to either branch inside the kernel (bad),
  242. or to use separate kernels for each individual component and chain them all
  243. together. I have not yet tested whether the latter approach would be faster
  244. after the latest round of refactors to the kernel glue code.
  245. 3. I do not yet have any support for LUTs. But when I add them, something we
  246. could do is have the optimized pass automatically "promote" a sequence of
  247. operations to LUTs. For example, any sequence that looks like:
  248. 1. [u8] SWS_OP_CONVERT -> X
  249. 2. [X] ... // only per-component operations
  250. 4. [X] SWS_OP_CONVERT -> Y
  251. 3. [Y] SWS_OP_WRITE
  252. could be replaced by a LUT with 256 entries. This is especially important
  253. for anything involving packed 8-bit input (e.g. rgb8, rgb4_byte).
  254. We also definitely want to hook this up to the existing CMS code for
  255. transformations between different primaries.
  256. 4. Because we rely on AVRational math to generate the coefficients for
  257. operations, we need to be able to represent all pixel values as an
  258. AVRational. However, this presents a challenge for 32-bit formats (e.g.
  259. GRAY32, RGBA128), because their size exceeds INT_MAX, which is the maximum
  260. value representable by an AVRational.
  261. It's possible we may want to introduce an AVRational64 for this, or
  262. perhaps more flexibly, extend AVRational to an AVFloating type which is
  263. represented as { AVRational n; int exp; }, representing n/d * 2^exp. This
  264. would preserve our ability to represent all pixel values exactly, while
  265. opening up the range arbitrarily.
  266. 5. Is there ever a situation where the use of floats introduces the risk of
  267. non bit-exact output? For this reason, and possible performance advantages,
  268. we may want to explore the use of a fixed-point 16 bit path as an alternative
  269. to the floating point math.
  270. So far, I have managed to avoid any bit exactness issues inside the x86
  271. backend by ensuring that the order of linear operations is identical
  272. between the C backend and the x86 backend, but this may not be practical
  273. to guarantee on all backends. The x86 float code is also dramatically
  274. faster than the old fixed point code, so I'm tentatively optimistic about
  275. the lack of a need for a fixed point path.