Overview

Request 1963 (accepted)

Updated to 1.7

Submit package home:Aloysius:branches:Essentials / x265 to package Essentials / x265

x265.changes Changed
x
 
1
@@ -1,4 +1,45 @@
2
 -------------------------------------------------------------------
3
+Fri May 29 09:11:02 UTC 2015 - aloisio@gmx.com
4
+
5
+- soname bump to 59
6
+- Update to version 1.7
7
+  * large amount of assembly code optimizations
8
+  * some preliminary support for high dynamic range content
9
+  * improvements for multi-library support
10
+  * some new quality features
11
+    (full documentation at: http://x265.readthedocs.org/en/1.7)
12
+  * This release simplifies the multi-library support introduced
13
+    in version 1.6. Any libx265 can now forward API requests to
14
+    other installed libx265 libraries (by name) so applications
15
+    like ffmpeg and the x265 CLI can select between 8bit and 10bit
16
+    encodes at runtime without the need of a shim library or
17
+    library load path hacks. See --output-depth, and
18
+    http://x265.readthedocs.org/en/1.7/api.html#multi-library-interface
19
+  * For quality, x265 now allows you to configure the quantization
20
+    group size smaller than the CTU size (for finer grained AQ
21
+    adjustments). See --qg-size.
22
+  * x265 now supports limited mid-encode reconfigure via a new public
23
+    method: x265_encoder_reconfig()
24
+  * For HDR, x265 now supports signaling the SMPTE 2084 color transfer
25
+    function, the SMPTE 2086 mastering display color primaries, and the
26
+    content light levels. See --master-display, --max-cll
27
+  * x265 will no longer emit any non-conformant bitstreams unless
28
+    --allow-non-conformance is specified.
29
+  * The x265 CLI now supports a simple encode preview feature. See
30
+    --recon-y4m-exec.
31
+  * The AnnexB NAL headers can now be configured off, via x265_param.bAnnexB
32
+    This is not configurable via the CLI because it is a function of the
33
+    muxer being used, and the CLI only supports raw output files. See
34
+    --annexb
35
+  Misc:
36
+  * --lossless encodes are now signaled as level 8.5
37
+  * --profile now has a -P short option
38
+  * The regression scripts used by x265 are now public, and can be found at:
39
+    https://bitbucket.org/sborho/test-harness
40
+  * x265's cmake scripts now support PGO builds, the test-harness can be
41
+    used to drive the profile-guided build process.
42
+
43
+-------------------------------------------------------------------
44
 Tue Apr 28 20:08:06 UTC 2015 - aloisio@gmx.com
45
 
46
 - soname bumped to 51
47
x265.spec Changed
14
 
1
@@ -1,10 +1,10 @@
2
 # based on the spec file from https://build.opensuse.org/package/view_file/home:Simmphonie/libx265/
3
 
4
 Name:           x265
5
-%define soname  51
6
+%define soname  59
7
 %define libname lib%{name}
8
 %define libsoname %{libname}-%{soname}
9
-Version:        1.6
10
+Version:        1.7
11
 Release:        0
12
 License:        GPL-2.0+
13
 Summary:        A free h265/HEVC encoder - encoder binary
14
baselibs.conf Changed
4
 
1
@@ -1,1 +1,1 @@
2
-libx265-51
3
+libx265-59
4
x265_1.6.tar.gz/.hg_archival.txt -> x265_1.7.tar.gz/.hg_archival.txt Changed
8
 
1
@@ -1,4 +1,4 @@
2
 repo: 09fe40627f03a0f9c3e6ac78b22ac93da23f9fdf
3
-node: cbeb7d8a4880e4020c4545dd8e498432c3c6cad3
4
+node: 8425278def1edf0931dc33fc518e1950063e76b0
5
 branch: stable
6
-tag: 1.6
7
+tag: 1.7
8
x265_1.6.tar.gz/.hgtags -> x265_1.7.tar.gz/.hgtags Changed
6
 
1
@@ -14,3 +14,4 @@
2
 c1e4fc0162c14fdb84f5c3bd404fb28cfe10a17f 1.3
3
 5e604833c5aa605d0b6efbe5234492b5e7d8ac61 1.4
4
 9f0324125f53a12f766f6ed6f98f16e2f42337f4 1.5
5
+cbeb7d8a4880e4020c4545dd8e498432c3c6cad3 1.6
6
x265_1.6.tar.gz/doc/reST/api.rst -> x265_1.7.tar.gz/doc/reST/api.rst Changed
83
 
1
@@ -171,8 +171,26 @@
2
     *      how x265_encoder_open has changed the parameters.
3
     *      note that the data accessible through pointers in the returned param struct
4
     *      (e.g. filenames) should not be modified by the calling application. */
5
-   void x265_encoder_parameters(x265_encoder *, x265_param *);                                                                      
6
-
7
+   void x265_encoder_parameters(x265_encoder *, x265_param *);
8
+
9
+**x265_encoder_reconfig()** may be used to reconfigure encoder parameters mid-encode::
10
+
11
+   /* x265_encoder_reconfig:
12
+    *       used to modify encoder parameters.
13
+    *      various parameters from x265_param are copied.
14
+    *      this takes effect immediately, on whichever frame is encoded next;
15
+    *      returns 0 on success, negative on parameter validation error.
16
+    *
17
+    *      not all parameters can be changed; see the actual function for a
18
+    *      detailed breakdown.  since not all parameters can be changed, moving
19
+    *      from preset to preset may not always fully copy all relevant parameters,
20
+    *      but should still work usably in practice. however, more so than for
21
+    *      other presets, many of the speed shortcuts used in ultrafast cannot be
22
+    *      switched out of; using reconfig to switch between ultrafast and other
23
+    *      presets is not recommended without a more fine-grained breakdown of
24
+    *      parameters to take this into account. */
25
+   int x265_encoder_reconfig(x265_encoder *, x265_param *);
26
+   
27
 Pictures
28
 ========
29
 
30
@@ -352,7 +370,7 @@
31
 Multi-library Interface
32
 =======================
33
 
34
-If your application might want to make a runtime selection between among
35
+If your application might want to make a runtime selection between
36
 a number of libx265 libraries (perhaps 8bpp and 16bpp), then you will
37
 want to use the multi-library interface.
38
 
39
@@ -370,13 +388,34 @@
40
      *   libx265 */
41
     const x265_api* x265_api_get(int bitDepth);
42
 
43
-The general idea is to request the API for the bitDepth you would prefer
44
-the encoder to use (8 or 10), and if that returns NULL you request the
45
-API for bitDepth=0, which returns the system default libx265.
46
-
47
 Note that using this multi-library API in your application is only the
48
-first step. Next your application must dynamically link to libx265 and
49
-then you must build and install a multi-lib configuration of libx265,
50
-which includes 8bpp and 16bpp builds of libx265 and a shim library which
51
-forwards x265_api_get() calls to the appropriate library using dynamic
52
-loading and binding.
53
+first step.
54
+
55
+Your application must link to one build of libx265 (statically or 
56
+dynamically) and this linked version of libx265 will support one 
57
+bit-depth (8 or 10 bits). 
58
+
59
+Your application must now request the API for the bitDepth you would 
60
+prefer the encoder to use (8 or 10). If the requested bitdepth is zero, 
61
+or if it matches the bitdepth of the system default libx265 (the 
62
+currently linked library), then this library will be used for encode.
63
+If you request a different bit-depth, the linked libx265 will attempt 
64
+to dynamically bind a shared library with a name appropriate for the 
65
+requested bit-depth:
66
+
67
+    8-bit:  libx265_main.dll
68
+    10-bit: libx265_main10.dll
69
+
70
+    (the shared library extension is obviously platform specific. On
71
+    Linux it is .so while on Mac it is .dylib)
72
+
73
+For example on Windows, one could package together an x265.exe
74
+statically linked against the 8bpp libx265 together with a
75
+libx265_main10.dll in the same folder, and this executable would be able
76
+to encode main and main10 bitstreams.
77
+
78
+On Linux, x265 packagers could install 8bpp static and shared libraries
79
+under the name libx265 (so all applications link against 8bpp libx265)
80
+and then also install libx265_main10.so (symlinked to its numbered solib).
81
+Thus applications which use x265_api_get() will be able to generate main
82
+or main10 bitstreams.
83
x265_1.6.tar.gz/doc/reST/cli.rst -> x265_1.7.tar.gz/doc/reST/cli.rst Changed
248
 
1
@@ -159,6 +159,13 @@
2
    handled implicitly.
3
 
4
    One may also directly supply the CPU capability bitmap as an integer.
5
+   
6
+   Note that by specifying this option you are overriding x265's CPU
7
+   detection and it is possible to do this wrong. You can cause encoder
8
+   crashes by specifying SIMD architectures which are not supported on
9
+   your CPU.
10
+
11
+   Default: auto-detected SIMD architectures
12
 
13
 .. option:: --frame-threads, -F <integer>
14
 
15
@@ -171,7 +178,7 @@
16
    Over-allocation of frame threads will not improve performance, it
17
    will generally just increase memory use.
18
 
19
-   **Values:** any value between 8 and 16. Default is 0, auto-detect
20
+   **Values:** any value between 0 and 16. Default is 0, auto-detect
21
 
22
 .. option:: --pools <string>, --numa-pools <string>
23
 
24
@@ -201,11 +208,11 @@
25
    their node, they will not be allowed to migrate between nodes, but they
26
    will be allowed to move between CPU cores within their node.
27
 
28
-   If the three pool features: :option:`--wpp` :option:`--pmode` and
29
-   :option:`--pme` are all disabled, then :option:`--pools` is ignored
30
-   and no thread pools are created.
31
+   If the four pool features: :option:`--wpp`, :option:`--pmode`,
32
+   :option:`--pme` and :option:`--lookahead-slices` are all disabled,
33
+   then :option:`--pools` is ignored and no thread pools are created.
34
 
35
-   If "none" is specified, then all three of the thread pool features are
36
+   If "none" is specified, then all four of the thread pool features are
37
    implicitly disabled.
38
 
39
    Multiple thread pools will be allocated for any NUMA node with more than
40
@@ -217,9 +224,22 @@
41
    :option:`--frame-threads`.  The pools are used for WPP and for
42
    distributed analysis and motion search.
43
 
44
+   On Windows, the native APIs offer sufficient functionality to
45
+   discover the NUMA topology and enforce the thread affinity that
46
+   libx265 needs (so long as you have not chosen to target XP or
47
+   Vista), but on POSIX systems it relies on libnuma for this
48
+   functionality. If your target POSIX system is single socket, then
49
+   building without libnuma is a perfectly reasonable option, as it
50
+   will have no effect on the runtime behavior. On a multiple-socket
51
+   system, a POSIX build of libx265 without libnuma will be less work
52
+   efficient. See :ref:`thread pools <pools>` for more detail.
53
+
54
    Default "", one thread is allocated per detected hardware thread
55
    (logical CPU cores) and one thread pool per NUMA node.
56
 
57
+   Note that the string value will need to be escaped or quoted to
58
+   protect against shell expansion on many platforms
59
+
60
 .. option:: --wpp, --no-wpp
61
 
62
    Enable Wavefront Parallel Processing. The encoder may begin encoding
63
@@ -399,10 +419,20 @@
64
 
65
    **CLI ONLY**
66
 
67
+.. option:: --output-depth, -D 8|10
68
+
69
+   Bitdepth of output HEVC bitstream, which is also the internal bit
70
+   depth of the encoder. If the requested bit depth is not the bit
71
+   depth of the linked libx265, it will attempt to bind libx265_main
72
+   for an 8bit encoder, or libx265_main10 for a 10bit encoder, with the
73
+   same API version as the linked libx265.
74
+
75
+   **CLI ONLY**
76
+
77
 Profile, Level, Tier
78
 ====================
79
 
80
-.. option:: --profile <string>
81
+.. option:: --profile, -P <string>
82
 
83
    Enforce the requirements of the specified profile, ensuring the
84
    output stream will be decodable by a decoder which supports that
85
@@ -437,7 +467,7 @@
86
    times 10, for example level **5.1** is specified as "5.1" or "51",
87
    and level **5.0** is specified as "5.0" or "50".
88
 
89
-   Annex A levels: 1, 2, 2.1, 3, 3.1, 4, 4.1, 5, 5.1, 5.2, 6, 6.1, 6.2
90
+   Annex A levels: 1, 2, 2.1, 3, 3.1, 4, 4.1, 5, 5.1, 5.2, 6, 6.1, 6.2, 8.5
91
 
92
 .. option:: --high-tier, --no-high-tier
93
 
94
@@ -464,11 +494,22 @@
95
    HEVC specification.  If x265 detects that the total reference count
96
    is greater than 8, it will issue a warning that the resulting stream
97
    is non-compliant and it signals the stream as profile NONE and level
98
-   NONE but still allows the encode to continue.  Compliant HEVC
99
+   NONE and will abort the encode unless
100
+   :option:`--allow-non-conformance` it specified.  Compliant HEVC
101
    decoders may refuse to decode such streams.
102
    
103
    Default 3
104
 
105
+.. option:: --allow-non-conformance, --no-allow-non-conformance
106
+
107
+   Allow libx265 to generate a bitstream with profile and level NONE.
108
+   By default it will abort any encode which does not meet strict level
109
+   compliance. The two most likely causes for non-conformance are
110
+   :option:`--ctu` being too small, :option:`--ref` being too high,
111
+   or the bitrate or resolution being out of specification.
112
+
113
+   Default: disabled
114
+
115
 .. note::
116
    :option:`--profile`, :option:`--level-idc`, and
117
    :option:`--high-tier` are only intended for use when you are
118
@@ -476,7 +517,7 @@
119
    limitations and must constrain the bitstream within those limits.
120
    Specifying a profile or level may lower the encode quality
121
    parameters to meet those requirements but it will never raise
122
-   them.
123
+   them. It may enable VBV constraints on a CRF encode.
124
 
125
 Mode decision / Analysis
126
 ========================
127
@@ -1111,6 +1152,14 @@
128
 
129
    **Range of values:** 0.0 to 3.0
130
 
131
+.. option:: --qg-size <64|32|16>
132
+
133
+   Enable adaptive quantization for sub-CTUs. This parameter specifies 
134
+   the minimum CU size at which QP can be adjusted, ie. Quantization Group
135
+   size. Allowed range of values are 64, 32, 16 provided this falls within 
136
+   the inclusive range [maxCUSize, minCUSize]. Experimental.
137
+   Default: same as maxCUSize
138
+
139
 .. option:: --cutree, --no-cutree
140
 
141
    Enable the use of lookahead's lowres motion vector fields to
142
@@ -1162,12 +1211,12 @@
143
 .. option:: --strict-cbr, --no-strict-cbr
144
    
145
    Enables stricter conditions to control bitrate deviance from the 
146
-   target bitrate in CBR mode. Bitrate adherence is prioritised
147
+   target bitrate in ABR mode. Bit rate adherence is prioritised
148
    over quality. Rate tolerance is reduced to 50%. Default disabled.
149
    
150
    This option is for use-cases which require the final average bitrate 
151
-   to be within very strict limits of the target - preventing overshoots 
152
-   completely, and achieve bitrates within 5% of target bitrate, 
153
+   to be within very strict limits of the target; preventing overshoots, 
154
+   while keeping the bit rate within 5% of the target setting, 
155
    especially in short segment encodes. Typically, the encoder stays 
156
    conservative, waiting until there is enough feedback in terms of 
157
    encoded frames to control QP. strict-cbr allows the encoder to be 
158
@@ -1209,7 +1258,7 @@
159
    lookahead).  Default value is 0.6. Increasing it to 1 will
160
    effectively generate CQP
161
 
162
-.. option:: --qstep <integer>
163
+.. option:: --qpstep <integer>
164
 
165
    The maximum single adjustment in QP allowed to rate control. Default
166
    4
167
@@ -1451,9 +1500,48 @@
168
    specification for a description of these values. Default undefined
169
    (not signaled)
170
 
171
+.. option:: --master-display <string>
172
+
173
+   SMPTE ST 2086 mastering display color volume SEI info, specified as
174
+   a string which is parsed when the stream header SEI are emitted. The
175
+   string format is "G(%hu,%hu)B(%hu,%hu)R(%hu,%hu)WP(%hu,%hu)L(%u,%u)"
176
+   where %hu are unsigned 16bit integers and %u are unsigned 32bit
177
+   integers. The SEI includes X,Y display primaries for RGB channels,
178
+   white point X,Y and max,min luminance values. (HDR)
179
+
180
+   Example for P65D3 1000-nits:
181
+
182
+       G(13200,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(10000000,1)
183
+
184
+   Note that this string value will need to be escaped or quoted to
185
+   protect against shell expansion on many platforms. No default.
186
+
187
+.. option:: --max-cll <string>
188
+
189
+   Maximum content light level and maximum frame average light level as
190
+   required by the Consumer Electronics Association 861.3 specification.
191
+
192
+   Specified as a string which is parsed when the stream header SEI are
193
+   emitted. The string format is "%hu,%hu" where %hu are unsigned 16bit
194
+   integers. The first value is the max content light level (or 0 if no
195
+   maximum is indicated), the second value is the maximum picture
196
+   average light level (or 0). (HDR)
197
+
198
+   Note that this string value will need to be escaped or quoted to
199
+   protect against shell expansion on many platforms. No default.
200
+
201
 Bitstream options
202
 =================
203
 
204
+.. option:: --annexb, --no-annexb
205
+
206
+   If enabled, x265 will produce Annex B bitstream format, which places
207
+   start codes before NAL. If disabled, x265 will produce file format,
208
+   which places length before NAL. x265 CLI will choose the right option
209
+   based on output format. Default enabled
210
+
211
+   **API ONLY**
212
+
213
 .. option:: --repeat-headers, --no-repeat-headers
214
 
215
    If enabled, x265 will emit VPS, SPS, and PPS headers with every
216
@@ -1498,8 +1586,8 @@
217
 
218
    Enable a temporal sub layer. All referenced I/P/B frames are in the
219
    base layer and all unreferenced B frames are placed in a temporal
220
-   sublayer. A decoder may chose to drop the sublayer and only decode
221
-   and display the base layer slices.
222
+   enhancement layer. A decoder may chose to drop the enhancement layer 
223
+   and only decode and display the base layer slices.
224
    
225
    If used with a fixed GOP (:option:`b-adapt` 0) and :option:`bframes`
226
    3 then the two layers evenly split the frame rate, with a cadence of
227
@@ -1525,4 +1613,20 @@
228
 
229
    **CLI ONLY**
230
 
231
+.. option:: --recon-y4m-exec <string>
232
+
233
+   If you have an application which can play a Y4MPEG stream received
234
+   on stdin, the x265 CLI can feed it reconstructed pictures in display
235
+   order.  The pictures will have no timing info, obviously, so the
236
+   picture timing will be determined primarily by encoding elapsed time
237
+   and latencies, but it can be useful to preview the pictures being
238
+   output by the encoder to validate input settings and rate control
239
+   parameters.
240
+
241
+   Example command for ffplay (assuming it is in your PATH):
242
+
243
+   --recon-y4m-exec "ffplay -i pipe:0 -autoexit"
244
+
245
+   **CLI ONLY**
246
+
247
 .. vim: noet
248
x265_1.6.tar.gz/doc/reST/threading.rst -> x265_1.7.tar.gz/doc/reST/threading.rst Changed
37
 
1
@@ -2,6 +2,8 @@
2
 Threading
3
 *********
4
 
5
+.. _pools:
6
+
7
 Thread Pools
8
 ============
9
 
10
@@ -31,6 +33,18 @@
11
 expected to drop that job so the worker thread may go back to the pool
12
 and find more work.
13
 
14
+On Windows, the native APIs offer sufficient functionality to discover
15
+the NUMA topology and enforce the thread affinity that libx265 needs (so
16
+long as you have not chosen to target XP or Vista), but on POSIX systems
17
+it relies on libnuma for this functionality. If your target POSIX system
18
+is single socket, then building without libnuma is a perfectly
19
+reasonable option, as it will have no effect on the runtime behavior. On
20
+a multiple-socket system, a POSIX build of libx265 without libnuma will
21
+be less work efficient, but will still function correctly. You lose the
22
+work isolation effect that keeps each frame encoder from only using the
23
+threads of a single socket and so you incur a heavier context switching
24
+cost.
25
+
26
 Wavefront Parallel Processing
27
 =============================
28
 
29
@@ -225,6 +239,7 @@
30
 lowres cost analysis to worker threads. It will use bonded task groups
31
 to perform batches of frame cost estimates, and it may optionally use
32
 bonded task groups to measure single frame cost estimates using slices.
33
+(see :option:`--lookahead-slices`)
34
 
35
 The function slicetypeDecide() itself is also be performed by a worker
36
 thread if your encoder has a thread pool, else it runs within the
37
x265_1.6.tar.gz/readme.rst -> x265_1.7.tar.gz/readme.rst Changed
10
 
1
@@ -3,7 +3,7 @@
2
 =================
3
 
4
 | **Read:** | Online `documentation <http://x265.readthedocs.org/en/default/>`_ | Developer `wiki <http://bitbucket.org/multicoreware/x265/wiki/>`_
5
-| **Download:** | `releases <http://bitbucket.org/multicoreware/x265/downloads/>`_ 
6
+| **Download:** | `releases <http://ftp.videolan.org/pub/videolan/x265/>`_ 
7
 | **Interact:** | #x265 on freenode.irc.net | `x265-devel@videolan.org <http://mailman.videolan.org/listinfo/x265-devel>`_ | `Report an issue <https://bitbucket.org/multicoreware/x265/issues?status=new&status=open>`_
8
 
9
 `x265 <https://www.videolan.org/developers/x265.html>`_ is an open
10
x265_1.6.tar.gz/source/CMakeLists.txt -> x265_1.7.tar.gz/source/CMakeLists.txt Changed
91
 
1
@@ -30,7 +30,7 @@
2
 mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD)
3
 
4
 # X265_BUILD must be incremented each time the public API is changed
5
-set(X265_BUILD 51)
6
+set(X265_BUILD 59)
7
 configure_file("${PROJECT_SOURCE_DIR}/x265.def.in"
8
                "${PROJECT_BINARY_DIR}/x265.def")
9
 configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in"
10
@@ -65,15 +65,19 @@
11
     if(LIBRT)
12
         list(APPEND PLATFORM_LIBS rt)
13
     endif()
14
+    find_library(LIBDL dl)
15
+    if(LIBDL)
16
+        list(APPEND PLATFORM_LIBS dl)
17
+    endif()
18
     find_package(Numa)
19
     if(NUMA_FOUND)
20
-        list(APPEND CMAKE_REQUIRED_LIBRARIES ${NUMA_LIBRARY})
21
+        link_directories(${NUMA_LIBRARY_DIR})
22
+        list(APPEND CMAKE_REQUIRED_LIBRARIES numa)
23
         check_symbol_exists(numa_node_of_cpu numa.h NUMA_V2)
24
         if(NUMA_V2)
25
             add_definitions(-DHAVE_LIBNUMA)
26
             message(STATUS "libnuma found, building with support for NUMA nodes")
27
-            list(APPEND PLATFORM_LIBS ${NUMA_LIBRARY})
28
-            link_directories(${NUMA_LIBRARY_DIR})
29
+            list(APPEND PLATFORM_LIBS numa)
30
             include_directories(${NUMA_INCLUDE_DIR})
31
         endif()
32
     endif()
33
@@ -90,7 +94,7 @@
34
 if(CMAKE_GENERATOR STREQUAL "Xcode")
35
   set(XCODE 1)
36
 endif()
37
-if (APPLE)
38
+if(APPLE)
39
   add_definitions(-DMACOS)
40
 endif()
41
 
42
@@ -196,6 +200,7 @@
43
         add_definitions(-static)
44
         list(APPEND LINKER_OPTIONS "-static")
45
     endif(STATIC_LINK_CRT)
46
+    check_cxx_compiler_flag(-Wno-strict-overflow CC_HAS_NO_STRICT_OVERFLOW)
47
     check_cxx_compiler_flag(-Wno-narrowing CC_HAS_NO_NARROWING) 
48
     check_cxx_compiler_flag(-Wno-array-bounds CC_HAS_NO_ARRAY_BOUNDS) 
49
     if (CC_HAS_NO_ARRAY_BOUNDS)
50
@@ -291,7 +296,7 @@
51
     endif()
52
 endif(WARNINGS_AS_ERRORS)
53
 
54
-if (WIN32)
55
+if(WIN32)
56
     # Visual leak detector
57
     find_package(VLD QUIET)
58
     if(VLD_FOUND)
59
@@ -300,12 +305,15 @@
60
         list(APPEND PLATFORM_LIBS ${VLD_LIBRARIES})
61
         link_directories(${VLD_LIBRARY_DIRS})
62
     endif()
63
-    option(WINXP_SUPPORT "Make binaries compatible with Windows XP" OFF)
64
+    option(WINXP_SUPPORT "Make binaries compatible with Windows XP and Vista" OFF)
65
     if(WINXP_SUPPORT)
66
         # force use of workarounds for CONDITION_VARIABLE and atomic
67
         # intrinsics introduced after XP
68
-        add_definitions(-D_WIN32_WINNT=_WIN32_WINNT_WINXP)
69
-    endif()
70
+        add_definitions(-D_WIN32_WINNT=_WIN32_WINNT_WINXP -D_WIN32_WINNT_WIN7=0x0601)
71
+    else(WINXP_SUPPORT)
72
+        # default to targeting Windows 7 for the NUMA APIs
73
+        add_definitions(-D_WIN32_WINNT=_WIN32_WINNT_WIN7)
74
+    endif(WINXP_SUPPORT)
75
 endif()
76
 
77
 include(version) # determine X265_VERSION and X265_LATEST_TAG
78
@@ -462,8 +470,10 @@
79
 # Main CLI application
80
 option(ENABLE_CLI "Build standalone CLI application" ON)
81
 if(ENABLE_CLI)
82
-    file(GLOB InputFiles input/*.cpp input/*.h)
83
-    file(GLOB OutputFiles output/*.cpp output/*.h)
84
+    file(GLOB InputFiles input/input.cpp input/yuv.cpp input/y4m.cpp input/*.h)
85
+    file(GLOB OutputFiles output/output.cpp output/reconplay.cpp output/*.h
86
+                          output/yuv.cpp output/y4m.cpp # recon
87
+                          output/raw.cpp)               # muxers
88
     file(GLOB FilterFiles filters/*.cpp filters/*.h)
89
     source_group(input FILES ${InputFiles})
90
     source_group(output FILES ${OutputFiles})
91
x265_1.6.tar.gz/source/common/common.cpp -> x265_1.7.tar.gz/source/common/common.cpp Changed
34
 
1
@@ -100,11 +100,14 @@
2
     return (x265_exp2_lut[i & 63] + 256) << (i >> 6) >> 8;
3
 }
4
 
5
-void x265_log(const x265_param *param, int level, const char *fmt, ...)
6
+void general_log(const x265_param* param, const char* caller, int level, const char* fmt, ...)
7
 {
8
     if (param && level > param->logLevel)
9
         return;
10
-    const char *log_level;
11
+    const int bufferSize = 4096;
12
+    char buffer[bufferSize];
13
+    int p = 0;
14
+    const char* log_level;
15
     switch (level)
16
     {
17
     case X265_LOG_ERROR:
18
@@ -127,11 +130,13 @@
19
         break;
20
     }
21
 
22
-    fprintf(stderr, "x265 [%s]: ", log_level);
23
+    if (caller)
24
+        p += sprintf(buffer, "%-4s [%s]: ", caller, log_level);
25
     va_list arg;
26
     va_start(arg, fmt);
27
-    vfprintf(stderr, fmt, arg);
28
+    vsnprintf(buffer + p, bufferSize - p, fmt, arg);
29
     va_end(arg);
30
+    fputs(buffer, stderr);
31
 }
32
 
33
 double x265_ssim2dB(double ssim)
34
x265_1.6.tar.gz/source/common/common.h -> x265_1.7.tar.gz/source/common/common.h Changed
11
 
1
@@ -413,7 +413,8 @@
2
 
3
 /* outside x265 namespace, but prefixed. defined in common.cpp */
4
 int64_t  x265_mdate(void);
5
-void     x265_log(const x265_param *param, int level, const char *fmt, ...);
6
+#define  x265_log(param, ...) general_log(param, "x265", __VA_ARGS__)
7
+void     general_log(const x265_param* param, const char* caller, int level, const char* fmt, ...);
8
 int      x265_exp2fix8(double x);
9
 
10
 double   x265_ssim2dB(double ssim);
11
x265_1.6.tar.gz/source/common/constants.cpp -> x265_1.7.tar.gz/source/common/constants.cpp Changed
10
 
1
@@ -324,7 +324,7 @@
2
       4,  12, 20, 28,  5, 13, 21, 29,  6, 14, 22, 30,  7, 15, 23, 31, 36, 44, 52, 60, 37, 45, 53, 61, 38, 46, 54, 62, 39, 47, 55, 63 }
3
 };
4
 
5
-const uint16_t g_scan4x4[NUM_SCAN_TYPE][4 * 4] =
6
+ALIGN_VAR_16(const uint16_t, g_scan4x4[NUM_SCAN_TYPE][4 * 4]) =
7
 {
8
     { 0,  4,  1,  8,  5,  2, 12,  9,  6,  3, 13, 10,  7, 14, 11, 15 },
9
     { 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15 },
10
x265_1.6.tar.gz/source/common/contexts.h -> x265_1.7.tar.gz/source/common/contexts.h Changed
9
 
1
@@ -106,6 +106,7 @@
2
 // private namespace
3
 
4
 extern const uint32_t g_entropyBits[128];
5
+extern const uint32_t g_entropyStateBits[128];
6
 extern const uint8_t g_nextState[128][2];
7
 
8
 #define sbacGetMps(S)            ((S) & 1)
9
x265_1.6.tar.gz/source/common/cudata.cpp -> x265_1.7.tar.gz/source/common/cudata.cpp Changed
40
 
1
@@ -298,7 +298,7 @@
2
 }
3
 
4
 // initialize Sub partition
5
-void CUData::initSubCU(const CUData& ctu, const CUGeom& cuGeom)
6
+void CUData::initSubCU(const CUData& ctu, const CUGeom& cuGeom, int qp)
7
 {
8
     m_absIdxInCTU   = cuGeom.absPartIdx;
9
     m_encData       = ctu.m_encData;
10
@@ -312,8 +312,8 @@
11
     m_cuAboveRight  = ctu.m_cuAboveRight;
12
     X265_CHECK(m_numPartitions == cuGeom.numPartitions, "initSubCU() size mismatch\n");
13
 
14
-    /* sequential memsets */
15
-    m_partSet((uint8_t*)m_qp, (uint8_t)ctu.m_qp[0]);
16
+    m_partSet((uint8_t*)m_qp, (uint8_t)qp);
17
+
18
     m_partSet(m_log2CUSize,   (uint8_t)cuGeom.log2CUSize);
19
     m_partSet(m_lumaIntraDir, (uint8_t)DC_IDX);
20
     m_partSet(m_tqBypass,     (uint8_t)m_encData->m_param->bLossless);
21
@@ -1830,6 +1830,10 @@
22
     }
23
 }
24
 
25
+/* Clip motion vector to within slightly padded boundary of picture (the
26
+ * MV may reference a block that is completely within the padded area).
27
+ * Note this function is unaware of how much of this picture is actually
28
+ * available for use (re: frame parallelism) */
29
 void CUData::clipMv(MV& outMV) const
30
 {
31
     const uint32_t mvshift = 2;
32
@@ -2027,6 +2031,7 @@
33
         uint32_t blockSize = 1 << log2CUSize;
34
         uint32_t sbWidth   = 1 << (g_log2Size[maxCUSize] - log2CUSize);
35
         int32_t lastLevelFlag = log2CUSize == g_log2Size[minCUSize];
36
+
37
         for (uint32_t sbY = 0; sbY < sbWidth; sbY++)
38
         {
39
             for (uint32_t sbX = 0; sbX < sbWidth; sbX++)
40
x265_1.6.tar.gz/source/common/cudata.h -> x265_1.7.tar.gz/source/common/cudata.h Changed
20
 
1
@@ -85,8 +85,8 @@
2
     uint32_t childOffset;   // offset of the first child CU from current CU
3
     uint32_t absPartIdx;    // Part index of this CU in terms of 4x4 blocks.
4
     uint32_t numPartitions; // Number of 4x4 blocks in the CU
5
-    uint32_t depth;         // depth of this CU relative from CTU
6
     uint32_t flags;         // CU flags.
7
+    uint32_t depth;         // depth of this CU relative from CTU
8
 };
9
 
10
 struct MVField
11
@@ -182,7 +182,7 @@
12
     static void calcCTUGeoms(uint32_t ctuWidth, uint32_t ctuHeight, uint32_t maxCUSize, uint32_t minCUSize, CUGeom cuDataArray[CUGeom::MAX_GEOMS]);
13
 
14
     void     initCTU(const Frame& frame, uint32_t cuAddr, int qp);
15
-    void     initSubCU(const CUData& ctu, const CUGeom& cuGeom);
16
+    void     initSubCU(const CUData& ctu, const CUGeom& cuGeom, int qp);
17
     void     initLosslessCU(const CUData& cu, const CUGeom& cuGeom);
18
 
19
     void     copyPartFrom(const CUData& cu, const CUGeom& childGeom, uint32_t subPartIdx);
20
x265_1.6.tar.gz/source/common/dct.cpp -> x265_1.7.tar.gz/source/common/dct.cpp Changed
57
 
1
@@ -752,7 +752,7 @@
2
     }
3
 }
4
 
5
-int findPosLast_c(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig)
6
+int scanPosLast_c(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* /*scanCG4x4*/, const int /*trSize*/)
7
 {
8
     memset(coeffNum, 0, MLS_GRP_NUM * sizeof(*coeffNum));
9
     memset(coeffFlag, 0, MLS_GRP_NUM * sizeof(*coeffFlag));
10
@@ -785,6 +785,37 @@
11
     return scanPosLast - 1;
12
 }
13
 
14
+uint32_t findPosFirstLast_c(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16])
15
+{
16
+    int n;
17
+
18
+    for (n = SCAN_SET_SIZE - 1; n >= 0; --n)
19
+    {
20
+        const uint32_t idx = scanTbl[n];
21
+        const uint32_t idxY = idx / MLS_CG_SIZE;
22
+        const uint32_t idxX = idx % MLS_CG_SIZE;
23
+        if (dstCoeff[idxY * trSize + idxX])
24
+            break;
25
+    }
26
+
27
+    X265_CHECK(n >= 0, "non-zero coeff scan failuare!\n");
28
+
29
+    uint32_t lastNZPosInCG = (uint32_t)n;
30
+
31
+    for (n = 0;; n++)
32
+    {
33
+        const uint32_t idx = scanTbl[n];
34
+        const uint32_t idxY = idx / MLS_CG_SIZE;
35
+        const uint32_t idxX = idx % MLS_CG_SIZE;
36
+        if (dstCoeff[idxY * trSize + idxX])
37
+            break;
38
+    }
39
+
40
+    uint32_t firstNZPosInCG = (uint32_t)n;
41
+
42
+    return ((lastNZPosInCG << 16) | firstNZPosInCG);
43
+}
44
+
45
 }  // closing - anonymous file-static namespace
46
 
47
 namespace x265 {
48
@@ -817,6 +848,7 @@
49
     p.cu[BLOCK_16x16].copy_cnt = copy_count<16>;
50
     p.cu[BLOCK_32x32].copy_cnt = copy_count<32>;
51
 
52
-    p.findPosLast = findPosLast_c;
53
+    p.scanPosLast = scanPosLast_c;
54
+    p.findPosFirstLast = findPosFirstLast_c;
55
 }
56
 }
57
x265_1.6.tar.gz/source/common/frame.cpp -> x265_1.7.tar.gz/source/common/frame.cpp Changed
23
 
1
@@ -31,18 +31,21 @@
2
 Frame::Frame()
3
 {
4
     m_bChromaExtended = false;
5
+    m_lowresInit = false;
6
     m_reconRowCount.set(0);
7
     m_countRefEncoders = 0;
8
     m_encData = NULL;
9
     m_reconPic = NULL;
10
     m_next = NULL;
11
     m_prev = NULL;
12
+    m_param = NULL;
13
     memset(&m_lowres, 0, sizeof(m_lowres));
14
 }
15
 
16
 bool Frame::create(x265_param *param)
17
 {
18
     m_fencPic = new PicYuv;
19
+    m_param = param;
20
 
21
     return m_fencPic->create(param->sourceWidth, param->sourceHeight, param->internalCsp) &&
22
            m_lowres.create(m_fencPic, param->bframes, !!param->rc.aqMode);
23
x265_1.6.tar.gz/source/common/frame.h -> x265_1.7.tar.gz/source/common/frame.h Changed
18
 
1
@@ -56,6 +56,7 @@
2
     void*                  m_userData;           // user provided pointer passed in with this picture
3
 
4
     Lowres                 m_lowres;
5
+    bool                   m_lowresInit;         // lowres init complete (pre-analysis)
6
     bool                   m_bChromaExtended;    // orig chroma planes motion extended for weight analysis
7
 
8
     /* Frame Parallelism - notification between FrameEncoders of available motion reference rows */
9
@@ -64,7 +65,7 @@
10
 
11
     Frame*                 m_next;               // PicList doubly linked list pointers
12
     Frame*                 m_prev;
13
-
14
+    x265_param*            m_param;              // Points to the latest param set for the frame.
15
     x265_analysis_data     m_analysisData;
16
     Frame();
17
 
18
x265_1.6.tar.gz/source/common/framedata.h -> x265_1.7.tar.gz/source/common/framedata.h Changed
9
 
1
@@ -74,6 +74,7 @@
2
         uint32_t numEncodedCUs; /* ctuAddr of last encoded CTU in row */
3
         uint32_t encodedBits;   /* sum of 'totalBits' of encoded CTUs */
4
         uint32_t satdForVbv;    /* sum of lowres (estimated) costs for entire row */
5
+        uint32_t intraSatdForVbv; /* sum of lowres (estimated) intra costs for entire row */
6
         uint32_t diagSatd;
7
         uint32_t diagIntraSatd;
8
         double   diagQp;
9
x265_1.6.tar.gz/source/common/ipfilter.cpp -> x265_1.7.tar.gz/source/common/ipfilter.cpp Changed
87
 
1
@@ -34,27 +34,8 @@
2
 #endif
3
 
4
 namespace {
5
-template<int dstStride, int width, int height>
6
-void pixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst)
7
-{
8
-    int shift = IF_INTERNAL_PREC - X265_DEPTH;
9
-    int row, col;
10
-
11
-    for (row = 0; row < height; row++)
12
-    {
13
-        for (col = 0; col < width; col++)
14
-        {
15
-            int16_t val = src[col] << shift;
16
-            dst[col] = val - (int16_t)IF_INTERNAL_OFFS;
17
-        }
18
-
19
-        src += srcStride;
20
-        dst += dstStride;
21
-    }
22
-}
23
-
24
-template<int dstStride>
25
-void filterPixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height)
26
+template<int width, int height>
27
+void filterPixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride)
28
 {
29
     int shift = IF_INTERNAL_PREC - X265_DEPTH;
30
     int row, col;
31
@@ -398,7 +379,7 @@
32
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
33
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
34
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
35
-    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE / 2, W, H>; 
36
+    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>;
37
 
38
 #define CHROMA_422(W, H) \
39
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \
40
@@ -407,7 +388,7 @@
41
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
42
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
43
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
44
-    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE / 2, W, H>; 
45
+    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>;
46
 
47
 #define CHROMA_444(W, H) \
48
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \
49
@@ -416,7 +397,7 @@
50
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
51
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
52
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
53
-    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE, W, H>; 
54
+    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>;
55
 
56
 #define LUMA(W, H) \
57
     p.pu[LUMA_ ## W ## x ## H].luma_hpp     = interp_horiz_pp_c<8, W, H>; \
58
@@ -426,7 +407,7 @@
59
     p.pu[LUMA_ ## W ## x ## H].luma_vsp     = interp_vert_sp_c<8, W, H>;  \
60
     p.pu[LUMA_ ## W ## x ## H].luma_vss     = interp_vert_ss_c<8, W, H>;  \
61
     p.pu[LUMA_ ## W ## x ## H].luma_hvpp    = interp_hv_pp_c<8, W, H>; \
62
-    p.pu[LUMA_ ## W ## x ## H].filter_p2s = pixelToShort_c<MAX_CU_SIZE, W, H>
63
+    p.pu[LUMA_ ## W ## x ## H].convert_p2s = filterPixelToShort_c<W, H>;
64
 
65
 void setupFilterPrimitives_c(EncoderPrimitives& p)
66
 {
67
@@ -482,6 +463,7 @@
68
 
69
     CHROMA_422(4, 8);
70
     CHROMA_422(4, 4);
71
+    CHROMA_422(2, 4);
72
     CHROMA_422(2, 8);
73
     CHROMA_422(8,  16);
74
     CHROMA_422(8,  8);
75
@@ -530,11 +512,6 @@
76
     CHROMA_444(48, 64);
77
     CHROMA_444(64, 16);
78
     CHROMA_444(16, 64);
79
-    p.luma_p2s = filterPixelToShort_c<MAX_CU_SIZE>;
80
-
81
-    p.chroma[X265_CSP_I444].p2s = filterPixelToShort_c<MAX_CU_SIZE>;
82
-    p.chroma[X265_CSP_I420].p2s = filterPixelToShort_c<MAX_CU_SIZE / 2>;
83
-    p.chroma[X265_CSP_I422].p2s = filterPixelToShort_c<MAX_CU_SIZE / 2>;
84
 
85
     p.extendRowBorder = extendCURowColBorder;
86
 }
87
x265_1.6.tar.gz/source/common/loopfilter.cpp -> x265_1.7.tar.gz/source/common/loopfilter.cpp Changed
73
 
1
@@ -42,18 +42,23 @@
2
         dst[x] = signOf(src1[x] - src2[x]);
3
 }
4
 
5
-void processSaoCUE0(pixel * rec, int8_t * offsetEo, int width, int8_t signLeft)
6
+void processSaoCUE0(pixel * rec, int8_t * offsetEo, int width, int8_t* signLeft, intptr_t stride)
7
 {
8
-    int x;
9
-    int8_t signRight;
10
+    int x, y;
11
+    int8_t signRight, signLeft0;
12
     int8_t edgeType;
13
 
14
-    for (x = 0; x < width; x++)
15
+    for (y = 0; y < 2; y++)
16
     {
17
-        signRight = ((rec[x] - rec[x + 1]) < 0) ? -1 : ((rec[x] - rec[x + 1]) > 0) ? 1 : 0;
18
-        edgeType = signRight + signLeft + 2;
19
-        signLeft  = -signRight;
20
-        rec[x] = x265_clip(rec[x] + offsetEo[edgeType]);
21
+        signLeft0 = signLeft[y];
22
+        for (x = 0; x < width; x++)
23
+        {
24
+            signRight = ((rec[x] - rec[x + 1]) < 0) ? -1 : ((rec[x] - rec[x + 1]) > 0) ? 1 : 0;
25
+            edgeType = signRight + signLeft0 + 2;
26
+            signLeft0 = -signRight;
27
+            rec[x] = x265_clip(rec[x] + offsetEo[edgeType]);
28
+        }
29
+        rec += stride;
30
     }
31
 }
32
 
33
@@ -72,6 +77,25 @@
34
     }
35
 }
36
 
37
+void processSaoCUE1_2Rows(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width)
38
+{
39
+    int x, y;
40
+    int8_t signDown;
41
+    int edgeType;
42
+
43
+    for (y = 0; y < 2; y++)
44
+    {
45
+        for (x = 0; x < width; x++)
46
+        {
47
+            signDown = signOf(rec[x] - rec[x + stride]);
48
+            edgeType = signDown + upBuff1[x] + 2;
49
+            upBuff1[x] = -signDown;
50
+            rec[x] = x265_clip(rec[x] + offsetEo[edgeType]);
51
+        }
52
+        rec += stride;
53
+    }
54
+}
55
+
56
 void processSaoCUE2(pixel * rec, int8_t * bufft, int8_t * buff1, int8_t * offsetEo, int width, intptr_t stride)
57
 {
58
     int x;
59
@@ -119,8 +143,11 @@
60
 {
61
     p.saoCuOrgE0 = processSaoCUE0;
62
     p.saoCuOrgE1 = processSaoCUE1;
63
-    p.saoCuOrgE2 = processSaoCUE2;
64
-    p.saoCuOrgE3 = processSaoCUE3;
65
+    p.saoCuOrgE1_2Rows = processSaoCUE1_2Rows;
66
+    p.saoCuOrgE2[0] = processSaoCUE2;
67
+    p.saoCuOrgE2[1] = processSaoCUE2;
68
+    p.saoCuOrgE3[0] = processSaoCUE3;
69
+    p.saoCuOrgE3[1] = processSaoCUE3;
70
     p.saoCuOrgB0 = processSaoCUB0;
71
     p.sign = calSign;
72
 }
73
x265_1.6.tar.gz/source/common/param.cpp -> x265_1.7.tar.gz/source/common/param.cpp Changed
159
 
1
@@ -87,7 +87,7 @@
2
 extern "C"
3
 void x265_param_free(x265_param* p)
4
 {
5
-    return x265_free(p);
6
+    x265_free(p);
7
 }
8
 
9
 extern "C"
10
@@ -117,6 +117,7 @@
11
     param->levelIdc = 0;
12
     param->bHighTier = 0;
13
     param->interlaceMode = 0;
14
+    param->bAnnexB = 1;
15
     param->bRepeatHeaders = 0;
16
     param->bEnableAccessUnitDelimiters = 0;
17
     param->bEmitHRDSEI = 0;
18
@@ -209,6 +210,7 @@
19
     param->rc.zones = NULL;
20
     param->rc.bEnableSlowFirstPass = 0;
21
     param->rc.bStrictCbr = 0;
22
+    param->rc.qgSize = 64; /* Same as maxCUSize */
23
 
24
     /* Video Usability Information (VUI) */
25
     param->vui.aspectRatioIdc = 0;
26
@@ -263,6 +265,7 @@
27
             param->rc.aqStrength = 0.0;
28
             param->rc.aqMode = X265_AQ_NONE;
29
             param->rc.cuTree = 0;
30
+            param->rc.qgSize = 32;
31
             param->bEnableFastIntra = 1;
32
         }
33
         else if (!strcmp(preset, "superfast"))
34
@@ -279,6 +282,7 @@
35
             param->rc.aqStrength = 0.0;
36
             param->rc.aqMode = X265_AQ_NONE;
37
             param->rc.cuTree = 0;
38
+            param->rc.qgSize = 32;
39
             param->bEnableSAO = 0;
40
             param->bEnableFastIntra = 1;
41
         }
42
@@ -292,6 +296,7 @@
43
             param->rdLevel = 2;
44
             param->maxNumReferences = 1;
45
             param->rc.cuTree = 0;
46
+            param->rc.qgSize = 32;
47
             param->bEnableFastIntra = 1;
48
         }
49
         else if (!strcmp(preset, "faster"))
50
@@ -565,6 +570,7 @@
51
             p->levelIdc = atoi(value);
52
     }
53
     OPT("high-tier") p->bHighTier = atobool(value);
54
+    OPT("allow-non-conformance") p->bAllowNonConformance = atobool(value);
55
     OPT2("log-level", "log")
56
     {
57
         p->logLevel = atoi(value);
58
@@ -575,6 +581,7 @@
59
         }
60
     }
61
     OPT("cu-stats") p->bLogCuStats = atobool(value);
62
+    OPT("annexb") p->bAnnexB = atobool(value);
63
     OPT("repeat-headers") p->bRepeatHeaders = atobool(value);
64
     OPT("wpp") p->bEnableWavefront = atobool(value);
65
     OPT("ctu") p->maxCUSize = (uint32_t)atoi(value);
66
@@ -843,6 +850,9 @@
67
     OPT2("pools", "numa-pools") p->numaPools = strdup(value);
68
     OPT("lambda-file") p->rc.lambdaFileName = strdup(value);
69
     OPT("analysis-file") p->analysisFileName = strdup(value);
70
+    OPT("qg-size") p->rc.qgSize = atoi(value);
71
+    OPT("master-display") p->masteringDisplayColorVolume = strdup(value);
72
+    OPT("max-cll") p->contentLightLevelInfo = strdup(value);
73
     else
74
         return X265_PARAM_BAD_NAME;
75
 #undef OPT
76
@@ -1183,7 +1193,7 @@
77
     uint32_t maxLog2CUSize = (uint32_t)g_log2Size[param->maxCUSize];
78
     uint32_t minLog2CUSize = (uint32_t)g_log2Size[param->minCUSize];
79
 
80
-    if (g_ctuSizeConfigured || ATOMIC_INC(&g_ctuSizeConfigured) > 1)
81
+    if (ATOMIC_INC(&g_ctuSizeConfigured) > 1)
82
     {
83
         if (g_maxCUSize != param->maxCUSize)
84
         {
85
@@ -1264,22 +1274,20 @@
86
     x265_log(param, X265_LOG_INFO, "b-pyramid / weightp / weightb / refs: %d / %d / %d / %d\n",
87
              param->bBPyramid, param->bEnableWeightedPred, param->bEnableWeightedBiPred, param->maxNumReferences);
88
 
89
+    if (param->rc.aqMode)
90
+        x265_log(param, X265_LOG_INFO, "AQ: mode / str / qg-size / cu-tree  : %d / %0.1f / %d / %d\n", param->rc.aqMode,
91
+                 param->rc.aqStrength, param->rc.qgSize, param->rc.cuTree);
92
+
93
     if (param->bLossless)
94
         x265_log(param, X265_LOG_INFO, "Rate Control                        : Lossless\n");
95
     else switch (param->rc.rateControlMode)
96
     {
97
     case X265_RC_ABR:
98
-        x265_log(param, X265_LOG_INFO, "Rate Control / AQ-Strength / CUTree : ABR-%d kbps / %0.1f / %d\n", param->rc.bitrate,
99
-                 param->rc.aqStrength, param->rc.cuTree);
100
-        break;
101
+        x265_log(param, X265_LOG_INFO, "Rate Control / qCompress            : ABR-%d kbps / %0.2f\n", param->rc.bitrate, param->rc.qCompress); break;
102
     case X265_RC_CQP:
103
-        x265_log(param, X265_LOG_INFO, "Rate Control / AQ-Strength / CUTree : CQP-%d / %0.1f / %d\n", param->rc.qp, param->rc.aqStrength,
104
-                 param->rc.cuTree);
105
-        break;
106
+        x265_log(param, X265_LOG_INFO, "Rate Control                        : CQP-%d\n", param->rc.qp); break;
107
     case X265_RC_CRF:
108
-        x265_log(param, X265_LOG_INFO, "Rate Control / AQ-Strength / CUTree : CRF-%0.1f / %0.1f / %d\n", param->rc.rfConstant,
109
-                 param->rc.aqStrength, param->rc.cuTree);
110
-        break;
111
+        x265_log(param, X265_LOG_INFO, "Rate Control / qCompress            : CRF-%0.1f / %0.2f\n", param->rc.rfConstant, param->rc.qCompress); break;
112
     }
113
 
114
     if (param->rc.vbvBufferSize)
115
@@ -1327,6 +1335,43 @@
116
     fflush(stderr);
117
 }
118
 
119
+void x265_print_reconfigured_params(x265_param* param, x265_param* reconfiguredParam)
120
+{
121
+    if (!param || !reconfiguredParam)
122
+        return;
123
+
124
+    x265_log(param,X265_LOG_INFO, "Reconfigured param options :\n");
125
+
126
+    char buf[80] = { 0 };
127
+    char tmp[40];
128
+#define TOOLCMP(COND1, COND2, STR, VAL)  if (COND1 != COND2) { sprintf(tmp, STR, VAL); appendtool(param, buf, sizeof(buf), tmp); }
129
+    TOOLCMP(param->maxNumReferences, reconfiguredParam->maxNumReferences, "ref=%d", reconfiguredParam->maxNumReferences);
130
+    TOOLCMP(param->maxTUSize, reconfiguredParam->maxTUSize, "max-tu-size=%d", reconfiguredParam->maxTUSize);
131
+    TOOLCMP(param->searchRange, reconfiguredParam->searchRange, "merange=%d", reconfiguredParam->searchRange);
132
+    TOOLCMP(param->subpelRefine, reconfiguredParam->subpelRefine, "subme= %d", reconfiguredParam->subpelRefine);
133
+    TOOLCMP(param->rdLevel, reconfiguredParam->rdLevel, "rd=%d", reconfiguredParam->rdLevel);
134
+    TOOLCMP(param->psyRd, reconfiguredParam->psyRd, "psy-rd=%.2lf", reconfiguredParam->psyRd);
135
+    TOOLCMP(param->rdoqLevel, reconfiguredParam->rdoqLevel, "rdoq=%d", reconfiguredParam->rdoqLevel);
136
+    TOOLCMP(param->psyRdoq, reconfiguredParam->psyRdoq, "psy-rdoq=%.2lf", reconfiguredParam->psyRdoq);
137
+    TOOLCMP(param->noiseReductionIntra, reconfiguredParam->noiseReductionIntra, "nr-intra=%d", reconfiguredParam->noiseReductionIntra);
138
+    TOOLCMP(param->noiseReductionInter, reconfiguredParam->noiseReductionInter, "nr-inter=%d", reconfiguredParam->noiseReductionInter);
139
+    TOOLCMP(param->bEnableTSkipFast, reconfiguredParam->bEnableTSkipFast, "tskip-fast=%d", reconfiguredParam->bEnableTSkipFast);
140
+    TOOLCMP(param->bEnableSignHiding, reconfiguredParam->bEnableSignHiding, "signhide=%d", reconfiguredParam->bEnableSignHiding);
141
+    TOOLCMP(param->bEnableFastIntra, reconfiguredParam->bEnableFastIntra, "fast-intra=%d", reconfiguredParam->bEnableFastIntra);
142
+    if (param->bEnableLoopFilter && (param->deblockingFilterBetaOffset != reconfiguredParam->deblockingFilterBetaOffset 
143
+        || param->deblockingFilterTCOffset != reconfiguredParam->deblockingFilterTCOffset))
144
+    {
145
+        sprintf(tmp, "deblock(tC=%d:B=%d)", param->deblockingFilterTCOffset, param->deblockingFilterBetaOffset);
146
+        appendtool(param, buf, sizeof(buf), tmp);
147
+    }
148
+    else
149
+        TOOLCMP(param->bEnableLoopFilter,  reconfiguredParam->bEnableLoopFilter, "deblock=%d", reconfiguredParam->bEnableLoopFilter);
150
+
151
+    TOOLCMP(param->bEnableTemporalMvp, reconfiguredParam->bEnableTemporalMvp, "tmvp=%d", reconfiguredParam->bEnableTemporalMvp);
152
+    TOOLCMP(param->bEnableEarlySkip, reconfiguredParam->bEnableEarlySkip, "early-skip=%d", reconfiguredParam->bEnableEarlySkip);
153
+    x265_log(param, X265_LOG_INFO, "tools:%s\n", buf);
154
+}
155
+
156
 char *x265_param2string(x265_param* p)
157
 {
158
     char *buf, *s;
159
x265_1.6.tar.gz/source/common/param.h -> x265_1.7.tar.gz/source/common/param.h Changed
9
 
1
@@ -28,6 +28,7 @@
2
 int   x265_check_params(x265_param *param);
3
 int   x265_set_globals(x265_param *param);
4
 void  x265_print_params(x265_param *param);
5
+void  x265_print_reconfigured_params(x265_param* param, x265_param* reconfiguredParam);
6
 void  x265_param_apply_fastfirstpass(x265_param *p);
7
 char* x265_param2string(x265_param *param);
8
 int   x265_atoi(const char *str, bool& bError);
9
x265_1.6.tar.gz/source/common/picyuv.cpp -> x265_1.7.tar.gz/source/common/picyuv.cpp Changed
25
 
1
@@ -175,8 +175,7 @@
2
 
3
         for (int r = 0; r < height; r++)
4
         {
5
-            for (int c = 0; c < width; c++)
6
-                yPixel[c] = (pixel)yChar[c];
7
+            memcpy(yPixel, yChar, width * sizeof(pixel));
8
 
9
             yPixel += m_stride;
10
             yChar += pic.stride[0] / sizeof(*yChar);
11
@@ -184,11 +183,8 @@
12
 
13
         for (int r = 0; r < height >> m_vChromaShift; r++)
14
         {
15
-            for (int c = 0; c < width >> m_hChromaShift; c++)
16
-            {
17
-                uPixel[c] = (pixel)uChar[c];
18
-                vPixel[c] = (pixel)vChar[c];
19
-            }
20
+            memcpy(uPixel, uChar, (width >> m_hChromaShift) * sizeof(pixel));
21
+            memcpy(vPixel, vChar, (width >> m_hChromaShift) * sizeof(pixel));
22
 
23
             uPixel += m_strideC;
24
             vPixel += m_strideC;
25
x265_1.6.tar.gz/source/common/pixel.cpp -> x265_1.7.tar.gz/source/common/pixel.cpp Changed
10
 
1
@@ -582,7 +582,7 @@
2
     }
3
 }
4
 
5
-void scale1D_128to64(pixel *dst, const pixel *src, intptr_t /*stride*/)
6
+void scale1D_128to64(pixel *dst, const pixel *src)
7
 {
8
     int x;
9
     const pixel* src1 = src;
10
x265_1.6.tar.gz/source/common/predict.cpp -> x265_1.7.tar.gz/source/common/predict.cpp Changed
72
 
1
@@ -273,7 +273,7 @@
2
 void Predict::predInterLumaShort(const PredictionUnit& pu, ShortYuv& dstSYuv, const PicYuv& refPic, const MV& mv) const
3
 {
4
     int16_t* dst = dstSYuv.getLumaAddr(pu.puAbsPartIdx);
5
-    int dstStride = dstSYuv.m_size;
6
+    intptr_t dstStride = dstSYuv.m_size;
7
 
8
     intptr_t srcStride = refPic.m_stride;
9
     intptr_t srcOffset = (mv.x >> 2) + (mv.y >> 2) * srcStride;
10
@@ -288,7 +288,7 @@
11
     X265_CHECK(dstStride == MAX_CU_SIZE, "stride expected to be max cu size\n");
12
 
13
     if (!(yFrac | xFrac))
14
-        primitives.luma_p2s(src, srcStride, dst, pu.width, pu.height);
15
+        primitives.pu[partEnum].convert_p2s(src, srcStride, dst, dstStride);
16
     else if (!yFrac)
17
         primitives.pu[partEnum].luma_hps(src, srcStride, dst, dstStride, xFrac, 0);
18
     else if (!xFrac)
19
@@ -375,14 +375,13 @@
20
     int partEnum = partitionFromSizes(pu.width, pu.height);
21
     
22
     uint32_t cxWidth  = pu.width >> m_hChromaShift;
23
-    uint32_t cxHeight = pu.height >> m_vChromaShift;
24
 
25
-    X265_CHECK(((cxWidth | cxHeight) % 2) == 0, "chroma block size expected to be multiple of 2\n");
26
+    X265_CHECK(((cxWidth | (pu.height >> m_vChromaShift)) % 2) == 0, "chroma block size expected to be multiple of 2\n");
27
 
28
     if (!(yFrac | xFrac))
29
     {
30
-        primitives.chroma[m_csp].p2s(refCb, refStride, dstCb, cxWidth, cxHeight);
31
-        primitives.chroma[m_csp].p2s(refCr, refStride, dstCr, cxWidth, cxHeight);
32
+        primitives.chroma[m_csp].pu[partEnum].p2s(refCb, refStride, dstCb, dstStride);
33
+        primitives.chroma[m_csp].pu[partEnum].p2s(refCr, refStride, dstCr, dstStride);
34
     }
35
     else if (!yFrac)
36
     {
37
@@ -817,7 +816,9 @@
38
             const pixel refSample = *pAdiLineNext;
39
             // Pad unavailable samples with new value
40
             int nextOrTop = X265_MIN(next, leftUnits);
41
+
42
             // fill left column
43
+#if HIGH_BIT_DEPTH
44
             while (curr < nextOrTop)
45
             {
46
                 for (int i = 0; i < unitHeight; i++)
47
@@ -836,6 +837,24 @@
48
                 adi += unitWidth;
49
                 curr++;
50
             }
51
+#else
52
+            X265_CHECK(curr <= nextOrTop, "curr must be less than or equal to nextOrTop\n");
53
+            if (curr < nextOrTop)
54
+            {
55
+                const int fillSize = unitHeight * (nextOrTop - curr);
56
+                memset(adi, refSample, fillSize * sizeof(pixel));
57
+                curr = nextOrTop;
58
+                adi += fillSize;
59
+            }
60
+
61
+            if (curr < next)
62
+            {
63
+                const int fillSize = unitWidth * (next - curr);
64
+                memset(adi, refSample, fillSize * sizeof(pixel));
65
+                curr = next;
66
+                adi += fillSize;
67
+            }
68
+#endif
69
         }
70
 
71
         // pad all other reference samples.
72
x265_1.6.tar.gz/source/common/primitives.cpp -> x265_1.7.tar.gz/source/common/primitives.cpp Changed
18
 
1
@@ -90,7 +90,6 @@
2
 
3
     /* alias chroma 4:4:4 from luma primitives (all but chroma filters) */
4
 
5
-    p.chroma[X265_CSP_I444].p2s = p.luma_p2s;
6
     p.chroma[X265_CSP_I444].cu[BLOCK_4x4].sa8d = NULL;
7
 
8
     for (int i = 0; i < NUM_PU_SIZES; i++)
9
@@ -98,7 +97,7 @@
10
         p.chroma[X265_CSP_I444].pu[i].copy_pp = p.pu[i].copy_pp;
11
         p.chroma[X265_CSP_I444].pu[i].addAvg  = p.pu[i].addAvg;
12
         p.chroma[X265_CSP_I444].pu[i].satd    = p.pu[i].satd;
13
-        p.chroma[X265_CSP_I444].pu[i].chroma_p2s = p.pu[i].filter_p2s;
14
+        p.chroma[X265_CSP_I444].pu[i].p2s     = p.pu[i].convert_p2s;
15
     }
16
 
17
     for (int i = 0; i < NUM_CU_SIZES; i++)
18
x265_1.6.tar.gz/source/common/primitives.h -> x265_1.7.tar.gz/source/common/primitives.h Changed
110
 
1
@@ -140,7 +140,8 @@
2
 typedef int(*count_nonzero_t)(const int16_t* quantCoeff);
3
 typedef void (*weightp_pp_t)(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset);
4
 typedef void (*weightp_sp_t)(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset);
5
-typedef void (*scale_t)(pixel* dst, const pixel* src, intptr_t stride);
6
+typedef void (*scale1D_t)(pixel* dst, const pixel* src);
7
+typedef void (*scale2D_t)(pixel* dst, const pixel* src, intptr_t stride);
8
 typedef void (*downscale_t)(const pixel* src0, pixel* dstf, pixel* dsth, pixel* dstv, pixel* dstc,
9
                             intptr_t src_stride, intptr_t dst_stride, int width, int height);
10
 typedef void (*extendCURowBorder_t)(pixel* txt, intptr_t stride, int width, int height, int marginX);
11
@@ -155,8 +156,7 @@
12
 typedef void (*filter_sp_t) (const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx);
13
 typedef void (*filter_ss_t) (const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx);
14
 typedef void (*filter_hv_pp_t) (const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY);
15
-typedef void (*filter_p2s_wxh_t)(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height);
16
-typedef void (*filter_p2s_t)(const pixel* src, intptr_t srcStride, int16_t* dst);
17
+typedef void (*filter_p2s_t)(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
18
 
19
 typedef void (*copy_pp_t)(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); // dst is aligned
20
 typedef void (*copy_sp_t)(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
21
@@ -168,7 +168,7 @@
22
 typedef void (*pixelavg_pp_t)(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int weight);
23
 typedef void (*addAvg_t)(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride);
24
 
25
-typedef void (*saoCuOrgE0_t)(pixel* rec, int8_t* offsetEo, int width, int8_t signLeft);
26
+typedef void (*saoCuOrgE0_t)(pixel* rec, int8_t* offsetEo, int width, int8_t* signLeft, intptr_t stride);
27
 typedef void (*saoCuOrgE1_t)(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
28
 typedef void (*saoCuOrgE2_t)(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride);
29
 typedef void (*saoCuOrgE3_t)(pixel* rec, int8_t* upBuff1, int8_t* m_offsetEo, intptr_t stride, int startX, int endX);
30
@@ -179,7 +179,8 @@
31
 
32
 typedef void (*cutree_propagate_cost) (int* dst, const uint16_t* propagateIn, const int32_t* intraCosts, const uint16_t* interCosts, const int32_t* invQscales, const double* fpsFactor, int len);
33
 
34
-typedef int (*findPosLast_t)(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig);
35
+typedef int (*scanPosLast_t)(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize);
36
+typedef uint32_t (*findPosFirstLast_t)(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16]);
37
 
38
 /* Function pointers to optimized encoder primitives. Each pointer can reference
39
  * either an assembly routine, a SIMD intrinsic primitive, or a C function */
40
@@ -210,7 +211,7 @@
41
         addAvg_t       addAvg;      // bidir motion compensation, uses 16bit values
42
 
43
         copy_pp_t      copy_pp;
44
-        filter_p2s_t   filter_p2s;
45
+        filter_p2s_t   convert_p2s;
46
     }
47
     pu[NUM_PU_SIZES];
48
 
49
@@ -266,17 +267,26 @@
50
     dequant_scaling_t     dequant_scaling;
51
     dequant_normal_t      dequant_normal;
52
     denoiseDct_t          denoiseDct;
53
-    scale_t               scale1D_128to64;
54
-    scale_t               scale2D_64to32;
55
+    scale1D_t             scale1D_128to64;
56
+    scale2D_t             scale2D_64to32;
57
 
58
     ssim_4x4x2_core_t     ssim_4x4x2_core;
59
     ssim_end4_t           ssim_end_4;
60
 
61
     sign_t                sign;
62
     saoCuOrgE0_t          saoCuOrgE0;
63
-    saoCuOrgE1_t          saoCuOrgE1;
64
-    saoCuOrgE2_t          saoCuOrgE2;
65
-    saoCuOrgE3_t          saoCuOrgE3;
66
+
67
+    /* To avoid the overhead in avx2 optimization in handling width=16, SAO_E0_1 is split
68
+     * into two parts: saoCuOrgE1, saoCuOrgE1_2Rows */
69
+    saoCuOrgE1_t          saoCuOrgE1, saoCuOrgE1_2Rows;
70
+
71
+    // saoCuOrgE2[0] is used for width<=16 and saoCuOrgE2[1] is used for width > 16.
72
+    saoCuOrgE2_t          saoCuOrgE2[2];
73
+
74
+    /* In avx2 optimization, two rows cannot be handled simultaneously since it requires 
75
+     * a pixel from the previous row. So, saoCuOrgE3[0] is used for width<=16 and 
76
+     * saoCuOrgE3[1] is used for width > 16. */
77
+    saoCuOrgE3_t          saoCuOrgE3[2];
78
     saoCuOrgB0_t          saoCuOrgB0;
79
 
80
     downscale_t           frameInitLowres;
81
@@ -289,9 +299,9 @@
82
     weightp_sp_t          weight_sp;
83
     weightp_pp_t          weight_pp;
84
 
85
-    filter_p2s_wxh_t      luma_p2s;
86
 
87
-    findPosLast_t         findPosLast;
88
+    scanPosLast_t         scanPosLast;
89
+    findPosFirstLast_t    findPosFirstLast;
90
 
91
     /* There is one set of chroma primitives per color space. An encoder will
92
      * have just a single color space and thus it will only ever use one entry
93
@@ -316,7 +326,7 @@
94
             filter_hps_t filter_hps;
95
             addAvg_t     addAvg;
96
             copy_pp_t    copy_pp;
97
-            filter_p2s_t chroma_p2s;
98
+            filter_p2s_t p2s;
99
 
100
         }
101
         pu[NUM_PU_SIZES];
102
@@ -336,7 +346,6 @@
103
         }
104
         cu[NUM_CU_SIZES];
105
 
106
-        filter_p2s_wxh_t p2s; // takes width/height as arguments
107
     }
108
     chroma[X265_CSP_COUNT];
109
 };
110
x265_1.6.tar.gz/source/common/quant.cpp -> x265_1.7.tar.gz/source/common/quant.cpp Changed
704
 
1
@@ -198,7 +198,8 @@
2
 {
3
     m_entropyCoder = &entropy;
4
     m_rdoqLevel    = rdoqLevel;
5
-    m_psyRdoqScale = (int64_t)(psyScale * 256.0);
6
+    m_psyRdoqScale = (int32_t)(psyScale * 256.0);
7
+    X265_CHECK((psyScale * 256.0) < (double)MAX_INT, "psyScale value too large\n");
8
     m_scalingList  = &scalingList;
9
     m_resiDctCoeff = X265_MALLOC(int16_t, MAX_TR_SIZE * MAX_TR_SIZE * 2);
10
     m_fencDctCoeff = m_resiDctCoeff + (MAX_TR_SIZE * MAX_TR_SIZE);
11
@@ -225,16 +226,15 @@
12
     X265_FREE(m_fencShortBuf);
13
 }
14
 
15
-void Quant::setQPforQuant(const CUData& cu)
16
+void Quant::setQPforQuant(const CUData& ctu, int qp)
17
 {
18
-    m_tqBypass = !!cu.m_tqBypass[0];
19
+    m_tqBypass = !!ctu.m_tqBypass[0];
20
     if (m_tqBypass)
21
         return;
22
-    m_nr = m_frameNr ? &m_frameNr[cu.m_encData->m_frameEncoderID] : NULL;
23
-    int qpy = cu.m_qp[0];
24
-    m_qpParam[TEXT_LUMA].setQpParam(qpy + QP_BD_OFFSET);
25
-    setChromaQP(qpy + cu.m_slice->m_pps->chromaQpOffset[0], TEXT_CHROMA_U, cu.m_chromaFormat);
26
-    setChromaQP(qpy + cu.m_slice->m_pps->chromaQpOffset[1], TEXT_CHROMA_V, cu.m_chromaFormat);
27
+    m_nr = m_frameNr ? &m_frameNr[ctu.m_encData->m_frameEncoderID] : NULL;
28
+    m_qpParam[TEXT_LUMA].setQpParam(qp + QP_BD_OFFSET);
29
+    setChromaQP(qp + ctu.m_slice->m_pps->chromaQpOffset[0], TEXT_CHROMA_U, ctu.m_chromaFormat);
30
+    setChromaQP(qp + ctu.m_slice->m_pps->chromaQpOffset[1], TEXT_CHROMA_V, ctu.m_chromaFormat);
31
 }
32
 
33
 void Quant::setChromaQP(int qpin, TextType ttype, int chFmt)
34
@@ -515,6 +515,7 @@
35
 {
36
     int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; /* Represents scaling through forward transform */
37
     int scalingListType = (cu.isIntra(absPartIdx) ? 0 : 3) + ttype;
38
+    const uint32_t usePsyMask = usePsy ? -1 : 0;
39
 
40
     X265_CHECK(scalingListType < 6, "scaling list type out of range\n");
41
 
42
@@ -529,9 +530,10 @@
43
     X265_CHECK((int)numSig == primitives.cu[log2TrSize - 2].count_nonzero(dstCoeff), "numSig differ\n");
44
     if (!numSig)
45
         return 0;
46
+
47
     uint32_t trSize = 1 << log2TrSize;
48
     int64_t lambda2 = m_qpParam[ttype].lambda2;
49
-    int64_t psyScale = (m_psyRdoqScale * m_qpParam[ttype].lambda);
50
+    const int64_t psyScale = ((int64_t)m_psyRdoqScale * m_qpParam[ttype].lambda);
51
 
52
     /* unquant constants for measuring distortion. Scaling list quant coefficients have a (1 << 4)
53
      * scale applied that must be removed during unquant. Note that in real dequant there is clipping
54
@@ -544,7 +546,7 @@
55
 #define UNQUANT(lvl)    (((lvl) * (unquantScale[blkPos] << per) + unquantRound) >> unquantShift)
56
 #define SIGCOST(bits)   ((lambda2 * (bits)) >> 8)
57
 #define RDCOST(d, bits) ((((int64_t)d * d) << scaleBits) + SIGCOST(bits))
58
-#define PSYVALUE(rec)   ((psyScale * (rec)) >> (16 - scaleBits))
59
+#define PSYVALUE(rec)   ((psyScale * (rec)) >> (2 * transformShift + 1))
60
 
61
     int64_t costCoeff[32 * 32];   /* d*d + lambda * bits */
62
     int64_t costUncoded[32 * 32]; /* d*d + lambda * 0    */
63
@@ -557,14 +559,6 @@
64
     int64_t costCoeffGroupSig[MLS_GRP_NUM]; /* lambda * bits of group coding cost */
65
     uint64_t sigCoeffGroupFlag64 = 0;
66
 
67
-    uint32_t ctxSet      = 0;
68
-    int    c1            = 1;
69
-    int    c2            = 0;
70
-    uint32_t goRiceParam = 0;
71
-    uint32_t c1Idx       = 0;
72
-    uint32_t c2Idx       = 0;
73
-    int cgLastScanPos    = -1;
74
-    int lastScanPos      = -1;
75
     const uint32_t cgSize = (1 << MLS_CG_SIZE); /* 4x4 num coef = 16 */
76
     bool bIsLuma = ttype == TEXT_LUMA;
77
 
78
@@ -579,30 +573,231 @@
79
     TUEntropyCodingParameters codeParams;
80
     cu.getTUEntropyCodingParameters(codeParams, absPartIdx, log2TrSize, bIsLuma);
81
     const uint32_t cgNum = 1 << (codeParams.log2TrSizeCG * 2);
82
+    const uint32_t cgStride = (trSize >> MLS_CG_LOG2_SIZE);
83
+
84
+    uint8_t coeffNum[MLS_GRP_NUM];      // value range[0, 16]
85
+    uint16_t coeffSign[MLS_GRP_NUM];    // bit mask map for non-zero coeff sign
86
+    uint16_t coeffFlag[MLS_GRP_NUM];    // bit mask map for non-zero coeff
87
+
88
+#if CHECKED_BUILD || _DEBUG
89
+    // clean output buffer, the asm version of scanPosLast Never output anything after latest non-zero coeff group
90
+    memset(coeffNum, 0, sizeof(coeffNum));
91
+    memset(coeffSign, 0, sizeof(coeffNum));
92
+    memset(coeffFlag, 0, sizeof(coeffNum));
93
+#endif
94
+    const int lastScanPos = primitives.scanPosLast(codeParams.scan, dstCoeff, coeffSign, coeffFlag, coeffNum, numSig, g_scan4x4[codeParams.scanType], trSize);
95
+    const int cgLastScanPos = (lastScanPos >> LOG2_SCAN_SET_SIZE);
96
+
97
 
98
     /* TODO: update bit estimates if dirty */
99
     EstBitsSbac& estBitsSbac = m_entropyCoder->m_estBitsSbac;
100
 
101
-    uint32_t scanPos;
102
-    coeffGroupRDStats cgRdStats;
103
+    uint32_t scanPos = 0;
104
+    uint32_t c1 = 1;
105
+
106
+    // process trail all zero Coeff Group
107
+
108
+    /* coefficients after lastNZ have no distortion signal cost */
109
+    const int zeroCG = cgNum - 1 - cgLastScanPos;
110
+    memset(&costCoeff[(cgLastScanPos + 1) << MLS_CG_SIZE], 0, zeroCG * MLS_CG_BLK_SIZE * sizeof(int64_t));
111
+    memset(&costSig[(cgLastScanPos + 1) << MLS_CG_SIZE], 0, zeroCG * MLS_CG_BLK_SIZE * sizeof(int64_t));
112
+
113
+    /* sum zero coeff (uncodec) cost */
114
+
115
+    // TODO: does we need these cost?
116
+    if (usePsyMask)
117
+    {
118
+        for (int cgScanPos = cgLastScanPos + 1; cgScanPos < (int)cgNum ; cgScanPos++)
119
+        {
120
+            X265_CHECK(coeffNum[cgScanPos] == 0, "count of coeff failure\n");
121
+
122
+            uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE);
123
+            uint32_t blkPos      = codeParams.scan[scanPosBase];
124
+
125
+            // TODO: we can't SIMD optimize because PSYVALUE need 64-bits multiplication, convert to Double can work faster by FMA
126
+            for (int y = 0; y < MLS_CG_SIZE; y++)
127
+            {
128
+                for (int x = 0; x < MLS_CG_SIZE; x++)
129
+                {
130
+                    int signCoef         = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
131
+                    int predictedCoef    = m_fencDctCoeff[blkPos + x] - signCoef; /* predicted DCT = source DCT - residual DCT*/
132
+
133
+                    costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits;
134
+
135
+                    /* when no residual coefficient is coded, predicted coef == recon coef */
136
+                    costUncoded[blkPos + x] -= PSYVALUE(predictedCoef);
137
+
138
+                    totalUncodedCost += costUncoded[blkPos + x];
139
+                    totalRdCost += costUncoded[blkPos + x];
140
+                }
141
+                blkPos += trSize;
142
+            }
143
+        }
144
+    }
145
+    else
146
+    {
147
+        // non-psy path
148
+        for (int cgScanPos = cgLastScanPos + 1; cgScanPos < (int)cgNum ; cgScanPos++)
149
+        {
150
+            X265_CHECK(coeffNum[cgScanPos] == 0, "count of coeff failure\n");
151
+
152
+            uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE);
153
+            uint32_t blkPos      = codeParams.scan[scanPosBase];
154
+
155
+            for (int y = 0; y < MLS_CG_SIZE; y++)
156
+            {
157
+                for (int x = 0; x < MLS_CG_SIZE; x++)
158
+                {
159
+                    int signCoef = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
160
+                    costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits;
161
+
162
+                    totalUncodedCost += costUncoded[blkPos + x];
163
+                    totalRdCost += costUncoded[blkPos + x];
164
+                }
165
+                blkPos += trSize;
166
+            }
167
+        }
168
+    }
169
+
170
+    static const uint8_t table_cnt[5][SCAN_SET_SIZE] =
171
+    {
172
+        // patternSigCtx = 0
173
+        {
174
+            2, 1, 1, 0,
175
+            1, 1, 0, 0,
176
+            1, 0, 0, 0,
177
+            0, 0, 0, 0,
178
+        },
179
+        // patternSigCtx = 1
180
+        {
181
+            2, 2, 2, 2,
182
+            1, 1, 1, 1,
183
+            0, 0, 0, 0,
184
+            0, 0, 0, 0,
185
+        },
186
+        // patternSigCtx = 2
187
+        {
188
+            2, 1, 0, 0,
189
+            2, 1, 0, 0,
190
+            2, 1, 0, 0,
191
+            2, 1, 0, 0,
192
+        },
193
+        // patternSigCtx = 3
194
+        {
195
+            2, 2, 2, 2,
196
+            2, 2, 2, 2,
197
+            2, 2, 2, 2,
198
+            2, 2, 2, 2,
199
+        },
200
+        // 4x4
201
+        {
202
+            0, 1, 4, 5,
203
+            2, 3, 4, 5,
204
+            6, 6, 8, 8,
205
+            7, 7, 8, 8
206
+        }
207
+    };
208
 
209
     /* iterate over coding groups in reverse scan order */
210
-    for (int cgScanPos = cgNum - 1; cgScanPos >= 0; cgScanPos--)
211
+    for (int cgScanPos = cgLastScanPos; cgScanPos >= 0; cgScanPos--)
212
     {
213
+        uint32_t ctxSet = (cgScanPos && bIsLuma) ? 2 : 0;
214
         const uint32_t cgBlkPos = codeParams.scanCG[cgScanPos];
215
         const uint32_t cgPosY   = cgBlkPos >> codeParams.log2TrSizeCG;
216
         const uint32_t cgPosX   = cgBlkPos - (cgPosY << codeParams.log2TrSizeCG);
217
         const uint64_t cgBlkPosMask = ((uint64_t)1 << cgBlkPos);
218
-        memset(&cgRdStats, 0, sizeof(coeffGroupRDStats));
219
+        const int patternSigCtx = calcPatternSigCtx(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, cgStride);
220
+        const int ctxSigOffset = codeParams.firstSignificanceMapContext + (cgScanPos && bIsLuma ? 3 : 0);
221
+
222
+        if (c1 == 0)
223
+            ctxSet++;
224
+        c1 = 1;
225
 
226
-        const int patternSigCtx = calcPatternSigCtx(sigCoeffGroupFlag64, cgPosX, cgPosY, codeParams.log2TrSizeCG);
227
+        if (cgScanPos && (coeffNum[cgScanPos] == 0))
228
+        {
229
+            // TODO: does we need zero-coeff cost?
230
+            const uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE);
231
+            uint32_t blkPos = codeParams.scan[scanPosBase];
232
 
233
+            if (usePsyMask)
234
+            {
235
+                // TODO: we can't SIMD optimize because PSYVALUE need 64-bits multiplication, convert to Double can work faster by FMA
236
+                for (int y = 0; y < MLS_CG_SIZE; y++)
237
+                {
238
+                    for (int x = 0; x < MLS_CG_SIZE; x++)
239
+                    {
240
+                        int signCoef         = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
241
+                        int predictedCoef    = m_fencDctCoeff[blkPos + x] - signCoef; /* predicted DCT = source DCT - residual DCT*/
242
+
243
+                        costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits;
244
+
245
+                        /* when no residual coefficient is coded, predicted coef == recon coef */
246
+                        costUncoded[blkPos + x] -= PSYVALUE(predictedCoef);
247
+
248
+                        totalUncodedCost += costUncoded[blkPos + x];
249
+                        totalRdCost += costUncoded[blkPos + x];
250
+
251
+                        const uint32_t scanPosOffset =  y * MLS_CG_SIZE + x;
252
+                        const uint32_t ctxSig = table_cnt[patternSigCtx][g_scan4x4[codeParams.scanType][scanPosOffset]] + ctxSigOffset;
253
+                        X265_CHECK(trSize > 4, "trSize check failure\n");
254
+                        X265_CHECK(ctxSig == getSigCtxInc(patternSigCtx, log2TrSize, trSize, codeParams.scan[scanPosBase + scanPosOffset], bIsLuma, codeParams.firstSignificanceMapContext), "sigCtx check failure\n");
255
+
256
+                        costSig[scanPosBase + scanPosOffset] = SIGCOST(estBitsSbac.significantBits[0][ctxSig]);
257
+                        costCoeff[scanPosBase + scanPosOffset] = costUncoded[blkPos + x];
258
+                        sigRateDelta[blkPos + x] = estBitsSbac.significantBits[1][ctxSig] - estBitsSbac.significantBits[0][ctxSig];
259
+                    }
260
+                    blkPos += trSize;
261
+                }
262
+            }
263
+            else
264
+            {
265
+                // non-psy path
266
+                for (int y = 0; y < MLS_CG_SIZE; y++)
267
+                {
268
+                    for (int x = 0; x < MLS_CG_SIZE; x++)
269
+                    {
270
+                        int signCoef = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
271
+                        costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits;
272
+
273
+                        totalUncodedCost += costUncoded[blkPos + x];
274
+                        totalRdCost += costUncoded[blkPos + x];
275
+
276
+                        const uint32_t scanPosOffset =  y * MLS_CG_SIZE + x;
277
+                        const uint32_t ctxSig = table_cnt[patternSigCtx][g_scan4x4[codeParams.scanType][scanPosOffset]] + ctxSigOffset;
278
+                        X265_CHECK(trSize > 4, "trSize check failure\n");
279
+                        X265_CHECK(ctxSig == getSigCtxInc(patternSigCtx, log2TrSize, trSize, codeParams.scan[scanPosBase + scanPosOffset], bIsLuma, codeParams.firstSignificanceMapContext), "sigCtx check failure\n");
280
+
281
+                        costSig[scanPosBase + scanPosOffset] = SIGCOST(estBitsSbac.significantBits[0][ctxSig]);
282
+                        costCoeff[scanPosBase + scanPosOffset] = costUncoded[blkPos + x];
283
+                        sigRateDelta[blkPos + x] = estBitsSbac.significantBits[1][ctxSig] - estBitsSbac.significantBits[0][ctxSig];
284
+                    }
285
+                    blkPos += trSize;
286
+                }
287
+            }
288
+
289
+            /* there were no coded coefficients in this coefficient group */
290
+            {
291
+                uint32_t ctxSig = getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, cgStride);
292
+                costCoeffGroupSig[cgScanPos] = SIGCOST(estBitsSbac.significantCoeffGroupBits[ctxSig][0]);
293
+                totalRdCost += costCoeffGroupSig[cgScanPos];  /* add cost of 0 bit in significant CG bitmap */
294
+            }
295
+            continue;
296
+        }
297
+
298
+        coeffGroupRDStats cgRdStats;
299
+        memset(&cgRdStats, 0, sizeof(coeffGroupRDStats));
300
+
301
+        uint32_t subFlagMask = coeffFlag[cgScanPos];
302
+        int    c2            = 0;
303
+        uint32_t goRiceParam = 0;
304
+        uint32_t c1Idx       = 0;
305
+        uint32_t c2Idx       = 0;
306
         /* iterate over coefficients in each group in reverse scan order */
307
         for (int scanPosinCG = cgSize - 1; scanPosinCG >= 0; scanPosinCG--)
308
         {
309
             scanPos              = (cgScanPos << MLS_CG_SIZE) + scanPosinCG;
310
             uint32_t blkPos      = codeParams.scan[scanPos];
311
-            uint16_t maxAbsLevel = (int16_t)abs(dstCoeff[blkPos]);             /* abs(quantized coeff) */
312
+            uint32_t maxAbsLevel = abs(dstCoeff[blkPos]);             /* abs(quantized coeff) */
313
             int signCoef         = m_resiDctCoeff[blkPos];            /* pre-quantization DCT coeff */
314
             int predictedCoef    = m_fencDctCoeff[blkPos] - signCoef; /* predicted DCT = source DCT - residual DCT*/
315
 
316
@@ -611,22 +806,21 @@
317
              * FIX15 nature of the CABAC cost tables minus the forward transform scale */
318
 
319
             /* cost of not coding this coefficient (all distortion, no signal bits) */
320
-            costUncoded[scanPos] = (int64_t)(signCoef * signCoef) << scaleBits;
321
-            if (usePsy && blkPos)
322
+            costUncoded[blkPos] = ((int64_t)signCoef * signCoef) << scaleBits;
323
+            X265_CHECK((!!scanPos ^ !!blkPos) == 0, "failed on (blkPos=0 && scanPos!=0)\n");
324
+            if (usePsyMask & scanPos)
325
                 /* when no residual coefficient is coded, predicted coef == recon coef */
326
-                costUncoded[scanPos] -= PSYVALUE(predictedCoef);
327
+                costUncoded[blkPos] -= PSYVALUE(predictedCoef);
328
 
329
-            totalUncodedCost += costUncoded[scanPos];
330
+            totalUncodedCost += costUncoded[blkPos];
331
 
332
-            if (maxAbsLevel && lastScanPos < 0)
333
-            {
334
-                /* remember the first non-zero coef found in this reverse scan as the last pos */
335
-                lastScanPos   = scanPos;
336
-                ctxSet        = (scanPos < SCAN_SET_SIZE || !bIsLuma) ? 0 : 2;
337
-                cgLastScanPos = cgScanPos;
338
-            }
339
+            // coefficient level estimation
340
+            const int* greaterOneBits = estBitsSbac.greaterOneBits[4 * ctxSet + c1];
341
+            const uint32_t ctxSig = (blkPos == 0) ? 0 : table_cnt[(trSize == 4) ? 4 : patternSigCtx][g_scan4x4[codeParams.scanType][scanPosinCG]] + ctxSigOffset;
342
+            X265_CHECK(ctxSig == getSigCtxInc(patternSigCtx, log2TrSize, trSize, blkPos, bIsLuma, codeParams.firstSignificanceMapContext), "sigCtx check failure\n");
343
 
344
-            if (lastScanPos < 0)
345
+            // before find lastest non-zero coeff
346
+            if (scanPos > (uint32_t)lastScanPos)
347
             {
348
                 /* coefficients after lastNZ have no distortion signal cost */
349
                 costCoeff[scanPos] = 0;
350
@@ -635,10 +829,24 @@
351
                 /* No non-zero coefficient yet found, but this does not mean
352
                  * there is no uncoded-cost for this coefficient. Pre-
353
                  * quantization the coefficient may have been non-zero */
354
-                totalRdCost += costUncoded[scanPos];
355
+                totalRdCost += costUncoded[blkPos];
356
+            }
357
+            else if (!(subFlagMask & 1))
358
+            {
359
+                // fast zero coeff path
360
+                /* set default costs to uncoded costs */
361
+                costSig[scanPos] = SIGCOST(estBitsSbac.significantBits[0][ctxSig]);
362
+                costCoeff[scanPos] = costUncoded[blkPos] + costSig[scanPos];
363
+                sigRateDelta[blkPos] = estBitsSbac.significantBits[1][ctxSig] - estBitsSbac.significantBits[0][ctxSig];
364
+                totalRdCost += costCoeff[scanPos];
365
+                rateIncUp[blkPos] = greaterOneBits[0];
366
+
367
+                subFlagMask >>= 1;
368
             }
369
             else
370
             {
371
+                subFlagMask >>= 1;
372
+
373
                 const uint32_t c1c2Idx = ((c1Idx - 8) >> (sizeof(int) * CHAR_BIT - 1)) + (((-(int)c2Idx) >> (sizeof(int) * CHAR_BIT - 1)) + 1) * 2;
374
                 const uint32_t baseLevel = ((uint32_t)0xD9 >> (c1c2Idx * 2)) & 3;  // {1, 2, 1, 3}
375
 
376
@@ -647,12 +855,9 @@
377
                 X265_CHECK((int)baseLevel == ((c1Idx < C1FLAG_NUMBER) ? (2 + (c2Idx == 0)) : 1), "scan validation 3\n");
378
 
379
                 // coefficient level estimation
380
-                const uint32_t oneCtx = 4 * ctxSet + c1;
381
-                const uint32_t absCtx = ctxSet + c2;
382
-                const int* greaterOneBits = estBitsSbac.greaterOneBits[oneCtx];
383
-                const int* levelAbsBits = estBitsSbac.levelAbsBits[absCtx];
384
+                const int* levelAbsBits = estBitsSbac.levelAbsBits[ctxSet + c2];
385
 
386
-                uint16_t level = 0;
387
+                uint32_t level = 0;
388
                 uint32_t sigCoefBits = 0;
389
                 costCoeff[scanPos] = MAX_INT64;
390
 
391
@@ -660,48 +865,82 @@
392
                     sigRateDelta[blkPos] = 0;
393
                 else
394
                 {
395
-                    const uint32_t ctxSig = getSigCtxInc(patternSigCtx, log2TrSize, trSize, blkPos, bIsLuma, codeParams.firstSignificanceMapContext);
396
                     if (maxAbsLevel < 3)
397
                     {
398
                         /* set default costs to uncoded costs */
399
-                        costSig[scanPos] = SIGCOST(estBitsSbac.significantBits[ctxSig][0]);
400
-                        costCoeff[scanPos] = costUncoded[scanPos] + costSig[scanPos];
401
+                        costSig[scanPos] = SIGCOST(estBitsSbac.significantBits[0][ctxSig]);
402
+                        costCoeff[scanPos] = costUncoded[blkPos] + costSig[scanPos];
403
                     }
404
-                    sigRateDelta[blkPos] = estBitsSbac.significantBits[ctxSig][1] - estBitsSbac.significantBits[ctxSig][0];
405
-                    sigCoefBits = estBitsSbac.significantBits[ctxSig][1];
406
+                    sigRateDelta[blkPos] = estBitsSbac.significantBits[1][ctxSig] - estBitsSbac.significantBits[0][ctxSig];
407
+                    sigCoefBits = estBitsSbac.significantBits[1][ctxSig];
408
                 }
409
-                if (maxAbsLevel)
410
+
411
+                // NOTE: X265_MAX(maxAbsLevel - 1, 1) ==> (X>=2 -> X-1), (X<2 -> 1)  | (0 < X < 2 ==> X=1)
412
+                if (maxAbsLevel == 1)
413
                 {
414
-                    uint16_t minAbsLevel = X265_MAX(maxAbsLevel - 1, 1);
415
-                    for (uint16_t lvl = maxAbsLevel; lvl >= minAbsLevel; lvl--)
416
+                    uint32_t levelBits = (c1c2Idx & 1) ? greaterOneBits[0] + IEP_RATE : ((1 + goRiceParam) << 15) + IEP_RATE;
417
+                    X265_CHECK(levelBits == getICRateCost(1, 1 - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx) + IEP_RATE, "levelBits mistake\n");
418
+
419
+                    int unquantAbsLevel = UNQUANT(1);
420
+                    int d = abs(signCoef) - unquantAbsLevel;
421
+                    int64_t curCost = RDCOST(d, sigCoefBits + levelBits);
422
+
423
+                    /* Psy RDOQ: bias in favor of higher AC coefficients in the reconstructed frame */
424
+                    if (usePsyMask & scanPos)
425
                     {
426
-                        uint32_t levelBits = getICRateCost(lvl, lvl - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx) + IEP_RATE;
427
+                        int reconCoef = abs(unquantAbsLevel + SIGN(predictedCoef, signCoef));
428
+                        curCost -= PSYVALUE(reconCoef);
429
+                    }
430
 
431
-                        int unquantAbsLevel = UNQUANT(lvl);
432
-                        int d = abs(signCoef) - unquantAbsLevel;
433
-                        int64_t curCost = RDCOST(d, sigCoefBits + levelBits);
434
+                    if (curCost < costCoeff[scanPos])
435
+                    {
436
+                        level = 1;
437
+                        costCoeff[scanPos] = curCost;
438
+                        costSig[scanPos] = SIGCOST(sigCoefBits);
439
+                    }
440
+                }
441
+                else if (maxAbsLevel)
442
+                {
443
+                    uint32_t levelBits0 = getICRateCost(maxAbsLevel,     maxAbsLevel     - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx) + IEP_RATE;
444
+                    uint32_t levelBits1 = getICRateCost(maxAbsLevel - 1, maxAbsLevel - 1 - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx) + IEP_RATE;
445
 
446
-                        /* Psy RDOQ: bias in favor of higher AC coefficients in the reconstructed frame */
447
-                        if (usePsy && blkPos)
448
-                        {
449
-                            int reconCoef = abs(unquantAbsLevel + SIGN(predictedCoef, signCoef));
450
-                            curCost -= PSYVALUE(reconCoef);
451
-                        }
452
+                    int unquantAbsLevel0 = UNQUANT(maxAbsLevel);
453
+                    int d0 = abs(signCoef) - unquantAbsLevel0;
454
+                    int64_t curCost0 = RDCOST(d0, sigCoefBits + levelBits0);
455
 
456
-                        if (curCost < costCoeff[scanPos])
457
-                        {
458
-                            level = lvl;
459
-                            costCoeff[scanPos] = curCost;
460
-                            costSig[scanPos] = SIGCOST(sigCoefBits);
461
-                        }
462
+                    int unquantAbsLevel1 = UNQUANT(maxAbsLevel - 1);
463
+                    int d1 = abs(signCoef) - unquantAbsLevel1;
464
+                    int64_t curCost1 = RDCOST(d1, sigCoefBits + levelBits1);
465
+
466
+                    /* Psy RDOQ: bias in favor of higher AC coefficients in the reconstructed frame */
467
+                    if (usePsyMask & scanPos)
468
+                    {
469
+                        int reconCoef;
470
+                        reconCoef = abs(unquantAbsLevel0 + SIGN(predictedCoef, signCoef));
471
+                        curCost0 -= PSYVALUE(reconCoef);
472
+
473
+                        reconCoef = abs(unquantAbsLevel1 + SIGN(predictedCoef, signCoef));
474
+                        curCost1 -= PSYVALUE(reconCoef);
475
+                    }
476
+                    if (curCost0 < costCoeff[scanPos])
477
+                    {
478
+                        level = maxAbsLevel;
479
+                        costCoeff[scanPos] = curCost0;
480
+                        costSig[scanPos] = SIGCOST(sigCoefBits);
481
+                    }
482
+                    if (curCost1 < costCoeff[scanPos])
483
+                    {
484
+                        level = maxAbsLevel - 1;
485
+                        costCoeff[scanPos] = curCost1;
486
+                        costSig[scanPos] = SIGCOST(sigCoefBits);
487
                     }
488
                 }
489
 
490
-                dstCoeff[blkPos] = level;
491
+                dstCoeff[blkPos] = (int16_t)level;
492
                 totalRdCost += costCoeff[scanPos];
493
 
494
                 /* record costs for sign-hiding performed at the end */
495
-                if (level)
496
+                if ((cu.m_slice->m_pps->bSignHideEnabled ? ~0 : 0) & level)
497
                 {
498
                     const int32_t diff0 = level - 1 - baseLevel;
499
                     const int32_t diff2 = level + 1 - baseLevel;
500
@@ -763,41 +1002,27 @@
501
                 else if ((c1 < 3) && (c1 > 0) && level)
502
                     c1++;
503
 
504
-                /* context set update */
505
-                if (!(scanPos % SCAN_SET_SIZE) && scanPos)
506
+                if (dstCoeff[blkPos])
507
                 {
508
-                    c2 = 0;
509
-                    goRiceParam = 0;
510
-
511
-                    c1Idx = 0;
512
-                    c2Idx = 0;
513
-                    ctxSet = (scanPos == SCAN_SET_SIZE || !bIsLuma) ? 0 : 2;
514
-                    X265_CHECK(c1 >= 0, "c1 is negative\n");
515
-                    ctxSet -= ((int32_t)(c1 - 1) >> 31);
516
-                    c1 = 1;
517
+                    sigCoeffGroupFlag64 |= cgBlkPosMask;
518
+                    cgRdStats.codedLevelAndDist += costCoeff[scanPos] - costSig[scanPos];
519
+                    cgRdStats.uncodedDist += costUncoded[blkPos];
520
+                    cgRdStats.nnzBeforePos0 += scanPosinCG;
521
                 }
522
             }
523
 
524
             cgRdStats.sigCost += costSig[scanPos];
525
-            if (!scanPosinCG)
526
-                cgRdStats.sigCost0 = costSig[scanPos];
527
-
528
-            if (dstCoeff[blkPos])
529
-            {
530
-                sigCoeffGroupFlag64 |= cgBlkPosMask;
531
-                cgRdStats.codedLevelAndDist += costCoeff[scanPos] - costSig[scanPos];
532
-                cgRdStats.uncodedDist += costUncoded[scanPos];
533
-                cgRdStats.nnzBeforePos0 += scanPosinCG;
534
-            }
535
         } /* end for (scanPosinCG) */
536
 
537
+        X265_CHECK((cgScanPos << MLS_CG_SIZE) == (int)scanPos, "scanPos mistake\n");
538
+        cgRdStats.sigCost0 = costSig[scanPos];
539
+
540
         costCoeffGroupSig[cgScanPos] = 0;
541
 
542
-        if (cgLastScanPos < 0)
543
-        {
544
-            /* nothing to do at this point */
545
-        }
546
-        else if (!cgScanPos || cgScanPos == cgLastScanPos)
547
+        /* nothing to do at this case */
548
+        X265_CHECK(cgLastScanPos >= 0, "cgLastScanPos check failure\n");
549
+
550
+        if (!cgScanPos || cgScanPos == cgLastScanPos)
551
         {
552
             /* coeff group 0 is implied to be present, no signal cost */
553
             /* coeff group with last NZ is implied to be present, handled below */
554
@@ -815,7 +1040,7 @@
555
              * of the significant coefficient group flag and evaluate whether the RD cost of the
556
              * coded group is more than the RD cost of the uncoded group */
557
 
558
-            uint32_t sigCtx = getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, codeParams.log2TrSizeCG);
559
+            uint32_t sigCtx = getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, cgStride);
560
 
561
             int64_t costZeroCG = totalRdCost + SIGCOST(estBitsSbac.significantCoeffGroupBits[sigCtx][0]);
562
             costZeroCG += cgRdStats.uncodedDist;       /* add distortion for resetting non-zero levels to zero levels */
563
@@ -832,23 +1057,17 @@
564
                 costCoeffGroupSig[cgScanPos] = SIGCOST(estBitsSbac.significantCoeffGroupBits[sigCtx][0]);
565
 
566
                 /* reset all coeffs to 0. UNCODE THIS COEFF GROUP! */
567
-                for (int scanPosinCG = cgSize - 1; scanPosinCG >= 0; scanPosinCG--)
568
-                {
569
-                    scanPos = cgScanPos * cgSize + scanPosinCG;
570
-                    uint32_t blkPos = codeParams.scan[scanPos];
571
-                    if (dstCoeff[blkPos])
572
-                    {
573
-                        costCoeff[scanPos] = costUncoded[scanPos];
574
-                        costSig[scanPos] = 0;
575
-                    }
576
-                    dstCoeff[blkPos] = 0;
577
-                }
578
+                const uint32_t blkPos = codeParams.scan[cgScanPos * cgSize];
579
+                memset(&dstCoeff[blkPos + 0 * trSize], 0, 4 * sizeof(*dstCoeff));
580
+                memset(&dstCoeff[blkPos + 1 * trSize], 0, 4 * sizeof(*dstCoeff));
581
+                memset(&dstCoeff[blkPos + 2 * trSize], 0, 4 * sizeof(*dstCoeff));
582
+                memset(&dstCoeff[blkPos + 3 * trSize], 0, 4 * sizeof(*dstCoeff));
583
             }
584
         }
585
         else
586
         {
587
             /* there were no coded coefficients in this coefficient group */
588
-            uint32_t ctxSig = getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, codeParams.log2TrSizeCG);
589
+            uint32_t ctxSig = getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, cgStride);
590
             costCoeffGroupSig[cgScanPos] = SIGCOST(estBitsSbac.significantCoeffGroupBits[ctxSig][0]);
591
             totalRdCost += costCoeffGroupSig[cgScanPos];  /* add cost of 0 bit in significant CG bitmap */
592
             totalRdCost -= cgRdStats.sigCost;             /* remove cost of significant coefficient bitmap */
593
@@ -909,7 +1128,7 @@
594
              * cost of signaling it as not-significant */
595
             uint32_t blkPos = codeParams.scan[scanPos];
596
             if (dstCoeff[blkPos])
597
-            {                
598
+            {
599
                 // Calculates the cost of signaling the last significant coefficient in the block 
600
                 uint32_t pos[2] = { (blkPos & (trSize - 1)), (blkPos >> log2TrSize) };
601
                 if (codeParams.scanType == SCAN_VER)
602
@@ -940,7 +1159,7 @@
603
                 }
604
 
605
                 totalRdCost -= costCoeff[scanPos];
606
-                totalRdCost += costUncoded[scanPos];
607
+                totalRdCost += costUncoded[blkPos];
608
             }
609
             else
610
                 totalRdCost -= costSig[scanPos];
611
@@ -959,34 +1178,40 @@
612
         dstCoeff[blkPos] = (int16_t)((level ^ mask) - mask);
613
     }
614
 
615
+    // Average 49.62 pixels
616
     /* clean uncoded coefficients */
617
-    for (int pos = bestLastIdx; pos <= lastScanPos; pos++)
618
+    for (int pos = bestLastIdx; pos <= fastMin(lastScanPos, (bestLastIdx | (SCAN_SET_SIZE - 1))); pos++)
619
+    {
620
         dstCoeff[codeParams.scan[pos]] = 0;
621
+    }
622
+    for (int pos = (bestLastIdx & ~(SCAN_SET_SIZE - 1)) + SCAN_SET_SIZE; pos <= lastScanPos; pos += SCAN_SET_SIZE)
623
+    {
624
+        const uint32_t blkPos = codeParams.scan[pos];
625
+        memset(&dstCoeff[blkPos + 0 * trSize], 0, 4 * sizeof(*dstCoeff));
626
+        memset(&dstCoeff[blkPos + 1 * trSize], 0, 4 * sizeof(*dstCoeff));
627
+        memset(&dstCoeff[blkPos + 2 * trSize], 0, 4 * sizeof(*dstCoeff));
628
+        memset(&dstCoeff[blkPos + 3 * trSize], 0, 4 * sizeof(*dstCoeff));
629
+    }
630
 
631
     /* rate-distortion based sign-hiding */
632
     if (cu.m_slice->m_pps->bSignHideEnabled && numSig >= 2)
633
     {
634
+        const int realLastScanPos = (bestLastIdx - 1) >> LOG2_SCAN_SET_SIZE;
635
         int lastCG = true;
636
-        for (int subSet = cgLastScanPos; subSet >= 0; subSet--)
637
+        for (int subSet = realLastScanPos; subSet >= 0; subSet--)
638
         {
639
             int subPos = subSet << LOG2_SCAN_SET_SIZE;
640
             int n;
641
 
642
-            /* measure distance between first and last non-zero coef in this
643
-             * coding group */
644
-            for (n = SCAN_SET_SIZE - 1; n >= 0; --n)
645
-                if (dstCoeff[codeParams.scan[n + subPos]])
646
-                    break;
647
-            if (n < 0)
648
+            if (!(sigCoeffGroupFlag64 & (1ULL << codeParams.scanCG[subSet])))
649
                 continue;
650
 
651
-            int lastNZPosInCG = n;
652
-
653
-            for (n = 0;; n++)
654
-                if (dstCoeff[codeParams.scan[n + subPos]])
655
-                    break;
656
+            /* measure distance between first and last non-zero coef in this
657
+             * coding group */
658
+            const uint32_t posFirstLast = primitives.findPosFirstLast(&dstCoeff[codeParams.scan[subPos]], trSize, g_scan4x4[codeParams.scanType]);
659
+            int firstNZPosInCG = (uint16_t)posFirstLast;
660
+            int lastNZPosInCG = posFirstLast >> 16;
661
 
662
-            int firstNZPosInCG = n;
663
 
664
             if (lastNZPosInCG - firstNZPosInCG >= SBH_THRESHOLD)
665
             {
666
@@ -1092,22 +1317,6 @@
667
     return numSig;
668
 }
669
 
670
-/* Pattern decision for context derivation process of significant_coeff_flag */
671
-uint32_t Quant::calcPatternSigCtx(uint64_t sigCoeffGroupFlag64, uint32_t cgPosX, uint32_t cgPosY, uint32_t log2TrSizeCG)
672
-{
673
-    if (!log2TrSizeCG)
674
-        return 0;
675
-
676
-    const uint32_t trSizeCG = 1 << log2TrSizeCG;
677
-    X265_CHECK(trSizeCG <= 8, "transform CG is too large\n");
678
-    const uint32_t shift = (cgPosY << log2TrSizeCG) + cgPosX + 1;
679
-    const uint32_t sigPos = (uint32_t)(shift >= 64 ? 0 : sigCoeffGroupFlag64 >> shift);
680
-    const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & (sigPos & 1);
681
-    const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 2)) & 2;
682
-
683
-    return sigRight + sigLower;
684
-}
685
-
686
 /* Context derivation process of coeff_abs_significant_flag */
687
 uint32_t Quant::getSigCtxInc(uint32_t patternSigCtx, uint32_t log2TrSize, uint32_t trSize, uint32_t blkPos, bool bIsLuma,
688
                              uint32_t firstSignificanceMapContext)
689
@@ -1175,14 +1384,3 @@
690
     return (bIsLuma && (posX | posY) >= 4) ? 3 + offset : offset;
691
 }
692
 
693
-/* Context derivation process of coeff_abs_significant_flag */
694
-uint32_t Quant::getSigCoeffGroupCtxInc(uint64_t cgGroupMask, uint32_t cgPosX, uint32_t cgPosY, uint32_t log2TrSizeCG)
695
-{
696
-    const uint32_t trSizeCG = 1 << log2TrSizeCG;
697
-
698
-    const uint32_t sigPos = (uint32_t)(cgGroupMask >> (1 + (cgPosY << log2TrSizeCG) + cgPosX));
699
-    const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & sigPos;
700
-    const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 1));
701
-
702
-    return (sigRight | sigLower) & 1;
703
-}
704
x265_1.6.tar.gz/source/common/quant.h -> x265_1.7.tar.gz/source/common/quant.h Changed
80
 
1
@@ -41,7 +41,7 @@
2
     int per;
3
     int qp;
4
     int64_t lambda2; /* FIX8 */
5
-    int64_t lambda;  /* FIX8 */
6
+    int32_t lambda;  /* FIX8, dynamic range is 18-bits in 8bpp and 20-bits in 16bpp */
7
 
8
     QpParam() : qp(MAX_INT) {}
9
 
10
@@ -53,7 +53,8 @@
11
             per = qpScaled / 6;
12
             qp  = qpScaled;
13
             lambda2 = (int64_t)(x265_lambda2_tab[qp - QP_BD_OFFSET] * 256. + 0.5);
14
-            lambda  = (int64_t)(x265_lambda_tab[qp - QP_BD_OFFSET] * 256. + 0.5);
15
+            lambda  = (int32_t)(x265_lambda_tab[qp - QP_BD_OFFSET] * 256. + 0.5);
16
+            X265_CHECK((x265_lambda_tab[qp - QP_BD_OFFSET] * 256. + 0.5) < (double)MAX_INT, "x265_lambda_tab[] value too large\n");
17
         }
18
     }
19
 };
20
@@ -82,7 +83,7 @@
21
     QpParam            m_qpParam[3];
22
 
23
     int                m_rdoqLevel;
24
-    int64_t            m_psyRdoqScale;
25
+    int32_t            m_psyRdoqScale;  // dynamic range [0,50] * 256 = 14-bits
26
     int16_t*           m_resiDctCoeff;
27
     int16_t*           m_fencDctCoeff;
28
     int16_t*           m_fencShortBuf;
29
@@ -103,7 +104,7 @@
30
     bool allocNoiseReduction(const x265_param& param);
31
 
32
     /* CU setup */
33
-    void setQPforQuant(const CUData& cu);
34
+    void setQPforQuant(const CUData& ctu, int qp);
35
 
36
     uint32_t transformNxN(const CUData& cu, const pixel* fenc, uint32_t fencStride, const int16_t* residual, uint32_t resiStride, coeff_t* coeff,
37
                           uint32_t log2TrSize, TextType ttype, uint32_t absPartIdx, bool useTransformSkip);
38
@@ -111,10 +112,39 @@
39
     void invtransformNxN(int16_t* residual, uint32_t resiStride, const coeff_t* coeff,
40
                          uint32_t log2TrSize, TextType ttype, bool bIntra, bool useTransformSkip, uint32_t numSig);
41
 
42
+    /* Pattern decision for context derivation process of significant_coeff_flag */
43
+    static uint32_t calcPatternSigCtx(uint64_t sigCoeffGroupFlag64, uint32_t cgPosX, uint32_t cgPosY, uint32_t cgBlkPos, uint32_t trSizeCG)
44
+    {
45
+        if (trSizeCG == 1)
46
+            return 0;
47
+
48
+        X265_CHECK(trSizeCG <= 8, "transform CG is too large\n");
49
+        X265_CHECK(cgBlkPos < 64, "cgBlkPos is too large\n");
50
+        // NOTE: cgBlkPos+1 may more than 63, it is invalid for shift,
51
+        //       but in this case, both cgPosX and cgPosY equal to (trSizeCG - 1),
52
+        //       the sigRight and sigLower will clear value to zero, the final result will be correct
53
+        const uint32_t sigPos = (uint32_t)(sigCoeffGroupFlag64 >> (cgBlkPos + 1)); // just need lowest 7-bits valid
54
+
55
+        // TODO: instruction BT is faster, but _bittest64 still generate instruction 'BT m, r' in VS2012
56
+        const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & (sigPos & 1);
57
+        const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 2)) & 2;
58
+        return sigRight + sigLower;
59
+    }
60
+
61
+    /* Context derivation process of coeff_abs_significant_flag */
62
+    static uint32_t getSigCoeffGroupCtxInc(uint64_t cgGroupMask, uint32_t cgPosX, uint32_t cgPosY, uint32_t cgBlkPos, uint32_t trSizeCG)
63
+    {
64
+        X265_CHECK(cgBlkPos < 64, "cgBlkPos is too large\n");
65
+        // NOTE: unsafe shift operator, see NOTE in calcPatternSigCtx
66
+        const uint32_t sigPos = (uint32_t)(cgGroupMask >> (cgBlkPos + 1)); // just need lowest 8-bits valid
67
+        const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & sigPos;
68
+        const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 1));
69
+
70
+        return (sigRight | sigLower) & 1;
71
+    }
72
+
73
     /* static methods shared with entropy.cpp */
74
-    static uint32_t calcPatternSigCtx(uint64_t sigCoeffGroupFlag64, uint32_t cgPosX, uint32_t cgPosY, uint32_t log2TrSizeCG);
75
     static uint32_t getSigCtxInc(uint32_t patternSigCtx, uint32_t log2TrSize, uint32_t trSize, uint32_t blkPos, bool bIsLuma, uint32_t firstSignificanceMapContext);
76
-    static uint32_t getSigCoeffGroupCtxInc(uint64_t sigCoeffGroupFlag64, uint32_t cgPosX, uint32_t cgPosY, uint32_t log2TrSizeCG);
77
 
78
 protected:
79
 
80
x265_1.6.tar.gz/source/common/slice.h -> x265_1.7.tar.gz/source/common/slice.h Changed
9
 
1
@@ -98,6 +98,7 @@
2
         LEVEL6 = 180,
3
         LEVEL6_1 = 183,
4
         LEVEL6_2 = 186,
5
+        LEVEL8_5 = 255,
6
     };
7
 }
8
 
9
x265_1.6.tar.gz/source/common/threading.h -> x265_1.7.tar.gz/source/common/threading.h Changed
31
 
1
@@ -189,6 +189,14 @@
2
         LeaveCriticalSection(&m_cs);
3
     }
4
 
5
+    void poke(void)
6
+    {
7
+        /* awaken all waiting threads, but make no change */
8
+        EnterCriticalSection(&m_cs);
9
+        WakeAllConditionVariable(&m_cv);
10
+        LeaveCriticalSection(&m_cs);
11
+    }
12
+
13
     void incr()
14
     {
15
         EnterCriticalSection(&m_cs);
16
@@ -370,6 +378,14 @@
17
         pthread_mutex_unlock(&m_mutex);
18
     }
19
 
20
+    void poke(void)
21
+    {
22
+        /* awaken all waiting threads, but make no change */
23
+        pthread_mutex_lock(&m_mutex);
24
+        pthread_cond_broadcast(&m_cond);
25
+        pthread_mutex_unlock(&m_mutex);
26
+    }
27
+
28
     void incr()
29
     {
30
         pthread_mutex_lock(&m_mutex);
31
x265_1.6.tar.gz/source/common/threadpool.cpp -> x265_1.7.tar.gz/source/common/threadpool.cpp Changed
59
 
1
@@ -232,7 +232,7 @@
2
     int cpuCount = getCpuCount();
3
     bool bNumaSupport = false;
4
 
5
-#if _WIN32_WINNT >= 0x0601
6
+#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 
7
     bNumaSupport = true;
8
 #elif HAVE_LIBNUMA
9
     bNumaSupport = numa_available() >= 0;
10
@@ -241,10 +241,10 @@
11
 
12
     for (int i = 0; i < cpuCount; i++)
13
     {
14
-#if _WIN32_WINNT >= 0x0601
15
+#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 
16
         UCHAR node;
17
         if (GetNumaProcessorNode((UCHAR)i, &node))
18
-            cpusPerNode[X265_MIN(node, MAX_NODE_NUM)]++;
19
+            cpusPerNode[X265_MIN(node, (UCHAR)MAX_NODE_NUM)]++;
20
         else
21
 #elif HAVE_LIBNUMA
22
         if (bNumaSupport >= 0)
23
@@ -261,7 +261,7 @@
24
     /* limit nodes based on param->numaPools */
25
     if (p->numaPools && *p->numaPools)
26
     {
27
-        char *nodeStr = p->numaPools;
28
+        const char *nodeStr = p->numaPools;
29
         for (int i = 0; i < numNumaNodes; i++)
30
         {
31
             if (!*nodeStr)
32
@@ -373,7 +373,7 @@
33
     return true;
34
 }
35
 
36
-void ThreadPool::stop()
37
+void ThreadPool::stopWorkers()
38
 {
39
     if (m_workers)
40
     {
41
@@ -408,7 +408,7 @@
42
 /* static */
43
 void ThreadPool::setThreadNodeAffinity(int numaNode)
44
 {
45
-#if _WIN32_WINNT >= 0x0601
46
+#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 
47
     GROUP_AFFINITY groupAffinity;
48
     if (GetNumaNodeProcessorMaskEx((USHORT)numaNode, &groupAffinity))
49
     {
50
@@ -433,7 +433,7 @@
51
 /* static */
52
 int ThreadPool::getNumaNodeCount()
53
 {
54
-#if _WIN32_WINNT >= 0x0601
55
+#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 
56
     ULONG num = 1;
57
     if (GetNumaHighestNodeNumber(&num))
58
         num++;
59
x265_1.6.tar.gz/source/common/threadpool.h -> x265_1.7.tar.gz/source/common/threadpool.h Changed
10
 
1
@@ -94,7 +94,7 @@
2
 
3
     bool create(int numThreads, int maxProviders, int node);
4
     bool start();
5
-    void stop();
6
+    void stopWorkers();
7
     void setCurrentThreadAffinity();
8
     int  tryAcquireSleepingThread(sleepbitmap_t firstTryBitmap, sleepbitmap_t secondTryBitmap);
9
     int  tryBondPeers(int maxPeers, sleepbitmap_t peerBitmap, BondedTaskGroup& master);
10
x265_1.6.tar.gz/source/common/x86/asm-primitives.cpp -> x265_1.7.tar.gz/source/common/x86/asm-primitives.cpp Changed
1148
 
1
@@ -800,6 +800,10 @@
2
 #error "Unsupported build configuration (32bit x86 and HIGH_BIT_DEPTH), you must configure ENABLE_ASSEMBLY=OFF"
3
 #endif
4
 
5
+#if X86_64
6
+    p.scanPosLast = x265_scanPosLast_x64;
7
+#endif
8
+
9
     if (cpuMask & X265_CPU_SSE2)
10
     {
11
         /* We do not differentiate CPUs which support MMX and not SSE2. We only check
12
@@ -859,9 +863,6 @@
13
         PIXEL_AVG_W4(mmx2);
14
         LUMA_VAR(sse2);
15
 
16
-        p.luma_p2s = x265_luma_p2s_sse2;
17
-        p.chroma[X265_CSP_I420].p2s = x265_chroma_p2s_sse2;
18
-        p.chroma[X265_CSP_I422].p2s = x265_chroma_p2s_sse2;
19
 
20
         ALL_LUMA_TU(blockfill_s, blockfill_s, sse2);
21
         ALL_LUMA_TU_S(cpy1Dto2D_shr, cpy1Dto2D_shr_, sse2);
22
@@ -872,15 +873,41 @@
23
         ALL_LUMA_TU_S(calcresidual, getResidual, sse2);
24
         ALL_LUMA_TU_S(transpose, transpose, sse2);
25
 
26
-        p.cu[BLOCK_4x4].intra_pred[DC_IDX] = x265_intra_pred_dc4_sse2;
27
-        p.cu[BLOCK_8x8].intra_pred[DC_IDX] = x265_intra_pred_dc8_sse2;
28
-        p.cu[BLOCK_16x16].intra_pred[DC_IDX] = x265_intra_pred_dc16_sse2;
29
-        p.cu[BLOCK_32x32].intra_pred[DC_IDX] = x265_intra_pred_dc32_sse2;
30
-
31
-        p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = x265_intra_pred_planar4_sse2;
32
-        p.cu[BLOCK_8x8].intra_pred[PLANAR_IDX] = x265_intra_pred_planar8_sse2;
33
-        p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = x265_intra_pred_planar16_sse2;
34
-        p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = x265_intra_pred_planar32_sse2;
35
+        ALL_LUMA_TU_S(intra_pred[PLANAR_IDX], intra_pred_planar, sse2);
36
+        ALL_LUMA_TU_S(intra_pred[DC_IDX], intra_pred_dc, sse2);
37
+
38
+        p.cu[BLOCK_4x4].intra_pred[2] = x265_intra_pred_ang4_2_sse2;
39
+        p.cu[BLOCK_4x4].intra_pred[3] = x265_intra_pred_ang4_3_sse2;
40
+        p.cu[BLOCK_4x4].intra_pred[4] = x265_intra_pred_ang4_4_sse2;
41
+        p.cu[BLOCK_4x4].intra_pred[5] = x265_intra_pred_ang4_5_sse2;
42
+        p.cu[BLOCK_4x4].intra_pred[6] = x265_intra_pred_ang4_6_sse2;
43
+        p.cu[BLOCK_4x4].intra_pred[7] = x265_intra_pred_ang4_7_sse2;
44
+        p.cu[BLOCK_4x4].intra_pred[8] = x265_intra_pred_ang4_8_sse2;
45
+        p.cu[BLOCK_4x4].intra_pred[9] = x265_intra_pred_ang4_9_sse2;
46
+        p.cu[BLOCK_4x4].intra_pred[10] = x265_intra_pred_ang4_10_sse2;
47
+        p.cu[BLOCK_4x4].intra_pred[11] = x265_intra_pred_ang4_11_sse2;
48
+        p.cu[BLOCK_4x4].intra_pred[12] = x265_intra_pred_ang4_12_sse2;
49
+        p.cu[BLOCK_4x4].intra_pred[13] = x265_intra_pred_ang4_13_sse2;
50
+        p.cu[BLOCK_4x4].intra_pred[14] = x265_intra_pred_ang4_14_sse2;
51
+        p.cu[BLOCK_4x4].intra_pred[15] = x265_intra_pred_ang4_15_sse2;
52
+        p.cu[BLOCK_4x4].intra_pred[16] = x265_intra_pred_ang4_16_sse2;
53
+        p.cu[BLOCK_4x4].intra_pred[17] = x265_intra_pred_ang4_17_sse2;
54
+        p.cu[BLOCK_4x4].intra_pred[18] = x265_intra_pred_ang4_18_sse2;
55
+        p.cu[BLOCK_4x4].intra_pred[19] = x265_intra_pred_ang4_17_sse2;
56
+        p.cu[BLOCK_4x4].intra_pred[20] = x265_intra_pred_ang4_16_sse2;
57
+        p.cu[BLOCK_4x4].intra_pred[21] = x265_intra_pred_ang4_15_sse2;
58
+        p.cu[BLOCK_4x4].intra_pred[22] = x265_intra_pred_ang4_14_sse2;
59
+        p.cu[BLOCK_4x4].intra_pred[23] = x265_intra_pred_ang4_13_sse2;
60
+        p.cu[BLOCK_4x4].intra_pred[24] = x265_intra_pred_ang4_12_sse2;
61
+        p.cu[BLOCK_4x4].intra_pred[25] = x265_intra_pred_ang4_11_sse2;
62
+        p.cu[BLOCK_4x4].intra_pred[26] = x265_intra_pred_ang4_26_sse2;
63
+        p.cu[BLOCK_4x4].intra_pred[27] = x265_intra_pred_ang4_9_sse2;
64
+        p.cu[BLOCK_4x4].intra_pred[28] = x265_intra_pred_ang4_8_sse2;
65
+        p.cu[BLOCK_4x4].intra_pred[29] = x265_intra_pred_ang4_7_sse2;
66
+        p.cu[BLOCK_4x4].intra_pred[30] = x265_intra_pred_ang4_6_sse2;
67
+        p.cu[BLOCK_4x4].intra_pred[31] = x265_intra_pred_ang4_5_sse2;
68
+        p.cu[BLOCK_4x4].intra_pred[32] = x265_intra_pred_ang4_4_sse2;
69
+        p.cu[BLOCK_4x4].intra_pred[33] = x265_intra_pred_ang4_3_sse2;
70
 
71
         p.cu[BLOCK_4x4].sse_ss = x265_pixel_ssd_ss_4x4_mmx2;
72
         ALL_LUMA_CU(sse_ss, pixel_ssd_ss, sse2);
73
@@ -918,6 +945,74 @@
74
         p.cu[BLOCK_16x16].count_nonzero = x265_count_nonzero_16x16_ssse3;
75
         p.cu[BLOCK_32x32].count_nonzero = x265_count_nonzero_32x32_ssse3;
76
         p.frameInitLowres = x265_frame_init_lowres_core_ssse3;
77
+
78
+        p.pu[LUMA_4x4].convert_p2s = x265_filterPixelToShort_4x4_ssse3;
79
+        p.pu[LUMA_4x8].convert_p2s = x265_filterPixelToShort_4x8_ssse3;
80
+        p.pu[LUMA_4x16].convert_p2s = x265_filterPixelToShort_4x16_ssse3;
81
+        p.pu[LUMA_8x4].convert_p2s = x265_filterPixelToShort_8x4_ssse3;
82
+        p.pu[LUMA_8x8].convert_p2s = x265_filterPixelToShort_8x8_ssse3;
83
+        p.pu[LUMA_8x16].convert_p2s = x265_filterPixelToShort_8x16_ssse3;
84
+        p.pu[LUMA_8x32].convert_p2s = x265_filterPixelToShort_8x32_ssse3;
85
+        p.pu[LUMA_16x4].convert_p2s = x265_filterPixelToShort_16x4_ssse3;
86
+        p.pu[LUMA_16x8].convert_p2s = x265_filterPixelToShort_16x8_ssse3;
87
+        p.pu[LUMA_16x12].convert_p2s = x265_filterPixelToShort_16x12_ssse3;
88
+        p.pu[LUMA_16x16].convert_p2s = x265_filterPixelToShort_16x16_ssse3;
89
+        p.pu[LUMA_16x32].convert_p2s = x265_filterPixelToShort_16x32_ssse3;
90
+        p.pu[LUMA_16x64].convert_p2s = x265_filterPixelToShort_16x64_ssse3;
91
+        p.pu[LUMA_32x8].convert_p2s = x265_filterPixelToShort_32x8_ssse3;
92
+        p.pu[LUMA_32x16].convert_p2s = x265_filterPixelToShort_32x16_ssse3;
93
+        p.pu[LUMA_32x24].convert_p2s = x265_filterPixelToShort_32x24_ssse3;
94
+        p.pu[LUMA_32x32].convert_p2s = x265_filterPixelToShort_32x32_ssse3;
95
+        p.pu[LUMA_32x64].convert_p2s = x265_filterPixelToShort_32x64_ssse3;
96
+        p.pu[LUMA_64x16].convert_p2s = x265_filterPixelToShort_64x16_ssse3;
97
+        p.pu[LUMA_64x32].convert_p2s = x265_filterPixelToShort_64x32_ssse3;
98
+        p.pu[LUMA_64x48].convert_p2s = x265_filterPixelToShort_64x48_ssse3;
99
+        p.pu[LUMA_64x64].convert_p2s = x265_filterPixelToShort_64x64_ssse3;
100
+        p.pu[LUMA_24x32].convert_p2s = x265_filterPixelToShort_24x32_ssse3;
101
+        p.pu[LUMA_12x16].convert_p2s = x265_filterPixelToShort_12x16_ssse3;
102
+        p.pu[LUMA_48x64].convert_p2s = x265_filterPixelToShort_48x64_ssse3;
103
+
104
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].p2s = x265_filterPixelToShort_4x4_ssse3;
105
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].p2s = x265_filterPixelToShort_4x8_ssse3;
106
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].p2s = x265_filterPixelToShort_4x16_ssse3;
107
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].p2s = x265_filterPixelToShort_8x4_ssse3;
108
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].p2s = x265_filterPixelToShort_8x8_ssse3;
109
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].p2s = x265_filterPixelToShort_8x16_ssse3;
110
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].p2s = x265_filterPixelToShort_8x32_ssse3;
111
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s = x265_filterPixelToShort_16x4_ssse3;
112
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s = x265_filterPixelToShort_16x8_ssse3;
113
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s = x265_filterPixelToShort_16x12_ssse3;
114
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s = x265_filterPixelToShort_16x16_ssse3;
115
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s = x265_filterPixelToShort_16x32_ssse3;
116
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = x265_filterPixelToShort_32x8_ssse3;
117
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = x265_filterPixelToShort_32x16_ssse3;
118
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = x265_filterPixelToShort_32x24_ssse3;
119
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = x265_filterPixelToShort_32x32_ssse3;
120
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].p2s = x265_filterPixelToShort_4x4_ssse3;
121
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].p2s = x265_filterPixelToShort_4x8_ssse3;
122
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].p2s = x265_filterPixelToShort_4x16_ssse3;
123
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].p2s = x265_filterPixelToShort_4x32_ssse3;
124
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].p2s = x265_filterPixelToShort_8x4_ssse3;
125
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].p2s = x265_filterPixelToShort_8x8_ssse3;
126
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].p2s = x265_filterPixelToShort_8x12_ssse3;
127
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].p2s = x265_filterPixelToShort_8x16_ssse3;
128
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].p2s = x265_filterPixelToShort_8x32_ssse3;
129
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].p2s = x265_filterPixelToShort_8x64_ssse3;
130
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].p2s = x265_filterPixelToShort_12x32_ssse3;
131
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s = x265_filterPixelToShort_16x8_ssse3;
132
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s = x265_filterPixelToShort_16x16_ssse3;
133
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s = x265_filterPixelToShort_16x24_ssse3;
134
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s = x265_filterPixelToShort_16x32_ssse3;
135
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s = x265_filterPixelToShort_16x64_ssse3;
136
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = x265_filterPixelToShort_24x64_ssse3;
137
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = x265_filterPixelToShort_32x16_ssse3;
138
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = x265_filterPixelToShort_32x32_ssse3;
139
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = x265_filterPixelToShort_32x48_ssse3;
140
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = x265_filterPixelToShort_32x64_ssse3;
141
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].p2s = x265_filterPixelToShort_4x2_ssse3;
142
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].p2s = x265_filterPixelToShort_8x2_ssse3;
143
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].p2s = x265_filterPixelToShort_8x6_ssse3;
144
+        p.findPosFirstLast = x265_findPosFirstLast_ssse3;
145
     }
146
     if (cpuMask & X265_CPU_SSE4)
147
     {
148
@@ -957,6 +1052,13 @@
149
         ALL_LUMA_TU_S(copy_cnt, copy_cnt_, sse4);
150
         ALL_LUMA_CU(psy_cost_pp, psyCost_pp, sse4);
151
         ALL_LUMA_CU(psy_cost_ss, psyCost_ss, sse4);
152
+
153
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].p2s = x265_filterPixelToShort_2x4_sse4;
154
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].p2s = x265_filterPixelToShort_2x8_sse4;
155
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].p2s = x265_filterPixelToShort_6x8_sse4;
156
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].p2s = x265_filterPixelToShort_2x8_sse4;
157
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].p2s = x265_filterPixelToShort_2x16_sse4;
158
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].p2s = x265_filterPixelToShort_6x16_sse4;
159
     }
160
     if (cpuMask & X265_CPU_AVX)
161
     {
162
@@ -1079,6 +1181,26 @@
163
     }
164
     if (cpuMask & X265_CPU_AVX2)
165
     {
166
+        p.pu[LUMA_48x64].satd = x265_pixel_satd_48x64_avx2;
167
+
168
+        p.pu[LUMA_64x16].satd = x265_pixel_satd_64x16_avx2;
169
+        p.pu[LUMA_64x32].satd = x265_pixel_satd_64x32_avx2;
170
+        p.pu[LUMA_64x48].satd = x265_pixel_satd_64x48_avx2;
171
+        p.pu[LUMA_64x64].satd = x265_pixel_satd_64x64_avx2;
172
+
173
+        p.pu[LUMA_32x8].satd = x265_pixel_satd_32x8_avx2;
174
+        p.pu[LUMA_32x16].satd = x265_pixel_satd_32x16_avx2;
175
+        p.pu[LUMA_32x24].satd = x265_pixel_satd_32x24_avx2;
176
+        p.pu[LUMA_32x32].satd = x265_pixel_satd_32x32_avx2;
177
+        p.pu[LUMA_32x64].satd = x265_pixel_satd_32x64_avx2;
178
+
179
+        p.pu[LUMA_16x4].satd = x265_pixel_satd_16x4_avx2;
180
+        p.pu[LUMA_16x8].satd = x265_pixel_satd_16x8_avx2;
181
+        p.pu[LUMA_16x12].satd = x265_pixel_satd_16x12_avx2;
182
+        p.pu[LUMA_16x16].satd = x265_pixel_satd_16x16_avx2;
183
+        p.pu[LUMA_16x32].satd = x265_pixel_satd_16x32_avx2;
184
+        p.pu[LUMA_16x64].satd = x265_pixel_satd_16x64_avx2;
185
+
186
         p.cu[BLOCK_32x32].ssd_s = x265_pixel_ssd_s_32_avx2;
187
         p.cu[BLOCK_16x16].sse_ss = x265_pixel_ssd_ss_16x16_avx2;
188
 
189
@@ -1087,6 +1209,7 @@
190
         p.dequant_normal  = x265_dequant_normal_avx2;
191
 
192
         p.scale1D_128to64 = x265_scale1D_128to64_avx2;
193
+        p.scale2D_64to32 = x265_scale2D_64to32_avx2;
194
         // p.weight_pp = x265_weight_pp_avx2; fails tests
195
 
196
         p.cu[BLOCK_16x16].calcresidual = x265_getResidual16_avx2;
197
@@ -1119,12 +1242,84 @@
198
         ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, avx2);
199
         ALL_LUMA_PU(luma_vsp, interp_8tap_vert_sp, avx2);
200
         ALL_LUMA_PU(luma_vss, interp_8tap_vert_ss, avx2);
201
+
202
+        p.cu[BLOCK_16x16].add_ps = x265_pixel_add_ps_16x16_avx2;
203
+        p.cu[BLOCK_32x32].add_ps = x265_pixel_add_ps_32x32_avx2;
204
+        p.cu[BLOCK_64x64].add_ps = x265_pixel_add_ps_64x64_avx2;
205
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].add_ps = x265_pixel_add_ps_16x16_avx2;
206
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].add_ps = x265_pixel_add_ps_32x32_avx2;
207
+        p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].add_ps = x265_pixel_add_ps_16x32_avx2;
208
+        p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].add_ps = x265_pixel_add_ps_32x64_avx2;
209
+
210
+        p.cu[BLOCK_16x16].sub_ps = x265_pixel_sub_ps_16x16_avx2;
211
+        p.cu[BLOCK_32x32].sub_ps = x265_pixel_sub_ps_32x32_avx2;
212
+        p.cu[BLOCK_64x64].sub_ps = x265_pixel_sub_ps_64x64_avx2;
213
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sub_ps = x265_pixel_sub_ps_16x16_avx2;
214
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sub_ps = x265_pixel_sub_ps_32x32_avx2;
215
+        p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sub_ps = x265_pixel_sub_ps_16x32_avx2;
216
+        p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sub_ps = x265_pixel_sub_ps_32x64_avx2;
217
+
218
+        p.pu[LUMA_16x4].sad = x265_pixel_sad_16x4_avx2;
219
+        p.pu[LUMA_16x8].sad = x265_pixel_sad_16x8_avx2;
220
+        p.pu[LUMA_16x12].sad = x265_pixel_sad_16x12_avx2;
221
+        p.pu[LUMA_16x16].sad = x265_pixel_sad_16x16_avx2;
222
+        p.pu[LUMA_16x32].sad = x265_pixel_sad_16x32_avx2;
223
+
224
+        p.pu[LUMA_16x4].convert_p2s = x265_filterPixelToShort_16x4_avx2;
225
+        p.pu[LUMA_16x8].convert_p2s = x265_filterPixelToShort_16x8_avx2;
226
+        p.pu[LUMA_16x12].convert_p2s = x265_filterPixelToShort_16x12_avx2;
227
+        p.pu[LUMA_16x16].convert_p2s = x265_filterPixelToShort_16x16_avx2;
228
+        p.pu[LUMA_16x32].convert_p2s = x265_filterPixelToShort_16x32_avx2;
229
+        p.pu[LUMA_16x64].convert_p2s = x265_filterPixelToShort_16x64_avx2;
230
+        p.pu[LUMA_32x8].convert_p2s = x265_filterPixelToShort_32x8_avx2;
231
+        p.pu[LUMA_32x16].convert_p2s = x265_filterPixelToShort_32x16_avx2;
232
+        p.pu[LUMA_32x24].convert_p2s = x265_filterPixelToShort_32x24_avx2;
233
+        p.pu[LUMA_32x32].convert_p2s = x265_filterPixelToShort_32x32_avx2;
234
+        p.pu[LUMA_32x64].convert_p2s = x265_filterPixelToShort_32x64_avx2;
235
+        p.pu[LUMA_64x16].convert_p2s = x265_filterPixelToShort_64x16_avx2;
236
+        p.pu[LUMA_64x32].convert_p2s = x265_filterPixelToShort_64x32_avx2;
237
+        p.pu[LUMA_64x48].convert_p2s = x265_filterPixelToShort_64x48_avx2;
238
+        p.pu[LUMA_64x64].convert_p2s = x265_filterPixelToShort_64x64_avx2;
239
+        p.pu[LUMA_24x32].convert_p2s = x265_filterPixelToShort_24x32_avx2;
240
+        p.pu[LUMA_48x64].convert_p2s = x265_filterPixelToShort_48x64_avx2;
241
+
242
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s = x265_filterPixelToShort_16x4_avx2;
243
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s = x265_filterPixelToShort_16x8_avx2;
244
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s = x265_filterPixelToShort_16x12_avx2;
245
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s = x265_filterPixelToShort_16x16_avx2;
246
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s = x265_filterPixelToShort_16x32_avx2;
247
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].p2s = x265_filterPixelToShort_24x32_avx2;
248
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = x265_filterPixelToShort_32x8_avx2;
249
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = x265_filterPixelToShort_32x16_avx2;
250
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = x265_filterPixelToShort_32x24_avx2;
251
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = x265_filterPixelToShort_32x32_avx2;
252
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s = x265_filterPixelToShort_16x8_avx2;
253
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s = x265_filterPixelToShort_16x16_avx2;
254
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s = x265_filterPixelToShort_16x24_avx2;
255
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s = x265_filterPixelToShort_16x32_avx2;
256
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s = x265_filterPixelToShort_16x64_avx2;
257
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = x265_filterPixelToShort_24x64_avx2;
258
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = x265_filterPixelToShort_32x16_avx2;
259
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = x265_filterPixelToShort_32x32_avx2;
260
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = x265_filterPixelToShort_32x48_avx2;
261
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = x265_filterPixelToShort_32x64_avx2;
262
+
263
+        p.pu[LUMA_4x4].luma_hps = x265_interp_8tap_horiz_ps_4x4_avx2;
264
+        p.pu[LUMA_4x8].luma_hps = x265_interp_8tap_horiz_ps_4x8_avx2;
265
+        p.pu[LUMA_4x16].luma_hps = x265_interp_8tap_horiz_ps_4x16_avx2;
266
+
267
+        if (cpuMask & X265_CPU_BMI2)
268
+            p.scanPosLast = x265_scanPosLast_avx2_bmi2;
269
     }
270
 }
271
 #else // if HIGH_BIT_DEPTH
272
 
273
 void setupAssemblyPrimitives(EncoderPrimitives &p, int cpuMask) // 8bpp
274
 {
275
+#if X86_64
276
+    p.scanPosLast = x265_scanPosLast_x64;
277
+#endif
278
+
279
     if (cpuMask & X265_CPU_SSE2)
280
     {
281
         /* We do not differentiate CPUs which support MMX and not SSE2. We only check
282
@@ -1175,6 +1370,47 @@
283
         CHROMA_420_VSP_FILTERS(_sse2);
284
         CHROMA_422_VSP_FILTERS(_sse2);
285
         CHROMA_444_VSP_FILTERS(_sse2);
286
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_vpp = x265_interp_4tap_vert_pp_2x4_sse2;
287
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].filter_vpp = x265_interp_4tap_vert_pp_2x8_sse2;
288
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vpp = x265_interp_4tap_vert_pp_4x2_sse2;
289
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_sse2;
290
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_sse2;
291
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_sse2;
292
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_vpp = x265_interp_4tap_vert_pp_2x16_sse2;
293
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_sse2;
294
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_sse2;
295
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_sse2;
296
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vpp = x265_interp_4tap_vert_pp_4x32_sse2;
297
+        p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_sse2;
298
+        p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_sse2;
299
+        p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_sse2;
300
+#if X86_64
301
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_vpp = x265_interp_4tap_vert_pp_6x8_sse2;
302
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_vpp = x265_interp_4tap_vert_pp_8x2_sse2;
303
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_sse2;
304
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vpp = x265_interp_4tap_vert_pp_8x6_sse2;
305
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_sse2;
306
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_sse2;
307
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_sse2;
308
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_vpp = x265_interp_4tap_vert_pp_6x16_sse2;
309
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_sse2;
310
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_sse2;
311
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_sse2;
312
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vpp = x265_interp_4tap_vert_pp_8x12_sse2;
313
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_sse2;
314
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_sse2;
315
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vpp = x265_interp_4tap_vert_pp_8x64_sse2;
316
+        p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_sse2;
317
+        p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_sse2;
318
+        p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_sse2;
319
+        p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_sse2;
320
+#endif
321
+
322
+        ALL_LUMA_PU(luma_hpp, interp_8tap_horiz_pp, sse2);
323
+        p.pu[LUMA_4x4].luma_hpp = x265_interp_8tap_horiz_pp_4x4_sse2;
324
+        ALL_LUMA_PU(luma_hps, interp_8tap_horiz_ps, sse2);
325
+        p.pu[LUMA_4x4].luma_hps = x265_interp_8tap_horiz_ps_4x4_sse2;
326
+        p.pu[LUMA_8x8].luma_hvpp = x265_interp_8tap_hv_pp_8x8_sse3;
327
 
328
         //p.frameInitLowres = x265_frame_init_lowres_core_mmx2;
329
         p.frameInitLowres = x265_frame_init_lowres_core_sse2;
330
@@ -1186,15 +1422,8 @@
331
         ALL_LUMA_TU_S(cpy1Dto2D_shr, cpy1Dto2D_shr_, sse2);
332
         ALL_LUMA_TU_S(ssd_s, pixel_ssd_s_, sse2);
333
 
334
-        p.cu[BLOCK_4x4].intra_pred[DC_IDX] = x265_intra_pred_dc4_sse2;
335
-        p.cu[BLOCK_8x8].intra_pred[DC_IDX] = x265_intra_pred_dc8_sse2;
336
-        p.cu[BLOCK_16x16].intra_pred[DC_IDX] = x265_intra_pred_dc16_sse2;
337
-        p.cu[BLOCK_32x32].intra_pred[DC_IDX] = x265_intra_pred_dc32_sse2;
338
-
339
-        p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = x265_intra_pred_planar4_sse2;
340
-        p.cu[BLOCK_8x8].intra_pred[PLANAR_IDX] = x265_intra_pred_planar8_sse2;
341
-        p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = x265_intra_pred_planar16_sse2;
342
-        p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = x265_intra_pred_planar32_sse2;
343
+        ALL_LUMA_TU_S(intra_pred[PLANAR_IDX], intra_pred_planar, sse2);
344
+        ALL_LUMA_TU_S(intra_pred[DC_IDX], intra_pred_dc, sse2);
345
 
346
         p.cu[BLOCK_4x4].intra_pred[2] = x265_intra_pred_ang4_2_sse2;
347
         p.cu[BLOCK_4x4].intra_pred[3] = x265_intra_pred_ang4_3_sse2;
348
@@ -1204,6 +1433,32 @@
349
         p.cu[BLOCK_4x4].intra_pred[7] = x265_intra_pred_ang4_7_sse2;
350
         p.cu[BLOCK_4x4].intra_pred[8] = x265_intra_pred_ang4_8_sse2;
351
         p.cu[BLOCK_4x4].intra_pred[9] = x265_intra_pred_ang4_9_sse2;
352
+        p.cu[BLOCK_4x4].intra_pred[10] = x265_intra_pred_ang4_10_sse2;
353
+        p.cu[BLOCK_4x4].intra_pred[11] = x265_intra_pred_ang4_11_sse2;
354
+        p.cu[BLOCK_4x4].intra_pred[12] = x265_intra_pred_ang4_12_sse2;
355
+        p.cu[BLOCK_4x4].intra_pred[13] = x265_intra_pred_ang4_13_sse2;
356
+        p.cu[BLOCK_4x4].intra_pred[14] = x265_intra_pred_ang4_14_sse2;
357
+        p.cu[BLOCK_4x4].intra_pred[15] = x265_intra_pred_ang4_15_sse2;
358
+        p.cu[BLOCK_4x4].intra_pred[16] = x265_intra_pred_ang4_16_sse2;
359
+        p.cu[BLOCK_4x4].intra_pred[17] = x265_intra_pred_ang4_17_sse2;
360
+        p.cu[BLOCK_4x4].intra_pred[18] = x265_intra_pred_ang4_18_sse2;
361
+        p.cu[BLOCK_4x4].intra_pred[19] = x265_intra_pred_ang4_17_sse2;
362
+        p.cu[BLOCK_4x4].intra_pred[20] = x265_intra_pred_ang4_16_sse2;
363
+        p.cu[BLOCK_4x4].intra_pred[21] = x265_intra_pred_ang4_15_sse2;
364
+        p.cu[BLOCK_4x4].intra_pred[22] = x265_intra_pred_ang4_14_sse2;
365
+        p.cu[BLOCK_4x4].intra_pred[23] = x265_intra_pred_ang4_13_sse2;
366
+        p.cu[BLOCK_4x4].intra_pred[24] = x265_intra_pred_ang4_12_sse2;
367
+        p.cu[BLOCK_4x4].intra_pred[25] = x265_intra_pred_ang4_11_sse2;
368
+        p.cu[BLOCK_4x4].intra_pred[26] = x265_intra_pred_ang4_26_sse2;
369
+        p.cu[BLOCK_4x4].intra_pred[27] = x265_intra_pred_ang4_9_sse2;
370
+        p.cu[BLOCK_4x4].intra_pred[28] = x265_intra_pred_ang4_8_sse2;
371
+        p.cu[BLOCK_4x4].intra_pred[29] = x265_intra_pred_ang4_7_sse2;
372
+        p.cu[BLOCK_4x4].intra_pred[30] = x265_intra_pred_ang4_6_sse2;
373
+        p.cu[BLOCK_4x4].intra_pred[31] = x265_intra_pred_ang4_5_sse2;
374
+        p.cu[BLOCK_4x4].intra_pred[32] = x265_intra_pred_ang4_4_sse2;
375
+        p.cu[BLOCK_4x4].intra_pred[33] = x265_intra_pred_ang4_3_sse2;
376
+
377
+        p.cu[BLOCK_4x4].intra_pred_allangs = x265_all_angs_pred_4x4_sse2;
378
 
379
         p.cu[BLOCK_4x4].calcresidual = x265_getResidual4_sse2;
380
         p.cu[BLOCK_8x8].calcresidual = x265_getResidual8_sse2;
381
@@ -1224,6 +1479,12 @@
382
 
383
         p.planecopy_sp = x265_downShift_16_sse2;
384
     }
385
+    if (cpuMask & X265_CPU_SSE3)
386
+    {
387
+        ALL_CHROMA_420_PU(filter_hpp, interp_4tap_horiz_pp, sse3);
388
+        ALL_CHROMA_422_PU(filter_hpp, interp_4tap_horiz_pp, sse3);
389
+        ALL_CHROMA_444_PU(filter_hpp, interp_4tap_horiz_pp, sse3);
390
+    }
391
     if (cpuMask & X265_CPU_SSSE3)
392
     {
393
         p.pu[LUMA_8x16].sad_x3 = x265_pixel_sad_x3_8x16_ssse3;
394
@@ -1249,48 +1510,86 @@
395
         ASSIGN_SSE_PP(ssse3);
396
         p.cu[BLOCK_4x4].sse_pp = x265_pixel_ssd_4x4_ssse3;
397
         p.chroma[X265_CSP_I422].cu[BLOCK_422_4x8].sse_pp = x265_pixel_ssd_4x8_ssse3;
398
-        p.pu[LUMA_4x4].filter_p2s = x265_pixelToShort_4x4_ssse3;
399
-        p.pu[LUMA_4x8].filter_p2s = x265_pixelToShort_4x8_ssse3;
400
-        p.pu[LUMA_4x16].filter_p2s = x265_pixelToShort_4x16_ssse3;
401
-        p.pu[LUMA_8x4].filter_p2s = x265_pixelToShort_8x4_ssse3;
402
-        p.pu[LUMA_8x8].filter_p2s = x265_pixelToShort_8x8_ssse3;
403
-        p.pu[LUMA_8x16].filter_p2s = x265_pixelToShort_8x16_ssse3;
404
-        p.pu[LUMA_8x32].filter_p2s = x265_pixelToShort_8x32_ssse3;
405
-        p.pu[LUMA_16x4].filter_p2s = x265_pixelToShort_16x4_ssse3;
406
-        p.pu[LUMA_16x8].filter_p2s = x265_pixelToShort_16x8_ssse3;
407
-        p.pu[LUMA_16x12].filter_p2s = x265_pixelToShort_16x12_ssse3;
408
-        p.pu[LUMA_16x16].filter_p2s = x265_pixelToShort_16x16_ssse3;
409
-        p.pu[LUMA_16x32].filter_p2s = x265_pixelToShort_16x32_ssse3;
410
-        p.pu[LUMA_16x64].filter_p2s = x265_pixelToShort_16x64_ssse3;
411
-        p.pu[LUMA_32x8].filter_p2s = x265_pixelToShort_32x8_ssse3;
412
-        p.pu[LUMA_32x16].filter_p2s = x265_pixelToShort_32x16_ssse3;
413
-        p.pu[LUMA_32x24].filter_p2s = x265_pixelToShort_32x24_ssse3;
414
-        p.pu[LUMA_32x32].filter_p2s = x265_pixelToShort_32x32_ssse3;
415
-        p.pu[LUMA_32x64].filter_p2s = x265_pixelToShort_32x64_ssse3;
416
-        p.pu[LUMA_64x16].filter_p2s = x265_pixelToShort_64x16_ssse3;
417
-        p.pu[LUMA_64x32].filter_p2s = x265_pixelToShort_64x32_ssse3;
418
-        p.pu[LUMA_64x48].filter_p2s = x265_pixelToShort_64x48_ssse3;
419
-        p.pu[LUMA_64x64].filter_p2s = x265_pixelToShort_64x64_ssse3;
420
-
421
-        p.chroma[X265_CSP_I420].p2s = x265_chroma_p2s_ssse3;
422
-        p.chroma[X265_CSP_I422].p2s = x265_chroma_p2s_ssse3;
423
 
424
         p.dst4x4 = x265_dst4_ssse3;
425
         p.cu[BLOCK_8x8].idct = x265_idct8_ssse3;
426
 
427
         ALL_LUMA_TU(count_nonzero, count_nonzero, ssse3);
428
 
429
+        // MUST be done after LUMA_FILTERS() to overwrite default version
430
+        p.pu[LUMA_8x8].luma_hvpp = x265_interp_8tap_hv_pp_8x8_ssse3;
431
+
432
         p.frameInitLowres = x265_frame_init_lowres_core_ssse3;
433
         p.scale1D_128to64 = x265_scale1D_128to64_ssse3;
434
         p.scale2D_64to32 = x265_scale2D_64to32_ssse3;
435
+
436
+        p.pu[LUMA_8x4].convert_p2s = x265_filterPixelToShort_8x4_ssse3;
437
+        p.pu[LUMA_8x8].convert_p2s = x265_filterPixelToShort_8x8_ssse3;
438
+        p.pu[LUMA_8x16].convert_p2s = x265_filterPixelToShort_8x16_ssse3;
439
+        p.pu[LUMA_8x32].convert_p2s = x265_filterPixelToShort_8x32_ssse3;
440
+        p.pu[LUMA_16x4].convert_p2s = x265_filterPixelToShort_16x4_ssse3;
441
+        p.pu[LUMA_16x8].convert_p2s = x265_filterPixelToShort_16x8_ssse3;
442
+        p.pu[LUMA_16x12].convert_p2s = x265_filterPixelToShort_16x12_ssse3;
443
+        p.pu[LUMA_16x16].convert_p2s = x265_filterPixelToShort_16x16_ssse3;
444
+        p.pu[LUMA_16x32].convert_p2s = x265_filterPixelToShort_16x32_ssse3;
445
+        p.pu[LUMA_16x64].convert_p2s = x265_filterPixelToShort_16x64_ssse3;
446
+        p.pu[LUMA_32x8].convert_p2s = x265_filterPixelToShort_32x8_ssse3;
447
+        p.pu[LUMA_32x16].convert_p2s = x265_filterPixelToShort_32x16_ssse3;
448
+        p.pu[LUMA_32x24].convert_p2s = x265_filterPixelToShort_32x24_ssse3;
449
+        p.pu[LUMA_32x32].convert_p2s = x265_filterPixelToShort_32x32_ssse3;
450
+        p.pu[LUMA_32x64].convert_p2s = x265_filterPixelToShort_32x64_ssse3;
451
+        p.pu[LUMA_64x16].convert_p2s = x265_filterPixelToShort_64x16_ssse3;
452
+        p.pu[LUMA_64x32].convert_p2s = x265_filterPixelToShort_64x32_ssse3;
453
+        p.pu[LUMA_64x48].convert_p2s = x265_filterPixelToShort_64x48_ssse3;
454
+        p.pu[LUMA_64x64].convert_p2s = x265_filterPixelToShort_64x64_ssse3;
455
+        p.pu[LUMA_12x16].convert_p2s = x265_filterPixelToShort_12x16_ssse3;
456
+        p.pu[LUMA_24x32].convert_p2s = x265_filterPixelToShort_24x32_ssse3;
457
+        p.pu[LUMA_48x64].convert_p2s = x265_filterPixelToShort_48x64_ssse3;
458
+
459
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].p2s = x265_filterPixelToShort_8x2_ssse3;
460
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].p2s = x265_filterPixelToShort_8x4_ssse3;
461
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].p2s = x265_filterPixelToShort_8x6_ssse3;
462
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].p2s = x265_filterPixelToShort_8x8_ssse3;
463
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].p2s = x265_filterPixelToShort_8x16_ssse3;
464
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].p2s = x265_filterPixelToShort_8x32_ssse3;
465
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s = x265_filterPixelToShort_16x4_ssse3;
466
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s = x265_filterPixelToShort_16x8_ssse3;
467
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s = x265_filterPixelToShort_16x12_ssse3;
468
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s = x265_filterPixelToShort_16x16_ssse3;
469
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s = x265_filterPixelToShort_16x32_ssse3;
470
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = x265_filterPixelToShort_32x8_ssse3;
471
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = x265_filterPixelToShort_32x16_ssse3;
472
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = x265_filterPixelToShort_32x24_ssse3;
473
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = x265_filterPixelToShort_32x32_ssse3;
474
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].p2s = x265_filterPixelToShort_8x4_ssse3;
475
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].p2s = x265_filterPixelToShort_8x8_ssse3;
476
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].p2s = x265_filterPixelToShort_8x12_ssse3;
477
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].p2s = x265_filterPixelToShort_8x16_ssse3;
478
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].p2s = x265_filterPixelToShort_8x32_ssse3;
479
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].p2s = x265_filterPixelToShort_8x64_ssse3;
480
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].p2s = x265_filterPixelToShort_12x32_ssse3;
481
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s = x265_filterPixelToShort_16x8_ssse3;
482
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s = x265_filterPixelToShort_16x16_ssse3;
483
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s = x265_filterPixelToShort_16x24_ssse3;
484
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s = x265_filterPixelToShort_16x32_ssse3;
485
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s = x265_filterPixelToShort_16x64_ssse3;
486
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = x265_filterPixelToShort_24x64_ssse3;
487
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = x265_filterPixelToShort_32x16_ssse3;
488
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = x265_filterPixelToShort_32x32_ssse3;
489
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = x265_filterPixelToShort_32x48_ssse3;
490
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = x265_filterPixelToShort_32x64_ssse3;
491
+        p.findPosFirstLast = x265_findPosFirstLast_ssse3;
492
     }
493
     if (cpuMask & X265_CPU_SSE4)
494
     {
495
         p.sign = x265_calSign_sse4;
496
         p.saoCuOrgE0 = x265_saoCuOrgE0_sse4;
497
         p.saoCuOrgE1 = x265_saoCuOrgE1_sse4;
498
-        p.saoCuOrgE2 = x265_saoCuOrgE2_sse4;
499
-        p.saoCuOrgE3 = x265_saoCuOrgE3_sse4;
500
+        p.saoCuOrgE1_2Rows = x265_saoCuOrgE1_2Rows_sse4;
501
+        p.saoCuOrgE2[0] = x265_saoCuOrgE2_sse4;
502
+        p.saoCuOrgE2[1] = x265_saoCuOrgE2_sse4;
503
+        p.saoCuOrgE3[0] = x265_saoCuOrgE3_sse4;
504
+        p.saoCuOrgE3[1] = x265_saoCuOrgE3_sse4;
505
         p.saoCuOrgB0 = x265_saoCuOrgB0_sse4;
506
 
507
         LUMA_ADDAVG(sse4);
508
@@ -1321,7 +1620,7 @@
509
         CHROMA_444_VSP_FILTERS_SSE4(_sse4);
510
 
511
         // MUST be done after LUMA_FILTERS() to overwrite default version
512
-        p.pu[LUMA_8x8].luma_hvpp = x265_interp_8tap_hv_pp_8x8_sse4;
513
+        p.pu[LUMA_8x8].luma_hvpp = x265_interp_8tap_hv_pp_8x8_ssse3;
514
 
515
         LUMA_CU_BLOCKCOPY(ps, sse4);
516
         CHROMA_420_CU_BLOCKCOPY(ps, sse4);
517
@@ -1348,6 +1647,25 @@
518
         p.cu[BLOCK_4x4].psy_cost_pp = x265_psyCost_pp_4x4_sse4;
519
         p.cu[BLOCK_4x4].psy_cost_ss = x265_psyCost_ss_4x4_sse4;
520
 
521
+        p.pu[LUMA_4x4].convert_p2s = x265_filterPixelToShort_4x4_sse4;
522
+        p.pu[LUMA_4x8].convert_p2s = x265_filterPixelToShort_4x8_sse4;
523
+        p.pu[LUMA_4x16].convert_p2s = x265_filterPixelToShort_4x16_sse4;
524
+
525
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].p2s = x265_filterPixelToShort_2x4_sse4;
526
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].p2s = x265_filterPixelToShort_2x8_sse4;
527
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].p2s = x265_filterPixelToShort_4x2_sse4;
528
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].p2s = x265_filterPixelToShort_4x4_sse4;
529
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].p2s = x265_filterPixelToShort_4x8_sse4;
530
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].p2s = x265_filterPixelToShort_4x16_sse4;
531
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].p2s = x265_filterPixelToShort_6x8_sse4;
532
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].p2s = x265_filterPixelToShort_2x8_sse4;
533
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].p2s = x265_filterPixelToShort_2x16_sse4;
534
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].p2s = x265_filterPixelToShort_4x4_sse4;
535
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].p2s = x265_filterPixelToShort_4x8_sse4;
536
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].p2s = x265_filterPixelToShort_4x16_sse4;
537
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].p2s = x265_filterPixelToShort_4x32_sse4;
538
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].p2s = x265_filterPixelToShort_6x16_sse4;
539
+
540
 #if X86_64
541
         ALL_LUMA_CU(psy_cost_pp, psyCost_pp, sse4);
542
         ALL_LUMA_CU(psy_cost_ss, psyCost_ss, sse4);
543
@@ -1363,6 +1681,20 @@
544
         p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].satd = x265_pixel_satd_8x12_avx;
545
         p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].satd = x265_pixel_satd_12x32_avx;
546
         p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].satd = x265_pixel_satd_4x32_avx;
547
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].satd = x265_pixel_satd_16x32_avx;
548
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].satd = x265_pixel_satd_32x64_avx;
549
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].satd = x265_pixel_satd_16x16_avx;
550
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].satd = x265_pixel_satd_32x32_avx;
551
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].satd = x265_pixel_satd_16x64_avx;
552
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].satd = x265_pixel_satd_16x8_avx;
553
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].satd = x265_pixel_satd_32x16_avx;
554
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].satd = x265_pixel_satd_8x4_avx;
555
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].satd = x265_pixel_satd_8x16_avx;
556
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].satd = x265_pixel_satd_8x8_avx;
557
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].satd = x265_pixel_satd_8x32_avx;
558
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].satd = x265_pixel_satd_4x8_avx;
559
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].satd = x265_pixel_satd_4x16_avx;
560
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].satd = x265_pixel_satd_4x4_avx;
561
         ALL_LUMA_PU(satd, pixel_satd, avx);
562
         p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].satd = x265_pixel_satd_4x4_avx;
563
         p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].satd = x265_pixel_satd_8x8_avx;
564
@@ -1383,6 +1715,10 @@
565
         p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].satd = x265_pixel_satd_32x8_avx;
566
         p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].satd = x265_pixel_satd_8x32_avx;
567
         ASSIGN_SA8D(avx);
568
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sa8d = x265_pixel_sa8d_32x32_avx;
569
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sa8d = x265_pixel_sa8d_16x16_avx;
570
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].sa8d = x265_pixel_sa8d_8x8_avx;
571
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_4x4].sa8d = x265_pixel_satd_4x4_avx;
572
         ASSIGN_SSE_PP(avx);
573
         p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].sse_pp = x265_pixel_ssd_8x8_avx;
574
         ASSIGN_SSE_SS(avx);
575
@@ -1405,6 +1741,7 @@
576
         p.chroma[X265_CSP_I420].cu[CHROMA_420_16x16].copy_ss = x265_blockcopy_ss_16x16_avx;
577
         p.chroma[X265_CSP_I420].cu[CHROMA_420_32x32].copy_ss = x265_blockcopy_ss_32x32_avx;
578
         p.chroma[X265_CSP_I422].cu[CHROMA_422_16x32].copy_ss = x265_blockcopy_ss_16x32_avx;
579
+        p.chroma[X265_CSP_I422].cu[CHROMA_422_32x64].copy_ss = x265_blockcopy_ss_32x64_avx;
580
 
581
         p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].copy_pp = x265_blockcopy_pp_32x8_avx;
582
         p.pu[LUMA_32x8].copy_pp = x265_blockcopy_pp_32x8_avx;
583
@@ -1447,6 +1784,26 @@
584
 #if X86_64
585
     if (cpuMask & X265_CPU_AVX2)
586
     {
587
+        p.planecopy_sp = x265_downShift_16_avx2;
588
+
589
+        p.cu[BLOCK_32x32].intra_pred[DC_IDX] = x265_intra_pred_dc32_avx2;
590
+
591
+        p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = x265_intra_pred_planar16_avx2;
592
+        p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = x265_intra_pred_planar32_avx2;
593
+
594
+        p.idst4x4 = x265_idst4_avx2;
595
+        p.dst4x4 = x265_dst4_avx2;
596
+        p.scale2D_64to32 = x265_scale2D_64to32_avx2;
597
+        p.saoCuOrgE0 = x265_saoCuOrgE0_avx2;
598
+        p.saoCuOrgE1 = x265_saoCuOrgE1_avx2;
599
+        p.saoCuOrgE1_2Rows = x265_saoCuOrgE1_2Rows_avx2;
600
+        p.saoCuOrgE2[0] = x265_saoCuOrgE2_avx2;
601
+        p.saoCuOrgE2[1] = x265_saoCuOrgE2_32_avx2;
602
+        p.saoCuOrgE3[0] = x265_saoCuOrgE3_avx2;
603
+        p.saoCuOrgE3[1] = x265_saoCuOrgE3_32_avx2;
604
+        p.saoCuOrgB0 = x265_saoCuOrgB0_avx2;
605
+        p.sign = x265_calSign_avx2;
606
+
607
         p.cu[BLOCK_4x4].psy_cost_ss = x265_psyCost_ss_4x4_avx2;
608
         p.cu[BLOCK_8x8].psy_cost_ss = x265_psyCost_ss_8x8_avx2;
609
         p.cu[BLOCK_16x16].psy_cost_ss = x265_psyCost_ss_16x16_avx2;
610
@@ -1494,31 +1851,50 @@
611
         p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].addAvg = x265_addAvg_8x8_avx2;
612
         p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].addAvg = x265_addAvg_8x16_avx2;
613
         p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].addAvg = x265_addAvg_8x32_avx2;
614
-
615
         p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].addAvg = x265_addAvg_12x16_avx2;
616
-
617
         p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].addAvg = x265_addAvg_16x4_avx2;
618
         p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].addAvg = x265_addAvg_16x8_avx2;
619
         p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].addAvg = x265_addAvg_16x12_avx2;
620
         p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].addAvg = x265_addAvg_16x16_avx2;
621
         p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].addAvg = x265_addAvg_16x32_avx2;
622
-
623
         p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].addAvg = x265_addAvg_32x8_avx2;
624
         p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].addAvg = x265_addAvg_32x16_avx2;
625
         p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].addAvg = x265_addAvg_32x24_avx2;
626
         p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].addAvg = x265_addAvg_32x32_avx2;
627
 
628
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].addAvg = x265_addAvg_8x4_avx2;
629
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].addAvg = x265_addAvg_8x8_avx2;
630
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].addAvg = x265_addAvg_8x12_avx2;
631
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].addAvg = x265_addAvg_8x16_avx2;
632
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].addAvg = x265_addAvg_8x32_avx2;
633
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].addAvg = x265_addAvg_8x64_avx2;
634
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].addAvg = x265_addAvg_12x32_avx2;
635
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].addAvg = x265_addAvg_16x8_avx2;
636
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].addAvg = x265_addAvg_16x16_avx2;
637
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].addAvg = x265_addAvg_16x24_avx2;
638
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].addAvg = x265_addAvg_16x32_avx2;
639
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].addAvg = x265_addAvg_16x64_avx2;
640
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].addAvg = x265_addAvg_24x64_avx2;
641
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].addAvg = x265_addAvg_32x16_avx2;
642
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].addAvg = x265_addAvg_32x32_avx2;
643
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].addAvg = x265_addAvg_32x48_avx2;
644
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].addAvg = x265_addAvg_32x64_avx2;
645
+
646
         p.cu[BLOCK_16x16].add_ps = x265_pixel_add_ps_16x16_avx2;
647
         p.cu[BLOCK_32x32].add_ps = x265_pixel_add_ps_32x32_avx2;
648
         p.cu[BLOCK_64x64].add_ps = x265_pixel_add_ps_64x64_avx2;
649
         p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].add_ps = x265_pixel_add_ps_16x16_avx2;
650
         p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].add_ps = x265_pixel_add_ps_32x32_avx2;
651
+        p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].add_ps = x265_pixel_add_ps_16x32_avx2;
652
+        p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].add_ps = x265_pixel_add_ps_32x64_avx2;
653
 
654
         p.cu[BLOCK_16x16].sub_ps = x265_pixel_sub_ps_16x16_avx2;
655
         p.cu[BLOCK_32x32].sub_ps = x265_pixel_sub_ps_32x32_avx2;
656
         p.cu[BLOCK_64x64].sub_ps = x265_pixel_sub_ps_64x64_avx2;
657
         p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sub_ps = x265_pixel_sub_ps_16x16_avx2;
658
         p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sub_ps = x265_pixel_sub_ps_32x32_avx2;
659
+        p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sub_ps = x265_pixel_sub_ps_16x32_avx2;
660
+        p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sub_ps = x265_pixel_sub_ps_32x64_avx2;
661
 
662
         p.pu[LUMA_16x4].pixelavg_pp = x265_pixel_avg_16x4_avx2;
663
         p.pu[LUMA_16x8].pixelavg_pp = x265_pixel_avg_16x8_avx2;
664
@@ -1543,6 +1919,22 @@
665
         p.pu[LUMA_8x16].satd  = x265_pixel_satd_8x16_avx2;
666
         p.pu[LUMA_8x8].satd   = x265_pixel_satd_8x8_avx2;
667
 
668
+        p.pu[LUMA_16x4].satd  = x265_pixel_satd_16x4_avx2;
669
+        p.pu[LUMA_16x12].satd = x265_pixel_satd_16x12_avx2;
670
+        p.pu[LUMA_16x32].satd = x265_pixel_satd_16x32_avx2;
671
+        p.pu[LUMA_16x64].satd = x265_pixel_satd_16x64_avx2;
672
+
673
+        p.pu[LUMA_32x8].satd   = x265_pixel_satd_32x8_avx2;
674
+        p.pu[LUMA_32x16].satd   = x265_pixel_satd_32x16_avx2;
675
+        p.pu[LUMA_32x24].satd   = x265_pixel_satd_32x24_avx2;
676
+        p.pu[LUMA_32x32].satd   = x265_pixel_satd_32x32_avx2;
677
+        p.pu[LUMA_32x64].satd   = x265_pixel_satd_32x64_avx2;
678
+        p.pu[LUMA_48x64].satd   = x265_pixel_satd_48x64_avx2;
679
+        p.pu[LUMA_64x16].satd   = x265_pixel_satd_64x16_avx2;
680
+        p.pu[LUMA_64x32].satd   = x265_pixel_satd_64x32_avx2;
681
+        p.pu[LUMA_64x48].satd   = x265_pixel_satd_64x48_avx2;
682
+        p.pu[LUMA_64x64].satd   = x265_pixel_satd_64x64_avx2;
683
+
684
         p.pu[LUMA_32x8].sad = x265_pixel_sad_32x8_avx2;
685
         p.pu[LUMA_32x16].sad = x265_pixel_sad_32x16_avx2;
686
         p.pu[LUMA_32x24].sad = x265_pixel_sad_32x24_avx2;
687
@@ -1602,8 +1994,37 @@
688
 
689
         p.scale1D_128to64 = x265_scale1D_128to64_avx2;
690
         p.weight_pp = x265_weight_pp_avx2;
691
+        p.weight_sp = x265_weight_sp_avx2;
692
 
693
         // intra_pred functions
694
+        p.cu[BLOCK_4x4].intra_pred[3] = x265_intra_pred_ang4_3_avx2;
695
+        p.cu[BLOCK_4x4].intra_pred[4] = x265_intra_pred_ang4_4_avx2;
696
+        p.cu[BLOCK_4x4].intra_pred[5] = x265_intra_pred_ang4_5_avx2;
697
+        p.cu[BLOCK_4x4].intra_pred[6] = x265_intra_pred_ang4_6_avx2;
698
+        p.cu[BLOCK_4x4].intra_pred[7] = x265_intra_pred_ang4_7_avx2;
699
+        p.cu[BLOCK_4x4].intra_pred[8] = x265_intra_pred_ang4_8_avx2;
700
+        p.cu[BLOCK_4x4].intra_pred[9] = x265_intra_pred_ang4_9_avx2;
701
+        p.cu[BLOCK_4x4].intra_pred[11] = x265_intra_pred_ang4_11_avx2;
702
+        p.cu[BLOCK_4x4].intra_pred[12] = x265_intra_pred_ang4_12_avx2;
703
+        p.cu[BLOCK_4x4].intra_pred[13] = x265_intra_pred_ang4_13_avx2;
704
+        p.cu[BLOCK_4x4].intra_pred[14] = x265_intra_pred_ang4_14_avx2;
705
+        p.cu[BLOCK_4x4].intra_pred[15] = x265_intra_pred_ang4_15_avx2;
706
+        p.cu[BLOCK_4x4].intra_pred[16] = x265_intra_pred_ang4_16_avx2;
707
+        p.cu[BLOCK_4x4].intra_pred[17] = x265_intra_pred_ang4_17_avx2;
708
+        p.cu[BLOCK_4x4].intra_pred[19] = x265_intra_pred_ang4_19_avx2;
709
+        p.cu[BLOCK_4x4].intra_pred[20] = x265_intra_pred_ang4_20_avx2;
710
+        p.cu[BLOCK_4x4].intra_pred[21] = x265_intra_pred_ang4_21_avx2;
711
+        p.cu[BLOCK_4x4].intra_pred[22] = x265_intra_pred_ang4_22_avx2;
712
+        p.cu[BLOCK_4x4].intra_pred[23] = x265_intra_pred_ang4_23_avx2;
713
+        p.cu[BLOCK_4x4].intra_pred[24] = x265_intra_pred_ang4_24_avx2;
714
+        p.cu[BLOCK_4x4].intra_pred[25] = x265_intra_pred_ang4_25_avx2;
715
+        p.cu[BLOCK_4x4].intra_pred[27] = x265_intra_pred_ang4_27_avx2;
716
+        p.cu[BLOCK_4x4].intra_pred[28] = x265_intra_pred_ang4_28_avx2;
717
+        p.cu[BLOCK_4x4].intra_pred[29] = x265_intra_pred_ang4_29_avx2;
718
+        p.cu[BLOCK_4x4].intra_pred[30] = x265_intra_pred_ang4_30_avx2;
719
+        p.cu[BLOCK_4x4].intra_pred[31] = x265_intra_pred_ang4_31_avx2;
720
+        p.cu[BLOCK_4x4].intra_pred[32] = x265_intra_pred_ang4_32_avx2;
721
+        p.cu[BLOCK_4x4].intra_pred[33] = x265_intra_pred_ang4_33_avx2;
722
         p.cu[BLOCK_8x8].intra_pred[3] = x265_intra_pred_ang8_3_avx2;
723
         p.cu[BLOCK_8x8].intra_pred[33] = x265_intra_pred_ang8_33_avx2;
724
         p.cu[BLOCK_8x8].intra_pred[4] = x265_intra_pred_ang8_4_avx2;
725
@@ -1622,6 +2043,24 @@
726
         p.cu[BLOCK_8x8].intra_pred[12] = x265_intra_pred_ang8_12_avx2;
727
         p.cu[BLOCK_8x8].intra_pred[24] = x265_intra_pred_ang8_24_avx2;
728
         p.cu[BLOCK_8x8].intra_pred[11] = x265_intra_pred_ang8_11_avx2;
729
+        p.cu[BLOCK_8x8].intra_pred[13] = x265_intra_pred_ang8_13_avx2;
730
+        p.cu[BLOCK_8x8].intra_pred[20] = x265_intra_pred_ang8_20_avx2;
731
+        p.cu[BLOCK_8x8].intra_pred[21] = x265_intra_pred_ang8_21_avx2;
732
+        p.cu[BLOCK_8x8].intra_pred[22] = x265_intra_pred_ang8_22_avx2;
733
+        p.cu[BLOCK_8x8].intra_pred[23] = x265_intra_pred_ang8_23_avx2;
734
+        p.cu[BLOCK_8x8].intra_pred[14] = x265_intra_pred_ang8_14_avx2;
735
+        p.cu[BLOCK_8x8].intra_pred[15] = x265_intra_pred_ang8_15_avx2;
736
+        p.cu[BLOCK_8x8].intra_pred[16] = x265_intra_pred_ang8_16_avx2;
737
+        p.cu[BLOCK_16x16].intra_pred[3] = x265_intra_pred_ang16_3_avx2;
738
+        p.cu[BLOCK_16x16].intra_pred[4] = x265_intra_pred_ang16_4_avx2;
739
+        p.cu[BLOCK_16x16].intra_pred[5] = x265_intra_pred_ang16_5_avx2;
740
+        p.cu[BLOCK_16x16].intra_pred[6] = x265_intra_pred_ang16_6_avx2;
741
+        p.cu[BLOCK_16x16].intra_pred[7] = x265_intra_pred_ang16_7_avx2;
742
+        p.cu[BLOCK_16x16].intra_pred[8] = x265_intra_pred_ang16_8_avx2;
743
+        p.cu[BLOCK_16x16].intra_pred[9] = x265_intra_pred_ang16_9_avx2;
744
+        p.cu[BLOCK_16x16].intra_pred[12] = x265_intra_pred_ang16_12_avx2;
745
+        p.cu[BLOCK_16x16].intra_pred[11] = x265_intra_pred_ang16_11_avx2;
746
+        p.cu[BLOCK_16x16].intra_pred[13] = x265_intra_pred_ang16_13_avx2;
747
         p.cu[BLOCK_16x16].intra_pred[25] = x265_intra_pred_ang16_25_avx2;
748
         p.cu[BLOCK_16x16].intra_pred[28] = x265_intra_pred_ang16_28_avx2;
749
         p.cu[BLOCK_16x16].intra_pred[27] = x265_intra_pred_ang16_27_avx2;
750
@@ -1642,6 +2081,16 @@
751
         p.cu[BLOCK_32x32].intra_pred[30] = x265_intra_pred_ang32_30_avx2;
752
         p.cu[BLOCK_32x32].intra_pred[31] = x265_intra_pred_ang32_31_avx2;
753
         p.cu[BLOCK_32x32].intra_pred[32] = x265_intra_pred_ang32_32_avx2;
754
+        p.cu[BLOCK_32x32].intra_pred[33] = x265_intra_pred_ang32_33_avx2;
755
+        p.cu[BLOCK_32x32].intra_pred[25] = x265_intra_pred_ang32_25_avx2;
756
+        p.cu[BLOCK_32x32].intra_pred[24] = x265_intra_pred_ang32_24_avx2;
757
+        p.cu[BLOCK_32x32].intra_pred[23] = x265_intra_pred_ang32_23_avx2;
758
+        p.cu[BLOCK_32x32].intra_pred[22] = x265_intra_pred_ang32_22_avx2;
759
+        p.cu[BLOCK_32x32].intra_pred[21] = x265_intra_pred_ang32_21_avx2;
760
+        p.cu[BLOCK_32x32].intra_pred[18] = x265_intra_pred_ang32_18_avx2;
761
+
762
+        // all_angs primitives
763
+        p.cu[BLOCK_4x4].intra_pred_allangs = x265_all_angs_pred_4x4_avx2;
764
 
765
         // copy_sp primitives
766
         p.cu[BLOCK_16x16].copy_sp = x265_blockcopy_sp_16x16_avx2;
767
@@ -1725,6 +2174,8 @@
768
         p.pu[LUMA_64x48].luma_hps = x265_interp_8tap_horiz_ps_64x48_avx2;
769
         p.pu[LUMA_64x32].luma_hps = x265_interp_8tap_horiz_ps_64x32_avx2;
770
         p.pu[LUMA_64x16].luma_hps = x265_interp_8tap_horiz_ps_64x16_avx2;
771
+        p.pu[LUMA_12x16].luma_hps = x265_interp_8tap_horiz_ps_12x16_avx2;
772
+        p.pu[LUMA_24x32].luma_hps = x265_interp_8tap_horiz_ps_24x32_avx2;
773
 
774
         p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_hpp = x265_interp_4tap_horiz_pp_8x8_avx2;
775
         p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_hpp = x265_interp_4tap_horiz_pp_4x4_avx2;
776
@@ -1744,6 +2195,7 @@
777
         p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_hpp = x265_interp_4tap_horiz_pp_16x32_avx2;
778
 
779
         p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_hpp = x265_interp_4tap_horiz_pp_6x8_avx2;
780
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_hpp = x265_interp_4tap_horiz_pp_6x16_avx2;
781
 
782
         p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_hpp = x265_interp_4tap_horiz_pp_32x16_avx2;
783
         p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_hpp = x265_interp_4tap_horiz_pp_32x24_avx2;
784
@@ -1777,6 +2229,7 @@
785
         p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_hps = x265_interp_4tap_horiz_ps_16x8_avx2;
786
         p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_hps = x265_interp_4tap_horiz_ps_16x4_avx2;
787
 
788
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_hps = x265_interp_4tap_horiz_ps_24x32_avx2;
789
         p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_hps = x265_interp_4tap_horiz_ps_32x16_avx2;
790
         p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_hps = x265_interp_4tap_horiz_ps_32x24_avx2;
791
         p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_hps = x265_interp_4tap_horiz_ps_32x8_avx2;
792
@@ -1887,8 +2340,353 @@
793
         p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vss = x265_interp_4tap_vert_ss_32x16_avx2;
794
         p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vss = x265_interp_4tap_vert_ss_32x24_avx2;
795
 
796
-        if ((cpuMask & X265_CPU_BMI1) && (cpuMask & X265_CPU_BMI2))
797
-            p.findPosLast = x265_findPosLast_x64;
798
+        //i422 for chroma_vss
799
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vss = x265_interp_4tap_vert_ss_4x8_avx2;
800
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vss = x265_interp_4tap_vert_ss_8x16_avx2;
801
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vss = x265_interp_4tap_vert_ss_16x32_avx2;
802
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vss = x265_interp_4tap_vert_ss_4x4_avx2;
803
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_vss = x265_interp_4tap_vert_ss_2x8_avx2;
804
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vss = x265_interp_4tap_vert_ss_8x8_avx2;
805
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vss = x265_interp_4tap_vert_ss_4x16_avx2;
806
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vss = x265_interp_4tap_vert_ss_16x16_avx2;
807
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vss = x265_interp_4tap_vert_ss_8x32_avx2;
808
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vss = x265_interp_4tap_vert_ss_32x32_avx2;
809
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vss = x265_interp_4tap_vert_ss_8x4_avx2;
810
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vss = x265_interp_4tap_vert_ss_32x16_avx2;
811
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vss = x265_interp_4tap_vert_ss_32x64_avx2;
812
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vss = x265_interp_4tap_vert_ss_16x64_avx2;
813
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vss = x265_interp_4tap_vert_ss_24x64_avx2;
814
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vss = x265_interp_4tap_vert_ss_8x64_avx2;
815
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vss = x265_interp_4tap_vert_ss_32x48_avx2;
816
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vss = x265_interp_4tap_vert_ss_8x12_avx2;
817
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_vss = x265_interp_4tap_vert_ss_6x16_avx2;
818
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_vss = x265_interp_4tap_vert_ss_2x16_avx2;
819
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vss = x265_interp_4tap_vert_ss_16x24_avx2;
820
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vss = x265_interp_4tap_vert_ss_12x32_avx2;
821
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vss = x265_interp_4tap_vert_ss_4x32_avx2;
822
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x4].filter_vss = x265_interp_4tap_vert_ss_2x4_avx2;
823
+
824
+        //i444 for chroma_vss
825
+        p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vss = x265_interp_4tap_vert_ss_4x4_avx2;
826
+        p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vss = x265_interp_4tap_vert_ss_8x8_avx2;
827
+        p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vss = x265_interp_4tap_vert_ss_16x16_avx2;
828
+        p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vss = x265_interp_4tap_vert_ss_32x32_avx2;
829
+        p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vss = x265_interp_4tap_vert_ss_64x64_avx2;
830
+        p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vss = x265_interp_4tap_vert_ss_8x4_avx2;
831
+        p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vss = x265_interp_4tap_vert_ss_4x8_avx2;
832
+        p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vss = x265_interp_4tap_vert_ss_16x8_avx2;
833
+        p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vss = x265_interp_4tap_vert_ss_8x16_avx2;
834
+        p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vss = x265_interp_4tap_vert_ss_32x16_avx2;
835
+        p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vss = x265_interp_4tap_vert_ss_16x32_avx2;
836
+        p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vss = x265_interp_4tap_vert_ss_16x12_avx2;
837
+        p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vss = x265_interp_4tap_vert_ss_12x16_avx2;
838
+        p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vss = x265_interp_4tap_vert_ss_16x4_avx2;
839
+        p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vss = x265_interp_4tap_vert_ss_4x16_avx2;
840
+        p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vss = x265_interp_4tap_vert_ss_32x24_avx2;
841
+        p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vss = x265_interp_4tap_vert_ss_24x32_avx2;
842
+        p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vss = x265_interp_4tap_vert_ss_32x8_avx2;
843
+        p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vss = x265_interp_4tap_vert_ss_8x32_avx2;
844
+        p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vss = x265_interp_4tap_vert_ss_64x32_avx2;
845
+        p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vss = x265_interp_4tap_vert_ss_32x64_avx2;
846
+        p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vss = x265_interp_4tap_vert_ss_64x48_avx2;
847
+        p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vss = x265_interp_4tap_vert_ss_48x64_avx2;
848
+        p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vss = x265_interp_4tap_vert_ss_64x16_avx2;
849
+        p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vss = x265_interp_4tap_vert_ss_16x64_avx2;
850
+
851
+        p.pu[LUMA_16x16].luma_hvpp = x265_interp_8tap_hv_pp_16x16_avx2;
852
+
853
+        p.pu[LUMA_32x8].convert_p2s = x265_filterPixelToShort_32x8_avx2;
854
+        p.pu[LUMA_32x16].convert_p2s = x265_filterPixelToShort_32x16_avx2;
855
+        p.pu[LUMA_32x24].convert_p2s = x265_filterPixelToShort_32x24_avx2;
856
+        p.pu[LUMA_32x32].convert_p2s = x265_filterPixelToShort_32x32_avx2;
857
+        p.pu[LUMA_32x64].convert_p2s = x265_filterPixelToShort_32x64_avx2;
858
+        p.pu[LUMA_64x16].convert_p2s = x265_filterPixelToShort_64x16_avx2;
859
+        p.pu[LUMA_64x32].convert_p2s = x265_filterPixelToShort_64x32_avx2;
860
+        p.pu[LUMA_64x48].convert_p2s = x265_filterPixelToShort_64x48_avx2;
861
+        p.pu[LUMA_64x64].convert_p2s = x265_filterPixelToShort_64x64_avx2;
862
+        p.pu[LUMA_48x64].convert_p2s = x265_filterPixelToShort_48x64_avx2;
863
+        p.pu[LUMA_24x32].convert_p2s = x265_filterPixelToShort_24x32_avx2;
864
+
865
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].p2s = x265_filterPixelToShort_24x32_avx2;
866
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = x265_filterPixelToShort_32x8_avx2;
867
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = x265_filterPixelToShort_32x16_avx2;
868
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = x265_filterPixelToShort_32x24_avx2;
869
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = x265_filterPixelToShort_32x32_avx2;
870
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = x265_filterPixelToShort_24x64_avx2;
871
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = x265_filterPixelToShort_32x16_avx2;
872
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = x265_filterPixelToShort_32x32_avx2;
873
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = x265_filterPixelToShort_32x48_avx2;
874
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = x265_filterPixelToShort_32x64_avx2;
875
+
876
+        //i422 for chroma_hpp
877
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_hpp = x265_interp_4tap_horiz_pp_12x32_avx2;
878
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_hpp = x265_interp_4tap_horiz_pp_24x64_avx2;
879
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_hpp = x265_interp_4tap_horiz_pp_2x16_avx2;
880
+
881
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_hpp = x265_interp_4tap_horiz_pp_2x16_avx2;
882
+
883
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_hpp = x265_interp_4tap_horiz_pp_4x4_avx2;
884
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_hpp = x265_interp_4tap_horiz_pp_4x8_avx2;
885
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_hpp = x265_interp_4tap_horiz_pp_4x16_avx2;
886
+        
887
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_hpp = x265_interp_4tap_horiz_pp_8x4_avx2;
888
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_hpp = x265_interp_4tap_horiz_pp_8x8_avx2;
889
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_hpp = x265_interp_4tap_horiz_pp_8x16_avx2;
890
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_hpp = x265_interp_4tap_horiz_pp_8x32_avx2;
891
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_hpp = x265_interp_4tap_horiz_pp_8x64_avx2;
892
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_hpp = x265_interp_4tap_horiz_pp_8x12_avx2;
893
+        
894
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_hpp = x265_interp_4tap_horiz_pp_16x8_avx2;
895
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_hpp = x265_interp_4tap_horiz_pp_16x16_avx2;
896
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_hpp = x265_interp_4tap_horiz_pp_16x32_avx2;
897
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_hpp = x265_interp_4tap_horiz_pp_16x64_avx2;
898
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_hpp = x265_interp_4tap_horiz_pp_16x24_avx2;
899
+        
900
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_hpp = x265_interp_4tap_horiz_pp_32x16_avx2;
901
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_hpp = x265_interp_4tap_horiz_pp_32x32_avx2;
902
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_hpp = x265_interp_4tap_horiz_pp_32x64_avx2;
903
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_hpp = x265_interp_4tap_horiz_pp_32x48_avx2;
904
+        
905
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_hpp = x265_interp_4tap_horiz_pp_2x8_avx2;
906
+
907
+        //i444 filters hpp
908
+
909
+        p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_hpp = x265_interp_4tap_horiz_pp_4x4_avx2;
910
+        p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_hpp = x265_interp_4tap_horiz_pp_8x8_avx2;
911
+        p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_hpp = x265_interp_4tap_horiz_pp_16x16_avx2;
912
+        p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_hpp = x265_interp_4tap_horiz_pp_32x32_avx2;
913
+
914
+        p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_hpp = x265_interp_4tap_horiz_pp_4x8_avx2;
915
+        p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_hpp = x265_interp_4tap_horiz_pp_4x16_avx2;
916
+
917
+        p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_hpp = x265_interp_4tap_horiz_pp_8x4_avx2;
918
+        p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_hpp = x265_interp_4tap_horiz_pp_8x16_avx2;
919
+        p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_hpp = x265_interp_4tap_horiz_pp_8x32_avx2;
920
+
921
+        p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_hpp = x265_interp_4tap_horiz_pp_16x8_avx2;
922
+        p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_hpp = x265_interp_4tap_horiz_pp_16x32_avx2;
923
+        p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_hpp = x265_interp_4tap_horiz_pp_16x12_avx2;
924
+        p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_hpp = x265_interp_4tap_horiz_pp_16x4_avx2;
925
+        p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_hpp = x265_interp_4tap_horiz_pp_16x64_avx2;
926
+
927
+        p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_hpp = x265_interp_4tap_horiz_pp_12x16_avx2;
928
+        p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_hpp = x265_interp_4tap_horiz_pp_24x32_avx2;
929
+
930
+        p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_hpp = x265_interp_4tap_horiz_pp_32x16_avx2;
931
+        p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_hpp = x265_interp_4tap_horiz_pp_32x64_avx2;
932
+        p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_hpp = x265_interp_4tap_horiz_pp_32x24_avx2;
933
+        p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_hpp = x265_interp_4tap_horiz_pp_32x8_avx2;
934
+
935
+        p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_hpp = x265_interp_4tap_horiz_pp_64x64_avx2;
936
+        p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_hpp = x265_interp_4tap_horiz_pp_64x32_avx2;
937
+        p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_hpp = x265_interp_4tap_horiz_pp_64x48_avx2;
938
+        p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_hpp = x265_interp_4tap_horiz_pp_64x16_avx2;
939
+        p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_hpp = x265_interp_4tap_horiz_pp_48x64_avx2;
940
+
941
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_hps = x265_interp_4tap_horiz_ps_4x4_avx2;
942
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_hps = x265_interp_4tap_horiz_ps_4x8_avx2;
943
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_hps = x265_interp_4tap_horiz_ps_4x16_avx2;
944
+
945
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_hps = x265_interp_4tap_horiz_ps_8x4_avx2;
946
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_hps = x265_interp_4tap_horiz_ps_8x8_avx2;
947
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_hps = x265_interp_4tap_horiz_ps_8x16_avx2;
948
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_hps = x265_interp_4tap_horiz_ps_8x32_avx2;
949
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_hps = x265_interp_4tap_horiz_ps_8x64_avx2; //adding macro call
950
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_hps = x265_interp_4tap_horiz_ps_8x12_avx2; //adding macro call
951
+
952
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_hps = x265_interp_4tap_horiz_ps_16x8_avx2;
953
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_hps = x265_interp_4tap_horiz_ps_16x16_avx2;
954
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_hps = x265_interp_4tap_horiz_ps_16x32_avx2;
955
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_hps = x265_interp_4tap_horiz_ps_16x64_avx2;//adding macro call
956
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_hps = x265_interp_4tap_horiz_ps_16x24_avx2;//adding macro call
957
+
958
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_hps = x265_interp_4tap_horiz_ps_32x16_avx2;
959
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_hps = x265_interp_4tap_horiz_ps_32x32_avx2;
960
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_hps = x265_interp_4tap_horiz_ps_32x64_avx2;
961
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_hps = x265_interp_4tap_horiz_ps_32x48_avx2;
962
+
963
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_hps = x265_interp_4tap_horiz_ps_2x8_avx2;
964
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_hps = x265_interp_4tap_horiz_ps_24x64_avx2;
965
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_hps = x265_interp_4tap_horiz_ps_2x16_avx2;
966
+
967
+        //i444 chroma_hps
968
+        p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_hps = x265_interp_4tap_horiz_ps_64x32_avx2;
969
+        p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_hps = x265_interp_4tap_horiz_ps_64x48_avx2;
970
+        p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_hps = x265_interp_4tap_horiz_ps_64x16_avx2;
971
+        p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_hps = x265_interp_4tap_horiz_ps_64x64_avx2;
972
+
973
+        p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_hps = x265_interp_4tap_horiz_ps_4x4_avx2;
974
+        p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_hps = x265_interp_4tap_horiz_ps_8x8_avx2;
975
+        p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_hps = x265_interp_4tap_horiz_ps_16x16_avx2;
976
+        p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_hps = x265_interp_4tap_horiz_ps_32x32_avx2;
977
+
978
+        p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_hps = x265_interp_4tap_horiz_ps_4x8_avx2;
979
+        p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_hps = x265_interp_4tap_horiz_ps_4x16_avx2;
980
+
981
+        p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_hps = x265_interp_4tap_horiz_ps_8x4_avx2;
982
+        p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_hps = x265_interp_4tap_horiz_ps_8x16_avx2;
983
+        p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_hps = x265_interp_4tap_horiz_ps_8x32_avx2;
984
+
985
+        p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_hps = x265_interp_4tap_horiz_ps_16x8_avx2;
986
+        p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_hps = x265_interp_4tap_horiz_ps_16x32_avx2;
987
+        p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_hps = x265_interp_4tap_horiz_ps_16x12_avx2;
988
+        p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_hps = x265_interp_4tap_horiz_ps_16x4_avx2;
989
+        p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_hps = x265_interp_4tap_horiz_ps_16x64_avx2;
990
+
991
+        p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_hps = x265_interp_4tap_horiz_ps_24x32_avx2;
992
+        p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_hps = x265_interp_4tap_horiz_ps_48x64_avx2;
993
+
994
+        p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_hps = x265_interp_4tap_horiz_ps_32x16_avx2;
995
+        p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_hps = x265_interp_4tap_horiz_ps_32x64_avx2;
996
+        p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_hps = x265_interp_4tap_horiz_ps_32x24_avx2;
997
+        p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_hps = x265_interp_4tap_horiz_ps_32x8_avx2;
998
+
999
+        //i422 for chroma_vsp
1000
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vsp = x265_interp_4tap_vert_sp_4x8_avx2;
1001
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vsp = x265_interp_4tap_vert_sp_8x16_avx2;
1002
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vsp = x265_interp_4tap_vert_sp_16x32_avx2;
1003
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vsp = x265_interp_4tap_vert_sp_4x4_avx2;
1004
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_vsp = x265_interp_4tap_vert_sp_2x8_avx2;
1005
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vsp = x265_interp_4tap_vert_sp_8x8_avx2;
1006
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vsp = x265_interp_4tap_vert_sp_4x16_avx2;
1007
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vsp = x265_interp_4tap_vert_sp_16x16_avx2;
1008
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vsp = x265_interp_4tap_vert_sp_8x32_avx2;
1009
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vsp = x265_interp_4tap_vert_sp_32x32_avx2;
1010
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vsp = x265_interp_4tap_vert_sp_8x4_avx2;
1011
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vsp = x265_interp_4tap_vert_sp_16x8_avx2;
1012
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vsp = x265_interp_4tap_vert_sp_32x16_avx2;
1013
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vsp = x265_interp_4tap_vert_sp_32x64_avx2;
1014
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vsp = x265_interp_4tap_vert_sp_16x64_avx2;
1015
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vsp = x265_interp_4tap_vert_sp_24x64_avx2;
1016
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vsp = x265_interp_4tap_vert_sp_8x64_avx2;
1017
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vsp = x265_interp_4tap_vert_sp_32x48_avx2;
1018
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vsp = x265_interp_4tap_vert_sp_8x12_avx2;
1019
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_vsp = x265_interp_4tap_vert_sp_6x16_avx2;
1020
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_vsp = x265_interp_4tap_vert_sp_2x16_avx2;
1021
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vsp = x265_interp_4tap_vert_sp_16x24_avx2;
1022
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vsp = x265_interp_4tap_vert_sp_12x32_avx2;
1023
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vsp = x265_interp_4tap_vert_sp_4x32_avx2;
1024
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x4].filter_vsp = x265_interp_4tap_vert_sp_2x4_avx2;
1025
+
1026
+        //i444 for chroma_vsp
1027
+        p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vsp = x265_interp_4tap_vert_sp_4x4_avx2;
1028
+        p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vsp = x265_interp_4tap_vert_sp_8x8_avx2;
1029
+        p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vsp = x265_interp_4tap_vert_sp_16x16_avx2;
1030
+        p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vsp = x265_interp_4tap_vert_sp_32x32_avx2;
1031
+        p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vsp = x265_interp_4tap_vert_sp_64x64_avx2;
1032
+        p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vsp = x265_interp_4tap_vert_sp_8x4_avx2;
1033
+        p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vsp = x265_interp_4tap_vert_sp_4x8_avx2;
1034
+        p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vsp = x265_interp_4tap_vert_sp_16x8_avx2;
1035
+        p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vsp = x265_interp_4tap_vert_sp_8x16_avx2;
1036
+        p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vsp = x265_interp_4tap_vert_sp_32x16_avx2;
1037
+        p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vsp = x265_interp_4tap_vert_sp_16x32_avx2;
1038
+        p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vsp = x265_interp_4tap_vert_sp_16x12_avx2;
1039
+        p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vsp = x265_interp_4tap_vert_sp_12x16_avx2;
1040
+        p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vsp = x265_interp_4tap_vert_sp_16x4_avx2;
1041
+        p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vsp = x265_interp_4tap_vert_sp_4x16_avx2;
1042
+        p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vsp = x265_interp_4tap_vert_sp_32x24_avx2;
1043
+        p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vsp = x265_interp_4tap_vert_sp_24x32_avx2;
1044
+        p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vsp = x265_interp_4tap_vert_sp_32x8_avx2;
1045
+        p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vsp = x265_interp_4tap_vert_sp_8x32_avx2;
1046
+        p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vsp = x265_interp_4tap_vert_sp_64x32_avx2;
1047
+        p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vsp = x265_interp_4tap_vert_sp_32x64_avx2;
1048
+        p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vsp = x265_interp_4tap_vert_sp_64x48_avx2;
1049
+        p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vsp = x265_interp_4tap_vert_sp_48x64_avx2;
1050
+        p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vsp = x265_interp_4tap_vert_sp_64x16_avx2;
1051
+        p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vsp = x265_interp_4tap_vert_sp_16x64_avx2;
1052
+
1053
+        //i422 for chroma_vps
1054
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vps = x265_interp_4tap_vert_ps_4x8_avx2;
1055
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vps = x265_interp_4tap_vert_ps_8x16_avx2;
1056
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vps = x265_interp_4tap_vert_ps_16x32_avx2;
1057
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vps = x265_interp_4tap_vert_ps_4x4_avx2;
1058
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_vps = x265_interp_4tap_vert_ps_2x8_avx2;
1059
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vps = x265_interp_4tap_vert_ps_8x8_avx2;
1060
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vps = x265_interp_4tap_vert_ps_4x16_avx2;
1061
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vps = x265_interp_4tap_vert_ps_16x16_avx2;
1062
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vps = x265_interp_4tap_vert_ps_8x32_avx2;
1063
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vps = x265_interp_4tap_vert_ps_32x32_avx2;
1064
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vps = x265_interp_4tap_vert_ps_8x4_avx2;
1065
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vps = x265_interp_4tap_vert_ps_16x8_avx2;
1066
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vps = x265_interp_4tap_vert_ps_32x16_avx2;
1067
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vps = x265_interp_4tap_vert_ps_16x64_avx2;
1068
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vps = x265_interp_4tap_vert_ps_8x64_avx2;
1069
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vps = x265_interp_4tap_vert_ps_32x64_avx2;
1070
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vps = x265_interp_4tap_vert_ps_32x48_avx2;
1071
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vps = x265_interp_4tap_vert_ps_12x32_avx2;
1072
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vps = x265_interp_4tap_vert_ps_8x12_avx2;
1073
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x4].filter_vps = x265_interp_4tap_vert_ps_2x4_avx2;
1074
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vps = x265_interp_4tap_vert_ps_16x24_avx2;
1075
+
1076
+        //i444 for chroma_vps
1077
+        p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vps = x265_interp_4tap_vert_ps_4x4_avx2;
1078
+        p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vps = x265_interp_4tap_vert_ps_8x8_avx2;
1079
+        p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vps = x265_interp_4tap_vert_ps_16x16_avx2;
1080
+        p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vps = x265_interp_4tap_vert_ps_32x32_avx2;
1081
+        p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vps = x265_interp_4tap_vert_ps_8x4_avx2;
1082
+        p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vps = x265_interp_4tap_vert_ps_4x8_avx2;
1083
+        p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vps = x265_interp_4tap_vert_ps_16x8_avx2;
1084
+        p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vps = x265_interp_4tap_vert_ps_8x16_avx2;
1085
+        p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vps = x265_interp_4tap_vert_ps_32x16_avx2;
1086
+        p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vps = x265_interp_4tap_vert_ps_16x32_avx2;
1087
+        p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vps = x265_interp_4tap_vert_ps_16x12_avx2;
1088
+        p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vps = x265_interp_4tap_vert_ps_12x16_avx2;
1089
+        p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vps = x265_interp_4tap_vert_ps_16x4_avx2;
1090
+        p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vps = x265_interp_4tap_vert_ps_4x16_avx2;
1091
+        p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vps = x265_interp_4tap_vert_ps_32x24_avx2;
1092
+        p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vps = x265_interp_4tap_vert_ps_24x32_avx2;
1093
+        p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vps = x265_interp_4tap_vert_ps_32x8_avx2;
1094
+        p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vps = x265_interp_4tap_vert_ps_8x32_avx2;
1095
+        p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vps = x265_interp_4tap_vert_ps_16x64_avx2;
1096
+        p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vps = x265_interp_4tap_vert_ps_32x64_avx2;
1097
+
1098
+        //i422 for chroma_vpp
1099
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_avx2;
1100
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_avx2;
1101
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vpp = x265_interp_4tap_vert_pp_16x32_avx2;
1102
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_avx2;
1103
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_vpp = x265_interp_4tap_vert_pp_2x8_avx2;
1104
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_avx2;
1105
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_avx2;
1106
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vpp = x265_interp_4tap_vert_pp_16x16_avx2;
1107
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_avx2;
1108
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vpp = x265_interp_4tap_vert_pp_32x32_avx2;
1109
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_avx2;
1110
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vpp = x265_interp_4tap_vert_pp_16x8_avx2;
1111
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vpp = x265_interp_4tap_vert_pp_32x16_avx2;
1112
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vpp = x265_interp_4tap_vert_pp_16x64_avx2;
1113
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vpp = x265_interp_4tap_vert_pp_8x64_avx2;
1114
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vpp = x265_interp_4tap_vert_pp_32x64_avx2;
1115
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vpp = x265_interp_4tap_vert_pp_32x48_avx2;
1116
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vpp = x265_interp_4tap_vert_pp_12x32_avx2;
1117
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vpp = x265_interp_4tap_vert_pp_8x12_avx2;
1118
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x4].filter_vpp = x265_interp_4tap_vert_pp_2x4_avx2;
1119
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vpp = x265_interp_4tap_vert_pp_16x24_avx2;
1120
+
1121
+        //i444 for chroma_vpp
1122
+        p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_avx2;
1123
+        p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_avx2;
1124
+        p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vpp = x265_interp_4tap_vert_pp_16x16_avx2;
1125
+        p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vpp = x265_interp_4tap_vert_pp_32x32_avx2;
1126
+        p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_avx2;
1127
+        p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_avx2;
1128
+        p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vpp = x265_interp_4tap_vert_pp_16x8_avx2;
1129
+        p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_avx2;
1130
+        p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vpp = x265_interp_4tap_vert_pp_32x16_avx2;
1131
+        p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vpp = x265_interp_4tap_vert_pp_16x32_avx2;
1132
+        p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vpp = x265_interp_4tap_vert_pp_16x12_avx2;
1133
+        p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vpp = x265_interp_4tap_vert_pp_12x16_avx2;
1134
+        p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vpp = x265_interp_4tap_vert_pp_16x4_avx2;
1135
+        p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_avx2;
1136
+        p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vpp = x265_interp_4tap_vert_pp_32x24_avx2;
1137
+        p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vpp = x265_interp_4tap_vert_pp_24x32_avx2;
1138
+        p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vpp = x265_interp_4tap_vert_pp_32x8_avx2;
1139
+        p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_avx2;
1140
+        p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vpp = x265_interp_4tap_vert_pp_16x64_avx2;
1141
+        p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vpp = x265_interp_4tap_vert_pp_32x64_avx2;
1142
+
1143
+        if (cpuMask & X265_CPU_BMI2)
1144
+            p.scanPosLast = x265_scanPosLast_avx2_bmi2;
1145
     }
1146
 #endif
1147
 }
1148
x265_1.6.tar.gz/source/common/x86/const-a.asm -> x265_1.7.tar.gz/source/common/x86/const-a.asm Changed
173
 
1
@@ -29,81 +29,100 @@
2
 
3
 SECTION_RODATA 32
4
 
5
-const pb_1,        times 32 db 1
6
+;; 8-bit constants
7
 
8
-const hsub_mul,    times 16 db 1, -1
9
-const pw_1,        times 16 dw 1
10
-const pw_16,       times 16 dw 16
11
-const pw_32,       times 16 dw 32
12
-const pw_128,      times 16 dw 128
13
-const pw_256,      times 16 dw 256
14
-const pw_257,      times 16 dw 257
15
-const pw_512,      times 16 dw 512
16
-const pw_1023,     times 8  dw 1023
17
-ALIGN 32
18
-const pw_1024,     times 16 dw 1024
19
-const pw_4096,     times 16 dw 4096
20
-const pw_00ff,     times 16 dw 0x00ff
21
-ALIGN 32
22
-const pw_pixel_max,times 16 dw ((1 << BIT_DEPTH)-1)
23
-const deinterleave_shufd, dd 0,4,1,5,2,6,3,7
24
-const pb_unpackbd1, times 2 db 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3
25
-const pb_unpackbd2, times 2 db 4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7
26
-const pb_unpackwq1, db 0,1,0,1,0,1,0,1,2,3,2,3,2,3,2,3
27
-const pb_unpackwq2, db 4,5,4,5,4,5,4,5,6,7,6,7,6,7,6,7
28
-const pw_swap,      times 2 db 6,7,4,5,2,3,0,1
29
+const pb_0,                 times 16 db 0
30
+const pb_1,                 times 32 db 1
31
+const pb_2,                 times 32 db 2
32
+const pb_3,                 times 16 db 3
33
+const pb_4,                 times 32 db 4
34
+const pb_8,                 times 32 db 8
35
+const pb_15,                times 32 db 15
36
+const pb_16,                times 32 db 16
37
+const pb_32,                times 32 db 32
38
+const pb_64,                times 32 db 64
39
+const pb_128,               times 16 db 128
40
+const pb_a1,                times 16 db 0xa1
41
 
42
-const pb_2,        times 32 db 2
43
-const pb_4,        times 32 db 4
44
-const pb_16,       times 32 db 16
45
-const pb_64,       times 32 db 64
46
-const pb_01,       times  8 db 0,1
47
-const pb_0,        times 16 db 0
48
-const pb_a1,       times 16 db 0xa1
49
-const pb_3,        times 16 db 3
50
-const pb_8,        times 32 db 8
51
-const pb_32,       times 32 db 32
52
-const pb_128,      times 16 db 128
53
-const pb_shuf8x8c, db 0,0,0,0,2,2,2,2,4,4,4,4,6,6,6,6
54
+const pb_01,                times  8 db   0,   1
55
+const hsub_mul,             times 16 db   1,  -1
56
+const pw_swap,              times  2 db   6,   7,   4,   5,   2,   3,   0,   1
57
+const pb_unpackbd1,         times  2 db   0,   0,   0,   0,   1,   1,   1,   1,   2,   2,   2,   2,   3,   3,   3,   3
58
+const pb_unpackbd2,         times  2 db   4,   4,   4,   4,   5,   5,   5,   5,   6,   6,   6,   6,   7,   7,   7,   7
59
+const pb_unpackwq1,         times  1 db   0,   1,   0,   1,   0,   1,   0,   1,   2,   3,   2,   3,   2,   3,   2,   3
60
+const pb_unpackwq2,         times  1 db   4,   5,   4,   5,   4,   5,   4,   5,   6,   7,   6,   7,   6,   7,   6,   7
61
+const pb_shuf8x8c,          times  1 db   0,   0,   0,   0,   2,   2,   2,   2,   4,   4,   4,   4,   6,   6,   6,   6
62
+const pb_movemask,          times 16 db 0x00
63
+                            times 16 db 0xFF
64
+const pb_0000000000000F0F,  times  2 db 0xff, 0x00
65
+                            times 12 db 0x00
66
+const pb_000000000000000F,           db 0xff
67
+                            times 15 db 0x00
68
 
69
-const pw_0_15,     times 2 dw 0, 1, 2, 3, 4, 5, 6, 7
70
-const pw_2,        times 8 dw 2
71
-const pw_m2,       times 8 dw -2
72
-const pw_4,        times 8 dw 4
73
-const pw_8,        times 8 dw 8
74
-const pw_64,       times 8 dw 64
75
-const pw_256,      times 8 dw 256
76
-const pw_32_0,     times 4 dw 32,
77
-                   times 4 dw 0
78
-const pw_2000,     times 16 dw 0x2000
79
-const pw_8000,     times 8 dw 0x8000
80
-const pw_3fff,     times 8 dw 0x3fff
81
-const pw_ppppmmmm, dw 1,1,1,1,-1,-1,-1,-1
82
-const pw_ppmmppmm, dw 1,1,-1,-1,1,1,-1,-1
83
-const pw_pmpmpmpm, dw 1,-1,1,-1,1,-1,1,-1
84
-const pw_pmmpzzzz, dw 1,-1,-1,1,0,0,0,0
85
-const pd_1,        times 8 dd 1
86
-const pd_2,        times 8 dd 2
87
-const pd_4,        times 4 dd 4
88
-const pd_8,        times 4 dd 8
89
-const pd_16,       times 4 dd 16
90
-const pd_32,       times 4 dd 32
91
-const pd_64,       times 4 dd 64
92
-const pd_128,      times 4 dd 128
93
-const pd_256,      times 4 dd 256
94
-const pd_512,      times 4 dd 512
95
-const pd_1024,     times 4 dd 1024
96
-const pd_2048,     times 4 dd 2048
97
-const pd_ffff,     times 4 dd 0xffff
98
-const pd_32767,    times 4 dd 32767
99
-const pd_n32768,   times 4 dd 0xffff8000
100
-const pw_ff00,     times 8 dw 0xff00
101
+;; 16-bit constants
102
 
103
-const multi_2Row,  dw 1, 2, 3, 4, 1, 2, 3, 4
104
-const multiL,      dw 1, 2, 3, 4, 5, 6, 7, 8
105
-const multiH,      dw 9, 10, 11, 12, 13, 14, 15, 16
106
-const multiH2,     dw 17, 18, 19, 20, 21, 22, 23, 24
107
-const multiH3,     dw 25, 26, 27, 28, 29, 30, 31, 32
108
+const pw_1,                 times 16 dw 1
109
+const pw_2,                 times  8 dw 2
110
+const pw_m2,                times  8 dw -2
111
+const pw_4,                 times  8 dw 4
112
+const pw_8,                 times  8 dw 8
113
+const pw_16,                times 16 dw 16
114
+const pw_15,                times 16 dw 15
115
+const pw_31,                times 16 dw 31
116
+const pw_32,                times 16 dw 32
117
+const pw_64,                times  8 dw 64
118
+const pw_128,               times 16 dw 128
119
+const pw_256,               times 16 dw 256
120
+const pw_257,               times 16 dw 257
121
+const pw_512,               times 16 dw 512
122
+const pw_1023,              times  8 dw 1023
123
+const pw_1024,              times 16 dw 1024
124
+const pw_4096,              times 16 dw 4096
125
+const pw_00ff,              times 16 dw 0x00ff
126
+const pw_ff00,              times  8 dw 0xff00
127
+const pw_2000,              times 16 dw 0x2000
128
+const pw_8000,              times  8 dw 0x8000
129
+const pw_3fff,              times  8 dw 0x3fff
130
+const pw_32_0,              times  4 dw 32,
131
+                            times  4 dw 0
132
+const pw_pixel_max,         times 16 dw ((1 << BIT_DEPTH)-1)
133
+
134
+const pw_0_15,              times  2 dw   0,   1,   2,   3,   4,   5,   6,   7
135
+const pw_ppppmmmm,          times  1 dw   1,   1,   1,   1,  -1,  -1,  -1,  -1
136
+const pw_ppmmppmm,          times  1 dw   1,   1,  -1,  -1,   1,   1,  -1,  -1
137
+const pw_pmpmpmpm,          times  1 dw   1,  -1,   1,  -1,   1,  -1,   1,  -1
138
+const pw_pmmpzzzz,          times  1 dw   1,  -1,  -1,   1,   0,   0,   0,   0
139
+const multi_2Row,           times  1 dw   1,   2,   3,   4,   1,   2,   3,   4
140
+const multiH,               times  1 dw   9,  10,  11,  12,  13,  14,  15,  16
141
+const multiH3,              times  1 dw  25,  26,  27,  28,  29,  30,  31,  32
142
+const multiL,               times  1 dw   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,  15,  16
143
+const multiH2,              times  1 dw  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,  30,  31,  32
144
+const pw_planar16_mul,      times  1 dw  15,  14,  13,  12,  11,  10,   9,   8,   7,   6,   5,   4,   3,   2,   1,   0
145
+const pw_planar32_mul,      times  1 dw  31,  30,  29,  28,  27,  26,  25,  24,  23,  22,  21,  20,  19,  18,  17,  16
146
+const pw_FFFFFFFFFFFFFFF0,           dw 0x00
147
+                            times 7  dw 0xff
148
+
149
+
150
+;; 32-bit constants
151
+
152
+const pd_1,                 times  8 dd 1
153
+const pd_2,                 times  8 dd 2
154
+const pd_4,                 times  4 dd 4
155
+const pd_8,                 times  4 dd 8
156
+const pd_16,                times  4 dd 16
157
+const pd_32,                times  4 dd 32
158
+const pd_64,                times  4 dd 64
159
+const pd_128,               times  4 dd 128
160
+const pd_256,               times  4 dd 256
161
+const pd_512,               times  4 dd 512
162
+const pd_1024,              times  4 dd 1024
163
+const pd_2048,              times  4 dd 2048
164
+const pd_ffff,              times  4 dd 0xffff
165
+const pd_32767,             times  4 dd 32767
166
+const pd_n32768,            times  4 dd 0xffff8000
167
+
168
+const trans8_shuf,          times  1 dd   0,   4,   1,   5,   2,   6,   3,   7
169
+const deinterleave_shufd,   times  1 dd   0,   4,   1,   5,   2,   6,   3,   7
170
 
171
 const popcnt_table
172
 %assign x 0
173
x265_1.6.tar.gz/source/common/x86/dct8.asm -> x265_1.7.tar.gz/source/common/x86/dct8.asm Changed
181
 
1
@@ -261,6 +261,11 @@
2
                 times 2 dw 84, -29, -74, 55
3
                 times 2 dw 55, -84, 74, -29
4
 
5
+pw_dst4_tab:    times 4 dw 29,  55,  74,  84
6
+                times 4 dw 74,  74,   0, -74
7
+                times 4 dw 84, -29, -74,  55
8
+                times 4 dw 55, -84,  74, -29
9
+
10
 tab_idst4:      times 4 dw 29, +84
11
                 times 4 dw +74, +55
12
                 times 4 dw 55, -29
13
@@ -270,6 +275,16 @@
14
                 times 4 dw 84, +55
15
                 times 4 dw -74, -29
16
 
17
+pw_idst4_tab:   times 4 dw  29,  84
18
+                times 4 dw  55, -29
19
+                times 4 dw  74,  55
20
+                times 4 dw  74, -84
21
+                times 4 dw  74, -74
22
+                times 4 dw  84,  55
23
+                times 4 dw  0,   74
24
+                times 4 dw -74, -29
25
+pb_idst4_shuf:  times 2 db 0, 1, 8, 9, 2, 3, 10, 11, 4, 5, 12, 13, 6, 7, 14, 15
26
+
27
 tab_dct8_1:     times 2 dw 89, 50, 75, 18
28
                 times 2 dw 75, -89, -18, -50
29
                 times 2 dw 50, 18, -89, 75
30
@@ -316,7 +331,7 @@
31
 cextern pd_1024
32
 cextern pd_2048
33
 cextern pw_ppppmmmm
34
-
35
+cextern trans8_shuf
36
 ;------------------------------------------------------
37
 ;void dct4(const int16_t* src, int16_t* dst, intptr_t srcStride)
38
 ;------------------------------------------------------
39
@@ -656,6 +671,59 @@
40
 
41
     RET
42
 
43
+;------------------------------------------------------------------
44
+;void dst4(const int16_t* src, int16_t* dst, intptr_t srcStride)
45
+;------------------------------------------------------------------
46
+INIT_YMM avx2
47
+cglobal dst4, 3, 4, 6
48
+%if BIT_DEPTH == 8
49
+  %define       DST_SHIFT 1
50
+  vpbroadcastd  m5, [pd_1]
51
+%elif BIT_DEPTH == 10
52
+  %define       DST_SHIFT 3
53
+  vpbroadcastd  m5, [pd_4]
54
+%endif
55
+    mova        m4, [trans8_shuf]
56
+    add         r2d, r2d
57
+    lea         r3, [pw_dst4_tab]
58
+
59
+    movq        xm0, [r0 + 0 * r2]
60
+    movhps      xm0, [r0 + 1 * r2]
61
+    lea         r0, [r0 + 2 * r2]
62
+    movq        xm1, [r0]
63
+    movhps      xm1, [r0 + r2]
64
+
65
+    vinserti128 m0, m0, xm1, 1          ; m0 = src[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
66
+
67
+    pmaddwd     m2, m0, [r3 + 0 * 32]
68
+    pmaddwd     m1, m0, [r3 + 1 * 32]
69
+    phaddd      m2, m1
70
+    paddd       m2, m5
71
+    psrad       m2, DST_SHIFT
72
+    pmaddwd     m3, m0, [r3 + 2 * 32]
73
+    pmaddwd     m1, m0, [r3 + 3 * 32]
74
+    phaddd      m3, m1
75
+    paddd       m3, m5
76
+    psrad       m3, DST_SHIFT
77
+    packssdw    m2, m3
78
+    vpermd      m2, m4, m2
79
+
80
+    vpbroadcastd m5, [pd_128]
81
+    pmaddwd     m0, m2, [r3 + 0 * 32]
82
+    pmaddwd     m1, m2, [r3 + 1 * 32]
83
+    phaddd      m0, m1
84
+    paddd       m0, m5
85
+    psrad       m0, 8
86
+    pmaddwd     m3, m2, [r3 + 2 * 32]
87
+    pmaddwd     m2, m2, [r3 + 3 * 32]
88
+    phaddd      m3, m2
89
+    paddd       m3, m5
90
+    psrad       m3, 8
91
+    packssdw    m0, m3
92
+    vpermd      m0, m4, m0
93
+    movu        [r1], m0
94
+    RET
95
+
96
 ;-------------------------------------------------------
97
 ;void idst4(const int16_t* src, int16_t* dst, intptr_t dstStride)
98
 ;-------------------------------------------------------
99
@@ -748,6 +816,81 @@
100
     movhps      [r1 + r2], m1
101
     RET
102
 
103
+;-----------------------------------------------------------------
104
+;void idst4(const int16_t* src, int16_t* dst, intptr_t dstStride)
105
+;-----------------------------------------------------------------
106
+INIT_YMM avx2
107
+cglobal idst4, 3, 4, 6
108
+%if BIT_DEPTH == 8
109
+  vpbroadcastd  m4, [pd_2048]
110
+  %define       IDCT4_SHIFT 12
111
+%elif BIT_DEPTH == 10
112
+  vpbroadcastd  m4, [pd_512]
113
+  %define       IDCT4_SHIFT 10
114
+%else
115
+  %error Unsupported BIT_DEPTH!
116
+%endif
117
+    add         r2d, r2d
118
+    lea         r3, [pw_idst4_tab]
119
+
120
+    movu        xm0, [r0 + 0 * 16]
121
+    movu        xm1, [r0 + 1 * 16]
122
+
123
+    punpcklwd   m2, m0, m1
124
+    punpckhwd   m0, m1
125
+
126
+    vinserti128 m2, m2, xm2, 1
127
+    vinserti128 m0, m0, xm0, 1
128
+
129
+    vpbroadcastd m5, [pd_64]
130
+    pmaddwd     m1, m2, [r3 + 0 * 32]
131
+    pmaddwd     m3, m0, [r3 + 1 * 32]
132
+    paddd       m1, m3
133
+    paddd       m1, m5
134
+    psrad       m1, 7
135
+    pmaddwd     m3, m2, [r3 + 2 * 32]
136
+    pmaddwd     m0, [r3 + 3 * 32]
137
+    paddd       m3, m0
138
+    paddd       m3, m5
139
+    psrad       m3, 7
140
+
141
+    packssdw    m0, m1, m3
142
+    pshufb      m0, [pb_idst4_shuf]
143
+    vpermq      m1, m0, 11101110b
144
+
145
+    punpcklwd   m2, m0, m1
146
+    punpckhwd   m0, m1
147
+    punpcklwd   m1, m2, m0
148
+    punpckhwd   m2, m0
149
+
150
+    vpermq      m1, m1, 01000100b
151
+    vpermq      m2, m2, 01000100b
152
+
153
+    pmaddwd     m0, m1, [r3 + 0 * 32]
154
+    pmaddwd     m3, m2, [r3 + 1 * 32]
155
+    paddd       m0, m3
156
+    paddd       m0, m4
157
+    psrad       m0, IDCT4_SHIFT
158
+    pmaddwd     m3, m1, [r3 + 2 * 32]
159
+    pmaddwd     m2, m2, [r3 + 3 * 32]
160
+    paddd       m3, m2
161
+    paddd       m3, m4
162
+    psrad       m3, IDCT4_SHIFT
163
+
164
+    packssdw    m0, m3
165
+    pshufb      m1, m0, [pb_idst4_shuf]
166
+    vpermq      m0, m1, 11101110b
167
+
168
+    punpcklwd   m2, m1, m0
169
+    movq        [r1 + 0 * r2], xm2
170
+    movhps      [r1 + 1 * r2], xm2
171
+
172
+    punpckhwd   m1, m0
173
+    movq        [r1 + 2 * r2], xm1
174
+    lea         r1, [r1 + 2 * r2]
175
+    movhps      [r1 + r2], xm1
176
+    RET
177
+
178
 ;-------------------------------------------------------
179
 ; void dct8(const int16_t* src, int16_t* dst, intptr_t srcStride)
180
 ;-------------------------------------------------------
181
x265_1.6.tar.gz/source/common/x86/dct8.h -> x265_1.7.tar.gz/source/common/x86/dct8.h Changed
17
 
1
@@ -26,6 +26,7 @@
2
 void x265_dct4_sse2(const int16_t* src, int16_t* dst, intptr_t srcStride);
3
 void x265_dct8_sse2(const int16_t* src, int16_t* dst, intptr_t srcStride);
4
 void x265_dst4_ssse3(const int16_t* src, int16_t* dst, intptr_t srcStride);
5
+void x265_dst4_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride);
6
 void x265_dct8_sse4(const int16_t* src, int16_t* dst, intptr_t srcStride);
7
 void x265_dct4_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride);
8
 void x265_dct8_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride);
9
@@ -33,6 +34,7 @@
10
 void x265_dct32_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride);
11
 
12
 void x265_idst4_sse2(const int16_t* src, int16_t* dst, intptr_t dstStride);
13
+void x265_idst4_avx2(const int16_t* src, int16_t* dst, intptr_t dstStride);
14
 void x265_idct4_sse2(const int16_t* src, int16_t* dst, intptr_t dstStride);
15
 void x265_idct4_avx2(const int16_t* src, int16_t* dst, intptr_t dstStride);
16
 void x265_idct8_sse2(const int16_t* src, int16_t* dst, intptr_t dstStride);
17
x265_1.6.tar.gz/source/common/x86/intrapred.h -> x265_1.7.tar.gz/source/common/x86/intrapred.h Changed
113
 
1
@@ -34,6 +34,7 @@
2
 void x265_intra_pred_dc8_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
3
 void x265_intra_pred_dc16_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
4
 void x265_intra_pred_dc32_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
5
+void x265_intra_pred_dc32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
6
 
7
 void x265_intra_pred_planar4_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
8
 void x265_intra_pred_planar8_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
9
@@ -43,6 +44,8 @@
10
 void x265_intra_pred_planar8_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
11
 void x265_intra_pred_planar16_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
12
 void x265_intra_pred_planar32_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
13
+void x265_intra_pred_planar16_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
14
+void x265_intra_pred_planar32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
15
 
16
 #define DECL_ANG(bsize, mode, cpu) \
17
     void x265_intra_pred_ang ## bsize ## _ ## mode ## _ ## cpu(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
18
@@ -55,6 +58,16 @@
19
 DECL_ANG(4, 7, sse2);
20
 DECL_ANG(4, 8, sse2);
21
 DECL_ANG(4, 9, sse2);
22
+DECL_ANG(4, 10, sse2);
23
+DECL_ANG(4, 11, sse2);
24
+DECL_ANG(4, 12, sse2);
25
+DECL_ANG(4, 13, sse2);
26
+DECL_ANG(4, 14, sse2);
27
+DECL_ANG(4, 15, sse2);
28
+DECL_ANG(4, 16, sse2);
29
+DECL_ANG(4, 17, sse2);
30
+DECL_ANG(4, 18, sse2);
31
+DECL_ANG(4, 26, sse2);
32
 
33
 DECL_ANG(4, 2, ssse3);
34
 DECL_ANG(4, 3, sse4);
35
@@ -174,6 +187,34 @@
36
 DECL_ANG(32, 33, sse4);
37
 
38
 #undef DECL_ANG
39
+void x265_intra_pred_ang4_3_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
40
+void x265_intra_pred_ang4_4_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
41
+void x265_intra_pred_ang4_5_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
42
+void x265_intra_pred_ang4_6_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
43
+void x265_intra_pred_ang4_7_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
44
+void x265_intra_pred_ang4_8_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
45
+void x265_intra_pred_ang4_9_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
46
+void x265_intra_pred_ang4_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
47
+void x265_intra_pred_ang4_12_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
48
+void x265_intra_pred_ang4_13_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
49
+void x265_intra_pred_ang4_14_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
50
+void x265_intra_pred_ang4_15_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
51
+void x265_intra_pred_ang4_16_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
52
+void x265_intra_pred_ang4_17_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
53
+void x265_intra_pred_ang4_19_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
54
+void x265_intra_pred_ang4_20_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
55
+void x265_intra_pred_ang4_21_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
56
+void x265_intra_pred_ang4_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
57
+void x265_intra_pred_ang4_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
58
+void x265_intra_pred_ang4_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
59
+void x265_intra_pred_ang4_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
60
+void x265_intra_pred_ang4_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
61
+void x265_intra_pred_ang4_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
62
+void x265_intra_pred_ang4_29_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
63
+void x265_intra_pred_ang4_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
64
+void x265_intra_pred_ang4_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
65
+void x265_intra_pred_ang4_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
66
+void x265_intra_pred_ang4_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
67
 void x265_intra_pred_ang8_3_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
68
 void x265_intra_pred_ang8_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
69
 void x265_intra_pred_ang8_4_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
70
@@ -192,6 +233,24 @@
71
 void x265_intra_pred_ang8_12_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
72
 void x265_intra_pred_ang8_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
73
 void x265_intra_pred_ang8_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
74
+void x265_intra_pred_ang8_13_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
75
+void x265_intra_pred_ang8_14_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
76
+void x265_intra_pred_ang8_15_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
77
+void x265_intra_pred_ang8_16_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
78
+void x265_intra_pred_ang8_20_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
79
+void x265_intra_pred_ang8_21_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
80
+void x265_intra_pred_ang8_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
81
+void x265_intra_pred_ang8_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
82
+void x265_intra_pred_ang16_3_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
83
+void x265_intra_pred_ang16_4_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
84
+void x265_intra_pred_ang16_5_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
85
+void x265_intra_pred_ang16_6_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
86
+void x265_intra_pred_ang16_7_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
87
+void x265_intra_pred_ang16_8_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
88
+void x265_intra_pred_ang16_9_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
89
+void x265_intra_pred_ang16_12_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
90
+void x265_intra_pred_ang16_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
91
+void x265_intra_pred_ang16_13_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
92
 void x265_intra_pred_ang16_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
93
 void x265_intra_pred_ang16_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
94
 void x265_intra_pred_ang16_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
95
@@ -212,8 +271,17 @@
96
 void x265_intra_pred_ang32_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
97
 void x265_intra_pred_ang32_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
98
 void x265_intra_pred_ang32_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
99
+void x265_intra_pred_ang32_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
100
+void x265_intra_pred_ang32_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
101
+void x265_intra_pred_ang32_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
102
+void x265_intra_pred_ang32_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
103
+void x265_intra_pred_ang32_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
104
+void x265_intra_pred_ang32_21_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
105
+void x265_intra_pred_ang32_18_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
106
+void x265_all_angs_pred_4x4_sse2(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
107
 void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
108
 void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
109
 void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
110
 void x265_all_angs_pred_32x32_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
111
+void x265_all_angs_pred_4x4_avx2(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
112
 #endif // ifndef X265_INTRAPRED_H
113
x265_1.6.tar.gz/source/common/x86/intrapred16.asm -> x265_1.7.tar.gz/source/common/x86/intrapred16.asm Changed
510
 
1
@@ -690,6 +690,508 @@
2
 %endrep
3
     RET
4
 
5
+;-----------------------------------------------------------------------------------------
6
+; void intraPredAng4(pixel* dst, intptr_t dstStride, pixel* src, int dirMode, int bFilter)
7
+;-----------------------------------------------------------------------------------------
8
+INIT_XMM sse2
9
+cglobal intra_pred_ang4_2, 3,5,4
10
+    lea         r4,            [r2 + 4]
11
+    add         r2,            20
12
+    cmp         r3m,           byte 34
13
+    cmove       r2,            r4
14
+
15
+    add         r1,            r1
16
+    movu        m0,            [r2]
17
+    movh        [r0],          m0
18
+    psrldq      m0,            2
19
+    movh        [r0 + r1],     m0
20
+    psrldq      m0,            2
21
+    movh        [r0 + r1 * 2], m0
22
+    lea         r1,            [r1 * 3]
23
+    psrldq      m0,            2
24
+    movh        [r0 + r1],     m0
25
+    RET
26
+
27
+cglobal intra_pred_ang4_3, 3,5,8
28
+    mov         r4d, 2
29
+    cmp         r3m, byte 33
30
+    mov         r3d, 18
31
+    cmove       r3d, r4d
32
+
33
+    movu        m0, [r2 + r3]   ; [8 7 6 5 4 3 2 1]
34
+
35
+    mova        m2, m0
36
+    psrldq      m0, 2
37
+    punpcklwd   m2, m0      ; [5 4 4 3 3 2 2 1]
38
+    mova        m3, m0
39
+    psrldq      m0, 2
40
+    punpcklwd   m3, m0      ; [6 5 5 4 4 3 3 2]
41
+    mova        m4, m0
42
+    psrldq      m0, 2
43
+    punpcklwd   m4, m0      ; [7 6 6 5 5 4 4 3]
44
+    mova        m5, m0
45
+    psrldq      m0, 2
46
+    punpcklwd   m5, m0      ; [8 7 7 6 6 5 5 4]
47
+
48
+
49
+    lea         r3, [ang_table + 20 * 16]
50
+    mova        m0, [r3 + 6 * 16]   ; [26]
51
+    mova        m1, [r3]            ; [20]
52
+    mova        m6, [r3 - 6 * 16]   ; [14]
53
+    mova        m7, [r3 - 12 * 16]  ; [ 8]
54
+    jmp        .do_filter4x4
55
+
56
+
57
+ALIGN 16
58
+.do_filter4x4:
59
+    lea     r4, [pd_16]
60
+    pmaddwd m2, m0
61
+    paddd   m2, [r4]
62
+    psrld   m2, 5
63
+
64
+    pmaddwd m3, m1
65
+    paddd   m3, [r4]
66
+    psrld   m3, 5
67
+    packssdw m2, m3
68
+
69
+    pmaddwd m4, m6
70
+    paddd   m4, [r4]
71
+    psrld   m4, 5
72
+
73
+    pmaddwd m5, m7
74
+    paddd   m5, [r4]
75
+    psrld   m5, 5
76
+    packssdw m4, m5
77
+
78
+    jz         .store
79
+
80
+    ; transpose 4x4
81
+    punpckhwd    m0, m2, m4
82
+    punpcklwd    m2, m4
83
+    punpckhwd    m4, m2, m0
84
+    punpcklwd    m2, m0
85
+
86
+.store:
87
+    add         r1, r1
88
+    movh        [r0], m2
89
+    movhps      [r0 + r1], m2
90
+    movh        [r0 + r1 * 2], m4
91
+    lea         r1, [r1 * 3]
92
+    movhps      [r0 + r1], m4
93
+    RET
94
+
95
+cglobal intra_pred_ang4_4, 3,5,8
96
+    mov         r4d, 2
97
+    cmp         r3m, byte 32
98
+    mov         r3d, 18
99
+    cmove       r3d, r4d
100
+
101
+    movu        m0, [r2 + r3]   ; [8 7 6 5 4 3 2 1]
102
+    mova        m2, m0
103
+    psrldq      m0, 2
104
+    punpcklwd   m2, m0      ; [5 4 4 3 3 2 2 1]
105
+    mova        m3, m0
106
+    psrldq      m0, 2
107
+    punpcklwd   m3, m0      ; [6 5 5 4 4 3 3 2]
108
+    mova        m4, m3
109
+    mova        m5, m0
110
+    psrldq      m0, 2
111
+    punpcklwd   m5, m0      ; [7 6 6 5 5 4 4 3]
112
+
113
+    lea         r3, [ang_table + 18 * 16]
114
+    mova        m0, [r3 +  3 * 16]  ; [21]
115
+    mova        m1, [r3 -  8 * 16]  ; [10]
116
+    mova        m6, [r3 + 13 * 16]  ; [31]
117
+    mova        m7, [r3 +  2 * 16]  ; [20]
118
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
119
+
120
+cglobal intra_pred_ang4_5, 3,5,8
121
+    mov         r4d, 2
122
+    cmp         r3m, byte 31
123
+    mov         r3d, 18
124
+    cmove       r3d, r4d
125
+
126
+    movu        m0, [r2 + r3]   ; [8 7 6 5 4 3 2 1]
127
+    mova        m2, m0
128
+    psrldq      m0, 2
129
+    punpcklwd   m2, m0      ; [5 4 4 3 3 2 2 1]
130
+    mova        m3, m0
131
+    psrldq      m0, 2
132
+    punpcklwd   m3, m0      ; [6 5 5 4 4 3 3 2]
133
+    mova        m4, m3
134
+    mova        m5, m0
135
+    psrldq      m0, 2
136
+    punpcklwd   m5, m0      ; [7 6 6 5 5 4 4 3]
137
+
138
+    lea         r3, [ang_table + 10 * 16]
139
+    mova        m0, [r3 +  7 * 16]  ; [17]
140
+    mova        m1, [r3 -  8 * 16]  ; [ 2]
141
+    mova        m6, [r3 +  9 * 16]  ; [19]
142
+    mova        m7, [r3 -  6 * 16]  ; [ 4]
143
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
144
+
145
+cglobal intra_pred_ang4_6, 3,5,8
146
+    mov         r4d, 2
147
+    cmp         r3m, byte 30
148
+    mov         r3d, 18
149
+    cmove       r3d, r4d
150
+
151
+    movu        m0, [r2 + r3]   ; [8 7 6 5 4 3 2 1]
152
+    mova        m2, m0
153
+    psrldq      m0, 2
154
+    punpcklwd   m2, m0      ; [5 4 4 3 3 2 2 1]
155
+    mova        m3, m2
156
+    mova        m4, m0
157
+    psrldq      m0, 2
158
+    punpcklwd   m4, m0      ; [6 5 5 4 4 3 3 2]
159
+    mova        m5, m4
160
+
161
+    lea         r3, [ang_table + 19 * 16]
162
+    mova        m0, [r3 -  6 * 16]  ; [13]
163
+    mova        m1, [r3 +  7 * 16]  ; [26]
164
+    mova        m6, [r3 - 12 * 16]  ; [ 7]
165
+    mova        m7, [r3 +  1 * 16]  ; [20]
166
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
167
+
168
+cglobal intra_pred_ang4_7, 3,5,8
169
+    mov         r4d, 2
170
+    cmp         r3m, byte 29
171
+    mov         r3d, 18
172
+    cmove       r3d, r4d
173
+
174
+    movu        m0, [r2 + r3]   ; [8 7 6 5 4 3 2 1]
175
+    mova        m2, m0
176
+    psrldq      m0, 2
177
+    punpcklwd   m2, m0      ; [5 4 4 3 3 2 2 1]
178
+    mova        m3, m2
179
+    mova        m4, m2
180
+    mova        m5, m0
181
+    psrldq      m0, 2
182
+    punpcklwd   m5, m0      ; [6 5 5 4 4 3 3 2]
183
+
184
+    lea         r3, [ang_table + 20 * 16]
185
+    mova        m0, [r3 - 11 * 16]  ; [ 9]
186
+    mova        m1, [r3 -  2 * 16]  ; [18]
187
+    mova        m6, [r3 +  7 * 16]  ; [27]
188
+    mova        m7, [r3 - 16 * 16]  ; [ 4]
189
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
190
+
191
+cglobal intra_pred_ang4_8, 3,5,8
192
+    mov         r4d, 2
193
+    cmp         r3m, byte 28
194
+    mov         r3d, 18
195
+    cmove       r3d, r4d
196
+
197
+    movu        m0, [r2 + r3]   ; [8 7 6 5 4 3 2 1]
198
+    mova        m2, m0
199
+    psrldq      m0, 2
200
+    punpcklwd   m2, m0      ; [5 4 4 3 3 2 2 1]
201
+    mova        m3, m2
202
+    mova        m4, m2
203
+    mova        m5, m2
204
+
205
+    lea         r3, [ang_table + 13 * 16]
206
+    mova        m0, [r3 -  8 * 16]  ; [ 5]
207
+    mova        m1, [r3 -  3 * 16]  ; [10]
208
+    mova        m6, [r3 +  2 * 16]  ; [15]
209
+    mova        m7, [r3 +  7 * 16]  ; [20]
210
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
211
+
212
+cglobal intra_pred_ang4_9, 3,5,8
213
+    mov         r4d, 2
214
+    cmp         r3m, byte 27
215
+    mov         r3d, 18
216
+    cmove       r3d, r4d
217
+
218
+    movu        m0, [r2 + r3]   ; [8 7 6 5 4 3 2 1]
219
+    mova        m2, m0
220
+    psrldq      m0, 2
221
+    punpcklwd   m2, m0      ; [5 4 4 3 3 2 2 1]
222
+    mova        m3, m2
223
+    mova        m4, m2
224
+    mova        m5, m2
225
+
226
+    lea         r3, [ang_table + 4 * 16]
227
+    mova        m0, [r3 -  2 * 16]  ; [ 2]
228
+    mova        m1, [r3 -  0 * 16]  ; [ 4]
229
+    mova        m6, [r3 +  2 * 16]  ; [ 6]
230
+    mova        m7, [r3 +  4 * 16]  ; [ 8]
231
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
232
+
233
+cglobal intra_pred_ang4_10, 3,3,3
234
+    movh        m0,             [r2 + 18]           ; [4 3 2 1]
235
+
236
+    punpcklwd   m0,             m0              ;[4 4 3 3 2 2 1 1]
237
+    pshufd      m1,             m0, 0xFA
238
+    add         r1,             r1
239
+    pshufd      m0,             m0, 0x50
240
+    movhps      [r0 + r1],      m0
241
+    movh        [r0 + r1 * 2],  m1
242
+    lea         r1,             [r1 * 3]
243
+    movhps      [r0 + r1],      m1
244
+
245
+    cmp         r4m,            byte 0
246
+    jz         .quit
247
+
248
+    ; filter
249
+    movd        m2,             [r2]                ; [7 6 5 4 3 2 1 0]
250
+    pshuflw     m2,             m2, 0x00
251
+    movh        m1,             [r2 + 2]
252
+    psubw       m1,             m2
253
+    psraw       m1,             1
254
+    paddw       m0,             m1
255
+    pxor        m1,             m1
256
+    pmaxsw      m0,             m1
257
+    pminsw      m0,             [pw_1023]
258
+.quit:
259
+    movh        [r0],           m0
260
+    RET
261
+
262
+cglobal intra_pred_ang4_26, 3,3,3
263
+    movh        m0,             [r2 + 2]            ; [8 7 6 5 4 3 2 1]
264
+    add         r1d,            r1d
265
+    ; store
266
+    movh        [r0],           m0
267
+    movh        [r0 + r1],      m0
268
+    movh        [r0 + r1 * 2],  m0
269
+    lea         r3,             [r1 * 3]
270
+    movh        [r0 + r3],      m0
271
+
272
+    ; filter
273
+    cmp         r4m,            byte 0
274
+    jz         .quit
275
+
276
+    pshuflw     m0,             m0, 0x00
277
+    movd        m2,             [r2]
278
+    pshuflw     m2,             m2, 0x00
279
+    movh        m1,             [r2 + 18]
280
+    psubw       m1,             m2
281
+    psraw       m1,             1
282
+    paddw       m0,             m1
283
+    pxor        m1,             m1
284
+    pmaxsw      m0,             m1
285
+    pminsw      m0,             [pw_1023]
286
+
287
+    movh        r2,             m0
288
+    mov         [r0],           r2w
289
+    shr         r2,             16
290
+    mov         [r0 + r1],      r2w
291
+    shr         r2,             16
292
+    mov         [r0 + r1 * 2],  r2w
293
+    shr         r2,             16
294
+    mov         [r0 + r3],      r2w
295
+.quit:
296
+    RET
297
+
298
+cglobal intra_pred_ang4_11, 3,5,8
299
+    xor         r4d, r4d
300
+    cmp         r3m, byte 25
301
+    mov         r3d, 16
302
+    cmove       r3d, r4d
303
+
304
+    movh        m1, [r2 + r3 + 2]   ; [x x x 4 3 2 1 0]
305
+    movh        m2, [r2 - 6]
306
+    punpcklqdq  m2, m1
307
+    psrldq      m2, 6
308
+    punpcklwd   m2, m1          ; [4 3 3 2 2 1 1 0]
309
+    mova        m3, m2
310
+    mova        m4, m2
311
+    mova        m5, m2
312
+
313
+    lea         r3, [ang_table + 24 * 16]
314
+    mova        m0, [r3 +  6 * 16]  ; [24]
315
+    mova        m1, [r3 +  4 * 16]  ; [26]
316
+    mova        m6, [r3 +  2 * 16]  ; [28]
317
+    mova        m7, [r3 +  0 * 16]  ; [30]
318
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
319
+
320
+cglobal intra_pred_ang4_12, 3,5,8
321
+    xor         r4d, r4d
322
+    cmp         r3m, byte 24
323
+    mov         r3d, 16
324
+    cmove       r3d, r4d
325
+
326
+    movh        m1, [r2 + r3 + 2]
327
+    movh        m2, [r2 - 6]
328
+    punpcklqdq  m2, m1
329
+    psrldq      m2, 6
330
+    punpcklwd   m2, m1          ; [4 3 3 2 2 1 1 0]
331
+    mova        m3, m2
332
+    mova        m4, m2
333
+    mova        m5, m2
334
+
335
+    lea         r3, [ang_table + 20 * 16]
336
+    mova        m0, [r3 +  7 * 16]  ; [27]
337
+    mova        m1, [r3 +  2 * 16]  ; [22]
338
+    mova        m6, [r3 -  3 * 16]  ; [17]
339
+    mova        m7, [r3 -  8 * 16]  ; [12]
340
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
341
+
342
+cglobal intra_pred_ang4_13, 3,5,8
343
+    xor         r4d, r4d
344
+    cmp         r3m, byte 23
345
+    mov         r3d, 16
346
+    jz          .next
347
+    xchg        r3d, r4d
348
+.next:
349
+    movd        m5, [r2 + r3 + 6]
350
+    movd        m2, [r2 - 2]
351
+    movh        m0, [r2 + r4 + 2]
352
+    punpcklwd   m5, m2
353
+    punpcklqdq  m5, m0
354
+    psrldq      m5, 4
355
+    mova        m2, m5
356
+    psrldq      m2, 2
357
+    punpcklwd   m5, m2          ; [3 2 2 1 1 0 0 x]
358
+    punpcklwd   m2, m0          ; [4 3 3 2 2 1 1 0]
359
+    mova        m3, m2
360
+    mova        m4, m2
361
+
362
+    lea         r3, [ang_table + 21 * 16]
363
+    mova        m0, [r3 +  2 * 16]  ; [23]
364
+    mova        m1, [r3 -  7 * 16]  ; [14]
365
+    mova        m6, [r3 - 16 * 16]  ; [ 5]
366
+    mova        m7, [r3 +  7 * 16]  ; [28]
367
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
368
+
369
+cglobal intra_pred_ang4_14, 3,5,8
370
+    xor         r4d, r4d
371
+    cmp         r3m, byte 22
372
+    mov         r3d, 16
373
+    jz          .next
374
+    xchg        r3d, r4d
375
+.next:
376
+    movd        m5, [r2 + r3 + 2]
377
+    movd        m2, [r2 - 2]
378
+    movh        m0, [r2 + r4 + 2]
379
+    punpcklwd   m5, m2
380
+    punpcklqdq  m5, m0
381
+    psrldq      m5, 4
382
+    mova        m2, m5
383
+    psrldq      m2, 2
384
+    punpcklwd   m5, m2          ; [3 2 2 1 1 0 0 x]
385
+    punpcklwd   m2, m0          ; [4 3 3 2 2 1 1 0]
386
+    mova        m3, m2
387
+    mova        m4, m5
388
+
389
+    lea         r3, [ang_table + 19 * 16]
390
+    mova        m0, [r3 +  0 * 16]  ; [19]
391
+    mova        m1, [r3 - 13 * 16]  ; [ 6]
392
+    mova        m6, [r3 +  6 * 16]  ; [25]
393
+    mova        m7, [r3 -  7 * 16]  ; [12]
394
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
395
+
396
+cglobal intra_pred_ang4_15, 3,5,8
397
+    xor         r4d, r4d
398
+    cmp         r3m, byte 21
399
+    mov         r3d, 16
400
+    jz          .next
401
+    xchg        r3d, r4d
402
+.next:
403
+    movd        m4, [r2]                ;[x x x A]
404
+    movh        m5, [r2 + r3 + 4]       ;[x C x B]
405
+    movh        m0, [r2 + r4 + 2]       ;[4 3 2 1]
406
+    pshuflw     m5, m5, 0x22            ;[B C B C]
407
+    punpcklqdq  m5, m4                  ;[x x x A B C B C]
408
+    psrldq      m5, 2                   ;[x x x x A B C B]
409
+    punpcklqdq  m5, m0
410
+    psrldq      m5, 2
411
+    mova        m2, m5
412
+    mova        m3, m5
413
+    psrldq      m2, 4
414
+    psrldq      m3, 2
415
+    punpcklwd   m5, m3          ; [2 1 1 0 0 x x y]
416
+    punpcklwd   m3, m2          ; [3 2 2 1 1 0 0 x]
417
+    punpcklwd   m2, m0          ; [4 3 3 2 2 1 1 0]
418
+    mova        m4, m3
419
+
420
+    lea         r3, [ang_table + 23 * 16]
421
+    mova        m0, [r3 -  8 * 16]  ; [15]
422
+    mova        m1, [r3 +  7 * 16]  ; [30]
423
+    mova        m6, [r3 - 10 * 16]  ; [13]
424
+    mova        m7, [r3 +  5 * 16]  ; [28]
425
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
426
+
427
+cglobal intra_pred_ang4_16, 3,5,8
428
+    xor         r4d, r4d
429
+    cmp         r3m, byte 20
430
+    mov         r3d, 16
431
+    jz          .next
432
+    xchg        r3d, r4d
433
+.next:
434
+    movd        m4, [r2]                ;[x x x A]
435
+    movd        m5, [r2 + r3 + 4]       ;[x x C B]
436
+    movh        m0, [r2 + r4 + 2]       ;[4 3 2 1]
437
+    punpcklwd   m5, m4                  ;[x C A B]
438
+    pshuflw     m5, m5, 0x4A            ;[A B C C]
439
+    punpcklqdq  m5, m0                  ;[4 3 2 1 A B C C]
440
+    psrldq      m5, 2
441
+    mova        m2, m5
442
+    mova        m3, m5
443
+    psrldq      m2, 4
444
+    psrldq      m3, 2
445
+    punpcklwd   m5, m3          ; [2 1 1 0 0 x x y]
446
+    punpcklwd   m3, m2          ; [3 2 2 1 1 0 0 x]
447
+    punpcklwd   m2, m0          ; [4 3 3 2 2 1 1 0]
448
+    mova        m4, m3
449
+
450
+    lea         r3, [ang_table + 19 * 16]
451
+    mova        m0, [r3 -  8 * 16]  ; [11]
452
+    mova        m1, [r3 +  3 * 16]  ; [22]
453
+    mova        m6, [r3 - 18 * 16]  ; [ 1]
454
+    mova        m7, [r3 -  7 * 16]  ; [12]
455
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
456
+
457
+cglobal intra_pred_ang4_17, 3,5,8
458
+    xor         r4d, r4d
459
+    cmp         r3m, byte 19
460
+    mov         r3d, 16
461
+    jz          .next
462
+    xchg        r3d, r4d
463
+.next:
464
+    movd        m4, [r2]
465
+    movh        m5, [r2 + r3 + 2]       ;[D x C B]
466
+    pshuflw     m5, m5, 0x1F            ;[B C D D]
467
+    punpcklqdq  m5, m4                  ;[x x x A B C D D]
468
+    psrldq      m5, 2                   ;[x x x x A B C D]
469
+    movhps      m5, [r2 + r4 + 2]
470
+
471
+    mova        m4, m5
472
+    psrldq      m4, 2
473
+    punpcklwd   m5, m4
474
+    mova        m3, m4
475
+    psrldq      m3, 2
476
+    punpcklwd   m4, m3
477
+    mova        m2, m3
478
+    psrldq      m2, 2
479
+    punpcklwd   m3, m2
480
+    mova        m1, m2
481
+    psrldq      m1, 2
482
+    punpcklwd   m2, m1
483
+
484
+    lea         r3, [ang_table + 14 * 16]
485
+    mova        m0, [r3 -  8 * 16]  ; [ 6]
486
+    mova        m1, [r3 -  2 * 16]  ; [12]
487
+    mova        m6, [r3 +  4 * 16]  ; [18]
488
+    mova        m7, [r3 + 10 * 16]  ; [24]
489
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
490
+
491
+cglobal intra_pred_ang4_18, 3,3,1
492
+    movh        m0, [r2 + 16]
493
+    pinsrw      m0, [r2], 0
494
+    pshuflw     m0, m0, q0123
495
+    movhps      m0, [r2 + 2]
496
+    add         r1, r1
497
+    lea         r2, [r1 * 3]
498
+    movh        [r0 + r2], m0
499
+    psrldq      m0, 2
500
+    movh        [r0 + r1 * 2], m0
501
+    psrldq      m0, 2
502
+    movh        [r0 + r1], m0
503
+    psrldq      m0, 2
504
+    movh        [r0], m0
505
+    RET
506
+
507
 ;-----------------------------------------------------------------------------------
508
 ; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* above, int, int filter)
509
 ;-----------------------------------------------------------------------------------
510
x265_1.6.tar.gz/source/common/x86/intrapred8.asm -> x265_1.7.tar.gz/source/common/x86/intrapred8.asm Changed
4717
 
1
@@ -28,6 +28,7 @@
2
 SECTION_RODATA 32
3
 
4
 intra_pred_shuff_0_8:    times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8
5
+intra_pred_shuff_15_0:   times 2 db 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
6
 
7
 pb_0_8        times 8 db  0,  8
8
 pb_unpackbw1  times 2 db  1,  8,  2,  8,  3,  8,  4,  8
9
@@ -58,7 +59,6 @@
10
 c_mode16_18:    db 0, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1
11
 
12
 ALIGN 32
13
-trans8_shuf:          dd 0, 4, 1, 5, 2, 6, 3, 7
14
 c_ang8_src1_9_2_10:   db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9
15
 c_ang8_26_20:         db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
16
 c_ang8_src3_11_4_12:  db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11
17
@@ -124,6 +124,37 @@
18
                       db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
19
                       db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
20
 
21
+ALIGN 32
22
+c_ang16_mode_11:      db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
23
+                      db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
24
+                      db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
25
+                      db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
26
+                      db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
27
+                      db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
28
+                      db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
29
+                      db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
30
+
31
+
32
+ALIGN 32
33
+c_ang16_mode_12:      db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19
34
+                      db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
35
+                      db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9
36
+                      db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
37
+                      db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31
38
+                      db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
39
+                      db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
40
+                      db  8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
41
+
42
+
43
+ALIGN 32
44
+c_ang16_mode_13:      db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15
45
+                      db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
46
+                      db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29
47
+                      db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
48
+                      db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11
49
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
50
+                      db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25
51
+                      db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
52
 
53
 ALIGN 32
54
 c_ang16_mode_28:      db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
55
@@ -135,6 +166,15 @@
56
                       db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
57
                       db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
58
 
59
+ALIGN 32
60
+c_ang16_mode_9:       db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
61
+                      db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
62
+                      db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
63
+                      db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
64
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
65
+                      db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
66
+                      db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
67
+                      db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
68
 
69
 ALIGN 32
70
 c_ang16_mode_27:      db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
71
@@ -150,6 +190,15 @@
72
 ALIGN 32
73
 intra_pred_shuff_0_15: db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 15
74
 
75
+ALIGN 32
76
+c_ang16_mode_8:       db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13
77
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
78
+                      db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23
79
+                      db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
80
+                      db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1
81
+                      db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
82
+                      db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11
83
+                      db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
84
 
85
 ALIGN 32
86
 c_ang16_mode_29:     db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9,  14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
87
@@ -162,6 +211,15 @@
88
                      db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
89
                      db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
90
 
91
+ALIGN 32
92
+c_ang16_mode_7:      db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
93
+                     db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
94
+                     db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3
95
+                     db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
96
+                     db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
97
+                     db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
98
+                     db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7
99
+                     db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
100
 
101
 ALIGN 32
102
 c_ang16_mode_30:      db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
103
@@ -175,6 +233,17 @@
104
                       db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
105
 
106
 
107
+
108
+ALIGN 32
109
+c_ang16_mode_6:       db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
110
+                      db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
111
+                      db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15
112
+                      db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
113
+                      db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9
114
+                      db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
115
+                      db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3
116
+                      db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
117
+
118
 ALIGN 32
119
 c_ang16_mode_31:      db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
120
                       db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19
121
@@ -186,6 +255,17 @@
122
                       db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31
123
                       db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
124
 
125
+
126
+ALIGN 32
127
+c_ang16_mode_5:       db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25
128
+                      db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
129
+                      db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27
130
+                      db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
131
+                      db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29
132
+                      db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
133
+                      db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31
134
+                      db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
135
+
136
 ALIGN 32
137
 c_ang16_mode_32:      db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
138
                       db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31
139
@@ -200,6 +280,16 @@
140
                       db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
141
 
142
 ALIGN 32
143
+c_ang16_mode_4:       db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29
144
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
145
+                      db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7
146
+                      db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
147
+                      db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
148
+                      db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
149
+                      db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27
150
+                      db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
151
+
152
+ALIGN 32
153
 c_ang16_mode_33:     db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
154
                      db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
155
                      db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
156
@@ -216,6 +306,16 @@
157
                      db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
158
 
159
 ALIGN 32
160
+c_ang16_mode_3:      db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
161
+                     db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
162
+                     db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
163
+                     db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
164
+                     db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
165
+                     db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
166
+                     db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
167
+                     db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
168
+
169
+ALIGN 32
170
 c_ang16_mode_24:     db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
171
                      db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
172
                      db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
173
@@ -376,6 +476,191 @@
174
                    db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11
175
                    db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
176
 
177
+
178
+ALIGN 32
179
+c_ang32_mode_33:   db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
180
+                   db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
181
+                   db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
182
+                   db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
183
+                   db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
184
+                   db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
185
+                   db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
186
+                   db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
187
+                   db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
188
+                   db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
189
+                   db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
190
+                   db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
191
+                   db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
192
+                   db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
193
+                   db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
194
+                   db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
195
+                   db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
196
+                   db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
197
+                   db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
198
+                   db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
199
+                   db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
200
+                   db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
201
+                   db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
202
+                   db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
203
+                   db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
204
+                   db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
205
+                   db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
206
+
207
+
208
+
209
+ALIGN 32
210
+c_ang32_mode_25:   db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
211
+                   db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
212
+                   db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
213
+                   db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
214
+                   db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
215
+                   db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
216
+                   db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
217
+                   db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
218
+                   db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
219
+                   db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
220
+                   db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
221
+                   db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
222
+                   db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
223
+                   db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
224
+                   db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
225
+                   db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
226
+
227
+
228
+
229
+ALIGN 32
230
+c_ang32_mode_24:   db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
231
+                   db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
232
+                   db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
233
+                   db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
234
+                   db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
235
+                   db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
236
+                   db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
237
+                   db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
238
+                   db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
239
+                   db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1
240
+                   db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23
241
+                   db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13
242
+                   db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3
243
+                   db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25
244
+                   db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15
245
+                   db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5
246
+                   db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
247
+
248
+
249
+ALIGN 32
250
+c_ang32_mode_23:  db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
251
+                  db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5
252
+                  db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19
253
+                  db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1
254
+                  db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15
255
+                  db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
256
+                  db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
257
+                  db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
258
+                  db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
259
+                  db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7
260
+                  db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
261
+                  db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3
262
+                  db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
263
+                  db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
264
+                  db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
265
+                  db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
266
+                  db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
267
+                  db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
268
+
269
+
270
+ALIGN 32
271
+c_ang32_mode_22: db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
272
+                 db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
273
+                 db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
274
+                 db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5
275
+                 db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11
276
+                 db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
277
+                 db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
278
+                 db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
279
+                 db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
280
+                 db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3
281
+                 db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9
282
+                 db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15
283
+                 db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
284
+                 db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
285
+                 db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
286
+                 db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1
287
+                 db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7
288
+                 db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13
289
+                 db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
290
+
291
+ALIGN 32
292
+c_ang32_mode_21: db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15
293
+                 db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13
294
+                 db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11
295
+                 db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9
296
+                 db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7
297
+                 db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5
298
+                 db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3
299
+                 db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1
300
+                 db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
301
+                 db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
302
+                 db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
303
+                 db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
304
+                 db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
305
+                 db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
306
+                 db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
307
+                 db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
308
+                 db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
309
+
310
+
311
+ALIGN 32
312
+intra_pred_shuff_0_4:    times 4 db 0, 1, 1, 2, 2, 3, 3, 4
313
+intra_pred4_shuff1:      db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5
314
+intra_pred4_shuff2:      db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5
315
+intra_pred4_shuff31:     db 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6
316
+intra_pred4_shuff33:     db 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7
317
+intra_pred4_shuff3:      db 8, 9, 9, 10, 10, 11, 11, 12, 9, 10, 10, 11, 11, 12, 12, 13, 10, 11, 11, 12, 12, 13, 13, 14, 11, 12, 12, 13, 13, 14, 14, 15
318
+intra_pred4_shuff4:      db 9, 10, 10, 11, 11, 12, 12, 13, 10, 11, 11, 12, 12, 13, 13, 14, 10, 11, 11, 12, 12, 13, 13, 14, 11, 12, 12, 13, 13, 14, 14, 15
319
+intra_pred4_shuff5:      db 9, 10, 10, 11, 11, 12, 12, 13, 10, 11, 11, 12, 12, 13, 13, 14, 10, 11, 11, 12, 12, 13, 13, 14, 11, 12, 12, 13, 13, 14, 14, 15
320
+intra_pred4_shuff6:      db 9, 10, 10, 11, 11, 12, 12, 13, 9, 10, 10, 11, 11, 12, 12, 13, 10, 11, 11, 12, 12, 13, 13, 14, 10, 11, 11, 12, 12, 13, 13, 14
321
+intra_pred4_shuff7:      db 9, 10, 10, 11, 11, 12, 12, 13, 9, 10, 10, 11, 11, 12, 12, 13, 9, 10, 10, 11, 11, 12, 12, 13, 10, 11, 11, 12, 12, 13, 13, 14
322
+intra_pred4_shuff9:      db 9, 10, 10, 11, 11, 12, 12, 13, 9, 10, 10, 11, 11, 12, 12, 13, 9, 10, 10, 11, 11, 12, 12, 13, 9, 10, 10, 11, 11, 12, 12, 13
323
+intra_pred4_shuff12:     db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12,0, 9, 9, 10, 10, 11, 11, 12
324
+intra_pred4_shuff13:     db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 4, 0, 0, 9, 9, 10, 10, 11
325
+intra_pred4_shuff14:     db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11
326
+intra_pred4_shuff15:     db 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11, 4, 2, 2, 0, 0, 9, 9, 10
327
+intra_pred4_shuff16:     db 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11, 3, 2, 2, 0, 0, 9, 9, 10
328
+intra_pred4_shuff17:     db 0, 9, 9, 10, 10, 11, 11, 12, 1, 0, 0, 9, 9, 10, 10, 11, 2, 1, 1, 0, 0, 9, 9, 10, 4, 2, 2, 1, 1, 0, 0, 9
329
+intra_pred4_shuff19:     db 0, 1, 1, 2, 2, 3, 3, 4, 9, 0, 0, 1, 1, 2, 2, 3, 10, 9, 9, 0, 0, 1, 1, 2, 12, 10, 10, 9, 9, 0, 0, 1
330
+intra_pred4_shuff20:     db 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3, 11, 10, 10, 0, 0, 1, 1, 2
331
+intra_pred4_shuff21:     db 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3, 12, 10, 10, 0, 0, 1, 1, 2
332
+intra_pred4_shuff22:     db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3
333
+intra_pred4_shuff23:     db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 12, 0, 0, 1, 1, 2, 2, 3
334
+
335
+c_ang4_mode_27:          db 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8
336
+c_ang4_mode_28:          db 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20
337
+c_ang4_mode_29:          db 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 5, 27, 5, 27, 5, 27, 5, 27, 28, 4, 28, 4, 28, 4, 28, 4
338
+c_ang4_mode_30:          db 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20
339
+c_ang4_mode_31:          db 15, 17, 15, 17, 15, 17, 15, 17, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 28, 4, 28, 4, 28, 4, 28, 4
340
+c_ang4_mode_32:          db 11, 21, 11, 21, 11, 21, 11, 21, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 12, 20, 12, 20, 12, 20, 12, 20
341
+c_ang4_mode_33:          db 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 18, 14, 18, 14, 18, 14, 18, 14, 24, 8, 24, 8, 24, 8, 24, 8
342
+c_ang4_mode_5:           db 15, 17, 15, 17, 15, 17, 15, 17, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 28, 4, 28, 4, 28, 4, 28, 4
343
+c_ang4_mode_6:           db 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20
344
+c_ang4_mode_7:           db 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 5, 27, 5, 27, 5, 27, 5, 27, 28, 4, 28, 4, 28, 4, 28, 4
345
+c_ang4_mode_8:           db 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20
346
+c_ang4_mode_9:           db 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8
347
+c_ang4_mode_11:          db 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24
348
+c_ang4_mode_12:          db 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12
349
+c_ang4_mode_13:          db 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28
350
+c_ang4_mode_14:          db 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12
351
+c_ang4_mode_15:          db 17, 15, 17, 15, 17, 15, 17, 15, 2, 30, 2, 30, 2, 30, 2, 30, 19, 13, 19, 13, 19, 13, 19, 13, 4, 28, 4, 28, 4, 28, 4, 28, 4
352
+c_ang4_mode_16:          db 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12
353
+c_ang4_mode_17:          db 26, 6, 26, 6, 26, 6, 26, 6, 20, 12, 20, 12, 20, 12, 20, 12, 14, 18, 14, 18, 14, 18, 14, 18, 8, 24, 8, 24, 8, 24, 8, 24
354
+c_ang4_mode_19:          db 26, 6, 26, 6, 26, 6, 26, 6, 20, 12, 20, 12, 20, 12, 20, 12, 14, 18, 14, 18, 14, 18, 14, 18, 8, 24, 8, 24, 8, 24, 8, 24
355
+c_ang4_mode_20:          db 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12
356
+c_ang4_mode_21:          db 17, 15, 17, 15, 17, 15, 17, 15, 2, 30, 2, 30, 2, 30, 2, 30, 19, 13, 19, 13, 19, 13, 19, 13, 4, 28, 4, 28, 4, 28, 4, 28
357
+c_ang4_mode_22:          db 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12
358
+c_ang4_mode_23:          db 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28
359
+c_ang4_mode_24:          db 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12
360
+c_ang4_mode_25:          db 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24
361
+
362
 ALIGN 32
363
 ;; (blkSize - 1 - x)
364
 pw_planar4_0:         dw 3,  2,  1,  0,  3,  2,  1,  0
365
@@ -388,6 +673,29 @@
366
 pw_planar32_L:        dw 31, 30, 29, 28, 27, 26, 25, 24
367
 pw_planar32_H:        dw 23, 22, 21, 20, 19, 18, 17, 16
368
 
369
+ALIGN 32
370
+c_ang8_mode_13:       db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
371
+                      db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
372
+                      db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
373
+                      db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
374
+
375
+ALIGN 32
376
+c_ang8_mode_14:       db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
377
+                      db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
378
+                      db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
379
+                      db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
380
+
381
+ALIGN 32
382
+c_ang8_mode_15:       db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
383
+                      db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
384
+                      db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
385
+                      db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
386
+
387
+ALIGN 32
388
+c_ang8_mode_20:       db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
389
+                      db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
390
+                      db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
391
+                      db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
392
 
393
 const ang_table
394
 %assign x 0
395
@@ -409,8 +717,11 @@
396
 cextern pw_4
397
 cextern pw_8
398
 cextern pw_16
399
+cextern pw_15
400
+cextern pw_31
401
 cextern pw_32
402
 cextern pw_257
403
+cextern pw_512
404
 cextern pw_1024
405
 cextern pw_4096
406
 cextern pw_00ff
407
@@ -420,6 +731,9 @@
408
 cextern multiH2
409
 cextern multiH3
410
 cextern multi_2Row
411
+cextern trans8_shuf
412
+cextern pw_planar16_mul
413
+cextern pw_planar32_mul
414
 
415
 ;---------------------------------------------------------------------------------------------
416
 ; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel *srcPix, int dirMode, int bFilter)
417
@@ -1249,7 +1563,7 @@
418
 ; void intraPredAng4(pixel* dst, intptr_t dstStride, pixel* src, int dirMode, int bFilter)
419
 ;-----------------------------------------------------------------------------------------
420
 INIT_XMM sse2
421
-cglobal intra_pred_ang4_2, 3,5,3
422
+cglobal intra_pred_ang4_2, 3,5,1
423
     lea         r4, [r2 + 2]
424
     add         r2, 10
425
     cmp         r3m, byte 34
426
@@ -1257,23 +1571,21 @@
427
 
428
     movh        m0, [r2]
429
     movd        [r0], m0
430
-    mova        m1, m0
431
-    psrldq      m1, 1
432
-    movd        [r0 + r1], m1
433
-    mova        m2, m0
434
-    psrldq      m2, 2
435
-    movd        [r0 + r1 * 2], m2
436
+    psrldq      m0, 1
437
+    movd        [r0 + r1], m0
438
+    psrldq      m0, 1
439
+    movd        [r0 + r1 * 2], m0
440
     lea         r1, [r1 * 3]
441
-    psrldq      m0, 3
442
+    psrldq      m0, 1
443
     movd        [r0 + r1], m0
444
     RET
445
 
446
 INIT_XMM sse2
447
 cglobal intra_pred_ang4_3, 3,5,8
448
-    mov         r4, 1
449
+    mov         r4d, 1
450
     cmp         r3m, byte 33
451
-    mov         r3, 9
452
-    cmove       r3, r4
453
+    mov         r3d, 9
454
+    cmove       r3d, r4d
455
 
456
     movh        m0, [r2 + r3]   ; [8 7 6 5 4 3 2 1]
457
     mova        m1, m0
458
@@ -1299,7 +1611,6 @@
459
 ALIGN 16
460
 .do_filter4x4:
461
     pxor        m1, m1
462
-    pxor        m3, m3
463
     punpckhbw   m3, m0
464
     psrlw       m3, 8
465
     pmaddwd     m3, m5
466
@@ -1308,7 +1619,6 @@
467
     packssdw    m0, m3
468
     paddw       m0, [pw_16]
469
     psraw       m0, 5
470
-    pxor        m3, m3
471
     punpckhbw   m3, m2
472
     psrlw       m3, 8
473
     pmaddwd     m3, m7
474
@@ -1335,32 +1645,31 @@
475
 .store:
476
     packuswb    m0, m2
477
     movd        [r0], m0
478
-    pshufd      m0, m0, 0x39
479
+    psrldq      m0, 4
480
     movd        [r0 + r1], m0
481
-    pshufd      m0, m0, 0x39
482
+    psrldq      m0, 4
483
     movd        [r0 + r1 * 2], m0
484
     lea         r1, [r1 * 3]
485
-    pshufd      m0, m0, 0x39
486
+    psrldq      m0, 4
487
     movd        [r0 + r1], m0
488
     RET
489
 
490
 cglobal intra_pred_ang4_4, 3,5,8
491
-    xor         r4, r4
492
-    inc         r4
493
+    xor         r4d, r4d
494
+    inc         r4d
495
     cmp         r3m, byte 32
496
-    mov         r3, 9
497
-    cmove       r3, r4
498
+    mov         r3d, 9
499
+    cmove       r3d, r4d
500
 
501
     movh        m0, [r2 + r3]    ; [8 7 6 5 4 3 2 1]
502
+    punpcklbw   m0, m0
503
+    psrldq      m0, 1
504
+    mova        m2, m0
505
+    psrldq      m2, 2           ; [x x x x x x x x 6 5 5 4 4 3 3 2]
506
     mova        m1, m0
507
-    psrldq      m1, 1           ; [x 8 7 6 5 4 3 2]
508
-    punpcklbw   m0, m1          ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1]
509
-    mova        m1, m0
510
-    psrldq      m1, 2           ; [x x x x x x x x 6 5 5 4 4 3 3 2]
511
-    mova        m3, m0
512
-    psrldq      m3, 4           ; [x x x x x x x x 7 6 6 5 5 4 4 3]
513
-    punpcklqdq  m0, m1
514
-    punpcklqdq  m2, m1, m3
515
+    psrldq      m1, 4           ; [x x x x x x x x 7 6 6 5 5 4 4 3]
516
+    punpcklqdq  m0, m2
517
+    punpcklqdq  m2, m1
518
 
519
     lea         r3, [pw_ang_table + 18 * 16]
520
     mova        m4, [r3 +  3 * 16]  ; [21]
521
@@ -1370,22 +1679,21 @@
522
     jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
523
 
524
 cglobal intra_pred_ang4_5, 3,5,8
525
-    xor         r4, r4
526
-    inc         r4
527
+    xor         r4d, r4d
528
+    inc         r4d
529
     cmp         r3m, byte 31
530
-    mov         r3, 9
531
-    cmove       r3, r4
532
+    mov         r3d, 9
533
+    cmove       r3d, r4d
534
 
535
     movh        m0, [r2 + r3]    ; [8 7 6 5 4 3 2 1]
536
-    mova        m1, m0
537
-    psrldq      m1, 1           ; [x 8 7 6 5 4 3 2]
538
-    punpcklbw   m0, m1          ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1]
539
-    mova        m1, m0
540
-    psrldq      m1, 2           ; [x x x x x x x x 6 5 5 4 4 3 3 2]
541
+    punpcklbw   m0, m0          ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1]
542
+    psrldq      m0, 1
543
+    mova        m2, m0
544
+    psrldq      m2, 2           ; [x x x x x x x x 6 5 5 4 4 3 3 2]
545
     mova        m3, m0
546
     psrldq      m3, 4           ; [x x x x x x x x 7 6 6 5 5 4 4 3]
547
-    punpcklqdq  m0, m1
548
-    punpcklqdq  m2, m1, m3
549
+    punpcklqdq  m0, m2
550
+    punpcklqdq  m2, m3
551
 
552
     lea         r3, [pw_ang_table + 10 * 16]
553
     mova        m4, [r3 +  7 * 16]  ; [17]
554
@@ -1395,18 +1703,17 @@
555
     jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
556
 
557
 cglobal intra_pred_ang4_6, 3,5,8
558
-    xor         r4, r4
559
-    inc         r4
560
+    xor         r4d, r4d
561
+    inc         r4d
562
     cmp         r3m, byte 30
563
-    mov         r3, 9
564
-    cmove       r3, r4
565
+    mov         r3d, 9
566
+    cmove       r3d, r4d
567
 
568
     movh        m0, [r2 + r3]    ; [8 7 6 5 4 3 2 1]
569
-    mova        m1, m0
570
-    psrldq      m1, 1           ; [x 8 7 6 5 4 3 2]
571
-    punpcklbw   m0, m1          ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1]
572
+    punpcklbw   m0, m0
573
+    psrldq      m0, 1           ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1]
574
     mova        m2, m0
575
-    psrldq      m2, 2           ; [x x x x x x x x 6 5 5 4 4 3 3 2]
576
+    psrldq      m2, 2           ; [x x x 8 8 7 7 6 6 5 5 4 4 3 3 2]
577
     punpcklqdq  m0, m0
578
     punpcklqdq  m2, m2
579
 
580
@@ -1418,20 +1725,20 @@
581
     jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
582
 
583
 cglobal intra_pred_ang4_7, 3,5,8
584
-    xor         r4, r4
585
-    inc         r4
586
+    xor         r4d, r4d
587
+    inc         r4d
588
     cmp         r3m, byte 29
589
-    mov         r3, 9
590
-    cmove       r3, r4
591
+    mov         r3d, 9
592
+    cmove       r3d, r4d
593
 
594
     movh        m0, [r2 + r3]    ; [8 7 6 5 4 3 2 1]
595
-    mova        m1, m0
596
-    psrldq      m1, 1           ; [x 8 7 6 5 4 3 2]
597
-    punpcklbw   m0, m1          ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1]
598
-    mova        m3, m0
599
-    psrldq      m3, 2           ; [x x x x x x x x 6 5 5 4 4 3 3 2]
600
-    punpcklqdq  m2, m0, m3
601
+    punpcklbw   m0, m0
602
+    psrldq      m0, 1
603
+    mova        m2, m0
604
+    psrldq      m2, 2           ; [x x x x x x x x 6 5 5 4 4 3 3 2]
605
     punpcklqdq  m0, m0
606
+    punpcklqdq  m2, m2
607
+    movhlps     m2, m0
608
 
609
     lea         r3, [pw_ang_table + 20 * 16]
610
     mova        m4, [r3 - 11 * 16]  ; [ 9]
611
@@ -1441,16 +1748,15 @@
612
     jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
613
 
614
 cglobal intra_pred_ang4_8, 3,5,8
615
-    xor         r4, r4
616
-    inc         r4
617
+    xor         r4d, r4d
618
+    inc         r4d
619
     cmp         r3m, byte 28
620
-    mov         r3, 9
621
-    cmove       r3, r4
622
+    mov         r3d, 9
623
+    cmove       r3d, r4d
624
 
625
     movh        m0, [r2 + r3]    ; [8 7 6 5 4 3 2 1]
626
-    mova        m1, m0
627
-    psrldq      m1, 1           ; [x 8 7 6 5 4 3 2]
628
-    punpcklbw   m0, m1          ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1]
629
+    punpcklbw   m0, m0
630
+    psrldq      m0, 1
631
     punpcklqdq  m0, m0
632
     mova        m2, m0
633
 
634
@@ -1462,16 +1768,15 @@
635
     jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
636
 
637
 cglobal intra_pred_ang4_9, 3,5,8
638
-    xor         r4, r4
639
-    inc         r4
640
+    xor         r4d, r4d
641
+    inc         r4d
642
     cmp         r3m, byte 27
643
-    mov         r3, 9
644
-    cmove       r3, r4
645
+    mov         r3d, 9
646
+    cmove       r3d, r4d
647
 
648
     movh        m0, [r2 + r3]    ; [8 7 6 5 4 3 2 1]
649
-    mova        m1, m0
650
-    psrldq      m1, 1           ; [x 8 7 6 5 4 3 2]
651
-    punpcklbw   m0, m1          ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1]
652
+    punpcklbw   m0, m0
653
+    psrldq      m0, 1           ; [x 8 7 6 5 4 3 2]
654
     punpcklqdq  m0, m0
655
     mova        m2, m0
656
 
657
@@ -1482,6 +1787,292 @@
658
     mova        m7, [r3 +  4 * 16]  ; [ 8]
659
     jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
660
 
661
+cglobal intra_pred_ang4_10, 3,5,4
662
+    movd        m0, [r2 + 9]            ; [8 7 6 5 4 3 2 1]
663
+    punpcklbw   m0, m0
664
+    punpcklwd   m0, m0
665
+    pshufd      m1, m0, 1
666
+    movhlps     m2, m0
667
+    pshufd      m3, m0, 3
668
+    movd        [r0 + r1], m1
669
+    movd        [r0 + r1 * 2], m2
670
+    lea         r1, [r1 * 3]
671
+    movd        [r0 + r1], m3
672
+    cmp         r4m, byte 0
673
+    jz          .quit
674
+
675
+    ; filter
676
+    pxor        m3, m3
677
+    punpcklbw   m0, m3
678
+    movh        m1, [r2]                ; [4 3 2 1 0]
679
+    punpcklbw   m1, m3
680
+    pshuflw     m2, m1, 0x00
681
+    psrldq      m1, 2
682
+    psubw       m1, m2
683
+    psraw       m1, 1
684
+    paddw       m0, m1
685
+    packuswb    m0, m0
686
+
687
+.quit:
688
+    movd        [r0], m0
689
+    RET
690
+
691
+cglobal intra_pred_ang4_26, 3,4,4
692
+    movd        m0, [r2 + 1]            ; [8 7 6 5 4 3 2 1]
693
+
694
+    ; store
695
+    movd        [r0], m0
696
+    movd        [r0 + r1], m0
697
+    movd        [r0 + r1 * 2], m0
698
+    lea         r3, [r1 * 3]
699
+    movd        [r0 + r3], m0
700
+
701
+    ; filter
702
+    cmp         r4m, byte 0
703
+    jz         .quit
704
+
705
+    pxor        m3, m3
706
+    punpcklbw   m0, m3
707
+    pshuflw     m0, m0, 0x00
708
+    movd        m2, [r2]
709
+    punpcklbw   m2, m3
710
+    pshuflw     m2, m2, 0x00
711
+    movd        m1, [r2 + 9]
712
+    punpcklbw   m1, m3
713
+    psubw       m1, m2
714
+    psraw       m1, 1
715
+    paddw       m0, m1
716
+    packuswb    m0, m0
717
+
718
+    movd        r2, m0
719
+    mov         [r0], r2b
720
+    shr         r2, 8
721
+    mov         [r0 + r1], r2b
722
+    shr         r2, 8
723
+    mov         [r0 + r1 * 2], r2b
724
+    shr         r2, 8
725
+    mov         [r0 + r3], r2b
726
+
727
+.quit:
728
+    RET
729
+
730
+cglobal intra_pred_ang4_11, 3,5,8
731
+    xor         r4d, r4d
732
+    cmp         r3m, byte 25
733
+    mov         r3d, 8
734
+    cmove       r3d, r4d
735
+
736
+    movd        m1, [r2 + r3 + 1]       ;[4 3 2 1]
737
+    movh        m0, [r2 - 7]            ;[A x x x x x x x]
738
+    punpcklbw   m1, m1                  ;[4 4 3 3 2 2 1 1]
739
+    punpcklqdq  m0, m1                  ;[4 4 3 3 2 2 1 1 A x x x x x x x]]
740
+    psrldq      m0, 7                   ;[x x x x x x x x 4 3 3 2 2 1 1 A]
741
+    punpcklqdq  m0, m0
742
+    mova        m2, m0
743
+
744
+    lea         r3, [pw_ang_table + 24 * 16]
745
+
746
+    mova        m4, [r3 +  6 * 16]  ; [24]
747
+    mova        m5, [r3 +  4 * 16]  ; [26]
748
+    mova        m6, [r3 +  2 * 16]  ; [28]
749
+    mova        m7, [r3 +  0 * 16]  ; [30]
750
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
751
+
752
+cglobal intra_pred_ang4_12, 3,5,8
753
+    xor         r4d, r4d
754
+    cmp         r3m, byte 24
755
+    mov         r3d, 8
756
+    cmove       r3d, r4d
757
+
758
+    movd        m1, [r2 + r3 + 1]
759
+    movh        m0, [r2 - 7]
760
+    punpcklbw   m1, m1
761
+    punpcklqdq  m0, m1
762
+    psrldq      m0, 7
763
+    punpcklqdq  m0, m0
764
+    mova        m2, m0
765
+
766
+    lea         r3, [pw_ang_table + 20 * 16]
767
+    mova        m4, [r3 +  7 * 16]  ; [27]
768
+    mova        m5, [r3 +  2 * 16]  ; [22]
769
+    mova        m6, [r3 -  3 * 16]  ; [17]
770
+    mova        m7, [r3 -  8 * 16]  ; [12]
771
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
772
+
773
+cglobal intra_pred_ang4_13, 3,5,8
774
+    xor         r4d, r4d
775
+    cmp         r3m, byte 23
776
+    mov         r3d, 8
777
+    jz          .next
778
+    xchg        r3d, r4d
779
+
780
+.next:
781
+    movd        m1, [r2 - 1]            ;[x x A x]
782
+    movd        m2, [r2 + r4 + 1]       ;[4 3 2 1]
783
+    movd        m0, [r2 + r3 + 3]       ;[x x B x]
784
+    punpcklbw   m0, m1                  ;[x x x x A B x x]
785
+    punpckldq   m0, m2                  ;[4 3 2 1 A B x x]
786
+    psrldq      m0, 2                   ;[x x 4 3 2 1 A B]
787
+    punpcklbw   m0, m0                  ;[x x x x 4 4 3 3 2 2 1 1 A A B B]
788
+    mova        m1, m0
789
+    psrldq      m0, 3                   ;[x x x x x x x 4 4 3 3 2 2 1 1 A]
790
+    psrldq      m1, 1                   ;[x x x x x 4 4 3 3 2 2 1 1 A A B]
791
+    movh        m2, m0
792
+    punpcklqdq  m0, m0
793
+    punpcklqdq  m2, m1
794
+
795
+    lea         r3, [pw_ang_table + 21 * 16]
796
+    mova        m4, [r3 +  2 * 16]  ; [23]
797
+    mova        m5, [r3 -  7 * 16]  ; [14]
798
+    mova        m6, [r3 - 16 * 16]  ; [ 5]
799
+    mova        m7, [r3 +  7 * 16]  ; [28]
800
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
801
+
802
+cglobal intra_pred_ang4_14, 3,5,8
803
+    xor         r4d, r4d
804
+    cmp         r3m, byte 22
805
+    mov         r3d, 8
806
+    jz          .next
807
+    xchg        r3d, r4d
808
+
809
+.next:
810
+    movd        m1, [r2 - 1]            ;[x x A x]
811
+    movd        m0, [r2 + r3 + 1]       ;[x x B x]
812
+    punpcklbw   m0, m1                  ;[A B x x]
813
+    movd        m1, [r2 + r4 + 1]       ;[4 3 2 1]
814
+    punpckldq   m0, m1                  ;[4 3 2 1 A B x x]
815
+    psrldq      m0, 2                   ;[x x 4 3 2 1 A B]
816
+    punpcklbw   m0, m0                  ;[x x x x 4 4 3 3 2 2 1 1 A A B B]
817
+    mova        m2, m0
818
+    psrldq      m0, 3                   ;[x x x x x x x 4 4 3 3 2 2 1 1 A]
819
+    psrldq      m2, 1                   ;[x x x x x 4 4 3 3 2 2 1 1 A A B]
820
+    punpcklqdq  m0, m0
821
+    punpcklqdq  m2, m2
822
+
823
+    lea         r3, [pw_ang_table + 19 * 16]
824
+    mova        m4, [r3 +  0 * 16]  ; [19]
825
+    mova        m5, [r3 - 13 * 16]  ; [ 6]
826
+    mova        m6, [r3 +  6 * 16]  ; [25]
827
+    mova        m7, [r3 -  7 * 16]  ; [12]
828
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
829
+
830
+cglobal intra_pred_ang4_15, 3,5,8
831
+    xor         r4d, r4d
832
+    cmp         r3m, byte 21
833
+    mov         r3d, 8
834
+    jz          .next
835
+    xchg        r3d, r4d
836
+
837
+.next:
838
+    movd        m0, [r2]                ;[x x x A]
839
+    movd        m1, [r2 + r3 + 2]       ;[x x x B]
840
+    punpcklbw   m1, m0                  ;[x x A B]
841
+    movd        m0, [r2 + r3 + 3]       ;[x x C x]
842
+    punpcklwd   m0, m1                  ;[A B C x]
843
+    movd        m1, [r2 + r4 + 1]       ;[4 3 2 1]
844
+    punpckldq   m0, m1                  ;[4 3 2 1 A B C x]
845
+    psrldq      m0, 1                   ;[x 4 3 2 1 A B C]
846
+    punpcklbw   m0, m0                  ;[x x 4 4 3 3 2 2 1 1 A A B B C C]
847
+    psrldq      m0, 1
848
+    movh        m1, m0
849
+    psrldq      m0, 2
850
+    movh        m2, m0
851
+    psrldq      m0, 2
852
+    punpcklqdq  m0, m2
853
+    punpcklqdq  m2, m1
854
+
855
+    lea         r3, [pw_ang_table + 23 * 16]
856
+    mova        m4, [r3 -  8 * 16]  ; [15]
857
+    mova        m5, [r3 +  7 * 16]  ; [30]
858
+    mova        m6, [r3 - 10 * 16]  ; [13]
859
+    mova        m7, [r3 +  5 * 16]  ; [28]
860
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
861
+
862
+cglobal intra_pred_ang4_16, 3,5,8
863
+    xor         r4d, r4d
864
+    cmp         r3m, byte 20
865
+    mov         r3d, 8
866
+    jz          .next
867
+    xchg        r3d, r4d
868
+
869
+.next:
870
+    movd        m2, [r2]                ;[x x x A]
871
+    movd        m1, [r2 + r3 + 2]       ;[x x x B]
872
+    punpcklbw   m1, m2                  ;[x x A B]
873
+    movh        m0, [r2 + r3 + 2]       ;[x x C x]
874
+    punpcklwd   m0, m1                  ;[A B C x]
875
+    movd        m1, [r2 + r4 + 1]       ;[4 3 2 1]
876
+    punpckldq   m0, m1                  ;[4 3 2 1 A B C x]
877
+    psrldq      m0, 1                   ;[x 4 3 2 1 A B C]
878
+    punpcklbw   m0, m0                  ;[x x 4 4 3 3 2 2 1 1 A A B B C C]
879
+    psrldq      m0, 1
880
+    movh        m1, m0
881
+    psrldq      m0, 2
882
+    movh        m2, m0
883
+    psrldq      m0, 2
884
+    punpcklqdq  m0, m2
885
+    punpcklqdq  m2, m1
886
+
887
+    lea         r3, [pw_ang_table + 19 * 16]
888
+    mova        m4, [r3 -  8 * 16]  ; [11]
889
+    mova        m5, [r3 +  3 * 16]  ; [22]
890
+    mova        m6, [r3 - 18 * 16]  ; [ 1]
891
+    mova        m7, [r3 -  7 * 16]  ; [12]
892
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
893
+
894
+cglobal intra_pred_ang4_17, 3,5,8
895
+    xor         r4d, r4d
896
+    cmp         r3m, byte 19
897
+    mov         r3d, 8
898
+    jz          .next
899
+    xchg        r3d, r4d
900
+
901
+.next:
902
+    movd        m2, [r2]                ;[x x x A]
903
+    movd        m3, [r2 + r3 + 1]       ;[x x x B]
904
+    movd        m4, [r2 + r3 + 2]       ;[x x x C]
905
+    movd        m0, [r2 + r3 + 4]       ;[x x x D]
906
+    punpcklbw   m3, m2                  ;[x x A B]
907
+    punpcklbw   m0, m4                  ;[x x C D]
908
+    punpcklwd   m0, m3                  ;[A B C D]
909
+    movd        m1, [r2 + r4 + 1]       ;[4 3 2 1]
910
+    punpckldq   m0, m1                  ;[4 3 2 1 A B C D]
911
+    punpcklbw   m0, m0                  ;[4 4 3 3 2 2 1 1 A A B B C C D D]
912
+    psrldq      m0, 1
913
+    movh        m1, m0
914
+    psrldq      m0, 2
915
+    movh        m2, m0
916
+    punpcklqdq  m2, m1
917
+    psrldq      m0, 2
918
+    movh        m1, m0
919
+    psrldq      m0, 2
920
+    punpcklqdq  m0, m1
921
+
922
+    lea         r3, [pw_ang_table + 14 * 16]
923
+    mova        m4, [r3 -  8 * 16]  ; [ 6]
924
+    mova        m5, [r3 -  2 * 16]  ; [12]
925
+    mova        m6, [r3 +  4 * 16]  ; [18]
926
+    mova        m7, [r3 + 10 * 16]  ; [24]
927
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
928
+
929
+cglobal intra_pred_ang4_18, 3,4,2
930
+    mov         r3d, [r2 + 8]
931
+    mov         r3b, byte [r2]
932
+    bswap       r3d
933
+    movd        m0, r3d
934
+
935
+    movd        m1, [r2 + 1]
936
+    punpckldq   m0, m1
937
+    lea         r3, [r1 * 3]
938
+    movd        [r0 + r3], m0
939
+    psrldq      m0, 1
940
+    movd        [r0 + r1 * 2], m0
941
+    psrldq      m0, 1
942
+    movd        [r0 + r1], m0
943
+    psrldq      m0, 1
944
+    movd        [r0], m0
945
+    RET
946
+
947
 ;---------------------------------------------------------------------------------------------
948
 ; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel *srcPix, int dirMode, int bFilter)
949
 ;---------------------------------------------------------------------------------------------
950
@@ -1809,6 +2400,69 @@
951
 
952
     RET
953
 
954
+;---------------------------------------------------------------------------------------------
955
+; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel *srcPix, int dirMode, int bFilter)
956
+;---------------------------------------------------------------------------------------------
957
+%if ARCH_X86_64 == 1
958
+INIT_YMM avx2
959
+cglobal intra_pred_dc32, 3, 4, 3
960
+    lea             r3, [r1 * 3]
961
+    pxor            m0, m0
962
+    movu            m1, [r2 + 1]
963
+    movu            m2, [r2 + 65]
964
+    psadbw          m1, m0
965
+    psadbw          m2, m0
966
+    paddw           m1, m2
967
+    vextracti128    xm2, m1, 1
968
+    paddw           m1, m2
969
+    pshufd          m2, m1, 2
970
+    paddw           m1, m2
971
+
972
+    pmulhrsw        m1, [pw_512]    ; sum = (sum + 32) / 64
973
+    vpbroadcastb    m1, xm1         ; m1 = byte [dc_val ...]
974
+
975
+    movu            [r0 + r1 * 0], m1
976
+    movu            [r0 + r1 * 1], m1
977
+    movu            [r0 + r1 * 2], m1
978
+    movu            [r0 + r3 * 1], m1
979
+    lea             r0, [r0 + 4 * r1]
980
+    movu            [r0 + r1 * 0], m1
981
+    movu            [r0 + r1 * 1], m1
982
+    movu            [r0 + r1 * 2], m1
983
+    movu            [r0 + r3 * 1], m1
984
+    lea             r0, [r0 + 4 * r1]
985
+    movu            [r0 + r1 * 0], m1
986
+    movu            [r0 + r1 * 1], m1
987
+    movu            [r0 + r1 * 2], m1
988
+    movu            [r0 + r3 * 1], m1
989
+    lea             r0, [r0 + 4 * r1]
990
+    movu            [r0 + r1 * 0], m1
991
+    movu            [r0 + r1 * 1], m1
992
+    movu            [r0 + r1 * 2], m1
993
+    movu            [r0 + r3 * 1], m1
994
+    lea             r0, [r0 + 4 * r1]
995
+    movu            [r0 + r1 * 0], m1
996
+    movu            [r0 + r1 * 1], m1
997
+    movu            [r0 + r1 * 2], m1
998
+    movu            [r0 + r3 * 1], m1
999
+    lea             r0, [r0 + 4 * r1]
1000
+    movu            [r0 + r1 * 0], m1
1001
+    movu            [r0 + r1 * 1], m1
1002
+    movu            [r0 + r1 * 2], m1
1003
+    movu            [r0 + r3 * 1], m1
1004
+    lea             r0, [r0 + 4 * r1]
1005
+    movu            [r0 + r1 * 0], m1
1006
+    movu            [r0 + r1 * 1], m1
1007
+    movu            [r0 + r1 * 2], m1
1008
+    movu            [r0 + r3 * 1], m1
1009
+    lea             r0, [r0 + 4 * r1]
1010
+    movu            [r0 + r1 * 0], m1
1011
+    movu            [r0 + r1 * 1], m1
1012
+    movu            [r0 + r1 * 2], m1
1013
+    movu            [r0 + r3 * 1], m1
1014
+    RET
1015
+%endif ;; ARCH_X86_64 == 1
1016
+
1017
 ;---------------------------------------------------------------------------------------
1018
 ; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter)
1019
 ;---------------------------------------------------------------------------------------
1020
@@ -2000,6 +2654,57 @@
1021
 ;---------------------------------------------------------------------------------------
1022
 ; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter)
1023
 ;---------------------------------------------------------------------------------------
1024
+INIT_YMM avx2
1025
+cglobal intra_pred_planar16, 3,3,6
1026
+    vpbroadcastw    m3, [r2 + 17]
1027
+    mova            m5, [pw_00ff]
1028
+    vpbroadcastw    m4, [r2 + 49]
1029
+    mova            m0, [pw_planar16_mul]
1030
+    pmovzxbw        m2, [r2 + 1]
1031
+    pand            m3, m5                      ; v_topRight
1032
+    pand            m4, m5                      ; v_bottomLeft
1033
+
1034
+    pmullw          m3, [multiL]                ; (x + 1) * topRight
1035
+    pmullw          m1, m2, [pw_15]             ; (blkSize - 1 - y) * above[x]
1036
+    paddw           m3, [pw_16]
1037
+    paddw           m3, m4
1038
+    paddw           m3, m1
1039
+    psubw           m4, m2
1040
+    add             r2, 33
1041
+
1042
+%macro INTRA_PRED_PLANAR16_AVX2 1
1043
+    vpbroadcastw    m1, [r2 + %1]
1044
+    vpsrlw          m2, m1, 8
1045
+    pand            m1, m5
1046
+
1047
+    pmullw          m1, m0
1048
+    pmullw          m2, m0
1049
+    paddw           m1, m3
1050
+    paddw           m3, m4
1051
+    psraw           m1, 5
1052
+    paddw           m2, m3
1053
+    psraw           m2, 5
1054
+    paddw           m3, m4
1055
+    packuswb        m1, m2
1056
+    vpermq          m1, m1, 11011000b
1057
+    movu            [r0], xm1
1058
+    vextracti128    [r0 + r1], m1, 1
1059
+    lea             r0, [r0 + r1 * 2]
1060
+%endmacro
1061
+    INTRA_PRED_PLANAR16_AVX2 0
1062
+    INTRA_PRED_PLANAR16_AVX2 2
1063
+    INTRA_PRED_PLANAR16_AVX2 4
1064
+    INTRA_PRED_PLANAR16_AVX2 6
1065
+    INTRA_PRED_PLANAR16_AVX2 8
1066
+    INTRA_PRED_PLANAR16_AVX2 10
1067
+    INTRA_PRED_PLANAR16_AVX2 12
1068
+    INTRA_PRED_PLANAR16_AVX2 14
1069
+%undef INTRA_PRED_PLANAR16_AVX2
1070
+    RET
1071
+
1072
+;---------------------------------------------------------------------------------------
1073
+; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter)
1074
+;---------------------------------------------------------------------------------------
1075
 INIT_XMM sse4
1076
 %if ARCH_X86_64 == 1
1077
 cglobal intra_pred_planar32, 3,4,12
1078
@@ -2104,6 +2809,91 @@
1079
     jnz             .loop
1080
     RET
1081
 
1082
+;---------------------------------------------------------------------------------------
1083
+; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter)
1084
+;---------------------------------------------------------------------------------------
1085
+%if ARCH_X86_64 == 1
1086
+INIT_YMM avx2
1087
+cglobal intra_pred_planar32, 3,4,11
1088
+    mova            m6, [pw_00ff]
1089
+    vpbroadcastw    m3, [r2 + 33]               ; topRight   = above[32]
1090
+    vpbroadcastw    m2, [r2 + 97]               ; bottomLeft = left[32]
1091
+    pand            m3, m6
1092
+    pand            m2, m6
1093
+
1094
+    pmullw          m0, m3, [multiL]            ; (x + 1) * topRight
1095
+    pmullw          m3, [multiH2]               ; (x + 1) * topRight
1096
+
1097
+    paddw           m0, m2
1098
+    paddw           m3, m2
1099
+    paddw           m0, [pw_32]
1100
+    paddw           m3, [pw_32]
1101
+
1102
+    pmovzxbw        m4, [r2 + 1]
1103
+    pmovzxbw        m1, [r2 + 17]
1104
+    pmullw          m5, m4, [pw_31]
1105
+    paddw           m0, m5
1106
+    psubw           m5, m2, m4
1107
+    psubw           m2, m1
1108
+    pmullw          m1, [pw_31]
1109
+    paddw           m3, m1
1110
+    mova            m1, m5
1111
+
1112
+    add             r2, 65                      ; (2 * blkSize + 1)
1113
+    mova            m9, [pw_planar32_mul]
1114
+    mova            m10, [pw_planar16_mul]
1115
+
1116
+%macro INTRA_PRED_PLANAR32_AVX2 0
1117
+    vpbroadcastw    m4, [r2]
1118
+    vpsrlw          m7, m4, 8
1119
+    pand            m4, m6
1120
+
1121
+    pmullw          m5, m4, m9
1122
+    pmullw          m4, m4, m10
1123
+    paddw           m5, m0
1124
+    paddw           m4, m3
1125
+    paddw           m0, m1
1126
+    paddw           m3, m2
1127
+    psraw           m5, 6
1128
+    psraw           m4, 6
1129
+    packuswb        m5, m4
1130
+    pmullw          m8, m7, m9
1131
+    pmullw          m7, m7, m10
1132
+    vpermq          m5, m5, 11011000b
1133
+    paddw           m8, m0
1134
+    paddw           m7, m3
1135
+    paddw           m0, m1
1136
+    paddw           m3, m2
1137
+    psraw           m8, 6
1138
+    psraw           m7, 6
1139
+    packuswb        m8, m7
1140
+    add             r2, 2
1141
+    vpermq          m8, m8, 11011000b
1142
+
1143
+    movu            [r0], m5
1144
+    movu            [r0 + r1], m8
1145
+    lea             r0, [r0 + r1 * 2]
1146
+%endmacro
1147
+    INTRA_PRED_PLANAR32_AVX2
1148
+    INTRA_PRED_PLANAR32_AVX2
1149
+    INTRA_PRED_PLANAR32_AVX2
1150
+    INTRA_PRED_PLANAR32_AVX2
1151
+    INTRA_PRED_PLANAR32_AVX2
1152
+    INTRA_PRED_PLANAR32_AVX2
1153
+    INTRA_PRED_PLANAR32_AVX2
1154
+    INTRA_PRED_PLANAR32_AVX2
1155
+    INTRA_PRED_PLANAR32_AVX2
1156
+    INTRA_PRED_PLANAR32_AVX2
1157
+    INTRA_PRED_PLANAR32_AVX2
1158
+    INTRA_PRED_PLANAR32_AVX2
1159
+    INTRA_PRED_PLANAR32_AVX2
1160
+    INTRA_PRED_PLANAR32_AVX2
1161
+    INTRA_PRED_PLANAR32_AVX2
1162
+    INTRA_PRED_PLANAR32_AVX2
1163
+%undef INTRA_PRED_PLANAR32_AVX2
1164
+    RET
1165
+%endif ;; ARCH_X86_64 == 1
1166
+
1167
 ;-----------------------------------------------------------------------------------------
1168
 ; void intraPredAng4(pixel* dst, intptr_t dstStride, pixel* src, int dirMode, int bFilter)
1169
 ;-----------------------------------------------------------------------------------------
1170
@@ -9577,6 +10367,99 @@
1171
 
1172
     RET
1173
 
1174
+INIT_YMM avx2
1175
+cglobal intra_pred_ang32_18, 4, 4, 3
1176
+    movu           m0, [r2]
1177
+    movu           xm1, [r2 + 1 + 64]
1178
+    pshufb         xm1, [intra_pred_shuff_15_0]
1179
+    mova           xm2, xm0
1180
+    vinserti128    m1, m1, xm2, 1
1181
+
1182
+    lea            r3, [r1 * 3]
1183
+
1184
+    movu           [r0], m0
1185
+    palignr        m2, m0, m1, 15
1186
+    movu           [r0 + r1], m2
1187
+    palignr        m2, m0, m1, 14
1188
+    movu           [r0 + r1 * 2], m2
1189
+    palignr        m2, m0, m1, 13
1190
+    movu           [r0 + r3], m2
1191
+
1192
+    lea            r0, [r0 + r1 * 4]
1193
+    palignr        m2, m0, m1, 12
1194
+    movu           [r0], m2
1195
+    palignr        m2, m0, m1, 11
1196
+    movu           [r0 + r1], m2
1197
+    palignr        m2, m0, m1, 10
1198
+    movu           [r0 + r1 * 2], m2
1199
+    palignr        m2, m0, m1, 9
1200
+    movu           [r0 + r3], m2
1201
+
1202
+    lea            r0, [r0 + r1 * 4]
1203
+    palignr        m2, m0, m1, 8
1204
+    movu           [r0], m2
1205
+    palignr        m2, m0, m1, 7
1206
+    movu           [r0 + r1], m2
1207
+    palignr        m2, m0, m1, 6
1208
+    movu           [r0 + r1 * 2], m2
1209
+    palignr        m2, m0, m1, 5
1210
+    movu           [r0 + r3], m2
1211
+
1212
+    lea            r0, [r0 + r1 * 4]
1213
+    palignr        m2, m0, m1, 4
1214
+    movu           [r0], m2
1215
+    palignr        m2, m0, m1, 3
1216
+    movu           [r0 + r1], m2
1217
+    palignr        m2, m0, m1, 2
1218
+    movu           [r0 + r1 * 2], m2
1219
+    palignr        m2, m0, m1, 1
1220
+    movu           [r0 + r3], m2
1221
+
1222
+    lea            r0, [r0 + r1 * 4]
1223
+    movu           [r0], m1
1224
+
1225
+    movu           xm0, [r2 + 64 + 17]
1226
+    pshufb         xm0, [intra_pred_shuff_15_0]
1227
+    vinserti128    m0, m0, xm1, 1
1228
+
1229
+    palignr        m2, m1, m0, 15
1230
+    movu           [r0 + r1], m2
1231
+    palignr        m2, m1, m0, 14
1232
+    movu           [r0 + r1 * 2], m2
1233
+    palignr        m2, m1, m0, 13
1234
+    movu           [r0 + r3], m2
1235
+
1236
+    lea            r0, [r0 + r1 * 4]
1237
+    palignr        m2, m1, m0, 12
1238
+    movu           [r0], m2
1239
+    palignr        m2, m1, m0, 11
1240
+    movu           [r0 + r1], m2
1241
+    palignr        m2, m1, m0, 10
1242
+    movu           [r0 + r1 * 2], m2
1243
+    palignr        m2, m1, m0, 9
1244
+    movu           [r0 + r3], m2
1245
+
1246
+    lea            r0, [r0 + r1 * 4]
1247
+    palignr        m2, m1, m0, 8
1248
+    movu           [r0], m2
1249
+    palignr        m2, m1, m0, 7
1250
+    movu           [r0 + r1], m2
1251
+    palignr        m2, m1, m0,6
1252
+    movu           [r0 + r1 * 2], m2
1253
+    palignr        m2, m1, m0, 5
1254
+    movu           [r0 + r3], m2
1255
+
1256
+    lea            r0, [r0 + r1 * 4]
1257
+    palignr        m2, m1, m0, 4
1258
+    movu           [r0], m2
1259
+    palignr        m2, m1, m0, 3
1260
+    movu           [r0 + r1], m2
1261
+    palignr        m2, m1, m0,2
1262
+    movu           [r0 + r1 * 2], m2
1263
+    palignr        m2, m1, m0, 1
1264
+    movu           [r0 + r3], m2
1265
+    RET
1266
+
1267
 INIT_XMM sse4
1268
 cglobal intra_pred_ang32_18, 4,5,5
1269
     movu        m0, [r2]               ; [15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0]
1270
@@ -11099,6 +11982,441 @@
1271
     movhps            [r0 + r3], xm2
1272
     RET
1273
 
1274
+INIT_YMM avx2
1275
+cglobal intra_pred_ang8_15, 3, 6, 6
1276
+    mova              m3, [pw_1024]
1277
+    movu              xm5, [r2 + 16]
1278
+    pinsrb            xm5, [r2], 0
1279
+    lea               r5, [intra_pred_shuff_0_8]
1280
+    mova              xm0, xm5
1281
+    pslldq            xm5, 1
1282
+    pinsrb            xm5, [r2 + 2], 0
1283
+    vinserti128       m0, m0, xm5, 1
1284
+    pshufb            m0, [r5]
1285
+
1286
+    lea               r4, [c_ang8_mode_15]
1287
+    pmaddubsw         m1, m0, [r4]
1288
+    pmulhrsw          m1, m3
1289
+    mova              xm0, xm5
1290
+    pslldq            xm5, 1
1291
+    pinsrb            xm5, [r2 + 4], 0
1292
+    vinserti128       m0, m0, xm5, 1
1293
+    pshufb            m0, [r5]
1294
+    pmaddubsw         m2, m0, [r4 + mmsize]
1295
+    pmulhrsw          m2, m3
1296
+    mova              xm0, xm5
1297
+    pslldq            xm5, 1
1298
+    pinsrb            xm5, [r2 + 6], 0
1299
+    vinserti128       m0, m0, xm5, 1
1300
+    pshufb            m0, [r5]
1301
+    pmaddubsw         m4, m0, [r4 + 2 * mmsize]
1302
+    pmulhrsw          m4, m3
1303
+    mova              xm0, xm5
1304
+    pslldq            xm5, 1
1305
+    pinsrb            xm5, [r2 + 8], 0
1306
+    vinserti128       m0, m0, xm5, 1
1307
+    pshufb            m0, [r5]
1308
+    pmaddubsw         m0, [r4 + 3 * mmsize]
1309
+    pmulhrsw          m0, m3
1310
+    packuswb          m1, m2
1311
+    packuswb          m4, m0
1312
+
1313
+    vperm2i128        m2, m1, m4, 00100000b
1314
+    vperm2i128        m1, m1, m4, 00110001b
1315
+    punpcklbw         m4, m2, m1
1316
+    punpckhbw         m2, m1
1317
+    punpcklwd         m1, m4, m2
1318
+    punpckhwd         m4, m2
1319
+    mova              m0, [trans8_shuf]
1320
+    vpermd            m1, m0, m1
1321
+    vpermd            m4, m0, m4
1322
+
1323
+    lea               r3, [3 * r1]
1324
+    movq              [r0], xm1
1325
+    movhps            [r0 + r1], xm1
1326
+    vextracti128      xm2, m1, 1
1327
+    movq              [r0 + 2 * r1], xm2
1328
+    movhps            [r0 + r3], xm2
1329
+    lea               r0, [r0 + 4 * r1]
1330
+    movq              [r0], xm4
1331
+    movhps            [r0 + r1], xm4
1332
+    vextracti128      xm2, m4, 1
1333
+    movq              [r0 + 2 * r1], xm2
1334
+    movhps            [r0 + r3], xm2
1335
+    RET
1336
+
1337
+INIT_YMM avx2
1338
+cglobal intra_pred_ang8_16, 3, 6, 6
1339
+    mova              m3, [pw_1024]
1340
+    movu              xm5, [r2 + 16]
1341
+    pinsrb            xm5, [r2], 0
1342
+    lea               r5, [intra_pred_shuff_0_8]
1343
+    mova              xm0, xm5
1344
+    pslldq            xm5, 1
1345
+    pinsrb            xm5, [r2 + 2], 0
1346
+    vinserti128       m0, m0, xm5, 1
1347
+    pshufb            m0, [r5]
1348
+
1349
+    lea               r4, [c_ang8_mode_20]
1350
+    pmaddubsw         m1, m0, [r4]
1351
+    pmulhrsw          m1, m3
1352
+    mova              xm0, xm5
1353
+    pslldq            xm5, 1
1354
+    pinsrb            xm5, [r2 + 3], 0
1355
+    vinserti128       m0, m0, xm5, 1
1356
+    pshufb            m0, [r5]
1357
+    pmaddubsw         m2, m0, [r4 + mmsize]
1358
+    pmulhrsw          m2, m3
1359
+    pslldq            xm5, 1
1360
+    pinsrb            xm5, [r2 + 5], 0
1361
+    vinserti128       m0, m5, xm5, 1
1362
+    pshufb            m0, [r5]
1363
+    pmaddubsw         m4, m0, [r4 + 2 * mmsize]
1364
+    pmulhrsw          m4, m3
1365
+    pslldq            xm5, 1
1366
+    pinsrb            xm5, [r2 + 6], 0
1367
+    mova              xm0, xm5
1368
+    pslldq            xm5, 1
1369
+    pinsrb            xm5, [r2 + 8], 0
1370
+    vinserti128       m0, m0, xm5, 1
1371
+    pshufb            m0, [r5]
1372
+    pmaddubsw         m0, [r4 + 3 * mmsize]
1373
+    pmulhrsw          m0, m3
1374
+
1375
+    packuswb          m1, m2
1376
+    packuswb          m4, m0
1377
+
1378
+    vperm2i128        m2, m1, m4, 00100000b
1379
+    vperm2i128        m1, m1, m4, 00110001b
1380
+    punpcklbw         m4, m2, m1
1381
+    punpckhbw         m2, m1
1382
+    punpcklwd         m1, m4, m2
1383
+    punpckhwd         m4, m2
1384
+    mova              m0, [trans8_shuf]
1385
+    vpermd            m1, m0, m1
1386
+    vpermd            m4, m0, m4
1387
+
1388
+    lea               r3, [3 * r1]
1389
+    movq              [r0], xm1
1390
+    movhps            [r0 + r1], xm1
1391
+    vextracti128      xm2, m1, 1
1392
+    movq              [r0 + 2 * r1], xm2
1393
+    movhps            [r0 + r3], xm2
1394
+    lea               r0, [r0 + 4 * r1]
1395
+    movq              [r0], xm4
1396
+    movhps            [r0 + r1], xm4
1397
+    vextracti128      xm2, m4, 1
1398
+    movq              [r0 + 2 * r1], xm2
1399
+    movhps            [r0 + r3], xm2
1400
+    RET
1401
+
1402
+INIT_YMM avx2
1403
+cglobal intra_pred_ang8_20, 3, 6, 6
1404
+    mova              m3, [pw_1024]
1405
+    movu              xm5, [r2]
1406
+    lea               r5, [intra_pred_shuff_0_8]
1407
+    mova              xm0, xm5
1408
+    pslldq            xm5, 1
1409
+    pinsrb            xm5, [r2 + 2 + 16], 0
1410
+    vinserti128       m0, m0, xm5, 1
1411
+    pshufb            m0, [r5]
1412
+
1413
+    lea               r4, [c_ang8_mode_20]
1414
+    pmaddubsw         m1, m0, [r4]
1415
+    pmulhrsw          m1, m3
1416
+    mova              xm0, xm5
1417
+    pslldq            xm5, 1
1418
+    pinsrb            xm5, [r2 + 3 + 16], 0
1419
+    vinserti128       m0, m0, xm5, 1
1420
+    pshufb            m0, [r5]
1421
+    pmaddubsw         m2, m0, [r4 + mmsize]
1422
+    pmulhrsw          m2, m3
1423
+    pslldq            xm5, 1
1424
+    pinsrb            xm5, [r2 + 5 + 16], 0
1425
+    vinserti128       m0, m5, xm5, 1
1426
+    pshufb            m0, [r5]
1427
+    pmaddubsw         m4, m0, [r4 + 2 * mmsize]
1428
+    pmulhrsw          m4, m3
1429
+    pslldq            xm5, 1
1430
+    pinsrb            xm5, [r2 + 6 + 16], 0
1431
+    mova              xm0, xm5
1432
+    pslldq            xm5, 1
1433
+    pinsrb            xm5, [r2 + 8 + 16], 0
1434
+    vinserti128       m0, m0, xm5, 1
1435
+    pshufb            m0, [r5]
1436
+    pmaddubsw         m0, [r4 + 3 * mmsize]
1437
+    pmulhrsw          m0, m3
1438
+
1439
+    packuswb          m1, m2
1440
+    packuswb          m4, m0
1441
+
1442
+    lea               r3, [3 * r1]
1443
+    movq              [r0], xm1
1444
+    vextracti128      xm2, m1, 1
1445
+    movq              [r0 + r1], xm2
1446
+    movhps            [r0 + 2 * r1], xm1
1447
+    movhps            [r0 + r3], xm2
1448
+    lea               r0, [r0 + 4 * r1]
1449
+    movq              [r0], xm4
1450
+    vextracti128      xm2, m4, 1
1451
+    movq              [r0 + r1], xm2
1452
+    movhps            [r0 + 2 * r1], xm4
1453
+    movhps            [r0 + r3], xm2
1454
+    RET
1455
+
1456
+INIT_YMM avx2
1457
+cglobal intra_pred_ang8_21, 3, 6, 6
1458
+    mova              m3, [pw_1024]
1459
+    movu              xm5, [r2]
1460
+    lea               r5, [intra_pred_shuff_0_8]
1461
+    mova              xm0, xm5
1462
+    pslldq            xm5, 1
1463
+    pinsrb            xm5, [r2 + 2 + 16], 0
1464
+    vinserti128       m0, m0, xm5, 1
1465
+    pshufb            m0, [r5]
1466
+
1467
+    lea               r4, [c_ang8_mode_15]
1468
+    pmaddubsw         m1, m0, [r4]
1469
+    pmulhrsw          m1, m3
1470
+    mova              xm0, xm5
1471
+    pslldq            xm5, 1
1472
+    pinsrb            xm5, [r2 + 4 + 16], 0
1473
+    vinserti128       m0, m0, xm5, 1
1474
+    pshufb            m0, [r5]
1475
+    pmaddubsw         m2, m0, [r4 + mmsize]
1476
+    pmulhrsw          m2, m3
1477
+    mova              xm0, xm5
1478
+    pslldq            xm5, 1
1479
+    pinsrb            xm5, [r2 + 6 + 16], 0
1480
+    vinserti128       m0, m0, xm5, 1
1481
+    pshufb            m0, [r5]
1482
+    pmaddubsw         m4, m0, [r4 + 2 * mmsize]
1483
+    pmulhrsw          m4, m3
1484
+    mova              xm0, xm5
1485
+    pslldq            xm5, 1
1486
+    pinsrb            xm5, [r2 + 8 + 16], 0
1487
+    vinserti128       m0, m0, xm5, 1
1488
+    pshufb            m0, [r5]
1489
+    pmaddubsw         m0, [r4 + 3 * mmsize]
1490
+    pmulhrsw          m0, m3
1491
+    packuswb          m1, m2
1492
+    packuswb          m4, m0
1493
+
1494
+    lea               r3, [3 * r1]
1495
+    movq              [r0], xm1
1496
+    vextracti128      xm2, m1, 1
1497
+    movq              [r0 + r1], xm2
1498
+    movhps            [r0 + 2 * r1], xm1
1499
+    movhps            [r0 + r3], xm2
1500
+    lea               r0, [r0 + 4 * r1]
1501
+    movq              [r0], xm4
1502
+    vextracti128      xm2, m4, 1
1503
+    movq              [r0 + r1], xm2
1504
+    movhps            [r0 + 2 * r1], xm4
1505
+    movhps            [r0 + r3], xm2
1506
+    RET
1507
+
1508
+INIT_YMM avx2
1509
+cglobal intra_pred_ang8_22, 3, 6, 6
1510
+    mova              m3, [pw_1024]
1511
+    movu              xm5, [r2]
1512
+    lea               r5, [intra_pred_shuff_0_8]
1513
+    vinserti128       m0, m5, xm5, 1
1514
+    pshufb            m0, [r5]
1515
+
1516
+    lea               r4, [c_ang8_mode_14]
1517
+    pmaddubsw         m1, m0, [r4]
1518
+    pmulhrsw          m1, m3
1519
+    pslldq            xm5, 1
1520
+    pinsrb            xm5, [r2 + 2 + 16], 0
1521
+    vinserti128       m0, m5, xm5, 1
1522
+    pshufb            m0, [r5]
1523
+    pmaddubsw         m2, m0, [r4 + mmsize]
1524
+    pmulhrsw          m2, m3
1525
+    pslldq            xm5, 1
1526
+    pinsrb            xm5, [r2 + 5 + 16], 0
1527
+    vinserti128       m0, m5, xm5, 1
1528
+    pshufb            m0, [r5]
1529
+    pmaddubsw         m4, m0, [r4 + 2 * mmsize]
1530
+    pmulhrsw          m4, m3
1531
+    pslldq            xm5, 1
1532
+    pinsrb            xm5, [r2 + 7 + 16], 0
1533
+    pshufb            xm5, [r5]
1534
+    vinserti128       m0, m0, xm5, 1
1535
+    pmaddubsw         m0, [r4 + 3 * mmsize]
1536
+    pmulhrsw          m0, m3
1537
+    packuswb          m1, m2
1538
+    packuswb          m4, m0
1539
+
1540
+    lea               r3, [3 * r1]
1541
+    movq              [r0], xm1
1542
+    vextracti128      xm2, m1, 1
1543
+    movq              [r0 + r1], xm2
1544
+    movhps            [r0 + 2 * r1], xm1
1545
+    movhps            [r0 + r3], xm2
1546
+    lea               r0, [r0 + 4 * r1]
1547
+    movq              [r0], xm4
1548
+    vextracti128      xm2, m4, 1
1549
+    movq              [r0 + r1], xm2
1550
+    movhps            [r0 + 2 * r1], xm4
1551
+    movhps            [r0 + r3], xm2
1552
+    RET
1553
+
1554
+INIT_YMM avx2
1555
+cglobal intra_pred_ang8_14, 3, 6, 6
1556
+    mova              m3, [pw_1024]
1557
+    movu              xm5, [r2 + 16]
1558
+    pinsrb            xm5, [r2], 0
1559
+    lea               r5, [intra_pred_shuff_0_8]
1560
+    vinserti128       m0, m5, xm5, 1
1561
+    pshufb            m0, [r5]
1562
+
1563
+    lea               r4, [c_ang8_mode_14]
1564
+    pmaddubsw         m1, m0, [r4]
1565
+    pmulhrsw          m1, m3
1566
+    pslldq            xm5, 1
1567
+    pinsrb            xm5, [r2 + 2], 0
1568
+    vinserti128       m0, m5, xm5, 1
1569
+    pshufb            m0, [r5]
1570
+    pmaddubsw         m2, m0, [r4 + mmsize]
1571
+    pmulhrsw          m2, m3
1572
+    pslldq            xm5, 1
1573
+    pinsrb            xm5, [r2 + 5], 0
1574
+    vinserti128       m0, m5, xm5, 1
1575
+    pshufb            m0, [r5]
1576
+    pmaddubsw         m4, m0, [r4 + 2 * mmsize]
1577
+    pmulhrsw          m4, m3
1578
+    pslldq            xm5, 1
1579
+    pinsrb            xm5, [r2 + 7], 0
1580
+    pshufb            xm5, [r5]
1581
+    vinserti128       m0, m0, xm5, 1
1582
+    pmaddubsw         m0, [r4 + 3 * mmsize]
1583
+    pmulhrsw          m0, m3
1584
+    packuswb          m1, m2
1585
+    packuswb          m4, m0
1586
+
1587
+    vperm2i128        m2, m1, m4, 00100000b
1588
+    vperm2i128        m1, m1, m4, 00110001b
1589
+    punpcklbw         m4, m2, m1
1590
+    punpckhbw         m2, m1
1591
+    punpcklwd         m1, m4, m2
1592
+    punpckhwd         m4, m2
1593
+    mova              m0, [trans8_shuf]
1594
+    vpermd            m1, m0, m1
1595
+    vpermd            m4, m0, m4
1596
+
1597
+    lea               r3, [3 * r1]
1598
+    movq              [r0], xm1
1599
+    movhps            [r0 + r1], xm1
1600
+    vextracti128      xm2, m1, 1
1601
+    movq              [r0 + 2 * r1], xm2
1602
+    movhps            [r0 + r3], xm2
1603
+    lea               r0, [r0 + 4 * r1]
1604
+    movq              [r0], xm4
1605
+    movhps            [r0 + r1], xm4
1606
+    vextracti128      xm2, m4, 1
1607
+    movq              [r0 + 2 * r1], xm2
1608
+    movhps            [r0 + r3], xm2
1609
+    RET
1610
+
1611
+INIT_YMM avx2
1612
+cglobal intra_pred_ang8_13, 3, 6, 6
1613
+    mova              m3, [pw_1024]
1614
+    movu              xm5, [r2 + 16]
1615
+    pinsrb            xm5, [r2], 0
1616
+    lea               r5, [intra_pred_shuff_0_8]
1617
+    vinserti128       m0, m5, xm5, 1
1618
+    pshufb            m0, [r5]
1619
+
1620
+    lea               r4, [c_ang8_mode_13]
1621
+    pmaddubsw         m1, m0, [r4]
1622
+    pmulhrsw          m1, m3
1623
+    pslldq            xm5, 1
1624
+    pinsrb            xm5, [r2 + 4], 0
1625
+    pshufb            xm4, xm5, [r5]
1626
+    vinserti128       m0, m0, xm4, 1
1627
+    pmaddubsw         m2, m0, [r4 + mmsize]
1628
+    pmulhrsw          m2, m3
1629
+    vinserti128       m0, m0, xm4, 0
1630
+    pmaddubsw         m4, m0, [r4 + 2 * mmsize]
1631
+    pmulhrsw          m4, m3
1632
+    pslldq            xm5, 1
1633
+    pinsrb            xm5, [r2 + 7], 0
1634
+    pshufb            xm5, [r5]
1635
+    vinserti128       m0, m0, xm5, 1
1636
+    pmaddubsw         m0, [r4 + 3 * mmsize]
1637
+    pmulhrsw          m0, m3
1638
+    packuswb          m1, m2
1639
+    packuswb          m4, m0
1640
+
1641
+    vperm2i128        m2, m1, m4, 00100000b
1642
+    vperm2i128        m1, m1, m4, 00110001b
1643
+    punpcklbw         m4, m2, m1
1644
+    punpckhbw         m2, m1
1645
+    punpcklwd         m1, m4, m2
1646
+    punpckhwd         m4, m2
1647
+    mova              m0, [trans8_shuf]
1648
+    vpermd            m1, m0, m1
1649
+    vpermd            m4, m0, m4
1650
+
1651
+    lea               r3, [3 * r1]
1652
+    movq              [r0], xm1
1653
+    movhps            [r0 + r1], xm1
1654
+    vextracti128      xm2, m1, 1
1655
+    movq              [r0 + 2 * r1], xm2
1656
+    movhps            [r0 + r3], xm2
1657
+    lea               r0, [r0 + 4 * r1]
1658
+    movq              [r0], xm4
1659
+    movhps            [r0 + r1], xm4
1660
+    vextracti128      xm2, m4, 1
1661
+    movq              [r0 + 2 * r1], xm2
1662
+    movhps            [r0 + r3], xm2
1663
+    RET
1664
+
1665
+
1666
+INIT_YMM avx2
1667
+cglobal intra_pred_ang8_23, 3, 6, 6
1668
+    mova              m3, [pw_1024]
1669
+    movu              xm5, [r2]
1670
+    lea               r5, [intra_pred_shuff_0_8]
1671
+    vinserti128       m0, m5, xm5, 1
1672
+    pshufb            m0, [r5]
1673
+
1674
+    lea               r4, [c_ang8_mode_13]
1675
+    pmaddubsw         m1, m0, [r4]
1676
+    pmulhrsw          m1, m3
1677
+    pslldq            xm5, 1
1678
+    pinsrb            xm5, [r2 + 4 + 16], 0
1679
+    pshufb            xm4, xm5, [r5]
1680
+    vinserti128       m0, m0, xm4, 1
1681
+    pmaddubsw         m2, m0, [r4 + mmsize]
1682
+    pmulhrsw          m2, m3
1683
+    vinserti128       m0, m0, xm4, 0
1684
+    pmaddubsw         m4, m0, [r4 + 2 * mmsize]
1685
+    pmulhrsw          m4, m3
1686
+    pslldq            xm5, 1
1687
+    pinsrb            xm5, [r2 + 7 + 16], 0
1688
+    pshufb            xm5, [r5]
1689
+    vinserti128       m0, m0, xm5, 1
1690
+    pmaddubsw         m0, [r4 + 3 * mmsize]
1691
+    pmulhrsw          m0, m3
1692
+
1693
+    packuswb          m1, m2
1694
+    packuswb          m4, m0
1695
+
1696
+    lea               r3, [3 * r1]
1697
+    movq              [r0], xm1
1698
+    vextracti128      xm2, m1, 1
1699
+    movq              [r0 + r1], xm2
1700
+    movhps            [r0 + 2 * r1], xm1
1701
+    movhps            [r0 + r3], xm2
1702
+    lea               r0, [r0 + 4 * r1]
1703
+    movq              [r0], xm4
1704
+    vextracti128      xm2, m4, 1
1705
+    movq              [r0 + r1], xm2
1706
+    movhps            [r0 + 2 * r1], xm4
1707
+    movhps            [r0 + r3], xm2
1708
+    RET
1709
 
1710
 INIT_YMM avx2
1711
 cglobal intra_pred_ang8_12, 3, 5, 5
1712
@@ -11228,6 +12546,849 @@
1713
     movu              [%2], xm3
1714
 %endmacro
1715
 
1716
+%if ARCH_X86_64 == 1
1717
+%macro INTRA_PRED_TRANS_STORE_16x16 0
1718
+    punpcklbw    m8, m0, m1
1719
+    punpckhbw    m0, m1
1720
+
1721
+    punpcklbw    m1, m2, m3
1722
+    punpckhbw    m2, m3
1723
+
1724
+    punpcklbw    m3, m4, m5
1725
+    punpckhbw    m4, m5
1726
+
1727
+    punpcklbw    m5, m6, m7
1728
+    punpckhbw    m6, m7
1729
+
1730
+    punpcklwd    m7, m8, m1
1731
+    punpckhwd    m8, m1
1732
+
1733
+    punpcklwd    m1, m3, m5
1734
+    punpckhwd    m3, m5
1735
+
1736
+    punpcklwd    m5, m0, m2
1737
+    punpckhwd    m0, m2
1738
+
1739
+    punpcklwd    m2, m4, m6
1740
+    punpckhwd    m4, m6
1741
+
1742
+    punpckldq    m6, m7, m1
1743
+    punpckhdq    m7, m1
1744
+
1745
+    punpckldq    m1, m8, m3
1746
+    punpckhdq    m8, m3
1747
+
1748
+    punpckldq    m3, m5, m2
1749
+    punpckhdq    m5, m2
1750
+
1751
+    punpckldq    m2, m0, m4
1752
+    punpckhdq    m0, m4
1753
+
1754
+    vpermq       m6, m6, 0xD8
1755
+    vpermq       m7, m7, 0xD8
1756
+    vpermq       m1, m1, 0xD8
1757
+    vpermq       m8, m8, 0xD8
1758
+    vpermq       m3, m3, 0xD8
1759
+    vpermq       m5, m5, 0xD8
1760
+    vpermq       m2, m2, 0xD8
1761
+    vpermq       m0, m0, 0xD8
1762
+
1763
+    movu            [r0], xm6
1764
+    vextracti128    xm4, m6, 1
1765
+    movu            [r0 + r1], xm4
1766
+
1767
+    movu            [r0 + 2 * r1], xm7
1768
+    vextracti128    xm4, m7, 1
1769
+    movu            [r0 + r3], xm4
1770
+
1771
+    lea             r0, [r0 + 4 * r1]
1772
+
1773
+    movu            [r0], xm1
1774
+    vextracti128    xm4, m1, 1
1775
+    movu            [r0 + r1], xm4
1776
+
1777
+    movu            [r0 + 2 * r1], xm8
1778
+    vextracti128    xm4, m8, 1
1779
+    movu            [r0 + r3], xm4
1780
+
1781
+    lea             r0, [r0 + 4 * r1]
1782
+
1783
+    movu            [r0], xm3
1784
+    vextracti128    xm4, m3, 1
1785
+    movu            [r0 + r1], xm4
1786
+
1787
+    movu            [r0 + 2 * r1], xm5
1788
+    vextracti128    xm4, m5, 1
1789
+    movu            [r0 + r3], xm4
1790
+
1791
+    lea             r0, [r0 + 4 * r1]
1792
+
1793
+    movu            [r0], xm2
1794
+    vextracti128    xm4, m2, 1
1795
+    movu            [r0 + r1], xm4
1796
+
1797
+    movu            [r0 + 2 * r1], xm0
1798
+    vextracti128    xm4, m0, 1
1799
+    movu            [r0 + r3], xm4
1800
+%endmacro
1801
+
1802
+%macro INTRA_PRED_ANG16_CAL_ROW 3
1803
+    pmaddubsw         %1, m9, [r4 + (%3 * mmsize)]
1804
+    pmulhrsw          %1, m11
1805
+    pmaddubsw         %2, m10, [r4 + (%3 * mmsize)]
1806
+    pmulhrsw          %2, m11
1807
+    packuswb          %1, %2
1808
+%endmacro
1809
+
1810
+
1811
+INIT_YMM avx2
1812
+cglobal intra_pred_ang16_12, 3, 6, 13
1813
+    mova              m11, [pw_1024]
1814
+    lea               r5, [intra_pred_shuff_0_8]
1815
+
1816
+    movu              xm9, [r2 + 32]
1817
+    pinsrb            xm9, [r2], 0
1818
+    pslldq            xm7, xm9, 1
1819
+    pinsrb            xm7, [r2 + 6], 0
1820
+    vinserti128       m9, m9, xm7, 1
1821
+    pshufb            m9, [r5]
1822
+
1823
+    movu              xm12, [r2 + 6 + 32]
1824
+
1825
+    psrldq            xm10, xm12, 2
1826
+    psrldq            xm8, xm12, 1
1827
+    vinserti128       m10, m10, xm8, 1
1828
+    pshufb            m10, [r5]
1829
+
1830
+    lea               r3, [3 * r1]
1831
+    lea               r4, [c_ang16_mode_12]
1832
+
1833
+    INTRA_PRED_ANG16_CAL_ROW m0, m1, 0
1834
+    INTRA_PRED_ANG16_CAL_ROW m1, m2, 1
1835
+    INTRA_PRED_ANG16_CAL_ROW m2, m3, 2
1836
+    INTRA_PRED_ANG16_CAL_ROW m3, m4, 3
1837
+
1838
+    add               r4, 4 * mmsize
1839
+
1840
+    pslldq            xm7, 1
1841
+    pinsrb            xm7, [r2 + 13], 0
1842
+    pshufb            xm7, [r5]
1843
+    vinserti128       m9, m9, xm7, 1
1844
+
1845
+    mova              xm8, xm12
1846
+    pshufb            xm8, [r5]
1847
+    vinserti128       m10, m10, xm8, 1
1848
+
1849
+    INTRA_PRED_ANG16_CAL_ROW m4, m5, 0
1850
+    INTRA_PRED_ANG16_CAL_ROW m5, m6, 1
1851
+
1852
+    movu              xm9, [r2 + 31]
1853
+    pinsrb            xm9, [r2 + 6], 0
1854
+    pinsrb            xm9, [r2 + 0], 1
1855
+    pshufb            xm9, [r5]
1856
+    vinserti128       m9, m9, xm7, 1
1857
+
1858
+    psrldq            xm10, xm12, 1
1859
+    vinserti128       m10, m10, xm12, 1
1860
+    pshufb            m10, [r5]
1861
+
1862
+    INTRA_PRED_ANG16_CAL_ROW m6, m7, 2
1863
+    INTRA_PRED_ANG16_CAL_ROW m7, m8, 3
1864
+
1865
+    ; transpose and store
1866
+    INTRA_PRED_TRANS_STORE_16x16
1867
+    RET
1868
+
1869
+INIT_YMM avx2
1870
+cglobal intra_pred_ang16_13, 3, 6, 14
1871
+    mova              m11, [pw_1024]
1872
+    lea               r5, [intra_pred_shuff_0_8]
1873
+
1874
+    movu              xm13, [r2 + 32]
1875
+    pinsrb            xm13, [r2], 0
1876
+    pslldq            xm7, xm13, 2
1877
+    pinsrb            xm7, [r2 + 7], 0
1878
+    pinsrb            xm7, [r2 + 4], 1
1879
+    vinserti128       m9, m13, xm7, 1
1880
+    pshufb            m9, [r5]
1881
+
1882
+    movu              xm12, [r2 + 4 + 32]
1883
+
1884
+    psrldq            xm10, xm12, 4
1885
+    psrldq            xm8, xm12, 2
1886
+    vinserti128       m10, m10, xm8, 1
1887
+    pshufb            m10, [r5]
1888
+
1889
+    lea               r3, [3 * r1]
1890
+    lea               r4, [c_ang16_mode_13]
1891
+
1892
+    INTRA_PRED_ANG16_CAL_ROW m0, m1, 0
1893
+    INTRA_PRED_ANG16_CAL_ROW m1, m2, 1
1894
+
1895
+    pslldq            xm7, 1
1896
+    pinsrb            xm7, [r2 + 11], 0
1897
+    pshufb            xm2, xm7, [r5]
1898
+    vinserti128       m9, m9, xm2, 1
1899
+
1900
+    psrldq            xm8, xm12, 1
1901
+    pshufb            xm8, [r5]
1902
+    vinserti128       m10, m10, xm8, 1
1903
+
1904
+    INTRA_PRED_ANG16_CAL_ROW m2, m3, 2
1905
+
1906
+    pslldq            xm13, 1
1907
+    pinsrb            xm13, [r2 + 4], 0
1908
+    pshufb            xm3, xm13, [r5]
1909
+    vinserti128       m9, m9, xm3, 0
1910
+
1911
+    psrldq            xm8, xm12, 3
1912
+    pshufb            xm8, [r5]
1913
+    vinserti128       m10, m10, xm8, 0
1914
+
1915
+    INTRA_PRED_ANG16_CAL_ROW m3, m4, 3
1916
+
1917
+    add               r4, 4 * mmsize
1918
+
1919
+    INTRA_PRED_ANG16_CAL_ROW m4, m5, 0
1920
+    INTRA_PRED_ANG16_CAL_ROW m5, m6, 1
1921
+
1922
+    pslldq            xm7, 1
1923
+    pinsrb            xm7, [r2 + 14], 0
1924
+    pshufb            xm7, [r5]
1925
+    vinserti128       m9, m9, xm7, 1
1926
+
1927
+    mova              xm8, xm12
1928
+    pshufb            xm8, [r5]
1929
+    vinserti128       m10, m10, xm8, 1
1930
+
1931
+    INTRA_PRED_ANG16_CAL_ROW m6, m7, 2
1932
+
1933
+    pslldq            xm13, 1
1934
+    pinsrb            xm13, [r2 + 7], 0
1935
+    pshufb            xm13, [r5]
1936
+    vinserti128       m9, m9, xm13, 0
1937
+
1938
+    psrldq            xm12, 2
1939
+    pshufb            xm12, [r5]
1940
+    vinserti128       m10, m10, xm12, 0
1941
+
1942
+    INTRA_PRED_ANG16_CAL_ROW m7, m8, 3
1943
+
1944
+    ; transpose and store
1945
+    INTRA_PRED_TRANS_STORE_16x16
1946
+    RET
1947
+
1948
+INIT_YMM avx2
1949
+cglobal intra_pred_ang16_11, 3, 5, 12
1950
+    mova              m11, [pw_1024]
1951
+
1952
+    movu              xm9, [r2 + 32]
1953
+    pinsrb            xm9, [r2], 0
1954
+    pshufb            xm9, [intra_pred_shuff_0_8]
1955
+    vinserti128       m9, m9, xm9, 1
1956
+
1957
+    vbroadcasti128    m10, [r2 + 8 + 32]
1958
+    pshufb            m10, [intra_pred_shuff_0_8]
1959
+
1960
+    lea               r3, [3 * r1]
1961
+    lea               r4, [c_ang16_mode_11]
1962
+
1963
+    INTRA_PRED_ANG16_CAL_ROW m0, m1, 0
1964
+    INTRA_PRED_ANG16_CAL_ROW m1, m2, 1
1965
+    INTRA_PRED_ANG16_CAL_ROW m2, m3, 2
1966
+    INTRA_PRED_ANG16_CAL_ROW m3, m4, 3
1967
+
1968
+    add               r4, 4 * mmsize
1969
+
1970
+    INTRA_PRED_ANG16_CAL_ROW m4, m5, 0
1971
+    INTRA_PRED_ANG16_CAL_ROW m5, m6, 1
1972
+    INTRA_PRED_ANG16_CAL_ROW m6, m7, 2
1973
+    INTRA_PRED_ANG16_CAL_ROW m7, m8, 3
1974
+
1975
+    ; transpose and store
1976
+    INTRA_PRED_TRANS_STORE_16x16
1977
+    RET
1978
+
1979
+
1980
+INIT_YMM avx2
1981
+cglobal intra_pred_ang16_3, 3, 6, 12
1982
+    mova              m11, [pw_1024]
1983
+    lea               r5, [intra_pred_shuff_0_8]
1984
+
1985
+    movu              xm9, [r2 + 1 + 32]
1986
+    pshufb            xm9, [r5]
1987
+    movu              xm10, [r2 + 9 + 32]
1988
+    pshufb            xm10, [r5]
1989
+
1990
+    movu              xm7, [r2 + 8 + 32]
1991
+    pshufb            xm7, [r5]
1992
+    vinserti128       m9, m9, xm7, 1
1993
+
1994
+    movu              xm8, [r2 + 16 + 32]
1995
+    pshufb            xm8, [r5]
1996
+    vinserti128       m10, m10, xm8, 1
1997
+
1998
+    lea               r3, [3 * r1]
1999
+    lea               r4, [c_ang16_mode_3]
2000
+
2001
+    INTRA_PRED_ANG16_CAL_ROW m0, m1, 0
2002
+
2003
+    movu              xm9, [r2 + 2 + 32]
2004
+    pshufb            xm9, [r5]
2005
+    movu              xm10, [r2 + 10 + 32]
2006
+    pshufb            xm10, [r5]
2007
+
2008
+    movu              xm7, [r2 + 9 + 32]
2009
+    pshufb            xm7, [r5]
2010
+    vinserti128       m9, m9, xm7, 1
2011
+
2012
+    movu              xm8, [r2 + 17 + 32]
2013
+    pshufb            xm8, [r5]
2014
+    vinserti128       m10, m10, xm8, 1
2015
+
2016
+    INTRA_PRED_ANG16_CAL_ROW m1, m2, 1
2017
+
2018
+    movu              xm7, [r2 + 3 + 32]
2019
+    pshufb            xm7, [r5]
2020
+    vinserti128       m9, m9, xm7, 0
2021
+
2022
+    movu              xm8, [r2 + 11 + 32]
2023
+    pshufb            xm8, [r5]
2024
+    vinserti128       m10, m10, xm8, 0
2025
+
2026
+    INTRA_PRED_ANG16_CAL_ROW m2, m3, 2
2027
+
2028
+    movu              xm9, [r2 + 4 + 32]
2029
+    pshufb            xm9, [r5]
2030
+    movu              xm10, [r2 + 12 + 32]
2031
+    pshufb            xm10, [r5]
2032
+
2033
+    movu              xm7, [r2 + 10 + 32]
2034
+    pshufb            xm7, [r5]
2035
+    vinserti128       m9, m9, xm7, 1
2036
+
2037
+    movu              xm8, [r2 + 18 + 32]
2038
+    pshufb            xm8, [r5]
2039
+    vinserti128       m10, m10, xm8, 1
2040
+
2041
+    INTRA_PRED_ANG16_CAL_ROW m3, m4, 3
2042
+
2043
+    movu              xm9, [r2 + 5 + 32]
2044
+    pshufb            xm9, [r5]
2045
+    movu              xm10, [r2 + 13 + 32]
2046
+    pshufb            xm10, [r5]
2047
+
2048
+    movu              xm7, [r2 + 11 + 32]
2049
+    pshufb            xm7, [r5]
2050
+    vinserti128       m9, m9, xm7, 1
2051
+
2052
+    movu              xm8, [r2 + 19 + 32]
2053
+    pshufb            xm8, [r5]
2054
+    vinserti128       m10, m10, xm8, 1
2055
+
2056
+    add               r4, 4 * mmsize
2057
+
2058
+    INTRA_PRED_ANG16_CAL_ROW m4, m5, 0
2059
+
2060
+    movu              xm7, [r2 + 12 + 32]
2061
+    pshufb            xm7, [r5]
2062
+    vinserti128       m9, m9, xm7, 1
2063
+
2064
+    movu              xm8, [r2 + 20 + 32]
2065
+    pshufb            xm8, [r5]
2066
+    vinserti128       m10, m10, xm8, 1
2067
+
2068
+    INTRA_PRED_ANG16_CAL_ROW m5, m6, 1
2069
+
2070
+    movu              xm9, [r2 + 6 + 32]
2071
+    pshufb            xm9, [r5]
2072
+    movu              xm10, [r2 + 14 + 32]
2073
+    pshufb            xm10, [r5]
2074
+
2075
+    movu              xm7, [r2 + 13 + 32]
2076
+    pshufb            xm7, [r5]
2077
+    vinserti128       m9, m9, xm7, 1
2078
+
2079
+    movu              xm8, [r2 + 21 + 32]
2080
+    pshufb            xm8, [r5]
2081
+    vinserti128       m10, m10, xm8, 1
2082
+
2083
+    INTRA_PRED_ANG16_CAL_ROW m6, m7, 2
2084
+
2085
+    movu              xm9, [r2 + 7 + 32]
2086
+    pshufb            xm9, [r5]
2087
+    movu              xm10, [r2 + 15 + 32]
2088
+    pshufb            xm10, [r5]
2089
+
2090
+    movu              xm7, [r2 + 14 + 32]
2091
+    pshufb            xm7, [r5]
2092
+    vinserti128       m9, m9, xm7, 1
2093
+
2094
+    movu              xm8, [r2 + 22 + 32]
2095
+    pshufb            xm8, [r5]
2096
+    vinserti128       m10, m10, xm8, 1
2097
+
2098
+    INTRA_PRED_ANG16_CAL_ROW m7, m8, 3
2099
+
2100
+    ; transpose and store
2101
+    INTRA_PRED_TRANS_STORE_16x16
2102
+    RET
2103
+
2104
+
2105
+INIT_YMM avx2
2106
+cglobal intra_pred_ang16_4, 3, 6, 12
2107
+    mova              m11, [pw_1024]
2108
+    lea               r5, [intra_pred_shuff_0_8]
2109
+
2110
+    movu              xm9, [r2 + 1 + 32]
2111
+    pshufb            xm9, [r5]
2112
+    movu              xm10, [r2 + 9 + 32]
2113
+    pshufb            xm10, [r5]
2114
+
2115
+    movu              xm7, [r2 + 6 + 32]
2116
+    pshufb            xm7, [r5]
2117
+    vinserti128       m9, m9, xm7, 1
2118
+
2119
+    movu              xm8, [r2 + 14 + 32]
2120
+    pshufb            xm8, [r5]
2121
+    vinserti128       m10, m10, xm8, 1
2122
+
2123
+    lea               r3, [3 * r1]
2124
+    lea               r4, [c_ang16_mode_4]
2125
+
2126
+    INTRA_PRED_ANG16_CAL_ROW m0, m1, 0
2127
+
2128
+    movu              xm9, [r2 + 2 + 32]
2129
+    pshufb            xm9, [r5]
2130
+    movu              xm10, [r2 + 10 + 32]
2131
+    pshufb            xm10, [r5]
2132
+
2133
+    movu              xm7, [r2 + 7 + 32]
2134
+    pshufb            xm7, [r5]
2135
+    vinserti128       m9, m9, xm7, 1
2136
+
2137
+    movu              xm8, [r2 + 15 + 32]
2138
+    pshufb            xm8, [r5]
2139
+    vinserti128       m10, m10, xm8, 1
2140
+
2141
+    INTRA_PRED_ANG16_CAL_ROW m1, m2, 1
2142
+
2143
+    movu              xm7, [r2 + 8 + 32]
2144
+    pshufb            xm7, [r5]
2145
+    vinserti128       m9, m9, xm7, 1
2146
+
2147
+    movu              xm8, [r2 + 16 + 32]
2148
+    pshufb            xm8, [r5]
2149
+    vinserti128       m10, m10, xm8, 1
2150
+
2151
+    INTRA_PRED_ANG16_CAL_ROW m2, m3, 2
2152
+
2153
+    movu              xm7, [r2 + 3 + 32]
2154
+    pshufb            xm7, [r5]
2155
+    vinserti128       m9, m9, xm7, 0
2156
+
2157
+    movu              xm8, [r2 + 11 + 32]
2158
+    pshufb            xm8, [r5]
2159
+    vinserti128       m10, m10, xm8, 0
2160
+
2161
+    INTRA_PRED_ANG16_CAL_ROW m3, m4, 3
2162
+
2163
+    add               r4, 4 * mmsize
2164
+
2165
+    movu              xm9, [r2 + 4 + 32]
2166
+    pshufb            xm9, [r5]
2167
+    movu              xm10, [r2 + 12 + 32]
2168
+    pshufb            xm10, [r5]
2169
+
2170
+    movu              xm7, [r2 + 9 + 32]
2171
+    pshufb            xm7, [r5]
2172
+    vinserti128       m9, m9, xm7, 1
2173
+
2174
+    movu              xm8, [r2 + 17 + 32]
2175
+    pshufb            xm8, [r5]
2176
+    vinserti128       m10, m10, xm8, 1
2177
+
2178
+    INTRA_PRED_ANG16_CAL_ROW m4, m5, 0
2179
+
2180
+    movu              xm7, [r2 + 10 + 32]
2181
+    pshufb            xm7, [r5]
2182
+    vinserti128       m9, m9, xm7, 1
2183
+
2184
+    movu              xm8, [r2 + 18 + 32]
2185
+    pshufb            xm8, [r5]
2186
+    vinserti128       m10, m10, xm8, 1
2187
+
2188
+    INTRA_PRED_ANG16_CAL_ROW m5, m6, 1
2189
+
2190
+    movu              xm7, [r2 + 5 + 32]
2191
+    pshufb            xm7, [r5]
2192
+    vinserti128       m9, m9, xm7, 0
2193
+
2194
+    movu              xm8, [r2 + 13 + 32]
2195
+    pshufb            xm8, [r5]
2196
+    vinserti128       m10, m10, xm8, 0
2197
+
2198
+    INTRA_PRED_ANG16_CAL_ROW m6, m7, 2
2199
+
2200
+    movu              xm9, [r2 + 6 + 32]
2201
+    pshufb            xm9, [r5]
2202
+    movu              xm10, [r2 + 14 + 32]
2203
+    pshufb            xm10, [r5]
2204
+
2205
+    movu              xm7, [r2 + 11 + 32]
2206
+    pshufb            xm7, [r5]
2207
+    vinserti128       m9, m9, xm7, 1
2208
+
2209
+    movu              xm8, [r2 + 19 + 32]
2210
+    pshufb            xm8, [r5]
2211
+    vinserti128       m10, m10, xm8, 1
2212
+
2213
+    INTRA_PRED_ANG16_CAL_ROW m7, m8, 3
2214
+
2215
+    ; transpose and store
2216
+    INTRA_PRED_TRANS_STORE_16x16
2217
+    RET
2218
+
2219
+INIT_YMM avx2
2220
+cglobal intra_pred_ang16_5, 3, 6, 12
2221
+    mova              m11, [pw_1024]
2222
+    lea               r5, [intra_pred_shuff_0_8]
2223
+
2224
+    movu              xm9, [r2 + 1 + 32]
2225
+    pshufb            xm9, [r5]
2226
+    movu              xm10, [r2 + 9 + 32]
2227
+    pshufb            xm10, [r5]
2228
+
2229
+    movu              xm7, [r2 + 5 + 32]
2230
+    pshufb            xm7, [r5]
2231
+    vinserti128       m9, m9, xm7, 1
2232
+
2233
+    movu              xm8, [r2 + 13 + 32]
2234
+    pshufb            xm8, [r5]
2235
+    vinserti128       m10, m10, xm8, 1
2236
+
2237
+    lea               r3, [3 * r1]
2238
+    lea               r4, [c_ang16_mode_5]
2239
+
2240
+    INTRA_PRED_ANG16_CAL_ROW m0, m1, 0
2241
+
2242
+    movu              xm9, [r2 + 2 + 32]
2243
+    pshufb            xm9, [r5]
2244
+    movu              xm10, [r2 + 10 + 32]
2245
+    pshufb            xm10, [r5]
2246
+
2247
+    movu              xm7, [r2 + 6 + 32]
2248
+    pshufb            xm7, [r5]
2249
+    vinserti128       m9, m9, xm7, 1
2250
+
2251
+    movu              xm8, [r2 + 14 + 32]
2252
+    pshufb            xm8, [r5]
2253
+    vinserti128       m10, m10, xm8, 1
2254
+
2255
+    INTRA_PRED_ANG16_CAL_ROW m1, m2, 1
2256
+    INTRA_PRED_ANG16_CAL_ROW m2, m3, 2
2257
+
2258
+    movu              xm9, [r2 + 3 + 32]
2259
+    pshufb            xm9, [r5]
2260
+    movu              xm10, [r2 + 11 + 32]
2261
+    pshufb            xm10, [r5]
2262
+
2263
+    movu              xm7, [r2 + 7 + 32]
2264
+    pshufb            xm7, [r5]
2265
+    vinserti128       m9, m9, xm7, 1
2266
+
2267
+    movu              xm8, [r2 + 15 + 32]
2268
+    pshufb            xm8, [r5]
2269
+    vinserti128       m10, m10, xm8, 1
2270
+
2271
+    INTRA_PRED_ANG16_CAL_ROW m3, m4, 3
2272
+
2273
+    add               r4, 4 * mmsize
2274
+
2275
+    INTRA_PRED_ANG16_CAL_ROW m4, m5, 0
2276
+
2277
+    movu              xm9, [r2 + 4 + 32]
2278
+    pshufb            xm9, [r5]
2279
+    movu              xm10, [r2 + 12 + 32]
2280
+    pshufb            xm10, [r5]
2281
+
2282
+    movu              xm7, [r2 + 8 + 32]
2283
+    pshufb            xm7, [r5]
2284
+    vinserti128       m9, m9, xm7, 1
2285
+
2286
+    movu              xm8, [r2 + 16 + 32]
2287
+    pshufb            xm8, [r5]
2288
+    vinserti128       m10, m10, xm8, 1
2289
+
2290
+    INTRA_PRED_ANG16_CAL_ROW m5, m6, 1
2291
+    INTRA_PRED_ANG16_CAL_ROW m6, m7, 2
2292
+
2293
+    movu              xm9, [r2 + 5 + 32]
2294
+    pshufb            xm9, [r5]
2295
+    movu              xm10, [r2 + 13 + 32]
2296
+    pshufb            xm10, [r5]
2297
+
2298
+    movu              xm7, [r2 + 9 + 32]
2299
+    pshufb            xm7, [r5]
2300
+    vinserti128       m9, m9, xm7, 1
2301
+
2302
+    movu              xm8, [r2 + 17 + 32]
2303
+    pshufb            xm8, [r5]
2304
+    vinserti128       m10, m10, xm8, 1
2305
+
2306
+    INTRA_PRED_ANG16_CAL_ROW m7, m8, 3
2307
+
2308
+    ; transpose and store
2309
+    INTRA_PRED_TRANS_STORE_16x16
2310
+    RET
2311
+
2312
+INIT_YMM avx2
2313
+cglobal intra_pred_ang16_6, 3, 6, 12
2314
+    mova              m11, [pw_1024]
2315
+    lea               r5, [intra_pred_shuff_0_8]
2316
+
2317
+    movu              xm9, [r2 + 1 + 32]
2318
+    pshufb            xm9, [r5]
2319
+    movu              xm10, [r2 + 9 + 32]
2320
+    pshufb            xm10, [r5]
2321
+
2322
+    movu              xm7, [r2 + 4 + 32]
2323
+    pshufb            xm7, [r5]
2324
+    vinserti128       m9, m9, xm7, 1
2325
+
2326
+    movu              xm8, [r2 + 12 + 32]
2327
+    pshufb            xm8, [r5]
2328
+    vinserti128       m10, m10, xm8, 1
2329
+
2330
+    lea               r3, [3 * r1]
2331
+    lea               r4, [c_ang16_mode_6]
2332
+
2333
+    INTRA_PRED_ANG16_CAL_ROW m0, m1, 0
2334
+
2335
+    movu              xm7, [r2 + 5 + 32]
2336
+    pshufb            xm7, [r5]
2337
+    vinserti128       m9, m9, xm7, 1
2338
+
2339
+    movu              xm8, [r2 + 13 + 32]
2340
+    pshufb            xm8, [r5]
2341
+    vinserti128       m10, m10, xm8, 1
2342
+
2343
+    INTRA_PRED_ANG16_CAL_ROW m1, m2, 1
2344
+
2345
+    movu              xm7, [r2 + 2 + 32]
2346
+    pshufb            xm7, [r5]
2347
+    vinserti128       m9, m9, xm7, 0
2348
+
2349
+    movu              xm8, [r2 + 10 + 32]
2350
+    pshufb            xm8, [r5]
2351
+    vinserti128       m10, m10, xm8, 0
2352
+
2353
+    INTRA_PRED_ANG16_CAL_ROW m2, m3, 2
2354
+    INTRA_PRED_ANG16_CAL_ROW m3, m4, 3
2355
+
2356
+    add               r4, 4 * mmsize
2357
+
2358
+    movu              xm9, [r2 + 3 + 32]
2359
+    pshufb            xm9, [r5]
2360
+    movu              xm10, [r2 + 11 + 32]
2361
+    pshufb            xm10, [r5]
2362
+
2363
+    movu              xm7, [r2 + 6 + 32]
2364
+    pshufb            xm7, [r5]
2365
+    vinserti128       m9, m9, xm7, 1
2366
+
2367
+    movu              xm8, [r2 + 14 + 32]
2368
+    pshufb            xm8, [r5]
2369
+    vinserti128       m10, m10, xm8, 1
2370
+
2371
+    INTRA_PRED_ANG16_CAL_ROW m4, m5, 0
2372
+    INTRA_PRED_ANG16_CAL_ROW m5, m6, 1
2373
+
2374
+    movu              xm7, [r2 + 7 + 32]
2375
+    pshufb            xm7, [r5]
2376
+    vinserti128       m9, m9, xm7, 1
2377
+
2378
+    movu              xm8, [r2 + 15 + 32]
2379
+    pshufb            xm8, [r5]
2380
+    vinserti128       m10, m10, xm8, 1
2381
+
2382
+    INTRA_PRED_ANG16_CAL_ROW m6, m7, 2
2383
+
2384
+    movu              xm7, [r2 + 4 + 32]
2385
+    pshufb            xm7, [r5]
2386
+    vinserti128       m9, m9, xm7, 0
2387
+
2388
+    movu              xm8, [r2 + 12 + 32]
2389
+    pshufb            xm8, [r5]
2390
+    vinserti128       m10, m10, xm8, 0
2391
+
2392
+    INTRA_PRED_ANG16_CAL_ROW m7, m8, 3
2393
+
2394
+    ; transpose and store
2395
+    INTRA_PRED_TRANS_STORE_16x16
2396
+    RET
2397
+
2398
+INIT_YMM avx2
2399
+cglobal intra_pred_ang16_7, 3, 6, 12
2400
+    mova              m11, [pw_1024]
2401
+    lea               r5, [intra_pred_shuff_0_8]
2402
+
2403
+    movu              xm9, [r2 + 1 + 32]
2404
+    pshufb            xm9, [r5]
2405
+    movu              xm10, [r2 + 9 + 32]
2406
+    pshufb            xm10, [r5]
2407
+
2408
+    movu              xm7, [r2 + 3  + 32]
2409
+    pshufb            xm7, [r5]
2410
+    vinserti128       m9, m9, xm7, 1
2411
+
2412
+    movu              xm8, [r2 + 11 + 32]
2413
+    pshufb            xm8, [r5]
2414
+    vinserti128       m10, m10, xm8, 1
2415
+
2416
+    lea               r3, [3 * r1]
2417
+    lea               r4, [c_ang16_mode_7]
2418
+
2419
+    INTRA_PRED_ANG16_CAL_ROW m0, m1, 0
2420
+    INTRA_PRED_ANG16_CAL_ROW m1, m2, 1
2421
+
2422
+    movu              xm7, [r2 + 4  + 32]
2423
+    pshufb            xm7, [r5]
2424
+    vinserti128       m9, m9, xm7, 1
2425
+
2426
+    movu              xm8, [r2 + 12 + 32]
2427
+    pshufb            xm8, [r5]
2428
+    vinserti128       m10, m10, xm8, 1
2429
+
2430
+    INTRA_PRED_ANG16_CAL_ROW m2, m3, 2
2431
+
2432
+    movu              xm7, [r2 + 2  + 32]
2433
+    pshufb            xm7, [r5]
2434
+    vinserti128       m9, m9, xm7, 0
2435
+
2436
+    movu              xm8, [r2 + 10 + 32]
2437
+    pshufb            xm8, [r5]
2438
+    vinserti128       m10, m10, xm8, 0
2439
+
2440
+    INTRA_PRED_ANG16_CAL_ROW m3, m4, 3
2441
+
2442
+    add               r4, 4 * mmsize
2443
+
2444
+    INTRA_PRED_ANG16_CAL_ROW m4, m5, 0
2445
+    INTRA_PRED_ANG16_CAL_ROW m5, m6, 1
2446
+
2447
+    movu              xm7, [r2 + 5  + 32]
2448
+    pshufb            xm7, [r5]
2449
+    vinserti128       m9, m9, xm7, 1
2450
+
2451
+    movu              xm8, [r2 + 13 + 32]
2452
+    pshufb            xm8, [r5]
2453
+    vinserti128       m10, m10, xm8, 1
2454
+
2455
+    INTRA_PRED_ANG16_CAL_ROW m6, m7, 2
2456
+
2457
+    movu              xm7, [r2 + 3  + 32]
2458
+    pshufb            xm7, [r5]
2459
+    vinserti128       m9, m9, xm7, 0
2460
+
2461
+    movu              xm8, [r2 + 11 + 32]
2462
+    pshufb            xm8, [r5]
2463
+    vinserti128       m10, m10, xm8, 0
2464
+
2465
+    INTRA_PRED_ANG16_CAL_ROW m7, m8, 3
2466
+
2467
+    ; transpose and store
2468
+    INTRA_PRED_TRANS_STORE_16x16
2469
+    RET
2470
+
2471
+INIT_YMM avx2
2472
+cglobal intra_pred_ang16_8, 3, 6, 12
2473
+    mova              m11, [pw_1024]
2474
+    lea               r5, [intra_pred_shuff_0_8]
2475
+
2476
+    movu              xm9, [r2 + 1 + 32]
2477
+    pshufb            xm9, [r5]
2478
+    movu              xm10, [r2 + 9 + 32]
2479
+    pshufb            xm10, [r5]
2480
+
2481
+    movu              xm7, [r2 + 2  + 32]
2482
+    pshufb            xm7, [r5]
2483
+    vinserti128       m9, m9, xm7, 1
2484
+
2485
+    movu              xm8, [r2 + 10 + 32]
2486
+    pshufb            xm8, [r5]
2487
+    vinserti128       m10, m10, xm8, 1
2488
+
2489
+    lea               r3, [3 * r1]
2490
+    lea               r4, [c_ang16_mode_8]
2491
+
2492
+    INTRA_PRED_ANG16_CAL_ROW m0, m1, 0
2493
+    INTRA_PRED_ANG16_CAL_ROW m1, m2, 1
2494
+    INTRA_PRED_ANG16_CAL_ROW m2, m3, 2
2495
+    INTRA_PRED_ANG16_CAL_ROW m3, m4, 3
2496
+
2497
+    add               r4, 4 * mmsize
2498
+
2499
+    movu              xm4, [r2 + 3  + 32]
2500
+    pshufb            xm4, [r5]
2501
+    vinserti128       m9, m9, xm4, 1
2502
+
2503
+    movu              xm5, [r2 + 11 + 32]
2504
+    pshufb            xm5, [r5]
2505
+    vinserti128       m10, m10, xm5, 1
2506
+
2507
+    INTRA_PRED_ANG16_CAL_ROW m4, m5, 0
2508
+    INTRA_PRED_ANG16_CAL_ROW m5, m6, 1
2509
+
2510
+    vinserti128       m9, m9, xm7, 0
2511
+    vinserti128       m10, m10, xm8, 0
2512
+
2513
+    INTRA_PRED_ANG16_CAL_ROW m6, m7, 2
2514
+    INTRA_PRED_ANG16_CAL_ROW m7, m8, 3
2515
+
2516
+    ; transpose and store
2517
+    INTRA_PRED_TRANS_STORE_16x16
2518
+    RET
2519
+
2520
+INIT_YMM avx2
2521
+cglobal intra_pred_ang16_9, 3, 6, 12
2522
+    mova              m11, [pw_1024]
2523
+    lea               r5, [intra_pred_shuff_0_8]
2524
+
2525
+    vbroadcasti128    m9, [r2 + 1 + 32]
2526
+    pshufb            m9, [r5]
2527
+    vbroadcasti128    m10, [r2 + 9 + 32]
2528
+    pshufb            m10, [r5]
2529
+
2530
+    lea               r3, [3 * r1]
2531
+    lea               r4, [c_ang16_mode_9]
2532
+
2533
+    INTRA_PRED_ANG16_CAL_ROW m0, m1, 0
2534
+    INTRA_PRED_ANG16_CAL_ROW m1, m2, 1
2535
+    INTRA_PRED_ANG16_CAL_ROW m2, m3, 2
2536
+    INTRA_PRED_ANG16_CAL_ROW m3, m4, 3
2537
+
2538
+    add               r4, 4 * mmsize
2539
+
2540
+    INTRA_PRED_ANG16_CAL_ROW m4, m5, 0
2541
+    INTRA_PRED_ANG16_CAL_ROW m5, m6, 1
2542
+    INTRA_PRED_ANG16_CAL_ROW m6, m7, 2
2543
+
2544
+    movu              xm7, [r2 + 2 + 32]
2545
+    pshufb            xm7, [r5]
2546
+    vinserti128       m9, m9, xm7, 1
2547
+
2548
+    movu              xm7, [r2 + 10 + 32]
2549
+    pshufb            xm7, [r5]
2550
+    vinserti128       m10, m10, xm7, 1
2551
+
2552
+    INTRA_PRED_ANG16_CAL_ROW m7, m8, 3
2553
+
2554
+    ; transpose and store
2555
+    INTRA_PRED_TRANS_STORE_16x16
2556
+    RET
2557
+%endif
2558
+
2559
 INIT_YMM avx2
2560
 cglobal intra_pred_ang16_25, 3, 5, 5
2561
     mova              m0, [pw_1024]
2562
@@ -13514,5 +15675,2154 @@
2563
     vpermq            m6, m6, 11011000b
2564
     movu              [r0 + r3], m6
2565
     RET
2566
+
2567
+INIT_YMM avx2
2568
+cglobal intra_pred_ang32_33, 3, 5, 11
2569
+    mova              m0, [pw_1024]
2570
+    mova              m1, [intra_pred_shuff_0_8]
2571
+    lea               r3, [3 * r1]
2572
+    lea               r4, [c_ang32_mode_33]
2573
+
2574
+    ;row [0]
2575
+    vbroadcasti128    m2, [r2 + 1]
2576
+    pshufb            m2, m1
2577
+    vbroadcasti128    m3, [r2 + 9]
2578
+    pshufb            m3, m1
2579
+    vbroadcasti128    m4, [r2 + 17]
2580
+    pshufb            m4, m1
2581
+    vbroadcasti128    m5, [r2 + 25]
2582
+    pshufb            m5, m1
2583
+
2584
+    vperm2i128        m6, m2, m3, 00100000b
2585
+    pmaddubsw         m6, [r4 + 0 * mmsize]
2586
+    pmulhrsw          m6, m0
2587
+    vperm2i128        m7, m4, m5, 00100000b
2588
+    pmaddubsw         m7, [r4 + 0 * mmsize]
2589
+    pmulhrsw          m7, m0
2590
+    packuswb          m6, m7
2591
+    vpermq            m6, m6, 11011000b
2592
+    movu              [r0], m6
2593
+
2594
+    ;row [1]
2595
+    vbroadcasti128    m2, [r2 + 2]
2596
+    pshufb            m2, m1
2597
+    vbroadcasti128    m3, [r2 + 10]
2598
+    pshufb            m3, m1
2599
+    vbroadcasti128    m4, [r2 + 18]
2600
+    pshufb            m4, m1
2601
+    vbroadcasti128    m5, [r2 + 26]
2602
+    pshufb            m5, m1
2603
+
2604
+    vperm2i128        m6, m2, m3, 00100000b
2605
+    pmaddubsw         m6, [r4 + 1 * mmsize]
2606
+    pmulhrsw          m6, m0
2607
+    vperm2i128        m7, m4, m5, 00100000b
2608
+    pmaddubsw         m7, [r4 + 1 * mmsize]
2609
+    pmulhrsw          m7, m0
2610
+    packuswb          m6, m7
2611
+    vpermq            m6, m6, 11011000b
2612
+    movu              [r0 + r1], m6
2613
+
2614
+    ;row [2]
2615
+    vbroadcasti128    m2, [r2 + 3]
2616
+    pshufb            m2, m1
2617
+    vbroadcasti128    m3, [r2 + 11]
2618
+    pshufb            m3, m1
2619
+    vbroadcasti128    m4, [r2 + 19]
2620
+    pshufb            m4, m1
2621
+    vbroadcasti128    m5, [r2 + 27]
2622
+    pshufb            m5, m1
2623
+
2624
+    vperm2i128        m6, m2, m3, 00100000b
2625
+    pmaddubsw         m6, [r4 + 2 * mmsize]
2626
+    pmulhrsw          m6, m0
2627
+    vperm2i128        m7, m4, m5, 00100000b
2628
+    pmaddubsw         m7, [r4 + 2 * mmsize]
2629
+    pmulhrsw          m7, m0
2630
+    packuswb          m6, m7
2631
+    vpermq            m6, m6, 11011000b
2632
+    movu              [r0 + 2 * r1], m6
2633
+
2634
+    ;row [3]
2635
+    vbroadcasti128    m2, [r2 + 4]
2636
+    pshufb            m2, m1
2637
+    vbroadcasti128    m3, [r2 + 12]
2638
+    pshufb            m3, m1
2639
+    vbroadcasti128    m4, [r2 + 20]
2640
+    pshufb            m4, m1
2641
+    vbroadcasti128    m5, [r2 + 28]
2642
+    pshufb            m5, m1
2643
+
2644
+    vperm2i128        m6, m2, m3, 00100000b
2645
+    pmaddubsw         m6, [r4 + 3 * mmsize]
2646
+    pmulhrsw          m6, m0
2647
+    vperm2i128        m7, m4, m5, 00100000b
2648
+    pmaddubsw         m7, [r4 + 3 * mmsize]
2649
+    pmulhrsw          m7, m0
2650
+    packuswb          m6, m7
2651
+    vpermq            m6, m6, 11011000b
2652
+    movu              [r0 + r3], m6
2653
+
2654
+    ;row [4, 5]
2655
+    vbroadcasti128    m2, [r2 + 5]
2656
+    pshufb            m2, m1
2657
+    vbroadcasti128    m3, [r2 + 13]
2658
+    pshufb            m3, m1
2659
+    vbroadcasti128    m4, [r2 + 21]
2660
+    pshufb            m4, m1
2661
+    vbroadcasti128    m5, [r2 + 29]
2662
+    pshufb            m5, m1
2663
+
2664
+    add               r4, 4 * mmsize
2665
+    lea               r0, [r0 + 4 * r1]
2666
+    mova              m10, [r4 + 0 * mmsize]
2667
+
2668
+    INTRA_PRED_ANG32_CAL_ROW
2669
+    movu              [r0], m7
2670
+    movu              [r0 + r1], m6
2671
+
2672
+    ;row [6]
2673
+    vbroadcasti128    m2, [r2 + 6]
2674
+    pshufb            m2, m1
2675
+    vbroadcasti128    m3, [r2 + 14]
2676
+    pshufb            m3, m1
2677
+    vbroadcasti128    m4, [r2 + 22]
2678
+    pshufb            m4, m1
2679
+    vbroadcasti128    m5, [r2 + 30]
2680
+    pshufb            m5, m1
2681
+
2682
+    vperm2i128        m6, m2, m3, 00100000b
2683
+    pmaddubsw         m6, [r4 + 1 * mmsize]
2684
+    pmulhrsw          m6, m0
2685
+    vperm2i128        m7, m4, m5, 00100000b
2686
+    pmaddubsw         m7, [r4 + 1 * mmsize]
2687
+    pmulhrsw          m7, m0
2688
+    packuswb          m6, m7
2689
+    vpermq            m6, m6, 11011000b
2690
+    movu              [r0 + 2 * r1], m6
2691
+
2692
+    ;row [7]
2693
+    vbroadcasti128    m2, [r2 + 7]
2694
+    pshufb            m2, m1
2695
+    vbroadcasti128    m3, [r2 + 15]
2696
+    pshufb            m3, m1
2697
+    vbroadcasti128    m4, [r2 + 23]
2698
+    pshufb            m4, m1
2699
+    vbroadcasti128    m5, [r2 + 31]
2700
+    pshufb            m5, m1
2701
+
2702
+    vperm2i128        m6, m2, m3, 00100000b
2703
+    pmaddubsw         m6, [r4 + 2 * mmsize]
2704
+    pmulhrsw          m6, m0
2705
+    vperm2i128        m7, m4, m5, 00100000b
2706
+    pmaddubsw         m7, [r4 + 2 * mmsize]
2707
+    pmulhrsw          m7, m0
2708
+    packuswb          m6, m7
2709
+    vpermq            m6, m6, 11011000b
2710
+    movu              [r0 + r3], m6
2711
+
2712
+    ;row [8]
2713
+    vbroadcasti128    m2, [r2 + 8]
2714
+    pshufb            m2, m1
2715
+    vbroadcasti128    m3, [r2 + 16]
2716
+    pshufb            m3, m1
2717
+    vbroadcasti128    m4, [r2 + 24]
2718
+    pshufb            m4, m1
2719
+    vbroadcasti128    m5, [r2 + 32]
2720
+    pshufb            m5, m1
2721
+
2722
+    lea               r0, [r0 + 4 * r1]
2723
+    vperm2i128        m6, m2, m3, 00100000b
2724
+    pmaddubsw         m6, [r4 + 3 * mmsize]
2725
+    pmulhrsw          m6, m0
2726
+    vperm2i128        m7, m4, m5, 00100000b
2727
+    pmaddubsw         m7, [r4 + 3 * mmsize]
2728
+    pmulhrsw          m7, m0
2729
+    packuswb          m6, m7
2730
+    vpermq            m6, m6, 11011000b
2731
+    movu              [r0], m6
2732
+
2733
+    ;row [9, 10]
2734
+    vbroadcasti128    m2, [r2 + 9]
2735
+    pshufb            m2, m1
2736
+    vbroadcasti128    m3, [r2 + 17]
2737
+    pshufb            m3, m1
2738
+    vbroadcasti128    m4, [r2 + 25]
2739
+    pshufb            m4, m1
2740
+    vbroadcasti128    m5, [r2 + 33]
2741
+    pshufb            m5, m1
2742
+
2743
+    add               r4, 4 * mmsize
2744
+    mova              m10, [r4 + 0 * mmsize]
2745
+
2746
+    INTRA_PRED_ANG32_CAL_ROW
2747
+    movu              [r0 + r1], m7
2748
+    movu              [r0 + 2 * r1], m6
2749
+
2750
+    ;row [11]
2751
+    vbroadcasti128    m2, [r2 + 10]
2752
+    pshufb            m2, m1
2753
+    vbroadcasti128    m3, [r2 + 18]
2754
+    pshufb            m3, m1
2755
+    vbroadcasti128    m4, [r2 + 26]
2756
+    pshufb            m4, m1
2757
+    vbroadcasti128    m5, [r2 + 34]
2758
+    pshufb            m5, m1
2759
+
2760
+    vperm2i128        m6, m2, m3, 00100000b
2761
+    pmaddubsw         m6, [r4 + 1 * mmsize]
2762
+    pmulhrsw          m6, m0
2763
+    vperm2i128        m7, m4, m5, 00100000b
2764
+    pmaddubsw         m7, [r4 + 1 * mmsize]
2765
+    pmulhrsw          m7, m0
2766
+    packuswb          m6, m7
2767
+    vpermq            m6, m6, 11011000b
2768
+    movu              [r0 + r3], m6
2769
+
2770
+    ;row [12]
2771
+    vbroadcasti128    m2, [r2 + 11]
2772
+    pshufb            m2, m1
2773
+    vbroadcasti128    m3, [r2 + 19]
2774
+    pshufb            m3, m1
2775
+    vbroadcasti128    m4, [r2 + 27]
2776
+    pshufb            m4, m1
2777
+    vbroadcasti128    m5, [r2 + 35]
2778
+    pshufb            m5, m1
2779
+
2780
+    lea               r0, [r0 + 4 * r1]
2781
+    vperm2i128        m6, m2, m3, 00100000b
2782
+    pmaddubsw         m6, [r4 + 2 * mmsize]
2783
+    pmulhrsw          m6, m0
2784
+    vperm2i128        m7, m4, m5, 00100000b
2785
+    pmaddubsw         m7, [r4 + 2 * mmsize]
2786
+    pmulhrsw          m7, m0
2787
+    packuswb          m6, m7
2788
+    vpermq            m6, m6, 11011000b
2789
+    movu              [r0], m6
2790
+
2791
+    ;row [13]
2792
+    vbroadcasti128    m2, [r2 + 12]
2793
+    pshufb            m2, m1
2794
+    vbroadcasti128    m3, [r2 + 20]
2795
+    pshufb            m3, m1
2796
+    vbroadcasti128    m4, [r2 + 28]
2797
+    pshufb            m4, m1
2798
+    vbroadcasti128    m5, [r2 + 36]
2799
+    pshufb            m5, m1
2800
+
2801
+    vperm2i128        m6, m2, m3, 00100000b
2802
+    pmaddubsw         m6, [r4 + 3 * mmsize]
2803
+    pmulhrsw          m6, m0
2804
+    vperm2i128        m7, m4, m5, 00100000b
2805
+    pmaddubsw         m7, [r4 + 3 * mmsize]
2806
+    pmulhrsw          m7, m0
2807
+    packuswb          m6, m7
2808
+    vpermq            m6, m6, 11011000b
2809
+    movu              [r0 + r1], m6
2810
+
2811
+    ;row [14]
2812
+    vbroadcasti128    m2, [r2 + 13]
2813
+    pshufb            m2, m1
2814
+    vbroadcasti128    m3, [r2 + 21]
2815
+    pshufb            m3, m1
2816
+    vbroadcasti128    m4, [r2 + 29]
2817
+    pshufb            m4, m1
2818
+    vbroadcasti128    m5, [r2 + 37]
2819
+    pshufb            m5, m1
2820
+
2821
+    add               r4, 4 * mmsize
2822
+    vperm2i128        m6, m2, m3, 00100000b
2823
+    pmaddubsw         m6, [r4 + 0 * mmsize]
2824
+    pmulhrsw          m6, m0
2825
+    vperm2i128        m7, m4, m5, 00100000b
2826
+    pmaddubsw         m7, [r4 + 0 * mmsize]
2827
+    pmulhrsw          m7, m0
2828
+    packuswb          m6, m7
2829
+    vpermq            m6, m6, 11011000b
2830
+    movu              [r0 + 2 * r1], m6
2831
+
2832
+    ;row [15, 16]
2833
+    vbroadcasti128    m2, [r2 + 14]
2834
+    pshufb            m2, m1
2835
+    vbroadcasti128    m3, [r2 + 22]
2836
+    pshufb            m3, m1
2837
+    vbroadcasti128    m4, [r2 + 30]
2838
+    pshufb            m4, m1
2839
+    vbroadcasti128    m5, [r2 + 38]
2840
+    pshufb            m5, m1
2841
+
2842
+    mova              m10, [r4 + 1 * mmsize]
2843
+
2844
+    INTRA_PRED_ANG32_CAL_ROW
2845
+    movu              [r0 + r3], m7
2846
+    lea               r0, [r0 + 4 * r1]
2847
+    movu              [r0], m6
2848
+
2849
+    ;row [17]
2850
+    vbroadcasti128    m2, [r2 + 15]
2851
+    pshufb            m2, m1
2852
+    vbroadcasti128    m3, [r2 + 23]
2853
+    pshufb            m3, m1
2854
+    vbroadcasti128    m4, [r2 + 31]
2855
+    pshufb            m4, m1
2856
+    vbroadcasti128    m5, [r2 + 39]
2857
+    pshufb            m5, m1
2858
+
2859
+    vperm2i128        m6, m2, m3, 00100000b
2860
+    pmaddubsw         m6, [r4 + 2 * mmsize]
2861
+    pmulhrsw          m6, m0
2862
+    vperm2i128        m7, m4, m5, 00100000b
2863
+    pmaddubsw         m7, [r4 + 2 * mmsize]
2864
+    pmulhrsw          m7, m0
2865
+    packuswb          m6, m7
2866
+    vpermq            m6, m6, 11011000b
2867
+    movu              [r0 + r1], m6
2868
+
2869
+    ;row [18]
2870
+    vbroadcasti128    m2, [r2 + 16]
2871
+    pshufb            m2, m1
2872
+    vbroadcasti128    m3, [r2 + 24]
2873
+    pshufb            m3, m1
2874
+    vbroadcasti128    m4, [r2 + 32]
2875
+    pshufb            m4, m1
2876
+    vbroadcasti128    m5, [r2 + 40]
2877
+    pshufb            m5, m1
2878
+
2879
+    vperm2i128        m6, m2, m3, 00100000b
2880
+    pmaddubsw         m6, [r4 + 3 * mmsize]
2881
+    pmulhrsw          m6, m0
2882
+    vperm2i128        m7, m4, m5, 00100000b
2883
+    pmaddubsw         m7, [r4 + 3 * mmsize]
2884
+    pmulhrsw          m7, m0
2885
+    packuswb          m6, m7
2886
+    vpermq            m6, m6, 11011000b
2887
+    movu              [r0 + 2 * r1], m6
2888
+
2889
+    ;row [19]
2890
+    vbroadcasti128    m2, [r2 + 17]
2891
+    pshufb            m2, m1
2892
+    vbroadcasti128    m3, [r2 + 25]
2893
+    pshufb            m3, m1
2894
+    vbroadcasti128    m4, [r2 + 33]
2895
+    pshufb            m4, m1
2896
+    vbroadcasti128    m5, [r2 + 41]
2897
+    pshufb            m5, m1
2898
+
2899
+    add               r4, 4 * mmsize
2900
+    vperm2i128        m6, m2, m3, 00100000b
2901
+    pmaddubsw         m6, [r4 + 0 * mmsize]
2902
+    pmulhrsw          m6, m0
2903
+    vperm2i128        m7, m4, m5, 00100000b
2904
+    pmaddubsw         m7, [r4 + 0 * mmsize]
2905
+    pmulhrsw          m7, m0
2906
+    packuswb          m6, m7
2907
+    vpermq            m6, m6, 11011000b
2908
+    movu              [r0 + r3], m6
2909
+
2910
+    ;row [20, 21]
2911
+    vbroadcasti128    m2, [r2 + 18]
2912
+    pshufb            m2, m1
2913
+    vbroadcasti128    m3, [r2 + 26]
2914
+    pshufb            m3, m1
2915
+    vbroadcasti128    m4, [r2 + 34]
2916
+    pshufb            m4, m1
2917
+    vbroadcasti128    m5, [r2 + 42]
2918
+    pshufb            m5, m1
2919
+
2920
+    lea               r0, [r0 + 4 * r1]
2921
+    mova              m10, [r4 + 1 * mmsize]
2922
+
2923
+    INTRA_PRED_ANG32_CAL_ROW
2924
+    movu              [r0], m7
2925
+    movu              [r0 + r1], m6
2926
+
2927
+    ;row [22]
2928
+    vbroadcasti128    m2, [r2 + 19]
2929
+    pshufb            m2, m1
2930
+    vbroadcasti128    m3, [r2 + 27]
2931
+    pshufb            m3, m1
2932
+    vbroadcasti128    m4, [r2 + 35]
2933
+    pshufb            m4, m1
2934
+    vbroadcasti128    m5, [r2 + 43]
2935
+    pshufb            m5, m1
2936
+
2937
+    vperm2i128        m6, m2, m3, 00100000b
2938
+    pmaddubsw         m6, [r4 + 2 * mmsize]
2939
+    pmulhrsw          m6, m0
2940
+    vperm2i128        m7, m4, m5, 00100000b
2941
+    pmaddubsw         m7, [r4 + 2 * mmsize]
2942
+    pmulhrsw          m7, m0
2943
+    packuswb          m6, m7
2944
+    vpermq            m6, m6, 11011000b
2945
+    movu              [r0 + 2 * r1], m6
2946
+
2947
+    ;row [23]
2948
+    vbroadcasti128    m2, [r2 + 20]
2949
+    pshufb            m2, m1
2950
+    vbroadcasti128    m3, [r2 + 28]
2951
+    pshufb            m3, m1
2952
+    vbroadcasti128    m4, [r2 + 36]
2953
+    pshufb            m4, m1
2954
+    vbroadcasti128    m5, [r2 + 44]
2955
+    pshufb            m5, m1
2956
+
2957
+    vperm2i128        m6, m2, m3, 00100000b
2958
+    pmaddubsw         m6, [r4 + 3 * mmsize]
2959
+    pmulhrsw          m6, m0
2960
+    vperm2i128        m7, m4, m5, 00100000b
2961
+    pmaddubsw         m7, [r4 + 3 * mmsize]
2962
+    pmulhrsw          m7, m0
2963
+    packuswb          m6, m7
2964
+    vpermq            m6, m6, 11011000b
2965
+    movu              [r0 + r3], m6
2966
+
2967
+    ;row [24]
2968
+    vbroadcasti128    m2, [r2 + 21]
2969
+    pshufb            m2, m1
2970
+    vbroadcasti128    m3, [r2 + 29]
2971
+    pshufb            m3, m1
2972
+    vbroadcasti128    m4, [r2 + 37]
2973
+    pshufb            m4, m1
2974
+    vbroadcasti128    m5, [r2 + 45]
2975
+    pshufb            m5, m1
2976
+
2977
+    add               r4, 4 * mmsize
2978
+    lea               r0, [r0 + 4 * r1]
2979
+    vperm2i128        m6, m2, m3, 00100000b
2980
+    pmaddubsw         m6, [r4 + 0 * mmsize]
2981
+    pmulhrsw          m6, m0
2982
+    vperm2i128        m7, m4, m5, 00100000b
2983
+    pmaddubsw         m7, [r4 + 0 * mmsize]
2984
+    pmulhrsw          m7, m0
2985
+    packuswb          m6, m7
2986
+    vpermq            m6, m6, 11011000b
2987
+    movu              [r0], m6
2988
+
2989
+    ;row [25, 26]
2990
+    vbroadcasti128    m2, [r2 + 22]
2991
+    pshufb            m2, m1
2992
+    vbroadcasti128    m3, [r2 + 30]
2993
+    pshufb            m3, m1
2994
+    vbroadcasti128    m4, [r2 + 38]
2995
+    pshufb            m4, m1
2996
+    vbroadcasti128    m5, [r2 + 46]
2997
+    pshufb            m5, m1
2998
+
2999
+    mova              m10, [r4 + 1 * mmsize]
3000
+
3001
+    INTRA_PRED_ANG32_CAL_ROW
3002
+    movu              [r0 + r1], m7
3003
+    movu              [r0 + 2 * r1], m6
3004
+
3005
+    ;row [27]
3006
+    vbroadcasti128    m2, [r2 + 23]
3007
+    pshufb            m2, m1
3008
+    vbroadcasti128    m3, [r2 + 31]
3009
+    pshufb            m3, m1
3010
+    vbroadcasti128    m4, [r2 + 39]
3011
+    pshufb            m4, m1
3012
+    vbroadcasti128    m5, [r2 + 47]
3013
+    pshufb            m5, m1
3014
+
3015
+    vperm2i128        m6, m2, m3, 00100000b
3016
+    pmaddubsw         m6, [r4 + 2 * mmsize]
3017
+    pmulhrsw          m6, m0
3018
+    vperm2i128        m7, m4, m5, 00100000b
3019
+    pmaddubsw         m7, [r4 + 2 * mmsize]
3020
+    pmulhrsw          m7, m0
3021
+    packuswb          m6, m7
3022
+    vpermq            m6, m6, 11011000b
3023
+    movu              [r0 + r3], m6
3024
+
3025
+    ;row [28]
3026
+    vbroadcasti128    m2, [r2 + 24]
3027
+    pshufb            m2, m1
3028
+    vbroadcasti128    m3, [r2 + 32]
3029
+    pshufb            m3, m1
3030
+    vbroadcasti128    m4, [r2 + 40]
3031
+    pshufb            m4, m1
3032
+    vbroadcasti128    m5, [r2 + 48]
3033
+    pshufb            m5, m1
3034
+
3035
+    lea               r0, [r0 + 4 * r1]
3036
+    vperm2i128        m6, m2, m3, 00100000b
3037
+    pmaddubsw         m6, [r4 + 3 * mmsize]
3038
+    pmulhrsw          m6, m0
3039
+    vperm2i128        m7, m4, m5, 00100000b
3040
+    pmaddubsw         m7, [r4 + 3 * mmsize]
3041
+    pmulhrsw          m7, m0
3042
+    packuswb          m6, m7
3043
+    vpermq            m6, m6, 11011000b
3044
+    movu              [r0], m6
3045
+
3046
+    ;row [29]
3047
+    vbroadcasti128    m2, [r2 + 25]
3048
+    pshufb            m2, m1
3049
+    vbroadcasti128    m3, [r2 + 33]
3050
+    pshufb            m3, m1
3051
+    vbroadcasti128    m4, [r2 + 41]
3052
+    pshufb            m4, m1
3053
+    vbroadcasti128    m5, [r2 + 49]
3054
+    pshufb            m5, m1
3055
+
3056
+    add               r4, 4 * mmsize
3057
+    vperm2i128        m6, m2, m3, 00100000b
3058
+    pmaddubsw         m6, [r4 + 0 * mmsize]
3059
+    pmulhrsw          m6, m0
3060
+    vperm2i128        m7, m4, m5, 00100000b
3061
+    pmaddubsw         m7, [r4 + 0 * mmsize]
3062
+    pmulhrsw          m7, m0
3063
+    packuswb          m6, m7
3064
+    vpermq            m6, m6, 11011000b
3065
+    movu              [r0 + r1], m6
3066
+
3067
+    ;row [30]
3068
+    vbroadcasti128    m2, [r2 + 26]
3069
+    pshufb            m2, m1
3070
+    vbroadcasti128    m3, [r2 + 34]
3071
+    pshufb            m3, m1
3072
+    vbroadcasti128    m4, [r2 + 42]
3073
+    pshufb            m4, m1
3074
+    vbroadcasti128    m5, [r2 + 50]
3075
+    pshufb            m5, m1
3076
+
3077
+    vperm2i128        m6, m2, m3, 00100000b
3078
+    pmaddubsw         m6, [r4 + 1 * mmsize]
3079
+    pmulhrsw          m6, m0
3080
+    vperm2i128        m7, m4, m5, 00100000b
3081
+    pmaddubsw         m7, [r4 + 1 * mmsize]
3082
+    pmulhrsw          m7, m0
3083
+    packuswb          m6, m7
3084
+    vpermq            m6, m6, 11011000b
3085
+    movu              [r0 + 2 * r1], m6
3086
+
3087
+    ;row [31]
3088
+    vbroadcasti128    m2, [r2 + 27]
3089
+    pshufb            m2, m1
3090
+    vbroadcasti128    m3, [r2 + 35]
3091
+    pshufb            m3, m1
3092
+    vbroadcasti128    m4, [r2 + 43]
3093
+    pshufb            m4, m1
3094
+    vbroadcasti128    m5, [r2 + 51]
3095
+    pshufb            m5, m1
3096
+
3097
+    vperm2i128        m6, m2, m3, 00100000b
3098
+    pmaddubsw         m6, [r4 + 2 * mmsize]
3099
+    pmulhrsw          m6, m0
3100
+    vperm2i128        m7, m4, m5, 00100000b
3101
+    pmaddubsw         m7, [r4 + 2 * mmsize]
3102
+    pmulhrsw          m7, m0
3103
+    packuswb          m6, m7
3104
+    vpermq            m6, m6, 11011000b
3105
+    movu              [r0 + r3], m6
3106
+    RET
3107
+
3108
+INIT_YMM avx2
3109
+cglobal intra_pred_ang32_25, 3, 5, 11
3110
+    mova              m0, [pw_1024]
3111
+    mova              m1, [intra_pred_shuff_0_8]
3112
+    lea               r3, [3 * r1]
3113
+    lea               r4, [c_ang32_mode_25]
3114
+
3115
+    ;row [0, 1]
3116
+    vbroadcasti128    m2, [r2 + 0]
3117
+    pshufb            m2, m1
3118
+    vbroadcasti128    m3, [r2 + 8]
3119
+    pshufb            m3, m1
3120
+    vbroadcasti128    m4, [r2 + 16]
3121
+    pshufb            m4, m1
3122
+    vbroadcasti128    m5, [r2 + 24]
3123
+    pshufb            m5, m1
3124
+
3125
+    mova              m10, [r4 + 0 * mmsize]
3126
+
3127
+    INTRA_PRED_ANG32_CAL_ROW
3128
+    movu              [r0], m7
3129
+    movu              [r0 + r1], m6
3130
+
3131
+    ;row[2, 3]
3132
+    mova              m10, [r4 + 1 * mmsize]
3133
+
3134
+    INTRA_PRED_ANG32_CAL_ROW
3135
+    movu              [r0 + 2 * r1], m7
3136
+    movu              [r0 + r3], m6
3137
+
3138
+    ;row[4, 5]
3139
+    mova              m10, [r4 + 2 * mmsize]
3140
+    lea               r0, [r0 + 4 * r1]
3141
+
3142
+    INTRA_PRED_ANG32_CAL_ROW
3143
+    movu              [r0], m7
3144
+    movu              [r0 + r1], m6
3145
+
3146
+    ;row[6, 7]
3147
+    mova              m10, [r4 + 3 * mmsize]
3148
+
3149
+    INTRA_PRED_ANG32_CAL_ROW
3150
+    movu              [r0 + 2 * r1], m7
3151
+    movu              [r0 + r3], m6
3152
+
3153
+    ;row[8, 9]
3154
+    add               r4, 4 * mmsize
3155
+    lea               r0, [r0 + 4 * r1]
3156
+    mova              m10, [r4 + 0 * mmsize]
3157
+
3158
+    INTRA_PRED_ANG32_CAL_ROW
3159
+    movu              [r0], m7
3160
+    movu              [r0 + r1], m6
3161
+
3162
+    ;row[10, 11]
3163
+    mova              m10, [r4 + 1 * mmsize]
3164
+
3165
+    INTRA_PRED_ANG32_CAL_ROW
3166
+    movu              [r0 + 2 * r1], m7
3167
+    movu              [r0 + r3], m6
3168
+
3169
+    ;row[12, 13]
3170
+    mova              m10, [r4 + 2 * mmsize]
3171
+    lea               r0, [r0 + 4 * r1]
3172
+
3173
+    INTRA_PRED_ANG32_CAL_ROW
3174
+    movu              [r0], m7
3175
+    movu              [r0 + r1], m6
3176
+
3177
+    ;row[14, 15]
3178
+    mova              m10, [r4 + 3 * mmsize]
3179
+
3180
+    INTRA_PRED_ANG32_CAL_ROW
3181
+    movu              [r0 + 2 * r1], m7
3182
+    movu              [r0 + r3], m6
3183
+
3184
+    ;row[16, 17]
3185
+    movu              xm2, [r2 - 1]
3186
+    pinsrb            xm2, [r2 + 80], 0
3187
+    vinserti128       m2, m2, xm2, 1
3188
+    pshufb            m2, m1
3189
+    vbroadcasti128    m3, [r2 + 7]
3190
+    pshufb            m3, m1
3191
+    vbroadcasti128    m4, [r2 + 15]
3192
+    pshufb            m4, m1
3193
+    vbroadcasti128    m5, [r2 + 23]
3194
+    pshufb            m5, m1
3195
+
3196
+    add               r4, 4 * mmsize
3197
+    lea               r0, [r0 + 4 * r1]
3198
+    mova              m10, [r4 + 0 * mmsize]
3199
+
3200
+    INTRA_PRED_ANG32_CAL_ROW
3201
+    movu              [r0], m7
3202
+    movu              [r0 + r1], m6
3203
+
3204
+    ;row[18, 19]
3205
+    mova              m10, [r4 + 1 * mmsize]
3206
+
3207
+    INTRA_PRED_ANG32_CAL_ROW
3208
+    movu              [r0 + 2 * r1], m7
3209
+    movu              [r0 + r3], m6
3210
+
3211
+    ;row[20, 21]
3212
+    mova              m10, [r4 + 2 * mmsize]
3213
+    lea               r0, [r0 + 4 * r1]
3214
+
3215
+    INTRA_PRED_ANG32_CAL_ROW
3216
+    movu              [r0], m7
3217
+    movu              [r0 + r1], m6
3218
+
3219
+    ;row[22, 23]
3220
+    mova              m10, [r4 + 3 * mmsize]
3221
+
3222
+    INTRA_PRED_ANG32_CAL_ROW
3223
+    movu              [r0 + 2 * r1], m7
3224
+    movu              [r0 + r3], m6
3225
+
3226
+    ;row[24, 25]
3227
+    add               r4, 4 * mmsize
3228
+    lea               r0, [r0 + 4 * r1]
3229
+    mova              m10, [r4 + 0 * mmsize]
3230
+
3231
+    INTRA_PRED_ANG32_CAL_ROW
3232
+    movu              [r0], m7
3233
+    movu              [r0 + r1], m6
3234
+
3235
+    ;row[26, 27]
3236
+    mova              m10, [r4 + 1 * mmsize]
3237
+
3238
+    INTRA_PRED_ANG32_CAL_ROW
3239
+    movu              [r0 + 2 * r1], m7
3240
+    movu              [r0 + r3], m6
3241
+
3242
+    ;row[28, 29]
3243
+    mova              m10, [r4 + 2 * mmsize]
3244
+    lea               r0, [r0 + 4 * r1]
3245
+
3246
+    INTRA_PRED_ANG32_CAL_ROW
3247
+    movu              [r0], m7
3248
+    movu              [r0 + r1], m6
3249
+
3250
+    ;row[30, 31]
3251
+    mova              m10, [r4 + 3 * mmsize]
3252
+
3253
+    INTRA_PRED_ANG32_CAL_ROW
3254
+    movu              [r0 + 2 * r1], m7
3255
+    movu              [r0 + r3], m6
3256
+    RET
3257
+
3258
+INIT_YMM avx2
3259
+cglobal intra_pred_ang32_24, 3, 5, 12
3260
+    mova              m0, [pw_1024]
3261
+    mova              m1, [intra_pred_shuff_0_8]
3262
+    lea               r3, [3 * r1]
3263
+    lea               r4, [c_ang32_mode_24]
3264
+
3265
+    ;row[0, 1]
3266
+    vbroadcasti128    m11, [r2 + 0]
3267
+    pshufb            m2, m11, m1
3268
+    vbroadcasti128    m3, [r2 + 8]
3269
+    pshufb            m3, m1
3270
+    vbroadcasti128    m4, [r2 + 16]
3271
+    pshufb            m4, m1
3272
+    vbroadcasti128    m5, [r2 + 24]
3273
+    pshufb            m5, m1
3274
+
3275
+    mova              m10, [r4 + 0 * mmsize]
3276
+
3277
+    INTRA_PRED_ANG32_CAL_ROW
3278
+    movu              [r0], m7
3279
+    movu              [r0 + r1], m6
3280
+
3281
+    ;row[2, 3]
3282
+    mova              m10, [r4 + 1 * mmsize]
3283
+
3284
+    INTRA_PRED_ANG32_CAL_ROW
3285
+    movu              [r0 + 2 * r1], m7
3286
+    movu              [r0 + r3], m6
3287
+
3288
+    ;row[4, 5]
3289
+    mova              m10, [r4 + 2 * mmsize]
3290
+    lea               r0, [r0 + 4 * r1]
3291
+
3292
+    INTRA_PRED_ANG32_CAL_ROW
3293
+    movu              [r0], m7
3294
+    movu              [r0 + r1], m6
3295
+
3296
+    ;row[6, 7]
3297
+    pslldq            xm11, 1
3298
+    pinsrb            xm11, [r2 + 70], 0
3299
+    vinserti128       m2, m11, xm11, 1
3300
+    pshufb            m2, m1
3301
+    vbroadcasti128    m3, [r2 + 7]
3302
+    pshufb            m3, m1
3303
+    vbroadcasti128    m4, [r2 + 15]
3304
+    pshufb            m4, m1
3305
+    vbroadcasti128    m5, [r2 + 23]
3306
+    pshufb            m5, m1
3307
+
3308
+    mova              m10, [r4 + 3 * mmsize]
3309
+
3310
+    INTRA_PRED_ANG32_CAL_ROW
3311
+    movu              [r0 + 2 * r1], m7
3312
+    movu              [r0 + r3], m6
3313
+
3314
+    ;row[8, 9]
3315
+    add               r4, 4 * mmsize
3316
+    lea               r0, [r0 + 4 * r1]
3317
+    mova              m10, [r4 + 0 * mmsize]
3318
+
3319
+    INTRA_PRED_ANG32_CAL_ROW
3320
+    movu              [r0], m7
3321
+    movu              [r0 + r1], m6
3322
+
3323
+    ;row[10, 11]
3324
+    mova              m10, [r4 + 1 * mmsize]
3325
+
3326
+    INTRA_PRED_ANG32_CAL_ROW
3327
+    movu              [r0 + 2 * r1], m7
3328
+    movu              [r0 + r3], m6
3329
+
3330
+    ;row[12, 13]
3331
+    pslldq            xm11, 1
3332
+    pinsrb            xm11, [r2 + 77], 0
3333
+    vinserti128       m2, m11, xm11, 1
3334
+    pshufb            m2, m1
3335
+    vbroadcasti128    m3, [r2 + 6]
3336
+    pshufb            m3, m1
3337
+    vbroadcasti128    m4, [r2 + 14]
3338
+    pshufb            m4, m1
3339
+    vbroadcasti128    m5, [r2 + 22]
3340
+    pshufb            m5, m1
3341
+
3342
+    mova              m10, [r4 + 2 * mmsize]
3343
+    lea               r0, [r0 + 4 * r1]
3344
+
3345
+    INTRA_PRED_ANG32_CAL_ROW
3346
+    movu              [r0], m7
3347
+    movu              [r0 + r1], m6
3348
+
3349
+    ;row[14, 15]
3350
+    mova              m10, [r4 + 3 * mmsize]
3351
+
3352
+    INTRA_PRED_ANG32_CAL_ROW
3353
+    movu              [r0 + 2 * r1], m7
3354
+    movu              [r0 + r3], m6
3355
+
3356
+    ;row[16, 17]
3357
+    add               r4, 4 * mmsize
3358
+    lea               r0, [r0 + 4 * r1]
3359
+    mova              m10, [r4 + 0 * mmsize]
3360
+
3361
+    INTRA_PRED_ANG32_CAL_ROW
3362
+    movu              [r0], m7
3363
+    movu              [r0 + r1], m6
3364
+
3365
+    ;row[18]
3366
+    mova              m10, [r4 + 1 * mmsize]
3367
+    vperm2i128        m6, m2, m3, 00100000b
3368
+    pmaddubsw         m6, m10
3369
+    pmulhrsw          m6, m0
3370
+    vperm2i128        m7, m4, m5, 00100000b
3371
+    pmaddubsw         m7, m10
3372
+    pmulhrsw          m7, m0
3373
+    packuswb          m6, m7
3374
+    vpermq            m6, m6, 11011000b
3375
+    movu              [r0 + 2 * r1], m6
3376
+
3377
+    ;row[19, 20]
3378
+    pslldq            xm11, 1
3379
+    pinsrb            xm11, [r2 + 83], 0
3380
+    vinserti128       m2, m11, xm11, 1
3381
+    pshufb            m2, m1
3382
+    vbroadcasti128    m3, [r2 + 5]
3383
+    pshufb            m3, m1
3384
+    vbroadcasti128    m4, [r2 + 13]
3385
+    pshufb            m4, m1
3386
+    vbroadcasti128    m5, [r2 + 21]
3387
+    pshufb            m5, m1
3388
+
3389
+    mova              m10, [r4 + 2 * mmsize]
3390
+
3391
+    INTRA_PRED_ANG32_CAL_ROW
3392
+    movu              [r0 + r3], m7
3393
+    lea               r0, [r0 + 4 * r1]
3394
+    movu              [r0], m6
3395
+
3396
+    ;row[21, 22]
3397
+    mova              m10, [r4 + 3 * mmsize]
3398
+
3399
+    INTRA_PRED_ANG32_CAL_ROW
3400
+    movu              [r0 + r1], m7
3401
+    movu              [r0 + 2 * r1], m6
3402
+
3403
+    ;row[23, 24]
3404
+    add               r4, 4 * mmsize
3405
+    mova              m10, [r4 + 0 * mmsize]
3406
+
3407
+    INTRA_PRED_ANG32_CAL_ROW
3408
+    movu              [r0 + r3], m7
3409
+    lea               r0, [r0 + 4 * r1]
3410
+    movu              [r0], m6
3411
+
3412
+    ;row[25, 26]
3413
+    pslldq            xm11, 1
3414
+    pinsrb            xm11, [r2 + 90], 0
3415
+    vinserti128       m2, m11, xm11, 1
3416
+    pshufb            m2, m1
3417
+    vbroadcasti128    m3, [r2 + 4]
3418
+    pshufb            m3, m1
3419
+    vbroadcasti128    m4, [r2 + 12]
3420
+    pshufb            m4, m1
3421
+    vbroadcasti128    m5, [r2 + 20]
3422
+    pshufb            m5, m1
3423
+
3424
+    mova              m10, [r4 + 1 * mmsize]
3425
+
3426
+    INTRA_PRED_ANG32_CAL_ROW
3427
+    movu              [r0 + r1], m7
3428
+    movu              [r0 + 2 * r1], m6
3429
+
3430
+    ;row[27, 28]
3431
+    mova              m10, [r4 + 2 * mmsize]
3432
+
3433
+    INTRA_PRED_ANG32_CAL_ROW
3434
+    movu              [r0 + r3], m7
3435
+    lea               r0, [r0 + 4 * r1]
3436
+    movu              [r0], m6
3437
+
3438
+    ;row[29, 30]
3439
+    mova              m10, [r4 + 3 * mmsize]
3440
+
3441
+    INTRA_PRED_ANG32_CAL_ROW
3442
+    movu              [r0 + r1], m7
3443
+    movu              [r0 + 2 * r1], m6
3444
+
3445
+    ;[row 31]
3446
+    mova              m10, [r4 + 4 * mmsize]
3447
+    vperm2i128        m6, m2, m3, 00100000b
3448
+    pmaddubsw         m6, m10
3449
+    pmulhrsw          m6, m0
3450
+    vperm2i128        m7, m4, m5, 00100000b
3451
+    pmaddubsw         m7, m10
3452
+    pmulhrsw          m7, m0
3453
+    packuswb          m6, m7
3454
+    vpermq            m6, m6, 11011000b
3455
+    movu              [r0 + r3], m6
3456
+    RET
3457
+
3458
+INIT_YMM avx2
3459
+cglobal intra_pred_ang32_23, 3, 5, 12
3460
+    mova              m0, [pw_1024]
3461
+    mova              m1, [intra_pred_shuff_0_8]
3462
+    lea               r3, [3 * r1]
3463
+    lea               r4, [c_ang32_mode_23]
3464
+
3465
+    ;row[0, 1]
3466
+    vbroadcasti128    m11, [r2 + 0]
3467
+    pshufb            m2, m11, m1
3468
+    vbroadcasti128    m3, [r2 + 8]
3469
+    pshufb            m3, m1
3470
+    vbroadcasti128    m4, [r2 + 16]
3471
+    pshufb            m4, m1
3472
+    vbroadcasti128    m5, [r2 + 24]
3473
+    pshufb            m5, m1
3474
+
3475
+    mova              m10, [r4 + 0 * mmsize]
3476
+
3477
+    INTRA_PRED_ANG32_CAL_ROW
3478
+    movu              [r0], m7
3479
+    movu              [r0 + r1], m6
3480
+
3481
+    ;row[2]
3482
+    vperm2i128        m6, m2, m3, 00100000b
3483
+    pmaddubsw         m6, [r4 + 1 * mmsize]
3484
+    pmulhrsw          m6, m0
3485
+    vperm2i128        m7, m4, m5, 00100000b
3486
+    pmaddubsw         m7, [r4 + 1 * mmsize]
3487
+    pmulhrsw          m7, m0
3488
+    packuswb          m6, m7
3489
+    vpermq            m6, m6, 11011000b
3490
+    movu              [r0 + 2 * r1], m6
3491
+
3492
+    ;row[3, 4]
3493
+    pslldq            xm11, 1
3494
+    pinsrb            xm11, [r2 + 68], 0
3495
+    vinserti128       m2, m11, xm11, 1
3496
+    pshufb            m2, m1
3497
+    vbroadcasti128    m3, [r2 + 7]
3498
+    pshufb            m3, m1
3499
+    vbroadcasti128    m4, [r2 + 15]
3500
+    pshufb            m4, m1
3501
+    vbroadcasti128    m5, [r2 + 23]
3502
+    pshufb            m5, m1
3503
+
3504
+    mova              m10, [r4 + 2 * mmsize]
3505
+
3506
+    INTRA_PRED_ANG32_CAL_ROW
3507
+    movu              [r0 + r3], m7
3508
+    lea               r0, [r0 + 4 * r1]
3509
+    movu              [r0], m6
3510
+
3511
+    ;row[5, 6]
3512
+    mova              m10, [r4 + 3 * mmsize]
3513
+
3514
+    INTRA_PRED_ANG32_CAL_ROW
3515
+    movu              [r0 + r1], m7
3516
+    movu              [r0 + 2 * r1], m6
3517
+
3518
+    ;row[7, 8]
3519
+    pslldq            xm11, 1
3520
+    pinsrb            xm11, [r2 + 71], 0
3521
+    vinserti128       m2, m11, xm11, 1
3522
+    pshufb            m2, m1
3523
+    vbroadcasti128    m3, [r2 + 6]
3524
+    pshufb            m3, m1
3525
+    vbroadcasti128    m4, [r2 + 14]
3526
+    pshufb            m4, m1
3527
+    vbroadcasti128    m5, [r2 + 22]
3528
+    pshufb            m5, m1
3529
+
3530
+    add               r4, 4 * mmsize
3531
+    mova              m10, [r4 + 0 * mmsize]
3532
+
3533
+    INTRA_PRED_ANG32_CAL_ROW
3534
+    movu              [r0 + r3], m7
3535
+    lea               r0, [r0 + 4 * r1]
3536
+    movu              [r0], m6
3537
+
3538
+    ;row[9]
3539
+    vperm2i128        m6, m2, m3, 00100000b
3540
+    pmaddubsw         m6, [r4 + 1 * mmsize]
3541
+    pmulhrsw          m6, m0
3542
+    vperm2i128        m7, m4, m5, 00100000b
3543
+    pmaddubsw         m7, [r4 + 1 * mmsize]
3544
+    pmulhrsw          m7, m0
3545
+    packuswb          m6, m7
3546
+    vpermq            m6, m6, 11011000b
3547
+    movu              [r0 + r1], m6
3548
+
3549
+    ;row[10, 11]
3550
+    pslldq            xm11, 1
3551
+    pinsrb            xm11, [r2 + 75], 0
3552
+    vinserti128       m2, m11, xm11, 1
3553
+    pshufb            m2, m1
3554
+    vbroadcasti128    m3, [r2 + 5]
3555
+    pshufb            m3, m1
3556
+    vbroadcasti128    m4, [r2 + 13]
3557
+    pshufb            m4, m1
3558
+    vbroadcasti128    m5, [r2 + 21]
3559
+    pshufb            m5, m1
3560
+
3561
+    mova              m10, [r4 + 2 * mmsize]
3562
+
3563
+    INTRA_PRED_ANG32_CAL_ROW
3564
+    movu              [r0 + 2 * r1], m7
3565
+    movu              [r0 + r3], m6
3566
+
3567
+    ;row[12, 13]
3568
+    lea               r0, [r0 + 4 * r1]
3569
+    mova              m10, [r4 + 3 * mmsize]
3570
+
3571
+    INTRA_PRED_ANG32_CAL_ROW
3572
+    movu              [r0], m7
3573
+    movu              [r0 + r1], m6
3574
+
3575
+    ;row[14, 15]
3576
+    pslldq            xm11, 1
3577
+    pinsrb            xm11, [r2 + 78], 0
3578
+    vinserti128       m2, m11, xm11, 1
3579
+    pshufb            m2, m1
3580
+    vbroadcasti128    m3, [r2 + 4]
3581
+    pshufb            m3, m1
3582
+    vbroadcasti128    m4, [r2 + 12]
3583
+    pshufb            m4, m1
3584
+    vbroadcasti128    m5, [r2 + 20]
3585
+    pshufb            m5, m1
3586
+
3587
+    add               r4, 4 * mmsize
3588
+    mova              m10, [r4 + 0 * mmsize]
3589
+
3590
+    INTRA_PRED_ANG32_CAL_ROW
3591
+    movu              [r0 + 2 * r1], m7
3592
+    movu              [r0 + r3], m6
3593
+
3594
+    ;row[16]
3595
+    lea               r0, [r0 + 4 * r1]
3596
+    vperm2i128        m6, m2, m3, 00100000b
3597
+    pmaddubsw         m6, [r4 + 1 * mmsize]
3598
+    pmulhrsw          m6, m0
3599
+    vperm2i128        m7, m4, m5, 00100000b
3600
+    pmaddubsw         m7, [r4 + 1 * mmsize]
3601
+    pmulhrsw          m7, m0
3602
+    packuswb          m6, m7
3603
+    vpermq            m6, m6, 11011000b
3604
+    movu              [r0], m6
3605
+
3606
+    ;row[17, 18]
3607
+    pslldq            xm11, 1
3608
+    pinsrb            xm11, [r2 + 82], 0
3609
+    vinserti128       m2, m11, xm11, 1
3610
+    pshufb            m2, m1
3611
+    vbroadcasti128    m3, [r2 + 3]
3612
+    pshufb            m3, m1
3613
+    vbroadcasti128    m4, [r2 + 11]
3614
+    pshufb            m4, m1
3615
+    vbroadcasti128    m5, [r2 + 19]
3616
+    pshufb            m5, m1
3617
+
3618
+    mova              m10, [r4 + 2 * mmsize]
3619
+
3620
+    INTRA_PRED_ANG32_CAL_ROW
3621
+    movu              [r0 + r1], m7
3622
+    movu              [r0 + 2 * r1], m6
3623
+
3624
+    ;row[19, 20]
3625
+    mova              m10, [r4 + 3 * mmsize]
3626
+
3627
+    INTRA_PRED_ANG32_CAL_ROW
3628
+    movu              [r0 + r3], m7
3629
+    lea               r0, [r0 + 4 * r1]
3630
+    movu              [r0], m6
3631
+
3632
+    ;row[21, 22]
3633
+    pslldq            xm11, 1
3634
+    pinsrb            xm11, [r2 + 85], 0
3635
+    vinserti128       m2, m11, xm11, 1
3636
+    pshufb            m2, m1
3637
+    vbroadcasti128    m3, [r2 + 2]
3638
+    pshufb            m3, m1
3639
+    vbroadcasti128    m4, [r2 + 10]
3640
+    pshufb            m4, m1
3641
+    vbroadcasti128    m5, [r2 + 18]
3642
+    pshufb            m5, m1
3643
+
3644
+    add               r4, 4 * mmsize
3645
+    mova              m10, [r4 + 0 * mmsize]
3646
+
3647
+    INTRA_PRED_ANG32_CAL_ROW
3648
+    movu              [r0 + r1], m7
3649
+    movu              [r0 + 2 * r1], m6
3650
+
3651
+    ;row[23]
3652
+    vperm2i128        m6, m2, m3, 00100000b
3653
+    pmaddubsw         m6, [r4 + 1 * mmsize]
3654
+    pmulhrsw          m6, m0
3655
+    vperm2i128        m7, m4, m5, 00100000b
3656
+    pmaddubsw         m7, [r4 + 1 * mmsize]
3657
+    pmulhrsw          m7, m0
3658
+    packuswb          m6, m7
3659
+    vpermq            m6, m6, 11011000b
3660
+    movu              [r0 + r3], m6
3661
+
3662
+    ;row[24, 25]
3663
+    pslldq            xm11, 1
3664
+    pinsrb            xm11, [r2 + 89], 0
3665
+    vinserti128       m2, m11, xm11, 1
3666
+    pshufb            m2, m1
3667
+    vbroadcasti128    m3, [r2 + 1]
3668
+    pshufb            m3, m1
3669
+    vbroadcasti128    m4, [r2 + 9]
3670
+    pshufb            m4, m1
3671
+    vbroadcasti128    m5, [r2 + 17]
3672
+    pshufb            m5, m1
3673
+
3674
+    mova              m10, [r4 + 2 * mmsize]
3675
+    lea               r0, [r0 + 4 * r1]
3676
+
3677
+    INTRA_PRED_ANG32_CAL_ROW
3678
+    movu              [r0], m7
3679
+    movu              [r0 + r1], m6
3680
+
3681
+    ;row[26, 27]
3682
+    mova              m10, [r4 + 3 * mmsize]
3683
+
3684
+    INTRA_PRED_ANG32_CAL_ROW
3685
+    movu              [r0 + 2 * r1], m7
3686
+    movu              [r0 + r3], m6
3687
+
3688
+    ;row[28, 29]
3689
+    pslldq            xm11, 1
3690
+    pinsrb            xm11, [r2 + 92], 0
3691
+    vinserti128       m2, m11, xm11, 1
3692
+    pshufb            m2, m1
3693
+    vbroadcasti128    m3, [r2 + 0]
3694
+    pshufb            m3, m1
3695
+    vbroadcasti128    m4, [r2 + 8]
3696
+    pshufb            m4, m1
3697
+    vbroadcasti128    m5, [r2 + 16]
3698
+    pshufb            m5, m1
3699
+
3700
+    add               r4, 4 * mmsize
3701
+    mova              m10, [r4 + 0 * mmsize]
3702
+    lea               r0, [r0 + 4 * r1]
3703
+
3704
+    INTRA_PRED_ANG32_CAL_ROW
3705
+    movu              [r0], m7
3706
+    movu              [r0 + r1], m6
3707
+
3708
+    ;row[30, 31]
3709
+    mova              m10, [r4 + 1 * mmsize]
3710
+
3711
+    INTRA_PRED_ANG32_CAL_ROW
3712
+    movu              [r0 + 2 * r1], m7
3713
+    movu              [r0 + r3], m6
3714
+    RET
3715
+
3716
+INIT_YMM avx2
3717
+cglobal intra_pred_ang32_22, 3, 5, 13
3718
+    mova              m0, [pw_1024]
3719
+    mova              m1, [intra_pred_shuff_0_8]
3720
+    lea               r3, [3 * r1]
3721
+    lea               r4, [c_ang32_mode_22]
3722
+
3723
+    ;row[0, 1]
3724
+    vbroadcasti128    m11, [r2 + 0]
3725
+    pshufb            m2, m11, m1
3726
+    vbroadcasti128    m3, [r2 + 8]
3727
+    pshufb            m3, m1
3728
+    vbroadcasti128    m4, [r2 + 16]
3729
+    pshufb            m4, m1
3730
+    vbroadcasti128    m5, [r2 + 24]
3731
+    pshufb            m5, m1
3732
+
3733
+    mova              m10, [r4 + 0 * mmsize]
3734
+
3735
+    INTRA_PRED_ANG32_CAL_ROW
3736
+    movu              [r0], m7
3737
+    movu              [r0 + r1], m6
3738
+
3739
+    ;row[2, 3]
3740
+    pslldq            xm11, 1
3741
+    pinsrb            xm11, [r2 + 66], 0
3742
+    vinserti128       m2, m11, xm11, 1
3743
+    pshufb            m2, m1
3744
+    vbroadcasti128    m3, [r2 + 7]
3745
+    pshufb            m3, m1
3746
+    vbroadcasti128    m4, [r2 + 15]
3747
+    pshufb            m4, m1
3748
+    vbroadcasti128    m5, [r2 + 23]
3749
+    pshufb            m5, m1
3750
+
3751
+    mova              m10, [r4 + 1 * mmsize]
3752
+
3753
+    INTRA_PRED_ANG32_CAL_ROW
3754
+    movu              [r0 + 2 * r1], m7
3755
+    movu              [r0 + r3], m6
3756
+
3757
+    ;row[4, 5]
3758
+    pslldq            xm11, 1
3759
+    pinsrb            xm11, [r2 + 69], 0
3760
+    vinserti128       m2, m11, xm11, 1
3761
+    pshufb            m2, m1
3762
+    vbroadcasti128    m3, [r2 + 6]
3763
+    pshufb            m3, m1
3764
+    vbroadcasti128    m4, [r2 + 14]
3765
+    pshufb            m4, m1
3766
+    vbroadcasti128    m5, [r2 + 22]
3767
+    pshufb            m5, m1
3768
+
3769
+    lea               r0, [r0 + 4 * r1]
3770
+    mova              m10, [r4 + 2 * mmsize]
3771
+
3772
+    INTRA_PRED_ANG32_CAL_ROW
3773
+    movu              [r0], m7
3774
+    movu              [r0 + r1], m6
3775
+
3776
+    ;row[6]
3777
+    vperm2i128        m6, m2, m3, 00100000b
3778
+    pmaddubsw         m6, [r4 + 3 * mmsize]
3779
+    pmulhrsw          m6, m0
3780
+    vperm2i128        m7, m4, m5, 00100000b
3781
+    pmaddubsw         m7, [r4 + 3 * mmsize]
3782
+    pmulhrsw          m7, m0
3783
+    packuswb          m6, m7
3784
+    vpermq            m6, m6, 11011000b
3785
+    movu              [r0 + 2 * r1], m6
3786
+
3787
+    ;row[7, 8]
3788
+    pslldq            xm11, 1
3789
+    pinsrb            xm11, [r2 + 71], 0
3790
+    vinserti128       m2, m11, xm11, 1
3791
+    pshufb            m2, m1
3792
+    vbroadcasti128    m3, [r2 + 5]
3793
+    pshufb            m3, m1
3794
+    vbroadcasti128    m4, [r2 + 13]
3795
+    pshufb            m4, m1
3796
+    vbroadcasti128    m5, [r2 + 21]
3797
+    pshufb            m5, m1
3798
+
3799
+    add               r4, 4 * mmsize
3800
+    mova              m10, [r4 + 0 * mmsize]
3801
+
3802
+    INTRA_PRED_ANG32_CAL_ROW
3803
+    movu              [r0 + r3], m7
3804
+    lea               r0, [r0 + 4 * r1]
3805
+    movu              [r0], m6
3806
+
3807
+    ;row[9, 10]
3808
+    pslldq            xm11, 1
3809
+    pinsrb            xm11, [r2 + 74], 0
3810
+    vinserti128       m2, m11, xm11, 1
3811
+    vinserti128       m2, m2, xm2, 1
3812
+    pshufb            m2, m1
3813
+    vbroadcasti128    m3, [r2 + 4]
3814
+    pshufb            m3, m1
3815
+    vbroadcasti128    m4, [r2 + 12]
3816
+    pshufb            m4, m1
3817
+    vbroadcasti128    m5, [r2 + 20]
3818
+    pshufb            m5, m1
3819
+
3820
+    mova              m10, [r4 + 1 * mmsize]
3821
+
3822
+    INTRA_PRED_ANG32_CAL_ROW
3823
+    movu              [r0 + r1], m7
3824
+    movu              [r0 + 2 * r1], m6
3825
+
3826
+    ;row[11]
3827
+    vperm2i128        m6, m2, m3, 00100000b
3828
+    pmaddubsw         m6, [r4 + 2 * mmsize]
3829
+    pmulhrsw          m6, m0
3830
+    vperm2i128        m7, m4, m5, 00100000b
3831
+    pmaddubsw         m7, [r4 + 2 * mmsize]
3832
+    pmulhrsw          m7, m0
3833
+    packuswb          m6, m7
3834
+    vpermq            m6, m6, 11011000b
3835
+    movu              [r0 + r3], m6
3836
+
3837
+    ;row[12, 13]
3838
+    pslldq            xm11, 1
3839
+    pinsrb            xm11, [r2 + 76], 0
3840
+    vinserti128       m2, m11, xm11, 1
3841
+    pshufb            m2, m1
3842
+    vbroadcasti128    m3, [r2 + 3]
3843
+    pshufb            m3, m1
3844
+    vbroadcasti128    m4, [r2 + 11]
3845
+    pshufb            m4, m1
3846
+    vbroadcasti128    m5, [r2 + 19]
3847
+    pshufb            m5, m1
3848
+
3849
+    mova              m10, [r4 + 3 * mmsize]
3850
+    lea               r0, [r0 + 4 * r1]
3851
+
3852
+    INTRA_PRED_ANG32_CAL_ROW
3853
+    movu              [r0], m7
3854
+    movu              [r0 + r1], m6
3855
+
3856
+    ;row[14, 15]
3857
+    pslldq            xm11, 1
3858
+    pinsrb            xm11, [r2 + 79], 0
3859
+    vinserti128       m2, m11, xm11, 1
3860
+    pshufb            m2, m1
3861
+    vbroadcasti128    m3, [r2 + 2]
3862
+    pshufb            m3, m1
3863
+    vbroadcasti128    m4, [r2 + 10]
3864
+    pshufb            m4, m1
3865
+    vbroadcasti128    m5, [r2 + 18]
3866
+    pshufb            m5, m1
3867
+
3868
+    add               r4, 4 * mmsize
3869
+    mova              m10, [r4 + 0 * mmsize]
3870
+
3871
+    INTRA_PRED_ANG32_CAL_ROW
3872
+    movu              [r0 + 2 * r1], m7
3873
+    movu              [r0 + r3], m6
3874
+
3875
+    ;row[16]
3876
+    lea               r0, [r0 + 4 * r1]
3877
+    vperm2i128        m6, m2, m3, 00100000b
3878
+    pmaddubsw         m6, [r4 + 1 * mmsize]
3879
+    pmulhrsw          m6, m0
3880
+    vperm2i128        m7, m4, m5, 00100000b
3881
+    pmaddubsw         m7, [r4 + 1 * mmsize]
3882
+    pmulhrsw          m7, m0
3883
+    packuswb          m6, m7
3884
+    vpermq            m6, m6, 11011000b
3885
+    movu              [r0], m6
3886
+
3887
+    ;row[17, 18]
3888
+    pslldq            xm11, 1
3889
+    pinsrb            xm11, [r2 + 81], 0
3890
+    vinserti128       m2, m11, xm11, 1
3891
+    pshufb            m2, m1
3892
+    vbroadcasti128    m3, [r2 + 1]
3893
+    pshufb            m3, m1
3894
+    vbroadcasti128    m4, [r2 + 9]
3895
+    pshufb            m4, m1
3896
+    vbroadcasti128    m5, [r2 + 17]
3897
+    pshufb            m5, m1
3898
+
3899
+    mova              m10, [r4 + 2 * mmsize]
3900
+
3901
+    INTRA_PRED_ANG32_CAL_ROW
3902
+    movu              [r0 + r1], m7
3903
+    movu              [r0 + 2 * r1], m6
3904
+
3905
+    ;row[19, 20]
3906
+    pslldq            xm11, 1
3907
+    pinsrb            xm11, [r2 + 84], 0
3908
+    vinserti128       m2, m11, xm11, 1
3909
+    pshufb            m2, m1
3910
+    vbroadcasti128    m12, [r2 + 0]
3911
+    pshufb            m3, m12, m1
3912
+    vbroadcasti128    m4, [r2 + 8]
3913
+    pshufb            m4, m1
3914
+    vbroadcasti128    m5, [r2 + 16]
3915
+    pshufb            m5, m1
3916
+
3917
+    mova              m10, [r4 + 3 * mmsize]
3918
+
3919
+    INTRA_PRED_ANG32_CAL_ROW
3920
+    movu              [r0 + r3], m7
3921
+    lea               r0, [r0 + 4 * r1]
3922
+    movu              [r0], m6
3923
+
3924
+    ;row[21]
3925
+    add               r4, 4 * mmsize
3926
+    vperm2i128        m6, m2, m3, 00100000b
3927
+    pmaddubsw         m6, [r4 + 0 * mmsize]
3928
+    pmulhrsw          m6, m0
3929
+    vperm2i128        m7, m4, m5, 00100000b
3930
+    pmaddubsw         m7, [r4 + 0 * mmsize]
3931
+    pmulhrsw          m7, m0
3932
+    packuswb          m6, m7
3933
+    vpermq            m6, m6, 11011000b
3934
+    movu              [r0 + r1], m6
3935
+
3936
+    ;row[22, 23]
3937
+    pslldq            xm11, 1
3938
+    pinsrb            xm11, [r2 + 86], 0
3939
+    vinserti128       m2, m11, xm11, 1
3940
+    pshufb            m2, m1
3941
+    pslldq            xm12, 1
3942
+    pinsrb            xm12, [r2 + 66], 0
3943
+    vinserti128       m3, m12, xm12, 1
3944
+    pshufb            m3, m1
3945
+    vbroadcasti128    m4, [r2 + 7]
3946
+    pshufb            m4, m1
3947
+    vbroadcasti128    m5, [r2 + 15]
3948
+    pshufb            m5, m1
3949
+
3950
+    mova              m10, [r4 + 1 * mmsize]
3951
+
3952
+    INTRA_PRED_ANG32_CAL_ROW
3953
+    movu              [r0 + 2 * r1], m7
3954
+    movu              [r0 + r3], m6
3955
+
3956
+    ;row[24, 25]
3957
+    pslldq            xm11, 1
3958
+    pinsrb            xm11, [r2 + 89], 0
3959
+    vinserti128       m2, m11, xm11, 1
3960
+    pshufb            m2, m1
3961
+    pslldq            xm12, 1
3962
+    pinsrb            xm12, [r2 + 69], 0
3963
+    vinserti128       m3, m12, xm12, 1
3964
+    pshufb            m3, m1
3965
+    vbroadcasti128    m4, [r2 + 6]
3966
+    pshufb            m4, m1
3967
+    vbroadcasti128    m5, [r2 + 14]
3968
+    pshufb            m5, m1
3969
+
3970
+    mova              m10, [r4 + 2 * mmsize]
3971
+    lea               r0, [r0 + 4 * r1]
3972
+
3973
+    INTRA_PRED_ANG32_CAL_ROW
3974
+    movu              [r0], m7
3975
+    movu              [r0 + r1], m6
3976
+
3977
+    ;row[26]
3978
+    vperm2i128        m6, m2, m3, 00100000b
3979
+    pmaddubsw         m6, [r4 + 3 * mmsize]
3980
+    pmulhrsw          m6, m0
3981
+    vperm2i128        m7, m4, m5, 00100000b
3982
+    pmaddubsw         m7, [r4 + 3 * mmsize]
3983
+    pmulhrsw          m7, m0
3984
+    packuswb          m6, m7
3985
+    vpermq            m6, m6, 11011000b
3986
+    movu              [r0 + 2 * r1], m6
3987
+
3988
+    ;row[27, 28]
3989
+    pslldq            xm11, 1
3990
+    pinsrb            xm11, [r2 + 91], 0
3991
+    vinserti128       m2, m11, xm11, 1
3992
+    pshufb            m2, m1
3993
+    pslldq            xm12, 1
3994
+    pinsrb            xm12, [r2 + 71], 0
3995
+    vinserti128       m3, m12, xm12, 1
3996
+    pshufb            m3, m1
3997
+    vbroadcasti128    m4, [r2 + 5]
3998
+    pshufb            m4, m1
3999
+    vbroadcasti128    m5, [r2 + 13]
4000
+    pshufb            m5, m1
4001
+
4002
+    add               r4, 4 * mmsize
4003
+    mova              m10, [r4 + 0 * mmsize]
4004
+
4005
+    INTRA_PRED_ANG32_CAL_ROW
4006
+    movu              [r0 + r3], m7
4007
+    lea               r0, [r0 + 4 * r1]
4008
+    movu              [r0], m6
4009
+
4010
+    ;row[29, 30]
4011
+    pslldq            xm11, 1
4012
+    pinsrb            xm11, [r2 + 94], 0
4013
+    vinserti128       m2, m11, xm11, 1
4014
+    pshufb            m2, m1
4015
+    pslldq            xm12, 1
4016
+    pinsrb            xm12, [r2 + 74], 0
4017
+    vinserti128       m3, m12, xm12, 1
4018
+    pshufb            m3, m1
4019
+    vbroadcasti128    m4, [r2 + 4]
4020
+    pshufb            m4, m1
4021
+    vbroadcasti128    m5, [r2 + 12]
4022
+    pshufb            m5, m1
4023
+
4024
+    mova              m10, [r4 + 1 * mmsize]
4025
+
4026
+    INTRA_PRED_ANG32_CAL_ROW
4027
+    movu              [r0 + r1], m7
4028
+    movu              [r0 + 2 * r1], m6
4029
+
4030
+    ;row[31]
4031
+    vperm2i128        m6, m2, m3, 00100000b
4032
+    pmaddubsw         m6, [r4 + 2 * mmsize]
4033
+    pmulhrsw          m6, m0
4034
+    vperm2i128        m7, m4, m5, 00100000b
4035
+    pmaddubsw         m7, [r4 + 2 * mmsize]
4036
+    pmulhrsw          m7, m0
4037
+    packuswb          m6, m7
4038
+    vpermq            m6, m6, 11011000b
4039
+    movu              [r0 + r3], m6
4040
+    RET
4041
+
4042
+INIT_YMM avx2
4043
+cglobal intra_pred_ang32_21, 3, 5, 13
4044
+    mova              m0, [pw_1024]
4045
+    mova              m1, [intra_pred_shuff_0_8]
4046
+    lea               r3, [3 * r1]
4047
+    lea               r4, [c_ang32_mode_21]
4048
+
4049
+    ;row[0]
4050
+    vbroadcasti128    m11, [r2 + 0]
4051
+    pshufb            m2, m11, m1
4052
+    vbroadcasti128    m3, [r2 + 8]
4053
+    pshufb            m3, m1
4054
+    vbroadcasti128    m4, [r2 + 16]
4055
+    pshufb            m4, m1
4056
+    vbroadcasti128    m5, [r2 + 24]
4057
+    pshufb            m5, m1
4058
+
4059
+    vperm2i128        m6, m2, m3, 00100000b
4060
+    pmaddubsw         m6, [r4 + 0 * mmsize]
4061
+    pmulhrsw          m6, m0
4062
+    vperm2i128        m7, m4, m5, 00100000b
4063
+    pmaddubsw         m7, [r4 + 0 * mmsize]
4064
+    pmulhrsw          m7, m0
4065
+    packuswb          m6, m7
4066
+    vpermq            m6, m6, 11011000b
4067
+    movu              [r0], m6
4068
+
4069
+    ;row[1, 2]
4070
+    pslldq            xm11, 1
4071
+    pinsrb            xm11, [r2 + 66], 0
4072
+    vinserti128       m2, m11, xm11, 1
4073
+    pshufb            m2, m1
4074
+    vbroadcasti128    m3, [r2 + 7]
4075
+    pshufb            m3, m1
4076
+    vbroadcasti128    m4, [r2 + 15]
4077
+    pshufb            m4, m1
4078
+    vbroadcasti128    m5, [r2 + 23]
4079
+    pshufb            m5, m1
4080
+
4081
+    mova              m10, [r4 + 1 * mmsize]
4082
+
4083
+    INTRA_PRED_ANG32_CAL_ROW
4084
+    movu              [r0 + r1], m7
4085
+    movu              [r0 + 2 * r1], m6
4086
+
4087
+    ;row[3, 4]
4088
+    pslldq            xm11, 1
4089
+    pinsrb            xm11, [r2 + 68], 0
4090
+    vinserti128       m2, m11, xm11, 1
4091
+    pshufb            m2, m1
4092
+    vbroadcasti128    m3, [r2 + 6]
4093
+    pshufb            m3, m1
4094
+    vbroadcasti128    m4, [r2 + 14]
4095
+    pshufb            m4, m1
4096
+    vbroadcasti128    m5, [r2 + 22]
4097
+    pshufb            m5, m1
4098
+
4099
+    mova              m10, [r4 + 2 * mmsize]
4100
+
4101
+    INTRA_PRED_ANG32_CAL_ROW
4102
+    movu              [r0 + r3], m7
4103
+    lea               r0, [r0 + 4 * r1]
4104
+    movu              [r0], m6
4105
+
4106
+    ;row[5, 6]
4107
+    pslldq            xm11, 1
4108
+    pinsrb            xm11, [r2 + 70], 0
4109
+    vinserti128       m2, m11, xm11, 1
4110
+    pshufb            m2, m1
4111
+    vbroadcasti128    m3, [r2 + 5]
4112
+    pshufb            m3, m1
4113
+    vbroadcasti128    m4, [r2 + 13]
4114
+    pshufb            m4, m1
4115
+    vbroadcasti128    m5, [r2 + 21]
4116
+    pshufb            m5, m1
4117
+
4118
+    mova              m10, [r4 + 3 * mmsize]
4119
+
4120
+    INTRA_PRED_ANG32_CAL_ROW
4121
+    movu              [r0 + r1], m7
4122
+    movu              [r0 + 2 * r1], m6
4123
+
4124
+    ;row[7, 8]
4125
+    pslldq            xm11, 1
4126
+    pinsrb            xm11, [r2 + 72], 0
4127
+    vinserti128       m2, m11, xm11, 1
4128
+    pshufb            m2, m1
4129
+    vbroadcasti128    m3, [r2 + 4]
4130
+    pshufb            m3, m1
4131
+    vbroadcasti128    m4, [r2 + 12]
4132
+    pshufb            m4, m1
4133
+    vbroadcasti128    m5, [r2 + 20]
4134
+    pshufb            m5, m1
4135
+
4136
+    add               r4, 4 * mmsize
4137
+    mova              m10, [r4 + 0 * mmsize]
4138
+
4139
+    INTRA_PRED_ANG32_CAL_ROW
4140
+    movu              [r0 + r3], m7
4141
+    lea               r0, [r0 + 4 * r1]
4142
+    movu              [r0], m6
4143
+
4144
+    ;row[9, 10]
4145
+    pslldq            xm11, 1
4146
+    pinsrb            xm11, [r2 + 73], 0
4147
+    vinserti128       m2, m11, xm11, 1
4148
+    pshufb            m2, m1
4149
+    vbroadcasti128    m3, [r2 + 3]
4150
+    pshufb            m3, m1
4151
+    vbroadcasti128    m4, [r2 + 11]
4152
+    pshufb            m4, m1
4153
+    vbroadcasti128    m5, [r2 + 19]
4154
+    pshufb            m5, m1
4155
+
4156
+    mova              m10, [r4 + 1 * mmsize]
4157
+
4158
+    INTRA_PRED_ANG32_CAL_ROW
4159
+    movu              [r0 + r1], m7
4160
+    movu              [r0 + 2 * r1], m6
4161
+
4162
+    ;row[11, 12]
4163
+    pslldq            xm11, 1
4164
+    pinsrb            xm11, [r2 + 75], 0
4165
+    vinserti128       m2, m11, xm11, 1
4166
+    pshufb            m2, m1
4167
+    vbroadcasti128    m3, [r2 + 2]
4168
+    pshufb            m3, m1
4169
+    vbroadcasti128    m4, [r2 + 10]
4170
+    pshufb            m4, m1
4171
+    vbroadcasti128    m5, [r2 + 18]
4172
+    pshufb            m5, m1
4173
+
4174
+    mova              m10, [r4 + 2 * mmsize]
4175
+
4176
+    INTRA_PRED_ANG32_CAL_ROW
4177
+    movu              [r0 + r3], m7
4178
+    lea               r0, [r0 + 4 * r1]
4179
+    movu              [r0], m6
4180
+
4181
+    ;row[13, 14]
4182
+    pslldq            xm11, 1
4183
+    pinsrb            xm11, [r2 + 77], 0
4184
+    vinserti128       m2, m11, xm11, 1
4185
+    pshufb            m2, m1
4186
+    vbroadcasti128    m3, [r2 + 1]
4187
+    pshufb            m3, m1
4188
+    vbroadcasti128    m4, [r2 + 9]
4189
+    pshufb            m4, m1
4190
+    vbroadcasti128    m5, [r2 + 17]
4191
+    pshufb            m5, m1
4192
+
4193
+    mova              m10, [r4 + 3 * mmsize]
4194
+
4195
+    INTRA_PRED_ANG32_CAL_ROW
4196
+    movu              [r0 + r1], m7
4197
+    movu              [r0 + 2 * r1], m6
4198
+
4199
+    ;row[15]
4200
+    pslldq            xm11, 1
4201
+    pinsrb            xm11, [r2 + 79], 0
4202
+    vinserti128       m2, m11, xm11, 1
4203
+    pshufb            m2, m1
4204
+    vbroadcasti128    m12, [r2 + 0]
4205
+    pshufb            m3, m12, m1
4206
+    vbroadcasti128    m4, [r2 + 8]
4207
+    pshufb            m4, m1
4208
+    vbroadcasti128    m5, [r2 + 16]
4209
+    pshufb            m5, m1
4210
+    vperm2i128        m6, m2, m3, 00100000b
4211
+    add               r4, 4 * mmsize
4212
+    pmaddubsw         m6, [r4 + 0 * mmsize]
4213
+    pmulhrsw          m6, m0
4214
+    vperm2i128        m7, m4, m5, 00100000b
4215
+    pmaddubsw         m7, [r4 + 0 * mmsize]
4216
+    pmulhrsw          m7, m0
4217
+    packuswb          m6, m7
4218
+    vpermq            m6, m6, 11011000b
4219
+    movu              [r0 + r3], m6
4220
+
4221
+    ;row[16, 17]
4222
+    pslldq            xm11, 1
4223
+    pinsrb            xm11, [r2 + 81], 0
4224
+    vinserti128       m2, m11, xm11, 1
4225
+    pshufb            m2, m1
4226
+    pslldq            xm12, 1
4227
+    pinsrb            xm12, [r2 + 66], 0
4228
+    vinserti128       m3, m12, xm12, 1
4229
+    pshufb            m3, m1
4230
+    vbroadcasti128    m4, [r2 + 7]
4231
+    pshufb            m4, m1
4232
+    vbroadcasti128    m5, [r2 + 15]
4233
+    pshufb            m5, m1
4234
+
4235
+    mova              m10, [r4 + 1 * mmsize]
4236
+
4237
+    INTRA_PRED_ANG32_CAL_ROW
4238
+    lea               r0, [r0 + 4 * r1]
4239
+    movu              [r0], m7
4240
+    movu              [r0 + r1], m6
4241
+
4242
+    ;row[18, 19]
4243
+    pslldq            xm11, 1
4244
+    pinsrb            xm11, [r2 + 83], 0
4245
+    vinserti128       m2, m11, xm11, 1
4246
+    pshufb            m2, m1
4247
+    pslldq            xm12, 1
4248
+    pinsrb            xm12, [r2 + 68], 0
4249
+    vinserti128       m3, m12, xm12, 1
4250
+    pshufb            m3, m1
4251
+    vbroadcasti128    m4, [r2 + 6]
4252
+    pshufb            m4, m1
4253
+    vbroadcasti128    m5, [r2 + 14]
4254
+    pshufb            m5, m1
4255
+
4256
+    mova              m10, [r4 + 2 * mmsize]
4257
+
4258
+    INTRA_PRED_ANG32_CAL_ROW
4259
+    movu              [r0 + 2 * r1], m7
4260
+    movu              [r0 + r3], m6
4261
+
4262
+    ;row[20, 21]
4263
+    pslldq            xm11, 1
4264
+    pinsrb            xm11, [r2 + 85], 0
4265
+    vinserti128       m2, m11, xm11, 1
4266
+    pshufb            m2, m1
4267
+    pslldq            xm12, 1
4268
+    pinsrb            xm12, [r2 + 70], 0
4269
+    vinserti128       m3, m12, xm12, 1
4270
+    pshufb            m3, m1
4271
+    vbroadcasti128    m4, [r2 + 5]
4272
+    pshufb            m4, m1
4273
+    vbroadcasti128    m5, [r2 + 13]
4274
+    pshufb            m5, m1
4275
+
4276
+    mova              m10, [r4 + 3 * mmsize]
4277
+
4278
+    INTRA_PRED_ANG32_CAL_ROW
4279
+    lea               r0, [r0 + 4 * r1]
4280
+    movu              [r0], m7
4281
+    movu              [r0 + r1], m6
4282
+
4283
+    ;row[22, 23]
4284
+    pslldq            xm11, 1
4285
+    pinsrb            xm11, [r2 + 87], 0
4286
+    vinserti128       m2, m11, xm11, 1
4287
+    pshufb            m2, m1
4288
+    pslldq            xm12, 1
4289
+    pinsrb            xm12, [r2 + 72], 0
4290
+    vinserti128       m3, m12, xm12, 1
4291
+    pshufb            m3, m1
4292
+    vbroadcasti128    m4, [r2 + 4]
4293
+    pshufb            m4, m1
4294
+    vbroadcasti128    m5, [r2 + 12]
4295
+    pshufb            m5, m1
4296
+
4297
+    add               r4, 4 * mmsize
4298
+    mova              m10, [r4 + 0 * mmsize]
4299
+
4300
+    INTRA_PRED_ANG32_CAL_ROW
4301
+    movu              [r0 + 2 * r1], m7
4302
+    movu              [r0 + r3], m6
4303
+
4304
+    ;row[24, 25]
4305
+    pslldq            xm11, 1
4306
+    pinsrb            xm11, [r2 + 88], 0
4307
+    vinserti128       m2, m11, xm11, 1
4308
+    pshufb            m2, m1
4309
+    pslldq            xm12, 1
4310
+    pinsrb            xm12, [r2 + 73], 0
4311
+    vinserti128       m3, m12, xm12, 1
4312
+    pshufb            m3, m1
4313
+    vbroadcasti128    m4, [r2 + 3]
4314
+    pshufb            m4, m1
4315
+    vbroadcasti128    m5, [r2 + 11]
4316
+    pshufb            m5, m1
4317
+
4318
+    mova              m10, [r4 + 1 * mmsize]
4319
+
4320
+    INTRA_PRED_ANG32_CAL_ROW
4321
+    lea               r0, [r0 + 4 * r1]
4322
+    movu              [r0], m7
4323
+    movu              [r0 + r1], m6
4324
+
4325
+    ;row[26, 27]
4326
+    pslldq            xm11, 1
4327
+    pinsrb            xm11, [r2 + 90], 0
4328
+    vinserti128       m2, m11, xm11, 1
4329
+    pshufb            m2, m1
4330
+    pslldq            xm12, 1
4331
+    pinsrb            xm12, [r2 + 75], 0
4332
+    vinserti128       m3, m12, xm12, 1
4333
+    pshufb            m3, m1
4334
+    vbroadcasti128    m4, [r2 + 2]
4335
+    pshufb            m4, m1
4336
+    vbroadcasti128    m5, [r2 + 10]
4337
+    pshufb            m5, m1
4338
+
4339
+    mova              m10, [r4 + 2 * mmsize]
4340
+
4341
+    INTRA_PRED_ANG32_CAL_ROW
4342
+    movu              [r0 + 2 * r1], m7
4343
+    movu              [r0 + r3], m6
4344
+
4345
+    ;row[28, 29]
4346
+    pslldq            xm11, 1
4347
+    pinsrb            xm11, [r2 + 92], 0
4348
+    vinserti128       m2, m11, xm11, 1
4349
+    pshufb            m2, m1
4350
+    pslldq            xm12, 1
4351
+    pinsrb            xm12, [r2 + 77], 0
4352
+    vinserti128       m3, m12, xm12, 1
4353
+    pshufb            m3, m1
4354
+    vbroadcasti128    m4, [r2 + 1]
4355
+    pshufb            m4, m1
4356
+    vbroadcasti128    m5, [r2 + 9]
4357
+    pshufb            m5, m1
4358
+
4359
+    mova              m10, [r4 + 3 * mmsize]
4360
+
4361
+    INTRA_PRED_ANG32_CAL_ROW
4362
+    lea               r0, [r0 + 4 * r1]
4363
+    movu              [r0], m7
4364
+    movu              [r0 + r1], m6
4365
+
4366
+    ;row[30, 31]
4367
+    pslldq            xm11, 1
4368
+    pinsrb            xm11, [r2 + 94], 0
4369
+    vinserti128       m2, m11, xm11, 1
4370
+    pshufb            m2, m1
4371
+    pslldq            xm12, 1
4372
+    pinsrb            xm12, [r2 + 79], 0
4373
+    vinserti128       m3, m12, xm12, 1
4374
+    pshufb            m3, m1
4375
+    vbroadcasti128    m4, [r2 + 0]
4376
+    pshufb            m4, m1
4377
+    vbroadcasti128    m5, [r2 + 8]
4378
+    pshufb            m5, m1
4379
+
4380
+    mova              m10, [r4 + 4 * mmsize]
4381
+
4382
+    INTRA_PRED_ANG32_CAL_ROW
4383
+    movu              [r0 + 2 * r1], m7
4384
+    movu              [r0 + r3], m6
4385
+    RET
4386
 %endif
4387
 
4388
+%macro INTRA_PRED_STORE_4x4 0
4389
+    movd              [r0], xm0
4390
+    pextrd            [r0 + r1], xm0, 1
4391
+    vextracti128      xm0, m0, 1
4392
+    lea               r0, [r0 + 2 * r1]
4393
+    movd              [r0], xm0
4394
+    pextrd            [r0 + r1], xm0, 1
4395
+%endmacro
4396
+
4397
+%macro INTRA_PRED_TRANS_STORE_4x4 0
4398
+    vpermq            m0, m0, 00001000b
4399
+    pshufb            m0, [c_trans_4x4]
4400
+
4401
+    ;store
4402
+    movd              [r0], xm0
4403
+    pextrd            [r0 + r1], xm0, 1
4404
+    lea               r0, [r0 + 2 * r1]
4405
+    pextrd            [r0], xm0, 2
4406
+    pextrd            [r0 + r1], xm0, 3
4407
+%endmacro
4408
+
4409
+INIT_YMM avx2
4410
+cglobal intra_pred_ang4_27, 3, 3, 1
4411
+    vbroadcasti128    m0, [r2 + 1]
4412
+    pshufb            m0, [intra_pred_shuff_0_4]
4413
+    pmaddubsw         m0, [c_ang4_mode_27]
4414
+    pmulhrsw          m0, [pw_1024]
4415
+    packuswb          m0, m0
4416
+
4417
+    INTRA_PRED_STORE_4x4
4418
+    RET
4419
+
4420
+INIT_YMM avx2
4421
+cglobal intra_pred_ang4_28, 3, 3, 1
4422
+    vbroadcasti128    m0, [r2 + 1]
4423
+    pshufb            m0, [intra_pred_shuff_0_4]
4424
+    pmaddubsw         m0, [c_ang4_mode_28]
4425
+    pmulhrsw          m0, [pw_1024]
4426
+    packuswb          m0, m0
4427
+
4428
+    INTRA_PRED_STORE_4x4
4429
+    RET
4430
+
4431
+INIT_YMM avx2
4432
+cglobal intra_pred_ang4_29, 3, 3, 1
4433
+    vbroadcasti128    m0, [r2 + 1]
4434
+    pshufb            m0, [intra_pred4_shuff1]
4435
+    pmaddubsw         m0, [c_ang4_mode_29]
4436
+    pmulhrsw          m0, [pw_1024]
4437
+    packuswb          m0, m0
4438
+
4439
+    INTRA_PRED_STORE_4x4
4440
+    RET
4441
+
4442
+INIT_YMM avx2
4443
+cglobal intra_pred_ang4_30, 3, 3, 1
4444
+    vbroadcasti128    m0, [r2 + 1]
4445
+    pshufb            m0, [intra_pred4_shuff2]
4446
+    pmaddubsw         m0, [c_ang4_mode_30]
4447
+    pmulhrsw          m0, [pw_1024]
4448
+    packuswb          m0, m0
4449
+
4450
+    INTRA_PRED_STORE_4x4
4451
+    RET
4452
+
4453
+INIT_YMM avx2
4454
+cglobal intra_pred_ang4_31, 3, 3, 1
4455
+    vbroadcasti128    m0, [r2 + 1]
4456
+    pshufb            m0, [intra_pred4_shuff31]
4457
+    pmaddubsw         m0, [c_ang4_mode_31]
4458
+    pmulhrsw          m0, [pw_1024]
4459
+    packuswb          m0, m0
4460
+
4461
+    INTRA_PRED_STORE_4x4
4462
+    RET
4463
+
4464
+INIT_YMM avx2
4465
+cglobal intra_pred_ang4_32, 3, 3, 1
4466
+    vbroadcasti128    m0, [r2 + 1]
4467
+    pshufb            m0, [intra_pred4_shuff31]
4468
+    pmaddubsw         m0, [c_ang4_mode_32]
4469
+    pmulhrsw          m0, [pw_1024]
4470
+    packuswb          m0, m0
4471
+
4472
+    INTRA_PRED_STORE_4x4
4473
+    RET
4474
+
4475
+INIT_YMM avx2
4476
+cglobal intra_pred_ang4_33, 3, 3, 1
4477
+    vbroadcasti128    m0, [r2 + 1]
4478
+    pshufb            m0, [intra_pred4_shuff33]
4479
+    pmaddubsw         m0, [c_ang4_mode_33]
4480
+    pmulhrsw          m0, [pw_1024]
4481
+    packuswb          m0, m0
4482
+
4483
+    INTRA_PRED_STORE_4x4
4484
+    RET
4485
+
4486
+
4487
+INIT_YMM avx2
4488
+cglobal intra_pred_ang4_3, 3, 3, 1
4489
+    vbroadcasti128    m0, [r2 + 1]
4490
+    pshufb            m0, [intra_pred4_shuff3]
4491
+    pmaddubsw         m0, [c_ang4_mode_33]
4492
+    pmulhrsw          m0, [pw_1024]
4493
+    packuswb          m0, m0
4494
+
4495
+    INTRA_PRED_TRANS_STORE_4x4
4496
+    RET
4497
+
4498
+INIT_YMM avx2
4499
+cglobal intra_pred_ang4_4, 3, 3, 1
4500
+    vbroadcasti128    m0, [r2]
4501
+    pshufb            m0, [intra_pred4_shuff5]
4502
+    pmaddubsw         m0, [c_ang4_mode_32]
4503
+    pmulhrsw          m0, [pw_1024]
4504
+    packuswb          m0, m0
4505
+
4506
+    INTRA_PRED_TRANS_STORE_4x4
4507
+    RET
4508
+
4509
+INIT_YMM avx2
4510
+cglobal intra_pred_ang4_5, 3, 3, 1
4511
+    vbroadcasti128    m0, [r2]
4512
+    pshufb            m0, [intra_pred4_shuff5]
4513
+    pmaddubsw         m0, [c_ang4_mode_5]
4514
+    pmulhrsw          m0, [pw_1024]
4515
+    packuswb          m0, m0
4516
+
4517
+    INTRA_PRED_TRANS_STORE_4x4
4518
+    RET
4519
+
4520
+INIT_YMM avx2
4521
+cglobal intra_pred_ang4_6, 3, 3, 1
4522
+    vbroadcasti128    m0, [r2]
4523
+    pshufb            m0, [intra_pred4_shuff6]
4524
+    pmaddubsw         m0, [c_ang4_mode_6]
4525
+    pmulhrsw          m0, [pw_1024]
4526
+    packuswb          m0, m0
4527
+
4528
+    INTRA_PRED_TRANS_STORE_4x4
4529
+    RET
4530
+
4531
+INIT_YMM avx2
4532
+cglobal intra_pred_ang4_7, 3, 3, 1
4533
+    vbroadcasti128    m0, [r2]
4534
+    pshufb            m0, [intra_pred4_shuff7]
4535
+    pmaddubsw         m0, [c_ang4_mode_7]
4536
+    pmulhrsw          m0, [pw_1024]
4537
+    packuswb          m0, m0
4538
+
4539
+    INTRA_PRED_TRANS_STORE_4x4
4540
+    RET
4541
+
4542
+INIT_YMM avx2
4543
+cglobal intra_pred_ang4_8, 3, 3, 1
4544
+    vbroadcasti128    m0, [r2]
4545
+    pshufb            m0, [intra_pred4_shuff9]
4546
+    pmaddubsw         m0, [c_ang4_mode_8]
4547
+    pmulhrsw          m0, [pw_1024]
4548
+    packuswb          m0, m0
4549
+
4550
+    INTRA_PRED_TRANS_STORE_4x4
4551
+    RET
4552
+
4553
+INIT_YMM avx2
4554
+cglobal intra_pred_ang4_9, 3, 3, 1
4555
+    vbroadcasti128    m0, [r2]
4556
+    pshufb            m0, [intra_pred4_shuff9]
4557
+    pmaddubsw         m0, [c_ang4_mode_9]
4558
+    pmulhrsw          m0, [pw_1024]
4559
+    packuswb          m0, m0
4560
+
4561
+    INTRA_PRED_TRANS_STORE_4x4
4562
+    RET
4563
+
4564
+INIT_YMM avx2
4565
+cglobal intra_pred_ang4_11, 3, 3, 1
4566
+    vbroadcasti128    m0, [r2]
4567
+    pshufb            m0, [intra_pred4_shuff12]
4568
+    pmaddubsw         m0, [c_ang4_mode_11]
4569
+    pmulhrsw          m0, [pw_1024]
4570
+    packuswb          m0, m0
4571
+
4572
+    INTRA_PRED_TRANS_STORE_4x4
4573
+    RET
4574
+
4575
+INIT_YMM avx2
4576
+cglobal intra_pred_ang4_12, 3, 3, 1
4577
+    vbroadcasti128    m0, [r2]
4578
+    pshufb            m0, [intra_pred4_shuff12]
4579
+    pmaddubsw         m0, [c_ang4_mode_12]
4580
+    pmulhrsw          m0, [pw_1024]
4581
+    packuswb          m0, m0
4582
+
4583
+    INTRA_PRED_TRANS_STORE_4x4
4584
+    RET
4585
+
4586
+INIT_YMM avx2
4587
+cglobal intra_pred_ang4_13, 3, 3, 1
4588
+    vbroadcasti128    m0, [r2]
4589
+    pshufb            m0, [intra_pred4_shuff13]
4590
+    pmaddubsw         m0, [c_ang4_mode_13]
4591
+    pmulhrsw          m0, [pw_1024]
4592
+    packuswb          m0, m0
4593
+
4594
+    INTRA_PRED_TRANS_STORE_4x4
4595
+    RET
4596
+
4597
+INIT_YMM avx2
4598
+cglobal intra_pred_ang4_14, 3, 3, 1
4599
+    vbroadcasti128    m0, [r2]
4600
+    pshufb            m0, [intra_pred4_shuff14]
4601
+    pmaddubsw         m0, [c_ang4_mode_14]
4602
+    pmulhrsw          m0, [pw_1024]
4603
+    packuswb          m0, m0
4604
+
4605
+    INTRA_PRED_TRANS_STORE_4x4
4606
+    RET
4607
+
4608
+INIT_YMM avx2
4609
+cglobal intra_pred_ang4_15, 3, 3, 1
4610
+    vbroadcasti128    m0, [r2]
4611
+    pshufb            m0, [intra_pred4_shuff15]
4612
+    pmaddubsw         m0, [c_ang4_mode_15]
4613
+    pmulhrsw          m0, [pw_1024]
4614
+    packuswb          m0, m0
4615
+
4616
+    INTRA_PRED_TRANS_STORE_4x4
4617
+    RET
4618
+
4619
+INIT_YMM avx2
4620
+cglobal intra_pred_ang4_16, 3, 3, 1
4621
+    vbroadcasti128    m0, [r2]
4622
+    pshufb            m0, [intra_pred4_shuff16]
4623
+    pmaddubsw         m0, [c_ang4_mode_16]
4624
+    pmulhrsw          m0, [pw_1024]
4625
+    packuswb          m0, m0
4626
+
4627
+    INTRA_PRED_TRANS_STORE_4x4
4628
+    RET
4629
+
4630
+INIT_YMM avx2
4631
+cglobal intra_pred_ang4_17, 3, 3, 1
4632
+    vbroadcasti128    m0, [r2]
4633
+    pshufb            m0, [intra_pred4_shuff17]
4634
+    pmaddubsw         m0, [c_ang4_mode_17]
4635
+    pmulhrsw          m0, [pw_1024]
4636
+    packuswb          m0, m0
4637
+
4638
+    INTRA_PRED_TRANS_STORE_4x4
4639
+    RET
4640
+
4641
+INIT_YMM avx2
4642
+cglobal intra_pred_ang4_19, 3, 3, 1
4643
+    vbroadcasti128    m0, [r2]
4644
+    pshufb            m0, [intra_pred4_shuff19]
4645
+    pmaddubsw         m0, [c_ang4_mode_19]
4646
+    pmulhrsw          m0, [pw_1024]
4647
+    packuswb          m0, m0
4648
+
4649
+    INTRA_PRED_STORE_4x4
4650
+    RET
4651
+
4652
+INIT_YMM avx2
4653
+cglobal intra_pred_ang4_20, 3, 3, 1
4654
+    vbroadcasti128    m0, [r2]
4655
+    pshufb            m0, [intra_pred4_shuff20]
4656
+    pmaddubsw         m0, [c_ang4_mode_20]
4657
+    pmulhrsw          m0, [pw_1024]
4658
+    packuswb          m0, m0
4659
+
4660
+    INTRA_PRED_STORE_4x4
4661
+    RET
4662
+
4663
+INIT_YMM avx2
4664
+cglobal intra_pred_ang4_21, 3, 3, 1
4665
+    vbroadcasti128    m0, [r2]
4666
+    pshufb            m0, [intra_pred4_shuff21]
4667
+    pmaddubsw         m0, [c_ang4_mode_21]
4668
+    pmulhrsw          m0, [pw_1024]
4669
+    packuswb          m0, m0
4670
+
4671
+    INTRA_PRED_STORE_4x4
4672
+    RET
4673
+
4674
+INIT_YMM avx2
4675
+cglobal intra_pred_ang4_22, 3, 3, 1
4676
+    vbroadcasti128    m0, [r2]
4677
+    pshufb            m0, [intra_pred4_shuff22]
4678
+    pmaddubsw         m0, [c_ang4_mode_22]
4679
+    pmulhrsw          m0, [pw_1024]
4680
+    packuswb          m0, m0
4681
+
4682
+    INTRA_PRED_STORE_4x4
4683
+    RET
4684
+
4685
+INIT_YMM avx2
4686
+cglobal intra_pred_ang4_23, 3, 3, 1
4687
+    vbroadcasti128    m0, [r2]
4688
+    pshufb            m0, [intra_pred4_shuff23]
4689
+    pmaddubsw         m0, [c_ang4_mode_23]
4690
+    pmulhrsw          m0, [pw_1024]
4691
+    packuswb          m0, m0
4692
+
4693
+    INTRA_PRED_STORE_4x4
4694
+    RET
4695
+
4696
+INIT_YMM avx2
4697
+cglobal intra_pred_ang4_24, 3, 3, 1
4698
+    vbroadcasti128    m0, [r2]
4699
+    pshufb            m0, [intra_pred_shuff_0_4]
4700
+    pmaddubsw         m0, [c_ang4_mode_24]
4701
+    pmulhrsw          m0, [pw_1024]
4702
+    packuswb          m0, m0
4703
+
4704
+    INTRA_PRED_STORE_4x4
4705
+    RET
4706
+
4707
+INIT_YMM avx2
4708
+cglobal intra_pred_ang4_25, 3, 3, 1
4709
+    vbroadcasti128    m0, [r2]
4710
+    pshufb            m0, [intra_pred_shuff_0_4]
4711
+    pmaddubsw         m0, [c_ang4_mode_25]
4712
+    pmulhrsw          m0, [pw_1024]
4713
+    packuswb          m0, m0
4714
+
4715
+    INTRA_PRED_STORE_4x4
4716
+    RET
4717
x265_1.6.tar.gz/source/common/x86/intrapred8_allangs.asm -> x265_1.7.tar.gz/source/common/x86/intrapred8_allangs.asm Changed
1189
 
1
@@ -2,7 +2,7 @@
2
 ;* Copyright (C) 2013 x265 project
3
 ;*
4
 ;* Authors: Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com>
5
-;*          Praveen Tiwari <praveen@multicorewareinc.com>
6
+;*          Praveen Tiwari <praveen@multicorewareinc.com>
7
 ;*
8
 ;* This program is free software; you can redistribute it and/or modify
9
 ;* it under the terms of the GNU General Public License as published by
10
@@ -27,6 +27,64 @@
11
 
12
 SECTION_RODATA 32
13
 
14
+all_ang4_shuff: db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6, 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
15
+                db 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7
16
+                db 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6
17
+                db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5
18
+                db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5
19
+                db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4
20
+                db 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3
21
+                db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12
22
+                db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 4, 0, 0, 9, 9, 10, 10, 11
23
+                db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11
24
+                db 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11, 4, 2, 2, 0, 0, 9, 9, 10
25
+                db 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11, 3, 2, 2, 0, 0, 9, 9, 10
26
+                db 0, 9, 9, 10, 10, 11, 11, 12, 1, 0, 0, 9, 9, 10, 10, 11, 2, 1, 1, 0, 0, 9, 9, 10, 4, 2, 2, 1, 1, 0, 0, 9
27
+                db 0, 1, 2, 3, 9, 0, 1, 2, 10, 9, 0, 1, 11, 10, 9, 0, 0, 1, 2, 3, 9, 0, 1, 2, 10, 9, 0, 1, 11, 10, 9, 0
28
+                db 0, 1, 1, 2, 2, 3, 3, 4, 9, 0, 0, 1, 1, 2, 2, 3, 10, 9, 9, 0, 0, 1, 1, 2, 12, 10, 10, 9, 9, 0, 0, 1
29
+                db 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3, 11, 10, 10, 0, 0, 1, 1, 2
30
+                db 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3, 12, 10, 10, 0, 0, 1, 1, 2
31
+                db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3
32
+                db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 12, 0, 0, 1, 1, 2, 2, 3
33
+                db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4
34
+                db 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4
35
+                db 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5
36
+                db 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6
37
+                db 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 2, 3, 3, 4, 4, 5, 5, 6
38
+                db 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7
39
+                db 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7, 4, 5, 5, 6, 6, 7, 7, 8
40
+                db 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 8, 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 8
41
+
42
+all_ang4: db 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 18, 14, 18, 14, 18, 14, 18, 14, 24, 8, 24, 8, 24, 8, 24, 8
43
+          db 11, 21, 11, 21, 11, 21, 11, 21, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 12, 20, 12, 20, 12, 20, 12, 20
44
+          db 15, 17, 15, 17, 15, 17, 15, 17, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 28, 4, 28, 4, 28, 4, 28, 4
45
+          db 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20
46
+          db 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 5, 27, 5, 27, 5, 27, 5, 27, 28, 4, 28, 4, 28, 4, 28, 4
47
+          db 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20
48
+          db 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8
49
+          db 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24
50
+          db 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12
51
+          db 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28
52
+          db 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12
53
+          db 17, 15, 17, 15, 17, 15, 17, 15, 2, 30, 2, 30, 2, 30, 2, 30, 19, 13, 19, 13, 19, 13, 19, 13, 4, 28, 4, 28, 4, 28, 4, 28
54
+          db 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12
55
+          db 26, 6, 26, 6, 26, 6, 26, 6, 20, 12, 20, 12, 20, 12, 20, 12, 14, 18, 14, 18, 14, 18, 14, 18, 8, 24, 8, 24, 8, 24, 8, 24
56
+          db 26, 6, 26, 6, 26, 6, 26, 6, 20, 12, 20, 12, 20, 12, 20, 12, 14, 18, 14, 18, 14, 18, 14, 18, 8, 24, 8, 24, 8, 24, 8, 24
57
+          db 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12
58
+          db 17, 15, 17, 15, 17, 15, 17, 15, 2, 30, 2, 30, 2, 30, 2, 30, 19, 13, 19, 13, 19, 13, 19, 13, 4, 28, 4, 28, 4, 28, 4, 28
59
+          db 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12
60
+          db 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28
61
+          db 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12
62
+          db 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24
63
+          db 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8
64
+          db 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20
65
+          db 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 5, 27, 5, 27, 5, 27, 5, 27, 28, 4, 28, 4, 28, 4, 28, 4
66
+          db 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20
67
+          db 15, 17, 15, 17, 15, 17, 15, 17, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 28, 4, 28, 4, 28, 4, 28, 4
68
+          db 11, 21, 11, 21, 11, 21, 11, 21, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 12, 20, 12, 20, 12, 20, 12, 20
69
+          db 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 18, 14, 18, 14, 18, 14, 18, 14, 24, 8, 24, 8, 24, 8, 24, 8
70
+
71
+
72
 SECTION .text
73
 
74
 ; global constant
75
@@ -34,9 +92,14 @@
76
 
77
 ; common constant with intrapred8.asm
78
 cextern ang_table
79
+cextern pw_ang_table
80
 cextern tab_S1
81
 cextern tab_S2
82
 cextern tab_Si
83
+cextern pw_16
84
+cextern pb_000000000000000F
85
+cextern pb_0000000000000F0F
86
+cextern pw_FFFFFFFFFFFFFFF0
87
 
88
 
89
 ;-----------------------------------------------------------------------------
90
@@ -23006,3 +23069,1098 @@
91
     palignr    m4,              m2,       m1,    14
92
     movu       [r0 + 2111 * 16],   m4
93
     RET
94
+
95
+
96
+;-----------------------------------------------------------------------------
97
+; void all_angs_pred_4x4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma)
98
+;-----------------------------------------------------------------------------
99
+INIT_YMM avx2
100
+cglobal all_angs_pred_4x4, 4, 4, 6
101
+
102
+    mova           m5, [pw_1024]
103
+    lea            r2, [all_ang4]
104
+    lea            r3, [all_ang4_shuff]
105
+
106
+; mode 2
107
+
108
+    vbroadcasti128 m0, [r1 + 9]
109
+    mova           xm1, xm0
110
+    psrldq         xm1, 1
111
+    pshufb         xm1, [r3]
112
+    movu           [r0], xm1
113
+
114
+; mode 3
115
+
116
+    pshufb         m1, m0, [r3 + 1 * mmsize]
117
+    pmaddubsw      m1, [r2]
118
+    pmulhrsw       m1, m5
119
+
120
+; mode 4
121
+
122
+    pshufb         m2, m0, [r3 + 2 * mmsize]
123
+    pmaddubsw      m2, [r2 + 1 * mmsize]
124
+    pmulhrsw       m2, m5
125
+    packuswb       m1, m2
126
+    vpermq         m1, m1, 11011000b
127
+    movu           [r0 + (3 - 2) * 16], m1
128
+
129
+; mode 5
130
+
131
+    pshufb         m1, m0, [r3 + 2 * mmsize]
132
+    pmaddubsw      m1, [r2 + 2 * mmsize]
133
+    pmulhrsw       m1, m5
134
+
135
+; mode 6
136
+
137
+    pshufb         m2, m0, [r3 + 3 * mmsize]
138
+    pmaddubsw      m2, [r2 + 3 * mmsize]
139
+    pmulhrsw       m2, m5
140
+    packuswb       m1, m2
141
+    vpermq         m1, m1, 11011000b
142
+    movu           [r0 + (5 - 2) * 16], m1
143
+
144
+    add            r3, 4 * mmsize
145
+    add            r2, 4 * mmsize
146
+
147
+; mode 7
148
+
149
+    pshufb         m1, m0, [r3 + 0 * mmsize]
150
+    pmaddubsw      m1, [r2 + 0 * mmsize]
151
+    pmulhrsw       m1, m5
152
+
153
+; mode 8
154
+
155
+    pshufb         m2, m0, [r3 + 1 * mmsize]
156
+    pmaddubsw      m2, [r2 + 1 * mmsize]
157
+    pmulhrsw       m2, m5
158
+    packuswb       m1, m2
159
+    vpermq         m1, m1, 11011000b
160
+    movu           [r0 + (7 - 2) * 16], m1
161
+
162
+; mode 9
163
+
164
+    pshufb         m1, m0, [r3 + 1 * mmsize]
165
+    pmaddubsw      m1, [r2 + 2 * mmsize]
166
+    pmulhrsw       m1, m5
167
+    packuswb       m1, m1
168
+    vpermq         m1, m1, 11011000b
169
+    movu           [r0 + (9 - 2) * 16], xm1
170
+
171
+; mode 10
172
+
173
+    pshufb         xm1, xm0, [r3 + 2 * mmsize]
174
+    movu           [r0 + (10 - 2) * 16], xm1
175
+
176
+    pxor           xm1, xm1
177
+    movd           xm2, [r1 + 1]
178
+    pshufd         xm3, xm2, 0
179
+    punpcklbw      xm3, xm1
180
+    pinsrb         xm2, [r1], 0
181
+    pshufb         xm4, xm2, xm1
182
+    punpcklbw      xm4, xm1
183
+    psubw          xm3, xm4
184
+    psraw          xm3, 1
185
+    pshufb         xm4, xm0, xm1
186
+    punpcklbw      xm4, xm1
187
+    paddw          xm3, xm4
188
+    packuswb       xm3, xm1
189
+
190
+    pextrb         [r0 + 128], xm3, 0
191
+    pextrb         [r0 + 132], xm3, 1
192
+    pextrb         [r0 + 136], xm3, 2
193
+    pextrb         [r0 + 140], xm3, 3
194
+
195
+; mode 11
196
+
197
+    vbroadcasti128 m0, [r1]
198
+    pshufb         m1, m0, [r3 + 3 * mmsize]
199
+    pmaddubsw      m1, [r2 + 3 * mmsize]
200
+    pmulhrsw       m1, m5
201
+
202
+; mode 12
203
+
204
+    add            r2, 4 * mmsize
205
+
206
+    pshufb         m2, m0, [r3 + 3 * mmsize]
207
+    pmaddubsw      m2, [r2 + 0 * mmsize]
208
+    pmulhrsw       m2, m5
209
+    packuswb       m1, m2
210
+    vpermq         m1, m1, 11011000b
211
+    movu           [r0 + (11 - 2) * 16], m1
212
+
213
+; mode 13
214
+
215
+    add            r3, 4 * mmsize
216
+
217
+    pshufb         m1, m0, [r3 + 0 * mmsize]
218
+    pmaddubsw      m1, [r2 + 1 * mmsize]
219
+    pmulhrsw       m1, m5
220
+
221
+; mode 14
222
+
223
+    pshufb         m2, m0, [r3 + 1 * mmsize]
224
+    pmaddubsw      m2, [r2 + 2 * mmsize]
225
+    pmulhrsw       m2, m5
226
+    packuswb       m1, m2
227
+    vpermq         m1, m1, 11011000b
228
+    movu           [r0 + (13 - 2) * 16], m1
229
+
230
+; mode 15
231
+
232
+    pshufb         m1, m0, [r3 + 2 * mmsize]
233
+    pmaddubsw      m1, [r2 + 3 * mmsize]
234
+    pmulhrsw       m1, m5
235
+
236
+; mode 16
237
+
238
+    add            r2, 4 * mmsize
239
+
240
+    pshufb         m2, m0, [r3 + 3 * mmsize]
241
+    pmaddubsw      m2, [r2 + 0 * mmsize]
242
+    pmulhrsw       m2, m5
243
+    packuswb       m1, m2
244
+    vpermq         m1, m1, 11011000b
245
+    movu           [r0 + (15 - 2) * 16], m1
246
+
247
+; mode 17
248
+
249
+    add            r3, 4 * mmsize
250
+
251
+    pshufb         m1, m0, [r3 + 0 * mmsize]
252
+    pmaddubsw      m1, [r2 + 1 * mmsize]
253
+    pmulhrsw       m1, m5
254
+    packuswb       m1, m1
255
+    vpermq         m1, m1, 11011000b
256
+
257
+; mode 18
258
+
259
+    pshufb         m2, m0, [r3 + 1 * mmsize]
260
+    vinserti128    m1, m1, xm2, 1
261
+    movu           [r0 + (17 - 2) * 16], m1
262
+
263
+; mode 19
264
+
265
+    pshufb         m1, m0, [r3 + 2 * mmsize]
266
+    pmaddubsw      m1, [r2 + 2 * mmsize]
267
+    pmulhrsw       m1, m5
268
+
269
+; mode 20
270
+
271
+    pshufb         m2, m0, [r3 + 3 * mmsize]
272
+    pmaddubsw      m2, [r2 + 3 * mmsize]
273
+    pmulhrsw       m2, m5
274
+    packuswb       m1, m2
275
+    vpermq         m1, m1, 11011000b
276
+    movu           [r0 + (19 - 2) * 16], m1
277
+
278
+; mode 21
279
+
280
+    add            r2, 4 * mmsize
281
+    add            r3, 4 * mmsize
282
+
283
+    pshufb         m1, m0, [r3 + 0 * mmsize]
284
+    pmaddubsw      m1, [r2 + 0 * mmsize]
285
+    pmulhrsw       m1, m5
286
+
287
+; mode 22
288
+
289
+    pshufb         m2, m0, [r3 + 1 * mmsize]
290
+    pmaddubsw      m2, [r2 + 1 * mmsize]
291
+    pmulhrsw       m2, m5
292
+    packuswb       m1, m2
293
+    vpermq         m1, m1, 11011000b
294
+    movu           [r0 + (21 - 2) * 16], m1
295
+
296
+; mode 23
297
+
298
+    pshufb         m1, m0, [r3 + 2 * mmsize]
299
+    pmaddubsw      m1, [r2 + 2 * mmsize]
300
+    pmulhrsw       m1, m5
301
+
302
+; mode 24
303
+
304
+    pshufb         m2, m0, [r3 + 3 * mmsize]
305
+    pmaddubsw      m2, [r2 + 3 * mmsize]
306
+    pmulhrsw       m2, m5
307
+    packuswb       m1, m2
308
+    vpermq         m1, m1, 11011000b
309
+    movu           [r0 + (23 - 2) * 16], m1
310
+
311
+; mode 25
312
+
313
+    add            r2, 4 * mmsize
314
+
315
+    pshufb         m1, m0, [r3 + 3 * mmsize]
316
+    pmaddubsw      m1, [r2 + 0 * mmsize]
317
+    pmulhrsw       m1, m5
318
+    packuswb       m1, m1
319
+    vpermq         m1, m1, 11011000b
320
+    movu           [r0 + (25 - 2) * 16], xm1
321
+
322
+; mode 26
323
+
324
+    add            r3, 4 * mmsize
325
+
326
+    pshufb         xm1, xm0, [r3 + 0 * mmsize]
327
+    movu           [r0 + (26 - 2) * 16], xm1
328
+
329
+    pxor           xm1, xm1
330
+    movd           xm2, [r1 + 9]
331
+    pshufd         xm3, xm2, 0
332
+    punpcklbw      xm3, xm1
333
+    pinsrb         xm4, [r1 + 0], 0
334
+    pshufb         xm4, xm1
335
+    punpcklbw      xm4, xm1
336
+    psubw          xm3, xm4
337
+    psraw          xm3, 1
338
+    psrldq         xm2, xm0, 1
339
+    pshufb         xm2, xm1
340
+    punpcklbw      xm2, xm1
341
+    paddw          xm3, xm2
342
+    packuswb       xm3, xm1
343
+
344
+    pextrb       [r0 + 384], xm3, 0
345
+    pextrb       [r0 + 388], xm3, 1
346
+    pextrb       [r0 + 392], xm3, 2
347
+    pextrb       [r0 + 396], xm3, 3
348
+
349
+; mode 27
350
+
351
+    pshufb        m1, m0, [r3 + 1 * mmsize]
352
+    pmaddubsw     m1, [r2 + 1 * mmsize]
353
+    pmulhrsw      m1, m5
354
+
355
+; mode 28
356
+
357
+    pshufb        m2, m0, [r3 + 1 * mmsize]
358
+    pmaddubsw     m2, [r2 + 2 * mmsize]
359
+    pmulhrsw      m2, m5
360
+    packuswb      m1, m2
361
+    vpermq        m1, m1, 11011000b
362
+    movu          [r0 + (27 - 2) * 16], m1
363
+
364
+; mode 29
365
+
366
+    pshufb        m1, m0, [r3 + 2 * mmsize]
367
+    pmaddubsw     m1, [r2 + 3 * mmsize]
368
+    pmulhrsw      m1, m5
369
+
370
+; mode 30
371
+
372
+    add           r2, 4 * mmsize
373
+
374
+    pshufb        m2, m0, [r3 + 3 * mmsize]
375
+    pmaddubsw     m2, [r2 + 0 * mmsize]
376
+    pmulhrsw      m2, m5
377
+    packuswb      m1, m2
378
+    vpermq        m1, m1, 11011000b
379
+    movu          [r0 + (29 - 2) * 16], m1
380
+
381
+; mode 31
382
+
383
+    add           r3, 4 * mmsize
384
+
385
+    pshufb        m1, m0, [r3 + 0 * mmsize]
386
+    pmaddubsw     m1, [r2 + 1 * mmsize]
387
+    pmulhrsw      m1, m5
388
+
389
+; mode 32
390
+
391
+    pshufb        m2, m0, [r3 + 0 * mmsize]
392
+    pmaddubsw     m2, [r2 + 2 * mmsize]
393
+    pmulhrsw      m2, m5
394
+    packuswb      m1, m2
395
+    vpermq        m1, m1, 11011000b
396
+    movu          [r0 + (31 - 2) * 16], m1
397
+
398
+; mode 33
399
+
400
+    pshufb        m1, m0, [r3 + 1 * mmsize]
401
+    pmaddubsw     m1, [r2 + 3 * mmsize]
402
+    pmulhrsw      m1, m5
403
+    packuswb      m1, m2
404
+    vpermq        m1, m1, 11011000b
405
+
406
+; mode 34
407
+
408
+    pshufb        m0, [r3 + 2 * mmsize]
409
+    vinserti128   m1, m1, xm0, 1
410
+    movu          [r0 + (33 - 2) * 16], m1
411
+    RET
412
+
413
+;-----------------------------------------------------------------------------
414
+; void all_angs_pred_4x4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma)
415
+;-----------------------------------------------------------------------------
416
+INIT_XMM sse2
417
+cglobal all_angs_pred_4x4, 4, 4, 8
418
+
419
+; mode 2
420
+
421
+    movh        m6,             [r1 + 9]
422
+    mova        m2,             m6
423
+    psrldq      m2,             1
424
+    movd        [r0],           m2              ;byte[A, B, C, D]
425
+    psrldq      m2,             1
426
+    movd        [r0 + 4],       m2              ;byte[B, C, D, E]
427
+    psrldq      m2,             1
428
+    movd        [r0 + 8],       m2              ;byte[C, D, E, F]
429
+    psrldq      m2,             1
430
+    movd        [r0 + 12],      m2              ;byte[D, E, F, G]
431
+
432
+; mode 10/26
433
+
434
+    pxor        m7,             m7
435
+    pshufd      m5,             m6,        0
436
+    mova        [r0 + 128],     m5              ;mode 10 byte[9, A, B, C, 9, A, B, C, 9, A, B, C, 9, A, B, C]
437
+
438
+    movd        m4,             [r1 + 1]
439
+    pshufd      m4,             m4,        0
440
+    mova        [r0 + 384],     m4              ;mode 26 byte[1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4]
441
+
442
+    movd        m1,             [r1]
443
+    punpcklbw   m1,             m7
444
+    pshuflw     m1,             m1,     0x00
445
+    punpcklqdq  m1,             m1              ;m1 = byte[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
446
+
447
+    punpckldq   m4,             m5
448
+    punpcklbw   m4,             m7              ;m4 = word[1, 2, 3, 4, 9, A, B, C]
449
+    pshuflw     m2,             m4,     0x00
450
+    pshufhw     m2,             m2,     0x00    ;m2 = word[1, 1, 1, 1, 9, 9, 9, 9]
451
+
452
+    psubw       m4,             m1
453
+    psraw       m4,             1
454
+
455
+    pshufd      m2,             m2,     q1032   ;m2 = word[9, 9, 9, 9, 1, 1, 1, 1]
456
+    paddw       m4,             m2
457
+    packuswb    m4,             m4
458
+
459
+%if ARCH_X86_64
460
+    movq        r2,             m4
461
+
462
+    mov         [r0 + 128],     r2b              ;mode 10
463
+    shr         r2,             8
464
+    mov         [r0 + 132],     r2b
465
+    shr         r2,             8
466
+    mov         [r0 + 136],     r2b
467
+    shr         r2,             8
468
+    mov         [r0 + 140],     r2b
469
+    shr         r2,             8
470
+    mov         [r0 + 384],     r2b              ;mode 26
471
+    shr         r2d,            8
472
+    mov         [r0 + 388],     r2b
473
+    shr         r2d,            8
474
+    mov         [r0 + 392],     r2b
475
+    shr         r2d,            8
476
+    mov         [r0 + 396],     r2b
477
+
478
+%else
479
+    movd        r2d,             m4
480
+
481
+    mov         [r0 + 128],     r2b              ;mode 10
482
+    shr         r2d,             8
483
+    mov         [r0 + 132],     r2b
484
+    shr         r2d,             8
485
+    mov         [r0 + 136],     r2b
486
+    shr         r2d,             8
487
+    mov         [r0 + 140],     r2b
488
+
489
+    psrldq      m4,             4
490
+    movd        r2d,            m4
491
+
492
+    mov         [r0 + 384],     r2b              ;mode 26
493
+    shr         r2d,            8
494
+    mov         [r0 + 388],     r2b
495
+    shr         r2d,            8
496
+    mov         [r0 + 392],     r2b
497
+    shr         r2d,            8
498
+    mov         [r0 + 396],     r2b
499
+%endif
500
+
501
+; mode 3
502
+
503
+    mova        m2,             [pw_16]
504
+    lea         r3,             [pw_ang_table + 7 * 16]
505
+    lea         r2,             [pw_ang_table + 23 * 16]
506
+    punpcklbw   m6,             m6
507
+    psrldq      m6,             1
508
+    movh        m1,             m6
509
+    psrldq      m6,             2
510
+    movh        m0,             m6
511
+    psrldq      m6,             2
512
+    movh        m3,             m6
513
+    psrldq      m6,             2
514
+    punpcklbw   m1,             m7              ;m1 = word[9, A, A, B, B, C, C, D]
515
+    punpcklbw   m0,             m7              ;m0 = word[A, B, B, C, C, D, D, E]
516
+    punpcklbw   m3,             m7              ;m3 = word[B, C, C, D, D, E, E, F]
517
+    punpcklbw   m6,             m7              ;m6 = word[C, D, D, E, E, F, F, G]
518
+
519
+    mova        m7,             [r2 - 3 * 16]
520
+
521
+    pmaddwd     m5,             m1,     [r2 + 3 * 16]
522
+    pmaddwd     m4,             m0,     m7
523
+
524
+    packssdw    m5,             m4
525
+    paddw       m5,             m2
526
+    psraw       m5,             5
527
+
528
+    pmaddwd     m4,             m3,     [r3 + 7 * 16]
529
+    pmaddwd     m6,             [r3 + 1 * 16]
530
+
531
+    packssdw    m4,             m6
532
+    paddw       m4,             m2
533
+    psraw       m4,             5
534
+
535
+    packuswb    m5,             m4
536
+    mova        [r0 + 16],      m5
537
+    movd        [r0 + 68],      m5              ;mode 6 row 1
538
+    psrldq      m5,             4
539
+    movd        [r0 + 76],      m5              ;mode 6 row 3
540
+
541
+; mode 4
542
+
543
+    pmaddwd     m4,             m0,     [r2 + 8 * 16]
544
+    pmaddwd     m6,             m3,     m7
545
+
546
+    packssdw    m4,             m6
547
+    paddw       m4,             m2
548
+    psraw       m4,             5
549
+
550
+    pmaddwd     m5,             m1,     [r2 - 2 * 16]
551
+    pmaddwd     m6,             m0,     [r3 + 3 * 16]
552
+
553
+    packssdw    m5,             m6
554
+    paddw       m5,             m2
555
+    psraw       m5,             5
556
+
557
+    packuswb    m5,             m4
558
+    mova        [r0 + 32],      m5
559
+
560
+; mode 5
561
+
562
+    pmaddwd     m5,             m1,     [r2 - 6 * 16]
563
+    pmaddwd     m6,             m0,     [r3 - 5 * 16]
564
+
565
+    packssdw    m5,             m6
566
+    paddw       m5,             m2
567
+    psraw       m5,             5
568
+
569
+    pmaddwd     m4,             m0,     [r2 - 4 * 16]
570
+    pmaddwd     m3,             [r3 - 3 * 16]
571
+
572
+    packssdw    m4,             m3
573
+    paddw       m4,             m2
574
+    psraw       m4,             5
575
+
576
+    packuswb    m5,             m4
577
+    mova        [r0 + 48],      m5
578
+
579
+; mode 6
580
+
581
+    pmaddwd     m5,             m1,     [r3 + 6 * 16]
582
+    pmaddwd     m6,             m0,     [r3 + 0 * 16]
583
+
584
+    packssdw    m5,             m6
585
+    paddw       m5,             m2
586
+    psraw       m5,             5
587
+
588
+    packuswb    m5,             m6
589
+    movd        [r0 + 64],      m5
590
+    psrldq      m5,             4
591
+    movd        [r0 + 72],      m5
592
+
593
+; mode 7
594
+
595
+    pmaddwd     m5,             m1,     [r3 + 2 * 16]
596
+    pmaddwd     m6,             m1,     [r2 - 5 * 16]
597
+
598
+    packssdw    m5,             m6
599
+    paddw       m5,             m2
600
+    psraw       m5,             5
601
+
602
+    mova        m3,             [r2 + 4 * 16]
603
+    pmaddwd     m4,             m1,     m3
604
+    pmaddwd     m0,             [r3 - 3 * 16]
605
+
606
+    packssdw    m4,             m0
607
+    paddw       m4,             m2
608
+    psraw       m4,             5
609
+
610
+    packuswb    m5,             m4
611
+    mova        [r0 + 80],      m5
612
+
613
+; mode 8
614
+
615
+    mova        m0,             [r3 - 2 * 16]
616
+    pmaddwd     m5,             m1,     m0
617
+    pmaddwd     m6,             m1,     [r3 + 3 * 16]
618
+
619
+    packssdw    m5,             m6
620
+    paddw       m5,             m2
621
+    psraw       m5,             5
622
+
623
+    pmaddwd     m4,             m1,     [r3 + 8 * 16]
624
+    pmaddwd     m7,             m1
625
+
626
+    packssdw    m4,             m7
627
+    paddw       m4,             m2
628
+    psraw       m4,             5
629
+
630
+    packuswb    m5,             m4
631
+    mova        [r0 + 96],      m5
632
+
633
+; mode 9
634
+
635
+    pmaddwd     m5,             m1,     [r3 - 5 * 16]
636
+    pmaddwd     m6,             m1,     [r3 - 3 * 16]
637
+
638
+    packssdw    m5,             m6
639
+    paddw       m5,             m2
640
+    psraw       m5,             5
641
+
642
+    pmaddwd     m4,             m1,     [r3 - 1 * 16]
643
+    pmaddwd     m6,             m1,     [r3 + 1 * 16]
644
+
645
+    packssdw    m4,             m6
646
+    paddw       m4,             m2
647
+    psraw       m4,             5
648
+
649
+    packuswb    m5,             m4
650
+    mova        [r0 + 112],     m5
651
+
652
+; mode 11
653
+
654
+    movd        m5,             [r1]
655
+    punpcklwd   m5,             m1
656
+    pand        m5,             [pb_0000000000000F0F]
657
+    pslldq      m1,             4
658
+    por         m1,             m5              ;m1 = word[0, 9, 9, A, A, B, B, C]
659
+
660
+    pmaddwd     m5,             m1,     [r2 + 7 * 16]
661
+    pmaddwd     m6,             m1,     [r2 + 5 * 16]
662
+
663
+    packssdw    m5,             m6
664
+    paddw       m5,             m2
665
+    psraw       m5,             5
666
+
667
+    pmaddwd     m4,             m1,     [r2 + 3 * 16]
668
+    pmaddwd     m6,             m1,     [r2 + 1 * 16]
669
+
670
+    packssdw    m4,             m6
671
+    paddw       m4,             m2
672
+    psraw       m4,             5
673
+
674
+    packuswb    m5,             m4
675
+    mova        [r0 + 144],     m5
676
+
677
+; mode 12
678
+
679
+    pmaddwd     m3,             m1
680
+    pmaddwd     m6,             m1,     [r2 - 1 * 16]
681
+
682
+    packssdw    m3,             m6
683
+    paddw       m3,             m2
684
+    psraw       m3,             5
685
+
686
+    pmaddwd     m4,             m1,     [r2 - 6 * 16]
687
+    pmaddwd     m6,             m1,     [r3 + 5 * 16]
688
+
689
+    packssdw    m4,             m6
690
+    paddw       m4,             m2
691
+    psraw       m4,             5
692
+
693
+    packuswb    m3,             m4
694
+    mova        [r0 + 160],     m3
695
+
696
+; mode 13
697
+
698
+    mova        m3,             m1
699
+    movd        m7,             [r1 + 4]
700
+    punpcklwd   m7,             m1
701
+    pand        m7,             [pb_0000000000000F0F]
702
+    pslldq      m3,             4
703
+    por         m3,             m7              ;m3 = word[4, 0, 0, 9, 9, A, A, B]
704
+
705
+    pmaddwd     m5,             m1,     [r2 + 0 * 16]
706
+    pmaddwd     m6,             m1,     [r3 + 7 * 16]
707
+
708
+    packssdw    m5,             m6
709
+    paddw       m5,             m2
710
+    psraw       m5,             5
711
+
712
+    pmaddwd     m4,             m1,     m0
713
+    pmaddwd     m6,             m3,     [r2 + 5 * 16]
714
+
715
+    packssdw    m4,             m6
716
+    paddw       m4,             m2
717
+    psraw       m4,             5
718
+
719
+    packuswb    m5,             m4
720
+    mova        [r0 + 176],     m5
721
+
722
+; mode 14
723
+
724
+    pmaddwd     m5,             m1,     [r2 - 4 * 16]
725
+    pmaddwd     m6,             m1,     [r3 - 1 * 16]
726
+
727
+    packssdw    m5,             m6
728
+    paddw       m5,             m2
729
+    psraw       m5,             5
730
+
731
+    movd        m6,             [r1 + 2]
732
+    pand        m3,             [pw_FFFFFFFFFFFFFFF0]
733
+    pand        m6,             [pb_000000000000000F]
734
+    por         m3,             m6              ;m3 = word[2, 0, 0, 9, 9, A, A, B]
735
+
736
+    pmaddwd     m4,             m3,     [r2 + 2 * 16]
737
+    pmaddwd     m6,             m3,     [r3 + 5 * 16]
738
+
739
+    packssdw    m4,             m6
740
+    paddw       m4,             m2
741
+    psraw       m4,             5
742
+
743
+    packuswb    m5,             m4
744
+    mova        [r0 + 192],     m5
745
+    psrldq      m5,             4
746
+    movd        [r0 + 240],     m5              ;mode 17 row 0
747
+
748
+; mode 15
749
+
750
+    pmaddwd     m5,             m1,     [r3 + 8 * 16]
751
+    pmaddwd     m6,             m3,     [r2 + 7 * 16]
752
+
753
+    packssdw    m5,             m6
754
+    paddw       m5,             m2
755
+    psraw       m5,             5
756
+
757
+    pmaddwd     m6,             m3,     [r3 + 6 * 16]
758
+
759
+    mova        m0,             m3
760
+    punpcklwd   m7,             m3
761
+    pslldq      m0,             4
762
+    pand        m7,             [pb_0000000000000F0F]
763
+    por         m0,             m7              ;m0 = word[4, 2, 2, 0, 0, 9, 9, A]
764
+
765
+    pmaddwd     m4,             m0,     [r2 + 5 * 16]
766
+
767
+    packssdw    m6,             m4
768
+    paddw       m6,             m2
769
+    psraw       m6,             5
770
+
771
+    packuswb    m5,             m6
772
+    mova        [r0 + 208],     m5
773
+
774
+; mode 16
775
+
776
+    pmaddwd     m5,             m1,     [r3 + 4 * 16]
777
+    pmaddwd     m6,             m3,     [r2 - 1 * 16]
778
+
779
+    packssdw    m5,             m6
780
+    paddw       m5,             m2
781
+    psraw       m5,             5
782
+
783
+    pmaddwd     m3,             [r3 - 6 * 16]
784
+
785
+    movd        m6,             [r1 + 3]
786
+    pand        m0,             [pw_FFFFFFFFFFFFFFF0]
787
+    pand        m6,             [pb_000000000000000F]
788
+    por         m0,             m6              ;m0 = word[3, 2, 2, 0, 0, 9, 9, A]
789
+
790
+    pmaddwd     m0,             [r3 + 5 * 16]
791
+    packssdw    m3,             m0
792
+    paddw       m3,             m2
793
+    psraw       m3,             5
794
+
795
+    packuswb    m5,             m3
796
+    mova        [r0 + 224],     m5
797
+
798
+; mode 17
799
+
800
+    movd        m4,             [r1 + 1]
801
+    punpcklwd   m4,             m1
802
+    pand        m4,             [pb_0000000000000F0F]
803
+    pslldq      m1,             4
804
+    por         m1,             m4              ;m1 = word[1, 0, 0, 9, 9, A, A, B]
805
+
806
+    pmaddwd     m6,             m1,     [r3 + 5 * 16]
807
+
808
+    packssdw    m6,             m6
809
+    paddw       m6,             m2
810
+    psraw       m6,             5
811
+
812
+    movd        m5,             [r1 + 2]
813
+    punpcklwd   m5,             m1
814
+    pand        m5,             [pb_0000000000000F0F]
815
+    pslldq      m1,             4
816
+    por         m1,             m5              ;m1 = word[2, 1, 1, 0, 0, 9, 9, A]
817
+
818
+    pmaddwd     m4,             m1,     [r2 - 5 * 16]
819
+
820
+    punpcklwd   m7,             m1
821
+    pand        m7,             [pb_0000000000000F0F]
822
+    pslldq      m1,             4
823
+    por         m1,             m7              ;m1 = word[4, 2, 2, 1, 1, 0, 0, 9]
824
+
825
+    pmaddwd     m1,             [r2 + 1 * 16]
826
+    packssdw    m4,             m1
827
+    paddw       m4,             m2
828
+    psraw       m4,             5
829
+
830
+    packuswb    m6,             m4
831
+    movd        [r0 + 244],     m6
832
+    psrldq      m6,             8
833
+    movh        [r0 + 248],     m6
834
+
835
+; mode 18
836
+
837
+    movh        m1,             [r1]
838
+    movd        [r0 + 256],     m1              ;byte[0, 1, 2, 3]
839
+
840
+    movh        m3,             [r1 + 2]
841
+    punpcklqdq  m3,             m1
842
+    psrldq      m3,             7
843
+    movd        [r0 + 260],     m3              ;byte[2, 1, 0, 9]
844
+
845
+    movh        m4,             [r1 + 3]
846
+    punpcklqdq  m4,             m3
847
+    psrldq      m4,             7
848
+    movd        [r0 + 264],     m4              ;byte[1, 0, 9, A]
849
+
850
+    movh        m0,             [r1 + 4]
851
+    punpcklqdq  m0,             m4
852
+    psrldq      m0,             7
853
+    movd        [r0 + 268],     m0              ;byte[0, 9, A, B]
854
+
855
+; mode 19
856
+
857
+    pxor        m7,             m7
858
+    punpcklbw   m4,             m3
859
+    punpcklbw   m3,             m1
860
+    punpcklbw   m1,             m1
861
+    punpcklbw   m4,             m7              ;m4 = word[A, 9, 9, 0, 0, 1, 1, 2]
862
+    punpcklbw   m3,             m7              ;m3 = word[9, 0, 0, 1, 1, 2, 2, 3]
863
+    psrldq      m1,             1
864
+    punpcklbw   m1,             m7              ;m1 = word[0, 1, 1, 2, 2, 3, 3, 4]
865
+
866
+    pmaddwd     m6,             m1,     [r3 - 1 * 16]
867
+    pmaddwd     m7,             m3,     [r3 + 5 * 16]
868
+
869
+    packssdw    m6,             m7
870
+    paddw       m6,             m2
871
+    psraw       m6,             5
872
+
873
+    pmaddwd     m5,             m4,     [r2 - 5 * 16]
874
+
875
+    movd        m7,             [r1 + 12]
876
+    punpcklwd   m7,             m4
877
+    pand        m7,             [pb_0000000000000F0F]
878
+    pslldq      m4,             4
879
+    por         m4,             m7              ;m4 = word[C, A, A, 9, 9, 0, 0, 1]
880
+
881
+    pmaddwd     m4,             [r2 + 1 * 16]
882
+    packssdw    m5,             m4
883
+    paddw       m5,             m2
884
+    psraw       m5,             5
885
+
886
+    packuswb    m6,             m5
887
+    mova        [r0 + 272],     m6
888
+    movd        [r0 + 324],     m6              ;mode 22 row 1
889
+
890
+; mode 20
891
+
892
+    pmaddwd     m5,             m1,     [r3 + 4 * 16]
893
+
894
+    movd        m4,             [r1 + 10]
895
+    pand        m3,             [pw_FFFFFFFFFFFFFFF0]
896
+    pand        m4,             [pb_000000000000000F]
897
+    por         m3,             m4              ;m3 = word[A, 0, 0, 1, 1, 2, 2, 3]
898
+
899
+    pmaddwd     m6,             m3,     [r2 - 1 * 16]
900
+
901
+    packssdw    m5,             m6
902
+    paddw       m5,             m2
903
+    psraw       m5,             5
904
+
905
+    pmaddwd     m4,             m3,     [r3 - 6 * 16]
906
+
907
+    punpcklwd   m0,             m3
908
+    pand        m0,             [pb_0000000000000F0F]
909
+    mova        m6,             m3
910
+    pslldq      m6,             4
911
+    por         m0,             m6              ;m0 = word[B, A, A, 0, 0, 1, 1, 2]
912
+
913
+    pmaddwd     m6,             m0,     [r3 + 5 * 16]
914
+
915
+    packssdw    m4,             m6
916
+    paddw       m4,             m2
917
+    psraw       m4,             5
918
+
919
+    packuswb    m5,             m4
920
+    mova        [r0 + 288],     m5
921
+
922
+; mode 21
923
+
924
+    pmaddwd     m4,             m1,     [r3 + 8 * 16]
925
+    pmaddwd     m6,             m3,     [r2 + 7 * 16]
926
+
927
+    packssdw    m4,             m6
928
+    paddw       m4,             m2
929
+    psraw       m4,             5
930
+
931
+    pmaddwd     m5,             m3,     [r3 + 6 * 16]
932
+
933
+    pand        m0,             [pw_FFFFFFFFFFFFFFF0]
934
+    pand        m7,             [pb_000000000000000F]
935
+    por         m0,             m7              ;m0 = word[C, A, A, 0, 0, 1, 1, 2]
936
+
937
+    pmaddwd     m0,             [r2 + 5 * 16]
938
+    packssdw    m5,             m0
939
+    paddw       m5,             m2
940
+    psraw       m5,             5
941
+
942
+    packuswb    m4,             m5
943
+    mova        [r0 + 304],     m4
944
+
945
+; mode 22
946
+
947
+    pmaddwd     m4,             m1,     [r2 - 4 * 16]
948
+    packssdw    m4,             m4
949
+    paddw       m4,             m2
950
+    psraw       m4,             5
951
+
952
+    mova        m0,             [r3 + 5 * 16]
953
+    pmaddwd     m5,             m3,     [r2 + 2 * 16]
954
+    pmaddwd     m6,             m3,     m0
955
+
956
+    packssdw    m5,             m6
957
+    paddw       m5,             m2
958
+    psraw       m5,             5
959
+
960
+    packuswb    m4,             m5
961
+    movd        [r0 + 320],     m4
962
+    psrldq      m4,             8
963
+    movh        [r0 + 328],     m4
964
+
965
+; mode 23
966
+
967
+    pmaddwd     m4,             m1,     [r2 + 0 * 16]
968
+    pmaddwd     m5,             m1,     [r3 + 7 * 16]
969
+
970
+    packssdw    m4,             m5
971
+    paddw       m4,             m2
972
+    psraw       m4,             5
973
+
974
+    pmaddwd     m6,             m1,     [r3 - 2 * 16]
975
+
976
+    pand        m3,             [pw_FFFFFFFFFFFFFFF0]
977
+    por         m3,             m7              ;m3 = word[C, 0, 0, 1, 1, 2, 2, 3]
978
+
979
+    pmaddwd     m3,             [r2 + 5 * 16]
980
+    packssdw    m6,             m3
981
+    paddw       m6,             m2
982
+    psraw       m6,             5
983
+
984
+    packuswb    m4,             m6
985
+    mova        [r0 + 336],     m4
986
+
987
+; mode 24
988
+
989
+    pmaddwd     m4,             m1,     [r2 + 4 * 16]
990
+    pmaddwd     m5,             m1,     [r2 - 1 * 16]
991
+
992
+    packssdw    m4,             m5
993
+    paddw       m4,             m2
994
+    psraw       m4,             5
995
+
996
+    pmaddwd     m6,             m1,     [r2 - 6 * 16]
997
+    pmaddwd     m0,             m1
998
+
999
+    packssdw    m6,             m0
1000
+    paddw       m6,             m2
1001
+    psraw       m6,             5
1002
+
1003
+    packuswb    m4,             m6
1004
+    mova        [r0 + 352],     m4
1005
+
1006
+; mode 25
1007
+
1008
+    pmaddwd     m4,             m1,     [r2 + 7 * 16]
1009
+    pmaddwd     m5,             m1,     [r2 + 5 * 16]
1010
+
1011
+    packssdw    m4,             m5
1012
+    paddw       m4,             m2
1013
+    psraw       m4,             5
1014
+
1015
+    pmaddwd     m6,             m1,     [r2 + 3 * 16]
1016
+    pmaddwd     m1,             [r2 + 1 * 16]
1017
+
1018
+    packssdw    m6,             m1
1019
+    paddw       m6,             m2
1020
+    psraw       m6,             5
1021
+
1022
+    packuswb    m4,             m6
1023
+    mova        [r0 + 368],     m4
1024
+
1025
+; mode 27
1026
+
1027
+    movh        m0,             [r1 + 1]
1028
+    pxor        m7,             m7
1029
+    punpcklbw   m0,             m0
1030
+    psrldq      m0,             1
1031
+    movh        m1,             m0
1032
+    psrldq      m0,             2
1033
+    movh        m3,             m0
1034
+    psrldq      m0,             2
1035
+    punpcklbw   m1,             m7              ;m1 = word[1, 2, 2, 3, 3, 4, 4, 5]
1036
+    punpcklbw   m3,             m7              ;m3 = word[2, 3, 3, 4, 4, 5, 5, 6]
1037
+    punpcklbw   m0,             m7              ;m0 = word[3, 4, 4, 5, 5, 6, 6, 7]
1038
+
1039
+    mova        m7,             [r3 - 3 * 16]
1040
+
1041
+    pmaddwd     m4,             m1,     [r3 - 5 * 16]
1042
+    pmaddwd     m5,             m1,     m7
1043
+
1044
+    packssdw    m4,             m5
1045
+    paddw       m4,             m2
1046
+    psraw       m4,             5
1047
+
1048
+    pmaddwd     m6,             m1,     [r3 - 1 * 16]
1049
+    pmaddwd     m5,             m1,     [r3 + 1 * 16]
1050
+
1051
+    packssdw    m6,             m5
1052
+    paddw       m6,             m2
1053
+    psraw       m6,             5
1054
+
1055
+    packuswb    m4,             m6
1056
+    mova        [r0 + 400],     m4
1057
+
1058
+; mode 28
1059
+
1060
+    pmaddwd     m4,             m1,     [r3 - 2 * 16]
1061
+    pmaddwd     m5,             m1,     [r3 + 3 * 16]
1062
+
1063
+    packssdw    m4,             m5
1064
+    paddw       m4,             m2
1065
+    psraw       m4,             5
1066
+
1067
+    pmaddwd     m6,             m1,     [r3 + 8 * 16]
1068
+    pmaddwd     m5,             m1,     [r2 - 3 * 16]
1069
+
1070
+    packssdw    m6,             m5
1071
+    paddw       m6,             m2
1072
+    psraw       m6,             5
1073
+
1074
+    packuswb    m4,             m6
1075
+    mova        [r0 + 416],     m4
1076
+
1077
+; mode 29
1078
+
1079
+    pmaddwd     m4,             m1,     [r3 + 2 * 16]
1080
+    pmaddwd     m6,             m1,     [r2 - 5 * 16]
1081
+
1082
+    packssdw    m4,             m6
1083
+    paddw       m4,             m2
1084
+    psraw       m4,             5
1085
+
1086
+    pmaddwd     m6,             m1,     [r2 + 4 * 16]
1087
+    pmaddwd     m5,             m3,     m7
1088
+
1089
+    packssdw    m6,             m5
1090
+    paddw       m6,             m2
1091
+    psraw       m6,             5
1092
+
1093
+    packuswb    m4,             m6
1094
+    mova        [r0 + 432],     m4
1095
+
1096
+; mode 30
1097
+
1098
+    pmaddwd     m4,             m1,     [r3 + 6 * 16]
1099
+    pmaddwd     m5,             m1,     [r2 + 3 * 16]
1100
+
1101
+    packssdw    m4,             m5
1102
+    paddw       m4,             m2
1103
+    psraw       m4,             5
1104
+
1105
+    pmaddwd     m6,             m3,     [r3 + 0 * 16]
1106
+    pmaddwd     m5,             m3,     [r2 - 3 * 16]
1107
+
1108
+    packssdw    m6,             m5
1109
+    paddw       m6,             m2
1110
+    psraw       m6,             5
1111
+
1112
+    packuswb    m4,             m6
1113
+    mova        [r0 + 448],     m4
1114
+    psrldq      m4,             4
1115
+    movh        [r0 + 496],     m4              ;mode 33 row 0
1116
+    psrldq      m4,             8
1117
+    movd        [r0 + 500],     m4              ;mode 33 row 1
1118
+
1119
+; mode 31
1120
+
1121
+    pmaddwd     m4,             m1,     [r2 - 6 * 16]
1122
+    pmaddwd     m5,             m3,     [r3 - 5 * 16]
1123
+
1124
+    packssdw    m4,             m5
1125
+    paddw       m4,             m2
1126
+    psraw       m4,             5
1127
+
1128
+    pmaddwd     m6,             m3,     [r2 - 4 * 16]
1129
+    pmaddwd     m7,             m0
1130
+
1131
+    packssdw    m6,             m7
1132
+    paddw       m6,             m2
1133
+    psraw       m6,             5
1134
+
1135
+    packuswb    m4,             m6
1136
+    mova        [r0 + 464],     m4
1137
+
1138
+; mode 32
1139
+
1140
+    pmaddwd     m1,             [r2 - 2 * 16]
1141
+    pmaddwd     m5,             m3,     [r3 + 3 * 16]
1142
+
1143
+    packssdw    m1,             m5
1144
+    paddw       m1,             m2
1145
+    psraw       m1,             5
1146
+
1147
+    pmaddwd     m3,             [r2 + 8 * 16]
1148
+    pmaddwd     m5,             m0,     [r2 - 3 * 16]
1149
+    packssdw    m3,             m5
1150
+    paddw       m3,             m2
1151
+    psraw       m3,             5
1152
+
1153
+    packuswb    m1,             m3
1154
+    mova        [r0 + 480],     m1
1155
+
1156
+; mode 33
1157
+
1158
+    pmaddwd     m0,             [r3 + 7 * 16]
1159
+    pxor        m7,             m7
1160
+    movh        m4,             [r1 + 4]
1161
+    punpcklbw   m4,             m4
1162
+    psrldq      m4,             1
1163
+    punpcklbw   m4,             m7
1164
+
1165
+    pmaddwd     m4,             [r3 + 1 * 16]
1166
+
1167
+    packssdw    m0,             m4
1168
+    paddw       m0,             m2
1169
+    psraw       m0,             5
1170
+
1171
+    packuswb    m0,             m0
1172
+    movh        [r0 + 504],     m0
1173
+
1174
+; mode 34
1175
+
1176
+    movh        m7,             [r1 + 2]
1177
+    movd        [r0 + 512],     m7              ;byte[2, 3, 4, 5]
1178
+
1179
+    psrldq      m7,             1
1180
+    movd        [r0 + 516],     m7              ;byte[3, 4, 5, 6]
1181
+
1182
+    psrldq      m7,             1
1183
+    movd        [r0 + 520],     m7              ;byte[4, 5, 6, 7]
1184
+
1185
+    psrldq      m7,             1
1186
+    movd        [r0 + 524],     m7              ;byte[5, 6, 7, 8]
1187
+
1188
+RET
1189
x265_1.6.tar.gz/source/common/x86/ipfilter16.asm -> x265_1.7.tar.gz/source/common/x86/ipfilter16.asm Changed
1472
 
1
@@ -113,10 +113,13 @@
2
                   times 8 dw 58, -10
3
                   times 8 dw 4, -1
4
 
5
+const interp8_hps_shuf,     dd 0, 4, 1, 5, 2, 6, 3, 7
6
+
7
 SECTION .text
8
 cextern pd_32
9
 cextern pw_pixel_max
10
 cextern pd_n32768
11
+cextern pw_2000
12
 
13
 ;------------------------------------------------------------------------------------------------------------
14
 ; void interp_8tap_horiz_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
15
@@ -5525,65 +5528,1409 @@
16
     FILTER_VER_LUMA_SS 64, 16
17
     FILTER_VER_LUMA_SS 16, 64
18
 
19
-;--------------------------------------------------------------------------------------------------
20
-; void filterConvertPelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height)
21
-;--------------------------------------------------------------------------------------------------
22
-INIT_XMM sse2
23
-cglobal luma_p2s, 3, 7, 5
24
+;-----------------------------------------------------------------------------
25
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
26
+;-----------------------------------------------------------------------------
27
+%macro P2S_H_2xN 1
28
+INIT_XMM sse4
29
+cglobal filterPixelToShort_2x%1, 3, 6, 2
30
+    add        r1d, r1d
31
+    mov        r3d, r3m
32
+    add        r3d, r3d
33
+    lea        r4, [r1 * 3]
34
+    lea        r5, [r3 * 3]
35
 
36
-    add         r1, r1
37
+    ; load constant
38
+    mova       m1, [pw_2000]
39
 
40
-    ; load width and height
41
-    mov         r3d, r3m
42
-    mov         r4d, r4m
43
+%rep %1/4
44
+    movd       m0, [r0]
45
+    movhps     m0, [r0 + r1]
46
+    psllw      m0, 4
47
+    psubw      m0, m1
48
+
49
+    movd       [r2 + r3 * 0], m0
50
+    pextrd     [r2 + r3 * 1], m0, 2
51
+
52
+    movd       m0, [r0 + r1 * 2]
53
+    movhps     m0, [r0 + r4]
54
+    psllw      m0, 4
55
+    psubw      m0, m1
56
+
57
+    movd       [r2 + r3 * 2], m0
58
+    pextrd     [r2 + r5], m0, 2
59
+
60
+    lea        r0, [r0 + r1 * 4]
61
+    lea        r2, [r2 + r3 * 4]
62
+%endrep
63
+    RET
64
+%endmacro
65
+P2S_H_2xN 4
66
+P2S_H_2xN 8
67
+P2S_H_2xN 16
68
+
69
+;-----------------------------------------------------------------------------
70
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
71
+;-----------------------------------------------------------------------------
72
+%macro P2S_H_4xN 1
73
+INIT_XMM ssse3
74
+cglobal filterPixelToShort_4x%1, 3, 6, 2
75
+    add        r1d, r1d
76
+    mov        r3d, r3m
77
+    add        r3d, r3d
78
+    lea        r4, [r3 * 3]
79
+    lea        r5, [r1 * 3]
80
 
81
     ; load constant
82
-    mova        m4, [tab_c_n8192]
83
+    mova       m1, [pw_2000]
84
 
85
-.loopH:
86
+%rep %1/4
87
+    movh       m0, [r0]
88
+    movhps     m0, [r0 + r1]
89
+    psllw      m0, 4
90
+    psubw      m0, m1
91
+    movh       [r2 + r3 * 0], m0
92
+    movhps     [r2 + r3 * 1], m0
93
+
94
+    movh       m0, [r0 + r1 * 2]
95
+    movhps     m0, [r0 + r5]
96
+    psllw      m0, 4
97
+    psubw      m0, m1
98
+    movh       [r2 + r3 * 2], m0
99
+    movhps     [r2 + r4], m0
100
 
101
-    xor         r5d, r5d
102
-.loopW:
103
-    lea         r6, [r0 + r5 * 2]
104
+    lea        r0, [r0 + r1 * 4]
105
+    lea        r2, [r2 + r3 * 4]
106
+%endrep
107
+    RET
108
+%endmacro
109
+P2S_H_4xN 4
110
+P2S_H_4xN 8
111
+P2S_H_4xN 16
112
+P2S_H_4xN 32
113
 
114
-    movu        m0, [r6]
115
-    psllw       m0, 4
116
-    paddw       m0, m4
117
+;-----------------------------------------------------------------------------
118
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
119
+;-----------------------------------------------------------------------------
120
+INIT_XMM ssse3
121
+cglobal filterPixelToShort_4x2, 3, 4, 1
122
+    add        r1d, r1d
123
+    mov        r3d, r3m
124
+    add        r3d, r3d
125
 
126
-    movu        m1, [r6 + r1]
127
-    psllw       m1, 4
128
-    paddw       m1, m4
129
+    movh       m0, [r0]
130
+    movhps     m0, [r0 + r1]
131
+    psllw      m0, 4
132
+    psubw      m0, [pw_2000]
133
+    movh       [r2 + r3 * 0], m0
134
+    movhps     [r2 + r3 * 1], m0
135
 
136
-    movu        m2, [r6 + r1 * 2]
137
-    psllw       m2, 4
138
-    paddw       m2, m4
139
-
140
-    lea         r6, [r6 + r1 * 2]
141
-    movu        m3, [r6 + r1]
142
-    psllw       m3, 4
143
-    paddw       m3, m4
144
+    RET
145
 
146
-    add         r5, 8
147
-    cmp         r5, r3
148
-    jg          .width4
149
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0
150
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1
151
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2
152
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3
153
-    je          .nextH
154
-    jmp         .loopW
155
+;-----------------------------------------------------------------------------
156
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
157
+;-----------------------------------------------------------------------------
158
+%macro P2S_H_6xN 1
159
+INIT_XMM sse4
160
+cglobal filterPixelToShort_6x%1, 3, 7, 3
161
+    add        r1d, r1d
162
+    mov        r3d, r3m
163
+    add        r3d, r3d
164
+    lea        r4, [r3 * 3]
165
+    lea        r5, [r1 * 3]
166
 
167
-.width4:
168
-    movh        [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0
169
-    movh        [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1
170
-    movh        [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2
171
-    movh        [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3
172
+    ; load height
173
+    mov        r6d, %1/4
174
 
175
-.nextH:
176
-    lea         r0, [r0 + r1 * 4]
177
-    add         r2, FENC_STRIDE * 8
178
+    ; load constant
179
+    mova       m2, [pw_2000]
180
 
181
-    sub         r4d, 4
182
-    jnz         .loopH
183
+.loop
184
+    movu       m0, [r0]
185
+    movu       m1, [r0 + r1]
186
+    psllw      m0, 4
187
+    psubw      m0, m2
188
+    psllw      m1, 4
189
+    psubw      m1, m2
190
+
191
+    movh       [r2 + r3 * 0], m0
192
+    pextrd     [r2 + r3 * 0 + 8], m0, 2
193
+    movh       [r2 + r3 * 1], m1
194
+    pextrd     [r2 + r3 * 1 + 8], m1, 2
195
+
196
+    movu       m0, [r0 + r1 * 2]
197
+    movu       m1, [r0 + r5]
198
+    psllw      m0, 4
199
+    psubw      m0, m2
200
+    psllw      m1, 4
201
+    psubw      m1, m2
202
+
203
+    movh       [r2 + r3 * 2], m0
204
+    pextrd     [r2 + r3 * 2 + 8], m0, 2
205
+    movh       [r2 + r4], m1
206
+    pextrd     [r2 + r4 + 8], m1, 2
207
+
208
+    lea        r0, [r0 + r1 * 4]
209
+    lea        r2, [r2 + r3 * 4]
210
+
211
+    dec        r6d
212
+    jnz        .loop
213
+    RET
214
+%endmacro
215
+P2S_H_6xN 8
216
+P2S_H_6xN 16
217
+
218
+;-----------------------------------------------------------------------------
219
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
220
+;-----------------------------------------------------------------------------
221
+%macro P2S_H_8xN 1
222
+INIT_XMM ssse3
223
+cglobal filterPixelToShort_8x%1, 3, 7, 2
224
+    add        r1d, r1d
225
+    mov        r3d, r3m
226
+    add        r3d, r3d
227
+    lea        r4, [r3 * 3]
228
+    lea        r5, [r1 * 3]
229
+
230
+    ; load height
231
+    mov        r6d, %1/4
232
+
233
+    ; load constant
234
+    mova       m1, [pw_2000]
235
+
236
+.loop
237
+    movu       m0, [r0]
238
+    psllw      m0, 4
239
+    psubw      m0, m1
240
+    movu       [r2 + r3 * 0], m0
241
+
242
+    movu       m0, [r0 + r1]
243
+    psllw      m0, 4
244
+    psubw      m0, m1
245
+    movu       [r2 + r3 * 1], m0
246
+
247
+    movu       m0, [r0 + r1 * 2]
248
+    psllw      m0, 4
249
+    psubw      m0, m1
250
+    movu       [r2 + r3 * 2], m0
251
+
252
+    movu       m0, [r0 + r5]
253
+    psllw      m0, 4
254
+    psubw      m0, m1
255
+    movu       [r2 + r4], m0
256
+
257
+    lea        r0, [r0 + r1 * 4]
258
+    lea        r2, [r2 + r3 * 4]
259
+
260
+    dec        r6d
261
+    jnz        .loop
262
+    RET
263
+%endmacro
264
+P2S_H_8xN 8
265
+P2S_H_8xN 4
266
+P2S_H_8xN 16
267
+P2S_H_8xN 32
268
+P2S_H_8xN 12
269
+P2S_H_8xN 64
270
+
271
+;-----------------------------------------------------------------------------
272
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
273
+;-----------------------------------------------------------------------------
274
+INIT_XMM ssse3
275
+cglobal filterPixelToShort_8x2, 3, 4, 2
276
+    add        r1d, r1d
277
+    mov        r3d, r3m
278
+    add        r3d, r3d
279
+
280
+    movu       m0, [r0]
281
+    movu       m1, [r0 + r1]
282
+
283
+    psllw      m0, 4
284
+    psubw      m0, [pw_2000]
285
+    psllw      m1, 4
286
+    psubw      m1, [pw_2000]
287
+
288
+    movu       [r2 + r3 * 0], m0
289
+    movu       [r2 + r3 * 1], m1
290
+
291
+    RET
292
+
293
+;-----------------------------------------------------------------------------
294
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
295
+;-----------------------------------------------------------------------------
296
+INIT_XMM ssse3
297
+cglobal filterPixelToShort_8x6, 3, 7, 4
298
+    add        r1d, r1d
299
+    mov        r3d, r3m
300
+    add        r3d, r3d
301
+    lea        r4, [r1 * 3]
302
+    lea        r5, [r1 * 5]
303
+    lea        r6, [r3 * 3]
304
+
305
+    ; load constant
306
+    mova       m3, [pw_2000]
307
+
308
+    movu       m0, [r0]
309
+    movu       m1, [r0 + r1]
310
+    movu       m2, [r0 + r1 * 2]
311
+
312
+    psllw      m0, 4
313
+    psubw      m0, m3
314
+    psllw      m1, 4
315
+    psubw      m1, m3
316
+    psllw      m2, 4
317
+    psubw      m2, m3
318
+
319
+    movu       [r2 + r3 * 0], m0
320
+    movu       [r2 + r3 * 1], m1
321
+    movu       [r2 + r3 * 2], m2
322
+
323
+    movu       m0, [r0 + r4]
324
+    movu       m1, [r0 + r1 * 4]
325
+    movu       m2, [r0 + r5 ]
326
+
327
+    psllw      m0, 4
328
+    psubw      m0, m3
329
+    psllw      m1, 4
330
+    psubw      m1, m3
331
+    psllw      m2, 4
332
+    psubw      m2, m3
333
+
334
+    movu       [r2 + r6], m0
335
+    movu       [r2 + r3 * 4], m1
336
+    lea        r2, [r2 + r3 * 4]
337
+    movu       [r2 + r3], m2
338
+
339
+    RET
340
+
341
+;-----------------------------------------------------------------------------
342
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
343
+;-----------------------------------------------------------------------------
344
+%macro P2S_H_16xN 1
345
+INIT_XMM ssse3
346
+cglobal filterPixelToShort_16x%1, 3, 7, 3
347
+    add        r1d, r1d
348
+    mov        r3d, r3m
349
+    add        r3d, r3d
350
+    lea        r4, [r3 * 3]
351
+    lea        r5, [r1 * 3]
352
+
353
+    ; load height
354
+    mov        r6d, %1/4
355
+
356
+    ; load constant
357
+    mova       m2, [pw_2000]
358
+
359
+.loop
360
+    movu       m0, [r0]
361
+    movu       m1, [r0 + r1]
362
+    psllw      m0, 4
363
+    psubw      m0, m2
364
+    psllw      m1, 4
365
+    psubw      m1, m2
366
+
367
+    movu       [r2 + r3 * 0], m0
368
+    movu       [r2 + r3 * 1], m1
369
+
370
+    movu       m0, [r0 + r1 * 2]
371
+    movu       m1, [r0 + r5]
372
+    psllw      m0, 4
373
+    psubw      m0, m2
374
+    psllw      m1, 4
375
+    psubw      m1, m2
376
+
377
+    movu       [r2 + r3 * 2], m0
378
+    movu       [r2 + r4], m1
379
+
380
+    movu       m0, [r0 + 16]
381
+    movu       m1, [r0 + r1 + 16]
382
+    psllw      m0, 4
383
+    psubw      m0, m2
384
+    psllw      m1, 4
385
+    psubw      m1, m2
386
+
387
+    movu       [r2 + r3 * 0 + 16], m0
388
+    movu       [r2 + r3 * 1 + 16], m1
389
+
390
+    movu       m0, [r0 + r1 * 2 + 16]
391
+    movu       m1, [r0 + r5 + 16]
392
+    psllw      m0, 4
393
+    psubw      m0, m2
394
+    psllw      m1, 4
395
+    psubw      m1, m2
396
+
397
+    movu       [r2 + r3 * 2 + 16], m0
398
+    movu       [r2 + r4 + 16], m1
399
+
400
+    lea        r0, [r0 + r1 * 4]
401
+    lea        r2, [r2 + r3 * 4]
402
+
403
+    dec        r6d
404
+    jnz        .loop
405
+    RET
406
+%endmacro
407
+P2S_H_16xN 16
408
+P2S_H_16xN 4
409
+P2S_H_16xN 8
410
+P2S_H_16xN 12
411
+P2S_H_16xN 32
412
+P2S_H_16xN 64
413
+P2S_H_16xN 24
414
+
415
+;-----------------------------------------------------------------------------
416
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
417
+;-----------------------------------------------------------------------------
418
+%macro P2S_H_16xN_avx2 1
419
+INIT_YMM avx2
420
+cglobal filterPixelToShort_16x%1, 3, 7, 3
421
+    add        r1d, r1d
422
+    mov        r3d, r3m
423
+    add        r3d, r3d
424
+    lea        r4, [r3 * 3]
425
+    lea        r5, [r1 * 3]
426
+
427
+    ; load height
428
+    mov        r6d, %1/4
429
+
430
+    ; load constant
431
+    mova       m2, [pw_2000]
432
+
433
+.loop
434
+    movu       m0, [r0]
435
+    movu       m1, [r0 + r1]
436
+    psllw      m0, 4
437
+    psubw      m0, m2
438
+    psllw      m1, 4
439
+    psubw      m1, m2
440
+
441
+    movu       [r2 + r3 * 0], m0
442
+    movu       [r2 + r3 * 1], m1
443
+
444
+    movu       m0, [r0 + r1 * 2]
445
+    movu       m1, [r0 + r5]
446
+    psllw      m0, 4
447
+    psubw      m0, m2
448
+    psllw      m1, 4
449
+    psubw      m1, m2
450
+
451
+    movu       [r2 + r3 * 2], m0
452
+    movu       [r2 + r4], m1
453
+
454
+    lea        r0, [r0 + r1 * 4]
455
+    lea        r2, [r2 + r3 * 4]
456
+
457
+    dec        r6d
458
+    jnz        .loop
459
+    RET
460
+%endmacro
461
+P2S_H_16xN_avx2 16
462
+P2S_H_16xN_avx2 4
463
+P2S_H_16xN_avx2 8
464
+P2S_H_16xN_avx2 12
465
+P2S_H_16xN_avx2 32
466
+P2S_H_16xN_avx2 64
467
+P2S_H_16xN_avx2 24
468
+
469
+;-----------------------------------------------------------------------------
470
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
471
+;-----------------------------------------------------------------------------
472
+%macro P2S_H_32xN 1
473
+INIT_XMM ssse3
474
+cglobal filterPixelToShort_32x%1, 3, 7, 5
475
+    add        r1d, r1d
476
+    mov        r3d, r3m
477
+    add        r3d, r3d
478
+    lea        r4, [r3 * 3]
479
+    lea        r5, [r1 * 3]
480
+
481
+    ; load height
482
+    mov        r6d, %1/4
483
+
484
+    ; load constant
485
+    mova       m4, [pw_2000]
486
+
487
+.loop
488
+    movu       m0, [r0]
489
+    movu       m1, [r0 + r1]
490
+    movu       m2, [r0 + r1 * 2]
491
+    movu       m3, [r0 + r5]
492
+    psllw      m0, 4
493
+    psubw      m0, m4
494
+    psllw      m1, 4
495
+    psubw      m1, m4
496
+    psllw      m2, 4
497
+    psubw      m2, m4
498
+    psllw      m3, 4
499
+    psubw      m3, m4
500
+
501
+    movu       [r2 + r3 * 0], m0
502
+    movu       [r2 + r3 * 1], m1
503
+    movu       [r2 + r3 * 2], m2
504
+    movu       [r2 + r4], m3
505
+
506
+    movu       m0, [r0 + 16]
507
+    movu       m1, [r0 + r1 + 16]
508
+    movu       m2, [r0 + r1 * 2 + 16]
509
+    movu       m3, [r0 + r5 + 16]
510
+    psllw      m0, 4
511
+    psubw      m0, m4
512
+    psllw      m1, 4
513
+    psubw      m1, m4
514
+    psllw      m2, 4
515
+    psubw      m2, m4
516
+    psllw      m3, 4
517
+    psubw      m3, m4
518
+
519
+    movu       [r2 + r3 * 0 + 16], m0
520
+    movu       [r2 + r3 * 1 + 16], m1
521
+    movu       [r2 + r3 * 2 + 16], m2
522
+    movu       [r2 + r4 + 16], m3
523
+
524
+    movu       m0, [r0 + 32]
525
+    movu       m1, [r0 + r1 + 32]
526
+    movu       m2, [r0 + r1 * 2 + 32]
527
+    movu       m3, [r0 + r5 + 32]
528
+    psllw      m0, 4
529
+    psubw      m0, m4
530
+    psllw      m1, 4
531
+    psubw      m1, m4
532
+    psllw      m2, 4
533
+    psubw      m2, m4
534
+    psllw      m3, 4
535
+    psubw      m3, m4
536
+
537
+    movu       [r2 + r3 * 0 + 32], m0
538
+    movu       [r2 + r3 * 1 + 32], m1
539
+    movu       [r2 + r3 * 2 + 32], m2
540
+    movu       [r2 + r4 + 32], m3
541
+
542
+    movu       m0, [r0 + 48]
543
+    movu       m1, [r0 + r1 + 48]
544
+    movu       m2, [r0 + r1 * 2 + 48]
545
+    movu       m3, [r0 + r5 + 48]
546
+    psllw      m0, 4
547
+    psubw      m0, m4
548
+    psllw      m1, 4
549
+    psubw      m1, m4
550
+    psllw      m2, 4
551
+    psubw      m2, m4
552
+    psllw      m3, 4
553
+    psubw      m3, m4
554
+
555
+    movu       [r2 + r3 * 0 + 48], m0
556
+    movu       [r2 + r3 * 1 + 48], m1
557
+    movu       [r2 + r3 * 2 + 48], m2
558
+    movu       [r2 + r4 + 48], m3
559
 
560
+    lea        r0, [r0 + r1 * 4]
561
+    lea        r2, [r2 + r3 * 4]
562
+
563
+    dec        r6d
564
+    jnz        .loop
565
+    RET
566
+%endmacro
567
+P2S_H_32xN 32
568
+P2S_H_32xN 8
569
+P2S_H_32xN 16
570
+P2S_H_32xN 24
571
+P2S_H_32xN 64
572
+P2S_H_32xN 48
573
+
574
+;-----------------------------------------------------------------------------
575
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
576
+;-----------------------------------------------------------------------------
577
+%macro P2S_H_32xN_avx2 1
578
+INIT_YMM avx2
579
+cglobal filterPixelToShort_32x%1, 3, 7, 3
580
+    add        r1d, r1d
581
+    mov        r3d, r3m
582
+    add        r3d, r3d
583
+    lea        r4, [r3 * 3]
584
+    lea        r5, [r1 * 3]
585
+
586
+    ; load height
587
+    mov        r6d, %1/4
588
+
589
+    ; load constant
590
+    mova       m2, [pw_2000]
591
+
592
+.loop
593
+    movu       m0, [r0]
594
+    movu       m1, [r0 + r1]
595
+    psllw      m0, 4
596
+    psubw      m0, m2
597
+    psllw      m1, 4
598
+    psubw      m1, m2
599
+
600
+    movu       [r2 + r3 * 0], m0
601
+    movu       [r2 + r3 * 1], m1
602
+
603
+    movu       m0, [r0 + r1 * 2]
604
+    movu       m1, [r0 + r5]
605
+    psllw      m0, 4
606
+    psubw      m0, m2
607
+    psllw      m1, 4
608
+    psubw      m1, m2
609
+
610
+    movu       [r2 + r3 * 2], m0
611
+    movu       [r2 + r4], m1
612
+
613
+    movu       m0, [r0 + 32]
614
+    movu       m1, [r0 + r1 + 32]
615
+    psllw      m0, 4
616
+    psubw      m0, m2
617
+    psllw      m1, 4
618
+    psubw      m1, m2
619
+
620
+    movu       [r2 + r3 * 0 + 32], m0
621
+    movu       [r2 + r3 * 1 + 32], m1
622
+
623
+    movu       m0, [r0 + r1 * 2 + 32]
624
+    movu       m1, [r0 + r5 + 32]
625
+    psllw      m0, 4
626
+    psubw      m0, m2
627
+    psllw      m1, 4
628
+    psubw      m1, m2
629
+
630
+    movu       [r2 + r3 * 2 + 32], m0
631
+    movu       [r2 + r4 + 32], m1
632
+
633
+    lea        r0, [r0 + r1 * 4]
634
+    lea        r2, [r2 + r3 * 4]
635
+
636
+    dec        r6d
637
+    jnz        .loop
638
+    RET
639
+%endmacro
640
+P2S_H_32xN_avx2 32
641
+P2S_H_32xN_avx2 8
642
+P2S_H_32xN_avx2 16
643
+P2S_H_32xN_avx2 24
644
+P2S_H_32xN_avx2 64
645
+P2S_H_32xN_avx2 48
646
+
647
+;-----------------------------------------------------------------------------
648
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
649
+;-----------------------------------------------------------------------------
650
+%macro P2S_H_64xN 1
651
+INIT_XMM ssse3
652
+cglobal filterPixelToShort_64x%1, 3, 7, 5
653
+    add        r1d, r1d
654
+    mov        r3d, r3m
655
+    add        r3d, r3d
656
+    lea        r4, [r3 * 3]
657
+    lea        r5, [r1 * 3]
658
+
659
+    ; load height
660
+    mov        r6d, %1/4
661
+
662
+    ; load constant
663
+    mova       m4, [pw_2000]
664
+
665
+.loop
666
+    movu       m0, [r0]
667
+    movu       m1, [r0 + r1]
668
+    movu       m2, [r0 + r1 * 2]
669
+    movu       m3, [r0 + r5]
670
+    psllw      m0, 4
671
+    psubw      m0, m4
672
+    psllw      m1, 4
673
+    psubw      m1, m4
674
+    psllw      m2, 4
675
+    psubw      m2, m4
676
+    psllw      m3, 4
677
+    psubw      m3, m4
678
+
679
+    movu       [r2 + r3 * 0], m0
680
+    movu       [r2 + r3 * 1], m1
681
+    movu       [r2 + r3 * 2], m2
682
+    movu       [r2 + r4], m3
683
+
684
+    movu       m0, [r0 + 16]
685
+    movu       m1, [r0 + r1 + 16]
686
+    movu       m2, [r0 + r1 * 2 + 16]
687
+    movu       m3, [r0 + r5 + 16]
688
+    psllw      m0, 4
689
+    psubw      m0, m4
690
+    psllw      m1, 4
691
+    psubw      m1, m4
692
+    psllw      m2, 4
693
+    psubw      m2, m4
694
+    psllw      m3, 4
695
+    psubw      m3, m4
696
+
697
+    movu       [r2 + r3 * 0 + 16], m0
698
+    movu       [r2 + r3 * 1 + 16], m1
699
+    movu       [r2 + r3 * 2 + 16], m2
700
+    movu       [r2 + r4 + 16], m3
701
+
702
+    movu       m0, [r0 + 32]
703
+    movu       m1, [r0 + r1 + 32]
704
+    movu       m2, [r0 + r1 * 2 + 32]
705
+    movu       m3, [r0 + r5 + 32]
706
+    psllw      m0, 4
707
+    psubw      m0, m4
708
+    psllw      m1, 4
709
+    psubw      m1, m4
710
+    psllw      m2, 4
711
+    psubw      m2, m4
712
+    psllw      m3, 4
713
+    psubw      m3, m4
714
+
715
+    movu       [r2 + r3 * 0 + 32], m0
716
+    movu       [r2 + r3 * 1 + 32], m1
717
+    movu       [r2 + r3 * 2 + 32], m2
718
+    movu       [r2 + r4 + 32], m3
719
+
720
+    movu       m0, [r0 + 48]
721
+    movu       m1, [r0 + r1 + 48]
722
+    movu       m2, [r0 + r1 * 2 + 48]
723
+    movu       m3, [r0 + r5 + 48]
724
+    psllw      m0, 4
725
+    psubw      m0, m4
726
+    psllw      m1, 4
727
+    psubw      m1, m4
728
+    psllw      m2, 4
729
+    psubw      m2, m4
730
+    psllw      m3, 4
731
+    psubw      m3, m4
732
+
733
+    movu       [r2 + r3 * 0 + 48], m0
734
+    movu       [r2 + r3 * 1 + 48], m1
735
+    movu       [r2 + r3 * 2 + 48], m2
736
+    movu       [r2 + r4 + 48], m3
737
+
738
+    movu       m0, [r0 + 64]
739
+    movu       m1, [r0 + r1 + 64]
740
+    movu       m2, [r0 + r1 * 2 + 64]
741
+    movu       m3, [r0 + r5 + 64]
742
+    psllw      m0, 4
743
+    psubw      m0, m4
744
+    psllw      m1, 4
745
+    psubw      m1, m4
746
+    psllw      m2, 4
747
+    psubw      m2, m4
748
+    psllw      m3, 4
749
+    psubw      m3, m4
750
+
751
+    movu       [r2 + r3 * 0 + 64], m0
752
+    movu       [r2 + r3 * 1 + 64], m1
753
+    movu       [r2 + r3 * 2 + 64], m2
754
+    movu       [r2 + r4 + 64], m3
755
+
756
+    movu       m0, [r0 + 80]
757
+    movu       m1, [r0 + r1 + 80]
758
+    movu       m2, [r0 + r1 * 2 + 80]
759
+    movu       m3, [r0 + r5 + 80]
760
+    psllw      m0, 4
761
+    psubw      m0, m4
762
+    psllw      m1, 4
763
+    psubw      m1, m4
764
+    psllw      m2, 4
765
+    psubw      m2, m4
766
+    psllw      m3, 4
767
+    psubw      m3, m4
768
+
769
+    movu       [r2 + r3 * 0 + 80], m0
770
+    movu       [r2 + r3 * 1 + 80], m1
771
+    movu       [r2 + r3 * 2 + 80], m2
772
+    movu       [r2 + r4 + 80], m3
773
+
774
+    movu       m0, [r0 + 96]
775
+    movu       m1, [r0 + r1 + 96]
776
+    movu       m2, [r0 + r1 * 2 + 96]
777
+    movu       m3, [r0 + r5 + 96]
778
+    psllw      m0, 4
779
+    psubw      m0, m4
780
+    psllw      m1, 4
781
+    psubw      m1, m4
782
+    psllw      m2, 4
783
+    psubw      m2, m4
784
+    psllw      m3, 4
785
+    psubw      m3, m4
786
+
787
+    movu       [r2 + r3 * 0 + 96], m0
788
+    movu       [r2 + r3 * 1 + 96], m1
789
+    movu       [r2 + r3 * 2 + 96], m2
790
+    movu       [r2 + r4 + 96], m3
791
+
792
+    movu       m0, [r0 + 112]
793
+    movu       m1, [r0 + r1 + 112]
794
+    movu       m2, [r0 + r1 * 2 + 112]
795
+    movu       m3, [r0 + r5 + 112]
796
+    psllw      m0, 4
797
+    psubw      m0, m4
798
+    psllw      m1, 4
799
+    psubw      m1, m4
800
+    psllw      m2, 4
801
+    psubw      m2, m4
802
+    psllw      m3, 4
803
+    psubw      m3, m4
804
+
805
+    movu       [r2 + r3 * 0 + 112], m0
806
+    movu       [r2 + r3 * 1 + 112], m1
807
+    movu       [r2 + r3 * 2 + 112], m2
808
+    movu       [r2 + r4 + 112], m3
809
+
810
+    lea        r0, [r0 + r1 * 4]
811
+    lea        r2, [r2 + r3 * 4]
812
+
813
+    dec        r6d
814
+    jnz        .loop
815
+    RET
816
+%endmacro
817
+P2S_H_64xN 64
818
+P2S_H_64xN 16
819
+P2S_H_64xN 32
820
+P2S_H_64xN 48
821
+
822
+;-----------------------------------------------------------------------------
823
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
824
+;-----------------------------------------------------------------------------
825
+%macro P2S_H_64xN_avx2 1
826
+INIT_YMM avx2
827
+cglobal filterPixelToShort_64x%1, 3, 7, 3
828
+    add        r1d, r1d
829
+    mov        r3d, r3m
830
+    add        r3d, r3d
831
+    lea        r4, [r3 * 3]
832
+    lea        r5, [r1 * 3]
833
+
834
+    ; load height
835
+    mov        r6d, %1/4
836
+
837
+    ; load constant
838
+    mova       m2, [pw_2000]
839
+
840
+.loop
841
+    movu       m0, [r0]
842
+    movu       m1, [r0 + r1]
843
+    psllw      m0, 4
844
+    psubw      m0, m2
845
+    psllw      m1, 4
846
+    psubw      m1, m2
847
+
848
+    movu       [r2 + r3 * 0], m0
849
+    movu       [r2 + r3 * 1], m1
850
+
851
+    movu       m0, [r0 + r1 * 2]
852
+    movu       m1, [r0 + r5]
853
+    psllw      m0, 4
854
+    psubw      m0, m2
855
+    psllw      m1, 4
856
+    psubw      m1, m2
857
+
858
+    movu       [r2 + r3 * 2], m0
859
+    movu       [r2 + r4], m1
860
+
861
+    movu       m0, [r0 + 32]
862
+    movu       m1, [r0 + r1 + 32]
863
+    psllw      m0, 4
864
+    psubw      m0, m2
865
+    psllw      m1, 4
866
+    psubw      m1, m2
867
+
868
+    movu       [r2 + r3 * 0 + 32], m0
869
+    movu       [r2 + r3 * 1 + 32], m1
870
+
871
+    movu       m0, [r0 + r1 * 2 + 32]
872
+    movu       m1, [r0 + r5 + 32]
873
+    psllw      m0, 4
874
+    psubw      m0, m2
875
+    psllw      m1, 4
876
+    psubw      m1, m2
877
+
878
+    movu       [r2 + r3 * 2 + 32], m0
879
+    movu       [r2 + r4 + 32], m1
880
+
881
+    movu       m0, [r0 + 64]
882
+    movu       m1, [r0 + r1 + 64]
883
+    psllw      m0, 4
884
+    psubw      m0, m2
885
+    psllw      m1, 4
886
+    psubw      m1, m2
887
+
888
+    movu       [r2 + r3 * 0 + 64], m0
889
+    movu       [r2 + r3 * 1 + 64], m1
890
+
891
+    movu       m0, [r0 + r1 * 2 + 64]
892
+    movu       m1, [r0 + r5 + 64]
893
+    psllw      m0, 4
894
+    psubw      m0, m2
895
+    psllw      m1, 4
896
+    psubw      m1, m2
897
+
898
+    movu       [r2 + r3 * 2 + 64], m0
899
+    movu       [r2 + r4 + 64], m1
900
+
901
+    movu       m0, [r0 + 96]
902
+    movu       m1, [r0 + r1 + 96]
903
+    psllw      m0, 4
904
+    psubw      m0, m2
905
+    psllw      m1, 4
906
+    psubw      m1, m2
907
+
908
+    movu       [r2 + r3 * 0 + 96], m0
909
+    movu       [r2 + r3 * 1 + 96], m1
910
+
911
+    movu       m0, [r0 + r1 * 2 + 96]
912
+    movu       m1, [r0 + r5 + 96]
913
+    psllw      m0, 4
914
+    psubw      m0, m2
915
+    psllw      m1, 4
916
+    psubw      m1, m2
917
+
918
+    movu       [r2 + r3 * 2 + 96], m0
919
+    movu       [r2 + r4 + 96], m1
920
+
921
+    lea        r0, [r0 + r1 * 4]
922
+    lea        r2, [r2 + r3 * 4]
923
+
924
+    dec        r6d
925
+    jnz        .loop
926
+    RET
927
+%endmacro
928
+P2S_H_64xN_avx2 64
929
+P2S_H_64xN_avx2 16
930
+P2S_H_64xN_avx2 32
931
+P2S_H_64xN_avx2 48
932
+
933
+;-----------------------------------------------------------------------------
934
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
935
+;-----------------------------------------------------------------------------
936
+%macro P2S_H_24xN 1
937
+INIT_XMM ssse3
938
+cglobal filterPixelToShort_24x%1, 3, 7, 5
939
+    add        r1d, r1d
940
+    mov        r3d, r3m
941
+    add        r3d, r3d
942
+    lea        r4, [r3 * 3]
943
+    lea        r5, [r1 * 3]
944
+
945
+    ; load height
946
+    mov        r6d, %1/4
947
+
948
+    ; load constant
949
+    mova       m4, [pw_2000]
950
+
951
+.loop
952
+    movu       m0, [r0]
953
+    movu       m1, [r0 + r1]
954
+    movu       m2, [r0 + r1 * 2]
955
+    movu       m3, [r0 + r5]
956
+    psllw      m0, 4
957
+    psubw      m0, m4
958
+    psllw      m1, 4
959
+    psubw      m1, m4
960
+    psllw      m2, 4
961
+    psubw      m2, m4
962
+    psllw      m3, 4
963
+    psubw      m3, m4
964
+
965
+    movu       [r2 + r3 * 0], m0
966
+    movu       [r2 + r3 * 1], m1
967
+    movu       [r2 + r3 * 2], m2
968
+    movu       [r2 + r4], m3
969
+
970
+    movu       m0, [r0 + 16]
971
+    movu       m1, [r0 + r1 + 16]
972
+    movu       m2, [r0 + r1 * 2 + 16]
973
+    movu       m3, [r0 + r5 + 16]
974
+    psllw      m0, 4
975
+    psubw      m0, m4
976
+    psllw      m1, 4
977
+    psubw      m1, m4
978
+    psllw      m2, 4
979
+    psubw      m2, m4
980
+    psllw      m3, 4
981
+    psubw      m3, m4
982
+
983
+    movu       [r2 + r3 * 0 + 16], m0
984
+    movu       [r2 + r3 * 1 + 16], m1
985
+    movu       [r2 + r3 * 2 + 16], m2
986
+    movu       [r2 + r4 + 16], m3
987
+
988
+    movu       m0, [r0 + 32]
989
+    movu       m1, [r0 + r1 + 32]
990
+    movu       m2, [r0 + r1 * 2 + 32]
991
+    movu       m3, [r0 + r5 + 32]
992
+    psllw      m0, 4
993
+    psubw      m0, m4
994
+    psllw      m1, 4
995
+    psubw      m1, m4
996
+    psllw      m2, 4
997
+    psubw      m2, m4
998
+    psllw      m3, 4
999
+    psubw      m3, m4
1000
+
1001
+    movu       [r2 + r3 * 0 + 32], m0
1002
+    movu       [r2 + r3 * 1 + 32], m1
1003
+    movu       [r2 + r3 * 2 + 32], m2
1004
+    movu       [r2 + r4 + 32], m3
1005
+
1006
+    lea        r0, [r0 + r1 * 4]
1007
+    lea        r2, [r2 + r3 * 4]
1008
+
1009
+    dec        r6d
1010
+    jnz        .loop
1011
+    RET
1012
+%endmacro
1013
+P2S_H_24xN 32
1014
+P2S_H_24xN 64
1015
+
1016
+;-----------------------------------------------------------------------------
1017
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
1018
+;-----------------------------------------------------------------------------
1019
+%macro P2S_H_24xN_avx2 1
1020
+INIT_YMM avx2
1021
+cglobal filterPixelToShort_24x%1, 3, 7, 3
1022
+    add        r1d, r1d
1023
+    mov        r3d, r3m
1024
+    add        r3d, r3d
1025
+    lea        r4, [r3 * 3]
1026
+    lea        r5, [r1 * 3]
1027
+
1028
+    ; load height
1029
+    mov        r6d, %1/4
1030
+
1031
+    ; load constant
1032
+    mova       m2, [pw_2000]
1033
+
1034
+.loop
1035
+    movu       m0, [r0]
1036
+    movu       m1, [r0 + 32]
1037
+    psllw      m0, 4
1038
+    psubw      m0, m2
1039
+    psllw      m1, 4
1040
+    psubw      m1, m2
1041
+    movu       [r2 + r3 * 0], m0
1042
+    movu       [r2 + r3 * 0 + 32], xm1
1043
+
1044
+    movu       m0, [r0 + r1]
1045
+    movu       m1, [r0 + r1 + 32]
1046
+    psllw      m0, 4
1047
+    psubw      m0, m2
1048
+    psllw      m1, 4
1049
+    psubw      m1, m2
1050
+    movu       [r2 + r3 * 1], m0
1051
+    movu       [r2 + r3 * 1 + 32], xm1
1052
+
1053
+    movu       m0, [r0 + r1 * 2]
1054
+    movu       m1, [r0 + r1 * 2 + 32]
1055
+    psllw      m0, 4
1056
+    psubw      m0, m2
1057
+    psllw      m1, 4
1058
+    psubw      m1, m2
1059
+    movu       [r2 + r3 * 2], m0
1060
+    movu       [r2 + r3 * 2 + 32], xm1
1061
+
1062
+    movu       m0, [r0 + r5]
1063
+    movu       m1, [r0 + r5 + 32]
1064
+    psllw      m0, 4
1065
+    psubw      m0, m2
1066
+    psllw      m1, 4
1067
+    psubw      m1, m2
1068
+    movu       [r2 + r4], m0
1069
+    movu       [r2 + r4 + 32], xm1
1070
+
1071
+    lea        r0, [r0 + r1 * 4]
1072
+    lea        r2, [r2 + r3 * 4]
1073
+
1074
+    dec        r6d
1075
+    jnz        .loop
1076
+    RET
1077
+%endmacro
1078
+P2S_H_24xN_avx2 32
1079
+P2S_H_24xN_avx2 64
1080
+
1081
+;-----------------------------------------------------------------------------
1082
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
1083
+;-----------------------------------------------------------------------------
1084
+%macro P2S_H_12xN 1
1085
+INIT_XMM ssse3
1086
+cglobal filterPixelToShort_12x%1, 3, 7, 3
1087
+    add        r1d, r1d
1088
+    mov        r3d, r3m
1089
+    add        r3d, r3d
1090
+    lea        r4, [r3 * 3]
1091
+    lea        r5, [r1 * 3]
1092
+
1093
+    ; load height
1094
+    mov        r6d, %1/4
1095
+
1096
+    ; load constant
1097
+    mova       m2, [pw_2000]
1098
+
1099
+.loop
1100
+    movu       m0, [r0]
1101
+    movu       m1, [r0 + r1]
1102
+    psllw      m0, 4
1103
+    psubw      m0, m2
1104
+    psllw      m1, 4
1105
+    psubw      m1, m2
1106
+
1107
+    movu       [r2 + r3 * 0], m0
1108
+    movu       [r2 + r3 * 1], m1
1109
+
1110
+    movu       m0, [r0 + r1 * 2]
1111
+    movu       m1, [r0 + r5]
1112
+    psllw      m0, 4
1113
+    psubw      m0, m2
1114
+    psllw      m1, 4
1115
+    psubw      m1, m2
1116
+
1117
+    movu       [r2 + r3 * 2], m0
1118
+    movu       [r2 + r4], m1
1119
+
1120
+    movh       m0, [r0 + 16]
1121
+    movhps     m0, [r0 + r1 + 16]
1122
+    psllw      m0, 4
1123
+    psubw      m0, m2
1124
+
1125
+    movh       [r2 + r3 * 0 + 16], m0
1126
+    movhps     [r2 + r3 * 1 + 16], m0
1127
+
1128
+    movh       m0, [r0 + r1 * 2 + 16]
1129
+    movhps     m0, [r0 + r5 + 16]
1130
+    psllw      m0, 4
1131
+    psubw      m0, m2
1132
+
1133
+    movh       [r2 + r3 * 2 + 16], m0
1134
+    movhps     [r2 + r4 + 16], m0
1135
+
1136
+    lea        r0, [r0 + r1 * 4]
1137
+    lea        r2, [r2 + r3 * 4]
1138
+
1139
+    dec        r6d
1140
+    jnz        .loop
1141
+    RET
1142
+%endmacro
1143
+P2S_H_12xN 16
1144
+P2S_H_12xN 32
1145
+
1146
+;-----------------------------------------------------------------------------
1147
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
1148
+;-----------------------------------------------------------------------------
1149
+INIT_XMM ssse3
1150
+cglobal filterPixelToShort_48x64, 3, 7, 5
1151
+    add        r1d, r1d
1152
+    mov        r3d, r3m
1153
+    add        r3d, r3d
1154
+    lea        r4, [r3 * 3]
1155
+    lea        r5, [r1 * 3]
1156
+
1157
+    ; load height
1158
+    mov        r6d, 16
1159
+
1160
+    ; load constant
1161
+    mova       m4, [pw_2000]
1162
+
1163
+.loop
1164
+    movu       m0, [r0]
1165
+    movu       m1, [r0 + r1]
1166
+    movu       m2, [r0 + r1 * 2]
1167
+    movu       m3, [r0 + r5]
1168
+    psllw      m0, 4
1169
+    psubw      m0, m4
1170
+    psllw      m1, 4
1171
+    psubw      m1, m4
1172
+    psllw      m2, 4
1173
+    psubw      m2, m4
1174
+    psllw      m3, 4
1175
+    psubw      m3, m4
1176
+
1177
+    movu       [r2 + r3 * 0], m0
1178
+    movu       [r2 + r3 * 1], m1
1179
+    movu       [r2 + r3 * 2], m2
1180
+    movu       [r2 + r4], m3
1181
+
1182
+    movu       m0, [r0 + 16]
1183
+    movu       m1, [r0 + r1 + 16]
1184
+    movu       m2, [r0 + r1 * 2 + 16]
1185
+    movu       m3, [r0 + r5 + 16]
1186
+    psllw      m0, 4
1187
+    psubw      m0, m4
1188
+    psllw      m1, 4
1189
+    psubw      m1, m4
1190
+    psllw      m2, 4
1191
+    psubw      m2, m4
1192
+    psllw      m3, 4
1193
+    psubw      m3, m4
1194
+
1195
+    movu       [r2 + r3 * 0 + 16], m0
1196
+    movu       [r2 + r3 * 1 + 16], m1
1197
+    movu       [r2 + r3 * 2 + 16], m2
1198
+    movu       [r2 + r4 + 16], m3
1199
+
1200
+    movu       m0, [r0 + 32]
1201
+    movu       m1, [r0 + r1 + 32]
1202
+    movu       m2, [r0 + r1 * 2 + 32]
1203
+    movu       m3, [r0 + r5 + 32]
1204
+    psllw      m0, 4
1205
+    psubw      m0, m4
1206
+    psllw      m1, 4
1207
+    psubw      m1, m4
1208
+    psllw      m2, 4
1209
+    psubw      m2, m4
1210
+    psllw      m3, 4
1211
+    psubw      m3, m4
1212
+
1213
+    movu       [r2 + r3 * 0 + 32], m0
1214
+    movu       [r2 + r3 * 1 + 32], m1
1215
+    movu       [r2 + r3 * 2 + 32], m2
1216
+    movu       [r2 + r4 + 32], m3
1217
+
1218
+    movu       m0, [r0 + 48]
1219
+    movu       m1, [r0 + r1 + 48]
1220
+    movu       m2, [r0 + r1 * 2 + 48]
1221
+    movu       m3, [r0 + r5 + 48]
1222
+    psllw      m0, 4
1223
+    psubw      m0, m4
1224
+    psllw      m1, 4
1225
+    psubw      m1, m4
1226
+    psllw      m2, 4
1227
+    psubw      m2, m4
1228
+    psllw      m3, 4
1229
+    psubw      m3, m4
1230
+
1231
+    movu       [r2 + r3 * 0 + 48], m0
1232
+    movu       [r2 + r3 * 1 + 48], m1
1233
+    movu       [r2 + r3 * 2 + 48], m2
1234
+    movu       [r2 + r4 + 48], m3
1235
+
1236
+    movu       m0, [r0 + 64]
1237
+    movu       m1, [r0 + r1 + 64]
1238
+    movu       m2, [r0 + r1 * 2 + 64]
1239
+    movu       m3, [r0 + r5 + 64]
1240
+    psllw      m0, 4
1241
+    psubw      m0, m4
1242
+    psllw      m1, 4
1243
+    psubw      m1, m4
1244
+    psllw      m2, 4
1245
+    psubw      m2, m4
1246
+    psllw      m3, 4
1247
+    psubw      m3, m4
1248
+
1249
+    movu       [r2 + r3 * 0 + 64], m0
1250
+    movu       [r2 + r3 * 1 + 64], m1
1251
+    movu       [r2 + r3 * 2 + 64], m2
1252
+    movu       [r2 + r4 + 64], m3
1253
+
1254
+    movu       m0, [r0 + 80]
1255
+    movu       m1, [r0 + r1 + 80]
1256
+    movu       m2, [r0 + r1 * 2 + 80]
1257
+    movu       m3, [r0 + r5 + 80]
1258
+    psllw      m0, 4
1259
+    psubw      m0, m4
1260
+    psllw      m1, 4
1261
+    psubw      m1, m4
1262
+    psllw      m2, 4
1263
+    psubw      m2, m4
1264
+    psllw      m3, 4
1265
+    psubw      m3, m4
1266
+
1267
+    movu       [r2 + r3 * 0 + 80], m0
1268
+    movu       [r2 + r3 * 1 + 80], m1
1269
+    movu       [r2 + r3 * 2 + 80], m2
1270
+    movu       [r2 + r4 + 80], m3
1271
+
1272
+    lea        r0, [r0 + r1 * 4]
1273
+    lea        r2, [r2 + r3 * 4]
1274
+
1275
+    dec        r6d
1276
+    jnz        .loop
1277
     RET
1278
+
1279
+;-----------------------------------------------------------------------------
1280
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
1281
+;-----------------------------------------------------------------------------
1282
+INIT_YMM avx2
1283
+cglobal filterPixelToShort_48x64, 3, 7, 4
1284
+    add        r1d, r1d
1285
+    mov        r3d, r3m
1286
+    add        r3d, r3d
1287
+    lea        r4, [r3 * 3]
1288
+    lea        r5, [r1 * 3]
1289
+
1290
+    ; load height
1291
+    mov        r6d, 16
1292
+
1293
+    ; load constant
1294
+    mova       m3, [pw_2000]
1295
+
1296
+.loop
1297
+    movu       m0, [r0]
1298
+    movu       m1, [r0 + 32]
1299
+    movu       m2, [r0 + 64]
1300
+    psllw      m0, 4
1301
+    psubw      m0, m3
1302
+    psllw      m1, 4
1303
+    psubw      m1, m3
1304
+    psllw      m2, 4
1305
+    psubw      m2, m3
1306
+    movu       [r2 + r3 * 0], m0
1307
+    movu       [r2 + r3 * 0 + 32], m1
1308
+    movu       [r2 + r3 * 0 + 64], m2
1309
+
1310
+    movu       m0, [r0 + r1]
1311
+    movu       m1, [r0 + r1 + 32]
1312
+    movu       m2, [r0 + r1 + 64]
1313
+    psllw      m0, 4
1314
+    psubw      m0, m3
1315
+    psllw      m1, 4
1316
+    psubw      m1, m3
1317
+    psllw      m2, 4
1318
+    psubw      m2, m3
1319
+    movu       [r2 + r3 * 1], m0
1320
+    movu       [r2 + r3 * 1 + 32], m1
1321
+    movu       [r2 + r3 * 1 + 64], m2
1322
+
1323
+    movu       m0, [r0 + r1 * 2]
1324
+    movu       m1, [r0 + r1 * 2 + 32]
1325
+    movu       m2, [r0 + r1 * 2 + 64]
1326
+    psllw      m0, 4
1327
+    psubw      m0, m3
1328
+    psllw      m1, 4
1329
+    psubw      m1, m3
1330
+    psllw      m2, 4
1331
+    psubw      m2, m3
1332
+    movu       [r2 + r3 * 2], m0
1333
+    movu       [r2 + r3 * 2 + 32], m1
1334
+    movu       [r2 + r3 * 2 + 64], m2
1335
+
1336
+    movu       m0, [r0 + r5]
1337
+    movu       m1, [r0 + r5 + 32]
1338
+    movu       m2, [r0 + r5 + 64]
1339
+    psllw      m0, 4
1340
+    psubw      m0, m3
1341
+    psllw      m1, 4
1342
+    psubw      m1, m3
1343
+    psllw      m2, 4
1344
+    psubw      m2, m3
1345
+    movu       [r2 + r4], m0
1346
+    movu       [r2 + r4 + 32], m1
1347
+    movu       [r2 + r4 + 64], m2
1348
+
1349
+    lea        r0, [r0 + r1 * 4]
1350
+    lea        r2, [r2 + r3 * 4]
1351
+
1352
+    dec        r6d
1353
+    jnz        .loop
1354
+    RET
1355
+
1356
+
1357
+;-----------------------------------------------------------------------------------------------------------------------------
1358
+;void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt)
1359
+;-----------------------------------------------------------------------------------------------------------------------------
1360
+
1361
+%macro IPFILTER_LUMA_PS_4xN_AVX2 1
1362
+INIT_YMM avx2
1363
+%if ARCH_X86_64 == 1
1364
+cglobal interp_8tap_horiz_ps_4x%1, 6,8,7
1365
+    mov                         r5d,               r5m
1366
+    mov                         r4d,               r4m
1367
+    add                         r1d,               r1d
1368
+    add                         r3d,               r3d
1369
+%ifdef PIC
1370
+
1371
+    lea                         r6,                [tab_LumaCoeff]
1372
+    lea                         r4 ,               [r4 * 8]
1373
+    vbroadcasti128              m0,                [r6 + r4 * 2]
1374
+
1375
+%else
1376
+    lea                         r4 ,                [r4 * 8]
1377
+    vbroadcasti128              m0,                [tab_LumaCoeff + r4 * 2]
1378
+%endif
1379
+
1380
+    vbroadcasti128              m2,                [pd_n32768]
1381
+
1382
+    ; register map
1383
+    ; m0 - interpolate coeff
1384
+    ; m1 - shuffle order table
1385
+    ; m2 - pw_2000
1386
+
1387
+    sub                         r0,                6
1388
+    test                        r5d,               r5d
1389
+    mov                         r7d,               %1                                    ; loop count variable - height
1390
+    jz                         .preloop
1391
+    lea                         r6,                [r1 * 3]                              ; r6 = (N / 2 - 1) * srcStride
1392
+    sub                         r0,                r6                                    ; r0(src) - 3 * srcStride
1393
+    add                         r7d,               6                                     ;7 - 1(since last row not in loop)                            ; need extra 7 rows, just set a specially flag here, blkheight += N - 1  (7 - 3 = 4 ; since the last three rows not in loop)
1394
+
1395
+.preloop:
1396
+    lea                         r6,                [r3 * 3]
1397
+.loop
1398
+    ; Row 0
1399
+    movu                        xm3,                [r0]                                 ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
1400
+    movu                        xm4,                [r0 + 2]                             ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
1401
+    vinserti128                 m3,                 m3,                xm4,       1
1402
+    movu                        xm4,                [r0 + 4]
1403
+    movu                        xm5,                [r0 + 6]
1404
+    vinserti128                 m4,                 m4,                xm5,       1
1405
+    pmaddwd                     m3,                m0
1406
+    pmaddwd                     m4,                m0
1407
+    phaddd                      m3,                m4                                    ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A]
1408
+
1409
+    ; Row 1
1410
+    movu                        xm4,                [r0 + r1]                            ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
1411
+    movu                        xm5,                [r0 + r1 + 2]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
1412
+    vinserti128                 m4,                 m4,                xm5,       1
1413
+    movu                        xm5,                [r0 + r1 + 4]
1414
+    movu                        xm6,                [r0 + r1 + 6]
1415
+    vinserti128                 m5,                 m5,                xm6,       1
1416
+    pmaddwd                     m4,                m0
1417
+    pmaddwd                     m5,                m0
1418
+    phaddd                      m4,                m5                                     ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A]
1419
+    phaddd                      m3,                m4                                     ; all rows and col completed.
1420
+
1421
+    mova                        m5,                [interp8_hps_shuf]
1422
+    vpermd                      m3,                m5,                  m3
1423
+    paddd                       m3,                m2
1424
+    vextracti128                xm4,               m3,                  1
1425
+    psrad                       xm3,               2
1426
+    psrad                       xm4,               2
1427
+    packssdw                    xm3,               xm3
1428
+    packssdw                    xm4,               xm4
1429
+
1430
+    movq                        [r2],              xm3                                   ;row 0
1431
+    movq                        [r2 + r3],         xm4                                   ;row 1
1432
+    lea                         r0,                [r0 + r1 * 2]                         ; first loop src ->5th row(i.e 4)
1433
+    lea                         r2,                [r2 + r3 * 2]                         ; first loop dst ->5th row(i.e 4)
1434
+
1435
+    sub                         r7d,               2
1436
+    jg                          .loop
1437
+    test                        r5d,               r5d
1438
+    jz                          .end
1439
+
1440
+    ; Row 10
1441
+    movu                        xm3,                [r0]
1442
+    movu                        xm4,                [r0 + 2]
1443
+    vinserti128                 m3,                 m3,                 xm4,      1
1444
+    movu                        xm4,                [r0 + 4]
1445
+    movu                        xm5,                [r0 + 6]
1446
+    vinserti128                 m4,                 m4,                 xm5,      1
1447
+    pmaddwd                     m3,                m0
1448
+    pmaddwd                     m4,                m0
1449
+    phaddd                      m3,                m4
1450
+
1451
+    ; Row11
1452
+    phaddd                      m3,                m4                                    ; all rows and col completed.
1453
+
1454
+    mova                        m5,                [interp8_hps_shuf]
1455
+    vpermd                      m3,                m5,                  m3
1456
+    paddd                       m3,                m2
1457
+    vextracti128                xm4,               m3,                  1
1458
+    psrad                       xm3,               2
1459
+    psrad                       xm4,               2
1460
+    packssdw                    xm3,               xm3
1461
+    packssdw                    xm4,               xm4
1462
+
1463
+    movq                        [r2],              xm3                                   ;row 0
1464
+.end
1465
+    RET
1466
+%endif
1467
+%endmacro
1468
+
1469
+    IPFILTER_LUMA_PS_4xN_AVX2 4
1470
+    IPFILTER_LUMA_PS_4xN_AVX2 8
1471
+    IPFILTER_LUMA_PS_4xN_AVX2 16
1472
x265_1.6.tar.gz/source/common/x86/ipfilter8.asm -> x265_1.7.tar.gz/source/common/x86/ipfilter8.asm Changed
10717
 
1
@@ -27,269 +27,269 @@
2
 %include "x86util.asm"
3
 
4
 SECTION_RODATA 32
5
-tab_Tm:    db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
6
-           db 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10
7
-           db 8, 9,10,11, 9,10,11,12,10,11,12,13,11,12,13, 14
8
+const tab_Tm,    db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
9
+                 db 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10
10
+                 db 8, 9,10,11, 9,10,11,12,10,11,12,13,11,12,13, 14
11
 
12
-ALIGN 32
13
 const interp4_vpp_shuf, times 2 db 0, 4, 1, 5, 2, 6, 3, 7, 8, 12, 9, 13, 10, 14, 11, 15
14
 
15
-ALIGN 32
16
 const interp_vert_shuf, times 2 db 0, 2, 1, 3, 2, 4, 3, 5, 4, 6, 5, 7, 6, 8, 7, 9
17
                         times 2 db 4, 6, 5, 7, 6, 8, 7, 9, 8, 10, 9, 11, 10, 12, 11, 13
18
 
19
-ALIGN 32
20
 const interp4_vpp_shuf1, dd 0, 1, 1, 2, 2, 3, 3, 4
21
                          dd 2, 3, 3, 4, 4, 5, 5, 6
22
 
23
-ALIGN 32
24
 const pb_8tap_hps_0, times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8
25
                      times 2 db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10
26
                      times 2 db 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10,10,11,11,12
27
                      times 2 db 6, 7, 7, 8, 8, 9, 9,10,10,11,11,12,12,13,13,14
28
 
29
-ALIGN 32
30
-tab_Lm:    db 0, 1, 2, 3, 4,  5,  6,  7,  1, 2, 3, 4,  5,  6,  7,  8
31
-           db 2, 3, 4, 5, 6,  7,  8,  9,  3, 4, 5, 6,  7,  8,  9,  10
32
-           db 4, 5, 6, 7, 8,  9,  10, 11, 5, 6, 7, 8,  9,  10, 11, 12
33
-           db 6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14
34
-
35
-tab_Vm:    db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
36
-           db 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3
37
-
38
-tab_Cm:    db 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3
39
-
40
-tab_c_526336:   times 4 dd 8192*64+2048
41
-
42
-pd_526336:      times 8 dd 8192*64+2048
43
-
44
-tab_ChromaCoeff: db  0, 64,  0,  0
45
-                 db -2, 58, 10, -2
46
-                 db -4, 54, 16, -2
47
-                 db -6, 46, 28, -4
48
-                 db -4, 36, 36, -4
49
-                 db -4, 28, 46, -6
50
-                 db -2, 16, 54, -4
51
-                 db -2, 10, 58, -2
52
-ALIGN 32
53
-tab_ChromaCoeff_V: times 8 db 0, 64
54
-                   times 8 db 0,  0
55
+const tab_Lm,    db 0, 1, 2, 3, 4,  5,  6,  7,  1, 2, 3, 4,  5,  6,  7,  8
56
+                 db 2, 3, 4, 5, 6,  7,  8,  9,  3, 4, 5, 6,  7,  8,  9,  10
57
+                 db 4, 5, 6, 7, 8,  9,  10, 11, 5, 6, 7, 8,  9,  10, 11, 12
58
+                 db 6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14
59
 
60
-                   times 8 db -2, 58
61
-                   times 8 db 10, -2
62
+const tab_Vm,    db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
63
+                 db 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3
64
 
65
-                   times 8 db -4, 54
66
-                   times 8 db 16, -2
67
+const tab_Cm,    db 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3
68
 
69
-                   times 8 db -6, 46
70
-                   times 8 db 28, -4
71
+const pd_526336, times 8 dd 8192*64+2048
72
 
73
-                   times 8 db -4, 36
74
-                   times 8 db 36, -4
75
+const tab_ChromaCoeff, db  0, 64,  0,  0
76
+                       db -2, 58, 10, -2
77
+                       db -4, 54, 16, -2
78
+                       db -6, 46, 28, -4
79
+                       db -4, 36, 36, -4
80
+                       db -4, 28, 46, -6
81
+                       db -2, 16, 54, -4
82
+                       db -2, 10, 58, -2
83
 
84
-                   times 8 db -4, 28
85
-                   times 8 db 46, -6
86
+const tabw_ChromaCoeff, dw  0, 64,  0,  0
87
+                        dw -2, 58, 10, -2
88
+                        dw -4, 54, 16, -2
89
+                        dw -6, 46, 28, -4
90
+                        dw -4, 36, 36, -4
91
+                        dw -4, 28, 46, -6
92
+                        dw -2, 16, 54, -4
93
+                        dw -2, 10, 58, -2
94
 
95
-                   times 8 db -2, 16
96
-                   times 8 db 54, -4
97
+const tab_ChromaCoeff_V, times 8 db 0, 64
98
+                         times 8 db 0,  0
99
 
100
-                   times 8 db -2, 10
101
-                   times 8 db 58, -2
102
+                         times 8 db -2, 58
103
+                         times 8 db 10, -2
104
 
105
-tab_ChromaCoeffV: times 4 dw 0, 64
106
-                  times 4 dw 0, 0
107
+                         times 8 db -4, 54
108
+                         times 8 db 16, -2
109
 
110
-                  times 4 dw -2, 58
111
-                  times 4 dw 10, -2
112
+                         times 8 db -6, 46
113
+                         times 8 db 28, -4
114
 
115
-                  times 4 dw -4, 54
116
-                  times 4 dw 16, -2
117
+                         times 8 db -4, 36
118
+                         times 8 db 36, -4
119
 
120
-                  times 4 dw -6, 46 
121
-                  times 4 dw 28, -4
122
+                         times 8 db -4, 28
123
+                         times 8 db 46, -6
124
 
125
-                  times 4 dw -4, 36
126
-                  times 4 dw 36, -4
127
+                         times 8 db -2, 16
128
+                         times 8 db 54, -4
129
 
130
-                  times 4 dw -4, 28
131
-                  times 4 dw 46, -6
132
+                         times 8 db -2, 10
133
+                         times 8 db 58, -2
134
 
135
-                  times 4 dw -2, 16
136
-                  times 4 dw 54, -4
137
+const tab_ChromaCoeffV, times 4 dw 0, 64
138
+                        times 4 dw 0, 0
139
 
140
-                  times 4 dw -2, 10
141
-                  times 4 dw 58, -2
142
+                        times 4 dw -2, 58
143
+                        times 4 dw 10, -2
144
 
145
-ALIGN 32
146
-pw_ChromaCoeffV:  times 8 dw 0, 64
147
-                  times 8 dw 0, 0
148
+                        times 4 dw -4, 54
149
+                        times 4 dw 16, -2
150
 
151
-                  times 8 dw -2, 58
152
-                  times 8 dw 10, -2
153
+                        times 4 dw -6, 46
154
+                        times 4 dw 28, -4
155
 
156
-                  times 8 dw -4, 54
157
-                  times 8 dw 16, -2
158
+                        times 4 dw -4, 36
159
+                        times 4 dw 36, -4
160
 
161
-                  times 8 dw -6, 46 
162
-                  times 8 dw 28, -4
163
-
164
-                  times 8 dw -4, 36
165
-                  times 8 dw 36, -4
166
-
167
-                  times 8 dw -4, 28
168
-                  times 8 dw 46, -6
169
-
170
-                  times 8 dw -2, 16
171
-                  times 8 dw 54, -4
172
-
173
-                  times 8 dw -2, 10
174
-                  times 8 dw 58, -2
175
-
176
-tab_LumaCoeff:   db   0, 0,  0,  64,  0,   0,  0,  0
177
-                 db  -1, 4, -10, 58,  17, -5,  1,  0
178
-                 db  -1, 4, -11, 40,  40, -11, 4, -1
179
-                 db   0, 1, -5,  17,  58, -10, 4, -1
180
-
181
-tab_LumaCoeffV: times 4 dw 0, 0
182
-                times 4 dw 0, 64
183
-                times 4 dw 0, 0
184
-                times 4 dw 0, 0
185
-
186
-                times 4 dw -1, 4
187
-                times 4 dw -10, 58
188
-                times 4 dw 17, -5
189
-                times 4 dw 1, 0
190
-
191
-                times 4 dw -1, 4
192
-                times 4 dw -11, 40
193
-                times 4 dw 40, -11
194
-                times 4 dw 4, -1
195
-
196
-                times 4 dw 0, 1
197
-                times 4 dw -5, 17
198
-                times 4 dw 58, -10
199
-                times 4 dw 4, -1
200
+                        times 4 dw -4, 28
201
+                        times 4 dw 46, -6
202
 
203
-ALIGN 32
204
-pw_LumaCoeffVer: times 8 dw 0, 0
205
-                 times 8 dw 0, 64
206
-                 times 8 dw 0, 0
207
-                 times 8 dw 0, 0
208
-
209
-                 times 8 dw -1, 4
210
-                 times 8 dw -10, 58
211
-                 times 8 dw 17, -5
212
-                 times 8 dw 1, 0
213
-
214
-                 times 8 dw -1, 4
215
-                 times 8 dw -11, 40
216
-                 times 8 dw 40, -11
217
-                 times 8 dw 4, -1
218
-
219
-                 times 8 dw 0, 1
220
-                 times 8 dw -5, 17
221
-                 times 8 dw 58, -10
222
-                 times 8 dw 4, -1
223
-
224
-pb_LumaCoeffVer: times 16 db 0, 0
225
-                 times 16 db 0, 64
226
-                 times 16 db 0, 0
227
-                 times 16 db 0, 0
228
-
229
-                 times 16 db -1, 4
230
-                 times 16 db -10, 58
231
-                 times 16 db 17, -5
232
-                 times 16 db 1, 0
233
-
234
-                 times 16 db -1, 4
235
-                 times 16 db -11, 40
236
-                 times 16 db 40, -11
237
-                 times 16 db 4, -1
238
-
239
-                 times 16 db 0, 1
240
-                 times 16 db -5, 17
241
-                 times 16 db 58, -10
242
-                 times 16 db 4, -1
243
-
244
-tab_LumaCoeffVer: times 8 db 0, 0
245
-                  times 8 db 0, 64
246
-                  times 8 db 0, 0
247
-                  times 8 db 0, 0
248
-
249
-                  times 8 db -1, 4
250
-                  times 8 db -10, 58
251
-                  times 8 db 17, -5
252
-                  times 8 db 1, 0
253
-
254
-                  times 8 db -1, 4
255
-                  times 8 db -11, 40
256
-                  times 8 db 40, -11
257
-                  times 8 db 4, -1
258
-
259
-                  times 8 db 0, 1
260
-                  times 8 db -5, 17
261
-                  times 8 db 58, -10
262
-                  times 8 db 4, -1
263
+                        times 4 dw -2, 16
264
+                        times 4 dw 54, -4
265
 
266
-ALIGN 32
267
-tab_LumaCoeffVer_32: times 16 db 0, 0
268
-                     times 16 db 0, 64
269
-                     times 16 db 0, 0
270
-                     times 16 db 0, 0
271
-
272
-                     times 16 db -1, 4
273
-                     times 16 db -10, 58
274
-                     times 16 db 17, -5
275
-                     times 16 db 1, 0
276
-
277
-                     times 16 db -1, 4
278
-                     times 16 db -11, 40
279
-                     times 16 db 40, -11
280
-                     times 16 db 4, -1
281
-
282
-                     times 16 db 0, 1
283
-                     times 16 db -5, 17
284
-                     times 16 db 58, -10
285
-                     times 16 db 4, -1
286
+                        times 4 dw -2, 10
287
+                        times 4 dw 58, -2
288
 
289
-ALIGN 32
290
-tab_ChromaCoeffVer_32: times 16 db 0, 64
291
-                       times 16 db 0, 0
292
+const pw_ChromaCoeffV,  times 8 dw 0, 64
293
+                        times 8 dw 0, 0
294
+
295
+                        times 8 dw -2, 58
296
+                        times 8 dw 10, -2
297
+
298
+                        times 8 dw -4, 54
299
+                        times 8 dw 16, -2
300
+
301
+                        times 8 dw -6, 46
302
+                        times 8 dw 28, -4
303
+
304
+                        times 8 dw -4, 36
305
+                        times 8 dw 36, -4
306
+
307
+                        times 8 dw -4, 28
308
+                        times 8 dw 46, -6
309
+
310
+                        times 8 dw -2, 16
311
+                        times 8 dw 54, -4
312
+
313
+                        times 8 dw -2, 10
314
+                        times 8 dw 58, -2
315
+
316
+const tab_LumaCoeff,   db   0, 0,  0,  64,  0,   0,  0,  0
317
+                       db  -1, 4, -10, 58,  17, -5,  1,  0
318
+                       db  -1, 4, -11, 40,  40, -11, 4, -1
319
+                       db   0, 1, -5,  17,  58, -10, 4, -1
320
+
321
+const tabw_LumaCoeff,  dw   0, 0,  0,  64,  0,   0,  0,  0
322
+                       dw  -1, 4, -10, 58,  17, -5,  1,  0
323
+                       dw  -1, 4, -11, 40,  40, -11, 4, -1
324
+                       dw   0, 1, -5,  17,  58, -10, 4, -1
325
+
326
+const tab_LumaCoeffV,   times 4 dw 0, 0
327
+                        times 4 dw 0, 64
328
+                        times 4 dw 0, 0
329
+                        times 4 dw 0, 0
330
+
331
+                        times 4 dw -1, 4
332
+                        times 4 dw -10, 58
333
+                        times 4 dw 17, -5
334
+                        times 4 dw 1, 0
335
+
336
+                        times 4 dw -1, 4
337
+                        times 4 dw -11, 40
338
+                        times 4 dw 40, -11
339
+                        times 4 dw 4, -1
340
+
341
+                        times 4 dw 0, 1
342
+                        times 4 dw -5, 17
343
+                        times 4 dw 58, -10
344
+                        times 4 dw 4, -1
345
+
346
+const pw_LumaCoeffVer,  times 8 dw 0, 0
347
+                        times 8 dw 0, 64
348
+                        times 8 dw 0, 0
349
+                        times 8 dw 0, 0
350
+
351
+                        times 8 dw -1, 4
352
+                        times 8 dw -10, 58
353
+                        times 8 dw 17, -5
354
+                        times 8 dw 1, 0
355
 
356
-                       times 16 db -2, 58
357
-                       times 16 db 10, -2
358
+                        times 8 dw -1, 4
359
+                        times 8 dw -11, 40
360
+                        times 8 dw 40, -11
361
+                        times 8 dw 4, -1
362
 
363
-                       times 16 db -4, 54
364
-                       times 16 db 16, -2
365
+                        times 8 dw 0, 1
366
+                        times 8 dw -5, 17
367
+                        times 8 dw 58, -10
368
+                        times 8 dw 4, -1
369
 
370
-                       times 16 db -6, 46
371
-                       times 16 db 28, -4
372
+const pb_LumaCoeffVer,  times 16 db 0, 0
373
+                        times 16 db 0, 64
374
+                        times 16 db 0, 0
375
+                        times 16 db 0, 0
376
 
377
-                       times 16 db -4, 36
378
-                       times 16 db 36, -4
379
+                        times 16 db -1, 4
380
+                        times 16 db -10, 58
381
+                        times 16 db 17, -5
382
+                        times 16 db 1, 0
383
 
384
-                       times 16 db -4, 28
385
-                       times 16 db 46, -6
386
+                        times 16 db -1, 4
387
+                        times 16 db -11, 40
388
+                        times 16 db 40, -11
389
+                        times 16 db 4, -1
390
 
391
-                       times 16 db -2, 16
392
-                       times 16 db 54, -4
393
+                        times 16 db 0, 1
394
+                        times 16 db -5, 17
395
+                        times 16 db 58, -10
396
+                        times 16 db 4, -1
397
 
398
-                       times 16 db -2, 10
399
-                       times 16 db 58, -2
400
+const tab_LumaCoeffVer, times 8 db 0, 0
401
+                        times 8 db 0, 64
402
+                        times 8 db 0, 0
403
+                        times 8 db 0, 0
404
 
405
-tab_c_64_n64:   times 8 db 64, -64
406
+                        times 8 db -1, 4
407
+                        times 8 db -10, 58
408
+                        times 8 db 17, -5
409
+                        times 8 db 1, 0
410
+
411
+                        times 8 db -1, 4
412
+                        times 8 db -11, 40
413
+                        times 8 db 40, -11
414
+                        times 8 db 4, -1
415
+
416
+                        times 8 db 0, 1
417
+                        times 8 db -5, 17
418
+                        times 8 db 58, -10
419
+                        times 8 db 4, -1
420
+
421
+const tab_LumaCoeffVer_32,  times 16 db 0, 0
422
+                            times 16 db 0, 64
423
+                            times 16 db 0, 0
424
+                            times 16 db 0, 0
425
+
426
+                            times 16 db -1, 4
427
+                            times 16 db -10, 58
428
+                            times 16 db 17, -5
429
+                            times 16 db 1, 0
430
+
431
+                            times 16 db -1, 4
432
+                            times 16 db -11, 40
433
+                            times 16 db 40, -11
434
+                            times 16 db 4, -1
435
+
436
+                            times 16 db 0, 1
437
+                            times 16 db -5, 17
438
+                            times 16 db 58, -10
439
+                            times 16 db 4, -1
440
+
441
+const tab_ChromaCoeffVer_32,    times 16 db 0, 64
442
+                                times 16 db 0, 0
443
+
444
+                                times 16 db -2, 58
445
+                                times 16 db 10, -2
446
+
447
+                                times 16 db -4, 54
448
+                                times 16 db 16, -2
449
+
450
+                                times 16 db -6, 46
451
+                                times 16 db 28, -4
452
+
453
+                                times 16 db -4, 36
454
+                                times 16 db 36, -4
455
+
456
+                                times 16 db -4, 28
457
+                                times 16 db 46, -6
458
+
459
+                                times 16 db -2, 16
460
+                                times 16 db 54, -4
461
+
462
+                                times 16 db -2, 10
463
+                                times 16 db 58, -2
464
+
465
+const tab_c_64_n64, times 8 db 64, -64
466
 
467
 const interp4_shuf, times 2 db 0, 1, 8, 9, 4, 5, 12, 13, 2, 3, 10, 11, 6, 7, 14, 15
468
 
469
-ALIGN 32
470
-interp4_horiz_shuf1:    db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
471
-                        db 8, 9, 10, 11, 9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14
472
+const interp4_horiz_shuf1,  db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
473
+                            db 8, 9, 10, 11, 9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14
474
 
475
-ALIGN 32
476
-interp4_hpp_shuf: times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 8, 9, 10, 11, 9, 10, 11, 12
477
+const interp4_hpp_shuf,     times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 8, 9, 10, 11, 9, 10, 11, 12
478
 
479
-ALIGN 32
480
-interp8_hps_shuf: dd 0, 4, 1, 5, 2, 6, 3, 7
481
+const interp8_hps_shuf,     dd 0, 4, 1, 5, 2, 6, 3, 7
482
 
483
 ALIGN 32
484
 interp4_hps_shuf: times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 8, 9, 10, 11, 9, 10, 11, 12
485
@@ -298,9 +298,276 @@
486
 
487
 cextern pb_128
488
 cextern pw_1
489
+cextern pw_32
490
 cextern pw_512
491
 cextern pw_2000
492
 
493
+%macro FILTER_H4_w2_2_sse2 0
494
+    pxor        m3, m3
495
+    movd        m0, [srcq - 1]
496
+    movd        m2, [srcq]
497
+    punpckldq   m0, m2
498
+    punpcklbw   m0, m3
499
+    movd        m1, [srcq + srcstrideq - 1]
500
+    movd        m2, [srcq + srcstrideq]
501
+    punpckldq   m1, m2
502
+    punpcklbw   m1, m3
503
+    pmaddwd     m0, m4
504
+    pmaddwd     m1, m4
505
+    packssdw    m0, m1
506
+    pshuflw     m1, m0, q2301
507
+    pshufhw     m1, m1, q2301
508
+    paddw       m0, m1
509
+    psrld       m0, 16
510
+    packssdw    m0, m0
511
+    paddw       m0, m5
512
+    psraw       m0, 6
513
+    packuswb    m0, m0
514
+    movd        r4, m0
515
+    mov         [dstq], r4w
516
+    shr         r4, 16
517
+    mov         [dstq + dststrideq], r4w
518
+%endmacro
519
+
520
+;-----------------------------------------------------------------------------
521
+; void interp_4tap_horiz_pp_2x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
522
+;-----------------------------------------------------------------------------
523
+INIT_XMM sse3
524
+cglobal interp_4tap_horiz_pp_2x4, 4, 6, 6, src, srcstride, dst, dststride
525
+    mov         r4d,        r4m
526
+    mova        m5,         [pw_32]
527
+
528
+%ifdef PIC
529
+    lea         r5,          [tabw_ChromaCoeff]
530
+    movddup     m4,         [r5 + r4 * 8]
531
+%else
532
+    movddup     m4,         [tabw_ChromaCoeff + r4 * 8]
533
+%endif
534
+
535
+    FILTER_H4_w2_2_sse2
536
+    lea         srcq,       [srcq + srcstrideq * 2]
537
+    lea         dstq,       [dstq + dststrideq * 2]
538
+    FILTER_H4_w2_2_sse2
539
+
540
+    RET
541
+
542
+;-----------------------------------------------------------------------------
543
+; void interp_4tap_horiz_pp_2x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
544
+;-----------------------------------------------------------------------------
545
+INIT_XMM sse3
546
+cglobal interp_4tap_horiz_pp_2x8, 4, 6, 6, src, srcstride, dst, dststride
547
+    mov         r4d,        r4m
548
+    mova        m5,         [pw_32]
549
+
550
+%ifdef PIC
551
+    lea         r5,          [tabw_ChromaCoeff]
552
+    movddup     m4,         [r5 + r4 * 8]
553
+%else
554
+    movddup     m4,         [tabw_ChromaCoeff + r4 * 8]
555
+%endif
556
+
557
+%assign x 1
558
+%rep 4
559
+    FILTER_H4_w2_2_sse2
560
+%if x < 4
561
+    lea         srcq,       [srcq + srcstrideq * 2]
562
+    lea         dstq,       [dstq + dststrideq * 2]
563
+%endif
564
+%assign x x+1
565
+%endrep
566
+
567
+    RET
568
+
569
+;-----------------------------------------------------------------------------
570
+; void interp_4tap_horiz_pp_2x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
571
+;-----------------------------------------------------------------------------
572
+INIT_XMM sse3
573
+cglobal interp_4tap_horiz_pp_2x16, 4, 6, 6, src, srcstride, dst, dststride
574
+    mov         r4d,        r4m
575
+    mova        m5,         [pw_32]
576
+
577
+%ifdef PIC
578
+    lea         r5,         [tabw_ChromaCoeff]
579
+    movddup     m4,         [r5 + r4 * 8]
580
+%else
581
+    movddup     m4,         [tabw_ChromaCoeff + r4 * 8]
582
+%endif
583
+
584
+%assign x 1
585
+%rep 8
586
+    FILTER_H4_w2_2_sse2
587
+%if x < 8
588
+    lea         srcq,       [srcq + srcstrideq * 2]
589
+    lea         dstq,       [dstq + dststrideq * 2]
590
+%endif
591
+%assign x x+1
592
+%endrep
593
+
594
+    RET
595
+
596
+%macro FILTER_H4_w4_2_sse2 0
597
+    pxor        m5, m5
598
+    movd        m0, [srcq - 1]
599
+    movd        m6, [srcq]
600
+    punpckldq   m0, m6
601
+    punpcklbw   m0, m5
602
+    movd        m1, [srcq + 1]
603
+    movd        m6, [srcq + 2]
604
+    punpckldq   m1, m6
605
+    punpcklbw   m1, m5
606
+    movd        m2, [srcq + srcstrideq - 1]
607
+    movd        m6, [srcq + srcstrideq]
608
+    punpckldq   m2, m6
609
+    punpcklbw   m2, m5
610
+    movd        m3, [srcq + srcstrideq + 1]
611
+    movd        m6, [srcq + srcstrideq + 2]
612
+    punpckldq   m3, m6
613
+    punpcklbw   m3, m5
614
+    pmaddwd     m0, m4
615
+    pmaddwd     m1, m4
616
+    pmaddwd     m2, m4
617
+    pmaddwd     m3, m4
618
+    packssdw    m0, m1
619
+    packssdw    m2, m3
620
+    pshuflw     m1, m0, q2301
621
+    pshufhw     m1, m1, q2301
622
+    pshuflw     m3, m2, q2301
623
+    pshufhw     m3, m3, q2301
624
+    paddw       m0, m1
625
+    paddw       m2, m3
626
+    psrld       m0, 16
627
+    psrld       m2, 16
628
+    packssdw    m0, m2
629
+    paddw       m0, m7
630
+    psraw       m0, 6
631
+    packuswb    m0, m2
632
+    movd        [dstq], m0
633
+    psrldq      m0, 4
634
+    movd        [dstq + dststrideq], m0
635
+%endmacro
636
+
637
+;-----------------------------------------------------------------------------
638
+; void interp_4tap_horiz_pp_4x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
639
+;-----------------------------------------------------------------------------
640
+INIT_XMM sse3
641
+cglobal interp_4tap_horiz_pp_4x2, 4, 6, 8, src, srcstride, dst, dststride
642
+    mov         r4d,        r4m
643
+    mova        m7,         [pw_32]
644
+
645
+%ifdef PIC
646
+    lea         r5,         [tabw_ChromaCoeff]
647
+    movddup     m4,         [r5 + r4 * 8]
648
+%else
649
+    movddup     m4,         [tabw_ChromaCoeff + r4 * 8]
650
+%endif
651
+
652
+    FILTER_H4_w4_2_sse2
653
+
654
+    RET
655
+
656
+;-----------------------------------------------------------------------------
657
+; void interp_4tap_horiz_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
658
+;-----------------------------------------------------------------------------
659
+INIT_XMM sse3
660
+cglobal interp_4tap_horiz_pp_4x4, 4, 6, 8, src, srcstride, dst, dststride
661
+    mov         r4d,        r4m
662
+    mova        m7,         [pw_32]
663
+
664
+%ifdef PIC
665
+    lea         r5,         [tabw_ChromaCoeff]
666
+    movddup     m4,         [r5 + r4 * 8]
667
+%else
668
+    movddup     m4,         [tabw_ChromaCoeff + r4 * 8]
669
+%endif
670
+
671
+    FILTER_H4_w4_2_sse2
672
+    lea         srcq,       [srcq + srcstrideq * 2]
673
+    lea         dstq,       [dstq + dststrideq * 2]
674
+    FILTER_H4_w4_2_sse2
675
+
676
+    RET
677
+
678
+;-----------------------------------------------------------------------------
679
+; void interp_4tap_horiz_pp_4x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
680
+;-----------------------------------------------------------------------------
681
+INIT_XMM sse3
682
+cglobal interp_4tap_horiz_pp_4x8, 4, 6, 8, src, srcstride, dst, dststride
683
+    mov         r4d,        r4m
684
+    mova        m7,         [pw_32]
685
+
686
+%ifdef PIC
687
+    lea         r5,         [tabw_ChromaCoeff]
688
+    movddup     m4,         [r5 + r4 * 8]
689
+%else
690
+    movddup     m4,         [tabw_ChromaCoeff + r4 * 8]
691
+%endif
692
+
693
+%assign x 1
694
+%rep 4
695
+    FILTER_H4_w4_2_sse2
696
+%if x < 4
697
+    lea         srcq,       [srcq + srcstrideq * 2]
698
+    lea         dstq,       [dstq + dststrideq * 2]
699
+%endif
700
+%assign x x+1
701
+%endrep
702
+
703
+    RET
704
+
705
+;-----------------------------------------------------------------------------
706
+; void interp_4tap_horiz_pp_4x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
707
+;-----------------------------------------------------------------------------
708
+INIT_XMM sse3
709
+cglobal interp_4tap_horiz_pp_4x16, 4, 6, 8, src, srcstride, dst, dststride
710
+    mov         r4d,        r4m
711
+    mova        m7,         [pw_32]
712
+
713
+%ifdef PIC
714
+    lea         r5,         [tabw_ChromaCoeff]
715
+    movddup     m4,         [r5 + r4 * 8]
716
+%else
717
+    movddup     m4,         [tabw_ChromaCoeff + r4 * 8]
718
+%endif
719
+
720
+%assign x 1
721
+%rep 8
722
+    FILTER_H4_w4_2_sse2
723
+%if x < 8
724
+    lea         srcq,       [srcq + srcstrideq * 2]
725
+    lea         dstq,       [dstq + dststrideq * 2]
726
+%endif
727
+%assign x x+1
728
+%endrep
729
+
730
+    RET
731
+
732
+;-----------------------------------------------------------------------------
733
+; void interp_4tap_horiz_pp_4x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
734
+;-----------------------------------------------------------------------------
735
+INIT_XMM sse3
736
+cglobal interp_4tap_horiz_pp_4x32, 4, 6, 8, src, srcstride, dst, dststride
737
+    mov         r4d,        r4m
738
+    mova        m7,         [pw_32]
739
+
740
+%ifdef PIC
741
+    lea         r5,          [tabw_ChromaCoeff]
742
+    movddup     m4,       [r5 + r4 * 8]
743
+%else
744
+    movddup     m4,       [tabw_ChromaCoeff + r4 * 8]
745
+%endif
746
+
747
+%assign x 1
748
+%rep 16
749
+    FILTER_H4_w4_2_sse2
750
+%if x < 16
751
+    lea         srcq,       [srcq + srcstrideq * 2]
752
+    lea         dstq,       [dstq + dststrideq * 2]
753
+%endif
754
+%assign x x+1
755
+%endrep
756
+
757
+    RET
758
+
759
 %macro FILTER_H4_w2_2 3
760
     movh        %2, [srcq - 1]
761
     pshufb      %2, %2, Tm0
762
@@ -317,6 +584,1298 @@
763
     mov         [dstq + dststrideq], r4w
764
 %endmacro
765
 
766
+%macro FILTER_H4_w6_sse2 0
767
+    pxor        m4, m4
768
+    movh        m0, [srcq - 1]
769
+    movh        m5, [srcq]
770
+    punpckldq   m0, m5
771
+    movhlps     m2, m0
772
+    punpcklbw   m0, m4
773
+    punpcklbw   m2, m4
774
+    movd        m1, [srcq + 1]
775
+    movd        m5, [srcq + 2]
776
+    punpckldq   m1, m5
777
+    punpcklbw   m1, m4
778
+    pmaddwd     m0, m6
779
+    pmaddwd     m1, m6
780
+    pmaddwd     m2, m6
781
+    packssdw    m0, m1
782
+    packssdw    m2, m2
783
+    pshuflw     m1, m0, q2301
784
+    pshufhw     m1, m1, q2301
785
+    pshuflw     m3, m2, q2301
786
+    paddw       m0, m1
787
+    paddw       m2, m3
788
+    psrld       m0, 16
789
+    psrld       m2, 16
790
+    packssdw    m0, m2
791
+    paddw       m0, m7
792
+    psraw       m0, 6
793
+    packuswb    m0, m0
794
+    movd        [dstq], m0
795
+    pextrw      r4d, m0, 2
796
+    mov         [dstq + 4], r4w
797
+%endmacro
798
+
799
+%macro FILH4W8_sse2 1
800
+    movh        m0, [srcq - 1 + %1]
801
+    movh        m5, [srcq + %1]
802
+    punpckldq   m0, m5
803
+    movhlps     m2, m0
804
+    punpcklbw   m0, m4
805
+    punpcklbw   m2, m4
806
+    movh        m1, [srcq + 1 + %1]
807
+    movh        m5, [srcq + 2 + %1]
808
+    punpckldq   m1, m5
809
+    movhlps     m3, m1
810
+    punpcklbw   m1, m4
811
+    punpcklbw   m3, m4
812
+    pmaddwd     m0, m6
813
+    pmaddwd     m1, m6
814
+    pmaddwd     m2, m6
815
+    pmaddwd     m3, m6
816
+    packssdw    m0, m1
817
+    packssdw    m2, m3
818
+    pshuflw     m1, m0, q2301
819
+    pshufhw     m1, m1, q2301
820
+    pshuflw     m3, m2, q2301
821
+    pshufhw     m3, m3, q2301
822
+    paddw       m0, m1
823
+    paddw       m2, m3
824
+    psrld       m0, 16
825
+    psrld       m2, 16
826
+    packssdw    m0, m2
827
+    paddw       m0, m7
828
+    psraw       m0, 6
829
+    packuswb    m0, m0
830
+    movh        [dstq + %1], m0
831
+%endmacro
832
+
833
+%macro FILTER_H4_w8_sse2 0
834
+    FILH4W8_sse2 0
835
+%endmacro
836
+
837
+%macro FILTER_H4_w12_sse2 0
838
+    FILH4W8_sse2 0
839
+    movd        m1, [srcq - 1 + 8]
840
+    movd        m3, [srcq + 8]
841
+    punpckldq   m1, m3
842
+    punpcklbw   m1, m4
843
+    movd        m2, [srcq + 1 + 8]
844
+    movd        m3, [srcq + 2 + 8]
845
+    punpckldq   m2, m3
846
+    punpcklbw   m2, m4
847
+    pmaddwd     m1, m6
848
+    pmaddwd     m2, m6
849
+    packssdw    m1, m2
850
+    pshuflw     m2, m1, q2301
851
+    pshufhw     m2, m2, q2301
852
+    paddw       m1, m2
853
+    psrld       m1, 16
854
+    packssdw    m1, m1
855
+    paddw       m1, m7
856
+    psraw       m1, 6
857
+    packuswb    m1, m1
858
+    movd        [dstq + 8], m1
859
+%endmacro
860
+
861
+%macro FILTER_H4_w16_sse2 0
862
+    FILH4W8_sse2 0
863
+    FILH4W8_sse2 8
864
+%endmacro
865
+
866
+%macro FILTER_H4_w24_sse2 0
867
+    FILH4W8_sse2 0
868
+    FILH4W8_sse2 8
869
+    FILH4W8_sse2 16
870
+%endmacro
871
+
872
+%macro FILTER_H4_w32_sse2 0
873
+    FILH4W8_sse2 0
874
+    FILH4W8_sse2 8
875
+    FILH4W8_sse2 16
876
+    FILH4W8_sse2 24
877
+%endmacro
878
+
879
+%macro FILTER_H4_w48_sse2 0
880
+    FILH4W8_sse2 0
881
+    FILH4W8_sse2 8
882
+    FILH4W8_sse2 16
883
+    FILH4W8_sse2 24
884
+    FILH4W8_sse2 32
885
+    FILH4W8_sse2 40
886
+%endmacro
887
+
888
+%macro FILTER_H4_w64_sse2 0
889
+    FILH4W8_sse2 0
890
+    FILH4W8_sse2 8
891
+    FILH4W8_sse2 16
892
+    FILH4W8_sse2 24
893
+    FILH4W8_sse2 32
894
+    FILH4W8_sse2 40
895
+    FILH4W8_sse2 48
896
+    FILH4W8_sse2 56
897
+%endmacro
898
+
899
+;-----------------------------------------------------------------------------
900
+; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
901
+;-----------------------------------------------------------------------------
902
+%macro IPFILTER_CHROMA_sse3 2
903
+INIT_XMM sse3
904
+cglobal interp_4tap_horiz_pp_%1x%2, 4, 6, 8, src, srcstride, dst, dststride
905
+    mov         r4d,        r4m
906
+    mova        m7,         [pw_32]
907
+    pxor        m4,         m4
908
+
909
+%ifdef PIC
910
+    lea         r5,          [tabw_ChromaCoeff]
911
+    movddup     m6,       [r5 + r4 * 8]
912
+%else
913
+    movddup     m6,       [tabw_ChromaCoeff + r4 * 8]
914
+%endif
915
+
916
+%assign x 1
917
+%rep %2
918
+    FILTER_H4_w%1_sse2
919
+%if x < %2
920
+    add         srcq,        srcstrideq
921
+    add         dstq,        dststrideq
922
+%endif
923
+%assign x x+1
924
+%endrep
925
+
926
+    RET
927
+
928
+%endmacro
929
+
930
+    IPFILTER_CHROMA_sse3 6,   8
931
+    IPFILTER_CHROMA_sse3 8,   2
932
+    IPFILTER_CHROMA_sse3 8,   4
933
+    IPFILTER_CHROMA_sse3 8,   6
934
+    IPFILTER_CHROMA_sse3 8,   8
935
+    IPFILTER_CHROMA_sse3 8,  16
936
+    IPFILTER_CHROMA_sse3 8,  32
937
+    IPFILTER_CHROMA_sse3 12, 16
938
+
939
+    IPFILTER_CHROMA_sse3 6,  16
940
+    IPFILTER_CHROMA_sse3 8,  12
941
+    IPFILTER_CHROMA_sse3 8,  64
942
+    IPFILTER_CHROMA_sse3 12, 32
943
+
944
+;-----------------------------------------------------------------------------
945
+; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
946
+;-----------------------------------------------------------------------------
947
+%macro IPFILTER_CHROMA_W_sse3 2
948
+INIT_XMM sse3
949
+cglobal interp_4tap_horiz_pp_%1x%2, 4, 6, 8, src, srcstride, dst, dststride
950
+    mov         r4d,         r4m
951
+    mova        m7,         [pw_32]
952
+    pxor        m4,         m4
953
+%ifdef PIC
954
+    lea         r5,          [tabw_ChromaCoeff]
955
+    movddup     m6,       [r5 + r4 * 8]
956
+%else
957
+    movddup     m6,       [tabw_ChromaCoeff + r4 * 8]
958
+%endif
959
+
960
+%assign x 1
961
+%rep %2
962
+    FILTER_H4_w%1_sse2
963
+%if x < %2
964
+    add         srcq,        srcstrideq
965
+    add         dstq,        dststrideq
966
+%endif
967
+%assign x x+1
968
+%endrep
969
+
970
+    RET
971
+
972
+%endmacro
973
+
974
+    IPFILTER_CHROMA_W_sse3 16,  4
975
+    IPFILTER_CHROMA_W_sse3 16,  8
976
+    IPFILTER_CHROMA_W_sse3 16, 12
977
+    IPFILTER_CHROMA_W_sse3 16, 16
978
+    IPFILTER_CHROMA_W_sse3 16, 32
979
+    IPFILTER_CHROMA_W_sse3 32,  8
980
+    IPFILTER_CHROMA_W_sse3 32, 16
981
+    IPFILTER_CHROMA_W_sse3 32, 24
982
+    IPFILTER_CHROMA_W_sse3 24, 32
983
+    IPFILTER_CHROMA_W_sse3 32, 32
984
+
985
+    IPFILTER_CHROMA_W_sse3 16, 24
986
+    IPFILTER_CHROMA_W_sse3 16, 64
987
+    IPFILTER_CHROMA_W_sse3 32, 48
988
+    IPFILTER_CHROMA_W_sse3 24, 64
989
+    IPFILTER_CHROMA_W_sse3 32, 64
990
+
991
+    IPFILTER_CHROMA_W_sse3 64, 64
992
+    IPFILTER_CHROMA_W_sse3 64, 32
993
+    IPFILTER_CHROMA_W_sse3 64, 48
994
+    IPFILTER_CHROMA_W_sse3 48, 64
995
+    IPFILTER_CHROMA_W_sse3 64, 16
996
+
997
+%macro FILTER_H8_W8_sse2 0
998
+    movh        m1, [r0 + x - 3]
999
+    movh        m4, [r0 + x - 2]
1000
+    punpcklbw   m1, m6
1001
+    punpcklbw   m4, m6
1002
+    movh        m5, [r0 + x - 1]
1003
+    movh        m0, [r0 + x]
1004
+    punpcklbw   m5, m6
1005
+    punpcklbw   m0, m6
1006
+    pmaddwd     m1, m3
1007
+    pmaddwd     m4, m3
1008
+    pmaddwd     m5, m3
1009
+    pmaddwd     m0, m3
1010
+    packssdw    m1, m4
1011
+    packssdw    m5, m0
1012
+    pshuflw     m4, m1, q2301
1013
+    pshufhw     m4, m4, q2301
1014
+    pshuflw     m0, m5, q2301
1015
+    pshufhw     m0, m0, q2301
1016
+    paddw       m1, m4
1017
+    paddw       m5, m0
1018
+    psrldq      m1, 2
1019
+    psrldq      m5, 2
1020
+    pshufd      m1, m1, q3120
1021
+    pshufd      m5, m5, q3120
1022
+    punpcklqdq  m1, m5
1023
+    movh        m7, [r0 + x + 1]
1024
+    movh        m4, [r0 + x + 2]
1025
+    punpcklbw   m7, m6
1026
+    punpcklbw   m4, m6
1027
+    movh        m5, [r0 + x + 3]
1028
+    movh        m0, [r0 + x + 4]
1029
+    punpcklbw   m5, m6
1030
+    punpcklbw   m0, m6
1031
+    pmaddwd     m7, m3
1032
+    pmaddwd     m4, m3
1033
+    pmaddwd     m5, m3
1034
+    pmaddwd     m0, m3
1035
+    packssdw    m7, m4
1036
+    packssdw    m5, m0
1037
+    pshuflw     m4, m7, q2301
1038
+    pshufhw     m4, m4, q2301
1039
+    pshuflw     m0, m5, q2301
1040
+    pshufhw     m0, m0, q2301
1041
+    paddw       m7, m4
1042
+    paddw       m5, m0
1043
+    psrldq      m7, 2
1044
+    psrldq      m5, 2
1045
+    pshufd      m7, m7, q3120
1046
+    pshufd      m5, m5, q3120
1047
+    punpcklqdq  m7, m5
1048
+    pshuflw     m4, m1, q2301
1049
+    pshufhw     m4, m4, q2301
1050
+    pshuflw     m0, m7, q2301
1051
+    pshufhw     m0, m0, q2301
1052
+    paddw       m1, m4
1053
+    paddw       m7, m0
1054
+    psrldq      m1, 2
1055
+    psrldq      m7, 2
1056
+    pshufd      m1, m1, q3120
1057
+    pshufd      m7, m7, q3120
1058
+    punpcklqdq  m1, m7
1059
+%endmacro
1060
+
1061
+%macro FILTER_H8_W4_sse2 0
1062
+    movh        m1, [r0 + x - 3]
1063
+    movh        m0, [r0 + x - 2]
1064
+    punpcklbw   m1, m6
1065
+    punpcklbw   m0, m6
1066
+    movh        m4, [r0 + x - 1]
1067
+    movh        m5, [r0 + x]
1068
+    punpcklbw   m4, m6
1069
+    punpcklbw   m5, m6
1070
+    pmaddwd     m1, m3
1071
+    pmaddwd     m0, m3
1072
+    pmaddwd     m4, m3
1073
+    pmaddwd     m5, m3
1074
+    packssdw    m1, m0
1075
+    packssdw    m4, m5
1076
+    pshuflw     m0, m1, q2301
1077
+    pshufhw     m0, m0, q2301
1078
+    pshuflw     m5, m4, q2301
1079
+    pshufhw     m5, m5, q2301
1080
+    paddw       m1, m0
1081
+    paddw       m4, m5
1082
+    psrldq      m1, 2
1083
+    psrldq      m4, 2
1084
+    pshufd      m1, m1, q3120
1085
+    pshufd      m4, m4, q3120
1086
+    punpcklqdq  m1, m4
1087
+    pshuflw     m0, m1, q2301
1088
+    pshufhw     m0, m0, q2301
1089
+    paddw       m1, m0
1090
+    psrldq      m1, 2
1091
+    pshufd      m1, m1, q3120
1092
+%endmacro
1093
+
1094
+;----------------------------------------------------------------------------------------------------------------------------
1095
+; void interp_8tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
1096
+;----------------------------------------------------------------------------------------------------------------------------
1097
+%macro IPFILTER_LUMA_sse2 3
1098
+INIT_XMM sse2
1099
+cglobal interp_8tap_horiz_%3_%1x%2, 4,6,8
1100
+    mov       r4d, r4m
1101
+    add       r4d, r4d
1102
+    pxor      m6, m6
1103
+
1104
+%ifidn %3, ps
1105
+    add       r3d, r3d
1106
+    cmp       r5m, byte 0
1107
+%endif
1108
+
1109
+%ifdef PIC
1110
+    lea       r5, [tabw_LumaCoeff]
1111
+    movu      m3, [r5 + r4 * 8]
1112
+%else
1113
+    movu      m3, [tabw_LumaCoeff + r4 * 8]
1114
+%endif
1115
+
1116
+    mov       r4d, %2
1117
+
1118
+%ifidn %3, pp
1119
+    mova      m2, [pw_32]
1120
+%else
1121
+    mova      m2, [pw_2000]
1122
+    je        .loopH
1123
+    lea       r5, [r1 + 2 * r1]
1124
+    sub       r0, r5
1125
+    add       r4d, 7
1126
+%endif
1127
+
1128
+.loopH:
1129
+%assign x 0
1130
+%rep %1 / 8
1131
+    FILTER_H8_W8_sse2
1132
+  %ifidn %3, pp
1133
+    paddw     m1, m2
1134
+    psraw     m1, 6
1135
+    packuswb  m1, m1
1136
+    movh      [r2 + x], m1
1137
+  %else
1138
+    psubw     m1, m2
1139
+    movu      [r2 + 2 * x], m1
1140
+  %endif
1141
+%assign x x+8
1142
+%endrep
1143
+
1144
+%rep (%1 % 8) / 4
1145
+    FILTER_H8_W4_sse2
1146
+  %ifidn %3, pp
1147
+    paddw     m1, m2
1148
+    psraw     m1, 6
1149
+    packuswb  m1, m1
1150
+    movd      [r2 + x], m1
1151
+  %else
1152
+    psubw     m1, m2
1153
+    movh      [r2 + 2 * x], m1
1154
+  %endif
1155
+%endrep
1156
+
1157
+    add       r0, r1
1158
+    add       r2, r3
1159
+
1160
+    dec       r4d
1161
+    jnz       .loopH
1162
+    RET
1163
+
1164
+%endmacro
1165
+
1166
+;--------------------------------------------------------------------------------------------------------------
1167
+; void interp_8tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
1168
+;--------------------------------------------------------------------------------------------------------------
1169
+    IPFILTER_LUMA_sse2 4, 4, pp
1170
+    IPFILTER_LUMA_sse2 4, 8, pp
1171
+    IPFILTER_LUMA_sse2 8, 4, pp
1172
+    IPFILTER_LUMA_sse2 8, 8, pp
1173
+    IPFILTER_LUMA_sse2 16, 16, pp
1174
+    IPFILTER_LUMA_sse2 16, 8, pp
1175
+    IPFILTER_LUMA_sse2 8, 16, pp
1176
+    IPFILTER_LUMA_sse2 16, 12, pp
1177
+    IPFILTER_LUMA_sse2 12, 16, pp
1178
+    IPFILTER_LUMA_sse2 16, 4, pp
1179
+    IPFILTER_LUMA_sse2 4, 16, pp
1180
+    IPFILTER_LUMA_sse2 32, 32, pp
1181
+    IPFILTER_LUMA_sse2 32, 16, pp
1182
+    IPFILTER_LUMA_sse2 16, 32, pp
1183
+    IPFILTER_LUMA_sse2 32, 24, pp
1184
+    IPFILTER_LUMA_sse2 24, 32, pp
1185
+    IPFILTER_LUMA_sse2 32, 8, pp
1186
+    IPFILTER_LUMA_sse2 8, 32, pp
1187
+    IPFILTER_LUMA_sse2 64, 64, pp
1188
+    IPFILTER_LUMA_sse2 64, 32, pp
1189
+    IPFILTER_LUMA_sse2 32, 64, pp
1190
+    IPFILTER_LUMA_sse2 64, 48, pp
1191
+    IPFILTER_LUMA_sse2 48, 64, pp
1192
+    IPFILTER_LUMA_sse2 64, 16, pp
1193
+    IPFILTER_LUMA_sse2 16, 64, pp
1194
+
1195
+;----------------------------------------------------------------------------------------------------------------------------
1196
+; void interp_8tap_horiz_ps_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
1197
+;----------------------------------------------------------------------------------------------------------------------------
1198
+    IPFILTER_LUMA_sse2 4, 4, ps
1199
+    IPFILTER_LUMA_sse2 8, 8, ps
1200
+    IPFILTER_LUMA_sse2 8, 4, ps
1201
+    IPFILTER_LUMA_sse2 4, 8, ps
1202
+    IPFILTER_LUMA_sse2 16, 16, ps
1203
+    IPFILTER_LUMA_sse2 16, 8, ps
1204
+    IPFILTER_LUMA_sse2 8, 16, ps
1205
+    IPFILTER_LUMA_sse2 16, 12, ps
1206
+    IPFILTER_LUMA_sse2 12, 16, ps
1207
+    IPFILTER_LUMA_sse2 16, 4, ps
1208
+    IPFILTER_LUMA_sse2 4, 16, ps
1209
+    IPFILTER_LUMA_sse2 32, 32, ps
1210
+    IPFILTER_LUMA_sse2 32, 16, ps
1211
+    IPFILTER_LUMA_sse2 16, 32, ps
1212
+    IPFILTER_LUMA_sse2 32, 24, ps
1213
+    IPFILTER_LUMA_sse2 24, 32, ps
1214
+    IPFILTER_LUMA_sse2 32, 8, ps
1215
+    IPFILTER_LUMA_sse2 8, 32, ps
1216
+    IPFILTER_LUMA_sse2 64, 64, ps
1217
+    IPFILTER_LUMA_sse2 64, 32, ps
1218
+    IPFILTER_LUMA_sse2 32, 64, ps
1219
+    IPFILTER_LUMA_sse2 64, 48, ps
1220
+    IPFILTER_LUMA_sse2 48, 64, ps
1221
+    IPFILTER_LUMA_sse2 64, 16, ps
1222
+    IPFILTER_LUMA_sse2 16, 64, ps
1223
+
1224
+%macro  WORD_TO_DOUBLE 1
1225
+%if ARCH_X86_64
1226
+    punpcklbw   %1,     m8
1227
+%else
1228
+    punpcklbw   %1,     %1
1229
+    psrlw       %1,     8
1230
+%endif
1231
+%endmacro
1232
+
1233
+;-----------------------------------------------------------------------------
1234
+; void interp_4tap_vert_pp_2xn(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
1235
+;-----------------------------------------------------------------------------
1236
+%macro FILTER_V4_W2_H4_sse2 1
1237
+INIT_XMM sse2
1238
+%if ARCH_X86_64
1239
+cglobal interp_4tap_vert_pp_2x%1, 4, 6, 9
1240
+    pxor        m8,        m8
1241
+%else
1242
+cglobal interp_4tap_vert_pp_2x%1, 4, 6, 8
1243
+%endif
1244
+    mov         r4d,       r4m
1245
+    sub         r0,        r1
1246
+
1247
+%ifdef PIC
1248
+    lea         r5,        [tabw_ChromaCoeff]
1249
+    movh        m0,        [r5 + r4 * 8]
1250
+%else
1251
+    movh        m0,        [tabw_ChromaCoeff + r4 * 8]
1252
+%endif
1253
+
1254
+    punpcklqdq  m0,        m0
1255
+    mova        m1,        [pw_32]
1256
+    lea         r5,        [3 * r1]
1257
+
1258
+%assign x 1
1259
+%rep %1/4
1260
+    movd        m2,        [r0]
1261
+    movd        m3,        [r0 + r1]
1262
+    movd        m4,        [r0 + 2 * r1]
1263
+    movd        m5,        [r0 + r5]
1264
+
1265
+    punpcklbw   m2,        m3
1266
+    punpcklbw   m6,        m4,        m5
1267
+    punpcklwd   m2,        m6
1268
+
1269
+    WORD_TO_DOUBLE         m2
1270
+    pmaddwd     m2,        m0
1271
+
1272
+    lea         r0,        [r0 + 4 * r1]
1273
+    movd        m6,        [r0]
1274
+
1275
+    punpcklbw   m3,        m4
1276
+    punpcklbw   m7,        m5,        m6
1277
+    punpcklwd   m3,        m7
1278
+
1279
+    WORD_TO_DOUBLE         m3
1280
+    pmaddwd     m3,        m0
1281
+
1282
+    packssdw    m2,        m3
1283
+    pshuflw     m3,        m2,          q2301
1284
+    pshufhw     m3,        m3,          q2301
1285
+    paddw       m2,        m3
1286
+    psrld       m2,        16
1287
+
1288
+    movd        m7,        [r0 + r1]
1289
+
1290
+    punpcklbw   m4,        m5
1291
+    punpcklbw   m3,        m6,        m7
1292
+    punpcklwd   m4,        m3
1293
+
1294
+    WORD_TO_DOUBLE         m4
1295
+    pmaddwd     m4,        m0
1296
+
1297
+    movd        m3,        [r0 + 2 * r1]
1298
+
1299
+    punpcklbw   m5,        m6
1300
+    punpcklbw   m7,        m3
1301
+    punpcklwd   m5,        m7
1302
+
1303
+    WORD_TO_DOUBLE         m5
1304
+    pmaddwd     m5,        m0
1305
+
1306
+    packssdw    m4,        m5
1307
+    pshuflw     m5,        m4,          q2301
1308
+    pshufhw     m5,        m5,          q2301
1309
+    paddw       m4,        m5
1310
+    psrld       m4,        16
1311
+
1312
+    packssdw    m2,        m4
1313
+    paddw       m2,        m1
1314
+    psraw       m2,        6
1315
+    packuswb    m2,        m2
1316
+
1317
+%if ARCH_X86_64
1318
+    movq        r4,        m2
1319
+    mov         [r2],      r4w
1320
+    shr         r4,        16
1321
+    mov         [r2 + r3], r4w
1322
+    lea         r2,        [r2 + 2 * r3]
1323
+    shr         r4,        16
1324
+    mov         [r2],      r4w
1325
+    shr         r4,        16
1326
+    mov         [r2 + r3], r4w
1327
+%else
1328
+    movd        r4,        m2
1329
+    mov         [r2],      r4w
1330
+    shr         r4,        16
1331
+    mov         [r2 + r3], r4w
1332
+    lea         r2,        [r2 + 2 * r3]
1333
+    psrldq      m2,        4
1334
+    movd        r4,        m2
1335
+    mov         [r2],      r4w
1336
+    shr         r4,        16
1337
+    mov         [r2 + r3], r4w
1338
+%endif
1339
+
1340
+%if x < %1/4
1341
+    lea         r2,        [r2 + 2 * r3]
1342
+%endif
1343
+%assign x x+1
1344
+%endrep
1345
+    RET
1346
+
1347
+%endmacro
1348
+
1349
+    FILTER_V4_W2_H4_sse2 4
1350
+    FILTER_V4_W2_H4_sse2 8
1351
+    FILTER_V4_W2_H4_sse2 16
1352
+
1353
+;-----------------------------------------------------------------------------
1354
+; void interp_4tap_vert_pp_4x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
1355
+;-----------------------------------------------------------------------------
1356
+INIT_XMM sse2
1357
+cglobal interp_4tap_vert_pp_4x2, 4, 6, 8
1358
+
1359
+    mov         r4d,       r4m
1360
+    sub         r0,        r1
1361
+    pxor        m7,        m7
1362
+
1363
+%ifdef PIC
1364
+    lea         r5,        [tabw_ChromaCoeff]
1365
+    movh        m0,        [r5 + r4 * 8]
1366
+%else
1367
+    movh        m0,        [tabw_ChromaCoeff + r4 * 8]
1368
+%endif
1369
+
1370
+    lea         r5,        [r0 + 2 * r1]
1371
+    punpcklqdq  m0,        m0
1372
+    movd        m2,        [r0]
1373
+    movd        m3,        [r0 + r1]
1374
+    movd        m4,        [r5]
1375
+    movd        m5,        [r5 + r1]
1376
+
1377
+    punpcklbw   m2,        m3
1378
+    punpcklbw   m1,        m4,        m5
1379
+    punpcklwd   m2,        m1
1380
+
1381
+    movhlps     m6,        m2
1382
+    punpcklbw   m2,        m7
1383
+    punpcklbw   m6,        m7
1384
+    pmaddwd     m2,        m0
1385
+    pmaddwd     m6,        m0
1386
+    packssdw    m2,        m6
1387
+
1388
+    movd        m1,        [r0 + 4 * r1]
1389
+
1390
+    punpcklbw   m3,        m4
1391
+    punpcklbw   m5,        m1
1392
+    punpcklwd   m3,        m5
1393
+
1394
+    movhlps     m6,        m3
1395
+    punpcklbw   m3,        m7
1396
+    punpcklbw   m6,        m7
1397
+    pmaddwd     m3,        m0
1398
+    pmaddwd     m6,        m0
1399
+    packssdw    m3,        m6
1400
+
1401
+    pshuflw     m4,        m2,        q2301
1402
+    pshufhw     m4,        m4,        q2301
1403
+    paddw       m2,        m4
1404
+    pshuflw     m5,        m3,        q2301
1405
+    pshufhw     m5,        m5,        q2301
1406
+    paddw       m3,        m5
1407
+    psrld       m2,        16
1408
+    psrld       m3,        16
1409
+    packssdw    m2,        m3
1410
+
1411
+    paddw       m2,        [pw_32]
1412
+    psraw       m2,        6
1413
+    packuswb    m2,        m2
1414
+
1415
+    movd        [r2],      m2
1416
+    psrldq      m2,        4
1417
+    movd        [r2 + r3], m2
1418
+    RET
1419
+
1420
+;-----------------------------------------------------------------------------
1421
+; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
1422
+;-----------------------------------------------------------------------------
1423
+%macro FILTER_V4_W4_H4_sse2 1
1424
+INIT_XMM sse2
1425
+%if ARCH_X86_64
1426
+cglobal interp_4tap_vert_pp_4x%1, 4, 6, 9
1427
+    pxor        m8,        m8
1428
+%else
1429
+cglobal interp_4tap_vert_pp_4x%1, 4, 6, 8
1430
+%endif
1431
+
1432
+    mov         r4d,       r4m
1433
+    sub         r0,        r1
1434
+
1435
+%ifdef PIC
1436
+    lea         r5,        [tabw_ChromaCoeff]
1437
+    movh        m0,        [r5 + r4 * 8]
1438
+%else
1439
+    movh        m0,        [tabw_ChromaCoeff + r4 * 8]
1440
+%endif
1441
+
1442
+    mova        m1,        [pw_32]
1443
+    lea         r5,        [3 * r1]
1444
+    punpcklqdq  m0,        m0
1445
+
1446
+%assign x 1
1447
+%rep %1/4
1448
+    movd        m2,        [r0]
1449
+    movd        m3,        [r0 + r1]
1450
+    movd        m4,        [r0 + 2 * r1]
1451
+    movd        m5,        [r0 + r5]
1452
+
1453
+    punpcklbw   m2,        m3
1454
+    punpcklbw   m6,        m4,        m5
1455
+    punpcklwd   m2,        m6
1456
+
1457
+    movhlps     m6,        m2
1458
+    WORD_TO_DOUBLE         m2
1459
+    WORD_TO_DOUBLE         m6
1460
+    pmaddwd     m2,        m0
1461
+    pmaddwd     m6,        m0
1462
+    packssdw    m2,        m6
1463
+
1464
+    lea         r0,        [r0 + 4 * r1]
1465
+    movd        m6,        [r0]
1466
+
1467
+    punpcklbw   m3,        m4
1468
+    punpcklbw   m7,        m5,        m6
1469
+    punpcklwd   m3,        m7
1470
+
1471
+    movhlps     m7,        m3
1472
+    WORD_TO_DOUBLE         m3
1473
+    WORD_TO_DOUBLE         m7
1474
+    pmaddwd     m3,        m0
1475
+    pmaddwd     m7,        m0
1476
+    packssdw    m3,        m7
1477
+
1478
+    pshuflw     m7,        m2,        q2301
1479
+    pshufhw     m7,        m7,        q2301
1480
+    paddw       m2,        m7
1481
+    pshuflw     m7,        m3,        q2301
1482
+    pshufhw     m7,        m7,        q2301
1483
+    paddw       m3,        m7
1484
+    psrld       m2,        16
1485
+    psrld       m3,        16
1486
+    packssdw    m2,        m3
1487
+
1488
+    paddw       m2,        m1
1489
+    psraw       m2,        6
1490
+
1491
+    movd        m7,        [r0 + r1]
1492
+
1493
+    punpcklbw   m4,        m5
1494
+    punpcklbw   m3,        m6,        m7
1495
+    punpcklwd   m4,        m3
1496
+
1497
+    movhlps     m3,        m4
1498
+    WORD_TO_DOUBLE         m4
1499
+    WORD_TO_DOUBLE         m3
1500
+    pmaddwd     m4,        m0
1501
+    pmaddwd     m3,        m0
1502
+    packssdw    m4,        m3
1503
+
1504
+    movd        m3,        [r0 + 2 * r1]
1505
+
1506
+    punpcklbw   m5,        m6
1507
+    punpcklbw   m7,        m3
1508
+    punpcklwd   m5,        m7
1509
+
1510
+    movhlps     m3,        m5
1511
+    WORD_TO_DOUBLE         m5
1512
+    WORD_TO_DOUBLE         m3
1513
+    pmaddwd     m5,        m0
1514
+    pmaddwd     m3,        m0
1515
+    packssdw    m5,        m3
1516
+
1517
+    pshuflw     m7,        m4,        q2301
1518
+    pshufhw     m7,        m7,        q2301
1519
+    paddw       m4,        m7
1520
+    pshuflw     m7,        m5,        q2301
1521
+    pshufhw     m7,        m7,        q2301
1522
+    paddw       m5,        m7
1523
+    psrld       m4,        16
1524
+    psrld       m5,        16
1525
+    packssdw    m4,        m5
1526
+
1527
+    paddw       m4,        m1
1528
+    psraw       m4,        6
1529
+    packuswb    m2,        m4
1530
+
1531
+    movd        [r2],      m2
1532
+    psrldq      m2,        4
1533
+    movd        [r2 + r3], m2
1534
+    lea         r2,        [r2 + 2 * r3]
1535
+    psrldq      m2,        4
1536
+    movd        [r2],      m2
1537
+    psrldq      m2,        4
1538
+    movd        [r2 + r3], m2
1539
+
1540
+%if x < %1/4
1541
+    lea         r2,        [r2 + 2 * r3]
1542
+%endif
1543
+%assign x x+1
1544
+%endrep
1545
+    RET
1546
+%endmacro
1547
+
1548
+    FILTER_V4_W4_H4_sse2 4
1549
+    FILTER_V4_W4_H4_sse2 8
1550
+    FILTER_V4_W4_H4_sse2 16
1551
+    FILTER_V4_W4_H4_sse2 32
1552
+
1553
+;-----------------------------------------------------------------------------
1554
+;void interp_4tap_vert_pp_6x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
1555
+;-----------------------------------------------------------------------------
1556
+%macro FILTER_V4_W6_H4_sse2 1
1557
+INIT_XMM sse2
1558
+cglobal interp_4tap_vert_pp_6x%1, 4, 7, 10
1559
+
1560
+    mov         r4d,       r4m
1561
+    sub         r0,        r1
1562
+    shl         r4d,       5
1563
+    pxor        m9,        m9
1564
+
1565
+%ifdef PIC
1566
+    lea         r5,        [tab_ChromaCoeffV]
1567
+    mova        m6,        [r5 + r4]
1568
+    mova        m5,        [r5 + r4 + 16]
1569
+%else
1570
+    mova        m6,        [tab_ChromaCoeffV + r4]
1571
+    mova        m5,        [tab_ChromaCoeffV + r4 + 16]
1572
+%endif
1573
+
1574
+    mova        m4,        [pw_32]
1575
+    lea         r5,        [3 * r1]
1576
+
1577
+%assign x 1
1578
+%rep %1/4
1579
+    movq        m0,        [r0]
1580
+    movq        m1,        [r0 + r1]
1581
+    movq        m2,        [r0 + 2 * r1]
1582
+    movq        m3,        [r0 + r5]
1583
+
1584
+    punpcklbw   m0,        m1
1585
+    punpcklbw   m1,        m2
1586
+    punpcklbw   m2,        m3
1587
+
1588
+    movhlps     m7,        m0
1589
+    punpcklbw   m0,        m9
1590
+    punpcklbw   m7,        m9
1591
+    pmaddwd     m0,        m6
1592
+    pmaddwd     m7,        m6
1593
+    packssdw    m0,        m7
1594
+
1595
+    movhlps     m8,        m2
1596
+    movq        m7,        m2
1597
+    punpcklbw   m8,        m9
1598
+    punpcklbw   m7,        m9
1599
+    pmaddwd     m8,        m5
1600
+    pmaddwd     m7,        m5
1601
+    packssdw    m7,        m8
1602
+
1603
+    paddw       m0,        m7
1604
+
1605
+    paddw       m0,        m4
1606
+    psraw       m0,        6
1607
+    packuswb    m0,        m0
1608
+    movd        [r2],      m0
1609
+    pextrw      r6d,       m0,        2
1610
+    mov         [r2 + 4],  r6w
1611
+
1612
+    lea         r0,        [r0 + 4 * r1]
1613
+
1614
+    movq        m0,        [r0]
1615
+    punpcklbw   m3,        m0
1616
+
1617
+    movhlps     m8,        m1
1618
+    punpcklbw   m1,        m9
1619
+    punpcklbw   m8,        m9
1620
+    pmaddwd     m1,        m6
1621
+    pmaddwd     m8,        m6
1622
+    packssdw    m1,        m8
1623
+
1624
+    movhlps     m8,        m3
1625
+    movq        m7,        m3
1626
+    punpcklbw   m8,        m9
1627
+    punpcklbw   m7,        m9
1628
+    pmaddwd     m8,        m5
1629
+    pmaddwd     m7,        m5
1630
+    packssdw    m7,        m8
1631
+
1632
+    paddw       m1,        m7
1633
+
1634
+    paddw       m1,        m4
1635
+    psraw       m1,        6
1636
+    packuswb    m1,        m1
1637
+    movd        [r2 + r3], m1
1638
+    pextrw      r6d,       m1,        2
1639
+    mov         [r2 + r3 + 4], r6w
1640
+    movq        m1,        [r0 + r1]
1641
+    punpcklbw   m7,        m0,        m1
1642
+
1643
+    movhlps     m8,        m2
1644
+    punpcklbw   m2,        m9
1645
+    punpcklbw   m8,        m9
1646
+    pmaddwd     m2,        m6
1647
+    pmaddwd     m8,        m6
1648
+    packssdw    m2,        m8
1649
+
1650
+    movhlps     m8,        m7
1651
+    punpcklbw   m7,        m9
1652
+    punpcklbw   m8,        m9
1653
+    pmaddwd     m7,        m5
1654
+    pmaddwd     m8,        m5
1655
+    packssdw    m7,        m8
1656
+
1657
+    paddw       m2,        m7
1658
+
1659
+    paddw       m2,        m4
1660
+    psraw       m2,        6
1661
+    packuswb    m2,        m2
1662
+    lea         r2,        [r2 + 2 * r3]
1663
+    movd        [r2],      m2
1664
+    pextrw      r6d,       m2,    2
1665
+    mov         [r2 + 4],  r6w
1666
+
1667
+    movq        m2,        [r0 + 2 * r1]
1668
+    punpcklbw   m1,        m2
1669
+
1670
+    movhlps     m8,        m3
1671
+    punpcklbw   m3,        m9
1672
+    punpcklbw   m8,        m9
1673
+    pmaddwd     m3,        m6
1674
+    pmaddwd     m8,        m6
1675
+    packssdw    m3,        m8
1676
+
1677
+    movhlps     m8,        m1
1678
+    punpcklbw   m1,        m9
1679
+    punpcklbw   m8,        m9
1680
+    pmaddwd     m1,        m5
1681
+    pmaddwd     m8,        m5
1682
+    packssdw    m1,        m8
1683
+
1684
+    paddw       m3,        m1
1685
+
1686
+    paddw       m3,        m4
1687
+    psraw       m3,        6
1688
+    packuswb    m3,        m3
1689
+
1690
+    movd        [r2 + r3], m3
1691
+    pextrw      r6d,    m3,    2
1692
+    mov         [r2 + r3 + 4], r6w
1693
+
1694
+%if x < %1/4
1695
+    lea         r2,        [r2 + 2 * r3]
1696
+%endif
1697
+%assign x x+1
1698
+%endrep
1699
+    RET
1700
+
1701
+%endmacro
1702
+
1703
+%if ARCH_X86_64
1704
+    FILTER_V4_W6_H4_sse2 8
1705
+    FILTER_V4_W6_H4_sse2 16
1706
+%endif
1707
+
1708
+;-----------------------------------------------------------------------------
1709
+; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
1710
+;-----------------------------------------------------------------------------
1711
+%macro FILTER_V4_W8_sse2 1
1712
+INIT_XMM sse2
1713
+cglobal interp_4tap_vert_pp_8x%1, 4, 7, 12
1714
+
1715
+    mov         r4d,       r4m
1716
+    sub         r0,        r1
1717
+    shl         r4d,       5
1718
+    pxor        m9,        m9
1719
+    mova        m4,        [pw_32]
1720
+
1721
+%ifdef PIC
1722
+    lea         r6,        [tab_ChromaCoeffV]
1723
+    mova        m6,        [r6 + r4]
1724
+    mova        m5,        [r6 + r4 + 16]
1725
+%else
1726
+    mova        m6,        [tab_ChromaCoeffV + r4]
1727
+    mova        m5,        [tab_ChromaCoeffV + r4 + 16]
1728
+%endif
1729
+
1730
+    movq        m0,        [r0]
1731
+    movq        m1,        [r0 + r1]
1732
+    movq        m2,        [r0 + 2 * r1]
1733
+    lea         r5,        [r0 + 2 * r1]
1734
+    movq        m3,        [r5 + r1]
1735
+
1736
+    punpcklbw   m0,        m1
1737
+    punpcklbw   m7,        m2,          m3
1738
+
1739
+    movhlps     m8,        m0
1740
+    punpcklbw   m0,        m9
1741
+    punpcklbw   m8,        m9
1742
+    pmaddwd     m0,        m6
1743
+    pmaddwd     m8,        m6
1744
+    packssdw    m0,        m8
1745
+
1746
+    movhlps     m8,        m7
1747
+    punpcklbw   m7,        m9
1748
+    punpcklbw   m8,        m9
1749
+    pmaddwd     m7,        m5
1750
+    pmaddwd     m8,        m5
1751
+    packssdw    m7,        m8
1752
+
1753
+    paddw       m0,        m7
1754
+
1755
+    paddw       m0,        m4
1756
+    psraw       m0,        6
1757
+
1758
+    movq        m11,        [r0 + 4 * r1]
1759
+
1760
+    punpcklbw   m1,        m2
1761
+    punpcklbw   m7,        m3,        m11
1762
+
1763
+    movhlps     m8,        m1
1764
+    punpcklbw   m1,        m9
1765
+    punpcklbw   m8,        m9
1766
+    pmaddwd     m1,        m6
1767
+    pmaddwd     m8,        m6
1768
+    packssdw    m1,        m8
1769
+
1770
+    movhlps     m8,        m7
1771
+    punpcklbw   m7,        m9
1772
+    punpcklbw   m8,        m9
1773
+    pmaddwd     m7,        m5
1774
+    pmaddwd     m8,        m5
1775
+    packssdw    m7,        m8
1776
+
1777
+    paddw       m1,        m7
1778
+
1779
+    paddw       m1,        m4
1780
+    psraw       m1,        6
1781
+    packuswb    m1,        m0
1782
+
1783
+    movhps      [r2],      m1
1784
+    movh        [r2 + r3], m1
1785
+%if %1 == 2     ;end of 8x2
1786
+    RET
1787
+
1788
+%else
1789
+    lea         r6,        [r0 + 4 * r1]
1790
+    movq        m1,        [r6 + r1]
1791
+
1792
+    punpcklbw   m2,        m3
1793
+    punpcklbw   m7,        m11,        m1
1794
+
1795
+    movhlps     m8,        m2
1796
+    punpcklbw   m2,        m9
1797
+    punpcklbw   m8,        m9
1798
+    pmaddwd     m2,        m6
1799
+    pmaddwd     m8,        m6
1800
+    packssdw    m2,        m8
1801
+
1802
+    movhlps     m8,        m7
1803
+    punpcklbw   m7,        m9
1804
+    punpcklbw   m8,        m9
1805
+    pmaddwd     m7,        m5
1806
+    pmaddwd     m8,        m5
1807
+    packssdw    m7,        m8
1808
+
1809
+    paddw       m2,        m7
1810
+
1811
+    paddw       m2,        m4
1812
+    psraw       m2,        6
1813
+
1814
+    movq        m10,        [r6 + 2 * r1]
1815
+
1816
+    punpcklbw   m3,        m11
1817
+    punpcklbw   m7,        m1,        m10
1818
+
1819
+    movhlps     m8,        m3
1820
+    punpcklbw   m3,        m9
1821
+    punpcklbw   m8,        m9
1822
+    pmaddwd     m3,        m6
1823
+    pmaddwd     m8,        m6
1824
+    packssdw    m3,        m8
1825
+
1826
+    movhlps     m8,        m7
1827
+    punpcklbw   m7,        m9
1828
+    punpcklbw   m8,        m9
1829
+    pmaddwd     m7,        m5
1830
+    pmaddwd     m8,        m5
1831
+    packssdw    m7,        m8
1832
+
1833
+    paddw       m3,        m7
1834
+
1835
+    paddw       m3,        m4
1836
+    psraw       m3,        6
1837
+    packuswb    m3,        m2
1838
+
1839
+    movhps      [r2 + 2 * r3], m3
1840
+    lea         r5,        [r2 + 2 * r3]
1841
+    movh        [r5 + r3], m3
1842
+%if %1 == 4     ;end of 8x4
1843
+    RET
1844
+
1845
+%else
1846
+    lea         r6,        [r6 + 2 * r1]
1847
+    movq        m3,        [r6 + r1]
1848
+
1849
+    punpcklbw   m11,        m1
1850
+    punpcklbw   m7,        m10,        m3
1851
+
1852
+    movhlps     m8,        m11
1853
+    punpcklbw   m11,        m9
1854
+    punpcklbw   m8,        m9
1855
+    pmaddwd     m11,        m6
1856
+    pmaddwd     m8,        m6
1857
+    packssdw    m11,        m8
1858
+
1859
+    movhlps     m8,        m7
1860
+    punpcklbw   m7,        m9
1861
+    punpcklbw   m8,        m9
1862
+    pmaddwd     m7,        m5
1863
+    pmaddwd     m8,        m5
1864
+    packssdw    m7,        m8
1865
+
1866
+    paddw       m11,        m7
1867
+
1868
+    paddw       m11,        m4
1869
+    psraw       m11,        6
1870
+
1871
+    movq        m7,        [r0 + 8 * r1]
1872
+
1873
+    punpcklbw   m1,        m10
1874
+    punpcklbw   m3,        m7
1875
+
1876
+    movhlps     m8,        m1
1877
+    punpcklbw   m1,        m9
1878
+    punpcklbw   m8,        m9
1879
+    pmaddwd     m1,        m6
1880
+    pmaddwd     m8,        m6
1881
+    packssdw    m1,        m8
1882
+
1883
+    movhlps     m8,        m3
1884
+    punpcklbw   m3,        m9
1885
+    punpcklbw   m8,        m9
1886
+    pmaddwd     m3,        m5
1887
+    pmaddwd     m8,        m5
1888
+    packssdw    m3,        m8
1889
+
1890
+    paddw       m1,        m3
1891
+
1892
+    paddw       m1,        m4
1893
+    psraw       m1,        6
1894
+    packuswb    m1,        m11
1895
+
1896
+    movhps      [r2 + 4 * r3], m1
1897
+    lea         r5,        [r2 + 4 * r3]
1898
+    movh        [r5 + r3], m1
1899
+%if %1 == 6
1900
+    RET
1901
+
1902
+%else
1903
+  %error INVALID macro argument, only 2, 4 or 6!
1904
+%endif
1905
+%endif
1906
+%endif
1907
+%endmacro
1908
+
1909
+%if ARCH_X86_64
1910
+    FILTER_V4_W8_sse2 2
1911
+    FILTER_V4_W8_sse2 4
1912
+    FILTER_V4_W8_sse2 6
1913
+%endif
1914
+
1915
+;-----------------------------------------------------------------------------
1916
+; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
1917
+;-----------------------------------------------------------------------------
1918
+%macro FILTER_V4_W8_H8_H16_H32_sse2 2
1919
+INIT_XMM sse2
1920
+cglobal interp_4tap_vert_pp_%1x%2, 4, 6, 11
1921
+
1922
+    mov         r4d,       r4m
1923
+    sub         r0,        r1
1924
+    shl         r4d,       5
1925
+    pxor        m9,        m9
1926
+
1927
+%ifdef PIC
1928
+    lea         r5,        [tab_ChromaCoeffV]
1929
+    mova        m6,        [r5 + r4]
1930
+    mova        m5,        [r5 + r4 + 16]
1931
+%else
1932
+    mova        m6,        [tab_ChromaCoeff + r4]
1933
+    mova        m5,        [tab_ChromaCoeff + r4 + 16]
1934
+%endif
1935
+
1936
+    mova        m4,        [pw_32]
1937
+    lea         r5,        [r1 * 3]
1938
+
1939
+%assign x 1
1940
+%rep %2/4
1941
+    movq        m0,        [r0]
1942
+    movq        m1,        [r0 + r1]
1943
+    movq        m2,        [r0 + 2 * r1]
1944
+    movq        m3,        [r0 + r5]
1945
+
1946
+    punpcklbw   m0,        m1
1947
+    punpcklbw   m1,        m2
1948
+    punpcklbw   m2,        m3
1949
+
1950
+    movhlps     m7,        m0
1951
+    punpcklbw   m0,        m9
1952
+    punpcklbw   m7,        m9
1953
+    pmaddwd     m0,        m6
1954
+    pmaddwd     m7,        m6
1955
+    packssdw    m0,        m7
1956
+
1957
+    movhlps     m8,        m2
1958
+    movq        m7,        m2
1959
+    punpcklbw   m8,        m9
1960
+    punpcklbw   m7,        m9
1961
+    pmaddwd     m8,        m5
1962
+    pmaddwd     m7,        m5
1963
+    packssdw    m7,        m8
1964
+
1965
+    paddw       m0,        m7
1966
+    paddw       m0,        m4
1967
+    psraw       m0,        6
1968
+
1969
+    lea         r0,        [r0 + 4 * r1]
1970
+    movq        m10,       [r0]
1971
+    punpcklbw   m3,        m10
1972
+
1973
+    movhlps     m8,        m1
1974
+    punpcklbw   m1,        m9
1975
+    punpcklbw   m8,        m9
1976
+    pmaddwd     m1,        m6
1977
+    pmaddwd     m8,        m6
1978
+    packssdw    m1,        m8
1979
+
1980
+    movhlps     m8,        m3
1981
+    movq        m7,        m3
1982
+    punpcklbw   m8,        m9
1983
+    punpcklbw   m7,        m9
1984
+    pmaddwd     m8,        m5
1985
+    pmaddwd     m7,        m5
1986
+    packssdw    m7,        m8
1987
+
1988
+    paddw       m1,        m7
1989
+    paddw       m1,        m4
1990
+    psraw       m1,        6
1991
+
1992
+    packuswb    m0,        m1
1993
+    movh        [r2],      m0
1994
+    movhps      [r2 + r3], m0
1995
+
1996
+    movq        m1,        [r0 + r1]
1997
+    punpcklbw   m10,       m1
1998
+
1999
+    movhlps     m8,        m2
2000
+    punpcklbw   m2,        m9
2001
+    punpcklbw   m8,        m9
2002
+    pmaddwd     m2,        m6
2003
+    pmaddwd     m8,        m6
2004
+    packssdw    m2,        m8
2005
+
2006
+    movhlps     m8,        m10
2007
+    punpcklbw   m10,       m9
2008
+    punpcklbw   m8,        m9
2009
+    pmaddwd     m10,       m5
2010
+    pmaddwd     m8,        m5
2011
+    packssdw    m10,       m8
2012
+
2013
+    paddw       m2,        m10
2014
+    paddw       m2,        m4
2015
+    psraw       m2,        6
2016
+
2017
+    movq        m7,        [r0 + 2 * r1]
2018
+    punpcklbw   m1,        m7
2019
+
2020
+    movhlps     m8,        m3
2021
+    punpcklbw   m3,        m9
2022
+    punpcklbw   m8,        m9
2023
+    pmaddwd     m3,        m6
2024
+    pmaddwd     m8,        m6
2025
+    packssdw    m3,        m8
2026
+
2027
+    movhlps     m8,        m1
2028
+    punpcklbw   m1,        m9
2029
+    punpcklbw   m8,        m9
2030
+    pmaddwd     m1,        m5
2031
+    pmaddwd     m8,        m5
2032
+    packssdw    m1,        m8
2033
+
2034
+    paddw       m3,        m1
2035
+    paddw       m3,        m4
2036
+    psraw       m3,        6
2037
+
2038
+    packuswb    m2,        m3
2039
+    lea         r2,        [r2 + 2 * r3]
2040
+    movh        [r2],      m2
2041
+    movhps      [r2 + r3], m2
2042
+%if x < %2/4
2043
+    lea         r2,        [r2 + 2 * r3]
2044
+%endif
2045
+%endrep
2046
+    RET
2047
+%endmacro
2048
+
2049
+%if ARCH_X86_64
2050
+    FILTER_V4_W8_H8_H16_H32_sse2 8,  8
2051
+    FILTER_V4_W8_H8_H16_H32_sse2 8, 16
2052
+    FILTER_V4_W8_H8_H16_H32_sse2 8, 32
2053
+
2054
+    FILTER_V4_W8_H8_H16_H32_sse2 8, 12
2055
+    FILTER_V4_W8_H8_H16_H32_sse2 8, 64
2056
+%endif
2057
+
2058
 ;-----------------------------------------------------------------------------
2059
 ; void interp_4tap_horiz_pp_2x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
2060
 ;-----------------------------------------------------------------------------
2061
@@ -328,26 +1887,26 @@
2062
 %define t1          m1
2063
 %define t0          m0
2064
 
2065
-mov         r4d,        r4m
2066
+    mov         r4d,        r4m
2067
 
2068
 %ifdef PIC
2069
-lea         r5,          [tab_ChromaCoeff]
2070
-movd        coef2,       [r5 + r4 * 4]
2071
+    lea         r5,          [tab_ChromaCoeff]
2072
+    movd        coef2,       [r5 + r4 * 4]
2073
 %else
2074
-movd        coef2,       [tab_ChromaCoeff + r4 * 4]
2075
+    movd        coef2,       [tab_ChromaCoeff + r4 * 4]
2076
 %endif
2077
 
2078
-pshufd      coef2,       coef2,      0
2079
-mova        t2,          [pw_512]
2080
-mova        Tm0,         [tab_Tm]
2081
+    pshufd      coef2,       coef2,      0
2082
+    mova        t2,          [pw_512]
2083
+    mova        Tm0,         [tab_Tm]
2084
 
2085
 %rep 2
2086
-FILTER_H4_w2_2   t0, t1, t2
2087
-lea         srcq,       [srcq + srcstrideq * 2]
2088
-lea         dstq,       [dstq + dststrideq * 2]
2089
+    FILTER_H4_w2_2   t0, t1, t2
2090
+    lea         srcq,       [srcq + srcstrideq * 2]
2091
+    lea         dstq,       [dstq + dststrideq * 2]
2092
 %endrep
2093
 
2094
-RET
2095
+    RET
2096
 
2097
 ;-----------------------------------------------------------------------------
2098
 ; void interp_4tap_horiz_pp_2x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
2099
@@ -360,26 +1919,26 @@
2100
 %define t1          m1
2101
 %define t0          m0
2102
 
2103
-mov         r4d,        r4m
2104
+    mov         r4d,        r4m
2105
 
2106
 %ifdef PIC
2107
-lea         r5,          [tab_ChromaCoeff]
2108
-movd        coef2,       [r5 + r4 * 4]
2109
+    lea         r5,          [tab_ChromaCoeff]
2110
+    movd        coef2,       [r5 + r4 * 4]
2111
 %else
2112
-movd        coef2,       [tab_ChromaCoeff + r4 * 4]
2113
+    movd        coef2,       [tab_ChromaCoeff + r4 * 4]
2114
 %endif
2115
 
2116
-pshufd      coef2,       coef2,      0
2117
-mova        t2,          [pw_512]
2118
-mova        Tm0,         [tab_Tm]
2119
+    pshufd      coef2,       coef2,      0
2120
+    mova        t2,          [pw_512]
2121
+    mova        Tm0,         [tab_Tm]
2122
 
2123
 %rep 4
2124
-FILTER_H4_w2_2   t0, t1, t2
2125
-lea         srcq,       [srcq + srcstrideq * 2]
2126
-lea         dstq,       [dstq + dststrideq * 2]
2127
+    FILTER_H4_w2_2   t0, t1, t2
2128
+    lea         srcq,       [srcq + srcstrideq * 2]
2129
+    lea         dstq,       [dstq + dststrideq * 2]
2130
 %endrep
2131
 
2132
-RET
2133
+    RET
2134
 
2135
 ;-----------------------------------------------------------------------------
2136
 ; void interp_4tap_horiz_pp_2x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
2137
@@ -392,29 +1951,29 @@
2138
 %define t1          m1
2139
 %define t0          m0
2140
 
2141
-mov         r4d,        r4m
2142
+    mov         r4d,        r4m
2143
 
2144
 %ifdef PIC
2145
-lea         r5,          [tab_ChromaCoeff]
2146
-movd        coef2,       [r5 + r4 * 4]
2147
+    lea         r5,          [tab_ChromaCoeff]
2148
+    movd        coef2,       [r5 + r4 * 4]
2149
 %else
2150
-movd        coef2,       [tab_ChromaCoeff + r4 * 4]
2151
+    movd        coef2,       [tab_ChromaCoeff + r4 * 4]
2152
 %endif
2153
 
2154
-pshufd      coef2,       coef2,      0
2155
-mova        t2,          [pw_512]
2156
-mova        Tm0,         [tab_Tm]
2157
+    pshufd      coef2,       coef2,      0
2158
+    mova        t2,          [pw_512]
2159
+    mova        Tm0,         [tab_Tm]
2160
 
2161
-mov         r5d,        16/2
2162
+    mov         r5d,        16/2
2163
 
2164
 .loop:
2165
-FILTER_H4_w2_2   t0, t1, t2
2166
-lea         srcq,       [srcq + srcstrideq * 2]
2167
-lea         dstq,       [dstq + dststrideq * 2]
2168
-dec         r5d
2169
-jnz         .loop
2170
+    FILTER_H4_w2_2   t0, t1, t2
2171
+    lea         srcq,       [srcq + srcstrideq * 2]
2172
+    lea         dstq,       [dstq + dststrideq * 2]
2173
+    dec         r5d
2174
+    jnz         .loop
2175
 
2176
-RET
2177
+    RET
2178
 
2179
 %macro FILTER_H4_w4_2 3
2180
     movh        %2, [srcq - 1]
2181
@@ -442,22 +2001,22 @@
2182
 %define t1          m1
2183
 %define t0          m0
2184
 
2185
-mov         r4d,        r4m
2186
+    mov         r4d,        r4m
2187
 
2188
 %ifdef PIC
2189
-lea         r5,          [tab_ChromaCoeff]
2190
-movd        coef2,       [r5 + r4 * 4]
2191
+    lea         r5,          [tab_ChromaCoeff]
2192
+    movd        coef2,       [r5 + r4 * 4]
2193
 %else
2194
-movd        coef2,       [tab_ChromaCoeff + r4 * 4]
2195
+    movd        coef2,       [tab_ChromaCoeff + r4 * 4]
2196
 %endif
2197
 
2198
-pshufd      coef2,       coef2,      0
2199
-mova        t2,          [pw_512]
2200
-mova        Tm0,         [tab_Tm]
2201
+    pshufd      coef2,       coef2,      0
2202
+    mova        t2,          [pw_512]
2203
+    mova        Tm0,         [tab_Tm]
2204
 
2205
-FILTER_H4_w4_2   t0, t1, t2
2206
+    FILTER_H4_w4_2   t0, t1, t2
2207
 
2208
-RET
2209
+    RET
2210
 
2211
 ;-----------------------------------------------------------------------------
2212
 ; void interp_4tap_horiz_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
2213
@@ -470,26 +2029,26 @@
2214
 %define t1          m1
2215
 %define t0          m0
2216
 
2217
-mov         r4d,        r4m
2218
+    mov         r4d,        r4m
2219
 
2220
 %ifdef PIC
2221
-lea         r5,          [tab_ChromaCoeff]
2222
-movd        coef2,       [r5 + r4 * 4]
2223
+    lea         r5,          [tab_ChromaCoeff]
2224
+    movd        coef2,       [r5 + r4 * 4]
2225
 %else
2226
-movd        coef2,       [tab_ChromaCoeff + r4 * 4]
2227
+    movd        coef2,       [tab_ChromaCoeff + r4 * 4]
2228
 %endif
2229
 
2230
-pshufd      coef2,       coef2,      0
2231
-mova        t2,          [pw_512]
2232
-mova        Tm0,         [tab_Tm]
2233
+    pshufd      coef2,       coef2,      0
2234
+    mova        t2,          [pw_512]
2235
+    mova        Tm0,         [tab_Tm]
2236
 
2237
 %rep 2
2238
-FILTER_H4_w4_2   t0, t1, t2
2239
-lea         srcq,       [srcq + srcstrideq * 2]
2240
-lea         dstq,       [dstq + dststrideq * 2]
2241
+    FILTER_H4_w4_2   t0, t1, t2
2242
+    lea         srcq,       [srcq + srcstrideq * 2]
2243
+    lea         dstq,       [dstq + dststrideq * 2]
2244
 %endrep
2245
 
2246
-RET
2247
+    RET
2248
 
2249
 ;-----------------------------------------------------------------------------
2250
 ; void interp_4tap_horiz_pp_4x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
2251
@@ -502,26 +2061,26 @@
2252
 %define t1          m1
2253
 %define t0          m0
2254
 
2255
-mov         r4d,        r4m
2256
+    mov         r4d,        r4m
2257
 
2258
 %ifdef PIC
2259
-lea         r5,          [tab_ChromaCoeff]
2260
-movd        coef2,       [r5 + r4 * 4]
2261
+    lea         r5,          [tab_ChromaCoeff]
2262
+    movd        coef2,       [r5 + r4 * 4]
2263
 %else
2264
-movd        coef2,       [tab_ChromaCoeff + r4 * 4]
2265
+    movd        coef2,       [tab_ChromaCoeff + r4 * 4]
2266
 %endif
2267
 
2268
-pshufd      coef2,       coef2,      0
2269
-mova        t2,          [pw_512]
2270
-mova        Tm0,         [tab_Tm]
2271
+    pshufd      coef2,       coef2,      0
2272
+    mova        t2,          [pw_512]
2273
+    mova        Tm0,         [tab_Tm]
2274
 
2275
 %rep 4
2276
-FILTER_H4_w4_2   t0, t1, t2
2277
-lea         srcq,       [srcq + srcstrideq * 2]
2278
-lea         dstq,       [dstq + dststrideq * 2]
2279
+    FILTER_H4_w4_2   t0, t1, t2
2280
+    lea         srcq,       [srcq + srcstrideq * 2]
2281
+    lea         dstq,       [dstq + dststrideq * 2]
2282
 %endrep
2283
 
2284
-RET
2285
+    RET
2286
 
2287
 ;-----------------------------------------------------------------------------
2288
 ; void interp_4tap_horiz_pp_4x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
2289
@@ -534,26 +2093,26 @@
2290
 %define t1          m1
2291
 %define t0          m0
2292
 
2293
-mov         r4d,        r4m
2294
+    mov         r4d,        r4m
2295
 
2296
 %ifdef PIC
2297
-lea         r5,          [tab_ChromaCoeff]
2298
-movd        coef2,       [r5 + r4 * 4]
2299
+    lea         r5,          [tab_ChromaCoeff]
2300
+    movd        coef2,       [r5 + r4 * 4]
2301
 %else
2302
-movd        coef2,       [tab_ChromaCoeff + r4 * 4]
2303
+    movd        coef2,       [tab_ChromaCoeff + r4 * 4]
2304
 %endif
2305
 
2306
-pshufd      coef2,       coef2,      0
2307
-mova        t2,          [pw_512]
2308
-mova        Tm0,         [tab_Tm]
2309
+    pshufd      coef2,       coef2,      0
2310
+    mova        t2,          [pw_512]
2311
+    mova        Tm0,         [tab_Tm]
2312
 
2313
 %rep 8
2314
-FILTER_H4_w4_2   t0, t1, t2
2315
-lea         srcq,       [srcq + srcstrideq * 2]
2316
-lea         dstq,       [dstq + dststrideq * 2]
2317
+    FILTER_H4_w4_2   t0, t1, t2
2318
+    lea         srcq,       [srcq + srcstrideq * 2]
2319
+    lea         dstq,       [dstq + dststrideq * 2]
2320
 %endrep
2321
 
2322
-RET
2323
+    RET
2324
 
2325
 ;-----------------------------------------------------------------------------
2326
 ; void interp_4tap_horiz_pp_4x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
2327
@@ -566,29 +2125,29 @@
2328
 %define t1          m1
2329
 %define t0          m0
2330
 
2331
-mov         r4d,        r4m
2332
+    mov         r4d,        r4m
2333
 
2334
 %ifdef PIC
2335
-lea         r5,          [tab_ChromaCoeff]
2336
-movd        coef2,       [r5 + r4 * 4]
2337
+    lea         r5,          [tab_ChromaCoeff]
2338
+    movd        coef2,       [r5 + r4 * 4]
2339
 %else
2340
-movd        coef2,       [tab_ChromaCoeff + r4 * 4]
2341
+    movd        coef2,       [tab_ChromaCoeff + r4 * 4]
2342
 %endif
2343
 
2344
-pshufd      coef2,       coef2,      0
2345
-mova        t2,          [pw_512]
2346
-mova        Tm0,         [tab_Tm]
2347
+    pshufd      coef2,       coef2,      0
2348
+    mova        t2,          [pw_512]
2349
+    mova        Tm0,         [tab_Tm]
2350
 
2351
-mov         r5d,        32/2
2352
+    mov         r5d,        32/2
2353
 
2354
 .loop:
2355
-FILTER_H4_w4_2   t0, t1, t2
2356
-lea         srcq,       [srcq + srcstrideq * 2]
2357
-lea         dstq,       [dstq + dststrideq * 2]
2358
-dec         r5d
2359
-jnz         .loop
2360
+    FILTER_H4_w4_2   t0, t1, t2
2361
+    lea         srcq,       [srcq + srcstrideq * 2]
2362
+    lea         dstq,       [dstq + dststrideq * 2]
2363
+    dec         r5d
2364
+    jnz         .loop
2365
 
2366
-RET
2367
+    RET
2368
 
2369
 ALIGN 32
2370
 const interp_4tap_8x8_horiz_shuf,   dd 0, 4, 1, 5, 2, 6, 3, 7
2371
@@ -764,47 +2323,47 @@
2372
 %define t1          m1
2373
 %define t0          m0
2374
 
2375
-mov         r4d,        r4m
2376
+    mov         r4d,        r4m
2377
 
2378
 %ifdef PIC
2379
-lea         r5,          [tab_ChromaCoeff]
2380
-movd        coef2,       [r5 + r4 * 4]
2381
+    lea         r5,          [tab_ChromaCoeff]
2382
+    movd        coef2,       [r5 + r4 * 4]
2383
 %else
2384
-movd        coef2,       [tab_ChromaCoeff + r4 * 4]
2385
+    movd        coef2,       [tab_ChromaCoeff + r4 * 4]
2386
 %endif
2387
 
2388
-mov           r5d,       %2
2389
+    mov           r5d,       %2
2390
 
2391
-pshufd      coef2,       coef2,      0
2392
-mova        t2,          [pw_512]
2393
-mova        Tm0,         [tab_Tm]
2394
-mova        Tm1,         [tab_Tm + 16]
2395
+    pshufd      coef2,       coef2,      0
2396
+    mova        t2,          [pw_512]
2397
+    mova        Tm0,         [tab_Tm]
2398
+    mova        Tm1,         [tab_Tm + 16]
2399
 
2400
 .loop:
2401
-FILTER_H4_w%1   t0, t1, t2
2402
-add         srcq,        srcstrideq
2403
-add         dstq,        dststrideq
2404
-
2405
-dec         r5d
2406
-jnz        .loop
2407
-
2408
-RET
2409
+    FILTER_H4_w%1   t0, t1, t2
2410
+    add         srcq,        srcstrideq
2411
+    add         dstq,        dststrideq
2412
+
2413
+    dec         r5d
2414
+    jnz        .loop
2415
+
2416
+    RET
2417
 %endmacro
2418
 
2419
 
2420
-IPFILTER_CHROMA 6,   8
2421
-IPFILTER_CHROMA 8,   2
2422
-IPFILTER_CHROMA 8,   4
2423
-IPFILTER_CHROMA 8,   6
2424
-IPFILTER_CHROMA 8,   8
2425
-IPFILTER_CHROMA 8,  16
2426
-IPFILTER_CHROMA 8,  32
2427
-IPFILTER_CHROMA 12, 16
2428
-
2429
-IPFILTER_CHROMA 6,  16
2430
-IPFILTER_CHROMA 8,  12
2431
-IPFILTER_CHROMA 8,  64
2432
-IPFILTER_CHROMA 12, 32
2433
+    IPFILTER_CHROMA 6,   8
2434
+    IPFILTER_CHROMA 8,   2
2435
+    IPFILTER_CHROMA 8,   4
2436
+    IPFILTER_CHROMA 8,   6
2437
+    IPFILTER_CHROMA 8,   8
2438
+    IPFILTER_CHROMA 8,  16
2439
+    IPFILTER_CHROMA 8,  32
2440
+    IPFILTER_CHROMA 12, 16
2441
+
2442
+    IPFILTER_CHROMA 6,  16
2443
+    IPFILTER_CHROMA 8,  12
2444
+    IPFILTER_CHROMA 8,  64
2445
+    IPFILTER_CHROMA 12, 32
2446
 
2447
 ;-----------------------------------------------------------------------------
2448
 ; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
2449
@@ -820,55 +2379,55 @@
2450
 %define t1          m1
2451
 %define t0          m0
2452
 
2453
-mov         r4d,         r4m
2454
+    mov         r4d,         r4m
2455
 
2456
 %ifdef PIC
2457
-lea         r5,          [tab_ChromaCoeff]
2458
-movd        coef2,       [r5 + r4 * 4]
2459
+    lea         r5,          [tab_ChromaCoeff]
2460
+    movd        coef2,       [r5 + r4 * 4]
2461
 %else
2462
-movd        coef2,       [tab_ChromaCoeff + r4 * 4]
2463
+    movd        coef2,       [tab_ChromaCoeff + r4 * 4]
2464
 %endif
2465
 
2466
-mov         r5d,          %2
2467
+    mov         r5d,          %2
2468
 
2469
-pshufd      coef2,       coef2,      0
2470
-mova        t2,          [pw_512]
2471
-mova        Tm0,         [tab_Tm]
2472
-mova        Tm1,         [tab_Tm + 16]
2473
+    pshufd      coef2,       coef2,      0
2474
+    mova        t2,          [pw_512]
2475
+    mova        Tm0,         [tab_Tm]
2476
+    mova        Tm1,         [tab_Tm + 16]
2477
 
2478
 .loop:
2479
-FILTER_H4_w%1   t0, t1, t2, t3
2480
-add         srcq,        srcstrideq
2481
-add         dstq,        dststrideq
2482
-
2483
-dec         r5d
2484
-jnz        .loop
2485
-
2486
-RET
2487
-%endmacro
2488
-
2489
-IPFILTER_CHROMA_W 16,  4
2490
-IPFILTER_CHROMA_W 16,  8
2491
-IPFILTER_CHROMA_W 16, 12
2492
-IPFILTER_CHROMA_W 16, 16
2493
-IPFILTER_CHROMA_W 16, 32
2494
-IPFILTER_CHROMA_W 32,  8
2495
-IPFILTER_CHROMA_W 32, 16
2496
-IPFILTER_CHROMA_W 32, 24
2497
-IPFILTER_CHROMA_W 24, 32
2498
-IPFILTER_CHROMA_W 32, 32
2499
-
2500
-IPFILTER_CHROMA_W 16, 24
2501
-IPFILTER_CHROMA_W 16, 64
2502
-IPFILTER_CHROMA_W 32, 48
2503
-IPFILTER_CHROMA_W 24, 64
2504
-IPFILTER_CHROMA_W 32, 64
2505
-
2506
-IPFILTER_CHROMA_W 64, 64
2507
-IPFILTER_CHROMA_W 64, 32
2508
-IPFILTER_CHROMA_W 64, 48
2509
-IPFILTER_CHROMA_W 48, 64
2510
-IPFILTER_CHROMA_W 64, 16
2511
+    FILTER_H4_w%1   t0, t1, t2, t3
2512
+    add         srcq,        srcstrideq
2513
+    add         dstq,        dststrideq
2514
+
2515
+    dec         r5d
2516
+    jnz        .loop
2517
+
2518
+    RET
2519
+%endmacro
2520
+
2521
+    IPFILTER_CHROMA_W 16,  4
2522
+    IPFILTER_CHROMA_W 16,  8
2523
+    IPFILTER_CHROMA_W 16, 12
2524
+    IPFILTER_CHROMA_W 16, 16
2525
+    IPFILTER_CHROMA_W 16, 32
2526
+    IPFILTER_CHROMA_W 32,  8
2527
+    IPFILTER_CHROMA_W 32, 16
2528
+    IPFILTER_CHROMA_W 32, 24
2529
+    IPFILTER_CHROMA_W 24, 32
2530
+    IPFILTER_CHROMA_W 32, 32
2531
+
2532
+    IPFILTER_CHROMA_W 16, 24
2533
+    IPFILTER_CHROMA_W 16, 64
2534
+    IPFILTER_CHROMA_W 32, 48
2535
+    IPFILTER_CHROMA_W 24, 64
2536
+    IPFILTER_CHROMA_W 32, 64
2537
+
2538
+    IPFILTER_CHROMA_W 64, 64
2539
+    IPFILTER_CHROMA_W 64, 32
2540
+    IPFILTER_CHROMA_W 64, 48
2541
+    IPFILTER_CHROMA_W 48, 64
2542
+    IPFILTER_CHROMA_W 64, 16
2543
 
2544
 
2545
 %macro FILTER_H8_W8 7-8   ; t0, t1, t2, t3, coef, c512, src, dst
2546
@@ -918,7 +2477,7 @@
2547
 %endif
2548
     punpcklqdq  m3, m3
2549
 
2550
-%ifidn %3, pp 
2551
+%ifidn %3, pp
2552
     mova      m2, [pw_512]
2553
 %else
2554
     mova      m2, [pw_2000]
2555
@@ -937,7 +2496,7 @@
2556
 .loopH:
2557
     xor       r5, r5
2558
 %rep %1 / 8
2559
-  %ifidn %3, pp 
2560
+  %ifidn %3, pp
2561
     FILTER_H8_W8  m0, m1, m4, m5, m3, m2, [r0 - 3 + r5], [r2 + r5]
2562
   %else
2563
     FILTER_H8_W8  m0, m1, m4, m5, m3, UNUSED, [r0 - 3 + r5]
2564
@@ -949,7 +2508,7 @@
2565
 
2566
 %rep (%1 % 8) / 4
2567
     FILTER_H8_W4  m0, m1
2568
-  %ifidn %3, pp 
2569
+  %ifidn %3, pp
2570
     pmulhrsw  m1, m2
2571
     packuswb  m1, m1
2572
     movd      [r2 + r5], m1
2573
@@ -1120,8 +2679,8 @@
2574
 %endif
2575
 %endmacro
2576
 
2577
-FILTER_HORIZ_LUMA_AVX2_4xN 8
2578
-FILTER_HORIZ_LUMA_AVX2_4xN 16
2579
+    FILTER_HORIZ_LUMA_AVX2_4xN 8
2580
+    FILTER_HORIZ_LUMA_AVX2_4xN 16
2581
 
2582
 INIT_YMM avx2
2583
 cglobal interp_8tap_horiz_pp_8x4, 4, 6, 7
2584
@@ -1271,9 +2830,9 @@
2585
     RET
2586
 %endmacro
2587
 
2588
-IPFILTER_LUMA_AVX2_8xN 8, 8
2589
-IPFILTER_LUMA_AVX2_8xN 8, 16
2590
-IPFILTER_LUMA_AVX2_8xN 8, 32
2591
+    IPFILTER_LUMA_AVX2_8xN 8, 8
2592
+    IPFILTER_LUMA_AVX2_8xN 8, 16
2593
+    IPFILTER_LUMA_AVX2_8xN 8, 32
2594
 
2595
 %macro IPFILTER_LUMA_AVX2 2
2596
 INIT_YMM avx2
2597
@@ -1306,7 +2865,7 @@
2598
     pmaddubsw         m5,         m1
2599
     paddw             m4,         m5
2600
     pmaddwd           m4,         m7
2601
-    vbroadcasti128    m5,         [r0 + 8]                    ; second 8 elements in Row0 
2602
+    vbroadcasti128    m5,         [r0 + 8]                    ; second 8 elements in Row0
2603
     pshufb            m6,         m5,     m3
2604
     pshufb            m5,         [tab_Tm]
2605
     pmaddubsw         m5,         m0
2606
@@ -1322,7 +2881,7 @@
2607
     pmaddubsw         m5,         m1
2608
     paddw             m2,         m5
2609
     pmaddwd           m2,         m7
2610
-    vbroadcasti128    m5,         [r0 + r1 + 8]                    ; second 8 elements in Row0 
2611
+    vbroadcasti128    m5,         [r0 + r1 + 8]                    ; second 8 elements in Row0
2612
     pshufb            m6,         m5,     m3
2613
     pshufb            m5,         [tab_Tm]
2614
     pmaddubsw         m5,         m0
2615
@@ -1617,7 +3176,7 @@
2616
     jnz               .loop
2617
     RET
2618
 
2619
-INIT_YMM avx2 
2620
+INIT_YMM avx2
2621
 cglobal interp_4tap_horiz_pp_4x4, 4,6,6
2622
     mov             r4d, r4m
2623
 
2624
@@ -1665,7 +3224,7 @@
2625
     pextrd            [r2+r0],      xm3,     3
2626
     RET
2627
 
2628
-INIT_YMM avx2 
2629
+INIT_YMM avx2
2630
 cglobal interp_4tap_horiz_pp_2x4, 4, 6, 3
2631
     mov               r4d,           r4m
2632
 
2633
@@ -1698,7 +3257,7 @@
2634
     pextrw            [r2 + r4],     xm1,         3
2635
     RET
2636
 
2637
-INIT_YMM avx2 
2638
+INIT_YMM avx2
2639
 cglobal interp_4tap_horiz_pp_2x8, 4, 6, 6
2640
     mov               r4d,           r4m
2641
 
2642
@@ -1941,7 +3500,7 @@
2643
 
2644
     IPFILTER_LUMA_AVX2 16, 4
2645
     IPFILTER_LUMA_AVX2 16, 8
2646
-    IPFILTER_LUMA_AVX2 16, 12 
2647
+    IPFILTER_LUMA_AVX2 16, 12
2648
     IPFILTER_LUMA_AVX2 16, 16
2649
     IPFILTER_LUMA_AVX2 16, 32
2650
     IPFILTER_LUMA_AVX2 16, 64
2651
@@ -2144,6 +3703,108 @@
2652
     RET
2653
 
2654
 ;-----------------------------------------------------------------------------------------------------------------------------
2655
+; void interp_4tap_horiz_ps_64xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
2656
+;-----------------------------------------------------------------------------------------------------------------------------;
2657
+%macro IPFILTER_CHROMA_HPS_64xN 1
2658
+INIT_YMM avx2
2659
+cglobal interp_4tap_horiz_ps_64x%1, 4,7,6
2660
+    mov             r4d, r4m
2661
+    mov             r5d, r5m
2662
+    add             r3d, r3d
2663
+
2664
+%ifdef PIC
2665
+    lea               r6,           [tab_ChromaCoeff]
2666
+    vpbroadcastd      m0,           [r6 + r4 * 4]
2667
+%else
2668
+    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]
2669
+%endif
2670
+
2671
+    vbroadcasti128     m2,           [pw_1]
2672
+    vbroadcasti128     m5,           [pw_2000]
2673
+    mova               m1,           [tab_Tm]
2674
+
2675
+    ; register map
2676
+    ; m0 - interpolate coeff
2677
+    ; m1 - shuffle order table
2678
+    ; m2 - constant word 1
2679
+    mov                r6d,         %1
2680
+    dec                r0
2681
+    test                r5d,      r5d
2682
+    je                 .loop
2683
+    sub                r0 ,         r1
2684
+    add                r6d ,        3
2685
+
2686
+.loop
2687
+    ; Row 0
2688
+    vbroadcasti128    m3,           [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
2689
+    pshufb            m3,           m1
2690
+    pmaddubsw         m3,           m0
2691
+    pmaddwd           m3,           m2
2692
+    vbroadcasti128    m4,           [r0 + 8]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
2693
+    pshufb            m4,           m1
2694
+    pmaddubsw         m4,           m0
2695
+    pmaddwd           m4,           m2
2696
+
2697
+    packssdw          m3,           m4
2698
+    psubw             m3,           m5
2699
+    vpermq            m3,           m3,          11011000b
2700
+    movu              [r2],         m3
2701
+
2702
+    vbroadcasti128    m3,           [r0 + 16]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
2703
+    pshufb            m3,           m1
2704
+    pmaddubsw         m3,           m0
2705
+    pmaddwd           m3,           m2
2706
+    vbroadcasti128    m4,           [r0 + 24]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
2707
+    pshufb            m4,           m1
2708
+    pmaddubsw         m4,           m0
2709
+    pmaddwd           m4,           m2
2710
+
2711
+    packssdw          m3,           m4
2712
+    psubw             m3,           m5
2713
+    vpermq            m3,           m3,          11011000b
2714
+    movu              [r2 + 32],    m3
2715
+
2716
+    vbroadcasti128    m3,           [r0 + 32]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
2717
+    pshufb            m3,           m1
2718
+    pmaddubsw         m3,           m0
2719
+    pmaddwd           m3,           m2
2720
+    vbroadcasti128    m4,           [r0 + 40]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
2721
+    pshufb            m4,           m1
2722
+    pmaddubsw         m4,           m0
2723
+    pmaddwd           m4,           m2
2724
+
2725
+    packssdw          m3,           m4
2726
+    psubw             m3,           m5
2727
+    vpermq            m3,           m3,          11011000b
2728
+    movu              [r2 + 64],    m3
2729
+
2730
+    vbroadcasti128    m3,           [r0 + 48]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
2731
+    pshufb            m3,           m1
2732
+    pmaddubsw         m3,           m0
2733
+    pmaddwd           m3,           m2
2734
+    vbroadcasti128    m4,           [r0 + 56]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
2735
+    pshufb            m4,           m1
2736
+    pmaddubsw         m4,           m0
2737
+    pmaddwd           m4,           m2
2738
+
2739
+    packssdw          m3,           m4
2740
+    psubw             m3,           m5
2741
+    vpermq            m3,           m3,          11011000b
2742
+    movu              [r2 + 96],    m3
2743
+
2744
+    add                r2,           r3
2745
+    add                r0,           r1
2746
+    dec                r6d
2747
+    jnz                .loop
2748
+    RET
2749
+%endmacro
2750
+
2751
+   IPFILTER_CHROMA_HPS_64xN 64
2752
+   IPFILTER_CHROMA_HPS_64xN 32
2753
+   IPFILTER_CHROMA_HPS_64xN 48
2754
+   IPFILTER_CHROMA_HPS_64xN 16
2755
+
2756
+;-----------------------------------------------------------------------------------------------------------------------------
2757
 ;void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt)
2758
 ;-----------------------------------------------------------------------------------------------------------------------------
2759
 
2760
@@ -2230,7 +3891,7 @@
2761
     pshufb                      m4,                m1
2762
     pmaddubsw                   m4,                m0
2763
     phaddw                      m4,                m4                           ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A]
2764
-    phaddw                      m3,                m4 
2765
+    phaddw                      m3,                m4
2766
 
2767
     vpermd                      m3,                m5,            m3            ; m5 don't broken in above
2768
     psubw                       m3,                m2
2769
@@ -2312,7 +3973,7 @@
2770
     lea                         r2,         [r2 + r3 * 2]                   ; first loop dst ->5th row(i.e 4)
2771
     sub                         r5d,        2
2772
     jg                         .loop
2773
-    jz                         .end             
2774
+    jz                         .end
2775
 
2776
     ; last row
2777
     movu                        xm1,        [r0]
2778
@@ -2334,10 +3995,10 @@
2779
 %endif
2780
 %endmacro ; IPFILTER_LUMA_PS_8xN_AVX2
2781
 
2782
-IPFILTER_LUMA_PS_8xN_AVX2  4
2783
-IPFILTER_LUMA_PS_8xN_AVX2  8
2784
-IPFILTER_LUMA_PS_8xN_AVX2 16
2785
-IPFILTER_LUMA_PS_8xN_AVX2 32
2786
+    IPFILTER_LUMA_PS_8xN_AVX2  4
2787
+    IPFILTER_LUMA_PS_8xN_AVX2  8
2788
+    IPFILTER_LUMA_PS_8xN_AVX2 16
2789
+    IPFILTER_LUMA_PS_8xN_AVX2 32
2790
 
2791
 
2792
 %macro IPFILTER_LUMA_PS_16x_AVX2 2
2793
@@ -2399,17 +4060,17 @@
2794
     dec                         r9d
2795
     jnz                         .label
2796
 
2797
-RET
2798
+    RET
2799
 %endif
2800
 %endmacro
2801
 
2802
 
2803
-IPFILTER_LUMA_PS_16x_AVX2 16 , 16
2804
-IPFILTER_LUMA_PS_16x_AVX2 16 , 8
2805
-IPFILTER_LUMA_PS_16x_AVX2 16 , 12
2806
-IPFILTER_LUMA_PS_16x_AVX2 16 , 4
2807
-IPFILTER_LUMA_PS_16x_AVX2 16 , 32
2808
-IPFILTER_LUMA_PS_16x_AVX2 16 , 64
2809
+    IPFILTER_LUMA_PS_16x_AVX2 16 , 16
2810
+    IPFILTER_LUMA_PS_16x_AVX2 16 , 8
2811
+    IPFILTER_LUMA_PS_16x_AVX2 16 , 12
2812
+    IPFILTER_LUMA_PS_16x_AVX2 16 , 4
2813
+    IPFILTER_LUMA_PS_16x_AVX2 16 , 32
2814
+    IPFILTER_LUMA_PS_16x_AVX2 16 , 64
2815
 
2816
 
2817
 ;--------------------------------------------------------------------------------------------------------------
2818
@@ -2460,27 +4121,27 @@
2819
     RET
2820
 %endmacro
2821
 
2822
-IPFILTER_LUMA_PP_W8      8,  4
2823
-IPFILTER_LUMA_PP_W8      8,  8
2824
-IPFILTER_LUMA_PP_W8      8, 16
2825
-IPFILTER_LUMA_PP_W8      8, 32
2826
-IPFILTER_LUMA_PP_W8     16,  4
2827
-IPFILTER_LUMA_PP_W8     16,  8
2828
-IPFILTER_LUMA_PP_W8     16, 12
2829
-IPFILTER_LUMA_PP_W8     16, 16
2830
-IPFILTER_LUMA_PP_W8     16, 32
2831
-IPFILTER_LUMA_PP_W8     16, 64
2832
-IPFILTER_LUMA_PP_W8     24, 32
2833
-IPFILTER_LUMA_PP_W8     32,  8
2834
-IPFILTER_LUMA_PP_W8     32, 16
2835
-IPFILTER_LUMA_PP_W8     32, 24
2836
-IPFILTER_LUMA_PP_W8     32, 32
2837
-IPFILTER_LUMA_PP_W8     32, 64
2838
-IPFILTER_LUMA_PP_W8     48, 64
2839
-IPFILTER_LUMA_PP_W8     64, 16
2840
-IPFILTER_LUMA_PP_W8     64, 32
2841
-IPFILTER_LUMA_PP_W8     64, 48
2842
-IPFILTER_LUMA_PP_W8     64, 64
2843
+    IPFILTER_LUMA_PP_W8      8,  4
2844
+    IPFILTER_LUMA_PP_W8      8,  8
2845
+    IPFILTER_LUMA_PP_W8      8, 16
2846
+    IPFILTER_LUMA_PP_W8      8, 32
2847
+    IPFILTER_LUMA_PP_W8     16,  4
2848
+    IPFILTER_LUMA_PP_W8     16,  8
2849
+    IPFILTER_LUMA_PP_W8     16, 12
2850
+    IPFILTER_LUMA_PP_W8     16, 16
2851
+    IPFILTER_LUMA_PP_W8     16, 32
2852
+    IPFILTER_LUMA_PP_W8     16, 64
2853
+    IPFILTER_LUMA_PP_W8     24, 32
2854
+    IPFILTER_LUMA_PP_W8     32,  8
2855
+    IPFILTER_LUMA_PP_W8     32, 16
2856
+    IPFILTER_LUMA_PP_W8     32, 24
2857
+    IPFILTER_LUMA_PP_W8     32, 32
2858
+    IPFILTER_LUMA_PP_W8     32, 64
2859
+    IPFILTER_LUMA_PP_W8     48, 64
2860
+    IPFILTER_LUMA_PP_W8     64, 16
2861
+    IPFILTER_LUMA_PP_W8     64, 32
2862
+    IPFILTER_LUMA_PP_W8     64, 48
2863
+    IPFILTER_LUMA_PP_W8     64, 64
2864
 
2865
 ;----------------------------------------------------------------------------------------------------------------------------
2866
 ; void interp_8tap_horiz_ps_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
2867
@@ -2547,10 +4208,10 @@
2868
 
2869
 ; Round and Saturate
2870
 %macro FILTER_HV8_END 4 ; output in [1, 3]
2871
-    paddd       %1, [tab_c_526336]
2872
-    paddd       %2, [tab_c_526336]
2873
-    paddd       %3, [tab_c_526336]
2874
-    paddd       %4, [tab_c_526336]
2875
+    paddd       %1, [pd_526336]
2876
+    paddd       %2, [pd_526336]
2877
+    paddd       %3, [pd_526336]
2878
+    paddd       %4, [pd_526336]
2879
     psrad       %1, 12
2880
     psrad       %2, 12
2881
     psrad       %3, 12
2882
@@ -2565,7 +4226,7 @@
2883
 ;-----------------------------------------------------------------------------
2884
 ; void interp_8tap_hv_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int idxX, int idxY)
2885
 ;-----------------------------------------------------------------------------
2886
-INIT_XMM sse4
2887
+INIT_XMM ssse3
2888
 cglobal interp_8tap_hv_pp_8x8, 4, 7, 8, 0-15*16
2889
 %define coef        m7
2890
 %define stk_buf     rsp
2891
@@ -2640,76 +4301,148 @@
2892
     RET
2893
 
2894
 ;-----------------------------------------------------------------------------
2895
+; void interp_8tap_hv_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int idxX, int idxY)
2896
+;-----------------------------------------------------------------------------
2897
+INIT_XMM sse3
2898
+cglobal interp_8tap_hv_pp_8x8, 4, 7, 8, 0-15*16
2899
+    mov         r4d,        r4m
2900
+    mov         r5d,        r5m
2901
+    add         r4d,        r4d
2902
+    pxor        m6,         m6
2903
+
2904
+%ifdef PIC
2905
+    lea         r6,         [tabw_LumaCoeff]
2906
+    mova        m3,         [r6 + r4 * 8]
2907
+%else
2908
+    mova        m3,         [tabw_LumaCoeff + r4 * 8]
2909
+%endif
2910
+
2911
+    ; move to row -3
2912
+    lea         r6,         [r1 + r1 * 2]
2913
+    sub         r0,         r6
2914
+
2915
+    mov         r4,         rsp
2916
+
2917
+%assign x 0     ;needed for FILTER_H8_W8_sse2 macro
2918
+%assign y 1
2919
+%rep 15
2920
+    FILTER_H8_W8_sse2
2921
+    psubw       m1,         [pw_2000]
2922
+    mova        [r4],       m1
2923
+
2924
+%if y < 15
2925
+    add         r0,         r1
2926
+    add         r4,         16
2927
+%endif
2928
+%assign y y+1
2929
+%endrep
2930
+
2931
+    ; ready to phase V
2932
+    ; Here all of mN is free
2933
+
2934
+    ; load coeff table
2935
+    shl         r5,         6
2936
+    lea         r6,         [tab_LumaCoeffV]
2937
+    lea         r5,         [r5 + r6]
2938
+
2939
+    ; load intermedia buffer
2940
+    mov         r0,         rsp
2941
+
2942
+    ; register mapping
2943
+    ; r0 - src
2944
+    ; r5 - coeff
2945
+
2946
+    ; let's go
2947
+%assign y 1
2948
+%rep 4
2949
+    FILTER_HV8_START    m1, m2, m3, m4, m0,             0, 0
2950
+    FILTER_HV8_MID      m6, m2, m3, m4, m0, m1, m7, m5, 3, 1
2951
+    FILTER_HV8_MID      m5, m6, m3, m4, m0, m1, m7, m2, 5, 2
2952
+    FILTER_HV8_MID      m6, m5, m3, m4, m0, m1, m7, m2, 7, 3
2953
+    FILTER_HV8_END      m3, m0, m4, m1
2954
+
2955
+    movh        [r2],       m3
2956
+    movhps      [r2 + r3],  m3
2957
+
2958
+%if y < 4
2959
+    lea         r0,         [r0 + 16 * 2]
2960
+    lea         r2,         [r2 + r3 * 2]
2961
+%endif
2962
+%assign y y+1
2963
+%endrep
2964
+    RET
2965
+
2966
+;-----------------------------------------------------------------------------
2967
 ;void interp_4tap_vert_pp_2x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
2968
 ;-----------------------------------------------------------------------------
2969
 INIT_XMM sse4
2970
 cglobal interp_4tap_vert_pp_2x4, 4, 6, 8
2971
 
2972
-mov         r4d,       r4m
2973
-sub         r0,        r1
2974
+    mov         r4d,       r4m
2975
+    sub         r0,        r1
2976
 
2977
 %ifdef PIC
2978
-lea         r5,        [tab_ChromaCoeff]
2979
-movd        m0,        [r5 + r4 * 4]
2980
+    lea         r5,        [tab_ChromaCoeff]
2981
+    movd        m0,        [r5 + r4 * 4]
2982
 %else
2983
-movd        m0,        [tab_ChromaCoeff + r4 * 4]
2984
+    movd        m0,        [tab_ChromaCoeff + r4 * 4]
2985
 %endif
2986
-lea         r4,        [r1 * 3]
2987
-lea         r5,        [r0 + 4 * r1]
2988
-pshufb      m0,        [tab_Cm]
2989
-mova        m1,        [pw_512]
2990
+    lea         r4,        [r1 * 3]
2991
+    lea         r5,        [r0 + 4 * r1]
2992
+    pshufb      m0,        [tab_Cm]
2993
+    mova        m1,        [pw_512]
2994
 
2995
-movd        m2,        [r0]
2996
-movd        m3,        [r0 + r1]
2997
-movd        m4,        [r0 + 2 * r1]
2998
-movd        m5,        [r0 + r4]
2999
+    movd        m2,        [r0]
3000
+    movd        m3,        [r0 + r1]
3001
+    movd        m4,        [r0 + 2 * r1]
3002
+    movd        m5,        [r0 + r4]
3003
 
3004
-punpcklbw   m2,        m3
3005
-punpcklbw   m6,        m4,        m5
3006
-punpcklbw   m2,        m6
3007
+    punpcklbw   m2,        m3
3008
+    punpcklbw   m6,        m4,        m5
3009
+    punpcklbw   m2,        m6
3010
 
3011
-pmaddubsw   m2,        m0
3012
+    pmaddubsw   m2,        m0
3013
 
3014
-movd        m6,        [r5]
3015
+    movd        m6,        [r5]
3016
 
3017
-punpcklbw   m3,        m4
3018
-punpcklbw   m7,        m5,        m6
3019
-punpcklbw   m3,        m7
3020
+    punpcklbw   m3,        m4
3021
+    punpcklbw   m7,        m5,        m6
3022
+    punpcklbw   m3,        m7
3023
 
3024
-pmaddubsw   m3,        m0
3025
+    pmaddubsw   m3,        m0
3026
 
3027
-phaddw      m2,        m3
3028
+    phaddw      m2,        m3
3029
 
3030
-pmulhrsw    m2,        m1
3031
+    pmulhrsw    m2,        m1
3032
 
3033
-movd        m7,        [r5 + r1]
3034
+    movd        m7,        [r5 + r1]
3035
 
3036
-punpcklbw   m4,        m5
3037
-punpcklbw   m3,        m6,        m7
3038
-punpcklbw   m4,        m3
3039
+    punpcklbw   m4,        m5
3040
+    punpcklbw   m3,        m6,        m7
3041
+    punpcklbw   m4,        m3
3042
 
3043
-pmaddubsw   m4,        m0
3044
+    pmaddubsw   m4,        m0
3045
 
3046
-movd        m3,        [r5 + 2 * r1]
3047
+    movd        m3,        [r5 + 2 * r1]
3048
 
3049
-punpcklbw   m5,        m6
3050
-punpcklbw   m7,        m3
3051
-punpcklbw   m5,        m7
3052
+    punpcklbw   m5,        m6
3053
+    punpcklbw   m7,        m3
3054
+    punpcklbw   m5,        m7
3055
 
3056
-pmaddubsw   m5,        m0
3057
+    pmaddubsw   m5,        m0
3058
 
3059
-phaddw      m4,        m5
3060
+    phaddw      m4,        m5
3061
 
3062
-pmulhrsw    m4,        m1
3063
-packuswb    m2,        m4
3064
+    pmulhrsw    m4,        m1
3065
+    packuswb    m2,        m4
3066
 
3067
-pextrw      [r2],      m2, 0
3068
-pextrw      [r2 + r3], m2, 2
3069
-lea         r2,        [r2 + 2 * r3]
3070
-pextrw      [r2],      m2, 4
3071
-pextrw      [r2 + r3], m2, 6
3072
+    pextrw      [r2],      m2, 0
3073
+    pextrw      [r2 + r3], m2, 2
3074
+    lea         r2,        [r2 + 2 * r3]
3075
+    pextrw      [r2],      m2, 4
3076
+    pextrw      [r2 + r3], m2, 6
3077
 
3078
-RET
3079
+    RET
3080
 
3081
 %macro FILTER_VER_CHROMA_AVX2_2x4 1
3082
 INIT_YMM avx2
3083
@@ -2762,8 +4495,8 @@
3084
     RET
3085
 %endmacro
3086
 
3087
-FILTER_VER_CHROMA_AVX2_2x4 pp
3088
-FILTER_VER_CHROMA_AVX2_2x4 ps
3089
+    FILTER_VER_CHROMA_AVX2_2x4 pp
3090
+    FILTER_VER_CHROMA_AVX2_2x4 ps
3091
 
3092
 %macro FILTER_VER_CHROMA_AVX2_2x8 1
3093
 INIT_YMM avx2
3094
@@ -2834,8 +4567,8 @@
3095
     RET
3096
 %endmacro
3097
 
3098
-FILTER_VER_CHROMA_AVX2_2x8 pp
3099
-FILTER_VER_CHROMA_AVX2_2x8 ps
3100
+    FILTER_VER_CHROMA_AVX2_2x8 pp
3101
+    FILTER_VER_CHROMA_AVX2_2x8 ps
3102
 
3103
 ;-----------------------------------------------------------------------------
3104
 ; void interp_4tap_vert_pp_2x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
3105
@@ -2844,85 +4577,85 @@
3106
 INIT_XMM sse4
3107
 cglobal interp_4tap_vert_pp_2x%2, 4, 6, 8
3108
 
3109
-mov         r4d,       r4m
3110
-sub         r0,        r1
3111
+    mov         r4d,       r4m
3112
+    sub         r0,        r1
3113
 
3114
 %ifdef PIC
3115
-lea         r5,        [tab_ChromaCoeff]
3116
-movd        m0,        [r5 + r4 * 4]
3117
+    lea         r5,        [tab_ChromaCoeff]
3118
+    movd        m0,        [r5 + r4 * 4]
3119
 %else
3120
-movd        m0,        [tab_ChromaCoeff + r4 * 4]
3121
+    movd        m0,        [tab_ChromaCoeff + r4 * 4]
3122
 %endif
3123
 
3124
-pshufb      m0,        [tab_Cm]
3125
+    pshufb      m0,        [tab_Cm]
3126
 
3127
-mova        m1,        [pw_512]
3128
+    mova        m1,        [pw_512]
3129
 
3130
-mov         r4d,       %2
3131
-lea         r5,        [3 * r1]
3132
+    mov         r4d,       %2
3133
+    lea         r5,        [3 * r1]
3134
 
3135
 .loop:
3136
-movd        m2,        [r0]
3137
-movd        m3,        [r0 + r1]
3138
-movd        m4,        [r0 + 2 * r1]
3139
-movd        m5,        [r0 + r5]
3140
+    movd        m2,        [r0]
3141
+    movd        m3,        [r0 + r1]
3142
+    movd        m4,        [r0 + 2 * r1]
3143
+    movd        m5,        [r0 + r5]
3144
 
3145
-punpcklbw   m2,        m3
3146
-punpcklbw   m6,        m4,        m5
3147
-punpcklbw   m2,        m6
3148
+    punpcklbw   m2,        m3
3149
+    punpcklbw   m6,        m4,        m5
3150
+    punpcklbw   m2,        m6
3151
 
3152
-pmaddubsw   m2,        m0
3153
+    pmaddubsw   m2,        m0
3154
 
3155
-lea         r0,        [r0 + 4 * r1]
3156
-movd        m6,        [r0]
3157
+    lea         r0,        [r0 + 4 * r1]
3158
+    movd        m6,        [r0]
3159
 
3160
-punpcklbw   m3,        m4
3161
-punpcklbw   m7,        m5,        m6
3162
-punpcklbw   m3,        m7
3163
+    punpcklbw   m3,        m4
3164
+    punpcklbw   m7,        m5,        m6
3165
+    punpcklbw   m3,        m7
3166
 
3167
-pmaddubsw   m3,        m0
3168
+    pmaddubsw   m3,        m0
3169
 
3170
-phaddw      m2,        m3
3171
+    phaddw      m2,        m3
3172
 
3173
-pmulhrsw    m2,        m1
3174
+    pmulhrsw    m2,        m1
3175
 
3176
-movd        m7,        [r0 + r1]
3177
+    movd        m7,        [r0 + r1]
3178
 
3179
-punpcklbw   m4,        m5
3180
-punpcklbw   m3,        m6,        m7
3181
-punpcklbw   m4,        m3
3182
+    punpcklbw   m4,        m5
3183
+    punpcklbw   m3,        m6,        m7
3184
+    punpcklbw   m4,        m3
3185
 
3186
-pmaddubsw   m4,        m0
3187
+    pmaddubsw   m4,        m0
3188
 
3189
-movd        m3,        [r0 + 2 * r1]
3190
+    movd        m3,        [r0 + 2 * r1]
3191
 
3192
-punpcklbw   m5,        m6
3193
-punpcklbw   m7,        m3
3194
-punpcklbw   m5,        m7
3195
+    punpcklbw   m5,        m6
3196
+    punpcklbw   m7,        m3
3197
+    punpcklbw   m5,        m7
3198
 
3199
-pmaddubsw   m5,        m0
3200
+    pmaddubsw   m5,        m0
3201
 
3202
-phaddw      m4,        m5
3203
+    phaddw      m4,        m5
3204
 
3205
-pmulhrsw    m4,        m1
3206
-packuswb    m2,        m4
3207
+    pmulhrsw    m4,        m1
3208
+    packuswb    m2,        m4
3209
 
3210
-pextrw      [r2],      m2, 0
3211
-pextrw      [r2 + r3], m2, 2
3212
-lea         r2,        [r2 + 2 * r3]
3213
-pextrw      [r2],      m2, 4
3214
-pextrw      [r2 + r3], m2, 6
3215
+    pextrw      [r2],      m2, 0
3216
+    pextrw      [r2 + r3], m2, 2
3217
+    lea         r2,        [r2 + 2 * r3]
3218
+    pextrw      [r2],      m2, 4
3219
+    pextrw      [r2 + r3], m2, 6
3220
 
3221
-lea         r2,        [r2 + 2 * r3]
3222
+    lea         r2,        [r2 + 2 * r3]
3223
 
3224
-sub         r4,        4
3225
-jnz        .loop
3226
-RET
3227
+    sub         r4,        4
3228
+    jnz        .loop
3229
+    RET
3230
 %endmacro
3231
 
3232
-FILTER_V4_W2_H4 2, 8
3233
+    FILTER_V4_W2_H4 2, 8
3234
 
3235
-FILTER_V4_W2_H4 2, 16
3236
+    FILTER_V4_W2_H4 2, 16
3237
 
3238
 ;-----------------------------------------------------------------------------
3239
 ; void interp_4tap_vert_pp_4x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
3240
@@ -2930,46 +4663,46 @@
3241
 INIT_XMM sse4
3242
 cglobal interp_4tap_vert_pp_4x2, 4, 6, 6
3243
 
3244
-mov         r4d,       r4m
3245
-sub         r0,        r1
3246
+    mov         r4d,       r4m
3247
+    sub         r0,        r1
3248
 
3249
 %ifdef PIC
3250
-lea         r5,        [tab_ChromaCoeff]
3251
-movd        m0,        [r5 + r4 * 4]
3252
+    lea         r5,        [tab_ChromaCoeff]
3253
+    movd        m0,        [r5 + r4 * 4]
3254
 %else
3255
-movd        m0,        [tab_ChromaCoeff + r4 * 4]
3256
+    movd        m0,        [tab_ChromaCoeff + r4 * 4]
3257
 %endif
3258
 
3259
-pshufb      m0,        [tab_Cm]
3260
-lea         r5,        [r0 + 2 * r1]
3261
+    pshufb      m0,        [tab_Cm]
3262
+    lea         r5,        [r0 + 2 * r1]
3263
 
3264
-movd        m2,        [r0]
3265
-movd        m3,        [r0 + r1]
3266
-movd        m4,        [r5]
3267
-movd        m5,        [r5 + r1]
3268
+    movd        m2,        [r0]
3269
+    movd        m3,        [r0 + r1]
3270
+    movd        m4,        [r5]
3271
+    movd        m5,        [r5 + r1]
3272
 
3273
-punpcklbw   m2,        m3
3274
-punpcklbw   m1,        m4,        m5
3275
-punpcklbw   m2,        m1
3276
+    punpcklbw   m2,        m3
3277
+    punpcklbw   m1,        m4,        m5
3278
+    punpcklbw   m2,        m1
3279
 
3280
-pmaddubsw   m2,        m0
3281
+    pmaddubsw   m2,        m0
3282
 
3283
-movd        m1,        [r0 + 4 * r1]
3284
+    movd        m1,        [r0 + 4 * r1]
3285
 
3286
-punpcklbw   m3,        m4
3287
-punpcklbw   m5,        m1
3288
-punpcklbw   m3,        m5
3289
+    punpcklbw   m3,        m4
3290
+    punpcklbw   m5,        m1
3291
+    punpcklbw   m3,        m5
3292
 
3293
-pmaddubsw   m3,        m0
3294
+    pmaddubsw   m3,        m0
3295
 
3296
-phaddw      m2,        m3
3297
+    phaddw      m2,        m3
3298
 
3299
-pmulhrsw    m2,        [pw_512]
3300
-packuswb    m2,        m2
3301
-movd        [r2],      m2
3302
-pextrd      [r2 + r3], m2,  1
3303
+    pmulhrsw    m2,        [pw_512]
3304
+    packuswb    m2,        m2
3305
+    movd        [r2],      m2
3306
+    pextrd      [r2 + r3], m2,  1
3307
 
3308
-RET
3309
+    RET
3310
 
3311
 %macro FILTER_VER_CHROMA_AVX2_4x2 1
3312
 INIT_YMM avx2
3313
@@ -3017,8 +4750,8 @@
3314
     RET
3315
 %endmacro
3316
 
3317
-FILTER_VER_CHROMA_AVX2_4x2 pp
3318
-FILTER_VER_CHROMA_AVX2_4x2 ps
3319
+    FILTER_VER_CHROMA_AVX2_4x2 pp
3320
+    FILTER_VER_CHROMA_AVX2_4x2 ps
3321
 
3322
 ;-----------------------------------------------------------------------------
3323
 ; void interp_4tap_vert_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
3324
@@ -3026,71 +4759,71 @@
3325
 INIT_XMM sse4
3326
 cglobal interp_4tap_vert_pp_4x4, 4, 6, 8
3327
 
3328
-mov         r4d,       r4m
3329
-sub         r0,        r1
3330
+    mov         r4d,       r4m
3331
+    sub         r0,        r1
3332
 
3333
 %ifdef PIC
3334
-lea         r5,        [tab_ChromaCoeff]
3335
-movd        m0,        [r5 + r4 * 4]
3336
+    lea         r5,        [tab_ChromaCoeff]
3337
+    movd        m0,        [r5 + r4 * 4]
3338
 %else
3339
-movd        m0,        [tab_ChromaCoeff + r4 * 4]
3340
+    movd        m0,        [tab_ChromaCoeff + r4 * 4]
3341
 %endif
3342
 
3343
-pshufb      m0,        [tab_Cm]
3344
-mova        m1,        [pw_512]
3345
-lea         r5,        [r0 + 4 * r1]
3346
-lea         r4,        [r1 * 3]
3347
+    pshufb      m0,        [tab_Cm]
3348
+    mova        m1,        [pw_512]
3349
+    lea         r5,        [r0 + 4 * r1]
3350
+    lea         r4,        [r1 * 3]
3351
 
3352
-movd        m2,        [r0]
3353
-movd        m3,        [r0 + r1]
3354
-movd        m4,        [r0 + 2 * r1]
3355
-movd        m5,        [r0 + r4]
3356
+    movd        m2,        [r0]
3357
+    movd        m3,        [r0 + r1]
3358
+    movd        m4,        [r0 + 2 * r1]
3359
+    movd        m5,        [r0 + r4]
3360
 
3361
-punpcklbw   m2,        m3
3362
-punpcklbw   m6,        m4,        m5
3363
-punpcklbw   m2,        m6
3364
+    punpcklbw   m2,        m3
3365
+    punpcklbw   m6,        m4,        m5
3366
+    punpcklbw   m2,        m6
3367
 
3368
-pmaddubsw   m2,        m0
3369
+    pmaddubsw   m2,        m0
3370
 
3371
-movd        m6,        [r5]
3372
+    movd        m6,        [r5]
3373
 
3374
-punpcklbw   m3,        m4
3375
-punpcklbw   m7,        m5,        m6
3376
-punpcklbw   m3,        m7
3377
+    punpcklbw   m3,        m4
3378
+    punpcklbw   m7,        m5,        m6
3379
+    punpcklbw   m3,        m7
3380
 
3381
-pmaddubsw   m3,        m0
3382
+    pmaddubsw   m3,        m0
3383
 
3384
-phaddw      m2,        m3
3385
+    phaddw      m2,        m3
3386
 
3387
-pmulhrsw    m2,        m1
3388
+    pmulhrsw    m2,        m1
3389
 
3390
-movd        m7,        [r5 + r1]
3391
+    movd        m7,        [r5 + r1]
3392
 
3393
-punpcklbw   m4,        m5
3394
-punpcklbw   m3,        m6,        m7
3395
-punpcklbw   m4,        m3
3396
+    punpcklbw   m4,        m5
3397
+    punpcklbw   m3,        m6,        m7
3398
+    punpcklbw   m4,        m3
3399
 
3400
-pmaddubsw   m4,        m0
3401
+    pmaddubsw   m4,        m0
3402
 
3403
-movd        m3,        [r5 + 2 * r1]
3404
+    movd        m3,        [r5 + 2 * r1]
3405
 
3406
-punpcklbw   m5,        m6
3407
-punpcklbw   m7,        m3
3408
-punpcklbw   m5,        m7
3409
+    punpcklbw   m5,        m6
3410
+    punpcklbw   m7,        m3
3411
+    punpcklbw   m5,        m7
3412
 
3413
-pmaddubsw   m5,        m0
3414
+    pmaddubsw   m5,        m0
3415
 
3416
-phaddw      m4,        m5
3417
+    phaddw      m4,        m5
3418
 
3419
-pmulhrsw    m4,        m1
3420
+    pmulhrsw    m4,        m1
3421
 
3422
-packuswb    m2,        m4
3423
-movd        [r2],      m2
3424
-pextrd      [r2 + r3], m2, 1
3425
-lea         r2,        [r2 + 2 * r3]
3426
-pextrd      [r2],      m2, 2
3427
-pextrd      [r2 + r3], m2, 3
3428
-RET
3429
+    packuswb    m2,        m4
3430
+    movd        [r2],      m2
3431
+    pextrd      [r2 + r3], m2, 1
3432
+    lea         r2,        [r2 + 2 * r3]
3433
+    pextrd      [r2],      m2, 2
3434
+    pextrd      [r2 + r3], m2, 3
3435
+    RET
3436
 %macro FILTER_VER_CHROMA_AVX2_4x4 1
3437
 INIT_YMM avx2
3438
 cglobal interp_4tap_vert_%1_4x4, 4, 6, 3
3439
@@ -3148,8 +4881,8 @@
3440
 %endif
3441
     RET
3442
 %endmacro
3443
-FILTER_VER_CHROMA_AVX2_4x4 pp
3444
-FILTER_VER_CHROMA_AVX2_4x4 ps
3445
+    FILTER_VER_CHROMA_AVX2_4x4 pp
3446
+    FILTER_VER_CHROMA_AVX2_4x4 ps
3447
 
3448
 %macro FILTER_VER_CHROMA_AVX2_4x8 1
3449
 INIT_YMM avx2
3450
@@ -3235,8 +4968,8 @@
3451
     RET
3452
 %endmacro
3453
 
3454
-FILTER_VER_CHROMA_AVX2_4x8 pp
3455
-FILTER_VER_CHROMA_AVX2_4x8 ps
3456
+    FILTER_VER_CHROMA_AVX2_4x8 pp
3457
+    FILTER_VER_CHROMA_AVX2_4x8 ps
3458
 
3459
 %macro FILTER_VER_CHROMA_AVX2_4x16 1
3460
 INIT_YMM avx2
3461
@@ -3380,8 +5113,8 @@
3462
 %endif
3463
 %endmacro
3464
 
3465
-FILTER_VER_CHROMA_AVX2_4x16 pp
3466
-FILTER_VER_CHROMA_AVX2_4x16 ps
3467
+    FILTER_VER_CHROMA_AVX2_4x16 pp
3468
+    FILTER_VER_CHROMA_AVX2_4x16 ps
3469
 
3470
 ;-----------------------------------------------------------------------------
3471
 ; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
3472
@@ -3390,184 +5123,184 @@
3473
 INIT_XMM sse4
3474
 cglobal interp_4tap_vert_pp_%1x%2, 4, 6, 8
3475
 
3476
-mov         r4d,       r4m
3477
-sub         r0,        r1
3478
+    mov         r4d,       r4m
3479
+    sub         r0,        r1
3480
 
3481
 %ifdef PIC
3482
-lea         r5,        [tab_ChromaCoeff]
3483
-movd        m0,        [r5 + r4 * 4]
3484
+    lea         r5,        [tab_ChromaCoeff]
3485
+    movd        m0,        [r5 + r4 * 4]
3486
 %else
3487
-movd        m0,        [tab_ChromaCoeff + r4 * 4]
3488
+    movd        m0,        [tab_ChromaCoeff + r4 * 4]
3489
 %endif
3490
 
3491
-pshufb      m0,        [tab_Cm]
3492
+    pshufb      m0,        [tab_Cm]
3493
 
3494
-mova        m1,        [pw_512]
3495
+    mova        m1,        [pw_512]
3496
 
3497
-mov         r4d,       %2
3498
+    mov         r4d,       %2
3499
 
3500
-lea         r5,        [3 * r1]
3501
+    lea         r5,        [3 * r1]
3502
 
3503
 .loop:
3504
-movd        m2,        [r0]
3505
-movd        m3,        [r0 + r1]
3506
-movd        m4,        [r0 + 2 * r1]
3507
-movd        m5,        [r0 + r5]
3508
+    movd        m2,        [r0]
3509
+    movd        m3,        [r0 + r1]
3510
+    movd        m4,        [r0 + 2 * r1]
3511
+    movd        m5,        [r0 + r5]
3512
 
3513
-punpcklbw   m2,        m3
3514
-punpcklbw   m6,        m4,        m5
3515
-punpcklbw   m2,        m6
3516
+    punpcklbw   m2,        m3
3517
+    punpcklbw   m6,        m4,        m5
3518
+    punpcklbw   m2,        m6
3519
 
3520
-pmaddubsw   m2,        m0
3521
+    pmaddubsw   m2,        m0
3522
 
3523
-lea         r0,        [r0 + 4 * r1]
3524
-movd        m6,        [r0]
3525
+    lea         r0,        [r0 + 4 * r1]
3526
+    movd        m6,        [r0]
3527
 
3528
-punpcklbw   m3,        m4
3529
-punpcklbw   m7,        m5,        m6
3530
-punpcklbw   m3,        m7
3531
+    punpcklbw   m3,        m4
3532
+    punpcklbw   m7,        m5,        m6
3533
+    punpcklbw   m3,        m7
3534
 
3535
-pmaddubsw   m3,        m0
3536
+    pmaddubsw   m3,        m0
3537
 
3538
-phaddw      m2,        m3
3539
+    phaddw      m2,        m3
3540
 
3541
-pmulhrsw    m2,        m1
3542
+    pmulhrsw    m2,        m1
3543
 
3544
-movd        m7,        [r0 + r1]
3545
+    movd        m7,        [r0 + r1]
3546
 
3547
-punpcklbw   m4,        m5
3548
-punpcklbw   m3,        m6,        m7
3549
-punpcklbw   m4,        m3
3550
+    punpcklbw   m4,        m5
3551
+    punpcklbw   m3,        m6,        m7
3552
+    punpcklbw   m4,        m3
3553
 
3554
-pmaddubsw   m4,        m0
3555
+    pmaddubsw   m4,        m0
3556
 
3557
-movd        m3,        [r0 + 2 * r1]
3558
+    movd        m3,        [r0 + 2 * r1]
3559
 
3560
-punpcklbw   m5,        m6
3561
-punpcklbw   m7,        m3
3562
-punpcklbw   m5,        m7
3563
+    punpcklbw   m5,        m6
3564
+    punpcklbw   m7,        m3
3565
+    punpcklbw   m5,        m7
3566
 
3567
-pmaddubsw   m5,        m0
3568
+    pmaddubsw   m5,        m0
3569
 
3570
-phaddw      m4,        m5
3571
+    phaddw      m4,        m5
3572
 
3573
-pmulhrsw    m4,        m1
3574
-packuswb    m2,        m4
3575
-movd        [r2],      m2
3576
-pextrd      [r2 + r3], m2,  1
3577
-lea         r2,        [r2 + 2 * r3]
3578
-pextrd      [r2],      m2, 2
3579
-pextrd      [r2 + r3], m2, 3
3580
+    pmulhrsw    m4,        m1
3581
+    packuswb    m2,        m4
3582
+    movd        [r2],      m2
3583
+    pextrd      [r2 + r3], m2,  1
3584
+    lea         r2,        [r2 + 2 * r3]
3585
+    pextrd      [r2],      m2, 2
3586
+    pextrd      [r2 + r3], m2, 3
3587
 
3588
-lea         r2,        [r2 + 2 * r3]
3589
+    lea         r2,        [r2 + 2 * r3]
3590
 
3591
-sub         r4,        4
3592
-jnz        .loop
3593
-RET
3594
+    sub         r4,        4
3595
+    jnz        .loop
3596
+    RET
3597
 %endmacro
3598
 
3599
-FILTER_V4_W4_H4 4,  8
3600
-FILTER_V4_W4_H4 4, 16
3601
+    FILTER_V4_W4_H4 4,  8
3602
+    FILTER_V4_W4_H4 4, 16
3603
 
3604
-FILTER_V4_W4_H4 4, 32
3605
+    FILTER_V4_W4_H4 4, 32
3606
 
3607
 %macro FILTER_V4_W8_H2 0
3608
-punpcklbw   m1,        m2
3609
-punpcklbw   m7,        m3,        m0
3610
+    punpcklbw   m1,        m2
3611
+    punpcklbw   m7,        m3,        m0
3612
 
3613
-pmaddubsw   m1,        m6
3614
-pmaddubsw   m7,        m5
3615
+    pmaddubsw   m1,        m6
3616
+    pmaddubsw   m7,        m5
3617
 
3618
-paddw       m1,        m7
3619
+    paddw       m1,        m7
3620
 
3621
-pmulhrsw    m1,        m4
3622
-packuswb    m1,        m1
3623
+    pmulhrsw    m1,        m4
3624
+    packuswb    m1,        m1
3625
 %endmacro
3626
 
3627
 %macro FILTER_V4_W8_H3 0
3628
-punpcklbw   m2,        m3
3629
-punpcklbw   m7,        m0,        m1
3630
+    punpcklbw   m2,        m3
3631
+    punpcklbw   m7,        m0,        m1
3632
 
3633
-pmaddubsw   m2,        m6
3634
-pmaddubsw   m7,        m5
3635
+    pmaddubsw   m2,        m6
3636
+    pmaddubsw   m7,        m5
3637
 
3638
-paddw       m2,        m7
3639
+    paddw       m2,        m7
3640
 
3641
-pmulhrsw    m2,        m4
3642
-packuswb    m2,        m2
3643
+    pmulhrsw    m2,        m4
3644
+    packuswb    m2,        m2
3645
 %endmacro
3646
 
3647
 %macro FILTER_V4_W8_H4 0
3648
-punpcklbw   m3,        m0
3649
-punpcklbw   m7,        m1,        m2
3650
+    punpcklbw   m3,        m0
3651
+    punpcklbw   m7,        m1,        m2
3652
 
3653
-pmaddubsw   m3,        m6
3654
-pmaddubsw   m7,        m5
3655
+    pmaddubsw   m3,        m6
3656
+    pmaddubsw   m7,        m5
3657
 
3658
-paddw       m3,        m7
3659
+    paddw       m3,        m7
3660
 
3661
-pmulhrsw    m3,        m4
3662
-packuswb    m3,        m3
3663
+    pmulhrsw    m3,        m4
3664
+    packuswb    m3,        m3
3665
 %endmacro
3666
 
3667
 %macro FILTER_V4_W8_H5 0
3668
-punpcklbw   m0,        m1
3669
-punpcklbw   m7,        m2,        m3
3670
+    punpcklbw   m0,        m1
3671
+    punpcklbw   m7,        m2,        m3
3672
 
3673
-pmaddubsw   m0,        m6
3674
-pmaddubsw   m7,        m5
3675
+    pmaddubsw   m0,        m6
3676
+    pmaddubsw   m7,        m5
3677
 
3678
-paddw       m0,        m7
3679
+    paddw       m0,        m7
3680
 
3681
-pmulhrsw    m0,        m4
3682
-packuswb    m0,        m0
3683
+    pmulhrsw    m0,        m4
3684
+    packuswb    m0,        m0
3685
 %endmacro
3686
 
3687
 %macro FILTER_V4_W8_8x2 2
3688
-FILTER_V4_W8 %1, %2
3689
-movq        m0,        [r0 + 4 * r1]
3690
+    FILTER_V4_W8 %1, %2
3691
+    movq        m0,        [r0 + 4 * r1]
3692
 
3693
-FILTER_V4_W8_H2
3694
+    FILTER_V4_W8_H2
3695
 
3696
-movh        [r2 + r3], m1
3697
+    movh        [r2 + r3], m1
3698
 %endmacro
3699
 
3700
 %macro FILTER_V4_W8_8x4 2
3701
-FILTER_V4_W8_8x2 %1, %2
3702
+    FILTER_V4_W8_8x2 %1, %2
3703
 ;8x3
3704
-lea         r6,        [r0 + 4 * r1]
3705
-movq        m1,        [r6 + r1]
3706
+    lea         r6,        [r0 + 4 * r1]
3707
+    movq        m1,        [r6 + r1]
3708
 
3709
-FILTER_V4_W8_H3
3710
+    FILTER_V4_W8_H3
3711
 
3712
-movh        [r2 + 2 * r3], m2
3713
+    movh        [r2 + 2 * r3], m2
3714
 
3715
 ;8x4
3716
-movq        m2,        [r6 + 2 * r1]
3717
+    movq        m2,        [r6 + 2 * r1]
3718
 
3719
-FILTER_V4_W8_H4
3720
+    FILTER_V4_W8_H4
3721
 
3722
-lea         r5,        [r2 + 2 * r3]
3723
-movh        [r5 + r3], m3
3724
+    lea         r5,        [r2 + 2 * r3]
3725
+    movh        [r5 + r3], m3
3726
 %endmacro
3727
 
3728
 %macro FILTER_V4_W8_8x6 2
3729
-FILTER_V4_W8_8x4 %1, %2
3730
+    FILTER_V4_W8_8x4 %1, %2
3731
 ;8x5
3732
-lea         r6,        [r6 + 2 * r1]
3733
-movq        m3,        [r6 + r1]
3734
+    lea         r6,        [r6 + 2 * r1]
3735
+    movq        m3,        [r6 + r1]
3736
 
3737
-FILTER_V4_W8_H5
3738
+    FILTER_V4_W8_H5
3739
 
3740
-movh        [r2 + 4 * r3], m0
3741
+    movh        [r2 + 4 * r3], m0
3742
 
3743
 ;8x6
3744
-movq        m0,        [r0 + 8 * r1]
3745
+    movq        m0,        [r0 + 8 * r1]
3746
 
3747
-FILTER_V4_W8_H2
3748
+    FILTER_V4_W8_H2
3749
 
3750
-lea         r5,        [r2 + 4 * r3]
3751
-movh        [r5 + r3], m1
3752
+    lea         r5,        [r2 + 4 * r3]
3753
+    movh        [r5 + r3], m1
3754
 %endmacro
3755
 
3756
 ;-----------------------------------------------------------------------------
3757
@@ -3577,60 +5310,60 @@
3758
 INIT_XMM sse4
3759
 cglobal interp_4tap_vert_pp_%1x%2, 4, 7, 8
3760
 
3761
-mov         r4d,       r4m
3762
+    mov         r4d,       r4m
3763
 
3764
-sub         r0,        r1
3765
-movq        m0,        [r0]
3766
-movq        m1,        [r0 + r1]
3767
-movq        m2,        [r0 + 2 * r1]
3768
-lea         r5,        [r0 + 2 * r1]
3769
-movq        m3,        [r5 + r1]
3770
+    sub         r0,        r1
3771
+    movq        m0,        [r0]
3772
+    movq        m1,        [r0 + r1]
3773
+    movq        m2,        [r0 + 2 * r1]
3774
+    lea         r5,        [r0 + 2 * r1]
3775
+    movq        m3,        [r5 + r1]
3776
 
3777
-punpcklbw   m0,        m1
3778
-punpcklbw   m4,        m2,          m3
3779
+    punpcklbw   m0,        m1
3780
+    punpcklbw   m4,        m2,          m3
3781
 
3782
 %ifdef PIC
3783
-lea         r6,        [tab_ChromaCoeff]
3784
-movd        m5,        [r6 + r4 * 4]
3785
+    lea         r6,        [tab_ChromaCoeff]
3786
+    movd        m5,        [r6 + r4 * 4]
3787
 %else
3788
-movd        m5,        [tab_ChromaCoeff + r4 * 4]
3789
+    movd        m5,        [tab_ChromaCoeff + r4 * 4]
3790
 %endif
3791
 
3792
-pshufb      m6,        m5,       [tab_Vm]
3793
-pmaddubsw   m0,        m6
3794
+    pshufb      m6,        m5,       [tab_Vm]
3795
+    pmaddubsw   m0,        m6
3796
 
3797
-pshufb      m5,        [tab_Vm + 16]
3798
-pmaddubsw   m4,        m5
3799
+    pshufb      m5,        [tab_Vm + 16]
3800
+    pmaddubsw   m4,        m5
3801
 
3802
-paddw       m0,        m4
3803
+    paddw       m0,        m4
3804
 
3805
-mova        m4,        [pw_512]
3806
+    mova        m4,        [pw_512]
3807
 
3808
-pmulhrsw    m0,        m4
3809
-packuswb    m0,        m0
3810
-movh        [r2],      m0
3811
+    pmulhrsw    m0,        m4
3812
+    packuswb    m0,        m0
3813
+    movh        [r2],      m0
3814
 %endmacro
3815
 
3816
 ;-----------------------------------------------------------------------------
3817
 ; void interp_4tap_vert_pp_8x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
3818
 ;-----------------------------------------------------------------------------
3819
-FILTER_V4_W8_8x2 8, 2
3820
+    FILTER_V4_W8_8x2 8, 2
3821
 
3822
-RET
3823
+    RET
3824
 
3825
 ;-----------------------------------------------------------------------------
3826
 ; void interp_4tap_vert_pp_8x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
3827
 ;-----------------------------------------------------------------------------
3828
-FILTER_V4_W8_8x4 8, 4
3829
+    FILTER_V4_W8_8x4 8, 4
3830
 
3831
-RET
3832
+    RET
3833
 
3834
 ;-----------------------------------------------------------------------------
3835
 ; void interp_4tap_vert_pp_8x6(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
3836
 ;-----------------------------------------------------------------------------
3837
-FILTER_V4_W8_8x6 8, 6
3838
+    FILTER_V4_W8_8x6 8, 6
3839
 
3840
-RET
3841
+    RET
3842
 
3843
 ;-------------------------------------------------------------------------------------------------------------
3844
 ; void interp_4tap_vert_ps_4x2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
3845
@@ -3638,46 +5371,46 @@
3846
 INIT_XMM sse4
3847
 cglobal interp_4tap_vert_ps_4x2, 4, 6, 6
3848
 
3849
-mov         r4d, r4m
3850
-sub         r0, r1
3851
-add         r3d, r3d
3852
+    mov         r4d, r4m
3853
+    sub         r0, r1
3854
+    add         r3d, r3d
3855
 
3856
 %ifdef PIC
3857
-lea         r5, [tab_ChromaCoeff]
3858
-movd        m0, [r5 + r4 * 4]
3859
+    lea         r5, [tab_ChromaCoeff]
3860
+    movd        m0, [r5 + r4 * 4]
3861
 %else
3862
-movd        m0, [tab_ChromaCoeff + r4 * 4]
3863
+    movd        m0, [tab_ChromaCoeff + r4 * 4]
3864
 %endif
3865
 
3866
-pshufb      m0, [tab_Cm]
3867
+    pshufb      m0, [tab_Cm]
3868
 
3869
-movd        m2, [r0]
3870
-movd        m3, [r0 + r1]
3871
-lea         r5, [r0 + 2 * r1]
3872
-movd        m4, [r5]
3873
-movd        m5, [r5 + r1]
3874
+    movd        m2, [r0]
3875
+    movd        m3, [r0 + r1]
3876
+    lea         r5, [r0 + 2 * r1]
3877
+    movd        m4, [r5]
3878
+    movd        m5, [r5 + r1]
3879
 
3880
-punpcklbw   m2, m3
3881
-punpcklbw   m1, m4, m5
3882
-punpcklbw   m2, m1
3883
+    punpcklbw   m2, m3
3884
+    punpcklbw   m1, m4, m5
3885
+    punpcklbw   m2, m1
3886
 
3887
-pmaddubsw   m2, m0
3888
+    pmaddubsw   m2, m0
3889
 
3890
-movd        m1, [r0 + 4 * r1]
3891
+    movd        m1, [r0 + 4 * r1]
3892
 
3893
-punpcklbw   m3, m4
3894
-punpcklbw   m5, m1
3895
-punpcklbw   m3, m5
3896
+    punpcklbw   m3, m4
3897
+    punpcklbw   m5, m1
3898
+    punpcklbw   m3, m5
3899
 
3900
-pmaddubsw   m3, m0
3901
+    pmaddubsw   m3, m0
3902
 
3903
-phaddw      m2, m3
3904
+    phaddw      m2, m3
3905
 
3906
-psubw       m2, [pw_2000]
3907
-movh        [r2], m2
3908
-movhps      [r2 + r3], m2
3909
+    psubw       m2, [pw_2000]
3910
+    movh        [r2], m2
3911
+    movhps      [r2 + r3], m2
3912
 
3913
-RET
3914
+    RET
3915
 
3916
 ;-------------------------------------------------------------------------------------------------------------
3917
 ; void interp_4tap_vert_ps_4x4(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
3918
@@ -3835,10 +5568,10 @@
3919
     RET
3920
 %endmacro
3921
 
3922
-FILTER_V_PS_W4_H4 4, 8
3923
-FILTER_V_PS_W4_H4 4, 16
3924
+    FILTER_V_PS_W4_H4 4, 8
3925
+    FILTER_V_PS_W4_H4 4, 16
3926
 
3927
-FILTER_V_PS_W4_H4 4, 32
3928
+    FILTER_V_PS_W4_H4 4, 32
3929
 
3930
 ;--------------------------------------------------------------------------------------------------------------
3931
 ; void interp_4tap_vert_ps_8x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
3932
@@ -3904,12 +5637,12 @@
3933
     RET
3934
 %endmacro
3935
 
3936
-FILTER_V_PS_W8_H8_H16_H2 8, 2
3937
-FILTER_V_PS_W8_H8_H16_H2 8, 4
3938
-FILTER_V_PS_W8_H8_H16_H2 8, 6
3939
+    FILTER_V_PS_W8_H8_H16_H2 8, 2
3940
+    FILTER_V_PS_W8_H8_H16_H2 8, 4
3941
+    FILTER_V_PS_W8_H8_H16_H2 8, 6
3942
 
3943
-FILTER_V_PS_W8_H8_H16_H2 8, 12
3944
-FILTER_V_PS_W8_H8_H16_H2 8, 64
3945
+    FILTER_V_PS_W8_H8_H16_H2 8, 12
3946
+    FILTER_V_PS_W8_H8_H16_H2 8, 64
3947
 
3948
 ;--------------------------------------------------------------------------------------------------------------
3949
 ; void interp_4tap_vert_ps_8x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
3950
@@ -3999,9 +5732,9 @@
3951
     RET
3952
 %endmacro
3953
 
3954
-FILTER_V_PS_W8_H8_H16_H32 8,  8
3955
-FILTER_V_PS_W8_H8_H16_H32 8, 16
3956
-FILTER_V_PS_W8_H8_H16_H32 8, 32
3957
+    FILTER_V_PS_W8_H8_H16_H32 8,  8
3958
+    FILTER_V_PS_W8_H8_H16_H32 8, 16
3959
+    FILTER_V_PS_W8_H8_H16_H32 8, 32
3960
 
3961
 ;------------------------------------------------------------------------------------------------------------
3962
 ;void interp_4tap_vert_ps_6x8(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
3963
@@ -4095,8 +5828,8 @@
3964
     RET
3965
 %endmacro
3966
 
3967
-FILTER_V_PS_W6 6, 8
3968
-FILTER_V_PS_W6 6, 16
3969
+    FILTER_V_PS_W6 6, 8
3970
+    FILTER_V_PS_W6 6, 16
3971
 
3972
 ;---------------------------------------------------------------------------------------------------------------
3973
 ; void interp_4tap_vert_ps_12x16(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
3974
@@ -4181,8 +5914,8 @@
3975
     RET
3976
 %endmacro
3977
 
3978
-FILTER_V_PS_W12 12, 16
3979
-FILTER_V_PS_W12 12, 32
3980
+    FILTER_V_PS_W12 12, 16
3981
+    FILTER_V_PS_W12 12, 32
3982
 
3983
 ;---------------------------------------------------------------------------------------------------------------
3984
 ; void interp_4tap_vert_ps_16x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
3985
@@ -4266,14 +5999,14 @@
3986
     RET
3987
 %endmacro
3988
 
3989
-FILTER_V_PS_W16 16,  4
3990
-FILTER_V_PS_W16 16,  8
3991
-FILTER_V_PS_W16 16, 12
3992
-FILTER_V_PS_W16 16, 16
3993
-FILTER_V_PS_W16 16, 32
3994
+    FILTER_V_PS_W16 16,  4
3995
+    FILTER_V_PS_W16 16,  8
3996
+    FILTER_V_PS_W16 16, 12
3997
+    FILTER_V_PS_W16 16, 16
3998
+    FILTER_V_PS_W16 16, 32
3999
 
4000
-FILTER_V_PS_W16 16, 24
4001
-FILTER_V_PS_W16 16, 64
4002
+    FILTER_V_PS_W16 16, 24
4003
+    FILTER_V_PS_W16 16, 64
4004
 
4005
 ;--------------------------------------------------------------------------------------------------------------
4006
 ;void interp_4tap_vert_ps_24x32(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
4007
@@ -4389,9 +6122,9 @@
4008
     RET
4009
 %endmacro
4010
 
4011
-FILTER_V4_PS_W24 24, 32
4012
+    FILTER_V4_PS_W24 24, 32
4013
 
4014
-FILTER_V4_PS_W24 24, 64
4015
+    FILTER_V4_PS_W24 24, 64
4016
 
4017
 ;---------------------------------------------------------------------------------------------------------------
4018
 ; void interp_4tap_vert_ps_32x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
4019
@@ -4482,13 +6215,13 @@
4020
     RET
4021
 %endmacro
4022
 
4023
-FILTER_V_PS_W32 32,  8
4024
-FILTER_V_PS_W32 32, 16
4025
-FILTER_V_PS_W32 32, 24
4026
-FILTER_V_PS_W32 32, 32
4027
+    FILTER_V_PS_W32 32,  8
4028
+    FILTER_V_PS_W32 32, 16
4029
+    FILTER_V_PS_W32 32, 24
4030
+    FILTER_V_PS_W32 32, 32
4031
 
4032
-FILTER_V_PS_W32 32, 48
4033
-FILTER_V_PS_W32 32, 64
4034
+    FILTER_V_PS_W32 32, 48
4035
+    FILTER_V_PS_W32 32, 64
4036
 
4037
 ;-----------------------------------------------------------------------------
4038
 ; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
4039
@@ -4497,95 +6230,95 @@
4040
 INIT_XMM sse4
4041
 cglobal interp_4tap_vert_pp_%1x%2, 4, 6, 8
4042
 
4043
-mov         r4d,       r4m
4044
-sub         r0,        r1
4045
+    mov         r4d,       r4m
4046
+    sub         r0,        r1
4047
 
4048
 %ifdef PIC
4049
-lea         r5,        [tab_ChromaCoeff]
4050
-movd        m5,        [r5 + r4 * 4]
4051
+    lea         r5,        [tab_ChromaCoeff]
4052
+    movd        m5,        [r5 + r4 * 4]
4053
 %else
4054
-movd        m5,        [tab_ChromaCoeff + r4 * 4]
4055
+    movd        m5,        [tab_ChromaCoeff + r4 * 4]
4056
 %endif
4057
 
4058
-pshufb      m6,        m5,       [tab_Vm]
4059
-pshufb      m5,        [tab_Vm + 16]
4060
-mova        m4,        [pw_512]
4061
-lea         r5,        [r1 * 3]
4062
+    pshufb      m6,        m5,       [tab_Vm]
4063
+    pshufb      m5,        [tab_Vm + 16]
4064
+    mova        m4,        [pw_512]
4065
+    lea         r5,        [r1 * 3]
4066
 
4067
-mov         r4d,       %2
4068
+    mov         r4d,       %2
4069
 
4070
 .loop:
4071
-movq        m0,        [r0]
4072
-movq        m1,        [r0 + r1]
4073
-movq        m2,        [r0 + 2 * r1]
4074
-movq        m3,        [r0 + r5]
4075
+    movq        m0,        [r0]
4076
+    movq        m1,        [r0 + r1]
4077
+    movq        m2,        [r0 + 2 * r1]
4078
+    movq        m3,        [r0 + r5]
4079
 
4080
-punpcklbw   m0,        m1
4081
-punpcklbw   m1,        m2
4082
-punpcklbw   m2,        m3
4083
+    punpcklbw   m0,        m1
4084
+    punpcklbw   m1,        m2
4085
+    punpcklbw   m2,        m3
4086
 
4087
-pmaddubsw   m0,        m6
4088
-pmaddubsw   m7,        m2, m5
4089
+    pmaddubsw   m0,        m6
4090
+    pmaddubsw   m7,        m2, m5
4091
 
4092
-paddw       m0,        m7
4093
+    paddw       m0,        m7
4094
 
4095
-pmulhrsw    m0,        m4
4096
-packuswb    m0,        m0
4097
-movh        [r2],      m0
4098
+    pmulhrsw    m0,        m4
4099
+    packuswb    m0,        m0
4100
+    movh        [r2],      m0
4101
 
4102
-lea         r0,        [r0 + 4 * r1]
4103
-movq        m0,        [r0]
4104
+    lea         r0,        [r0 + 4 * r1]
4105
+    movq        m0,        [r0]
4106
 
4107
-punpcklbw   m3,        m0
4108
+    punpcklbw   m3,        m0
4109
 
4110
-pmaddubsw   m1,        m6
4111
-pmaddubsw   m7,        m3, m5
4112
+    pmaddubsw   m1,        m6
4113
+    pmaddubsw   m7,        m3, m5
4114
 
4115
-paddw       m1,        m7
4116
+    paddw       m1,        m7
4117
 
4118
-pmulhrsw    m1,        m4
4119
-packuswb    m1,        m1
4120
-movh        [r2 + r3], m1
4121
+    pmulhrsw    m1,        m4
4122
+    packuswb    m1,        m1
4123
+    movh        [r2 + r3], m1
4124
 
4125
-movq        m1,        [r0 + r1]
4126
+    movq        m1,        [r0 + r1]
4127
 
4128
-punpcklbw   m0,        m1
4129
+    punpcklbw   m0,        m1
4130
 
4131
-pmaddubsw   m2,        m6
4132
-pmaddubsw   m0,        m5
4133
+    pmaddubsw   m2,        m6
4134
+    pmaddubsw   m0,        m5
4135
 
4136
-paddw       m2,        m0
4137
+    paddw       m2,        m0
4138
 
4139
-pmulhrsw    m2,        m4
4140
+    pmulhrsw    m2,        m4
4141
 
4142
-movq        m7,        [r0 + 2 * r1]
4143
-punpcklbw   m1,        m7
4144
+    movq        m7,        [r0 + 2 * r1]
4145
+    punpcklbw   m1,        m7
4146
 
4147
-pmaddubsw   m3,        m6
4148
-pmaddubsw   m1,        m5
4149
+    pmaddubsw   m3,        m6
4150
+    pmaddubsw   m1,        m5
4151
 
4152
-paddw       m3,        m1
4153
+    paddw       m3,        m1
4154
 
4155
-pmulhrsw    m3,        m4
4156
-packuswb    m2,        m3
4157
+    pmulhrsw    m3,        m4
4158
+    packuswb    m2,        m3
4159
 
4160
-lea         r2,        [r2 + 2 * r3]
4161
-movh        [r2],      m2
4162
-movhps      [r2 + r3], m2
4163
+    lea         r2,        [r2 + 2 * r3]
4164
+    movh        [r2],      m2
4165
+    movhps      [r2 + r3], m2
4166
 
4167
-lea         r2,        [r2 + 2 * r3]
4168
+    lea         r2,        [r2 + 2 * r3]
4169
 
4170
-sub         r4,         4
4171
-jnz        .loop
4172
-RET
4173
+    sub         r4,         4
4174
+    jnz        .loop
4175
+    RET
4176
 %endmacro
4177
 
4178
-FILTER_V4_W8_H8_H16_H32 8,  8
4179
-FILTER_V4_W8_H8_H16_H32 8, 16
4180
-FILTER_V4_W8_H8_H16_H32 8, 32
4181
+    FILTER_V4_W8_H8_H16_H32 8,  8
4182
+    FILTER_V4_W8_H8_H16_H32 8, 16
4183
+    FILTER_V4_W8_H8_H16_H32 8, 32
4184
 
4185
-FILTER_V4_W8_H8_H16_H32 8, 12
4186
-FILTER_V4_W8_H8_H16_H32 8, 64
4187
+    FILTER_V4_W8_H8_H16_H32 8, 12
4188
+    FILTER_V4_W8_H8_H16_H32 8, 64
4189
 
4190
 %macro PROCESS_CHROMA_AVX2_W8_8R 0
4191
     movq            xm1, [r0]                       ; m1 = row 0
4192
@@ -4691,8 +6424,8 @@
4193
     RET
4194
 %endmacro
4195
 
4196
-FILTER_VER_CHROMA_AVX2_8x8 pp
4197
-FILTER_VER_CHROMA_AVX2_8x8 ps
4198
+    FILTER_VER_CHROMA_AVX2_8x8 pp
4199
+    FILTER_VER_CHROMA_AVX2_8x8 ps
4200
 
4201
 %macro FILTER_VER_CHROMA_AVX2_8x6 1
4202
 INIT_YMM avx2
4203
@@ -4780,8 +6513,8 @@
4204
     RET
4205
 %endmacro
4206
 
4207
-FILTER_VER_CHROMA_AVX2_8x6 pp
4208
-FILTER_VER_CHROMA_AVX2_8x6 ps
4209
+    FILTER_VER_CHROMA_AVX2_8x6 pp
4210
+    FILTER_VER_CHROMA_AVX2_8x6 ps
4211
 
4212
 %macro PROCESS_CHROMA_AVX2_W8_16R 1
4213
     movq            xm1, [r0]                       ; m1 = row 0
4214
@@ -4961,12 +6694,154 @@
4215
     RET
4216
 %endmacro
4217
 
4218
-FILTER_VER_CHROMA_AVX2_8x16 pp
4219
-FILTER_VER_CHROMA_AVX2_8x16 ps
4220
+    FILTER_VER_CHROMA_AVX2_8x16 pp
4221
+    FILTER_VER_CHROMA_AVX2_8x16 ps
4222
+
4223
+%macro FILTER_VER_CHROMA_AVX2_8x12 1
4224
+INIT_YMM avx2
4225
+cglobal interp_4tap_vert_%1_8x12, 4, 7, 8
4226
+    mov             r4d, r4m
4227
+    shl             r4d, 6
4228
+
4229
+%ifdef PIC
4230
+    lea             r5, [tab_ChromaCoeffVer_32]
4231
+    add             r5, r4
4232
+%else
4233
+    lea             r5, [tab_ChromaCoeffVer_32 + r4]
4234
+%endif
4235
+
4236
+    lea             r4, [r1 * 3]
4237
+    sub             r0, r1
4238
+%ifidn %1, pp
4239
+    mova            m7, [pw_512]
4240
+%else
4241
+    add             r3d, r3d
4242
+    mova            m7, [pw_2000]
4243
+%endif
4244
+    lea             r6, [r3 * 3]
4245
+    movq            xm1, [r0]                       ; m1 = row 0
4246
+    movq            xm2, [r0 + r1]                  ; m2 = row 1
4247
+    punpcklbw       xm1, xm2
4248
+    movq            xm3, [r0 + r1 * 2]              ; m3 = row 2
4249
+    punpcklbw       xm2, xm3
4250
+    vinserti128     m5, m1, xm2, 1
4251
+    pmaddubsw       m5, [r5]
4252
+    movq            xm4, [r0 + r4]                  ; m4 = row 3
4253
+    punpcklbw       xm3, xm4
4254
+    lea             r0, [r0 + r1 * 4]
4255
+    movq            xm1, [r0]                       ; m1 = row 4
4256
+    punpcklbw       xm4, xm1
4257
+    vinserti128     m2, m3, xm4, 1
4258
+    pmaddubsw       m0, m2, [r5 + 1 * mmsize]
4259
+    paddw           m5, m0
4260
+    pmaddubsw       m2, [r5]
4261
+    movq            xm3, [r0 + r1]                  ; m3 = row 5
4262
+    punpcklbw       xm1, xm3
4263
+    movq            xm4, [r0 + r1 * 2]              ; m4 = row 6
4264
+    punpcklbw       xm3, xm4
4265
+    vinserti128     m1, m1, xm3, 1
4266
+    pmaddubsw       m0, m1, [r5 + 1 * mmsize]
4267
+    paddw           m2, m0
4268
+    pmaddubsw       m1, [r5]
4269
+    movq            xm3, [r0 + r4]                  ; m3 = row 7
4270
+    punpcklbw       xm4, xm3
4271
+    lea             r0, [r0 + r1 * 4]
4272
+    movq            xm0, [r0]                       ; m0 = row 8
4273
+    punpcklbw       xm3, xm0
4274
+    vinserti128     m4, m4, xm3, 1
4275
+    pmaddubsw       m3, m4, [r5 + 1 * mmsize]
4276
+    paddw           m1, m3
4277
+    pmaddubsw       m4, [r5]
4278
+    movq            xm3, [r0 + r1]                  ; m3 = row 9
4279
+    punpcklbw       xm0, xm3
4280
+    movq            xm6, [r0 + r1 * 2]              ; m6 = row 10
4281
+    punpcklbw       xm3, xm6
4282
+    vinserti128     m0, m0, xm3, 1
4283
+    pmaddubsw       m3, m0, [r5 + 1 * mmsize]
4284
+    paddw           m4, m3
4285
+    pmaddubsw       m0, [r5]
4286
+%ifidn %1, pp
4287
+    pmulhrsw        m5, m7                          ; m5 = word: row 0, row 1
4288
+    pmulhrsw        m2, m7                          ; m2 = word: row 2, row 3
4289
+    pmulhrsw        m1, m7                          ; m1 = word: row 4, row 5
4290
+    pmulhrsw        m4, m7                          ; m4 = word: row 6, row 7
4291
+    packuswb        m5, m2
4292
+    packuswb        m1, m4
4293
+    vextracti128    xm2, m5, 1
4294
+    vextracti128    xm4, m1, 1
4295
+    movq            [r2], xm5
4296
+    movq            [r2 + r3], xm2
4297
+    movhps          [r2 + r3 * 2], xm5
4298
+    movhps          [r2 + r6], xm2
4299
+    lea             r2, [r2 + r3 * 4]
4300
+    movq            [r2], xm1
4301
+    movq            [r2 + r3], xm4
4302
+    movhps          [r2 + r3 * 2], xm1
4303
+    movhps          [r2 + r6], xm4
4304
+%else
4305
+    psubw           m5, m7                          ; m5 = word: row 0, row 1
4306
+    psubw           m2, m7                          ; m2 = word: row 2, row 3
4307
+    psubw           m1, m7                          ; m1 = word: row 4, row 5
4308
+    psubw           m4, m7                          ; m4 = word: row 6, row 7
4309
+    vextracti128    xm3, m5, 1
4310
+    movu            [r2], xm5
4311
+    movu            [r2 + r3], xm3
4312
+    vextracti128    xm3, m2, 1
4313
+    movu            [r2 + r3 * 2], xm2
4314
+    movu            [r2 + r6], xm3
4315
+    lea             r2, [r2 + r3 * 4]
4316
+    vextracti128    xm5, m1, 1
4317
+    vextracti128    xm3, m4, 1
4318
+    movu            [r2], xm1
4319
+    movu            [r2 + r3], xm5
4320
+    movu            [r2 + r3 * 2], xm4
4321
+    movu            [r2 + r6], xm3
4322
+%endif
4323
+    movq            xm3, [r0 + r4]                  ; m3 = row 11
4324
+    punpcklbw       xm6, xm3
4325
+    lea             r0, [r0 + r1 * 4]
4326
+    movq            xm5, [r0]                       ; m5 = row 12
4327
+    punpcklbw       xm3, xm5
4328
+    vinserti128     m6, m6, xm3, 1
4329
+    pmaddubsw       m3, m6, [r5 + 1 * mmsize]
4330
+    paddw           m0, m3
4331
+    pmaddubsw       m6, [r5]
4332
+    movq            xm3, [r0 + r1]                  ; m3 = row 13
4333
+    punpcklbw       xm5, xm3
4334
+    movq            xm2, [r0 + r1 * 2]              ; m2 = row 14
4335
+    punpcklbw       xm3, xm2
4336
+    vinserti128     m5, m5, xm3, 1
4337
+    pmaddubsw       m3, m5, [r5 + 1 * mmsize]
4338
+    paddw           m6, m3
4339
+    lea             r2, [r2 + r3 * 4]
4340
+%ifidn %1, pp
4341
+    pmulhrsw        m0, m7                          ; m0 = word: row 8, row 9
4342
+    pmulhrsw        m6, m7                          ; m6 = word: row 10, row 11
4343
+    packuswb        m0, m6
4344
+    vextracti128    xm6, m0, 1
4345
+    movq            [r2], xm0
4346
+    movq            [r2 + r3], xm6
4347
+    movhps          [r2 + r3 * 2], xm0
4348
+    movhps          [r2 + r6], xm6
4349
+%else
4350
+    psubw           m0, m7                          ; m0 = word: row 8, row 9
4351
+    psubw           m6, m7                          ; m6 = word: row 10, row 11
4352
+    vextracti128    xm1, m0, 1
4353
+    vextracti128    xm3, m6, 1
4354
+    movu            [r2], xm0
4355
+    movu            [r2 + r3], xm1
4356
+    movu            [r2 + r3 * 2], xm6
4357
+    movu            [r2 + r6], xm3
4358
+%endif
4359
+    RET
4360
+%endmacro
4361
+
4362
+    FILTER_VER_CHROMA_AVX2_8x12 pp
4363
+    FILTER_VER_CHROMA_AVX2_8x12 ps
4364
 
4365
-%macro FILTER_VER_CHROMA_AVX2_8x32 1
4366
+%macro FILTER_VER_CHROMA_AVX2_8xN 2
4367
 INIT_YMM avx2
4368
-cglobal interp_4tap_vert_%1_8x32, 4, 7, 8
4369
+cglobal interp_4tap_vert_%1_8x%2, 4, 7, 8
4370
     mov             r4d, r4m
4371
     shl             r4d, 6
4372
 
4373
@@ -4986,15 +6861,17 @@
4374
     mova            m7, [pw_2000]
4375
 %endif
4376
     lea             r6, [r3 * 3]
4377
-%rep 2
4378
+%rep %2 / 16
4379
     PROCESS_CHROMA_AVX2_W8_16R %1
4380
     lea             r2, [r2 + r3 * 4]
4381
 %endrep
4382
     RET
4383
 %endmacro
4384
 
4385
-FILTER_VER_CHROMA_AVX2_8x32 pp
4386
-FILTER_VER_CHROMA_AVX2_8x32 ps
4387
+    FILTER_VER_CHROMA_AVX2_8xN pp, 32
4388
+    FILTER_VER_CHROMA_AVX2_8xN ps, 32
4389
+    FILTER_VER_CHROMA_AVX2_8xN pp, 64
4390
+    FILTER_VER_CHROMA_AVX2_8xN ps, 64
4391
 
4392
 %macro PROCESS_CHROMA_AVX2_W8_4R 0
4393
     movq            xm1, [r0]                       ; m1 = row 0
4394
@@ -5065,8 +6942,8 @@
4395
     RET
4396
 %endmacro
4397
 
4398
-FILTER_VER_CHROMA_AVX2_8x4 pp
4399
-FILTER_VER_CHROMA_AVX2_8x4 ps
4400
+    FILTER_VER_CHROMA_AVX2_8x4 pp
4401
+    FILTER_VER_CHROMA_AVX2_8x4 ps
4402
 
4403
 %macro FILTER_VER_CHROMA_AVX2_8x2 1
4404
 INIT_YMM avx2
4405
@@ -5114,8 +6991,8 @@
4406
     RET
4407
 %endmacro
4408
 
4409
-FILTER_VER_CHROMA_AVX2_8x2 pp
4410
-FILTER_VER_CHROMA_AVX2_8x2 ps
4411
+    FILTER_VER_CHROMA_AVX2_8x2 pp
4412
+    FILTER_VER_CHROMA_AVX2_8x2 ps
4413
 
4414
 %macro FILTER_VER_CHROMA_AVX2_6x8 1
4415
 INIT_YMM avx2
4416
@@ -5194,8 +7071,8 @@
4417
     RET
4418
 %endmacro
4419
 
4420
-FILTER_VER_CHROMA_AVX2_6x8 pp
4421
-FILTER_VER_CHROMA_AVX2_6x8 ps
4422
+    FILTER_VER_CHROMA_AVX2_6x8 pp
4423
+    FILTER_VER_CHROMA_AVX2_6x8 ps
4424
 
4425
 ;-----------------------------------------------------------------------------
4426
 ;void interp_4tap_vert_pp_6x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
4427
@@ -5204,96 +7081,96 @@
4428
 INIT_XMM sse4
4429
 cglobal interp_4tap_vert_pp_6x%2, 4, 6, 8
4430
 
4431
-mov         r4d,       r4m
4432
-sub         r0,        r1
4433
+    mov         r4d,       r4m
4434
+    sub         r0,        r1
4435
 
4436
 %ifdef PIC
4437
-lea         r5,        [tab_ChromaCoeff]
4438
-movd        m5,        [r5 + r4 * 4]
4439
+    lea         r5,        [tab_ChromaCoeff]
4440
+    movd        m5,        [r5 + r4 * 4]
4441
 %else
4442
-movd        m5,        [tab_ChromaCoeff + r4 * 4]
4443
+    movd        m5,        [tab_ChromaCoeff + r4 * 4]
4444
 %endif
4445
 
4446
-pshufb      m6,        m5,       [tab_Vm]
4447
-pshufb      m5,        [tab_Vm + 16]
4448
-mova        m4,        [pw_512]
4449
+    pshufb      m6,        m5,       [tab_Vm]
4450
+    pshufb      m5,        [tab_Vm + 16]
4451
+    mova        m4,        [pw_512]
4452
 
4453
-mov         r4d,       %2
4454
-lea         r5,        [3 * r1]
4455
+    mov         r4d,       %2
4456
+    lea         r5,        [3 * r1]
4457
 
4458
 .loop:
4459
-movq        m0,        [r0]
4460
-movq        m1,        [r0 + r1]
4461
-movq        m2,        [r0 + 2 * r1]
4462
-movq        m3,        [r0 + r5]
4463
+    movq        m0,        [r0]
4464
+    movq        m1,        [r0 + r1]
4465
+    movq        m2,        [r0 + 2 * r1]
4466
+    movq        m3,        [r0 + r5]
4467
 
4468
-punpcklbw   m0,        m1
4469
-punpcklbw   m1,        m2
4470
-punpcklbw   m2,        m3
4471
+    punpcklbw   m0,        m1
4472
+    punpcklbw   m1,        m2
4473
+    punpcklbw   m2,        m3
4474
 
4475
-pmaddubsw   m0,        m6
4476
-pmaddubsw   m7,        m2, m5
4477
+    pmaddubsw   m0,        m6
4478
+    pmaddubsw   m7,        m2, m5
4479
 
4480
-paddw       m0,        m7
4481
+    paddw       m0,        m7
4482
 
4483
-pmulhrsw    m0,        m4
4484
-packuswb    m0,        m0
4485
-movd        [r2],      m0
4486
-pextrw      [r2 + 4],  m0,    2
4487
+    pmulhrsw    m0,        m4
4488
+    packuswb    m0,        m0
4489
+    movd        [r2],      m0
4490
+    pextrw      [r2 + 4],  m0,    2
4491
 
4492
-lea         r0,        [r0 + 4 * r1]
4493
+    lea         r0,        [r0 + 4 * r1]
4494
 
4495
-movq        m0,        [r0]
4496
-punpcklbw   m3,        m0
4497
+    movq        m0,        [r0]
4498
+    punpcklbw   m3,        m0
4499
 
4500
-pmaddubsw   m1,        m6
4501
-pmaddubsw   m7,        m3, m5
4502
+    pmaddubsw   m1,        m6
4503
+    pmaddubsw   m7,        m3, m5
4504
 
4505
-paddw       m1,        m7
4506
+    paddw       m1,        m7
4507
 
4508
-pmulhrsw    m1,        m4
4509
-packuswb    m1,        m1
4510
-movd        [r2 + r3],      m1
4511
-pextrw      [r2 + r3 + 4],  m1,    2
4512
+    pmulhrsw    m1,        m4
4513
+    packuswb    m1,        m1
4514
+    movd        [r2 + r3],      m1
4515
+    pextrw      [r2 + r3 + 4],  m1,    2
4516
 
4517
-movq        m1,        [r0 + r1]
4518
-punpcklbw   m7,        m0,        m1
4519
+    movq        m1,        [r0 + r1]
4520
+    punpcklbw   m7,        m0,        m1
4521
 
4522
-pmaddubsw   m2,        m6
4523
-pmaddubsw   m7,        m5
4524
+    pmaddubsw   m2,        m6
4525
+    pmaddubsw   m7,        m5
4526
 
4527
-paddw       m2,        m7
4528
+    paddw       m2,        m7
4529
 
4530
-pmulhrsw    m2,        m4
4531
-packuswb    m2,        m2
4532
-lea         r2,        [r2 + 2 * r3]
4533
-movd        [r2],      m2
4534
-pextrw      [r2 + 4],  m2,    2
4535
+    pmulhrsw    m2,        m4
4536
+    packuswb    m2,        m2
4537
+    lea         r2,        [r2 + 2 * r3]
4538
+    movd        [r2],      m2
4539
+    pextrw      [r2 + 4],  m2,    2
4540
 
4541
-movq        m2,        [r0 + 2 * r1]
4542
-punpcklbw   m1,        m2
4543
+    movq        m2,        [r0 + 2 * r1]
4544
+    punpcklbw   m1,        m2
4545
 
4546
-pmaddubsw   m3,        m6
4547
-pmaddubsw   m1,        m5
4548
+    pmaddubsw   m3,        m6
4549
+    pmaddubsw   m1,        m5
4550
 
4551
-paddw       m3,        m1
4552
+    paddw       m3,        m1
4553
 
4554
-pmulhrsw    m3,        m4
4555
-packuswb    m3,        m3
4556
+    pmulhrsw    m3,        m4
4557
+    packuswb    m3,        m3
4558
 
4559
-movd        [r2 + r3],        m3
4560
-pextrw      [r2 + r3 + 4],    m3,    2
4561
+    movd        [r2 + r3],        m3
4562
+    pextrw      [r2 + r3 + 4],    m3,    2
4563
 
4564
-lea         r2,        [r2 + 2 * r3]
4565
+    lea         r2,        [r2 + 2 * r3]
4566
 
4567
-sub         r4,         4
4568
-jnz        .loop
4569
-RET
4570
+    sub         r4,         4
4571
+    jnz        .loop
4572
+    RET
4573
 %endmacro
4574
 
4575
-FILTER_V4_W6_H4 6, 8
4576
+    FILTER_V4_W6_H4 6, 8
4577
 
4578
-FILTER_V4_W6_H4 6, 16
4579
+    FILTER_V4_W6_H4 6, 16
4580
 
4581
 ;-----------------------------------------------------------------------------
4582
 ; void interp_4tap_vert_pp_12x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
4583
@@ -5302,88 +7179,88 @@
4584
 INIT_XMM sse4
4585
 cglobal interp_4tap_vert_pp_12x%2, 4, 6, 8
4586
 
4587
-mov         r4d,       r4m
4588
-sub         r0,        r1
4589
+    mov         r4d,       r4m
4590
+    sub         r0,        r1
4591
 
4592
 %ifdef PIC
4593
-lea         r5,        [tab_ChromaCoeff]
4594
-movd        m0,        [r5 + r4 * 4]
4595
+    lea         r5,        [tab_ChromaCoeff]
4596
+    movd        m0,        [r5 + r4 * 4]
4597
 %else
4598
-movd        m0,        [tab_ChromaCoeff + r4 * 4]
4599
+    movd        m0,        [tab_ChromaCoeff + r4 * 4]
4600
 %endif
4601
 
4602
-pshufb      m1,        m0,       [tab_Vm]
4603
-pshufb      m0,        [tab_Vm + 16]
4604
+    pshufb      m1,        m0,       [tab_Vm]
4605
+    pshufb      m0,        [tab_Vm + 16]
4606
 
4607
-mov         r4d,       %2
4608
+    mov         r4d,       %2
4609
 
4610
 .loop:
4611
-movu        m2,        [r0]
4612
-movu        m3,        [r0 + r1]
4613
+    movu        m2,        [r0]
4614
+    movu        m3,        [r0 + r1]
4615
 
4616
-punpcklbw   m4,        m2,        m3
4617
-punpckhbw   m2,        m3
4618
+    punpcklbw   m4,        m2,        m3
4619
+    punpckhbw   m2,        m3
4620
 
4621
-pmaddubsw   m4,        m1
4622
-pmaddubsw   m2,        m1
4623
+    pmaddubsw   m4,        m1
4624
+    pmaddubsw   m2,        m1
4625
 
4626
-lea         r0,        [r0 + 2 * r1]
4627
-movu        m5,        [r0]
4628
-movu        m7,        [r0 + r1]
4629
+    lea         r0,        [r0 + 2 * r1]
4630
+    movu        m5,        [r0]
4631
+    movu        m7,        [r0 + r1]
4632
 
4633
-punpcklbw   m6,        m5,        m7
4634
-pmaddubsw   m6,        m0
4635
-paddw       m4,        m6
4636
+    punpcklbw   m6,        m5,        m7
4637
+    pmaddubsw   m6,        m0
4638
+    paddw       m4,        m6
4639
 
4640
-punpckhbw   m6,        m5,        m7
4641
-pmaddubsw   m6,        m0
4642
-paddw       m2,        m6
4643
+    punpckhbw   m6,        m5,        m7
4644
+    pmaddubsw   m6,        m0
4645
+    paddw       m2,        m6
4646
 
4647
-mova        m6,        [pw_512]
4648
+    mova        m6,        [pw_512]
4649
 
4650
-pmulhrsw    m4,        m6
4651
-pmulhrsw    m2,        m6
4652
+    pmulhrsw    m4,        m6
4653
+    pmulhrsw    m2,        m6
4654
 
4655
-packuswb    m4,        m2
4656
+    packuswb    m4,        m2
4657
 
4658
-movh         [r2],     m4
4659
-pextrd       [r2 + 8], m4,  2
4660
+    movh         [r2],     m4
4661
+    pextrd       [r2 + 8], m4,  2
4662
 
4663
-punpcklbw   m4,        m3,        m5
4664
-punpckhbw   m3,        m5
4665
+    punpcklbw   m4,        m3,        m5
4666
+    punpckhbw   m3,        m5
4667
 
4668
-pmaddubsw   m4,        m1
4669
-pmaddubsw   m3,        m1
4670
+    pmaddubsw   m4,        m1
4671
+    pmaddubsw   m3,        m1
4672
 
4673
-movu        m5,        [r0 + 2 * r1]
4674
+    movu        m5,        [r0 + 2 * r1]
4675
 
4676
-punpcklbw   m2,        m7,        m5
4677
-punpckhbw   m7,        m5
4678
+    punpcklbw   m2,        m7,        m5
4679
+    punpckhbw   m7,        m5
4680
 
4681
-pmaddubsw   m2,        m0
4682
-pmaddubsw   m7,        m0
4683
+    pmaddubsw   m2,        m0
4684
+    pmaddubsw   m7,        m0
4685
 
4686
-paddw       m4,        m2
4687
-paddw       m3,        m7
4688
+    paddw       m4,        m2
4689
+    paddw       m3,        m7
4690
 
4691
-pmulhrsw    m4,        m6
4692
-pmulhrsw    m3,        m6
4693
+    pmulhrsw    m4,        m6
4694
+    pmulhrsw    m3,        m6
4695
 
4696
-packuswb    m4,        m3
4697
+    packuswb    m4,        m3
4698
 
4699
-movh        [r2 + r3],      m4
4700
-pextrd      [r2 + r3 + 8],  m4,  2
4701
+    movh        [r2 + r3],      m4
4702
+    pextrd      [r2 + r3 + 8],  m4,  2
4703
 
4704
-lea         r2,        [r2 + 2 * r3]
4705
+    lea         r2,        [r2 + 2 * r3]
4706
 
4707
-sub         r4,        2
4708
-jnz        .loop
4709
-RET
4710
+    sub         r4,        2
4711
+    jnz        .loop
4712
+    RET
4713
 %endmacro
4714
 
4715
-FILTER_V4_W12_H2 12, 16
4716
+    FILTER_V4_W12_H2 12, 16
4717
 
4718
-FILTER_V4_W12_H2 12, 32
4719
+    FILTER_V4_W12_H2 12, 32
4720
 
4721
 ;-----------------------------------------------------------------------------
4722
 ; void interp_4tap_vert_pp_16x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
4723
@@ -5392,91 +7269,91 @@
4724
 INIT_XMM sse4
4725
 cglobal interp_4tap_vert_pp_16x%2, 4, 6, 8
4726
 
4727
-mov         r4d,       r4m
4728
-sub         r0,        r1
4729
+    mov         r4d,       r4m
4730
+    sub         r0,        r1
4731
 
4732
 %ifdef PIC
4733
-lea         r5,        [tab_ChromaCoeff]
4734
-movd        m0,        [r5 + r4 * 4]
4735
+    lea         r5,        [tab_ChromaCoeff]
4736
+    movd        m0,        [r5 + r4 * 4]
4737
 %else
4738
-movd        m0,        [tab_ChromaCoeff + r4 * 4]
4739
+    movd        m0,        [tab_ChromaCoeff + r4 * 4]
4740
 %endif
4741
 
4742
-pshufb      m1,        m0,       [tab_Vm]
4743
-pshufb      m0,        [tab_Vm + 16]
4744
+    pshufb      m1,        m0,       [tab_Vm]
4745
+    pshufb      m0,        [tab_Vm + 16]
4746
 
4747
-mov         r4d,       %2/2
4748
+    mov         r4d,       %2/2
4749
 
4750
 .loop:
4751
-movu        m2,        [r0]
4752
-movu        m3,        [r0 + r1]
4753
+    movu        m2,        [r0]
4754
+    movu        m3,        [r0 + r1]
4755
 
4756
-punpcklbw   m4,        m2,        m3
4757
-punpckhbw   m2,        m3
4758
+    punpcklbw   m4,        m2,        m3
4759
+    punpckhbw   m2,        m3
4760
 
4761
-pmaddubsw   m4,        m1
4762
-pmaddubsw   m2,        m1
4763
+    pmaddubsw   m4,        m1
4764
+    pmaddubsw   m2,        m1
4765
 
4766
-lea         r0,        [r0 + 2 * r1]
4767
-movu        m5,        [r0]
4768
-movu        m6,        [r0 + r1]
4769
+    lea         r0,        [r0 + 2 * r1]
4770
+    movu        m5,        [r0]
4771
+    movu        m6,        [r0 + r1]
4772
 
4773
-punpckhbw   m7,        m5,        m6
4774
-pmaddubsw   m7,        m0
4775
-paddw       m2,        m7
4776
+    punpckhbw   m7,        m5,        m6
4777
+    pmaddubsw   m7,        m0
4778
+    paddw       m2,        m7
4779
 
4780
-punpcklbw   m7,        m5,        m6
4781
-pmaddubsw   m7,        m0
4782
-paddw       m4,        m7
4783
+    punpcklbw   m7,        m5,        m6
4784
+    pmaddubsw   m7,        m0
4785
+    paddw       m4,        m7
4786
 
4787
-mova        m7,        [pw_512]
4788
+    mova        m7,        [pw_512]
4789
 
4790
-pmulhrsw    m4,        m7
4791
-pmulhrsw    m2,        m7
4792
+    pmulhrsw    m4,        m7
4793
+    pmulhrsw    m2,        m7
4794
 
4795
-packuswb    m4,        m2
4796
+    packuswb    m4,        m2
4797
 
4798
-movu        [r2],      m4
4799
+    movu        [r2],      m4
4800
 
4801
-punpcklbw   m4,        m3,        m5
4802
-punpckhbw   m3,        m5
4803
+    punpcklbw   m4,        m3,        m5
4804
+    punpckhbw   m3,        m5
4805
 
4806
-pmaddubsw   m4,        m1
4807
-pmaddubsw   m3,        m1
4808
+    pmaddubsw   m4,        m1
4809
+    pmaddubsw   m3,        m1
4810
 
4811
-movu        m5,        [r0 + 2 * r1]
4812
+    movu        m5,        [r0 + 2 * r1]
4813
 
4814
-punpcklbw   m2,        m6,        m5
4815
-punpckhbw   m6,        m5
4816
+    punpcklbw   m2,        m6,        m5
4817
+    punpckhbw   m6,        m5
4818
 
4819
-pmaddubsw   m2,        m0
4820
-pmaddubsw   m6,        m0
4821
+    pmaddubsw   m2,        m0
4822
+    pmaddubsw   m6,        m0
4823
 
4824
-paddw       m4,        m2
4825
-paddw       m3,        m6
4826
+    paddw       m4,        m2
4827
+    paddw       m3,        m6
4828
 
4829
-pmulhrsw    m4,        m7
4830
-pmulhrsw    m3,        m7
4831
+    pmulhrsw    m4,        m7
4832
+    pmulhrsw    m3,        m7
4833
 
4834
-packuswb    m4,        m3
4835
+    packuswb    m4,        m3
4836
 
4837
-movu        [r2 + r3],      m4
4838
+    movu        [r2 + r3],      m4
4839
 
4840
-lea         r2,        [r2 + 2 * r3]
4841
+    lea         r2,        [r2 + 2 * r3]
4842
 
4843
-dec         r4d
4844
-jnz        .loop
4845
-RET
4846
+    dec         r4d
4847
+    jnz        .loop
4848
+    RET
4849
 %endmacro
4850
 
4851
-FILTER_V4_W16_H2 16,  4
4852
-FILTER_V4_W16_H2 16,  8
4853
-FILTER_V4_W16_H2 16, 12
4854
-FILTER_V4_W16_H2 16, 16
4855
-FILTER_V4_W16_H2 16, 32
4856
+    FILTER_V4_W16_H2 16,  4
4857
+    FILTER_V4_W16_H2 16,  8
4858
+    FILTER_V4_W16_H2 16, 12
4859
+    FILTER_V4_W16_H2 16, 16
4860
+    FILTER_V4_W16_H2 16, 32
4861
 
4862
-FILTER_V4_W16_H2 16, 24
4863
-FILTER_V4_W16_H2 16, 64
4864
+    FILTER_V4_W16_H2 16, 24
4865
+    FILTER_V4_W16_H2 16, 64
4866
 
4867
 %macro FILTER_VER_CHROMA_AVX2_16x16 1
4868
 INIT_YMM avx2
4869
@@ -5736,8 +7613,8 @@
4870
 %endif
4871
 %endmacro
4872
 
4873
-FILTER_VER_CHROMA_AVX2_16x16 pp
4874
-FILTER_VER_CHROMA_AVX2_16x16 ps
4875
+    FILTER_VER_CHROMA_AVX2_16x16 pp
4876
+    FILTER_VER_CHROMA_AVX2_16x16 ps
4877
 %macro FILTER_VER_CHROMA_AVX2_16x8 1
4878
 INIT_YMM avx2
4879
 cglobal interp_4tap_vert_%1_16x8, 4, 7, 7
4880
@@ -5891,8 +7768,8 @@
4881
     RET
4882
 %endmacro
4883
 
4884
-FILTER_VER_CHROMA_AVX2_16x8 pp
4885
-FILTER_VER_CHROMA_AVX2_16x8 ps
4886
+    FILTER_VER_CHROMA_AVX2_16x8 pp
4887
+    FILTER_VER_CHROMA_AVX2_16x8 ps
4888
 
4889
 %macro FILTER_VER_CHROMA_AVX2_16x12 1
4890
 INIT_YMM avx2
4891
@@ -6119,13 +7996,13 @@
4892
 %endif
4893
 %endmacro
4894
 
4895
-FILTER_VER_CHROMA_AVX2_16x12 pp
4896
-FILTER_VER_CHROMA_AVX2_16x12 ps
4897
+    FILTER_VER_CHROMA_AVX2_16x12 pp
4898
+    FILTER_VER_CHROMA_AVX2_16x12 ps
4899
 
4900
-%macro FILTER_VER_CHROMA_AVX2_16x32 1
4901
-INIT_YMM avx2
4902
+%macro FILTER_VER_CHROMA_AVX2_16xN 2
4903
 %if ARCH_X86_64 == 1
4904
-cglobal interp_4tap_vert_%1_16x32, 4, 8, 8
4905
+INIT_YMM avx2
4906
+cglobal interp_4tap_vert_%1_16x%2, 4, 8, 8
4907
     mov             r4d, r4m
4908
     shl             r4d, 6
4909
 
4910
@@ -6145,7 +8022,7 @@
4911
     mova            m7, [pw_2000]
4912
 %endif
4913
     lea             r6, [r3 * 3]
4914
-    mov             r7d, 2
4915
+    mov             r7d, %2 / 16
4916
 .loopH:
4917
     movu            xm0, [r0]
4918
     vinserti128     m0, m0, [r0 + r1 * 2], 1
4919
@@ -6412,8 +8289,381 @@
4920
 %endif
4921
 %endmacro
4922
 
4923
-FILTER_VER_CHROMA_AVX2_16x32 pp
4924
-FILTER_VER_CHROMA_AVX2_16x32 ps
4925
+    FILTER_VER_CHROMA_AVX2_16xN pp, 32
4926
+    FILTER_VER_CHROMA_AVX2_16xN ps, 32
4927
+    FILTER_VER_CHROMA_AVX2_16xN pp, 64
4928
+    FILTER_VER_CHROMA_AVX2_16xN ps, 64
4929
+
4930
+%macro FILTER_VER_CHROMA_AVX2_16x24 1
4931
+%if ARCH_X86_64 == 1
4932
+INIT_YMM avx2
4933
+cglobal interp_4tap_vert_%1_16x24, 4, 6, 15
4934
+    mov             r4d, r4m
4935
+    shl             r4d, 6
4936
+
4937
+%ifdef PIC
4938
+    lea             r5, [tab_ChromaCoeffVer_32]
4939
+    add             r5, r4
4940
+%else
4941
+    lea             r5, [tab_ChromaCoeffVer_32 + r4]
4942
+%endif
4943
+
4944
+    mova            m12, [r5]
4945
+    mova            m13, [r5 + mmsize]
4946
+    lea             r4, [r1 * 3]
4947
+    sub             r0, r1
4948
+%ifidn %1,pp
4949
+    mova            m14, [pw_512]
4950
+%else
4951
+    add             r3d, r3d
4952
+    vbroadcasti128  m14, [pw_2000]
4953
+%endif
4954
+    lea             r5, [r3 * 3]
4955
+
4956
+    movu            xm0, [r0]                       ; m0 = row 0
4957
+    movu            xm1, [r0 + r1]                  ; m1 = row 1
4958
+    punpckhbw       xm2, xm0, xm1
4959
+    punpcklbw       xm0, xm1
4960
+    vinserti128     m0, m0, xm2, 1
4961
+    pmaddubsw       m0, m12
4962
+    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2
4963
+    punpckhbw       xm3, xm1, xm2
4964
+    punpcklbw       xm1, xm2
4965
+    vinserti128     m1, m1, xm3, 1
4966
+    pmaddubsw       m1, m12
4967
+    movu            xm3, [r0 + r4]                  ; m3 = row 3
4968
+    punpckhbw       xm4, xm2, xm3
4969
+    punpcklbw       xm2, xm3
4970
+    vinserti128     m2, m2, xm4, 1
4971
+    pmaddubsw       m4, m2, m13
4972
+    paddw           m0, m4
4973
+    pmaddubsw       m2, m12
4974
+    lea             r0, [r0 + r1 * 4]
4975
+    movu            xm4, [r0]                       ; m4 = row 4
4976
+    punpckhbw       xm5, xm3, xm4
4977
+    punpcklbw       xm3, xm4
4978
+    vinserti128     m3, m3, xm5, 1
4979
+    pmaddubsw       m5, m3, m13
4980
+    paddw           m1, m5
4981
+    pmaddubsw       m3, m12
4982
+    movu            xm5, [r0 + r1]                  ; m5 = row 5
4983
+    punpckhbw       xm6, xm4, xm5
4984
+    punpcklbw       xm4, xm5
4985
+    vinserti128     m4, m4, xm6, 1
4986
+    pmaddubsw       m6, m4, m13
4987
+    paddw           m2, m6
4988
+    pmaddubsw       m4, m12
4989
+    movu            xm6, [r0 + r1 * 2]              ; m6 = row 6
4990
+    punpckhbw       xm7, xm5, xm6
4991
+    punpcklbw       xm5, xm6
4992
+    vinserti128     m5, m5, xm7, 1
4993
+    pmaddubsw       m7, m5, m13
4994
+    paddw           m3, m7
4995
+    pmaddubsw       m5, m12
4996
+    movu            xm7, [r0 + r4]                  ; m7 = row 7
4997
+    punpckhbw       xm8, xm6, xm7
4998
+    punpcklbw       xm6, xm7
4999
+    vinserti128     m6, m6, xm8, 1
5000
+    pmaddubsw       m8, m6, m13
5001
+    paddw           m4, m8
5002
+    pmaddubsw       m6, m12
5003
+    lea             r0, [r0 + r1 * 4]
5004
+    movu            xm8, [r0]                       ; m8 = row 8
5005
+    punpckhbw       xm9, xm7, xm8
5006
+    punpcklbw       xm7, xm8
5007
+    vinserti128     m7, m7, xm9, 1
5008
+    pmaddubsw       m9, m7, m13
5009
+    paddw           m5, m9
5010
+    pmaddubsw       m7, m12
5011
+    movu            xm9, [r0 + r1]                  ; m9 = row 9
5012
+    punpckhbw       xm10, xm8, xm9
5013
+    punpcklbw       xm8, xm9
5014
+    vinserti128     m8, m8, xm10, 1
5015
+    pmaddubsw       m10, m8, m13
5016
+    paddw           m6, m10
5017
+    pmaddubsw       m8, m12
5018
+    movu            xm10, [r0 + r1 * 2]             ; m10 = row 10
5019
+    punpckhbw       xm11, xm9, xm10
5020
+    punpcklbw       xm9, xm10
5021
+    vinserti128     m9, m9, xm11, 1
5022
+    pmaddubsw       m11, m9, m13
5023
+    paddw           m7, m11
5024
+    pmaddubsw       m9, m12
5025
+
5026
+%ifidn %1,pp
5027
+    pmulhrsw        m0, m14                         ; m0 = word: row 0
5028
+    pmulhrsw        m1, m14                         ; m1 = word: row 1
5029
+    pmulhrsw        m2, m14                         ; m2 = word: row 2
5030
+    pmulhrsw        m3, m14                         ; m3 = word: row 3
5031
+    pmulhrsw        m4, m14                         ; m4 = word: row 4
5032
+    pmulhrsw        m5, m14                         ; m5 = word: row 5
5033
+    pmulhrsw        m6, m14                         ; m6 = word: row 6
5034
+    pmulhrsw        m7, m14                         ; m7 = word: row 7
5035
+    packuswb        m0, m1
5036
+    packuswb        m2, m3
5037
+    packuswb        m4, m5
5038
+    packuswb        m6, m7
5039
+    vpermq          m0, m0, q3120
5040
+    vpermq          m2, m2, q3120
5041
+    vpermq          m4, m4, q3120
5042
+    vpermq          m6, m6, q3120
5043
+    vextracti128    xm1, m0, 1
5044
+    vextracti128    xm3, m2, 1
5045
+    vextracti128    xm5, m4, 1
5046
+    vextracti128    xm7, m6, 1
5047
+    movu            [r2], xm0
5048
+    movu            [r2 + r3], xm1
5049
+    movu            [r2 + r3 * 2], xm2
5050
+    movu            [r2 + r5], xm3
5051
+    lea             r2, [r2 + r3 * 4]
5052
+    movu            [r2], xm4
5053
+    movu            [r2 + r3], xm5
5054
+    movu            [r2 + r3 * 2], xm6
5055
+    movu            [r2 + r5], xm7
5056
+%else
5057
+    psubw           m0, m14                         ; m0 = word: row 0
5058
+    psubw           m1, m14                         ; m1 = word: row 1
5059
+    psubw           m2, m14                         ; m2 = word: row 2
5060
+    psubw           m3, m14                         ; m3 = word: row 3
5061
+    psubw           m4, m14                         ; m4 = word: row 4
5062
+    psubw           m5, m14                         ; m5 = word: row 5
5063
+    psubw           m6, m14                         ; m6 = word: row 6
5064
+    psubw           m7, m14                         ; m7 = word: row 7
5065
+    movu            [r2], m0
5066
+    movu            [r2 + r3], m1
5067
+    movu            [r2 + r3 * 2], m2
5068
+    movu            [r2 + r5], m3
5069
+    lea             r2, [r2 + r3 * 4]
5070
+    movu            [r2], m4
5071
+    movu            [r2 + r3], m5
5072
+    movu            [r2 + r3 * 2], m6
5073
+    movu            [r2 + r5], m7
5074
+%endif
5075
+    lea             r2, [r2 + r3 * 4]
5076
+
5077
+    movu            xm11, [r0 + r4]                 ; m11 = row 11
5078
+    punpckhbw       xm6, xm10, xm11
5079
+    punpcklbw       xm10, xm11
5080
+    vinserti128     m10, m10, xm6, 1
5081
+    pmaddubsw       m6, m10, m13
5082
+    paddw           m8, m6
5083
+    pmaddubsw       m10, m12
5084
+    lea             r0, [r0 + r1 * 4]
5085
+    movu            xm6, [r0]                       ; m6 = row 12
5086
+    punpckhbw       xm7, xm11, xm6
5087
+    punpcklbw       xm11, xm6
5088
+    vinserti128     m11, m11, xm7, 1
5089
+    pmaddubsw       m7, m11, m13
5090
+    paddw           m9, m7
5091
+    pmaddubsw       m11, m12
5092
+
5093
+    movu            xm7, [r0 + r1]                  ; m7 = row 13
5094
+    punpckhbw       xm0, xm6, xm7
5095
+    punpcklbw       xm6, xm7
5096
+    vinserti128     m6, m6, xm0, 1
5097
+    pmaddubsw       m0, m6, m13
5098
+    paddw           m10, m0
5099
+    pmaddubsw       m6, m12
5100
+    movu            xm0, [r0 + r1 * 2]              ; m0 = row 14
5101
+    punpckhbw       xm1, xm7, xm0
5102
+    punpcklbw       xm7, xm0
5103
+    vinserti128     m7, m7, xm1, 1
5104
+    pmaddubsw       m1, m7, m13
5105
+    paddw           m11, m1
5106
+    pmaddubsw       m7, m12
5107
+    movu            xm1, [r0 + r4]                  ; m1 = row 15
5108
+    punpckhbw       xm2, xm0, xm1
5109
+    punpcklbw       xm0, xm1
5110
+    vinserti128     m0, m0, xm2, 1
5111
+    pmaddubsw       m2, m0, m13
5112
+    paddw           m6, m2
5113
+    pmaddubsw       m0, m12
5114
+    lea             r0, [r0 + r1 * 4]
5115
+    movu            xm2, [r0]                       ; m2 = row 16
5116
+    punpckhbw       xm3, xm1, xm2
5117
+    punpcklbw       xm1, xm2
5118
+    vinserti128     m1, m1, xm3, 1
5119
+    pmaddubsw       m3, m1, m13
5120
+    paddw           m7, m3
5121
+    pmaddubsw       m1, m12
5122
+    movu            xm3, [r0 + r1]                  ; m3 = row 17
5123
+    punpckhbw       xm4, xm2, xm3
5124
+    punpcklbw       xm2, xm3
5125
+    vinserti128     m2, m2, xm4, 1
5126
+    pmaddubsw       m4, m2, m13
5127
+    paddw           m0, m4
5128
+    pmaddubsw       m2, m12
5129
+    movu            xm4, [r0 + r1 * 2]              ; m4 = row 18
5130
+    punpckhbw       xm5, xm3, xm4
5131
+    punpcklbw       xm3, xm4
5132
+    vinserti128     m3, m3, xm5, 1
5133
+    pmaddubsw       m5, m3, m13
5134
+    paddw           m1, m5
5135
+    pmaddubsw       m3, m12
5136
+
5137
+%ifidn %1,pp
5138
+    pmulhrsw        m8, m14                         ; m8 = word: row 8
5139
+    pmulhrsw        m9, m14                         ; m9 = word: row 9
5140
+    pmulhrsw        m10, m14                        ; m10 = word: row 10
5141
+    pmulhrsw        m11, m14                        ; m11 = word: row 11
5142
+    pmulhrsw        m6, m14                         ; m6 = word: row 12
5143
+    pmulhrsw        m7, m14                         ; m7 = word: row 13
5144
+    pmulhrsw        m0, m14                         ; m0 = word: row 14
5145
+    pmulhrsw        m1, m14                         ; m1 = word: row 15
5146
+    packuswb        m8, m9
5147
+    packuswb        m10, m11
5148
+    packuswb        m6, m7
5149
+    packuswb        m0, m1
5150
+    vpermq          m8, m8, q3120
5151
+    vpermq          m10, m10, q3120
5152
+    vpermq          m6, m6, q3120
5153
+    vpermq          m0, m0, q3120
5154
+    vextracti128    xm9, m8, 1
5155
+    vextracti128    xm11, m10, 1
5156
+    vextracti128    xm7, m6, 1
5157
+    vextracti128    xm1, m0, 1
5158
+    movu            [r2], xm8
5159
+    movu            [r2 + r3], xm9
5160
+    movu            [r2 + r3 * 2], xm10
5161
+    movu            [r2 + r5], xm11
5162
+    lea             r2, [r2 + r3 * 4]
5163
+    movu            [r2], xm6
5164
+    movu            [r2 + r3], xm7
5165
+    movu            [r2 + r3 * 2], xm0
5166
+    movu            [r2 + r5], xm1
5167
+%else
5168
+    psubw           m8, m14                         ; m8 = word: row 8
5169
+    psubw           m9, m14                         ; m9 = word: row 9
5170
+    psubw           m10, m14                        ; m10 = word: row 10
5171
+    psubw           m11, m14                        ; m11 = word: row 11
5172
+    psubw           m6, m14                         ; m6 = word: row 12
5173
+    psubw           m7, m14                         ; m7 = word: row 13
5174
+    psubw           m0, m14                         ; m0 = word: row 14
5175
+    psubw           m1, m14                         ; m1 = word: row 15
5176
+    movu            [r2], m8
5177
+    movu            [r2 + r3], m9
5178
+    movu            [r2 + r3 * 2], m10
5179
+    movu            [r2 + r5], m11
5180
+    lea             r2, [r2 + r3 * 4]
5181
+    movu            [r2], m6
5182
+    movu            [r2 + r3], m7
5183
+    movu            [r2 + r3 * 2], m0
5184
+    movu            [r2 + r5], m1
5185
+%endif
5186
+    lea             r2, [r2 + r3 * 4]
5187
+
5188
+    movu            xm5, [r0 + r4]                  ; m5 = row 19
5189
+    punpckhbw       xm6, xm4, xm5
5190
+    punpcklbw       xm4, xm5
5191
+    vinserti128     m4, m4, xm6, 1
5192
+    pmaddubsw       m6, m4, m13
5193
+    paddw           m2, m6
5194
+    pmaddubsw       m4, m12
5195
+    lea             r0, [r0 + r1 * 4]
5196
+    movu            xm6, [r0]                       ; m6 = row 20
5197
+    punpckhbw       xm7, xm5, xm6
5198
+    punpcklbw       xm5, xm6
5199
+    vinserti128     m5, m5, xm7, 1
5200
+    pmaddubsw       m7, m5, m13
5201
+    paddw           m3, m7
5202
+    pmaddubsw       m5, m12
5203
+    movu            xm7, [r0 + r1]                  ; m7 = row 21
5204
+    punpckhbw       xm0, xm6, xm7
5205
+    punpcklbw       xm6, xm7
5206
+    vinserti128     m6, m6, xm0, 1
5207
+    pmaddubsw       m0, m6, m13
5208
+    paddw           m4, m0
5209
+    pmaddubsw       m6, m12
5210
+    movu            xm0, [r0 + r1 * 2]              ; m0 = row 22
5211
+    punpckhbw       xm1, xm7, xm0
5212
+    punpcklbw       xm7, xm0
5213
+    vinserti128     m7, m7, xm1, 1
5214
+    pmaddubsw       m1, m7, m13
5215
+    paddw           m5, m1
5216
+    pmaddubsw       m7, m12
5217
+    movu            xm1, [r0 + r4]                  ; m1 = row 23
5218
+    punpckhbw       xm8, xm0, xm1
5219
+    punpcklbw       xm0, xm1
5220
+    vinserti128     m0, m0, xm8, 1
5221
+    pmaddubsw       m8, m0, m13
5222
+    paddw           m6, m8
5223
+    pmaddubsw       m0, m12
5224
+    lea             r0, [r0 + r1 * 4]
5225
+    movu            xm8, [r0]                       ; m8 = row 24
5226
+    punpckhbw       xm9, xm1, xm8
5227
+    punpcklbw       xm1, xm8
5228
+    vinserti128     m1, m1, xm9, 1
5229
+    pmaddubsw       m9, m1, m13
5230
+    paddw           m7, m9
5231
+    pmaddubsw       m1, m12
5232
+    movu            xm9, [r0 + r1]                  ; m9 = row 25
5233
+    punpckhbw       xm10, xm8, xm9
5234
+    punpcklbw       xm8, xm9
5235
+    vinserti128     m8, m8, xm10, 1
5236
+    pmaddubsw       m8, m13
5237
+    paddw           m0, m8
5238
+    movu            xm10, [r0 + r1 * 2]             ; m10 = row 26
5239
+    punpckhbw       xm11, xm9, xm10
5240
+    punpcklbw       xm9, xm10
5241
+    vinserti128     m9, m9, xm11, 1
5242
+    pmaddubsw       m9, m13
5243
+    paddw           m1, m9
5244
+
5245
+%ifidn %1,pp
5246
+    pmulhrsw        m2, m14                         ; m2 = word: row 16
5247
+    pmulhrsw        m3, m14                         ; m3 = word: row 17
5248
+    pmulhrsw        m4, m14                         ; m4 = word: row 18
5249
+    pmulhrsw        m5, m14                         ; m5 = word: row 19
5250
+    pmulhrsw        m6, m14                         ; m6 = word: row 20
5251
+    pmulhrsw        m7, m14                         ; m7 = word: row 21
5252
+    pmulhrsw        m0, m14                         ; m0 = word: row 22
5253
+    pmulhrsw        m1, m14                         ; m1 = word: row 23
5254
+    packuswb        m2, m3
5255
+    packuswb        m4, m5
5256
+    packuswb        m6, m7
5257
+    packuswb        m0, m1
5258
+    vpermq          m2, m2, q3120
5259
+    vpermq          m4, m4, q3120
5260
+    vpermq          m6, m6, q3120
5261
+    vpermq          m0, m0, q3120
5262
+    vextracti128    xm3, m2, 1
5263
+    vextracti128    xm5, m4, 1
5264
+    vextracti128    xm7, m6, 1
5265
+    vextracti128    xm1, m0, 1
5266
+    movu            [r2], xm2
5267
+    movu            [r2 + r3], xm3
5268
+    movu            [r2 + r3 * 2], xm4
5269
+    movu            [r2 + r5], xm5
5270
+    lea             r2, [r2 + r3 * 4]
5271
+    movu            [r2], xm6
5272
+    movu            [r2 + r3], xm7
5273
+    movu            [r2 + r3 * 2], xm0
5274
+    movu            [r2 + r5], xm1
5275
+%else
5276
+    psubw           m2, m14                         ; m2 = word: row 16
5277
+    psubw           m3, m14                         ; m3 = word: row 17
5278
+    psubw           m4, m14                         ; m4 = word: row 18
5279
+    psubw           m5, m14                         ; m5 = word: row 19
5280
+    psubw           m6, m14                         ; m6 = word: row 20
5281
+    psubw           m7, m14                         ; m7 = word: row 21
5282
+    psubw           m0, m14                         ; m0 = word: row 22
5283
+    psubw           m1, m14                         ; m1 = word: row 23
5284
+    movu            [r2], m2
5285
+    movu            [r2 + r3], m3
5286
+    movu            [r2 + r3 * 2], m4
5287
+    movu            [r2 + r5], m5
5288
+    lea             r2, [r2 + r3 * 4]
5289
+    movu            [r2], m6
5290
+    movu            [r2 + r3], m7
5291
+    movu            [r2 + r3 * 2], m0
5292
+    movu            [r2 + r5], m1
5293
+%endif
5294
+    RET
5295
+%endif
5296
+%endmacro
5297
+
5298
+    FILTER_VER_CHROMA_AVX2_16x24 pp
5299
+    FILTER_VER_CHROMA_AVX2_16x24 ps
5300
 
5301
 %macro FILTER_VER_CHROMA_AVX2_24x32 1
5302
 INIT_YMM avx2
5303
@@ -6863,8 +9113,8 @@
5304
 %endif
5305
 %endmacro
5306
 
5307
-FILTER_VER_CHROMA_AVX2_24x32 pp
5308
-FILTER_VER_CHROMA_AVX2_24x32 ps
5309
+    FILTER_VER_CHROMA_AVX2_24x32 pp
5310
+    FILTER_VER_CHROMA_AVX2_24x32 ps
5311
 
5312
 %macro FILTER_VER_CHROMA_AVX2_16x4 1
5313
 INIT_YMM avx2
5314
@@ -6961,12 +9211,12 @@
5315
     RET
5316
 %endmacro
5317
 
5318
-FILTER_VER_CHROMA_AVX2_16x4 pp
5319
-FILTER_VER_CHROMA_AVX2_16x4 ps
5320
+    FILTER_VER_CHROMA_AVX2_16x4 pp
5321
+    FILTER_VER_CHROMA_AVX2_16x4 ps
5322
 
5323
-%macro FILTER_VER_CHROMA_AVX2_12x16 1
5324
+%macro FILTER_VER_CHROMA_AVX2_12xN 2
5325
 INIT_YMM avx2
5326
-cglobal interp_4tap_vert_%1_12x16, 4, 7, 8
5327
+cglobal interp_4tap_vert_%1_12x%2, 4, 7, 8
5328
     mov             r4d, r4m
5329
     shl             r4d, 6
5330
 
5331
@@ -6986,7 +9236,7 @@
5332
     vbroadcasti128  m7, [pw_2000]
5333
 %endif
5334
     lea             r6, [r3 * 3]
5335
-
5336
+%rep %2 / 16
5337
     movu            xm0, [r0]                       ; m0 = row 0
5338
     movu            xm1, [r0 + r1]                  ; m1 = row 1
5339
     punpckhbw       xm2, xm0, xm1
5340
@@ -7272,11 +9522,15 @@
5341
     vextracti128    xm5, m5, 1
5342
     movq            [r2 + r6 + 16], xm5
5343
 %endif
5344
+    lea             r2, [r2 + r3 * 4]
5345
+%endrep
5346
     RET
5347
 %endmacro
5348
 
5349
-FILTER_VER_CHROMA_AVX2_12x16 pp
5350
-FILTER_VER_CHROMA_AVX2_12x16 ps
5351
+    FILTER_VER_CHROMA_AVX2_12xN pp, 16
5352
+    FILTER_VER_CHROMA_AVX2_12xN ps, 16
5353
+    FILTER_VER_CHROMA_AVX2_12xN pp, 32
5354
+    FILTER_VER_CHROMA_AVX2_12xN ps, 32
5355
 
5356
 ;-----------------------------------------------------------------------------
5357
 ;void interp_4tap_vert_pp_24x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
5358
@@ -7285,121 +9539,121 @@
5359
 INIT_XMM sse4
5360
 cglobal interp_4tap_vert_pp_24x%2, 4, 6, 8
5361
 
5362
-mov         r4d,       r4m
5363
-sub         r0,        r1
5364
+    mov         r4d,       r4m
5365
+    sub         r0,        r1
5366
 
5367
 %ifdef PIC
5368
-lea         r5,        [tab_ChromaCoeff]
5369
-movd        m0,        [r5 + r4 * 4]
5370
+    lea         r5,        [tab_ChromaCoeff]
5371
+    movd        m0,        [r5 + r4 * 4]
5372
 %else
5373
-movd        m0,        [tab_ChromaCoeff + r4 * 4]
5374
+    movd        m0,        [tab_ChromaCoeff + r4 * 4]
5375
 %endif
5376
 
5377
-pshufb      m1,        m0,       [tab_Vm]
5378
-pshufb      m0,        [tab_Vm + 16]
5379
+    pshufb      m1,        m0,       [tab_Vm]
5380
+    pshufb      m0,        [tab_Vm + 16]
5381
 
5382
-mov         r4d,       %2
5383
+    mov         r4d,       %2
5384
 
5385
 .loop:
5386
-movu        m2,        [r0]
5387
-movu        m3,        [r0 + r1]
5388
+    movu        m2,        [r0]
5389
+    movu        m3,        [r0 + r1]
5390
 
5391
-punpcklbw   m4,        m2,        m3
5392
-punpckhbw   m2,        m3
5393
+    punpcklbw   m4,        m2,        m3
5394
+    punpckhbw   m2,        m3
5395
 
5396
-pmaddubsw   m4,        m1
5397
-pmaddubsw   m2,        m1
5398
+    pmaddubsw   m4,        m1
5399
+    pmaddubsw   m2,        m1
5400
 
5401
-lea         r5,        [r0 + 2 * r1]
5402
-movu        m5,        [r5]
5403
-movu        m7,        [r5 + r1]
5404
+    lea         r5,        [r0 + 2 * r1]
5405
+    movu        m5,        [r5]
5406
+    movu        m7,        [r5 + r1]
5407
 
5408
-punpcklbw   m6,        m5,        m7
5409
-pmaddubsw   m6,        m0
5410
-paddw       m4,        m6
5411
+    punpcklbw   m6,        m5,        m7
5412
+    pmaddubsw   m6,        m0
5413
+    paddw       m4,        m6
5414
 
5415
-punpckhbw   m6,        m5,        m7
5416
-pmaddubsw   m6,        m0
5417
-paddw       m2,        m6
5418
+    punpckhbw   m6,        m5,        m7
5419
+    pmaddubsw   m6,        m0
5420
+    paddw       m2,        m6
5421
 
5422
-mova        m6,        [pw_512]
5423
+    mova        m6,        [pw_512]
5424
 
5425
-pmulhrsw    m4,        m6
5426
-pmulhrsw    m2,        m6
5427
+    pmulhrsw    m4,        m6
5428
+    pmulhrsw    m2,        m6
5429
 
5430
-packuswb    m4,        m2
5431
+    packuswb    m4,        m2
5432
 
5433
-movu        [r2],      m4
5434
+    movu        [r2],      m4
5435
 
5436
-punpcklbw   m4,        m3,        m5
5437
-punpckhbw   m3,        m5
5438
+    punpcklbw   m4,        m3,        m5
5439
+    punpckhbw   m3,        m5
5440
 
5441
-pmaddubsw   m4,        m1
5442
-pmaddubsw   m3,        m1
5443
+    pmaddubsw   m4,        m1
5444
+    pmaddubsw   m3,        m1
5445
 
5446
-movu        m2,        [r5 + 2 * r1]
5447
+    movu        m2,        [r5 + 2 * r1]
5448
 
5449
-punpcklbw   m5,        m7,        m2
5450
-punpckhbw   m7,        m2
5451
+    punpcklbw   m5,        m7,        m2
5452
+    punpckhbw   m7,        m2
5453
 
5454
-pmaddubsw   m5,        m0
5455
-pmaddubsw   m7,        m0
5456
+    pmaddubsw   m5,        m0
5457
+    pmaddubsw   m7,        m0
5458
 
5459
-paddw       m4,        m5
5460
-paddw       m3,        m7
5461
+    paddw       m4,        m5
5462
+    paddw       m3,        m7
5463
 
5464
-pmulhrsw    m4,        m6
5465
-pmulhrsw    m3,        m6
5466
+    pmulhrsw    m4,        m6
5467
+    pmulhrsw    m3,        m6
5468
 
5469
-packuswb    m4,        m3
5470
+    packuswb    m4,        m3
5471
 
5472
-movu        [r2 + r3],      m4
5473
+    movu        [r2 + r3],      m4
5474
 
5475
-movq        m2,        [r0 + 16]
5476
-movq        m3,        [r0 + r1 + 16]
5477
-movq        m4,        [r5 + 16]
5478
-movq        m5,        [r5 + r1 + 16]
5479
+    movq        m2,        [r0 + 16]
5480
+    movq        m3,        [r0 + r1 + 16]
5481
+    movq        m4,        [r5 + 16]
5482
+    movq        m5,        [r5 + r1 + 16]
5483
 
5484
-punpcklbw   m2,        m3
5485
-punpcklbw   m4,        m5
5486
+    punpcklbw   m2,        m3
5487
+    punpcklbw   m4,        m5
5488
 
5489
-pmaddubsw   m2,        m1
5490
-pmaddubsw   m4,        m0
5491
+    pmaddubsw   m2,        m1
5492
+    pmaddubsw   m4,        m0
5493
 
5494
-paddw       m2,        m4
5495
+    paddw       m2,        m4
5496
 
5497
-pmulhrsw    m2,        m6
5498
+    pmulhrsw    m2,        m6
5499
 
5500
-movq        m3,        [r0 + r1 + 16]
5501
-movq        m4,        [r5 + 16]
5502
-movq        m5,        [r5 + r1 + 16]
5503
-movq        m7,        [r5 + 2 * r1 + 16]
5504
+    movq        m3,        [r0 + r1 + 16]
5505
+    movq        m4,        [r5 + 16]
5506
+    movq        m5,        [r5 + r1 + 16]
5507
+    movq        m7,        [r5 + 2 * r1 + 16]
5508
 
5509
-punpcklbw   m3,        m4
5510
-punpcklbw   m5,        m7
5511
+    punpcklbw   m3,        m4
5512
+    punpcklbw   m5,        m7
5513
 
5514
-pmaddubsw   m3,        m1
5515
-pmaddubsw   m5,        m0
5516
+    pmaddubsw   m3,        m1
5517
+    pmaddubsw   m5,        m0
5518
 
5519
-paddw       m3,        m5
5520
+    paddw       m3,        m5
5521
 
5522
-pmulhrsw    m3,        m6
5523
-packuswb    m2,        m3
5524
+    pmulhrsw    m3,        m6
5525
+    packuswb    m2,        m3
5526
 
5527
-movh        [r2 + 16], m2
5528
-movhps      [r2 + r3 + 16], m2
5529
+    movh        [r2 + 16], m2
5530
+    movhps      [r2 + r3 + 16], m2
5531
 
5532
-mov         r0,        r5
5533
-lea         r2,        [r2 + 2 * r3]
5534
+    mov         r0,        r5
5535
+    lea         r2,        [r2 + 2 * r3]
5536
 
5537
-sub         r4,        2
5538
-jnz        .loop
5539
-RET
5540
+    sub         r4,        2
5541
+    jnz        .loop
5542
+    RET
5543
 %endmacro
5544
 
5545
-FILTER_V4_W24 24, 32
5546
+    FILTER_V4_W24 24, 32
5547
 
5548
-FILTER_V4_W24 24, 64
5549
+    FILTER_V4_W24 24, 64
5550
 
5551
 ;-----------------------------------------------------------------------------
5552
 ; void interp_4tap_vert_pp_32x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
5553
@@ -7408,100 +9662,100 @@
5554
 INIT_XMM sse4
5555
 cglobal interp_4tap_vert_pp_%1x%2, 4, 6, 8
5556
 
5557
-mov         r4d,       r4m
5558
-sub         r0,        r1
5559
+    mov         r4d,       r4m
5560
+    sub         r0,        r1
5561
 
5562
 %ifdef PIC
5563
-lea         r5,        [tab_ChromaCoeff]
5564
-movd        m0,        [r5 + r4 * 4]
5565
+    lea         r5,        [tab_ChromaCoeff]
5566
+    movd        m0,        [r5 + r4 * 4]
5567
 %else
5568
-movd        m0,        [tab_ChromaCoeff + r4 * 4]
5569
+    movd        m0,        [tab_ChromaCoeff + r4 * 4]
5570
 %endif
5571
 
5572
-pshufb      m1,        m0,       [tab_Vm]
5573
-pshufb      m0,        [tab_Vm + 16]
5574
+    pshufb      m1,        m0,       [tab_Vm]
5575
+    pshufb      m0,        [tab_Vm + 16]
5576
 
5577
-mova        m7,        [pw_512]
5578
+    mova        m7,        [pw_512]
5579
 
5580
-mov         r4d,       %2
5581
+    mov         r4d,       %2
5582
 
5583
 .loop:
5584
-movu        m2,        [r0]
5585
-movu        m3,        [r0 + r1]
5586
+    movu        m2,        [r0]
5587
+    movu        m3,        [r0 + r1]
5588
 
5589
-punpcklbw   m4,        m2,        m3
5590
-punpckhbw   m2,        m3
5591
+    punpcklbw   m4,        m2,        m3
5592
+    punpckhbw   m2,        m3
5593
 
5594
-pmaddubsw   m4,        m1
5595
-pmaddubsw   m2,        m1
5596
+    pmaddubsw   m4,        m1
5597
+    pmaddubsw   m2,        m1
5598
 
5599
-lea         r5,        [r0 + 2 * r1]
5600
-movu        m3,        [r5]
5601
-movu        m5,        [r5 + r1]
5602
+    lea         r5,        [r0 + 2 * r1]
5603
+    movu        m3,        [r5]
5604
+    movu        m5,        [r5 + r1]
5605
 
5606
-punpcklbw   m6,        m3,        m5
5607
-punpckhbw   m3,        m5
5608
+    punpcklbw   m6,        m3,        m5
5609
+    punpckhbw   m3,        m5
5610
 
5611
-pmaddubsw   m6,        m0
5612
-pmaddubsw   m3,        m0
5613
+    pmaddubsw   m6,        m0
5614
+    pmaddubsw   m3,        m0
5615
 
5616
-paddw       m4,        m6
5617
-paddw       m2,        m3
5618
+    paddw       m4,        m6
5619
+    paddw       m2,        m3
5620
 
5621
-pmulhrsw    m4,        m7
5622
-pmulhrsw    m2,        m7
5623
+    pmulhrsw    m4,        m7
5624
+    pmulhrsw    m2,        m7
5625
 
5626
-packuswb    m4,        m2
5627
+    packuswb    m4,        m2
5628
 
5629
-movu        [r2],      m4
5630
+    movu        [r2],      m4
5631
 
5632
-movu        m2,        [r0 + 16]
5633
-movu        m3,        [r0 + r1 + 16]
5634
+    movu        m2,        [r0 + 16]
5635
+    movu        m3,        [r0 + r1 + 16]
5636
 
5637
-punpcklbw   m4,        m2,        m3
5638
-punpckhbw   m2,        m3
5639
+    punpcklbw   m4,        m2,        m3
5640
+    punpckhbw   m2,        m3
5641
 
5642
-pmaddubsw   m4,        m1
5643
-pmaddubsw   m2,        m1
5644
+    pmaddubsw   m4,        m1
5645
+    pmaddubsw   m2,        m1
5646
 
5647
-movu        m3,        [r5 + 16]
5648
-movu        m5,        [r5 + r1 + 16]
5649
+    movu        m3,        [r5 + 16]
5650
+    movu        m5,        [r5 + r1 + 16]
5651
 
5652
-punpcklbw   m6,        m3,        m5
5653
-punpckhbw   m3,        m5
5654
+    punpcklbw   m6,        m3,        m5
5655
+    punpckhbw   m3,        m5
5656
 
5657
-pmaddubsw   m6,        m0
5658
-pmaddubsw   m3,        m0
5659
+    pmaddubsw   m6,        m0
5660
+    pmaddubsw   m3,        m0
5661
 
5662
-paddw       m4,        m6
5663
-paddw       m2,        m3
5664
+    paddw       m4,        m6
5665
+    paddw       m2,        m3
5666
 
5667
-pmulhrsw    m4,        m7
5668
-pmulhrsw    m2,        m7
5669
+    pmulhrsw    m4,        m7
5670
+    pmulhrsw    m2,        m7
5671
 
5672
-packuswb    m4,        m2
5673
+    packuswb    m4,        m2
5674
 
5675
-movu        [r2 + 16], m4
5676
+    movu        [r2 + 16], m4
5677
 
5678
-lea         r0,        [r0 + r1]
5679
-lea         r2,        [r2 + r3]
5680
+    lea         r0,        [r0 + r1]
5681
+    lea         r2,        [r2 + r3]
5682
 
5683
-dec         r4
5684
-jnz        .loop
5685
-RET
5686
+    dec         r4
5687
+    jnz        .loop
5688
+    RET
5689
 %endmacro
5690
 
5691
-FILTER_V4_W32 32,  8
5692
-FILTER_V4_W32 32, 16
5693
-FILTER_V4_W32 32, 24
5694
-FILTER_V4_W32 32, 32
5695
+    FILTER_V4_W32 32,  8
5696
+    FILTER_V4_W32 32, 16
5697
+    FILTER_V4_W32 32, 24
5698
+    FILTER_V4_W32 32, 32
5699
 
5700
-FILTER_V4_W32 32, 48
5701
-FILTER_V4_W32 32, 64
5702
+    FILTER_V4_W32 32, 48
5703
+    FILTER_V4_W32 32, 64
5704
 
5705
 %macro FILTER_VER_CHROMA_AVX2_32xN 2
5706
-INIT_YMM avx2
5707
 %if ARCH_X86_64 == 1
5708
+INIT_YMM avx2
5709
 cglobal interp_4tap_vert_%1_32x%2, 4, 7, 13
5710
     mov             r4d, r4m
5711
     shl             r4d, 6
5712
@@ -7631,14 +9885,18 @@
5713
 %endif
5714
 %endmacro
5715
 
5716
-FILTER_VER_CHROMA_AVX2_32xN pp, 32
5717
-FILTER_VER_CHROMA_AVX2_32xN pp, 24
5718
-FILTER_VER_CHROMA_AVX2_32xN pp, 16
5719
-FILTER_VER_CHROMA_AVX2_32xN pp, 8
5720
-FILTER_VER_CHROMA_AVX2_32xN ps, 32
5721
-FILTER_VER_CHROMA_AVX2_32xN ps, 24
5722
-FILTER_VER_CHROMA_AVX2_32xN ps, 16
5723
-FILTER_VER_CHROMA_AVX2_32xN ps, 8
5724
+    FILTER_VER_CHROMA_AVX2_32xN pp, 64
5725
+    FILTER_VER_CHROMA_AVX2_32xN pp, 48
5726
+    FILTER_VER_CHROMA_AVX2_32xN pp, 32
5727
+    FILTER_VER_CHROMA_AVX2_32xN pp, 24
5728
+    FILTER_VER_CHROMA_AVX2_32xN pp, 16
5729
+    FILTER_VER_CHROMA_AVX2_32xN pp, 8
5730
+    FILTER_VER_CHROMA_AVX2_32xN ps, 64
5731
+    FILTER_VER_CHROMA_AVX2_32xN ps, 48
5732
+    FILTER_VER_CHROMA_AVX2_32xN ps, 32
5733
+    FILTER_VER_CHROMA_AVX2_32xN ps, 24
5734
+    FILTER_VER_CHROMA_AVX2_32xN ps, 16
5735
+    FILTER_VER_CHROMA_AVX2_32xN ps, 8
5736
 
5737
 ;-----------------------------------------------------------------------------
5738
 ; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
5739
@@ -7647,413 +9905,1338 @@
5740
 INIT_XMM sse4
5741
 cglobal interp_4tap_vert_pp_%1x%2, 4, 7, 8
5742
 
5743
-mov         r4d,       r4m
5744
-sub         r0,        r1
5745
+    mov         r4d,       r4m
5746
+    sub         r0,        r1
5747
 
5748
 %ifdef PIC
5749
-lea         r5,        [tab_ChromaCoeff]
5750
-movd        m0,        [r5 + r4 * 4]
5751
+    lea         r5,        [tab_ChromaCoeff]
5752
+    movd        m0,        [r5 + r4 * 4]
5753
 %else
5754
-movd        m0,        [tab_ChromaCoeff + r4 * 4]
5755
+    movd        m0,        [tab_ChromaCoeff + r4 * 4]
5756
 %endif
5757
 
5758
-pshufb      m1,        m0,       [tab_Vm]
5759
-pshufb      m0,        [tab_Vm + 16]
5760
+    pshufb      m1,        m0,       [tab_Vm]
5761
+    pshufb      m0,        [tab_Vm + 16]
5762
 
5763
-mov         r4d,       %2/2
5764
+    mov         r4d,       %2/2
5765
 
5766
 .loop:
5767
 
5768
-mov         r6d,       %1/16
5769
+    mov         r6d,       %1/16
5770
 
5771
 .loopW:
5772
 
5773
-movu        m2,        [r0]
5774
-movu        m3,        [r0 + r1]
5775
+    movu        m2,        [r0]
5776
+    movu        m3,        [r0 + r1]
5777
+
5778
+    punpcklbw   m4,        m2,        m3
5779
+    punpckhbw   m2,        m3
5780
+
5781
+    pmaddubsw   m4,        m1
5782
+    pmaddubsw   m2,        m1
5783
+
5784
+    lea         r5,        [r0 + 2 * r1]
5785
+    movu        m5,        [r5]
5786
+    movu        m6,        [r5 + r1]
5787
 
5788
-punpcklbw   m4,        m2,        m3
5789
-punpckhbw   m2,        m3
5790
+    punpckhbw   m7,        m5,        m6
5791
+    pmaddubsw   m7,        m0
5792
+    paddw       m2,        m7
5793
 
5794
-pmaddubsw   m4,        m1
5795
-pmaddubsw   m2,        m1
5796
+    punpcklbw   m7,        m5,        m6
5797
+    pmaddubsw   m7,        m0
5798
+    paddw       m4,        m7
5799
 
5800
-lea         r5,        [r0 + 2 * r1]
5801
-movu        m5,        [r5]
5802
-movu        m6,        [r5 + r1]
5803
+    mova        m7,        [pw_512]
5804
 
5805
-punpckhbw   m7,        m5,        m6
5806
-pmaddubsw   m7,        m0
5807
-paddw       m2,        m7
5808
+    pmulhrsw    m4,        m7
5809
+    pmulhrsw    m2,        m7
5810
 
5811
-punpcklbw   m7,        m5,        m6
5812
-pmaddubsw   m7,        m0
5813
-paddw       m4,        m7
5814
+    packuswb    m4,        m2
5815
 
5816
-mova        m7,        [pw_512]
5817
+    movu        [r2],      m4
5818
 
5819
-pmulhrsw    m4,        m7
5820
-pmulhrsw    m2,        m7
5821
+    punpcklbw   m4,        m3,        m5
5822
+    punpckhbw   m3,        m5
5823
 
5824
-packuswb    m4,        m2
5825
+    pmaddubsw   m4,        m1
5826
+    pmaddubsw   m3,        m1
5827
 
5828
-movu        [r2],      m4
5829
+    movu        m5,        [r5 + 2 * r1]
5830
 
5831
-punpcklbw   m4,        m3,        m5
5832
-punpckhbw   m3,        m5
5833
+    punpcklbw   m2,        m6,        m5
5834
+    punpckhbw   m6,        m5
5835
 
5836
-pmaddubsw   m4,        m1
5837
-pmaddubsw   m3,        m1
5838
+    pmaddubsw   m2,        m0
5839
+    pmaddubsw   m6,        m0
5840
 
5841
-movu        m5,        [r5 + 2 * r1]
5842
+    paddw       m4,        m2
5843
+    paddw       m3,        m6
5844
 
5845
-punpcklbw   m2,        m6,        m5
5846
-punpckhbw   m6,        m5
5847
+    pmulhrsw    m4,        m7
5848
+    pmulhrsw    m3,        m7
5849
 
5850
-pmaddubsw   m2,        m0
5851
-pmaddubsw   m6,        m0
5852
+    packuswb    m4,        m3
5853
 
5854
-paddw       m4,        m2
5855
-paddw       m3,        m6
5856
+    movu        [r2 + r3],      m4
5857
 
5858
-pmulhrsw    m4,        m7
5859
-pmulhrsw    m3,        m7
5860
+    add         r0,        16
5861
+    add         r2,        16
5862
+    dec         r6d
5863
+    jnz         .loopW
5864
 
5865
-packuswb    m4,        m3
5866
+    lea         r0,        [r0 + r1 * 2 - %1]
5867
+    lea         r2,        [r2 + r3 * 2 - %1]
5868
 
5869
-movu        [r2 + r3],      m4
5870
+    dec         r4d
5871
+    jnz        .loop
5872
+    RET
5873
+%endmacro
5874
 
5875
-add         r0,        16
5876
-add         r2,        16
5877
-dec         r6d
5878
-jnz         .loopW
5879
+    FILTER_V4_W16n_H2 64, 64
5880
+    FILTER_V4_W16n_H2 64, 32
5881
+    FILTER_V4_W16n_H2 64, 48
5882
+    FILTER_V4_W16n_H2 48, 64
5883
+    FILTER_V4_W16n_H2 64, 16
5884
 
5885
-lea         r0,        [r0 + r1 * 2 - %1]
5886
-lea         r2,        [r2 + r3 * 2 - %1]
5887
+;-----------------------------------------------------------------------------
5888
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
5889
+;-----------------------------------------------------------------------------
5890
+%macro P2S_H_2xN 1
5891
+INIT_XMM sse4
5892
+cglobal filterPixelToShort_2x%1, 3, 4, 3
5893
+    mov         r3d, r3m
5894
+    add         r3d, r3d
5895
 
5896
-dec         r4d
5897
-jnz        .loop
5898
-RET
5899
+    ; load constant
5900
+    mova        m1, [pb_128]
5901
+    mova        m2, [tab_c_64_n64]
5902
+
5903
+%rep %1/2
5904
+    movd        m0, [r0]
5905
+    pinsrd      m0, [r0 + r1], 1
5906
+    punpcklbw   m0, m1
5907
+    pmaddubsw   m0, m2
5908
+
5909
+    movd        [r2 + r3 * 0], m0
5910
+    pextrd      [r2 + r3 * 1], m0, 2
5911
+
5912
+    lea         r0, [r0 + r1 * 2]
5913
+    lea         r2, [r2 + r3 * 2]
5914
+%endrep
5915
+    RET
5916
 %endmacro
5917
+    P2S_H_2xN 4
5918
+    P2S_H_2xN 8
5919
+    P2S_H_2xN 16
5920
 
5921
-FILTER_V4_W16n_H2 64, 64
5922
-FILTER_V4_W16n_H2 64, 32
5923
-FILTER_V4_W16n_H2 64, 48
5924
-FILTER_V4_W16n_H2 48, 64
5925
-FILTER_V4_W16n_H2 64, 16
5926
 ;-----------------------------------------------------------------------------
5927
-; void pixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height)
5928
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
5929
 ;-----------------------------------------------------------------------------
5930
-%macro PIXEL_WH_4xN 2
5931
-INIT_XMM ssse3
5932
-cglobal pixelToShort_%1x%2, 3, 7, 6
5933
+%macro P2S_H_4xN 1
5934
+INIT_XMM sse4
5935
+cglobal filterPixelToShort_4x%1, 3, 6, 4
5936
+    mov         r3d, r3m
5937
+    add         r3d, r3d
5938
+    lea         r4, [r3 * 3]
5939
+    lea         r5, [r1 * 3]
5940
+
5941
+    ; load constant
5942
+    mova        m2, [pb_128]
5943
+    mova        m3, [tab_c_64_n64]
5944
+
5945
+%assign x 0
5946
+%rep %1/4
5947
+    movd        m0, [r0]
5948
+    pinsrd      m0, [r0 + r1], 1
5949
+    punpcklbw   m0, m2
5950
+    pmaddubsw   m0, m3
5951
+
5952
+    movd        m1, [r0 + r1 * 2]
5953
+    pinsrd      m1, [r0 + r5], 1
5954
+    punpcklbw   m1, m2
5955
+    pmaddubsw   m1, m3
5956
+
5957
+    movq        [r2 + r3 * 0], m0
5958
+    movq        [r2 + r3 * 2], m1
5959
+    movhps      [r2 + r3 * 1], m0
5960
+    movhps      [r2 + r4], m1
5961
+%assign x x+1
5962
+%if (x != %1/4)
5963
+    lea         r0, [r0 + r1 * 4]
5964
+    lea         r2, [r2 + r3 * 4]
5965
+%endif
5966
+%endrep
5967
+    RET
5968
+%endmacro
5969
+    P2S_H_4xN 4
5970
+    P2S_H_4xN 8
5971
+    P2S_H_4xN 16
5972
+    P2S_H_4xN 32
5973
+
5974
+;-----------------------------------------------------------------------------
5975
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
5976
+;-----------------------------------------------------------------------------
5977
+%macro P2S_H_6xN 1
5978
+INIT_XMM sse4
5979
+cglobal filterPixelToShort_6x%1, 3, 7, 6
5980
+    mov         r3d, r3m
5981
+    add         r3d, r3d
5982
+    lea         r4, [r1 * 3]
5983
+    lea         r5, [r3 * 3]
5984
+
5985
+    ; load height
5986
+    mov         r6d, %1/4
5987
 
5988
-    ; load width and height
5989
-    mov         r3d, %1
5990
-    mov         r4d, %2
5991
     ; load constant
5992
     mova        m4, [pb_128]
5993
     mova        m5, [tab_c_64_n64]
5994
-.loopH:
5995
-    xor         r5d, r5d
5996
 
5997
-.loopW:
5998
-    mov         r6, r0
5999
-    movh        m0, [r6]
6000
+.loop:
6001
+    movh        m0, [r0]
6002
     punpcklbw   m0, m4
6003
     pmaddubsw   m0, m5
6004
 
6005
-    movh        m1, [r6 + r1]
6006
+    movh        m1, [r0 + r1]
6007
     punpcklbw   m1, m4
6008
     pmaddubsw   m1, m5
6009
 
6010
-    movh        m2, [r6 + r1 * 2]
6011
+    movh        m2, [r0 + r1 * 2]
6012
     punpcklbw   m2, m4
6013
     pmaddubsw   m2, m5
6014
 
6015
-    lea         r6, [r6 + r1 * 2]
6016
-    movh        m3, [r6 + r1]
6017
+    movh        m3, [r0 + r4]
6018
     punpcklbw   m3, m4
6019
     pmaddubsw   m3, m5
6020
 
6021
-    add         r5, 8
6022
-    cmp         r5, r3
6023
-    jg          .width4
6024
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0
6025
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1
6026
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2
6027
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3
6028
-    je          .nextH
6029
-    jmp         .loopW
6030
-
6031
-.width4:
6032
-    movh        [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0
6033
-    movh        [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1
6034
-    movh        [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2
6035
-    movh        [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3
6036
+    movh        [r2 + r3 * 0], m0
6037
+    pextrd      [r2 + r3 * 0 + 8], m0, 2
6038
+    movh        [r2 + r3 * 1], m1
6039
+    pextrd      [r2 + r3 * 1 + 8], m1, 2
6040
+    movh        [r2 + r3 * 2], m2
6041
+    pextrd      [r2 + r3 * 2 + 8], m2, 2
6042
+    movh        [r2 + r5], m3
6043
+    pextrd      [r2 + r5 + 8], m3, 2
6044
 
6045
-.nextH:
6046
     lea         r0, [r0 + r1 * 4]
6047
-    add         r2, FENC_STRIDE * 8
6048
+    lea         r2, [r2 + r3 * 4]
6049
 
6050
-    sub         r4d, 4
6051
-    jnz         .loopH
6052
+    dec         r6d
6053
+    jnz         .loop
6054
     RET
6055
 %endmacro
6056
-PIXEL_WH_4xN 4, 4
6057
-PIXEL_WH_4xN 4, 8
6058
-PIXEL_WH_4xN 4, 16
6059
+    P2S_H_6xN 8
6060
+    P2S_H_6xN 16
6061
 
6062
 ;-----------------------------------------------------------------------------
6063
-; void pixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height)
6064
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
6065
 ;-----------------------------------------------------------------------------
6066
-%macro PIXEL_WH_8xN 2
6067
+%macro P2S_H_8xN 1
6068
 INIT_XMM ssse3
6069
-cglobal pixelToShort_%1x%2, 3, 7, 6
6070
+cglobal filterPixelToShort_8x%1, 3, 7, 6
6071
+    mov         r3d, r3m
6072
+    add         r3d, r3d
6073
+    lea         r5, [r1 * 3]
6074
+    lea         r6, [r3 * 3]
6075
 
6076
-    ; load width and height
6077
-    mov         r3d, %1
6078
-    mov         r4d, %2
6079
+    ; load height
6080
+    mov         r4d, %1/4
6081
 
6082
     ; load constant
6083
     mova        m4, [pb_128]
6084
     mova        m5, [tab_c_64_n64]
6085
 
6086
-.loopH
6087
-    xor         r5d, r5d
6088
-.loopW
6089
-    lea         r6, [r0 + r5]
6090
-
6091
-    movh        m0, [r6]
6092
+.loop
6093
+    movh        m0, [r0]
6094
     punpcklbw   m0, m4
6095
     pmaddubsw   m0, m5
6096
 
6097
-    movh        m1, [r6 + r1]
6098
+    movh        m1, [r0 + r1]
6099
     punpcklbw   m1, m4
6100
     pmaddubsw   m1, m5
6101
 
6102
-    movh        m2, [r6 + r1 * 2]
6103
+    movh        m2, [r0 + r1 * 2]
6104
     punpcklbw   m2, m4
6105
     pmaddubsw   m2, m5
6106
 
6107
-    lea         r6, [r6 + r1 * 2]
6108
-    movh        m3, [r6 + r1]
6109
+    movh        m3, [r0 + r5]
6110
     punpcklbw   m3, m4
6111
     pmaddubsw   m3, m5
6112
 
6113
-    add         r5, 8
6114
-    cmp         r5, r3
6115
-
6116
-    movu        [r2 + FENC_STRIDE * 0], m0
6117
-    movu        [r2 + FENC_STRIDE * 2], m1
6118
-    movu        [r2 + FENC_STRIDE * 4], m2
6119
-    movu        [r2 + FENC_STRIDE * 6], m3
6120
-
6121
-    je          .nextH
6122
-    jmp         .loopW
6123
-
6124
+    movu        [r2 + r3 * 0], m0
6125
+    movu        [r2 + r3 * 1], m1
6126
+    movu        [r2 + r3 * 2], m2
6127
+    movu        [r2 + r6 ], m3
6128
 
6129
-.nextH:
6130
     lea         r0, [r0 + r1 * 4]
6131
-    add         r2, FENC_STRIDE * 8
6132
+    lea         r2, [r2 + r3 * 4]
6133
 
6134
-    sub         r4d, 4
6135
-    jnz         .loopH
6136
+    dec         r4d
6137
+    jnz         .loop
6138
     RET
6139
 %endmacro
6140
-PIXEL_WH_8xN 8, 8
6141
-PIXEL_WH_8xN 8, 4
6142
-PIXEL_WH_8xN 8, 16
6143
-PIXEL_WH_8xN 8, 32
6144
+    P2S_H_8xN 8
6145
+    P2S_H_8xN 4
6146
+    P2S_H_8xN 16
6147
+    P2S_H_8xN 32
6148
+    P2S_H_8xN 12
6149
+    P2S_H_8xN 64
6150
+
6151
+;-----------------------------------------------------------------------------
6152
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
6153
+;-----------------------------------------------------------------------------
6154
+INIT_XMM ssse3
6155
+cglobal filterPixelToShort_8x6, 3, 7, 5
6156
+    mov         r3d, r3m
6157
+    add         r3d, r3d
6158
+    lea         r4, [r1 * 3]
6159
+    lea         r5, [r1 * 5]
6160
+    lea         r6, [r3 * 3]
6161
+
6162
+    ; load constant
6163
+    mova        m3, [pb_128]
6164
+    mova        m4, [tab_c_64_n64]
6165
+
6166
+    movh        m0, [r0]
6167
+    punpcklbw   m0, m3
6168
+    pmaddubsw   m0, m4
6169
+
6170
+    movh        m1, [r0 + r1]
6171
+    punpcklbw   m1, m3
6172
+    pmaddubsw   m1, m4
6173
+
6174
+    movh        m2, [r0 + r1 * 2]
6175
+    punpcklbw   m2, m3
6176
+    pmaddubsw   m2, m4
6177
 
6178
+    movu        [r2 + r3 * 0], m0
6179
+    movu        [r2 + r3 * 1], m1
6180
+    movu        [r2 + r3 * 2], m2
6181
+
6182
+    movh        m0, [r0 + r4]
6183
+    punpcklbw   m0, m3
6184
+    pmaddubsw   m0, m4
6185
+
6186
+    movh        m1, [r0 + r1 * 4]
6187
+    punpcklbw   m1, m3
6188
+    pmaddubsw   m1, m4
6189
+
6190
+    movh        m2, [r0 + r5]
6191
+    punpcklbw   m2, m3
6192
+    pmaddubsw   m2, m4
6193
+
6194
+    movu        [r2 + r6 ], m0
6195
+    movu        [r2 + r3 * 4], m1
6196
+    lea         r2, [r2 + r3 * 4]
6197
+    movu        [r2 + r3], m2
6198
+
6199
+    RET
6200
 
6201
 ;-----------------------------------------------------------------------------
6202
-; void pixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height)
6203
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
6204
 ;-----------------------------------------------------------------------------
6205
-%macro PIXEL_WH_16xN 2
6206
+%macro P2S_H_16xN 1
6207
 INIT_XMM ssse3
6208
-cglobal pixelToShort_%1x%2, 3, 7, 6
6209
+cglobal filterPixelToShort_16x%1, 3, 7, 6
6210
+    mov         r3d, r3m
6211
+    add         r3d, r3d
6212
+    lea         r4, [r3 * 3]
6213
+    lea         r5, [r1 * 3]
6214
 
6215
-    ; load width and height
6216
-    mov         r3d, %1
6217
-    mov         r4d, %2
6218
+   ; load height
6219
+    mov         r6d, %1/4
6220
 
6221
     ; load constant
6222
     mova        m4, [pb_128]
6223
     mova        m5, [tab_c_64_n64]
6224
 
6225
-.loopH:
6226
-    xor         r5d, r5d
6227
-.loopW:
6228
-    lea         r6, [r0 + r5]
6229
-
6230
-    movh        m0, [r6]
6231
+.loop:
6232
+    movh        m0, [r0]
6233
     punpcklbw   m0, m4
6234
     pmaddubsw   m0, m5
6235
 
6236
-    movh        m1, [r6 + r1]
6237
+    movh        m1, [r0 + r1]
6238
     punpcklbw   m1, m4
6239
     pmaddubsw   m1, m5
6240
 
6241
-    movh        m2, [r6 + r1 * 2]
6242
+    movh        m2, [r0 + r1 * 2]
6243
     punpcklbw   m2, m4
6244
     pmaddubsw   m2, m5
6245
 
6246
-    lea         r6, [r6 + r1 * 2]
6247
-    movh        m3, [r6 + r1]
6248
+    movh        m3, [r0 + r5]
6249
     punpcklbw   m3, m4
6250
     pmaddubsw   m3, m5
6251
 
6252
-    add         r5, 8
6253
-    cmp         r5, r3
6254
+    movu        [r2 + r3 * 0], m0
6255
+    movu        [r2 + r3 * 1], m1
6256
+    movu        [r2 + r3 * 2], m2
6257
+    movu        [r2 + r4], m3
6258
 
6259
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0
6260
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1
6261
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2
6262
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3
6263
-    je          .nextH
6264
-    jmp         .loopW
6265
+    lea         r0, [r0 + 8]
6266
 
6267
+    movh        m0, [r0]
6268
+    punpcklbw   m0, m4
6269
+    pmaddubsw   m0, m5
6270
 
6271
-.nextH:
6272
-    lea         r0, [r0 + r1 * 4]
6273
-    add         r2, FENC_STRIDE * 8
6274
+    movh        m1, [r0 + r1]
6275
+    punpcklbw   m1, m4
6276
+    pmaddubsw   m1, m5
6277
 
6278
-    sub         r4d, 4
6279
-    jnz         .loopH
6280
+    movh        m2, [r0 + r1 * 2]
6281
+    punpcklbw   m2, m4
6282
+    pmaddubsw   m2, m5
6283
+
6284
+    movh        m3, [r0 + r5]
6285
+    punpcklbw   m3, m4
6286
+    pmaddubsw   m3, m5
6287
+
6288
+    movu        [r2 + r3 * 0 + 16], m0
6289
+    movu        [r2 + r3 * 1 + 16], m1
6290
+    movu        [r2 + r3 * 2 + 16], m2
6291
+    movu        [r2 + r4 + 16], m3
6292
 
6293
+    lea         r0, [r0 + r1 * 4 - 8]
6294
+    lea         r2, [r2 + r3 * 4]
6295
+
6296
+    dec         r6d
6297
+    jnz         .loop
6298
     RET
6299
 %endmacro
6300
-PIXEL_WH_16xN 16, 16
6301
-PIXEL_WH_16xN 16, 8
6302
-PIXEL_WH_16xN 16, 4
6303
-PIXEL_WH_16xN 16, 12
6304
-PIXEL_WH_16xN 16, 32
6305
-PIXEL_WH_16xN 16, 64
6306
+    P2S_H_16xN 16
6307
+    P2S_H_16xN 4
6308
+    P2S_H_16xN 8
6309
+    P2S_H_16xN 12
6310
+    P2S_H_16xN 32
6311
+    P2S_H_16xN 64
6312
+    P2S_H_16xN 24
6313
 
6314
 ;-----------------------------------------------------------------------------
6315
-; void pixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height)
6316
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
6317
 ;-----------------------------------------------------------------------------
6318
-%macro PIXEL_WH_32xN 2
6319
+%macro P2S_H_32xN 1
6320
 INIT_XMM ssse3
6321
-cglobal pixelToShort_%1x%2, 3, 7, 6
6322
+cglobal filterPixelToShort_32x%1, 3, 7, 6
6323
+    mov         r3d, r3m
6324
+    add         r3d, r3d
6325
+    lea         r4, [r3 * 3]
6326
+    lea         r5, [r1 * 3]
6327
 
6328
-    ; load width and height
6329
-    mov         r3d, %1
6330
-    mov         r4d, %2
6331
+    ; load height
6332
+    mov         r6d, %1/4
6333
 
6334
     ; load constant
6335
     mova        m4, [pb_128]
6336
     mova        m5, [tab_c_64_n64]
6337
 
6338
-.loopH:
6339
-    xor         r5d, r5d
6340
-.loopW:
6341
-    lea         r6, [r0 + r5]
6342
+.loop:
6343
+    movh        m0, [r0]
6344
+    punpcklbw   m0, m4
6345
+    pmaddubsw   m0, m5
6346
+
6347
+    movh        m1, [r0 + r1]
6348
+    punpcklbw   m1, m4
6349
+    pmaddubsw   m1, m5
6350
+
6351
+    movh        m2, [r0 + r1 * 2]
6352
+    punpcklbw   m2, m4
6353
+    pmaddubsw   m2, m5
6354
+
6355
+    movh        m3, [r0 + r5]
6356
+    punpcklbw   m3, m4
6357
+    pmaddubsw   m3, m5
6358
+
6359
+    movu        [r2 + r3 * 0], m0
6360
+    movu        [r2 + r3 * 1], m1
6361
+    movu        [r2 + r3 * 2], m2
6362
+    movu        [r2 + r4], m3
6363
+
6364
+    lea         r0, [r0 + 8]
6365
+
6366
+    movh        m0, [r0]
6367
+    punpcklbw   m0, m4
6368
+    pmaddubsw   m0, m5
6369
+
6370
+    movh        m1, [r0 + r1]
6371
+    punpcklbw   m1, m4
6372
+    pmaddubsw   m1, m5
6373
+
6374
+    movh        m2, [r0 + r1 * 2]
6375
+    punpcklbw   m2, m4
6376
+    pmaddubsw   m2, m5
6377
 
6378
-    movh        m0, [r6]
6379
+    movh        m3, [r0 + r5]
6380
+    punpcklbw   m3, m4
6381
+    pmaddubsw   m3, m5
6382
+
6383
+    movu        [r2 + r3 * 0 + 16], m0
6384
+    movu        [r2 + r3 * 1 + 16], m1
6385
+    movu        [r2 + r3 * 2 + 16], m2
6386
+    movu        [r2 + r4 + 16], m3
6387
+
6388
+    lea         r0, [r0 + 8]
6389
+
6390
+    movh        m0, [r0]
6391
     punpcklbw   m0, m4
6392
     pmaddubsw   m0, m5
6393
 
6394
-    movh        m1, [r6 + r1]
6395
+    movh        m1, [r0 + r1]
6396
     punpcklbw   m1, m4
6397
     pmaddubsw   m1, m5
6398
 
6399
-    movh        m2, [r6 + r1 * 2]
6400
+    movh        m2, [r0 + r1 * 2]
6401
     punpcklbw   m2, m4
6402
     pmaddubsw   m2, m5
6403
 
6404
-    lea         r6, [r6 + r1 * 2]
6405
-    movh        m3, [r6 + r1]
6406
+    movh        m3, [r0 + r5]
6407
     punpcklbw   m3, m4
6408
     pmaddubsw   m3, m5
6409
 
6410
-    add         r5, 8
6411
-    cmp         r5, r3
6412
+    movu        [r2 + r3 * 0 + 32], m0
6413
+    movu        [r2 + r3 * 1 + 32], m1
6414
+    movu        [r2 + r3 * 2 + 32], m2
6415
+    movu        [r2 + r4 + 32], m3
6416
 
6417
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0
6418
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1
6419
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2
6420
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3
6421
-    je          .nextH
6422
-    jmp         .loopW
6423
+    lea         r0, [r0 + 8]
6424
 
6425
+    movh        m0, [r0]
6426
+    punpcklbw   m0, m4
6427
+    pmaddubsw   m0, m5
6428
 
6429
-.nextH:
6430
-    lea         r0, [r0 + r1 * 4]
6431
-    add         r2, FENC_STRIDE * 8
6432
+    movh        m1, [r0 + r1]
6433
+    punpcklbw   m1, m4
6434
+    pmaddubsw   m1, m5
6435
 
6436
-    sub         r4d, 4
6437
-    jnz         .loopH
6438
+    movh        m2, [r0 + r1 * 2]
6439
+    punpcklbw   m2, m4
6440
+    pmaddubsw   m2, m5
6441
+
6442
+    movh        m3, [r0 + r5]
6443
+    punpcklbw   m3, m4
6444
+    pmaddubsw   m3, m5
6445
+
6446
+    movu        [r2 + r3 * 0 + 48], m0
6447
+    movu        [r2 + r3 * 1 + 48], m1
6448
+    movu        [r2 + r3 * 2 + 48], m2
6449
+    movu        [r2 + r4 + 48], m3
6450
+
6451
+    lea         r0, [r0 + r1 * 4 - 24]
6452
+    lea         r2, [r2 + r3 * 4]
6453
 
6454
+    dec         r6d
6455
+    jnz         .loop
6456
     RET
6457
 %endmacro
6458
-PIXEL_WH_32xN 32, 32
6459
-PIXEL_WH_32xN 32, 8
6460
-PIXEL_WH_32xN 32, 16
6461
-PIXEL_WH_32xN 32, 24
6462
-PIXEL_WH_32xN 32, 64
6463
+    P2S_H_32xN 32
6464
+    P2S_H_32xN 8
6465
+    P2S_H_32xN 16
6466
+    P2S_H_32xN 24
6467
+    P2S_H_32xN 64
6468
+    P2S_H_32xN 48
6469
 
6470
 ;-----------------------------------------------------------------------------
6471
-; void pixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height)
6472
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
6473
 ;-----------------------------------------------------------------------------
6474
-%macro PIXEL_WH_64xN 2
6475
+%macro P2S_H_32xN_avx2 1
6476
+INIT_YMM avx2
6477
+cglobal filterPixelToShort_32x%1, 3, 7, 3
6478
+    mov         r3d, r3m
6479
+    add         r3d, r3d
6480
+    lea         r5, [r1 * 3]
6481
+    lea         r6, [r3 * 3]
6482
+
6483
+    ; load height
6484
+    mov         r4d, %1/4
6485
+
6486
+    ; load constant
6487
+    vpbroadcastd m2, [pw_2000]
6488
+
6489
+.loop:
6490
+    pmovzxbw    m0, [r0 + 0 * mmsize/2]
6491
+    pmovzxbw    m1, [r0 + 1 * mmsize/2]
6492
+    psllw       m0, 6
6493
+    psllw       m1, 6
6494
+    psubw       m0, m2
6495
+    psubw       m1, m2
6496
+    movu        [r2 + 0 * mmsize], m0
6497
+    movu        [r2 + 1 * mmsize], m1
6498
+
6499
+    pmovzxbw    m0, [r0 + r1 + 0 * mmsize/2]
6500
+    pmovzxbw    m1, [r0 + r1 + 1 * mmsize/2]
6501
+    psllw       m0, 6
6502
+    psllw       m1, 6
6503
+    psubw       m0, m2
6504
+    psubw       m1, m2
6505
+    movu        [r2 + r3 + 0 * mmsize], m0
6506
+    movu        [r2 + r3 + 1 * mmsize], m1
6507
+
6508
+    pmovzxbw    m0, [r0 + r1 * 2 + 0 * mmsize/2]
6509
+    pmovzxbw    m1, [r0 + r1 * 2 + 1 * mmsize/2]
6510
+    psllw       m0, 6
6511
+    psllw       m1, 6
6512
+    psubw       m0, m2
6513
+    psubw       m1, m2
6514
+    movu        [r2 + r3 * 2 + 0 * mmsize], m0
6515
+    movu        [r2 + r3 * 2 + 1 * mmsize], m1
6516
+
6517
+    pmovzxbw    m0, [r0 + r5 + 0 * mmsize/2]
6518
+    pmovzxbw    m1, [r0 + r5 + 1 * mmsize/2]
6519
+    psllw       m0, 6
6520
+    psllw       m1, 6
6521
+    psubw       m0, m2
6522
+    psubw       m1, m2
6523
+    movu        [r2 + r6 + 0 * mmsize], m0
6524
+    movu        [r2 + r6 + 1 * mmsize], m1
6525
+
6526
+    lea         r0, [r0 + r1 * 4]
6527
+    lea         r2, [r2 + r3 * 4]
6528
+
6529
+    dec         r4d
6530
+    jnz        .loop
6531
+    RET
6532
+%endmacro
6533
+    P2S_H_32xN_avx2 32
6534
+    P2S_H_32xN_avx2 8
6535
+    P2S_H_32xN_avx2 16
6536
+    P2S_H_32xN_avx2 24
6537
+    P2S_H_32xN_avx2 64
6538
+    P2S_H_32xN_avx2 48
6539
+
6540
+;-----------------------------------------------------------------------------
6541
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
6542
+;-----------------------------------------------------------------------------
6543
+%macro P2S_H_64xN 1
6544
 INIT_XMM ssse3
6545
-cglobal pixelToShort_%1x%2, 3, 7, 6
6546
+cglobal filterPixelToShort_64x%1, 3, 7, 6
6547
+    mov         r3d, r3m
6548
+    add         r3d, r3d
6549
+    lea         r4, [r3 * 3]
6550
+    lea         r5, [r1 * 3]
6551
 
6552
-    ; load width and height
6553
-    mov         r3d, %1
6554
-    mov         r4d, %2
6555
+    ; load height
6556
+    mov         r6d, %1/4
6557
 
6558
     ; load constant
6559
     mova        m4, [pb_128]
6560
     mova        m5, [tab_c_64_n64]
6561
 
6562
-.loopH:
6563
-    xor         r5d, r5d
6564
-.loopW:
6565
-    lea         r6, [r0 + r5]
6566
+.loop:
6567
+    movh        m0, [r0]
6568
+    punpcklbw   m0, m4
6569
+    pmaddubsw   m0, m5
6570
+
6571
+    movh        m1, [r0 + r1]
6572
+    punpcklbw   m1, m4
6573
+    pmaddubsw   m1, m5
6574
+
6575
+    movh        m2, [r0 + r1 * 2]
6576
+    punpcklbw   m2, m4
6577
+    pmaddubsw   m2, m5
6578
 
6579
-    movh        m0, [r6]
6580
+    movh        m3, [r0 + r5]
6581
+    punpcklbw   m3, m4
6582
+    pmaddubsw   m3, m5
6583
+
6584
+    movu        [r2 + r3 * 0], m0
6585
+    movu        [r2 + r3 * 1], m1
6586
+    movu        [r2 + r3 * 2], m2
6587
+    movu        [r2 + r4], m3
6588
+
6589
+    lea         r0, [r0 + 8]
6590
+
6591
+    movh        m0, [r0]
6592
     punpcklbw   m0, m4
6593
     pmaddubsw   m0, m5
6594
 
6595
-    movh        m1, [r6 + r1]
6596
+    movh        m1, [r0 + r1]
6597
     punpcklbw   m1, m4
6598
     pmaddubsw   m1, m5
6599
 
6600
-    movh        m2, [r6 + r1 * 2]
6601
+    movh        m2, [r0 + r1 * 2]
6602
     punpcklbw   m2, m4
6603
     pmaddubsw   m2, m5
6604
 
6605
-    lea         r6, [r6 + r1 * 2]
6606
-    movh        m3, [r6 + r1]
6607
+    movh        m3, [r0 + r5]
6608
     punpcklbw   m3, m4
6609
     pmaddubsw   m3, m5
6610
 
6611
-    add         r5, 8
6612
-    cmp         r5, r3
6613
+    movu        [r2 + r3 * 0 + 16], m0
6614
+    movu        [r2 + r3 * 1 + 16], m1
6615
+    movu        [r2 + r3 * 2 + 16], m2
6616
+    movu        [r2 + r4 + 16], m3
6617
+
6618
+    lea         r0, [r0 + 8]
6619
+
6620
+    movh        m0, [r0]
6621
+    punpcklbw   m0, m4
6622
+    pmaddubsw   m0, m5
6623
+
6624
+    movh        m1, [r0 + r1]
6625
+    punpcklbw   m1, m4
6626
+    pmaddubsw   m1, m5
6627
+
6628
+    movh        m2, [r0 + r1 * 2]
6629
+    punpcklbw   m2, m4
6630
+    pmaddubsw   m2, m5
6631
+
6632
+    movh        m3, [r0 + r5]
6633
+    punpcklbw   m3, m4
6634
+    pmaddubsw   m3, m5
6635
 
6636
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0
6637
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1
6638
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2
6639
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3
6640
-    je          .nextH
6641
-    jmp         .loopW
6642
+    movu        [r2 + r3 * 0 + 32], m0
6643
+    movu        [r2 + r3 * 1 + 32], m1
6644
+    movu        [r2 + r3 * 2 + 32], m2
6645
+    movu        [r2 + r4 + 32], m3
6646
 
6647
+    lea         r0, [r0 + 8]
6648
+
6649
+    movh        m0, [r0]
6650
+    punpcklbw   m0, m4
6651
+    pmaddubsw   m0, m5
6652
+
6653
+    movh        m1, [r0 + r1]
6654
+    punpcklbw   m1, m4
6655
+    pmaddubsw   m1, m5
6656
+
6657
+    movh        m2, [r0 + r1 * 2]
6658
+    punpcklbw   m2, m4
6659
+    pmaddubsw   m2, m5
6660
+
6661
+    movh        m3, [r0 + r5]
6662
+    punpcklbw   m3, m4
6663
+    pmaddubsw   m3, m5
6664
+
6665
+    movu        [r2 + r3 * 0 + 48], m0
6666
+    movu        [r2 + r3 * 1 + 48], m1
6667
+    movu        [r2 + r3 * 2 + 48], m2
6668
+    movu        [r2 + r4 + 48], m3
6669
+
6670
+    lea         r0, [r0 + 8]
6671
+
6672
+    movh        m0, [r0]
6673
+    punpcklbw   m0, m4
6674
+    pmaddubsw   m0, m5
6675
+
6676
+    movh        m1, [r0 + r1]
6677
+    punpcklbw   m1, m4
6678
+    pmaddubsw   m1, m5
6679
+
6680
+    movh        m2, [r0 + r1 * 2]
6681
+    punpcklbw   m2, m4
6682
+    pmaddubsw   m2, m5
6683
+
6684
+    movh        m3, [r0 + r5]
6685
+    punpcklbw   m3, m4
6686
+    pmaddubsw   m3, m5
6687
+
6688
+    movu        [r2 + r3 * 0 + 64], m0
6689
+    movu        [r2 + r3 * 1 + 64], m1
6690
+    movu        [r2 + r3 * 2 + 64], m2
6691
+    movu        [r2 + r4 + 64], m3
6692
+
6693
+    lea         r0, [r0 + 8]
6694
+
6695
+    movh        m0, [r0]
6696
+    punpcklbw   m0, m4
6697
+    pmaddubsw   m0, m5
6698
+
6699
+    movh        m1, [r0 + r1]
6700
+    punpcklbw   m1, m4
6701
+    pmaddubsw   m1, m5
6702
+
6703
+    movh        m2, [r0 + r1 * 2]
6704
+    punpcklbw   m2, m4
6705
+    pmaddubsw   m2, m5
6706
+
6707
+    movh        m3, [r0 + r5]
6708
+    punpcklbw   m3, m4
6709
+    pmaddubsw   m3, m5
6710
+
6711
+    movu        [r2 + r3 * 0 + 80], m0
6712
+    movu        [r2 + r3 * 1 + 80], m1
6713
+    movu        [r2 + r3 * 2 + 80], m2
6714
+    movu        [r2 + r4 + 80], m3
6715
+
6716
+    lea         r0, [r0 + 8]
6717
+
6718
+    movh        m0, [r0]
6719
+    punpcklbw   m0, m4
6720
+    pmaddubsw   m0, m5
6721
+
6722
+    movh        m1, [r0 + r1]
6723
+    punpcklbw   m1, m4
6724
+    pmaddubsw   m1, m5
6725
+
6726
+    movh        m2, [r0 + r1 * 2]
6727
+    punpcklbw   m2, m4
6728
+    pmaddubsw   m2, m5
6729
+
6730
+    movh        m3, [r0 + r5]
6731
+    punpcklbw   m3, m4
6732
+    pmaddubsw   m3, m5
6733
+
6734
+    movu        [r2 + r3 * 0 + 96], m0
6735
+    movu        [r2 + r3 * 1 + 96], m1
6736
+    movu        [r2 + r3 * 2 + 96], m2
6737
+    movu        [r2 + r4 + 96], m3
6738
+
6739
+    lea         r0, [r0 + 8]
6740
+
6741
+    movh        m0, [r0]
6742
+    punpcklbw   m0, m4
6743
+    pmaddubsw   m0, m5
6744
+
6745
+    movh        m1, [r0 + r1]
6746
+    punpcklbw   m1, m4
6747
+    pmaddubsw   m1, m5
6748
+
6749
+    movh        m2, [r0 + r1 * 2]
6750
+    punpcklbw   m2, m4
6751
+    pmaddubsw   m2, m5
6752
+
6753
+    movh        m3, [r0 + r5]
6754
+    punpcklbw   m3, m4
6755
+    pmaddubsw   m3, m5
6756
+
6757
+    movu        [r2 + r3 * 0 + 112], m0
6758
+    movu        [r2 + r3 * 1 + 112], m1
6759
+    movu        [r2 + r3 * 2 + 112], m2
6760
+    movu        [r2 + r4 + 112], m3
6761
+
6762
+    lea         r0, [r0 + r1 * 4 - 56]
6763
+    lea         r2, [r2 + r3 * 4]
6764
+
6765
+    dec         r6d
6766
+    jnz         .loop
6767
+    RET
6768
+%endmacro
6769
+    P2S_H_64xN 64
6770
+    P2S_H_64xN 16
6771
+    P2S_H_64xN 32
6772
+    P2S_H_64xN 48
6773
+
6774
+;-----------------------------------------------------------------------------
6775
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
6776
+;-----------------------------------------------------------------------------
6777
+%macro P2S_H_64xN_avx2 1
6778
+INIT_YMM avx2
6779
+cglobal filterPixelToShort_64x%1, 3, 7, 5
6780
+    mov         r3d, r3m
6781
+    add         r3d, r3d
6782
+    lea         r5, [r1 * 3]
6783
+    lea         r6, [r3 * 3]
6784
+
6785
+    ; load height
6786
+    mov         r4d, %1/4
6787
+
6788
+    ; load constant
6789
+    vpbroadcastd m4, [pw_2000]
6790
+
6791
+.loop:
6792
+    pmovzxbw    m0, [r0 + 0 * mmsize/2]
6793
+    pmovzxbw    m1, [r0 + 1 * mmsize/2]
6794
+    pmovzxbw    m2, [r0 + 2 * mmsize/2]
6795
+    pmovzxbw    m3, [r0 + 3 * mmsize/2]
6796
+    psllw       m0, 6
6797
+    psllw       m1, 6
6798
+    psllw       m2, 6
6799
+    psllw       m3, 6
6800
+    psubw       m0, m4
6801
+    psubw       m1, m4
6802
+    psubw       m2, m4
6803
+    psubw       m3, m4
6804
+
6805
+    movu        [r2 + 0 * mmsize], m0
6806
+    movu        [r2 + 1 * mmsize], m1
6807
+    movu        [r2 + 2 * mmsize], m2
6808
+    movu        [r2 + 3 * mmsize], m3
6809
+
6810
+    pmovzxbw    m0, [r0 + r1 + 0 * mmsize/2]
6811
+    pmovzxbw    m1, [r0 + r1 + 1 * mmsize/2]
6812
+    pmovzxbw    m2, [r0 + r1 + 2 * mmsize/2]
6813
+    pmovzxbw    m3, [r0 + r1 + 3 * mmsize/2]
6814
+    psllw       m0, 6
6815
+    psllw       m1, 6
6816
+    psllw       m2, 6
6817
+    psllw       m3, 6
6818
+    psubw       m0, m4
6819
+    psubw       m1, m4
6820
+    psubw       m2, m4
6821
+    psubw       m3, m4
6822
+
6823
+    movu        [r2 + r3 + 0 * mmsize], m0
6824
+    movu        [r2 + r3 + 1 * mmsize], m1
6825
+    movu        [r2 + r3 + 2 * mmsize], m2
6826
+    movu        [r2 + r3 + 3 * mmsize], m3
6827
+
6828
+    pmovzxbw    m0, [r0 + r1 * 2 + 0 * mmsize/2]
6829
+    pmovzxbw    m1, [r0 + r1 * 2 + 1 * mmsize/2]
6830
+    pmovzxbw    m2, [r0 + r1 * 2 + 2 * mmsize/2]
6831
+    pmovzxbw    m3, [r0 + r1 * 2 + 3 * mmsize/2]
6832
+    psllw       m0, 6
6833
+    psllw       m1, 6
6834
+    psllw       m2, 6
6835
+    psllw       m3, 6
6836
+    psubw       m0, m4
6837
+    psubw       m1, m4
6838
+    psubw       m2, m4
6839
+    psubw       m3, m4
6840
+
6841
+    movu        [r2 + r3 * 2 + 0 * mmsize], m0
6842
+    movu        [r2 + r3 * 2 + 1 * mmsize], m1
6843
+    movu        [r2 + r3 * 2 + 2 * mmsize], m2
6844
+    movu        [r2 + r3 * 2 + 3 * mmsize], m3
6845
+
6846
+    pmovzxbw    m0, [r0 + r5 + 0 * mmsize/2]
6847
+    pmovzxbw    m1, [r0 + r5 + 1 * mmsize/2]
6848
+    pmovzxbw    m2, [r0 + r5 + 2 * mmsize/2]
6849
+    pmovzxbw    m3, [r0 + r5 + 3 * mmsize/2]
6850
+    psllw       m0, 6
6851
+    psllw       m1, 6
6852
+    psllw       m2, 6
6853
+    psllw       m3, 6
6854
+    psubw       m0, m4
6855
+    psubw       m1, m4
6856
+    psubw       m2, m4
6857
+    psubw       m3, m4
6858
+
6859
+    movu        [r2 + r6 + 0 * mmsize], m0
6860
+    movu        [r2 + r6 + 1 * mmsize], m1
6861
+    movu        [r2 + r6 + 2 * mmsize], m2
6862
+    movu        [r2 + r6 + 3 * mmsize], m3
6863
 
6864
-.nextH:
6865
     lea         r0, [r0 + r1 * 4]
6866
-    add         r2, FENC_STRIDE * 8
6867
+    lea         r2, [r2 + r3 * 4]
6868
 
6869
-    sub         r4d, 4
6870
-    jnz         .loopH
6871
+    dec         r4d
6872
+    jnz        .loop
6873
+    RET
6874
+%endmacro
6875
+    P2S_H_64xN_avx2 64
6876
+    P2S_H_64xN_avx2 16
6877
+    P2S_H_64xN_avx2 32
6878
+    P2S_H_64xN_avx2 48
6879
+
6880
+;-----------------------------------------------------------------------------
6881
+; void filterPixelToShort(pixel src, intptr_t srcStride, int16_t dst, int16_t dstStride)
6882
+;-----------------------------------------------------------------------------
6883
+%macro P2S_H_12xN 1
6884
+INIT_XMM ssse3
6885
+cglobal filterPixelToShort_12x%1, 3, 7, 6
6886
+    mov         r3d, r3m
6887
+    add         r3d, r3d
6888
+    lea         r4, [r1 * 3]
6889
+    lea         r6, [r3 * 3]
6890
+    mov         r5d, %1/4
6891
 
6892
+    ; load constant
6893
+    mova        m4, [pb_128]
6894
+    mova        m5, [tab_c_64_n64]
6895
+
6896
+.loop:
6897
+    movu        m0, [r0]
6898
+    punpcklbw   m1, m0, m4
6899
+    punpckhbw   m0, m4
6900
+    pmaddubsw   m0, m5
6901
+    pmaddubsw   m1, m5
6902
+
6903
+    movu        m2, [r0 + r1]
6904
+    punpcklbw   m3, m2, m4
6905
+    punpckhbw   m2, m4
6906
+    pmaddubsw   m2, m5
6907
+    pmaddubsw   m3, m5
6908
+
6909
+    movu        [r2 + r3 * 0], m1
6910
+    movu        [r2 + r3 * 1], m3
6911
+
6912
+    movh        [r2 + r3 * 0 + 16], m0
6913
+    movh        [r2 + r3 * 1 + 16], m2
6914
+
6915
+    movu        m0, [r0 + r1 * 2]
6916
+    punpcklbw   m1, m0, m4
6917
+    punpckhbw   m0, m4
6918
+    pmaddubsw   m0, m5
6919
+    pmaddubsw   m1, m5
6920
+
6921
+    movu        m2, [r0 + r4]
6922
+    punpcklbw   m3, m2, m4
6923
+    punpckhbw   m2, m4
6924
+    pmaddubsw   m2, m5
6925
+    pmaddubsw   m3, m5
6926
+
6927
+    movu        [r2 + r3 * 2], m1
6928
+    movu        [r2 + r6], m3
6929
+
6930
+    movh        [r2 + r3 * 2 + 16], m0
6931
+    movh        [r2 + r6 + 16], m2
6932
+
6933
+    lea         r0, [r0 + r1 * 4]
6934
+    lea         r2, [r2 + r3 * 4]
6935
+
6936
+    dec         r5d
6937
+    jnz         .loop
6938
+    RET
6939
+%endmacro
6940
+    P2S_H_12xN 16
6941
+    P2S_H_12xN 32
6942
+
6943
+;-----------------------------------------------------------------------------
6944
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
6945
+;-----------------------------------------------------------------------------
6946
+%macro P2S_H_24xN 1
6947
+INIT_XMM ssse3
6948
+cglobal filterPixelToShort_24x%1, 3, 7, 5
6949
+    mov         r3d, r3m
6950
+    add         r3d, r3d
6951
+    lea         r4, [r1 * 3]
6952
+    lea         r5, [r3 * 3]
6953
+    mov         r6d, %1/4
6954
+
6955
+    ; load constant
6956
+    mova        m3, [pb_128]
6957
+    mova        m4, [tab_c_64_n64]
6958
+
6959
+.loop:
6960
+    movu        m0, [r0]
6961
+    punpcklbw   m1, m0, m3
6962
+    punpckhbw   m0, m3
6963
+    pmaddubsw   m0, m4
6964
+    pmaddubsw   m1, m4
6965
+
6966
+    movu        m2, [r0 + 16]
6967
+    punpcklbw   m2, m3
6968
+    pmaddubsw   m2, m4
6969
+
6970
+    movu        [r2 +  r3 * 0], m1
6971
+    movu        [r2 +  r3 * 0 + 16], m0
6972
+    movu        [r2 +  r3 * 0 + 32], m2
6973
+
6974
+    movu        m0, [r0 + r1]
6975
+    punpcklbw   m1, m0, m3
6976
+    punpckhbw   m0, m3
6977
+    pmaddubsw   m0, m4
6978
+    pmaddubsw   m1, m4
6979
+
6980
+    movu        m2, [r0 + r1 + 16]
6981
+    punpcklbw   m2, m3
6982
+    pmaddubsw   m2, m4
6983
+
6984
+    movu        [r2 +  r3 * 1], m1
6985
+    movu        [r2 +  r3 * 1 + 16], m0
6986
+    movu        [r2 +  r3 * 1 + 32], m2
6987
+
6988
+    movu        m0, [r0 + r1 * 2]
6989
+    punpcklbw   m1, m0, m3
6990
+    punpckhbw   m0, m3
6991
+    pmaddubsw   m0, m4
6992
+    pmaddubsw   m1, m4
6993
+
6994
+    movu        m2, [r0 + r1 * 2 + 16]
6995
+    punpcklbw   m2, m3
6996
+    pmaddubsw   m2, m4
6997
+
6998
+    movu        [r2 +  r3 * 2], m1
6999
+    movu        [r2 +  r3 * 2 + 16], m0
7000
+    movu        [r2 +  r3 * 2 + 32], m2
7001
+
7002
+    movu        m0, [r0 + r4]
7003
+    punpcklbw   m1, m0, m3
7004
+    punpckhbw   m0, m3
7005
+    pmaddubsw   m0, m4
7006
+    pmaddubsw   m1, m4
7007
+
7008
+    movu        m2, [r0 + r4 + 16]
7009
+    punpcklbw   m2, m3
7010
+    pmaddubsw   m2, m4
7011
+    movu        [r2 +  r5], m1
7012
+    movu        [r2 +  r5 + 16], m0
7013
+    movu        [r2 +  r5 + 32], m2
7014
+
7015
+    lea         r0, [r0 + r1 * 4]
7016
+    lea         r2, [r2 + r3 * 4]
7017
+
7018
+    dec         r6d
7019
+    jnz         .loop
7020
     RET
7021
 %endmacro
7022
-PIXEL_WH_64xN 64, 64
7023
-PIXEL_WH_64xN 64, 16
7024
-PIXEL_WH_64xN 64, 32
7025
-PIXEL_WH_64xN 64, 48
7026
+    P2S_H_24xN 32
7027
+    P2S_H_24xN 64
7028
+
7029
+;-----------------------------------------------------------------------------
7030
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
7031
+;-----------------------------------------------------------------------------
7032
+%macro P2S_H_24xN_avx2 1
7033
+INIT_YMM avx2
7034
+cglobal filterPixelToShort_24x%1, 3, 7, 4
7035
+    mov         r3d, r3m
7036
+    add         r3d, r3d
7037
+    lea         r4, [r1 * 3]
7038
+    lea         r5, [r3 * 3]
7039
+    mov         r6d, %1/4
7040
+
7041
+    ; load constant
7042
+    vpbroadcastd m1, [pw_2000]
7043
+    vpbroadcastd m2, [pb_128]
7044
+    vpbroadcastd m3, [tab_c_64_n64]
7045
+
7046
+.loop:
7047
+    pmovzxbw    m0, [r0]
7048
+    psllw       m0, 6
7049
+    psubw       m0, m1
7050
+    movu        [r2], m0
7051
+
7052
+    movu        m0, [r0 + mmsize/2]
7053
+    punpcklbw   m0, m2
7054
+    pmaddubsw   m0, m3
7055
+    movu        [r2 +  r3 * 0 + mmsize], xm0
7056
+
7057
+    pmovzxbw    m0, [r0 + r1]
7058
+    psllw       m0, 6
7059
+    psubw       m0, m1
7060
+    movu        [r2 + r3], m0
7061
+
7062
+    movu        m0, [r0 + r1 + mmsize/2]
7063
+    punpcklbw   m0, m2
7064
+    pmaddubsw   m0, m3
7065
+    movu        [r2 +  r3 * 1 + mmsize], xm0
7066
+
7067
+    pmovzxbw    m0, [r0 + r1 * 2]
7068
+    psllw       m0, 6
7069
+    psubw       m0, m1
7070
+    movu        [r2 + r3 * 2], m0
7071
+
7072
+    movu        m0, [r0 + r1 * 2 + mmsize/2]
7073
+    punpcklbw   m0, m2
7074
+    pmaddubsw   m0, m3
7075
+    movu        [r2 +  r3 * 2 + mmsize], xm0
7076
+
7077
+    pmovzxbw    m0, [r0 + r4]
7078
+    psllw       m0, 6
7079
+    psubw       m0, m1
7080
+    movu        [r2 + r5], m0
7081
+
7082
+    movu        m0, [r0 + r4 + mmsize/2]
7083
+    punpcklbw   m0, m2
7084
+    pmaddubsw   m0, m3
7085
+    movu        [r2 + r5 + mmsize], xm0
7086
+
7087
+    lea         r0, [r0 + r1 * 4]
7088
+    lea         r2, [r2 + r3 * 4]
7089
+
7090
+    dec         r6d
7091
+    jnz         .loop
7092
+    RET
7093
+%endmacro
7094
+    P2S_H_24xN_avx2 32
7095
+    P2S_H_24xN_avx2 64
7096
+
7097
+;-----------------------------------------------------------------------------
7098
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
7099
+;-----------------------------------------------------------------------------
7100
+INIT_XMM ssse3
7101
+cglobal filterPixelToShort_48x64, 3, 7, 4
7102
+    mov         r3d, r3m
7103
+    add         r3d, r3d
7104
+    lea         r4, [r1 * 3]
7105
+    lea         r5, [r3 * 3]
7106
+    mov         r6d, 16
7107
+
7108
+    ; load constant
7109
+    mova        m2, [pb_128]
7110
+    mova        m3, [tab_c_64_n64]
7111
+
7112
+.loop:
7113
+    movu        m0, [r0]
7114
+    punpcklbw   m1, m0, m2
7115
+    punpckhbw   m0, m2
7116
+    pmaddubsw   m0, m3
7117
+    pmaddubsw   m1, m3
7118
+
7119
+    movu        [r2 +  r3 * 0], m1
7120
+    movu        [r2 +  r3 * 0 + 16], m0
7121
+
7122
+    movu        m0, [r0 + 16]
7123
+    punpcklbw   m1, m0, m2
7124
+    punpckhbw   m0, m2
7125
+    pmaddubsw   m0, m3
7126
+    pmaddubsw   m1, m3
7127
+
7128
+    movu        [r2 +  r3 * 0 + 32], m1
7129
+    movu        [r2 +  r3 * 0 + 48], m0
7130
+
7131
+    movu        m0, [r0 + 32]
7132
+    punpcklbw   m1, m0, m2
7133
+    punpckhbw   m0, m2
7134
+    pmaddubsw   m0, m3
7135
+    pmaddubsw   m1, m3
7136
+
7137
+    movu        [r2 +  r3 * 0 + 64], m1
7138
+    movu        [r2 +  r3 * 0 + 80], m0
7139
+
7140
+    movu        m0, [r0 + r1]
7141
+    punpcklbw   m1, m0, m2
7142
+    punpckhbw   m0, m2
7143
+    pmaddubsw   m0, m3
7144
+    pmaddubsw   m1, m3
7145
+
7146
+    movu        [r2 +  r3 * 1], m1
7147
+    movu        [r2 +  r3 * 1 + 16], m0
7148
+
7149
+    movu        m0, [r0 + r1 + 16]
7150
+    punpcklbw   m1, m0, m2
7151
+    punpckhbw   m0, m2
7152
+    pmaddubsw   m0, m3
7153
+    pmaddubsw   m1, m3
7154
+
7155
+    movu        [r2 +  r3 * 1 + 32], m1
7156
+    movu        [r2 +  r3 * 1 + 48], m0
7157
+
7158
+    movu        m0, [r0 + r1 + 32]
7159
+    punpcklbw   m1, m0, m2
7160
+    punpckhbw   m0, m2
7161
+    pmaddubsw   m0, m3
7162
+    pmaddubsw   m1, m3
7163
+
7164
+    movu        [r2 +  r3 * 1 + 64], m1
7165
+    movu        [r2 +  r3 * 1 + 80], m0
7166
+
7167
+    movu        m0, [r0 + r1 * 2]
7168
+    punpcklbw   m1, m0, m2
7169
+    punpckhbw   m0, m2
7170
+    pmaddubsw   m0, m3
7171
+    pmaddubsw   m1, m3
7172
+
7173
+    movu        [r2 +  r3 * 2], m1
7174
+    movu        [r2 +  r3 * 2 + 16], m0
7175
+
7176
+    movu        m0, [r0 + r1 * 2 + 16]
7177
+    punpcklbw   m1, m0, m2
7178
+    punpckhbw   m0, m2
7179
+    pmaddubsw   m0, m3
7180
+    pmaddubsw   m1, m3
7181
+
7182
+    movu        [r2 +  r3 * 2 + 32], m1
7183
+    movu        [r2 +  r3 * 2 + 48], m0
7184
+
7185
+    movu        m0, [r0 + r1 * 2 + 32]
7186
+    punpcklbw   m1, m0, m2
7187
+    punpckhbw   m0, m2
7188
+    pmaddubsw   m0, m3
7189
+    pmaddubsw   m1, m3
7190
+
7191
+    movu        [r2 +  r3 * 2 + 64], m1
7192
+    movu        [r2 +  r3 * 2 + 80], m0
7193
+
7194
+    movu        m0, [r0 + r4]
7195
+    punpcklbw   m1, m0, m2
7196
+    punpckhbw   m0, m2
7197
+    pmaddubsw   m0, m3
7198
+    pmaddubsw   m1, m3
7199
+
7200
+    movu        [r2 +  r5], m1
7201
+    movu        [r2 +  r5 + 16], m0
7202
+
7203
+    movu        m0, [r0 + r4 + 16]
7204
+    punpcklbw   m1, m0, m2
7205
+    punpckhbw   m0, m2
7206
+    pmaddubsw   m0, m3
7207
+    pmaddubsw   m1, m3
7208
+
7209
+    movu        [r2 +  r5 + 32], m1
7210
+    movu        [r2 +  r5 + 48], m0
7211
+
7212
+    movu        m0, [r0 + r4 + 32]
7213
+    punpcklbw   m1, m0, m2
7214
+    punpckhbw   m0, m2
7215
+    pmaddubsw   m0, m3
7216
+    pmaddubsw   m1, m3
7217
+
7218
+    movu        [r2 +  r5 + 64], m1
7219
+    movu        [r2 +  r5 + 80], m0
7220
+
7221
+    lea         r0, [r0 + r1 * 4]
7222
+    lea         r2, [r2 + r3 * 4]
7223
+
7224
+    dec         r6d
7225
+    jnz         .loop
7226
+    RET
7227
+
7228
+;-----------------------------------------------------------------------------
7229
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
7230
+;-----------------------------------------------------------------------------
7231
+INIT_YMM avx2
7232
+cglobal filterPixelToShort_48x64, 3,7,4
7233
+    mov         r3d, r3m
7234
+    add         r3d, r3d
7235
+    lea         r5, [r1 * 3]
7236
+    lea         r6, [r3 * 3]
7237
+
7238
+    ; load height
7239
+    mov         r4d, 64/4
7240
+
7241
+    ; load constant
7242
+    vpbroadcastd m3, [pw_2000]
7243
+
7244
+    ; just unroll(1) because it is best choice for 48x64
7245
+.loop:
7246
+    pmovzxbw    m0, [r0 + 0 * mmsize/2]
7247
+    pmovzxbw    m1, [r0 + 1 * mmsize/2]
7248
+    pmovzxbw    m2, [r0 + 2 * mmsize/2]
7249
+    psllw       m0, 6
7250
+    psllw       m1, 6
7251
+    psllw       m2, 6
7252
+    psubw       m0, m3
7253
+    psubw       m1, m3
7254
+    psubw       m2, m3
7255
+    movu        [r2 + 0 * mmsize], m0
7256
+    movu        [r2 + 1 * mmsize], m1
7257
+    movu        [r2 + 2 * mmsize], m2
7258
+
7259
+    pmovzxbw    m0, [r0 + r1 + 0 * mmsize/2]
7260
+    pmovzxbw    m1, [r0 + r1 + 1 * mmsize/2]
7261
+    pmovzxbw    m2, [r0 + r1 + 2 * mmsize/2]
7262
+    psllw       m0, 6
7263
+    psllw       m1, 6
7264
+    psllw       m2, 6
7265
+    psubw       m0, m3
7266
+    psubw       m1, m3
7267
+    psubw       m2, m3
7268
+    movu        [r2 + r3 + 0 * mmsize], m0
7269
+    movu        [r2 + r3 + 1 * mmsize], m1
7270
+    movu        [r2 + r3 + 2 * mmsize], m2
7271
+
7272
+    pmovzxbw    m0, [r0 + r1 * 2 + 0 * mmsize/2]
7273
+    pmovzxbw    m1, [r0 + r1 * 2 + 1 * mmsize/2]
7274
+    pmovzxbw    m2, [r0 + r1 * 2 + 2 * mmsize/2]
7275
+    psllw       m0, 6
7276
+    psllw       m1, 6
7277
+    psllw       m2, 6
7278
+    psubw       m0, m3
7279
+    psubw       m1, m3
7280
+    psubw       m2, m3
7281
+    movu        [r2 + r3 * 2 + 0 * mmsize], m0
7282
+    movu        [r2 + r3 * 2 + 1 * mmsize], m1
7283
+    movu        [r2 + r3 * 2 + 2 * mmsize], m2
7284
+
7285
+    pmovzxbw    m0, [r0 + r5 + 0 * mmsize/2]
7286
+    pmovzxbw    m1, [r0 + r5 + 1 * mmsize/2]
7287
+    pmovzxbw    m2, [r0 + r5 + 2 * mmsize/2]
7288
+    psllw       m0, 6
7289
+    psllw       m1, 6
7290
+    psllw       m2, 6
7291
+    psubw       m0, m3
7292
+    psubw       m1, m3
7293
+    psubw       m2, m3
7294
+    movu        [r2 + r6 + 0 * mmsize], m0
7295
+    movu        [r2 + r6 + 1 * mmsize], m1
7296
+    movu        [r2 + r6 + 2 * mmsize], m2
7297
+
7298
+    lea         r0, [r0 + r1 * 4]
7299
+    lea         r2, [r2 + r3 * 4]
7300
+
7301
+    dec         r4d
7302
+    jnz        .loop
7303
+    RET
7304
+
7305
 
7306
 %macro PROCESS_LUMA_W4_4R 0
7307
     movd        m0, [r0]
7308
@@ -8495,36 +11678,36 @@
7309
 ;-------------------------------------------------------------------------------------------------------------
7310
 ; void interp_8tap_vert_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7311
 ;-------------------------------------------------------------------------------------------------------------
7312
-FILTER_VER_LUMA_4xN 4, 4, pp
7313
+    FILTER_VER_LUMA_4xN 4, 4, pp
7314
 
7315
 ;-------------------------------------------------------------------------------------------------------------
7316
 ; void interp_8tap_vert_pp_4x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7317
 ;-------------------------------------------------------------------------------------------------------------
7318
-FILTER_VER_LUMA_4xN 4, 8, pp
7319
-FILTER_VER_LUMA_AVX2_4xN 4, 8, pp
7320
+    FILTER_VER_LUMA_4xN 4, 8, pp
7321
+    FILTER_VER_LUMA_AVX2_4xN 4, 8, pp
7322
 
7323
 ;-------------------------------------------------------------------------------------------------------------
7324
 ; void interp_8tap_vert_pp_4x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7325
 ;-------------------------------------------------------------------------------------------------------------
7326
-FILTER_VER_LUMA_4xN 4, 16, pp
7327
-FILTER_VER_LUMA_AVX2_4xN 4, 16, pp
7328
+    FILTER_VER_LUMA_4xN 4, 16, pp
7329
+    FILTER_VER_LUMA_AVX2_4xN 4, 16, pp
7330
 
7331
 ;-------------------------------------------------------------------------------------------------------------
7332
 ; void interp_8tap_vert_ps_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7333
 ;-------------------------------------------------------------------------------------------------------------
7334
-FILTER_VER_LUMA_4xN 4, 4, ps
7335
+    FILTER_VER_LUMA_4xN 4, 4, ps
7336
 
7337
 ;-------------------------------------------------------------------------------------------------------------
7338
 ; void interp_8tap_vert_ps_4x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7339
 ;-------------------------------------------------------------------------------------------------------------
7340
-FILTER_VER_LUMA_4xN 4, 8, ps
7341
-FILTER_VER_LUMA_AVX2_4xN 4, 8, ps
7342
+    FILTER_VER_LUMA_4xN 4, 8, ps
7343
+    FILTER_VER_LUMA_AVX2_4xN 4, 8, ps
7344
 
7345
 ;-------------------------------------------------------------------------------------------------------------
7346
 ; void interp_8tap_vert_ps_4x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7347
 ;-------------------------------------------------------------------------------------------------------------
7348
-FILTER_VER_LUMA_4xN 4, 16, ps
7349
-FILTER_VER_LUMA_AVX2_4xN 4, 16, ps
7350
+    FILTER_VER_LUMA_4xN 4, 16, ps
7351
+    FILTER_VER_LUMA_AVX2_4xN 4, 16, ps
7352
 
7353
 %macro PROCESS_LUMA_AVX2_W8_8R 0
7354
     movq            xm1, [r0]                       ; m1 = row 0
7355
@@ -8895,50 +12078,50 @@
7356
 ;-------------------------------------------------------------------------------------------------------------
7357
 ; void interp_8tap_vert_pp_8x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7358
 ;-------------------------------------------------------------------------------------------------------------
7359
-FILTER_VER_LUMA_8xN 8, 4, pp
7360
-FILTER_VER_LUMA_AVX2_8x4 pp
7361
+    FILTER_VER_LUMA_8xN 8, 4, pp
7362
+    FILTER_VER_LUMA_AVX2_8x4 pp
7363
 
7364
 ;-------------------------------------------------------------------------------------------------------------
7365
 ; void interp_8tap_vert_pp_8x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7366
 ;-------------------------------------------------------------------------------------------------------------
7367
-FILTER_VER_LUMA_8xN 8, 8, pp
7368
-FILTER_VER_LUMA_AVX2_8x8 pp
7369
+    FILTER_VER_LUMA_8xN 8, 8, pp
7370
+    FILTER_VER_LUMA_AVX2_8x8 pp
7371
 
7372
 ;-------------------------------------------------------------------------------------------------------------
7373
 ; void interp_8tap_vert_pp_8x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7374
 ;-------------------------------------------------------------------------------------------------------------
7375
-FILTER_VER_LUMA_8xN 8, 16, pp
7376
-FILTER_VER_LUMA_AVX2_8xN 8, 16, pp
7377
+    FILTER_VER_LUMA_8xN 8, 16, pp
7378
+    FILTER_VER_LUMA_AVX2_8xN 8, 16, pp
7379
 
7380
 ;-------------------------------------------------------------------------------------------------------------
7381
 ; void interp_8tap_vert_pp_8x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7382
 ;-------------------------------------------------------------------------------------------------------------
7383
-FILTER_VER_LUMA_8xN 8, 32, pp
7384
-FILTER_VER_LUMA_AVX2_8xN 8, 32, pp
7385
+    FILTER_VER_LUMA_8xN 8, 32, pp
7386
+    FILTER_VER_LUMA_AVX2_8xN 8, 32, pp
7387
 
7388
 ;-------------------------------------------------------------------------------------------------------------
7389
 ; void interp_8tap_vert_ps_8x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7390
 ;-------------------------------------------------------------------------------------------------------------
7391
-FILTER_VER_LUMA_8xN 8, 4, ps
7392
-FILTER_VER_LUMA_AVX2_8x4 ps
7393
+    FILTER_VER_LUMA_8xN 8, 4, ps
7394
+    FILTER_VER_LUMA_AVX2_8x4 ps
7395
 
7396
 ;-------------------------------------------------------------------------------------------------------------
7397
 ; void interp_8tap_vert_ps_8x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7398
 ;-------------------------------------------------------------------------------------------------------------
7399
-FILTER_VER_LUMA_8xN 8, 8, ps
7400
-FILTER_VER_LUMA_AVX2_8x8 ps
7401
+    FILTER_VER_LUMA_8xN 8, 8, ps
7402
+    FILTER_VER_LUMA_AVX2_8x8 ps
7403
 
7404
 ;-------------------------------------------------------------------------------------------------------------
7405
 ; void interp_8tap_vert_ps_8x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7406
 ;-------------------------------------------------------------------------------------------------------------
7407
-FILTER_VER_LUMA_8xN 8, 16, ps
7408
-FILTER_VER_LUMA_AVX2_8xN 8, 16, ps
7409
+    FILTER_VER_LUMA_8xN 8, 16, ps
7410
+    FILTER_VER_LUMA_AVX2_8xN 8, 16, ps
7411
 
7412
 ;-------------------------------------------------------------------------------------------------------------
7413
 ; void interp_8tap_vert_ps_8x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7414
 ;-------------------------------------------------------------------------------------------------------------
7415
-FILTER_VER_LUMA_8xN 8, 32, ps
7416
-FILTER_VER_LUMA_AVX2_8xN 8, 32, ps
7417
+    FILTER_VER_LUMA_8xN 8, 32, ps
7418
+    FILTER_VER_LUMA_AVX2_8xN 8, 32, ps
7419
 
7420
 ;-------------------------------------------------------------------------------------------------------------
7421
 ; void interp_8tap_vert_%3_12x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7422
@@ -9000,7 +12183,7 @@
7423
 
7424
     lea       r5, [8 * r1 - 8]
7425
     sub       r0, r5
7426
-%ifidn %3,pp 
7427
+%ifidn %3,pp
7428
     add       r2, 8
7429
 %else
7430
     add       r2, 16
7431
@@ -9047,12 +12230,12 @@
7432
 ;-------------------------------------------------------------------------------------------------------------
7433
 ; void interp_8tap_vert_pp_12x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7434
 ;-------------------------------------------------------------------------------------------------------------
7435
-FILTER_VER_LUMA_12xN 12, 16, pp
7436
+    FILTER_VER_LUMA_12xN 12, 16, pp
7437
 
7438
 ;-------------------------------------------------------------------------------------------------------------
7439
 ; void interp_8tap_vert_ps_12x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7440
 ;-------------------------------------------------------------------------------------------------------------
7441
-FILTER_VER_LUMA_12xN 12, 16, ps
7442
+    FILTER_VER_LUMA_12xN 12, 16, ps
7443
 
7444
 %macro FILTER_VER_LUMA_AVX2_12x16 1
7445
 INIT_YMM avx2
7446
@@ -9443,8 +12626,8 @@
7447
 %endif
7448
 %endmacro
7449
 
7450
-FILTER_VER_LUMA_AVX2_12x16 pp
7451
-FILTER_VER_LUMA_AVX2_12x16 ps
7452
+    FILTER_VER_LUMA_AVX2_12x16 pp
7453
+    FILTER_VER_LUMA_AVX2_12x16 ps
7454
 
7455
 %macro FILTER_VER_LUMA_AVX2_16x16 1
7456
 INIT_YMM avx2
7457
@@ -9787,8 +12970,8 @@
7458
 %endif
7459
 %endmacro
7460
 
7461
-FILTER_VER_LUMA_AVX2_16x16 pp
7462
-FILTER_VER_LUMA_AVX2_16x16 ps
7463
+    FILTER_VER_LUMA_AVX2_16x16 pp
7464
+    FILTER_VER_LUMA_AVX2_16x16 ps
7465
 
7466
 %macro FILTER_VER_LUMA_AVX2_16x12 1
7467
 INIT_YMM avx2
7468
@@ -10062,8 +13245,8 @@
7469
 %endif
7470
 %endmacro
7471
 
7472
-FILTER_VER_LUMA_AVX2_16x12 pp
7473
-FILTER_VER_LUMA_AVX2_16x12 ps
7474
+    FILTER_VER_LUMA_AVX2_16x12 pp
7475
+    FILTER_VER_LUMA_AVX2_16x12 ps
7476
 
7477
 %macro FILTER_VER_LUMA_AVX2_16x8 1
7478
 INIT_YMM avx2
7479
@@ -10258,8 +13441,8 @@
7480
 %endif
7481
 %endmacro
7482
 
7483
-FILTER_VER_LUMA_AVX2_16x8 pp
7484
-FILTER_VER_LUMA_AVX2_16x8 ps
7485
+    FILTER_VER_LUMA_AVX2_16x8 pp
7486
+    FILTER_VER_LUMA_AVX2_16x8 ps
7487
 
7488
 %macro FILTER_VER_LUMA_AVX2_16x4 1
7489
 INIT_YMM avx2
7490
@@ -10383,8 +13566,8 @@
7491
 %endif
7492
 %endmacro
7493
 
7494
-FILTER_VER_LUMA_AVX2_16x4 pp
7495
-FILTER_VER_LUMA_AVX2_16x4 ps
7496
+    FILTER_VER_LUMA_AVX2_16x4 pp
7497
+    FILTER_VER_LUMA_AVX2_16x4 ps
7498
 %macro FILTER_VER_LUMA_AVX2_16xN 3
7499
 INIT_YMM avx2
7500
 %if ARCH_X86_64 == 1
7501
@@ -10735,10 +13918,10 @@
7502
 %endif
7503
 %endmacro
7504
 
7505
-FILTER_VER_LUMA_AVX2_16xN 16, 32, pp
7506
-FILTER_VER_LUMA_AVX2_16xN 16, 64, pp
7507
-FILTER_VER_LUMA_AVX2_16xN 16, 32, ps
7508
-FILTER_VER_LUMA_AVX2_16xN 16, 64, ps
7509
+    FILTER_VER_LUMA_AVX2_16xN 16, 32, pp
7510
+    FILTER_VER_LUMA_AVX2_16xN 16, 64, pp
7511
+    FILTER_VER_LUMA_AVX2_16xN 16, 32, ps
7512
+    FILTER_VER_LUMA_AVX2_16xN 16, 64, ps
7513
 
7514
 %macro PROCESS_LUMA_AVX2_W16_16R 1
7515
     movu            xm0, [r0]                       ; m0 = row 0
7516
@@ -11466,8 +14649,8 @@
7517
 %endif
7518
 %endmacro
7519
 
7520
-FILTER_VER_LUMA_AVX2_24x32 pp
7521
-FILTER_VER_LUMA_AVX2_24x32 ps
7522
+    FILTER_VER_LUMA_AVX2_24x32 pp
7523
+    FILTER_VER_LUMA_AVX2_24x32 ps
7524
 
7525
 %macro FILTER_VER_LUMA_AVX2_32xN 3
7526
 INIT_YMM avx2
7527
@@ -11517,10 +14700,10 @@
7528
 %endif
7529
 %endmacro
7530
 
7531
-FILTER_VER_LUMA_AVX2_32xN 32, 32, pp
7532
-FILTER_VER_LUMA_AVX2_32xN 32, 64, pp
7533
-FILTER_VER_LUMA_AVX2_32xN 32, 32, ps
7534
-FILTER_VER_LUMA_AVX2_32xN 32, 64, ps
7535
+    FILTER_VER_LUMA_AVX2_32xN 32, 32, pp
7536
+    FILTER_VER_LUMA_AVX2_32xN 32, 64, pp
7537
+    FILTER_VER_LUMA_AVX2_32xN 32, 32, ps
7538
+    FILTER_VER_LUMA_AVX2_32xN 32, 64, ps
7539
 
7540
 %macro FILTER_VER_LUMA_AVX2_32x16 1
7541
 INIT_YMM avx2
7542
@@ -11560,9 +14743,9 @@
7543
 %endif
7544
 %endmacro
7545
 
7546
-FILTER_VER_LUMA_AVX2_32x16 pp
7547
-FILTER_VER_LUMA_AVX2_32x16 ps
7548
- 
7549
+    FILTER_VER_LUMA_AVX2_32x16 pp
7550
+    FILTER_VER_LUMA_AVX2_32x16 ps
7551
+
7552
 %macro FILTER_VER_LUMA_AVX2_32x24 1
7553
 INIT_YMM avx2
7554
 %if ARCH_X86_64 == 1
7555
@@ -11620,8 +14803,8 @@
7556
 %endif
7557
 %endmacro
7558
 
7559
-FILTER_VER_LUMA_AVX2_32x24 pp
7560
-FILTER_VER_LUMA_AVX2_32x24 ps
7561
+    FILTER_VER_LUMA_AVX2_32x24 pp
7562
+    FILTER_VER_LUMA_AVX2_32x24 ps
7563
 
7564
 %macro FILTER_VER_LUMA_AVX2_32x8 1
7565
 INIT_YMM avx2
7566
@@ -11663,8 +14846,8 @@
7567
 %endif
7568
 %endmacro
7569
 
7570
-FILTER_VER_LUMA_AVX2_32x8 pp
7571
-FILTER_VER_LUMA_AVX2_32x8 ps
7572
+    FILTER_VER_LUMA_AVX2_32x8 pp
7573
+    FILTER_VER_LUMA_AVX2_32x8 ps
7574
 
7575
 %macro FILTER_VER_LUMA_AVX2_48x64 1
7576
 INIT_YMM avx2
7577
@@ -11722,8 +14905,8 @@
7578
 %endif
7579
 %endmacro
7580
 
7581
-FILTER_VER_LUMA_AVX2_48x64 pp
7582
-FILTER_VER_LUMA_AVX2_48x64 ps
7583
+    FILTER_VER_LUMA_AVX2_48x64 pp
7584
+    FILTER_VER_LUMA_AVX2_48x64 ps
7585
 
7586
 %macro FILTER_VER_LUMA_AVX2_64xN 3
7587
 INIT_YMM avx2
7588
@@ -11781,12 +14964,12 @@
7589
 %endif
7590
 %endmacro
7591
 
7592
-FILTER_VER_LUMA_AVX2_64xN 64, 32, pp
7593
-FILTER_VER_LUMA_AVX2_64xN 64, 48, pp
7594
-FILTER_VER_LUMA_AVX2_64xN 64, 64, pp
7595
-FILTER_VER_LUMA_AVX2_64xN 64, 32, ps
7596
-FILTER_VER_LUMA_AVX2_64xN 64, 48, ps
7597
-FILTER_VER_LUMA_AVX2_64xN 64, 64, ps
7598
+    FILTER_VER_LUMA_AVX2_64xN 64, 32, pp
7599
+    FILTER_VER_LUMA_AVX2_64xN 64, 48, pp
7600
+    FILTER_VER_LUMA_AVX2_64xN 64, 64, pp
7601
+    FILTER_VER_LUMA_AVX2_64xN 64, 32, ps
7602
+    FILTER_VER_LUMA_AVX2_64xN 64, 48, ps
7603
+    FILTER_VER_LUMA_AVX2_64xN 64, 64, ps
7604
 
7605
 %macro FILTER_VER_LUMA_AVX2_64x16 1
7606
 INIT_YMM avx2
7607
@@ -11832,8 +15015,8 @@
7608
 %endif
7609
 %endmacro
7610
 
7611
-FILTER_VER_LUMA_AVX2_64x16 pp
7612
-FILTER_VER_LUMA_AVX2_64x16 ps
7613
+    FILTER_VER_LUMA_AVX2_64x16 pp
7614
+    FILTER_VER_LUMA_AVX2_64x16 ps
7615
 
7616
 ;-------------------------------------------------------------------------------------------------------------
7617
 ; void interp_8tap_vert_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7618
@@ -11916,41 +15099,41 @@
7619
     RET
7620
 %endmacro
7621
 
7622
-FILTER_VER_LUMA 16, 4, pp
7623
-FILTER_VER_LUMA 16, 8, pp
7624
-FILTER_VER_LUMA 16, 12, pp
7625
-FILTER_VER_LUMA 16, 16, pp
7626
-FILTER_VER_LUMA 16, 32, pp
7627
-FILTER_VER_LUMA 16, 64, pp
7628
-FILTER_VER_LUMA 24, 32, pp
7629
-FILTER_VER_LUMA 32, 8, pp
7630
-FILTER_VER_LUMA 32, 16, pp
7631
-FILTER_VER_LUMA 32, 24, pp
7632
-FILTER_VER_LUMA 32, 32, pp
7633
-FILTER_VER_LUMA 32, 64, pp
7634
-FILTER_VER_LUMA 48, 64, pp
7635
-FILTER_VER_LUMA 64, 16, pp
7636
-FILTER_VER_LUMA 64, 32, pp
7637
-FILTER_VER_LUMA 64, 48, pp
7638
-FILTER_VER_LUMA 64, 64, pp
7639
-
7640
-FILTER_VER_LUMA 16, 4, ps
7641
-FILTER_VER_LUMA 16, 8, ps
7642
-FILTER_VER_LUMA 16, 12, ps
7643
-FILTER_VER_LUMA 16, 16, ps
7644
-FILTER_VER_LUMA 16, 32, ps
7645
-FILTER_VER_LUMA 16, 64, ps
7646
-FILTER_VER_LUMA 24, 32, ps
7647
-FILTER_VER_LUMA 32, 8, ps
7648
-FILTER_VER_LUMA 32, 16, ps
7649
-FILTER_VER_LUMA 32, 24, ps
7650
-FILTER_VER_LUMA 32, 32, ps
7651
-FILTER_VER_LUMA 32, 64, ps
7652
-FILTER_VER_LUMA 48, 64, ps
7653
-FILTER_VER_LUMA 64, 16, ps
7654
-FILTER_VER_LUMA 64, 32, ps
7655
-FILTER_VER_LUMA 64, 48, ps
7656
-FILTER_VER_LUMA 64, 64, ps
7657
+    FILTER_VER_LUMA 16, 4, pp
7658
+    FILTER_VER_LUMA 16, 8, pp
7659
+    FILTER_VER_LUMA 16, 12, pp
7660
+    FILTER_VER_LUMA 16, 16, pp
7661
+    FILTER_VER_LUMA 16, 32, pp
7662
+    FILTER_VER_LUMA 16, 64, pp
7663
+    FILTER_VER_LUMA 24, 32, pp
7664
+    FILTER_VER_LUMA 32, 8, pp
7665
+    FILTER_VER_LUMA 32, 16, pp
7666
+    FILTER_VER_LUMA 32, 24, pp
7667
+    FILTER_VER_LUMA 32, 32, pp
7668
+    FILTER_VER_LUMA 32, 64, pp
7669
+    FILTER_VER_LUMA 48, 64, pp
7670
+    FILTER_VER_LUMA 64, 16, pp
7671
+    FILTER_VER_LUMA 64, 32, pp
7672
+    FILTER_VER_LUMA 64, 48, pp
7673
+    FILTER_VER_LUMA 64, 64, pp
7674
+
7675
+    FILTER_VER_LUMA 16, 4, ps
7676
+    FILTER_VER_LUMA 16, 8, ps
7677
+    FILTER_VER_LUMA 16, 12, ps
7678
+    FILTER_VER_LUMA 16, 16, ps
7679
+    FILTER_VER_LUMA 16, 32, ps
7680
+    FILTER_VER_LUMA 16, 64, ps
7681
+    FILTER_VER_LUMA 24, 32, ps
7682
+    FILTER_VER_LUMA 32, 8, ps
7683
+    FILTER_VER_LUMA 32, 16, ps
7684
+    FILTER_VER_LUMA 32, 24, ps
7685
+    FILTER_VER_LUMA 32, 32, ps
7686
+    FILTER_VER_LUMA 32, 64, ps
7687
+    FILTER_VER_LUMA 48, 64, ps
7688
+    FILTER_VER_LUMA 64, 16, ps
7689
+    FILTER_VER_LUMA 64, 32, ps
7690
+    FILTER_VER_LUMA 64, 48, ps
7691
+    FILTER_VER_LUMA 64, 64, ps
7692
 
7693
 %macro PROCESS_LUMA_SP_W4_4R 0
7694
     movq       m0, [r0]
7695
@@ -12036,7 +15219,7 @@
7696
     lea       r6, [tab_LumaCoeffV + r4]
7697
 %endif
7698
 
7699
-    mova      m7, [tab_c_526336]
7700
+    mova      m7, [pd_526336]
7701
 
7702
     mov       dword [rsp], %2/4
7703
 .loopH:
7704
@@ -12110,63 +15293,49 @@
7705
     FILTER_VER_LUMA_SP 64, 16
7706
     FILTER_VER_LUMA_SP 16, 64
7707
 
7708
-; TODO: combin of U and V is more performance, but need more register
7709
-; TODO: use two path for height alignment to 4 and otherwise may improvement 10% performance, but code is more complex, so I disable it
7710
-INIT_XMM ssse3
7711
-cglobal chroma_p2s, 3, 7, 4
7712
-
7713
-    ; load width and height
7714
+;-----------------------------------------------------------------------------
7715
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
7716
+;-----------------------------------------------------------------------------
7717
+INIT_XMM sse4
7718
+cglobal filterPixelToShort_4x2, 3, 4, 3
7719
     mov         r3d, r3m
7720
-    mov         r4d, r4m
7721
+    add         r3d, r3d
7722
 
7723
     ; load constant
7724
-    mova        m2, [pb_128]
7725
-    mova        m3, [tab_c_64_n64]
7726
+    mova        m1, [pb_128]
7727
+    mova        m2, [tab_c_64_n64]
7728
 
7729
-.loopH:
7730
-
7731
-    xor         r5d, r5d
7732
-.loopW:
7733
-    lea         r6, [r0 + r5]
7734
+    movd        m0, [r0]
7735
+    pinsrd      m0, [r0 + r1], 1
7736
+    punpcklbw   m0, m1
7737
+    pmaddubsw   m0, m2
7738
 
7739
-    movh        m0, [r6]
7740
-    punpcklbw   m0, m2
7741
-    pmaddubsw   m0, m3
7742
+    movq        [r2 + r3 * 0], m0
7743
+    movhps      [r2 + r3 * 1], m0
7744
 
7745
-    movh        m1, [r6 + r1]
7746
-    punpcklbw   m1, m2
7747
-    pmaddubsw   m1, m3
7748
+    RET
7749
 
7750
-    add         r5d, 8
7751
-    cmp         r5d, r3d
7752
-    lea         r6, [r2 + r5 * 2]
7753
-    jg          .width4
7754
-    movu        [r6 + FENC_STRIDE / 2 * 0 - 16], m0
7755
-    movu        [r6 + FENC_STRIDE / 2 * 2 - 16], m1
7756
-    je          .nextH
7757
-    jmp         .loopW
7758
-
7759
-.width4:
7760
-    test        r3d, 4
7761
-    jz          .width2
7762
-    test        r3d, 2
7763
-    movh        [r6 + FENC_STRIDE / 2 * 0 - 16], m0
7764
-    movh        [r6 + FENC_STRIDE / 2 * 2 - 16], m1
7765
-    lea         r6, [r6 + 8]
7766
-    pshufd      m0, m0, 2
7767
-    pshufd      m1, m1, 2
7768
-    jz          .nextH
7769
-
7770
-.width2:
7771
-    movd        [r6 + FENC_STRIDE / 2 * 0 - 16], m0
7772
-    movd        [r6 + FENC_STRIDE / 2 * 2 - 16], m1
7773
+;-----------------------------------------------------------------------------
7774
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
7775
+;-----------------------------------------------------------------------------
7776
+INIT_XMM ssse3
7777
+cglobal filterPixelToShort_8x2, 3, 4, 3
7778
+    mov         r3d, r3m
7779
+    add         r3d, r3d
7780
 
7781
-.nextH:
7782
-    lea         r0, [r0 + r1 * 2]
7783
-    add         r2, FENC_STRIDE / 2 * 4
7784
+    ; load constant
7785
+    mova        m1, [pb_128]
7786
+    mova        m2, [tab_c_64_n64]
7787
 
7788
-    sub         r4d, 2
7789
-    jnz         .loopH
7790
+    movh        m0, [r0]
7791
+    punpcklbw   m0, m1
7792
+    pmaddubsw   m0, m2
7793
+    movu        [r2 + r3 * 0], m0
7794
+
7795
+    movh        m0, [r0 + r1]
7796
+    punpcklbw   m0, m1
7797
+    pmaddubsw   m0, m2
7798
+    movu        [r2 + r3 * 1], m0
7799
 
7800
     RET
7801
 
7802
@@ -12223,7 +15392,7 @@
7803
     lea       r6, [tab_ChromaCoeffV + r4]
7804
 %endif
7805
 
7806
-    mova      m6, [tab_c_526336]
7807
+    mova      m6, [pd_526336]
7808
 
7809
     mov       dword [rsp], %2/4
7810
 
7811
@@ -12350,7 +15519,7 @@
7812
     lea       r5, [tab_ChromaCoeffV + r4]
7813
 %endif
7814
 
7815
-    mova      m5, [tab_c_526336]
7816
+    mova      m5, [pd_526336]
7817
 
7818
     mov       r4d, (%2/4)
7819
 
7820
@@ -12380,10 +15549,10 @@
7821
     RET
7822
 %endmacro
7823
 
7824
-FILTER_VER_CHROMA_SP_W2_4R 2, 4
7825
-FILTER_VER_CHROMA_SP_W2_4R 2, 8
7826
+    FILTER_VER_CHROMA_SP_W2_4R 2, 4
7827
+    FILTER_VER_CHROMA_SP_W2_4R 2, 8
7828
 
7829
-FILTER_VER_CHROMA_SP_W2_4R 2, 16
7830
+    FILTER_VER_CHROMA_SP_W2_4R 2, 16
7831
 
7832
 ;--------------------------------------------------------------------------------------------------------------
7833
 ; void interp_4tap_vert_sp_4x2(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7834
@@ -12402,7 +15571,7 @@
7835
     lea        r5, [tab_ChromaCoeffV + r4]
7836
 %endif
7837
 
7838
-    mova       m4, [tab_c_526336]
7839
+    mova       m4, [pd_526336]
7840
 
7841
     movq       m0, [r0]
7842
     movq       m1, [r0 + r1]
7843
@@ -12454,7 +15623,7 @@
7844
     lea       r6, [tab_ChromaCoeffV + r4]
7845
 %endif
7846
 
7847
-    mova      m6, [tab_c_526336]
7848
+    mova      m6, [pd_526336]
7849
 
7850
     mov       r4d, %2/4
7851
 
7852
@@ -12512,9 +15681,9 @@
7853
     RET
7854
 %endmacro
7855
 
7856
-FILTER_VER_CHROMA_SP_W6_H4 6, 8
7857
+    FILTER_VER_CHROMA_SP_W6_H4 6, 8
7858
 
7859
-FILTER_VER_CHROMA_SP_W6_H4 6, 16
7860
+    FILTER_VER_CHROMA_SP_W6_H4 6, 16
7861
 
7862
 %macro PROCESS_CHROMA_SP_W8_2R 0
7863
     movu       m1, [r0]
7864
@@ -12566,7 +15735,7 @@
7865
     lea       r5, [tab_ChromaCoeffV + r4]
7866
 %endif
7867
 
7868
-    mova      m7, [tab_c_526336]
7869
+    mova      m7, [pd_526336]
7870
 
7871
     mov       r4d, %2/2
7872
 .loopH:
7873
@@ -12598,15 +15767,15 @@
7874
     RET
7875
 %endmacro
7876
 
7877
-FILTER_VER_CHROMA_SP_W8_H2 8, 2
7878
-FILTER_VER_CHROMA_SP_W8_H2 8, 4
7879
-FILTER_VER_CHROMA_SP_W8_H2 8, 6
7880
-FILTER_VER_CHROMA_SP_W8_H2 8, 8
7881
-FILTER_VER_CHROMA_SP_W8_H2 8, 16
7882
-FILTER_VER_CHROMA_SP_W8_H2 8, 32
7883
+    FILTER_VER_CHROMA_SP_W8_H2 8, 2
7884
+    FILTER_VER_CHROMA_SP_W8_H2 8, 4
7885
+    FILTER_VER_CHROMA_SP_W8_H2 8, 6
7886
+    FILTER_VER_CHROMA_SP_W8_H2 8, 8
7887
+    FILTER_VER_CHROMA_SP_W8_H2 8, 16
7888
+    FILTER_VER_CHROMA_SP_W8_H2 8, 32
7889
 
7890
-FILTER_VER_CHROMA_SP_W8_H2 8, 12
7891
-FILTER_VER_CHROMA_SP_W8_H2 8, 64
7892
+    FILTER_VER_CHROMA_SP_W8_H2 8, 12
7893
+    FILTER_VER_CHROMA_SP_W8_H2 8, 64
7894
 
7895
 
7896
 ;-----------------------------------------------------------------------------------------------------------------------------
7897
@@ -12658,10 +15827,10 @@
7898
     RET
7899
 %endmacro
7900
 
7901
-FILTER_HORIZ_CHROMA_2xN 2, 4
7902
-FILTER_HORIZ_CHROMA_2xN 2, 8
7903
+    FILTER_HORIZ_CHROMA_2xN 2, 4
7904
+    FILTER_HORIZ_CHROMA_2xN 2, 8
7905
 
7906
-FILTER_HORIZ_CHROMA_2xN 2, 16
7907
+    FILTER_HORIZ_CHROMA_2xN 2, 16
7908
 
7909
 ;-----------------------------------------------------------------------------------------------------------------------------
7910
 ; void interp_4tap_horiz_ps_4x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
7911
@@ -12711,12 +15880,12 @@
7912
     RET
7913
 %endmacro
7914
 
7915
-FILTER_HORIZ_CHROMA_4xN 4, 2
7916
-FILTER_HORIZ_CHROMA_4xN 4, 4
7917
-FILTER_HORIZ_CHROMA_4xN 4, 8
7918
-FILTER_HORIZ_CHROMA_4xN 4, 16
7919
+    FILTER_HORIZ_CHROMA_4xN 4, 2
7920
+    FILTER_HORIZ_CHROMA_4xN 4, 4
7921
+    FILTER_HORIZ_CHROMA_4xN 4, 8
7922
+    FILTER_HORIZ_CHROMA_4xN 4, 16
7923
 
7924
-FILTER_HORIZ_CHROMA_4xN 4, 32
7925
+    FILTER_HORIZ_CHROMA_4xN 4, 32
7926
 
7927
 %macro PROCESS_CHROMA_W6 3
7928
     movu       %1, [srcq]
7929
@@ -12794,11 +15963,11 @@
7930
     RET
7931
 %endmacro
7932
 
7933
-FILTER_HORIZ_CHROMA 6, 8
7934
-FILTER_HORIZ_CHROMA 12, 16
7935
+    FILTER_HORIZ_CHROMA 6, 8
7936
+    FILTER_HORIZ_CHROMA 12, 16
7937
 
7938
-FILTER_HORIZ_CHROMA 6, 16
7939
-FILTER_HORIZ_CHROMA 12, 32
7940
+    FILTER_HORIZ_CHROMA 6, 16
7941
+    FILTER_HORIZ_CHROMA 12, 32
7942
 
7943
 %macro PROCESS_CHROMA_W8 3
7944
     movu        %1, [srcq]
7945
@@ -12857,15 +16026,15 @@
7946
     RET
7947
 %endmacro
7948
 
7949
-FILTER_HORIZ_CHROMA_8xN 8, 2
7950
-FILTER_HORIZ_CHROMA_8xN 8, 4
7951
-FILTER_HORIZ_CHROMA_8xN 8, 6
7952
-FILTER_HORIZ_CHROMA_8xN 8, 8
7953
-FILTER_HORIZ_CHROMA_8xN 8, 16
7954
-FILTER_HORIZ_CHROMA_8xN 8, 32
7955
+    FILTER_HORIZ_CHROMA_8xN 8, 2
7956
+    FILTER_HORIZ_CHROMA_8xN 8, 4
7957
+    FILTER_HORIZ_CHROMA_8xN 8, 6
7958
+    FILTER_HORIZ_CHROMA_8xN 8, 8
7959
+    FILTER_HORIZ_CHROMA_8xN 8, 16
7960
+    FILTER_HORIZ_CHROMA_8xN 8, 32
7961
 
7962
-FILTER_HORIZ_CHROMA_8xN 8, 12
7963
-FILTER_HORIZ_CHROMA_8xN 8, 64
7964
+    FILTER_HORIZ_CHROMA_8xN 8, 12
7965
+    FILTER_HORIZ_CHROMA_8xN 8, 64
7966
 
7967
 %macro PROCESS_CHROMA_W16 4
7968
     movu        %1, [srcq]
7969
@@ -13027,28 +16196,28 @@
7970
     RET
7971
 %endmacro
7972
 
7973
-FILTER_HORIZ_CHROMA_WxN 16, 4
7974
-FILTER_HORIZ_CHROMA_WxN 16, 8
7975
-FILTER_HORIZ_CHROMA_WxN 16, 12
7976
-FILTER_HORIZ_CHROMA_WxN 16, 16
7977
-FILTER_HORIZ_CHROMA_WxN 16, 32
7978
-FILTER_HORIZ_CHROMA_WxN 24, 32
7979
-FILTER_HORIZ_CHROMA_WxN 32,  8
7980
-FILTER_HORIZ_CHROMA_WxN 32, 16
7981
-FILTER_HORIZ_CHROMA_WxN 32, 24
7982
-FILTER_HORIZ_CHROMA_WxN 32, 32
7983
-
7984
-FILTER_HORIZ_CHROMA_WxN 16, 24
7985
-FILTER_HORIZ_CHROMA_WxN 16, 64
7986
-FILTER_HORIZ_CHROMA_WxN 24, 64
7987
-FILTER_HORIZ_CHROMA_WxN 32, 48
7988
-FILTER_HORIZ_CHROMA_WxN 32, 64
7989
-
7990
-FILTER_HORIZ_CHROMA_WxN 64, 64
7991
-FILTER_HORIZ_CHROMA_WxN 64, 32
7992
-FILTER_HORIZ_CHROMA_WxN 64, 48
7993
-FILTER_HORIZ_CHROMA_WxN 48, 64
7994
-FILTER_HORIZ_CHROMA_WxN 64, 16
7995
+    FILTER_HORIZ_CHROMA_WxN 16, 4
7996
+    FILTER_HORIZ_CHROMA_WxN 16, 8
7997
+    FILTER_HORIZ_CHROMA_WxN 16, 12
7998
+    FILTER_HORIZ_CHROMA_WxN 16, 16
7999
+    FILTER_HORIZ_CHROMA_WxN 16, 32
8000
+    FILTER_HORIZ_CHROMA_WxN 24, 32
8001
+    FILTER_HORIZ_CHROMA_WxN 32,  8
8002
+    FILTER_HORIZ_CHROMA_WxN 32, 16
8003
+    FILTER_HORIZ_CHROMA_WxN 32, 24
8004
+    FILTER_HORIZ_CHROMA_WxN 32, 32
8005
+
8006
+    FILTER_HORIZ_CHROMA_WxN 16, 24
8007
+    FILTER_HORIZ_CHROMA_WxN 16, 64
8008
+    FILTER_HORIZ_CHROMA_WxN 24, 64
8009
+    FILTER_HORIZ_CHROMA_WxN 32, 48
8010
+    FILTER_HORIZ_CHROMA_WxN 32, 64
8011
+
8012
+    FILTER_HORIZ_CHROMA_WxN 64, 64
8013
+    FILTER_HORIZ_CHROMA_WxN 64, 32
8014
+    FILTER_HORIZ_CHROMA_WxN 64, 48
8015
+    FILTER_HORIZ_CHROMA_WxN 48, 64
8016
+    FILTER_HORIZ_CHROMA_WxN 64, 16
8017
 
8018
 
8019
 ;---------------------------------------------------------------------------------------------------------------
8020
@@ -13144,11 +16313,11 @@
8021
     RET
8022
 %endmacro
8023
 
8024
-FILTER_V_PS_W16n 64, 64
8025
-FILTER_V_PS_W16n 64, 32
8026
-FILTER_V_PS_W16n 64, 48
8027
-FILTER_V_PS_W16n 48, 64
8028
-FILTER_V_PS_W16n 64, 16
8029
+    FILTER_V_PS_W16n 64, 64
8030
+    FILTER_V_PS_W16n 64, 32
8031
+    FILTER_V_PS_W16n 64, 48
8032
+    FILTER_V_PS_W16n 48, 64
8033
+    FILTER_V_PS_W16n 64, 16
8034
 
8035
 
8036
 ;------------------------------------------------------------------------------------------------------------
8037
@@ -13306,12 +16475,12 @@
8038
     dec        r4d
8039
     jnz        .loop
8040
 
8041
-RET
8042
+    RET
8043
 %endmacro
8044
 
8045
-FILTER_V_PS_W2 2, 8
8046
+    FILTER_V_PS_W2 2, 8
8047
 
8048
-FILTER_V_PS_W2 2, 16
8049
+    FILTER_V_PS_W2 2, 16
8050
 
8051
 ;-----------------------------------------------------------------------------------------------------------------
8052
 ; void interp_4tap_vert_ss_%1x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
8053
@@ -13472,8 +16641,8 @@
8054
     RET
8055
 %endmacro
8056
 
8057
-FILTER_VER_CHROMA_S_AVX2_4x4 sp
8058
-FILTER_VER_CHROMA_S_AVX2_4x4 ss
8059
+    FILTER_VER_CHROMA_S_AVX2_4x4 sp
8060
+    FILTER_VER_CHROMA_S_AVX2_4x4 ss
8061
 
8062
 %macro FILTER_VER_CHROMA_S_AVX2_4x8 1
8063
 INIT_YMM avx2
8064
@@ -13584,8 +16753,8 @@
8065
     RET
8066
 %endmacro
8067
 
8068
-FILTER_VER_CHROMA_S_AVX2_4x8 sp
8069
-FILTER_VER_CHROMA_S_AVX2_4x8 ss
8070
+    FILTER_VER_CHROMA_S_AVX2_4x8 sp
8071
+    FILTER_VER_CHROMA_S_AVX2_4x8 ss
8072
 
8073
 %macro PROCESS_CHROMA_AVX2_W4_16R 1
8074
     movq            xm0, [r0]
8075
@@ -13779,8 +16948,40 @@
8076
     RET
8077
 %endmacro
8078
 
8079
-FILTER_VER_CHROMA_S_AVX2_4x16 sp
8080
-FILTER_VER_CHROMA_S_AVX2_4x16 ss
8081
+    FILTER_VER_CHROMA_S_AVX2_4x16 sp
8082
+    FILTER_VER_CHROMA_S_AVX2_4x16 ss
8083
+
8084
+%macro FILTER_VER_CHROMA_S_AVX2_4x32 1
8085
+INIT_YMM avx2
8086
+cglobal interp_4tap_vert_%1_4x32, 4, 7, 8
8087
+    mov             r4d, r4m
8088
+    shl             r4d, 6
8089
+    add             r1d, r1d
8090
+    sub             r0, r1
8091
+
8092
+%ifdef PIC
8093
+    lea             r5, [pw_ChromaCoeffV]
8094
+    add             r5, r4
8095
+%else
8096
+    lea             r5, [pw_ChromaCoeffV + r4]
8097
+%endif
8098
+
8099
+    lea             r4, [r1 * 3]
8100
+%ifidn %1,sp
8101
+    mova            m7, [pd_526336]
8102
+%else
8103
+    add             r3d, r3d
8104
+%endif
8105
+    lea             r6, [r3 * 3]
8106
+%rep 2
8107
+    PROCESS_CHROMA_AVX2_W4_16R %1
8108
+    lea             r2, [r2 + r3 * 4]
8109
+%endrep
8110
+    RET
8111
+%endmacro
8112
+
8113
+    FILTER_VER_CHROMA_S_AVX2_4x32 sp
8114
+    FILTER_VER_CHROMA_S_AVX2_4x32 ss
8115
 
8116
 %macro FILTER_VER_CHROMA_S_AVX2_4x2 1
8117
 INIT_YMM avx2
8118
@@ -13836,8 +17037,8 @@
8119
     RET
8120
 %endmacro
8121
 
8122
-FILTER_VER_CHROMA_S_AVX2_4x2 sp
8123
-FILTER_VER_CHROMA_S_AVX2_4x2 ss
8124
+    FILTER_VER_CHROMA_S_AVX2_4x2 sp
8125
+    FILTER_VER_CHROMA_S_AVX2_4x2 ss
8126
 
8127
 %macro FILTER_VER_CHROMA_S_AVX2_2x4 1
8128
 INIT_YMM avx2
8129
@@ -13906,8 +17107,8 @@
8130
     RET
8131
 %endmacro
8132
 
8133
-FILTER_VER_CHROMA_S_AVX2_2x4 sp
8134
-FILTER_VER_CHROMA_S_AVX2_2x4 ss
8135
+    FILTER_VER_CHROMA_S_AVX2_2x4 sp
8136
+    FILTER_VER_CHROMA_S_AVX2_2x4 ss
8137
 
8138
 %macro FILTER_VER_CHROMA_S_AVX2_8x8 1
8139
 INIT_YMM avx2
8140
@@ -14085,8 +17286,8 @@
8141
     RET
8142
 %endmacro
8143
 
8144
-FILTER_VER_CHROMA_S_AVX2_8x8 sp
8145
-FILTER_VER_CHROMA_S_AVX2_8x8 ss
8146
+    FILTER_VER_CHROMA_S_AVX2_8x8 sp
8147
+    FILTER_VER_CHROMA_S_AVX2_8x8 ss
8148
 
8149
 %macro PROCESS_CHROMA_S_AVX2_W8_16R 1
8150
     movu            xm0, [r0]                       ; m0 = row 0
8151
@@ -14401,10 +17602,12 @@
8152
 %endif
8153
 %endmacro
8154
 
8155
-FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 16
8156
-FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 32
8157
-FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 16
8158
-FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 32
8159
+    FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 16
8160
+    FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 32
8161
+    FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 64
8162
+    FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 16
8163
+    FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 32
8164
+    FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 64
8165
 
8166
 %macro FILTER_VER_CHROMA_S_AVX2_NxN 3
8167
 INIT_YMM avx2
8168
@@ -14453,12 +17656,28 @@
8169
 %endif
8170
 %endmacro
8171
 
8172
-FILTER_VER_CHROMA_S_AVX2_NxN 16, 32, sp
8173
-FILTER_VER_CHROMA_S_AVX2_NxN 24, 32, sp
8174
-FILTER_VER_CHROMA_S_AVX2_NxN 32, 32, sp
8175
-FILTER_VER_CHROMA_S_AVX2_NxN 16, 32, ss
8176
-FILTER_VER_CHROMA_S_AVX2_NxN 24, 32, ss
8177
-FILTER_VER_CHROMA_S_AVX2_NxN 32, 32, ss
8178
+    FILTER_VER_CHROMA_S_AVX2_NxN 16, 32, sp
8179
+    FILTER_VER_CHROMA_S_AVX2_NxN 24, 32, sp
8180
+    FILTER_VER_CHROMA_S_AVX2_NxN 32, 32, sp
8181
+    FILTER_VER_CHROMA_S_AVX2_NxN 16, 32, ss
8182
+    FILTER_VER_CHROMA_S_AVX2_NxN 24, 32, ss
8183
+    FILTER_VER_CHROMA_S_AVX2_NxN 32, 32, ss
8184
+    FILTER_VER_CHROMA_S_AVX2_NxN 16, 64, sp
8185
+    FILTER_VER_CHROMA_S_AVX2_NxN 24, 64, sp
8186
+    FILTER_VER_CHROMA_S_AVX2_NxN 32, 64, sp
8187
+    FILTER_VER_CHROMA_S_AVX2_NxN 32, 48, sp
8188
+    FILTER_VER_CHROMA_S_AVX2_NxN 32, 48, ss
8189
+    FILTER_VER_CHROMA_S_AVX2_NxN 16, 64, ss
8190
+    FILTER_VER_CHROMA_S_AVX2_NxN 24, 64, ss
8191
+    FILTER_VER_CHROMA_S_AVX2_NxN 32, 64, ss
8192
+    FILTER_VER_CHROMA_S_AVX2_NxN 64, 64, sp
8193
+    FILTER_VER_CHROMA_S_AVX2_NxN 64, 32, sp
8194
+    FILTER_VER_CHROMA_S_AVX2_NxN 64, 48, sp
8195
+    FILTER_VER_CHROMA_S_AVX2_NxN 48, 64, sp
8196
+    FILTER_VER_CHROMA_S_AVX2_NxN 64, 64, ss
8197
+    FILTER_VER_CHROMA_S_AVX2_NxN 64, 32, ss
8198
+    FILTER_VER_CHROMA_S_AVX2_NxN 64, 48, ss
8199
+    FILTER_VER_CHROMA_S_AVX2_NxN 48, 64, ss
8200
 
8201
 %macro PROCESS_CHROMA_S_AVX2_W8_4R 1
8202
     movu            xm0, [r0]                       ; m0 = row 0
8203
@@ -14567,8 +17786,8 @@
8204
     RET
8205
 %endmacro
8206
 
8207
-FILTER_VER_CHROMA_S_AVX2_8x4 sp
8208
-FILTER_VER_CHROMA_S_AVX2_8x4 ss
8209
+    FILTER_VER_CHROMA_S_AVX2_8x4 sp
8210
+    FILTER_VER_CHROMA_S_AVX2_8x4 ss
8211
 
8212
 %macro FILTER_VER_CHROMA_S_AVX2_12x16 1
8213
 INIT_YMM avx2
8214
@@ -14606,8 +17825,55 @@
8215
 %endif
8216
 %endmacro
8217
 
8218
-FILTER_VER_CHROMA_S_AVX2_12x16 sp
8219
-FILTER_VER_CHROMA_S_AVX2_12x16 ss
8220
+    FILTER_VER_CHROMA_S_AVX2_12x16 sp
8221
+    FILTER_VER_CHROMA_S_AVX2_12x16 ss
8222
+
8223
+%macro FILTER_VER_CHROMA_S_AVX2_12x32 1
8224
+%if ARCH_X86_64 == 1
8225
+INIT_YMM avx2
8226
+cglobal interp_4tap_vert_%1_12x32, 4, 9, 10
8227
+    mov             r4d, r4m
8228
+    shl             r4d, 6
8229
+    add             r1d, r1d
8230
+
8231
+%ifdef PIC
8232
+    lea             r5, [pw_ChromaCoeffV]
8233
+    add             r5, r4
8234
+%else
8235
+    lea             r5, [pw_ChromaCoeffV + r4]
8236
+%endif
8237
+
8238
+    lea             r4, [r1 * 3]
8239
+    sub             r0, r1
8240
+%ifidn %1, sp
8241
+    mova            m9, [pd_526336]
8242
+%else
8243
+    add             r3d, r3d
8244
+%endif
8245
+    lea             r6, [r3 * 3]
8246
+%rep 2
8247
+    PROCESS_CHROMA_S_AVX2_W8_16R %1
8248
+%ifidn %1, sp
8249
+    add             r2, 8
8250
+%else
8251
+    add             r2, 16
8252
+%endif
8253
+    add             r0, 16
8254
+    mova            m7, m9
8255
+    PROCESS_CHROMA_AVX2_W4_16R %1
8256
+    sub             r0, 16
8257
+%ifidn %1, sp
8258
+    lea             r2, [r2 + r3 * 4 - 8]
8259
+%else
8260
+    lea             r2, [r2 + r3 * 4 - 16]
8261
+%endif
8262
+%endrep
8263
+    RET
8264
+%endif
8265
+%endmacro
8266
+
8267
+    FILTER_VER_CHROMA_S_AVX2_12x32 sp
8268
+    FILTER_VER_CHROMA_S_AVX2_12x32 ss
8269
 
8270
 %macro FILTER_VER_CHROMA_S_AVX2_16x12 1
8271
 INIT_YMM avx2
8272
@@ -14860,8 +18126,257 @@
8273
 %endif
8274
 %endmacro
8275
 
8276
-FILTER_VER_CHROMA_S_AVX2_16x12 sp
8277
-FILTER_VER_CHROMA_S_AVX2_16x12 ss
8278
+    FILTER_VER_CHROMA_S_AVX2_16x12 sp
8279
+    FILTER_VER_CHROMA_S_AVX2_16x12 ss
8280
+
8281
+%macro FILTER_VER_CHROMA_S_AVX2_8x12 1
8282
+%if ARCH_X86_64 == 1
8283
+INIT_YMM avx2
8284
+cglobal interp_4tap_vert_%1_8x12, 4, 7, 9
8285
+    mov             r4d, r4m
8286
+    shl             r4d, 6
8287
+    add             r1d, r1d
8288
+
8289
+%ifdef PIC
8290
+    lea             r5, [pw_ChromaCoeffV]
8291
+    add             r5, r4
8292
+%else
8293
+    lea             r5, [pw_ChromaCoeffV + r4]
8294
+%endif
8295
+
8296
+    lea             r4, [r1 * 3]
8297
+    sub             r0, r1
8298
+%ifidn %1,sp
8299
+    mova            m8, [pd_526336]
8300
+%else
8301
+    add             r3d, r3d
8302
+%endif
8303
+    lea             r6, [r3 * 3]
8304
+    movu            xm0, [r0]                       ; m0 = row 0
8305
+    movu            xm1, [r0 + r1]                  ; m1 = row 1
8306
+    punpckhwd       xm2, xm0, xm1
8307
+    punpcklwd       xm0, xm1
8308
+    vinserti128     m0, m0, xm2, 1
8309
+    pmaddwd         m0, [r5]
8310
+    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2
8311
+    punpckhwd       xm3, xm1, xm2
8312
+    punpcklwd       xm1, xm2
8313
+    vinserti128     m1, m1, xm3, 1
8314
+    pmaddwd         m1, [r5]
8315
+    movu            xm3, [r0 + r4]                  ; m3 = row 3
8316
+    punpckhwd       xm4, xm2, xm3
8317
+    punpcklwd       xm2, xm3
8318
+    vinserti128     m2, m2, xm4, 1
8319
+    pmaddwd         m4, m2, [r5 + 1 * mmsize]
8320
+    paddd           m0, m4
8321
+    pmaddwd         m2, [r5]
8322
+    lea             r0, [r0 + r1 * 4]
8323
+    movu            xm4, [r0]                       ; m4 = row 4
8324
+    punpckhwd       xm5, xm3, xm4
8325
+    punpcklwd       xm3, xm4
8326
+    vinserti128     m3, m3, xm5, 1
8327
+    pmaddwd         m5, m3, [r5 + 1 * mmsize]
8328
+    paddd           m1, m5
8329
+    pmaddwd         m3, [r5]
8330
+%ifidn %1,sp
8331
+    paddd           m0, m8
8332
+    paddd           m1, m8
8333
+    psrad           m0, 12
8334
+    psrad           m1, 12
8335
+%else
8336
+    psrad           m0, 6
8337
+    psrad           m1, 6
8338
+%endif
8339
+    packssdw        m0, m1
8340
+
8341
+    movu            xm5, [r0 + r1]                  ; m5 = row 5
8342
+    punpckhwd       xm6, xm4, xm5
8343
+    punpcklwd       xm4, xm5
8344
+    vinserti128     m4, m4, xm6, 1
8345
+    pmaddwd         m6, m4, [r5 + 1 * mmsize]
8346
+    paddd           m2, m6
8347
+    pmaddwd         m4, [r5]
8348
+    movu            xm6, [r0 + r1 * 2]              ; m6 = row 6
8349
+    punpckhwd       xm1, xm5, xm6
8350
+    punpcklwd       xm5, xm6
8351
+    vinserti128     m5, m5, xm1, 1
8352
+    pmaddwd         m1, m5, [r5 + 1 * mmsize]
8353
+    pmaddwd         m5, [r5]
8354
+    paddd           m3, m1
8355
+%ifidn %1,sp
8356
+    paddd           m2, m8
8357
+    paddd           m3, m8
8358
+    psrad           m2, 12
8359
+    psrad           m3, 12
8360
+%else
8361
+    psrad           m2, 6
8362
+    psrad           m3, 6
8363
+%endif
8364
+    packssdw        m2, m3
8365
+%ifidn %1,sp
8366
+    packuswb        m0, m2
8367
+    mova            m3, [interp8_hps_shuf]
8368
+    vpermd          m0, m3, m0
8369
+    vextracti128    xm2, m0, 1
8370
+    movq            [r2], xm0
8371
+    movhps          [r2 + r3], xm0
8372
+    movq            [r2 + r3 * 2], xm2
8373
+    movhps          [r2 + r6], xm2
8374
+%else
8375
+    vpermq          m0, m0, 11011000b
8376
+    vpermq          m2, m2, 11011000b
8377
+    movu            [r2], xm0
8378
+    vextracti128    xm0, m0, 1
8379
+    vextracti128    xm3, m2, 1
8380
+    movu            [r2 + r3], xm0
8381
+    movu            [r2 + r3 * 2], xm2
8382
+    movu            [r2 + r6], xm3
8383
+%endif
8384
+    lea             r2, [r2 + r3 * 4]
8385
+
8386
+    movu            xm1, [r0 + r4]                  ; m1 = row 7
8387
+    punpckhwd       xm0, xm6, xm1
8388
+    punpcklwd       xm6, xm1
8389
+    vinserti128     m6, m6, xm0, 1
8390
+    pmaddwd         m0, m6, [r5 + 1 * mmsize]
8391
+    pmaddwd         m6, [r5]
8392
+    paddd           m4, m0
8393
+    lea             r0, [r0 + r1 * 4]
8394
+    movu            xm0, [r0]                       ; m0 = row 8
8395
+    punpckhwd       xm2, xm1, xm0
8396
+    punpcklwd       xm1, xm0
8397
+    vinserti128     m1, m1, xm2, 1
8398
+    pmaddwd         m2, m1, [r5 + 1 * mmsize]
8399
+    pmaddwd         m1, [r5]
8400
+    paddd           m5, m2
8401
+%ifidn %1,sp
8402
+    paddd           m4, m8
8403
+    paddd           m5, m8
8404
+    psrad           m4, 12
8405
+    psrad           m5, 12
8406
+%else
8407
+    psrad           m4, 6
8408
+    psrad           m5, 6
8409
+%endif
8410
+    packssdw        m4, m5
8411
+
8412
+    movu            xm2, [r0 + r1]                  ; m2 = row 9
8413
+    punpckhwd       xm5, xm0, xm2
8414
+    punpcklwd       xm0, xm2
8415
+    vinserti128     m0, m0, xm5, 1
8416
+    pmaddwd         m5, m0, [r5 + 1 * mmsize]
8417
+    paddd           m6, m5
8418
+    pmaddwd         m0, [r5]
8419
+    movu            xm5, [r0 + r1 * 2]              ; m5 = row 10
8420
+    punpckhwd       xm7, xm2, xm5
8421
+    punpcklwd       xm2, xm5
8422
+    vinserti128     m2, m2, xm7, 1
8423
+    pmaddwd         m7, m2, [r5 + 1 * mmsize]
8424
+    paddd           m1, m7
8425
+    pmaddwd         m2, [r5]
8426
+
8427
+%ifidn %1,sp
8428
+    paddd           m6, m8
8429
+    paddd           m1, m8
8430
+    psrad           m6, 12
8431
+    psrad           m1, 12
8432
+%else
8433
+    psrad           m6, 6
8434
+    psrad           m1, 6
8435
+%endif
8436
+    packssdw        m6, m1
8437
+%ifidn %1,sp
8438
+    packuswb        m4, m6
8439
+    vpermd          m4, m3, m4
8440
+    vextracti128    xm6, m4, 1
8441
+    movq            [r2], xm4
8442
+    movhps          [r2 + r3], xm4
8443
+    movq            [r2 + r3 * 2], xm6
8444
+    movhps          [r2 + r6], xm6
8445
+%else
8446
+    vpermq          m4, m4, 11011000b
8447
+    vpermq          m6, m6, 11011000b
8448
+    vextracti128    xm7, m4, 1
8449
+    vextracti128    xm1, m6, 1
8450
+    movu            [r2], xm4
8451
+    movu            [r2 + r3], xm7
8452
+    movu            [r2 + r3 * 2], xm6
8453
+    movu            [r2 + r6], xm1
8454
+%endif
8455
+    lea             r2, [r2 + r3 * 4]
8456
+
8457
+    movu            xm7, [r0 + r4]                  ; m7 = row 11
8458
+    punpckhwd       xm1, xm5, xm7
8459
+    punpcklwd       xm5, xm7
8460
+    vinserti128     m5, m5, xm1, 1
8461
+    pmaddwd         m1, m5, [r5 + 1 * mmsize]
8462
+    paddd           m0, m1
8463
+    pmaddwd         m5, [r5]
8464
+    lea             r0, [r0 + r1 * 4]
8465
+    movu            xm1, [r0]                       ; m1 = row 12
8466
+    punpckhwd       xm4, xm7, xm1
8467
+    punpcklwd       xm7, xm1
8468
+    vinserti128     m7, m7, xm4, 1
8469
+    pmaddwd         m4, m7, [r5 + 1 * mmsize]
8470
+    paddd           m2, m4
8471
+    pmaddwd         m7, [r5]
8472
+%ifidn %1,sp
8473
+    paddd           m0, m8
8474
+    paddd           m2, m8
8475
+    psrad           m0, 12
8476
+    psrad           m2, 12
8477
+%else
8478
+    psrad           m0, 6
8479
+    psrad           m2, 6
8480
+%endif
8481
+    packssdw        m0, m2
8482
+
8483
+    movu            xm4, [r0 + r1]                  ; m4 = row 13
8484
+    punpckhwd       xm2, xm1, xm4
8485
+    punpcklwd       xm1, xm4
8486
+    vinserti128     m1, m1, xm2, 1
8487
+    pmaddwd         m1, [r5 + 1 * mmsize]
8488
+    paddd           m5, m1
8489
+    movu            xm2, [r0 + r1 * 2]              ; m2 = row 14
8490
+    punpckhwd       xm6, xm4, xm2
8491
+    punpcklwd       xm4, xm2
8492
+    vinserti128     m4, m4, xm6, 1
8493
+    pmaddwd         m4, [r5 + 1 * mmsize]
8494
+    paddd           m7, m4
8495
+%ifidn %1,sp
8496
+    paddd           m5, m8
8497
+    paddd           m7, m8
8498
+    psrad           m5, 12
8499
+    psrad           m7, 12
8500
+%else
8501
+    psrad           m5, 6
8502
+    psrad           m7, 6
8503
+%endif
8504
+    packssdw        m5, m7
8505
+%ifidn %1,sp
8506
+    packuswb        m0, m5
8507
+    vpermd          m0, m3, m0
8508
+    vextracti128    xm5, m0, 1
8509
+    movq            [r2], xm0
8510
+    movhps          [r2 + r3], xm0
8511
+    movq            [r2 + r3 * 2], xm5
8512
+    movhps          [r2 + r6], xm5
8513
+%else
8514
+    vpermq          m0, m0, 11011000b
8515
+    vpermq          m5, m5, 11011000b
8516
+    vextracti128    xm7, m0, 1
8517
+    vextracti128    xm6, m5, 1
8518
+    movu            [r2], xm0
8519
+    movu            [r2 + r3], xm7
8520
+    movu            [r2 + r3 * 2], xm5
8521
+    movu            [r2 + r6], xm6
8522
+%endif
8523
+    RET
8524
+%endif
8525
+%endmacro
8526
+
8527
+    FILTER_VER_CHROMA_S_AVX2_8x12 sp
8528
+    FILTER_VER_CHROMA_S_AVX2_8x12 ss
8529
 
8530
 %macro FILTER_VER_CHROMA_S_AVX2_16x4 1
8531
 INIT_YMM avx2
8532
@@ -14906,8 +18421,8 @@
8533
     RET
8534
 %endmacro
8535
 
8536
-FILTER_VER_CHROMA_S_AVX2_16x4 sp
8537
-FILTER_VER_CHROMA_S_AVX2_16x4 ss
8538
+    FILTER_VER_CHROMA_S_AVX2_16x4 sp
8539
+    FILTER_VER_CHROMA_S_AVX2_16x4 ss
8540
 
8541
 %macro PROCESS_CHROMA_S_AVX2_W8_8R 1
8542
     movu            xm0, [r0]                       ; m0 = row 0
8543
@@ -15097,10 +18612,10 @@
8544
 %endif
8545
 %endmacro
8546
 
8547
-FILTER_VER_CHROMA_S_AVX2_Nx8 sp, 32
8548
-FILTER_VER_CHROMA_S_AVX2_Nx8 sp, 16
8549
-FILTER_VER_CHROMA_S_AVX2_Nx8 ss, 32
8550
-FILTER_VER_CHROMA_S_AVX2_Nx8 ss, 16
8551
+    FILTER_VER_CHROMA_S_AVX2_Nx8 sp, 32
8552
+    FILTER_VER_CHROMA_S_AVX2_Nx8 sp, 16
8553
+    FILTER_VER_CHROMA_S_AVX2_Nx8 ss, 32
8554
+    FILTER_VER_CHROMA_S_AVX2_Nx8 ss, 16
8555
 
8556
 %macro FILTER_VER_CHROMA_S_AVX2_8x2 1
8557
 INIT_YMM avx2
8558
@@ -15172,8 +18687,8 @@
8559
     RET
8560
 %endmacro
8561
 
8562
-FILTER_VER_CHROMA_S_AVX2_8x2 sp
8563
-FILTER_VER_CHROMA_S_AVX2_8x2 ss
8564
+    FILTER_VER_CHROMA_S_AVX2_8x2 sp
8565
+    FILTER_VER_CHROMA_S_AVX2_8x2 ss
8566
 
8567
 %macro FILTER_VER_CHROMA_S_AVX2_8x6 1
8568
 INIT_YMM avx2
8569
@@ -15315,8 +18830,8 @@
8570
     RET
8571
 %endmacro
8572
 
8573
-FILTER_VER_CHROMA_S_AVX2_8x6 sp
8574
-FILTER_VER_CHROMA_S_AVX2_8x6 ss
8575
+    FILTER_VER_CHROMA_S_AVX2_8x6 sp
8576
+    FILTER_VER_CHROMA_S_AVX2_8x6 ss
8577
 
8578
 %macro FILTER_VER_CHROMA_S_AVX2_8xN 2
8579
 INIT_YMM avx2
8580
@@ -15637,15 +19152,17 @@
8581
 %endif
8582
 %endmacro
8583
 
8584
-FILTER_VER_CHROMA_S_AVX2_8xN sp, 16
8585
-FILTER_VER_CHROMA_S_AVX2_8xN sp, 32
8586
-FILTER_VER_CHROMA_S_AVX2_8xN ss, 16
8587
-FILTER_VER_CHROMA_S_AVX2_8xN ss, 32
8588
+    FILTER_VER_CHROMA_S_AVX2_8xN sp, 16
8589
+    FILTER_VER_CHROMA_S_AVX2_8xN sp, 32
8590
+    FILTER_VER_CHROMA_S_AVX2_8xN sp, 64
8591
+    FILTER_VER_CHROMA_S_AVX2_8xN ss, 16
8592
+    FILTER_VER_CHROMA_S_AVX2_8xN ss, 32
8593
+    FILTER_VER_CHROMA_S_AVX2_8xN ss, 64
8594
 
8595
-%macro FILTER_VER_CHROMA_S_AVX2_32x24 1
8596
-INIT_YMM avx2
8597
+%macro FILTER_VER_CHROMA_S_AVX2_Nx24 2
8598
 %if ARCH_X86_64 == 1
8599
-cglobal interp_4tap_vert_%1_32x24, 4, 10, 10
8600
+INIT_YMM avx2
8601
+cglobal interp_4tap_vert_%1_%2x24, 4, 10, 10
8602
     mov             r4d, r4m
8603
     shl             r4d, 6
8604
     add             r1d, r1d
8605
@@ -15665,7 +19182,7 @@
8606
     add             r3d, r3d
8607
 %endif
8608
     lea             r6, [r3 * 3]
8609
-    mov             r9d, 4
8610
+    mov             r9d, %2 / 8
8611
 .loopW:
8612
     PROCESS_CHROMA_S_AVX2_W8_16R %1
8613
 %ifidn %1,sp
8614
@@ -15677,13 +19194,13 @@
8615
     dec             r9d
8616
     jnz             .loopW
8617
 %ifidn %1,sp
8618
-    lea             r2, [r8 + r3 * 4 - 24]
8619
+    lea             r2, [r8 + r3 * 4 - %2 + 8]
8620
 %else
8621
-    lea             r2, [r8 + r3 * 4 - 48]
8622
+    lea             r2, [r8 + r3 * 4 - 2 * %2 + 16]
8623
 %endif
8624
-    lea             r0, [r7 - 48]
8625
+    lea             r0, [r7 - 2 * %2 + 16]
8626
     mova            m7, m9
8627
-    mov             r9d, 4
8628
+    mov             r9d, %2 / 8
8629
 .loop:
8630
     PROCESS_CHROMA_S_AVX2_W8_8R %1
8631
 %ifidn %1,sp
8632
@@ -15698,8 +19215,10 @@
8633
 %endif
8634
 %endmacro
8635
 
8636
-FILTER_VER_CHROMA_S_AVX2_32x24 sp
8637
-FILTER_VER_CHROMA_S_AVX2_32x24 ss
8638
+    FILTER_VER_CHROMA_S_AVX2_Nx24 sp, 32
8639
+    FILTER_VER_CHROMA_S_AVX2_Nx24 sp, 16
8640
+    FILTER_VER_CHROMA_S_AVX2_Nx24 ss, 32
8641
+    FILTER_VER_CHROMA_S_AVX2_Nx24 ss, 16
8642
 
8643
 %macro FILTER_VER_CHROMA_S_AVX2_2x8 1
8644
 INIT_YMM avx2
8645
@@ -15797,8 +19316,170 @@
8646
     RET
8647
 %endmacro
8648
 
8649
-FILTER_VER_CHROMA_S_AVX2_2x8 sp
8650
-FILTER_VER_CHROMA_S_AVX2_2x8 ss
8651
+    FILTER_VER_CHROMA_S_AVX2_2x8 sp
8652
+    FILTER_VER_CHROMA_S_AVX2_2x8 ss
8653
+
8654
+%macro FILTER_VER_CHROMA_S_AVX2_2x16 1
8655
+%if ARCH_X86_64 == 1
8656
+INIT_YMM avx2
8657
+cglobal interp_4tap_vert_%1_2x16, 4, 6, 9
8658
+    mov             r4d, r4m
8659
+    shl             r4d, 6
8660
+    add             r1d, r1d
8661
+    sub             r0, r1
8662
+
8663
+%ifdef PIC
8664
+    lea             r5, [pw_ChromaCoeffV]
8665
+    add             r5, r4
8666
+%else
8667
+    lea             r5, [pw_ChromaCoeffV + r4]
8668
+%endif
8669
+
8670
+    lea             r4, [r1 * 3]
8671
+%ifidn %1,sp
8672
+    mova            m6, [pd_526336]
8673
+%else
8674
+    add             r3d, r3d
8675
+%endif
8676
+    movd            xm0, [r0]
8677
+    movd            xm1, [r0 + r1]
8678
+    punpcklwd       xm0, xm1
8679
+    movd            xm2, [r0 + r1 * 2]
8680
+    punpcklwd       xm1, xm2
8681
+    punpcklqdq      xm0, xm1                        ; m0 = [2 1 1 0]
8682
+    movd            xm3, [r0 + r4]
8683
+    punpcklwd       xm2, xm3
8684
+    lea             r0, [r0 + 4 * r1]
8685
+    movd            xm4, [r0]
8686
+    punpcklwd       xm3, xm4
8687
+    punpcklqdq      xm2, xm3                        ; m2 = [4 3 3 2]
8688
+    vinserti128     m0, m0, xm2, 1                  ; m0 = [4 3 3 2 2 1 1 0]
8689
+    movd            xm1, [r0 + r1]
8690
+    punpcklwd       xm4, xm1
8691
+    movd            xm3, [r0 + r1 * 2]
8692
+    punpcklwd       xm1, xm3
8693
+    punpcklqdq      xm4, xm1                        ; m4 = [6 5 5 4]
8694
+    vinserti128     m2, m2, xm4, 1                  ; m2 = [6 5 5 4 4 3 3 2]
8695
+    pmaddwd         m0, [r5]
8696
+    pmaddwd         m2, [r5 + 1 * mmsize]
8697
+    paddd           m0, m2
8698
+    movd            xm1, [r0 + r4]
8699
+    punpcklwd       xm3, xm1
8700
+    lea             r0, [r0 + 4 * r1]
8701
+    movd            xm2, [r0]
8702
+    punpcklwd       xm1, xm2
8703
+    punpcklqdq      xm3, xm1                        ; m3 = [8 7 7 6]
8704
+    vinserti128     m4, m4, xm3, 1                  ; m4 = [8 7 7 6 6 5 5 4]
8705
+    movd            xm1, [r0 + r1]
8706
+    punpcklwd       xm2, xm1
8707
+    movd            xm5, [r0 + r1 * 2]
8708
+    punpcklwd       xm1, xm5
8709
+    punpcklqdq      xm2, xm1                        ; m2 = [10 9 9 8]
8710
+    vinserti128     m3, m3, xm2, 1                  ; m3 = [10 9 9 8 8 7 7 6]
8711
+    pmaddwd         m4, [r5]
8712
+    pmaddwd         m3, [r5 + 1 * mmsize]
8713
+    paddd           m4, m3
8714
+    movd            xm1, [r0 + r4]
8715
+    punpcklwd       xm5, xm1
8716
+    lea             r0, [r0 + 4 * r1]
8717
+    movd            xm3, [r0]
8718
+    punpcklwd       xm1, xm3
8719
+    punpcklqdq      xm5, xm1                        ; m5 = [12 11 11 10]
8720
+    vinserti128     m2, m2, xm5, 1                  ; m2 = [12 11 11 10 10 9 9 8]
8721
+    movd            xm1, [r0 + r1]
8722
+    punpcklwd       xm3, xm1
8723
+    movd            xm7, [r0 + r1 * 2]
8724
+    punpcklwd       xm1, xm7
8725
+    punpcklqdq      xm3, xm1                        ; m3 = [14 13 13 12]
8726
+    vinserti128     m5, m5, xm3, 1                  ; m5 = [14 13 13 12 12 11 11 10]
8727
+    pmaddwd         m2, [r5]
8728
+    pmaddwd         m5, [r5 + 1 * mmsize]
8729
+    paddd           m2, m5
8730
+    movd            xm5, [r0 + r4]
8731
+    punpcklwd       xm7, xm5
8732
+    lea             r0, [r0 + 4 * r1]
8733
+    movd            xm1, [r0]
8734
+    punpcklwd       xm5, xm1
8735
+    punpcklqdq      xm7, xm5                        ; m7 = [16 15 15 14]
8736
+    vinserti128     m3, m3, xm7, 1                  ; m3 = [16 15 15 14 14 13 13 12]
8737
+    movd            xm5, [r0 + r1]
8738
+    punpcklwd       xm1, xm5
8739
+    movd            xm8, [r0 + r1 * 2]
8740
+    punpcklwd       xm5, xm8
8741
+    punpcklqdq      xm1, xm5                        ; m1 = [18 17 17 16]
8742
+    vinserti128     m7, m7, xm1, 1                  ; m7 = [18 17 17 16 16 15 15 14]
8743
+    pmaddwd         m3, [r5]
8744
+    pmaddwd         m7, [r5 + 1 * mmsize]
8745
+    paddd           m3, m7
8746
+%ifidn %1,sp
8747
+    paddd           m0, m6
8748
+    paddd           m4, m6
8749
+    paddd           m2, m6
8750
+    paddd           m3, m6
8751
+    psrad           m0, 12
8752
+    psrad           m4, 12
8753
+    psrad           m2, 12
8754
+    psrad           m3, 12
8755
+%else
8756
+    psrad           m0, 6
8757
+    psrad           m4, 6
8758
+    psrad           m2, 6
8759
+    psrad           m3, 6
8760
+%endif
8761
+    packssdw        m0, m4
8762
+    packssdw        m2, m3
8763
+    lea             r4, [r3 * 3]
8764
+%ifidn %1,sp
8765
+    packuswb        m0, m2
8766
+    vextracti128    xm2, m0, 1
8767
+    pextrw          [r2], xm0, 0
8768
+    pextrw          [r2 + r3], xm0, 1
8769
+    pextrw          [r2 + 2 * r3], xm2, 0
8770
+    pextrw          [r2 + r4], xm2, 1
8771
+    lea             r2, [r2 + r3 * 4]
8772
+    pextrw          [r2], xm0, 2
8773
+    pextrw          [r2 + r3], xm0, 3
8774
+    pextrw          [r2 + 2 * r3], xm2, 2
8775
+    pextrw          [r2 + r4], xm2, 3
8776
+    lea             r2, [r2 + r3 * 4]
8777
+    pextrw          [r2], xm0, 4
8778
+    pextrw          [r2 + r3], xm0, 5
8779
+    pextrw          [r2 + 2 * r3], xm2, 4
8780
+    pextrw          [r2 + r4], xm2, 5
8781
+    lea             r2, [r2 + r3 * 4]
8782
+    pextrw          [r2], xm0, 6
8783
+    pextrw          [r2 + r3], xm0, 7
8784
+    pextrw          [r2 + 2 * r3], xm2, 6
8785
+    pextrw          [r2 + r4], xm2, 7
8786
+%else
8787
+    vextracti128    xm4, m0, 1
8788
+    vextracti128    xm3, m2, 1
8789
+    movd            [r2], xm0
8790
+    pextrd          [r2 + r3], xm0, 1
8791
+    movd            [r2 + 2 * r3], xm4
8792
+    pextrd          [r2 + r4], xm4, 1
8793
+    lea             r2, [r2 + r3 * 4]
8794
+    pextrd          [r2], xm0, 2
8795
+    pextrd          [r2 + r3], xm0, 3
8796
+    pextrd          [r2 + 2 * r3], xm4, 2
8797
+    pextrd          [r2 + r4], xm4, 3
8798
+    lea             r2, [r2 + r3 * 4]
8799
+    movd            [r2], xm2
8800
+    pextrd          [r2 + r3], xm2, 1
8801
+    movd            [r2 + 2 * r3], xm3
8802
+    pextrd          [r2 + r4], xm3, 1
8803
+    lea             r2, [r2 + r3 * 4]
8804
+    pextrd          [r2], xm2, 2
8805
+    pextrd          [r2 + r3], xm2, 3
8806
+    pextrd          [r2 + 2 * r3], xm3, 2
8807
+    pextrd          [r2 + r4], xm3, 3
8808
+%endif
8809
+    RET
8810
+%endif
8811
+%endmacro
8812
+
8813
+    FILTER_VER_CHROMA_S_AVX2_2x16 sp
8814
+    FILTER_VER_CHROMA_S_AVX2_2x16 ss
8815
 
8816
 %macro FILTER_VER_CHROMA_S_AVX2_6x8 1
8817
 INIT_YMM avx2
8818
@@ -15985,8 +19666,344 @@
8819
     RET
8820
 %endmacro
8821
 
8822
-FILTER_VER_CHROMA_S_AVX2_6x8 sp
8823
-FILTER_VER_CHROMA_S_AVX2_6x8 ss
8824
+    FILTER_VER_CHROMA_S_AVX2_6x8 sp
8825
+    FILTER_VER_CHROMA_S_AVX2_6x8 ss
8826
+
8827
+%macro FILTER_VER_CHROMA_S_AVX2_6x16 1
8828
+%if ARCH_X86_64 == 1
8829
+INIT_YMM avx2
8830
+cglobal interp_4tap_vert_%1_6x16, 4, 7, 9
8831
+    mov             r4d, r4m
8832
+    shl             r4d, 6
8833
+    add             r1d, r1d
8834
+
8835
+%ifdef PIC
8836
+    lea             r5, [pw_ChromaCoeffV]
8837
+    add             r5, r4
8838
+%else
8839
+    lea             r5, [pw_ChromaCoeffV + r4]
8840
+%endif
8841
+
8842
+    lea             r4, [r1 * 3]
8843
+    sub             r0, r1
8844
+%ifidn %1,sp
8845
+    mova            m8, [pd_526336]
8846
+%else
8847
+    add             r3d, r3d
8848
+%endif
8849
+    lea             r6, [r3 * 3]
8850
+    movu            xm0, [r0]                       ; m0 = row 0
8851
+    movu            xm1, [r0 + r1]                  ; m1 = row 1
8852
+    punpckhwd       xm2, xm0, xm1
8853
+    punpcklwd       xm0, xm1
8854
+    vinserti128     m0, m0, xm2, 1
8855
+    pmaddwd         m0, [r5]
8856
+    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2
8857
+    punpckhwd       xm3, xm1, xm2
8858
+    punpcklwd       xm1, xm2
8859
+    vinserti128     m1, m1, xm3, 1
8860
+    pmaddwd         m1, [r5]
8861
+    movu            xm3, [r0 + r4]                  ; m3 = row 3
8862
+    punpckhwd       xm4, xm2, xm3
8863
+    punpcklwd       xm2, xm3
8864
+    vinserti128     m2, m2, xm4, 1
8865
+    pmaddwd         m4, m2, [r5 + 1 * mmsize]
8866
+    paddd           m0, m4
8867
+    pmaddwd         m2, [r5]
8868
+    lea             r0, [r0 + r1 * 4]
8869
+    movu            xm4, [r0]                       ; m4 = row 4
8870
+    punpckhwd       xm5, xm3, xm4
8871
+    punpcklwd       xm3, xm4
8872
+    vinserti128     m3, m3, xm5, 1
8873
+    pmaddwd         m5, m3, [r5 + 1 * mmsize]
8874
+    paddd           m1, m5
8875
+    pmaddwd         m3, [r5]
8876
+%ifidn %1,sp
8877
+    paddd           m0, m8
8878
+    paddd           m1, m8
8879
+    psrad           m0, 12
8880
+    psrad           m1, 12
8881
+%else
8882
+    psrad           m0, 6
8883
+    psrad           m1, 6
8884
+%endif
8885
+    packssdw        m0, m1
8886
+
8887
+    movu            xm5, [r0 + r1]                  ; m5 = row 5
8888
+    punpckhwd       xm6, xm4, xm5
8889
+    punpcklwd       xm4, xm5
8890
+    vinserti128     m4, m4, xm6, 1
8891
+    pmaddwd         m6, m4, [r5 + 1 * mmsize]
8892
+    paddd           m2, m6
8893
+    pmaddwd         m4, [r5]
8894
+    movu            xm6, [r0 + r1 * 2]              ; m6 = row 6
8895
+    punpckhwd       xm1, xm5, xm6
8896
+    punpcklwd       xm5, xm6
8897
+    vinserti128     m5, m5, xm1, 1
8898
+    pmaddwd         m1, m5, [r5 + 1 * mmsize]
8899
+    pmaddwd         m5, [r5]
8900
+    paddd           m3, m1
8901
+%ifidn %1,sp
8902
+    paddd           m2, m8
8903
+    paddd           m3, m8
8904
+    psrad           m2, 12
8905
+    psrad           m3, 12
8906
+%else
8907
+    psrad           m2, 6
8908
+    psrad           m3, 6
8909
+%endif
8910
+    packssdw        m2, m3
8911
+%ifidn %1,sp
8912
+    packuswb        m0, m2
8913
+    vextracti128    xm2, m0, 1
8914
+    movd            [r2], xm0
8915
+    pextrw          [r2 + 4], xm2, 0
8916
+    pextrd          [r2 + r3], xm0, 1
8917
+    pextrw          [r2 + r3 + 4], xm2, 2
8918
+    pextrd          [r2 + r3 * 2], xm0, 2
8919
+    pextrw          [r2 + r3 * 2 + 4], xm2, 4
8920
+    pextrd          [r2 + r6], xm0, 3
8921
+    pextrw          [r2 + r6 + 4], xm2, 6
8922
+%else
8923
+    movq            [r2], xm0
8924
+    movhps          [r2 + r3], xm0
8925
+    movq            [r2 + r3 * 2], xm2
8926
+    movhps          [r2 + r6], xm2
8927
+    vextracti128    xm0, m0, 1
8928
+    vextracti128    xm3, m2, 1
8929
+    movd            [r2 + 8], xm0
8930
+    pextrd          [r2 + r3 + 8], xm0, 2
8931
+    movd            [r2 + r3 * 2 + 8], xm3
8932
+    pextrd          [r2 + r6 + 8], xm3, 2
8933
+%endif
8934
+    lea             r2, [r2 + r3 * 4]
8935
+    movu            xm1, [r0 + r4]                  ; m1 = row 7
8936
+    punpckhwd       xm0, xm6, xm1
8937
+    punpcklwd       xm6, xm1
8938
+    vinserti128     m6, m6, xm0, 1
8939
+    pmaddwd         m0, m6, [r5 + 1 * mmsize]
8940
+    pmaddwd         m6, [r5]
8941
+    paddd           m4, m0
8942
+    lea             r0, [r0 + r1 * 4]
8943
+    movu            xm0, [r0]                       ; m0 = row 8
8944
+    punpckhwd       xm2, xm1, xm0
8945
+    punpcklwd       xm1, xm0
8946
+    vinserti128     m1, m1, xm2, 1
8947
+    pmaddwd         m2, m1, [r5 + 1 * mmsize]
8948
+    pmaddwd         m1, [r5]
8949
+    paddd           m5, m2
8950
+%ifidn %1,sp
8951
+    paddd           m4, m8
8952
+    paddd           m5, m8
8953
+    psrad           m4, 12
8954
+    psrad           m5, 12
8955
+%else
8956
+    psrad           m4, 6
8957
+    psrad           m5, 6
8958
+%endif
8959
+    packssdw        m4, m5
8960
+
8961
+    movu            xm2, [r0 + r1]                  ; m2 = row 9
8962
+    punpckhwd       xm5, xm0, xm2
8963
+    punpcklwd       xm0, xm2
8964
+    vinserti128     m0, m0, xm5, 1
8965
+    pmaddwd         m5, m0, [r5 + 1 * mmsize]
8966
+    paddd           m6, m5
8967
+    pmaddwd         m0, [r5]
8968
+    movu            xm5, [r0 + r1 * 2]              ; m5 = row 10
8969
+    punpckhwd       xm7, xm2, xm5
8970
+    punpcklwd       xm2, xm5
8971
+    vinserti128     m2, m2, xm7, 1
8972
+    pmaddwd         m7, m2, [r5 + 1 * mmsize]
8973
+    paddd           m1, m7
8974
+    pmaddwd         m2, [r5]
8975
+
8976
+%ifidn %1,sp
8977
+    paddd           m6, m8
8978
+    paddd           m1, m8
8979
+    psrad           m6, 12
8980
+    psrad           m1, 12
8981
+%else
8982
+    psrad           m6, 6
8983
+    psrad           m1, 6
8984
+%endif
8985
+    packssdw        m6, m1
8986
+%ifidn %1,sp
8987
+    packuswb        m4, m6
8988
+    vextracti128    xm6, m4, 1
8989
+    movd            [r2], xm4
8990
+    pextrw          [r2 + 4], xm6, 0
8991
+    pextrd          [r2 + r3], xm4, 1
8992
+    pextrw          [r2 + r3 + 4], xm6, 2
8993
+    pextrd          [r2 + r3 * 2], xm4, 2
8994
+    pextrw          [r2 + r3 * 2 + 4], xm6, 4
8995
+    pextrd          [r2 + r6], xm4, 3
8996
+    pextrw          [r2 + r6 + 4], xm6, 6
8997
+%else
8998
+    movq            [r2], xm4
8999
+    movhps          [r2 + r3], xm4
9000
+    movq            [r2 + r3 * 2], xm6
9001
+    movhps          [r2 + r6], xm6
9002
+    vextracti128    xm4, m4, 1
9003
+    vextracti128    xm1, m6, 1
9004
+    movd            [r2 + 8], xm4
9005
+    pextrd          [r2 + r3 + 8], xm4, 2
9006
+    movd            [r2 + r3 * 2 + 8], xm1
9007
+    pextrd          [r2 + r6 + 8], xm1, 2
9008
+%endif
9009
+    lea             r2, [r2 + r3 * 4]
9010
+    movu            xm7, [r0 + r4]                  ; m7 = row 11
9011
+    punpckhwd       xm1, xm5, xm7
9012
+    punpcklwd       xm5, xm7
9013
+    vinserti128     m5, m5, xm1, 1
9014
+    pmaddwd         m1, m5, [r5 + 1 * mmsize]
9015
+    paddd           m0, m1
9016
+    pmaddwd         m5, [r5]
9017
+    lea             r0, [r0 + r1 * 4]
9018
+    movu            xm1, [r0]                       ; m1 = row 12
9019
+    punpckhwd       xm4, xm7, xm1
9020
+    punpcklwd       xm7, xm1
9021
+    vinserti128     m7, m7, xm4, 1
9022
+    pmaddwd         m4, m7, [r5 + 1 * mmsize]
9023
+    paddd           m2, m4
9024
+    pmaddwd         m7, [r5]
9025
+%ifidn %1,sp
9026
+    paddd           m0, m8
9027
+    paddd           m2, m8
9028
+    psrad           m0, 12
9029
+    psrad           m2, 12
9030
+%else
9031
+    psrad           m0, 6
9032
+    psrad           m2, 6
9033
+%endif
9034
+    packssdw        m0, m2
9035
+
9036
+    movu            xm4, [r0 + r1]                  ; m4 = row 13
9037
+    punpckhwd       xm2, xm1, xm4
9038
+    punpcklwd       xm1, xm4
9039
+    vinserti128     m1, m1, xm2, 1
9040
+    pmaddwd         m2, m1, [r5 + 1 * mmsize]
9041
+    paddd           m5, m2
9042
+    pmaddwd         m1, [r5]
9043
+    movu            xm2, [r0 + r1 * 2]              ; m2 = row 14
9044
+    punpckhwd       xm6, xm4, xm2
9045
+    punpcklwd       xm4, xm2
9046
+    vinserti128     m4, m4, xm6, 1
9047
+    pmaddwd         m6, m4, [r5 + 1 * mmsize]
9048
+    paddd           m7, m6
9049
+    pmaddwd         m4, [r5]
9050
+%ifidn %1,sp
9051
+    paddd           m5, m8
9052
+    paddd           m7, m8
9053
+    psrad           m5, 12
9054
+    psrad           m7, 12
9055
+%else
9056
+    psrad           m5, 6
9057
+    psrad           m7, 6
9058
+%endif
9059
+    packssdw        m5, m7
9060
+%ifidn %1,sp
9061
+    packuswb        m0, m5
9062
+    vextracti128    xm5, m0, 1
9063
+    movd            [r2], xm0
9064
+    pextrw          [r2 + 4], xm5, 0
9065
+    pextrd          [r2 + r3], xm0, 1
9066
+    pextrw          [r2 + r3 + 4], xm5, 2
9067
+    pextrd          [r2 + r3 * 2], xm0, 2
9068
+    pextrw          [r2 + r3 * 2 + 4], xm5, 4
9069
+    pextrd          [r2 + r6], xm0, 3
9070
+    pextrw          [r2 + r6 + 4], xm5, 6
9071
+%else
9072
+    movq            [r2], xm0
9073
+    movhps          [r2 + r3], xm0
9074
+    movq            [r2 + r3 * 2], xm5
9075
+    movhps          [r2 + r6], xm5
9076
+    vextracti128    xm0, m0, 1
9077
+    vextracti128    xm7, m5, 1
9078
+    movd            [r2 + 8], xm0
9079
+    pextrd          [r2 + r3 + 8], xm0, 2
9080
+    movd            [r2 + r3 * 2 + 8], xm7
9081
+    pextrd          [r2 + r6 + 8], xm7, 2
9082
+%endif
9083
+    lea             r2, [r2 + r3 * 4]
9084
+
9085
+    movu            xm6, [r0 + r4]                  ; m6 = row 15
9086
+    punpckhwd       xm5, xm2, xm6
9087
+    punpcklwd       xm2, xm6
9088
+    vinserti128     m2, m2, xm5, 1
9089
+    pmaddwd         m5, m2, [r5 + 1 * mmsize]
9090
+    paddd           m1, m5
9091
+    pmaddwd         m2, [r5]
9092
+    lea             r0, [r0 + r1 * 4]
9093
+    movu            xm0, [r0]                       ; m0 = row 16
9094
+    punpckhwd       xm5, xm6, xm0
9095
+    punpcklwd       xm6, xm0
9096
+    vinserti128     m6, m6, xm5, 1
9097
+    pmaddwd         m5, m6, [r5 + 1 * mmsize]
9098
+    paddd           m4, m5
9099
+    pmaddwd         m6, [r5]
9100
+%ifidn %1,sp
9101
+    paddd           m1, m8
9102
+    paddd           m4, m8
9103
+    psrad           m1, 12
9104
+    psrad           m4, 12
9105
+%else
9106
+    psrad           m1, 6
9107
+    psrad           m4, 6
9108
+%endif
9109
+    packssdw        m1, m4
9110
+
9111
+    movu            xm5, [r0 + r1]                  ; m5 = row 17
9112
+    punpckhwd       xm4, xm0, xm5
9113
+    punpcklwd       xm0, xm5
9114
+    vinserti128     m0, m0, xm4, 1
9115
+    pmaddwd         m0, [r5 + 1 * mmsize]
9116
+    paddd           m2, m0
9117
+    movu            xm4, [r0 + r1 * 2]              ; m4 = row 18
9118
+    punpckhwd       xm0, xm5, xm4
9119
+    punpcklwd       xm5, xm4
9120
+    vinserti128     m5, m5, xm0, 1
9121
+    pmaddwd         m5, [r5 + 1 * mmsize]
9122
+    paddd           m6, m5
9123
+%ifidn %1,sp
9124
+    paddd           m2, m8
9125
+    paddd           m6, m8
9126
+    psrad           m2, 12
9127
+    psrad           m6, 12
9128
+%else
9129
+    psrad           m2, 6
9130
+    psrad           m6, 6
9131
+%endif
9132
+    packssdw        m2, m6
9133
+%ifidn %1,sp
9134
+    packuswb        m1, m2
9135
+    vextracti128    xm2, m1, 1
9136
+    movd            [r2], xm1
9137
+    pextrw          [r2 + 4], xm2, 0
9138
+    pextrd          [r2 + r3], xm1, 1
9139
+    pextrw          [r2 + r3 + 4], xm2, 2
9140
+    pextrd          [r2 + r3 * 2], xm1, 2
9141
+    pextrw          [r2 + r3 * 2 + 4], xm2, 4
9142
+    pextrd          [r2 + r6], xm1, 3
9143
+    pextrw          [r2 + r6 + 4], xm2, 6
9144
+%else
9145
+    movq            [r2], xm1
9146
+    movhps          [r2 + r3], xm1
9147
+    movq            [r2 + r3 * 2], xm2
9148
+    movhps          [r2 + r6], xm2
9149
+    vextracti128    xm4, m1, 1
9150
+    vextracti128    xm6, m2, 1
9151
+    movd            [r2 + 8], xm4
9152
+    pextrd          [r2 + r3 + 8], xm4, 2
9153
+    movd            [r2 + r3 * 2 + 8], xm6
9154
+    pextrd          [r2 + r6 + 8], xm6, 2
9155
+%endif
9156
+    RET
9157
+%endif
9158
+%endmacro
9159
+
9160
+    FILTER_VER_CHROMA_S_AVX2_6x16 sp
9161
+    FILTER_VER_CHROMA_S_AVX2_6x16 ss
9162
 
9163
 ;---------------------------------------------------------------------------------------------------------------------
9164
 ; void interp_4tap_vertical_ss_%1x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
9165
@@ -16031,10 +20048,10 @@
9166
     RET
9167
 %endmacro
9168
 
9169
-FILTER_VER_CHROMA_SS_W2_4R 2, 4
9170
-FILTER_VER_CHROMA_SS_W2_4R 2, 8
9171
+    FILTER_VER_CHROMA_SS_W2_4R 2, 4
9172
+    FILTER_VER_CHROMA_SS_W2_4R 2, 8
9173
 
9174
-FILTER_VER_CHROMA_SS_W2_4R 2, 16
9175
+    FILTER_VER_CHROMA_SS_W2_4R 2, 16
9176
 
9177
 ;---------------------------------------------------------------------------------------------------------------
9178
 ; void interp_4tap_vert_ss_4x2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
9179
@@ -16147,9 +20164,9 @@
9180
     RET
9181
 %endmacro
9182
 
9183
-FILTER_VER_CHROMA_SS_W6_H4 6, 8
9184
+    FILTER_VER_CHROMA_SS_W6_H4 6, 8
9185
 
9186
-FILTER_VER_CHROMA_SS_W6_H4 6, 16
9187
+    FILTER_VER_CHROMA_SS_W6_H4 6, 16
9188
 
9189
 
9190
 ;----------------------------------------------------------------------------------------------------------------
9191
@@ -16194,15 +20211,15 @@
9192
     RET
9193
 %endmacro
9194
 
9195
-FILTER_VER_CHROMA_SS_W8_H2 8, 2
9196
-FILTER_VER_CHROMA_SS_W8_H2 8, 4
9197
-FILTER_VER_CHROMA_SS_W8_H2 8, 6
9198
-FILTER_VER_CHROMA_SS_W8_H2 8, 8
9199
-FILTER_VER_CHROMA_SS_W8_H2 8, 16
9200
-FILTER_VER_CHROMA_SS_W8_H2 8, 32
9201
+    FILTER_VER_CHROMA_SS_W8_H2 8, 2
9202
+    FILTER_VER_CHROMA_SS_W8_H2 8, 4
9203
+    FILTER_VER_CHROMA_SS_W8_H2 8, 6
9204
+    FILTER_VER_CHROMA_SS_W8_H2 8, 8
9205
+    FILTER_VER_CHROMA_SS_W8_H2 8, 16
9206
+    FILTER_VER_CHROMA_SS_W8_H2 8, 32
9207
 
9208
-FILTER_VER_CHROMA_SS_W8_H2 8, 12
9209
-FILTER_VER_CHROMA_SS_W8_H2 8, 64
9210
+    FILTER_VER_CHROMA_SS_W8_H2 8, 12
9211
+    FILTER_VER_CHROMA_SS_W8_H2 8, 64
9212
 
9213
 ;-----------------------------------------------------------------------------------------------------------------
9214
 ; void interp_8tap_vert_ss_%1x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
9215
@@ -16442,8 +20459,8 @@
9216
     RET
9217
 %endmacro
9218
 
9219
-FILTER_VER_LUMA_AVX2_4x4 sp
9220
-FILTER_VER_LUMA_AVX2_4x4 ss
9221
+    FILTER_VER_LUMA_AVX2_4x4 sp
9222
+    FILTER_VER_LUMA_AVX2_4x4 ss
9223
 
9224
 %macro FILTER_VER_LUMA_AVX2_4x8 1
9225
 INIT_YMM avx2
9226
@@ -16588,8 +20605,8 @@
9227
     RET
9228
 %endmacro
9229
 
9230
-FILTER_VER_LUMA_AVX2_4x8 sp
9231
-FILTER_VER_LUMA_AVX2_4x8 ss
9232
+    FILTER_VER_LUMA_AVX2_4x8 sp
9233
+    FILTER_VER_LUMA_AVX2_4x8 ss
9234
 
9235
 %macro PROCESS_LUMA_AVX2_W4_16R 1
9236
     movq            xm0, [r0]
9237
@@ -16833,8 +20850,8 @@
9238
     RET
9239
 %endmacro
9240
 
9241
-FILTER_VER_LUMA_AVX2_4x16 sp
9242
-FILTER_VER_LUMA_AVX2_4x16 ss
9243
+    FILTER_VER_LUMA_AVX2_4x16 sp
9244
+    FILTER_VER_LUMA_AVX2_4x16 ss
9245
 
9246
 %macro FILTER_VER_LUMA_S_AVX2_8x8 1
9247
 INIT_YMM avx2
9248
@@ -17056,8 +21073,8 @@
9249
 %endif
9250
 %endmacro
9251
 
9252
-FILTER_VER_LUMA_S_AVX2_8x8 sp
9253
-FILTER_VER_LUMA_S_AVX2_8x8 ss
9254
+    FILTER_VER_LUMA_S_AVX2_8x8 sp
9255
+    FILTER_VER_LUMA_S_AVX2_8x8 ss
9256
 
9257
 %macro FILTER_VER_LUMA_S_AVX2_8xN 2
9258
 INIT_YMM avx2
9259
@@ -17446,10 +21463,10 @@
9260
 %endif
9261
 %endmacro
9262
 
9263
-FILTER_VER_LUMA_S_AVX2_8xN sp, 16
9264
-FILTER_VER_LUMA_S_AVX2_8xN sp, 32
9265
-FILTER_VER_LUMA_S_AVX2_8xN ss, 16
9266
-FILTER_VER_LUMA_S_AVX2_8xN ss, 32
9267
+    FILTER_VER_LUMA_S_AVX2_8xN sp, 16
9268
+    FILTER_VER_LUMA_S_AVX2_8xN sp, 32
9269
+    FILTER_VER_LUMA_S_AVX2_8xN ss, 16
9270
+    FILTER_VER_LUMA_S_AVX2_8xN ss, 32
9271
 
9272
 %macro PROCESS_LUMA_S_AVX2_W8_4R 1
9273
     movu            xm0, [r0]                       ; m0 = row 0
9274
@@ -17592,8 +21609,8 @@
9275
     RET
9276
 %endmacro
9277
 
9278
-FILTER_VER_LUMA_S_AVX2_8x4 sp
9279
-FILTER_VER_LUMA_S_AVX2_8x4 ss
9280
+    FILTER_VER_LUMA_S_AVX2_8x4 sp
9281
+    FILTER_VER_LUMA_S_AVX2_8x4 ss
9282
 
9283
 %macro PROCESS_LUMA_AVX2_W8_16R 1
9284
     movu            xm0, [r0]                       ; m0 = row 0
9285
@@ -17988,12 +22005,12 @@
9286
 %endif
9287
 %endmacro
9288
 
9289
-FILTER_VER_LUMA_AVX2_Nx16 sp, 16
9290
-FILTER_VER_LUMA_AVX2_Nx16 sp, 32
9291
-FILTER_VER_LUMA_AVX2_Nx16 sp, 64
9292
-FILTER_VER_LUMA_AVX2_Nx16 ss, 16
9293
-FILTER_VER_LUMA_AVX2_Nx16 ss, 32
9294
-FILTER_VER_LUMA_AVX2_Nx16 ss, 64
9295
+    FILTER_VER_LUMA_AVX2_Nx16 sp, 16
9296
+    FILTER_VER_LUMA_AVX2_Nx16 sp, 32
9297
+    FILTER_VER_LUMA_AVX2_Nx16 sp, 64
9298
+    FILTER_VER_LUMA_AVX2_Nx16 ss, 16
9299
+    FILTER_VER_LUMA_AVX2_Nx16 ss, 32
9300
+    FILTER_VER_LUMA_AVX2_Nx16 ss, 64
9301
 
9302
 %macro FILTER_VER_LUMA_AVX2_NxN 3
9303
 INIT_YMM avx2
9304
@@ -18047,24 +22064,24 @@
9305
 %endif
9306
 %endmacro
9307
 
9308
-FILTER_VER_LUMA_AVX2_NxN 16, 32, sp
9309
-FILTER_VER_LUMA_AVX2_NxN 16, 64, sp
9310
-FILTER_VER_LUMA_AVX2_NxN 24, 32, sp
9311
-FILTER_VER_LUMA_AVX2_NxN 32, 32, sp
9312
-FILTER_VER_LUMA_AVX2_NxN 32, 64, sp
9313
-FILTER_VER_LUMA_AVX2_NxN 48, 64, sp
9314
-FILTER_VER_LUMA_AVX2_NxN 64, 32, sp
9315
-FILTER_VER_LUMA_AVX2_NxN 64, 48, sp
9316
-FILTER_VER_LUMA_AVX2_NxN 64, 64, sp
9317
-FILTER_VER_LUMA_AVX2_NxN 16, 32, ss
9318
-FILTER_VER_LUMA_AVX2_NxN 16, 64, ss
9319
-FILTER_VER_LUMA_AVX2_NxN 24, 32, ss
9320
-FILTER_VER_LUMA_AVX2_NxN 32, 32, ss
9321
-FILTER_VER_LUMA_AVX2_NxN 32, 64, ss
9322
-FILTER_VER_LUMA_AVX2_NxN 48, 64, ss
9323
-FILTER_VER_LUMA_AVX2_NxN 64, 32, ss
9324
-FILTER_VER_LUMA_AVX2_NxN 64, 48, ss
9325
-FILTER_VER_LUMA_AVX2_NxN 64, 64, ss
9326
+    FILTER_VER_LUMA_AVX2_NxN 16, 32, sp
9327
+    FILTER_VER_LUMA_AVX2_NxN 16, 64, sp
9328
+    FILTER_VER_LUMA_AVX2_NxN 24, 32, sp
9329
+    FILTER_VER_LUMA_AVX2_NxN 32, 32, sp
9330
+    FILTER_VER_LUMA_AVX2_NxN 32, 64, sp
9331
+    FILTER_VER_LUMA_AVX2_NxN 48, 64, sp
9332
+    FILTER_VER_LUMA_AVX2_NxN 64, 32, sp
9333
+    FILTER_VER_LUMA_AVX2_NxN 64, 48, sp
9334
+    FILTER_VER_LUMA_AVX2_NxN 64, 64, sp
9335
+    FILTER_VER_LUMA_AVX2_NxN 16, 32, ss
9336
+    FILTER_VER_LUMA_AVX2_NxN 16, 64, ss
9337
+    FILTER_VER_LUMA_AVX2_NxN 24, 32, ss
9338
+    FILTER_VER_LUMA_AVX2_NxN 32, 32, ss
9339
+    FILTER_VER_LUMA_AVX2_NxN 32, 64, ss
9340
+    FILTER_VER_LUMA_AVX2_NxN 48, 64, ss
9341
+    FILTER_VER_LUMA_AVX2_NxN 64, 32, ss
9342
+    FILTER_VER_LUMA_AVX2_NxN 64, 48, ss
9343
+    FILTER_VER_LUMA_AVX2_NxN 64, 64, ss
9344
 
9345
 %macro FILTER_VER_LUMA_S_AVX2_12x16 1
9346
 INIT_YMM avx2
9347
@@ -18102,8 +22119,8 @@
9348
 %endif
9349
 %endmacro
9350
 
9351
-FILTER_VER_LUMA_S_AVX2_12x16 sp
9352
-FILTER_VER_LUMA_S_AVX2_12x16 ss
9353
+    FILTER_VER_LUMA_S_AVX2_12x16 sp
9354
+    FILTER_VER_LUMA_S_AVX2_12x16 ss
9355
 
9356
 %macro FILTER_VER_LUMA_S_AVX2_16x12 1
9357
 INIT_YMM avx2
9358
@@ -18416,8 +22433,8 @@
9359
 %endif
9360
 %endmacro
9361
 
9362
-FILTER_VER_LUMA_S_AVX2_16x12 sp
9363
-FILTER_VER_LUMA_S_AVX2_16x12 ss
9364
+    FILTER_VER_LUMA_S_AVX2_16x12 sp
9365
+    FILTER_VER_LUMA_S_AVX2_16x12 ss
9366
 
9367
 %macro FILTER_VER_LUMA_S_AVX2_16x4 1
9368
 INIT_YMM avx2
9369
@@ -18464,8 +22481,8 @@
9370
     RET
9371
 %endmacro
9372
 
9373
-FILTER_VER_LUMA_S_AVX2_16x4 sp
9374
-FILTER_VER_LUMA_S_AVX2_16x4 ss
9375
+    FILTER_VER_LUMA_S_AVX2_16x4 sp
9376
+    FILTER_VER_LUMA_S_AVX2_16x4 ss
9377
 
9378
 %macro PROCESS_LUMA_S_AVX2_W8_8R 1
9379
     movu            xm0, [r0]                       ; m0 = row 0
9380
@@ -18701,10 +22718,10 @@
9381
 %endif
9382
 %endmacro
9383
 
9384
-FILTER_VER_LUMA_AVX2_Nx8 sp, 32
9385
-FILTER_VER_LUMA_AVX2_Nx8 sp, 16
9386
-FILTER_VER_LUMA_AVX2_Nx8 ss, 32
9387
-FILTER_VER_LUMA_AVX2_Nx8 ss, 16
9388
+    FILTER_VER_LUMA_AVX2_Nx8 sp, 32
9389
+    FILTER_VER_LUMA_AVX2_Nx8 sp, 16
9390
+    FILTER_VER_LUMA_AVX2_Nx8 ss, 32
9391
+    FILTER_VER_LUMA_AVX2_Nx8 ss, 16
9392
 
9393
 %macro FILTER_VER_LUMA_S_AVX2_32x24 1
9394
 INIT_YMM avx2
9395
@@ -18764,13 +22781,13 @@
9396
 %endif
9397
 %endmacro
9398
 
9399
-FILTER_VER_LUMA_S_AVX2_32x24 sp
9400
-FILTER_VER_LUMA_S_AVX2_32x24 ss
9401
+    FILTER_VER_LUMA_S_AVX2_32x24 sp
9402
+    FILTER_VER_LUMA_S_AVX2_32x24 ss
9403
 
9404
 ;-----------------------------------------------------------------------------------------------------------------------------
9405
 ; void interp_4tap_horiz_ps_32x32(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
9406
 ;-----------------------------------------------------------------------------------------------------------------------------;
9407
-INIT_YMM avx2 
9408
+INIT_YMM avx2
9409
 cglobal interp_4tap_horiz_ps_32x32, 4,7,6
9410
     mov             r4d, r4m
9411
     mov             r5d, r5m
9412
@@ -18832,12 +22849,12 @@
9413
     add                r0,           r1
9414
     dec               r6d
9415
     jnz                .loop
9416
-   RET
9417
+    RET
9418
 
9419
 ;-----------------------------------------------------------------------------------------------------------------------------
9420
 ; void interp_4tap_horiz_ps_16x16(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
9421
 ;-----------------------------------------------------------------------------------------------------------------------------;
9422
-INIT_YMM avx2 
9423
+INIT_YMM avx2
9424
 cglobal interp_4tap_horiz_ps_16x16, 4,7,6
9425
     mov             r4d, r4m
9426
     mov             r5d, r5m
9427
@@ -18885,13 +22902,13 @@
9428
     add                r0,          r1
9429
     dec                r6d
9430
     jnz                .loop
9431
-   RET
9432
+    RET
9433
 
9434
 ;-----------------------------------------------------------------------------------------------------------------------------
9435
 ; void interp_4tap_horiz_ps_16xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
9436
 ;-----------------------------------------------------------------------------------------------------------------------------
9437
 %macro IPFILTER_CHROMA_PS_16xN_AVX2 2
9438
-INIT_YMM avx2 
9439
+INIT_YMM avx2
9440
 cglobal interp_4tap_horiz_ps_%1x%2, 4,7,6
9441
     mov                    r4d,        r4m
9442
     mov                    r5d,        r5m
9443
@@ -18947,12 +22964,14 @@
9444
     IPFILTER_CHROMA_PS_16xN_AVX2  16 , 12
9445
     IPFILTER_CHROMA_PS_16xN_AVX2  16 , 8
9446
     IPFILTER_CHROMA_PS_16xN_AVX2  16 , 4
9447
+    IPFILTER_CHROMA_PS_16xN_AVX2  16 , 24
9448
+    IPFILTER_CHROMA_PS_16xN_AVX2  16 , 64
9449
 
9450
 ;-----------------------------------------------------------------------------------------------------------------------------
9451
 ; void interp_4tap_horiz_ps_32xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
9452
 ;-----------------------------------------------------------------------------------------------------------------------------
9453
 %macro IPFILTER_CHROMA_PS_32xN_AVX2 2
9454
-INIT_YMM avx2 
9455
+INIT_YMM avx2
9456
 cglobal interp_4tap_horiz_ps_%1x%2, 4,7,6
9457
     mov                r4d,          r4m
9458
     mov                r5d,          r5m
9459
@@ -19019,13 +23038,15 @@
9460
     RET
9461
 %endmacro
9462
 
9463
-IPFILTER_CHROMA_PS_32xN_AVX2  32 , 16
9464
-IPFILTER_CHROMA_PS_32xN_AVX2  32 , 24
9465
-IPFILTER_CHROMA_PS_32xN_AVX2  32 , 8
9466
+    IPFILTER_CHROMA_PS_32xN_AVX2  32 , 16
9467
+    IPFILTER_CHROMA_PS_32xN_AVX2  32 , 24
9468
+    IPFILTER_CHROMA_PS_32xN_AVX2  32 , 8
9469
+    IPFILTER_CHROMA_PS_32xN_AVX2  32 , 64
9470
+    IPFILTER_CHROMA_PS_32xN_AVX2  32 , 48
9471
 ;-----------------------------------------------------------------------------------------------------------------------------
9472
 ; void interp_4tap_horiz_ps_4x4(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
9473
 ;-----------------------------------------------------------------------------------------------------------------------------
9474
-INIT_YMM avx2 
9475
+INIT_YMM avx2
9476
 cglobal interp_4tap_horiz_ps_4x4, 4,7,5
9477
     mov             r4d, r4m
9478
     mov             r5d, r5m
9479
@@ -19104,7 +23125,7 @@
9480
     lea               r2,           [r2 + r3 * 2]
9481
     movhps            [r2],         xm3
9482
 .end
9483
-   RET
9484
+    RET
9485
 
9486
 cglobal interp_4tap_horiz_ps_4x2, 4,7,5
9487
     mov             r4d, r4m
9488
@@ -19173,13 +23194,13 @@
9489
     lea               r2,           [r2 + r3 * 2]
9490
     movhps            [r2],         xm3
9491
 .end
9492
-   RET
9493
+    RET
9494
 
9495
 ;-----------------------------------------------------------------------------------------------------------------------------
9496
 ; void interp_4tap_horiz_ps_4xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
9497
 ;-----------------------------------------------------------------------------------------------------------------------------;
9498
 %macro IPFILTER_CHROMA_PS_4xN_AVX2 2
9499
-INIT_YMM avx2 
9500
+INIT_YMM avx2
9501
 cglobal interp_4tap_horiz_ps_%1x%2, 4,7,5
9502
     mov             r4d, r4m
9503
     mov             r5d, r5m
9504
@@ -19264,7 +23285,7 @@
9505
     lea               r2,           [r2 + r3 * 2]
9506
     movhps            [r2],         xm3
9507
 .end
9508
-RET
9509
+    RET
9510
 %endmacro
9511
 
9512
     IPFILTER_CHROMA_PS_4xN_AVX2  4 , 8
9513
@@ -19272,7 +23293,7 @@
9514
 ;-----------------------------------------------------------------------------------------------------------------------------
9515
 ; void interp_4tap_horiz_ps_8x8(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
9516
 ;-----------------------------------------------------------------------------------------------------------------------------;
9517
-INIT_YMM avx2 
9518
+INIT_YMM avx2
9519
 cglobal interp_4tap_horiz_ps_8x8, 4,7,6
9520
     mov             r4d, r4m
9521
     mov             r5d, r5m
9522
@@ -19341,9 +23362,9 @@
9523
     vpermq            m3,           m3,          11011000b
9524
     movu             [r2],         xm3
9525
 .end
9526
-   RET
9527
+    RET
9528
 
9529
-INIT_YMM avx2 
9530
+INIT_YMM avx2
9531
 cglobal interp_4tap_horiz_pp_4x2, 4,6,4
9532
     mov             r4d, r4m
9533
 %ifdef PIC
9534
@@ -19436,9 +23457,11 @@
9535
     RET
9536
 %endmacro
9537
 
9538
-IPFILTER_CHROMA_PP_32xN_AVX2 32, 16
9539
-IPFILTER_CHROMA_PP_32xN_AVX2 32, 24
9540
-IPFILTER_CHROMA_PP_32xN_AVX2 32, 8
9541
+    IPFILTER_CHROMA_PP_32xN_AVX2 32, 16
9542
+    IPFILTER_CHROMA_PP_32xN_AVX2 32, 24
9543
+    IPFILTER_CHROMA_PP_32xN_AVX2 32, 8
9544
+    IPFILTER_CHROMA_PP_32xN_AVX2 32, 64
9545
+    IPFILTER_CHROMA_PP_32xN_AVX2 32, 48
9546
 
9547
 ;-------------------------------------------------------------------------------------------------------------
9548
 ; void interp_4tap_horiz_pp_8xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx
9549
@@ -19512,15 +23535,17 @@
9550
     RET
9551
 %endmacro
9552
 
9553
-IPFILTER_CHROMA_PP_8xN_AVX2   8 , 16
9554
-IPFILTER_CHROMA_PP_8xN_AVX2   8 , 32
9555
-IPFILTER_CHROMA_PP_8xN_AVX2   8 , 4
9556
+    IPFILTER_CHROMA_PP_8xN_AVX2   8 , 16
9557
+    IPFILTER_CHROMA_PP_8xN_AVX2   8 , 32
9558
+    IPFILTER_CHROMA_PP_8xN_AVX2   8 , 4
9559
+    IPFILTER_CHROMA_PP_8xN_AVX2   8 , 64
9560
+    IPFILTER_CHROMA_PP_8xN_AVX2   8 , 12
9561
 
9562
 ;-------------------------------------------------------------------------------------------------------------
9563
 ; void interp_4tap_horiz_pp_4xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx
9564
 ;-------------------------------------------------------------------------------------------------------------
9565
 %macro IPFILTER_CHROMA_PP_4xN_AVX2 2
9566
-INIT_YMM avx2 
9567
+INIT_YMM avx2
9568
 cglobal interp_4tap_horiz_pp_%1x%2, 4,6,6
9569
     mov             r4d, r4m
9570
 
9571
@@ -19576,8 +23601,8 @@
9572
     RET
9573
 %endmacro
9574
 
9575
-IPFILTER_CHROMA_PP_4xN_AVX2  4 , 8
9576
-IPFILTER_CHROMA_PP_4xN_AVX2  4 , 16
9577
+    IPFILTER_CHROMA_PP_4xN_AVX2  4 , 8
9578
+    IPFILTER_CHROMA_PP_4xN_AVX2  4 , 16
9579
 
9580
 %macro IPFILTER_LUMA_PS_32xN_AVX2 2
9581
 INIT_YMM avx2
9582
@@ -19674,11 +23699,11 @@
9583
     RET
9584
 %endmacro
9585
 
9586
-IPFILTER_LUMA_PS_32xN_AVX2 32 , 32
9587
-IPFILTER_LUMA_PS_32xN_AVX2 32 , 16
9588
-IPFILTER_LUMA_PS_32xN_AVX2 32 , 24
9589
-IPFILTER_LUMA_PS_32xN_AVX2 32 , 8
9590
-IPFILTER_LUMA_PS_32xN_AVX2 32 , 64
9591
+    IPFILTER_LUMA_PS_32xN_AVX2 32 , 32
9592
+    IPFILTER_LUMA_PS_32xN_AVX2 32 , 16
9593
+    IPFILTER_LUMA_PS_32xN_AVX2 32 , 24
9594
+    IPFILTER_LUMA_PS_32xN_AVX2 32 , 8
9595
+    IPFILTER_LUMA_PS_32xN_AVX2 32 , 64
9596
 
9597
 INIT_YMM avx2
9598
 cglobal interp_8tap_horiz_ps_48x64, 4, 7, 8
9599
@@ -20003,10 +24028,12 @@
9600
     RET
9601
 %endmacro
9602
 
9603
-IPFILTER_CHROMA_PP_16xN_AVX2 16 , 8
9604
-IPFILTER_CHROMA_PP_16xN_AVX2 16 , 32
9605
-IPFILTER_CHROMA_PP_16xN_AVX2 16 , 12
9606
-IPFILTER_CHROMA_PP_16xN_AVX2 16 , 4
9607
+    IPFILTER_CHROMA_PP_16xN_AVX2 16 , 8
9608
+    IPFILTER_CHROMA_PP_16xN_AVX2 16 , 32
9609
+    IPFILTER_CHROMA_PP_16xN_AVX2 16 , 12
9610
+    IPFILTER_CHROMA_PP_16xN_AVX2 16 , 4
9611
+    IPFILTER_CHROMA_PP_16xN_AVX2 16 , 64
9612
+    IPFILTER_CHROMA_PP_16xN_AVX2 16 , 24
9613
 
9614
 %macro IPFILTER_LUMA_PS_64xN_AVX2 1
9615
 INIT_YMM avx2
9616
@@ -20144,16 +24171,16 @@
9617
     RET
9618
 %endmacro
9619
 
9620
-IPFILTER_LUMA_PS_64xN_AVX2 64
9621
-IPFILTER_LUMA_PS_64xN_AVX2 48
9622
-IPFILTER_LUMA_PS_64xN_AVX2 32
9623
-IPFILTER_LUMA_PS_64xN_AVX2 16
9624
+    IPFILTER_LUMA_PS_64xN_AVX2 64
9625
+    IPFILTER_LUMA_PS_64xN_AVX2 48
9626
+    IPFILTER_LUMA_PS_64xN_AVX2 32
9627
+    IPFILTER_LUMA_PS_64xN_AVX2 16
9628
 
9629
 ;-----------------------------------------------------------------------------------------------------------------------------
9630
 ; void interp_4tap_horiz_ps_8xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
9631
 ;-----------------------------------------------------------------------------------------------------------------------------
9632
 %macro IPFILTER_CHROMA_PS_8xN_AVX2 1
9633
-INIT_YMM avx2 
9634
+INIT_YMM avx2
9635
 cglobal interp_4tap_horiz_ps_8x%1, 4,7,6
9636
     mov                r4d,             r4m
9637
     mov                r5d,             r5m
9638
@@ -20218,7 +24245,7 @@
9639
     vpermq             m3,              m3,          11011000b
9640
     movu               [r2],            xm3
9641
 .end
9642
-   RET
9643
+    RET
9644
 %endmacro
9645
 
9646
     IPFILTER_CHROMA_PS_8xN_AVX2  2
9647
@@ -20226,6 +24253,8 @@
9648
     IPFILTER_CHROMA_PS_8xN_AVX2  16
9649
     IPFILTER_CHROMA_PS_8xN_AVX2  6
9650
     IPFILTER_CHROMA_PS_8xN_AVX2  4
9651
+    IPFILTER_CHROMA_PS_8xN_AVX2  12
9652
+    IPFILTER_CHROMA_PS_8xN_AVX2  64
9653
 
9654
 INIT_YMM avx2
9655
 cglobal interp_4tap_horiz_ps_2x4, 4, 7, 3
9656
@@ -20253,7 +24282,7 @@
9657
     movhps             xm2,            [r0 + r6]
9658
 
9659
     vinserti128        m1,             m1,          xm2,          1
9660
-    pshufb             m1,             [interp4_hps_shuf]
9661
+    pshufb             m1,             [interp4_hpp_shuf]
9662
     pmaddubsw          m1,             m0
9663
     pmaddwd            m1,             [pw_1]
9664
     vextracti128       xm2,            m1,          1
9665
@@ -20275,7 +24304,7 @@
9666
     movhps             xm1,            [r0 + r1]
9667
     movq               xm2,            [r0 + r1 * 2]
9668
     vinserti128        m1,             m1,          xm2,          1
9669
-    pshufb             m1,             [interp4_hps_shuf]
9670
+    pshufb             m1,             [interp4_hpp_shuf]
9671
     pmaddubsw          m1,             m0
9672
     pmaddwd            m1,             [pw_1]
9673
     vextracti128       xm2,            m1,          1
9674
@@ -20306,7 +24335,7 @@
9675
     sub               r0,             r1
9676
 
9677
 .label
9678
-    mova              m4,            [interp4_hps_shuf]
9679
+    mova              m4,            [interp4_hpp_shuf]
9680
     mova              m5,            [pw_1]
9681
     dec               r0
9682
     lea               r4,            [r1 * 3]
9683
@@ -20488,7 +24517,7 @@
9684
 ;-----------------------------------------------------------------------------------------------------------------------------
9685
 ; void interp_4tap_horiz_ps_6x8(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
9686
 ;-----------------------------------------------------------------------------------------------------------------------------;
9687
-INIT_YMM avx2 
9688
+INIT_YMM avx2
9689
 cglobal interp_4tap_horiz_ps_6x8, 4,7,6
9690
     mov                r4d,            r4m
9691
     mov                r5d,            r5m
9692
@@ -20556,3 +24585,1024 @@
9693
     movd              [r2+8],          xm4
9694
 .end
9695
     RET
9696
+
9697
+INIT_YMM avx2
9698
+cglobal interp_8tap_horiz_ps_12x16, 6, 7, 8
9699
+    mov                         r5d,               r5m
9700
+    mov                         r4d,               r4m
9701
+%ifdef PIC
9702
+    lea                         r6,                [tab_LumaCoeff]
9703
+    vpbroadcastq                m0,                [r6 + r4 * 8]
9704
+%else
9705
+    vpbroadcastq                m0,                [tab_LumaCoeff + r4 * 8]
9706
+%endif
9707
+    mova                        m6,                [tab_Lm + 32]
9708
+    mova                        m1,                [tab_Lm]
9709
+    add                         r3d,               r3d
9710
+    vbroadcasti128              m2,                [pw_2000]
9711
+    mov                         r4d,                16
9712
+    vbroadcasti128              m7,                [pw_1]
9713
+    ; register map
9714
+    ; m0 - interpolate coeff
9715
+    ; m1 - shuffle order table
9716
+    ; m2 - pw_2000
9717
+
9718
+    mova                        m5,                [interp8_hps_shuf]
9719
+    sub                         r0,                3
9720
+    test                        r5d,               r5d
9721
+    jz                          .loop
9722
+    lea                         r6,                [r1 * 3]                     ; r6 = (N / 2 - 1) * srcStride
9723
+    sub                         r0,                r6                           ; r0(src)-r6
9724
+    add                         r4d,                7
9725
+.loop
9726
+
9727
+    ; Row 0
9728
+
9729
+    vbroadcasti128              m3,                [r0]                         ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
9730
+    pshufb                      m4,                m3,        m6
9731
+    pshufb                      m3,                m1                           ; shuffled based on the col order tab_Lm
9732
+    pmaddubsw                   m3,                m0
9733
+    pmaddubsw                   m4,                m0
9734
+    pmaddwd                     m3,                m7
9735
+    pmaddwd                     m4,                m7
9736
+    packssdw                    m3,                m4
9737
+
9738
+    vbroadcasti128              m4,                [r0 + 8]                         ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
9739
+    pshufb                      m4,                m1
9740
+    pmaddubsw                   m4,                m0
9741
+    pmaddwd                     m4,                m7
9742
+    packssdw                    m4,                m4
9743
+
9744
+    pmaddwd                     m3,                m7
9745
+    pmaddwd                     m4,                m7
9746
+    packssdw                    m3,                m4
9747
+
9748
+    vpermd                      m3,                m5,               m3
9749
+    psubw                       m3,                m2
9750
+
9751
+    vextracti128                xm4,               m3,               1
9752
+    movu                        [r2],              xm3                          ;row 0
9753
+    movq                        [r2 + 16],         xm4                          ;row 1
9754
+
9755
+    add                         r0,                r1
9756
+    add                         r2,                r3
9757
+    dec                         r4d
9758
+    jnz                         .loop
9759
+    RET
9760
+
9761
+INIT_YMM avx2
9762
+cglobal interp_8tap_horiz_ps_24x32, 4, 7, 8
9763
+    mov                         r5d,               r5m
9764
+    mov                         r4d,               r4m
9765
+%ifdef PIC
9766
+    lea                         r6,                [tab_LumaCoeff]
9767
+    vpbroadcastq                m0,                [r6 + r4 * 8]
9768
+%else
9769
+    vpbroadcastq                m0,                [tab_LumaCoeff + r4 * 8]
9770
+%endif
9771
+    mova                        m6,                [tab_Lm + 32]
9772
+    mova                        m1,                [tab_Lm]
9773
+    mov                         r4d,               32                           ;height
9774
+    add                         r3d,               r3d
9775
+    vbroadcasti128              m2,                [pw_2000]
9776
+    vbroadcasti128              m7,                [pw_1]
9777
+
9778
+    ; register map
9779
+    ; m0      - interpolate coeff
9780
+    ; m1 , m6 - shuffle order table
9781
+    ; m2      - pw_2000
9782
+
9783
+    sub                         r0,                3
9784
+    test                        r5d,               r5d
9785
+    jz                          .label
9786
+    lea                         r6,                [r1 * 3]                     ; r6 = (N / 2 - 1) * srcStride
9787
+    sub                         r0,                r6                           ; r0(src)-r6
9788
+    add                         r4d,               7                            ; blkheight += N - 1  (7 - 1 = 6 ; since the last one row not in loop)
9789
+
9790
+.label
9791
+    lea                         r6,                [interp8_hps_shuf]
9792
+.loop
9793
+    ; Row 0
9794
+    vbroadcasti128              m3,                [r0]                         ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
9795
+    pshufb                      m4,                m3,             m6           ; row 0 (col 4 to 7)
9796
+    pshufb                      m3,                m1                           ; shuffled based on the col order tab_Lm row 0 (col 0 to 3)
9797
+    pmaddubsw                   m3,                m0
9798
+    pmaddubsw                   m4,                m0
9799
+    pmaddwd                     m3,                m7
9800
+    pmaddwd                     m4,                m7
9801
+    packssdw                    m3,                m4
9802
+
9803
+    vbroadcasti128              m4,                [r0 + 8]                     ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
9804
+    pshufb                      m5,                m4,            m6            ;row 1 (col 4 to 7)
9805
+    pshufb                      m4,                m1                           ;row 1 (col 0 to 3)
9806
+    pmaddubsw                   m4,                m0
9807
+    pmaddubsw                   m5,                m0
9808
+    pmaddwd                     m4,                m7
9809
+    pmaddwd                     m5,                m7
9810
+    packssdw                    m4,                m5
9811
+    pmaddwd                     m3,                m7
9812
+    pmaddwd                     m4,                m7
9813
+    packssdw                    m3,                m4
9814
+    mova                        m5,                [r6]
9815
+    vpermd                      m3,                m5,               m3
9816
+    psubw                       m3,                m2
9817
+    movu                        [r2],              m3                          ;row 0
9818
+
9819
+    vbroadcasti128              m3,                [r0 + 16]
9820
+    pshufb                      m4,                m3,          m6
9821
+    pshufb                      m3,                m1
9822
+    pmaddubsw                   m3,                m0
9823
+    pmaddubsw                   m4,                m0
9824
+    pmaddwd                     m3,                m7
9825
+    pmaddwd                     m4,                m7
9826
+    packssdw                    m3,                m4
9827
+    pmaddwd                     m3,                m7
9828
+    pmaddwd                     m4,                m7
9829
+    packssdw                    m3,                m4
9830
+    mova                        m4,                [r6]
9831
+    vpermd                      m3,                m4,            m3
9832
+    psubw                       m3,                m2
9833
+    movu                        [r2 + 32],         xm3                          ;row 0
9834
+
9835
+    add                         r0,                r1
9836
+    add                         r2,                r3
9837
+    dec                         r4d
9838
+    jnz                         .loop
9839
+    RET
9840
+
9841
+;-----------------------------------------------------------------------------------------------------------------------------
9842
+; void interp_4tap_horiz_ps_24x32(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
9843
+;-----------------------------------------------------------------------------------------------------------------------------
9844
+INIT_YMM avx2
9845
+cglobal interp_4tap_horiz_ps_24x32, 4,7,6
9846
+    mov                r4d,            r4m
9847
+    mov                r5d,            r5m
9848
+    add                r3d,            r3d
9849
+%ifdef PIC
9850
+    lea                r6,             [tab_ChromaCoeff]
9851
+    vpbroadcastd       m0,             [r6 + r4 * 4]
9852
+%else
9853
+    vpbroadcastd       m0,             [tab_ChromaCoeff + r4 * 4]
9854
+%endif
9855
+    vbroadcasti128     m2,             [pw_1]
9856
+    vbroadcasti128     m5,             [pw_2000]
9857
+    mova               m1,             [tab_Tm]
9858
+
9859
+    ; register map
9860
+    ; m0 - interpolate coeff
9861
+    ; m1 - shuffle order table
9862
+    ; m2 - constant word 1
9863
+    mov                r6d,            32
9864
+    dec                r0
9865
+    test               r5d,            r5d
9866
+    je                 .loop
9867
+    sub                r0 ,            r1
9868
+    add                r6d ,           3
9869
+
9870
+.loop
9871
+    ; Row 0
9872
+    vbroadcasti128     m3,             [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
9873
+    pshufb             m3,             m1
9874
+    pmaddubsw          m3,             m0
9875
+    pmaddwd            m3,             m2
9876
+    vbroadcasti128     m4,             [r0 + 8]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
9877
+    pshufb             m4,             m1
9878
+    pmaddubsw          m4,             m0
9879
+    pmaddwd            m4,             m2
9880
+    packssdw           m3,             m4
9881
+    psubw              m3,             m5
9882
+    vpermq             m3,             m3,          11011000b
9883
+    movu               [r2],           m3
9884
+
9885
+    vbroadcasti128     m3,             [r0 + 16]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
9886
+    pshufb             m3,             m1
9887
+    pmaddubsw          m3,             m0
9888
+    pmaddwd            m3,             m2
9889
+    packssdw           m3,             m3
9890
+    psubw              m3,             m5
9891
+    vpermq             m3,             m3,          11011000b
9892
+    movu               [r2 + 32],      xm3
9893
+
9894
+    add                r2,             r3
9895
+    add                r0,             r1
9896
+    dec                r6d
9897
+    jnz                .loop
9898
+    RET
9899
+
9900
+;-----------------------------------------------------------------------------------------------------------------------
9901
+;macro FILTER_H8_W8_16N_AVX2
9902
+;-----------------------------------------------------------------------------------------------------------------------
9903
+%macro  FILTER_H8_W8_16N_AVX2 0
9904
+    vbroadcasti128              m3,                [r0]                         ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
9905
+    pshufb                      m4,                m3,             m6           ; row 0 (col 4 to 7)
9906
+    pshufb                      m3,                m1                           ; shuffled based on the col order tab_Lm row 0 (col 0 to 3)
9907
+    pmaddubsw                   m3,                m0
9908
+    pmaddubsw                   m4,                m0
9909
+    pmaddwd                     m3,                m2
9910
+    pmaddwd                     m4,                m2
9911
+    packssdw                    m3,                m4                         ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A]
9912
+
9913
+    vbroadcasti128              m4,                [r0 + 8]                         ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
9914
+    pshufb                      m5,                m4,            m6            ;row 1 (col 4 to 7)
9915
+    pshufb                      m4,                m1                           ;row 1 (col 0 to 3)
9916
+    pmaddubsw                   m4,                m0
9917
+    pmaddubsw                   m5,                m0
9918
+    pmaddwd                     m4,                m2
9919
+    pmaddwd                     m5,                m2
9920
+    packssdw                    m4,                m5                         ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A]
9921
+
9922
+    pmaddwd                     m3,                m2
9923
+    pmaddwd                     m4,                m2
9924
+    packssdw                    m3,                m4                         ; all rows and col completed.
9925
+
9926
+    mova                        m5,                [interp8_hps_shuf]
9927
+    vpermd                      m3,                m5,               m3
9928
+    psubw                       m3,                m8
9929
+
9930
+    vextracti128                xm4,               m3,               1
9931
+    mova                        [r4],              xm3
9932
+    mova                        [r4 + 16],         xm4
9933
+    %endmacro
9934
+
9935
+;-----------------------------------------------------------------------------
9936
+; void interp_8tap_hv_pp_16x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int idxX, int idxY)
9937
+;-----------------------------------------------------------------------------
9938
+INIT_YMM avx2
9939
+%if ARCH_X86_64 == 1
9940
+cglobal interp_8tap_hv_pp_16x16, 4, 10, 15, 0-31*32
9941
+%define stk_buf1    rsp
9942
+    mov                         r4d,               r4m
9943
+    mov                         r5d,               r5m
9944
+%ifdef PIC
9945
+    lea                         r6,                [tab_LumaCoeff]
9946
+    vpbroadcastq                m0,                [r6 + r4 * 8]
9947
+%else
9948
+    vpbroadcastq                m0,                [tab_LumaCoeff + r4 * 8]
9949
+%endif
9950
+
9951
+    xor                         r6,                 r6
9952
+    mov                         r4,                 rsp
9953
+    mova                        m6,                [tab_Lm + 32]
9954
+    mova                        m1,                [tab_Lm]
9955
+    mov                         r8,                16                           ;height
9956
+    vbroadcasti128              m8,                [pw_2000]
9957
+    vbroadcasti128              m2,                [pw_1]
9958
+    sub                         r0,                3
9959
+    lea                         r7,                [r1 * 3]                     ; r7 = (N / 2 - 1) * srcStride
9960
+    sub                         r0,                r7                           ; r0(src)-r7
9961
+    add                         r8,                7
9962
+
9963
+.loopH:
9964
+    FILTER_H8_W8_16N_AVX2
9965
+    add                         r0,                r1
9966
+    add                         r4,                32
9967
+    inc                         r6
9968
+    cmp                         r6,                16+7
9969
+    jnz                        .loopH
9970
+
9971
+; vertical phase
9972
+    xor                         r6,                r6
9973
+    xor                         r1,                r1
9974
+.loopV:
9975
+
9976
+;load necessary variables
9977
+    mov                         r4d,               r5d          ;coeff here for vertical is r5m
9978
+    shl                         r4d,               7
9979
+    mov                         r1d,               16
9980
+    add                         r1d,               r1d
9981
+
9982
+ ; load intermedia buffer
9983
+    mov                         r0,                stk_buf1
9984
+
9985
+    ; register mapping
9986
+    ; r0 - src
9987
+    ; r5 - coeff
9988
+    ; r6 - loop_i
9989
+
9990
+; load coeff table
9991
+%ifdef PIC
9992
+    lea                          r5,                [pw_LumaCoeffVer]
9993
+    add                          r5,                r4
9994
+%else
9995
+    lea                          r5,                [pw_LumaCoeffVer + r4]
9996
+%endif
9997
+
9998
+    lea                          r4,                [r1*3]
9999
+    mova                         m14,               [pd_526336]
10000
+    lea                          r6,                [r3 * 3]
10001
+    mov                          r9d,               16 / 8
10002
+
10003
+.loopW:
10004
+    PROCESS_LUMA_AVX2_W8_16R sp
10005
+    add                          r2,                 8
10006
+    add                          r0,                 16
10007
+    dec                          r9d
10008
+    jnz                          .loopW
10009
+    RET
10010
+%endif
10011
+
10012
+INIT_YMM avx2
10013
+cglobal interp_4tap_horiz_pp_12x32, 4, 6, 7
10014
+    mov               r4d,          r4m
10015
+
10016
+%ifdef PIC
10017
+    lea               r5,           [tab_ChromaCoeff]
10018
+    vpbroadcastd      m0,           [r5 + r4 * 4]
10019
+%else
10020
+    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]
10021
+%endif
10022
+
10023
+    mova              m6,           [pw_512]
10024
+    mova              m1,           [interp4_horiz_shuf1]
10025
+    vpbroadcastd      m2,           [pw_1]
10026
+
10027
+    ; register map
10028
+    ; m0 - interpolate coeff
10029
+    ; m1 - shuffle order table
10030
+    ; m2 - constant word 1
10031
+
10032
+    dec               r0
10033
+    mov               r4d,          16
10034
+
10035
+.loop:
10036
+    ; Row 0
10037
+    vbroadcasti128    m3,           [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10038
+    pshufb            m3,           m1
10039
+    pmaddubsw         m3,           m0
10040
+    pmaddwd           m3,           m2
10041
+    vbroadcasti128    m4,           [r0 + 4]                    ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10042
+    pshufb            m4,           m1
10043
+    pmaddubsw         m4,           m0
10044
+    pmaddwd           m4,           m2
10045
+    packssdw          m3,           m4
10046
+    pmulhrsw          m3,           m6
10047
+
10048
+    ; Row 1
10049
+    vbroadcasti128    m4,           [r0 + r1]                   ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10050
+    pshufb            m4,           m1
10051
+    pmaddubsw         m4,           m0
10052
+    pmaddwd           m4,           m2
10053
+    vbroadcasti128    m5,           [r0 + r1 + 4]               ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10054
+    pshufb            m5,           m1
10055
+    pmaddubsw         m5,           m0
10056
+    pmaddwd           m5,           m2
10057
+    packssdw          m4,           m5
10058
+    pmulhrsw          m4,           m6
10059
+
10060
+    packuswb          m3,           m4
10061
+    vpermq            m3,           m3,      11011000b
10062
+
10063
+    vextracti128      xm4,          m3,       1
10064
+    movq              [r2],         xm3
10065
+    pextrd            [r2+8],       xm3,      2
10066
+    movq              [r2 + r3],    xm4
10067
+    pextrd            [r2 + r3 + 8],xm4,      2
10068
+    lea               r2,           [r2 + r3 * 2]
10069
+    lea               r0,           [r0 + r1 * 2]
10070
+    dec               r4d
10071
+    jnz               .loop
10072
+    RET
10073
+
10074
+INIT_YMM avx2
10075
+cglobal interp_4tap_horiz_pp_24x64, 4,6,7
10076
+    mov              r4d,           r4m
10077
+
10078
+%ifdef PIC
10079
+    lea               r5,           [tab_ChromaCoeff]
10080
+    vpbroadcastd      m0,           [r5 + r4 * 4]
10081
+%else
10082
+    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]
10083
+%endif
10084
+
10085
+    mova              m1,           [interp4_horiz_shuf1]
10086
+    vpbroadcastd      m2,           [pw_1]
10087
+    mova              m6,           [pw_512]
10088
+    ; register map
10089
+    ; m0 - interpolate coeff
10090
+    ; m1 - shuffle order table
10091
+    ; m2 - constant word 1
10092
+
10093
+    dec               r0
10094
+    mov               r4d,          64
10095
+
10096
+.loop:
10097
+    ; Row 0
10098
+    vbroadcasti128    m3,           [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10099
+    pshufb            m3,           m1
10100
+    pmaddubsw         m3,           m0
10101
+    pmaddwd           m3,           m2
10102
+    vbroadcasti128    m4,           [r0 + 4]
10103
+    pshufb            m4,           m1
10104
+    pmaddubsw         m4,           m0
10105
+    pmaddwd           m4,           m2
10106
+    packssdw          m3,           m4
10107
+    pmulhrsw          m3,           m6
10108
+
10109
+    vbroadcasti128    m4,           [r0 + 16]
10110
+    pshufb            m4,           m1
10111
+    pmaddubsw         m4,           m0
10112
+    pmaddwd           m4,           m2
10113
+    vbroadcasti128    m5,           [r0 + 20]
10114
+    pshufb            m5,           m1
10115
+    pmaddubsw         m5,           m0
10116
+    pmaddwd           m5,           m2
10117
+    packssdw          m4,           m5
10118
+    pmulhrsw          m4,           m6
10119
+
10120
+    packuswb          m3,           m4
10121
+    vpermq            m3,           m3,      11011000b
10122
+
10123
+    vextracti128      xm4,          m3,       1
10124
+    movu              [r2],         xm3
10125
+    movq              [r2 + 16],    xm4
10126
+    add               r2,           r3
10127
+    add               r0,           r1
10128
+    dec               r4d
10129
+    jnz               .loop
10130
+    RET
10131
+
10132
+
10133
+INIT_YMM avx2
10134
+cglobal interp_4tap_horiz_pp_2x16, 4, 6, 6
10135
+    mov               r4d,           r4m
10136
+
10137
+%ifdef PIC
10138
+    lea               r5,            [tab_ChromaCoeff]
10139
+    vpbroadcastd      m0,            [r5 + r4 * 4]
10140
+%else
10141
+    vpbroadcastd      m0,            [tab_ChromaCoeff + r4 * 4]
10142
+%endif
10143
+
10144
+    mova              m4,            [interp4_hpp_shuf]
10145
+    mova              m5,            [pw_1]
10146
+    dec               r0
10147
+    lea               r4,            [r1 * 3]
10148
+    movq              xm1,           [r0]
10149
+    movhps            xm1,           [r0 + r1]
10150
+    movq              xm2,           [r0 + r1 * 2]
10151
+    movhps            xm2,           [r0 + r4]
10152
+    vinserti128       m1,            m1,          xm2,          1
10153
+    lea               r0,            [r0 + r1 * 4]
10154
+    movq              xm3,           [r0]
10155
+    movhps            xm3,           [r0 + r1]
10156
+    movq              xm2,           [r0 + r1 * 2]
10157
+    movhps            xm2,           [r0 + r4]
10158
+    vinserti128       m3,            m3,          xm2,          1
10159
+
10160
+    pshufb            m1,            m4
10161
+    pshufb            m3,            m4
10162
+    pmaddubsw         m1,            m0
10163
+    pmaddubsw         m3,            m0
10164
+    pmaddwd           m1,            m5
10165
+    pmaddwd           m3,            m5
10166
+    packssdw          m1,            m3
10167
+    pmulhrsw          m1,            [pw_512]
10168
+    vextracti128      xm2,           m1,          1
10169
+    packuswb          xm1,           xm2
10170
+
10171
+    lea               r4,            [r3 * 3]
10172
+    pextrw            [r2],          xm1,         0
10173
+    pextrw            [r2 + r3],     xm1,         1
10174
+    pextrw            [r2 + r3 * 2], xm1,         4
10175
+    pextrw            [r2 + r4],     xm1,         5
10176
+    lea               r2,            [r2 + r3 * 4]
10177
+    pextrw            [r2],          xm1,         2
10178
+    pextrw            [r2 + r3],     xm1,         3
10179
+    pextrw            [r2 + r3 * 2], xm1,         6
10180
+    pextrw            [r2 + r4],     xm1,         7
10181
+    lea               r2,            [r2 + r3 * 4]
10182
+    lea               r0,            [r0 + r1 * 4]
10183
+
10184
+    lea               r4,            [r1 * 3]
10185
+    movq              xm1,           [r0]
10186
+    movhps            xm1,           [r0 + r1]
10187
+    movq              xm2,           [r0 + r1 * 2]
10188
+    movhps            xm2,           [r0 + r4]
10189
+    vinserti128       m1,            m1,          xm2,          1
10190
+    lea               r0,            [r0 + r1 * 4]
10191
+    movq              xm3,           [r0]
10192
+    movhps            xm3,           [r0 + r1]
10193
+    movq              xm2,           [r0 + r1 * 2]
10194
+    movhps            xm2,           [r0 + r4]
10195
+    vinserti128       m3,            m3,          xm2,          1
10196
+
10197
+    pshufb            m1,            m4
10198
+    pshufb            m3,            m4
10199
+    pmaddubsw         m1,            m0
10200
+    pmaddubsw         m3,            m0
10201
+    pmaddwd           m1,            m5
10202
+    pmaddwd           m3,            m5
10203
+    packssdw          m1,            m3
10204
+    pmulhrsw          m1,            [pw_512]
10205
+    vextracti128      xm2,           m1,          1
10206
+    packuswb          xm1,           xm2
10207
+
10208
+    lea               r4,            [r3 * 3]
10209
+    pextrw            [r2],          xm1,         0
10210
+    pextrw            [r2 + r3],     xm1,         1
10211
+    pextrw            [r2 + r3 * 2], xm1,         4
10212
+    pextrw            [r2 + r4],     xm1,         5
10213
+    lea               r2,            [r2 + r3 * 4]
10214
+    pextrw            [r2],          xm1,         2
10215
+    pextrw            [r2 + r3],     xm1,         3
10216
+    pextrw            [r2 + r3 * 2], xm1,         6
10217
+    pextrw            [r2 + r4],     xm1,         7
10218
+    RET
10219
+
10220
+;-------------------------------------------------------------------------------------------------------------
10221
+; void interp_4tap_horiz_pp_64xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx
10222
+;-------------------------------------------------------------------------------------------------------------
10223
+%macro IPFILTER_CHROMA_PP_64xN_AVX2 1
10224
+INIT_YMM avx2
10225
+cglobal interp_4tap_horiz_pp_64x%1, 4,6,7
10226
+    mov             r4d, r4m
10227
+
10228
+%ifdef PIC
10229
+    lea               r5,           [tab_ChromaCoeff]
10230
+    vpbroadcastd      m0,           [r5 + r4 * 4]
10231
+%else
10232
+    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]
10233
+%endif
10234
+
10235
+    mova              m1,           [interp4_horiz_shuf1]
10236
+    vpbroadcastd      m2,           [pw_1]
10237
+    mova              m6,           [pw_512]
10238
+    ; register map
10239
+    ; m0 - interpolate coeff
10240
+    ; m1 - shuffle order table
10241
+    ; m2 - constant word 1
10242
+
10243
+    dec               r0
10244
+    mov               r4d,          %1
10245
+
10246
+.loop:
10247
+    ; Row 0
10248
+    vbroadcasti128    m3,           [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10249
+    pshufb            m3,           m1
10250
+    pmaddubsw         m3,           m0
10251
+    pmaddwd           m3,           m2
10252
+    vbroadcasti128    m4,           [r0 + 4]
10253
+    pshufb            m4,           m1
10254
+    pmaddubsw         m4,           m0
10255
+    pmaddwd           m4,           m2
10256
+    packssdw          m3,           m4
10257
+    pmulhrsw          m3,           m6
10258
+
10259
+    vbroadcasti128    m4,           [r0 + 16]
10260
+    pshufb            m4,           m1
10261
+    pmaddubsw         m4,           m0
10262
+    pmaddwd           m4,           m2
10263
+    vbroadcasti128    m5,           [r0 + 20]
10264
+    pshufb            m5,           m1
10265
+    pmaddubsw         m5,           m0
10266
+    pmaddwd           m5,           m2
10267
+    packssdw          m4,           m5
10268
+    pmulhrsw          m4,           m6
10269
+    packuswb          m3,           m4
10270
+    vpermq            m3,           m3,      11011000b
10271
+    movu              [r2],         m3
10272
+
10273
+    vbroadcasti128    m3,           [r0 + 32]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10274
+    pshufb            m3,           m1
10275
+    pmaddubsw         m3,           m0
10276
+    pmaddwd           m3,           m2
10277
+    vbroadcasti128    m4,           [r0 + 36]
10278
+    pshufb            m4,           m1
10279
+    pmaddubsw         m4,           m0
10280
+    pmaddwd           m4,           m2
10281
+    packssdw          m3,           m4
10282
+    pmulhrsw          m3,           m6
10283
+
10284
+    vbroadcasti128    m4,           [r0 + 48]
10285
+    pshufb            m4,           m1
10286
+    pmaddubsw         m4,           m0
10287
+    pmaddwd           m4,           m2
10288
+    vbroadcasti128    m5,           [r0 + 52]
10289
+    pshufb            m5,           m1
10290
+    pmaddubsw         m5,           m0
10291
+    pmaddwd           m5,           m2
10292
+    packssdw          m4,           m5
10293
+    pmulhrsw          m4,           m6
10294
+    packuswb          m3,           m4
10295
+    vpermq            m3,           m3,      11011000b
10296
+    movu              [r2 + 32],         m3
10297
+
10298
+    add               r2,           r3
10299
+    add               r0,           r1
10300
+    dec               r4d
10301
+    jnz               .loop
10302
+    RET
10303
+%endmacro
10304
+
10305
+    IPFILTER_CHROMA_PP_64xN_AVX2  64
10306
+    IPFILTER_CHROMA_PP_64xN_AVX2  32
10307
+    IPFILTER_CHROMA_PP_64xN_AVX2  48
10308
+    IPFILTER_CHROMA_PP_64xN_AVX2  16
10309
+
10310
+;-------------------------------------------------------------------------------------------------------------
10311
+; void interp_4tap_horiz_pp_48x64(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx
10312
+;-------------------------------------------------------------------------------------------------------------
10313
+INIT_YMM avx2
10314
+cglobal interp_4tap_horiz_pp_48x64, 4,6,7
10315
+    mov             r4d, r4m
10316
+
10317
+%ifdef PIC
10318
+    lea               r5,            [tab_ChromaCoeff]
10319
+    vpbroadcastd      m0,            [r5 + r4 * 4]
10320
+%else
10321
+    vpbroadcastd      m0,            [tab_ChromaCoeff + r4 * 4]
10322
+%endif
10323
+
10324
+    mova              m1,            [interp4_horiz_shuf1]
10325
+    vpbroadcastd      m2,            [pw_1]
10326
+    mova              m6,            [pw_512]
10327
+    ; register map
10328
+    ; m0 - interpolate coeff
10329
+    ; m1 - shuffle order table
10330
+    ; m2 - constant word 1
10331
+
10332
+    dec               r0
10333
+    mov               r4d,           64
10334
+
10335
+.loop:
10336
+    ; Row 0
10337
+    vbroadcasti128    m3,            [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10338
+    pshufb            m3,            m1
10339
+    pmaddubsw         m3,            m0
10340
+    pmaddwd           m3,            m2
10341
+    vbroadcasti128    m4,            [r0 + 4]
10342
+    pshufb            m4,            m1
10343
+    pmaddubsw         m4,            m0
10344
+    pmaddwd           m4,            m2
10345
+    packssdw          m3,            m4
10346
+    pmulhrsw          m3,            m6
10347
+
10348
+    vbroadcasti128    m4,            [r0 + 16]
10349
+    pshufb            m4,            m1
10350
+    pmaddubsw         m4,            m0
10351
+    pmaddwd           m4,            m2
10352
+    vbroadcasti128    m5,            [r0 + 20]
10353
+    pshufb            m5,            m1
10354
+    pmaddubsw         m5,            m0
10355
+    pmaddwd           m5,            m2
10356
+    packssdw          m4,            m5
10357
+    pmulhrsw          m4,            m6
10358
+
10359
+    packuswb          m3,            m4
10360
+    vpermq            m3,            m3,      q3120
10361
+
10362
+    movu              [r2],          m3
10363
+
10364
+    vbroadcasti128    m3,            [r0 + mmsize]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10365
+    pshufb            m3,            m1
10366
+    pmaddubsw         m3,            m0
10367
+    pmaddwd           m3,            m2
10368
+    vbroadcasti128    m4,            [r0 + mmsize + 4]
10369
+    pshufb            m4,            m1
10370
+    pmaddubsw         m4,            m0
10371
+    pmaddwd           m4,            m2
10372
+    packssdw          m3,            m4
10373
+    pmulhrsw          m3,            m6
10374
+
10375
+    vbroadcasti128    m4,            [r0 + mmsize + 16]
10376
+    pshufb            m4,            m1
10377
+    pmaddubsw         m4,            m0
10378
+    pmaddwd           m4,            m2
10379
+    vbroadcasti128    m5,            [r0 + mmsize + 20]
10380
+    pshufb            m5,            m1
10381
+    pmaddubsw         m5,            m0
10382
+    pmaddwd           m5,            m2
10383
+    packssdw          m4,            m5
10384
+    pmulhrsw          m4,            m6
10385
+
10386
+    packuswb          m3,            m4
10387
+    vpermq            m3,            m3,      q3120
10388
+    movu              [r2 + mmsize], xm3
10389
+
10390
+    add               r2,            r3
10391
+    add               r0,            r1
10392
+    dec               r4d
10393
+    jnz               .loop
10394
+    RET
10395
+
10396
+;-----------------------------------------------------------------------------------------------------------------------------
10397
+; void interp_4tap_horiz_ps_48x64(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
10398
+;-----------------------------------------------------------------------------------------------------------------------------;
10399
+
10400
+INIT_YMM avx2
10401
+cglobal interp_4tap_horiz_ps_48x64, 4,7,6
10402
+    mov             r4d, r4m
10403
+    mov             r5d, r5m
10404
+    add             r3d, r3d
10405
+
10406
+%ifdef PIC
10407
+    lea               r6,           [tab_ChromaCoeff]
10408
+    vpbroadcastd      m0,           [r6 + r4 * 4]
10409
+%else
10410
+    vpbroadcastd      m0,           [tab_ChromaCoeff + r4 * 4]
10411
+%endif
10412
+
10413
+    vbroadcasti128     m2,          [pw_1]
10414
+    vbroadcasti128     m5,          [pw_2000]
10415
+    mova               m1,          [tab_Tm]
10416
+
10417
+    ; register map
10418
+    ; m0 - interpolate coeff
10419
+    ; m1 - shuffle order table
10420
+    ; m2 - constant word 1
10421
+    mov               r6d,          64
10422
+    dec               r0
10423
+    test              r5d,          r5d
10424
+    je                .loop
10425
+    sub               r0 ,          r1
10426
+    add               r6d ,         3
10427
+
10428
+.loop
10429
+    ; Row 0
10430
+    vbroadcasti128    m3,           [r0]                           ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10431
+    pshufb            m3,           m1
10432
+    pmaddubsw         m3,           m0
10433
+    pmaddwd           m3,           m2
10434
+    vbroadcasti128    m4,           [r0 + 8]                       ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10435
+    pshufb            m4,           m1
10436
+    pmaddubsw         m4,           m0
10437
+    pmaddwd           m4,           m2
10438
+
10439
+    packssdw          m3,           m4
10440
+    psubw             m3,           m5
10441
+    vpermq            m3,           m3,          q3120
10442
+    movu              [r2],         m3
10443
+
10444
+    vbroadcasti128    m3,           [r0 + 16]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10445
+    pshufb            m3,           m1
10446
+    pmaddubsw         m3,           m0
10447
+    pmaddwd           m3,           m2
10448
+    vbroadcasti128    m4,           [r0 + 24]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10449
+    pshufb            m4,           m1
10450
+    pmaddubsw         m4,           m0
10451
+    pmaddwd           m4,           m2
10452
+
10453
+    packssdw          m3,           m4
10454
+    psubw             m3,           m5
10455
+    vpermq            m3,           m3,          q3120
10456
+    movu              [r2 + 32],    m3
10457
+
10458
+    vbroadcasti128    m3,           [r0 + 32]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10459
+    pshufb            m3,           m1
10460
+    pmaddubsw         m3,           m0
10461
+    pmaddwd           m3,           m2
10462
+    vbroadcasti128    m4,           [r0 + 40]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10463
+    pshufb            m4,           m1
10464
+    pmaddubsw         m4,           m0
10465
+    pmaddwd           m4,           m2
10466
+
10467
+    packssdw          m3,           m4
10468
+    psubw             m3,           m5
10469
+    vpermq            m3,           m3,          q3120
10470
+    movu              [r2 + 64],    m3
10471
+
10472
+    add               r2,          r3
10473
+    add               r0,          r1
10474
+    dec               r6d
10475
+    jnz               .loop
10476
+    RET
10477
+
10478
+;-----------------------------------------------------------------------------------------------------------------------------
10479
+; void interp_4tap_horiz_ps_24x64(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
10480
+;-----------------------------------------------------------------------------------------------------------------------------
10481
+INIT_YMM avx2
10482
+cglobal interp_4tap_horiz_ps_24x64, 4,7,6
10483
+    mov                r4d,            r4m
10484
+    mov                r5d,            r5m
10485
+    add                r3d,            r3d
10486
+%ifdef PIC
10487
+    lea                r6,             [tab_ChromaCoeff]
10488
+    vpbroadcastd       m0,             [r6 + r4 * 4]
10489
+%else
10490
+    vpbroadcastd       m0,             [tab_ChromaCoeff + r4 * 4]
10491
+%endif
10492
+    vbroadcasti128     m2,             [pw_1]
10493
+    vbroadcasti128     m5,             [pw_2000]
10494
+    mova               m1,             [tab_Tm]
10495
+
10496
+    ; register map
10497
+    ; m0 - interpolate coeff
10498
+    ; m1 - shuffle order table
10499
+    ; m2 - constant word 1
10500
+    mov                r6d,            64
10501
+    dec                r0
10502
+    test               r5d,            r5d
10503
+    je                 .loop
10504
+    sub                r0 ,            r1
10505
+    add                r6d ,           3
10506
+
10507
+.loop
10508
+    ; Row 0
10509
+    vbroadcasti128     m3,             [r0]                          ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10510
+    pshufb             m3,             m1
10511
+    pmaddubsw          m3,             m0
10512
+    pmaddwd            m3,             m2
10513
+    vbroadcasti128     m4,             [r0 + 8]                      ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10514
+    pshufb             m4,             m1
10515
+    pmaddubsw          m4,             m0
10516
+    pmaddwd            m4,             m2
10517
+    packssdw           m3,             m4
10518
+    psubw              m3,             m5
10519
+    vpermq             m3,             m3,          q3120
10520
+    movu               [r2],           m3
10521
+
10522
+    vbroadcasti128     m3,             [r0 + 16]                     ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10523
+    pshufb             m3,             m1
10524
+    pmaddubsw          m3,             m0
10525
+    pmaddwd            m3,             m2
10526
+    packssdw           m3,             m3
10527
+    psubw              m3,             m5
10528
+    vpermq             m3,             m3,          q3120
10529
+    movu               [r2 + 32],      xm3
10530
+
10531
+    add                r2,             r3
10532
+    add                r0,             r1
10533
+    dec                r6d
10534
+    jnz                .loop
10535
+    RET
10536
+
10537
+INIT_YMM avx2
10538
+cglobal interp_4tap_horiz_ps_2x16, 4, 7, 7
10539
+    mov               r4d,           r4m
10540
+    mov               r5d,           r5m
10541
+    add               r3d,           r3d
10542
+
10543
+%ifdef PIC
10544
+    lea               r6,            [tab_ChromaCoeff]
10545
+    vpbroadcastd      m0,            [r6 + r4 * 4]
10546
+%else
10547
+    vpbroadcastd      m0,            [tab_ChromaCoeff + r4 * 4]
10548
+%endif
10549
+    vbroadcasti128    m6,            [pw_2000]
10550
+    test              r5d,            r5d
10551
+    jz                .label
10552
+    sub               r0,             r1
10553
+
10554
+.label
10555
+    mova              m4,            [interp4_hps_shuf]
10556
+    mova              m5,            [pw_1]
10557
+    dec               r0
10558
+    lea               r4,            [r1 * 3]
10559
+    movq              xm1,           [r0]                                   ;row 0
10560
+    movhps            xm1,           [r0 + r1]
10561
+    movq              xm2,           [r0 + r1 * 2]
10562
+    movhps            xm2,           [r0 + r4]
10563
+    vinserti128       m1,            m1,           xm2,          1
10564
+    lea               r0,            [r0 + r1 * 4]
10565
+    movq              xm3,           [r0]
10566
+    movhps            xm3,           [r0 + r1]
10567
+    movq              xm2,           [r0 + r1 * 2]
10568
+    movhps            xm2,           [r0 + r4]
10569
+    vinserti128       m3,            m3,           xm2,          1
10570
+
10571
+    pshufb            m1,            m4
10572
+    pshufb            m3,            m4
10573
+    pmaddubsw         m1,            m0
10574
+    pmaddubsw         m3,            m0
10575
+    pmaddwd           m1,            m5
10576
+    pmaddwd           m3,            m5
10577
+    packssdw          m1,            m3
10578
+    psubw             m1,            m6
10579
+
10580
+    lea               r4,            [r3 * 3]
10581
+    vextracti128      xm2,           m1,           1
10582
+
10583
+    movd              [r2],          xm1
10584
+    pextrd            [r2 + r3],     xm1,          1
10585
+    movd              [r2 + r3 * 2], xm2
10586
+    pextrd            [r2 + r4],     xm2,          1
10587
+    lea               r2,            [r2 + r3 * 4]
10588
+    pextrd            [r2],          xm1,          2
10589
+    pextrd            [r2 + r3],     xm1,          3
10590
+    pextrd            [r2 + r3 * 2], xm2,          2
10591
+    pextrd            [r2 + r4],     xm2,          3
10592
+
10593
+    lea               r0,            [r0 + r1 * 4]
10594
+    lea               r2,            [r2 + r3 * 4]
10595
+    lea               r4,            [r1 * 3]
10596
+    movq              xm1,           [r0]
10597
+    movhps            xm1,           [r0 + r1]
10598
+    movq              xm2,           [r0 + r1 * 2]
10599
+    movhps            xm2,           [r0 + r4]
10600
+    vinserti128       m1,            m1,          xm2,           1
10601
+    lea               r0,            [r0 + r1 * 4]
10602
+    movq              xm3,           [r0]
10603
+    movhps            xm3,           [r0 + r1]
10604
+    movq              xm2,           [r0 + r1 * 2]
10605
+    movhps            xm2,           [r0 + r4]
10606
+    vinserti128       m3,            m3,          xm2,           1
10607
+
10608
+    pshufb            m1,            m4
10609
+    pshufb            m3,            m4
10610
+    pmaddubsw         m1,            m0
10611
+    pmaddubsw         m3,            m0
10612
+    pmaddwd           m1,            m5
10613
+    pmaddwd           m3,            m5
10614
+    packssdw          m1,            m3
10615
+    psubw             m1,            m6
10616
+
10617
+    lea               r4,            [r3 * 3]
10618
+    vextracti128      xm2,           m1,           1
10619
+
10620
+    movd              [r2],          xm1
10621
+    pextrd            [r2 + r3],     xm1,          1
10622
+    movd              [r2 + r3 * 2], xm2
10623
+    pextrd            [r2 + r4],     xm2,          1
10624
+    lea               r2,            [r2 + r3 * 4]
10625
+    pextrd            [r2],          xm1,          2
10626
+    pextrd            [r2 + r3],     xm1,          3
10627
+    pextrd            [r2 + r3 * 2], xm2,          2
10628
+    pextrd            [r2 + r4],     xm2,          3
10629
+
10630
+    test              r5d,            r5d
10631
+    jz                .end
10632
+
10633
+    lea               r0,            [r0 + r1 * 4]
10634
+    lea               r2,            [r2 + r3 * 4]
10635
+    movq              xm1,           [r0]
10636
+    movhps            xm1,           [r0 + r1]
10637
+    movq              xm2,           [r0 + r1 * 2]
10638
+    vinserti128       m1,            m1,          xm2,           1
10639
+    pshufb            m1,            m4
10640
+    pmaddubsw         m1,            m0
10641
+    pmaddwd           m1,            m5
10642
+    packssdw          m1,            m1
10643
+    psubw             m1,            m6
10644
+    vextracti128      xm2,           m1,           1
10645
+
10646
+    movd              [r2],          xm1
10647
+    pextrd            [r2 + r3],     xm1,          1
10648
+    movd              [r2 + r3 * 2], xm2
10649
+.end
10650
+    RET
10651
+
10652
+INIT_YMM avx2
10653
+cglobal interp_4tap_horiz_pp_6x16, 4, 6, 7
10654
+    mov               r4d,               r4m
10655
+
10656
+%ifdef PIC
10657
+    lea               r5,                [tab_ChromaCoeff]
10658
+    vpbroadcastd      m0,                [r5 + r4 * 4]
10659
+%else
10660
+    vpbroadcastd      m0,                [tab_ChromaCoeff + r4 * 4]
10661
+%endif
10662
+
10663
+    mova              m1,                [tab_Tm]
10664
+    mova              m2,                [pw_1]
10665
+    mova              m6,                [pw_512]
10666
+    lea               r4,                [r1 * 3]
10667
+    lea               r5,                [r3 * 3]
10668
+    ; register map
10669
+    ; m0 - interpolate coeff
10670
+    ; m1 - shuffle order table
10671
+    ; m2 - constant word 1
10672
+
10673
+    dec               r0
10674
+%rep 4
10675
+    ; Row 0
10676
+    vbroadcasti128    m3,                [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10677
+    pshufb            m3,                m1
10678
+    pmaddubsw         m3,                m0
10679
+    pmaddwd           m3,                m2
10680
+
10681
+    ; Row 1
10682
+    vbroadcasti128    m4,                [r0 + r1]                   ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10683
+    pshufb            m4,                m1
10684
+    pmaddubsw         m4,                m0
10685
+    pmaddwd           m4,                m2
10686
+    packssdw          m3,                m4
10687
+    pmulhrsw          m3,                m6
10688
+
10689
+    ; Row 2
10690
+    vbroadcasti128    m4,                [r0 + r1 * 2]               ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10691
+    pshufb            m4,                m1
10692
+    pmaddubsw         m4,                m0
10693
+    pmaddwd           m4,                m2
10694
+
10695
+    ; Row 3
10696
+    vbroadcasti128    m5,                [r0 + r4]                   ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10697
+    pshufb            m5,                m1
10698
+    pmaddubsw         m5,                m0
10699
+    pmaddwd           m5,                m2
10700
+    packssdw          m4,                m5
10701
+    pmulhrsw          m4,                m6
10702
+
10703
+    packuswb          m3,                m4
10704
+    vextracti128      xm4,               m3,          1
10705
+    movd              [r2],              xm3
10706
+    pextrw            [r2 + 4],          xm4,         0
10707
+    pextrd            [r2 + r3],         xm3,         1
10708
+    pextrw            [r2 + r3 + 4],     xm4,         2
10709
+    pextrd            [r2 + r3 * 2],     xm3,         2
10710
+    pextrw            [r2 + r3 * 2 + 4], xm4,         4
10711
+    pextrd            [r2 + r5],         xm3,         3
10712
+    pextrw            [r2 + r5 + 4],     xm4,         6
10713
+    lea               r2,                [r2 + r3 * 4]
10714
+    lea               r0,                [r0 + r1 * 4]
10715
+%endrep
10716
+    RET
10717
x265_1.6.tar.gz/source/common/x86/ipfilter8.h -> x265_1.7.tar.gz/source/common/x86/ipfilter8.h Changed
381
 
1
@@ -289,16 +289,114 @@
2
     SETUP_CHROMA_420_HORIZ_FUNC_DEF(64, 16, cpu); \
3
     SETUP_CHROMA_420_HORIZ_FUNC_DEF(16, 64, cpu)
4
 
5
-void x265_chroma_p2s_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height);
6
-void x265_luma_p2s_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height);
7
+void x265_filterPixelToShort_4x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
8
+void x265_filterPixelToShort_4x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
9
+void x265_filterPixelToShort_4x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
10
+void x265_filterPixelToShort_8x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
11
+void x265_filterPixelToShort_8x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
12
+void x265_filterPixelToShort_8x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
13
+void x265_filterPixelToShort_8x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
14
+void x265_filterPixelToShort_16x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
15
+void x265_filterPixelToShort_16x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
16
+void x265_filterPixelToShort_16x12_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
17
+void x265_filterPixelToShort_16x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
18
+void x265_filterPixelToShort_16x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
19
+void x265_filterPixelToShort_16x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
20
+void x265_filterPixelToShort_32x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
21
+void x265_filterPixelToShort_32x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
22
+void x265_filterPixelToShort_32x24_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
23
+void x265_filterPixelToShort_32x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
24
+void x265_filterPixelToShort_32x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
25
+void x265_filterPixelToShort_64x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
26
+void x265_filterPixelToShort_64x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
27
+void x265_filterPixelToShort_64x48_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
28
+void x265_filterPixelToShort_64x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
29
+void x265_filterPixelToShort_24x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
30
+void x265_filterPixelToShort_12x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
31
+void x265_filterPixelToShort_48x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
32
+void x265_filterPixelToShort_16x4_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
33
+void x265_filterPixelToShort_16x8_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
34
+void x265_filterPixelToShort_16x12_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
35
+void x265_filterPixelToShort_16x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
36
+void x265_filterPixelToShort_16x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
37
+void x265_filterPixelToShort_16x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
38
+void x265_filterPixelToShort_32x8_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
39
+void x265_filterPixelToShort_32x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
40
+void x265_filterPixelToShort_32x24_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
41
+void x265_filterPixelToShort_32x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
42
+void x265_filterPixelToShort_32x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
43
+void x265_filterPixelToShort_64x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
44
+void x265_filterPixelToShort_64x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
45
+void x265_filterPixelToShort_64x48_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
46
+void x265_filterPixelToShort_64x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
47
+void x265_filterPixelToShort_24x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
48
+void x265_filterPixelToShort_48x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
49
+
50
+#define SETUP_CHROMA_P2S_FUNC_DEF(W, H, cpu) \
51
+    void x265_filterPixelToShort_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
52
+
53
+#define CHROMA_420_P2S_FILTERS_SSSE3(cpu) \
54
+    SETUP_CHROMA_P2S_FUNC_DEF(4, 2, cpu); \
55
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 2, cpu); \
56
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 6, cpu);
57
+
58
+#define CHROMA_420_P2S_FILTERS_SSE4(cpu) \
59
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 4, cpu); \
60
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \
61
+    SETUP_CHROMA_P2S_FUNC_DEF(6, 8, cpu);
62
+
63
+#define CHROMA_422_P2S_FILTERS_SSSE3(cpu) \
64
+    SETUP_CHROMA_P2S_FUNC_DEF(4, 32, cpu) \
65
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 12, cpu); \
66
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 64, cpu); \
67
+    SETUP_CHROMA_P2S_FUNC_DEF(12, 32, cpu); \
68
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 24, cpu); \
69
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 64, cpu); \
70
+    SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \
71
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu);
72
+
73
+#define CHROMA_422_P2S_FILTERS_SSE4(cpu) \
74
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \
75
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 16, cpu) \
76
+    SETUP_CHROMA_P2S_FUNC_DEF(6, 16, cpu);
77
+
78
+#define CHROMA_420_P2S_FILTERS_AVX2(cpu) \
79
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 4, cpu); \
80
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 8, cpu); \
81
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 12, cpu); \
82
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 16, cpu); \
83
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 32, cpu); \
84
+    SETUP_CHROMA_P2S_FUNC_DEF(24, 32, cpu); \
85
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 8, cpu); \
86
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \
87
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 24, cpu); \
88
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu);
89
+
90
+#define CHROMA_422_P2S_FILTERS_AVX2(cpu) \
91
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 8, cpu); \
92
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 16, cpu); \
93
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 24, cpu); \
94
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 32, cpu); \
95
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 64, cpu); \
96
+    SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \
97
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \
98
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu); \
99
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu); \
100
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 64, cpu);
101
 
102
 CHROMA_420_VERT_FILTERS(_sse2);
103
 CHROMA_420_HORIZ_FILTERS(_sse4);
104
 CHROMA_420_VERT_FILTERS_SSE4(_sse4);
105
+CHROMA_420_P2S_FILTERS_SSSE3(_ssse3);
106
+CHROMA_420_P2S_FILTERS_SSE4(_sse4);
107
+CHROMA_420_P2S_FILTERS_AVX2(_avx2);
108
 
109
 CHROMA_422_VERT_FILTERS(_sse2);
110
 CHROMA_422_HORIZ_FILTERS(_sse4);
111
 CHROMA_422_VERT_FILTERS_SSE4(_sse4);
112
+CHROMA_422_P2S_FILTERS_SSE4(_sse4);
113
+CHROMA_422_P2S_FILTERS_SSSE3(_ssse3);
114
+CHROMA_422_P2S_FILTERS_AVX2(_avx2);
115
 
116
 CHROMA_444_VERT_FILTERS(_sse2);
117
 CHROMA_444_HORIZ_FILTERS(_sse4);
118
@@ -572,6 +670,48 @@
119
     SETUP_CHROMA_SS_FUNC_DEF(64, 16, cpu); \
120
     SETUP_CHROMA_SS_FUNC_DEF(16, 64, cpu);
121
 
122
+#define SETUP_CHROMA_P2S_FUNC_DEF(W, H, cpu) \
123
+    void x265_filterPixelToShort_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
124
+
125
+#define CHROMA_420_P2S_FILTERS_SSE4(cpu) \
126
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 4, cpu); \
127
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \
128
+    SETUP_CHROMA_P2S_FUNC_DEF(4, 2, cpu); \
129
+    SETUP_CHROMA_P2S_FUNC_DEF(6, 8, cpu); 
130
+
131
+#define CHROMA_420_P2S_FILTERS_SSSE3(cpu) \
132
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 2, cpu); \
133
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 6, cpu);
134
+
135
+#define CHROMA_422_P2S_FILTERS_SSE4(cpu) \
136
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \
137
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 16, cpu); \
138
+    SETUP_CHROMA_P2S_FUNC_DEF(6, 16, cpu); \
139
+    SETUP_CHROMA_P2S_FUNC_DEF(4, 32, cpu);
140
+
141
+#define CHROMA_422_P2S_FILTERS_SSSE3(cpu) \
142
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 12, cpu); \
143
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 64, cpu); \
144
+    SETUP_CHROMA_P2S_FUNC_DEF(12, 32, cpu); \
145
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 24, cpu); \
146
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 64, cpu); \
147
+    SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \
148
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu);
149
+
150
+#define CHROMA_420_P2S_FILTERS_AVX2(cpu) \
151
+    SETUP_CHROMA_P2S_FUNC_DEF(24, 32, cpu); \
152
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 8, cpu); \
153
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \
154
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 24, cpu); \
155
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu);
156
+
157
+#define CHROMA_422_P2S_FILTERS_AVX2(cpu) \
158
+    SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \
159
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \
160
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu); \
161
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu); \
162
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 64, cpu);
163
+
164
 CHROMA_420_FILTERS(_sse4);
165
 CHROMA_420_FILTERS(_avx2);
166
 CHROMA_420_SP_FILTERS(_sse2);
167
@@ -582,19 +722,32 @@
168
 CHROMA_420_SS_FILTERS_SSE4(_sse4);
169
 CHROMA_420_SS_FILTERS(_avx2);
170
 CHROMA_420_SS_FILTERS_SSE4(_avx2);
171
+CHROMA_420_P2S_FILTERS_SSE4(_sse4);
172
+CHROMA_420_P2S_FILTERS_SSSE3(_ssse3);
173
+CHROMA_420_P2S_FILTERS_AVX2(_avx2);
174
 
175
 CHROMA_422_FILTERS(_sse4);
176
 CHROMA_422_FILTERS(_avx2);
177
 CHROMA_422_SP_FILTERS(_sse2);
178
+CHROMA_422_SP_FILTERS(_avx2);
179
 CHROMA_422_SP_FILTERS_SSE4(_sse4);
180
+CHROMA_422_SP_FILTERS_SSE4(_avx2);
181
 CHROMA_422_SS_FILTERS(_sse2);
182
+CHROMA_422_SS_FILTERS(_avx2);
183
 CHROMA_422_SS_FILTERS_SSE4(_sse4);
184
+CHROMA_422_SS_FILTERS_SSE4(_avx2);
185
+CHROMA_422_P2S_FILTERS_SSE4(_sse4);
186
+CHROMA_422_P2S_FILTERS_SSSE3(_ssse3);
187
+CHROMA_422_P2S_FILTERS_AVX2(_avx2);
188
+void x265_interp_4tap_vert_ss_2x4_avx2(const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx);
189
+void x265_interp_4tap_vert_sp_2x4_avx2(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx);
190
 
191
 CHROMA_444_FILTERS(_sse4);
192
 CHROMA_444_SP_FILTERS(_sse4);
193
 CHROMA_444_SS_FILTERS(_sse2);
194
-
195
-void x265_chroma_p2s_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height);
196
+CHROMA_444_FILTERS(_avx2);
197
+CHROMA_444_SP_FILTERS(_avx2);
198
+CHROMA_444_SS_FILTERS(_avx2);
199
 
200
 #undef SETUP_CHROMA_FUNC_DEF
201
 #undef SETUP_CHROMA_SP_FUNC_DEF
202
@@ -623,29 +776,155 @@
203
 LUMA_FILTERS(_avx2);
204
 LUMA_SP_FILTERS(_avx2);
205
 LUMA_SS_FILTERS(_avx2);
206
-void x265_interp_8tap_hv_pp_8x8_sse4(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY);
207
-void x265_pixelToShort_4x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
208
-void x265_pixelToShort_4x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
209
-void x265_pixelToShort_4x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
210
-void x265_pixelToShort_8x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
211
-void x265_pixelToShort_8x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
212
-void x265_pixelToShort_8x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
213
-void x265_pixelToShort_8x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
214
-void x265_pixelToShort_16x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
215
-void x265_pixelToShort_16x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
216
-void x265_pixelToShort_16x12_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
217
-void x265_pixelToShort_16x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
218
-void x265_pixelToShort_16x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
219
-void x265_pixelToShort_16x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
220
-void x265_pixelToShort_32x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
221
-void x265_pixelToShort_32x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
222
-void x265_pixelToShort_32x24_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
223
-void x265_pixelToShort_32x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
224
-void x265_pixelToShort_32x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
225
-void x265_pixelToShort_64x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
226
-void x265_pixelToShort_64x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
227
-void x265_pixelToShort_64x48_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
228
-void x265_pixelToShort_64x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
229
+void x265_interp_8tap_hv_pp_8x8_ssse3(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY);
230
+void x265_interp_8tap_hv_pp_16x16_avx2(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY);
231
+void x265_filterPixelToShort_4x4_sse4(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
232
+void x265_filterPixelToShort_4x8_sse4(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
233
+void x265_filterPixelToShort_4x16_sse4(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
234
+void x265_filterPixelToShort_8x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
235
+void x265_filterPixelToShort_8x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
236
+void x265_filterPixelToShort_8x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
237
+void x265_filterPixelToShort_8x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
238
+void x265_filterPixelToShort_16x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
239
+void x265_filterPixelToShort_16x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
240
+void x265_filterPixelToShort_16x12_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
241
+void x265_filterPixelToShort_16x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
242
+void x265_filterPixelToShort_16x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
243
+void x265_filterPixelToShort_16x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
244
+void x265_filterPixelToShort_32x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
245
+void x265_filterPixelToShort_32x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
246
+void x265_filterPixelToShort_32x24_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
247
+void x265_filterPixelToShort_32x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
248
+void x265_filterPixelToShort_32x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
249
+void x265_filterPixelToShort_64x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
250
+void x265_filterPixelToShort_64x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
251
+void x265_filterPixelToShort_64x48_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
252
+void x265_filterPixelToShort_64x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
253
+void x265_filterPixelToShort_12x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
254
+void x265_filterPixelToShort_24x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
255
+void x265_filterPixelToShort_48x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
256
+void x265_filterPixelToShort_32x8_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
257
+void x265_filterPixelToShort_32x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
258
+void x265_filterPixelToShort_32x24_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
259
+void x265_filterPixelToShort_32x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
260
+void x265_filterPixelToShort_32x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
261
+void x265_filterPixelToShort_64x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
262
+void x265_filterPixelToShort_64x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
263
+void x265_filterPixelToShort_64x48_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
264
+void x265_filterPixelToShort_64x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
265
+void x265_filterPixelToShort_48x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
266
+void x265_filterPixelToShort_24x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
267
+void x265_interp_4tap_horiz_pp_2x4_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
268
+void x265_interp_4tap_horiz_pp_2x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
269
+void x265_interp_4tap_horiz_pp_2x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
270
+void x265_interp_4tap_horiz_pp_4x2_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
271
+void x265_interp_4tap_horiz_pp_4x4_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
272
+void x265_interp_4tap_horiz_pp_4x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
273
+void x265_interp_4tap_horiz_pp_4x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
274
+void x265_interp_4tap_horiz_pp_4x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
275
+void x265_interp_4tap_horiz_pp_6x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
276
+void x265_interp_4tap_horiz_pp_6x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
277
+void x265_interp_4tap_horiz_pp_8x2_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
278
+void x265_interp_4tap_horiz_pp_8x4_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
279
+void x265_interp_4tap_horiz_pp_8x6_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
280
+void x265_interp_4tap_horiz_pp_8x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
281
+void x265_interp_4tap_horiz_pp_8x12_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
282
+void x265_interp_4tap_horiz_pp_8x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
283
+void x265_interp_4tap_horiz_pp_8x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
284
+void x265_interp_4tap_horiz_pp_8x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
285
+void x265_interp_4tap_horiz_pp_12x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
286
+void x265_interp_4tap_horiz_pp_12x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
287
+void x265_interp_4tap_horiz_pp_16x4_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
288
+void x265_interp_4tap_horiz_pp_16x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
289
+void x265_interp_4tap_horiz_pp_16x12_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
290
+void x265_interp_4tap_horiz_pp_16x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
291
+void x265_interp_4tap_horiz_pp_16x24_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
292
+void x265_interp_4tap_horiz_pp_16x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
293
+void x265_interp_4tap_horiz_pp_16x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
294
+void x265_interp_4tap_horiz_pp_24x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
295
+void x265_interp_4tap_horiz_pp_24x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
296
+void x265_interp_4tap_horiz_pp_32x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
297
+void x265_interp_4tap_horiz_pp_32x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
298
+void x265_interp_4tap_horiz_pp_32x24_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
299
+void x265_interp_4tap_horiz_pp_32x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
300
+void x265_interp_4tap_horiz_pp_32x48_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
301
+void x265_interp_4tap_horiz_pp_32x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
302
+void x265_interp_4tap_horiz_pp_48x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
303
+void x265_interp_4tap_horiz_pp_64x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
304
+void x265_interp_4tap_horiz_pp_64x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
305
+void x265_interp_4tap_horiz_pp_64x48_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
306
+void x265_interp_4tap_horiz_pp_64x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
307
+void x265_interp_8tap_horiz_pp_4x4_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
308
+void x265_interp_8tap_horiz_pp_4x8_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
309
+void x265_interp_8tap_horiz_pp_4x16_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
310
+void x265_interp_8tap_horiz_pp_8x4_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
311
+void x265_interp_8tap_horiz_pp_8x8_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
312
+void x265_interp_8tap_horiz_pp_8x16_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
313
+void x265_interp_8tap_horiz_pp_8x32_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
314
+void x265_interp_8tap_horiz_pp_12x16_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
315
+void x265_interp_8tap_horiz_pp_16x4_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
316
+void x265_interp_8tap_horiz_pp_16x8_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
317
+void x265_interp_8tap_horiz_pp_16x12_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
318
+void x265_interp_8tap_horiz_pp_16x16_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
319
+void x265_interp_8tap_horiz_pp_16x32_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
320
+void x265_interp_8tap_horiz_pp_16x64_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
321
+void x265_interp_8tap_horiz_pp_24x32_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
322
+void x265_interp_8tap_horiz_pp_32x8_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
323
+void x265_interp_8tap_horiz_pp_32x16_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
324
+void x265_interp_8tap_horiz_pp_32x24_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
325
+void x265_interp_8tap_horiz_pp_32x32_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
326
+void x265_interp_8tap_horiz_pp_32x64_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
327
+void x265_interp_8tap_horiz_pp_48x64_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
328
+void x265_interp_8tap_horiz_pp_64x16_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
329
+void x265_interp_8tap_horiz_pp_64x32_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
330
+void x265_interp_8tap_horiz_pp_64x48_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
331
+void x265_interp_8tap_horiz_pp_64x64_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
332
+void x265_interp_8tap_horiz_ps_4x4_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
333
+void x265_interp_8tap_horiz_ps_4x8_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
334
+void x265_interp_8tap_horiz_ps_4x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
335
+void x265_interp_8tap_horiz_ps_8x4_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
336
+void x265_interp_8tap_horiz_ps_8x8_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
337
+void x265_interp_8tap_horiz_ps_8x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
338
+void x265_interp_8tap_horiz_ps_8x32_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
339
+void x265_interp_8tap_horiz_ps_12x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
340
+void x265_interp_8tap_horiz_ps_16x4_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
341
+void x265_interp_8tap_horiz_ps_16x8_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
342
+void x265_interp_8tap_horiz_ps_16x12_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
343
+void x265_interp_8tap_horiz_ps_16x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
344
+void x265_interp_8tap_horiz_ps_16x32_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
345
+void x265_interp_8tap_horiz_ps_16x64_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
346
+void x265_interp_8tap_horiz_ps_24x32_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
347
+void x265_interp_8tap_horiz_ps_32x8_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
348
+void x265_interp_8tap_horiz_ps_32x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
349
+void x265_interp_8tap_horiz_ps_32x24_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
350
+void x265_interp_8tap_horiz_ps_32x32_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
351
+void x265_interp_8tap_horiz_ps_32x64_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
352
+void x265_interp_8tap_horiz_ps_48x64_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
353
+void x265_interp_8tap_horiz_ps_64x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
354
+void x265_interp_8tap_horiz_ps_64x32_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
355
+void x265_interp_8tap_horiz_ps_64x48_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
356
+void x265_interp_8tap_horiz_ps_64x64_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
357
+void x265_interp_8tap_hv_pp_8x8_sse3(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY);
358
+void x265_interp_4tap_vert_pp_2x4_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
359
+void x265_interp_4tap_vert_pp_2x8_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
360
+void x265_interp_4tap_vert_pp_2x16_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
361
+void x265_interp_4tap_vert_pp_4x2_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
362
+void x265_interp_4tap_vert_pp_4x4_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
363
+void x265_interp_4tap_vert_pp_4x8_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
364
+void x265_interp_4tap_vert_pp_4x16_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
365
+void x265_interp_4tap_vert_pp_4x32_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
366
+#ifdef X86_64
367
+void x265_interp_4tap_vert_pp_6x8_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
368
+void x265_interp_4tap_vert_pp_6x16_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
369
+void x265_interp_4tap_vert_pp_8x2_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
370
+void x265_interp_4tap_vert_pp_8x4_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
371
+void x265_interp_4tap_vert_pp_8x6_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
372
+void x265_interp_4tap_vert_pp_8x8_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
373
+void x265_interp_4tap_vert_pp_8x12_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
374
+void x265_interp_4tap_vert_pp_8x16_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
375
+void x265_interp_4tap_vert_pp_8x32_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
376
+void x265_interp_4tap_vert_pp_8x64_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
377
+#endif
378
 #undef LUMA_FILTERS
379
 #undef LUMA_SP_FILTERS
380
 #undef LUMA_SS_FILTERS
381
x265_1.6.tar.gz/source/common/x86/loopfilter.asm -> x265_1.7.tar.gz/source/common/x86/loopfilter.asm Changed
992
 
1
@@ -28,31 +28,39 @@
2
 %include "x86inc.asm"
3
 
4
 SECTION_RODATA 32
5
-pb_31:      times 16 db 31
6
-pb_15:      times 16 db 15
7
+pb_31:      times 32 db 31
8
+pb_15:      times 32 db 15
9
+pb_movemask_32:  times 32 db 0x00
10
+                 times 32 db 0xFF
11
 
12
 SECTION .text
13
 cextern pb_1
14
 cextern pb_128
15
 cextern pb_2
16
 cextern pw_2
17
+cextern pb_movemask
18
 
19
 
20
 ;============================================================================================================
21
-; void saoCuOrgE0(pixel * rec, int8_t * offsetEo, int lcuWidth, int8_t signLeft)
22
+; void saoCuOrgE0(pixel * rec, int8_t * offsetEo, int lcuWidth, int8_t* signLeft, intptr_t stride)
23
 ;============================================================================================================
24
 INIT_XMM sse4
25
-cglobal saoCuOrgE0, 4, 4, 8, rec, offsetEo, lcuWidth, signLeft
26
+cglobal saoCuOrgE0, 5, 5, 8, rec, offsetEo, lcuWidth, signLeft, stride
27
 
28
-    neg         r3                          ; r3 = -signLeft
29
-    movzx       r3d, r3b
30
-    movd        m0, r3d
31
-    mova        m4, [pb_128]                ; m4 = [80]
32
-    pxor        m5, m5                      ; m5 = 0
33
-    movu        m6, [r1]                    ; m6 = offsetEo
34
+    mov         r4d, r4m
35
+    mova        m4,  [pb_128]                ; m4 = [80]
36
+    pxor        m5,  m5                      ; m5 = 0
37
+    movu        m6,  [r1]                    ; m6 = offsetEo
38
+
39
+    movzx       r1d, byte [r3]
40
+    inc         r3
41
+    neg         r1b
42
+    movd        m0, r1d
43
+    lea         r1, [r0 + r4]
44
+    mov         r4d, r2d
45
 
46
 .loop:
47
-    movu        m7, [r0]                    ; m1 = rec[x]
48
+    movu        m7, [r0]                    ; m7 = rec[x]
49
     movu        m2, [r0 + 1]                ; m2 = rec[x+1]
50
 
51
     pxor        m1, m7, m4
52
@@ -69,7 +77,7 @@
53
     pxor        m0, m0
54
     palignr     m0, m2, 15
55
     paddb       m2, m3
56
-    paddb       m2, [pb_2]                  ; m1 = uiEdgeType
57
+    paddb       m2, [pb_2]                  ; m2 = uiEdgeType
58
     pshufb      m3, m6, m2
59
     pmovzxbw    m2, m7                      ; rec
60
     punpckhbw   m7, m5
61
@@ -84,6 +92,97 @@
62
     add         r0q, 16
63
     sub         r2d, 16
64
     jnz        .loop
65
+
66
+    movzx       r3d, byte [r3]
67
+    neg         r3b
68
+    movd        m0, r3d
69
+.loopH:
70
+    movu        m7, [r1]                    ; m7 = rec[x]
71
+    movu        m2, [r1 + 1]                ; m2 = rec[x+1]
72
+
73
+    pxor        m1, m7, m4
74
+    pxor        m3, m2, m4
75
+    pcmpgtb     m2, m1, m3
76
+    pcmpgtb     m3, m1
77
+    pand        m2, [pb_1]
78
+    por         m2, m3
79
+
80
+    pslldq      m3, m2, 1
81
+    por         m3, m0
82
+
83
+    psignb      m3, m4                      ; m3 = signLeft
84
+    pxor        m0, m0
85
+    palignr     m0, m2, 15
86
+    paddb       m2, m3
87
+    paddb       m2, [pb_2]                  ; m2 = uiEdgeType
88
+    pshufb      m3, m6, m2
89
+    pmovzxbw    m2, m7                      ; rec
90
+    punpckhbw   m7, m5
91
+    pmovsxbw    m1, m3                      ; offsetEo
92
+    punpckhbw   m3, m3
93
+    psraw       m3, 8
94
+    paddw       m2, m1
95
+    paddw       m7, m3
96
+    packuswb    m2, m7
97
+    movu        [r1], m2
98
+
99
+    add         r1q, 16
100
+    sub         r4d, 16
101
+    jnz        .loopH
102
+    RET
103
+
104
+INIT_YMM avx2
105
+cglobal saoCuOrgE0, 5, 5, 7, rec, offsetEo, lcuWidth, signLeft, stride
106
+
107
+    mov                 r4d,        r4m
108
+    vbroadcasti128      m4,         [pb_128]                   ; m4 = [80]
109
+    vbroadcasti128      m6,         [r1]                       ; m6 = offsetEo
110
+    movzx               r1d,        byte [r3]
111
+    neg                 r1b
112
+    movd                xm0,        r1d
113
+    movzx               r1d,        byte [r3 + 1]
114
+    neg                 r1b
115
+    movd                xm1,        r1d
116
+    vinserti128         m0,         m0,        xm1,           1
117
+
118
+.loop:
119
+    movu                xm5,        [r0]                       ; xm5 = rec[x]
120
+    movu                xm2,        [r0 + 1]                   ; xm2 = rec[x + 1]
121
+    vinserti128         m5,         m5,        [r0 + r4],     1
122
+    vinserti128         m2,         m2,        [r0 + r4 + 1], 1
123
+
124
+    pxor                m1,         m5,        m4
125
+    pxor                m3,         m2,        m4
126
+    pcmpgtb             m2,         m1,        m3
127
+    pcmpgtb             m3,         m1
128
+    pand                m2,         [pb_1]
129
+    por                 m2,         m3
130
+
131
+    pslldq              m3,         m2,        1
132
+    por                 m3,         m0
133
+
134
+    psignb              m3,         m4                         ; m3 = signLeft
135
+    pxor                m0,         m0
136
+    palignr             m0,         m2,        15
137
+    paddb               m2,         m3
138
+    paddb               m2,         [pb_2]                     ; m2 = uiEdgeType
139
+    pshufb              m3,         m6,        m2
140
+    pmovzxbw            m2,         xm5                        ; rec
141
+    vextracti128        xm5,        m5,        1
142
+    pmovzxbw            m5,         xm5
143
+    pmovsxbw            m1,         xm3                        ; offsetEo
144
+    vextracti128        xm3,        m3,        1
145
+    pmovsxbw            m3,         xm3
146
+    paddw               m2,         m1
147
+    paddw               m5,         m3
148
+    packuswb            m2,         m5
149
+    vpermq              m2,         m2,        11011000b
150
+    movu                [r0],       xm2
151
+    vextracti128        [r0 + r4],  m2,        1
152
+
153
+    add                 r0q,        16
154
+    sub                 r2d,        16
155
+    jnz                 .loop
156
     RET
157
 
158
 ;==================================================================================================
159
@@ -94,117 +193,382 @@
160
     mov         r3d, r3m
161
     mov         r4d, r4m
162
     pxor        m0,    m0                      ; m0 = 0
163
-    movu        m6,    [pb_2]                  ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
164
+    mova        m6,    [pb_2]                  ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
165
     mova        m7,    [pb_128]
166
     shr         r4d,   4
167
-    .loop
168
-         movu        m1,    [r0]                    ; m1 = pRec[x]
169
-         movu        m2,    [r0 + r3]               ; m2 = pRec[x + iStride]
170
-
171
-         pxor        m3,    m1,    m7
172
-         pxor        m4,    m2,    m7
173
-         pcmpgtb     m2,    m3,    m4
174
-         pcmpgtb     m4,    m3
175
-         pand        m2,    [pb_1]
176
-         por         m2,    m4
177
-
178
-         movu        m3,    [r1]                    ; m3 = m_iUpBuff1
179
-
180
-         paddb       m3,    m2
181
-         paddb       m3,    m6
182
-
183
-         movu        m4,    [r2]                    ; m4 = m_iOffsetEo
184
-         pshufb      m5,    m4,    m3
185
-
186
-         psubb       m3,    m0,    m2
187
-         movu        [r1],  m3
188
-
189
-         pmovzxbw    m2,    m1
190
-         punpckhbw   m1,    m0
191
-         pmovsxbw    m3,    m5
192
-         punpckhbw   m5,    m5
193
-         psraw       m5,    8
194
-
195
-         paddw       m2,    m3
196
-         paddw       m1,    m5
197
-         packuswb    m2,    m1
198
-         movu        [r0],  m2
199
-
200
-         add         r0,    16
201
-         add         r1,    16
202
-         dec         r4d
203
-         jnz         .loop
204
+.loop
205
+    movu        m1,    [r0]                    ; m1 = pRec[x]
206
+    movu        m2,    [r0 + r3]               ; m2 = pRec[x + iStride]
207
+
208
+    pxor        m3,    m1,    m7
209
+    pxor        m4,    m2,    m7
210
+    pcmpgtb     m2,    m3,    m4
211
+    pcmpgtb     m4,    m3
212
+    pand        m2,    [pb_1]
213
+    por         m2,    m4
214
+
215
+    movu        m3,    [r1]                    ; m3 = m_iUpBuff1
216
+
217
+    paddb       m3,    m2
218
+    paddb       m3,    m6
219
+
220
+    movu        m4,    [r2]                    ; m4 = m_iOffsetEo
221
+    pshufb      m5,    m4,    m3
222
+
223
+    psubb       m3,    m0,    m2
224
+    movu        [r1],  m3
225
+
226
+    pmovzxbw    m2,    m1
227
+    punpckhbw   m1,    m0
228
+    pmovsxbw    m3,    m5
229
+    punpckhbw   m5,    m5
230
+    psraw       m5,    8
231
+
232
+    paddw       m2,    m3
233
+    paddw       m1,    m5
234
+    packuswb    m2,    m1
235
+    movu        [r0],  m2
236
+
237
+    add         r0,    16
238
+    add         r1,    16
239
+    dec         r4d
240
+    jnz         .loop
241
+    RET
242
+
243
+INIT_YMM avx2
244
+cglobal saoCuOrgE1, 3, 5, 8, pRec, m_iUpBuff1, m_iOffsetEo, iStride, iLcuWidth
245
+    mov           r3d,    r3m
246
+    mov           r4d,    r4m
247
+    movu          xm0,    [r2]                    ; xm0 = m_iOffsetEo
248
+    mova          xm6,    [pb_2]                  ; xm6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
249
+    mova          xm7,    [pb_128]
250
+    shr           r4d,    4
251
+.loop
252
+    movu          xm1,    [r0]                    ; xm1 = pRec[x]
253
+    movu          xm2,    [r0 + r3]               ; xm2 = pRec[x + iStride]
254
+
255
+    pxor          xm3,    xm1,    xm7
256
+    pxor          xm4,    xm2,    xm7
257
+    pcmpgtb       xm2,    xm3,    xm4
258
+    pcmpgtb       xm4,    xm3
259
+    pand          xm2,    [pb_1]
260
+    por           xm2,    xm4
261
+
262
+    movu          xm3,    [r1]                    ; xm3 = m_iUpBuff1
263
+
264
+    paddb         xm3,    xm2
265
+    paddb         xm3,    xm6
266
+
267
+    pshufb        xm5,    xm0,    xm3
268
+    pxor          xm4,    xm4
269
+    psubb         xm3,    xm4,    xm2
270
+    movu          [r1],   xm3
271
+
272
+    pmovzxbw      m2,     xm1
273
+    pmovsxbw      m3,     xm5
274
+
275
+    paddw         m2,     m3
276
+    vextracti128  xm3,    m2,     1
277
+    packuswb      xm2,    xm3
278
+    movu          [r0],   xm2
279
+
280
+    add           r0,     16
281
+    add           r1,     16
282
+    dec           r4d
283
+    jnz           .loop
284
+    RET
285
+
286
+;========================================================================================================
287
+; void saoCuOrgE1_2Rows(pixel *pRec, int8_t *m_iUpBuff1, int8_t *m_iOffsetEo, Int iStride, Int iLcuWidth)
288
+;========================================================================================================
289
+INIT_XMM sse4
290
+cglobal saoCuOrgE1_2Rows, 3, 5, 8, pRec, m_iUpBuff1, m_iOffsetEo, iStride, iLcuWidth
291
+    mov         r3d,        r3m
292
+    mov         r4d,        r4m
293
+    pxor        m0,         m0                      ; m0 = 0
294
+    mova        m7,         [pb_128]
295
+    shr         r4d,        4
296
+.loop
297
+    movu        m1,         [r0]                    ; m1 = pRec[x]
298
+    movu        m2,         [r0 + r3]               ; m2 = pRec[x + iStride]
299
+
300
+    pxor        m3,         m1,         m7
301
+    pxor        m4,         m2,         m7
302
+    pcmpgtb     m6,         m3,         m4
303
+    pcmpgtb     m5,         m4,         m3
304
+    pand        m6,         [pb_1]
305
+    por         m6,         m5
306
+
307
+    movu        m5,         [r0 + r3 * 2]
308
+    pxor        m3,         m5,         m7
309
+    pcmpgtb     m5,         m4,         m3
310
+    pcmpgtb     m3,         m4
311
+    pand        m5,         [pb_1]
312
+    por         m5,         m3
313
+
314
+    movu        m3,         [r1]                    ; m3 = m_iUpBuff1
315
+    paddb       m3,         m6
316
+    paddb       m3,         [pb_2]
317
+
318
+    movu        m4,         [r2]                    ; m4 = m_iOffsetEo
319
+    pshufb      m4,         m3
320
+
321
+    psubb       m3,         m0,         m6
322
+    movu        [r1],       m3
323
+
324
+    pmovzxbw    m6,         m1
325
+    punpckhbw   m1,         m0
326
+    pmovsxbw    m3,         m4
327
+    punpckhbw   m4,         m4
328
+    psraw       m4,         8
329
+
330
+    paddw       m6,         m3
331
+    paddw       m1,         m4
332
+    packuswb    m6,         m1
333
+    movu        [r0],       m6
334
+
335
+    movu        m3,         [r1]                    ; m3 = m_iUpBuff1
336
+    paddb       m3,         m5
337
+    paddb       m3,         [pb_2]
338
+
339
+    movu        m4,         [r2]                    ; m4 = m_iOffsetEo
340
+    pshufb      m4,         m3
341
+    psubb       m3,         m0,         m5
342
+    movu        [r1],       m3
343
+
344
+    pmovzxbw    m5,         m2
345
+    punpckhbw   m2,         m0
346
+    pmovsxbw    m3,         m4
347
+    punpckhbw   m4,         m4
348
+    psraw       m4,         8
349
+
350
+    paddw       m5,         m3
351
+    paddw       m2,         m4
352
+    packuswb    m5,         m2
353
+    movu        [r0 + r3],  m5
354
+
355
+    add         r0,         16
356
+    add         r1,         16
357
+    dec         r4d
358
+    jnz         .loop
359
+    RET
360
+
361
+INIT_YMM avx2
362
+cglobal saoCuOrgE1_2Rows, 3, 5, 7, pRec, m_iUpBuff1, m_iOffsetEo, iStride, iLcuWidth
363
+    mov             r3d,        r3m
364
+    mov             r4d,        r4m
365
+    pxor            m0,         m0                           ; m0 = 0
366
+    vbroadcasti128  m5,         [pb_128]
367
+    vbroadcasti128  m6,         [r2]                         ; m6 = m_iOffsetEo
368
+    shr             r4d,        4
369
+.loop
370
+    movu            xm1,        [r0]                         ; m1 = pRec[x]
371
+    movu            xm2,        [r0 + r3]                    ; m2 = pRec[x + iStride]
372
+    vinserti128     m1,         m1,       xm2,            1
373
+    vinserti128     m2,         m2,       [r0 + r3 * 2],  1
374
+
375
+    pxor            m3,         m1,       m5
376
+    pxor            m4,         m2,       m5
377
+    pcmpgtb         m2,         m3,       m4
378
+    pcmpgtb         m4,         m3
379
+    pand            m2,         [pb_1]
380
+    por             m2,         m4
381
+
382
+    movu            xm3,        [r1]                         ; xm3 = m_iUpBuff
383
+    psubb           m4,         m0,       m2
384
+    vinserti128     m3,         m3,       xm4,            1
385
+    paddb           m3,         m2
386
+    paddb           m3,         [pb_2]
387
+    pshufb          m2,         m6,       m3
388
+    vextracti128    [r1],       m4,       1
389
+
390
+    pmovzxbw        m4,         xm1
391
+    vextracti128    xm3,        m1,       1
392
+    pmovzxbw        m3,         xm3
393
+    pmovsxbw        m1,         xm2
394
+    vextracti128    xm2,        m2,       1
395
+    pmovsxbw        m2,         xm2
396
+
397
+    paddw           m4,         m1
398
+    paddw           m3,         m2
399
+    packuswb        m4,         m3
400
+    vpermq          m4,         m4,       11011000b
401
+    movu            [r0],       xm4
402
+    vextracti128    [r0 + r3],  m4,       1
403
+
404
+    add             r0,         16
405
+    add             r1,         16
406
+    dec             r4d
407
+    jnz             .loop
408
     RET
409
 
410
 ;======================================================================================================================================================
411
 ; void saoCuOrgE2(pixel * rec, int8_t * bufft, int8_t * buff1, int8_t * offsetEo, int lcuWidth, intptr_t stride)
412
 ;======================================================================================================================================================
413
 INIT_XMM sse4
414
-cglobal saoCuOrgE2, 5, 7, 8, rec, bufft, buff1, offsetEo, lcuWidth
415
-
416
-    mov         r6,    16
417
+cglobal saoCuOrgE2, 5, 6, 8, rec, bufft, buff1, offsetEo, lcuWidth
418
+    mov         r4d,   r4m
419
     mov         r5d,   r5m
420
     pxor        m0,    m0                      ; m0 = 0
421
     mova        m6,    [pb_2]                  ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
422
     mova        m7,    [pb_128]
423
-    shr         r4d,   4
424
-    inc         r1q
425
-
426
-    .loop
427
-         movu        m1,    [r0]                    ; m1 = rec[x]
428
-         movu        m2,    [r0 + r5 + 1]           ; m2 = rec[x + stride + 1]
429
-         pxor        m3,    m1,    m7
430
-         pxor        m4,    m2,    m7
431
-         pcmpgtb     m2,    m3,    m4
432
-         pcmpgtb     m4,    m3
433
-         pand        m2,    [pb_1]
434
-         por         m2,    m4
435
-         movu        m3,    [r2]                    ; m3 = buff1
436
-
437
-         paddb       m3,    m2
438
-         paddb       m3,    m6                      ; m3 = edgeType
439
-
440
-         movu        m4,    [r3]                    ; m4 = offsetEo
441
-         pshufb      m4,    m3
442
-
443
-         psubb       m3,    m0,    m2
444
-         movu        [r1],  m3
445
-
446
-         pmovzxbw    m2,    m1
447
-         punpckhbw   m1,    m0
448
-         pmovsxbw    m3,    m4
449
-         punpckhbw   m4,    m4
450
-         psraw       m4,    8
451
-
452
-         paddw       m2,    m3
453
-         paddw       m1,    m4
454
-         packuswb    m2,    m1
455
-         movu        [r0],  m2
456
-
457
-         add         r0,    r6
458
-         add         r1,    r6
459
-         add         r2,    r6
460
-         dec         r4d
461
-         jnz         .loop
462
+    inc         r1
463
+    movh        m5,    [r0 + r4]
464
+    movhps      m5,    [r1 + r4]
465
+
466
+.loop
467
+    movu        m1,    [r0]                    ; m1 = rec[x]
468
+    movu        m2,    [r0 + r5 + 1]           ; m2 = rec[x + stride + 1]
469
+    pxor        m3,    m1,    m7
470
+    pxor        m4,    m2,    m7
471
+    pcmpgtb     m2,    m3,    m4
472
+    pcmpgtb     m4,    m3
473
+    pand        m2,    [pb_1]
474
+    por         m2,    m4
475
+    movu        m3,    [r2]                    ; m3 = buff1
476
+
477
+    paddb       m3,    m2
478
+    paddb       m3,    m6                      ; m3 = edgeType
479
+
480
+    movu        m4,    [r3]                    ; m4 = offsetEo
481
+    pshufb      m4,    m3
482
+
483
+    psubb       m3,    m0,    m2
484
+    movu        [r1],  m3
485
+
486
+    pmovzxbw    m2,    m1
487
+    punpckhbw   m1,    m0
488
+    pmovsxbw    m3,    m4
489
+    punpckhbw   m4,    m4
490
+    psraw       m4,    8
491
+
492
+    paddw       m2,    m3
493
+    paddw       m1,    m4
494
+    packuswb    m2,    m1
495
+    movu        [r0],  m2
496
+
497
+    add         r0,    16
498
+    add         r1,    16
499
+    add         r2,    16
500
+    sub         r4,    16
501
+    jg          .loop
502
+
503
+    movh        [r0 + r4], m5
504
+    movhps      [r1 + r4], m5
505
+    RET
506
+
507
+INIT_YMM avx2
508
+cglobal saoCuOrgE2, 5, 6, 7, rec, bufft, buff1, offsetEo, lcuWidth
509
+    mov            r4d,   r4m
510
+    mov            r5d,   r5m
511
+    pxor           xm0,   xm0                     ; xm0 = 0
512
+    mova           xm5,   [pb_128]
513
+    inc            r1
514
+    movq           xm6,   [r0 + r4]
515
+    movhps         xm6,   [r1 + r4]
516
+
517
+    movu           xm1,   [r0]                    ; xm1 = rec[x]
518
+    movu           xm2,   [r0 + r5 + 1]           ; xm2 = rec[x + stride + 1]
519
+    pxor           xm3,   xm1,   xm5
520
+    pxor           xm4,   xm2,   xm5
521
+    pcmpgtb        xm2,   xm3,   xm4
522
+    pcmpgtb        xm4,   xm3
523
+    pand           xm2,   [pb_1]
524
+    por            xm2,   xm4
525
+    movu           xm3,   [r2]                    ; xm3 = buff1
526
+
527
+    paddb          xm3,   xm2
528
+    paddb          xm3,   [pb_2]                  ; xm3 = edgeType
529
+
530
+    movu           xm4,   [r3]                    ; xm4 = offsetEo
531
+    pshufb         xm4,   xm3
532
+
533
+    psubb          xm3,   xm0,   xm2
534
+    movu           [r1],  xm3
535
+
536
+    pmovzxbw       m2,    xm1
537
+    pmovsxbw       m3,    xm4
538
+
539
+    paddw          m2,    m3
540
+    vextracti128   xm3,   m2,    1
541
+    packuswb       xm2,   xm3
542
+    movu           [r0],  xm2
543
+
544
+    movq           [r0 + r4], xm6
545
+    movhps         [r1 + r4], xm6
546
+    RET
547
+
548
+INIT_YMM avx2
549
+cglobal saoCuOrgE2_32, 5, 6, 8, rec, bufft, buff1, offsetEo, lcuWidth
550
+    mov             r4d,   r4m
551
+    mov             r5d,   r5m
552
+    pxor            m0,    m0                      ; m0 = 0
553
+    vbroadcasti128  m7,    [pb_128]
554
+    vbroadcasti128  m5,    [r3]                    ; m5 = offsetEo
555
+    inc             r1
556
+    movq            xm6,   [r0 + r4]
557
+    movhps          xm6,   [r1 + r4]
558
+
559
+.loop:
560
+    movu            m1,    [r0]                    ; m1 = rec[x]
561
+    movu            m2,    [r0 + r5 + 1]           ; m2 = rec[x + stride + 1]
562
+    pxor            m3,    m1,    m7
563
+    pxor            m4,    m2,    m7
564
+    pcmpgtb         m2,    m3,    m4
565
+    pcmpgtb         m4,    m3
566
+    pand            m2,    [pb_1]
567
+    por             m2,    m4
568
+    movu            m3,    [r2]                    ; m3 = buff1
569
+
570
+    paddb           m3,    m2
571
+    paddb           m3,    [pb_2]                  ; m3 = edgeType
572
+
573
+    pshufb          m4,    m5,    m3
574
+
575
+    psubb           m3,    m0,    m2
576
+    movu            [r1],  m3
577
+
578
+    pmovzxbw        m2,    xm1
579
+    vextracti128    xm1,   m1,    1
580
+    pmovzxbw        m1,    xm1
581
+    pmovsxbw        m3,    xm4
582
+    vextracti128    xm4,   m4,    1
583
+    pmovsxbw        m4,    xm4
584
+
585
+    paddw           m2,    m3
586
+    paddw           m1,    m4
587
+    packuswb        m2,    m1
588
+    vpermq          m2,    m2,    11011000b
589
+    movu            [r0],  m2
590
+
591
+    add             r0,    32
592
+    add             r1,    32
593
+    add             r2,    32
594
+    sub             r4,    32
595
+    jg              .loop
596
+
597
+    movq            [r0 + r4], xm6
598
+    movhps          [r1 + r4], xm6
599
     RET
600
 
601
 ;=======================================================================================================
602
 ;void saoCuOrgE3(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX)
603
 ;=======================================================================================================
604
 INIT_XMM sse4
605
-cglobal saoCuOrgE3, 3, 7, 8
606
+cglobal saoCuOrgE3, 3,6,8
607
     mov             r3d, r3m
608
     mov             r4d, r4m
609
     mov             r5d, r5m
610
 
611
-    mov             r6d, r5d
612
-    sub             r6d, r4d
613
+    ; save latest 2 pixels for case startX=1 or left_endX=15
614
+    movh            m7, [r0 + r5]
615
+    movhps          m7, [r1 + r5 - 1]
616
 
617
+    ; move to startX+1
618
     inc             r4d
619
     add             r0, r4
620
     add             r1, r4
621
-    movh            m7, [r0 + r6 - 1]
622
-    mov             r6, [r1 + r6 - 2]
623
+    sub             r5d, r4d
624
     pxor            m0, m0                      ; m0 = 0
625
     movu            m6, [pb_2]                  ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
626
 
627
@@ -244,30 +608,143 @@
628
     packuswb        m2, m1
629
     movu            [r0], m2
630
 
631
-    sub             r5d, 16
632
-    jle             .end
633
+    add             r0, 16
634
+    add             r1, 16
635
 
636
-    lea             r0, [r0 + 16]
637
-    lea             r1, [r1 + 16]
638
+    sub             r5, 16
639
+    jg             .loop
640
 
641
-    jnz             .loop
642
+    ; restore last pixels (up to 2)
643
+    movh            [r0 + r5], m7
644
+    movhps          [r1 + r5 - 1], m7
645
+    RET
646
 
647
-.end:
648
-    js              .skip
649
-    sub             r0, r4
650
-    sub             r1, r4
651
-    movh            [r0 + 16], m7
652
-    mov             [r1 + 15], r6
653
-    jmp             .quit
654
+INIT_YMM avx2
655
+cglobal saoCuOrgE3, 3, 6, 8
656
+    mov             r3d,  r3m
657
+    mov             r4d,  r4m
658
+    mov             r5d,  r5m
659
+
660
+    ; save latest 2 pixels for case startX=1 or left_endX=15
661
+    movq            xm7,  [r0 + r5]
662
+    movhps          xm7,  [r1 + r5 - 1]
663
+
664
+    ; move to startX+1
665
+    inc             r4d
666
+    add             r0,   r4
667
+    add             r1,   r4
668
+    sub             r5d,  r4d
669
+    pxor            xm0,  xm0                     ; xm0 = 0
670
+    mova            xm6,  [pb_2]                  ; xm6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
671
+    movu            xm5,  [r2]                    ; xm5 = m_iOffsetEo
672
+
673
+.loop:
674
+    movu            xm1,  [r0]                    ; xm1 = pRec[x]
675
+    movu            xm2,  [r0 + r3]               ; xm2 = pRec[x + iStride]
676
+
677
+    psubusb         xm3,  xm2,  xm1
678
+    psubusb         xm4,  xm1,  xm2
679
+    pcmpeqb         xm3,  xm0
680
+    pcmpeqb         xm4,  xm0
681
+    pcmpeqb         xm2,  xm1
682
+
683
+    pabsb           xm3,  xm3
684
+    por             xm4,  xm3
685
+    pandn           xm2,  xm4                     ; xm2 = iSignDown
686
+
687
+    movu            xm3,  [r1]                    ; xm3 = m_iUpBuff1
688
+
689
+    paddb           xm3,  xm2
690
+    paddb           xm3,  xm6                     ; xm3 = uiEdgeType
691
+
692
+    pshufb          xm4,  xm5,  xm3
693
+
694
+    psubb           xm3,  xm0,  xm2
695
+    movu            [r1 - 1],   xm3
696
+
697
+    pmovzxbw        m2,   xm1
698
+    pmovsxbw        m3,   xm4
699
+
700
+    paddw           m2,   m3
701
+    vextracti128    xm3,  m2,   1
702
+    packuswb        xm2,  xm3
703
+    movu            [r0], xm2
704
+
705
+    add             r0,   16
706
+    add             r1,   16
707
+
708
+    sub             r5,   16
709
+    jg             .loop
710
+
711
+    ; restore last pixels (up to 2)
712
+    movq            [r0 + r5],     xm7
713
+    movhps          [r1 + r5 - 1], xm7
714
+    RET
715
+
716
+INIT_YMM avx2
717
+cglobal saoCuOrgE3_32, 3, 6, 8
718
+    mov             r3d,  r3m
719
+    mov             r4d,  r4m
720
+    mov             r5d,  r5m
721
+
722
+    ; save latest 2 pixels for case startX=1 or left_endX=15
723
+    movq            xm7,  [r0 + r5]
724
+    movhps          xm7,  [r1 + r5 - 1]
725
+
726
+    ; move to startX+1
727
+    inc             r4d
728
+    add             r0,   r4
729
+    add             r1,   r4
730
+    sub             r5d,  r4d
731
+    pxor            m0,   m0                      ; m0 = 0
732
+    mova            m6,   [pb_2]                  ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
733
+    vbroadcasti128  m5,   [r2]                    ; m5 = m_iOffsetEo
734
+
735
+.loop:
736
+    movu            m1,   [r0]                    ; m1 = pRec[x]
737
+    movu            m2,   [r0 + r3]               ; m2 = pRec[x + iStride]
738
+
739
+    psubusb         m3,   m2,   m1
740
+    psubusb         m4,   m1,   m2
741
+    pcmpeqb         m3,   m0
742
+    pcmpeqb         m4,   m0
743
+    pcmpeqb         m2,   m1
744
+
745
+    pabsb           m3,   m3
746
+    por             m4,   m3
747
+    pandn           m2,   m4                      ; m2 = iSignDown
748
+
749
+    movu            m3,   [r1]                    ; m3 = m_iUpBuff1
750
+
751
+    paddb           m3,   m2
752
+    paddb           m3,   m6                      ; m3 = uiEdgeType
753
+
754
+    pshufb          m4,   m5,   m3
755
+
756
+    psubb           m3,   m0,   m2
757
+    movu            [r1 - 1],   m3
758
+
759
+    pmovzxbw        m2,   xm1
760
+    vextracti128    xm1,  m1,   1
761
+    pmovzxbw        m1,   xm1
762
+    pmovsxbw        m3,   xm4
763
+    vextracti128    xm4,  m4,   1
764
+    pmovsxbw        m4,   xm4
765
 
766
-.skip:
767
-    sub             r0, r4
768
-    sub             r1, r4
769
-    movh            [r0 + 15], m7
770
-    mov             [r1 + 14], r6
771
+    paddw           m2,   m3
772
+    paddw           m1,   m4
773
+    packuswb        m2,   m1
774
+    vpermq          m2,   m2,   11011000b
775
+    movu            [r0], m2
776
 
777
-.quit:
778
+    add             r0,   32
779
+    add             r1,   32
780
+    sub             r5,   32
781
+    jg             .loop
782
 
783
+    ; restore last pixels (up to 2)
784
+    movq            [r0 + r5],     xm7
785
+    movhps          [r1 + r5 - 1], xm7
786
     RET
787
 
788
 ;=====================================================================================
789
@@ -320,32 +797,181 @@
790
     jnz         .loopH
791
     RET
792
 
793
+INIT_YMM avx2
794
+cglobal saoCuOrgB0, 4, 7, 8
795
+
796
+    mov             r3d,        r3m
797
+    mov             r4d,        r4m
798
+    mova            m7,         [pb_31]
799
+    vbroadcasti128  m3,         [r1 + 0]            ; offset[0-15]
800
+    vbroadcasti128  m4,         [r1 + 16]           ; offset[16-31]
801
+    lea             r6,         [r4 * 2]
802
+    sub             r6d,        r2d
803
+    shr             r2d,        4
804
+    mov             r1d,        r3d
805
+    shr             r3d,        1
806
+.loopH
807
+    mov             r5d,        r2d
808
+.loopW
809
+    movu            xm2,        [r0]                ; m2 = [rec]
810
+    vinserti128     m2,         m2,  [r0 + r4],  1
811
+    psrlw           m1,         m2,  3
812
+    pand            m1,         m7                  ; m1 = [index]
813
+    pcmpgtb         m0,         m1,  [pb_15]        ; m0 = [mask]
814
+
815
+    pshufb          m6,         m3,  m1
816
+    pshufb          m5,         m4,  m1
817
+
818
+    pblendvb        m6,         m6,  m5,  m0        ; NOTE: don't use 3 parameters style, x264 macro have some bug!
819
+
820
+    pmovzxbw        m1,         xm2                 ; rec
821
+    vextracti128    xm2,        m2,  1
822
+    pmovzxbw        m2,         xm2
823
+    pmovsxbw        m0,         xm6                 ; offset
824
+    vextracti128    xm6,        m6,  1
825
+    pmovsxbw        m6,         xm6
826
+
827
+    paddw           m1,         m0
828
+    paddw           m2,         m6
829
+    packuswb        m1,         m2
830
+    vpermq          m1,         m1,  11011000b
831
+
832
+    movu            [r0],       xm1
833
+    vextracti128    [r0 + r4],  m1,  1
834
+    add             r0,         16
835
+    dec             r5d
836
+    jnz             .loopW
837
+
838
+    add             r0,         r6
839
+    dec             r3d
840
+    jnz             .loopH
841
+    test            r1b,        1
842
+    jz              .end
843
+    mov             r5d,        r2d
844
+.loopW1
845
+    movu            xm2,        [r0]                ; m2 = [rec]
846
+    psrlw           xm1,        xm2, 3
847
+    pand            xm1,        xm7                 ; m1 = [index]
848
+    pcmpgtb         xm0,        xm1, [pb_15]        ; m0 = [mask]
849
+
850
+    pshufb          xm6,        xm3, xm1
851
+    pshufb          xm5,        xm4, xm1
852
+
853
+    pblendvb        xm6,        xm6, xm5, xm0       ; NOTE: don't use 3 parameters style, x264 macro have some bug!
854
+
855
+    pmovzxbw        m1,         xm2                 ; rec
856
+    pmovsxbw        m0,         xm6                 ; offset
857
+
858
+    paddw           m1,         m0
859
+    vextracti128    xm0,        m1,  1
860
+    packuswb        xm1,        xm0
861
+
862
+    movu            [r0],       xm1
863
+    add             r0,         16
864
+    dec             r5d
865
+    jnz             .loopW1
866
+.end
867
+    RET
868
+
869
 ;============================================================================================================
870
-; void calSign(int8_t *dst, const Pixel *src1, const Pixel *src2, const int endX)
871
+; void calSign(int8_t *dst, const Pixel *src1, const Pixel *src2, const int width)
872
 ;============================================================================================================
873
 INIT_XMM sse4
874
-cglobal calSign, 4, 5, 7
875
+cglobal calSign, 4,5,6
876
+    mova        m0,     [pb_128]
877
+    mova        m1,     [pb_1]
878
 
879
-    mov         r4,    16
880
-    mova        m1,    [pb_128]
881
-    mova        m0,    [pb_1]
882
-    shr         r3d,   4
883
-.loop
884
-    movu        m2,    [r1]        ; m2 = pRec[x]
885
-    movu        m3,    [r2]        ; m3 = pTmpU[x]
886
+    sub         r1,     r0
887
+    sub         r2,     r0
888
 
889
-    pxor        m4,    m2,    m1
890
-    pxor        m5,    m3,    m1
891
-    pcmpgtb     m6,    m4,    m5
892
-    pcmpgtb     m5,    m4
893
-    pand        m6,    m0
894
-    por         m6,    m5
895
+    mov         r4d,    r3d
896
+    shr         r3d,    4
897
+    jz         .next
898
+.loop:
899
+    movu        m2,     [r0 + r1]            ; m2 = pRec[x]
900
+    movu        m3,     [r0 + r2]            ; m3 = pTmpU[x]
901
+    pxor        m4,     m2,     m0
902
+    pxor        m3,     m0
903
+    pcmpgtb     m5,     m4,     m3
904
+    pcmpgtb     m3,     m4
905
+    pand        m5,     m1
906
+    por         m5,     m3
907
+    movu        [r0],   m5
908
+
909
+    add         r0,     16
910
+    dec         r3d
911
+    jnz        .loop
912
 
913
-    movu        [r0],  m6
914
+    ; process partial
915
+.next:
916
+    and         r4d, 15
917
+    jz         .end
918
+
919
+    movu        m2,     [r0 + r1]            ; m2 = pRec[x]
920
+    movu        m3,     [r0 + r2]            ; m3 = pTmpU[x]
921
+    pxor        m4,     m2,     m0
922
+    pxor        m3,     m0
923
+    pcmpgtb     m5,     m4,     m3
924
+    pcmpgtb     m3,     m4
925
+    pand        m5,     m1
926
+    por         m5,     m3
927
+
928
+    lea         r3,     [pb_movemask + 16]
929
+    sub         r3,     r4
930
+    movu        xmm0,   [r3]
931
+    movu        m3,     [r0]
932
+    pblendvb    m5,     m5,     m3,     xmm0
933
+    movu        [r0],   m5
934
 
935
-    add         r0,    r4
936
-    add         r1,    r4
937
-    add         r2,    r4
938
-    dec         r3d
939
-    jnz         .loop
940
+.end:
941
+    RET
942
+
943
+INIT_YMM avx2
944
+cglobal calSign, 4, 5, 6
945
+    vbroadcasti128  m0,     [pb_128]
946
+    mova            m1,     [pb_1]
947
+
948
+    sub             r1,     r0
949
+    sub             r2,     r0
950
+
951
+    mov             r4d,    r3d
952
+    shr             r3d,    5
953
+    jz              .next
954
+.loop:
955
+    movu            m2,     [r0 + r1]            ; m2 = pRec[x]
956
+    movu            m3,     [r0 + r2]            ; m3 = pTmpU[x]
957
+    pxor            m4,     m2,     m0
958
+    pxor            m3,     m0
959
+    pcmpgtb         m5,     m4,     m3
960
+    pcmpgtb         m3,     m4
961
+    pand            m5,     m1
962
+    por             m5,     m3
963
+    movu            [r0],   m5
964
+
965
+    add             r0,     mmsize
966
+    dec             r3d
967
+    jnz             .loop
968
+
969
+    ; process partial
970
+.next:
971
+    and             r4d,    31
972
+    jz              .end
973
+
974
+    movu            m2,     [r0 + r1]            ; m2 = pRec[x]
975
+    movu            m3,     [r0 + r2]            ; m3 = pTmpU[x]
976
+    pxor            m4,     m2,     m0
977
+    pxor            m3,     m0
978
+    pcmpgtb         m5,     m4,     m3
979
+    pcmpgtb         m3,     m4
980
+    pand            m5,     m1
981
+    por             m5,     m3
982
+
983
+    lea             r3,     [pb_movemask_32 + 32]
984
+    sub             r3,     r4
985
+    movu            m0,     [r3]
986
+    movu            m3,     [r0]
987
+    pblendvb        m5,     m5,     m3,     m0
988
+    movu            [r0],   m5
989
+
990
+.end:
991
     RET
992
x265_1.6.tar.gz/source/common/x86/loopfilter.h -> x265_1.7.tar.gz/source/common/x86/loopfilter.h Changed
24
 
1
@@ -25,11 +25,21 @@
2
 #ifndef X265_LOOPFILTER_H
3
 #define X265_LOOPFILTER_H
4
 
5
-void x265_saoCuOrgE0_sse4(pixel * rec, int8_t * offsetEo, int endX, int8_t signLeft);
6
+void x265_saoCuOrgE0_sse4(pixel * rec, int8_t * offsetEo, int endX, int8_t* signLeft, intptr_t stride);
7
+void x265_saoCuOrgE0_avx2(pixel * rec, int8_t * offsetEo, int endX, int8_t* signLeft, intptr_t stride);
8
 void x265_saoCuOrgE1_sse4(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
9
+void x265_saoCuOrgE1_avx2(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
10
+void x265_saoCuOrgE1_2Rows_sse4(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
11
+void x265_saoCuOrgE1_2Rows_avx2(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
12
 void x265_saoCuOrgE2_sse4(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride);
13
+void x265_saoCuOrgE2_avx2(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride);
14
+void x265_saoCuOrgE2_32_avx2(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride);
15
 void x265_saoCuOrgE3_sse4(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX);
16
+void x265_saoCuOrgE3_avx2(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX);
17
+void x265_saoCuOrgE3_32_avx2(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX);
18
 void x265_saoCuOrgB0_sse4(pixel* rec, const int8_t* offsetBo, int ctuWidth, int ctuHeight, intptr_t stride);
19
+void x265_saoCuOrgB0_avx2(pixel* rec, const int8_t* offsetBo, int ctuWidth, int ctuHeight, intptr_t stride);
20
 void x265_calSign_sse4(int8_t *dst, const pixel *src1, const pixel *src2, const int endX);
21
+void x265_calSign_avx2(int8_t *dst, const pixel *src1, const pixel *src2, const int endX);
22
 
23
 #endif // ifndef X265_LOOPFILTER_H
24
x265_1.6.tar.gz/source/common/x86/mc-a.asm -> x265_1.7.tar.gz/source/common/x86/mc-a.asm Changed
44
 
1
@@ -1895,8 +1895,10 @@
2
 
3
 ADDAVG_W8_H4_AVX2 4
4
 ADDAVG_W8_H4_AVX2 8
5
+ADDAVG_W8_H4_AVX2 12
6
 ADDAVG_W8_H4_AVX2 16
7
 ADDAVG_W8_H4_AVX2 32
8
+ADDAVG_W8_H4_AVX2 64
9
 
10
 %macro ADDAVG_W12_H4_AVX2 1
11
 INIT_YMM avx2
12
@@ -1982,6 +1984,7 @@
13
 %endmacro
14
 
15
 ADDAVG_W12_H4_AVX2 16
16
+ADDAVG_W12_H4_AVX2 32
17
 
18
 %macro ADDAVG_W16_H4_AVX2 1
19
 INIT_YMM avx2
20
@@ -2044,6 +2047,7 @@
21
 ADDAVG_W16_H4_AVX2 8
22
 ADDAVG_W16_H4_AVX2 12
23
 ADDAVG_W16_H4_AVX2 16
24
+ADDAVG_W16_H4_AVX2 24
25
 ADDAVG_W16_H4_AVX2 32
26
 ADDAVG_W16_H4_AVX2 64
27
 
28
@@ -2101,6 +2105,7 @@
29
 %endmacro
30
 
31
 ADDAVG_W24_H2_AVX2 32
32
+ADDAVG_W24_H2_AVX2 64
33
 
34
 %macro ADDAVG_W32_H2_AVX2 1
35
 INIT_YMM avx2
36
@@ -2157,6 +2162,7 @@
37
 ADDAVG_W32_H2_AVX2 16
38
 ADDAVG_W32_H2_AVX2 24
39
 ADDAVG_W32_H2_AVX2 32
40
+ADDAVG_W32_H2_AVX2 48
41
 ADDAVG_W32_H2_AVX2 64
42
 
43
 %macro ADDAVG_W64_H2_AVX2 1
44
x265_1.6.tar.gz/source/common/x86/pixel-a.asm -> x265_1.7.tar.gz/source/common/x86/pixel-a.asm Changed
1492
 
1
@@ -7078,6 +7078,117 @@
2
 .end:
3
     RET
4
 
5
+; Input 16bpp, Output 8bpp
6
+;-------------------------------------------------------------------------------------------------------------------------------------
7
+;void planecopy_sp(uint16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask)
8
+;-------------------------------------------------------------------------------------------------------------------------------------
9
+INIT_YMM avx2
10
+cglobal downShift_16, 6,7,3
11
+    movd        xm0, r6m        ; m0 = shift
12
+    add         r1d, r1d
13
+    dec         r5d
14
+.loopH:
15
+    xor         r6, r6
16
+.loopW:
17
+    movu        m1, [r0 + r6 * 2 +  0]
18
+    movu        m2, [r0 + r6 * 2 + 32]
19
+    vpsrlw      m1, xm0
20
+    vpsrlw      m2, xm0
21
+    packuswb    m1, m2
22
+    vpermq      m1, m1, 11011000b
23
+    movu        [r2 + r6], m1
24
+
25
+    add         r6d, mmsize
26
+    cmp         r6d, r4d
27
+    jl          .loopW
28
+
29
+    ; move to next row
30
+    add         r0, r1
31
+    add         r2, r3
32
+    dec         r5d
33
+    jnz         .loopH
34
+
35
+; processing last row of every frame [To handle width which not a multiple of 32]
36
+    mov         r6d, r4d
37
+    and         r4d, 31
38
+    shr         r6d, 5
39
+
40
+.loop32:
41
+    movu        m1, [r0]
42
+    movu        m2, [r0 + 32]
43
+    psrlw       m1, xm0
44
+    psrlw       m2, xm0
45
+    packuswb    m1, m2
46
+    vpermq      m1, m1, 11011000b
47
+    movu        [r2], m1
48
+
49
+    add         r0, 2*mmsize
50
+    add         r2, mmsize
51
+    dec         r6d
52
+    jnz         .loop32
53
+
54
+    cmp         r4d, 16
55
+    jl          .process8
56
+    movu        m1, [r0]
57
+    psrlw       m1, xm0
58
+    packuswb    m1, m1
59
+    vpermq      m1, m1, 10001000b
60
+    movu        [r2], xm1
61
+
62
+    add         r0, mmsize
63
+    add         r2, 16
64
+    sub         r4d, 16
65
+    jz          .end
66
+
67
+.process8:
68
+    cmp         r4d, 8
69
+    jl          .process4
70
+    movu        m1, [r0]
71
+    psrlw       m1, xm0
72
+    packuswb    m1, m1
73
+    movq        [r2], xm1
74
+
75
+    add         r0, 16
76
+    add         r2, 8
77
+    sub         r4d, 8
78
+    jz          .end
79
+
80
+.process4:
81
+    cmp         r4d, 4
82
+    jl          .process2
83
+    movq        xm1,[r0]
84
+    psrlw       m1, xm0
85
+    packuswb    m1, m1
86
+    movd        [r2], xm1
87
+
88
+    add         r0, 8
89
+    add         r2, 4
90
+    sub         r4d, 4
91
+    jz          .end
92
+
93
+.process2:
94
+    cmp         r4d, 2
95
+    jl          .process1
96
+    movd        xm1, [r0]
97
+    psrlw       m1, xm0
98
+    packuswb    m1, m1
99
+    movd        r6d, xm1
100
+    mov         [r2], r6w
101
+
102
+    add         r0, 4
103
+    add         r2, 2
104
+    sub         r4d, 2
105
+    jz          .end
106
+
107
+.process1:
108
+    movd        xm1, [r0]
109
+    psrlw       m1, xm0
110
+    packuswb    m1, m1
111
+    movd        r3d, xm1
112
+    mov         [r2], r3b
113
+.end:
114
+    RET
115
+
116
 ; Input 8bpp, Output 16bpp
117
 ;---------------------------------------------------------------------------------------------------------------------
118
 ;void planecopy_cp(uint8_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int shift)
119
@@ -10395,3 +10506,1372 @@
120
     mov             rsp, r5
121
     RET
122
 %endif
123
+
124
+;;---------------------------------------------------------------
125
+;; SATD AVX2
126
+;; int pixel_satd(const pixel*, intptr_t, const pixel*, intptr_t)
127
+;;---------------------------------------------------------------
128
+;; r0   - pix0
129
+;; r1   - pix0Stride
130
+;; r2   - pix1
131
+;; r3   - pix1Stride
132
+
133
+%if ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 0
134
+INIT_YMM avx2
135
+cglobal calc_satd_16x8    ; function to compute satd cost for 16 columns, 8 rows
136
+    pxor                m6, m6
137
+    vbroadcasti128      m0, [r0]
138
+    vbroadcasti128      m4, [r2]
139
+    vbroadcasti128      m1, [r0 + r1]
140
+    vbroadcasti128      m5, [r2 + r3]
141
+    pmaddubsw           m4, m7
142
+    pmaddubsw           m0, m7
143
+    pmaddubsw           m5, m7
144
+    pmaddubsw           m1, m7
145
+    psubw               m0, m4
146
+    psubw               m1, m5
147
+    vbroadcasti128      m2, [r0 + r1 * 2]
148
+    vbroadcasti128      m4, [r2 + r3 * 2]
149
+    vbroadcasti128      m3, [r0 + r4]
150
+    vbroadcasti128      m5, [r2 + r5]
151
+    pmaddubsw           m4, m7
152
+    pmaddubsw           m2, m7
153
+    pmaddubsw           m5, m7
154
+    pmaddubsw           m3, m7
155
+    psubw               m2, m4
156
+    psubw               m3, m5
157
+    lea                 r0, [r0 + r1 * 4]
158
+    lea                 r2, [r2 + r3 * 4]
159
+    paddw               m4, m0, m1
160
+    psubw               m1, m1, m0
161
+    paddw               m0, m2, m3
162
+    psubw               m3, m2
163
+    paddw               m2, m4, m0
164
+    psubw               m0, m4
165
+    paddw               m4, m1, m3
166
+    psubw               m3, m1
167
+    pabsw               m2, m2
168
+    pabsw               m0, m0
169
+    pabsw               m4, m4
170
+    pabsw               m3, m3
171
+    pblendw             m1, m2, m0, 10101010b
172
+    pslld               m0, 16
173
+    psrld               m2, 16
174
+    por                 m0, m2
175
+    pmaxsw              m1, m0
176
+    paddw               m6, m1
177
+    pblendw             m2, m4, m3, 10101010b
178
+    pslld               m3, 16
179
+    psrld               m4, 16
180
+    por                 m3, m4
181
+    pmaxsw              m2, m3
182
+    paddw               m6, m2
183
+    vbroadcasti128      m1, [r0]
184
+    vbroadcasti128      m4, [r2]
185
+    vbroadcasti128      m2, [r0 + r1]
186
+    vbroadcasti128      m5, [r2 + r3]
187
+    pmaddubsw           m4, m7
188
+    pmaddubsw           m1, m7
189
+    pmaddubsw           m5, m7
190
+    pmaddubsw           m2, m7
191
+    psubw               m1, m4
192
+    psubw               m2, m5
193
+    vbroadcasti128      m0, [r0 + r1 * 2]
194
+    vbroadcasti128      m4, [r2 + r3 * 2]
195
+    vbroadcasti128      m3, [r0 + r4]
196
+    vbroadcasti128      m5, [r2 + r5]
197
+    lea                 r0, [r0 + r1 * 4]
198
+    lea                 r2, [r2 + r3 * 4]
199
+    pmaddubsw           m4, m7
200
+    pmaddubsw           m0, m7
201
+    pmaddubsw           m5, m7
202
+    pmaddubsw           m3, m7
203
+    psubw               m0, m4
204
+    psubw               m3, m5
205
+    paddw               m4, m1, m2
206
+    psubw               m2, m1
207
+    paddw               m1, m0, m3
208
+    psubw               m3, m0
209
+    paddw               m0, m4, m1
210
+    psubw               m1, m4
211
+    paddw               m4, m2, m3
212
+    psubw               m3, m2
213
+    pabsw               m0, m0
214
+    pabsw               m1, m1
215
+    pabsw               m4, m4
216
+    pabsw               m3, m3
217
+    pblendw             m2, m0, m1, 10101010b
218
+    pslld               m1, 16
219
+    psrld               m0, 16
220
+    por                 m1, m0
221
+    pmaxsw              m2, m1
222
+    paddw               m6, m2
223
+    pblendw             m0, m4, m3, 10101010b
224
+    pslld               m3, 16
225
+    psrld               m4, 16
226
+    por                 m3, m4
227
+    pmaxsw              m0, m3
228
+    paddw               m6, m0
229
+    vextracti128        xm0, m6, 1
230
+    pmovzxwd            m6, xm6
231
+    pmovzxwd            m0, xm0
232
+    paddd               m8, m6
233
+    paddd               m9, m0
234
+    ret
235
+
236
+cglobal calc_satd_16x4    ; function to compute satd cost for 16 columns, 4 rows
237
+    pxor                m6, m6
238
+    vbroadcasti128      m0, [r0]
239
+    vbroadcasti128      m4, [r2]
240
+    vbroadcasti128      m1, [r0 + r1]
241
+    vbroadcasti128      m5, [r2 + r3]
242
+    pmaddubsw           m4, m7
243
+    pmaddubsw           m0, m7
244
+    pmaddubsw           m5, m7
245
+    pmaddubsw           m1, m7
246
+    psubw               m0, m4
247
+    psubw               m1, m5
248
+    vbroadcasti128      m2, [r0 + r1 * 2]
249
+    vbroadcasti128      m4, [r2 + r3 * 2]
250
+    vbroadcasti128      m3, [r0 + r4]
251
+    vbroadcasti128      m5, [r2 + r5]
252
+    pmaddubsw           m4, m7
253
+    pmaddubsw           m2, m7
254
+    pmaddubsw           m5, m7
255
+    pmaddubsw           m3, m7
256
+    psubw               m2, m4
257
+    psubw               m3, m5
258
+    paddw               m4, m0, m1
259
+    psubw               m1, m1, m0
260
+    paddw               m0, m2, m3
261
+    psubw               m3, m2
262
+    paddw               m2, m4, m0
263
+    psubw               m0, m4
264
+    paddw               m4, m1, m3
265
+    psubw               m3, m1
266
+    pabsw               m2, m2
267
+    pabsw               m0, m0
268
+    pabsw               m4, m4
269
+    pabsw               m3, m3
270
+    pblendw             m1, m2, m0, 10101010b
271
+    pslld               m0, 16
272
+    psrld               m2, 16
273
+    por                 m0, m2
274
+    pmaxsw              m1, m0
275
+    paddw               m6, m1
276
+    pblendw             m2, m4, m3, 10101010b
277
+    pslld               m3, 16
278
+    psrld               m4, 16
279
+    por                 m3, m4
280
+    pmaxsw              m2, m3
281
+    paddw               m6, m2
282
+    vextracti128        xm0, m6, 1
283
+    pmovzxwd            m6, xm6
284
+    pmovzxwd            m0, xm0
285
+    paddd               m8, m6
286
+    paddd               m9, m0
287
+    ret
288
+
289
+cglobal pixel_satd_16x4, 4,6,10         ; if WIN64 && cpuflag(avx2)
290
+    mova            m7, [hmul_16p]
291
+    lea             r4, [3 * r1]
292
+    lea             r5, [3 * r3]
293
+    pxor            m8, m8
294
+    pxor            m9, m9
295
+
296
+    call            calc_satd_16x4
297
+
298
+    paddd           m8, m9
299
+    vextracti128    xm0, m8, 1
300
+    paddd           xm0, xm8
301
+    movhlps         xm1, xm0
302
+    paddd           xm0, xm1
303
+    pshuflw         xm1, xm0, q0032
304
+    paddd           xm0, xm1
305
+    movd            eax, xm0
306
+    RET
307
+
308
+cglobal pixel_satd_16x12, 4,6,10        ; if WIN64 && cpuflag(avx2)
309
+    mova            m7, [hmul_16p]
310
+    lea             r4, [3 * r1]
311
+    lea             r5, [3 * r3]
312
+    pxor            m8, m8
313
+    pxor            m9, m9
314
+
315
+    call            calc_satd_16x8
316
+    call            calc_satd_16x4
317
+
318
+    paddd           m8, m9
319
+    vextracti128    xm0, m8, 1
320
+    paddd           xm0, xm8
321
+    movhlps         xm1, xm0
322
+    paddd           xm0, xm1
323
+    pshuflw         xm1, xm0, q0032
324
+    paddd           xm0, xm1
325
+    movd            eax, xm0
326
+    RET
327
+
328
+cglobal pixel_satd_16x32, 4,6,10        ; if WIN64 && cpuflag(avx2)
329
+    mova            m7, [hmul_16p]
330
+    lea             r4, [3 * r1]
331
+    lea             r5, [3 * r3]
332
+    pxor            m8, m8
333
+    pxor            m9, m9
334
+
335
+    call            calc_satd_16x8
336
+    call            calc_satd_16x8
337
+    call            calc_satd_16x8
338
+    call            calc_satd_16x8
339
+
340
+    paddd           m8, m9
341
+    vextracti128    xm0, m8, 1
342
+    paddd           xm0, xm8
343
+    movhlps         xm1, xm0
344
+    paddd           xm0, xm1
345
+    pshuflw         xm1, xm0, q0032
346
+    paddd           xm0, xm1
347
+    movd            eax, xm0
348
+    RET
349
+
350
+cglobal pixel_satd_16x64, 4,6,10        ; if WIN64 && cpuflag(avx2)
351
+    mova            m7, [hmul_16p]
352
+    lea             r4, [3 * r1]
353
+    lea             r5, [3 * r3]
354
+    pxor            m8, m8
355
+    pxor            m9, m9
356
+
357
+    call            calc_satd_16x8
358
+    call            calc_satd_16x8
359
+    call            calc_satd_16x8
360
+    call            calc_satd_16x8
361
+    call            calc_satd_16x8
362
+    call            calc_satd_16x8
363
+    call            calc_satd_16x8
364
+    call            calc_satd_16x8
365
+
366
+    paddd           m8, m9
367
+    vextracti128    xm0, m8, 1
368
+    paddd           xm0, xm8
369
+    movhlps         xm1, xm0
370
+    paddd           xm0, xm1
371
+    pshuflw         xm1, xm0, q0032
372
+    paddd           xm0, xm1
373
+    movd            eax, xm0
374
+    RET
375
+
376
+cglobal pixel_satd_32x8, 4,8,10          ; if WIN64 && cpuflag(avx2)
377
+    mova            m7, [hmul_16p]
378
+    lea             r4, [3 * r1]
379
+    lea             r5, [3 * r3]
380
+    pxor            m8, m8
381
+    pxor            m9, m9
382
+    mov             r6, r0
383
+    mov             r7, r2
384
+
385
+    call            calc_satd_16x8
386
+
387
+    lea             r0, [r6 + 16]
388
+    lea             r2, [r7 + 16]
389
+
390
+    call            calc_satd_16x8
391
+
392
+    paddd           m8, m9
393
+    vextracti128    xm0, m8, 1
394
+    paddd           xm0, xm8
395
+    movhlps         xm1, xm0
396
+    paddd           xm0, xm1
397
+    pshuflw         xm1, xm0, q0032
398
+    paddd           xm0, xm1
399
+    movd            eax, xm0
400
+    RET
401
+
402
+cglobal pixel_satd_32x16, 4,8,10         ; if WIN64 && cpuflag(avx2)
403
+    mova            m7, [hmul_16p]
404
+    lea             r4, [3 * r1]
405
+    lea             r5, [3 * r3]
406
+    pxor            m8, m8
407
+    pxor            m9, m9
408
+    mov             r6, r0
409
+    mov             r7, r2
410
+
411
+    call            calc_satd_16x8
412
+    call            calc_satd_16x8
413
+
414
+    lea             r0, [r6 + 16]
415
+    lea             r2, [r7 + 16]
416
+
417
+    call            calc_satd_16x8
418
+    call            calc_satd_16x8
419
+
420
+    paddd           m8, m9
421
+    vextracti128    xm0, m8, 1
422
+    paddd           xm0, xm8
423
+    movhlps         xm1, xm0
424
+    paddd           xm0, xm1
425
+    pshuflw         xm1, xm0, q0032
426
+    paddd           xm0, xm1
427
+    movd            eax, xm0
428
+    RET
429
+
430
+cglobal pixel_satd_32x24, 4,8,10         ; if WIN64 && cpuflag(avx2)
431
+    mova            m7, [hmul_16p]
432
+    lea             r4, [3 * r1]
433
+    lea             r5, [3 * r3]
434
+    pxor            m8, m8
435
+    pxor            m9, m9
436
+    mov             r6, r0
437
+    mov             r7, r2
438
+
439
+    call            calc_satd_16x8
440
+    call            calc_satd_16x8
441
+    call            calc_satd_16x8
442
+
443
+    lea             r0, [r6 + 16]
444
+    lea             r2, [r7 + 16]
445
+
446
+    call            calc_satd_16x8
447
+    call            calc_satd_16x8
448
+    call            calc_satd_16x8
449
+
450
+    paddd           m8, m9
451
+    vextracti128    xm0, m8, 1
452
+    paddd           xm0, xm8
453
+    movhlps         xm1, xm0
454
+    paddd           xm0, xm1
455
+    pshuflw         xm1, xm0, q0032
456
+    paddd           xm0, xm1
457
+    movd            eax, xm0
458
+    RET
459
+
460
+cglobal pixel_satd_32x32, 4,8,10         ; if WIN64 && cpuflag(avx2)
461
+    mova            m7, [hmul_16p]
462
+    lea             r4, [3 * r1]
463
+    lea             r5, [3 * r3]
464
+    pxor            m8, m8
465
+    pxor            m9, m9
466
+    mov             r6, r0
467
+    mov             r7, r2
468
+
469
+    call            calc_satd_16x8
470
+    call            calc_satd_16x8
471
+    call            calc_satd_16x8
472
+    call            calc_satd_16x8
473
+
474
+    lea             r0, [r6 + 16]
475
+    lea             r2, [r7 + 16]
476
+
477
+    call            calc_satd_16x8
478
+    call            calc_satd_16x8
479
+    call            calc_satd_16x8
480
+    call            calc_satd_16x8
481
+
482
+    paddd           m8, m9
483
+    vextracti128    xm0, m8, 1
484
+    paddd           xm0, xm8
485
+    movhlps         xm1, xm0
486
+    paddd           xm0, xm1
487
+    pshuflw         xm1, xm0, q0032
488
+    paddd           xm0, xm1
489
+    movd            eax, xm0
490
+    RET
491
+
492
+cglobal pixel_satd_32x64, 4,8,10         ; if WIN64 && cpuflag(avx2)
493
+    mova            m7, [hmul_16p]
494
+    lea             r4, [3 * r1]
495
+    lea             r5, [3 * r3]
496
+    pxor            m8, m8
497
+    pxor            m9, m9
498
+    mov             r6, r0
499
+    mov             r7, r2
500
+
501
+    call            calc_satd_16x8
502
+    call            calc_satd_16x8
503
+    call            calc_satd_16x8
504
+    call            calc_satd_16x8
505
+    call            calc_satd_16x8
506
+    call            calc_satd_16x8
507
+    call            calc_satd_16x8
508
+    call            calc_satd_16x8
509
+
510
+    lea             r0, [r6 + 16]
511
+    lea             r2, [r7 + 16]
512
+
513
+    call            calc_satd_16x8
514
+    call            calc_satd_16x8
515
+    call            calc_satd_16x8
516
+    call            calc_satd_16x8
517
+    call            calc_satd_16x8
518
+    call            calc_satd_16x8
519
+    call            calc_satd_16x8
520
+    call            calc_satd_16x8
521
+
522
+    paddd           m8, m9
523
+    vextracti128    xm0, m8, 1
524
+    paddd           xm0, xm8
525
+    movhlps         xm1, xm0
526
+    paddd           xm0, xm1
527
+    pshuflw         xm1, xm0, q0032
528
+    paddd           xm0, xm1
529
+    movd            eax, xm0
530
+    RET
531
+
532
+cglobal pixel_satd_48x64, 4,8,10        ; if WIN64 && cpuflag(avx2)
533
+    mova            m7, [hmul_16p]
534
+    lea             r4, [3 * r1]
535
+    lea             r5, [3 * r3]
536
+    pxor            m8, m8
537
+    pxor            m9, m9
538
+    mov             r6, r0
539
+    mov             r7, r2
540
+
541
+    call            calc_satd_16x8
542
+    call            calc_satd_16x8
543
+    call            calc_satd_16x8
544
+    call            calc_satd_16x8
545
+    call            calc_satd_16x8
546
+    call            calc_satd_16x8
547
+    call            calc_satd_16x8
548
+    call            calc_satd_16x8
549
+    lea             r0, [r6 + 16]
550
+    lea             r2, [r7 + 16]
551
+    call            calc_satd_16x8
552
+    call            calc_satd_16x8
553
+    call            calc_satd_16x8
554
+    call            calc_satd_16x8
555
+    call            calc_satd_16x8
556
+    call            calc_satd_16x8
557
+    call            calc_satd_16x8
558
+    call            calc_satd_16x8
559
+    lea             r0, [r6 + 32]
560
+    lea             r2, [r7 + 32]
561
+    call            calc_satd_16x8
562
+    call            calc_satd_16x8
563
+    call            calc_satd_16x8
564
+    call            calc_satd_16x8
565
+    call            calc_satd_16x8
566
+    call            calc_satd_16x8
567
+    call            calc_satd_16x8
568
+    call            calc_satd_16x8
569
+
570
+    paddd           m8, m9
571
+    vextracti128    xm0, m8, 1
572
+    paddd           xm0, xm8
573
+    movhlps         xm1, xm0
574
+    paddd           xm0, xm1
575
+    pshuflw         xm1, xm0, q0032
576
+    paddd           xm0, xm1
577
+    movd            eax, xm0
578
+    RET
579
+
580
+cglobal pixel_satd_64x16, 4,8,10         ; if WIN64 && cpuflag(avx2)
581
+    mova            m7, [hmul_16p]
582
+    lea             r4, [3 * r1]
583
+    lea             r5, [3 * r3]
584
+    pxor            m8, m8
585
+    pxor            m9, m9
586
+    mov             r6, r0
587
+    mov             r7, r2
588
+
589
+    call            calc_satd_16x8
590
+    call            calc_satd_16x8
591
+    lea             r0, [r6 + 16]
592
+    lea             r2, [r7 + 16]
593
+    call            calc_satd_16x8
594
+    call            calc_satd_16x8
595
+    lea             r0, [r6 + 32]
596
+    lea             r2, [r7 + 32]
597
+    call            calc_satd_16x8
598
+    call            calc_satd_16x8
599
+    lea             r0, [r6 + 48]
600
+    lea             r2, [r7 + 48]
601
+    call            calc_satd_16x8
602
+    call            calc_satd_16x8
603
+
604
+    paddd           m8, m9
605
+    vextracti128    xm0, m8, 1
606
+    paddd           xm0, xm8
607
+    movhlps         xm1, xm0
608
+    paddd           xm0, xm1
609
+    pshuflw         xm1, xm0, q0032
610
+    paddd           xm0, xm1
611
+    movd            eax, xm0
612
+    RET
613
+
614
+cglobal pixel_satd_64x32, 4,8,10         ; if WIN64 && cpuflag(avx2)
615
+    mova            m7, [hmul_16p]
616
+    lea             r4, [3 * r1]
617
+    lea             r5, [3 * r3]
618
+    pxor            m8, m8
619
+    pxor            m9, m9
620
+    mov             r6, r0
621
+    mov             r7, r2
622
+
623
+    call            calc_satd_16x8
624
+    call            calc_satd_16x8
625
+    call            calc_satd_16x8
626
+    call            calc_satd_16x8
627
+    lea             r0, [r6 + 16]
628
+    lea             r2, [r7 + 16]
629
+    call            calc_satd_16x8
630
+    call            calc_satd_16x8
631
+    call            calc_satd_16x8
632
+    call            calc_satd_16x8
633
+    lea             r0, [r6 + 32]
634
+    lea             r2, [r7 + 32]
635
+    call            calc_satd_16x8
636
+    call            calc_satd_16x8
637
+    call            calc_satd_16x8
638
+    call            calc_satd_16x8
639
+    lea             r0, [r6 + 48]
640
+    lea             r2, [r7 + 48]
641
+    call            calc_satd_16x8
642
+    call            calc_satd_16x8
643
+    call            calc_satd_16x8
644
+    call            calc_satd_16x8
645
+
646
+    paddd           m8, m9
647
+    vextracti128    xm0, m8, 1
648
+    paddd           xm0, xm8
649
+    movhlps         xm1, xm0
650
+    paddd           xm0, xm1
651
+    pshuflw         xm1, xm0, q0032
652
+    paddd           xm0, xm1
653
+    movd            eax, xm0
654
+    RET
655
+
656
+cglobal pixel_satd_64x48, 4,8,10        ; if WIN64 && cpuflag(avx2)
657
+    mova            m7, [hmul_16p]
658
+    lea             r4, [3 * r1]
659
+    lea             r5, [3 * r3]
660
+    pxor            m8, m8
661
+    pxor            m9, m9
662
+    mov             r6, r0
663
+    mov             r7, r2
664
+
665
+    call            calc_satd_16x8
666
+    call            calc_satd_16x8
667
+    call            calc_satd_16x8
668
+    call            calc_satd_16x8
669
+    call            calc_satd_16x8
670
+    call            calc_satd_16x8
671
+    lea             r0, [r6 + 16]
672
+    lea             r2, [r7 + 16]
673
+    call            calc_satd_16x8
674
+    call            calc_satd_16x8
675
+    call            calc_satd_16x8
676
+    call            calc_satd_16x8
677
+    call            calc_satd_16x8
678
+    call            calc_satd_16x8
679
+    lea             r0, [r6 + 32]
680
+    lea             r2, [r7 + 32]
681
+    call            calc_satd_16x8
682
+    call            calc_satd_16x8
683
+    call            calc_satd_16x8
684
+    call            calc_satd_16x8
685
+    call            calc_satd_16x8
686
+    call            calc_satd_16x8
687
+    lea             r0, [r6 + 48]
688
+    lea             r2, [r7 + 48]
689
+    call            calc_satd_16x8
690
+    call            calc_satd_16x8
691
+    call            calc_satd_16x8
692
+    call            calc_satd_16x8
693
+    call            calc_satd_16x8
694
+    call            calc_satd_16x8
695
+
696
+    paddd           m8, m9
697
+    vextracti128    xm0, m8, 1
698
+    paddd           xm0, xm8
699
+    movhlps         xm1, xm0
700
+    paddd           xm0, xm1
701
+    pshuflw         xm1, xm0, q0032
702
+    paddd           xm0, xm1
703
+    movd            eax, xm0
704
+    RET
705
+
706
+cglobal pixel_satd_64x64, 4,8,10        ; if WIN64 && cpuflag(avx2)
707
+    mova            m7, [hmul_16p]
708
+    lea             r4, [3 * r1]
709
+    lea             r5, [3 * r3]
710
+    pxor            m8, m8
711
+    pxor            m9, m9
712
+    mov             r6, r0
713
+    mov             r7, r2
714
+
715
+    call            calc_satd_16x8
716
+    call            calc_satd_16x8
717
+    call            calc_satd_16x8
718
+    call            calc_satd_16x8
719
+    call            calc_satd_16x8
720
+    call            calc_satd_16x8
721
+    call            calc_satd_16x8
722
+    call            calc_satd_16x8
723
+    lea             r0, [r6 + 16]
724
+    lea             r2, [r7 + 16]
725
+    call            calc_satd_16x8
726
+    call            calc_satd_16x8
727
+    call            calc_satd_16x8
728
+    call            calc_satd_16x8
729
+    call            calc_satd_16x8
730
+    call            calc_satd_16x8
731
+    call            calc_satd_16x8
732
+    call            calc_satd_16x8
733
+    lea             r0, [r6 + 32]
734
+    lea             r2, [r7 + 32]
735
+    call            calc_satd_16x8
736
+    call            calc_satd_16x8
737
+    call            calc_satd_16x8
738
+    call            calc_satd_16x8
739
+    call            calc_satd_16x8
740
+    call            calc_satd_16x8
741
+    call            calc_satd_16x8
742
+    call            calc_satd_16x8
743
+    lea             r0, [r6 + 48]
744
+    lea             r2, [r7 + 48]
745
+    call            calc_satd_16x8
746
+    call            calc_satd_16x8
747
+    call            calc_satd_16x8
748
+    call            calc_satd_16x8
749
+    call            calc_satd_16x8
750
+    call            calc_satd_16x8
751
+    call            calc_satd_16x8
752
+    call            calc_satd_16x8
753
+
754
+    paddd           m8, m9
755
+    vextracti128    xm0, m8, 1
756
+    paddd           xm0, xm8
757
+    movhlps         xm1, xm0
758
+    paddd           xm0, xm1
759
+    pshuflw         xm1, xm0, q0032
760
+    paddd           xm0, xm1
761
+    movd            eax, xm0
762
+    RET
763
+%endif ; ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 0
764
+
765
+%if ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 1
766
+INIT_YMM avx2
767
+cglobal calc_satd_16x8    ; function to compute satd cost for 16 columns, 8 rows
768
+    ; rows 0-3
769
+    movu            m0, [r0]
770
+    movu            m4, [r2]
771
+    psubw           m0, m4
772
+    movu            m1, [r0 + r1]
773
+    movu            m5, [r2 + r3]
774
+    psubw           m1, m5
775
+    movu            m2, [r0 + r1 * 2]
776
+    movu            m4, [r2 + r3 * 2]
777
+    psubw           m2, m4
778
+    movu            m3, [r0 + r4]
779
+    movu            m5, [r2 + r5]
780
+    psubw           m3, m5
781
+    lea             r0, [r0 + r1 * 4]
782
+    lea             r2, [r2 + r3 * 4]
783
+    paddw           m4, m0, m1
784
+    psubw           m1, m0
785
+    paddw           m0, m2, m3
786
+    psubw           m3, m2
787
+    punpckhwd       m2, m4, m1
788
+    punpcklwd       m4, m1
789
+    punpckhwd       m1, m0, m3
790
+    punpcklwd       m0, m3
791
+    paddw           m3, m4, m0
792
+    psubw           m0, m4
793
+    paddw           m4, m2, m1
794
+    psubw           m1, m2
795
+    punpckhdq       m2, m3, m0
796
+    punpckldq       m3, m0
797
+    paddw           m0, m3, m2
798
+    psubw           m2, m3
799
+    punpckhdq       m3, m4, m1
800
+    punpckldq       m4, m1
801
+    paddw           m1, m4, m3
802
+    psubw           m3, m4
803
+    punpckhqdq      m4, m0, m1
804
+    punpcklqdq      m0, m1
805
+    pabsw           m0, m0
806
+    pabsw           m4, m4
807
+    pmaxsw          m0, m0, m4
808
+    punpckhqdq      m1, m2, m3
809
+    punpcklqdq      m2, m3
810
+    pabsw           m2, m2
811
+    pabsw           m1, m1
812
+    pmaxsw          m2, m1
813
+    pxor            m7, m7
814
+    mova            m1, m0
815
+    punpcklwd       m1, m7
816
+    paddd           m6, m1
817
+    mova            m1, m0
818
+    punpckhwd       m1, m7
819
+    paddd           m6, m1
820
+    pxor            m7, m7
821
+    mova            m1, m2
822
+    punpcklwd       m1, m7
823
+    paddd           m6, m1
824
+    mova            m1, m2
825
+    punpckhwd       m1, m7
826
+    paddd           m6, m1
827
+    ; rows 4-7
828
+    movu            m0, [r0]
829
+    movu            m4, [r2]
830
+    psubw           m0, m4
831
+    movu            m1, [r0 + r1]
832
+    movu            m5, [r2 + r3]
833
+    psubw           m1, m5
834
+    movu            m2, [r0 + r1 * 2]
835
+    movu            m4, [r2 + r3 * 2]
836
+    psubw           m2, m4
837
+    movu            m3, [r0 + r4]
838
+    movu            m5, [r2 + r5]
839
+    psubw           m3, m5
840
+    lea             r0, [r0 + r1 * 4]
841
+    lea             r2, [r2 + r3 * 4]
842
+    paddw           m4, m0, m1
843
+    psubw           m1, m0
844
+    paddw           m0, m2, m3
845
+    psubw           m3, m2
846
+    punpckhwd       m2, m4, m1
847
+    punpcklwd       m4, m1
848
+    punpckhwd       m1, m0, m3
849
+    punpcklwd       m0, m3
850
+    paddw           m3, m4, m0
851
+    psubw           m0, m4
852
+    paddw           m4, m2, m1
853
+    psubw           m1, m2
854
+    punpckhdq       m2, m3, m0
855
+    punpckldq       m3, m0
856
+    paddw           m0, m3, m2
857
+    psubw           m2, m3
858
+    punpckhdq       m3, m4, m1
859
+    punpckldq       m4, m1
860
+    paddw           m1, m4, m3
861
+    psubw           m3, m4
862
+    punpckhqdq      m4, m0, m1
863
+    punpcklqdq      m0, m1
864
+    pabsw           m0, m0
865
+    pabsw           m4, m4
866
+    pmaxsw          m0, m0, m4
867
+    punpckhqdq      m1, m2, m3
868
+    punpcklqdq      m2, m3
869
+    pabsw           m2, m2
870
+    pabsw           m1, m1
871
+    pmaxsw          m2, m1
872
+    pxor            m7, m7
873
+    mova            m1, m0
874
+    punpcklwd       m1, m7
875
+    paddd           m6, m1
876
+    mova            m1, m0
877
+    punpckhwd       m1, m7
878
+    paddd           m6, m1
879
+    pxor            m7, m7
880
+    mova            m1, m2
881
+    punpcklwd       m1, m7
882
+    paddd           m6, m1
883
+    mova            m1, m2
884
+    punpckhwd       m1, m7
885
+    paddd           m6, m1
886
+    ret
887
+
888
+cglobal calc_satd_16x4    ; function to compute satd cost for 16 columns, 4 rows
889
+    ; rows 0-3
890
+    movu            m0, [r0]
891
+    movu            m4, [r2]
892
+    psubw           m0, m4
893
+    movu            m1, [r0 + r1]
894
+    movu            m5, [r2 + r3]
895
+    psubw           m1, m5
896
+    movu            m2, [r0 + r1 * 2]
897
+    movu            m4, [r2 + r3 * 2]
898
+    psubw           m2, m4
899
+    movu            m3, [r0 + r4]
900
+    movu            m5, [r2 + r5]
901
+    psubw           m3, m5
902
+    lea             r0, [r0 + r1 * 4]
903
+    lea             r2, [r2 + r3 * 4]
904
+    paddw           m4, m0, m1
905
+    psubw           m1, m0
906
+    paddw           m0, m2, m3
907
+    psubw           m3, m2
908
+    punpckhwd       m2, m4, m1
909
+    punpcklwd       m4, m1
910
+    punpckhwd       m1, m0, m3
911
+    punpcklwd       m0, m3
912
+    paddw           m3, m4, m0
913
+    psubw           m0, m4
914
+    paddw           m4, m2, m1
915
+    psubw           m1, m2
916
+    punpckhdq       m2, m3, m0
917
+    punpckldq       m3, m0
918
+    paddw           m0, m3, m2
919
+    psubw           m2, m3
920
+    punpckhdq       m3, m4, m1
921
+    punpckldq       m4, m1
922
+    paddw           m1, m4, m3
923
+    psubw           m3, m4
924
+    punpckhqdq      m4, m0, m1
925
+    punpcklqdq      m0, m1
926
+    pabsw           m0, m0
927
+    pabsw           m4, m4
928
+    pmaxsw          m0, m0, m4
929
+    punpckhqdq      m1, m2, m3
930
+    punpcklqdq      m2, m3
931
+    pabsw           m2, m2
932
+    pabsw           m1, m1
933
+    pmaxsw          m2, m1
934
+    pxor            m7, m7
935
+    mova            m1, m0
936
+    punpcklwd       m1, m7
937
+    paddd           m6, m1
938
+    mova            m1, m0
939
+    punpckhwd       m1, m7
940
+    paddd           m6, m1
941
+    pxor            m7, m7
942
+    mova            m1, m2
943
+    punpcklwd       m1, m7
944
+    paddd           m6, m1
945
+    mova            m1, m2
946
+    punpckhwd       m1, m7
947
+    paddd           m6, m1
948
+    ret
949
+
950
+cglobal pixel_satd_16x4, 4,6,8
951
+    add             r1d, r1d
952
+    add             r3d, r3d
953
+    lea             r4, [3 * r1]
954
+    lea             r5, [3 * r3]
955
+    pxor            m6, m6
956
+
957
+    call            calc_satd_16x4
958
+
959
+    vextracti128    xm7, m6, 1
960
+    paddd           xm6, xm7
961
+    pxor            xm7, xm7
962
+    movhlps         xm7, xm6
963
+    paddd           xm6, xm7
964
+    pshufd          xm7, xm6, 1
965
+    paddd           xm6, xm7
966
+    movd            eax, xm6
967
+    RET
968
+
969
+cglobal pixel_satd_16x8, 4,6,8
970
+    add             r1d, r1d
971
+    add             r3d, r3d
972
+    lea             r4, [3 * r1]
973
+    lea             r5, [3 * r3]
974
+    pxor            m6, m6
975
+
976
+    call            calc_satd_16x8
977
+
978
+    vextracti128    xm7, m6, 1
979
+    paddd           xm6, xm7
980
+    pxor            xm7, xm7
981
+    movhlps         xm7, xm6
982
+    paddd           xm6, xm7
983
+    pshufd          xm7, xm6, 1
984
+    paddd           xm6, xm7
985
+    movd            eax, xm6
986
+    RET
987
+
988
+cglobal pixel_satd_16x12, 4,6,8
989
+    add             r1d, r1d
990
+    add             r3d, r3d
991
+    lea             r4, [3 * r1]
992
+    lea             r5, [3 * r3]
993
+    pxor            m6, m6
994
+
995
+    call            calc_satd_16x8
996
+    call            calc_satd_16x4
997
+
998
+    vextracti128    xm7, m6, 1
999
+    paddd           xm6, xm7
1000
+    pxor            xm7, xm7
1001
+    movhlps         xm7, xm6
1002
+    paddd           xm6, xm7
1003
+    pshufd          xm7, xm6, 1
1004
+    paddd           xm6, xm7
1005
+    movd            eax, xm6
1006
+    RET
1007
+
1008
+cglobal pixel_satd_16x16, 4,6,8
1009
+    add             r1d, r1d
1010
+    add             r3d, r3d
1011
+    lea             r4, [3 * r1]
1012
+    lea             r5, [3 * r3]
1013
+    pxor            m6, m6
1014
+
1015
+    call            calc_satd_16x8
1016
+    call            calc_satd_16x8
1017
+
1018
+    vextracti128    xm7, m6, 1
1019
+    paddd           xm6, xm7
1020
+    pxor            xm7, xm7
1021
+    movhlps         xm7, xm6
1022
+    paddd           xm6, xm7
1023
+    pshufd          xm7, xm6, 1
1024
+    paddd           xm6, xm7
1025
+    movd            eax, xm6
1026
+    RET
1027
+
1028
+cglobal pixel_satd_16x32, 4,6,8
1029
+    add             r1d, r1d
1030
+    add             r3d, r3d
1031
+    lea             r4, [3 * r1]
1032
+    lea             r5, [3 * r3]
1033
+    pxor            m6, m6
1034
+
1035
+    call            calc_satd_16x8
1036
+    call            calc_satd_16x8
1037
+    call            calc_satd_16x8
1038
+    call            calc_satd_16x8
1039
+
1040
+    vextracti128    xm7, m6, 1
1041
+    paddd           xm6, xm7
1042
+    pxor            xm7, xm7
1043
+    movhlps         xm7, xm6
1044
+    paddd           xm6, xm7
1045
+    pshufd          xm7, xm6, 1
1046
+    paddd           xm6, xm7
1047
+    movd            eax, xm6
1048
+    RET
1049
+
1050
+cglobal pixel_satd_16x64, 4,6,8
1051
+    add             r1d, r1d
1052
+    add             r3d, r3d
1053
+    lea             r4, [3 * r1]
1054
+    lea             r5, [3 * r3]
1055
+    pxor            m6, m6
1056
+
1057
+    call            calc_satd_16x8
1058
+    call            calc_satd_16x8
1059
+    call            calc_satd_16x8
1060
+    call            calc_satd_16x8
1061
+    call            calc_satd_16x8
1062
+    call            calc_satd_16x8
1063
+    call            calc_satd_16x8
1064
+    call            calc_satd_16x8
1065
+
1066
+    vextracti128    xm7, m6, 1
1067
+    paddd           xm6, xm7
1068
+    pxor            xm7, xm7
1069
+    movhlps         xm7, xm6
1070
+    paddd           xm6, xm7
1071
+    pshufd          xm7, xm6, 1
1072
+    paddd           xm6, xm7
1073
+    movd            eax, xm6
1074
+    RET
1075
+
1076
+cglobal pixel_satd_32x8, 4,8,8
1077
+    add             r1d, r1d
1078
+    add             r3d, r3d
1079
+    lea             r4, [3 * r1]
1080
+    lea             r5, [3 * r3]
1081
+    pxor            m6, m6
1082
+    mov             r6, r0
1083
+    mov             r7, r2
1084
+
1085
+    call            calc_satd_16x8
1086
+
1087
+    lea             r0, [r6 + 32]
1088
+    lea             r2, [r7 + 32]
1089
+
1090
+    call            calc_satd_16x8
1091
+
1092
+    vextracti128    xm7, m6, 1
1093
+    paddd           xm6, xm7
1094
+    pxor            xm7, xm7
1095
+    movhlps         xm7, xm6
1096
+    paddd           xm6, xm7
1097
+    pshufd          xm7, xm6, 1
1098
+    paddd           xm6, xm7
1099
+    movd            eax, xm6
1100
+    RET
1101
+
1102
+cglobal pixel_satd_32x16, 4,8,8
1103
+    add             r1d, r1d
1104
+    add             r3d, r3d
1105
+    lea             r4, [3 * r1]
1106
+    lea             r5, [3 * r3]
1107
+    pxor            m6, m6
1108
+    mov             r6, r0
1109
+    mov             r7, r2
1110
+
1111
+    call            calc_satd_16x8
1112
+    call            calc_satd_16x8
1113
+
1114
+    lea             r0, [r6 + 32]
1115
+    lea             r2, [r7 + 32]
1116
+
1117
+    call            calc_satd_16x8
1118
+    call            calc_satd_16x8
1119
+
1120
+    vextracti128    xm7, m6, 1
1121
+    paddd           xm6, xm7
1122
+    pxor            xm7, xm7
1123
+    movhlps         xm7, xm6
1124
+    paddd           xm6, xm7
1125
+    pshufd          xm7, xm6, 1
1126
+    paddd           xm6, xm7
1127
+    movd            eax, xm6
1128
+    RET
1129
+
1130
+cglobal pixel_satd_32x24, 4,8,8
1131
+    add             r1d, r1d
1132
+    add             r3d, r3d
1133
+    lea             r4, [3 * r1]
1134
+    lea             r5, [3 * r3]
1135
+    pxor            m6, m6
1136
+    mov             r6, r0
1137
+    mov             r7, r2
1138
+
1139
+    call            calc_satd_16x8
1140
+    call            calc_satd_16x8
1141
+    call            calc_satd_16x8
1142
+
1143
+    lea             r0, [r6 + 32]
1144
+    lea             r2, [r7 + 32]
1145
+
1146
+    call            calc_satd_16x8
1147
+    call            calc_satd_16x8
1148
+    call            calc_satd_16x8
1149
+
1150
+    vextracti128    xm7, m6, 1
1151
+    paddd           xm6, xm7
1152
+    pxor            xm7, xm7
1153
+    movhlps         xm7, xm6
1154
+    paddd           xm6, xm7
1155
+    pshufd          xm7, xm6, 1
1156
+    paddd           xm6, xm7
1157
+    movd            eax, xm6
1158
+    RET
1159
+
1160
+cglobal pixel_satd_32x32, 4,8,8
1161
+    add             r1d, r1d
1162
+    add             r3d, r3d
1163
+    lea             r4, [3 * r1]
1164
+    lea             r5, [3 * r3]
1165
+    pxor            m6, m6
1166
+    mov             r6, r0
1167
+    mov             r7, r2
1168
+
1169
+    call            calc_satd_16x8
1170
+    call            calc_satd_16x8
1171
+    call            calc_satd_16x8
1172
+    call            calc_satd_16x8
1173
+
1174
+    lea             r0, [r6 + 32]
1175
+    lea             r2, [r7 + 32]
1176
+
1177
+    call            calc_satd_16x8
1178
+    call            calc_satd_16x8
1179
+    call            calc_satd_16x8
1180
+    call            calc_satd_16x8
1181
+
1182
+    vextracti128    xm7, m6, 1
1183
+    paddd           xm6, xm7
1184
+    pxor            xm7, xm7
1185
+    movhlps         xm7, xm6
1186
+    paddd           xm6, xm7
1187
+    pshufd          xm7, xm6, 1
1188
+    paddd           xm6, xm7
1189
+    movd            eax, xm6
1190
+    RET
1191
+
1192
+cglobal pixel_satd_32x64, 4,8,8
1193
+    add             r1d, r1d
1194
+    add             r3d, r3d
1195
+    lea             r4, [3 * r1]
1196
+    lea             r5, [3 * r3]
1197
+    pxor            m6, m6
1198
+    mov             r6, r0
1199
+    mov             r7, r2
1200
+
1201
+    call            calc_satd_16x8
1202
+    call            calc_satd_16x8
1203
+    call            calc_satd_16x8
1204
+    call            calc_satd_16x8
1205
+    call            calc_satd_16x8
1206
+    call            calc_satd_16x8
1207
+    call            calc_satd_16x8
1208
+    call            calc_satd_16x8
1209
+
1210
+    lea             r0, [r6 + 32]
1211
+    lea             r2, [r7 + 32]
1212
+
1213
+    call            calc_satd_16x8
1214
+    call            calc_satd_16x8
1215
+    call            calc_satd_16x8
1216
+    call            calc_satd_16x8
1217
+    call            calc_satd_16x8
1218
+    call            calc_satd_16x8
1219
+    call            calc_satd_16x8
1220
+    call            calc_satd_16x8
1221
+
1222
+    vextracti128    xm7, m6, 1
1223
+    paddd           xm6, xm7
1224
+    pxor            xm7, xm7
1225
+    movhlps         xm7, xm6
1226
+    paddd           xm6, xm7
1227
+    pshufd          xm7, xm6, 1
1228
+    paddd           xm6, xm7
1229
+    movd            eax, xm6
1230
+    RET
1231
+
1232
+cglobal pixel_satd_48x64, 4,8,8
1233
+    add             r1d, r1d
1234
+    add             r3d, r3d
1235
+    lea             r4, [3 * r1]
1236
+    lea             r5, [3 * r3]
1237
+    pxor            m6, m6
1238
+    mov             r6, r0
1239
+    mov             r7, r2
1240
+
1241
+    call            calc_satd_16x8
1242
+    call            calc_satd_16x8
1243
+    call            calc_satd_16x8
1244
+    call            calc_satd_16x8
1245
+    call            calc_satd_16x8
1246
+    call            calc_satd_16x8
1247
+    call            calc_satd_16x8
1248
+    call            calc_satd_16x8
1249
+
1250
+    lea             r0, [r6 + 32]
1251
+    lea             r2, [r7 + 32]
1252
+
1253
+    call            calc_satd_16x8
1254
+    call            calc_satd_16x8
1255
+    call            calc_satd_16x8
1256
+    call            calc_satd_16x8
1257
+    call            calc_satd_16x8
1258
+    call            calc_satd_16x8
1259
+    call            calc_satd_16x8
1260
+    call            calc_satd_16x8
1261
+
1262
+    lea             r0, [r6 + 64]
1263
+    lea             r2, [r7 + 64]
1264
+
1265
+    call            calc_satd_16x8
1266
+    call            calc_satd_16x8
1267
+    call            calc_satd_16x8
1268
+    call            calc_satd_16x8
1269
+    call            calc_satd_16x8
1270
+    call            calc_satd_16x8
1271
+    call            calc_satd_16x8
1272
+    call            calc_satd_16x8
1273
+
1274
+    vextracti128    xm7, m6, 1
1275
+    paddd           xm6, xm7
1276
+    pxor            xm7, xm7
1277
+    movhlps         xm7, xm6
1278
+    paddd           xm6, xm7
1279
+    pshufd          xm7, xm6, 1
1280
+    paddd           xm6, xm7
1281
+    movd            eax, xm6
1282
+    RET
1283
+
1284
+cglobal pixel_satd_64x16, 4,8,8
1285
+    add             r1d, r1d
1286
+    add             r3d, r3d
1287
+    lea             r4, [3 * r1]
1288
+    lea             r5, [3 * r3]
1289
+    pxor            m6, m6
1290
+    mov             r6, r0
1291
+    mov             r7, r2
1292
+
1293
+    call            calc_satd_16x8
1294
+    call            calc_satd_16x8
1295
+
1296
+    lea             r0, [r6 + 32]
1297
+    lea             r2, [r7 + 32]
1298
+
1299
+    call            calc_satd_16x8
1300
+    call            calc_satd_16x8
1301
+
1302
+    lea             r0, [r6 + 64]
1303
+    lea             r2, [r7 + 64]
1304
+
1305
+    call            calc_satd_16x8
1306
+    call            calc_satd_16x8
1307
+
1308
+    lea             r0, [r6 + 96]
1309
+    lea             r2, [r7 + 96]
1310
+
1311
+    call            calc_satd_16x8
1312
+    call            calc_satd_16x8
1313
+
1314
+    vextracti128    xm7, m6, 1
1315
+    paddd           xm6, xm7
1316
+    pxor            xm7, xm7
1317
+    movhlps         xm7, xm6
1318
+    paddd           xm6, xm7
1319
+    pshufd          xm7, xm6, 1
1320
+    paddd           xm6, xm7
1321
+    movd            eax, xm6
1322
+    RET
1323
+
1324
+cglobal pixel_satd_64x32, 4,8,8
1325
+    add             r1d, r1d
1326
+    add             r3d, r3d
1327
+    lea             r4, [3 * r1]
1328
+    lea             r5, [3 * r3]
1329
+    pxor            m6, m6
1330
+    mov             r6, r0
1331
+    mov             r7, r2
1332
+
1333
+    call            calc_satd_16x8
1334
+    call            calc_satd_16x8
1335
+    call            calc_satd_16x8
1336
+    call            calc_satd_16x8
1337
+
1338
+    lea             r0, [r6 + 32]
1339
+    lea             r2, [r7 + 32]
1340
+
1341
+    call            calc_satd_16x8
1342
+    call            calc_satd_16x8
1343
+    call            calc_satd_16x8
1344
+    call            calc_satd_16x8
1345
+
1346
+    lea             r0, [r6 + 64]
1347
+    lea             r2, [r7 + 64]
1348
+
1349
+    call            calc_satd_16x8
1350
+    call            calc_satd_16x8
1351
+    call            calc_satd_16x8
1352
+    call            calc_satd_16x8
1353
+
1354
+    lea             r0, [r6 + 96]
1355
+    lea             r2, [r7 + 96]
1356
+
1357
+    call            calc_satd_16x8
1358
+    call            calc_satd_16x8
1359
+    call            calc_satd_16x8
1360
+    call            calc_satd_16x8
1361
+
1362
+    vextracti128    xm7, m6, 1
1363
+    paddd           xm6, xm7
1364
+    pxor            xm7, xm7
1365
+    movhlps         xm7, xm6
1366
+    paddd           xm6, xm7
1367
+    pshufd          xm7, xm6, 1
1368
+    paddd           xm6, xm7
1369
+    movd            eax, xm6
1370
+    RET
1371
+
1372
+cglobal pixel_satd_64x48, 4,8,8
1373
+    add             r1d, r1d
1374
+    add             r3d, r3d
1375
+    lea             r4, [3 * r1]
1376
+    lea             r5, [3 * r3]
1377
+    pxor            m6, m6
1378
+    mov             r6, r0
1379
+    mov             r7, r2
1380
+
1381
+    call            calc_satd_16x8
1382
+    call            calc_satd_16x8
1383
+    call            calc_satd_16x8
1384
+    call            calc_satd_16x8
1385
+    call            calc_satd_16x8
1386
+    call            calc_satd_16x8
1387
+
1388
+    lea             r0, [r6 + 32]
1389
+    lea             r2, [r7 + 32]
1390
+
1391
+    call            calc_satd_16x8
1392
+    call            calc_satd_16x8
1393
+    call            calc_satd_16x8
1394
+    call            calc_satd_16x8
1395
+    call            calc_satd_16x8
1396
+    call            calc_satd_16x8
1397
+
1398
+    lea             r0, [r6 + 64]
1399
+    lea             r2, [r7 + 64]
1400
+
1401
+    call            calc_satd_16x8
1402
+    call            calc_satd_16x8
1403
+    call            calc_satd_16x8
1404
+    call            calc_satd_16x8
1405
+    call            calc_satd_16x8
1406
+    call            calc_satd_16x8
1407
+
1408
+    lea             r0, [r6 + 96]
1409
+    lea             r2, [r7 + 96]
1410
+
1411
+    call            calc_satd_16x8
1412
+    call            calc_satd_16x8
1413
+    call            calc_satd_16x8
1414
+    call            calc_satd_16x8
1415
+    call            calc_satd_16x8
1416
+    call            calc_satd_16x8
1417
+
1418
+    vextracti128    xm7, m6, 1
1419
+    paddd           xm6, xm7
1420
+    pxor            xm7, xm7
1421
+    movhlps         xm7, xm6
1422
+    paddd           xm6, xm7
1423
+    pshufd          xm7, xm6, 1
1424
+    paddd           xm6, xm7
1425
+    movd            eax, xm6
1426
+    RET
1427
+
1428
+cglobal pixel_satd_64x64, 4,8,8
1429
+    add             r1d, r1d
1430
+    add             r3d, r3d
1431
+    lea             r4, [3 * r1]
1432
+    lea             r5, [3 * r3]
1433
+    pxor            m6, m6
1434
+    mov             r6, r0
1435
+    mov             r7, r2
1436
+
1437
+    call            calc_satd_16x8
1438
+    call            calc_satd_16x8
1439
+    call            calc_satd_16x8
1440
+    call            calc_satd_16x8
1441
+    call            calc_satd_16x8
1442
+    call            calc_satd_16x8
1443
+    call            calc_satd_16x8
1444
+    call            calc_satd_16x8
1445
+
1446
+    lea             r0, [r6 + 32]
1447
+    lea             r2, [r7 + 32]
1448
+
1449
+    call            calc_satd_16x8
1450
+    call            calc_satd_16x8
1451
+    call            calc_satd_16x8
1452
+    call            calc_satd_16x8
1453
+    call            calc_satd_16x8
1454
+    call            calc_satd_16x8
1455
+    call            calc_satd_16x8
1456
+    call            calc_satd_16x8
1457
+
1458
+    lea             r0, [r6 + 64]
1459
+    lea             r2, [r7 + 64]
1460
+
1461
+    call            calc_satd_16x8
1462
+    call            calc_satd_16x8
1463
+    call            calc_satd_16x8
1464
+    call            calc_satd_16x8
1465
+    call            calc_satd_16x8
1466
+    call            calc_satd_16x8
1467
+    call            calc_satd_16x8
1468
+    call            calc_satd_16x8
1469
+
1470
+    lea             r0, [r6 + 96]
1471
+    lea             r2, [r7 + 96]
1472
+
1473
+    call            calc_satd_16x8
1474
+    call            calc_satd_16x8
1475
+    call            calc_satd_16x8
1476
+    call            calc_satd_16x8
1477
+    call            calc_satd_16x8
1478
+    call            calc_satd_16x8
1479
+    call            calc_satd_16x8
1480
+    call            calc_satd_16x8
1481
+
1482
+    vextracti128    xm7, m6, 1
1483
+    paddd           xm6, xm7
1484
+    pxor            xm7, xm7
1485
+    movhlps         xm7, xm6
1486
+    paddd           xm6, xm7
1487
+    pshufd          xm7, xm6, 1
1488
+    paddd           xm6, xm7
1489
+    movd            eax, xm6
1490
+    RET
1491
+%endif ; ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 1
1492
x265_1.6.tar.gz/source/common/x86/pixel-util.h -> x265_1.7.tar.gz/source/common/x86/pixel-util.h Changed
33
 
1
@@ -73,15 +73,18 @@
2
 float x265_pixel_ssim_end4_sse2(int sum0[5][4], int sum1[5][4], int width);
3
 float x265_pixel_ssim_end4_avx(int sum0[5][4], int sum1[5][4], int width);
4
 
5
-void x265_scale1D_128to64_ssse3(pixel*, const pixel*, intptr_t);
6
-void x265_scale1D_128to64_avx2(pixel*, const pixel*, intptr_t);
7
+void x265_scale1D_128to64_ssse3(pixel*, const pixel*);
8
+void x265_scale1D_128to64_avx2(pixel*, const pixel*);
9
 void x265_scale2D_64to32_ssse3(pixel*, const pixel*, intptr_t);
10
+void x265_scale2D_64to32_avx2(pixel*, const pixel*, intptr_t);
11
 
12
-int x265_findPosLast_x64(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig);
13
+int x265_scanPosLast_x64(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize);
14
+int x265_scanPosLast_avx2_bmi2(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize);
15
+uint32_t x265_findPosFirstLast_ssse3(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16]);
16
 
17
 #define SETUP_CHROMA_PIXELSUB_PS_FUNC(W, H, cpu) \
18
     void x265_pixel_sub_ps_ ## W ## x ## H ## cpu(int16_t*  dest, intptr_t destride, const pixel* src0, const pixel* src1, intptr_t srcstride0, intptr_t srcstride1); \
19
-    void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t*  scr1, intptr_t srcStride0, intptr_t srcStride1);
20
+    void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t*  src1, intptr_t srcStride0, intptr_t srcStride1);
21
 
22
 #define CHROMA_420_PIXELSUB_DEF(cpu) \
23
     SETUP_CHROMA_PIXELSUB_PS_FUNC(4, 4, cpu); \
24
@@ -97,7 +100,7 @@
25
 
26
 #define SETUP_LUMA_PIXELSUB_PS_FUNC(W, H, cpu) \
27
     void x265_pixel_sub_ps_ ## W ## x ## H ## cpu(int16_t*  dest, intptr_t destride, const pixel* src0, const pixel* src1, intptr_t srcstride0, intptr_t srcstride1); \
28
-    void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t*  scr1, intptr_t srcStride0, intptr_t srcStride1);
29
+    void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t*  src1, intptr_t srcStride0, intptr_t srcStride1);
30
 
31
 #define LUMA_PIXELSUB_DEF(cpu) \
32
     SETUP_LUMA_PIXELSUB_PS_FUNC(8,   8, cpu); \
33
x265_1.6.tar.gz/source/common/x86/pixel-util8.asm -> x265_1.7.tar.gz/source/common/x86/pixel-util8.asm Changed
1000
 
1
@@ -40,16 +40,17 @@
2
 ssim_c1:   times 4 dd 416          ; .01*.01*255*255*64
3
 ssim_c2:   times 4 dd 235963       ; .03*.03*255*255*64*63
4
 %endif
5
-mask_ff:   times 16 db 0xff
6
-           times 16 db 0
7
-deinterleave_shuf: db 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15
8
-deinterleave_word_shuf: db 0, 1, 4, 5, 8, 9, 12, 13, 2, 3, 6, 7, 10, 11, 14, 15
9
-hmul_16p:  times 16 db 1
10
-           times 8 db 1, -1
11
-hmulw_16p:  times 8 dw 1
12
-            times 4 dw 1, -1
13
 
14
-trans8_shuf: dd 0, 4, 1, 5, 2, 6, 3, 7
15
+mask_ff:                times 16 db 0xff
16
+                        times 16 db 0
17
+deinterleave_shuf:      times  2 db 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15
18
+deinterleave_word_shuf: times  2 db 0, 1, 4, 5, 8, 9, 12, 13, 2, 3, 6, 7, 10, 11, 14, 15
19
+hmul_16p:               times 16 db 1
20
+                        times  8 db 1, -1
21
+hmulw_16p:              times  8 dw 1
22
+                        times  4 dw 1, -1
23
+
24
+trans8_shuf:            dd 0, 4, 1, 5, 2, 6, 3, 7
25
 
26
 SECTION .text
27
 
28
@@ -67,6 +68,7 @@
29
 cextern pb_2
30
 cextern pb_4
31
 cextern pb_8
32
+cextern pb_15
33
 cextern pb_16
34
 cextern pb_32
35
 cextern pb_64
36
@@ -616,7 +618,7 @@
37
 
38
 %if ARCH_X86_64 == 1
39
 INIT_YMM avx2
40
-cglobal quant, 5,5,10
41
+cglobal quant, 5,6,9
42
     ; fill qbits
43
     movd            xm4, r4d            ; m4 = qbits
44
 
45
@@ -627,7 +629,7 @@
46
     ; fill offset
47
     vpbroadcastd    m5, r5m             ; m5 = add
48
 
49
-    vpbroadcastw    m9, [pw_1]          ; m9 = word [1]
50
+    lea             r5, [pw_1]
51
 
52
     mov             r4d, r6m
53
     shr             r4d, 4
54
@@ -665,7 +667,7 @@
55
 
56
     ; count non-zero coeff
57
     ; TODO: popcnt is faster, but some CPU can't support
58
-    pminuw          m2, m9
59
+    pminuw          m2, [r5]
60
     paddw           m7, m2
61
 
62
     add             r0, mmsize
63
@@ -1285,9 +1287,8 @@
64
     mov          r6d, r6m
65
     shl          r6d, 16
66
     or           r6d, r5d          ; assuming both (w0<<6) and round are using maximum of 16 bits each.
67
-    movd         xm0, r6d
68
-    pshufd       xm0, xm0, 0       ; m0 = [w0<<6, round]
69
-    vinserti128  m0, m0, xm0, 1    ; document says (pshufd + vinserti128) can be replaced with vpbroadcastd m0, xm0, but having build problem, need to investigate
70
+
71
+    vpbroadcastd m0, r6d
72
 
73
     movd         xm1, r7m
74
     vpbroadcastd m2, r8m
75
@@ -1492,6 +1493,84 @@
76
     dec         r5d
77
     jnz         .loopH
78
     RET
79
+
80
+%if ARCH_X86_64
81
+INIT_YMM avx2
82
+cglobal weight_sp, 6, 9, 7
83
+    mov             r7d, r7m
84
+    shl             r7d, 16
85
+    or              r7d, r6m
86
+    vpbroadcastd    m0, r7d            ; m0 = times 8 dw w0, round
87
+    movd            xm1, r8m            ; m1 = [shift]
88
+    vpbroadcastd    m2, r9m            ; m2 = times 16 dw offset
89
+    vpbroadcastw    m3, [pw_1]
90
+    vpbroadcastw    m4, [pw_2000]
91
+
92
+    add             r2d, r2d            ; 2 * srcstride
93
+
94
+    mov             r7, r0
95
+    mov             r8, r1
96
+.loopH:
97
+    mov             r6d, r4d            ; width
98
+
99
+    ; save old src and dst
100
+    mov             r0, r7              ; src
101
+    mov             r1, r8              ; dst
102
+.loopW:
103
+    movu            m5, [r0]
104
+    paddw           m5, m4
105
+
106
+    punpcklwd       m6,m5, m3
107
+    pmaddwd         m6, m0
108
+    psrad           m6, xm1
109
+    paddd           m6, m2
110
+
111
+    punpckhwd       m5, m3
112
+    pmaddwd         m5, m0
113
+    psrad           m5, xm1
114
+    paddd           m5, m2
115
+
116
+    packssdw        m6, m5
117
+    packuswb        m6, m6
118
+    vpermq          m6, m6, 10001000b
119
+
120
+    sub             r6d, 16
121
+    jl              .width8
122
+    movu            [r1], xm6
123
+    je              .nextH
124
+    add             r0, 32
125
+    add             r1, 16
126
+    jmp             .loopW
127
+
128
+.width8:
129
+    add             r6d, 16
130
+    cmp             r6d, 8
131
+    jl              .width4
132
+    movq            [r1], xm6
133
+    je              .nextH
134
+    psrldq          m6, 8
135
+    sub             r6d, 8
136
+    add             r1, 8
137
+
138
+.width4:
139
+    cmp             r6d, 4
140
+    jl              .width2
141
+    movd            [r1], xm6
142
+    je              .nextH
143
+    add             r1, 4
144
+    pshufd          m6, m6, 1
145
+
146
+.width2:
147
+    pextrw          [r1], xm6, 0
148
+
149
+.nextH:
150
+    lea             r7, [r7 + r2]
151
+    lea             r8, [r8 + r3]
152
+
153
+    dec             r5d
154
+    jnz             .loopH
155
+    RET
156
+%endif
157
 %endif  ; end of (HIGH_BIT_DEPTH == 0)
158
     
159
 
160
@@ -3944,6 +4023,150 @@
161
     RET
162
 %endif
163
 
164
+;-----------------------------------------------------------------
165
+; void scale2D_64to32(pixel *dst, pixel *src, intptr_t stride)
166
+;-----------------------------------------------------------------
167
+%if HIGH_BIT_DEPTH
168
+INIT_YMM avx2
169
+cglobal scale2D_64to32, 3, 4, 5, dest, src, stride
170
+    mov         r3d,     32
171
+    add         r2d,     r2d
172
+    mova        m4,      [pw_2000]
173
+
174
+.loop:
175
+    movu        m0,      [r1]
176
+    movu        m1,      [r1 + 1 * mmsize]
177
+    movu        m2,      [r1 + r2]
178
+    movu        m3,      [r1 + r2 + 1 * mmsize]
179
+
180
+    paddw       m0,      m2
181
+    paddw       m1,      m3
182
+    phaddw      m0,      m1
183
+
184
+    pmulhrsw    m0,      m4
185
+    vpermq      m0,      m0, q3120
186
+    movu        [r0],    m0
187
+
188
+    movu        m0,      [r1 + 2 * mmsize]
189
+    movu        m1,      [r1 + 3 * mmsize]
190
+    movu        m2,      [r1 + r2 + 2 * mmsize]
191
+    movu        m3,      [r1 + r2 + 3 * mmsize]
192
+
193
+    paddw       m0,      m2
194
+    paddw       m1,      m3
195
+    phaddw      m0,      m1
196
+
197
+    pmulhrsw    m0,      m4
198
+    vpermq      m0,      m0, q3120
199
+    movu        [r0 + mmsize], m0
200
+
201
+    add         r0,      64
202
+    lea         r1,      [r1 + 2 * r2]
203
+    dec         r3d
204
+    jnz         .loop
205
+    RET
206
+%else
207
+
208
+INIT_YMM avx2
209
+cglobal scale2D_64to32, 3, 5, 8, dest, src, stride
210
+    mov         r3d,     16
211
+    mova        m7,      [deinterleave_shuf]
212
+.loop:
213
+    movu        m0,      [r1]                  ; i
214
+    lea         r4,      [r1 + r2 * 2]
215
+    psrlw       m1,      m0, 8                 ; j
216
+    movu        m2,      [r1 + r2]             ; k
217
+    psrlw       m3,      m2, 8                 ; l
218
+
219
+    pxor        m4,      m0, m1                ; i^j
220
+    pxor        m5,      m2, m3                ; k^l
221
+    por         m4,      m5                    ; ij|kl
222
+
223
+    pavgb       m0,      m1                    ; s
224
+    pavgb       m2,      m3                    ; t
225
+    mova        m5,      m0
226
+    pavgb       m0,      m2                    ; (s+t+1)/2
227
+    pxor        m5,      m2                    ; s^t
228
+    pand        m4,      m5                    ; (ij|kl)&st
229
+    pand        m4,      [pb_1]
230
+    psubb       m0,      m4                    ; Result
231
+
232
+    movu        m1,      [r1 + 32]             ; i
233
+    psrlw       m2,      m1, 8                 ; j
234
+    movu        m3,      [r1 + r2 + 32]        ; k
235
+    psrlw       m4,      m3, 8                 ; l
236
+
237
+    pxor        m5,      m1, m2                ; i^j
238
+    pxor        m6,      m3, m4                ; k^l
239
+    por         m5,      m6                    ; ij|kl
240
+
241
+    pavgb       m1,      m2                    ; s
242
+    pavgb       m3,      m4                    ; t
243
+    mova        m6,      m1
244
+    pavgb       m1,      m3                    ; (s+t+1)/2
245
+    pxor        m6,      m3                    ; s^t
246
+    pand        m5,      m6                    ; (ij|kl)&st
247
+    pand        m5,      [pb_1]
248
+    psubb       m1,      m5                    ; Result
249
+
250
+    pshufb      m0,      m0, m7
251
+    pshufb      m1,      m1, m7
252
+
253
+    punpcklqdq  m0,      m1
254
+    vpermq      m0,      m0, 11011000b
255
+    movu        [r0],    m0
256
+
257
+    add         r0,      32
258
+
259
+    movu        m0,      [r4]                  ; i
260
+    psrlw       m1,      m0, 8                 ; j
261
+    movu        m2,      [r4 + r2]             ; k
262
+    psrlw       m3,      m2, 8                 ; l
263
+
264
+    pxor        m4,      m0, m1                ; i^j
265
+    pxor        m5,      m2, m3                ; k^l
266
+    por         m4,      m5                    ; ij|kl
267
+
268
+    pavgb       m0,      m1                    ; s
269
+    pavgb       m2,      m3                    ; t
270
+    mova        m5,      m0
271
+    pavgb       m0,      m2                    ; (s+t+1)/2
272
+    pxor        m5,      m2                    ; s^t
273
+    pand        m4,      m5                    ; (ij|kl)&st
274
+    pand        m4,      [pb_1]
275
+    psubb       m0,      m4                    ; Result
276
+
277
+    movu        m1,      [r4 + 32]             ; i
278
+    psrlw       m2,      m1, 8                 ; j
279
+    movu        m3,      [r4 + r2 + 32]        ; k
280
+    psrlw       m4,      m3, 8                 ; l
281
+
282
+    pxor        m5,      m1, m2                ; i^j
283
+    pxor        m6,      m3, m4                ; k^l
284
+    por         m5,      m6                    ; ij|kl
285
+
286
+    pavgb       m1,      m2                    ; s
287
+    pavgb       m3,      m4                    ; t
288
+    mova        m6,      m1
289
+    pavgb       m1,      m3                    ; (s+t+1)/2
290
+    pxor        m6,      m3                    ; s^t
291
+    pand        m5,      m6                    ; (ij|kl)&st
292
+    pand        m5,      [pb_1]
293
+    psubb       m1,      m5                    ; Result
294
+
295
+    pshufb      m0,      m0, m7
296
+    pshufb      m1,      m1, m7
297
+
298
+    punpcklqdq  m0,      m1
299
+    vpermq      m0,      m0, 11011000b
300
+    movu        [r0],    m0
301
+
302
+    lea         r1,      [r1 + 4 * r2]
303
+    add         r0,      32
304
+    dec         r3d
305
+    jnz         .loop
306
+    RET
307
+%endif
308
 
309
 ;-----------------------------------------------------------------------------
310
 ; void pixel_sub_ps_4x4(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1);
311
@@ -4337,18 +4560,70 @@
312
 ;-----------------------------------------------------------------------------
313
 ; void pixel_sub_ps_16x16(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1);
314
 ;-----------------------------------------------------------------------------
315
+%if HIGH_BIT_DEPTH
316
+%macro PIXELSUB_PS_W16_H4_avx2 1
317
+%if ARCH_X86_64
318
+INIT_YMM avx2
319
+cglobal pixel_sub_ps_16x%1, 6, 9, 4, dest, deststride, src0, src1, srcstride0, srcstride1
320
+    add         r1d,    r1d
321
+    add         r4d,    r4d
322
+    add         r5d,    r5d
323
+    lea         r6,     [r1 * 3]
324
+    lea         r7,     [r4 * 3]
325
+    lea         r8,     [r5 * 3]
326
+
327
+%rep %1/4
328
+    movu        m0,     [r2]
329
+    movu        m1,     [r3]
330
+    movu        m2,     [r2 + r4]
331
+    movu        m3,     [r3 + r5]
332
+
333
+    psubw       m0,     m1
334
+    psubw       m2,     m3
335
+
336
+    movu        [r0],            m0
337
+    movu        [r0 + r1],       m2
338
+
339
+    movu        m0,     [r2 + r4 * 2]
340
+    movu        m1,     [r3 + r5 * 2]
341
+    movu        m2,     [r2 + r7]
342
+    movu        m3,     [r3 + r8]
343
+
344
+    psubw       m0,     m1
345
+    psubw       m2,     m3
346
+
347
+    movu        [r0 + r1 * 2],   m0
348
+    movu        [r0 + r6],       m2
349
+
350
+    lea         r0,     [r0 + r1 * 4]
351
+    lea         r2,     [r2 + r4 * 4]
352
+    lea         r3,     [r3 + r5 * 4]
353
+%endrep
354
+    RET
355
+%endif
356
+%endmacro
357
+PIXELSUB_PS_W16_H4_avx2 16
358
+PIXELSUB_PS_W16_H4_avx2 32
359
+%else
360
+;-----------------------------------------------------------------------------
361
+; void pixel_sub_ps_16x16(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1);
362
+;-----------------------------------------------------------------------------
363
+%macro PIXELSUB_PS_W16_H8_avx2 2
364
+%if ARCH_X86_64
365
 INIT_YMM avx2
366
-cglobal pixel_sub_ps_16x16, 6, 7, 4, dest, deststride, src0, src1, srcstride0, srcstride1
367
+cglobal pixel_sub_ps_16x%2, 6, 10, 4, dest, deststride, src0, src1, srcstride0, srcstride1
368
     add         r1,     r1
369
     lea         r6,     [r1 * 3]
370
+    mov         r7d,    %2/8
371
 
372
-%rep 4
373
+    lea         r9,     [r4 * 3]
374
+    lea         r8,     [r5 * 3]
375
+
376
+.loop
377
     pmovzxbw    m0,     [r2]
378
     pmovzxbw    m1,     [r3]
379
     pmovzxbw    m2,     [r2 + r4]
380
     pmovzxbw    m3,     [r3 + r5]
381
-    lea         r2,     [r2 + r4 * 2]
382
-    lea         r3,     [r3 + r5 * 2]
383
 
384
     psubw       m0,     m1
385
     psubw       m2,     m3
386
@@ -4356,6 +4631,21 @@
387
     movu        [r0],            m0
388
     movu        [r0 + r1],       m2
389
 
390
+    pmovzxbw    m0,     [r2 + 2 * r4]
391
+    pmovzxbw    m1,     [r3 + 2 * r5]
392
+    pmovzxbw    m2,     [r2 + r9]
393
+    pmovzxbw    m3,     [r3 + r8]
394
+
395
+    psubw       m0,     m1
396
+    psubw       m2,     m3
397
+
398
+    movu        [r0 + r1 * 2],   m0
399
+    movu        [r0 + r6],       m2
400
+
401
+    lea         r0,     [r0 + r1 * 4]
402
+    lea         r2,     [r2 + r4 * 4]
403
+    lea         r3,     [r3 + r5 * 4]
404
+
405
     pmovzxbw    m0,     [r2]
406
     pmovzxbw    m1,     [r3]
407
     pmovzxbw    m2,     [r2 + r4]
408
@@ -4364,14 +4654,34 @@
409
     psubw       m0,     m1
410
     psubw       m2,     m3
411
 
412
+    movu        [r0],            m0
413
+    movu        [r0 + r1],       m2
414
+
415
+    pmovzxbw    m0,     [r2 + 2 * r4]
416
+    pmovzxbw    m1,     [r3 + 2 * r5]
417
+    pmovzxbw    m2,     [r2 + r9]
418
+    pmovzxbw    m3,     [r3 + r8]
419
+
420
+    psubw       m0,     m1
421
+    psubw       m2,     m3
422
+
423
     movu        [r0 + r1 * 2],   m0
424
     movu        [r0 + r6],       m2
425
 
426
     lea         r0,     [r0 + r1 * 4]
427
-    lea         r2,     [r2 + r4 * 2]
428
-    lea         r3,     [r3 + r5 * 2]
429
-%endrep
430
+    lea         r2,     [r2 + r4 * 4]
431
+    lea         r3,     [r3 + r5 * 4]
432
+
433
+    dec         r7d
434
+    jnz         .loop
435
     RET
436
+%endif
437
+%endmacro
438
+
439
+PIXELSUB_PS_W16_H8_avx2 16, 16
440
+PIXELSUB_PS_W16_H8_avx2 16, 32
441
+%endif
442
+
443
 ;-----------------------------------------------------------------------------
444
 ; void pixel_sub_ps_32x%2(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1);
445
 ;-----------------------------------------------------------------------------
446
@@ -4509,10 +4819,83 @@
447
 ;-----------------------------------------------------------------------------
448
 ; void pixel_sub_ps_32x32(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1);
449
 ;-----------------------------------------------------------------------------
450
+%if HIGH_BIT_DEPTH
451
+%macro PIXELSUB_PS_W32_H4_avx2 1
452
+%if ARCH_X86_64
453
 INIT_YMM avx2
454
-cglobal pixel_sub_ps_32x32, 6, 7, 4, dest, deststride, src0, src1, srcstride0, srcstride1
455
-     mov        r6d,    4
456
-     add        r1,     r1
457
+cglobal pixel_sub_ps_32x%1, 6, 10, 4, dest, deststride, src0, src1, srcstride0, srcstride1
458
+    add         r1d,    r1d
459
+    add         r4d,    r4d
460
+    add         r5d,    r5d
461
+    mov         r9d,    %1/4
462
+    lea         r6,     [r1 * 3]
463
+    lea         r7,     [r4 * 3]
464
+    lea         r8,     [r5 * 3]
465
+
466
+.loop
467
+    movu        m0,     [r2]
468
+    movu        m1,     [r2 + 32]
469
+    movu        m2,     [r3]
470
+    movu        m3,     [r3 + 32]
471
+    psubw       m0,     m2
472
+    psubw       m1,     m3
473
+
474
+    movu        [r0],                 m0
475
+    movu        [r0 + 32],            m1
476
+
477
+    movu        m0,     [r2 + r4]
478
+    movu        m1,     [r2 + r4 + 32]
479
+    movu        m2,     [r3 + r5]
480
+    movu        m3,     [r3 + r5 + 32]
481
+    psubw       m0,     m2
482
+    psubw       m1,     m3
483
+
484
+    movu        [r0 + r1],            m0
485
+    movu        [r0 + r1 + 32],       m1
486
+
487
+    movu        m0,     [r2 + r4 * 2]
488
+    movu        m1,     [r2 + r4 * 2 + 32]
489
+    movu        m2,     [r3 + r5 * 2]
490
+    movu        m3,     [r3 + r5 * 2 + 32]
491
+    psubw       m0,     m2
492
+    psubw       m1,     m3
493
+
494
+    movu        [r0 + r1 * 2],        m0
495
+    movu        [r0 + r1 * 2 + 32],   m1
496
+
497
+    movu        m0,     [r2 + r7]
498
+    movu        m1,     [r2 + r7 + 32]
499
+    movu        m2,     [r3 + r8]
500
+    movu        m3,     [r3 + r8 + 32]
501
+    psubw       m0,     m2
502
+    psubw       m1,     m3
503
+
504
+    movu        [r0 + r6],            m0
505
+    movu        [r0 + r6 + 32],       m1
506
+
507
+    lea         r0,     [r0 + r1 * 4]
508
+    lea         r2,     [r2 + r4 * 4]
509
+    lea         r3,     [r3 + r5 * 4]
510
+    dec         r9d
511
+    jnz         .loop
512
+    RET
513
+%endif
514
+%endmacro
515
+PIXELSUB_PS_W32_H4_avx2 32
516
+PIXELSUB_PS_W32_H4_avx2 64
517
+%else
518
+;-----------------------------------------------------------------------------
519
+; void pixel_sub_ps_32x32(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1);
520
+;-----------------------------------------------------------------------------
521
+%macro PIXELSUB_PS_W32_H8_avx2 2
522
+%if ARCH_X86_64
523
+INIT_YMM avx2
524
+cglobal pixel_sub_ps_32x%2, 6, 10, 4, dest, deststride, src0, src1, srcstride0, srcstride1
525
+    mov        r6d,    %2/8
526
+    add        r1,     r1
527
+    lea         r7,         [r4 * 3]
528
+    lea         r8,         [r5 * 3]
529
+    lea         r9,         [r1 * 3]
530
 
531
 .loop:
532
     pmovzxbw    m0,     [r2]
533
@@ -4537,55 +4920,44 @@
534
     movu        [r0 + r1],       m0
535
     movu        [r0 + r1 + 32],  m1
536
 
537
-    add         r2,     r4
538
-    add         r3,     r5
539
-
540
-    pmovzxbw    m0,     [r2 + r4]
541
-    pmovzxbw    m1,     [r2 + r4 + 16]
542
-    pmovzxbw    m2,     [r3 + r5]
543
-    pmovzxbw    m3,     [r3 + r5 + 16]
544
+    pmovzxbw    m0,     [r2 + 2 * r4]
545
+    pmovzxbw    m1,     [r2 + 2 * r4 + 16]
546
+    pmovzxbw    m2,     [r3 + 2 * r5]
547
+    pmovzxbw    m3,     [r3 + 2 * r5 + 16]
548
 
549
     psubw       m0,     m2
550
     psubw       m1,     m3
551
-    lea         r0,     [r0 + r1 * 2]
552
 
553
-    movu        [r0 ],           m0
554
-    movu        [r0 + 32],       m1
555
-
556
-    add         r2,     r4
557
-    add         r3,     r5
558
+    movu        [r0 + r1 * 2 ],           m0
559
+    movu        [r0 + r1 * 2 + 32],       m1
560
 
561
-    pmovzxbw    m0,     [r2 + r4]
562
-    pmovzxbw    m1,     [r2 + r4 + 16]
563
-    pmovzxbw    m2,     [r3 + r5]
564
-    pmovzxbw    m3,     [r3 + r5 + 16]
565
+    pmovzxbw    m0,     [r2 + r7]
566
+    pmovzxbw    m1,     [r2 + r7 + 16]
567
+    pmovzxbw    m2,     [r3 + r8]
568
+    pmovzxbw    m3,     [r3 + r8 + 16]
569
 
570
 
571
     psubw       m0,     m2
572
     psubw       m1,     m3
573
-    add         r0,     r1
574
 
575
-    movu        [r0 ],           m0
576
-    movu        [r0 + 32],       m1
577
+    movu        [r0 + r9],           m0
578
+    movu        [r0 + r9 +32],       m1
579
 
580
-    add         r2,     r4
581
-    add         r3,     r5
582
+    lea         r2,     [r2 + r4 * 4]
583
+    lea         r3,     [r3 + r5 * 4]
584
+    lea         r0,     [r0 + r1 * 4]
585
 
586
-    pmovzxbw    m0,     [r2 + r4]
587
-    pmovzxbw    m1,     [r2 + r4 + 16]
588
-    pmovzxbw    m2,     [r3 + r5]
589
-    pmovzxbw    m3,     [r3 + r5 + 16]
590
+    pmovzxbw    m0,     [r2]
591
+    pmovzxbw    m1,     [r2 + 16]
592
+    pmovzxbw    m2,     [r3]
593
+    pmovzxbw    m3,     [r3 + 16]
594
 
595
     psubw       m0,     m2
596
     psubw       m1,     m3
597
-    add         r0,     r1
598
 
599
     movu        [r0 ],           m0
600
     movu        [r0 + 32],       m1
601
 
602
-    add         r2,     r4
603
-    add         r3,     r5
604
-
605
     pmovzxbw    m0,     [r2 + r4]
606
     pmovzxbw    m1,     [r2 + r4 + 16]
607
     pmovzxbw    m2,     [r3 + r5]
608
@@ -4593,48 +4965,45 @@
609
 
610
     psubw       m0,     m2
611
     psubw       m1,     m3
612
-    add         r0,     r1
613
 
614
-    movu        [r0 ],           m0
615
-    movu        [r0 + 32],       m1
616
+    movu        [r0 + r1],           m0
617
+    movu        [r0 + r1 + 32],       m1
618
 
619
-    add         r2,     r4
620
-    add         r3,     r5
621
-
622
-    pmovzxbw    m0,     [r2 + r4]
623
-    pmovzxbw    m1,     [r2 + r4 + 16]
624
-    pmovzxbw    m2,     [r3 + r5]
625
-    pmovzxbw    m3,     [r3 + r5 + 16]
626
+    pmovzxbw    m0,     [r2 + 2 * r4]
627
+    pmovzxbw    m1,     [r2 + 2 * r4 + 16]
628
+    pmovzxbw    m2,     [r3 + 2 * r5]
629
+    pmovzxbw    m3,     [r3 + 2 * r5 + 16]
630
 
631
     psubw       m0,     m2
632
     psubw       m1,     m3
633
-    add         r0,     r1
634
 
635
-    movu        [r0 ],           m0
636
-    movu        [r0 + 32],       m1
637
+    movu        [r0 + r1 * 2],           m0
638
+    movu        [r0 + r1 * 2 + 32],       m1
639
 
640
-    add         r2,     r4
641
-    add         r3,     r5
642
-
643
-    pmovzxbw    m0,     [r2 + r4]
644
-    pmovzxbw    m1,     [r2 + r4 + 16]
645
-    pmovzxbw    m2,     [r3 + r5]
646
-    pmovzxbw    m3,     [r3 + r5 + 16]
647
+    pmovzxbw    m0,     [r2 + r7]
648
+    pmovzxbw    m1,     [r2 + r7 + 16]
649
+    pmovzxbw    m2,     [r3 + r8]
650
+    pmovzxbw    m3,     [r3 + r8 + 16]
651
 
652
     psubw       m0,     m2
653
     psubw       m1,     m3
654
-    add         r0,     r1
655
 
656
-    movu        [r0 ],           m0
657
-    movu        [r0 + 32],       m1
658
+    movu        [r0 + r9],           m0
659
+    movu        [r0 + r9 + 32],       m1
660
 
661
-    lea         r0,     [r0 + r1]
662
-    lea         r2,     [r2 + r4 * 2]
663
-    lea         r3,     [r3 + r5 * 2]
664
+    lea         r0,     [r0 + r1 * 4]
665
+    lea         r2,     [r2 + r4 * 4]
666
+    lea         r3,     [r3 + r5 * 4]
667
 
668
     dec         r6d
669
     jnz         .loop
670
     RET
671
+%endif
672
+%endmacro
673
+
674
+PIXELSUB_PS_W32_H8_avx2 32, 32
675
+PIXELSUB_PS_W32_H8_avx2 32, 64
676
+%endif
677
 
678
 ;-----------------------------------------------------------------------------
679
 ; void pixel_sub_ps_64x%2(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1);
680
@@ -4858,6 +5227,102 @@
681
 ;-----------------------------------------------------------------------------
682
 ; void pixel_sub_ps_64x64(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1);
683
 ;-----------------------------------------------------------------------------
684
+%if HIGH_BIT_DEPTH
685
+%if ARCH_X86_64
686
+INIT_YMM avx2
687
+cglobal pixel_sub_ps_64x64, 6, 10, 8, dest, deststride, src0, src1, srcstride0, srcstride1
688
+    add         r1d,    r1d
689
+    add         r4d,    r4d
690
+    add         r5d,    r5d
691
+    mov         r9d,    16
692
+    lea         r6,     [r1 * 3]
693
+    lea         r7,     [r4 * 3]
694
+    lea         r8,     [r5 * 3]
695
+
696
+.loop
697
+    movu        m0,     [r2]
698
+    movu        m1,     [r2 + 32]
699
+    movu        m2,     [r2 + 64]
700
+    movu        m3,     [r2 + 96]
701
+    movu        m4,     [r3]
702
+    movu        m5,     [r3 + 32]
703
+    movu        m6,     [r3 + 64]
704
+    movu        m7,     [r3 + 96]
705
+    psubw       m0,     m4
706
+    psubw       m1,     m5
707
+    psubw       m2,     m6
708
+    psubw       m3,     m7
709
+
710
+    movu        [r0],                 m0
711
+    movu        [r0 + 32],            m1
712
+    movu        [r0 + 64],            m2
713
+    movu        [r0 + 96],            m3
714
+
715
+    movu        m0,     [r2 + r4]
716
+    movu        m1,     [r2 + r4 + 32]
717
+    movu        m2,     [r2 + r4 + 64]
718
+    movu        m3,     [r2 + r4 + 96]
719
+    movu        m4,     [r3 + r5]
720
+    movu        m5,     [r3 + r5 + 32]
721
+    movu        m6,     [r3 + r5 + 64]
722
+    movu        m7,     [r3 + r5 + 96]
723
+    psubw       m0,     m4
724
+    psubw       m1,     m5
725
+    psubw       m2,     m6
726
+    psubw       m3,     m7
727
+
728
+    movu        [r0 + r1],            m0
729
+    movu        [r0 + r1 + 32],       m1
730
+    movu        [r0 + r1 + 64],       m2
731
+    movu        [r0 + r1 + 96],       m3
732
+
733
+    movu        m0,     [r2 + r4 * 2]
734
+    movu        m1,     [r2 + r4 * 2 + 32]
735
+    movu        m2,     [r2 + r4 * 2 + 64]
736
+    movu        m3,     [r2 + r4 * 2 + 96]
737
+    movu        m4,     [r3 + r5 * 2]
738
+    movu        m5,     [r3 + r5 * 2 + 32]
739
+    movu        m6,     [r3 + r5 * 2 + 64]
740
+    movu        m7,     [r3 + r5 * 2 + 96]
741
+    psubw       m0,     m4
742
+    psubw       m1,     m5
743
+    psubw       m2,     m6
744
+    psubw       m3,     m7
745
+
746
+    movu        [r0 + r1 * 2],        m0
747
+    movu        [r0 + r1 * 2 + 32],   m1
748
+    movu        [r0 + r1 * 2 + 64],   m2
749
+    movu        [r0 + r1 * 2 + 96],   m3
750
+
751
+    movu        m0,     [r2 + r7]
752
+    movu        m1,     [r2 + r7 + 32]
753
+    movu        m2,     [r2 + r7 + 64]
754
+    movu        m3,     [r2 + r7 + 96]
755
+    movu        m4,     [r3 + r8]
756
+    movu        m5,     [r3 + r8 + 32]
757
+    movu        m6,     [r3 + r8 + 64]
758
+    movu        m7,     [r3 + r8 + 96]
759
+    psubw       m0,     m4
760
+    psubw       m1,     m5
761
+    psubw       m2,     m6
762
+    psubw       m3,     m7
763
+
764
+    movu        [r0 + r6],            m0
765
+    movu        [r0 + r6 + 32],       m1
766
+    movu        [r0 + r6 + 64],       m2
767
+    movu        [r0 + r6 + 96],       m3
768
+
769
+    lea         r0,     [r0 + r1 * 4]
770
+    lea         r2,     [r2 + r4 * 4]
771
+    lea         r3,     [r3 + r5 * 4]
772
+    dec         r9d
773
+    jnz         .loop
774
+    RET
775
+%endif
776
+%else
777
+;-----------------------------------------------------------------------------
778
+; void pixel_sub_ps_64x64(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1);
779
+;-----------------------------------------------------------------------------
780
 INIT_YMM avx2
781
 cglobal pixel_sub_ps_64x64, 6, 7, 8, dest, deststride, src0, src1, srcstride0, srcstride1
782
      mov        r6d,    16
783
@@ -4963,7 +5428,7 @@
784
     dec         r6d
785
     jnz         .loop
786
     RET
787
-
788
+%endif
789
 ;=============================================================================
790
 ; variance
791
 ;=============================================================================
792
@@ -5387,7 +5852,7 @@
793
     RET
794
 %endmacro
795
 
796
-;int x265_test_func(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig)
797
+;int scanPosLast(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize)
798
 ;{
799
 ;    int scanPosLast = 0;
800
 ;    do
801
@@ -5409,8 +5874,104 @@
802
 ;}
803
 
804
 %if ARCH_X86_64 == 1
805
+INIT_XMM avx2,bmi2
806
+cglobal scanPosLast, 7,11,6
807
+    ; convert unit of Stride(trSize) to int16_t
808
+    mov         r7d, r7m
809
+    add         r7d, r7d
810
+
811
+    ; loading scan table and convert to Byte
812
+    mova        m0, [r6]
813
+    packuswb    m0, [r6 + mmsize]
814
+    pxor        m1, m0, [pb_15]
815
+
816
+    ; clear CG count
817
+    xor         r9d, r9d
818
+
819
+    ; m0 - Zigzag scan table
820
+    ; m1 - revert order scan table
821
+    ; m4 - zero
822
+    ; m5 - ones
823
+
824
+    pxor        m4, m4
825
+    pcmpeqb     m5, m5
826
+    lea         r8d, [r7d * 3]
827
+
828
+.loop:
829
+    ; position of current CG
830
+    movzx       r6d, word [r0]
831
+    lea         r6, [r6 * 2 + r1]
832
+    add         r0, 16 * 2
833
+
834
+    ; loading current CG
835
+    movh        m2, [r6]
836
+    movhps      m2, [r6 + r7]
837
+    movh        m3, [r6 + r7 * 2]
838
+    movhps      m3, [r6 + r8]
839
+    packsswb    m2, m3
840
+
841
+    ; Zigzag
842
+    pshufb      m3, m2, m0
843
+    pshufb      m2, m1
844
+
845
+    ; get sign
846
+    pmovmskb    r6d, m3
847
+    pcmpeqb     m3, m4
848
+    pmovmskb    r10d, m3
849
+    not         r10d
850
+    pext        r6d, r6d, r10d
851
+    mov         [r2 + r9 * 2], r6w
852
+
853
+    ; get non-zero flag
854
+    ; TODO: reuse above result with reorder
855
+    pcmpeqb     m2, m4
856
+    pxor        m2, m5
857
+    pmovmskb    r6d, m2
858
+    mov         [r3 + r9 * 2], r6w
859
+
860
+    ; get non-zero number, POPCNT is faster
861
+    pabsb       m2, m2
862
+    psadbw      m2, m4
863
+    movhlps     m3, m2
864
+    paddd       m2, m3
865
+    movd        r6d, m2
866
+    mov         [r4 + r9], r6b
867
+
868
+    inc         r9d
869
+    sub         r5d, r6d
870
+    jg         .loop
871
+
872
+    ; fixup last CG non-zero flag
873
+    dec         r9d
874
+    movzx       r0d, word [r3 + r9 * 2]
875
+;%if cpuflag(bmi1)  ; 2uops?
876
+;    tzcnt       r1d, r0d
877
+;%else
878
+    bsf         r1d, r0d
879
+;%endif
880
+    shrx        r0d, r0d, r1d
881
+    mov         [r3 + r9 * 2], r0w
882
+
883
+    ; get last pos
884
+    mov         eax, r9d
885
+    shl         eax, 4
886
+    xor         r1d, 15
887
+    add         eax, r1d
888
+    RET
889
+
890
+
891
+; t3 must be ecx, since it's used for shift.
892
+%if WIN64
893
+    DECLARE_REG_TMP 3,1,2,0
894
+%elif ARCH_X86_64
895
+    DECLARE_REG_TMP 0,1,2,3
896
+%else ; X86_32
897
+    %error Unsupport platform X86_32
898
+%endif
899
 INIT_CPUFLAGS
900
-cglobal findPosLast_x64, 5,12
901
+cglobal scanPosLast_x64, 5,12
902
+    mov         r10, r3mp
903
+    movifnidn   t0, r0mp
904
     mov         r5d, r5m
905
     xor         r11d, r11d                  ; cgIdx
906
     xor         r7d, r7d                    ; tmp for non-zero flag
907
@@ -5418,40 +5979,78 @@
908
 .loop:
909
     xor         r8d, r8d                    ; coeffSign[]
910
     xor         r9d, r9d                    ; coeffFlag[]
911
-    xor         r10d, r10d                  ; coeffNum[]
912
+    xor         t3d, t3d                    ; coeffNum[]
913
 
914
 %assign x 0
915
 %rep 16
916
-    movzx       r6d, word [r0 + x * 2]
917
-    movsx       r6d, word [r1 + r6 * 2]
918
+    movzx       r6d, word [t0 + x * 2]
919
+    movsx       r6d, word [t1 + r6 * 2]
920
     test        r6d, r6d
921
     setnz       r7b
922
     shr         r6d, 31
923
-    shlx        r6d, r6d, r10d
924
+    shl         r6d, t3b
925
     or          r8d, r6d
926
     lea         r9, [r9 * 2 + r7]
927
-    add         r10d, r7d
928
+    add         t3d, r7d
929
 %assign x x+1
930
 %endrep
931
 
932
     ; store latest group data
933
-    mov         [r2 + r11 * 2], r8w
934
-    mov         [r3 + r11 * 2], r9w
935
-    mov         [r4 + r11], r10b
936
+    mov         [t2 + r11 * 2], r8w
937
+    mov         [r10 + r11 * 2], r9w
938
+    mov         [r4 + r11], t3b
939
     inc         r11d
940
 
941
-    add         r0, 16 * 2
942
-    sub         r5d, r10d
943
+    add         t0, 16 * 2
944
+    sub         r5d, t3d
945
     jnz        .loop
946
 
947
     ; store group data
948
-    tzcnt       r6d, r9d
949
-    shrx        r9d, r9d, r6d
950
-    mov         [r3 + (r11 - 1) * 2], r9w
951
+    bsf         t3d, r9d
952
+    shr         r9d, t3b
953
+    mov         [r10 + (r11 - 1) * 2], r9w
954
 
955
     ; get posLast
956
     shl         r11d, 4
957
-    sub         r11d, r6d
958
+    sub         r11d, t3d
959
     lea         eax, [r11d - 1]
960
     RET
961
 %endif
962
+
963
+
964
+;-----------------------------------------------------------------------------
965
+; uint32_t[last first] findPosFirstAndLast(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16])
966
+;-----------------------------------------------------------------------------
967
+INIT_XMM ssse3
968
+cglobal findPosFirstLast, 3,3,3
969
+    ; convert stride to int16_t
970
+    add         r1d, r1d
971
+
972
+    ; loading scan table and convert to Byte
973
+    mova        m0, [r2]
974
+    packuswb    m0, [r2 + mmsize]
975
+
976
+    ; loading 16 of coeff
977
+    movh        m1, [r0]
978
+    movhps      m1, [r0 + r1]
979
+    movh        m2, [r0 + r1 * 2]
980
+    lea         r1, [r1 * 3]
981
+    movhps      m2, [r0 + r1]
982
+    packsswb    m1, m2
983
+
984
+    ; get non-zero mask
985
+    pxor        m2, m2
986
+    pcmpeqb     m1, m2
987
+
988
+    ; reorder by Zigzag scan
989
+    pshufb      m1, m0
990
+
991
+    ; get First and Last pos
992
+    xor         eax, eax
993
+    pmovmskb    r0d, m1
994
+    not         r0w
995
+    bsr         r1w, r0w
996
+    bsf          ax, r0w
997
+    shl         r1d, 16
998
+    or          eax, r1d
999
+    RET
1000
x265_1.6.tar.gz/source/common/x86/pixel.h -> x265_1.7.tar.gz/source/common/x86/pixel.h Changed
32
 
1
@@ -226,6 +226,7 @@
2
 ADDAVG(addAvg_32x48)
3
 
4
 void x265_downShift_16_sse2(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask);
5
+void x265_downShift_16_avx2(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask);
6
 void x265_upShift_8_sse4(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift);
7
 int x265_psyCost_pp_4x4_sse4(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
8
 int x265_psyCost_pp_8x8_sse4(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
9
@@ -256,10 +257,14 @@
10
 void x265_pixel_add_ps_16x16_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
11
 void x265_pixel_add_ps_32x32_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
12
 void x265_pixel_add_ps_64x64_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
13
+void x265_pixel_add_ps_16x32_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
14
+void x265_pixel_add_ps_32x64_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
15
 
16
 void x265_pixel_sub_ps_16x16_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
17
 void x265_pixel_sub_ps_32x32_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
18
 void x265_pixel_sub_ps_64x64_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
19
+void x265_pixel_sub_ps_16x32_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
20
+void x265_pixel_sub_ps_32x64_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
21
 
22
 int x265_psyCost_pp_4x4_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
23
 int x265_psyCost_pp_8x8_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
24
@@ -272,6 +277,7 @@
25
 int x265_psyCost_ss_16x16_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
26
 int x265_psyCost_ss_32x32_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
27
 int x265_psyCost_ss_64x64_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
28
+void x265_weight_sp_avx2(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset);
29
 
30
 #undef DECL_PIXELS
31
 #undef DECL_HEVC_SSD
32
x265_1.6.tar.gz/source/common/x86/pixeladd8.asm -> x265_1.7.tar.gz/source/common/x86/pixeladd8.asm Changed
371
 
1
@@ -398,10 +398,65 @@
2
 
3
     jnz         .loop
4
     RET
5
+%endif
6
+%endmacro
7
+PIXEL_ADD_PS_W16_H4 16, 16
8
+PIXEL_ADD_PS_W16_H4 16, 32
9
 
10
+;-----------------------------------------------------------------------------
11
+; void pixel_add_ps_16x16(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1)
12
+;-----------------------------------------------------------------------------
13
+%macro PIXEL_ADD_PS_W16_H4_avx2 1
14
+%if HIGH_BIT_DEPTH
15
+%if ARCH_X86_64
16
 INIT_YMM avx2
17
-cglobal pixel_add_ps_16x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1
18
-    mov         r6d,        %2/4
19
+cglobal pixel_add_ps_16x%1, 6, 10, 4, dest, destride, src0, scr1, srcStride0, srcStride1
20
+    mova    m3,     [pw_pixel_max]
21
+    pxor    m2,     m2
22
+    mov     r6d,    %1/4
23
+    add     r4d,    r4d
24
+    add     r5d,    r5d
25
+    add     r1d,    r1d
26
+    lea     r7,     [r4 * 3]
27
+    lea     r8,     [r5 * 3]
28
+    lea     r9,     [r1 * 3]
29
+
30
+.loop:
31
+    movu    m0,     [r2]
32
+    movu    m1,     [r3]
33
+    paddw   m0,     m1
34
+    CLIPW   m0, m2, m3
35
+    movu    [r0],              m0
36
+
37
+    movu    m0,     [r2 + r4]
38
+    movu    m1,     [r3 + r5]
39
+    paddw   m0,     m1
40
+    CLIPW   m0, m2, m3
41
+    movu    [r0 + r1],         m0
42
+
43
+    movu    m0,     [r2 + r4 * 2]
44
+    movu    m1,     [r3 + r5 * 2]
45
+    paddw   m0,     m1
46
+    CLIPW   m0, m2, m3
47
+    movu    [r0 + r1 * 2],     m0
48
+
49
+    movu    m0,     [r2 + r7]
50
+    movu    m1,     [r3 + r8]
51
+    paddw   m0,     m1
52
+    CLIPW   m0, m2, m3
53
+    movu    [r0 + r9],         m0
54
+
55
+    dec     r6d
56
+    lea     r0,     [r0 + r1 * 4]
57
+    lea     r2,     [r2 + r4 * 4]
58
+    lea     r3,     [r3 + r5 * 4]
59
+    jnz     .loop
60
+    RET
61
+%endif
62
+%else
63
+INIT_YMM avx2
64
+cglobal pixel_add_ps_16x%1, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1
65
+    mov         r6d,        %1/4
66
     add         r5,         r5
67
 .loop:
68
 
69
@@ -447,8 +502,8 @@
70
 %endif
71
 %endmacro
72
 
73
-PIXEL_ADD_PS_W16_H4 16, 16
74
-PIXEL_ADD_PS_W16_H4 16, 32
75
+PIXEL_ADD_PS_W16_H4_avx2 16
76
+PIXEL_ADD_PS_W16_H4_avx2 32
77
 
78
 
79
 ;-----------------------------------------------------------------------------
80
@@ -569,11 +624,90 @@
81
 
82
     jnz         .loop
83
     RET
84
+%endif
85
+%endmacro
86
+PIXEL_ADD_PS_W32_H2 32, 32
87
+PIXEL_ADD_PS_W32_H2 32, 64
88
 
89
+;-----------------------------------------------------------------------------
90
+; void pixel_add_ps_32x32(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1)
91
+;-----------------------------------------------------------------------------
92
+%macro PIXEL_ADD_PS_W32_H4_avx2 1
93
+%if HIGH_BIT_DEPTH
94
+%if ARCH_X86_64
95
 INIT_YMM avx2
96
-cglobal pixel_add_ps_32x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1
97
-    mov         r6d,        %2/4
98
+cglobal pixel_add_ps_32x%1, 6, 10, 6, dest, destride, src0, scr1, srcStride0, srcStride1
99
+    mova    m5,     [pw_pixel_max]
100
+    pxor    m4,     m4
101
+    mov     r6d,    %1/4
102
+    add     r4d,    r4d
103
+    add     r5d,    r5d
104
+    add     r1d,    r1d
105
+    lea     r7,     [r4 * 3]
106
+    lea     r8,     [r5 * 3]
107
+    lea     r9,     [r1 * 3]
108
+
109
+.loop:
110
+    movu    m0,     [r2]
111
+    movu    m2,     [r2 + 32]
112
+    movu    m1,     [r3]
113
+    movu    m3,     [r3 + 32]
114
+    paddw   m0,     m1
115
+    paddw   m2,     m3
116
+    CLIPW2  m0, m2, m4, m5
117
+
118
+    movu    [r0],               m0
119
+    movu    [r0 + 32],          m2
120
+
121
+    movu    m0,     [r2 + r4]
122
+    movu    m2,     [r2 + r4 + 32]
123
+    movu    m1,     [r3 + r5]
124
+    movu    m3,     [r3 + r5 + 32]
125
+    paddw   m0,     m1
126
+    paddw   m2,     m3
127
+    CLIPW2  m0, m2, m4, m5
128
+
129
+    movu    [r0 + r1],          m0
130
+    movu    [r0 + r1 + 32],     m2
131
+
132
+    movu    m0,     [r2 + r4 * 2]
133
+    movu    m2,     [r2 + r4 * 2 + 32]
134
+    movu    m1,     [r3 + r5 * 2]
135
+    movu    m3,     [r3 + r5 * 2 + 32]
136
+    paddw   m0,     m1
137
+    paddw   m2,     m3
138
+    CLIPW2  m0, m2, m4, m5
139
+
140
+    movu    [r0 + r1 * 2],      m0
141
+    movu    [r0 + r1 * 2 + 32], m2
142
+
143
+    movu    m0,     [r2 + r7]
144
+    movu    m2,     [r2 + r7 + 32]
145
+    movu    m1,     [r3 + r8]
146
+    movu    m3,     [r3 + r8 + 32]
147
+    paddw   m0,     m1
148
+    paddw   m2,     m3
149
+    CLIPW2  m0, m2, m4, m5
150
+
151
+    movu    [r0 + r9],          m0
152
+    movu    [r0 + r9 + 32],     m2
153
+
154
+    dec     r6d
155
+    lea     r0,     [r0 + r1 * 4]
156
+    lea     r2,     [r2 + r4 * 4]
157
+    lea     r3,     [r3 + r5 * 4]
158
+    jnz     .loop
159
+    RET
160
+%endif
161
+%else
162
+%if ARCH_X86_64
163
+INIT_YMM avx2
164
+cglobal pixel_add_ps_32x%1, 6, 10, 8, dest, destride, src0, scr1, srcStride0, srcStride1
165
+    mov         r6d,        %1/4
166
     add         r5,         r5
167
+    lea         r7,         [r4 * 3]
168
+    lea         r8,         [r5 * 3]
169
+    lea         r9,         [r1 * 3]
170
 .loop:
171
     pmovzxbw    m0,         [r2]                ; first half of row 0 of src0
172
     pmovzxbw    m1,         [r2 + 16]           ; second half of row 0 of src0
173
@@ -597,44 +731,41 @@
174
     vpermq      m0, m0, 11011000b
175
     movu        [r0 + r1],      m0              ; row 1 of dst
176
 
177
-    lea         r2,         [r2 + r4 * 2]
178
-    lea         r3,         [r3 + r5 * 2]
179
-    lea         r0,         [r0 + r1 * 2]
180
-
181
-    pmovzxbw    m0,         [r2]                ; first half of row 2 of src0
182
-    pmovzxbw    m1,         [r2 + 16]           ; second half of row 2 of src0
183
-    movu        m2,         [r3]                ; first half of row 2 of src1
184
-    movu        m3,         [r3 + 32]           ; second half of row 2 of src1
185
+    pmovzxbw    m0,         [r2 + r4 * 2]       ; first half of row 2 of src0
186
+    pmovzxbw    m1,         [r2 + r4 * 2 + 16]  ; second half of row 2 of src0
187
+    movu        m2,         [r3 + r5 * 2]       ; first half of row 2 of src1
188
+    movu        m3,         [r3 + + r5 * 2 + 32]; second half of row 2 of src1
189
 
190
     paddw       m0,         m2
191
     paddw       m1,         m3
192
     packuswb    m0,         m1
193
     vpermq      m0, m0, 11011000b
194
-    movu        [r0],      m0                   ; row 2 of dst
195
+    movu        [r0 + r1 * 2],      m0          ; row 2 of dst
196
 
197
-    pmovzxbw    m0,         [r2 + r4]           ; first half of row 3 of src0
198
-    pmovzxbw    m1,         [r2 + r4 + 16]      ; second half of row 3 of src0
199
-    movu        m2,         [r3 + r5]           ; first half of row 3 of src1
200
-    movu        m3,         [r3 + r5 + 32]      ; second half of row 3 of src1
201
+    pmovzxbw    m0,         [r2 + r7]           ; first half of row 3 of src0
202
+    pmovzxbw    m1,         [r2 + r7 + 16]      ; second half of row 3 of src0
203
+    movu        m2,         [r3 + r8]           ; first half of row 3 of src1
204
+    movu        m3,         [r3 + r8 + 32]      ; second half of row 3 of src1
205
 
206
     paddw       m0,         m2
207
     paddw       m1,         m3
208
     packuswb    m0,         m1
209
     vpermq      m0, m0, 11011000b
210
-    movu        [r0 + r1],      m0              ; row 3 of dst
211
+    movu        [r0 + r9],      m0              ; row 3 of dst
212
 
213
-    lea         r2,         [r2 + r4 * 2]
214
-    lea         r3,         [r3 + r5 * 2]
215
-    lea         r0,         [r0 + r1 * 2]
216
+    lea         r2,         [r2 + r4 * 4]
217
+    lea         r3,         [r3 + r5 * 4]
218
+    lea         r0,         [r0 + r1 * 4]
219
 
220
     dec         r6d
221
     jnz         .loop
222
     RET
223
 %endif
224
+%endif
225
 %endmacro
226
 
227
-PIXEL_ADD_PS_W32_H2 32, 32
228
-PIXEL_ADD_PS_W32_H2 32, 64
229
+PIXEL_ADD_PS_W32_H4_avx2 32
230
+PIXEL_ADD_PS_W32_H4_avx2 64
231
 
232
 
233
 ;-----------------------------------------------------------------------------
234
@@ -841,10 +972,127 @@
235
 
236
     jnz         .loop
237
     RET
238
+%endif
239
+%endmacro
240
+PIXEL_ADD_PS_W64_H2 64, 64
241
 
242
+;-----------------------------------------------------------------------------
243
+; void pixel_add_ps_64x64(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1)
244
+;-----------------------------------------------------------------------------
245
+%if HIGH_BIT_DEPTH
246
+%if ARCH_X86_64
247
 INIT_YMM avx2
248
-cglobal pixel_add_ps_64x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1
249
-    mov         r6d,        %2/2
250
+cglobal pixel_add_ps_64x64, 6, 10, 6, dest, destride, src0, scr1, srcStride0, srcStride1
251
+    mova    m5,     [pw_pixel_max]
252
+    pxor    m4,     m4
253
+    mov     r6d,    16
254
+    add     r4d,    r4d
255
+    add     r5d,    r5d
256
+    add     r1d,    r1d
257
+    lea     r7,     [r4 * 3]
258
+    lea     r8,     [r5 * 3]
259
+    lea     r9,     [r1 * 3]
260
+
261
+.loop:
262
+    movu    m0,     [r2]
263
+    movu    m1,     [r2 + 32]
264
+    movu    m2,     [r3]
265
+    movu    m3,     [r3 + 32]
266
+    paddw   m0,     m2
267
+    paddw   m1,     m3
268
+
269
+    CLIPW2  m0, m1, m4, m5
270
+    movu    [r0],                m0
271
+    movu    [r0 + 32],           m1
272
+
273
+    movu    m0,     [r2 + 64]
274
+    movu    m1,     [r2 + 96]
275
+    movu    m2,     [r3 + 64]
276
+    movu    m3,     [r3 + 96]
277
+    paddw   m0,     m2
278
+    paddw   m1,     m3
279
+
280
+    CLIPW2  m0, m1, m4, m5
281
+    movu    [r0 + 64],           m0
282
+    movu    [r0 + 96],           m1
283
+
284
+    movu    m0,     [r2 + r4]
285
+    movu    m1,     [r2 + r4 + 32]
286
+    movu    m2,     [r3 + r5]
287
+    movu    m3,     [r3 + r5 + 32]
288
+    paddw   m0,     m2
289
+    paddw   m1,     m3
290
+
291
+    CLIPW2  m0, m1, m4, m5
292
+    movu    [r0 + r1],           m0
293
+    movu    [r0 + r1 + 32],      m1
294
+
295
+    movu    m0,     [r2 + r4 + 64]
296
+    movu    m1,     [r2 + r4 + 96]
297
+    movu    m2,     [r3 + r5 + 64]
298
+    movu    m3,     [r3 + r5 + 96]
299
+    paddw   m0,     m2
300
+    paddw   m1,     m3
301
+
302
+    CLIPW2  m0, m1, m4, m5
303
+    movu    [r0 + r1 + 64],      m0
304
+    movu    [r0 + r1 + 96],      m1
305
+
306
+    movu    m0,     [r2 + r4 * 2]
307
+    movu    m1,     [r2 + r4 * 2 + 32]
308
+    movu    m2,     [r3 + r5 * 2]
309
+    movu    m3,     [r3 + r5 * 2+ 32]
310
+    paddw   m0,     m2
311
+    paddw   m1,     m3
312
+
313
+    CLIPW2  m0, m1, m4, m5
314
+    movu    [r0 + r1 * 2],       m0
315
+    movu    [r0 + r1 * 2 + 32],  m1
316
+
317
+    movu    m0,     [r2 + r4 * 2 + 64]
318
+    movu    m1,     [r2 + r4 * 2 + 96]
319
+    movu    m2,     [r3 + r5 * 2 + 64]
320
+    movu    m3,     [r3 + r5 * 2 + 96]
321
+    paddw   m0,     m2
322
+    paddw   m1,     m3
323
+
324
+    CLIPW2  m0, m1, m4, m5
325
+    movu    [r0 + r1 * 2 + 64],  m0
326
+    movu    [r0 + r1 * 2 + 96],  m1
327
+
328
+    movu    m0,     [r2 + r7]
329
+    movu    m1,     [r2 + r7 + 32]
330
+    movu    m2,     [r3 + r8]
331
+    movu    m3,     [r3 + r8 + 32]
332
+    paddw   m0,     m2
333
+    paddw   m1,     m3
334
+
335
+    CLIPW2  m0, m1, m4, m5
336
+    movu    [r0 + r9],           m0
337
+    movu    [r0 + r9 + 32],      m1
338
+
339
+    movu    m0,     [r2 + r7 + 64]
340
+    movu    m1,     [r2 + r7 + 96]
341
+    movu    m2,     [r3 + r8 + 64]
342
+    movu    m3,     [r3 + r8 + 96]
343
+    paddw   m0,     m2
344
+    paddw   m1,     m3
345
+
346
+    CLIPW2  m0, m1, m4, m5
347
+    movu    [r0 + r9 + 64],      m0
348
+    movu    [r0 + r9 + 96],      m1
349
+
350
+    dec     r6d
351
+    lea     r0,     [r0 + r1 * 4]
352
+    lea     r2,     [r2 + r4 * 4]
353
+    lea     r3,     [r3 + r5 * 4]
354
+    jnz     .loop
355
+    RET
356
+%endif
357
+%else
358
+INIT_YMM avx2
359
+cglobal pixel_add_ps_64x64, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1
360
+    mov         r6d,        32
361
     add         r5,         r5
362
 .loop:
363
     pmovzxbw    m0,         [r2]                ; first 16 of row 0 of src0
364
@@ -896,6 +1144,3 @@
365
     RET
366
 
367
 %endif
368
-%endmacro
369
-
370
-PIXEL_ADD_PS_W64_H2 64, 64
371
x265_1.6.tar.gz/source/common/x86/sad-a.asm -> x265_1.7.tar.gz/source/common/x86/sad-a.asm Changed
187
 
1
@@ -4004,10 +4004,12 @@
2
     RET
3
 
4
 INIT_YMM avx2
5
-cglobal pixel_sad_32x24, 4,5,6
6
+cglobal pixel_sad_32x24, 4,7,6
7
     xorps           m0, m0
8
     xorps           m5, m5
9
     mov             r4d, 6
10
+    lea             r5, [r1 * 3]
11
+    lea             r6, [r3 * 3]
12
 .loop
13
     movu           m1, [r0]               ; row 0 of pix0
14
     movu           m2, [r2]               ; row 0 of pix1
15
@@ -4019,21 +4021,18 @@
16
     paddd          m0, m1
17
     paddd          m5, m3
18
 
19
-    lea     r2,     [r2 + 2 * r3]
20
-    lea     r0,     [r0 + 2 * r1]
21
-
22
-    movu           m1, [r0]               ; row 2 of pix0
23
-    movu           m2, [r2]               ; row 2 of pix1
24
-    movu           m3, [r0 + r1]          ; row 3 of pix0
25
-    movu           m4, [r2 + r3]          ; row 3 of pix1
26
+    movu           m1, [r0 + 2 * r1]      ; row 2 of pix0
27
+    movu           m2, [r2 + 2 * r3]      ; row 2 of pix1
28
+    movu           m3, [r0 + r5]          ; row 3 of pix0
29
+    movu           m4, [r2 + r6]          ; row 3 of pix1
30
 
31
     psadbw         m1, m2
32
     psadbw         m3, m4
33
     paddd          m0, m1
34
     paddd          m5, m3
35
 
36
-    lea     r2,     [r2 + 2 * r3]
37
-    lea     r0,     [r0 + 2 * r1]
38
+    lea     r2,     [r2 + 4 * r3]
39
+    lea     r0,     [r0 + 4 * r1]
40
 
41
     dec         r4d
42
     jnz         .loop
43
@@ -4307,10 +4306,12 @@
44
     RET
45
 
46
 INIT_YMM avx2
47
-cglobal pixel_sad_64x48, 4,5,6
48
+cglobal pixel_sad_64x48, 4,7,6
49
     xorps           m0, m0
50
     xorps           m5, m5
51
-    mov             r4d, 24
52
+    mov             r4d, 12
53
+    lea             r5, [r1 * 3]
54
+    lea             r6, [r3 * 3]
55
 .loop
56
     movu           m1, [r0]               ; first 32 of row 0 of pix0
57
     movu           m2, [r2]               ; first 32 of row 0 of pix1
58
@@ -4332,8 +4333,28 @@
59
     paddd          m0, m1
60
     paddd          m5, m3
61
 
62
-    lea     r2,     [r2 + 2 * r3]
63
-    lea     r0,     [r0 + 2 * r1]
64
+    movu           m1, [r0 + 2 * r1]      ; first 32 of row 0 of pix0
65
+    movu           m2, [r2 + 2 * r3]      ; first 32 of row 0 of pix1
66
+    movu           m3, [r0 + 2 * r1 + 32] ; second 32 of row 0 of pix0
67
+    movu           m4, [r2 + 2 * r3 + 32] ; second 32 of row 0 of pix1
68
+
69
+    psadbw         m1, m2
70
+    psadbw         m3, m4
71
+    paddd          m0, m1
72
+    paddd          m5, m3
73
+
74
+    movu           m1, [r0 + r5]          ; first 32 of row 1 of pix0
75
+    movu           m2, [r2 + r6]          ; first 32 of row 1 of pix1
76
+    movu           m3, [r0 + 32 + r5]     ; second 32 of row 1 of pix0
77
+    movu           m4, [r2 + 32 + r6]     ; second 32 of row 1 of pix1
78
+
79
+    psadbw         m1, m2
80
+    psadbw         m3, m4
81
+    paddd          m0, m1
82
+    paddd          m5, m3
83
+
84
+    lea     r2,     [r2 + 4 * r3]
85
+    lea     r0,     [r0 + 4 * r1]
86
 
87
     dec         r4d
88
     jnz         .loop
89
@@ -4347,10 +4368,12 @@
90
     RET
91
 
92
 INIT_YMM avx2
93
-cglobal pixel_sad_64x64, 4,5,6
94
+cglobal pixel_sad_64x64, 4,7,6
95
     xorps           m0, m0
96
     xorps           m5, m5
97
     mov             r4d, 8
98
+    lea             r5, [r1 * 3]
99
+    lea             r6, [r3 * 3]
100
 .loop
101
     movu           m1, [r0]               ; first 32 of row 0 of pix0
102
     movu           m2, [r2]               ; first 32 of row 0 of pix1
103
@@ -4372,31 +4395,28 @@
104
     paddd          m0, m1
105
     paddd          m5, m3
106
 
107
-    lea     r2,     [r2 + 2 * r3]
108
-    lea     r0,     [r0 + 2 * r1]
109
-
110
-    movu           m1, [r0]               ; first 32 of row 2 of pix0
111
-    movu           m2, [r2]               ; first 32 of row 2 of pix1
112
-    movu           m3, [r0 + 32]          ; second 32 of row 2 of pix0
113
-    movu           m4, [r2 + 32]          ; second 32 of row 2 of pix1
114
+    movu           m1, [r0 + 2 * r1]      ; first 32 of row 2 of pix0
115
+    movu           m2, [r2 + 2 * r3]      ; first 32 of row 2 of pix1
116
+    movu           m3, [r0 + 2 * r1 + 32] ; second 32 of row 2 of pix0
117
+    movu           m4, [r2 + 2 * r3 + 32] ; second 32 of row 2 of pix1
118
 
119
     psadbw         m1, m2
120
     psadbw         m3, m4
121
     paddd          m0, m1
122
     paddd          m5, m3
123
 
124
-    movu           m1, [r0 + r1]          ; first 32 of row 3 of pix0
125
-    movu           m2, [r2 + r3]          ; first 32 of row 3 of pix1
126
-    movu           m3, [r0 + 32 + r1]     ; second 32 of row 3 of pix0
127
-    movu           m4, [r2 + 32 + r3]     ; second 32 of row 3 of pix1
128
+    movu           m1, [r0 + r5]          ; first 32 of row 3 of pix0
129
+    movu           m2, [r2 + r6]          ; first 32 of row 3 of pix1
130
+    movu           m3, [r0 + 32 + r5]     ; second 32 of row 3 of pix0
131
+    movu           m4, [r2 + 32 + r6]     ; second 32 of row 3 of pix1
132
 
133
     psadbw         m1, m2
134
     psadbw         m3, m4
135
     paddd          m0, m1
136
     paddd          m5, m3
137
 
138
-    lea     r2,     [r2 + 2 * r3]
139
-    lea     r0,     [r0 + 2 * r1]
140
+    lea     r2,     [r2 + 4 * r3]
141
+    lea     r0,     [r0 + 4 * r1]
142
 
143
     movu           m1, [r0]               ; first 32 of row 4 of pix0
144
     movu           m2, [r2]               ; first 32 of row 4 of pix1
145
@@ -4418,31 +4438,28 @@
146
     paddd          m0, m1
147
     paddd          m5, m3
148
 
149
-    lea     r2,     [r2 + 2 * r3]
150
-    lea     r0,     [r0 + 2 * r1]
151
-
152
-    movu           m1, [r0]               ; first 32 of row 6 of pix0
153
-    movu           m2, [r2]               ; first 32 of row 6 of pix1
154
-    movu           m3, [r0 + 32]          ; second 32 of row 6 of pix0
155
-    movu           m4, [r2 + 32]          ; second 32 of row 6 of pix1
156
+    movu           m1, [r0 + 2 * r1]      ; first 32 of row 6 of pix0
157
+    movu           m2, [r2 + 2 * r3]      ; first 32 of row 6 of pix1
158
+    movu           m3, [r0 + 2 * r1 + 32] ; second 32 of row 6 of pix0
159
+    movu           m4, [r2 + 2 * r3 + 32] ; second 32 of row 6 of pix1
160
 
161
     psadbw         m1, m2
162
     psadbw         m3, m4
163
     paddd          m0, m1
164
     paddd          m5, m3
165
 
166
-    movu           m1, [r0 + r1]          ; first 32 of row 7 of pix0
167
-    movu           m2, [r2 + r3]          ; first 32 of row 7 of pix1
168
-    movu           m3, [r0 + 32 + r1]     ; second 32 of row 7 of pix0
169
-    movu           m4, [r2 + 32 + r3]     ; second 32 of row 7 of pix1
170
+    movu           m1, [r0 + r5]          ; first 32 of row 7 of pix0
171
+    movu           m2, [r2 + r6]          ; first 32 of row 7 of pix1
172
+    movu           m3, [r0 + 32 + r5]     ; second 32 of row 7 of pix0
173
+    movu           m4, [r2 + 32 + r6]     ; second 32 of row 7 of pix1
174
 
175
     psadbw         m1, m2
176
     psadbw         m3, m4
177
     paddd          m0, m1
178
     paddd          m5, m3
179
 
180
-    lea     r2,     [r2 + 2 * r3]
181
-    lea     r0,     [r0 + 2 * r1]
182
+    lea     r2,     [r2 + 4 * r3]
183
+    lea     r0,     [r0 + 4 * r1]
184
 
185
     dec         r4d
186
     jnz         .loop
187
x265_1.6.tar.gz/source/common/x86/sad16-a.asm -> x265_1.7.tar.gz/source/common/x86/sad16-a.asm Changed
132
 
1
@@ -276,9 +276,8 @@
2
     ABSW2   m3, m4, m3, m4, m7, m5
3
     paddw   m1, m2
4
     paddw   m3, m4
5
-    paddw   m3, m1
6
-    pmaddwd m3, [pw_1]
7
-    paddd   m0, m3
8
+    paddw   m0, m1
9
+    paddw   m0, m3
10
 %else
11
     movu    m1, [r2]
12
     movu    m2, [r2+2*r3]
13
@@ -287,15 +286,45 @@
14
     ABSW2   m1, m2, m1, m2, m3, m4
15
     lea     r0, [r0+4*r1]
16
     lea     r2, [r2+4*r3]
17
-    paddw   m2, m1
18
-    pmaddwd m2, [pw_1]
19
-    paddd   m0, m2
20
+    paddw   m0, m1
21
+    paddw   m0, m2
22
 %endif
23
 %endmacro
24
 
25
-;-----------------------------------------------------------------------------
26
-; int pixel_sad_NxM( uint16_t *, intptr_t, uint16_t *, intptr_t )
27
-;-----------------------------------------------------------------------------
28
+%macro SAD_INC_2ROW_Nx64 1
29
+%if 2*%1 > mmsize
30
+    movu    m1, [r2 + 0]
31
+    movu    m2, [r2 + 16]
32
+    movu    m3, [r2 + 2 * r3 + 0]
33
+    movu    m4, [r2 + 2 * r3 + 16]
34
+    psubw   m1, [r0 + 0]
35
+    psubw   m2, [r0 + 16]
36
+    psubw   m3, [r0 + 2 * r1 + 0]
37
+    psubw   m4, [r0 + 2 * r1 + 16]
38
+    ABSW2   m1, m2, m1, m2, m5, m6
39
+    lea     r0, [r0 + 4 * r1]
40
+    lea     r2, [r2 + 4 * r3]
41
+    ABSW2   m3, m4, m3, m4, m7, m5
42
+    paddw   m1, m2
43
+    paddw   m3, m4
44
+    paddw   m0, m1
45
+    paddw   m8, m3
46
+%else
47
+    movu    m1, [r2]
48
+    movu    m2, [r2 + 2 * r3]
49
+    psubw   m1, [r0]
50
+    psubw   m2, [r0 + 2 * r1]
51
+    ABSW2   m1, m2, m1, m2, m3, m4
52
+    lea     r0, [r0 + 4 * r1]
53
+    lea     r2, [r2 + 4 * r3]
54
+    paddw   m0, m1
55
+    paddw   m8, m2
56
+%endif
57
+%endmacro
58
+
59
+; ---------------------------------------------------------------------------- -
60
+; int pixel_sad_NxM(uint16_t *, intptr_t, uint16_t *, intptr_t)
61
+; ---------------------------------------------------------------------------- -
62
 %macro SAD 2
63
 cglobal pixel_sad_%1x%2, 4,5-(%2&4/4),8*(%1/mmsize)
64
     pxor    m0, m0
65
@@ -309,8 +338,35 @@
66
     dec    r4d
67
     jg .loop
68
 %endif
69
+%if %2 == 32
70
+    HADDUWD m0, m1
71
+    HADDD   m0, m1
72
+%else
73
+    HADDW   m0, m1
74
+%endif
75
+    movd    eax, xm0
76
+    RET
77
+%endmacro
78
 
79
+; ---------------------------------------------------------------------------- -
80
+; int pixel_sad_Nx64(uint16_t *, intptr_t, uint16_t *, intptr_t)
81
+; ---------------------------------------------------------------------------- -
82
+%macro SAD_Nx64 1
83
+cglobal pixel_sad_%1x64, 4,5-(64&4/4), 9
84
+    pxor    m0, m0
85
+    pxor    m8, m8
86
+    mov     r4d, 64 / 2
87
+.loop:
88
+    SAD_INC_2ROW_Nx64 %1
89
+    dec    r4d
90
+    jg .loop
91
+
92
+    HADDUWD m0, m1
93
+    HADDUWD m8, m1
94
     HADDD   m0, m1
95
+    HADDD   m8, m1
96
+    paddd   m0, m8
97
+
98
     movd    eax, xm0
99
     RET
100
 %endmacro
101
@@ -321,7 +377,7 @@
102
 SAD  16, 12
103
 SAD  16, 16
104
 SAD  16, 32
105
-SAD  16, 64
106
+SAD_Nx64  16
107
 
108
 INIT_XMM sse2
109
 SAD  8,  4
110
@@ -329,6 +385,13 @@
111
 SAD  8, 16
112
 SAD  8, 32
113
 
114
+INIT_YMM avx2
115
+SAD  16,  4
116
+SAD  16,  8
117
+SAD  16, 12
118
+SAD  16, 16
119
+SAD  16, 32
120
+
121
 ;------------------------------------------------------------------
122
 ; int pixel_sad_32xN( uint16_t *, intptr_t, uint16_t *, intptr_t )
123
 ;------------------------------------------------------------------
124
@@ -716,7 +779,6 @@
125
 %endif
126
     movd     eax, xm0
127
     RET
128
-
129
 ;-----------------------------------------------------------------------------
130
 ; void pixel_sad_xN_WxH( uint16_t *fenc, uint16_t *pix0, uint16_t *pix1,
131
 ;                        uint16_t *pix2, intptr_t i_stride, int scores[3] )
132
x265_1.6.tar.gz/source/common/x86/x86inc.asm -> x265_1.7.tar.gz/source/common/x86/x86inc.asm Changed
18
 
1
@@ -72,7 +72,7 @@
2
     %define mangle(x) x
3
 %endif
4
 
5
-%macro SECTION_RODATA 0-1 16
6
+%macro SECTION_RODATA 0-1 32
7
     SECTION .rodata align=%1
8
 %endmacro
9
 
10
@@ -715,6 +715,7 @@
11
     %else
12
         global %1
13
     %endif
14
+    ALIGN 32
15
     %1: %2
16
 %endmacro
17
 
18
x265_1.6.tar.gz/source/encoder/CMakeLists.txt -> x265_1.7.tar.gz/source/encoder/CMakeLists.txt Changed
14
 
1
@@ -1,7 +1,11 @@
2
 # vim: syntax=cmake
3
 
4
 if(GCC)
5
-   add_definitions(-Wno-uninitialized)
6
+    add_definitions(-Wno-uninitialized)
7
+    if(CC_HAS_NO_STRICT_OVERFLOW)
8
+        # GCC 4.9.2 gives warnings we know we can ignore in this file
9
+        set_source_files_properties(slicetype.cpp PROPERTIES COMPILE_FLAGS -Wno-strict-overflow)
10
+    endif(CC_HAS_NO_STRICT_OVERFLOW)
11
 endif()
12
 if(MSVC)
13
    add_definitions(/wd4701) # potentially uninitialized local variable 'foo' used
14
x265_1.6.tar.gz/source/encoder/analysis.cpp -> x265_1.7.tar.gz/source/encoder/analysis.cpp Changed
786
 
1
@@ -130,9 +130,12 @@
2
     for (uint32_t i = 0; i <= g_maxCUDepth; i++)
3
         for (uint32_t j = 0; j < MAX_PRED_TYPES; j++)
4
             m_modeDepth[i].pred[j].invalidate();
5
-#endif
6
     invalidateContexts(0);
7
-    m_quant.setQPforQuant(ctu);
8
+#endif
9
+
10
+    int qp = setLambdaFromQP(ctu, m_slice->m_pps->bUseDQP ? calculateQpforCuSize(ctu, cuGeom) : m_slice->m_sliceQp);
11
+    ctu.setQPSubParts((int8_t)qp, 0, 0);
12
+
13
     m_rqt[0].cur.load(initialContext);
14
     m_modeDepth[0].fencYuv.copyFromPicYuv(*m_frame->m_fencPic, ctu.m_cuAddr, 0);
15
 
16
@@ -140,11 +143,11 @@
17
     if (m_param->analysisMode)
18
     {
19
         if (m_slice->m_sliceType == I_SLICE)
20
-            m_reuseIntraDataCTU = (analysis_intra_data *)m_frame->m_analysisData.intraData;
21
+            m_reuseIntraDataCTU = (analysis_intra_data*)m_frame->m_analysisData.intraData;
22
         else
23
         {
24
             int numPredDir = m_slice->isInterP() ? 1 : 2;
25
-            m_reuseInterDataCTU = (analysis_inter_data *)m_frame->m_analysisData.interData;
26
+            m_reuseInterDataCTU = (analysis_inter_data*)m_frame->m_analysisData.interData;
27
             m_reuseRef = &m_reuseInterDataCTU->ref[ctu.m_cuAddr * X265_MAX_PRED_MODE_PER_CTU * numPredDir];
28
             m_reuseBestMergeCand = &m_reuseInterDataCTU->bestMergeCand[ctu.m_cuAddr * CUGeom::MAX_GEOMS];
29
         }
30
@@ -155,10 +158,10 @@
31
     uint32_t zOrder = 0;
32
     if (m_slice->m_sliceType == I_SLICE)
33
     {
34
-        compressIntraCU(ctu, cuGeom, zOrder);
35
+        compressIntraCU(ctu, cuGeom, zOrder, qp);
36
         if (m_param->analysisMode == X265_ANALYSIS_SAVE && m_frame->m_analysisData.intraData)
37
         {
38
-            CUData *bestCU = &m_modeDepth[0].bestMode->cu;
39
+            CUData* bestCU = &m_modeDepth[0].bestMode->cu;
40
             memcpy(&m_reuseIntraDataCTU->depth[ctu.m_cuAddr * numPartition], bestCU->m_cuDepth, sizeof(uint8_t) * numPartition);
41
             memcpy(&m_reuseIntraDataCTU->modes[ctu.m_cuAddr * numPartition], bestCU->m_lumaIntraDir, sizeof(uint8_t) * numPartition);
42
             memcpy(&m_reuseIntraDataCTU->partSizes[ctu.m_cuAddr * numPartition], bestCU->m_partSize, sizeof(uint8_t) * numPartition);
43
@@ -173,21 +176,21 @@
44
             * they are available for intra predictions */
45
             m_modeDepth[0].fencYuv.copyToPicYuv(*m_frame->m_reconPic, ctu.m_cuAddr, 0);
46
 
47
-            compressInterCU_rd0_4(ctu, cuGeom);
48
+            compressInterCU_rd0_4(ctu, cuGeom, qp);
49
 
50
             /* generate residual for entire CTU at once and copy to reconPic */
51
             encodeResidue(ctu, cuGeom);
52
         }
53
         else if (m_param->bDistributeModeAnalysis && m_param->rdLevel >= 2)
54
-            compressInterCU_dist(ctu, cuGeom);
55
+            compressInterCU_dist(ctu, cuGeom, qp);
56
         else if (m_param->rdLevel <= 4)
57
-            compressInterCU_rd0_4(ctu, cuGeom);
58
+            compressInterCU_rd0_4(ctu, cuGeom, qp);
59
         else
60
         {
61
-            compressInterCU_rd5_6(ctu, cuGeom, zOrder);
62
+            compressInterCU_rd5_6(ctu, cuGeom, zOrder, qp);
63
             if (m_param->analysisMode == X265_ANALYSIS_SAVE && m_frame->m_analysisData.interData)
64
             {
65
-                CUData *bestCU = &m_modeDepth[0].bestMode->cu;
66
+                CUData* bestCU = &m_modeDepth[0].bestMode->cu;
67
                 memcpy(&m_reuseInterDataCTU->depth[ctu.m_cuAddr * numPartition], bestCU->m_cuDepth, sizeof(uint8_t) * numPartition);
68
                 memcpy(&m_reuseInterDataCTU->modes[ctu.m_cuAddr * numPartition], bestCU->m_predMode, sizeof(uint8_t) * numPartition);
69
             }
70
@@ -206,24 +209,28 @@
71
         return;
72
     else if (md.bestMode->cu.isIntra(0))
73
     {
74
+        m_quant.m_tqBypass = true;
75
         md.pred[PRED_LOSSLESS].initCosts();
76
         md.pred[PRED_LOSSLESS].cu.initLosslessCU(md.bestMode->cu, cuGeom);
77
         PartSize size = (PartSize)md.pred[PRED_LOSSLESS].cu.m_partSize[0];
78
         uint8_t* modes = md.pred[PRED_LOSSLESS].cu.m_lumaIntraDir;
79
         checkIntra(md.pred[PRED_LOSSLESS], cuGeom, size, modes, NULL);
80
         checkBestMode(md.pred[PRED_LOSSLESS], cuGeom.depth);
81
+        m_quant.m_tqBypass = false;
82
     }
83
     else
84
     {
85
+        m_quant.m_tqBypass = true;
86
         md.pred[PRED_LOSSLESS].initCosts();
87
         md.pred[PRED_LOSSLESS].cu.initLosslessCU(md.bestMode->cu, cuGeom);
88
         md.pred[PRED_LOSSLESS].predYuv.copyFromYuv(md.bestMode->predYuv);
89
         encodeResAndCalcRdInterCU(md.pred[PRED_LOSSLESS], cuGeom);
90
         checkBestMode(md.pred[PRED_LOSSLESS], cuGeom.depth);
91
+        m_quant.m_tqBypass = false;
92
     }
93
 }
94
 
95
-void Analysis::compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t& zOrder)
96
+void Analysis::compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t& zOrder, int32_t qp)
97
 {
98
     uint32_t depth = cuGeom.depth;
99
     ModeDepth& md = m_modeDepth[depth];
100
@@ -241,11 +248,9 @@
101
 
102
         if (mightNotSplit && depth == reuseDepth[zOrder] && zOrder == cuGeom.absPartIdx)
103
         {
104
-            m_quant.setQPforQuant(parentCTU);
105
-
106
             PartSize size = (PartSize)reusePartSizes[zOrder];
107
             Mode& mode = size == SIZE_2Nx2N ? md.pred[PRED_INTRA] : md.pred[PRED_INTRA_NxN];
108
-            mode.cu.initSubCU(parentCTU, cuGeom);
109
+            mode.cu.initSubCU(parentCTU, cuGeom, qp);
110
             checkIntra(mode, cuGeom, size, &reuseModes[zOrder], &reuseChromaModes[zOrder]);
111
             checkBestMode(mode, depth);
112
 
113
@@ -262,15 +267,13 @@
114
     }
115
     else if (mightNotSplit)
116
     {
117
-        m_quant.setQPforQuant(parentCTU);
118
-
119
-        md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom);
120
+        md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp);
121
         checkIntra(md.pred[PRED_INTRA], cuGeom, SIZE_2Nx2N, NULL, NULL);
122
         checkBestMode(md.pred[PRED_INTRA], depth);
123
 
124
         if (cuGeom.log2CUSize == 3 && m_slice->m_sps->quadtreeTULog2MinSize < 3)
125
         {
126
-            md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom);
127
+            md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom, qp);
128
             checkIntra(md.pred[PRED_INTRA_NxN], cuGeom, SIZE_NxN, NULL, NULL);
129
             checkBestMode(md.pred[PRED_INTRA_NxN], depth);
130
         }
131
@@ -287,12 +290,13 @@
132
         Mode* splitPred = &md.pred[PRED_SPLIT];
133
         splitPred->initCosts();
134
         CUData* splitCU = &splitPred->cu;
135
-        splitCU->initSubCU(parentCTU, cuGeom);
136
+        splitCU->initSubCU(parentCTU, cuGeom, qp);
137
 
138
         uint32_t nextDepth = depth + 1;
139
         ModeDepth& nd = m_modeDepth[nextDepth];
140
         invalidateContexts(nextDepth);
141
         Entropy* nextContext = &m_rqt[depth].cur;
142
+        int32_t nextQP = qp;
143
 
144
         for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++)
145
         {
146
@@ -301,7 +305,11 @@
147
             {
148
                 m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx);
149
                 m_rqt[nextDepth].cur.load(*nextContext);
150
-                compressIntraCU(parentCTU, childGeom, zOrder);
151
+
152
+                if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth)
153
+                    nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom));
154
+
155
+                compressIntraCU(parentCTU, childGeom, zOrder, nextQP);
156
 
157
                 // Save best CU and pred data for this sub CU
158
                 splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx);
159
@@ -322,7 +330,7 @@
160
         else
161
             updateModeCost(*splitPred);
162
 
163
-        checkDQPForSplitPred(splitPred->cu, cuGeom);
164
+        checkDQPForSplitPred(*splitPred, cuGeom);
165
         checkBestMode(*splitPred, depth);
166
     }
167
 
168
@@ -362,24 +370,18 @@
169
     }
170
 
171
     ModeDepth& md = m_modeDepth[pmode.cuGeom.depth];
172
-    bool bMergeOnly = pmode.cuGeom.log2CUSize == 6;
173
 
174
     /* setup slave Analysis */
175
     if (&slave != this)
176
     {
177
         slave.m_slice = m_slice;
178
         slave.m_frame = m_frame;
179
-        slave.setQP(*m_slice, m_rdCost.m_qp);
180
+        slave.m_param = m_param;
181
+        slave.setLambdaFromQP(md.pred[PRED_2Nx2N].cu, m_rdCost.m_qp);
182
         slave.invalidateContexts(0);
183
-
184
-        if (m_param->rdLevel >= 5)
185
-        {
186
-            slave.m_rqt[pmode.cuGeom.depth].cur.load(m_rqt[pmode.cuGeom.depth].cur);
187
-            slave.m_quant.setQPforQuant(md.pred[PRED_2Nx2N].cu);
188
-        }
189
+        slave.m_rqt[pmode.cuGeom.depth].cur.load(m_rqt[pmode.cuGeom.depth].cur);
190
     }
191
 
192
-
193
     /* perform Mode task, repeat until no more work is available */
194
     do
195
     {
196
@@ -388,8 +390,6 @@
197
             switch (pmode.modes[task])
198
             {
199
             case PRED_INTRA:
200
-                if (&slave != this)
201
-                    slave.m_rqt[pmode.cuGeom.depth].cur.load(m_rqt[pmode.cuGeom.depth].cur);
202
                 slave.checkIntraInInter(md.pred[PRED_INTRA], pmode.cuGeom);
203
                 if (m_param->rdLevel > 2)
204
                     slave.encodeIntraInInter(md.pred[PRED_INTRA], pmode.cuGeom);
205
@@ -441,7 +441,7 @@
206
                 break;
207
 
208
             case PRED_2Nx2N:
209
-                slave.checkInter_rd5_6(md.pred[PRED_2Nx2N], pmode.cuGeom, SIZE_2Nx2N, false);
210
+                slave.checkInter_rd5_6(md.pred[PRED_2Nx2N], pmode.cuGeom, SIZE_2Nx2N);
211
                 md.pred[PRED_BIDIR].rdCost = MAX_INT64;
212
                 if (m_slice->m_sliceType == B_SLICE)
213
                 {
214
@@ -452,27 +452,27 @@
215
                 break;
216
 
217
             case PRED_Nx2N:
218
-                slave.checkInter_rd5_6(md.pred[PRED_Nx2N], pmode.cuGeom, SIZE_Nx2N, false);
219
+                slave.checkInter_rd5_6(md.pred[PRED_Nx2N], pmode.cuGeom, SIZE_Nx2N);
220
                 break;
221
 
222
             case PRED_2NxN:
223
-                slave.checkInter_rd5_6(md.pred[PRED_2NxN], pmode.cuGeom, SIZE_2NxN, false);
224
+                slave.checkInter_rd5_6(md.pred[PRED_2NxN], pmode.cuGeom, SIZE_2NxN);
225
                 break;
226
 
227
             case PRED_2NxnU:
228
-                slave.checkInter_rd5_6(md.pred[PRED_2NxnU], pmode.cuGeom, SIZE_2NxnU, bMergeOnly);
229
+                slave.checkInter_rd5_6(md.pred[PRED_2NxnU], pmode.cuGeom, SIZE_2NxnU);
230
                 break;
231
 
232
             case PRED_2NxnD:
233
-                slave.checkInter_rd5_6(md.pred[PRED_2NxnD], pmode.cuGeom, SIZE_2NxnD, bMergeOnly);
234
+                slave.checkInter_rd5_6(md.pred[PRED_2NxnD], pmode.cuGeom, SIZE_2NxnD);
235
                 break;
236
 
237
             case PRED_nLx2N:
238
-                slave.checkInter_rd5_6(md.pred[PRED_nLx2N], pmode.cuGeom, SIZE_nLx2N, bMergeOnly);
239
+                slave.checkInter_rd5_6(md.pred[PRED_nLx2N], pmode.cuGeom, SIZE_nLx2N);
240
                 break;
241
 
242
             case PRED_nRx2N:
243
-                slave.checkInter_rd5_6(md.pred[PRED_nRx2N], pmode.cuGeom, SIZE_nRx2N, bMergeOnly);
244
+                slave.checkInter_rd5_6(md.pred[PRED_nRx2N], pmode.cuGeom, SIZE_nRx2N);
245
                 break;
246
 
247
             default:
248
@@ -490,7 +490,7 @@
249
     while (task >= 0);
250
 }
251
 
252
-void Analysis::compressInterCU_dist(const CUData& parentCTU, const CUGeom& cuGeom)
253
+void Analysis::compressInterCU_dist(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp)
254
 {
255
     uint32_t depth = cuGeom.depth;
256
     uint32_t cuAddr = parentCTU.m_cuAddr;
257
@@ -505,34 +505,34 @@
258
 
259
     if (mightNotSplit && depth >= minDepth)
260
     {
261
-        int bTryAmp = m_slice->m_sps->maxAMPDepth > depth && (cuGeom.log2CUSize < 6 || m_param->rdLevel > 4);
262
+        int bTryAmp = m_slice->m_sps->maxAMPDepth > depth;
263
         int bTryIntra = m_slice->m_sliceType != B_SLICE || m_param->bIntraInBFrames;
264
 
265
         PMODE pmode(*this, cuGeom);
266
 
267
         /* Initialize all prediction CUs based on parentCTU */
268
-        md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom);
269
-        md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom);
270
+        md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp);
271
+        md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp);
272
         if (bTryIntra)
273
         {
274
-            md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom);
275
+            md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp);
276
             if (cuGeom.log2CUSize == 3 && m_slice->m_sps->quadtreeTULog2MinSize < 3 && m_param->rdLevel >= 5)
277
-                md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom);
278
+                md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom, qp);
279
             pmode.modes[pmode.m_jobTotal++] = PRED_INTRA;
280
         }
281
-        md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom); pmode.modes[pmode.m_jobTotal++] = PRED_2Nx2N;
282
-        md.pred[PRED_BIDIR].cu.initSubCU(parentCTU, cuGeom);
283
+        md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); pmode.modes[pmode.m_jobTotal++] = PRED_2Nx2N;
284
+        md.pred[PRED_BIDIR].cu.initSubCU(parentCTU, cuGeom, qp);
285
         if (m_param->bEnableRectInter)
286
         {
287
-            md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom); pmode.modes[pmode.m_jobTotal++] = PRED_2NxN;
288
-            md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom); pmode.modes[pmode.m_jobTotal++] = PRED_Nx2N;
289
+            md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp); pmode.modes[pmode.m_jobTotal++] = PRED_2NxN;
290
+            md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); pmode.modes[pmode.m_jobTotal++] = PRED_Nx2N;
291
         }
292
         if (bTryAmp)
293
         {
294
-            md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom); pmode.modes[pmode.m_jobTotal++] = PRED_2NxnU;
295
-            md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom); pmode.modes[pmode.m_jobTotal++] = PRED_2NxnD;
296
-            md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom); pmode.modes[pmode.m_jobTotal++] = PRED_nLx2N;
297
-            md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom); pmode.modes[pmode.m_jobTotal++] = PRED_nRx2N;
298
+            md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom, qp); pmode.modes[pmode.m_jobTotal++] = PRED_2NxnU;
299
+            md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp); pmode.modes[pmode.m_jobTotal++] = PRED_2NxnD;
300
+            md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom, qp); pmode.modes[pmode.m_jobTotal++] = PRED_nLx2N;
301
+            md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp); pmode.modes[pmode.m_jobTotal++] = PRED_nRx2N;
302
         }
303
 
304
         pmode.tryBondPeers(*m_frame->m_encData->m_jobProvider, pmode.m_jobTotal);
305
@@ -662,7 +662,7 @@
306
 
307
         if (md.bestMode->rdCost == MAX_INT64 && !bTryIntra)
308
         {
309
-            md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom);
310
+            md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp);
311
             checkIntraInInter(md.pred[PRED_INTRA], cuGeom);
312
             encodeIntraInInter(md.pred[PRED_INTRA], cuGeom);
313
             checkBestMode(md.pred[PRED_INTRA], depth);
314
@@ -688,12 +688,13 @@
315
         Mode* splitPred = &md.pred[PRED_SPLIT];
316
         splitPred->initCosts();
317
         CUData* splitCU = &splitPred->cu;
318
-        splitCU->initSubCU(parentCTU, cuGeom);
319
+        splitCU->initSubCU(parentCTU, cuGeom, qp);
320
 
321
         uint32_t nextDepth = depth + 1;
322
         ModeDepth& nd = m_modeDepth[nextDepth];
323
         invalidateContexts(nextDepth);
324
         Entropy* nextContext = &m_rqt[depth].cur;
325
+        int nextQP = qp;
326
 
327
         for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++)
328
         {
329
@@ -702,7 +703,11 @@
330
             {
331
                 m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx);
332
                 m_rqt[nextDepth].cur.load(*nextContext);
333
-                compressInterCU_dist(parentCTU, childGeom);
334
+
335
+                if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth)
336
+                    nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom));
337
+
338
+                compressInterCU_dist(parentCTU, childGeom, nextQP);
339
 
340
                 // Save best CU and pred data for this sub CU
341
                 splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx);
342
@@ -721,7 +726,7 @@
343
         else
344
             updateModeCost(*splitPred);
345
 
346
-        checkDQPForSplitPred(splitPred->cu, cuGeom);
347
+        checkDQPForSplitPred(*splitPred, cuGeom);
348
         checkBestMode(*splitPred, depth);
349
     }
350
 
351
@@ -741,7 +746,7 @@
352
         md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, cuAddr, cuGeom.absPartIdx);
353
 }
354
 
355
-void Analysis::compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom)
356
+void Analysis::compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp)
357
 {
358
     uint32_t depth = cuGeom.depth;
359
     uint32_t cuAddr = parentCTU.m_cuAddr;
360
@@ -757,8 +762,8 @@
361
         bool bTryIntra = m_slice->m_sliceType != B_SLICE || m_param->bIntraInBFrames;
362
 
363
         /* Compute Merge Cost */
364
-        md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom);
365
-        md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom);
366
+        md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp);
367
+        md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp);
368
         checkMerge2Nx2N_rd0_4(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom);
369
 
370
         bool earlyskip = false;
371
@@ -767,30 +772,30 @@
372
 
373
         if (!earlyskip)
374
         {
375
-            md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom);
376
+            md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom, qp);
377
             checkInter_rd0_4(md.pred[PRED_2Nx2N], cuGeom, SIZE_2Nx2N);
378
 
379
             if (m_slice->m_sliceType == B_SLICE)
380
             {
381
-                md.pred[PRED_BIDIR].cu.initSubCU(parentCTU, cuGeom);
382
+                md.pred[PRED_BIDIR].cu.initSubCU(parentCTU, cuGeom, qp);
383
                 checkBidir2Nx2N(md.pred[PRED_2Nx2N], md.pred[PRED_BIDIR], cuGeom);
384
             }
385
 
386
             Mode *bestInter = &md.pred[PRED_2Nx2N];
387
             if (m_param->bEnableRectInter)
388
             {
389
-                md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom);
390
+                md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom, qp);
391
                 checkInter_rd0_4(md.pred[PRED_Nx2N], cuGeom, SIZE_Nx2N);
392
                 if (md.pred[PRED_Nx2N].sa8dCost < bestInter->sa8dCost)
393
                     bestInter = &md.pred[PRED_Nx2N];
394
 
395
-                md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom);
396
+                md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp);
397
                 checkInter_rd0_4(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN);
398
                 if (md.pred[PRED_2NxN].sa8dCost < bestInter->sa8dCost)
399
                     bestInter = &md.pred[PRED_2NxN];
400
             }
401
 
402
-            if (m_slice->m_sps->maxAMPDepth > depth && cuGeom.log2CUSize < 6)
403
+            if (m_slice->m_sps->maxAMPDepth > depth)
404
             {
405
                 bool bHor = false, bVer = false;
406
                 if (bestInter->cu.m_partSize[0] == SIZE_2NxN)
407
@@ -806,24 +811,24 @@
408
 
409
                 if (bHor)
410
                 {
411
-                    md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom);
412
+                    md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom, qp);
413
                     checkInter_rd0_4(md.pred[PRED_2NxnU], cuGeom, SIZE_2NxnU);
414
                     if (md.pred[PRED_2NxnU].sa8dCost < bestInter->sa8dCost)
415
                         bestInter = &md.pred[PRED_2NxnU];
416
 
417
-                    md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom);
418
+                    md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp);
419
                     checkInter_rd0_4(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD);
420
                     if (md.pred[PRED_2NxnD].sa8dCost < bestInter->sa8dCost)
421
                         bestInter = &md.pred[PRED_2NxnD];
422
                 }
423
                 if (bVer)
424
                 {
425
-                    md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom);
426
+                    md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom, qp);
427
                     checkInter_rd0_4(md.pred[PRED_nLx2N], cuGeom, SIZE_nLx2N);
428
                     if (md.pred[PRED_nLx2N].sa8dCost < bestInter->sa8dCost)
429
                         bestInter = &md.pred[PRED_nLx2N];
430
 
431
-                    md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom);
432
+                    md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp);
433
                     checkInter_rd0_4(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N);
434
                     if (md.pred[PRED_nRx2N].sa8dCost < bestInter->sa8dCost)
435
                         bestInter = &md.pred[PRED_nRx2N];
436
@@ -855,7 +860,7 @@
437
                 if ((bTryIntra && md.bestMode->cu.getQtRootCbf(0)) ||
438
                     md.bestMode->sa8dCost == MAX_INT64)
439
                 {
440
-                    md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom);
441
+                    md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp);
442
                     checkIntraInInter(md.pred[PRED_INTRA], cuGeom);
443
                     encodeIntraInInter(md.pred[PRED_INTRA], cuGeom);
444
                     checkBestMode(md.pred[PRED_INTRA], depth);
445
@@ -873,7 +878,7 @@
446
 
447
                 if (bTryIntra || md.bestMode->sa8dCost == MAX_INT64)
448
                 {
449
-                    md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom);
450
+                    md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp);
451
                     checkIntraInInter(md.pred[PRED_INTRA], cuGeom);
452
                     if (md.pred[PRED_INTRA].sa8dCost < md.bestMode->sa8dCost)
453
                         md.bestMode = &md.pred[PRED_INTRA];
454
@@ -901,7 +906,6 @@
455
                     {
456
                         /* generate recon pixels with no rate distortion considerations */
457
                         CUData& cu = md.bestMode->cu;
458
-                        m_quant.setQPforQuant(cu);
459
 
460
                         uint32_t tuDepthRange[2];
461
                         cu.getInterTUQtDepthRange(tuDepthRange, 0);
462
@@ -926,7 +930,6 @@
463
                     {
464
                         /* generate recon pixels with no rate distortion considerations */
465
                         CUData& cu = md.bestMode->cu;
466
-                        m_quant.setQPforQuant(cu);
467
 
468
                         uint32_t tuDepthRange[2];
469
                         cu.getIntraTUQtDepthRange(tuDepthRange, 0);
470
@@ -960,12 +963,13 @@
471
         Mode* splitPred = &md.pred[PRED_SPLIT];
472
         splitPred->initCosts();
473
         CUData* splitCU = &splitPred->cu;
474
-        splitCU->initSubCU(parentCTU, cuGeom);
475
+        splitCU->initSubCU(parentCTU, cuGeom, qp);
476
 
477
         uint32_t nextDepth = depth + 1;
478
         ModeDepth& nd = m_modeDepth[nextDepth];
479
         invalidateContexts(nextDepth);
480
         Entropy* nextContext = &m_rqt[depth].cur;
481
+        int nextQP = qp;
482
 
483
         for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++)
484
         {
485
@@ -974,7 +978,11 @@
486
             {
487
                 m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx);
488
                 m_rqt[nextDepth].cur.load(*nextContext);
489
-                compressInterCU_rd0_4(parentCTU, childGeom);
490
+
491
+                if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth)
492
+                    nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom));
493
+
494
+                compressInterCU_rd0_4(parentCTU, childGeom, nextQP);
495
 
496
                 // Save best CU and pred data for this sub CU
497
                 splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx);
498
@@ -1006,7 +1014,7 @@
499
         else if (splitPred->sa8dCost < md.bestMode->sa8dCost)
500
             md.bestMode = splitPred;
501
 
502
-        checkDQPForSplitPred(md.bestMode->cu, cuGeom);
503
+        checkDQPForSplitPred(*md.bestMode, cuGeom);
504
     }
505
     if (mightNotSplit)
506
     {
507
@@ -1025,7 +1033,7 @@
508
         md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, cuAddr, cuGeom.absPartIdx);
509
 }
510
 
511
-void Analysis::compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder)
512
+void Analysis::compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp)
513
 {
514
     uint32_t depth = cuGeom.depth;
515
     ModeDepth& md = m_modeDepth[depth];
516
@@ -1040,8 +1048,8 @@
517
         uint8_t* reuseModes  = &m_reuseInterDataCTU->modes[parentCTU.m_cuAddr * parentCTU.m_numPartitions];
518
         if (mightNotSplit && depth == reuseDepth[zOrder] && zOrder == cuGeom.absPartIdx && reuseModes[zOrder] == MODE_SKIP)
519
         {
520
-            md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom);
521
-            md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom);
522
+            md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp);
523
+            md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp);
524
             checkMerge2Nx2N_rd5_6(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom, true);
525
 
526
             if (m_bTryLossless)
527
@@ -1060,20 +1068,20 @@
528
 
529
     if (mightNotSplit)
530
     {
531
-        md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom);
532
-        md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom);
533
+        md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp);
534
+        md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp);
535
         checkMerge2Nx2N_rd5_6(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom, false);
536
         bool earlySkip = m_param->bEnableEarlySkip && md.bestMode && !md.bestMode->cu.getQtRootCbf(0);
537
 
538
         if (!earlySkip)
539
         {
540
-            md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom);
541
-            checkInter_rd5_6(md.pred[PRED_2Nx2N], cuGeom, SIZE_2Nx2N, false);
542
+            md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom, qp);
543
+            checkInter_rd5_6(md.pred[PRED_2Nx2N], cuGeom, SIZE_2Nx2N);
544
             checkBestMode(md.pred[PRED_2Nx2N], cuGeom.depth);
545
 
546
             if (m_slice->m_sliceType == B_SLICE)
547
             {
548
-                md.pred[PRED_BIDIR].cu.initSubCU(parentCTU, cuGeom);
549
+                md.pred[PRED_BIDIR].cu.initSubCU(parentCTU, cuGeom, qp);
550
                 checkBidir2Nx2N(md.pred[PRED_2Nx2N], md.pred[PRED_BIDIR], cuGeom);
551
                 if (md.pred[PRED_BIDIR].sa8dCost < MAX_INT64)
552
                 {
553
@@ -1084,20 +1092,18 @@
554
 
555
             if (m_param->bEnableRectInter)
556
             {
557
-                md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom);
558
-                checkInter_rd5_6(md.pred[PRED_Nx2N], cuGeom, SIZE_Nx2N, false);
559
+                md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom, qp);
560
+                checkInter_rd5_6(md.pred[PRED_Nx2N], cuGeom, SIZE_Nx2N);
561
                 checkBestMode(md.pred[PRED_Nx2N], cuGeom.depth);
562
 
563
-                md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom);
564
-                checkInter_rd5_6(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN, false);
565
+                md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp);
566
+                checkInter_rd5_6(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN);
567
                 checkBestMode(md.pred[PRED_2NxN], cuGeom.depth);
568
             }
569
 
570
             // Try AMP (SIZE_2NxnU, SIZE_2NxnD, SIZE_nLx2N, SIZE_nRx2N)
571
             if (m_slice->m_sps->maxAMPDepth > depth)
572
             {
573
-                bool bMergeOnly = cuGeom.log2CUSize == 6;
574
-
575
                 bool bHor = false, bVer = false;
576
                 if (md.bestMode->cu.m_partSize[0] == SIZE_2NxN)
577
                     bHor = true;
578
@@ -1111,35 +1117,35 @@
579
 
580
                 if (bHor)
581
                 {
582
-                    md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom);
583
-                    checkInter_rd5_6(md.pred[PRED_2NxnU], cuGeom, SIZE_2NxnU, bMergeOnly);
584
+                    md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom, qp);
585
+                    checkInter_rd5_6(md.pred[PRED_2NxnU], cuGeom, SIZE_2NxnU);
586
                     checkBestMode(md.pred[PRED_2NxnU], cuGeom.depth);
587
 
588
-                    md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom);
589
-                    checkInter_rd5_6(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD, bMergeOnly);
590
+                    md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp);
591
+                    checkInter_rd5_6(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD);
592
                     checkBestMode(md.pred[PRED_2NxnD], cuGeom.depth);
593
                 }
594
                 if (bVer)
595
                 {
596
-                    md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom);
597
-                    checkInter_rd5_6(md.pred[PRED_nLx2N], cuGeom, SIZE_nLx2N, bMergeOnly);
598
+                    md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom, qp);
599
+                    checkInter_rd5_6(md.pred[PRED_nLx2N], cuGeom, SIZE_nLx2N);
600
                     checkBestMode(md.pred[PRED_nLx2N], cuGeom.depth);
601
 
602
-                    md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom);
603
-                    checkInter_rd5_6(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N, bMergeOnly);
604
+                    md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp);
605
+                    checkInter_rd5_6(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N);
606
                     checkBestMode(md.pred[PRED_nRx2N], cuGeom.depth);
607
                 }
608
             }
609
 
610
             if (m_slice->m_sliceType != B_SLICE || m_param->bIntraInBFrames)
611
             {
612
-                md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom);
613
+                md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp);
614
                 checkIntra(md.pred[PRED_INTRA], cuGeom, SIZE_2Nx2N, NULL, NULL);
615
                 checkBestMode(md.pred[PRED_INTRA], depth);
616
 
617
                 if (cuGeom.log2CUSize == 3 && m_slice->m_sps->quadtreeTULog2MinSize < 3)
618
                 {
619
-                    md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom);
620
+                    md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom, qp);
621
                     checkIntra(md.pred[PRED_INTRA_NxN], cuGeom, SIZE_NxN, NULL, NULL);
622
                     checkBestMode(md.pred[PRED_INTRA_NxN], depth);
623
                 }
624
@@ -1159,12 +1165,13 @@
625
         Mode* splitPred = &md.pred[PRED_SPLIT];
626
         splitPred->initCosts();
627
         CUData* splitCU = &splitPred->cu;
628
-        splitCU->initSubCU(parentCTU, cuGeom);
629
+        splitCU->initSubCU(parentCTU, cuGeom, qp);
630
 
631
         uint32_t nextDepth = depth + 1;
632
         ModeDepth& nd = m_modeDepth[nextDepth];
633
         invalidateContexts(nextDepth);
634
         Entropy* nextContext = &m_rqt[depth].cur;
635
+        int nextQP = qp;
636
 
637
         for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++)
638
         {
639
@@ -1173,7 +1180,11 @@
640
             {
641
                 m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx);
642
                 m_rqt[nextDepth].cur.load(*nextContext);
643
-                compressInterCU_rd5_6(parentCTU, childGeom, zOrder);
644
+
645
+                if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth)
646
+                    nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom));
647
+
648
+                compressInterCU_rd5_6(parentCTU, childGeom, zOrder, nextQP);
649
 
650
                 // Save best CU and pred data for this sub CU
651
                 splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx);
652
@@ -1193,7 +1204,7 @@
653
         else
654
             updateModeCost(*splitPred);
655
 
656
-        checkDQPForSplitPred(splitPred->cu, cuGeom);
657
+        checkDQPForSplitPred(*splitPred, cuGeom);
658
         checkBestMode(*splitPred, depth);
659
     }
660
 
661
@@ -1308,7 +1319,7 @@
662
     md.bestMode->cu.setPUMv(1, candMvField[bestSadCand][1].mv, 0, 0);
663
     md.bestMode->cu.setPURefIdx(0, (int8_t)candMvField[bestSadCand][0].refIdx, 0, 0);
664
     md.bestMode->cu.setPURefIdx(1, (int8_t)candMvField[bestSadCand][1].refIdx, 0, 0);
665
-    checkDQP(md.bestMode->cu, cuGeom);
666
+    checkDQP(*md.bestMode, cuGeom);
667
     X265_CHECK(md.bestMode->ok(), "Merge mode not ok\n");
668
 }
669
 
670
@@ -1440,7 +1451,7 @@
671
         bestPred->cu.setPUMv(1, candMvField[bestCand][1].mv, 0, 0);
672
         bestPred->cu.setPURefIdx(0, (int8_t)candMvField[bestCand][0].refIdx, 0, 0);
673
         bestPred->cu.setPURefIdx(1, (int8_t)candMvField[bestCand][1].refIdx, 0, 0);
674
-        checkDQP(bestPred->cu, cuGeom);
675
+        checkDQP(*bestPred, cuGeom);
676
         X265_CHECK(bestPred->ok(), "merge mode is not ok");
677
     }
678
 
679
@@ -1472,7 +1483,7 @@
680
         }
681
     }
682
 
683
-    predInterSearch(interMode, cuGeom, false, m_bChromaSa8d);
684
+    predInterSearch(interMode, cuGeom, m_bChromaSa8d);
685
 
686
     /* predInterSearch sets interMode.sa8dBits */
687
     const Yuv& fencYuv = *interMode.fencYuv;
688
@@ -1500,7 +1511,7 @@
689
     }
690
 }
691
 
692
-void Analysis::checkInter_rd5_6(Mode& interMode, const CUGeom& cuGeom, PartSize partSize, bool bMergeOnly)
693
+void Analysis::checkInter_rd5_6(Mode& interMode, const CUGeom& cuGeom, PartSize partSize)
694
 {
695
     interMode.initCosts();
696
     interMode.cu.setPartSizeSubParts(partSize);
697
@@ -1520,7 +1531,7 @@
698
         }
699
     }
700
 
701
-    predInterSearch(interMode, cuGeom, bMergeOnly, true);
702
+    predInterSearch(interMode, cuGeom, true);
703
 
704
     /* predInterSearch sets interMode.sa8dBits, but this is ignored */
705
     encodeResAndCalcRdInterCU(interMode, cuGeom);
706
@@ -1642,8 +1653,8 @@
707
         uint32_t zcost = zsa8d + m_rdCost.getCost(bits0) + m_rdCost.getCost(bits1);
708
 
709
         /* refine MVP selection for zero mv, updates: mvp, mvpidx, bits, cost */
710
-        checkBestMVP(inter2Nx2N.amvpCand[0][ref0], mvzero, mvp0, mvpIdx0, bits0, zcost);
711
-        checkBestMVP(inter2Nx2N.amvpCand[1][ref1], mvzero, mvp1, mvpIdx1, bits1, zcost);
712
+        mvp0 = checkBestMVP(inter2Nx2N.amvpCand[0][ref0], mvzero, mvpIdx0, bits0, zcost);
713
+        mvp1 = checkBestMVP(inter2Nx2N.amvpCand[1][ref1], mvzero, mvpIdx1, bits1, zcost);
714
 
715
         uint32_t zbits = bits0 + bits1 + m_listSelBits[2] - (m_listSelBits[0] + m_listSelBits[1]);
716
         zcost = zsa8d + m_rdCost.getCost(zbits);
717
@@ -1697,7 +1708,6 @@
718
     CUData& cu = bestMode->cu;
719
 
720
     cu.copyFromPic(ctu, cuGeom);
721
-    m_quant.setQPforQuant(cu);
722
 
723
     Yuv& fencYuv = m_modeDepth[cuGeom.depth].fencYuv;
724
     if (cuGeom.depth)
725
@@ -1913,37 +1923,39 @@
726
     return false;
727
 }
728
 
729
-int Analysis::calculateQpforCuSize(CUData& ctu, const CUGeom& cuGeom)
730
+int Analysis::calculateQpforCuSize(const CUData& ctu, const CUGeom& cuGeom)
731
 {
732
-    uint32_t ctuAddr = ctu.m_cuAddr;
733
     FrameData& curEncData = *m_frame->m_encData;
734
-    double qp = curEncData.m_cuStat[ctuAddr].baseQp;
735
-
736
-    uint32_t width = m_frame->m_fencPic->m_picWidth;
737
-    uint32_t height = m_frame->m_fencPic->m_picHeight;
738
-    uint32_t block_x = ctu.m_cuPelX + g_zscanToPelX[cuGeom.absPartIdx];
739
-    uint32_t block_y = ctu.m_cuPelY + g_zscanToPelY[cuGeom.absPartIdx];
740
-    uint32_t maxCols = (m_frame->m_fencPic->m_picWidth + (16 - 1)) / 16;
741
-    uint32_t blockSize = g_maxCUSize >> cuGeom.depth;
742
-    double qp_offset = 0;
743
-    uint32_t cnt = 0;
744
-    uint32_t idx;
745
+    double qp = curEncData.m_cuStat[ctu.m_cuAddr].baseQp;
746
 
747
     /* Use cuTree offsets if cuTree enabled and frame is referenced, else use AQ offsets */
748
     bool isReferenced = IS_REFERENCED(m_frame);
749
     double *qpoffs = (isReferenced && m_param->rc.cuTree) ? m_frame->m_lowres.qpCuTreeOffset : m_frame->m_lowres.qpAqOffset;
750
-
751
-    for (uint32_t block_yy = block_y; block_yy < block_y + blockSize && block_yy < height; block_yy += 16)
752
+    if (qpoffs)
753
     {
754
-        for (uint32_t block_xx = block_x; block_xx < block_x + blockSize && block_xx < width; block_xx += 16)
755
+        uint32_t width = m_frame->m_fencPic->m_picWidth;
756
+        uint32_t height = m_frame->m_fencPic->m_picHeight;
757
+        uint32_t block_x = ctu.m_cuPelX + g_zscanToPelX[cuGeom.absPartIdx];
758
+        uint32_t block_y = ctu.m_cuPelY + g_zscanToPelY[cuGeom.absPartIdx];
759
+        uint32_t maxCols = (m_frame->m_fencPic->m_picWidth + (16 - 1)) / 16;
760
+        uint32_t blockSize = g_maxCUSize >> cuGeom.depth;
761
+        double qp_offset = 0;
762
+        uint32_t cnt = 0;
763
+        uint32_t idx;
764
+
765
+        for (uint32_t block_yy = block_y; block_yy < block_y + blockSize && block_yy < height; block_yy += 16)
766
         {
767
-            idx = ((block_yy / 16) * (maxCols)) + (block_xx / 16);
768
-            qp_offset += qpoffs[idx];
769
-            cnt++;
770
+            for (uint32_t block_xx = block_x; block_xx < block_x + blockSize && block_xx < width; block_xx += 16)
771
+            {
772
+                idx = ((block_yy / 16) * (maxCols)) + (block_xx / 16);
773
+                qp_offset += qpoffs[idx];
774
+                cnt++;
775
+            }
776
         }
777
+
778
+        qp_offset /= cnt;
779
+        qp += qp_offset;
780
     }
781
 
782
-    qp_offset /= cnt;
783
-    qp += qp_offset;
784
     return x265_clip3(QP_MIN, QP_MAX_MAX, (int)(qp + 0.5));
785
 }
786
x265_1.6.tar.gz/source/encoder/analysis.h -> x265_1.7.tar.gz/source/encoder/analysis.h Changed
36
 
1
@@ -109,12 +109,12 @@
2
     uint32_t*            m_reuseBestMergeCand;
3
 
4
     /* full analysis for an I-slice CU */
5
-    void compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder);
6
+    void compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp);
7
 
8
     /* full analysis for a P or B slice CU */
9
-    void compressInterCU_dist(const CUData& parentCTU, const CUGeom& cuGeom);
10
-    void compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom);
11
-    void compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder);
12
+    void compressInterCU_dist(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp);
13
+    void compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp);
14
+    void compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp);
15
 
16
     /* measure merge and skip */
17
     void checkMerge2Nx2N_rd0_4(Mode& skip, Mode& merge, const CUGeom& cuGeom);
18
@@ -122,7 +122,7 @@
19
 
20
     /* measure inter options */
21
     void checkInter_rd0_4(Mode& interMode, const CUGeom& cuGeom, PartSize partSize);
22
-    void checkInter_rd5_6(Mode& interMode, const CUGeom& cuGeom, PartSize partSize, bool bMergeOnly);
23
+    void checkInter_rd5_6(Mode& interMode, const CUGeom& cuGeom, PartSize partSize);
24
 
25
     void checkBidir2Nx2N(Mode& inter2Nx2N, Mode& bidir2Nx2N, const CUGeom& cuGeom);
26
 
27
@@ -139,7 +139,7 @@
28
     /* generate residual and recon pixels for an entire CTU recursively (RD0) */
29
     void encodeResidue(const CUData& parentCTU, const CUGeom& cuGeom);
30
 
31
-    int calculateQpforCuSize(CUData& ctu, const CUGeom& cuGeom);
32
+    int calculateQpforCuSize(const CUData& ctu, const CUGeom& cuGeom);
33
 
34
     /* check whether current mode is the new best */
35
     inline void checkBestMode(Mode& mode, uint32_t depth)
36
x265_1.6.tar.gz/source/encoder/api.cpp -> x265_1.7.tar.gz/source/encoder/api.cpp Changed
205
 
1
@@ -39,9 +39,11 @@
2
     if (!p)
3
         return NULL;
4
 
5
-    x265_param *param = X265_MALLOC(x265_param, 1);
6
-    if (!param)
7
-        return NULL;
8
+    Encoder* encoder = NULL;
9
+    x265_param* param = x265_param_alloc();
10
+    x265_param* latestParam = x265_param_alloc();
11
+    if (!param || !latestParam)
12
+        goto fail;
13
 
14
     memcpy(param, p, sizeof(x265_param));
15
     x265_log(param, X265_LOG_INFO, "HEVC encoder version %s\n", x265_version_str);
16
@@ -50,38 +52,44 @@
17
     x265_setup_primitives(param, param->cpuid);
18
 
19
     if (x265_check_params(param))
20
-        return NULL;
21
+        goto fail;
22
 
23
     if (x265_set_globals(param))
24
-        return NULL;
25
+        goto fail;
26
 
27
-    Encoder *encoder = new Encoder;
28
+    encoder = new Encoder;
29
     if (!param->rc.bEnableSlowFirstPass)
30
         x265_param_apply_fastfirstpass(param);
31
 
32
     // may change params for auto-detect, etc
33
     encoder->configure(param);
34
-    
35
     // may change rate control and CPB params
36
     if (!enforceLevel(*param, encoder->m_vps))
37
-    {
38
-        delete encoder;
39
-        return NULL;
40
-    }
41
+        goto fail;
42
 
43
     // will detect and set profile/tier/level in VPS
44
     determineLevel(*param, encoder->m_vps);
45
 
46
-    encoder->create();
47
-    if (encoder->m_aborted)
48
+    if (!param->bAllowNonConformance && encoder->m_vps.ptl.profileIdc == Profile::NONE)
49
     {
50
-        delete encoder;
51
-        return NULL;
52
+        x265_log(param, X265_LOG_INFO, "non-conformant bitstreams not allowed (--allow-non-conformance)\n");
53
+        goto fail;
54
     }
55
 
56
-    x265_print_params(param);
57
+    encoder->create();
58
+    encoder->m_latestParam = latestParam;
59
+    memcpy(latestParam, param, sizeof(x265_param));
60
+    if (encoder->m_aborted)
61
+        goto fail;
62
 
63
+    x265_print_params(param);
64
     return encoder;
65
+
66
+fail:
67
+    delete encoder;
68
+    x265_param_free(param);
69
+    x265_param_free(latestParam);
70
+    return NULL;
71
 }
72
 
73
 extern "C"
74
@@ -112,6 +120,27 @@
75
 }
76
 
77
 extern "C"
78
+int x265_encoder_reconfig(x265_encoder* enc, x265_param* param_in)
79
+{
80
+    if (!enc || !param_in)
81
+        return -1;
82
+
83
+    x265_param save;
84
+    Encoder* encoder = static_cast<Encoder*>(enc);
85
+    memcpy(&save, encoder->m_latestParam, sizeof(x265_param));
86
+    int ret = encoder->reconfigureParam(encoder->m_latestParam, param_in);
87
+    if (ret)
88
+        /* reconfigure failed, recover saved param set */
89
+        memcpy(encoder->m_latestParam, &save, sizeof(x265_param));
90
+    else
91
+    {
92
+        encoder->m_reconfigured = true;
93
+        x265_print_reconfigured_params(&save, encoder->m_latestParam);
94
+    }
95
+    return ret;
96
+}
97
+
98
+extern "C"
99
 int x265_encoder_encode(x265_encoder *enc, x265_nal **pp_nal, uint32_t *pi_nal, x265_picture *pic_in, x265_picture *pic_out)
100
 {
101
     if (!enc)
102
@@ -173,19 +202,22 @@
103
     {
104
         Encoder *encoder = static_cast<Encoder*>(enc);
105
 
106
-        encoder->stop();
107
+        encoder->stopJobs();
108
         encoder->printSummary();
109
         encoder->destroy();
110
         delete encoder;
111
+        ATOMIC_DEC(&g_ctuSizeConfigured);
112
     }
113
 }
114
 
115
 extern "C"
116
 void x265_cleanup(void)
117
 {
118
-    BitCost::destroy();
119
-    CUData::s_partSet[0] = NULL; /* allow CUData to adjust to new CTU size */
120
-    g_ctuSizeConfigured = 0;
121
+    if (!g_ctuSizeConfigured)
122
+    {
123
+        BitCost::destroy();
124
+        CUData::s_partSet[0] = NULL; /* allow CUData to adjust to new CTU size */
125
+    }
126
 }
127
 
128
 extern "C"
129
@@ -232,6 +264,7 @@
130
     &x265_picture_init,
131
     &x265_encoder_open,
132
     &x265_encoder_parameters,
133
+    &x265_encoder_reconfig,
134
     &x265_encoder_headers,
135
     &x265_encoder_encode,
136
     &x265_encoder_get_stats,
137
@@ -243,11 +276,66 @@
138
     x265_max_bit_depth,
139
 };
140
 
141
+typedef const x265_api* (*api_get_func)(int bitDepth);
142
+
143
+#define xstr(s) str(s)
144
+#define str(s) #s
145
+
146
+#if _WIN32
147
+#define ext ".dll"
148
+#elif MACOS
149
+#include <dlfcn.h>
150
+#define ext ".dylib"
151
+#else
152
+#include <dlfcn.h>
153
+#define ext ".so"
154
+#endif
155
+
156
 extern "C"
157
 const x265_api* x265_api_get(int bitDepth)
158
 {
159
     if (bitDepth && bitDepth != X265_DEPTH)
160
-        return NULL;
161
+    {
162
+        const char* libname = NULL;
163
+        const char* method = "x265_api_get_" xstr(X265_BUILD);
164
+
165
+        if (bitDepth == 12)
166
+            libname = "libx265_main12" ext;
167
+        else if (bitDepth == 10)
168
+            libname = "libx265_main10" ext;
169
+        else if (bitDepth == 8)
170
+            libname = "libx265_main" ext;
171
+        else
172
+            return NULL;
173
+
174
+        const x265_api* api = NULL;
175
+
176
+#if _WIN32
177
+        HMODULE h = LoadLibraryA(libname);
178
+        if (h)
179
+        {
180
+            api_get_func get = (api_get_func)GetProcAddress(h, method);
181
+            if (get)
182
+                api = get(0);
183
+        }
184
+#else
185
+        void* h = dlopen(libname, RTLD_LAZY | RTLD_LOCAL);
186
+        if (h)
187
+        {
188
+            api_get_func get = (api_get_func)dlsym(h, method);
189
+            if (get)
190
+                api = get(0);
191
+        }
192
+#endif
193
+
194
+        if (api && bitDepth != api->max_bit_depth)
195
+        {
196
+            x265_log(NULL, X265_LOG_WARNING, "%s does not support requested bitDepth %d\n", libname, bitDepth);
197
+            return NULL;
198
+        }
199
+
200
+        return api;
201
+    }
202
 
203
     return &libapi;
204
 }
205
x265_1.6.tar.gz/source/encoder/encoder.cpp -> x265_1.7.tar.gz/source/encoder/encoder.cpp Changed
240
 
1
@@ -58,6 +58,7 @@
2
 Encoder::Encoder()
3
 {
4
     m_aborted = false;
5
+    m_reconfigured = false;
6
     m_encodedFrameNum = 0;
7
     m_pocLast = -1;
8
     m_curEncoder = 0;
9
@@ -73,6 +74,7 @@
10
     m_outputCount = 0;
11
     m_csvfpt = NULL;
12
     m_param = NULL;
13
+    m_latestParam = NULL;
14
     m_cuOffsetY = NULL;
15
     m_cuOffsetC = NULL;
16
     m_buOffsetY = NULL;
17
@@ -106,7 +108,7 @@
18
     bool allowPools = !p->numaPools || strcmp(p->numaPools, "none");
19
 
20
     // Trim the thread pool if --wpp, --pme, and --pmode are disabled
21
-    if (!p->bEnableWavefront && !p->bDistributeModeAnalysis && !p->bDistributeMotionEstimation)
22
+    if (!p->bEnableWavefront && !p->bDistributeModeAnalysis && !p->bDistributeMotionEstimation && !p->lookaheadSlices)
23
         allowPools = false;
24
 
25
     if (!p->frameNumThreads)
26
@@ -140,9 +142,11 @@
27
             x265_log(p, X265_LOG_WARNING, "No thread pool allocated, --pme disabled\n");
28
         if (p->bDistributeModeAnalysis)
29
             x265_log(p, X265_LOG_WARNING, "No thread pool allocated, --pmode disabled\n");
30
+        if (p->lookaheadSlices)
31
+            x265_log(p, X265_LOG_WARNING, "No thread pool allocated, --lookahead-slices disabled\n");
32
 
33
         // disable all pool features if the thread pool is disabled or unusable.
34
-        p->bEnableWavefront = p->bDistributeModeAnalysis = p->bDistributeMotionEstimation = 0;
35
+        p->bEnableWavefront = p->bDistributeModeAnalysis = p->bDistributeMotionEstimation = p->lookaheadSlices = 0;
36
     }
37
 
38
     char buf[128];
39
@@ -159,7 +163,10 @@
40
     x265_log(p, X265_LOG_INFO, "frame threads / pool features       : %d / %s\n", p->frameNumThreads, buf);
41
 
42
     for (int i = 0; i < m_param->frameNumThreads; i++)
43
+    {
44
         m_frameEncoder[i] = new FrameEncoder;
45
+        m_frameEncoder[i]->m_nalList.m_annexB = !!m_param->bAnnexB;
46
+    }
47
 
48
     if (m_numPools)
49
     {
50
@@ -287,15 +294,17 @@
51
     m_aborted |= parseLambdaFile(m_param);
52
 
53
     m_encodeStartTime = x265_mdate();
54
+
55
+    m_nalList.m_annexB = !!m_param->bAnnexB;
56
 }
57
 
58
-void Encoder::stop()
59
+void Encoder::stopJobs()
60
 {
61
     if (m_rateControl)
62
         m_rateControl->terminate(); // unblock all blocked RC calls
63
 
64
     if (m_lookahead)
65
-        m_lookahead->stop();
66
+        m_lookahead->stopJobs();
67
     
68
     for (int i = 0; i < m_param->frameNumThreads; i++)
69
     {
70
@@ -309,7 +318,7 @@
71
     }
72
 
73
     if (m_threadPool)
74
-        m_threadPool->stop();
75
+        m_threadPool->stopWorkers();
76
 }
77
 
78
 void Encoder::destroy()
79
@@ -358,15 +367,20 @@
80
 
81
     if (m_param)
82
     {
83
-        free((void*)m_param->rc.lambdaFileName); // allocs by strdup
84
-        free(m_param->rc.statFileName);
85
-        free(m_param->analysisFileName);
86
-        free((void*)m_param->scalingLists);
87
-        free(m_param->csvfn);
88
-        free(m_param->numaPools);
89
+        /* release string arguments that were strdup'd */
90
+        free((char*)m_param->rc.lambdaFileName);
91
+        free((char*)m_param->rc.statFileName);
92
+        free((char*)m_param->analysisFileName);
93
+        free((char*)m_param->scalingLists);
94
+        free((char*)m_param->csvfn);
95
+        free((char*)m_param->numaPools);
96
+        free((char*)m_param->masteringDisplayColorVolume);
97
+        free((char*)m_param->contentLightLevelInfo);
98
 
99
-        X265_FREE(m_param);
100
+        x265_param_free(m_param);
101
     }
102
+
103
+    x265_param_free(m_latestParam);
104
 }
105
 
106
 void Encoder::updateVbvPlan(RateControl* rc)
107
@@ -436,7 +450,8 @@
108
         if (m_dpb->m_freeList.empty())
109
         {
110
             inFrame = new Frame;
111
-            if (inFrame->create(m_param))
112
+            x265_param* p = m_reconfigured? m_latestParam : m_param;
113
+            if (inFrame->create(p))
114
             {
115
                 /* the first PicYuv created is asked to generate the CU and block unit offset
116
                  * arrays which are then shared with all subsequent PicYuv (orig and recon) 
117
@@ -477,7 +492,10 @@
118
             }
119
         }
120
         else
121
+        {
122
             inFrame = m_dpb->m_freeList.popBack();
123
+            inFrame->m_lowresInit = false;
124
+        }
125
 
126
         /* Copy input picture into a Frame and PicYuv, send to lookahead */
127
         inFrame->m_fencPic->copyFromPicture(*pic_in, m_sps.conformanceWindow.rightOffset, m_sps.conformanceWindow.bottomOffset);
128
@@ -486,6 +504,7 @@
129
         inFrame->m_userData  = pic_in->userData;
130
         inFrame->m_pts       = pic_in->pts;
131
         inFrame->m_forceqp   = pic_in->forceqp;
132
+        inFrame->m_param     = m_reconfigured ? m_latestParam : m_param;
133
 
134
         if (m_pocLast == 0)
135
             m_firstPts = inFrame->m_pts;
136
@@ -717,6 +736,34 @@
137
     return ret;
138
 }
139
 
140
+int Encoder::reconfigureParam(x265_param* encParam, x265_param* param)
141
+{
142
+    encParam->maxNumReferences = param->maxNumReferences; // never uses more refs than specified in stream headers
143
+    encParam->bEnableLoopFilter = param->bEnableLoopFilter;
144
+    encParam->deblockingFilterTCOffset = param->deblockingFilterTCOffset;
145
+    encParam->deblockingFilterBetaOffset = param->deblockingFilterBetaOffset; 
146
+    encParam->bEnableFastIntra = param->bEnableFastIntra;
147
+    encParam->bEnableEarlySkip = param->bEnableEarlySkip;
148
+    encParam->bEnableTemporalMvp = param->bEnableTemporalMvp;
149
+    /* Scratch buffer prevents me_range from being increased for esa/tesa
150
+    if (param->searchMethod < X265_FULL_SEARCH || param->searchMethod < encParam->searchRange)
151
+        encParam->searchRange = param->searchRange; */
152
+    encParam->noiseReductionInter = param->noiseReductionInter;
153
+    encParam->noiseReductionIntra = param->noiseReductionIntra;
154
+    /* We can't switch out of subme=0 during encoding. */
155
+    if (encParam->subpelRefine)
156
+        encParam->subpelRefine = param->subpelRefine;
157
+    encParam->rdoqLevel = param->rdoqLevel;
158
+    encParam->rdLevel = param->rdLevel;
159
+    encParam->bEnableTSkipFast = param->bEnableTSkipFast;
160
+    encParam->psyRd = param->psyRd;
161
+    encParam->psyRdoq = param->psyRdoq;
162
+    encParam->bEnableSignHiding = param->bEnableSignHiding;
163
+    encParam->bEnableFastIntra = param->bEnableFastIntra;
164
+    encParam->maxTUSize = param->maxTUSize;
165
+    return x265_check_params(encParam);
166
+}
167
+
168
 void EncStats::addPsnr(double psnrY, double psnrU, double psnrV)
169
 {
170
     m_psnrSumY += psnrY;
171
@@ -1430,6 +1477,34 @@
172
     bs.writeByteAlignment();
173
     list.serialize(NAL_UNIT_PPS, bs);
174
 
175
+    if (m_param->masteringDisplayColorVolume)
176
+    {
177
+        SEIMasteringDisplayColorVolume mdsei;
178
+        if (mdsei.parse(m_param->masteringDisplayColorVolume))
179
+        {
180
+            bs.resetBits();
181
+            mdsei.write(bs, m_sps);
182
+            bs.writeByteAlignment();
183
+            list.serialize(NAL_UNIT_PREFIX_SEI, bs);
184
+        }
185
+        else
186
+            x265_log(m_param, X265_LOG_WARNING, "unable to parse mastering display color volume info\n");
187
+    }
188
+
189
+    if (m_param->contentLightLevelInfo)
190
+    {
191
+        SEIContentLightLevel cllsei;
192
+        if (cllsei.parse(m_param->contentLightLevelInfo))
193
+        {
194
+            bs.resetBits();
195
+            cllsei.write(bs, m_sps);
196
+            bs.writeByteAlignment();
197
+            list.serialize(NAL_UNIT_PREFIX_SEI, bs);
198
+        }
199
+        else
200
+            x265_log(m_param, X265_LOG_WARNING, "unable to parse content light level info\n");
201
+    }
202
+
203
     if (m_param->bEmitInfoSEI)
204
     {
205
         char *opts = x265_param2string(m_param);
206
@@ -1559,7 +1634,8 @@
207
     if (!m_param->bLossless && (m_param->rc.aqMode || bIsVbv))
208
     {
209
         pps->bUseDQP = true;
210
-        pps->maxCuDQPDepth = 0; /* TODO: make configurable? */
211
+        pps->maxCuDQPDepth = g_log2Size[m_param->maxCUSize] - g_log2Size[m_param->rc.qgSize];
212
+        X265_CHECK(pps->maxCuDQPDepth <= 2, "max CU DQP depth cannot be greater than 2\n");
213
     }
214
     else
215
     {
216
@@ -1788,6 +1864,23 @@
217
         p->analysisMode = X265_ANALYSIS_OFF;
218
         x265_log(p, X265_LOG_WARNING, "Analysis save and load mode not supported for distributed mode analysis\n");
219
     }
220
+
221
+    bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0;
222
+    if (!m_param->bLossless && (m_param->rc.aqMode || bIsVbv))
223
+    {
224
+        if (p->rc.qgSize < X265_MAX(16, p->minCUSize))
225
+        {
226
+            p->rc.qgSize = X265_MAX(16, p->minCUSize);
227
+            x265_log(p, X265_LOG_WARNING, "QGSize should be greater than or equal to 16 and minCUSize, setting QGSize = %d\n", p->rc.qgSize);
228
+        }
229
+        if (p->rc.qgSize > p->maxCUSize)
230
+        {
231
+            p->rc.qgSize = p->maxCUSize;
232
+            x265_log(p, X265_LOG_WARNING, "QGSize should be less than or equal to maxCUSize, setting QGSize = %d\n", p->rc.qgSize);
233
+        }
234
+    }
235
+    else
236
+        m_param->rc.qgSize = p->maxCUSize;
237
 }
238
 
239
 void Encoder::allocAnalysis(x265_analysis_data* analysis)
240
x265_1.6.tar.gz/source/encoder/encoder.h -> x265_1.7.tar.gz/source/encoder/encoder.h Changed
29
 
1
@@ -125,22 +125,26 @@
2
     uint32_t           m_numDelayedPic;
3
 
4
     x265_param*        m_param;
5
+    x265_param*        m_latestParam;
6
     RateControl*       m_rateControl;
7
     Lookahead*         m_lookahead;
8
     Window             m_conformanceWindow;
9
 
10
     bool               m_bZeroLatency;     // x265_encoder_encode() returns NALs for the input picture, zero lag
11
     bool               m_aborted;          // fatal error detected
12
+    bool               m_reconfigured;      // reconfigure of encoder detected
13
 
14
     Encoder();
15
     ~Encoder() {}
16
 
17
     void create();
18
-    void stop();
19
+    void stopJobs();
20
     void destroy();
21
 
22
     int encode(const x265_picture* pic, x265_picture *pic_out);
23
 
24
+    int reconfigureParam(x265_param* encParam, x265_param* param);
25
+
26
     void getStreamHeaders(NALList& list, Entropy& sbacCoder, Bitstream& bs);
27
 
28
     void fetchStats(x265_stats* stats, size_t statsSizeBytes);
29
x265_1.6.tar.gz/source/encoder/entropy.cpp -> x265_1.7.tar.gz/source/encoder/entropy.cpp Changed
462
 
1
@@ -585,7 +585,7 @@
2
         if (ctu.isSkipped(absPartIdx))
3
         {
4
             codeMergeIndex(ctu, absPartIdx);
5
-            finishCU(ctu, absPartIdx, depth);
6
+            finishCU(ctu, absPartIdx, depth, bEncodeDQP);
7
             return;
8
         }
9
         codePredMode(ctu.m_predMode[absPartIdx]);
10
@@ -606,7 +606,7 @@
11
     codeCoeff(ctu, absPartIdx, bEncodeDQP, tuDepthRange);
12
 
13
     // --- write terminating bit ---
14
-    finishCU(ctu, absPartIdx, depth);
15
+    finishCU(ctu, absPartIdx, depth, bEncodeDQP);
16
 }
17
 
18
 /* Return bit count of signaling inter mode */
19
@@ -658,7 +658,7 @@
20
 }
21
 
22
 /* finish encoding a cu and handle end-of-slice conditions */
23
-void Entropy::finishCU(const CUData& ctu, uint32_t absPartIdx, uint32_t depth)
24
+void Entropy::finishCU(const CUData& ctu, uint32_t absPartIdx, uint32_t depth, bool bCodeDQP)
25
 {
26
     const Slice* slice = ctu.m_slice;
27
     uint32_t realEndAddress = slice->m_endCUAddr;
28
@@ -672,6 +672,9 @@
29
     bool granularityBoundary = (((rpelx & granularityMask) == 0 || (rpelx == slice->m_sps->picWidthInLumaSamples )) &&
30
                                 ((bpely & granularityMask) == 0 || (bpely == slice->m_sps->picHeightInLumaSamples)));
31
 
32
+    if (slice->m_pps->bUseDQP)
33
+        const_cast<CUData&>(ctu).setQPSubParts(bCodeDQP ? ctu.getRefQP(absPartIdx) : ctu.m_qp[absPartIdx], absPartIdx, depth);
34
+
35
     if (granularityBoundary)
36
     {
37
         // Encode slice finish
38
@@ -1141,11 +1144,11 @@
39
     {
40
         length = 0;
41
         codeNumber = (codeNumber >> absGoRice) - COEF_REMAIN_BIN_REDUCTION;
42
-        if (codeNumber != 0)
43
         {
44
             unsigned long idx;
45
             CLZ(idx, codeNumber + 1);
46
             length = idx;
47
+            X265_CHECK((codeNumber != 0) || (length == 0), "length check failure\n");
48
             codeNumber -= (1 << idx) - 1;
49
         }
50
         codeNumber = (codeNumber << absGoRice) + codeRemain;
51
@@ -1461,7 +1464,7 @@
52
     //const uint32_t maskPosXY = ((uint32_t)~0 >> (31 - log2TrSize + MLS_CG_LOG2_SIZE)) >> 1;
53
     X265_CHECK((uint32_t)((1 << (log2TrSize - MLS_CG_LOG2_SIZE)) - 1) == (((uint32_t)~0 >> (31 - log2TrSize + MLS_CG_LOG2_SIZE)) >> 1), "maskPosXY fault\n");
54
 
55
-    scanPosLast = primitives.findPosLast(codingParameters.scan, coeff, coeffSign, coeffFlag, coeffNum, numSig);
56
+    scanPosLast = primitives.scanPosLast(codingParameters.scan, coeff, coeffSign, coeffFlag, coeffNum, numSig, g_scan4x4[codingParameters.scanType], trSize);
57
     posLast = codingParameters.scan[scanPosLast];
58
 
59
     const int lastScanSet = scanPosLast >> MLS_CG_SIZE;
60
@@ -1515,7 +1518,6 @@
61
     uint8_t * const baseCoeffGroupCtx = &m_contextState[OFF_SIG_CG_FLAG_CTX + (bIsLuma ? 0 : NUM_SIG_CG_FLAG_CTX)];
62
     uint8_t * const baseCtx = bIsLuma ? &m_contextState[OFF_SIG_FLAG_CTX] : &m_contextState[OFF_SIG_FLAG_CTX + NUM_SIG_FLAG_CTX_LUMA];
63
     uint32_t c1 = 1;
64
-    uint32_t goRiceParam = 0;
65
     int scanPosSigOff = scanPosLast - (lastScanSet << MLS_CG_SIZE) - 1;
66
     int absCoeff[1 << MLS_CG_SIZE];
67
     int numNonZero = 1;
68
@@ -1529,7 +1531,6 @@
69
         const uint32_t subCoeffFlag = coeffFlag[subSet];
70
         uint32_t scanFlagMask = subCoeffFlag;
71
         int subPosBase = subSet << MLS_CG_SIZE;
72
-        goRiceParam    = 0;
73
         
74
         if (subSet == lastScanSet)
75
         {
76
@@ -1548,7 +1549,7 @@
77
         else
78
         {
79
             uint32_t sigCoeffGroup = ((sigCoeffGroupFlag64 & cgBlkPosMask) != 0);
80
-            uint32_t ctxSig = Quant::getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, codingParameters.log2TrSizeCG);
81
+            uint32_t ctxSig = Quant::getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, (trSize >> MLS_CG_LOG2_SIZE));
82
             encodeBin(sigCoeffGroup, baseCoeffGroupCtx[ctxSig]);
83
         }
84
 
85
@@ -1556,7 +1557,8 @@
86
         if (sigCoeffGroupFlag64 & cgBlkPosMask)
87
         {
88
             X265_CHECK((log2TrSize != 2) || (log2TrSize == 2 && subSet == 0), "log2TrSize and subSet mistake!\n");
89
-            const int patternSigCtx = Quant::calcPatternSigCtx(sigCoeffGroupFlag64, cgPosX, cgPosY, codingParameters.log2TrSizeCG);
90
+            const int patternSigCtx = Quant::calcPatternSigCtx(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, (trSize >> MLS_CG_LOG2_SIZE));
91
+            const uint32_t posOffset = (bIsLuma && subSet) ? 3 : 0;
92
 
93
             static const uint8_t ctxIndMap4x4[16] =
94
             {
95
@@ -1566,37 +1568,50 @@
96
                 7, 7, 8, 8
97
             };
98
             // NOTE: [patternSigCtx][posXinSubset][posYinSubset]
99
-            static const uint8_t table_cnt[4][4][4] =
100
+            static const uint8_t table_cnt[4][SCAN_SET_SIZE] =
101
             {
102
                 // patternSigCtx = 0
103
                 {
104
-                    { 2, 1, 1, 0 },
105
-                    { 1, 1, 0, 0 },
106
-                    { 1, 0, 0, 0 },
107
-                    { 0, 0, 0, 0 },
108
+                    2, 1, 1, 0,
109
+                    1, 1, 0, 0,
110
+                    1, 0, 0, 0,
111
+                    0, 0, 0, 0,
112
                 },
113
                 // patternSigCtx = 1
114
                 {
115
-                    { 2, 1, 0, 0 },
116
-                    { 2, 1, 0, 0 },
117
-                    { 2, 1, 0, 0 },
118
-                    { 2, 1, 0, 0 },
119
+                    2, 2, 2, 2,
120
+                    1, 1, 1, 1,
121
+                    0, 0, 0, 0,
122
+                    0, 0, 0, 0,
123
                 },
124
                 // patternSigCtx = 2
125
                 {
126
-                    { 2, 2, 2, 2 },
127
-                    { 1, 1, 1, 1 },
128
-                    { 0, 0, 0, 0 },
129
-                    { 0, 0, 0, 0 },
130
+                    2, 1, 0, 0,
131
+                    2, 1, 0, 0,
132
+                    2, 1, 0, 0,
133
+                    2, 1, 0, 0,
134
                 },
135
                 // patternSigCtx = 3
136
                 {
137
-                    { 2, 2, 2, 2 },
138
-                    { 2, 2, 2, 2 },
139
-                    { 2, 2, 2, 2 },
140
-                    { 2, 2, 2, 2 },
141
+                    2, 2, 2, 2,
142
+                    2, 2, 2, 2,
143
+                    2, 2, 2, 2,
144
+                    2, 2, 2, 2,
145
                 }
146
             };
147
+
148
+            const int offset = codingParameters.firstSignificanceMapContext;
149
+            ALIGN_VAR_32(uint16_t, tmpCoeff[SCAN_SET_SIZE]);
150
+            // TODO: accelerate by PABSW
151
+            const uint32_t blkPosBase  = codingParameters.scan[subPosBase];
152
+            for (int i = 0; i < MLS_CG_SIZE; i++)
153
+            {
154
+                tmpCoeff[i * MLS_CG_SIZE + 0] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 0]);
155
+                tmpCoeff[i * MLS_CG_SIZE + 1] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 1]);
156
+                tmpCoeff[i * MLS_CG_SIZE + 2] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 2]);
157
+                tmpCoeff[i * MLS_CG_SIZE + 3] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 3]);
158
+            }
159
+
160
             if (m_bitIf)
161
             {
162
                 if (log2TrSize == 2)
163
@@ -1604,16 +1619,16 @@
164
                     uint32_t blkPos, sig, ctxSig;
165
                     for (; scanPosSigOff >= 0; scanPosSigOff--)
166
                     {
167
-                        blkPos  = codingParameters.scan[subPosBase + scanPosSigOff];
168
+                        blkPos = g_scan4x4[codingParameters.scanType][scanPosSigOff];
169
                         sig     = scanFlagMask & 1;
170
                         scanFlagMask >>= 1;
171
-                        X265_CHECK((uint32_t)(coeff[blkPos] != 0) == sig, "sign bit mistake\n");
172
+                        X265_CHECK((uint32_t)(tmpCoeff[blkPos] != 0) == sig, "sign bit mistake\n");
173
                         {
174
                             ctxSig = ctxIndMap4x4[blkPos];
175
                             X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, blkPos, bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");;
176
                             encodeBin(sig, baseCtx[ctxSig]);
177
                         }
178
-                        absCoeff[numNonZero] = int(abs(coeff[blkPos]));
179
+                        absCoeff[numNonZero] = tmpCoeff[blkPos];
180
                         numNonZero += sig;
181
                     }
182
                 }
183
@@ -1621,35 +1636,25 @@
184
                 {
185
                     X265_CHECK((log2TrSize > 2), "log2TrSize must be more than 2 in this path!\n");
186
 
187
-                    const uint8_t (*tabSigCtx)[4] = table_cnt[(uint32_t)patternSigCtx];
188
-                    const int offset = codingParameters.firstSignificanceMapContext;
189
-                    const uint32_t lumaMask = bIsLuma ? ~0 : 0;
190
-                    static const uint32_t posXY4Mask[] = {0x024, 0x0CC, 0x39C};
191
-                    const uint32_t posGT4Mask = posXY4Mask[log2TrSize - 3] & lumaMask;
192
+                    const uint8_t *tabSigCtx = table_cnt[(uint32_t)patternSigCtx];
193
 
194
                     uint32_t blkPos, sig, ctxSig;
195
                     for (; scanPosSigOff >= 0; scanPosSigOff--)
196
                     {
197
-                        blkPos  = codingParameters.scan[subPosBase + scanPosSigOff];
198
-                        X265_CHECK(blkPos || (subPosBase + scanPosSigOff == 0), "blkPos==0 must be at scan[0]\n");
199
+                        blkPos = g_scan4x4[codingParameters.scanType][scanPosSigOff];
200
                         const uint32_t posZeroMask = (subPosBase + scanPosSigOff) ? ~0 : 0;
201
                         sig     = scanFlagMask & 1;
202
                         scanFlagMask >>= 1;
203
-                        X265_CHECK((uint32_t)(coeff[blkPos] != 0) == sig, "sign bit mistake\n");
204
+                        X265_CHECK((uint32_t)(tmpCoeff[blkPos] != 0) == sig, "sign bit mistake\n");
205
                         if (scanPosSigOff != 0 || subSet == 0 || numNonZero)
206
                         {
207
-                            const uint32_t posY = blkPos >> log2TrSize;
208
-                            const uint32_t posOffset = (blkPos & posGT4Mask) ? 3 : 0;
209
-
210
-                            const uint32_t posXinSubset = blkPos & 3;
211
-                            const uint32_t posYinSubset = posY & 3;
212
-                            const uint32_t cnt = tabSigCtx[posXinSubset][posYinSubset] + offset;
213
+                            const uint32_t cnt = tabSigCtx[blkPos] + offset;
214
                             ctxSig = (cnt + posOffset) & posZeroMask;
215
 
216
-                            X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, blkPos, bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");;
217
+                            X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, codingParameters.scan[subPosBase + scanPosSigOff], bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");;
218
                             encodeBin(sig, baseCtx[ctxSig]);
219
                         }
220
-                        absCoeff[numNonZero] = int(abs(coeff[blkPos]));
221
+                        absCoeff[numNonZero] = tmpCoeff[blkPos];
222
                         numNonZero += sig;
223
                     }
224
                 }
225
@@ -1663,19 +1668,26 @@
226
                     uint32_t blkPos, sig, ctxSig;
227
                     for (; scanPosSigOff >= 0; scanPosSigOff--)
228
                     {
229
-                        blkPos  = codingParameters.scan[subPosBase + scanPosSigOff];
230
+                        blkPos = g_scan4x4[codingParameters.scanType][scanPosSigOff];
231
                         sig     = scanFlagMask & 1;
232
                         scanFlagMask >>= 1;
233
-                        X265_CHECK((uint32_t)(coeff[blkPos] != 0) == sig, "sign bit mistake\n");
234
+                        X265_CHECK((uint32_t)(tmpCoeff[blkPos] != 0) == sig, "sign bit mistake\n");
235
                         {
236
                             ctxSig = ctxIndMap4x4[blkPos];
237
-                            X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, blkPos, bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");;
238
+                            X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, codingParameters.scan[subPosBase + scanPosSigOff], bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");;
239
                             //encodeBin(sig, baseCtx[ctxSig]);
240
                             const uint32_t mstate = baseCtx[ctxSig];
241
-                            baseCtx[ctxSig] = sbacNext(mstate, sig);
242
-                            sum += sbacGetEntropyBits(mstate, sig);
243
+                            const uint32_t mps = mstate & 1;
244
+                            const uint32_t stateBits = g_entropyStateBits[mstate ^ sig];
245
+                            uint32_t nextState = (stateBits >> 23) + mps;
246
+                            if ((mstate ^ sig) == 1)
247
+                                nextState = sig;
248
+                            X265_CHECK(sbacNext(mstate, sig) == nextState, "nextState check failure\n");
249
+                            X265_CHECK(sbacGetEntropyBits(mstate, sig) == (stateBits & 0xFFFFFF), "entropyBits check failure\n");
250
+                            baseCtx[ctxSig] = (uint8_t)nextState;
251
+                            sum += stateBits;
252
                         }
253
-                        absCoeff[numNonZero] = int(abs(coeff[blkPos]));
254
+                        absCoeff[numNonZero] = tmpCoeff[blkPos];
255
                         numNonZero += sig;
256
                     }
257
                 } // end of 4x4
258
@@ -1683,41 +1695,39 @@
259
                 {
260
                     X265_CHECK((log2TrSize > 2), "log2TrSize must be more than 2 in this path!\n");
261
 
262
-                    const uint8_t (*tabSigCtx)[4] = table_cnt[(uint32_t)patternSigCtx];
263
-                    const int offset = codingParameters.firstSignificanceMapContext;
264
-                    const uint32_t lumaMask = bIsLuma ? ~0 : 0;
265
-                    static const uint32_t posXY4Mask[] = {0x024, 0x0CC, 0x39C};
266
-                    const uint32_t posGT4Mask = posXY4Mask[log2TrSize - 3] & lumaMask;
267
+                    const uint8_t *tabSigCtx = table_cnt[(uint32_t)patternSigCtx];
268
 
269
                     uint32_t blkPos, sig, ctxSig;
270
                     for (; scanPosSigOff >= 0; scanPosSigOff--)
271
                     {
272
-                        blkPos  = codingParameters.scan[subPosBase + scanPosSigOff];
273
-                        X265_CHECK(blkPos || (subPosBase + scanPosSigOff == 0), "blkPos==0 must be at scan[0]\n");
274
+                        blkPos = g_scan4x4[codingParameters.scanType][scanPosSigOff];
275
                         const uint32_t posZeroMask = (subPosBase + scanPosSigOff) ? ~0 : 0;
276
                         sig     = scanFlagMask & 1;
277
                         scanFlagMask >>= 1;
278
-                        X265_CHECK((uint32_t)(coeff[blkPos] != 0) == sig, "sign bit mistake\n");
279
+                        X265_CHECK((uint32_t)(tmpCoeff[blkPos] != 0) == sig, "sign bit mistake\n");
280
                         if (scanPosSigOff != 0 || subSet == 0 || numNonZero)
281
                         {
282
-                            const uint32_t posY = blkPos >> log2TrSize;
283
-                            const uint32_t posOffset = (blkPos & posGT4Mask) ? 3 : 0;
284
-
285
-                            const uint32_t posXinSubset = blkPos & 3;
286
-                            const uint32_t posYinSubset = posY & 3;
287
-                            const uint32_t cnt = tabSigCtx[posXinSubset][posYinSubset] + offset;
288
+                            const uint32_t cnt = tabSigCtx[blkPos] + offset;
289
                             ctxSig = (cnt + posOffset) & posZeroMask;
290
 
291
-                            X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, blkPos, bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");;
292
+                            X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, codingParameters.scan[subPosBase + scanPosSigOff], bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");;
293
                             //encodeBin(sig, baseCtx[ctxSig]);
294
                             const uint32_t mstate = baseCtx[ctxSig];
295
-                            baseCtx[ctxSig] = sbacNext(mstate, sig);
296
-                            sum += sbacGetEntropyBits(mstate, sig);
297
+                            const uint32_t mps = mstate & 1;
298
+                            const uint32_t stateBits = g_entropyStateBits[mstate ^ sig];
299
+                            uint32_t nextState = (stateBits >> 23) + mps;
300
+                            if ((mstate ^ sig) == 1)
301
+                                nextState = sig;
302
+                            X265_CHECK(sbacNext(mstate, sig) == nextState, "nextState check failure\n");
303
+                            X265_CHECK(sbacGetEntropyBits(mstate, sig) == (stateBits & 0xFFFFFF), "entropyBits check failure\n");
304
+                            baseCtx[ctxSig] = (uint8_t)nextState;
305
+                            sum += stateBits;
306
                         }
307
-                        absCoeff[numNonZero] = int(abs(coeff[blkPos]));
308
+                        absCoeff[numNonZero] = tmpCoeff[blkPos];
309
                         numNonZero += sig;
310
                     }
311
                 } // end of non 4x4 path
312
+                sum &= 0xFFFFFF;
313
 
314
                 // update RD cost
315
                 m_fracBits += sum;
316
@@ -1762,31 +1772,77 @@
317
             if (!c1)
318
             {
319
                 baseCtxMod = bIsLuma ? &m_contextState[OFF_ABS_FLAG_CTX + ctxSet] : &m_contextState[OFF_ABS_FLAG_CTX + NUM_ABS_FLAG_CTX_LUMA + ctxSet];
320
-                if (firstC2FlagIdx != -1)
321
-                {
322
-                    uint32_t symbol = absCoeff[firstC2FlagIdx] > 2;
323
-                    encodeBin(symbol, baseCtxMod[0]);
324
-                }
325
+
326
+                X265_CHECK((firstC2FlagIdx != -1), "firstC2FlagIdx check failure\n");
327
+                uint32_t symbol = absCoeff[firstC2FlagIdx] > 2;
328
+                encodeBin(symbol, baseCtxMod[0]);
329
             }
330
 
331
             const int hiddenShift = (bHideFirstSign && signHidden) ? 1 : 0;
332
             encodeBinsEP((coeffSigns >> hiddenShift), numNonZero - hiddenShift);
333
 
334
-            int firstCoeff2 = 1;
335
             if (!c1 || numNonZero > C1FLAG_NUMBER)
336
             {
337
-                for (int idx = 0; idx < numNonZero; idx++)
338
+                uint32_t goRiceParam = 0;
339
+                int firstCoeff2 = 1;
340
+                uint32_t baseLevelN = 0x5555AAAA; // 2-bits encode format baseLevel
341
+
342
+                if (!m_bitIf)
343
                 {
344
-                    int baseLevel = (idx < C1FLAG_NUMBER) ? (2 + firstCoeff2) : 1;
345
+                    // FastRd path
346
+                    for (int idx = 0; idx < numNonZero; idx++)
347
+                    {
348
+                        int baseLevel = (baseLevelN & 3) | firstCoeff2;
349
+                        X265_CHECK(baseLevel == ((idx < C1FLAG_NUMBER) ? (2 + firstCoeff2) : 1), "baseLevel check failurr\n");
350
+                        baseLevelN >>= 2;
351
+                        int codeNumber = absCoeff[idx] - baseLevel;
352
 
353
-                    if (absCoeff[idx] >= baseLevel)
354
+                        if (codeNumber >= 0)
355
+                        {
356
+                            //writeCoefRemainExGolomb(absCoeff[idx] - baseLevel, goRiceParam);
357
+                            uint32_t length = 0;
358
+
359
+                            codeNumber = ((uint32_t)codeNumber >> goRiceParam) - COEF_REMAIN_BIN_REDUCTION;
360
+                            if (codeNumber >= 0)
361
+                            {
362
+                                {
363
+                                    unsigned long cidx;
364
+                                    CLZ(cidx, codeNumber + 1);
365
+                                    length = cidx;
366
+                                }
367
+                                X265_CHECK((codeNumber != 0) || (length == 0), "length check failure\n");
368
+
369
+                                codeNumber = (length + length);
370
+                            }
371
+                            m_fracBits += (COEF_REMAIN_BIN_REDUCTION + 1 + goRiceParam + codeNumber) << 15;
372
+
373
+                            if (absCoeff[idx] > (COEF_REMAIN_BIN_REDUCTION << goRiceParam))
374
+                                goRiceParam = (goRiceParam + 1) - (goRiceParam >> 2);
375
+                            X265_CHECK(goRiceParam <= 4, "goRiceParam check failure\n");
376
+                        }
377
+                        if (absCoeff[idx] >= 2)
378
+                            firstCoeff2 = 0;
379
+                    }
380
+                }
381
+                else
382
+                {
383
+                    // Standard path
384
+                    for (int idx = 0; idx < numNonZero; idx++)
385
                     {
386
-                        writeCoefRemainExGolomb(absCoeff[idx] - baseLevel, goRiceParam);
387
-                        if (absCoeff[idx] > 3 * (1 << goRiceParam))
388
-                            goRiceParam = std::min<uint32_t>(goRiceParam + 1, 4);
389
+                        int baseLevel = (baseLevelN & 3) | firstCoeff2;
390
+                        X265_CHECK(baseLevel == ((idx < C1FLAG_NUMBER) ? (2 + firstCoeff2) : 1), "baseLevel check failurr\n");
391
+                        baseLevelN >>= 2;
392
+
393
+                        if (absCoeff[idx] >= baseLevel)
394
+                        {
395
+                            writeCoefRemainExGolomb(absCoeff[idx] - baseLevel, goRiceParam);
396
+                            if (absCoeff[idx] > (COEF_REMAIN_BIN_REDUCTION << goRiceParam))
397
+                                goRiceParam = (goRiceParam + 1) - (goRiceParam >> 2);
398
+                            X265_CHECK(goRiceParam <= 4, "goRiceParam check failure\n");
399
+                        }
400
+                        if (absCoeff[idx] >= 2)
401
+                            firstCoeff2 = 0;
402
                     }
403
-                    if (absCoeff[idx] >= 2)
404
-                        firstCoeff2 = 0;
405
                 }
406
             }
407
         }
408
@@ -1874,20 +1930,20 @@
409
     if (bIsLuma)
410
     {
411
         for (uint32_t bin = 0; bin < 2; bin++)
412
-            estBitsSbac.significantBits[0][bin] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX], bin);
413
+            estBitsSbac.significantBits[bin][0] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX], bin);
414
 
415
         for (int ctxIdx = firstCtx; ctxIdx < firstCtx + numCtx; ctxIdx++)
416
             for (uint32_t bin = 0; bin < 2; bin++)
417
-                estBitsSbac.significantBits[ctxIdx][bin] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX + ctxIdx], bin);
418
+                estBitsSbac.significantBits[bin][ctxIdx] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX + ctxIdx], bin);
419
     }
420
     else
421
     {
422
         for (uint32_t bin = 0; bin < 2; bin++)
423
-            estBitsSbac.significantBits[0][bin] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX + (NUM_SIG_FLAG_CTX_LUMA + 0)], bin);
424
+            estBitsSbac.significantBits[bin][0] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX + (NUM_SIG_FLAG_CTX_LUMA + 0)], bin);
425
 
426
         for (int ctxIdx = firstCtx; ctxIdx < firstCtx + numCtx; ctxIdx++)
427
             for (uint32_t bin = 0; bin < 2; bin++)
428
-                estBitsSbac.significantBits[ctxIdx][bin] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX + (NUM_SIG_FLAG_CTX_LUMA + ctxIdx)], bin);
429
+                estBitsSbac.significantBits[bin][ctxIdx] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX + (NUM_SIG_FLAG_CTX_LUMA + ctxIdx)], bin);
430
     }
431
 
432
     int blkSizeOffset = bIsLuma ? ((log2TrSize - 2) * 3 + ((log2TrSize - 1) >> 2)) : NUM_CTX_LAST_FLAG_XY_LUMA;
433
@@ -2187,6 +2243,28 @@
434
     0x0050c, 0x29bab, 0x004c1, 0x2a674, 0x004a7, 0x2aa5e, 0x0046f, 0x2b32f, 0x0041f, 0x2c0ad, 0x003e7, 0x2ca8d, 0x003ba, 0x2d323, 0x0010c, 0x3bfbb
435
 };
436
 
437
+// [8 24] --> [stateMPS BitCost], [stateLPS BitCost]
438
+const uint32_t g_entropyStateBits[128] =
439
+{
440
+    // Corrected table, most notably for last state
441
+    0x01007b23, 0x000085f9, 0x020074a0, 0x00008cbc, 0x03006ee4, 0x01009354, 0x040067f4, 0x02009c1b,
442
+    0x050060b0, 0x0200a62a, 0x06005a9c, 0x0400af5b, 0x0700548d, 0x0400b955, 0x08004f56, 0x0500c2a9,
443
+    0x09004a87, 0x0600cbf7, 0x0a0045d6, 0x0700d5c3, 0x0b004144, 0x0800e01b, 0x0c003d88, 0x0900e937,
444
+    0x0d0039e0, 0x0900f2cd, 0x0e003663, 0x0b00fc9e, 0x0f003347, 0x0b010600, 0x10003050, 0x0c010f95,
445
+    0x11002d4d, 0x0d011a02, 0x12002ad3, 0x0d012333, 0x1300286e, 0x0f012cad, 0x14002604, 0x0f0136df,
446
+    0x15002425, 0x10013f48, 0x160021f4, 0x100149c4, 0x1700203e, 0x1201527b, 0x18001e4d, 0x12015d00,
447
+    0x19001c99, 0x130166de, 0x1a001b18, 0x13017017, 0x1b0019a5, 0x15017988, 0x1c001841, 0x15018327,
448
+    0x1d0016df, 0x16018d50, 0x1e0015d9, 0x16019547, 0x1f00147c, 0x1701a083, 0x2000138e, 0x1801a8a3,
449
+    0x21001251, 0x1801b418, 0x22001166, 0x1901bd27, 0x23001068, 0x1a01c77b, 0x24000f7f, 0x1a01d18e,
450
+    0x25000eda, 0x1b01d91a, 0x26000e19, 0x1b01e254, 0x27000d4f, 0x1c01ec9a, 0x28000c90, 0x1d01f6e0,
451
+    0x29000c01, 0x1d01fef8, 0x2a000b5f, 0x1e0208b1, 0x2b000ab6, 0x1e021362, 0x2c000a15, 0x1e021e46,
452
+    0x2d000988, 0x1f02285d, 0x2e000934, 0x20022ea8, 0x2f0008a8, 0x200239b2, 0x3000081d, 0x21024577,
453
+    0x310007c9, 0x21024ce6, 0x32000763, 0x21025663, 0x33000710, 0x22025e8f, 0x340006a0, 0x22026a26,
454
+    0x35000672, 0x23026f23, 0x360005e8, 0x23027ef8, 0x370005ba, 0x230284b5, 0x3800055e, 0x24029057,
455
+    0x3900050c, 0x24029bab, 0x3a0004c1, 0x2402a674, 0x3b0004a7, 0x2502aa5e, 0x3c00046f, 0x2502b32f,
456
+    0x3d00041f, 0x2502c0ad, 0x3e0003e7, 0x2602ca8d, 0x3e0003ba, 0x2602d323, 0x3f00010c, 0x3f03bfbb,
457
+};
458
+
459
 const uint8_t g_nextState[128][2] =
460
 {
461
     { 2, 1 }, { 0, 3 }, { 4, 0 }, { 1, 5 }, { 6, 2 }, { 3, 7 }, { 8, 4 }, { 5, 9 },
462
x265_1.6.tar.gz/source/encoder/entropy.h -> x265_1.7.tar.gz/source/encoder/entropy.h Changed
36
 
1
@@ -87,7 +87,7 @@
2
 struct EstBitsSbac
3
 {
4
     int significantCoeffGroupBits[NUM_SIG_CG_FLAG_CTX][2];
5
-    int significantBits[NUM_SIG_FLAG_CTX][2];
6
+    int significantBits[2][NUM_SIG_FLAG_CTX];
7
     int lastBits[2][10];
8
     int greaterOneBits[NUM_ONE_FLAG_CTX][2];
9
     int levelAbsBits[NUM_ABS_FLAG_CTX][2];
10
@@ -179,7 +179,7 @@
11
     inline void codeQtCbfChroma(uint32_t cbf, uint32_t tuDepth)           { encodeBin(cbf, m_contextState[OFF_QT_CBF_CTX + 2 + tuDepth]); }
12
     inline void codeQtRootCbf(uint32_t cbf)                               { encodeBin(cbf, m_contextState[OFF_QT_ROOT_CBF_CTX]); }
13
     inline void codeTransformSkipFlags(uint32_t transformSkip, TextType ttype) { encodeBin(transformSkip, m_contextState[OFF_TRANSFORMSKIP_FLAG_CTX + (ttype ? NUM_TRANSFORMSKIP_FLAG_CTX : 0)]); }
14
-
15
+    void codeDeltaQP(const CUData& cu, uint32_t absPartIdx);
16
     void codeSaoOffset(const SaoCtuParam& ctuParam, int plane);
17
 
18
     /* RDO functions */
19
@@ -221,7 +221,7 @@
20
     }
21
 
22
     void encodeCU(const CUData& ctu, const CUGeom &cuGeom, uint32_t absPartIdx, uint32_t depth, bool& bEncodeDQP);
23
-    void finishCU(const CUData& ctu, uint32_t absPartIdx, uint32_t depth);
24
+    void finishCU(const CUData& ctu, uint32_t absPartIdx, uint32_t depth, bool bEncodeDQP);
25
 
26
     void writeOut();
27
 
28
@@ -242,7 +242,6 @@
29
 
30
     void codeSaoMaxUvlc(uint32_t code, uint32_t maxSymbol);
31
 
32
-    void codeDeltaQP(const CUData& cu, uint32_t absPartIdx);
33
     void codeLastSignificantXY(uint32_t posx, uint32_t posy, uint32_t log2TrSize, bool bIsLuma, uint32_t scanIdx);
34
 
35
     void encodeTransform(const CUData& cu, uint32_t absPartIdx, uint32_t tuDepth, uint32_t log2TrSize,
36
x265_1.6.tar.gz/source/encoder/frameencoder.cpp -> x265_1.7.tar.gz/source/encoder/frameencoder.cpp Changed
268
 
1
@@ -213,6 +213,7 @@
2
 {
3
     m_slicetypeWaitTime = x265_mdate() - m_prevOutputTime;
4
     m_frame = curFrame;
5
+    m_param = curFrame->m_param;
6
     m_sliceType = curFrame->m_lowres.sliceType;
7
     curFrame->m_encData->m_frameEncoderID = m_jpId;
8
     curFrame->m_encData->m_jobProvider = this;
9
@@ -794,6 +795,7 @@
10
     uint32_t row = (uint32_t)intRow;
11
     CTURow& curRow = m_rows[row];
12
 
13
+    tld.analysis.m_param = m_param;
14
     if (m_param->bEnableWavefront)
15
     {
16
         ScopedLock self(curRow.lock);
17
@@ -824,6 +826,13 @@
18
     const uint32_t lineStartCUAddr = row * numCols;
19
     bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0;
20
 
21
+    /* These store the count of inter, intra and skip cus within quad tree structure of each CTU */
22
+    uint32_t qTreeInterCnt[NUM_CU_DEPTH];
23
+    uint32_t qTreeIntraCnt[NUM_CU_DEPTH];
24
+    uint32_t qTreeSkipCnt[NUM_CU_DEPTH];
25
+    for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++)
26
+        qTreeIntraCnt[depth] = qTreeInterCnt[depth] = qTreeSkipCnt[depth] = 0;
27
+
28
     while (curRow.completed < numCols)
29
     {
30
         ProfileScopeEvent(encodeCTU);
31
@@ -841,24 +850,34 @@
32
                 curEncData.m_rowStat[row].diagQpScale = x265_qp2qScale(curEncData.m_avgQpRc);
33
             }
34
 
35
+            FrameData::RCStatCU& cuStat = curEncData.m_cuStat[cuAddr];
36
             if (row >= col && row && m_vbvResetTriggerRow != intRow)
37
-                curEncData.m_cuStat[cuAddr].baseQp = curEncData.m_cuStat[cuAddr - numCols + 1].baseQp;
38
+                cuStat.baseQp = curEncData.m_cuStat[cuAddr - numCols + 1].baseQp;
39
             else
40
-                curEncData.m_cuStat[cuAddr].baseQp = curEncData.m_rowStat[row].diagQp;
41
-        }
42
-        else
43
-            curEncData.m_cuStat[cuAddr].baseQp = curEncData.m_avgQpRc;
44
+                cuStat.baseQp = curEncData.m_rowStat[row].diagQp;
45
+
46
+            /* TODO: use defines from slicetype.h for lowres block size */
47
+            uint32_t maxBlockCols = (m_frame->m_fencPic->m_picWidth + (16 - 1)) / 16;
48
+            uint32_t maxBlockRows = (m_frame->m_fencPic->m_picHeight + (16 - 1)) / 16;
49
+            uint32_t noOfBlocks = g_maxCUSize / 16;
50
+            uint32_t block_y = (cuAddr / curEncData.m_slice->m_sps->numCuInWidth) * noOfBlocks;
51
+            uint32_t block_x = (cuAddr * noOfBlocks) - block_y * curEncData.m_slice->m_sps->numCuInWidth;
52
+            
53
+            cuStat.vbvCost = 0;
54
+            cuStat.intraVbvCost = 0;
55
+            for (uint32_t h = 0; h < noOfBlocks && block_y < maxBlockRows; h++, block_y++)
56
+            {
57
+                uint32_t idx = block_x + (block_y * maxBlockCols);
58
 
59
-        if (m_param->rc.aqMode || bIsVbv)
60
-        {
61
-            int qp = calcQpForCu(cuAddr, curEncData.m_cuStat[cuAddr].baseQp);
62
-            tld.analysis.setQP(*slice, qp);
63
-            qp = x265_clip3(QP_MIN, QP_MAX_SPEC, qp);
64
-            ctu->setQPSubParts((int8_t)qp, 0, 0);
65
-            curEncData.m_rowStat[row].sumQpAq += qp;
66
+                for (uint32_t w = 0; w < noOfBlocks && (block_x + w) < maxBlockCols; w++, idx++)
67
+                {
68
+                    cuStat.vbvCost += m_frame->m_lowres.lowresCostForRc[idx] & LOWRES_COST_MASK;
69
+                    cuStat.intraVbvCost += m_frame->m_lowres.intraCost[idx];
70
+                }
71
+            }
72
         }
73
         else
74
-            tld.analysis.setQP(*slice, slice->m_sliceQp);
75
+            curEncData.m_cuStat[cuAddr].baseQp = curEncData.m_avgQpRc;
76
 
77
         if (m_param->bEnableWavefront && !col && row)
78
         {
79
@@ -886,7 +905,9 @@
80
         curRow.completed++;
81
 
82
         if (m_param->bLogCuStats || m_param->rc.bStatWrite)
83
-            collectCTUStatistics(*ctu);
84
+            curEncData.m_rowStat[row].sumQpAq += collectCTUStatistics(*ctu, qTreeInterCnt, qTreeIntraCnt, qTreeSkipCnt);
85
+        else if (m_param->rc.aqMode)
86
+            curEncData.m_rowStat[row].sumQpAq += calcCTUQP(*ctu);
87
 
88
         // copy no. of intra, inter Cu cnt per row into frame stats for 2 pass
89
         if (m_param->rc.bStatWrite)
90
@@ -894,18 +915,17 @@
91
             curRow.rowStats.mvBits += best.mvBits;
92
             curRow.rowStats.coeffBits += best.coeffBits;
93
             curRow.rowStats.miscBits += best.totalBits - (best.mvBits + best.coeffBits);
94
-            StatisticLog* log = &m_sliceTypeLog[slice->m_sliceType];
95
 
96
             for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++)
97
             {
98
                 /* 1 << shift == number of 8x8 blocks at current depth */
99
                 int shift = 2 * (g_maxCUDepth - depth);
100
-                curRow.rowStats.iCuCnt += log->qTreeIntraCnt[depth] << shift;
101
-                curRow.rowStats.pCuCnt += log->qTreeInterCnt[depth] << shift;
102
-                curRow.rowStats.skipCuCnt += log->qTreeSkipCnt[depth] << shift;
103
+                curRow.rowStats.iCuCnt += qTreeIntraCnt[depth] << shift;
104
+                curRow.rowStats.pCuCnt += qTreeInterCnt[depth] << shift;
105
+                curRow.rowStats.skipCuCnt += qTreeSkipCnt[depth] << shift;
106
 
107
                 // clear the row cu data from thread local object
108
-                log->qTreeIntraCnt[depth] = log->qTreeInterCnt[depth] = log->qTreeSkipCnt[depth] = 0;
109
+                qTreeIntraCnt[depth] = qTreeInterCnt[depth] = qTreeSkipCnt[depth] = 0;
110
             }
111
         }
112
 
113
@@ -1075,15 +1095,18 @@
114
         }
115
     }
116
 
117
+    tld.analysis.m_param = NULL;
118
     curRow.busy = false;
119
 
120
     if (ATOMIC_INC(&m_completionCount) == 2 * (int)m_numRows)
121
         m_completionEvent.trigger();
122
 }
123
 
124
-void FrameEncoder::collectCTUStatistics(CUData& ctu)
125
+/* collect statistics about CU coding decisions, return total QP */
126
+int FrameEncoder::collectCTUStatistics(const CUData& ctu, uint32_t* qtreeInterCnt, uint32_t* qtreeIntraCnt, uint32_t* qtreeSkipCnt)
127
 {
128
     StatisticLog* log = &m_sliceTypeLog[ctu.m_slice->m_sliceType];
129
+    int totQP = 0;
130
 
131
     if (ctu.m_slice->m_sliceType == I_SLICE)
132
     {
133
@@ -1094,13 +1117,14 @@
134
 
135
             log->totalCu++;
136
             log->cntIntra[depth]++;
137
-            log->qTreeIntraCnt[depth]++;
138
+            qtreeIntraCnt[depth]++;
139
+            totQP += ctu.m_qp[absPartIdx] * (ctu.m_numPartitions >> (depth * 2));
140
 
141
             if (ctu.m_predMode[absPartIdx] == MODE_NONE)
142
             {
143
                 log->totalCu--;
144
                 log->cntIntra[depth]--;
145
-                log->qTreeIntraCnt[depth]--;
146
+                qtreeIntraCnt[depth]--;
147
             }
148
             else if (ctu.m_partSize[absPartIdx] != SIZE_2Nx2N)
149
             {
150
@@ -1124,6 +1148,7 @@
151
 
152
             log->totalCu++;
153
             log->cntTotalCu[depth]++;
154
+            totQP += ctu.m_qp[absPartIdx] * (ctu.m_numPartitions >> (depth * 2));
155
 
156
             if (ctu.m_predMode[absPartIdx] == MODE_NONE)
157
             {
158
@@ -1134,12 +1159,12 @@
159
             {
160
                 log->totalCu--;
161
                 log->cntSkipCu[depth]++;
162
-                log->qTreeSkipCnt[depth]++;
163
+                qtreeSkipCnt[depth]++;
164
             }
165
             else if (ctu.isInter(absPartIdx))
166
             {
167
                 log->cntInter[depth]++;
168
-                log->qTreeInterCnt[depth]++;
169
+                qtreeInterCnt[depth]++;
170
 
171
                 if (ctu.m_partSize[absPartIdx] < AMP_ID)
172
                     log->cuInterDistribution[depth][ctu.m_partSize[absPartIdx]]++;
173
@@ -1149,12 +1174,13 @@
174
             else if (ctu.isIntra(absPartIdx))
175
             {
176
                 log->cntIntra[depth]++;
177
-                log->qTreeIntraCnt[depth]++;
178
+                qtreeIntraCnt[depth]++;
179
 
180
                 if (ctu.m_partSize[absPartIdx] != SIZE_2Nx2N)
181
                 {
182
                     X265_CHECK(ctu.m_log2CUSize[absPartIdx] == 3 && ctu.m_slice->m_sps->quadtreeTULog2MinSize < 3, "Intra NxN found at improbable depth\n");
183
                     log->cntIntraNxN++;
184
+                    log->cntIntra[depth]--;
185
                     /* TODO: log intra modes at absPartIdx +0 to +3 */
186
                 }
187
                 else if (ctu.m_lumaIntraDir[absPartIdx] > 1)
188
@@ -1164,6 +1190,23 @@
189
             }
190
         }
191
     }
192
+
193
+    return totQP;
194
+}
195
+
196
+/* iterate over coded CUs and determine total QP */
197
+int FrameEncoder::calcCTUQP(const CUData& ctu)
198
+{
199
+    int totQP = 0;
200
+    uint32_t depth = 0, numParts = ctu.m_numPartitions;
201
+
202
+    for (uint32_t absPartIdx = 0; absPartIdx < ctu.m_numPartitions; absPartIdx += numParts)
203
+    {
204
+        depth = ctu.m_cuDepth[absPartIdx];
205
+        numParts = ctu.m_numPartitions >> (depth * 2);
206
+        totQP += ctu.m_qp[absPartIdx] * numParts;
207
+    }
208
+    return totQP;
209
 }
210
 
211
 /* DCT-domain noise reduction / adaptive deadzone from libavcodec */
212
@@ -1198,55 +1241,6 @@
213
     }
214
 }
215
 
216
-int FrameEncoder::calcQpForCu(uint32_t ctuAddr, double baseQp)
217
-{
218
-    x265_emms();
219
-    double qp = baseQp;
220
-
221
-    FrameData& curEncData = *m_frame->m_encData;
222
-    /* clear cuCostsForVbv from when vbv row reset was triggered */
223
-    bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0;
224
-    if (bIsVbv)
225
-    {
226
-        curEncData.m_cuStat[ctuAddr].vbvCost = 0;
227
-        curEncData.m_cuStat[ctuAddr].intraVbvCost = 0;
228
-    }
229
-
230
-    /* Derive qpOffet for each CU by averaging offsets for all 16x16 blocks in the cu. */
231
-    double qp_offset = 0;
232
-    uint32_t maxBlockCols = (m_frame->m_fencPic->m_picWidth + (16 - 1)) / 16;
233
-    uint32_t maxBlockRows = (m_frame->m_fencPic->m_picHeight + (16 - 1)) / 16;
234
-    uint32_t noOfBlocks = g_maxCUSize / 16;
235
-    uint32_t block_y = (ctuAddr / curEncData.m_slice->m_sps->numCuInWidth) * noOfBlocks;
236
-    uint32_t block_x = (ctuAddr * noOfBlocks) - block_y * curEncData.m_slice->m_sps->numCuInWidth;
237
-
238
-    /* Use cuTree offsets if cuTree enabled and frame is referenced, else use AQ offsets */
239
-    bool isReferenced = IS_REFERENCED(m_frame);
240
-    double *qpoffs = (isReferenced && m_param->rc.cuTree) ? m_frame->m_lowres.qpCuTreeOffset : m_frame->m_lowres.qpAqOffset;
241
-
242
-    uint32_t cnt = 0, idx = 0;
243
-    for (uint32_t h = 0; h < noOfBlocks && block_y < maxBlockRows; h++, block_y++)
244
-    {
245
-        for (uint32_t w = 0; w < noOfBlocks && (block_x + w) < maxBlockCols; w++)
246
-        {
247
-            idx = block_x + w + (block_y * maxBlockCols);
248
-            if (m_param->rc.aqMode)
249
-                qp_offset += qpoffs[idx];
250
-            if (bIsVbv)
251
-            {
252
-                curEncData.m_cuStat[ctuAddr].vbvCost += m_frame->m_lowres.lowresCostForRc[idx] & LOWRES_COST_MASK;
253
-                curEncData.m_cuStat[ctuAddr].intraVbvCost += m_frame->m_lowres.intraCost[idx];
254
-            }
255
-            cnt++;
256
-        }
257
-    }
258
-
259
-    qp_offset /= cnt;
260
-    qp += qp_offset;
261
-
262
-    return x265_clip3(QP_MIN, QP_MAX_MAX, (int)(qp + 0.5));
263
-}
264
-
265
 Frame *FrameEncoder::getEncodedPicture(NALList& output)
266
 {
267
     if (m_frame)
268
x265_1.6.tar.gz/source/encoder/frameencoder.h -> x265_1.7.tar.gz/source/encoder/frameencoder.h Changed
24
 
1
@@ -63,11 +63,6 @@
2
     uint64_t cntTotalCu[4];
3
     uint64_t totalCu;
4
 
5
-    /* These states store the count of inter,intra and skip ctus within quad tree structure of each CU */
6
-    uint32_t qTreeInterCnt[4];
7
-    uint32_t qTreeIntraCnt[4];
8
-    uint32_t qTreeSkipCnt[4];
9
-
10
     StatisticLog()
11
     {
12
         memset(this, 0, sizeof(StatisticLog));
13
@@ -226,8 +221,8 @@
14
     void encodeSlice();
15
 
16
     void threadMain();
17
-    int  calcQpForCu(uint32_t cuAddr, double baseQp);
18
-    void collectCTUStatistics(CUData& ctu);
19
+    int  collectCTUStatistics(const CUData& ctu, uint32_t* qtreeInterCnt, uint32_t* qtreeIntraCnt, uint32_t* qtreeSkipCnt);
20
+    int  calcCTUQP(const CUData& ctu);
21
     void noiseReductionUpdate();
22
 
23
     /* Called by WaveFront::findJob() */
24
x265_1.6.tar.gz/source/encoder/level.cpp -> x265_1.7.tar.gz/source/encoder/level.cpp Changed
138
 
1
@@ -55,15 +55,14 @@
2
     { 35651584, 1069547520, 60000,    240000,   60000,  240000,   8, Level::LEVEL6,   "6",   60 },
3
     { 35651584, 2139095040, 120000,   480000,   120000, 480000,   8, Level::LEVEL6_1, "6.1", 61 },
4
     { 35651584, 4278190080U, 240000,  800000,   240000, 800000,   6, Level::LEVEL6_2, "6.2", 62 },
5
+    { MAX_UINT, MAX_UINT, MAX_UINT, MAX_UINT, MAX_UINT, MAX_UINT, 1, Level::LEVEL8_5, "8.5", 85 },
6
 };
7
 
8
 /* determine minimum decoder level required to decode the described video */
9
 void determineLevel(const x265_param &param, VPS& vps)
10
 {
11
     vps.maxTempSubLayers = param.bEnableTemporalSubLayers ? 2 : 1;
12
-    if (param.bLossless)
13
-        vps.ptl.profileIdc = Profile::NONE;
14
-    else if (param.internalCsp == X265_CSP_I420)
15
+    if (param.internalCsp == X265_CSP_I420)
16
     {
17
         if (param.internalBitDepth == 8)
18
         {
19
@@ -104,7 +103,15 @@
20
 
21
     const size_t NumLevels = sizeof(levels) / sizeof(levels[0]);
22
     uint32_t i;
23
-    for (i = 0; i < NumLevels; i++)
24
+    if (param.bLossless)
25
+    {
26
+        i = 13;
27
+        vps.ptl.minCrForLevel = 1;
28
+        vps.ptl.maxLumaSrForLevel = MAX_UINT;
29
+        vps.ptl.levelIdc = Level::LEVEL8_5;
30
+        vps.ptl.tierFlag = Level::MAIN;
31
+    }
32
+    else for (i = 0; i < NumLevels; i++)
33
     {
34
         if (lumaSamples > levels[i].maxLumaSamples)
35
             continue;
36
@@ -337,31 +344,40 @@
37
 extern "C"
38
 int x265_param_apply_profile(x265_param *param, const char *profile)
39
 {
40
-    if (!profile)
41
+    if (!param || !profile)
42
         return 0;
43
-    if (!strcmp(profile, "main"))
44
-    {
45
-        /* SPSs shall have chroma_format_idc equal to 1 only */
46
-        param->internalCsp = X265_CSP_I420;
47
 
48
 #if HIGH_BIT_DEPTH
49
-        /* SPSs shall have bit_depth_luma_minus8 equal to 0 only */
50
-        x265_log(param, X265_LOG_ERROR, "Main profile not supported, compiled for Main10.\n");
51
+    if (!strcmp(profile, "main") || !strcmp(profile, "mainstillpicture") || !strcmp(profile, "msp") || !strcmp(profile, "main444-8"))
52
+    {
53
+        x265_log(param, X265_LOG_ERROR, "%s profile not supported, compiled for Main10.\n", profile);
54
         return -1;
55
-#endif
56
     }
57
-    else if (!strcmp(profile, "main10"))
58
+#else
59
+    if (!strcmp(profile, "main10") || !strcmp(profile, "main422-10") || !strcmp(profile, "main444-10"))
60
     {
61
-        /* SPSs shall have chroma_format_idc equal to 1 only */
62
-        param->internalCsp = X265_CSP_I420;
63
-
64
-        /* SPSs shall have bit_depth_luma_minus8 in the range of 0 to 2, inclusive 
65
-         * this covers all builds of x265, currently */
66
+        x265_log(param, X265_LOG_ERROR, "%s profile not supported, compiled for Main.\n", profile);
67
+        return -1;
68
+    }
69
+#endif
70
+    
71
+    if (!strcmp(profile, "main"))
72
+    {
73
+        if (!(param->internalCsp & X265_CSP_I420))
74
+        {
75
+            x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n",
76
+                     profile, x265_source_csp_names[param->internalCsp]);
77
+            return -1;
78
+        }
79
     }
80
     else if (!strcmp(profile, "mainstillpicture") || !strcmp(profile, "msp"))
81
     {
82
-        /* SPSs shall have chroma_format_idc equal to 1 only */
83
-        param->internalCsp = X265_CSP_I420;
84
+        if (!(param->internalCsp & X265_CSP_I420))
85
+        {
86
+            x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n",
87
+                     profile, x265_source_csp_names[param->internalCsp]);
88
+            return -1;
89
+        }
90
 
91
         /* SPSs shall have sps_max_dec_pic_buffering_minus1[ sps_max_sub_layers_minus1 ] equal to 0 only */
92
         param->maxNumReferences = 1;
93
@@ -378,25 +394,29 @@
94
         param->rc.cuTree = 0;
95
         param->bEnableWeightedPred = 0;
96
         param->bEnableWeightedBiPred = 0;
97
-
98
-#if HIGH_BIT_DEPTH
99
-        /* SPSs shall have bit_depth_luma_minus8 equal to 0 only */
100
-        x265_log(param, X265_LOG_ERROR, "Mainstillpicture profile not supported, compiled for Main10.\n");
101
-        return -1;
102
-#endif
103
+    }
104
+    else if (!strcmp(profile, "main10"))
105
+    {
106
+        if (!(param->internalCsp & X265_CSP_I420))
107
+        {
108
+            x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n",
109
+                     profile, x265_source_csp_names[param->internalCsp]);
110
+            return -1;
111
+        }
112
     }
113
     else if (!strcmp(profile, "main422-10"))
114
-        param->internalCsp = X265_CSP_I422;
115
-    else if (!strcmp(profile, "main444-8"))
116
     {
117
-        param->internalCsp = X265_CSP_I444;
118
-#if HIGH_BIT_DEPTH
119
-        x265_log(param, X265_LOG_ERROR, "Main 4:4:4 8 profile not supported, compiled for Main10.\n");
120
-        return -1;
121
-#endif
122
+        if (!(param->internalCsp & (X265_CSP_I420 | X265_CSP_I422)))
123
+        {
124
+            x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n",
125
+                     profile, x265_source_csp_names[param->internalCsp]);
126
+            return -1;
127
+        }
128
+    }
129
+    else if (!strcmp(profile, "main444-8") || !strcmp(profile, "main444-10"))
130
+    {
131
+        /* any color space allowed */
132
     }
133
-    else if (!strcmp(profile, "main444-10"))
134
-        param->internalCsp = X265_CSP_I444;
135
     else
136
     {
137
         x265_log(param, X265_LOG_ERROR, "unknown profile <%s>\n", profile);
138
x265_1.6.tar.gz/source/encoder/motion.cpp -> x265_1.7.tar.gz/source/encoder/motion.cpp Changed
77
 
1
@@ -234,9 +234,14 @@
2
                pix_base + (m1x) + (m1y) * stride, \
3
                pix_base + (m2x) + (m2y) * stride, \
4
                stride, costs); \
5
-        (costs)[0] += mvcost((bmv + MV(m0x, m0y)) << 2); \
6
-        (costs)[1] += mvcost((bmv + MV(m1x, m1y)) << 2); \
7
-        (costs)[2] += mvcost((bmv + MV(m2x, m2y)) << 2); \
8
+        const uint16_t *base_mvx = &m_cost_mvx[(bmv.x + (m0x)) << 2]; \
9
+        const uint16_t *base_mvy = &m_cost_mvy[(bmv.y + (m0y)) << 2]; \
10
+        X265_CHECK(mvcost((bmv + MV(m0x, m0y)) << 2) == (base_mvx[((m0x) - (m0x)) << 2] + base_mvy[((m0y) - (m0y)) << 2]), "mvcost() check failure\n"); \
11
+        X265_CHECK(mvcost((bmv + MV(m1x, m1y)) << 2) == (base_mvx[((m1x) - (m0x)) << 2] + base_mvy[((m1y) - (m0y)) << 2]), "mvcost() check failure\n"); \
12
+        X265_CHECK(mvcost((bmv + MV(m2x, m2y)) << 2) == (base_mvx[((m2x) - (m0x)) << 2] + base_mvy[((m2y) - (m0y)) << 2]), "mvcost() check failure\n"); \
13
+        (costs)[0] += (base_mvx[((m0x) - (m0x)) << 2] + base_mvy[((m0y) - (m0y)) << 2]); \
14
+        (costs)[1] += (base_mvx[((m1x) - (m0x)) << 2] + base_mvy[((m1y) - (m0y)) << 2]); \
15
+        (costs)[2] += (base_mvx[((m2x) - (m0x)) << 2] + base_mvy[((m2y) - (m0y)) << 2]); \
16
     }
17
 
18
 #define COST_MV_PT_DIST_X4(m0x, m0y, p0, d0, m1x, m1y, p1, d1, m2x, m2y, p2, d2, m3x, m3y, p3, d3) \
19
@@ -247,10 +252,10 @@
20
                fref + (m2x) + (m2y) * stride, \
21
                fref + (m3x) + (m3y) * stride, \
22
                stride, costs); \
23
-        costs[0] += mvcost(MV(m0x, m0y) << 2); \
24
-        costs[1] += mvcost(MV(m1x, m1y) << 2); \
25
-        costs[2] += mvcost(MV(m2x, m2y) << 2); \
26
-        costs[3] += mvcost(MV(m3x, m3y) << 2); \
27
+        (costs)[0] += mvcost(MV(m0x, m0y) << 2); \
28
+        (costs)[1] += mvcost(MV(m1x, m1y) << 2); \
29
+        (costs)[2] += mvcost(MV(m2x, m2y) << 2); \
30
+        (costs)[3] += mvcost(MV(m3x, m3y) << 2); \
31
         COPY4_IF_LT(bcost, costs[0], bmv, MV(m0x, m0y), bPointNr, p0, bDistance, d0); \
32
         COPY4_IF_LT(bcost, costs[1], bmv, MV(m1x, m1y), bPointNr, p1, bDistance, d1); \
33
         COPY4_IF_LT(bcost, costs[2], bmv, MV(m2x, m2y), bPointNr, p2, bDistance, d2); \
34
@@ -266,10 +271,16 @@
35
                pix_base + (m2x) + (m2y) * stride, \
36
                pix_base + (m3x) + (m3y) * stride, \
37
                stride, costs); \
38
-        costs[0] += mvcost((omv + MV(m0x, m0y)) << 2); \
39
-        costs[1] += mvcost((omv + MV(m1x, m1y)) << 2); \
40
-        costs[2] += mvcost((omv + MV(m2x, m2y)) << 2); \
41
-        costs[3] += mvcost((omv + MV(m3x, m3y)) << 2); \
42
+        const uint16_t *base_mvx = &m_cost_mvx[(omv.x << 2)]; \
43
+        const uint16_t *base_mvy = &m_cost_mvy[(omv.y << 2)]; \
44
+        X265_CHECK(mvcost((omv + MV(m0x, m0y)) << 2) == (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]), "mvcost() check failure\n"); \
45
+        X265_CHECK(mvcost((omv + MV(m1x, m1y)) << 2) == (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]), "mvcost() check failure\n"); \
46
+        X265_CHECK(mvcost((omv + MV(m2x, m2y)) << 2) == (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]), "mvcost() check failure\n"); \
47
+        X265_CHECK(mvcost((omv + MV(m3x, m3y)) << 2) == (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]), "mvcost() check failure\n"); \
48
+        costs[0] += (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]); \
49
+        costs[1] += (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]); \
50
+        costs[2] += (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]); \
51
+        costs[3] += (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]); \
52
         COPY2_IF_LT(bcost, costs[0], bmv, omv + MV(m0x, m0y)); \
53
         COPY2_IF_LT(bcost, costs[1], bmv, omv + MV(m1x, m1y)); \
54
         COPY2_IF_LT(bcost, costs[2], bmv, omv + MV(m2x, m2y)); \
55
@@ -285,10 +296,17 @@
56
                pix_base + (m2x) + (m2y) * stride, \
57
                pix_base + (m3x) + (m3y) * stride, \
58
                stride, costs); \
59
-        (costs)[0] += mvcost((bmv + MV(m0x, m0y)) << 2); \
60
-        (costs)[1] += mvcost((bmv + MV(m1x, m1y)) << 2); \
61
-        (costs)[2] += mvcost((bmv + MV(m2x, m2y)) << 2); \
62
-        (costs)[3] += mvcost((bmv + MV(m3x, m3y)) << 2); \
63
+        /* TODO: use restrict keyword in ICL */ \
64
+        const uint16_t *base_mvx = &m_cost_mvx[(bmv.x << 2)]; \
65
+        const uint16_t *base_mvy = &m_cost_mvy[(bmv.y << 2)]; \
66
+        X265_CHECK(mvcost((bmv + MV(m0x, m0y)) << 2) == (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]), "mvcost() check failure\n"); \
67
+        X265_CHECK(mvcost((bmv + MV(m1x, m1y)) << 2) == (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]), "mvcost() check failure\n"); \
68
+        X265_CHECK(mvcost((bmv + MV(m2x, m2y)) << 2) == (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]), "mvcost() check failure\n"); \
69
+        X265_CHECK(mvcost((bmv + MV(m3x, m3y)) << 2) == (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]), "mvcost() check failure\n"); \
70
+        (costs)[0] += (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]); \
71
+        (costs)[1] += (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]); \
72
+        (costs)[2] += (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]); \
73
+        (costs)[3] += (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]); \
74
     }
75
 
76
 #define DIA1_ITER(mx, my) \
77
x265_1.6.tar.gz/source/encoder/nal.cpp -> x265_1.7.tar.gz/source/encoder/nal.cpp Changed
40
 
1
@@ -35,6 +35,7 @@
2
     , m_extraBuffer(NULL)
3
     , m_extraOccupancy(0)
4
     , m_extraAllocSize(0)
5
+    , m_annexB(true)
6
 {}
7
 
8
 void NALList::takeContents(NALList& other)
9
@@ -90,7 +91,12 @@
10
     uint8_t *out = m_buffer + m_occupancy;
11
     uint32_t bytes = 0;
12
 
13
-    if (!m_numNal || nalUnitType == NAL_UNIT_VPS || nalUnitType == NAL_UNIT_SPS || nalUnitType == NAL_UNIT_PPS)
14
+    if (!m_annexB)
15
+    {
16
+        /* Will write size later */
17
+        bytes += 4;
18
+    }
19
+    else if (!m_numNal || nalUnitType == NAL_UNIT_VPS || nalUnitType == NAL_UNIT_SPS || nalUnitType == NAL_UNIT_PPS)
20
     {
21
         memcpy(out, startCodePrefix, 4);
22
         bytes += 4;
23
@@ -144,6 +150,16 @@
24
      * to 0x03 is appended to the end of the data.  */
25
     if (!out[bytes - 1])
26
         out[bytes++] = 0x03;
27
+
28
+    if (!m_annexB)
29
+    {
30
+        uint32_t dataSize = bytes - 4;
31
+        out[0] = (uint8_t)(dataSize >> 24);
32
+        out[1] = (uint8_t)(dataSize >> 16);
33
+        out[2] = (uint8_t)(dataSize >> 8);
34
+        out[3] = (uint8_t)dataSize;
35
+    }
36
+
37
     m_occupancy += bytes;
38
 
39
     X265_CHECK(m_numNal < (uint32_t)MAX_NAL_UNITS, "NAL count overflow\n");
40
x265_1.6.tar.gz/source/encoder/nal.h -> x265_1.7.tar.gz/source/encoder/nal.h Changed
9
 
1
@@ -48,6 +48,7 @@
2
     uint8_t*    m_extraBuffer;
3
     uint32_t    m_extraOccupancy;
4
     uint32_t    m_extraAllocSize;
5
+    bool        m_annexB;
6
 
7
     NALList();
8
     ~NALList() { X265_FREE(m_buffer); X265_FREE(m_extraBuffer); }
9
x265_1.6.tar.gz/source/encoder/ratecontrol.cpp -> x265_1.7.tar.gz/source/encoder/ratecontrol.cpp Changed
277
 
1
@@ -300,7 +300,7 @@
2
         }
3
     }
4
 
5
-    /* qstep - value set as encoder specific */
6
+    /* qpstep - value set as encoder specific */
7
     m_lstep = pow(2, m_param->rc.qpStep / 6.0);
8
 
9
     for (int i = 0; i < 2; i++)
10
@@ -370,14 +370,19 @@
11
     m_accumPQp = (m_param->rc.rateControlMode == X265_RC_CRF ? CRF_INIT_QP : ABR_INIT_QP_MIN) * m_accumPNorm;
12
 
13
     /* Frame Predictors and Row predictors used in vbv */
14
-    for (int i = 0; i < 5; i++)
15
+    for (int i = 0; i < 4; i++)
16
     {
17
-        m_pred[i].coeff = 1.5;
18
+        m_pred[i].coeff = 1.0;
19
         m_pred[i].count = 1.0;
20
         m_pred[i].decay = 0.5;
21
         m_pred[i].offset = 0.0;
22
     }
23
-    m_pred[0].coeff = 1.0;
24
+    m_pred[0].coeff = m_pred[3].coeff = 0.75;
25
+    if (m_param->rc.qCompress >= 0.8) // when tuned for grain 
26
+    {
27
+        m_pred[1].coeff = 0.75;
28
+        m_pred[0].coeff = m_pred[3].coeff = 0.50;
29
+    }
30
     if (!m_statFileOut && (m_param->rc.bStatWrite || m_param->rc.bStatRead))
31
     {
32
         /* If the user hasn't defined the stat filename, use the default value */
33
@@ -945,6 +950,9 @@
34
     m_curSlice = curEncData.m_slice;
35
     m_sliceType = m_curSlice->m_sliceType;
36
     rce->sliceType = m_sliceType;
37
+    if (!m_2pass)
38
+        rce->keptAsRef = IS_REFERENCED(curFrame);
39
+    m_predType = getPredictorType(curFrame->m_lowres.sliceType, m_sliceType);
40
     rce->poc = m_curSlice->m_poc;
41
     if (m_param->rc.bStatRead)
42
     {
43
@@ -1074,7 +1082,7 @@
44
             m_lastQScaleFor[m_sliceType] = x265_qp2qScale(rce->qpaRc);
45
             if (rce->poc == 0)
46
                  m_lastQScaleFor[P_SLICE] = m_lastQScaleFor[m_sliceType] * fabs(m_param->rc.ipFactor);
47
-            rce->frameSizePlanned = predictSize(&m_pred[m_sliceType], m_qp, (double)m_currentSatd);
48
+            rce->frameSizePlanned = predictSize(&m_pred[m_predType], m_qp, (double)m_currentSatd);
49
         }
50
     }
51
     m_framesDone++;
52
@@ -1105,6 +1113,14 @@
53
         m_accumPQp += m_qp;
54
 }
55
 
56
+int RateControl::getPredictorType(int lowresSliceType, int sliceType)
57
+{
58
+    /* Use a different predictor for B Ref and B frames for vbv frame size predictions */
59
+    if (lowresSliceType == X265_TYPE_BREF)
60
+        return 3;
61
+    return sliceType;
62
+}
63
+
64
 double RateControl::getDiffLimitedQScale(RateControlEntry *rce, double q)
65
 {
66
     // force I/B quants as a function of P quants
67
@@ -1379,6 +1395,7 @@
68
             q += m_pbOffset;
69
 
70
         double qScale = x265_qp2qScale(q);
71
+        rce->qpNoVbv = q;
72
         double lmin = 0, lmax = 0;
73
         if (m_isVbv)
74
         {
75
@@ -1391,16 +1408,15 @@
76
                     qScale = x265_clip3(lmin, lmax, qScale);
77
                 q = x265_qScale2qp(qScale);
78
             }
79
-            rce->qpNoVbv = q;
80
             if (!m_2pass)
81
             {
82
                 qScale = clipQscale(curFrame, rce, qScale);
83
                 /* clip qp to permissible range after vbv-lookahead estimation to avoid possible 
84
                  * mispredictions by initial frame size predictors */
85
-                if (m_pred[m_sliceType].count == 1)
86
+                if (m_pred[m_predType].count == 1)
87
                     qScale = x265_clip3(lmin, lmax, qScale);
88
                 m_lastQScaleFor[m_sliceType] = qScale;
89
-                rce->frameSizePlanned = predictSize(&m_pred[m_sliceType], qScale, (double)m_currentSatd);
90
+                rce->frameSizePlanned = predictSize(&m_pred[m_predType], qScale, (double)m_currentSatd);
91
             }
92
             else
93
                 rce->frameSizePlanned = qScale2bits(rce, qScale);
94
@@ -1544,7 +1560,7 @@
95
             q = clipQscale(curFrame, rce, q);
96
             /*  clip qp to permissible range after vbv-lookahead estimation to avoid possible
97
              * mispredictions by initial frame size predictors */
98
-            if (!m_2pass && m_isVbv && m_pred[m_sliceType].count == 1)
99
+            if (!m_2pass && m_isVbv && m_pred[m_predType].count == 1)
100
                 q = x265_clip3(lqmin, lqmax, q);
101
         }
102
         m_lastQScaleFor[m_sliceType] = q;
103
@@ -1554,7 +1570,7 @@
104
         if (m_2pass && m_isVbv)
105
             rce->frameSizePlanned = qScale2bits(rce, q);
106
         else
107
-            rce->frameSizePlanned = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd);
108
+            rce->frameSizePlanned = predictSize(&m_pred[m_predType], q, (double)m_currentSatd);
109
 
110
         /* Always use up the whole VBV in this case. */
111
         if (m_singleFrameVbv)
112
@@ -1707,7 +1723,7 @@
113
             {
114
                 double frameQ[3];
115
                 double curBits;
116
-                curBits = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd);
117
+                curBits = predictSize(&m_pred[m_predType], q, (double)m_currentSatd);
118
                 double bufferFillCur = m_bufferFill - curBits;
119
                 double targetFill;
120
                 double totalDuration = m_frameDuration;
121
@@ -1726,7 +1742,8 @@
122
                         bufferFillCur += wantedFrameSize;
123
                     int64_t satd = curFrame->m_lowres.plannedSatd[j] >> (X265_DEPTH - 8);
124
                     type = IS_X265_TYPE_I(type) ? I_SLICE : IS_X265_TYPE_B(type) ? B_SLICE : P_SLICE;
125
-                    curBits = predictSize(&m_pred[type], frameQ[type], (double)satd);
126
+                    int predType = getPredictorType(curFrame->m_lowres.plannedType[j], type);
127
+                    curBits = predictSize(&m_pred[predType], frameQ[type], (double)satd);
128
                     bufferFillCur -= curBits;
129
                 }
130
 
131
@@ -1766,7 +1783,7 @@
132
             }
133
             // Now a hard threshold to make sure the frame fits in VBV.
134
             // This one is mostly for I-frames.
135
-            double bits = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd);
136
+            double bits = predictSize(&m_pred[m_predType], q, (double)m_currentSatd);
137
 
138
             // For small VBVs, allow the frame to use up the entire VBV.
139
             double maxFillFactor;
140
@@ -1783,18 +1800,21 @@
141
                 bits *= qf;
142
                 if (bits < m_bufferRate / minFillFactor)
143
                     q *= bits * minFillFactor / m_bufferRate;
144
-                bits = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd);
145
+                bits = predictSize(&m_pred[m_predType], q, (double)m_currentSatd);
146
             }
147
 
148
             q = X265_MAX(q0, q);
149
         }
150
 
151
         /* Apply MinCR restrictions */
152
-        double pbits = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd);
153
+        double pbits = predictSize(&m_pred[m_predType], q, (double)m_currentSatd);
154
         if (pbits > rce->frameSizeMaximum)
155
             q *= pbits / rce->frameSizeMaximum;
156
-
157
-        if (!m_isCbr || (m_isAbr && m_currentSatd >= rce->movingAvgSum && q <= q0 / 2))
158
+        /* To detect frames that are more complex in SATD costs compared to prev window, yet 
159
+         * lookahead vbv reduces its qscale by half its value. Be on safer side and avoid drastic 
160
+         * qscale reductions for frames high in complexity */
161
+        bool mispredCheck = rce->movingAvgSum && m_currentSatd >= rce->movingAvgSum && q <= q0 / 2;
162
+        if (!m_isCbr || (m_isAbr && mispredCheck))
163
             q = X265_MAX(q0, q);
164
 
165
         if (m_rateFactorMaxIncrement)
166
@@ -1838,18 +1858,26 @@
167
         if (satdCostForPendingCus  > 0)
168
         {
169
             double pred_s = predictSize(rce->rowPred[0], qScale, satdCostForPendingCus);
170
-            uint32_t refRowSatdCost = 0, refRowBits = 0, intraCost = 0;
171
+            uint32_t refRowSatdCost = 0, refRowBits = 0, intraCostForPendingCus = 0;
172
             double refQScale = 0;
173
 
174
             if (picType != I_SLICE)
175
             {
176
                 FrameData& refEncData = *refFrame->m_encData;
177
                 uint32_t endCuAddr = maxCols * (row + 1);
178
-                for (uint32_t cuAddr = curEncData.m_rowStat[row].numEncodedCUs + 1; cuAddr < endCuAddr; cuAddr++)
179
+                uint32_t startCuAddr = curEncData.m_rowStat[row].numEncodedCUs;
180
+                if (startCuAddr)
181
                 {
182
-                    refRowSatdCost += refEncData.m_cuStat[cuAddr].vbvCost;
183
-                    refRowBits += refEncData.m_cuStat[cuAddr].totalBits;
184
-                    intraCost += curEncData.m_cuStat[cuAddr].intraVbvCost;
185
+                    for (uint32_t cuAddr = startCuAddr + 1 ; cuAddr < endCuAddr; cuAddr++)
186
+                    {
187
+                        refRowSatdCost += refEncData.m_cuStat[cuAddr].vbvCost;
188
+                        refRowBits += refEncData.m_cuStat[cuAddr].totalBits;
189
+                    }
190
+                }
191
+                else
192
+                {
193
+                    refRowBits = refEncData.m_rowStat[row].encodedBits;
194
+                    refRowSatdCost = refEncData.m_rowStat[row].satdForVbv;
195
                 }
196
 
197
                 refRowSatdCost >>= X265_DEPTH - 8;
198
@@ -1859,7 +1887,7 @@
199
             if (picType == I_SLICE || qScale >= refQScale)
200
             {
201
                 if (picType == P_SLICE 
202
-                    && !refFrame 
203
+                    && refFrame 
204
                     && refFrame->m_encData->m_slice->m_sliceType == picType
205
                     && refQScale > 0
206
                     && refRowSatdCost > 0)
207
@@ -1875,8 +1903,9 @@
208
             }
209
             else if (picType == P_SLICE)
210
             {
211
+                intraCostForPendingCus = curEncData.m_rowStat[row].intraSatdForVbv - curEncData.m_rowStat[row].diagIntraSatd;
212
                 /* Our QP is lower than the reference! */
213
-                double pred_intra = predictSize(rce->rowPred[1], qScale, intraCost);
214
+                double pred_intra = predictSize(rce->rowPred[1], qScale, intraCostForPendingCus);
215
                 /* Sum: better to overestimate than underestimate by using only one of the two predictors. */
216
                 totalSatdBits += (int32_t)(pred_intra + pred_s);
217
             }
218
@@ -2099,8 +2128,10 @@
219
 
220
 void RateControl::updateVbv(int64_t bits, RateControlEntry* rce)
221
 {
222
+    int predType = rce->sliceType;
223
+    predType = rce->sliceType == B_SLICE && rce->keptAsRef ? 3 : predType;
224
     if (rce->lastSatd >= m_ncu)
225
-        updatePredictor(&m_pred[rce->sliceType], x265_qp2qScale(rce->qpaRc), (double)rce->lastSatd, (double)bits);
226
+        updatePredictor(&m_pred[predType], x265_qp2qScale(rce->qpaRc), (double)rce->lastSatd, (double)bits);
227
     if (!m_isVbv)
228
         return;
229
 
230
@@ -2156,23 +2187,24 @@
231
     {
232
         if (m_isVbv)
233
         {
234
+            /* determine avg QP decided by VBV rate control */
235
             for (uint32_t i = 0; i < slice->m_sps->numCuInHeight; i++)
236
                 curEncData.m_avgQpRc += curEncData.m_rowStat[i].sumQpRc;
237
 
238
             curEncData.m_avgQpRc /= slice->m_sps->numCUsInFrame;
239
             rce->qpaRc = curEncData.m_avgQpRc;
240
-
241
-            // copy avg RC qp to m_avgQpAq. To print out the correct qp when aq/cutree is disabled.
242
-            curEncData.m_avgQpAq = curEncData.m_avgQpRc;
243
         }
244
 
245
         if (m_param->rc.aqMode)
246
         {
247
+            /* determine actual avg encoded QP, after AQ/cutree adjustments */
248
             for (uint32_t i = 0; i < slice->m_sps->numCuInHeight; i++)
249
                 curEncData.m_avgQpAq += curEncData.m_rowStat[i].sumQpAq;
250
 
251
-            curEncData.m_avgQpAq /= slice->m_sps->numCUsInFrame;
252
+            curEncData.m_avgQpAq /= (slice->m_sps->numCUsInFrame * NUM_4x4_PARTITIONS);
253
         }
254
+        else
255
+            curEncData.m_avgQpAq = curEncData.m_avgQpRc;
256
     }
257
 
258
     // Write frame stats into the stats file if 2 pass is enabled.
259
@@ -2301,7 +2333,7 @@
260
 {
261
     m_finalFrameCount = count;
262
     /* unblock waiting threads */
263
-    m_startEndOrder.set(m_startEndOrder.get());
264
+    m_startEndOrder.poke();
265
 }
266
 
267
 /* called when the encoder is closing, and no more frames will be output.
268
@@ -2311,7 +2343,7 @@
269
 {
270
     m_bTerminated = true;
271
     /* unblock waiting threads */
272
-    m_startEndOrder.set(m_startEndOrder.get());
273
+    m_startEndOrder.poke();
274
 }
275
 
276
 void RateControl::destroy()
277
x265_1.6.tar.gz/source/encoder/ratecontrol.h -> x265_1.7.tar.gz/source/encoder/ratecontrol.h Changed
22
 
1
@@ -157,10 +157,9 @@
2
     double m_rateFactorMaxIncrement; /* Don't allow RF above (CRF + this value). */
3
     double m_rateFactorMaxDecrement; /* don't allow RF below (this value). */
4
 
5
-    Predictor m_pred[5];
6
-    Predictor m_predBfromP;
7
-
8
+    Predictor m_pred[4];       /* Slice predictors to preidct bits for each Slice type - I,P,Bref and B */
9
     int64_t m_leadingNoBSatd;
10
+    int     m_predType;       /* Type of slice predictors to be used - depends on the slice type */
11
     double  m_ipOffset;
12
     double  m_pbOffset;
13
     int64_t m_bframeBits;
14
@@ -266,6 +265,7 @@
15
     double tuneAbrQScaleFromFeedback(double qScale);
16
     void   accumPQpUpdate();
17
 
18
+    int    getPredictorType(int lowresSliceType, int sliceType);
19
     void   updateVbv(int64_t bits, RateControlEntry* rce);
20
     void   updatePredictor(Predictor *p, double q, double var, double bits);
21
     double clipQscale(Frame* pic, RateControlEntry* rce, double q);
22
x265_1.6.tar.gz/source/encoder/rdcost.h -> x265_1.7.tar.gz/source/encoder/rdcost.h Changed
47
 
1
@@ -40,13 +40,15 @@
2
     uint32_t  m_chromaDistWeight[2];
3
     uint32_t  m_psyRdBase;
4
     uint32_t  m_psyRd;
5
-    int       m_qp;
6
+    int       m_qp; /* QP used to configure lambda, may be higher than QP_MAX_SPEC but <= QP_MAX_MAX */
7
 
8
     void setPsyRdScale(double scale)                { m_psyRdBase = (uint32_t)floor(65536.0 * scale * 0.33); }
9
 
10
     void setQP(const Slice& slice, int qp)
11
     {
12
+        x265_emms(); /* TODO: if the lambda tables were ints, this would not be necessary */
13
         m_qp = qp;
14
+        setLambda(x265_lambda2_tab[qp], x265_lambda_tab[qp]);
15
 
16
         /* Scale PSY RD factor by a slice type factor */
17
         static const uint32_t psyScaleFix8[3] = { 300, 256, 96 }; /* B, P, I */
18
@@ -60,19 +62,21 @@
19
         }
20
 
21
         int qpCb, qpCr;
22
-        setLambda(x265_lambda2_tab[qp], x265_lambda_tab[qp]);
23
         if (slice.m_sps->chromaFormatIdc == X265_CSP_I420)
24
-            qpCb = x265_clip3(QP_MIN, QP_MAX_MAX, (int)g_chromaScale[qp + slice.m_pps->chromaQpOffset[0]]);
25
+        {
26
+            qpCb = (int)g_chromaScale[x265_clip3(QP_MIN, QP_MAX_MAX, qp + slice.m_pps->chromaQpOffset[0])];
27
+            qpCr = (int)g_chromaScale[x265_clip3(QP_MIN, QP_MAX_MAX, qp + slice.m_pps->chromaQpOffset[1])];
28
+        }
29
         else
30
-            qpCb = X265_MIN(qp + slice.m_pps->chromaQpOffset[0], QP_MAX_SPEC);
31
+        {
32
+            qpCb = x265_clip3(QP_MIN, QP_MAX_SPEC, qp + slice.m_pps->chromaQpOffset[0]);
33
+            qpCr = x265_clip3(QP_MIN, QP_MAX_SPEC, qp + slice.m_pps->chromaQpOffset[1]);
34
+        }
35
+
36
         int chroma_offset_idx = X265_MIN(qp - qpCb + 12, MAX_CHROMA_LAMBDA_OFFSET);
37
         uint16_t lambdaOffset = m_psyRd ? x265_chroma_lambda2_offset_tab[chroma_offset_idx] : 256;
38
         m_chromaDistWeight[0] = lambdaOffset;
39
 
40
-        if (slice.m_sps->chromaFormatIdc == X265_CSP_I420)
41
-            qpCr = x265_clip3(QP_MIN, QP_MAX_MAX, (int)g_chromaScale[qp + slice.m_pps->chromaQpOffset[0]]);
42
-        else
43
-            qpCr = X265_MIN(qp + slice.m_pps->chromaQpOffset[0], QP_MAX_SPEC);
44
         chroma_offset_idx = X265_MIN(qp - qpCr + 12, MAX_CHROMA_LAMBDA_OFFSET);
45
         lambdaOffset = m_psyRd ? x265_chroma_lambda2_offset_tab[chroma_offset_idx] : 256;
46
         m_chromaDistWeight[1] = lambdaOffset;
47
x265_1.6.tar.gz/source/encoder/sao.cpp -> x265_1.7.tar.gz/source/encoder/sao.cpp Changed
149
 
1
@@ -258,7 +258,7 @@
2
     pixel* tmpL;
3
     pixel* tmpU;
4
 
5
-    int8_t _upBuff1[MAX_CU_SIZE + 2], *upBuff1 = _upBuff1 + 1;
6
+    int8_t _upBuff1[MAX_CU_SIZE + 2], *upBuff1 = _upBuff1 + 1, signLeft1[2];
7
     int8_t _upBufft[MAX_CU_SIZE + 2], *upBufft = _upBufft + 1;
8
 
9
     memset(_upBuff1 + MAX_CU_SIZE, 0, 2 * sizeof(int8_t)); /* avoid valgrind uninit warnings */
10
@@ -279,7 +279,7 @@
11
     {
12
     case SAO_EO_0: // dir: -
13
     {
14
-        pixel firstPxl = 0, lastPxl = 0;
15
+        pixel firstPxl = 0, lastPxl = 0, row1FirstPxl = 0, row1LastPxl = 0;
16
         startX = !lpelx;
17
         endX   = (rpelx == picWidth) ? ctuWidth - 1 : ctuWidth;
18
         if (ctuWidth & 15)
19
@@ -301,25 +301,38 @@
20
         }
21
         else
22
         {
23
-            for (y = 0; y < ctuHeight; y++)
24
+            for (y = 0; y < ctuHeight; y += 2)
25
             {
26
-                int signLeft = signOf(rec[startX] - tmpL[y]);
27
+                signLeft1[0] = signOf(rec[startX] - tmpL[y]);
28
+                signLeft1[1] = signOf(rec[stride + startX] - tmpL[y + 1]);
29
 
30
                 if (!lpelx)
31
+                {
32
                     firstPxl = rec[0];
33
+                    row1FirstPxl = rec[stride];
34
+                }
35
 
36
                 if (rpelx == picWidth)
37
+                {
38
                     lastPxl = rec[ctuWidth - 1];
39
+                    row1LastPxl = rec[stride + ctuWidth - 1];
40
+                }
41
 
42
-                primitives.saoCuOrgE0(rec, m_offsetEo, ctuWidth, (int8_t)signLeft);
43
+                primitives.saoCuOrgE0(rec, m_offsetEo, ctuWidth, signLeft1, stride);
44
 
45
                 if (!lpelx)
46
+                {
47
                     rec[0] = firstPxl;
48
+                    rec[stride] = row1FirstPxl;
49
+                }
50
 
51
                 if (rpelx == picWidth)
52
+                {
53
                     rec[ctuWidth - 1] = lastPxl;
54
+                    rec[stride + ctuWidth - 1] = row1LastPxl;
55
+                }
56
 
57
-                rec += stride;
58
+                rec += 2 * stride;
59
             }
60
         }
61
         break;
62
@@ -354,11 +367,14 @@
63
         {
64
             primitives.sign(upBuff1, rec, tmpU, ctuWidth);
65
 
66
-            for (y = startY; y < endY; y++)
67
+            int diff = (endY - startY) % 2;
68
+            for (y = startY; y < endY - diff; y += 2)
69
             {
70
-                primitives.saoCuOrgE1(rec, upBuff1, m_offsetEo, stride, ctuWidth);
71
-                rec += stride;
72
+                primitives.saoCuOrgE1_2Rows(rec, upBuff1, m_offsetEo, stride, ctuWidth);
73
+                rec += 2 * stride;
74
             }
75
+            if (diff & 1)
76
+                primitives.saoCuOrgE1(rec, upBuff1, m_offsetEo, stride, ctuWidth);
77
         }
78
 
79
         break;
80
@@ -421,23 +437,8 @@
81
             for (y = startY; y < endY; y++)
82
             {
83
                 int8_t iSignDown2 = signOf(rec[stride + startX] - tmpL[y]);
84
-                pixel firstPxl = rec[0];  // copy first Pxl
85
-                pixel lastPxl = rec[ctuWidth - 1];
86
-                int8_t one = upBufft[1];
87
-                int8_t two = upBufft[endX + 1];
88
 
89
-                primitives.saoCuOrgE2(rec, upBufft, upBuff1, m_offsetEo, ctuWidth, stride);
90
-                if (!lpelx)
91
-                {
92
-                    rec[0] = firstPxl;
93
-                    upBufft[1] = one;
94
-                }
95
-
96
-                if (rpelx == picWidth)
97
-                {
98
-                    rec[ctuWidth - 1] = lastPxl;
99
-                    upBufft[endX + 1] = two;
100
-                }
101
+                primitives.saoCuOrgE2[endX > 16](rec + startX, upBufft + startX, upBuff1 + startX, m_offsetEo, endX - startX, stride);
102
 
103
                 upBufft[startX] = iSignDown2;
104
 
105
@@ -508,7 +509,7 @@
106
                 upBuff1[x - 1] = -signDown;
107
                 rec[x] = m_clipTable[rec[x] + m_offsetEo[edgeType]];
108
 
109
-                primitives.saoCuOrgE3(rec, upBuff1, m_offsetEo, stride - 1, startX, endX);
110
+                primitives.saoCuOrgE3[endX > 16](rec, upBuff1, m_offsetEo, stride - 1, startX, endX);
111
 
112
                 upBuff1[endX - 1] = signOf(rec[endX - 1 + stride] - rec[endX]);
113
 
114
@@ -783,13 +784,7 @@
115
                 rec += stride;
116
             }
117
 
118
-            if (!(ctuWidth & 15))
119
-                primitives.sign(upBuff1, rec, &rec[- stride], ctuWidth);
120
-            else
121
-            {
122
-                for (x = 0; x < ctuWidth; x++)
123
-                    upBuff1[x] = signOf(rec[x] - rec[x - stride]);
124
-            }
125
+            primitives.sign(upBuff1, rec, &rec[- stride], ctuWidth);
126
 
127
             for (y = startY; y < endY; y++)
128
             {
129
@@ -832,8 +827,7 @@
130
                 rec += stride;
131
             }
132
 
133
-            for (x = startX; x < endX; x++)
134
-                upBuff1[x] = signOf(rec[x] - rec[x - stride - 1]);
135
+            primitives.sign(&upBuff1[startX], &rec[startX], &rec[startX - stride - 1], (endX - startX));
136
 
137
             for (y = startY; y < endY; y++)
138
             {
139
@@ -879,8 +873,7 @@
140
                 rec += stride;
141
             }
142
 
143
-            for (x = startX - 1; x < endX; x++)
144
-                upBuff1[x] = signOf(rec[x] - rec[x - stride + 1]);
145
+            primitives.sign(&upBuff1[startX - 1], &rec[startX - 1], &rec[startX - 1 - stride + 1], (endX - startX + 1));
146
 
147
             for (y = startY; y < endY; y++)
148
             {
149
x265_1.6.tar.gz/source/encoder/search.cpp -> x265_1.7.tar.gz/source/encoder/search.cpp Changed
535
 
1
@@ -163,11 +163,16 @@
2
     X265_FREE(m_tsRecon);
3
 }
4
 
5
-void Search::setQP(const Slice& slice, int qp)
6
+int Search::setLambdaFromQP(const CUData& ctu, int qp)
7
 {
8
-    x265_emms(); /* TODO: if the lambda tables were ints, this would not be necessary */
9
+    X265_CHECK(qp >= QP_MIN && qp <= QP_MAX_MAX, "QP used for lambda is out of range\n");
10
+
11
     m_me.setQP(qp);
12
-    m_rdCost.setQP(slice, qp);
13
+    m_rdCost.setQP(*m_slice, qp);
14
+
15
+    int quantQP = x265_clip3(QP_MIN, QP_MAX_SPEC, qp);
16
+    m_quant.setQPforQuant(ctu, quantQP);
17
+    return quantQP;
18
 }
19
 
20
 #if CHECKED_BUILD || _DEBUG
21
@@ -1185,7 +1190,7 @@
22
         intraMode.psyEnergy = m_rdCost.psyCost(cuGeom.log2CUSize - 2, fencYuv->m_buf[0], fencYuv->m_size, intraMode.reconYuv.m_buf[0], intraMode.reconYuv.m_size);
23
     }
24
     updateModeCost(intraMode);
25
-    checkDQP(cu, cuGeom);
26
+    checkDQP(intraMode, cuGeom);
27
 }
28
 
29
 /* Note that this function does not save the best intra prediction, it must
30
@@ -1231,16 +1236,11 @@
31
 
32
         pixel nScale[129];
33
         intraNeighbourBuf[1][0] = intraNeighbourBuf[0][0];
34
-        primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1, 0);
35
+        primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1);
36
 
37
         // we do not estimate filtering for downscaled samples
38
-        for (int x = 1; x < 65; x++)
39
-        {
40
-            intraNeighbourBuf[0][x] = nScale[x];           // Top pixel
41
-            intraNeighbourBuf[0][x + 64] = nScale[x + 64]; // Left pixel
42
-            intraNeighbourBuf[1][x] = nScale[x];           // Top pixel
43
-            intraNeighbourBuf[1][x + 64] = nScale[x + 64]; // Left pixel
44
-        }
45
+        memcpy(&intraNeighbourBuf[0][1], &nScale[1], 2 * 64 * sizeof(pixel));   // Top & Left pixels
46
+        memcpy(&intraNeighbourBuf[1][1], &nScale[1], 2 * 64 * sizeof(pixel));
47
 
48
         scaleTuSize = 32;
49
         scaleStride = 32;
50
@@ -1369,8 +1369,6 @@
51
     X265_CHECK(cu.m_partSize[0] == SIZE_2Nx2N, "encodeIntraInInter does not expect NxN intra\n");
52
     X265_CHECK(!m_slice->isIntra(), "encodeIntraInInter does not expect to be used in I slices\n");
53
 
54
-    m_quant.setQPforQuant(cu);
55
-
56
     uint32_t tuDepthRange[2];
57
     cu.getIntraTUQtDepthRange(tuDepthRange, 0);
58
 
59
@@ -1405,7 +1403,7 @@
60
 
61
     m_entropyCoder.store(intraMode.contexts);
62
     updateModeCost(intraMode);
63
-    checkDQP(intraMode.cu, cuGeom);
64
+    checkDQP(intraMode, cuGeom);
65
 }
66
 
67
 uint32_t Search::estIntraPredQT(Mode &intraMode, const CUGeom& cuGeom, const uint32_t depthRange[2], uint8_t* sharedModes)
68
@@ -1465,16 +1463,10 @@
69
 
70
                     pixel nScale[129];
71
                     intraNeighbourBuf[1][0] = intraNeighbourBuf[0][0];
72
-                    primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1, 0);
73
+                    primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1);
74
 
75
-                    // TO DO: primitive
76
-                    for (int x = 1; x < 65; x++)
77
-                    {
78
-                        intraNeighbourBuf[0][x] = nScale[x];           // Top pixel
79
-                        intraNeighbourBuf[0][x + 64] = nScale[x + 64]; // Left pixel
80
-                        intraNeighbourBuf[1][x] = nScale[x];           // Top pixel
81
-                        intraNeighbourBuf[1][x + 64] = nScale[x + 64]; // Left pixel
82
-                    }
83
+                    memcpy(&intraNeighbourBuf[0][1], &nScale[1], 2 * 64 * sizeof(pixel));
84
+                    memcpy(&intraNeighbourBuf[1][1], &nScale[1], 2 * 64 * sizeof(pixel));
85
 
86
                     scaleTuSize = 32;
87
                     scaleStride = 32;
88
@@ -1869,6 +1861,34 @@
89
     return outCost;
90
 }
91
 
92
+/* Pick between the two AMVP candidates which is the best one to use as
93
+ * MVP for the motion search, based on SAD cost */
94
+int Search::selectMVP(const CUData& cu, const PredictionUnit& pu, const MV amvp[AMVP_NUM_CANDS], int list, int ref)
95
+{
96
+    if (amvp[0] == amvp[1])
97
+        return 0;
98
+
99
+    Yuv& tmpPredYuv = m_rqt[cu.m_cuDepth[0]].tmpPredYuv;
100
+    uint32_t costs[AMVP_NUM_CANDS];
101
+
102
+    for (int i = 0; i < AMVP_NUM_CANDS; i++)
103
+    {
104
+        MV mvCand = amvp[i];
105
+
106
+        // NOTE: skip mvCand if Y is > merange and -FN>1
107
+        if (m_bFrameParallel && (mvCand.y >= (m_param->searchRange + 1) * 4))
108
+            costs[i] = m_me.COST_MAX;
109
+        else
110
+        {
111
+            cu.clipMv(mvCand);
112
+            predInterLumaPixel(pu, tmpPredYuv, *m_slice->m_refPicList[list][ref]->m_reconPic, mvCand);
113
+            costs[i] = m_me.bufSAD(tmpPredYuv.getLumaAddr(pu.puAbsPartIdx), tmpPredYuv.m_size);
114
+        }
115
+    }
116
+
117
+    return costs[0] <= costs[1] ? 0 : 1;
118
+}
119
+
120
 void Search::PME::processTasks(int workerThreadId)
121
 {
122
 #if DETAILED_CU_STATS
123
@@ -1899,10 +1919,10 @@
124
     /* Setup slave Search instance for ME for master's CU */
125
     if (&slave != this)
126
     {
127
-        slave.setQP(*m_slice, m_rdCost.m_qp);
128
         slave.m_slice = m_slice;
129
         slave.m_frame = m_frame;
130
-
131
+        slave.m_param = m_param;
132
+        slave.setLambdaFromQP(pme.mode.cu, m_rdCost.m_qp);
133
         slave.m_me.setSourcePU(*pme.mode.fencYuv, pme.pu.ctuAddr, pme.pu.cuAbsPartIdx, pme.pu.puAbsPartIdx, pme.pu.width, pme.pu.height);
134
     }
135
 
136
@@ -1910,9 +1930,9 @@
137
     do
138
     {
139
         if (meId < m_slice->m_numRefIdx[0])
140
-            slave.singleMotionEstimation(*this, pme.mode, pme.cuGeom, pme.pu, pme.puIdx, 0, meId);
141
+            slave.singleMotionEstimation(*this, pme.mode, pme.pu, pme.puIdx, 0, meId);
142
         else
143
-            slave.singleMotionEstimation(*this, pme.mode, pme.cuGeom, pme.pu, pme.puIdx, 1, meId - m_slice->m_numRefIdx[0]);
144
+            slave.singleMotionEstimation(*this, pme.mode, pme.pu, pme.puIdx, 1, meId - m_slice->m_numRefIdx[0]);
145
 
146
         meId = -1;
147
         pme.m_lock.acquire();
148
@@ -1923,55 +1943,30 @@
149
     while (meId >= 0);
150
 }
151
 
152
-void Search::singleMotionEstimation(Search& master, Mode& interMode, const CUGeom& cuGeom, const PredictionUnit& pu,
153
-                                    int part, int list, int ref)
154
+void Search::singleMotionEstimation(Search& master, Mode& interMode, const PredictionUnit& pu, int part, int list, int ref)
155
 {
156
     uint32_t bits = master.m_listSelBits[list] + MVP_IDX_BITS;
157
     bits += getTUBits(ref, m_slice->m_numRefIdx[list]);
158
 
159
-    MV mvc[(MD_ABOVE_LEFT + 1) * 2 + 1];
160
-    int numMvc = interMode.cu.getPMV(interMode.interNeighbours, list, ref, interMode.amvpCand[list][ref], mvc);
161
-
162
-    int mvpIdx = 0;
163
-    int merange = m_param->searchRange;
164
     MotionData* bestME = interMode.bestME[part];
165
 
166
-    if (interMode.amvpCand[list][ref][0] != interMode.amvpCand[list][ref][1])
167
-    {
168
-        uint32_t bestCost = MAX_INT;
169
-        for (int i = 0; i < AMVP_NUM_CANDS; i++)
170
-        {
171
-            MV mvCand = interMode.amvpCand[list][ref][i];
172
-
173
-            // NOTE: skip mvCand if Y is > merange and -FN>1
174
-            if (m_bFrameParallel && (mvCand.y >= (merange + 1) * 4))
175
-                continue;
176
-
177
-            interMode.cu.clipMv(mvCand);
178
-
179
-            Yuv& tmpPredYuv = m_rqt[cuGeom.depth].tmpPredYuv;
180
-            predInterLumaPixel(pu, tmpPredYuv, *m_slice->m_refPicList[list][ref]->m_reconPic, mvCand);
181
-            uint32_t cost = m_me.bufSAD(tmpPredYuv.getLumaAddr(pu.puAbsPartIdx), tmpPredYuv.m_size);
182
+    MV  mvc[(MD_ABOVE_LEFT + 1) * 2 + 1];
183
+    int numMvc = interMode.cu.getPMV(interMode.interNeighbours, list, ref, interMode.amvpCand[list][ref], mvc);
184
 
185
-            if (bestCost > cost)
186
-            {
187
-                bestCost = cost;
188
-                mvpIdx = i;
189
-            }
190
-        }
191
-    }
192
+    const MV* amvp = interMode.amvpCand[list][ref];
193
+    int mvpIdx = selectMVP(interMode.cu, pu, amvp, list, ref);
194
+    MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx];
195
 
196
-    MV mvmin, mvmax, outmv, mvp = interMode.amvpCand[list][ref][mvpIdx];
197
-    setSearchRange(interMode.cu, mvp, merange, mvmin, mvmax);
198
+    setSearchRange(interMode.cu, mvp, m_param->searchRange, mvmin, mvmax);
199
 
200
-    int satdCost = m_me.motionEstimate(&m_slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, merange, outmv);
201
+    int satdCost = m_me.motionEstimate(&m_slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, m_param->searchRange, outmv);
202
 
203
     /* Get total cost of partition, but only include MV bit cost once */
204
     bits += m_me.bitcost(outmv);
205
     uint32_t cost = (satdCost - m_me.mvcost(outmv)) + m_rdCost.getCost(bits);
206
 
207
-    /* Refine MVP selection, updates: mvp, mvpIdx, bits, cost */
208
-    checkBestMVP(interMode.amvpCand[list][ref], outmv, mvp, mvpIdx, bits, cost);
209
+    /* Refine MVP selection, updates: mvpIdx, bits, cost */
210
+    mvp = checkBestMVP(amvp, outmv, mvpIdx, bits, cost);
211
 
212
     /* tie goes to the smallest ref ID, just like --no-pme */
213
     ScopedLock _lock(master.m_meLock);
214
@@ -1988,7 +1983,7 @@
215
 }
216
 
217
 /* find the best inter prediction for each PU of specified mode */
218
-void Search::predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bMergeOnly, bool bChromaSA8D)
219
+void Search::predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChromaMC)
220
 {
221
     ProfileCUScope(interMode.cu, motionEstimationElapsedTime, countMotionEstimate);
222
 
223
@@ -2009,7 +2004,6 @@
224
     Yuv&     tmpPredYuv = m_rqt[cuGeom.depth].tmpPredYuv;
225
 
226
     MergeData merge;
227
-    uint32_t mrgCost;
228
     memset(&merge, 0, sizeof(merge));
229
 
230
     for (int puIdx = 0; puIdx < numPart; puIdx++)
231
@@ -2020,27 +2014,7 @@
232
         m_me.setSourcePU(*interMode.fencYuv, pu.ctuAddr, pu.cuAbsPartIdx, pu.puAbsPartIdx, pu.width, pu.height);
233
 
234
         /* find best cost merge candidate. note: 2Nx2N merge and bidir are handled as separate modes */
235
-        if (cu.m_partSize[0] != SIZE_2Nx2N)
236
-        {
237
-            mrgCost = mergeEstimation(cu, cuGeom, pu, puIdx, merge);
238
-
239
-            if (bMergeOnly && mrgCost != MAX_UINT)
240
-            {
241
-                cu.m_mergeFlag[pu.puAbsPartIdx] = true;
242
-                cu.m_mvpIdx[0][pu.puAbsPartIdx] = merge.index; // merge candidate ID is stored in L0 MVP idx
243
-                cu.setPUInterDir(merge.dir, pu.puAbsPartIdx, puIdx);
244
-                cu.setPUMv(0, merge.mvField[0].mv, pu.puAbsPartIdx, puIdx);
245
-                cu.setPURefIdx(0, merge.mvField[0].refIdx, pu.puAbsPartIdx, puIdx);
246
-                cu.setPUMv(1, merge.mvField[1].mv, pu.puAbsPartIdx, puIdx);
247
-                cu.setPURefIdx(1, merge.mvField[1].refIdx, pu.puAbsPartIdx, puIdx);
248
-                totalmebits += merge.bits;
249
-
250
-                motionCompensation(cu, pu, *predYuv, true, bChromaSA8D);
251
-                continue;
252
-            }
253
-        }
254
-        else
255
-            mrgCost = MAX_UINT;
256
+        uint32_t mrgCost = numPart == 1 ? MAX_UINT : mergeEstimation(cu, cuGeom, pu, puIdx, merge);
257
 
258
         bestME[0].cost = MAX_UINT;
259
         bestME[1].cost = MAX_UINT;
260
@@ -2061,45 +2035,19 @@
261
 
262
                 int numMvc = cu.getPMV(interMode.interNeighbours, list, ref, interMode.amvpCand[list][ref], mvc);
263
 
264
-                // Pick the best possible MVP from AMVP candidates based on least residual
265
-                int mvpIdx = 0;
266
-                int merange = m_param->searchRange;
267
+                const MV* amvp = interMode.amvpCand[list][ref];
268
+                int mvpIdx = selectMVP(cu, pu, amvp, list, ref);
269
+                MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx];
270
 
271
-                if (interMode.amvpCand[list][ref][0] != interMode.amvpCand[list][ref][1])
272
-                {
273
-                    uint32_t bestCost = MAX_INT;
274
-                    for (int i = 0; i < AMVP_NUM_CANDS; i++)
275
-                    {
276
-                        MV mvCand = interMode.amvpCand[list][ref][i];
277
-
278
-                        // NOTE: skip mvCand if Y is > merange and -FN>1
279
-                        if (m_bFrameParallel && (mvCand.y >= (merange + 1) * 4))
280
-                            continue;
281
-
282
-                        cu.clipMv(mvCand);
283
-                        predInterLumaPixel(pu, tmpPredYuv, *slice->m_refPicList[list][ref]->m_reconPic, mvCand);
284
-                        uint32_t cost = m_me.bufSAD(tmpPredYuv.getLumaAddr(pu.puAbsPartIdx), tmpPredYuv.m_size);
285
-
286
-                        if (bestCost > cost)
287
-                        {
288
-                            bestCost = cost;
289
-                            mvpIdx = i;
290
-                        }
291
-                    }
292
-                }
293
-
294
-                MV mvmin, mvmax, outmv, mvp = interMode.amvpCand[list][ref][mvpIdx];
295
-
296
-                int satdCost;
297
-                setSearchRange(cu, mvp, merange, mvmin, mvmax);
298
-                satdCost = m_me.motionEstimate(&slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, merange, outmv);
299
+                setSearchRange(cu, mvp, m_param->searchRange, mvmin, mvmax);
300
+                int satdCost = m_me.motionEstimate(&slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, m_param->searchRange, outmv);
301
 
302
                 /* Get total cost of partition, but only include MV bit cost once */
303
                 bits += m_me.bitcost(outmv);
304
                 uint32_t cost = (satdCost - m_me.mvcost(outmv)) + m_rdCost.getCost(bits);
305
 
306
-                /* Refine MVP selection, updates: mvp, mvpIdx, bits, cost */
307
-                checkBestMVP(interMode.amvpCand[list][ref], outmv, mvp, mvpIdx, bits, cost);
308
+                /* Refine MVP selection, updates: mvpIdx, bits, cost */
309
+                mvp = checkBestMVP(amvp, outmv, mvpIdx, bits, cost);
310
 
311
                 if (cost < bestME[list].cost)
312
                 {
313
@@ -2122,7 +2070,7 @@
314
             {
315
                 processPME(pme, *this);
316
 
317
-                singleMotionEstimation(*this, interMode, cuGeom, pu, puIdx, 0, 0); /* L0-0 */
318
+                singleMotionEstimation(*this, interMode, pu, puIdx, 0, 0); /* L0-0 */
319
 
320
                 bDoUnidir = false;
321
 
322
@@ -2144,44 +2092,19 @@
323
 
324
                     int numMvc = cu.getPMV(interMode.interNeighbours, list, ref, interMode.amvpCand[list][ref], mvc);
325
 
326
-                    // Pick the best possible MVP from AMVP candidates based on least residual
327
-                    int mvpIdx = 0;
328
-                    int merange = m_param->searchRange;
329
-
330
-                    if (interMode.amvpCand[list][ref][0] != interMode.amvpCand[list][ref][1])
331
-                    {
332
-                        uint32_t bestCost = MAX_INT;
333
-                        for (int i = 0; i < AMVP_NUM_CANDS; i++)
334
-                        {
335
-                            MV mvCand = interMode.amvpCand[list][ref][i];
336
-
337
-                            // NOTE: skip mvCand if Y is > merange and -FN>1
338
-                            if (m_bFrameParallel && (mvCand.y >= (merange + 1) * 4))
339
-                                continue;
340
+                    const MV* amvp = interMode.amvpCand[list][ref];
341
+                    int mvpIdx = selectMVP(cu, pu, amvp, list, ref);
342
+                    MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx];
343
 
344
-                            cu.clipMv(mvCand);
345
-                            predInterLumaPixel(pu, tmpPredYuv, *slice->m_refPicList[list][ref]->m_reconPic, mvCand);
346
-                            uint32_t cost = m_me.bufSAD(tmpPredYuv.getLumaAddr(pu.puAbsPartIdx), tmpPredYuv.m_size);
347
-
348
-                            if (bestCost > cost)
349
-                            {
350
-                                bestCost = cost;
351
-                                mvpIdx = i;
352
-                            }
353
-                        }
354
-                    }
355
-
356
-                    MV mvmin, mvmax, outmv, mvp = interMode.amvpCand[list][ref][mvpIdx];
357
-
358
-                    setSearchRange(cu, mvp, merange, mvmin, mvmax);
359
-                    int satdCost = m_me.motionEstimate(&slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, merange, outmv);
360
+                    setSearchRange(cu, mvp, m_param->searchRange, mvmin, mvmax);
361
+                    int satdCost = m_me.motionEstimate(&slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, m_param->searchRange, outmv);
362
 
363
                     /* Get total cost of partition, but only include MV bit cost once */
364
                     bits += m_me.bitcost(outmv);
365
                     uint32_t cost = (satdCost - m_me.mvcost(outmv)) + m_rdCost.getCost(bits);
366
 
367
-                    /* Refine MVP selection, updates: mvp, mvpIdx, bits, cost */
368
-                    checkBestMVP(interMode.amvpCand[list][ref], outmv, mvp, mvpIdx, bits, cost);
369
+                    /* Refine MVP selection, updates: mvpIdx, bits, cost */
370
+                    mvp = checkBestMVP(amvp, outmv, mvpIdx, bits, cost);
371
 
372
                     if (cost < bestME[list].cost)
373
                     {
374
@@ -2289,8 +2212,8 @@
375
                 uint32_t cost = satdCost + m_rdCost.getCost(bits0) + m_rdCost.getCost(bits1);
376
 
377
                 /* refine MVP selection for zero mv, updates: mvp, mvpidx, bits, cost */
378
-                checkBestMVP(interMode.amvpCand[0][bestME[0].ref], mvzero, mvp0, mvpIdx0, bits0, cost);
379
-                checkBestMVP(interMode.amvpCand[1][bestME[1].ref], mvzero, mvp1, mvpIdx1, bits1, cost);
380
+                mvp0 = checkBestMVP(interMode.amvpCand[0][bestME[0].ref], mvzero, mvpIdx0, bits0, cost);
381
+                mvp1 = checkBestMVP(interMode.amvpCand[1][bestME[1].ref], mvzero, mvpIdx1, bits1, cost);
382
 
383
                 if (cost < bidirCost)
384
                 {
385
@@ -2370,7 +2293,7 @@
386
             totalmebits += bestME[1].bits;
387
         }
388
 
389
-        motionCompensation(cu, pu, *predYuv, true, bChromaSA8D);
390
+        motionCompensation(cu, pu, *predYuv, true, bChromaMC);
391
     }
392
     X265_CHECK(interMode.ok(), "inter mode is not ok");
393
     interMode.sa8dBits += totalmebits;
394
@@ -2429,27 +2352,21 @@
395
 }
396
 
397
 /* Check if using an alternative MVP would result in a smaller MVD + signal bits */
398
-void Search::checkBestMVP(MV* amvpCand, MV mv, MV& mvPred, int& outMvpIdx, uint32_t& outBits, uint32_t& outCost) const
399
+const MV& Search::checkBestMVP(const MV* amvpCand, const MV& mv, int& mvpIdx, uint32_t& outBits, uint32_t& outCost) const
400
 {
401
-    X265_CHECK(amvpCand[outMvpIdx] == mvPred, "checkBestMVP: unexpected mvPred\n");
402
-
403
-    int mvpIdx = !outMvpIdx;
404
-    MV mvp = amvpCand[mvpIdx];
405
-    int diffBits = m_me.bitcost(mv, mvp) - m_me.bitcost(mv, mvPred);
406
+    int diffBits = m_me.bitcost(mv, amvpCand[!mvpIdx]) - m_me.bitcost(mv, amvpCand[mvpIdx]);
407
     if (diffBits < 0)
408
     {
409
-        outMvpIdx = mvpIdx;
410
-        mvPred = mvp;
411
+        mvpIdx = !mvpIdx;
412
         uint32_t origOutBits = outBits;
413
         outBits = origOutBits + diffBits;
414
         outCost = (outCost - m_rdCost.getCost(origOutBits)) + m_rdCost.getCost(outBits);
415
     }
416
+    return amvpCand[mvpIdx];
417
 }
418
 
419
-void Search::setSearchRange(const CUData& cu, MV mvp, int merange, MV& mvmin, MV& mvmax) const
420
+void Search::setSearchRange(const CUData& cu, const MV& mvp, int merange, MV& mvmin, MV& mvmax) const
421
 {
422
-    cu.clipMv(mvp);
423
-
424
     MV dist((int16_t)merange << 2, (int16_t)merange << 2);
425
     mvmin = mvp - dist;
426
     mvmax = mvp + dist;
427
@@ -2534,9 +2451,6 @@
428
     uint32_t log2CUSize = cuGeom.log2CUSize;
429
     int sizeIdx = log2CUSize - 2;
430
 
431
-    uint32_t tqBypass = cu.m_tqBypass[0];
432
-    m_quant.setQPforQuant(interMode.cu);
433
-
434
     resiYuv->subtract(*fencYuv, *predYuv, log2CUSize);
435
 
436
     uint32_t tuDepthRange[2];
437
@@ -2547,6 +2461,7 @@
438
     Cost costs;
439
     estimateResidualQT(interMode, cuGeom, 0, 0, *resiYuv, costs, tuDepthRange);
440
 
441
+    uint32_t tqBypass = cu.m_tqBypass[0];
442
     if (!tqBypass)
443
     {
444
         uint32_t cbf0Dist = primitives.cu[sizeIdx].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, predYuv->m_buf[0], predYuv->m_size);
445
@@ -2631,7 +2546,7 @@
446
     interMode.coeffBits = coeffBits;
447
     interMode.mvBits = bits - coeffBits;
448
     updateModeCost(interMode);
449
-    checkDQP(interMode.cu, cuGeom);
450
+    checkDQP(interMode, cuGeom);
451
 }
452
 
453
 void Search::residualTransformQuantInter(Mode& mode, const CUGeom& cuGeom, uint32_t absPartIdx, uint32_t tuDepth, const uint32_t depthRange[2])
454
@@ -3448,22 +3363,43 @@
455
     }
456
 }
457
 
458
-void Search::checkDQP(CUData& cu, const CUGeom& cuGeom)
459
+void Search::checkDQP(Mode& mode, const CUGeom& cuGeom)
460
 {
461
+    CUData& cu = mode.cu;
462
     if (cu.m_slice->m_pps->bUseDQP && cuGeom.depth <= cu.m_slice->m_pps->maxCuDQPDepth)
463
     {
464
         if (cu.getQtRootCbf(0))
465
         {
466
-            /* When analysing RDO with DQP bits, the entropy encoder should add the cost of DQP bits here
467
-             * i.e Encode QP */
468
+            if (m_param->rdLevel >= 3)
469
+            {
470
+                mode.contexts.resetBits();
471
+                mode.contexts.codeDeltaQP(cu, 0);
472
+                uint32_t bits = mode.contexts.getNumberOfWrittenBits();
473
+                mode.mvBits += bits;
474
+                mode.totalBits += bits;
475
+                updateModeCost(mode);
476
+            }
477
+            else if (m_param->rdLevel <= 1)
478
+            {
479
+                mode.sa8dBits++;
480
+                mode.sa8dCost = m_rdCost.calcRdSADCost(mode.distortion, mode.sa8dBits);
481
+            }
482
+            else
483
+            {
484
+                mode.mvBits++;
485
+                mode.totalBits++;
486
+                updateModeCost(mode);
487
+            }
488
         }
489
         else
490
             cu.setQPSubParts(cu.getRefQP(0), 0, cuGeom.depth);
491
     }
492
 }
493
 
494
-void Search::checkDQPForSplitPred(CUData& cu, const CUGeom& cuGeom)
495
+void Search::checkDQPForSplitPred(Mode& mode, const CUGeom& cuGeom)
496
 {
497
+    CUData& cu = mode.cu;
498
+
499
     if ((cuGeom.depth == cu.m_slice->m_pps->maxCuDQPDepth) && cu.m_slice->m_pps->bUseDQP)
500
     {
501
         bool hasResidual = false;
502
@@ -3478,10 +3414,31 @@
503
             }
504
         }
505
         if (hasResidual)
506
-            /* TODO: Encode QP, and recalculate RD cost of splitPred */
507
+        {
508
+            if (m_param->rdLevel >= 3)
509
+            {
510
+                mode.contexts.resetBits();
511
+                mode.contexts.codeDeltaQP(cu, 0);
512
+                uint32_t bits = mode.contexts.getNumberOfWrittenBits();
513
+                mode.mvBits += bits;
514
+                mode.totalBits += bits;
515
+                updateModeCost(mode);
516
+            }
517
+            else if (m_param->rdLevel <= 1)
518
+            {
519
+                mode.sa8dBits++;
520
+                mode.sa8dCost = m_rdCost.calcRdSADCost(mode.distortion, mode.sa8dBits);
521
+            }
522
+            else
523
+            {
524
+                mode.mvBits++;
525
+                mode.totalBits++;
526
+                updateModeCost(mode);
527
+            }
528
             /* For all zero CBF sub-CUs, reset QP to RefQP (so that deltaQP is not signalled).
529
             When the non-zero CBF sub-CU is found, stop */
530
             cu.setQPSubCUs(cu.getRefQP(0), 0, cuGeom.depth);
531
+        }
532
         else
533
             /* No residual within this CU or subCU, so reset QP to RefQP */
534
             cu.setQPSubParts(cu.getRefQP(0), 0, cuGeom.depth);
535
x265_1.6.tar.gz/source/encoder/search.h -> x265_1.7.tar.gz/source/encoder/search.h Changed
51
 
1
@@ -287,7 +287,7 @@
2
     ~Search();
3
 
4
     bool     initSearch(const x265_param& param, ScalingList& scalingList);
5
-    void     setQP(const Slice& slice, int qp);
6
+    int      setLambdaFromQP(const CUData& ctu, int qp); /* returns real quant QP in valid spec range */
7
 
8
     // mark temp RD entropy contexts as uninitialized; useful for finding loads without stores
9
     void     invalidateContexts(int fromDepth);
10
@@ -301,7 +301,7 @@
11
     void     encodeIntraInInter(Mode& intraMode, const CUGeom& cuGeom);
12
 
13
     // estimation inter prediction (non-skip)
14
-    void     predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bMergeOnly, bool bChroma);
15
+    void     predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChromaMC);
16
 
17
     // encode residual and compute rd-cost for inter mode
18
     void     encodeResAndCalcRdInterCU(Mode& interMode, const CUGeom& cuGeom);
19
@@ -316,8 +316,8 @@
20
     void     getBestIntraModeChroma(Mode& intraMode, const CUGeom& cuGeom);
21
 
22
     /* update CBF flags and QP values to be internally consistent */
23
-    void checkDQP(CUData& cu, const CUGeom& cuGeom);
24
-    void checkDQPForSplitPred(CUData& cu, const CUGeom& cuGeom);
25
+    void checkDQP(Mode& mode, const CUGeom& cuGeom);
26
+    void checkDQPForSplitPred(Mode& mode, const CUGeom& cuGeom);
27
 
28
     class PME : public BondedTaskGroup
29
     {
30
@@ -339,7 +339,7 @@
31
     };
32
 
33
     void     processPME(PME& pme, Search& slave);
34
-    void     singleMotionEstimation(Search& master, Mode& interMode, const CUGeom& cuGeom, const PredictionUnit& pu, int part, int list, int ref);
35
+    void     singleMotionEstimation(Search& master, Mode& interMode, const PredictionUnit& pu, int part, int list, int ref);
36
 
37
 protected:
38
 
39
@@ -396,8 +396,9 @@
40
     };
41
 
42
     /* inter/ME helper functions */
43
-    void     checkBestMVP(MV* amvpCand, MV cMv, MV& mvPred, int& mvpIdx, uint32_t& outBits, uint32_t& outCost) const;
44
-    void     setSearchRange(const CUData& cu, MV mvp, int merange, MV& mvmin, MV& mvmax) const;
45
+    int       selectMVP(const CUData& cu, const PredictionUnit& pu, const MV amvp[AMVP_NUM_CANDS], int list, int ref);
46
+    const MV& checkBestMVP(const MV amvpCand[2], const MV& mv, int& mvpIdx, uint32_t& outBits, uint32_t& outCost) const;
47
+    void     setSearchRange(const CUData& cu, const MV& mvp, int merange, MV& mvmin, MV& mvmax) const;
48
     uint32_t mergeEstimation(CUData& cu, const CUGeom& cuGeom, const PredictionUnit& pu, int puIdx, MergeData& m);
49
     static void getBlkBits(PartSize cuMode, bool bPSlice, int puIdx, uint32_t lastMode, uint32_t blockBit[3]);
50
 
51
x265_1.6.tar.gz/source/encoder/sei.h -> x265_1.7.tar.gz/source/encoder/sei.h Changed
84
 
1
@@ -71,6 +71,8 @@
2
         DECODED_PICTURE_HASH                 = 132,
3
         SCALABLE_NESTING                     = 133,
4
         REGION_REFRESH_INFO                  = 134,
5
+        MASTERING_DISPLAY_INFO               = 137,
6
+        CONTENT_LIGHT_LEVEL_INFO             = 144,
7
     };
8
 
9
     virtual PayloadType payloadType() const = 0;
10
@@ -111,6 +113,73 @@
11
     }
12
 };
13
 
14
+class SEIMasteringDisplayColorVolume : public SEI
15
+{
16
+public:
17
+
18
+    uint16_t displayPrimaryX[3];
19
+    uint16_t displayPrimaryY[3];
20
+    uint16_t whitePointX, whitePointY;
21
+    uint32_t maxDisplayMasteringLuminance;
22
+    uint32_t minDisplayMasteringLuminance;
23
+
24
+    PayloadType payloadType() const { return MASTERING_DISPLAY_INFO; }
25
+
26
+    bool parse(const char* value)
27
+    {
28
+        return sscanf(value, "G(%hu,%hu)B(%hu,%hu)R(%hu,%hu)WP(%hu,%hu)L(%u,%u)",
29
+                      &displayPrimaryX[0], &displayPrimaryY[0],
30
+                      &displayPrimaryX[1], &displayPrimaryY[1],
31
+                      &displayPrimaryX[2], &displayPrimaryY[2],
32
+                      &whitePointX, &whitePointY,
33
+                      &maxDisplayMasteringLuminance, &minDisplayMasteringLuminance) == 10;
34
+    }
35
+
36
+    void write(Bitstream& bs, const SPS&)
37
+    {
38
+        m_bitIf = &bs;
39
+
40
+        WRITE_CODE(MASTERING_DISPLAY_INFO, 8, "payload_type");
41
+        WRITE_CODE(8 * 2 + 2 * 4, 8, "payload_size");
42
+
43
+        for (uint32_t i = 0; i < 3; i++)
44
+        {
45
+            WRITE_CODE(displayPrimaryX[i], 16, "display_primaries_x[ c ]");
46
+            WRITE_CODE(displayPrimaryY[i], 16, "display_primaries_y[ c ]");
47
+        }
48
+        WRITE_CODE(whitePointX, 16, "white_point_x");
49
+        WRITE_CODE(whitePointY, 16, "white_point_y");
50
+        WRITE_CODE(maxDisplayMasteringLuminance, 32, "max_display_mastering_luminance");
51
+        WRITE_CODE(minDisplayMasteringLuminance, 32, "min_display_mastering_luminance");
52
+    }
53
+};
54
+
55
+class SEIContentLightLevel : public SEI
56
+{
57
+public:
58
+
59
+    uint16_t max_content_light_level;
60
+    uint16_t max_pic_average_light_level;
61
+
62
+    PayloadType payloadType() const { return CONTENT_LIGHT_LEVEL_INFO; }
63
+
64
+    bool parse(const char* value)
65
+    {
66
+        return sscanf(value, "%hu,%hu",
67
+                      &max_content_light_level, &max_pic_average_light_level) == 2;
68
+    }
69
+
70
+    void write(Bitstream& bs, const SPS&)
71
+    {
72
+        m_bitIf = &bs;
73
+
74
+        WRITE_CODE(CONTENT_LIGHT_LEVEL_INFO, 8, "payload_type");
75
+        WRITE_CODE(4, 8, "payload_size");
76
+        WRITE_CODE(max_content_light_level,     16, "max_content_light_level");
77
+        WRITE_CODE(max_pic_average_light_level, 16, "max_pic_average_light_level");
78
+    }
79
+};
80
+
81
 class SEIDecodedPictureHash : public SEI
82
 {
83
 public:
84
x265_1.6.tar.gz/source/encoder/slicetype.cpp -> x265_1.7.tar.gz/source/encoder/slicetype.cpp Changed
341
 
1
@@ -44,23 +44,6 @@
2
 
3
 namespace {
4
 
5
-inline int16_t median(int16_t a, int16_t b, int16_t c)
6
-{
7
-    int16_t t = (a - b) & ((a - b) >> 31);
8
-
9
-    a -= t;
10
-    b += t;
11
-    b -= (b - c) & ((b - c) >> 31);
12
-    b += (a - b) & ((a - b) >> 31);
13
-    return b;
14
-}
15
-
16
-inline void median_mv(MV &dst, MV a, MV b, MV c)
17
-{
18
-    dst.x = median(a.x, b.x, c.x);
19
-    dst.y = median(a.y, b.y, c.y);
20
-}
21
-
22
 /* Compute variance to derive AC energy of each block */
23
 inline uint32_t acEnergyVar(Frame *curFrame, uint64_t sum_ssd, int shift, int plane)
24
 {
25
@@ -492,8 +475,6 @@
26
     m_8x8Blocks = m_8x8Width > 2 && m_8x8Height > 2 ? (m_8x8Width - 2) * (m_8x8Height - 2) : m_8x8Width * m_8x8Height;
27
 
28
     m_lastKeyframe = -m_param->keyframeMax;
29
-    memset(m_preframes, 0, sizeof(m_preframes));
30
-    m_preTotal = m_preAcquired = m_preCompleted = 0;
31
     m_sliceTypeBusy = false;
32
     m_fullQueueSize = X265_MAX(1, m_param->lookaheadDepth);
33
     m_bAdaptiveQuant = m_param->rc.aqMode || m_param->bEnableWeightedPred || m_param->bEnableWeightedBiPred;
34
@@ -568,14 +549,14 @@
35
     return m_tld && m_scratch;
36
 }
37
 
38
-void Lookahead::stop()
39
+void Lookahead::stopJobs()
40
 {
41
     if (m_pool && !m_inputQueue.empty())
42
     {
43
-        m_preLookaheadLock.acquire();
44
+        m_inputLock.acquire();
45
         m_isActive = false;
46
         bool wait = m_outputSignalRequired = m_sliceTypeBusy;
47
-        m_preLookaheadLock.release();
48
+        m_inputLock.release();
49
 
50
         if (wait)
51
             m_outputSignal.wait();
52
@@ -634,19 +615,11 @@
53
             m_filled = true; /* full capacity plus mini-gop lag */
54
     }
55
 
56
-    m_preLookaheadLock.acquire();
57
-
58
     m_inputLock.acquire();
59
     m_inputQueue.pushBack(curFrame);
60
-    m_inputLock.release();
61
-
62
-    m_preframes[m_preTotal++] = &curFrame;
63
-    X265_CHECK(m_preTotal <= X265_LOOKAHEAD_MAX, "prelookahead overflow\n");
64
-    
65
-    m_preLookaheadLock.release();
66
-
67
-    if (m_pool)
68
+    if (m_pool && m_inputQueue.size() >= m_fullQueueSize)
69
         tryWakeOne();
70
+    m_inputLock.release();
71
 }
72
 
73
 /* Called by API thread */
74
@@ -657,74 +630,33 @@
75
     m_filled = true;
76
 }
77
 
78
-void Lookahead::findJob(int workerThreadID)
79
+void Lookahead::findJob(int /*workerThreadID*/)
80
 {
81
-    Frame* preFrame;
82
-    bool   doDecide;
83
-
84
-    if (!m_isActive)
85
-        return;
86
-
87
-    int tld = workerThreadID;
88
-    if (workerThreadID < 0)
89
-        tld = m_pool ? m_pool->m_numWorkers : 0;
90
+    bool doDecide;
91
 
92
-    m_preLookaheadLock.acquire();
93
-    do
94
-    {
95
-        preFrame = NULL;
96
-        doDecide = false;
97
+    m_inputLock.acquire();
98
+    if (m_inputQueue.size() >= m_fullQueueSize && !m_sliceTypeBusy && m_isActive)
99
+        doDecide = m_sliceTypeBusy = true;
100
+    else
101
+        doDecide = m_helpWanted = false;
102
+    m_inputLock.release();
103
 
104
-        if (m_preTotal > m_preAcquired)
105
-            preFrame = m_preframes[m_preAcquired++];
106
-        else
107
-        {
108
-            if (m_preTotal == m_preCompleted)
109
-                m_preAcquired = m_preTotal = m_preCompleted = 0;
110
-
111
-            /* the worker thread that performs the last pre-lookahead will generally get to run
112
-             * slicetypeDecide() */
113
-            m_inputLock.acquire();
114
-            if (!m_sliceTypeBusy && !m_preTotal && m_inputQueue.size() >= m_fullQueueSize && m_isActive)
115
-                doDecide = m_sliceTypeBusy = true;
116
-            else
117
-                m_helpWanted = false;
118
-            m_inputLock.release();
119
-        }
120
-        m_preLookaheadLock.release();
121
+    if (!doDecide)
122
+        return;
123
 
124
-        if (preFrame)
125
-        {
126
-            ProfileLookaheadTime(m_preLookaheadElapsedTime, m_countPreLookahead);
127
-            ProfileScopeEvent(prelookahead);
128
-
129
-            preFrame->m_lowres.init(preFrame->m_fencPic, preFrame->m_poc);
130
-            if (m_param->rc.bStatRead && m_param->rc.cuTree && IS_REFERENCED(preFrame))
131
-                /* cu-tree offsets were read from stats file */;
132
-            else if (m_bAdaptiveQuant)
133
-                m_tld[tld].calcAdaptiveQuantFrame(preFrame, m_param);
134
-            m_tld[tld].lowresIntraEstimate(preFrame->m_lowres);
135
-
136
-            m_preLookaheadLock.acquire(); /* re-acquire for next pass */
137
-            m_preCompleted++;
138
-        }
139
-        else if (doDecide)
140
-        {
141
-            ProfileLookaheadTime(m_slicetypeDecideElapsedTime, m_countSlicetypeDecide);
142
-            ProfileScopeEvent(slicetypeDecideEV);
143
+    ProfileLookaheadTime(m_slicetypeDecideElapsedTime, m_countSlicetypeDecide);
144
+    ProfileScopeEvent(slicetypeDecideEV);
145
 
146
-            slicetypeDecide();
147
+    slicetypeDecide();
148
 
149
-            m_preLookaheadLock.acquire(); /* re-acquire for next pass */
150
-            if (m_outputSignalRequired)
151
-            {
152
-                m_outputSignal.trigger();
153
-                m_outputSignalRequired = false;
154
-            }
155
-            m_sliceTypeBusy = false;
156
-        }
157
+    m_inputLock.acquire();
158
+    if (m_outputSignalRequired)
159
+    {
160
+        m_outputSignal.trigger();
161
+        m_outputSignalRequired = false;
162
     }
163
-    while (preFrame || doDecide);
164
+    m_sliceTypeBusy = false;
165
+    m_inputLock.release();
166
 }
167
 
168
 /* Called by API thread */
169
@@ -739,13 +671,11 @@
170
         if (out)
171
             return out;
172
 
173
-        /* process all pending pre-lookahead frames and run slicetypeDecide() if
174
-         * necessary */
175
-        findJob(-1);
176
+        findJob(-1); /* run slicetypeDecide() if necessary */
177
 
178
-        m_preLookaheadLock.acquire();
179
-        bool wait = m_outputSignalRequired = m_sliceTypeBusy || m_preTotal;
180
-        m_preLookaheadLock.release();
181
+        m_inputLock.acquire();
182
+        bool wait = m_outputSignalRequired = m_sliceTypeBusy;
183
+        m_inputLock.release();
184
 
185
         if (wait)
186
             m_outputSignal.wait();
187
@@ -809,7 +739,7 @@
188
     {
189
         /* aggregate lowres row satds to CTU resolution */
190
         curFrame->m_lowres.lowresCostForRc = curFrame->m_lowres.lowresCosts[b - p0][p1 - b];
191
-        uint32_t lowresRow = 0, lowresCol = 0, lowresCuIdx = 0, sum = 0;
192
+        uint32_t lowresRow = 0, lowresCol = 0, lowresCuIdx = 0, sum = 0, intraSum = 0;
193
         uint32_t scale = m_param->maxCUSize / (2 * X265_LOWRES_CU_SIZE);
194
         uint32_t numCuInHeight = (m_param->sourceHeight + g_maxCUSize - 1) / g_maxCUSize;
195
         uint32_t widthInLowresCu = (uint32_t)m_8x8Width, heightInLowresCu = (uint32_t)m_8x8Height;
196
@@ -823,7 +753,7 @@
197
             lowresRow = row * scale;
198
             for (uint32_t cnt = 0; cnt < scale && lowresRow < heightInLowresCu; lowresRow++, cnt++)
199
             {
200
-                sum = 0;
201
+                sum = 0; intraSum = 0;
202
                 lowresCuIdx = lowresRow * widthInLowresCu;
203
                 for (lowresCol = 0; lowresCol < widthInLowresCu; lowresCol++, lowresCuIdx++)
204
                 {
205
@@ -836,24 +766,57 @@
206
                     }
207
                     curFrame->m_lowres.lowresCostForRc[lowresCuIdx] = lowresCuCost;
208
                     sum += lowresCuCost;
209
+                    intraSum += curFrame->m_lowres.intraCost[lowresCuIdx];
210
                 }
211
                 curFrame->m_encData->m_rowStat[row].satdForVbv += sum;
212
+                curFrame->m_encData->m_rowStat[row].intraSatdForVbv += intraSum;
213
             }
214
         }
215
     }
216
 }
217
 
218
+void PreLookaheadGroup::processTasks(int workerThreadID)
219
+{
220
+    if (workerThreadID < 0)
221
+        workerThreadID = m_lookahead.m_pool ? m_lookahead.m_pool->m_numWorkers : 0;
222
+    LookaheadTLD& tld = m_lookahead.m_tld[workerThreadID];
223
+
224
+    m_lock.acquire();
225
+    while (m_jobAcquired < m_jobTotal)
226
+    {
227
+        Frame* preFrame = m_preframes[m_jobAcquired++];
228
+        ProfileLookaheadTime(m_lookahead.m_preLookaheadElapsedTime, m_lookahead.m_countPreLookahead);
229
+        ProfileScopeEvent(prelookahead);
230
+        m_lock.release();
231
+
232
+        preFrame->m_lowres.init(preFrame->m_fencPic, preFrame->m_poc);
233
+        if (m_lookahead.m_param->rc.bStatRead && m_lookahead.m_param->rc.cuTree && IS_REFERENCED(preFrame))
234
+            /* cu-tree offsets were read from stats file */;
235
+        else if (m_lookahead.m_bAdaptiveQuant)
236
+            tld.calcAdaptiveQuantFrame(preFrame, m_lookahead.m_param);
237
+        tld.lowresIntraEstimate(preFrame->m_lowres);
238
+        preFrame->m_lowresInit = true;
239
+
240
+        m_lock.acquire();
241
+    }
242
+    m_lock.release();
243
+}
244
+
245
 /* called by API thread or worker thread with inputQueueLock acquired */
246
 void Lookahead::slicetypeDecide()
247
 {
248
-    Lowres *frames[X265_LOOKAHEAD_MAX];
249
-    Frame *list[X265_LOOKAHEAD_MAX];
250
-    int maxSearch = X265_MIN(m_param->lookaheadDepth, X265_LOOKAHEAD_MAX);
251
+    PreLookaheadGroup pre(*this);
252
 
253
+    Lowres* frames[X265_LOOKAHEAD_MAX + X265_BFRAME_MAX + 4];
254
+    Frame*  list[X265_BFRAME_MAX + 4];
255
     memset(frames, 0, sizeof(frames));
256
     memset(list, 0, sizeof(list));
257
+    int maxSearch = X265_MIN(m_param->lookaheadDepth, X265_LOOKAHEAD_MAX);
258
+    maxSearch = X265_MAX(1, maxSearch);
259
+
260
     {
261
         ScopedLock lock(m_inputLock);
262
+
263
         Frame *curFrame = m_inputQueue.first();
264
         int j;
265
         for (j = 0; j < m_param->bframes + 2; j++)
266
@@ -869,13 +832,25 @@
267
         {
268
             if (!curFrame) break;
269
             frames[j + 1] = &curFrame->m_lowres;
270
-            X265_CHECK(curFrame->m_lowres.costEst[0][0] > 0, "prelookahead not completed for input picture\n");
271
+
272
+            if (!curFrame->m_lowresInit)
273
+                pre.m_preframes[pre.m_jobTotal++] = curFrame;
274
+
275
             curFrame = curFrame->m_next;
276
         }
277
 
278
         maxSearch = j;
279
     }
280
 
281
+    /* perform pre-analysis on frames which need it, using a bonded task group */
282
+    if (pre.m_jobTotal)
283
+    {
284
+        if (m_pool)
285
+            pre.tryBondPeers(*m_pool, pre.m_jobTotal);
286
+        pre.processTasks(-1);
287
+        pre.waitForExit();
288
+    }
289
+
290
     if (m_lastNonB && !m_param->rc.bStatRead &&
291
         ((m_param->bFrameAdaptive && m_param->bframes) ||
292
          m_param->rc.cuTree || m_param->scenecutThreshold ||
293
@@ -2038,12 +2013,10 @@
294
 
295
         int numc = 0;
296
         MV mvc[4], mvp;
297
-
298
         MV* fencMV = &fenc->lowresMvs[i][listDist[i]][cuXY];
299
+        ReferencePlanes* fref = i ? fref1 : wfref0;
300
 
301
         /* Reverse-order MV prediction */
302
-        mvc[0] = 0;
303
-        mvc[2] = 0;
304
 #define MVC(mv) mvc[numc++] = mv;
305
         if (cuX < widthInCU - 1)
306
             MVC(fencMV[1]);
307
@@ -2056,12 +2029,29 @@
308
                 MVC(fencMV[widthInCU + 1]);
309
         }
310
 #undef MVC
311
-        if (numc <= 1)
312
-            mvp = mvc[0];
313
+
314
+        if (!numc)
315
+            mvp = 0;
316
         else
317
-            median_mv(mvp, mvc[0], mvc[1], mvc[2]);
318
+        {
319
+            ALIGN_VAR_32(pixel, subpelbuf[X265_LOWRES_CU_SIZE * X265_LOWRES_CU_SIZE]);
320
+            int mvpcost = MotionEstimate::COST_MAX;
321
+
322
+            /* measure SATD cost of each neighbor MV (estimating merge analysis)
323
+             * and use the lowest cost MV as MVP (estimating AMVP). Since all
324
+             * mvc[] candidates are measured here, none are passed to motionEstimate */
325
+            for (int idx = 0; idx < numc; idx++)
326
+            {
327
+                intptr_t stride = X265_LOWRES_CU_SIZE;
328
+                pixel *src = fref->lowresMC(pelOffset, mvc[idx], subpelbuf, stride);
329
+                int cost = tld.me.bufSATD(src, stride);
330
+                COPY2_IF_LT(mvpcost, cost, mvp, mvc[idx]);
331
+            }
332
+        }
333
 
334
-        fencCost = tld.me.motionEstimate(i ? fref1 : wfref0, mvmin, mvmax, mvp, numc, mvc, s_merange, *fencMV);
335
+        /* ME will never return a cost larger than the cost @MVP, so we do not
336
+         * have to check that ME cost is more than the estimated merge cost */
337
+        fencCost = tld.me.motionEstimate(fref, mvmin, mvmax, mvp, 0, NULL, s_merange, *fencMV);
338
         COPY2_IF_LT(bcost, fencCost, listused, i + 1);
339
     }
340
 
341
x265_1.6.tar.gz/source/encoder/slicetype.h -> x265_1.7.tar.gz/source/encoder/slicetype.h Changed
50
 
1
@@ -105,8 +105,6 @@
2
     Lock          m_outputLock;
3
 
4
     /* pre-lookahead */
5
-    Frame*        m_preframes[X265_LOOKAHEAD_MAX];
6
-    int           m_preTotal, m_preAcquired, m_preCompleted;
7
     int           m_fullQueueSize;
8
     bool          m_isActive;
9
     bool          m_sliceTypeBusy;
10
@@ -114,7 +112,6 @@
11
     bool          m_outputSignalRequired;
12
     bool          m_bBatchMotionSearch;
13
     bool          m_bBatchFrameCosts;
14
-    Lock          m_preLookaheadLock;
15
     Event         m_outputSignal;
16
 
17
     LookaheadTLD* m_tld;
18
@@ -143,7 +140,7 @@
19
 
20
     bool    create();
21
     void    destroy();
22
-    void    stop();
23
+    void    stopJobs();
24
 
25
     void    addPicture(Frame&, int sliceType);
26
     void    flush();
27
@@ -176,6 +173,22 @@
28
     int64_t frameCostRecalculate(Lowres **frames, int p0, int p1, int b);
29
 };
30
 
31
+class PreLookaheadGroup : public BondedTaskGroup
32
+{
33
+public:
34
+
35
+    Frame* m_preframes[X265_LOOKAHEAD_MAX];
36
+    Lookahead& m_lookahead;
37
+
38
+    PreLookaheadGroup(Lookahead& l) : m_lookahead(l) {}
39
+
40
+    void processTasks(int workerThreadID);
41
+
42
+protected:
43
+
44
+    PreLookaheadGroup& operator=(const PreLookaheadGroup&);
45
+};
46
+
47
 class CostEstimateGroup : public BondedTaskGroup
48
 {
49
 public:
50
x265_1.6.tar.gz/source/input/input.cpp -> x265_1.7.tar.gz/source/input/input.cpp Changed
10
 
1
@@ -27,7 +27,7 @@
2
 
3
 using namespace x265;
4
 
5
-Input* Input::open(InputFileInfo& info, bool bForceY4m)
6
+InputFile* InputFile::open(InputFileInfo& info, bool bForceY4m)
7
 {
8
     const char * s = strrchr(info.filename, '.');
9
 
10
x265_1.6.tar.gz/source/input/input.h -> x265_1.7.tar.gz/source/input/input.h Changed
31
 
1
@@ -48,23 +48,25 @@
2
     int sarWidth;
3
     int sarHeight;
4
     int frameCount;
5
+    int timebaseNum;
6
+    int timebaseDenom;
7
 
8
     /* user supplied */
9
     int skipFrames;
10
     const char *filename;
11
 };
12
 
13
-class Input
14
+class InputFile
15
 {
16
 protected:
17
 
18
-    virtual ~Input()  {}
19
+    virtual ~InputFile()  {}
20
 
21
 public:
22
 
23
-    Input()           {}
24
+    InputFile()           {}
25
 
26
-    static Input* open(InputFileInfo& info, bool bForceY4m);
27
+    static InputFile* open(InputFileInfo& info, bool bForceY4m);
28
 
29
     virtual void startReader() = 0;
30
 
31
x265_1.6.tar.gz/source/input/y4m.cpp -> x265_1.7.tar.gz/source/input/y4m.cpp Changed
29
 
1
@@ -46,9 +46,6 @@
2
     for (int i = 0; i < QUEUE_SIZE; i++)
3
         buf[i] = NULL;
4
 
5
-    readCount.set(0);
6
-    writeCount.set(0);
7
-
8
     threadActive = false;
9
     colorSpace = info.csp;
10
     sarWidth = info.sarWidth;
11
@@ -164,7 +161,7 @@
12
 void Y4MInput::release()
13
 {
14
     threadActive = false;
15
-    readCount.set(readCount.get()); // unblock file reader
16
+    readCount.poke();
17
     stop();
18
     delete this;
19
 }
20
@@ -352,7 +349,7 @@
21
     while (threadActive);
22
 
23
     threadActive = false;
24
-    writeCount.set(writeCount.get()); // unblock readPicture
25
+    writeCount.poke();
26
 }
27
 
28
 bool Y4MInput::populateFrameQueue()
29
x265_1.6.tar.gz/source/input/y4m.h -> x265_1.7.tar.gz/source/input/y4m.h Changed
10
 
1
@@ -33,7 +33,7 @@
2
 namespace x265 {
3
 // x265 private namespace
4
 
5
-class Y4MInput : public Input, public Thread
6
+class Y4MInput : public InputFile, public Thread
7
 {
8
 protected:
9
 
10
x265_1.6.tar.gz/source/input/yuv.cpp -> x265_1.7.tar.gz/source/input/yuv.cpp Changed
28
 
1
@@ -44,8 +44,6 @@
2
     for (int i = 0; i < QUEUE_SIZE; i++)
3
         buf[i] = NULL;
4
 
5
-    readCount.set(0);
6
-    writeCount.set(0);
7
     depth = info.depth;
8
     width = info.width;
9
     height = info.height;
10
@@ -152,7 +150,7 @@
11
 void YUVInput::release()
12
 {
13
     threadActive = false;
14
-    readCount.set(readCount.get()); // unblock read thread
15
+    readCount.poke();
16
     stop();
17
     delete this;
18
 }
19
@@ -175,7 +173,7 @@
20
     }
21
 
22
     threadActive = false;
23
-    writeCount.set(writeCount.get()); // unblock readPicture
24
+    writeCount.poke();
25
 }
26
 
27
 bool YUVInput::populateFrameQueue()
28
x265_1.6.tar.gz/source/input/yuv.h -> x265_1.7.tar.gz/source/input/yuv.h Changed
10
 
1
@@ -33,7 +33,7 @@
2
 namespace x265 {
3
 // private x265 namespace
4
 
5
-class YUVInput : public Input, public Thread
6
+class YUVInput : public InputFile, public Thread
7
 {
8
 protected:
9
 
10
x265_1.6.tar.gz/source/output/output.cpp -> x265_1.7.tar.gz/source/output/output.cpp Changed
33
 
1
@@ -1,7 +1,8 @@
2
 /*****************************************************************************
3
- * Copyright (C) 2013 x265 project
4
+ * Copyright (C) 2013-2015 x265 project
5
  *
6
  * Authors: Steve Borho <steve@borho.org>
7
+ *          Xinyue Lu <i@7086.in>
8
  *
9
  * This program is free software; you can redistribute it and/or modify
10
  * it under the terms of the GNU General Public License as published by
11
@@ -25,9 +26,11 @@
12
 #include "yuv.h"
13
 #include "y4m.h"
14
 
15
+#include "raw.h"
16
+
17
 using namespace x265;
18
 
19
-Output* Output::open(const char *fname, int width, int height, uint32_t bitdepth, uint32_t fpsNum, uint32_t fpsDenom, int csp)
20
+ReconFile* ReconFile::open(const char *fname, int width, int height, uint32_t bitdepth, uint32_t fpsNum, uint32_t fpsDenom, int csp)
21
 {
22
     const char * s = strrchr(fname, '.');
23
 
24
@@ -36,3 +39,8 @@
25
     else
26
         return new YUVOutput(fname, width, height, bitdepth, csp);
27
 }
28
+
29
+OutputFile* OutputFile::open(const char *fname, InputFileInfo& inputInfo)
30
+{
31
+    return new RAWOutput(fname, inputInfo);
32
+}
33
x265_1.6.tar.gz/source/output/output.h -> x265_1.7.tar.gz/source/output/output.h Changed
76
 
1
@@ -1,7 +1,8 @@
2
 /*****************************************************************************
3
- * Copyright (C) 2013 x265 project
4
+ * Copyright (C) 2013-2015 x265 project
5
  *
6
  * Authors: Steve Borho <steve@borho.org>
7
+ *          Xinyue Lu <i@7086.in>
8
  *
9
  * This program is free software; you can redistribute it and/or modify
10
  * it under the terms of the GNU General Public License as published by
11
@@ -25,22 +26,23 @@
12
 #define X265_OUTPUT_H
13
 
14
 #include "x265.h"
15
+#include "input/input.h"
16
 
17
 namespace x265 {
18
 // private x265 namespace
19
 
20
-class Output
21
+class ReconFile
22
 {
23
 protected:
24
 
25
-    virtual ~Output()  {}
26
+    virtual ~ReconFile()  {}
27
 
28
 public:
29
 
30
-    Output()           {}
31
+    ReconFile()           {}
32
 
33
-    static Output* open(const char *fname, int width, int height, uint32_t bitdepth,
34
-                        uint32_t fpsNum, uint32_t fpsDenom, int csp);
35
+    static ReconFile* open(const char *fname, int width, int height, uint32_t bitdepth,
36
+                           uint32_t fpsNum, uint32_t fpsDenom, int csp);
37
 
38
     virtual bool isFail() const = 0;
39
 
40
@@ -50,6 +52,35 @@
41
 
42
     virtual const char *getName() const = 0;
43
 };
44
+
45
+class OutputFile
46
+{
47
+protected:
48
+
49
+    virtual ~OutputFile() {}
50
+
51
+public:
52
+
53
+    OutputFile() {}
54
+
55
+    static OutputFile* open(const char* fname, InputFileInfo& inputInfo);
56
+
57
+    virtual bool isFail() const = 0;
58
+
59
+    virtual bool needPTS() const = 0;
60
+
61
+    virtual void release() = 0;
62
+
63
+    virtual const char* getName() const = 0;
64
+
65
+    virtual void setParam(x265_param* param) = 0;
66
+
67
+    virtual int writeHeaders(const x265_nal* nal, uint32_t nalcount) = 0;
68
+
69
+    virtual int writeFrame(const x265_nal* nal, uint32_t nalcount, x265_picture& pic) = 0;
70
+
71
+    virtual void closeFile(int64_t largest_pts, int64_t second_largest_pts) = 0;
72
+};
73
 }
74
 
75
 #endif // ifndef X265_OUTPUT_H
76
x265_1.7.tar.gz/source/output/raw.cpp Added
82
 
1
@@ -0,0 +1,80 @@
2
+/*****************************************************************************
3
+ * Copyright (C) 2013-2015 x265 project
4
+ *
5
+ * Authors: Steve Borho <steve@borho.org>
6
+ *          Xinyue Lu <i@7086.in>
7
+ *
8
+ * This program is free software; you can redistribute it and/or modify
9
+ * it under the terms of the GNU General Public License as published by
10
+ * the Free Software Foundation; either version 2 of the License, or
11
+ * (at your option) any later version.
12
+ *
13
+ * This program is distributed in the hope that it will be useful,
14
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
15
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
16
+ * GNU General Public License for more details.
17
+ *
18
+ * You should have received a copy of the GNU General Public License
19
+ * along with this program; if not, write to the Free Software
20
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
21
+ *
22
+ * This program is also available under a commercial proprietary license.
23
+ * For more information, contact us at license @ x265.com.
24
+ *****************************************************************************/
25
+
26
+#include "raw.h"
27
+
28
+using namespace x265;
29
+using namespace std;
30
+
31
+RAWOutput::RAWOutput(const char* fname, InputFileInfo&)
32
+{
33
+    b_fail = false;
34
+    if (!strcmp(fname, "-"))
35
+    {
36
+        ofs = &cout;
37
+        return;
38
+    }
39
+    ofs = new ofstream(fname, ios::binary | ios::out);
40
+    if (ofs->fail())
41
+        b_fail = true;
42
+}
43
+
44
+void RAWOutput::setParam(x265_param* param)
45
+{
46
+    param->bAnnexB = true;
47
+}
48
+
49
+int RAWOutput::writeHeaders(const x265_nal* nal, uint32_t nalcount)
50
+{
51
+    uint32_t bytes = 0;
52
+
53
+    for (uint32_t i = 0; i < nalcount; i++)
54
+    {
55
+        ofs->write((const char*)nal->payload, nal->sizeBytes);
56
+        bytes += nal->sizeBytes;
57
+        nal++;
58
+    }
59
+
60
+    return bytes;
61
+}
62
+
63
+int RAWOutput::writeFrame(const x265_nal* nal, uint32_t nalcount, x265_picture&)
64
+{
65
+    uint32_t bytes = 0;
66
+
67
+    for (uint32_t i = 0; i < nalcount; i++)
68
+    {
69
+        ofs->write((const char*)nal->payload, nal->sizeBytes);
70
+        bytes += nal->sizeBytes;
71
+        nal++;
72
+    }
73
+
74
+    return bytes;
75
+}
76
+
77
+void RAWOutput::closeFile(int64_t, int64_t)
78
+{
79
+    if (ofs != &cout)
80
+        delete ofs;
81
+}
82
x265_1.7.tar.gz/source/output/raw.h Added
66
 
1
@@ -0,0 +1,64 @@
2
+/*****************************************************************************
3
+ * Copyright (C) 2013-2015 x265 project
4
+ *
5
+ * Authors: Steve Borho <steve@borho.org>
6
+ *          Xinyue Lu <i@7086.in>
7
+ *
8
+ * This program is free software; you can redistribute it and/or modify
9
+ * it under the terms of the GNU General Public License as published by
10
+ * the Free Software Foundation; either version 2 of the License, or
11
+ * (at your option) any later version.
12
+ *
13
+ * This program is distributed in the hope that it will be useful,
14
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
15
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
16
+ * GNU General Public License for more details.
17
+ *
18
+ * You should have received a copy of the GNU General Public License
19
+ * along with this program; if not, write to the Free Software
20
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
21
+ *
22
+ * This program is also available under a commercial proprietary license.
23
+ * For more information, contact us at license @ x265.com.
24
+ *****************************************************************************/
25
+
26
+#ifndef X265_HEVC_RAW_H
27
+#define X265_HEVC_RAW_H
28
+
29
+#include "output.h"
30
+#include "common.h"
31
+#include <fstream>
32
+#include <iostream>
33
+
34
+namespace x265 {
35
+class RAWOutput : public OutputFile
36
+{
37
+protected:
38
+
39
+    std::ostream* ofs;
40
+
41
+    bool b_fail;
42
+
43
+public:
44
+
45
+    RAWOutput(const char* fname, InputFileInfo&);
46
+
47
+    bool isFail() const { return b_fail; }
48
+
49
+    bool needPTS() const { return false; }
50
+
51
+    void release() { delete this; }
52
+
53
+    const char* getName() const { return "raw"; }
54
+
55
+    void setParam(x265_param* param);
56
+
57
+    int writeHeaders(const x265_nal* nal, uint32_t nalcount);
58
+
59
+    int writeFrame(const x265_nal* nal, uint32_t nalcount, x265_picture&);
60
+
61
+    void closeFile(int64_t largest_pts, int64_t second_largest_pts);
62
+};
63
+}
64
+
65
+#endif // ifndef X265_HEVC_RAW_H
66
x265_1.7.tar.gz/source/output/reconplay.cpp Added
199
 
1
@@ -0,0 +1,197 @@
2
+/*****************************************************************************
3
+ * Copyright (C) 2013 x265 project
4
+ *
5
+ * Authors: Peixuan Zhang <zhangpeixuancn@gmail.com>
6
+ *          Chunli Zhang <chunli@multicorewareinc.com>
7
+ *
8
+ * This program is free software; you can redistribute it and/or modify
9
+ * it under the terms of the GNU General Public License as published by
10
+ * the Free Software Foundation; either version 2 of the License, or
11
+ * (at your option) any later version.
12
+ *
13
+ * This program is distributed in the hope that it will be useful,
14
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
15
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
16
+ * GNU General Public License for more details.
17
+ *
18
+ * You should have received a copy of the GNU General Public License
19
+ * along with this program; if not, write to the Free Software
20
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
21
+ *
22
+ * This program is also available under a commercial proprietary license.
23
+ * For more information, contact us at license @ x265.com.
24
+ *****************************************************************************/
25
+
26
+#include "common.h"
27
+#include "reconplay.h"
28
+
29
+#include <signal.h>
30
+
31
+using namespace x265;
32
+
33
+#if _WIN32
34
+#define popen  _popen
35
+#define pclose _pclose
36
+#define pipemode "wb"
37
+#else
38
+#define pipemode "w"
39
+#endif
40
+
41
+bool ReconPlay::pipeValid;
42
+
43
+#ifndef _WIN32
44
+static void sigpipe_handler(int)
45
+{
46
+    if (ReconPlay::pipeValid)
47
+        general_log(NULL, "exec", X265_LOG_ERROR, "pipe closed\n");
48
+    ReconPlay::pipeValid = false;
49
+}
50
+#endif
51
+
52
+ReconPlay::ReconPlay(const char* commandLine, x265_param& param)
53
+{
54
+#ifndef _WIN32
55
+    if (signal(SIGPIPE, sigpipe_handler) == SIG_ERR)
56
+        general_log(&param, "exec", X265_LOG_ERROR, "Unable to register SIGPIPE handler: %s\n", strerror(errno));
57
+#endif
58
+
59
+    width = param.sourceWidth;
60
+    height = param.sourceHeight;
61
+    colorSpace = param.internalCsp;
62
+
63
+    frameSize = 0;
64
+    for (int i = 0; i < x265_cli_csps[colorSpace].planes; i++)
65
+        frameSize += (uint32_t)((width >> x265_cli_csps[colorSpace].width[i]) * (height >> x265_cli_csps[colorSpace].height[i]));
66
+
67
+    for (int i = 0; i < RECON_BUF_SIZE; i++)
68
+    {
69
+        poc[i] = -1;
70
+        CHECKED_MALLOC(frameData[i], pixel, frameSize);
71
+    }
72
+
73
+    outputPipe = popen(commandLine, pipemode);
74
+    if (outputPipe)
75
+    {
76
+        const char* csp = (colorSpace >= X265_CSP_I444) ? "444" : (colorSpace >= X265_CSP_I422) ? "422" : "420";
77
+        const char* depth = (param.internalBitDepth == 10) ? "p10" : "";
78
+
79
+        fprintf(outputPipe, "YUV4MPEG2 W%d H%d F%d:%d Ip C%s%s\n", width, height, param.fpsNum, param.fpsDenom, csp, depth);
80
+
81
+        pipeValid = true;
82
+        threadActive = true;
83
+        start();
84
+        return;
85
+    }
86
+    else
87
+        general_log(&param, "exec", X265_LOG_ERROR, "popen(%s) failed\n", commandLine);
88
+
89
+fail:
90
+    threadActive = false;
91
+}
92
+
93
+ReconPlay::~ReconPlay()
94
+{
95
+    if (threadActive)
96
+    {
97
+        threadActive = false;
98
+        writeCount.poke();
99
+        stop();
100
+    }
101
+
102
+    if (outputPipe) 
103
+        pclose(outputPipe);
104
+
105
+    for (int i = 0; i < RECON_BUF_SIZE; i++)
106
+        X265_FREE(frameData[i]);
107
+}
108
+
109
+bool ReconPlay::writePicture(const x265_picture& pic)
110
+{
111
+    if (!threadActive || !pipeValid)
112
+        return false;
113
+
114
+    int written = writeCount.get();
115
+    int read = readCount.get();
116
+    int currentCursor = pic.poc % RECON_BUF_SIZE;
117
+
118
+    /* TODO: it's probably better to drop recon pictures when the ring buffer is
119
+     * backed up on the display app */
120
+    while (written - read > RECON_BUF_SIZE - 2 || poc[currentCursor] != -1)
121
+    {
122
+        read = readCount.waitForChange(read);
123
+        if (!threadActive)
124
+            return false;
125
+    }
126
+
127
+    X265_CHECK(pic.colorSpace == colorSpace, "invalid color space\n");
128
+    X265_CHECK(pic.bitDepth == X265_DEPTH,   "invalid bit depth\n");
129
+
130
+    pixel* buf = frameData[currentCursor];
131
+    for (int i = 0; i < x265_cli_csps[colorSpace].planes; i++)
132
+    {
133
+        char* src = (char*)pic.planes[i];
134
+        int pwidth = width >> x265_cli_csps[colorSpace].width[i];
135
+
136
+        for (int h = 0; h < height >> x265_cli_csps[colorSpace].height[i]; h++)
137
+        {
138
+            memcpy(buf, src, pwidth * sizeof(pixel));
139
+            src += pic.stride[i];
140
+            buf += pwidth;
141
+        }
142
+    }
143
+
144
+    poc[currentCursor] = pic.poc;
145
+    writeCount.incr();
146
+
147
+    return true;
148
+}
149
+
150
+void ReconPlay::threadMain()
151
+{
152
+    THREAD_NAME("ReconPlayOutput", 0);
153
+
154
+    do
155
+    {
156
+        /* extract the next output picture in display order and write to pipe */
157
+        if (!outputFrame())
158
+            break;
159
+    }
160
+    while (threadActive);
161
+
162
+    threadActive = false;
163
+    readCount.poke();
164
+}
165
+
166
+bool ReconPlay::outputFrame()
167
+{
168
+    int written = writeCount.get();
169
+    int read = readCount.get();
170
+    int currentCursor = read % RECON_BUF_SIZE;
171
+
172
+    while (poc[currentCursor] != read)
173
+    {
174
+        written = writeCount.waitForChange(written);
175
+        if (!threadActive)
176
+            return false;
177
+    }
178
+
179
+    char* buf = (char*)frameData[currentCursor];
180
+    intptr_t remainSize = frameSize * sizeof(pixel);
181
+
182
+    fprintf(outputPipe, "FRAME\n");
183
+    while (remainSize > 0)
184
+    {
185
+        intptr_t retCount = (intptr_t)fwrite(buf, sizeof(char), remainSize, outputPipe);
186
+
187
+        if (retCount < 0 || !pipeValid)
188
+            /* pipe failure, stop writing and start dropping recon pictures */
189
+            return false;
190
+    
191
+        buf += retCount;
192
+        remainSize -= retCount;
193
+    }
194
+
195
+    poc[currentCursor] = -1;
196
+    readCount.incr();
197
+    return true;
198
+}
199
x265_1.7.tar.gz/source/output/reconplay.h Added
76
 
1
@@ -0,0 +1,74 @@
2
+/*****************************************************************************
3
+ * Copyright (C) 2013 x265 project
4
+ *
5
+ * Authors: Peixuan Zhang <zhangpeixuancn@gmail.com>
6
+ *          Chunli Zhang <chunli@multicorewareinc.com>
7
+ *
8
+ * This program is free software; you can redistribute it and/or modify
9
+ * it under the terms of the GNU General Public License as published by
10
+ * the Free Software Foundation; either version 2 of the License, or
11
+ * (at your option) any later version.
12
+ *
13
+ * This program is distributed in the hope that it will be useful,
14
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
15
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
16
+ * GNU General Public License for more details.
17
+ *
18
+ * You should have received a copy of the GNU General Public License
19
+ * along with this program; if not, write to the Free Software
20
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
21
+ *
22
+ * This program is also available under a commercial proprietary license.
23
+ * For more information, contact us at license @ x265.com.
24
+ *****************************************************************************/
25
+
26
+#ifndef X265_RECONPLAY_H
27
+#define X265_RECONPLAY_H
28
+
29
+#include "x265.h"
30
+#include "threading.h"
31
+#include <cstdio>
32
+
33
+namespace x265 {
34
+// private x265 namespace
35
+
36
+class ReconPlay : public Thread
37
+{
38
+public:
39
+
40
+    ReconPlay(const char* commandLine, x265_param& param);
41
+
42
+    virtual ~ReconPlay();
43
+
44
+    bool writePicture(const x265_picture& pic);
45
+
46
+    static bool pipeValid;
47
+
48
+protected:
49
+
50
+    enum { RECON_BUF_SIZE = 40 };
51
+
52
+    FILE*  outputPipe;     /* The output pipe for player */
53
+    size_t frameSize;      /* size of one frame in pixels */
54
+    bool   threadActive;   /* worker thread is active */
55
+    int    width;          /* width of frame */
56
+    int    height;         /* height of frame */
57
+    int    colorSpace;     /* color space of frame */
58
+
59
+    int    poc[RECON_BUF_SIZE];
60
+    pixel* frameData[RECON_BUF_SIZE];
61
+
62
+    /* Note that the class uses read and write counters to signal that reads and
63
+     * writes have occurred in the ring buffer, but writes into the buffer
64
+     * happen in decode order and the reader must check that the POC it next
65
+     * needs to send to the pipe is in fact present.  The counters are used to
66
+     * prevent the writer from getting too far ahead of the reader */
67
+    ThreadSafeInteger readCount;
68
+    ThreadSafeInteger writeCount;
69
+
70
+    void threadMain();
71
+    bool outputFrame();
72
+};
73
+}
74
+
75
+#endif // ifndef X265_RECONPLAY_H
76
x265_1.6.tar.gz/source/output/y4m.h -> x265_1.7.tar.gz/source/output/y4m.h Changed
10
 
1
@@ -30,7 +30,7 @@
2
 namespace x265 {
3
 // private x265 namespace
4
 
5
-class Y4MOutput : public Output
6
+class Y4MOutput : public ReconFile
7
 {
8
 protected:
9
 
10
x265_1.6.tar.gz/source/output/yuv.h -> x265_1.7.tar.gz/source/output/yuv.h Changed
10
 
1
@@ -32,7 +32,7 @@
2
 namespace x265 {
3
 // private x265 namespace
4
 
5
-class YUVOutput : public Output
6
+class YUVOutput : public ReconFile
7
 {
8
 protected:
9
 
10
x265_1.6.tar.gz/source/test/ipfilterharness.cpp -> x265_1.7.tar.gz/source/test/ipfilterharness.cpp Changed
215
 
1
@@ -61,55 +61,6 @@
2
     }
3
 }
4
 
5
-bool IPFilterHarness::check_IPFilter_primitive(filter_p2s_wxh_t ref, filter_p2s_wxh_t opt, int isChroma, int csp)
6
-{
7
-    intptr_t rand_srcStride;
8
-    int min_size = isChroma ? 2 : 4;
9
-    int max_size = isChroma ? (MAX_CU_SIZE >> 1) : MAX_CU_SIZE;
10
-
11
-    if (isChroma && (csp == X265_CSP_I444))
12
-    {
13
-        min_size = 4;
14
-        max_size = MAX_CU_SIZE;
15
-    }
16
-
17
-    for (int i = 0; i < ITERS; i++)
18
-    {
19
-        int index = i % TEST_CASES;
20
-        int rand_height = (int16_t)rand() % 100;
21
-        int rand_width = (int16_t)rand() % 100;
22
-
23
-        rand_srcStride = rand_width + rand() % 100;
24
-        if (rand_srcStride < rand_width)
25
-            rand_srcStride = rand_width;
26
-
27
-        rand_width &= ~(min_size - 1);
28
-        rand_width = x265_clip3(min_size, max_size, rand_width);
29
-
30
-        rand_height &= ~(min_size - 1);
31
-        rand_height = x265_clip3(min_size, max_size, rand_height);
32
-
33
-        ref(pixel_test_buff[index],
34
-            rand_srcStride,
35
-            IPF_C_output_s,
36
-            rand_width,
37
-            rand_height);
38
-
39
-        checked(opt, pixel_test_buff[index],
40
-                rand_srcStride,
41
-                IPF_vec_output_s,
42
-                rand_width,
43
-                rand_height);
44
-
45
-        if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(int16_t)))
46
-            return false;
47
-
48
-        reportfail();
49
-    }
50
-
51
-    return true;
52
-}
53
-
54
 bool IPFilterHarness::check_IPFilterChroma_primitive(filter_pp_t ref, filter_pp_t opt)
55
 {
56
     intptr_t rand_srcStride, rand_dstStride;
57
@@ -518,12 +469,13 @@
58
     {
59
         intptr_t rand_srcStride = rand() % 100;
60
         int index = i % TEST_CASES;
61
+        intptr_t dstStride = rand() % 100 + 64;
62
 
63
-        ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s);
64
+        ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s, dstStride);
65
 
66
-        checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s);
67
+        checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s, dstStride);
68
 
69
-        if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(pixel)))
70
+        if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(int16_t)))
71
             return false;
72
 
73
         reportfail();
74
@@ -538,12 +490,13 @@
75
     {
76
         intptr_t rand_srcStride = rand() % 100;
77
         int index = i % TEST_CASES;
78
+        intptr_t dstStride = rand() % 100 + 64;
79
 
80
-        ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s);
81
+        ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s, dstStride);
82
 
83
-        checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s);
84
+        checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s, dstStride);
85
 
86
-        if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(pixel)))
87
+        if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(int16_t)))
88
             return false;
89
 
90
         reportfail();
91
@@ -554,15 +507,6 @@
92
 
93
 bool IPFilterHarness::testCorrectness(const EncoderPrimitives& ref, const EncoderPrimitives& opt)
94
 {
95
-    if (opt.luma_p2s)
96
-    {
97
-        // last parameter does not matter in case of luma
98
-        if (!check_IPFilter_primitive(ref.luma_p2s, opt.luma_p2s, 0, 1))
99
-        {
100
-            printf("luma_p2s failed\n");
101
-            return false;
102
-        }
103
-    }
104
 
105
     for (int value = 0; value < NUM_PU_SIZES; value++)
106
     {
107
@@ -622,11 +566,11 @@
108
                 return false;
109
             }
110
         }
111
-        if (opt.pu[value].filter_p2s)
112
+        if (opt.pu[value].convert_p2s)
113
         {
114
-            if (!check_IPFilterLumaP2S_primitive(ref.pu[value].filter_p2s, opt.pu[value].filter_p2s))
115
+            if (!check_IPFilterLumaP2S_primitive(ref.pu[value].convert_p2s, opt.pu[value].convert_p2s))
116
             {
117
-                printf("filter_p2s[%s]", lumaPartStr[value]);
118
+                printf("convert_p2s[%s]", lumaPartStr[value]);
119
                 return false;
120
             }
121
         }
122
@@ -634,14 +578,6 @@
123
 
124
     for (int csp = X265_CSP_I420; csp < X265_CSP_COUNT; csp++)
125
     {
126
-        if (opt.chroma[csp].p2s)
127
-        {
128
-            if (!check_IPFilter_primitive(ref.chroma[csp].p2s, opt.chroma[csp].p2s, 1, csp))
129
-            {
130
-                printf("chroma_p2s[%s]", x265_source_csp_names[csp]);
131
-                return false;
132
-            }
133
-        }
134
         for (int value = 0; value < NUM_PU_SIZES; value++)
135
         {
136
             if (opt.chroma[csp].pu[value].filter_hpp)
137
@@ -692,9 +628,9 @@
138
                     return false;
139
                 }
140
             }
141
-            if (opt.chroma[csp].pu[value].chroma_p2s)
142
+            if (opt.chroma[csp].pu[value].p2s)
143
             {
144
-                if (!check_IPFilterChromaP2S_primitive(ref.chroma[csp].pu[value].chroma_p2s, opt.chroma[csp].pu[value].chroma_p2s))
145
+                if (!check_IPFilterChromaP2S_primitive(ref.chroma[csp].pu[value].p2s, opt.chroma[csp].pu[value].p2s))
146
                 {
147
                     printf("chroma_p2s[%s]", chromaPartStr[csp][value]);
148
                     return false;
149
@@ -708,19 +644,10 @@
150
 
151
 void IPFilterHarness::measureSpeed(const EncoderPrimitives& ref, const EncoderPrimitives& opt)
152
 {
153
-    int height = 64;
154
-    int width = 64;
155
     int16_t srcStride = 96;
156
     int16_t dstStride = 96;
157
     int maxVerticalfilterHalfDistance = 3;
158
 
159
-    if (opt.luma_p2s)
160
-    {
161
-        printf("luma_p2s\t");
162
-        REPORT_SPEEDUP(opt.luma_p2s, ref.luma_p2s,
163
-                       pixel_buff, srcStride, IPF_vec_output_s, width, height);
164
-    }
165
-
166
     for (int value = 0; value < NUM_PU_SIZES; value++)
167
     {
168
         if (opt.pu[value].luma_hpp)
169
@@ -777,23 +704,18 @@
170
                            pixel_buff + 3 * srcStride, srcStride, IPF_vec_output_p, srcStride, 1, 3);
171
         }
172
 
173
-        if (opt.pu[value].filter_p2s)
174
+        if (opt.pu[value].convert_p2s)
175
         {
176
-            printf("filter_p2s [%s]\t", lumaPartStr[value]);
177
-            REPORT_SPEEDUP(opt.pu[value].filter_p2s, ref.pu[value].filter_p2s,
178
-                           pixel_buff, srcStride, IPF_vec_output_s);
179
+            printf("convert_p2s[%s]\t", lumaPartStr[value]);
180
+            REPORT_SPEEDUP(opt.pu[value].convert_p2s, ref.pu[value].convert_p2s,
181
+                               pixel_buff, srcStride,
182
+                               IPF_vec_output_s, dstStride);
183
         }
184
     }
185
 
186
     for (int csp = X265_CSP_I420; csp < X265_CSP_COUNT; csp++)
187
     {
188
         printf("= Color Space %s =\n", x265_source_csp_names[csp]);
189
-        if (opt.chroma[csp].p2s)
190
-        {
191
-            printf("chroma_p2s\t");
192
-            REPORT_SPEEDUP(opt.chroma[csp].p2s, ref.chroma[csp].p2s,
193
-                           pixel_buff, srcStride, IPF_vec_output_s, width, height);
194
-        }
195
         for (int value = 0; value < NUM_PU_SIZES; value++)
196
         {
197
             if (opt.chroma[csp].pu[value].filter_hpp)
198
@@ -836,13 +758,11 @@
199
                                short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
200
                                IPF_vec_output_s, dstStride, 1);
201
             }
202
-
203
-            if (opt.chroma[csp].pu[value].chroma_p2s)
204
+            if (opt.chroma[csp].pu[value].p2s)
205
             {
206
                 printf("chroma_p2s[%s]\t", chromaPartStr[csp][value]);
207
-                REPORT_SPEEDUP(opt.chroma[csp].pu[value].chroma_p2s, ref.chroma[csp].pu[value].chroma_p2s,
208
-                               pixel_buff, srcStride,
209
-                               IPF_vec_output_s);
210
+                REPORT_SPEEDUP(opt.chroma[csp].pu[value].p2s, ref.chroma[csp].pu[value].p2s,
211
+                               pixel_buff, srcStride, IPF_vec_output_s, dstStride);
212
             }
213
         }
214
     }
215
x265_1.6.tar.gz/source/test/ipfilterharness.h -> x265_1.7.tar.gz/source/test/ipfilterharness.h Changed
9
 
1
@@ -50,7 +50,6 @@
2
     pixel   pixel_test_buff[TEST_CASES][TEST_BUF_SIZE];
3
     int16_t short_test_buff[TEST_CASES][TEST_BUF_SIZE];
4
 
5
-    bool check_IPFilter_primitive(filter_p2s_wxh_t ref, filter_p2s_wxh_t opt, int isChroma, int csp);
6
     bool check_IPFilterChroma_primitive(filter_pp_t ref, filter_pp_t opt);
7
     bool check_IPFilterChroma_ps_primitive(filter_ps_t ref, filter_ps_t opt);
8
     bool check_IPFilterChroma_hps_primitive(filter_hps_t ref, filter_hps_t opt);
9
x265_1.6.tar.gz/source/test/pixelharness.cpp -> x265_1.7.tar.gz/source/test/pixelharness.cpp Changed
497
 
1
@@ -666,7 +666,32 @@
2
     return true;
3
 }
4
 
5
-bool PixelHarness::check_scale_pp(scale_t ref, scale_t opt)
6
+bool PixelHarness::check_scale1D_pp(scale1D_t ref, scale1D_t opt)
7
+{
8
+    ALIGN_VAR_16(pixel, ref_dest[64 * 64]);
9
+    ALIGN_VAR_16(pixel, opt_dest[64 * 64]);
10
+
11
+    memset(ref_dest, 0, sizeof(ref_dest));
12
+    memset(opt_dest, 0, sizeof(opt_dest));
13
+
14
+    int j = 0;
15
+    for (int i = 0; i < ITERS; i++)
16
+    {
17
+        int index = i % TEST_CASES;
18
+        checked(opt, opt_dest, pixel_test_buff[index] + j);
19
+        ref(ref_dest, pixel_test_buff[index] + j);
20
+
21
+        if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel)))
22
+            return false;
23
+
24
+        reportfail();
25
+        j += INCR;
26
+    }
27
+
28
+    return true;
29
+}
30
+
31
+bool PixelHarness::check_scale2D_pp(scale2D_t ref, scale2D_t opt)
32
 {
33
     ALIGN_VAR_16(pixel, ref_dest[64 * 64]);
34
     ALIGN_VAR_16(pixel, opt_dest[64 * 64]);
35
@@ -845,8 +870,8 @@
36
 
37
 bool PixelHarness::check_calSign(sign_t ref, sign_t opt)
38
 {
39
-    ALIGN_VAR_16(int8_t, ref_dest[64 * 64]);
40
-    ALIGN_VAR_16(int8_t, opt_dest[64 * 64]);
41
+    ALIGN_VAR_16(int8_t, ref_dest[64 * 2]);
42
+    ALIGN_VAR_16(int8_t, opt_dest[64 * 2]);
43
 
44
     memset(ref_dest, 0xCD, sizeof(ref_dest));
45
     memset(opt_dest, 0xCD, sizeof(opt_dest));
46
@@ -855,12 +880,12 @@
47
 
48
     for (int i = 0; i < ITERS; i++)
49
     {
50
-        int width = 16 * (rand() % 4 + 1);
51
+        int width = (rand() % 64) + 1;
52
 
53
         ref(ref_dest, pbuf2 + j, pbuf3 + j, width);
54
         checked(opt, opt_dest, pbuf2 + j, pbuf3 + j, width);
55
 
56
-        if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(int8_t)))
57
+        if (memcmp(ref_dest, opt_dest, sizeof(ref_dest)))
58
             return false;
59
 
60
         reportfail();
61
@@ -883,12 +908,10 @@
62
     for (int i = 0; i < ITERS; i++)
63
     {
64
         int width = 16 * (rand() % 4 + 1);
65
-        int8_t sign = rand() % 3;
66
-        if (sign == 2)
67
-            sign = -1;
68
+        int stride = width + 1;
69
 
70
-        ref(ref_dest, psbuf1 + j, width, sign);
71
-        checked(opt, opt_dest, psbuf1 + j, width, sign);
72
+        ref(ref_dest, psbuf1 + j, width, psbuf2 + j, stride);
73
+        checked(opt, opt_dest, psbuf1 + j, width, psbuf5 + j, stride);
74
 
75
         if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel)))
76
             return false;
77
@@ -928,7 +951,43 @@
78
     return true;
79
 }
80
 
81
-bool PixelHarness::check_saoCuOrgE2_t(saoCuOrgE2_t ref, saoCuOrgE2_t opt)
82
+bool PixelHarness::check_saoCuOrgE2_t(saoCuOrgE2_t ref[2], saoCuOrgE2_t opt[2])
83
+{
84
+    ALIGN_VAR_16(pixel, ref_dest[64 * 64]);
85
+    ALIGN_VAR_16(pixel, opt_dest[64 * 64]);
86
+
87
+    memset(ref_dest, 0xCD, sizeof(ref_dest));
88
+    memset(opt_dest, 0xCD, sizeof(opt_dest));
89
+
90
+    for (int id = 0; id < 2; id++)
91
+    {
92
+        int j = 0;
93
+        if (opt[id])
94
+        {
95
+            for (int i = 0; i < ITERS; i++)
96
+            {
97
+                int width = 16 * (1 << (id * (rand() % 2 + 1))) - (rand() % 2);
98
+                int stride = width + 1;
99
+
100
+                ref[width > 16](ref_dest, psbuf1 + j, psbuf2 + j, psbuf3 + j, width, stride);
101
+                checked(opt[width > 16], opt_dest, psbuf4 + j, psbuf2 + j, psbuf3 + j, width, stride);
102
+
103
+                if (memcmp(psbuf1 + j, psbuf4 + j, width * sizeof(int8_t)))
104
+                    return false;
105
+
106
+                if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel)))
107
+                    return false;
108
+
109
+                reportfail();
110
+                j += INCR;
111
+            }
112
+        }
113
+    }
114
+
115
+    return true;
116
+}
117
+
118
+bool PixelHarness::check_saoCuOrgE3_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt)
119
 {
120
     ALIGN_VAR_16(pixel, ref_dest[64 * 64]);
121
     ALIGN_VAR_16(pixel, opt_dest[64 * 64]);
122
@@ -940,16 +999,14 @@
123
 
124
     for (int i = 0; i < ITERS; i++)
125
     {
126
-        int width = 16 * (rand() % 4 + 1);
127
-        int stride = width + 1;
128
-
129
-        ref(ref_dest, psbuf1 + j, psbuf2 + j, psbuf3 + j, width, stride);
130
-        checked(opt, opt_dest, psbuf4 + j, psbuf2 + j, psbuf3 + j, width, stride);
131
+        int stride = 16 * (rand() % 4 + 1);
132
+        int start = rand() % 2;
133
+        int end = 16 - rand() % 2;
134
 
135
-        if (memcmp(psbuf1 + j, psbuf4 + j, width * sizeof(int8_t)))
136
-            return false;
137
+        ref(ref_dest, psbuf2 + j, psbuf1 + j, stride, start, end);
138
+        checked(opt, opt_dest, psbuf5 + j, psbuf1 + j, stride, start, end);
139
 
140
-        if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel)))
141
+        if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel)) || memcmp(psbuf2, psbuf5, BUFFSIZE))
142
             return false;
143
 
144
         reportfail();
145
@@ -959,7 +1016,7 @@
146
     return true;
147
 }
148
 
149
-bool PixelHarness::check_saoCuOrgE3_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt)
150
+bool PixelHarness::check_saoCuOrgE3_32_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt)
151
 {
152
     ALIGN_VAR_16(pixel, ref_dest[64 * 64]);
153
     ALIGN_VAR_16(pixel, opt_dest[64 * 64]);
154
@@ -971,9 +1028,9 @@
155
 
156
     for (int i = 0; i < ITERS; i++)
157
     {
158
-        int stride = 16 * (rand() % 4 + 1);
159
+        int stride = 32 * (rand() % 2 + 1);
160
         int start = rand() % 2;
161
-        int end = (16 * (rand() % 4 + 1)) - rand() % 2;
162
+        int end = (32 * (rand() % 2 + 1)) - rand() % 2;
163
 
164
         ref(ref_dest, psbuf2 + j, psbuf1 + j, stride, start, end);
165
         checked(opt, opt_dest, psbuf5 + j, psbuf1 + j, stride, start, end);
166
@@ -995,9 +1052,8 @@
167
 
168
     memset(ref_dest, 0xCD, sizeof(ref_dest));
169
     memset(opt_dest, 0xCD, sizeof(opt_dest));
170
-
171
-    int width = 16 + rand() % 48;
172
-    int height = 16 + rand() % 48;
173
+    int width = 32 + rand() % 32;
174
+    int height = 32 + rand() % 32;
175
     intptr_t srcStride = 64;
176
     intptr_t dstStride = width;
177
     int j = 0;
178
@@ -1133,8 +1189,8 @@
179
     for (int i = 0; i < ITERS; i++)
180
     {
181
         int width = 16 * (rand() % 4 + 1);
182
-        int height = rand() % 64 +1;
183
-        int stride = rand() % 65;
184
+        int height = rand() % 63 + 2;
185
+        int stride = width;
186
 
187
         ref(ref_dest, psbuf1 + j, width, height, stride);
188
         checked(opt, opt_dest, psbuf1 + j, width, height, stride);
189
@@ -1149,7 +1205,7 @@
190
     return true;
191
 }
192
 
193
-bool PixelHarness::check_findPosLast(findPosLast_t ref, findPosLast_t opt)
194
+bool PixelHarness::check_scanPosLast(scanPosLast_t ref, scanPosLast_t opt)
195
 {
196
     ALIGN_VAR_16(coeff_t, ref_src[32 * 32 + ITERS * 2]);
197
     uint8_t ref_coeffNum[MLS_GRP_NUM], opt_coeffNum[MLS_GRP_NUM];      // value range[0, 16]
198
@@ -1160,6 +1216,14 @@
199
     for (int i = 0; i < 32 * 32; i++)
200
     {
201
         ref_src[i] = rand() & SHORT_MAX;
202
+
203
+        // more zero coeff
204
+        if (ref_src[i] < SHORT_MAX * 2 / 3)
205
+            ref_src[i] = 0;
206
+
207
+        // more negtive
208
+        if ((rand() % 10) < 8)
209
+            ref_src[i] *= -1;
210
         totalCoeffs += (ref_src[i] != 0);
211
     }
212
 
213
@@ -1187,10 +1251,19 @@
214
         for (int j = 0; j < 1 << (2 * (rand_scan_size + 2)); j++)
215
             rand_numCoeff += (ref_src[i + j] != 0);
216
 
217
+        // at least one coeff in transform block
218
+        if (rand_numCoeff == 0)
219
+        {
220
+            ref_src[i + (1 << (2 * (rand_scan_size + 2))) - 1] = -1;
221
+            rand_numCoeff = 1;
222
+        }
223
+
224
+        const int trSize = (1 << (rand_scan_size + 2));
225
         const uint16_t* const scanTbl = g_scanOrder[rand_scan_type][rand_scan_size];
226
+        const uint16_t* const scanTblCG4x4 = g_scan4x4[rand_scan_size <= (MDCS_LOG2_MAX_SIZE - 2) ? rand_scan_type : SCAN_DIAG];
227
 
228
-        int ref_scanPos = ref(scanTbl, ref_src + i, ref_coeffSign, ref_coeffFlag, ref_coeffNum, rand_numCoeff);
229
-        int opt_scanPos = (int)checked(opt, scanTbl, ref_src + i, opt_coeffSign, opt_coeffFlag, opt_coeffNum, rand_numCoeff);
230
+        int ref_scanPos = ref(scanTbl, ref_src + i, ref_coeffSign, ref_coeffFlag, ref_coeffNum, rand_numCoeff, scanTblCG4x4, trSize);
231
+        int opt_scanPos = (int)checked(opt, scanTbl, ref_src + i, opt_coeffSign, opt_coeffFlag, opt_coeffNum, rand_numCoeff, scanTblCG4x4, trSize);
232
 
233
         if (ref_scanPos != opt_scanPos)
234
             return false;
235
@@ -1209,6 +1282,56 @@
236
             rand_numCoeff -= ref_coeffNum[j];
237
         }
238
 
239
+        if (rand_numCoeff != 0)
240
+            return false;
241
+
242
+        reportfail();
243
+    }
244
+
245
+    return true;
246
+}
247
+
248
+bool PixelHarness::check_findPosFirstLast(findPosFirstLast_t ref, findPosFirstLast_t opt)
249
+{
250
+    ALIGN_VAR_16(coeff_t, ref_src[32 * 32 + ITERS * 2]);
251
+
252
+    for (int i = 0; i < 32 * 32; i++)
253
+    {
254
+        ref_src[i] = rand() & SHORT_MAX;
255
+    }
256
+
257
+    // extra test area all of 0x1234
258
+    for (int i = 0; i < ITERS * 2; i++)
259
+    {
260
+        ref_src[32 * 32 + i] = 0x1234;
261
+    }
262
+
263
+    for (int i = 0; i < ITERS; i++)
264
+    {
265
+        int rand_scan_type = rand() % NUM_SCAN_TYPE;
266
+        int rand_scan_size = (rand() % NUM_SCAN_SIZE) + 2;
267
+        coeff_t *rand_src = ref_src + i;
268
+
269
+        const uint16_t* const scanTbl = g_scan4x4[rand_scan_type];
270
+
271
+        int j;
272
+        for (j = 0; j < SCAN_SET_SIZE; j++)
273
+        {
274
+            const uint32_t idxY = j / MLS_CG_SIZE;
275
+            const uint32_t idxX = j % MLS_CG_SIZE;
276
+            if (rand_src[idxY * rand_scan_size + idxX]) break;
277
+        }
278
+
279
+        // fill one coeff when all coeff group are zero
280
+        if (j >= SCAN_SET_SIZE)
281
+            rand_src[0] = 0x0BAD;
282
+
283
+        uint32_t ref_scanPos = ref(rand_src, (1 << rand_scan_size), scanTbl);
284
+        uint32_t opt_scanPos = (int)checked(opt, rand_src, (1 << rand_scan_size), scanTbl);
285
+
286
+        if (ref_scanPos != opt_scanPos)
287
+            return false;
288
+
289
         reportfail();
290
     }
291
 
292
@@ -1414,6 +1537,14 @@
293
                     return false;
294
                 }
295
             }
296
+            if (opt.chroma[i].cu[part].sa8d)
297
+            {
298
+                if (!check_pixelcmp(ref.chroma[i].cu[part].sa8d, opt.chroma[i].cu[part].sa8d))
299
+                {
300
+                    printf("chroma_sa8d[%s][%s] failed\n", x265_source_csp_names[i], chromaPartStr[i][part]);
301
+                    return false;
302
+                }
303
+            }
304
         }
305
     }
306
 
307
@@ -1603,7 +1734,7 @@
308
 
309
     if (opt.scale1D_128to64)
310
     {
311
-        if (!check_scale_pp(ref.scale1D_128to64, opt.scale1D_128to64))
312
+        if (!check_scale1D_pp(ref.scale1D_128to64, opt.scale1D_128to64))
313
         {
314
             printf("scale1D_128to64 failed!\n");
315
             return false;
316
@@ -1612,7 +1743,7 @@
317
 
318
     if (opt.scale2D_64to32)
319
     {
320
-        if (!check_scale_pp(ref.scale2D_64to32, opt.scale2D_64to32))
321
+        if (!check_scale2D_pp(ref.scale2D_64to32, opt.scale2D_64to32))
322
         {
323
             printf("scale2D_64to32 failed!\n");
324
             return false;
325
@@ -1664,20 +1795,41 @@
326
         }
327
     }
328
 
329
-    if (opt.saoCuOrgE2)
330
+    if (opt.saoCuOrgE1_2Rows)
331
+    {
332
+        if (!check_saoCuOrgE1_t(ref.saoCuOrgE1_2Rows, opt.saoCuOrgE1_2Rows))
333
+        {
334
+            printf("SAO_EO_1_2Rows failed\n");
335
+            return false;
336
+        }
337
+    }
338
+
339
+    if (opt.saoCuOrgE2[0] || opt.saoCuOrgE2[1])
340
+    {
341
+        saoCuOrgE2_t ref1[] = { ref.saoCuOrgE2[0], ref.saoCuOrgE2[1] };
342
+        saoCuOrgE2_t opt1[] = { opt.saoCuOrgE2[0], opt.saoCuOrgE2[1] };
343
+
344
+        if (!check_saoCuOrgE2_t(ref1, opt1))
345
+        {
346
+            printf("SAO_EO_2[0] && SAO_EO_2[1] failed\n");
347
+            return false;
348
+        }
349
+    }
350
+
351
+    if (opt.saoCuOrgE3[0])
352
     {
353
-        if (!check_saoCuOrgE2_t(ref.saoCuOrgE2, opt.saoCuOrgE2))
354
+        if (!check_saoCuOrgE3_t(ref.saoCuOrgE3[0], opt.saoCuOrgE3[0]))
355
         {
356
-            printf("SAO_EO_2 failed\n");
357
+            printf("SAO_EO_3[0] failed\n");
358
             return false;
359
         }
360
     }
361
 
362
-    if (opt.saoCuOrgE3)
363
+    if (opt.saoCuOrgE3[1])
364
     {
365
-        if (!check_saoCuOrgE3_t(ref.saoCuOrgE3, opt.saoCuOrgE3))
366
+        if (!check_saoCuOrgE3_32_t(ref.saoCuOrgE3[1], opt.saoCuOrgE3[1]))
367
         {
368
-            printf("SAO_EO_3 failed\n");
369
+            printf("SAO_EO_3[1] failed\n");
370
             return false;
371
         }
372
     }
373
@@ -1718,11 +1870,20 @@
374
         }
375
     }
376
 
377
-    if (opt.findPosLast)
378
+    if (opt.scanPosLast)
379
     {
380
-        if (!check_findPosLast(ref.findPosLast, opt.findPosLast))
381
+        if (!check_scanPosLast(ref.scanPosLast, opt.scanPosLast))
382
         {
383
-            printf("findPosLast failed!\n");
384
+            printf("scanPosLast failed!\n");
385
+            return false;
386
+        }
387
+    }
388
+
389
+    if (opt.findPosFirstLast)
390
+    {
391
+        if (!check_findPosFirstLast(ref.findPosFirstLast, opt.findPosFirstLast))
392
+        {
393
+            printf("findPosFirstLast failed!\n");
394
             return false;
395
         }
396
     }
397
@@ -1863,6 +2024,11 @@
398
                 HEADER("[%s]  add_ps[%s]", x265_source_csp_names[i], chromaPartStr[i][part]);
399
                 REPORT_SPEEDUP(opt.chroma[i].cu[part].add_ps, ref.chroma[i].cu[part].add_ps, pbuf1, FENC_STRIDE, pbuf2, sbuf1, STRIDE, STRIDE);
400
             }
401
+            if (opt.chroma[i].cu[part].sa8d)
402
+            {
403
+                HEADER("[%s] sa8d[%s]", x265_source_csp_names[i], chromaPartStr[i][part]);
404
+                REPORT_SPEEDUP(opt.chroma[i].cu[part].sa8d, ref.chroma[i].cu[part].sa8d, pbuf1, STRIDE, pbuf2, STRIDE);
405
+            }
406
         }
407
     }
408
 
409
@@ -2003,7 +2169,7 @@
410
     if (opt.scale1D_128to64)
411
     {
412
         HEADER0("scale1D_128to64");
413
-        REPORT_SPEEDUP(opt.scale1D_128to64, ref.scale1D_128to64, pbuf2, pbuf1, 64);
414
+        REPORT_SPEEDUP(opt.scale1D_128to64, ref.scale1D_128to64, pbuf2, pbuf1);
415
     }
416
 
417
     if (opt.scale2D_64to32)
418
@@ -2033,7 +2199,7 @@
419
     if (opt.saoCuOrgE0)
420
     {
421
         HEADER0("SAO_EO_0");
422
-        REPORT_SPEEDUP(opt.saoCuOrgE0, ref.saoCuOrgE0, pbuf1, psbuf1, 64, 1);
423
+        REPORT_SPEEDUP(opt.saoCuOrgE0, ref.saoCuOrgE0, pbuf1, psbuf1, 64, psbuf2, 64);
424
     }
425
 
426
     if (opt.saoCuOrgE1)
427
@@ -2042,16 +2208,34 @@
428
         REPORT_SPEEDUP(opt.saoCuOrgE1, ref.saoCuOrgE1, pbuf1, psbuf2, psbuf1, 64, 64);
429
     }
430
 
431
-    if (opt.saoCuOrgE2)
432
+    if (opt.saoCuOrgE1_2Rows)
433
     {
434
-        HEADER0("SAO_EO_2");
435
-        REPORT_SPEEDUP(opt.saoCuOrgE2, ref.saoCuOrgE2, pbuf1, psbuf1, psbuf2, psbuf3, 64, 64);
436
+        HEADER0("SAO_EO_1_2Rows");
437
+        REPORT_SPEEDUP(opt.saoCuOrgE1_2Rows, ref.saoCuOrgE1_2Rows, pbuf1, psbuf2, psbuf1, 64, 64);
438
     }
439
 
440
-    if (opt.saoCuOrgE3)
441
+    if (opt.saoCuOrgE2[0])
442
     {
443
-        HEADER0("SAO_EO_3");
444
-        REPORT_SPEEDUP(opt.saoCuOrgE3, ref.saoCuOrgE3, pbuf1, psbuf2, psbuf1, 64, 0, 64);
445
+        HEADER0("SAO_EO_2[0]");
446
+        REPORT_SPEEDUP(opt.saoCuOrgE2[0], ref.saoCuOrgE2[0], pbuf1, psbuf1, psbuf2, psbuf3, 16, 64);
447
+    }
448
+
449
+    if (opt.saoCuOrgE2[1])
450
+    {
451
+        HEADER0("SAO_EO_2[1]");
452
+        REPORT_SPEEDUP(opt.saoCuOrgE2[1], ref.saoCuOrgE2[1], pbuf1, psbuf1, psbuf2, psbuf3, 64, 64);
453
+    }
454
+
455
+    if (opt.saoCuOrgE3[0])
456
+    {
457
+        HEADER0("SAO_EO_3[0]");
458
+        REPORT_SPEEDUP(opt.saoCuOrgE3[0], ref.saoCuOrgE3[0], pbuf1, psbuf2, psbuf1, 64, 0, 16);
459
+    }
460
+
461
+    if (opt.saoCuOrgE3[1])
462
+    {
463
+        HEADER0("SAO_EO_3[1]");
464
+        REPORT_SPEEDUP(opt.saoCuOrgE3[1], ref.saoCuOrgE3[1], pbuf1, psbuf2, psbuf1, 64, 0, 64);
465
     }
466
 
467
     if (opt.saoCuOrgB0)
468
@@ -2078,12 +2262,25 @@
469
         REPORT_SPEEDUP(opt.propagateCost, ref.propagateCost, ibuf1, ushort_test_buff[0], int_test_buff[0], ushort_test_buff[0], int_test_buff[0], double_test_buff[0], 80);
470
     }
471
 
472
-    if (opt.findPosLast)
473
+    if (opt.scanPosLast)
474
     {
475
-        HEADER0("findPosLast");
476
+        HEADER0("scanPosLast");
477
         coeff_t coefBuf[32 * 32];
478
         memset(coefBuf, 0, sizeof(coefBuf));
479
         memset(coefBuf + 32 * 31, 1, 32 * sizeof(coeff_t));
480
-        REPORT_SPEEDUP(opt.findPosLast, ref.findPosLast, g_scanOrder[SCAN_DIAG][NUM_SCAN_SIZE - 1], coefBuf, (uint16_t*)sbuf1, (uint16_t*)sbuf2, (uint8_t*)psbuf1, 32);
481
+        REPORT_SPEEDUP(opt.scanPosLast, ref.scanPosLast, g_scanOrder[SCAN_DIAG][NUM_SCAN_SIZE - 1], coefBuf, (uint16_t*)sbuf1, (uint16_t*)sbuf2, (uint8_t*)psbuf1, 32, g_scan4x4[SCAN_DIAG], 32);
482
+    }
483
+
484
+    if (opt.findPosFirstLast)
485
+    {
486
+        HEADER0("findPosFirstLast");
487
+        coeff_t coefBuf[32 * MLS_CG_SIZE];
488
+        memset(coefBuf, 0, sizeof(coefBuf));
489
+        // every CG can't be all zeros!
490
+        coefBuf[3 + 0 * 32] = 0x0BAD;
491
+        coefBuf[3 + 1 * 32] = 0x0BAD;
492
+        coefBuf[3 + 2 * 32] = 0x0BAD;
493
+        coefBuf[3 + 3 * 32] = 0x0BAD;
494
+        REPORT_SPEEDUP(opt.findPosFirstLast, ref.findPosFirstLast, coefBuf, 32, g_scan4x4[SCAN_DIAG]);
495
     }
496
 }
497
x265_1.6.tar.gz/source/test/pixelharness.h -> x265_1.7.tar.gz/source/test/pixelharness.h Changed
32
 
1
@@ -76,7 +76,8 @@
2
     bool check_pixelavg_pp(pixelavg_pp_t ref, pixelavg_pp_t opt);
3
     bool check_pixel_sub_ps(pixel_sub_ps_t ref, pixel_sub_ps_t opt);
4
     bool check_pixel_add_ps(pixel_add_ps_t ref, pixel_add_ps_t opt);
5
-    bool check_scale_pp(scale_t ref, scale_t opt);
6
+    bool check_scale1D_pp(scale1D_t ref, scale1D_t opt);
7
+    bool check_scale2D_pp(scale2D_t ref, scale2D_t opt);
8
     bool check_ssd_s(pixel_ssd_s_t ref, pixel_ssd_s_t opt);
9
     bool check_blockfill_s(blockfill_s_t ref, blockfill_s_t opt);
10
     bool check_calresidual(calcresidual_t ref, calcresidual_t opt);
11
@@ -95,8 +96,9 @@
12
     bool check_addAvg(addAvg_t, addAvg_t);
13
     bool check_saoCuOrgE0_t(saoCuOrgE0_t ref, saoCuOrgE0_t opt);
14
     bool check_saoCuOrgE1_t(saoCuOrgE1_t ref, saoCuOrgE1_t opt);
15
-    bool check_saoCuOrgE2_t(saoCuOrgE2_t ref, saoCuOrgE2_t opt);
16
+    bool check_saoCuOrgE2_t(saoCuOrgE2_t ref[], saoCuOrgE2_t opt[]);
17
     bool check_saoCuOrgE3_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt);
18
+    bool check_saoCuOrgE3_32_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt);
19
     bool check_saoCuOrgB0_t(saoCuOrgB0_t ref, saoCuOrgB0_t opt);
20
     bool check_planecopy_sp(planecopy_sp_t ref, planecopy_sp_t opt);
21
     bool check_planecopy_cp(planecopy_cp_t ref, planecopy_cp_t opt);
22
@@ -104,7 +106,8 @@
23
     bool check_psyCost_pp(pixelcmp_t ref, pixelcmp_t opt);
24
     bool check_psyCost_ss(pixelcmp_ss_t ref, pixelcmp_ss_t opt);
25
     bool check_calSign(sign_t ref, sign_t opt);
26
-    bool check_findPosLast(findPosLast_t ref, findPosLast_t opt);
27
+    bool check_scanPosLast(scanPosLast_t ref, scanPosLast_t opt);
28
+    bool check_findPosFirstLast(findPosFirstLast_t ref, findPosFirstLast_t opt);
29
 
30
 public:
31
 
32
x265_1.6.tar.gz/source/test/rate-control-tests.txt -> x265_1.7.tar.gz/source/test/rate-control-tests.txt Changed
72
 
1
@@ -1,34 +1,36 @@
2
-# List of command lines to be run by rate control regression tests, see https://bitbucket.org/sborho/test-harness
3
-
4
-# This test is listed first since it currently reproduces bugs
5
-big_buck_bunny_360p24.y4m,--preset medium --bitrate 1000 --pass 1 -F4,--preset medium --bitrate 1000 --pass 2 -F4
6
-
7
-# VBV tests, non-deterministic so testing for correctness and bitrate
8
-# fluctuations - up to 1% bitrate fluctuation is allowed between runs
9
-RaceHorses_416x240_30_10bit.yuv,--preset medium --bitrate 700 --vbv-bufsize 900 --vbv-maxrate 700
10
-RaceHorses_416x240_30_10bit.yuv,--preset superfast --bitrate 600 --vbv-bufsize 600 --vbv-maxrate 600
11
-RaceHorses_416x240_30_10bit.yuv,--preset veryslow --bitrate 1100 --vbv-bufsize 1100 --vbv-maxrate 1200
12
-112_1920x1080_25.yuv,--preset medium --bitrate 1000 --vbv-maxrate 1500 --vbv-bufsize 1500 --aud
13
-112_1920x1080_25.yuv,--preset medium --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 15000 --hrd
14
-112_1920x1080_25.yuv,--preset medium --bitrate 4000 --vbv-maxrate 12000 --vbv-bufsize 12000 --repeat-headers
15
-112_1920x1080_25.yuv,--preset superfast --bitrate 1000 --vbv-maxrate 1000 --vbv-bufsize 1500 --hrd --strict-cbr
16
-112_1920x1080_25.yuv,--preset superfast --bitrate 30000 --vbv-maxrate 30000 --vbv-bufsize 30000 --repeat-headers
17
-112_1920x1080_25.yuv,--preset superfast --bitrate 4000 --vbv-maxrate 6000 --vbv-bufsize 6000 --aud
18
-112_1920x1080_25.yuv,--preset veryslow --bitrate 1000 --vbv-maxrate 3000 --vbv-bufsize 3000 --repeat-headers
19
-big_buck_bunny_360p24.y4m,--preset medium --bitrate 1000 --vbv-bufsize 3000 --vbv-maxrate 3000 --repeat-headers
20
-big_buck_bunny_360p24.y4m,--preset medium --bitrate 3000 --vbv-bufsize 3000 --vbv-maxrate 3000 --hrd
21
-big_buck_bunny_360p24.y4m,--preset medium --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 600 --aud
22
-big_buck_bunny_360p24.y4m,--preset medium --crf 1 --vbv-bufsize 3000 --vbv-maxrate 3000 --hrd
23
-big_buck_bunny_360p24.y4m,--preset superfast --bitrate 1000 --vbv-bufsize 1000 --vbv-maxrate 1000 --aud --strict-cbr
24
-big_buck_bunny_360p24.y4m,--preset superfast --bitrate 3000 --vbv-bufsize 9000 --vbv-maxrate 9000 --repeat-headers
25
-big_buck_bunny_360p24.y4m,--preset superfast --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 400 --hrd
26
-big_buck_bunny_360p24.y4m,--preset superfast --crf 6 --vbv-bufsize 1000 --vbv-maxrate 1000 --aud
27
-
28
-# multi-pass rate control tests
29
-big_buck_bunny_360p24.y4m,--preset slow --crf 40 --pass 1,--preset slow --bitrate 200 --pass 2
30
-big_buck_bunny_360p24.y4m,--preset medium --bitrate 700 --pass 1 -F4 --slow-firstpass,--preset medium --bitrate 700 --vbv-bufsize 900 --vbv-maxrate 700 --pass 2 -F4
31
-112_1920x1080_25.yuv,--preset slow --bitrate 1000 --pass 1 -F4,--preset slow --bitrate 1000 --pass 2 -F4
32
-112_1920x1080_25.yuv,--preset superfast --crf 12 --pass 1,--preset superfast --bitrate 4000 --pass 2 -F4
33
-RaceHorses_416x240_30_10bit.yuv,--preset veryslow --crf 40 --pass 1, --preset veryslow --bitrate 200 --pass 2 -F4
34
-RaceHorses_416x240_30_10bit.yuv,--preset superfast --bitrate 600 --pass 1 -F4 --slow-firstpass,--preset superfast --bitrate 600 --pass 2 -F4
35
-RaceHorses_416x240_30_10bit.yuv,--preset medium --crf 26 --pass 1,--preset medium --bitrate 500 --pass 3 -F4,--preset medium --bitrate 500 --pass 2 -F4
36
+# List of command lines to be run by rate control regression tests, see https://bitbucket.org/sborho/test-harness
37
+
38
+#These tests should yeild deterministic results
39
+# This test is listed first since it currently reproduces bugs
40
+big_buck_bunny_360p24.y4m,--preset medium --bitrate 1000 --pass 1 -F4,--preset medium --bitrate 1000 --pass 2 -F4
41
+fire_1920x1080_30.yuv, --preset slow --bitrate 2000 --tune zero-latency 
42
+
43
+
44
+# VBV tests, non-deterministic so testing for correctness and bitrate
45
+# fluctuations - up to 1% bitrate fluctuation is allowed between runs
46
+night_cars_1920x1080_30.yuv,--preset medium --crf 25 --vbv-bufsize 5000 --vbv-maxrate 5000 -F6 --crf-max 34 --crf-min 22
47
+ducks_take_off_420_720p50.y4m,--preset slow --bitrate 1600 --vbv-bufsize 1600 --vbv-maxrate 1600 --strict-cbr --aq-mode 2 --aq-strength 0.5
48
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset veryslow --bitrate 4000 --vbv-bufsize 3000 --vbv-maxrate 4000 --tune grain
49
+fire_1920x1080_30.yuv,--preset medium --bitrate 1000 --vbv-maxrate 1500 --vbv-bufsize 1500 --aud --pmode --tune ssim
50
+112_1920x1080_25.yuv,--preset ultrafast --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 15000 --hrd --strict-cbr
51
+Traffic_4096x2048_30.yuv,--preset superfast --bitrate 20000 --vbv-maxrate 20000 --vbv-bufsize 20000 --repeat-headers --strict-cbr
52
+Traffic_4096x2048_30.yuv,--preset faster --bitrate 8000 --vbv-maxrate 8000 --vbv-bufsize 6000 --aud --repeat-headers --no-open-gop --hrd --pmode --pme
53
+News-4k.y4m,--preset veryfast --bitrate 3000 --vbv-maxrate 5000 --vbv-bufsize 5000 --repeat-headers --temporal-layers
54
+NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --bitrate 18000 --vbv-bufsize 20000 --vbv-maxrate 18000 --strict-cbr
55
+NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --bitrate 8000 --vbv-bufsize 12000 --vbv-maxrate 10000  --tune grain
56
+big_buck_bunny_360p24.y4m,--preset medium --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 600 --aud --hrd --tune fast-decode
57
+sita_1920x1080_30.yuv,--preset superfast --crf 25 --vbv-bufsize 3000 --vbv-maxrate 4000 --vbv-bufsize 5000 --hrd  --crf-max 30
58
+sita_1920x1080_30.yuv,--preset superfast --bitrate 3000 --vbv-bufsize 3000 --vbv-maxrate 3000 --aud --strict-cbr
59
+
60
+
61
+
62
+# multi-pass rate control tests
63
+big_buck_bunny_360p24.y4m,--preset slow --crf 40 --pass 1 -f 5000,--preset slow --bitrate 200 --pass 2 -f 5000
64
+big_buck_bunny_360p24.y4m,--preset medium --bitrate 700 --pass 1 -F4 --slow-firstpass -f 5000 ,--preset medium --bitrate 700 --vbv-bufsize 900 --vbv-maxrate 700 --pass 2 -F4 -f 5000
65
+112_1920x1080_25.yuv,--preset fast --bitrate 1000 --vbv-maxrate 1000 --vbv-bufsize 1000 --strict-cbr --pass 1 -F4,--preset fast --bitrate 1000 --vbv-maxrate 3000 --vbv-bufsize 3000 --pass 2 -F4
66
+pine_tree_1920x1080_30.yuv,--preset veryfast --crf 12 --pass 1 -F4,--preset faster --bitrate 4000 --pass 2 -F4
67
+SteamLocomotiveTrain_2560x1600_60_10bit_crop.yuv, --tune grain --preset ultrafast --bitrate 5000 --vbv-maxrate 5000 --vbv-bufsize 8000 --strict-cbr -F4 --pass 1, --tune grain --preset ultrafast --bitrate 8000 --vbv-maxrate 8000 --vbv-bufsize 8000 -F4 --pass 2
68
+RaceHorses_416x240_30_10bit.yuv,--preset medium --crf 40 --pass 1, --preset faster --bitrate 200 --pass 2 -F4
69
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset superfast --bitrate 2500 --pass 1 -F4 --slow-firstpass,--preset superfast --bitrate 2500 --pass 2 -F4
70
+RaceHorses_416x240_30_10bit.yuv,--preset medium --crf 26 --vbv-maxrate 1000 --vbv-bufsize 1000 --pass 1,--preset fast --bitrate 1000  --vbv-maxrate 1000 --vbv-bufsize 700 --pass 3 -F4,--preset slow --bitrate 500 --vbv-maxrate 500  --vbv-bufsize 700 --pass 2 -F4
71
+
72
x265_1.6.tar.gz/source/test/regression-tests.txt -> x265_1.7.tar.gz/source/test/regression-tests.txt Changed
64
 
1
@@ -12,9 +12,9 @@
2
 # not auto-detected.
3
 
4
 BasketballDrive_1920x1080_50.y4m,--preset faster --aq-strength 2 --merange 190
5
-BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7
6
+BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7 --qg-size 32
7
 BasketballDrive_1920x1080_50.y4m,--preset medium --keyint -1 --nr-inter 100 -F4 --no-sao
8
-BasketballDrive_1920x1080_50.y4m,--preset slow --nr-intra 100 -F4 --aq-strength 3
9
+BasketballDrive_1920x1080_50.y4m,--preset slow --nr-intra 100 -F4 --aq-strength 3 --qg-size 16
10
 BasketballDrive_1920x1080_50.y4m,--preset slower --lossless --chromaloc 3 --subme 0
11
 BasketballDrive_1920x1080_50.y4m,--preset superfast --psy-rd 1 --ctu 16 --no-wpp
12
 BasketballDrive_1920x1080_50.y4m,--preset ultrafast --signhide --colormatrix bt709
13
@@ -29,7 +29,7 @@
14
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset slow --no-wpp --tune ssim --transfer smpte240m
15
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset slower --tune ssim --tune fastdecode
16
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset superfast --weightp --no-wpp --sao
17
-CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency
18
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency --qg-size 16
19
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset veryfast --temporal-layers --tune grain
20
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset medium --dither --keyint -1 --rdoq-level 1
21
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset superfast --weightp --dither --no-psy-rd
22
@@ -37,8 +37,8 @@
23
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryfast --temporal-layers --repeat-headers
24
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryslow --tskip --tskip-fast --no-scenecut
25
 DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset medium --tune psnr --bframes 16
26
-DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers --no-psy-rd
27
-DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset superfast --weightp
28
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers --no-psy-rd --qg-size 32
29
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset superfast --weightp --qg-size 16
30
 DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset medium --nr-inter 500 -F4 --no-psy-rdoq
31
 DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset slower --no-weightp --rdoq-level 0
32
 DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset veryfast --weightp --nr-intra 1000 -F4
33
@@ -51,11 +51,11 @@
34
 Kimono1_1920x1080_24_10bit_444.yuv,--preset superfast --weightb
35
 KristenAndSara_1280x720_60.y4m,--preset medium --no-cutree --max-tu-size 16
36
 KristenAndSara_1280x720_60.y4m,--preset slower --pmode --max-tu-size 8
37
-KristenAndSara_1280x720_60.y4m,--preset superfast --min-cu-size 16
38
+KristenAndSara_1280x720_60.y4m,--preset superfast --min-cu-size 16 --qg-size 16
39
 KristenAndSara_1280x720_60.y4m,--preset ultrafast --strong-intra-smoothing
40
 NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --tune grain
41
 NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset superfast --tune psnr
42
-News-4k.y4m,--preset medium --tune ssim --no-sao
43
+News-4k.y4m,--preset medium --tune ssim --no-sao --qg-size 32
44
 News-4k.y4m,--preset superfast --lookahead-slices 6 --aq-mode 0
45
 OldTownCross_1920x1080_50_10bit_422.yuv,--preset medium --no-weightp
46
 OldTownCross_1920x1080_50_10bit_422.yuv,--preset slower --tune fastdecode
47
@@ -108,13 +108,13 @@
48
 parkrun_ter_720p50.y4m,--preset slower --fast-intra --no-rect --tune grain
49
 silent_cif_420.y4m,--preset medium --me full --rect --amp
50
 silent_cif_420.y4m,--preset superfast --weightp --rect
51
-silent_cif_420.y4m,--preset placebo --ctu 32 --no-sao
52
+silent_cif_420.y4m,--preset placebo --ctu 32 --no-sao --qg-size 16
53
 vtc1nw_422_ntsc.y4m,--preset medium --scaling-list default --ctu 16 --ref 5
54
-vtc1nw_422_ntsc.y4m,--preset slower --nr-inter 1000 -F4 --tune fast-decode
55
+vtc1nw_422_ntsc.y4m,--preset slower --nr-inter 1000 -F4 --tune fast-decode --qg-size 16
56
 vtc1nw_422_ntsc.y4m,--preset superfast --weightp --nr-intra 100 -F4
57
 washdc_422_ntsc.y4m,--preset faster --rdoq-level 1 --max-merge 5
58
 washdc_422_ntsc.y4m,--preset medium --no-weightp --max-tu-size 4
59
-washdc_422_ntsc.y4m,--preset slower --psy-rdoq 2.0 --rdoq-level 2
60
+washdc_422_ntsc.y4m,--preset slower --psy-rdoq 2.0 --rdoq-level 2 --qg-size 32
61
 washdc_422_ntsc.y4m,--preset superfast --psy-rd 1 --tune zerolatency
62
 washdc_422_ntsc.y4m,--preset ultrafast --weightp --tu-intra-depth 4
63
 washdc_422_ntsc.y4m,--preset veryfast --tu-inter-depth 4
64
x265_1.6.tar.gz/source/test/smoke-tests.txt -> x265_1.7.tar.gz/source/test/smoke-tests.txt Changed
23
 
1
@@ -1,14 +1,18 @@
2
 # List of command lines to be run by smoke tests, see https://bitbucket.org/sborho/test-harness
3
 
4
+# consider VBV tests a failure if new bitrate is more than 5% different
5
+# from the old bitrate
6
+# vbv-tolerance = 0.05
7
+
8
 big_buck_bunny_360p24.y4m,--preset=superfast --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 400 --hrd --aud --repeat-headers
9
 big_buck_bunny_360p24.y4m,--preset=medium --bitrate 1000 -F4 --cu-lossless --scaling-list default
10
-big_buck_bunny_360p24.y4m,--preset=slower --no-weightp --cu-stats --pme
11
-washdc_422_ntsc.y4m,--preset=faster --no-strong-intra-smoothing --keyint 1
12
+big_buck_bunny_360p24.y4m,--preset=slower --no-weightp --cu-stats --pme --qg-size 16
13
+washdc_422_ntsc.y4m,--preset=faster --no-strong-intra-smoothing --keyint 1 --qg-size 16
14
 washdc_422_ntsc.y4m,--preset=medium --qp 40 --nr-inter 400 -F4
15
 washdc_422_ntsc.y4m,--preset=veryslow --pmode --tskip --rdoq-level 0
16
 old_town_cross_444_720p50.y4m,--preset=ultrafast --weightp --keyint -1
17
 old_town_cross_444_720p50.y4m,--preset=fast --keyint 20 --min-cu-size 16
18
-old_town_cross_444_720p50.y4m,--preset=slow --sao-non-deblock --pmode
19
+old_town_cross_444_720p50.y4m,--preset=slow --sao-non-deblock --pmode --qg-size 32
20
 RaceHorses_416x240_30_10bit.yuv,--preset=veryfast --cu-stats --max-tu-size 8
21
 RaceHorses_416x240_30_10bit.yuv,--preset=slower --bitrate 500 -F4 --rdoq-level 1
22
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset=ultrafast --constrained-intra --min-keyint 5 --keyint 10
23
x265_1.6.tar.gz/source/test/testbench.cpp -> x265_1.7.tar.gz/source/test/testbench.cpp Changed
9
 
1
@@ -168,6 +168,7 @@
2
         { "AVX", X265_CPU_AVX },
3
         { "XOP", X265_CPU_XOP },
4
         { "AVX2", X265_CPU_AVX2 },
5
+        { "BMI2", X265_CPU_AVX2 | X265_CPU_BMI1 | X265_CPU_BMI2 },
6
         { "", 0 },
7
     };
8
 
9
x265_1.6.tar.gz/source/x265.cpp -> x265_1.7.tar.gz/source/x265.cpp Changed
556
 
1
@@ -27,6 +27,7 @@
2
 
3
 #include "input/input.h"
4
 #include "output/output.h"
5
+#include "output/reconplay.h"
6
 #include "filters/filters.h"
7
 #include "common.h"
8
 #include "param.h"
9
@@ -46,12 +47,16 @@
10
 #include <string>
11
 #include <ostream>
12
 #include <fstream>
13
+#include <queue>
14
 
15
+#define CONSOLE_TITLE_SIZE 200
16
 #ifdef _WIN32
17
 #include <windows.h>
18
+static char orgConsoleTitle[CONSOLE_TITLE_SIZE] = "";
19
 #else
20
 #define GetConsoleTitle(t, n)
21
 #define SetConsoleTitle(t)
22
+#define SetThreadExecutionState(es)
23
 #endif
24
 
25
 using namespace x265;
26
@@ -65,33 +70,34 @@
27
 
28
 struct CLIOptions
29
 {
30
-    Input*  input;
31
-    Output* recon;
32
-    std::fstream bitstreamFile;
33
+    InputFile* input;
34
+    ReconFile* recon;
35
+    OutputFile* output;
36
+    FILE*       qpfile;
37
+    const char* reconPlayCmd;
38
+    const x265_api* api;
39
+    x265_param* param;
40
     bool bProgress;
41
     bool bForceY4m;
42
     bool bDither;
43
-
44
     uint32_t seek;              // number of frames to skip from the beginning
45
     uint32_t framesToBeEncoded; // number of frames to encode
46
     uint64_t totalbytes;
47
-    size_t   analysisRecordSize; // number of bytes read from or dumped into file
48
-    int      analysisHeaderSize;
49
-
50
     int64_t startTime;
51
     int64_t prevUpdateTime;
52
-    float   frameRate;
53
-    FILE*   qpfile;
54
-    FILE*   analysisFile;
55
 
56
     /* in microseconds */
57
     static const int UPDATE_INTERVAL = 250000;
58
 
59
     CLIOptions()
60
     {
61
-        frameRate = 0.f;
62
         input = NULL;
63
         recon = NULL;
64
+        output = NULL;
65
+        qpfile = NULL;
66
+        reconPlayCmd = NULL;
67
+        api = NULL;
68
+        param = NULL;
69
         framesToBeEncoded = seek = 0;
70
         totalbytes = 0;
71
         bProgress = true;
72
@@ -99,18 +105,12 @@
73
         startTime = x265_mdate();
74
         prevUpdateTime = 0;
75
         bDither = false;
76
-        qpfile = NULL;
77
-        analysisFile = NULL;
78
-        analysisRecordSize = 0;
79
-        analysisHeaderSize = 0;
80
     }
81
 
82
     void destroy();
83
-    void writeNALs(const x265_nal* nal, uint32_t nalcount);
84
-    void printStatus(uint32_t frameNum, x265_param *param);
85
-    bool parse(int argc, char **argv, x265_param* param);
86
+    void printStatus(uint32_t frameNum);
87
+    bool parse(int argc, char **argv);
88
     bool parseQPFile(x265_picture &pic_org);
89
-    bool validateFanout(x265_param*);
90
 };
91
 
92
 void CLIOptions::destroy()
93
@@ -124,23 +124,12 @@
94
     if (qpfile)
95
         fclose(qpfile);
96
     qpfile = NULL;
97
-    if (analysisFile)
98
-        fclose(analysisFile);
99
-    analysisFile = NULL;
100
+    if (output)
101
+        output->release();
102
+    output = NULL;
103
 }
104
 
105
-void CLIOptions::writeNALs(const x265_nal* nal, uint32_t nalcount)
106
-{
107
-    ProfileScopeEvent(bitstreamWrite);
108
-    for (uint32_t i = 0; i < nalcount; i++)
109
-    {
110
-        bitstreamFile.write((const char*)nal->payload, nal->sizeBytes);
111
-        totalbytes += nal->sizeBytes;
112
-        nal++;
113
-    }
114
-}
115
-
116
-void CLIOptions::printStatus(uint32_t frameNum, x265_param *param)
117
+void CLIOptions::printStatus(uint32_t frameNum)
118
 {
119
     char buf[200];
120
     int64_t time = x265_mdate();
121
@@ -167,15 +156,16 @@
122
     prevUpdateTime = time;
123
 }
124
 
125
-bool CLIOptions::parse(int argc, char **argv, x265_param* param)
126
+bool CLIOptions::parse(int argc, char **argv)
127
 {
128
     bool bError = 0;
129
     int help = 0;
130
     int inputBitDepth = 8;
131
+    int outputBitDepth = 0;
132
     int reconFileBitDepth = 0;
133
     const char *inputfn = NULL;
134
     const char *reconfn = NULL;
135
-    const char *bitstreamfn = NULL;
136
+    const char *outputfn = NULL;
137
     const char *preset = NULL;
138
     const char *tune = NULL;
139
     const char *profile = NULL;
140
@@ -192,15 +182,31 @@
141
         int c = getopt_long(argc, argv, short_options, long_options, NULL);
142
         if (c == -1)
143
             break;
144
-        if (c == 'p')
145
+        else if (c == 'p')
146
             preset = optarg;
147
-        if (c == 't')
148
+        else if (c == 't')
149
             tune = optarg;
150
+        else if (c == 'D')
151
+            outputBitDepth = atoi(optarg);
152
         else if (c == '?')
153
             showHelp(param);
154
     }
155
 
156
-    if (x265_param_default_preset(param, preset, tune) < 0)
157
+    api = x265_api_get(outputBitDepth);
158
+    if (!api)
159
+    {
160
+        x265_log(NULL, X265_LOG_WARNING, "falling back to default bit-depth\n");
161
+        api = x265_api_get(0);
162
+    }
163
+
164
+    param = api->param_alloc();
165
+    if (!param)
166
+    {
167
+        x265_log(NULL, X265_LOG_ERROR, "param alloc failed\n");
168
+        return true;
169
+    }
170
+
171
+    if (api->param_default_preset(param, preset, tune) < 0)
172
     {
173
         x265_log(NULL, X265_LOG_ERROR, "preset or tune unrecognized\n");
174
         return true;
175
@@ -211,9 +217,7 @@
176
         int long_options_index = -1;
177
         int c = getopt_long(argc, argv, short_options, long_options, &long_options_index);
178
         if (c == -1)
179
-        {
180
             break;
181
-        }
182
 
183
         switch (c)
184
         {
185
@@ -261,7 +265,7 @@
186
             OPT2("frame-skip", "seek") this->seek = (uint32_t)x265_atoi(optarg, bError);
187
             OPT("frames") this->framesToBeEncoded = (uint32_t)x265_atoi(optarg, bError);
188
             OPT("no-progress") this->bProgress = false;
189
-            OPT("output") bitstreamfn = optarg;
190
+            OPT("output") outputfn = optarg;
191
             OPT("input") inputfn = optarg;
192
             OPT("recon") reconfn = optarg;
193
             OPT("input-depth") inputBitDepth = (uint32_t)x265_atoi(optarg, bError);
194
@@ -271,17 +275,19 @@
195
             OPT("profile") profile = optarg; /* handled last */
196
             OPT("preset") /* handled above */;
197
             OPT("tune")   /* handled above */;
198
+            OPT("output-depth")   /* handled above */;
199
+            OPT("recon-y4m-exec") reconPlayCmd = optarg;
200
             OPT("qpfile")
201
             {
202
                 this->qpfile = fopen(optarg, "rb");
203
                 if (!this->qpfile)
204
                 {
205
-                    x265_log(param, X265_LOG_ERROR, "%s qpfile not found or error in opening qp file \n", optarg);
206
+                    x265_log(param, X265_LOG_ERROR, "%s qpfile not found or error in opening qp file\n", optarg);
207
                     return false;
208
                 }
209
             }
210
             else
211
-                bError |= !!x265_param_parse(param, long_options[long_options_index].name, optarg);
212
+                bError |= !!api->param_parse(param, long_options[long_options_index].name, optarg);
213
 
214
             if (bError)
215
             {
216
@@ -295,8 +301,8 @@
217
 
218
     if (optind < argc && !inputfn)
219
         inputfn = argv[optind++];
220
-    if (optind < argc && !bitstreamfn)
221
-        bitstreamfn = argv[optind++];
222
+    if (optind < argc && !outputfn)
223
+        outputfn = argv[optind++];
224
     if (optind < argc)
225
     {
226
         x265_log(param, X265_LOG_WARNING, "extra unused command arguments given <%s>\n", argv[optind]);
227
@@ -306,15 +312,15 @@
228
     if (argc <= 1 || help)
229
         showHelp(param);
230
 
231
-    if (inputfn == NULL || bitstreamfn == NULL)
232
+    if (inputfn == NULL || outputfn == NULL)
233
     {
234
         x265_log(param, X265_LOG_ERROR, "input or output file not specified, try -V for help\n");
235
         return true;
236
     }
237
 
238
-    if (param->internalBitDepth != x265_max_bit_depth)
239
+    if (param->internalBitDepth != api->max_bit_depth)
240
     {
241
-        x265_log(param, X265_LOG_ERROR, "Only bit depths of %d are supported in this build\n", x265_max_bit_depth);
242
+        x265_log(param, X265_LOG_ERROR, "Only bit depths of %d are supported in this build\n", api->max_bit_depth);
243
         return true;
244
     }
245
 
246
@@ -332,7 +338,7 @@
247
     info.frameCount = 0;
248
     getParamAspectRatio(param, info.sarWidth, info.sarHeight);
249
 
250
-    this->input = Input::open(info, this->bForceY4m);
251
+    this->input = InputFile::open(info, this->bForceY4m);
252
     if (!this->input || this->input->isFail())
253
     {
254
         x265_log(param, X265_LOG_ERROR, "unable to open input file <%s>\n", inputfn);
255
@@ -362,7 +368,11 @@
256
         this->framesToBeEncoded = info.frameCount - seek;
257
     param->totalFrames = this->framesToBeEncoded;
258
 
259
-    if (x265_param_apply_profile(param, profile))
260
+    /* Force CFR until we have support for VFR */
261
+    info.timebaseNum = param->fpsDenom;
262
+    info.timebaseDenom = param->fpsNum;
263
+
264
+    if (api->param_apply_profile(param, profile))
265
         return true;
266
 
267
     if (param->logLevel >= X265_LOG_INFO)
268
@@ -381,7 +391,7 @@
269
         else
270
             sprintf(buf + p, " frames %u - %d of %d", this->seek, this->seek + this->framesToBeEncoded - 1, info.frameCount);
271
 
272
-        fprintf(stderr, "%s  [info]: %s\n", input->getName(), buf);
273
+        general_log(param, input->getName(), X265_LOG_INFO, "%s\n", buf);
274
     }
275
 
276
     this->input->startReader();
277
@@ -390,26 +400,28 @@
278
     {
279
         if (reconFileBitDepth == 0)
280
             reconFileBitDepth = param->internalBitDepth;
281
-        this->recon = Output::open(reconfn, param->sourceWidth, param->sourceHeight, reconFileBitDepth,
282
-                                   param->fpsNum, param->fpsDenom, param->internalCsp);
283
+        this->recon = ReconFile::open(reconfn, param->sourceWidth, param->sourceHeight, reconFileBitDepth,
284
+                                      param->fpsNum, param->fpsDenom, param->internalCsp);
285
         if (this->recon->isFail())
286
         {
287
-            x265_log(param, X265_LOG_WARNING, "unable to write reconstruction file\n");
288
+            x265_log(param, X265_LOG_WARNING, "unable to write reconstructed outputs file\n");
289
             this->recon->release();
290
             this->recon = 0;
291
         }
292
         else
293
-            fprintf(stderr, "%s  [info]: reconstructed images %dx%d fps %d/%d %s\n", this->recon->getName(),
294
+            general_log(param, this->recon->getName(), X265_LOG_INFO,
295
+                    "reconstructed images %dx%d fps %d/%d %s\n",
296
                     param->sourceWidth, param->sourceHeight, param->fpsNum, param->fpsDenom,
297
                     x265_source_csp_names[param->internalCsp]);
298
     }
299
 
300
-    this->bitstreamFile.open(bitstreamfn, std::fstream::binary | std::fstream::out);
301
-    if (!this->bitstreamFile)
302
+    this->output = OutputFile::open(outputfn, info);
303
+    if (this->output->isFail())
304
     {
305
-        x265_log(NULL, X265_LOG_ERROR, "failed to open bitstream file <%s> for writing\n", bitstreamfn);
306
+        x265_log(param, X265_LOG_ERROR, "failed to open output file <%s> for writing\n", outputfn);
307
         return true;
308
     }
309
+    general_log(param, this->output->getName(), X265_LOG_INFO, "output file: %s\n", outputfn);
310
     return false;
311
 }
312
 
313
@@ -464,28 +476,45 @@
314
     PROFILE_INIT();
315
     THREAD_NAME("API", 0);
316
 
317
-    x265_param *param = x265_param_alloc();
318
+    GetConsoleTitle(orgConsoleTitle, CONSOLE_TITLE_SIZE);
319
+    SetThreadExecutionState(ES_CONTINUOUS | ES_SYSTEM_REQUIRED | ES_AWAYMODE_REQUIRED);
320
+
321
+    ReconPlay* reconPlay = NULL;
322
     CLIOptions cliopt;
323
 
324
-    if (cliopt.parse(argc, argv, param))
325
+    if (cliopt.parse(argc, argv))
326
     {
327
         cliopt.destroy();
328
-        x265_param_free(param);
329
+        if (cliopt.api)
330
+            cliopt.api->param_free(cliopt.param);
331
         exit(1);
332
     }
333
 
334
-    x265_encoder *encoder = x265_encoder_open(param);
335
+    x265_param* param = cliopt.param;
336
+    const x265_api* api = cliopt.api;
337
+
338
+    /* This allows muxers to modify bitstream format */
339
+    cliopt.output->setParam(param);
340
+
341
+    if (cliopt.reconPlayCmd)
342
+        reconPlay = new ReconPlay(cliopt.reconPlayCmd, *param);
343
+
344
+    /* note: we could try to acquire a different libx265 API here based on
345
+     * the profile found during option parsing, but it must be done before
346
+     * opening an encoder */
347
+
348
+    x265_encoder *encoder = api->encoder_open(param);
349
     if (!encoder)
350
     {
351
         x265_log(param, X265_LOG_ERROR, "failed to open encoder\n");
352
         cliopt.destroy();
353
-        x265_param_free(param);
354
-        x265_cleanup();
355
+        api->param_free(param);
356
+        api->cleanup();
357
         exit(2);
358
     }
359
 
360
     /* get the encoder parameters post-initialization */
361
-    x265_encoder_parameters(encoder, param);
362
+    api->encoder_parameters(encoder, param);
363
 
364
     /* Control-C handler */
365
     if (signal(SIGINT, sigint_handler) == SIG_ERR)
366
@@ -494,7 +523,8 @@
367
     x265_picture pic_orig, pic_out;
368
     x265_picture *pic_in = &pic_orig;
369
     /* Allocate recon picture if analysisMode is enabled */
370
-    x265_picture *pic_recon = (cliopt.recon || !!param->analysisMode) ? &pic_out : NULL;
371
+    std::priority_queue<int64_t>* pts_queue = cliopt.output->needPTS() ? new std::priority_queue<int64_t>() : NULL;
372
+    x265_picture *pic_recon = (cliopt.recon || !!param->analysisMode || pts_queue || reconPlay) ? &pic_out : NULL;
373
     uint32_t inFrameCount = 0;
374
     uint32_t outFrameCount = 0;
375
     x265_nal *p_nal;
376
@@ -505,17 +535,17 @@
377
 
378
     if (!param->bRepeatHeaders)
379
     {
380
-        if (x265_encoder_headers(encoder, &p_nal, &nal) < 0)
381
+        if (api->encoder_headers(encoder, &p_nal, &nal) < 0)
382
         {
383
             x265_log(param, X265_LOG_ERROR, "Failure generating stream headers\n");
384
             ret = 3;
385
             goto fail;
386
         }
387
         else
388
-            cliopt.writeNALs(p_nal, nal);
389
+            cliopt.totalbytes += cliopt.output->writeHeaders(p_nal, nal);
390
     }
391
 
392
-    x265_picture_init(param, pic_in);
393
+    api->picture_init(param, pic_in);
394
 
395
     if (cliopt.bDither)
396
     {
397
@@ -549,46 +579,72 @@
398
 
399
         if (pic_in)
400
         {
401
-            if (pic_in->bitDepth > X265_DEPTH && cliopt.bDither)
402
+            if (pic_in->bitDepth > param->internalBitDepth && cliopt.bDither)
403
             {
404
-                ditherImage(*pic_in, param->sourceWidth, param->sourceHeight, errorBuf, X265_DEPTH);
405
-                pic_in->bitDepth = X265_DEPTH;
406
+                ditherImage(*pic_in, param->sourceWidth, param->sourceHeight, errorBuf, param->internalBitDepth);
407
+                pic_in->bitDepth = param->internalBitDepth;
408
             }
409
+            /* Overwrite PTS */
410
+            pic_in->pts = pic_in->poc;
411
         }
412
 
413
-        int numEncoded = x265_encoder_encode(encoder, &p_nal, &nal, pic_in, pic_recon);
414
+        int numEncoded = api->encoder_encode(encoder, &p_nal, &nal, pic_in, pic_recon);
415
         if (numEncoded < 0)
416
         {
417
             b_ctrl_c = 1;
418
             ret = 4;
419
             break;
420
         }
421
+
422
+        if (reconPlay && numEncoded)
423
+            reconPlay->writePicture(*pic_recon);
424
+
425
         outFrameCount += numEncoded;
426
 
427
         if (numEncoded && pic_recon && cliopt.recon)
428
             cliopt.recon->writePicture(pic_out);
429
         if (nal)
430
-            cliopt.writeNALs(p_nal, nal);
431
+        {
432
+            cliopt.totalbytes += cliopt.output->writeFrame(p_nal, nal, pic_out);
433
+            if (pts_queue)
434
+            {
435
+                pts_queue->push(-pic_out.pts);
436
+                if (pts_queue->size() > 2)
437
+                    pts_queue->pop();
438
+            }
439
+        }
440
 
441
-        cliopt.printStatus(outFrameCount, param);
442
+        cliopt.printStatus(outFrameCount);
443
     }
444
 
445
     /* Flush the encoder */
446
     while (!b_ctrl_c)
447
     {
448
-        int numEncoded = x265_encoder_encode(encoder, &p_nal, &nal, NULL, pic_recon);
449
+        int numEncoded = api->encoder_encode(encoder, &p_nal, &nal, NULL, pic_recon);
450
         if (numEncoded < 0)
451
         {
452
             ret = 4;
453
             break;
454
         }
455
+
456
+        if (reconPlay && numEncoded)
457
+            reconPlay->writePicture(*pic_recon);
458
+
459
         outFrameCount += numEncoded;
460
         if (numEncoded && pic_recon && cliopt.recon)
461
             cliopt.recon->writePicture(pic_out);
462
         if (nal)
463
-            cliopt.writeNALs(p_nal, nal);
464
+        {
465
+            cliopt.totalbytes += cliopt.output->writeFrame(p_nal, nal, pic_out);
466
+            if (pts_queue)
467
+            {
468
+                pts_queue->push(-pic_out.pts);
469
+                if (pts_queue->size() > 2)
470
+                    pts_queue->pop();
471
+            }
472
+        }
473
 
474
-        cliopt.printStatus(outFrameCount, param);
475
+        cliopt.printStatus(outFrameCount);
476
 
477
         if (!numEncoded)
478
             break;
479
@@ -599,42 +655,62 @@
480
         fprintf(stderr, "%*s\r", 80, " ");
481
 
482
 fail:
483
-    x265_encoder_get_stats(encoder, &stats, sizeof(stats));
484
+
485
+    delete reconPlay;
486
+
487
+    api->encoder_get_stats(encoder, &stats, sizeof(stats));
488
     if (param->csvfn && !b_ctrl_c)
489
-        x265_encoder_log(encoder, argc, argv);
490
-    x265_encoder_close(encoder);
491
-    cliopt.bitstreamFile.close();
492
+        api->encoder_log(encoder, argc, argv);
493
+    api->encoder_close(encoder);
494
+
495
+    int64_t second_largest_pts = 0;
496
+    int64_t largest_pts = 0;
497
+    if (pts_queue && pts_queue->size() >= 2)
498
+    {
499
+        second_largest_pts = -pts_queue->top();
500
+        pts_queue->pop();
501
+        largest_pts = -pts_queue->top();
502
+        pts_queue->pop();
503
+        delete pts_queue;
504
+        pts_queue = NULL;
505
+    }
506
+    cliopt.output->closeFile(largest_pts, second_largest_pts);
507
 
508
     if (b_ctrl_c)
509
-        fprintf(stderr, "aborted at input frame %d, output frame %d\n",
510
-                cliopt.seek + inFrameCount, stats.encodedPictureCount);
511
+        general_log(param, NULL, X265_LOG_INFO, "aborted at input frame %d, output frame %d\n",
512
+                    cliopt.seek + inFrameCount, stats.encodedPictureCount);
513
 
514
     if (stats.encodedPictureCount)
515
     {
516
-        printf("\nencoded %d frames in %.2fs (%.2f fps), %.2f kb/s", stats.encodedPictureCount,
517
-               stats.elapsedEncodeTime, stats.encodedPictureCount / stats.elapsedEncodeTime, stats.bitrate);
518
+        char buffer[4096];
519
+        int p = sprintf(buffer, "\nencoded %d frames in %.2fs (%.2f fps), %.2f kb/s", stats.encodedPictureCount,
520
+                        stats.elapsedEncodeTime, stats.encodedPictureCount / stats.elapsedEncodeTime, stats.bitrate);
521
 
522
         if (param->bEnablePsnr)
523
-            printf(", Global PSNR: %.3f", stats.globalPsnr);
524
+            p += sprintf(buffer + p, ", Global PSNR: %.3f", stats.globalPsnr);
525
 
526
         if (param->bEnableSsim)
527
-            printf(", SSIM Mean Y: %.7f (%6.3f dB)", stats.globalSsim, x265_ssim2dB(stats.globalSsim));
528
+            p += sprintf(buffer + p, ", SSIM Mean Y: %.7f (%6.3f dB)", stats.globalSsim, x265_ssim2dB(stats.globalSsim));
529
 
530
-        printf("\n");
531
+        sprintf(buffer + p, "\n");
532
+        general_log(param, NULL, X265_LOG_INFO, buffer);
533
     }
534
     else
535
     {
536
-        printf("\nencoded 0 frames\n");
537
+        general_log(param, NULL, X265_LOG_INFO, "\nencoded 0 frames\n");
538
     }
539
 
540
-    x265_cleanup(); /* Free library singletons */
541
+    api->cleanup(); /* Free library singletons */
542
 
543
     cliopt.destroy();
544
 
545
-    x265_param_free(param);
546
+    api->param_free(param);
547
 
548
     X265_FREE(errorBuf);
549
 
550
+    SetConsoleTitle(orgConsoleTitle);
551
+    SetThreadExecutionState(ES_CONTINUOUS);
552
+
553
 #if HAVE_VLD
554
     assert(VLDReportLeaks() == 0);
555
 #endif
556
x265_1.6.tar.gz/source/x265.def.in -> x265_1.7.tar.gz/source/x265.def.in Changed
9
 
1
@@ -14,6 +14,7 @@
2
 x265_build_info_str
3
 x265_encoder_headers
4
 x265_encoder_parameters
5
+x265_encoder_reconfig
6
 x265_encoder_encode
7
 x265_encoder_get_stats
8
 x265_encoder_log
9
x265_1.6.tar.gz/source/x265.h -> x265_1.7.tar.gz/source/x265.h Changed
159
 
1
@@ -416,7 +416,7 @@
2
      *
3
      * Frame encoders are distributed between the available thread pools, and
4
      * the encoder will never generate more thread pools than frameNumThreads */
5
-    char*     numaPools;
6
+    const char* numaPools;
7
 
8
     /* Enable wavefront parallel processing, greatly increases parallelism for
9
      * less than 1% compression efficiency loss. Requires a thread pool, enabled
10
@@ -458,7 +458,7 @@
11
      * order. Otherwise the encoder will emit per-stream statistics into the log
12
      * file when x265_encoder_log is called (presumably at the end of the
13
      * encode) */
14
-    char*     csvfn;
15
+    const char* csvfn;
16
 
17
     /*== Internal Picture Specification ==*/
18
 
19
@@ -522,12 +522,21 @@
20
      * performance. Value must be between 1 and 16, default is 3 */
21
     int       maxNumReferences;
22
 
23
+    /* Allow libx265 to emit HEVC bitstreams which do not meet strict level
24
+     * requirements. Defaults to false */
25
+    int       bAllowNonConformance;
26
+
27
     /*== Bitstream Options ==*/
28
 
29
     /* Flag indicating whether VPS, SPS and PPS headers should be output with
30
      * each keyframe. Default false */
31
     int       bRepeatHeaders;
32
 
33
+    /* Flag indicating whether the encoder should generate start codes (Annex B
34
+     * format) or length (file format) before NAL units. Default true, Annex B.
35
+     * Muxers should set this to the correct value */
36
+    int       bAnnexB;
37
+
38
     /* Flag indicating whether the encoder should emit an Access Unit Delimiter
39
      * NAL at the start of every access unit. Default false */
40
     int       bEnableAccessUnitDelimiters;
41
@@ -869,7 +878,7 @@
42
     int       analysisMode;
43
 
44
     /* Filename for analysisMode save/load. Default name is "x265_analysis.dat" */
45
-    char*     analysisFileName;
46
+    const char* analysisFileName;
47
 
48
     /*== Rate Control ==*/
49
 
50
@@ -962,7 +971,7 @@
51
 
52
         /* Filename of the 2pass output/input stats file, if unspecified the
53
          * encoder will default to using x265_2pass.log */
54
-        char*     statFileName;
55
+        const char* statFileName;
56
 
57
         /* temporally blur quants */
58
         double    qblur;
59
@@ -988,6 +997,12 @@
60
         /* Enable stricter conditions to check bitrate deviations in CBR mode. May compromise 
61
          * quality to maintain bitrate adherence */
62
         int bStrictCbr;
63
+
64
+        /* Enable adaptive quantization at CU granularity. This parameter specifies 
65
+         * the minimum CU size at which QP can be adjusted, i.e. Quantization Group 
66
+         * (QG) size. Allowed values are 64, 32, 16 provided it falls within the 
67
+         * inclusuve range [maxCUSize, minCUSize]. Experimental, default: maxCUSize*/
68
+        uint32_t qgSize;
69
     } rc;
70
 
71
     /*== Video Usability Information ==*/
72
@@ -1084,6 +1099,22 @@
73
          * conformance cropping window to further crop the displayed window */
74
         int defDispWinBottomOffset;
75
     } vui;
76
+
77
+    /* SMPTE ST 2086 mastering display color volume SEI info, specified as a
78
+     * string which is parsed when the stream header SEI are emitted. The string
79
+     * format is "G(%hu,%hu)B(%hu,%hu)R(%hu,%hu)WP(%hu,%hu)L(%u,%u)" where %hu
80
+     * are unsigned 16bit integers and %u are unsigned 32bit integers. The SEI
81
+     * includes X,Y display primaries for RGB channels, white point X,Y and
82
+     * max,min luminance values. */
83
+    const char* masteringDisplayColorVolume;
84
+
85
+    /* Content light level info SEI, specified as a string which is parsed when
86
+     * the stream header SEI are emitted. The string format is "%hu,%hu" where
87
+     * %hu are unsigned 16bit integers. The first value is the max content light
88
+     * level (or 0 if no maximum is indicated), the second value is the maximum
89
+     * picture average light level (or 0). */
90
+    const char* contentLightLevelInfo;
91
+
92
 } x265_param;
93
 
94
 /* x265_param_alloc:
95
@@ -1162,12 +1193,10 @@
96
 void x265_picture_init(x265_param *param, x265_picture *pic);
97
 
98
 /* x265_max_bit_depth:
99
- *      Specifies the maximum number of bits per pixel that x265 can input. This
100
- *      is also the max bit depth that x265 encodes in.  When x265_max_bit_depth
101
- *      is 8, the internal and input bit depths must be 8.  When
102
- *      x265_max_bit_depth is 12, the internal and input bit depths can be
103
- *      either 8, 10, or 12. Note that the internal bit depth must be the same
104
- *      for all encoders allocated in the same process. */
105
+ *      Specifies the numer of bits per pixel that x265 uses internally to
106
+ *      represent a pixel, and the bit depth of the output bitstream.
107
+ *      param->internalBitDepth must be set to this value. x265_max_bit_depth
108
+ *      will be 8 for default builds, 10 for HIGH_BIT_DEPTH builds. */
109
 X265_API extern const int x265_max_bit_depth;
110
 
111
 /* x265_version_str:
112
@@ -1214,6 +1243,21 @@
113
  *      Once flushing has begun, all subsequent calls must pass pic_in as NULL. */
114
 int x265_encoder_encode(x265_encoder *encoder, x265_nal **pp_nal, uint32_t *pi_nal, x265_picture *pic_in, x265_picture *pic_out);
115
 
116
+/* x265_encoder_reconfig:
117
+ *      various parameters from x265_param are copied.
118
+ *      this takes effect immediately, on whichever frame is encoded next;
119
+ *      returns 0 on success, negative on parameter validation error.
120
+ *
121
+ *      not all parameters can be changed; see the actual function for a
122
+ *      detailed breakdown.  since not all parameters can be changed, moving
123
+ *      from preset to preset may not always fully copy all relevant parameters,
124
+ *      but should still work usably in practice. however, more so than for
125
+ *      other presets, many of the speed shortcuts used in ultrafast cannot be
126
+ *      switched out of; using reconfig to switch between ultrafast and other
127
+ *      presets is not recommended without a more fine-grained breakdown of
128
+ *      parameters to take this into account. */
129
+int x265_encoder_reconfig(x265_encoder *, x265_param *);
130
+
131
 /* x265_encoder_get_stats:
132
  *       returns encoder statistics */
133
 void x265_encoder_get_stats(x265_encoder *encoder, x265_stats *, uint32_t statsSizeBytes);
134
@@ -1253,6 +1297,7 @@
135
     void          (*picture_init)(x265_param*, x265_picture*);
136
     x265_encoder* (*encoder_open)(x265_param*);
137
     void          (*encoder_parameters)(x265_encoder*, x265_param*);
138
+    int           (*encoder_reconfig)(x265_encoder*, x265_param*);
139
     int           (*encoder_headers)(x265_encoder*, x265_nal**, uint32_t*);
140
     int           (*encoder_encode)(x265_encoder*, x265_nal**, uint32_t*, x265_picture*, x265_picture*);
141
     void          (*encoder_get_stats)(x265_encoder*, x265_stats*, uint32_t);
142
@@ -1275,8 +1320,14 @@
143
  *   Retrieve the programming interface for a linked x265 library.
144
  *   May return NULL if no library is available that supports the
145
  *   requested bit depth. If bitDepth is 0 the function is guarunteed
146
- *   to return a non-NULL x265_api pointer, from the system default
147
- *   libx265 */
148
+ *   to return a non-NULL x265_api pointer, from the linked libx265.
149
+ *
150
+ *   If the requested bitDepth is not supported by the linked libx265,
151
+ *   it will attempt to dynamically bind x265_api_get() from a shared
152
+ *   library with an appropriate name:
153
+ *     8bit:  libx265_main.so
154
+ *     10bit: libx265_main10.so
155
+ *   Obviously the shared library file extension is platform specific */
156
 const x265_api* x265_api_get(int bitDepth);
157
 
158
 #ifdef __cplusplus
159
x265_1.6.tar.gz/source/x265cli.h -> x265_1.7.tar.gz/source/x265cli.h Changed
104
 
1
@@ -30,7 +30,7 @@
2
 namespace x265 {
3
 #endif
4
 
5
-static const char short_options[] = "o:p:f:F:r:I:i:b:s:t:q:m:hwV?";
6
+static const char short_options[] = "o:D:P:p:f:F:r:I:i:b:s:t:q:m:hwV?";
7
 static const struct option long_options[] =
8
 {
9
     { "help",                 no_argument, NULL, 'h' },
10
@@ -47,16 +47,19 @@
11
     { "no-pme",               no_argument, NULL, 0 },
12
     { "pme",                  no_argument, NULL, 0 },
13
     { "log-level",      required_argument, NULL, 0 },
14
-    { "profile",        required_argument, NULL, 0 },
15
+    { "profile",        required_argument, NULL, 'P' },
16
     { "level-idc",      required_argument, NULL, 0 },
17
     { "high-tier",            no_argument, NULL, 0 },
18
     { "no-high-tier",         no_argument, NULL, 0 },
19
+    { "allow-non-conformance",no_argument, NULL, 0 },
20
+    { "no-allow-non-conformance",no_argument, NULL, 0 },
21
     { "csv",            required_argument, NULL, 0 },
22
     { "no-cu-stats",          no_argument, NULL, 0 },
23
     { "cu-stats",             no_argument, NULL, 0 },
24
     { "y4m",                  no_argument, NULL, 0 },
25
     { "no-progress",          no_argument, NULL, 0 },
26
     { "output",         required_argument, NULL, 'o' },
27
+    { "output-depth",   required_argument, NULL, 'D' },
28
     { "input",          required_argument, NULL, 0 },
29
     { "input-depth",    required_argument, NULL, 0 },
30
     { "input-res",      required_argument, NULL, 0 },
31
@@ -181,6 +184,8 @@
32
     { "colormatrix",    required_argument, NULL, 0 },
33
     { "chromaloc",      required_argument, NULL, 0 },
34
     { "crop-rect",      required_argument, NULL, 0 },
35
+    { "master-display", required_argument, NULL, 0 },
36
+    { "max-cll",        required_argument, NULL, 0 },
37
     { "no-dither",            no_argument, NULL, 0 },
38
     { "dither",               no_argument, NULL, 0 },
39
     { "no-repeat-headers",    no_argument, NULL, 0 },
40
@@ -205,6 +210,8 @@
41
     { "strict-cbr",           no_argument, NULL, 0 },
42
     { "temporal-layers",      no_argument, NULL, 0 },
43
     { "no-temporal-layers",   no_argument, NULL, 0 },
44
+    { "qg-size",        required_argument, NULL, 0 },
45
+    { "recon-y4m-exec", required_argument, NULL, 0 },
46
     { 0, 0, 0, 0 },
47
     { 0, 0, 0, 0 },
48
     { 0, 0, 0, 0 },
49
@@ -236,6 +243,7 @@
50
     H0("-V/--version                     Show version info and exit\n");
51
     H0("\nOutput Options:\n");
52
     H0("-o/--output <filename>           Bitstream output file name\n");
53
+    H0("-D/--output-depth 8|10           Output bit depth (also internal bit depth). Default %d\n", param->internalBitDepth);
54
     H0("   --log-level <string>          Logging level: none error warning info debug full. Default %s\n", x265::logLevelNames[param->logLevel + 1]);
55
     H0("   --no-progress                 Disable CLI progress reports\n");
56
     H0("   --[no-]cu-stats               Enable logging stats about distribution of cu across all modes. Default %s\n",OPT(param->bLogCuStats));
57
@@ -255,9 +263,10 @@
58
     H0("   --[no-]ssim                   Enable reporting SSIM metric scores. Default %s\n", OPT(param->bEnableSsim));
59
     H0("   --[no-]psnr                   Enable reporting PSNR metric scores. Default %s\n", OPT(param->bEnablePsnr));
60
     H0("\nProfile, Level, Tier:\n");
61
-    H0("   --profile <string>            Enforce an encode profile: main, main10, mainstillpicture\n");
62
+    H0("-P/--profile <string>            Enforce an encode profile: main, main10, mainstillpicture\n");
63
     H0("   --level-idc <integer|float>   Force a minimum required decoder level (as '5.0' or '50')\n");
64
     H0("   --[no-]high-tier              If a decoder level is specified, this modifier selects High tier of that level\n");
65
+    H0("   --[no-]allow-non-conformance  Allow the encoder to generate profile NONE bitstreams. Default %s\n", OPT(param->bAllowNonConformance));
66
     H0("\nThreading, performance:\n");
67
     H0("   --pools <integer,...>         Comma separated thread count per thread pool (pool per NUMA node)\n");
68
     H0("                                 '-' implies no threads on node, '+' implies one thread per core on node\n");
69
@@ -352,12 +361,14 @@
70
     H0("   --analysis-file <filename>    Specify file name used for either dumping or reading analysis data.\n");
71
     H0("   --aq-mode <integer>           Mode for Adaptive Quantization - 0:none 1:uniform AQ 2:auto variance. Default %d\n", param->rc.aqMode);
72
     H0("   --aq-strength <float>         Reduces blocking and blurring in flat and textured areas (0 to 3.0). Default %.2f\n", param->rc.aqStrength);
73
+    H0("   --qg-size <int>               Specifies the size of the quantization group (64, 32, 16). Default %d\n", param->rc.qgSize);
74
     H0("   --[no-]cutree                 Enable cutree for Adaptive Quantization. Default %s\n", OPT(param->rc.cuTree));
75
     H1("   --ipratio <float>             QP factor between I and P. Default %.2f\n", param->rc.ipFactor);
76
     H1("   --pbratio <float>             QP factor between P and B. Default %.2f\n", param->rc.pbFactor);
77
     H1("   --qcomp <float>               Weight given to predicted complexity. Default %.2f\n", param->rc.qCompress);
78
-    H1("   --cbqpoffs <integer>          Chroma Cb QP Offset. Default %d\n", param->cbQpOffset);
79
-    H1("   --crqpoffs <integer>          Chroma Cr QP Offset. Default %d\n", param->crQpOffset);
80
+    H1("   --qpstep <integer>            The maximum single adjustment in QP allowed to rate control. Default %d\n", param->rc.qpStep);
81
+    H1("   --cbqpoffs <integer>          Chroma Cb QP Offset [-12..12]. Default %d\n", param->cbQpOffset);
82
+    H1("   --crqpoffs <integer>          Chroma Cr QP Offset [-12..12]. Default %d\n", param->crQpOffset);
83
     H1("   --scaling-list <string>       Specify a file containing HM style quant scaling lists or 'default' or 'off'. Default: off\n");
84
     H1("   --lambda-file <string>        Specify a file containing replacement values for the lambda tables\n");
85
     H1("                                 MAX_MAX_QP+1 floats for lambda table, then again for lambda2 table\n");
86
@@ -384,6 +395,9 @@
87
     H1("   --colormatrix <string>        Specify color matrix setting from undef, bt709, fcc, bt470bg, smpte170m,\n");
88
     H1("                                 smpte240m, GBR, YCgCo, bt2020nc, bt2020c. Default undef\n");
89
     H1("   --chromaloc <integer>         Specify chroma sample location (0 to 5). Default of %d\n", param->vui.chromaSampleLocTypeTopField);
90
+    H0("   --master-display <string>     SMPTE ST 2086 master display color volume info SEI (HDR)\n");
91
+    H0("                                    format: G(x,y)B(x,y)R(x,y)WP(x,y)L(max,min)\n");
92
+    H0("   --max-cll <string>            Emit content light level info SEI as \"cll,fall\" (HDR)\n");
93
     H0("\nBitstream options:\n");
94
     H0("   --[no-]repeat-headers         Emit SPS and PPS headers at each keyframe. Default %s\n", OPT(param->bRepeatHeaders));
95
     H0("   --[no-]info                   Emit SEI identifying encoder and parameters. Default %s\n", OPT(param->bEmitInfoSEI));
96
@@ -394,6 +408,7 @@
97
     H1("\nReconstructed video options (debugging):\n");
98
     H1("-r/--recon <filename>            Reconstructed raw image YUV or Y4M output file name\n");
99
     H1("   --recon-depth <integer>       Bit-depth of reconstructed raw image file. Defaults to input bit depth, or 8 if Y4M\n");
100
+    H1("   --recon-y4m-exec <string>     pipe reconstructed frames to Y4M viewer, ex:\"ffplay -i pipe:0 -autoexit\"\n");
101
     H1("\nExecutable return codes:\n");
102
     H1("    0 - encode successful\n");
103
     H1("    1 - unable to parse command line\n");
104
Refresh

No build results available

Refresh

No rpmlint results available

Request History
Luigi Baldoni's avatar

Aloysius created request almost 10 years ago

Updated to 1.7


Tomáš Chvátal's avatar

scarabeus accepted request almost 10 years ago

Thanks for the bump