Overview
Submit package home:Aloysius:branches:Essentials / x265 to package Essentials / x265
x265.changes
Changed
x
1
2
-------------------------------------------------------------------
3
+Fri May 29 09:11:02 UTC 2015 - aloisio@gmx.com
4
+
5
+- soname bump to 59
6
+- Update to version 1.7
7
+ * large amount of assembly code optimizations
8
+ * some preliminary support for high dynamic range content
9
+ * improvements for multi-library support
10
+ * some new quality features
11
+ (full documentation at: http://x265.readthedocs.org/en/1.7)
12
+ * This release simplifies the multi-library support introduced
13
+ in version 1.6. Any libx265 can now forward API requests to
14
+ other installed libx265 libraries (by name) so applications
15
+ like ffmpeg and the x265 CLI can select between 8bit and 10bit
16
+ encodes at runtime without the need of a shim library or
17
+ library load path hacks. See --output-depth, and
18
+ http://x265.readthedocs.org/en/1.7/api.html#multi-library-interface
19
+ * For quality, x265 now allows you to configure the quantization
20
+ group size smaller than the CTU size (for finer grained AQ
21
+ adjustments). See --qg-size.
22
+ * x265 now supports limited mid-encode reconfigure via a new public
23
+ method: x265_encoder_reconfig()
24
+ * For HDR, x265 now supports signaling the SMPTE 2084 color transfer
25
+ function, the SMPTE 2086 mastering display color primaries, and the
26
+ content light levels. See --master-display, --max-cll
27
+ * x265 will no longer emit any non-conformant bitstreams unless
28
+ --allow-non-conformance is specified.
29
+ * The x265 CLI now supports a simple encode preview feature. See
30
+ --recon-y4m-exec.
31
+ * The AnnexB NAL headers can now be configured off, via x265_param.bAnnexB
32
+ This is not configurable via the CLI because it is a function of the
33
+ muxer being used, and the CLI only supports raw output files. See
34
+ --annexb
35
+ Misc:
36
+ * --lossless encodes are now signaled as level 8.5
37
+ * --profile now has a -P short option
38
+ * The regression scripts used by x265 are now public, and can be found at:
39
+ https://bitbucket.org/sborho/test-harness
40
+ * x265's cmake scripts now support PGO builds, the test-harness can be
41
+ used to drive the profile-guided build process.
42
+
43
+-------------------------------------------------------------------
44
Tue Apr 28 20:08:06 UTC 2015 - aloisio@gmx.com
45
46
- soname bumped to 51
47
x265.spec
Changed
14
1
2
# based on the spec file from https://build.opensuse.org/package/view_file/home:Simmphonie/libx265/
3
4
Name: x265
5
-%define soname 51
6
+%define soname 59
7
%define libname lib%{name}
8
%define libsoname %{libname}-%{soname}
9
-Version: 1.6
10
+Version: 1.7
11
Release: 0
12
License: GPL-2.0+
13
Summary: A free h265/HEVC encoder - encoder binary
14
baselibs.conf
Changed
4
1
2
-libx265-51
3
+libx265-59
4
x265_1.6.tar.gz/.hg_archival.txt -> x265_1.7.tar.gz/.hg_archival.txt
Changed
8
1
2
repo: 09fe40627f03a0f9c3e6ac78b22ac93da23f9fdf
3
-node: cbeb7d8a4880e4020c4545dd8e498432c3c6cad3
4
+node: 8425278def1edf0931dc33fc518e1950063e76b0
5
branch: stable
6
-tag: 1.6
7
+tag: 1.7
8
x265_1.6.tar.gz/.hgtags -> x265_1.7.tar.gz/.hgtags
Changed
6
1
2
c1e4fc0162c14fdb84f5c3bd404fb28cfe10a17f 1.3
3
5e604833c5aa605d0b6efbe5234492b5e7d8ac61 1.4
4
9f0324125f53a12f766f6ed6f98f16e2f42337f4 1.5
5
+cbeb7d8a4880e4020c4545dd8e498432c3c6cad3 1.6
6
x265_1.6.tar.gz/doc/reST/api.rst -> x265_1.7.tar.gz/doc/reST/api.rst
Changed
83
1
2
* how x265_encoder_open has changed the parameters.
3
* note that the data accessible through pointers in the returned param struct
4
* (e.g. filenames) should not be modified by the calling application. */
5
- void x265_encoder_parameters(x265_encoder *, x265_param *);
6
-
7
+ void x265_encoder_parameters(x265_encoder *, x265_param *);
8
+
9
+**x265_encoder_reconfig()** may be used to reconfigure encoder parameters mid-encode::
10
+
11
+ /* x265_encoder_reconfig:
12
+ * used to modify encoder parameters.
13
+ * various parameters from x265_param are copied.
14
+ * this takes effect immediately, on whichever frame is encoded next;
15
+ * returns 0 on success, negative on parameter validation error.
16
+ *
17
+ * not all parameters can be changed; see the actual function for a
18
+ * detailed breakdown. since not all parameters can be changed, moving
19
+ * from preset to preset may not always fully copy all relevant parameters,
20
+ * but should still work usably in practice. however, more so than for
21
+ * other presets, many of the speed shortcuts used in ultrafast cannot be
22
+ * switched out of; using reconfig to switch between ultrafast and other
23
+ * presets is not recommended without a more fine-grained breakdown of
24
+ * parameters to take this into account. */
25
+ int x265_encoder_reconfig(x265_encoder *, x265_param *);
26
+
27
Pictures
28
========
29
30
31
Multi-library Interface
32
=======================
33
34
-If your application might want to make a runtime selection between among
35
+If your application might want to make a runtime selection between
36
a number of libx265 libraries (perhaps 8bpp and 16bpp), then you will
37
want to use the multi-library interface.
38
39
40
* libx265 */
41
const x265_api* x265_api_get(int bitDepth);
42
43
-The general idea is to request the API for the bitDepth you would prefer
44
-the encoder to use (8 or 10), and if that returns NULL you request the
45
-API for bitDepth=0, which returns the system default libx265.
46
-
47
Note that using this multi-library API in your application is only the
48
-first step. Next your application must dynamically link to libx265 and
49
-then you must build and install a multi-lib configuration of libx265,
50
-which includes 8bpp and 16bpp builds of libx265 and a shim library which
51
-forwards x265_api_get() calls to the appropriate library using dynamic
52
-loading and binding.
53
+first step.
54
+
55
+Your application must link to one build of libx265 (statically or
56
+dynamically) and this linked version of libx265 will support one
57
+bit-depth (8 or 10 bits).
58
+
59
+Your application must now request the API for the bitDepth you would
60
+prefer the encoder to use (8 or 10). If the requested bitdepth is zero,
61
+or if it matches the bitdepth of the system default libx265 (the
62
+currently linked library), then this library will be used for encode.
63
+If you request a different bit-depth, the linked libx265 will attempt
64
+to dynamically bind a shared library with a name appropriate for the
65
+requested bit-depth:
66
+
67
+ 8-bit: libx265_main.dll
68
+ 10-bit: libx265_main10.dll
69
+
70
+ (the shared library extension is obviously platform specific. On
71
+ Linux it is .so while on Mac it is .dylib)
72
+
73
+For example on Windows, one could package together an x265.exe
74
+statically linked against the 8bpp libx265 together with a
75
+libx265_main10.dll in the same folder, and this executable would be able
76
+to encode main and main10 bitstreams.
77
+
78
+On Linux, x265 packagers could install 8bpp static and shared libraries
79
+under the name libx265 (so all applications link against 8bpp libx265)
80
+and then also install libx265_main10.so (symlinked to its numbered solib).
81
+Thus applications which use x265_api_get() will be able to generate main
82
+or main10 bitstreams.
83
x265_1.6.tar.gz/doc/reST/cli.rst -> x265_1.7.tar.gz/doc/reST/cli.rst
Changed
248
1
2
handled implicitly.
3
4
One may also directly supply the CPU capability bitmap as an integer.
5
+
6
+ Note that by specifying this option you are overriding x265's CPU
7
+ detection and it is possible to do this wrong. You can cause encoder
8
+ crashes by specifying SIMD architectures which are not supported on
9
+ your CPU.
10
+
11
+ Default: auto-detected SIMD architectures
12
13
.. option:: --frame-threads, -F <integer>
14
15
16
Over-allocation of frame threads will not improve performance, it
17
will generally just increase memory use.
18
19
- **Values:** any value between 8 and 16. Default is 0, auto-detect
20
+ **Values:** any value between 0 and 16. Default is 0, auto-detect
21
22
.. option:: --pools <string>, --numa-pools <string>
23
24
25
their node, they will not be allowed to migrate between nodes, but they
26
will be allowed to move between CPU cores within their node.
27
28
- If the three pool features: :option:`--wpp` :option:`--pmode` and
29
- :option:`--pme` are all disabled, then :option:`--pools` is ignored
30
- and no thread pools are created.
31
+ If the four pool features: :option:`--wpp`, :option:`--pmode`,
32
+ :option:`--pme` and :option:`--lookahead-slices` are all disabled,
33
+ then :option:`--pools` is ignored and no thread pools are created.
34
35
- If "none" is specified, then all three of the thread pool features are
36
+ If "none" is specified, then all four of the thread pool features are
37
implicitly disabled.
38
39
Multiple thread pools will be allocated for any NUMA node with more than
40
41
:option:`--frame-threads`. The pools are used for WPP and for
42
distributed analysis and motion search.
43
44
+ On Windows, the native APIs offer sufficient functionality to
45
+ discover the NUMA topology and enforce the thread affinity that
46
+ libx265 needs (so long as you have not chosen to target XP or
47
+ Vista), but on POSIX systems it relies on libnuma for this
48
+ functionality. If your target POSIX system is single socket, then
49
+ building without libnuma is a perfectly reasonable option, as it
50
+ will have no effect on the runtime behavior. On a multiple-socket
51
+ system, a POSIX build of libx265 without libnuma will be less work
52
+ efficient. See :ref:`thread pools <pools>` for more detail.
53
+
54
Default "", one thread is allocated per detected hardware thread
55
(logical CPU cores) and one thread pool per NUMA node.
56
57
+ Note that the string value will need to be escaped or quoted to
58
+ protect against shell expansion on many platforms
59
+
60
.. option:: --wpp, --no-wpp
61
62
Enable Wavefront Parallel Processing. The encoder may begin encoding
63
64
65
**CLI ONLY**
66
67
+.. option:: --output-depth, -D 8|10
68
+
69
+ Bitdepth of output HEVC bitstream, which is also the internal bit
70
+ depth of the encoder. If the requested bit depth is not the bit
71
+ depth of the linked libx265, it will attempt to bind libx265_main
72
+ for an 8bit encoder, or libx265_main10 for a 10bit encoder, with the
73
+ same API version as the linked libx265.
74
+
75
+ **CLI ONLY**
76
+
77
Profile, Level, Tier
78
====================
79
80
-.. option:: --profile <string>
81
+.. option:: --profile, -P <string>
82
83
Enforce the requirements of the specified profile, ensuring the
84
output stream will be decodable by a decoder which supports that
85
86
times 10, for example level **5.1** is specified as "5.1" or "51",
87
and level **5.0** is specified as "5.0" or "50".
88
89
- Annex A levels: 1, 2, 2.1, 3, 3.1, 4, 4.1, 5, 5.1, 5.2, 6, 6.1, 6.2
90
+ Annex A levels: 1, 2, 2.1, 3, 3.1, 4, 4.1, 5, 5.1, 5.2, 6, 6.1, 6.2, 8.5
91
92
.. option:: --high-tier, --no-high-tier
93
94
95
HEVC specification. If x265 detects that the total reference count
96
is greater than 8, it will issue a warning that the resulting stream
97
is non-compliant and it signals the stream as profile NONE and level
98
- NONE but still allows the encode to continue. Compliant HEVC
99
+ NONE and will abort the encode unless
100
+ :option:`--allow-non-conformance` it specified. Compliant HEVC
101
decoders may refuse to decode such streams.
102
103
Default 3
104
105
+.. option:: --allow-non-conformance, --no-allow-non-conformance
106
+
107
+ Allow libx265 to generate a bitstream with profile and level NONE.
108
+ By default it will abort any encode which does not meet strict level
109
+ compliance. The two most likely causes for non-conformance are
110
+ :option:`--ctu` being too small, :option:`--ref` being too high,
111
+ or the bitrate or resolution being out of specification.
112
+
113
+ Default: disabled
114
+
115
.. note::
116
:option:`--profile`, :option:`--level-idc`, and
117
:option:`--high-tier` are only intended for use when you are
118
119
limitations and must constrain the bitstream within those limits.
120
Specifying a profile or level may lower the encode quality
121
parameters to meet those requirements but it will never raise
122
- them.
123
+ them. It may enable VBV constraints on a CRF encode.
124
125
Mode decision / Analysis
126
========================
127
128
129
**Range of values:** 0.0 to 3.0
130
131
+.. option:: --qg-size <64|32|16>
132
+
133
+ Enable adaptive quantization for sub-CTUs. This parameter specifies
134
+ the minimum CU size at which QP can be adjusted, ie. Quantization Group
135
+ size. Allowed range of values are 64, 32, 16 provided this falls within
136
+ the inclusive range [maxCUSize, minCUSize]. Experimental.
137
+ Default: same as maxCUSize
138
+
139
.. option:: --cutree, --no-cutree
140
141
Enable the use of lookahead's lowres motion vector fields to
142
143
.. option:: --strict-cbr, --no-strict-cbr
144
145
Enables stricter conditions to control bitrate deviance from the
146
- target bitrate in CBR mode. Bitrate adherence is prioritised
147
+ target bitrate in ABR mode. Bit rate adherence is prioritised
148
over quality. Rate tolerance is reduced to 50%. Default disabled.
149
150
This option is for use-cases which require the final average bitrate
151
- to be within very strict limits of the target - preventing overshoots
152
- completely, and achieve bitrates within 5% of target bitrate,
153
+ to be within very strict limits of the target; preventing overshoots,
154
+ while keeping the bit rate within 5% of the target setting,
155
especially in short segment encodes. Typically, the encoder stays
156
conservative, waiting until there is enough feedback in terms of
157
encoded frames to control QP. strict-cbr allows the encoder to be
158
159
lookahead). Default value is 0.6. Increasing it to 1 will
160
effectively generate CQP
161
162
-.. option:: --qstep <integer>
163
+.. option:: --qpstep <integer>
164
165
The maximum single adjustment in QP allowed to rate control. Default
166
4
167
168
specification for a description of these values. Default undefined
169
(not signaled)
170
171
+.. option:: --master-display <string>
172
+
173
+ SMPTE ST 2086 mastering display color volume SEI info, specified as
174
+ a string which is parsed when the stream header SEI are emitted. The
175
+ string format is "G(%hu,%hu)B(%hu,%hu)R(%hu,%hu)WP(%hu,%hu)L(%u,%u)"
176
+ where %hu are unsigned 16bit integers and %u are unsigned 32bit
177
+ integers. The SEI includes X,Y display primaries for RGB channels,
178
+ white point X,Y and max,min luminance values. (HDR)
179
+
180
+ Example for P65D3 1000-nits:
181
+
182
+ G(13200,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(10000000,1)
183
+
184
+ Note that this string value will need to be escaped or quoted to
185
+ protect against shell expansion on many platforms. No default.
186
+
187
+.. option:: --max-cll <string>
188
+
189
+ Maximum content light level and maximum frame average light level as
190
+ required by the Consumer Electronics Association 861.3 specification.
191
+
192
+ Specified as a string which is parsed when the stream header SEI are
193
+ emitted. The string format is "%hu,%hu" where %hu are unsigned 16bit
194
+ integers. The first value is the max content light level (or 0 if no
195
+ maximum is indicated), the second value is the maximum picture
196
+ average light level (or 0). (HDR)
197
+
198
+ Note that this string value will need to be escaped or quoted to
199
+ protect against shell expansion on many platforms. No default.
200
+
201
Bitstream options
202
=================
203
204
+.. option:: --annexb, --no-annexb
205
+
206
+ If enabled, x265 will produce Annex B bitstream format, which places
207
+ start codes before NAL. If disabled, x265 will produce file format,
208
+ which places length before NAL. x265 CLI will choose the right option
209
+ based on output format. Default enabled
210
+
211
+ **API ONLY**
212
+
213
.. option:: --repeat-headers, --no-repeat-headers
214
215
If enabled, x265 will emit VPS, SPS, and PPS headers with every
216
217
218
Enable a temporal sub layer. All referenced I/P/B frames are in the
219
base layer and all unreferenced B frames are placed in a temporal
220
- sublayer. A decoder may chose to drop the sublayer and only decode
221
- and display the base layer slices.
222
+ enhancement layer. A decoder may chose to drop the enhancement layer
223
+ and only decode and display the base layer slices.
224
225
If used with a fixed GOP (:option:`b-adapt` 0) and :option:`bframes`
226
3 then the two layers evenly split the frame rate, with a cadence of
227
228
229
**CLI ONLY**
230
231
+.. option:: --recon-y4m-exec <string>
232
+
233
+ If you have an application which can play a Y4MPEG stream received
234
+ on stdin, the x265 CLI can feed it reconstructed pictures in display
235
+ order. The pictures will have no timing info, obviously, so the
236
+ picture timing will be determined primarily by encoding elapsed time
237
+ and latencies, but it can be useful to preview the pictures being
238
+ output by the encoder to validate input settings and rate control
239
+ parameters.
240
+
241
+ Example command for ffplay (assuming it is in your PATH):
242
+
243
+ --recon-y4m-exec "ffplay -i pipe:0 -autoexit"
244
+
245
+ **CLI ONLY**
246
+
247
.. vim: noet
248
x265_1.6.tar.gz/doc/reST/threading.rst -> x265_1.7.tar.gz/doc/reST/threading.rst
Changed
37
1
2
Threading
3
*********
4
5
+.. _pools:
6
+
7
Thread Pools
8
============
9
10
11
expected to drop that job so the worker thread may go back to the pool
12
and find more work.
13
14
+On Windows, the native APIs offer sufficient functionality to discover
15
+the NUMA topology and enforce the thread affinity that libx265 needs (so
16
+long as you have not chosen to target XP or Vista), but on POSIX systems
17
+it relies on libnuma for this functionality. If your target POSIX system
18
+is single socket, then building without libnuma is a perfectly
19
+reasonable option, as it will have no effect on the runtime behavior. On
20
+a multiple-socket system, a POSIX build of libx265 without libnuma will
21
+be less work efficient, but will still function correctly. You lose the
22
+work isolation effect that keeps each frame encoder from only using the
23
+threads of a single socket and so you incur a heavier context switching
24
+cost.
25
+
26
Wavefront Parallel Processing
27
=============================
28
29
30
lowres cost analysis to worker threads. It will use bonded task groups
31
to perform batches of frame cost estimates, and it may optionally use
32
bonded task groups to measure single frame cost estimates using slices.
33
+(see :option:`--lookahead-slices`)
34
35
The function slicetypeDecide() itself is also be performed by a worker
36
thread if your encoder has a thread pool, else it runs within the
37
x265_1.6.tar.gz/readme.rst -> x265_1.7.tar.gz/readme.rst
Changed
10
1
2
=================
3
4
| **Read:** | Online `documentation <http://x265.readthedocs.org/en/default/>`_ | Developer `wiki <http://bitbucket.org/multicoreware/x265/wiki/>`_
5
-| **Download:** | `releases <http://bitbucket.org/multicoreware/x265/downloads/>`_
6
+| **Download:** | `releases <http://ftp.videolan.org/pub/videolan/x265/>`_
7
| **Interact:** | #x265 on freenode.irc.net | `x265-devel@videolan.org <http://mailman.videolan.org/listinfo/x265-devel>`_ | `Report an issue <https://bitbucket.org/multicoreware/x265/issues?status=new&status=open>`_
8
9
`x265 <https://www.videolan.org/developers/x265.html>`_ is an open
10
x265_1.6.tar.gz/source/CMakeLists.txt -> x265_1.7.tar.gz/source/CMakeLists.txt
Changed
91
1
2
mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD)
3
4
# X265_BUILD must be incremented each time the public API is changed
5
-set(X265_BUILD 51)
6
+set(X265_BUILD 59)
7
configure_file("${PROJECT_SOURCE_DIR}/x265.def.in"
8
"${PROJECT_BINARY_DIR}/x265.def")
9
configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in"
10
11
if(LIBRT)
12
list(APPEND PLATFORM_LIBS rt)
13
endif()
14
+ find_library(LIBDL dl)
15
+ if(LIBDL)
16
+ list(APPEND PLATFORM_LIBS dl)
17
+ endif()
18
find_package(Numa)
19
if(NUMA_FOUND)
20
- list(APPEND CMAKE_REQUIRED_LIBRARIES ${NUMA_LIBRARY})
21
+ link_directories(${NUMA_LIBRARY_DIR})
22
+ list(APPEND CMAKE_REQUIRED_LIBRARIES numa)
23
check_symbol_exists(numa_node_of_cpu numa.h NUMA_V2)
24
if(NUMA_V2)
25
add_definitions(-DHAVE_LIBNUMA)
26
message(STATUS "libnuma found, building with support for NUMA nodes")
27
- list(APPEND PLATFORM_LIBS ${NUMA_LIBRARY})
28
- link_directories(${NUMA_LIBRARY_DIR})
29
+ list(APPEND PLATFORM_LIBS numa)
30
include_directories(${NUMA_INCLUDE_DIR})
31
endif()
32
endif()
33
34
if(CMAKE_GENERATOR STREQUAL "Xcode")
35
set(XCODE 1)
36
endif()
37
-if (APPLE)
38
+if(APPLE)
39
add_definitions(-DMACOS)
40
endif()
41
42
43
add_definitions(-static)
44
list(APPEND LINKER_OPTIONS "-static")
45
endif(STATIC_LINK_CRT)
46
+ check_cxx_compiler_flag(-Wno-strict-overflow CC_HAS_NO_STRICT_OVERFLOW)
47
check_cxx_compiler_flag(-Wno-narrowing CC_HAS_NO_NARROWING)
48
check_cxx_compiler_flag(-Wno-array-bounds CC_HAS_NO_ARRAY_BOUNDS)
49
if (CC_HAS_NO_ARRAY_BOUNDS)
50
51
endif()
52
endif(WARNINGS_AS_ERRORS)
53
54
-if (WIN32)
55
+if(WIN32)
56
# Visual leak detector
57
find_package(VLD QUIET)
58
if(VLD_FOUND)
59
60
list(APPEND PLATFORM_LIBS ${VLD_LIBRARIES})
61
link_directories(${VLD_LIBRARY_DIRS})
62
endif()
63
- option(WINXP_SUPPORT "Make binaries compatible with Windows XP" OFF)
64
+ option(WINXP_SUPPORT "Make binaries compatible with Windows XP and Vista" OFF)
65
if(WINXP_SUPPORT)
66
# force use of workarounds for CONDITION_VARIABLE and atomic
67
# intrinsics introduced after XP
68
- add_definitions(-D_WIN32_WINNT=_WIN32_WINNT_WINXP)
69
- endif()
70
+ add_definitions(-D_WIN32_WINNT=_WIN32_WINNT_WINXP -D_WIN32_WINNT_WIN7=0x0601)
71
+ else(WINXP_SUPPORT)
72
+ # default to targeting Windows 7 for the NUMA APIs
73
+ add_definitions(-D_WIN32_WINNT=_WIN32_WINNT_WIN7)
74
+ endif(WINXP_SUPPORT)
75
endif()
76
77
include(version) # determine X265_VERSION and X265_LATEST_TAG
78
79
# Main CLI application
80
option(ENABLE_CLI "Build standalone CLI application" ON)
81
if(ENABLE_CLI)
82
- file(GLOB InputFiles input/*.cpp input/*.h)
83
- file(GLOB OutputFiles output/*.cpp output/*.h)
84
+ file(GLOB InputFiles input/input.cpp input/yuv.cpp input/y4m.cpp input/*.h)
85
+ file(GLOB OutputFiles output/output.cpp output/reconplay.cpp output/*.h
86
+ output/yuv.cpp output/y4m.cpp # recon
87
+ output/raw.cpp) # muxers
88
file(GLOB FilterFiles filters/*.cpp filters/*.h)
89
source_group(input FILES ${InputFiles})
90
source_group(output FILES ${OutputFiles})
91
x265_1.6.tar.gz/source/common/common.cpp -> x265_1.7.tar.gz/source/common/common.cpp
Changed
34
1
2
return (x265_exp2_lut[i & 63] + 256) << (i >> 6) >> 8;
3
}
4
5
-void x265_log(const x265_param *param, int level, const char *fmt, ...)
6
+void general_log(const x265_param* param, const char* caller, int level, const char* fmt, ...)
7
{
8
if (param && level > param->logLevel)
9
return;
10
- const char *log_level;
11
+ const int bufferSize = 4096;
12
+ char buffer[bufferSize];
13
+ int p = 0;
14
+ const char* log_level;
15
switch (level)
16
{
17
case X265_LOG_ERROR:
18
19
break;
20
}
21
22
- fprintf(stderr, "x265 [%s]: ", log_level);
23
+ if (caller)
24
+ p += sprintf(buffer, "%-4s [%s]: ", caller, log_level);
25
va_list arg;
26
va_start(arg, fmt);
27
- vfprintf(stderr, fmt, arg);
28
+ vsnprintf(buffer + p, bufferSize - p, fmt, arg);
29
va_end(arg);
30
+ fputs(buffer, stderr);
31
}
32
33
double x265_ssim2dB(double ssim)
34
x265_1.6.tar.gz/source/common/common.h -> x265_1.7.tar.gz/source/common/common.h
Changed
11
1
2
3
/* outside x265 namespace, but prefixed. defined in common.cpp */
4
int64_t x265_mdate(void);
5
-void x265_log(const x265_param *param, int level, const char *fmt, ...);
6
+#define x265_log(param, ...) general_log(param, "x265", __VA_ARGS__)
7
+void general_log(const x265_param* param, const char* caller, int level, const char* fmt, ...);
8
int x265_exp2fix8(double x);
9
10
double x265_ssim2dB(double ssim);
11
x265_1.6.tar.gz/source/common/constants.cpp -> x265_1.7.tar.gz/source/common/constants.cpp
Changed
10
1
2
4, 12, 20, 28, 5, 13, 21, 29, 6, 14, 22, 30, 7, 15, 23, 31, 36, 44, 52, 60, 37, 45, 53, 61, 38, 46, 54, 62, 39, 47, 55, 63 }
3
};
4
5
-const uint16_t g_scan4x4[NUM_SCAN_TYPE][4 * 4] =
6
+ALIGN_VAR_16(const uint16_t, g_scan4x4[NUM_SCAN_TYPE][4 * 4]) =
7
{
8
{ 0, 4, 1, 8, 5, 2, 12, 9, 6, 3, 13, 10, 7, 14, 11, 15 },
9
{ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
10
x265_1.6.tar.gz/source/common/contexts.h -> x265_1.7.tar.gz/source/common/contexts.h
Changed
9
1
2
// private namespace
3
4
extern const uint32_t g_entropyBits[128];
5
+extern const uint32_t g_entropyStateBits[128];
6
extern const uint8_t g_nextState[128][2];
7
8
#define sbacGetMps(S) ((S) & 1)
9
x265_1.6.tar.gz/source/common/cudata.cpp -> x265_1.7.tar.gz/source/common/cudata.cpp
Changed
40
1
2
}
3
4
// initialize Sub partition
5
-void CUData::initSubCU(const CUData& ctu, const CUGeom& cuGeom)
6
+void CUData::initSubCU(const CUData& ctu, const CUGeom& cuGeom, int qp)
7
{
8
m_absIdxInCTU = cuGeom.absPartIdx;
9
m_encData = ctu.m_encData;
10
11
m_cuAboveRight = ctu.m_cuAboveRight;
12
X265_CHECK(m_numPartitions == cuGeom.numPartitions, "initSubCU() size mismatch\n");
13
14
- /* sequential memsets */
15
- m_partSet((uint8_t*)m_qp, (uint8_t)ctu.m_qp[0]);
16
+ m_partSet((uint8_t*)m_qp, (uint8_t)qp);
17
+
18
m_partSet(m_log2CUSize, (uint8_t)cuGeom.log2CUSize);
19
m_partSet(m_lumaIntraDir, (uint8_t)DC_IDX);
20
m_partSet(m_tqBypass, (uint8_t)m_encData->m_param->bLossless);
21
22
}
23
}
24
25
+/* Clip motion vector to within slightly padded boundary of picture (the
26
+ * MV may reference a block that is completely within the padded area).
27
+ * Note this function is unaware of how much of this picture is actually
28
+ * available for use (re: frame parallelism) */
29
void CUData::clipMv(MV& outMV) const
30
{
31
const uint32_t mvshift = 2;
32
33
uint32_t blockSize = 1 << log2CUSize;
34
uint32_t sbWidth = 1 << (g_log2Size[maxCUSize] - log2CUSize);
35
int32_t lastLevelFlag = log2CUSize == g_log2Size[minCUSize];
36
+
37
for (uint32_t sbY = 0; sbY < sbWidth; sbY++)
38
{
39
for (uint32_t sbX = 0; sbX < sbWidth; sbX++)
40
x265_1.6.tar.gz/source/common/cudata.h -> x265_1.7.tar.gz/source/common/cudata.h
Changed
20
1
2
uint32_t childOffset; // offset of the first child CU from current CU
3
uint32_t absPartIdx; // Part index of this CU in terms of 4x4 blocks.
4
uint32_t numPartitions; // Number of 4x4 blocks in the CU
5
- uint32_t depth; // depth of this CU relative from CTU
6
uint32_t flags; // CU flags.
7
+ uint32_t depth; // depth of this CU relative from CTU
8
};
9
10
struct MVField
11
12
static void calcCTUGeoms(uint32_t ctuWidth, uint32_t ctuHeight, uint32_t maxCUSize, uint32_t minCUSize, CUGeom cuDataArray[CUGeom::MAX_GEOMS]);
13
14
void initCTU(const Frame& frame, uint32_t cuAddr, int qp);
15
- void initSubCU(const CUData& ctu, const CUGeom& cuGeom);
16
+ void initSubCU(const CUData& ctu, const CUGeom& cuGeom, int qp);
17
void initLosslessCU(const CUData& cu, const CUGeom& cuGeom);
18
19
void copyPartFrom(const CUData& cu, const CUGeom& childGeom, uint32_t subPartIdx);
20
x265_1.6.tar.gz/source/common/dct.cpp -> x265_1.7.tar.gz/source/common/dct.cpp
Changed
57
1
2
}
3
}
4
5
-int findPosLast_c(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig)
6
+int scanPosLast_c(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* /*scanCG4x4*/, const int /*trSize*/)
7
{
8
memset(coeffNum, 0, MLS_GRP_NUM * sizeof(*coeffNum));
9
memset(coeffFlag, 0, MLS_GRP_NUM * sizeof(*coeffFlag));
10
11
return scanPosLast - 1;
12
}
13
14
+uint32_t findPosFirstLast_c(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16])
15
+{
16
+ int n;
17
+
18
+ for (n = SCAN_SET_SIZE - 1; n >= 0; --n)
19
+ {
20
+ const uint32_t idx = scanTbl[n];
21
+ const uint32_t idxY = idx / MLS_CG_SIZE;
22
+ const uint32_t idxX = idx % MLS_CG_SIZE;
23
+ if (dstCoeff[idxY * trSize + idxX])
24
+ break;
25
+ }
26
+
27
+ X265_CHECK(n >= 0, "non-zero coeff scan failuare!\n");
28
+
29
+ uint32_t lastNZPosInCG = (uint32_t)n;
30
+
31
+ for (n = 0;; n++)
32
+ {
33
+ const uint32_t idx = scanTbl[n];
34
+ const uint32_t idxY = idx / MLS_CG_SIZE;
35
+ const uint32_t idxX = idx % MLS_CG_SIZE;
36
+ if (dstCoeff[idxY * trSize + idxX])
37
+ break;
38
+ }
39
+
40
+ uint32_t firstNZPosInCG = (uint32_t)n;
41
+
42
+ return ((lastNZPosInCG << 16) | firstNZPosInCG);
43
+}
44
+
45
} // closing - anonymous file-static namespace
46
47
namespace x265 {
48
49
p.cu[BLOCK_16x16].copy_cnt = copy_count<16>;
50
p.cu[BLOCK_32x32].copy_cnt = copy_count<32>;
51
52
- p.findPosLast = findPosLast_c;
53
+ p.scanPosLast = scanPosLast_c;
54
+ p.findPosFirstLast = findPosFirstLast_c;
55
}
56
}
57
x265_1.6.tar.gz/source/common/frame.cpp -> x265_1.7.tar.gz/source/common/frame.cpp
Changed
23
1
2
Frame::Frame()
3
{
4
m_bChromaExtended = false;
5
+ m_lowresInit = false;
6
m_reconRowCount.set(0);
7
m_countRefEncoders = 0;
8
m_encData = NULL;
9
m_reconPic = NULL;
10
m_next = NULL;
11
m_prev = NULL;
12
+ m_param = NULL;
13
memset(&m_lowres, 0, sizeof(m_lowres));
14
}
15
16
bool Frame::create(x265_param *param)
17
{
18
m_fencPic = new PicYuv;
19
+ m_param = param;
20
21
return m_fencPic->create(param->sourceWidth, param->sourceHeight, param->internalCsp) &&
22
m_lowres.create(m_fencPic, param->bframes, !!param->rc.aqMode);
23
x265_1.6.tar.gz/source/common/frame.h -> x265_1.7.tar.gz/source/common/frame.h
Changed
18
1
2
void* m_userData; // user provided pointer passed in with this picture
3
4
Lowres m_lowres;
5
+ bool m_lowresInit; // lowres init complete (pre-analysis)
6
bool m_bChromaExtended; // orig chroma planes motion extended for weight analysis
7
8
/* Frame Parallelism - notification between FrameEncoders of available motion reference rows */
9
10
11
Frame* m_next; // PicList doubly linked list pointers
12
Frame* m_prev;
13
-
14
+ x265_param* m_param; // Points to the latest param set for the frame.
15
x265_analysis_data m_analysisData;
16
Frame();
17
18
x265_1.6.tar.gz/source/common/framedata.h -> x265_1.7.tar.gz/source/common/framedata.h
Changed
9
1
2
uint32_t numEncodedCUs; /* ctuAddr of last encoded CTU in row */
3
uint32_t encodedBits; /* sum of 'totalBits' of encoded CTUs */
4
uint32_t satdForVbv; /* sum of lowres (estimated) costs for entire row */
5
+ uint32_t intraSatdForVbv; /* sum of lowres (estimated) intra costs for entire row */
6
uint32_t diagSatd;
7
uint32_t diagIntraSatd;
8
double diagQp;
9
x265_1.6.tar.gz/source/common/ipfilter.cpp -> x265_1.7.tar.gz/source/common/ipfilter.cpp
Changed
87
1
2
#endif
3
4
namespace {
5
-template<int dstStride, int width, int height>
6
-void pixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst)
7
-{
8
- int shift = IF_INTERNAL_PREC - X265_DEPTH;
9
- int row, col;
10
-
11
- for (row = 0; row < height; row++)
12
- {
13
- for (col = 0; col < width; col++)
14
- {
15
- int16_t val = src[col] << shift;
16
- dst[col] = val - (int16_t)IF_INTERNAL_OFFS;
17
- }
18
-
19
- src += srcStride;
20
- dst += dstStride;
21
- }
22
-}
23
-
24
-template<int dstStride>
25
-void filterPixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height)
26
+template<int width, int height>
27
+void filterPixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride)
28
{
29
int shift = IF_INTERNAL_PREC - X265_DEPTH;
30
int row, col;
31
32
p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>; \
33
p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>; \
34
p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
35
- p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE / 2, W, H>;
36
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>;
37
38
#define CHROMA_422(W, H) \
39
p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \
40
41
p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>; \
42
p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>; \
43
p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
44
- p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE / 2, W, H>;
45
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>;
46
47
#define CHROMA_444(W, H) \
48
p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \
49
50
p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>; \
51
p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>; \
52
p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
53
- p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE, W, H>;
54
+ p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>;
55
56
#define LUMA(W, H) \
57
p.pu[LUMA_ ## W ## x ## H].luma_hpp = interp_horiz_pp_c<8, W, H>; \
58
59
p.pu[LUMA_ ## W ## x ## H].luma_vsp = interp_vert_sp_c<8, W, H>; \
60
p.pu[LUMA_ ## W ## x ## H].luma_vss = interp_vert_ss_c<8, W, H>; \
61
p.pu[LUMA_ ## W ## x ## H].luma_hvpp = interp_hv_pp_c<8, W, H>; \
62
- p.pu[LUMA_ ## W ## x ## H].filter_p2s = pixelToShort_c<MAX_CU_SIZE, W, H>
63
+ p.pu[LUMA_ ## W ## x ## H].convert_p2s = filterPixelToShort_c<W, H>;
64
65
void setupFilterPrimitives_c(EncoderPrimitives& p)
66
{
67
68
69
CHROMA_422(4, 8);
70
CHROMA_422(4, 4);
71
+ CHROMA_422(2, 4);
72
CHROMA_422(2, 8);
73
CHROMA_422(8, 16);
74
CHROMA_422(8, 8);
75
76
CHROMA_444(48, 64);
77
CHROMA_444(64, 16);
78
CHROMA_444(16, 64);
79
- p.luma_p2s = filterPixelToShort_c<MAX_CU_SIZE>;
80
-
81
- p.chroma[X265_CSP_I444].p2s = filterPixelToShort_c<MAX_CU_SIZE>;
82
- p.chroma[X265_CSP_I420].p2s = filterPixelToShort_c<MAX_CU_SIZE / 2>;
83
- p.chroma[X265_CSP_I422].p2s = filterPixelToShort_c<MAX_CU_SIZE / 2>;
84
85
p.extendRowBorder = extendCURowColBorder;
86
}
87
x265_1.6.tar.gz/source/common/loopfilter.cpp -> x265_1.7.tar.gz/source/common/loopfilter.cpp
Changed
73
1
2
dst[x] = signOf(src1[x] - src2[x]);
3
}
4
5
-void processSaoCUE0(pixel * rec, int8_t * offsetEo, int width, int8_t signLeft)
6
+void processSaoCUE0(pixel * rec, int8_t * offsetEo, int width, int8_t* signLeft, intptr_t stride)
7
{
8
- int x;
9
- int8_t signRight;
10
+ int x, y;
11
+ int8_t signRight, signLeft0;
12
int8_t edgeType;
13
14
- for (x = 0; x < width; x++)
15
+ for (y = 0; y < 2; y++)
16
{
17
- signRight = ((rec[x] - rec[x + 1]) < 0) ? -1 : ((rec[x] - rec[x + 1]) > 0) ? 1 : 0;
18
- edgeType = signRight + signLeft + 2;
19
- signLeft = -signRight;
20
- rec[x] = x265_clip(rec[x] + offsetEo[edgeType]);
21
+ signLeft0 = signLeft[y];
22
+ for (x = 0; x < width; x++)
23
+ {
24
+ signRight = ((rec[x] - rec[x + 1]) < 0) ? -1 : ((rec[x] - rec[x + 1]) > 0) ? 1 : 0;
25
+ edgeType = signRight + signLeft0 + 2;
26
+ signLeft0 = -signRight;
27
+ rec[x] = x265_clip(rec[x] + offsetEo[edgeType]);
28
+ }
29
+ rec += stride;
30
}
31
}
32
33
34
}
35
}
36
37
+void processSaoCUE1_2Rows(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width)
38
+{
39
+ int x, y;
40
+ int8_t signDown;
41
+ int edgeType;
42
+
43
+ for (y = 0; y < 2; y++)
44
+ {
45
+ for (x = 0; x < width; x++)
46
+ {
47
+ signDown = signOf(rec[x] - rec[x + stride]);
48
+ edgeType = signDown + upBuff1[x] + 2;
49
+ upBuff1[x] = -signDown;
50
+ rec[x] = x265_clip(rec[x] + offsetEo[edgeType]);
51
+ }
52
+ rec += stride;
53
+ }
54
+}
55
+
56
void processSaoCUE2(pixel * rec, int8_t * bufft, int8_t * buff1, int8_t * offsetEo, int width, intptr_t stride)
57
{
58
int x;
59
60
{
61
p.saoCuOrgE0 = processSaoCUE0;
62
p.saoCuOrgE1 = processSaoCUE1;
63
- p.saoCuOrgE2 = processSaoCUE2;
64
- p.saoCuOrgE3 = processSaoCUE3;
65
+ p.saoCuOrgE1_2Rows = processSaoCUE1_2Rows;
66
+ p.saoCuOrgE2[0] = processSaoCUE2;
67
+ p.saoCuOrgE2[1] = processSaoCUE2;
68
+ p.saoCuOrgE3[0] = processSaoCUE3;
69
+ p.saoCuOrgE3[1] = processSaoCUE3;
70
p.saoCuOrgB0 = processSaoCUB0;
71
p.sign = calSign;
72
}
73
x265_1.6.tar.gz/source/common/param.cpp -> x265_1.7.tar.gz/source/common/param.cpp
Changed
159
1
2
extern "C"
3
void x265_param_free(x265_param* p)
4
{
5
- return x265_free(p);
6
+ x265_free(p);
7
}
8
9
extern "C"
10
11
param->levelIdc = 0;
12
param->bHighTier = 0;
13
param->interlaceMode = 0;
14
+ param->bAnnexB = 1;
15
param->bRepeatHeaders = 0;
16
param->bEnableAccessUnitDelimiters = 0;
17
param->bEmitHRDSEI = 0;
18
19
param->rc.zones = NULL;
20
param->rc.bEnableSlowFirstPass = 0;
21
param->rc.bStrictCbr = 0;
22
+ param->rc.qgSize = 64; /* Same as maxCUSize */
23
24
/* Video Usability Information (VUI) */
25
param->vui.aspectRatioIdc = 0;
26
27
param->rc.aqStrength = 0.0;
28
param->rc.aqMode = X265_AQ_NONE;
29
param->rc.cuTree = 0;
30
+ param->rc.qgSize = 32;
31
param->bEnableFastIntra = 1;
32
}
33
else if (!strcmp(preset, "superfast"))
34
35
param->rc.aqStrength = 0.0;
36
param->rc.aqMode = X265_AQ_NONE;
37
param->rc.cuTree = 0;
38
+ param->rc.qgSize = 32;
39
param->bEnableSAO = 0;
40
param->bEnableFastIntra = 1;
41
}
42
43
param->rdLevel = 2;
44
param->maxNumReferences = 1;
45
param->rc.cuTree = 0;
46
+ param->rc.qgSize = 32;
47
param->bEnableFastIntra = 1;
48
}
49
else if (!strcmp(preset, "faster"))
50
51
p->levelIdc = atoi(value);
52
}
53
OPT("high-tier") p->bHighTier = atobool(value);
54
+ OPT("allow-non-conformance") p->bAllowNonConformance = atobool(value);
55
OPT2("log-level", "log")
56
{
57
p->logLevel = atoi(value);
58
59
}
60
}
61
OPT("cu-stats") p->bLogCuStats = atobool(value);
62
+ OPT("annexb") p->bAnnexB = atobool(value);
63
OPT("repeat-headers") p->bRepeatHeaders = atobool(value);
64
OPT("wpp") p->bEnableWavefront = atobool(value);
65
OPT("ctu") p->maxCUSize = (uint32_t)atoi(value);
66
67
OPT2("pools", "numa-pools") p->numaPools = strdup(value);
68
OPT("lambda-file") p->rc.lambdaFileName = strdup(value);
69
OPT("analysis-file") p->analysisFileName = strdup(value);
70
+ OPT("qg-size") p->rc.qgSize = atoi(value);
71
+ OPT("master-display") p->masteringDisplayColorVolume = strdup(value);
72
+ OPT("max-cll") p->contentLightLevelInfo = strdup(value);
73
else
74
return X265_PARAM_BAD_NAME;
75
#undef OPT
76
77
uint32_t maxLog2CUSize = (uint32_t)g_log2Size[param->maxCUSize];
78
uint32_t minLog2CUSize = (uint32_t)g_log2Size[param->minCUSize];
79
80
- if (g_ctuSizeConfigured || ATOMIC_INC(&g_ctuSizeConfigured) > 1)
81
+ if (ATOMIC_INC(&g_ctuSizeConfigured) > 1)
82
{
83
if (g_maxCUSize != param->maxCUSize)
84
{
85
86
x265_log(param, X265_LOG_INFO, "b-pyramid / weightp / weightb / refs: %d / %d / %d / %d\n",
87
param->bBPyramid, param->bEnableWeightedPred, param->bEnableWeightedBiPred, param->maxNumReferences);
88
89
+ if (param->rc.aqMode)
90
+ x265_log(param, X265_LOG_INFO, "AQ: mode / str / qg-size / cu-tree : %d / %0.1f / %d / %d\n", param->rc.aqMode,
91
+ param->rc.aqStrength, param->rc.qgSize, param->rc.cuTree);
92
+
93
if (param->bLossless)
94
x265_log(param, X265_LOG_INFO, "Rate Control : Lossless\n");
95
else switch (param->rc.rateControlMode)
96
{
97
case X265_RC_ABR:
98
- x265_log(param, X265_LOG_INFO, "Rate Control / AQ-Strength / CUTree : ABR-%d kbps / %0.1f / %d\n", param->rc.bitrate,
99
- param->rc.aqStrength, param->rc.cuTree);
100
- break;
101
+ x265_log(param, X265_LOG_INFO, "Rate Control / qCompress : ABR-%d kbps / %0.2f\n", param->rc.bitrate, param->rc.qCompress); break;
102
case X265_RC_CQP:
103
- x265_log(param, X265_LOG_INFO, "Rate Control / AQ-Strength / CUTree : CQP-%d / %0.1f / %d\n", param->rc.qp, param->rc.aqStrength,
104
- param->rc.cuTree);
105
- break;
106
+ x265_log(param, X265_LOG_INFO, "Rate Control : CQP-%d\n", param->rc.qp); break;
107
case X265_RC_CRF:
108
- x265_log(param, X265_LOG_INFO, "Rate Control / AQ-Strength / CUTree : CRF-%0.1f / %0.1f / %d\n", param->rc.rfConstant,
109
- param->rc.aqStrength, param->rc.cuTree);
110
- break;
111
+ x265_log(param, X265_LOG_INFO, "Rate Control / qCompress : CRF-%0.1f / %0.2f\n", param->rc.rfConstant, param->rc.qCompress); break;
112
}
113
114
if (param->rc.vbvBufferSize)
115
116
fflush(stderr);
117
}
118
119
+void x265_print_reconfigured_params(x265_param* param, x265_param* reconfiguredParam)
120
+{
121
+ if (!param || !reconfiguredParam)
122
+ return;
123
+
124
+ x265_log(param,X265_LOG_INFO, "Reconfigured param options :\n");
125
+
126
+ char buf[80] = { 0 };
127
+ char tmp[40];
128
+#define TOOLCMP(COND1, COND2, STR, VAL) if (COND1 != COND2) { sprintf(tmp, STR, VAL); appendtool(param, buf, sizeof(buf), tmp); }
129
+ TOOLCMP(param->maxNumReferences, reconfiguredParam->maxNumReferences, "ref=%d", reconfiguredParam->maxNumReferences);
130
+ TOOLCMP(param->maxTUSize, reconfiguredParam->maxTUSize, "max-tu-size=%d", reconfiguredParam->maxTUSize);
131
+ TOOLCMP(param->searchRange, reconfiguredParam->searchRange, "merange=%d", reconfiguredParam->searchRange);
132
+ TOOLCMP(param->subpelRefine, reconfiguredParam->subpelRefine, "subme= %d", reconfiguredParam->subpelRefine);
133
+ TOOLCMP(param->rdLevel, reconfiguredParam->rdLevel, "rd=%d", reconfiguredParam->rdLevel);
134
+ TOOLCMP(param->psyRd, reconfiguredParam->psyRd, "psy-rd=%.2lf", reconfiguredParam->psyRd);
135
+ TOOLCMP(param->rdoqLevel, reconfiguredParam->rdoqLevel, "rdoq=%d", reconfiguredParam->rdoqLevel);
136
+ TOOLCMP(param->psyRdoq, reconfiguredParam->psyRdoq, "psy-rdoq=%.2lf", reconfiguredParam->psyRdoq);
137
+ TOOLCMP(param->noiseReductionIntra, reconfiguredParam->noiseReductionIntra, "nr-intra=%d", reconfiguredParam->noiseReductionIntra);
138
+ TOOLCMP(param->noiseReductionInter, reconfiguredParam->noiseReductionInter, "nr-inter=%d", reconfiguredParam->noiseReductionInter);
139
+ TOOLCMP(param->bEnableTSkipFast, reconfiguredParam->bEnableTSkipFast, "tskip-fast=%d", reconfiguredParam->bEnableTSkipFast);
140
+ TOOLCMP(param->bEnableSignHiding, reconfiguredParam->bEnableSignHiding, "signhide=%d", reconfiguredParam->bEnableSignHiding);
141
+ TOOLCMP(param->bEnableFastIntra, reconfiguredParam->bEnableFastIntra, "fast-intra=%d", reconfiguredParam->bEnableFastIntra);
142
+ if (param->bEnableLoopFilter && (param->deblockingFilterBetaOffset != reconfiguredParam->deblockingFilterBetaOffset
143
+ || param->deblockingFilterTCOffset != reconfiguredParam->deblockingFilterTCOffset))
144
+ {
145
+ sprintf(tmp, "deblock(tC=%d:B=%d)", param->deblockingFilterTCOffset, param->deblockingFilterBetaOffset);
146
+ appendtool(param, buf, sizeof(buf), tmp);
147
+ }
148
+ else
149
+ TOOLCMP(param->bEnableLoopFilter, reconfiguredParam->bEnableLoopFilter, "deblock=%d", reconfiguredParam->bEnableLoopFilter);
150
+
151
+ TOOLCMP(param->bEnableTemporalMvp, reconfiguredParam->bEnableTemporalMvp, "tmvp=%d", reconfiguredParam->bEnableTemporalMvp);
152
+ TOOLCMP(param->bEnableEarlySkip, reconfiguredParam->bEnableEarlySkip, "early-skip=%d", reconfiguredParam->bEnableEarlySkip);
153
+ x265_log(param, X265_LOG_INFO, "tools:%s\n", buf);
154
+}
155
+
156
char *x265_param2string(x265_param* p)
157
{
158
char *buf, *s;
159
x265_1.6.tar.gz/source/common/param.h -> x265_1.7.tar.gz/source/common/param.h
Changed
9
1
2
int x265_check_params(x265_param *param);
3
int x265_set_globals(x265_param *param);
4
void x265_print_params(x265_param *param);
5
+void x265_print_reconfigured_params(x265_param* param, x265_param* reconfiguredParam);
6
void x265_param_apply_fastfirstpass(x265_param *p);
7
char* x265_param2string(x265_param *param);
8
int x265_atoi(const char *str, bool& bError);
9
x265_1.6.tar.gz/source/common/picyuv.cpp -> x265_1.7.tar.gz/source/common/picyuv.cpp
Changed
25
1
2
3
for (int r = 0; r < height; r++)
4
{
5
- for (int c = 0; c < width; c++)
6
- yPixel[c] = (pixel)yChar[c];
7
+ memcpy(yPixel, yChar, width * sizeof(pixel));
8
9
yPixel += m_stride;
10
yChar += pic.stride[0] / sizeof(*yChar);
11
12
13
for (int r = 0; r < height >> m_vChromaShift; r++)
14
{
15
- for (int c = 0; c < width >> m_hChromaShift; c++)
16
- {
17
- uPixel[c] = (pixel)uChar[c];
18
- vPixel[c] = (pixel)vChar[c];
19
- }
20
+ memcpy(uPixel, uChar, (width >> m_hChromaShift) * sizeof(pixel));
21
+ memcpy(vPixel, vChar, (width >> m_hChromaShift) * sizeof(pixel));
22
23
uPixel += m_strideC;
24
vPixel += m_strideC;
25
x265_1.6.tar.gz/source/common/pixel.cpp -> x265_1.7.tar.gz/source/common/pixel.cpp
Changed
10
1
2
}
3
}
4
5
-void scale1D_128to64(pixel *dst, const pixel *src, intptr_t /*stride*/)
6
+void scale1D_128to64(pixel *dst, const pixel *src)
7
{
8
int x;
9
const pixel* src1 = src;
10
x265_1.6.tar.gz/source/common/predict.cpp -> x265_1.7.tar.gz/source/common/predict.cpp
Changed
72
1
2
void Predict::predInterLumaShort(const PredictionUnit& pu, ShortYuv& dstSYuv, const PicYuv& refPic, const MV& mv) const
3
{
4
int16_t* dst = dstSYuv.getLumaAddr(pu.puAbsPartIdx);
5
- int dstStride = dstSYuv.m_size;
6
+ intptr_t dstStride = dstSYuv.m_size;
7
8
intptr_t srcStride = refPic.m_stride;
9
intptr_t srcOffset = (mv.x >> 2) + (mv.y >> 2) * srcStride;
10
11
X265_CHECK(dstStride == MAX_CU_SIZE, "stride expected to be max cu size\n");
12
13
if (!(yFrac | xFrac))
14
- primitives.luma_p2s(src, srcStride, dst, pu.width, pu.height);
15
+ primitives.pu[partEnum].convert_p2s(src, srcStride, dst, dstStride);
16
else if (!yFrac)
17
primitives.pu[partEnum].luma_hps(src, srcStride, dst, dstStride, xFrac, 0);
18
else if (!xFrac)
19
20
int partEnum = partitionFromSizes(pu.width, pu.height);
21
22
uint32_t cxWidth = pu.width >> m_hChromaShift;
23
- uint32_t cxHeight = pu.height >> m_vChromaShift;
24
25
- X265_CHECK(((cxWidth | cxHeight) % 2) == 0, "chroma block size expected to be multiple of 2\n");
26
+ X265_CHECK(((cxWidth | (pu.height >> m_vChromaShift)) % 2) == 0, "chroma block size expected to be multiple of 2\n");
27
28
if (!(yFrac | xFrac))
29
{
30
- primitives.chroma[m_csp].p2s(refCb, refStride, dstCb, cxWidth, cxHeight);
31
- primitives.chroma[m_csp].p2s(refCr, refStride, dstCr, cxWidth, cxHeight);
32
+ primitives.chroma[m_csp].pu[partEnum].p2s(refCb, refStride, dstCb, dstStride);
33
+ primitives.chroma[m_csp].pu[partEnum].p2s(refCr, refStride, dstCr, dstStride);
34
}
35
else if (!yFrac)
36
{
37
38
const pixel refSample = *pAdiLineNext;
39
// Pad unavailable samples with new value
40
int nextOrTop = X265_MIN(next, leftUnits);
41
+
42
// fill left column
43
+#if HIGH_BIT_DEPTH
44
while (curr < nextOrTop)
45
{
46
for (int i = 0; i < unitHeight; i++)
47
48
adi += unitWidth;
49
curr++;
50
}
51
+#else
52
+ X265_CHECK(curr <= nextOrTop, "curr must be less than or equal to nextOrTop\n");
53
+ if (curr < nextOrTop)
54
+ {
55
+ const int fillSize = unitHeight * (nextOrTop - curr);
56
+ memset(adi, refSample, fillSize * sizeof(pixel));
57
+ curr = nextOrTop;
58
+ adi += fillSize;
59
+ }
60
+
61
+ if (curr < next)
62
+ {
63
+ const int fillSize = unitWidth * (next - curr);
64
+ memset(adi, refSample, fillSize * sizeof(pixel));
65
+ curr = next;
66
+ adi += fillSize;
67
+ }
68
+#endif
69
}
70
71
// pad all other reference samples.
72
x265_1.6.tar.gz/source/common/primitives.cpp -> x265_1.7.tar.gz/source/common/primitives.cpp
Changed
18
1
2
3
/* alias chroma 4:4:4 from luma primitives (all but chroma filters) */
4
5
- p.chroma[X265_CSP_I444].p2s = p.luma_p2s;
6
p.chroma[X265_CSP_I444].cu[BLOCK_4x4].sa8d = NULL;
7
8
for (int i = 0; i < NUM_PU_SIZES; i++)
9
10
p.chroma[X265_CSP_I444].pu[i].copy_pp = p.pu[i].copy_pp;
11
p.chroma[X265_CSP_I444].pu[i].addAvg = p.pu[i].addAvg;
12
p.chroma[X265_CSP_I444].pu[i].satd = p.pu[i].satd;
13
- p.chroma[X265_CSP_I444].pu[i].chroma_p2s = p.pu[i].filter_p2s;
14
+ p.chroma[X265_CSP_I444].pu[i].p2s = p.pu[i].convert_p2s;
15
}
16
17
for (int i = 0; i < NUM_CU_SIZES; i++)
18
x265_1.6.tar.gz/source/common/primitives.h -> x265_1.7.tar.gz/source/common/primitives.h
Changed
110
1
2
typedef int(*count_nonzero_t)(const int16_t* quantCoeff);
3
typedef void (*weightp_pp_t)(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset);
4
typedef void (*weightp_sp_t)(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset);
5
-typedef void (*scale_t)(pixel* dst, const pixel* src, intptr_t stride);
6
+typedef void (*scale1D_t)(pixel* dst, const pixel* src);
7
+typedef void (*scale2D_t)(pixel* dst, const pixel* src, intptr_t stride);
8
typedef void (*downscale_t)(const pixel* src0, pixel* dstf, pixel* dsth, pixel* dstv, pixel* dstc,
9
intptr_t src_stride, intptr_t dst_stride, int width, int height);
10
typedef void (*extendCURowBorder_t)(pixel* txt, intptr_t stride, int width, int height, int marginX);
11
12
typedef void (*filter_sp_t) (const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx);
13
typedef void (*filter_ss_t) (const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx);
14
typedef void (*filter_hv_pp_t) (const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY);
15
-typedef void (*filter_p2s_wxh_t)(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height);
16
-typedef void (*filter_p2s_t)(const pixel* src, intptr_t srcStride, int16_t* dst);
17
+typedef void (*filter_p2s_t)(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
18
19
typedef void (*copy_pp_t)(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); // dst is aligned
20
typedef void (*copy_sp_t)(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
21
22
typedef void (*pixelavg_pp_t)(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int weight);
23
typedef void (*addAvg_t)(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride);
24
25
-typedef void (*saoCuOrgE0_t)(pixel* rec, int8_t* offsetEo, int width, int8_t signLeft);
26
+typedef void (*saoCuOrgE0_t)(pixel* rec, int8_t* offsetEo, int width, int8_t* signLeft, intptr_t stride);
27
typedef void (*saoCuOrgE1_t)(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
28
typedef void (*saoCuOrgE2_t)(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride);
29
typedef void (*saoCuOrgE3_t)(pixel* rec, int8_t* upBuff1, int8_t* m_offsetEo, intptr_t stride, int startX, int endX);
30
31
32
typedef void (*cutree_propagate_cost) (int* dst, const uint16_t* propagateIn, const int32_t* intraCosts, const uint16_t* interCosts, const int32_t* invQscales, const double* fpsFactor, int len);
33
34
-typedef int (*findPosLast_t)(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig);
35
+typedef int (*scanPosLast_t)(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize);
36
+typedef uint32_t (*findPosFirstLast_t)(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16]);
37
38
/* Function pointers to optimized encoder primitives. Each pointer can reference
39
* either an assembly routine, a SIMD intrinsic primitive, or a C function */
40
41
addAvg_t addAvg; // bidir motion compensation, uses 16bit values
42
43
copy_pp_t copy_pp;
44
- filter_p2s_t filter_p2s;
45
+ filter_p2s_t convert_p2s;
46
}
47
pu[NUM_PU_SIZES];
48
49
50
dequant_scaling_t dequant_scaling;
51
dequant_normal_t dequant_normal;
52
denoiseDct_t denoiseDct;
53
- scale_t scale1D_128to64;
54
- scale_t scale2D_64to32;
55
+ scale1D_t scale1D_128to64;
56
+ scale2D_t scale2D_64to32;
57
58
ssim_4x4x2_core_t ssim_4x4x2_core;
59
ssim_end4_t ssim_end_4;
60
61
sign_t sign;
62
saoCuOrgE0_t saoCuOrgE0;
63
- saoCuOrgE1_t saoCuOrgE1;
64
- saoCuOrgE2_t saoCuOrgE2;
65
- saoCuOrgE3_t saoCuOrgE3;
66
+
67
+ /* To avoid the overhead in avx2 optimization in handling width=16, SAO_E0_1 is split
68
+ * into two parts: saoCuOrgE1, saoCuOrgE1_2Rows */
69
+ saoCuOrgE1_t saoCuOrgE1, saoCuOrgE1_2Rows;
70
+
71
+ // saoCuOrgE2[0] is used for width<=16 and saoCuOrgE2[1] is used for width > 16.
72
+ saoCuOrgE2_t saoCuOrgE2[2];
73
+
74
+ /* In avx2 optimization, two rows cannot be handled simultaneously since it requires
75
+ * a pixel from the previous row. So, saoCuOrgE3[0] is used for width<=16 and
76
+ * saoCuOrgE3[1] is used for width > 16. */
77
+ saoCuOrgE3_t saoCuOrgE3[2];
78
saoCuOrgB0_t saoCuOrgB0;
79
80
downscale_t frameInitLowres;
81
82
weightp_sp_t weight_sp;
83
weightp_pp_t weight_pp;
84
85
- filter_p2s_wxh_t luma_p2s;
86
87
- findPosLast_t findPosLast;
88
+ scanPosLast_t scanPosLast;
89
+ findPosFirstLast_t findPosFirstLast;
90
91
/* There is one set of chroma primitives per color space. An encoder will
92
* have just a single color space and thus it will only ever use one entry
93
94
filter_hps_t filter_hps;
95
addAvg_t addAvg;
96
copy_pp_t copy_pp;
97
- filter_p2s_t chroma_p2s;
98
+ filter_p2s_t p2s;
99
100
}
101
pu[NUM_PU_SIZES];
102
103
}
104
cu[NUM_CU_SIZES];
105
106
- filter_p2s_wxh_t p2s; // takes width/height as arguments
107
}
108
chroma[X265_CSP_COUNT];
109
};
110
x265_1.6.tar.gz/source/common/quant.cpp -> x265_1.7.tar.gz/source/common/quant.cpp
Changed
704
1
2
{
3
m_entropyCoder = &entropy;
4
m_rdoqLevel = rdoqLevel;
5
- m_psyRdoqScale = (int64_t)(psyScale * 256.0);
6
+ m_psyRdoqScale = (int32_t)(psyScale * 256.0);
7
+ X265_CHECK((psyScale * 256.0) < (double)MAX_INT, "psyScale value too large\n");
8
m_scalingList = &scalingList;
9
m_resiDctCoeff = X265_MALLOC(int16_t, MAX_TR_SIZE * MAX_TR_SIZE * 2);
10
m_fencDctCoeff = m_resiDctCoeff + (MAX_TR_SIZE * MAX_TR_SIZE);
11
12
X265_FREE(m_fencShortBuf);
13
}
14
15
-void Quant::setQPforQuant(const CUData& cu)
16
+void Quant::setQPforQuant(const CUData& ctu, int qp)
17
{
18
- m_tqBypass = !!cu.m_tqBypass[0];
19
+ m_tqBypass = !!ctu.m_tqBypass[0];
20
if (m_tqBypass)
21
return;
22
- m_nr = m_frameNr ? &m_frameNr[cu.m_encData->m_frameEncoderID] : NULL;
23
- int qpy = cu.m_qp[0];
24
- m_qpParam[TEXT_LUMA].setQpParam(qpy + QP_BD_OFFSET);
25
- setChromaQP(qpy + cu.m_slice->m_pps->chromaQpOffset[0], TEXT_CHROMA_U, cu.m_chromaFormat);
26
- setChromaQP(qpy + cu.m_slice->m_pps->chromaQpOffset[1], TEXT_CHROMA_V, cu.m_chromaFormat);
27
+ m_nr = m_frameNr ? &m_frameNr[ctu.m_encData->m_frameEncoderID] : NULL;
28
+ m_qpParam[TEXT_LUMA].setQpParam(qp + QP_BD_OFFSET);
29
+ setChromaQP(qp + ctu.m_slice->m_pps->chromaQpOffset[0], TEXT_CHROMA_U, ctu.m_chromaFormat);
30
+ setChromaQP(qp + ctu.m_slice->m_pps->chromaQpOffset[1], TEXT_CHROMA_V, ctu.m_chromaFormat);
31
}
32
33
void Quant::setChromaQP(int qpin, TextType ttype, int chFmt)
34
35
{
36
int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; /* Represents scaling through forward transform */
37
int scalingListType = (cu.isIntra(absPartIdx) ? 0 : 3) + ttype;
38
+ const uint32_t usePsyMask = usePsy ? -1 : 0;
39
40
X265_CHECK(scalingListType < 6, "scaling list type out of range\n");
41
42
43
X265_CHECK((int)numSig == primitives.cu[log2TrSize - 2].count_nonzero(dstCoeff), "numSig differ\n");
44
if (!numSig)
45
return 0;
46
+
47
uint32_t trSize = 1 << log2TrSize;
48
int64_t lambda2 = m_qpParam[ttype].lambda2;
49
- int64_t psyScale = (m_psyRdoqScale * m_qpParam[ttype].lambda);
50
+ const int64_t psyScale = ((int64_t)m_psyRdoqScale * m_qpParam[ttype].lambda);
51
52
/* unquant constants for measuring distortion. Scaling list quant coefficients have a (1 << 4)
53
* scale applied that must be removed during unquant. Note that in real dequant there is clipping
54
55
#define UNQUANT(lvl) (((lvl) * (unquantScale[blkPos] << per) + unquantRound) >> unquantShift)
56
#define SIGCOST(bits) ((lambda2 * (bits)) >> 8)
57
#define RDCOST(d, bits) ((((int64_t)d * d) << scaleBits) + SIGCOST(bits))
58
-#define PSYVALUE(rec) ((psyScale * (rec)) >> (16 - scaleBits))
59
+#define PSYVALUE(rec) ((psyScale * (rec)) >> (2 * transformShift + 1))
60
61
int64_t costCoeff[32 * 32]; /* d*d + lambda * bits */
62
int64_t costUncoded[32 * 32]; /* d*d + lambda * 0 */
63
64
int64_t costCoeffGroupSig[MLS_GRP_NUM]; /* lambda * bits of group coding cost */
65
uint64_t sigCoeffGroupFlag64 = 0;
66
67
- uint32_t ctxSet = 0;
68
- int c1 = 1;
69
- int c2 = 0;
70
- uint32_t goRiceParam = 0;
71
- uint32_t c1Idx = 0;
72
- uint32_t c2Idx = 0;
73
- int cgLastScanPos = -1;
74
- int lastScanPos = -1;
75
const uint32_t cgSize = (1 << MLS_CG_SIZE); /* 4x4 num coef = 16 */
76
bool bIsLuma = ttype == TEXT_LUMA;
77
78
79
TUEntropyCodingParameters codeParams;
80
cu.getTUEntropyCodingParameters(codeParams, absPartIdx, log2TrSize, bIsLuma);
81
const uint32_t cgNum = 1 << (codeParams.log2TrSizeCG * 2);
82
+ const uint32_t cgStride = (trSize >> MLS_CG_LOG2_SIZE);
83
+
84
+ uint8_t coeffNum[MLS_GRP_NUM]; // value range[0, 16]
85
+ uint16_t coeffSign[MLS_GRP_NUM]; // bit mask map for non-zero coeff sign
86
+ uint16_t coeffFlag[MLS_GRP_NUM]; // bit mask map for non-zero coeff
87
+
88
+#if CHECKED_BUILD || _DEBUG
89
+ // clean output buffer, the asm version of scanPosLast Never output anything after latest non-zero coeff group
90
+ memset(coeffNum, 0, sizeof(coeffNum));
91
+ memset(coeffSign, 0, sizeof(coeffNum));
92
+ memset(coeffFlag, 0, sizeof(coeffNum));
93
+#endif
94
+ const int lastScanPos = primitives.scanPosLast(codeParams.scan, dstCoeff, coeffSign, coeffFlag, coeffNum, numSig, g_scan4x4[codeParams.scanType], trSize);
95
+ const int cgLastScanPos = (lastScanPos >> LOG2_SCAN_SET_SIZE);
96
+
97
98
/* TODO: update bit estimates if dirty */
99
EstBitsSbac& estBitsSbac = m_entropyCoder->m_estBitsSbac;
100
101
- uint32_t scanPos;
102
- coeffGroupRDStats cgRdStats;
103
+ uint32_t scanPos = 0;
104
+ uint32_t c1 = 1;
105
+
106
+ // process trail all zero Coeff Group
107
+
108
+ /* coefficients after lastNZ have no distortion signal cost */
109
+ const int zeroCG = cgNum - 1 - cgLastScanPos;
110
+ memset(&costCoeff[(cgLastScanPos + 1) << MLS_CG_SIZE], 0, zeroCG * MLS_CG_BLK_SIZE * sizeof(int64_t));
111
+ memset(&costSig[(cgLastScanPos + 1) << MLS_CG_SIZE], 0, zeroCG * MLS_CG_BLK_SIZE * sizeof(int64_t));
112
+
113
+ /* sum zero coeff (uncodec) cost */
114
+
115
+ // TODO: does we need these cost?
116
+ if (usePsyMask)
117
+ {
118
+ for (int cgScanPos = cgLastScanPos + 1; cgScanPos < (int)cgNum ; cgScanPos++)
119
+ {
120
+ X265_CHECK(coeffNum[cgScanPos] == 0, "count of coeff failure\n");
121
+
122
+ uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE);
123
+ uint32_t blkPos = codeParams.scan[scanPosBase];
124
+
125
+ // TODO: we can't SIMD optimize because PSYVALUE need 64-bits multiplication, convert to Double can work faster by FMA
126
+ for (int y = 0; y < MLS_CG_SIZE; y++)
127
+ {
128
+ for (int x = 0; x < MLS_CG_SIZE; x++)
129
+ {
130
+ int signCoef = m_resiDctCoeff[blkPos + x]; /* pre-quantization DCT coeff */
131
+ int predictedCoef = m_fencDctCoeff[blkPos + x] - signCoef; /* predicted DCT = source DCT - residual DCT*/
132
+
133
+ costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits;
134
+
135
+ /* when no residual coefficient is coded, predicted coef == recon coef */
136
+ costUncoded[blkPos + x] -= PSYVALUE(predictedCoef);
137
+
138
+ totalUncodedCost += costUncoded[blkPos + x];
139
+ totalRdCost += costUncoded[blkPos + x];
140
+ }
141
+ blkPos += trSize;
142
+ }
143
+ }
144
+ }
145
+ else
146
+ {
147
+ // non-psy path
148
+ for (int cgScanPos = cgLastScanPos + 1; cgScanPos < (int)cgNum ; cgScanPos++)
149
+ {
150
+ X265_CHECK(coeffNum[cgScanPos] == 0, "count of coeff failure\n");
151
+
152
+ uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE);
153
+ uint32_t blkPos = codeParams.scan[scanPosBase];
154
+
155
+ for (int y = 0; y < MLS_CG_SIZE; y++)
156
+ {
157
+ for (int x = 0; x < MLS_CG_SIZE; x++)
158
+ {
159
+ int signCoef = m_resiDctCoeff[blkPos + x]; /* pre-quantization DCT coeff */
160
+ costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits;
161
+
162
+ totalUncodedCost += costUncoded[blkPos + x];
163
+ totalRdCost += costUncoded[blkPos + x];
164
+ }
165
+ blkPos += trSize;
166
+ }
167
+ }
168
+ }
169
+
170
+ static const uint8_t table_cnt[5][SCAN_SET_SIZE] =
171
+ {
172
+ // patternSigCtx = 0
173
+ {
174
+ 2, 1, 1, 0,
175
+ 1, 1, 0, 0,
176
+ 1, 0, 0, 0,
177
+ 0, 0, 0, 0,
178
+ },
179
+ // patternSigCtx = 1
180
+ {
181
+ 2, 2, 2, 2,
182
+ 1, 1, 1, 1,
183
+ 0, 0, 0, 0,
184
+ 0, 0, 0, 0,
185
+ },
186
+ // patternSigCtx = 2
187
+ {
188
+ 2, 1, 0, 0,
189
+ 2, 1, 0, 0,
190
+ 2, 1, 0, 0,
191
+ 2, 1, 0, 0,
192
+ },
193
+ // patternSigCtx = 3
194
+ {
195
+ 2, 2, 2, 2,
196
+ 2, 2, 2, 2,
197
+ 2, 2, 2, 2,
198
+ 2, 2, 2, 2,
199
+ },
200
+ // 4x4
201
+ {
202
+ 0, 1, 4, 5,
203
+ 2, 3, 4, 5,
204
+ 6, 6, 8, 8,
205
+ 7, 7, 8, 8
206
+ }
207
+ };
208
209
/* iterate over coding groups in reverse scan order */
210
- for (int cgScanPos = cgNum - 1; cgScanPos >= 0; cgScanPos--)
211
+ for (int cgScanPos = cgLastScanPos; cgScanPos >= 0; cgScanPos--)
212
{
213
+ uint32_t ctxSet = (cgScanPos && bIsLuma) ? 2 : 0;
214
const uint32_t cgBlkPos = codeParams.scanCG[cgScanPos];
215
const uint32_t cgPosY = cgBlkPos >> codeParams.log2TrSizeCG;
216
const uint32_t cgPosX = cgBlkPos - (cgPosY << codeParams.log2TrSizeCG);
217
const uint64_t cgBlkPosMask = ((uint64_t)1 << cgBlkPos);
218
- memset(&cgRdStats, 0, sizeof(coeffGroupRDStats));
219
+ const int patternSigCtx = calcPatternSigCtx(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, cgStride);
220
+ const int ctxSigOffset = codeParams.firstSignificanceMapContext + (cgScanPos && bIsLuma ? 3 : 0);
221
+
222
+ if (c1 == 0)
223
+ ctxSet++;
224
+ c1 = 1;
225
226
- const int patternSigCtx = calcPatternSigCtx(sigCoeffGroupFlag64, cgPosX, cgPosY, codeParams.log2TrSizeCG);
227
+ if (cgScanPos && (coeffNum[cgScanPos] == 0))
228
+ {
229
+ // TODO: does we need zero-coeff cost?
230
+ const uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE);
231
+ uint32_t blkPos = codeParams.scan[scanPosBase];
232
233
+ if (usePsyMask)
234
+ {
235
+ // TODO: we can't SIMD optimize because PSYVALUE need 64-bits multiplication, convert to Double can work faster by FMA
236
+ for (int y = 0; y < MLS_CG_SIZE; y++)
237
+ {
238
+ for (int x = 0; x < MLS_CG_SIZE; x++)
239
+ {
240
+ int signCoef = m_resiDctCoeff[blkPos + x]; /* pre-quantization DCT coeff */
241
+ int predictedCoef = m_fencDctCoeff[blkPos + x] - signCoef; /* predicted DCT = source DCT - residual DCT*/
242
+
243
+ costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits;
244
+
245
+ /* when no residual coefficient is coded, predicted coef == recon coef */
246
+ costUncoded[blkPos + x] -= PSYVALUE(predictedCoef);
247
+
248
+ totalUncodedCost += costUncoded[blkPos + x];
249
+ totalRdCost += costUncoded[blkPos + x];
250
+
251
+ const uint32_t scanPosOffset = y * MLS_CG_SIZE + x;
252
+ const uint32_t ctxSig = table_cnt[patternSigCtx][g_scan4x4[codeParams.scanType][scanPosOffset]] + ctxSigOffset;
253
+ X265_CHECK(trSize > 4, "trSize check failure\n");
254
+ X265_CHECK(ctxSig == getSigCtxInc(patternSigCtx, log2TrSize, trSize, codeParams.scan[scanPosBase + scanPosOffset], bIsLuma, codeParams.firstSignificanceMapContext), "sigCtx check failure\n");
255
+
256
+ costSig[scanPosBase + scanPosOffset] = SIGCOST(estBitsSbac.significantBits[0][ctxSig]);
257
+ costCoeff[scanPosBase + scanPosOffset] = costUncoded[blkPos + x];
258
+ sigRateDelta[blkPos + x] = estBitsSbac.significantBits[1][ctxSig] - estBitsSbac.significantBits[0][ctxSig];
259
+ }
260
+ blkPos += trSize;
261
+ }
262
+ }
263
+ else
264
+ {
265
+ // non-psy path
266
+ for (int y = 0; y < MLS_CG_SIZE; y++)
267
+ {
268
+ for (int x = 0; x < MLS_CG_SIZE; x++)
269
+ {
270
+ int signCoef = m_resiDctCoeff[blkPos + x]; /* pre-quantization DCT coeff */
271
+ costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits;
272
+
273
+ totalUncodedCost += costUncoded[blkPos + x];
274
+ totalRdCost += costUncoded[blkPos + x];
275
+
276
+ const uint32_t scanPosOffset = y * MLS_CG_SIZE + x;
277
+ const uint32_t ctxSig = table_cnt[patternSigCtx][g_scan4x4[codeParams.scanType][scanPosOffset]] + ctxSigOffset;
278
+ X265_CHECK(trSize > 4, "trSize check failure\n");
279
+ X265_CHECK(ctxSig == getSigCtxInc(patternSigCtx, log2TrSize, trSize, codeParams.scan[scanPosBase + scanPosOffset], bIsLuma, codeParams.firstSignificanceMapContext), "sigCtx check failure\n");
280
+
281
+ costSig[scanPosBase + scanPosOffset] = SIGCOST(estBitsSbac.significantBits[0][ctxSig]);
282
+ costCoeff[scanPosBase + scanPosOffset] = costUncoded[blkPos + x];
283
+ sigRateDelta[blkPos + x] = estBitsSbac.significantBits[1][ctxSig] - estBitsSbac.significantBits[0][ctxSig];
284
+ }
285
+ blkPos += trSize;
286
+ }
287
+ }
288
+
289
+ /* there were no coded coefficients in this coefficient group */
290
+ {
291
+ uint32_t ctxSig = getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, cgStride);
292
+ costCoeffGroupSig[cgScanPos] = SIGCOST(estBitsSbac.significantCoeffGroupBits[ctxSig][0]);
293
+ totalRdCost += costCoeffGroupSig[cgScanPos]; /* add cost of 0 bit in significant CG bitmap */
294
+ }
295
+ continue;
296
+ }
297
+
298
+ coeffGroupRDStats cgRdStats;
299
+ memset(&cgRdStats, 0, sizeof(coeffGroupRDStats));
300
+
301
+ uint32_t subFlagMask = coeffFlag[cgScanPos];
302
+ int c2 = 0;
303
+ uint32_t goRiceParam = 0;
304
+ uint32_t c1Idx = 0;
305
+ uint32_t c2Idx = 0;
306
/* iterate over coefficients in each group in reverse scan order */
307
for (int scanPosinCG = cgSize - 1; scanPosinCG >= 0; scanPosinCG--)
308
{
309
scanPos = (cgScanPos << MLS_CG_SIZE) + scanPosinCG;
310
uint32_t blkPos = codeParams.scan[scanPos];
311
- uint16_t maxAbsLevel = (int16_t)abs(dstCoeff[blkPos]); /* abs(quantized coeff) */
312
+ uint32_t maxAbsLevel = abs(dstCoeff[blkPos]); /* abs(quantized coeff) */
313
int signCoef = m_resiDctCoeff[blkPos]; /* pre-quantization DCT coeff */
314
int predictedCoef = m_fencDctCoeff[blkPos] - signCoef; /* predicted DCT = source DCT - residual DCT*/
315
316
317
* FIX15 nature of the CABAC cost tables minus the forward transform scale */
318
319
/* cost of not coding this coefficient (all distortion, no signal bits) */
320
- costUncoded[scanPos] = (int64_t)(signCoef * signCoef) << scaleBits;
321
- if (usePsy && blkPos)
322
+ costUncoded[blkPos] = ((int64_t)signCoef * signCoef) << scaleBits;
323
+ X265_CHECK((!!scanPos ^ !!blkPos) == 0, "failed on (blkPos=0 && scanPos!=0)\n");
324
+ if (usePsyMask & scanPos)
325
/* when no residual coefficient is coded, predicted coef == recon coef */
326
- costUncoded[scanPos] -= PSYVALUE(predictedCoef);
327
+ costUncoded[blkPos] -= PSYVALUE(predictedCoef);
328
329
- totalUncodedCost += costUncoded[scanPos];
330
+ totalUncodedCost += costUncoded[blkPos];
331
332
- if (maxAbsLevel && lastScanPos < 0)
333
- {
334
- /* remember the first non-zero coef found in this reverse scan as the last pos */
335
- lastScanPos = scanPos;
336
- ctxSet = (scanPos < SCAN_SET_SIZE || !bIsLuma) ? 0 : 2;
337
- cgLastScanPos = cgScanPos;
338
- }
339
+ // coefficient level estimation
340
+ const int* greaterOneBits = estBitsSbac.greaterOneBits[4 * ctxSet + c1];
341
+ const uint32_t ctxSig = (blkPos == 0) ? 0 : table_cnt[(trSize == 4) ? 4 : patternSigCtx][g_scan4x4[codeParams.scanType][scanPosinCG]] + ctxSigOffset;
342
+ X265_CHECK(ctxSig == getSigCtxInc(patternSigCtx, log2TrSize, trSize, blkPos, bIsLuma, codeParams.firstSignificanceMapContext), "sigCtx check failure\n");
343
344
- if (lastScanPos < 0)
345
+ // before find lastest non-zero coeff
346
+ if (scanPos > (uint32_t)lastScanPos)
347
{
348
/* coefficients after lastNZ have no distortion signal cost */
349
costCoeff[scanPos] = 0;
350
351
/* No non-zero coefficient yet found, but this does not mean
352
* there is no uncoded-cost for this coefficient. Pre-
353
* quantization the coefficient may have been non-zero */
354
- totalRdCost += costUncoded[scanPos];
355
+ totalRdCost += costUncoded[blkPos];
356
+ }
357
+ else if (!(subFlagMask & 1))
358
+ {
359
+ // fast zero coeff path
360
+ /* set default costs to uncoded costs */
361
+ costSig[scanPos] = SIGCOST(estBitsSbac.significantBits[0][ctxSig]);
362
+ costCoeff[scanPos] = costUncoded[blkPos] + costSig[scanPos];
363
+ sigRateDelta[blkPos] = estBitsSbac.significantBits[1][ctxSig] - estBitsSbac.significantBits[0][ctxSig];
364
+ totalRdCost += costCoeff[scanPos];
365
+ rateIncUp[blkPos] = greaterOneBits[0];
366
+
367
+ subFlagMask >>= 1;
368
}
369
else
370
{
371
+ subFlagMask >>= 1;
372
+
373
const uint32_t c1c2Idx = ((c1Idx - 8) >> (sizeof(int) * CHAR_BIT - 1)) + (((-(int)c2Idx) >> (sizeof(int) * CHAR_BIT - 1)) + 1) * 2;
374
const uint32_t baseLevel = ((uint32_t)0xD9 >> (c1c2Idx * 2)) & 3; // {1, 2, 1, 3}
375
376
377
X265_CHECK((int)baseLevel == ((c1Idx < C1FLAG_NUMBER) ? (2 + (c2Idx == 0)) : 1), "scan validation 3\n");
378
379
// coefficient level estimation
380
- const uint32_t oneCtx = 4 * ctxSet + c1;
381
- const uint32_t absCtx = ctxSet + c2;
382
- const int* greaterOneBits = estBitsSbac.greaterOneBits[oneCtx];
383
- const int* levelAbsBits = estBitsSbac.levelAbsBits[absCtx];
384
+ const int* levelAbsBits = estBitsSbac.levelAbsBits[ctxSet + c2];
385
386
- uint16_t level = 0;
387
+ uint32_t level = 0;
388
uint32_t sigCoefBits = 0;
389
costCoeff[scanPos] = MAX_INT64;
390
391
392
sigRateDelta[blkPos] = 0;
393
else
394
{
395
- const uint32_t ctxSig = getSigCtxInc(patternSigCtx, log2TrSize, trSize, blkPos, bIsLuma, codeParams.firstSignificanceMapContext);
396
if (maxAbsLevel < 3)
397
{
398
/* set default costs to uncoded costs */
399
- costSig[scanPos] = SIGCOST(estBitsSbac.significantBits[ctxSig][0]);
400
- costCoeff[scanPos] = costUncoded[scanPos] + costSig[scanPos];
401
+ costSig[scanPos] = SIGCOST(estBitsSbac.significantBits[0][ctxSig]);
402
+ costCoeff[scanPos] = costUncoded[blkPos] + costSig[scanPos];
403
}
404
- sigRateDelta[blkPos] = estBitsSbac.significantBits[ctxSig][1] - estBitsSbac.significantBits[ctxSig][0];
405
- sigCoefBits = estBitsSbac.significantBits[ctxSig][1];
406
+ sigRateDelta[blkPos] = estBitsSbac.significantBits[1][ctxSig] - estBitsSbac.significantBits[0][ctxSig];
407
+ sigCoefBits = estBitsSbac.significantBits[1][ctxSig];
408
}
409
- if (maxAbsLevel)
410
+
411
+ // NOTE: X265_MAX(maxAbsLevel - 1, 1) ==> (X>=2 -> X-1), (X<2 -> 1) | (0 < X < 2 ==> X=1)
412
+ if (maxAbsLevel == 1)
413
{
414
- uint16_t minAbsLevel = X265_MAX(maxAbsLevel - 1, 1);
415
- for (uint16_t lvl = maxAbsLevel; lvl >= minAbsLevel; lvl--)
416
+ uint32_t levelBits = (c1c2Idx & 1) ? greaterOneBits[0] + IEP_RATE : ((1 + goRiceParam) << 15) + IEP_RATE;
417
+ X265_CHECK(levelBits == getICRateCost(1, 1 - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx) + IEP_RATE, "levelBits mistake\n");
418
+
419
+ int unquantAbsLevel = UNQUANT(1);
420
+ int d = abs(signCoef) - unquantAbsLevel;
421
+ int64_t curCost = RDCOST(d, sigCoefBits + levelBits);
422
+
423
+ /* Psy RDOQ: bias in favor of higher AC coefficients in the reconstructed frame */
424
+ if (usePsyMask & scanPos)
425
{
426
- uint32_t levelBits = getICRateCost(lvl, lvl - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx) + IEP_RATE;
427
+ int reconCoef = abs(unquantAbsLevel + SIGN(predictedCoef, signCoef));
428
+ curCost -= PSYVALUE(reconCoef);
429
+ }
430
431
- int unquantAbsLevel = UNQUANT(lvl);
432
- int d = abs(signCoef) - unquantAbsLevel;
433
- int64_t curCost = RDCOST(d, sigCoefBits + levelBits);
434
+ if (curCost < costCoeff[scanPos])
435
+ {
436
+ level = 1;
437
+ costCoeff[scanPos] = curCost;
438
+ costSig[scanPos] = SIGCOST(sigCoefBits);
439
+ }
440
+ }
441
+ else if (maxAbsLevel)
442
+ {
443
+ uint32_t levelBits0 = getICRateCost(maxAbsLevel, maxAbsLevel - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx) + IEP_RATE;
444
+ uint32_t levelBits1 = getICRateCost(maxAbsLevel - 1, maxAbsLevel - 1 - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx) + IEP_RATE;
445
446
- /* Psy RDOQ: bias in favor of higher AC coefficients in the reconstructed frame */
447
- if (usePsy && blkPos)
448
- {
449
- int reconCoef = abs(unquantAbsLevel + SIGN(predictedCoef, signCoef));
450
- curCost -= PSYVALUE(reconCoef);
451
- }
452
+ int unquantAbsLevel0 = UNQUANT(maxAbsLevel);
453
+ int d0 = abs(signCoef) - unquantAbsLevel0;
454
+ int64_t curCost0 = RDCOST(d0, sigCoefBits + levelBits0);
455
456
- if (curCost < costCoeff[scanPos])
457
- {
458
- level = lvl;
459
- costCoeff[scanPos] = curCost;
460
- costSig[scanPos] = SIGCOST(sigCoefBits);
461
- }
462
+ int unquantAbsLevel1 = UNQUANT(maxAbsLevel - 1);
463
+ int d1 = abs(signCoef) - unquantAbsLevel1;
464
+ int64_t curCost1 = RDCOST(d1, sigCoefBits + levelBits1);
465
+
466
+ /* Psy RDOQ: bias in favor of higher AC coefficients in the reconstructed frame */
467
+ if (usePsyMask & scanPos)
468
+ {
469
+ int reconCoef;
470
+ reconCoef = abs(unquantAbsLevel0 + SIGN(predictedCoef, signCoef));
471
+ curCost0 -= PSYVALUE(reconCoef);
472
+
473
+ reconCoef = abs(unquantAbsLevel1 + SIGN(predictedCoef, signCoef));
474
+ curCost1 -= PSYVALUE(reconCoef);
475
+ }
476
+ if (curCost0 < costCoeff[scanPos])
477
+ {
478
+ level = maxAbsLevel;
479
+ costCoeff[scanPos] = curCost0;
480
+ costSig[scanPos] = SIGCOST(sigCoefBits);
481
+ }
482
+ if (curCost1 < costCoeff[scanPos])
483
+ {
484
+ level = maxAbsLevel - 1;
485
+ costCoeff[scanPos] = curCost1;
486
+ costSig[scanPos] = SIGCOST(sigCoefBits);
487
}
488
}
489
490
- dstCoeff[blkPos] = level;
491
+ dstCoeff[blkPos] = (int16_t)level;
492
totalRdCost += costCoeff[scanPos];
493
494
/* record costs for sign-hiding performed at the end */
495
- if (level)
496
+ if ((cu.m_slice->m_pps->bSignHideEnabled ? ~0 : 0) & level)
497
{
498
const int32_t diff0 = level - 1 - baseLevel;
499
const int32_t diff2 = level + 1 - baseLevel;
500
501
else if ((c1 < 3) && (c1 > 0) && level)
502
c1++;
503
504
- /* context set update */
505
- if (!(scanPos % SCAN_SET_SIZE) && scanPos)
506
+ if (dstCoeff[blkPos])
507
{
508
- c2 = 0;
509
- goRiceParam = 0;
510
-
511
- c1Idx = 0;
512
- c2Idx = 0;
513
- ctxSet = (scanPos == SCAN_SET_SIZE || !bIsLuma) ? 0 : 2;
514
- X265_CHECK(c1 >= 0, "c1 is negative\n");
515
- ctxSet -= ((int32_t)(c1 - 1) >> 31);
516
- c1 = 1;
517
+ sigCoeffGroupFlag64 |= cgBlkPosMask;
518
+ cgRdStats.codedLevelAndDist += costCoeff[scanPos] - costSig[scanPos];
519
+ cgRdStats.uncodedDist += costUncoded[blkPos];
520
+ cgRdStats.nnzBeforePos0 += scanPosinCG;
521
}
522
}
523
524
cgRdStats.sigCost += costSig[scanPos];
525
- if (!scanPosinCG)
526
- cgRdStats.sigCost0 = costSig[scanPos];
527
-
528
- if (dstCoeff[blkPos])
529
- {
530
- sigCoeffGroupFlag64 |= cgBlkPosMask;
531
- cgRdStats.codedLevelAndDist += costCoeff[scanPos] - costSig[scanPos];
532
- cgRdStats.uncodedDist += costUncoded[scanPos];
533
- cgRdStats.nnzBeforePos0 += scanPosinCG;
534
- }
535
} /* end for (scanPosinCG) */
536
537
+ X265_CHECK((cgScanPos << MLS_CG_SIZE) == (int)scanPos, "scanPos mistake\n");
538
+ cgRdStats.sigCost0 = costSig[scanPos];
539
+
540
costCoeffGroupSig[cgScanPos] = 0;
541
542
- if (cgLastScanPos < 0)
543
- {
544
- /* nothing to do at this point */
545
- }
546
- else if (!cgScanPos || cgScanPos == cgLastScanPos)
547
+ /* nothing to do at this case */
548
+ X265_CHECK(cgLastScanPos >= 0, "cgLastScanPos check failure\n");
549
+
550
+ if (!cgScanPos || cgScanPos == cgLastScanPos)
551
{
552
/* coeff group 0 is implied to be present, no signal cost */
553
/* coeff group with last NZ is implied to be present, handled below */
554
555
* of the significant coefficient group flag and evaluate whether the RD cost of the
556
* coded group is more than the RD cost of the uncoded group */
557
558
- uint32_t sigCtx = getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, codeParams.log2TrSizeCG);
559
+ uint32_t sigCtx = getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, cgStride);
560
561
int64_t costZeroCG = totalRdCost + SIGCOST(estBitsSbac.significantCoeffGroupBits[sigCtx][0]);
562
costZeroCG += cgRdStats.uncodedDist; /* add distortion for resetting non-zero levels to zero levels */
563
564
costCoeffGroupSig[cgScanPos] = SIGCOST(estBitsSbac.significantCoeffGroupBits[sigCtx][0]);
565
566
/* reset all coeffs to 0. UNCODE THIS COEFF GROUP! */
567
- for (int scanPosinCG = cgSize - 1; scanPosinCG >= 0; scanPosinCG--)
568
- {
569
- scanPos = cgScanPos * cgSize + scanPosinCG;
570
- uint32_t blkPos = codeParams.scan[scanPos];
571
- if (dstCoeff[blkPos])
572
- {
573
- costCoeff[scanPos] = costUncoded[scanPos];
574
- costSig[scanPos] = 0;
575
- }
576
- dstCoeff[blkPos] = 0;
577
- }
578
+ const uint32_t blkPos = codeParams.scan[cgScanPos * cgSize];
579
+ memset(&dstCoeff[blkPos + 0 * trSize], 0, 4 * sizeof(*dstCoeff));
580
+ memset(&dstCoeff[blkPos + 1 * trSize], 0, 4 * sizeof(*dstCoeff));
581
+ memset(&dstCoeff[blkPos + 2 * trSize], 0, 4 * sizeof(*dstCoeff));
582
+ memset(&dstCoeff[blkPos + 3 * trSize], 0, 4 * sizeof(*dstCoeff));
583
}
584
}
585
else
586
{
587
/* there were no coded coefficients in this coefficient group */
588
- uint32_t ctxSig = getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, codeParams.log2TrSizeCG);
589
+ uint32_t ctxSig = getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, cgStride);
590
costCoeffGroupSig[cgScanPos] = SIGCOST(estBitsSbac.significantCoeffGroupBits[ctxSig][0]);
591
totalRdCost += costCoeffGroupSig[cgScanPos]; /* add cost of 0 bit in significant CG bitmap */
592
totalRdCost -= cgRdStats.sigCost; /* remove cost of significant coefficient bitmap */
593
594
* cost of signaling it as not-significant */
595
uint32_t blkPos = codeParams.scan[scanPos];
596
if (dstCoeff[blkPos])
597
- {
598
+ {
599
// Calculates the cost of signaling the last significant coefficient in the block
600
uint32_t pos[2] = { (blkPos & (trSize - 1)), (blkPos >> log2TrSize) };
601
if (codeParams.scanType == SCAN_VER)
602
603
}
604
605
totalRdCost -= costCoeff[scanPos];
606
- totalRdCost += costUncoded[scanPos];
607
+ totalRdCost += costUncoded[blkPos];
608
}
609
else
610
totalRdCost -= costSig[scanPos];
611
612
dstCoeff[blkPos] = (int16_t)((level ^ mask) - mask);
613
}
614
615
+ // Average 49.62 pixels
616
/* clean uncoded coefficients */
617
- for (int pos = bestLastIdx; pos <= lastScanPos; pos++)
618
+ for (int pos = bestLastIdx; pos <= fastMin(lastScanPos, (bestLastIdx | (SCAN_SET_SIZE - 1))); pos++)
619
+ {
620
dstCoeff[codeParams.scan[pos]] = 0;
621
+ }
622
+ for (int pos = (bestLastIdx & ~(SCAN_SET_SIZE - 1)) + SCAN_SET_SIZE; pos <= lastScanPos; pos += SCAN_SET_SIZE)
623
+ {
624
+ const uint32_t blkPos = codeParams.scan[pos];
625
+ memset(&dstCoeff[blkPos + 0 * trSize], 0, 4 * sizeof(*dstCoeff));
626
+ memset(&dstCoeff[blkPos + 1 * trSize], 0, 4 * sizeof(*dstCoeff));
627
+ memset(&dstCoeff[blkPos + 2 * trSize], 0, 4 * sizeof(*dstCoeff));
628
+ memset(&dstCoeff[blkPos + 3 * trSize], 0, 4 * sizeof(*dstCoeff));
629
+ }
630
631
/* rate-distortion based sign-hiding */
632
if (cu.m_slice->m_pps->bSignHideEnabled && numSig >= 2)
633
{
634
+ const int realLastScanPos = (bestLastIdx - 1) >> LOG2_SCAN_SET_SIZE;
635
int lastCG = true;
636
- for (int subSet = cgLastScanPos; subSet >= 0; subSet--)
637
+ for (int subSet = realLastScanPos; subSet >= 0; subSet--)
638
{
639
int subPos = subSet << LOG2_SCAN_SET_SIZE;
640
int n;
641
642
- /* measure distance between first and last non-zero coef in this
643
- * coding group */
644
- for (n = SCAN_SET_SIZE - 1; n >= 0; --n)
645
- if (dstCoeff[codeParams.scan[n + subPos]])
646
- break;
647
- if (n < 0)
648
+ if (!(sigCoeffGroupFlag64 & (1ULL << codeParams.scanCG[subSet])))
649
continue;
650
651
- int lastNZPosInCG = n;
652
-
653
- for (n = 0;; n++)
654
- if (dstCoeff[codeParams.scan[n + subPos]])
655
- break;
656
+ /* measure distance between first and last non-zero coef in this
657
+ * coding group */
658
+ const uint32_t posFirstLast = primitives.findPosFirstLast(&dstCoeff[codeParams.scan[subPos]], trSize, g_scan4x4[codeParams.scanType]);
659
+ int firstNZPosInCG = (uint16_t)posFirstLast;
660
+ int lastNZPosInCG = posFirstLast >> 16;
661
662
- int firstNZPosInCG = n;
663
664
if (lastNZPosInCG - firstNZPosInCG >= SBH_THRESHOLD)
665
{
666
667
return numSig;
668
}
669
670
-/* Pattern decision for context derivation process of significant_coeff_flag */
671
-uint32_t Quant::calcPatternSigCtx(uint64_t sigCoeffGroupFlag64, uint32_t cgPosX, uint32_t cgPosY, uint32_t log2TrSizeCG)
672
-{
673
- if (!log2TrSizeCG)
674
- return 0;
675
-
676
- const uint32_t trSizeCG = 1 << log2TrSizeCG;
677
- X265_CHECK(trSizeCG <= 8, "transform CG is too large\n");
678
- const uint32_t shift = (cgPosY << log2TrSizeCG) + cgPosX + 1;
679
- const uint32_t sigPos = (uint32_t)(shift >= 64 ? 0 : sigCoeffGroupFlag64 >> shift);
680
- const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & (sigPos & 1);
681
- const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 2)) & 2;
682
-
683
- return sigRight + sigLower;
684
-}
685
-
686
/* Context derivation process of coeff_abs_significant_flag */
687
uint32_t Quant::getSigCtxInc(uint32_t patternSigCtx, uint32_t log2TrSize, uint32_t trSize, uint32_t blkPos, bool bIsLuma,
688
uint32_t firstSignificanceMapContext)
689
690
return (bIsLuma && (posX | posY) >= 4) ? 3 + offset : offset;
691
}
692
693
-/* Context derivation process of coeff_abs_significant_flag */
694
-uint32_t Quant::getSigCoeffGroupCtxInc(uint64_t cgGroupMask, uint32_t cgPosX, uint32_t cgPosY, uint32_t log2TrSizeCG)
695
-{
696
- const uint32_t trSizeCG = 1 << log2TrSizeCG;
697
-
698
- const uint32_t sigPos = (uint32_t)(cgGroupMask >> (1 + (cgPosY << log2TrSizeCG) + cgPosX));
699
- const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & sigPos;
700
- const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 1));
701
-
702
- return (sigRight | sigLower) & 1;
703
-}
704
x265_1.6.tar.gz/source/common/quant.h -> x265_1.7.tar.gz/source/common/quant.h
Changed
80
1
2
int per;
3
int qp;
4
int64_t lambda2; /* FIX8 */
5
- int64_t lambda; /* FIX8 */
6
+ int32_t lambda; /* FIX8, dynamic range is 18-bits in 8bpp and 20-bits in 16bpp */
7
8
QpParam() : qp(MAX_INT) {}
9
10
11
per = qpScaled / 6;
12
qp = qpScaled;
13
lambda2 = (int64_t)(x265_lambda2_tab[qp - QP_BD_OFFSET] * 256. + 0.5);
14
- lambda = (int64_t)(x265_lambda_tab[qp - QP_BD_OFFSET] * 256. + 0.5);
15
+ lambda = (int32_t)(x265_lambda_tab[qp - QP_BD_OFFSET] * 256. + 0.5);
16
+ X265_CHECK((x265_lambda_tab[qp - QP_BD_OFFSET] * 256. + 0.5) < (double)MAX_INT, "x265_lambda_tab[] value too large\n");
17
}
18
}
19
};
20
21
QpParam m_qpParam[3];
22
23
int m_rdoqLevel;
24
- int64_t m_psyRdoqScale;
25
+ int32_t m_psyRdoqScale; // dynamic range [0,50] * 256 = 14-bits
26
int16_t* m_resiDctCoeff;
27
int16_t* m_fencDctCoeff;
28
int16_t* m_fencShortBuf;
29
30
bool allocNoiseReduction(const x265_param& param);
31
32
/* CU setup */
33
- void setQPforQuant(const CUData& cu);
34
+ void setQPforQuant(const CUData& ctu, int qp);
35
36
uint32_t transformNxN(const CUData& cu, const pixel* fenc, uint32_t fencStride, const int16_t* residual, uint32_t resiStride, coeff_t* coeff,
37
uint32_t log2TrSize, TextType ttype, uint32_t absPartIdx, bool useTransformSkip);
38
39
void invtransformNxN(int16_t* residual, uint32_t resiStride, const coeff_t* coeff,
40
uint32_t log2TrSize, TextType ttype, bool bIntra, bool useTransformSkip, uint32_t numSig);
41
42
+ /* Pattern decision for context derivation process of significant_coeff_flag */
43
+ static uint32_t calcPatternSigCtx(uint64_t sigCoeffGroupFlag64, uint32_t cgPosX, uint32_t cgPosY, uint32_t cgBlkPos, uint32_t trSizeCG)
44
+ {
45
+ if (trSizeCG == 1)
46
+ return 0;
47
+
48
+ X265_CHECK(trSizeCG <= 8, "transform CG is too large\n");
49
+ X265_CHECK(cgBlkPos < 64, "cgBlkPos is too large\n");
50
+ // NOTE: cgBlkPos+1 may more than 63, it is invalid for shift,
51
+ // but in this case, both cgPosX and cgPosY equal to (trSizeCG - 1),
52
+ // the sigRight and sigLower will clear value to zero, the final result will be correct
53
+ const uint32_t sigPos = (uint32_t)(sigCoeffGroupFlag64 >> (cgBlkPos + 1)); // just need lowest 7-bits valid
54
+
55
+ // TODO: instruction BT is faster, but _bittest64 still generate instruction 'BT m, r' in VS2012
56
+ const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & (sigPos & 1);
57
+ const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 2)) & 2;
58
+ return sigRight + sigLower;
59
+ }
60
+
61
+ /* Context derivation process of coeff_abs_significant_flag */
62
+ static uint32_t getSigCoeffGroupCtxInc(uint64_t cgGroupMask, uint32_t cgPosX, uint32_t cgPosY, uint32_t cgBlkPos, uint32_t trSizeCG)
63
+ {
64
+ X265_CHECK(cgBlkPos < 64, "cgBlkPos is too large\n");
65
+ // NOTE: unsafe shift operator, see NOTE in calcPatternSigCtx
66
+ const uint32_t sigPos = (uint32_t)(cgGroupMask >> (cgBlkPos + 1)); // just need lowest 8-bits valid
67
+ const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & sigPos;
68
+ const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 1));
69
+
70
+ return (sigRight | sigLower) & 1;
71
+ }
72
+
73
/* static methods shared with entropy.cpp */
74
- static uint32_t calcPatternSigCtx(uint64_t sigCoeffGroupFlag64, uint32_t cgPosX, uint32_t cgPosY, uint32_t log2TrSizeCG);
75
static uint32_t getSigCtxInc(uint32_t patternSigCtx, uint32_t log2TrSize, uint32_t trSize, uint32_t blkPos, bool bIsLuma, uint32_t firstSignificanceMapContext);
76
- static uint32_t getSigCoeffGroupCtxInc(uint64_t sigCoeffGroupFlag64, uint32_t cgPosX, uint32_t cgPosY, uint32_t log2TrSizeCG);
77
78
protected:
79
80
x265_1.6.tar.gz/source/common/slice.h -> x265_1.7.tar.gz/source/common/slice.h
Changed
9
1
2
LEVEL6 = 180,
3
LEVEL6_1 = 183,
4
LEVEL6_2 = 186,
5
+ LEVEL8_5 = 255,
6
};
7
}
8
9
x265_1.6.tar.gz/source/common/threading.h -> x265_1.7.tar.gz/source/common/threading.h
Changed
31
1
2
LeaveCriticalSection(&m_cs);
3
}
4
5
+ void poke(void)
6
+ {
7
+ /* awaken all waiting threads, but make no change */
8
+ EnterCriticalSection(&m_cs);
9
+ WakeAllConditionVariable(&m_cv);
10
+ LeaveCriticalSection(&m_cs);
11
+ }
12
+
13
void incr()
14
{
15
EnterCriticalSection(&m_cs);
16
17
pthread_mutex_unlock(&m_mutex);
18
}
19
20
+ void poke(void)
21
+ {
22
+ /* awaken all waiting threads, but make no change */
23
+ pthread_mutex_lock(&m_mutex);
24
+ pthread_cond_broadcast(&m_cond);
25
+ pthread_mutex_unlock(&m_mutex);
26
+ }
27
+
28
void incr()
29
{
30
pthread_mutex_lock(&m_mutex);
31
x265_1.6.tar.gz/source/common/threadpool.cpp -> x265_1.7.tar.gz/source/common/threadpool.cpp
Changed
59
1
2
int cpuCount = getCpuCount();
3
bool bNumaSupport = false;
4
5
-#if _WIN32_WINNT >= 0x0601
6
+#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7
7
bNumaSupport = true;
8
#elif HAVE_LIBNUMA
9
bNumaSupport = numa_available() >= 0;
10
11
12
for (int i = 0; i < cpuCount; i++)
13
{
14
-#if _WIN32_WINNT >= 0x0601
15
+#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7
16
UCHAR node;
17
if (GetNumaProcessorNode((UCHAR)i, &node))
18
- cpusPerNode[X265_MIN(node, MAX_NODE_NUM)]++;
19
+ cpusPerNode[X265_MIN(node, (UCHAR)MAX_NODE_NUM)]++;
20
else
21
#elif HAVE_LIBNUMA
22
if (bNumaSupport >= 0)
23
24
/* limit nodes based on param->numaPools */
25
if (p->numaPools && *p->numaPools)
26
{
27
- char *nodeStr = p->numaPools;
28
+ const char *nodeStr = p->numaPools;
29
for (int i = 0; i < numNumaNodes; i++)
30
{
31
if (!*nodeStr)
32
33
return true;
34
}
35
36
-void ThreadPool::stop()
37
+void ThreadPool::stopWorkers()
38
{
39
if (m_workers)
40
{
41
42
/* static */
43
void ThreadPool::setThreadNodeAffinity(int numaNode)
44
{
45
-#if _WIN32_WINNT >= 0x0601
46
+#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7
47
GROUP_AFFINITY groupAffinity;
48
if (GetNumaNodeProcessorMaskEx((USHORT)numaNode, &groupAffinity))
49
{
50
51
/* static */
52
int ThreadPool::getNumaNodeCount()
53
{
54
-#if _WIN32_WINNT >= 0x0601
55
+#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7
56
ULONG num = 1;
57
if (GetNumaHighestNodeNumber(&num))
58
num++;
59
x265_1.6.tar.gz/source/common/threadpool.h -> x265_1.7.tar.gz/source/common/threadpool.h
Changed
10
1
2
3
bool create(int numThreads, int maxProviders, int node);
4
bool start();
5
- void stop();
6
+ void stopWorkers();
7
void setCurrentThreadAffinity();
8
int tryAcquireSleepingThread(sleepbitmap_t firstTryBitmap, sleepbitmap_t secondTryBitmap);
9
int tryBondPeers(int maxPeers, sleepbitmap_t peerBitmap, BondedTaskGroup& master);
10
x265_1.6.tar.gz/source/common/x86/asm-primitives.cpp -> x265_1.7.tar.gz/source/common/x86/asm-primitives.cpp
Changed
1148
1
2
#error "Unsupported build configuration (32bit x86 and HIGH_BIT_DEPTH), you must configure ENABLE_ASSEMBLY=OFF"
3
#endif
4
5
+#if X86_64
6
+ p.scanPosLast = x265_scanPosLast_x64;
7
+#endif
8
+
9
if (cpuMask & X265_CPU_SSE2)
10
{
11
/* We do not differentiate CPUs which support MMX and not SSE2. We only check
12
13
PIXEL_AVG_W4(mmx2);
14
LUMA_VAR(sse2);
15
16
- p.luma_p2s = x265_luma_p2s_sse2;
17
- p.chroma[X265_CSP_I420].p2s = x265_chroma_p2s_sse2;
18
- p.chroma[X265_CSP_I422].p2s = x265_chroma_p2s_sse2;
19
20
ALL_LUMA_TU(blockfill_s, blockfill_s, sse2);
21
ALL_LUMA_TU_S(cpy1Dto2D_shr, cpy1Dto2D_shr_, sse2);
22
23
ALL_LUMA_TU_S(calcresidual, getResidual, sse2);
24
ALL_LUMA_TU_S(transpose, transpose, sse2);
25
26
- p.cu[BLOCK_4x4].intra_pred[DC_IDX] = x265_intra_pred_dc4_sse2;
27
- p.cu[BLOCK_8x8].intra_pred[DC_IDX] = x265_intra_pred_dc8_sse2;
28
- p.cu[BLOCK_16x16].intra_pred[DC_IDX] = x265_intra_pred_dc16_sse2;
29
- p.cu[BLOCK_32x32].intra_pred[DC_IDX] = x265_intra_pred_dc32_sse2;
30
-
31
- p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = x265_intra_pred_planar4_sse2;
32
- p.cu[BLOCK_8x8].intra_pred[PLANAR_IDX] = x265_intra_pred_planar8_sse2;
33
- p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = x265_intra_pred_planar16_sse2;
34
- p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = x265_intra_pred_planar32_sse2;
35
+ ALL_LUMA_TU_S(intra_pred[PLANAR_IDX], intra_pred_planar, sse2);
36
+ ALL_LUMA_TU_S(intra_pred[DC_IDX], intra_pred_dc, sse2);
37
+
38
+ p.cu[BLOCK_4x4].intra_pred[2] = x265_intra_pred_ang4_2_sse2;
39
+ p.cu[BLOCK_4x4].intra_pred[3] = x265_intra_pred_ang4_3_sse2;
40
+ p.cu[BLOCK_4x4].intra_pred[4] = x265_intra_pred_ang4_4_sse2;
41
+ p.cu[BLOCK_4x4].intra_pred[5] = x265_intra_pred_ang4_5_sse2;
42
+ p.cu[BLOCK_4x4].intra_pred[6] = x265_intra_pred_ang4_6_sse2;
43
+ p.cu[BLOCK_4x4].intra_pred[7] = x265_intra_pred_ang4_7_sse2;
44
+ p.cu[BLOCK_4x4].intra_pred[8] = x265_intra_pred_ang4_8_sse2;
45
+ p.cu[BLOCK_4x4].intra_pred[9] = x265_intra_pred_ang4_9_sse2;
46
+ p.cu[BLOCK_4x4].intra_pred[10] = x265_intra_pred_ang4_10_sse2;
47
+ p.cu[BLOCK_4x4].intra_pred[11] = x265_intra_pred_ang4_11_sse2;
48
+ p.cu[BLOCK_4x4].intra_pred[12] = x265_intra_pred_ang4_12_sse2;
49
+ p.cu[BLOCK_4x4].intra_pred[13] = x265_intra_pred_ang4_13_sse2;
50
+ p.cu[BLOCK_4x4].intra_pred[14] = x265_intra_pred_ang4_14_sse2;
51
+ p.cu[BLOCK_4x4].intra_pred[15] = x265_intra_pred_ang4_15_sse2;
52
+ p.cu[BLOCK_4x4].intra_pred[16] = x265_intra_pred_ang4_16_sse2;
53
+ p.cu[BLOCK_4x4].intra_pred[17] = x265_intra_pred_ang4_17_sse2;
54
+ p.cu[BLOCK_4x4].intra_pred[18] = x265_intra_pred_ang4_18_sse2;
55
+ p.cu[BLOCK_4x4].intra_pred[19] = x265_intra_pred_ang4_17_sse2;
56
+ p.cu[BLOCK_4x4].intra_pred[20] = x265_intra_pred_ang4_16_sse2;
57
+ p.cu[BLOCK_4x4].intra_pred[21] = x265_intra_pred_ang4_15_sse2;
58
+ p.cu[BLOCK_4x4].intra_pred[22] = x265_intra_pred_ang4_14_sse2;
59
+ p.cu[BLOCK_4x4].intra_pred[23] = x265_intra_pred_ang4_13_sse2;
60
+ p.cu[BLOCK_4x4].intra_pred[24] = x265_intra_pred_ang4_12_sse2;
61
+ p.cu[BLOCK_4x4].intra_pred[25] = x265_intra_pred_ang4_11_sse2;
62
+ p.cu[BLOCK_4x4].intra_pred[26] = x265_intra_pred_ang4_26_sse2;
63
+ p.cu[BLOCK_4x4].intra_pred[27] = x265_intra_pred_ang4_9_sse2;
64
+ p.cu[BLOCK_4x4].intra_pred[28] = x265_intra_pred_ang4_8_sse2;
65
+ p.cu[BLOCK_4x4].intra_pred[29] = x265_intra_pred_ang4_7_sse2;
66
+ p.cu[BLOCK_4x4].intra_pred[30] = x265_intra_pred_ang4_6_sse2;
67
+ p.cu[BLOCK_4x4].intra_pred[31] = x265_intra_pred_ang4_5_sse2;
68
+ p.cu[BLOCK_4x4].intra_pred[32] = x265_intra_pred_ang4_4_sse2;
69
+ p.cu[BLOCK_4x4].intra_pred[33] = x265_intra_pred_ang4_3_sse2;
70
71
p.cu[BLOCK_4x4].sse_ss = x265_pixel_ssd_ss_4x4_mmx2;
72
ALL_LUMA_CU(sse_ss, pixel_ssd_ss, sse2);
73
74
p.cu[BLOCK_16x16].count_nonzero = x265_count_nonzero_16x16_ssse3;
75
p.cu[BLOCK_32x32].count_nonzero = x265_count_nonzero_32x32_ssse3;
76
p.frameInitLowres = x265_frame_init_lowres_core_ssse3;
77
+
78
+ p.pu[LUMA_4x4].convert_p2s = x265_filterPixelToShort_4x4_ssse3;
79
+ p.pu[LUMA_4x8].convert_p2s = x265_filterPixelToShort_4x8_ssse3;
80
+ p.pu[LUMA_4x16].convert_p2s = x265_filterPixelToShort_4x16_ssse3;
81
+ p.pu[LUMA_8x4].convert_p2s = x265_filterPixelToShort_8x4_ssse3;
82
+ p.pu[LUMA_8x8].convert_p2s = x265_filterPixelToShort_8x8_ssse3;
83
+ p.pu[LUMA_8x16].convert_p2s = x265_filterPixelToShort_8x16_ssse3;
84
+ p.pu[LUMA_8x32].convert_p2s = x265_filterPixelToShort_8x32_ssse3;
85
+ p.pu[LUMA_16x4].convert_p2s = x265_filterPixelToShort_16x4_ssse3;
86
+ p.pu[LUMA_16x8].convert_p2s = x265_filterPixelToShort_16x8_ssse3;
87
+ p.pu[LUMA_16x12].convert_p2s = x265_filterPixelToShort_16x12_ssse3;
88
+ p.pu[LUMA_16x16].convert_p2s = x265_filterPixelToShort_16x16_ssse3;
89
+ p.pu[LUMA_16x32].convert_p2s = x265_filterPixelToShort_16x32_ssse3;
90
+ p.pu[LUMA_16x64].convert_p2s = x265_filterPixelToShort_16x64_ssse3;
91
+ p.pu[LUMA_32x8].convert_p2s = x265_filterPixelToShort_32x8_ssse3;
92
+ p.pu[LUMA_32x16].convert_p2s = x265_filterPixelToShort_32x16_ssse3;
93
+ p.pu[LUMA_32x24].convert_p2s = x265_filterPixelToShort_32x24_ssse3;
94
+ p.pu[LUMA_32x32].convert_p2s = x265_filterPixelToShort_32x32_ssse3;
95
+ p.pu[LUMA_32x64].convert_p2s = x265_filterPixelToShort_32x64_ssse3;
96
+ p.pu[LUMA_64x16].convert_p2s = x265_filterPixelToShort_64x16_ssse3;
97
+ p.pu[LUMA_64x32].convert_p2s = x265_filterPixelToShort_64x32_ssse3;
98
+ p.pu[LUMA_64x48].convert_p2s = x265_filterPixelToShort_64x48_ssse3;
99
+ p.pu[LUMA_64x64].convert_p2s = x265_filterPixelToShort_64x64_ssse3;
100
+ p.pu[LUMA_24x32].convert_p2s = x265_filterPixelToShort_24x32_ssse3;
101
+ p.pu[LUMA_12x16].convert_p2s = x265_filterPixelToShort_12x16_ssse3;
102
+ p.pu[LUMA_48x64].convert_p2s = x265_filterPixelToShort_48x64_ssse3;
103
+
104
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].p2s = x265_filterPixelToShort_4x4_ssse3;
105
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].p2s = x265_filterPixelToShort_4x8_ssse3;
106
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].p2s = x265_filterPixelToShort_4x16_ssse3;
107
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].p2s = x265_filterPixelToShort_8x4_ssse3;
108
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].p2s = x265_filterPixelToShort_8x8_ssse3;
109
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].p2s = x265_filterPixelToShort_8x16_ssse3;
110
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].p2s = x265_filterPixelToShort_8x32_ssse3;
111
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s = x265_filterPixelToShort_16x4_ssse3;
112
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s = x265_filterPixelToShort_16x8_ssse3;
113
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s = x265_filterPixelToShort_16x12_ssse3;
114
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s = x265_filterPixelToShort_16x16_ssse3;
115
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s = x265_filterPixelToShort_16x32_ssse3;
116
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = x265_filterPixelToShort_32x8_ssse3;
117
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = x265_filterPixelToShort_32x16_ssse3;
118
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = x265_filterPixelToShort_32x24_ssse3;
119
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = x265_filterPixelToShort_32x32_ssse3;
120
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].p2s = x265_filterPixelToShort_4x4_ssse3;
121
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].p2s = x265_filterPixelToShort_4x8_ssse3;
122
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].p2s = x265_filterPixelToShort_4x16_ssse3;
123
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].p2s = x265_filterPixelToShort_4x32_ssse3;
124
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].p2s = x265_filterPixelToShort_8x4_ssse3;
125
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].p2s = x265_filterPixelToShort_8x8_ssse3;
126
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].p2s = x265_filterPixelToShort_8x12_ssse3;
127
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].p2s = x265_filterPixelToShort_8x16_ssse3;
128
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].p2s = x265_filterPixelToShort_8x32_ssse3;
129
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].p2s = x265_filterPixelToShort_8x64_ssse3;
130
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].p2s = x265_filterPixelToShort_12x32_ssse3;
131
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s = x265_filterPixelToShort_16x8_ssse3;
132
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s = x265_filterPixelToShort_16x16_ssse3;
133
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s = x265_filterPixelToShort_16x24_ssse3;
134
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s = x265_filterPixelToShort_16x32_ssse3;
135
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s = x265_filterPixelToShort_16x64_ssse3;
136
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = x265_filterPixelToShort_24x64_ssse3;
137
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = x265_filterPixelToShort_32x16_ssse3;
138
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = x265_filterPixelToShort_32x32_ssse3;
139
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = x265_filterPixelToShort_32x48_ssse3;
140
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = x265_filterPixelToShort_32x64_ssse3;
141
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].p2s = x265_filterPixelToShort_4x2_ssse3;
142
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].p2s = x265_filterPixelToShort_8x2_ssse3;
143
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].p2s = x265_filterPixelToShort_8x6_ssse3;
144
+ p.findPosFirstLast = x265_findPosFirstLast_ssse3;
145
}
146
if (cpuMask & X265_CPU_SSE4)
147
{
148
149
ALL_LUMA_TU_S(copy_cnt, copy_cnt_, sse4);
150
ALL_LUMA_CU(psy_cost_pp, psyCost_pp, sse4);
151
ALL_LUMA_CU(psy_cost_ss, psyCost_ss, sse4);
152
+
153
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].p2s = x265_filterPixelToShort_2x4_sse4;
154
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].p2s = x265_filterPixelToShort_2x8_sse4;
155
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].p2s = x265_filterPixelToShort_6x8_sse4;
156
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].p2s = x265_filterPixelToShort_2x8_sse4;
157
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].p2s = x265_filterPixelToShort_2x16_sse4;
158
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].p2s = x265_filterPixelToShort_6x16_sse4;
159
}
160
if (cpuMask & X265_CPU_AVX)
161
{
162
163
}
164
if (cpuMask & X265_CPU_AVX2)
165
{
166
+ p.pu[LUMA_48x64].satd = x265_pixel_satd_48x64_avx2;
167
+
168
+ p.pu[LUMA_64x16].satd = x265_pixel_satd_64x16_avx2;
169
+ p.pu[LUMA_64x32].satd = x265_pixel_satd_64x32_avx2;
170
+ p.pu[LUMA_64x48].satd = x265_pixel_satd_64x48_avx2;
171
+ p.pu[LUMA_64x64].satd = x265_pixel_satd_64x64_avx2;
172
+
173
+ p.pu[LUMA_32x8].satd = x265_pixel_satd_32x8_avx2;
174
+ p.pu[LUMA_32x16].satd = x265_pixel_satd_32x16_avx2;
175
+ p.pu[LUMA_32x24].satd = x265_pixel_satd_32x24_avx2;
176
+ p.pu[LUMA_32x32].satd = x265_pixel_satd_32x32_avx2;
177
+ p.pu[LUMA_32x64].satd = x265_pixel_satd_32x64_avx2;
178
+
179
+ p.pu[LUMA_16x4].satd = x265_pixel_satd_16x4_avx2;
180
+ p.pu[LUMA_16x8].satd = x265_pixel_satd_16x8_avx2;
181
+ p.pu[LUMA_16x12].satd = x265_pixel_satd_16x12_avx2;
182
+ p.pu[LUMA_16x16].satd = x265_pixel_satd_16x16_avx2;
183
+ p.pu[LUMA_16x32].satd = x265_pixel_satd_16x32_avx2;
184
+ p.pu[LUMA_16x64].satd = x265_pixel_satd_16x64_avx2;
185
+
186
p.cu[BLOCK_32x32].ssd_s = x265_pixel_ssd_s_32_avx2;
187
p.cu[BLOCK_16x16].sse_ss = x265_pixel_ssd_ss_16x16_avx2;
188
189
190
p.dequant_normal = x265_dequant_normal_avx2;
191
192
p.scale1D_128to64 = x265_scale1D_128to64_avx2;
193
+ p.scale2D_64to32 = x265_scale2D_64to32_avx2;
194
// p.weight_pp = x265_weight_pp_avx2; fails tests
195
196
p.cu[BLOCK_16x16].calcresidual = x265_getResidual16_avx2;
197
198
ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, avx2);
199
ALL_LUMA_PU(luma_vsp, interp_8tap_vert_sp, avx2);
200
ALL_LUMA_PU(luma_vss, interp_8tap_vert_ss, avx2);
201
+
202
+ p.cu[BLOCK_16x16].add_ps = x265_pixel_add_ps_16x16_avx2;
203
+ p.cu[BLOCK_32x32].add_ps = x265_pixel_add_ps_32x32_avx2;
204
+ p.cu[BLOCK_64x64].add_ps = x265_pixel_add_ps_64x64_avx2;
205
+ p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].add_ps = x265_pixel_add_ps_16x16_avx2;
206
+ p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].add_ps = x265_pixel_add_ps_32x32_avx2;
207
+ p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].add_ps = x265_pixel_add_ps_16x32_avx2;
208
+ p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].add_ps = x265_pixel_add_ps_32x64_avx2;
209
+
210
+ p.cu[BLOCK_16x16].sub_ps = x265_pixel_sub_ps_16x16_avx2;
211
+ p.cu[BLOCK_32x32].sub_ps = x265_pixel_sub_ps_32x32_avx2;
212
+ p.cu[BLOCK_64x64].sub_ps = x265_pixel_sub_ps_64x64_avx2;
213
+ p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sub_ps = x265_pixel_sub_ps_16x16_avx2;
214
+ p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sub_ps = x265_pixel_sub_ps_32x32_avx2;
215
+ p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sub_ps = x265_pixel_sub_ps_16x32_avx2;
216
+ p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sub_ps = x265_pixel_sub_ps_32x64_avx2;
217
+
218
+ p.pu[LUMA_16x4].sad = x265_pixel_sad_16x4_avx2;
219
+ p.pu[LUMA_16x8].sad = x265_pixel_sad_16x8_avx2;
220
+ p.pu[LUMA_16x12].sad = x265_pixel_sad_16x12_avx2;
221
+ p.pu[LUMA_16x16].sad = x265_pixel_sad_16x16_avx2;
222
+ p.pu[LUMA_16x32].sad = x265_pixel_sad_16x32_avx2;
223
+
224
+ p.pu[LUMA_16x4].convert_p2s = x265_filterPixelToShort_16x4_avx2;
225
+ p.pu[LUMA_16x8].convert_p2s = x265_filterPixelToShort_16x8_avx2;
226
+ p.pu[LUMA_16x12].convert_p2s = x265_filterPixelToShort_16x12_avx2;
227
+ p.pu[LUMA_16x16].convert_p2s = x265_filterPixelToShort_16x16_avx2;
228
+ p.pu[LUMA_16x32].convert_p2s = x265_filterPixelToShort_16x32_avx2;
229
+ p.pu[LUMA_16x64].convert_p2s = x265_filterPixelToShort_16x64_avx2;
230
+ p.pu[LUMA_32x8].convert_p2s = x265_filterPixelToShort_32x8_avx2;
231
+ p.pu[LUMA_32x16].convert_p2s = x265_filterPixelToShort_32x16_avx2;
232
+ p.pu[LUMA_32x24].convert_p2s = x265_filterPixelToShort_32x24_avx2;
233
+ p.pu[LUMA_32x32].convert_p2s = x265_filterPixelToShort_32x32_avx2;
234
+ p.pu[LUMA_32x64].convert_p2s = x265_filterPixelToShort_32x64_avx2;
235
+ p.pu[LUMA_64x16].convert_p2s = x265_filterPixelToShort_64x16_avx2;
236
+ p.pu[LUMA_64x32].convert_p2s = x265_filterPixelToShort_64x32_avx2;
237
+ p.pu[LUMA_64x48].convert_p2s = x265_filterPixelToShort_64x48_avx2;
238
+ p.pu[LUMA_64x64].convert_p2s = x265_filterPixelToShort_64x64_avx2;
239
+ p.pu[LUMA_24x32].convert_p2s = x265_filterPixelToShort_24x32_avx2;
240
+ p.pu[LUMA_48x64].convert_p2s = x265_filterPixelToShort_48x64_avx2;
241
+
242
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s = x265_filterPixelToShort_16x4_avx2;
243
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s = x265_filterPixelToShort_16x8_avx2;
244
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s = x265_filterPixelToShort_16x12_avx2;
245
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s = x265_filterPixelToShort_16x16_avx2;
246
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s = x265_filterPixelToShort_16x32_avx2;
247
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].p2s = x265_filterPixelToShort_24x32_avx2;
248
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = x265_filterPixelToShort_32x8_avx2;
249
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = x265_filterPixelToShort_32x16_avx2;
250
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = x265_filterPixelToShort_32x24_avx2;
251
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = x265_filterPixelToShort_32x32_avx2;
252
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s = x265_filterPixelToShort_16x8_avx2;
253
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s = x265_filterPixelToShort_16x16_avx2;
254
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s = x265_filterPixelToShort_16x24_avx2;
255
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s = x265_filterPixelToShort_16x32_avx2;
256
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s = x265_filterPixelToShort_16x64_avx2;
257
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = x265_filterPixelToShort_24x64_avx2;
258
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = x265_filterPixelToShort_32x16_avx2;
259
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = x265_filterPixelToShort_32x32_avx2;
260
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = x265_filterPixelToShort_32x48_avx2;
261
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = x265_filterPixelToShort_32x64_avx2;
262
+
263
+ p.pu[LUMA_4x4].luma_hps = x265_interp_8tap_horiz_ps_4x4_avx2;
264
+ p.pu[LUMA_4x8].luma_hps = x265_interp_8tap_horiz_ps_4x8_avx2;
265
+ p.pu[LUMA_4x16].luma_hps = x265_interp_8tap_horiz_ps_4x16_avx2;
266
+
267
+ if (cpuMask & X265_CPU_BMI2)
268
+ p.scanPosLast = x265_scanPosLast_avx2_bmi2;
269
}
270
}
271
#else // if HIGH_BIT_DEPTH
272
273
void setupAssemblyPrimitives(EncoderPrimitives &p, int cpuMask) // 8bpp
274
{
275
+#if X86_64
276
+ p.scanPosLast = x265_scanPosLast_x64;
277
+#endif
278
+
279
if (cpuMask & X265_CPU_SSE2)
280
{
281
/* We do not differentiate CPUs which support MMX and not SSE2. We only check
282
283
CHROMA_420_VSP_FILTERS(_sse2);
284
CHROMA_422_VSP_FILTERS(_sse2);
285
CHROMA_444_VSP_FILTERS(_sse2);
286
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_vpp = x265_interp_4tap_vert_pp_2x4_sse2;
287
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].filter_vpp = x265_interp_4tap_vert_pp_2x8_sse2;
288
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vpp = x265_interp_4tap_vert_pp_4x2_sse2;
289
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_sse2;
290
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_sse2;
291
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_sse2;
292
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_vpp = x265_interp_4tap_vert_pp_2x16_sse2;
293
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_sse2;
294
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_sse2;
295
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_sse2;
296
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vpp = x265_interp_4tap_vert_pp_4x32_sse2;
297
+ p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_sse2;
298
+ p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_sse2;
299
+ p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_sse2;
300
+#if X86_64
301
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_vpp = x265_interp_4tap_vert_pp_6x8_sse2;
302
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_vpp = x265_interp_4tap_vert_pp_8x2_sse2;
303
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_sse2;
304
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vpp = x265_interp_4tap_vert_pp_8x6_sse2;
305
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_sse2;
306
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_sse2;
307
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_sse2;
308
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_vpp = x265_interp_4tap_vert_pp_6x16_sse2;
309
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_sse2;
310
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_sse2;
311
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_sse2;
312
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vpp = x265_interp_4tap_vert_pp_8x12_sse2;
313
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_sse2;
314
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_sse2;
315
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vpp = x265_interp_4tap_vert_pp_8x64_sse2;
316
+ p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_sse2;
317
+ p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_sse2;
318
+ p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_sse2;
319
+ p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_sse2;
320
+#endif
321
+
322
+ ALL_LUMA_PU(luma_hpp, interp_8tap_horiz_pp, sse2);
323
+ p.pu[LUMA_4x4].luma_hpp = x265_interp_8tap_horiz_pp_4x4_sse2;
324
+ ALL_LUMA_PU(luma_hps, interp_8tap_horiz_ps, sse2);
325
+ p.pu[LUMA_4x4].luma_hps = x265_interp_8tap_horiz_ps_4x4_sse2;
326
+ p.pu[LUMA_8x8].luma_hvpp = x265_interp_8tap_hv_pp_8x8_sse3;
327
328
//p.frameInitLowres = x265_frame_init_lowres_core_mmx2;
329
p.frameInitLowres = x265_frame_init_lowres_core_sse2;
330
331
ALL_LUMA_TU_S(cpy1Dto2D_shr, cpy1Dto2D_shr_, sse2);
332
ALL_LUMA_TU_S(ssd_s, pixel_ssd_s_, sse2);
333
334
- p.cu[BLOCK_4x4].intra_pred[DC_IDX] = x265_intra_pred_dc4_sse2;
335
- p.cu[BLOCK_8x8].intra_pred[DC_IDX] = x265_intra_pred_dc8_sse2;
336
- p.cu[BLOCK_16x16].intra_pred[DC_IDX] = x265_intra_pred_dc16_sse2;
337
- p.cu[BLOCK_32x32].intra_pred[DC_IDX] = x265_intra_pred_dc32_sse2;
338
-
339
- p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = x265_intra_pred_planar4_sse2;
340
- p.cu[BLOCK_8x8].intra_pred[PLANAR_IDX] = x265_intra_pred_planar8_sse2;
341
- p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = x265_intra_pred_planar16_sse2;
342
- p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = x265_intra_pred_planar32_sse2;
343
+ ALL_LUMA_TU_S(intra_pred[PLANAR_IDX], intra_pred_planar, sse2);
344
+ ALL_LUMA_TU_S(intra_pred[DC_IDX], intra_pred_dc, sse2);
345
346
p.cu[BLOCK_4x4].intra_pred[2] = x265_intra_pred_ang4_2_sse2;
347
p.cu[BLOCK_4x4].intra_pred[3] = x265_intra_pred_ang4_3_sse2;
348
349
p.cu[BLOCK_4x4].intra_pred[7] = x265_intra_pred_ang4_7_sse2;
350
p.cu[BLOCK_4x4].intra_pred[8] = x265_intra_pred_ang4_8_sse2;
351
p.cu[BLOCK_4x4].intra_pred[9] = x265_intra_pred_ang4_9_sse2;
352
+ p.cu[BLOCK_4x4].intra_pred[10] = x265_intra_pred_ang4_10_sse2;
353
+ p.cu[BLOCK_4x4].intra_pred[11] = x265_intra_pred_ang4_11_sse2;
354
+ p.cu[BLOCK_4x4].intra_pred[12] = x265_intra_pred_ang4_12_sse2;
355
+ p.cu[BLOCK_4x4].intra_pred[13] = x265_intra_pred_ang4_13_sse2;
356
+ p.cu[BLOCK_4x4].intra_pred[14] = x265_intra_pred_ang4_14_sse2;
357
+ p.cu[BLOCK_4x4].intra_pred[15] = x265_intra_pred_ang4_15_sse2;
358
+ p.cu[BLOCK_4x4].intra_pred[16] = x265_intra_pred_ang4_16_sse2;
359
+ p.cu[BLOCK_4x4].intra_pred[17] = x265_intra_pred_ang4_17_sse2;
360
+ p.cu[BLOCK_4x4].intra_pred[18] = x265_intra_pred_ang4_18_sse2;
361
+ p.cu[BLOCK_4x4].intra_pred[19] = x265_intra_pred_ang4_17_sse2;
362
+ p.cu[BLOCK_4x4].intra_pred[20] = x265_intra_pred_ang4_16_sse2;
363
+ p.cu[BLOCK_4x4].intra_pred[21] = x265_intra_pred_ang4_15_sse2;
364
+ p.cu[BLOCK_4x4].intra_pred[22] = x265_intra_pred_ang4_14_sse2;
365
+ p.cu[BLOCK_4x4].intra_pred[23] = x265_intra_pred_ang4_13_sse2;
366
+ p.cu[BLOCK_4x4].intra_pred[24] = x265_intra_pred_ang4_12_sse2;
367
+ p.cu[BLOCK_4x4].intra_pred[25] = x265_intra_pred_ang4_11_sse2;
368
+ p.cu[BLOCK_4x4].intra_pred[26] = x265_intra_pred_ang4_26_sse2;
369
+ p.cu[BLOCK_4x4].intra_pred[27] = x265_intra_pred_ang4_9_sse2;
370
+ p.cu[BLOCK_4x4].intra_pred[28] = x265_intra_pred_ang4_8_sse2;
371
+ p.cu[BLOCK_4x4].intra_pred[29] = x265_intra_pred_ang4_7_sse2;
372
+ p.cu[BLOCK_4x4].intra_pred[30] = x265_intra_pred_ang4_6_sse2;
373
+ p.cu[BLOCK_4x4].intra_pred[31] = x265_intra_pred_ang4_5_sse2;
374
+ p.cu[BLOCK_4x4].intra_pred[32] = x265_intra_pred_ang4_4_sse2;
375
+ p.cu[BLOCK_4x4].intra_pred[33] = x265_intra_pred_ang4_3_sse2;
376
+
377
+ p.cu[BLOCK_4x4].intra_pred_allangs = x265_all_angs_pred_4x4_sse2;
378
379
p.cu[BLOCK_4x4].calcresidual = x265_getResidual4_sse2;
380
p.cu[BLOCK_8x8].calcresidual = x265_getResidual8_sse2;
381
382
383
p.planecopy_sp = x265_downShift_16_sse2;
384
}
385
+ if (cpuMask & X265_CPU_SSE3)
386
+ {
387
+ ALL_CHROMA_420_PU(filter_hpp, interp_4tap_horiz_pp, sse3);
388
+ ALL_CHROMA_422_PU(filter_hpp, interp_4tap_horiz_pp, sse3);
389
+ ALL_CHROMA_444_PU(filter_hpp, interp_4tap_horiz_pp, sse3);
390
+ }
391
if (cpuMask & X265_CPU_SSSE3)
392
{
393
p.pu[LUMA_8x16].sad_x3 = x265_pixel_sad_x3_8x16_ssse3;
394
395
ASSIGN_SSE_PP(ssse3);
396
p.cu[BLOCK_4x4].sse_pp = x265_pixel_ssd_4x4_ssse3;
397
p.chroma[X265_CSP_I422].cu[BLOCK_422_4x8].sse_pp = x265_pixel_ssd_4x8_ssse3;
398
- p.pu[LUMA_4x4].filter_p2s = x265_pixelToShort_4x4_ssse3;
399
- p.pu[LUMA_4x8].filter_p2s = x265_pixelToShort_4x8_ssse3;
400
- p.pu[LUMA_4x16].filter_p2s = x265_pixelToShort_4x16_ssse3;
401
- p.pu[LUMA_8x4].filter_p2s = x265_pixelToShort_8x4_ssse3;
402
- p.pu[LUMA_8x8].filter_p2s = x265_pixelToShort_8x8_ssse3;
403
- p.pu[LUMA_8x16].filter_p2s = x265_pixelToShort_8x16_ssse3;
404
- p.pu[LUMA_8x32].filter_p2s = x265_pixelToShort_8x32_ssse3;
405
- p.pu[LUMA_16x4].filter_p2s = x265_pixelToShort_16x4_ssse3;
406
- p.pu[LUMA_16x8].filter_p2s = x265_pixelToShort_16x8_ssse3;
407
- p.pu[LUMA_16x12].filter_p2s = x265_pixelToShort_16x12_ssse3;
408
- p.pu[LUMA_16x16].filter_p2s = x265_pixelToShort_16x16_ssse3;
409
- p.pu[LUMA_16x32].filter_p2s = x265_pixelToShort_16x32_ssse3;
410
- p.pu[LUMA_16x64].filter_p2s = x265_pixelToShort_16x64_ssse3;
411
- p.pu[LUMA_32x8].filter_p2s = x265_pixelToShort_32x8_ssse3;
412
- p.pu[LUMA_32x16].filter_p2s = x265_pixelToShort_32x16_ssse3;
413
- p.pu[LUMA_32x24].filter_p2s = x265_pixelToShort_32x24_ssse3;
414
- p.pu[LUMA_32x32].filter_p2s = x265_pixelToShort_32x32_ssse3;
415
- p.pu[LUMA_32x64].filter_p2s = x265_pixelToShort_32x64_ssse3;
416
- p.pu[LUMA_64x16].filter_p2s = x265_pixelToShort_64x16_ssse3;
417
- p.pu[LUMA_64x32].filter_p2s = x265_pixelToShort_64x32_ssse3;
418
- p.pu[LUMA_64x48].filter_p2s = x265_pixelToShort_64x48_ssse3;
419
- p.pu[LUMA_64x64].filter_p2s = x265_pixelToShort_64x64_ssse3;
420
-
421
- p.chroma[X265_CSP_I420].p2s = x265_chroma_p2s_ssse3;
422
- p.chroma[X265_CSP_I422].p2s = x265_chroma_p2s_ssse3;
423
424
p.dst4x4 = x265_dst4_ssse3;
425
p.cu[BLOCK_8x8].idct = x265_idct8_ssse3;
426
427
ALL_LUMA_TU(count_nonzero, count_nonzero, ssse3);
428
429
+ // MUST be done after LUMA_FILTERS() to overwrite default version
430
+ p.pu[LUMA_8x8].luma_hvpp = x265_interp_8tap_hv_pp_8x8_ssse3;
431
+
432
p.frameInitLowres = x265_frame_init_lowres_core_ssse3;
433
p.scale1D_128to64 = x265_scale1D_128to64_ssse3;
434
p.scale2D_64to32 = x265_scale2D_64to32_ssse3;
435
+
436
+ p.pu[LUMA_8x4].convert_p2s = x265_filterPixelToShort_8x4_ssse3;
437
+ p.pu[LUMA_8x8].convert_p2s = x265_filterPixelToShort_8x8_ssse3;
438
+ p.pu[LUMA_8x16].convert_p2s = x265_filterPixelToShort_8x16_ssse3;
439
+ p.pu[LUMA_8x32].convert_p2s = x265_filterPixelToShort_8x32_ssse3;
440
+ p.pu[LUMA_16x4].convert_p2s = x265_filterPixelToShort_16x4_ssse3;
441
+ p.pu[LUMA_16x8].convert_p2s = x265_filterPixelToShort_16x8_ssse3;
442
+ p.pu[LUMA_16x12].convert_p2s = x265_filterPixelToShort_16x12_ssse3;
443
+ p.pu[LUMA_16x16].convert_p2s = x265_filterPixelToShort_16x16_ssse3;
444
+ p.pu[LUMA_16x32].convert_p2s = x265_filterPixelToShort_16x32_ssse3;
445
+ p.pu[LUMA_16x64].convert_p2s = x265_filterPixelToShort_16x64_ssse3;
446
+ p.pu[LUMA_32x8].convert_p2s = x265_filterPixelToShort_32x8_ssse3;
447
+ p.pu[LUMA_32x16].convert_p2s = x265_filterPixelToShort_32x16_ssse3;
448
+ p.pu[LUMA_32x24].convert_p2s = x265_filterPixelToShort_32x24_ssse3;
449
+ p.pu[LUMA_32x32].convert_p2s = x265_filterPixelToShort_32x32_ssse3;
450
+ p.pu[LUMA_32x64].convert_p2s = x265_filterPixelToShort_32x64_ssse3;
451
+ p.pu[LUMA_64x16].convert_p2s = x265_filterPixelToShort_64x16_ssse3;
452
+ p.pu[LUMA_64x32].convert_p2s = x265_filterPixelToShort_64x32_ssse3;
453
+ p.pu[LUMA_64x48].convert_p2s = x265_filterPixelToShort_64x48_ssse3;
454
+ p.pu[LUMA_64x64].convert_p2s = x265_filterPixelToShort_64x64_ssse3;
455
+ p.pu[LUMA_12x16].convert_p2s = x265_filterPixelToShort_12x16_ssse3;
456
+ p.pu[LUMA_24x32].convert_p2s = x265_filterPixelToShort_24x32_ssse3;
457
+ p.pu[LUMA_48x64].convert_p2s = x265_filterPixelToShort_48x64_ssse3;
458
+
459
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].p2s = x265_filterPixelToShort_8x2_ssse3;
460
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].p2s = x265_filterPixelToShort_8x4_ssse3;
461
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].p2s = x265_filterPixelToShort_8x6_ssse3;
462
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].p2s = x265_filterPixelToShort_8x8_ssse3;
463
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].p2s = x265_filterPixelToShort_8x16_ssse3;
464
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].p2s = x265_filterPixelToShort_8x32_ssse3;
465
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s = x265_filterPixelToShort_16x4_ssse3;
466
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s = x265_filterPixelToShort_16x8_ssse3;
467
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s = x265_filterPixelToShort_16x12_ssse3;
468
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s = x265_filterPixelToShort_16x16_ssse3;
469
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s = x265_filterPixelToShort_16x32_ssse3;
470
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = x265_filterPixelToShort_32x8_ssse3;
471
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = x265_filterPixelToShort_32x16_ssse3;
472
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = x265_filterPixelToShort_32x24_ssse3;
473
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = x265_filterPixelToShort_32x32_ssse3;
474
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].p2s = x265_filterPixelToShort_8x4_ssse3;
475
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].p2s = x265_filterPixelToShort_8x8_ssse3;
476
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].p2s = x265_filterPixelToShort_8x12_ssse3;
477
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].p2s = x265_filterPixelToShort_8x16_ssse3;
478
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].p2s = x265_filterPixelToShort_8x32_ssse3;
479
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].p2s = x265_filterPixelToShort_8x64_ssse3;
480
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].p2s = x265_filterPixelToShort_12x32_ssse3;
481
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s = x265_filterPixelToShort_16x8_ssse3;
482
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s = x265_filterPixelToShort_16x16_ssse3;
483
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s = x265_filterPixelToShort_16x24_ssse3;
484
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s = x265_filterPixelToShort_16x32_ssse3;
485
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s = x265_filterPixelToShort_16x64_ssse3;
486
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = x265_filterPixelToShort_24x64_ssse3;
487
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = x265_filterPixelToShort_32x16_ssse3;
488
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = x265_filterPixelToShort_32x32_ssse3;
489
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = x265_filterPixelToShort_32x48_ssse3;
490
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = x265_filterPixelToShort_32x64_ssse3;
491
+ p.findPosFirstLast = x265_findPosFirstLast_ssse3;
492
}
493
if (cpuMask & X265_CPU_SSE4)
494
{
495
p.sign = x265_calSign_sse4;
496
p.saoCuOrgE0 = x265_saoCuOrgE0_sse4;
497
p.saoCuOrgE1 = x265_saoCuOrgE1_sse4;
498
- p.saoCuOrgE2 = x265_saoCuOrgE2_sse4;
499
- p.saoCuOrgE3 = x265_saoCuOrgE3_sse4;
500
+ p.saoCuOrgE1_2Rows = x265_saoCuOrgE1_2Rows_sse4;
501
+ p.saoCuOrgE2[0] = x265_saoCuOrgE2_sse4;
502
+ p.saoCuOrgE2[1] = x265_saoCuOrgE2_sse4;
503
+ p.saoCuOrgE3[0] = x265_saoCuOrgE3_sse4;
504
+ p.saoCuOrgE3[1] = x265_saoCuOrgE3_sse4;
505
p.saoCuOrgB0 = x265_saoCuOrgB0_sse4;
506
507
LUMA_ADDAVG(sse4);
508
509
CHROMA_444_VSP_FILTERS_SSE4(_sse4);
510
511
// MUST be done after LUMA_FILTERS() to overwrite default version
512
- p.pu[LUMA_8x8].luma_hvpp = x265_interp_8tap_hv_pp_8x8_sse4;
513
+ p.pu[LUMA_8x8].luma_hvpp = x265_interp_8tap_hv_pp_8x8_ssse3;
514
515
LUMA_CU_BLOCKCOPY(ps, sse4);
516
CHROMA_420_CU_BLOCKCOPY(ps, sse4);
517
518
p.cu[BLOCK_4x4].psy_cost_pp = x265_psyCost_pp_4x4_sse4;
519
p.cu[BLOCK_4x4].psy_cost_ss = x265_psyCost_ss_4x4_sse4;
520
521
+ p.pu[LUMA_4x4].convert_p2s = x265_filterPixelToShort_4x4_sse4;
522
+ p.pu[LUMA_4x8].convert_p2s = x265_filterPixelToShort_4x8_sse4;
523
+ p.pu[LUMA_4x16].convert_p2s = x265_filterPixelToShort_4x16_sse4;
524
+
525
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].p2s = x265_filterPixelToShort_2x4_sse4;
526
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].p2s = x265_filterPixelToShort_2x8_sse4;
527
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].p2s = x265_filterPixelToShort_4x2_sse4;
528
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].p2s = x265_filterPixelToShort_4x4_sse4;
529
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].p2s = x265_filterPixelToShort_4x8_sse4;
530
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].p2s = x265_filterPixelToShort_4x16_sse4;
531
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].p2s = x265_filterPixelToShort_6x8_sse4;
532
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].p2s = x265_filterPixelToShort_2x8_sse4;
533
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].p2s = x265_filterPixelToShort_2x16_sse4;
534
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].p2s = x265_filterPixelToShort_4x4_sse4;
535
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].p2s = x265_filterPixelToShort_4x8_sse4;
536
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].p2s = x265_filterPixelToShort_4x16_sse4;
537
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].p2s = x265_filterPixelToShort_4x32_sse4;
538
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].p2s = x265_filterPixelToShort_6x16_sse4;
539
+
540
#if X86_64
541
ALL_LUMA_CU(psy_cost_pp, psyCost_pp, sse4);
542
ALL_LUMA_CU(psy_cost_ss, psyCost_ss, sse4);
543
544
p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].satd = x265_pixel_satd_8x12_avx;
545
p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].satd = x265_pixel_satd_12x32_avx;
546
p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].satd = x265_pixel_satd_4x32_avx;
547
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].satd = x265_pixel_satd_16x32_avx;
548
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].satd = x265_pixel_satd_32x64_avx;
549
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].satd = x265_pixel_satd_16x16_avx;
550
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].satd = x265_pixel_satd_32x32_avx;
551
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].satd = x265_pixel_satd_16x64_avx;
552
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].satd = x265_pixel_satd_16x8_avx;
553
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].satd = x265_pixel_satd_32x16_avx;
554
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].satd = x265_pixel_satd_8x4_avx;
555
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].satd = x265_pixel_satd_8x16_avx;
556
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].satd = x265_pixel_satd_8x8_avx;
557
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].satd = x265_pixel_satd_8x32_avx;
558
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].satd = x265_pixel_satd_4x8_avx;
559
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].satd = x265_pixel_satd_4x16_avx;
560
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].satd = x265_pixel_satd_4x4_avx;
561
ALL_LUMA_PU(satd, pixel_satd, avx);
562
p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].satd = x265_pixel_satd_4x4_avx;
563
p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].satd = x265_pixel_satd_8x8_avx;
564
565
p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].satd = x265_pixel_satd_32x8_avx;
566
p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].satd = x265_pixel_satd_8x32_avx;
567
ASSIGN_SA8D(avx);
568
+ p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sa8d = x265_pixel_sa8d_32x32_avx;
569
+ p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sa8d = x265_pixel_sa8d_16x16_avx;
570
+ p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].sa8d = x265_pixel_sa8d_8x8_avx;
571
+ p.chroma[X265_CSP_I420].cu[BLOCK_420_4x4].sa8d = x265_pixel_satd_4x4_avx;
572
ASSIGN_SSE_PP(avx);
573
p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].sse_pp = x265_pixel_ssd_8x8_avx;
574
ASSIGN_SSE_SS(avx);
575
576
p.chroma[X265_CSP_I420].cu[CHROMA_420_16x16].copy_ss = x265_blockcopy_ss_16x16_avx;
577
p.chroma[X265_CSP_I420].cu[CHROMA_420_32x32].copy_ss = x265_blockcopy_ss_32x32_avx;
578
p.chroma[X265_CSP_I422].cu[CHROMA_422_16x32].copy_ss = x265_blockcopy_ss_16x32_avx;
579
+ p.chroma[X265_CSP_I422].cu[CHROMA_422_32x64].copy_ss = x265_blockcopy_ss_32x64_avx;
580
581
p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].copy_pp = x265_blockcopy_pp_32x8_avx;
582
p.pu[LUMA_32x8].copy_pp = x265_blockcopy_pp_32x8_avx;
583
584
#if X86_64
585
if (cpuMask & X265_CPU_AVX2)
586
{
587
+ p.planecopy_sp = x265_downShift_16_avx2;
588
+
589
+ p.cu[BLOCK_32x32].intra_pred[DC_IDX] = x265_intra_pred_dc32_avx2;
590
+
591
+ p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = x265_intra_pred_planar16_avx2;
592
+ p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = x265_intra_pred_planar32_avx2;
593
+
594
+ p.idst4x4 = x265_idst4_avx2;
595
+ p.dst4x4 = x265_dst4_avx2;
596
+ p.scale2D_64to32 = x265_scale2D_64to32_avx2;
597
+ p.saoCuOrgE0 = x265_saoCuOrgE0_avx2;
598
+ p.saoCuOrgE1 = x265_saoCuOrgE1_avx2;
599
+ p.saoCuOrgE1_2Rows = x265_saoCuOrgE1_2Rows_avx2;
600
+ p.saoCuOrgE2[0] = x265_saoCuOrgE2_avx2;
601
+ p.saoCuOrgE2[1] = x265_saoCuOrgE2_32_avx2;
602
+ p.saoCuOrgE3[0] = x265_saoCuOrgE3_avx2;
603
+ p.saoCuOrgE3[1] = x265_saoCuOrgE3_32_avx2;
604
+ p.saoCuOrgB0 = x265_saoCuOrgB0_avx2;
605
+ p.sign = x265_calSign_avx2;
606
+
607
p.cu[BLOCK_4x4].psy_cost_ss = x265_psyCost_ss_4x4_avx2;
608
p.cu[BLOCK_8x8].psy_cost_ss = x265_psyCost_ss_8x8_avx2;
609
p.cu[BLOCK_16x16].psy_cost_ss = x265_psyCost_ss_16x16_avx2;
610
611
p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].addAvg = x265_addAvg_8x8_avx2;
612
p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].addAvg = x265_addAvg_8x16_avx2;
613
p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].addAvg = x265_addAvg_8x32_avx2;
614
-
615
p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].addAvg = x265_addAvg_12x16_avx2;
616
-
617
p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].addAvg = x265_addAvg_16x4_avx2;
618
p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].addAvg = x265_addAvg_16x8_avx2;
619
p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].addAvg = x265_addAvg_16x12_avx2;
620
p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].addAvg = x265_addAvg_16x16_avx2;
621
p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].addAvg = x265_addAvg_16x32_avx2;
622
-
623
p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].addAvg = x265_addAvg_32x8_avx2;
624
p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].addAvg = x265_addAvg_32x16_avx2;
625
p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].addAvg = x265_addAvg_32x24_avx2;
626
p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].addAvg = x265_addAvg_32x32_avx2;
627
628
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].addAvg = x265_addAvg_8x4_avx2;
629
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].addAvg = x265_addAvg_8x8_avx2;
630
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].addAvg = x265_addAvg_8x12_avx2;
631
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].addAvg = x265_addAvg_8x16_avx2;
632
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].addAvg = x265_addAvg_8x32_avx2;
633
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].addAvg = x265_addAvg_8x64_avx2;
634
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].addAvg = x265_addAvg_12x32_avx2;
635
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].addAvg = x265_addAvg_16x8_avx2;
636
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].addAvg = x265_addAvg_16x16_avx2;
637
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].addAvg = x265_addAvg_16x24_avx2;
638
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].addAvg = x265_addAvg_16x32_avx2;
639
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].addAvg = x265_addAvg_16x64_avx2;
640
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].addAvg = x265_addAvg_24x64_avx2;
641
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].addAvg = x265_addAvg_32x16_avx2;
642
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].addAvg = x265_addAvg_32x32_avx2;
643
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].addAvg = x265_addAvg_32x48_avx2;
644
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].addAvg = x265_addAvg_32x64_avx2;
645
+
646
p.cu[BLOCK_16x16].add_ps = x265_pixel_add_ps_16x16_avx2;
647
p.cu[BLOCK_32x32].add_ps = x265_pixel_add_ps_32x32_avx2;
648
p.cu[BLOCK_64x64].add_ps = x265_pixel_add_ps_64x64_avx2;
649
p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].add_ps = x265_pixel_add_ps_16x16_avx2;
650
p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].add_ps = x265_pixel_add_ps_32x32_avx2;
651
+ p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].add_ps = x265_pixel_add_ps_16x32_avx2;
652
+ p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].add_ps = x265_pixel_add_ps_32x64_avx2;
653
654
p.cu[BLOCK_16x16].sub_ps = x265_pixel_sub_ps_16x16_avx2;
655
p.cu[BLOCK_32x32].sub_ps = x265_pixel_sub_ps_32x32_avx2;
656
p.cu[BLOCK_64x64].sub_ps = x265_pixel_sub_ps_64x64_avx2;
657
p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sub_ps = x265_pixel_sub_ps_16x16_avx2;
658
p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sub_ps = x265_pixel_sub_ps_32x32_avx2;
659
+ p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sub_ps = x265_pixel_sub_ps_16x32_avx2;
660
+ p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sub_ps = x265_pixel_sub_ps_32x64_avx2;
661
662
p.pu[LUMA_16x4].pixelavg_pp = x265_pixel_avg_16x4_avx2;
663
p.pu[LUMA_16x8].pixelavg_pp = x265_pixel_avg_16x8_avx2;
664
665
p.pu[LUMA_8x16].satd = x265_pixel_satd_8x16_avx2;
666
p.pu[LUMA_8x8].satd = x265_pixel_satd_8x8_avx2;
667
668
+ p.pu[LUMA_16x4].satd = x265_pixel_satd_16x4_avx2;
669
+ p.pu[LUMA_16x12].satd = x265_pixel_satd_16x12_avx2;
670
+ p.pu[LUMA_16x32].satd = x265_pixel_satd_16x32_avx2;
671
+ p.pu[LUMA_16x64].satd = x265_pixel_satd_16x64_avx2;
672
+
673
+ p.pu[LUMA_32x8].satd = x265_pixel_satd_32x8_avx2;
674
+ p.pu[LUMA_32x16].satd = x265_pixel_satd_32x16_avx2;
675
+ p.pu[LUMA_32x24].satd = x265_pixel_satd_32x24_avx2;
676
+ p.pu[LUMA_32x32].satd = x265_pixel_satd_32x32_avx2;
677
+ p.pu[LUMA_32x64].satd = x265_pixel_satd_32x64_avx2;
678
+ p.pu[LUMA_48x64].satd = x265_pixel_satd_48x64_avx2;
679
+ p.pu[LUMA_64x16].satd = x265_pixel_satd_64x16_avx2;
680
+ p.pu[LUMA_64x32].satd = x265_pixel_satd_64x32_avx2;
681
+ p.pu[LUMA_64x48].satd = x265_pixel_satd_64x48_avx2;
682
+ p.pu[LUMA_64x64].satd = x265_pixel_satd_64x64_avx2;
683
+
684
p.pu[LUMA_32x8].sad = x265_pixel_sad_32x8_avx2;
685
p.pu[LUMA_32x16].sad = x265_pixel_sad_32x16_avx2;
686
p.pu[LUMA_32x24].sad = x265_pixel_sad_32x24_avx2;
687
688
689
p.scale1D_128to64 = x265_scale1D_128to64_avx2;
690
p.weight_pp = x265_weight_pp_avx2;
691
+ p.weight_sp = x265_weight_sp_avx2;
692
693
// intra_pred functions
694
+ p.cu[BLOCK_4x4].intra_pred[3] = x265_intra_pred_ang4_3_avx2;
695
+ p.cu[BLOCK_4x4].intra_pred[4] = x265_intra_pred_ang4_4_avx2;
696
+ p.cu[BLOCK_4x4].intra_pred[5] = x265_intra_pred_ang4_5_avx2;
697
+ p.cu[BLOCK_4x4].intra_pred[6] = x265_intra_pred_ang4_6_avx2;
698
+ p.cu[BLOCK_4x4].intra_pred[7] = x265_intra_pred_ang4_7_avx2;
699
+ p.cu[BLOCK_4x4].intra_pred[8] = x265_intra_pred_ang4_8_avx2;
700
+ p.cu[BLOCK_4x4].intra_pred[9] = x265_intra_pred_ang4_9_avx2;
701
+ p.cu[BLOCK_4x4].intra_pred[11] = x265_intra_pred_ang4_11_avx2;
702
+ p.cu[BLOCK_4x4].intra_pred[12] = x265_intra_pred_ang4_12_avx2;
703
+ p.cu[BLOCK_4x4].intra_pred[13] = x265_intra_pred_ang4_13_avx2;
704
+ p.cu[BLOCK_4x4].intra_pred[14] = x265_intra_pred_ang4_14_avx2;
705
+ p.cu[BLOCK_4x4].intra_pred[15] = x265_intra_pred_ang4_15_avx2;
706
+ p.cu[BLOCK_4x4].intra_pred[16] = x265_intra_pred_ang4_16_avx2;
707
+ p.cu[BLOCK_4x4].intra_pred[17] = x265_intra_pred_ang4_17_avx2;
708
+ p.cu[BLOCK_4x4].intra_pred[19] = x265_intra_pred_ang4_19_avx2;
709
+ p.cu[BLOCK_4x4].intra_pred[20] = x265_intra_pred_ang4_20_avx2;
710
+ p.cu[BLOCK_4x4].intra_pred[21] = x265_intra_pred_ang4_21_avx2;
711
+ p.cu[BLOCK_4x4].intra_pred[22] = x265_intra_pred_ang4_22_avx2;
712
+ p.cu[BLOCK_4x4].intra_pred[23] = x265_intra_pred_ang4_23_avx2;
713
+ p.cu[BLOCK_4x4].intra_pred[24] = x265_intra_pred_ang4_24_avx2;
714
+ p.cu[BLOCK_4x4].intra_pred[25] = x265_intra_pred_ang4_25_avx2;
715
+ p.cu[BLOCK_4x4].intra_pred[27] = x265_intra_pred_ang4_27_avx2;
716
+ p.cu[BLOCK_4x4].intra_pred[28] = x265_intra_pred_ang4_28_avx2;
717
+ p.cu[BLOCK_4x4].intra_pred[29] = x265_intra_pred_ang4_29_avx2;
718
+ p.cu[BLOCK_4x4].intra_pred[30] = x265_intra_pred_ang4_30_avx2;
719
+ p.cu[BLOCK_4x4].intra_pred[31] = x265_intra_pred_ang4_31_avx2;
720
+ p.cu[BLOCK_4x4].intra_pred[32] = x265_intra_pred_ang4_32_avx2;
721
+ p.cu[BLOCK_4x4].intra_pred[33] = x265_intra_pred_ang4_33_avx2;
722
p.cu[BLOCK_8x8].intra_pred[3] = x265_intra_pred_ang8_3_avx2;
723
p.cu[BLOCK_8x8].intra_pred[33] = x265_intra_pred_ang8_33_avx2;
724
p.cu[BLOCK_8x8].intra_pred[4] = x265_intra_pred_ang8_4_avx2;
725
726
p.cu[BLOCK_8x8].intra_pred[12] = x265_intra_pred_ang8_12_avx2;
727
p.cu[BLOCK_8x8].intra_pred[24] = x265_intra_pred_ang8_24_avx2;
728
p.cu[BLOCK_8x8].intra_pred[11] = x265_intra_pred_ang8_11_avx2;
729
+ p.cu[BLOCK_8x8].intra_pred[13] = x265_intra_pred_ang8_13_avx2;
730
+ p.cu[BLOCK_8x8].intra_pred[20] = x265_intra_pred_ang8_20_avx2;
731
+ p.cu[BLOCK_8x8].intra_pred[21] = x265_intra_pred_ang8_21_avx2;
732
+ p.cu[BLOCK_8x8].intra_pred[22] = x265_intra_pred_ang8_22_avx2;
733
+ p.cu[BLOCK_8x8].intra_pred[23] = x265_intra_pred_ang8_23_avx2;
734
+ p.cu[BLOCK_8x8].intra_pred[14] = x265_intra_pred_ang8_14_avx2;
735
+ p.cu[BLOCK_8x8].intra_pred[15] = x265_intra_pred_ang8_15_avx2;
736
+ p.cu[BLOCK_8x8].intra_pred[16] = x265_intra_pred_ang8_16_avx2;
737
+ p.cu[BLOCK_16x16].intra_pred[3] = x265_intra_pred_ang16_3_avx2;
738
+ p.cu[BLOCK_16x16].intra_pred[4] = x265_intra_pred_ang16_4_avx2;
739
+ p.cu[BLOCK_16x16].intra_pred[5] = x265_intra_pred_ang16_5_avx2;
740
+ p.cu[BLOCK_16x16].intra_pred[6] = x265_intra_pred_ang16_6_avx2;
741
+ p.cu[BLOCK_16x16].intra_pred[7] = x265_intra_pred_ang16_7_avx2;
742
+ p.cu[BLOCK_16x16].intra_pred[8] = x265_intra_pred_ang16_8_avx2;
743
+ p.cu[BLOCK_16x16].intra_pred[9] = x265_intra_pred_ang16_9_avx2;
744
+ p.cu[BLOCK_16x16].intra_pred[12] = x265_intra_pred_ang16_12_avx2;
745
+ p.cu[BLOCK_16x16].intra_pred[11] = x265_intra_pred_ang16_11_avx2;
746
+ p.cu[BLOCK_16x16].intra_pred[13] = x265_intra_pred_ang16_13_avx2;
747
p.cu[BLOCK_16x16].intra_pred[25] = x265_intra_pred_ang16_25_avx2;
748
p.cu[BLOCK_16x16].intra_pred[28] = x265_intra_pred_ang16_28_avx2;
749
p.cu[BLOCK_16x16].intra_pred[27] = x265_intra_pred_ang16_27_avx2;
750
751
p.cu[BLOCK_32x32].intra_pred[30] = x265_intra_pred_ang32_30_avx2;
752
p.cu[BLOCK_32x32].intra_pred[31] = x265_intra_pred_ang32_31_avx2;
753
p.cu[BLOCK_32x32].intra_pred[32] = x265_intra_pred_ang32_32_avx2;
754
+ p.cu[BLOCK_32x32].intra_pred[33] = x265_intra_pred_ang32_33_avx2;
755
+ p.cu[BLOCK_32x32].intra_pred[25] = x265_intra_pred_ang32_25_avx2;
756
+ p.cu[BLOCK_32x32].intra_pred[24] = x265_intra_pred_ang32_24_avx2;
757
+ p.cu[BLOCK_32x32].intra_pred[23] = x265_intra_pred_ang32_23_avx2;
758
+ p.cu[BLOCK_32x32].intra_pred[22] = x265_intra_pred_ang32_22_avx2;
759
+ p.cu[BLOCK_32x32].intra_pred[21] = x265_intra_pred_ang32_21_avx2;
760
+ p.cu[BLOCK_32x32].intra_pred[18] = x265_intra_pred_ang32_18_avx2;
761
+
762
+ // all_angs primitives
763
+ p.cu[BLOCK_4x4].intra_pred_allangs = x265_all_angs_pred_4x4_avx2;
764
765
// copy_sp primitives
766
p.cu[BLOCK_16x16].copy_sp = x265_blockcopy_sp_16x16_avx2;
767
768
p.pu[LUMA_64x48].luma_hps = x265_interp_8tap_horiz_ps_64x48_avx2;
769
p.pu[LUMA_64x32].luma_hps = x265_interp_8tap_horiz_ps_64x32_avx2;
770
p.pu[LUMA_64x16].luma_hps = x265_interp_8tap_horiz_ps_64x16_avx2;
771
+ p.pu[LUMA_12x16].luma_hps = x265_interp_8tap_horiz_ps_12x16_avx2;
772
+ p.pu[LUMA_24x32].luma_hps = x265_interp_8tap_horiz_ps_24x32_avx2;
773
774
p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_hpp = x265_interp_4tap_horiz_pp_8x8_avx2;
775
p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_hpp = x265_interp_4tap_horiz_pp_4x4_avx2;
776
777
p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_hpp = x265_interp_4tap_horiz_pp_16x32_avx2;
778
779
p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_hpp = x265_interp_4tap_horiz_pp_6x8_avx2;
780
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_hpp = x265_interp_4tap_horiz_pp_6x16_avx2;
781
782
p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_hpp = x265_interp_4tap_horiz_pp_32x16_avx2;
783
p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_hpp = x265_interp_4tap_horiz_pp_32x24_avx2;
784
785
p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_hps = x265_interp_4tap_horiz_ps_16x8_avx2;
786
p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_hps = x265_interp_4tap_horiz_ps_16x4_avx2;
787
788
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_hps = x265_interp_4tap_horiz_ps_24x32_avx2;
789
p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_hps = x265_interp_4tap_horiz_ps_32x16_avx2;
790
p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_hps = x265_interp_4tap_horiz_ps_32x24_avx2;
791
p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_hps = x265_interp_4tap_horiz_ps_32x8_avx2;
792
793
p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vss = x265_interp_4tap_vert_ss_32x16_avx2;
794
p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vss = x265_interp_4tap_vert_ss_32x24_avx2;
795
796
- if ((cpuMask & X265_CPU_BMI1) && (cpuMask & X265_CPU_BMI2))
797
- p.findPosLast = x265_findPosLast_x64;
798
+ //i422 for chroma_vss
799
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vss = x265_interp_4tap_vert_ss_4x8_avx2;
800
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vss = x265_interp_4tap_vert_ss_8x16_avx2;
801
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vss = x265_interp_4tap_vert_ss_16x32_avx2;
802
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vss = x265_interp_4tap_vert_ss_4x4_avx2;
803
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_vss = x265_interp_4tap_vert_ss_2x8_avx2;
804
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vss = x265_interp_4tap_vert_ss_8x8_avx2;
805
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vss = x265_interp_4tap_vert_ss_4x16_avx2;
806
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vss = x265_interp_4tap_vert_ss_16x16_avx2;
807
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vss = x265_interp_4tap_vert_ss_8x32_avx2;
808
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vss = x265_interp_4tap_vert_ss_32x32_avx2;
809
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vss = x265_interp_4tap_vert_ss_8x4_avx2;
810
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vss = x265_interp_4tap_vert_ss_32x16_avx2;
811
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vss = x265_interp_4tap_vert_ss_32x64_avx2;
812
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vss = x265_interp_4tap_vert_ss_16x64_avx2;
813
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vss = x265_interp_4tap_vert_ss_24x64_avx2;
814
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vss = x265_interp_4tap_vert_ss_8x64_avx2;
815
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vss = x265_interp_4tap_vert_ss_32x48_avx2;
816
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vss = x265_interp_4tap_vert_ss_8x12_avx2;
817
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_vss = x265_interp_4tap_vert_ss_6x16_avx2;
818
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_vss = x265_interp_4tap_vert_ss_2x16_avx2;
819
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vss = x265_interp_4tap_vert_ss_16x24_avx2;
820
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vss = x265_interp_4tap_vert_ss_12x32_avx2;
821
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vss = x265_interp_4tap_vert_ss_4x32_avx2;
822
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_2x4].filter_vss = x265_interp_4tap_vert_ss_2x4_avx2;
823
+
824
+ //i444 for chroma_vss
825
+ p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vss = x265_interp_4tap_vert_ss_4x4_avx2;
826
+ p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vss = x265_interp_4tap_vert_ss_8x8_avx2;
827
+ p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vss = x265_interp_4tap_vert_ss_16x16_avx2;
828
+ p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vss = x265_interp_4tap_vert_ss_32x32_avx2;
829
+ p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vss = x265_interp_4tap_vert_ss_64x64_avx2;
830
+ p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vss = x265_interp_4tap_vert_ss_8x4_avx2;
831
+ p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vss = x265_interp_4tap_vert_ss_4x8_avx2;
832
+ p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vss = x265_interp_4tap_vert_ss_16x8_avx2;
833
+ p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vss = x265_interp_4tap_vert_ss_8x16_avx2;
834
+ p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vss = x265_interp_4tap_vert_ss_32x16_avx2;
835
+ p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vss = x265_interp_4tap_vert_ss_16x32_avx2;
836
+ p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vss = x265_interp_4tap_vert_ss_16x12_avx2;
837
+ p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vss = x265_interp_4tap_vert_ss_12x16_avx2;
838
+ p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vss = x265_interp_4tap_vert_ss_16x4_avx2;
839
+ p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vss = x265_interp_4tap_vert_ss_4x16_avx2;
840
+ p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vss = x265_interp_4tap_vert_ss_32x24_avx2;
841
+ p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vss = x265_interp_4tap_vert_ss_24x32_avx2;
842
+ p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vss = x265_interp_4tap_vert_ss_32x8_avx2;
843
+ p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vss = x265_interp_4tap_vert_ss_8x32_avx2;
844
+ p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vss = x265_interp_4tap_vert_ss_64x32_avx2;
845
+ p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vss = x265_interp_4tap_vert_ss_32x64_avx2;
846
+ p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vss = x265_interp_4tap_vert_ss_64x48_avx2;
847
+ p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vss = x265_interp_4tap_vert_ss_48x64_avx2;
848
+ p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vss = x265_interp_4tap_vert_ss_64x16_avx2;
849
+ p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vss = x265_interp_4tap_vert_ss_16x64_avx2;
850
+
851
+ p.pu[LUMA_16x16].luma_hvpp = x265_interp_8tap_hv_pp_16x16_avx2;
852
+
853
+ p.pu[LUMA_32x8].convert_p2s = x265_filterPixelToShort_32x8_avx2;
854
+ p.pu[LUMA_32x16].convert_p2s = x265_filterPixelToShort_32x16_avx2;
855
+ p.pu[LUMA_32x24].convert_p2s = x265_filterPixelToShort_32x24_avx2;
856
+ p.pu[LUMA_32x32].convert_p2s = x265_filterPixelToShort_32x32_avx2;
857
+ p.pu[LUMA_32x64].convert_p2s = x265_filterPixelToShort_32x64_avx2;
858
+ p.pu[LUMA_64x16].convert_p2s = x265_filterPixelToShort_64x16_avx2;
859
+ p.pu[LUMA_64x32].convert_p2s = x265_filterPixelToShort_64x32_avx2;
860
+ p.pu[LUMA_64x48].convert_p2s = x265_filterPixelToShort_64x48_avx2;
861
+ p.pu[LUMA_64x64].convert_p2s = x265_filterPixelToShort_64x64_avx2;
862
+ p.pu[LUMA_48x64].convert_p2s = x265_filterPixelToShort_48x64_avx2;
863
+ p.pu[LUMA_24x32].convert_p2s = x265_filterPixelToShort_24x32_avx2;
864
+
865
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].p2s = x265_filterPixelToShort_24x32_avx2;
866
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = x265_filterPixelToShort_32x8_avx2;
867
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = x265_filterPixelToShort_32x16_avx2;
868
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = x265_filterPixelToShort_32x24_avx2;
869
+ p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = x265_filterPixelToShort_32x32_avx2;
870
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = x265_filterPixelToShort_24x64_avx2;
871
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = x265_filterPixelToShort_32x16_avx2;
872
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = x265_filterPixelToShort_32x32_avx2;
873
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = x265_filterPixelToShort_32x48_avx2;
874
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = x265_filterPixelToShort_32x64_avx2;
875
+
876
+ //i422 for chroma_hpp
877
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_hpp = x265_interp_4tap_horiz_pp_12x32_avx2;
878
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_hpp = x265_interp_4tap_horiz_pp_24x64_avx2;
879
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_hpp = x265_interp_4tap_horiz_pp_2x16_avx2;
880
+
881
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_hpp = x265_interp_4tap_horiz_pp_2x16_avx2;
882
+
883
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_hpp = x265_interp_4tap_horiz_pp_4x4_avx2;
884
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_hpp = x265_interp_4tap_horiz_pp_4x8_avx2;
885
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_hpp = x265_interp_4tap_horiz_pp_4x16_avx2;
886
+
887
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_hpp = x265_interp_4tap_horiz_pp_8x4_avx2;
888
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_hpp = x265_interp_4tap_horiz_pp_8x8_avx2;
889
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_hpp = x265_interp_4tap_horiz_pp_8x16_avx2;
890
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_hpp = x265_interp_4tap_horiz_pp_8x32_avx2;
891
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_hpp = x265_interp_4tap_horiz_pp_8x64_avx2;
892
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_hpp = x265_interp_4tap_horiz_pp_8x12_avx2;
893
+
894
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_hpp = x265_interp_4tap_horiz_pp_16x8_avx2;
895
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_hpp = x265_interp_4tap_horiz_pp_16x16_avx2;
896
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_hpp = x265_interp_4tap_horiz_pp_16x32_avx2;
897
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_hpp = x265_interp_4tap_horiz_pp_16x64_avx2;
898
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_hpp = x265_interp_4tap_horiz_pp_16x24_avx2;
899
+
900
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_hpp = x265_interp_4tap_horiz_pp_32x16_avx2;
901
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_hpp = x265_interp_4tap_horiz_pp_32x32_avx2;
902
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_hpp = x265_interp_4tap_horiz_pp_32x64_avx2;
903
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_hpp = x265_interp_4tap_horiz_pp_32x48_avx2;
904
+
905
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_hpp = x265_interp_4tap_horiz_pp_2x8_avx2;
906
+
907
+ //i444 filters hpp
908
+
909
+ p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_hpp = x265_interp_4tap_horiz_pp_4x4_avx2;
910
+ p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_hpp = x265_interp_4tap_horiz_pp_8x8_avx2;
911
+ p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_hpp = x265_interp_4tap_horiz_pp_16x16_avx2;
912
+ p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_hpp = x265_interp_4tap_horiz_pp_32x32_avx2;
913
+
914
+ p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_hpp = x265_interp_4tap_horiz_pp_4x8_avx2;
915
+ p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_hpp = x265_interp_4tap_horiz_pp_4x16_avx2;
916
+
917
+ p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_hpp = x265_interp_4tap_horiz_pp_8x4_avx2;
918
+ p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_hpp = x265_interp_4tap_horiz_pp_8x16_avx2;
919
+ p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_hpp = x265_interp_4tap_horiz_pp_8x32_avx2;
920
+
921
+ p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_hpp = x265_interp_4tap_horiz_pp_16x8_avx2;
922
+ p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_hpp = x265_interp_4tap_horiz_pp_16x32_avx2;
923
+ p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_hpp = x265_interp_4tap_horiz_pp_16x12_avx2;
924
+ p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_hpp = x265_interp_4tap_horiz_pp_16x4_avx2;
925
+ p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_hpp = x265_interp_4tap_horiz_pp_16x64_avx2;
926
+
927
+ p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_hpp = x265_interp_4tap_horiz_pp_12x16_avx2;
928
+ p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_hpp = x265_interp_4tap_horiz_pp_24x32_avx2;
929
+
930
+ p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_hpp = x265_interp_4tap_horiz_pp_32x16_avx2;
931
+ p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_hpp = x265_interp_4tap_horiz_pp_32x64_avx2;
932
+ p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_hpp = x265_interp_4tap_horiz_pp_32x24_avx2;
933
+ p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_hpp = x265_interp_4tap_horiz_pp_32x8_avx2;
934
+
935
+ p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_hpp = x265_interp_4tap_horiz_pp_64x64_avx2;
936
+ p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_hpp = x265_interp_4tap_horiz_pp_64x32_avx2;
937
+ p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_hpp = x265_interp_4tap_horiz_pp_64x48_avx2;
938
+ p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_hpp = x265_interp_4tap_horiz_pp_64x16_avx2;
939
+ p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_hpp = x265_interp_4tap_horiz_pp_48x64_avx2;
940
+
941
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_hps = x265_interp_4tap_horiz_ps_4x4_avx2;
942
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_hps = x265_interp_4tap_horiz_ps_4x8_avx2;
943
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_hps = x265_interp_4tap_horiz_ps_4x16_avx2;
944
+
945
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_hps = x265_interp_4tap_horiz_ps_8x4_avx2;
946
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_hps = x265_interp_4tap_horiz_ps_8x8_avx2;
947
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_hps = x265_interp_4tap_horiz_ps_8x16_avx2;
948
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_hps = x265_interp_4tap_horiz_ps_8x32_avx2;
949
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_hps = x265_interp_4tap_horiz_ps_8x64_avx2; //adding macro call
950
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_hps = x265_interp_4tap_horiz_ps_8x12_avx2; //adding macro call
951
+
952
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_hps = x265_interp_4tap_horiz_ps_16x8_avx2;
953
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_hps = x265_interp_4tap_horiz_ps_16x16_avx2;
954
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_hps = x265_interp_4tap_horiz_ps_16x32_avx2;
955
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_hps = x265_interp_4tap_horiz_ps_16x64_avx2;//adding macro call
956
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_hps = x265_interp_4tap_horiz_ps_16x24_avx2;//adding macro call
957
+
958
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_hps = x265_interp_4tap_horiz_ps_32x16_avx2;
959
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_hps = x265_interp_4tap_horiz_ps_32x32_avx2;
960
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_hps = x265_interp_4tap_horiz_ps_32x64_avx2;
961
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_hps = x265_interp_4tap_horiz_ps_32x48_avx2;
962
+
963
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_hps = x265_interp_4tap_horiz_ps_2x8_avx2;
964
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_hps = x265_interp_4tap_horiz_ps_24x64_avx2;
965
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_hps = x265_interp_4tap_horiz_ps_2x16_avx2;
966
+
967
+ //i444 chroma_hps
968
+ p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_hps = x265_interp_4tap_horiz_ps_64x32_avx2;
969
+ p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_hps = x265_interp_4tap_horiz_ps_64x48_avx2;
970
+ p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_hps = x265_interp_4tap_horiz_ps_64x16_avx2;
971
+ p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_hps = x265_interp_4tap_horiz_ps_64x64_avx2;
972
+
973
+ p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_hps = x265_interp_4tap_horiz_ps_4x4_avx2;
974
+ p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_hps = x265_interp_4tap_horiz_ps_8x8_avx2;
975
+ p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_hps = x265_interp_4tap_horiz_ps_16x16_avx2;
976
+ p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_hps = x265_interp_4tap_horiz_ps_32x32_avx2;
977
+
978
+ p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_hps = x265_interp_4tap_horiz_ps_4x8_avx2;
979
+ p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_hps = x265_interp_4tap_horiz_ps_4x16_avx2;
980
+
981
+ p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_hps = x265_interp_4tap_horiz_ps_8x4_avx2;
982
+ p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_hps = x265_interp_4tap_horiz_ps_8x16_avx2;
983
+ p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_hps = x265_interp_4tap_horiz_ps_8x32_avx2;
984
+
985
+ p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_hps = x265_interp_4tap_horiz_ps_16x8_avx2;
986
+ p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_hps = x265_interp_4tap_horiz_ps_16x32_avx2;
987
+ p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_hps = x265_interp_4tap_horiz_ps_16x12_avx2;
988
+ p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_hps = x265_interp_4tap_horiz_ps_16x4_avx2;
989
+ p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_hps = x265_interp_4tap_horiz_ps_16x64_avx2;
990
+
991
+ p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_hps = x265_interp_4tap_horiz_ps_24x32_avx2;
992
+ p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_hps = x265_interp_4tap_horiz_ps_48x64_avx2;
993
+
994
+ p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_hps = x265_interp_4tap_horiz_ps_32x16_avx2;
995
+ p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_hps = x265_interp_4tap_horiz_ps_32x64_avx2;
996
+ p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_hps = x265_interp_4tap_horiz_ps_32x24_avx2;
997
+ p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_hps = x265_interp_4tap_horiz_ps_32x8_avx2;
998
+
999
+ //i422 for chroma_vsp
1000
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vsp = x265_interp_4tap_vert_sp_4x8_avx2;
1001
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vsp = x265_interp_4tap_vert_sp_8x16_avx2;
1002
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vsp = x265_interp_4tap_vert_sp_16x32_avx2;
1003
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vsp = x265_interp_4tap_vert_sp_4x4_avx2;
1004
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_vsp = x265_interp_4tap_vert_sp_2x8_avx2;
1005
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vsp = x265_interp_4tap_vert_sp_8x8_avx2;
1006
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vsp = x265_interp_4tap_vert_sp_4x16_avx2;
1007
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vsp = x265_interp_4tap_vert_sp_16x16_avx2;
1008
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vsp = x265_interp_4tap_vert_sp_8x32_avx2;
1009
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vsp = x265_interp_4tap_vert_sp_32x32_avx2;
1010
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vsp = x265_interp_4tap_vert_sp_8x4_avx2;
1011
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vsp = x265_interp_4tap_vert_sp_16x8_avx2;
1012
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vsp = x265_interp_4tap_vert_sp_32x16_avx2;
1013
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vsp = x265_interp_4tap_vert_sp_32x64_avx2;
1014
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vsp = x265_interp_4tap_vert_sp_16x64_avx2;
1015
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vsp = x265_interp_4tap_vert_sp_24x64_avx2;
1016
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vsp = x265_interp_4tap_vert_sp_8x64_avx2;
1017
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vsp = x265_interp_4tap_vert_sp_32x48_avx2;
1018
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vsp = x265_interp_4tap_vert_sp_8x12_avx2;
1019
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_vsp = x265_interp_4tap_vert_sp_6x16_avx2;
1020
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_vsp = x265_interp_4tap_vert_sp_2x16_avx2;
1021
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vsp = x265_interp_4tap_vert_sp_16x24_avx2;
1022
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vsp = x265_interp_4tap_vert_sp_12x32_avx2;
1023
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vsp = x265_interp_4tap_vert_sp_4x32_avx2;
1024
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_2x4].filter_vsp = x265_interp_4tap_vert_sp_2x4_avx2;
1025
+
1026
+ //i444 for chroma_vsp
1027
+ p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vsp = x265_interp_4tap_vert_sp_4x4_avx2;
1028
+ p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vsp = x265_interp_4tap_vert_sp_8x8_avx2;
1029
+ p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vsp = x265_interp_4tap_vert_sp_16x16_avx2;
1030
+ p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vsp = x265_interp_4tap_vert_sp_32x32_avx2;
1031
+ p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vsp = x265_interp_4tap_vert_sp_64x64_avx2;
1032
+ p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vsp = x265_interp_4tap_vert_sp_8x4_avx2;
1033
+ p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vsp = x265_interp_4tap_vert_sp_4x8_avx2;
1034
+ p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vsp = x265_interp_4tap_vert_sp_16x8_avx2;
1035
+ p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vsp = x265_interp_4tap_vert_sp_8x16_avx2;
1036
+ p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vsp = x265_interp_4tap_vert_sp_32x16_avx2;
1037
+ p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vsp = x265_interp_4tap_vert_sp_16x32_avx2;
1038
+ p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vsp = x265_interp_4tap_vert_sp_16x12_avx2;
1039
+ p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vsp = x265_interp_4tap_vert_sp_12x16_avx2;
1040
+ p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vsp = x265_interp_4tap_vert_sp_16x4_avx2;
1041
+ p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vsp = x265_interp_4tap_vert_sp_4x16_avx2;
1042
+ p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vsp = x265_interp_4tap_vert_sp_32x24_avx2;
1043
+ p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vsp = x265_interp_4tap_vert_sp_24x32_avx2;
1044
+ p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vsp = x265_interp_4tap_vert_sp_32x8_avx2;
1045
+ p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vsp = x265_interp_4tap_vert_sp_8x32_avx2;
1046
+ p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vsp = x265_interp_4tap_vert_sp_64x32_avx2;
1047
+ p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vsp = x265_interp_4tap_vert_sp_32x64_avx2;
1048
+ p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vsp = x265_interp_4tap_vert_sp_64x48_avx2;
1049
+ p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vsp = x265_interp_4tap_vert_sp_48x64_avx2;
1050
+ p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vsp = x265_interp_4tap_vert_sp_64x16_avx2;
1051
+ p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vsp = x265_interp_4tap_vert_sp_16x64_avx2;
1052
+
1053
+ //i422 for chroma_vps
1054
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vps = x265_interp_4tap_vert_ps_4x8_avx2;
1055
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vps = x265_interp_4tap_vert_ps_8x16_avx2;
1056
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vps = x265_interp_4tap_vert_ps_16x32_avx2;
1057
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vps = x265_interp_4tap_vert_ps_4x4_avx2;
1058
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_vps = x265_interp_4tap_vert_ps_2x8_avx2;
1059
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vps = x265_interp_4tap_vert_ps_8x8_avx2;
1060
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vps = x265_interp_4tap_vert_ps_4x16_avx2;
1061
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vps = x265_interp_4tap_vert_ps_16x16_avx2;
1062
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vps = x265_interp_4tap_vert_ps_8x32_avx2;
1063
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vps = x265_interp_4tap_vert_ps_32x32_avx2;
1064
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vps = x265_interp_4tap_vert_ps_8x4_avx2;
1065
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vps = x265_interp_4tap_vert_ps_16x8_avx2;
1066
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vps = x265_interp_4tap_vert_ps_32x16_avx2;
1067
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vps = x265_interp_4tap_vert_ps_16x64_avx2;
1068
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vps = x265_interp_4tap_vert_ps_8x64_avx2;
1069
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vps = x265_interp_4tap_vert_ps_32x64_avx2;
1070
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vps = x265_interp_4tap_vert_ps_32x48_avx2;
1071
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vps = x265_interp_4tap_vert_ps_12x32_avx2;
1072
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vps = x265_interp_4tap_vert_ps_8x12_avx2;
1073
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_2x4].filter_vps = x265_interp_4tap_vert_ps_2x4_avx2;
1074
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vps = x265_interp_4tap_vert_ps_16x24_avx2;
1075
+
1076
+ //i444 for chroma_vps
1077
+ p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vps = x265_interp_4tap_vert_ps_4x4_avx2;
1078
+ p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vps = x265_interp_4tap_vert_ps_8x8_avx2;
1079
+ p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vps = x265_interp_4tap_vert_ps_16x16_avx2;
1080
+ p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vps = x265_interp_4tap_vert_ps_32x32_avx2;
1081
+ p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vps = x265_interp_4tap_vert_ps_8x4_avx2;
1082
+ p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vps = x265_interp_4tap_vert_ps_4x8_avx2;
1083
+ p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vps = x265_interp_4tap_vert_ps_16x8_avx2;
1084
+ p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vps = x265_interp_4tap_vert_ps_8x16_avx2;
1085
+ p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vps = x265_interp_4tap_vert_ps_32x16_avx2;
1086
+ p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vps = x265_interp_4tap_vert_ps_16x32_avx2;
1087
+ p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vps = x265_interp_4tap_vert_ps_16x12_avx2;
1088
+ p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vps = x265_interp_4tap_vert_ps_12x16_avx2;
1089
+ p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vps = x265_interp_4tap_vert_ps_16x4_avx2;
1090
+ p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vps = x265_interp_4tap_vert_ps_4x16_avx2;
1091
+ p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vps = x265_interp_4tap_vert_ps_32x24_avx2;
1092
+ p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vps = x265_interp_4tap_vert_ps_24x32_avx2;
1093
+ p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vps = x265_interp_4tap_vert_ps_32x8_avx2;
1094
+ p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vps = x265_interp_4tap_vert_ps_8x32_avx2;
1095
+ p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vps = x265_interp_4tap_vert_ps_16x64_avx2;
1096
+ p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vps = x265_interp_4tap_vert_ps_32x64_avx2;
1097
+
1098
+ //i422 for chroma_vpp
1099
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_avx2;
1100
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_avx2;
1101
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vpp = x265_interp_4tap_vert_pp_16x32_avx2;
1102
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_avx2;
1103
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_vpp = x265_interp_4tap_vert_pp_2x8_avx2;
1104
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_avx2;
1105
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_avx2;
1106
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vpp = x265_interp_4tap_vert_pp_16x16_avx2;
1107
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_avx2;
1108
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vpp = x265_interp_4tap_vert_pp_32x32_avx2;
1109
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_avx2;
1110
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vpp = x265_interp_4tap_vert_pp_16x8_avx2;
1111
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vpp = x265_interp_4tap_vert_pp_32x16_avx2;
1112
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vpp = x265_interp_4tap_vert_pp_16x64_avx2;
1113
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vpp = x265_interp_4tap_vert_pp_8x64_avx2;
1114
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vpp = x265_interp_4tap_vert_pp_32x64_avx2;
1115
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vpp = x265_interp_4tap_vert_pp_32x48_avx2;
1116
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vpp = x265_interp_4tap_vert_pp_12x32_avx2;
1117
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vpp = x265_interp_4tap_vert_pp_8x12_avx2;
1118
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_2x4].filter_vpp = x265_interp_4tap_vert_pp_2x4_avx2;
1119
+ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vpp = x265_interp_4tap_vert_pp_16x24_avx2;
1120
+
1121
+ //i444 for chroma_vpp
1122
+ p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_avx2;
1123
+ p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_avx2;
1124
+ p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vpp = x265_interp_4tap_vert_pp_16x16_avx2;
1125
+ p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vpp = x265_interp_4tap_vert_pp_32x32_avx2;
1126
+ p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_avx2;
1127
+ p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_avx2;
1128
+ p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vpp = x265_interp_4tap_vert_pp_16x8_avx2;
1129
+ p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_avx2;
1130
+ p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vpp = x265_interp_4tap_vert_pp_32x16_avx2;
1131
+ p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vpp = x265_interp_4tap_vert_pp_16x32_avx2;
1132
+ p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vpp = x265_interp_4tap_vert_pp_16x12_avx2;
1133
+ p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vpp = x265_interp_4tap_vert_pp_12x16_avx2;
1134
+ p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vpp = x265_interp_4tap_vert_pp_16x4_avx2;
1135
+ p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_avx2;
1136
+ p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vpp = x265_interp_4tap_vert_pp_32x24_avx2;
1137
+ p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vpp = x265_interp_4tap_vert_pp_24x32_avx2;
1138
+ p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vpp = x265_interp_4tap_vert_pp_32x8_avx2;
1139
+ p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_avx2;
1140
+ p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vpp = x265_interp_4tap_vert_pp_16x64_avx2;
1141
+ p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vpp = x265_interp_4tap_vert_pp_32x64_avx2;
1142
+
1143
+ if (cpuMask & X265_CPU_BMI2)
1144
+ p.scanPosLast = x265_scanPosLast_avx2_bmi2;
1145
}
1146
#endif
1147
}
1148
x265_1.6.tar.gz/source/common/x86/const-a.asm -> x265_1.7.tar.gz/source/common/x86/const-a.asm
Changed
173
1
2
3
SECTION_RODATA 32
4
5
-const pb_1, times 32 db 1
6
+;; 8-bit constants
7
8
-const hsub_mul, times 16 db 1, -1
9
-const pw_1, times 16 dw 1
10
-const pw_16, times 16 dw 16
11
-const pw_32, times 16 dw 32
12
-const pw_128, times 16 dw 128
13
-const pw_256, times 16 dw 256
14
-const pw_257, times 16 dw 257
15
-const pw_512, times 16 dw 512
16
-const pw_1023, times 8 dw 1023
17
-ALIGN 32
18
-const pw_1024, times 16 dw 1024
19
-const pw_4096, times 16 dw 4096
20
-const pw_00ff, times 16 dw 0x00ff
21
-ALIGN 32
22
-const pw_pixel_max,times 16 dw ((1 << BIT_DEPTH)-1)
23
-const deinterleave_shufd, dd 0,4,1,5,2,6,3,7
24
-const pb_unpackbd1, times 2 db 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3
25
-const pb_unpackbd2, times 2 db 4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7
26
-const pb_unpackwq1, db 0,1,0,1,0,1,0,1,2,3,2,3,2,3,2,3
27
-const pb_unpackwq2, db 4,5,4,5,4,5,4,5,6,7,6,7,6,7,6,7
28
-const pw_swap, times 2 db 6,7,4,5,2,3,0,1
29
+const pb_0, times 16 db 0
30
+const pb_1, times 32 db 1
31
+const pb_2, times 32 db 2
32
+const pb_3, times 16 db 3
33
+const pb_4, times 32 db 4
34
+const pb_8, times 32 db 8
35
+const pb_15, times 32 db 15
36
+const pb_16, times 32 db 16
37
+const pb_32, times 32 db 32
38
+const pb_64, times 32 db 64
39
+const pb_128, times 16 db 128
40
+const pb_a1, times 16 db 0xa1
41
42
-const pb_2, times 32 db 2
43
-const pb_4, times 32 db 4
44
-const pb_16, times 32 db 16
45
-const pb_64, times 32 db 64
46
-const pb_01, times 8 db 0,1
47
-const pb_0, times 16 db 0
48
-const pb_a1, times 16 db 0xa1
49
-const pb_3, times 16 db 3
50
-const pb_8, times 32 db 8
51
-const pb_32, times 32 db 32
52
-const pb_128, times 16 db 128
53
-const pb_shuf8x8c, db 0,0,0,0,2,2,2,2,4,4,4,4,6,6,6,6
54
+const pb_01, times 8 db 0, 1
55
+const hsub_mul, times 16 db 1, -1
56
+const pw_swap, times 2 db 6, 7, 4, 5, 2, 3, 0, 1
57
+const pb_unpackbd1, times 2 db 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
58
+const pb_unpackbd2, times 2 db 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7
59
+const pb_unpackwq1, times 1 db 0, 1, 0, 1, 0, 1, 0, 1, 2, 3, 2, 3, 2, 3, 2, 3
60
+const pb_unpackwq2, times 1 db 4, 5, 4, 5, 4, 5, 4, 5, 6, 7, 6, 7, 6, 7, 6, 7
61
+const pb_shuf8x8c, times 1 db 0, 0, 0, 0, 2, 2, 2, 2, 4, 4, 4, 4, 6, 6, 6, 6
62
+const pb_movemask, times 16 db 0x00
63
+ times 16 db 0xFF
64
+const pb_0000000000000F0F, times 2 db 0xff, 0x00
65
+ times 12 db 0x00
66
+const pb_000000000000000F, db 0xff
67
+ times 15 db 0x00
68
69
-const pw_0_15, times 2 dw 0, 1, 2, 3, 4, 5, 6, 7
70
-const pw_2, times 8 dw 2
71
-const pw_m2, times 8 dw -2
72
-const pw_4, times 8 dw 4
73
-const pw_8, times 8 dw 8
74
-const pw_64, times 8 dw 64
75
-const pw_256, times 8 dw 256
76
-const pw_32_0, times 4 dw 32,
77
- times 4 dw 0
78
-const pw_2000, times 16 dw 0x2000
79
-const pw_8000, times 8 dw 0x8000
80
-const pw_3fff, times 8 dw 0x3fff
81
-const pw_ppppmmmm, dw 1,1,1,1,-1,-1,-1,-1
82
-const pw_ppmmppmm, dw 1,1,-1,-1,1,1,-1,-1
83
-const pw_pmpmpmpm, dw 1,-1,1,-1,1,-1,1,-1
84
-const pw_pmmpzzzz, dw 1,-1,-1,1,0,0,0,0
85
-const pd_1, times 8 dd 1
86
-const pd_2, times 8 dd 2
87
-const pd_4, times 4 dd 4
88
-const pd_8, times 4 dd 8
89
-const pd_16, times 4 dd 16
90
-const pd_32, times 4 dd 32
91
-const pd_64, times 4 dd 64
92
-const pd_128, times 4 dd 128
93
-const pd_256, times 4 dd 256
94
-const pd_512, times 4 dd 512
95
-const pd_1024, times 4 dd 1024
96
-const pd_2048, times 4 dd 2048
97
-const pd_ffff, times 4 dd 0xffff
98
-const pd_32767, times 4 dd 32767
99
-const pd_n32768, times 4 dd 0xffff8000
100
-const pw_ff00, times 8 dw 0xff00
101
+;; 16-bit constants
102
103
-const multi_2Row, dw 1, 2, 3, 4, 1, 2, 3, 4
104
-const multiL, dw 1, 2, 3, 4, 5, 6, 7, 8
105
-const multiH, dw 9, 10, 11, 12, 13, 14, 15, 16
106
-const multiH2, dw 17, 18, 19, 20, 21, 22, 23, 24
107
-const multiH3, dw 25, 26, 27, 28, 29, 30, 31, 32
108
+const pw_1, times 16 dw 1
109
+const pw_2, times 8 dw 2
110
+const pw_m2, times 8 dw -2
111
+const pw_4, times 8 dw 4
112
+const pw_8, times 8 dw 8
113
+const pw_16, times 16 dw 16
114
+const pw_15, times 16 dw 15
115
+const pw_31, times 16 dw 31
116
+const pw_32, times 16 dw 32
117
+const pw_64, times 8 dw 64
118
+const pw_128, times 16 dw 128
119
+const pw_256, times 16 dw 256
120
+const pw_257, times 16 dw 257
121
+const pw_512, times 16 dw 512
122
+const pw_1023, times 8 dw 1023
123
+const pw_1024, times 16 dw 1024
124
+const pw_4096, times 16 dw 4096
125
+const pw_00ff, times 16 dw 0x00ff
126
+const pw_ff00, times 8 dw 0xff00
127
+const pw_2000, times 16 dw 0x2000
128
+const pw_8000, times 8 dw 0x8000
129
+const pw_3fff, times 8 dw 0x3fff
130
+const pw_32_0, times 4 dw 32,
131
+ times 4 dw 0
132
+const pw_pixel_max, times 16 dw ((1 << BIT_DEPTH)-1)
133
+
134
+const pw_0_15, times 2 dw 0, 1, 2, 3, 4, 5, 6, 7
135
+const pw_ppppmmmm, times 1 dw 1, 1, 1, 1, -1, -1, -1, -1
136
+const pw_ppmmppmm, times 1 dw 1, 1, -1, -1, 1, 1, -1, -1
137
+const pw_pmpmpmpm, times 1 dw 1, -1, 1, -1, 1, -1, 1, -1
138
+const pw_pmmpzzzz, times 1 dw 1, -1, -1, 1, 0, 0, 0, 0
139
+const multi_2Row, times 1 dw 1, 2, 3, 4, 1, 2, 3, 4
140
+const multiH, times 1 dw 9, 10, 11, 12, 13, 14, 15, 16
141
+const multiH3, times 1 dw 25, 26, 27, 28, 29, 30, 31, 32
142
+const multiL, times 1 dw 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
143
+const multiH2, times 1 dw 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32
144
+const pw_planar16_mul, times 1 dw 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
145
+const pw_planar32_mul, times 1 dw 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16
146
+const pw_FFFFFFFFFFFFFFF0, dw 0x00
147
+ times 7 dw 0xff
148
+
149
+
150
+;; 32-bit constants
151
+
152
+const pd_1, times 8 dd 1
153
+const pd_2, times 8 dd 2
154
+const pd_4, times 4 dd 4
155
+const pd_8, times 4 dd 8
156
+const pd_16, times 4 dd 16
157
+const pd_32, times 4 dd 32
158
+const pd_64, times 4 dd 64
159
+const pd_128, times 4 dd 128
160
+const pd_256, times 4 dd 256
161
+const pd_512, times 4 dd 512
162
+const pd_1024, times 4 dd 1024
163
+const pd_2048, times 4 dd 2048
164
+const pd_ffff, times 4 dd 0xffff
165
+const pd_32767, times 4 dd 32767
166
+const pd_n32768, times 4 dd 0xffff8000
167
+
168
+const trans8_shuf, times 1 dd 0, 4, 1, 5, 2, 6, 3, 7
169
+const deinterleave_shufd, times 1 dd 0, 4, 1, 5, 2, 6, 3, 7
170
171
const popcnt_table
172
%assign x 0
173
x265_1.6.tar.gz/source/common/x86/dct8.asm -> x265_1.7.tar.gz/source/common/x86/dct8.asm
Changed
181
1
2
times 2 dw 84, -29, -74, 55
3
times 2 dw 55, -84, 74, -29
4
5
+pw_dst4_tab: times 4 dw 29, 55, 74, 84
6
+ times 4 dw 74, 74, 0, -74
7
+ times 4 dw 84, -29, -74, 55
8
+ times 4 dw 55, -84, 74, -29
9
+
10
tab_idst4: times 4 dw 29, +84
11
times 4 dw +74, +55
12
times 4 dw 55, -29
13
14
times 4 dw 84, +55
15
times 4 dw -74, -29
16
17
+pw_idst4_tab: times 4 dw 29, 84
18
+ times 4 dw 55, -29
19
+ times 4 dw 74, 55
20
+ times 4 dw 74, -84
21
+ times 4 dw 74, -74
22
+ times 4 dw 84, 55
23
+ times 4 dw 0, 74
24
+ times 4 dw -74, -29
25
+pb_idst4_shuf: times 2 db 0, 1, 8, 9, 2, 3, 10, 11, 4, 5, 12, 13, 6, 7, 14, 15
26
+
27
tab_dct8_1: times 2 dw 89, 50, 75, 18
28
times 2 dw 75, -89, -18, -50
29
times 2 dw 50, 18, -89, 75
30
31
cextern pd_1024
32
cextern pd_2048
33
cextern pw_ppppmmmm
34
-
35
+cextern trans8_shuf
36
;------------------------------------------------------
37
;void dct4(const int16_t* src, int16_t* dst, intptr_t srcStride)
38
;------------------------------------------------------
39
40
41
RET
42
43
+;------------------------------------------------------------------
44
+;void dst4(const int16_t* src, int16_t* dst, intptr_t srcStride)
45
+;------------------------------------------------------------------
46
+INIT_YMM avx2
47
+cglobal dst4, 3, 4, 6
48
+%if BIT_DEPTH == 8
49
+ %define DST_SHIFT 1
50
+ vpbroadcastd m5, [pd_1]
51
+%elif BIT_DEPTH == 10
52
+ %define DST_SHIFT 3
53
+ vpbroadcastd m5, [pd_4]
54
+%endif
55
+ mova m4, [trans8_shuf]
56
+ add r2d, r2d
57
+ lea r3, [pw_dst4_tab]
58
+
59
+ movq xm0, [r0 + 0 * r2]
60
+ movhps xm0, [r0 + 1 * r2]
61
+ lea r0, [r0 + 2 * r2]
62
+ movq xm1, [r0]
63
+ movhps xm1, [r0 + r2]
64
+
65
+ vinserti128 m0, m0, xm1, 1 ; m0 = src[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
66
+
67
+ pmaddwd m2, m0, [r3 + 0 * 32]
68
+ pmaddwd m1, m0, [r3 + 1 * 32]
69
+ phaddd m2, m1
70
+ paddd m2, m5
71
+ psrad m2, DST_SHIFT
72
+ pmaddwd m3, m0, [r3 + 2 * 32]
73
+ pmaddwd m1, m0, [r3 + 3 * 32]
74
+ phaddd m3, m1
75
+ paddd m3, m5
76
+ psrad m3, DST_SHIFT
77
+ packssdw m2, m3
78
+ vpermd m2, m4, m2
79
+
80
+ vpbroadcastd m5, [pd_128]
81
+ pmaddwd m0, m2, [r3 + 0 * 32]
82
+ pmaddwd m1, m2, [r3 + 1 * 32]
83
+ phaddd m0, m1
84
+ paddd m0, m5
85
+ psrad m0, 8
86
+ pmaddwd m3, m2, [r3 + 2 * 32]
87
+ pmaddwd m2, m2, [r3 + 3 * 32]
88
+ phaddd m3, m2
89
+ paddd m3, m5
90
+ psrad m3, 8
91
+ packssdw m0, m3
92
+ vpermd m0, m4, m0
93
+ movu [r1], m0
94
+ RET
95
+
96
;-------------------------------------------------------
97
;void idst4(const int16_t* src, int16_t* dst, intptr_t dstStride)
98
;-------------------------------------------------------
99
100
movhps [r1 + r2], m1
101
RET
102
103
+;-----------------------------------------------------------------
104
+;void idst4(const int16_t* src, int16_t* dst, intptr_t dstStride)
105
+;-----------------------------------------------------------------
106
+INIT_YMM avx2
107
+cglobal idst4, 3, 4, 6
108
+%if BIT_DEPTH == 8
109
+ vpbroadcastd m4, [pd_2048]
110
+ %define IDCT4_SHIFT 12
111
+%elif BIT_DEPTH == 10
112
+ vpbroadcastd m4, [pd_512]
113
+ %define IDCT4_SHIFT 10
114
+%else
115
+ %error Unsupported BIT_DEPTH!
116
+%endif
117
+ add r2d, r2d
118
+ lea r3, [pw_idst4_tab]
119
+
120
+ movu xm0, [r0 + 0 * 16]
121
+ movu xm1, [r0 + 1 * 16]
122
+
123
+ punpcklwd m2, m0, m1
124
+ punpckhwd m0, m1
125
+
126
+ vinserti128 m2, m2, xm2, 1
127
+ vinserti128 m0, m0, xm0, 1
128
+
129
+ vpbroadcastd m5, [pd_64]
130
+ pmaddwd m1, m2, [r3 + 0 * 32]
131
+ pmaddwd m3, m0, [r3 + 1 * 32]
132
+ paddd m1, m3
133
+ paddd m1, m5
134
+ psrad m1, 7
135
+ pmaddwd m3, m2, [r3 + 2 * 32]
136
+ pmaddwd m0, [r3 + 3 * 32]
137
+ paddd m3, m0
138
+ paddd m3, m5
139
+ psrad m3, 7
140
+
141
+ packssdw m0, m1, m3
142
+ pshufb m0, [pb_idst4_shuf]
143
+ vpermq m1, m0, 11101110b
144
+
145
+ punpcklwd m2, m0, m1
146
+ punpckhwd m0, m1
147
+ punpcklwd m1, m2, m0
148
+ punpckhwd m2, m0
149
+
150
+ vpermq m1, m1, 01000100b
151
+ vpermq m2, m2, 01000100b
152
+
153
+ pmaddwd m0, m1, [r3 + 0 * 32]
154
+ pmaddwd m3, m2, [r3 + 1 * 32]
155
+ paddd m0, m3
156
+ paddd m0, m4
157
+ psrad m0, IDCT4_SHIFT
158
+ pmaddwd m3, m1, [r3 + 2 * 32]
159
+ pmaddwd m2, m2, [r3 + 3 * 32]
160
+ paddd m3, m2
161
+ paddd m3, m4
162
+ psrad m3, IDCT4_SHIFT
163
+
164
+ packssdw m0, m3
165
+ pshufb m1, m0, [pb_idst4_shuf]
166
+ vpermq m0, m1, 11101110b
167
+
168
+ punpcklwd m2, m1, m0
169
+ movq [r1 + 0 * r2], xm2
170
+ movhps [r1 + 1 * r2], xm2
171
+
172
+ punpckhwd m1, m0
173
+ movq [r1 + 2 * r2], xm1
174
+ lea r1, [r1 + 2 * r2]
175
+ movhps [r1 + r2], xm1
176
+ RET
177
+
178
;-------------------------------------------------------
179
; void dct8(const int16_t* src, int16_t* dst, intptr_t srcStride)
180
;-------------------------------------------------------
181
x265_1.6.tar.gz/source/common/x86/dct8.h -> x265_1.7.tar.gz/source/common/x86/dct8.h
Changed
17
1
2
void x265_dct4_sse2(const int16_t* src, int16_t* dst, intptr_t srcStride);
3
void x265_dct8_sse2(const int16_t* src, int16_t* dst, intptr_t srcStride);
4
void x265_dst4_ssse3(const int16_t* src, int16_t* dst, intptr_t srcStride);
5
+void x265_dst4_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride);
6
void x265_dct8_sse4(const int16_t* src, int16_t* dst, intptr_t srcStride);
7
void x265_dct4_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride);
8
void x265_dct8_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride);
9
10
void x265_dct32_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride);
11
12
void x265_idst4_sse2(const int16_t* src, int16_t* dst, intptr_t dstStride);
13
+void x265_idst4_avx2(const int16_t* src, int16_t* dst, intptr_t dstStride);
14
void x265_idct4_sse2(const int16_t* src, int16_t* dst, intptr_t dstStride);
15
void x265_idct4_avx2(const int16_t* src, int16_t* dst, intptr_t dstStride);
16
void x265_idct8_sse2(const int16_t* src, int16_t* dst, intptr_t dstStride);
17
x265_1.6.tar.gz/source/common/x86/intrapred.h -> x265_1.7.tar.gz/source/common/x86/intrapred.h
Changed
113
1
2
void x265_intra_pred_dc8_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
3
void x265_intra_pred_dc16_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
4
void x265_intra_pred_dc32_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
5
+void x265_intra_pred_dc32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
6
7
void x265_intra_pred_planar4_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
8
void x265_intra_pred_planar8_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
9
10
void x265_intra_pred_planar8_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
11
void x265_intra_pred_planar16_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
12
void x265_intra_pred_planar32_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
13
+void x265_intra_pred_planar16_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
14
+void x265_intra_pred_planar32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
15
16
#define DECL_ANG(bsize, mode, cpu) \
17
void x265_intra_pred_ang ## bsize ## _ ## mode ## _ ## cpu(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
18
19
DECL_ANG(4, 7, sse2);
20
DECL_ANG(4, 8, sse2);
21
DECL_ANG(4, 9, sse2);
22
+DECL_ANG(4, 10, sse2);
23
+DECL_ANG(4, 11, sse2);
24
+DECL_ANG(4, 12, sse2);
25
+DECL_ANG(4, 13, sse2);
26
+DECL_ANG(4, 14, sse2);
27
+DECL_ANG(4, 15, sse2);
28
+DECL_ANG(4, 16, sse2);
29
+DECL_ANG(4, 17, sse2);
30
+DECL_ANG(4, 18, sse2);
31
+DECL_ANG(4, 26, sse2);
32
33
DECL_ANG(4, 2, ssse3);
34
DECL_ANG(4, 3, sse4);
35
36
DECL_ANG(32, 33, sse4);
37
38
#undef DECL_ANG
39
+void x265_intra_pred_ang4_3_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
40
+void x265_intra_pred_ang4_4_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
41
+void x265_intra_pred_ang4_5_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
42
+void x265_intra_pred_ang4_6_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
43
+void x265_intra_pred_ang4_7_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
44
+void x265_intra_pred_ang4_8_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
45
+void x265_intra_pred_ang4_9_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
46
+void x265_intra_pred_ang4_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
47
+void x265_intra_pred_ang4_12_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
48
+void x265_intra_pred_ang4_13_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
49
+void x265_intra_pred_ang4_14_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
50
+void x265_intra_pred_ang4_15_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
51
+void x265_intra_pred_ang4_16_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
52
+void x265_intra_pred_ang4_17_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
53
+void x265_intra_pred_ang4_19_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
54
+void x265_intra_pred_ang4_20_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
55
+void x265_intra_pred_ang4_21_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
56
+void x265_intra_pred_ang4_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
57
+void x265_intra_pred_ang4_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
58
+void x265_intra_pred_ang4_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
59
+void x265_intra_pred_ang4_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
60
+void x265_intra_pred_ang4_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
61
+void x265_intra_pred_ang4_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
62
+void x265_intra_pred_ang4_29_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
63
+void x265_intra_pred_ang4_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
64
+void x265_intra_pred_ang4_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
65
+void x265_intra_pred_ang4_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
66
+void x265_intra_pred_ang4_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
67
void x265_intra_pred_ang8_3_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
68
void x265_intra_pred_ang8_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
69
void x265_intra_pred_ang8_4_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
70
71
void x265_intra_pred_ang8_12_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
72
void x265_intra_pred_ang8_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
73
void x265_intra_pred_ang8_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
74
+void x265_intra_pred_ang8_13_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
75
+void x265_intra_pred_ang8_14_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
76
+void x265_intra_pred_ang8_15_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
77
+void x265_intra_pred_ang8_16_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
78
+void x265_intra_pred_ang8_20_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
79
+void x265_intra_pred_ang8_21_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
80
+void x265_intra_pred_ang8_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
81
+void x265_intra_pred_ang8_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
82
+void x265_intra_pred_ang16_3_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
83
+void x265_intra_pred_ang16_4_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
84
+void x265_intra_pred_ang16_5_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
85
+void x265_intra_pred_ang16_6_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
86
+void x265_intra_pred_ang16_7_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
87
+void x265_intra_pred_ang16_8_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
88
+void x265_intra_pred_ang16_9_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
89
+void x265_intra_pred_ang16_12_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
90
+void x265_intra_pred_ang16_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
91
+void x265_intra_pred_ang16_13_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
92
void x265_intra_pred_ang16_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
93
void x265_intra_pred_ang16_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
94
void x265_intra_pred_ang16_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
95
96
void x265_intra_pred_ang32_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
97
void x265_intra_pred_ang32_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
98
void x265_intra_pred_ang32_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
99
+void x265_intra_pred_ang32_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
100
+void x265_intra_pred_ang32_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
101
+void x265_intra_pred_ang32_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
102
+void x265_intra_pred_ang32_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
103
+void x265_intra_pred_ang32_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
104
+void x265_intra_pred_ang32_21_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
105
+void x265_intra_pred_ang32_18_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
106
+void x265_all_angs_pred_4x4_sse2(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
107
void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
108
void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
109
void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
110
void x265_all_angs_pred_32x32_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
111
+void x265_all_angs_pred_4x4_avx2(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
112
#endif // ifndef X265_INTRAPRED_H
113
x265_1.6.tar.gz/source/common/x86/intrapred16.asm -> x265_1.7.tar.gz/source/common/x86/intrapred16.asm
Changed
510
1
2
%endrep
3
RET
4
5
+;-----------------------------------------------------------------------------------------
6
+; void intraPredAng4(pixel* dst, intptr_t dstStride, pixel* src, int dirMode, int bFilter)
7
+;-----------------------------------------------------------------------------------------
8
+INIT_XMM sse2
9
+cglobal intra_pred_ang4_2, 3,5,4
10
+ lea r4, [r2 + 4]
11
+ add r2, 20
12
+ cmp r3m, byte 34
13
+ cmove r2, r4
14
+
15
+ add r1, r1
16
+ movu m0, [r2]
17
+ movh [r0], m0
18
+ psrldq m0, 2
19
+ movh [r0 + r1], m0
20
+ psrldq m0, 2
21
+ movh [r0 + r1 * 2], m0
22
+ lea r1, [r1 * 3]
23
+ psrldq m0, 2
24
+ movh [r0 + r1], m0
25
+ RET
26
+
27
+cglobal intra_pred_ang4_3, 3,5,8
28
+ mov r4d, 2
29
+ cmp r3m, byte 33
30
+ mov r3d, 18
31
+ cmove r3d, r4d
32
+
33
+ movu m0, [r2 + r3] ; [8 7 6 5 4 3 2 1]
34
+
35
+ mova m2, m0
36
+ psrldq m0, 2
37
+ punpcklwd m2, m0 ; [5 4 4 3 3 2 2 1]
38
+ mova m3, m0
39
+ psrldq m0, 2
40
+ punpcklwd m3, m0 ; [6 5 5 4 4 3 3 2]
41
+ mova m4, m0
42
+ psrldq m0, 2
43
+ punpcklwd m4, m0 ; [7 6 6 5 5 4 4 3]
44
+ mova m5, m0
45
+ psrldq m0, 2
46
+ punpcklwd m5, m0 ; [8 7 7 6 6 5 5 4]
47
+
48
+
49
+ lea r3, [ang_table + 20 * 16]
50
+ mova m0, [r3 + 6 * 16] ; [26]
51
+ mova m1, [r3] ; [20]
52
+ mova m6, [r3 - 6 * 16] ; [14]
53
+ mova m7, [r3 - 12 * 16] ; [ 8]
54
+ jmp .do_filter4x4
55
+
56
+
57
+ALIGN 16
58
+.do_filter4x4:
59
+ lea r4, [pd_16]
60
+ pmaddwd m2, m0
61
+ paddd m2, [r4]
62
+ psrld m2, 5
63
+
64
+ pmaddwd m3, m1
65
+ paddd m3, [r4]
66
+ psrld m3, 5
67
+ packssdw m2, m3
68
+
69
+ pmaddwd m4, m6
70
+ paddd m4, [r4]
71
+ psrld m4, 5
72
+
73
+ pmaddwd m5, m7
74
+ paddd m5, [r4]
75
+ psrld m5, 5
76
+ packssdw m4, m5
77
+
78
+ jz .store
79
+
80
+ ; transpose 4x4
81
+ punpckhwd m0, m2, m4
82
+ punpcklwd m2, m4
83
+ punpckhwd m4, m2, m0
84
+ punpcklwd m2, m0
85
+
86
+.store:
87
+ add r1, r1
88
+ movh [r0], m2
89
+ movhps [r0 + r1], m2
90
+ movh [r0 + r1 * 2], m4
91
+ lea r1, [r1 * 3]
92
+ movhps [r0 + r1], m4
93
+ RET
94
+
95
+cglobal intra_pred_ang4_4, 3,5,8
96
+ mov r4d, 2
97
+ cmp r3m, byte 32
98
+ mov r3d, 18
99
+ cmove r3d, r4d
100
+
101
+ movu m0, [r2 + r3] ; [8 7 6 5 4 3 2 1]
102
+ mova m2, m0
103
+ psrldq m0, 2
104
+ punpcklwd m2, m0 ; [5 4 4 3 3 2 2 1]
105
+ mova m3, m0
106
+ psrldq m0, 2
107
+ punpcklwd m3, m0 ; [6 5 5 4 4 3 3 2]
108
+ mova m4, m3
109
+ mova m5, m0
110
+ psrldq m0, 2
111
+ punpcklwd m5, m0 ; [7 6 6 5 5 4 4 3]
112
+
113
+ lea r3, [ang_table + 18 * 16]
114
+ mova m0, [r3 + 3 * 16] ; [21]
115
+ mova m1, [r3 - 8 * 16] ; [10]
116
+ mova m6, [r3 + 13 * 16] ; [31]
117
+ mova m7, [r3 + 2 * 16] ; [20]
118
+ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
119
+
120
+cglobal intra_pred_ang4_5, 3,5,8
121
+ mov r4d, 2
122
+ cmp r3m, byte 31
123
+ mov r3d, 18
124
+ cmove r3d, r4d
125
+
126
+ movu m0, [r2 + r3] ; [8 7 6 5 4 3 2 1]
127
+ mova m2, m0
128
+ psrldq m0, 2
129
+ punpcklwd m2, m0 ; [5 4 4 3 3 2 2 1]
130
+ mova m3, m0
131
+ psrldq m0, 2
132
+ punpcklwd m3, m0 ; [6 5 5 4 4 3 3 2]
133
+ mova m4, m3
134
+ mova m5, m0
135
+ psrldq m0, 2
136
+ punpcklwd m5, m0 ; [7 6 6 5 5 4 4 3]
137
+
138
+ lea r3, [ang_table + 10 * 16]
139
+ mova m0, [r3 + 7 * 16] ; [17]
140
+ mova m1, [r3 - 8 * 16] ; [ 2]
141
+ mova m6, [r3 + 9 * 16] ; [19]
142
+ mova m7, [r3 - 6 * 16] ; [ 4]
143
+ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
144
+
145
+cglobal intra_pred_ang4_6, 3,5,8
146
+ mov r4d, 2
147
+ cmp r3m, byte 30
148
+ mov r3d, 18
149
+ cmove r3d, r4d
150
+
151
+ movu m0, [r2 + r3] ; [8 7 6 5 4 3 2 1]
152
+ mova m2, m0
153
+ psrldq m0, 2
154
+ punpcklwd m2, m0 ; [5 4 4 3 3 2 2 1]
155
+ mova m3, m2
156
+ mova m4, m0
157
+ psrldq m0, 2
158
+ punpcklwd m4, m0 ; [6 5 5 4 4 3 3 2]
159
+ mova m5, m4
160
+
161
+ lea r3, [ang_table + 19 * 16]
162
+ mova m0, [r3 - 6 * 16] ; [13]
163
+ mova m1, [r3 + 7 * 16] ; [26]
164
+ mova m6, [r3 - 12 * 16] ; [ 7]
165
+ mova m7, [r3 + 1 * 16] ; [20]
166
+ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
167
+
168
+cglobal intra_pred_ang4_7, 3,5,8
169
+ mov r4d, 2
170
+ cmp r3m, byte 29
171
+ mov r3d, 18
172
+ cmove r3d, r4d
173
+
174
+ movu m0, [r2 + r3] ; [8 7 6 5 4 3 2 1]
175
+ mova m2, m0
176
+ psrldq m0, 2
177
+ punpcklwd m2, m0 ; [5 4 4 3 3 2 2 1]
178
+ mova m3, m2
179
+ mova m4, m2
180
+ mova m5, m0
181
+ psrldq m0, 2
182
+ punpcklwd m5, m0 ; [6 5 5 4 4 3 3 2]
183
+
184
+ lea r3, [ang_table + 20 * 16]
185
+ mova m0, [r3 - 11 * 16] ; [ 9]
186
+ mova m1, [r3 - 2 * 16] ; [18]
187
+ mova m6, [r3 + 7 * 16] ; [27]
188
+ mova m7, [r3 - 16 * 16] ; [ 4]
189
+ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
190
+
191
+cglobal intra_pred_ang4_8, 3,5,8
192
+ mov r4d, 2
193
+ cmp r3m, byte 28
194
+ mov r3d, 18
195
+ cmove r3d, r4d
196
+
197
+ movu m0, [r2 + r3] ; [8 7 6 5 4 3 2 1]
198
+ mova m2, m0
199
+ psrldq m0, 2
200
+ punpcklwd m2, m0 ; [5 4 4 3 3 2 2 1]
201
+ mova m3, m2
202
+ mova m4, m2
203
+ mova m5, m2
204
+
205
+ lea r3, [ang_table + 13 * 16]
206
+ mova m0, [r3 - 8 * 16] ; [ 5]
207
+ mova m1, [r3 - 3 * 16] ; [10]
208
+ mova m6, [r3 + 2 * 16] ; [15]
209
+ mova m7, [r3 + 7 * 16] ; [20]
210
+ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
211
+
212
+cglobal intra_pred_ang4_9, 3,5,8
213
+ mov r4d, 2
214
+ cmp r3m, byte 27
215
+ mov r3d, 18
216
+ cmove r3d, r4d
217
+
218
+ movu m0, [r2 + r3] ; [8 7 6 5 4 3 2 1]
219
+ mova m2, m0
220
+ psrldq m0, 2
221
+ punpcklwd m2, m0 ; [5 4 4 3 3 2 2 1]
222
+ mova m3, m2
223
+ mova m4, m2
224
+ mova m5, m2
225
+
226
+ lea r3, [ang_table + 4 * 16]
227
+ mova m0, [r3 - 2 * 16] ; [ 2]
228
+ mova m1, [r3 - 0 * 16] ; [ 4]
229
+ mova m6, [r3 + 2 * 16] ; [ 6]
230
+ mova m7, [r3 + 4 * 16] ; [ 8]
231
+ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
232
+
233
+cglobal intra_pred_ang4_10, 3,3,3
234
+ movh m0, [r2 + 18] ; [4 3 2 1]
235
+
236
+ punpcklwd m0, m0 ;[4 4 3 3 2 2 1 1]
237
+ pshufd m1, m0, 0xFA
238
+ add r1, r1
239
+ pshufd m0, m0, 0x50
240
+ movhps [r0 + r1], m0
241
+ movh [r0 + r1 * 2], m1
242
+ lea r1, [r1 * 3]
243
+ movhps [r0 + r1], m1
244
+
245
+ cmp r4m, byte 0
246
+ jz .quit
247
+
248
+ ; filter
249
+ movd m2, [r2] ; [7 6 5 4 3 2 1 0]
250
+ pshuflw m2, m2, 0x00
251
+ movh m1, [r2 + 2]
252
+ psubw m1, m2
253
+ psraw m1, 1
254
+ paddw m0, m1
255
+ pxor m1, m1
256
+ pmaxsw m0, m1
257
+ pminsw m0, [pw_1023]
258
+.quit:
259
+ movh [r0], m0
260
+ RET
261
+
262
+cglobal intra_pred_ang4_26, 3,3,3
263
+ movh m0, [r2 + 2] ; [8 7 6 5 4 3 2 1]
264
+ add r1d, r1d
265
+ ; store
266
+ movh [r0], m0
267
+ movh [r0 + r1], m0
268
+ movh [r0 + r1 * 2], m0
269
+ lea r3, [r1 * 3]
270
+ movh [r0 + r3], m0
271
+
272
+ ; filter
273
+ cmp r4m, byte 0
274
+ jz .quit
275
+
276
+ pshuflw m0, m0, 0x00
277
+ movd m2, [r2]
278
+ pshuflw m2, m2, 0x00
279
+ movh m1, [r2 + 18]
280
+ psubw m1, m2
281
+ psraw m1, 1
282
+ paddw m0, m1
283
+ pxor m1, m1
284
+ pmaxsw m0, m1
285
+ pminsw m0, [pw_1023]
286
+
287
+ movh r2, m0
288
+ mov [r0], r2w
289
+ shr r2, 16
290
+ mov [r0 + r1], r2w
291
+ shr r2, 16
292
+ mov [r0 + r1 * 2], r2w
293
+ shr r2, 16
294
+ mov [r0 + r3], r2w
295
+.quit:
296
+ RET
297
+
298
+cglobal intra_pred_ang4_11, 3,5,8
299
+ xor r4d, r4d
300
+ cmp r3m, byte 25
301
+ mov r3d, 16
302
+ cmove r3d, r4d
303
+
304
+ movh m1, [r2 + r3 + 2] ; [x x x 4 3 2 1 0]
305
+ movh m2, [r2 - 6]
306
+ punpcklqdq m2, m1
307
+ psrldq m2, 6
308
+ punpcklwd m2, m1 ; [4 3 3 2 2 1 1 0]
309
+ mova m3, m2
310
+ mova m4, m2
311
+ mova m5, m2
312
+
313
+ lea r3, [ang_table + 24 * 16]
314
+ mova m0, [r3 + 6 * 16] ; [24]
315
+ mova m1, [r3 + 4 * 16] ; [26]
316
+ mova m6, [r3 + 2 * 16] ; [28]
317
+ mova m7, [r3 + 0 * 16] ; [30]
318
+ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
319
+
320
+cglobal intra_pred_ang4_12, 3,5,8
321
+ xor r4d, r4d
322
+ cmp r3m, byte 24
323
+ mov r3d, 16
324
+ cmove r3d, r4d
325
+
326
+ movh m1, [r2 + r3 + 2]
327
+ movh m2, [r2 - 6]
328
+ punpcklqdq m2, m1
329
+ psrldq m2, 6
330
+ punpcklwd m2, m1 ; [4 3 3 2 2 1 1 0]
331
+ mova m3, m2
332
+ mova m4, m2
333
+ mova m5, m2
334
+
335
+ lea r3, [ang_table + 20 * 16]
336
+ mova m0, [r3 + 7 * 16] ; [27]
337
+ mova m1, [r3 + 2 * 16] ; [22]
338
+ mova m6, [r3 - 3 * 16] ; [17]
339
+ mova m7, [r3 - 8 * 16] ; [12]
340
+ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
341
+
342
+cglobal intra_pred_ang4_13, 3,5,8
343
+ xor r4d, r4d
344
+ cmp r3m, byte 23
345
+ mov r3d, 16
346
+ jz .next
347
+ xchg r3d, r4d
348
+.next:
349
+ movd m5, [r2 + r3 + 6]
350
+ movd m2, [r2 - 2]
351
+ movh m0, [r2 + r4 + 2]
352
+ punpcklwd m5, m2
353
+ punpcklqdq m5, m0
354
+ psrldq m5, 4
355
+ mova m2, m5
356
+ psrldq m2, 2
357
+ punpcklwd m5, m2 ; [3 2 2 1 1 0 0 x]
358
+ punpcklwd m2, m0 ; [4 3 3 2 2 1 1 0]
359
+ mova m3, m2
360
+ mova m4, m2
361
+
362
+ lea r3, [ang_table + 21 * 16]
363
+ mova m0, [r3 + 2 * 16] ; [23]
364
+ mova m1, [r3 - 7 * 16] ; [14]
365
+ mova m6, [r3 - 16 * 16] ; [ 5]
366
+ mova m7, [r3 + 7 * 16] ; [28]
367
+ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
368
+
369
+cglobal intra_pred_ang4_14, 3,5,8
370
+ xor r4d, r4d
371
+ cmp r3m, byte 22
372
+ mov r3d, 16
373
+ jz .next
374
+ xchg r3d, r4d
375
+.next:
376
+ movd m5, [r2 + r3 + 2]
377
+ movd m2, [r2 - 2]
378
+ movh m0, [r2 + r4 + 2]
379
+ punpcklwd m5, m2
380
+ punpcklqdq m5, m0
381
+ psrldq m5, 4
382
+ mova m2, m5
383
+ psrldq m2, 2
384
+ punpcklwd m5, m2 ; [3 2 2 1 1 0 0 x]
385
+ punpcklwd m2, m0 ; [4 3 3 2 2 1 1 0]
386
+ mova m3, m2
387
+ mova m4, m5
388
+
389
+ lea r3, [ang_table + 19 * 16]
390
+ mova m0, [r3 + 0 * 16] ; [19]
391
+ mova m1, [r3 - 13 * 16] ; [ 6]
392
+ mova m6, [r3 + 6 * 16] ; [25]
393
+ mova m7, [r3 - 7 * 16] ; [12]
394
+ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
395
+
396
+cglobal intra_pred_ang4_15, 3,5,8
397
+ xor r4d, r4d
398
+ cmp r3m, byte 21
399
+ mov r3d, 16
400
+ jz .next
401
+ xchg r3d, r4d
402
+.next:
403
+ movd m4, [r2] ;[x x x A]
404
+ movh m5, [r2 + r3 + 4] ;[x C x B]
405
+ movh m0, [r2 + r4 + 2] ;[4 3 2 1]
406
+ pshuflw m5, m5, 0x22 ;[B C B C]
407
+ punpcklqdq m5, m4 ;[x x x A B C B C]
408
+ psrldq m5, 2 ;[x x x x A B C B]
409
+ punpcklqdq m5, m0
410
+ psrldq m5, 2
411
+ mova m2, m5
412
+ mova m3, m5
413
+ psrldq m2, 4
414
+ psrldq m3, 2
415
+ punpcklwd m5, m3 ; [2 1 1 0 0 x x y]
416
+ punpcklwd m3, m2 ; [3 2 2 1 1 0 0 x]
417
+ punpcklwd m2, m0 ; [4 3 3 2 2 1 1 0]
418
+ mova m4, m3
419
+
420
+ lea r3, [ang_table + 23 * 16]
421
+ mova m0, [r3 - 8 * 16] ; [15]
422
+ mova m1, [r3 + 7 * 16] ; [30]
423
+ mova m6, [r3 - 10 * 16] ; [13]
424
+ mova m7, [r3 + 5 * 16] ; [28]
425
+ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
426
+
427
+cglobal intra_pred_ang4_16, 3,5,8
428
+ xor r4d, r4d
429
+ cmp r3m, byte 20
430
+ mov r3d, 16
431
+ jz .next
432
+ xchg r3d, r4d
433
+.next:
434
+ movd m4, [r2] ;[x x x A]
435
+ movd m5, [r2 + r3 + 4] ;[x x C B]
436
+ movh m0, [r2 + r4 + 2] ;[4 3 2 1]
437
+ punpcklwd m5, m4 ;[x C A B]
438
+ pshuflw m5, m5, 0x4A ;[A B C C]
439
+ punpcklqdq m5, m0 ;[4 3 2 1 A B C C]
440
+ psrldq m5, 2
441
+ mova m2, m5
442
+ mova m3, m5
443
+ psrldq m2, 4
444
+ psrldq m3, 2
445
+ punpcklwd m5, m3 ; [2 1 1 0 0 x x y]
446
+ punpcklwd m3, m2 ; [3 2 2 1 1 0 0 x]
447
+ punpcklwd m2, m0 ; [4 3 3 2 2 1 1 0]
448
+ mova m4, m3
449
+
450
+ lea r3, [ang_table + 19 * 16]
451
+ mova m0, [r3 - 8 * 16] ; [11]
452
+ mova m1, [r3 + 3 * 16] ; [22]
453
+ mova m6, [r3 - 18 * 16] ; [ 1]
454
+ mova m7, [r3 - 7 * 16] ; [12]
455
+ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
456
+
457
+cglobal intra_pred_ang4_17, 3,5,8
458
+ xor r4d, r4d
459
+ cmp r3m, byte 19
460
+ mov r3d, 16
461
+ jz .next
462
+ xchg r3d, r4d
463
+.next:
464
+ movd m4, [r2]
465
+ movh m5, [r2 + r3 + 2] ;[D x C B]
466
+ pshuflw m5, m5, 0x1F ;[B C D D]
467
+ punpcklqdq m5, m4 ;[x x x A B C D D]
468
+ psrldq m5, 2 ;[x x x x A B C D]
469
+ movhps m5, [r2 + r4 + 2]
470
+
471
+ mova m4, m5
472
+ psrldq m4, 2
473
+ punpcklwd m5, m4
474
+ mova m3, m4
475
+ psrldq m3, 2
476
+ punpcklwd m4, m3
477
+ mova m2, m3
478
+ psrldq m2, 2
479
+ punpcklwd m3, m2
480
+ mova m1, m2
481
+ psrldq m1, 2
482
+ punpcklwd m2, m1
483
+
484
+ lea r3, [ang_table + 14 * 16]
485
+ mova m0, [r3 - 8 * 16] ; [ 6]
486
+ mova m1, [r3 - 2 * 16] ; [12]
487
+ mova m6, [r3 + 4 * 16] ; [18]
488
+ mova m7, [r3 + 10 * 16] ; [24]
489
+ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
490
+
491
+cglobal intra_pred_ang4_18, 3,3,1
492
+ movh m0, [r2 + 16]
493
+ pinsrw m0, [r2], 0
494
+ pshuflw m0, m0, q0123
495
+ movhps m0, [r2 + 2]
496
+ add r1, r1
497
+ lea r2, [r1 * 3]
498
+ movh [r0 + r2], m0
499
+ psrldq m0, 2
500
+ movh [r0 + r1 * 2], m0
501
+ psrldq m0, 2
502
+ movh [r0 + r1], m0
503
+ psrldq m0, 2
504
+ movh [r0], m0
505
+ RET
506
+
507
;-----------------------------------------------------------------------------------
508
; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* above, int, int filter)
509
;-----------------------------------------------------------------------------------
510
x265_1.6.tar.gz/source/common/x86/intrapred8.asm -> x265_1.7.tar.gz/source/common/x86/intrapred8.asm
Changed
4717
1
2
SECTION_RODATA 32
3
4
intra_pred_shuff_0_8: times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8
5
+intra_pred_shuff_15_0: times 2 db 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
6
7
pb_0_8 times 8 db 0, 8
8
pb_unpackbw1 times 2 db 1, 8, 2, 8, 3, 8, 4, 8
9
10
c_mode16_18: db 0, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1
11
12
ALIGN 32
13
-trans8_shuf: dd 0, 4, 1, 5, 2, 6, 3, 7
14
c_ang8_src1_9_2_10: db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9
15
c_ang8_26_20: db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
16
c_ang8_src3_11_4_12: db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11
17
18
db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
19
db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
20
21
+ALIGN 32
22
+c_ang16_mode_11: db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
23
+ db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
24
+ db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
25
+ db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
26
+ db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
27
+ db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
28
+ db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
29
+ db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
30
+
31
+
32
+ALIGN 32
33
+c_ang16_mode_12: db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19
34
+ db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
35
+ db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9
36
+ db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
37
+ db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31
38
+ db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
39
+ db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
40
+ db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
41
+
42
+
43
+ALIGN 32
44
+c_ang16_mode_13: db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15
45
+ db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
46
+ db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29
47
+ db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
48
+ db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11
49
+ db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
50
+ db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25
51
+ db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
52
53
ALIGN 32
54
c_ang16_mode_28: db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
55
56
db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
57
db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
58
59
+ALIGN 32
60
+c_ang16_mode_9: db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
61
+ db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
62
+ db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
63
+ db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
64
+ db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
65
+ db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
66
+ db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
67
+ db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
68
69
ALIGN 32
70
c_ang16_mode_27: db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
71
72
ALIGN 32
73
intra_pred_shuff_0_15: db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 15
74
75
+ALIGN 32
76
+c_ang16_mode_8: db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13
77
+ db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
78
+ db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23
79
+ db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
80
+ db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1
81
+ db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
82
+ db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11
83
+ db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
84
85
ALIGN 32
86
c_ang16_mode_29: db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
87
88
db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
89
db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
90
91
+ALIGN 32
92
+c_ang16_mode_7: db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
93
+ db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
94
+ db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3
95
+ db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
96
+ db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
97
+ db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
98
+ db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7
99
+ db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
100
101
ALIGN 32
102
c_ang16_mode_30: db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
103
104
db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
105
106
107
+
108
+ALIGN 32
109
+c_ang16_mode_6: db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
110
+ db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
111
+ db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15
112
+ db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
113
+ db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9
114
+ db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
115
+ db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3
116
+ db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
117
+
118
ALIGN 32
119
c_ang16_mode_31: db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
120
db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19
121
122
db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31
123
db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
124
125
+
126
+ALIGN 32
127
+c_ang16_mode_5: db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25
128
+ db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
129
+ db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27
130
+ db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
131
+ db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29
132
+ db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
133
+ db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31
134
+ db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
135
+
136
ALIGN 32
137
c_ang16_mode_32: db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
138
db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31
139
140
db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
141
142
ALIGN 32
143
+c_ang16_mode_4: db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29
144
+ db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
145
+ db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7
146
+ db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
147
+ db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
148
+ db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
149
+ db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27
150
+ db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
151
+
152
+ALIGN 32
153
c_ang16_mode_33: db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
154
db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
155
db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
156
157
db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
158
159
ALIGN 32
160
+c_ang16_mode_3: db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
161
+ db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
162
+ db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
163
+ db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
164
+ db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
165
+ db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
166
+ db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
167
+ db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
168
+
169
+ALIGN 32
170
c_ang16_mode_24: db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
171
db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
172
db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
173
174
db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11
175
db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
176
177
+
178
+ALIGN 32
179
+c_ang32_mode_33: db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
180
+ db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
181
+ db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
182
+ db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
183
+ db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
184
+ db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
185
+ db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
186
+ db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
187
+ db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
188
+ db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
189
+ db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
190
+ db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
191
+ db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
192
+ db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
193
+ db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
194
+ db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
195
+ db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
196
+ db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
197
+ db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
198
+ db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
199
+ db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
200
+ db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
201
+ db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
202
+ db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
203
+ db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
204
+ db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
205
+ db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
206
+
207
+
208
+
209
+ALIGN 32
210
+c_ang32_mode_25: db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
211
+ db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
212
+ db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
213
+ db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
214
+ db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
215
+ db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
216
+ db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
217
+ db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
218
+ db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
219
+ db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
220
+ db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
221
+ db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
222
+ db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
223
+ db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
224
+ db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
225
+ db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
226
+
227
+
228
+
229
+ALIGN 32
230
+c_ang32_mode_24: db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
231
+ db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
232
+ db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
233
+ db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
234
+ db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
235
+ db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
236
+ db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
237
+ db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
238
+ db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
239
+ db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1
240
+ db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23
241
+ db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13
242
+ db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3
243
+ db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25
244
+ db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15
245
+ db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5
246
+ db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
247
+
248
+
249
+ALIGN 32
250
+c_ang32_mode_23: db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
251
+ db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5
252
+ db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19
253
+ db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1
254
+ db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15
255
+ db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
256
+ db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
257
+ db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
258
+ db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
259
+ db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7
260
+ db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
261
+ db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3
262
+ db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
263
+ db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
264
+ db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
265
+ db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
266
+ db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
267
+ db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
268
+
269
+
270
+ALIGN 32
271
+c_ang32_mode_22: db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
272
+ db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
273
+ db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
274
+ db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5
275
+ db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11
276
+ db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
277
+ db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
278
+ db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
279
+ db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
280
+ db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3
281
+ db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9
282
+ db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15
283
+ db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
284
+ db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
285
+ db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
286
+ db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1
287
+ db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7
288
+ db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13
289
+ db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
290
+
291
+ALIGN 32
292
+c_ang32_mode_21: db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15
293
+ db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13
294
+ db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11
295
+ db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9
296
+ db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7
297
+ db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5
298
+ db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3
299
+ db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1
300
+ db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
301
+ db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
302
+ db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
303
+ db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
304
+ db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
305
+ db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
306
+ db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
307
+ db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
308
+ db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
309
+
310
+
311
+ALIGN 32
312
+intra_pred_shuff_0_4: times 4 db 0, 1, 1, 2, 2, 3, 3, 4
313
+intra_pred4_shuff1: db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5
314
+intra_pred4_shuff2: db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5
315
+intra_pred4_shuff31: db 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6
316
+intra_pred4_shuff33: db 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7
317
+intra_pred4_shuff3: db 8, 9, 9, 10, 10, 11, 11, 12, 9, 10, 10, 11, 11, 12, 12, 13, 10, 11, 11, 12, 12, 13, 13, 14, 11, 12, 12, 13, 13, 14, 14, 15
318
+intra_pred4_shuff4: db 9, 10, 10, 11, 11, 12, 12, 13, 10, 11, 11, 12, 12, 13, 13, 14, 10, 11, 11, 12, 12, 13, 13, 14, 11, 12, 12, 13, 13, 14, 14, 15
319
+intra_pred4_shuff5: db 9, 10, 10, 11, 11, 12, 12, 13, 10, 11, 11, 12, 12, 13, 13, 14, 10, 11, 11, 12, 12, 13, 13, 14, 11, 12, 12, 13, 13, 14, 14, 15
320
+intra_pred4_shuff6: db 9, 10, 10, 11, 11, 12, 12, 13, 9, 10, 10, 11, 11, 12, 12, 13, 10, 11, 11, 12, 12, 13, 13, 14, 10, 11, 11, 12, 12, 13, 13, 14
321
+intra_pred4_shuff7: db 9, 10, 10, 11, 11, 12, 12, 13, 9, 10, 10, 11, 11, 12, 12, 13, 9, 10, 10, 11, 11, 12, 12, 13, 10, 11, 11, 12, 12, 13, 13, 14
322
+intra_pred4_shuff9: db 9, 10, 10, 11, 11, 12, 12, 13, 9, 10, 10, 11, 11, 12, 12, 13, 9, 10, 10, 11, 11, 12, 12, 13, 9, 10, 10, 11, 11, 12, 12, 13
323
+intra_pred4_shuff12: db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12,0, 9, 9, 10, 10, 11, 11, 12
324
+intra_pred4_shuff13: db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 4, 0, 0, 9, 9, 10, 10, 11
325
+intra_pred4_shuff14: db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11
326
+intra_pred4_shuff15: db 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11, 4, 2, 2, 0, 0, 9, 9, 10
327
+intra_pred4_shuff16: db 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11, 3, 2, 2, 0, 0, 9, 9, 10
328
+intra_pred4_shuff17: db 0, 9, 9, 10, 10, 11, 11, 12, 1, 0, 0, 9, 9, 10, 10, 11, 2, 1, 1, 0, 0, 9, 9, 10, 4, 2, 2, 1, 1, 0, 0, 9
329
+intra_pred4_shuff19: db 0, 1, 1, 2, 2, 3, 3, 4, 9, 0, 0, 1, 1, 2, 2, 3, 10, 9, 9, 0, 0, 1, 1, 2, 12, 10, 10, 9, 9, 0, 0, 1
330
+intra_pred4_shuff20: db 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3, 11, 10, 10, 0, 0, 1, 1, 2
331
+intra_pred4_shuff21: db 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3, 12, 10, 10, 0, 0, 1, 1, 2
332
+intra_pred4_shuff22: db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3
333
+intra_pred4_shuff23: db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 12, 0, 0, 1, 1, 2, 2, 3
334
+
335
+c_ang4_mode_27: db 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8
336
+c_ang4_mode_28: db 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20
337
+c_ang4_mode_29: db 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 5, 27, 5, 27, 5, 27, 5, 27, 28, 4, 28, 4, 28, 4, 28, 4
338
+c_ang4_mode_30: db 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20
339
+c_ang4_mode_31: db 15, 17, 15, 17, 15, 17, 15, 17, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 28, 4, 28, 4, 28, 4, 28, 4
340
+c_ang4_mode_32: db 11, 21, 11, 21, 11, 21, 11, 21, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 12, 20, 12, 20, 12, 20, 12, 20
341
+c_ang4_mode_33: db 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 18, 14, 18, 14, 18, 14, 18, 14, 24, 8, 24, 8, 24, 8, 24, 8
342
+c_ang4_mode_5: db 15, 17, 15, 17, 15, 17, 15, 17, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 28, 4, 28, 4, 28, 4, 28, 4
343
+c_ang4_mode_6: db 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20
344
+c_ang4_mode_7: db 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 5, 27, 5, 27, 5, 27, 5, 27, 28, 4, 28, 4, 28, 4, 28, 4
345
+c_ang4_mode_8: db 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20
346
+c_ang4_mode_9: db 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8
347
+c_ang4_mode_11: db 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24
348
+c_ang4_mode_12: db 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12
349
+c_ang4_mode_13: db 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28
350
+c_ang4_mode_14: db 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12
351
+c_ang4_mode_15: db 17, 15, 17, 15, 17, 15, 17, 15, 2, 30, 2, 30, 2, 30, 2, 30, 19, 13, 19, 13, 19, 13, 19, 13, 4, 28, 4, 28, 4, 28, 4, 28, 4
352
+c_ang4_mode_16: db 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12
353
+c_ang4_mode_17: db 26, 6, 26, 6, 26, 6, 26, 6, 20, 12, 20, 12, 20, 12, 20, 12, 14, 18, 14, 18, 14, 18, 14, 18, 8, 24, 8, 24, 8, 24, 8, 24
354
+c_ang4_mode_19: db 26, 6, 26, 6, 26, 6, 26, 6, 20, 12, 20, 12, 20, 12, 20, 12, 14, 18, 14, 18, 14, 18, 14, 18, 8, 24, 8, 24, 8, 24, 8, 24
355
+c_ang4_mode_20: db 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12
356
+c_ang4_mode_21: db 17, 15, 17, 15, 17, 15, 17, 15, 2, 30, 2, 30, 2, 30, 2, 30, 19, 13, 19, 13, 19, 13, 19, 13, 4, 28, 4, 28, 4, 28, 4, 28
357
+c_ang4_mode_22: db 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12
358
+c_ang4_mode_23: db 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28
359
+c_ang4_mode_24: db 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12
360
+c_ang4_mode_25: db 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24
361
+
362
ALIGN 32
363
;; (blkSize - 1 - x)
364
pw_planar4_0: dw 3, 2, 1, 0, 3, 2, 1, 0
365
366
pw_planar32_L: dw 31, 30, 29, 28, 27, 26, 25, 24
367
pw_planar32_H: dw 23, 22, 21, 20, 19, 18, 17, 16
368
369
+ALIGN 32
370
+c_ang8_mode_13: db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
371
+ db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
372
+ db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
373
+ db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
374
+
375
+ALIGN 32
376
+c_ang8_mode_14: db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
377
+ db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
378
+ db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
379
+ db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
380
+
381
+ALIGN 32
382
+c_ang8_mode_15: db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
383
+ db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
384
+ db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
385
+ db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
386
+
387
+ALIGN 32
388
+c_ang8_mode_20: db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
389
+ db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
390
+ db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
391
+ db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
392
393
const ang_table
394
%assign x 0
395
396
cextern pw_4
397
cextern pw_8
398
cextern pw_16
399
+cextern pw_15
400
+cextern pw_31
401
cextern pw_32
402
cextern pw_257
403
+cextern pw_512
404
cextern pw_1024
405
cextern pw_4096
406
cextern pw_00ff
407
408
cextern multiH2
409
cextern multiH3
410
cextern multi_2Row
411
+cextern trans8_shuf
412
+cextern pw_planar16_mul
413
+cextern pw_planar32_mul
414
415
;---------------------------------------------------------------------------------------------
416
; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel *srcPix, int dirMode, int bFilter)
417
418
; void intraPredAng4(pixel* dst, intptr_t dstStride, pixel* src, int dirMode, int bFilter)
419
;-----------------------------------------------------------------------------------------
420
INIT_XMM sse2
421
-cglobal intra_pred_ang4_2, 3,5,3
422
+cglobal intra_pred_ang4_2, 3,5,1
423
lea r4, [r2 + 2]
424
add r2, 10
425
cmp r3m, byte 34
426
427
428
movh m0, [r2]
429
movd [r0], m0
430
- mova m1, m0
431
- psrldq m1, 1
432
- movd [r0 + r1], m1
433
- mova m2, m0
434
- psrldq m2, 2
435
- movd [r0 + r1 * 2], m2
436
+ psrldq m0, 1
437
+ movd [r0 + r1], m0
438
+ psrldq m0, 1
439
+ movd [r0 + r1 * 2], m0
440
lea r1, [r1 * 3]
441
- psrldq m0, 3
442
+ psrldq m0, 1
443
movd [r0 + r1], m0
444
RET
445
446
INIT_XMM sse2
447
cglobal intra_pred_ang4_3, 3,5,8
448
- mov r4, 1
449
+ mov r4d, 1
450
cmp r3m, byte 33
451
- mov r3, 9
452
- cmove r3, r4
453
+ mov r3d, 9
454
+ cmove r3d, r4d
455
456
movh m0, [r2 + r3] ; [8 7 6 5 4 3 2 1]
457
mova m1, m0
458
459
ALIGN 16
460
.do_filter4x4:
461
pxor m1, m1
462
- pxor m3, m3
463
punpckhbw m3, m0
464
psrlw m3, 8
465
pmaddwd m3, m5
466
467
packssdw m0, m3
468
paddw m0, [pw_16]
469
psraw m0, 5
470
- pxor m3, m3
471
punpckhbw m3, m2
472
psrlw m3, 8
473
pmaddwd m3, m7
474
475
.store:
476
packuswb m0, m2
477
movd [r0], m0
478
- pshufd m0, m0, 0x39
479
+ psrldq m0, 4
480
movd [r0 + r1], m0
481
- pshufd m0, m0, 0x39
482
+ psrldq m0, 4
483
movd [r0 + r1 * 2], m0
484
lea r1, [r1 * 3]
485
- pshufd m0, m0, 0x39
486
+ psrldq m0, 4
487
movd [r0 + r1], m0
488
RET
489
490
cglobal intra_pred_ang4_4, 3,5,8
491
- xor r4, r4
492
- inc r4
493
+ xor r4d, r4d
494
+ inc r4d
495
cmp r3m, byte 32
496
- mov r3, 9
497
- cmove r3, r4
498
+ mov r3d, 9
499
+ cmove r3d, r4d
500
501
movh m0, [r2 + r3] ; [8 7 6 5 4 3 2 1]
502
+ punpcklbw m0, m0
503
+ psrldq m0, 1
504
+ mova m2, m0
505
+ psrldq m2, 2 ; [x x x x x x x x 6 5 5 4 4 3 3 2]
506
mova m1, m0
507
- psrldq m1, 1 ; [x 8 7 6 5 4 3 2]
508
- punpcklbw m0, m1 ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1]
509
- mova m1, m0
510
- psrldq m1, 2 ; [x x x x x x x x 6 5 5 4 4 3 3 2]
511
- mova m3, m0
512
- psrldq m3, 4 ; [x x x x x x x x 7 6 6 5 5 4 4 3]
513
- punpcklqdq m0, m1
514
- punpcklqdq m2, m1, m3
515
+ psrldq m1, 4 ; [x x x x x x x x 7 6 6 5 5 4 4 3]
516
+ punpcklqdq m0, m2
517
+ punpcklqdq m2, m1
518
519
lea r3, [pw_ang_table + 18 * 16]
520
mova m4, [r3 + 3 * 16] ; [21]
521
522
jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
523
524
cglobal intra_pred_ang4_5, 3,5,8
525
- xor r4, r4
526
- inc r4
527
+ xor r4d, r4d
528
+ inc r4d
529
cmp r3m, byte 31
530
- mov r3, 9
531
- cmove r3, r4
532
+ mov r3d, 9
533
+ cmove r3d, r4d
534
535
movh m0, [r2 + r3] ; [8 7 6 5 4 3 2 1]
536
- mova m1, m0
537
- psrldq m1, 1 ; [x 8 7 6 5 4 3 2]
538
- punpcklbw m0, m1 ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1]
539
- mova m1, m0
540
- psrldq m1, 2 ; [x x x x x x x x 6 5 5 4 4 3 3 2]
541
+ punpcklbw m0, m0 ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1]
542
+ psrldq m0, 1
543
+ mova m2, m0
544
+ psrldq m2, 2 ; [x x x x x x x x 6 5 5 4 4 3 3 2]
545
mova m3, m0
546
psrldq m3, 4 ; [x x x x x x x x 7 6 6 5 5 4 4 3]
547
- punpcklqdq m0, m1
548
- punpcklqdq m2, m1, m3
549
+ punpcklqdq m0, m2
550
+ punpcklqdq m2, m3
551
552
lea r3, [pw_ang_table + 10 * 16]
553
mova m4, [r3 + 7 * 16] ; [17]
554
555
jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
556
557
cglobal intra_pred_ang4_6, 3,5,8
558
- xor r4, r4
559
- inc r4
560
+ xor r4d, r4d
561
+ inc r4d
562
cmp r3m, byte 30
563
- mov r3, 9
564
- cmove r3, r4
565
+ mov r3d, 9
566
+ cmove r3d, r4d
567
568
movh m0, [r2 + r3] ; [8 7 6 5 4 3 2 1]
569
- mova m1, m0
570
- psrldq m1, 1 ; [x 8 7 6 5 4 3 2]
571
- punpcklbw m0, m1 ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1]
572
+ punpcklbw m0, m0
573
+ psrldq m0, 1 ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1]
574
mova m2, m0
575
- psrldq m2, 2 ; [x x x x x x x x 6 5 5 4 4 3 3 2]
576
+ psrldq m2, 2 ; [x x x 8 8 7 7 6 6 5 5 4 4 3 3 2]
577
punpcklqdq m0, m0
578
punpcklqdq m2, m2
579
580
581
jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
582
583
cglobal intra_pred_ang4_7, 3,5,8
584
- xor r4, r4
585
- inc r4
586
+ xor r4d, r4d
587
+ inc r4d
588
cmp r3m, byte 29
589
- mov r3, 9
590
- cmove r3, r4
591
+ mov r3d, 9
592
+ cmove r3d, r4d
593
594
movh m0, [r2 + r3] ; [8 7 6 5 4 3 2 1]
595
- mova m1, m0
596
- psrldq m1, 1 ; [x 8 7 6 5 4 3 2]
597
- punpcklbw m0, m1 ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1]
598
- mova m3, m0
599
- psrldq m3, 2 ; [x x x x x x x x 6 5 5 4 4 3 3 2]
600
- punpcklqdq m2, m0, m3
601
+ punpcklbw m0, m0
602
+ psrldq m0, 1
603
+ mova m2, m0
604
+ psrldq m2, 2 ; [x x x x x x x x 6 5 5 4 4 3 3 2]
605
punpcklqdq m0, m0
606
+ punpcklqdq m2, m2
607
+ movhlps m2, m0
608
609
lea r3, [pw_ang_table + 20 * 16]
610
mova m4, [r3 - 11 * 16] ; [ 9]
611
612
jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
613
614
cglobal intra_pred_ang4_8, 3,5,8
615
- xor r4, r4
616
- inc r4
617
+ xor r4d, r4d
618
+ inc r4d
619
cmp r3m, byte 28
620
- mov r3, 9
621
- cmove r3, r4
622
+ mov r3d, 9
623
+ cmove r3d, r4d
624
625
movh m0, [r2 + r3] ; [8 7 6 5 4 3 2 1]
626
- mova m1, m0
627
- psrldq m1, 1 ; [x 8 7 6 5 4 3 2]
628
- punpcklbw m0, m1 ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1]
629
+ punpcklbw m0, m0
630
+ psrldq m0, 1
631
punpcklqdq m0, m0
632
mova m2, m0
633
634
635
jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
636
637
cglobal intra_pred_ang4_9, 3,5,8
638
- xor r4, r4
639
- inc r4
640
+ xor r4d, r4d
641
+ inc r4d
642
cmp r3m, byte 27
643
- mov r3, 9
644
- cmove r3, r4
645
+ mov r3d, 9
646
+ cmove r3d, r4d
647
648
movh m0, [r2 + r3] ; [8 7 6 5 4 3 2 1]
649
- mova m1, m0
650
- psrldq m1, 1 ; [x 8 7 6 5 4 3 2]
651
- punpcklbw m0, m1 ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1]
652
+ punpcklbw m0, m0
653
+ psrldq m0, 1 ; [x 8 7 6 5 4 3 2]
654
punpcklqdq m0, m0
655
mova m2, m0
656
657
658
mova m7, [r3 + 4 * 16] ; [ 8]
659
jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
660
661
+cglobal intra_pred_ang4_10, 3,5,4
662
+ movd m0, [r2 + 9] ; [8 7 6 5 4 3 2 1]
663
+ punpcklbw m0, m0
664
+ punpcklwd m0, m0
665
+ pshufd m1, m0, 1
666
+ movhlps m2, m0
667
+ pshufd m3, m0, 3
668
+ movd [r0 + r1], m1
669
+ movd [r0 + r1 * 2], m2
670
+ lea r1, [r1 * 3]
671
+ movd [r0 + r1], m3
672
+ cmp r4m, byte 0
673
+ jz .quit
674
+
675
+ ; filter
676
+ pxor m3, m3
677
+ punpcklbw m0, m3
678
+ movh m1, [r2] ; [4 3 2 1 0]
679
+ punpcklbw m1, m3
680
+ pshuflw m2, m1, 0x00
681
+ psrldq m1, 2
682
+ psubw m1, m2
683
+ psraw m1, 1
684
+ paddw m0, m1
685
+ packuswb m0, m0
686
+
687
+.quit:
688
+ movd [r0], m0
689
+ RET
690
+
691
+cglobal intra_pred_ang4_26, 3,4,4
692
+ movd m0, [r2 + 1] ; [8 7 6 5 4 3 2 1]
693
+
694
+ ; store
695
+ movd [r0], m0
696
+ movd [r0 + r1], m0
697
+ movd [r0 + r1 * 2], m0
698
+ lea r3, [r1 * 3]
699
+ movd [r0 + r3], m0
700
+
701
+ ; filter
702
+ cmp r4m, byte 0
703
+ jz .quit
704
+
705
+ pxor m3, m3
706
+ punpcklbw m0, m3
707
+ pshuflw m0, m0, 0x00
708
+ movd m2, [r2]
709
+ punpcklbw m2, m3
710
+ pshuflw m2, m2, 0x00
711
+ movd m1, [r2 + 9]
712
+ punpcklbw m1, m3
713
+ psubw m1, m2
714
+ psraw m1, 1
715
+ paddw m0, m1
716
+ packuswb m0, m0
717
+
718
+ movd r2, m0
719
+ mov [r0], r2b
720
+ shr r2, 8
721
+ mov [r0 + r1], r2b
722
+ shr r2, 8
723
+ mov [r0 + r1 * 2], r2b
724
+ shr r2, 8
725
+ mov [r0 + r3], r2b
726
+
727
+.quit:
728
+ RET
729
+
730
+cglobal intra_pred_ang4_11, 3,5,8
731
+ xor r4d, r4d
732
+ cmp r3m, byte 25
733
+ mov r3d, 8
734
+ cmove r3d, r4d
735
+
736
+ movd m1, [r2 + r3 + 1] ;[4 3 2 1]
737
+ movh m0, [r2 - 7] ;[A x x x x x x x]
738
+ punpcklbw m1, m1 ;[4 4 3 3 2 2 1 1]
739
+ punpcklqdq m0, m1 ;[4 4 3 3 2 2 1 1 A x x x x x x x]]
740
+ psrldq m0, 7 ;[x x x x x x x x 4 3 3 2 2 1 1 A]
741
+ punpcklqdq m0, m0
742
+ mova m2, m0
743
+
744
+ lea r3, [pw_ang_table + 24 * 16]
745
+
746
+ mova m4, [r3 + 6 * 16] ; [24]
747
+ mova m5, [r3 + 4 * 16] ; [26]
748
+ mova m6, [r3 + 2 * 16] ; [28]
749
+ mova m7, [r3 + 0 * 16] ; [30]
750
+ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
751
+
752
+cglobal intra_pred_ang4_12, 3,5,8
753
+ xor r4d, r4d
754
+ cmp r3m, byte 24
755
+ mov r3d, 8
756
+ cmove r3d, r4d
757
+
758
+ movd m1, [r2 + r3 + 1]
759
+ movh m0, [r2 - 7]
760
+ punpcklbw m1, m1
761
+ punpcklqdq m0, m1
762
+ psrldq m0, 7
763
+ punpcklqdq m0, m0
764
+ mova m2, m0
765
+
766
+ lea r3, [pw_ang_table + 20 * 16]
767
+ mova m4, [r3 + 7 * 16] ; [27]
768
+ mova m5, [r3 + 2 * 16] ; [22]
769
+ mova m6, [r3 - 3 * 16] ; [17]
770
+ mova m7, [r3 - 8 * 16] ; [12]
771
+ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
772
+
773
+cglobal intra_pred_ang4_13, 3,5,8
774
+ xor r4d, r4d
775
+ cmp r3m, byte 23
776
+ mov r3d, 8
777
+ jz .next
778
+ xchg r3d, r4d
779
+
780
+.next:
781
+ movd m1, [r2 - 1] ;[x x A x]
782
+ movd m2, [r2 + r4 + 1] ;[4 3 2 1]
783
+ movd m0, [r2 + r3 + 3] ;[x x B x]
784
+ punpcklbw m0, m1 ;[x x x x A B x x]
785
+ punpckldq m0, m2 ;[4 3 2 1 A B x x]
786
+ psrldq m0, 2 ;[x x 4 3 2 1 A B]
787
+ punpcklbw m0, m0 ;[x x x x 4 4 3 3 2 2 1 1 A A B B]
788
+ mova m1, m0
789
+ psrldq m0, 3 ;[x x x x x x x 4 4 3 3 2 2 1 1 A]
790
+ psrldq m1, 1 ;[x x x x x 4 4 3 3 2 2 1 1 A A B]
791
+ movh m2, m0
792
+ punpcklqdq m0, m0
793
+ punpcklqdq m2, m1
794
+
795
+ lea r3, [pw_ang_table + 21 * 16]
796
+ mova m4, [r3 + 2 * 16] ; [23]
797
+ mova m5, [r3 - 7 * 16] ; [14]
798
+ mova m6, [r3 - 16 * 16] ; [ 5]
799
+ mova m7, [r3 + 7 * 16] ; [28]
800
+ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
801
+
802
+cglobal intra_pred_ang4_14, 3,5,8
803
+ xor r4d, r4d
804
+ cmp r3m, byte 22
805
+ mov r3d, 8
806
+ jz .next
807
+ xchg r3d, r4d
808
+
809
+.next:
810
+ movd m1, [r2 - 1] ;[x x A x]
811
+ movd m0, [r2 + r3 + 1] ;[x x B x]
812
+ punpcklbw m0, m1 ;[A B x x]
813
+ movd m1, [r2 + r4 + 1] ;[4 3 2 1]
814
+ punpckldq m0, m1 ;[4 3 2 1 A B x x]
815
+ psrldq m0, 2 ;[x x 4 3 2 1 A B]
816
+ punpcklbw m0, m0 ;[x x x x 4 4 3 3 2 2 1 1 A A B B]
817
+ mova m2, m0
818
+ psrldq m0, 3 ;[x x x x x x x 4 4 3 3 2 2 1 1 A]
819
+ psrldq m2, 1 ;[x x x x x 4 4 3 3 2 2 1 1 A A B]
820
+ punpcklqdq m0, m0
821
+ punpcklqdq m2, m2
822
+
823
+ lea r3, [pw_ang_table + 19 * 16]
824
+ mova m4, [r3 + 0 * 16] ; [19]
825
+ mova m5, [r3 - 13 * 16] ; [ 6]
826
+ mova m6, [r3 + 6 * 16] ; [25]
827
+ mova m7, [r3 - 7 * 16] ; [12]
828
+ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
829
+
830
+cglobal intra_pred_ang4_15, 3,5,8
831
+ xor r4d, r4d
832
+ cmp r3m, byte 21
833
+ mov r3d, 8
834
+ jz .next
835
+ xchg r3d, r4d
836
+
837
+.next:
838
+ movd m0, [r2] ;[x x x A]
839
+ movd m1, [r2 + r3 + 2] ;[x x x B]
840
+ punpcklbw m1, m0 ;[x x A B]
841
+ movd m0, [r2 + r3 + 3] ;[x x C x]
842
+ punpcklwd m0, m1 ;[A B C x]
843
+ movd m1, [r2 + r4 + 1] ;[4 3 2 1]
844
+ punpckldq m0, m1 ;[4 3 2 1 A B C x]
845
+ psrldq m0, 1 ;[x 4 3 2 1 A B C]
846
+ punpcklbw m0, m0 ;[x x 4 4 3 3 2 2 1 1 A A B B C C]
847
+ psrldq m0, 1
848
+ movh m1, m0
849
+ psrldq m0, 2
850
+ movh m2, m0
851
+ psrldq m0, 2
852
+ punpcklqdq m0, m2
853
+ punpcklqdq m2, m1
854
+
855
+ lea r3, [pw_ang_table + 23 * 16]
856
+ mova m4, [r3 - 8 * 16] ; [15]
857
+ mova m5, [r3 + 7 * 16] ; [30]
858
+ mova m6, [r3 - 10 * 16] ; [13]
859
+ mova m7, [r3 + 5 * 16] ; [28]
860
+ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
861
+
862
+cglobal intra_pred_ang4_16, 3,5,8
863
+ xor r4d, r4d
864
+ cmp r3m, byte 20
865
+ mov r3d, 8
866
+ jz .next
867
+ xchg r3d, r4d
868
+
869
+.next:
870
+ movd m2, [r2] ;[x x x A]
871
+ movd m1, [r2 + r3 + 2] ;[x x x B]
872
+ punpcklbw m1, m2 ;[x x A B]
873
+ movh m0, [r2 + r3 + 2] ;[x x C x]
874
+ punpcklwd m0, m1 ;[A B C x]
875
+ movd m1, [r2 + r4 + 1] ;[4 3 2 1]
876
+ punpckldq m0, m1 ;[4 3 2 1 A B C x]
877
+ psrldq m0, 1 ;[x 4 3 2 1 A B C]
878
+ punpcklbw m0, m0 ;[x x 4 4 3 3 2 2 1 1 A A B B C C]
879
+ psrldq m0, 1
880
+ movh m1, m0
881
+ psrldq m0, 2
882
+ movh m2, m0
883
+ psrldq m0, 2
884
+ punpcklqdq m0, m2
885
+ punpcklqdq m2, m1
886
+
887
+ lea r3, [pw_ang_table + 19 * 16]
888
+ mova m4, [r3 - 8 * 16] ; [11]
889
+ mova m5, [r3 + 3 * 16] ; [22]
890
+ mova m6, [r3 - 18 * 16] ; [ 1]
891
+ mova m7, [r3 - 7 * 16] ; [12]
892
+ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
893
+
894
+cglobal intra_pred_ang4_17, 3,5,8
895
+ xor r4d, r4d
896
+ cmp r3m, byte 19
897
+ mov r3d, 8
898
+ jz .next
899
+ xchg r3d, r4d
900
+
901
+.next:
902
+ movd m2, [r2] ;[x x x A]
903
+ movd m3, [r2 + r3 + 1] ;[x x x B]
904
+ movd m4, [r2 + r3 + 2] ;[x x x C]
905
+ movd m0, [r2 + r3 + 4] ;[x x x D]
906
+ punpcklbw m3, m2 ;[x x A B]
907
+ punpcklbw m0, m4 ;[x x C D]
908
+ punpcklwd m0, m3 ;[A B C D]
909
+ movd m1, [r2 + r4 + 1] ;[4 3 2 1]
910
+ punpckldq m0, m1 ;[4 3 2 1 A B C D]
911
+ punpcklbw m0, m0 ;[4 4 3 3 2 2 1 1 A A B B C C D D]
912
+ psrldq m0, 1
913
+ movh m1, m0
914
+ psrldq m0, 2
915
+ movh m2, m0
916
+ punpcklqdq m2, m1
917
+ psrldq m0, 2
918
+ movh m1, m0
919
+ psrldq m0, 2
920
+ punpcklqdq m0, m1
921
+
922
+ lea r3, [pw_ang_table + 14 * 16]
923
+ mova m4, [r3 - 8 * 16] ; [ 6]
924
+ mova m5, [r3 - 2 * 16] ; [12]
925
+ mova m6, [r3 + 4 * 16] ; [18]
926
+ mova m7, [r3 + 10 * 16] ; [24]
927
+ jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
928
+
929
+cglobal intra_pred_ang4_18, 3,4,2
930
+ mov r3d, [r2 + 8]
931
+ mov r3b, byte [r2]
932
+ bswap r3d
933
+ movd m0, r3d
934
+
935
+ movd m1, [r2 + 1]
936
+ punpckldq m0, m1
937
+ lea r3, [r1 * 3]
938
+ movd [r0 + r3], m0
939
+ psrldq m0, 1
940
+ movd [r0 + r1 * 2], m0
941
+ psrldq m0, 1
942
+ movd [r0 + r1], m0
943
+ psrldq m0, 1
944
+ movd [r0], m0
945
+ RET
946
+
947
;---------------------------------------------------------------------------------------------
948
; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel *srcPix, int dirMode, int bFilter)
949
;---------------------------------------------------------------------------------------------
950
951
952
RET
953
954
+;---------------------------------------------------------------------------------------------
955
+; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel *srcPix, int dirMode, int bFilter)
956
+;---------------------------------------------------------------------------------------------
957
+%if ARCH_X86_64 == 1
958
+INIT_YMM avx2
959
+cglobal intra_pred_dc32, 3, 4, 3
960
+ lea r3, [r1 * 3]
961
+ pxor m0, m0
962
+ movu m1, [r2 + 1]
963
+ movu m2, [r2 + 65]
964
+ psadbw m1, m0
965
+ psadbw m2, m0
966
+ paddw m1, m2
967
+ vextracti128 xm2, m1, 1
968
+ paddw m1, m2
969
+ pshufd m2, m1, 2
970
+ paddw m1, m2
971
+
972
+ pmulhrsw m1, [pw_512] ; sum = (sum + 32) / 64
973
+ vpbroadcastb m1, xm1 ; m1 = byte [dc_val ...]
974
+
975
+ movu [r0 + r1 * 0], m1
976
+ movu [r0 + r1 * 1], m1
977
+ movu [r0 + r1 * 2], m1
978
+ movu [r0 + r3 * 1], m1
979
+ lea r0, [r0 + 4 * r1]
980
+ movu [r0 + r1 * 0], m1
981
+ movu [r0 + r1 * 1], m1
982
+ movu [r0 + r1 * 2], m1
983
+ movu [r0 + r3 * 1], m1
984
+ lea r0, [r0 + 4 * r1]
985
+ movu [r0 + r1 * 0], m1
986
+ movu [r0 + r1 * 1], m1
987
+ movu [r0 + r1 * 2], m1
988
+ movu [r0 + r3 * 1], m1
989
+ lea r0, [r0 + 4 * r1]
990
+ movu [r0 + r1 * 0], m1
991
+ movu [r0 + r1 * 1], m1
992
+ movu [r0 + r1 * 2], m1
993
+ movu [r0 + r3 * 1], m1
994
+ lea r0, [r0 + 4 * r1]
995
+ movu [r0 + r1 * 0], m1
996
+ movu [r0 + r1 * 1], m1
997
+ movu [r0 + r1 * 2], m1
998
+ movu [r0 + r3 * 1], m1
999
+ lea r0, [r0 + 4 * r1]
1000
+ movu [r0 + r1 * 0], m1
1001
+ movu [r0 + r1 * 1], m1
1002
+ movu [r0 + r1 * 2], m1
1003
+ movu [r0 + r3 * 1], m1
1004
+ lea r0, [r0 + 4 * r1]
1005
+ movu [r0 + r1 * 0], m1
1006
+ movu [r0 + r1 * 1], m1
1007
+ movu [r0 + r1 * 2], m1
1008
+ movu [r0 + r3 * 1], m1
1009
+ lea r0, [r0 + 4 * r1]
1010
+ movu [r0 + r1 * 0], m1
1011
+ movu [r0 + r1 * 1], m1
1012
+ movu [r0 + r1 * 2], m1
1013
+ movu [r0 + r3 * 1], m1
1014
+ RET
1015
+%endif ;; ARCH_X86_64 == 1
1016
+
1017
;---------------------------------------------------------------------------------------
1018
; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter)
1019
;---------------------------------------------------------------------------------------
1020
1021
;---------------------------------------------------------------------------------------
1022
; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter)
1023
;---------------------------------------------------------------------------------------
1024
+INIT_YMM avx2
1025
+cglobal intra_pred_planar16, 3,3,6
1026
+ vpbroadcastw m3, [r2 + 17]
1027
+ mova m5, [pw_00ff]
1028
+ vpbroadcastw m4, [r2 + 49]
1029
+ mova m0, [pw_planar16_mul]
1030
+ pmovzxbw m2, [r2 + 1]
1031
+ pand m3, m5 ; v_topRight
1032
+ pand m4, m5 ; v_bottomLeft
1033
+
1034
+ pmullw m3, [multiL] ; (x + 1) * topRight
1035
+ pmullw m1, m2, [pw_15] ; (blkSize - 1 - y) * above[x]
1036
+ paddw m3, [pw_16]
1037
+ paddw m3, m4
1038
+ paddw m3, m1
1039
+ psubw m4, m2
1040
+ add r2, 33
1041
+
1042
+%macro INTRA_PRED_PLANAR16_AVX2 1
1043
+ vpbroadcastw m1, [r2 + %1]
1044
+ vpsrlw m2, m1, 8
1045
+ pand m1, m5
1046
+
1047
+ pmullw m1, m0
1048
+ pmullw m2, m0
1049
+ paddw m1, m3
1050
+ paddw m3, m4
1051
+ psraw m1, 5
1052
+ paddw m2, m3
1053
+ psraw m2, 5
1054
+ paddw m3, m4
1055
+ packuswb m1, m2
1056
+ vpermq m1, m1, 11011000b
1057
+ movu [r0], xm1
1058
+ vextracti128 [r0 + r1], m1, 1
1059
+ lea r0, [r0 + r1 * 2]
1060
+%endmacro
1061
+ INTRA_PRED_PLANAR16_AVX2 0
1062
+ INTRA_PRED_PLANAR16_AVX2 2
1063
+ INTRA_PRED_PLANAR16_AVX2 4
1064
+ INTRA_PRED_PLANAR16_AVX2 6
1065
+ INTRA_PRED_PLANAR16_AVX2 8
1066
+ INTRA_PRED_PLANAR16_AVX2 10
1067
+ INTRA_PRED_PLANAR16_AVX2 12
1068
+ INTRA_PRED_PLANAR16_AVX2 14
1069
+%undef INTRA_PRED_PLANAR16_AVX2
1070
+ RET
1071
+
1072
+;---------------------------------------------------------------------------------------
1073
+; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter)
1074
+;---------------------------------------------------------------------------------------
1075
INIT_XMM sse4
1076
%if ARCH_X86_64 == 1
1077
cglobal intra_pred_planar32, 3,4,12
1078
1079
jnz .loop
1080
RET
1081
1082
+;---------------------------------------------------------------------------------------
1083
+; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter)
1084
+;---------------------------------------------------------------------------------------
1085
+%if ARCH_X86_64 == 1
1086
+INIT_YMM avx2
1087
+cglobal intra_pred_planar32, 3,4,11
1088
+ mova m6, [pw_00ff]
1089
+ vpbroadcastw m3, [r2 + 33] ; topRight = above[32]
1090
+ vpbroadcastw m2, [r2 + 97] ; bottomLeft = left[32]
1091
+ pand m3, m6
1092
+ pand m2, m6
1093
+
1094
+ pmullw m0, m3, [multiL] ; (x + 1) * topRight
1095
+ pmullw m3, [multiH2] ; (x + 1) * topRight
1096
+
1097
+ paddw m0, m2
1098
+ paddw m3, m2
1099
+ paddw m0, [pw_32]
1100
+ paddw m3, [pw_32]
1101
+
1102
+ pmovzxbw m4, [r2 + 1]
1103
+ pmovzxbw m1, [r2 + 17]
1104
+ pmullw m5, m4, [pw_31]
1105
+ paddw m0, m5
1106
+ psubw m5, m2, m4
1107
+ psubw m2, m1
1108
+ pmullw m1, [pw_31]
1109
+ paddw m3, m1
1110
+ mova m1, m5
1111
+
1112
+ add r2, 65 ; (2 * blkSize + 1)
1113
+ mova m9, [pw_planar32_mul]
1114
+ mova m10, [pw_planar16_mul]
1115
+
1116
+%macro INTRA_PRED_PLANAR32_AVX2 0
1117
+ vpbroadcastw m4, [r2]
1118
+ vpsrlw m7, m4, 8
1119
+ pand m4, m6
1120
+
1121
+ pmullw m5, m4, m9
1122
+ pmullw m4, m4, m10
1123
+ paddw m5, m0
1124
+ paddw m4, m3
1125
+ paddw m0, m1
1126
+ paddw m3, m2
1127
+ psraw m5, 6
1128
+ psraw m4, 6
1129
+ packuswb m5, m4
1130
+ pmullw m8, m7, m9
1131
+ pmullw m7, m7, m10
1132
+ vpermq m5, m5, 11011000b
1133
+ paddw m8, m0
1134
+ paddw m7, m3
1135
+ paddw m0, m1
1136
+ paddw m3, m2
1137
+ psraw m8, 6
1138
+ psraw m7, 6
1139
+ packuswb m8, m7
1140
+ add r2, 2
1141
+ vpermq m8, m8, 11011000b
1142
+
1143
+ movu [r0], m5
1144
+ movu [r0 + r1], m8
1145
+ lea r0, [r0 + r1 * 2]
1146
+%endmacro
1147
+ INTRA_PRED_PLANAR32_AVX2
1148
+ INTRA_PRED_PLANAR32_AVX2
1149
+ INTRA_PRED_PLANAR32_AVX2
1150
+ INTRA_PRED_PLANAR32_AVX2
1151
+ INTRA_PRED_PLANAR32_AVX2
1152
+ INTRA_PRED_PLANAR32_AVX2
1153
+ INTRA_PRED_PLANAR32_AVX2
1154
+ INTRA_PRED_PLANAR32_AVX2
1155
+ INTRA_PRED_PLANAR32_AVX2
1156
+ INTRA_PRED_PLANAR32_AVX2
1157
+ INTRA_PRED_PLANAR32_AVX2
1158
+ INTRA_PRED_PLANAR32_AVX2
1159
+ INTRA_PRED_PLANAR32_AVX2
1160
+ INTRA_PRED_PLANAR32_AVX2
1161
+ INTRA_PRED_PLANAR32_AVX2
1162
+ INTRA_PRED_PLANAR32_AVX2
1163
+%undef INTRA_PRED_PLANAR32_AVX2
1164
+ RET
1165
+%endif ;; ARCH_X86_64 == 1
1166
+
1167
;-----------------------------------------------------------------------------------------
1168
; void intraPredAng4(pixel* dst, intptr_t dstStride, pixel* src, int dirMode, int bFilter)
1169
;-----------------------------------------------------------------------------------------
1170
1171
1172
RET
1173
1174
+INIT_YMM avx2
1175
+cglobal intra_pred_ang32_18, 4, 4, 3
1176
+ movu m0, [r2]
1177
+ movu xm1, [r2 + 1 + 64]
1178
+ pshufb xm1, [intra_pred_shuff_15_0]
1179
+ mova xm2, xm0
1180
+ vinserti128 m1, m1, xm2, 1
1181
+
1182
+ lea r3, [r1 * 3]
1183
+
1184
+ movu [r0], m0
1185
+ palignr m2, m0, m1, 15
1186
+ movu [r0 + r1], m2
1187
+ palignr m2, m0, m1, 14
1188
+ movu [r0 + r1 * 2], m2
1189
+ palignr m2, m0, m1, 13
1190
+ movu [r0 + r3], m2
1191
+
1192
+ lea r0, [r0 + r1 * 4]
1193
+ palignr m2, m0, m1, 12
1194
+ movu [r0], m2
1195
+ palignr m2, m0, m1, 11
1196
+ movu [r0 + r1], m2
1197
+ palignr m2, m0, m1, 10
1198
+ movu [r0 + r1 * 2], m2
1199
+ palignr m2, m0, m1, 9
1200
+ movu [r0 + r3], m2
1201
+
1202
+ lea r0, [r0 + r1 * 4]
1203
+ palignr m2, m0, m1, 8
1204
+ movu [r0], m2
1205
+ palignr m2, m0, m1, 7
1206
+ movu [r0 + r1], m2
1207
+ palignr m2, m0, m1, 6
1208
+ movu [r0 + r1 * 2], m2
1209
+ palignr m2, m0, m1, 5
1210
+ movu [r0 + r3], m2
1211
+
1212
+ lea r0, [r0 + r1 * 4]
1213
+ palignr m2, m0, m1, 4
1214
+ movu [r0], m2
1215
+ palignr m2, m0, m1, 3
1216
+ movu [r0 + r1], m2
1217
+ palignr m2, m0, m1, 2
1218
+ movu [r0 + r1 * 2], m2
1219
+ palignr m2, m0, m1, 1
1220
+ movu [r0 + r3], m2
1221
+
1222
+ lea r0, [r0 + r1 * 4]
1223
+ movu [r0], m1
1224
+
1225
+ movu xm0, [r2 + 64 + 17]
1226
+ pshufb xm0, [intra_pred_shuff_15_0]
1227
+ vinserti128 m0, m0, xm1, 1
1228
+
1229
+ palignr m2, m1, m0, 15
1230
+ movu [r0 + r1], m2
1231
+ palignr m2, m1, m0, 14
1232
+ movu [r0 + r1 * 2], m2
1233
+ palignr m2, m1, m0, 13
1234
+ movu [r0 + r3], m2
1235
+
1236
+ lea r0, [r0 + r1 * 4]
1237
+ palignr m2, m1, m0, 12
1238
+ movu [r0], m2
1239
+ palignr m2, m1, m0, 11
1240
+ movu [r0 + r1], m2
1241
+ palignr m2, m1, m0, 10
1242
+ movu [r0 + r1 * 2], m2
1243
+ palignr m2, m1, m0, 9
1244
+ movu [r0 + r3], m2
1245
+
1246
+ lea r0, [r0 + r1 * 4]
1247
+ palignr m2, m1, m0, 8
1248
+ movu [r0], m2
1249
+ palignr m2, m1, m0, 7
1250
+ movu [r0 + r1], m2
1251
+ palignr m2, m1, m0,6
1252
+ movu [r0 + r1 * 2], m2
1253
+ palignr m2, m1, m0, 5
1254
+ movu [r0 + r3], m2
1255
+
1256
+ lea r0, [r0 + r1 * 4]
1257
+ palignr m2, m1, m0, 4
1258
+ movu [r0], m2
1259
+ palignr m2, m1, m0, 3
1260
+ movu [r0 + r1], m2
1261
+ palignr m2, m1, m0,2
1262
+ movu [r0 + r1 * 2], m2
1263
+ palignr m2, m1, m0, 1
1264
+ movu [r0 + r3], m2
1265
+ RET
1266
+
1267
INIT_XMM sse4
1268
cglobal intra_pred_ang32_18, 4,5,5
1269
movu m0, [r2] ; [15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0]
1270
1271
movhps [r0 + r3], xm2
1272
RET
1273
1274
+INIT_YMM avx2
1275
+cglobal intra_pred_ang8_15, 3, 6, 6
1276
+ mova m3, [pw_1024]
1277
+ movu xm5, [r2 + 16]
1278
+ pinsrb xm5, [r2], 0
1279
+ lea r5, [intra_pred_shuff_0_8]
1280
+ mova xm0, xm5
1281
+ pslldq xm5, 1
1282
+ pinsrb xm5, [r2 + 2], 0
1283
+ vinserti128 m0, m0, xm5, 1
1284
+ pshufb m0, [r5]
1285
+
1286
+ lea r4, [c_ang8_mode_15]
1287
+ pmaddubsw m1, m0, [r4]
1288
+ pmulhrsw m1, m3
1289
+ mova xm0, xm5
1290
+ pslldq xm5, 1
1291
+ pinsrb xm5, [r2 + 4], 0
1292
+ vinserti128 m0, m0, xm5, 1
1293
+ pshufb m0, [r5]
1294
+ pmaddubsw m2, m0, [r4 + mmsize]
1295
+ pmulhrsw m2, m3
1296
+ mova xm0, xm5
1297
+ pslldq xm5, 1
1298
+ pinsrb xm5, [r2 + 6], 0
1299
+ vinserti128 m0, m0, xm5, 1
1300
+ pshufb m0, [r5]
1301
+ pmaddubsw m4, m0, [r4 + 2 * mmsize]
1302
+ pmulhrsw m4, m3
1303
+ mova xm0, xm5
1304
+ pslldq xm5, 1
1305
+ pinsrb xm5, [r2 + 8], 0
1306
+ vinserti128 m0, m0, xm5, 1
1307
+ pshufb m0, [r5]
1308
+ pmaddubsw m0, [r4 + 3 * mmsize]
1309
+ pmulhrsw m0, m3
1310
+ packuswb m1, m2
1311
+ packuswb m4, m0
1312
+
1313
+ vperm2i128 m2, m1, m4, 00100000b
1314
+ vperm2i128 m1, m1, m4, 00110001b
1315
+ punpcklbw m4, m2, m1
1316
+ punpckhbw m2, m1
1317
+ punpcklwd m1, m4, m2
1318
+ punpckhwd m4, m2
1319
+ mova m0, [trans8_shuf]
1320
+ vpermd m1, m0, m1
1321
+ vpermd m4, m0, m4
1322
+
1323
+ lea r3, [3 * r1]
1324
+ movq [r0], xm1
1325
+ movhps [r0 + r1], xm1
1326
+ vextracti128 xm2, m1, 1
1327
+ movq [r0 + 2 * r1], xm2
1328
+ movhps [r0 + r3], xm2
1329
+ lea r0, [r0 + 4 * r1]
1330
+ movq [r0], xm4
1331
+ movhps [r0 + r1], xm4
1332
+ vextracti128 xm2, m4, 1
1333
+ movq [r0 + 2 * r1], xm2
1334
+ movhps [r0 + r3], xm2
1335
+ RET
1336
+
1337
+INIT_YMM avx2
1338
+cglobal intra_pred_ang8_16, 3, 6, 6
1339
+ mova m3, [pw_1024]
1340
+ movu xm5, [r2 + 16]
1341
+ pinsrb xm5, [r2], 0
1342
+ lea r5, [intra_pred_shuff_0_8]
1343
+ mova xm0, xm5
1344
+ pslldq xm5, 1
1345
+ pinsrb xm5, [r2 + 2], 0
1346
+ vinserti128 m0, m0, xm5, 1
1347
+ pshufb m0, [r5]
1348
+
1349
+ lea r4, [c_ang8_mode_20]
1350
+ pmaddubsw m1, m0, [r4]
1351
+ pmulhrsw m1, m3
1352
+ mova xm0, xm5
1353
+ pslldq xm5, 1
1354
+ pinsrb xm5, [r2 + 3], 0
1355
+ vinserti128 m0, m0, xm5, 1
1356
+ pshufb m0, [r5]
1357
+ pmaddubsw m2, m0, [r4 + mmsize]
1358
+ pmulhrsw m2, m3
1359
+ pslldq xm5, 1
1360
+ pinsrb xm5, [r2 + 5], 0
1361
+ vinserti128 m0, m5, xm5, 1
1362
+ pshufb m0, [r5]
1363
+ pmaddubsw m4, m0, [r4 + 2 * mmsize]
1364
+ pmulhrsw m4, m3
1365
+ pslldq xm5, 1
1366
+ pinsrb xm5, [r2 + 6], 0
1367
+ mova xm0, xm5
1368
+ pslldq xm5, 1
1369
+ pinsrb xm5, [r2 + 8], 0
1370
+ vinserti128 m0, m0, xm5, 1
1371
+ pshufb m0, [r5]
1372
+ pmaddubsw m0, [r4 + 3 * mmsize]
1373
+ pmulhrsw m0, m3
1374
+
1375
+ packuswb m1, m2
1376
+ packuswb m4, m0
1377
+
1378
+ vperm2i128 m2, m1, m4, 00100000b
1379
+ vperm2i128 m1, m1, m4, 00110001b
1380
+ punpcklbw m4, m2, m1
1381
+ punpckhbw m2, m1
1382
+ punpcklwd m1, m4, m2
1383
+ punpckhwd m4, m2
1384
+ mova m0, [trans8_shuf]
1385
+ vpermd m1, m0, m1
1386
+ vpermd m4, m0, m4
1387
+
1388
+ lea r3, [3 * r1]
1389
+ movq [r0], xm1
1390
+ movhps [r0 + r1], xm1
1391
+ vextracti128 xm2, m1, 1
1392
+ movq [r0 + 2 * r1], xm2
1393
+ movhps [r0 + r3], xm2
1394
+ lea r0, [r0 + 4 * r1]
1395
+ movq [r0], xm4
1396
+ movhps [r0 + r1], xm4
1397
+ vextracti128 xm2, m4, 1
1398
+ movq [r0 + 2 * r1], xm2
1399
+ movhps [r0 + r3], xm2
1400
+ RET
1401
+
1402
+INIT_YMM avx2
1403
+cglobal intra_pred_ang8_20, 3, 6, 6
1404
+ mova m3, [pw_1024]
1405
+ movu xm5, [r2]
1406
+ lea r5, [intra_pred_shuff_0_8]
1407
+ mova xm0, xm5
1408
+ pslldq xm5, 1
1409
+ pinsrb xm5, [r2 + 2 + 16], 0
1410
+ vinserti128 m0, m0, xm5, 1
1411
+ pshufb m0, [r5]
1412
+
1413
+ lea r4, [c_ang8_mode_20]
1414
+ pmaddubsw m1, m0, [r4]
1415
+ pmulhrsw m1, m3
1416
+ mova xm0, xm5
1417
+ pslldq xm5, 1
1418
+ pinsrb xm5, [r2 + 3 + 16], 0
1419
+ vinserti128 m0, m0, xm5, 1
1420
+ pshufb m0, [r5]
1421
+ pmaddubsw m2, m0, [r4 + mmsize]
1422
+ pmulhrsw m2, m3
1423
+ pslldq xm5, 1
1424
+ pinsrb xm5, [r2 + 5 + 16], 0
1425
+ vinserti128 m0, m5, xm5, 1
1426
+ pshufb m0, [r5]
1427
+ pmaddubsw m4, m0, [r4 + 2 * mmsize]
1428
+ pmulhrsw m4, m3
1429
+ pslldq xm5, 1
1430
+ pinsrb xm5, [r2 + 6 + 16], 0
1431
+ mova xm0, xm5
1432
+ pslldq xm5, 1
1433
+ pinsrb xm5, [r2 + 8 + 16], 0
1434
+ vinserti128 m0, m0, xm5, 1
1435
+ pshufb m0, [r5]
1436
+ pmaddubsw m0, [r4 + 3 * mmsize]
1437
+ pmulhrsw m0, m3
1438
+
1439
+ packuswb m1, m2
1440
+ packuswb m4, m0
1441
+
1442
+ lea r3, [3 * r1]
1443
+ movq [r0], xm1
1444
+ vextracti128 xm2, m1, 1
1445
+ movq [r0 + r1], xm2
1446
+ movhps [r0 + 2 * r1], xm1
1447
+ movhps [r0 + r3], xm2
1448
+ lea r0, [r0 + 4 * r1]
1449
+ movq [r0], xm4
1450
+ vextracti128 xm2, m4, 1
1451
+ movq [r0 + r1], xm2
1452
+ movhps [r0 + 2 * r1], xm4
1453
+ movhps [r0 + r3], xm2
1454
+ RET
1455
+
1456
+INIT_YMM avx2
1457
+cglobal intra_pred_ang8_21, 3, 6, 6
1458
+ mova m3, [pw_1024]
1459
+ movu xm5, [r2]
1460
+ lea r5, [intra_pred_shuff_0_8]
1461
+ mova xm0, xm5
1462
+ pslldq xm5, 1
1463
+ pinsrb xm5, [r2 + 2 + 16], 0
1464
+ vinserti128 m0, m0, xm5, 1
1465
+ pshufb m0, [r5]
1466
+
1467
+ lea r4, [c_ang8_mode_15]
1468
+ pmaddubsw m1, m0, [r4]
1469
+ pmulhrsw m1, m3
1470
+ mova xm0, xm5
1471
+ pslldq xm5, 1
1472
+ pinsrb xm5, [r2 + 4 + 16], 0
1473
+ vinserti128 m0, m0, xm5, 1
1474
+ pshufb m0, [r5]
1475
+ pmaddubsw m2, m0, [r4 + mmsize]
1476
+ pmulhrsw m2, m3
1477
+ mova xm0, xm5
1478
+ pslldq xm5, 1
1479
+ pinsrb xm5, [r2 + 6 + 16], 0
1480
+ vinserti128 m0, m0, xm5, 1
1481
+ pshufb m0, [r5]
1482
+ pmaddubsw m4, m0, [r4 + 2 * mmsize]
1483
+ pmulhrsw m4, m3
1484
+ mova xm0, xm5
1485
+ pslldq xm5, 1
1486
+ pinsrb xm5, [r2 + 8 + 16], 0
1487
+ vinserti128 m0, m0, xm5, 1
1488
+ pshufb m0, [r5]
1489
+ pmaddubsw m0, [r4 + 3 * mmsize]
1490
+ pmulhrsw m0, m3
1491
+ packuswb m1, m2
1492
+ packuswb m4, m0
1493
+
1494
+ lea r3, [3 * r1]
1495
+ movq [r0], xm1
1496
+ vextracti128 xm2, m1, 1
1497
+ movq [r0 + r1], xm2
1498
+ movhps [r0 + 2 * r1], xm1
1499
+ movhps [r0 + r3], xm2
1500
+ lea r0, [r0 + 4 * r1]
1501
+ movq [r0], xm4
1502
+ vextracti128 xm2, m4, 1
1503
+ movq [r0 + r1], xm2
1504
+ movhps [r0 + 2 * r1], xm4
1505
+ movhps [r0 + r3], xm2
1506
+ RET
1507
+
1508
+INIT_YMM avx2
1509
+cglobal intra_pred_ang8_22, 3, 6, 6
1510
+ mova m3, [pw_1024]
1511
+ movu xm5, [r2]
1512
+ lea r5, [intra_pred_shuff_0_8]
1513
+ vinserti128 m0, m5, xm5, 1
1514
+ pshufb m0, [r5]
1515
+
1516
+ lea r4, [c_ang8_mode_14]
1517
+ pmaddubsw m1, m0, [r4]
1518
+ pmulhrsw m1, m3
1519
+ pslldq xm5, 1
1520
+ pinsrb xm5, [r2 + 2 + 16], 0
1521
+ vinserti128 m0, m5, xm5, 1
1522
+ pshufb m0, [r5]
1523
+ pmaddubsw m2, m0, [r4 + mmsize]
1524
+ pmulhrsw m2, m3
1525
+ pslldq xm5, 1
1526
+ pinsrb xm5, [r2 + 5 + 16], 0
1527
+ vinserti128 m0, m5, xm5, 1
1528
+ pshufb m0, [r5]
1529
+ pmaddubsw m4, m0, [r4 + 2 * mmsize]
1530
+ pmulhrsw m4, m3
1531
+ pslldq xm5, 1
1532
+ pinsrb xm5, [r2 + 7 + 16], 0
1533
+ pshufb xm5, [r5]
1534
+ vinserti128 m0, m0, xm5, 1
1535
+ pmaddubsw m0, [r4 + 3 * mmsize]
1536
+ pmulhrsw m0, m3
1537
+ packuswb m1, m2
1538
+ packuswb m4, m0
1539
+
1540
+ lea r3, [3 * r1]
1541
+ movq [r0], xm1
1542
+ vextracti128 xm2, m1, 1
1543
+ movq [r0 + r1], xm2
1544
+ movhps [r0 + 2 * r1], xm1
1545
+ movhps [r0 + r3], xm2
1546
+ lea r0, [r0 + 4 * r1]
1547
+ movq [r0], xm4
1548
+ vextracti128 xm2, m4, 1
1549
+ movq [r0 + r1], xm2
1550
+ movhps [r0 + 2 * r1], xm4
1551
+ movhps [r0 + r3], xm2
1552
+ RET
1553
+
1554
+INIT_YMM avx2
1555
+cglobal intra_pred_ang8_14, 3, 6, 6
1556
+ mova m3, [pw_1024]
1557
+ movu xm5, [r2 + 16]
1558
+ pinsrb xm5, [r2], 0
1559
+ lea r5, [intra_pred_shuff_0_8]
1560
+ vinserti128 m0, m5, xm5, 1
1561
+ pshufb m0, [r5]
1562
+
1563
+ lea r4, [c_ang8_mode_14]
1564
+ pmaddubsw m1, m0, [r4]
1565
+ pmulhrsw m1, m3
1566
+ pslldq xm5, 1
1567
+ pinsrb xm5, [r2 + 2], 0
1568
+ vinserti128 m0, m5, xm5, 1
1569
+ pshufb m0, [r5]
1570
+ pmaddubsw m2, m0, [r4 + mmsize]
1571
+ pmulhrsw m2, m3
1572
+ pslldq xm5, 1
1573
+ pinsrb xm5, [r2 + 5], 0
1574
+ vinserti128 m0, m5, xm5, 1
1575
+ pshufb m0, [r5]
1576
+ pmaddubsw m4, m0, [r4 + 2 * mmsize]
1577
+ pmulhrsw m4, m3
1578
+ pslldq xm5, 1
1579
+ pinsrb xm5, [r2 + 7], 0
1580
+ pshufb xm5, [r5]
1581
+ vinserti128 m0, m0, xm5, 1
1582
+ pmaddubsw m0, [r4 + 3 * mmsize]
1583
+ pmulhrsw m0, m3
1584
+ packuswb m1, m2
1585
+ packuswb m4, m0
1586
+
1587
+ vperm2i128 m2, m1, m4, 00100000b
1588
+ vperm2i128 m1, m1, m4, 00110001b
1589
+ punpcklbw m4, m2, m1
1590
+ punpckhbw m2, m1
1591
+ punpcklwd m1, m4, m2
1592
+ punpckhwd m4, m2
1593
+ mova m0, [trans8_shuf]
1594
+ vpermd m1, m0, m1
1595
+ vpermd m4, m0, m4
1596
+
1597
+ lea r3, [3 * r1]
1598
+ movq [r0], xm1
1599
+ movhps [r0 + r1], xm1
1600
+ vextracti128 xm2, m1, 1
1601
+ movq [r0 + 2 * r1], xm2
1602
+ movhps [r0 + r3], xm2
1603
+ lea r0, [r0 + 4 * r1]
1604
+ movq [r0], xm4
1605
+ movhps [r0 + r1], xm4
1606
+ vextracti128 xm2, m4, 1
1607
+ movq [r0 + 2 * r1], xm2
1608
+ movhps [r0 + r3], xm2
1609
+ RET
1610
+
1611
+INIT_YMM avx2
1612
+cglobal intra_pred_ang8_13, 3, 6, 6
1613
+ mova m3, [pw_1024]
1614
+ movu xm5, [r2 + 16]
1615
+ pinsrb xm5, [r2], 0
1616
+ lea r5, [intra_pred_shuff_0_8]
1617
+ vinserti128 m0, m5, xm5, 1
1618
+ pshufb m0, [r5]
1619
+
1620
+ lea r4, [c_ang8_mode_13]
1621
+ pmaddubsw m1, m0, [r4]
1622
+ pmulhrsw m1, m3
1623
+ pslldq xm5, 1
1624
+ pinsrb xm5, [r2 + 4], 0
1625
+ pshufb xm4, xm5, [r5]
1626
+ vinserti128 m0, m0, xm4, 1
1627
+ pmaddubsw m2, m0, [r4 + mmsize]
1628
+ pmulhrsw m2, m3
1629
+ vinserti128 m0, m0, xm4, 0
1630
+ pmaddubsw m4, m0, [r4 + 2 * mmsize]
1631
+ pmulhrsw m4, m3
1632
+ pslldq xm5, 1
1633
+ pinsrb xm5, [r2 + 7], 0
1634
+ pshufb xm5, [r5]
1635
+ vinserti128 m0, m0, xm5, 1
1636
+ pmaddubsw m0, [r4 + 3 * mmsize]
1637
+ pmulhrsw m0, m3
1638
+ packuswb m1, m2
1639
+ packuswb m4, m0
1640
+
1641
+ vperm2i128 m2, m1, m4, 00100000b
1642
+ vperm2i128 m1, m1, m4, 00110001b
1643
+ punpcklbw m4, m2, m1
1644
+ punpckhbw m2, m1
1645
+ punpcklwd m1, m4, m2
1646
+ punpckhwd m4, m2
1647
+ mova m0, [trans8_shuf]
1648
+ vpermd m1, m0, m1
1649
+ vpermd m4, m0, m4
1650
+
1651
+ lea r3, [3 * r1]
1652
+ movq [r0], xm1
1653
+ movhps [r0 + r1], xm1
1654
+ vextracti128 xm2, m1, 1
1655
+ movq [r0 + 2 * r1], xm2
1656
+ movhps [r0 + r3], xm2
1657
+ lea r0, [r0 + 4 * r1]
1658
+ movq [r0], xm4
1659
+ movhps [r0 + r1], xm4
1660
+ vextracti128 xm2, m4, 1
1661
+ movq [r0 + 2 * r1], xm2
1662
+ movhps [r0 + r3], xm2
1663
+ RET
1664
+
1665
+
1666
+INIT_YMM avx2
1667
+cglobal intra_pred_ang8_23, 3, 6, 6
1668
+ mova m3, [pw_1024]
1669
+ movu xm5, [r2]
1670
+ lea r5, [intra_pred_shuff_0_8]
1671
+ vinserti128 m0, m5, xm5, 1
1672
+ pshufb m0, [r5]
1673
+
1674
+ lea r4, [c_ang8_mode_13]
1675
+ pmaddubsw m1, m0, [r4]
1676
+ pmulhrsw m1, m3
1677
+ pslldq xm5, 1
1678
+ pinsrb xm5, [r2 + 4 + 16], 0
1679
+ pshufb xm4, xm5, [r5]
1680
+ vinserti128 m0, m0, xm4, 1
1681
+ pmaddubsw m2, m0, [r4 + mmsize]
1682
+ pmulhrsw m2, m3
1683
+ vinserti128 m0, m0, xm4, 0
1684
+ pmaddubsw m4, m0, [r4 + 2 * mmsize]
1685
+ pmulhrsw m4, m3
1686
+ pslldq xm5, 1
1687
+ pinsrb xm5, [r2 + 7 + 16], 0
1688
+ pshufb xm5, [r5]
1689
+ vinserti128 m0, m0, xm5, 1
1690
+ pmaddubsw m0, [r4 + 3 * mmsize]
1691
+ pmulhrsw m0, m3
1692
+
1693
+ packuswb m1, m2
1694
+ packuswb m4, m0
1695
+
1696
+ lea r3, [3 * r1]
1697
+ movq [r0], xm1
1698
+ vextracti128 xm2, m1, 1
1699
+ movq [r0 + r1], xm2
1700
+ movhps [r0 + 2 * r1], xm1
1701
+ movhps [r0 + r3], xm2
1702
+ lea r0, [r0 + 4 * r1]
1703
+ movq [r0], xm4
1704
+ vextracti128 xm2, m4, 1
1705
+ movq [r0 + r1], xm2
1706
+ movhps [r0 + 2 * r1], xm4
1707
+ movhps [r0 + r3], xm2
1708
+ RET
1709
1710
INIT_YMM avx2
1711
cglobal intra_pred_ang8_12, 3, 5, 5
1712
1713
movu [%2], xm3
1714
%endmacro
1715
1716
+%if ARCH_X86_64 == 1
1717
+%macro INTRA_PRED_TRANS_STORE_16x16 0
1718
+ punpcklbw m8, m0, m1
1719
+ punpckhbw m0, m1
1720
+
1721
+ punpcklbw m1, m2, m3
1722
+ punpckhbw m2, m3
1723
+
1724
+ punpcklbw m3, m4, m5
1725
+ punpckhbw m4, m5
1726
+
1727
+ punpcklbw m5, m6, m7
1728
+ punpckhbw m6, m7
1729
+
1730
+ punpcklwd m7, m8, m1
1731
+ punpckhwd m8, m1
1732
+
1733
+ punpcklwd m1, m3, m5
1734
+ punpckhwd m3, m5
1735
+
1736
+ punpcklwd m5, m0, m2
1737
+ punpckhwd m0, m2
1738
+
1739
+ punpcklwd m2, m4, m6
1740
+ punpckhwd m4, m6
1741
+
1742
+ punpckldq m6, m7, m1
1743
+ punpckhdq m7, m1
1744
+
1745
+ punpckldq m1, m8, m3
1746
+ punpckhdq m8, m3
1747
+
1748
+ punpckldq m3, m5, m2
1749
+ punpckhdq m5, m2
1750
+
1751
+ punpckldq m2, m0, m4
1752
+ punpckhdq m0, m4
1753
+
1754
+ vpermq m6, m6, 0xD8
1755
+ vpermq m7, m7, 0xD8
1756
+ vpermq m1, m1, 0xD8
1757
+ vpermq m8, m8, 0xD8
1758
+ vpermq m3, m3, 0xD8
1759
+ vpermq m5, m5, 0xD8
1760
+ vpermq m2, m2, 0xD8
1761
+ vpermq m0, m0, 0xD8
1762
+
1763
+ movu [r0], xm6
1764
+ vextracti128 xm4, m6, 1
1765
+ movu [r0 + r1], xm4
1766
+
1767
+ movu [r0 + 2 * r1], xm7
1768
+ vextracti128 xm4, m7, 1
1769
+ movu [r0 + r3], xm4
1770
+
1771
+ lea r0, [r0 + 4 * r1]
1772
+
1773
+ movu [r0], xm1
1774
+ vextracti128 xm4, m1, 1
1775
+ movu [r0 + r1], xm4
1776
+
1777
+ movu [r0 + 2 * r1], xm8
1778
+ vextracti128 xm4, m8, 1
1779
+ movu [r0 + r3], xm4
1780
+
1781
+ lea r0, [r0 + 4 * r1]
1782
+
1783
+ movu [r0], xm3
1784
+ vextracti128 xm4, m3, 1
1785
+ movu [r0 + r1], xm4
1786
+
1787
+ movu [r0 + 2 * r1], xm5
1788
+ vextracti128 xm4, m5, 1
1789
+ movu [r0 + r3], xm4
1790
+
1791
+ lea r0, [r0 + 4 * r1]
1792
+
1793
+ movu [r0], xm2
1794
+ vextracti128 xm4, m2, 1
1795
+ movu [r0 + r1], xm4
1796
+
1797
+ movu [r0 + 2 * r1], xm0
1798
+ vextracti128 xm4, m0, 1
1799
+ movu [r0 + r3], xm4
1800
+%endmacro
1801
+
1802
+%macro INTRA_PRED_ANG16_CAL_ROW 3
1803
+ pmaddubsw %1, m9, [r4 + (%3 * mmsize)]
1804
+ pmulhrsw %1, m11
1805
+ pmaddubsw %2, m10, [r4 + (%3 * mmsize)]
1806
+ pmulhrsw %2, m11
1807
+ packuswb %1, %2
1808
+%endmacro
1809
+
1810
+
1811
+INIT_YMM avx2
1812
+cglobal intra_pred_ang16_12, 3, 6, 13
1813
+ mova m11, [pw_1024]
1814
+ lea r5, [intra_pred_shuff_0_8]
1815
+
1816
+ movu xm9, [r2 + 32]
1817
+ pinsrb xm9, [r2], 0
1818
+ pslldq xm7, xm9, 1
1819
+ pinsrb xm7, [r2 + 6], 0
1820
+ vinserti128 m9, m9, xm7, 1
1821
+ pshufb m9, [r5]
1822
+
1823
+ movu xm12, [r2 + 6 + 32]
1824
+
1825
+ psrldq xm10, xm12, 2
1826
+ psrldq xm8, xm12, 1
1827
+ vinserti128 m10, m10, xm8, 1
1828
+ pshufb m10, [r5]
1829
+
1830
+ lea r3, [3 * r1]
1831
+ lea r4, [c_ang16_mode_12]
1832
+
1833
+ INTRA_PRED_ANG16_CAL_ROW m0, m1, 0
1834
+ INTRA_PRED_ANG16_CAL_ROW m1, m2, 1
1835
+ INTRA_PRED_ANG16_CAL_ROW m2, m3, 2
1836
+ INTRA_PRED_ANG16_CAL_ROW m3, m4, 3
1837
+
1838
+ add r4, 4 * mmsize
1839
+
1840
+ pslldq xm7, 1
1841
+ pinsrb xm7, [r2 + 13], 0
1842
+ pshufb xm7, [r5]
1843
+ vinserti128 m9, m9, xm7, 1
1844
+
1845
+ mova xm8, xm12
1846
+ pshufb xm8, [r5]
1847
+ vinserti128 m10, m10, xm8, 1
1848
+
1849
+ INTRA_PRED_ANG16_CAL_ROW m4, m5, 0
1850
+ INTRA_PRED_ANG16_CAL_ROW m5, m6, 1
1851
+
1852
+ movu xm9, [r2 + 31]
1853
+ pinsrb xm9, [r2 + 6], 0
1854
+ pinsrb xm9, [r2 + 0], 1
1855
+ pshufb xm9, [r5]
1856
+ vinserti128 m9, m9, xm7, 1
1857
+
1858
+ psrldq xm10, xm12, 1
1859
+ vinserti128 m10, m10, xm12, 1
1860
+ pshufb m10, [r5]
1861
+
1862
+ INTRA_PRED_ANG16_CAL_ROW m6, m7, 2
1863
+ INTRA_PRED_ANG16_CAL_ROW m7, m8, 3
1864
+
1865
+ ; transpose and store
1866
+ INTRA_PRED_TRANS_STORE_16x16
1867
+ RET
1868
+
1869
+INIT_YMM avx2
1870
+cglobal intra_pred_ang16_13, 3, 6, 14
1871
+ mova m11, [pw_1024]
1872
+ lea r5, [intra_pred_shuff_0_8]
1873
+
1874
+ movu xm13, [r2 + 32]
1875
+ pinsrb xm13, [r2], 0
1876
+ pslldq xm7, xm13, 2
1877
+ pinsrb xm7, [r2 + 7], 0
1878
+ pinsrb xm7, [r2 + 4], 1
1879
+ vinserti128 m9, m13, xm7, 1
1880
+ pshufb m9, [r5]
1881
+
1882
+ movu xm12, [r2 + 4 + 32]
1883
+
1884
+ psrldq xm10, xm12, 4
1885
+ psrldq xm8, xm12, 2
1886
+ vinserti128 m10, m10, xm8, 1
1887
+ pshufb m10, [r5]
1888
+
1889
+ lea r3, [3 * r1]
1890
+ lea r4, [c_ang16_mode_13]
1891
+
1892
+ INTRA_PRED_ANG16_CAL_ROW m0, m1, 0
1893
+ INTRA_PRED_ANG16_CAL_ROW m1, m2, 1
1894
+
1895
+ pslldq xm7, 1
1896
+ pinsrb xm7, [r2 + 11], 0
1897
+ pshufb xm2, xm7, [r5]
1898
+ vinserti128 m9, m9, xm2, 1
1899
+
1900
+ psrldq xm8, xm12, 1
1901
+ pshufb xm8, [r5]
1902
+ vinserti128 m10, m10, xm8, 1
1903
+
1904
+ INTRA_PRED_ANG16_CAL_ROW m2, m3, 2
1905
+
1906
+ pslldq xm13, 1
1907
+ pinsrb xm13, [r2 + 4], 0
1908
+ pshufb xm3, xm13, [r5]
1909
+ vinserti128 m9, m9, xm3, 0
1910
+
1911
+ psrldq xm8, xm12, 3
1912
+ pshufb xm8, [r5]
1913
+ vinserti128 m10, m10, xm8, 0
1914
+
1915
+ INTRA_PRED_ANG16_CAL_ROW m3, m4, 3
1916
+
1917
+ add r4, 4 * mmsize
1918
+
1919
+ INTRA_PRED_ANG16_CAL_ROW m4, m5, 0
1920
+ INTRA_PRED_ANG16_CAL_ROW m5, m6, 1
1921
+
1922
+ pslldq xm7, 1
1923
+ pinsrb xm7, [r2 + 14], 0
1924
+ pshufb xm7, [r5]
1925
+ vinserti128 m9, m9, xm7, 1
1926
+
1927
+ mova xm8, xm12
1928
+ pshufb xm8, [r5]
1929
+ vinserti128 m10, m10, xm8, 1
1930
+
1931
+ INTRA_PRED_ANG16_CAL_ROW m6, m7, 2
1932
+
1933
+ pslldq xm13, 1
1934
+ pinsrb xm13, [r2 + 7], 0
1935
+ pshufb xm13, [r5]
1936
+ vinserti128 m9, m9, xm13, 0
1937
+
1938
+ psrldq xm12, 2
1939
+ pshufb xm12, [r5]
1940
+ vinserti128 m10, m10, xm12, 0
1941
+
1942
+ INTRA_PRED_ANG16_CAL_ROW m7, m8, 3
1943
+
1944
+ ; transpose and store
1945
+ INTRA_PRED_TRANS_STORE_16x16
1946
+ RET
1947
+
1948
+INIT_YMM avx2
1949
+cglobal intra_pred_ang16_11, 3, 5, 12
1950
+ mova m11, [pw_1024]
1951
+
1952
+ movu xm9, [r2 + 32]
1953
+ pinsrb xm9, [r2], 0
1954
+ pshufb xm9, [intra_pred_shuff_0_8]
1955
+ vinserti128 m9, m9, xm9, 1
1956
+
1957
+ vbroadcasti128 m10, [r2 + 8 + 32]
1958
+ pshufb m10, [intra_pred_shuff_0_8]
1959
+
1960
+ lea r3, [3 * r1]
1961
+ lea r4, [c_ang16_mode_11]
1962
+
1963
+ INTRA_PRED_ANG16_CAL_ROW m0, m1, 0
1964
+ INTRA_PRED_ANG16_CAL_ROW m1, m2, 1
1965
+ INTRA_PRED_ANG16_CAL_ROW m2, m3, 2
1966
+ INTRA_PRED_ANG16_CAL_ROW m3, m4, 3
1967
+
1968
+ add r4, 4 * mmsize
1969
+
1970
+ INTRA_PRED_ANG16_CAL_ROW m4, m5, 0
1971
+ INTRA_PRED_ANG16_CAL_ROW m5, m6, 1
1972
+ INTRA_PRED_ANG16_CAL_ROW m6, m7, 2
1973
+ INTRA_PRED_ANG16_CAL_ROW m7, m8, 3
1974
+
1975
+ ; transpose and store
1976
+ INTRA_PRED_TRANS_STORE_16x16
1977
+ RET
1978
+
1979
+
1980
+INIT_YMM avx2
1981
+cglobal intra_pred_ang16_3, 3, 6, 12
1982
+ mova m11, [pw_1024]
1983
+ lea r5, [intra_pred_shuff_0_8]
1984
+
1985
+ movu xm9, [r2 + 1 + 32]
1986
+ pshufb xm9, [r5]
1987
+ movu xm10, [r2 + 9 + 32]
1988
+ pshufb xm10, [r5]
1989
+
1990
+ movu xm7, [r2 + 8 + 32]
1991
+ pshufb xm7, [r5]
1992
+ vinserti128 m9, m9, xm7, 1
1993
+
1994
+ movu xm8, [r2 + 16 + 32]
1995
+ pshufb xm8, [r5]
1996
+ vinserti128 m10, m10, xm8, 1
1997
+
1998
+ lea r3, [3 * r1]
1999
+ lea r4, [c_ang16_mode_3]
2000
+
2001
+ INTRA_PRED_ANG16_CAL_ROW m0, m1, 0
2002
+
2003
+ movu xm9, [r2 + 2 + 32]
2004
+ pshufb xm9, [r5]
2005
+ movu xm10, [r2 + 10 + 32]
2006
+ pshufb xm10, [r5]
2007
+
2008
+ movu xm7, [r2 + 9 + 32]
2009
+ pshufb xm7, [r5]
2010
+ vinserti128 m9, m9, xm7, 1
2011
+
2012
+ movu xm8, [r2 + 17 + 32]
2013
+ pshufb xm8, [r5]
2014
+ vinserti128 m10, m10, xm8, 1
2015
+
2016
+ INTRA_PRED_ANG16_CAL_ROW m1, m2, 1
2017
+
2018
+ movu xm7, [r2 + 3 + 32]
2019
+ pshufb xm7, [r5]
2020
+ vinserti128 m9, m9, xm7, 0
2021
+
2022
+ movu xm8, [r2 + 11 + 32]
2023
+ pshufb xm8, [r5]
2024
+ vinserti128 m10, m10, xm8, 0
2025
+
2026
+ INTRA_PRED_ANG16_CAL_ROW m2, m3, 2
2027
+
2028
+ movu xm9, [r2 + 4 + 32]
2029
+ pshufb xm9, [r5]
2030
+ movu xm10, [r2 + 12 + 32]
2031
+ pshufb xm10, [r5]
2032
+
2033
+ movu xm7, [r2 + 10 + 32]
2034
+ pshufb xm7, [r5]
2035
+ vinserti128 m9, m9, xm7, 1
2036
+
2037
+ movu xm8, [r2 + 18 + 32]
2038
+ pshufb xm8, [r5]
2039
+ vinserti128 m10, m10, xm8, 1
2040
+
2041
+ INTRA_PRED_ANG16_CAL_ROW m3, m4, 3
2042
+
2043
+ movu xm9, [r2 + 5 + 32]
2044
+ pshufb xm9, [r5]
2045
+ movu xm10, [r2 + 13 + 32]
2046
+ pshufb xm10, [r5]
2047
+
2048
+ movu xm7, [r2 + 11 + 32]
2049
+ pshufb xm7, [r5]
2050
+ vinserti128 m9, m9, xm7, 1
2051
+
2052
+ movu xm8, [r2 + 19 + 32]
2053
+ pshufb xm8, [r5]
2054
+ vinserti128 m10, m10, xm8, 1
2055
+
2056
+ add r4, 4 * mmsize
2057
+
2058
+ INTRA_PRED_ANG16_CAL_ROW m4, m5, 0
2059
+
2060
+ movu xm7, [r2 + 12 + 32]
2061
+ pshufb xm7, [r5]
2062
+ vinserti128 m9, m9, xm7, 1
2063
+
2064
+ movu xm8, [r2 + 20 + 32]
2065
+ pshufb xm8, [r5]
2066
+ vinserti128 m10, m10, xm8, 1
2067
+
2068
+ INTRA_PRED_ANG16_CAL_ROW m5, m6, 1
2069
+
2070
+ movu xm9, [r2 + 6 + 32]
2071
+ pshufb xm9, [r5]
2072
+ movu xm10, [r2 + 14 + 32]
2073
+ pshufb xm10, [r5]
2074
+
2075
+ movu xm7, [r2 + 13 + 32]
2076
+ pshufb xm7, [r5]
2077
+ vinserti128 m9, m9, xm7, 1
2078
+
2079
+ movu xm8, [r2 + 21 + 32]
2080
+ pshufb xm8, [r5]
2081
+ vinserti128 m10, m10, xm8, 1
2082
+
2083
+ INTRA_PRED_ANG16_CAL_ROW m6, m7, 2
2084
+
2085
+ movu xm9, [r2 + 7 + 32]
2086
+ pshufb xm9, [r5]
2087
+ movu xm10, [r2 + 15 + 32]
2088
+ pshufb xm10, [r5]
2089
+
2090
+ movu xm7, [r2 + 14 + 32]
2091
+ pshufb xm7, [r5]
2092
+ vinserti128 m9, m9, xm7, 1
2093
+
2094
+ movu xm8, [r2 + 22 + 32]
2095
+ pshufb xm8, [r5]
2096
+ vinserti128 m10, m10, xm8, 1
2097
+
2098
+ INTRA_PRED_ANG16_CAL_ROW m7, m8, 3
2099
+
2100
+ ; transpose and store
2101
+ INTRA_PRED_TRANS_STORE_16x16
2102
+ RET
2103
+
2104
+
2105
+INIT_YMM avx2
2106
+cglobal intra_pred_ang16_4, 3, 6, 12
2107
+ mova m11, [pw_1024]
2108
+ lea r5, [intra_pred_shuff_0_8]
2109
+
2110
+ movu xm9, [r2 + 1 + 32]
2111
+ pshufb xm9, [r5]
2112
+ movu xm10, [r2 + 9 + 32]
2113
+ pshufb xm10, [r5]
2114
+
2115
+ movu xm7, [r2 + 6 + 32]
2116
+ pshufb xm7, [r5]
2117
+ vinserti128 m9, m9, xm7, 1
2118
+
2119
+ movu xm8, [r2 + 14 + 32]
2120
+ pshufb xm8, [r5]
2121
+ vinserti128 m10, m10, xm8, 1
2122
+
2123
+ lea r3, [3 * r1]
2124
+ lea r4, [c_ang16_mode_4]
2125
+
2126
+ INTRA_PRED_ANG16_CAL_ROW m0, m1, 0
2127
+
2128
+ movu xm9, [r2 + 2 + 32]
2129
+ pshufb xm9, [r5]
2130
+ movu xm10, [r2 + 10 + 32]
2131
+ pshufb xm10, [r5]
2132
+
2133
+ movu xm7, [r2 + 7 + 32]
2134
+ pshufb xm7, [r5]
2135
+ vinserti128 m9, m9, xm7, 1
2136
+
2137
+ movu xm8, [r2 + 15 + 32]
2138
+ pshufb xm8, [r5]
2139
+ vinserti128 m10, m10, xm8, 1
2140
+
2141
+ INTRA_PRED_ANG16_CAL_ROW m1, m2, 1
2142
+
2143
+ movu xm7, [r2 + 8 + 32]
2144
+ pshufb xm7, [r5]
2145
+ vinserti128 m9, m9, xm7, 1
2146
+
2147
+ movu xm8, [r2 + 16 + 32]
2148
+ pshufb xm8, [r5]
2149
+ vinserti128 m10, m10, xm8, 1
2150
+
2151
+ INTRA_PRED_ANG16_CAL_ROW m2, m3, 2
2152
+
2153
+ movu xm7, [r2 + 3 + 32]
2154
+ pshufb xm7, [r5]
2155
+ vinserti128 m9, m9, xm7, 0
2156
+
2157
+ movu xm8, [r2 + 11 + 32]
2158
+ pshufb xm8, [r5]
2159
+ vinserti128 m10, m10, xm8, 0
2160
+
2161
+ INTRA_PRED_ANG16_CAL_ROW m3, m4, 3
2162
+
2163
+ add r4, 4 * mmsize
2164
+
2165
+ movu xm9, [r2 + 4 + 32]
2166
+ pshufb xm9, [r5]
2167
+ movu xm10, [r2 + 12 + 32]
2168
+ pshufb xm10, [r5]
2169
+
2170
+ movu xm7, [r2 + 9 + 32]
2171
+ pshufb xm7, [r5]
2172
+ vinserti128 m9, m9, xm7, 1
2173
+
2174
+ movu xm8, [r2 + 17 + 32]
2175
+ pshufb xm8, [r5]
2176
+ vinserti128 m10, m10, xm8, 1
2177
+
2178
+ INTRA_PRED_ANG16_CAL_ROW m4, m5, 0
2179
+
2180
+ movu xm7, [r2 + 10 + 32]
2181
+ pshufb xm7, [r5]
2182
+ vinserti128 m9, m9, xm7, 1
2183
+
2184
+ movu xm8, [r2 + 18 + 32]
2185
+ pshufb xm8, [r5]
2186
+ vinserti128 m10, m10, xm8, 1
2187
+
2188
+ INTRA_PRED_ANG16_CAL_ROW m5, m6, 1
2189
+
2190
+ movu xm7, [r2 + 5 + 32]
2191
+ pshufb xm7, [r5]
2192
+ vinserti128 m9, m9, xm7, 0
2193
+
2194
+ movu xm8, [r2 + 13 + 32]
2195
+ pshufb xm8, [r5]
2196
+ vinserti128 m10, m10, xm8, 0
2197
+
2198
+ INTRA_PRED_ANG16_CAL_ROW m6, m7, 2
2199
+
2200
+ movu xm9, [r2 + 6 + 32]
2201
+ pshufb xm9, [r5]
2202
+ movu xm10, [r2 + 14 + 32]
2203
+ pshufb xm10, [r5]
2204
+
2205
+ movu xm7, [r2 + 11 + 32]
2206
+ pshufb xm7, [r5]
2207
+ vinserti128 m9, m9, xm7, 1
2208
+
2209
+ movu xm8, [r2 + 19 + 32]
2210
+ pshufb xm8, [r5]
2211
+ vinserti128 m10, m10, xm8, 1
2212
+
2213
+ INTRA_PRED_ANG16_CAL_ROW m7, m8, 3
2214
+
2215
+ ; transpose and store
2216
+ INTRA_PRED_TRANS_STORE_16x16
2217
+ RET
2218
+
2219
+INIT_YMM avx2
2220
+cglobal intra_pred_ang16_5, 3, 6, 12
2221
+ mova m11, [pw_1024]
2222
+ lea r5, [intra_pred_shuff_0_8]
2223
+
2224
+ movu xm9, [r2 + 1 + 32]
2225
+ pshufb xm9, [r5]
2226
+ movu xm10, [r2 + 9 + 32]
2227
+ pshufb xm10, [r5]
2228
+
2229
+ movu xm7, [r2 + 5 + 32]
2230
+ pshufb xm7, [r5]
2231
+ vinserti128 m9, m9, xm7, 1
2232
+
2233
+ movu xm8, [r2 + 13 + 32]
2234
+ pshufb xm8, [r5]
2235
+ vinserti128 m10, m10, xm8, 1
2236
+
2237
+ lea r3, [3 * r1]
2238
+ lea r4, [c_ang16_mode_5]
2239
+
2240
+ INTRA_PRED_ANG16_CAL_ROW m0, m1, 0
2241
+
2242
+ movu xm9, [r2 + 2 + 32]
2243
+ pshufb xm9, [r5]
2244
+ movu xm10, [r2 + 10 + 32]
2245
+ pshufb xm10, [r5]
2246
+
2247
+ movu xm7, [r2 + 6 + 32]
2248
+ pshufb xm7, [r5]
2249
+ vinserti128 m9, m9, xm7, 1
2250
+
2251
+ movu xm8, [r2 + 14 + 32]
2252
+ pshufb xm8, [r5]
2253
+ vinserti128 m10, m10, xm8, 1
2254
+
2255
+ INTRA_PRED_ANG16_CAL_ROW m1, m2, 1
2256
+ INTRA_PRED_ANG16_CAL_ROW m2, m3, 2
2257
+
2258
+ movu xm9, [r2 + 3 + 32]
2259
+ pshufb xm9, [r5]
2260
+ movu xm10, [r2 + 11 + 32]
2261
+ pshufb xm10, [r5]
2262
+
2263
+ movu xm7, [r2 + 7 + 32]
2264
+ pshufb xm7, [r5]
2265
+ vinserti128 m9, m9, xm7, 1
2266
+
2267
+ movu xm8, [r2 + 15 + 32]
2268
+ pshufb xm8, [r5]
2269
+ vinserti128 m10, m10, xm8, 1
2270
+
2271
+ INTRA_PRED_ANG16_CAL_ROW m3, m4, 3
2272
+
2273
+ add r4, 4 * mmsize
2274
+
2275
+ INTRA_PRED_ANG16_CAL_ROW m4, m5, 0
2276
+
2277
+ movu xm9, [r2 + 4 + 32]
2278
+ pshufb xm9, [r5]
2279
+ movu xm10, [r2 + 12 + 32]
2280
+ pshufb xm10, [r5]
2281
+
2282
+ movu xm7, [r2 + 8 + 32]
2283
+ pshufb xm7, [r5]
2284
+ vinserti128 m9, m9, xm7, 1
2285
+
2286
+ movu xm8, [r2 + 16 + 32]
2287
+ pshufb xm8, [r5]
2288
+ vinserti128 m10, m10, xm8, 1
2289
+
2290
+ INTRA_PRED_ANG16_CAL_ROW m5, m6, 1
2291
+ INTRA_PRED_ANG16_CAL_ROW m6, m7, 2
2292
+
2293
+ movu xm9, [r2 + 5 + 32]
2294
+ pshufb xm9, [r5]
2295
+ movu xm10, [r2 + 13 + 32]
2296
+ pshufb xm10, [r5]
2297
+
2298
+ movu xm7, [r2 + 9 + 32]
2299
+ pshufb xm7, [r5]
2300
+ vinserti128 m9, m9, xm7, 1
2301
+
2302
+ movu xm8, [r2 + 17 + 32]
2303
+ pshufb xm8, [r5]
2304
+ vinserti128 m10, m10, xm8, 1
2305
+
2306
+ INTRA_PRED_ANG16_CAL_ROW m7, m8, 3
2307
+
2308
+ ; transpose and store
2309
+ INTRA_PRED_TRANS_STORE_16x16
2310
+ RET
2311
+
2312
+INIT_YMM avx2
2313
+cglobal intra_pred_ang16_6, 3, 6, 12
2314
+ mova m11, [pw_1024]
2315
+ lea r5, [intra_pred_shuff_0_8]
2316
+
2317
+ movu xm9, [r2 + 1 + 32]
2318
+ pshufb xm9, [r5]
2319
+ movu xm10, [r2 + 9 + 32]
2320
+ pshufb xm10, [r5]
2321
+
2322
+ movu xm7, [r2 + 4 + 32]
2323
+ pshufb xm7, [r5]
2324
+ vinserti128 m9, m9, xm7, 1
2325
+
2326
+ movu xm8, [r2 + 12 + 32]
2327
+ pshufb xm8, [r5]
2328
+ vinserti128 m10, m10, xm8, 1
2329
+
2330
+ lea r3, [3 * r1]
2331
+ lea r4, [c_ang16_mode_6]
2332
+
2333
+ INTRA_PRED_ANG16_CAL_ROW m0, m1, 0
2334
+
2335
+ movu xm7, [r2 + 5 + 32]
2336
+ pshufb xm7, [r5]
2337
+ vinserti128 m9, m9, xm7, 1
2338
+
2339
+ movu xm8, [r2 + 13 + 32]
2340
+ pshufb xm8, [r5]
2341
+ vinserti128 m10, m10, xm8, 1
2342
+
2343
+ INTRA_PRED_ANG16_CAL_ROW m1, m2, 1
2344
+
2345
+ movu xm7, [r2 + 2 + 32]
2346
+ pshufb xm7, [r5]
2347
+ vinserti128 m9, m9, xm7, 0
2348
+
2349
+ movu xm8, [r2 + 10 + 32]
2350
+ pshufb xm8, [r5]
2351
+ vinserti128 m10, m10, xm8, 0
2352
+
2353
+ INTRA_PRED_ANG16_CAL_ROW m2, m3, 2
2354
+ INTRA_PRED_ANG16_CAL_ROW m3, m4, 3
2355
+
2356
+ add r4, 4 * mmsize
2357
+
2358
+ movu xm9, [r2 + 3 + 32]
2359
+ pshufb xm9, [r5]
2360
+ movu xm10, [r2 + 11 + 32]
2361
+ pshufb xm10, [r5]
2362
+
2363
+ movu xm7, [r2 + 6 + 32]
2364
+ pshufb xm7, [r5]
2365
+ vinserti128 m9, m9, xm7, 1
2366
+
2367
+ movu xm8, [r2 + 14 + 32]
2368
+ pshufb xm8, [r5]
2369
+ vinserti128 m10, m10, xm8, 1
2370
+
2371
+ INTRA_PRED_ANG16_CAL_ROW m4, m5, 0
2372
+ INTRA_PRED_ANG16_CAL_ROW m5, m6, 1
2373
+
2374
+ movu xm7, [r2 + 7 + 32]
2375
+ pshufb xm7, [r5]
2376
+ vinserti128 m9, m9, xm7, 1
2377
+
2378
+ movu xm8, [r2 + 15 + 32]
2379
+ pshufb xm8, [r5]
2380
+ vinserti128 m10, m10, xm8, 1
2381
+
2382
+ INTRA_PRED_ANG16_CAL_ROW m6, m7, 2
2383
+
2384
+ movu xm7, [r2 + 4 + 32]
2385
+ pshufb xm7, [r5]
2386
+ vinserti128 m9, m9, xm7, 0
2387
+
2388
+ movu xm8, [r2 + 12 + 32]
2389
+ pshufb xm8, [r5]
2390
+ vinserti128 m10, m10, xm8, 0
2391
+
2392
+ INTRA_PRED_ANG16_CAL_ROW m7, m8, 3
2393
+
2394
+ ; transpose and store
2395
+ INTRA_PRED_TRANS_STORE_16x16
2396
+ RET
2397
+
2398
+INIT_YMM avx2
2399
+cglobal intra_pred_ang16_7, 3, 6, 12
2400
+ mova m11, [pw_1024]
2401
+ lea r5, [intra_pred_shuff_0_8]
2402
+
2403
+ movu xm9, [r2 + 1 + 32]
2404
+ pshufb xm9, [r5]
2405
+ movu xm10, [r2 + 9 + 32]
2406
+ pshufb xm10, [r5]
2407
+
2408
+ movu xm7, [r2 + 3 + 32]
2409
+ pshufb xm7, [r5]
2410
+ vinserti128 m9, m9, xm7, 1
2411
+
2412
+ movu xm8, [r2 + 11 + 32]
2413
+ pshufb xm8, [r5]
2414
+ vinserti128 m10, m10, xm8, 1
2415
+
2416
+ lea r3, [3 * r1]
2417
+ lea r4, [c_ang16_mode_7]
2418
+
2419
+ INTRA_PRED_ANG16_CAL_ROW m0, m1, 0
2420
+ INTRA_PRED_ANG16_CAL_ROW m1, m2, 1
2421
+
2422
+ movu xm7, [r2 + 4 + 32]
2423
+ pshufb xm7, [r5]
2424
+ vinserti128 m9, m9, xm7, 1
2425
+
2426
+ movu xm8, [r2 + 12 + 32]
2427
+ pshufb xm8, [r5]
2428
+ vinserti128 m10, m10, xm8, 1
2429
+
2430
+ INTRA_PRED_ANG16_CAL_ROW m2, m3, 2
2431
+
2432
+ movu xm7, [r2 + 2 + 32]
2433
+ pshufb xm7, [r5]
2434
+ vinserti128 m9, m9, xm7, 0
2435
+
2436
+ movu xm8, [r2 + 10 + 32]
2437
+ pshufb xm8, [r5]
2438
+ vinserti128 m10, m10, xm8, 0
2439
+
2440
+ INTRA_PRED_ANG16_CAL_ROW m3, m4, 3
2441
+
2442
+ add r4, 4 * mmsize
2443
+
2444
+ INTRA_PRED_ANG16_CAL_ROW m4, m5, 0
2445
+ INTRA_PRED_ANG16_CAL_ROW m5, m6, 1
2446
+
2447
+ movu xm7, [r2 + 5 + 32]
2448
+ pshufb xm7, [r5]
2449
+ vinserti128 m9, m9, xm7, 1
2450
+
2451
+ movu xm8, [r2 + 13 + 32]
2452
+ pshufb xm8, [r5]
2453
+ vinserti128 m10, m10, xm8, 1
2454
+
2455
+ INTRA_PRED_ANG16_CAL_ROW m6, m7, 2
2456
+
2457
+ movu xm7, [r2 + 3 + 32]
2458
+ pshufb xm7, [r5]
2459
+ vinserti128 m9, m9, xm7, 0
2460
+
2461
+ movu xm8, [r2 + 11 + 32]
2462
+ pshufb xm8, [r5]
2463
+ vinserti128 m10, m10, xm8, 0
2464
+
2465
+ INTRA_PRED_ANG16_CAL_ROW m7, m8, 3
2466
+
2467
+ ; transpose and store
2468
+ INTRA_PRED_TRANS_STORE_16x16
2469
+ RET
2470
+
2471
+INIT_YMM avx2
2472
+cglobal intra_pred_ang16_8, 3, 6, 12
2473
+ mova m11, [pw_1024]
2474
+ lea r5, [intra_pred_shuff_0_8]
2475
+
2476
+ movu xm9, [r2 + 1 + 32]
2477
+ pshufb xm9, [r5]
2478
+ movu xm10, [r2 + 9 + 32]
2479
+ pshufb xm10, [r5]
2480
+
2481
+ movu xm7, [r2 + 2 + 32]
2482
+ pshufb xm7, [r5]
2483
+ vinserti128 m9, m9, xm7, 1
2484
+
2485
+ movu xm8, [r2 + 10 + 32]
2486
+ pshufb xm8, [r5]
2487
+ vinserti128 m10, m10, xm8, 1
2488
+
2489
+ lea r3, [3 * r1]
2490
+ lea r4, [c_ang16_mode_8]
2491
+
2492
+ INTRA_PRED_ANG16_CAL_ROW m0, m1, 0
2493
+ INTRA_PRED_ANG16_CAL_ROW m1, m2, 1
2494
+ INTRA_PRED_ANG16_CAL_ROW m2, m3, 2
2495
+ INTRA_PRED_ANG16_CAL_ROW m3, m4, 3
2496
+
2497
+ add r4, 4 * mmsize
2498
+
2499
+ movu xm4, [r2 + 3 + 32]
2500
+ pshufb xm4, [r5]
2501
+ vinserti128 m9, m9, xm4, 1
2502
+
2503
+ movu xm5, [r2 + 11 + 32]
2504
+ pshufb xm5, [r5]
2505
+ vinserti128 m10, m10, xm5, 1
2506
+
2507
+ INTRA_PRED_ANG16_CAL_ROW m4, m5, 0
2508
+ INTRA_PRED_ANG16_CAL_ROW m5, m6, 1
2509
+
2510
+ vinserti128 m9, m9, xm7, 0
2511
+ vinserti128 m10, m10, xm8, 0
2512
+
2513
+ INTRA_PRED_ANG16_CAL_ROW m6, m7, 2
2514
+ INTRA_PRED_ANG16_CAL_ROW m7, m8, 3
2515
+
2516
+ ; transpose and store
2517
+ INTRA_PRED_TRANS_STORE_16x16
2518
+ RET
2519
+
2520
+INIT_YMM avx2
2521
+cglobal intra_pred_ang16_9, 3, 6, 12
2522
+ mova m11, [pw_1024]
2523
+ lea r5, [intra_pred_shuff_0_8]
2524
+
2525
+ vbroadcasti128 m9, [r2 + 1 + 32]
2526
+ pshufb m9, [r5]
2527
+ vbroadcasti128 m10, [r2 + 9 + 32]
2528
+ pshufb m10, [r5]
2529
+
2530
+ lea r3, [3 * r1]
2531
+ lea r4, [c_ang16_mode_9]
2532
+
2533
+ INTRA_PRED_ANG16_CAL_ROW m0, m1, 0
2534
+ INTRA_PRED_ANG16_CAL_ROW m1, m2, 1
2535
+ INTRA_PRED_ANG16_CAL_ROW m2, m3, 2
2536
+ INTRA_PRED_ANG16_CAL_ROW m3, m4, 3
2537
+
2538
+ add r4, 4 * mmsize
2539
+
2540
+ INTRA_PRED_ANG16_CAL_ROW m4, m5, 0
2541
+ INTRA_PRED_ANG16_CAL_ROW m5, m6, 1
2542
+ INTRA_PRED_ANG16_CAL_ROW m6, m7, 2
2543
+
2544
+ movu xm7, [r2 + 2 + 32]
2545
+ pshufb xm7, [r5]
2546
+ vinserti128 m9, m9, xm7, 1
2547
+
2548
+ movu xm7, [r2 + 10 + 32]
2549
+ pshufb xm7, [r5]
2550
+ vinserti128 m10, m10, xm7, 1
2551
+
2552
+ INTRA_PRED_ANG16_CAL_ROW m7, m8, 3
2553
+
2554
+ ; transpose and store
2555
+ INTRA_PRED_TRANS_STORE_16x16
2556
+ RET
2557
+%endif
2558
+
2559
INIT_YMM avx2
2560
cglobal intra_pred_ang16_25, 3, 5, 5
2561
mova m0, [pw_1024]
2562
2563
vpermq m6, m6, 11011000b
2564
movu [r0 + r3], m6
2565
RET
2566
+
2567
+INIT_YMM avx2
2568
+cglobal intra_pred_ang32_33, 3, 5, 11
2569
+ mova m0, [pw_1024]
2570
+ mova m1, [intra_pred_shuff_0_8]
2571
+ lea r3, [3 * r1]
2572
+ lea r4, [c_ang32_mode_33]
2573
+
2574
+ ;row [0]
2575
+ vbroadcasti128 m2, [r2 + 1]
2576
+ pshufb m2, m1
2577
+ vbroadcasti128 m3, [r2 + 9]
2578
+ pshufb m3, m1
2579
+ vbroadcasti128 m4, [r2 + 17]
2580
+ pshufb m4, m1
2581
+ vbroadcasti128 m5, [r2 + 25]
2582
+ pshufb m5, m1
2583
+
2584
+ vperm2i128 m6, m2, m3, 00100000b
2585
+ pmaddubsw m6, [r4 + 0 * mmsize]
2586
+ pmulhrsw m6, m0
2587
+ vperm2i128 m7, m4, m5, 00100000b
2588
+ pmaddubsw m7, [r4 + 0 * mmsize]
2589
+ pmulhrsw m7, m0
2590
+ packuswb m6, m7
2591
+ vpermq m6, m6, 11011000b
2592
+ movu [r0], m6
2593
+
2594
+ ;row [1]
2595
+ vbroadcasti128 m2, [r2 + 2]
2596
+ pshufb m2, m1
2597
+ vbroadcasti128 m3, [r2 + 10]
2598
+ pshufb m3, m1
2599
+ vbroadcasti128 m4, [r2 + 18]
2600
+ pshufb m4, m1
2601
+ vbroadcasti128 m5, [r2 + 26]
2602
+ pshufb m5, m1
2603
+
2604
+ vperm2i128 m6, m2, m3, 00100000b
2605
+ pmaddubsw m6, [r4 + 1 * mmsize]
2606
+ pmulhrsw m6, m0
2607
+ vperm2i128 m7, m4, m5, 00100000b
2608
+ pmaddubsw m7, [r4 + 1 * mmsize]
2609
+ pmulhrsw m7, m0
2610
+ packuswb m6, m7
2611
+ vpermq m6, m6, 11011000b
2612
+ movu [r0 + r1], m6
2613
+
2614
+ ;row [2]
2615
+ vbroadcasti128 m2, [r2 + 3]
2616
+ pshufb m2, m1
2617
+ vbroadcasti128 m3, [r2 + 11]
2618
+ pshufb m3, m1
2619
+ vbroadcasti128 m4, [r2 + 19]
2620
+ pshufb m4, m1
2621
+ vbroadcasti128 m5, [r2 + 27]
2622
+ pshufb m5, m1
2623
+
2624
+ vperm2i128 m6, m2, m3, 00100000b
2625
+ pmaddubsw m6, [r4 + 2 * mmsize]
2626
+ pmulhrsw m6, m0
2627
+ vperm2i128 m7, m4, m5, 00100000b
2628
+ pmaddubsw m7, [r4 + 2 * mmsize]
2629
+ pmulhrsw m7, m0
2630
+ packuswb m6, m7
2631
+ vpermq m6, m6, 11011000b
2632
+ movu [r0 + 2 * r1], m6
2633
+
2634
+ ;row [3]
2635
+ vbroadcasti128 m2, [r2 + 4]
2636
+ pshufb m2, m1
2637
+ vbroadcasti128 m3, [r2 + 12]
2638
+ pshufb m3, m1
2639
+ vbroadcasti128 m4, [r2 + 20]
2640
+ pshufb m4, m1
2641
+ vbroadcasti128 m5, [r2 + 28]
2642
+ pshufb m5, m1
2643
+
2644
+ vperm2i128 m6, m2, m3, 00100000b
2645
+ pmaddubsw m6, [r4 + 3 * mmsize]
2646
+ pmulhrsw m6, m0
2647
+ vperm2i128 m7, m4, m5, 00100000b
2648
+ pmaddubsw m7, [r4 + 3 * mmsize]
2649
+ pmulhrsw m7, m0
2650
+ packuswb m6, m7
2651
+ vpermq m6, m6, 11011000b
2652
+ movu [r0 + r3], m6
2653
+
2654
+ ;row [4, 5]
2655
+ vbroadcasti128 m2, [r2 + 5]
2656
+ pshufb m2, m1
2657
+ vbroadcasti128 m3, [r2 + 13]
2658
+ pshufb m3, m1
2659
+ vbroadcasti128 m4, [r2 + 21]
2660
+ pshufb m4, m1
2661
+ vbroadcasti128 m5, [r2 + 29]
2662
+ pshufb m5, m1
2663
+
2664
+ add r4, 4 * mmsize
2665
+ lea r0, [r0 + 4 * r1]
2666
+ mova m10, [r4 + 0 * mmsize]
2667
+
2668
+ INTRA_PRED_ANG32_CAL_ROW
2669
+ movu [r0], m7
2670
+ movu [r0 + r1], m6
2671
+
2672
+ ;row [6]
2673
+ vbroadcasti128 m2, [r2 + 6]
2674
+ pshufb m2, m1
2675
+ vbroadcasti128 m3, [r2 + 14]
2676
+ pshufb m3, m1
2677
+ vbroadcasti128 m4, [r2 + 22]
2678
+ pshufb m4, m1
2679
+ vbroadcasti128 m5, [r2 + 30]
2680
+ pshufb m5, m1
2681
+
2682
+ vperm2i128 m6, m2, m3, 00100000b
2683
+ pmaddubsw m6, [r4 + 1 * mmsize]
2684
+ pmulhrsw m6, m0
2685
+ vperm2i128 m7, m4, m5, 00100000b
2686
+ pmaddubsw m7, [r4 + 1 * mmsize]
2687
+ pmulhrsw m7, m0
2688
+ packuswb m6, m7
2689
+ vpermq m6, m6, 11011000b
2690
+ movu [r0 + 2 * r1], m6
2691
+
2692
+ ;row [7]
2693
+ vbroadcasti128 m2, [r2 + 7]
2694
+ pshufb m2, m1
2695
+ vbroadcasti128 m3, [r2 + 15]
2696
+ pshufb m3, m1
2697
+ vbroadcasti128 m4, [r2 + 23]
2698
+ pshufb m4, m1
2699
+ vbroadcasti128 m5, [r2 + 31]
2700
+ pshufb m5, m1
2701
+
2702
+ vperm2i128 m6, m2, m3, 00100000b
2703
+ pmaddubsw m6, [r4 + 2 * mmsize]
2704
+ pmulhrsw m6, m0
2705
+ vperm2i128 m7, m4, m5, 00100000b
2706
+ pmaddubsw m7, [r4 + 2 * mmsize]
2707
+ pmulhrsw m7, m0
2708
+ packuswb m6, m7
2709
+ vpermq m6, m6, 11011000b
2710
+ movu [r0 + r3], m6
2711
+
2712
+ ;row [8]
2713
+ vbroadcasti128 m2, [r2 + 8]
2714
+ pshufb m2, m1
2715
+ vbroadcasti128 m3, [r2 + 16]
2716
+ pshufb m3, m1
2717
+ vbroadcasti128 m4, [r2 + 24]
2718
+ pshufb m4, m1
2719
+ vbroadcasti128 m5, [r2 + 32]
2720
+ pshufb m5, m1
2721
+
2722
+ lea r0, [r0 + 4 * r1]
2723
+ vperm2i128 m6, m2, m3, 00100000b
2724
+ pmaddubsw m6, [r4 + 3 * mmsize]
2725
+ pmulhrsw m6, m0
2726
+ vperm2i128 m7, m4, m5, 00100000b
2727
+ pmaddubsw m7, [r4 + 3 * mmsize]
2728
+ pmulhrsw m7, m0
2729
+ packuswb m6, m7
2730
+ vpermq m6, m6, 11011000b
2731
+ movu [r0], m6
2732
+
2733
+ ;row [9, 10]
2734
+ vbroadcasti128 m2, [r2 + 9]
2735
+ pshufb m2, m1
2736
+ vbroadcasti128 m3, [r2 + 17]
2737
+ pshufb m3, m1
2738
+ vbroadcasti128 m4, [r2 + 25]
2739
+ pshufb m4, m1
2740
+ vbroadcasti128 m5, [r2 + 33]
2741
+ pshufb m5, m1
2742
+
2743
+ add r4, 4 * mmsize
2744
+ mova m10, [r4 + 0 * mmsize]
2745
+
2746
+ INTRA_PRED_ANG32_CAL_ROW
2747
+ movu [r0 + r1], m7
2748
+ movu [r0 + 2 * r1], m6
2749
+
2750
+ ;row [11]
2751
+ vbroadcasti128 m2, [r2 + 10]
2752
+ pshufb m2, m1
2753
+ vbroadcasti128 m3, [r2 + 18]
2754
+ pshufb m3, m1
2755
+ vbroadcasti128 m4, [r2 + 26]
2756
+ pshufb m4, m1
2757
+ vbroadcasti128 m5, [r2 + 34]
2758
+ pshufb m5, m1
2759
+
2760
+ vperm2i128 m6, m2, m3, 00100000b
2761
+ pmaddubsw m6, [r4 + 1 * mmsize]
2762
+ pmulhrsw m6, m0
2763
+ vperm2i128 m7, m4, m5, 00100000b
2764
+ pmaddubsw m7, [r4 + 1 * mmsize]
2765
+ pmulhrsw m7, m0
2766
+ packuswb m6, m7
2767
+ vpermq m6, m6, 11011000b
2768
+ movu [r0 + r3], m6
2769
+
2770
+ ;row [12]
2771
+ vbroadcasti128 m2, [r2 + 11]
2772
+ pshufb m2, m1
2773
+ vbroadcasti128 m3, [r2 + 19]
2774
+ pshufb m3, m1
2775
+ vbroadcasti128 m4, [r2 + 27]
2776
+ pshufb m4, m1
2777
+ vbroadcasti128 m5, [r2 + 35]
2778
+ pshufb m5, m1
2779
+
2780
+ lea r0, [r0 + 4 * r1]
2781
+ vperm2i128 m6, m2, m3, 00100000b
2782
+ pmaddubsw m6, [r4 + 2 * mmsize]
2783
+ pmulhrsw m6, m0
2784
+ vperm2i128 m7, m4, m5, 00100000b
2785
+ pmaddubsw m7, [r4 + 2 * mmsize]
2786
+ pmulhrsw m7, m0
2787
+ packuswb m6, m7
2788
+ vpermq m6, m6, 11011000b
2789
+ movu [r0], m6
2790
+
2791
+ ;row [13]
2792
+ vbroadcasti128 m2, [r2 + 12]
2793
+ pshufb m2, m1
2794
+ vbroadcasti128 m3, [r2 + 20]
2795
+ pshufb m3, m1
2796
+ vbroadcasti128 m4, [r2 + 28]
2797
+ pshufb m4, m1
2798
+ vbroadcasti128 m5, [r2 + 36]
2799
+ pshufb m5, m1
2800
+
2801
+ vperm2i128 m6, m2, m3, 00100000b
2802
+ pmaddubsw m6, [r4 + 3 * mmsize]
2803
+ pmulhrsw m6, m0
2804
+ vperm2i128 m7, m4, m5, 00100000b
2805
+ pmaddubsw m7, [r4 + 3 * mmsize]
2806
+ pmulhrsw m7, m0
2807
+ packuswb m6, m7
2808
+ vpermq m6, m6, 11011000b
2809
+ movu [r0 + r1], m6
2810
+
2811
+ ;row [14]
2812
+ vbroadcasti128 m2, [r2 + 13]
2813
+ pshufb m2, m1
2814
+ vbroadcasti128 m3, [r2 + 21]
2815
+ pshufb m3, m1
2816
+ vbroadcasti128 m4, [r2 + 29]
2817
+ pshufb m4, m1
2818
+ vbroadcasti128 m5, [r2 + 37]
2819
+ pshufb m5, m1
2820
+
2821
+ add r4, 4 * mmsize
2822
+ vperm2i128 m6, m2, m3, 00100000b
2823
+ pmaddubsw m6, [r4 + 0 * mmsize]
2824
+ pmulhrsw m6, m0
2825
+ vperm2i128 m7, m4, m5, 00100000b
2826
+ pmaddubsw m7, [r4 + 0 * mmsize]
2827
+ pmulhrsw m7, m0
2828
+ packuswb m6, m7
2829
+ vpermq m6, m6, 11011000b
2830
+ movu [r0 + 2 * r1], m6
2831
+
2832
+ ;row [15, 16]
2833
+ vbroadcasti128 m2, [r2 + 14]
2834
+ pshufb m2, m1
2835
+ vbroadcasti128 m3, [r2 + 22]
2836
+ pshufb m3, m1
2837
+ vbroadcasti128 m4, [r2 + 30]
2838
+ pshufb m4, m1
2839
+ vbroadcasti128 m5, [r2 + 38]
2840
+ pshufb m5, m1
2841
+
2842
+ mova m10, [r4 + 1 * mmsize]
2843
+
2844
+ INTRA_PRED_ANG32_CAL_ROW
2845
+ movu [r0 + r3], m7
2846
+ lea r0, [r0 + 4 * r1]
2847
+ movu [r0], m6
2848
+
2849
+ ;row [17]
2850
+ vbroadcasti128 m2, [r2 + 15]
2851
+ pshufb m2, m1
2852
+ vbroadcasti128 m3, [r2 + 23]
2853
+ pshufb m3, m1
2854
+ vbroadcasti128 m4, [r2 + 31]
2855
+ pshufb m4, m1
2856
+ vbroadcasti128 m5, [r2 + 39]
2857
+ pshufb m5, m1
2858
+
2859
+ vperm2i128 m6, m2, m3, 00100000b
2860
+ pmaddubsw m6, [r4 + 2 * mmsize]
2861
+ pmulhrsw m6, m0
2862
+ vperm2i128 m7, m4, m5, 00100000b
2863
+ pmaddubsw m7, [r4 + 2 * mmsize]
2864
+ pmulhrsw m7, m0
2865
+ packuswb m6, m7
2866
+ vpermq m6, m6, 11011000b
2867
+ movu [r0 + r1], m6
2868
+
2869
+ ;row [18]
2870
+ vbroadcasti128 m2, [r2 + 16]
2871
+ pshufb m2, m1
2872
+ vbroadcasti128 m3, [r2 + 24]
2873
+ pshufb m3, m1
2874
+ vbroadcasti128 m4, [r2 + 32]
2875
+ pshufb m4, m1
2876
+ vbroadcasti128 m5, [r2 + 40]
2877
+ pshufb m5, m1
2878
+
2879
+ vperm2i128 m6, m2, m3, 00100000b
2880
+ pmaddubsw m6, [r4 + 3 * mmsize]
2881
+ pmulhrsw m6, m0
2882
+ vperm2i128 m7, m4, m5, 00100000b
2883
+ pmaddubsw m7, [r4 + 3 * mmsize]
2884
+ pmulhrsw m7, m0
2885
+ packuswb m6, m7
2886
+ vpermq m6, m6, 11011000b
2887
+ movu [r0 + 2 * r1], m6
2888
+
2889
+ ;row [19]
2890
+ vbroadcasti128 m2, [r2 + 17]
2891
+ pshufb m2, m1
2892
+ vbroadcasti128 m3, [r2 + 25]
2893
+ pshufb m3, m1
2894
+ vbroadcasti128 m4, [r2 + 33]
2895
+ pshufb m4, m1
2896
+ vbroadcasti128 m5, [r2 + 41]
2897
+ pshufb m5, m1
2898
+
2899
+ add r4, 4 * mmsize
2900
+ vperm2i128 m6, m2, m3, 00100000b
2901
+ pmaddubsw m6, [r4 + 0 * mmsize]
2902
+ pmulhrsw m6, m0
2903
+ vperm2i128 m7, m4, m5, 00100000b
2904
+ pmaddubsw m7, [r4 + 0 * mmsize]
2905
+ pmulhrsw m7, m0
2906
+ packuswb m6, m7
2907
+ vpermq m6, m6, 11011000b
2908
+ movu [r0 + r3], m6
2909
+
2910
+ ;row [20, 21]
2911
+ vbroadcasti128 m2, [r2 + 18]
2912
+ pshufb m2, m1
2913
+ vbroadcasti128 m3, [r2 + 26]
2914
+ pshufb m3, m1
2915
+ vbroadcasti128 m4, [r2 + 34]
2916
+ pshufb m4, m1
2917
+ vbroadcasti128 m5, [r2 + 42]
2918
+ pshufb m5, m1
2919
+
2920
+ lea r0, [r0 + 4 * r1]
2921
+ mova m10, [r4 + 1 * mmsize]
2922
+
2923
+ INTRA_PRED_ANG32_CAL_ROW
2924
+ movu [r0], m7
2925
+ movu [r0 + r1], m6
2926
+
2927
+ ;row [22]
2928
+ vbroadcasti128 m2, [r2 + 19]
2929
+ pshufb m2, m1
2930
+ vbroadcasti128 m3, [r2 + 27]
2931
+ pshufb m3, m1
2932
+ vbroadcasti128 m4, [r2 + 35]
2933
+ pshufb m4, m1
2934
+ vbroadcasti128 m5, [r2 + 43]
2935
+ pshufb m5, m1
2936
+
2937
+ vperm2i128 m6, m2, m3, 00100000b
2938
+ pmaddubsw m6, [r4 + 2 * mmsize]
2939
+ pmulhrsw m6, m0
2940
+ vperm2i128 m7, m4, m5, 00100000b
2941
+ pmaddubsw m7, [r4 + 2 * mmsize]
2942
+ pmulhrsw m7, m0
2943
+ packuswb m6, m7
2944
+ vpermq m6, m6, 11011000b
2945
+ movu [r0 + 2 * r1], m6
2946
+
2947
+ ;row [23]
2948
+ vbroadcasti128 m2, [r2 + 20]
2949
+ pshufb m2, m1
2950
+ vbroadcasti128 m3, [r2 + 28]
2951
+ pshufb m3, m1
2952
+ vbroadcasti128 m4, [r2 + 36]
2953
+ pshufb m4, m1
2954
+ vbroadcasti128 m5, [r2 + 44]
2955
+ pshufb m5, m1
2956
+
2957
+ vperm2i128 m6, m2, m3, 00100000b
2958
+ pmaddubsw m6, [r4 + 3 * mmsize]
2959
+ pmulhrsw m6, m0
2960
+ vperm2i128 m7, m4, m5, 00100000b
2961
+ pmaddubsw m7, [r4 + 3 * mmsize]
2962
+ pmulhrsw m7, m0
2963
+ packuswb m6, m7
2964
+ vpermq m6, m6, 11011000b
2965
+ movu [r0 + r3], m6
2966
+
2967
+ ;row [24]
2968
+ vbroadcasti128 m2, [r2 + 21]
2969
+ pshufb m2, m1
2970
+ vbroadcasti128 m3, [r2 + 29]
2971
+ pshufb m3, m1
2972
+ vbroadcasti128 m4, [r2 + 37]
2973
+ pshufb m4, m1
2974
+ vbroadcasti128 m5, [r2 + 45]
2975
+ pshufb m5, m1
2976
+
2977
+ add r4, 4 * mmsize
2978
+ lea r0, [r0 + 4 * r1]
2979
+ vperm2i128 m6, m2, m3, 00100000b
2980
+ pmaddubsw m6, [r4 + 0 * mmsize]
2981
+ pmulhrsw m6, m0
2982
+ vperm2i128 m7, m4, m5, 00100000b
2983
+ pmaddubsw m7, [r4 + 0 * mmsize]
2984
+ pmulhrsw m7, m0
2985
+ packuswb m6, m7
2986
+ vpermq m6, m6, 11011000b
2987
+ movu [r0], m6
2988
+
2989
+ ;row [25, 26]
2990
+ vbroadcasti128 m2, [r2 + 22]
2991
+ pshufb m2, m1
2992
+ vbroadcasti128 m3, [r2 + 30]
2993
+ pshufb m3, m1
2994
+ vbroadcasti128 m4, [r2 + 38]
2995
+ pshufb m4, m1
2996
+ vbroadcasti128 m5, [r2 + 46]
2997
+ pshufb m5, m1
2998
+
2999
+ mova m10, [r4 + 1 * mmsize]
3000
+
3001
+ INTRA_PRED_ANG32_CAL_ROW
3002
+ movu [r0 + r1], m7
3003
+ movu [r0 + 2 * r1], m6
3004
+
3005
+ ;row [27]
3006
+ vbroadcasti128 m2, [r2 + 23]
3007
+ pshufb m2, m1
3008
+ vbroadcasti128 m3, [r2 + 31]
3009
+ pshufb m3, m1
3010
+ vbroadcasti128 m4, [r2 + 39]
3011
+ pshufb m4, m1
3012
+ vbroadcasti128 m5, [r2 + 47]
3013
+ pshufb m5, m1
3014
+
3015
+ vperm2i128 m6, m2, m3, 00100000b
3016
+ pmaddubsw m6, [r4 + 2 * mmsize]
3017
+ pmulhrsw m6, m0
3018
+ vperm2i128 m7, m4, m5, 00100000b
3019
+ pmaddubsw m7, [r4 + 2 * mmsize]
3020
+ pmulhrsw m7, m0
3021
+ packuswb m6, m7
3022
+ vpermq m6, m6, 11011000b
3023
+ movu [r0 + r3], m6
3024
+
3025
+ ;row [28]
3026
+ vbroadcasti128 m2, [r2 + 24]
3027
+ pshufb m2, m1
3028
+ vbroadcasti128 m3, [r2 + 32]
3029
+ pshufb m3, m1
3030
+ vbroadcasti128 m4, [r2 + 40]
3031
+ pshufb m4, m1
3032
+ vbroadcasti128 m5, [r2 + 48]
3033
+ pshufb m5, m1
3034
+
3035
+ lea r0, [r0 + 4 * r1]
3036
+ vperm2i128 m6, m2, m3, 00100000b
3037
+ pmaddubsw m6, [r4 + 3 * mmsize]
3038
+ pmulhrsw m6, m0
3039
+ vperm2i128 m7, m4, m5, 00100000b
3040
+ pmaddubsw m7, [r4 + 3 * mmsize]
3041
+ pmulhrsw m7, m0
3042
+ packuswb m6, m7
3043
+ vpermq m6, m6, 11011000b
3044
+ movu [r0], m6
3045
+
3046
+ ;row [29]
3047
+ vbroadcasti128 m2, [r2 + 25]
3048
+ pshufb m2, m1
3049
+ vbroadcasti128 m3, [r2 + 33]
3050
+ pshufb m3, m1
3051
+ vbroadcasti128 m4, [r2 + 41]
3052
+ pshufb m4, m1
3053
+ vbroadcasti128 m5, [r2 + 49]
3054
+ pshufb m5, m1
3055
+
3056
+ add r4, 4 * mmsize
3057
+ vperm2i128 m6, m2, m3, 00100000b
3058
+ pmaddubsw m6, [r4 + 0 * mmsize]
3059
+ pmulhrsw m6, m0
3060
+ vperm2i128 m7, m4, m5, 00100000b
3061
+ pmaddubsw m7, [r4 + 0 * mmsize]
3062
+ pmulhrsw m7, m0
3063
+ packuswb m6, m7
3064
+ vpermq m6, m6, 11011000b
3065
+ movu [r0 + r1], m6
3066
+
3067
+ ;row [30]
3068
+ vbroadcasti128 m2, [r2 + 26]
3069
+ pshufb m2, m1
3070
+ vbroadcasti128 m3, [r2 + 34]
3071
+ pshufb m3, m1
3072
+ vbroadcasti128 m4, [r2 + 42]
3073
+ pshufb m4, m1
3074
+ vbroadcasti128 m5, [r2 + 50]
3075
+ pshufb m5, m1
3076
+
3077
+ vperm2i128 m6, m2, m3, 00100000b
3078
+ pmaddubsw m6, [r4 + 1 * mmsize]
3079
+ pmulhrsw m6, m0
3080
+ vperm2i128 m7, m4, m5, 00100000b
3081
+ pmaddubsw m7, [r4 + 1 * mmsize]
3082
+ pmulhrsw m7, m0
3083
+ packuswb m6, m7
3084
+ vpermq m6, m6, 11011000b
3085
+ movu [r0 + 2 * r1], m6
3086
+
3087
+ ;row [31]
3088
+ vbroadcasti128 m2, [r2 + 27]
3089
+ pshufb m2, m1
3090
+ vbroadcasti128 m3, [r2 + 35]
3091
+ pshufb m3, m1
3092
+ vbroadcasti128 m4, [r2 + 43]
3093
+ pshufb m4, m1
3094
+ vbroadcasti128 m5, [r2 + 51]
3095
+ pshufb m5, m1
3096
+
3097
+ vperm2i128 m6, m2, m3, 00100000b
3098
+ pmaddubsw m6, [r4 + 2 * mmsize]
3099
+ pmulhrsw m6, m0
3100
+ vperm2i128 m7, m4, m5, 00100000b
3101
+ pmaddubsw m7, [r4 + 2 * mmsize]
3102
+ pmulhrsw m7, m0
3103
+ packuswb m6, m7
3104
+ vpermq m6, m6, 11011000b
3105
+ movu [r0 + r3], m6
3106
+ RET
3107
+
3108
+INIT_YMM avx2
3109
+cglobal intra_pred_ang32_25, 3, 5, 11
3110
+ mova m0, [pw_1024]
3111
+ mova m1, [intra_pred_shuff_0_8]
3112
+ lea r3, [3 * r1]
3113
+ lea r4, [c_ang32_mode_25]
3114
+
3115
+ ;row [0, 1]
3116
+ vbroadcasti128 m2, [r2 + 0]
3117
+ pshufb m2, m1
3118
+ vbroadcasti128 m3, [r2 + 8]
3119
+ pshufb m3, m1
3120
+ vbroadcasti128 m4, [r2 + 16]
3121
+ pshufb m4, m1
3122
+ vbroadcasti128 m5, [r2 + 24]
3123
+ pshufb m5, m1
3124
+
3125
+ mova m10, [r4 + 0 * mmsize]
3126
+
3127
+ INTRA_PRED_ANG32_CAL_ROW
3128
+ movu [r0], m7
3129
+ movu [r0 + r1], m6
3130
+
3131
+ ;row[2, 3]
3132
+ mova m10, [r4 + 1 * mmsize]
3133
+
3134
+ INTRA_PRED_ANG32_CAL_ROW
3135
+ movu [r0 + 2 * r1], m7
3136
+ movu [r0 + r3], m6
3137
+
3138
+ ;row[4, 5]
3139
+ mova m10, [r4 + 2 * mmsize]
3140
+ lea r0, [r0 + 4 * r1]
3141
+
3142
+ INTRA_PRED_ANG32_CAL_ROW
3143
+ movu [r0], m7
3144
+ movu [r0 + r1], m6
3145
+
3146
+ ;row[6, 7]
3147
+ mova m10, [r4 + 3 * mmsize]
3148
+
3149
+ INTRA_PRED_ANG32_CAL_ROW
3150
+ movu [r0 + 2 * r1], m7
3151
+ movu [r0 + r3], m6
3152
+
3153
+ ;row[8, 9]
3154
+ add r4, 4 * mmsize
3155
+ lea r0, [r0 + 4 * r1]
3156
+ mova m10, [r4 + 0 * mmsize]
3157
+
3158
+ INTRA_PRED_ANG32_CAL_ROW
3159
+ movu [r0], m7
3160
+ movu [r0 + r1], m6
3161
+
3162
+ ;row[10, 11]
3163
+ mova m10, [r4 + 1 * mmsize]
3164
+
3165
+ INTRA_PRED_ANG32_CAL_ROW
3166
+ movu [r0 + 2 * r1], m7
3167
+ movu [r0 + r3], m6
3168
+
3169
+ ;row[12, 13]
3170
+ mova m10, [r4 + 2 * mmsize]
3171
+ lea r0, [r0 + 4 * r1]
3172
+
3173
+ INTRA_PRED_ANG32_CAL_ROW
3174
+ movu [r0], m7
3175
+ movu [r0 + r1], m6
3176
+
3177
+ ;row[14, 15]
3178
+ mova m10, [r4 + 3 * mmsize]
3179
+
3180
+ INTRA_PRED_ANG32_CAL_ROW
3181
+ movu [r0 + 2 * r1], m7
3182
+ movu [r0 + r3], m6
3183
+
3184
+ ;row[16, 17]
3185
+ movu xm2, [r2 - 1]
3186
+ pinsrb xm2, [r2 + 80], 0
3187
+ vinserti128 m2, m2, xm2, 1
3188
+ pshufb m2, m1
3189
+ vbroadcasti128 m3, [r2 + 7]
3190
+ pshufb m3, m1
3191
+ vbroadcasti128 m4, [r2 + 15]
3192
+ pshufb m4, m1
3193
+ vbroadcasti128 m5, [r2 + 23]
3194
+ pshufb m5, m1
3195
+
3196
+ add r4, 4 * mmsize
3197
+ lea r0, [r0 + 4 * r1]
3198
+ mova m10, [r4 + 0 * mmsize]
3199
+
3200
+ INTRA_PRED_ANG32_CAL_ROW
3201
+ movu [r0], m7
3202
+ movu [r0 + r1], m6
3203
+
3204
+ ;row[18, 19]
3205
+ mova m10, [r4 + 1 * mmsize]
3206
+
3207
+ INTRA_PRED_ANG32_CAL_ROW
3208
+ movu [r0 + 2 * r1], m7
3209
+ movu [r0 + r3], m6
3210
+
3211
+ ;row[20, 21]
3212
+ mova m10, [r4 + 2 * mmsize]
3213
+ lea r0, [r0 + 4 * r1]
3214
+
3215
+ INTRA_PRED_ANG32_CAL_ROW
3216
+ movu [r0], m7
3217
+ movu [r0 + r1], m6
3218
+
3219
+ ;row[22, 23]
3220
+ mova m10, [r4 + 3 * mmsize]
3221
+
3222
+ INTRA_PRED_ANG32_CAL_ROW
3223
+ movu [r0 + 2 * r1], m7
3224
+ movu [r0 + r3], m6
3225
+
3226
+ ;row[24, 25]
3227
+ add r4, 4 * mmsize
3228
+ lea r0, [r0 + 4 * r1]
3229
+ mova m10, [r4 + 0 * mmsize]
3230
+
3231
+ INTRA_PRED_ANG32_CAL_ROW
3232
+ movu [r0], m7
3233
+ movu [r0 + r1], m6
3234
+
3235
+ ;row[26, 27]
3236
+ mova m10, [r4 + 1 * mmsize]
3237
+
3238
+ INTRA_PRED_ANG32_CAL_ROW
3239
+ movu [r0 + 2 * r1], m7
3240
+ movu [r0 + r3], m6
3241
+
3242
+ ;row[28, 29]
3243
+ mova m10, [r4 + 2 * mmsize]
3244
+ lea r0, [r0 + 4 * r1]
3245
+
3246
+ INTRA_PRED_ANG32_CAL_ROW
3247
+ movu [r0], m7
3248
+ movu [r0 + r1], m6
3249
+
3250
+ ;row[30, 31]
3251
+ mova m10, [r4 + 3 * mmsize]
3252
+
3253
+ INTRA_PRED_ANG32_CAL_ROW
3254
+ movu [r0 + 2 * r1], m7
3255
+ movu [r0 + r3], m6
3256
+ RET
3257
+
3258
+INIT_YMM avx2
3259
+cglobal intra_pred_ang32_24, 3, 5, 12
3260
+ mova m0, [pw_1024]
3261
+ mova m1, [intra_pred_shuff_0_8]
3262
+ lea r3, [3 * r1]
3263
+ lea r4, [c_ang32_mode_24]
3264
+
3265
+ ;row[0, 1]
3266
+ vbroadcasti128 m11, [r2 + 0]
3267
+ pshufb m2, m11, m1
3268
+ vbroadcasti128 m3, [r2 + 8]
3269
+ pshufb m3, m1
3270
+ vbroadcasti128 m4, [r2 + 16]
3271
+ pshufb m4, m1
3272
+ vbroadcasti128 m5, [r2 + 24]
3273
+ pshufb m5, m1
3274
+
3275
+ mova m10, [r4 + 0 * mmsize]
3276
+
3277
+ INTRA_PRED_ANG32_CAL_ROW
3278
+ movu [r0], m7
3279
+ movu [r0 + r1], m6
3280
+
3281
+ ;row[2, 3]
3282
+ mova m10, [r4 + 1 * mmsize]
3283
+
3284
+ INTRA_PRED_ANG32_CAL_ROW
3285
+ movu [r0 + 2 * r1], m7
3286
+ movu [r0 + r3], m6
3287
+
3288
+ ;row[4, 5]
3289
+ mova m10, [r4 + 2 * mmsize]
3290
+ lea r0, [r0 + 4 * r1]
3291
+
3292
+ INTRA_PRED_ANG32_CAL_ROW
3293
+ movu [r0], m7
3294
+ movu [r0 + r1], m6
3295
+
3296
+ ;row[6, 7]
3297
+ pslldq xm11, 1
3298
+ pinsrb xm11, [r2 + 70], 0
3299
+ vinserti128 m2, m11, xm11, 1
3300
+ pshufb m2, m1
3301
+ vbroadcasti128 m3, [r2 + 7]
3302
+ pshufb m3, m1
3303
+ vbroadcasti128 m4, [r2 + 15]
3304
+ pshufb m4, m1
3305
+ vbroadcasti128 m5, [r2 + 23]
3306
+ pshufb m5, m1
3307
+
3308
+ mova m10, [r4 + 3 * mmsize]
3309
+
3310
+ INTRA_PRED_ANG32_CAL_ROW
3311
+ movu [r0 + 2 * r1], m7
3312
+ movu [r0 + r3], m6
3313
+
3314
+ ;row[8, 9]
3315
+ add r4, 4 * mmsize
3316
+ lea r0, [r0 + 4 * r1]
3317
+ mova m10, [r4 + 0 * mmsize]
3318
+
3319
+ INTRA_PRED_ANG32_CAL_ROW
3320
+ movu [r0], m7
3321
+ movu [r0 + r1], m6
3322
+
3323
+ ;row[10, 11]
3324
+ mova m10, [r4 + 1 * mmsize]
3325
+
3326
+ INTRA_PRED_ANG32_CAL_ROW
3327
+ movu [r0 + 2 * r1], m7
3328
+ movu [r0 + r3], m6
3329
+
3330
+ ;row[12, 13]
3331
+ pslldq xm11, 1
3332
+ pinsrb xm11, [r2 + 77], 0
3333
+ vinserti128 m2, m11, xm11, 1
3334
+ pshufb m2, m1
3335
+ vbroadcasti128 m3, [r2 + 6]
3336
+ pshufb m3, m1
3337
+ vbroadcasti128 m4, [r2 + 14]
3338
+ pshufb m4, m1
3339
+ vbroadcasti128 m5, [r2 + 22]
3340
+ pshufb m5, m1
3341
+
3342
+ mova m10, [r4 + 2 * mmsize]
3343
+ lea r0, [r0 + 4 * r1]
3344
+
3345
+ INTRA_PRED_ANG32_CAL_ROW
3346
+ movu [r0], m7
3347
+ movu [r0 + r1], m6
3348
+
3349
+ ;row[14, 15]
3350
+ mova m10, [r4 + 3 * mmsize]
3351
+
3352
+ INTRA_PRED_ANG32_CAL_ROW
3353
+ movu [r0 + 2 * r1], m7
3354
+ movu [r0 + r3], m6
3355
+
3356
+ ;row[16, 17]
3357
+ add r4, 4 * mmsize
3358
+ lea r0, [r0 + 4 * r1]
3359
+ mova m10, [r4 + 0 * mmsize]
3360
+
3361
+ INTRA_PRED_ANG32_CAL_ROW
3362
+ movu [r0], m7
3363
+ movu [r0 + r1], m6
3364
+
3365
+ ;row[18]
3366
+ mova m10, [r4 + 1 * mmsize]
3367
+ vperm2i128 m6, m2, m3, 00100000b
3368
+ pmaddubsw m6, m10
3369
+ pmulhrsw m6, m0
3370
+ vperm2i128 m7, m4, m5, 00100000b
3371
+ pmaddubsw m7, m10
3372
+ pmulhrsw m7, m0
3373
+ packuswb m6, m7
3374
+ vpermq m6, m6, 11011000b
3375
+ movu [r0 + 2 * r1], m6
3376
+
3377
+ ;row[19, 20]
3378
+ pslldq xm11, 1
3379
+ pinsrb xm11, [r2 + 83], 0
3380
+ vinserti128 m2, m11, xm11, 1
3381
+ pshufb m2, m1
3382
+ vbroadcasti128 m3, [r2 + 5]
3383
+ pshufb m3, m1
3384
+ vbroadcasti128 m4, [r2 + 13]
3385
+ pshufb m4, m1
3386
+ vbroadcasti128 m5, [r2 + 21]
3387
+ pshufb m5, m1
3388
+
3389
+ mova m10, [r4 + 2 * mmsize]
3390
+
3391
+ INTRA_PRED_ANG32_CAL_ROW
3392
+ movu [r0 + r3], m7
3393
+ lea r0, [r0 + 4 * r1]
3394
+ movu [r0], m6
3395
+
3396
+ ;row[21, 22]
3397
+ mova m10, [r4 + 3 * mmsize]
3398
+
3399
+ INTRA_PRED_ANG32_CAL_ROW
3400
+ movu [r0 + r1], m7
3401
+ movu [r0 + 2 * r1], m6
3402
+
3403
+ ;row[23, 24]
3404
+ add r4, 4 * mmsize
3405
+ mova m10, [r4 + 0 * mmsize]
3406
+
3407
+ INTRA_PRED_ANG32_CAL_ROW
3408
+ movu [r0 + r3], m7
3409
+ lea r0, [r0 + 4 * r1]
3410
+ movu [r0], m6
3411
+
3412
+ ;row[25, 26]
3413
+ pslldq xm11, 1
3414
+ pinsrb xm11, [r2 + 90], 0
3415
+ vinserti128 m2, m11, xm11, 1
3416
+ pshufb m2, m1
3417
+ vbroadcasti128 m3, [r2 + 4]
3418
+ pshufb m3, m1
3419
+ vbroadcasti128 m4, [r2 + 12]
3420
+ pshufb m4, m1
3421
+ vbroadcasti128 m5, [r2 + 20]
3422
+ pshufb m5, m1
3423
+
3424
+ mova m10, [r4 + 1 * mmsize]
3425
+
3426
+ INTRA_PRED_ANG32_CAL_ROW
3427
+ movu [r0 + r1], m7
3428
+ movu [r0 + 2 * r1], m6
3429
+
3430
+ ;row[27, 28]
3431
+ mova m10, [r4 + 2 * mmsize]
3432
+
3433
+ INTRA_PRED_ANG32_CAL_ROW
3434
+ movu [r0 + r3], m7
3435
+ lea r0, [r0 + 4 * r1]
3436
+ movu [r0], m6
3437
+
3438
+ ;row[29, 30]
3439
+ mova m10, [r4 + 3 * mmsize]
3440
+
3441
+ INTRA_PRED_ANG32_CAL_ROW
3442
+ movu [r0 + r1], m7
3443
+ movu [r0 + 2 * r1], m6
3444
+
3445
+ ;[row 31]
3446
+ mova m10, [r4 + 4 * mmsize]
3447
+ vperm2i128 m6, m2, m3, 00100000b
3448
+ pmaddubsw m6, m10
3449
+ pmulhrsw m6, m0
3450
+ vperm2i128 m7, m4, m5, 00100000b
3451
+ pmaddubsw m7, m10
3452
+ pmulhrsw m7, m0
3453
+ packuswb m6, m7
3454
+ vpermq m6, m6, 11011000b
3455
+ movu [r0 + r3], m6
3456
+ RET
3457
+
3458
+INIT_YMM avx2
3459
+cglobal intra_pred_ang32_23, 3, 5, 12
3460
+ mova m0, [pw_1024]
3461
+ mova m1, [intra_pred_shuff_0_8]
3462
+ lea r3, [3 * r1]
3463
+ lea r4, [c_ang32_mode_23]
3464
+
3465
+ ;row[0, 1]
3466
+ vbroadcasti128 m11, [r2 + 0]
3467
+ pshufb m2, m11, m1
3468
+ vbroadcasti128 m3, [r2 + 8]
3469
+ pshufb m3, m1
3470
+ vbroadcasti128 m4, [r2 + 16]
3471
+ pshufb m4, m1
3472
+ vbroadcasti128 m5, [r2 + 24]
3473
+ pshufb m5, m1
3474
+
3475
+ mova m10, [r4 + 0 * mmsize]
3476
+
3477
+ INTRA_PRED_ANG32_CAL_ROW
3478
+ movu [r0], m7
3479
+ movu [r0 + r1], m6
3480
+
3481
+ ;row[2]
3482
+ vperm2i128 m6, m2, m3, 00100000b
3483
+ pmaddubsw m6, [r4 + 1 * mmsize]
3484
+ pmulhrsw m6, m0
3485
+ vperm2i128 m7, m4, m5, 00100000b
3486
+ pmaddubsw m7, [r4 + 1 * mmsize]
3487
+ pmulhrsw m7, m0
3488
+ packuswb m6, m7
3489
+ vpermq m6, m6, 11011000b
3490
+ movu [r0 + 2 * r1], m6
3491
+
3492
+ ;row[3, 4]
3493
+ pslldq xm11, 1
3494
+ pinsrb xm11, [r2 + 68], 0
3495
+ vinserti128 m2, m11, xm11, 1
3496
+ pshufb m2, m1
3497
+ vbroadcasti128 m3, [r2 + 7]
3498
+ pshufb m3, m1
3499
+ vbroadcasti128 m4, [r2 + 15]
3500
+ pshufb m4, m1
3501
+ vbroadcasti128 m5, [r2 + 23]
3502
+ pshufb m5, m1
3503
+
3504
+ mova m10, [r4 + 2 * mmsize]
3505
+
3506
+ INTRA_PRED_ANG32_CAL_ROW
3507
+ movu [r0 + r3], m7
3508
+ lea r0, [r0 + 4 * r1]
3509
+ movu [r0], m6
3510
+
3511
+ ;row[5, 6]
3512
+ mova m10, [r4 + 3 * mmsize]
3513
+
3514
+ INTRA_PRED_ANG32_CAL_ROW
3515
+ movu [r0 + r1], m7
3516
+ movu [r0 + 2 * r1], m6
3517
+
3518
+ ;row[7, 8]
3519
+ pslldq xm11, 1
3520
+ pinsrb xm11, [r2 + 71], 0
3521
+ vinserti128 m2, m11, xm11, 1
3522
+ pshufb m2, m1
3523
+ vbroadcasti128 m3, [r2 + 6]
3524
+ pshufb m3, m1
3525
+ vbroadcasti128 m4, [r2 + 14]
3526
+ pshufb m4, m1
3527
+ vbroadcasti128 m5, [r2 + 22]
3528
+ pshufb m5, m1
3529
+
3530
+ add r4, 4 * mmsize
3531
+ mova m10, [r4 + 0 * mmsize]
3532
+
3533
+ INTRA_PRED_ANG32_CAL_ROW
3534
+ movu [r0 + r3], m7
3535
+ lea r0, [r0 + 4 * r1]
3536
+ movu [r0], m6
3537
+
3538
+ ;row[9]
3539
+ vperm2i128 m6, m2, m3, 00100000b
3540
+ pmaddubsw m6, [r4 + 1 * mmsize]
3541
+ pmulhrsw m6, m0
3542
+ vperm2i128 m7, m4, m5, 00100000b
3543
+ pmaddubsw m7, [r4 + 1 * mmsize]
3544
+ pmulhrsw m7, m0
3545
+ packuswb m6, m7
3546
+ vpermq m6, m6, 11011000b
3547
+ movu [r0 + r1], m6
3548
+
3549
+ ;row[10, 11]
3550
+ pslldq xm11, 1
3551
+ pinsrb xm11, [r2 + 75], 0
3552
+ vinserti128 m2, m11, xm11, 1
3553
+ pshufb m2, m1
3554
+ vbroadcasti128 m3, [r2 + 5]
3555
+ pshufb m3, m1
3556
+ vbroadcasti128 m4, [r2 + 13]
3557
+ pshufb m4, m1
3558
+ vbroadcasti128 m5, [r2 + 21]
3559
+ pshufb m5, m1
3560
+
3561
+ mova m10, [r4 + 2 * mmsize]
3562
+
3563
+ INTRA_PRED_ANG32_CAL_ROW
3564
+ movu [r0 + 2 * r1], m7
3565
+ movu [r0 + r3], m6
3566
+
3567
+ ;row[12, 13]
3568
+ lea r0, [r0 + 4 * r1]
3569
+ mova m10, [r4 + 3 * mmsize]
3570
+
3571
+ INTRA_PRED_ANG32_CAL_ROW
3572
+ movu [r0], m7
3573
+ movu [r0 + r1], m6
3574
+
3575
+ ;row[14, 15]
3576
+ pslldq xm11, 1
3577
+ pinsrb xm11, [r2 + 78], 0
3578
+ vinserti128 m2, m11, xm11, 1
3579
+ pshufb m2, m1
3580
+ vbroadcasti128 m3, [r2 + 4]
3581
+ pshufb m3, m1
3582
+ vbroadcasti128 m4, [r2 + 12]
3583
+ pshufb m4, m1
3584
+ vbroadcasti128 m5, [r2 + 20]
3585
+ pshufb m5, m1
3586
+
3587
+ add r4, 4 * mmsize
3588
+ mova m10, [r4 + 0 * mmsize]
3589
+
3590
+ INTRA_PRED_ANG32_CAL_ROW
3591
+ movu [r0 + 2 * r1], m7
3592
+ movu [r0 + r3], m6
3593
+
3594
+ ;row[16]
3595
+ lea r0, [r0 + 4 * r1]
3596
+ vperm2i128 m6, m2, m3, 00100000b
3597
+ pmaddubsw m6, [r4 + 1 * mmsize]
3598
+ pmulhrsw m6, m0
3599
+ vperm2i128 m7, m4, m5, 00100000b
3600
+ pmaddubsw m7, [r4 + 1 * mmsize]
3601
+ pmulhrsw m7, m0
3602
+ packuswb m6, m7
3603
+ vpermq m6, m6, 11011000b
3604
+ movu [r0], m6
3605
+
3606
+ ;row[17, 18]
3607
+ pslldq xm11, 1
3608
+ pinsrb xm11, [r2 + 82], 0
3609
+ vinserti128 m2, m11, xm11, 1
3610
+ pshufb m2, m1
3611
+ vbroadcasti128 m3, [r2 + 3]
3612
+ pshufb m3, m1
3613
+ vbroadcasti128 m4, [r2 + 11]
3614
+ pshufb m4, m1
3615
+ vbroadcasti128 m5, [r2 + 19]
3616
+ pshufb m5, m1
3617
+
3618
+ mova m10, [r4 + 2 * mmsize]
3619
+
3620
+ INTRA_PRED_ANG32_CAL_ROW
3621
+ movu [r0 + r1], m7
3622
+ movu [r0 + 2 * r1], m6
3623
+
3624
+ ;row[19, 20]
3625
+ mova m10, [r4 + 3 * mmsize]
3626
+
3627
+ INTRA_PRED_ANG32_CAL_ROW
3628
+ movu [r0 + r3], m7
3629
+ lea r0, [r0 + 4 * r1]
3630
+ movu [r0], m6
3631
+
3632
+ ;row[21, 22]
3633
+ pslldq xm11, 1
3634
+ pinsrb xm11, [r2 + 85], 0
3635
+ vinserti128 m2, m11, xm11, 1
3636
+ pshufb m2, m1
3637
+ vbroadcasti128 m3, [r2 + 2]
3638
+ pshufb m3, m1
3639
+ vbroadcasti128 m4, [r2 + 10]
3640
+ pshufb m4, m1
3641
+ vbroadcasti128 m5, [r2 + 18]
3642
+ pshufb m5, m1
3643
+
3644
+ add r4, 4 * mmsize
3645
+ mova m10, [r4 + 0 * mmsize]
3646
+
3647
+ INTRA_PRED_ANG32_CAL_ROW
3648
+ movu [r0 + r1], m7
3649
+ movu [r0 + 2 * r1], m6
3650
+
3651
+ ;row[23]
3652
+ vperm2i128 m6, m2, m3, 00100000b
3653
+ pmaddubsw m6, [r4 + 1 * mmsize]
3654
+ pmulhrsw m6, m0
3655
+ vperm2i128 m7, m4, m5, 00100000b
3656
+ pmaddubsw m7, [r4 + 1 * mmsize]
3657
+ pmulhrsw m7, m0
3658
+ packuswb m6, m7
3659
+ vpermq m6, m6, 11011000b
3660
+ movu [r0 + r3], m6
3661
+
3662
+ ;row[24, 25]
3663
+ pslldq xm11, 1
3664
+ pinsrb xm11, [r2 + 89], 0
3665
+ vinserti128 m2, m11, xm11, 1
3666
+ pshufb m2, m1
3667
+ vbroadcasti128 m3, [r2 + 1]
3668
+ pshufb m3, m1
3669
+ vbroadcasti128 m4, [r2 + 9]
3670
+ pshufb m4, m1
3671
+ vbroadcasti128 m5, [r2 + 17]
3672
+ pshufb m5, m1
3673
+
3674
+ mova m10, [r4 + 2 * mmsize]
3675
+ lea r0, [r0 + 4 * r1]
3676
+
3677
+ INTRA_PRED_ANG32_CAL_ROW
3678
+ movu [r0], m7
3679
+ movu [r0 + r1], m6
3680
+
3681
+ ;row[26, 27]
3682
+ mova m10, [r4 + 3 * mmsize]
3683
+
3684
+ INTRA_PRED_ANG32_CAL_ROW
3685
+ movu [r0 + 2 * r1], m7
3686
+ movu [r0 + r3], m6
3687
+
3688
+ ;row[28, 29]
3689
+ pslldq xm11, 1
3690
+ pinsrb xm11, [r2 + 92], 0
3691
+ vinserti128 m2, m11, xm11, 1
3692
+ pshufb m2, m1
3693
+ vbroadcasti128 m3, [r2 + 0]
3694
+ pshufb m3, m1
3695
+ vbroadcasti128 m4, [r2 + 8]
3696
+ pshufb m4, m1
3697
+ vbroadcasti128 m5, [r2 + 16]
3698
+ pshufb m5, m1
3699
+
3700
+ add r4, 4 * mmsize
3701
+ mova m10, [r4 + 0 * mmsize]
3702
+ lea r0, [r0 + 4 * r1]
3703
+
3704
+ INTRA_PRED_ANG32_CAL_ROW
3705
+ movu [r0], m7
3706
+ movu [r0 + r1], m6
3707
+
3708
+ ;row[30, 31]
3709
+ mova m10, [r4 + 1 * mmsize]
3710
+
3711
+ INTRA_PRED_ANG32_CAL_ROW
3712
+ movu [r0 + 2 * r1], m7
3713
+ movu [r0 + r3], m6
3714
+ RET
3715
+
3716
+INIT_YMM avx2
3717
+cglobal intra_pred_ang32_22, 3, 5, 13
3718
+ mova m0, [pw_1024]
3719
+ mova m1, [intra_pred_shuff_0_8]
3720
+ lea r3, [3 * r1]
3721
+ lea r4, [c_ang32_mode_22]
3722
+
3723
+ ;row[0, 1]
3724
+ vbroadcasti128 m11, [r2 + 0]
3725
+ pshufb m2, m11, m1
3726
+ vbroadcasti128 m3, [r2 + 8]
3727
+ pshufb m3, m1
3728
+ vbroadcasti128 m4, [r2 + 16]
3729
+ pshufb m4, m1
3730
+ vbroadcasti128 m5, [r2 + 24]
3731
+ pshufb m5, m1
3732
+
3733
+ mova m10, [r4 + 0 * mmsize]
3734
+
3735
+ INTRA_PRED_ANG32_CAL_ROW
3736
+ movu [r0], m7
3737
+ movu [r0 + r1], m6
3738
+
3739
+ ;row[2, 3]
3740
+ pslldq xm11, 1
3741
+ pinsrb xm11, [r2 + 66], 0
3742
+ vinserti128 m2, m11, xm11, 1
3743
+ pshufb m2, m1
3744
+ vbroadcasti128 m3, [r2 + 7]
3745
+ pshufb m3, m1
3746
+ vbroadcasti128 m4, [r2 + 15]
3747
+ pshufb m4, m1
3748
+ vbroadcasti128 m5, [r2 + 23]
3749
+ pshufb m5, m1
3750
+
3751
+ mova m10, [r4 + 1 * mmsize]
3752
+
3753
+ INTRA_PRED_ANG32_CAL_ROW
3754
+ movu [r0 + 2 * r1], m7
3755
+ movu [r0 + r3], m6
3756
+
3757
+ ;row[4, 5]
3758
+ pslldq xm11, 1
3759
+ pinsrb xm11, [r2 + 69], 0
3760
+ vinserti128 m2, m11, xm11, 1
3761
+ pshufb m2, m1
3762
+ vbroadcasti128 m3, [r2 + 6]
3763
+ pshufb m3, m1
3764
+ vbroadcasti128 m4, [r2 + 14]
3765
+ pshufb m4, m1
3766
+ vbroadcasti128 m5, [r2 + 22]
3767
+ pshufb m5, m1
3768
+
3769
+ lea r0, [r0 + 4 * r1]
3770
+ mova m10, [r4 + 2 * mmsize]
3771
+
3772
+ INTRA_PRED_ANG32_CAL_ROW
3773
+ movu [r0], m7
3774
+ movu [r0 + r1], m6
3775
+
3776
+ ;row[6]
3777
+ vperm2i128 m6, m2, m3, 00100000b
3778
+ pmaddubsw m6, [r4 + 3 * mmsize]
3779
+ pmulhrsw m6, m0
3780
+ vperm2i128 m7, m4, m5, 00100000b
3781
+ pmaddubsw m7, [r4 + 3 * mmsize]
3782
+ pmulhrsw m7, m0
3783
+ packuswb m6, m7
3784
+ vpermq m6, m6, 11011000b
3785
+ movu [r0 + 2 * r1], m6
3786
+
3787
+ ;row[7, 8]
3788
+ pslldq xm11, 1
3789
+ pinsrb xm11, [r2 + 71], 0
3790
+ vinserti128 m2, m11, xm11, 1
3791
+ pshufb m2, m1
3792
+ vbroadcasti128 m3, [r2 + 5]
3793
+ pshufb m3, m1
3794
+ vbroadcasti128 m4, [r2 + 13]
3795
+ pshufb m4, m1
3796
+ vbroadcasti128 m5, [r2 + 21]
3797
+ pshufb m5, m1
3798
+
3799
+ add r4, 4 * mmsize
3800
+ mova m10, [r4 + 0 * mmsize]
3801
+
3802
+ INTRA_PRED_ANG32_CAL_ROW
3803
+ movu [r0 + r3], m7
3804
+ lea r0, [r0 + 4 * r1]
3805
+ movu [r0], m6
3806
+
3807
+ ;row[9, 10]
3808
+ pslldq xm11, 1
3809
+ pinsrb xm11, [r2 + 74], 0
3810
+ vinserti128 m2, m11, xm11, 1
3811
+ vinserti128 m2, m2, xm2, 1
3812
+ pshufb m2, m1
3813
+ vbroadcasti128 m3, [r2 + 4]
3814
+ pshufb m3, m1
3815
+ vbroadcasti128 m4, [r2 + 12]
3816
+ pshufb m4, m1
3817
+ vbroadcasti128 m5, [r2 + 20]
3818
+ pshufb m5, m1
3819
+
3820
+ mova m10, [r4 + 1 * mmsize]
3821
+
3822
+ INTRA_PRED_ANG32_CAL_ROW
3823
+ movu [r0 + r1], m7
3824
+ movu [r0 + 2 * r1], m6
3825
+
3826
+ ;row[11]
3827
+ vperm2i128 m6, m2, m3, 00100000b
3828
+ pmaddubsw m6, [r4 + 2 * mmsize]
3829
+ pmulhrsw m6, m0
3830
+ vperm2i128 m7, m4, m5, 00100000b
3831
+ pmaddubsw m7, [r4 + 2 * mmsize]
3832
+ pmulhrsw m7, m0
3833
+ packuswb m6, m7
3834
+ vpermq m6, m6, 11011000b
3835
+ movu [r0 + r3], m6
3836
+
3837
+ ;row[12, 13]
3838
+ pslldq xm11, 1
3839
+ pinsrb xm11, [r2 + 76], 0
3840
+ vinserti128 m2, m11, xm11, 1
3841
+ pshufb m2, m1
3842
+ vbroadcasti128 m3, [r2 + 3]
3843
+ pshufb m3, m1
3844
+ vbroadcasti128 m4, [r2 + 11]
3845
+ pshufb m4, m1
3846
+ vbroadcasti128 m5, [r2 + 19]
3847
+ pshufb m5, m1
3848
+
3849
+ mova m10, [r4 + 3 * mmsize]
3850
+ lea r0, [r0 + 4 * r1]
3851
+
3852
+ INTRA_PRED_ANG32_CAL_ROW
3853
+ movu [r0], m7
3854
+ movu [r0 + r1], m6
3855
+
3856
+ ;row[14, 15]
3857
+ pslldq xm11, 1
3858
+ pinsrb xm11, [r2 + 79], 0
3859
+ vinserti128 m2, m11, xm11, 1
3860
+ pshufb m2, m1
3861
+ vbroadcasti128 m3, [r2 + 2]
3862
+ pshufb m3, m1
3863
+ vbroadcasti128 m4, [r2 + 10]
3864
+ pshufb m4, m1
3865
+ vbroadcasti128 m5, [r2 + 18]
3866
+ pshufb m5, m1
3867
+
3868
+ add r4, 4 * mmsize
3869
+ mova m10, [r4 + 0 * mmsize]
3870
+
3871
+ INTRA_PRED_ANG32_CAL_ROW
3872
+ movu [r0 + 2 * r1], m7
3873
+ movu [r0 + r3], m6
3874
+
3875
+ ;row[16]
3876
+ lea r0, [r0 + 4 * r1]
3877
+ vperm2i128 m6, m2, m3, 00100000b
3878
+ pmaddubsw m6, [r4 + 1 * mmsize]
3879
+ pmulhrsw m6, m0
3880
+ vperm2i128 m7, m4, m5, 00100000b
3881
+ pmaddubsw m7, [r4 + 1 * mmsize]
3882
+ pmulhrsw m7, m0
3883
+ packuswb m6, m7
3884
+ vpermq m6, m6, 11011000b
3885
+ movu [r0], m6
3886
+
3887
+ ;row[17, 18]
3888
+ pslldq xm11, 1
3889
+ pinsrb xm11, [r2 + 81], 0
3890
+ vinserti128 m2, m11, xm11, 1
3891
+ pshufb m2, m1
3892
+ vbroadcasti128 m3, [r2 + 1]
3893
+ pshufb m3, m1
3894
+ vbroadcasti128 m4, [r2 + 9]
3895
+ pshufb m4, m1
3896
+ vbroadcasti128 m5, [r2 + 17]
3897
+ pshufb m5, m1
3898
+
3899
+ mova m10, [r4 + 2 * mmsize]
3900
+
3901
+ INTRA_PRED_ANG32_CAL_ROW
3902
+ movu [r0 + r1], m7
3903
+ movu [r0 + 2 * r1], m6
3904
+
3905
+ ;row[19, 20]
3906
+ pslldq xm11, 1
3907
+ pinsrb xm11, [r2 + 84], 0
3908
+ vinserti128 m2, m11, xm11, 1
3909
+ pshufb m2, m1
3910
+ vbroadcasti128 m12, [r2 + 0]
3911
+ pshufb m3, m12, m1
3912
+ vbroadcasti128 m4, [r2 + 8]
3913
+ pshufb m4, m1
3914
+ vbroadcasti128 m5, [r2 + 16]
3915
+ pshufb m5, m1
3916
+
3917
+ mova m10, [r4 + 3 * mmsize]
3918
+
3919
+ INTRA_PRED_ANG32_CAL_ROW
3920
+ movu [r0 + r3], m7
3921
+ lea r0, [r0 + 4 * r1]
3922
+ movu [r0], m6
3923
+
3924
+ ;row[21]
3925
+ add r4, 4 * mmsize
3926
+ vperm2i128 m6, m2, m3, 00100000b
3927
+ pmaddubsw m6, [r4 + 0 * mmsize]
3928
+ pmulhrsw m6, m0
3929
+ vperm2i128 m7, m4, m5, 00100000b
3930
+ pmaddubsw m7, [r4 + 0 * mmsize]
3931
+ pmulhrsw m7, m0
3932
+ packuswb m6, m7
3933
+ vpermq m6, m6, 11011000b
3934
+ movu [r0 + r1], m6
3935
+
3936
+ ;row[22, 23]
3937
+ pslldq xm11, 1
3938
+ pinsrb xm11, [r2 + 86], 0
3939
+ vinserti128 m2, m11, xm11, 1
3940
+ pshufb m2, m1
3941
+ pslldq xm12, 1
3942
+ pinsrb xm12, [r2 + 66], 0
3943
+ vinserti128 m3, m12, xm12, 1
3944
+ pshufb m3, m1
3945
+ vbroadcasti128 m4, [r2 + 7]
3946
+ pshufb m4, m1
3947
+ vbroadcasti128 m5, [r2 + 15]
3948
+ pshufb m5, m1
3949
+
3950
+ mova m10, [r4 + 1 * mmsize]
3951
+
3952
+ INTRA_PRED_ANG32_CAL_ROW
3953
+ movu [r0 + 2 * r1], m7
3954
+ movu [r0 + r3], m6
3955
+
3956
+ ;row[24, 25]
3957
+ pslldq xm11, 1
3958
+ pinsrb xm11, [r2 + 89], 0
3959
+ vinserti128 m2, m11, xm11, 1
3960
+ pshufb m2, m1
3961
+ pslldq xm12, 1
3962
+ pinsrb xm12, [r2 + 69], 0
3963
+ vinserti128 m3, m12, xm12, 1
3964
+ pshufb m3, m1
3965
+ vbroadcasti128 m4, [r2 + 6]
3966
+ pshufb m4, m1
3967
+ vbroadcasti128 m5, [r2 + 14]
3968
+ pshufb m5, m1
3969
+
3970
+ mova m10, [r4 + 2 * mmsize]
3971
+ lea r0, [r0 + 4 * r1]
3972
+
3973
+ INTRA_PRED_ANG32_CAL_ROW
3974
+ movu [r0], m7
3975
+ movu [r0 + r1], m6
3976
+
3977
+ ;row[26]
3978
+ vperm2i128 m6, m2, m3, 00100000b
3979
+ pmaddubsw m6, [r4 + 3 * mmsize]
3980
+ pmulhrsw m6, m0
3981
+ vperm2i128 m7, m4, m5, 00100000b
3982
+ pmaddubsw m7, [r4 + 3 * mmsize]
3983
+ pmulhrsw m7, m0
3984
+ packuswb m6, m7
3985
+ vpermq m6, m6, 11011000b
3986
+ movu [r0 + 2 * r1], m6
3987
+
3988
+ ;row[27, 28]
3989
+ pslldq xm11, 1
3990
+ pinsrb xm11, [r2 + 91], 0
3991
+ vinserti128 m2, m11, xm11, 1
3992
+ pshufb m2, m1
3993
+ pslldq xm12, 1
3994
+ pinsrb xm12, [r2 + 71], 0
3995
+ vinserti128 m3, m12, xm12, 1
3996
+ pshufb m3, m1
3997
+ vbroadcasti128 m4, [r2 + 5]
3998
+ pshufb m4, m1
3999
+ vbroadcasti128 m5, [r2 + 13]
4000
+ pshufb m5, m1
4001
+
4002
+ add r4, 4 * mmsize
4003
+ mova m10, [r4 + 0 * mmsize]
4004
+
4005
+ INTRA_PRED_ANG32_CAL_ROW
4006
+ movu [r0 + r3], m7
4007
+ lea r0, [r0 + 4 * r1]
4008
+ movu [r0], m6
4009
+
4010
+ ;row[29, 30]
4011
+ pslldq xm11, 1
4012
+ pinsrb xm11, [r2 + 94], 0
4013
+ vinserti128 m2, m11, xm11, 1
4014
+ pshufb m2, m1
4015
+ pslldq xm12, 1
4016
+ pinsrb xm12, [r2 + 74], 0
4017
+ vinserti128 m3, m12, xm12, 1
4018
+ pshufb m3, m1
4019
+ vbroadcasti128 m4, [r2 + 4]
4020
+ pshufb m4, m1
4021
+ vbroadcasti128 m5, [r2 + 12]
4022
+ pshufb m5, m1
4023
+
4024
+ mova m10, [r4 + 1 * mmsize]
4025
+
4026
+ INTRA_PRED_ANG32_CAL_ROW
4027
+ movu [r0 + r1], m7
4028
+ movu [r0 + 2 * r1], m6
4029
+
4030
+ ;row[31]
4031
+ vperm2i128 m6, m2, m3, 00100000b
4032
+ pmaddubsw m6, [r4 + 2 * mmsize]
4033
+ pmulhrsw m6, m0
4034
+ vperm2i128 m7, m4, m5, 00100000b
4035
+ pmaddubsw m7, [r4 + 2 * mmsize]
4036
+ pmulhrsw m7, m0
4037
+ packuswb m6, m7
4038
+ vpermq m6, m6, 11011000b
4039
+ movu [r0 + r3], m6
4040
+ RET
4041
+
4042
+INIT_YMM avx2
4043
+cglobal intra_pred_ang32_21, 3, 5, 13
4044
+ mova m0, [pw_1024]
4045
+ mova m1, [intra_pred_shuff_0_8]
4046
+ lea r3, [3 * r1]
4047
+ lea r4, [c_ang32_mode_21]
4048
+
4049
+ ;row[0]
4050
+ vbroadcasti128 m11, [r2 + 0]
4051
+ pshufb m2, m11, m1
4052
+ vbroadcasti128 m3, [r2 + 8]
4053
+ pshufb m3, m1
4054
+ vbroadcasti128 m4, [r2 + 16]
4055
+ pshufb m4, m1
4056
+ vbroadcasti128 m5, [r2 + 24]
4057
+ pshufb m5, m1
4058
+
4059
+ vperm2i128 m6, m2, m3, 00100000b
4060
+ pmaddubsw m6, [r4 + 0 * mmsize]
4061
+ pmulhrsw m6, m0
4062
+ vperm2i128 m7, m4, m5, 00100000b
4063
+ pmaddubsw m7, [r4 + 0 * mmsize]
4064
+ pmulhrsw m7, m0
4065
+ packuswb m6, m7
4066
+ vpermq m6, m6, 11011000b
4067
+ movu [r0], m6
4068
+
4069
+ ;row[1, 2]
4070
+ pslldq xm11, 1
4071
+ pinsrb xm11, [r2 + 66], 0
4072
+ vinserti128 m2, m11, xm11, 1
4073
+ pshufb m2, m1
4074
+ vbroadcasti128 m3, [r2 + 7]
4075
+ pshufb m3, m1
4076
+ vbroadcasti128 m4, [r2 + 15]
4077
+ pshufb m4, m1
4078
+ vbroadcasti128 m5, [r2 + 23]
4079
+ pshufb m5, m1
4080
+
4081
+ mova m10, [r4 + 1 * mmsize]
4082
+
4083
+ INTRA_PRED_ANG32_CAL_ROW
4084
+ movu [r0 + r1], m7
4085
+ movu [r0 + 2 * r1], m6
4086
+
4087
+ ;row[3, 4]
4088
+ pslldq xm11, 1
4089
+ pinsrb xm11, [r2 + 68], 0
4090
+ vinserti128 m2, m11, xm11, 1
4091
+ pshufb m2, m1
4092
+ vbroadcasti128 m3, [r2 + 6]
4093
+ pshufb m3, m1
4094
+ vbroadcasti128 m4, [r2 + 14]
4095
+ pshufb m4, m1
4096
+ vbroadcasti128 m5, [r2 + 22]
4097
+ pshufb m5, m1
4098
+
4099
+ mova m10, [r4 + 2 * mmsize]
4100
+
4101
+ INTRA_PRED_ANG32_CAL_ROW
4102
+ movu [r0 + r3], m7
4103
+ lea r0, [r0 + 4 * r1]
4104
+ movu [r0], m6
4105
+
4106
+ ;row[5, 6]
4107
+ pslldq xm11, 1
4108
+ pinsrb xm11, [r2 + 70], 0
4109
+ vinserti128 m2, m11, xm11, 1
4110
+ pshufb m2, m1
4111
+ vbroadcasti128 m3, [r2 + 5]
4112
+ pshufb m3, m1
4113
+ vbroadcasti128 m4, [r2 + 13]
4114
+ pshufb m4, m1
4115
+ vbroadcasti128 m5, [r2 + 21]
4116
+ pshufb m5, m1
4117
+
4118
+ mova m10, [r4 + 3 * mmsize]
4119
+
4120
+ INTRA_PRED_ANG32_CAL_ROW
4121
+ movu [r0 + r1], m7
4122
+ movu [r0 + 2 * r1], m6
4123
+
4124
+ ;row[7, 8]
4125
+ pslldq xm11, 1
4126
+ pinsrb xm11, [r2 + 72], 0
4127
+ vinserti128 m2, m11, xm11, 1
4128
+ pshufb m2, m1
4129
+ vbroadcasti128 m3, [r2 + 4]
4130
+ pshufb m3, m1
4131
+ vbroadcasti128 m4, [r2 + 12]
4132
+ pshufb m4, m1
4133
+ vbroadcasti128 m5, [r2 + 20]
4134
+ pshufb m5, m1
4135
+
4136
+ add r4, 4 * mmsize
4137
+ mova m10, [r4 + 0 * mmsize]
4138
+
4139
+ INTRA_PRED_ANG32_CAL_ROW
4140
+ movu [r0 + r3], m7
4141
+ lea r0, [r0 + 4 * r1]
4142
+ movu [r0], m6
4143
+
4144
+ ;row[9, 10]
4145
+ pslldq xm11, 1
4146
+ pinsrb xm11, [r2 + 73], 0
4147
+ vinserti128 m2, m11, xm11, 1
4148
+ pshufb m2, m1
4149
+ vbroadcasti128 m3, [r2 + 3]
4150
+ pshufb m3, m1
4151
+ vbroadcasti128 m4, [r2 + 11]
4152
+ pshufb m4, m1
4153
+ vbroadcasti128 m5, [r2 + 19]
4154
+ pshufb m5, m1
4155
+
4156
+ mova m10, [r4 + 1 * mmsize]
4157
+
4158
+ INTRA_PRED_ANG32_CAL_ROW
4159
+ movu [r0 + r1], m7
4160
+ movu [r0 + 2 * r1], m6
4161
+
4162
+ ;row[11, 12]
4163
+ pslldq xm11, 1
4164
+ pinsrb xm11, [r2 + 75], 0
4165
+ vinserti128 m2, m11, xm11, 1
4166
+ pshufb m2, m1
4167
+ vbroadcasti128 m3, [r2 + 2]
4168
+ pshufb m3, m1
4169
+ vbroadcasti128 m4, [r2 + 10]
4170
+ pshufb m4, m1
4171
+ vbroadcasti128 m5, [r2 + 18]
4172
+ pshufb m5, m1
4173
+
4174
+ mova m10, [r4 + 2 * mmsize]
4175
+
4176
+ INTRA_PRED_ANG32_CAL_ROW
4177
+ movu [r0 + r3], m7
4178
+ lea r0, [r0 + 4 * r1]
4179
+ movu [r0], m6
4180
+
4181
+ ;row[13, 14]
4182
+ pslldq xm11, 1
4183
+ pinsrb xm11, [r2 + 77], 0
4184
+ vinserti128 m2, m11, xm11, 1
4185
+ pshufb m2, m1
4186
+ vbroadcasti128 m3, [r2 + 1]
4187
+ pshufb m3, m1
4188
+ vbroadcasti128 m4, [r2 + 9]
4189
+ pshufb m4, m1
4190
+ vbroadcasti128 m5, [r2 + 17]
4191
+ pshufb m5, m1
4192
+
4193
+ mova m10, [r4 + 3 * mmsize]
4194
+
4195
+ INTRA_PRED_ANG32_CAL_ROW
4196
+ movu [r0 + r1], m7
4197
+ movu [r0 + 2 * r1], m6
4198
+
4199
+ ;row[15]
4200
+ pslldq xm11, 1
4201
+ pinsrb xm11, [r2 + 79], 0
4202
+ vinserti128 m2, m11, xm11, 1
4203
+ pshufb m2, m1
4204
+ vbroadcasti128 m12, [r2 + 0]
4205
+ pshufb m3, m12, m1
4206
+ vbroadcasti128 m4, [r2 + 8]
4207
+ pshufb m4, m1
4208
+ vbroadcasti128 m5, [r2 + 16]
4209
+ pshufb m5, m1
4210
+ vperm2i128 m6, m2, m3, 00100000b
4211
+ add r4, 4 * mmsize
4212
+ pmaddubsw m6, [r4 + 0 * mmsize]
4213
+ pmulhrsw m6, m0
4214
+ vperm2i128 m7, m4, m5, 00100000b
4215
+ pmaddubsw m7, [r4 + 0 * mmsize]
4216
+ pmulhrsw m7, m0
4217
+ packuswb m6, m7
4218
+ vpermq m6, m6, 11011000b
4219
+ movu [r0 + r3], m6
4220
+
4221
+ ;row[16, 17]
4222
+ pslldq xm11, 1
4223
+ pinsrb xm11, [r2 + 81], 0
4224
+ vinserti128 m2, m11, xm11, 1
4225
+ pshufb m2, m1
4226
+ pslldq xm12, 1
4227
+ pinsrb xm12, [r2 + 66], 0
4228
+ vinserti128 m3, m12, xm12, 1
4229
+ pshufb m3, m1
4230
+ vbroadcasti128 m4, [r2 + 7]
4231
+ pshufb m4, m1
4232
+ vbroadcasti128 m5, [r2 + 15]
4233
+ pshufb m5, m1
4234
+
4235
+ mova m10, [r4 + 1 * mmsize]
4236
+
4237
+ INTRA_PRED_ANG32_CAL_ROW
4238
+ lea r0, [r0 + 4 * r1]
4239
+ movu [r0], m7
4240
+ movu [r0 + r1], m6
4241
+
4242
+ ;row[18, 19]
4243
+ pslldq xm11, 1
4244
+ pinsrb xm11, [r2 + 83], 0
4245
+ vinserti128 m2, m11, xm11, 1
4246
+ pshufb m2, m1
4247
+ pslldq xm12, 1
4248
+ pinsrb xm12, [r2 + 68], 0
4249
+ vinserti128 m3, m12, xm12, 1
4250
+ pshufb m3, m1
4251
+ vbroadcasti128 m4, [r2 + 6]
4252
+ pshufb m4, m1
4253
+ vbroadcasti128 m5, [r2 + 14]
4254
+ pshufb m5, m1
4255
+
4256
+ mova m10, [r4 + 2 * mmsize]
4257
+
4258
+ INTRA_PRED_ANG32_CAL_ROW
4259
+ movu [r0 + 2 * r1], m7
4260
+ movu [r0 + r3], m6
4261
+
4262
+ ;row[20, 21]
4263
+ pslldq xm11, 1
4264
+ pinsrb xm11, [r2 + 85], 0
4265
+ vinserti128 m2, m11, xm11, 1
4266
+ pshufb m2, m1
4267
+ pslldq xm12, 1
4268
+ pinsrb xm12, [r2 + 70], 0
4269
+ vinserti128 m3, m12, xm12, 1
4270
+ pshufb m3, m1
4271
+ vbroadcasti128 m4, [r2 + 5]
4272
+ pshufb m4, m1
4273
+ vbroadcasti128 m5, [r2 + 13]
4274
+ pshufb m5, m1
4275
+
4276
+ mova m10, [r4 + 3 * mmsize]
4277
+
4278
+ INTRA_PRED_ANG32_CAL_ROW
4279
+ lea r0, [r0 + 4 * r1]
4280
+ movu [r0], m7
4281
+ movu [r0 + r1], m6
4282
+
4283
+ ;row[22, 23]
4284
+ pslldq xm11, 1
4285
+ pinsrb xm11, [r2 + 87], 0
4286
+ vinserti128 m2, m11, xm11, 1
4287
+ pshufb m2, m1
4288
+ pslldq xm12, 1
4289
+ pinsrb xm12, [r2 + 72], 0
4290
+ vinserti128 m3, m12, xm12, 1
4291
+ pshufb m3, m1
4292
+ vbroadcasti128 m4, [r2 + 4]
4293
+ pshufb m4, m1
4294
+ vbroadcasti128 m5, [r2 + 12]
4295
+ pshufb m5, m1
4296
+
4297
+ add r4, 4 * mmsize
4298
+ mova m10, [r4 + 0 * mmsize]
4299
+
4300
+ INTRA_PRED_ANG32_CAL_ROW
4301
+ movu [r0 + 2 * r1], m7
4302
+ movu [r0 + r3], m6
4303
+
4304
+ ;row[24, 25]
4305
+ pslldq xm11, 1
4306
+ pinsrb xm11, [r2 + 88], 0
4307
+ vinserti128 m2, m11, xm11, 1
4308
+ pshufb m2, m1
4309
+ pslldq xm12, 1
4310
+ pinsrb xm12, [r2 + 73], 0
4311
+ vinserti128 m3, m12, xm12, 1
4312
+ pshufb m3, m1
4313
+ vbroadcasti128 m4, [r2 + 3]
4314
+ pshufb m4, m1
4315
+ vbroadcasti128 m5, [r2 + 11]
4316
+ pshufb m5, m1
4317
+
4318
+ mova m10, [r4 + 1 * mmsize]
4319
+
4320
+ INTRA_PRED_ANG32_CAL_ROW
4321
+ lea r0, [r0 + 4 * r1]
4322
+ movu [r0], m7
4323
+ movu [r0 + r1], m6
4324
+
4325
+ ;row[26, 27]
4326
+ pslldq xm11, 1
4327
+ pinsrb xm11, [r2 + 90], 0
4328
+ vinserti128 m2, m11, xm11, 1
4329
+ pshufb m2, m1
4330
+ pslldq xm12, 1
4331
+ pinsrb xm12, [r2 + 75], 0
4332
+ vinserti128 m3, m12, xm12, 1
4333
+ pshufb m3, m1
4334
+ vbroadcasti128 m4, [r2 + 2]
4335
+ pshufb m4, m1
4336
+ vbroadcasti128 m5, [r2 + 10]
4337
+ pshufb m5, m1
4338
+
4339
+ mova m10, [r4 + 2 * mmsize]
4340
+
4341
+ INTRA_PRED_ANG32_CAL_ROW
4342
+ movu [r0 + 2 * r1], m7
4343
+ movu [r0 + r3], m6
4344
+
4345
+ ;row[28, 29]
4346
+ pslldq xm11, 1
4347
+ pinsrb xm11, [r2 + 92], 0
4348
+ vinserti128 m2, m11, xm11, 1
4349
+ pshufb m2, m1
4350
+ pslldq xm12, 1
4351
+ pinsrb xm12, [r2 + 77], 0
4352
+ vinserti128 m3, m12, xm12, 1
4353
+ pshufb m3, m1
4354
+ vbroadcasti128 m4, [r2 + 1]
4355
+ pshufb m4, m1
4356
+ vbroadcasti128 m5, [r2 + 9]
4357
+ pshufb m5, m1
4358
+
4359
+ mova m10, [r4 + 3 * mmsize]
4360
+
4361
+ INTRA_PRED_ANG32_CAL_ROW
4362
+ lea r0, [r0 + 4 * r1]
4363
+ movu [r0], m7
4364
+ movu [r0 + r1], m6
4365
+
4366
+ ;row[30, 31]
4367
+ pslldq xm11, 1
4368
+ pinsrb xm11, [r2 + 94], 0
4369
+ vinserti128 m2, m11, xm11, 1
4370
+ pshufb m2, m1
4371
+ pslldq xm12, 1
4372
+ pinsrb xm12, [r2 + 79], 0
4373
+ vinserti128 m3, m12, xm12, 1
4374
+ pshufb m3, m1
4375
+ vbroadcasti128 m4, [r2 + 0]
4376
+ pshufb m4, m1
4377
+ vbroadcasti128 m5, [r2 + 8]
4378
+ pshufb m5, m1
4379
+
4380
+ mova m10, [r4 + 4 * mmsize]
4381
+
4382
+ INTRA_PRED_ANG32_CAL_ROW
4383
+ movu [r0 + 2 * r1], m7
4384
+ movu [r0 + r3], m6
4385
+ RET
4386
%endif
4387
4388
+%macro INTRA_PRED_STORE_4x4 0
4389
+ movd [r0], xm0
4390
+ pextrd [r0 + r1], xm0, 1
4391
+ vextracti128 xm0, m0, 1
4392
+ lea r0, [r0 + 2 * r1]
4393
+ movd [r0], xm0
4394
+ pextrd [r0 + r1], xm0, 1
4395
+%endmacro
4396
+
4397
+%macro INTRA_PRED_TRANS_STORE_4x4 0
4398
+ vpermq m0, m0, 00001000b
4399
+ pshufb m0, [c_trans_4x4]
4400
+
4401
+ ;store
4402
+ movd [r0], xm0
4403
+ pextrd [r0 + r1], xm0, 1
4404
+ lea r0, [r0 + 2 * r1]
4405
+ pextrd [r0], xm0, 2
4406
+ pextrd [r0 + r1], xm0, 3
4407
+%endmacro
4408
+
4409
+INIT_YMM avx2
4410
+cglobal intra_pred_ang4_27, 3, 3, 1
4411
+ vbroadcasti128 m0, [r2 + 1]
4412
+ pshufb m0, [intra_pred_shuff_0_4]
4413
+ pmaddubsw m0, [c_ang4_mode_27]
4414
+ pmulhrsw m0, [pw_1024]
4415
+ packuswb m0, m0
4416
+
4417
+ INTRA_PRED_STORE_4x4
4418
+ RET
4419
+
4420
+INIT_YMM avx2
4421
+cglobal intra_pred_ang4_28, 3, 3, 1
4422
+ vbroadcasti128 m0, [r2 + 1]
4423
+ pshufb m0, [intra_pred_shuff_0_4]
4424
+ pmaddubsw m0, [c_ang4_mode_28]
4425
+ pmulhrsw m0, [pw_1024]
4426
+ packuswb m0, m0
4427
+
4428
+ INTRA_PRED_STORE_4x4
4429
+ RET
4430
+
4431
+INIT_YMM avx2
4432
+cglobal intra_pred_ang4_29, 3, 3, 1
4433
+ vbroadcasti128 m0, [r2 + 1]
4434
+ pshufb m0, [intra_pred4_shuff1]
4435
+ pmaddubsw m0, [c_ang4_mode_29]
4436
+ pmulhrsw m0, [pw_1024]
4437
+ packuswb m0, m0
4438
+
4439
+ INTRA_PRED_STORE_4x4
4440
+ RET
4441
+
4442
+INIT_YMM avx2
4443
+cglobal intra_pred_ang4_30, 3, 3, 1
4444
+ vbroadcasti128 m0, [r2 + 1]
4445
+ pshufb m0, [intra_pred4_shuff2]
4446
+ pmaddubsw m0, [c_ang4_mode_30]
4447
+ pmulhrsw m0, [pw_1024]
4448
+ packuswb m0, m0
4449
+
4450
+ INTRA_PRED_STORE_4x4
4451
+ RET
4452
+
4453
+INIT_YMM avx2
4454
+cglobal intra_pred_ang4_31, 3, 3, 1
4455
+ vbroadcasti128 m0, [r2 + 1]
4456
+ pshufb m0, [intra_pred4_shuff31]
4457
+ pmaddubsw m0, [c_ang4_mode_31]
4458
+ pmulhrsw m0, [pw_1024]
4459
+ packuswb m0, m0
4460
+
4461
+ INTRA_PRED_STORE_4x4
4462
+ RET
4463
+
4464
+INIT_YMM avx2
4465
+cglobal intra_pred_ang4_32, 3, 3, 1
4466
+ vbroadcasti128 m0, [r2 + 1]
4467
+ pshufb m0, [intra_pred4_shuff31]
4468
+ pmaddubsw m0, [c_ang4_mode_32]
4469
+ pmulhrsw m0, [pw_1024]
4470
+ packuswb m0, m0
4471
+
4472
+ INTRA_PRED_STORE_4x4
4473
+ RET
4474
+
4475
+INIT_YMM avx2
4476
+cglobal intra_pred_ang4_33, 3, 3, 1
4477
+ vbroadcasti128 m0, [r2 + 1]
4478
+ pshufb m0, [intra_pred4_shuff33]
4479
+ pmaddubsw m0, [c_ang4_mode_33]
4480
+ pmulhrsw m0, [pw_1024]
4481
+ packuswb m0, m0
4482
+
4483
+ INTRA_PRED_STORE_4x4
4484
+ RET
4485
+
4486
+
4487
+INIT_YMM avx2
4488
+cglobal intra_pred_ang4_3, 3, 3, 1
4489
+ vbroadcasti128 m0, [r2 + 1]
4490
+ pshufb m0, [intra_pred4_shuff3]
4491
+ pmaddubsw m0, [c_ang4_mode_33]
4492
+ pmulhrsw m0, [pw_1024]
4493
+ packuswb m0, m0
4494
+
4495
+ INTRA_PRED_TRANS_STORE_4x4
4496
+ RET
4497
+
4498
+INIT_YMM avx2
4499
+cglobal intra_pred_ang4_4, 3, 3, 1
4500
+ vbroadcasti128 m0, [r2]
4501
+ pshufb m0, [intra_pred4_shuff5]
4502
+ pmaddubsw m0, [c_ang4_mode_32]
4503
+ pmulhrsw m0, [pw_1024]
4504
+ packuswb m0, m0
4505
+
4506
+ INTRA_PRED_TRANS_STORE_4x4
4507
+ RET
4508
+
4509
+INIT_YMM avx2
4510
+cglobal intra_pred_ang4_5, 3, 3, 1
4511
+ vbroadcasti128 m0, [r2]
4512
+ pshufb m0, [intra_pred4_shuff5]
4513
+ pmaddubsw m0, [c_ang4_mode_5]
4514
+ pmulhrsw m0, [pw_1024]
4515
+ packuswb m0, m0
4516
+
4517
+ INTRA_PRED_TRANS_STORE_4x4
4518
+ RET
4519
+
4520
+INIT_YMM avx2
4521
+cglobal intra_pred_ang4_6, 3, 3, 1
4522
+ vbroadcasti128 m0, [r2]
4523
+ pshufb m0, [intra_pred4_shuff6]
4524
+ pmaddubsw m0, [c_ang4_mode_6]
4525
+ pmulhrsw m0, [pw_1024]
4526
+ packuswb m0, m0
4527
+
4528
+ INTRA_PRED_TRANS_STORE_4x4
4529
+ RET
4530
+
4531
+INIT_YMM avx2
4532
+cglobal intra_pred_ang4_7, 3, 3, 1
4533
+ vbroadcasti128 m0, [r2]
4534
+ pshufb m0, [intra_pred4_shuff7]
4535
+ pmaddubsw m0, [c_ang4_mode_7]
4536
+ pmulhrsw m0, [pw_1024]
4537
+ packuswb m0, m0
4538
+
4539
+ INTRA_PRED_TRANS_STORE_4x4
4540
+ RET
4541
+
4542
+INIT_YMM avx2
4543
+cglobal intra_pred_ang4_8, 3, 3, 1
4544
+ vbroadcasti128 m0, [r2]
4545
+ pshufb m0, [intra_pred4_shuff9]
4546
+ pmaddubsw m0, [c_ang4_mode_8]
4547
+ pmulhrsw m0, [pw_1024]
4548
+ packuswb m0, m0
4549
+
4550
+ INTRA_PRED_TRANS_STORE_4x4
4551
+ RET
4552
+
4553
+INIT_YMM avx2
4554
+cglobal intra_pred_ang4_9, 3, 3, 1
4555
+ vbroadcasti128 m0, [r2]
4556
+ pshufb m0, [intra_pred4_shuff9]
4557
+ pmaddubsw m0, [c_ang4_mode_9]
4558
+ pmulhrsw m0, [pw_1024]
4559
+ packuswb m0, m0
4560
+
4561
+ INTRA_PRED_TRANS_STORE_4x4
4562
+ RET
4563
+
4564
+INIT_YMM avx2
4565
+cglobal intra_pred_ang4_11, 3, 3, 1
4566
+ vbroadcasti128 m0, [r2]
4567
+ pshufb m0, [intra_pred4_shuff12]
4568
+ pmaddubsw m0, [c_ang4_mode_11]
4569
+ pmulhrsw m0, [pw_1024]
4570
+ packuswb m0, m0
4571
+
4572
+ INTRA_PRED_TRANS_STORE_4x4
4573
+ RET
4574
+
4575
+INIT_YMM avx2
4576
+cglobal intra_pred_ang4_12, 3, 3, 1
4577
+ vbroadcasti128 m0, [r2]
4578
+ pshufb m0, [intra_pred4_shuff12]
4579
+ pmaddubsw m0, [c_ang4_mode_12]
4580
+ pmulhrsw m0, [pw_1024]
4581
+ packuswb m0, m0
4582
+
4583
+ INTRA_PRED_TRANS_STORE_4x4
4584
+ RET
4585
+
4586
+INIT_YMM avx2
4587
+cglobal intra_pred_ang4_13, 3, 3, 1
4588
+ vbroadcasti128 m0, [r2]
4589
+ pshufb m0, [intra_pred4_shuff13]
4590
+ pmaddubsw m0, [c_ang4_mode_13]
4591
+ pmulhrsw m0, [pw_1024]
4592
+ packuswb m0, m0
4593
+
4594
+ INTRA_PRED_TRANS_STORE_4x4
4595
+ RET
4596
+
4597
+INIT_YMM avx2
4598
+cglobal intra_pred_ang4_14, 3, 3, 1
4599
+ vbroadcasti128 m0, [r2]
4600
+ pshufb m0, [intra_pred4_shuff14]
4601
+ pmaddubsw m0, [c_ang4_mode_14]
4602
+ pmulhrsw m0, [pw_1024]
4603
+ packuswb m0, m0
4604
+
4605
+ INTRA_PRED_TRANS_STORE_4x4
4606
+ RET
4607
+
4608
+INIT_YMM avx2
4609
+cglobal intra_pred_ang4_15, 3, 3, 1
4610
+ vbroadcasti128 m0, [r2]
4611
+ pshufb m0, [intra_pred4_shuff15]
4612
+ pmaddubsw m0, [c_ang4_mode_15]
4613
+ pmulhrsw m0, [pw_1024]
4614
+ packuswb m0, m0
4615
+
4616
+ INTRA_PRED_TRANS_STORE_4x4
4617
+ RET
4618
+
4619
+INIT_YMM avx2
4620
+cglobal intra_pred_ang4_16, 3, 3, 1
4621
+ vbroadcasti128 m0, [r2]
4622
+ pshufb m0, [intra_pred4_shuff16]
4623
+ pmaddubsw m0, [c_ang4_mode_16]
4624
+ pmulhrsw m0, [pw_1024]
4625
+ packuswb m0, m0
4626
+
4627
+ INTRA_PRED_TRANS_STORE_4x4
4628
+ RET
4629
+
4630
+INIT_YMM avx2
4631
+cglobal intra_pred_ang4_17, 3, 3, 1
4632
+ vbroadcasti128 m0, [r2]
4633
+ pshufb m0, [intra_pred4_shuff17]
4634
+ pmaddubsw m0, [c_ang4_mode_17]
4635
+ pmulhrsw m0, [pw_1024]
4636
+ packuswb m0, m0
4637
+
4638
+ INTRA_PRED_TRANS_STORE_4x4
4639
+ RET
4640
+
4641
+INIT_YMM avx2
4642
+cglobal intra_pred_ang4_19, 3, 3, 1
4643
+ vbroadcasti128 m0, [r2]
4644
+ pshufb m0, [intra_pred4_shuff19]
4645
+ pmaddubsw m0, [c_ang4_mode_19]
4646
+ pmulhrsw m0, [pw_1024]
4647
+ packuswb m0, m0
4648
+
4649
+ INTRA_PRED_STORE_4x4
4650
+ RET
4651
+
4652
+INIT_YMM avx2
4653
+cglobal intra_pred_ang4_20, 3, 3, 1
4654
+ vbroadcasti128 m0, [r2]
4655
+ pshufb m0, [intra_pred4_shuff20]
4656
+ pmaddubsw m0, [c_ang4_mode_20]
4657
+ pmulhrsw m0, [pw_1024]
4658
+ packuswb m0, m0
4659
+
4660
+ INTRA_PRED_STORE_4x4
4661
+ RET
4662
+
4663
+INIT_YMM avx2
4664
+cglobal intra_pred_ang4_21, 3, 3, 1
4665
+ vbroadcasti128 m0, [r2]
4666
+ pshufb m0, [intra_pred4_shuff21]
4667
+ pmaddubsw m0, [c_ang4_mode_21]
4668
+ pmulhrsw m0, [pw_1024]
4669
+ packuswb m0, m0
4670
+
4671
+ INTRA_PRED_STORE_4x4
4672
+ RET
4673
+
4674
+INIT_YMM avx2
4675
+cglobal intra_pred_ang4_22, 3, 3, 1
4676
+ vbroadcasti128 m0, [r2]
4677
+ pshufb m0, [intra_pred4_shuff22]
4678
+ pmaddubsw m0, [c_ang4_mode_22]
4679
+ pmulhrsw m0, [pw_1024]
4680
+ packuswb m0, m0
4681
+
4682
+ INTRA_PRED_STORE_4x4
4683
+ RET
4684
+
4685
+INIT_YMM avx2
4686
+cglobal intra_pred_ang4_23, 3, 3, 1
4687
+ vbroadcasti128 m0, [r2]
4688
+ pshufb m0, [intra_pred4_shuff23]
4689
+ pmaddubsw m0, [c_ang4_mode_23]
4690
+ pmulhrsw m0, [pw_1024]
4691
+ packuswb m0, m0
4692
+
4693
+ INTRA_PRED_STORE_4x4
4694
+ RET
4695
+
4696
+INIT_YMM avx2
4697
+cglobal intra_pred_ang4_24, 3, 3, 1
4698
+ vbroadcasti128 m0, [r2]
4699
+ pshufb m0, [intra_pred_shuff_0_4]
4700
+ pmaddubsw m0, [c_ang4_mode_24]
4701
+ pmulhrsw m0, [pw_1024]
4702
+ packuswb m0, m0
4703
+
4704
+ INTRA_PRED_STORE_4x4
4705
+ RET
4706
+
4707
+INIT_YMM avx2
4708
+cglobal intra_pred_ang4_25, 3, 3, 1
4709
+ vbroadcasti128 m0, [r2]
4710
+ pshufb m0, [intra_pred_shuff_0_4]
4711
+ pmaddubsw m0, [c_ang4_mode_25]
4712
+ pmulhrsw m0, [pw_1024]
4713
+ packuswb m0, m0
4714
+
4715
+ INTRA_PRED_STORE_4x4
4716
+ RET
4717
x265_1.6.tar.gz/source/common/x86/intrapred8_allangs.asm -> x265_1.7.tar.gz/source/common/x86/intrapred8_allangs.asm
Changed
1189
1
2
;* Copyright (C) 2013 x265 project
3
;*
4
;* Authors: Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com>
5
-;* Praveen Tiwari <praveen@multicorewareinc.com>
6
+;* Praveen Tiwari <praveen@multicorewareinc.com>
7
;*
8
;* This program is free software; you can redistribute it and/or modify
9
;* it under the terms of the GNU General Public License as published by
10
11
12
SECTION_RODATA 32
13
14
+all_ang4_shuff: db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6, 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
15
+ db 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7
16
+ db 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6
17
+ db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5
18
+ db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5
19
+ db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4
20
+ db 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3
21
+ db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12
22
+ db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 4, 0, 0, 9, 9, 10, 10, 11
23
+ db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11
24
+ db 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11, 4, 2, 2, 0, 0, 9, 9, 10
25
+ db 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11, 3, 2, 2, 0, 0, 9, 9, 10
26
+ db 0, 9, 9, 10, 10, 11, 11, 12, 1, 0, 0, 9, 9, 10, 10, 11, 2, 1, 1, 0, 0, 9, 9, 10, 4, 2, 2, 1, 1, 0, 0, 9
27
+ db 0, 1, 2, 3, 9, 0, 1, 2, 10, 9, 0, 1, 11, 10, 9, 0, 0, 1, 2, 3, 9, 0, 1, 2, 10, 9, 0, 1, 11, 10, 9, 0
28
+ db 0, 1, 1, 2, 2, 3, 3, 4, 9, 0, 0, 1, 1, 2, 2, 3, 10, 9, 9, 0, 0, 1, 1, 2, 12, 10, 10, 9, 9, 0, 0, 1
29
+ db 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3, 11, 10, 10, 0, 0, 1, 1, 2
30
+ db 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3, 12, 10, 10, 0, 0, 1, 1, 2
31
+ db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3
32
+ db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 12, 0, 0, 1, 1, 2, 2, 3
33
+ db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4
34
+ db 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4
35
+ db 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5
36
+ db 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6
37
+ db 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 2, 3, 3, 4, 4, 5, 5, 6
38
+ db 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7
39
+ db 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7, 4, 5, 5, 6, 6, 7, 7, 8
40
+ db 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 8, 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 8
41
+
42
+all_ang4: db 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 18, 14, 18, 14, 18, 14, 18, 14, 24, 8, 24, 8, 24, 8, 24, 8
43
+ db 11, 21, 11, 21, 11, 21, 11, 21, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 12, 20, 12, 20, 12, 20, 12, 20
44
+ db 15, 17, 15, 17, 15, 17, 15, 17, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 28, 4, 28, 4, 28, 4, 28, 4
45
+ db 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20
46
+ db 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 5, 27, 5, 27, 5, 27, 5, 27, 28, 4, 28, 4, 28, 4, 28, 4
47
+ db 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20
48
+ db 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8
49
+ db 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24
50
+ db 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12
51
+ db 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28
52
+ db 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12
53
+ db 17, 15, 17, 15, 17, 15, 17, 15, 2, 30, 2, 30, 2, 30, 2, 30, 19, 13, 19, 13, 19, 13, 19, 13, 4, 28, 4, 28, 4, 28, 4, 28
54
+ db 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12
55
+ db 26, 6, 26, 6, 26, 6, 26, 6, 20, 12, 20, 12, 20, 12, 20, 12, 14, 18, 14, 18, 14, 18, 14, 18, 8, 24, 8, 24, 8, 24, 8, 24
56
+ db 26, 6, 26, 6, 26, 6, 26, 6, 20, 12, 20, 12, 20, 12, 20, 12, 14, 18, 14, 18, 14, 18, 14, 18, 8, 24, 8, 24, 8, 24, 8, 24
57
+ db 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12
58
+ db 17, 15, 17, 15, 17, 15, 17, 15, 2, 30, 2, 30, 2, 30, 2, 30, 19, 13, 19, 13, 19, 13, 19, 13, 4, 28, 4, 28, 4, 28, 4, 28
59
+ db 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12
60
+ db 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28
61
+ db 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12
62
+ db 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24
63
+ db 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8
64
+ db 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20
65
+ db 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 5, 27, 5, 27, 5, 27, 5, 27, 28, 4, 28, 4, 28, 4, 28, 4
66
+ db 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20
67
+ db 15, 17, 15, 17, 15, 17, 15, 17, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 28, 4, 28, 4, 28, 4, 28, 4
68
+ db 11, 21, 11, 21, 11, 21, 11, 21, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 12, 20, 12, 20, 12, 20, 12, 20
69
+ db 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 18, 14, 18, 14, 18, 14, 18, 14, 24, 8, 24, 8, 24, 8, 24, 8
70
+
71
+
72
SECTION .text
73
74
; global constant
75
76
77
; common constant with intrapred8.asm
78
cextern ang_table
79
+cextern pw_ang_table
80
cextern tab_S1
81
cextern tab_S2
82
cextern tab_Si
83
+cextern pw_16
84
+cextern pb_000000000000000F
85
+cextern pb_0000000000000F0F
86
+cextern pw_FFFFFFFFFFFFFFF0
87
88
89
;-----------------------------------------------------------------------------
90
91
palignr m4, m2, m1, 14
92
movu [r0 + 2111 * 16], m4
93
RET
94
+
95
+
96
+;-----------------------------------------------------------------------------
97
+; void all_angs_pred_4x4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma)
98
+;-----------------------------------------------------------------------------
99
+INIT_YMM avx2
100
+cglobal all_angs_pred_4x4, 4, 4, 6
101
+
102
+ mova m5, [pw_1024]
103
+ lea r2, [all_ang4]
104
+ lea r3, [all_ang4_shuff]
105
+
106
+; mode 2
107
+
108
+ vbroadcasti128 m0, [r1 + 9]
109
+ mova xm1, xm0
110
+ psrldq xm1, 1
111
+ pshufb xm1, [r3]
112
+ movu [r0], xm1
113
+
114
+; mode 3
115
+
116
+ pshufb m1, m0, [r3 + 1 * mmsize]
117
+ pmaddubsw m1, [r2]
118
+ pmulhrsw m1, m5
119
+
120
+; mode 4
121
+
122
+ pshufb m2, m0, [r3 + 2 * mmsize]
123
+ pmaddubsw m2, [r2 + 1 * mmsize]
124
+ pmulhrsw m2, m5
125
+ packuswb m1, m2
126
+ vpermq m1, m1, 11011000b
127
+ movu [r0 + (3 - 2) * 16], m1
128
+
129
+; mode 5
130
+
131
+ pshufb m1, m0, [r3 + 2 * mmsize]
132
+ pmaddubsw m1, [r2 + 2 * mmsize]
133
+ pmulhrsw m1, m5
134
+
135
+; mode 6
136
+
137
+ pshufb m2, m0, [r3 + 3 * mmsize]
138
+ pmaddubsw m2, [r2 + 3 * mmsize]
139
+ pmulhrsw m2, m5
140
+ packuswb m1, m2
141
+ vpermq m1, m1, 11011000b
142
+ movu [r0 + (5 - 2) * 16], m1
143
+
144
+ add r3, 4 * mmsize
145
+ add r2, 4 * mmsize
146
+
147
+; mode 7
148
+
149
+ pshufb m1, m0, [r3 + 0 * mmsize]
150
+ pmaddubsw m1, [r2 + 0 * mmsize]
151
+ pmulhrsw m1, m5
152
+
153
+; mode 8
154
+
155
+ pshufb m2, m0, [r3 + 1 * mmsize]
156
+ pmaddubsw m2, [r2 + 1 * mmsize]
157
+ pmulhrsw m2, m5
158
+ packuswb m1, m2
159
+ vpermq m1, m1, 11011000b
160
+ movu [r0 + (7 - 2) * 16], m1
161
+
162
+; mode 9
163
+
164
+ pshufb m1, m0, [r3 + 1 * mmsize]
165
+ pmaddubsw m1, [r2 + 2 * mmsize]
166
+ pmulhrsw m1, m5
167
+ packuswb m1, m1
168
+ vpermq m1, m1, 11011000b
169
+ movu [r0 + (9 - 2) * 16], xm1
170
+
171
+; mode 10
172
+
173
+ pshufb xm1, xm0, [r3 + 2 * mmsize]
174
+ movu [r0 + (10 - 2) * 16], xm1
175
+
176
+ pxor xm1, xm1
177
+ movd xm2, [r1 + 1]
178
+ pshufd xm3, xm2, 0
179
+ punpcklbw xm3, xm1
180
+ pinsrb xm2, [r1], 0
181
+ pshufb xm4, xm2, xm1
182
+ punpcklbw xm4, xm1
183
+ psubw xm3, xm4
184
+ psraw xm3, 1
185
+ pshufb xm4, xm0, xm1
186
+ punpcklbw xm4, xm1
187
+ paddw xm3, xm4
188
+ packuswb xm3, xm1
189
+
190
+ pextrb [r0 + 128], xm3, 0
191
+ pextrb [r0 + 132], xm3, 1
192
+ pextrb [r0 + 136], xm3, 2
193
+ pextrb [r0 + 140], xm3, 3
194
+
195
+; mode 11
196
+
197
+ vbroadcasti128 m0, [r1]
198
+ pshufb m1, m0, [r3 + 3 * mmsize]
199
+ pmaddubsw m1, [r2 + 3 * mmsize]
200
+ pmulhrsw m1, m5
201
+
202
+; mode 12
203
+
204
+ add r2, 4 * mmsize
205
+
206
+ pshufb m2, m0, [r3 + 3 * mmsize]
207
+ pmaddubsw m2, [r2 + 0 * mmsize]
208
+ pmulhrsw m2, m5
209
+ packuswb m1, m2
210
+ vpermq m1, m1, 11011000b
211
+ movu [r0 + (11 - 2) * 16], m1
212
+
213
+; mode 13
214
+
215
+ add r3, 4 * mmsize
216
+
217
+ pshufb m1, m0, [r3 + 0 * mmsize]
218
+ pmaddubsw m1, [r2 + 1 * mmsize]
219
+ pmulhrsw m1, m5
220
+
221
+; mode 14
222
+
223
+ pshufb m2, m0, [r3 + 1 * mmsize]
224
+ pmaddubsw m2, [r2 + 2 * mmsize]
225
+ pmulhrsw m2, m5
226
+ packuswb m1, m2
227
+ vpermq m1, m1, 11011000b
228
+ movu [r0 + (13 - 2) * 16], m1
229
+
230
+; mode 15
231
+
232
+ pshufb m1, m0, [r3 + 2 * mmsize]
233
+ pmaddubsw m1, [r2 + 3 * mmsize]
234
+ pmulhrsw m1, m5
235
+
236
+; mode 16
237
+
238
+ add r2, 4 * mmsize
239
+
240
+ pshufb m2, m0, [r3 + 3 * mmsize]
241
+ pmaddubsw m2, [r2 + 0 * mmsize]
242
+ pmulhrsw m2, m5
243
+ packuswb m1, m2
244
+ vpermq m1, m1, 11011000b
245
+ movu [r0 + (15 - 2) * 16], m1
246
+
247
+; mode 17
248
+
249
+ add r3, 4 * mmsize
250
+
251
+ pshufb m1, m0, [r3 + 0 * mmsize]
252
+ pmaddubsw m1, [r2 + 1 * mmsize]
253
+ pmulhrsw m1, m5
254
+ packuswb m1, m1
255
+ vpermq m1, m1, 11011000b
256
+
257
+; mode 18
258
+
259
+ pshufb m2, m0, [r3 + 1 * mmsize]
260
+ vinserti128 m1, m1, xm2, 1
261
+ movu [r0 + (17 - 2) * 16], m1
262
+
263
+; mode 19
264
+
265
+ pshufb m1, m0, [r3 + 2 * mmsize]
266
+ pmaddubsw m1, [r2 + 2 * mmsize]
267
+ pmulhrsw m1, m5
268
+
269
+; mode 20
270
+
271
+ pshufb m2, m0, [r3 + 3 * mmsize]
272
+ pmaddubsw m2, [r2 + 3 * mmsize]
273
+ pmulhrsw m2, m5
274
+ packuswb m1, m2
275
+ vpermq m1, m1, 11011000b
276
+ movu [r0 + (19 - 2) * 16], m1
277
+
278
+; mode 21
279
+
280
+ add r2, 4 * mmsize
281
+ add r3, 4 * mmsize
282
+
283
+ pshufb m1, m0, [r3 + 0 * mmsize]
284
+ pmaddubsw m1, [r2 + 0 * mmsize]
285
+ pmulhrsw m1, m5
286
+
287
+; mode 22
288
+
289
+ pshufb m2, m0, [r3 + 1 * mmsize]
290
+ pmaddubsw m2, [r2 + 1 * mmsize]
291
+ pmulhrsw m2, m5
292
+ packuswb m1, m2
293
+ vpermq m1, m1, 11011000b
294
+ movu [r0 + (21 - 2) * 16], m1
295
+
296
+; mode 23
297
+
298
+ pshufb m1, m0, [r3 + 2 * mmsize]
299
+ pmaddubsw m1, [r2 + 2 * mmsize]
300
+ pmulhrsw m1, m5
301
+
302
+; mode 24
303
+
304
+ pshufb m2, m0, [r3 + 3 * mmsize]
305
+ pmaddubsw m2, [r2 + 3 * mmsize]
306
+ pmulhrsw m2, m5
307
+ packuswb m1, m2
308
+ vpermq m1, m1, 11011000b
309
+ movu [r0 + (23 - 2) * 16], m1
310
+
311
+; mode 25
312
+
313
+ add r2, 4 * mmsize
314
+
315
+ pshufb m1, m0, [r3 + 3 * mmsize]
316
+ pmaddubsw m1, [r2 + 0 * mmsize]
317
+ pmulhrsw m1, m5
318
+ packuswb m1, m1
319
+ vpermq m1, m1, 11011000b
320
+ movu [r0 + (25 - 2) * 16], xm1
321
+
322
+; mode 26
323
+
324
+ add r3, 4 * mmsize
325
+
326
+ pshufb xm1, xm0, [r3 + 0 * mmsize]
327
+ movu [r0 + (26 - 2) * 16], xm1
328
+
329
+ pxor xm1, xm1
330
+ movd xm2, [r1 + 9]
331
+ pshufd xm3, xm2, 0
332
+ punpcklbw xm3, xm1
333
+ pinsrb xm4, [r1 + 0], 0
334
+ pshufb xm4, xm1
335
+ punpcklbw xm4, xm1
336
+ psubw xm3, xm4
337
+ psraw xm3, 1
338
+ psrldq xm2, xm0, 1
339
+ pshufb xm2, xm1
340
+ punpcklbw xm2, xm1
341
+ paddw xm3, xm2
342
+ packuswb xm3, xm1
343
+
344
+ pextrb [r0 + 384], xm3, 0
345
+ pextrb [r0 + 388], xm3, 1
346
+ pextrb [r0 + 392], xm3, 2
347
+ pextrb [r0 + 396], xm3, 3
348
+
349
+; mode 27
350
+
351
+ pshufb m1, m0, [r3 + 1 * mmsize]
352
+ pmaddubsw m1, [r2 + 1 * mmsize]
353
+ pmulhrsw m1, m5
354
+
355
+; mode 28
356
+
357
+ pshufb m2, m0, [r3 + 1 * mmsize]
358
+ pmaddubsw m2, [r2 + 2 * mmsize]
359
+ pmulhrsw m2, m5
360
+ packuswb m1, m2
361
+ vpermq m1, m1, 11011000b
362
+ movu [r0 + (27 - 2) * 16], m1
363
+
364
+; mode 29
365
+
366
+ pshufb m1, m0, [r3 + 2 * mmsize]
367
+ pmaddubsw m1, [r2 + 3 * mmsize]
368
+ pmulhrsw m1, m5
369
+
370
+; mode 30
371
+
372
+ add r2, 4 * mmsize
373
+
374
+ pshufb m2, m0, [r3 + 3 * mmsize]
375
+ pmaddubsw m2, [r2 + 0 * mmsize]
376
+ pmulhrsw m2, m5
377
+ packuswb m1, m2
378
+ vpermq m1, m1, 11011000b
379
+ movu [r0 + (29 - 2) * 16], m1
380
+
381
+; mode 31
382
+
383
+ add r3, 4 * mmsize
384
+
385
+ pshufb m1, m0, [r3 + 0 * mmsize]
386
+ pmaddubsw m1, [r2 + 1 * mmsize]
387
+ pmulhrsw m1, m5
388
+
389
+; mode 32
390
+
391
+ pshufb m2, m0, [r3 + 0 * mmsize]
392
+ pmaddubsw m2, [r2 + 2 * mmsize]
393
+ pmulhrsw m2, m5
394
+ packuswb m1, m2
395
+ vpermq m1, m1, 11011000b
396
+ movu [r0 + (31 - 2) * 16], m1
397
+
398
+; mode 33
399
+
400
+ pshufb m1, m0, [r3 + 1 * mmsize]
401
+ pmaddubsw m1, [r2 + 3 * mmsize]
402
+ pmulhrsw m1, m5
403
+ packuswb m1, m2
404
+ vpermq m1, m1, 11011000b
405
+
406
+; mode 34
407
+
408
+ pshufb m0, [r3 + 2 * mmsize]
409
+ vinserti128 m1, m1, xm0, 1
410
+ movu [r0 + (33 - 2) * 16], m1
411
+ RET
412
+
413
+;-----------------------------------------------------------------------------
414
+; void all_angs_pred_4x4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma)
415
+;-----------------------------------------------------------------------------
416
+INIT_XMM sse2
417
+cglobal all_angs_pred_4x4, 4, 4, 8
418
+
419
+; mode 2
420
+
421
+ movh m6, [r1 + 9]
422
+ mova m2, m6
423
+ psrldq m2, 1
424
+ movd [r0], m2 ;byte[A, B, C, D]
425
+ psrldq m2, 1
426
+ movd [r0 + 4], m2 ;byte[B, C, D, E]
427
+ psrldq m2, 1
428
+ movd [r0 + 8], m2 ;byte[C, D, E, F]
429
+ psrldq m2, 1
430
+ movd [r0 + 12], m2 ;byte[D, E, F, G]
431
+
432
+; mode 10/26
433
+
434
+ pxor m7, m7
435
+ pshufd m5, m6, 0
436
+ mova [r0 + 128], m5 ;mode 10 byte[9, A, B, C, 9, A, B, C, 9, A, B, C, 9, A, B, C]
437
+
438
+ movd m4, [r1 + 1]
439
+ pshufd m4, m4, 0
440
+ mova [r0 + 384], m4 ;mode 26 byte[1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4]
441
+
442
+ movd m1, [r1]
443
+ punpcklbw m1, m7
444
+ pshuflw m1, m1, 0x00
445
+ punpcklqdq m1, m1 ;m1 = byte[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
446
+
447
+ punpckldq m4, m5
448
+ punpcklbw m4, m7 ;m4 = word[1, 2, 3, 4, 9, A, B, C]
449
+ pshuflw m2, m4, 0x00
450
+ pshufhw m2, m2, 0x00 ;m2 = word[1, 1, 1, 1, 9, 9, 9, 9]
451
+
452
+ psubw m4, m1
453
+ psraw m4, 1
454
+
455
+ pshufd m2, m2, q1032 ;m2 = word[9, 9, 9, 9, 1, 1, 1, 1]
456
+ paddw m4, m2
457
+ packuswb m4, m4
458
+
459
+%if ARCH_X86_64
460
+ movq r2, m4
461
+
462
+ mov [r0 + 128], r2b ;mode 10
463
+ shr r2, 8
464
+ mov [r0 + 132], r2b
465
+ shr r2, 8
466
+ mov [r0 + 136], r2b
467
+ shr r2, 8
468
+ mov [r0 + 140], r2b
469
+ shr r2, 8
470
+ mov [r0 + 384], r2b ;mode 26
471
+ shr r2d, 8
472
+ mov [r0 + 388], r2b
473
+ shr r2d, 8
474
+ mov [r0 + 392], r2b
475
+ shr r2d, 8
476
+ mov [r0 + 396], r2b
477
+
478
+%else
479
+ movd r2d, m4
480
+
481
+ mov [r0 + 128], r2b ;mode 10
482
+ shr r2d, 8
483
+ mov [r0 + 132], r2b
484
+ shr r2d, 8
485
+ mov [r0 + 136], r2b
486
+ shr r2d, 8
487
+ mov [r0 + 140], r2b
488
+
489
+ psrldq m4, 4
490
+ movd r2d, m4
491
+
492
+ mov [r0 + 384], r2b ;mode 26
493
+ shr r2d, 8
494
+ mov [r0 + 388], r2b
495
+ shr r2d, 8
496
+ mov [r0 + 392], r2b
497
+ shr r2d, 8
498
+ mov [r0 + 396], r2b
499
+%endif
500
+
501
+; mode 3
502
+
503
+ mova m2, [pw_16]
504
+ lea r3, [pw_ang_table + 7 * 16]
505
+ lea r2, [pw_ang_table + 23 * 16]
506
+ punpcklbw m6, m6
507
+ psrldq m6, 1
508
+ movh m1, m6
509
+ psrldq m6, 2
510
+ movh m0, m6
511
+ psrldq m6, 2
512
+ movh m3, m6
513
+ psrldq m6, 2
514
+ punpcklbw m1, m7 ;m1 = word[9, A, A, B, B, C, C, D]
515
+ punpcklbw m0, m7 ;m0 = word[A, B, B, C, C, D, D, E]
516
+ punpcklbw m3, m7 ;m3 = word[B, C, C, D, D, E, E, F]
517
+ punpcklbw m6, m7 ;m6 = word[C, D, D, E, E, F, F, G]
518
+
519
+ mova m7, [r2 - 3 * 16]
520
+
521
+ pmaddwd m5, m1, [r2 + 3 * 16]
522
+ pmaddwd m4, m0, m7
523
+
524
+ packssdw m5, m4
525
+ paddw m5, m2
526
+ psraw m5, 5
527
+
528
+ pmaddwd m4, m3, [r3 + 7 * 16]
529
+ pmaddwd m6, [r3 + 1 * 16]
530
+
531
+ packssdw m4, m6
532
+ paddw m4, m2
533
+ psraw m4, 5
534
+
535
+ packuswb m5, m4
536
+ mova [r0 + 16], m5
537
+ movd [r0 + 68], m5 ;mode 6 row 1
538
+ psrldq m5, 4
539
+ movd [r0 + 76], m5 ;mode 6 row 3
540
+
541
+; mode 4
542
+
543
+ pmaddwd m4, m0, [r2 + 8 * 16]
544
+ pmaddwd m6, m3, m7
545
+
546
+ packssdw m4, m6
547
+ paddw m4, m2
548
+ psraw m4, 5
549
+
550
+ pmaddwd m5, m1, [r2 - 2 * 16]
551
+ pmaddwd m6, m0, [r3 + 3 * 16]
552
+
553
+ packssdw m5, m6
554
+ paddw m5, m2
555
+ psraw m5, 5
556
+
557
+ packuswb m5, m4
558
+ mova [r0 + 32], m5
559
+
560
+; mode 5
561
+
562
+ pmaddwd m5, m1, [r2 - 6 * 16]
563
+ pmaddwd m6, m0, [r3 - 5 * 16]
564
+
565
+ packssdw m5, m6
566
+ paddw m5, m2
567
+ psraw m5, 5
568
+
569
+ pmaddwd m4, m0, [r2 - 4 * 16]
570
+ pmaddwd m3, [r3 - 3 * 16]
571
+
572
+ packssdw m4, m3
573
+ paddw m4, m2
574
+ psraw m4, 5
575
+
576
+ packuswb m5, m4
577
+ mova [r0 + 48], m5
578
+
579
+; mode 6
580
+
581
+ pmaddwd m5, m1, [r3 + 6 * 16]
582
+ pmaddwd m6, m0, [r3 + 0 * 16]
583
+
584
+ packssdw m5, m6
585
+ paddw m5, m2
586
+ psraw m5, 5
587
+
588
+ packuswb m5, m6
589
+ movd [r0 + 64], m5
590
+ psrldq m5, 4
591
+ movd [r0 + 72], m5
592
+
593
+; mode 7
594
+
595
+ pmaddwd m5, m1, [r3 + 2 * 16]
596
+ pmaddwd m6, m1, [r2 - 5 * 16]
597
+
598
+ packssdw m5, m6
599
+ paddw m5, m2
600
+ psraw m5, 5
601
+
602
+ mova m3, [r2 + 4 * 16]
603
+ pmaddwd m4, m1, m3
604
+ pmaddwd m0, [r3 - 3 * 16]
605
+
606
+ packssdw m4, m0
607
+ paddw m4, m2
608
+ psraw m4, 5
609
+
610
+ packuswb m5, m4
611
+ mova [r0 + 80], m5
612
+
613
+; mode 8
614
+
615
+ mova m0, [r3 - 2 * 16]
616
+ pmaddwd m5, m1, m0
617
+ pmaddwd m6, m1, [r3 + 3 * 16]
618
+
619
+ packssdw m5, m6
620
+ paddw m5, m2
621
+ psraw m5, 5
622
+
623
+ pmaddwd m4, m1, [r3 + 8 * 16]
624
+ pmaddwd m7, m1
625
+
626
+ packssdw m4, m7
627
+ paddw m4, m2
628
+ psraw m4, 5
629
+
630
+ packuswb m5, m4
631
+ mova [r0 + 96], m5
632
+
633
+; mode 9
634
+
635
+ pmaddwd m5, m1, [r3 - 5 * 16]
636
+ pmaddwd m6, m1, [r3 - 3 * 16]
637
+
638
+ packssdw m5, m6
639
+ paddw m5, m2
640
+ psraw m5, 5
641
+
642
+ pmaddwd m4, m1, [r3 - 1 * 16]
643
+ pmaddwd m6, m1, [r3 + 1 * 16]
644
+
645
+ packssdw m4, m6
646
+ paddw m4, m2
647
+ psraw m4, 5
648
+
649
+ packuswb m5, m4
650
+ mova [r0 + 112], m5
651
+
652
+; mode 11
653
+
654
+ movd m5, [r1]
655
+ punpcklwd m5, m1
656
+ pand m5, [pb_0000000000000F0F]
657
+ pslldq m1, 4
658
+ por m1, m5 ;m1 = word[0, 9, 9, A, A, B, B, C]
659
+
660
+ pmaddwd m5, m1, [r2 + 7 * 16]
661
+ pmaddwd m6, m1, [r2 + 5 * 16]
662
+
663
+ packssdw m5, m6
664
+ paddw m5, m2
665
+ psraw m5, 5
666
+
667
+ pmaddwd m4, m1, [r2 + 3 * 16]
668
+ pmaddwd m6, m1, [r2 + 1 * 16]
669
+
670
+ packssdw m4, m6
671
+ paddw m4, m2
672
+ psraw m4, 5
673
+
674
+ packuswb m5, m4
675
+ mova [r0 + 144], m5
676
+
677
+; mode 12
678
+
679
+ pmaddwd m3, m1
680
+ pmaddwd m6, m1, [r2 - 1 * 16]
681
+
682
+ packssdw m3, m6
683
+ paddw m3, m2
684
+ psraw m3, 5
685
+
686
+ pmaddwd m4, m1, [r2 - 6 * 16]
687
+ pmaddwd m6, m1, [r3 + 5 * 16]
688
+
689
+ packssdw m4, m6
690
+ paddw m4, m2
691
+ psraw m4, 5
692
+
693
+ packuswb m3, m4
694
+ mova [r0 + 160], m3
695
+
696
+; mode 13
697
+
698
+ mova m3, m1
699
+ movd m7, [r1 + 4]
700
+ punpcklwd m7, m1
701
+ pand m7, [pb_0000000000000F0F]
702
+ pslldq m3, 4
703
+ por m3, m7 ;m3 = word[4, 0, 0, 9, 9, A, A, B]
704
+
705
+ pmaddwd m5, m1, [r2 + 0 * 16]
706
+ pmaddwd m6, m1, [r3 + 7 * 16]
707
+
708
+ packssdw m5, m6
709
+ paddw m5, m2
710
+ psraw m5, 5
711
+
712
+ pmaddwd m4, m1, m0
713
+ pmaddwd m6, m3, [r2 + 5 * 16]
714
+
715
+ packssdw m4, m6
716
+ paddw m4, m2
717
+ psraw m4, 5
718
+
719
+ packuswb m5, m4
720
+ mova [r0 + 176], m5
721
+
722
+; mode 14
723
+
724
+ pmaddwd m5, m1, [r2 - 4 * 16]
725
+ pmaddwd m6, m1, [r3 - 1 * 16]
726
+
727
+ packssdw m5, m6
728
+ paddw m5, m2
729
+ psraw m5, 5
730
+
731
+ movd m6, [r1 + 2]
732
+ pand m3, [pw_FFFFFFFFFFFFFFF0]
733
+ pand m6, [pb_000000000000000F]
734
+ por m3, m6 ;m3 = word[2, 0, 0, 9, 9, A, A, B]
735
+
736
+ pmaddwd m4, m3, [r2 + 2 * 16]
737
+ pmaddwd m6, m3, [r3 + 5 * 16]
738
+
739
+ packssdw m4, m6
740
+ paddw m4, m2
741
+ psraw m4, 5
742
+
743
+ packuswb m5, m4
744
+ mova [r0 + 192], m5
745
+ psrldq m5, 4
746
+ movd [r0 + 240], m5 ;mode 17 row 0
747
+
748
+; mode 15
749
+
750
+ pmaddwd m5, m1, [r3 + 8 * 16]
751
+ pmaddwd m6, m3, [r2 + 7 * 16]
752
+
753
+ packssdw m5, m6
754
+ paddw m5, m2
755
+ psraw m5, 5
756
+
757
+ pmaddwd m6, m3, [r3 + 6 * 16]
758
+
759
+ mova m0, m3
760
+ punpcklwd m7, m3
761
+ pslldq m0, 4
762
+ pand m7, [pb_0000000000000F0F]
763
+ por m0, m7 ;m0 = word[4, 2, 2, 0, 0, 9, 9, A]
764
+
765
+ pmaddwd m4, m0, [r2 + 5 * 16]
766
+
767
+ packssdw m6, m4
768
+ paddw m6, m2
769
+ psraw m6, 5
770
+
771
+ packuswb m5, m6
772
+ mova [r0 + 208], m5
773
+
774
+; mode 16
775
+
776
+ pmaddwd m5, m1, [r3 + 4 * 16]
777
+ pmaddwd m6, m3, [r2 - 1 * 16]
778
+
779
+ packssdw m5, m6
780
+ paddw m5, m2
781
+ psraw m5, 5
782
+
783
+ pmaddwd m3, [r3 - 6 * 16]
784
+
785
+ movd m6, [r1 + 3]
786
+ pand m0, [pw_FFFFFFFFFFFFFFF0]
787
+ pand m6, [pb_000000000000000F]
788
+ por m0, m6 ;m0 = word[3, 2, 2, 0, 0, 9, 9, A]
789
+
790
+ pmaddwd m0, [r3 + 5 * 16]
791
+ packssdw m3, m0
792
+ paddw m3, m2
793
+ psraw m3, 5
794
+
795
+ packuswb m5, m3
796
+ mova [r0 + 224], m5
797
+
798
+; mode 17
799
+
800
+ movd m4, [r1 + 1]
801
+ punpcklwd m4, m1
802
+ pand m4, [pb_0000000000000F0F]
803
+ pslldq m1, 4
804
+ por m1, m4 ;m1 = word[1, 0, 0, 9, 9, A, A, B]
805
+
806
+ pmaddwd m6, m1, [r3 + 5 * 16]
807
+
808
+ packssdw m6, m6
809
+ paddw m6, m2
810
+ psraw m6, 5
811
+
812
+ movd m5, [r1 + 2]
813
+ punpcklwd m5, m1
814
+ pand m5, [pb_0000000000000F0F]
815
+ pslldq m1, 4
816
+ por m1, m5 ;m1 = word[2, 1, 1, 0, 0, 9, 9, A]
817
+
818
+ pmaddwd m4, m1, [r2 - 5 * 16]
819
+
820
+ punpcklwd m7, m1
821
+ pand m7, [pb_0000000000000F0F]
822
+ pslldq m1, 4
823
+ por m1, m7 ;m1 = word[4, 2, 2, 1, 1, 0, 0, 9]
824
+
825
+ pmaddwd m1, [r2 + 1 * 16]
826
+ packssdw m4, m1
827
+ paddw m4, m2
828
+ psraw m4, 5
829
+
830
+ packuswb m6, m4
831
+ movd [r0 + 244], m6
832
+ psrldq m6, 8
833
+ movh [r0 + 248], m6
834
+
835
+; mode 18
836
+
837
+ movh m1, [r1]
838
+ movd [r0 + 256], m1 ;byte[0, 1, 2, 3]
839
+
840
+ movh m3, [r1 + 2]
841
+ punpcklqdq m3, m1
842
+ psrldq m3, 7
843
+ movd [r0 + 260], m3 ;byte[2, 1, 0, 9]
844
+
845
+ movh m4, [r1 + 3]
846
+ punpcklqdq m4, m3
847
+ psrldq m4, 7
848
+ movd [r0 + 264], m4 ;byte[1, 0, 9, A]
849
+
850
+ movh m0, [r1 + 4]
851
+ punpcklqdq m0, m4
852
+ psrldq m0, 7
853
+ movd [r0 + 268], m0 ;byte[0, 9, A, B]
854
+
855
+; mode 19
856
+
857
+ pxor m7, m7
858
+ punpcklbw m4, m3
859
+ punpcklbw m3, m1
860
+ punpcklbw m1, m1
861
+ punpcklbw m4, m7 ;m4 = word[A, 9, 9, 0, 0, 1, 1, 2]
862
+ punpcklbw m3, m7 ;m3 = word[9, 0, 0, 1, 1, 2, 2, 3]
863
+ psrldq m1, 1
864
+ punpcklbw m1, m7 ;m1 = word[0, 1, 1, 2, 2, 3, 3, 4]
865
+
866
+ pmaddwd m6, m1, [r3 - 1 * 16]
867
+ pmaddwd m7, m3, [r3 + 5 * 16]
868
+
869
+ packssdw m6, m7
870
+ paddw m6, m2
871
+ psraw m6, 5
872
+
873
+ pmaddwd m5, m4, [r2 - 5 * 16]
874
+
875
+ movd m7, [r1 + 12]
876
+ punpcklwd m7, m4
877
+ pand m7, [pb_0000000000000F0F]
878
+ pslldq m4, 4
879
+ por m4, m7 ;m4 = word[C, A, A, 9, 9, 0, 0, 1]
880
+
881
+ pmaddwd m4, [r2 + 1 * 16]
882
+ packssdw m5, m4
883
+ paddw m5, m2
884
+ psraw m5, 5
885
+
886
+ packuswb m6, m5
887
+ mova [r0 + 272], m6
888
+ movd [r0 + 324], m6 ;mode 22 row 1
889
+
890
+; mode 20
891
+
892
+ pmaddwd m5, m1, [r3 + 4 * 16]
893
+
894
+ movd m4, [r1 + 10]
895
+ pand m3, [pw_FFFFFFFFFFFFFFF0]
896
+ pand m4, [pb_000000000000000F]
897
+ por m3, m4 ;m3 = word[A, 0, 0, 1, 1, 2, 2, 3]
898
+
899
+ pmaddwd m6, m3, [r2 - 1 * 16]
900
+
901
+ packssdw m5, m6
902
+ paddw m5, m2
903
+ psraw m5, 5
904
+
905
+ pmaddwd m4, m3, [r3 - 6 * 16]
906
+
907
+ punpcklwd m0, m3
908
+ pand m0, [pb_0000000000000F0F]
909
+ mova m6, m3
910
+ pslldq m6, 4
911
+ por m0, m6 ;m0 = word[B, A, A, 0, 0, 1, 1, 2]
912
+
913
+ pmaddwd m6, m0, [r3 + 5 * 16]
914
+
915
+ packssdw m4, m6
916
+ paddw m4, m2
917
+ psraw m4, 5
918
+
919
+ packuswb m5, m4
920
+ mova [r0 + 288], m5
921
+
922
+; mode 21
923
+
924
+ pmaddwd m4, m1, [r3 + 8 * 16]
925
+ pmaddwd m6, m3, [r2 + 7 * 16]
926
+
927
+ packssdw m4, m6
928
+ paddw m4, m2
929
+ psraw m4, 5
930
+
931
+ pmaddwd m5, m3, [r3 + 6 * 16]
932
+
933
+ pand m0, [pw_FFFFFFFFFFFFFFF0]
934
+ pand m7, [pb_000000000000000F]
935
+ por m0, m7 ;m0 = word[C, A, A, 0, 0, 1, 1, 2]
936
+
937
+ pmaddwd m0, [r2 + 5 * 16]
938
+ packssdw m5, m0
939
+ paddw m5, m2
940
+ psraw m5, 5
941
+
942
+ packuswb m4, m5
943
+ mova [r0 + 304], m4
944
+
945
+; mode 22
946
+
947
+ pmaddwd m4, m1, [r2 - 4 * 16]
948
+ packssdw m4, m4
949
+ paddw m4, m2
950
+ psraw m4, 5
951
+
952
+ mova m0, [r3 + 5 * 16]
953
+ pmaddwd m5, m3, [r2 + 2 * 16]
954
+ pmaddwd m6, m3, m0
955
+
956
+ packssdw m5, m6
957
+ paddw m5, m2
958
+ psraw m5, 5
959
+
960
+ packuswb m4, m5
961
+ movd [r0 + 320], m4
962
+ psrldq m4, 8
963
+ movh [r0 + 328], m4
964
+
965
+; mode 23
966
+
967
+ pmaddwd m4, m1, [r2 + 0 * 16]
968
+ pmaddwd m5, m1, [r3 + 7 * 16]
969
+
970
+ packssdw m4, m5
971
+ paddw m4, m2
972
+ psraw m4, 5
973
+
974
+ pmaddwd m6, m1, [r3 - 2 * 16]
975
+
976
+ pand m3, [pw_FFFFFFFFFFFFFFF0]
977
+ por m3, m7 ;m3 = word[C, 0, 0, 1, 1, 2, 2, 3]
978
+
979
+ pmaddwd m3, [r2 + 5 * 16]
980
+ packssdw m6, m3
981
+ paddw m6, m2
982
+ psraw m6, 5
983
+
984
+ packuswb m4, m6
985
+ mova [r0 + 336], m4
986
+
987
+; mode 24
988
+
989
+ pmaddwd m4, m1, [r2 + 4 * 16]
990
+ pmaddwd m5, m1, [r2 - 1 * 16]
991
+
992
+ packssdw m4, m5
993
+ paddw m4, m2
994
+ psraw m4, 5
995
+
996
+ pmaddwd m6, m1, [r2 - 6 * 16]
997
+ pmaddwd m0, m1
998
+
999
+ packssdw m6, m0
1000
+ paddw m6, m2
1001
+ psraw m6, 5
1002
+
1003
+ packuswb m4, m6
1004
+ mova [r0 + 352], m4
1005
+
1006
+; mode 25
1007
+
1008
+ pmaddwd m4, m1, [r2 + 7 * 16]
1009
+ pmaddwd m5, m1, [r2 + 5 * 16]
1010
+
1011
+ packssdw m4, m5
1012
+ paddw m4, m2
1013
+ psraw m4, 5
1014
+
1015
+ pmaddwd m6, m1, [r2 + 3 * 16]
1016
+ pmaddwd m1, [r2 + 1 * 16]
1017
+
1018
+ packssdw m6, m1
1019
+ paddw m6, m2
1020
+ psraw m6, 5
1021
+
1022
+ packuswb m4, m6
1023
+ mova [r0 + 368], m4
1024
+
1025
+; mode 27
1026
+
1027
+ movh m0, [r1 + 1]
1028
+ pxor m7, m7
1029
+ punpcklbw m0, m0
1030
+ psrldq m0, 1
1031
+ movh m1, m0
1032
+ psrldq m0, 2
1033
+ movh m3, m0
1034
+ psrldq m0, 2
1035
+ punpcklbw m1, m7 ;m1 = word[1, 2, 2, 3, 3, 4, 4, 5]
1036
+ punpcklbw m3, m7 ;m3 = word[2, 3, 3, 4, 4, 5, 5, 6]
1037
+ punpcklbw m0, m7 ;m0 = word[3, 4, 4, 5, 5, 6, 6, 7]
1038
+
1039
+ mova m7, [r3 - 3 * 16]
1040
+
1041
+ pmaddwd m4, m1, [r3 - 5 * 16]
1042
+ pmaddwd m5, m1, m7
1043
+
1044
+ packssdw m4, m5
1045
+ paddw m4, m2
1046
+ psraw m4, 5
1047
+
1048
+ pmaddwd m6, m1, [r3 - 1 * 16]
1049
+ pmaddwd m5, m1, [r3 + 1 * 16]
1050
+
1051
+ packssdw m6, m5
1052
+ paddw m6, m2
1053
+ psraw m6, 5
1054
+
1055
+ packuswb m4, m6
1056
+ mova [r0 + 400], m4
1057
+
1058
+; mode 28
1059
+
1060
+ pmaddwd m4, m1, [r3 - 2 * 16]
1061
+ pmaddwd m5, m1, [r3 + 3 * 16]
1062
+
1063
+ packssdw m4, m5
1064
+ paddw m4, m2
1065
+ psraw m4, 5
1066
+
1067
+ pmaddwd m6, m1, [r3 + 8 * 16]
1068
+ pmaddwd m5, m1, [r2 - 3 * 16]
1069
+
1070
+ packssdw m6, m5
1071
+ paddw m6, m2
1072
+ psraw m6, 5
1073
+
1074
+ packuswb m4, m6
1075
+ mova [r0 + 416], m4
1076
+
1077
+; mode 29
1078
+
1079
+ pmaddwd m4, m1, [r3 + 2 * 16]
1080
+ pmaddwd m6, m1, [r2 - 5 * 16]
1081
+
1082
+ packssdw m4, m6
1083
+ paddw m4, m2
1084
+ psraw m4, 5
1085
+
1086
+ pmaddwd m6, m1, [r2 + 4 * 16]
1087
+ pmaddwd m5, m3, m7
1088
+
1089
+ packssdw m6, m5
1090
+ paddw m6, m2
1091
+ psraw m6, 5
1092
+
1093
+ packuswb m4, m6
1094
+ mova [r0 + 432], m4
1095
+
1096
+; mode 30
1097
+
1098
+ pmaddwd m4, m1, [r3 + 6 * 16]
1099
+ pmaddwd m5, m1, [r2 + 3 * 16]
1100
+
1101
+ packssdw m4, m5
1102
+ paddw m4, m2
1103
+ psraw m4, 5
1104
+
1105
+ pmaddwd m6, m3, [r3 + 0 * 16]
1106
+ pmaddwd m5, m3, [r2 - 3 * 16]
1107
+
1108
+ packssdw m6, m5
1109
+ paddw m6, m2
1110
+ psraw m6, 5
1111
+
1112
+ packuswb m4, m6
1113
+ mova [r0 + 448], m4
1114
+ psrldq m4, 4
1115
+ movh [r0 + 496], m4 ;mode 33 row 0
1116
+ psrldq m4, 8
1117
+ movd [r0 + 500], m4 ;mode 33 row 1
1118
+
1119
+; mode 31
1120
+
1121
+ pmaddwd m4, m1, [r2 - 6 * 16]
1122
+ pmaddwd m5, m3, [r3 - 5 * 16]
1123
+
1124
+ packssdw m4, m5
1125
+ paddw m4, m2
1126
+ psraw m4, 5
1127
+
1128
+ pmaddwd m6, m3, [r2 - 4 * 16]
1129
+ pmaddwd m7, m0
1130
+
1131
+ packssdw m6, m7
1132
+ paddw m6, m2
1133
+ psraw m6, 5
1134
+
1135
+ packuswb m4, m6
1136
+ mova [r0 + 464], m4
1137
+
1138
+; mode 32
1139
+
1140
+ pmaddwd m1, [r2 - 2 * 16]
1141
+ pmaddwd m5, m3, [r3 + 3 * 16]
1142
+
1143
+ packssdw m1, m5
1144
+ paddw m1, m2
1145
+ psraw m1, 5
1146
+
1147
+ pmaddwd m3, [r2 + 8 * 16]
1148
+ pmaddwd m5, m0, [r2 - 3 * 16]
1149
+ packssdw m3, m5
1150
+ paddw m3, m2
1151
+ psraw m3, 5
1152
+
1153
+ packuswb m1, m3
1154
+ mova [r0 + 480], m1
1155
+
1156
+; mode 33
1157
+
1158
+ pmaddwd m0, [r3 + 7 * 16]
1159
+ pxor m7, m7
1160
+ movh m4, [r1 + 4]
1161
+ punpcklbw m4, m4
1162
+ psrldq m4, 1
1163
+ punpcklbw m4, m7
1164
+
1165
+ pmaddwd m4, [r3 + 1 * 16]
1166
+
1167
+ packssdw m0, m4
1168
+ paddw m0, m2
1169
+ psraw m0, 5
1170
+
1171
+ packuswb m0, m0
1172
+ movh [r0 + 504], m0
1173
+
1174
+; mode 34
1175
+
1176
+ movh m7, [r1 + 2]
1177
+ movd [r0 + 512], m7 ;byte[2, 3, 4, 5]
1178
+
1179
+ psrldq m7, 1
1180
+ movd [r0 + 516], m7 ;byte[3, 4, 5, 6]
1181
+
1182
+ psrldq m7, 1
1183
+ movd [r0 + 520], m7 ;byte[4, 5, 6, 7]
1184
+
1185
+ psrldq m7, 1
1186
+ movd [r0 + 524], m7 ;byte[5, 6, 7, 8]
1187
+
1188
+RET
1189
x265_1.6.tar.gz/source/common/x86/ipfilter16.asm -> x265_1.7.tar.gz/source/common/x86/ipfilter16.asm
Changed
1472
1
2
times 8 dw 58, -10
3
times 8 dw 4, -1
4
5
+const interp8_hps_shuf, dd 0, 4, 1, 5, 2, 6, 3, 7
6
+
7
SECTION .text
8
cextern pd_32
9
cextern pw_pixel_max
10
cextern pd_n32768
11
+cextern pw_2000
12
13
;------------------------------------------------------------------------------------------------------------
14
; void interp_8tap_horiz_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
15
16
FILTER_VER_LUMA_SS 64, 16
17
FILTER_VER_LUMA_SS 16, 64
18
19
-;--------------------------------------------------------------------------------------------------
20
-; void filterConvertPelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height)
21
-;--------------------------------------------------------------------------------------------------
22
-INIT_XMM sse2
23
-cglobal luma_p2s, 3, 7, 5
24
+;-----------------------------------------------------------------------------
25
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
26
+;-----------------------------------------------------------------------------
27
+%macro P2S_H_2xN 1
28
+INIT_XMM sse4
29
+cglobal filterPixelToShort_2x%1, 3, 6, 2
30
+ add r1d, r1d
31
+ mov r3d, r3m
32
+ add r3d, r3d
33
+ lea r4, [r1 * 3]
34
+ lea r5, [r3 * 3]
35
36
- add r1, r1
37
+ ; load constant
38
+ mova m1, [pw_2000]
39
40
- ; load width and height
41
- mov r3d, r3m
42
- mov r4d, r4m
43
+%rep %1/4
44
+ movd m0, [r0]
45
+ movhps m0, [r0 + r1]
46
+ psllw m0, 4
47
+ psubw m0, m1
48
+
49
+ movd [r2 + r3 * 0], m0
50
+ pextrd [r2 + r3 * 1], m0, 2
51
+
52
+ movd m0, [r0 + r1 * 2]
53
+ movhps m0, [r0 + r4]
54
+ psllw m0, 4
55
+ psubw m0, m1
56
+
57
+ movd [r2 + r3 * 2], m0
58
+ pextrd [r2 + r5], m0, 2
59
+
60
+ lea r0, [r0 + r1 * 4]
61
+ lea r2, [r2 + r3 * 4]
62
+%endrep
63
+ RET
64
+%endmacro
65
+P2S_H_2xN 4
66
+P2S_H_2xN 8
67
+P2S_H_2xN 16
68
+
69
+;-----------------------------------------------------------------------------
70
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
71
+;-----------------------------------------------------------------------------
72
+%macro P2S_H_4xN 1
73
+INIT_XMM ssse3
74
+cglobal filterPixelToShort_4x%1, 3, 6, 2
75
+ add r1d, r1d
76
+ mov r3d, r3m
77
+ add r3d, r3d
78
+ lea r4, [r3 * 3]
79
+ lea r5, [r1 * 3]
80
81
; load constant
82
- mova m4, [tab_c_n8192]
83
+ mova m1, [pw_2000]
84
85
-.loopH:
86
+%rep %1/4
87
+ movh m0, [r0]
88
+ movhps m0, [r0 + r1]
89
+ psllw m0, 4
90
+ psubw m0, m1
91
+ movh [r2 + r3 * 0], m0
92
+ movhps [r2 + r3 * 1], m0
93
+
94
+ movh m0, [r0 + r1 * 2]
95
+ movhps m0, [r0 + r5]
96
+ psllw m0, 4
97
+ psubw m0, m1
98
+ movh [r2 + r3 * 2], m0
99
+ movhps [r2 + r4], m0
100
101
- xor r5d, r5d
102
-.loopW:
103
- lea r6, [r0 + r5 * 2]
104
+ lea r0, [r0 + r1 * 4]
105
+ lea r2, [r2 + r3 * 4]
106
+%endrep
107
+ RET
108
+%endmacro
109
+P2S_H_4xN 4
110
+P2S_H_4xN 8
111
+P2S_H_4xN 16
112
+P2S_H_4xN 32
113
114
- movu m0, [r6]
115
- psllw m0, 4
116
- paddw m0, m4
117
+;-----------------------------------------------------------------------------
118
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
119
+;-----------------------------------------------------------------------------
120
+INIT_XMM ssse3
121
+cglobal filterPixelToShort_4x2, 3, 4, 1
122
+ add r1d, r1d
123
+ mov r3d, r3m
124
+ add r3d, r3d
125
126
- movu m1, [r6 + r1]
127
- psllw m1, 4
128
- paddw m1, m4
129
+ movh m0, [r0]
130
+ movhps m0, [r0 + r1]
131
+ psllw m0, 4
132
+ psubw m0, [pw_2000]
133
+ movh [r2 + r3 * 0], m0
134
+ movhps [r2 + r3 * 1], m0
135
136
- movu m2, [r6 + r1 * 2]
137
- psllw m2, 4
138
- paddw m2, m4
139
-
140
- lea r6, [r6 + r1 * 2]
141
- movu m3, [r6 + r1]
142
- psllw m3, 4
143
- paddw m3, m4
144
+ RET
145
146
- add r5, 8
147
- cmp r5, r3
148
- jg .width4
149
- movu [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0
150
- movu [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1
151
- movu [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2
152
- movu [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3
153
- je .nextH
154
- jmp .loopW
155
+;-----------------------------------------------------------------------------
156
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
157
+;-----------------------------------------------------------------------------
158
+%macro P2S_H_6xN 1
159
+INIT_XMM sse4
160
+cglobal filterPixelToShort_6x%1, 3, 7, 3
161
+ add r1d, r1d
162
+ mov r3d, r3m
163
+ add r3d, r3d
164
+ lea r4, [r3 * 3]
165
+ lea r5, [r1 * 3]
166
167
-.width4:
168
- movh [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0
169
- movh [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1
170
- movh [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2
171
- movh [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3
172
+ ; load height
173
+ mov r6d, %1/4
174
175
-.nextH:
176
- lea r0, [r0 + r1 * 4]
177
- add r2, FENC_STRIDE * 8
178
+ ; load constant
179
+ mova m2, [pw_2000]
180
181
- sub r4d, 4
182
- jnz .loopH
183
+.loop
184
+ movu m0, [r0]
185
+ movu m1, [r0 + r1]
186
+ psllw m0, 4
187
+ psubw m0, m2
188
+ psllw m1, 4
189
+ psubw m1, m2
190
+
191
+ movh [r2 + r3 * 0], m0
192
+ pextrd [r2 + r3 * 0 + 8], m0, 2
193
+ movh [r2 + r3 * 1], m1
194
+ pextrd [r2 + r3 * 1 + 8], m1, 2
195
+
196
+ movu m0, [r0 + r1 * 2]
197
+ movu m1, [r0 + r5]
198
+ psllw m0, 4
199
+ psubw m0, m2
200
+ psllw m1, 4
201
+ psubw m1, m2
202
+
203
+ movh [r2 + r3 * 2], m0
204
+ pextrd [r2 + r3 * 2 + 8], m0, 2
205
+ movh [r2 + r4], m1
206
+ pextrd [r2 + r4 + 8], m1, 2
207
+
208
+ lea r0, [r0 + r1 * 4]
209
+ lea r2, [r2 + r3 * 4]
210
+
211
+ dec r6d
212
+ jnz .loop
213
+ RET
214
+%endmacro
215
+P2S_H_6xN 8
216
+P2S_H_6xN 16
217
+
218
+;-----------------------------------------------------------------------------
219
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
220
+;-----------------------------------------------------------------------------
221
+%macro P2S_H_8xN 1
222
+INIT_XMM ssse3
223
+cglobal filterPixelToShort_8x%1, 3, 7, 2
224
+ add r1d, r1d
225
+ mov r3d, r3m
226
+ add r3d, r3d
227
+ lea r4, [r3 * 3]
228
+ lea r5, [r1 * 3]
229
+
230
+ ; load height
231
+ mov r6d, %1/4
232
+
233
+ ; load constant
234
+ mova m1, [pw_2000]
235
+
236
+.loop
237
+ movu m0, [r0]
238
+ psllw m0, 4
239
+ psubw m0, m1
240
+ movu [r2 + r3 * 0], m0
241
+
242
+ movu m0, [r0 + r1]
243
+ psllw m0, 4
244
+ psubw m0, m1
245
+ movu [r2 + r3 * 1], m0
246
+
247
+ movu m0, [r0 + r1 * 2]
248
+ psllw m0, 4
249
+ psubw m0, m1
250
+ movu [r2 + r3 * 2], m0
251
+
252
+ movu m0, [r0 + r5]
253
+ psllw m0, 4
254
+ psubw m0, m1
255
+ movu [r2 + r4], m0
256
+
257
+ lea r0, [r0 + r1 * 4]
258
+ lea r2, [r2 + r3 * 4]
259
+
260
+ dec r6d
261
+ jnz .loop
262
+ RET
263
+%endmacro
264
+P2S_H_8xN 8
265
+P2S_H_8xN 4
266
+P2S_H_8xN 16
267
+P2S_H_8xN 32
268
+P2S_H_8xN 12
269
+P2S_H_8xN 64
270
+
271
+;-----------------------------------------------------------------------------
272
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
273
+;-----------------------------------------------------------------------------
274
+INIT_XMM ssse3
275
+cglobal filterPixelToShort_8x2, 3, 4, 2
276
+ add r1d, r1d
277
+ mov r3d, r3m
278
+ add r3d, r3d
279
+
280
+ movu m0, [r0]
281
+ movu m1, [r0 + r1]
282
+
283
+ psllw m0, 4
284
+ psubw m0, [pw_2000]
285
+ psllw m1, 4
286
+ psubw m1, [pw_2000]
287
+
288
+ movu [r2 + r3 * 0], m0
289
+ movu [r2 + r3 * 1], m1
290
+
291
+ RET
292
+
293
+;-----------------------------------------------------------------------------
294
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
295
+;-----------------------------------------------------------------------------
296
+INIT_XMM ssse3
297
+cglobal filterPixelToShort_8x6, 3, 7, 4
298
+ add r1d, r1d
299
+ mov r3d, r3m
300
+ add r3d, r3d
301
+ lea r4, [r1 * 3]
302
+ lea r5, [r1 * 5]
303
+ lea r6, [r3 * 3]
304
+
305
+ ; load constant
306
+ mova m3, [pw_2000]
307
+
308
+ movu m0, [r0]
309
+ movu m1, [r0 + r1]
310
+ movu m2, [r0 + r1 * 2]
311
+
312
+ psllw m0, 4
313
+ psubw m0, m3
314
+ psllw m1, 4
315
+ psubw m1, m3
316
+ psllw m2, 4
317
+ psubw m2, m3
318
+
319
+ movu [r2 + r3 * 0], m0
320
+ movu [r2 + r3 * 1], m1
321
+ movu [r2 + r3 * 2], m2
322
+
323
+ movu m0, [r0 + r4]
324
+ movu m1, [r0 + r1 * 4]
325
+ movu m2, [r0 + r5 ]
326
+
327
+ psllw m0, 4
328
+ psubw m0, m3
329
+ psllw m1, 4
330
+ psubw m1, m3
331
+ psllw m2, 4
332
+ psubw m2, m3
333
+
334
+ movu [r2 + r6], m0
335
+ movu [r2 + r3 * 4], m1
336
+ lea r2, [r2 + r3 * 4]
337
+ movu [r2 + r3], m2
338
+
339
+ RET
340
+
341
+;-----------------------------------------------------------------------------
342
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
343
+;-----------------------------------------------------------------------------
344
+%macro P2S_H_16xN 1
345
+INIT_XMM ssse3
346
+cglobal filterPixelToShort_16x%1, 3, 7, 3
347
+ add r1d, r1d
348
+ mov r3d, r3m
349
+ add r3d, r3d
350
+ lea r4, [r3 * 3]
351
+ lea r5, [r1 * 3]
352
+
353
+ ; load height
354
+ mov r6d, %1/4
355
+
356
+ ; load constant
357
+ mova m2, [pw_2000]
358
+
359
+.loop
360
+ movu m0, [r0]
361
+ movu m1, [r0 + r1]
362
+ psllw m0, 4
363
+ psubw m0, m2
364
+ psllw m1, 4
365
+ psubw m1, m2
366
+
367
+ movu [r2 + r3 * 0], m0
368
+ movu [r2 + r3 * 1], m1
369
+
370
+ movu m0, [r0 + r1 * 2]
371
+ movu m1, [r0 + r5]
372
+ psllw m0, 4
373
+ psubw m0, m2
374
+ psllw m1, 4
375
+ psubw m1, m2
376
+
377
+ movu [r2 + r3 * 2], m0
378
+ movu [r2 + r4], m1
379
+
380
+ movu m0, [r0 + 16]
381
+ movu m1, [r0 + r1 + 16]
382
+ psllw m0, 4
383
+ psubw m0, m2
384
+ psllw m1, 4
385
+ psubw m1, m2
386
+
387
+ movu [r2 + r3 * 0 + 16], m0
388
+ movu [r2 + r3 * 1 + 16], m1
389
+
390
+ movu m0, [r0 + r1 * 2 + 16]
391
+ movu m1, [r0 + r5 + 16]
392
+ psllw m0, 4
393
+ psubw m0, m2
394
+ psllw m1, 4
395
+ psubw m1, m2
396
+
397
+ movu [r2 + r3 * 2 + 16], m0
398
+ movu [r2 + r4 + 16], m1
399
+
400
+ lea r0, [r0 + r1 * 4]
401
+ lea r2, [r2 + r3 * 4]
402
+
403
+ dec r6d
404
+ jnz .loop
405
+ RET
406
+%endmacro
407
+P2S_H_16xN 16
408
+P2S_H_16xN 4
409
+P2S_H_16xN 8
410
+P2S_H_16xN 12
411
+P2S_H_16xN 32
412
+P2S_H_16xN 64
413
+P2S_H_16xN 24
414
+
415
+;-----------------------------------------------------------------------------
416
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
417
+;-----------------------------------------------------------------------------
418
+%macro P2S_H_16xN_avx2 1
419
+INIT_YMM avx2
420
+cglobal filterPixelToShort_16x%1, 3, 7, 3
421
+ add r1d, r1d
422
+ mov r3d, r3m
423
+ add r3d, r3d
424
+ lea r4, [r3 * 3]
425
+ lea r5, [r1 * 3]
426
+
427
+ ; load height
428
+ mov r6d, %1/4
429
+
430
+ ; load constant
431
+ mova m2, [pw_2000]
432
+
433
+.loop
434
+ movu m0, [r0]
435
+ movu m1, [r0 + r1]
436
+ psllw m0, 4
437
+ psubw m0, m2
438
+ psllw m1, 4
439
+ psubw m1, m2
440
+
441
+ movu [r2 + r3 * 0], m0
442
+ movu [r2 + r3 * 1], m1
443
+
444
+ movu m0, [r0 + r1 * 2]
445
+ movu m1, [r0 + r5]
446
+ psllw m0, 4
447
+ psubw m0, m2
448
+ psllw m1, 4
449
+ psubw m1, m2
450
+
451
+ movu [r2 + r3 * 2], m0
452
+ movu [r2 + r4], m1
453
+
454
+ lea r0, [r0 + r1 * 4]
455
+ lea r2, [r2 + r3 * 4]
456
+
457
+ dec r6d
458
+ jnz .loop
459
+ RET
460
+%endmacro
461
+P2S_H_16xN_avx2 16
462
+P2S_H_16xN_avx2 4
463
+P2S_H_16xN_avx2 8
464
+P2S_H_16xN_avx2 12
465
+P2S_H_16xN_avx2 32
466
+P2S_H_16xN_avx2 64
467
+P2S_H_16xN_avx2 24
468
+
469
+;-----------------------------------------------------------------------------
470
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
471
+;-----------------------------------------------------------------------------
472
+%macro P2S_H_32xN 1
473
+INIT_XMM ssse3
474
+cglobal filterPixelToShort_32x%1, 3, 7, 5
475
+ add r1d, r1d
476
+ mov r3d, r3m
477
+ add r3d, r3d
478
+ lea r4, [r3 * 3]
479
+ lea r5, [r1 * 3]
480
+
481
+ ; load height
482
+ mov r6d, %1/4
483
+
484
+ ; load constant
485
+ mova m4, [pw_2000]
486
+
487
+.loop
488
+ movu m0, [r0]
489
+ movu m1, [r0 + r1]
490
+ movu m2, [r0 + r1 * 2]
491
+ movu m3, [r0 + r5]
492
+ psllw m0, 4
493
+ psubw m0, m4
494
+ psllw m1, 4
495
+ psubw m1, m4
496
+ psllw m2, 4
497
+ psubw m2, m4
498
+ psllw m3, 4
499
+ psubw m3, m4
500
+
501
+ movu [r2 + r3 * 0], m0
502
+ movu [r2 + r3 * 1], m1
503
+ movu [r2 + r3 * 2], m2
504
+ movu [r2 + r4], m3
505
+
506
+ movu m0, [r0 + 16]
507
+ movu m1, [r0 + r1 + 16]
508
+ movu m2, [r0 + r1 * 2 + 16]
509
+ movu m3, [r0 + r5 + 16]
510
+ psllw m0, 4
511
+ psubw m0, m4
512
+ psllw m1, 4
513
+ psubw m1, m4
514
+ psllw m2, 4
515
+ psubw m2, m4
516
+ psllw m3, 4
517
+ psubw m3, m4
518
+
519
+ movu [r2 + r3 * 0 + 16], m0
520
+ movu [r2 + r3 * 1 + 16], m1
521
+ movu [r2 + r3 * 2 + 16], m2
522
+ movu [r2 + r4 + 16], m3
523
+
524
+ movu m0, [r0 + 32]
525
+ movu m1, [r0 + r1 + 32]
526
+ movu m2, [r0 + r1 * 2 + 32]
527
+ movu m3, [r0 + r5 + 32]
528
+ psllw m0, 4
529
+ psubw m0, m4
530
+ psllw m1, 4
531
+ psubw m1, m4
532
+ psllw m2, 4
533
+ psubw m2, m4
534
+ psllw m3, 4
535
+ psubw m3, m4
536
+
537
+ movu [r2 + r3 * 0 + 32], m0
538
+ movu [r2 + r3 * 1 + 32], m1
539
+ movu [r2 + r3 * 2 + 32], m2
540
+ movu [r2 + r4 + 32], m3
541
+
542
+ movu m0, [r0 + 48]
543
+ movu m1, [r0 + r1 + 48]
544
+ movu m2, [r0 + r1 * 2 + 48]
545
+ movu m3, [r0 + r5 + 48]
546
+ psllw m0, 4
547
+ psubw m0, m4
548
+ psllw m1, 4
549
+ psubw m1, m4
550
+ psllw m2, 4
551
+ psubw m2, m4
552
+ psllw m3, 4
553
+ psubw m3, m4
554
+
555
+ movu [r2 + r3 * 0 + 48], m0
556
+ movu [r2 + r3 * 1 + 48], m1
557
+ movu [r2 + r3 * 2 + 48], m2
558
+ movu [r2 + r4 + 48], m3
559
560
+ lea r0, [r0 + r1 * 4]
561
+ lea r2, [r2 + r3 * 4]
562
+
563
+ dec r6d
564
+ jnz .loop
565
+ RET
566
+%endmacro
567
+P2S_H_32xN 32
568
+P2S_H_32xN 8
569
+P2S_H_32xN 16
570
+P2S_H_32xN 24
571
+P2S_H_32xN 64
572
+P2S_H_32xN 48
573
+
574
+;-----------------------------------------------------------------------------
575
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
576
+;-----------------------------------------------------------------------------
577
+%macro P2S_H_32xN_avx2 1
578
+INIT_YMM avx2
579
+cglobal filterPixelToShort_32x%1, 3, 7, 3
580
+ add r1d, r1d
581
+ mov r3d, r3m
582
+ add r3d, r3d
583
+ lea r4, [r3 * 3]
584
+ lea r5, [r1 * 3]
585
+
586
+ ; load height
587
+ mov r6d, %1/4
588
+
589
+ ; load constant
590
+ mova m2, [pw_2000]
591
+
592
+.loop
593
+ movu m0, [r0]
594
+ movu m1, [r0 + r1]
595
+ psllw m0, 4
596
+ psubw m0, m2
597
+ psllw m1, 4
598
+ psubw m1, m2
599
+
600
+ movu [r2 + r3 * 0], m0
601
+ movu [r2 + r3 * 1], m1
602
+
603
+ movu m0, [r0 + r1 * 2]
604
+ movu m1, [r0 + r5]
605
+ psllw m0, 4
606
+ psubw m0, m2
607
+ psllw m1, 4
608
+ psubw m1, m2
609
+
610
+ movu [r2 + r3 * 2], m0
611
+ movu [r2 + r4], m1
612
+
613
+ movu m0, [r0 + 32]
614
+ movu m1, [r0 + r1 + 32]
615
+ psllw m0, 4
616
+ psubw m0, m2
617
+ psllw m1, 4
618
+ psubw m1, m2
619
+
620
+ movu [r2 + r3 * 0 + 32], m0
621
+ movu [r2 + r3 * 1 + 32], m1
622
+
623
+ movu m0, [r0 + r1 * 2 + 32]
624
+ movu m1, [r0 + r5 + 32]
625
+ psllw m0, 4
626
+ psubw m0, m2
627
+ psllw m1, 4
628
+ psubw m1, m2
629
+
630
+ movu [r2 + r3 * 2 + 32], m0
631
+ movu [r2 + r4 + 32], m1
632
+
633
+ lea r0, [r0 + r1 * 4]
634
+ lea r2, [r2 + r3 * 4]
635
+
636
+ dec r6d
637
+ jnz .loop
638
+ RET
639
+%endmacro
640
+P2S_H_32xN_avx2 32
641
+P2S_H_32xN_avx2 8
642
+P2S_H_32xN_avx2 16
643
+P2S_H_32xN_avx2 24
644
+P2S_H_32xN_avx2 64
645
+P2S_H_32xN_avx2 48
646
+
647
+;-----------------------------------------------------------------------------
648
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
649
+;-----------------------------------------------------------------------------
650
+%macro P2S_H_64xN 1
651
+INIT_XMM ssse3
652
+cglobal filterPixelToShort_64x%1, 3, 7, 5
653
+ add r1d, r1d
654
+ mov r3d, r3m
655
+ add r3d, r3d
656
+ lea r4, [r3 * 3]
657
+ lea r5, [r1 * 3]
658
+
659
+ ; load height
660
+ mov r6d, %1/4
661
+
662
+ ; load constant
663
+ mova m4, [pw_2000]
664
+
665
+.loop
666
+ movu m0, [r0]
667
+ movu m1, [r0 + r1]
668
+ movu m2, [r0 + r1 * 2]
669
+ movu m3, [r0 + r5]
670
+ psllw m0, 4
671
+ psubw m0, m4
672
+ psllw m1, 4
673
+ psubw m1, m4
674
+ psllw m2, 4
675
+ psubw m2, m4
676
+ psllw m3, 4
677
+ psubw m3, m4
678
+
679
+ movu [r2 + r3 * 0], m0
680
+ movu [r2 + r3 * 1], m1
681
+ movu [r2 + r3 * 2], m2
682
+ movu [r2 + r4], m3
683
+
684
+ movu m0, [r0 + 16]
685
+ movu m1, [r0 + r1 + 16]
686
+ movu m2, [r0 + r1 * 2 + 16]
687
+ movu m3, [r0 + r5 + 16]
688
+ psllw m0, 4
689
+ psubw m0, m4
690
+ psllw m1, 4
691
+ psubw m1, m4
692
+ psllw m2, 4
693
+ psubw m2, m4
694
+ psllw m3, 4
695
+ psubw m3, m4
696
+
697
+ movu [r2 + r3 * 0 + 16], m0
698
+ movu [r2 + r3 * 1 + 16], m1
699
+ movu [r2 + r3 * 2 + 16], m2
700
+ movu [r2 + r4 + 16], m3
701
+
702
+ movu m0, [r0 + 32]
703
+ movu m1, [r0 + r1 + 32]
704
+ movu m2, [r0 + r1 * 2 + 32]
705
+ movu m3, [r0 + r5 + 32]
706
+ psllw m0, 4
707
+ psubw m0, m4
708
+ psllw m1, 4
709
+ psubw m1, m4
710
+ psllw m2, 4
711
+ psubw m2, m4
712
+ psllw m3, 4
713
+ psubw m3, m4
714
+
715
+ movu [r2 + r3 * 0 + 32], m0
716
+ movu [r2 + r3 * 1 + 32], m1
717
+ movu [r2 + r3 * 2 + 32], m2
718
+ movu [r2 + r4 + 32], m3
719
+
720
+ movu m0, [r0 + 48]
721
+ movu m1, [r0 + r1 + 48]
722
+ movu m2, [r0 + r1 * 2 + 48]
723
+ movu m3, [r0 + r5 + 48]
724
+ psllw m0, 4
725
+ psubw m0, m4
726
+ psllw m1, 4
727
+ psubw m1, m4
728
+ psllw m2, 4
729
+ psubw m2, m4
730
+ psllw m3, 4
731
+ psubw m3, m4
732
+
733
+ movu [r2 + r3 * 0 + 48], m0
734
+ movu [r2 + r3 * 1 + 48], m1
735
+ movu [r2 + r3 * 2 + 48], m2
736
+ movu [r2 + r4 + 48], m3
737
+
738
+ movu m0, [r0 + 64]
739
+ movu m1, [r0 + r1 + 64]
740
+ movu m2, [r0 + r1 * 2 + 64]
741
+ movu m3, [r0 + r5 + 64]
742
+ psllw m0, 4
743
+ psubw m0, m4
744
+ psllw m1, 4
745
+ psubw m1, m4
746
+ psllw m2, 4
747
+ psubw m2, m4
748
+ psllw m3, 4
749
+ psubw m3, m4
750
+
751
+ movu [r2 + r3 * 0 + 64], m0
752
+ movu [r2 + r3 * 1 + 64], m1
753
+ movu [r2 + r3 * 2 + 64], m2
754
+ movu [r2 + r4 + 64], m3
755
+
756
+ movu m0, [r0 + 80]
757
+ movu m1, [r0 + r1 + 80]
758
+ movu m2, [r0 + r1 * 2 + 80]
759
+ movu m3, [r0 + r5 + 80]
760
+ psllw m0, 4
761
+ psubw m0, m4
762
+ psllw m1, 4
763
+ psubw m1, m4
764
+ psllw m2, 4
765
+ psubw m2, m4
766
+ psllw m3, 4
767
+ psubw m3, m4
768
+
769
+ movu [r2 + r3 * 0 + 80], m0
770
+ movu [r2 + r3 * 1 + 80], m1
771
+ movu [r2 + r3 * 2 + 80], m2
772
+ movu [r2 + r4 + 80], m3
773
+
774
+ movu m0, [r0 + 96]
775
+ movu m1, [r0 + r1 + 96]
776
+ movu m2, [r0 + r1 * 2 + 96]
777
+ movu m3, [r0 + r5 + 96]
778
+ psllw m0, 4
779
+ psubw m0, m4
780
+ psllw m1, 4
781
+ psubw m1, m4
782
+ psllw m2, 4
783
+ psubw m2, m4
784
+ psllw m3, 4
785
+ psubw m3, m4
786
+
787
+ movu [r2 + r3 * 0 + 96], m0
788
+ movu [r2 + r3 * 1 + 96], m1
789
+ movu [r2 + r3 * 2 + 96], m2
790
+ movu [r2 + r4 + 96], m3
791
+
792
+ movu m0, [r0 + 112]
793
+ movu m1, [r0 + r1 + 112]
794
+ movu m2, [r0 + r1 * 2 + 112]
795
+ movu m3, [r0 + r5 + 112]
796
+ psllw m0, 4
797
+ psubw m0, m4
798
+ psllw m1, 4
799
+ psubw m1, m4
800
+ psllw m2, 4
801
+ psubw m2, m4
802
+ psllw m3, 4
803
+ psubw m3, m4
804
+
805
+ movu [r2 + r3 * 0 + 112], m0
806
+ movu [r2 + r3 * 1 + 112], m1
807
+ movu [r2 + r3 * 2 + 112], m2
808
+ movu [r2 + r4 + 112], m3
809
+
810
+ lea r0, [r0 + r1 * 4]
811
+ lea r2, [r2 + r3 * 4]
812
+
813
+ dec r6d
814
+ jnz .loop
815
+ RET
816
+%endmacro
817
+P2S_H_64xN 64
818
+P2S_H_64xN 16
819
+P2S_H_64xN 32
820
+P2S_H_64xN 48
821
+
822
+;-----------------------------------------------------------------------------
823
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
824
+;-----------------------------------------------------------------------------
825
+%macro P2S_H_64xN_avx2 1
826
+INIT_YMM avx2
827
+cglobal filterPixelToShort_64x%1, 3, 7, 3
828
+ add r1d, r1d
829
+ mov r3d, r3m
830
+ add r3d, r3d
831
+ lea r4, [r3 * 3]
832
+ lea r5, [r1 * 3]
833
+
834
+ ; load height
835
+ mov r6d, %1/4
836
+
837
+ ; load constant
838
+ mova m2, [pw_2000]
839
+
840
+.loop
841
+ movu m0, [r0]
842
+ movu m1, [r0 + r1]
843
+ psllw m0, 4
844
+ psubw m0, m2
845
+ psllw m1, 4
846
+ psubw m1, m2
847
+
848
+ movu [r2 + r3 * 0], m0
849
+ movu [r2 + r3 * 1], m1
850
+
851
+ movu m0, [r0 + r1 * 2]
852
+ movu m1, [r0 + r5]
853
+ psllw m0, 4
854
+ psubw m0, m2
855
+ psllw m1, 4
856
+ psubw m1, m2
857
+
858
+ movu [r2 + r3 * 2], m0
859
+ movu [r2 + r4], m1
860
+
861
+ movu m0, [r0 + 32]
862
+ movu m1, [r0 + r1 + 32]
863
+ psllw m0, 4
864
+ psubw m0, m2
865
+ psllw m1, 4
866
+ psubw m1, m2
867
+
868
+ movu [r2 + r3 * 0 + 32], m0
869
+ movu [r2 + r3 * 1 + 32], m1
870
+
871
+ movu m0, [r0 + r1 * 2 + 32]
872
+ movu m1, [r0 + r5 + 32]
873
+ psllw m0, 4
874
+ psubw m0, m2
875
+ psllw m1, 4
876
+ psubw m1, m2
877
+
878
+ movu [r2 + r3 * 2 + 32], m0
879
+ movu [r2 + r4 + 32], m1
880
+
881
+ movu m0, [r0 + 64]
882
+ movu m1, [r0 + r1 + 64]
883
+ psllw m0, 4
884
+ psubw m0, m2
885
+ psllw m1, 4
886
+ psubw m1, m2
887
+
888
+ movu [r2 + r3 * 0 + 64], m0
889
+ movu [r2 + r3 * 1 + 64], m1
890
+
891
+ movu m0, [r0 + r1 * 2 + 64]
892
+ movu m1, [r0 + r5 + 64]
893
+ psllw m0, 4
894
+ psubw m0, m2
895
+ psllw m1, 4
896
+ psubw m1, m2
897
+
898
+ movu [r2 + r3 * 2 + 64], m0
899
+ movu [r2 + r4 + 64], m1
900
+
901
+ movu m0, [r0 + 96]
902
+ movu m1, [r0 + r1 + 96]
903
+ psllw m0, 4
904
+ psubw m0, m2
905
+ psllw m1, 4
906
+ psubw m1, m2
907
+
908
+ movu [r2 + r3 * 0 + 96], m0
909
+ movu [r2 + r3 * 1 + 96], m1
910
+
911
+ movu m0, [r0 + r1 * 2 + 96]
912
+ movu m1, [r0 + r5 + 96]
913
+ psllw m0, 4
914
+ psubw m0, m2
915
+ psllw m1, 4
916
+ psubw m1, m2
917
+
918
+ movu [r2 + r3 * 2 + 96], m0
919
+ movu [r2 + r4 + 96], m1
920
+
921
+ lea r0, [r0 + r1 * 4]
922
+ lea r2, [r2 + r3 * 4]
923
+
924
+ dec r6d
925
+ jnz .loop
926
+ RET
927
+%endmacro
928
+P2S_H_64xN_avx2 64
929
+P2S_H_64xN_avx2 16
930
+P2S_H_64xN_avx2 32
931
+P2S_H_64xN_avx2 48
932
+
933
+;-----------------------------------------------------------------------------
934
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
935
+;-----------------------------------------------------------------------------
936
+%macro P2S_H_24xN 1
937
+INIT_XMM ssse3
938
+cglobal filterPixelToShort_24x%1, 3, 7, 5
939
+ add r1d, r1d
940
+ mov r3d, r3m
941
+ add r3d, r3d
942
+ lea r4, [r3 * 3]
943
+ lea r5, [r1 * 3]
944
+
945
+ ; load height
946
+ mov r6d, %1/4
947
+
948
+ ; load constant
949
+ mova m4, [pw_2000]
950
+
951
+.loop
952
+ movu m0, [r0]
953
+ movu m1, [r0 + r1]
954
+ movu m2, [r0 + r1 * 2]
955
+ movu m3, [r0 + r5]
956
+ psllw m0, 4
957
+ psubw m0, m4
958
+ psllw m1, 4
959
+ psubw m1, m4
960
+ psllw m2, 4
961
+ psubw m2, m4
962
+ psllw m3, 4
963
+ psubw m3, m4
964
+
965
+ movu [r2 + r3 * 0], m0
966
+ movu [r2 + r3 * 1], m1
967
+ movu [r2 + r3 * 2], m2
968
+ movu [r2 + r4], m3
969
+
970
+ movu m0, [r0 + 16]
971
+ movu m1, [r0 + r1 + 16]
972
+ movu m2, [r0 + r1 * 2 + 16]
973
+ movu m3, [r0 + r5 + 16]
974
+ psllw m0, 4
975
+ psubw m0, m4
976
+ psllw m1, 4
977
+ psubw m1, m4
978
+ psllw m2, 4
979
+ psubw m2, m4
980
+ psllw m3, 4
981
+ psubw m3, m4
982
+
983
+ movu [r2 + r3 * 0 + 16], m0
984
+ movu [r2 + r3 * 1 + 16], m1
985
+ movu [r2 + r3 * 2 + 16], m2
986
+ movu [r2 + r4 + 16], m3
987
+
988
+ movu m0, [r0 + 32]
989
+ movu m1, [r0 + r1 + 32]
990
+ movu m2, [r0 + r1 * 2 + 32]
991
+ movu m3, [r0 + r5 + 32]
992
+ psllw m0, 4
993
+ psubw m0, m4
994
+ psllw m1, 4
995
+ psubw m1, m4
996
+ psllw m2, 4
997
+ psubw m2, m4
998
+ psllw m3, 4
999
+ psubw m3, m4
1000
+
1001
+ movu [r2 + r3 * 0 + 32], m0
1002
+ movu [r2 + r3 * 1 + 32], m1
1003
+ movu [r2 + r3 * 2 + 32], m2
1004
+ movu [r2 + r4 + 32], m3
1005
+
1006
+ lea r0, [r0 + r1 * 4]
1007
+ lea r2, [r2 + r3 * 4]
1008
+
1009
+ dec r6d
1010
+ jnz .loop
1011
+ RET
1012
+%endmacro
1013
+P2S_H_24xN 32
1014
+P2S_H_24xN 64
1015
+
1016
+;-----------------------------------------------------------------------------
1017
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
1018
+;-----------------------------------------------------------------------------
1019
+%macro P2S_H_24xN_avx2 1
1020
+INIT_YMM avx2
1021
+cglobal filterPixelToShort_24x%1, 3, 7, 3
1022
+ add r1d, r1d
1023
+ mov r3d, r3m
1024
+ add r3d, r3d
1025
+ lea r4, [r3 * 3]
1026
+ lea r5, [r1 * 3]
1027
+
1028
+ ; load height
1029
+ mov r6d, %1/4
1030
+
1031
+ ; load constant
1032
+ mova m2, [pw_2000]
1033
+
1034
+.loop
1035
+ movu m0, [r0]
1036
+ movu m1, [r0 + 32]
1037
+ psllw m0, 4
1038
+ psubw m0, m2
1039
+ psllw m1, 4
1040
+ psubw m1, m2
1041
+ movu [r2 + r3 * 0], m0
1042
+ movu [r2 + r3 * 0 + 32], xm1
1043
+
1044
+ movu m0, [r0 + r1]
1045
+ movu m1, [r0 + r1 + 32]
1046
+ psllw m0, 4
1047
+ psubw m0, m2
1048
+ psllw m1, 4
1049
+ psubw m1, m2
1050
+ movu [r2 + r3 * 1], m0
1051
+ movu [r2 + r3 * 1 + 32], xm1
1052
+
1053
+ movu m0, [r0 + r1 * 2]
1054
+ movu m1, [r0 + r1 * 2 + 32]
1055
+ psllw m0, 4
1056
+ psubw m0, m2
1057
+ psllw m1, 4
1058
+ psubw m1, m2
1059
+ movu [r2 + r3 * 2], m0
1060
+ movu [r2 + r3 * 2 + 32], xm1
1061
+
1062
+ movu m0, [r0 + r5]
1063
+ movu m1, [r0 + r5 + 32]
1064
+ psllw m0, 4
1065
+ psubw m0, m2
1066
+ psllw m1, 4
1067
+ psubw m1, m2
1068
+ movu [r2 + r4], m0
1069
+ movu [r2 + r4 + 32], xm1
1070
+
1071
+ lea r0, [r0 + r1 * 4]
1072
+ lea r2, [r2 + r3 * 4]
1073
+
1074
+ dec r6d
1075
+ jnz .loop
1076
+ RET
1077
+%endmacro
1078
+P2S_H_24xN_avx2 32
1079
+P2S_H_24xN_avx2 64
1080
+
1081
+;-----------------------------------------------------------------------------
1082
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
1083
+;-----------------------------------------------------------------------------
1084
+%macro P2S_H_12xN 1
1085
+INIT_XMM ssse3
1086
+cglobal filterPixelToShort_12x%1, 3, 7, 3
1087
+ add r1d, r1d
1088
+ mov r3d, r3m
1089
+ add r3d, r3d
1090
+ lea r4, [r3 * 3]
1091
+ lea r5, [r1 * 3]
1092
+
1093
+ ; load height
1094
+ mov r6d, %1/4
1095
+
1096
+ ; load constant
1097
+ mova m2, [pw_2000]
1098
+
1099
+.loop
1100
+ movu m0, [r0]
1101
+ movu m1, [r0 + r1]
1102
+ psllw m0, 4
1103
+ psubw m0, m2
1104
+ psllw m1, 4
1105
+ psubw m1, m2
1106
+
1107
+ movu [r2 + r3 * 0], m0
1108
+ movu [r2 + r3 * 1], m1
1109
+
1110
+ movu m0, [r0 + r1 * 2]
1111
+ movu m1, [r0 + r5]
1112
+ psllw m0, 4
1113
+ psubw m0, m2
1114
+ psllw m1, 4
1115
+ psubw m1, m2
1116
+
1117
+ movu [r2 + r3 * 2], m0
1118
+ movu [r2 + r4], m1
1119
+
1120
+ movh m0, [r0 + 16]
1121
+ movhps m0, [r0 + r1 + 16]
1122
+ psllw m0, 4
1123
+ psubw m0, m2
1124
+
1125
+ movh [r2 + r3 * 0 + 16], m0
1126
+ movhps [r2 + r3 * 1 + 16], m0
1127
+
1128
+ movh m0, [r0 + r1 * 2 + 16]
1129
+ movhps m0, [r0 + r5 + 16]
1130
+ psllw m0, 4
1131
+ psubw m0, m2
1132
+
1133
+ movh [r2 + r3 * 2 + 16], m0
1134
+ movhps [r2 + r4 + 16], m0
1135
+
1136
+ lea r0, [r0 + r1 * 4]
1137
+ lea r2, [r2 + r3 * 4]
1138
+
1139
+ dec r6d
1140
+ jnz .loop
1141
+ RET
1142
+%endmacro
1143
+P2S_H_12xN 16
1144
+P2S_H_12xN 32
1145
+
1146
+;-----------------------------------------------------------------------------
1147
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
1148
+;-----------------------------------------------------------------------------
1149
+INIT_XMM ssse3
1150
+cglobal filterPixelToShort_48x64, 3, 7, 5
1151
+ add r1d, r1d
1152
+ mov r3d, r3m
1153
+ add r3d, r3d
1154
+ lea r4, [r3 * 3]
1155
+ lea r5, [r1 * 3]
1156
+
1157
+ ; load height
1158
+ mov r6d, 16
1159
+
1160
+ ; load constant
1161
+ mova m4, [pw_2000]
1162
+
1163
+.loop
1164
+ movu m0, [r0]
1165
+ movu m1, [r0 + r1]
1166
+ movu m2, [r0 + r1 * 2]
1167
+ movu m3, [r0 + r5]
1168
+ psllw m0, 4
1169
+ psubw m0, m4
1170
+ psllw m1, 4
1171
+ psubw m1, m4
1172
+ psllw m2, 4
1173
+ psubw m2, m4
1174
+ psllw m3, 4
1175
+ psubw m3, m4
1176
+
1177
+ movu [r2 + r3 * 0], m0
1178
+ movu [r2 + r3 * 1], m1
1179
+ movu [r2 + r3 * 2], m2
1180
+ movu [r2 + r4], m3
1181
+
1182
+ movu m0, [r0 + 16]
1183
+ movu m1, [r0 + r1 + 16]
1184
+ movu m2, [r0 + r1 * 2 + 16]
1185
+ movu m3, [r0 + r5 + 16]
1186
+ psllw m0, 4
1187
+ psubw m0, m4
1188
+ psllw m1, 4
1189
+ psubw m1, m4
1190
+ psllw m2, 4
1191
+ psubw m2, m4
1192
+ psllw m3, 4
1193
+ psubw m3, m4
1194
+
1195
+ movu [r2 + r3 * 0 + 16], m0
1196
+ movu [r2 + r3 * 1 + 16], m1
1197
+ movu [r2 + r3 * 2 + 16], m2
1198
+ movu [r2 + r4 + 16], m3
1199
+
1200
+ movu m0, [r0 + 32]
1201
+ movu m1, [r0 + r1 + 32]
1202
+ movu m2, [r0 + r1 * 2 + 32]
1203
+ movu m3, [r0 + r5 + 32]
1204
+ psllw m0, 4
1205
+ psubw m0, m4
1206
+ psllw m1, 4
1207
+ psubw m1, m4
1208
+ psllw m2, 4
1209
+ psubw m2, m4
1210
+ psllw m3, 4
1211
+ psubw m3, m4
1212
+
1213
+ movu [r2 + r3 * 0 + 32], m0
1214
+ movu [r2 + r3 * 1 + 32], m1
1215
+ movu [r2 + r3 * 2 + 32], m2
1216
+ movu [r2 + r4 + 32], m3
1217
+
1218
+ movu m0, [r0 + 48]
1219
+ movu m1, [r0 + r1 + 48]
1220
+ movu m2, [r0 + r1 * 2 + 48]
1221
+ movu m3, [r0 + r5 + 48]
1222
+ psllw m0, 4
1223
+ psubw m0, m4
1224
+ psllw m1, 4
1225
+ psubw m1, m4
1226
+ psllw m2, 4
1227
+ psubw m2, m4
1228
+ psllw m3, 4
1229
+ psubw m3, m4
1230
+
1231
+ movu [r2 + r3 * 0 + 48], m0
1232
+ movu [r2 + r3 * 1 + 48], m1
1233
+ movu [r2 + r3 * 2 + 48], m2
1234
+ movu [r2 + r4 + 48], m3
1235
+
1236
+ movu m0, [r0 + 64]
1237
+ movu m1, [r0 + r1 + 64]
1238
+ movu m2, [r0 + r1 * 2 + 64]
1239
+ movu m3, [r0 + r5 + 64]
1240
+ psllw m0, 4
1241
+ psubw m0, m4
1242
+ psllw m1, 4
1243
+ psubw m1, m4
1244
+ psllw m2, 4
1245
+ psubw m2, m4
1246
+ psllw m3, 4
1247
+ psubw m3, m4
1248
+
1249
+ movu [r2 + r3 * 0 + 64], m0
1250
+ movu [r2 + r3 * 1 + 64], m1
1251
+ movu [r2 + r3 * 2 + 64], m2
1252
+ movu [r2 + r4 + 64], m3
1253
+
1254
+ movu m0, [r0 + 80]
1255
+ movu m1, [r0 + r1 + 80]
1256
+ movu m2, [r0 + r1 * 2 + 80]
1257
+ movu m3, [r0 + r5 + 80]
1258
+ psllw m0, 4
1259
+ psubw m0, m4
1260
+ psllw m1, 4
1261
+ psubw m1, m4
1262
+ psllw m2, 4
1263
+ psubw m2, m4
1264
+ psllw m3, 4
1265
+ psubw m3, m4
1266
+
1267
+ movu [r2 + r3 * 0 + 80], m0
1268
+ movu [r2 + r3 * 1 + 80], m1
1269
+ movu [r2 + r3 * 2 + 80], m2
1270
+ movu [r2 + r4 + 80], m3
1271
+
1272
+ lea r0, [r0 + r1 * 4]
1273
+ lea r2, [r2 + r3 * 4]
1274
+
1275
+ dec r6d
1276
+ jnz .loop
1277
RET
1278
+
1279
+;-----------------------------------------------------------------------------
1280
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
1281
+;-----------------------------------------------------------------------------
1282
+INIT_YMM avx2
1283
+cglobal filterPixelToShort_48x64, 3, 7, 4
1284
+ add r1d, r1d
1285
+ mov r3d, r3m
1286
+ add r3d, r3d
1287
+ lea r4, [r3 * 3]
1288
+ lea r5, [r1 * 3]
1289
+
1290
+ ; load height
1291
+ mov r6d, 16
1292
+
1293
+ ; load constant
1294
+ mova m3, [pw_2000]
1295
+
1296
+.loop
1297
+ movu m0, [r0]
1298
+ movu m1, [r0 + 32]
1299
+ movu m2, [r0 + 64]
1300
+ psllw m0, 4
1301
+ psubw m0, m3
1302
+ psllw m1, 4
1303
+ psubw m1, m3
1304
+ psllw m2, 4
1305
+ psubw m2, m3
1306
+ movu [r2 + r3 * 0], m0
1307
+ movu [r2 + r3 * 0 + 32], m1
1308
+ movu [r2 + r3 * 0 + 64], m2
1309
+
1310
+ movu m0, [r0 + r1]
1311
+ movu m1, [r0 + r1 + 32]
1312
+ movu m2, [r0 + r1 + 64]
1313
+ psllw m0, 4
1314
+ psubw m0, m3
1315
+ psllw m1, 4
1316
+ psubw m1, m3
1317
+ psllw m2, 4
1318
+ psubw m2, m3
1319
+ movu [r2 + r3 * 1], m0
1320
+ movu [r2 + r3 * 1 + 32], m1
1321
+ movu [r2 + r3 * 1 + 64], m2
1322
+
1323
+ movu m0, [r0 + r1 * 2]
1324
+ movu m1, [r0 + r1 * 2 + 32]
1325
+ movu m2, [r0 + r1 * 2 + 64]
1326
+ psllw m0, 4
1327
+ psubw m0, m3
1328
+ psllw m1, 4
1329
+ psubw m1, m3
1330
+ psllw m2, 4
1331
+ psubw m2, m3
1332
+ movu [r2 + r3 * 2], m0
1333
+ movu [r2 + r3 * 2 + 32], m1
1334
+ movu [r2 + r3 * 2 + 64], m2
1335
+
1336
+ movu m0, [r0 + r5]
1337
+ movu m1, [r0 + r5 + 32]
1338
+ movu m2, [r0 + r5 + 64]
1339
+ psllw m0, 4
1340
+ psubw m0, m3
1341
+ psllw m1, 4
1342
+ psubw m1, m3
1343
+ psllw m2, 4
1344
+ psubw m2, m3
1345
+ movu [r2 + r4], m0
1346
+ movu [r2 + r4 + 32], m1
1347
+ movu [r2 + r4 + 64], m2
1348
+
1349
+ lea r0, [r0 + r1 * 4]
1350
+ lea r2, [r2 + r3 * 4]
1351
+
1352
+ dec r6d
1353
+ jnz .loop
1354
+ RET
1355
+
1356
+
1357
+;-----------------------------------------------------------------------------------------------------------------------------
1358
+;void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt)
1359
+;-----------------------------------------------------------------------------------------------------------------------------
1360
+
1361
+%macro IPFILTER_LUMA_PS_4xN_AVX2 1
1362
+INIT_YMM avx2
1363
+%if ARCH_X86_64 == 1
1364
+cglobal interp_8tap_horiz_ps_4x%1, 6,8,7
1365
+ mov r5d, r5m
1366
+ mov r4d, r4m
1367
+ add r1d, r1d
1368
+ add r3d, r3d
1369
+%ifdef PIC
1370
+
1371
+ lea r6, [tab_LumaCoeff]
1372
+ lea r4 , [r4 * 8]
1373
+ vbroadcasti128 m0, [r6 + r4 * 2]
1374
+
1375
+%else
1376
+ lea r4 , [r4 * 8]
1377
+ vbroadcasti128 m0, [tab_LumaCoeff + r4 * 2]
1378
+%endif
1379
+
1380
+ vbroadcasti128 m2, [pd_n32768]
1381
+
1382
+ ; register map
1383
+ ; m0 - interpolate coeff
1384
+ ; m1 - shuffle order table
1385
+ ; m2 - pw_2000
1386
+
1387
+ sub r0, 6
1388
+ test r5d, r5d
1389
+ mov r7d, %1 ; loop count variable - height
1390
+ jz .preloop
1391
+ lea r6, [r1 * 3] ; r6 = (N / 2 - 1) * srcStride
1392
+ sub r0, r6 ; r0(src) - 3 * srcStride
1393
+ add r7d, 6 ;7 - 1(since last row not in loop) ; need extra 7 rows, just set a specially flag here, blkheight += N - 1 (7 - 3 = 4 ; since the last three rows not in loop)
1394
+
1395
+.preloop:
1396
+ lea r6, [r3 * 3]
1397
+.loop
1398
+ ; Row 0
1399
+ movu xm3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
1400
+ movu xm4, [r0 + 2] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
1401
+ vinserti128 m3, m3, xm4, 1
1402
+ movu xm4, [r0 + 4]
1403
+ movu xm5, [r0 + 6]
1404
+ vinserti128 m4, m4, xm5, 1
1405
+ pmaddwd m3, m0
1406
+ pmaddwd m4, m0
1407
+ phaddd m3, m4 ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A]
1408
+
1409
+ ; Row 1
1410
+ movu xm4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
1411
+ movu xm5, [r0 + r1 + 2] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
1412
+ vinserti128 m4, m4, xm5, 1
1413
+ movu xm5, [r0 + r1 + 4]
1414
+ movu xm6, [r0 + r1 + 6]
1415
+ vinserti128 m5, m5, xm6, 1
1416
+ pmaddwd m4, m0
1417
+ pmaddwd m5, m0
1418
+ phaddd m4, m5 ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A]
1419
+ phaddd m3, m4 ; all rows and col completed.
1420
+
1421
+ mova m5, [interp8_hps_shuf]
1422
+ vpermd m3, m5, m3
1423
+ paddd m3, m2
1424
+ vextracti128 xm4, m3, 1
1425
+ psrad xm3, 2
1426
+ psrad xm4, 2
1427
+ packssdw xm3, xm3
1428
+ packssdw xm4, xm4
1429
+
1430
+ movq [r2], xm3 ;row 0
1431
+ movq [r2 + r3], xm4 ;row 1
1432
+ lea r0, [r0 + r1 * 2] ; first loop src ->5th row(i.e 4)
1433
+ lea r2, [r2 + r3 * 2] ; first loop dst ->5th row(i.e 4)
1434
+
1435
+ sub r7d, 2
1436
+ jg .loop
1437
+ test r5d, r5d
1438
+ jz .end
1439
+
1440
+ ; Row 10
1441
+ movu xm3, [r0]
1442
+ movu xm4, [r0 + 2]
1443
+ vinserti128 m3, m3, xm4, 1
1444
+ movu xm4, [r0 + 4]
1445
+ movu xm5, [r0 + 6]
1446
+ vinserti128 m4, m4, xm5, 1
1447
+ pmaddwd m3, m0
1448
+ pmaddwd m4, m0
1449
+ phaddd m3, m4
1450
+
1451
+ ; Row11
1452
+ phaddd m3, m4 ; all rows and col completed.
1453
+
1454
+ mova m5, [interp8_hps_shuf]
1455
+ vpermd m3, m5, m3
1456
+ paddd m3, m2
1457
+ vextracti128 xm4, m3, 1
1458
+ psrad xm3, 2
1459
+ psrad xm4, 2
1460
+ packssdw xm3, xm3
1461
+ packssdw xm4, xm4
1462
+
1463
+ movq [r2], xm3 ;row 0
1464
+.end
1465
+ RET
1466
+%endif
1467
+%endmacro
1468
+
1469
+ IPFILTER_LUMA_PS_4xN_AVX2 4
1470
+ IPFILTER_LUMA_PS_4xN_AVX2 8
1471
+ IPFILTER_LUMA_PS_4xN_AVX2 16
1472
x265_1.6.tar.gz/source/common/x86/ipfilter8.asm -> x265_1.7.tar.gz/source/common/x86/ipfilter8.asm
Changed
10717
1
2
%include "x86util.asm"
3
4
SECTION_RODATA 32
5
-tab_Tm: db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
6
- db 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10
7
- db 8, 9,10,11, 9,10,11,12,10,11,12,13,11,12,13, 14
8
+const tab_Tm, db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
9
+ db 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10
10
+ db 8, 9,10,11, 9,10,11,12,10,11,12,13,11,12,13, 14
11
12
-ALIGN 32
13
const interp4_vpp_shuf, times 2 db 0, 4, 1, 5, 2, 6, 3, 7, 8, 12, 9, 13, 10, 14, 11, 15
14
15
-ALIGN 32
16
const interp_vert_shuf, times 2 db 0, 2, 1, 3, 2, 4, 3, 5, 4, 6, 5, 7, 6, 8, 7, 9
17
times 2 db 4, 6, 5, 7, 6, 8, 7, 9, 8, 10, 9, 11, 10, 12, 11, 13
18
19
-ALIGN 32
20
const interp4_vpp_shuf1, dd 0, 1, 1, 2, 2, 3, 3, 4
21
dd 2, 3, 3, 4, 4, 5, 5, 6
22
23
-ALIGN 32
24
const pb_8tap_hps_0, times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8
25
times 2 db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10
26
times 2 db 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10,10,11,11,12
27
times 2 db 6, 7, 7, 8, 8, 9, 9,10,10,11,11,12,12,13,13,14
28
29
-ALIGN 32
30
-tab_Lm: db 0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8
31
- db 2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 10
32
- db 4, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12
33
- db 6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14
34
-
35
-tab_Vm: db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
36
- db 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3
37
-
38
-tab_Cm: db 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3
39
-
40
-tab_c_526336: times 4 dd 8192*64+2048
41
-
42
-pd_526336: times 8 dd 8192*64+2048
43
-
44
-tab_ChromaCoeff: db 0, 64, 0, 0
45
- db -2, 58, 10, -2
46
- db -4, 54, 16, -2
47
- db -6, 46, 28, -4
48
- db -4, 36, 36, -4
49
- db -4, 28, 46, -6
50
- db -2, 16, 54, -4
51
- db -2, 10, 58, -2
52
-ALIGN 32
53
-tab_ChromaCoeff_V: times 8 db 0, 64
54
- times 8 db 0, 0
55
+const tab_Lm, db 0, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8
56
+ db 2, 3, 4, 5, 6, 7, 8, 9, 3, 4, 5, 6, 7, 8, 9, 10
57
+ db 4, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12
58
+ db 6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14
59
60
- times 8 db -2, 58
61
- times 8 db 10, -2
62
+const tab_Vm, db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
63
+ db 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3
64
65
- times 8 db -4, 54
66
- times 8 db 16, -2
67
+const tab_Cm, db 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3
68
69
- times 8 db -6, 46
70
- times 8 db 28, -4
71
+const pd_526336, times 8 dd 8192*64+2048
72
73
- times 8 db -4, 36
74
- times 8 db 36, -4
75
+const tab_ChromaCoeff, db 0, 64, 0, 0
76
+ db -2, 58, 10, -2
77
+ db -4, 54, 16, -2
78
+ db -6, 46, 28, -4
79
+ db -4, 36, 36, -4
80
+ db -4, 28, 46, -6
81
+ db -2, 16, 54, -4
82
+ db -2, 10, 58, -2
83
84
- times 8 db -4, 28
85
- times 8 db 46, -6
86
+const tabw_ChromaCoeff, dw 0, 64, 0, 0
87
+ dw -2, 58, 10, -2
88
+ dw -4, 54, 16, -2
89
+ dw -6, 46, 28, -4
90
+ dw -4, 36, 36, -4
91
+ dw -4, 28, 46, -6
92
+ dw -2, 16, 54, -4
93
+ dw -2, 10, 58, -2
94
95
- times 8 db -2, 16
96
- times 8 db 54, -4
97
+const tab_ChromaCoeff_V, times 8 db 0, 64
98
+ times 8 db 0, 0
99
100
- times 8 db -2, 10
101
- times 8 db 58, -2
102
+ times 8 db -2, 58
103
+ times 8 db 10, -2
104
105
-tab_ChromaCoeffV: times 4 dw 0, 64
106
- times 4 dw 0, 0
107
+ times 8 db -4, 54
108
+ times 8 db 16, -2
109
110
- times 4 dw -2, 58
111
- times 4 dw 10, -2
112
+ times 8 db -6, 46
113
+ times 8 db 28, -4
114
115
- times 4 dw -4, 54
116
- times 4 dw 16, -2
117
+ times 8 db -4, 36
118
+ times 8 db 36, -4
119
120
- times 4 dw -6, 46
121
- times 4 dw 28, -4
122
+ times 8 db -4, 28
123
+ times 8 db 46, -6
124
125
- times 4 dw -4, 36
126
- times 4 dw 36, -4
127
+ times 8 db -2, 16
128
+ times 8 db 54, -4
129
130
- times 4 dw -4, 28
131
- times 4 dw 46, -6
132
+ times 8 db -2, 10
133
+ times 8 db 58, -2
134
135
- times 4 dw -2, 16
136
- times 4 dw 54, -4
137
+const tab_ChromaCoeffV, times 4 dw 0, 64
138
+ times 4 dw 0, 0
139
140
- times 4 dw -2, 10
141
- times 4 dw 58, -2
142
+ times 4 dw -2, 58
143
+ times 4 dw 10, -2
144
145
-ALIGN 32
146
-pw_ChromaCoeffV: times 8 dw 0, 64
147
- times 8 dw 0, 0
148
+ times 4 dw -4, 54
149
+ times 4 dw 16, -2
150
151
- times 8 dw -2, 58
152
- times 8 dw 10, -2
153
+ times 4 dw -6, 46
154
+ times 4 dw 28, -4
155
156
- times 8 dw -4, 54
157
- times 8 dw 16, -2
158
+ times 4 dw -4, 36
159
+ times 4 dw 36, -4
160
161
- times 8 dw -6, 46
162
- times 8 dw 28, -4
163
-
164
- times 8 dw -4, 36
165
- times 8 dw 36, -4
166
-
167
- times 8 dw -4, 28
168
- times 8 dw 46, -6
169
-
170
- times 8 dw -2, 16
171
- times 8 dw 54, -4
172
-
173
- times 8 dw -2, 10
174
- times 8 dw 58, -2
175
-
176
-tab_LumaCoeff: db 0, 0, 0, 64, 0, 0, 0, 0
177
- db -1, 4, -10, 58, 17, -5, 1, 0
178
- db -1, 4, -11, 40, 40, -11, 4, -1
179
- db 0, 1, -5, 17, 58, -10, 4, -1
180
-
181
-tab_LumaCoeffV: times 4 dw 0, 0
182
- times 4 dw 0, 64
183
- times 4 dw 0, 0
184
- times 4 dw 0, 0
185
-
186
- times 4 dw -1, 4
187
- times 4 dw -10, 58
188
- times 4 dw 17, -5
189
- times 4 dw 1, 0
190
-
191
- times 4 dw -1, 4
192
- times 4 dw -11, 40
193
- times 4 dw 40, -11
194
- times 4 dw 4, -1
195
-
196
- times 4 dw 0, 1
197
- times 4 dw -5, 17
198
- times 4 dw 58, -10
199
- times 4 dw 4, -1
200
+ times 4 dw -4, 28
201
+ times 4 dw 46, -6
202
203
-ALIGN 32
204
-pw_LumaCoeffVer: times 8 dw 0, 0
205
- times 8 dw 0, 64
206
- times 8 dw 0, 0
207
- times 8 dw 0, 0
208
-
209
- times 8 dw -1, 4
210
- times 8 dw -10, 58
211
- times 8 dw 17, -5
212
- times 8 dw 1, 0
213
-
214
- times 8 dw -1, 4
215
- times 8 dw -11, 40
216
- times 8 dw 40, -11
217
- times 8 dw 4, -1
218
-
219
- times 8 dw 0, 1
220
- times 8 dw -5, 17
221
- times 8 dw 58, -10
222
- times 8 dw 4, -1
223
-
224
-pb_LumaCoeffVer: times 16 db 0, 0
225
- times 16 db 0, 64
226
- times 16 db 0, 0
227
- times 16 db 0, 0
228
-
229
- times 16 db -1, 4
230
- times 16 db -10, 58
231
- times 16 db 17, -5
232
- times 16 db 1, 0
233
-
234
- times 16 db -1, 4
235
- times 16 db -11, 40
236
- times 16 db 40, -11
237
- times 16 db 4, -1
238
-
239
- times 16 db 0, 1
240
- times 16 db -5, 17
241
- times 16 db 58, -10
242
- times 16 db 4, -1
243
-
244
-tab_LumaCoeffVer: times 8 db 0, 0
245
- times 8 db 0, 64
246
- times 8 db 0, 0
247
- times 8 db 0, 0
248
-
249
- times 8 db -1, 4
250
- times 8 db -10, 58
251
- times 8 db 17, -5
252
- times 8 db 1, 0
253
-
254
- times 8 db -1, 4
255
- times 8 db -11, 40
256
- times 8 db 40, -11
257
- times 8 db 4, -1
258
-
259
- times 8 db 0, 1
260
- times 8 db -5, 17
261
- times 8 db 58, -10
262
- times 8 db 4, -1
263
+ times 4 dw -2, 16
264
+ times 4 dw 54, -4
265
266
-ALIGN 32
267
-tab_LumaCoeffVer_32: times 16 db 0, 0
268
- times 16 db 0, 64
269
- times 16 db 0, 0
270
- times 16 db 0, 0
271
-
272
- times 16 db -1, 4
273
- times 16 db -10, 58
274
- times 16 db 17, -5
275
- times 16 db 1, 0
276
-
277
- times 16 db -1, 4
278
- times 16 db -11, 40
279
- times 16 db 40, -11
280
- times 16 db 4, -1
281
-
282
- times 16 db 0, 1
283
- times 16 db -5, 17
284
- times 16 db 58, -10
285
- times 16 db 4, -1
286
+ times 4 dw -2, 10
287
+ times 4 dw 58, -2
288
289
-ALIGN 32
290
-tab_ChromaCoeffVer_32: times 16 db 0, 64
291
- times 16 db 0, 0
292
+const pw_ChromaCoeffV, times 8 dw 0, 64
293
+ times 8 dw 0, 0
294
+
295
+ times 8 dw -2, 58
296
+ times 8 dw 10, -2
297
+
298
+ times 8 dw -4, 54
299
+ times 8 dw 16, -2
300
+
301
+ times 8 dw -6, 46
302
+ times 8 dw 28, -4
303
+
304
+ times 8 dw -4, 36
305
+ times 8 dw 36, -4
306
+
307
+ times 8 dw -4, 28
308
+ times 8 dw 46, -6
309
+
310
+ times 8 dw -2, 16
311
+ times 8 dw 54, -4
312
+
313
+ times 8 dw -2, 10
314
+ times 8 dw 58, -2
315
+
316
+const tab_LumaCoeff, db 0, 0, 0, 64, 0, 0, 0, 0
317
+ db -1, 4, -10, 58, 17, -5, 1, 0
318
+ db -1, 4, -11, 40, 40, -11, 4, -1
319
+ db 0, 1, -5, 17, 58, -10, 4, -1
320
+
321
+const tabw_LumaCoeff, dw 0, 0, 0, 64, 0, 0, 0, 0
322
+ dw -1, 4, -10, 58, 17, -5, 1, 0
323
+ dw -1, 4, -11, 40, 40, -11, 4, -1
324
+ dw 0, 1, -5, 17, 58, -10, 4, -1
325
+
326
+const tab_LumaCoeffV, times 4 dw 0, 0
327
+ times 4 dw 0, 64
328
+ times 4 dw 0, 0
329
+ times 4 dw 0, 0
330
+
331
+ times 4 dw -1, 4
332
+ times 4 dw -10, 58
333
+ times 4 dw 17, -5
334
+ times 4 dw 1, 0
335
+
336
+ times 4 dw -1, 4
337
+ times 4 dw -11, 40
338
+ times 4 dw 40, -11
339
+ times 4 dw 4, -1
340
+
341
+ times 4 dw 0, 1
342
+ times 4 dw -5, 17
343
+ times 4 dw 58, -10
344
+ times 4 dw 4, -1
345
+
346
+const pw_LumaCoeffVer, times 8 dw 0, 0
347
+ times 8 dw 0, 64
348
+ times 8 dw 0, 0
349
+ times 8 dw 0, 0
350
+
351
+ times 8 dw -1, 4
352
+ times 8 dw -10, 58
353
+ times 8 dw 17, -5
354
+ times 8 dw 1, 0
355
356
- times 16 db -2, 58
357
- times 16 db 10, -2
358
+ times 8 dw -1, 4
359
+ times 8 dw -11, 40
360
+ times 8 dw 40, -11
361
+ times 8 dw 4, -1
362
363
- times 16 db -4, 54
364
- times 16 db 16, -2
365
+ times 8 dw 0, 1
366
+ times 8 dw -5, 17
367
+ times 8 dw 58, -10
368
+ times 8 dw 4, -1
369
370
- times 16 db -6, 46
371
- times 16 db 28, -4
372
+const pb_LumaCoeffVer, times 16 db 0, 0
373
+ times 16 db 0, 64
374
+ times 16 db 0, 0
375
+ times 16 db 0, 0
376
377
- times 16 db -4, 36
378
- times 16 db 36, -4
379
+ times 16 db -1, 4
380
+ times 16 db -10, 58
381
+ times 16 db 17, -5
382
+ times 16 db 1, 0
383
384
- times 16 db -4, 28
385
- times 16 db 46, -6
386
+ times 16 db -1, 4
387
+ times 16 db -11, 40
388
+ times 16 db 40, -11
389
+ times 16 db 4, -1
390
391
- times 16 db -2, 16
392
- times 16 db 54, -4
393
+ times 16 db 0, 1
394
+ times 16 db -5, 17
395
+ times 16 db 58, -10
396
+ times 16 db 4, -1
397
398
- times 16 db -2, 10
399
- times 16 db 58, -2
400
+const tab_LumaCoeffVer, times 8 db 0, 0
401
+ times 8 db 0, 64
402
+ times 8 db 0, 0
403
+ times 8 db 0, 0
404
405
-tab_c_64_n64: times 8 db 64, -64
406
+ times 8 db -1, 4
407
+ times 8 db -10, 58
408
+ times 8 db 17, -5
409
+ times 8 db 1, 0
410
+
411
+ times 8 db -1, 4
412
+ times 8 db -11, 40
413
+ times 8 db 40, -11
414
+ times 8 db 4, -1
415
+
416
+ times 8 db 0, 1
417
+ times 8 db -5, 17
418
+ times 8 db 58, -10
419
+ times 8 db 4, -1
420
+
421
+const tab_LumaCoeffVer_32, times 16 db 0, 0
422
+ times 16 db 0, 64
423
+ times 16 db 0, 0
424
+ times 16 db 0, 0
425
+
426
+ times 16 db -1, 4
427
+ times 16 db -10, 58
428
+ times 16 db 17, -5
429
+ times 16 db 1, 0
430
+
431
+ times 16 db -1, 4
432
+ times 16 db -11, 40
433
+ times 16 db 40, -11
434
+ times 16 db 4, -1
435
+
436
+ times 16 db 0, 1
437
+ times 16 db -5, 17
438
+ times 16 db 58, -10
439
+ times 16 db 4, -1
440
+
441
+const tab_ChromaCoeffVer_32, times 16 db 0, 64
442
+ times 16 db 0, 0
443
+
444
+ times 16 db -2, 58
445
+ times 16 db 10, -2
446
+
447
+ times 16 db -4, 54
448
+ times 16 db 16, -2
449
+
450
+ times 16 db -6, 46
451
+ times 16 db 28, -4
452
+
453
+ times 16 db -4, 36
454
+ times 16 db 36, -4
455
+
456
+ times 16 db -4, 28
457
+ times 16 db 46, -6
458
+
459
+ times 16 db -2, 16
460
+ times 16 db 54, -4
461
+
462
+ times 16 db -2, 10
463
+ times 16 db 58, -2
464
+
465
+const tab_c_64_n64, times 8 db 64, -64
466
467
const interp4_shuf, times 2 db 0, 1, 8, 9, 4, 5, 12, 13, 2, 3, 10, 11, 6, 7, 14, 15
468
469
-ALIGN 32
470
-interp4_horiz_shuf1: db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
471
- db 8, 9, 10, 11, 9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14
472
+const interp4_horiz_shuf1, db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
473
+ db 8, 9, 10, 11, 9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14
474
475
-ALIGN 32
476
-interp4_hpp_shuf: times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 8, 9, 10, 11, 9, 10, 11, 12
477
+const interp4_hpp_shuf, times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 8, 9, 10, 11, 9, 10, 11, 12
478
479
-ALIGN 32
480
-interp8_hps_shuf: dd 0, 4, 1, 5, 2, 6, 3, 7
481
+const interp8_hps_shuf, dd 0, 4, 1, 5, 2, 6, 3, 7
482
483
ALIGN 32
484
interp4_hps_shuf: times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 8, 9, 10, 11, 9, 10, 11, 12
485
486
487
cextern pb_128
488
cextern pw_1
489
+cextern pw_32
490
cextern pw_512
491
cextern pw_2000
492
493
+%macro FILTER_H4_w2_2_sse2 0
494
+ pxor m3, m3
495
+ movd m0, [srcq - 1]
496
+ movd m2, [srcq]
497
+ punpckldq m0, m2
498
+ punpcklbw m0, m3
499
+ movd m1, [srcq + srcstrideq - 1]
500
+ movd m2, [srcq + srcstrideq]
501
+ punpckldq m1, m2
502
+ punpcklbw m1, m3
503
+ pmaddwd m0, m4
504
+ pmaddwd m1, m4
505
+ packssdw m0, m1
506
+ pshuflw m1, m0, q2301
507
+ pshufhw m1, m1, q2301
508
+ paddw m0, m1
509
+ psrld m0, 16
510
+ packssdw m0, m0
511
+ paddw m0, m5
512
+ psraw m0, 6
513
+ packuswb m0, m0
514
+ movd r4, m0
515
+ mov [dstq], r4w
516
+ shr r4, 16
517
+ mov [dstq + dststrideq], r4w
518
+%endmacro
519
+
520
+;-----------------------------------------------------------------------------
521
+; void interp_4tap_horiz_pp_2x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
522
+;-----------------------------------------------------------------------------
523
+INIT_XMM sse3
524
+cglobal interp_4tap_horiz_pp_2x4, 4, 6, 6, src, srcstride, dst, dststride
525
+ mov r4d, r4m
526
+ mova m5, [pw_32]
527
+
528
+%ifdef PIC
529
+ lea r5, [tabw_ChromaCoeff]
530
+ movddup m4, [r5 + r4 * 8]
531
+%else
532
+ movddup m4, [tabw_ChromaCoeff + r4 * 8]
533
+%endif
534
+
535
+ FILTER_H4_w2_2_sse2
536
+ lea srcq, [srcq + srcstrideq * 2]
537
+ lea dstq, [dstq + dststrideq * 2]
538
+ FILTER_H4_w2_2_sse2
539
+
540
+ RET
541
+
542
+;-----------------------------------------------------------------------------
543
+; void interp_4tap_horiz_pp_2x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
544
+;-----------------------------------------------------------------------------
545
+INIT_XMM sse3
546
+cglobal interp_4tap_horiz_pp_2x8, 4, 6, 6, src, srcstride, dst, dststride
547
+ mov r4d, r4m
548
+ mova m5, [pw_32]
549
+
550
+%ifdef PIC
551
+ lea r5, [tabw_ChromaCoeff]
552
+ movddup m4, [r5 + r4 * 8]
553
+%else
554
+ movddup m4, [tabw_ChromaCoeff + r4 * 8]
555
+%endif
556
+
557
+%assign x 1
558
+%rep 4
559
+ FILTER_H4_w2_2_sse2
560
+%if x < 4
561
+ lea srcq, [srcq + srcstrideq * 2]
562
+ lea dstq, [dstq + dststrideq * 2]
563
+%endif
564
+%assign x x+1
565
+%endrep
566
+
567
+ RET
568
+
569
+;-----------------------------------------------------------------------------
570
+; void interp_4tap_horiz_pp_2x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
571
+;-----------------------------------------------------------------------------
572
+INIT_XMM sse3
573
+cglobal interp_4tap_horiz_pp_2x16, 4, 6, 6, src, srcstride, dst, dststride
574
+ mov r4d, r4m
575
+ mova m5, [pw_32]
576
+
577
+%ifdef PIC
578
+ lea r5, [tabw_ChromaCoeff]
579
+ movddup m4, [r5 + r4 * 8]
580
+%else
581
+ movddup m4, [tabw_ChromaCoeff + r4 * 8]
582
+%endif
583
+
584
+%assign x 1
585
+%rep 8
586
+ FILTER_H4_w2_2_sse2
587
+%if x < 8
588
+ lea srcq, [srcq + srcstrideq * 2]
589
+ lea dstq, [dstq + dststrideq * 2]
590
+%endif
591
+%assign x x+1
592
+%endrep
593
+
594
+ RET
595
+
596
+%macro FILTER_H4_w4_2_sse2 0
597
+ pxor m5, m5
598
+ movd m0, [srcq - 1]
599
+ movd m6, [srcq]
600
+ punpckldq m0, m6
601
+ punpcklbw m0, m5
602
+ movd m1, [srcq + 1]
603
+ movd m6, [srcq + 2]
604
+ punpckldq m1, m6
605
+ punpcklbw m1, m5
606
+ movd m2, [srcq + srcstrideq - 1]
607
+ movd m6, [srcq + srcstrideq]
608
+ punpckldq m2, m6
609
+ punpcklbw m2, m5
610
+ movd m3, [srcq + srcstrideq + 1]
611
+ movd m6, [srcq + srcstrideq + 2]
612
+ punpckldq m3, m6
613
+ punpcklbw m3, m5
614
+ pmaddwd m0, m4
615
+ pmaddwd m1, m4
616
+ pmaddwd m2, m4
617
+ pmaddwd m3, m4
618
+ packssdw m0, m1
619
+ packssdw m2, m3
620
+ pshuflw m1, m0, q2301
621
+ pshufhw m1, m1, q2301
622
+ pshuflw m3, m2, q2301
623
+ pshufhw m3, m3, q2301
624
+ paddw m0, m1
625
+ paddw m2, m3
626
+ psrld m0, 16
627
+ psrld m2, 16
628
+ packssdw m0, m2
629
+ paddw m0, m7
630
+ psraw m0, 6
631
+ packuswb m0, m2
632
+ movd [dstq], m0
633
+ psrldq m0, 4
634
+ movd [dstq + dststrideq], m0
635
+%endmacro
636
+
637
+;-----------------------------------------------------------------------------
638
+; void interp_4tap_horiz_pp_4x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
639
+;-----------------------------------------------------------------------------
640
+INIT_XMM sse3
641
+cglobal interp_4tap_horiz_pp_4x2, 4, 6, 8, src, srcstride, dst, dststride
642
+ mov r4d, r4m
643
+ mova m7, [pw_32]
644
+
645
+%ifdef PIC
646
+ lea r5, [tabw_ChromaCoeff]
647
+ movddup m4, [r5 + r4 * 8]
648
+%else
649
+ movddup m4, [tabw_ChromaCoeff + r4 * 8]
650
+%endif
651
+
652
+ FILTER_H4_w4_2_sse2
653
+
654
+ RET
655
+
656
+;-----------------------------------------------------------------------------
657
+; void interp_4tap_horiz_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
658
+;-----------------------------------------------------------------------------
659
+INIT_XMM sse3
660
+cglobal interp_4tap_horiz_pp_4x4, 4, 6, 8, src, srcstride, dst, dststride
661
+ mov r4d, r4m
662
+ mova m7, [pw_32]
663
+
664
+%ifdef PIC
665
+ lea r5, [tabw_ChromaCoeff]
666
+ movddup m4, [r5 + r4 * 8]
667
+%else
668
+ movddup m4, [tabw_ChromaCoeff + r4 * 8]
669
+%endif
670
+
671
+ FILTER_H4_w4_2_sse2
672
+ lea srcq, [srcq + srcstrideq * 2]
673
+ lea dstq, [dstq + dststrideq * 2]
674
+ FILTER_H4_w4_2_sse2
675
+
676
+ RET
677
+
678
+;-----------------------------------------------------------------------------
679
+; void interp_4tap_horiz_pp_4x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
680
+;-----------------------------------------------------------------------------
681
+INIT_XMM sse3
682
+cglobal interp_4tap_horiz_pp_4x8, 4, 6, 8, src, srcstride, dst, dststride
683
+ mov r4d, r4m
684
+ mova m7, [pw_32]
685
+
686
+%ifdef PIC
687
+ lea r5, [tabw_ChromaCoeff]
688
+ movddup m4, [r5 + r4 * 8]
689
+%else
690
+ movddup m4, [tabw_ChromaCoeff + r4 * 8]
691
+%endif
692
+
693
+%assign x 1
694
+%rep 4
695
+ FILTER_H4_w4_2_sse2
696
+%if x < 4
697
+ lea srcq, [srcq + srcstrideq * 2]
698
+ lea dstq, [dstq + dststrideq * 2]
699
+%endif
700
+%assign x x+1
701
+%endrep
702
+
703
+ RET
704
+
705
+;-----------------------------------------------------------------------------
706
+; void interp_4tap_horiz_pp_4x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
707
+;-----------------------------------------------------------------------------
708
+INIT_XMM sse3
709
+cglobal interp_4tap_horiz_pp_4x16, 4, 6, 8, src, srcstride, dst, dststride
710
+ mov r4d, r4m
711
+ mova m7, [pw_32]
712
+
713
+%ifdef PIC
714
+ lea r5, [tabw_ChromaCoeff]
715
+ movddup m4, [r5 + r4 * 8]
716
+%else
717
+ movddup m4, [tabw_ChromaCoeff + r4 * 8]
718
+%endif
719
+
720
+%assign x 1
721
+%rep 8
722
+ FILTER_H4_w4_2_sse2
723
+%if x < 8
724
+ lea srcq, [srcq + srcstrideq * 2]
725
+ lea dstq, [dstq + dststrideq * 2]
726
+%endif
727
+%assign x x+1
728
+%endrep
729
+
730
+ RET
731
+
732
+;-----------------------------------------------------------------------------
733
+; void interp_4tap_horiz_pp_4x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
734
+;-----------------------------------------------------------------------------
735
+INIT_XMM sse3
736
+cglobal interp_4tap_horiz_pp_4x32, 4, 6, 8, src, srcstride, dst, dststride
737
+ mov r4d, r4m
738
+ mova m7, [pw_32]
739
+
740
+%ifdef PIC
741
+ lea r5, [tabw_ChromaCoeff]
742
+ movddup m4, [r5 + r4 * 8]
743
+%else
744
+ movddup m4, [tabw_ChromaCoeff + r4 * 8]
745
+%endif
746
+
747
+%assign x 1
748
+%rep 16
749
+ FILTER_H4_w4_2_sse2
750
+%if x < 16
751
+ lea srcq, [srcq + srcstrideq * 2]
752
+ lea dstq, [dstq + dststrideq * 2]
753
+%endif
754
+%assign x x+1
755
+%endrep
756
+
757
+ RET
758
+
759
%macro FILTER_H4_w2_2 3
760
movh %2, [srcq - 1]
761
pshufb %2, %2, Tm0
762
763
mov [dstq + dststrideq], r4w
764
%endmacro
765
766
+%macro FILTER_H4_w6_sse2 0
767
+ pxor m4, m4
768
+ movh m0, [srcq - 1]
769
+ movh m5, [srcq]
770
+ punpckldq m0, m5
771
+ movhlps m2, m0
772
+ punpcklbw m0, m4
773
+ punpcklbw m2, m4
774
+ movd m1, [srcq + 1]
775
+ movd m5, [srcq + 2]
776
+ punpckldq m1, m5
777
+ punpcklbw m1, m4
778
+ pmaddwd m0, m6
779
+ pmaddwd m1, m6
780
+ pmaddwd m2, m6
781
+ packssdw m0, m1
782
+ packssdw m2, m2
783
+ pshuflw m1, m0, q2301
784
+ pshufhw m1, m1, q2301
785
+ pshuflw m3, m2, q2301
786
+ paddw m0, m1
787
+ paddw m2, m3
788
+ psrld m0, 16
789
+ psrld m2, 16
790
+ packssdw m0, m2
791
+ paddw m0, m7
792
+ psraw m0, 6
793
+ packuswb m0, m0
794
+ movd [dstq], m0
795
+ pextrw r4d, m0, 2
796
+ mov [dstq + 4], r4w
797
+%endmacro
798
+
799
+%macro FILH4W8_sse2 1
800
+ movh m0, [srcq - 1 + %1]
801
+ movh m5, [srcq + %1]
802
+ punpckldq m0, m5
803
+ movhlps m2, m0
804
+ punpcklbw m0, m4
805
+ punpcklbw m2, m4
806
+ movh m1, [srcq + 1 + %1]
807
+ movh m5, [srcq + 2 + %1]
808
+ punpckldq m1, m5
809
+ movhlps m3, m1
810
+ punpcklbw m1, m4
811
+ punpcklbw m3, m4
812
+ pmaddwd m0, m6
813
+ pmaddwd m1, m6
814
+ pmaddwd m2, m6
815
+ pmaddwd m3, m6
816
+ packssdw m0, m1
817
+ packssdw m2, m3
818
+ pshuflw m1, m0, q2301
819
+ pshufhw m1, m1, q2301
820
+ pshuflw m3, m2, q2301
821
+ pshufhw m3, m3, q2301
822
+ paddw m0, m1
823
+ paddw m2, m3
824
+ psrld m0, 16
825
+ psrld m2, 16
826
+ packssdw m0, m2
827
+ paddw m0, m7
828
+ psraw m0, 6
829
+ packuswb m0, m0
830
+ movh [dstq + %1], m0
831
+%endmacro
832
+
833
+%macro FILTER_H4_w8_sse2 0
834
+ FILH4W8_sse2 0
835
+%endmacro
836
+
837
+%macro FILTER_H4_w12_sse2 0
838
+ FILH4W8_sse2 0
839
+ movd m1, [srcq - 1 + 8]
840
+ movd m3, [srcq + 8]
841
+ punpckldq m1, m3
842
+ punpcklbw m1, m4
843
+ movd m2, [srcq + 1 + 8]
844
+ movd m3, [srcq + 2 + 8]
845
+ punpckldq m2, m3
846
+ punpcklbw m2, m4
847
+ pmaddwd m1, m6
848
+ pmaddwd m2, m6
849
+ packssdw m1, m2
850
+ pshuflw m2, m1, q2301
851
+ pshufhw m2, m2, q2301
852
+ paddw m1, m2
853
+ psrld m1, 16
854
+ packssdw m1, m1
855
+ paddw m1, m7
856
+ psraw m1, 6
857
+ packuswb m1, m1
858
+ movd [dstq + 8], m1
859
+%endmacro
860
+
861
+%macro FILTER_H4_w16_sse2 0
862
+ FILH4W8_sse2 0
863
+ FILH4W8_sse2 8
864
+%endmacro
865
+
866
+%macro FILTER_H4_w24_sse2 0
867
+ FILH4W8_sse2 0
868
+ FILH4W8_sse2 8
869
+ FILH4W8_sse2 16
870
+%endmacro
871
+
872
+%macro FILTER_H4_w32_sse2 0
873
+ FILH4W8_sse2 0
874
+ FILH4W8_sse2 8
875
+ FILH4W8_sse2 16
876
+ FILH4W8_sse2 24
877
+%endmacro
878
+
879
+%macro FILTER_H4_w48_sse2 0
880
+ FILH4W8_sse2 0
881
+ FILH4W8_sse2 8
882
+ FILH4W8_sse2 16
883
+ FILH4W8_sse2 24
884
+ FILH4W8_sse2 32
885
+ FILH4W8_sse2 40
886
+%endmacro
887
+
888
+%macro FILTER_H4_w64_sse2 0
889
+ FILH4W8_sse2 0
890
+ FILH4W8_sse2 8
891
+ FILH4W8_sse2 16
892
+ FILH4W8_sse2 24
893
+ FILH4W8_sse2 32
894
+ FILH4W8_sse2 40
895
+ FILH4W8_sse2 48
896
+ FILH4W8_sse2 56
897
+%endmacro
898
+
899
+;-----------------------------------------------------------------------------
900
+; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
901
+;-----------------------------------------------------------------------------
902
+%macro IPFILTER_CHROMA_sse3 2
903
+INIT_XMM sse3
904
+cglobal interp_4tap_horiz_pp_%1x%2, 4, 6, 8, src, srcstride, dst, dststride
905
+ mov r4d, r4m
906
+ mova m7, [pw_32]
907
+ pxor m4, m4
908
+
909
+%ifdef PIC
910
+ lea r5, [tabw_ChromaCoeff]
911
+ movddup m6, [r5 + r4 * 8]
912
+%else
913
+ movddup m6, [tabw_ChromaCoeff + r4 * 8]
914
+%endif
915
+
916
+%assign x 1
917
+%rep %2
918
+ FILTER_H4_w%1_sse2
919
+%if x < %2
920
+ add srcq, srcstrideq
921
+ add dstq, dststrideq
922
+%endif
923
+%assign x x+1
924
+%endrep
925
+
926
+ RET
927
+
928
+%endmacro
929
+
930
+ IPFILTER_CHROMA_sse3 6, 8
931
+ IPFILTER_CHROMA_sse3 8, 2
932
+ IPFILTER_CHROMA_sse3 8, 4
933
+ IPFILTER_CHROMA_sse3 8, 6
934
+ IPFILTER_CHROMA_sse3 8, 8
935
+ IPFILTER_CHROMA_sse3 8, 16
936
+ IPFILTER_CHROMA_sse3 8, 32
937
+ IPFILTER_CHROMA_sse3 12, 16
938
+
939
+ IPFILTER_CHROMA_sse3 6, 16
940
+ IPFILTER_CHROMA_sse3 8, 12
941
+ IPFILTER_CHROMA_sse3 8, 64
942
+ IPFILTER_CHROMA_sse3 12, 32
943
+
944
+;-----------------------------------------------------------------------------
945
+; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
946
+;-----------------------------------------------------------------------------
947
+%macro IPFILTER_CHROMA_W_sse3 2
948
+INIT_XMM sse3
949
+cglobal interp_4tap_horiz_pp_%1x%2, 4, 6, 8, src, srcstride, dst, dststride
950
+ mov r4d, r4m
951
+ mova m7, [pw_32]
952
+ pxor m4, m4
953
+%ifdef PIC
954
+ lea r5, [tabw_ChromaCoeff]
955
+ movddup m6, [r5 + r4 * 8]
956
+%else
957
+ movddup m6, [tabw_ChromaCoeff + r4 * 8]
958
+%endif
959
+
960
+%assign x 1
961
+%rep %2
962
+ FILTER_H4_w%1_sse2
963
+%if x < %2
964
+ add srcq, srcstrideq
965
+ add dstq, dststrideq
966
+%endif
967
+%assign x x+1
968
+%endrep
969
+
970
+ RET
971
+
972
+%endmacro
973
+
974
+ IPFILTER_CHROMA_W_sse3 16, 4
975
+ IPFILTER_CHROMA_W_sse3 16, 8
976
+ IPFILTER_CHROMA_W_sse3 16, 12
977
+ IPFILTER_CHROMA_W_sse3 16, 16
978
+ IPFILTER_CHROMA_W_sse3 16, 32
979
+ IPFILTER_CHROMA_W_sse3 32, 8
980
+ IPFILTER_CHROMA_W_sse3 32, 16
981
+ IPFILTER_CHROMA_W_sse3 32, 24
982
+ IPFILTER_CHROMA_W_sse3 24, 32
983
+ IPFILTER_CHROMA_W_sse3 32, 32
984
+
985
+ IPFILTER_CHROMA_W_sse3 16, 24
986
+ IPFILTER_CHROMA_W_sse3 16, 64
987
+ IPFILTER_CHROMA_W_sse3 32, 48
988
+ IPFILTER_CHROMA_W_sse3 24, 64
989
+ IPFILTER_CHROMA_W_sse3 32, 64
990
+
991
+ IPFILTER_CHROMA_W_sse3 64, 64
992
+ IPFILTER_CHROMA_W_sse3 64, 32
993
+ IPFILTER_CHROMA_W_sse3 64, 48
994
+ IPFILTER_CHROMA_W_sse3 48, 64
995
+ IPFILTER_CHROMA_W_sse3 64, 16
996
+
997
+%macro FILTER_H8_W8_sse2 0
998
+ movh m1, [r0 + x - 3]
999
+ movh m4, [r0 + x - 2]
1000
+ punpcklbw m1, m6
1001
+ punpcklbw m4, m6
1002
+ movh m5, [r0 + x - 1]
1003
+ movh m0, [r0 + x]
1004
+ punpcklbw m5, m6
1005
+ punpcklbw m0, m6
1006
+ pmaddwd m1, m3
1007
+ pmaddwd m4, m3
1008
+ pmaddwd m5, m3
1009
+ pmaddwd m0, m3
1010
+ packssdw m1, m4
1011
+ packssdw m5, m0
1012
+ pshuflw m4, m1, q2301
1013
+ pshufhw m4, m4, q2301
1014
+ pshuflw m0, m5, q2301
1015
+ pshufhw m0, m0, q2301
1016
+ paddw m1, m4
1017
+ paddw m5, m0
1018
+ psrldq m1, 2
1019
+ psrldq m5, 2
1020
+ pshufd m1, m1, q3120
1021
+ pshufd m5, m5, q3120
1022
+ punpcklqdq m1, m5
1023
+ movh m7, [r0 + x + 1]
1024
+ movh m4, [r0 + x + 2]
1025
+ punpcklbw m7, m6
1026
+ punpcklbw m4, m6
1027
+ movh m5, [r0 + x + 3]
1028
+ movh m0, [r0 + x + 4]
1029
+ punpcklbw m5, m6
1030
+ punpcklbw m0, m6
1031
+ pmaddwd m7, m3
1032
+ pmaddwd m4, m3
1033
+ pmaddwd m5, m3
1034
+ pmaddwd m0, m3
1035
+ packssdw m7, m4
1036
+ packssdw m5, m0
1037
+ pshuflw m4, m7, q2301
1038
+ pshufhw m4, m4, q2301
1039
+ pshuflw m0, m5, q2301
1040
+ pshufhw m0, m0, q2301
1041
+ paddw m7, m4
1042
+ paddw m5, m0
1043
+ psrldq m7, 2
1044
+ psrldq m5, 2
1045
+ pshufd m7, m7, q3120
1046
+ pshufd m5, m5, q3120
1047
+ punpcklqdq m7, m5
1048
+ pshuflw m4, m1, q2301
1049
+ pshufhw m4, m4, q2301
1050
+ pshuflw m0, m7, q2301
1051
+ pshufhw m0, m0, q2301
1052
+ paddw m1, m4
1053
+ paddw m7, m0
1054
+ psrldq m1, 2
1055
+ psrldq m7, 2
1056
+ pshufd m1, m1, q3120
1057
+ pshufd m7, m7, q3120
1058
+ punpcklqdq m1, m7
1059
+%endmacro
1060
+
1061
+%macro FILTER_H8_W4_sse2 0
1062
+ movh m1, [r0 + x - 3]
1063
+ movh m0, [r0 + x - 2]
1064
+ punpcklbw m1, m6
1065
+ punpcklbw m0, m6
1066
+ movh m4, [r0 + x - 1]
1067
+ movh m5, [r0 + x]
1068
+ punpcklbw m4, m6
1069
+ punpcklbw m5, m6
1070
+ pmaddwd m1, m3
1071
+ pmaddwd m0, m3
1072
+ pmaddwd m4, m3
1073
+ pmaddwd m5, m3
1074
+ packssdw m1, m0
1075
+ packssdw m4, m5
1076
+ pshuflw m0, m1, q2301
1077
+ pshufhw m0, m0, q2301
1078
+ pshuflw m5, m4, q2301
1079
+ pshufhw m5, m5, q2301
1080
+ paddw m1, m0
1081
+ paddw m4, m5
1082
+ psrldq m1, 2
1083
+ psrldq m4, 2
1084
+ pshufd m1, m1, q3120
1085
+ pshufd m4, m4, q3120
1086
+ punpcklqdq m1, m4
1087
+ pshuflw m0, m1, q2301
1088
+ pshufhw m0, m0, q2301
1089
+ paddw m1, m0
1090
+ psrldq m1, 2
1091
+ pshufd m1, m1, q3120
1092
+%endmacro
1093
+
1094
+;----------------------------------------------------------------------------------------------------------------------------
1095
+; void interp_8tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
1096
+;----------------------------------------------------------------------------------------------------------------------------
1097
+%macro IPFILTER_LUMA_sse2 3
1098
+INIT_XMM sse2
1099
+cglobal interp_8tap_horiz_%3_%1x%2, 4,6,8
1100
+ mov r4d, r4m
1101
+ add r4d, r4d
1102
+ pxor m6, m6
1103
+
1104
+%ifidn %3, ps
1105
+ add r3d, r3d
1106
+ cmp r5m, byte 0
1107
+%endif
1108
+
1109
+%ifdef PIC
1110
+ lea r5, [tabw_LumaCoeff]
1111
+ movu m3, [r5 + r4 * 8]
1112
+%else
1113
+ movu m3, [tabw_LumaCoeff + r4 * 8]
1114
+%endif
1115
+
1116
+ mov r4d, %2
1117
+
1118
+%ifidn %3, pp
1119
+ mova m2, [pw_32]
1120
+%else
1121
+ mova m2, [pw_2000]
1122
+ je .loopH
1123
+ lea r5, [r1 + 2 * r1]
1124
+ sub r0, r5
1125
+ add r4d, 7
1126
+%endif
1127
+
1128
+.loopH:
1129
+%assign x 0
1130
+%rep %1 / 8
1131
+ FILTER_H8_W8_sse2
1132
+ %ifidn %3, pp
1133
+ paddw m1, m2
1134
+ psraw m1, 6
1135
+ packuswb m1, m1
1136
+ movh [r2 + x], m1
1137
+ %else
1138
+ psubw m1, m2
1139
+ movu [r2 + 2 * x], m1
1140
+ %endif
1141
+%assign x x+8
1142
+%endrep
1143
+
1144
+%rep (%1 % 8) / 4
1145
+ FILTER_H8_W4_sse2
1146
+ %ifidn %3, pp
1147
+ paddw m1, m2
1148
+ psraw m1, 6
1149
+ packuswb m1, m1
1150
+ movd [r2 + x], m1
1151
+ %else
1152
+ psubw m1, m2
1153
+ movh [r2 + 2 * x], m1
1154
+ %endif
1155
+%endrep
1156
+
1157
+ add r0, r1
1158
+ add r2, r3
1159
+
1160
+ dec r4d
1161
+ jnz .loopH
1162
+ RET
1163
+
1164
+%endmacro
1165
+
1166
+;--------------------------------------------------------------------------------------------------------------
1167
+; void interp_8tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
1168
+;--------------------------------------------------------------------------------------------------------------
1169
+ IPFILTER_LUMA_sse2 4, 4, pp
1170
+ IPFILTER_LUMA_sse2 4, 8, pp
1171
+ IPFILTER_LUMA_sse2 8, 4, pp
1172
+ IPFILTER_LUMA_sse2 8, 8, pp
1173
+ IPFILTER_LUMA_sse2 16, 16, pp
1174
+ IPFILTER_LUMA_sse2 16, 8, pp
1175
+ IPFILTER_LUMA_sse2 8, 16, pp
1176
+ IPFILTER_LUMA_sse2 16, 12, pp
1177
+ IPFILTER_LUMA_sse2 12, 16, pp
1178
+ IPFILTER_LUMA_sse2 16, 4, pp
1179
+ IPFILTER_LUMA_sse2 4, 16, pp
1180
+ IPFILTER_LUMA_sse2 32, 32, pp
1181
+ IPFILTER_LUMA_sse2 32, 16, pp
1182
+ IPFILTER_LUMA_sse2 16, 32, pp
1183
+ IPFILTER_LUMA_sse2 32, 24, pp
1184
+ IPFILTER_LUMA_sse2 24, 32, pp
1185
+ IPFILTER_LUMA_sse2 32, 8, pp
1186
+ IPFILTER_LUMA_sse2 8, 32, pp
1187
+ IPFILTER_LUMA_sse2 64, 64, pp
1188
+ IPFILTER_LUMA_sse2 64, 32, pp
1189
+ IPFILTER_LUMA_sse2 32, 64, pp
1190
+ IPFILTER_LUMA_sse2 64, 48, pp
1191
+ IPFILTER_LUMA_sse2 48, 64, pp
1192
+ IPFILTER_LUMA_sse2 64, 16, pp
1193
+ IPFILTER_LUMA_sse2 16, 64, pp
1194
+
1195
+;----------------------------------------------------------------------------------------------------------------------------
1196
+; void interp_8tap_horiz_ps_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
1197
+;----------------------------------------------------------------------------------------------------------------------------
1198
+ IPFILTER_LUMA_sse2 4, 4, ps
1199
+ IPFILTER_LUMA_sse2 8, 8, ps
1200
+ IPFILTER_LUMA_sse2 8, 4, ps
1201
+ IPFILTER_LUMA_sse2 4, 8, ps
1202
+ IPFILTER_LUMA_sse2 16, 16, ps
1203
+ IPFILTER_LUMA_sse2 16, 8, ps
1204
+ IPFILTER_LUMA_sse2 8, 16, ps
1205
+ IPFILTER_LUMA_sse2 16, 12, ps
1206
+ IPFILTER_LUMA_sse2 12, 16, ps
1207
+ IPFILTER_LUMA_sse2 16, 4, ps
1208
+ IPFILTER_LUMA_sse2 4, 16, ps
1209
+ IPFILTER_LUMA_sse2 32, 32, ps
1210
+ IPFILTER_LUMA_sse2 32, 16, ps
1211
+ IPFILTER_LUMA_sse2 16, 32, ps
1212
+ IPFILTER_LUMA_sse2 32, 24, ps
1213
+ IPFILTER_LUMA_sse2 24, 32, ps
1214
+ IPFILTER_LUMA_sse2 32, 8, ps
1215
+ IPFILTER_LUMA_sse2 8, 32, ps
1216
+ IPFILTER_LUMA_sse2 64, 64, ps
1217
+ IPFILTER_LUMA_sse2 64, 32, ps
1218
+ IPFILTER_LUMA_sse2 32, 64, ps
1219
+ IPFILTER_LUMA_sse2 64, 48, ps
1220
+ IPFILTER_LUMA_sse2 48, 64, ps
1221
+ IPFILTER_LUMA_sse2 64, 16, ps
1222
+ IPFILTER_LUMA_sse2 16, 64, ps
1223
+
1224
+%macro WORD_TO_DOUBLE 1
1225
+%if ARCH_X86_64
1226
+ punpcklbw %1, m8
1227
+%else
1228
+ punpcklbw %1, %1
1229
+ psrlw %1, 8
1230
+%endif
1231
+%endmacro
1232
+
1233
+;-----------------------------------------------------------------------------
1234
+; void interp_4tap_vert_pp_2xn(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
1235
+;-----------------------------------------------------------------------------
1236
+%macro FILTER_V4_W2_H4_sse2 1
1237
+INIT_XMM sse2
1238
+%if ARCH_X86_64
1239
+cglobal interp_4tap_vert_pp_2x%1, 4, 6, 9
1240
+ pxor m8, m8
1241
+%else
1242
+cglobal interp_4tap_vert_pp_2x%1, 4, 6, 8
1243
+%endif
1244
+ mov r4d, r4m
1245
+ sub r0, r1
1246
+
1247
+%ifdef PIC
1248
+ lea r5, [tabw_ChromaCoeff]
1249
+ movh m0, [r5 + r4 * 8]
1250
+%else
1251
+ movh m0, [tabw_ChromaCoeff + r4 * 8]
1252
+%endif
1253
+
1254
+ punpcklqdq m0, m0
1255
+ mova m1, [pw_32]
1256
+ lea r5, [3 * r1]
1257
+
1258
+%assign x 1
1259
+%rep %1/4
1260
+ movd m2, [r0]
1261
+ movd m3, [r0 + r1]
1262
+ movd m4, [r0 + 2 * r1]
1263
+ movd m5, [r0 + r5]
1264
+
1265
+ punpcklbw m2, m3
1266
+ punpcklbw m6, m4, m5
1267
+ punpcklwd m2, m6
1268
+
1269
+ WORD_TO_DOUBLE m2
1270
+ pmaddwd m2, m0
1271
+
1272
+ lea r0, [r0 + 4 * r1]
1273
+ movd m6, [r0]
1274
+
1275
+ punpcklbw m3, m4
1276
+ punpcklbw m7, m5, m6
1277
+ punpcklwd m3, m7
1278
+
1279
+ WORD_TO_DOUBLE m3
1280
+ pmaddwd m3, m0
1281
+
1282
+ packssdw m2, m3
1283
+ pshuflw m3, m2, q2301
1284
+ pshufhw m3, m3, q2301
1285
+ paddw m2, m3
1286
+ psrld m2, 16
1287
+
1288
+ movd m7, [r0 + r1]
1289
+
1290
+ punpcklbw m4, m5
1291
+ punpcklbw m3, m6, m7
1292
+ punpcklwd m4, m3
1293
+
1294
+ WORD_TO_DOUBLE m4
1295
+ pmaddwd m4, m0
1296
+
1297
+ movd m3, [r0 + 2 * r1]
1298
+
1299
+ punpcklbw m5, m6
1300
+ punpcklbw m7, m3
1301
+ punpcklwd m5, m7
1302
+
1303
+ WORD_TO_DOUBLE m5
1304
+ pmaddwd m5, m0
1305
+
1306
+ packssdw m4, m5
1307
+ pshuflw m5, m4, q2301
1308
+ pshufhw m5, m5, q2301
1309
+ paddw m4, m5
1310
+ psrld m4, 16
1311
+
1312
+ packssdw m2, m4
1313
+ paddw m2, m1
1314
+ psraw m2, 6
1315
+ packuswb m2, m2
1316
+
1317
+%if ARCH_X86_64
1318
+ movq r4, m2
1319
+ mov [r2], r4w
1320
+ shr r4, 16
1321
+ mov [r2 + r3], r4w
1322
+ lea r2, [r2 + 2 * r3]
1323
+ shr r4, 16
1324
+ mov [r2], r4w
1325
+ shr r4, 16
1326
+ mov [r2 + r3], r4w
1327
+%else
1328
+ movd r4, m2
1329
+ mov [r2], r4w
1330
+ shr r4, 16
1331
+ mov [r2 + r3], r4w
1332
+ lea r2, [r2 + 2 * r3]
1333
+ psrldq m2, 4
1334
+ movd r4, m2
1335
+ mov [r2], r4w
1336
+ shr r4, 16
1337
+ mov [r2 + r3], r4w
1338
+%endif
1339
+
1340
+%if x < %1/4
1341
+ lea r2, [r2 + 2 * r3]
1342
+%endif
1343
+%assign x x+1
1344
+%endrep
1345
+ RET
1346
+
1347
+%endmacro
1348
+
1349
+ FILTER_V4_W2_H4_sse2 4
1350
+ FILTER_V4_W2_H4_sse2 8
1351
+ FILTER_V4_W2_H4_sse2 16
1352
+
1353
+;-----------------------------------------------------------------------------
1354
+; void interp_4tap_vert_pp_4x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
1355
+;-----------------------------------------------------------------------------
1356
+INIT_XMM sse2
1357
+cglobal interp_4tap_vert_pp_4x2, 4, 6, 8
1358
+
1359
+ mov r4d, r4m
1360
+ sub r0, r1
1361
+ pxor m7, m7
1362
+
1363
+%ifdef PIC
1364
+ lea r5, [tabw_ChromaCoeff]
1365
+ movh m0, [r5 + r4 * 8]
1366
+%else
1367
+ movh m0, [tabw_ChromaCoeff + r4 * 8]
1368
+%endif
1369
+
1370
+ lea r5, [r0 + 2 * r1]
1371
+ punpcklqdq m0, m0
1372
+ movd m2, [r0]
1373
+ movd m3, [r0 + r1]
1374
+ movd m4, [r5]
1375
+ movd m5, [r5 + r1]
1376
+
1377
+ punpcklbw m2, m3
1378
+ punpcklbw m1, m4, m5
1379
+ punpcklwd m2, m1
1380
+
1381
+ movhlps m6, m2
1382
+ punpcklbw m2, m7
1383
+ punpcklbw m6, m7
1384
+ pmaddwd m2, m0
1385
+ pmaddwd m6, m0
1386
+ packssdw m2, m6
1387
+
1388
+ movd m1, [r0 + 4 * r1]
1389
+
1390
+ punpcklbw m3, m4
1391
+ punpcklbw m5, m1
1392
+ punpcklwd m3, m5
1393
+
1394
+ movhlps m6, m3
1395
+ punpcklbw m3, m7
1396
+ punpcklbw m6, m7
1397
+ pmaddwd m3, m0
1398
+ pmaddwd m6, m0
1399
+ packssdw m3, m6
1400
+
1401
+ pshuflw m4, m2, q2301
1402
+ pshufhw m4, m4, q2301
1403
+ paddw m2, m4
1404
+ pshuflw m5, m3, q2301
1405
+ pshufhw m5, m5, q2301
1406
+ paddw m3, m5
1407
+ psrld m2, 16
1408
+ psrld m3, 16
1409
+ packssdw m2, m3
1410
+
1411
+ paddw m2, [pw_32]
1412
+ psraw m2, 6
1413
+ packuswb m2, m2
1414
+
1415
+ movd [r2], m2
1416
+ psrldq m2, 4
1417
+ movd [r2 + r3], m2
1418
+ RET
1419
+
1420
+;-----------------------------------------------------------------------------
1421
+; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
1422
+;-----------------------------------------------------------------------------
1423
+%macro FILTER_V4_W4_H4_sse2 1
1424
+INIT_XMM sse2
1425
+%if ARCH_X86_64
1426
+cglobal interp_4tap_vert_pp_4x%1, 4, 6, 9
1427
+ pxor m8, m8
1428
+%else
1429
+cglobal interp_4tap_vert_pp_4x%1, 4, 6, 8
1430
+%endif
1431
+
1432
+ mov r4d, r4m
1433
+ sub r0, r1
1434
+
1435
+%ifdef PIC
1436
+ lea r5, [tabw_ChromaCoeff]
1437
+ movh m0, [r5 + r4 * 8]
1438
+%else
1439
+ movh m0, [tabw_ChromaCoeff + r4 * 8]
1440
+%endif
1441
+
1442
+ mova m1, [pw_32]
1443
+ lea r5, [3 * r1]
1444
+ punpcklqdq m0, m0
1445
+
1446
+%assign x 1
1447
+%rep %1/4
1448
+ movd m2, [r0]
1449
+ movd m3, [r0 + r1]
1450
+ movd m4, [r0 + 2 * r1]
1451
+ movd m5, [r0 + r5]
1452
+
1453
+ punpcklbw m2, m3
1454
+ punpcklbw m6, m4, m5
1455
+ punpcklwd m2, m6
1456
+
1457
+ movhlps m6, m2
1458
+ WORD_TO_DOUBLE m2
1459
+ WORD_TO_DOUBLE m6
1460
+ pmaddwd m2, m0
1461
+ pmaddwd m6, m0
1462
+ packssdw m2, m6
1463
+
1464
+ lea r0, [r0 + 4 * r1]
1465
+ movd m6, [r0]
1466
+
1467
+ punpcklbw m3, m4
1468
+ punpcklbw m7, m5, m6
1469
+ punpcklwd m3, m7
1470
+
1471
+ movhlps m7, m3
1472
+ WORD_TO_DOUBLE m3
1473
+ WORD_TO_DOUBLE m7
1474
+ pmaddwd m3, m0
1475
+ pmaddwd m7, m0
1476
+ packssdw m3, m7
1477
+
1478
+ pshuflw m7, m2, q2301
1479
+ pshufhw m7, m7, q2301
1480
+ paddw m2, m7
1481
+ pshuflw m7, m3, q2301
1482
+ pshufhw m7, m7, q2301
1483
+ paddw m3, m7
1484
+ psrld m2, 16
1485
+ psrld m3, 16
1486
+ packssdw m2, m3
1487
+
1488
+ paddw m2, m1
1489
+ psraw m2, 6
1490
+
1491
+ movd m7, [r0 + r1]
1492
+
1493
+ punpcklbw m4, m5
1494
+ punpcklbw m3, m6, m7
1495
+ punpcklwd m4, m3
1496
+
1497
+ movhlps m3, m4
1498
+ WORD_TO_DOUBLE m4
1499
+ WORD_TO_DOUBLE m3
1500
+ pmaddwd m4, m0
1501
+ pmaddwd m3, m0
1502
+ packssdw m4, m3
1503
+
1504
+ movd m3, [r0 + 2 * r1]
1505
+
1506
+ punpcklbw m5, m6
1507
+ punpcklbw m7, m3
1508
+ punpcklwd m5, m7
1509
+
1510
+ movhlps m3, m5
1511
+ WORD_TO_DOUBLE m5
1512
+ WORD_TO_DOUBLE m3
1513
+ pmaddwd m5, m0
1514
+ pmaddwd m3, m0
1515
+ packssdw m5, m3
1516
+
1517
+ pshuflw m7, m4, q2301
1518
+ pshufhw m7, m7, q2301
1519
+ paddw m4, m7
1520
+ pshuflw m7, m5, q2301
1521
+ pshufhw m7, m7, q2301
1522
+ paddw m5, m7
1523
+ psrld m4, 16
1524
+ psrld m5, 16
1525
+ packssdw m4, m5
1526
+
1527
+ paddw m4, m1
1528
+ psraw m4, 6
1529
+ packuswb m2, m4
1530
+
1531
+ movd [r2], m2
1532
+ psrldq m2, 4
1533
+ movd [r2 + r3], m2
1534
+ lea r2, [r2 + 2 * r3]
1535
+ psrldq m2, 4
1536
+ movd [r2], m2
1537
+ psrldq m2, 4
1538
+ movd [r2 + r3], m2
1539
+
1540
+%if x < %1/4
1541
+ lea r2, [r2 + 2 * r3]
1542
+%endif
1543
+%assign x x+1
1544
+%endrep
1545
+ RET
1546
+%endmacro
1547
+
1548
+ FILTER_V4_W4_H4_sse2 4
1549
+ FILTER_V4_W4_H4_sse2 8
1550
+ FILTER_V4_W4_H4_sse2 16
1551
+ FILTER_V4_W4_H4_sse2 32
1552
+
1553
+;-----------------------------------------------------------------------------
1554
+;void interp_4tap_vert_pp_6x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
1555
+;-----------------------------------------------------------------------------
1556
+%macro FILTER_V4_W6_H4_sse2 1
1557
+INIT_XMM sse2
1558
+cglobal interp_4tap_vert_pp_6x%1, 4, 7, 10
1559
+
1560
+ mov r4d, r4m
1561
+ sub r0, r1
1562
+ shl r4d, 5
1563
+ pxor m9, m9
1564
+
1565
+%ifdef PIC
1566
+ lea r5, [tab_ChromaCoeffV]
1567
+ mova m6, [r5 + r4]
1568
+ mova m5, [r5 + r4 + 16]
1569
+%else
1570
+ mova m6, [tab_ChromaCoeffV + r4]
1571
+ mova m5, [tab_ChromaCoeffV + r4 + 16]
1572
+%endif
1573
+
1574
+ mova m4, [pw_32]
1575
+ lea r5, [3 * r1]
1576
+
1577
+%assign x 1
1578
+%rep %1/4
1579
+ movq m0, [r0]
1580
+ movq m1, [r0 + r1]
1581
+ movq m2, [r0 + 2 * r1]
1582
+ movq m3, [r0 + r5]
1583
+
1584
+ punpcklbw m0, m1
1585
+ punpcklbw m1, m2
1586
+ punpcklbw m2, m3
1587
+
1588
+ movhlps m7, m0
1589
+ punpcklbw m0, m9
1590
+ punpcklbw m7, m9
1591
+ pmaddwd m0, m6
1592
+ pmaddwd m7, m6
1593
+ packssdw m0, m7
1594
+
1595
+ movhlps m8, m2
1596
+ movq m7, m2
1597
+ punpcklbw m8, m9
1598
+ punpcklbw m7, m9
1599
+ pmaddwd m8, m5
1600
+ pmaddwd m7, m5
1601
+ packssdw m7, m8
1602
+
1603
+ paddw m0, m7
1604
+
1605
+ paddw m0, m4
1606
+ psraw m0, 6
1607
+ packuswb m0, m0
1608
+ movd [r2], m0
1609
+ pextrw r6d, m0, 2
1610
+ mov [r2 + 4], r6w
1611
+
1612
+ lea r0, [r0 + 4 * r1]
1613
+
1614
+ movq m0, [r0]
1615
+ punpcklbw m3, m0
1616
+
1617
+ movhlps m8, m1
1618
+ punpcklbw m1, m9
1619
+ punpcklbw m8, m9
1620
+ pmaddwd m1, m6
1621
+ pmaddwd m8, m6
1622
+ packssdw m1, m8
1623
+
1624
+ movhlps m8, m3
1625
+ movq m7, m3
1626
+ punpcklbw m8, m9
1627
+ punpcklbw m7, m9
1628
+ pmaddwd m8, m5
1629
+ pmaddwd m7, m5
1630
+ packssdw m7, m8
1631
+
1632
+ paddw m1, m7
1633
+
1634
+ paddw m1, m4
1635
+ psraw m1, 6
1636
+ packuswb m1, m1
1637
+ movd [r2 + r3], m1
1638
+ pextrw r6d, m1, 2
1639
+ mov [r2 + r3 + 4], r6w
1640
+ movq m1, [r0 + r1]
1641
+ punpcklbw m7, m0, m1
1642
+
1643
+ movhlps m8, m2
1644
+ punpcklbw m2, m9
1645
+ punpcklbw m8, m9
1646
+ pmaddwd m2, m6
1647
+ pmaddwd m8, m6
1648
+ packssdw m2, m8
1649
+
1650
+ movhlps m8, m7
1651
+ punpcklbw m7, m9
1652
+ punpcklbw m8, m9
1653
+ pmaddwd m7, m5
1654
+ pmaddwd m8, m5
1655
+ packssdw m7, m8
1656
+
1657
+ paddw m2, m7
1658
+
1659
+ paddw m2, m4
1660
+ psraw m2, 6
1661
+ packuswb m2, m2
1662
+ lea r2, [r2 + 2 * r3]
1663
+ movd [r2], m2
1664
+ pextrw r6d, m2, 2
1665
+ mov [r2 + 4], r6w
1666
+
1667
+ movq m2, [r0 + 2 * r1]
1668
+ punpcklbw m1, m2
1669
+
1670
+ movhlps m8, m3
1671
+ punpcklbw m3, m9
1672
+ punpcklbw m8, m9
1673
+ pmaddwd m3, m6
1674
+ pmaddwd m8, m6
1675
+ packssdw m3, m8
1676
+
1677
+ movhlps m8, m1
1678
+ punpcklbw m1, m9
1679
+ punpcklbw m8, m9
1680
+ pmaddwd m1, m5
1681
+ pmaddwd m8, m5
1682
+ packssdw m1, m8
1683
+
1684
+ paddw m3, m1
1685
+
1686
+ paddw m3, m4
1687
+ psraw m3, 6
1688
+ packuswb m3, m3
1689
+
1690
+ movd [r2 + r3], m3
1691
+ pextrw r6d, m3, 2
1692
+ mov [r2 + r3 + 4], r6w
1693
+
1694
+%if x < %1/4
1695
+ lea r2, [r2 + 2 * r3]
1696
+%endif
1697
+%assign x x+1
1698
+%endrep
1699
+ RET
1700
+
1701
+%endmacro
1702
+
1703
+%if ARCH_X86_64
1704
+ FILTER_V4_W6_H4_sse2 8
1705
+ FILTER_V4_W6_H4_sse2 16
1706
+%endif
1707
+
1708
+;-----------------------------------------------------------------------------
1709
+; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
1710
+;-----------------------------------------------------------------------------
1711
+%macro FILTER_V4_W8_sse2 1
1712
+INIT_XMM sse2
1713
+cglobal interp_4tap_vert_pp_8x%1, 4, 7, 12
1714
+
1715
+ mov r4d, r4m
1716
+ sub r0, r1
1717
+ shl r4d, 5
1718
+ pxor m9, m9
1719
+ mova m4, [pw_32]
1720
+
1721
+%ifdef PIC
1722
+ lea r6, [tab_ChromaCoeffV]
1723
+ mova m6, [r6 + r4]
1724
+ mova m5, [r6 + r4 + 16]
1725
+%else
1726
+ mova m6, [tab_ChromaCoeffV + r4]
1727
+ mova m5, [tab_ChromaCoeffV + r4 + 16]
1728
+%endif
1729
+
1730
+ movq m0, [r0]
1731
+ movq m1, [r0 + r1]
1732
+ movq m2, [r0 + 2 * r1]
1733
+ lea r5, [r0 + 2 * r1]
1734
+ movq m3, [r5 + r1]
1735
+
1736
+ punpcklbw m0, m1
1737
+ punpcklbw m7, m2, m3
1738
+
1739
+ movhlps m8, m0
1740
+ punpcklbw m0, m9
1741
+ punpcklbw m8, m9
1742
+ pmaddwd m0, m6
1743
+ pmaddwd m8, m6
1744
+ packssdw m0, m8
1745
+
1746
+ movhlps m8, m7
1747
+ punpcklbw m7, m9
1748
+ punpcklbw m8, m9
1749
+ pmaddwd m7, m5
1750
+ pmaddwd m8, m5
1751
+ packssdw m7, m8
1752
+
1753
+ paddw m0, m7
1754
+
1755
+ paddw m0, m4
1756
+ psraw m0, 6
1757
+
1758
+ movq m11, [r0 + 4 * r1]
1759
+
1760
+ punpcklbw m1, m2
1761
+ punpcklbw m7, m3, m11
1762
+
1763
+ movhlps m8, m1
1764
+ punpcklbw m1, m9
1765
+ punpcklbw m8, m9
1766
+ pmaddwd m1, m6
1767
+ pmaddwd m8, m6
1768
+ packssdw m1, m8
1769
+
1770
+ movhlps m8, m7
1771
+ punpcklbw m7, m9
1772
+ punpcklbw m8, m9
1773
+ pmaddwd m7, m5
1774
+ pmaddwd m8, m5
1775
+ packssdw m7, m8
1776
+
1777
+ paddw m1, m7
1778
+
1779
+ paddw m1, m4
1780
+ psraw m1, 6
1781
+ packuswb m1, m0
1782
+
1783
+ movhps [r2], m1
1784
+ movh [r2 + r3], m1
1785
+%if %1 == 2 ;end of 8x2
1786
+ RET
1787
+
1788
+%else
1789
+ lea r6, [r0 + 4 * r1]
1790
+ movq m1, [r6 + r1]
1791
+
1792
+ punpcklbw m2, m3
1793
+ punpcklbw m7, m11, m1
1794
+
1795
+ movhlps m8, m2
1796
+ punpcklbw m2, m9
1797
+ punpcklbw m8, m9
1798
+ pmaddwd m2, m6
1799
+ pmaddwd m8, m6
1800
+ packssdw m2, m8
1801
+
1802
+ movhlps m8, m7
1803
+ punpcklbw m7, m9
1804
+ punpcklbw m8, m9
1805
+ pmaddwd m7, m5
1806
+ pmaddwd m8, m5
1807
+ packssdw m7, m8
1808
+
1809
+ paddw m2, m7
1810
+
1811
+ paddw m2, m4
1812
+ psraw m2, 6
1813
+
1814
+ movq m10, [r6 + 2 * r1]
1815
+
1816
+ punpcklbw m3, m11
1817
+ punpcklbw m7, m1, m10
1818
+
1819
+ movhlps m8, m3
1820
+ punpcklbw m3, m9
1821
+ punpcklbw m8, m9
1822
+ pmaddwd m3, m6
1823
+ pmaddwd m8, m6
1824
+ packssdw m3, m8
1825
+
1826
+ movhlps m8, m7
1827
+ punpcklbw m7, m9
1828
+ punpcklbw m8, m9
1829
+ pmaddwd m7, m5
1830
+ pmaddwd m8, m5
1831
+ packssdw m7, m8
1832
+
1833
+ paddw m3, m7
1834
+
1835
+ paddw m3, m4
1836
+ psraw m3, 6
1837
+ packuswb m3, m2
1838
+
1839
+ movhps [r2 + 2 * r3], m3
1840
+ lea r5, [r2 + 2 * r3]
1841
+ movh [r5 + r3], m3
1842
+%if %1 == 4 ;end of 8x4
1843
+ RET
1844
+
1845
+%else
1846
+ lea r6, [r6 + 2 * r1]
1847
+ movq m3, [r6 + r1]
1848
+
1849
+ punpcklbw m11, m1
1850
+ punpcklbw m7, m10, m3
1851
+
1852
+ movhlps m8, m11
1853
+ punpcklbw m11, m9
1854
+ punpcklbw m8, m9
1855
+ pmaddwd m11, m6
1856
+ pmaddwd m8, m6
1857
+ packssdw m11, m8
1858
+
1859
+ movhlps m8, m7
1860
+ punpcklbw m7, m9
1861
+ punpcklbw m8, m9
1862
+ pmaddwd m7, m5
1863
+ pmaddwd m8, m5
1864
+ packssdw m7, m8
1865
+
1866
+ paddw m11, m7
1867
+
1868
+ paddw m11, m4
1869
+ psraw m11, 6
1870
+
1871
+ movq m7, [r0 + 8 * r1]
1872
+
1873
+ punpcklbw m1, m10
1874
+ punpcklbw m3, m7
1875
+
1876
+ movhlps m8, m1
1877
+ punpcklbw m1, m9
1878
+ punpcklbw m8, m9
1879
+ pmaddwd m1, m6
1880
+ pmaddwd m8, m6
1881
+ packssdw m1, m8
1882
+
1883
+ movhlps m8, m3
1884
+ punpcklbw m3, m9
1885
+ punpcklbw m8, m9
1886
+ pmaddwd m3, m5
1887
+ pmaddwd m8, m5
1888
+ packssdw m3, m8
1889
+
1890
+ paddw m1, m3
1891
+
1892
+ paddw m1, m4
1893
+ psraw m1, 6
1894
+ packuswb m1, m11
1895
+
1896
+ movhps [r2 + 4 * r3], m1
1897
+ lea r5, [r2 + 4 * r3]
1898
+ movh [r5 + r3], m1
1899
+%if %1 == 6
1900
+ RET
1901
+
1902
+%else
1903
+ %error INVALID macro argument, only 2, 4 or 6!
1904
+%endif
1905
+%endif
1906
+%endif
1907
+%endmacro
1908
+
1909
+%if ARCH_X86_64
1910
+ FILTER_V4_W8_sse2 2
1911
+ FILTER_V4_W8_sse2 4
1912
+ FILTER_V4_W8_sse2 6
1913
+%endif
1914
+
1915
+;-----------------------------------------------------------------------------
1916
+; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
1917
+;-----------------------------------------------------------------------------
1918
+%macro FILTER_V4_W8_H8_H16_H32_sse2 2
1919
+INIT_XMM sse2
1920
+cglobal interp_4tap_vert_pp_%1x%2, 4, 6, 11
1921
+
1922
+ mov r4d, r4m
1923
+ sub r0, r1
1924
+ shl r4d, 5
1925
+ pxor m9, m9
1926
+
1927
+%ifdef PIC
1928
+ lea r5, [tab_ChromaCoeffV]
1929
+ mova m6, [r5 + r4]
1930
+ mova m5, [r5 + r4 + 16]
1931
+%else
1932
+ mova m6, [tab_ChromaCoeff + r4]
1933
+ mova m5, [tab_ChromaCoeff + r4 + 16]
1934
+%endif
1935
+
1936
+ mova m4, [pw_32]
1937
+ lea r5, [r1 * 3]
1938
+
1939
+%assign x 1
1940
+%rep %2/4
1941
+ movq m0, [r0]
1942
+ movq m1, [r0 + r1]
1943
+ movq m2, [r0 + 2 * r1]
1944
+ movq m3, [r0 + r5]
1945
+
1946
+ punpcklbw m0, m1
1947
+ punpcklbw m1, m2
1948
+ punpcklbw m2, m3
1949
+
1950
+ movhlps m7, m0
1951
+ punpcklbw m0, m9
1952
+ punpcklbw m7, m9
1953
+ pmaddwd m0, m6
1954
+ pmaddwd m7, m6
1955
+ packssdw m0, m7
1956
+
1957
+ movhlps m8, m2
1958
+ movq m7, m2
1959
+ punpcklbw m8, m9
1960
+ punpcklbw m7, m9
1961
+ pmaddwd m8, m5
1962
+ pmaddwd m7, m5
1963
+ packssdw m7, m8
1964
+
1965
+ paddw m0, m7
1966
+ paddw m0, m4
1967
+ psraw m0, 6
1968
+
1969
+ lea r0, [r0 + 4 * r1]
1970
+ movq m10, [r0]
1971
+ punpcklbw m3, m10
1972
+
1973
+ movhlps m8, m1
1974
+ punpcklbw m1, m9
1975
+ punpcklbw m8, m9
1976
+ pmaddwd m1, m6
1977
+ pmaddwd m8, m6
1978
+ packssdw m1, m8
1979
+
1980
+ movhlps m8, m3
1981
+ movq m7, m3
1982
+ punpcklbw m8, m9
1983
+ punpcklbw m7, m9
1984
+ pmaddwd m8, m5
1985
+ pmaddwd m7, m5
1986
+ packssdw m7, m8
1987
+
1988
+ paddw m1, m7
1989
+ paddw m1, m4
1990
+ psraw m1, 6
1991
+
1992
+ packuswb m0, m1
1993
+ movh [r2], m0
1994
+ movhps [r2 + r3], m0
1995
+
1996
+ movq m1, [r0 + r1]
1997
+ punpcklbw m10, m1
1998
+
1999
+ movhlps m8, m2
2000
+ punpcklbw m2, m9
2001
+ punpcklbw m8, m9
2002
+ pmaddwd m2, m6
2003
+ pmaddwd m8, m6
2004
+ packssdw m2, m8
2005
+
2006
+ movhlps m8, m10
2007
+ punpcklbw m10, m9
2008
+ punpcklbw m8, m9
2009
+ pmaddwd m10, m5
2010
+ pmaddwd m8, m5
2011
+ packssdw m10, m8
2012
+
2013
+ paddw m2, m10
2014
+ paddw m2, m4
2015
+ psraw m2, 6
2016
+
2017
+ movq m7, [r0 + 2 * r1]
2018
+ punpcklbw m1, m7
2019
+
2020
+ movhlps m8, m3
2021
+ punpcklbw m3, m9
2022
+ punpcklbw m8, m9
2023
+ pmaddwd m3, m6
2024
+ pmaddwd m8, m6
2025
+ packssdw m3, m8
2026
+
2027
+ movhlps m8, m1
2028
+ punpcklbw m1, m9
2029
+ punpcklbw m8, m9
2030
+ pmaddwd m1, m5
2031
+ pmaddwd m8, m5
2032
+ packssdw m1, m8
2033
+
2034
+ paddw m3, m1
2035
+ paddw m3, m4
2036
+ psraw m3, 6
2037
+
2038
+ packuswb m2, m3
2039
+ lea r2, [r2 + 2 * r3]
2040
+ movh [r2], m2
2041
+ movhps [r2 + r3], m2
2042
+%if x < %2/4
2043
+ lea r2, [r2 + 2 * r3]
2044
+%endif
2045
+%endrep
2046
+ RET
2047
+%endmacro
2048
+
2049
+%if ARCH_X86_64
2050
+ FILTER_V4_W8_H8_H16_H32_sse2 8, 8
2051
+ FILTER_V4_W8_H8_H16_H32_sse2 8, 16
2052
+ FILTER_V4_W8_H8_H16_H32_sse2 8, 32
2053
+
2054
+ FILTER_V4_W8_H8_H16_H32_sse2 8, 12
2055
+ FILTER_V4_W8_H8_H16_H32_sse2 8, 64
2056
+%endif
2057
+
2058
;-----------------------------------------------------------------------------
2059
; void interp_4tap_horiz_pp_2x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
2060
;-----------------------------------------------------------------------------
2061
2062
%define t1 m1
2063
%define t0 m0
2064
2065
-mov r4d, r4m
2066
+ mov r4d, r4m
2067
2068
%ifdef PIC
2069
-lea r5, [tab_ChromaCoeff]
2070
-movd coef2, [r5 + r4 * 4]
2071
+ lea r5, [tab_ChromaCoeff]
2072
+ movd coef2, [r5 + r4 * 4]
2073
%else
2074
-movd coef2, [tab_ChromaCoeff + r4 * 4]
2075
+ movd coef2, [tab_ChromaCoeff + r4 * 4]
2076
%endif
2077
2078
-pshufd coef2, coef2, 0
2079
-mova t2, [pw_512]
2080
-mova Tm0, [tab_Tm]
2081
+ pshufd coef2, coef2, 0
2082
+ mova t2, [pw_512]
2083
+ mova Tm0, [tab_Tm]
2084
2085
%rep 2
2086
-FILTER_H4_w2_2 t0, t1, t2
2087
-lea srcq, [srcq + srcstrideq * 2]
2088
-lea dstq, [dstq + dststrideq * 2]
2089
+ FILTER_H4_w2_2 t0, t1, t2
2090
+ lea srcq, [srcq + srcstrideq * 2]
2091
+ lea dstq, [dstq + dststrideq * 2]
2092
%endrep
2093
2094
-RET
2095
+ RET
2096
2097
;-----------------------------------------------------------------------------
2098
; void interp_4tap_horiz_pp_2x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
2099
2100
%define t1 m1
2101
%define t0 m0
2102
2103
-mov r4d, r4m
2104
+ mov r4d, r4m
2105
2106
%ifdef PIC
2107
-lea r5, [tab_ChromaCoeff]
2108
-movd coef2, [r5 + r4 * 4]
2109
+ lea r5, [tab_ChromaCoeff]
2110
+ movd coef2, [r5 + r4 * 4]
2111
%else
2112
-movd coef2, [tab_ChromaCoeff + r4 * 4]
2113
+ movd coef2, [tab_ChromaCoeff + r4 * 4]
2114
%endif
2115
2116
-pshufd coef2, coef2, 0
2117
-mova t2, [pw_512]
2118
-mova Tm0, [tab_Tm]
2119
+ pshufd coef2, coef2, 0
2120
+ mova t2, [pw_512]
2121
+ mova Tm0, [tab_Tm]
2122
2123
%rep 4
2124
-FILTER_H4_w2_2 t0, t1, t2
2125
-lea srcq, [srcq + srcstrideq * 2]
2126
-lea dstq, [dstq + dststrideq * 2]
2127
+ FILTER_H4_w2_2 t0, t1, t2
2128
+ lea srcq, [srcq + srcstrideq * 2]
2129
+ lea dstq, [dstq + dststrideq * 2]
2130
%endrep
2131
2132
-RET
2133
+ RET
2134
2135
;-----------------------------------------------------------------------------
2136
; void interp_4tap_horiz_pp_2x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
2137
2138
%define t1 m1
2139
%define t0 m0
2140
2141
-mov r4d, r4m
2142
+ mov r4d, r4m
2143
2144
%ifdef PIC
2145
-lea r5, [tab_ChromaCoeff]
2146
-movd coef2, [r5 + r4 * 4]
2147
+ lea r5, [tab_ChromaCoeff]
2148
+ movd coef2, [r5 + r4 * 4]
2149
%else
2150
-movd coef2, [tab_ChromaCoeff + r4 * 4]
2151
+ movd coef2, [tab_ChromaCoeff + r4 * 4]
2152
%endif
2153
2154
-pshufd coef2, coef2, 0
2155
-mova t2, [pw_512]
2156
-mova Tm0, [tab_Tm]
2157
+ pshufd coef2, coef2, 0
2158
+ mova t2, [pw_512]
2159
+ mova Tm0, [tab_Tm]
2160
2161
-mov r5d, 16/2
2162
+ mov r5d, 16/2
2163
2164
.loop:
2165
-FILTER_H4_w2_2 t0, t1, t2
2166
-lea srcq, [srcq + srcstrideq * 2]
2167
-lea dstq, [dstq + dststrideq * 2]
2168
-dec r5d
2169
-jnz .loop
2170
+ FILTER_H4_w2_2 t0, t1, t2
2171
+ lea srcq, [srcq + srcstrideq * 2]
2172
+ lea dstq, [dstq + dststrideq * 2]
2173
+ dec r5d
2174
+ jnz .loop
2175
2176
-RET
2177
+ RET
2178
2179
%macro FILTER_H4_w4_2 3
2180
movh %2, [srcq - 1]
2181
2182
%define t1 m1
2183
%define t0 m0
2184
2185
-mov r4d, r4m
2186
+ mov r4d, r4m
2187
2188
%ifdef PIC
2189
-lea r5, [tab_ChromaCoeff]
2190
-movd coef2, [r5 + r4 * 4]
2191
+ lea r5, [tab_ChromaCoeff]
2192
+ movd coef2, [r5 + r4 * 4]
2193
%else
2194
-movd coef2, [tab_ChromaCoeff + r4 * 4]
2195
+ movd coef2, [tab_ChromaCoeff + r4 * 4]
2196
%endif
2197
2198
-pshufd coef2, coef2, 0
2199
-mova t2, [pw_512]
2200
-mova Tm0, [tab_Tm]
2201
+ pshufd coef2, coef2, 0
2202
+ mova t2, [pw_512]
2203
+ mova Tm0, [tab_Tm]
2204
2205
-FILTER_H4_w4_2 t0, t1, t2
2206
+ FILTER_H4_w4_2 t0, t1, t2
2207
2208
-RET
2209
+ RET
2210
2211
;-----------------------------------------------------------------------------
2212
; void interp_4tap_horiz_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
2213
2214
%define t1 m1
2215
%define t0 m0
2216
2217
-mov r4d, r4m
2218
+ mov r4d, r4m
2219
2220
%ifdef PIC
2221
-lea r5, [tab_ChromaCoeff]
2222
-movd coef2, [r5 + r4 * 4]
2223
+ lea r5, [tab_ChromaCoeff]
2224
+ movd coef2, [r5 + r4 * 4]
2225
%else
2226
-movd coef2, [tab_ChromaCoeff + r4 * 4]
2227
+ movd coef2, [tab_ChromaCoeff + r4 * 4]
2228
%endif
2229
2230
-pshufd coef2, coef2, 0
2231
-mova t2, [pw_512]
2232
-mova Tm0, [tab_Tm]
2233
+ pshufd coef2, coef2, 0
2234
+ mova t2, [pw_512]
2235
+ mova Tm0, [tab_Tm]
2236
2237
%rep 2
2238
-FILTER_H4_w4_2 t0, t1, t2
2239
-lea srcq, [srcq + srcstrideq * 2]
2240
-lea dstq, [dstq + dststrideq * 2]
2241
+ FILTER_H4_w4_2 t0, t1, t2
2242
+ lea srcq, [srcq + srcstrideq * 2]
2243
+ lea dstq, [dstq + dststrideq * 2]
2244
%endrep
2245
2246
-RET
2247
+ RET
2248
2249
;-----------------------------------------------------------------------------
2250
; void interp_4tap_horiz_pp_4x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
2251
2252
%define t1 m1
2253
%define t0 m0
2254
2255
-mov r4d, r4m
2256
+ mov r4d, r4m
2257
2258
%ifdef PIC
2259
-lea r5, [tab_ChromaCoeff]
2260
-movd coef2, [r5 + r4 * 4]
2261
+ lea r5, [tab_ChromaCoeff]
2262
+ movd coef2, [r5 + r4 * 4]
2263
%else
2264
-movd coef2, [tab_ChromaCoeff + r4 * 4]
2265
+ movd coef2, [tab_ChromaCoeff + r4 * 4]
2266
%endif
2267
2268
-pshufd coef2, coef2, 0
2269
-mova t2, [pw_512]
2270
-mova Tm0, [tab_Tm]
2271
+ pshufd coef2, coef2, 0
2272
+ mova t2, [pw_512]
2273
+ mova Tm0, [tab_Tm]
2274
2275
%rep 4
2276
-FILTER_H4_w4_2 t0, t1, t2
2277
-lea srcq, [srcq + srcstrideq * 2]
2278
-lea dstq, [dstq + dststrideq * 2]
2279
+ FILTER_H4_w4_2 t0, t1, t2
2280
+ lea srcq, [srcq + srcstrideq * 2]
2281
+ lea dstq, [dstq + dststrideq * 2]
2282
%endrep
2283
2284
-RET
2285
+ RET
2286
2287
;-----------------------------------------------------------------------------
2288
; void interp_4tap_horiz_pp_4x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
2289
2290
%define t1 m1
2291
%define t0 m0
2292
2293
-mov r4d, r4m
2294
+ mov r4d, r4m
2295
2296
%ifdef PIC
2297
-lea r5, [tab_ChromaCoeff]
2298
-movd coef2, [r5 + r4 * 4]
2299
+ lea r5, [tab_ChromaCoeff]
2300
+ movd coef2, [r5 + r4 * 4]
2301
%else
2302
-movd coef2, [tab_ChromaCoeff + r4 * 4]
2303
+ movd coef2, [tab_ChromaCoeff + r4 * 4]
2304
%endif
2305
2306
-pshufd coef2, coef2, 0
2307
-mova t2, [pw_512]
2308
-mova Tm0, [tab_Tm]
2309
+ pshufd coef2, coef2, 0
2310
+ mova t2, [pw_512]
2311
+ mova Tm0, [tab_Tm]
2312
2313
%rep 8
2314
-FILTER_H4_w4_2 t0, t1, t2
2315
-lea srcq, [srcq + srcstrideq * 2]
2316
-lea dstq, [dstq + dststrideq * 2]
2317
+ FILTER_H4_w4_2 t0, t1, t2
2318
+ lea srcq, [srcq + srcstrideq * 2]
2319
+ lea dstq, [dstq + dststrideq * 2]
2320
%endrep
2321
2322
-RET
2323
+ RET
2324
2325
;-----------------------------------------------------------------------------
2326
; void interp_4tap_horiz_pp_4x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
2327
2328
%define t1 m1
2329
%define t0 m0
2330
2331
-mov r4d, r4m
2332
+ mov r4d, r4m
2333
2334
%ifdef PIC
2335
-lea r5, [tab_ChromaCoeff]
2336
-movd coef2, [r5 + r4 * 4]
2337
+ lea r5, [tab_ChromaCoeff]
2338
+ movd coef2, [r5 + r4 * 4]
2339
%else
2340
-movd coef2, [tab_ChromaCoeff + r4 * 4]
2341
+ movd coef2, [tab_ChromaCoeff + r4 * 4]
2342
%endif
2343
2344
-pshufd coef2, coef2, 0
2345
-mova t2, [pw_512]
2346
-mova Tm0, [tab_Tm]
2347
+ pshufd coef2, coef2, 0
2348
+ mova t2, [pw_512]
2349
+ mova Tm0, [tab_Tm]
2350
2351
-mov r5d, 32/2
2352
+ mov r5d, 32/2
2353
2354
.loop:
2355
-FILTER_H4_w4_2 t0, t1, t2
2356
-lea srcq, [srcq + srcstrideq * 2]
2357
-lea dstq, [dstq + dststrideq * 2]
2358
-dec r5d
2359
-jnz .loop
2360
+ FILTER_H4_w4_2 t0, t1, t2
2361
+ lea srcq, [srcq + srcstrideq * 2]
2362
+ lea dstq, [dstq + dststrideq * 2]
2363
+ dec r5d
2364
+ jnz .loop
2365
2366
-RET
2367
+ RET
2368
2369
ALIGN 32
2370
const interp_4tap_8x8_horiz_shuf, dd 0, 4, 1, 5, 2, 6, 3, 7
2371
2372
%define t1 m1
2373
%define t0 m0
2374
2375
-mov r4d, r4m
2376
+ mov r4d, r4m
2377
2378
%ifdef PIC
2379
-lea r5, [tab_ChromaCoeff]
2380
-movd coef2, [r5 + r4 * 4]
2381
+ lea r5, [tab_ChromaCoeff]
2382
+ movd coef2, [r5 + r4 * 4]
2383
%else
2384
-movd coef2, [tab_ChromaCoeff + r4 * 4]
2385
+ movd coef2, [tab_ChromaCoeff + r4 * 4]
2386
%endif
2387
2388
-mov r5d, %2
2389
+ mov r5d, %2
2390
2391
-pshufd coef2, coef2, 0
2392
-mova t2, [pw_512]
2393
-mova Tm0, [tab_Tm]
2394
-mova Tm1, [tab_Tm + 16]
2395
+ pshufd coef2, coef2, 0
2396
+ mova t2, [pw_512]
2397
+ mova Tm0, [tab_Tm]
2398
+ mova Tm1, [tab_Tm + 16]
2399
2400
.loop:
2401
-FILTER_H4_w%1 t0, t1, t2
2402
-add srcq, srcstrideq
2403
-add dstq, dststrideq
2404
-
2405
-dec r5d
2406
-jnz .loop
2407
-
2408
-RET
2409
+ FILTER_H4_w%1 t0, t1, t2
2410
+ add srcq, srcstrideq
2411
+ add dstq, dststrideq
2412
+
2413
+ dec r5d
2414
+ jnz .loop
2415
+
2416
+ RET
2417
%endmacro
2418
2419
2420
-IPFILTER_CHROMA 6, 8
2421
-IPFILTER_CHROMA 8, 2
2422
-IPFILTER_CHROMA 8, 4
2423
-IPFILTER_CHROMA 8, 6
2424
-IPFILTER_CHROMA 8, 8
2425
-IPFILTER_CHROMA 8, 16
2426
-IPFILTER_CHROMA 8, 32
2427
-IPFILTER_CHROMA 12, 16
2428
-
2429
-IPFILTER_CHROMA 6, 16
2430
-IPFILTER_CHROMA 8, 12
2431
-IPFILTER_CHROMA 8, 64
2432
-IPFILTER_CHROMA 12, 32
2433
+ IPFILTER_CHROMA 6, 8
2434
+ IPFILTER_CHROMA 8, 2
2435
+ IPFILTER_CHROMA 8, 4
2436
+ IPFILTER_CHROMA 8, 6
2437
+ IPFILTER_CHROMA 8, 8
2438
+ IPFILTER_CHROMA 8, 16
2439
+ IPFILTER_CHROMA 8, 32
2440
+ IPFILTER_CHROMA 12, 16
2441
+
2442
+ IPFILTER_CHROMA 6, 16
2443
+ IPFILTER_CHROMA 8, 12
2444
+ IPFILTER_CHROMA 8, 64
2445
+ IPFILTER_CHROMA 12, 32
2446
2447
;-----------------------------------------------------------------------------
2448
; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
2449
2450
%define t1 m1
2451
%define t0 m0
2452
2453
-mov r4d, r4m
2454
+ mov r4d, r4m
2455
2456
%ifdef PIC
2457
-lea r5, [tab_ChromaCoeff]
2458
-movd coef2, [r5 + r4 * 4]
2459
+ lea r5, [tab_ChromaCoeff]
2460
+ movd coef2, [r5 + r4 * 4]
2461
%else
2462
-movd coef2, [tab_ChromaCoeff + r4 * 4]
2463
+ movd coef2, [tab_ChromaCoeff + r4 * 4]
2464
%endif
2465
2466
-mov r5d, %2
2467
+ mov r5d, %2
2468
2469
-pshufd coef2, coef2, 0
2470
-mova t2, [pw_512]
2471
-mova Tm0, [tab_Tm]
2472
-mova Tm1, [tab_Tm + 16]
2473
+ pshufd coef2, coef2, 0
2474
+ mova t2, [pw_512]
2475
+ mova Tm0, [tab_Tm]
2476
+ mova Tm1, [tab_Tm + 16]
2477
2478
.loop:
2479
-FILTER_H4_w%1 t0, t1, t2, t3
2480
-add srcq, srcstrideq
2481
-add dstq, dststrideq
2482
-
2483
-dec r5d
2484
-jnz .loop
2485
-
2486
-RET
2487
-%endmacro
2488
-
2489
-IPFILTER_CHROMA_W 16, 4
2490
-IPFILTER_CHROMA_W 16, 8
2491
-IPFILTER_CHROMA_W 16, 12
2492
-IPFILTER_CHROMA_W 16, 16
2493
-IPFILTER_CHROMA_W 16, 32
2494
-IPFILTER_CHROMA_W 32, 8
2495
-IPFILTER_CHROMA_W 32, 16
2496
-IPFILTER_CHROMA_W 32, 24
2497
-IPFILTER_CHROMA_W 24, 32
2498
-IPFILTER_CHROMA_W 32, 32
2499
-
2500
-IPFILTER_CHROMA_W 16, 24
2501
-IPFILTER_CHROMA_W 16, 64
2502
-IPFILTER_CHROMA_W 32, 48
2503
-IPFILTER_CHROMA_W 24, 64
2504
-IPFILTER_CHROMA_W 32, 64
2505
-
2506
-IPFILTER_CHROMA_W 64, 64
2507
-IPFILTER_CHROMA_W 64, 32
2508
-IPFILTER_CHROMA_W 64, 48
2509
-IPFILTER_CHROMA_W 48, 64
2510
-IPFILTER_CHROMA_W 64, 16
2511
+ FILTER_H4_w%1 t0, t1, t2, t3
2512
+ add srcq, srcstrideq
2513
+ add dstq, dststrideq
2514
+
2515
+ dec r5d
2516
+ jnz .loop
2517
+
2518
+ RET
2519
+%endmacro
2520
+
2521
+ IPFILTER_CHROMA_W 16, 4
2522
+ IPFILTER_CHROMA_W 16, 8
2523
+ IPFILTER_CHROMA_W 16, 12
2524
+ IPFILTER_CHROMA_W 16, 16
2525
+ IPFILTER_CHROMA_W 16, 32
2526
+ IPFILTER_CHROMA_W 32, 8
2527
+ IPFILTER_CHROMA_W 32, 16
2528
+ IPFILTER_CHROMA_W 32, 24
2529
+ IPFILTER_CHROMA_W 24, 32
2530
+ IPFILTER_CHROMA_W 32, 32
2531
+
2532
+ IPFILTER_CHROMA_W 16, 24
2533
+ IPFILTER_CHROMA_W 16, 64
2534
+ IPFILTER_CHROMA_W 32, 48
2535
+ IPFILTER_CHROMA_W 24, 64
2536
+ IPFILTER_CHROMA_W 32, 64
2537
+
2538
+ IPFILTER_CHROMA_W 64, 64
2539
+ IPFILTER_CHROMA_W 64, 32
2540
+ IPFILTER_CHROMA_W 64, 48
2541
+ IPFILTER_CHROMA_W 48, 64
2542
+ IPFILTER_CHROMA_W 64, 16
2543
2544
2545
%macro FILTER_H8_W8 7-8 ; t0, t1, t2, t3, coef, c512, src, dst
2546
2547
%endif
2548
punpcklqdq m3, m3
2549
2550
-%ifidn %3, pp
2551
+%ifidn %3, pp
2552
mova m2, [pw_512]
2553
%else
2554
mova m2, [pw_2000]
2555
2556
.loopH:
2557
xor r5, r5
2558
%rep %1 / 8
2559
- %ifidn %3, pp
2560
+ %ifidn %3, pp
2561
FILTER_H8_W8 m0, m1, m4, m5, m3, m2, [r0 - 3 + r5], [r2 + r5]
2562
%else
2563
FILTER_H8_W8 m0, m1, m4, m5, m3, UNUSED, [r0 - 3 + r5]
2564
2565
2566
%rep (%1 % 8) / 4
2567
FILTER_H8_W4 m0, m1
2568
- %ifidn %3, pp
2569
+ %ifidn %3, pp
2570
pmulhrsw m1, m2
2571
packuswb m1, m1
2572
movd [r2 + r5], m1
2573
2574
%endif
2575
%endmacro
2576
2577
-FILTER_HORIZ_LUMA_AVX2_4xN 8
2578
-FILTER_HORIZ_LUMA_AVX2_4xN 16
2579
+ FILTER_HORIZ_LUMA_AVX2_4xN 8
2580
+ FILTER_HORIZ_LUMA_AVX2_4xN 16
2581
2582
INIT_YMM avx2
2583
cglobal interp_8tap_horiz_pp_8x4, 4, 6, 7
2584
2585
RET
2586
%endmacro
2587
2588
-IPFILTER_LUMA_AVX2_8xN 8, 8
2589
-IPFILTER_LUMA_AVX2_8xN 8, 16
2590
-IPFILTER_LUMA_AVX2_8xN 8, 32
2591
+ IPFILTER_LUMA_AVX2_8xN 8, 8
2592
+ IPFILTER_LUMA_AVX2_8xN 8, 16
2593
+ IPFILTER_LUMA_AVX2_8xN 8, 32
2594
2595
%macro IPFILTER_LUMA_AVX2 2
2596
INIT_YMM avx2
2597
2598
pmaddubsw m5, m1
2599
paddw m4, m5
2600
pmaddwd m4, m7
2601
- vbroadcasti128 m5, [r0 + 8] ; second 8 elements in Row0
2602
+ vbroadcasti128 m5, [r0 + 8] ; second 8 elements in Row0
2603
pshufb m6, m5, m3
2604
pshufb m5, [tab_Tm]
2605
pmaddubsw m5, m0
2606
2607
pmaddubsw m5, m1
2608
paddw m2, m5
2609
pmaddwd m2, m7
2610
- vbroadcasti128 m5, [r0 + r1 + 8] ; second 8 elements in Row0
2611
+ vbroadcasti128 m5, [r0 + r1 + 8] ; second 8 elements in Row0
2612
pshufb m6, m5, m3
2613
pshufb m5, [tab_Tm]
2614
pmaddubsw m5, m0
2615
2616
jnz .loop
2617
RET
2618
2619
-INIT_YMM avx2
2620
+INIT_YMM avx2
2621
cglobal interp_4tap_horiz_pp_4x4, 4,6,6
2622
mov r4d, r4m
2623
2624
2625
pextrd [r2+r0], xm3, 3
2626
RET
2627
2628
-INIT_YMM avx2
2629
+INIT_YMM avx2
2630
cglobal interp_4tap_horiz_pp_2x4, 4, 6, 3
2631
mov r4d, r4m
2632
2633
2634
pextrw [r2 + r4], xm1, 3
2635
RET
2636
2637
-INIT_YMM avx2
2638
+INIT_YMM avx2
2639
cglobal interp_4tap_horiz_pp_2x8, 4, 6, 6
2640
mov r4d, r4m
2641
2642
2643
2644
IPFILTER_LUMA_AVX2 16, 4
2645
IPFILTER_LUMA_AVX2 16, 8
2646
- IPFILTER_LUMA_AVX2 16, 12
2647
+ IPFILTER_LUMA_AVX2 16, 12
2648
IPFILTER_LUMA_AVX2 16, 16
2649
IPFILTER_LUMA_AVX2 16, 32
2650
IPFILTER_LUMA_AVX2 16, 64
2651
2652
RET
2653
2654
;-----------------------------------------------------------------------------------------------------------------------------
2655
+; void interp_4tap_horiz_ps_64xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
2656
+;-----------------------------------------------------------------------------------------------------------------------------;
2657
+%macro IPFILTER_CHROMA_HPS_64xN 1
2658
+INIT_YMM avx2
2659
+cglobal interp_4tap_horiz_ps_64x%1, 4,7,6
2660
+ mov r4d, r4m
2661
+ mov r5d, r5m
2662
+ add r3d, r3d
2663
+
2664
+%ifdef PIC
2665
+ lea r6, [tab_ChromaCoeff]
2666
+ vpbroadcastd m0, [r6 + r4 * 4]
2667
+%else
2668
+ vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4]
2669
+%endif
2670
+
2671
+ vbroadcasti128 m2, [pw_1]
2672
+ vbroadcasti128 m5, [pw_2000]
2673
+ mova m1, [tab_Tm]
2674
+
2675
+ ; register map
2676
+ ; m0 - interpolate coeff
2677
+ ; m1 - shuffle order table
2678
+ ; m2 - constant word 1
2679
+ mov r6d, %1
2680
+ dec r0
2681
+ test r5d, r5d
2682
+ je .loop
2683
+ sub r0 , r1
2684
+ add r6d , 3
2685
+
2686
+.loop
2687
+ ; Row 0
2688
+ vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
2689
+ pshufb m3, m1
2690
+ pmaddubsw m3, m0
2691
+ pmaddwd m3, m2
2692
+ vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
2693
+ pshufb m4, m1
2694
+ pmaddubsw m4, m0
2695
+ pmaddwd m4, m2
2696
+
2697
+ packssdw m3, m4
2698
+ psubw m3, m5
2699
+ vpermq m3, m3, 11011000b
2700
+ movu [r2], m3
2701
+
2702
+ vbroadcasti128 m3, [r0 + 16] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
2703
+ pshufb m3, m1
2704
+ pmaddubsw m3, m0
2705
+ pmaddwd m3, m2
2706
+ vbroadcasti128 m4, [r0 + 24] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
2707
+ pshufb m4, m1
2708
+ pmaddubsw m4, m0
2709
+ pmaddwd m4, m2
2710
+
2711
+ packssdw m3, m4
2712
+ psubw m3, m5
2713
+ vpermq m3, m3, 11011000b
2714
+ movu [r2 + 32], m3
2715
+
2716
+ vbroadcasti128 m3, [r0 + 32] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
2717
+ pshufb m3, m1
2718
+ pmaddubsw m3, m0
2719
+ pmaddwd m3, m2
2720
+ vbroadcasti128 m4, [r0 + 40] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
2721
+ pshufb m4, m1
2722
+ pmaddubsw m4, m0
2723
+ pmaddwd m4, m2
2724
+
2725
+ packssdw m3, m4
2726
+ psubw m3, m5
2727
+ vpermq m3, m3, 11011000b
2728
+ movu [r2 + 64], m3
2729
+
2730
+ vbroadcasti128 m3, [r0 + 48] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
2731
+ pshufb m3, m1
2732
+ pmaddubsw m3, m0
2733
+ pmaddwd m3, m2
2734
+ vbroadcasti128 m4, [r0 + 56] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
2735
+ pshufb m4, m1
2736
+ pmaddubsw m4, m0
2737
+ pmaddwd m4, m2
2738
+
2739
+ packssdw m3, m4
2740
+ psubw m3, m5
2741
+ vpermq m3, m3, 11011000b
2742
+ movu [r2 + 96], m3
2743
+
2744
+ add r2, r3
2745
+ add r0, r1
2746
+ dec r6d
2747
+ jnz .loop
2748
+ RET
2749
+%endmacro
2750
+
2751
+ IPFILTER_CHROMA_HPS_64xN 64
2752
+ IPFILTER_CHROMA_HPS_64xN 32
2753
+ IPFILTER_CHROMA_HPS_64xN 48
2754
+ IPFILTER_CHROMA_HPS_64xN 16
2755
+
2756
+;-----------------------------------------------------------------------------------------------------------------------------
2757
;void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt)
2758
;-----------------------------------------------------------------------------------------------------------------------------
2759
2760
2761
pshufb m4, m1
2762
pmaddubsw m4, m0
2763
phaddw m4, m4 ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A]
2764
- phaddw m3, m4
2765
+ phaddw m3, m4
2766
2767
vpermd m3, m5, m3 ; m5 don't broken in above
2768
psubw m3, m2
2769
2770
lea r2, [r2 + r3 * 2] ; first loop dst ->5th row(i.e 4)
2771
sub r5d, 2
2772
jg .loop
2773
- jz .end
2774
+ jz .end
2775
2776
; last row
2777
movu xm1, [r0]
2778
2779
%endif
2780
%endmacro ; IPFILTER_LUMA_PS_8xN_AVX2
2781
2782
-IPFILTER_LUMA_PS_8xN_AVX2 4
2783
-IPFILTER_LUMA_PS_8xN_AVX2 8
2784
-IPFILTER_LUMA_PS_8xN_AVX2 16
2785
-IPFILTER_LUMA_PS_8xN_AVX2 32
2786
+ IPFILTER_LUMA_PS_8xN_AVX2 4
2787
+ IPFILTER_LUMA_PS_8xN_AVX2 8
2788
+ IPFILTER_LUMA_PS_8xN_AVX2 16
2789
+ IPFILTER_LUMA_PS_8xN_AVX2 32
2790
2791
2792
%macro IPFILTER_LUMA_PS_16x_AVX2 2
2793
2794
dec r9d
2795
jnz .label
2796
2797
-RET
2798
+ RET
2799
%endif
2800
%endmacro
2801
2802
2803
-IPFILTER_LUMA_PS_16x_AVX2 16 , 16
2804
-IPFILTER_LUMA_PS_16x_AVX2 16 , 8
2805
-IPFILTER_LUMA_PS_16x_AVX2 16 , 12
2806
-IPFILTER_LUMA_PS_16x_AVX2 16 , 4
2807
-IPFILTER_LUMA_PS_16x_AVX2 16 , 32
2808
-IPFILTER_LUMA_PS_16x_AVX2 16 , 64
2809
+ IPFILTER_LUMA_PS_16x_AVX2 16 , 16
2810
+ IPFILTER_LUMA_PS_16x_AVX2 16 , 8
2811
+ IPFILTER_LUMA_PS_16x_AVX2 16 , 12
2812
+ IPFILTER_LUMA_PS_16x_AVX2 16 , 4
2813
+ IPFILTER_LUMA_PS_16x_AVX2 16 , 32
2814
+ IPFILTER_LUMA_PS_16x_AVX2 16 , 64
2815
2816
2817
;--------------------------------------------------------------------------------------------------------------
2818
2819
RET
2820
%endmacro
2821
2822
-IPFILTER_LUMA_PP_W8 8, 4
2823
-IPFILTER_LUMA_PP_W8 8, 8
2824
-IPFILTER_LUMA_PP_W8 8, 16
2825
-IPFILTER_LUMA_PP_W8 8, 32
2826
-IPFILTER_LUMA_PP_W8 16, 4
2827
-IPFILTER_LUMA_PP_W8 16, 8
2828
-IPFILTER_LUMA_PP_W8 16, 12
2829
-IPFILTER_LUMA_PP_W8 16, 16
2830
-IPFILTER_LUMA_PP_W8 16, 32
2831
-IPFILTER_LUMA_PP_W8 16, 64
2832
-IPFILTER_LUMA_PP_W8 24, 32
2833
-IPFILTER_LUMA_PP_W8 32, 8
2834
-IPFILTER_LUMA_PP_W8 32, 16
2835
-IPFILTER_LUMA_PP_W8 32, 24
2836
-IPFILTER_LUMA_PP_W8 32, 32
2837
-IPFILTER_LUMA_PP_W8 32, 64
2838
-IPFILTER_LUMA_PP_W8 48, 64
2839
-IPFILTER_LUMA_PP_W8 64, 16
2840
-IPFILTER_LUMA_PP_W8 64, 32
2841
-IPFILTER_LUMA_PP_W8 64, 48
2842
-IPFILTER_LUMA_PP_W8 64, 64
2843
+ IPFILTER_LUMA_PP_W8 8, 4
2844
+ IPFILTER_LUMA_PP_W8 8, 8
2845
+ IPFILTER_LUMA_PP_W8 8, 16
2846
+ IPFILTER_LUMA_PP_W8 8, 32
2847
+ IPFILTER_LUMA_PP_W8 16, 4
2848
+ IPFILTER_LUMA_PP_W8 16, 8
2849
+ IPFILTER_LUMA_PP_W8 16, 12
2850
+ IPFILTER_LUMA_PP_W8 16, 16
2851
+ IPFILTER_LUMA_PP_W8 16, 32
2852
+ IPFILTER_LUMA_PP_W8 16, 64
2853
+ IPFILTER_LUMA_PP_W8 24, 32
2854
+ IPFILTER_LUMA_PP_W8 32, 8
2855
+ IPFILTER_LUMA_PP_W8 32, 16
2856
+ IPFILTER_LUMA_PP_W8 32, 24
2857
+ IPFILTER_LUMA_PP_W8 32, 32
2858
+ IPFILTER_LUMA_PP_W8 32, 64
2859
+ IPFILTER_LUMA_PP_W8 48, 64
2860
+ IPFILTER_LUMA_PP_W8 64, 16
2861
+ IPFILTER_LUMA_PP_W8 64, 32
2862
+ IPFILTER_LUMA_PP_W8 64, 48
2863
+ IPFILTER_LUMA_PP_W8 64, 64
2864
2865
;----------------------------------------------------------------------------------------------------------------------------
2866
; void interp_8tap_horiz_ps_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
2867
2868
2869
; Round and Saturate
2870
%macro FILTER_HV8_END 4 ; output in [1, 3]
2871
- paddd %1, [tab_c_526336]
2872
- paddd %2, [tab_c_526336]
2873
- paddd %3, [tab_c_526336]
2874
- paddd %4, [tab_c_526336]
2875
+ paddd %1, [pd_526336]
2876
+ paddd %2, [pd_526336]
2877
+ paddd %3, [pd_526336]
2878
+ paddd %4, [pd_526336]
2879
psrad %1, 12
2880
psrad %2, 12
2881
psrad %3, 12
2882
2883
;-----------------------------------------------------------------------------
2884
; void interp_8tap_hv_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int idxX, int idxY)
2885
;-----------------------------------------------------------------------------
2886
-INIT_XMM sse4
2887
+INIT_XMM ssse3
2888
cglobal interp_8tap_hv_pp_8x8, 4, 7, 8, 0-15*16
2889
%define coef m7
2890
%define stk_buf rsp
2891
2892
RET
2893
2894
;-----------------------------------------------------------------------------
2895
+; void interp_8tap_hv_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int idxX, int idxY)
2896
+;-----------------------------------------------------------------------------
2897
+INIT_XMM sse3
2898
+cglobal interp_8tap_hv_pp_8x8, 4, 7, 8, 0-15*16
2899
+ mov r4d, r4m
2900
+ mov r5d, r5m
2901
+ add r4d, r4d
2902
+ pxor m6, m6
2903
+
2904
+%ifdef PIC
2905
+ lea r6, [tabw_LumaCoeff]
2906
+ mova m3, [r6 + r4 * 8]
2907
+%else
2908
+ mova m3, [tabw_LumaCoeff + r4 * 8]
2909
+%endif
2910
+
2911
+ ; move to row -3
2912
+ lea r6, [r1 + r1 * 2]
2913
+ sub r0, r6
2914
+
2915
+ mov r4, rsp
2916
+
2917
+%assign x 0 ;needed for FILTER_H8_W8_sse2 macro
2918
+%assign y 1
2919
+%rep 15
2920
+ FILTER_H8_W8_sse2
2921
+ psubw m1, [pw_2000]
2922
+ mova [r4], m1
2923
+
2924
+%if y < 15
2925
+ add r0, r1
2926
+ add r4, 16
2927
+%endif
2928
+%assign y y+1
2929
+%endrep
2930
+
2931
+ ; ready to phase V
2932
+ ; Here all of mN is free
2933
+
2934
+ ; load coeff table
2935
+ shl r5, 6
2936
+ lea r6, [tab_LumaCoeffV]
2937
+ lea r5, [r5 + r6]
2938
+
2939
+ ; load intermedia buffer
2940
+ mov r0, rsp
2941
+
2942
+ ; register mapping
2943
+ ; r0 - src
2944
+ ; r5 - coeff
2945
+
2946
+ ; let's go
2947
+%assign y 1
2948
+%rep 4
2949
+ FILTER_HV8_START m1, m2, m3, m4, m0, 0, 0
2950
+ FILTER_HV8_MID m6, m2, m3, m4, m0, m1, m7, m5, 3, 1
2951
+ FILTER_HV8_MID m5, m6, m3, m4, m0, m1, m7, m2, 5, 2
2952
+ FILTER_HV8_MID m6, m5, m3, m4, m0, m1, m7, m2, 7, 3
2953
+ FILTER_HV8_END m3, m0, m4, m1
2954
+
2955
+ movh [r2], m3
2956
+ movhps [r2 + r3], m3
2957
+
2958
+%if y < 4
2959
+ lea r0, [r0 + 16 * 2]
2960
+ lea r2, [r2 + r3 * 2]
2961
+%endif
2962
+%assign y y+1
2963
+%endrep
2964
+ RET
2965
+
2966
+;-----------------------------------------------------------------------------
2967
;void interp_4tap_vert_pp_2x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
2968
;-----------------------------------------------------------------------------
2969
INIT_XMM sse4
2970
cglobal interp_4tap_vert_pp_2x4, 4, 6, 8
2971
2972
-mov r4d, r4m
2973
-sub r0, r1
2974
+ mov r4d, r4m
2975
+ sub r0, r1
2976
2977
%ifdef PIC
2978
-lea r5, [tab_ChromaCoeff]
2979
-movd m0, [r5 + r4 * 4]
2980
+ lea r5, [tab_ChromaCoeff]
2981
+ movd m0, [r5 + r4 * 4]
2982
%else
2983
-movd m0, [tab_ChromaCoeff + r4 * 4]
2984
+ movd m0, [tab_ChromaCoeff + r4 * 4]
2985
%endif
2986
-lea r4, [r1 * 3]
2987
-lea r5, [r0 + 4 * r1]
2988
-pshufb m0, [tab_Cm]
2989
-mova m1, [pw_512]
2990
+ lea r4, [r1 * 3]
2991
+ lea r5, [r0 + 4 * r1]
2992
+ pshufb m0, [tab_Cm]
2993
+ mova m1, [pw_512]
2994
2995
-movd m2, [r0]
2996
-movd m3, [r0 + r1]
2997
-movd m4, [r0 + 2 * r1]
2998
-movd m5, [r0 + r4]
2999
+ movd m2, [r0]
3000
+ movd m3, [r0 + r1]
3001
+ movd m4, [r0 + 2 * r1]
3002
+ movd m5, [r0 + r4]
3003
3004
-punpcklbw m2, m3
3005
-punpcklbw m6, m4, m5
3006
-punpcklbw m2, m6
3007
+ punpcklbw m2, m3
3008
+ punpcklbw m6, m4, m5
3009
+ punpcklbw m2, m6
3010
3011
-pmaddubsw m2, m0
3012
+ pmaddubsw m2, m0
3013
3014
-movd m6, [r5]
3015
+ movd m6, [r5]
3016
3017
-punpcklbw m3, m4
3018
-punpcklbw m7, m5, m6
3019
-punpcklbw m3, m7
3020
+ punpcklbw m3, m4
3021
+ punpcklbw m7, m5, m6
3022
+ punpcklbw m3, m7
3023
3024
-pmaddubsw m3, m0
3025
+ pmaddubsw m3, m0
3026
3027
-phaddw m2, m3
3028
+ phaddw m2, m3
3029
3030
-pmulhrsw m2, m1
3031
+ pmulhrsw m2, m1
3032
3033
-movd m7, [r5 + r1]
3034
+ movd m7, [r5 + r1]
3035
3036
-punpcklbw m4, m5
3037
-punpcklbw m3, m6, m7
3038
-punpcklbw m4, m3
3039
+ punpcklbw m4, m5
3040
+ punpcklbw m3, m6, m7
3041
+ punpcklbw m4, m3
3042
3043
-pmaddubsw m4, m0
3044
+ pmaddubsw m4, m0
3045
3046
-movd m3, [r5 + 2 * r1]
3047
+ movd m3, [r5 + 2 * r1]
3048
3049
-punpcklbw m5, m6
3050
-punpcklbw m7, m3
3051
-punpcklbw m5, m7
3052
+ punpcklbw m5, m6
3053
+ punpcklbw m7, m3
3054
+ punpcklbw m5, m7
3055
3056
-pmaddubsw m5, m0
3057
+ pmaddubsw m5, m0
3058
3059
-phaddw m4, m5
3060
+ phaddw m4, m5
3061
3062
-pmulhrsw m4, m1
3063
-packuswb m2, m4
3064
+ pmulhrsw m4, m1
3065
+ packuswb m2, m4
3066
3067
-pextrw [r2], m2, 0
3068
-pextrw [r2 + r3], m2, 2
3069
-lea r2, [r2 + 2 * r3]
3070
-pextrw [r2], m2, 4
3071
-pextrw [r2 + r3], m2, 6
3072
+ pextrw [r2], m2, 0
3073
+ pextrw [r2 + r3], m2, 2
3074
+ lea r2, [r2 + 2 * r3]
3075
+ pextrw [r2], m2, 4
3076
+ pextrw [r2 + r3], m2, 6
3077
3078
-RET
3079
+ RET
3080
3081
%macro FILTER_VER_CHROMA_AVX2_2x4 1
3082
INIT_YMM avx2
3083
3084
RET
3085
%endmacro
3086
3087
-FILTER_VER_CHROMA_AVX2_2x4 pp
3088
-FILTER_VER_CHROMA_AVX2_2x4 ps
3089
+ FILTER_VER_CHROMA_AVX2_2x4 pp
3090
+ FILTER_VER_CHROMA_AVX2_2x4 ps
3091
3092
%macro FILTER_VER_CHROMA_AVX2_2x8 1
3093
INIT_YMM avx2
3094
3095
RET
3096
%endmacro
3097
3098
-FILTER_VER_CHROMA_AVX2_2x8 pp
3099
-FILTER_VER_CHROMA_AVX2_2x8 ps
3100
+ FILTER_VER_CHROMA_AVX2_2x8 pp
3101
+ FILTER_VER_CHROMA_AVX2_2x8 ps
3102
3103
;-----------------------------------------------------------------------------
3104
; void interp_4tap_vert_pp_2x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
3105
3106
INIT_XMM sse4
3107
cglobal interp_4tap_vert_pp_2x%2, 4, 6, 8
3108
3109
-mov r4d, r4m
3110
-sub r0, r1
3111
+ mov r4d, r4m
3112
+ sub r0, r1
3113
3114
%ifdef PIC
3115
-lea r5, [tab_ChromaCoeff]
3116
-movd m0, [r5 + r4 * 4]
3117
+ lea r5, [tab_ChromaCoeff]
3118
+ movd m0, [r5 + r4 * 4]
3119
%else
3120
-movd m0, [tab_ChromaCoeff + r4 * 4]
3121
+ movd m0, [tab_ChromaCoeff + r4 * 4]
3122
%endif
3123
3124
-pshufb m0, [tab_Cm]
3125
+ pshufb m0, [tab_Cm]
3126
3127
-mova m1, [pw_512]
3128
+ mova m1, [pw_512]
3129
3130
-mov r4d, %2
3131
-lea r5, [3 * r1]
3132
+ mov r4d, %2
3133
+ lea r5, [3 * r1]
3134
3135
.loop:
3136
-movd m2, [r0]
3137
-movd m3, [r0 + r1]
3138
-movd m4, [r0 + 2 * r1]
3139
-movd m5, [r0 + r5]
3140
+ movd m2, [r0]
3141
+ movd m3, [r0 + r1]
3142
+ movd m4, [r0 + 2 * r1]
3143
+ movd m5, [r0 + r5]
3144
3145
-punpcklbw m2, m3
3146
-punpcklbw m6, m4, m5
3147
-punpcklbw m2, m6
3148
+ punpcklbw m2, m3
3149
+ punpcklbw m6, m4, m5
3150
+ punpcklbw m2, m6
3151
3152
-pmaddubsw m2, m0
3153
+ pmaddubsw m2, m0
3154
3155
-lea r0, [r0 + 4 * r1]
3156
-movd m6, [r0]
3157
+ lea r0, [r0 + 4 * r1]
3158
+ movd m6, [r0]
3159
3160
-punpcklbw m3, m4
3161
-punpcklbw m7, m5, m6
3162
-punpcklbw m3, m7
3163
+ punpcklbw m3, m4
3164
+ punpcklbw m7, m5, m6
3165
+ punpcklbw m3, m7
3166
3167
-pmaddubsw m3, m0
3168
+ pmaddubsw m3, m0
3169
3170
-phaddw m2, m3
3171
+ phaddw m2, m3
3172
3173
-pmulhrsw m2, m1
3174
+ pmulhrsw m2, m1
3175
3176
-movd m7, [r0 + r1]
3177
+ movd m7, [r0 + r1]
3178
3179
-punpcklbw m4, m5
3180
-punpcklbw m3, m6, m7
3181
-punpcklbw m4, m3
3182
+ punpcklbw m4, m5
3183
+ punpcklbw m3, m6, m7
3184
+ punpcklbw m4, m3
3185
3186
-pmaddubsw m4, m0
3187
+ pmaddubsw m4, m0
3188
3189
-movd m3, [r0 + 2 * r1]
3190
+ movd m3, [r0 + 2 * r1]
3191
3192
-punpcklbw m5, m6
3193
-punpcklbw m7, m3
3194
-punpcklbw m5, m7
3195
+ punpcklbw m5, m6
3196
+ punpcklbw m7, m3
3197
+ punpcklbw m5, m7
3198
3199
-pmaddubsw m5, m0
3200
+ pmaddubsw m5, m0
3201
3202
-phaddw m4, m5
3203
+ phaddw m4, m5
3204
3205
-pmulhrsw m4, m1
3206
-packuswb m2, m4
3207
+ pmulhrsw m4, m1
3208
+ packuswb m2, m4
3209
3210
-pextrw [r2], m2, 0
3211
-pextrw [r2 + r3], m2, 2
3212
-lea r2, [r2 + 2 * r3]
3213
-pextrw [r2], m2, 4
3214
-pextrw [r2 + r3], m2, 6
3215
+ pextrw [r2], m2, 0
3216
+ pextrw [r2 + r3], m2, 2
3217
+ lea r2, [r2 + 2 * r3]
3218
+ pextrw [r2], m2, 4
3219
+ pextrw [r2 + r3], m2, 6
3220
3221
-lea r2, [r2 + 2 * r3]
3222
+ lea r2, [r2 + 2 * r3]
3223
3224
-sub r4, 4
3225
-jnz .loop
3226
-RET
3227
+ sub r4, 4
3228
+ jnz .loop
3229
+ RET
3230
%endmacro
3231
3232
-FILTER_V4_W2_H4 2, 8
3233
+ FILTER_V4_W2_H4 2, 8
3234
3235
-FILTER_V4_W2_H4 2, 16
3236
+ FILTER_V4_W2_H4 2, 16
3237
3238
;-----------------------------------------------------------------------------
3239
; void interp_4tap_vert_pp_4x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
3240
3241
INIT_XMM sse4
3242
cglobal interp_4tap_vert_pp_4x2, 4, 6, 6
3243
3244
-mov r4d, r4m
3245
-sub r0, r1
3246
+ mov r4d, r4m
3247
+ sub r0, r1
3248
3249
%ifdef PIC
3250
-lea r5, [tab_ChromaCoeff]
3251
-movd m0, [r5 + r4 * 4]
3252
+ lea r5, [tab_ChromaCoeff]
3253
+ movd m0, [r5 + r4 * 4]
3254
%else
3255
-movd m0, [tab_ChromaCoeff + r4 * 4]
3256
+ movd m0, [tab_ChromaCoeff + r4 * 4]
3257
%endif
3258
3259
-pshufb m0, [tab_Cm]
3260
-lea r5, [r0 + 2 * r1]
3261
+ pshufb m0, [tab_Cm]
3262
+ lea r5, [r0 + 2 * r1]
3263
3264
-movd m2, [r0]
3265
-movd m3, [r0 + r1]
3266
-movd m4, [r5]
3267
-movd m5, [r5 + r1]
3268
+ movd m2, [r0]
3269
+ movd m3, [r0 + r1]
3270
+ movd m4, [r5]
3271
+ movd m5, [r5 + r1]
3272
3273
-punpcklbw m2, m3
3274
-punpcklbw m1, m4, m5
3275
-punpcklbw m2, m1
3276
+ punpcklbw m2, m3
3277
+ punpcklbw m1, m4, m5
3278
+ punpcklbw m2, m1
3279
3280
-pmaddubsw m2, m0
3281
+ pmaddubsw m2, m0
3282
3283
-movd m1, [r0 + 4 * r1]
3284
+ movd m1, [r0 + 4 * r1]
3285
3286
-punpcklbw m3, m4
3287
-punpcklbw m5, m1
3288
-punpcklbw m3, m5
3289
+ punpcklbw m3, m4
3290
+ punpcklbw m5, m1
3291
+ punpcklbw m3, m5
3292
3293
-pmaddubsw m3, m0
3294
+ pmaddubsw m3, m0
3295
3296
-phaddw m2, m3
3297
+ phaddw m2, m3
3298
3299
-pmulhrsw m2, [pw_512]
3300
-packuswb m2, m2
3301
-movd [r2], m2
3302
-pextrd [r2 + r3], m2, 1
3303
+ pmulhrsw m2, [pw_512]
3304
+ packuswb m2, m2
3305
+ movd [r2], m2
3306
+ pextrd [r2 + r3], m2, 1
3307
3308
-RET
3309
+ RET
3310
3311
%macro FILTER_VER_CHROMA_AVX2_4x2 1
3312
INIT_YMM avx2
3313
3314
RET
3315
%endmacro
3316
3317
-FILTER_VER_CHROMA_AVX2_4x2 pp
3318
-FILTER_VER_CHROMA_AVX2_4x2 ps
3319
+ FILTER_VER_CHROMA_AVX2_4x2 pp
3320
+ FILTER_VER_CHROMA_AVX2_4x2 ps
3321
3322
;-----------------------------------------------------------------------------
3323
; void interp_4tap_vert_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
3324
3325
INIT_XMM sse4
3326
cglobal interp_4tap_vert_pp_4x4, 4, 6, 8
3327
3328
-mov r4d, r4m
3329
-sub r0, r1
3330
+ mov r4d, r4m
3331
+ sub r0, r1
3332
3333
%ifdef PIC
3334
-lea r5, [tab_ChromaCoeff]
3335
-movd m0, [r5 + r4 * 4]
3336
+ lea r5, [tab_ChromaCoeff]
3337
+ movd m0, [r5 + r4 * 4]
3338
%else
3339
-movd m0, [tab_ChromaCoeff + r4 * 4]
3340
+ movd m0, [tab_ChromaCoeff + r4 * 4]
3341
%endif
3342
3343
-pshufb m0, [tab_Cm]
3344
-mova m1, [pw_512]
3345
-lea r5, [r0 + 4 * r1]
3346
-lea r4, [r1 * 3]
3347
+ pshufb m0, [tab_Cm]
3348
+ mova m1, [pw_512]
3349
+ lea r5, [r0 + 4 * r1]
3350
+ lea r4, [r1 * 3]
3351
3352
-movd m2, [r0]
3353
-movd m3, [r0 + r1]
3354
-movd m4, [r0 + 2 * r1]
3355
-movd m5, [r0 + r4]
3356
+ movd m2, [r0]
3357
+ movd m3, [r0 + r1]
3358
+ movd m4, [r0 + 2 * r1]
3359
+ movd m5, [r0 + r4]
3360
3361
-punpcklbw m2, m3
3362
-punpcklbw m6, m4, m5
3363
-punpcklbw m2, m6
3364
+ punpcklbw m2, m3
3365
+ punpcklbw m6, m4, m5
3366
+ punpcklbw m2, m6
3367
3368
-pmaddubsw m2, m0
3369
+ pmaddubsw m2, m0
3370
3371
-movd m6, [r5]
3372
+ movd m6, [r5]
3373
3374
-punpcklbw m3, m4
3375
-punpcklbw m7, m5, m6
3376
-punpcklbw m3, m7
3377
+ punpcklbw m3, m4
3378
+ punpcklbw m7, m5, m6
3379
+ punpcklbw m3, m7
3380
3381
-pmaddubsw m3, m0
3382
+ pmaddubsw m3, m0
3383
3384
-phaddw m2, m3
3385
+ phaddw m2, m3
3386
3387
-pmulhrsw m2, m1
3388
+ pmulhrsw m2, m1
3389
3390
-movd m7, [r5 + r1]
3391
+ movd m7, [r5 + r1]
3392
3393
-punpcklbw m4, m5
3394
-punpcklbw m3, m6, m7
3395
-punpcklbw m4, m3
3396
+ punpcklbw m4, m5
3397
+ punpcklbw m3, m6, m7
3398
+ punpcklbw m4, m3
3399
3400
-pmaddubsw m4, m0
3401
+ pmaddubsw m4, m0
3402
3403
-movd m3, [r5 + 2 * r1]
3404
+ movd m3, [r5 + 2 * r1]
3405
3406
-punpcklbw m5, m6
3407
-punpcklbw m7, m3
3408
-punpcklbw m5, m7
3409
+ punpcklbw m5, m6
3410
+ punpcklbw m7, m3
3411
+ punpcklbw m5, m7
3412
3413
-pmaddubsw m5, m0
3414
+ pmaddubsw m5, m0
3415
3416
-phaddw m4, m5
3417
+ phaddw m4, m5
3418
3419
-pmulhrsw m4, m1
3420
+ pmulhrsw m4, m1
3421
3422
-packuswb m2, m4
3423
-movd [r2], m2
3424
-pextrd [r2 + r3], m2, 1
3425
-lea r2, [r2 + 2 * r3]
3426
-pextrd [r2], m2, 2
3427
-pextrd [r2 + r3], m2, 3
3428
-RET
3429
+ packuswb m2, m4
3430
+ movd [r2], m2
3431
+ pextrd [r2 + r3], m2, 1
3432
+ lea r2, [r2 + 2 * r3]
3433
+ pextrd [r2], m2, 2
3434
+ pextrd [r2 + r3], m2, 3
3435
+ RET
3436
%macro FILTER_VER_CHROMA_AVX2_4x4 1
3437
INIT_YMM avx2
3438
cglobal interp_4tap_vert_%1_4x4, 4, 6, 3
3439
3440
%endif
3441
RET
3442
%endmacro
3443
-FILTER_VER_CHROMA_AVX2_4x4 pp
3444
-FILTER_VER_CHROMA_AVX2_4x4 ps
3445
+ FILTER_VER_CHROMA_AVX2_4x4 pp
3446
+ FILTER_VER_CHROMA_AVX2_4x4 ps
3447
3448
%macro FILTER_VER_CHROMA_AVX2_4x8 1
3449
INIT_YMM avx2
3450
3451
RET
3452
%endmacro
3453
3454
-FILTER_VER_CHROMA_AVX2_4x8 pp
3455
-FILTER_VER_CHROMA_AVX2_4x8 ps
3456
+ FILTER_VER_CHROMA_AVX2_4x8 pp
3457
+ FILTER_VER_CHROMA_AVX2_4x8 ps
3458
3459
%macro FILTER_VER_CHROMA_AVX2_4x16 1
3460
INIT_YMM avx2
3461
3462
%endif
3463
%endmacro
3464
3465
-FILTER_VER_CHROMA_AVX2_4x16 pp
3466
-FILTER_VER_CHROMA_AVX2_4x16 ps
3467
+ FILTER_VER_CHROMA_AVX2_4x16 pp
3468
+ FILTER_VER_CHROMA_AVX2_4x16 ps
3469
3470
;-----------------------------------------------------------------------------
3471
; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
3472
3473
INIT_XMM sse4
3474
cglobal interp_4tap_vert_pp_%1x%2, 4, 6, 8
3475
3476
-mov r4d, r4m
3477
-sub r0, r1
3478
+ mov r4d, r4m
3479
+ sub r0, r1
3480
3481
%ifdef PIC
3482
-lea r5, [tab_ChromaCoeff]
3483
-movd m0, [r5 + r4 * 4]
3484
+ lea r5, [tab_ChromaCoeff]
3485
+ movd m0, [r5 + r4 * 4]
3486
%else
3487
-movd m0, [tab_ChromaCoeff + r4 * 4]
3488
+ movd m0, [tab_ChromaCoeff + r4 * 4]
3489
%endif
3490
3491
-pshufb m0, [tab_Cm]
3492
+ pshufb m0, [tab_Cm]
3493
3494
-mova m1, [pw_512]
3495
+ mova m1, [pw_512]
3496
3497
-mov r4d, %2
3498
+ mov r4d, %2
3499
3500
-lea r5, [3 * r1]
3501
+ lea r5, [3 * r1]
3502
3503
.loop:
3504
-movd m2, [r0]
3505
-movd m3, [r0 + r1]
3506
-movd m4, [r0 + 2 * r1]
3507
-movd m5, [r0 + r5]
3508
+ movd m2, [r0]
3509
+ movd m3, [r0 + r1]
3510
+ movd m4, [r0 + 2 * r1]
3511
+ movd m5, [r0 + r5]
3512
3513
-punpcklbw m2, m3
3514
-punpcklbw m6, m4, m5
3515
-punpcklbw m2, m6
3516
+ punpcklbw m2, m3
3517
+ punpcklbw m6, m4, m5
3518
+ punpcklbw m2, m6
3519
3520
-pmaddubsw m2, m0
3521
+ pmaddubsw m2, m0
3522
3523
-lea r0, [r0 + 4 * r1]
3524
-movd m6, [r0]
3525
+ lea r0, [r0 + 4 * r1]
3526
+ movd m6, [r0]
3527
3528
-punpcklbw m3, m4
3529
-punpcklbw m7, m5, m6
3530
-punpcklbw m3, m7
3531
+ punpcklbw m3, m4
3532
+ punpcklbw m7, m5, m6
3533
+ punpcklbw m3, m7
3534
3535
-pmaddubsw m3, m0
3536
+ pmaddubsw m3, m0
3537
3538
-phaddw m2, m3
3539
+ phaddw m2, m3
3540
3541
-pmulhrsw m2, m1
3542
+ pmulhrsw m2, m1
3543
3544
-movd m7, [r0 + r1]
3545
+ movd m7, [r0 + r1]
3546
3547
-punpcklbw m4, m5
3548
-punpcklbw m3, m6, m7
3549
-punpcklbw m4, m3
3550
+ punpcklbw m4, m5
3551
+ punpcklbw m3, m6, m7
3552
+ punpcklbw m4, m3
3553
3554
-pmaddubsw m4, m0
3555
+ pmaddubsw m4, m0
3556
3557
-movd m3, [r0 + 2 * r1]
3558
+ movd m3, [r0 + 2 * r1]
3559
3560
-punpcklbw m5, m6
3561
-punpcklbw m7, m3
3562
-punpcklbw m5, m7
3563
+ punpcklbw m5, m6
3564
+ punpcklbw m7, m3
3565
+ punpcklbw m5, m7
3566
3567
-pmaddubsw m5, m0
3568
+ pmaddubsw m5, m0
3569
3570
-phaddw m4, m5
3571
+ phaddw m4, m5
3572
3573
-pmulhrsw m4, m1
3574
-packuswb m2, m4
3575
-movd [r2], m2
3576
-pextrd [r2 + r3], m2, 1
3577
-lea r2, [r2 + 2 * r3]
3578
-pextrd [r2], m2, 2
3579
-pextrd [r2 + r3], m2, 3
3580
+ pmulhrsw m4, m1
3581
+ packuswb m2, m4
3582
+ movd [r2], m2
3583
+ pextrd [r2 + r3], m2, 1
3584
+ lea r2, [r2 + 2 * r3]
3585
+ pextrd [r2], m2, 2
3586
+ pextrd [r2 + r3], m2, 3
3587
3588
-lea r2, [r2 + 2 * r3]
3589
+ lea r2, [r2 + 2 * r3]
3590
3591
-sub r4, 4
3592
-jnz .loop
3593
-RET
3594
+ sub r4, 4
3595
+ jnz .loop
3596
+ RET
3597
%endmacro
3598
3599
-FILTER_V4_W4_H4 4, 8
3600
-FILTER_V4_W4_H4 4, 16
3601
+ FILTER_V4_W4_H4 4, 8
3602
+ FILTER_V4_W4_H4 4, 16
3603
3604
-FILTER_V4_W4_H4 4, 32
3605
+ FILTER_V4_W4_H4 4, 32
3606
3607
%macro FILTER_V4_W8_H2 0
3608
-punpcklbw m1, m2
3609
-punpcklbw m7, m3, m0
3610
+ punpcklbw m1, m2
3611
+ punpcklbw m7, m3, m0
3612
3613
-pmaddubsw m1, m6
3614
-pmaddubsw m7, m5
3615
+ pmaddubsw m1, m6
3616
+ pmaddubsw m7, m5
3617
3618
-paddw m1, m7
3619
+ paddw m1, m7
3620
3621
-pmulhrsw m1, m4
3622
-packuswb m1, m1
3623
+ pmulhrsw m1, m4
3624
+ packuswb m1, m1
3625
%endmacro
3626
3627
%macro FILTER_V4_W8_H3 0
3628
-punpcklbw m2, m3
3629
-punpcklbw m7, m0, m1
3630
+ punpcklbw m2, m3
3631
+ punpcklbw m7, m0, m1
3632
3633
-pmaddubsw m2, m6
3634
-pmaddubsw m7, m5
3635
+ pmaddubsw m2, m6
3636
+ pmaddubsw m7, m5
3637
3638
-paddw m2, m7
3639
+ paddw m2, m7
3640
3641
-pmulhrsw m2, m4
3642
-packuswb m2, m2
3643
+ pmulhrsw m2, m4
3644
+ packuswb m2, m2
3645
%endmacro
3646
3647
%macro FILTER_V4_W8_H4 0
3648
-punpcklbw m3, m0
3649
-punpcklbw m7, m1, m2
3650
+ punpcklbw m3, m0
3651
+ punpcklbw m7, m1, m2
3652
3653
-pmaddubsw m3, m6
3654
-pmaddubsw m7, m5
3655
+ pmaddubsw m3, m6
3656
+ pmaddubsw m7, m5
3657
3658
-paddw m3, m7
3659
+ paddw m3, m7
3660
3661
-pmulhrsw m3, m4
3662
-packuswb m3, m3
3663
+ pmulhrsw m3, m4
3664
+ packuswb m3, m3
3665
%endmacro
3666
3667
%macro FILTER_V4_W8_H5 0
3668
-punpcklbw m0, m1
3669
-punpcklbw m7, m2, m3
3670
+ punpcklbw m0, m1
3671
+ punpcklbw m7, m2, m3
3672
3673
-pmaddubsw m0, m6
3674
-pmaddubsw m7, m5
3675
+ pmaddubsw m0, m6
3676
+ pmaddubsw m7, m5
3677
3678
-paddw m0, m7
3679
+ paddw m0, m7
3680
3681
-pmulhrsw m0, m4
3682
-packuswb m0, m0
3683
+ pmulhrsw m0, m4
3684
+ packuswb m0, m0
3685
%endmacro
3686
3687
%macro FILTER_V4_W8_8x2 2
3688
-FILTER_V4_W8 %1, %2
3689
-movq m0, [r0 + 4 * r1]
3690
+ FILTER_V4_W8 %1, %2
3691
+ movq m0, [r0 + 4 * r1]
3692
3693
-FILTER_V4_W8_H2
3694
+ FILTER_V4_W8_H2
3695
3696
-movh [r2 + r3], m1
3697
+ movh [r2 + r3], m1
3698
%endmacro
3699
3700
%macro FILTER_V4_W8_8x4 2
3701
-FILTER_V4_W8_8x2 %1, %2
3702
+ FILTER_V4_W8_8x2 %1, %2
3703
;8x3
3704
-lea r6, [r0 + 4 * r1]
3705
-movq m1, [r6 + r1]
3706
+ lea r6, [r0 + 4 * r1]
3707
+ movq m1, [r6 + r1]
3708
3709
-FILTER_V4_W8_H3
3710
+ FILTER_V4_W8_H3
3711
3712
-movh [r2 + 2 * r3], m2
3713
+ movh [r2 + 2 * r3], m2
3714
3715
;8x4
3716
-movq m2, [r6 + 2 * r1]
3717
+ movq m2, [r6 + 2 * r1]
3718
3719
-FILTER_V4_W8_H4
3720
+ FILTER_V4_W8_H4
3721
3722
-lea r5, [r2 + 2 * r3]
3723
-movh [r5 + r3], m3
3724
+ lea r5, [r2 + 2 * r3]
3725
+ movh [r5 + r3], m3
3726
%endmacro
3727
3728
%macro FILTER_V4_W8_8x6 2
3729
-FILTER_V4_W8_8x4 %1, %2
3730
+ FILTER_V4_W8_8x4 %1, %2
3731
;8x5
3732
-lea r6, [r6 + 2 * r1]
3733
-movq m3, [r6 + r1]
3734
+ lea r6, [r6 + 2 * r1]
3735
+ movq m3, [r6 + r1]
3736
3737
-FILTER_V4_W8_H5
3738
+ FILTER_V4_W8_H5
3739
3740
-movh [r2 + 4 * r3], m0
3741
+ movh [r2 + 4 * r3], m0
3742
3743
;8x6
3744
-movq m0, [r0 + 8 * r1]
3745
+ movq m0, [r0 + 8 * r1]
3746
3747
-FILTER_V4_W8_H2
3748
+ FILTER_V4_W8_H2
3749
3750
-lea r5, [r2 + 4 * r3]
3751
-movh [r5 + r3], m1
3752
+ lea r5, [r2 + 4 * r3]
3753
+ movh [r5 + r3], m1
3754
%endmacro
3755
3756
;-----------------------------------------------------------------------------
3757
3758
INIT_XMM sse4
3759
cglobal interp_4tap_vert_pp_%1x%2, 4, 7, 8
3760
3761
-mov r4d, r4m
3762
+ mov r4d, r4m
3763
3764
-sub r0, r1
3765
-movq m0, [r0]
3766
-movq m1, [r0 + r1]
3767
-movq m2, [r0 + 2 * r1]
3768
-lea r5, [r0 + 2 * r1]
3769
-movq m3, [r5 + r1]
3770
+ sub r0, r1
3771
+ movq m0, [r0]
3772
+ movq m1, [r0 + r1]
3773
+ movq m2, [r0 + 2 * r1]
3774
+ lea r5, [r0 + 2 * r1]
3775
+ movq m3, [r5 + r1]
3776
3777
-punpcklbw m0, m1
3778
-punpcklbw m4, m2, m3
3779
+ punpcklbw m0, m1
3780
+ punpcklbw m4, m2, m3
3781
3782
%ifdef PIC
3783
-lea r6, [tab_ChromaCoeff]
3784
-movd m5, [r6 + r4 * 4]
3785
+ lea r6, [tab_ChromaCoeff]
3786
+ movd m5, [r6 + r4 * 4]
3787
%else
3788
-movd m5, [tab_ChromaCoeff + r4 * 4]
3789
+ movd m5, [tab_ChromaCoeff + r4 * 4]
3790
%endif
3791
3792
-pshufb m6, m5, [tab_Vm]
3793
-pmaddubsw m0, m6
3794
+ pshufb m6, m5, [tab_Vm]
3795
+ pmaddubsw m0, m6
3796
3797
-pshufb m5, [tab_Vm + 16]
3798
-pmaddubsw m4, m5
3799
+ pshufb m5, [tab_Vm + 16]
3800
+ pmaddubsw m4, m5
3801
3802
-paddw m0, m4
3803
+ paddw m0, m4
3804
3805
-mova m4, [pw_512]
3806
+ mova m4, [pw_512]
3807
3808
-pmulhrsw m0, m4
3809
-packuswb m0, m0
3810
-movh [r2], m0
3811
+ pmulhrsw m0, m4
3812
+ packuswb m0, m0
3813
+ movh [r2], m0
3814
%endmacro
3815
3816
;-----------------------------------------------------------------------------
3817
; void interp_4tap_vert_pp_8x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
3818
;-----------------------------------------------------------------------------
3819
-FILTER_V4_W8_8x2 8, 2
3820
+ FILTER_V4_W8_8x2 8, 2
3821
3822
-RET
3823
+ RET
3824
3825
;-----------------------------------------------------------------------------
3826
; void interp_4tap_vert_pp_8x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
3827
;-----------------------------------------------------------------------------
3828
-FILTER_V4_W8_8x4 8, 4
3829
+ FILTER_V4_W8_8x4 8, 4
3830
3831
-RET
3832
+ RET
3833
3834
;-----------------------------------------------------------------------------
3835
; void interp_4tap_vert_pp_8x6(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
3836
;-----------------------------------------------------------------------------
3837
-FILTER_V4_W8_8x6 8, 6
3838
+ FILTER_V4_W8_8x6 8, 6
3839
3840
-RET
3841
+ RET
3842
3843
;-------------------------------------------------------------------------------------------------------------
3844
; void interp_4tap_vert_ps_4x2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
3845
3846
INIT_XMM sse4
3847
cglobal interp_4tap_vert_ps_4x2, 4, 6, 6
3848
3849
-mov r4d, r4m
3850
-sub r0, r1
3851
-add r3d, r3d
3852
+ mov r4d, r4m
3853
+ sub r0, r1
3854
+ add r3d, r3d
3855
3856
%ifdef PIC
3857
-lea r5, [tab_ChromaCoeff]
3858
-movd m0, [r5 + r4 * 4]
3859
+ lea r5, [tab_ChromaCoeff]
3860
+ movd m0, [r5 + r4 * 4]
3861
%else
3862
-movd m0, [tab_ChromaCoeff + r4 * 4]
3863
+ movd m0, [tab_ChromaCoeff + r4 * 4]
3864
%endif
3865
3866
-pshufb m0, [tab_Cm]
3867
+ pshufb m0, [tab_Cm]
3868
3869
-movd m2, [r0]
3870
-movd m3, [r0 + r1]
3871
-lea r5, [r0 + 2 * r1]
3872
-movd m4, [r5]
3873
-movd m5, [r5 + r1]
3874
+ movd m2, [r0]
3875
+ movd m3, [r0 + r1]
3876
+ lea r5, [r0 + 2 * r1]
3877
+ movd m4, [r5]
3878
+ movd m5, [r5 + r1]
3879
3880
-punpcklbw m2, m3
3881
-punpcklbw m1, m4, m5
3882
-punpcklbw m2, m1
3883
+ punpcklbw m2, m3
3884
+ punpcklbw m1, m4, m5
3885
+ punpcklbw m2, m1
3886
3887
-pmaddubsw m2, m0
3888
+ pmaddubsw m2, m0
3889
3890
-movd m1, [r0 + 4 * r1]
3891
+ movd m1, [r0 + 4 * r1]
3892
3893
-punpcklbw m3, m4
3894
-punpcklbw m5, m1
3895
-punpcklbw m3, m5
3896
+ punpcklbw m3, m4
3897
+ punpcklbw m5, m1
3898
+ punpcklbw m3, m5
3899
3900
-pmaddubsw m3, m0
3901
+ pmaddubsw m3, m0
3902
3903
-phaddw m2, m3
3904
+ phaddw m2, m3
3905
3906
-psubw m2, [pw_2000]
3907
-movh [r2], m2
3908
-movhps [r2 + r3], m2
3909
+ psubw m2, [pw_2000]
3910
+ movh [r2], m2
3911
+ movhps [r2 + r3], m2
3912
3913
-RET
3914
+ RET
3915
3916
;-------------------------------------------------------------------------------------------------------------
3917
; void interp_4tap_vert_ps_4x4(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
3918
3919
RET
3920
%endmacro
3921
3922
-FILTER_V_PS_W4_H4 4, 8
3923
-FILTER_V_PS_W4_H4 4, 16
3924
+ FILTER_V_PS_W4_H4 4, 8
3925
+ FILTER_V_PS_W4_H4 4, 16
3926
3927
-FILTER_V_PS_W4_H4 4, 32
3928
+ FILTER_V_PS_W4_H4 4, 32
3929
3930
;--------------------------------------------------------------------------------------------------------------
3931
; void interp_4tap_vert_ps_8x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
3932
3933
RET
3934
%endmacro
3935
3936
-FILTER_V_PS_W8_H8_H16_H2 8, 2
3937
-FILTER_V_PS_W8_H8_H16_H2 8, 4
3938
-FILTER_V_PS_W8_H8_H16_H2 8, 6
3939
+ FILTER_V_PS_W8_H8_H16_H2 8, 2
3940
+ FILTER_V_PS_W8_H8_H16_H2 8, 4
3941
+ FILTER_V_PS_W8_H8_H16_H2 8, 6
3942
3943
-FILTER_V_PS_W8_H8_H16_H2 8, 12
3944
-FILTER_V_PS_W8_H8_H16_H2 8, 64
3945
+ FILTER_V_PS_W8_H8_H16_H2 8, 12
3946
+ FILTER_V_PS_W8_H8_H16_H2 8, 64
3947
3948
;--------------------------------------------------------------------------------------------------------------
3949
; void interp_4tap_vert_ps_8x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
3950
3951
RET
3952
%endmacro
3953
3954
-FILTER_V_PS_W8_H8_H16_H32 8, 8
3955
-FILTER_V_PS_W8_H8_H16_H32 8, 16
3956
-FILTER_V_PS_W8_H8_H16_H32 8, 32
3957
+ FILTER_V_PS_W8_H8_H16_H32 8, 8
3958
+ FILTER_V_PS_W8_H8_H16_H32 8, 16
3959
+ FILTER_V_PS_W8_H8_H16_H32 8, 32
3960
3961
;------------------------------------------------------------------------------------------------------------
3962
;void interp_4tap_vert_ps_6x8(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
3963
3964
RET
3965
%endmacro
3966
3967
-FILTER_V_PS_W6 6, 8
3968
-FILTER_V_PS_W6 6, 16
3969
+ FILTER_V_PS_W6 6, 8
3970
+ FILTER_V_PS_W6 6, 16
3971
3972
;---------------------------------------------------------------------------------------------------------------
3973
; void interp_4tap_vert_ps_12x16(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
3974
3975
RET
3976
%endmacro
3977
3978
-FILTER_V_PS_W12 12, 16
3979
-FILTER_V_PS_W12 12, 32
3980
+ FILTER_V_PS_W12 12, 16
3981
+ FILTER_V_PS_W12 12, 32
3982
3983
;---------------------------------------------------------------------------------------------------------------
3984
; void interp_4tap_vert_ps_16x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
3985
3986
RET
3987
%endmacro
3988
3989
-FILTER_V_PS_W16 16, 4
3990
-FILTER_V_PS_W16 16, 8
3991
-FILTER_V_PS_W16 16, 12
3992
-FILTER_V_PS_W16 16, 16
3993
-FILTER_V_PS_W16 16, 32
3994
+ FILTER_V_PS_W16 16, 4
3995
+ FILTER_V_PS_W16 16, 8
3996
+ FILTER_V_PS_W16 16, 12
3997
+ FILTER_V_PS_W16 16, 16
3998
+ FILTER_V_PS_W16 16, 32
3999
4000
-FILTER_V_PS_W16 16, 24
4001
-FILTER_V_PS_W16 16, 64
4002
+ FILTER_V_PS_W16 16, 24
4003
+ FILTER_V_PS_W16 16, 64
4004
4005
;--------------------------------------------------------------------------------------------------------------
4006
;void interp_4tap_vert_ps_24x32(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
4007
4008
RET
4009
%endmacro
4010
4011
-FILTER_V4_PS_W24 24, 32
4012
+ FILTER_V4_PS_W24 24, 32
4013
4014
-FILTER_V4_PS_W24 24, 64
4015
+ FILTER_V4_PS_W24 24, 64
4016
4017
;---------------------------------------------------------------------------------------------------------------
4018
; void interp_4tap_vert_ps_32x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
4019
4020
RET
4021
%endmacro
4022
4023
-FILTER_V_PS_W32 32, 8
4024
-FILTER_V_PS_W32 32, 16
4025
-FILTER_V_PS_W32 32, 24
4026
-FILTER_V_PS_W32 32, 32
4027
+ FILTER_V_PS_W32 32, 8
4028
+ FILTER_V_PS_W32 32, 16
4029
+ FILTER_V_PS_W32 32, 24
4030
+ FILTER_V_PS_W32 32, 32
4031
4032
-FILTER_V_PS_W32 32, 48
4033
-FILTER_V_PS_W32 32, 64
4034
+ FILTER_V_PS_W32 32, 48
4035
+ FILTER_V_PS_W32 32, 64
4036
4037
;-----------------------------------------------------------------------------
4038
; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
4039
4040
INIT_XMM sse4
4041
cglobal interp_4tap_vert_pp_%1x%2, 4, 6, 8
4042
4043
-mov r4d, r4m
4044
-sub r0, r1
4045
+ mov r4d, r4m
4046
+ sub r0, r1
4047
4048
%ifdef PIC
4049
-lea r5, [tab_ChromaCoeff]
4050
-movd m5, [r5 + r4 * 4]
4051
+ lea r5, [tab_ChromaCoeff]
4052
+ movd m5, [r5 + r4 * 4]
4053
%else
4054
-movd m5, [tab_ChromaCoeff + r4 * 4]
4055
+ movd m5, [tab_ChromaCoeff + r4 * 4]
4056
%endif
4057
4058
-pshufb m6, m5, [tab_Vm]
4059
-pshufb m5, [tab_Vm + 16]
4060
-mova m4, [pw_512]
4061
-lea r5, [r1 * 3]
4062
+ pshufb m6, m5, [tab_Vm]
4063
+ pshufb m5, [tab_Vm + 16]
4064
+ mova m4, [pw_512]
4065
+ lea r5, [r1 * 3]
4066
4067
-mov r4d, %2
4068
+ mov r4d, %2
4069
4070
.loop:
4071
-movq m0, [r0]
4072
-movq m1, [r0 + r1]
4073
-movq m2, [r0 + 2 * r1]
4074
-movq m3, [r0 + r5]
4075
+ movq m0, [r0]
4076
+ movq m1, [r0 + r1]
4077
+ movq m2, [r0 + 2 * r1]
4078
+ movq m3, [r0 + r5]
4079
4080
-punpcklbw m0, m1
4081
-punpcklbw m1, m2
4082
-punpcklbw m2, m3
4083
+ punpcklbw m0, m1
4084
+ punpcklbw m1, m2
4085
+ punpcklbw m2, m3
4086
4087
-pmaddubsw m0, m6
4088
-pmaddubsw m7, m2, m5
4089
+ pmaddubsw m0, m6
4090
+ pmaddubsw m7, m2, m5
4091
4092
-paddw m0, m7
4093
+ paddw m0, m7
4094
4095
-pmulhrsw m0, m4
4096
-packuswb m0, m0
4097
-movh [r2], m0
4098
+ pmulhrsw m0, m4
4099
+ packuswb m0, m0
4100
+ movh [r2], m0
4101
4102
-lea r0, [r0 + 4 * r1]
4103
-movq m0, [r0]
4104
+ lea r0, [r0 + 4 * r1]
4105
+ movq m0, [r0]
4106
4107
-punpcklbw m3, m0
4108
+ punpcklbw m3, m0
4109
4110
-pmaddubsw m1, m6
4111
-pmaddubsw m7, m3, m5
4112
+ pmaddubsw m1, m6
4113
+ pmaddubsw m7, m3, m5
4114
4115
-paddw m1, m7
4116
+ paddw m1, m7
4117
4118
-pmulhrsw m1, m4
4119
-packuswb m1, m1
4120
-movh [r2 + r3], m1
4121
+ pmulhrsw m1, m4
4122
+ packuswb m1, m1
4123
+ movh [r2 + r3], m1
4124
4125
-movq m1, [r0 + r1]
4126
+ movq m1, [r0 + r1]
4127
4128
-punpcklbw m0, m1
4129
+ punpcklbw m0, m1
4130
4131
-pmaddubsw m2, m6
4132
-pmaddubsw m0, m5
4133
+ pmaddubsw m2, m6
4134
+ pmaddubsw m0, m5
4135
4136
-paddw m2, m0
4137
+ paddw m2, m0
4138
4139
-pmulhrsw m2, m4
4140
+ pmulhrsw m2, m4
4141
4142
-movq m7, [r0 + 2 * r1]
4143
-punpcklbw m1, m7
4144
+ movq m7, [r0 + 2 * r1]
4145
+ punpcklbw m1, m7
4146
4147
-pmaddubsw m3, m6
4148
-pmaddubsw m1, m5
4149
+ pmaddubsw m3, m6
4150
+ pmaddubsw m1, m5
4151
4152
-paddw m3, m1
4153
+ paddw m3, m1
4154
4155
-pmulhrsw m3, m4
4156
-packuswb m2, m3
4157
+ pmulhrsw m3, m4
4158
+ packuswb m2, m3
4159
4160
-lea r2, [r2 + 2 * r3]
4161
-movh [r2], m2
4162
-movhps [r2 + r3], m2
4163
+ lea r2, [r2 + 2 * r3]
4164
+ movh [r2], m2
4165
+ movhps [r2 + r3], m2
4166
4167
-lea r2, [r2 + 2 * r3]
4168
+ lea r2, [r2 + 2 * r3]
4169
4170
-sub r4, 4
4171
-jnz .loop
4172
-RET
4173
+ sub r4, 4
4174
+ jnz .loop
4175
+ RET
4176
%endmacro
4177
4178
-FILTER_V4_W8_H8_H16_H32 8, 8
4179
-FILTER_V4_W8_H8_H16_H32 8, 16
4180
-FILTER_V4_W8_H8_H16_H32 8, 32
4181
+ FILTER_V4_W8_H8_H16_H32 8, 8
4182
+ FILTER_V4_W8_H8_H16_H32 8, 16
4183
+ FILTER_V4_W8_H8_H16_H32 8, 32
4184
4185
-FILTER_V4_W8_H8_H16_H32 8, 12
4186
-FILTER_V4_W8_H8_H16_H32 8, 64
4187
+ FILTER_V4_W8_H8_H16_H32 8, 12
4188
+ FILTER_V4_W8_H8_H16_H32 8, 64
4189
4190
%macro PROCESS_CHROMA_AVX2_W8_8R 0
4191
movq xm1, [r0] ; m1 = row 0
4192
4193
RET
4194
%endmacro
4195
4196
-FILTER_VER_CHROMA_AVX2_8x8 pp
4197
-FILTER_VER_CHROMA_AVX2_8x8 ps
4198
+ FILTER_VER_CHROMA_AVX2_8x8 pp
4199
+ FILTER_VER_CHROMA_AVX2_8x8 ps
4200
4201
%macro FILTER_VER_CHROMA_AVX2_8x6 1
4202
INIT_YMM avx2
4203
4204
RET
4205
%endmacro
4206
4207
-FILTER_VER_CHROMA_AVX2_8x6 pp
4208
-FILTER_VER_CHROMA_AVX2_8x6 ps
4209
+ FILTER_VER_CHROMA_AVX2_8x6 pp
4210
+ FILTER_VER_CHROMA_AVX2_8x6 ps
4211
4212
%macro PROCESS_CHROMA_AVX2_W8_16R 1
4213
movq xm1, [r0] ; m1 = row 0
4214
4215
RET
4216
%endmacro
4217
4218
-FILTER_VER_CHROMA_AVX2_8x16 pp
4219
-FILTER_VER_CHROMA_AVX2_8x16 ps
4220
+ FILTER_VER_CHROMA_AVX2_8x16 pp
4221
+ FILTER_VER_CHROMA_AVX2_8x16 ps
4222
+
4223
+%macro FILTER_VER_CHROMA_AVX2_8x12 1
4224
+INIT_YMM avx2
4225
+cglobal interp_4tap_vert_%1_8x12, 4, 7, 8
4226
+ mov r4d, r4m
4227
+ shl r4d, 6
4228
+
4229
+%ifdef PIC
4230
+ lea r5, [tab_ChromaCoeffVer_32]
4231
+ add r5, r4
4232
+%else
4233
+ lea r5, [tab_ChromaCoeffVer_32 + r4]
4234
+%endif
4235
+
4236
+ lea r4, [r1 * 3]
4237
+ sub r0, r1
4238
+%ifidn %1, pp
4239
+ mova m7, [pw_512]
4240
+%else
4241
+ add r3d, r3d
4242
+ mova m7, [pw_2000]
4243
+%endif
4244
+ lea r6, [r3 * 3]
4245
+ movq xm1, [r0] ; m1 = row 0
4246
+ movq xm2, [r0 + r1] ; m2 = row 1
4247
+ punpcklbw xm1, xm2
4248
+ movq xm3, [r0 + r1 * 2] ; m3 = row 2
4249
+ punpcklbw xm2, xm3
4250
+ vinserti128 m5, m1, xm2, 1
4251
+ pmaddubsw m5, [r5]
4252
+ movq xm4, [r0 + r4] ; m4 = row 3
4253
+ punpcklbw xm3, xm4
4254
+ lea r0, [r0 + r1 * 4]
4255
+ movq xm1, [r0] ; m1 = row 4
4256
+ punpcklbw xm4, xm1
4257
+ vinserti128 m2, m3, xm4, 1
4258
+ pmaddubsw m0, m2, [r5 + 1 * mmsize]
4259
+ paddw m5, m0
4260
+ pmaddubsw m2, [r5]
4261
+ movq xm3, [r0 + r1] ; m3 = row 5
4262
+ punpcklbw xm1, xm3
4263
+ movq xm4, [r0 + r1 * 2] ; m4 = row 6
4264
+ punpcklbw xm3, xm4
4265
+ vinserti128 m1, m1, xm3, 1
4266
+ pmaddubsw m0, m1, [r5 + 1 * mmsize]
4267
+ paddw m2, m0
4268
+ pmaddubsw m1, [r5]
4269
+ movq xm3, [r0 + r4] ; m3 = row 7
4270
+ punpcklbw xm4, xm3
4271
+ lea r0, [r0 + r1 * 4]
4272
+ movq xm0, [r0] ; m0 = row 8
4273
+ punpcklbw xm3, xm0
4274
+ vinserti128 m4, m4, xm3, 1
4275
+ pmaddubsw m3, m4, [r5 + 1 * mmsize]
4276
+ paddw m1, m3
4277
+ pmaddubsw m4, [r5]
4278
+ movq xm3, [r0 + r1] ; m3 = row 9
4279
+ punpcklbw xm0, xm3
4280
+ movq xm6, [r0 + r1 * 2] ; m6 = row 10
4281
+ punpcklbw xm3, xm6
4282
+ vinserti128 m0, m0, xm3, 1
4283
+ pmaddubsw m3, m0, [r5 + 1 * mmsize]
4284
+ paddw m4, m3
4285
+ pmaddubsw m0, [r5]
4286
+%ifidn %1, pp
4287
+ pmulhrsw m5, m7 ; m5 = word: row 0, row 1
4288
+ pmulhrsw m2, m7 ; m2 = word: row 2, row 3
4289
+ pmulhrsw m1, m7 ; m1 = word: row 4, row 5
4290
+ pmulhrsw m4, m7 ; m4 = word: row 6, row 7
4291
+ packuswb m5, m2
4292
+ packuswb m1, m4
4293
+ vextracti128 xm2, m5, 1
4294
+ vextracti128 xm4, m1, 1
4295
+ movq [r2], xm5
4296
+ movq [r2 + r3], xm2
4297
+ movhps [r2 + r3 * 2], xm5
4298
+ movhps [r2 + r6], xm2
4299
+ lea r2, [r2 + r3 * 4]
4300
+ movq [r2], xm1
4301
+ movq [r2 + r3], xm4
4302
+ movhps [r2 + r3 * 2], xm1
4303
+ movhps [r2 + r6], xm4
4304
+%else
4305
+ psubw m5, m7 ; m5 = word: row 0, row 1
4306
+ psubw m2, m7 ; m2 = word: row 2, row 3
4307
+ psubw m1, m7 ; m1 = word: row 4, row 5
4308
+ psubw m4, m7 ; m4 = word: row 6, row 7
4309
+ vextracti128 xm3, m5, 1
4310
+ movu [r2], xm5
4311
+ movu [r2 + r3], xm3
4312
+ vextracti128 xm3, m2, 1
4313
+ movu [r2 + r3 * 2], xm2
4314
+ movu [r2 + r6], xm3
4315
+ lea r2, [r2 + r3 * 4]
4316
+ vextracti128 xm5, m1, 1
4317
+ vextracti128 xm3, m4, 1
4318
+ movu [r2], xm1
4319
+ movu [r2 + r3], xm5
4320
+ movu [r2 + r3 * 2], xm4
4321
+ movu [r2 + r6], xm3
4322
+%endif
4323
+ movq xm3, [r0 + r4] ; m3 = row 11
4324
+ punpcklbw xm6, xm3
4325
+ lea r0, [r0 + r1 * 4]
4326
+ movq xm5, [r0] ; m5 = row 12
4327
+ punpcklbw xm3, xm5
4328
+ vinserti128 m6, m6, xm3, 1
4329
+ pmaddubsw m3, m6, [r5 + 1 * mmsize]
4330
+ paddw m0, m3
4331
+ pmaddubsw m6, [r5]
4332
+ movq xm3, [r0 + r1] ; m3 = row 13
4333
+ punpcklbw xm5, xm3
4334
+ movq xm2, [r0 + r1 * 2] ; m2 = row 14
4335
+ punpcklbw xm3, xm2
4336
+ vinserti128 m5, m5, xm3, 1
4337
+ pmaddubsw m3, m5, [r5 + 1 * mmsize]
4338
+ paddw m6, m3
4339
+ lea r2, [r2 + r3 * 4]
4340
+%ifidn %1, pp
4341
+ pmulhrsw m0, m7 ; m0 = word: row 8, row 9
4342
+ pmulhrsw m6, m7 ; m6 = word: row 10, row 11
4343
+ packuswb m0, m6
4344
+ vextracti128 xm6, m0, 1
4345
+ movq [r2], xm0
4346
+ movq [r2 + r3], xm6
4347
+ movhps [r2 + r3 * 2], xm0
4348
+ movhps [r2 + r6], xm6
4349
+%else
4350
+ psubw m0, m7 ; m0 = word: row 8, row 9
4351
+ psubw m6, m7 ; m6 = word: row 10, row 11
4352
+ vextracti128 xm1, m0, 1
4353
+ vextracti128 xm3, m6, 1
4354
+ movu [r2], xm0
4355
+ movu [r2 + r3], xm1
4356
+ movu [r2 + r3 * 2], xm6
4357
+ movu [r2 + r6], xm3
4358
+%endif
4359
+ RET
4360
+%endmacro
4361
+
4362
+ FILTER_VER_CHROMA_AVX2_8x12 pp
4363
+ FILTER_VER_CHROMA_AVX2_8x12 ps
4364
4365
-%macro FILTER_VER_CHROMA_AVX2_8x32 1
4366
+%macro FILTER_VER_CHROMA_AVX2_8xN 2
4367
INIT_YMM avx2
4368
-cglobal interp_4tap_vert_%1_8x32, 4, 7, 8
4369
+cglobal interp_4tap_vert_%1_8x%2, 4, 7, 8
4370
mov r4d, r4m
4371
shl r4d, 6
4372
4373
4374
mova m7, [pw_2000]
4375
%endif
4376
lea r6, [r3 * 3]
4377
-%rep 2
4378
+%rep %2 / 16
4379
PROCESS_CHROMA_AVX2_W8_16R %1
4380
lea r2, [r2 + r3 * 4]
4381
%endrep
4382
RET
4383
%endmacro
4384
4385
-FILTER_VER_CHROMA_AVX2_8x32 pp
4386
-FILTER_VER_CHROMA_AVX2_8x32 ps
4387
+ FILTER_VER_CHROMA_AVX2_8xN pp, 32
4388
+ FILTER_VER_CHROMA_AVX2_8xN ps, 32
4389
+ FILTER_VER_CHROMA_AVX2_8xN pp, 64
4390
+ FILTER_VER_CHROMA_AVX2_8xN ps, 64
4391
4392
%macro PROCESS_CHROMA_AVX2_W8_4R 0
4393
movq xm1, [r0] ; m1 = row 0
4394
4395
RET
4396
%endmacro
4397
4398
-FILTER_VER_CHROMA_AVX2_8x4 pp
4399
-FILTER_VER_CHROMA_AVX2_8x4 ps
4400
+ FILTER_VER_CHROMA_AVX2_8x4 pp
4401
+ FILTER_VER_CHROMA_AVX2_8x4 ps
4402
4403
%macro FILTER_VER_CHROMA_AVX2_8x2 1
4404
INIT_YMM avx2
4405
4406
RET
4407
%endmacro
4408
4409
-FILTER_VER_CHROMA_AVX2_8x2 pp
4410
-FILTER_VER_CHROMA_AVX2_8x2 ps
4411
+ FILTER_VER_CHROMA_AVX2_8x2 pp
4412
+ FILTER_VER_CHROMA_AVX2_8x2 ps
4413
4414
%macro FILTER_VER_CHROMA_AVX2_6x8 1
4415
INIT_YMM avx2
4416
4417
RET
4418
%endmacro
4419
4420
-FILTER_VER_CHROMA_AVX2_6x8 pp
4421
-FILTER_VER_CHROMA_AVX2_6x8 ps
4422
+ FILTER_VER_CHROMA_AVX2_6x8 pp
4423
+ FILTER_VER_CHROMA_AVX2_6x8 ps
4424
4425
;-----------------------------------------------------------------------------
4426
;void interp_4tap_vert_pp_6x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
4427
4428
INIT_XMM sse4
4429
cglobal interp_4tap_vert_pp_6x%2, 4, 6, 8
4430
4431
-mov r4d, r4m
4432
-sub r0, r1
4433
+ mov r4d, r4m
4434
+ sub r0, r1
4435
4436
%ifdef PIC
4437
-lea r5, [tab_ChromaCoeff]
4438
-movd m5, [r5 + r4 * 4]
4439
+ lea r5, [tab_ChromaCoeff]
4440
+ movd m5, [r5 + r4 * 4]
4441
%else
4442
-movd m5, [tab_ChromaCoeff + r4 * 4]
4443
+ movd m5, [tab_ChromaCoeff + r4 * 4]
4444
%endif
4445
4446
-pshufb m6, m5, [tab_Vm]
4447
-pshufb m5, [tab_Vm + 16]
4448
-mova m4, [pw_512]
4449
+ pshufb m6, m5, [tab_Vm]
4450
+ pshufb m5, [tab_Vm + 16]
4451
+ mova m4, [pw_512]
4452
4453
-mov r4d, %2
4454
-lea r5, [3 * r1]
4455
+ mov r4d, %2
4456
+ lea r5, [3 * r1]
4457
4458
.loop:
4459
-movq m0, [r0]
4460
-movq m1, [r0 + r1]
4461
-movq m2, [r0 + 2 * r1]
4462
-movq m3, [r0 + r5]
4463
+ movq m0, [r0]
4464
+ movq m1, [r0 + r1]
4465
+ movq m2, [r0 + 2 * r1]
4466
+ movq m3, [r0 + r5]
4467
4468
-punpcklbw m0, m1
4469
-punpcklbw m1, m2
4470
-punpcklbw m2, m3
4471
+ punpcklbw m0, m1
4472
+ punpcklbw m1, m2
4473
+ punpcklbw m2, m3
4474
4475
-pmaddubsw m0, m6
4476
-pmaddubsw m7, m2, m5
4477
+ pmaddubsw m0, m6
4478
+ pmaddubsw m7, m2, m5
4479
4480
-paddw m0, m7
4481
+ paddw m0, m7
4482
4483
-pmulhrsw m0, m4
4484
-packuswb m0, m0
4485
-movd [r2], m0
4486
-pextrw [r2 + 4], m0, 2
4487
+ pmulhrsw m0, m4
4488
+ packuswb m0, m0
4489
+ movd [r2], m0
4490
+ pextrw [r2 + 4], m0, 2
4491
4492
-lea r0, [r0 + 4 * r1]
4493
+ lea r0, [r0 + 4 * r1]
4494
4495
-movq m0, [r0]
4496
-punpcklbw m3, m0
4497
+ movq m0, [r0]
4498
+ punpcklbw m3, m0
4499
4500
-pmaddubsw m1, m6
4501
-pmaddubsw m7, m3, m5
4502
+ pmaddubsw m1, m6
4503
+ pmaddubsw m7, m3, m5
4504
4505
-paddw m1, m7
4506
+ paddw m1, m7
4507
4508
-pmulhrsw m1, m4
4509
-packuswb m1, m1
4510
-movd [r2 + r3], m1
4511
-pextrw [r2 + r3 + 4], m1, 2
4512
+ pmulhrsw m1, m4
4513
+ packuswb m1, m1
4514
+ movd [r2 + r3], m1
4515
+ pextrw [r2 + r3 + 4], m1, 2
4516
4517
-movq m1, [r0 + r1]
4518
-punpcklbw m7, m0, m1
4519
+ movq m1, [r0 + r1]
4520
+ punpcklbw m7, m0, m1
4521
4522
-pmaddubsw m2, m6
4523
-pmaddubsw m7, m5
4524
+ pmaddubsw m2, m6
4525
+ pmaddubsw m7, m5
4526
4527
-paddw m2, m7
4528
+ paddw m2, m7
4529
4530
-pmulhrsw m2, m4
4531
-packuswb m2, m2
4532
-lea r2, [r2 + 2 * r3]
4533
-movd [r2], m2
4534
-pextrw [r2 + 4], m2, 2
4535
+ pmulhrsw m2, m4
4536
+ packuswb m2, m2
4537
+ lea r2, [r2 + 2 * r3]
4538
+ movd [r2], m2
4539
+ pextrw [r2 + 4], m2, 2
4540
4541
-movq m2, [r0 + 2 * r1]
4542
-punpcklbw m1, m2
4543
+ movq m2, [r0 + 2 * r1]
4544
+ punpcklbw m1, m2
4545
4546
-pmaddubsw m3, m6
4547
-pmaddubsw m1, m5
4548
+ pmaddubsw m3, m6
4549
+ pmaddubsw m1, m5
4550
4551
-paddw m3, m1
4552
+ paddw m3, m1
4553
4554
-pmulhrsw m3, m4
4555
-packuswb m3, m3
4556
+ pmulhrsw m3, m4
4557
+ packuswb m3, m3
4558
4559
-movd [r2 + r3], m3
4560
-pextrw [r2 + r3 + 4], m3, 2
4561
+ movd [r2 + r3], m3
4562
+ pextrw [r2 + r3 + 4], m3, 2
4563
4564
-lea r2, [r2 + 2 * r3]
4565
+ lea r2, [r2 + 2 * r3]
4566
4567
-sub r4, 4
4568
-jnz .loop
4569
-RET
4570
+ sub r4, 4
4571
+ jnz .loop
4572
+ RET
4573
%endmacro
4574
4575
-FILTER_V4_W6_H4 6, 8
4576
+ FILTER_V4_W6_H4 6, 8
4577
4578
-FILTER_V4_W6_H4 6, 16
4579
+ FILTER_V4_W6_H4 6, 16
4580
4581
;-----------------------------------------------------------------------------
4582
; void interp_4tap_vert_pp_12x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
4583
4584
INIT_XMM sse4
4585
cglobal interp_4tap_vert_pp_12x%2, 4, 6, 8
4586
4587
-mov r4d, r4m
4588
-sub r0, r1
4589
+ mov r4d, r4m
4590
+ sub r0, r1
4591
4592
%ifdef PIC
4593
-lea r5, [tab_ChromaCoeff]
4594
-movd m0, [r5 + r4 * 4]
4595
+ lea r5, [tab_ChromaCoeff]
4596
+ movd m0, [r5 + r4 * 4]
4597
%else
4598
-movd m0, [tab_ChromaCoeff + r4 * 4]
4599
+ movd m0, [tab_ChromaCoeff + r4 * 4]
4600
%endif
4601
4602
-pshufb m1, m0, [tab_Vm]
4603
-pshufb m0, [tab_Vm + 16]
4604
+ pshufb m1, m0, [tab_Vm]
4605
+ pshufb m0, [tab_Vm + 16]
4606
4607
-mov r4d, %2
4608
+ mov r4d, %2
4609
4610
.loop:
4611
-movu m2, [r0]
4612
-movu m3, [r0 + r1]
4613
+ movu m2, [r0]
4614
+ movu m3, [r0 + r1]
4615
4616
-punpcklbw m4, m2, m3
4617
-punpckhbw m2, m3
4618
+ punpcklbw m4, m2, m3
4619
+ punpckhbw m2, m3
4620
4621
-pmaddubsw m4, m1
4622
-pmaddubsw m2, m1
4623
+ pmaddubsw m4, m1
4624
+ pmaddubsw m2, m1
4625
4626
-lea r0, [r0 + 2 * r1]
4627
-movu m5, [r0]
4628
-movu m7, [r0 + r1]
4629
+ lea r0, [r0 + 2 * r1]
4630
+ movu m5, [r0]
4631
+ movu m7, [r0 + r1]
4632
4633
-punpcklbw m6, m5, m7
4634
-pmaddubsw m6, m0
4635
-paddw m4, m6
4636
+ punpcklbw m6, m5, m7
4637
+ pmaddubsw m6, m0
4638
+ paddw m4, m6
4639
4640
-punpckhbw m6, m5, m7
4641
-pmaddubsw m6, m0
4642
-paddw m2, m6
4643
+ punpckhbw m6, m5, m7
4644
+ pmaddubsw m6, m0
4645
+ paddw m2, m6
4646
4647
-mova m6, [pw_512]
4648
+ mova m6, [pw_512]
4649
4650
-pmulhrsw m4, m6
4651
-pmulhrsw m2, m6
4652
+ pmulhrsw m4, m6
4653
+ pmulhrsw m2, m6
4654
4655
-packuswb m4, m2
4656
+ packuswb m4, m2
4657
4658
-movh [r2], m4
4659
-pextrd [r2 + 8], m4, 2
4660
+ movh [r2], m4
4661
+ pextrd [r2 + 8], m4, 2
4662
4663
-punpcklbw m4, m3, m5
4664
-punpckhbw m3, m5
4665
+ punpcklbw m4, m3, m5
4666
+ punpckhbw m3, m5
4667
4668
-pmaddubsw m4, m1
4669
-pmaddubsw m3, m1
4670
+ pmaddubsw m4, m1
4671
+ pmaddubsw m3, m1
4672
4673
-movu m5, [r0 + 2 * r1]
4674
+ movu m5, [r0 + 2 * r1]
4675
4676
-punpcklbw m2, m7, m5
4677
-punpckhbw m7, m5
4678
+ punpcklbw m2, m7, m5
4679
+ punpckhbw m7, m5
4680
4681
-pmaddubsw m2, m0
4682
-pmaddubsw m7, m0
4683
+ pmaddubsw m2, m0
4684
+ pmaddubsw m7, m0
4685
4686
-paddw m4, m2
4687
-paddw m3, m7
4688
+ paddw m4, m2
4689
+ paddw m3, m7
4690
4691
-pmulhrsw m4, m6
4692
-pmulhrsw m3, m6
4693
+ pmulhrsw m4, m6
4694
+ pmulhrsw m3, m6
4695
4696
-packuswb m4, m3
4697
+ packuswb m4, m3
4698
4699
-movh [r2 + r3], m4
4700
-pextrd [r2 + r3 + 8], m4, 2
4701
+ movh [r2 + r3], m4
4702
+ pextrd [r2 + r3 + 8], m4, 2
4703
4704
-lea r2, [r2 + 2 * r3]
4705
+ lea r2, [r2 + 2 * r3]
4706
4707
-sub r4, 2
4708
-jnz .loop
4709
-RET
4710
+ sub r4, 2
4711
+ jnz .loop
4712
+ RET
4713
%endmacro
4714
4715
-FILTER_V4_W12_H2 12, 16
4716
+ FILTER_V4_W12_H2 12, 16
4717
4718
-FILTER_V4_W12_H2 12, 32
4719
+ FILTER_V4_W12_H2 12, 32
4720
4721
;-----------------------------------------------------------------------------
4722
; void interp_4tap_vert_pp_16x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
4723
4724
INIT_XMM sse4
4725
cglobal interp_4tap_vert_pp_16x%2, 4, 6, 8
4726
4727
-mov r4d, r4m
4728
-sub r0, r1
4729
+ mov r4d, r4m
4730
+ sub r0, r1
4731
4732
%ifdef PIC
4733
-lea r5, [tab_ChromaCoeff]
4734
-movd m0, [r5 + r4 * 4]
4735
+ lea r5, [tab_ChromaCoeff]
4736
+ movd m0, [r5 + r4 * 4]
4737
%else
4738
-movd m0, [tab_ChromaCoeff + r4 * 4]
4739
+ movd m0, [tab_ChromaCoeff + r4 * 4]
4740
%endif
4741
4742
-pshufb m1, m0, [tab_Vm]
4743
-pshufb m0, [tab_Vm + 16]
4744
+ pshufb m1, m0, [tab_Vm]
4745
+ pshufb m0, [tab_Vm + 16]
4746
4747
-mov r4d, %2/2
4748
+ mov r4d, %2/2
4749
4750
.loop:
4751
-movu m2, [r0]
4752
-movu m3, [r0 + r1]
4753
+ movu m2, [r0]
4754
+ movu m3, [r0 + r1]
4755
4756
-punpcklbw m4, m2, m3
4757
-punpckhbw m2, m3
4758
+ punpcklbw m4, m2, m3
4759
+ punpckhbw m2, m3
4760
4761
-pmaddubsw m4, m1
4762
-pmaddubsw m2, m1
4763
+ pmaddubsw m4, m1
4764
+ pmaddubsw m2, m1
4765
4766
-lea r0, [r0 + 2 * r1]
4767
-movu m5, [r0]
4768
-movu m6, [r0 + r1]
4769
+ lea r0, [r0 + 2 * r1]
4770
+ movu m5, [r0]
4771
+ movu m6, [r0 + r1]
4772
4773
-punpckhbw m7, m5, m6
4774
-pmaddubsw m7, m0
4775
-paddw m2, m7
4776
+ punpckhbw m7, m5, m6
4777
+ pmaddubsw m7, m0
4778
+ paddw m2, m7
4779
4780
-punpcklbw m7, m5, m6
4781
-pmaddubsw m7, m0
4782
-paddw m4, m7
4783
+ punpcklbw m7, m5, m6
4784
+ pmaddubsw m7, m0
4785
+ paddw m4, m7
4786
4787
-mova m7, [pw_512]
4788
+ mova m7, [pw_512]
4789
4790
-pmulhrsw m4, m7
4791
-pmulhrsw m2, m7
4792
+ pmulhrsw m4, m7
4793
+ pmulhrsw m2, m7
4794
4795
-packuswb m4, m2
4796
+ packuswb m4, m2
4797
4798
-movu [r2], m4
4799
+ movu [r2], m4
4800
4801
-punpcklbw m4, m3, m5
4802
-punpckhbw m3, m5
4803
+ punpcklbw m4, m3, m5
4804
+ punpckhbw m3, m5
4805
4806
-pmaddubsw m4, m1
4807
-pmaddubsw m3, m1
4808
+ pmaddubsw m4, m1
4809
+ pmaddubsw m3, m1
4810
4811
-movu m5, [r0 + 2 * r1]
4812
+ movu m5, [r0 + 2 * r1]
4813
4814
-punpcklbw m2, m6, m5
4815
-punpckhbw m6, m5
4816
+ punpcklbw m2, m6, m5
4817
+ punpckhbw m6, m5
4818
4819
-pmaddubsw m2, m0
4820
-pmaddubsw m6, m0
4821
+ pmaddubsw m2, m0
4822
+ pmaddubsw m6, m0
4823
4824
-paddw m4, m2
4825
-paddw m3, m6
4826
+ paddw m4, m2
4827
+ paddw m3, m6
4828
4829
-pmulhrsw m4, m7
4830
-pmulhrsw m3, m7
4831
+ pmulhrsw m4, m7
4832
+ pmulhrsw m3, m7
4833
4834
-packuswb m4, m3
4835
+ packuswb m4, m3
4836
4837
-movu [r2 + r3], m4
4838
+ movu [r2 + r3], m4
4839
4840
-lea r2, [r2 + 2 * r3]
4841
+ lea r2, [r2 + 2 * r3]
4842
4843
-dec r4d
4844
-jnz .loop
4845
-RET
4846
+ dec r4d
4847
+ jnz .loop
4848
+ RET
4849
%endmacro
4850
4851
-FILTER_V4_W16_H2 16, 4
4852
-FILTER_V4_W16_H2 16, 8
4853
-FILTER_V4_W16_H2 16, 12
4854
-FILTER_V4_W16_H2 16, 16
4855
-FILTER_V4_W16_H2 16, 32
4856
+ FILTER_V4_W16_H2 16, 4
4857
+ FILTER_V4_W16_H2 16, 8
4858
+ FILTER_V4_W16_H2 16, 12
4859
+ FILTER_V4_W16_H2 16, 16
4860
+ FILTER_V4_W16_H2 16, 32
4861
4862
-FILTER_V4_W16_H2 16, 24
4863
-FILTER_V4_W16_H2 16, 64
4864
+ FILTER_V4_W16_H2 16, 24
4865
+ FILTER_V4_W16_H2 16, 64
4866
4867
%macro FILTER_VER_CHROMA_AVX2_16x16 1
4868
INIT_YMM avx2
4869
4870
%endif
4871
%endmacro
4872
4873
-FILTER_VER_CHROMA_AVX2_16x16 pp
4874
-FILTER_VER_CHROMA_AVX2_16x16 ps
4875
+ FILTER_VER_CHROMA_AVX2_16x16 pp
4876
+ FILTER_VER_CHROMA_AVX2_16x16 ps
4877
%macro FILTER_VER_CHROMA_AVX2_16x8 1
4878
INIT_YMM avx2
4879
cglobal interp_4tap_vert_%1_16x8, 4, 7, 7
4880
4881
RET
4882
%endmacro
4883
4884
-FILTER_VER_CHROMA_AVX2_16x8 pp
4885
-FILTER_VER_CHROMA_AVX2_16x8 ps
4886
+ FILTER_VER_CHROMA_AVX2_16x8 pp
4887
+ FILTER_VER_CHROMA_AVX2_16x8 ps
4888
4889
%macro FILTER_VER_CHROMA_AVX2_16x12 1
4890
INIT_YMM avx2
4891
4892
%endif
4893
%endmacro
4894
4895
-FILTER_VER_CHROMA_AVX2_16x12 pp
4896
-FILTER_VER_CHROMA_AVX2_16x12 ps
4897
+ FILTER_VER_CHROMA_AVX2_16x12 pp
4898
+ FILTER_VER_CHROMA_AVX2_16x12 ps
4899
4900
-%macro FILTER_VER_CHROMA_AVX2_16x32 1
4901
-INIT_YMM avx2
4902
+%macro FILTER_VER_CHROMA_AVX2_16xN 2
4903
%if ARCH_X86_64 == 1
4904
-cglobal interp_4tap_vert_%1_16x32, 4, 8, 8
4905
+INIT_YMM avx2
4906
+cglobal interp_4tap_vert_%1_16x%2, 4, 8, 8
4907
mov r4d, r4m
4908
shl r4d, 6
4909
4910
4911
mova m7, [pw_2000]
4912
%endif
4913
lea r6, [r3 * 3]
4914
- mov r7d, 2
4915
+ mov r7d, %2 / 16
4916
.loopH:
4917
movu xm0, [r0]
4918
vinserti128 m0, m0, [r0 + r1 * 2], 1
4919
4920
%endif
4921
%endmacro
4922
4923
-FILTER_VER_CHROMA_AVX2_16x32 pp
4924
-FILTER_VER_CHROMA_AVX2_16x32 ps
4925
+ FILTER_VER_CHROMA_AVX2_16xN pp, 32
4926
+ FILTER_VER_CHROMA_AVX2_16xN ps, 32
4927
+ FILTER_VER_CHROMA_AVX2_16xN pp, 64
4928
+ FILTER_VER_CHROMA_AVX2_16xN ps, 64
4929
+
4930
+%macro FILTER_VER_CHROMA_AVX2_16x24 1
4931
+%if ARCH_X86_64 == 1
4932
+INIT_YMM avx2
4933
+cglobal interp_4tap_vert_%1_16x24, 4, 6, 15
4934
+ mov r4d, r4m
4935
+ shl r4d, 6
4936
+
4937
+%ifdef PIC
4938
+ lea r5, [tab_ChromaCoeffVer_32]
4939
+ add r5, r4
4940
+%else
4941
+ lea r5, [tab_ChromaCoeffVer_32 + r4]
4942
+%endif
4943
+
4944
+ mova m12, [r5]
4945
+ mova m13, [r5 + mmsize]
4946
+ lea r4, [r1 * 3]
4947
+ sub r0, r1
4948
+%ifidn %1,pp
4949
+ mova m14, [pw_512]
4950
+%else
4951
+ add r3d, r3d
4952
+ vbroadcasti128 m14, [pw_2000]
4953
+%endif
4954
+ lea r5, [r3 * 3]
4955
+
4956
+ movu xm0, [r0] ; m0 = row 0
4957
+ movu xm1, [r0 + r1] ; m1 = row 1
4958
+ punpckhbw xm2, xm0, xm1
4959
+ punpcklbw xm0, xm1
4960
+ vinserti128 m0, m0, xm2, 1
4961
+ pmaddubsw m0, m12
4962
+ movu xm2, [r0 + r1 * 2] ; m2 = row 2
4963
+ punpckhbw xm3, xm1, xm2
4964
+ punpcklbw xm1, xm2
4965
+ vinserti128 m1, m1, xm3, 1
4966
+ pmaddubsw m1, m12
4967
+ movu xm3, [r0 + r4] ; m3 = row 3
4968
+ punpckhbw xm4, xm2, xm3
4969
+ punpcklbw xm2, xm3
4970
+ vinserti128 m2, m2, xm4, 1
4971
+ pmaddubsw m4, m2, m13
4972
+ paddw m0, m4
4973
+ pmaddubsw m2, m12
4974
+ lea r0, [r0 + r1 * 4]
4975
+ movu xm4, [r0] ; m4 = row 4
4976
+ punpckhbw xm5, xm3, xm4
4977
+ punpcklbw xm3, xm4
4978
+ vinserti128 m3, m3, xm5, 1
4979
+ pmaddubsw m5, m3, m13
4980
+ paddw m1, m5
4981
+ pmaddubsw m3, m12
4982
+ movu xm5, [r0 + r1] ; m5 = row 5
4983
+ punpckhbw xm6, xm4, xm5
4984
+ punpcklbw xm4, xm5
4985
+ vinserti128 m4, m4, xm6, 1
4986
+ pmaddubsw m6, m4, m13
4987
+ paddw m2, m6
4988
+ pmaddubsw m4, m12
4989
+ movu xm6, [r0 + r1 * 2] ; m6 = row 6
4990
+ punpckhbw xm7, xm5, xm6
4991
+ punpcklbw xm5, xm6
4992
+ vinserti128 m5, m5, xm7, 1
4993
+ pmaddubsw m7, m5, m13
4994
+ paddw m3, m7
4995
+ pmaddubsw m5, m12
4996
+ movu xm7, [r0 + r4] ; m7 = row 7
4997
+ punpckhbw xm8, xm6, xm7
4998
+ punpcklbw xm6, xm7
4999
+ vinserti128 m6, m6, xm8, 1
5000
+ pmaddubsw m8, m6, m13
5001
+ paddw m4, m8
5002
+ pmaddubsw m6, m12
5003
+ lea r0, [r0 + r1 * 4]
5004
+ movu xm8, [r0] ; m8 = row 8
5005
+ punpckhbw xm9, xm7, xm8
5006
+ punpcklbw xm7, xm8
5007
+ vinserti128 m7, m7, xm9, 1
5008
+ pmaddubsw m9, m7, m13
5009
+ paddw m5, m9
5010
+ pmaddubsw m7, m12
5011
+ movu xm9, [r0 + r1] ; m9 = row 9
5012
+ punpckhbw xm10, xm8, xm9
5013
+ punpcklbw xm8, xm9
5014
+ vinserti128 m8, m8, xm10, 1
5015
+ pmaddubsw m10, m8, m13
5016
+ paddw m6, m10
5017
+ pmaddubsw m8, m12
5018
+ movu xm10, [r0 + r1 * 2] ; m10 = row 10
5019
+ punpckhbw xm11, xm9, xm10
5020
+ punpcklbw xm9, xm10
5021
+ vinserti128 m9, m9, xm11, 1
5022
+ pmaddubsw m11, m9, m13
5023
+ paddw m7, m11
5024
+ pmaddubsw m9, m12
5025
+
5026
+%ifidn %1,pp
5027
+ pmulhrsw m0, m14 ; m0 = word: row 0
5028
+ pmulhrsw m1, m14 ; m1 = word: row 1
5029
+ pmulhrsw m2, m14 ; m2 = word: row 2
5030
+ pmulhrsw m3, m14 ; m3 = word: row 3
5031
+ pmulhrsw m4, m14 ; m4 = word: row 4
5032
+ pmulhrsw m5, m14 ; m5 = word: row 5
5033
+ pmulhrsw m6, m14 ; m6 = word: row 6
5034
+ pmulhrsw m7, m14 ; m7 = word: row 7
5035
+ packuswb m0, m1
5036
+ packuswb m2, m3
5037
+ packuswb m4, m5
5038
+ packuswb m6, m7
5039
+ vpermq m0, m0, q3120
5040
+ vpermq m2, m2, q3120
5041
+ vpermq m4, m4, q3120
5042
+ vpermq m6, m6, q3120
5043
+ vextracti128 xm1, m0, 1
5044
+ vextracti128 xm3, m2, 1
5045
+ vextracti128 xm5, m4, 1
5046
+ vextracti128 xm7, m6, 1
5047
+ movu [r2], xm0
5048
+ movu [r2 + r3], xm1
5049
+ movu [r2 + r3 * 2], xm2
5050
+ movu [r2 + r5], xm3
5051
+ lea r2, [r2 + r3 * 4]
5052
+ movu [r2], xm4
5053
+ movu [r2 + r3], xm5
5054
+ movu [r2 + r3 * 2], xm6
5055
+ movu [r2 + r5], xm7
5056
+%else
5057
+ psubw m0, m14 ; m0 = word: row 0
5058
+ psubw m1, m14 ; m1 = word: row 1
5059
+ psubw m2, m14 ; m2 = word: row 2
5060
+ psubw m3, m14 ; m3 = word: row 3
5061
+ psubw m4, m14 ; m4 = word: row 4
5062
+ psubw m5, m14 ; m5 = word: row 5
5063
+ psubw m6, m14 ; m6 = word: row 6
5064
+ psubw m7, m14 ; m7 = word: row 7
5065
+ movu [r2], m0
5066
+ movu [r2 + r3], m1
5067
+ movu [r2 + r3 * 2], m2
5068
+ movu [r2 + r5], m3
5069
+ lea r2, [r2 + r3 * 4]
5070
+ movu [r2], m4
5071
+ movu [r2 + r3], m5
5072
+ movu [r2 + r3 * 2], m6
5073
+ movu [r2 + r5], m7
5074
+%endif
5075
+ lea r2, [r2 + r3 * 4]
5076
+
5077
+ movu xm11, [r0 + r4] ; m11 = row 11
5078
+ punpckhbw xm6, xm10, xm11
5079
+ punpcklbw xm10, xm11
5080
+ vinserti128 m10, m10, xm6, 1
5081
+ pmaddubsw m6, m10, m13
5082
+ paddw m8, m6
5083
+ pmaddubsw m10, m12
5084
+ lea r0, [r0 + r1 * 4]
5085
+ movu xm6, [r0] ; m6 = row 12
5086
+ punpckhbw xm7, xm11, xm6
5087
+ punpcklbw xm11, xm6
5088
+ vinserti128 m11, m11, xm7, 1
5089
+ pmaddubsw m7, m11, m13
5090
+ paddw m9, m7
5091
+ pmaddubsw m11, m12
5092
+
5093
+ movu xm7, [r0 + r1] ; m7 = row 13
5094
+ punpckhbw xm0, xm6, xm7
5095
+ punpcklbw xm6, xm7
5096
+ vinserti128 m6, m6, xm0, 1
5097
+ pmaddubsw m0, m6, m13
5098
+ paddw m10, m0
5099
+ pmaddubsw m6, m12
5100
+ movu xm0, [r0 + r1 * 2] ; m0 = row 14
5101
+ punpckhbw xm1, xm7, xm0
5102
+ punpcklbw xm7, xm0
5103
+ vinserti128 m7, m7, xm1, 1
5104
+ pmaddubsw m1, m7, m13
5105
+ paddw m11, m1
5106
+ pmaddubsw m7, m12
5107
+ movu xm1, [r0 + r4] ; m1 = row 15
5108
+ punpckhbw xm2, xm0, xm1
5109
+ punpcklbw xm0, xm1
5110
+ vinserti128 m0, m0, xm2, 1
5111
+ pmaddubsw m2, m0, m13
5112
+ paddw m6, m2
5113
+ pmaddubsw m0, m12
5114
+ lea r0, [r0 + r1 * 4]
5115
+ movu xm2, [r0] ; m2 = row 16
5116
+ punpckhbw xm3, xm1, xm2
5117
+ punpcklbw xm1, xm2
5118
+ vinserti128 m1, m1, xm3, 1
5119
+ pmaddubsw m3, m1, m13
5120
+ paddw m7, m3
5121
+ pmaddubsw m1, m12
5122
+ movu xm3, [r0 + r1] ; m3 = row 17
5123
+ punpckhbw xm4, xm2, xm3
5124
+ punpcklbw xm2, xm3
5125
+ vinserti128 m2, m2, xm4, 1
5126
+ pmaddubsw m4, m2, m13
5127
+ paddw m0, m4
5128
+ pmaddubsw m2, m12
5129
+ movu xm4, [r0 + r1 * 2] ; m4 = row 18
5130
+ punpckhbw xm5, xm3, xm4
5131
+ punpcklbw xm3, xm4
5132
+ vinserti128 m3, m3, xm5, 1
5133
+ pmaddubsw m5, m3, m13
5134
+ paddw m1, m5
5135
+ pmaddubsw m3, m12
5136
+
5137
+%ifidn %1,pp
5138
+ pmulhrsw m8, m14 ; m8 = word: row 8
5139
+ pmulhrsw m9, m14 ; m9 = word: row 9
5140
+ pmulhrsw m10, m14 ; m10 = word: row 10
5141
+ pmulhrsw m11, m14 ; m11 = word: row 11
5142
+ pmulhrsw m6, m14 ; m6 = word: row 12
5143
+ pmulhrsw m7, m14 ; m7 = word: row 13
5144
+ pmulhrsw m0, m14 ; m0 = word: row 14
5145
+ pmulhrsw m1, m14 ; m1 = word: row 15
5146
+ packuswb m8, m9
5147
+ packuswb m10, m11
5148
+ packuswb m6, m7
5149
+ packuswb m0, m1
5150
+ vpermq m8, m8, q3120
5151
+ vpermq m10, m10, q3120
5152
+ vpermq m6, m6, q3120
5153
+ vpermq m0, m0, q3120
5154
+ vextracti128 xm9, m8, 1
5155
+ vextracti128 xm11, m10, 1
5156
+ vextracti128 xm7, m6, 1
5157
+ vextracti128 xm1, m0, 1
5158
+ movu [r2], xm8
5159
+ movu [r2 + r3], xm9
5160
+ movu [r2 + r3 * 2], xm10
5161
+ movu [r2 + r5], xm11
5162
+ lea r2, [r2 + r3 * 4]
5163
+ movu [r2], xm6
5164
+ movu [r2 + r3], xm7
5165
+ movu [r2 + r3 * 2], xm0
5166
+ movu [r2 + r5], xm1
5167
+%else
5168
+ psubw m8, m14 ; m8 = word: row 8
5169
+ psubw m9, m14 ; m9 = word: row 9
5170
+ psubw m10, m14 ; m10 = word: row 10
5171
+ psubw m11, m14 ; m11 = word: row 11
5172
+ psubw m6, m14 ; m6 = word: row 12
5173
+ psubw m7, m14 ; m7 = word: row 13
5174
+ psubw m0, m14 ; m0 = word: row 14
5175
+ psubw m1, m14 ; m1 = word: row 15
5176
+ movu [r2], m8
5177
+ movu [r2 + r3], m9
5178
+ movu [r2 + r3 * 2], m10
5179
+ movu [r2 + r5], m11
5180
+ lea r2, [r2 + r3 * 4]
5181
+ movu [r2], m6
5182
+ movu [r2 + r3], m7
5183
+ movu [r2 + r3 * 2], m0
5184
+ movu [r2 + r5], m1
5185
+%endif
5186
+ lea r2, [r2 + r3 * 4]
5187
+
5188
+ movu xm5, [r0 + r4] ; m5 = row 19
5189
+ punpckhbw xm6, xm4, xm5
5190
+ punpcklbw xm4, xm5
5191
+ vinserti128 m4, m4, xm6, 1
5192
+ pmaddubsw m6, m4, m13
5193
+ paddw m2, m6
5194
+ pmaddubsw m4, m12
5195
+ lea r0, [r0 + r1 * 4]
5196
+ movu xm6, [r0] ; m6 = row 20
5197
+ punpckhbw xm7, xm5, xm6
5198
+ punpcklbw xm5, xm6
5199
+ vinserti128 m5, m5, xm7, 1
5200
+ pmaddubsw m7, m5, m13
5201
+ paddw m3, m7
5202
+ pmaddubsw m5, m12
5203
+ movu xm7, [r0 + r1] ; m7 = row 21
5204
+ punpckhbw xm0, xm6, xm7
5205
+ punpcklbw xm6, xm7
5206
+ vinserti128 m6, m6, xm0, 1
5207
+ pmaddubsw m0, m6, m13
5208
+ paddw m4, m0
5209
+ pmaddubsw m6, m12
5210
+ movu xm0, [r0 + r1 * 2] ; m0 = row 22
5211
+ punpckhbw xm1, xm7, xm0
5212
+ punpcklbw xm7, xm0
5213
+ vinserti128 m7, m7, xm1, 1
5214
+ pmaddubsw m1, m7, m13
5215
+ paddw m5, m1
5216
+ pmaddubsw m7, m12
5217
+ movu xm1, [r0 + r4] ; m1 = row 23
5218
+ punpckhbw xm8, xm0, xm1
5219
+ punpcklbw xm0, xm1
5220
+ vinserti128 m0, m0, xm8, 1
5221
+ pmaddubsw m8, m0, m13
5222
+ paddw m6, m8
5223
+ pmaddubsw m0, m12
5224
+ lea r0, [r0 + r1 * 4]
5225
+ movu xm8, [r0] ; m8 = row 24
5226
+ punpckhbw xm9, xm1, xm8
5227
+ punpcklbw xm1, xm8
5228
+ vinserti128 m1, m1, xm9, 1
5229
+ pmaddubsw m9, m1, m13
5230
+ paddw m7, m9
5231
+ pmaddubsw m1, m12
5232
+ movu xm9, [r0 + r1] ; m9 = row 25
5233
+ punpckhbw xm10, xm8, xm9
5234
+ punpcklbw xm8, xm9
5235
+ vinserti128 m8, m8, xm10, 1
5236
+ pmaddubsw m8, m13
5237
+ paddw m0, m8
5238
+ movu xm10, [r0 + r1 * 2] ; m10 = row 26
5239
+ punpckhbw xm11, xm9, xm10
5240
+ punpcklbw xm9, xm10
5241
+ vinserti128 m9, m9, xm11, 1
5242
+ pmaddubsw m9, m13
5243
+ paddw m1, m9
5244
+
5245
+%ifidn %1,pp
5246
+ pmulhrsw m2, m14 ; m2 = word: row 16
5247
+ pmulhrsw m3, m14 ; m3 = word: row 17
5248
+ pmulhrsw m4, m14 ; m4 = word: row 18
5249
+ pmulhrsw m5, m14 ; m5 = word: row 19
5250
+ pmulhrsw m6, m14 ; m6 = word: row 20
5251
+ pmulhrsw m7, m14 ; m7 = word: row 21
5252
+ pmulhrsw m0, m14 ; m0 = word: row 22
5253
+ pmulhrsw m1, m14 ; m1 = word: row 23
5254
+ packuswb m2, m3
5255
+ packuswb m4, m5
5256
+ packuswb m6, m7
5257
+ packuswb m0, m1
5258
+ vpermq m2, m2, q3120
5259
+ vpermq m4, m4, q3120
5260
+ vpermq m6, m6, q3120
5261
+ vpermq m0, m0, q3120
5262
+ vextracti128 xm3, m2, 1
5263
+ vextracti128 xm5, m4, 1
5264
+ vextracti128 xm7, m6, 1
5265
+ vextracti128 xm1, m0, 1
5266
+ movu [r2], xm2
5267
+ movu [r2 + r3], xm3
5268
+ movu [r2 + r3 * 2], xm4
5269
+ movu [r2 + r5], xm5
5270
+ lea r2, [r2 + r3 * 4]
5271
+ movu [r2], xm6
5272
+ movu [r2 + r3], xm7
5273
+ movu [r2 + r3 * 2], xm0
5274
+ movu [r2 + r5], xm1
5275
+%else
5276
+ psubw m2, m14 ; m2 = word: row 16
5277
+ psubw m3, m14 ; m3 = word: row 17
5278
+ psubw m4, m14 ; m4 = word: row 18
5279
+ psubw m5, m14 ; m5 = word: row 19
5280
+ psubw m6, m14 ; m6 = word: row 20
5281
+ psubw m7, m14 ; m7 = word: row 21
5282
+ psubw m0, m14 ; m0 = word: row 22
5283
+ psubw m1, m14 ; m1 = word: row 23
5284
+ movu [r2], m2
5285
+ movu [r2 + r3], m3
5286
+ movu [r2 + r3 * 2], m4
5287
+ movu [r2 + r5], m5
5288
+ lea r2, [r2 + r3 * 4]
5289
+ movu [r2], m6
5290
+ movu [r2 + r3], m7
5291
+ movu [r2 + r3 * 2], m0
5292
+ movu [r2 + r5], m1
5293
+%endif
5294
+ RET
5295
+%endif
5296
+%endmacro
5297
+
5298
+ FILTER_VER_CHROMA_AVX2_16x24 pp
5299
+ FILTER_VER_CHROMA_AVX2_16x24 ps
5300
5301
%macro FILTER_VER_CHROMA_AVX2_24x32 1
5302
INIT_YMM avx2
5303
5304
%endif
5305
%endmacro
5306
5307
-FILTER_VER_CHROMA_AVX2_24x32 pp
5308
-FILTER_VER_CHROMA_AVX2_24x32 ps
5309
+ FILTER_VER_CHROMA_AVX2_24x32 pp
5310
+ FILTER_VER_CHROMA_AVX2_24x32 ps
5311
5312
%macro FILTER_VER_CHROMA_AVX2_16x4 1
5313
INIT_YMM avx2
5314
5315
RET
5316
%endmacro
5317
5318
-FILTER_VER_CHROMA_AVX2_16x4 pp
5319
-FILTER_VER_CHROMA_AVX2_16x4 ps
5320
+ FILTER_VER_CHROMA_AVX2_16x4 pp
5321
+ FILTER_VER_CHROMA_AVX2_16x4 ps
5322
5323
-%macro FILTER_VER_CHROMA_AVX2_12x16 1
5324
+%macro FILTER_VER_CHROMA_AVX2_12xN 2
5325
INIT_YMM avx2
5326
-cglobal interp_4tap_vert_%1_12x16, 4, 7, 8
5327
+cglobal interp_4tap_vert_%1_12x%2, 4, 7, 8
5328
mov r4d, r4m
5329
shl r4d, 6
5330
5331
5332
vbroadcasti128 m7, [pw_2000]
5333
%endif
5334
lea r6, [r3 * 3]
5335
-
5336
+%rep %2 / 16
5337
movu xm0, [r0] ; m0 = row 0
5338
movu xm1, [r0 + r1] ; m1 = row 1
5339
punpckhbw xm2, xm0, xm1
5340
5341
vextracti128 xm5, m5, 1
5342
movq [r2 + r6 + 16], xm5
5343
%endif
5344
+ lea r2, [r2 + r3 * 4]
5345
+%endrep
5346
RET
5347
%endmacro
5348
5349
-FILTER_VER_CHROMA_AVX2_12x16 pp
5350
-FILTER_VER_CHROMA_AVX2_12x16 ps
5351
+ FILTER_VER_CHROMA_AVX2_12xN pp, 16
5352
+ FILTER_VER_CHROMA_AVX2_12xN ps, 16
5353
+ FILTER_VER_CHROMA_AVX2_12xN pp, 32
5354
+ FILTER_VER_CHROMA_AVX2_12xN ps, 32
5355
5356
;-----------------------------------------------------------------------------
5357
;void interp_4tap_vert_pp_24x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
5358
5359
INIT_XMM sse4
5360
cglobal interp_4tap_vert_pp_24x%2, 4, 6, 8
5361
5362
-mov r4d, r4m
5363
-sub r0, r1
5364
+ mov r4d, r4m
5365
+ sub r0, r1
5366
5367
%ifdef PIC
5368
-lea r5, [tab_ChromaCoeff]
5369
-movd m0, [r5 + r4 * 4]
5370
+ lea r5, [tab_ChromaCoeff]
5371
+ movd m0, [r5 + r4 * 4]
5372
%else
5373
-movd m0, [tab_ChromaCoeff + r4 * 4]
5374
+ movd m0, [tab_ChromaCoeff + r4 * 4]
5375
%endif
5376
5377
-pshufb m1, m0, [tab_Vm]
5378
-pshufb m0, [tab_Vm + 16]
5379
+ pshufb m1, m0, [tab_Vm]
5380
+ pshufb m0, [tab_Vm + 16]
5381
5382
-mov r4d, %2
5383
+ mov r4d, %2
5384
5385
.loop:
5386
-movu m2, [r0]
5387
-movu m3, [r0 + r1]
5388
+ movu m2, [r0]
5389
+ movu m3, [r0 + r1]
5390
5391
-punpcklbw m4, m2, m3
5392
-punpckhbw m2, m3
5393
+ punpcklbw m4, m2, m3
5394
+ punpckhbw m2, m3
5395
5396
-pmaddubsw m4, m1
5397
-pmaddubsw m2, m1
5398
+ pmaddubsw m4, m1
5399
+ pmaddubsw m2, m1
5400
5401
-lea r5, [r0 + 2 * r1]
5402
-movu m5, [r5]
5403
-movu m7, [r5 + r1]
5404
+ lea r5, [r0 + 2 * r1]
5405
+ movu m5, [r5]
5406
+ movu m7, [r5 + r1]
5407
5408
-punpcklbw m6, m5, m7
5409
-pmaddubsw m6, m0
5410
-paddw m4, m6
5411
+ punpcklbw m6, m5, m7
5412
+ pmaddubsw m6, m0
5413
+ paddw m4, m6
5414
5415
-punpckhbw m6, m5, m7
5416
-pmaddubsw m6, m0
5417
-paddw m2, m6
5418
+ punpckhbw m6, m5, m7
5419
+ pmaddubsw m6, m0
5420
+ paddw m2, m6
5421
5422
-mova m6, [pw_512]
5423
+ mova m6, [pw_512]
5424
5425
-pmulhrsw m4, m6
5426
-pmulhrsw m2, m6
5427
+ pmulhrsw m4, m6
5428
+ pmulhrsw m2, m6
5429
5430
-packuswb m4, m2
5431
+ packuswb m4, m2
5432
5433
-movu [r2], m4
5434
+ movu [r2], m4
5435
5436
-punpcklbw m4, m3, m5
5437
-punpckhbw m3, m5
5438
+ punpcklbw m4, m3, m5
5439
+ punpckhbw m3, m5
5440
5441
-pmaddubsw m4, m1
5442
-pmaddubsw m3, m1
5443
+ pmaddubsw m4, m1
5444
+ pmaddubsw m3, m1
5445
5446
-movu m2, [r5 + 2 * r1]
5447
+ movu m2, [r5 + 2 * r1]
5448
5449
-punpcklbw m5, m7, m2
5450
-punpckhbw m7, m2
5451
+ punpcklbw m5, m7, m2
5452
+ punpckhbw m7, m2
5453
5454
-pmaddubsw m5, m0
5455
-pmaddubsw m7, m0
5456
+ pmaddubsw m5, m0
5457
+ pmaddubsw m7, m0
5458
5459
-paddw m4, m5
5460
-paddw m3, m7
5461
+ paddw m4, m5
5462
+ paddw m3, m7
5463
5464
-pmulhrsw m4, m6
5465
-pmulhrsw m3, m6
5466
+ pmulhrsw m4, m6
5467
+ pmulhrsw m3, m6
5468
5469
-packuswb m4, m3
5470
+ packuswb m4, m3
5471
5472
-movu [r2 + r3], m4
5473
+ movu [r2 + r3], m4
5474
5475
-movq m2, [r0 + 16]
5476
-movq m3, [r0 + r1 + 16]
5477
-movq m4, [r5 + 16]
5478
-movq m5, [r5 + r1 + 16]
5479
+ movq m2, [r0 + 16]
5480
+ movq m3, [r0 + r1 + 16]
5481
+ movq m4, [r5 + 16]
5482
+ movq m5, [r5 + r1 + 16]
5483
5484
-punpcklbw m2, m3
5485
-punpcklbw m4, m5
5486
+ punpcklbw m2, m3
5487
+ punpcklbw m4, m5
5488
5489
-pmaddubsw m2, m1
5490
-pmaddubsw m4, m0
5491
+ pmaddubsw m2, m1
5492
+ pmaddubsw m4, m0
5493
5494
-paddw m2, m4
5495
+ paddw m2, m4
5496
5497
-pmulhrsw m2, m6
5498
+ pmulhrsw m2, m6
5499
5500
-movq m3, [r0 + r1 + 16]
5501
-movq m4, [r5 + 16]
5502
-movq m5, [r5 + r1 + 16]
5503
-movq m7, [r5 + 2 * r1 + 16]
5504
+ movq m3, [r0 + r1 + 16]
5505
+ movq m4, [r5 + 16]
5506
+ movq m5, [r5 + r1 + 16]
5507
+ movq m7, [r5 + 2 * r1 + 16]
5508
5509
-punpcklbw m3, m4
5510
-punpcklbw m5, m7
5511
+ punpcklbw m3, m4
5512
+ punpcklbw m5, m7
5513
5514
-pmaddubsw m3, m1
5515
-pmaddubsw m5, m0
5516
+ pmaddubsw m3, m1
5517
+ pmaddubsw m5, m0
5518
5519
-paddw m3, m5
5520
+ paddw m3, m5
5521
5522
-pmulhrsw m3, m6
5523
-packuswb m2, m3
5524
+ pmulhrsw m3, m6
5525
+ packuswb m2, m3
5526
5527
-movh [r2 + 16], m2
5528
-movhps [r2 + r3 + 16], m2
5529
+ movh [r2 + 16], m2
5530
+ movhps [r2 + r3 + 16], m2
5531
5532
-mov r0, r5
5533
-lea r2, [r2 + 2 * r3]
5534
+ mov r0, r5
5535
+ lea r2, [r2 + 2 * r3]
5536
5537
-sub r4, 2
5538
-jnz .loop
5539
-RET
5540
+ sub r4, 2
5541
+ jnz .loop
5542
+ RET
5543
%endmacro
5544
5545
-FILTER_V4_W24 24, 32
5546
+ FILTER_V4_W24 24, 32
5547
5548
-FILTER_V4_W24 24, 64
5549
+ FILTER_V4_W24 24, 64
5550
5551
;-----------------------------------------------------------------------------
5552
; void interp_4tap_vert_pp_32x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
5553
5554
INIT_XMM sse4
5555
cglobal interp_4tap_vert_pp_%1x%2, 4, 6, 8
5556
5557
-mov r4d, r4m
5558
-sub r0, r1
5559
+ mov r4d, r4m
5560
+ sub r0, r1
5561
5562
%ifdef PIC
5563
-lea r5, [tab_ChromaCoeff]
5564
-movd m0, [r5 + r4 * 4]
5565
+ lea r5, [tab_ChromaCoeff]
5566
+ movd m0, [r5 + r4 * 4]
5567
%else
5568
-movd m0, [tab_ChromaCoeff + r4 * 4]
5569
+ movd m0, [tab_ChromaCoeff + r4 * 4]
5570
%endif
5571
5572
-pshufb m1, m0, [tab_Vm]
5573
-pshufb m0, [tab_Vm + 16]
5574
+ pshufb m1, m0, [tab_Vm]
5575
+ pshufb m0, [tab_Vm + 16]
5576
5577
-mova m7, [pw_512]
5578
+ mova m7, [pw_512]
5579
5580
-mov r4d, %2
5581
+ mov r4d, %2
5582
5583
.loop:
5584
-movu m2, [r0]
5585
-movu m3, [r0 + r1]
5586
+ movu m2, [r0]
5587
+ movu m3, [r0 + r1]
5588
5589
-punpcklbw m4, m2, m3
5590
-punpckhbw m2, m3
5591
+ punpcklbw m4, m2, m3
5592
+ punpckhbw m2, m3
5593
5594
-pmaddubsw m4, m1
5595
-pmaddubsw m2, m1
5596
+ pmaddubsw m4, m1
5597
+ pmaddubsw m2, m1
5598
5599
-lea r5, [r0 + 2 * r1]
5600
-movu m3, [r5]
5601
-movu m5, [r5 + r1]
5602
+ lea r5, [r0 + 2 * r1]
5603
+ movu m3, [r5]
5604
+ movu m5, [r5 + r1]
5605
5606
-punpcklbw m6, m3, m5
5607
-punpckhbw m3, m5
5608
+ punpcklbw m6, m3, m5
5609
+ punpckhbw m3, m5
5610
5611
-pmaddubsw m6, m0
5612
-pmaddubsw m3, m0
5613
+ pmaddubsw m6, m0
5614
+ pmaddubsw m3, m0
5615
5616
-paddw m4, m6
5617
-paddw m2, m3
5618
+ paddw m4, m6
5619
+ paddw m2, m3
5620
5621
-pmulhrsw m4, m7
5622
-pmulhrsw m2, m7
5623
+ pmulhrsw m4, m7
5624
+ pmulhrsw m2, m7
5625
5626
-packuswb m4, m2
5627
+ packuswb m4, m2
5628
5629
-movu [r2], m4
5630
+ movu [r2], m4
5631
5632
-movu m2, [r0 + 16]
5633
-movu m3, [r0 + r1 + 16]
5634
+ movu m2, [r0 + 16]
5635
+ movu m3, [r0 + r1 + 16]
5636
5637
-punpcklbw m4, m2, m3
5638
-punpckhbw m2, m3
5639
+ punpcklbw m4, m2, m3
5640
+ punpckhbw m2, m3
5641
5642
-pmaddubsw m4, m1
5643
-pmaddubsw m2, m1
5644
+ pmaddubsw m4, m1
5645
+ pmaddubsw m2, m1
5646
5647
-movu m3, [r5 + 16]
5648
-movu m5, [r5 + r1 + 16]
5649
+ movu m3, [r5 + 16]
5650
+ movu m5, [r5 + r1 + 16]
5651
5652
-punpcklbw m6, m3, m5
5653
-punpckhbw m3, m5
5654
+ punpcklbw m6, m3, m5
5655
+ punpckhbw m3, m5
5656
5657
-pmaddubsw m6, m0
5658
-pmaddubsw m3, m0
5659
+ pmaddubsw m6, m0
5660
+ pmaddubsw m3, m0
5661
5662
-paddw m4, m6
5663
-paddw m2, m3
5664
+ paddw m4, m6
5665
+ paddw m2, m3
5666
5667
-pmulhrsw m4, m7
5668
-pmulhrsw m2, m7
5669
+ pmulhrsw m4, m7
5670
+ pmulhrsw m2, m7
5671
5672
-packuswb m4, m2
5673
+ packuswb m4, m2
5674
5675
-movu [r2 + 16], m4
5676
+ movu [r2 + 16], m4
5677
5678
-lea r0, [r0 + r1]
5679
-lea r2, [r2 + r3]
5680
+ lea r0, [r0 + r1]
5681
+ lea r2, [r2 + r3]
5682
5683
-dec r4
5684
-jnz .loop
5685
-RET
5686
+ dec r4
5687
+ jnz .loop
5688
+ RET
5689
%endmacro
5690
5691
-FILTER_V4_W32 32, 8
5692
-FILTER_V4_W32 32, 16
5693
-FILTER_V4_W32 32, 24
5694
-FILTER_V4_W32 32, 32
5695
+ FILTER_V4_W32 32, 8
5696
+ FILTER_V4_W32 32, 16
5697
+ FILTER_V4_W32 32, 24
5698
+ FILTER_V4_W32 32, 32
5699
5700
-FILTER_V4_W32 32, 48
5701
-FILTER_V4_W32 32, 64
5702
+ FILTER_V4_W32 32, 48
5703
+ FILTER_V4_W32 32, 64
5704
5705
%macro FILTER_VER_CHROMA_AVX2_32xN 2
5706
-INIT_YMM avx2
5707
%if ARCH_X86_64 == 1
5708
+INIT_YMM avx2
5709
cglobal interp_4tap_vert_%1_32x%2, 4, 7, 13
5710
mov r4d, r4m
5711
shl r4d, 6
5712
5713
%endif
5714
%endmacro
5715
5716
-FILTER_VER_CHROMA_AVX2_32xN pp, 32
5717
-FILTER_VER_CHROMA_AVX2_32xN pp, 24
5718
-FILTER_VER_CHROMA_AVX2_32xN pp, 16
5719
-FILTER_VER_CHROMA_AVX2_32xN pp, 8
5720
-FILTER_VER_CHROMA_AVX2_32xN ps, 32
5721
-FILTER_VER_CHROMA_AVX2_32xN ps, 24
5722
-FILTER_VER_CHROMA_AVX2_32xN ps, 16
5723
-FILTER_VER_CHROMA_AVX2_32xN ps, 8
5724
+ FILTER_VER_CHROMA_AVX2_32xN pp, 64
5725
+ FILTER_VER_CHROMA_AVX2_32xN pp, 48
5726
+ FILTER_VER_CHROMA_AVX2_32xN pp, 32
5727
+ FILTER_VER_CHROMA_AVX2_32xN pp, 24
5728
+ FILTER_VER_CHROMA_AVX2_32xN pp, 16
5729
+ FILTER_VER_CHROMA_AVX2_32xN pp, 8
5730
+ FILTER_VER_CHROMA_AVX2_32xN ps, 64
5731
+ FILTER_VER_CHROMA_AVX2_32xN ps, 48
5732
+ FILTER_VER_CHROMA_AVX2_32xN ps, 32
5733
+ FILTER_VER_CHROMA_AVX2_32xN ps, 24
5734
+ FILTER_VER_CHROMA_AVX2_32xN ps, 16
5735
+ FILTER_VER_CHROMA_AVX2_32xN ps, 8
5736
5737
;-----------------------------------------------------------------------------
5738
; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
5739
5740
INIT_XMM sse4
5741
cglobal interp_4tap_vert_pp_%1x%2, 4, 7, 8
5742
5743
-mov r4d, r4m
5744
-sub r0, r1
5745
+ mov r4d, r4m
5746
+ sub r0, r1
5747
5748
%ifdef PIC
5749
-lea r5, [tab_ChromaCoeff]
5750
-movd m0, [r5 + r4 * 4]
5751
+ lea r5, [tab_ChromaCoeff]
5752
+ movd m0, [r5 + r4 * 4]
5753
%else
5754
-movd m0, [tab_ChromaCoeff + r4 * 4]
5755
+ movd m0, [tab_ChromaCoeff + r4 * 4]
5756
%endif
5757
5758
-pshufb m1, m0, [tab_Vm]
5759
-pshufb m0, [tab_Vm + 16]
5760
+ pshufb m1, m0, [tab_Vm]
5761
+ pshufb m0, [tab_Vm + 16]
5762
5763
-mov r4d, %2/2
5764
+ mov r4d, %2/2
5765
5766
.loop:
5767
5768
-mov r6d, %1/16
5769
+ mov r6d, %1/16
5770
5771
.loopW:
5772
5773
-movu m2, [r0]
5774
-movu m3, [r0 + r1]
5775
+ movu m2, [r0]
5776
+ movu m3, [r0 + r1]
5777
+
5778
+ punpcklbw m4, m2, m3
5779
+ punpckhbw m2, m3
5780
+
5781
+ pmaddubsw m4, m1
5782
+ pmaddubsw m2, m1
5783
+
5784
+ lea r5, [r0 + 2 * r1]
5785
+ movu m5, [r5]
5786
+ movu m6, [r5 + r1]
5787
5788
-punpcklbw m4, m2, m3
5789
-punpckhbw m2, m3
5790
+ punpckhbw m7, m5, m6
5791
+ pmaddubsw m7, m0
5792
+ paddw m2, m7
5793
5794
-pmaddubsw m4, m1
5795
-pmaddubsw m2, m1
5796
+ punpcklbw m7, m5, m6
5797
+ pmaddubsw m7, m0
5798
+ paddw m4, m7
5799
5800
-lea r5, [r0 + 2 * r1]
5801
-movu m5, [r5]
5802
-movu m6, [r5 + r1]
5803
+ mova m7, [pw_512]
5804
5805
-punpckhbw m7, m5, m6
5806
-pmaddubsw m7, m0
5807
-paddw m2, m7
5808
+ pmulhrsw m4, m7
5809
+ pmulhrsw m2, m7
5810
5811
-punpcklbw m7, m5, m6
5812
-pmaddubsw m7, m0
5813
-paddw m4, m7
5814
+ packuswb m4, m2
5815
5816
-mova m7, [pw_512]
5817
+ movu [r2], m4
5818
5819
-pmulhrsw m4, m7
5820
-pmulhrsw m2, m7
5821
+ punpcklbw m4, m3, m5
5822
+ punpckhbw m3, m5
5823
5824
-packuswb m4, m2
5825
+ pmaddubsw m4, m1
5826
+ pmaddubsw m3, m1
5827
5828
-movu [r2], m4
5829
+ movu m5, [r5 + 2 * r1]
5830
5831
-punpcklbw m4, m3, m5
5832
-punpckhbw m3, m5
5833
+ punpcklbw m2, m6, m5
5834
+ punpckhbw m6, m5
5835
5836
-pmaddubsw m4, m1
5837
-pmaddubsw m3, m1
5838
+ pmaddubsw m2, m0
5839
+ pmaddubsw m6, m0
5840
5841
-movu m5, [r5 + 2 * r1]
5842
+ paddw m4, m2
5843
+ paddw m3, m6
5844
5845
-punpcklbw m2, m6, m5
5846
-punpckhbw m6, m5
5847
+ pmulhrsw m4, m7
5848
+ pmulhrsw m3, m7
5849
5850
-pmaddubsw m2, m0
5851
-pmaddubsw m6, m0
5852
+ packuswb m4, m3
5853
5854
-paddw m4, m2
5855
-paddw m3, m6
5856
+ movu [r2 + r3], m4
5857
5858
-pmulhrsw m4, m7
5859
-pmulhrsw m3, m7
5860
+ add r0, 16
5861
+ add r2, 16
5862
+ dec r6d
5863
+ jnz .loopW
5864
5865
-packuswb m4, m3
5866
+ lea r0, [r0 + r1 * 2 - %1]
5867
+ lea r2, [r2 + r3 * 2 - %1]
5868
5869
-movu [r2 + r3], m4
5870
+ dec r4d
5871
+ jnz .loop
5872
+ RET
5873
+%endmacro
5874
5875
-add r0, 16
5876
-add r2, 16
5877
-dec r6d
5878
-jnz .loopW
5879
+ FILTER_V4_W16n_H2 64, 64
5880
+ FILTER_V4_W16n_H2 64, 32
5881
+ FILTER_V4_W16n_H2 64, 48
5882
+ FILTER_V4_W16n_H2 48, 64
5883
+ FILTER_V4_W16n_H2 64, 16
5884
5885
-lea r0, [r0 + r1 * 2 - %1]
5886
-lea r2, [r2 + r3 * 2 - %1]
5887
+;-----------------------------------------------------------------------------
5888
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
5889
+;-----------------------------------------------------------------------------
5890
+%macro P2S_H_2xN 1
5891
+INIT_XMM sse4
5892
+cglobal filterPixelToShort_2x%1, 3, 4, 3
5893
+ mov r3d, r3m
5894
+ add r3d, r3d
5895
5896
-dec r4d
5897
-jnz .loop
5898
-RET
5899
+ ; load constant
5900
+ mova m1, [pb_128]
5901
+ mova m2, [tab_c_64_n64]
5902
+
5903
+%rep %1/2
5904
+ movd m0, [r0]
5905
+ pinsrd m0, [r0 + r1], 1
5906
+ punpcklbw m0, m1
5907
+ pmaddubsw m0, m2
5908
+
5909
+ movd [r2 + r3 * 0], m0
5910
+ pextrd [r2 + r3 * 1], m0, 2
5911
+
5912
+ lea r0, [r0 + r1 * 2]
5913
+ lea r2, [r2 + r3 * 2]
5914
+%endrep
5915
+ RET
5916
%endmacro
5917
+ P2S_H_2xN 4
5918
+ P2S_H_2xN 8
5919
+ P2S_H_2xN 16
5920
5921
-FILTER_V4_W16n_H2 64, 64
5922
-FILTER_V4_W16n_H2 64, 32
5923
-FILTER_V4_W16n_H2 64, 48
5924
-FILTER_V4_W16n_H2 48, 64
5925
-FILTER_V4_W16n_H2 64, 16
5926
;-----------------------------------------------------------------------------
5927
-; void pixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height)
5928
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
5929
;-----------------------------------------------------------------------------
5930
-%macro PIXEL_WH_4xN 2
5931
-INIT_XMM ssse3
5932
-cglobal pixelToShort_%1x%2, 3, 7, 6
5933
+%macro P2S_H_4xN 1
5934
+INIT_XMM sse4
5935
+cglobal filterPixelToShort_4x%1, 3, 6, 4
5936
+ mov r3d, r3m
5937
+ add r3d, r3d
5938
+ lea r4, [r3 * 3]
5939
+ lea r5, [r1 * 3]
5940
+
5941
+ ; load constant
5942
+ mova m2, [pb_128]
5943
+ mova m3, [tab_c_64_n64]
5944
+
5945
+%assign x 0
5946
+%rep %1/4
5947
+ movd m0, [r0]
5948
+ pinsrd m0, [r0 + r1], 1
5949
+ punpcklbw m0, m2
5950
+ pmaddubsw m0, m3
5951
+
5952
+ movd m1, [r0 + r1 * 2]
5953
+ pinsrd m1, [r0 + r5], 1
5954
+ punpcklbw m1, m2
5955
+ pmaddubsw m1, m3
5956
+
5957
+ movq [r2 + r3 * 0], m0
5958
+ movq [r2 + r3 * 2], m1
5959
+ movhps [r2 + r3 * 1], m0
5960
+ movhps [r2 + r4], m1
5961
+%assign x x+1
5962
+%if (x != %1/4)
5963
+ lea r0, [r0 + r1 * 4]
5964
+ lea r2, [r2 + r3 * 4]
5965
+%endif
5966
+%endrep
5967
+ RET
5968
+%endmacro
5969
+ P2S_H_4xN 4
5970
+ P2S_H_4xN 8
5971
+ P2S_H_4xN 16
5972
+ P2S_H_4xN 32
5973
+
5974
+;-----------------------------------------------------------------------------
5975
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
5976
+;-----------------------------------------------------------------------------
5977
+%macro P2S_H_6xN 1
5978
+INIT_XMM sse4
5979
+cglobal filterPixelToShort_6x%1, 3, 7, 6
5980
+ mov r3d, r3m
5981
+ add r3d, r3d
5982
+ lea r4, [r1 * 3]
5983
+ lea r5, [r3 * 3]
5984
+
5985
+ ; load height
5986
+ mov r6d, %1/4
5987
5988
- ; load width and height
5989
- mov r3d, %1
5990
- mov r4d, %2
5991
; load constant
5992
mova m4, [pb_128]
5993
mova m5, [tab_c_64_n64]
5994
-.loopH:
5995
- xor r5d, r5d
5996
5997
-.loopW:
5998
- mov r6, r0
5999
- movh m0, [r6]
6000
+.loop:
6001
+ movh m0, [r0]
6002
punpcklbw m0, m4
6003
pmaddubsw m0, m5
6004
6005
- movh m1, [r6 + r1]
6006
+ movh m1, [r0 + r1]
6007
punpcklbw m1, m4
6008
pmaddubsw m1, m5
6009
6010
- movh m2, [r6 + r1 * 2]
6011
+ movh m2, [r0 + r1 * 2]
6012
punpcklbw m2, m4
6013
pmaddubsw m2, m5
6014
6015
- lea r6, [r6 + r1 * 2]
6016
- movh m3, [r6 + r1]
6017
+ movh m3, [r0 + r4]
6018
punpcklbw m3, m4
6019
pmaddubsw m3, m5
6020
6021
- add r5, 8
6022
- cmp r5, r3
6023
- jg .width4
6024
- movu [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0
6025
- movu [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1
6026
- movu [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2
6027
- movu [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3
6028
- je .nextH
6029
- jmp .loopW
6030
-
6031
-.width4:
6032
- movh [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0
6033
- movh [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1
6034
- movh [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2
6035
- movh [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3
6036
+ movh [r2 + r3 * 0], m0
6037
+ pextrd [r2 + r3 * 0 + 8], m0, 2
6038
+ movh [r2 + r3 * 1], m1
6039
+ pextrd [r2 + r3 * 1 + 8], m1, 2
6040
+ movh [r2 + r3 * 2], m2
6041
+ pextrd [r2 + r3 * 2 + 8], m2, 2
6042
+ movh [r2 + r5], m3
6043
+ pextrd [r2 + r5 + 8], m3, 2
6044
6045
-.nextH:
6046
lea r0, [r0 + r1 * 4]
6047
- add r2, FENC_STRIDE * 8
6048
+ lea r2, [r2 + r3 * 4]
6049
6050
- sub r4d, 4
6051
- jnz .loopH
6052
+ dec r6d
6053
+ jnz .loop
6054
RET
6055
%endmacro
6056
-PIXEL_WH_4xN 4, 4
6057
-PIXEL_WH_4xN 4, 8
6058
-PIXEL_WH_4xN 4, 16
6059
+ P2S_H_6xN 8
6060
+ P2S_H_6xN 16
6061
6062
;-----------------------------------------------------------------------------
6063
-; void pixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height)
6064
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
6065
;-----------------------------------------------------------------------------
6066
-%macro PIXEL_WH_8xN 2
6067
+%macro P2S_H_8xN 1
6068
INIT_XMM ssse3
6069
-cglobal pixelToShort_%1x%2, 3, 7, 6
6070
+cglobal filterPixelToShort_8x%1, 3, 7, 6
6071
+ mov r3d, r3m
6072
+ add r3d, r3d
6073
+ lea r5, [r1 * 3]
6074
+ lea r6, [r3 * 3]
6075
6076
- ; load width and height
6077
- mov r3d, %1
6078
- mov r4d, %2
6079
+ ; load height
6080
+ mov r4d, %1/4
6081
6082
; load constant
6083
mova m4, [pb_128]
6084
mova m5, [tab_c_64_n64]
6085
6086
-.loopH
6087
- xor r5d, r5d
6088
-.loopW
6089
- lea r6, [r0 + r5]
6090
-
6091
- movh m0, [r6]
6092
+.loop
6093
+ movh m0, [r0]
6094
punpcklbw m0, m4
6095
pmaddubsw m0, m5
6096
6097
- movh m1, [r6 + r1]
6098
+ movh m1, [r0 + r1]
6099
punpcklbw m1, m4
6100
pmaddubsw m1, m5
6101
6102
- movh m2, [r6 + r1 * 2]
6103
+ movh m2, [r0 + r1 * 2]
6104
punpcklbw m2, m4
6105
pmaddubsw m2, m5
6106
6107
- lea r6, [r6 + r1 * 2]
6108
- movh m3, [r6 + r1]
6109
+ movh m3, [r0 + r5]
6110
punpcklbw m3, m4
6111
pmaddubsw m3, m5
6112
6113
- add r5, 8
6114
- cmp r5, r3
6115
-
6116
- movu [r2 + FENC_STRIDE * 0], m0
6117
- movu [r2 + FENC_STRIDE * 2], m1
6118
- movu [r2 + FENC_STRIDE * 4], m2
6119
- movu [r2 + FENC_STRIDE * 6], m3
6120
-
6121
- je .nextH
6122
- jmp .loopW
6123
-
6124
+ movu [r2 + r3 * 0], m0
6125
+ movu [r2 + r3 * 1], m1
6126
+ movu [r2 + r3 * 2], m2
6127
+ movu [r2 + r6 ], m3
6128
6129
-.nextH:
6130
lea r0, [r0 + r1 * 4]
6131
- add r2, FENC_STRIDE * 8
6132
+ lea r2, [r2 + r3 * 4]
6133
6134
- sub r4d, 4
6135
- jnz .loopH
6136
+ dec r4d
6137
+ jnz .loop
6138
RET
6139
%endmacro
6140
-PIXEL_WH_8xN 8, 8
6141
-PIXEL_WH_8xN 8, 4
6142
-PIXEL_WH_8xN 8, 16
6143
-PIXEL_WH_8xN 8, 32
6144
+ P2S_H_8xN 8
6145
+ P2S_H_8xN 4
6146
+ P2S_H_8xN 16
6147
+ P2S_H_8xN 32
6148
+ P2S_H_8xN 12
6149
+ P2S_H_8xN 64
6150
+
6151
+;-----------------------------------------------------------------------------
6152
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
6153
+;-----------------------------------------------------------------------------
6154
+INIT_XMM ssse3
6155
+cglobal filterPixelToShort_8x6, 3, 7, 5
6156
+ mov r3d, r3m
6157
+ add r3d, r3d
6158
+ lea r4, [r1 * 3]
6159
+ lea r5, [r1 * 5]
6160
+ lea r6, [r3 * 3]
6161
+
6162
+ ; load constant
6163
+ mova m3, [pb_128]
6164
+ mova m4, [tab_c_64_n64]
6165
+
6166
+ movh m0, [r0]
6167
+ punpcklbw m0, m3
6168
+ pmaddubsw m0, m4
6169
+
6170
+ movh m1, [r0 + r1]
6171
+ punpcklbw m1, m3
6172
+ pmaddubsw m1, m4
6173
+
6174
+ movh m2, [r0 + r1 * 2]
6175
+ punpcklbw m2, m3
6176
+ pmaddubsw m2, m4
6177
6178
+ movu [r2 + r3 * 0], m0
6179
+ movu [r2 + r3 * 1], m1
6180
+ movu [r2 + r3 * 2], m2
6181
+
6182
+ movh m0, [r0 + r4]
6183
+ punpcklbw m0, m3
6184
+ pmaddubsw m0, m4
6185
+
6186
+ movh m1, [r0 + r1 * 4]
6187
+ punpcklbw m1, m3
6188
+ pmaddubsw m1, m4
6189
+
6190
+ movh m2, [r0 + r5]
6191
+ punpcklbw m2, m3
6192
+ pmaddubsw m2, m4
6193
+
6194
+ movu [r2 + r6 ], m0
6195
+ movu [r2 + r3 * 4], m1
6196
+ lea r2, [r2 + r3 * 4]
6197
+ movu [r2 + r3], m2
6198
+
6199
+ RET
6200
6201
;-----------------------------------------------------------------------------
6202
-; void pixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height)
6203
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
6204
;-----------------------------------------------------------------------------
6205
-%macro PIXEL_WH_16xN 2
6206
+%macro P2S_H_16xN 1
6207
INIT_XMM ssse3
6208
-cglobal pixelToShort_%1x%2, 3, 7, 6
6209
+cglobal filterPixelToShort_16x%1, 3, 7, 6
6210
+ mov r3d, r3m
6211
+ add r3d, r3d
6212
+ lea r4, [r3 * 3]
6213
+ lea r5, [r1 * 3]
6214
6215
- ; load width and height
6216
- mov r3d, %1
6217
- mov r4d, %2
6218
+ ; load height
6219
+ mov r6d, %1/4
6220
6221
; load constant
6222
mova m4, [pb_128]
6223
mova m5, [tab_c_64_n64]
6224
6225
-.loopH:
6226
- xor r5d, r5d
6227
-.loopW:
6228
- lea r6, [r0 + r5]
6229
-
6230
- movh m0, [r6]
6231
+.loop:
6232
+ movh m0, [r0]
6233
punpcklbw m0, m4
6234
pmaddubsw m0, m5
6235
6236
- movh m1, [r6 + r1]
6237
+ movh m1, [r0 + r1]
6238
punpcklbw m1, m4
6239
pmaddubsw m1, m5
6240
6241
- movh m2, [r6 + r1 * 2]
6242
+ movh m2, [r0 + r1 * 2]
6243
punpcklbw m2, m4
6244
pmaddubsw m2, m5
6245
6246
- lea r6, [r6 + r1 * 2]
6247
- movh m3, [r6 + r1]
6248
+ movh m3, [r0 + r5]
6249
punpcklbw m3, m4
6250
pmaddubsw m3, m5
6251
6252
- add r5, 8
6253
- cmp r5, r3
6254
+ movu [r2 + r3 * 0], m0
6255
+ movu [r2 + r3 * 1], m1
6256
+ movu [r2 + r3 * 2], m2
6257
+ movu [r2 + r4], m3
6258
6259
- movu [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0
6260
- movu [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1
6261
- movu [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2
6262
- movu [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3
6263
- je .nextH
6264
- jmp .loopW
6265
+ lea r0, [r0 + 8]
6266
6267
+ movh m0, [r0]
6268
+ punpcklbw m0, m4
6269
+ pmaddubsw m0, m5
6270
6271
-.nextH:
6272
- lea r0, [r0 + r1 * 4]
6273
- add r2, FENC_STRIDE * 8
6274
+ movh m1, [r0 + r1]
6275
+ punpcklbw m1, m4
6276
+ pmaddubsw m1, m5
6277
6278
- sub r4d, 4
6279
- jnz .loopH
6280
+ movh m2, [r0 + r1 * 2]
6281
+ punpcklbw m2, m4
6282
+ pmaddubsw m2, m5
6283
+
6284
+ movh m3, [r0 + r5]
6285
+ punpcklbw m3, m4
6286
+ pmaddubsw m3, m5
6287
+
6288
+ movu [r2 + r3 * 0 + 16], m0
6289
+ movu [r2 + r3 * 1 + 16], m1
6290
+ movu [r2 + r3 * 2 + 16], m2
6291
+ movu [r2 + r4 + 16], m3
6292
6293
+ lea r0, [r0 + r1 * 4 - 8]
6294
+ lea r2, [r2 + r3 * 4]
6295
+
6296
+ dec r6d
6297
+ jnz .loop
6298
RET
6299
%endmacro
6300
-PIXEL_WH_16xN 16, 16
6301
-PIXEL_WH_16xN 16, 8
6302
-PIXEL_WH_16xN 16, 4
6303
-PIXEL_WH_16xN 16, 12
6304
-PIXEL_WH_16xN 16, 32
6305
-PIXEL_WH_16xN 16, 64
6306
+ P2S_H_16xN 16
6307
+ P2S_H_16xN 4
6308
+ P2S_H_16xN 8
6309
+ P2S_H_16xN 12
6310
+ P2S_H_16xN 32
6311
+ P2S_H_16xN 64
6312
+ P2S_H_16xN 24
6313
6314
;-----------------------------------------------------------------------------
6315
-; void pixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height)
6316
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
6317
;-----------------------------------------------------------------------------
6318
-%macro PIXEL_WH_32xN 2
6319
+%macro P2S_H_32xN 1
6320
INIT_XMM ssse3
6321
-cglobal pixelToShort_%1x%2, 3, 7, 6
6322
+cglobal filterPixelToShort_32x%1, 3, 7, 6
6323
+ mov r3d, r3m
6324
+ add r3d, r3d
6325
+ lea r4, [r3 * 3]
6326
+ lea r5, [r1 * 3]
6327
6328
- ; load width and height
6329
- mov r3d, %1
6330
- mov r4d, %2
6331
+ ; load height
6332
+ mov r6d, %1/4
6333
6334
; load constant
6335
mova m4, [pb_128]
6336
mova m5, [tab_c_64_n64]
6337
6338
-.loopH:
6339
- xor r5d, r5d
6340
-.loopW:
6341
- lea r6, [r0 + r5]
6342
+.loop:
6343
+ movh m0, [r0]
6344
+ punpcklbw m0, m4
6345
+ pmaddubsw m0, m5
6346
+
6347
+ movh m1, [r0 + r1]
6348
+ punpcklbw m1, m4
6349
+ pmaddubsw m1, m5
6350
+
6351
+ movh m2, [r0 + r1 * 2]
6352
+ punpcklbw m2, m4
6353
+ pmaddubsw m2, m5
6354
+
6355
+ movh m3, [r0 + r5]
6356
+ punpcklbw m3, m4
6357
+ pmaddubsw m3, m5
6358
+
6359
+ movu [r2 + r3 * 0], m0
6360
+ movu [r2 + r3 * 1], m1
6361
+ movu [r2 + r3 * 2], m2
6362
+ movu [r2 + r4], m3
6363
+
6364
+ lea r0, [r0 + 8]
6365
+
6366
+ movh m0, [r0]
6367
+ punpcklbw m0, m4
6368
+ pmaddubsw m0, m5
6369
+
6370
+ movh m1, [r0 + r1]
6371
+ punpcklbw m1, m4
6372
+ pmaddubsw m1, m5
6373
+
6374
+ movh m2, [r0 + r1 * 2]
6375
+ punpcklbw m2, m4
6376
+ pmaddubsw m2, m5
6377
6378
- movh m0, [r6]
6379
+ movh m3, [r0 + r5]
6380
+ punpcklbw m3, m4
6381
+ pmaddubsw m3, m5
6382
+
6383
+ movu [r2 + r3 * 0 + 16], m0
6384
+ movu [r2 + r3 * 1 + 16], m1
6385
+ movu [r2 + r3 * 2 + 16], m2
6386
+ movu [r2 + r4 + 16], m3
6387
+
6388
+ lea r0, [r0 + 8]
6389
+
6390
+ movh m0, [r0]
6391
punpcklbw m0, m4
6392
pmaddubsw m0, m5
6393
6394
- movh m1, [r6 + r1]
6395
+ movh m1, [r0 + r1]
6396
punpcklbw m1, m4
6397
pmaddubsw m1, m5
6398
6399
- movh m2, [r6 + r1 * 2]
6400
+ movh m2, [r0 + r1 * 2]
6401
punpcklbw m2, m4
6402
pmaddubsw m2, m5
6403
6404
- lea r6, [r6 + r1 * 2]
6405
- movh m3, [r6 + r1]
6406
+ movh m3, [r0 + r5]
6407
punpcklbw m3, m4
6408
pmaddubsw m3, m5
6409
6410
- add r5, 8
6411
- cmp r5, r3
6412
+ movu [r2 + r3 * 0 + 32], m0
6413
+ movu [r2 + r3 * 1 + 32], m1
6414
+ movu [r2 + r3 * 2 + 32], m2
6415
+ movu [r2 + r4 + 32], m3
6416
6417
- movu [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0
6418
- movu [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1
6419
- movu [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2
6420
- movu [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3
6421
- je .nextH
6422
- jmp .loopW
6423
+ lea r0, [r0 + 8]
6424
6425
+ movh m0, [r0]
6426
+ punpcklbw m0, m4
6427
+ pmaddubsw m0, m5
6428
6429
-.nextH:
6430
- lea r0, [r0 + r1 * 4]
6431
- add r2, FENC_STRIDE * 8
6432
+ movh m1, [r0 + r1]
6433
+ punpcklbw m1, m4
6434
+ pmaddubsw m1, m5
6435
6436
- sub r4d, 4
6437
- jnz .loopH
6438
+ movh m2, [r0 + r1 * 2]
6439
+ punpcklbw m2, m4
6440
+ pmaddubsw m2, m5
6441
+
6442
+ movh m3, [r0 + r5]
6443
+ punpcklbw m3, m4
6444
+ pmaddubsw m3, m5
6445
+
6446
+ movu [r2 + r3 * 0 + 48], m0
6447
+ movu [r2 + r3 * 1 + 48], m1
6448
+ movu [r2 + r3 * 2 + 48], m2
6449
+ movu [r2 + r4 + 48], m3
6450
+
6451
+ lea r0, [r0 + r1 * 4 - 24]
6452
+ lea r2, [r2 + r3 * 4]
6453
6454
+ dec r6d
6455
+ jnz .loop
6456
RET
6457
%endmacro
6458
-PIXEL_WH_32xN 32, 32
6459
-PIXEL_WH_32xN 32, 8
6460
-PIXEL_WH_32xN 32, 16
6461
-PIXEL_WH_32xN 32, 24
6462
-PIXEL_WH_32xN 32, 64
6463
+ P2S_H_32xN 32
6464
+ P2S_H_32xN 8
6465
+ P2S_H_32xN 16
6466
+ P2S_H_32xN 24
6467
+ P2S_H_32xN 64
6468
+ P2S_H_32xN 48
6469
6470
;-----------------------------------------------------------------------------
6471
-; void pixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height)
6472
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
6473
;-----------------------------------------------------------------------------
6474
-%macro PIXEL_WH_64xN 2
6475
+%macro P2S_H_32xN_avx2 1
6476
+INIT_YMM avx2
6477
+cglobal filterPixelToShort_32x%1, 3, 7, 3
6478
+ mov r3d, r3m
6479
+ add r3d, r3d
6480
+ lea r5, [r1 * 3]
6481
+ lea r6, [r3 * 3]
6482
+
6483
+ ; load height
6484
+ mov r4d, %1/4
6485
+
6486
+ ; load constant
6487
+ vpbroadcastd m2, [pw_2000]
6488
+
6489
+.loop:
6490
+ pmovzxbw m0, [r0 + 0 * mmsize/2]
6491
+ pmovzxbw m1, [r0 + 1 * mmsize/2]
6492
+ psllw m0, 6
6493
+ psllw m1, 6
6494
+ psubw m0, m2
6495
+ psubw m1, m2
6496
+ movu [r2 + 0 * mmsize], m0
6497
+ movu [r2 + 1 * mmsize], m1
6498
+
6499
+ pmovzxbw m0, [r0 + r1 + 0 * mmsize/2]
6500
+ pmovzxbw m1, [r0 + r1 + 1 * mmsize/2]
6501
+ psllw m0, 6
6502
+ psllw m1, 6
6503
+ psubw m0, m2
6504
+ psubw m1, m2
6505
+ movu [r2 + r3 + 0 * mmsize], m0
6506
+ movu [r2 + r3 + 1 * mmsize], m1
6507
+
6508
+ pmovzxbw m0, [r0 + r1 * 2 + 0 * mmsize/2]
6509
+ pmovzxbw m1, [r0 + r1 * 2 + 1 * mmsize/2]
6510
+ psllw m0, 6
6511
+ psllw m1, 6
6512
+ psubw m0, m2
6513
+ psubw m1, m2
6514
+ movu [r2 + r3 * 2 + 0 * mmsize], m0
6515
+ movu [r2 + r3 * 2 + 1 * mmsize], m1
6516
+
6517
+ pmovzxbw m0, [r0 + r5 + 0 * mmsize/2]
6518
+ pmovzxbw m1, [r0 + r5 + 1 * mmsize/2]
6519
+ psllw m0, 6
6520
+ psllw m1, 6
6521
+ psubw m0, m2
6522
+ psubw m1, m2
6523
+ movu [r2 + r6 + 0 * mmsize], m0
6524
+ movu [r2 + r6 + 1 * mmsize], m1
6525
+
6526
+ lea r0, [r0 + r1 * 4]
6527
+ lea r2, [r2 + r3 * 4]
6528
+
6529
+ dec r4d
6530
+ jnz .loop
6531
+ RET
6532
+%endmacro
6533
+ P2S_H_32xN_avx2 32
6534
+ P2S_H_32xN_avx2 8
6535
+ P2S_H_32xN_avx2 16
6536
+ P2S_H_32xN_avx2 24
6537
+ P2S_H_32xN_avx2 64
6538
+ P2S_H_32xN_avx2 48
6539
+
6540
+;-----------------------------------------------------------------------------
6541
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
6542
+;-----------------------------------------------------------------------------
6543
+%macro P2S_H_64xN 1
6544
INIT_XMM ssse3
6545
-cglobal pixelToShort_%1x%2, 3, 7, 6
6546
+cglobal filterPixelToShort_64x%1, 3, 7, 6
6547
+ mov r3d, r3m
6548
+ add r3d, r3d
6549
+ lea r4, [r3 * 3]
6550
+ lea r5, [r1 * 3]
6551
6552
- ; load width and height
6553
- mov r3d, %1
6554
- mov r4d, %2
6555
+ ; load height
6556
+ mov r6d, %1/4
6557
6558
; load constant
6559
mova m4, [pb_128]
6560
mova m5, [tab_c_64_n64]
6561
6562
-.loopH:
6563
- xor r5d, r5d
6564
-.loopW:
6565
- lea r6, [r0 + r5]
6566
+.loop:
6567
+ movh m0, [r0]
6568
+ punpcklbw m0, m4
6569
+ pmaddubsw m0, m5
6570
+
6571
+ movh m1, [r0 + r1]
6572
+ punpcklbw m1, m4
6573
+ pmaddubsw m1, m5
6574
+
6575
+ movh m2, [r0 + r1 * 2]
6576
+ punpcklbw m2, m4
6577
+ pmaddubsw m2, m5
6578
6579
- movh m0, [r6]
6580
+ movh m3, [r0 + r5]
6581
+ punpcklbw m3, m4
6582
+ pmaddubsw m3, m5
6583
+
6584
+ movu [r2 + r3 * 0], m0
6585
+ movu [r2 + r3 * 1], m1
6586
+ movu [r2 + r3 * 2], m2
6587
+ movu [r2 + r4], m3
6588
+
6589
+ lea r0, [r0 + 8]
6590
+
6591
+ movh m0, [r0]
6592
punpcklbw m0, m4
6593
pmaddubsw m0, m5
6594
6595
- movh m1, [r6 + r1]
6596
+ movh m1, [r0 + r1]
6597
punpcklbw m1, m4
6598
pmaddubsw m1, m5
6599
6600
- movh m2, [r6 + r1 * 2]
6601
+ movh m2, [r0 + r1 * 2]
6602
punpcklbw m2, m4
6603
pmaddubsw m2, m5
6604
6605
- lea r6, [r6 + r1 * 2]
6606
- movh m3, [r6 + r1]
6607
+ movh m3, [r0 + r5]
6608
punpcklbw m3, m4
6609
pmaddubsw m3, m5
6610
6611
- add r5, 8
6612
- cmp r5, r3
6613
+ movu [r2 + r3 * 0 + 16], m0
6614
+ movu [r2 + r3 * 1 + 16], m1
6615
+ movu [r2 + r3 * 2 + 16], m2
6616
+ movu [r2 + r4 + 16], m3
6617
+
6618
+ lea r0, [r0 + 8]
6619
+
6620
+ movh m0, [r0]
6621
+ punpcklbw m0, m4
6622
+ pmaddubsw m0, m5
6623
+
6624
+ movh m1, [r0 + r1]
6625
+ punpcklbw m1, m4
6626
+ pmaddubsw m1, m5
6627
+
6628
+ movh m2, [r0 + r1 * 2]
6629
+ punpcklbw m2, m4
6630
+ pmaddubsw m2, m5
6631
+
6632
+ movh m3, [r0 + r5]
6633
+ punpcklbw m3, m4
6634
+ pmaddubsw m3, m5
6635
6636
- movu [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0
6637
- movu [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1
6638
- movu [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2
6639
- movu [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3
6640
- je .nextH
6641
- jmp .loopW
6642
+ movu [r2 + r3 * 0 + 32], m0
6643
+ movu [r2 + r3 * 1 + 32], m1
6644
+ movu [r2 + r3 * 2 + 32], m2
6645
+ movu [r2 + r4 + 32], m3
6646
6647
+ lea r0, [r0 + 8]
6648
+
6649
+ movh m0, [r0]
6650
+ punpcklbw m0, m4
6651
+ pmaddubsw m0, m5
6652
+
6653
+ movh m1, [r0 + r1]
6654
+ punpcklbw m1, m4
6655
+ pmaddubsw m1, m5
6656
+
6657
+ movh m2, [r0 + r1 * 2]
6658
+ punpcklbw m2, m4
6659
+ pmaddubsw m2, m5
6660
+
6661
+ movh m3, [r0 + r5]
6662
+ punpcklbw m3, m4
6663
+ pmaddubsw m3, m5
6664
+
6665
+ movu [r2 + r3 * 0 + 48], m0
6666
+ movu [r2 + r3 * 1 + 48], m1
6667
+ movu [r2 + r3 * 2 + 48], m2
6668
+ movu [r2 + r4 + 48], m3
6669
+
6670
+ lea r0, [r0 + 8]
6671
+
6672
+ movh m0, [r0]
6673
+ punpcklbw m0, m4
6674
+ pmaddubsw m0, m5
6675
+
6676
+ movh m1, [r0 + r1]
6677
+ punpcklbw m1, m4
6678
+ pmaddubsw m1, m5
6679
+
6680
+ movh m2, [r0 + r1 * 2]
6681
+ punpcklbw m2, m4
6682
+ pmaddubsw m2, m5
6683
+
6684
+ movh m3, [r0 + r5]
6685
+ punpcklbw m3, m4
6686
+ pmaddubsw m3, m5
6687
+
6688
+ movu [r2 + r3 * 0 + 64], m0
6689
+ movu [r2 + r3 * 1 + 64], m1
6690
+ movu [r2 + r3 * 2 + 64], m2
6691
+ movu [r2 + r4 + 64], m3
6692
+
6693
+ lea r0, [r0 + 8]
6694
+
6695
+ movh m0, [r0]
6696
+ punpcklbw m0, m4
6697
+ pmaddubsw m0, m5
6698
+
6699
+ movh m1, [r0 + r1]
6700
+ punpcklbw m1, m4
6701
+ pmaddubsw m1, m5
6702
+
6703
+ movh m2, [r0 + r1 * 2]
6704
+ punpcklbw m2, m4
6705
+ pmaddubsw m2, m5
6706
+
6707
+ movh m3, [r0 + r5]
6708
+ punpcklbw m3, m4
6709
+ pmaddubsw m3, m5
6710
+
6711
+ movu [r2 + r3 * 0 + 80], m0
6712
+ movu [r2 + r3 * 1 + 80], m1
6713
+ movu [r2 + r3 * 2 + 80], m2
6714
+ movu [r2 + r4 + 80], m3
6715
+
6716
+ lea r0, [r0 + 8]
6717
+
6718
+ movh m0, [r0]
6719
+ punpcklbw m0, m4
6720
+ pmaddubsw m0, m5
6721
+
6722
+ movh m1, [r0 + r1]
6723
+ punpcklbw m1, m4
6724
+ pmaddubsw m1, m5
6725
+
6726
+ movh m2, [r0 + r1 * 2]
6727
+ punpcklbw m2, m4
6728
+ pmaddubsw m2, m5
6729
+
6730
+ movh m3, [r0 + r5]
6731
+ punpcklbw m3, m4
6732
+ pmaddubsw m3, m5
6733
+
6734
+ movu [r2 + r3 * 0 + 96], m0
6735
+ movu [r2 + r3 * 1 + 96], m1
6736
+ movu [r2 + r3 * 2 + 96], m2
6737
+ movu [r2 + r4 + 96], m3
6738
+
6739
+ lea r0, [r0 + 8]
6740
+
6741
+ movh m0, [r0]
6742
+ punpcklbw m0, m4
6743
+ pmaddubsw m0, m5
6744
+
6745
+ movh m1, [r0 + r1]
6746
+ punpcklbw m1, m4
6747
+ pmaddubsw m1, m5
6748
+
6749
+ movh m2, [r0 + r1 * 2]
6750
+ punpcklbw m2, m4
6751
+ pmaddubsw m2, m5
6752
+
6753
+ movh m3, [r0 + r5]
6754
+ punpcklbw m3, m4
6755
+ pmaddubsw m3, m5
6756
+
6757
+ movu [r2 + r3 * 0 + 112], m0
6758
+ movu [r2 + r3 * 1 + 112], m1
6759
+ movu [r2 + r3 * 2 + 112], m2
6760
+ movu [r2 + r4 + 112], m3
6761
+
6762
+ lea r0, [r0 + r1 * 4 - 56]
6763
+ lea r2, [r2 + r3 * 4]
6764
+
6765
+ dec r6d
6766
+ jnz .loop
6767
+ RET
6768
+%endmacro
6769
+ P2S_H_64xN 64
6770
+ P2S_H_64xN 16
6771
+ P2S_H_64xN 32
6772
+ P2S_H_64xN 48
6773
+
6774
+;-----------------------------------------------------------------------------
6775
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
6776
+;-----------------------------------------------------------------------------
6777
+%macro P2S_H_64xN_avx2 1
6778
+INIT_YMM avx2
6779
+cglobal filterPixelToShort_64x%1, 3, 7, 5
6780
+ mov r3d, r3m
6781
+ add r3d, r3d
6782
+ lea r5, [r1 * 3]
6783
+ lea r6, [r3 * 3]
6784
+
6785
+ ; load height
6786
+ mov r4d, %1/4
6787
+
6788
+ ; load constant
6789
+ vpbroadcastd m4, [pw_2000]
6790
+
6791
+.loop:
6792
+ pmovzxbw m0, [r0 + 0 * mmsize/2]
6793
+ pmovzxbw m1, [r0 + 1 * mmsize/2]
6794
+ pmovzxbw m2, [r0 + 2 * mmsize/2]
6795
+ pmovzxbw m3, [r0 + 3 * mmsize/2]
6796
+ psllw m0, 6
6797
+ psllw m1, 6
6798
+ psllw m2, 6
6799
+ psllw m3, 6
6800
+ psubw m0, m4
6801
+ psubw m1, m4
6802
+ psubw m2, m4
6803
+ psubw m3, m4
6804
+
6805
+ movu [r2 + 0 * mmsize], m0
6806
+ movu [r2 + 1 * mmsize], m1
6807
+ movu [r2 + 2 * mmsize], m2
6808
+ movu [r2 + 3 * mmsize], m3
6809
+
6810
+ pmovzxbw m0, [r0 + r1 + 0 * mmsize/2]
6811
+ pmovzxbw m1, [r0 + r1 + 1 * mmsize/2]
6812
+ pmovzxbw m2, [r0 + r1 + 2 * mmsize/2]
6813
+ pmovzxbw m3, [r0 + r1 + 3 * mmsize/2]
6814
+ psllw m0, 6
6815
+ psllw m1, 6
6816
+ psllw m2, 6
6817
+ psllw m3, 6
6818
+ psubw m0, m4
6819
+ psubw m1, m4
6820
+ psubw m2, m4
6821
+ psubw m3, m4
6822
+
6823
+ movu [r2 + r3 + 0 * mmsize], m0
6824
+ movu [r2 + r3 + 1 * mmsize], m1
6825
+ movu [r2 + r3 + 2 * mmsize], m2
6826
+ movu [r2 + r3 + 3 * mmsize], m3
6827
+
6828
+ pmovzxbw m0, [r0 + r1 * 2 + 0 * mmsize/2]
6829
+ pmovzxbw m1, [r0 + r1 * 2 + 1 * mmsize/2]
6830
+ pmovzxbw m2, [r0 + r1 * 2 + 2 * mmsize/2]
6831
+ pmovzxbw m3, [r0 + r1 * 2 + 3 * mmsize/2]
6832
+ psllw m0, 6
6833
+ psllw m1, 6
6834
+ psllw m2, 6
6835
+ psllw m3, 6
6836
+ psubw m0, m4
6837
+ psubw m1, m4
6838
+ psubw m2, m4
6839
+ psubw m3, m4
6840
+
6841
+ movu [r2 + r3 * 2 + 0 * mmsize], m0
6842
+ movu [r2 + r3 * 2 + 1 * mmsize], m1
6843
+ movu [r2 + r3 * 2 + 2 * mmsize], m2
6844
+ movu [r2 + r3 * 2 + 3 * mmsize], m3
6845
+
6846
+ pmovzxbw m0, [r0 + r5 + 0 * mmsize/2]
6847
+ pmovzxbw m1, [r0 + r5 + 1 * mmsize/2]
6848
+ pmovzxbw m2, [r0 + r5 + 2 * mmsize/2]
6849
+ pmovzxbw m3, [r0 + r5 + 3 * mmsize/2]
6850
+ psllw m0, 6
6851
+ psllw m1, 6
6852
+ psllw m2, 6
6853
+ psllw m3, 6
6854
+ psubw m0, m4
6855
+ psubw m1, m4
6856
+ psubw m2, m4
6857
+ psubw m3, m4
6858
+
6859
+ movu [r2 + r6 + 0 * mmsize], m0
6860
+ movu [r2 + r6 + 1 * mmsize], m1
6861
+ movu [r2 + r6 + 2 * mmsize], m2
6862
+ movu [r2 + r6 + 3 * mmsize], m3
6863
6864
-.nextH:
6865
lea r0, [r0 + r1 * 4]
6866
- add r2, FENC_STRIDE * 8
6867
+ lea r2, [r2 + r3 * 4]
6868
6869
- sub r4d, 4
6870
- jnz .loopH
6871
+ dec r4d
6872
+ jnz .loop
6873
+ RET
6874
+%endmacro
6875
+ P2S_H_64xN_avx2 64
6876
+ P2S_H_64xN_avx2 16
6877
+ P2S_H_64xN_avx2 32
6878
+ P2S_H_64xN_avx2 48
6879
+
6880
+;-----------------------------------------------------------------------------
6881
+; void filterPixelToShort(pixel src, intptr_t srcStride, int16_t dst, int16_t dstStride)
6882
+;-----------------------------------------------------------------------------
6883
+%macro P2S_H_12xN 1
6884
+INIT_XMM ssse3
6885
+cglobal filterPixelToShort_12x%1, 3, 7, 6
6886
+ mov r3d, r3m
6887
+ add r3d, r3d
6888
+ lea r4, [r1 * 3]
6889
+ lea r6, [r3 * 3]
6890
+ mov r5d, %1/4
6891
6892
+ ; load constant
6893
+ mova m4, [pb_128]
6894
+ mova m5, [tab_c_64_n64]
6895
+
6896
+.loop:
6897
+ movu m0, [r0]
6898
+ punpcklbw m1, m0, m4
6899
+ punpckhbw m0, m4
6900
+ pmaddubsw m0, m5
6901
+ pmaddubsw m1, m5
6902
+
6903
+ movu m2, [r0 + r1]
6904
+ punpcklbw m3, m2, m4
6905
+ punpckhbw m2, m4
6906
+ pmaddubsw m2, m5
6907
+ pmaddubsw m3, m5
6908
+
6909
+ movu [r2 + r3 * 0], m1
6910
+ movu [r2 + r3 * 1], m3
6911
+
6912
+ movh [r2 + r3 * 0 + 16], m0
6913
+ movh [r2 + r3 * 1 + 16], m2
6914
+
6915
+ movu m0, [r0 + r1 * 2]
6916
+ punpcklbw m1, m0, m4
6917
+ punpckhbw m0, m4
6918
+ pmaddubsw m0, m5
6919
+ pmaddubsw m1, m5
6920
+
6921
+ movu m2, [r0 + r4]
6922
+ punpcklbw m3, m2, m4
6923
+ punpckhbw m2, m4
6924
+ pmaddubsw m2, m5
6925
+ pmaddubsw m3, m5
6926
+
6927
+ movu [r2 + r3 * 2], m1
6928
+ movu [r2 + r6], m3
6929
+
6930
+ movh [r2 + r3 * 2 + 16], m0
6931
+ movh [r2 + r6 + 16], m2
6932
+
6933
+ lea r0, [r0 + r1 * 4]
6934
+ lea r2, [r2 + r3 * 4]
6935
+
6936
+ dec r5d
6937
+ jnz .loop
6938
+ RET
6939
+%endmacro
6940
+ P2S_H_12xN 16
6941
+ P2S_H_12xN 32
6942
+
6943
+;-----------------------------------------------------------------------------
6944
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
6945
+;-----------------------------------------------------------------------------
6946
+%macro P2S_H_24xN 1
6947
+INIT_XMM ssse3
6948
+cglobal filterPixelToShort_24x%1, 3, 7, 5
6949
+ mov r3d, r3m
6950
+ add r3d, r3d
6951
+ lea r4, [r1 * 3]
6952
+ lea r5, [r3 * 3]
6953
+ mov r6d, %1/4
6954
+
6955
+ ; load constant
6956
+ mova m3, [pb_128]
6957
+ mova m4, [tab_c_64_n64]
6958
+
6959
+.loop:
6960
+ movu m0, [r0]
6961
+ punpcklbw m1, m0, m3
6962
+ punpckhbw m0, m3
6963
+ pmaddubsw m0, m4
6964
+ pmaddubsw m1, m4
6965
+
6966
+ movu m2, [r0 + 16]
6967
+ punpcklbw m2, m3
6968
+ pmaddubsw m2, m4
6969
+
6970
+ movu [r2 + r3 * 0], m1
6971
+ movu [r2 + r3 * 0 + 16], m0
6972
+ movu [r2 + r3 * 0 + 32], m2
6973
+
6974
+ movu m0, [r0 + r1]
6975
+ punpcklbw m1, m0, m3
6976
+ punpckhbw m0, m3
6977
+ pmaddubsw m0, m4
6978
+ pmaddubsw m1, m4
6979
+
6980
+ movu m2, [r0 + r1 + 16]
6981
+ punpcklbw m2, m3
6982
+ pmaddubsw m2, m4
6983
+
6984
+ movu [r2 + r3 * 1], m1
6985
+ movu [r2 + r3 * 1 + 16], m0
6986
+ movu [r2 + r3 * 1 + 32], m2
6987
+
6988
+ movu m0, [r0 + r1 * 2]
6989
+ punpcklbw m1, m0, m3
6990
+ punpckhbw m0, m3
6991
+ pmaddubsw m0, m4
6992
+ pmaddubsw m1, m4
6993
+
6994
+ movu m2, [r0 + r1 * 2 + 16]
6995
+ punpcklbw m2, m3
6996
+ pmaddubsw m2, m4
6997
+
6998
+ movu [r2 + r3 * 2], m1
6999
+ movu [r2 + r3 * 2 + 16], m0
7000
+ movu [r2 + r3 * 2 + 32], m2
7001
+
7002
+ movu m0, [r0 + r4]
7003
+ punpcklbw m1, m0, m3
7004
+ punpckhbw m0, m3
7005
+ pmaddubsw m0, m4
7006
+ pmaddubsw m1, m4
7007
+
7008
+ movu m2, [r0 + r4 + 16]
7009
+ punpcklbw m2, m3
7010
+ pmaddubsw m2, m4
7011
+ movu [r2 + r5], m1
7012
+ movu [r2 + r5 + 16], m0
7013
+ movu [r2 + r5 + 32], m2
7014
+
7015
+ lea r0, [r0 + r1 * 4]
7016
+ lea r2, [r2 + r3 * 4]
7017
+
7018
+ dec r6d
7019
+ jnz .loop
7020
RET
7021
%endmacro
7022
-PIXEL_WH_64xN 64, 64
7023
-PIXEL_WH_64xN 64, 16
7024
-PIXEL_WH_64xN 64, 32
7025
-PIXEL_WH_64xN 64, 48
7026
+ P2S_H_24xN 32
7027
+ P2S_H_24xN 64
7028
+
7029
+;-----------------------------------------------------------------------------
7030
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
7031
+;-----------------------------------------------------------------------------
7032
+%macro P2S_H_24xN_avx2 1
7033
+INIT_YMM avx2
7034
+cglobal filterPixelToShort_24x%1, 3, 7, 4
7035
+ mov r3d, r3m
7036
+ add r3d, r3d
7037
+ lea r4, [r1 * 3]
7038
+ lea r5, [r3 * 3]
7039
+ mov r6d, %1/4
7040
+
7041
+ ; load constant
7042
+ vpbroadcastd m1, [pw_2000]
7043
+ vpbroadcastd m2, [pb_128]
7044
+ vpbroadcastd m3, [tab_c_64_n64]
7045
+
7046
+.loop:
7047
+ pmovzxbw m0, [r0]
7048
+ psllw m0, 6
7049
+ psubw m0, m1
7050
+ movu [r2], m0
7051
+
7052
+ movu m0, [r0 + mmsize/2]
7053
+ punpcklbw m0, m2
7054
+ pmaddubsw m0, m3
7055
+ movu [r2 + r3 * 0 + mmsize], xm0
7056
+
7057
+ pmovzxbw m0, [r0 + r1]
7058
+ psllw m0, 6
7059
+ psubw m0, m1
7060
+ movu [r2 + r3], m0
7061
+
7062
+ movu m0, [r0 + r1 + mmsize/2]
7063
+ punpcklbw m0, m2
7064
+ pmaddubsw m0, m3
7065
+ movu [r2 + r3 * 1 + mmsize], xm0
7066
+
7067
+ pmovzxbw m0, [r0 + r1 * 2]
7068
+ psllw m0, 6
7069
+ psubw m0, m1
7070
+ movu [r2 + r3 * 2], m0
7071
+
7072
+ movu m0, [r0 + r1 * 2 + mmsize/2]
7073
+ punpcklbw m0, m2
7074
+ pmaddubsw m0, m3
7075
+ movu [r2 + r3 * 2 + mmsize], xm0
7076
+
7077
+ pmovzxbw m0, [r0 + r4]
7078
+ psllw m0, 6
7079
+ psubw m0, m1
7080
+ movu [r2 + r5], m0
7081
+
7082
+ movu m0, [r0 + r4 + mmsize/2]
7083
+ punpcklbw m0, m2
7084
+ pmaddubsw m0, m3
7085
+ movu [r2 + r5 + mmsize], xm0
7086
+
7087
+ lea r0, [r0 + r1 * 4]
7088
+ lea r2, [r2 + r3 * 4]
7089
+
7090
+ dec r6d
7091
+ jnz .loop
7092
+ RET
7093
+%endmacro
7094
+ P2S_H_24xN_avx2 32
7095
+ P2S_H_24xN_avx2 64
7096
+
7097
+;-----------------------------------------------------------------------------
7098
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
7099
+;-----------------------------------------------------------------------------
7100
+INIT_XMM ssse3
7101
+cglobal filterPixelToShort_48x64, 3, 7, 4
7102
+ mov r3d, r3m
7103
+ add r3d, r3d
7104
+ lea r4, [r1 * 3]
7105
+ lea r5, [r3 * 3]
7106
+ mov r6d, 16
7107
+
7108
+ ; load constant
7109
+ mova m2, [pb_128]
7110
+ mova m3, [tab_c_64_n64]
7111
+
7112
+.loop:
7113
+ movu m0, [r0]
7114
+ punpcklbw m1, m0, m2
7115
+ punpckhbw m0, m2
7116
+ pmaddubsw m0, m3
7117
+ pmaddubsw m1, m3
7118
+
7119
+ movu [r2 + r3 * 0], m1
7120
+ movu [r2 + r3 * 0 + 16], m0
7121
+
7122
+ movu m0, [r0 + 16]
7123
+ punpcklbw m1, m0, m2
7124
+ punpckhbw m0, m2
7125
+ pmaddubsw m0, m3
7126
+ pmaddubsw m1, m3
7127
+
7128
+ movu [r2 + r3 * 0 + 32], m1
7129
+ movu [r2 + r3 * 0 + 48], m0
7130
+
7131
+ movu m0, [r0 + 32]
7132
+ punpcklbw m1, m0, m2
7133
+ punpckhbw m0, m2
7134
+ pmaddubsw m0, m3
7135
+ pmaddubsw m1, m3
7136
+
7137
+ movu [r2 + r3 * 0 + 64], m1
7138
+ movu [r2 + r3 * 0 + 80], m0
7139
+
7140
+ movu m0, [r0 + r1]
7141
+ punpcklbw m1, m0, m2
7142
+ punpckhbw m0, m2
7143
+ pmaddubsw m0, m3
7144
+ pmaddubsw m1, m3
7145
+
7146
+ movu [r2 + r3 * 1], m1
7147
+ movu [r2 + r3 * 1 + 16], m0
7148
+
7149
+ movu m0, [r0 + r1 + 16]
7150
+ punpcklbw m1, m0, m2
7151
+ punpckhbw m0, m2
7152
+ pmaddubsw m0, m3
7153
+ pmaddubsw m1, m3
7154
+
7155
+ movu [r2 + r3 * 1 + 32], m1
7156
+ movu [r2 + r3 * 1 + 48], m0
7157
+
7158
+ movu m0, [r0 + r1 + 32]
7159
+ punpcklbw m1, m0, m2
7160
+ punpckhbw m0, m2
7161
+ pmaddubsw m0, m3
7162
+ pmaddubsw m1, m3
7163
+
7164
+ movu [r2 + r3 * 1 + 64], m1
7165
+ movu [r2 + r3 * 1 + 80], m0
7166
+
7167
+ movu m0, [r0 + r1 * 2]
7168
+ punpcklbw m1, m0, m2
7169
+ punpckhbw m0, m2
7170
+ pmaddubsw m0, m3
7171
+ pmaddubsw m1, m3
7172
+
7173
+ movu [r2 + r3 * 2], m1
7174
+ movu [r2 + r3 * 2 + 16], m0
7175
+
7176
+ movu m0, [r0 + r1 * 2 + 16]
7177
+ punpcklbw m1, m0, m2
7178
+ punpckhbw m0, m2
7179
+ pmaddubsw m0, m3
7180
+ pmaddubsw m1, m3
7181
+
7182
+ movu [r2 + r3 * 2 + 32], m1
7183
+ movu [r2 + r3 * 2 + 48], m0
7184
+
7185
+ movu m0, [r0 + r1 * 2 + 32]
7186
+ punpcklbw m1, m0, m2
7187
+ punpckhbw m0, m2
7188
+ pmaddubsw m0, m3
7189
+ pmaddubsw m1, m3
7190
+
7191
+ movu [r2 + r3 * 2 + 64], m1
7192
+ movu [r2 + r3 * 2 + 80], m0
7193
+
7194
+ movu m0, [r0 + r4]
7195
+ punpcklbw m1, m0, m2
7196
+ punpckhbw m0, m2
7197
+ pmaddubsw m0, m3
7198
+ pmaddubsw m1, m3
7199
+
7200
+ movu [r2 + r5], m1
7201
+ movu [r2 + r5 + 16], m0
7202
+
7203
+ movu m0, [r0 + r4 + 16]
7204
+ punpcklbw m1, m0, m2
7205
+ punpckhbw m0, m2
7206
+ pmaddubsw m0, m3
7207
+ pmaddubsw m1, m3
7208
+
7209
+ movu [r2 + r5 + 32], m1
7210
+ movu [r2 + r5 + 48], m0
7211
+
7212
+ movu m0, [r0 + r4 + 32]
7213
+ punpcklbw m1, m0, m2
7214
+ punpckhbw m0, m2
7215
+ pmaddubsw m0, m3
7216
+ pmaddubsw m1, m3
7217
+
7218
+ movu [r2 + r5 + 64], m1
7219
+ movu [r2 + r5 + 80], m0
7220
+
7221
+ lea r0, [r0 + r1 * 4]
7222
+ lea r2, [r2 + r3 * 4]
7223
+
7224
+ dec r6d
7225
+ jnz .loop
7226
+ RET
7227
+
7228
+;-----------------------------------------------------------------------------
7229
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
7230
+;-----------------------------------------------------------------------------
7231
+INIT_YMM avx2
7232
+cglobal filterPixelToShort_48x64, 3,7,4
7233
+ mov r3d, r3m
7234
+ add r3d, r3d
7235
+ lea r5, [r1 * 3]
7236
+ lea r6, [r3 * 3]
7237
+
7238
+ ; load height
7239
+ mov r4d, 64/4
7240
+
7241
+ ; load constant
7242
+ vpbroadcastd m3, [pw_2000]
7243
+
7244
+ ; just unroll(1) because it is best choice for 48x64
7245
+.loop:
7246
+ pmovzxbw m0, [r0 + 0 * mmsize/2]
7247
+ pmovzxbw m1, [r0 + 1 * mmsize/2]
7248
+ pmovzxbw m2, [r0 + 2 * mmsize/2]
7249
+ psllw m0, 6
7250
+ psllw m1, 6
7251
+ psllw m2, 6
7252
+ psubw m0, m3
7253
+ psubw m1, m3
7254
+ psubw m2, m3
7255
+ movu [r2 + 0 * mmsize], m0
7256
+ movu [r2 + 1 * mmsize], m1
7257
+ movu [r2 + 2 * mmsize], m2
7258
+
7259
+ pmovzxbw m0, [r0 + r1 + 0 * mmsize/2]
7260
+ pmovzxbw m1, [r0 + r1 + 1 * mmsize/2]
7261
+ pmovzxbw m2, [r0 + r1 + 2 * mmsize/2]
7262
+ psllw m0, 6
7263
+ psllw m1, 6
7264
+ psllw m2, 6
7265
+ psubw m0, m3
7266
+ psubw m1, m3
7267
+ psubw m2, m3
7268
+ movu [r2 + r3 + 0 * mmsize], m0
7269
+ movu [r2 + r3 + 1 * mmsize], m1
7270
+ movu [r2 + r3 + 2 * mmsize], m2
7271
+
7272
+ pmovzxbw m0, [r0 + r1 * 2 + 0 * mmsize/2]
7273
+ pmovzxbw m1, [r0 + r1 * 2 + 1 * mmsize/2]
7274
+ pmovzxbw m2, [r0 + r1 * 2 + 2 * mmsize/2]
7275
+ psllw m0, 6
7276
+ psllw m1, 6
7277
+ psllw m2, 6
7278
+ psubw m0, m3
7279
+ psubw m1, m3
7280
+ psubw m2, m3
7281
+ movu [r2 + r3 * 2 + 0 * mmsize], m0
7282
+ movu [r2 + r3 * 2 + 1 * mmsize], m1
7283
+ movu [r2 + r3 * 2 + 2 * mmsize], m2
7284
+
7285
+ pmovzxbw m0, [r0 + r5 + 0 * mmsize/2]
7286
+ pmovzxbw m1, [r0 + r5 + 1 * mmsize/2]
7287
+ pmovzxbw m2, [r0 + r5 + 2 * mmsize/2]
7288
+ psllw m0, 6
7289
+ psllw m1, 6
7290
+ psllw m2, 6
7291
+ psubw m0, m3
7292
+ psubw m1, m3
7293
+ psubw m2, m3
7294
+ movu [r2 + r6 + 0 * mmsize], m0
7295
+ movu [r2 + r6 + 1 * mmsize], m1
7296
+ movu [r2 + r6 + 2 * mmsize], m2
7297
+
7298
+ lea r0, [r0 + r1 * 4]
7299
+ lea r2, [r2 + r3 * 4]
7300
+
7301
+ dec r4d
7302
+ jnz .loop
7303
+ RET
7304
+
7305
7306
%macro PROCESS_LUMA_W4_4R 0
7307
movd m0, [r0]
7308
7309
;-------------------------------------------------------------------------------------------------------------
7310
; void interp_8tap_vert_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7311
;-------------------------------------------------------------------------------------------------------------
7312
-FILTER_VER_LUMA_4xN 4, 4, pp
7313
+ FILTER_VER_LUMA_4xN 4, 4, pp
7314
7315
;-------------------------------------------------------------------------------------------------------------
7316
; void interp_8tap_vert_pp_4x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7317
;-------------------------------------------------------------------------------------------------------------
7318
-FILTER_VER_LUMA_4xN 4, 8, pp
7319
-FILTER_VER_LUMA_AVX2_4xN 4, 8, pp
7320
+ FILTER_VER_LUMA_4xN 4, 8, pp
7321
+ FILTER_VER_LUMA_AVX2_4xN 4, 8, pp
7322
7323
;-------------------------------------------------------------------------------------------------------------
7324
; void interp_8tap_vert_pp_4x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7325
;-------------------------------------------------------------------------------------------------------------
7326
-FILTER_VER_LUMA_4xN 4, 16, pp
7327
-FILTER_VER_LUMA_AVX2_4xN 4, 16, pp
7328
+ FILTER_VER_LUMA_4xN 4, 16, pp
7329
+ FILTER_VER_LUMA_AVX2_4xN 4, 16, pp
7330
7331
;-------------------------------------------------------------------------------------------------------------
7332
; void interp_8tap_vert_ps_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7333
;-------------------------------------------------------------------------------------------------------------
7334
-FILTER_VER_LUMA_4xN 4, 4, ps
7335
+ FILTER_VER_LUMA_4xN 4, 4, ps
7336
7337
;-------------------------------------------------------------------------------------------------------------
7338
; void interp_8tap_vert_ps_4x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7339
;-------------------------------------------------------------------------------------------------------------
7340
-FILTER_VER_LUMA_4xN 4, 8, ps
7341
-FILTER_VER_LUMA_AVX2_4xN 4, 8, ps
7342
+ FILTER_VER_LUMA_4xN 4, 8, ps
7343
+ FILTER_VER_LUMA_AVX2_4xN 4, 8, ps
7344
7345
;-------------------------------------------------------------------------------------------------------------
7346
; void interp_8tap_vert_ps_4x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7347
;-------------------------------------------------------------------------------------------------------------
7348
-FILTER_VER_LUMA_4xN 4, 16, ps
7349
-FILTER_VER_LUMA_AVX2_4xN 4, 16, ps
7350
+ FILTER_VER_LUMA_4xN 4, 16, ps
7351
+ FILTER_VER_LUMA_AVX2_4xN 4, 16, ps
7352
7353
%macro PROCESS_LUMA_AVX2_W8_8R 0
7354
movq xm1, [r0] ; m1 = row 0
7355
7356
;-------------------------------------------------------------------------------------------------------------
7357
; void interp_8tap_vert_pp_8x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7358
;-------------------------------------------------------------------------------------------------------------
7359
-FILTER_VER_LUMA_8xN 8, 4, pp
7360
-FILTER_VER_LUMA_AVX2_8x4 pp
7361
+ FILTER_VER_LUMA_8xN 8, 4, pp
7362
+ FILTER_VER_LUMA_AVX2_8x4 pp
7363
7364
;-------------------------------------------------------------------------------------------------------------
7365
; void interp_8tap_vert_pp_8x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7366
;-------------------------------------------------------------------------------------------------------------
7367
-FILTER_VER_LUMA_8xN 8, 8, pp
7368
-FILTER_VER_LUMA_AVX2_8x8 pp
7369
+ FILTER_VER_LUMA_8xN 8, 8, pp
7370
+ FILTER_VER_LUMA_AVX2_8x8 pp
7371
7372
;-------------------------------------------------------------------------------------------------------------
7373
; void interp_8tap_vert_pp_8x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7374
;-------------------------------------------------------------------------------------------------------------
7375
-FILTER_VER_LUMA_8xN 8, 16, pp
7376
-FILTER_VER_LUMA_AVX2_8xN 8, 16, pp
7377
+ FILTER_VER_LUMA_8xN 8, 16, pp
7378
+ FILTER_VER_LUMA_AVX2_8xN 8, 16, pp
7379
7380
;-------------------------------------------------------------------------------------------------------------
7381
; void interp_8tap_vert_pp_8x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7382
;-------------------------------------------------------------------------------------------------------------
7383
-FILTER_VER_LUMA_8xN 8, 32, pp
7384
-FILTER_VER_LUMA_AVX2_8xN 8, 32, pp
7385
+ FILTER_VER_LUMA_8xN 8, 32, pp
7386
+ FILTER_VER_LUMA_AVX2_8xN 8, 32, pp
7387
7388
;-------------------------------------------------------------------------------------------------------------
7389
; void interp_8tap_vert_ps_8x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7390
;-------------------------------------------------------------------------------------------------------------
7391
-FILTER_VER_LUMA_8xN 8, 4, ps
7392
-FILTER_VER_LUMA_AVX2_8x4 ps
7393
+ FILTER_VER_LUMA_8xN 8, 4, ps
7394
+ FILTER_VER_LUMA_AVX2_8x4 ps
7395
7396
;-------------------------------------------------------------------------------------------------------------
7397
; void interp_8tap_vert_ps_8x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7398
;-------------------------------------------------------------------------------------------------------------
7399
-FILTER_VER_LUMA_8xN 8, 8, ps
7400
-FILTER_VER_LUMA_AVX2_8x8 ps
7401
+ FILTER_VER_LUMA_8xN 8, 8, ps
7402
+ FILTER_VER_LUMA_AVX2_8x8 ps
7403
7404
;-------------------------------------------------------------------------------------------------------------
7405
; void interp_8tap_vert_ps_8x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7406
;-------------------------------------------------------------------------------------------------------------
7407
-FILTER_VER_LUMA_8xN 8, 16, ps
7408
-FILTER_VER_LUMA_AVX2_8xN 8, 16, ps
7409
+ FILTER_VER_LUMA_8xN 8, 16, ps
7410
+ FILTER_VER_LUMA_AVX2_8xN 8, 16, ps
7411
7412
;-------------------------------------------------------------------------------------------------------------
7413
; void interp_8tap_vert_ps_8x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7414
;-------------------------------------------------------------------------------------------------------------
7415
-FILTER_VER_LUMA_8xN 8, 32, ps
7416
-FILTER_VER_LUMA_AVX2_8xN 8, 32, ps
7417
+ FILTER_VER_LUMA_8xN 8, 32, ps
7418
+ FILTER_VER_LUMA_AVX2_8xN 8, 32, ps
7419
7420
;-------------------------------------------------------------------------------------------------------------
7421
; void interp_8tap_vert_%3_12x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7422
7423
7424
lea r5, [8 * r1 - 8]
7425
sub r0, r5
7426
-%ifidn %3,pp
7427
+%ifidn %3,pp
7428
add r2, 8
7429
%else
7430
add r2, 16
7431
7432
;-------------------------------------------------------------------------------------------------------------
7433
; void interp_8tap_vert_pp_12x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7434
;-------------------------------------------------------------------------------------------------------------
7435
-FILTER_VER_LUMA_12xN 12, 16, pp
7436
+ FILTER_VER_LUMA_12xN 12, 16, pp
7437
7438
;-------------------------------------------------------------------------------------------------------------
7439
; void interp_8tap_vert_ps_12x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7440
;-------------------------------------------------------------------------------------------------------------
7441
-FILTER_VER_LUMA_12xN 12, 16, ps
7442
+ FILTER_VER_LUMA_12xN 12, 16, ps
7443
7444
%macro FILTER_VER_LUMA_AVX2_12x16 1
7445
INIT_YMM avx2
7446
7447
%endif
7448
%endmacro
7449
7450
-FILTER_VER_LUMA_AVX2_12x16 pp
7451
-FILTER_VER_LUMA_AVX2_12x16 ps
7452
+ FILTER_VER_LUMA_AVX2_12x16 pp
7453
+ FILTER_VER_LUMA_AVX2_12x16 ps
7454
7455
%macro FILTER_VER_LUMA_AVX2_16x16 1
7456
INIT_YMM avx2
7457
7458
%endif
7459
%endmacro
7460
7461
-FILTER_VER_LUMA_AVX2_16x16 pp
7462
-FILTER_VER_LUMA_AVX2_16x16 ps
7463
+ FILTER_VER_LUMA_AVX2_16x16 pp
7464
+ FILTER_VER_LUMA_AVX2_16x16 ps
7465
7466
%macro FILTER_VER_LUMA_AVX2_16x12 1
7467
INIT_YMM avx2
7468
7469
%endif
7470
%endmacro
7471
7472
-FILTER_VER_LUMA_AVX2_16x12 pp
7473
-FILTER_VER_LUMA_AVX2_16x12 ps
7474
+ FILTER_VER_LUMA_AVX2_16x12 pp
7475
+ FILTER_VER_LUMA_AVX2_16x12 ps
7476
7477
%macro FILTER_VER_LUMA_AVX2_16x8 1
7478
INIT_YMM avx2
7479
7480
%endif
7481
%endmacro
7482
7483
-FILTER_VER_LUMA_AVX2_16x8 pp
7484
-FILTER_VER_LUMA_AVX2_16x8 ps
7485
+ FILTER_VER_LUMA_AVX2_16x8 pp
7486
+ FILTER_VER_LUMA_AVX2_16x8 ps
7487
7488
%macro FILTER_VER_LUMA_AVX2_16x4 1
7489
INIT_YMM avx2
7490
7491
%endif
7492
%endmacro
7493
7494
-FILTER_VER_LUMA_AVX2_16x4 pp
7495
-FILTER_VER_LUMA_AVX2_16x4 ps
7496
+ FILTER_VER_LUMA_AVX2_16x4 pp
7497
+ FILTER_VER_LUMA_AVX2_16x4 ps
7498
%macro FILTER_VER_LUMA_AVX2_16xN 3
7499
INIT_YMM avx2
7500
%if ARCH_X86_64 == 1
7501
7502
%endif
7503
%endmacro
7504
7505
-FILTER_VER_LUMA_AVX2_16xN 16, 32, pp
7506
-FILTER_VER_LUMA_AVX2_16xN 16, 64, pp
7507
-FILTER_VER_LUMA_AVX2_16xN 16, 32, ps
7508
-FILTER_VER_LUMA_AVX2_16xN 16, 64, ps
7509
+ FILTER_VER_LUMA_AVX2_16xN 16, 32, pp
7510
+ FILTER_VER_LUMA_AVX2_16xN 16, 64, pp
7511
+ FILTER_VER_LUMA_AVX2_16xN 16, 32, ps
7512
+ FILTER_VER_LUMA_AVX2_16xN 16, 64, ps
7513
7514
%macro PROCESS_LUMA_AVX2_W16_16R 1
7515
movu xm0, [r0] ; m0 = row 0
7516
7517
%endif
7518
%endmacro
7519
7520
-FILTER_VER_LUMA_AVX2_24x32 pp
7521
-FILTER_VER_LUMA_AVX2_24x32 ps
7522
+ FILTER_VER_LUMA_AVX2_24x32 pp
7523
+ FILTER_VER_LUMA_AVX2_24x32 ps
7524
7525
%macro FILTER_VER_LUMA_AVX2_32xN 3
7526
INIT_YMM avx2
7527
7528
%endif
7529
%endmacro
7530
7531
-FILTER_VER_LUMA_AVX2_32xN 32, 32, pp
7532
-FILTER_VER_LUMA_AVX2_32xN 32, 64, pp
7533
-FILTER_VER_LUMA_AVX2_32xN 32, 32, ps
7534
-FILTER_VER_LUMA_AVX2_32xN 32, 64, ps
7535
+ FILTER_VER_LUMA_AVX2_32xN 32, 32, pp
7536
+ FILTER_VER_LUMA_AVX2_32xN 32, 64, pp
7537
+ FILTER_VER_LUMA_AVX2_32xN 32, 32, ps
7538
+ FILTER_VER_LUMA_AVX2_32xN 32, 64, ps
7539
7540
%macro FILTER_VER_LUMA_AVX2_32x16 1
7541
INIT_YMM avx2
7542
7543
%endif
7544
%endmacro
7545
7546
-FILTER_VER_LUMA_AVX2_32x16 pp
7547
-FILTER_VER_LUMA_AVX2_32x16 ps
7548
-
7549
+ FILTER_VER_LUMA_AVX2_32x16 pp
7550
+ FILTER_VER_LUMA_AVX2_32x16 ps
7551
+
7552
%macro FILTER_VER_LUMA_AVX2_32x24 1
7553
INIT_YMM avx2
7554
%if ARCH_X86_64 == 1
7555
7556
%endif
7557
%endmacro
7558
7559
-FILTER_VER_LUMA_AVX2_32x24 pp
7560
-FILTER_VER_LUMA_AVX2_32x24 ps
7561
+ FILTER_VER_LUMA_AVX2_32x24 pp
7562
+ FILTER_VER_LUMA_AVX2_32x24 ps
7563
7564
%macro FILTER_VER_LUMA_AVX2_32x8 1
7565
INIT_YMM avx2
7566
7567
%endif
7568
%endmacro
7569
7570
-FILTER_VER_LUMA_AVX2_32x8 pp
7571
-FILTER_VER_LUMA_AVX2_32x8 ps
7572
+ FILTER_VER_LUMA_AVX2_32x8 pp
7573
+ FILTER_VER_LUMA_AVX2_32x8 ps
7574
7575
%macro FILTER_VER_LUMA_AVX2_48x64 1
7576
INIT_YMM avx2
7577
7578
%endif
7579
%endmacro
7580
7581
-FILTER_VER_LUMA_AVX2_48x64 pp
7582
-FILTER_VER_LUMA_AVX2_48x64 ps
7583
+ FILTER_VER_LUMA_AVX2_48x64 pp
7584
+ FILTER_VER_LUMA_AVX2_48x64 ps
7585
7586
%macro FILTER_VER_LUMA_AVX2_64xN 3
7587
INIT_YMM avx2
7588
7589
%endif
7590
%endmacro
7591
7592
-FILTER_VER_LUMA_AVX2_64xN 64, 32, pp
7593
-FILTER_VER_LUMA_AVX2_64xN 64, 48, pp
7594
-FILTER_VER_LUMA_AVX2_64xN 64, 64, pp
7595
-FILTER_VER_LUMA_AVX2_64xN 64, 32, ps
7596
-FILTER_VER_LUMA_AVX2_64xN 64, 48, ps
7597
-FILTER_VER_LUMA_AVX2_64xN 64, 64, ps
7598
+ FILTER_VER_LUMA_AVX2_64xN 64, 32, pp
7599
+ FILTER_VER_LUMA_AVX2_64xN 64, 48, pp
7600
+ FILTER_VER_LUMA_AVX2_64xN 64, 64, pp
7601
+ FILTER_VER_LUMA_AVX2_64xN 64, 32, ps
7602
+ FILTER_VER_LUMA_AVX2_64xN 64, 48, ps
7603
+ FILTER_VER_LUMA_AVX2_64xN 64, 64, ps
7604
7605
%macro FILTER_VER_LUMA_AVX2_64x16 1
7606
INIT_YMM avx2
7607
7608
%endif
7609
%endmacro
7610
7611
-FILTER_VER_LUMA_AVX2_64x16 pp
7612
-FILTER_VER_LUMA_AVX2_64x16 ps
7613
+ FILTER_VER_LUMA_AVX2_64x16 pp
7614
+ FILTER_VER_LUMA_AVX2_64x16 ps
7615
7616
;-------------------------------------------------------------------------------------------------------------
7617
; void interp_8tap_vert_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7618
7619
RET
7620
%endmacro
7621
7622
-FILTER_VER_LUMA 16, 4, pp
7623
-FILTER_VER_LUMA 16, 8, pp
7624
-FILTER_VER_LUMA 16, 12, pp
7625
-FILTER_VER_LUMA 16, 16, pp
7626
-FILTER_VER_LUMA 16, 32, pp
7627
-FILTER_VER_LUMA 16, 64, pp
7628
-FILTER_VER_LUMA 24, 32, pp
7629
-FILTER_VER_LUMA 32, 8, pp
7630
-FILTER_VER_LUMA 32, 16, pp
7631
-FILTER_VER_LUMA 32, 24, pp
7632
-FILTER_VER_LUMA 32, 32, pp
7633
-FILTER_VER_LUMA 32, 64, pp
7634
-FILTER_VER_LUMA 48, 64, pp
7635
-FILTER_VER_LUMA 64, 16, pp
7636
-FILTER_VER_LUMA 64, 32, pp
7637
-FILTER_VER_LUMA 64, 48, pp
7638
-FILTER_VER_LUMA 64, 64, pp
7639
-
7640
-FILTER_VER_LUMA 16, 4, ps
7641
-FILTER_VER_LUMA 16, 8, ps
7642
-FILTER_VER_LUMA 16, 12, ps
7643
-FILTER_VER_LUMA 16, 16, ps
7644
-FILTER_VER_LUMA 16, 32, ps
7645
-FILTER_VER_LUMA 16, 64, ps
7646
-FILTER_VER_LUMA 24, 32, ps
7647
-FILTER_VER_LUMA 32, 8, ps
7648
-FILTER_VER_LUMA 32, 16, ps
7649
-FILTER_VER_LUMA 32, 24, ps
7650
-FILTER_VER_LUMA 32, 32, ps
7651
-FILTER_VER_LUMA 32, 64, ps
7652
-FILTER_VER_LUMA 48, 64, ps
7653
-FILTER_VER_LUMA 64, 16, ps
7654
-FILTER_VER_LUMA 64, 32, ps
7655
-FILTER_VER_LUMA 64, 48, ps
7656
-FILTER_VER_LUMA 64, 64, ps
7657
+ FILTER_VER_LUMA 16, 4, pp
7658
+ FILTER_VER_LUMA 16, 8, pp
7659
+ FILTER_VER_LUMA 16, 12, pp
7660
+ FILTER_VER_LUMA 16, 16, pp
7661
+ FILTER_VER_LUMA 16, 32, pp
7662
+ FILTER_VER_LUMA 16, 64, pp
7663
+ FILTER_VER_LUMA 24, 32, pp
7664
+ FILTER_VER_LUMA 32, 8, pp
7665
+ FILTER_VER_LUMA 32, 16, pp
7666
+ FILTER_VER_LUMA 32, 24, pp
7667
+ FILTER_VER_LUMA 32, 32, pp
7668
+ FILTER_VER_LUMA 32, 64, pp
7669
+ FILTER_VER_LUMA 48, 64, pp
7670
+ FILTER_VER_LUMA 64, 16, pp
7671
+ FILTER_VER_LUMA 64, 32, pp
7672
+ FILTER_VER_LUMA 64, 48, pp
7673
+ FILTER_VER_LUMA 64, 64, pp
7674
+
7675
+ FILTER_VER_LUMA 16, 4, ps
7676
+ FILTER_VER_LUMA 16, 8, ps
7677
+ FILTER_VER_LUMA 16, 12, ps
7678
+ FILTER_VER_LUMA 16, 16, ps
7679
+ FILTER_VER_LUMA 16, 32, ps
7680
+ FILTER_VER_LUMA 16, 64, ps
7681
+ FILTER_VER_LUMA 24, 32, ps
7682
+ FILTER_VER_LUMA 32, 8, ps
7683
+ FILTER_VER_LUMA 32, 16, ps
7684
+ FILTER_VER_LUMA 32, 24, ps
7685
+ FILTER_VER_LUMA 32, 32, ps
7686
+ FILTER_VER_LUMA 32, 64, ps
7687
+ FILTER_VER_LUMA 48, 64, ps
7688
+ FILTER_VER_LUMA 64, 16, ps
7689
+ FILTER_VER_LUMA 64, 32, ps
7690
+ FILTER_VER_LUMA 64, 48, ps
7691
+ FILTER_VER_LUMA 64, 64, ps
7692
7693
%macro PROCESS_LUMA_SP_W4_4R 0
7694
movq m0, [r0]
7695
7696
lea r6, [tab_LumaCoeffV + r4]
7697
%endif
7698
7699
- mova m7, [tab_c_526336]
7700
+ mova m7, [pd_526336]
7701
7702
mov dword [rsp], %2/4
7703
.loopH:
7704
7705
FILTER_VER_LUMA_SP 64, 16
7706
FILTER_VER_LUMA_SP 16, 64
7707
7708
-; TODO: combin of U and V is more performance, but need more register
7709
-; TODO: use two path for height alignment to 4 and otherwise may improvement 10% performance, but code is more complex, so I disable it
7710
-INIT_XMM ssse3
7711
-cglobal chroma_p2s, 3, 7, 4
7712
-
7713
- ; load width and height
7714
+;-----------------------------------------------------------------------------
7715
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
7716
+;-----------------------------------------------------------------------------
7717
+INIT_XMM sse4
7718
+cglobal filterPixelToShort_4x2, 3, 4, 3
7719
mov r3d, r3m
7720
- mov r4d, r4m
7721
+ add r3d, r3d
7722
7723
; load constant
7724
- mova m2, [pb_128]
7725
- mova m3, [tab_c_64_n64]
7726
+ mova m1, [pb_128]
7727
+ mova m2, [tab_c_64_n64]
7728
7729
-.loopH:
7730
-
7731
- xor r5d, r5d
7732
-.loopW:
7733
- lea r6, [r0 + r5]
7734
+ movd m0, [r0]
7735
+ pinsrd m0, [r0 + r1], 1
7736
+ punpcklbw m0, m1
7737
+ pmaddubsw m0, m2
7738
7739
- movh m0, [r6]
7740
- punpcklbw m0, m2
7741
- pmaddubsw m0, m3
7742
+ movq [r2 + r3 * 0], m0
7743
+ movhps [r2 + r3 * 1], m0
7744
7745
- movh m1, [r6 + r1]
7746
- punpcklbw m1, m2
7747
- pmaddubsw m1, m3
7748
+ RET
7749
7750
- add r5d, 8
7751
- cmp r5d, r3d
7752
- lea r6, [r2 + r5 * 2]
7753
- jg .width4
7754
- movu [r6 + FENC_STRIDE / 2 * 0 - 16], m0
7755
- movu [r6 + FENC_STRIDE / 2 * 2 - 16], m1
7756
- je .nextH
7757
- jmp .loopW
7758
-
7759
-.width4:
7760
- test r3d, 4
7761
- jz .width2
7762
- test r3d, 2
7763
- movh [r6 + FENC_STRIDE / 2 * 0 - 16], m0
7764
- movh [r6 + FENC_STRIDE / 2 * 2 - 16], m1
7765
- lea r6, [r6 + 8]
7766
- pshufd m0, m0, 2
7767
- pshufd m1, m1, 2
7768
- jz .nextH
7769
-
7770
-.width2:
7771
- movd [r6 + FENC_STRIDE / 2 * 0 - 16], m0
7772
- movd [r6 + FENC_STRIDE / 2 * 2 - 16], m1
7773
+;-----------------------------------------------------------------------------
7774
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
7775
+;-----------------------------------------------------------------------------
7776
+INIT_XMM ssse3
7777
+cglobal filterPixelToShort_8x2, 3, 4, 3
7778
+ mov r3d, r3m
7779
+ add r3d, r3d
7780
7781
-.nextH:
7782
- lea r0, [r0 + r1 * 2]
7783
- add r2, FENC_STRIDE / 2 * 4
7784
+ ; load constant
7785
+ mova m1, [pb_128]
7786
+ mova m2, [tab_c_64_n64]
7787
7788
- sub r4d, 2
7789
- jnz .loopH
7790
+ movh m0, [r0]
7791
+ punpcklbw m0, m1
7792
+ pmaddubsw m0, m2
7793
+ movu [r2 + r3 * 0], m0
7794
+
7795
+ movh m0, [r0 + r1]
7796
+ punpcklbw m0, m1
7797
+ pmaddubsw m0, m2
7798
+ movu [r2 + r3 * 1], m0
7799
7800
RET
7801
7802
7803
lea r6, [tab_ChromaCoeffV + r4]
7804
%endif
7805
7806
- mova m6, [tab_c_526336]
7807
+ mova m6, [pd_526336]
7808
7809
mov dword [rsp], %2/4
7810
7811
7812
lea r5, [tab_ChromaCoeffV + r4]
7813
%endif
7814
7815
- mova m5, [tab_c_526336]
7816
+ mova m5, [pd_526336]
7817
7818
mov r4d, (%2/4)
7819
7820
7821
RET
7822
%endmacro
7823
7824
-FILTER_VER_CHROMA_SP_W2_4R 2, 4
7825
-FILTER_VER_CHROMA_SP_W2_4R 2, 8
7826
+ FILTER_VER_CHROMA_SP_W2_4R 2, 4
7827
+ FILTER_VER_CHROMA_SP_W2_4R 2, 8
7828
7829
-FILTER_VER_CHROMA_SP_W2_4R 2, 16
7830
+ FILTER_VER_CHROMA_SP_W2_4R 2, 16
7831
7832
;--------------------------------------------------------------------------------------------------------------
7833
; void interp_4tap_vert_sp_4x2(int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
7834
7835
lea r5, [tab_ChromaCoeffV + r4]
7836
%endif
7837
7838
- mova m4, [tab_c_526336]
7839
+ mova m4, [pd_526336]
7840
7841
movq m0, [r0]
7842
movq m1, [r0 + r1]
7843
7844
lea r6, [tab_ChromaCoeffV + r4]
7845
%endif
7846
7847
- mova m6, [tab_c_526336]
7848
+ mova m6, [pd_526336]
7849
7850
mov r4d, %2/4
7851
7852
7853
RET
7854
%endmacro
7855
7856
-FILTER_VER_CHROMA_SP_W6_H4 6, 8
7857
+ FILTER_VER_CHROMA_SP_W6_H4 6, 8
7858
7859
-FILTER_VER_CHROMA_SP_W6_H4 6, 16
7860
+ FILTER_VER_CHROMA_SP_W6_H4 6, 16
7861
7862
%macro PROCESS_CHROMA_SP_W8_2R 0
7863
movu m1, [r0]
7864
7865
lea r5, [tab_ChromaCoeffV + r4]
7866
%endif
7867
7868
- mova m7, [tab_c_526336]
7869
+ mova m7, [pd_526336]
7870
7871
mov r4d, %2/2
7872
.loopH:
7873
7874
RET
7875
%endmacro
7876
7877
-FILTER_VER_CHROMA_SP_W8_H2 8, 2
7878
-FILTER_VER_CHROMA_SP_W8_H2 8, 4
7879
-FILTER_VER_CHROMA_SP_W8_H2 8, 6
7880
-FILTER_VER_CHROMA_SP_W8_H2 8, 8
7881
-FILTER_VER_CHROMA_SP_W8_H2 8, 16
7882
-FILTER_VER_CHROMA_SP_W8_H2 8, 32
7883
+ FILTER_VER_CHROMA_SP_W8_H2 8, 2
7884
+ FILTER_VER_CHROMA_SP_W8_H2 8, 4
7885
+ FILTER_VER_CHROMA_SP_W8_H2 8, 6
7886
+ FILTER_VER_CHROMA_SP_W8_H2 8, 8
7887
+ FILTER_VER_CHROMA_SP_W8_H2 8, 16
7888
+ FILTER_VER_CHROMA_SP_W8_H2 8, 32
7889
7890
-FILTER_VER_CHROMA_SP_W8_H2 8, 12
7891
-FILTER_VER_CHROMA_SP_W8_H2 8, 64
7892
+ FILTER_VER_CHROMA_SP_W8_H2 8, 12
7893
+ FILTER_VER_CHROMA_SP_W8_H2 8, 64
7894
7895
7896
;-----------------------------------------------------------------------------------------------------------------------------
7897
7898
RET
7899
%endmacro
7900
7901
-FILTER_HORIZ_CHROMA_2xN 2, 4
7902
-FILTER_HORIZ_CHROMA_2xN 2, 8
7903
+ FILTER_HORIZ_CHROMA_2xN 2, 4
7904
+ FILTER_HORIZ_CHROMA_2xN 2, 8
7905
7906
-FILTER_HORIZ_CHROMA_2xN 2, 16
7907
+ FILTER_HORIZ_CHROMA_2xN 2, 16
7908
7909
;-----------------------------------------------------------------------------------------------------------------------------
7910
; void interp_4tap_horiz_ps_4x%2(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
7911
7912
RET
7913
%endmacro
7914
7915
-FILTER_HORIZ_CHROMA_4xN 4, 2
7916
-FILTER_HORIZ_CHROMA_4xN 4, 4
7917
-FILTER_HORIZ_CHROMA_4xN 4, 8
7918
-FILTER_HORIZ_CHROMA_4xN 4, 16
7919
+ FILTER_HORIZ_CHROMA_4xN 4, 2
7920
+ FILTER_HORIZ_CHROMA_4xN 4, 4
7921
+ FILTER_HORIZ_CHROMA_4xN 4, 8
7922
+ FILTER_HORIZ_CHROMA_4xN 4, 16
7923
7924
-FILTER_HORIZ_CHROMA_4xN 4, 32
7925
+ FILTER_HORIZ_CHROMA_4xN 4, 32
7926
7927
%macro PROCESS_CHROMA_W6 3
7928
movu %1, [srcq]
7929
7930
RET
7931
%endmacro
7932
7933
-FILTER_HORIZ_CHROMA 6, 8
7934
-FILTER_HORIZ_CHROMA 12, 16
7935
+ FILTER_HORIZ_CHROMA 6, 8
7936
+ FILTER_HORIZ_CHROMA 12, 16
7937
7938
-FILTER_HORIZ_CHROMA 6, 16
7939
-FILTER_HORIZ_CHROMA 12, 32
7940
+ FILTER_HORIZ_CHROMA 6, 16
7941
+ FILTER_HORIZ_CHROMA 12, 32
7942
7943
%macro PROCESS_CHROMA_W8 3
7944
movu %1, [srcq]
7945
7946
RET
7947
%endmacro
7948
7949
-FILTER_HORIZ_CHROMA_8xN 8, 2
7950
-FILTER_HORIZ_CHROMA_8xN 8, 4
7951
-FILTER_HORIZ_CHROMA_8xN 8, 6
7952
-FILTER_HORIZ_CHROMA_8xN 8, 8
7953
-FILTER_HORIZ_CHROMA_8xN 8, 16
7954
-FILTER_HORIZ_CHROMA_8xN 8, 32
7955
+ FILTER_HORIZ_CHROMA_8xN 8, 2
7956
+ FILTER_HORIZ_CHROMA_8xN 8, 4
7957
+ FILTER_HORIZ_CHROMA_8xN 8, 6
7958
+ FILTER_HORIZ_CHROMA_8xN 8, 8
7959
+ FILTER_HORIZ_CHROMA_8xN 8, 16
7960
+ FILTER_HORIZ_CHROMA_8xN 8, 32
7961
7962
-FILTER_HORIZ_CHROMA_8xN 8, 12
7963
-FILTER_HORIZ_CHROMA_8xN 8, 64
7964
+ FILTER_HORIZ_CHROMA_8xN 8, 12
7965
+ FILTER_HORIZ_CHROMA_8xN 8, 64
7966
7967
%macro PROCESS_CHROMA_W16 4
7968
movu %1, [srcq]
7969
7970
RET
7971
%endmacro
7972
7973
-FILTER_HORIZ_CHROMA_WxN 16, 4
7974
-FILTER_HORIZ_CHROMA_WxN 16, 8
7975
-FILTER_HORIZ_CHROMA_WxN 16, 12
7976
-FILTER_HORIZ_CHROMA_WxN 16, 16
7977
-FILTER_HORIZ_CHROMA_WxN 16, 32
7978
-FILTER_HORIZ_CHROMA_WxN 24, 32
7979
-FILTER_HORIZ_CHROMA_WxN 32, 8
7980
-FILTER_HORIZ_CHROMA_WxN 32, 16
7981
-FILTER_HORIZ_CHROMA_WxN 32, 24
7982
-FILTER_HORIZ_CHROMA_WxN 32, 32
7983
-
7984
-FILTER_HORIZ_CHROMA_WxN 16, 24
7985
-FILTER_HORIZ_CHROMA_WxN 16, 64
7986
-FILTER_HORIZ_CHROMA_WxN 24, 64
7987
-FILTER_HORIZ_CHROMA_WxN 32, 48
7988
-FILTER_HORIZ_CHROMA_WxN 32, 64
7989
-
7990
-FILTER_HORIZ_CHROMA_WxN 64, 64
7991
-FILTER_HORIZ_CHROMA_WxN 64, 32
7992
-FILTER_HORIZ_CHROMA_WxN 64, 48
7993
-FILTER_HORIZ_CHROMA_WxN 48, 64
7994
-FILTER_HORIZ_CHROMA_WxN 64, 16
7995
+ FILTER_HORIZ_CHROMA_WxN 16, 4
7996
+ FILTER_HORIZ_CHROMA_WxN 16, 8
7997
+ FILTER_HORIZ_CHROMA_WxN 16, 12
7998
+ FILTER_HORIZ_CHROMA_WxN 16, 16
7999
+ FILTER_HORIZ_CHROMA_WxN 16, 32
8000
+ FILTER_HORIZ_CHROMA_WxN 24, 32
8001
+ FILTER_HORIZ_CHROMA_WxN 32, 8
8002
+ FILTER_HORIZ_CHROMA_WxN 32, 16
8003
+ FILTER_HORIZ_CHROMA_WxN 32, 24
8004
+ FILTER_HORIZ_CHROMA_WxN 32, 32
8005
+
8006
+ FILTER_HORIZ_CHROMA_WxN 16, 24
8007
+ FILTER_HORIZ_CHROMA_WxN 16, 64
8008
+ FILTER_HORIZ_CHROMA_WxN 24, 64
8009
+ FILTER_HORIZ_CHROMA_WxN 32, 48
8010
+ FILTER_HORIZ_CHROMA_WxN 32, 64
8011
+
8012
+ FILTER_HORIZ_CHROMA_WxN 64, 64
8013
+ FILTER_HORIZ_CHROMA_WxN 64, 32
8014
+ FILTER_HORIZ_CHROMA_WxN 64, 48
8015
+ FILTER_HORIZ_CHROMA_WxN 48, 64
8016
+ FILTER_HORIZ_CHROMA_WxN 64, 16
8017
8018
8019
;---------------------------------------------------------------------------------------------------------------
8020
8021
RET
8022
%endmacro
8023
8024
-FILTER_V_PS_W16n 64, 64
8025
-FILTER_V_PS_W16n 64, 32
8026
-FILTER_V_PS_W16n 64, 48
8027
-FILTER_V_PS_W16n 48, 64
8028
-FILTER_V_PS_W16n 64, 16
8029
+ FILTER_V_PS_W16n 64, 64
8030
+ FILTER_V_PS_W16n 64, 32
8031
+ FILTER_V_PS_W16n 64, 48
8032
+ FILTER_V_PS_W16n 48, 64
8033
+ FILTER_V_PS_W16n 64, 16
8034
8035
8036
;------------------------------------------------------------------------------------------------------------
8037
8038
dec r4d
8039
jnz .loop
8040
8041
-RET
8042
+ RET
8043
%endmacro
8044
8045
-FILTER_V_PS_W2 2, 8
8046
+ FILTER_V_PS_W2 2, 8
8047
8048
-FILTER_V_PS_W2 2, 16
8049
+ FILTER_V_PS_W2 2, 16
8050
8051
;-----------------------------------------------------------------------------------------------------------------
8052
; void interp_4tap_vert_ss_%1x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
8053
8054
RET
8055
%endmacro
8056
8057
-FILTER_VER_CHROMA_S_AVX2_4x4 sp
8058
-FILTER_VER_CHROMA_S_AVX2_4x4 ss
8059
+ FILTER_VER_CHROMA_S_AVX2_4x4 sp
8060
+ FILTER_VER_CHROMA_S_AVX2_4x4 ss
8061
8062
%macro FILTER_VER_CHROMA_S_AVX2_4x8 1
8063
INIT_YMM avx2
8064
8065
RET
8066
%endmacro
8067
8068
-FILTER_VER_CHROMA_S_AVX2_4x8 sp
8069
-FILTER_VER_CHROMA_S_AVX2_4x8 ss
8070
+ FILTER_VER_CHROMA_S_AVX2_4x8 sp
8071
+ FILTER_VER_CHROMA_S_AVX2_4x8 ss
8072
8073
%macro PROCESS_CHROMA_AVX2_W4_16R 1
8074
movq xm0, [r0]
8075
8076
RET
8077
%endmacro
8078
8079
-FILTER_VER_CHROMA_S_AVX2_4x16 sp
8080
-FILTER_VER_CHROMA_S_AVX2_4x16 ss
8081
+ FILTER_VER_CHROMA_S_AVX2_4x16 sp
8082
+ FILTER_VER_CHROMA_S_AVX2_4x16 ss
8083
+
8084
+%macro FILTER_VER_CHROMA_S_AVX2_4x32 1
8085
+INIT_YMM avx2
8086
+cglobal interp_4tap_vert_%1_4x32, 4, 7, 8
8087
+ mov r4d, r4m
8088
+ shl r4d, 6
8089
+ add r1d, r1d
8090
+ sub r0, r1
8091
+
8092
+%ifdef PIC
8093
+ lea r5, [pw_ChromaCoeffV]
8094
+ add r5, r4
8095
+%else
8096
+ lea r5, [pw_ChromaCoeffV + r4]
8097
+%endif
8098
+
8099
+ lea r4, [r1 * 3]
8100
+%ifidn %1,sp
8101
+ mova m7, [pd_526336]
8102
+%else
8103
+ add r3d, r3d
8104
+%endif
8105
+ lea r6, [r3 * 3]
8106
+%rep 2
8107
+ PROCESS_CHROMA_AVX2_W4_16R %1
8108
+ lea r2, [r2 + r3 * 4]
8109
+%endrep
8110
+ RET
8111
+%endmacro
8112
+
8113
+ FILTER_VER_CHROMA_S_AVX2_4x32 sp
8114
+ FILTER_VER_CHROMA_S_AVX2_4x32 ss
8115
8116
%macro FILTER_VER_CHROMA_S_AVX2_4x2 1
8117
INIT_YMM avx2
8118
8119
RET
8120
%endmacro
8121
8122
-FILTER_VER_CHROMA_S_AVX2_4x2 sp
8123
-FILTER_VER_CHROMA_S_AVX2_4x2 ss
8124
+ FILTER_VER_CHROMA_S_AVX2_4x2 sp
8125
+ FILTER_VER_CHROMA_S_AVX2_4x2 ss
8126
8127
%macro FILTER_VER_CHROMA_S_AVX2_2x4 1
8128
INIT_YMM avx2
8129
8130
RET
8131
%endmacro
8132
8133
-FILTER_VER_CHROMA_S_AVX2_2x4 sp
8134
-FILTER_VER_CHROMA_S_AVX2_2x4 ss
8135
+ FILTER_VER_CHROMA_S_AVX2_2x4 sp
8136
+ FILTER_VER_CHROMA_S_AVX2_2x4 ss
8137
8138
%macro FILTER_VER_CHROMA_S_AVX2_8x8 1
8139
INIT_YMM avx2
8140
8141
RET
8142
%endmacro
8143
8144
-FILTER_VER_CHROMA_S_AVX2_8x8 sp
8145
-FILTER_VER_CHROMA_S_AVX2_8x8 ss
8146
+ FILTER_VER_CHROMA_S_AVX2_8x8 sp
8147
+ FILTER_VER_CHROMA_S_AVX2_8x8 ss
8148
8149
%macro PROCESS_CHROMA_S_AVX2_W8_16R 1
8150
movu xm0, [r0] ; m0 = row 0
8151
8152
%endif
8153
%endmacro
8154
8155
-FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 16
8156
-FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 32
8157
-FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 16
8158
-FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 32
8159
+ FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 16
8160
+ FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 32
8161
+ FILTER_VER_CHROMA_S_AVX2_Nx16 sp, 64
8162
+ FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 16
8163
+ FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 32
8164
+ FILTER_VER_CHROMA_S_AVX2_Nx16 ss, 64
8165
8166
%macro FILTER_VER_CHROMA_S_AVX2_NxN 3
8167
INIT_YMM avx2
8168
8169
%endif
8170
%endmacro
8171
8172
-FILTER_VER_CHROMA_S_AVX2_NxN 16, 32, sp
8173
-FILTER_VER_CHROMA_S_AVX2_NxN 24, 32, sp
8174
-FILTER_VER_CHROMA_S_AVX2_NxN 32, 32, sp
8175
-FILTER_VER_CHROMA_S_AVX2_NxN 16, 32, ss
8176
-FILTER_VER_CHROMA_S_AVX2_NxN 24, 32, ss
8177
-FILTER_VER_CHROMA_S_AVX2_NxN 32, 32, ss
8178
+ FILTER_VER_CHROMA_S_AVX2_NxN 16, 32, sp
8179
+ FILTER_VER_CHROMA_S_AVX2_NxN 24, 32, sp
8180
+ FILTER_VER_CHROMA_S_AVX2_NxN 32, 32, sp
8181
+ FILTER_VER_CHROMA_S_AVX2_NxN 16, 32, ss
8182
+ FILTER_VER_CHROMA_S_AVX2_NxN 24, 32, ss
8183
+ FILTER_VER_CHROMA_S_AVX2_NxN 32, 32, ss
8184
+ FILTER_VER_CHROMA_S_AVX2_NxN 16, 64, sp
8185
+ FILTER_VER_CHROMA_S_AVX2_NxN 24, 64, sp
8186
+ FILTER_VER_CHROMA_S_AVX2_NxN 32, 64, sp
8187
+ FILTER_VER_CHROMA_S_AVX2_NxN 32, 48, sp
8188
+ FILTER_VER_CHROMA_S_AVX2_NxN 32, 48, ss
8189
+ FILTER_VER_CHROMA_S_AVX2_NxN 16, 64, ss
8190
+ FILTER_VER_CHROMA_S_AVX2_NxN 24, 64, ss
8191
+ FILTER_VER_CHROMA_S_AVX2_NxN 32, 64, ss
8192
+ FILTER_VER_CHROMA_S_AVX2_NxN 64, 64, sp
8193
+ FILTER_VER_CHROMA_S_AVX2_NxN 64, 32, sp
8194
+ FILTER_VER_CHROMA_S_AVX2_NxN 64, 48, sp
8195
+ FILTER_VER_CHROMA_S_AVX2_NxN 48, 64, sp
8196
+ FILTER_VER_CHROMA_S_AVX2_NxN 64, 64, ss
8197
+ FILTER_VER_CHROMA_S_AVX2_NxN 64, 32, ss
8198
+ FILTER_VER_CHROMA_S_AVX2_NxN 64, 48, ss
8199
+ FILTER_VER_CHROMA_S_AVX2_NxN 48, 64, ss
8200
8201
%macro PROCESS_CHROMA_S_AVX2_W8_4R 1
8202
movu xm0, [r0] ; m0 = row 0
8203
8204
RET
8205
%endmacro
8206
8207
-FILTER_VER_CHROMA_S_AVX2_8x4 sp
8208
-FILTER_VER_CHROMA_S_AVX2_8x4 ss
8209
+ FILTER_VER_CHROMA_S_AVX2_8x4 sp
8210
+ FILTER_VER_CHROMA_S_AVX2_8x4 ss
8211
8212
%macro FILTER_VER_CHROMA_S_AVX2_12x16 1
8213
INIT_YMM avx2
8214
8215
%endif
8216
%endmacro
8217
8218
-FILTER_VER_CHROMA_S_AVX2_12x16 sp
8219
-FILTER_VER_CHROMA_S_AVX2_12x16 ss
8220
+ FILTER_VER_CHROMA_S_AVX2_12x16 sp
8221
+ FILTER_VER_CHROMA_S_AVX2_12x16 ss
8222
+
8223
+%macro FILTER_VER_CHROMA_S_AVX2_12x32 1
8224
+%if ARCH_X86_64 == 1
8225
+INIT_YMM avx2
8226
+cglobal interp_4tap_vert_%1_12x32, 4, 9, 10
8227
+ mov r4d, r4m
8228
+ shl r4d, 6
8229
+ add r1d, r1d
8230
+
8231
+%ifdef PIC
8232
+ lea r5, [pw_ChromaCoeffV]
8233
+ add r5, r4
8234
+%else
8235
+ lea r5, [pw_ChromaCoeffV + r4]
8236
+%endif
8237
+
8238
+ lea r4, [r1 * 3]
8239
+ sub r0, r1
8240
+%ifidn %1, sp
8241
+ mova m9, [pd_526336]
8242
+%else
8243
+ add r3d, r3d
8244
+%endif
8245
+ lea r6, [r3 * 3]
8246
+%rep 2
8247
+ PROCESS_CHROMA_S_AVX2_W8_16R %1
8248
+%ifidn %1, sp
8249
+ add r2, 8
8250
+%else
8251
+ add r2, 16
8252
+%endif
8253
+ add r0, 16
8254
+ mova m7, m9
8255
+ PROCESS_CHROMA_AVX2_W4_16R %1
8256
+ sub r0, 16
8257
+%ifidn %1, sp
8258
+ lea r2, [r2 + r3 * 4 - 8]
8259
+%else
8260
+ lea r2, [r2 + r3 * 4 - 16]
8261
+%endif
8262
+%endrep
8263
+ RET
8264
+%endif
8265
+%endmacro
8266
+
8267
+ FILTER_VER_CHROMA_S_AVX2_12x32 sp
8268
+ FILTER_VER_CHROMA_S_AVX2_12x32 ss
8269
8270
%macro FILTER_VER_CHROMA_S_AVX2_16x12 1
8271
INIT_YMM avx2
8272
8273
%endif
8274
%endmacro
8275
8276
-FILTER_VER_CHROMA_S_AVX2_16x12 sp
8277
-FILTER_VER_CHROMA_S_AVX2_16x12 ss
8278
+ FILTER_VER_CHROMA_S_AVX2_16x12 sp
8279
+ FILTER_VER_CHROMA_S_AVX2_16x12 ss
8280
+
8281
+%macro FILTER_VER_CHROMA_S_AVX2_8x12 1
8282
+%if ARCH_X86_64 == 1
8283
+INIT_YMM avx2
8284
+cglobal interp_4tap_vert_%1_8x12, 4, 7, 9
8285
+ mov r4d, r4m
8286
+ shl r4d, 6
8287
+ add r1d, r1d
8288
+
8289
+%ifdef PIC
8290
+ lea r5, [pw_ChromaCoeffV]
8291
+ add r5, r4
8292
+%else
8293
+ lea r5, [pw_ChromaCoeffV + r4]
8294
+%endif
8295
+
8296
+ lea r4, [r1 * 3]
8297
+ sub r0, r1
8298
+%ifidn %1,sp
8299
+ mova m8, [pd_526336]
8300
+%else
8301
+ add r3d, r3d
8302
+%endif
8303
+ lea r6, [r3 * 3]
8304
+ movu xm0, [r0] ; m0 = row 0
8305
+ movu xm1, [r0 + r1] ; m1 = row 1
8306
+ punpckhwd xm2, xm0, xm1
8307
+ punpcklwd xm0, xm1
8308
+ vinserti128 m0, m0, xm2, 1
8309
+ pmaddwd m0, [r5]
8310
+ movu xm2, [r0 + r1 * 2] ; m2 = row 2
8311
+ punpckhwd xm3, xm1, xm2
8312
+ punpcklwd xm1, xm2
8313
+ vinserti128 m1, m1, xm3, 1
8314
+ pmaddwd m1, [r5]
8315
+ movu xm3, [r0 + r4] ; m3 = row 3
8316
+ punpckhwd xm4, xm2, xm3
8317
+ punpcklwd xm2, xm3
8318
+ vinserti128 m2, m2, xm4, 1
8319
+ pmaddwd m4, m2, [r5 + 1 * mmsize]
8320
+ paddd m0, m4
8321
+ pmaddwd m2, [r5]
8322
+ lea r0, [r0 + r1 * 4]
8323
+ movu xm4, [r0] ; m4 = row 4
8324
+ punpckhwd xm5, xm3, xm4
8325
+ punpcklwd xm3, xm4
8326
+ vinserti128 m3, m3, xm5, 1
8327
+ pmaddwd m5, m3, [r5 + 1 * mmsize]
8328
+ paddd m1, m5
8329
+ pmaddwd m3, [r5]
8330
+%ifidn %1,sp
8331
+ paddd m0, m8
8332
+ paddd m1, m8
8333
+ psrad m0, 12
8334
+ psrad m1, 12
8335
+%else
8336
+ psrad m0, 6
8337
+ psrad m1, 6
8338
+%endif
8339
+ packssdw m0, m1
8340
+
8341
+ movu xm5, [r0 + r1] ; m5 = row 5
8342
+ punpckhwd xm6, xm4, xm5
8343
+ punpcklwd xm4, xm5
8344
+ vinserti128 m4, m4, xm6, 1
8345
+ pmaddwd m6, m4, [r5 + 1 * mmsize]
8346
+ paddd m2, m6
8347
+ pmaddwd m4, [r5]
8348
+ movu xm6, [r0 + r1 * 2] ; m6 = row 6
8349
+ punpckhwd xm1, xm5, xm6
8350
+ punpcklwd xm5, xm6
8351
+ vinserti128 m5, m5, xm1, 1
8352
+ pmaddwd m1, m5, [r5 + 1 * mmsize]
8353
+ pmaddwd m5, [r5]
8354
+ paddd m3, m1
8355
+%ifidn %1,sp
8356
+ paddd m2, m8
8357
+ paddd m3, m8
8358
+ psrad m2, 12
8359
+ psrad m3, 12
8360
+%else
8361
+ psrad m2, 6
8362
+ psrad m3, 6
8363
+%endif
8364
+ packssdw m2, m3
8365
+%ifidn %1,sp
8366
+ packuswb m0, m2
8367
+ mova m3, [interp8_hps_shuf]
8368
+ vpermd m0, m3, m0
8369
+ vextracti128 xm2, m0, 1
8370
+ movq [r2], xm0
8371
+ movhps [r2 + r3], xm0
8372
+ movq [r2 + r3 * 2], xm2
8373
+ movhps [r2 + r6], xm2
8374
+%else
8375
+ vpermq m0, m0, 11011000b
8376
+ vpermq m2, m2, 11011000b
8377
+ movu [r2], xm0
8378
+ vextracti128 xm0, m0, 1
8379
+ vextracti128 xm3, m2, 1
8380
+ movu [r2 + r3], xm0
8381
+ movu [r2 + r3 * 2], xm2
8382
+ movu [r2 + r6], xm3
8383
+%endif
8384
+ lea r2, [r2 + r3 * 4]
8385
+
8386
+ movu xm1, [r0 + r4] ; m1 = row 7
8387
+ punpckhwd xm0, xm6, xm1
8388
+ punpcklwd xm6, xm1
8389
+ vinserti128 m6, m6, xm0, 1
8390
+ pmaddwd m0, m6, [r5 + 1 * mmsize]
8391
+ pmaddwd m6, [r5]
8392
+ paddd m4, m0
8393
+ lea r0, [r0 + r1 * 4]
8394
+ movu xm0, [r0] ; m0 = row 8
8395
+ punpckhwd xm2, xm1, xm0
8396
+ punpcklwd xm1, xm0
8397
+ vinserti128 m1, m1, xm2, 1
8398
+ pmaddwd m2, m1, [r5 + 1 * mmsize]
8399
+ pmaddwd m1, [r5]
8400
+ paddd m5, m2
8401
+%ifidn %1,sp
8402
+ paddd m4, m8
8403
+ paddd m5, m8
8404
+ psrad m4, 12
8405
+ psrad m5, 12
8406
+%else
8407
+ psrad m4, 6
8408
+ psrad m5, 6
8409
+%endif
8410
+ packssdw m4, m5
8411
+
8412
+ movu xm2, [r0 + r1] ; m2 = row 9
8413
+ punpckhwd xm5, xm0, xm2
8414
+ punpcklwd xm0, xm2
8415
+ vinserti128 m0, m0, xm5, 1
8416
+ pmaddwd m5, m0, [r5 + 1 * mmsize]
8417
+ paddd m6, m5
8418
+ pmaddwd m0, [r5]
8419
+ movu xm5, [r0 + r1 * 2] ; m5 = row 10
8420
+ punpckhwd xm7, xm2, xm5
8421
+ punpcklwd xm2, xm5
8422
+ vinserti128 m2, m2, xm7, 1
8423
+ pmaddwd m7, m2, [r5 + 1 * mmsize]
8424
+ paddd m1, m7
8425
+ pmaddwd m2, [r5]
8426
+
8427
+%ifidn %1,sp
8428
+ paddd m6, m8
8429
+ paddd m1, m8
8430
+ psrad m6, 12
8431
+ psrad m1, 12
8432
+%else
8433
+ psrad m6, 6
8434
+ psrad m1, 6
8435
+%endif
8436
+ packssdw m6, m1
8437
+%ifidn %1,sp
8438
+ packuswb m4, m6
8439
+ vpermd m4, m3, m4
8440
+ vextracti128 xm6, m4, 1
8441
+ movq [r2], xm4
8442
+ movhps [r2 + r3], xm4
8443
+ movq [r2 + r3 * 2], xm6
8444
+ movhps [r2 + r6], xm6
8445
+%else
8446
+ vpermq m4, m4, 11011000b
8447
+ vpermq m6, m6, 11011000b
8448
+ vextracti128 xm7, m4, 1
8449
+ vextracti128 xm1, m6, 1
8450
+ movu [r2], xm4
8451
+ movu [r2 + r3], xm7
8452
+ movu [r2 + r3 * 2], xm6
8453
+ movu [r2 + r6], xm1
8454
+%endif
8455
+ lea r2, [r2 + r3 * 4]
8456
+
8457
+ movu xm7, [r0 + r4] ; m7 = row 11
8458
+ punpckhwd xm1, xm5, xm7
8459
+ punpcklwd xm5, xm7
8460
+ vinserti128 m5, m5, xm1, 1
8461
+ pmaddwd m1, m5, [r5 + 1 * mmsize]
8462
+ paddd m0, m1
8463
+ pmaddwd m5, [r5]
8464
+ lea r0, [r0 + r1 * 4]
8465
+ movu xm1, [r0] ; m1 = row 12
8466
+ punpckhwd xm4, xm7, xm1
8467
+ punpcklwd xm7, xm1
8468
+ vinserti128 m7, m7, xm4, 1
8469
+ pmaddwd m4, m7, [r5 + 1 * mmsize]
8470
+ paddd m2, m4
8471
+ pmaddwd m7, [r5]
8472
+%ifidn %1,sp
8473
+ paddd m0, m8
8474
+ paddd m2, m8
8475
+ psrad m0, 12
8476
+ psrad m2, 12
8477
+%else
8478
+ psrad m0, 6
8479
+ psrad m2, 6
8480
+%endif
8481
+ packssdw m0, m2
8482
+
8483
+ movu xm4, [r0 + r1] ; m4 = row 13
8484
+ punpckhwd xm2, xm1, xm4
8485
+ punpcklwd xm1, xm4
8486
+ vinserti128 m1, m1, xm2, 1
8487
+ pmaddwd m1, [r5 + 1 * mmsize]
8488
+ paddd m5, m1
8489
+ movu xm2, [r0 + r1 * 2] ; m2 = row 14
8490
+ punpckhwd xm6, xm4, xm2
8491
+ punpcklwd xm4, xm2
8492
+ vinserti128 m4, m4, xm6, 1
8493
+ pmaddwd m4, [r5 + 1 * mmsize]
8494
+ paddd m7, m4
8495
+%ifidn %1,sp
8496
+ paddd m5, m8
8497
+ paddd m7, m8
8498
+ psrad m5, 12
8499
+ psrad m7, 12
8500
+%else
8501
+ psrad m5, 6
8502
+ psrad m7, 6
8503
+%endif
8504
+ packssdw m5, m7
8505
+%ifidn %1,sp
8506
+ packuswb m0, m5
8507
+ vpermd m0, m3, m0
8508
+ vextracti128 xm5, m0, 1
8509
+ movq [r2], xm0
8510
+ movhps [r2 + r3], xm0
8511
+ movq [r2 + r3 * 2], xm5
8512
+ movhps [r2 + r6], xm5
8513
+%else
8514
+ vpermq m0, m0, 11011000b
8515
+ vpermq m5, m5, 11011000b
8516
+ vextracti128 xm7, m0, 1
8517
+ vextracti128 xm6, m5, 1
8518
+ movu [r2], xm0
8519
+ movu [r2 + r3], xm7
8520
+ movu [r2 + r3 * 2], xm5
8521
+ movu [r2 + r6], xm6
8522
+%endif
8523
+ RET
8524
+%endif
8525
+%endmacro
8526
+
8527
+ FILTER_VER_CHROMA_S_AVX2_8x12 sp
8528
+ FILTER_VER_CHROMA_S_AVX2_8x12 ss
8529
8530
%macro FILTER_VER_CHROMA_S_AVX2_16x4 1
8531
INIT_YMM avx2
8532
8533
RET
8534
%endmacro
8535
8536
-FILTER_VER_CHROMA_S_AVX2_16x4 sp
8537
-FILTER_VER_CHROMA_S_AVX2_16x4 ss
8538
+ FILTER_VER_CHROMA_S_AVX2_16x4 sp
8539
+ FILTER_VER_CHROMA_S_AVX2_16x4 ss
8540
8541
%macro PROCESS_CHROMA_S_AVX2_W8_8R 1
8542
movu xm0, [r0] ; m0 = row 0
8543
8544
%endif
8545
%endmacro
8546
8547
-FILTER_VER_CHROMA_S_AVX2_Nx8 sp, 32
8548
-FILTER_VER_CHROMA_S_AVX2_Nx8 sp, 16
8549
-FILTER_VER_CHROMA_S_AVX2_Nx8 ss, 32
8550
-FILTER_VER_CHROMA_S_AVX2_Nx8 ss, 16
8551
+ FILTER_VER_CHROMA_S_AVX2_Nx8 sp, 32
8552
+ FILTER_VER_CHROMA_S_AVX2_Nx8 sp, 16
8553
+ FILTER_VER_CHROMA_S_AVX2_Nx8 ss, 32
8554
+ FILTER_VER_CHROMA_S_AVX2_Nx8 ss, 16
8555
8556
%macro FILTER_VER_CHROMA_S_AVX2_8x2 1
8557
INIT_YMM avx2
8558
8559
RET
8560
%endmacro
8561
8562
-FILTER_VER_CHROMA_S_AVX2_8x2 sp
8563
-FILTER_VER_CHROMA_S_AVX2_8x2 ss
8564
+ FILTER_VER_CHROMA_S_AVX2_8x2 sp
8565
+ FILTER_VER_CHROMA_S_AVX2_8x2 ss
8566
8567
%macro FILTER_VER_CHROMA_S_AVX2_8x6 1
8568
INIT_YMM avx2
8569
8570
RET
8571
%endmacro
8572
8573
-FILTER_VER_CHROMA_S_AVX2_8x6 sp
8574
-FILTER_VER_CHROMA_S_AVX2_8x6 ss
8575
+ FILTER_VER_CHROMA_S_AVX2_8x6 sp
8576
+ FILTER_VER_CHROMA_S_AVX2_8x6 ss
8577
8578
%macro FILTER_VER_CHROMA_S_AVX2_8xN 2
8579
INIT_YMM avx2
8580
8581
%endif
8582
%endmacro
8583
8584
-FILTER_VER_CHROMA_S_AVX2_8xN sp, 16
8585
-FILTER_VER_CHROMA_S_AVX2_8xN sp, 32
8586
-FILTER_VER_CHROMA_S_AVX2_8xN ss, 16
8587
-FILTER_VER_CHROMA_S_AVX2_8xN ss, 32
8588
+ FILTER_VER_CHROMA_S_AVX2_8xN sp, 16
8589
+ FILTER_VER_CHROMA_S_AVX2_8xN sp, 32
8590
+ FILTER_VER_CHROMA_S_AVX2_8xN sp, 64
8591
+ FILTER_VER_CHROMA_S_AVX2_8xN ss, 16
8592
+ FILTER_VER_CHROMA_S_AVX2_8xN ss, 32
8593
+ FILTER_VER_CHROMA_S_AVX2_8xN ss, 64
8594
8595
-%macro FILTER_VER_CHROMA_S_AVX2_32x24 1
8596
-INIT_YMM avx2
8597
+%macro FILTER_VER_CHROMA_S_AVX2_Nx24 2
8598
%if ARCH_X86_64 == 1
8599
-cglobal interp_4tap_vert_%1_32x24, 4, 10, 10
8600
+INIT_YMM avx2
8601
+cglobal interp_4tap_vert_%1_%2x24, 4, 10, 10
8602
mov r4d, r4m
8603
shl r4d, 6
8604
add r1d, r1d
8605
8606
add r3d, r3d
8607
%endif
8608
lea r6, [r3 * 3]
8609
- mov r9d, 4
8610
+ mov r9d, %2 / 8
8611
.loopW:
8612
PROCESS_CHROMA_S_AVX2_W8_16R %1
8613
%ifidn %1,sp
8614
8615
dec r9d
8616
jnz .loopW
8617
%ifidn %1,sp
8618
- lea r2, [r8 + r3 * 4 - 24]
8619
+ lea r2, [r8 + r3 * 4 - %2 + 8]
8620
%else
8621
- lea r2, [r8 + r3 * 4 - 48]
8622
+ lea r2, [r8 + r3 * 4 - 2 * %2 + 16]
8623
%endif
8624
- lea r0, [r7 - 48]
8625
+ lea r0, [r7 - 2 * %2 + 16]
8626
mova m7, m9
8627
- mov r9d, 4
8628
+ mov r9d, %2 / 8
8629
.loop:
8630
PROCESS_CHROMA_S_AVX2_W8_8R %1
8631
%ifidn %1,sp
8632
8633
%endif
8634
%endmacro
8635
8636
-FILTER_VER_CHROMA_S_AVX2_32x24 sp
8637
-FILTER_VER_CHROMA_S_AVX2_32x24 ss
8638
+ FILTER_VER_CHROMA_S_AVX2_Nx24 sp, 32
8639
+ FILTER_VER_CHROMA_S_AVX2_Nx24 sp, 16
8640
+ FILTER_VER_CHROMA_S_AVX2_Nx24 ss, 32
8641
+ FILTER_VER_CHROMA_S_AVX2_Nx24 ss, 16
8642
8643
%macro FILTER_VER_CHROMA_S_AVX2_2x8 1
8644
INIT_YMM avx2
8645
8646
RET
8647
%endmacro
8648
8649
-FILTER_VER_CHROMA_S_AVX2_2x8 sp
8650
-FILTER_VER_CHROMA_S_AVX2_2x8 ss
8651
+ FILTER_VER_CHROMA_S_AVX2_2x8 sp
8652
+ FILTER_VER_CHROMA_S_AVX2_2x8 ss
8653
+
8654
+%macro FILTER_VER_CHROMA_S_AVX2_2x16 1
8655
+%if ARCH_X86_64 == 1
8656
+INIT_YMM avx2
8657
+cglobal interp_4tap_vert_%1_2x16, 4, 6, 9
8658
+ mov r4d, r4m
8659
+ shl r4d, 6
8660
+ add r1d, r1d
8661
+ sub r0, r1
8662
+
8663
+%ifdef PIC
8664
+ lea r5, [pw_ChromaCoeffV]
8665
+ add r5, r4
8666
+%else
8667
+ lea r5, [pw_ChromaCoeffV + r4]
8668
+%endif
8669
+
8670
+ lea r4, [r1 * 3]
8671
+%ifidn %1,sp
8672
+ mova m6, [pd_526336]
8673
+%else
8674
+ add r3d, r3d
8675
+%endif
8676
+ movd xm0, [r0]
8677
+ movd xm1, [r0 + r1]
8678
+ punpcklwd xm0, xm1
8679
+ movd xm2, [r0 + r1 * 2]
8680
+ punpcklwd xm1, xm2
8681
+ punpcklqdq xm0, xm1 ; m0 = [2 1 1 0]
8682
+ movd xm3, [r0 + r4]
8683
+ punpcklwd xm2, xm3
8684
+ lea r0, [r0 + 4 * r1]
8685
+ movd xm4, [r0]
8686
+ punpcklwd xm3, xm4
8687
+ punpcklqdq xm2, xm3 ; m2 = [4 3 3 2]
8688
+ vinserti128 m0, m0, xm2, 1 ; m0 = [4 3 3 2 2 1 1 0]
8689
+ movd xm1, [r0 + r1]
8690
+ punpcklwd xm4, xm1
8691
+ movd xm3, [r0 + r1 * 2]
8692
+ punpcklwd xm1, xm3
8693
+ punpcklqdq xm4, xm1 ; m4 = [6 5 5 4]
8694
+ vinserti128 m2, m2, xm4, 1 ; m2 = [6 5 5 4 4 3 3 2]
8695
+ pmaddwd m0, [r5]
8696
+ pmaddwd m2, [r5 + 1 * mmsize]
8697
+ paddd m0, m2
8698
+ movd xm1, [r0 + r4]
8699
+ punpcklwd xm3, xm1
8700
+ lea r0, [r0 + 4 * r1]
8701
+ movd xm2, [r0]
8702
+ punpcklwd xm1, xm2
8703
+ punpcklqdq xm3, xm1 ; m3 = [8 7 7 6]
8704
+ vinserti128 m4, m4, xm3, 1 ; m4 = [8 7 7 6 6 5 5 4]
8705
+ movd xm1, [r0 + r1]
8706
+ punpcklwd xm2, xm1
8707
+ movd xm5, [r0 + r1 * 2]
8708
+ punpcklwd xm1, xm5
8709
+ punpcklqdq xm2, xm1 ; m2 = [10 9 9 8]
8710
+ vinserti128 m3, m3, xm2, 1 ; m3 = [10 9 9 8 8 7 7 6]
8711
+ pmaddwd m4, [r5]
8712
+ pmaddwd m3, [r5 + 1 * mmsize]
8713
+ paddd m4, m3
8714
+ movd xm1, [r0 + r4]
8715
+ punpcklwd xm5, xm1
8716
+ lea r0, [r0 + 4 * r1]
8717
+ movd xm3, [r0]
8718
+ punpcklwd xm1, xm3
8719
+ punpcklqdq xm5, xm1 ; m5 = [12 11 11 10]
8720
+ vinserti128 m2, m2, xm5, 1 ; m2 = [12 11 11 10 10 9 9 8]
8721
+ movd xm1, [r0 + r1]
8722
+ punpcklwd xm3, xm1
8723
+ movd xm7, [r0 + r1 * 2]
8724
+ punpcklwd xm1, xm7
8725
+ punpcklqdq xm3, xm1 ; m3 = [14 13 13 12]
8726
+ vinserti128 m5, m5, xm3, 1 ; m5 = [14 13 13 12 12 11 11 10]
8727
+ pmaddwd m2, [r5]
8728
+ pmaddwd m5, [r5 + 1 * mmsize]
8729
+ paddd m2, m5
8730
+ movd xm5, [r0 + r4]
8731
+ punpcklwd xm7, xm5
8732
+ lea r0, [r0 + 4 * r1]
8733
+ movd xm1, [r0]
8734
+ punpcklwd xm5, xm1
8735
+ punpcklqdq xm7, xm5 ; m7 = [16 15 15 14]
8736
+ vinserti128 m3, m3, xm7, 1 ; m3 = [16 15 15 14 14 13 13 12]
8737
+ movd xm5, [r0 + r1]
8738
+ punpcklwd xm1, xm5
8739
+ movd xm8, [r0 + r1 * 2]
8740
+ punpcklwd xm5, xm8
8741
+ punpcklqdq xm1, xm5 ; m1 = [18 17 17 16]
8742
+ vinserti128 m7, m7, xm1, 1 ; m7 = [18 17 17 16 16 15 15 14]
8743
+ pmaddwd m3, [r5]
8744
+ pmaddwd m7, [r5 + 1 * mmsize]
8745
+ paddd m3, m7
8746
+%ifidn %1,sp
8747
+ paddd m0, m6
8748
+ paddd m4, m6
8749
+ paddd m2, m6
8750
+ paddd m3, m6
8751
+ psrad m0, 12
8752
+ psrad m4, 12
8753
+ psrad m2, 12
8754
+ psrad m3, 12
8755
+%else
8756
+ psrad m0, 6
8757
+ psrad m4, 6
8758
+ psrad m2, 6
8759
+ psrad m3, 6
8760
+%endif
8761
+ packssdw m0, m4
8762
+ packssdw m2, m3
8763
+ lea r4, [r3 * 3]
8764
+%ifidn %1,sp
8765
+ packuswb m0, m2
8766
+ vextracti128 xm2, m0, 1
8767
+ pextrw [r2], xm0, 0
8768
+ pextrw [r2 + r3], xm0, 1
8769
+ pextrw [r2 + 2 * r3], xm2, 0
8770
+ pextrw [r2 + r4], xm2, 1
8771
+ lea r2, [r2 + r3 * 4]
8772
+ pextrw [r2], xm0, 2
8773
+ pextrw [r2 + r3], xm0, 3
8774
+ pextrw [r2 + 2 * r3], xm2, 2
8775
+ pextrw [r2 + r4], xm2, 3
8776
+ lea r2, [r2 + r3 * 4]
8777
+ pextrw [r2], xm0, 4
8778
+ pextrw [r2 + r3], xm0, 5
8779
+ pextrw [r2 + 2 * r3], xm2, 4
8780
+ pextrw [r2 + r4], xm2, 5
8781
+ lea r2, [r2 + r3 * 4]
8782
+ pextrw [r2], xm0, 6
8783
+ pextrw [r2 + r3], xm0, 7
8784
+ pextrw [r2 + 2 * r3], xm2, 6
8785
+ pextrw [r2 + r4], xm2, 7
8786
+%else
8787
+ vextracti128 xm4, m0, 1
8788
+ vextracti128 xm3, m2, 1
8789
+ movd [r2], xm0
8790
+ pextrd [r2 + r3], xm0, 1
8791
+ movd [r2 + 2 * r3], xm4
8792
+ pextrd [r2 + r4], xm4, 1
8793
+ lea r2, [r2 + r3 * 4]
8794
+ pextrd [r2], xm0, 2
8795
+ pextrd [r2 + r3], xm0, 3
8796
+ pextrd [r2 + 2 * r3], xm4, 2
8797
+ pextrd [r2 + r4], xm4, 3
8798
+ lea r2, [r2 + r3 * 4]
8799
+ movd [r2], xm2
8800
+ pextrd [r2 + r3], xm2, 1
8801
+ movd [r2 + 2 * r3], xm3
8802
+ pextrd [r2 + r4], xm3, 1
8803
+ lea r2, [r2 + r3 * 4]
8804
+ pextrd [r2], xm2, 2
8805
+ pextrd [r2 + r3], xm2, 3
8806
+ pextrd [r2 + 2 * r3], xm3, 2
8807
+ pextrd [r2 + r4], xm3, 3
8808
+%endif
8809
+ RET
8810
+%endif
8811
+%endmacro
8812
+
8813
+ FILTER_VER_CHROMA_S_AVX2_2x16 sp
8814
+ FILTER_VER_CHROMA_S_AVX2_2x16 ss
8815
8816
%macro FILTER_VER_CHROMA_S_AVX2_6x8 1
8817
INIT_YMM avx2
8818
8819
RET
8820
%endmacro
8821
8822
-FILTER_VER_CHROMA_S_AVX2_6x8 sp
8823
-FILTER_VER_CHROMA_S_AVX2_6x8 ss
8824
+ FILTER_VER_CHROMA_S_AVX2_6x8 sp
8825
+ FILTER_VER_CHROMA_S_AVX2_6x8 ss
8826
+
8827
+%macro FILTER_VER_CHROMA_S_AVX2_6x16 1
8828
+%if ARCH_X86_64 == 1
8829
+INIT_YMM avx2
8830
+cglobal interp_4tap_vert_%1_6x16, 4, 7, 9
8831
+ mov r4d, r4m
8832
+ shl r4d, 6
8833
+ add r1d, r1d
8834
+
8835
+%ifdef PIC
8836
+ lea r5, [pw_ChromaCoeffV]
8837
+ add r5, r4
8838
+%else
8839
+ lea r5, [pw_ChromaCoeffV + r4]
8840
+%endif
8841
+
8842
+ lea r4, [r1 * 3]
8843
+ sub r0, r1
8844
+%ifidn %1,sp
8845
+ mova m8, [pd_526336]
8846
+%else
8847
+ add r3d, r3d
8848
+%endif
8849
+ lea r6, [r3 * 3]
8850
+ movu xm0, [r0] ; m0 = row 0
8851
+ movu xm1, [r0 + r1] ; m1 = row 1
8852
+ punpckhwd xm2, xm0, xm1
8853
+ punpcklwd xm0, xm1
8854
+ vinserti128 m0, m0, xm2, 1
8855
+ pmaddwd m0, [r5]
8856
+ movu xm2, [r0 + r1 * 2] ; m2 = row 2
8857
+ punpckhwd xm3, xm1, xm2
8858
+ punpcklwd xm1, xm2
8859
+ vinserti128 m1, m1, xm3, 1
8860
+ pmaddwd m1, [r5]
8861
+ movu xm3, [r0 + r4] ; m3 = row 3
8862
+ punpckhwd xm4, xm2, xm3
8863
+ punpcklwd xm2, xm3
8864
+ vinserti128 m2, m2, xm4, 1
8865
+ pmaddwd m4, m2, [r5 + 1 * mmsize]
8866
+ paddd m0, m4
8867
+ pmaddwd m2, [r5]
8868
+ lea r0, [r0 + r1 * 4]
8869
+ movu xm4, [r0] ; m4 = row 4
8870
+ punpckhwd xm5, xm3, xm4
8871
+ punpcklwd xm3, xm4
8872
+ vinserti128 m3, m3, xm5, 1
8873
+ pmaddwd m5, m3, [r5 + 1 * mmsize]
8874
+ paddd m1, m5
8875
+ pmaddwd m3, [r5]
8876
+%ifidn %1,sp
8877
+ paddd m0, m8
8878
+ paddd m1, m8
8879
+ psrad m0, 12
8880
+ psrad m1, 12
8881
+%else
8882
+ psrad m0, 6
8883
+ psrad m1, 6
8884
+%endif
8885
+ packssdw m0, m1
8886
+
8887
+ movu xm5, [r0 + r1] ; m5 = row 5
8888
+ punpckhwd xm6, xm4, xm5
8889
+ punpcklwd xm4, xm5
8890
+ vinserti128 m4, m4, xm6, 1
8891
+ pmaddwd m6, m4, [r5 + 1 * mmsize]
8892
+ paddd m2, m6
8893
+ pmaddwd m4, [r5]
8894
+ movu xm6, [r0 + r1 * 2] ; m6 = row 6
8895
+ punpckhwd xm1, xm5, xm6
8896
+ punpcklwd xm5, xm6
8897
+ vinserti128 m5, m5, xm1, 1
8898
+ pmaddwd m1, m5, [r5 + 1 * mmsize]
8899
+ pmaddwd m5, [r5]
8900
+ paddd m3, m1
8901
+%ifidn %1,sp
8902
+ paddd m2, m8
8903
+ paddd m3, m8
8904
+ psrad m2, 12
8905
+ psrad m3, 12
8906
+%else
8907
+ psrad m2, 6
8908
+ psrad m3, 6
8909
+%endif
8910
+ packssdw m2, m3
8911
+%ifidn %1,sp
8912
+ packuswb m0, m2
8913
+ vextracti128 xm2, m0, 1
8914
+ movd [r2], xm0
8915
+ pextrw [r2 + 4], xm2, 0
8916
+ pextrd [r2 + r3], xm0, 1
8917
+ pextrw [r2 + r3 + 4], xm2, 2
8918
+ pextrd [r2 + r3 * 2], xm0, 2
8919
+ pextrw [r2 + r3 * 2 + 4], xm2, 4
8920
+ pextrd [r2 + r6], xm0, 3
8921
+ pextrw [r2 + r6 + 4], xm2, 6
8922
+%else
8923
+ movq [r2], xm0
8924
+ movhps [r2 + r3], xm0
8925
+ movq [r2 + r3 * 2], xm2
8926
+ movhps [r2 + r6], xm2
8927
+ vextracti128 xm0, m0, 1
8928
+ vextracti128 xm3, m2, 1
8929
+ movd [r2 + 8], xm0
8930
+ pextrd [r2 + r3 + 8], xm0, 2
8931
+ movd [r2 + r3 * 2 + 8], xm3
8932
+ pextrd [r2 + r6 + 8], xm3, 2
8933
+%endif
8934
+ lea r2, [r2 + r3 * 4]
8935
+ movu xm1, [r0 + r4] ; m1 = row 7
8936
+ punpckhwd xm0, xm6, xm1
8937
+ punpcklwd xm6, xm1
8938
+ vinserti128 m6, m6, xm0, 1
8939
+ pmaddwd m0, m6, [r5 + 1 * mmsize]
8940
+ pmaddwd m6, [r5]
8941
+ paddd m4, m0
8942
+ lea r0, [r0 + r1 * 4]
8943
+ movu xm0, [r0] ; m0 = row 8
8944
+ punpckhwd xm2, xm1, xm0
8945
+ punpcklwd xm1, xm0
8946
+ vinserti128 m1, m1, xm2, 1
8947
+ pmaddwd m2, m1, [r5 + 1 * mmsize]
8948
+ pmaddwd m1, [r5]
8949
+ paddd m5, m2
8950
+%ifidn %1,sp
8951
+ paddd m4, m8
8952
+ paddd m5, m8
8953
+ psrad m4, 12
8954
+ psrad m5, 12
8955
+%else
8956
+ psrad m4, 6
8957
+ psrad m5, 6
8958
+%endif
8959
+ packssdw m4, m5
8960
+
8961
+ movu xm2, [r0 + r1] ; m2 = row 9
8962
+ punpckhwd xm5, xm0, xm2
8963
+ punpcklwd xm0, xm2
8964
+ vinserti128 m0, m0, xm5, 1
8965
+ pmaddwd m5, m0, [r5 + 1 * mmsize]
8966
+ paddd m6, m5
8967
+ pmaddwd m0, [r5]
8968
+ movu xm5, [r0 + r1 * 2] ; m5 = row 10
8969
+ punpckhwd xm7, xm2, xm5
8970
+ punpcklwd xm2, xm5
8971
+ vinserti128 m2, m2, xm7, 1
8972
+ pmaddwd m7, m2, [r5 + 1 * mmsize]
8973
+ paddd m1, m7
8974
+ pmaddwd m2, [r5]
8975
+
8976
+%ifidn %1,sp
8977
+ paddd m6, m8
8978
+ paddd m1, m8
8979
+ psrad m6, 12
8980
+ psrad m1, 12
8981
+%else
8982
+ psrad m6, 6
8983
+ psrad m1, 6
8984
+%endif
8985
+ packssdw m6, m1
8986
+%ifidn %1,sp
8987
+ packuswb m4, m6
8988
+ vextracti128 xm6, m4, 1
8989
+ movd [r2], xm4
8990
+ pextrw [r2 + 4], xm6, 0
8991
+ pextrd [r2 + r3], xm4, 1
8992
+ pextrw [r2 + r3 + 4], xm6, 2
8993
+ pextrd [r2 + r3 * 2], xm4, 2
8994
+ pextrw [r2 + r3 * 2 + 4], xm6, 4
8995
+ pextrd [r2 + r6], xm4, 3
8996
+ pextrw [r2 + r6 + 4], xm6, 6
8997
+%else
8998
+ movq [r2], xm4
8999
+ movhps [r2 + r3], xm4
9000
+ movq [r2 + r3 * 2], xm6
9001
+ movhps [r2 + r6], xm6
9002
+ vextracti128 xm4, m4, 1
9003
+ vextracti128 xm1, m6, 1
9004
+ movd [r2 + 8], xm4
9005
+ pextrd [r2 + r3 + 8], xm4, 2
9006
+ movd [r2 + r3 * 2 + 8], xm1
9007
+ pextrd [r2 + r6 + 8], xm1, 2
9008
+%endif
9009
+ lea r2, [r2 + r3 * 4]
9010
+ movu xm7, [r0 + r4] ; m7 = row 11
9011
+ punpckhwd xm1, xm5, xm7
9012
+ punpcklwd xm5, xm7
9013
+ vinserti128 m5, m5, xm1, 1
9014
+ pmaddwd m1, m5, [r5 + 1 * mmsize]
9015
+ paddd m0, m1
9016
+ pmaddwd m5, [r5]
9017
+ lea r0, [r0 + r1 * 4]
9018
+ movu xm1, [r0] ; m1 = row 12
9019
+ punpckhwd xm4, xm7, xm1
9020
+ punpcklwd xm7, xm1
9021
+ vinserti128 m7, m7, xm4, 1
9022
+ pmaddwd m4, m7, [r5 + 1 * mmsize]
9023
+ paddd m2, m4
9024
+ pmaddwd m7, [r5]
9025
+%ifidn %1,sp
9026
+ paddd m0, m8
9027
+ paddd m2, m8
9028
+ psrad m0, 12
9029
+ psrad m2, 12
9030
+%else
9031
+ psrad m0, 6
9032
+ psrad m2, 6
9033
+%endif
9034
+ packssdw m0, m2
9035
+
9036
+ movu xm4, [r0 + r1] ; m4 = row 13
9037
+ punpckhwd xm2, xm1, xm4
9038
+ punpcklwd xm1, xm4
9039
+ vinserti128 m1, m1, xm2, 1
9040
+ pmaddwd m2, m1, [r5 + 1 * mmsize]
9041
+ paddd m5, m2
9042
+ pmaddwd m1, [r5]
9043
+ movu xm2, [r0 + r1 * 2] ; m2 = row 14
9044
+ punpckhwd xm6, xm4, xm2
9045
+ punpcklwd xm4, xm2
9046
+ vinserti128 m4, m4, xm6, 1
9047
+ pmaddwd m6, m4, [r5 + 1 * mmsize]
9048
+ paddd m7, m6
9049
+ pmaddwd m4, [r5]
9050
+%ifidn %1,sp
9051
+ paddd m5, m8
9052
+ paddd m7, m8
9053
+ psrad m5, 12
9054
+ psrad m7, 12
9055
+%else
9056
+ psrad m5, 6
9057
+ psrad m7, 6
9058
+%endif
9059
+ packssdw m5, m7
9060
+%ifidn %1,sp
9061
+ packuswb m0, m5
9062
+ vextracti128 xm5, m0, 1
9063
+ movd [r2], xm0
9064
+ pextrw [r2 + 4], xm5, 0
9065
+ pextrd [r2 + r3], xm0, 1
9066
+ pextrw [r2 + r3 + 4], xm5, 2
9067
+ pextrd [r2 + r3 * 2], xm0, 2
9068
+ pextrw [r2 + r3 * 2 + 4], xm5, 4
9069
+ pextrd [r2 + r6], xm0, 3
9070
+ pextrw [r2 + r6 + 4], xm5, 6
9071
+%else
9072
+ movq [r2], xm0
9073
+ movhps [r2 + r3], xm0
9074
+ movq [r2 + r3 * 2], xm5
9075
+ movhps [r2 + r6], xm5
9076
+ vextracti128 xm0, m0, 1
9077
+ vextracti128 xm7, m5, 1
9078
+ movd [r2 + 8], xm0
9079
+ pextrd [r2 + r3 + 8], xm0, 2
9080
+ movd [r2 + r3 * 2 + 8], xm7
9081
+ pextrd [r2 + r6 + 8], xm7, 2
9082
+%endif
9083
+ lea r2, [r2 + r3 * 4]
9084
+
9085
+ movu xm6, [r0 + r4] ; m6 = row 15
9086
+ punpckhwd xm5, xm2, xm6
9087
+ punpcklwd xm2, xm6
9088
+ vinserti128 m2, m2, xm5, 1
9089
+ pmaddwd m5, m2, [r5 + 1 * mmsize]
9090
+ paddd m1, m5
9091
+ pmaddwd m2, [r5]
9092
+ lea r0, [r0 + r1 * 4]
9093
+ movu xm0, [r0] ; m0 = row 16
9094
+ punpckhwd xm5, xm6, xm0
9095
+ punpcklwd xm6, xm0
9096
+ vinserti128 m6, m6, xm5, 1
9097
+ pmaddwd m5, m6, [r5 + 1 * mmsize]
9098
+ paddd m4, m5
9099
+ pmaddwd m6, [r5]
9100
+%ifidn %1,sp
9101
+ paddd m1, m8
9102
+ paddd m4, m8
9103
+ psrad m1, 12
9104
+ psrad m4, 12
9105
+%else
9106
+ psrad m1, 6
9107
+ psrad m4, 6
9108
+%endif
9109
+ packssdw m1, m4
9110
+
9111
+ movu xm5, [r0 + r1] ; m5 = row 17
9112
+ punpckhwd xm4, xm0, xm5
9113
+ punpcklwd xm0, xm5
9114
+ vinserti128 m0, m0, xm4, 1
9115
+ pmaddwd m0, [r5 + 1 * mmsize]
9116
+ paddd m2, m0
9117
+ movu xm4, [r0 + r1 * 2] ; m4 = row 18
9118
+ punpckhwd xm0, xm5, xm4
9119
+ punpcklwd xm5, xm4
9120
+ vinserti128 m5, m5, xm0, 1
9121
+ pmaddwd m5, [r5 + 1 * mmsize]
9122
+ paddd m6, m5
9123
+%ifidn %1,sp
9124
+ paddd m2, m8
9125
+ paddd m6, m8
9126
+ psrad m2, 12
9127
+ psrad m6, 12
9128
+%else
9129
+ psrad m2, 6
9130
+ psrad m6, 6
9131
+%endif
9132
+ packssdw m2, m6
9133
+%ifidn %1,sp
9134
+ packuswb m1, m2
9135
+ vextracti128 xm2, m1, 1
9136
+ movd [r2], xm1
9137
+ pextrw [r2 + 4], xm2, 0
9138
+ pextrd [r2 + r3], xm1, 1
9139
+ pextrw [r2 + r3 + 4], xm2, 2
9140
+ pextrd [r2 + r3 * 2], xm1, 2
9141
+ pextrw [r2 + r3 * 2 + 4], xm2, 4
9142
+ pextrd [r2 + r6], xm1, 3
9143
+ pextrw [r2 + r6 + 4], xm2, 6
9144
+%else
9145
+ movq [r2], xm1
9146
+ movhps [r2 + r3], xm1
9147
+ movq [r2 + r3 * 2], xm2
9148
+ movhps [r2 + r6], xm2
9149
+ vextracti128 xm4, m1, 1
9150
+ vextracti128 xm6, m2, 1
9151
+ movd [r2 + 8], xm4
9152
+ pextrd [r2 + r3 + 8], xm4, 2
9153
+ movd [r2 + r3 * 2 + 8], xm6
9154
+ pextrd [r2 + r6 + 8], xm6, 2
9155
+%endif
9156
+ RET
9157
+%endif
9158
+%endmacro
9159
+
9160
+ FILTER_VER_CHROMA_S_AVX2_6x16 sp
9161
+ FILTER_VER_CHROMA_S_AVX2_6x16 ss
9162
9163
;---------------------------------------------------------------------------------------------------------------------
9164
; void interp_4tap_vertical_ss_%1x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
9165
9166
RET
9167
%endmacro
9168
9169
-FILTER_VER_CHROMA_SS_W2_4R 2, 4
9170
-FILTER_VER_CHROMA_SS_W2_4R 2, 8
9171
+ FILTER_VER_CHROMA_SS_W2_4R 2, 4
9172
+ FILTER_VER_CHROMA_SS_W2_4R 2, 8
9173
9174
-FILTER_VER_CHROMA_SS_W2_4R 2, 16
9175
+ FILTER_VER_CHROMA_SS_W2_4R 2, 16
9176
9177
;---------------------------------------------------------------------------------------------------------------
9178
; void interp_4tap_vert_ss_4x2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
9179
9180
RET
9181
%endmacro
9182
9183
-FILTER_VER_CHROMA_SS_W6_H4 6, 8
9184
+ FILTER_VER_CHROMA_SS_W6_H4 6, 8
9185
9186
-FILTER_VER_CHROMA_SS_W6_H4 6, 16
9187
+ FILTER_VER_CHROMA_SS_W6_H4 6, 16
9188
9189
9190
;----------------------------------------------------------------------------------------------------------------
9191
9192
RET
9193
%endmacro
9194
9195
-FILTER_VER_CHROMA_SS_W8_H2 8, 2
9196
-FILTER_VER_CHROMA_SS_W8_H2 8, 4
9197
-FILTER_VER_CHROMA_SS_W8_H2 8, 6
9198
-FILTER_VER_CHROMA_SS_W8_H2 8, 8
9199
-FILTER_VER_CHROMA_SS_W8_H2 8, 16
9200
-FILTER_VER_CHROMA_SS_W8_H2 8, 32
9201
+ FILTER_VER_CHROMA_SS_W8_H2 8, 2
9202
+ FILTER_VER_CHROMA_SS_W8_H2 8, 4
9203
+ FILTER_VER_CHROMA_SS_W8_H2 8, 6
9204
+ FILTER_VER_CHROMA_SS_W8_H2 8, 8
9205
+ FILTER_VER_CHROMA_SS_W8_H2 8, 16
9206
+ FILTER_VER_CHROMA_SS_W8_H2 8, 32
9207
9208
-FILTER_VER_CHROMA_SS_W8_H2 8, 12
9209
-FILTER_VER_CHROMA_SS_W8_H2 8, 64
9210
+ FILTER_VER_CHROMA_SS_W8_H2 8, 12
9211
+ FILTER_VER_CHROMA_SS_W8_H2 8, 64
9212
9213
;-----------------------------------------------------------------------------------------------------------------
9214
; void interp_8tap_vert_ss_%1x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
9215
9216
RET
9217
%endmacro
9218
9219
-FILTER_VER_LUMA_AVX2_4x4 sp
9220
-FILTER_VER_LUMA_AVX2_4x4 ss
9221
+ FILTER_VER_LUMA_AVX2_4x4 sp
9222
+ FILTER_VER_LUMA_AVX2_4x4 ss
9223
9224
%macro FILTER_VER_LUMA_AVX2_4x8 1
9225
INIT_YMM avx2
9226
9227
RET
9228
%endmacro
9229
9230
-FILTER_VER_LUMA_AVX2_4x8 sp
9231
-FILTER_VER_LUMA_AVX2_4x8 ss
9232
+ FILTER_VER_LUMA_AVX2_4x8 sp
9233
+ FILTER_VER_LUMA_AVX2_4x8 ss
9234
9235
%macro PROCESS_LUMA_AVX2_W4_16R 1
9236
movq xm0, [r0]
9237
9238
RET
9239
%endmacro
9240
9241
-FILTER_VER_LUMA_AVX2_4x16 sp
9242
-FILTER_VER_LUMA_AVX2_4x16 ss
9243
+ FILTER_VER_LUMA_AVX2_4x16 sp
9244
+ FILTER_VER_LUMA_AVX2_4x16 ss
9245
9246
%macro FILTER_VER_LUMA_S_AVX2_8x8 1
9247
INIT_YMM avx2
9248
9249
%endif
9250
%endmacro
9251
9252
-FILTER_VER_LUMA_S_AVX2_8x8 sp
9253
-FILTER_VER_LUMA_S_AVX2_8x8 ss
9254
+ FILTER_VER_LUMA_S_AVX2_8x8 sp
9255
+ FILTER_VER_LUMA_S_AVX2_8x8 ss
9256
9257
%macro FILTER_VER_LUMA_S_AVX2_8xN 2
9258
INIT_YMM avx2
9259
9260
%endif
9261
%endmacro
9262
9263
-FILTER_VER_LUMA_S_AVX2_8xN sp, 16
9264
-FILTER_VER_LUMA_S_AVX2_8xN sp, 32
9265
-FILTER_VER_LUMA_S_AVX2_8xN ss, 16
9266
-FILTER_VER_LUMA_S_AVX2_8xN ss, 32
9267
+ FILTER_VER_LUMA_S_AVX2_8xN sp, 16
9268
+ FILTER_VER_LUMA_S_AVX2_8xN sp, 32
9269
+ FILTER_VER_LUMA_S_AVX2_8xN ss, 16
9270
+ FILTER_VER_LUMA_S_AVX2_8xN ss, 32
9271
9272
%macro PROCESS_LUMA_S_AVX2_W8_4R 1
9273
movu xm0, [r0] ; m0 = row 0
9274
9275
RET
9276
%endmacro
9277
9278
-FILTER_VER_LUMA_S_AVX2_8x4 sp
9279
-FILTER_VER_LUMA_S_AVX2_8x4 ss
9280
+ FILTER_VER_LUMA_S_AVX2_8x4 sp
9281
+ FILTER_VER_LUMA_S_AVX2_8x4 ss
9282
9283
%macro PROCESS_LUMA_AVX2_W8_16R 1
9284
movu xm0, [r0] ; m0 = row 0
9285
9286
%endif
9287
%endmacro
9288
9289
-FILTER_VER_LUMA_AVX2_Nx16 sp, 16
9290
-FILTER_VER_LUMA_AVX2_Nx16 sp, 32
9291
-FILTER_VER_LUMA_AVX2_Nx16 sp, 64
9292
-FILTER_VER_LUMA_AVX2_Nx16 ss, 16
9293
-FILTER_VER_LUMA_AVX2_Nx16 ss, 32
9294
-FILTER_VER_LUMA_AVX2_Nx16 ss, 64
9295
+ FILTER_VER_LUMA_AVX2_Nx16 sp, 16
9296
+ FILTER_VER_LUMA_AVX2_Nx16 sp, 32
9297
+ FILTER_VER_LUMA_AVX2_Nx16 sp, 64
9298
+ FILTER_VER_LUMA_AVX2_Nx16 ss, 16
9299
+ FILTER_VER_LUMA_AVX2_Nx16 ss, 32
9300
+ FILTER_VER_LUMA_AVX2_Nx16 ss, 64
9301
9302
%macro FILTER_VER_LUMA_AVX2_NxN 3
9303
INIT_YMM avx2
9304
9305
%endif
9306
%endmacro
9307
9308
-FILTER_VER_LUMA_AVX2_NxN 16, 32, sp
9309
-FILTER_VER_LUMA_AVX2_NxN 16, 64, sp
9310
-FILTER_VER_LUMA_AVX2_NxN 24, 32, sp
9311
-FILTER_VER_LUMA_AVX2_NxN 32, 32, sp
9312
-FILTER_VER_LUMA_AVX2_NxN 32, 64, sp
9313
-FILTER_VER_LUMA_AVX2_NxN 48, 64, sp
9314
-FILTER_VER_LUMA_AVX2_NxN 64, 32, sp
9315
-FILTER_VER_LUMA_AVX2_NxN 64, 48, sp
9316
-FILTER_VER_LUMA_AVX2_NxN 64, 64, sp
9317
-FILTER_VER_LUMA_AVX2_NxN 16, 32, ss
9318
-FILTER_VER_LUMA_AVX2_NxN 16, 64, ss
9319
-FILTER_VER_LUMA_AVX2_NxN 24, 32, ss
9320
-FILTER_VER_LUMA_AVX2_NxN 32, 32, ss
9321
-FILTER_VER_LUMA_AVX2_NxN 32, 64, ss
9322
-FILTER_VER_LUMA_AVX2_NxN 48, 64, ss
9323
-FILTER_VER_LUMA_AVX2_NxN 64, 32, ss
9324
-FILTER_VER_LUMA_AVX2_NxN 64, 48, ss
9325
-FILTER_VER_LUMA_AVX2_NxN 64, 64, ss
9326
+ FILTER_VER_LUMA_AVX2_NxN 16, 32, sp
9327
+ FILTER_VER_LUMA_AVX2_NxN 16, 64, sp
9328
+ FILTER_VER_LUMA_AVX2_NxN 24, 32, sp
9329
+ FILTER_VER_LUMA_AVX2_NxN 32, 32, sp
9330
+ FILTER_VER_LUMA_AVX2_NxN 32, 64, sp
9331
+ FILTER_VER_LUMA_AVX2_NxN 48, 64, sp
9332
+ FILTER_VER_LUMA_AVX2_NxN 64, 32, sp
9333
+ FILTER_VER_LUMA_AVX2_NxN 64, 48, sp
9334
+ FILTER_VER_LUMA_AVX2_NxN 64, 64, sp
9335
+ FILTER_VER_LUMA_AVX2_NxN 16, 32, ss
9336
+ FILTER_VER_LUMA_AVX2_NxN 16, 64, ss
9337
+ FILTER_VER_LUMA_AVX2_NxN 24, 32, ss
9338
+ FILTER_VER_LUMA_AVX2_NxN 32, 32, ss
9339
+ FILTER_VER_LUMA_AVX2_NxN 32, 64, ss
9340
+ FILTER_VER_LUMA_AVX2_NxN 48, 64, ss
9341
+ FILTER_VER_LUMA_AVX2_NxN 64, 32, ss
9342
+ FILTER_VER_LUMA_AVX2_NxN 64, 48, ss
9343
+ FILTER_VER_LUMA_AVX2_NxN 64, 64, ss
9344
9345
%macro FILTER_VER_LUMA_S_AVX2_12x16 1
9346
INIT_YMM avx2
9347
9348
%endif
9349
%endmacro
9350
9351
-FILTER_VER_LUMA_S_AVX2_12x16 sp
9352
-FILTER_VER_LUMA_S_AVX2_12x16 ss
9353
+ FILTER_VER_LUMA_S_AVX2_12x16 sp
9354
+ FILTER_VER_LUMA_S_AVX2_12x16 ss
9355
9356
%macro FILTER_VER_LUMA_S_AVX2_16x12 1
9357
INIT_YMM avx2
9358
9359
%endif
9360
%endmacro
9361
9362
-FILTER_VER_LUMA_S_AVX2_16x12 sp
9363
-FILTER_VER_LUMA_S_AVX2_16x12 ss
9364
+ FILTER_VER_LUMA_S_AVX2_16x12 sp
9365
+ FILTER_VER_LUMA_S_AVX2_16x12 ss
9366
9367
%macro FILTER_VER_LUMA_S_AVX2_16x4 1
9368
INIT_YMM avx2
9369
9370
RET
9371
%endmacro
9372
9373
-FILTER_VER_LUMA_S_AVX2_16x4 sp
9374
-FILTER_VER_LUMA_S_AVX2_16x4 ss
9375
+ FILTER_VER_LUMA_S_AVX2_16x4 sp
9376
+ FILTER_VER_LUMA_S_AVX2_16x4 ss
9377
9378
%macro PROCESS_LUMA_S_AVX2_W8_8R 1
9379
movu xm0, [r0] ; m0 = row 0
9380
9381
%endif
9382
%endmacro
9383
9384
-FILTER_VER_LUMA_AVX2_Nx8 sp, 32
9385
-FILTER_VER_LUMA_AVX2_Nx8 sp, 16
9386
-FILTER_VER_LUMA_AVX2_Nx8 ss, 32
9387
-FILTER_VER_LUMA_AVX2_Nx8 ss, 16
9388
+ FILTER_VER_LUMA_AVX2_Nx8 sp, 32
9389
+ FILTER_VER_LUMA_AVX2_Nx8 sp, 16
9390
+ FILTER_VER_LUMA_AVX2_Nx8 ss, 32
9391
+ FILTER_VER_LUMA_AVX2_Nx8 ss, 16
9392
9393
%macro FILTER_VER_LUMA_S_AVX2_32x24 1
9394
INIT_YMM avx2
9395
9396
%endif
9397
%endmacro
9398
9399
-FILTER_VER_LUMA_S_AVX2_32x24 sp
9400
-FILTER_VER_LUMA_S_AVX2_32x24 ss
9401
+ FILTER_VER_LUMA_S_AVX2_32x24 sp
9402
+ FILTER_VER_LUMA_S_AVX2_32x24 ss
9403
9404
;-----------------------------------------------------------------------------------------------------------------------------
9405
; void interp_4tap_horiz_ps_32x32(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
9406
;-----------------------------------------------------------------------------------------------------------------------------;
9407
-INIT_YMM avx2
9408
+INIT_YMM avx2
9409
cglobal interp_4tap_horiz_ps_32x32, 4,7,6
9410
mov r4d, r4m
9411
mov r5d, r5m
9412
9413
add r0, r1
9414
dec r6d
9415
jnz .loop
9416
- RET
9417
+ RET
9418
9419
;-----------------------------------------------------------------------------------------------------------------------------
9420
; void interp_4tap_horiz_ps_16x16(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
9421
;-----------------------------------------------------------------------------------------------------------------------------;
9422
-INIT_YMM avx2
9423
+INIT_YMM avx2
9424
cglobal interp_4tap_horiz_ps_16x16, 4,7,6
9425
mov r4d, r4m
9426
mov r5d, r5m
9427
9428
add r0, r1
9429
dec r6d
9430
jnz .loop
9431
- RET
9432
+ RET
9433
9434
;-----------------------------------------------------------------------------------------------------------------------------
9435
; void interp_4tap_horiz_ps_16xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
9436
;-----------------------------------------------------------------------------------------------------------------------------
9437
%macro IPFILTER_CHROMA_PS_16xN_AVX2 2
9438
-INIT_YMM avx2
9439
+INIT_YMM avx2
9440
cglobal interp_4tap_horiz_ps_%1x%2, 4,7,6
9441
mov r4d, r4m
9442
mov r5d, r5m
9443
9444
IPFILTER_CHROMA_PS_16xN_AVX2 16 , 12
9445
IPFILTER_CHROMA_PS_16xN_AVX2 16 , 8
9446
IPFILTER_CHROMA_PS_16xN_AVX2 16 , 4
9447
+ IPFILTER_CHROMA_PS_16xN_AVX2 16 , 24
9448
+ IPFILTER_CHROMA_PS_16xN_AVX2 16 , 64
9449
9450
;-----------------------------------------------------------------------------------------------------------------------------
9451
; void interp_4tap_horiz_ps_32xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
9452
;-----------------------------------------------------------------------------------------------------------------------------
9453
%macro IPFILTER_CHROMA_PS_32xN_AVX2 2
9454
-INIT_YMM avx2
9455
+INIT_YMM avx2
9456
cglobal interp_4tap_horiz_ps_%1x%2, 4,7,6
9457
mov r4d, r4m
9458
mov r5d, r5m
9459
9460
RET
9461
%endmacro
9462
9463
-IPFILTER_CHROMA_PS_32xN_AVX2 32 , 16
9464
-IPFILTER_CHROMA_PS_32xN_AVX2 32 , 24
9465
-IPFILTER_CHROMA_PS_32xN_AVX2 32 , 8
9466
+ IPFILTER_CHROMA_PS_32xN_AVX2 32 , 16
9467
+ IPFILTER_CHROMA_PS_32xN_AVX2 32 , 24
9468
+ IPFILTER_CHROMA_PS_32xN_AVX2 32 , 8
9469
+ IPFILTER_CHROMA_PS_32xN_AVX2 32 , 64
9470
+ IPFILTER_CHROMA_PS_32xN_AVX2 32 , 48
9471
;-----------------------------------------------------------------------------------------------------------------------------
9472
; void interp_4tap_horiz_ps_4x4(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
9473
;-----------------------------------------------------------------------------------------------------------------------------
9474
-INIT_YMM avx2
9475
+INIT_YMM avx2
9476
cglobal interp_4tap_horiz_ps_4x4, 4,7,5
9477
mov r4d, r4m
9478
mov r5d, r5m
9479
9480
lea r2, [r2 + r3 * 2]
9481
movhps [r2], xm3
9482
.end
9483
- RET
9484
+ RET
9485
9486
cglobal interp_4tap_horiz_ps_4x2, 4,7,5
9487
mov r4d, r4m
9488
9489
lea r2, [r2 + r3 * 2]
9490
movhps [r2], xm3
9491
.end
9492
- RET
9493
+ RET
9494
9495
;-----------------------------------------------------------------------------------------------------------------------------
9496
; void interp_4tap_horiz_ps_4xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
9497
;-----------------------------------------------------------------------------------------------------------------------------;
9498
%macro IPFILTER_CHROMA_PS_4xN_AVX2 2
9499
-INIT_YMM avx2
9500
+INIT_YMM avx2
9501
cglobal interp_4tap_horiz_ps_%1x%2, 4,7,5
9502
mov r4d, r4m
9503
mov r5d, r5m
9504
9505
lea r2, [r2 + r3 * 2]
9506
movhps [r2], xm3
9507
.end
9508
-RET
9509
+ RET
9510
%endmacro
9511
9512
IPFILTER_CHROMA_PS_4xN_AVX2 4 , 8
9513
9514
;-----------------------------------------------------------------------------------------------------------------------------
9515
; void interp_4tap_horiz_ps_8x8(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
9516
;-----------------------------------------------------------------------------------------------------------------------------;
9517
-INIT_YMM avx2
9518
+INIT_YMM avx2
9519
cglobal interp_4tap_horiz_ps_8x8, 4,7,6
9520
mov r4d, r4m
9521
mov r5d, r5m
9522
9523
vpermq m3, m3, 11011000b
9524
movu [r2], xm3
9525
.end
9526
- RET
9527
+ RET
9528
9529
-INIT_YMM avx2
9530
+INIT_YMM avx2
9531
cglobal interp_4tap_horiz_pp_4x2, 4,6,4
9532
mov r4d, r4m
9533
%ifdef PIC
9534
9535
RET
9536
%endmacro
9537
9538
-IPFILTER_CHROMA_PP_32xN_AVX2 32, 16
9539
-IPFILTER_CHROMA_PP_32xN_AVX2 32, 24
9540
-IPFILTER_CHROMA_PP_32xN_AVX2 32, 8
9541
+ IPFILTER_CHROMA_PP_32xN_AVX2 32, 16
9542
+ IPFILTER_CHROMA_PP_32xN_AVX2 32, 24
9543
+ IPFILTER_CHROMA_PP_32xN_AVX2 32, 8
9544
+ IPFILTER_CHROMA_PP_32xN_AVX2 32, 64
9545
+ IPFILTER_CHROMA_PP_32xN_AVX2 32, 48
9546
9547
;-------------------------------------------------------------------------------------------------------------
9548
; void interp_4tap_horiz_pp_8xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx
9549
9550
RET
9551
%endmacro
9552
9553
-IPFILTER_CHROMA_PP_8xN_AVX2 8 , 16
9554
-IPFILTER_CHROMA_PP_8xN_AVX2 8 , 32
9555
-IPFILTER_CHROMA_PP_8xN_AVX2 8 , 4
9556
+ IPFILTER_CHROMA_PP_8xN_AVX2 8 , 16
9557
+ IPFILTER_CHROMA_PP_8xN_AVX2 8 , 32
9558
+ IPFILTER_CHROMA_PP_8xN_AVX2 8 , 4
9559
+ IPFILTER_CHROMA_PP_8xN_AVX2 8 , 64
9560
+ IPFILTER_CHROMA_PP_8xN_AVX2 8 , 12
9561
9562
;-------------------------------------------------------------------------------------------------------------
9563
; void interp_4tap_horiz_pp_4xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx
9564
;-------------------------------------------------------------------------------------------------------------
9565
%macro IPFILTER_CHROMA_PP_4xN_AVX2 2
9566
-INIT_YMM avx2
9567
+INIT_YMM avx2
9568
cglobal interp_4tap_horiz_pp_%1x%2, 4,6,6
9569
mov r4d, r4m
9570
9571
9572
RET
9573
%endmacro
9574
9575
-IPFILTER_CHROMA_PP_4xN_AVX2 4 , 8
9576
-IPFILTER_CHROMA_PP_4xN_AVX2 4 , 16
9577
+ IPFILTER_CHROMA_PP_4xN_AVX2 4 , 8
9578
+ IPFILTER_CHROMA_PP_4xN_AVX2 4 , 16
9579
9580
%macro IPFILTER_LUMA_PS_32xN_AVX2 2
9581
INIT_YMM avx2
9582
9583
RET
9584
%endmacro
9585
9586
-IPFILTER_LUMA_PS_32xN_AVX2 32 , 32
9587
-IPFILTER_LUMA_PS_32xN_AVX2 32 , 16
9588
-IPFILTER_LUMA_PS_32xN_AVX2 32 , 24
9589
-IPFILTER_LUMA_PS_32xN_AVX2 32 , 8
9590
-IPFILTER_LUMA_PS_32xN_AVX2 32 , 64
9591
+ IPFILTER_LUMA_PS_32xN_AVX2 32 , 32
9592
+ IPFILTER_LUMA_PS_32xN_AVX2 32 , 16
9593
+ IPFILTER_LUMA_PS_32xN_AVX2 32 , 24
9594
+ IPFILTER_LUMA_PS_32xN_AVX2 32 , 8
9595
+ IPFILTER_LUMA_PS_32xN_AVX2 32 , 64
9596
9597
INIT_YMM avx2
9598
cglobal interp_8tap_horiz_ps_48x64, 4, 7, 8
9599
9600
RET
9601
%endmacro
9602
9603
-IPFILTER_CHROMA_PP_16xN_AVX2 16 , 8
9604
-IPFILTER_CHROMA_PP_16xN_AVX2 16 , 32
9605
-IPFILTER_CHROMA_PP_16xN_AVX2 16 , 12
9606
-IPFILTER_CHROMA_PP_16xN_AVX2 16 , 4
9607
+ IPFILTER_CHROMA_PP_16xN_AVX2 16 , 8
9608
+ IPFILTER_CHROMA_PP_16xN_AVX2 16 , 32
9609
+ IPFILTER_CHROMA_PP_16xN_AVX2 16 , 12
9610
+ IPFILTER_CHROMA_PP_16xN_AVX2 16 , 4
9611
+ IPFILTER_CHROMA_PP_16xN_AVX2 16 , 64
9612
+ IPFILTER_CHROMA_PP_16xN_AVX2 16 , 24
9613
9614
%macro IPFILTER_LUMA_PS_64xN_AVX2 1
9615
INIT_YMM avx2
9616
9617
RET
9618
%endmacro
9619
9620
-IPFILTER_LUMA_PS_64xN_AVX2 64
9621
-IPFILTER_LUMA_PS_64xN_AVX2 48
9622
-IPFILTER_LUMA_PS_64xN_AVX2 32
9623
-IPFILTER_LUMA_PS_64xN_AVX2 16
9624
+ IPFILTER_LUMA_PS_64xN_AVX2 64
9625
+ IPFILTER_LUMA_PS_64xN_AVX2 48
9626
+ IPFILTER_LUMA_PS_64xN_AVX2 32
9627
+ IPFILTER_LUMA_PS_64xN_AVX2 16
9628
9629
;-----------------------------------------------------------------------------------------------------------------------------
9630
; void interp_4tap_horiz_ps_8xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
9631
;-----------------------------------------------------------------------------------------------------------------------------
9632
%macro IPFILTER_CHROMA_PS_8xN_AVX2 1
9633
-INIT_YMM avx2
9634
+INIT_YMM avx2
9635
cglobal interp_4tap_horiz_ps_8x%1, 4,7,6
9636
mov r4d, r4m
9637
mov r5d, r5m
9638
9639
vpermq m3, m3, 11011000b
9640
movu [r2], xm3
9641
.end
9642
- RET
9643
+ RET
9644
%endmacro
9645
9646
IPFILTER_CHROMA_PS_8xN_AVX2 2
9647
9648
IPFILTER_CHROMA_PS_8xN_AVX2 16
9649
IPFILTER_CHROMA_PS_8xN_AVX2 6
9650
IPFILTER_CHROMA_PS_8xN_AVX2 4
9651
+ IPFILTER_CHROMA_PS_8xN_AVX2 12
9652
+ IPFILTER_CHROMA_PS_8xN_AVX2 64
9653
9654
INIT_YMM avx2
9655
cglobal interp_4tap_horiz_ps_2x4, 4, 7, 3
9656
9657
movhps xm2, [r0 + r6]
9658
9659
vinserti128 m1, m1, xm2, 1
9660
- pshufb m1, [interp4_hps_shuf]
9661
+ pshufb m1, [interp4_hpp_shuf]
9662
pmaddubsw m1, m0
9663
pmaddwd m1, [pw_1]
9664
vextracti128 xm2, m1, 1
9665
9666
movhps xm1, [r0 + r1]
9667
movq xm2, [r0 + r1 * 2]
9668
vinserti128 m1, m1, xm2, 1
9669
- pshufb m1, [interp4_hps_shuf]
9670
+ pshufb m1, [interp4_hpp_shuf]
9671
pmaddubsw m1, m0
9672
pmaddwd m1, [pw_1]
9673
vextracti128 xm2, m1, 1
9674
9675
sub r0, r1
9676
9677
.label
9678
- mova m4, [interp4_hps_shuf]
9679
+ mova m4, [interp4_hpp_shuf]
9680
mova m5, [pw_1]
9681
dec r0
9682
lea r4, [r1 * 3]
9683
9684
;-----------------------------------------------------------------------------------------------------------------------------
9685
; void interp_4tap_horiz_ps_6x8(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
9686
;-----------------------------------------------------------------------------------------------------------------------------;
9687
-INIT_YMM avx2
9688
+INIT_YMM avx2
9689
cglobal interp_4tap_horiz_ps_6x8, 4,7,6
9690
mov r4d, r4m
9691
mov r5d, r5m
9692
9693
movd [r2+8], xm4
9694
.end
9695
RET
9696
+
9697
+INIT_YMM avx2
9698
+cglobal interp_8tap_horiz_ps_12x16, 6, 7, 8
9699
+ mov r5d, r5m
9700
+ mov r4d, r4m
9701
+%ifdef PIC
9702
+ lea r6, [tab_LumaCoeff]
9703
+ vpbroadcastq m0, [r6 + r4 * 8]
9704
+%else
9705
+ vpbroadcastq m0, [tab_LumaCoeff + r4 * 8]
9706
+%endif
9707
+ mova m6, [tab_Lm + 32]
9708
+ mova m1, [tab_Lm]
9709
+ add r3d, r3d
9710
+ vbroadcasti128 m2, [pw_2000]
9711
+ mov r4d, 16
9712
+ vbroadcasti128 m7, [pw_1]
9713
+ ; register map
9714
+ ; m0 - interpolate coeff
9715
+ ; m1 - shuffle order table
9716
+ ; m2 - pw_2000
9717
+
9718
+ mova m5, [interp8_hps_shuf]
9719
+ sub r0, 3
9720
+ test r5d, r5d
9721
+ jz .loop
9722
+ lea r6, [r1 * 3] ; r6 = (N / 2 - 1) * srcStride
9723
+ sub r0, r6 ; r0(src)-r6
9724
+ add r4d, 7
9725
+.loop
9726
+
9727
+ ; Row 0
9728
+
9729
+ vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
9730
+ pshufb m4, m3, m6
9731
+ pshufb m3, m1 ; shuffled based on the col order tab_Lm
9732
+ pmaddubsw m3, m0
9733
+ pmaddubsw m4, m0
9734
+ pmaddwd m3, m7
9735
+ pmaddwd m4, m7
9736
+ packssdw m3, m4
9737
+
9738
+ vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
9739
+ pshufb m4, m1
9740
+ pmaddubsw m4, m0
9741
+ pmaddwd m4, m7
9742
+ packssdw m4, m4
9743
+
9744
+ pmaddwd m3, m7
9745
+ pmaddwd m4, m7
9746
+ packssdw m3, m4
9747
+
9748
+ vpermd m3, m5, m3
9749
+ psubw m3, m2
9750
+
9751
+ vextracti128 xm4, m3, 1
9752
+ movu [r2], xm3 ;row 0
9753
+ movq [r2 + 16], xm4 ;row 1
9754
+
9755
+ add r0, r1
9756
+ add r2, r3
9757
+ dec r4d
9758
+ jnz .loop
9759
+ RET
9760
+
9761
+INIT_YMM avx2
9762
+cglobal interp_8tap_horiz_ps_24x32, 4, 7, 8
9763
+ mov r5d, r5m
9764
+ mov r4d, r4m
9765
+%ifdef PIC
9766
+ lea r6, [tab_LumaCoeff]
9767
+ vpbroadcastq m0, [r6 + r4 * 8]
9768
+%else
9769
+ vpbroadcastq m0, [tab_LumaCoeff + r4 * 8]
9770
+%endif
9771
+ mova m6, [tab_Lm + 32]
9772
+ mova m1, [tab_Lm]
9773
+ mov r4d, 32 ;height
9774
+ add r3d, r3d
9775
+ vbroadcasti128 m2, [pw_2000]
9776
+ vbroadcasti128 m7, [pw_1]
9777
+
9778
+ ; register map
9779
+ ; m0 - interpolate coeff
9780
+ ; m1 , m6 - shuffle order table
9781
+ ; m2 - pw_2000
9782
+
9783
+ sub r0, 3
9784
+ test r5d, r5d
9785
+ jz .label
9786
+ lea r6, [r1 * 3] ; r6 = (N / 2 - 1) * srcStride
9787
+ sub r0, r6 ; r0(src)-r6
9788
+ add r4d, 7 ; blkheight += N - 1 (7 - 1 = 6 ; since the last one row not in loop)
9789
+
9790
+.label
9791
+ lea r6, [interp8_hps_shuf]
9792
+.loop
9793
+ ; Row 0
9794
+ vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
9795
+ pshufb m4, m3, m6 ; row 0 (col 4 to 7)
9796
+ pshufb m3, m1 ; shuffled based on the col order tab_Lm row 0 (col 0 to 3)
9797
+ pmaddubsw m3, m0
9798
+ pmaddubsw m4, m0
9799
+ pmaddwd m3, m7
9800
+ pmaddwd m4, m7
9801
+ packssdw m3, m4
9802
+
9803
+ vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
9804
+ pshufb m5, m4, m6 ;row 1 (col 4 to 7)
9805
+ pshufb m4, m1 ;row 1 (col 0 to 3)
9806
+ pmaddubsw m4, m0
9807
+ pmaddubsw m5, m0
9808
+ pmaddwd m4, m7
9809
+ pmaddwd m5, m7
9810
+ packssdw m4, m5
9811
+ pmaddwd m3, m7
9812
+ pmaddwd m4, m7
9813
+ packssdw m3, m4
9814
+ mova m5, [r6]
9815
+ vpermd m3, m5, m3
9816
+ psubw m3, m2
9817
+ movu [r2], m3 ;row 0
9818
+
9819
+ vbroadcasti128 m3, [r0 + 16]
9820
+ pshufb m4, m3, m6
9821
+ pshufb m3, m1
9822
+ pmaddubsw m3, m0
9823
+ pmaddubsw m4, m0
9824
+ pmaddwd m3, m7
9825
+ pmaddwd m4, m7
9826
+ packssdw m3, m4
9827
+ pmaddwd m3, m7
9828
+ pmaddwd m4, m7
9829
+ packssdw m3, m4
9830
+ mova m4, [r6]
9831
+ vpermd m3, m4, m3
9832
+ psubw m3, m2
9833
+ movu [r2 + 32], xm3 ;row 0
9834
+
9835
+ add r0, r1
9836
+ add r2, r3
9837
+ dec r4d
9838
+ jnz .loop
9839
+ RET
9840
+
9841
+;-----------------------------------------------------------------------------------------------------------------------------
9842
+; void interp_4tap_horiz_ps_24x32(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
9843
+;-----------------------------------------------------------------------------------------------------------------------------
9844
+INIT_YMM avx2
9845
+cglobal interp_4tap_horiz_ps_24x32, 4,7,6
9846
+ mov r4d, r4m
9847
+ mov r5d, r5m
9848
+ add r3d, r3d
9849
+%ifdef PIC
9850
+ lea r6, [tab_ChromaCoeff]
9851
+ vpbroadcastd m0, [r6 + r4 * 4]
9852
+%else
9853
+ vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4]
9854
+%endif
9855
+ vbroadcasti128 m2, [pw_1]
9856
+ vbroadcasti128 m5, [pw_2000]
9857
+ mova m1, [tab_Tm]
9858
+
9859
+ ; register map
9860
+ ; m0 - interpolate coeff
9861
+ ; m1 - shuffle order table
9862
+ ; m2 - constant word 1
9863
+ mov r6d, 32
9864
+ dec r0
9865
+ test r5d, r5d
9866
+ je .loop
9867
+ sub r0 , r1
9868
+ add r6d , 3
9869
+
9870
+.loop
9871
+ ; Row 0
9872
+ vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
9873
+ pshufb m3, m1
9874
+ pmaddubsw m3, m0
9875
+ pmaddwd m3, m2
9876
+ vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
9877
+ pshufb m4, m1
9878
+ pmaddubsw m4, m0
9879
+ pmaddwd m4, m2
9880
+ packssdw m3, m4
9881
+ psubw m3, m5
9882
+ vpermq m3, m3, 11011000b
9883
+ movu [r2], m3
9884
+
9885
+ vbroadcasti128 m3, [r0 + 16] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
9886
+ pshufb m3, m1
9887
+ pmaddubsw m3, m0
9888
+ pmaddwd m3, m2
9889
+ packssdw m3, m3
9890
+ psubw m3, m5
9891
+ vpermq m3, m3, 11011000b
9892
+ movu [r2 + 32], xm3
9893
+
9894
+ add r2, r3
9895
+ add r0, r1
9896
+ dec r6d
9897
+ jnz .loop
9898
+ RET
9899
+
9900
+;-----------------------------------------------------------------------------------------------------------------------
9901
+;macro FILTER_H8_W8_16N_AVX2
9902
+;-----------------------------------------------------------------------------------------------------------------------
9903
+%macro FILTER_H8_W8_16N_AVX2 0
9904
+ vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
9905
+ pshufb m4, m3, m6 ; row 0 (col 4 to 7)
9906
+ pshufb m3, m1 ; shuffled based on the col order tab_Lm row 0 (col 0 to 3)
9907
+ pmaddubsw m3, m0
9908
+ pmaddubsw m4, m0
9909
+ pmaddwd m3, m2
9910
+ pmaddwd m4, m2
9911
+ packssdw m3, m4 ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A]
9912
+
9913
+ vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
9914
+ pshufb m5, m4, m6 ;row 1 (col 4 to 7)
9915
+ pshufb m4, m1 ;row 1 (col 0 to 3)
9916
+ pmaddubsw m4, m0
9917
+ pmaddubsw m5, m0
9918
+ pmaddwd m4, m2
9919
+ pmaddwd m5, m2
9920
+ packssdw m4, m5 ; DWORD [R3D R3C R2D R2C R3B R3A R2B R2A]
9921
+
9922
+ pmaddwd m3, m2
9923
+ pmaddwd m4, m2
9924
+ packssdw m3, m4 ; all rows and col completed.
9925
+
9926
+ mova m5, [interp8_hps_shuf]
9927
+ vpermd m3, m5, m3
9928
+ psubw m3, m8
9929
+
9930
+ vextracti128 xm4, m3, 1
9931
+ mova [r4], xm3
9932
+ mova [r4 + 16], xm4
9933
+ %endmacro
9934
+
9935
+;-----------------------------------------------------------------------------
9936
+; void interp_8tap_hv_pp_16x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int idxX, int idxY)
9937
+;-----------------------------------------------------------------------------
9938
+INIT_YMM avx2
9939
+%if ARCH_X86_64 == 1
9940
+cglobal interp_8tap_hv_pp_16x16, 4, 10, 15, 0-31*32
9941
+%define stk_buf1 rsp
9942
+ mov r4d, r4m
9943
+ mov r5d, r5m
9944
+%ifdef PIC
9945
+ lea r6, [tab_LumaCoeff]
9946
+ vpbroadcastq m0, [r6 + r4 * 8]
9947
+%else
9948
+ vpbroadcastq m0, [tab_LumaCoeff + r4 * 8]
9949
+%endif
9950
+
9951
+ xor r6, r6
9952
+ mov r4, rsp
9953
+ mova m6, [tab_Lm + 32]
9954
+ mova m1, [tab_Lm]
9955
+ mov r8, 16 ;height
9956
+ vbroadcasti128 m8, [pw_2000]
9957
+ vbroadcasti128 m2, [pw_1]
9958
+ sub r0, 3
9959
+ lea r7, [r1 * 3] ; r7 = (N / 2 - 1) * srcStride
9960
+ sub r0, r7 ; r0(src)-r7
9961
+ add r8, 7
9962
+
9963
+.loopH:
9964
+ FILTER_H8_W8_16N_AVX2
9965
+ add r0, r1
9966
+ add r4, 32
9967
+ inc r6
9968
+ cmp r6, 16+7
9969
+ jnz .loopH
9970
+
9971
+; vertical phase
9972
+ xor r6, r6
9973
+ xor r1, r1
9974
+.loopV:
9975
+
9976
+;load necessary variables
9977
+ mov r4d, r5d ;coeff here for vertical is r5m
9978
+ shl r4d, 7
9979
+ mov r1d, 16
9980
+ add r1d, r1d
9981
+
9982
+ ; load intermedia buffer
9983
+ mov r0, stk_buf1
9984
+
9985
+ ; register mapping
9986
+ ; r0 - src
9987
+ ; r5 - coeff
9988
+ ; r6 - loop_i
9989
+
9990
+; load coeff table
9991
+%ifdef PIC
9992
+ lea r5, [pw_LumaCoeffVer]
9993
+ add r5, r4
9994
+%else
9995
+ lea r5, [pw_LumaCoeffVer + r4]
9996
+%endif
9997
+
9998
+ lea r4, [r1*3]
9999
+ mova m14, [pd_526336]
10000
+ lea r6, [r3 * 3]
10001
+ mov r9d, 16 / 8
10002
+
10003
+.loopW:
10004
+ PROCESS_LUMA_AVX2_W8_16R sp
10005
+ add r2, 8
10006
+ add r0, 16
10007
+ dec r9d
10008
+ jnz .loopW
10009
+ RET
10010
+%endif
10011
+
10012
+INIT_YMM avx2
10013
+cglobal interp_4tap_horiz_pp_12x32, 4, 6, 7
10014
+ mov r4d, r4m
10015
+
10016
+%ifdef PIC
10017
+ lea r5, [tab_ChromaCoeff]
10018
+ vpbroadcastd m0, [r5 + r4 * 4]
10019
+%else
10020
+ vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4]
10021
+%endif
10022
+
10023
+ mova m6, [pw_512]
10024
+ mova m1, [interp4_horiz_shuf1]
10025
+ vpbroadcastd m2, [pw_1]
10026
+
10027
+ ; register map
10028
+ ; m0 - interpolate coeff
10029
+ ; m1 - shuffle order table
10030
+ ; m2 - constant word 1
10031
+
10032
+ dec r0
10033
+ mov r4d, 16
10034
+
10035
+.loop:
10036
+ ; Row 0
10037
+ vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10038
+ pshufb m3, m1
10039
+ pmaddubsw m3, m0
10040
+ pmaddwd m3, m2
10041
+ vbroadcasti128 m4, [r0 + 4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10042
+ pshufb m4, m1
10043
+ pmaddubsw m4, m0
10044
+ pmaddwd m4, m2
10045
+ packssdw m3, m4
10046
+ pmulhrsw m3, m6
10047
+
10048
+ ; Row 1
10049
+ vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10050
+ pshufb m4, m1
10051
+ pmaddubsw m4, m0
10052
+ pmaddwd m4, m2
10053
+ vbroadcasti128 m5, [r0 + r1 + 4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10054
+ pshufb m5, m1
10055
+ pmaddubsw m5, m0
10056
+ pmaddwd m5, m2
10057
+ packssdw m4, m5
10058
+ pmulhrsw m4, m6
10059
+
10060
+ packuswb m3, m4
10061
+ vpermq m3, m3, 11011000b
10062
+
10063
+ vextracti128 xm4, m3, 1
10064
+ movq [r2], xm3
10065
+ pextrd [r2+8], xm3, 2
10066
+ movq [r2 + r3], xm4
10067
+ pextrd [r2 + r3 + 8],xm4, 2
10068
+ lea r2, [r2 + r3 * 2]
10069
+ lea r0, [r0 + r1 * 2]
10070
+ dec r4d
10071
+ jnz .loop
10072
+ RET
10073
+
10074
+INIT_YMM avx2
10075
+cglobal interp_4tap_horiz_pp_24x64, 4,6,7
10076
+ mov r4d, r4m
10077
+
10078
+%ifdef PIC
10079
+ lea r5, [tab_ChromaCoeff]
10080
+ vpbroadcastd m0, [r5 + r4 * 4]
10081
+%else
10082
+ vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4]
10083
+%endif
10084
+
10085
+ mova m1, [interp4_horiz_shuf1]
10086
+ vpbroadcastd m2, [pw_1]
10087
+ mova m6, [pw_512]
10088
+ ; register map
10089
+ ; m0 - interpolate coeff
10090
+ ; m1 - shuffle order table
10091
+ ; m2 - constant word 1
10092
+
10093
+ dec r0
10094
+ mov r4d, 64
10095
+
10096
+.loop:
10097
+ ; Row 0
10098
+ vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10099
+ pshufb m3, m1
10100
+ pmaddubsw m3, m0
10101
+ pmaddwd m3, m2
10102
+ vbroadcasti128 m4, [r0 + 4]
10103
+ pshufb m4, m1
10104
+ pmaddubsw m4, m0
10105
+ pmaddwd m4, m2
10106
+ packssdw m3, m4
10107
+ pmulhrsw m3, m6
10108
+
10109
+ vbroadcasti128 m4, [r0 + 16]
10110
+ pshufb m4, m1
10111
+ pmaddubsw m4, m0
10112
+ pmaddwd m4, m2
10113
+ vbroadcasti128 m5, [r0 + 20]
10114
+ pshufb m5, m1
10115
+ pmaddubsw m5, m0
10116
+ pmaddwd m5, m2
10117
+ packssdw m4, m5
10118
+ pmulhrsw m4, m6
10119
+
10120
+ packuswb m3, m4
10121
+ vpermq m3, m3, 11011000b
10122
+
10123
+ vextracti128 xm4, m3, 1
10124
+ movu [r2], xm3
10125
+ movq [r2 + 16], xm4
10126
+ add r2, r3
10127
+ add r0, r1
10128
+ dec r4d
10129
+ jnz .loop
10130
+ RET
10131
+
10132
+
10133
+INIT_YMM avx2
10134
+cglobal interp_4tap_horiz_pp_2x16, 4, 6, 6
10135
+ mov r4d, r4m
10136
+
10137
+%ifdef PIC
10138
+ lea r5, [tab_ChromaCoeff]
10139
+ vpbroadcastd m0, [r5 + r4 * 4]
10140
+%else
10141
+ vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4]
10142
+%endif
10143
+
10144
+ mova m4, [interp4_hpp_shuf]
10145
+ mova m5, [pw_1]
10146
+ dec r0
10147
+ lea r4, [r1 * 3]
10148
+ movq xm1, [r0]
10149
+ movhps xm1, [r0 + r1]
10150
+ movq xm2, [r0 + r1 * 2]
10151
+ movhps xm2, [r0 + r4]
10152
+ vinserti128 m1, m1, xm2, 1
10153
+ lea r0, [r0 + r1 * 4]
10154
+ movq xm3, [r0]
10155
+ movhps xm3, [r0 + r1]
10156
+ movq xm2, [r0 + r1 * 2]
10157
+ movhps xm2, [r0 + r4]
10158
+ vinserti128 m3, m3, xm2, 1
10159
+
10160
+ pshufb m1, m4
10161
+ pshufb m3, m4
10162
+ pmaddubsw m1, m0
10163
+ pmaddubsw m3, m0
10164
+ pmaddwd m1, m5
10165
+ pmaddwd m3, m5
10166
+ packssdw m1, m3
10167
+ pmulhrsw m1, [pw_512]
10168
+ vextracti128 xm2, m1, 1
10169
+ packuswb xm1, xm2
10170
+
10171
+ lea r4, [r3 * 3]
10172
+ pextrw [r2], xm1, 0
10173
+ pextrw [r2 + r3], xm1, 1
10174
+ pextrw [r2 + r3 * 2], xm1, 4
10175
+ pextrw [r2 + r4], xm1, 5
10176
+ lea r2, [r2 + r3 * 4]
10177
+ pextrw [r2], xm1, 2
10178
+ pextrw [r2 + r3], xm1, 3
10179
+ pextrw [r2 + r3 * 2], xm1, 6
10180
+ pextrw [r2 + r4], xm1, 7
10181
+ lea r2, [r2 + r3 * 4]
10182
+ lea r0, [r0 + r1 * 4]
10183
+
10184
+ lea r4, [r1 * 3]
10185
+ movq xm1, [r0]
10186
+ movhps xm1, [r0 + r1]
10187
+ movq xm2, [r0 + r1 * 2]
10188
+ movhps xm2, [r0 + r4]
10189
+ vinserti128 m1, m1, xm2, 1
10190
+ lea r0, [r0 + r1 * 4]
10191
+ movq xm3, [r0]
10192
+ movhps xm3, [r0 + r1]
10193
+ movq xm2, [r0 + r1 * 2]
10194
+ movhps xm2, [r0 + r4]
10195
+ vinserti128 m3, m3, xm2, 1
10196
+
10197
+ pshufb m1, m4
10198
+ pshufb m3, m4
10199
+ pmaddubsw m1, m0
10200
+ pmaddubsw m3, m0
10201
+ pmaddwd m1, m5
10202
+ pmaddwd m3, m5
10203
+ packssdw m1, m3
10204
+ pmulhrsw m1, [pw_512]
10205
+ vextracti128 xm2, m1, 1
10206
+ packuswb xm1, xm2
10207
+
10208
+ lea r4, [r3 * 3]
10209
+ pextrw [r2], xm1, 0
10210
+ pextrw [r2 + r3], xm1, 1
10211
+ pextrw [r2 + r3 * 2], xm1, 4
10212
+ pextrw [r2 + r4], xm1, 5
10213
+ lea r2, [r2 + r3 * 4]
10214
+ pextrw [r2], xm1, 2
10215
+ pextrw [r2 + r3], xm1, 3
10216
+ pextrw [r2 + r3 * 2], xm1, 6
10217
+ pextrw [r2 + r4], xm1, 7
10218
+ RET
10219
+
10220
+;-------------------------------------------------------------------------------------------------------------
10221
+; void interp_4tap_horiz_pp_64xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx
10222
+;-------------------------------------------------------------------------------------------------------------
10223
+%macro IPFILTER_CHROMA_PP_64xN_AVX2 1
10224
+INIT_YMM avx2
10225
+cglobal interp_4tap_horiz_pp_64x%1, 4,6,7
10226
+ mov r4d, r4m
10227
+
10228
+%ifdef PIC
10229
+ lea r5, [tab_ChromaCoeff]
10230
+ vpbroadcastd m0, [r5 + r4 * 4]
10231
+%else
10232
+ vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4]
10233
+%endif
10234
+
10235
+ mova m1, [interp4_horiz_shuf1]
10236
+ vpbroadcastd m2, [pw_1]
10237
+ mova m6, [pw_512]
10238
+ ; register map
10239
+ ; m0 - interpolate coeff
10240
+ ; m1 - shuffle order table
10241
+ ; m2 - constant word 1
10242
+
10243
+ dec r0
10244
+ mov r4d, %1
10245
+
10246
+.loop:
10247
+ ; Row 0
10248
+ vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10249
+ pshufb m3, m1
10250
+ pmaddubsw m3, m0
10251
+ pmaddwd m3, m2
10252
+ vbroadcasti128 m4, [r0 + 4]
10253
+ pshufb m4, m1
10254
+ pmaddubsw m4, m0
10255
+ pmaddwd m4, m2
10256
+ packssdw m3, m4
10257
+ pmulhrsw m3, m6
10258
+
10259
+ vbroadcasti128 m4, [r0 + 16]
10260
+ pshufb m4, m1
10261
+ pmaddubsw m4, m0
10262
+ pmaddwd m4, m2
10263
+ vbroadcasti128 m5, [r0 + 20]
10264
+ pshufb m5, m1
10265
+ pmaddubsw m5, m0
10266
+ pmaddwd m5, m2
10267
+ packssdw m4, m5
10268
+ pmulhrsw m4, m6
10269
+ packuswb m3, m4
10270
+ vpermq m3, m3, 11011000b
10271
+ movu [r2], m3
10272
+
10273
+ vbroadcasti128 m3, [r0 + 32] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10274
+ pshufb m3, m1
10275
+ pmaddubsw m3, m0
10276
+ pmaddwd m3, m2
10277
+ vbroadcasti128 m4, [r0 + 36]
10278
+ pshufb m4, m1
10279
+ pmaddubsw m4, m0
10280
+ pmaddwd m4, m2
10281
+ packssdw m3, m4
10282
+ pmulhrsw m3, m6
10283
+
10284
+ vbroadcasti128 m4, [r0 + 48]
10285
+ pshufb m4, m1
10286
+ pmaddubsw m4, m0
10287
+ pmaddwd m4, m2
10288
+ vbroadcasti128 m5, [r0 + 52]
10289
+ pshufb m5, m1
10290
+ pmaddubsw m5, m0
10291
+ pmaddwd m5, m2
10292
+ packssdw m4, m5
10293
+ pmulhrsw m4, m6
10294
+ packuswb m3, m4
10295
+ vpermq m3, m3, 11011000b
10296
+ movu [r2 + 32], m3
10297
+
10298
+ add r2, r3
10299
+ add r0, r1
10300
+ dec r4d
10301
+ jnz .loop
10302
+ RET
10303
+%endmacro
10304
+
10305
+ IPFILTER_CHROMA_PP_64xN_AVX2 64
10306
+ IPFILTER_CHROMA_PP_64xN_AVX2 32
10307
+ IPFILTER_CHROMA_PP_64xN_AVX2 48
10308
+ IPFILTER_CHROMA_PP_64xN_AVX2 16
10309
+
10310
+;-------------------------------------------------------------------------------------------------------------
10311
+; void interp_4tap_horiz_pp_48x64(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx
10312
+;-------------------------------------------------------------------------------------------------------------
10313
+INIT_YMM avx2
10314
+cglobal interp_4tap_horiz_pp_48x64, 4,6,7
10315
+ mov r4d, r4m
10316
+
10317
+%ifdef PIC
10318
+ lea r5, [tab_ChromaCoeff]
10319
+ vpbroadcastd m0, [r5 + r4 * 4]
10320
+%else
10321
+ vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4]
10322
+%endif
10323
+
10324
+ mova m1, [interp4_horiz_shuf1]
10325
+ vpbroadcastd m2, [pw_1]
10326
+ mova m6, [pw_512]
10327
+ ; register map
10328
+ ; m0 - interpolate coeff
10329
+ ; m1 - shuffle order table
10330
+ ; m2 - constant word 1
10331
+
10332
+ dec r0
10333
+ mov r4d, 64
10334
+
10335
+.loop:
10336
+ ; Row 0
10337
+ vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10338
+ pshufb m3, m1
10339
+ pmaddubsw m3, m0
10340
+ pmaddwd m3, m2
10341
+ vbroadcasti128 m4, [r0 + 4]
10342
+ pshufb m4, m1
10343
+ pmaddubsw m4, m0
10344
+ pmaddwd m4, m2
10345
+ packssdw m3, m4
10346
+ pmulhrsw m3, m6
10347
+
10348
+ vbroadcasti128 m4, [r0 + 16]
10349
+ pshufb m4, m1
10350
+ pmaddubsw m4, m0
10351
+ pmaddwd m4, m2
10352
+ vbroadcasti128 m5, [r0 + 20]
10353
+ pshufb m5, m1
10354
+ pmaddubsw m5, m0
10355
+ pmaddwd m5, m2
10356
+ packssdw m4, m5
10357
+ pmulhrsw m4, m6
10358
+
10359
+ packuswb m3, m4
10360
+ vpermq m3, m3, q3120
10361
+
10362
+ movu [r2], m3
10363
+
10364
+ vbroadcasti128 m3, [r0 + mmsize] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10365
+ pshufb m3, m1
10366
+ pmaddubsw m3, m0
10367
+ pmaddwd m3, m2
10368
+ vbroadcasti128 m4, [r0 + mmsize + 4]
10369
+ pshufb m4, m1
10370
+ pmaddubsw m4, m0
10371
+ pmaddwd m4, m2
10372
+ packssdw m3, m4
10373
+ pmulhrsw m3, m6
10374
+
10375
+ vbroadcasti128 m4, [r0 + mmsize + 16]
10376
+ pshufb m4, m1
10377
+ pmaddubsw m4, m0
10378
+ pmaddwd m4, m2
10379
+ vbroadcasti128 m5, [r0 + mmsize + 20]
10380
+ pshufb m5, m1
10381
+ pmaddubsw m5, m0
10382
+ pmaddwd m5, m2
10383
+ packssdw m4, m5
10384
+ pmulhrsw m4, m6
10385
+
10386
+ packuswb m3, m4
10387
+ vpermq m3, m3, q3120
10388
+ movu [r2 + mmsize], xm3
10389
+
10390
+ add r2, r3
10391
+ add r0, r1
10392
+ dec r4d
10393
+ jnz .loop
10394
+ RET
10395
+
10396
+;-----------------------------------------------------------------------------------------------------------------------------
10397
+; void interp_4tap_horiz_ps_48x64(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
10398
+;-----------------------------------------------------------------------------------------------------------------------------;
10399
+
10400
+INIT_YMM avx2
10401
+cglobal interp_4tap_horiz_ps_48x64, 4,7,6
10402
+ mov r4d, r4m
10403
+ mov r5d, r5m
10404
+ add r3d, r3d
10405
+
10406
+%ifdef PIC
10407
+ lea r6, [tab_ChromaCoeff]
10408
+ vpbroadcastd m0, [r6 + r4 * 4]
10409
+%else
10410
+ vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4]
10411
+%endif
10412
+
10413
+ vbroadcasti128 m2, [pw_1]
10414
+ vbroadcasti128 m5, [pw_2000]
10415
+ mova m1, [tab_Tm]
10416
+
10417
+ ; register map
10418
+ ; m0 - interpolate coeff
10419
+ ; m1 - shuffle order table
10420
+ ; m2 - constant word 1
10421
+ mov r6d, 64
10422
+ dec r0
10423
+ test r5d, r5d
10424
+ je .loop
10425
+ sub r0 , r1
10426
+ add r6d , 3
10427
+
10428
+.loop
10429
+ ; Row 0
10430
+ vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10431
+ pshufb m3, m1
10432
+ pmaddubsw m3, m0
10433
+ pmaddwd m3, m2
10434
+ vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10435
+ pshufb m4, m1
10436
+ pmaddubsw m4, m0
10437
+ pmaddwd m4, m2
10438
+
10439
+ packssdw m3, m4
10440
+ psubw m3, m5
10441
+ vpermq m3, m3, q3120
10442
+ movu [r2], m3
10443
+
10444
+ vbroadcasti128 m3, [r0 + 16] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10445
+ pshufb m3, m1
10446
+ pmaddubsw m3, m0
10447
+ pmaddwd m3, m2
10448
+ vbroadcasti128 m4, [r0 + 24] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10449
+ pshufb m4, m1
10450
+ pmaddubsw m4, m0
10451
+ pmaddwd m4, m2
10452
+
10453
+ packssdw m3, m4
10454
+ psubw m3, m5
10455
+ vpermq m3, m3, q3120
10456
+ movu [r2 + 32], m3
10457
+
10458
+ vbroadcasti128 m3, [r0 + 32] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10459
+ pshufb m3, m1
10460
+ pmaddubsw m3, m0
10461
+ pmaddwd m3, m2
10462
+ vbroadcasti128 m4, [r0 + 40] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10463
+ pshufb m4, m1
10464
+ pmaddubsw m4, m0
10465
+ pmaddwd m4, m2
10466
+
10467
+ packssdw m3, m4
10468
+ psubw m3, m5
10469
+ vpermq m3, m3, q3120
10470
+ movu [r2 + 64], m3
10471
+
10472
+ add r2, r3
10473
+ add r0, r1
10474
+ dec r6d
10475
+ jnz .loop
10476
+ RET
10477
+
10478
+;-----------------------------------------------------------------------------------------------------------------------------
10479
+; void interp_4tap_horiz_ps_24x64(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt)
10480
+;-----------------------------------------------------------------------------------------------------------------------------
10481
+INIT_YMM avx2
10482
+cglobal interp_4tap_horiz_ps_24x64, 4,7,6
10483
+ mov r4d, r4m
10484
+ mov r5d, r5m
10485
+ add r3d, r3d
10486
+%ifdef PIC
10487
+ lea r6, [tab_ChromaCoeff]
10488
+ vpbroadcastd m0, [r6 + r4 * 4]
10489
+%else
10490
+ vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4]
10491
+%endif
10492
+ vbroadcasti128 m2, [pw_1]
10493
+ vbroadcasti128 m5, [pw_2000]
10494
+ mova m1, [tab_Tm]
10495
+
10496
+ ; register map
10497
+ ; m0 - interpolate coeff
10498
+ ; m1 - shuffle order table
10499
+ ; m2 - constant word 1
10500
+ mov r6d, 64
10501
+ dec r0
10502
+ test r5d, r5d
10503
+ je .loop
10504
+ sub r0 , r1
10505
+ add r6d , 3
10506
+
10507
+.loop
10508
+ ; Row 0
10509
+ vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10510
+ pshufb m3, m1
10511
+ pmaddubsw m3, m0
10512
+ pmaddwd m3, m2
10513
+ vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10514
+ pshufb m4, m1
10515
+ pmaddubsw m4, m0
10516
+ pmaddwd m4, m2
10517
+ packssdw m3, m4
10518
+ psubw m3, m5
10519
+ vpermq m3, m3, q3120
10520
+ movu [r2], m3
10521
+
10522
+ vbroadcasti128 m3, [r0 + 16] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10523
+ pshufb m3, m1
10524
+ pmaddubsw m3, m0
10525
+ pmaddwd m3, m2
10526
+ packssdw m3, m3
10527
+ psubw m3, m5
10528
+ vpermq m3, m3, q3120
10529
+ movu [r2 + 32], xm3
10530
+
10531
+ add r2, r3
10532
+ add r0, r1
10533
+ dec r6d
10534
+ jnz .loop
10535
+ RET
10536
+
10537
+INIT_YMM avx2
10538
+cglobal interp_4tap_horiz_ps_2x16, 4, 7, 7
10539
+ mov r4d, r4m
10540
+ mov r5d, r5m
10541
+ add r3d, r3d
10542
+
10543
+%ifdef PIC
10544
+ lea r6, [tab_ChromaCoeff]
10545
+ vpbroadcastd m0, [r6 + r4 * 4]
10546
+%else
10547
+ vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4]
10548
+%endif
10549
+ vbroadcasti128 m6, [pw_2000]
10550
+ test r5d, r5d
10551
+ jz .label
10552
+ sub r0, r1
10553
+
10554
+.label
10555
+ mova m4, [interp4_hps_shuf]
10556
+ mova m5, [pw_1]
10557
+ dec r0
10558
+ lea r4, [r1 * 3]
10559
+ movq xm1, [r0] ;row 0
10560
+ movhps xm1, [r0 + r1]
10561
+ movq xm2, [r0 + r1 * 2]
10562
+ movhps xm2, [r0 + r4]
10563
+ vinserti128 m1, m1, xm2, 1
10564
+ lea r0, [r0 + r1 * 4]
10565
+ movq xm3, [r0]
10566
+ movhps xm3, [r0 + r1]
10567
+ movq xm2, [r0 + r1 * 2]
10568
+ movhps xm2, [r0 + r4]
10569
+ vinserti128 m3, m3, xm2, 1
10570
+
10571
+ pshufb m1, m4
10572
+ pshufb m3, m4
10573
+ pmaddubsw m1, m0
10574
+ pmaddubsw m3, m0
10575
+ pmaddwd m1, m5
10576
+ pmaddwd m3, m5
10577
+ packssdw m1, m3
10578
+ psubw m1, m6
10579
+
10580
+ lea r4, [r3 * 3]
10581
+ vextracti128 xm2, m1, 1
10582
+
10583
+ movd [r2], xm1
10584
+ pextrd [r2 + r3], xm1, 1
10585
+ movd [r2 + r3 * 2], xm2
10586
+ pextrd [r2 + r4], xm2, 1
10587
+ lea r2, [r2 + r3 * 4]
10588
+ pextrd [r2], xm1, 2
10589
+ pextrd [r2 + r3], xm1, 3
10590
+ pextrd [r2 + r3 * 2], xm2, 2
10591
+ pextrd [r2 + r4], xm2, 3
10592
+
10593
+ lea r0, [r0 + r1 * 4]
10594
+ lea r2, [r2 + r3 * 4]
10595
+ lea r4, [r1 * 3]
10596
+ movq xm1, [r0]
10597
+ movhps xm1, [r0 + r1]
10598
+ movq xm2, [r0 + r1 * 2]
10599
+ movhps xm2, [r0 + r4]
10600
+ vinserti128 m1, m1, xm2, 1
10601
+ lea r0, [r0 + r1 * 4]
10602
+ movq xm3, [r0]
10603
+ movhps xm3, [r0 + r1]
10604
+ movq xm2, [r0 + r1 * 2]
10605
+ movhps xm2, [r0 + r4]
10606
+ vinserti128 m3, m3, xm2, 1
10607
+
10608
+ pshufb m1, m4
10609
+ pshufb m3, m4
10610
+ pmaddubsw m1, m0
10611
+ pmaddubsw m3, m0
10612
+ pmaddwd m1, m5
10613
+ pmaddwd m3, m5
10614
+ packssdw m1, m3
10615
+ psubw m1, m6
10616
+
10617
+ lea r4, [r3 * 3]
10618
+ vextracti128 xm2, m1, 1
10619
+
10620
+ movd [r2], xm1
10621
+ pextrd [r2 + r3], xm1, 1
10622
+ movd [r2 + r3 * 2], xm2
10623
+ pextrd [r2 + r4], xm2, 1
10624
+ lea r2, [r2 + r3 * 4]
10625
+ pextrd [r2], xm1, 2
10626
+ pextrd [r2 + r3], xm1, 3
10627
+ pextrd [r2 + r3 * 2], xm2, 2
10628
+ pextrd [r2 + r4], xm2, 3
10629
+
10630
+ test r5d, r5d
10631
+ jz .end
10632
+
10633
+ lea r0, [r0 + r1 * 4]
10634
+ lea r2, [r2 + r3 * 4]
10635
+ movq xm1, [r0]
10636
+ movhps xm1, [r0 + r1]
10637
+ movq xm2, [r0 + r1 * 2]
10638
+ vinserti128 m1, m1, xm2, 1
10639
+ pshufb m1, m4
10640
+ pmaddubsw m1, m0
10641
+ pmaddwd m1, m5
10642
+ packssdw m1, m1
10643
+ psubw m1, m6
10644
+ vextracti128 xm2, m1, 1
10645
+
10646
+ movd [r2], xm1
10647
+ pextrd [r2 + r3], xm1, 1
10648
+ movd [r2 + r3 * 2], xm2
10649
+.end
10650
+ RET
10651
+
10652
+INIT_YMM avx2
10653
+cglobal interp_4tap_horiz_pp_6x16, 4, 6, 7
10654
+ mov r4d, r4m
10655
+
10656
+%ifdef PIC
10657
+ lea r5, [tab_ChromaCoeff]
10658
+ vpbroadcastd m0, [r5 + r4 * 4]
10659
+%else
10660
+ vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4]
10661
+%endif
10662
+
10663
+ mova m1, [tab_Tm]
10664
+ mova m2, [pw_1]
10665
+ mova m6, [pw_512]
10666
+ lea r4, [r1 * 3]
10667
+ lea r5, [r3 * 3]
10668
+ ; register map
10669
+ ; m0 - interpolate coeff
10670
+ ; m1 - shuffle order table
10671
+ ; m2 - constant word 1
10672
+
10673
+ dec r0
10674
+%rep 4
10675
+ ; Row 0
10676
+ vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10677
+ pshufb m3, m1
10678
+ pmaddubsw m3, m0
10679
+ pmaddwd m3, m2
10680
+
10681
+ ; Row 1
10682
+ vbroadcasti128 m4, [r0 + r1] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10683
+ pshufb m4, m1
10684
+ pmaddubsw m4, m0
10685
+ pmaddwd m4, m2
10686
+ packssdw m3, m4
10687
+ pmulhrsw m3, m6
10688
+
10689
+ ; Row 2
10690
+ vbroadcasti128 m4, [r0 + r1 * 2] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10691
+ pshufb m4, m1
10692
+ pmaddubsw m4, m0
10693
+ pmaddwd m4, m2
10694
+
10695
+ ; Row 3
10696
+ vbroadcasti128 m5, [r0 + r4] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
10697
+ pshufb m5, m1
10698
+ pmaddubsw m5, m0
10699
+ pmaddwd m5, m2
10700
+ packssdw m4, m5
10701
+ pmulhrsw m4, m6
10702
+
10703
+ packuswb m3, m4
10704
+ vextracti128 xm4, m3, 1
10705
+ movd [r2], xm3
10706
+ pextrw [r2 + 4], xm4, 0
10707
+ pextrd [r2 + r3], xm3, 1
10708
+ pextrw [r2 + r3 + 4], xm4, 2
10709
+ pextrd [r2 + r3 * 2], xm3, 2
10710
+ pextrw [r2 + r3 * 2 + 4], xm4, 4
10711
+ pextrd [r2 + r5], xm3, 3
10712
+ pextrw [r2 + r5 + 4], xm4, 6
10713
+ lea r2, [r2 + r3 * 4]
10714
+ lea r0, [r0 + r1 * 4]
10715
+%endrep
10716
+ RET
10717
x265_1.6.tar.gz/source/common/x86/ipfilter8.h -> x265_1.7.tar.gz/source/common/x86/ipfilter8.h
Changed
381
1
2
SETUP_CHROMA_420_HORIZ_FUNC_DEF(64, 16, cpu); \
3
SETUP_CHROMA_420_HORIZ_FUNC_DEF(16, 64, cpu)
4
5
-void x265_chroma_p2s_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height);
6
-void x265_luma_p2s_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height);
7
+void x265_filterPixelToShort_4x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
8
+void x265_filterPixelToShort_4x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
9
+void x265_filterPixelToShort_4x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
10
+void x265_filterPixelToShort_8x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
11
+void x265_filterPixelToShort_8x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
12
+void x265_filterPixelToShort_8x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
13
+void x265_filterPixelToShort_8x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
14
+void x265_filterPixelToShort_16x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
15
+void x265_filterPixelToShort_16x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
16
+void x265_filterPixelToShort_16x12_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
17
+void x265_filterPixelToShort_16x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
18
+void x265_filterPixelToShort_16x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
19
+void x265_filterPixelToShort_16x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
20
+void x265_filterPixelToShort_32x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
21
+void x265_filterPixelToShort_32x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
22
+void x265_filterPixelToShort_32x24_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
23
+void x265_filterPixelToShort_32x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
24
+void x265_filterPixelToShort_32x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
25
+void x265_filterPixelToShort_64x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
26
+void x265_filterPixelToShort_64x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
27
+void x265_filterPixelToShort_64x48_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
28
+void x265_filterPixelToShort_64x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
29
+void x265_filterPixelToShort_24x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
30
+void x265_filterPixelToShort_12x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
31
+void x265_filterPixelToShort_48x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
32
+void x265_filterPixelToShort_16x4_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
33
+void x265_filterPixelToShort_16x8_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
34
+void x265_filterPixelToShort_16x12_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
35
+void x265_filterPixelToShort_16x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
36
+void x265_filterPixelToShort_16x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
37
+void x265_filterPixelToShort_16x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
38
+void x265_filterPixelToShort_32x8_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
39
+void x265_filterPixelToShort_32x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
40
+void x265_filterPixelToShort_32x24_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
41
+void x265_filterPixelToShort_32x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
42
+void x265_filterPixelToShort_32x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
43
+void x265_filterPixelToShort_64x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
44
+void x265_filterPixelToShort_64x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
45
+void x265_filterPixelToShort_64x48_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
46
+void x265_filterPixelToShort_64x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
47
+void x265_filterPixelToShort_24x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
48
+void x265_filterPixelToShort_48x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
49
+
50
+#define SETUP_CHROMA_P2S_FUNC_DEF(W, H, cpu) \
51
+ void x265_filterPixelToShort_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
52
+
53
+#define CHROMA_420_P2S_FILTERS_SSSE3(cpu) \
54
+ SETUP_CHROMA_P2S_FUNC_DEF(4, 2, cpu); \
55
+ SETUP_CHROMA_P2S_FUNC_DEF(8, 2, cpu); \
56
+ SETUP_CHROMA_P2S_FUNC_DEF(8, 6, cpu);
57
+
58
+#define CHROMA_420_P2S_FILTERS_SSE4(cpu) \
59
+ SETUP_CHROMA_P2S_FUNC_DEF(2, 4, cpu); \
60
+ SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \
61
+ SETUP_CHROMA_P2S_FUNC_DEF(6, 8, cpu);
62
+
63
+#define CHROMA_422_P2S_FILTERS_SSSE3(cpu) \
64
+ SETUP_CHROMA_P2S_FUNC_DEF(4, 32, cpu) \
65
+ SETUP_CHROMA_P2S_FUNC_DEF(8, 12, cpu); \
66
+ SETUP_CHROMA_P2S_FUNC_DEF(8, 64, cpu); \
67
+ SETUP_CHROMA_P2S_FUNC_DEF(12, 32, cpu); \
68
+ SETUP_CHROMA_P2S_FUNC_DEF(16, 24, cpu); \
69
+ SETUP_CHROMA_P2S_FUNC_DEF(16, 64, cpu); \
70
+ SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \
71
+ SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu);
72
+
73
+#define CHROMA_422_P2S_FILTERS_SSE4(cpu) \
74
+ SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \
75
+ SETUP_CHROMA_P2S_FUNC_DEF(2, 16, cpu) \
76
+ SETUP_CHROMA_P2S_FUNC_DEF(6, 16, cpu);
77
+
78
+#define CHROMA_420_P2S_FILTERS_AVX2(cpu) \
79
+ SETUP_CHROMA_P2S_FUNC_DEF(16, 4, cpu); \
80
+ SETUP_CHROMA_P2S_FUNC_DEF(16, 8, cpu); \
81
+ SETUP_CHROMA_P2S_FUNC_DEF(16, 12, cpu); \
82
+ SETUP_CHROMA_P2S_FUNC_DEF(16, 16, cpu); \
83
+ SETUP_CHROMA_P2S_FUNC_DEF(16, 32, cpu); \
84
+ SETUP_CHROMA_P2S_FUNC_DEF(24, 32, cpu); \
85
+ SETUP_CHROMA_P2S_FUNC_DEF(32, 8, cpu); \
86
+ SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \
87
+ SETUP_CHROMA_P2S_FUNC_DEF(32, 24, cpu); \
88
+ SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu);
89
+
90
+#define CHROMA_422_P2S_FILTERS_AVX2(cpu) \
91
+ SETUP_CHROMA_P2S_FUNC_DEF(16, 8, cpu); \
92
+ SETUP_CHROMA_P2S_FUNC_DEF(16, 16, cpu); \
93
+ SETUP_CHROMA_P2S_FUNC_DEF(16, 24, cpu); \
94
+ SETUP_CHROMA_P2S_FUNC_DEF(16, 32, cpu); \
95
+ SETUP_CHROMA_P2S_FUNC_DEF(16, 64, cpu); \
96
+ SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \
97
+ SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \
98
+ SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu); \
99
+ SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu); \
100
+ SETUP_CHROMA_P2S_FUNC_DEF(32, 64, cpu);
101
102
CHROMA_420_VERT_FILTERS(_sse2);
103
CHROMA_420_HORIZ_FILTERS(_sse4);
104
CHROMA_420_VERT_FILTERS_SSE4(_sse4);
105
+CHROMA_420_P2S_FILTERS_SSSE3(_ssse3);
106
+CHROMA_420_P2S_FILTERS_SSE4(_sse4);
107
+CHROMA_420_P2S_FILTERS_AVX2(_avx2);
108
109
CHROMA_422_VERT_FILTERS(_sse2);
110
CHROMA_422_HORIZ_FILTERS(_sse4);
111
CHROMA_422_VERT_FILTERS_SSE4(_sse4);
112
+CHROMA_422_P2S_FILTERS_SSE4(_sse4);
113
+CHROMA_422_P2S_FILTERS_SSSE3(_ssse3);
114
+CHROMA_422_P2S_FILTERS_AVX2(_avx2);
115
116
CHROMA_444_VERT_FILTERS(_sse2);
117
CHROMA_444_HORIZ_FILTERS(_sse4);
118
119
SETUP_CHROMA_SS_FUNC_DEF(64, 16, cpu); \
120
SETUP_CHROMA_SS_FUNC_DEF(16, 64, cpu);
121
122
+#define SETUP_CHROMA_P2S_FUNC_DEF(W, H, cpu) \
123
+ void x265_filterPixelToShort_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
124
+
125
+#define CHROMA_420_P2S_FILTERS_SSE4(cpu) \
126
+ SETUP_CHROMA_P2S_FUNC_DEF(2, 4, cpu); \
127
+ SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \
128
+ SETUP_CHROMA_P2S_FUNC_DEF(4, 2, cpu); \
129
+ SETUP_CHROMA_P2S_FUNC_DEF(6, 8, cpu);
130
+
131
+#define CHROMA_420_P2S_FILTERS_SSSE3(cpu) \
132
+ SETUP_CHROMA_P2S_FUNC_DEF(8, 2, cpu); \
133
+ SETUP_CHROMA_P2S_FUNC_DEF(8, 6, cpu);
134
+
135
+#define CHROMA_422_P2S_FILTERS_SSE4(cpu) \
136
+ SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \
137
+ SETUP_CHROMA_P2S_FUNC_DEF(2, 16, cpu); \
138
+ SETUP_CHROMA_P2S_FUNC_DEF(6, 16, cpu); \
139
+ SETUP_CHROMA_P2S_FUNC_DEF(4, 32, cpu);
140
+
141
+#define CHROMA_422_P2S_FILTERS_SSSE3(cpu) \
142
+ SETUP_CHROMA_P2S_FUNC_DEF(8, 12, cpu); \
143
+ SETUP_CHROMA_P2S_FUNC_DEF(8, 64, cpu); \
144
+ SETUP_CHROMA_P2S_FUNC_DEF(12, 32, cpu); \
145
+ SETUP_CHROMA_P2S_FUNC_DEF(16, 24, cpu); \
146
+ SETUP_CHROMA_P2S_FUNC_DEF(16, 64, cpu); \
147
+ SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \
148
+ SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu);
149
+
150
+#define CHROMA_420_P2S_FILTERS_AVX2(cpu) \
151
+ SETUP_CHROMA_P2S_FUNC_DEF(24, 32, cpu); \
152
+ SETUP_CHROMA_P2S_FUNC_DEF(32, 8, cpu); \
153
+ SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \
154
+ SETUP_CHROMA_P2S_FUNC_DEF(32, 24, cpu); \
155
+ SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu);
156
+
157
+#define CHROMA_422_P2S_FILTERS_AVX2(cpu) \
158
+ SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \
159
+ SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \
160
+ SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu); \
161
+ SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu); \
162
+ SETUP_CHROMA_P2S_FUNC_DEF(32, 64, cpu);
163
+
164
CHROMA_420_FILTERS(_sse4);
165
CHROMA_420_FILTERS(_avx2);
166
CHROMA_420_SP_FILTERS(_sse2);
167
168
CHROMA_420_SS_FILTERS_SSE4(_sse4);
169
CHROMA_420_SS_FILTERS(_avx2);
170
CHROMA_420_SS_FILTERS_SSE4(_avx2);
171
+CHROMA_420_P2S_FILTERS_SSE4(_sse4);
172
+CHROMA_420_P2S_FILTERS_SSSE3(_ssse3);
173
+CHROMA_420_P2S_FILTERS_AVX2(_avx2);
174
175
CHROMA_422_FILTERS(_sse4);
176
CHROMA_422_FILTERS(_avx2);
177
CHROMA_422_SP_FILTERS(_sse2);
178
+CHROMA_422_SP_FILTERS(_avx2);
179
CHROMA_422_SP_FILTERS_SSE4(_sse4);
180
+CHROMA_422_SP_FILTERS_SSE4(_avx2);
181
CHROMA_422_SS_FILTERS(_sse2);
182
+CHROMA_422_SS_FILTERS(_avx2);
183
CHROMA_422_SS_FILTERS_SSE4(_sse4);
184
+CHROMA_422_SS_FILTERS_SSE4(_avx2);
185
+CHROMA_422_P2S_FILTERS_SSE4(_sse4);
186
+CHROMA_422_P2S_FILTERS_SSSE3(_ssse3);
187
+CHROMA_422_P2S_FILTERS_AVX2(_avx2);
188
+void x265_interp_4tap_vert_ss_2x4_avx2(const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx);
189
+void x265_interp_4tap_vert_sp_2x4_avx2(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx);
190
191
CHROMA_444_FILTERS(_sse4);
192
CHROMA_444_SP_FILTERS(_sse4);
193
CHROMA_444_SS_FILTERS(_sse2);
194
-
195
-void x265_chroma_p2s_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height);
196
+CHROMA_444_FILTERS(_avx2);
197
+CHROMA_444_SP_FILTERS(_avx2);
198
+CHROMA_444_SS_FILTERS(_avx2);
199
200
#undef SETUP_CHROMA_FUNC_DEF
201
#undef SETUP_CHROMA_SP_FUNC_DEF
202
203
LUMA_FILTERS(_avx2);
204
LUMA_SP_FILTERS(_avx2);
205
LUMA_SS_FILTERS(_avx2);
206
-void x265_interp_8tap_hv_pp_8x8_sse4(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY);
207
-void x265_pixelToShort_4x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
208
-void x265_pixelToShort_4x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
209
-void x265_pixelToShort_4x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
210
-void x265_pixelToShort_8x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
211
-void x265_pixelToShort_8x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
212
-void x265_pixelToShort_8x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
213
-void x265_pixelToShort_8x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
214
-void x265_pixelToShort_16x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
215
-void x265_pixelToShort_16x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
216
-void x265_pixelToShort_16x12_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
217
-void x265_pixelToShort_16x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
218
-void x265_pixelToShort_16x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
219
-void x265_pixelToShort_16x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
220
-void x265_pixelToShort_32x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
221
-void x265_pixelToShort_32x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
222
-void x265_pixelToShort_32x24_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
223
-void x265_pixelToShort_32x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
224
-void x265_pixelToShort_32x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
225
-void x265_pixelToShort_64x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
226
-void x265_pixelToShort_64x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
227
-void x265_pixelToShort_64x48_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
228
-void x265_pixelToShort_64x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
229
+void x265_interp_8tap_hv_pp_8x8_ssse3(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY);
230
+void x265_interp_8tap_hv_pp_16x16_avx2(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY);
231
+void x265_filterPixelToShort_4x4_sse4(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
232
+void x265_filterPixelToShort_4x8_sse4(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
233
+void x265_filterPixelToShort_4x16_sse4(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
234
+void x265_filterPixelToShort_8x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
235
+void x265_filterPixelToShort_8x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
236
+void x265_filterPixelToShort_8x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
237
+void x265_filterPixelToShort_8x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
238
+void x265_filterPixelToShort_16x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
239
+void x265_filterPixelToShort_16x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
240
+void x265_filterPixelToShort_16x12_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
241
+void x265_filterPixelToShort_16x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
242
+void x265_filterPixelToShort_16x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
243
+void x265_filterPixelToShort_16x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
244
+void x265_filterPixelToShort_32x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
245
+void x265_filterPixelToShort_32x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
246
+void x265_filterPixelToShort_32x24_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
247
+void x265_filterPixelToShort_32x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
248
+void x265_filterPixelToShort_32x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
249
+void x265_filterPixelToShort_64x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
250
+void x265_filterPixelToShort_64x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
251
+void x265_filterPixelToShort_64x48_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
252
+void x265_filterPixelToShort_64x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
253
+void x265_filterPixelToShort_12x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
254
+void x265_filterPixelToShort_24x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
255
+void x265_filterPixelToShort_48x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
256
+void x265_filterPixelToShort_32x8_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
257
+void x265_filterPixelToShort_32x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
258
+void x265_filterPixelToShort_32x24_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
259
+void x265_filterPixelToShort_32x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
260
+void x265_filterPixelToShort_32x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
261
+void x265_filterPixelToShort_64x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
262
+void x265_filterPixelToShort_64x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
263
+void x265_filterPixelToShort_64x48_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
264
+void x265_filterPixelToShort_64x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
265
+void x265_filterPixelToShort_48x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
266
+void x265_filterPixelToShort_24x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
267
+void x265_interp_4tap_horiz_pp_2x4_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
268
+void x265_interp_4tap_horiz_pp_2x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
269
+void x265_interp_4tap_horiz_pp_2x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
270
+void x265_interp_4tap_horiz_pp_4x2_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
271
+void x265_interp_4tap_horiz_pp_4x4_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
272
+void x265_interp_4tap_horiz_pp_4x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
273
+void x265_interp_4tap_horiz_pp_4x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
274
+void x265_interp_4tap_horiz_pp_4x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
275
+void x265_interp_4tap_horiz_pp_6x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
276
+void x265_interp_4tap_horiz_pp_6x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
277
+void x265_interp_4tap_horiz_pp_8x2_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
278
+void x265_interp_4tap_horiz_pp_8x4_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
279
+void x265_interp_4tap_horiz_pp_8x6_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
280
+void x265_interp_4tap_horiz_pp_8x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
281
+void x265_interp_4tap_horiz_pp_8x12_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
282
+void x265_interp_4tap_horiz_pp_8x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
283
+void x265_interp_4tap_horiz_pp_8x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
284
+void x265_interp_4tap_horiz_pp_8x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
285
+void x265_interp_4tap_horiz_pp_12x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
286
+void x265_interp_4tap_horiz_pp_12x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
287
+void x265_interp_4tap_horiz_pp_16x4_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
288
+void x265_interp_4tap_horiz_pp_16x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
289
+void x265_interp_4tap_horiz_pp_16x12_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
290
+void x265_interp_4tap_horiz_pp_16x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
291
+void x265_interp_4tap_horiz_pp_16x24_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
292
+void x265_interp_4tap_horiz_pp_16x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
293
+void x265_interp_4tap_horiz_pp_16x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
294
+void x265_interp_4tap_horiz_pp_24x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
295
+void x265_interp_4tap_horiz_pp_24x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
296
+void x265_interp_4tap_horiz_pp_32x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
297
+void x265_interp_4tap_horiz_pp_32x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
298
+void x265_interp_4tap_horiz_pp_32x24_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
299
+void x265_interp_4tap_horiz_pp_32x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
300
+void x265_interp_4tap_horiz_pp_32x48_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
301
+void x265_interp_4tap_horiz_pp_32x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
302
+void x265_interp_4tap_horiz_pp_48x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
303
+void x265_interp_4tap_horiz_pp_64x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
304
+void x265_interp_4tap_horiz_pp_64x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
305
+void x265_interp_4tap_horiz_pp_64x48_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
306
+void x265_interp_4tap_horiz_pp_64x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
307
+void x265_interp_8tap_horiz_pp_4x4_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
308
+void x265_interp_8tap_horiz_pp_4x8_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
309
+void x265_interp_8tap_horiz_pp_4x16_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
310
+void x265_interp_8tap_horiz_pp_8x4_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
311
+void x265_interp_8tap_horiz_pp_8x8_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
312
+void x265_interp_8tap_horiz_pp_8x16_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
313
+void x265_interp_8tap_horiz_pp_8x32_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
314
+void x265_interp_8tap_horiz_pp_12x16_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
315
+void x265_interp_8tap_horiz_pp_16x4_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
316
+void x265_interp_8tap_horiz_pp_16x8_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
317
+void x265_interp_8tap_horiz_pp_16x12_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
318
+void x265_interp_8tap_horiz_pp_16x16_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
319
+void x265_interp_8tap_horiz_pp_16x32_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
320
+void x265_interp_8tap_horiz_pp_16x64_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
321
+void x265_interp_8tap_horiz_pp_24x32_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
322
+void x265_interp_8tap_horiz_pp_32x8_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
323
+void x265_interp_8tap_horiz_pp_32x16_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
324
+void x265_interp_8tap_horiz_pp_32x24_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
325
+void x265_interp_8tap_horiz_pp_32x32_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
326
+void x265_interp_8tap_horiz_pp_32x64_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
327
+void x265_interp_8tap_horiz_pp_48x64_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
328
+void x265_interp_8tap_horiz_pp_64x16_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
329
+void x265_interp_8tap_horiz_pp_64x32_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
330
+void x265_interp_8tap_horiz_pp_64x48_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
331
+void x265_interp_8tap_horiz_pp_64x64_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
332
+void x265_interp_8tap_horiz_ps_4x4_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
333
+void x265_interp_8tap_horiz_ps_4x8_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
334
+void x265_interp_8tap_horiz_ps_4x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
335
+void x265_interp_8tap_horiz_ps_8x4_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
336
+void x265_interp_8tap_horiz_ps_8x8_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
337
+void x265_interp_8tap_horiz_ps_8x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
338
+void x265_interp_8tap_horiz_ps_8x32_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
339
+void x265_interp_8tap_horiz_ps_12x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
340
+void x265_interp_8tap_horiz_ps_16x4_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
341
+void x265_interp_8tap_horiz_ps_16x8_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
342
+void x265_interp_8tap_horiz_ps_16x12_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
343
+void x265_interp_8tap_horiz_ps_16x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
344
+void x265_interp_8tap_horiz_ps_16x32_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
345
+void x265_interp_8tap_horiz_ps_16x64_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
346
+void x265_interp_8tap_horiz_ps_24x32_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
347
+void x265_interp_8tap_horiz_ps_32x8_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
348
+void x265_interp_8tap_horiz_ps_32x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
349
+void x265_interp_8tap_horiz_ps_32x24_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
350
+void x265_interp_8tap_horiz_ps_32x32_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
351
+void x265_interp_8tap_horiz_ps_32x64_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
352
+void x265_interp_8tap_horiz_ps_48x64_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
353
+void x265_interp_8tap_horiz_ps_64x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
354
+void x265_interp_8tap_horiz_ps_64x32_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
355
+void x265_interp_8tap_horiz_ps_64x48_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
356
+void x265_interp_8tap_horiz_ps_64x64_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
357
+void x265_interp_8tap_hv_pp_8x8_sse3(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY);
358
+void x265_interp_4tap_vert_pp_2x4_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
359
+void x265_interp_4tap_vert_pp_2x8_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
360
+void x265_interp_4tap_vert_pp_2x16_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
361
+void x265_interp_4tap_vert_pp_4x2_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
362
+void x265_interp_4tap_vert_pp_4x4_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
363
+void x265_interp_4tap_vert_pp_4x8_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
364
+void x265_interp_4tap_vert_pp_4x16_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
365
+void x265_interp_4tap_vert_pp_4x32_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
366
+#ifdef X86_64
367
+void x265_interp_4tap_vert_pp_6x8_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
368
+void x265_interp_4tap_vert_pp_6x16_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
369
+void x265_interp_4tap_vert_pp_8x2_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
370
+void x265_interp_4tap_vert_pp_8x4_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
371
+void x265_interp_4tap_vert_pp_8x6_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
372
+void x265_interp_4tap_vert_pp_8x8_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
373
+void x265_interp_4tap_vert_pp_8x12_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
374
+void x265_interp_4tap_vert_pp_8x16_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
375
+void x265_interp_4tap_vert_pp_8x32_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
376
+void x265_interp_4tap_vert_pp_8x64_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
377
+#endif
378
#undef LUMA_FILTERS
379
#undef LUMA_SP_FILTERS
380
#undef LUMA_SS_FILTERS
381
x265_1.6.tar.gz/source/common/x86/loopfilter.asm -> x265_1.7.tar.gz/source/common/x86/loopfilter.asm
Changed
992
1
2
%include "x86inc.asm"
3
4
SECTION_RODATA 32
5
-pb_31: times 16 db 31
6
-pb_15: times 16 db 15
7
+pb_31: times 32 db 31
8
+pb_15: times 32 db 15
9
+pb_movemask_32: times 32 db 0x00
10
+ times 32 db 0xFF
11
12
SECTION .text
13
cextern pb_1
14
cextern pb_128
15
cextern pb_2
16
cextern pw_2
17
+cextern pb_movemask
18
19
20
;============================================================================================================
21
-; void saoCuOrgE0(pixel * rec, int8_t * offsetEo, int lcuWidth, int8_t signLeft)
22
+; void saoCuOrgE0(pixel * rec, int8_t * offsetEo, int lcuWidth, int8_t* signLeft, intptr_t stride)
23
;============================================================================================================
24
INIT_XMM sse4
25
-cglobal saoCuOrgE0, 4, 4, 8, rec, offsetEo, lcuWidth, signLeft
26
+cglobal saoCuOrgE0, 5, 5, 8, rec, offsetEo, lcuWidth, signLeft, stride
27
28
- neg r3 ; r3 = -signLeft
29
- movzx r3d, r3b
30
- movd m0, r3d
31
- mova m4, [pb_128] ; m4 = [80]
32
- pxor m5, m5 ; m5 = 0
33
- movu m6, [r1] ; m6 = offsetEo
34
+ mov r4d, r4m
35
+ mova m4, [pb_128] ; m4 = [80]
36
+ pxor m5, m5 ; m5 = 0
37
+ movu m6, [r1] ; m6 = offsetEo
38
+
39
+ movzx r1d, byte [r3]
40
+ inc r3
41
+ neg r1b
42
+ movd m0, r1d
43
+ lea r1, [r0 + r4]
44
+ mov r4d, r2d
45
46
.loop:
47
- movu m7, [r0] ; m1 = rec[x]
48
+ movu m7, [r0] ; m7 = rec[x]
49
movu m2, [r0 + 1] ; m2 = rec[x+1]
50
51
pxor m1, m7, m4
52
53
pxor m0, m0
54
palignr m0, m2, 15
55
paddb m2, m3
56
- paddb m2, [pb_2] ; m1 = uiEdgeType
57
+ paddb m2, [pb_2] ; m2 = uiEdgeType
58
pshufb m3, m6, m2
59
pmovzxbw m2, m7 ; rec
60
punpckhbw m7, m5
61
62
add r0q, 16
63
sub r2d, 16
64
jnz .loop
65
+
66
+ movzx r3d, byte [r3]
67
+ neg r3b
68
+ movd m0, r3d
69
+.loopH:
70
+ movu m7, [r1] ; m7 = rec[x]
71
+ movu m2, [r1 + 1] ; m2 = rec[x+1]
72
+
73
+ pxor m1, m7, m4
74
+ pxor m3, m2, m4
75
+ pcmpgtb m2, m1, m3
76
+ pcmpgtb m3, m1
77
+ pand m2, [pb_1]
78
+ por m2, m3
79
+
80
+ pslldq m3, m2, 1
81
+ por m3, m0
82
+
83
+ psignb m3, m4 ; m3 = signLeft
84
+ pxor m0, m0
85
+ palignr m0, m2, 15
86
+ paddb m2, m3
87
+ paddb m2, [pb_2] ; m2 = uiEdgeType
88
+ pshufb m3, m6, m2
89
+ pmovzxbw m2, m7 ; rec
90
+ punpckhbw m7, m5
91
+ pmovsxbw m1, m3 ; offsetEo
92
+ punpckhbw m3, m3
93
+ psraw m3, 8
94
+ paddw m2, m1
95
+ paddw m7, m3
96
+ packuswb m2, m7
97
+ movu [r1], m2
98
+
99
+ add r1q, 16
100
+ sub r4d, 16
101
+ jnz .loopH
102
+ RET
103
+
104
+INIT_YMM avx2
105
+cglobal saoCuOrgE0, 5, 5, 7, rec, offsetEo, lcuWidth, signLeft, stride
106
+
107
+ mov r4d, r4m
108
+ vbroadcasti128 m4, [pb_128] ; m4 = [80]
109
+ vbroadcasti128 m6, [r1] ; m6 = offsetEo
110
+ movzx r1d, byte [r3]
111
+ neg r1b
112
+ movd xm0, r1d
113
+ movzx r1d, byte [r3 + 1]
114
+ neg r1b
115
+ movd xm1, r1d
116
+ vinserti128 m0, m0, xm1, 1
117
+
118
+.loop:
119
+ movu xm5, [r0] ; xm5 = rec[x]
120
+ movu xm2, [r0 + 1] ; xm2 = rec[x + 1]
121
+ vinserti128 m5, m5, [r0 + r4], 1
122
+ vinserti128 m2, m2, [r0 + r4 + 1], 1
123
+
124
+ pxor m1, m5, m4
125
+ pxor m3, m2, m4
126
+ pcmpgtb m2, m1, m3
127
+ pcmpgtb m3, m1
128
+ pand m2, [pb_1]
129
+ por m2, m3
130
+
131
+ pslldq m3, m2, 1
132
+ por m3, m0
133
+
134
+ psignb m3, m4 ; m3 = signLeft
135
+ pxor m0, m0
136
+ palignr m0, m2, 15
137
+ paddb m2, m3
138
+ paddb m2, [pb_2] ; m2 = uiEdgeType
139
+ pshufb m3, m6, m2
140
+ pmovzxbw m2, xm5 ; rec
141
+ vextracti128 xm5, m5, 1
142
+ pmovzxbw m5, xm5
143
+ pmovsxbw m1, xm3 ; offsetEo
144
+ vextracti128 xm3, m3, 1
145
+ pmovsxbw m3, xm3
146
+ paddw m2, m1
147
+ paddw m5, m3
148
+ packuswb m2, m5
149
+ vpermq m2, m2, 11011000b
150
+ movu [r0], xm2
151
+ vextracti128 [r0 + r4], m2, 1
152
+
153
+ add r0q, 16
154
+ sub r2d, 16
155
+ jnz .loop
156
RET
157
158
;==================================================================================================
159
160
mov r3d, r3m
161
mov r4d, r4m
162
pxor m0, m0 ; m0 = 0
163
- movu m6, [pb_2] ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
164
+ mova m6, [pb_2] ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
165
mova m7, [pb_128]
166
shr r4d, 4
167
- .loop
168
- movu m1, [r0] ; m1 = pRec[x]
169
- movu m2, [r0 + r3] ; m2 = pRec[x + iStride]
170
-
171
- pxor m3, m1, m7
172
- pxor m4, m2, m7
173
- pcmpgtb m2, m3, m4
174
- pcmpgtb m4, m3
175
- pand m2, [pb_1]
176
- por m2, m4
177
-
178
- movu m3, [r1] ; m3 = m_iUpBuff1
179
-
180
- paddb m3, m2
181
- paddb m3, m6
182
-
183
- movu m4, [r2] ; m4 = m_iOffsetEo
184
- pshufb m5, m4, m3
185
-
186
- psubb m3, m0, m2
187
- movu [r1], m3
188
-
189
- pmovzxbw m2, m1
190
- punpckhbw m1, m0
191
- pmovsxbw m3, m5
192
- punpckhbw m5, m5
193
- psraw m5, 8
194
-
195
- paddw m2, m3
196
- paddw m1, m5
197
- packuswb m2, m1
198
- movu [r0], m2
199
-
200
- add r0, 16
201
- add r1, 16
202
- dec r4d
203
- jnz .loop
204
+.loop
205
+ movu m1, [r0] ; m1 = pRec[x]
206
+ movu m2, [r0 + r3] ; m2 = pRec[x + iStride]
207
+
208
+ pxor m3, m1, m7
209
+ pxor m4, m2, m7
210
+ pcmpgtb m2, m3, m4
211
+ pcmpgtb m4, m3
212
+ pand m2, [pb_1]
213
+ por m2, m4
214
+
215
+ movu m3, [r1] ; m3 = m_iUpBuff1
216
+
217
+ paddb m3, m2
218
+ paddb m3, m6
219
+
220
+ movu m4, [r2] ; m4 = m_iOffsetEo
221
+ pshufb m5, m4, m3
222
+
223
+ psubb m3, m0, m2
224
+ movu [r1], m3
225
+
226
+ pmovzxbw m2, m1
227
+ punpckhbw m1, m0
228
+ pmovsxbw m3, m5
229
+ punpckhbw m5, m5
230
+ psraw m5, 8
231
+
232
+ paddw m2, m3
233
+ paddw m1, m5
234
+ packuswb m2, m1
235
+ movu [r0], m2
236
+
237
+ add r0, 16
238
+ add r1, 16
239
+ dec r4d
240
+ jnz .loop
241
+ RET
242
+
243
+INIT_YMM avx2
244
+cglobal saoCuOrgE1, 3, 5, 8, pRec, m_iUpBuff1, m_iOffsetEo, iStride, iLcuWidth
245
+ mov r3d, r3m
246
+ mov r4d, r4m
247
+ movu xm0, [r2] ; xm0 = m_iOffsetEo
248
+ mova xm6, [pb_2] ; xm6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
249
+ mova xm7, [pb_128]
250
+ shr r4d, 4
251
+.loop
252
+ movu xm1, [r0] ; xm1 = pRec[x]
253
+ movu xm2, [r0 + r3] ; xm2 = pRec[x + iStride]
254
+
255
+ pxor xm3, xm1, xm7
256
+ pxor xm4, xm2, xm7
257
+ pcmpgtb xm2, xm3, xm4
258
+ pcmpgtb xm4, xm3
259
+ pand xm2, [pb_1]
260
+ por xm2, xm4
261
+
262
+ movu xm3, [r1] ; xm3 = m_iUpBuff1
263
+
264
+ paddb xm3, xm2
265
+ paddb xm3, xm6
266
+
267
+ pshufb xm5, xm0, xm3
268
+ pxor xm4, xm4
269
+ psubb xm3, xm4, xm2
270
+ movu [r1], xm3
271
+
272
+ pmovzxbw m2, xm1
273
+ pmovsxbw m3, xm5
274
+
275
+ paddw m2, m3
276
+ vextracti128 xm3, m2, 1
277
+ packuswb xm2, xm3
278
+ movu [r0], xm2
279
+
280
+ add r0, 16
281
+ add r1, 16
282
+ dec r4d
283
+ jnz .loop
284
+ RET
285
+
286
+;========================================================================================================
287
+; void saoCuOrgE1_2Rows(pixel *pRec, int8_t *m_iUpBuff1, int8_t *m_iOffsetEo, Int iStride, Int iLcuWidth)
288
+;========================================================================================================
289
+INIT_XMM sse4
290
+cglobal saoCuOrgE1_2Rows, 3, 5, 8, pRec, m_iUpBuff1, m_iOffsetEo, iStride, iLcuWidth
291
+ mov r3d, r3m
292
+ mov r4d, r4m
293
+ pxor m0, m0 ; m0 = 0
294
+ mova m7, [pb_128]
295
+ shr r4d, 4
296
+.loop
297
+ movu m1, [r0] ; m1 = pRec[x]
298
+ movu m2, [r0 + r3] ; m2 = pRec[x + iStride]
299
+
300
+ pxor m3, m1, m7
301
+ pxor m4, m2, m7
302
+ pcmpgtb m6, m3, m4
303
+ pcmpgtb m5, m4, m3
304
+ pand m6, [pb_1]
305
+ por m6, m5
306
+
307
+ movu m5, [r0 + r3 * 2]
308
+ pxor m3, m5, m7
309
+ pcmpgtb m5, m4, m3
310
+ pcmpgtb m3, m4
311
+ pand m5, [pb_1]
312
+ por m5, m3
313
+
314
+ movu m3, [r1] ; m3 = m_iUpBuff1
315
+ paddb m3, m6
316
+ paddb m3, [pb_2]
317
+
318
+ movu m4, [r2] ; m4 = m_iOffsetEo
319
+ pshufb m4, m3
320
+
321
+ psubb m3, m0, m6
322
+ movu [r1], m3
323
+
324
+ pmovzxbw m6, m1
325
+ punpckhbw m1, m0
326
+ pmovsxbw m3, m4
327
+ punpckhbw m4, m4
328
+ psraw m4, 8
329
+
330
+ paddw m6, m3
331
+ paddw m1, m4
332
+ packuswb m6, m1
333
+ movu [r0], m6
334
+
335
+ movu m3, [r1] ; m3 = m_iUpBuff1
336
+ paddb m3, m5
337
+ paddb m3, [pb_2]
338
+
339
+ movu m4, [r2] ; m4 = m_iOffsetEo
340
+ pshufb m4, m3
341
+ psubb m3, m0, m5
342
+ movu [r1], m3
343
+
344
+ pmovzxbw m5, m2
345
+ punpckhbw m2, m0
346
+ pmovsxbw m3, m4
347
+ punpckhbw m4, m4
348
+ psraw m4, 8
349
+
350
+ paddw m5, m3
351
+ paddw m2, m4
352
+ packuswb m5, m2
353
+ movu [r0 + r3], m5
354
+
355
+ add r0, 16
356
+ add r1, 16
357
+ dec r4d
358
+ jnz .loop
359
+ RET
360
+
361
+INIT_YMM avx2
362
+cglobal saoCuOrgE1_2Rows, 3, 5, 7, pRec, m_iUpBuff1, m_iOffsetEo, iStride, iLcuWidth
363
+ mov r3d, r3m
364
+ mov r4d, r4m
365
+ pxor m0, m0 ; m0 = 0
366
+ vbroadcasti128 m5, [pb_128]
367
+ vbroadcasti128 m6, [r2] ; m6 = m_iOffsetEo
368
+ shr r4d, 4
369
+.loop
370
+ movu xm1, [r0] ; m1 = pRec[x]
371
+ movu xm2, [r0 + r3] ; m2 = pRec[x + iStride]
372
+ vinserti128 m1, m1, xm2, 1
373
+ vinserti128 m2, m2, [r0 + r3 * 2], 1
374
+
375
+ pxor m3, m1, m5
376
+ pxor m4, m2, m5
377
+ pcmpgtb m2, m3, m4
378
+ pcmpgtb m4, m3
379
+ pand m2, [pb_1]
380
+ por m2, m4
381
+
382
+ movu xm3, [r1] ; xm3 = m_iUpBuff
383
+ psubb m4, m0, m2
384
+ vinserti128 m3, m3, xm4, 1
385
+ paddb m3, m2
386
+ paddb m3, [pb_2]
387
+ pshufb m2, m6, m3
388
+ vextracti128 [r1], m4, 1
389
+
390
+ pmovzxbw m4, xm1
391
+ vextracti128 xm3, m1, 1
392
+ pmovzxbw m3, xm3
393
+ pmovsxbw m1, xm2
394
+ vextracti128 xm2, m2, 1
395
+ pmovsxbw m2, xm2
396
+
397
+ paddw m4, m1
398
+ paddw m3, m2
399
+ packuswb m4, m3
400
+ vpermq m4, m4, 11011000b
401
+ movu [r0], xm4
402
+ vextracti128 [r0 + r3], m4, 1
403
+
404
+ add r0, 16
405
+ add r1, 16
406
+ dec r4d
407
+ jnz .loop
408
RET
409
410
;======================================================================================================================================================
411
; void saoCuOrgE2(pixel * rec, int8_t * bufft, int8_t * buff1, int8_t * offsetEo, int lcuWidth, intptr_t stride)
412
;======================================================================================================================================================
413
INIT_XMM sse4
414
-cglobal saoCuOrgE2, 5, 7, 8, rec, bufft, buff1, offsetEo, lcuWidth
415
-
416
- mov r6, 16
417
+cglobal saoCuOrgE2, 5, 6, 8, rec, bufft, buff1, offsetEo, lcuWidth
418
+ mov r4d, r4m
419
mov r5d, r5m
420
pxor m0, m0 ; m0 = 0
421
mova m6, [pb_2] ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
422
mova m7, [pb_128]
423
- shr r4d, 4
424
- inc r1q
425
-
426
- .loop
427
- movu m1, [r0] ; m1 = rec[x]
428
- movu m2, [r0 + r5 + 1] ; m2 = rec[x + stride + 1]
429
- pxor m3, m1, m7
430
- pxor m4, m2, m7
431
- pcmpgtb m2, m3, m4
432
- pcmpgtb m4, m3
433
- pand m2, [pb_1]
434
- por m2, m4
435
- movu m3, [r2] ; m3 = buff1
436
-
437
- paddb m3, m2
438
- paddb m3, m6 ; m3 = edgeType
439
-
440
- movu m4, [r3] ; m4 = offsetEo
441
- pshufb m4, m3
442
-
443
- psubb m3, m0, m2
444
- movu [r1], m3
445
-
446
- pmovzxbw m2, m1
447
- punpckhbw m1, m0
448
- pmovsxbw m3, m4
449
- punpckhbw m4, m4
450
- psraw m4, 8
451
-
452
- paddw m2, m3
453
- paddw m1, m4
454
- packuswb m2, m1
455
- movu [r0], m2
456
-
457
- add r0, r6
458
- add r1, r6
459
- add r2, r6
460
- dec r4d
461
- jnz .loop
462
+ inc r1
463
+ movh m5, [r0 + r4]
464
+ movhps m5, [r1 + r4]
465
+
466
+.loop
467
+ movu m1, [r0] ; m1 = rec[x]
468
+ movu m2, [r0 + r5 + 1] ; m2 = rec[x + stride + 1]
469
+ pxor m3, m1, m7
470
+ pxor m4, m2, m7
471
+ pcmpgtb m2, m3, m4
472
+ pcmpgtb m4, m3
473
+ pand m2, [pb_1]
474
+ por m2, m4
475
+ movu m3, [r2] ; m3 = buff1
476
+
477
+ paddb m3, m2
478
+ paddb m3, m6 ; m3 = edgeType
479
+
480
+ movu m4, [r3] ; m4 = offsetEo
481
+ pshufb m4, m3
482
+
483
+ psubb m3, m0, m2
484
+ movu [r1], m3
485
+
486
+ pmovzxbw m2, m1
487
+ punpckhbw m1, m0
488
+ pmovsxbw m3, m4
489
+ punpckhbw m4, m4
490
+ psraw m4, 8
491
+
492
+ paddw m2, m3
493
+ paddw m1, m4
494
+ packuswb m2, m1
495
+ movu [r0], m2
496
+
497
+ add r0, 16
498
+ add r1, 16
499
+ add r2, 16
500
+ sub r4, 16
501
+ jg .loop
502
+
503
+ movh [r0 + r4], m5
504
+ movhps [r1 + r4], m5
505
+ RET
506
+
507
+INIT_YMM avx2
508
+cglobal saoCuOrgE2, 5, 6, 7, rec, bufft, buff1, offsetEo, lcuWidth
509
+ mov r4d, r4m
510
+ mov r5d, r5m
511
+ pxor xm0, xm0 ; xm0 = 0
512
+ mova xm5, [pb_128]
513
+ inc r1
514
+ movq xm6, [r0 + r4]
515
+ movhps xm6, [r1 + r4]
516
+
517
+ movu xm1, [r0] ; xm1 = rec[x]
518
+ movu xm2, [r0 + r5 + 1] ; xm2 = rec[x + stride + 1]
519
+ pxor xm3, xm1, xm5
520
+ pxor xm4, xm2, xm5
521
+ pcmpgtb xm2, xm3, xm4
522
+ pcmpgtb xm4, xm3
523
+ pand xm2, [pb_1]
524
+ por xm2, xm4
525
+ movu xm3, [r2] ; xm3 = buff1
526
+
527
+ paddb xm3, xm2
528
+ paddb xm3, [pb_2] ; xm3 = edgeType
529
+
530
+ movu xm4, [r3] ; xm4 = offsetEo
531
+ pshufb xm4, xm3
532
+
533
+ psubb xm3, xm0, xm2
534
+ movu [r1], xm3
535
+
536
+ pmovzxbw m2, xm1
537
+ pmovsxbw m3, xm4
538
+
539
+ paddw m2, m3
540
+ vextracti128 xm3, m2, 1
541
+ packuswb xm2, xm3
542
+ movu [r0], xm2
543
+
544
+ movq [r0 + r4], xm6
545
+ movhps [r1 + r4], xm6
546
+ RET
547
+
548
+INIT_YMM avx2
549
+cglobal saoCuOrgE2_32, 5, 6, 8, rec, bufft, buff1, offsetEo, lcuWidth
550
+ mov r4d, r4m
551
+ mov r5d, r5m
552
+ pxor m0, m0 ; m0 = 0
553
+ vbroadcasti128 m7, [pb_128]
554
+ vbroadcasti128 m5, [r3] ; m5 = offsetEo
555
+ inc r1
556
+ movq xm6, [r0 + r4]
557
+ movhps xm6, [r1 + r4]
558
+
559
+.loop:
560
+ movu m1, [r0] ; m1 = rec[x]
561
+ movu m2, [r0 + r5 + 1] ; m2 = rec[x + stride + 1]
562
+ pxor m3, m1, m7
563
+ pxor m4, m2, m7
564
+ pcmpgtb m2, m3, m4
565
+ pcmpgtb m4, m3
566
+ pand m2, [pb_1]
567
+ por m2, m4
568
+ movu m3, [r2] ; m3 = buff1
569
+
570
+ paddb m3, m2
571
+ paddb m3, [pb_2] ; m3 = edgeType
572
+
573
+ pshufb m4, m5, m3
574
+
575
+ psubb m3, m0, m2
576
+ movu [r1], m3
577
+
578
+ pmovzxbw m2, xm1
579
+ vextracti128 xm1, m1, 1
580
+ pmovzxbw m1, xm1
581
+ pmovsxbw m3, xm4
582
+ vextracti128 xm4, m4, 1
583
+ pmovsxbw m4, xm4
584
+
585
+ paddw m2, m3
586
+ paddw m1, m4
587
+ packuswb m2, m1
588
+ vpermq m2, m2, 11011000b
589
+ movu [r0], m2
590
+
591
+ add r0, 32
592
+ add r1, 32
593
+ add r2, 32
594
+ sub r4, 32
595
+ jg .loop
596
+
597
+ movq [r0 + r4], xm6
598
+ movhps [r1 + r4], xm6
599
RET
600
601
;=======================================================================================================
602
;void saoCuOrgE3(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX)
603
;=======================================================================================================
604
INIT_XMM sse4
605
-cglobal saoCuOrgE3, 3, 7, 8
606
+cglobal saoCuOrgE3, 3,6,8
607
mov r3d, r3m
608
mov r4d, r4m
609
mov r5d, r5m
610
611
- mov r6d, r5d
612
- sub r6d, r4d
613
+ ; save latest 2 pixels for case startX=1 or left_endX=15
614
+ movh m7, [r0 + r5]
615
+ movhps m7, [r1 + r5 - 1]
616
617
+ ; move to startX+1
618
inc r4d
619
add r0, r4
620
add r1, r4
621
- movh m7, [r0 + r6 - 1]
622
- mov r6, [r1 + r6 - 2]
623
+ sub r5d, r4d
624
pxor m0, m0 ; m0 = 0
625
movu m6, [pb_2] ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
626
627
628
packuswb m2, m1
629
movu [r0], m2
630
631
- sub r5d, 16
632
- jle .end
633
+ add r0, 16
634
+ add r1, 16
635
636
- lea r0, [r0 + 16]
637
- lea r1, [r1 + 16]
638
+ sub r5, 16
639
+ jg .loop
640
641
- jnz .loop
642
+ ; restore last pixels (up to 2)
643
+ movh [r0 + r5], m7
644
+ movhps [r1 + r5 - 1], m7
645
+ RET
646
647
-.end:
648
- js .skip
649
- sub r0, r4
650
- sub r1, r4
651
- movh [r0 + 16], m7
652
- mov [r1 + 15], r6
653
- jmp .quit
654
+INIT_YMM avx2
655
+cglobal saoCuOrgE3, 3, 6, 8
656
+ mov r3d, r3m
657
+ mov r4d, r4m
658
+ mov r5d, r5m
659
+
660
+ ; save latest 2 pixels for case startX=1 or left_endX=15
661
+ movq xm7, [r0 + r5]
662
+ movhps xm7, [r1 + r5 - 1]
663
+
664
+ ; move to startX+1
665
+ inc r4d
666
+ add r0, r4
667
+ add r1, r4
668
+ sub r5d, r4d
669
+ pxor xm0, xm0 ; xm0 = 0
670
+ mova xm6, [pb_2] ; xm6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
671
+ movu xm5, [r2] ; xm5 = m_iOffsetEo
672
+
673
+.loop:
674
+ movu xm1, [r0] ; xm1 = pRec[x]
675
+ movu xm2, [r0 + r3] ; xm2 = pRec[x + iStride]
676
+
677
+ psubusb xm3, xm2, xm1
678
+ psubusb xm4, xm1, xm2
679
+ pcmpeqb xm3, xm0
680
+ pcmpeqb xm4, xm0
681
+ pcmpeqb xm2, xm1
682
+
683
+ pabsb xm3, xm3
684
+ por xm4, xm3
685
+ pandn xm2, xm4 ; xm2 = iSignDown
686
+
687
+ movu xm3, [r1] ; xm3 = m_iUpBuff1
688
+
689
+ paddb xm3, xm2
690
+ paddb xm3, xm6 ; xm3 = uiEdgeType
691
+
692
+ pshufb xm4, xm5, xm3
693
+
694
+ psubb xm3, xm0, xm2
695
+ movu [r1 - 1], xm3
696
+
697
+ pmovzxbw m2, xm1
698
+ pmovsxbw m3, xm4
699
+
700
+ paddw m2, m3
701
+ vextracti128 xm3, m2, 1
702
+ packuswb xm2, xm3
703
+ movu [r0], xm2
704
+
705
+ add r0, 16
706
+ add r1, 16
707
+
708
+ sub r5, 16
709
+ jg .loop
710
+
711
+ ; restore last pixels (up to 2)
712
+ movq [r0 + r5], xm7
713
+ movhps [r1 + r5 - 1], xm7
714
+ RET
715
+
716
+INIT_YMM avx2
717
+cglobal saoCuOrgE3_32, 3, 6, 8
718
+ mov r3d, r3m
719
+ mov r4d, r4m
720
+ mov r5d, r5m
721
+
722
+ ; save latest 2 pixels for case startX=1 or left_endX=15
723
+ movq xm7, [r0 + r5]
724
+ movhps xm7, [r1 + r5 - 1]
725
+
726
+ ; move to startX+1
727
+ inc r4d
728
+ add r0, r4
729
+ add r1, r4
730
+ sub r5d, r4d
731
+ pxor m0, m0 ; m0 = 0
732
+ mova m6, [pb_2] ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
733
+ vbroadcasti128 m5, [r2] ; m5 = m_iOffsetEo
734
+
735
+.loop:
736
+ movu m1, [r0] ; m1 = pRec[x]
737
+ movu m2, [r0 + r3] ; m2 = pRec[x + iStride]
738
+
739
+ psubusb m3, m2, m1
740
+ psubusb m4, m1, m2
741
+ pcmpeqb m3, m0
742
+ pcmpeqb m4, m0
743
+ pcmpeqb m2, m1
744
+
745
+ pabsb m3, m3
746
+ por m4, m3
747
+ pandn m2, m4 ; m2 = iSignDown
748
+
749
+ movu m3, [r1] ; m3 = m_iUpBuff1
750
+
751
+ paddb m3, m2
752
+ paddb m3, m6 ; m3 = uiEdgeType
753
+
754
+ pshufb m4, m5, m3
755
+
756
+ psubb m3, m0, m2
757
+ movu [r1 - 1], m3
758
+
759
+ pmovzxbw m2, xm1
760
+ vextracti128 xm1, m1, 1
761
+ pmovzxbw m1, xm1
762
+ pmovsxbw m3, xm4
763
+ vextracti128 xm4, m4, 1
764
+ pmovsxbw m4, xm4
765
766
-.skip:
767
- sub r0, r4
768
- sub r1, r4
769
- movh [r0 + 15], m7
770
- mov [r1 + 14], r6
771
+ paddw m2, m3
772
+ paddw m1, m4
773
+ packuswb m2, m1
774
+ vpermq m2, m2, 11011000b
775
+ movu [r0], m2
776
777
-.quit:
778
+ add r0, 32
779
+ add r1, 32
780
+ sub r5, 32
781
+ jg .loop
782
783
+ ; restore last pixels (up to 2)
784
+ movq [r0 + r5], xm7
785
+ movhps [r1 + r5 - 1], xm7
786
RET
787
788
;=====================================================================================
789
790
jnz .loopH
791
RET
792
793
+INIT_YMM avx2
794
+cglobal saoCuOrgB0, 4, 7, 8
795
+
796
+ mov r3d, r3m
797
+ mov r4d, r4m
798
+ mova m7, [pb_31]
799
+ vbroadcasti128 m3, [r1 + 0] ; offset[0-15]
800
+ vbroadcasti128 m4, [r1 + 16] ; offset[16-31]
801
+ lea r6, [r4 * 2]
802
+ sub r6d, r2d
803
+ shr r2d, 4
804
+ mov r1d, r3d
805
+ shr r3d, 1
806
+.loopH
807
+ mov r5d, r2d
808
+.loopW
809
+ movu xm2, [r0] ; m2 = [rec]
810
+ vinserti128 m2, m2, [r0 + r4], 1
811
+ psrlw m1, m2, 3
812
+ pand m1, m7 ; m1 = [index]
813
+ pcmpgtb m0, m1, [pb_15] ; m0 = [mask]
814
+
815
+ pshufb m6, m3, m1
816
+ pshufb m5, m4, m1
817
+
818
+ pblendvb m6, m6, m5, m0 ; NOTE: don't use 3 parameters style, x264 macro have some bug!
819
+
820
+ pmovzxbw m1, xm2 ; rec
821
+ vextracti128 xm2, m2, 1
822
+ pmovzxbw m2, xm2
823
+ pmovsxbw m0, xm6 ; offset
824
+ vextracti128 xm6, m6, 1
825
+ pmovsxbw m6, xm6
826
+
827
+ paddw m1, m0
828
+ paddw m2, m6
829
+ packuswb m1, m2
830
+ vpermq m1, m1, 11011000b
831
+
832
+ movu [r0], xm1
833
+ vextracti128 [r0 + r4], m1, 1
834
+ add r0, 16
835
+ dec r5d
836
+ jnz .loopW
837
+
838
+ add r0, r6
839
+ dec r3d
840
+ jnz .loopH
841
+ test r1b, 1
842
+ jz .end
843
+ mov r5d, r2d
844
+.loopW1
845
+ movu xm2, [r0] ; m2 = [rec]
846
+ psrlw xm1, xm2, 3
847
+ pand xm1, xm7 ; m1 = [index]
848
+ pcmpgtb xm0, xm1, [pb_15] ; m0 = [mask]
849
+
850
+ pshufb xm6, xm3, xm1
851
+ pshufb xm5, xm4, xm1
852
+
853
+ pblendvb xm6, xm6, xm5, xm0 ; NOTE: don't use 3 parameters style, x264 macro have some bug!
854
+
855
+ pmovzxbw m1, xm2 ; rec
856
+ pmovsxbw m0, xm6 ; offset
857
+
858
+ paddw m1, m0
859
+ vextracti128 xm0, m1, 1
860
+ packuswb xm1, xm0
861
+
862
+ movu [r0], xm1
863
+ add r0, 16
864
+ dec r5d
865
+ jnz .loopW1
866
+.end
867
+ RET
868
+
869
;============================================================================================================
870
-; void calSign(int8_t *dst, const Pixel *src1, const Pixel *src2, const int endX)
871
+; void calSign(int8_t *dst, const Pixel *src1, const Pixel *src2, const int width)
872
;============================================================================================================
873
INIT_XMM sse4
874
-cglobal calSign, 4, 5, 7
875
+cglobal calSign, 4,5,6
876
+ mova m0, [pb_128]
877
+ mova m1, [pb_1]
878
879
- mov r4, 16
880
- mova m1, [pb_128]
881
- mova m0, [pb_1]
882
- shr r3d, 4
883
-.loop
884
- movu m2, [r1] ; m2 = pRec[x]
885
- movu m3, [r2] ; m3 = pTmpU[x]
886
+ sub r1, r0
887
+ sub r2, r0
888
889
- pxor m4, m2, m1
890
- pxor m5, m3, m1
891
- pcmpgtb m6, m4, m5
892
- pcmpgtb m5, m4
893
- pand m6, m0
894
- por m6, m5
895
+ mov r4d, r3d
896
+ shr r3d, 4
897
+ jz .next
898
+.loop:
899
+ movu m2, [r0 + r1] ; m2 = pRec[x]
900
+ movu m3, [r0 + r2] ; m3 = pTmpU[x]
901
+ pxor m4, m2, m0
902
+ pxor m3, m0
903
+ pcmpgtb m5, m4, m3
904
+ pcmpgtb m3, m4
905
+ pand m5, m1
906
+ por m5, m3
907
+ movu [r0], m5
908
+
909
+ add r0, 16
910
+ dec r3d
911
+ jnz .loop
912
913
- movu [r0], m6
914
+ ; process partial
915
+.next:
916
+ and r4d, 15
917
+ jz .end
918
+
919
+ movu m2, [r0 + r1] ; m2 = pRec[x]
920
+ movu m3, [r0 + r2] ; m3 = pTmpU[x]
921
+ pxor m4, m2, m0
922
+ pxor m3, m0
923
+ pcmpgtb m5, m4, m3
924
+ pcmpgtb m3, m4
925
+ pand m5, m1
926
+ por m5, m3
927
+
928
+ lea r3, [pb_movemask + 16]
929
+ sub r3, r4
930
+ movu xmm0, [r3]
931
+ movu m3, [r0]
932
+ pblendvb m5, m5, m3, xmm0
933
+ movu [r0], m5
934
935
- add r0, r4
936
- add r1, r4
937
- add r2, r4
938
- dec r3d
939
- jnz .loop
940
+.end:
941
+ RET
942
+
943
+INIT_YMM avx2
944
+cglobal calSign, 4, 5, 6
945
+ vbroadcasti128 m0, [pb_128]
946
+ mova m1, [pb_1]
947
+
948
+ sub r1, r0
949
+ sub r2, r0
950
+
951
+ mov r4d, r3d
952
+ shr r3d, 5
953
+ jz .next
954
+.loop:
955
+ movu m2, [r0 + r1] ; m2 = pRec[x]
956
+ movu m3, [r0 + r2] ; m3 = pTmpU[x]
957
+ pxor m4, m2, m0
958
+ pxor m3, m0
959
+ pcmpgtb m5, m4, m3
960
+ pcmpgtb m3, m4
961
+ pand m5, m1
962
+ por m5, m3
963
+ movu [r0], m5
964
+
965
+ add r0, mmsize
966
+ dec r3d
967
+ jnz .loop
968
+
969
+ ; process partial
970
+.next:
971
+ and r4d, 31
972
+ jz .end
973
+
974
+ movu m2, [r0 + r1] ; m2 = pRec[x]
975
+ movu m3, [r0 + r2] ; m3 = pTmpU[x]
976
+ pxor m4, m2, m0
977
+ pxor m3, m0
978
+ pcmpgtb m5, m4, m3
979
+ pcmpgtb m3, m4
980
+ pand m5, m1
981
+ por m5, m3
982
+
983
+ lea r3, [pb_movemask_32 + 32]
984
+ sub r3, r4
985
+ movu m0, [r3]
986
+ movu m3, [r0]
987
+ pblendvb m5, m5, m3, m0
988
+ movu [r0], m5
989
+
990
+.end:
991
RET
992
x265_1.6.tar.gz/source/common/x86/loopfilter.h -> x265_1.7.tar.gz/source/common/x86/loopfilter.h
Changed
24
1
2
#ifndef X265_LOOPFILTER_H
3
#define X265_LOOPFILTER_H
4
5
-void x265_saoCuOrgE0_sse4(pixel * rec, int8_t * offsetEo, int endX, int8_t signLeft);
6
+void x265_saoCuOrgE0_sse4(pixel * rec, int8_t * offsetEo, int endX, int8_t* signLeft, intptr_t stride);
7
+void x265_saoCuOrgE0_avx2(pixel * rec, int8_t * offsetEo, int endX, int8_t* signLeft, intptr_t stride);
8
void x265_saoCuOrgE1_sse4(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
9
+void x265_saoCuOrgE1_avx2(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
10
+void x265_saoCuOrgE1_2Rows_sse4(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
11
+void x265_saoCuOrgE1_2Rows_avx2(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
12
void x265_saoCuOrgE2_sse4(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride);
13
+void x265_saoCuOrgE2_avx2(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride);
14
+void x265_saoCuOrgE2_32_avx2(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride);
15
void x265_saoCuOrgE3_sse4(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX);
16
+void x265_saoCuOrgE3_avx2(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX);
17
+void x265_saoCuOrgE3_32_avx2(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX);
18
void x265_saoCuOrgB0_sse4(pixel* rec, const int8_t* offsetBo, int ctuWidth, int ctuHeight, intptr_t stride);
19
+void x265_saoCuOrgB0_avx2(pixel* rec, const int8_t* offsetBo, int ctuWidth, int ctuHeight, intptr_t stride);
20
void x265_calSign_sse4(int8_t *dst, const pixel *src1, const pixel *src2, const int endX);
21
+void x265_calSign_avx2(int8_t *dst, const pixel *src1, const pixel *src2, const int endX);
22
23
#endif // ifndef X265_LOOPFILTER_H
24
x265_1.6.tar.gz/source/common/x86/mc-a.asm -> x265_1.7.tar.gz/source/common/x86/mc-a.asm
Changed
44
1
2
3
ADDAVG_W8_H4_AVX2 4
4
ADDAVG_W8_H4_AVX2 8
5
+ADDAVG_W8_H4_AVX2 12
6
ADDAVG_W8_H4_AVX2 16
7
ADDAVG_W8_H4_AVX2 32
8
+ADDAVG_W8_H4_AVX2 64
9
10
%macro ADDAVG_W12_H4_AVX2 1
11
INIT_YMM avx2
12
13
%endmacro
14
15
ADDAVG_W12_H4_AVX2 16
16
+ADDAVG_W12_H4_AVX2 32
17
18
%macro ADDAVG_W16_H4_AVX2 1
19
INIT_YMM avx2
20
21
ADDAVG_W16_H4_AVX2 8
22
ADDAVG_W16_H4_AVX2 12
23
ADDAVG_W16_H4_AVX2 16
24
+ADDAVG_W16_H4_AVX2 24
25
ADDAVG_W16_H4_AVX2 32
26
ADDAVG_W16_H4_AVX2 64
27
28
29
%endmacro
30
31
ADDAVG_W24_H2_AVX2 32
32
+ADDAVG_W24_H2_AVX2 64
33
34
%macro ADDAVG_W32_H2_AVX2 1
35
INIT_YMM avx2
36
37
ADDAVG_W32_H2_AVX2 16
38
ADDAVG_W32_H2_AVX2 24
39
ADDAVG_W32_H2_AVX2 32
40
+ADDAVG_W32_H2_AVX2 48
41
ADDAVG_W32_H2_AVX2 64
42
43
%macro ADDAVG_W64_H2_AVX2 1
44
x265_1.6.tar.gz/source/common/x86/pixel-a.asm -> x265_1.7.tar.gz/source/common/x86/pixel-a.asm
Changed
1492
1
2
.end:
3
RET
4
5
+; Input 16bpp, Output 8bpp
6
+;-------------------------------------------------------------------------------------------------------------------------------------
7
+;void planecopy_sp(uint16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask)
8
+;-------------------------------------------------------------------------------------------------------------------------------------
9
+INIT_YMM avx2
10
+cglobal downShift_16, 6,7,3
11
+ movd xm0, r6m ; m0 = shift
12
+ add r1d, r1d
13
+ dec r5d
14
+.loopH:
15
+ xor r6, r6
16
+.loopW:
17
+ movu m1, [r0 + r6 * 2 + 0]
18
+ movu m2, [r0 + r6 * 2 + 32]
19
+ vpsrlw m1, xm0
20
+ vpsrlw m2, xm0
21
+ packuswb m1, m2
22
+ vpermq m1, m1, 11011000b
23
+ movu [r2 + r6], m1
24
+
25
+ add r6d, mmsize
26
+ cmp r6d, r4d
27
+ jl .loopW
28
+
29
+ ; move to next row
30
+ add r0, r1
31
+ add r2, r3
32
+ dec r5d
33
+ jnz .loopH
34
+
35
+; processing last row of every frame [To handle width which not a multiple of 32]
36
+ mov r6d, r4d
37
+ and r4d, 31
38
+ shr r6d, 5
39
+
40
+.loop32:
41
+ movu m1, [r0]
42
+ movu m2, [r0 + 32]
43
+ psrlw m1, xm0
44
+ psrlw m2, xm0
45
+ packuswb m1, m2
46
+ vpermq m1, m1, 11011000b
47
+ movu [r2], m1
48
+
49
+ add r0, 2*mmsize
50
+ add r2, mmsize
51
+ dec r6d
52
+ jnz .loop32
53
+
54
+ cmp r4d, 16
55
+ jl .process8
56
+ movu m1, [r0]
57
+ psrlw m1, xm0
58
+ packuswb m1, m1
59
+ vpermq m1, m1, 10001000b
60
+ movu [r2], xm1
61
+
62
+ add r0, mmsize
63
+ add r2, 16
64
+ sub r4d, 16
65
+ jz .end
66
+
67
+.process8:
68
+ cmp r4d, 8
69
+ jl .process4
70
+ movu m1, [r0]
71
+ psrlw m1, xm0
72
+ packuswb m1, m1
73
+ movq [r2], xm1
74
+
75
+ add r0, 16
76
+ add r2, 8
77
+ sub r4d, 8
78
+ jz .end
79
+
80
+.process4:
81
+ cmp r4d, 4
82
+ jl .process2
83
+ movq xm1,[r0]
84
+ psrlw m1, xm0
85
+ packuswb m1, m1
86
+ movd [r2], xm1
87
+
88
+ add r0, 8
89
+ add r2, 4
90
+ sub r4d, 4
91
+ jz .end
92
+
93
+.process2:
94
+ cmp r4d, 2
95
+ jl .process1
96
+ movd xm1, [r0]
97
+ psrlw m1, xm0
98
+ packuswb m1, m1
99
+ movd r6d, xm1
100
+ mov [r2], r6w
101
+
102
+ add r0, 4
103
+ add r2, 2
104
+ sub r4d, 2
105
+ jz .end
106
+
107
+.process1:
108
+ movd xm1, [r0]
109
+ psrlw m1, xm0
110
+ packuswb m1, m1
111
+ movd r3d, xm1
112
+ mov [r2], r3b
113
+.end:
114
+ RET
115
+
116
; Input 8bpp, Output 16bpp
117
;---------------------------------------------------------------------------------------------------------------------
118
;void planecopy_cp(uint8_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int shift)
119
120
mov rsp, r5
121
RET
122
%endif
123
+
124
+;;---------------------------------------------------------------
125
+;; SATD AVX2
126
+;; int pixel_satd(const pixel*, intptr_t, const pixel*, intptr_t)
127
+;;---------------------------------------------------------------
128
+;; r0 - pix0
129
+;; r1 - pix0Stride
130
+;; r2 - pix1
131
+;; r3 - pix1Stride
132
+
133
+%if ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 0
134
+INIT_YMM avx2
135
+cglobal calc_satd_16x8 ; function to compute satd cost for 16 columns, 8 rows
136
+ pxor m6, m6
137
+ vbroadcasti128 m0, [r0]
138
+ vbroadcasti128 m4, [r2]
139
+ vbroadcasti128 m1, [r0 + r1]
140
+ vbroadcasti128 m5, [r2 + r3]
141
+ pmaddubsw m4, m7
142
+ pmaddubsw m0, m7
143
+ pmaddubsw m5, m7
144
+ pmaddubsw m1, m7
145
+ psubw m0, m4
146
+ psubw m1, m5
147
+ vbroadcasti128 m2, [r0 + r1 * 2]
148
+ vbroadcasti128 m4, [r2 + r3 * 2]
149
+ vbroadcasti128 m3, [r0 + r4]
150
+ vbroadcasti128 m5, [r2 + r5]
151
+ pmaddubsw m4, m7
152
+ pmaddubsw m2, m7
153
+ pmaddubsw m5, m7
154
+ pmaddubsw m3, m7
155
+ psubw m2, m4
156
+ psubw m3, m5
157
+ lea r0, [r0 + r1 * 4]
158
+ lea r2, [r2 + r3 * 4]
159
+ paddw m4, m0, m1
160
+ psubw m1, m1, m0
161
+ paddw m0, m2, m3
162
+ psubw m3, m2
163
+ paddw m2, m4, m0
164
+ psubw m0, m4
165
+ paddw m4, m1, m3
166
+ psubw m3, m1
167
+ pabsw m2, m2
168
+ pabsw m0, m0
169
+ pabsw m4, m4
170
+ pabsw m3, m3
171
+ pblendw m1, m2, m0, 10101010b
172
+ pslld m0, 16
173
+ psrld m2, 16
174
+ por m0, m2
175
+ pmaxsw m1, m0
176
+ paddw m6, m1
177
+ pblendw m2, m4, m3, 10101010b
178
+ pslld m3, 16
179
+ psrld m4, 16
180
+ por m3, m4
181
+ pmaxsw m2, m3
182
+ paddw m6, m2
183
+ vbroadcasti128 m1, [r0]
184
+ vbroadcasti128 m4, [r2]
185
+ vbroadcasti128 m2, [r0 + r1]
186
+ vbroadcasti128 m5, [r2 + r3]
187
+ pmaddubsw m4, m7
188
+ pmaddubsw m1, m7
189
+ pmaddubsw m5, m7
190
+ pmaddubsw m2, m7
191
+ psubw m1, m4
192
+ psubw m2, m5
193
+ vbroadcasti128 m0, [r0 + r1 * 2]
194
+ vbroadcasti128 m4, [r2 + r3 * 2]
195
+ vbroadcasti128 m3, [r0 + r4]
196
+ vbroadcasti128 m5, [r2 + r5]
197
+ lea r0, [r0 + r1 * 4]
198
+ lea r2, [r2 + r3 * 4]
199
+ pmaddubsw m4, m7
200
+ pmaddubsw m0, m7
201
+ pmaddubsw m5, m7
202
+ pmaddubsw m3, m7
203
+ psubw m0, m4
204
+ psubw m3, m5
205
+ paddw m4, m1, m2
206
+ psubw m2, m1
207
+ paddw m1, m0, m3
208
+ psubw m3, m0
209
+ paddw m0, m4, m1
210
+ psubw m1, m4
211
+ paddw m4, m2, m3
212
+ psubw m3, m2
213
+ pabsw m0, m0
214
+ pabsw m1, m1
215
+ pabsw m4, m4
216
+ pabsw m3, m3
217
+ pblendw m2, m0, m1, 10101010b
218
+ pslld m1, 16
219
+ psrld m0, 16
220
+ por m1, m0
221
+ pmaxsw m2, m1
222
+ paddw m6, m2
223
+ pblendw m0, m4, m3, 10101010b
224
+ pslld m3, 16
225
+ psrld m4, 16
226
+ por m3, m4
227
+ pmaxsw m0, m3
228
+ paddw m6, m0
229
+ vextracti128 xm0, m6, 1
230
+ pmovzxwd m6, xm6
231
+ pmovzxwd m0, xm0
232
+ paddd m8, m6
233
+ paddd m9, m0
234
+ ret
235
+
236
+cglobal calc_satd_16x4 ; function to compute satd cost for 16 columns, 4 rows
237
+ pxor m6, m6
238
+ vbroadcasti128 m0, [r0]
239
+ vbroadcasti128 m4, [r2]
240
+ vbroadcasti128 m1, [r0 + r1]
241
+ vbroadcasti128 m5, [r2 + r3]
242
+ pmaddubsw m4, m7
243
+ pmaddubsw m0, m7
244
+ pmaddubsw m5, m7
245
+ pmaddubsw m1, m7
246
+ psubw m0, m4
247
+ psubw m1, m5
248
+ vbroadcasti128 m2, [r0 + r1 * 2]
249
+ vbroadcasti128 m4, [r2 + r3 * 2]
250
+ vbroadcasti128 m3, [r0 + r4]
251
+ vbroadcasti128 m5, [r2 + r5]
252
+ pmaddubsw m4, m7
253
+ pmaddubsw m2, m7
254
+ pmaddubsw m5, m7
255
+ pmaddubsw m3, m7
256
+ psubw m2, m4
257
+ psubw m3, m5
258
+ paddw m4, m0, m1
259
+ psubw m1, m1, m0
260
+ paddw m0, m2, m3
261
+ psubw m3, m2
262
+ paddw m2, m4, m0
263
+ psubw m0, m4
264
+ paddw m4, m1, m3
265
+ psubw m3, m1
266
+ pabsw m2, m2
267
+ pabsw m0, m0
268
+ pabsw m4, m4
269
+ pabsw m3, m3
270
+ pblendw m1, m2, m0, 10101010b
271
+ pslld m0, 16
272
+ psrld m2, 16
273
+ por m0, m2
274
+ pmaxsw m1, m0
275
+ paddw m6, m1
276
+ pblendw m2, m4, m3, 10101010b
277
+ pslld m3, 16
278
+ psrld m4, 16
279
+ por m3, m4
280
+ pmaxsw m2, m3
281
+ paddw m6, m2
282
+ vextracti128 xm0, m6, 1
283
+ pmovzxwd m6, xm6
284
+ pmovzxwd m0, xm0
285
+ paddd m8, m6
286
+ paddd m9, m0
287
+ ret
288
+
289
+cglobal pixel_satd_16x4, 4,6,10 ; if WIN64 && cpuflag(avx2)
290
+ mova m7, [hmul_16p]
291
+ lea r4, [3 * r1]
292
+ lea r5, [3 * r3]
293
+ pxor m8, m8
294
+ pxor m9, m9
295
+
296
+ call calc_satd_16x4
297
+
298
+ paddd m8, m9
299
+ vextracti128 xm0, m8, 1
300
+ paddd xm0, xm8
301
+ movhlps xm1, xm0
302
+ paddd xm0, xm1
303
+ pshuflw xm1, xm0, q0032
304
+ paddd xm0, xm1
305
+ movd eax, xm0
306
+ RET
307
+
308
+cglobal pixel_satd_16x12, 4,6,10 ; if WIN64 && cpuflag(avx2)
309
+ mova m7, [hmul_16p]
310
+ lea r4, [3 * r1]
311
+ lea r5, [3 * r3]
312
+ pxor m8, m8
313
+ pxor m9, m9
314
+
315
+ call calc_satd_16x8
316
+ call calc_satd_16x4
317
+
318
+ paddd m8, m9
319
+ vextracti128 xm0, m8, 1
320
+ paddd xm0, xm8
321
+ movhlps xm1, xm0
322
+ paddd xm0, xm1
323
+ pshuflw xm1, xm0, q0032
324
+ paddd xm0, xm1
325
+ movd eax, xm0
326
+ RET
327
+
328
+cglobal pixel_satd_16x32, 4,6,10 ; if WIN64 && cpuflag(avx2)
329
+ mova m7, [hmul_16p]
330
+ lea r4, [3 * r1]
331
+ lea r5, [3 * r3]
332
+ pxor m8, m8
333
+ pxor m9, m9
334
+
335
+ call calc_satd_16x8
336
+ call calc_satd_16x8
337
+ call calc_satd_16x8
338
+ call calc_satd_16x8
339
+
340
+ paddd m8, m9
341
+ vextracti128 xm0, m8, 1
342
+ paddd xm0, xm8
343
+ movhlps xm1, xm0
344
+ paddd xm0, xm1
345
+ pshuflw xm1, xm0, q0032
346
+ paddd xm0, xm1
347
+ movd eax, xm0
348
+ RET
349
+
350
+cglobal pixel_satd_16x64, 4,6,10 ; if WIN64 && cpuflag(avx2)
351
+ mova m7, [hmul_16p]
352
+ lea r4, [3 * r1]
353
+ lea r5, [3 * r3]
354
+ pxor m8, m8
355
+ pxor m9, m9
356
+
357
+ call calc_satd_16x8
358
+ call calc_satd_16x8
359
+ call calc_satd_16x8
360
+ call calc_satd_16x8
361
+ call calc_satd_16x8
362
+ call calc_satd_16x8
363
+ call calc_satd_16x8
364
+ call calc_satd_16x8
365
+
366
+ paddd m8, m9
367
+ vextracti128 xm0, m8, 1
368
+ paddd xm0, xm8
369
+ movhlps xm1, xm0
370
+ paddd xm0, xm1
371
+ pshuflw xm1, xm0, q0032
372
+ paddd xm0, xm1
373
+ movd eax, xm0
374
+ RET
375
+
376
+cglobal pixel_satd_32x8, 4,8,10 ; if WIN64 && cpuflag(avx2)
377
+ mova m7, [hmul_16p]
378
+ lea r4, [3 * r1]
379
+ lea r5, [3 * r3]
380
+ pxor m8, m8
381
+ pxor m9, m9
382
+ mov r6, r0
383
+ mov r7, r2
384
+
385
+ call calc_satd_16x8
386
+
387
+ lea r0, [r6 + 16]
388
+ lea r2, [r7 + 16]
389
+
390
+ call calc_satd_16x8
391
+
392
+ paddd m8, m9
393
+ vextracti128 xm0, m8, 1
394
+ paddd xm0, xm8
395
+ movhlps xm1, xm0
396
+ paddd xm0, xm1
397
+ pshuflw xm1, xm0, q0032
398
+ paddd xm0, xm1
399
+ movd eax, xm0
400
+ RET
401
+
402
+cglobal pixel_satd_32x16, 4,8,10 ; if WIN64 && cpuflag(avx2)
403
+ mova m7, [hmul_16p]
404
+ lea r4, [3 * r1]
405
+ lea r5, [3 * r3]
406
+ pxor m8, m8
407
+ pxor m9, m9
408
+ mov r6, r0
409
+ mov r7, r2
410
+
411
+ call calc_satd_16x8
412
+ call calc_satd_16x8
413
+
414
+ lea r0, [r6 + 16]
415
+ lea r2, [r7 + 16]
416
+
417
+ call calc_satd_16x8
418
+ call calc_satd_16x8
419
+
420
+ paddd m8, m9
421
+ vextracti128 xm0, m8, 1
422
+ paddd xm0, xm8
423
+ movhlps xm1, xm0
424
+ paddd xm0, xm1
425
+ pshuflw xm1, xm0, q0032
426
+ paddd xm0, xm1
427
+ movd eax, xm0
428
+ RET
429
+
430
+cglobal pixel_satd_32x24, 4,8,10 ; if WIN64 && cpuflag(avx2)
431
+ mova m7, [hmul_16p]
432
+ lea r4, [3 * r1]
433
+ lea r5, [3 * r3]
434
+ pxor m8, m8
435
+ pxor m9, m9
436
+ mov r6, r0
437
+ mov r7, r2
438
+
439
+ call calc_satd_16x8
440
+ call calc_satd_16x8
441
+ call calc_satd_16x8
442
+
443
+ lea r0, [r6 + 16]
444
+ lea r2, [r7 + 16]
445
+
446
+ call calc_satd_16x8
447
+ call calc_satd_16x8
448
+ call calc_satd_16x8
449
+
450
+ paddd m8, m9
451
+ vextracti128 xm0, m8, 1
452
+ paddd xm0, xm8
453
+ movhlps xm1, xm0
454
+ paddd xm0, xm1
455
+ pshuflw xm1, xm0, q0032
456
+ paddd xm0, xm1
457
+ movd eax, xm0
458
+ RET
459
+
460
+cglobal pixel_satd_32x32, 4,8,10 ; if WIN64 && cpuflag(avx2)
461
+ mova m7, [hmul_16p]
462
+ lea r4, [3 * r1]
463
+ lea r5, [3 * r3]
464
+ pxor m8, m8
465
+ pxor m9, m9
466
+ mov r6, r0
467
+ mov r7, r2
468
+
469
+ call calc_satd_16x8
470
+ call calc_satd_16x8
471
+ call calc_satd_16x8
472
+ call calc_satd_16x8
473
+
474
+ lea r0, [r6 + 16]
475
+ lea r2, [r7 + 16]
476
+
477
+ call calc_satd_16x8
478
+ call calc_satd_16x8
479
+ call calc_satd_16x8
480
+ call calc_satd_16x8
481
+
482
+ paddd m8, m9
483
+ vextracti128 xm0, m8, 1
484
+ paddd xm0, xm8
485
+ movhlps xm1, xm0
486
+ paddd xm0, xm1
487
+ pshuflw xm1, xm0, q0032
488
+ paddd xm0, xm1
489
+ movd eax, xm0
490
+ RET
491
+
492
+cglobal pixel_satd_32x64, 4,8,10 ; if WIN64 && cpuflag(avx2)
493
+ mova m7, [hmul_16p]
494
+ lea r4, [3 * r1]
495
+ lea r5, [3 * r3]
496
+ pxor m8, m8
497
+ pxor m9, m9
498
+ mov r6, r0
499
+ mov r7, r2
500
+
501
+ call calc_satd_16x8
502
+ call calc_satd_16x8
503
+ call calc_satd_16x8
504
+ call calc_satd_16x8
505
+ call calc_satd_16x8
506
+ call calc_satd_16x8
507
+ call calc_satd_16x8
508
+ call calc_satd_16x8
509
+
510
+ lea r0, [r6 + 16]
511
+ lea r2, [r7 + 16]
512
+
513
+ call calc_satd_16x8
514
+ call calc_satd_16x8
515
+ call calc_satd_16x8
516
+ call calc_satd_16x8
517
+ call calc_satd_16x8
518
+ call calc_satd_16x8
519
+ call calc_satd_16x8
520
+ call calc_satd_16x8
521
+
522
+ paddd m8, m9
523
+ vextracti128 xm0, m8, 1
524
+ paddd xm0, xm8
525
+ movhlps xm1, xm0
526
+ paddd xm0, xm1
527
+ pshuflw xm1, xm0, q0032
528
+ paddd xm0, xm1
529
+ movd eax, xm0
530
+ RET
531
+
532
+cglobal pixel_satd_48x64, 4,8,10 ; if WIN64 && cpuflag(avx2)
533
+ mova m7, [hmul_16p]
534
+ lea r4, [3 * r1]
535
+ lea r5, [3 * r3]
536
+ pxor m8, m8
537
+ pxor m9, m9
538
+ mov r6, r0
539
+ mov r7, r2
540
+
541
+ call calc_satd_16x8
542
+ call calc_satd_16x8
543
+ call calc_satd_16x8
544
+ call calc_satd_16x8
545
+ call calc_satd_16x8
546
+ call calc_satd_16x8
547
+ call calc_satd_16x8
548
+ call calc_satd_16x8
549
+ lea r0, [r6 + 16]
550
+ lea r2, [r7 + 16]
551
+ call calc_satd_16x8
552
+ call calc_satd_16x8
553
+ call calc_satd_16x8
554
+ call calc_satd_16x8
555
+ call calc_satd_16x8
556
+ call calc_satd_16x8
557
+ call calc_satd_16x8
558
+ call calc_satd_16x8
559
+ lea r0, [r6 + 32]
560
+ lea r2, [r7 + 32]
561
+ call calc_satd_16x8
562
+ call calc_satd_16x8
563
+ call calc_satd_16x8
564
+ call calc_satd_16x8
565
+ call calc_satd_16x8
566
+ call calc_satd_16x8
567
+ call calc_satd_16x8
568
+ call calc_satd_16x8
569
+
570
+ paddd m8, m9
571
+ vextracti128 xm0, m8, 1
572
+ paddd xm0, xm8
573
+ movhlps xm1, xm0
574
+ paddd xm0, xm1
575
+ pshuflw xm1, xm0, q0032
576
+ paddd xm0, xm1
577
+ movd eax, xm0
578
+ RET
579
+
580
+cglobal pixel_satd_64x16, 4,8,10 ; if WIN64 && cpuflag(avx2)
581
+ mova m7, [hmul_16p]
582
+ lea r4, [3 * r1]
583
+ lea r5, [3 * r3]
584
+ pxor m8, m8
585
+ pxor m9, m9
586
+ mov r6, r0
587
+ mov r7, r2
588
+
589
+ call calc_satd_16x8
590
+ call calc_satd_16x8
591
+ lea r0, [r6 + 16]
592
+ lea r2, [r7 + 16]
593
+ call calc_satd_16x8
594
+ call calc_satd_16x8
595
+ lea r0, [r6 + 32]
596
+ lea r2, [r7 + 32]
597
+ call calc_satd_16x8
598
+ call calc_satd_16x8
599
+ lea r0, [r6 + 48]
600
+ lea r2, [r7 + 48]
601
+ call calc_satd_16x8
602
+ call calc_satd_16x8
603
+
604
+ paddd m8, m9
605
+ vextracti128 xm0, m8, 1
606
+ paddd xm0, xm8
607
+ movhlps xm1, xm0
608
+ paddd xm0, xm1
609
+ pshuflw xm1, xm0, q0032
610
+ paddd xm0, xm1
611
+ movd eax, xm0
612
+ RET
613
+
614
+cglobal pixel_satd_64x32, 4,8,10 ; if WIN64 && cpuflag(avx2)
615
+ mova m7, [hmul_16p]
616
+ lea r4, [3 * r1]
617
+ lea r5, [3 * r3]
618
+ pxor m8, m8
619
+ pxor m9, m9
620
+ mov r6, r0
621
+ mov r7, r2
622
+
623
+ call calc_satd_16x8
624
+ call calc_satd_16x8
625
+ call calc_satd_16x8
626
+ call calc_satd_16x8
627
+ lea r0, [r6 + 16]
628
+ lea r2, [r7 + 16]
629
+ call calc_satd_16x8
630
+ call calc_satd_16x8
631
+ call calc_satd_16x8
632
+ call calc_satd_16x8
633
+ lea r0, [r6 + 32]
634
+ lea r2, [r7 + 32]
635
+ call calc_satd_16x8
636
+ call calc_satd_16x8
637
+ call calc_satd_16x8
638
+ call calc_satd_16x8
639
+ lea r0, [r6 + 48]
640
+ lea r2, [r7 + 48]
641
+ call calc_satd_16x8
642
+ call calc_satd_16x8
643
+ call calc_satd_16x8
644
+ call calc_satd_16x8
645
+
646
+ paddd m8, m9
647
+ vextracti128 xm0, m8, 1
648
+ paddd xm0, xm8
649
+ movhlps xm1, xm0
650
+ paddd xm0, xm1
651
+ pshuflw xm1, xm0, q0032
652
+ paddd xm0, xm1
653
+ movd eax, xm0
654
+ RET
655
+
656
+cglobal pixel_satd_64x48, 4,8,10 ; if WIN64 && cpuflag(avx2)
657
+ mova m7, [hmul_16p]
658
+ lea r4, [3 * r1]
659
+ lea r5, [3 * r3]
660
+ pxor m8, m8
661
+ pxor m9, m9
662
+ mov r6, r0
663
+ mov r7, r2
664
+
665
+ call calc_satd_16x8
666
+ call calc_satd_16x8
667
+ call calc_satd_16x8
668
+ call calc_satd_16x8
669
+ call calc_satd_16x8
670
+ call calc_satd_16x8
671
+ lea r0, [r6 + 16]
672
+ lea r2, [r7 + 16]
673
+ call calc_satd_16x8
674
+ call calc_satd_16x8
675
+ call calc_satd_16x8
676
+ call calc_satd_16x8
677
+ call calc_satd_16x8
678
+ call calc_satd_16x8
679
+ lea r0, [r6 + 32]
680
+ lea r2, [r7 + 32]
681
+ call calc_satd_16x8
682
+ call calc_satd_16x8
683
+ call calc_satd_16x8
684
+ call calc_satd_16x8
685
+ call calc_satd_16x8
686
+ call calc_satd_16x8
687
+ lea r0, [r6 + 48]
688
+ lea r2, [r7 + 48]
689
+ call calc_satd_16x8
690
+ call calc_satd_16x8
691
+ call calc_satd_16x8
692
+ call calc_satd_16x8
693
+ call calc_satd_16x8
694
+ call calc_satd_16x8
695
+
696
+ paddd m8, m9
697
+ vextracti128 xm0, m8, 1
698
+ paddd xm0, xm8
699
+ movhlps xm1, xm0
700
+ paddd xm0, xm1
701
+ pshuflw xm1, xm0, q0032
702
+ paddd xm0, xm1
703
+ movd eax, xm0
704
+ RET
705
+
706
+cglobal pixel_satd_64x64, 4,8,10 ; if WIN64 && cpuflag(avx2)
707
+ mova m7, [hmul_16p]
708
+ lea r4, [3 * r1]
709
+ lea r5, [3 * r3]
710
+ pxor m8, m8
711
+ pxor m9, m9
712
+ mov r6, r0
713
+ mov r7, r2
714
+
715
+ call calc_satd_16x8
716
+ call calc_satd_16x8
717
+ call calc_satd_16x8
718
+ call calc_satd_16x8
719
+ call calc_satd_16x8
720
+ call calc_satd_16x8
721
+ call calc_satd_16x8
722
+ call calc_satd_16x8
723
+ lea r0, [r6 + 16]
724
+ lea r2, [r7 + 16]
725
+ call calc_satd_16x8
726
+ call calc_satd_16x8
727
+ call calc_satd_16x8
728
+ call calc_satd_16x8
729
+ call calc_satd_16x8
730
+ call calc_satd_16x8
731
+ call calc_satd_16x8
732
+ call calc_satd_16x8
733
+ lea r0, [r6 + 32]
734
+ lea r2, [r7 + 32]
735
+ call calc_satd_16x8
736
+ call calc_satd_16x8
737
+ call calc_satd_16x8
738
+ call calc_satd_16x8
739
+ call calc_satd_16x8
740
+ call calc_satd_16x8
741
+ call calc_satd_16x8
742
+ call calc_satd_16x8
743
+ lea r0, [r6 + 48]
744
+ lea r2, [r7 + 48]
745
+ call calc_satd_16x8
746
+ call calc_satd_16x8
747
+ call calc_satd_16x8
748
+ call calc_satd_16x8
749
+ call calc_satd_16x8
750
+ call calc_satd_16x8
751
+ call calc_satd_16x8
752
+ call calc_satd_16x8
753
+
754
+ paddd m8, m9
755
+ vextracti128 xm0, m8, 1
756
+ paddd xm0, xm8
757
+ movhlps xm1, xm0
758
+ paddd xm0, xm1
759
+ pshuflw xm1, xm0, q0032
760
+ paddd xm0, xm1
761
+ movd eax, xm0
762
+ RET
763
+%endif ; ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 0
764
+
765
+%if ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 1
766
+INIT_YMM avx2
767
+cglobal calc_satd_16x8 ; function to compute satd cost for 16 columns, 8 rows
768
+ ; rows 0-3
769
+ movu m0, [r0]
770
+ movu m4, [r2]
771
+ psubw m0, m4
772
+ movu m1, [r0 + r1]
773
+ movu m5, [r2 + r3]
774
+ psubw m1, m5
775
+ movu m2, [r0 + r1 * 2]
776
+ movu m4, [r2 + r3 * 2]
777
+ psubw m2, m4
778
+ movu m3, [r0 + r4]
779
+ movu m5, [r2 + r5]
780
+ psubw m3, m5
781
+ lea r0, [r0 + r1 * 4]
782
+ lea r2, [r2 + r3 * 4]
783
+ paddw m4, m0, m1
784
+ psubw m1, m0
785
+ paddw m0, m2, m3
786
+ psubw m3, m2
787
+ punpckhwd m2, m4, m1
788
+ punpcklwd m4, m1
789
+ punpckhwd m1, m0, m3
790
+ punpcklwd m0, m3
791
+ paddw m3, m4, m0
792
+ psubw m0, m4
793
+ paddw m4, m2, m1
794
+ psubw m1, m2
795
+ punpckhdq m2, m3, m0
796
+ punpckldq m3, m0
797
+ paddw m0, m3, m2
798
+ psubw m2, m3
799
+ punpckhdq m3, m4, m1
800
+ punpckldq m4, m1
801
+ paddw m1, m4, m3
802
+ psubw m3, m4
803
+ punpckhqdq m4, m0, m1
804
+ punpcklqdq m0, m1
805
+ pabsw m0, m0
806
+ pabsw m4, m4
807
+ pmaxsw m0, m0, m4
808
+ punpckhqdq m1, m2, m3
809
+ punpcklqdq m2, m3
810
+ pabsw m2, m2
811
+ pabsw m1, m1
812
+ pmaxsw m2, m1
813
+ pxor m7, m7
814
+ mova m1, m0
815
+ punpcklwd m1, m7
816
+ paddd m6, m1
817
+ mova m1, m0
818
+ punpckhwd m1, m7
819
+ paddd m6, m1
820
+ pxor m7, m7
821
+ mova m1, m2
822
+ punpcklwd m1, m7
823
+ paddd m6, m1
824
+ mova m1, m2
825
+ punpckhwd m1, m7
826
+ paddd m6, m1
827
+ ; rows 4-7
828
+ movu m0, [r0]
829
+ movu m4, [r2]
830
+ psubw m0, m4
831
+ movu m1, [r0 + r1]
832
+ movu m5, [r2 + r3]
833
+ psubw m1, m5
834
+ movu m2, [r0 + r1 * 2]
835
+ movu m4, [r2 + r3 * 2]
836
+ psubw m2, m4
837
+ movu m3, [r0 + r4]
838
+ movu m5, [r2 + r5]
839
+ psubw m3, m5
840
+ lea r0, [r0 + r1 * 4]
841
+ lea r2, [r2 + r3 * 4]
842
+ paddw m4, m0, m1
843
+ psubw m1, m0
844
+ paddw m0, m2, m3
845
+ psubw m3, m2
846
+ punpckhwd m2, m4, m1
847
+ punpcklwd m4, m1
848
+ punpckhwd m1, m0, m3
849
+ punpcklwd m0, m3
850
+ paddw m3, m4, m0
851
+ psubw m0, m4
852
+ paddw m4, m2, m1
853
+ psubw m1, m2
854
+ punpckhdq m2, m3, m0
855
+ punpckldq m3, m0
856
+ paddw m0, m3, m2
857
+ psubw m2, m3
858
+ punpckhdq m3, m4, m1
859
+ punpckldq m4, m1
860
+ paddw m1, m4, m3
861
+ psubw m3, m4
862
+ punpckhqdq m4, m0, m1
863
+ punpcklqdq m0, m1
864
+ pabsw m0, m0
865
+ pabsw m4, m4
866
+ pmaxsw m0, m0, m4
867
+ punpckhqdq m1, m2, m3
868
+ punpcklqdq m2, m3
869
+ pabsw m2, m2
870
+ pabsw m1, m1
871
+ pmaxsw m2, m1
872
+ pxor m7, m7
873
+ mova m1, m0
874
+ punpcklwd m1, m7
875
+ paddd m6, m1
876
+ mova m1, m0
877
+ punpckhwd m1, m7
878
+ paddd m6, m1
879
+ pxor m7, m7
880
+ mova m1, m2
881
+ punpcklwd m1, m7
882
+ paddd m6, m1
883
+ mova m1, m2
884
+ punpckhwd m1, m7
885
+ paddd m6, m1
886
+ ret
887
+
888
+cglobal calc_satd_16x4 ; function to compute satd cost for 16 columns, 4 rows
889
+ ; rows 0-3
890
+ movu m0, [r0]
891
+ movu m4, [r2]
892
+ psubw m0, m4
893
+ movu m1, [r0 + r1]
894
+ movu m5, [r2 + r3]
895
+ psubw m1, m5
896
+ movu m2, [r0 + r1 * 2]
897
+ movu m4, [r2 + r3 * 2]
898
+ psubw m2, m4
899
+ movu m3, [r0 + r4]
900
+ movu m5, [r2 + r5]
901
+ psubw m3, m5
902
+ lea r0, [r0 + r1 * 4]
903
+ lea r2, [r2 + r3 * 4]
904
+ paddw m4, m0, m1
905
+ psubw m1, m0
906
+ paddw m0, m2, m3
907
+ psubw m3, m2
908
+ punpckhwd m2, m4, m1
909
+ punpcklwd m4, m1
910
+ punpckhwd m1, m0, m3
911
+ punpcklwd m0, m3
912
+ paddw m3, m4, m0
913
+ psubw m0, m4
914
+ paddw m4, m2, m1
915
+ psubw m1, m2
916
+ punpckhdq m2, m3, m0
917
+ punpckldq m3, m0
918
+ paddw m0, m3, m2
919
+ psubw m2, m3
920
+ punpckhdq m3, m4, m1
921
+ punpckldq m4, m1
922
+ paddw m1, m4, m3
923
+ psubw m3, m4
924
+ punpckhqdq m4, m0, m1
925
+ punpcklqdq m0, m1
926
+ pabsw m0, m0
927
+ pabsw m4, m4
928
+ pmaxsw m0, m0, m4
929
+ punpckhqdq m1, m2, m3
930
+ punpcklqdq m2, m3
931
+ pabsw m2, m2
932
+ pabsw m1, m1
933
+ pmaxsw m2, m1
934
+ pxor m7, m7
935
+ mova m1, m0
936
+ punpcklwd m1, m7
937
+ paddd m6, m1
938
+ mova m1, m0
939
+ punpckhwd m1, m7
940
+ paddd m6, m1
941
+ pxor m7, m7
942
+ mova m1, m2
943
+ punpcklwd m1, m7
944
+ paddd m6, m1
945
+ mova m1, m2
946
+ punpckhwd m1, m7
947
+ paddd m6, m1
948
+ ret
949
+
950
+cglobal pixel_satd_16x4, 4,6,8
951
+ add r1d, r1d
952
+ add r3d, r3d
953
+ lea r4, [3 * r1]
954
+ lea r5, [3 * r3]
955
+ pxor m6, m6
956
+
957
+ call calc_satd_16x4
958
+
959
+ vextracti128 xm7, m6, 1
960
+ paddd xm6, xm7
961
+ pxor xm7, xm7
962
+ movhlps xm7, xm6
963
+ paddd xm6, xm7
964
+ pshufd xm7, xm6, 1
965
+ paddd xm6, xm7
966
+ movd eax, xm6
967
+ RET
968
+
969
+cglobal pixel_satd_16x8, 4,6,8
970
+ add r1d, r1d
971
+ add r3d, r3d
972
+ lea r4, [3 * r1]
973
+ lea r5, [3 * r3]
974
+ pxor m6, m6
975
+
976
+ call calc_satd_16x8
977
+
978
+ vextracti128 xm7, m6, 1
979
+ paddd xm6, xm7
980
+ pxor xm7, xm7
981
+ movhlps xm7, xm6
982
+ paddd xm6, xm7
983
+ pshufd xm7, xm6, 1
984
+ paddd xm6, xm7
985
+ movd eax, xm6
986
+ RET
987
+
988
+cglobal pixel_satd_16x12, 4,6,8
989
+ add r1d, r1d
990
+ add r3d, r3d
991
+ lea r4, [3 * r1]
992
+ lea r5, [3 * r3]
993
+ pxor m6, m6
994
+
995
+ call calc_satd_16x8
996
+ call calc_satd_16x4
997
+
998
+ vextracti128 xm7, m6, 1
999
+ paddd xm6, xm7
1000
+ pxor xm7, xm7
1001
+ movhlps xm7, xm6
1002
+ paddd xm6, xm7
1003
+ pshufd xm7, xm6, 1
1004
+ paddd xm6, xm7
1005
+ movd eax, xm6
1006
+ RET
1007
+
1008
+cglobal pixel_satd_16x16, 4,6,8
1009
+ add r1d, r1d
1010
+ add r3d, r3d
1011
+ lea r4, [3 * r1]
1012
+ lea r5, [3 * r3]
1013
+ pxor m6, m6
1014
+
1015
+ call calc_satd_16x8
1016
+ call calc_satd_16x8
1017
+
1018
+ vextracti128 xm7, m6, 1
1019
+ paddd xm6, xm7
1020
+ pxor xm7, xm7
1021
+ movhlps xm7, xm6
1022
+ paddd xm6, xm7
1023
+ pshufd xm7, xm6, 1
1024
+ paddd xm6, xm7
1025
+ movd eax, xm6
1026
+ RET
1027
+
1028
+cglobal pixel_satd_16x32, 4,6,8
1029
+ add r1d, r1d
1030
+ add r3d, r3d
1031
+ lea r4, [3 * r1]
1032
+ lea r5, [3 * r3]
1033
+ pxor m6, m6
1034
+
1035
+ call calc_satd_16x8
1036
+ call calc_satd_16x8
1037
+ call calc_satd_16x8
1038
+ call calc_satd_16x8
1039
+
1040
+ vextracti128 xm7, m6, 1
1041
+ paddd xm6, xm7
1042
+ pxor xm7, xm7
1043
+ movhlps xm7, xm6
1044
+ paddd xm6, xm7
1045
+ pshufd xm7, xm6, 1
1046
+ paddd xm6, xm7
1047
+ movd eax, xm6
1048
+ RET
1049
+
1050
+cglobal pixel_satd_16x64, 4,6,8
1051
+ add r1d, r1d
1052
+ add r3d, r3d
1053
+ lea r4, [3 * r1]
1054
+ lea r5, [3 * r3]
1055
+ pxor m6, m6
1056
+
1057
+ call calc_satd_16x8
1058
+ call calc_satd_16x8
1059
+ call calc_satd_16x8
1060
+ call calc_satd_16x8
1061
+ call calc_satd_16x8
1062
+ call calc_satd_16x8
1063
+ call calc_satd_16x8
1064
+ call calc_satd_16x8
1065
+
1066
+ vextracti128 xm7, m6, 1
1067
+ paddd xm6, xm7
1068
+ pxor xm7, xm7
1069
+ movhlps xm7, xm6
1070
+ paddd xm6, xm7
1071
+ pshufd xm7, xm6, 1
1072
+ paddd xm6, xm7
1073
+ movd eax, xm6
1074
+ RET
1075
+
1076
+cglobal pixel_satd_32x8, 4,8,8
1077
+ add r1d, r1d
1078
+ add r3d, r3d
1079
+ lea r4, [3 * r1]
1080
+ lea r5, [3 * r3]
1081
+ pxor m6, m6
1082
+ mov r6, r0
1083
+ mov r7, r2
1084
+
1085
+ call calc_satd_16x8
1086
+
1087
+ lea r0, [r6 + 32]
1088
+ lea r2, [r7 + 32]
1089
+
1090
+ call calc_satd_16x8
1091
+
1092
+ vextracti128 xm7, m6, 1
1093
+ paddd xm6, xm7
1094
+ pxor xm7, xm7
1095
+ movhlps xm7, xm6
1096
+ paddd xm6, xm7
1097
+ pshufd xm7, xm6, 1
1098
+ paddd xm6, xm7
1099
+ movd eax, xm6
1100
+ RET
1101
+
1102
+cglobal pixel_satd_32x16, 4,8,8
1103
+ add r1d, r1d
1104
+ add r3d, r3d
1105
+ lea r4, [3 * r1]
1106
+ lea r5, [3 * r3]
1107
+ pxor m6, m6
1108
+ mov r6, r0
1109
+ mov r7, r2
1110
+
1111
+ call calc_satd_16x8
1112
+ call calc_satd_16x8
1113
+
1114
+ lea r0, [r6 + 32]
1115
+ lea r2, [r7 + 32]
1116
+
1117
+ call calc_satd_16x8
1118
+ call calc_satd_16x8
1119
+
1120
+ vextracti128 xm7, m6, 1
1121
+ paddd xm6, xm7
1122
+ pxor xm7, xm7
1123
+ movhlps xm7, xm6
1124
+ paddd xm6, xm7
1125
+ pshufd xm7, xm6, 1
1126
+ paddd xm6, xm7
1127
+ movd eax, xm6
1128
+ RET
1129
+
1130
+cglobal pixel_satd_32x24, 4,8,8
1131
+ add r1d, r1d
1132
+ add r3d, r3d
1133
+ lea r4, [3 * r1]
1134
+ lea r5, [3 * r3]
1135
+ pxor m6, m6
1136
+ mov r6, r0
1137
+ mov r7, r2
1138
+
1139
+ call calc_satd_16x8
1140
+ call calc_satd_16x8
1141
+ call calc_satd_16x8
1142
+
1143
+ lea r0, [r6 + 32]
1144
+ lea r2, [r7 + 32]
1145
+
1146
+ call calc_satd_16x8
1147
+ call calc_satd_16x8
1148
+ call calc_satd_16x8
1149
+
1150
+ vextracti128 xm7, m6, 1
1151
+ paddd xm6, xm7
1152
+ pxor xm7, xm7
1153
+ movhlps xm7, xm6
1154
+ paddd xm6, xm7
1155
+ pshufd xm7, xm6, 1
1156
+ paddd xm6, xm7
1157
+ movd eax, xm6
1158
+ RET
1159
+
1160
+cglobal pixel_satd_32x32, 4,8,8
1161
+ add r1d, r1d
1162
+ add r3d, r3d
1163
+ lea r4, [3 * r1]
1164
+ lea r5, [3 * r3]
1165
+ pxor m6, m6
1166
+ mov r6, r0
1167
+ mov r7, r2
1168
+
1169
+ call calc_satd_16x8
1170
+ call calc_satd_16x8
1171
+ call calc_satd_16x8
1172
+ call calc_satd_16x8
1173
+
1174
+ lea r0, [r6 + 32]
1175
+ lea r2, [r7 + 32]
1176
+
1177
+ call calc_satd_16x8
1178
+ call calc_satd_16x8
1179
+ call calc_satd_16x8
1180
+ call calc_satd_16x8
1181
+
1182
+ vextracti128 xm7, m6, 1
1183
+ paddd xm6, xm7
1184
+ pxor xm7, xm7
1185
+ movhlps xm7, xm6
1186
+ paddd xm6, xm7
1187
+ pshufd xm7, xm6, 1
1188
+ paddd xm6, xm7
1189
+ movd eax, xm6
1190
+ RET
1191
+
1192
+cglobal pixel_satd_32x64, 4,8,8
1193
+ add r1d, r1d
1194
+ add r3d, r3d
1195
+ lea r4, [3 * r1]
1196
+ lea r5, [3 * r3]
1197
+ pxor m6, m6
1198
+ mov r6, r0
1199
+ mov r7, r2
1200
+
1201
+ call calc_satd_16x8
1202
+ call calc_satd_16x8
1203
+ call calc_satd_16x8
1204
+ call calc_satd_16x8
1205
+ call calc_satd_16x8
1206
+ call calc_satd_16x8
1207
+ call calc_satd_16x8
1208
+ call calc_satd_16x8
1209
+
1210
+ lea r0, [r6 + 32]
1211
+ lea r2, [r7 + 32]
1212
+
1213
+ call calc_satd_16x8
1214
+ call calc_satd_16x8
1215
+ call calc_satd_16x8
1216
+ call calc_satd_16x8
1217
+ call calc_satd_16x8
1218
+ call calc_satd_16x8
1219
+ call calc_satd_16x8
1220
+ call calc_satd_16x8
1221
+
1222
+ vextracti128 xm7, m6, 1
1223
+ paddd xm6, xm7
1224
+ pxor xm7, xm7
1225
+ movhlps xm7, xm6
1226
+ paddd xm6, xm7
1227
+ pshufd xm7, xm6, 1
1228
+ paddd xm6, xm7
1229
+ movd eax, xm6
1230
+ RET
1231
+
1232
+cglobal pixel_satd_48x64, 4,8,8
1233
+ add r1d, r1d
1234
+ add r3d, r3d
1235
+ lea r4, [3 * r1]
1236
+ lea r5, [3 * r3]
1237
+ pxor m6, m6
1238
+ mov r6, r0
1239
+ mov r7, r2
1240
+
1241
+ call calc_satd_16x8
1242
+ call calc_satd_16x8
1243
+ call calc_satd_16x8
1244
+ call calc_satd_16x8
1245
+ call calc_satd_16x8
1246
+ call calc_satd_16x8
1247
+ call calc_satd_16x8
1248
+ call calc_satd_16x8
1249
+
1250
+ lea r0, [r6 + 32]
1251
+ lea r2, [r7 + 32]
1252
+
1253
+ call calc_satd_16x8
1254
+ call calc_satd_16x8
1255
+ call calc_satd_16x8
1256
+ call calc_satd_16x8
1257
+ call calc_satd_16x8
1258
+ call calc_satd_16x8
1259
+ call calc_satd_16x8
1260
+ call calc_satd_16x8
1261
+
1262
+ lea r0, [r6 + 64]
1263
+ lea r2, [r7 + 64]
1264
+
1265
+ call calc_satd_16x8
1266
+ call calc_satd_16x8
1267
+ call calc_satd_16x8
1268
+ call calc_satd_16x8
1269
+ call calc_satd_16x8
1270
+ call calc_satd_16x8
1271
+ call calc_satd_16x8
1272
+ call calc_satd_16x8
1273
+
1274
+ vextracti128 xm7, m6, 1
1275
+ paddd xm6, xm7
1276
+ pxor xm7, xm7
1277
+ movhlps xm7, xm6
1278
+ paddd xm6, xm7
1279
+ pshufd xm7, xm6, 1
1280
+ paddd xm6, xm7
1281
+ movd eax, xm6
1282
+ RET
1283
+
1284
+cglobal pixel_satd_64x16, 4,8,8
1285
+ add r1d, r1d
1286
+ add r3d, r3d
1287
+ lea r4, [3 * r1]
1288
+ lea r5, [3 * r3]
1289
+ pxor m6, m6
1290
+ mov r6, r0
1291
+ mov r7, r2
1292
+
1293
+ call calc_satd_16x8
1294
+ call calc_satd_16x8
1295
+
1296
+ lea r0, [r6 + 32]
1297
+ lea r2, [r7 + 32]
1298
+
1299
+ call calc_satd_16x8
1300
+ call calc_satd_16x8
1301
+
1302
+ lea r0, [r6 + 64]
1303
+ lea r2, [r7 + 64]
1304
+
1305
+ call calc_satd_16x8
1306
+ call calc_satd_16x8
1307
+
1308
+ lea r0, [r6 + 96]
1309
+ lea r2, [r7 + 96]
1310
+
1311
+ call calc_satd_16x8
1312
+ call calc_satd_16x8
1313
+
1314
+ vextracti128 xm7, m6, 1
1315
+ paddd xm6, xm7
1316
+ pxor xm7, xm7
1317
+ movhlps xm7, xm6
1318
+ paddd xm6, xm7
1319
+ pshufd xm7, xm6, 1
1320
+ paddd xm6, xm7
1321
+ movd eax, xm6
1322
+ RET
1323
+
1324
+cglobal pixel_satd_64x32, 4,8,8
1325
+ add r1d, r1d
1326
+ add r3d, r3d
1327
+ lea r4, [3 * r1]
1328
+ lea r5, [3 * r3]
1329
+ pxor m6, m6
1330
+ mov r6, r0
1331
+ mov r7, r2
1332
+
1333
+ call calc_satd_16x8
1334
+ call calc_satd_16x8
1335
+ call calc_satd_16x8
1336
+ call calc_satd_16x8
1337
+
1338
+ lea r0, [r6 + 32]
1339
+ lea r2, [r7 + 32]
1340
+
1341
+ call calc_satd_16x8
1342
+ call calc_satd_16x8
1343
+ call calc_satd_16x8
1344
+ call calc_satd_16x8
1345
+
1346
+ lea r0, [r6 + 64]
1347
+ lea r2, [r7 + 64]
1348
+
1349
+ call calc_satd_16x8
1350
+ call calc_satd_16x8
1351
+ call calc_satd_16x8
1352
+ call calc_satd_16x8
1353
+
1354
+ lea r0, [r6 + 96]
1355
+ lea r2, [r7 + 96]
1356
+
1357
+ call calc_satd_16x8
1358
+ call calc_satd_16x8
1359
+ call calc_satd_16x8
1360
+ call calc_satd_16x8
1361
+
1362
+ vextracti128 xm7, m6, 1
1363
+ paddd xm6, xm7
1364
+ pxor xm7, xm7
1365
+ movhlps xm7, xm6
1366
+ paddd xm6, xm7
1367
+ pshufd xm7, xm6, 1
1368
+ paddd xm6, xm7
1369
+ movd eax, xm6
1370
+ RET
1371
+
1372
+cglobal pixel_satd_64x48, 4,8,8
1373
+ add r1d, r1d
1374
+ add r3d, r3d
1375
+ lea r4, [3 * r1]
1376
+ lea r5, [3 * r3]
1377
+ pxor m6, m6
1378
+ mov r6, r0
1379
+ mov r7, r2
1380
+
1381
+ call calc_satd_16x8
1382
+ call calc_satd_16x8
1383
+ call calc_satd_16x8
1384
+ call calc_satd_16x8
1385
+ call calc_satd_16x8
1386
+ call calc_satd_16x8
1387
+
1388
+ lea r0, [r6 + 32]
1389
+ lea r2, [r7 + 32]
1390
+
1391
+ call calc_satd_16x8
1392
+ call calc_satd_16x8
1393
+ call calc_satd_16x8
1394
+ call calc_satd_16x8
1395
+ call calc_satd_16x8
1396
+ call calc_satd_16x8
1397
+
1398
+ lea r0, [r6 + 64]
1399
+ lea r2, [r7 + 64]
1400
+
1401
+ call calc_satd_16x8
1402
+ call calc_satd_16x8
1403
+ call calc_satd_16x8
1404
+ call calc_satd_16x8
1405
+ call calc_satd_16x8
1406
+ call calc_satd_16x8
1407
+
1408
+ lea r0, [r6 + 96]
1409
+ lea r2, [r7 + 96]
1410
+
1411
+ call calc_satd_16x8
1412
+ call calc_satd_16x8
1413
+ call calc_satd_16x8
1414
+ call calc_satd_16x8
1415
+ call calc_satd_16x8
1416
+ call calc_satd_16x8
1417
+
1418
+ vextracti128 xm7, m6, 1
1419
+ paddd xm6, xm7
1420
+ pxor xm7, xm7
1421
+ movhlps xm7, xm6
1422
+ paddd xm6, xm7
1423
+ pshufd xm7, xm6, 1
1424
+ paddd xm6, xm7
1425
+ movd eax, xm6
1426
+ RET
1427
+
1428
+cglobal pixel_satd_64x64, 4,8,8
1429
+ add r1d, r1d
1430
+ add r3d, r3d
1431
+ lea r4, [3 * r1]
1432
+ lea r5, [3 * r3]
1433
+ pxor m6, m6
1434
+ mov r6, r0
1435
+ mov r7, r2
1436
+
1437
+ call calc_satd_16x8
1438
+ call calc_satd_16x8
1439
+ call calc_satd_16x8
1440
+ call calc_satd_16x8
1441
+ call calc_satd_16x8
1442
+ call calc_satd_16x8
1443
+ call calc_satd_16x8
1444
+ call calc_satd_16x8
1445
+
1446
+ lea r0, [r6 + 32]
1447
+ lea r2, [r7 + 32]
1448
+
1449
+ call calc_satd_16x8
1450
+ call calc_satd_16x8
1451
+ call calc_satd_16x8
1452
+ call calc_satd_16x8
1453
+ call calc_satd_16x8
1454
+ call calc_satd_16x8
1455
+ call calc_satd_16x8
1456
+ call calc_satd_16x8
1457
+
1458
+ lea r0, [r6 + 64]
1459
+ lea r2, [r7 + 64]
1460
+
1461
+ call calc_satd_16x8
1462
+ call calc_satd_16x8
1463
+ call calc_satd_16x8
1464
+ call calc_satd_16x8
1465
+ call calc_satd_16x8
1466
+ call calc_satd_16x8
1467
+ call calc_satd_16x8
1468
+ call calc_satd_16x8
1469
+
1470
+ lea r0, [r6 + 96]
1471
+ lea r2, [r7 + 96]
1472
+
1473
+ call calc_satd_16x8
1474
+ call calc_satd_16x8
1475
+ call calc_satd_16x8
1476
+ call calc_satd_16x8
1477
+ call calc_satd_16x8
1478
+ call calc_satd_16x8
1479
+ call calc_satd_16x8
1480
+ call calc_satd_16x8
1481
+
1482
+ vextracti128 xm7, m6, 1
1483
+ paddd xm6, xm7
1484
+ pxor xm7, xm7
1485
+ movhlps xm7, xm6
1486
+ paddd xm6, xm7
1487
+ pshufd xm7, xm6, 1
1488
+ paddd xm6, xm7
1489
+ movd eax, xm6
1490
+ RET
1491
+%endif ; ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 1
1492
x265_1.6.tar.gz/source/common/x86/pixel-util.h -> x265_1.7.tar.gz/source/common/x86/pixel-util.h
Changed
33
1
2
float x265_pixel_ssim_end4_sse2(int sum0[5][4], int sum1[5][4], int width);
3
float x265_pixel_ssim_end4_avx(int sum0[5][4], int sum1[5][4], int width);
4
5
-void x265_scale1D_128to64_ssse3(pixel*, const pixel*, intptr_t);
6
-void x265_scale1D_128to64_avx2(pixel*, const pixel*, intptr_t);
7
+void x265_scale1D_128to64_ssse3(pixel*, const pixel*);
8
+void x265_scale1D_128to64_avx2(pixel*, const pixel*);
9
void x265_scale2D_64to32_ssse3(pixel*, const pixel*, intptr_t);
10
+void x265_scale2D_64to32_avx2(pixel*, const pixel*, intptr_t);
11
12
-int x265_findPosLast_x64(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig);
13
+int x265_scanPosLast_x64(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize);
14
+int x265_scanPosLast_avx2_bmi2(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize);
15
+uint32_t x265_findPosFirstLast_ssse3(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16]);
16
17
#define SETUP_CHROMA_PIXELSUB_PS_FUNC(W, H, cpu) \
18
void x265_pixel_sub_ps_ ## W ## x ## H ## cpu(int16_t* dest, intptr_t destride, const pixel* src0, const pixel* src1, intptr_t srcstride0, intptr_t srcstride1); \
19
- void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t* scr1, intptr_t srcStride0, intptr_t srcStride1);
20
+ void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t* src1, intptr_t srcStride0, intptr_t srcStride1);
21
22
#define CHROMA_420_PIXELSUB_DEF(cpu) \
23
SETUP_CHROMA_PIXELSUB_PS_FUNC(4, 4, cpu); \
24
25
26
#define SETUP_LUMA_PIXELSUB_PS_FUNC(W, H, cpu) \
27
void x265_pixel_sub_ps_ ## W ## x ## H ## cpu(int16_t* dest, intptr_t destride, const pixel* src0, const pixel* src1, intptr_t srcstride0, intptr_t srcstride1); \
28
- void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t* scr1, intptr_t srcStride0, intptr_t srcStride1);
29
+ void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t* src1, intptr_t srcStride0, intptr_t srcStride1);
30
31
#define LUMA_PIXELSUB_DEF(cpu) \
32
SETUP_LUMA_PIXELSUB_PS_FUNC(8, 8, cpu); \
33
x265_1.6.tar.gz/source/common/x86/pixel-util8.asm -> x265_1.7.tar.gz/source/common/x86/pixel-util8.asm
Changed
1000
1
2
ssim_c1: times 4 dd 416 ; .01*.01*255*255*64
3
ssim_c2: times 4 dd 235963 ; .03*.03*255*255*64*63
4
%endif
5
-mask_ff: times 16 db 0xff
6
- times 16 db 0
7
-deinterleave_shuf: db 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15
8
-deinterleave_word_shuf: db 0, 1, 4, 5, 8, 9, 12, 13, 2, 3, 6, 7, 10, 11, 14, 15
9
-hmul_16p: times 16 db 1
10
- times 8 db 1, -1
11
-hmulw_16p: times 8 dw 1
12
- times 4 dw 1, -1
13
14
-trans8_shuf: dd 0, 4, 1, 5, 2, 6, 3, 7
15
+mask_ff: times 16 db 0xff
16
+ times 16 db 0
17
+deinterleave_shuf: times 2 db 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15
18
+deinterleave_word_shuf: times 2 db 0, 1, 4, 5, 8, 9, 12, 13, 2, 3, 6, 7, 10, 11, 14, 15
19
+hmul_16p: times 16 db 1
20
+ times 8 db 1, -1
21
+hmulw_16p: times 8 dw 1
22
+ times 4 dw 1, -1
23
+
24
+trans8_shuf: dd 0, 4, 1, 5, 2, 6, 3, 7
25
26
SECTION .text
27
28
29
cextern pb_2
30
cextern pb_4
31
cextern pb_8
32
+cextern pb_15
33
cextern pb_16
34
cextern pb_32
35
cextern pb_64
36
37
38
%if ARCH_X86_64 == 1
39
INIT_YMM avx2
40
-cglobal quant, 5,5,10
41
+cglobal quant, 5,6,9
42
; fill qbits
43
movd xm4, r4d ; m4 = qbits
44
45
46
; fill offset
47
vpbroadcastd m5, r5m ; m5 = add
48
49
- vpbroadcastw m9, [pw_1] ; m9 = word [1]
50
+ lea r5, [pw_1]
51
52
mov r4d, r6m
53
shr r4d, 4
54
55
56
; count non-zero coeff
57
; TODO: popcnt is faster, but some CPU can't support
58
- pminuw m2, m9
59
+ pminuw m2, [r5]
60
paddw m7, m2
61
62
add r0, mmsize
63
64
mov r6d, r6m
65
shl r6d, 16
66
or r6d, r5d ; assuming both (w0<<6) and round are using maximum of 16 bits each.
67
- movd xm0, r6d
68
- pshufd xm0, xm0, 0 ; m0 = [w0<<6, round]
69
- vinserti128 m0, m0, xm0, 1 ; document says (pshufd + vinserti128) can be replaced with vpbroadcastd m0, xm0, but having build problem, need to investigate
70
+
71
+ vpbroadcastd m0, r6d
72
73
movd xm1, r7m
74
vpbroadcastd m2, r8m
75
76
dec r5d
77
jnz .loopH
78
RET
79
+
80
+%if ARCH_X86_64
81
+INIT_YMM avx2
82
+cglobal weight_sp, 6, 9, 7
83
+ mov r7d, r7m
84
+ shl r7d, 16
85
+ or r7d, r6m
86
+ vpbroadcastd m0, r7d ; m0 = times 8 dw w0, round
87
+ movd xm1, r8m ; m1 = [shift]
88
+ vpbroadcastd m2, r9m ; m2 = times 16 dw offset
89
+ vpbroadcastw m3, [pw_1]
90
+ vpbroadcastw m4, [pw_2000]
91
+
92
+ add r2d, r2d ; 2 * srcstride
93
+
94
+ mov r7, r0
95
+ mov r8, r1
96
+.loopH:
97
+ mov r6d, r4d ; width
98
+
99
+ ; save old src and dst
100
+ mov r0, r7 ; src
101
+ mov r1, r8 ; dst
102
+.loopW:
103
+ movu m5, [r0]
104
+ paddw m5, m4
105
+
106
+ punpcklwd m6,m5, m3
107
+ pmaddwd m6, m0
108
+ psrad m6, xm1
109
+ paddd m6, m2
110
+
111
+ punpckhwd m5, m3
112
+ pmaddwd m5, m0
113
+ psrad m5, xm1
114
+ paddd m5, m2
115
+
116
+ packssdw m6, m5
117
+ packuswb m6, m6
118
+ vpermq m6, m6, 10001000b
119
+
120
+ sub r6d, 16
121
+ jl .width8
122
+ movu [r1], xm6
123
+ je .nextH
124
+ add r0, 32
125
+ add r1, 16
126
+ jmp .loopW
127
+
128
+.width8:
129
+ add r6d, 16
130
+ cmp r6d, 8
131
+ jl .width4
132
+ movq [r1], xm6
133
+ je .nextH
134
+ psrldq m6, 8
135
+ sub r6d, 8
136
+ add r1, 8
137
+
138
+.width4:
139
+ cmp r6d, 4
140
+ jl .width2
141
+ movd [r1], xm6
142
+ je .nextH
143
+ add r1, 4
144
+ pshufd m6, m6, 1
145
+
146
+.width2:
147
+ pextrw [r1], xm6, 0
148
+
149
+.nextH:
150
+ lea r7, [r7 + r2]
151
+ lea r8, [r8 + r3]
152
+
153
+ dec r5d
154
+ jnz .loopH
155
+ RET
156
+%endif
157
%endif ; end of (HIGH_BIT_DEPTH == 0)
158
159
160
161
RET
162
%endif
163
164
+;-----------------------------------------------------------------
165
+; void scale2D_64to32(pixel *dst, pixel *src, intptr_t stride)
166
+;-----------------------------------------------------------------
167
+%if HIGH_BIT_DEPTH
168
+INIT_YMM avx2
169
+cglobal scale2D_64to32, 3, 4, 5, dest, src, stride
170
+ mov r3d, 32
171
+ add r2d, r2d
172
+ mova m4, [pw_2000]
173
+
174
+.loop:
175
+ movu m0, [r1]
176
+ movu m1, [r1 + 1 * mmsize]
177
+ movu m2, [r1 + r2]
178
+ movu m3, [r1 + r2 + 1 * mmsize]
179
+
180
+ paddw m0, m2
181
+ paddw m1, m3
182
+ phaddw m0, m1
183
+
184
+ pmulhrsw m0, m4
185
+ vpermq m0, m0, q3120
186
+ movu [r0], m0
187
+
188
+ movu m0, [r1 + 2 * mmsize]
189
+ movu m1, [r1 + 3 * mmsize]
190
+ movu m2, [r1 + r2 + 2 * mmsize]
191
+ movu m3, [r1 + r2 + 3 * mmsize]
192
+
193
+ paddw m0, m2
194
+ paddw m1, m3
195
+ phaddw m0, m1
196
+
197
+ pmulhrsw m0, m4
198
+ vpermq m0, m0, q3120
199
+ movu [r0 + mmsize], m0
200
+
201
+ add r0, 64
202
+ lea r1, [r1 + 2 * r2]
203
+ dec r3d
204
+ jnz .loop
205
+ RET
206
+%else
207
+
208
+INIT_YMM avx2
209
+cglobal scale2D_64to32, 3, 5, 8, dest, src, stride
210
+ mov r3d, 16
211
+ mova m7, [deinterleave_shuf]
212
+.loop:
213
+ movu m0, [r1] ; i
214
+ lea r4, [r1 + r2 * 2]
215
+ psrlw m1, m0, 8 ; j
216
+ movu m2, [r1 + r2] ; k
217
+ psrlw m3, m2, 8 ; l
218
+
219
+ pxor m4, m0, m1 ; i^j
220
+ pxor m5, m2, m3 ; k^l
221
+ por m4, m5 ; ij|kl
222
+
223
+ pavgb m0, m1 ; s
224
+ pavgb m2, m3 ; t
225
+ mova m5, m0
226
+ pavgb m0, m2 ; (s+t+1)/2
227
+ pxor m5, m2 ; s^t
228
+ pand m4, m5 ; (ij|kl)&st
229
+ pand m4, [pb_1]
230
+ psubb m0, m4 ; Result
231
+
232
+ movu m1, [r1 + 32] ; i
233
+ psrlw m2, m1, 8 ; j
234
+ movu m3, [r1 + r2 + 32] ; k
235
+ psrlw m4, m3, 8 ; l
236
+
237
+ pxor m5, m1, m2 ; i^j
238
+ pxor m6, m3, m4 ; k^l
239
+ por m5, m6 ; ij|kl
240
+
241
+ pavgb m1, m2 ; s
242
+ pavgb m3, m4 ; t
243
+ mova m6, m1
244
+ pavgb m1, m3 ; (s+t+1)/2
245
+ pxor m6, m3 ; s^t
246
+ pand m5, m6 ; (ij|kl)&st
247
+ pand m5, [pb_1]
248
+ psubb m1, m5 ; Result
249
+
250
+ pshufb m0, m0, m7
251
+ pshufb m1, m1, m7
252
+
253
+ punpcklqdq m0, m1
254
+ vpermq m0, m0, 11011000b
255
+ movu [r0], m0
256
+
257
+ add r0, 32
258
+
259
+ movu m0, [r4] ; i
260
+ psrlw m1, m0, 8 ; j
261
+ movu m2, [r4 + r2] ; k
262
+ psrlw m3, m2, 8 ; l
263
+
264
+ pxor m4, m0, m1 ; i^j
265
+ pxor m5, m2, m3 ; k^l
266
+ por m4, m5 ; ij|kl
267
+
268
+ pavgb m0, m1 ; s
269
+ pavgb m2, m3 ; t
270
+ mova m5, m0
271
+ pavgb m0, m2 ; (s+t+1)/2
272
+ pxor m5, m2 ; s^t
273
+ pand m4, m5 ; (ij|kl)&st
274
+ pand m4, [pb_1]
275
+ psubb m0, m4 ; Result
276
+
277
+ movu m1, [r4 + 32] ; i
278
+ psrlw m2, m1, 8 ; j
279
+ movu m3, [r4 + r2 + 32] ; k
280
+ psrlw m4, m3, 8 ; l
281
+
282
+ pxor m5, m1, m2 ; i^j
283
+ pxor m6, m3, m4 ; k^l
284
+ por m5, m6 ; ij|kl
285
+
286
+ pavgb m1, m2 ; s
287
+ pavgb m3, m4 ; t
288
+ mova m6, m1
289
+ pavgb m1, m3 ; (s+t+1)/2
290
+ pxor m6, m3 ; s^t
291
+ pand m5, m6 ; (ij|kl)&st
292
+ pand m5, [pb_1]
293
+ psubb m1, m5 ; Result
294
+
295
+ pshufb m0, m0, m7
296
+ pshufb m1, m1, m7
297
+
298
+ punpcklqdq m0, m1
299
+ vpermq m0, m0, 11011000b
300
+ movu [r0], m0
301
+
302
+ lea r1, [r1 + 4 * r2]
303
+ add r0, 32
304
+ dec r3d
305
+ jnz .loop
306
+ RET
307
+%endif
308
309
;-----------------------------------------------------------------------------
310
; void pixel_sub_ps_4x4(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1);
311
312
;-----------------------------------------------------------------------------
313
; void pixel_sub_ps_16x16(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1);
314
;-----------------------------------------------------------------------------
315
+%if HIGH_BIT_DEPTH
316
+%macro PIXELSUB_PS_W16_H4_avx2 1
317
+%if ARCH_X86_64
318
+INIT_YMM avx2
319
+cglobal pixel_sub_ps_16x%1, 6, 9, 4, dest, deststride, src0, src1, srcstride0, srcstride1
320
+ add r1d, r1d
321
+ add r4d, r4d
322
+ add r5d, r5d
323
+ lea r6, [r1 * 3]
324
+ lea r7, [r4 * 3]
325
+ lea r8, [r5 * 3]
326
+
327
+%rep %1/4
328
+ movu m0, [r2]
329
+ movu m1, [r3]
330
+ movu m2, [r2 + r4]
331
+ movu m3, [r3 + r5]
332
+
333
+ psubw m0, m1
334
+ psubw m2, m3
335
+
336
+ movu [r0], m0
337
+ movu [r0 + r1], m2
338
+
339
+ movu m0, [r2 + r4 * 2]
340
+ movu m1, [r3 + r5 * 2]
341
+ movu m2, [r2 + r7]
342
+ movu m3, [r3 + r8]
343
+
344
+ psubw m0, m1
345
+ psubw m2, m3
346
+
347
+ movu [r0 + r1 * 2], m0
348
+ movu [r0 + r6], m2
349
+
350
+ lea r0, [r0 + r1 * 4]
351
+ lea r2, [r2 + r4 * 4]
352
+ lea r3, [r3 + r5 * 4]
353
+%endrep
354
+ RET
355
+%endif
356
+%endmacro
357
+PIXELSUB_PS_W16_H4_avx2 16
358
+PIXELSUB_PS_W16_H4_avx2 32
359
+%else
360
+;-----------------------------------------------------------------------------
361
+; void pixel_sub_ps_16x16(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1);
362
+;-----------------------------------------------------------------------------
363
+%macro PIXELSUB_PS_W16_H8_avx2 2
364
+%if ARCH_X86_64
365
INIT_YMM avx2
366
-cglobal pixel_sub_ps_16x16, 6, 7, 4, dest, deststride, src0, src1, srcstride0, srcstride1
367
+cglobal pixel_sub_ps_16x%2, 6, 10, 4, dest, deststride, src0, src1, srcstride0, srcstride1
368
add r1, r1
369
lea r6, [r1 * 3]
370
+ mov r7d, %2/8
371
372
-%rep 4
373
+ lea r9, [r4 * 3]
374
+ lea r8, [r5 * 3]
375
+
376
+.loop
377
pmovzxbw m0, [r2]
378
pmovzxbw m1, [r3]
379
pmovzxbw m2, [r2 + r4]
380
pmovzxbw m3, [r3 + r5]
381
- lea r2, [r2 + r4 * 2]
382
- lea r3, [r3 + r5 * 2]
383
384
psubw m0, m1
385
psubw m2, m3
386
387
movu [r0], m0
388
movu [r0 + r1], m2
389
390
+ pmovzxbw m0, [r2 + 2 * r4]
391
+ pmovzxbw m1, [r3 + 2 * r5]
392
+ pmovzxbw m2, [r2 + r9]
393
+ pmovzxbw m3, [r3 + r8]
394
+
395
+ psubw m0, m1
396
+ psubw m2, m3
397
+
398
+ movu [r0 + r1 * 2], m0
399
+ movu [r0 + r6], m2
400
+
401
+ lea r0, [r0 + r1 * 4]
402
+ lea r2, [r2 + r4 * 4]
403
+ lea r3, [r3 + r5 * 4]
404
+
405
pmovzxbw m0, [r2]
406
pmovzxbw m1, [r3]
407
pmovzxbw m2, [r2 + r4]
408
409
psubw m0, m1
410
psubw m2, m3
411
412
+ movu [r0], m0
413
+ movu [r0 + r1], m2
414
+
415
+ pmovzxbw m0, [r2 + 2 * r4]
416
+ pmovzxbw m1, [r3 + 2 * r5]
417
+ pmovzxbw m2, [r2 + r9]
418
+ pmovzxbw m3, [r3 + r8]
419
+
420
+ psubw m0, m1
421
+ psubw m2, m3
422
+
423
movu [r0 + r1 * 2], m0
424
movu [r0 + r6], m2
425
426
lea r0, [r0 + r1 * 4]
427
- lea r2, [r2 + r4 * 2]
428
- lea r3, [r3 + r5 * 2]
429
-%endrep
430
+ lea r2, [r2 + r4 * 4]
431
+ lea r3, [r3 + r5 * 4]
432
+
433
+ dec r7d
434
+ jnz .loop
435
RET
436
+%endif
437
+%endmacro
438
+
439
+PIXELSUB_PS_W16_H8_avx2 16, 16
440
+PIXELSUB_PS_W16_H8_avx2 16, 32
441
+%endif
442
+
443
;-----------------------------------------------------------------------------
444
; void pixel_sub_ps_32x%2(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1);
445
;-----------------------------------------------------------------------------
446
447
;-----------------------------------------------------------------------------
448
; void pixel_sub_ps_32x32(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1);
449
;-----------------------------------------------------------------------------
450
+%if HIGH_BIT_DEPTH
451
+%macro PIXELSUB_PS_W32_H4_avx2 1
452
+%if ARCH_X86_64
453
INIT_YMM avx2
454
-cglobal pixel_sub_ps_32x32, 6, 7, 4, dest, deststride, src0, src1, srcstride0, srcstride1
455
- mov r6d, 4
456
- add r1, r1
457
+cglobal pixel_sub_ps_32x%1, 6, 10, 4, dest, deststride, src0, src1, srcstride0, srcstride1
458
+ add r1d, r1d
459
+ add r4d, r4d
460
+ add r5d, r5d
461
+ mov r9d, %1/4
462
+ lea r6, [r1 * 3]
463
+ lea r7, [r4 * 3]
464
+ lea r8, [r5 * 3]
465
+
466
+.loop
467
+ movu m0, [r2]
468
+ movu m1, [r2 + 32]
469
+ movu m2, [r3]
470
+ movu m3, [r3 + 32]
471
+ psubw m0, m2
472
+ psubw m1, m3
473
+
474
+ movu [r0], m0
475
+ movu [r0 + 32], m1
476
+
477
+ movu m0, [r2 + r4]
478
+ movu m1, [r2 + r4 + 32]
479
+ movu m2, [r3 + r5]
480
+ movu m3, [r3 + r5 + 32]
481
+ psubw m0, m2
482
+ psubw m1, m3
483
+
484
+ movu [r0 + r1], m0
485
+ movu [r0 + r1 + 32], m1
486
+
487
+ movu m0, [r2 + r4 * 2]
488
+ movu m1, [r2 + r4 * 2 + 32]
489
+ movu m2, [r3 + r5 * 2]
490
+ movu m3, [r3 + r5 * 2 + 32]
491
+ psubw m0, m2
492
+ psubw m1, m3
493
+
494
+ movu [r0 + r1 * 2], m0
495
+ movu [r0 + r1 * 2 + 32], m1
496
+
497
+ movu m0, [r2 + r7]
498
+ movu m1, [r2 + r7 + 32]
499
+ movu m2, [r3 + r8]
500
+ movu m3, [r3 + r8 + 32]
501
+ psubw m0, m2
502
+ psubw m1, m3
503
+
504
+ movu [r0 + r6], m0
505
+ movu [r0 + r6 + 32], m1
506
+
507
+ lea r0, [r0 + r1 * 4]
508
+ lea r2, [r2 + r4 * 4]
509
+ lea r3, [r3 + r5 * 4]
510
+ dec r9d
511
+ jnz .loop
512
+ RET
513
+%endif
514
+%endmacro
515
+PIXELSUB_PS_W32_H4_avx2 32
516
+PIXELSUB_PS_W32_H4_avx2 64
517
+%else
518
+;-----------------------------------------------------------------------------
519
+; void pixel_sub_ps_32x32(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1);
520
+;-----------------------------------------------------------------------------
521
+%macro PIXELSUB_PS_W32_H8_avx2 2
522
+%if ARCH_X86_64
523
+INIT_YMM avx2
524
+cglobal pixel_sub_ps_32x%2, 6, 10, 4, dest, deststride, src0, src1, srcstride0, srcstride1
525
+ mov r6d, %2/8
526
+ add r1, r1
527
+ lea r7, [r4 * 3]
528
+ lea r8, [r5 * 3]
529
+ lea r9, [r1 * 3]
530
531
.loop:
532
pmovzxbw m0, [r2]
533
534
movu [r0 + r1], m0
535
movu [r0 + r1 + 32], m1
536
537
- add r2, r4
538
- add r3, r5
539
-
540
- pmovzxbw m0, [r2 + r4]
541
- pmovzxbw m1, [r2 + r4 + 16]
542
- pmovzxbw m2, [r3 + r5]
543
- pmovzxbw m3, [r3 + r5 + 16]
544
+ pmovzxbw m0, [r2 + 2 * r4]
545
+ pmovzxbw m1, [r2 + 2 * r4 + 16]
546
+ pmovzxbw m2, [r3 + 2 * r5]
547
+ pmovzxbw m3, [r3 + 2 * r5 + 16]
548
549
psubw m0, m2
550
psubw m1, m3
551
- lea r0, [r0 + r1 * 2]
552
553
- movu [r0 ], m0
554
- movu [r0 + 32], m1
555
-
556
- add r2, r4
557
- add r3, r5
558
+ movu [r0 + r1 * 2 ], m0
559
+ movu [r0 + r1 * 2 + 32], m1
560
561
- pmovzxbw m0, [r2 + r4]
562
- pmovzxbw m1, [r2 + r4 + 16]
563
- pmovzxbw m2, [r3 + r5]
564
- pmovzxbw m3, [r3 + r5 + 16]
565
+ pmovzxbw m0, [r2 + r7]
566
+ pmovzxbw m1, [r2 + r7 + 16]
567
+ pmovzxbw m2, [r3 + r8]
568
+ pmovzxbw m3, [r3 + r8 + 16]
569
570
571
psubw m0, m2
572
psubw m1, m3
573
- add r0, r1
574
575
- movu [r0 ], m0
576
- movu [r0 + 32], m1
577
+ movu [r0 + r9], m0
578
+ movu [r0 + r9 +32], m1
579
580
- add r2, r4
581
- add r3, r5
582
+ lea r2, [r2 + r4 * 4]
583
+ lea r3, [r3 + r5 * 4]
584
+ lea r0, [r0 + r1 * 4]
585
586
- pmovzxbw m0, [r2 + r4]
587
- pmovzxbw m1, [r2 + r4 + 16]
588
- pmovzxbw m2, [r3 + r5]
589
- pmovzxbw m3, [r3 + r5 + 16]
590
+ pmovzxbw m0, [r2]
591
+ pmovzxbw m1, [r2 + 16]
592
+ pmovzxbw m2, [r3]
593
+ pmovzxbw m3, [r3 + 16]
594
595
psubw m0, m2
596
psubw m1, m3
597
- add r0, r1
598
599
movu [r0 ], m0
600
movu [r0 + 32], m1
601
602
- add r2, r4
603
- add r3, r5
604
-
605
pmovzxbw m0, [r2 + r4]
606
pmovzxbw m1, [r2 + r4 + 16]
607
pmovzxbw m2, [r3 + r5]
608
609
610
psubw m0, m2
611
psubw m1, m3
612
- add r0, r1
613
614
- movu [r0 ], m0
615
- movu [r0 + 32], m1
616
+ movu [r0 + r1], m0
617
+ movu [r0 + r1 + 32], m1
618
619
- add r2, r4
620
- add r3, r5
621
-
622
- pmovzxbw m0, [r2 + r4]
623
- pmovzxbw m1, [r2 + r4 + 16]
624
- pmovzxbw m2, [r3 + r5]
625
- pmovzxbw m3, [r3 + r5 + 16]
626
+ pmovzxbw m0, [r2 + 2 * r4]
627
+ pmovzxbw m1, [r2 + 2 * r4 + 16]
628
+ pmovzxbw m2, [r3 + 2 * r5]
629
+ pmovzxbw m3, [r3 + 2 * r5 + 16]
630
631
psubw m0, m2
632
psubw m1, m3
633
- add r0, r1
634
635
- movu [r0 ], m0
636
- movu [r0 + 32], m1
637
+ movu [r0 + r1 * 2], m0
638
+ movu [r0 + r1 * 2 + 32], m1
639
640
- add r2, r4
641
- add r3, r5
642
-
643
- pmovzxbw m0, [r2 + r4]
644
- pmovzxbw m1, [r2 + r4 + 16]
645
- pmovzxbw m2, [r3 + r5]
646
- pmovzxbw m3, [r3 + r5 + 16]
647
+ pmovzxbw m0, [r2 + r7]
648
+ pmovzxbw m1, [r2 + r7 + 16]
649
+ pmovzxbw m2, [r3 + r8]
650
+ pmovzxbw m3, [r3 + r8 + 16]
651
652
psubw m0, m2
653
psubw m1, m3
654
- add r0, r1
655
656
- movu [r0 ], m0
657
- movu [r0 + 32], m1
658
+ movu [r0 + r9], m0
659
+ movu [r0 + r9 + 32], m1
660
661
- lea r0, [r0 + r1]
662
- lea r2, [r2 + r4 * 2]
663
- lea r3, [r3 + r5 * 2]
664
+ lea r0, [r0 + r1 * 4]
665
+ lea r2, [r2 + r4 * 4]
666
+ lea r3, [r3 + r5 * 4]
667
668
dec r6d
669
jnz .loop
670
RET
671
+%endif
672
+%endmacro
673
+
674
+PIXELSUB_PS_W32_H8_avx2 32, 32
675
+PIXELSUB_PS_W32_H8_avx2 32, 64
676
+%endif
677
678
;-----------------------------------------------------------------------------
679
; void pixel_sub_ps_64x%2(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1);
680
681
;-----------------------------------------------------------------------------
682
; void pixel_sub_ps_64x64(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1);
683
;-----------------------------------------------------------------------------
684
+%if HIGH_BIT_DEPTH
685
+%if ARCH_X86_64
686
+INIT_YMM avx2
687
+cglobal pixel_sub_ps_64x64, 6, 10, 8, dest, deststride, src0, src1, srcstride0, srcstride1
688
+ add r1d, r1d
689
+ add r4d, r4d
690
+ add r5d, r5d
691
+ mov r9d, 16
692
+ lea r6, [r1 * 3]
693
+ lea r7, [r4 * 3]
694
+ lea r8, [r5 * 3]
695
+
696
+.loop
697
+ movu m0, [r2]
698
+ movu m1, [r2 + 32]
699
+ movu m2, [r2 + 64]
700
+ movu m3, [r2 + 96]
701
+ movu m4, [r3]
702
+ movu m5, [r3 + 32]
703
+ movu m6, [r3 + 64]
704
+ movu m7, [r3 + 96]
705
+ psubw m0, m4
706
+ psubw m1, m5
707
+ psubw m2, m6
708
+ psubw m3, m7
709
+
710
+ movu [r0], m0
711
+ movu [r0 + 32], m1
712
+ movu [r0 + 64], m2
713
+ movu [r0 + 96], m3
714
+
715
+ movu m0, [r2 + r4]
716
+ movu m1, [r2 + r4 + 32]
717
+ movu m2, [r2 + r4 + 64]
718
+ movu m3, [r2 + r4 + 96]
719
+ movu m4, [r3 + r5]
720
+ movu m5, [r3 + r5 + 32]
721
+ movu m6, [r3 + r5 + 64]
722
+ movu m7, [r3 + r5 + 96]
723
+ psubw m0, m4
724
+ psubw m1, m5
725
+ psubw m2, m6
726
+ psubw m3, m7
727
+
728
+ movu [r0 + r1], m0
729
+ movu [r0 + r1 + 32], m1
730
+ movu [r0 + r1 + 64], m2
731
+ movu [r0 + r1 + 96], m3
732
+
733
+ movu m0, [r2 + r4 * 2]
734
+ movu m1, [r2 + r4 * 2 + 32]
735
+ movu m2, [r2 + r4 * 2 + 64]
736
+ movu m3, [r2 + r4 * 2 + 96]
737
+ movu m4, [r3 + r5 * 2]
738
+ movu m5, [r3 + r5 * 2 + 32]
739
+ movu m6, [r3 + r5 * 2 + 64]
740
+ movu m7, [r3 + r5 * 2 + 96]
741
+ psubw m0, m4
742
+ psubw m1, m5
743
+ psubw m2, m6
744
+ psubw m3, m7
745
+
746
+ movu [r0 + r1 * 2], m0
747
+ movu [r0 + r1 * 2 + 32], m1
748
+ movu [r0 + r1 * 2 + 64], m2
749
+ movu [r0 + r1 * 2 + 96], m3
750
+
751
+ movu m0, [r2 + r7]
752
+ movu m1, [r2 + r7 + 32]
753
+ movu m2, [r2 + r7 + 64]
754
+ movu m3, [r2 + r7 + 96]
755
+ movu m4, [r3 + r8]
756
+ movu m5, [r3 + r8 + 32]
757
+ movu m6, [r3 + r8 + 64]
758
+ movu m7, [r3 + r8 + 96]
759
+ psubw m0, m4
760
+ psubw m1, m5
761
+ psubw m2, m6
762
+ psubw m3, m7
763
+
764
+ movu [r0 + r6], m0
765
+ movu [r0 + r6 + 32], m1
766
+ movu [r0 + r6 + 64], m2
767
+ movu [r0 + r6 + 96], m3
768
+
769
+ lea r0, [r0 + r1 * 4]
770
+ lea r2, [r2 + r4 * 4]
771
+ lea r3, [r3 + r5 * 4]
772
+ dec r9d
773
+ jnz .loop
774
+ RET
775
+%endif
776
+%else
777
+;-----------------------------------------------------------------------------
778
+; void pixel_sub_ps_64x64(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1);
779
+;-----------------------------------------------------------------------------
780
INIT_YMM avx2
781
cglobal pixel_sub_ps_64x64, 6, 7, 8, dest, deststride, src0, src1, srcstride0, srcstride1
782
mov r6d, 16
783
784
dec r6d
785
jnz .loop
786
RET
787
-
788
+%endif
789
;=============================================================================
790
; variance
791
;=============================================================================
792
793
RET
794
%endmacro
795
796
-;int x265_test_func(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig)
797
+;int scanPosLast(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize)
798
;{
799
; int scanPosLast = 0;
800
; do
801
802
;}
803
804
%if ARCH_X86_64 == 1
805
+INIT_XMM avx2,bmi2
806
+cglobal scanPosLast, 7,11,6
807
+ ; convert unit of Stride(trSize) to int16_t
808
+ mov r7d, r7m
809
+ add r7d, r7d
810
+
811
+ ; loading scan table and convert to Byte
812
+ mova m0, [r6]
813
+ packuswb m0, [r6 + mmsize]
814
+ pxor m1, m0, [pb_15]
815
+
816
+ ; clear CG count
817
+ xor r9d, r9d
818
+
819
+ ; m0 - Zigzag scan table
820
+ ; m1 - revert order scan table
821
+ ; m4 - zero
822
+ ; m5 - ones
823
+
824
+ pxor m4, m4
825
+ pcmpeqb m5, m5
826
+ lea r8d, [r7d * 3]
827
+
828
+.loop:
829
+ ; position of current CG
830
+ movzx r6d, word [r0]
831
+ lea r6, [r6 * 2 + r1]
832
+ add r0, 16 * 2
833
+
834
+ ; loading current CG
835
+ movh m2, [r6]
836
+ movhps m2, [r6 + r7]
837
+ movh m3, [r6 + r7 * 2]
838
+ movhps m3, [r6 + r8]
839
+ packsswb m2, m3
840
+
841
+ ; Zigzag
842
+ pshufb m3, m2, m0
843
+ pshufb m2, m1
844
+
845
+ ; get sign
846
+ pmovmskb r6d, m3
847
+ pcmpeqb m3, m4
848
+ pmovmskb r10d, m3
849
+ not r10d
850
+ pext r6d, r6d, r10d
851
+ mov [r2 + r9 * 2], r6w
852
+
853
+ ; get non-zero flag
854
+ ; TODO: reuse above result with reorder
855
+ pcmpeqb m2, m4
856
+ pxor m2, m5
857
+ pmovmskb r6d, m2
858
+ mov [r3 + r9 * 2], r6w
859
+
860
+ ; get non-zero number, POPCNT is faster
861
+ pabsb m2, m2
862
+ psadbw m2, m4
863
+ movhlps m3, m2
864
+ paddd m2, m3
865
+ movd r6d, m2
866
+ mov [r4 + r9], r6b
867
+
868
+ inc r9d
869
+ sub r5d, r6d
870
+ jg .loop
871
+
872
+ ; fixup last CG non-zero flag
873
+ dec r9d
874
+ movzx r0d, word [r3 + r9 * 2]
875
+;%if cpuflag(bmi1) ; 2uops?
876
+; tzcnt r1d, r0d
877
+;%else
878
+ bsf r1d, r0d
879
+;%endif
880
+ shrx r0d, r0d, r1d
881
+ mov [r3 + r9 * 2], r0w
882
+
883
+ ; get last pos
884
+ mov eax, r9d
885
+ shl eax, 4
886
+ xor r1d, 15
887
+ add eax, r1d
888
+ RET
889
+
890
+
891
+; t3 must be ecx, since it's used for shift.
892
+%if WIN64
893
+ DECLARE_REG_TMP 3,1,2,0
894
+%elif ARCH_X86_64
895
+ DECLARE_REG_TMP 0,1,2,3
896
+%else ; X86_32
897
+ %error Unsupport platform X86_32
898
+%endif
899
INIT_CPUFLAGS
900
-cglobal findPosLast_x64, 5,12
901
+cglobal scanPosLast_x64, 5,12
902
+ mov r10, r3mp
903
+ movifnidn t0, r0mp
904
mov r5d, r5m
905
xor r11d, r11d ; cgIdx
906
xor r7d, r7d ; tmp for non-zero flag
907
908
.loop:
909
xor r8d, r8d ; coeffSign[]
910
xor r9d, r9d ; coeffFlag[]
911
- xor r10d, r10d ; coeffNum[]
912
+ xor t3d, t3d ; coeffNum[]
913
914
%assign x 0
915
%rep 16
916
- movzx r6d, word [r0 + x * 2]
917
- movsx r6d, word [r1 + r6 * 2]
918
+ movzx r6d, word [t0 + x * 2]
919
+ movsx r6d, word [t1 + r6 * 2]
920
test r6d, r6d
921
setnz r7b
922
shr r6d, 31
923
- shlx r6d, r6d, r10d
924
+ shl r6d, t3b
925
or r8d, r6d
926
lea r9, [r9 * 2 + r7]
927
- add r10d, r7d
928
+ add t3d, r7d
929
%assign x x+1
930
%endrep
931
932
; store latest group data
933
- mov [r2 + r11 * 2], r8w
934
- mov [r3 + r11 * 2], r9w
935
- mov [r4 + r11], r10b
936
+ mov [t2 + r11 * 2], r8w
937
+ mov [r10 + r11 * 2], r9w
938
+ mov [r4 + r11], t3b
939
inc r11d
940
941
- add r0, 16 * 2
942
- sub r5d, r10d
943
+ add t0, 16 * 2
944
+ sub r5d, t3d
945
jnz .loop
946
947
; store group data
948
- tzcnt r6d, r9d
949
- shrx r9d, r9d, r6d
950
- mov [r3 + (r11 - 1) * 2], r9w
951
+ bsf t3d, r9d
952
+ shr r9d, t3b
953
+ mov [r10 + (r11 - 1) * 2], r9w
954
955
; get posLast
956
shl r11d, 4
957
- sub r11d, r6d
958
+ sub r11d, t3d
959
lea eax, [r11d - 1]
960
RET
961
%endif
962
+
963
+
964
+;-----------------------------------------------------------------------------
965
+; uint32_t[last first] findPosFirstAndLast(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16])
966
+;-----------------------------------------------------------------------------
967
+INIT_XMM ssse3
968
+cglobal findPosFirstLast, 3,3,3
969
+ ; convert stride to int16_t
970
+ add r1d, r1d
971
+
972
+ ; loading scan table and convert to Byte
973
+ mova m0, [r2]
974
+ packuswb m0, [r2 + mmsize]
975
+
976
+ ; loading 16 of coeff
977
+ movh m1, [r0]
978
+ movhps m1, [r0 + r1]
979
+ movh m2, [r0 + r1 * 2]
980
+ lea r1, [r1 * 3]
981
+ movhps m2, [r0 + r1]
982
+ packsswb m1, m2
983
+
984
+ ; get non-zero mask
985
+ pxor m2, m2
986
+ pcmpeqb m1, m2
987
+
988
+ ; reorder by Zigzag scan
989
+ pshufb m1, m0
990
+
991
+ ; get First and Last pos
992
+ xor eax, eax
993
+ pmovmskb r0d, m1
994
+ not r0w
995
+ bsr r1w, r0w
996
+ bsf ax, r0w
997
+ shl r1d, 16
998
+ or eax, r1d
999
+ RET
1000
x265_1.6.tar.gz/source/common/x86/pixel.h -> x265_1.7.tar.gz/source/common/x86/pixel.h
Changed
32
1
2
ADDAVG(addAvg_32x48)
3
4
void x265_downShift_16_sse2(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask);
5
+void x265_downShift_16_avx2(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask);
6
void x265_upShift_8_sse4(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift);
7
int x265_psyCost_pp_4x4_sse4(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
8
int x265_psyCost_pp_8x8_sse4(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
9
10
void x265_pixel_add_ps_16x16_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
11
void x265_pixel_add_ps_32x32_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
12
void x265_pixel_add_ps_64x64_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
13
+void x265_pixel_add_ps_16x32_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
14
+void x265_pixel_add_ps_32x64_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
15
16
void x265_pixel_sub_ps_16x16_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
17
void x265_pixel_sub_ps_32x32_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
18
void x265_pixel_sub_ps_64x64_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
19
+void x265_pixel_sub_ps_16x32_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
20
+void x265_pixel_sub_ps_32x64_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
21
22
int x265_psyCost_pp_4x4_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
23
int x265_psyCost_pp_8x8_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
24
25
int x265_psyCost_ss_16x16_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
26
int x265_psyCost_ss_32x32_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
27
int x265_psyCost_ss_64x64_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
28
+void x265_weight_sp_avx2(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset);
29
30
#undef DECL_PIXELS
31
#undef DECL_HEVC_SSD
32
x265_1.6.tar.gz/source/common/x86/pixeladd8.asm -> x265_1.7.tar.gz/source/common/x86/pixeladd8.asm
Changed
371
1
2
3
jnz .loop
4
RET
5
+%endif
6
+%endmacro
7
+PIXEL_ADD_PS_W16_H4 16, 16
8
+PIXEL_ADD_PS_W16_H4 16, 32
9
10
+;-----------------------------------------------------------------------------
11
+; void pixel_add_ps_16x16(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1)
12
+;-----------------------------------------------------------------------------
13
+%macro PIXEL_ADD_PS_W16_H4_avx2 1
14
+%if HIGH_BIT_DEPTH
15
+%if ARCH_X86_64
16
INIT_YMM avx2
17
-cglobal pixel_add_ps_16x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1
18
- mov r6d, %2/4
19
+cglobal pixel_add_ps_16x%1, 6, 10, 4, dest, destride, src0, scr1, srcStride0, srcStride1
20
+ mova m3, [pw_pixel_max]
21
+ pxor m2, m2
22
+ mov r6d, %1/4
23
+ add r4d, r4d
24
+ add r5d, r5d
25
+ add r1d, r1d
26
+ lea r7, [r4 * 3]
27
+ lea r8, [r5 * 3]
28
+ lea r9, [r1 * 3]
29
+
30
+.loop:
31
+ movu m0, [r2]
32
+ movu m1, [r3]
33
+ paddw m0, m1
34
+ CLIPW m0, m2, m3
35
+ movu [r0], m0
36
+
37
+ movu m0, [r2 + r4]
38
+ movu m1, [r3 + r5]
39
+ paddw m0, m1
40
+ CLIPW m0, m2, m3
41
+ movu [r0 + r1], m0
42
+
43
+ movu m0, [r2 + r4 * 2]
44
+ movu m1, [r3 + r5 * 2]
45
+ paddw m0, m1
46
+ CLIPW m0, m2, m3
47
+ movu [r0 + r1 * 2], m0
48
+
49
+ movu m0, [r2 + r7]
50
+ movu m1, [r3 + r8]
51
+ paddw m0, m1
52
+ CLIPW m0, m2, m3
53
+ movu [r0 + r9], m0
54
+
55
+ dec r6d
56
+ lea r0, [r0 + r1 * 4]
57
+ lea r2, [r2 + r4 * 4]
58
+ lea r3, [r3 + r5 * 4]
59
+ jnz .loop
60
+ RET
61
+%endif
62
+%else
63
+INIT_YMM avx2
64
+cglobal pixel_add_ps_16x%1, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1
65
+ mov r6d, %1/4
66
add r5, r5
67
.loop:
68
69
70
%endif
71
%endmacro
72
73
-PIXEL_ADD_PS_W16_H4 16, 16
74
-PIXEL_ADD_PS_W16_H4 16, 32
75
+PIXEL_ADD_PS_W16_H4_avx2 16
76
+PIXEL_ADD_PS_W16_H4_avx2 32
77
78
79
;-----------------------------------------------------------------------------
80
81
82
jnz .loop
83
RET
84
+%endif
85
+%endmacro
86
+PIXEL_ADD_PS_W32_H2 32, 32
87
+PIXEL_ADD_PS_W32_H2 32, 64
88
89
+;-----------------------------------------------------------------------------
90
+; void pixel_add_ps_32x32(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1)
91
+;-----------------------------------------------------------------------------
92
+%macro PIXEL_ADD_PS_W32_H4_avx2 1
93
+%if HIGH_BIT_DEPTH
94
+%if ARCH_X86_64
95
INIT_YMM avx2
96
-cglobal pixel_add_ps_32x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1
97
- mov r6d, %2/4
98
+cglobal pixel_add_ps_32x%1, 6, 10, 6, dest, destride, src0, scr1, srcStride0, srcStride1
99
+ mova m5, [pw_pixel_max]
100
+ pxor m4, m4
101
+ mov r6d, %1/4
102
+ add r4d, r4d
103
+ add r5d, r5d
104
+ add r1d, r1d
105
+ lea r7, [r4 * 3]
106
+ lea r8, [r5 * 3]
107
+ lea r9, [r1 * 3]
108
+
109
+.loop:
110
+ movu m0, [r2]
111
+ movu m2, [r2 + 32]
112
+ movu m1, [r3]
113
+ movu m3, [r3 + 32]
114
+ paddw m0, m1
115
+ paddw m2, m3
116
+ CLIPW2 m0, m2, m4, m5
117
+
118
+ movu [r0], m0
119
+ movu [r0 + 32], m2
120
+
121
+ movu m0, [r2 + r4]
122
+ movu m2, [r2 + r4 + 32]
123
+ movu m1, [r3 + r5]
124
+ movu m3, [r3 + r5 + 32]
125
+ paddw m0, m1
126
+ paddw m2, m3
127
+ CLIPW2 m0, m2, m4, m5
128
+
129
+ movu [r0 + r1], m0
130
+ movu [r0 + r1 + 32], m2
131
+
132
+ movu m0, [r2 + r4 * 2]
133
+ movu m2, [r2 + r4 * 2 + 32]
134
+ movu m1, [r3 + r5 * 2]
135
+ movu m3, [r3 + r5 * 2 + 32]
136
+ paddw m0, m1
137
+ paddw m2, m3
138
+ CLIPW2 m0, m2, m4, m5
139
+
140
+ movu [r0 + r1 * 2], m0
141
+ movu [r0 + r1 * 2 + 32], m2
142
+
143
+ movu m0, [r2 + r7]
144
+ movu m2, [r2 + r7 + 32]
145
+ movu m1, [r3 + r8]
146
+ movu m3, [r3 + r8 + 32]
147
+ paddw m0, m1
148
+ paddw m2, m3
149
+ CLIPW2 m0, m2, m4, m5
150
+
151
+ movu [r0 + r9], m0
152
+ movu [r0 + r9 + 32], m2
153
+
154
+ dec r6d
155
+ lea r0, [r0 + r1 * 4]
156
+ lea r2, [r2 + r4 * 4]
157
+ lea r3, [r3 + r5 * 4]
158
+ jnz .loop
159
+ RET
160
+%endif
161
+%else
162
+%if ARCH_X86_64
163
+INIT_YMM avx2
164
+cglobal pixel_add_ps_32x%1, 6, 10, 8, dest, destride, src0, scr1, srcStride0, srcStride1
165
+ mov r6d, %1/4
166
add r5, r5
167
+ lea r7, [r4 * 3]
168
+ lea r8, [r5 * 3]
169
+ lea r9, [r1 * 3]
170
.loop:
171
pmovzxbw m0, [r2] ; first half of row 0 of src0
172
pmovzxbw m1, [r2 + 16] ; second half of row 0 of src0
173
174
vpermq m0, m0, 11011000b
175
movu [r0 + r1], m0 ; row 1 of dst
176
177
- lea r2, [r2 + r4 * 2]
178
- lea r3, [r3 + r5 * 2]
179
- lea r0, [r0 + r1 * 2]
180
-
181
- pmovzxbw m0, [r2] ; first half of row 2 of src0
182
- pmovzxbw m1, [r2 + 16] ; second half of row 2 of src0
183
- movu m2, [r3] ; first half of row 2 of src1
184
- movu m3, [r3 + 32] ; second half of row 2 of src1
185
+ pmovzxbw m0, [r2 + r4 * 2] ; first half of row 2 of src0
186
+ pmovzxbw m1, [r2 + r4 * 2 + 16] ; second half of row 2 of src0
187
+ movu m2, [r3 + r5 * 2] ; first half of row 2 of src1
188
+ movu m3, [r3 + + r5 * 2 + 32]; second half of row 2 of src1
189
190
paddw m0, m2
191
paddw m1, m3
192
packuswb m0, m1
193
vpermq m0, m0, 11011000b
194
- movu [r0], m0 ; row 2 of dst
195
+ movu [r0 + r1 * 2], m0 ; row 2 of dst
196
197
- pmovzxbw m0, [r2 + r4] ; first half of row 3 of src0
198
- pmovzxbw m1, [r2 + r4 + 16] ; second half of row 3 of src0
199
- movu m2, [r3 + r5] ; first half of row 3 of src1
200
- movu m3, [r3 + r5 + 32] ; second half of row 3 of src1
201
+ pmovzxbw m0, [r2 + r7] ; first half of row 3 of src0
202
+ pmovzxbw m1, [r2 + r7 + 16] ; second half of row 3 of src0
203
+ movu m2, [r3 + r8] ; first half of row 3 of src1
204
+ movu m3, [r3 + r8 + 32] ; second half of row 3 of src1
205
206
paddw m0, m2
207
paddw m1, m3
208
packuswb m0, m1
209
vpermq m0, m0, 11011000b
210
- movu [r0 + r1], m0 ; row 3 of dst
211
+ movu [r0 + r9], m0 ; row 3 of dst
212
213
- lea r2, [r2 + r4 * 2]
214
- lea r3, [r3 + r5 * 2]
215
- lea r0, [r0 + r1 * 2]
216
+ lea r2, [r2 + r4 * 4]
217
+ lea r3, [r3 + r5 * 4]
218
+ lea r0, [r0 + r1 * 4]
219
220
dec r6d
221
jnz .loop
222
RET
223
%endif
224
+%endif
225
%endmacro
226
227
-PIXEL_ADD_PS_W32_H2 32, 32
228
-PIXEL_ADD_PS_W32_H2 32, 64
229
+PIXEL_ADD_PS_W32_H4_avx2 32
230
+PIXEL_ADD_PS_W32_H4_avx2 64
231
232
233
;-----------------------------------------------------------------------------
234
235
236
jnz .loop
237
RET
238
+%endif
239
+%endmacro
240
+PIXEL_ADD_PS_W64_H2 64, 64
241
242
+;-----------------------------------------------------------------------------
243
+; void pixel_add_ps_64x64(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1)
244
+;-----------------------------------------------------------------------------
245
+%if HIGH_BIT_DEPTH
246
+%if ARCH_X86_64
247
INIT_YMM avx2
248
-cglobal pixel_add_ps_64x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1
249
- mov r6d, %2/2
250
+cglobal pixel_add_ps_64x64, 6, 10, 6, dest, destride, src0, scr1, srcStride0, srcStride1
251
+ mova m5, [pw_pixel_max]
252
+ pxor m4, m4
253
+ mov r6d, 16
254
+ add r4d, r4d
255
+ add r5d, r5d
256
+ add r1d, r1d
257
+ lea r7, [r4 * 3]
258
+ lea r8, [r5 * 3]
259
+ lea r9, [r1 * 3]
260
+
261
+.loop:
262
+ movu m0, [r2]
263
+ movu m1, [r2 + 32]
264
+ movu m2, [r3]
265
+ movu m3, [r3 + 32]
266
+ paddw m0, m2
267
+ paddw m1, m3
268
+
269
+ CLIPW2 m0, m1, m4, m5
270
+ movu [r0], m0
271
+ movu [r0 + 32], m1
272
+
273
+ movu m0, [r2 + 64]
274
+ movu m1, [r2 + 96]
275
+ movu m2, [r3 + 64]
276
+ movu m3, [r3 + 96]
277
+ paddw m0, m2
278
+ paddw m1, m3
279
+
280
+ CLIPW2 m0, m1, m4, m5
281
+ movu [r0 + 64], m0
282
+ movu [r0 + 96], m1
283
+
284
+ movu m0, [r2 + r4]
285
+ movu m1, [r2 + r4 + 32]
286
+ movu m2, [r3 + r5]
287
+ movu m3, [r3 + r5 + 32]
288
+ paddw m0, m2
289
+ paddw m1, m3
290
+
291
+ CLIPW2 m0, m1, m4, m5
292
+ movu [r0 + r1], m0
293
+ movu [r0 + r1 + 32], m1
294
+
295
+ movu m0, [r2 + r4 + 64]
296
+ movu m1, [r2 + r4 + 96]
297
+ movu m2, [r3 + r5 + 64]
298
+ movu m3, [r3 + r5 + 96]
299
+ paddw m0, m2
300
+ paddw m1, m3
301
+
302
+ CLIPW2 m0, m1, m4, m5
303
+ movu [r0 + r1 + 64], m0
304
+ movu [r0 + r1 + 96], m1
305
+
306
+ movu m0, [r2 + r4 * 2]
307
+ movu m1, [r2 + r4 * 2 + 32]
308
+ movu m2, [r3 + r5 * 2]
309
+ movu m3, [r3 + r5 * 2+ 32]
310
+ paddw m0, m2
311
+ paddw m1, m3
312
+
313
+ CLIPW2 m0, m1, m4, m5
314
+ movu [r0 + r1 * 2], m0
315
+ movu [r0 + r1 * 2 + 32], m1
316
+
317
+ movu m0, [r2 + r4 * 2 + 64]
318
+ movu m1, [r2 + r4 * 2 + 96]
319
+ movu m2, [r3 + r5 * 2 + 64]
320
+ movu m3, [r3 + r5 * 2 + 96]
321
+ paddw m0, m2
322
+ paddw m1, m3
323
+
324
+ CLIPW2 m0, m1, m4, m5
325
+ movu [r0 + r1 * 2 + 64], m0
326
+ movu [r0 + r1 * 2 + 96], m1
327
+
328
+ movu m0, [r2 + r7]
329
+ movu m1, [r2 + r7 + 32]
330
+ movu m2, [r3 + r8]
331
+ movu m3, [r3 + r8 + 32]
332
+ paddw m0, m2
333
+ paddw m1, m3
334
+
335
+ CLIPW2 m0, m1, m4, m5
336
+ movu [r0 + r9], m0
337
+ movu [r0 + r9 + 32], m1
338
+
339
+ movu m0, [r2 + r7 + 64]
340
+ movu m1, [r2 + r7 + 96]
341
+ movu m2, [r3 + r8 + 64]
342
+ movu m3, [r3 + r8 + 96]
343
+ paddw m0, m2
344
+ paddw m1, m3
345
+
346
+ CLIPW2 m0, m1, m4, m5
347
+ movu [r0 + r9 + 64], m0
348
+ movu [r0 + r9 + 96], m1
349
+
350
+ dec r6d
351
+ lea r0, [r0 + r1 * 4]
352
+ lea r2, [r2 + r4 * 4]
353
+ lea r3, [r3 + r5 * 4]
354
+ jnz .loop
355
+ RET
356
+%endif
357
+%else
358
+INIT_YMM avx2
359
+cglobal pixel_add_ps_64x64, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1
360
+ mov r6d, 32
361
add r5, r5
362
.loop:
363
pmovzxbw m0, [r2] ; first 16 of row 0 of src0
364
365
RET
366
367
%endif
368
-%endmacro
369
-
370
-PIXEL_ADD_PS_W64_H2 64, 64
371
x265_1.6.tar.gz/source/common/x86/sad-a.asm -> x265_1.7.tar.gz/source/common/x86/sad-a.asm
Changed
187
1
2
RET
3
4
INIT_YMM avx2
5
-cglobal pixel_sad_32x24, 4,5,6
6
+cglobal pixel_sad_32x24, 4,7,6
7
xorps m0, m0
8
xorps m5, m5
9
mov r4d, 6
10
+ lea r5, [r1 * 3]
11
+ lea r6, [r3 * 3]
12
.loop
13
movu m1, [r0] ; row 0 of pix0
14
movu m2, [r2] ; row 0 of pix1
15
16
paddd m0, m1
17
paddd m5, m3
18
19
- lea r2, [r2 + 2 * r3]
20
- lea r0, [r0 + 2 * r1]
21
-
22
- movu m1, [r0] ; row 2 of pix0
23
- movu m2, [r2] ; row 2 of pix1
24
- movu m3, [r0 + r1] ; row 3 of pix0
25
- movu m4, [r2 + r3] ; row 3 of pix1
26
+ movu m1, [r0 + 2 * r1] ; row 2 of pix0
27
+ movu m2, [r2 + 2 * r3] ; row 2 of pix1
28
+ movu m3, [r0 + r5] ; row 3 of pix0
29
+ movu m4, [r2 + r6] ; row 3 of pix1
30
31
psadbw m1, m2
32
psadbw m3, m4
33
paddd m0, m1
34
paddd m5, m3
35
36
- lea r2, [r2 + 2 * r3]
37
- lea r0, [r0 + 2 * r1]
38
+ lea r2, [r2 + 4 * r3]
39
+ lea r0, [r0 + 4 * r1]
40
41
dec r4d
42
jnz .loop
43
44
RET
45
46
INIT_YMM avx2
47
-cglobal pixel_sad_64x48, 4,5,6
48
+cglobal pixel_sad_64x48, 4,7,6
49
xorps m0, m0
50
xorps m5, m5
51
- mov r4d, 24
52
+ mov r4d, 12
53
+ lea r5, [r1 * 3]
54
+ lea r6, [r3 * 3]
55
.loop
56
movu m1, [r0] ; first 32 of row 0 of pix0
57
movu m2, [r2] ; first 32 of row 0 of pix1
58
59
paddd m0, m1
60
paddd m5, m3
61
62
- lea r2, [r2 + 2 * r3]
63
- lea r0, [r0 + 2 * r1]
64
+ movu m1, [r0 + 2 * r1] ; first 32 of row 0 of pix0
65
+ movu m2, [r2 + 2 * r3] ; first 32 of row 0 of pix1
66
+ movu m3, [r0 + 2 * r1 + 32] ; second 32 of row 0 of pix0
67
+ movu m4, [r2 + 2 * r3 + 32] ; second 32 of row 0 of pix1
68
+
69
+ psadbw m1, m2
70
+ psadbw m3, m4
71
+ paddd m0, m1
72
+ paddd m5, m3
73
+
74
+ movu m1, [r0 + r5] ; first 32 of row 1 of pix0
75
+ movu m2, [r2 + r6] ; first 32 of row 1 of pix1
76
+ movu m3, [r0 + 32 + r5] ; second 32 of row 1 of pix0
77
+ movu m4, [r2 + 32 + r6] ; second 32 of row 1 of pix1
78
+
79
+ psadbw m1, m2
80
+ psadbw m3, m4
81
+ paddd m0, m1
82
+ paddd m5, m3
83
+
84
+ lea r2, [r2 + 4 * r3]
85
+ lea r0, [r0 + 4 * r1]
86
87
dec r4d
88
jnz .loop
89
90
RET
91
92
INIT_YMM avx2
93
-cglobal pixel_sad_64x64, 4,5,6
94
+cglobal pixel_sad_64x64, 4,7,6
95
xorps m0, m0
96
xorps m5, m5
97
mov r4d, 8
98
+ lea r5, [r1 * 3]
99
+ lea r6, [r3 * 3]
100
.loop
101
movu m1, [r0] ; first 32 of row 0 of pix0
102
movu m2, [r2] ; first 32 of row 0 of pix1
103
104
paddd m0, m1
105
paddd m5, m3
106
107
- lea r2, [r2 + 2 * r3]
108
- lea r0, [r0 + 2 * r1]
109
-
110
- movu m1, [r0] ; first 32 of row 2 of pix0
111
- movu m2, [r2] ; first 32 of row 2 of pix1
112
- movu m3, [r0 + 32] ; second 32 of row 2 of pix0
113
- movu m4, [r2 + 32] ; second 32 of row 2 of pix1
114
+ movu m1, [r0 + 2 * r1] ; first 32 of row 2 of pix0
115
+ movu m2, [r2 + 2 * r3] ; first 32 of row 2 of pix1
116
+ movu m3, [r0 + 2 * r1 + 32] ; second 32 of row 2 of pix0
117
+ movu m4, [r2 + 2 * r3 + 32] ; second 32 of row 2 of pix1
118
119
psadbw m1, m2
120
psadbw m3, m4
121
paddd m0, m1
122
paddd m5, m3
123
124
- movu m1, [r0 + r1] ; first 32 of row 3 of pix0
125
- movu m2, [r2 + r3] ; first 32 of row 3 of pix1
126
- movu m3, [r0 + 32 + r1] ; second 32 of row 3 of pix0
127
- movu m4, [r2 + 32 + r3] ; second 32 of row 3 of pix1
128
+ movu m1, [r0 + r5] ; first 32 of row 3 of pix0
129
+ movu m2, [r2 + r6] ; first 32 of row 3 of pix1
130
+ movu m3, [r0 + 32 + r5] ; second 32 of row 3 of pix0
131
+ movu m4, [r2 + 32 + r6] ; second 32 of row 3 of pix1
132
133
psadbw m1, m2
134
psadbw m3, m4
135
paddd m0, m1
136
paddd m5, m3
137
138
- lea r2, [r2 + 2 * r3]
139
- lea r0, [r0 + 2 * r1]
140
+ lea r2, [r2 + 4 * r3]
141
+ lea r0, [r0 + 4 * r1]
142
143
movu m1, [r0] ; first 32 of row 4 of pix0
144
movu m2, [r2] ; first 32 of row 4 of pix1
145
146
paddd m0, m1
147
paddd m5, m3
148
149
- lea r2, [r2 + 2 * r3]
150
- lea r0, [r0 + 2 * r1]
151
-
152
- movu m1, [r0] ; first 32 of row 6 of pix0
153
- movu m2, [r2] ; first 32 of row 6 of pix1
154
- movu m3, [r0 + 32] ; second 32 of row 6 of pix0
155
- movu m4, [r2 + 32] ; second 32 of row 6 of pix1
156
+ movu m1, [r0 + 2 * r1] ; first 32 of row 6 of pix0
157
+ movu m2, [r2 + 2 * r3] ; first 32 of row 6 of pix1
158
+ movu m3, [r0 + 2 * r1 + 32] ; second 32 of row 6 of pix0
159
+ movu m4, [r2 + 2 * r3 + 32] ; second 32 of row 6 of pix1
160
161
psadbw m1, m2
162
psadbw m3, m4
163
paddd m0, m1
164
paddd m5, m3
165
166
- movu m1, [r0 + r1] ; first 32 of row 7 of pix0
167
- movu m2, [r2 + r3] ; first 32 of row 7 of pix1
168
- movu m3, [r0 + 32 + r1] ; second 32 of row 7 of pix0
169
- movu m4, [r2 + 32 + r3] ; second 32 of row 7 of pix1
170
+ movu m1, [r0 + r5] ; first 32 of row 7 of pix0
171
+ movu m2, [r2 + r6] ; first 32 of row 7 of pix1
172
+ movu m3, [r0 + 32 + r5] ; second 32 of row 7 of pix0
173
+ movu m4, [r2 + 32 + r6] ; second 32 of row 7 of pix1
174
175
psadbw m1, m2
176
psadbw m3, m4
177
paddd m0, m1
178
paddd m5, m3
179
180
- lea r2, [r2 + 2 * r3]
181
- lea r0, [r0 + 2 * r1]
182
+ lea r2, [r2 + 4 * r3]
183
+ lea r0, [r0 + 4 * r1]
184
185
dec r4d
186
jnz .loop
187
x265_1.6.tar.gz/source/common/x86/sad16-a.asm -> x265_1.7.tar.gz/source/common/x86/sad16-a.asm
Changed
132
1
2
ABSW2 m3, m4, m3, m4, m7, m5
3
paddw m1, m2
4
paddw m3, m4
5
- paddw m3, m1
6
- pmaddwd m3, [pw_1]
7
- paddd m0, m3
8
+ paddw m0, m1
9
+ paddw m0, m3
10
%else
11
movu m1, [r2]
12
movu m2, [r2+2*r3]
13
14
ABSW2 m1, m2, m1, m2, m3, m4
15
lea r0, [r0+4*r1]
16
lea r2, [r2+4*r3]
17
- paddw m2, m1
18
- pmaddwd m2, [pw_1]
19
- paddd m0, m2
20
+ paddw m0, m1
21
+ paddw m0, m2
22
%endif
23
%endmacro
24
25
-;-----------------------------------------------------------------------------
26
-; int pixel_sad_NxM( uint16_t *, intptr_t, uint16_t *, intptr_t )
27
-;-----------------------------------------------------------------------------
28
+%macro SAD_INC_2ROW_Nx64 1
29
+%if 2*%1 > mmsize
30
+ movu m1, [r2 + 0]
31
+ movu m2, [r2 + 16]
32
+ movu m3, [r2 + 2 * r3 + 0]
33
+ movu m4, [r2 + 2 * r3 + 16]
34
+ psubw m1, [r0 + 0]
35
+ psubw m2, [r0 + 16]
36
+ psubw m3, [r0 + 2 * r1 + 0]
37
+ psubw m4, [r0 + 2 * r1 + 16]
38
+ ABSW2 m1, m2, m1, m2, m5, m6
39
+ lea r0, [r0 + 4 * r1]
40
+ lea r2, [r2 + 4 * r3]
41
+ ABSW2 m3, m4, m3, m4, m7, m5
42
+ paddw m1, m2
43
+ paddw m3, m4
44
+ paddw m0, m1
45
+ paddw m8, m3
46
+%else
47
+ movu m1, [r2]
48
+ movu m2, [r2 + 2 * r3]
49
+ psubw m1, [r0]
50
+ psubw m2, [r0 + 2 * r1]
51
+ ABSW2 m1, m2, m1, m2, m3, m4
52
+ lea r0, [r0 + 4 * r1]
53
+ lea r2, [r2 + 4 * r3]
54
+ paddw m0, m1
55
+ paddw m8, m2
56
+%endif
57
+%endmacro
58
+
59
+; ---------------------------------------------------------------------------- -
60
+; int pixel_sad_NxM(uint16_t *, intptr_t, uint16_t *, intptr_t)
61
+; ---------------------------------------------------------------------------- -
62
%macro SAD 2
63
cglobal pixel_sad_%1x%2, 4,5-(%2&4/4),8*(%1/mmsize)
64
pxor m0, m0
65
66
dec r4d
67
jg .loop
68
%endif
69
+%if %2 == 32
70
+ HADDUWD m0, m1
71
+ HADDD m0, m1
72
+%else
73
+ HADDW m0, m1
74
+%endif
75
+ movd eax, xm0
76
+ RET
77
+%endmacro
78
79
+; ---------------------------------------------------------------------------- -
80
+; int pixel_sad_Nx64(uint16_t *, intptr_t, uint16_t *, intptr_t)
81
+; ---------------------------------------------------------------------------- -
82
+%macro SAD_Nx64 1
83
+cglobal pixel_sad_%1x64, 4,5-(64&4/4), 9
84
+ pxor m0, m0
85
+ pxor m8, m8
86
+ mov r4d, 64 / 2
87
+.loop:
88
+ SAD_INC_2ROW_Nx64 %1
89
+ dec r4d
90
+ jg .loop
91
+
92
+ HADDUWD m0, m1
93
+ HADDUWD m8, m1
94
HADDD m0, m1
95
+ HADDD m8, m1
96
+ paddd m0, m8
97
+
98
movd eax, xm0
99
RET
100
%endmacro
101
102
SAD 16, 12
103
SAD 16, 16
104
SAD 16, 32
105
-SAD 16, 64
106
+SAD_Nx64 16
107
108
INIT_XMM sse2
109
SAD 8, 4
110
111
SAD 8, 16
112
SAD 8, 32
113
114
+INIT_YMM avx2
115
+SAD 16, 4
116
+SAD 16, 8
117
+SAD 16, 12
118
+SAD 16, 16
119
+SAD 16, 32
120
+
121
;------------------------------------------------------------------
122
; int pixel_sad_32xN( uint16_t *, intptr_t, uint16_t *, intptr_t )
123
;------------------------------------------------------------------
124
125
%endif
126
movd eax, xm0
127
RET
128
-
129
;-----------------------------------------------------------------------------
130
; void pixel_sad_xN_WxH( uint16_t *fenc, uint16_t *pix0, uint16_t *pix1,
131
; uint16_t *pix2, intptr_t i_stride, int scores[3] )
132
x265_1.6.tar.gz/source/common/x86/x86inc.asm -> x265_1.7.tar.gz/source/common/x86/x86inc.asm
Changed
18
1
2
%define mangle(x) x
3
%endif
4
5
-%macro SECTION_RODATA 0-1 16
6
+%macro SECTION_RODATA 0-1 32
7
SECTION .rodata align=%1
8
%endmacro
9
10
11
%else
12
global %1
13
%endif
14
+ ALIGN 32
15
%1: %2
16
%endmacro
17
18
x265_1.6.tar.gz/source/encoder/CMakeLists.txt -> x265_1.7.tar.gz/source/encoder/CMakeLists.txt
Changed
14
1
2
# vim: syntax=cmake
3
4
if(GCC)
5
- add_definitions(-Wno-uninitialized)
6
+ add_definitions(-Wno-uninitialized)
7
+ if(CC_HAS_NO_STRICT_OVERFLOW)
8
+ # GCC 4.9.2 gives warnings we know we can ignore in this file
9
+ set_source_files_properties(slicetype.cpp PROPERTIES COMPILE_FLAGS -Wno-strict-overflow)
10
+ endif(CC_HAS_NO_STRICT_OVERFLOW)
11
endif()
12
if(MSVC)
13
add_definitions(/wd4701) # potentially uninitialized local variable 'foo' used
14
x265_1.6.tar.gz/source/encoder/analysis.cpp -> x265_1.7.tar.gz/source/encoder/analysis.cpp
Changed
786
1
2
for (uint32_t i = 0; i <= g_maxCUDepth; i++)
3
for (uint32_t j = 0; j < MAX_PRED_TYPES; j++)
4
m_modeDepth[i].pred[j].invalidate();
5
-#endif
6
invalidateContexts(0);
7
- m_quant.setQPforQuant(ctu);
8
+#endif
9
+
10
+ int qp = setLambdaFromQP(ctu, m_slice->m_pps->bUseDQP ? calculateQpforCuSize(ctu, cuGeom) : m_slice->m_sliceQp);
11
+ ctu.setQPSubParts((int8_t)qp, 0, 0);
12
+
13
m_rqt[0].cur.load(initialContext);
14
m_modeDepth[0].fencYuv.copyFromPicYuv(*m_frame->m_fencPic, ctu.m_cuAddr, 0);
15
16
17
if (m_param->analysisMode)
18
{
19
if (m_slice->m_sliceType == I_SLICE)
20
- m_reuseIntraDataCTU = (analysis_intra_data *)m_frame->m_analysisData.intraData;
21
+ m_reuseIntraDataCTU = (analysis_intra_data*)m_frame->m_analysisData.intraData;
22
else
23
{
24
int numPredDir = m_slice->isInterP() ? 1 : 2;
25
- m_reuseInterDataCTU = (analysis_inter_data *)m_frame->m_analysisData.interData;
26
+ m_reuseInterDataCTU = (analysis_inter_data*)m_frame->m_analysisData.interData;
27
m_reuseRef = &m_reuseInterDataCTU->ref[ctu.m_cuAddr * X265_MAX_PRED_MODE_PER_CTU * numPredDir];
28
m_reuseBestMergeCand = &m_reuseInterDataCTU->bestMergeCand[ctu.m_cuAddr * CUGeom::MAX_GEOMS];
29
}
30
31
uint32_t zOrder = 0;
32
if (m_slice->m_sliceType == I_SLICE)
33
{
34
- compressIntraCU(ctu, cuGeom, zOrder);
35
+ compressIntraCU(ctu, cuGeom, zOrder, qp);
36
if (m_param->analysisMode == X265_ANALYSIS_SAVE && m_frame->m_analysisData.intraData)
37
{
38
- CUData *bestCU = &m_modeDepth[0].bestMode->cu;
39
+ CUData* bestCU = &m_modeDepth[0].bestMode->cu;
40
memcpy(&m_reuseIntraDataCTU->depth[ctu.m_cuAddr * numPartition], bestCU->m_cuDepth, sizeof(uint8_t) * numPartition);
41
memcpy(&m_reuseIntraDataCTU->modes[ctu.m_cuAddr * numPartition], bestCU->m_lumaIntraDir, sizeof(uint8_t) * numPartition);
42
memcpy(&m_reuseIntraDataCTU->partSizes[ctu.m_cuAddr * numPartition], bestCU->m_partSize, sizeof(uint8_t) * numPartition);
43
44
* they are available for intra predictions */
45
m_modeDepth[0].fencYuv.copyToPicYuv(*m_frame->m_reconPic, ctu.m_cuAddr, 0);
46
47
- compressInterCU_rd0_4(ctu, cuGeom);
48
+ compressInterCU_rd0_4(ctu, cuGeom, qp);
49
50
/* generate residual for entire CTU at once and copy to reconPic */
51
encodeResidue(ctu, cuGeom);
52
}
53
else if (m_param->bDistributeModeAnalysis && m_param->rdLevel >= 2)
54
- compressInterCU_dist(ctu, cuGeom);
55
+ compressInterCU_dist(ctu, cuGeom, qp);
56
else if (m_param->rdLevel <= 4)
57
- compressInterCU_rd0_4(ctu, cuGeom);
58
+ compressInterCU_rd0_4(ctu, cuGeom, qp);
59
else
60
{
61
- compressInterCU_rd5_6(ctu, cuGeom, zOrder);
62
+ compressInterCU_rd5_6(ctu, cuGeom, zOrder, qp);
63
if (m_param->analysisMode == X265_ANALYSIS_SAVE && m_frame->m_analysisData.interData)
64
{
65
- CUData *bestCU = &m_modeDepth[0].bestMode->cu;
66
+ CUData* bestCU = &m_modeDepth[0].bestMode->cu;
67
memcpy(&m_reuseInterDataCTU->depth[ctu.m_cuAddr * numPartition], bestCU->m_cuDepth, sizeof(uint8_t) * numPartition);
68
memcpy(&m_reuseInterDataCTU->modes[ctu.m_cuAddr * numPartition], bestCU->m_predMode, sizeof(uint8_t) * numPartition);
69
}
70
71
return;
72
else if (md.bestMode->cu.isIntra(0))
73
{
74
+ m_quant.m_tqBypass = true;
75
md.pred[PRED_LOSSLESS].initCosts();
76
md.pred[PRED_LOSSLESS].cu.initLosslessCU(md.bestMode->cu, cuGeom);
77
PartSize size = (PartSize)md.pred[PRED_LOSSLESS].cu.m_partSize[0];
78
uint8_t* modes = md.pred[PRED_LOSSLESS].cu.m_lumaIntraDir;
79
checkIntra(md.pred[PRED_LOSSLESS], cuGeom, size, modes, NULL);
80
checkBestMode(md.pred[PRED_LOSSLESS], cuGeom.depth);
81
+ m_quant.m_tqBypass = false;
82
}
83
else
84
{
85
+ m_quant.m_tqBypass = true;
86
md.pred[PRED_LOSSLESS].initCosts();
87
md.pred[PRED_LOSSLESS].cu.initLosslessCU(md.bestMode->cu, cuGeom);
88
md.pred[PRED_LOSSLESS].predYuv.copyFromYuv(md.bestMode->predYuv);
89
encodeResAndCalcRdInterCU(md.pred[PRED_LOSSLESS], cuGeom);
90
checkBestMode(md.pred[PRED_LOSSLESS], cuGeom.depth);
91
+ m_quant.m_tqBypass = false;
92
}
93
}
94
95
-void Analysis::compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t& zOrder)
96
+void Analysis::compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t& zOrder, int32_t qp)
97
{
98
uint32_t depth = cuGeom.depth;
99
ModeDepth& md = m_modeDepth[depth];
100
101
102
if (mightNotSplit && depth == reuseDepth[zOrder] && zOrder == cuGeom.absPartIdx)
103
{
104
- m_quant.setQPforQuant(parentCTU);
105
-
106
PartSize size = (PartSize)reusePartSizes[zOrder];
107
Mode& mode = size == SIZE_2Nx2N ? md.pred[PRED_INTRA] : md.pred[PRED_INTRA_NxN];
108
- mode.cu.initSubCU(parentCTU, cuGeom);
109
+ mode.cu.initSubCU(parentCTU, cuGeom, qp);
110
checkIntra(mode, cuGeom, size, &reuseModes[zOrder], &reuseChromaModes[zOrder]);
111
checkBestMode(mode, depth);
112
113
114
}
115
else if (mightNotSplit)
116
{
117
- m_quant.setQPforQuant(parentCTU);
118
-
119
- md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom);
120
+ md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp);
121
checkIntra(md.pred[PRED_INTRA], cuGeom, SIZE_2Nx2N, NULL, NULL);
122
checkBestMode(md.pred[PRED_INTRA], depth);
123
124
if (cuGeom.log2CUSize == 3 && m_slice->m_sps->quadtreeTULog2MinSize < 3)
125
{
126
- md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom);
127
+ md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom, qp);
128
checkIntra(md.pred[PRED_INTRA_NxN], cuGeom, SIZE_NxN, NULL, NULL);
129
checkBestMode(md.pred[PRED_INTRA_NxN], depth);
130
}
131
132
Mode* splitPred = &md.pred[PRED_SPLIT];
133
splitPred->initCosts();
134
CUData* splitCU = &splitPred->cu;
135
- splitCU->initSubCU(parentCTU, cuGeom);
136
+ splitCU->initSubCU(parentCTU, cuGeom, qp);
137
138
uint32_t nextDepth = depth + 1;
139
ModeDepth& nd = m_modeDepth[nextDepth];
140
invalidateContexts(nextDepth);
141
Entropy* nextContext = &m_rqt[depth].cur;
142
+ int32_t nextQP = qp;
143
144
for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++)
145
{
146
147
{
148
m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx);
149
m_rqt[nextDepth].cur.load(*nextContext);
150
- compressIntraCU(parentCTU, childGeom, zOrder);
151
+
152
+ if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth)
153
+ nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom));
154
+
155
+ compressIntraCU(parentCTU, childGeom, zOrder, nextQP);
156
157
// Save best CU and pred data for this sub CU
158
splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx);
159
160
else
161
updateModeCost(*splitPred);
162
163
- checkDQPForSplitPred(splitPred->cu, cuGeom);
164
+ checkDQPForSplitPred(*splitPred, cuGeom);
165
checkBestMode(*splitPred, depth);
166
}
167
168
169
}
170
171
ModeDepth& md = m_modeDepth[pmode.cuGeom.depth];
172
- bool bMergeOnly = pmode.cuGeom.log2CUSize == 6;
173
174
/* setup slave Analysis */
175
if (&slave != this)
176
{
177
slave.m_slice = m_slice;
178
slave.m_frame = m_frame;
179
- slave.setQP(*m_slice, m_rdCost.m_qp);
180
+ slave.m_param = m_param;
181
+ slave.setLambdaFromQP(md.pred[PRED_2Nx2N].cu, m_rdCost.m_qp);
182
slave.invalidateContexts(0);
183
-
184
- if (m_param->rdLevel >= 5)
185
- {
186
- slave.m_rqt[pmode.cuGeom.depth].cur.load(m_rqt[pmode.cuGeom.depth].cur);
187
- slave.m_quant.setQPforQuant(md.pred[PRED_2Nx2N].cu);
188
- }
189
+ slave.m_rqt[pmode.cuGeom.depth].cur.load(m_rqt[pmode.cuGeom.depth].cur);
190
}
191
192
-
193
/* perform Mode task, repeat until no more work is available */
194
do
195
{
196
197
switch (pmode.modes[task])
198
{
199
case PRED_INTRA:
200
- if (&slave != this)
201
- slave.m_rqt[pmode.cuGeom.depth].cur.load(m_rqt[pmode.cuGeom.depth].cur);
202
slave.checkIntraInInter(md.pred[PRED_INTRA], pmode.cuGeom);
203
if (m_param->rdLevel > 2)
204
slave.encodeIntraInInter(md.pred[PRED_INTRA], pmode.cuGeom);
205
206
break;
207
208
case PRED_2Nx2N:
209
- slave.checkInter_rd5_6(md.pred[PRED_2Nx2N], pmode.cuGeom, SIZE_2Nx2N, false);
210
+ slave.checkInter_rd5_6(md.pred[PRED_2Nx2N], pmode.cuGeom, SIZE_2Nx2N);
211
md.pred[PRED_BIDIR].rdCost = MAX_INT64;
212
if (m_slice->m_sliceType == B_SLICE)
213
{
214
215
break;
216
217
case PRED_Nx2N:
218
- slave.checkInter_rd5_6(md.pred[PRED_Nx2N], pmode.cuGeom, SIZE_Nx2N, false);
219
+ slave.checkInter_rd5_6(md.pred[PRED_Nx2N], pmode.cuGeom, SIZE_Nx2N);
220
break;
221
222
case PRED_2NxN:
223
- slave.checkInter_rd5_6(md.pred[PRED_2NxN], pmode.cuGeom, SIZE_2NxN, false);
224
+ slave.checkInter_rd5_6(md.pred[PRED_2NxN], pmode.cuGeom, SIZE_2NxN);
225
break;
226
227
case PRED_2NxnU:
228
- slave.checkInter_rd5_6(md.pred[PRED_2NxnU], pmode.cuGeom, SIZE_2NxnU, bMergeOnly);
229
+ slave.checkInter_rd5_6(md.pred[PRED_2NxnU], pmode.cuGeom, SIZE_2NxnU);
230
break;
231
232
case PRED_2NxnD:
233
- slave.checkInter_rd5_6(md.pred[PRED_2NxnD], pmode.cuGeom, SIZE_2NxnD, bMergeOnly);
234
+ slave.checkInter_rd5_6(md.pred[PRED_2NxnD], pmode.cuGeom, SIZE_2NxnD);
235
break;
236
237
case PRED_nLx2N:
238
- slave.checkInter_rd5_6(md.pred[PRED_nLx2N], pmode.cuGeom, SIZE_nLx2N, bMergeOnly);
239
+ slave.checkInter_rd5_6(md.pred[PRED_nLx2N], pmode.cuGeom, SIZE_nLx2N);
240
break;
241
242
case PRED_nRx2N:
243
- slave.checkInter_rd5_6(md.pred[PRED_nRx2N], pmode.cuGeom, SIZE_nRx2N, bMergeOnly);
244
+ slave.checkInter_rd5_6(md.pred[PRED_nRx2N], pmode.cuGeom, SIZE_nRx2N);
245
break;
246
247
default:
248
249
while (task >= 0);
250
}
251
252
-void Analysis::compressInterCU_dist(const CUData& parentCTU, const CUGeom& cuGeom)
253
+void Analysis::compressInterCU_dist(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp)
254
{
255
uint32_t depth = cuGeom.depth;
256
uint32_t cuAddr = parentCTU.m_cuAddr;
257
258
259
if (mightNotSplit && depth >= minDepth)
260
{
261
- int bTryAmp = m_slice->m_sps->maxAMPDepth > depth && (cuGeom.log2CUSize < 6 || m_param->rdLevel > 4);
262
+ int bTryAmp = m_slice->m_sps->maxAMPDepth > depth;
263
int bTryIntra = m_slice->m_sliceType != B_SLICE || m_param->bIntraInBFrames;
264
265
PMODE pmode(*this, cuGeom);
266
267
/* Initialize all prediction CUs based on parentCTU */
268
- md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom);
269
- md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom);
270
+ md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp);
271
+ md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp);
272
if (bTryIntra)
273
{
274
- md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom);
275
+ md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp);
276
if (cuGeom.log2CUSize == 3 && m_slice->m_sps->quadtreeTULog2MinSize < 3 && m_param->rdLevel >= 5)
277
- md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom);
278
+ md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom, qp);
279
pmode.modes[pmode.m_jobTotal++] = PRED_INTRA;
280
}
281
- md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom); pmode.modes[pmode.m_jobTotal++] = PRED_2Nx2N;
282
- md.pred[PRED_BIDIR].cu.initSubCU(parentCTU, cuGeom);
283
+ md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); pmode.modes[pmode.m_jobTotal++] = PRED_2Nx2N;
284
+ md.pred[PRED_BIDIR].cu.initSubCU(parentCTU, cuGeom, qp);
285
if (m_param->bEnableRectInter)
286
{
287
- md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom); pmode.modes[pmode.m_jobTotal++] = PRED_2NxN;
288
- md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom); pmode.modes[pmode.m_jobTotal++] = PRED_Nx2N;
289
+ md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp); pmode.modes[pmode.m_jobTotal++] = PRED_2NxN;
290
+ md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); pmode.modes[pmode.m_jobTotal++] = PRED_Nx2N;
291
}
292
if (bTryAmp)
293
{
294
- md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom); pmode.modes[pmode.m_jobTotal++] = PRED_2NxnU;
295
- md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom); pmode.modes[pmode.m_jobTotal++] = PRED_2NxnD;
296
- md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom); pmode.modes[pmode.m_jobTotal++] = PRED_nLx2N;
297
- md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom); pmode.modes[pmode.m_jobTotal++] = PRED_nRx2N;
298
+ md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom, qp); pmode.modes[pmode.m_jobTotal++] = PRED_2NxnU;
299
+ md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp); pmode.modes[pmode.m_jobTotal++] = PRED_2NxnD;
300
+ md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom, qp); pmode.modes[pmode.m_jobTotal++] = PRED_nLx2N;
301
+ md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp); pmode.modes[pmode.m_jobTotal++] = PRED_nRx2N;
302
}
303
304
pmode.tryBondPeers(*m_frame->m_encData->m_jobProvider, pmode.m_jobTotal);
305
306
307
if (md.bestMode->rdCost == MAX_INT64 && !bTryIntra)
308
{
309
- md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom);
310
+ md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp);
311
checkIntraInInter(md.pred[PRED_INTRA], cuGeom);
312
encodeIntraInInter(md.pred[PRED_INTRA], cuGeom);
313
checkBestMode(md.pred[PRED_INTRA], depth);
314
315
Mode* splitPred = &md.pred[PRED_SPLIT];
316
splitPred->initCosts();
317
CUData* splitCU = &splitPred->cu;
318
- splitCU->initSubCU(parentCTU, cuGeom);
319
+ splitCU->initSubCU(parentCTU, cuGeom, qp);
320
321
uint32_t nextDepth = depth + 1;
322
ModeDepth& nd = m_modeDepth[nextDepth];
323
invalidateContexts(nextDepth);
324
Entropy* nextContext = &m_rqt[depth].cur;
325
+ int nextQP = qp;
326
327
for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++)
328
{
329
330
{
331
m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx);
332
m_rqt[nextDepth].cur.load(*nextContext);
333
- compressInterCU_dist(parentCTU, childGeom);
334
+
335
+ if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth)
336
+ nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom));
337
+
338
+ compressInterCU_dist(parentCTU, childGeom, nextQP);
339
340
// Save best CU and pred data for this sub CU
341
splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx);
342
343
else
344
updateModeCost(*splitPred);
345
346
- checkDQPForSplitPred(splitPred->cu, cuGeom);
347
+ checkDQPForSplitPred(*splitPred, cuGeom);
348
checkBestMode(*splitPred, depth);
349
}
350
351
352
md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, cuAddr, cuGeom.absPartIdx);
353
}
354
355
-void Analysis::compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom)
356
+void Analysis::compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp)
357
{
358
uint32_t depth = cuGeom.depth;
359
uint32_t cuAddr = parentCTU.m_cuAddr;
360
361
bool bTryIntra = m_slice->m_sliceType != B_SLICE || m_param->bIntraInBFrames;
362
363
/* Compute Merge Cost */
364
- md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom);
365
- md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom);
366
+ md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp);
367
+ md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp);
368
checkMerge2Nx2N_rd0_4(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom);
369
370
bool earlyskip = false;
371
372
373
if (!earlyskip)
374
{
375
- md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom);
376
+ md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom, qp);
377
checkInter_rd0_4(md.pred[PRED_2Nx2N], cuGeom, SIZE_2Nx2N);
378
379
if (m_slice->m_sliceType == B_SLICE)
380
{
381
- md.pred[PRED_BIDIR].cu.initSubCU(parentCTU, cuGeom);
382
+ md.pred[PRED_BIDIR].cu.initSubCU(parentCTU, cuGeom, qp);
383
checkBidir2Nx2N(md.pred[PRED_2Nx2N], md.pred[PRED_BIDIR], cuGeom);
384
}
385
386
Mode *bestInter = &md.pred[PRED_2Nx2N];
387
if (m_param->bEnableRectInter)
388
{
389
- md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom);
390
+ md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom, qp);
391
checkInter_rd0_4(md.pred[PRED_Nx2N], cuGeom, SIZE_Nx2N);
392
if (md.pred[PRED_Nx2N].sa8dCost < bestInter->sa8dCost)
393
bestInter = &md.pred[PRED_Nx2N];
394
395
- md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom);
396
+ md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp);
397
checkInter_rd0_4(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN);
398
if (md.pred[PRED_2NxN].sa8dCost < bestInter->sa8dCost)
399
bestInter = &md.pred[PRED_2NxN];
400
}
401
402
- if (m_slice->m_sps->maxAMPDepth > depth && cuGeom.log2CUSize < 6)
403
+ if (m_slice->m_sps->maxAMPDepth > depth)
404
{
405
bool bHor = false, bVer = false;
406
if (bestInter->cu.m_partSize[0] == SIZE_2NxN)
407
408
409
if (bHor)
410
{
411
- md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom);
412
+ md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom, qp);
413
checkInter_rd0_4(md.pred[PRED_2NxnU], cuGeom, SIZE_2NxnU);
414
if (md.pred[PRED_2NxnU].sa8dCost < bestInter->sa8dCost)
415
bestInter = &md.pred[PRED_2NxnU];
416
417
- md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom);
418
+ md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp);
419
checkInter_rd0_4(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD);
420
if (md.pred[PRED_2NxnD].sa8dCost < bestInter->sa8dCost)
421
bestInter = &md.pred[PRED_2NxnD];
422
}
423
if (bVer)
424
{
425
- md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom);
426
+ md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom, qp);
427
checkInter_rd0_4(md.pred[PRED_nLx2N], cuGeom, SIZE_nLx2N);
428
if (md.pred[PRED_nLx2N].sa8dCost < bestInter->sa8dCost)
429
bestInter = &md.pred[PRED_nLx2N];
430
431
- md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom);
432
+ md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp);
433
checkInter_rd0_4(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N);
434
if (md.pred[PRED_nRx2N].sa8dCost < bestInter->sa8dCost)
435
bestInter = &md.pred[PRED_nRx2N];
436
437
if ((bTryIntra && md.bestMode->cu.getQtRootCbf(0)) ||
438
md.bestMode->sa8dCost == MAX_INT64)
439
{
440
- md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom);
441
+ md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp);
442
checkIntraInInter(md.pred[PRED_INTRA], cuGeom);
443
encodeIntraInInter(md.pred[PRED_INTRA], cuGeom);
444
checkBestMode(md.pred[PRED_INTRA], depth);
445
446
447
if (bTryIntra || md.bestMode->sa8dCost == MAX_INT64)
448
{
449
- md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom);
450
+ md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp);
451
checkIntraInInter(md.pred[PRED_INTRA], cuGeom);
452
if (md.pred[PRED_INTRA].sa8dCost < md.bestMode->sa8dCost)
453
md.bestMode = &md.pred[PRED_INTRA];
454
455
{
456
/* generate recon pixels with no rate distortion considerations */
457
CUData& cu = md.bestMode->cu;
458
- m_quant.setQPforQuant(cu);
459
460
uint32_t tuDepthRange[2];
461
cu.getInterTUQtDepthRange(tuDepthRange, 0);
462
463
{
464
/* generate recon pixels with no rate distortion considerations */
465
CUData& cu = md.bestMode->cu;
466
- m_quant.setQPforQuant(cu);
467
468
uint32_t tuDepthRange[2];
469
cu.getIntraTUQtDepthRange(tuDepthRange, 0);
470
471
Mode* splitPred = &md.pred[PRED_SPLIT];
472
splitPred->initCosts();
473
CUData* splitCU = &splitPred->cu;
474
- splitCU->initSubCU(parentCTU, cuGeom);
475
+ splitCU->initSubCU(parentCTU, cuGeom, qp);
476
477
uint32_t nextDepth = depth + 1;
478
ModeDepth& nd = m_modeDepth[nextDepth];
479
invalidateContexts(nextDepth);
480
Entropy* nextContext = &m_rqt[depth].cur;
481
+ int nextQP = qp;
482
483
for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++)
484
{
485
486
{
487
m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx);
488
m_rqt[nextDepth].cur.load(*nextContext);
489
- compressInterCU_rd0_4(parentCTU, childGeom);
490
+
491
+ if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth)
492
+ nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom));
493
+
494
+ compressInterCU_rd0_4(parentCTU, childGeom, nextQP);
495
496
// Save best CU and pred data for this sub CU
497
splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx);
498
499
else if (splitPred->sa8dCost < md.bestMode->sa8dCost)
500
md.bestMode = splitPred;
501
502
- checkDQPForSplitPred(md.bestMode->cu, cuGeom);
503
+ checkDQPForSplitPred(*md.bestMode, cuGeom);
504
}
505
if (mightNotSplit)
506
{
507
508
md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, cuAddr, cuGeom.absPartIdx);
509
}
510
511
-void Analysis::compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder)
512
+void Analysis::compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp)
513
{
514
uint32_t depth = cuGeom.depth;
515
ModeDepth& md = m_modeDepth[depth];
516
517
uint8_t* reuseModes = &m_reuseInterDataCTU->modes[parentCTU.m_cuAddr * parentCTU.m_numPartitions];
518
if (mightNotSplit && depth == reuseDepth[zOrder] && zOrder == cuGeom.absPartIdx && reuseModes[zOrder] == MODE_SKIP)
519
{
520
- md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom);
521
- md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom);
522
+ md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp);
523
+ md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp);
524
checkMerge2Nx2N_rd5_6(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom, true);
525
526
if (m_bTryLossless)
527
528
529
if (mightNotSplit)
530
{
531
- md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom);
532
- md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom);
533
+ md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp);
534
+ md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp);
535
checkMerge2Nx2N_rd5_6(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom, false);
536
bool earlySkip = m_param->bEnableEarlySkip && md.bestMode && !md.bestMode->cu.getQtRootCbf(0);
537
538
if (!earlySkip)
539
{
540
- md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom);
541
- checkInter_rd5_6(md.pred[PRED_2Nx2N], cuGeom, SIZE_2Nx2N, false);
542
+ md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom, qp);
543
+ checkInter_rd5_6(md.pred[PRED_2Nx2N], cuGeom, SIZE_2Nx2N);
544
checkBestMode(md.pred[PRED_2Nx2N], cuGeom.depth);
545
546
if (m_slice->m_sliceType == B_SLICE)
547
{
548
- md.pred[PRED_BIDIR].cu.initSubCU(parentCTU, cuGeom);
549
+ md.pred[PRED_BIDIR].cu.initSubCU(parentCTU, cuGeom, qp);
550
checkBidir2Nx2N(md.pred[PRED_2Nx2N], md.pred[PRED_BIDIR], cuGeom);
551
if (md.pred[PRED_BIDIR].sa8dCost < MAX_INT64)
552
{
553
554
555
if (m_param->bEnableRectInter)
556
{
557
- md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom);
558
- checkInter_rd5_6(md.pred[PRED_Nx2N], cuGeom, SIZE_Nx2N, false);
559
+ md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom, qp);
560
+ checkInter_rd5_6(md.pred[PRED_Nx2N], cuGeom, SIZE_Nx2N);
561
checkBestMode(md.pred[PRED_Nx2N], cuGeom.depth);
562
563
- md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom);
564
- checkInter_rd5_6(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN, false);
565
+ md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp);
566
+ checkInter_rd5_6(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN);
567
checkBestMode(md.pred[PRED_2NxN], cuGeom.depth);
568
}
569
570
// Try AMP (SIZE_2NxnU, SIZE_2NxnD, SIZE_nLx2N, SIZE_nRx2N)
571
if (m_slice->m_sps->maxAMPDepth > depth)
572
{
573
- bool bMergeOnly = cuGeom.log2CUSize == 6;
574
-
575
bool bHor = false, bVer = false;
576
if (md.bestMode->cu.m_partSize[0] == SIZE_2NxN)
577
bHor = true;
578
579
580
if (bHor)
581
{
582
- md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom);
583
- checkInter_rd5_6(md.pred[PRED_2NxnU], cuGeom, SIZE_2NxnU, bMergeOnly);
584
+ md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom, qp);
585
+ checkInter_rd5_6(md.pred[PRED_2NxnU], cuGeom, SIZE_2NxnU);
586
checkBestMode(md.pred[PRED_2NxnU], cuGeom.depth);
587
588
- md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom);
589
- checkInter_rd5_6(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD, bMergeOnly);
590
+ md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp);
591
+ checkInter_rd5_6(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD);
592
checkBestMode(md.pred[PRED_2NxnD], cuGeom.depth);
593
}
594
if (bVer)
595
{
596
- md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom);
597
- checkInter_rd5_6(md.pred[PRED_nLx2N], cuGeom, SIZE_nLx2N, bMergeOnly);
598
+ md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom, qp);
599
+ checkInter_rd5_6(md.pred[PRED_nLx2N], cuGeom, SIZE_nLx2N);
600
checkBestMode(md.pred[PRED_nLx2N], cuGeom.depth);
601
602
- md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom);
603
- checkInter_rd5_6(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N, bMergeOnly);
604
+ md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp);
605
+ checkInter_rd5_6(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N);
606
checkBestMode(md.pred[PRED_nRx2N], cuGeom.depth);
607
}
608
}
609
610
if (m_slice->m_sliceType != B_SLICE || m_param->bIntraInBFrames)
611
{
612
- md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom);
613
+ md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp);
614
checkIntra(md.pred[PRED_INTRA], cuGeom, SIZE_2Nx2N, NULL, NULL);
615
checkBestMode(md.pred[PRED_INTRA], depth);
616
617
if (cuGeom.log2CUSize == 3 && m_slice->m_sps->quadtreeTULog2MinSize < 3)
618
{
619
- md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom);
620
+ md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom, qp);
621
checkIntra(md.pred[PRED_INTRA_NxN], cuGeom, SIZE_NxN, NULL, NULL);
622
checkBestMode(md.pred[PRED_INTRA_NxN], depth);
623
}
624
625
Mode* splitPred = &md.pred[PRED_SPLIT];
626
splitPred->initCosts();
627
CUData* splitCU = &splitPred->cu;
628
- splitCU->initSubCU(parentCTU, cuGeom);
629
+ splitCU->initSubCU(parentCTU, cuGeom, qp);
630
631
uint32_t nextDepth = depth + 1;
632
ModeDepth& nd = m_modeDepth[nextDepth];
633
invalidateContexts(nextDepth);
634
Entropy* nextContext = &m_rqt[depth].cur;
635
+ int nextQP = qp;
636
637
for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++)
638
{
639
640
{
641
m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx);
642
m_rqt[nextDepth].cur.load(*nextContext);
643
- compressInterCU_rd5_6(parentCTU, childGeom, zOrder);
644
+
645
+ if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth)
646
+ nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom));
647
+
648
+ compressInterCU_rd5_6(parentCTU, childGeom, zOrder, nextQP);
649
650
// Save best CU and pred data for this sub CU
651
splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx);
652
653
else
654
updateModeCost(*splitPred);
655
656
- checkDQPForSplitPred(splitPred->cu, cuGeom);
657
+ checkDQPForSplitPred(*splitPred, cuGeom);
658
checkBestMode(*splitPred, depth);
659
}
660
661
662
md.bestMode->cu.setPUMv(1, candMvField[bestSadCand][1].mv, 0, 0);
663
md.bestMode->cu.setPURefIdx(0, (int8_t)candMvField[bestSadCand][0].refIdx, 0, 0);
664
md.bestMode->cu.setPURefIdx(1, (int8_t)candMvField[bestSadCand][1].refIdx, 0, 0);
665
- checkDQP(md.bestMode->cu, cuGeom);
666
+ checkDQP(*md.bestMode, cuGeom);
667
X265_CHECK(md.bestMode->ok(), "Merge mode not ok\n");
668
}
669
670
671
bestPred->cu.setPUMv(1, candMvField[bestCand][1].mv, 0, 0);
672
bestPred->cu.setPURefIdx(0, (int8_t)candMvField[bestCand][0].refIdx, 0, 0);
673
bestPred->cu.setPURefIdx(1, (int8_t)candMvField[bestCand][1].refIdx, 0, 0);
674
- checkDQP(bestPred->cu, cuGeom);
675
+ checkDQP(*bestPred, cuGeom);
676
X265_CHECK(bestPred->ok(), "merge mode is not ok");
677
}
678
679
680
}
681
}
682
683
- predInterSearch(interMode, cuGeom, false, m_bChromaSa8d);
684
+ predInterSearch(interMode, cuGeom, m_bChromaSa8d);
685
686
/* predInterSearch sets interMode.sa8dBits */
687
const Yuv& fencYuv = *interMode.fencYuv;
688
689
}
690
}
691
692
-void Analysis::checkInter_rd5_6(Mode& interMode, const CUGeom& cuGeom, PartSize partSize, bool bMergeOnly)
693
+void Analysis::checkInter_rd5_6(Mode& interMode, const CUGeom& cuGeom, PartSize partSize)
694
{
695
interMode.initCosts();
696
interMode.cu.setPartSizeSubParts(partSize);
697
698
}
699
}
700
701
- predInterSearch(interMode, cuGeom, bMergeOnly, true);
702
+ predInterSearch(interMode, cuGeom, true);
703
704
/* predInterSearch sets interMode.sa8dBits, but this is ignored */
705
encodeResAndCalcRdInterCU(interMode, cuGeom);
706
707
uint32_t zcost = zsa8d + m_rdCost.getCost(bits0) + m_rdCost.getCost(bits1);
708
709
/* refine MVP selection for zero mv, updates: mvp, mvpidx, bits, cost */
710
- checkBestMVP(inter2Nx2N.amvpCand[0][ref0], mvzero, mvp0, mvpIdx0, bits0, zcost);
711
- checkBestMVP(inter2Nx2N.amvpCand[1][ref1], mvzero, mvp1, mvpIdx1, bits1, zcost);
712
+ mvp0 = checkBestMVP(inter2Nx2N.amvpCand[0][ref0], mvzero, mvpIdx0, bits0, zcost);
713
+ mvp1 = checkBestMVP(inter2Nx2N.amvpCand[1][ref1], mvzero, mvpIdx1, bits1, zcost);
714
715
uint32_t zbits = bits0 + bits1 + m_listSelBits[2] - (m_listSelBits[0] + m_listSelBits[1]);
716
zcost = zsa8d + m_rdCost.getCost(zbits);
717
718
CUData& cu = bestMode->cu;
719
720
cu.copyFromPic(ctu, cuGeom);
721
- m_quant.setQPforQuant(cu);
722
723
Yuv& fencYuv = m_modeDepth[cuGeom.depth].fencYuv;
724
if (cuGeom.depth)
725
726
return false;
727
}
728
729
-int Analysis::calculateQpforCuSize(CUData& ctu, const CUGeom& cuGeom)
730
+int Analysis::calculateQpforCuSize(const CUData& ctu, const CUGeom& cuGeom)
731
{
732
- uint32_t ctuAddr = ctu.m_cuAddr;
733
FrameData& curEncData = *m_frame->m_encData;
734
- double qp = curEncData.m_cuStat[ctuAddr].baseQp;
735
-
736
- uint32_t width = m_frame->m_fencPic->m_picWidth;
737
- uint32_t height = m_frame->m_fencPic->m_picHeight;
738
- uint32_t block_x = ctu.m_cuPelX + g_zscanToPelX[cuGeom.absPartIdx];
739
- uint32_t block_y = ctu.m_cuPelY + g_zscanToPelY[cuGeom.absPartIdx];
740
- uint32_t maxCols = (m_frame->m_fencPic->m_picWidth + (16 - 1)) / 16;
741
- uint32_t blockSize = g_maxCUSize >> cuGeom.depth;
742
- double qp_offset = 0;
743
- uint32_t cnt = 0;
744
- uint32_t idx;
745
+ double qp = curEncData.m_cuStat[ctu.m_cuAddr].baseQp;
746
747
/* Use cuTree offsets if cuTree enabled and frame is referenced, else use AQ offsets */
748
bool isReferenced = IS_REFERENCED(m_frame);
749
double *qpoffs = (isReferenced && m_param->rc.cuTree) ? m_frame->m_lowres.qpCuTreeOffset : m_frame->m_lowres.qpAqOffset;
750
-
751
- for (uint32_t block_yy = block_y; block_yy < block_y + blockSize && block_yy < height; block_yy += 16)
752
+ if (qpoffs)
753
{
754
- for (uint32_t block_xx = block_x; block_xx < block_x + blockSize && block_xx < width; block_xx += 16)
755
+ uint32_t width = m_frame->m_fencPic->m_picWidth;
756
+ uint32_t height = m_frame->m_fencPic->m_picHeight;
757
+ uint32_t block_x = ctu.m_cuPelX + g_zscanToPelX[cuGeom.absPartIdx];
758
+ uint32_t block_y = ctu.m_cuPelY + g_zscanToPelY[cuGeom.absPartIdx];
759
+ uint32_t maxCols = (m_frame->m_fencPic->m_picWidth + (16 - 1)) / 16;
760
+ uint32_t blockSize = g_maxCUSize >> cuGeom.depth;
761
+ double qp_offset = 0;
762
+ uint32_t cnt = 0;
763
+ uint32_t idx;
764
+
765
+ for (uint32_t block_yy = block_y; block_yy < block_y + blockSize && block_yy < height; block_yy += 16)
766
{
767
- idx = ((block_yy / 16) * (maxCols)) + (block_xx / 16);
768
- qp_offset += qpoffs[idx];
769
- cnt++;
770
+ for (uint32_t block_xx = block_x; block_xx < block_x + blockSize && block_xx < width; block_xx += 16)
771
+ {
772
+ idx = ((block_yy / 16) * (maxCols)) + (block_xx / 16);
773
+ qp_offset += qpoffs[idx];
774
+ cnt++;
775
+ }
776
}
777
+
778
+ qp_offset /= cnt;
779
+ qp += qp_offset;
780
}
781
782
- qp_offset /= cnt;
783
- qp += qp_offset;
784
return x265_clip3(QP_MIN, QP_MAX_MAX, (int)(qp + 0.5));
785
}
786
x265_1.6.tar.gz/source/encoder/analysis.h -> x265_1.7.tar.gz/source/encoder/analysis.h
Changed
36
1
2
uint32_t* m_reuseBestMergeCand;
3
4
/* full analysis for an I-slice CU */
5
- void compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder);
6
+ void compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp);
7
8
/* full analysis for a P or B slice CU */
9
- void compressInterCU_dist(const CUData& parentCTU, const CUGeom& cuGeom);
10
- void compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom);
11
- void compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder);
12
+ void compressInterCU_dist(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp);
13
+ void compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp);
14
+ void compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp);
15
16
/* measure merge and skip */
17
void checkMerge2Nx2N_rd0_4(Mode& skip, Mode& merge, const CUGeom& cuGeom);
18
19
20
/* measure inter options */
21
void checkInter_rd0_4(Mode& interMode, const CUGeom& cuGeom, PartSize partSize);
22
- void checkInter_rd5_6(Mode& interMode, const CUGeom& cuGeom, PartSize partSize, bool bMergeOnly);
23
+ void checkInter_rd5_6(Mode& interMode, const CUGeom& cuGeom, PartSize partSize);
24
25
void checkBidir2Nx2N(Mode& inter2Nx2N, Mode& bidir2Nx2N, const CUGeom& cuGeom);
26
27
28
/* generate residual and recon pixels for an entire CTU recursively (RD0) */
29
void encodeResidue(const CUData& parentCTU, const CUGeom& cuGeom);
30
31
- int calculateQpforCuSize(CUData& ctu, const CUGeom& cuGeom);
32
+ int calculateQpforCuSize(const CUData& ctu, const CUGeom& cuGeom);
33
34
/* check whether current mode is the new best */
35
inline void checkBestMode(Mode& mode, uint32_t depth)
36
x265_1.6.tar.gz/source/encoder/api.cpp -> x265_1.7.tar.gz/source/encoder/api.cpp
Changed
205
1
2
if (!p)
3
return NULL;
4
5
- x265_param *param = X265_MALLOC(x265_param, 1);
6
- if (!param)
7
- return NULL;
8
+ Encoder* encoder = NULL;
9
+ x265_param* param = x265_param_alloc();
10
+ x265_param* latestParam = x265_param_alloc();
11
+ if (!param || !latestParam)
12
+ goto fail;
13
14
memcpy(param, p, sizeof(x265_param));
15
x265_log(param, X265_LOG_INFO, "HEVC encoder version %s\n", x265_version_str);
16
17
x265_setup_primitives(param, param->cpuid);
18
19
if (x265_check_params(param))
20
- return NULL;
21
+ goto fail;
22
23
if (x265_set_globals(param))
24
- return NULL;
25
+ goto fail;
26
27
- Encoder *encoder = new Encoder;
28
+ encoder = new Encoder;
29
if (!param->rc.bEnableSlowFirstPass)
30
x265_param_apply_fastfirstpass(param);
31
32
// may change params for auto-detect, etc
33
encoder->configure(param);
34
-
35
// may change rate control and CPB params
36
if (!enforceLevel(*param, encoder->m_vps))
37
- {
38
- delete encoder;
39
- return NULL;
40
- }
41
+ goto fail;
42
43
// will detect and set profile/tier/level in VPS
44
determineLevel(*param, encoder->m_vps);
45
46
- encoder->create();
47
- if (encoder->m_aborted)
48
+ if (!param->bAllowNonConformance && encoder->m_vps.ptl.profileIdc == Profile::NONE)
49
{
50
- delete encoder;
51
- return NULL;
52
+ x265_log(param, X265_LOG_INFO, "non-conformant bitstreams not allowed (--allow-non-conformance)\n");
53
+ goto fail;
54
}
55
56
- x265_print_params(param);
57
+ encoder->create();
58
+ encoder->m_latestParam = latestParam;
59
+ memcpy(latestParam, param, sizeof(x265_param));
60
+ if (encoder->m_aborted)
61
+ goto fail;
62
63
+ x265_print_params(param);
64
return encoder;
65
+
66
+fail:
67
+ delete encoder;
68
+ x265_param_free(param);
69
+ x265_param_free(latestParam);
70
+ return NULL;
71
}
72
73
extern "C"
74
75
}
76
77
extern "C"
78
+int x265_encoder_reconfig(x265_encoder* enc, x265_param* param_in)
79
+{
80
+ if (!enc || !param_in)
81
+ return -1;
82
+
83
+ x265_param save;
84
+ Encoder* encoder = static_cast<Encoder*>(enc);
85
+ memcpy(&save, encoder->m_latestParam, sizeof(x265_param));
86
+ int ret = encoder->reconfigureParam(encoder->m_latestParam, param_in);
87
+ if (ret)
88
+ /* reconfigure failed, recover saved param set */
89
+ memcpy(encoder->m_latestParam, &save, sizeof(x265_param));
90
+ else
91
+ {
92
+ encoder->m_reconfigured = true;
93
+ x265_print_reconfigured_params(&save, encoder->m_latestParam);
94
+ }
95
+ return ret;
96
+}
97
+
98
+extern "C"
99
int x265_encoder_encode(x265_encoder *enc, x265_nal **pp_nal, uint32_t *pi_nal, x265_picture *pic_in, x265_picture *pic_out)
100
{
101
if (!enc)
102
103
{
104
Encoder *encoder = static_cast<Encoder*>(enc);
105
106
- encoder->stop();
107
+ encoder->stopJobs();
108
encoder->printSummary();
109
encoder->destroy();
110
delete encoder;
111
+ ATOMIC_DEC(&g_ctuSizeConfigured);
112
}
113
}
114
115
extern "C"
116
void x265_cleanup(void)
117
{
118
- BitCost::destroy();
119
- CUData::s_partSet[0] = NULL; /* allow CUData to adjust to new CTU size */
120
- g_ctuSizeConfigured = 0;
121
+ if (!g_ctuSizeConfigured)
122
+ {
123
+ BitCost::destroy();
124
+ CUData::s_partSet[0] = NULL; /* allow CUData to adjust to new CTU size */
125
+ }
126
}
127
128
extern "C"
129
130
&x265_picture_init,
131
&x265_encoder_open,
132
&x265_encoder_parameters,
133
+ &x265_encoder_reconfig,
134
&x265_encoder_headers,
135
&x265_encoder_encode,
136
&x265_encoder_get_stats,
137
138
x265_max_bit_depth,
139
};
140
141
+typedef const x265_api* (*api_get_func)(int bitDepth);
142
+
143
+#define xstr(s) str(s)
144
+#define str(s) #s
145
+
146
+#if _WIN32
147
+#define ext ".dll"
148
+#elif MACOS
149
+#include <dlfcn.h>
150
+#define ext ".dylib"
151
+#else
152
+#include <dlfcn.h>
153
+#define ext ".so"
154
+#endif
155
+
156
extern "C"
157
const x265_api* x265_api_get(int bitDepth)
158
{
159
if (bitDepth && bitDepth != X265_DEPTH)
160
- return NULL;
161
+ {
162
+ const char* libname = NULL;
163
+ const char* method = "x265_api_get_" xstr(X265_BUILD);
164
+
165
+ if (bitDepth == 12)
166
+ libname = "libx265_main12" ext;
167
+ else if (bitDepth == 10)
168
+ libname = "libx265_main10" ext;
169
+ else if (bitDepth == 8)
170
+ libname = "libx265_main" ext;
171
+ else
172
+ return NULL;
173
+
174
+ const x265_api* api = NULL;
175
+
176
+#if _WIN32
177
+ HMODULE h = LoadLibraryA(libname);
178
+ if (h)
179
+ {
180
+ api_get_func get = (api_get_func)GetProcAddress(h, method);
181
+ if (get)
182
+ api = get(0);
183
+ }
184
+#else
185
+ void* h = dlopen(libname, RTLD_LAZY | RTLD_LOCAL);
186
+ if (h)
187
+ {
188
+ api_get_func get = (api_get_func)dlsym(h, method);
189
+ if (get)
190
+ api = get(0);
191
+ }
192
+#endif
193
+
194
+ if (api && bitDepth != api->max_bit_depth)
195
+ {
196
+ x265_log(NULL, X265_LOG_WARNING, "%s does not support requested bitDepth %d\n", libname, bitDepth);
197
+ return NULL;
198
+ }
199
+
200
+ return api;
201
+ }
202
203
return &libapi;
204
}
205
x265_1.6.tar.gz/source/encoder/encoder.cpp -> x265_1.7.tar.gz/source/encoder/encoder.cpp
Changed
240
1
2
Encoder::Encoder()
3
{
4
m_aborted = false;
5
+ m_reconfigured = false;
6
m_encodedFrameNum = 0;
7
m_pocLast = -1;
8
m_curEncoder = 0;
9
10
m_outputCount = 0;
11
m_csvfpt = NULL;
12
m_param = NULL;
13
+ m_latestParam = NULL;
14
m_cuOffsetY = NULL;
15
m_cuOffsetC = NULL;
16
m_buOffsetY = NULL;
17
18
bool allowPools = !p->numaPools || strcmp(p->numaPools, "none");
19
20
// Trim the thread pool if --wpp, --pme, and --pmode are disabled
21
- if (!p->bEnableWavefront && !p->bDistributeModeAnalysis && !p->bDistributeMotionEstimation)
22
+ if (!p->bEnableWavefront && !p->bDistributeModeAnalysis && !p->bDistributeMotionEstimation && !p->lookaheadSlices)
23
allowPools = false;
24
25
if (!p->frameNumThreads)
26
27
x265_log(p, X265_LOG_WARNING, "No thread pool allocated, --pme disabled\n");
28
if (p->bDistributeModeAnalysis)
29
x265_log(p, X265_LOG_WARNING, "No thread pool allocated, --pmode disabled\n");
30
+ if (p->lookaheadSlices)
31
+ x265_log(p, X265_LOG_WARNING, "No thread pool allocated, --lookahead-slices disabled\n");
32
33
// disable all pool features if the thread pool is disabled or unusable.
34
- p->bEnableWavefront = p->bDistributeModeAnalysis = p->bDistributeMotionEstimation = 0;
35
+ p->bEnableWavefront = p->bDistributeModeAnalysis = p->bDistributeMotionEstimation = p->lookaheadSlices = 0;
36
}
37
38
char buf[128];
39
40
x265_log(p, X265_LOG_INFO, "frame threads / pool features : %d / %s\n", p->frameNumThreads, buf);
41
42
for (int i = 0; i < m_param->frameNumThreads; i++)
43
+ {
44
m_frameEncoder[i] = new FrameEncoder;
45
+ m_frameEncoder[i]->m_nalList.m_annexB = !!m_param->bAnnexB;
46
+ }
47
48
if (m_numPools)
49
{
50
51
m_aborted |= parseLambdaFile(m_param);
52
53
m_encodeStartTime = x265_mdate();
54
+
55
+ m_nalList.m_annexB = !!m_param->bAnnexB;
56
}
57
58
-void Encoder::stop()
59
+void Encoder::stopJobs()
60
{
61
if (m_rateControl)
62
m_rateControl->terminate(); // unblock all blocked RC calls
63
64
if (m_lookahead)
65
- m_lookahead->stop();
66
+ m_lookahead->stopJobs();
67
68
for (int i = 0; i < m_param->frameNumThreads; i++)
69
{
70
71
}
72
73
if (m_threadPool)
74
- m_threadPool->stop();
75
+ m_threadPool->stopWorkers();
76
}
77
78
void Encoder::destroy()
79
80
81
if (m_param)
82
{
83
- free((void*)m_param->rc.lambdaFileName); // allocs by strdup
84
- free(m_param->rc.statFileName);
85
- free(m_param->analysisFileName);
86
- free((void*)m_param->scalingLists);
87
- free(m_param->csvfn);
88
- free(m_param->numaPools);
89
+ /* release string arguments that were strdup'd */
90
+ free((char*)m_param->rc.lambdaFileName);
91
+ free((char*)m_param->rc.statFileName);
92
+ free((char*)m_param->analysisFileName);
93
+ free((char*)m_param->scalingLists);
94
+ free((char*)m_param->csvfn);
95
+ free((char*)m_param->numaPools);
96
+ free((char*)m_param->masteringDisplayColorVolume);
97
+ free((char*)m_param->contentLightLevelInfo);
98
99
- X265_FREE(m_param);
100
+ x265_param_free(m_param);
101
}
102
+
103
+ x265_param_free(m_latestParam);
104
}
105
106
void Encoder::updateVbvPlan(RateControl* rc)
107
108
if (m_dpb->m_freeList.empty())
109
{
110
inFrame = new Frame;
111
- if (inFrame->create(m_param))
112
+ x265_param* p = m_reconfigured? m_latestParam : m_param;
113
+ if (inFrame->create(p))
114
{
115
/* the first PicYuv created is asked to generate the CU and block unit offset
116
* arrays which are then shared with all subsequent PicYuv (orig and recon)
117
118
}
119
}
120
else
121
+ {
122
inFrame = m_dpb->m_freeList.popBack();
123
+ inFrame->m_lowresInit = false;
124
+ }
125
126
/* Copy input picture into a Frame and PicYuv, send to lookahead */
127
inFrame->m_fencPic->copyFromPicture(*pic_in, m_sps.conformanceWindow.rightOffset, m_sps.conformanceWindow.bottomOffset);
128
129
inFrame->m_userData = pic_in->userData;
130
inFrame->m_pts = pic_in->pts;
131
inFrame->m_forceqp = pic_in->forceqp;
132
+ inFrame->m_param = m_reconfigured ? m_latestParam : m_param;
133
134
if (m_pocLast == 0)
135
m_firstPts = inFrame->m_pts;
136
137
return ret;
138
}
139
140
+int Encoder::reconfigureParam(x265_param* encParam, x265_param* param)
141
+{
142
+ encParam->maxNumReferences = param->maxNumReferences; // never uses more refs than specified in stream headers
143
+ encParam->bEnableLoopFilter = param->bEnableLoopFilter;
144
+ encParam->deblockingFilterTCOffset = param->deblockingFilterTCOffset;
145
+ encParam->deblockingFilterBetaOffset = param->deblockingFilterBetaOffset;
146
+ encParam->bEnableFastIntra = param->bEnableFastIntra;
147
+ encParam->bEnableEarlySkip = param->bEnableEarlySkip;
148
+ encParam->bEnableTemporalMvp = param->bEnableTemporalMvp;
149
+ /* Scratch buffer prevents me_range from being increased for esa/tesa
150
+ if (param->searchMethod < X265_FULL_SEARCH || param->searchMethod < encParam->searchRange)
151
+ encParam->searchRange = param->searchRange; */
152
+ encParam->noiseReductionInter = param->noiseReductionInter;
153
+ encParam->noiseReductionIntra = param->noiseReductionIntra;
154
+ /* We can't switch out of subme=0 during encoding. */
155
+ if (encParam->subpelRefine)
156
+ encParam->subpelRefine = param->subpelRefine;
157
+ encParam->rdoqLevel = param->rdoqLevel;
158
+ encParam->rdLevel = param->rdLevel;
159
+ encParam->bEnableTSkipFast = param->bEnableTSkipFast;
160
+ encParam->psyRd = param->psyRd;
161
+ encParam->psyRdoq = param->psyRdoq;
162
+ encParam->bEnableSignHiding = param->bEnableSignHiding;
163
+ encParam->bEnableFastIntra = param->bEnableFastIntra;
164
+ encParam->maxTUSize = param->maxTUSize;
165
+ return x265_check_params(encParam);
166
+}
167
+
168
void EncStats::addPsnr(double psnrY, double psnrU, double psnrV)
169
{
170
m_psnrSumY += psnrY;
171
172
bs.writeByteAlignment();
173
list.serialize(NAL_UNIT_PPS, bs);
174
175
+ if (m_param->masteringDisplayColorVolume)
176
+ {
177
+ SEIMasteringDisplayColorVolume mdsei;
178
+ if (mdsei.parse(m_param->masteringDisplayColorVolume))
179
+ {
180
+ bs.resetBits();
181
+ mdsei.write(bs, m_sps);
182
+ bs.writeByteAlignment();
183
+ list.serialize(NAL_UNIT_PREFIX_SEI, bs);
184
+ }
185
+ else
186
+ x265_log(m_param, X265_LOG_WARNING, "unable to parse mastering display color volume info\n");
187
+ }
188
+
189
+ if (m_param->contentLightLevelInfo)
190
+ {
191
+ SEIContentLightLevel cllsei;
192
+ if (cllsei.parse(m_param->contentLightLevelInfo))
193
+ {
194
+ bs.resetBits();
195
+ cllsei.write(bs, m_sps);
196
+ bs.writeByteAlignment();
197
+ list.serialize(NAL_UNIT_PREFIX_SEI, bs);
198
+ }
199
+ else
200
+ x265_log(m_param, X265_LOG_WARNING, "unable to parse content light level info\n");
201
+ }
202
+
203
if (m_param->bEmitInfoSEI)
204
{
205
char *opts = x265_param2string(m_param);
206
207
if (!m_param->bLossless && (m_param->rc.aqMode || bIsVbv))
208
{
209
pps->bUseDQP = true;
210
- pps->maxCuDQPDepth = 0; /* TODO: make configurable? */
211
+ pps->maxCuDQPDepth = g_log2Size[m_param->maxCUSize] - g_log2Size[m_param->rc.qgSize];
212
+ X265_CHECK(pps->maxCuDQPDepth <= 2, "max CU DQP depth cannot be greater than 2\n");
213
}
214
else
215
{
216
217
p->analysisMode = X265_ANALYSIS_OFF;
218
x265_log(p, X265_LOG_WARNING, "Analysis save and load mode not supported for distributed mode analysis\n");
219
}
220
+
221
+ bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0;
222
+ if (!m_param->bLossless && (m_param->rc.aqMode || bIsVbv))
223
+ {
224
+ if (p->rc.qgSize < X265_MAX(16, p->minCUSize))
225
+ {
226
+ p->rc.qgSize = X265_MAX(16, p->minCUSize);
227
+ x265_log(p, X265_LOG_WARNING, "QGSize should be greater than or equal to 16 and minCUSize, setting QGSize = %d\n", p->rc.qgSize);
228
+ }
229
+ if (p->rc.qgSize > p->maxCUSize)
230
+ {
231
+ p->rc.qgSize = p->maxCUSize;
232
+ x265_log(p, X265_LOG_WARNING, "QGSize should be less than or equal to maxCUSize, setting QGSize = %d\n", p->rc.qgSize);
233
+ }
234
+ }
235
+ else
236
+ m_param->rc.qgSize = p->maxCUSize;
237
}
238
239
void Encoder::allocAnalysis(x265_analysis_data* analysis)
240
x265_1.6.tar.gz/source/encoder/encoder.h -> x265_1.7.tar.gz/source/encoder/encoder.h
Changed
29
1
2
uint32_t m_numDelayedPic;
3
4
x265_param* m_param;
5
+ x265_param* m_latestParam;
6
RateControl* m_rateControl;
7
Lookahead* m_lookahead;
8
Window m_conformanceWindow;
9
10
bool m_bZeroLatency; // x265_encoder_encode() returns NALs for the input picture, zero lag
11
bool m_aborted; // fatal error detected
12
+ bool m_reconfigured; // reconfigure of encoder detected
13
14
Encoder();
15
~Encoder() {}
16
17
void create();
18
- void stop();
19
+ void stopJobs();
20
void destroy();
21
22
int encode(const x265_picture* pic, x265_picture *pic_out);
23
24
+ int reconfigureParam(x265_param* encParam, x265_param* param);
25
+
26
void getStreamHeaders(NALList& list, Entropy& sbacCoder, Bitstream& bs);
27
28
void fetchStats(x265_stats* stats, size_t statsSizeBytes);
29
x265_1.6.tar.gz/source/encoder/entropy.cpp -> x265_1.7.tar.gz/source/encoder/entropy.cpp
Changed
462
1
2
if (ctu.isSkipped(absPartIdx))
3
{
4
codeMergeIndex(ctu, absPartIdx);
5
- finishCU(ctu, absPartIdx, depth);
6
+ finishCU(ctu, absPartIdx, depth, bEncodeDQP);
7
return;
8
}
9
codePredMode(ctu.m_predMode[absPartIdx]);
10
11
codeCoeff(ctu, absPartIdx, bEncodeDQP, tuDepthRange);
12
13
// --- write terminating bit ---
14
- finishCU(ctu, absPartIdx, depth);
15
+ finishCU(ctu, absPartIdx, depth, bEncodeDQP);
16
}
17
18
/* Return bit count of signaling inter mode */
19
20
}
21
22
/* finish encoding a cu and handle end-of-slice conditions */
23
-void Entropy::finishCU(const CUData& ctu, uint32_t absPartIdx, uint32_t depth)
24
+void Entropy::finishCU(const CUData& ctu, uint32_t absPartIdx, uint32_t depth, bool bCodeDQP)
25
{
26
const Slice* slice = ctu.m_slice;
27
uint32_t realEndAddress = slice->m_endCUAddr;
28
29
bool granularityBoundary = (((rpelx & granularityMask) == 0 || (rpelx == slice->m_sps->picWidthInLumaSamples )) &&
30
((bpely & granularityMask) == 0 || (bpely == slice->m_sps->picHeightInLumaSamples)));
31
32
+ if (slice->m_pps->bUseDQP)
33
+ const_cast<CUData&>(ctu).setQPSubParts(bCodeDQP ? ctu.getRefQP(absPartIdx) : ctu.m_qp[absPartIdx], absPartIdx, depth);
34
+
35
if (granularityBoundary)
36
{
37
// Encode slice finish
38
39
{
40
length = 0;
41
codeNumber = (codeNumber >> absGoRice) - COEF_REMAIN_BIN_REDUCTION;
42
- if (codeNumber != 0)
43
{
44
unsigned long idx;
45
CLZ(idx, codeNumber + 1);
46
length = idx;
47
+ X265_CHECK((codeNumber != 0) || (length == 0), "length check failure\n");
48
codeNumber -= (1 << idx) - 1;
49
}
50
codeNumber = (codeNumber << absGoRice) + codeRemain;
51
52
//const uint32_t maskPosXY = ((uint32_t)~0 >> (31 - log2TrSize + MLS_CG_LOG2_SIZE)) >> 1;
53
X265_CHECK((uint32_t)((1 << (log2TrSize - MLS_CG_LOG2_SIZE)) - 1) == (((uint32_t)~0 >> (31 - log2TrSize + MLS_CG_LOG2_SIZE)) >> 1), "maskPosXY fault\n");
54
55
- scanPosLast = primitives.findPosLast(codingParameters.scan, coeff, coeffSign, coeffFlag, coeffNum, numSig);
56
+ scanPosLast = primitives.scanPosLast(codingParameters.scan, coeff, coeffSign, coeffFlag, coeffNum, numSig, g_scan4x4[codingParameters.scanType], trSize);
57
posLast = codingParameters.scan[scanPosLast];
58
59
const int lastScanSet = scanPosLast >> MLS_CG_SIZE;
60
61
uint8_t * const baseCoeffGroupCtx = &m_contextState[OFF_SIG_CG_FLAG_CTX + (bIsLuma ? 0 : NUM_SIG_CG_FLAG_CTX)];
62
uint8_t * const baseCtx = bIsLuma ? &m_contextState[OFF_SIG_FLAG_CTX] : &m_contextState[OFF_SIG_FLAG_CTX + NUM_SIG_FLAG_CTX_LUMA];
63
uint32_t c1 = 1;
64
- uint32_t goRiceParam = 0;
65
int scanPosSigOff = scanPosLast - (lastScanSet << MLS_CG_SIZE) - 1;
66
int absCoeff[1 << MLS_CG_SIZE];
67
int numNonZero = 1;
68
69
const uint32_t subCoeffFlag = coeffFlag[subSet];
70
uint32_t scanFlagMask = subCoeffFlag;
71
int subPosBase = subSet << MLS_CG_SIZE;
72
- goRiceParam = 0;
73
74
if (subSet == lastScanSet)
75
{
76
77
else
78
{
79
uint32_t sigCoeffGroup = ((sigCoeffGroupFlag64 & cgBlkPosMask) != 0);
80
- uint32_t ctxSig = Quant::getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, codingParameters.log2TrSizeCG);
81
+ uint32_t ctxSig = Quant::getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, (trSize >> MLS_CG_LOG2_SIZE));
82
encodeBin(sigCoeffGroup, baseCoeffGroupCtx[ctxSig]);
83
}
84
85
86
if (sigCoeffGroupFlag64 & cgBlkPosMask)
87
{
88
X265_CHECK((log2TrSize != 2) || (log2TrSize == 2 && subSet == 0), "log2TrSize and subSet mistake!\n");
89
- const int patternSigCtx = Quant::calcPatternSigCtx(sigCoeffGroupFlag64, cgPosX, cgPosY, codingParameters.log2TrSizeCG);
90
+ const int patternSigCtx = Quant::calcPatternSigCtx(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, (trSize >> MLS_CG_LOG2_SIZE));
91
+ const uint32_t posOffset = (bIsLuma && subSet) ? 3 : 0;
92
93
static const uint8_t ctxIndMap4x4[16] =
94
{
95
96
7, 7, 8, 8
97
};
98
// NOTE: [patternSigCtx][posXinSubset][posYinSubset]
99
- static const uint8_t table_cnt[4][4][4] =
100
+ static const uint8_t table_cnt[4][SCAN_SET_SIZE] =
101
{
102
// patternSigCtx = 0
103
{
104
- { 2, 1, 1, 0 },
105
- { 1, 1, 0, 0 },
106
- { 1, 0, 0, 0 },
107
- { 0, 0, 0, 0 },
108
+ 2, 1, 1, 0,
109
+ 1, 1, 0, 0,
110
+ 1, 0, 0, 0,
111
+ 0, 0, 0, 0,
112
},
113
// patternSigCtx = 1
114
{
115
- { 2, 1, 0, 0 },
116
- { 2, 1, 0, 0 },
117
- { 2, 1, 0, 0 },
118
- { 2, 1, 0, 0 },
119
+ 2, 2, 2, 2,
120
+ 1, 1, 1, 1,
121
+ 0, 0, 0, 0,
122
+ 0, 0, 0, 0,
123
},
124
// patternSigCtx = 2
125
{
126
- { 2, 2, 2, 2 },
127
- { 1, 1, 1, 1 },
128
- { 0, 0, 0, 0 },
129
- { 0, 0, 0, 0 },
130
+ 2, 1, 0, 0,
131
+ 2, 1, 0, 0,
132
+ 2, 1, 0, 0,
133
+ 2, 1, 0, 0,
134
},
135
// patternSigCtx = 3
136
{
137
- { 2, 2, 2, 2 },
138
- { 2, 2, 2, 2 },
139
- { 2, 2, 2, 2 },
140
- { 2, 2, 2, 2 },
141
+ 2, 2, 2, 2,
142
+ 2, 2, 2, 2,
143
+ 2, 2, 2, 2,
144
+ 2, 2, 2, 2,
145
}
146
};
147
+
148
+ const int offset = codingParameters.firstSignificanceMapContext;
149
+ ALIGN_VAR_32(uint16_t, tmpCoeff[SCAN_SET_SIZE]);
150
+ // TODO: accelerate by PABSW
151
+ const uint32_t blkPosBase = codingParameters.scan[subPosBase];
152
+ for (int i = 0; i < MLS_CG_SIZE; i++)
153
+ {
154
+ tmpCoeff[i * MLS_CG_SIZE + 0] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 0]);
155
+ tmpCoeff[i * MLS_CG_SIZE + 1] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 1]);
156
+ tmpCoeff[i * MLS_CG_SIZE + 2] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 2]);
157
+ tmpCoeff[i * MLS_CG_SIZE + 3] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 3]);
158
+ }
159
+
160
if (m_bitIf)
161
{
162
if (log2TrSize == 2)
163
164
uint32_t blkPos, sig, ctxSig;
165
for (; scanPosSigOff >= 0; scanPosSigOff--)
166
{
167
- blkPos = codingParameters.scan[subPosBase + scanPosSigOff];
168
+ blkPos = g_scan4x4[codingParameters.scanType][scanPosSigOff];
169
sig = scanFlagMask & 1;
170
scanFlagMask >>= 1;
171
- X265_CHECK((uint32_t)(coeff[blkPos] != 0) == sig, "sign bit mistake\n");
172
+ X265_CHECK((uint32_t)(tmpCoeff[blkPos] != 0) == sig, "sign bit mistake\n");
173
{
174
ctxSig = ctxIndMap4x4[blkPos];
175
X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, blkPos, bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");;
176
encodeBin(sig, baseCtx[ctxSig]);
177
}
178
- absCoeff[numNonZero] = int(abs(coeff[blkPos]));
179
+ absCoeff[numNonZero] = tmpCoeff[blkPos];
180
numNonZero += sig;
181
}
182
}
183
184
{
185
X265_CHECK((log2TrSize > 2), "log2TrSize must be more than 2 in this path!\n");
186
187
- const uint8_t (*tabSigCtx)[4] = table_cnt[(uint32_t)patternSigCtx];
188
- const int offset = codingParameters.firstSignificanceMapContext;
189
- const uint32_t lumaMask = bIsLuma ? ~0 : 0;
190
- static const uint32_t posXY4Mask[] = {0x024, 0x0CC, 0x39C};
191
- const uint32_t posGT4Mask = posXY4Mask[log2TrSize - 3] & lumaMask;
192
+ const uint8_t *tabSigCtx = table_cnt[(uint32_t)patternSigCtx];
193
194
uint32_t blkPos, sig, ctxSig;
195
for (; scanPosSigOff >= 0; scanPosSigOff--)
196
{
197
- blkPos = codingParameters.scan[subPosBase + scanPosSigOff];
198
- X265_CHECK(blkPos || (subPosBase + scanPosSigOff == 0), "blkPos==0 must be at scan[0]\n");
199
+ blkPos = g_scan4x4[codingParameters.scanType][scanPosSigOff];
200
const uint32_t posZeroMask = (subPosBase + scanPosSigOff) ? ~0 : 0;
201
sig = scanFlagMask & 1;
202
scanFlagMask >>= 1;
203
- X265_CHECK((uint32_t)(coeff[blkPos] != 0) == sig, "sign bit mistake\n");
204
+ X265_CHECK((uint32_t)(tmpCoeff[blkPos] != 0) == sig, "sign bit mistake\n");
205
if (scanPosSigOff != 0 || subSet == 0 || numNonZero)
206
{
207
- const uint32_t posY = blkPos >> log2TrSize;
208
- const uint32_t posOffset = (blkPos & posGT4Mask) ? 3 : 0;
209
-
210
- const uint32_t posXinSubset = blkPos & 3;
211
- const uint32_t posYinSubset = posY & 3;
212
- const uint32_t cnt = tabSigCtx[posXinSubset][posYinSubset] + offset;
213
+ const uint32_t cnt = tabSigCtx[blkPos] + offset;
214
ctxSig = (cnt + posOffset) & posZeroMask;
215
216
- X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, blkPos, bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");;
217
+ X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, codingParameters.scan[subPosBase + scanPosSigOff], bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");;
218
encodeBin(sig, baseCtx[ctxSig]);
219
}
220
- absCoeff[numNonZero] = int(abs(coeff[blkPos]));
221
+ absCoeff[numNonZero] = tmpCoeff[blkPos];
222
numNonZero += sig;
223
}
224
}
225
226
uint32_t blkPos, sig, ctxSig;
227
for (; scanPosSigOff >= 0; scanPosSigOff--)
228
{
229
- blkPos = codingParameters.scan[subPosBase + scanPosSigOff];
230
+ blkPos = g_scan4x4[codingParameters.scanType][scanPosSigOff];
231
sig = scanFlagMask & 1;
232
scanFlagMask >>= 1;
233
- X265_CHECK((uint32_t)(coeff[blkPos] != 0) == sig, "sign bit mistake\n");
234
+ X265_CHECK((uint32_t)(tmpCoeff[blkPos] != 0) == sig, "sign bit mistake\n");
235
{
236
ctxSig = ctxIndMap4x4[blkPos];
237
- X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, blkPos, bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");;
238
+ X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, codingParameters.scan[subPosBase + scanPosSigOff], bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");;
239
//encodeBin(sig, baseCtx[ctxSig]);
240
const uint32_t mstate = baseCtx[ctxSig];
241
- baseCtx[ctxSig] = sbacNext(mstate, sig);
242
- sum += sbacGetEntropyBits(mstate, sig);
243
+ const uint32_t mps = mstate & 1;
244
+ const uint32_t stateBits = g_entropyStateBits[mstate ^ sig];
245
+ uint32_t nextState = (stateBits >> 23) + mps;
246
+ if ((mstate ^ sig) == 1)
247
+ nextState = sig;
248
+ X265_CHECK(sbacNext(mstate, sig) == nextState, "nextState check failure\n");
249
+ X265_CHECK(sbacGetEntropyBits(mstate, sig) == (stateBits & 0xFFFFFF), "entropyBits check failure\n");
250
+ baseCtx[ctxSig] = (uint8_t)nextState;
251
+ sum += stateBits;
252
}
253
- absCoeff[numNonZero] = int(abs(coeff[blkPos]));
254
+ absCoeff[numNonZero] = tmpCoeff[blkPos];
255
numNonZero += sig;
256
}
257
} // end of 4x4
258
259
{
260
X265_CHECK((log2TrSize > 2), "log2TrSize must be more than 2 in this path!\n");
261
262
- const uint8_t (*tabSigCtx)[4] = table_cnt[(uint32_t)patternSigCtx];
263
- const int offset = codingParameters.firstSignificanceMapContext;
264
- const uint32_t lumaMask = bIsLuma ? ~0 : 0;
265
- static const uint32_t posXY4Mask[] = {0x024, 0x0CC, 0x39C};
266
- const uint32_t posGT4Mask = posXY4Mask[log2TrSize - 3] & lumaMask;
267
+ const uint8_t *tabSigCtx = table_cnt[(uint32_t)patternSigCtx];
268
269
uint32_t blkPos, sig, ctxSig;
270
for (; scanPosSigOff >= 0; scanPosSigOff--)
271
{
272
- blkPos = codingParameters.scan[subPosBase + scanPosSigOff];
273
- X265_CHECK(blkPos || (subPosBase + scanPosSigOff == 0), "blkPos==0 must be at scan[0]\n");
274
+ blkPos = g_scan4x4[codingParameters.scanType][scanPosSigOff];
275
const uint32_t posZeroMask = (subPosBase + scanPosSigOff) ? ~0 : 0;
276
sig = scanFlagMask & 1;
277
scanFlagMask >>= 1;
278
- X265_CHECK((uint32_t)(coeff[blkPos] != 0) == sig, "sign bit mistake\n");
279
+ X265_CHECK((uint32_t)(tmpCoeff[blkPos] != 0) == sig, "sign bit mistake\n");
280
if (scanPosSigOff != 0 || subSet == 0 || numNonZero)
281
{
282
- const uint32_t posY = blkPos >> log2TrSize;
283
- const uint32_t posOffset = (blkPos & posGT4Mask) ? 3 : 0;
284
-
285
- const uint32_t posXinSubset = blkPos & 3;
286
- const uint32_t posYinSubset = posY & 3;
287
- const uint32_t cnt = tabSigCtx[posXinSubset][posYinSubset] + offset;
288
+ const uint32_t cnt = tabSigCtx[blkPos] + offset;
289
ctxSig = (cnt + posOffset) & posZeroMask;
290
291
- X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, blkPos, bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");;
292
+ X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, codingParameters.scan[subPosBase + scanPosSigOff], bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");;
293
//encodeBin(sig, baseCtx[ctxSig]);
294
const uint32_t mstate = baseCtx[ctxSig];
295
- baseCtx[ctxSig] = sbacNext(mstate, sig);
296
- sum += sbacGetEntropyBits(mstate, sig);
297
+ const uint32_t mps = mstate & 1;
298
+ const uint32_t stateBits = g_entropyStateBits[mstate ^ sig];
299
+ uint32_t nextState = (stateBits >> 23) + mps;
300
+ if ((mstate ^ sig) == 1)
301
+ nextState = sig;
302
+ X265_CHECK(sbacNext(mstate, sig) == nextState, "nextState check failure\n");
303
+ X265_CHECK(sbacGetEntropyBits(mstate, sig) == (stateBits & 0xFFFFFF), "entropyBits check failure\n");
304
+ baseCtx[ctxSig] = (uint8_t)nextState;
305
+ sum += stateBits;
306
}
307
- absCoeff[numNonZero] = int(abs(coeff[blkPos]));
308
+ absCoeff[numNonZero] = tmpCoeff[blkPos];
309
numNonZero += sig;
310
}
311
} // end of non 4x4 path
312
+ sum &= 0xFFFFFF;
313
314
// update RD cost
315
m_fracBits += sum;
316
317
if (!c1)
318
{
319
baseCtxMod = bIsLuma ? &m_contextState[OFF_ABS_FLAG_CTX + ctxSet] : &m_contextState[OFF_ABS_FLAG_CTX + NUM_ABS_FLAG_CTX_LUMA + ctxSet];
320
- if (firstC2FlagIdx != -1)
321
- {
322
- uint32_t symbol = absCoeff[firstC2FlagIdx] > 2;
323
- encodeBin(symbol, baseCtxMod[0]);
324
- }
325
+
326
+ X265_CHECK((firstC2FlagIdx != -1), "firstC2FlagIdx check failure\n");
327
+ uint32_t symbol = absCoeff[firstC2FlagIdx] > 2;
328
+ encodeBin(symbol, baseCtxMod[0]);
329
}
330
331
const int hiddenShift = (bHideFirstSign && signHidden) ? 1 : 0;
332
encodeBinsEP((coeffSigns >> hiddenShift), numNonZero - hiddenShift);
333
334
- int firstCoeff2 = 1;
335
if (!c1 || numNonZero > C1FLAG_NUMBER)
336
{
337
- for (int idx = 0; idx < numNonZero; idx++)
338
+ uint32_t goRiceParam = 0;
339
+ int firstCoeff2 = 1;
340
+ uint32_t baseLevelN = 0x5555AAAA; // 2-bits encode format baseLevel
341
+
342
+ if (!m_bitIf)
343
{
344
- int baseLevel = (idx < C1FLAG_NUMBER) ? (2 + firstCoeff2) : 1;
345
+ // FastRd path
346
+ for (int idx = 0; idx < numNonZero; idx++)
347
+ {
348
+ int baseLevel = (baseLevelN & 3) | firstCoeff2;
349
+ X265_CHECK(baseLevel == ((idx < C1FLAG_NUMBER) ? (2 + firstCoeff2) : 1), "baseLevel check failurr\n");
350
+ baseLevelN >>= 2;
351
+ int codeNumber = absCoeff[idx] - baseLevel;
352
353
- if (absCoeff[idx] >= baseLevel)
354
+ if (codeNumber >= 0)
355
+ {
356
+ //writeCoefRemainExGolomb(absCoeff[idx] - baseLevel, goRiceParam);
357
+ uint32_t length = 0;
358
+
359
+ codeNumber = ((uint32_t)codeNumber >> goRiceParam) - COEF_REMAIN_BIN_REDUCTION;
360
+ if (codeNumber >= 0)
361
+ {
362
+ {
363
+ unsigned long cidx;
364
+ CLZ(cidx, codeNumber + 1);
365
+ length = cidx;
366
+ }
367
+ X265_CHECK((codeNumber != 0) || (length == 0), "length check failure\n");
368
+
369
+ codeNumber = (length + length);
370
+ }
371
+ m_fracBits += (COEF_REMAIN_BIN_REDUCTION + 1 + goRiceParam + codeNumber) << 15;
372
+
373
+ if (absCoeff[idx] > (COEF_REMAIN_BIN_REDUCTION << goRiceParam))
374
+ goRiceParam = (goRiceParam + 1) - (goRiceParam >> 2);
375
+ X265_CHECK(goRiceParam <= 4, "goRiceParam check failure\n");
376
+ }
377
+ if (absCoeff[idx] >= 2)
378
+ firstCoeff2 = 0;
379
+ }
380
+ }
381
+ else
382
+ {
383
+ // Standard path
384
+ for (int idx = 0; idx < numNonZero; idx++)
385
{
386
- writeCoefRemainExGolomb(absCoeff[idx] - baseLevel, goRiceParam);
387
- if (absCoeff[idx] > 3 * (1 << goRiceParam))
388
- goRiceParam = std::min<uint32_t>(goRiceParam + 1, 4);
389
+ int baseLevel = (baseLevelN & 3) | firstCoeff2;
390
+ X265_CHECK(baseLevel == ((idx < C1FLAG_NUMBER) ? (2 + firstCoeff2) : 1), "baseLevel check failurr\n");
391
+ baseLevelN >>= 2;
392
+
393
+ if (absCoeff[idx] >= baseLevel)
394
+ {
395
+ writeCoefRemainExGolomb(absCoeff[idx] - baseLevel, goRiceParam);
396
+ if (absCoeff[idx] > (COEF_REMAIN_BIN_REDUCTION << goRiceParam))
397
+ goRiceParam = (goRiceParam + 1) - (goRiceParam >> 2);
398
+ X265_CHECK(goRiceParam <= 4, "goRiceParam check failure\n");
399
+ }
400
+ if (absCoeff[idx] >= 2)
401
+ firstCoeff2 = 0;
402
}
403
- if (absCoeff[idx] >= 2)
404
- firstCoeff2 = 0;
405
}
406
}
407
}
408
409
if (bIsLuma)
410
{
411
for (uint32_t bin = 0; bin < 2; bin++)
412
- estBitsSbac.significantBits[0][bin] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX], bin);
413
+ estBitsSbac.significantBits[bin][0] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX], bin);
414
415
for (int ctxIdx = firstCtx; ctxIdx < firstCtx + numCtx; ctxIdx++)
416
for (uint32_t bin = 0; bin < 2; bin++)
417
- estBitsSbac.significantBits[ctxIdx][bin] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX + ctxIdx], bin);
418
+ estBitsSbac.significantBits[bin][ctxIdx] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX + ctxIdx], bin);
419
}
420
else
421
{
422
for (uint32_t bin = 0; bin < 2; bin++)
423
- estBitsSbac.significantBits[0][bin] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX + (NUM_SIG_FLAG_CTX_LUMA + 0)], bin);
424
+ estBitsSbac.significantBits[bin][0] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX + (NUM_SIG_FLAG_CTX_LUMA + 0)], bin);
425
426
for (int ctxIdx = firstCtx; ctxIdx < firstCtx + numCtx; ctxIdx++)
427
for (uint32_t bin = 0; bin < 2; bin++)
428
- estBitsSbac.significantBits[ctxIdx][bin] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX + (NUM_SIG_FLAG_CTX_LUMA + ctxIdx)], bin);
429
+ estBitsSbac.significantBits[bin][ctxIdx] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX + (NUM_SIG_FLAG_CTX_LUMA + ctxIdx)], bin);
430
}
431
432
int blkSizeOffset = bIsLuma ? ((log2TrSize - 2) * 3 + ((log2TrSize - 1) >> 2)) : NUM_CTX_LAST_FLAG_XY_LUMA;
433
434
0x0050c, 0x29bab, 0x004c1, 0x2a674, 0x004a7, 0x2aa5e, 0x0046f, 0x2b32f, 0x0041f, 0x2c0ad, 0x003e7, 0x2ca8d, 0x003ba, 0x2d323, 0x0010c, 0x3bfbb
435
};
436
437
+// [8 24] --> [stateMPS BitCost], [stateLPS BitCost]
438
+const uint32_t g_entropyStateBits[128] =
439
+{
440
+ // Corrected table, most notably for last state
441
+ 0x01007b23, 0x000085f9, 0x020074a0, 0x00008cbc, 0x03006ee4, 0x01009354, 0x040067f4, 0x02009c1b,
442
+ 0x050060b0, 0x0200a62a, 0x06005a9c, 0x0400af5b, 0x0700548d, 0x0400b955, 0x08004f56, 0x0500c2a9,
443
+ 0x09004a87, 0x0600cbf7, 0x0a0045d6, 0x0700d5c3, 0x0b004144, 0x0800e01b, 0x0c003d88, 0x0900e937,
444
+ 0x0d0039e0, 0x0900f2cd, 0x0e003663, 0x0b00fc9e, 0x0f003347, 0x0b010600, 0x10003050, 0x0c010f95,
445
+ 0x11002d4d, 0x0d011a02, 0x12002ad3, 0x0d012333, 0x1300286e, 0x0f012cad, 0x14002604, 0x0f0136df,
446
+ 0x15002425, 0x10013f48, 0x160021f4, 0x100149c4, 0x1700203e, 0x1201527b, 0x18001e4d, 0x12015d00,
447
+ 0x19001c99, 0x130166de, 0x1a001b18, 0x13017017, 0x1b0019a5, 0x15017988, 0x1c001841, 0x15018327,
448
+ 0x1d0016df, 0x16018d50, 0x1e0015d9, 0x16019547, 0x1f00147c, 0x1701a083, 0x2000138e, 0x1801a8a3,
449
+ 0x21001251, 0x1801b418, 0x22001166, 0x1901bd27, 0x23001068, 0x1a01c77b, 0x24000f7f, 0x1a01d18e,
450
+ 0x25000eda, 0x1b01d91a, 0x26000e19, 0x1b01e254, 0x27000d4f, 0x1c01ec9a, 0x28000c90, 0x1d01f6e0,
451
+ 0x29000c01, 0x1d01fef8, 0x2a000b5f, 0x1e0208b1, 0x2b000ab6, 0x1e021362, 0x2c000a15, 0x1e021e46,
452
+ 0x2d000988, 0x1f02285d, 0x2e000934, 0x20022ea8, 0x2f0008a8, 0x200239b2, 0x3000081d, 0x21024577,
453
+ 0x310007c9, 0x21024ce6, 0x32000763, 0x21025663, 0x33000710, 0x22025e8f, 0x340006a0, 0x22026a26,
454
+ 0x35000672, 0x23026f23, 0x360005e8, 0x23027ef8, 0x370005ba, 0x230284b5, 0x3800055e, 0x24029057,
455
+ 0x3900050c, 0x24029bab, 0x3a0004c1, 0x2402a674, 0x3b0004a7, 0x2502aa5e, 0x3c00046f, 0x2502b32f,
456
+ 0x3d00041f, 0x2502c0ad, 0x3e0003e7, 0x2602ca8d, 0x3e0003ba, 0x2602d323, 0x3f00010c, 0x3f03bfbb,
457
+};
458
+
459
const uint8_t g_nextState[128][2] =
460
{
461
{ 2, 1 }, { 0, 3 }, { 4, 0 }, { 1, 5 }, { 6, 2 }, { 3, 7 }, { 8, 4 }, { 5, 9 },
462
x265_1.6.tar.gz/source/encoder/entropy.h -> x265_1.7.tar.gz/source/encoder/entropy.h
Changed
36
1
2
struct EstBitsSbac
3
{
4
int significantCoeffGroupBits[NUM_SIG_CG_FLAG_CTX][2];
5
- int significantBits[NUM_SIG_FLAG_CTX][2];
6
+ int significantBits[2][NUM_SIG_FLAG_CTX];
7
int lastBits[2][10];
8
int greaterOneBits[NUM_ONE_FLAG_CTX][2];
9
int levelAbsBits[NUM_ABS_FLAG_CTX][2];
10
11
inline void codeQtCbfChroma(uint32_t cbf, uint32_t tuDepth) { encodeBin(cbf, m_contextState[OFF_QT_CBF_CTX + 2 + tuDepth]); }
12
inline void codeQtRootCbf(uint32_t cbf) { encodeBin(cbf, m_contextState[OFF_QT_ROOT_CBF_CTX]); }
13
inline void codeTransformSkipFlags(uint32_t transformSkip, TextType ttype) { encodeBin(transformSkip, m_contextState[OFF_TRANSFORMSKIP_FLAG_CTX + (ttype ? NUM_TRANSFORMSKIP_FLAG_CTX : 0)]); }
14
-
15
+ void codeDeltaQP(const CUData& cu, uint32_t absPartIdx);
16
void codeSaoOffset(const SaoCtuParam& ctuParam, int plane);
17
18
/* RDO functions */
19
20
}
21
22
void encodeCU(const CUData& ctu, const CUGeom &cuGeom, uint32_t absPartIdx, uint32_t depth, bool& bEncodeDQP);
23
- void finishCU(const CUData& ctu, uint32_t absPartIdx, uint32_t depth);
24
+ void finishCU(const CUData& ctu, uint32_t absPartIdx, uint32_t depth, bool bEncodeDQP);
25
26
void writeOut();
27
28
29
30
void codeSaoMaxUvlc(uint32_t code, uint32_t maxSymbol);
31
32
- void codeDeltaQP(const CUData& cu, uint32_t absPartIdx);
33
void codeLastSignificantXY(uint32_t posx, uint32_t posy, uint32_t log2TrSize, bool bIsLuma, uint32_t scanIdx);
34
35
void encodeTransform(const CUData& cu, uint32_t absPartIdx, uint32_t tuDepth, uint32_t log2TrSize,
36
x265_1.6.tar.gz/source/encoder/frameencoder.cpp -> x265_1.7.tar.gz/source/encoder/frameencoder.cpp
Changed
268
1
2
{
3
m_slicetypeWaitTime = x265_mdate() - m_prevOutputTime;
4
m_frame = curFrame;
5
+ m_param = curFrame->m_param;
6
m_sliceType = curFrame->m_lowres.sliceType;
7
curFrame->m_encData->m_frameEncoderID = m_jpId;
8
curFrame->m_encData->m_jobProvider = this;
9
10
uint32_t row = (uint32_t)intRow;
11
CTURow& curRow = m_rows[row];
12
13
+ tld.analysis.m_param = m_param;
14
if (m_param->bEnableWavefront)
15
{
16
ScopedLock self(curRow.lock);
17
18
const uint32_t lineStartCUAddr = row * numCols;
19
bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0;
20
21
+ /* These store the count of inter, intra and skip cus within quad tree structure of each CTU */
22
+ uint32_t qTreeInterCnt[NUM_CU_DEPTH];
23
+ uint32_t qTreeIntraCnt[NUM_CU_DEPTH];
24
+ uint32_t qTreeSkipCnt[NUM_CU_DEPTH];
25
+ for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++)
26
+ qTreeIntraCnt[depth] = qTreeInterCnt[depth] = qTreeSkipCnt[depth] = 0;
27
+
28
while (curRow.completed < numCols)
29
{
30
ProfileScopeEvent(encodeCTU);
31
32
curEncData.m_rowStat[row].diagQpScale = x265_qp2qScale(curEncData.m_avgQpRc);
33
}
34
35
+ FrameData::RCStatCU& cuStat = curEncData.m_cuStat[cuAddr];
36
if (row >= col && row && m_vbvResetTriggerRow != intRow)
37
- curEncData.m_cuStat[cuAddr].baseQp = curEncData.m_cuStat[cuAddr - numCols + 1].baseQp;
38
+ cuStat.baseQp = curEncData.m_cuStat[cuAddr - numCols + 1].baseQp;
39
else
40
- curEncData.m_cuStat[cuAddr].baseQp = curEncData.m_rowStat[row].diagQp;
41
- }
42
- else
43
- curEncData.m_cuStat[cuAddr].baseQp = curEncData.m_avgQpRc;
44
+ cuStat.baseQp = curEncData.m_rowStat[row].diagQp;
45
+
46
+ /* TODO: use defines from slicetype.h for lowres block size */
47
+ uint32_t maxBlockCols = (m_frame->m_fencPic->m_picWidth + (16 - 1)) / 16;
48
+ uint32_t maxBlockRows = (m_frame->m_fencPic->m_picHeight + (16 - 1)) / 16;
49
+ uint32_t noOfBlocks = g_maxCUSize / 16;
50
+ uint32_t block_y = (cuAddr / curEncData.m_slice->m_sps->numCuInWidth) * noOfBlocks;
51
+ uint32_t block_x = (cuAddr * noOfBlocks) - block_y * curEncData.m_slice->m_sps->numCuInWidth;
52
+
53
+ cuStat.vbvCost = 0;
54
+ cuStat.intraVbvCost = 0;
55
+ for (uint32_t h = 0; h < noOfBlocks && block_y < maxBlockRows; h++, block_y++)
56
+ {
57
+ uint32_t idx = block_x + (block_y * maxBlockCols);
58
59
- if (m_param->rc.aqMode || bIsVbv)
60
- {
61
- int qp = calcQpForCu(cuAddr, curEncData.m_cuStat[cuAddr].baseQp);
62
- tld.analysis.setQP(*slice, qp);
63
- qp = x265_clip3(QP_MIN, QP_MAX_SPEC, qp);
64
- ctu->setQPSubParts((int8_t)qp, 0, 0);
65
- curEncData.m_rowStat[row].sumQpAq += qp;
66
+ for (uint32_t w = 0; w < noOfBlocks && (block_x + w) < maxBlockCols; w++, idx++)
67
+ {
68
+ cuStat.vbvCost += m_frame->m_lowres.lowresCostForRc[idx] & LOWRES_COST_MASK;
69
+ cuStat.intraVbvCost += m_frame->m_lowres.intraCost[idx];
70
+ }
71
+ }
72
}
73
else
74
- tld.analysis.setQP(*slice, slice->m_sliceQp);
75
+ curEncData.m_cuStat[cuAddr].baseQp = curEncData.m_avgQpRc;
76
77
if (m_param->bEnableWavefront && !col && row)
78
{
79
80
curRow.completed++;
81
82
if (m_param->bLogCuStats || m_param->rc.bStatWrite)
83
- collectCTUStatistics(*ctu);
84
+ curEncData.m_rowStat[row].sumQpAq += collectCTUStatistics(*ctu, qTreeInterCnt, qTreeIntraCnt, qTreeSkipCnt);
85
+ else if (m_param->rc.aqMode)
86
+ curEncData.m_rowStat[row].sumQpAq += calcCTUQP(*ctu);
87
88
// copy no. of intra, inter Cu cnt per row into frame stats for 2 pass
89
if (m_param->rc.bStatWrite)
90
91
curRow.rowStats.mvBits += best.mvBits;
92
curRow.rowStats.coeffBits += best.coeffBits;
93
curRow.rowStats.miscBits += best.totalBits - (best.mvBits + best.coeffBits);
94
- StatisticLog* log = &m_sliceTypeLog[slice->m_sliceType];
95
96
for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++)
97
{
98
/* 1 << shift == number of 8x8 blocks at current depth */
99
int shift = 2 * (g_maxCUDepth - depth);
100
- curRow.rowStats.iCuCnt += log->qTreeIntraCnt[depth] << shift;
101
- curRow.rowStats.pCuCnt += log->qTreeInterCnt[depth] << shift;
102
- curRow.rowStats.skipCuCnt += log->qTreeSkipCnt[depth] << shift;
103
+ curRow.rowStats.iCuCnt += qTreeIntraCnt[depth] << shift;
104
+ curRow.rowStats.pCuCnt += qTreeInterCnt[depth] << shift;
105
+ curRow.rowStats.skipCuCnt += qTreeSkipCnt[depth] << shift;
106
107
// clear the row cu data from thread local object
108
- log->qTreeIntraCnt[depth] = log->qTreeInterCnt[depth] = log->qTreeSkipCnt[depth] = 0;
109
+ qTreeIntraCnt[depth] = qTreeInterCnt[depth] = qTreeSkipCnt[depth] = 0;
110
}
111
}
112
113
114
}
115
}
116
117
+ tld.analysis.m_param = NULL;
118
curRow.busy = false;
119
120
if (ATOMIC_INC(&m_completionCount) == 2 * (int)m_numRows)
121
m_completionEvent.trigger();
122
}
123
124
-void FrameEncoder::collectCTUStatistics(CUData& ctu)
125
+/* collect statistics about CU coding decisions, return total QP */
126
+int FrameEncoder::collectCTUStatistics(const CUData& ctu, uint32_t* qtreeInterCnt, uint32_t* qtreeIntraCnt, uint32_t* qtreeSkipCnt)
127
{
128
StatisticLog* log = &m_sliceTypeLog[ctu.m_slice->m_sliceType];
129
+ int totQP = 0;
130
131
if (ctu.m_slice->m_sliceType == I_SLICE)
132
{
133
134
135
log->totalCu++;
136
log->cntIntra[depth]++;
137
- log->qTreeIntraCnt[depth]++;
138
+ qtreeIntraCnt[depth]++;
139
+ totQP += ctu.m_qp[absPartIdx] * (ctu.m_numPartitions >> (depth * 2));
140
141
if (ctu.m_predMode[absPartIdx] == MODE_NONE)
142
{
143
log->totalCu--;
144
log->cntIntra[depth]--;
145
- log->qTreeIntraCnt[depth]--;
146
+ qtreeIntraCnt[depth]--;
147
}
148
else if (ctu.m_partSize[absPartIdx] != SIZE_2Nx2N)
149
{
150
151
152
log->totalCu++;
153
log->cntTotalCu[depth]++;
154
+ totQP += ctu.m_qp[absPartIdx] * (ctu.m_numPartitions >> (depth * 2));
155
156
if (ctu.m_predMode[absPartIdx] == MODE_NONE)
157
{
158
159
{
160
log->totalCu--;
161
log->cntSkipCu[depth]++;
162
- log->qTreeSkipCnt[depth]++;
163
+ qtreeSkipCnt[depth]++;
164
}
165
else if (ctu.isInter(absPartIdx))
166
{
167
log->cntInter[depth]++;
168
- log->qTreeInterCnt[depth]++;
169
+ qtreeInterCnt[depth]++;
170
171
if (ctu.m_partSize[absPartIdx] < AMP_ID)
172
log->cuInterDistribution[depth][ctu.m_partSize[absPartIdx]]++;
173
174
else if (ctu.isIntra(absPartIdx))
175
{
176
log->cntIntra[depth]++;
177
- log->qTreeIntraCnt[depth]++;
178
+ qtreeIntraCnt[depth]++;
179
180
if (ctu.m_partSize[absPartIdx] != SIZE_2Nx2N)
181
{
182
X265_CHECK(ctu.m_log2CUSize[absPartIdx] == 3 && ctu.m_slice->m_sps->quadtreeTULog2MinSize < 3, "Intra NxN found at improbable depth\n");
183
log->cntIntraNxN++;
184
+ log->cntIntra[depth]--;
185
/* TODO: log intra modes at absPartIdx +0 to +3 */
186
}
187
else if (ctu.m_lumaIntraDir[absPartIdx] > 1)
188
189
}
190
}
191
}
192
+
193
+ return totQP;
194
+}
195
+
196
+/* iterate over coded CUs and determine total QP */
197
+int FrameEncoder::calcCTUQP(const CUData& ctu)
198
+{
199
+ int totQP = 0;
200
+ uint32_t depth = 0, numParts = ctu.m_numPartitions;
201
+
202
+ for (uint32_t absPartIdx = 0; absPartIdx < ctu.m_numPartitions; absPartIdx += numParts)
203
+ {
204
+ depth = ctu.m_cuDepth[absPartIdx];
205
+ numParts = ctu.m_numPartitions >> (depth * 2);
206
+ totQP += ctu.m_qp[absPartIdx] * numParts;
207
+ }
208
+ return totQP;
209
}
210
211
/* DCT-domain noise reduction / adaptive deadzone from libavcodec */
212
213
}
214
}
215
216
-int FrameEncoder::calcQpForCu(uint32_t ctuAddr, double baseQp)
217
-{
218
- x265_emms();
219
- double qp = baseQp;
220
-
221
- FrameData& curEncData = *m_frame->m_encData;
222
- /* clear cuCostsForVbv from when vbv row reset was triggered */
223
- bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0;
224
- if (bIsVbv)
225
- {
226
- curEncData.m_cuStat[ctuAddr].vbvCost = 0;
227
- curEncData.m_cuStat[ctuAddr].intraVbvCost = 0;
228
- }
229
-
230
- /* Derive qpOffet for each CU by averaging offsets for all 16x16 blocks in the cu. */
231
- double qp_offset = 0;
232
- uint32_t maxBlockCols = (m_frame->m_fencPic->m_picWidth + (16 - 1)) / 16;
233
- uint32_t maxBlockRows = (m_frame->m_fencPic->m_picHeight + (16 - 1)) / 16;
234
- uint32_t noOfBlocks = g_maxCUSize / 16;
235
- uint32_t block_y = (ctuAddr / curEncData.m_slice->m_sps->numCuInWidth) * noOfBlocks;
236
- uint32_t block_x = (ctuAddr * noOfBlocks) - block_y * curEncData.m_slice->m_sps->numCuInWidth;
237
-
238
- /* Use cuTree offsets if cuTree enabled and frame is referenced, else use AQ offsets */
239
- bool isReferenced = IS_REFERENCED(m_frame);
240
- double *qpoffs = (isReferenced && m_param->rc.cuTree) ? m_frame->m_lowres.qpCuTreeOffset : m_frame->m_lowres.qpAqOffset;
241
-
242
- uint32_t cnt = 0, idx = 0;
243
- for (uint32_t h = 0; h < noOfBlocks && block_y < maxBlockRows; h++, block_y++)
244
- {
245
- for (uint32_t w = 0; w < noOfBlocks && (block_x + w) < maxBlockCols; w++)
246
- {
247
- idx = block_x + w + (block_y * maxBlockCols);
248
- if (m_param->rc.aqMode)
249
- qp_offset += qpoffs[idx];
250
- if (bIsVbv)
251
- {
252
- curEncData.m_cuStat[ctuAddr].vbvCost += m_frame->m_lowres.lowresCostForRc[idx] & LOWRES_COST_MASK;
253
- curEncData.m_cuStat[ctuAddr].intraVbvCost += m_frame->m_lowres.intraCost[idx];
254
- }
255
- cnt++;
256
- }
257
- }
258
-
259
- qp_offset /= cnt;
260
- qp += qp_offset;
261
-
262
- return x265_clip3(QP_MIN, QP_MAX_MAX, (int)(qp + 0.5));
263
-}
264
-
265
Frame *FrameEncoder::getEncodedPicture(NALList& output)
266
{
267
if (m_frame)
268
x265_1.6.tar.gz/source/encoder/frameencoder.h -> x265_1.7.tar.gz/source/encoder/frameencoder.h
Changed
24
1
2
uint64_t cntTotalCu[4];
3
uint64_t totalCu;
4
5
- /* These states store the count of inter,intra and skip ctus within quad tree structure of each CU */
6
- uint32_t qTreeInterCnt[4];
7
- uint32_t qTreeIntraCnt[4];
8
- uint32_t qTreeSkipCnt[4];
9
-
10
StatisticLog()
11
{
12
memset(this, 0, sizeof(StatisticLog));
13
14
void encodeSlice();
15
16
void threadMain();
17
- int calcQpForCu(uint32_t cuAddr, double baseQp);
18
- void collectCTUStatistics(CUData& ctu);
19
+ int collectCTUStatistics(const CUData& ctu, uint32_t* qtreeInterCnt, uint32_t* qtreeIntraCnt, uint32_t* qtreeSkipCnt);
20
+ int calcCTUQP(const CUData& ctu);
21
void noiseReductionUpdate();
22
23
/* Called by WaveFront::findJob() */
24
x265_1.6.tar.gz/source/encoder/level.cpp -> x265_1.7.tar.gz/source/encoder/level.cpp
Changed
138
1
2
{ 35651584, 1069547520, 60000, 240000, 60000, 240000, 8, Level::LEVEL6, "6", 60 },
3
{ 35651584, 2139095040, 120000, 480000, 120000, 480000, 8, Level::LEVEL6_1, "6.1", 61 },
4
{ 35651584, 4278190080U, 240000, 800000, 240000, 800000, 6, Level::LEVEL6_2, "6.2", 62 },
5
+ { MAX_UINT, MAX_UINT, MAX_UINT, MAX_UINT, MAX_UINT, MAX_UINT, 1, Level::LEVEL8_5, "8.5", 85 },
6
};
7
8
/* determine minimum decoder level required to decode the described video */
9
void determineLevel(const x265_param ¶m, VPS& vps)
10
{
11
vps.maxTempSubLayers = param.bEnableTemporalSubLayers ? 2 : 1;
12
- if (param.bLossless)
13
- vps.ptl.profileIdc = Profile::NONE;
14
- else if (param.internalCsp == X265_CSP_I420)
15
+ if (param.internalCsp == X265_CSP_I420)
16
{
17
if (param.internalBitDepth == 8)
18
{
19
20
21
const size_t NumLevels = sizeof(levels) / sizeof(levels[0]);
22
uint32_t i;
23
- for (i = 0; i < NumLevels; i++)
24
+ if (param.bLossless)
25
+ {
26
+ i = 13;
27
+ vps.ptl.minCrForLevel = 1;
28
+ vps.ptl.maxLumaSrForLevel = MAX_UINT;
29
+ vps.ptl.levelIdc = Level::LEVEL8_5;
30
+ vps.ptl.tierFlag = Level::MAIN;
31
+ }
32
+ else for (i = 0; i < NumLevels; i++)
33
{
34
if (lumaSamples > levels[i].maxLumaSamples)
35
continue;
36
37
extern "C"
38
int x265_param_apply_profile(x265_param *param, const char *profile)
39
{
40
- if (!profile)
41
+ if (!param || !profile)
42
return 0;
43
- if (!strcmp(profile, "main"))
44
- {
45
- /* SPSs shall have chroma_format_idc equal to 1 only */
46
- param->internalCsp = X265_CSP_I420;
47
48
#if HIGH_BIT_DEPTH
49
- /* SPSs shall have bit_depth_luma_minus8 equal to 0 only */
50
- x265_log(param, X265_LOG_ERROR, "Main profile not supported, compiled for Main10.\n");
51
+ if (!strcmp(profile, "main") || !strcmp(profile, "mainstillpicture") || !strcmp(profile, "msp") || !strcmp(profile, "main444-8"))
52
+ {
53
+ x265_log(param, X265_LOG_ERROR, "%s profile not supported, compiled for Main10.\n", profile);
54
return -1;
55
-#endif
56
}
57
- else if (!strcmp(profile, "main10"))
58
+#else
59
+ if (!strcmp(profile, "main10") || !strcmp(profile, "main422-10") || !strcmp(profile, "main444-10"))
60
{
61
- /* SPSs shall have chroma_format_idc equal to 1 only */
62
- param->internalCsp = X265_CSP_I420;
63
-
64
- /* SPSs shall have bit_depth_luma_minus8 in the range of 0 to 2, inclusive
65
- * this covers all builds of x265, currently */
66
+ x265_log(param, X265_LOG_ERROR, "%s profile not supported, compiled for Main.\n", profile);
67
+ return -1;
68
+ }
69
+#endif
70
+
71
+ if (!strcmp(profile, "main"))
72
+ {
73
+ if (!(param->internalCsp & X265_CSP_I420))
74
+ {
75
+ x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n",
76
+ profile, x265_source_csp_names[param->internalCsp]);
77
+ return -1;
78
+ }
79
}
80
else if (!strcmp(profile, "mainstillpicture") || !strcmp(profile, "msp"))
81
{
82
- /* SPSs shall have chroma_format_idc equal to 1 only */
83
- param->internalCsp = X265_CSP_I420;
84
+ if (!(param->internalCsp & X265_CSP_I420))
85
+ {
86
+ x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n",
87
+ profile, x265_source_csp_names[param->internalCsp]);
88
+ return -1;
89
+ }
90
91
/* SPSs shall have sps_max_dec_pic_buffering_minus1[ sps_max_sub_layers_minus1 ] equal to 0 only */
92
param->maxNumReferences = 1;
93
94
param->rc.cuTree = 0;
95
param->bEnableWeightedPred = 0;
96
param->bEnableWeightedBiPred = 0;
97
-
98
-#if HIGH_BIT_DEPTH
99
- /* SPSs shall have bit_depth_luma_minus8 equal to 0 only */
100
- x265_log(param, X265_LOG_ERROR, "Mainstillpicture profile not supported, compiled for Main10.\n");
101
- return -1;
102
-#endif
103
+ }
104
+ else if (!strcmp(profile, "main10"))
105
+ {
106
+ if (!(param->internalCsp & X265_CSP_I420))
107
+ {
108
+ x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n",
109
+ profile, x265_source_csp_names[param->internalCsp]);
110
+ return -1;
111
+ }
112
}
113
else if (!strcmp(profile, "main422-10"))
114
- param->internalCsp = X265_CSP_I422;
115
- else if (!strcmp(profile, "main444-8"))
116
{
117
- param->internalCsp = X265_CSP_I444;
118
-#if HIGH_BIT_DEPTH
119
- x265_log(param, X265_LOG_ERROR, "Main 4:4:4 8 profile not supported, compiled for Main10.\n");
120
- return -1;
121
-#endif
122
+ if (!(param->internalCsp & (X265_CSP_I420 | X265_CSP_I422)))
123
+ {
124
+ x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n",
125
+ profile, x265_source_csp_names[param->internalCsp]);
126
+ return -1;
127
+ }
128
+ }
129
+ else if (!strcmp(profile, "main444-8") || !strcmp(profile, "main444-10"))
130
+ {
131
+ /* any color space allowed */
132
}
133
- else if (!strcmp(profile, "main444-10"))
134
- param->internalCsp = X265_CSP_I444;
135
else
136
{
137
x265_log(param, X265_LOG_ERROR, "unknown profile <%s>\n", profile);
138
x265_1.6.tar.gz/source/encoder/motion.cpp -> x265_1.7.tar.gz/source/encoder/motion.cpp
Changed
77
1
2
pix_base + (m1x) + (m1y) * stride, \
3
pix_base + (m2x) + (m2y) * stride, \
4
stride, costs); \
5
- (costs)[0] += mvcost((bmv + MV(m0x, m0y)) << 2); \
6
- (costs)[1] += mvcost((bmv + MV(m1x, m1y)) << 2); \
7
- (costs)[2] += mvcost((bmv + MV(m2x, m2y)) << 2); \
8
+ const uint16_t *base_mvx = &m_cost_mvx[(bmv.x + (m0x)) << 2]; \
9
+ const uint16_t *base_mvy = &m_cost_mvy[(bmv.y + (m0y)) << 2]; \
10
+ X265_CHECK(mvcost((bmv + MV(m0x, m0y)) << 2) == (base_mvx[((m0x) - (m0x)) << 2] + base_mvy[((m0y) - (m0y)) << 2]), "mvcost() check failure\n"); \
11
+ X265_CHECK(mvcost((bmv + MV(m1x, m1y)) << 2) == (base_mvx[((m1x) - (m0x)) << 2] + base_mvy[((m1y) - (m0y)) << 2]), "mvcost() check failure\n"); \
12
+ X265_CHECK(mvcost((bmv + MV(m2x, m2y)) << 2) == (base_mvx[((m2x) - (m0x)) << 2] + base_mvy[((m2y) - (m0y)) << 2]), "mvcost() check failure\n"); \
13
+ (costs)[0] += (base_mvx[((m0x) - (m0x)) << 2] + base_mvy[((m0y) - (m0y)) << 2]); \
14
+ (costs)[1] += (base_mvx[((m1x) - (m0x)) << 2] + base_mvy[((m1y) - (m0y)) << 2]); \
15
+ (costs)[2] += (base_mvx[((m2x) - (m0x)) << 2] + base_mvy[((m2y) - (m0y)) << 2]); \
16
}
17
18
#define COST_MV_PT_DIST_X4(m0x, m0y, p0, d0, m1x, m1y, p1, d1, m2x, m2y, p2, d2, m3x, m3y, p3, d3) \
19
20
fref + (m2x) + (m2y) * stride, \
21
fref + (m3x) + (m3y) * stride, \
22
stride, costs); \
23
- costs[0] += mvcost(MV(m0x, m0y) << 2); \
24
- costs[1] += mvcost(MV(m1x, m1y) << 2); \
25
- costs[2] += mvcost(MV(m2x, m2y) << 2); \
26
- costs[3] += mvcost(MV(m3x, m3y) << 2); \
27
+ (costs)[0] += mvcost(MV(m0x, m0y) << 2); \
28
+ (costs)[1] += mvcost(MV(m1x, m1y) << 2); \
29
+ (costs)[2] += mvcost(MV(m2x, m2y) << 2); \
30
+ (costs)[3] += mvcost(MV(m3x, m3y) << 2); \
31
COPY4_IF_LT(bcost, costs[0], bmv, MV(m0x, m0y), bPointNr, p0, bDistance, d0); \
32
COPY4_IF_LT(bcost, costs[1], bmv, MV(m1x, m1y), bPointNr, p1, bDistance, d1); \
33
COPY4_IF_LT(bcost, costs[2], bmv, MV(m2x, m2y), bPointNr, p2, bDistance, d2); \
34
35
pix_base + (m2x) + (m2y) * stride, \
36
pix_base + (m3x) + (m3y) * stride, \
37
stride, costs); \
38
- costs[0] += mvcost((omv + MV(m0x, m0y)) << 2); \
39
- costs[1] += mvcost((omv + MV(m1x, m1y)) << 2); \
40
- costs[2] += mvcost((omv + MV(m2x, m2y)) << 2); \
41
- costs[3] += mvcost((omv + MV(m3x, m3y)) << 2); \
42
+ const uint16_t *base_mvx = &m_cost_mvx[(omv.x << 2)]; \
43
+ const uint16_t *base_mvy = &m_cost_mvy[(omv.y << 2)]; \
44
+ X265_CHECK(mvcost((omv + MV(m0x, m0y)) << 2) == (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]), "mvcost() check failure\n"); \
45
+ X265_CHECK(mvcost((omv + MV(m1x, m1y)) << 2) == (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]), "mvcost() check failure\n"); \
46
+ X265_CHECK(mvcost((omv + MV(m2x, m2y)) << 2) == (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]), "mvcost() check failure\n"); \
47
+ X265_CHECK(mvcost((omv + MV(m3x, m3y)) << 2) == (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]), "mvcost() check failure\n"); \
48
+ costs[0] += (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]); \
49
+ costs[1] += (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]); \
50
+ costs[2] += (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]); \
51
+ costs[3] += (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]); \
52
COPY2_IF_LT(bcost, costs[0], bmv, omv + MV(m0x, m0y)); \
53
COPY2_IF_LT(bcost, costs[1], bmv, omv + MV(m1x, m1y)); \
54
COPY2_IF_LT(bcost, costs[2], bmv, omv + MV(m2x, m2y)); \
55
56
pix_base + (m2x) + (m2y) * stride, \
57
pix_base + (m3x) + (m3y) * stride, \
58
stride, costs); \
59
- (costs)[0] += mvcost((bmv + MV(m0x, m0y)) << 2); \
60
- (costs)[1] += mvcost((bmv + MV(m1x, m1y)) << 2); \
61
- (costs)[2] += mvcost((bmv + MV(m2x, m2y)) << 2); \
62
- (costs)[3] += mvcost((bmv + MV(m3x, m3y)) << 2); \
63
+ /* TODO: use restrict keyword in ICL */ \
64
+ const uint16_t *base_mvx = &m_cost_mvx[(bmv.x << 2)]; \
65
+ const uint16_t *base_mvy = &m_cost_mvy[(bmv.y << 2)]; \
66
+ X265_CHECK(mvcost((bmv + MV(m0x, m0y)) << 2) == (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]), "mvcost() check failure\n"); \
67
+ X265_CHECK(mvcost((bmv + MV(m1x, m1y)) << 2) == (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]), "mvcost() check failure\n"); \
68
+ X265_CHECK(mvcost((bmv + MV(m2x, m2y)) << 2) == (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]), "mvcost() check failure\n"); \
69
+ X265_CHECK(mvcost((bmv + MV(m3x, m3y)) << 2) == (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]), "mvcost() check failure\n"); \
70
+ (costs)[0] += (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]); \
71
+ (costs)[1] += (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]); \
72
+ (costs)[2] += (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]); \
73
+ (costs)[3] += (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]); \
74
}
75
76
#define DIA1_ITER(mx, my) \
77
x265_1.6.tar.gz/source/encoder/nal.cpp -> x265_1.7.tar.gz/source/encoder/nal.cpp
Changed
40
1
2
, m_extraBuffer(NULL)
3
, m_extraOccupancy(0)
4
, m_extraAllocSize(0)
5
+ , m_annexB(true)
6
{}
7
8
void NALList::takeContents(NALList& other)
9
10
uint8_t *out = m_buffer + m_occupancy;
11
uint32_t bytes = 0;
12
13
- if (!m_numNal || nalUnitType == NAL_UNIT_VPS || nalUnitType == NAL_UNIT_SPS || nalUnitType == NAL_UNIT_PPS)
14
+ if (!m_annexB)
15
+ {
16
+ /* Will write size later */
17
+ bytes += 4;
18
+ }
19
+ else if (!m_numNal || nalUnitType == NAL_UNIT_VPS || nalUnitType == NAL_UNIT_SPS || nalUnitType == NAL_UNIT_PPS)
20
{
21
memcpy(out, startCodePrefix, 4);
22
bytes += 4;
23
24
* to 0x03 is appended to the end of the data. */
25
if (!out[bytes - 1])
26
out[bytes++] = 0x03;
27
+
28
+ if (!m_annexB)
29
+ {
30
+ uint32_t dataSize = bytes - 4;
31
+ out[0] = (uint8_t)(dataSize >> 24);
32
+ out[1] = (uint8_t)(dataSize >> 16);
33
+ out[2] = (uint8_t)(dataSize >> 8);
34
+ out[3] = (uint8_t)dataSize;
35
+ }
36
+
37
m_occupancy += bytes;
38
39
X265_CHECK(m_numNal < (uint32_t)MAX_NAL_UNITS, "NAL count overflow\n");
40
x265_1.6.tar.gz/source/encoder/nal.h -> x265_1.7.tar.gz/source/encoder/nal.h
Changed
9
1
2
uint8_t* m_extraBuffer;
3
uint32_t m_extraOccupancy;
4
uint32_t m_extraAllocSize;
5
+ bool m_annexB;
6
7
NALList();
8
~NALList() { X265_FREE(m_buffer); X265_FREE(m_extraBuffer); }
9
x265_1.6.tar.gz/source/encoder/ratecontrol.cpp -> x265_1.7.tar.gz/source/encoder/ratecontrol.cpp
Changed
277
1
2
}
3
}
4
5
- /* qstep - value set as encoder specific */
6
+ /* qpstep - value set as encoder specific */
7
m_lstep = pow(2, m_param->rc.qpStep / 6.0);
8
9
for (int i = 0; i < 2; i++)
10
11
m_accumPQp = (m_param->rc.rateControlMode == X265_RC_CRF ? CRF_INIT_QP : ABR_INIT_QP_MIN) * m_accumPNorm;
12
13
/* Frame Predictors and Row predictors used in vbv */
14
- for (int i = 0; i < 5; i++)
15
+ for (int i = 0; i < 4; i++)
16
{
17
- m_pred[i].coeff = 1.5;
18
+ m_pred[i].coeff = 1.0;
19
m_pred[i].count = 1.0;
20
m_pred[i].decay = 0.5;
21
m_pred[i].offset = 0.0;
22
}
23
- m_pred[0].coeff = 1.0;
24
+ m_pred[0].coeff = m_pred[3].coeff = 0.75;
25
+ if (m_param->rc.qCompress >= 0.8) // when tuned for grain
26
+ {
27
+ m_pred[1].coeff = 0.75;
28
+ m_pred[0].coeff = m_pred[3].coeff = 0.50;
29
+ }
30
if (!m_statFileOut && (m_param->rc.bStatWrite || m_param->rc.bStatRead))
31
{
32
/* If the user hasn't defined the stat filename, use the default value */
33
34
m_curSlice = curEncData.m_slice;
35
m_sliceType = m_curSlice->m_sliceType;
36
rce->sliceType = m_sliceType;
37
+ if (!m_2pass)
38
+ rce->keptAsRef = IS_REFERENCED(curFrame);
39
+ m_predType = getPredictorType(curFrame->m_lowres.sliceType, m_sliceType);
40
rce->poc = m_curSlice->m_poc;
41
if (m_param->rc.bStatRead)
42
{
43
44
m_lastQScaleFor[m_sliceType] = x265_qp2qScale(rce->qpaRc);
45
if (rce->poc == 0)
46
m_lastQScaleFor[P_SLICE] = m_lastQScaleFor[m_sliceType] * fabs(m_param->rc.ipFactor);
47
- rce->frameSizePlanned = predictSize(&m_pred[m_sliceType], m_qp, (double)m_currentSatd);
48
+ rce->frameSizePlanned = predictSize(&m_pred[m_predType], m_qp, (double)m_currentSatd);
49
}
50
}
51
m_framesDone++;
52
53
m_accumPQp += m_qp;
54
}
55
56
+int RateControl::getPredictorType(int lowresSliceType, int sliceType)
57
+{
58
+ /* Use a different predictor for B Ref and B frames for vbv frame size predictions */
59
+ if (lowresSliceType == X265_TYPE_BREF)
60
+ return 3;
61
+ return sliceType;
62
+}
63
+
64
double RateControl::getDiffLimitedQScale(RateControlEntry *rce, double q)
65
{
66
// force I/B quants as a function of P quants
67
68
q += m_pbOffset;
69
70
double qScale = x265_qp2qScale(q);
71
+ rce->qpNoVbv = q;
72
double lmin = 0, lmax = 0;
73
if (m_isVbv)
74
{
75
76
qScale = x265_clip3(lmin, lmax, qScale);
77
q = x265_qScale2qp(qScale);
78
}
79
- rce->qpNoVbv = q;
80
if (!m_2pass)
81
{
82
qScale = clipQscale(curFrame, rce, qScale);
83
/* clip qp to permissible range after vbv-lookahead estimation to avoid possible
84
* mispredictions by initial frame size predictors */
85
- if (m_pred[m_sliceType].count == 1)
86
+ if (m_pred[m_predType].count == 1)
87
qScale = x265_clip3(lmin, lmax, qScale);
88
m_lastQScaleFor[m_sliceType] = qScale;
89
- rce->frameSizePlanned = predictSize(&m_pred[m_sliceType], qScale, (double)m_currentSatd);
90
+ rce->frameSizePlanned = predictSize(&m_pred[m_predType], qScale, (double)m_currentSatd);
91
}
92
else
93
rce->frameSizePlanned = qScale2bits(rce, qScale);
94
95
q = clipQscale(curFrame, rce, q);
96
/* clip qp to permissible range after vbv-lookahead estimation to avoid possible
97
* mispredictions by initial frame size predictors */
98
- if (!m_2pass && m_isVbv && m_pred[m_sliceType].count == 1)
99
+ if (!m_2pass && m_isVbv && m_pred[m_predType].count == 1)
100
q = x265_clip3(lqmin, lqmax, q);
101
}
102
m_lastQScaleFor[m_sliceType] = q;
103
104
if (m_2pass && m_isVbv)
105
rce->frameSizePlanned = qScale2bits(rce, q);
106
else
107
- rce->frameSizePlanned = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd);
108
+ rce->frameSizePlanned = predictSize(&m_pred[m_predType], q, (double)m_currentSatd);
109
110
/* Always use up the whole VBV in this case. */
111
if (m_singleFrameVbv)
112
113
{
114
double frameQ[3];
115
double curBits;
116
- curBits = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd);
117
+ curBits = predictSize(&m_pred[m_predType], q, (double)m_currentSatd);
118
double bufferFillCur = m_bufferFill - curBits;
119
double targetFill;
120
double totalDuration = m_frameDuration;
121
122
bufferFillCur += wantedFrameSize;
123
int64_t satd = curFrame->m_lowres.plannedSatd[j] >> (X265_DEPTH - 8);
124
type = IS_X265_TYPE_I(type) ? I_SLICE : IS_X265_TYPE_B(type) ? B_SLICE : P_SLICE;
125
- curBits = predictSize(&m_pred[type], frameQ[type], (double)satd);
126
+ int predType = getPredictorType(curFrame->m_lowres.plannedType[j], type);
127
+ curBits = predictSize(&m_pred[predType], frameQ[type], (double)satd);
128
bufferFillCur -= curBits;
129
}
130
131
132
}
133
// Now a hard threshold to make sure the frame fits in VBV.
134
// This one is mostly for I-frames.
135
- double bits = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd);
136
+ double bits = predictSize(&m_pred[m_predType], q, (double)m_currentSatd);
137
138
// For small VBVs, allow the frame to use up the entire VBV.
139
double maxFillFactor;
140
141
bits *= qf;
142
if (bits < m_bufferRate / minFillFactor)
143
q *= bits * minFillFactor / m_bufferRate;
144
- bits = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd);
145
+ bits = predictSize(&m_pred[m_predType], q, (double)m_currentSatd);
146
}
147
148
q = X265_MAX(q0, q);
149
}
150
151
/* Apply MinCR restrictions */
152
- double pbits = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd);
153
+ double pbits = predictSize(&m_pred[m_predType], q, (double)m_currentSatd);
154
if (pbits > rce->frameSizeMaximum)
155
q *= pbits / rce->frameSizeMaximum;
156
-
157
- if (!m_isCbr || (m_isAbr && m_currentSatd >= rce->movingAvgSum && q <= q0 / 2))
158
+ /* To detect frames that are more complex in SATD costs compared to prev window, yet
159
+ * lookahead vbv reduces its qscale by half its value. Be on safer side and avoid drastic
160
+ * qscale reductions for frames high in complexity */
161
+ bool mispredCheck = rce->movingAvgSum && m_currentSatd >= rce->movingAvgSum && q <= q0 / 2;
162
+ if (!m_isCbr || (m_isAbr && mispredCheck))
163
q = X265_MAX(q0, q);
164
165
if (m_rateFactorMaxIncrement)
166
167
if (satdCostForPendingCus > 0)
168
{
169
double pred_s = predictSize(rce->rowPred[0], qScale, satdCostForPendingCus);
170
- uint32_t refRowSatdCost = 0, refRowBits = 0, intraCost = 0;
171
+ uint32_t refRowSatdCost = 0, refRowBits = 0, intraCostForPendingCus = 0;
172
double refQScale = 0;
173
174
if (picType != I_SLICE)
175
{
176
FrameData& refEncData = *refFrame->m_encData;
177
uint32_t endCuAddr = maxCols * (row + 1);
178
- for (uint32_t cuAddr = curEncData.m_rowStat[row].numEncodedCUs + 1; cuAddr < endCuAddr; cuAddr++)
179
+ uint32_t startCuAddr = curEncData.m_rowStat[row].numEncodedCUs;
180
+ if (startCuAddr)
181
{
182
- refRowSatdCost += refEncData.m_cuStat[cuAddr].vbvCost;
183
- refRowBits += refEncData.m_cuStat[cuAddr].totalBits;
184
- intraCost += curEncData.m_cuStat[cuAddr].intraVbvCost;
185
+ for (uint32_t cuAddr = startCuAddr + 1 ; cuAddr < endCuAddr; cuAddr++)
186
+ {
187
+ refRowSatdCost += refEncData.m_cuStat[cuAddr].vbvCost;
188
+ refRowBits += refEncData.m_cuStat[cuAddr].totalBits;
189
+ }
190
+ }
191
+ else
192
+ {
193
+ refRowBits = refEncData.m_rowStat[row].encodedBits;
194
+ refRowSatdCost = refEncData.m_rowStat[row].satdForVbv;
195
}
196
197
refRowSatdCost >>= X265_DEPTH - 8;
198
199
if (picType == I_SLICE || qScale >= refQScale)
200
{
201
if (picType == P_SLICE
202
- && !refFrame
203
+ && refFrame
204
&& refFrame->m_encData->m_slice->m_sliceType == picType
205
&& refQScale > 0
206
&& refRowSatdCost > 0)
207
208
}
209
else if (picType == P_SLICE)
210
{
211
+ intraCostForPendingCus = curEncData.m_rowStat[row].intraSatdForVbv - curEncData.m_rowStat[row].diagIntraSatd;
212
/* Our QP is lower than the reference! */
213
- double pred_intra = predictSize(rce->rowPred[1], qScale, intraCost);
214
+ double pred_intra = predictSize(rce->rowPred[1], qScale, intraCostForPendingCus);
215
/* Sum: better to overestimate than underestimate by using only one of the two predictors. */
216
totalSatdBits += (int32_t)(pred_intra + pred_s);
217
}
218
219
220
void RateControl::updateVbv(int64_t bits, RateControlEntry* rce)
221
{
222
+ int predType = rce->sliceType;
223
+ predType = rce->sliceType == B_SLICE && rce->keptAsRef ? 3 : predType;
224
if (rce->lastSatd >= m_ncu)
225
- updatePredictor(&m_pred[rce->sliceType], x265_qp2qScale(rce->qpaRc), (double)rce->lastSatd, (double)bits);
226
+ updatePredictor(&m_pred[predType], x265_qp2qScale(rce->qpaRc), (double)rce->lastSatd, (double)bits);
227
if (!m_isVbv)
228
return;
229
230
231
{
232
if (m_isVbv)
233
{
234
+ /* determine avg QP decided by VBV rate control */
235
for (uint32_t i = 0; i < slice->m_sps->numCuInHeight; i++)
236
curEncData.m_avgQpRc += curEncData.m_rowStat[i].sumQpRc;
237
238
curEncData.m_avgQpRc /= slice->m_sps->numCUsInFrame;
239
rce->qpaRc = curEncData.m_avgQpRc;
240
-
241
- // copy avg RC qp to m_avgQpAq. To print out the correct qp when aq/cutree is disabled.
242
- curEncData.m_avgQpAq = curEncData.m_avgQpRc;
243
}
244
245
if (m_param->rc.aqMode)
246
{
247
+ /* determine actual avg encoded QP, after AQ/cutree adjustments */
248
for (uint32_t i = 0; i < slice->m_sps->numCuInHeight; i++)
249
curEncData.m_avgQpAq += curEncData.m_rowStat[i].sumQpAq;
250
251
- curEncData.m_avgQpAq /= slice->m_sps->numCUsInFrame;
252
+ curEncData.m_avgQpAq /= (slice->m_sps->numCUsInFrame * NUM_4x4_PARTITIONS);
253
}
254
+ else
255
+ curEncData.m_avgQpAq = curEncData.m_avgQpRc;
256
}
257
258
// Write frame stats into the stats file if 2 pass is enabled.
259
260
{
261
m_finalFrameCount = count;
262
/* unblock waiting threads */
263
- m_startEndOrder.set(m_startEndOrder.get());
264
+ m_startEndOrder.poke();
265
}
266
267
/* called when the encoder is closing, and no more frames will be output.
268
269
{
270
m_bTerminated = true;
271
/* unblock waiting threads */
272
- m_startEndOrder.set(m_startEndOrder.get());
273
+ m_startEndOrder.poke();
274
}
275
276
void RateControl::destroy()
277
x265_1.6.tar.gz/source/encoder/ratecontrol.h -> x265_1.7.tar.gz/source/encoder/ratecontrol.h
Changed
22
1
2
double m_rateFactorMaxIncrement; /* Don't allow RF above (CRF + this value). */
3
double m_rateFactorMaxDecrement; /* don't allow RF below (this value). */
4
5
- Predictor m_pred[5];
6
- Predictor m_predBfromP;
7
-
8
+ Predictor m_pred[4]; /* Slice predictors to preidct bits for each Slice type - I,P,Bref and B */
9
int64_t m_leadingNoBSatd;
10
+ int m_predType; /* Type of slice predictors to be used - depends on the slice type */
11
double m_ipOffset;
12
double m_pbOffset;
13
int64_t m_bframeBits;
14
15
double tuneAbrQScaleFromFeedback(double qScale);
16
void accumPQpUpdate();
17
18
+ int getPredictorType(int lowresSliceType, int sliceType);
19
void updateVbv(int64_t bits, RateControlEntry* rce);
20
void updatePredictor(Predictor *p, double q, double var, double bits);
21
double clipQscale(Frame* pic, RateControlEntry* rce, double q);
22
x265_1.6.tar.gz/source/encoder/rdcost.h -> x265_1.7.tar.gz/source/encoder/rdcost.h
Changed
47
1
2
uint32_t m_chromaDistWeight[2];
3
uint32_t m_psyRdBase;
4
uint32_t m_psyRd;
5
- int m_qp;
6
+ int m_qp; /* QP used to configure lambda, may be higher than QP_MAX_SPEC but <= QP_MAX_MAX */
7
8
void setPsyRdScale(double scale) { m_psyRdBase = (uint32_t)floor(65536.0 * scale * 0.33); }
9
10
void setQP(const Slice& slice, int qp)
11
{
12
+ x265_emms(); /* TODO: if the lambda tables were ints, this would not be necessary */
13
m_qp = qp;
14
+ setLambda(x265_lambda2_tab[qp], x265_lambda_tab[qp]);
15
16
/* Scale PSY RD factor by a slice type factor */
17
static const uint32_t psyScaleFix8[3] = { 300, 256, 96 }; /* B, P, I */
18
19
}
20
21
int qpCb, qpCr;
22
- setLambda(x265_lambda2_tab[qp], x265_lambda_tab[qp]);
23
if (slice.m_sps->chromaFormatIdc == X265_CSP_I420)
24
- qpCb = x265_clip3(QP_MIN, QP_MAX_MAX, (int)g_chromaScale[qp + slice.m_pps->chromaQpOffset[0]]);
25
+ {
26
+ qpCb = (int)g_chromaScale[x265_clip3(QP_MIN, QP_MAX_MAX, qp + slice.m_pps->chromaQpOffset[0])];
27
+ qpCr = (int)g_chromaScale[x265_clip3(QP_MIN, QP_MAX_MAX, qp + slice.m_pps->chromaQpOffset[1])];
28
+ }
29
else
30
- qpCb = X265_MIN(qp + slice.m_pps->chromaQpOffset[0], QP_MAX_SPEC);
31
+ {
32
+ qpCb = x265_clip3(QP_MIN, QP_MAX_SPEC, qp + slice.m_pps->chromaQpOffset[0]);
33
+ qpCr = x265_clip3(QP_MIN, QP_MAX_SPEC, qp + slice.m_pps->chromaQpOffset[1]);
34
+ }
35
+
36
int chroma_offset_idx = X265_MIN(qp - qpCb + 12, MAX_CHROMA_LAMBDA_OFFSET);
37
uint16_t lambdaOffset = m_psyRd ? x265_chroma_lambda2_offset_tab[chroma_offset_idx] : 256;
38
m_chromaDistWeight[0] = lambdaOffset;
39
40
- if (slice.m_sps->chromaFormatIdc == X265_CSP_I420)
41
- qpCr = x265_clip3(QP_MIN, QP_MAX_MAX, (int)g_chromaScale[qp + slice.m_pps->chromaQpOffset[0]]);
42
- else
43
- qpCr = X265_MIN(qp + slice.m_pps->chromaQpOffset[0], QP_MAX_SPEC);
44
chroma_offset_idx = X265_MIN(qp - qpCr + 12, MAX_CHROMA_LAMBDA_OFFSET);
45
lambdaOffset = m_psyRd ? x265_chroma_lambda2_offset_tab[chroma_offset_idx] : 256;
46
m_chromaDistWeight[1] = lambdaOffset;
47
x265_1.6.tar.gz/source/encoder/sao.cpp -> x265_1.7.tar.gz/source/encoder/sao.cpp
Changed
149
1
2
pixel* tmpL;
3
pixel* tmpU;
4
5
- int8_t _upBuff1[MAX_CU_SIZE + 2], *upBuff1 = _upBuff1 + 1;
6
+ int8_t _upBuff1[MAX_CU_SIZE + 2], *upBuff1 = _upBuff1 + 1, signLeft1[2];
7
int8_t _upBufft[MAX_CU_SIZE + 2], *upBufft = _upBufft + 1;
8
9
memset(_upBuff1 + MAX_CU_SIZE, 0, 2 * sizeof(int8_t)); /* avoid valgrind uninit warnings */
10
11
{
12
case SAO_EO_0: // dir: -
13
{
14
- pixel firstPxl = 0, lastPxl = 0;
15
+ pixel firstPxl = 0, lastPxl = 0, row1FirstPxl = 0, row1LastPxl = 0;
16
startX = !lpelx;
17
endX = (rpelx == picWidth) ? ctuWidth - 1 : ctuWidth;
18
if (ctuWidth & 15)
19
20
}
21
else
22
{
23
- for (y = 0; y < ctuHeight; y++)
24
+ for (y = 0; y < ctuHeight; y += 2)
25
{
26
- int signLeft = signOf(rec[startX] - tmpL[y]);
27
+ signLeft1[0] = signOf(rec[startX] - tmpL[y]);
28
+ signLeft1[1] = signOf(rec[stride + startX] - tmpL[y + 1]);
29
30
if (!lpelx)
31
+ {
32
firstPxl = rec[0];
33
+ row1FirstPxl = rec[stride];
34
+ }
35
36
if (rpelx == picWidth)
37
+ {
38
lastPxl = rec[ctuWidth - 1];
39
+ row1LastPxl = rec[stride + ctuWidth - 1];
40
+ }
41
42
- primitives.saoCuOrgE0(rec, m_offsetEo, ctuWidth, (int8_t)signLeft);
43
+ primitives.saoCuOrgE0(rec, m_offsetEo, ctuWidth, signLeft1, stride);
44
45
if (!lpelx)
46
+ {
47
rec[0] = firstPxl;
48
+ rec[stride] = row1FirstPxl;
49
+ }
50
51
if (rpelx == picWidth)
52
+ {
53
rec[ctuWidth - 1] = lastPxl;
54
+ rec[stride + ctuWidth - 1] = row1LastPxl;
55
+ }
56
57
- rec += stride;
58
+ rec += 2 * stride;
59
}
60
}
61
break;
62
63
{
64
primitives.sign(upBuff1, rec, tmpU, ctuWidth);
65
66
- for (y = startY; y < endY; y++)
67
+ int diff = (endY - startY) % 2;
68
+ for (y = startY; y < endY - diff; y += 2)
69
{
70
- primitives.saoCuOrgE1(rec, upBuff1, m_offsetEo, stride, ctuWidth);
71
- rec += stride;
72
+ primitives.saoCuOrgE1_2Rows(rec, upBuff1, m_offsetEo, stride, ctuWidth);
73
+ rec += 2 * stride;
74
}
75
+ if (diff & 1)
76
+ primitives.saoCuOrgE1(rec, upBuff1, m_offsetEo, stride, ctuWidth);
77
}
78
79
break;
80
81
for (y = startY; y < endY; y++)
82
{
83
int8_t iSignDown2 = signOf(rec[stride + startX] - tmpL[y]);
84
- pixel firstPxl = rec[0]; // copy first Pxl
85
- pixel lastPxl = rec[ctuWidth - 1];
86
- int8_t one = upBufft[1];
87
- int8_t two = upBufft[endX + 1];
88
89
- primitives.saoCuOrgE2(rec, upBufft, upBuff1, m_offsetEo, ctuWidth, stride);
90
- if (!lpelx)
91
- {
92
- rec[0] = firstPxl;
93
- upBufft[1] = one;
94
- }
95
-
96
- if (rpelx == picWidth)
97
- {
98
- rec[ctuWidth - 1] = lastPxl;
99
- upBufft[endX + 1] = two;
100
- }
101
+ primitives.saoCuOrgE2[endX > 16](rec + startX, upBufft + startX, upBuff1 + startX, m_offsetEo, endX - startX, stride);
102
103
upBufft[startX] = iSignDown2;
104
105
106
upBuff1[x - 1] = -signDown;
107
rec[x] = m_clipTable[rec[x] + m_offsetEo[edgeType]];
108
109
- primitives.saoCuOrgE3(rec, upBuff1, m_offsetEo, stride - 1, startX, endX);
110
+ primitives.saoCuOrgE3[endX > 16](rec, upBuff1, m_offsetEo, stride - 1, startX, endX);
111
112
upBuff1[endX - 1] = signOf(rec[endX - 1 + stride] - rec[endX]);
113
114
115
rec += stride;
116
}
117
118
- if (!(ctuWidth & 15))
119
- primitives.sign(upBuff1, rec, &rec[- stride], ctuWidth);
120
- else
121
- {
122
- for (x = 0; x < ctuWidth; x++)
123
- upBuff1[x] = signOf(rec[x] - rec[x - stride]);
124
- }
125
+ primitives.sign(upBuff1, rec, &rec[- stride], ctuWidth);
126
127
for (y = startY; y < endY; y++)
128
{
129
130
rec += stride;
131
}
132
133
- for (x = startX; x < endX; x++)
134
- upBuff1[x] = signOf(rec[x] - rec[x - stride - 1]);
135
+ primitives.sign(&upBuff1[startX], &rec[startX], &rec[startX - stride - 1], (endX - startX));
136
137
for (y = startY; y < endY; y++)
138
{
139
140
rec += stride;
141
}
142
143
- for (x = startX - 1; x < endX; x++)
144
- upBuff1[x] = signOf(rec[x] - rec[x - stride + 1]);
145
+ primitives.sign(&upBuff1[startX - 1], &rec[startX - 1], &rec[startX - 1 - stride + 1], (endX - startX + 1));
146
147
for (y = startY; y < endY; y++)
148
{
149
x265_1.6.tar.gz/source/encoder/search.cpp -> x265_1.7.tar.gz/source/encoder/search.cpp
Changed
535
1
2
X265_FREE(m_tsRecon);
3
}
4
5
-void Search::setQP(const Slice& slice, int qp)
6
+int Search::setLambdaFromQP(const CUData& ctu, int qp)
7
{
8
- x265_emms(); /* TODO: if the lambda tables were ints, this would not be necessary */
9
+ X265_CHECK(qp >= QP_MIN && qp <= QP_MAX_MAX, "QP used for lambda is out of range\n");
10
+
11
m_me.setQP(qp);
12
- m_rdCost.setQP(slice, qp);
13
+ m_rdCost.setQP(*m_slice, qp);
14
+
15
+ int quantQP = x265_clip3(QP_MIN, QP_MAX_SPEC, qp);
16
+ m_quant.setQPforQuant(ctu, quantQP);
17
+ return quantQP;
18
}
19
20
#if CHECKED_BUILD || _DEBUG
21
22
intraMode.psyEnergy = m_rdCost.psyCost(cuGeom.log2CUSize - 2, fencYuv->m_buf[0], fencYuv->m_size, intraMode.reconYuv.m_buf[0], intraMode.reconYuv.m_size);
23
}
24
updateModeCost(intraMode);
25
- checkDQP(cu, cuGeom);
26
+ checkDQP(intraMode, cuGeom);
27
}
28
29
/* Note that this function does not save the best intra prediction, it must
30
31
32
pixel nScale[129];
33
intraNeighbourBuf[1][0] = intraNeighbourBuf[0][0];
34
- primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1, 0);
35
+ primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1);
36
37
// we do not estimate filtering for downscaled samples
38
- for (int x = 1; x < 65; x++)
39
- {
40
- intraNeighbourBuf[0][x] = nScale[x]; // Top pixel
41
- intraNeighbourBuf[0][x + 64] = nScale[x + 64]; // Left pixel
42
- intraNeighbourBuf[1][x] = nScale[x]; // Top pixel
43
- intraNeighbourBuf[1][x + 64] = nScale[x + 64]; // Left pixel
44
- }
45
+ memcpy(&intraNeighbourBuf[0][1], &nScale[1], 2 * 64 * sizeof(pixel)); // Top & Left pixels
46
+ memcpy(&intraNeighbourBuf[1][1], &nScale[1], 2 * 64 * sizeof(pixel));
47
48
scaleTuSize = 32;
49
scaleStride = 32;
50
51
X265_CHECK(cu.m_partSize[0] == SIZE_2Nx2N, "encodeIntraInInter does not expect NxN intra\n");
52
X265_CHECK(!m_slice->isIntra(), "encodeIntraInInter does not expect to be used in I slices\n");
53
54
- m_quant.setQPforQuant(cu);
55
-
56
uint32_t tuDepthRange[2];
57
cu.getIntraTUQtDepthRange(tuDepthRange, 0);
58
59
60
61
m_entropyCoder.store(intraMode.contexts);
62
updateModeCost(intraMode);
63
- checkDQP(intraMode.cu, cuGeom);
64
+ checkDQP(intraMode, cuGeom);
65
}
66
67
uint32_t Search::estIntraPredQT(Mode &intraMode, const CUGeom& cuGeom, const uint32_t depthRange[2], uint8_t* sharedModes)
68
69
70
pixel nScale[129];
71
intraNeighbourBuf[1][0] = intraNeighbourBuf[0][0];
72
- primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1, 0);
73
+ primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1);
74
75
- // TO DO: primitive
76
- for (int x = 1; x < 65; x++)
77
- {
78
- intraNeighbourBuf[0][x] = nScale[x]; // Top pixel
79
- intraNeighbourBuf[0][x + 64] = nScale[x + 64]; // Left pixel
80
- intraNeighbourBuf[1][x] = nScale[x]; // Top pixel
81
- intraNeighbourBuf[1][x + 64] = nScale[x + 64]; // Left pixel
82
- }
83
+ memcpy(&intraNeighbourBuf[0][1], &nScale[1], 2 * 64 * sizeof(pixel));
84
+ memcpy(&intraNeighbourBuf[1][1], &nScale[1], 2 * 64 * sizeof(pixel));
85
86
scaleTuSize = 32;
87
scaleStride = 32;
88
89
return outCost;
90
}
91
92
+/* Pick between the two AMVP candidates which is the best one to use as
93
+ * MVP for the motion search, based on SAD cost */
94
+int Search::selectMVP(const CUData& cu, const PredictionUnit& pu, const MV amvp[AMVP_NUM_CANDS], int list, int ref)
95
+{
96
+ if (amvp[0] == amvp[1])
97
+ return 0;
98
+
99
+ Yuv& tmpPredYuv = m_rqt[cu.m_cuDepth[0]].tmpPredYuv;
100
+ uint32_t costs[AMVP_NUM_CANDS];
101
+
102
+ for (int i = 0; i < AMVP_NUM_CANDS; i++)
103
+ {
104
+ MV mvCand = amvp[i];
105
+
106
+ // NOTE: skip mvCand if Y is > merange and -FN>1
107
+ if (m_bFrameParallel && (mvCand.y >= (m_param->searchRange + 1) * 4))
108
+ costs[i] = m_me.COST_MAX;
109
+ else
110
+ {
111
+ cu.clipMv(mvCand);
112
+ predInterLumaPixel(pu, tmpPredYuv, *m_slice->m_refPicList[list][ref]->m_reconPic, mvCand);
113
+ costs[i] = m_me.bufSAD(tmpPredYuv.getLumaAddr(pu.puAbsPartIdx), tmpPredYuv.m_size);
114
+ }
115
+ }
116
+
117
+ return costs[0] <= costs[1] ? 0 : 1;
118
+}
119
+
120
void Search::PME::processTasks(int workerThreadId)
121
{
122
#if DETAILED_CU_STATS
123
124
/* Setup slave Search instance for ME for master's CU */
125
if (&slave != this)
126
{
127
- slave.setQP(*m_slice, m_rdCost.m_qp);
128
slave.m_slice = m_slice;
129
slave.m_frame = m_frame;
130
-
131
+ slave.m_param = m_param;
132
+ slave.setLambdaFromQP(pme.mode.cu, m_rdCost.m_qp);
133
slave.m_me.setSourcePU(*pme.mode.fencYuv, pme.pu.ctuAddr, pme.pu.cuAbsPartIdx, pme.pu.puAbsPartIdx, pme.pu.width, pme.pu.height);
134
}
135
136
137
do
138
{
139
if (meId < m_slice->m_numRefIdx[0])
140
- slave.singleMotionEstimation(*this, pme.mode, pme.cuGeom, pme.pu, pme.puIdx, 0, meId);
141
+ slave.singleMotionEstimation(*this, pme.mode, pme.pu, pme.puIdx, 0, meId);
142
else
143
- slave.singleMotionEstimation(*this, pme.mode, pme.cuGeom, pme.pu, pme.puIdx, 1, meId - m_slice->m_numRefIdx[0]);
144
+ slave.singleMotionEstimation(*this, pme.mode, pme.pu, pme.puIdx, 1, meId - m_slice->m_numRefIdx[0]);
145
146
meId = -1;
147
pme.m_lock.acquire();
148
149
while (meId >= 0);
150
}
151
152
-void Search::singleMotionEstimation(Search& master, Mode& interMode, const CUGeom& cuGeom, const PredictionUnit& pu,
153
- int part, int list, int ref)
154
+void Search::singleMotionEstimation(Search& master, Mode& interMode, const PredictionUnit& pu, int part, int list, int ref)
155
{
156
uint32_t bits = master.m_listSelBits[list] + MVP_IDX_BITS;
157
bits += getTUBits(ref, m_slice->m_numRefIdx[list]);
158
159
- MV mvc[(MD_ABOVE_LEFT + 1) * 2 + 1];
160
- int numMvc = interMode.cu.getPMV(interMode.interNeighbours, list, ref, interMode.amvpCand[list][ref], mvc);
161
-
162
- int mvpIdx = 0;
163
- int merange = m_param->searchRange;
164
MotionData* bestME = interMode.bestME[part];
165
166
- if (interMode.amvpCand[list][ref][0] != interMode.amvpCand[list][ref][1])
167
- {
168
- uint32_t bestCost = MAX_INT;
169
- for (int i = 0; i < AMVP_NUM_CANDS; i++)
170
- {
171
- MV mvCand = interMode.amvpCand[list][ref][i];
172
-
173
- // NOTE: skip mvCand if Y is > merange and -FN>1
174
- if (m_bFrameParallel && (mvCand.y >= (merange + 1) * 4))
175
- continue;
176
-
177
- interMode.cu.clipMv(mvCand);
178
-
179
- Yuv& tmpPredYuv = m_rqt[cuGeom.depth].tmpPredYuv;
180
- predInterLumaPixel(pu, tmpPredYuv, *m_slice->m_refPicList[list][ref]->m_reconPic, mvCand);
181
- uint32_t cost = m_me.bufSAD(tmpPredYuv.getLumaAddr(pu.puAbsPartIdx), tmpPredYuv.m_size);
182
+ MV mvc[(MD_ABOVE_LEFT + 1) * 2 + 1];
183
+ int numMvc = interMode.cu.getPMV(interMode.interNeighbours, list, ref, interMode.amvpCand[list][ref], mvc);
184
185
- if (bestCost > cost)
186
- {
187
- bestCost = cost;
188
- mvpIdx = i;
189
- }
190
- }
191
- }
192
+ const MV* amvp = interMode.amvpCand[list][ref];
193
+ int mvpIdx = selectMVP(interMode.cu, pu, amvp, list, ref);
194
+ MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx];
195
196
- MV mvmin, mvmax, outmv, mvp = interMode.amvpCand[list][ref][mvpIdx];
197
- setSearchRange(interMode.cu, mvp, merange, mvmin, mvmax);
198
+ setSearchRange(interMode.cu, mvp, m_param->searchRange, mvmin, mvmax);
199
200
- int satdCost = m_me.motionEstimate(&m_slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, merange, outmv);
201
+ int satdCost = m_me.motionEstimate(&m_slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, m_param->searchRange, outmv);
202
203
/* Get total cost of partition, but only include MV bit cost once */
204
bits += m_me.bitcost(outmv);
205
uint32_t cost = (satdCost - m_me.mvcost(outmv)) + m_rdCost.getCost(bits);
206
207
- /* Refine MVP selection, updates: mvp, mvpIdx, bits, cost */
208
- checkBestMVP(interMode.amvpCand[list][ref], outmv, mvp, mvpIdx, bits, cost);
209
+ /* Refine MVP selection, updates: mvpIdx, bits, cost */
210
+ mvp = checkBestMVP(amvp, outmv, mvpIdx, bits, cost);
211
212
/* tie goes to the smallest ref ID, just like --no-pme */
213
ScopedLock _lock(master.m_meLock);
214
215
}
216
217
/* find the best inter prediction for each PU of specified mode */
218
-void Search::predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bMergeOnly, bool bChromaSA8D)
219
+void Search::predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChromaMC)
220
{
221
ProfileCUScope(interMode.cu, motionEstimationElapsedTime, countMotionEstimate);
222
223
224
Yuv& tmpPredYuv = m_rqt[cuGeom.depth].tmpPredYuv;
225
226
MergeData merge;
227
- uint32_t mrgCost;
228
memset(&merge, 0, sizeof(merge));
229
230
for (int puIdx = 0; puIdx < numPart; puIdx++)
231
232
m_me.setSourcePU(*interMode.fencYuv, pu.ctuAddr, pu.cuAbsPartIdx, pu.puAbsPartIdx, pu.width, pu.height);
233
234
/* find best cost merge candidate. note: 2Nx2N merge and bidir are handled as separate modes */
235
- if (cu.m_partSize[0] != SIZE_2Nx2N)
236
- {
237
- mrgCost = mergeEstimation(cu, cuGeom, pu, puIdx, merge);
238
-
239
- if (bMergeOnly && mrgCost != MAX_UINT)
240
- {
241
- cu.m_mergeFlag[pu.puAbsPartIdx] = true;
242
- cu.m_mvpIdx[0][pu.puAbsPartIdx] = merge.index; // merge candidate ID is stored in L0 MVP idx
243
- cu.setPUInterDir(merge.dir, pu.puAbsPartIdx, puIdx);
244
- cu.setPUMv(0, merge.mvField[0].mv, pu.puAbsPartIdx, puIdx);
245
- cu.setPURefIdx(0, merge.mvField[0].refIdx, pu.puAbsPartIdx, puIdx);
246
- cu.setPUMv(1, merge.mvField[1].mv, pu.puAbsPartIdx, puIdx);
247
- cu.setPURefIdx(1, merge.mvField[1].refIdx, pu.puAbsPartIdx, puIdx);
248
- totalmebits += merge.bits;
249
-
250
- motionCompensation(cu, pu, *predYuv, true, bChromaSA8D);
251
- continue;
252
- }
253
- }
254
- else
255
- mrgCost = MAX_UINT;
256
+ uint32_t mrgCost = numPart == 1 ? MAX_UINT : mergeEstimation(cu, cuGeom, pu, puIdx, merge);
257
258
bestME[0].cost = MAX_UINT;
259
bestME[1].cost = MAX_UINT;
260
261
262
int numMvc = cu.getPMV(interMode.interNeighbours, list, ref, interMode.amvpCand[list][ref], mvc);
263
264
- // Pick the best possible MVP from AMVP candidates based on least residual
265
- int mvpIdx = 0;
266
- int merange = m_param->searchRange;
267
+ const MV* amvp = interMode.amvpCand[list][ref];
268
+ int mvpIdx = selectMVP(cu, pu, amvp, list, ref);
269
+ MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx];
270
271
- if (interMode.amvpCand[list][ref][0] != interMode.amvpCand[list][ref][1])
272
- {
273
- uint32_t bestCost = MAX_INT;
274
- for (int i = 0; i < AMVP_NUM_CANDS; i++)
275
- {
276
- MV mvCand = interMode.amvpCand[list][ref][i];
277
-
278
- // NOTE: skip mvCand if Y is > merange and -FN>1
279
- if (m_bFrameParallel && (mvCand.y >= (merange + 1) * 4))
280
- continue;
281
-
282
- cu.clipMv(mvCand);
283
- predInterLumaPixel(pu, tmpPredYuv, *slice->m_refPicList[list][ref]->m_reconPic, mvCand);
284
- uint32_t cost = m_me.bufSAD(tmpPredYuv.getLumaAddr(pu.puAbsPartIdx), tmpPredYuv.m_size);
285
-
286
- if (bestCost > cost)
287
- {
288
- bestCost = cost;
289
- mvpIdx = i;
290
- }
291
- }
292
- }
293
-
294
- MV mvmin, mvmax, outmv, mvp = interMode.amvpCand[list][ref][mvpIdx];
295
-
296
- int satdCost;
297
- setSearchRange(cu, mvp, merange, mvmin, mvmax);
298
- satdCost = m_me.motionEstimate(&slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, merange, outmv);
299
+ setSearchRange(cu, mvp, m_param->searchRange, mvmin, mvmax);
300
+ int satdCost = m_me.motionEstimate(&slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, m_param->searchRange, outmv);
301
302
/* Get total cost of partition, but only include MV bit cost once */
303
bits += m_me.bitcost(outmv);
304
uint32_t cost = (satdCost - m_me.mvcost(outmv)) + m_rdCost.getCost(bits);
305
306
- /* Refine MVP selection, updates: mvp, mvpIdx, bits, cost */
307
- checkBestMVP(interMode.amvpCand[list][ref], outmv, mvp, mvpIdx, bits, cost);
308
+ /* Refine MVP selection, updates: mvpIdx, bits, cost */
309
+ mvp = checkBestMVP(amvp, outmv, mvpIdx, bits, cost);
310
311
if (cost < bestME[list].cost)
312
{
313
314
{
315
processPME(pme, *this);
316
317
- singleMotionEstimation(*this, interMode, cuGeom, pu, puIdx, 0, 0); /* L0-0 */
318
+ singleMotionEstimation(*this, interMode, pu, puIdx, 0, 0); /* L0-0 */
319
320
bDoUnidir = false;
321
322
323
324
int numMvc = cu.getPMV(interMode.interNeighbours, list, ref, interMode.amvpCand[list][ref], mvc);
325
326
- // Pick the best possible MVP from AMVP candidates based on least residual
327
- int mvpIdx = 0;
328
- int merange = m_param->searchRange;
329
-
330
- if (interMode.amvpCand[list][ref][0] != interMode.amvpCand[list][ref][1])
331
- {
332
- uint32_t bestCost = MAX_INT;
333
- for (int i = 0; i < AMVP_NUM_CANDS; i++)
334
- {
335
- MV mvCand = interMode.amvpCand[list][ref][i];
336
-
337
- // NOTE: skip mvCand if Y is > merange and -FN>1
338
- if (m_bFrameParallel && (mvCand.y >= (merange + 1) * 4))
339
- continue;
340
+ const MV* amvp = interMode.amvpCand[list][ref];
341
+ int mvpIdx = selectMVP(cu, pu, amvp, list, ref);
342
+ MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx];
343
344
- cu.clipMv(mvCand);
345
- predInterLumaPixel(pu, tmpPredYuv, *slice->m_refPicList[list][ref]->m_reconPic, mvCand);
346
- uint32_t cost = m_me.bufSAD(tmpPredYuv.getLumaAddr(pu.puAbsPartIdx), tmpPredYuv.m_size);
347
-
348
- if (bestCost > cost)
349
- {
350
- bestCost = cost;
351
- mvpIdx = i;
352
- }
353
- }
354
- }
355
-
356
- MV mvmin, mvmax, outmv, mvp = interMode.amvpCand[list][ref][mvpIdx];
357
-
358
- setSearchRange(cu, mvp, merange, mvmin, mvmax);
359
- int satdCost = m_me.motionEstimate(&slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, merange, outmv);
360
+ setSearchRange(cu, mvp, m_param->searchRange, mvmin, mvmax);
361
+ int satdCost = m_me.motionEstimate(&slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, m_param->searchRange, outmv);
362
363
/* Get total cost of partition, but only include MV bit cost once */
364
bits += m_me.bitcost(outmv);
365
uint32_t cost = (satdCost - m_me.mvcost(outmv)) + m_rdCost.getCost(bits);
366
367
- /* Refine MVP selection, updates: mvp, mvpIdx, bits, cost */
368
- checkBestMVP(interMode.amvpCand[list][ref], outmv, mvp, mvpIdx, bits, cost);
369
+ /* Refine MVP selection, updates: mvpIdx, bits, cost */
370
+ mvp = checkBestMVP(amvp, outmv, mvpIdx, bits, cost);
371
372
if (cost < bestME[list].cost)
373
{
374
375
uint32_t cost = satdCost + m_rdCost.getCost(bits0) + m_rdCost.getCost(bits1);
376
377
/* refine MVP selection for zero mv, updates: mvp, mvpidx, bits, cost */
378
- checkBestMVP(interMode.amvpCand[0][bestME[0].ref], mvzero, mvp0, mvpIdx0, bits0, cost);
379
- checkBestMVP(interMode.amvpCand[1][bestME[1].ref], mvzero, mvp1, mvpIdx1, bits1, cost);
380
+ mvp0 = checkBestMVP(interMode.amvpCand[0][bestME[0].ref], mvzero, mvpIdx0, bits0, cost);
381
+ mvp1 = checkBestMVP(interMode.amvpCand[1][bestME[1].ref], mvzero, mvpIdx1, bits1, cost);
382
383
if (cost < bidirCost)
384
{
385
386
totalmebits += bestME[1].bits;
387
}
388
389
- motionCompensation(cu, pu, *predYuv, true, bChromaSA8D);
390
+ motionCompensation(cu, pu, *predYuv, true, bChromaMC);
391
}
392
X265_CHECK(interMode.ok(), "inter mode is not ok");
393
interMode.sa8dBits += totalmebits;
394
395
}
396
397
/* Check if using an alternative MVP would result in a smaller MVD + signal bits */
398
-void Search::checkBestMVP(MV* amvpCand, MV mv, MV& mvPred, int& outMvpIdx, uint32_t& outBits, uint32_t& outCost) const
399
+const MV& Search::checkBestMVP(const MV* amvpCand, const MV& mv, int& mvpIdx, uint32_t& outBits, uint32_t& outCost) const
400
{
401
- X265_CHECK(amvpCand[outMvpIdx] == mvPred, "checkBestMVP: unexpected mvPred\n");
402
-
403
- int mvpIdx = !outMvpIdx;
404
- MV mvp = amvpCand[mvpIdx];
405
- int diffBits = m_me.bitcost(mv, mvp) - m_me.bitcost(mv, mvPred);
406
+ int diffBits = m_me.bitcost(mv, amvpCand[!mvpIdx]) - m_me.bitcost(mv, amvpCand[mvpIdx]);
407
if (diffBits < 0)
408
{
409
- outMvpIdx = mvpIdx;
410
- mvPred = mvp;
411
+ mvpIdx = !mvpIdx;
412
uint32_t origOutBits = outBits;
413
outBits = origOutBits + diffBits;
414
outCost = (outCost - m_rdCost.getCost(origOutBits)) + m_rdCost.getCost(outBits);
415
}
416
+ return amvpCand[mvpIdx];
417
}
418
419
-void Search::setSearchRange(const CUData& cu, MV mvp, int merange, MV& mvmin, MV& mvmax) const
420
+void Search::setSearchRange(const CUData& cu, const MV& mvp, int merange, MV& mvmin, MV& mvmax) const
421
{
422
- cu.clipMv(mvp);
423
-
424
MV dist((int16_t)merange << 2, (int16_t)merange << 2);
425
mvmin = mvp - dist;
426
mvmax = mvp + dist;
427
428
uint32_t log2CUSize = cuGeom.log2CUSize;
429
int sizeIdx = log2CUSize - 2;
430
431
- uint32_t tqBypass = cu.m_tqBypass[0];
432
- m_quant.setQPforQuant(interMode.cu);
433
-
434
resiYuv->subtract(*fencYuv, *predYuv, log2CUSize);
435
436
uint32_t tuDepthRange[2];
437
438
Cost costs;
439
estimateResidualQT(interMode, cuGeom, 0, 0, *resiYuv, costs, tuDepthRange);
440
441
+ uint32_t tqBypass = cu.m_tqBypass[0];
442
if (!tqBypass)
443
{
444
uint32_t cbf0Dist = primitives.cu[sizeIdx].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, predYuv->m_buf[0], predYuv->m_size);
445
446
interMode.coeffBits = coeffBits;
447
interMode.mvBits = bits - coeffBits;
448
updateModeCost(interMode);
449
- checkDQP(interMode.cu, cuGeom);
450
+ checkDQP(interMode, cuGeom);
451
}
452
453
void Search::residualTransformQuantInter(Mode& mode, const CUGeom& cuGeom, uint32_t absPartIdx, uint32_t tuDepth, const uint32_t depthRange[2])
454
455
}
456
}
457
458
-void Search::checkDQP(CUData& cu, const CUGeom& cuGeom)
459
+void Search::checkDQP(Mode& mode, const CUGeom& cuGeom)
460
{
461
+ CUData& cu = mode.cu;
462
if (cu.m_slice->m_pps->bUseDQP && cuGeom.depth <= cu.m_slice->m_pps->maxCuDQPDepth)
463
{
464
if (cu.getQtRootCbf(0))
465
{
466
- /* When analysing RDO with DQP bits, the entropy encoder should add the cost of DQP bits here
467
- * i.e Encode QP */
468
+ if (m_param->rdLevel >= 3)
469
+ {
470
+ mode.contexts.resetBits();
471
+ mode.contexts.codeDeltaQP(cu, 0);
472
+ uint32_t bits = mode.contexts.getNumberOfWrittenBits();
473
+ mode.mvBits += bits;
474
+ mode.totalBits += bits;
475
+ updateModeCost(mode);
476
+ }
477
+ else if (m_param->rdLevel <= 1)
478
+ {
479
+ mode.sa8dBits++;
480
+ mode.sa8dCost = m_rdCost.calcRdSADCost(mode.distortion, mode.sa8dBits);
481
+ }
482
+ else
483
+ {
484
+ mode.mvBits++;
485
+ mode.totalBits++;
486
+ updateModeCost(mode);
487
+ }
488
}
489
else
490
cu.setQPSubParts(cu.getRefQP(0), 0, cuGeom.depth);
491
}
492
}
493
494
-void Search::checkDQPForSplitPred(CUData& cu, const CUGeom& cuGeom)
495
+void Search::checkDQPForSplitPred(Mode& mode, const CUGeom& cuGeom)
496
{
497
+ CUData& cu = mode.cu;
498
+
499
if ((cuGeom.depth == cu.m_slice->m_pps->maxCuDQPDepth) && cu.m_slice->m_pps->bUseDQP)
500
{
501
bool hasResidual = false;
502
503
}
504
}
505
if (hasResidual)
506
- /* TODO: Encode QP, and recalculate RD cost of splitPred */
507
+ {
508
+ if (m_param->rdLevel >= 3)
509
+ {
510
+ mode.contexts.resetBits();
511
+ mode.contexts.codeDeltaQP(cu, 0);
512
+ uint32_t bits = mode.contexts.getNumberOfWrittenBits();
513
+ mode.mvBits += bits;
514
+ mode.totalBits += bits;
515
+ updateModeCost(mode);
516
+ }
517
+ else if (m_param->rdLevel <= 1)
518
+ {
519
+ mode.sa8dBits++;
520
+ mode.sa8dCost = m_rdCost.calcRdSADCost(mode.distortion, mode.sa8dBits);
521
+ }
522
+ else
523
+ {
524
+ mode.mvBits++;
525
+ mode.totalBits++;
526
+ updateModeCost(mode);
527
+ }
528
/* For all zero CBF sub-CUs, reset QP to RefQP (so that deltaQP is not signalled).
529
When the non-zero CBF sub-CU is found, stop */
530
cu.setQPSubCUs(cu.getRefQP(0), 0, cuGeom.depth);
531
+ }
532
else
533
/* No residual within this CU or subCU, so reset QP to RefQP */
534
cu.setQPSubParts(cu.getRefQP(0), 0, cuGeom.depth);
535
x265_1.6.tar.gz/source/encoder/search.h -> x265_1.7.tar.gz/source/encoder/search.h
Changed
51
1
2
~Search();
3
4
bool initSearch(const x265_param& param, ScalingList& scalingList);
5
- void setQP(const Slice& slice, int qp);
6
+ int setLambdaFromQP(const CUData& ctu, int qp); /* returns real quant QP in valid spec range */
7
8
// mark temp RD entropy contexts as uninitialized; useful for finding loads without stores
9
void invalidateContexts(int fromDepth);
10
11
void encodeIntraInInter(Mode& intraMode, const CUGeom& cuGeom);
12
13
// estimation inter prediction (non-skip)
14
- void predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bMergeOnly, bool bChroma);
15
+ void predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChromaMC);
16
17
// encode residual and compute rd-cost for inter mode
18
void encodeResAndCalcRdInterCU(Mode& interMode, const CUGeom& cuGeom);
19
20
void getBestIntraModeChroma(Mode& intraMode, const CUGeom& cuGeom);
21
22
/* update CBF flags and QP values to be internally consistent */
23
- void checkDQP(CUData& cu, const CUGeom& cuGeom);
24
- void checkDQPForSplitPred(CUData& cu, const CUGeom& cuGeom);
25
+ void checkDQP(Mode& mode, const CUGeom& cuGeom);
26
+ void checkDQPForSplitPred(Mode& mode, const CUGeom& cuGeom);
27
28
class PME : public BondedTaskGroup
29
{
30
31
};
32
33
void processPME(PME& pme, Search& slave);
34
- void singleMotionEstimation(Search& master, Mode& interMode, const CUGeom& cuGeom, const PredictionUnit& pu, int part, int list, int ref);
35
+ void singleMotionEstimation(Search& master, Mode& interMode, const PredictionUnit& pu, int part, int list, int ref);
36
37
protected:
38
39
40
};
41
42
/* inter/ME helper functions */
43
- void checkBestMVP(MV* amvpCand, MV cMv, MV& mvPred, int& mvpIdx, uint32_t& outBits, uint32_t& outCost) const;
44
- void setSearchRange(const CUData& cu, MV mvp, int merange, MV& mvmin, MV& mvmax) const;
45
+ int selectMVP(const CUData& cu, const PredictionUnit& pu, const MV amvp[AMVP_NUM_CANDS], int list, int ref);
46
+ const MV& checkBestMVP(const MV amvpCand[2], const MV& mv, int& mvpIdx, uint32_t& outBits, uint32_t& outCost) const;
47
+ void setSearchRange(const CUData& cu, const MV& mvp, int merange, MV& mvmin, MV& mvmax) const;
48
uint32_t mergeEstimation(CUData& cu, const CUGeom& cuGeom, const PredictionUnit& pu, int puIdx, MergeData& m);
49
static void getBlkBits(PartSize cuMode, bool bPSlice, int puIdx, uint32_t lastMode, uint32_t blockBit[3]);
50
51
x265_1.6.tar.gz/source/encoder/sei.h -> x265_1.7.tar.gz/source/encoder/sei.h
Changed
84
1
2
DECODED_PICTURE_HASH = 132,
3
SCALABLE_NESTING = 133,
4
REGION_REFRESH_INFO = 134,
5
+ MASTERING_DISPLAY_INFO = 137,
6
+ CONTENT_LIGHT_LEVEL_INFO = 144,
7
};
8
9
virtual PayloadType payloadType() const = 0;
10
11
}
12
};
13
14
+class SEIMasteringDisplayColorVolume : public SEI
15
+{
16
+public:
17
+
18
+ uint16_t displayPrimaryX[3];
19
+ uint16_t displayPrimaryY[3];
20
+ uint16_t whitePointX, whitePointY;
21
+ uint32_t maxDisplayMasteringLuminance;
22
+ uint32_t minDisplayMasteringLuminance;
23
+
24
+ PayloadType payloadType() const { return MASTERING_DISPLAY_INFO; }
25
+
26
+ bool parse(const char* value)
27
+ {
28
+ return sscanf(value, "G(%hu,%hu)B(%hu,%hu)R(%hu,%hu)WP(%hu,%hu)L(%u,%u)",
29
+ &displayPrimaryX[0], &displayPrimaryY[0],
30
+ &displayPrimaryX[1], &displayPrimaryY[1],
31
+ &displayPrimaryX[2], &displayPrimaryY[2],
32
+ &whitePointX, &whitePointY,
33
+ &maxDisplayMasteringLuminance, &minDisplayMasteringLuminance) == 10;
34
+ }
35
+
36
+ void write(Bitstream& bs, const SPS&)
37
+ {
38
+ m_bitIf = &bs;
39
+
40
+ WRITE_CODE(MASTERING_DISPLAY_INFO, 8, "payload_type");
41
+ WRITE_CODE(8 * 2 + 2 * 4, 8, "payload_size");
42
+
43
+ for (uint32_t i = 0; i < 3; i++)
44
+ {
45
+ WRITE_CODE(displayPrimaryX[i], 16, "display_primaries_x[ c ]");
46
+ WRITE_CODE(displayPrimaryY[i], 16, "display_primaries_y[ c ]");
47
+ }
48
+ WRITE_CODE(whitePointX, 16, "white_point_x");
49
+ WRITE_CODE(whitePointY, 16, "white_point_y");
50
+ WRITE_CODE(maxDisplayMasteringLuminance, 32, "max_display_mastering_luminance");
51
+ WRITE_CODE(minDisplayMasteringLuminance, 32, "min_display_mastering_luminance");
52
+ }
53
+};
54
+
55
+class SEIContentLightLevel : public SEI
56
+{
57
+public:
58
+
59
+ uint16_t max_content_light_level;
60
+ uint16_t max_pic_average_light_level;
61
+
62
+ PayloadType payloadType() const { return CONTENT_LIGHT_LEVEL_INFO; }
63
+
64
+ bool parse(const char* value)
65
+ {
66
+ return sscanf(value, "%hu,%hu",
67
+ &max_content_light_level, &max_pic_average_light_level) == 2;
68
+ }
69
+
70
+ void write(Bitstream& bs, const SPS&)
71
+ {
72
+ m_bitIf = &bs;
73
+
74
+ WRITE_CODE(CONTENT_LIGHT_LEVEL_INFO, 8, "payload_type");
75
+ WRITE_CODE(4, 8, "payload_size");
76
+ WRITE_CODE(max_content_light_level, 16, "max_content_light_level");
77
+ WRITE_CODE(max_pic_average_light_level, 16, "max_pic_average_light_level");
78
+ }
79
+};
80
+
81
class SEIDecodedPictureHash : public SEI
82
{
83
public:
84
x265_1.6.tar.gz/source/encoder/slicetype.cpp -> x265_1.7.tar.gz/source/encoder/slicetype.cpp
Changed
341
1
2
3
namespace {
4
5
-inline int16_t median(int16_t a, int16_t b, int16_t c)
6
-{
7
- int16_t t = (a - b) & ((a - b) >> 31);
8
-
9
- a -= t;
10
- b += t;
11
- b -= (b - c) & ((b - c) >> 31);
12
- b += (a - b) & ((a - b) >> 31);
13
- return b;
14
-}
15
-
16
-inline void median_mv(MV &dst, MV a, MV b, MV c)
17
-{
18
- dst.x = median(a.x, b.x, c.x);
19
- dst.y = median(a.y, b.y, c.y);
20
-}
21
-
22
/* Compute variance to derive AC energy of each block */
23
inline uint32_t acEnergyVar(Frame *curFrame, uint64_t sum_ssd, int shift, int plane)
24
{
25
26
m_8x8Blocks = m_8x8Width > 2 && m_8x8Height > 2 ? (m_8x8Width - 2) * (m_8x8Height - 2) : m_8x8Width * m_8x8Height;
27
28
m_lastKeyframe = -m_param->keyframeMax;
29
- memset(m_preframes, 0, sizeof(m_preframes));
30
- m_preTotal = m_preAcquired = m_preCompleted = 0;
31
m_sliceTypeBusy = false;
32
m_fullQueueSize = X265_MAX(1, m_param->lookaheadDepth);
33
m_bAdaptiveQuant = m_param->rc.aqMode || m_param->bEnableWeightedPred || m_param->bEnableWeightedBiPred;
34
35
return m_tld && m_scratch;
36
}
37
38
-void Lookahead::stop()
39
+void Lookahead::stopJobs()
40
{
41
if (m_pool && !m_inputQueue.empty())
42
{
43
- m_preLookaheadLock.acquire();
44
+ m_inputLock.acquire();
45
m_isActive = false;
46
bool wait = m_outputSignalRequired = m_sliceTypeBusy;
47
- m_preLookaheadLock.release();
48
+ m_inputLock.release();
49
50
if (wait)
51
m_outputSignal.wait();
52
53
m_filled = true; /* full capacity plus mini-gop lag */
54
}
55
56
- m_preLookaheadLock.acquire();
57
-
58
m_inputLock.acquire();
59
m_inputQueue.pushBack(curFrame);
60
- m_inputLock.release();
61
-
62
- m_preframes[m_preTotal++] = &curFrame;
63
- X265_CHECK(m_preTotal <= X265_LOOKAHEAD_MAX, "prelookahead overflow\n");
64
-
65
- m_preLookaheadLock.release();
66
-
67
- if (m_pool)
68
+ if (m_pool && m_inputQueue.size() >= m_fullQueueSize)
69
tryWakeOne();
70
+ m_inputLock.release();
71
}
72
73
/* Called by API thread */
74
75
m_filled = true;
76
}
77
78
-void Lookahead::findJob(int workerThreadID)
79
+void Lookahead::findJob(int /*workerThreadID*/)
80
{
81
- Frame* preFrame;
82
- bool doDecide;
83
-
84
- if (!m_isActive)
85
- return;
86
-
87
- int tld = workerThreadID;
88
- if (workerThreadID < 0)
89
- tld = m_pool ? m_pool->m_numWorkers : 0;
90
+ bool doDecide;
91
92
- m_preLookaheadLock.acquire();
93
- do
94
- {
95
- preFrame = NULL;
96
- doDecide = false;
97
+ m_inputLock.acquire();
98
+ if (m_inputQueue.size() >= m_fullQueueSize && !m_sliceTypeBusy && m_isActive)
99
+ doDecide = m_sliceTypeBusy = true;
100
+ else
101
+ doDecide = m_helpWanted = false;
102
+ m_inputLock.release();
103
104
- if (m_preTotal > m_preAcquired)
105
- preFrame = m_preframes[m_preAcquired++];
106
- else
107
- {
108
- if (m_preTotal == m_preCompleted)
109
- m_preAcquired = m_preTotal = m_preCompleted = 0;
110
-
111
- /* the worker thread that performs the last pre-lookahead will generally get to run
112
- * slicetypeDecide() */
113
- m_inputLock.acquire();
114
- if (!m_sliceTypeBusy && !m_preTotal && m_inputQueue.size() >= m_fullQueueSize && m_isActive)
115
- doDecide = m_sliceTypeBusy = true;
116
- else
117
- m_helpWanted = false;
118
- m_inputLock.release();
119
- }
120
- m_preLookaheadLock.release();
121
+ if (!doDecide)
122
+ return;
123
124
- if (preFrame)
125
- {
126
- ProfileLookaheadTime(m_preLookaheadElapsedTime, m_countPreLookahead);
127
- ProfileScopeEvent(prelookahead);
128
-
129
- preFrame->m_lowres.init(preFrame->m_fencPic, preFrame->m_poc);
130
- if (m_param->rc.bStatRead && m_param->rc.cuTree && IS_REFERENCED(preFrame))
131
- /* cu-tree offsets were read from stats file */;
132
- else if (m_bAdaptiveQuant)
133
- m_tld[tld].calcAdaptiveQuantFrame(preFrame, m_param);
134
- m_tld[tld].lowresIntraEstimate(preFrame->m_lowres);
135
-
136
- m_preLookaheadLock.acquire(); /* re-acquire for next pass */
137
- m_preCompleted++;
138
- }
139
- else if (doDecide)
140
- {
141
- ProfileLookaheadTime(m_slicetypeDecideElapsedTime, m_countSlicetypeDecide);
142
- ProfileScopeEvent(slicetypeDecideEV);
143
+ ProfileLookaheadTime(m_slicetypeDecideElapsedTime, m_countSlicetypeDecide);
144
+ ProfileScopeEvent(slicetypeDecideEV);
145
146
- slicetypeDecide();
147
+ slicetypeDecide();
148
149
- m_preLookaheadLock.acquire(); /* re-acquire for next pass */
150
- if (m_outputSignalRequired)
151
- {
152
- m_outputSignal.trigger();
153
- m_outputSignalRequired = false;
154
- }
155
- m_sliceTypeBusy = false;
156
- }
157
+ m_inputLock.acquire();
158
+ if (m_outputSignalRequired)
159
+ {
160
+ m_outputSignal.trigger();
161
+ m_outputSignalRequired = false;
162
}
163
- while (preFrame || doDecide);
164
+ m_sliceTypeBusy = false;
165
+ m_inputLock.release();
166
}
167
168
/* Called by API thread */
169
170
if (out)
171
return out;
172
173
- /* process all pending pre-lookahead frames and run slicetypeDecide() if
174
- * necessary */
175
- findJob(-1);
176
+ findJob(-1); /* run slicetypeDecide() if necessary */
177
178
- m_preLookaheadLock.acquire();
179
- bool wait = m_outputSignalRequired = m_sliceTypeBusy || m_preTotal;
180
- m_preLookaheadLock.release();
181
+ m_inputLock.acquire();
182
+ bool wait = m_outputSignalRequired = m_sliceTypeBusy;
183
+ m_inputLock.release();
184
185
if (wait)
186
m_outputSignal.wait();
187
188
{
189
/* aggregate lowres row satds to CTU resolution */
190
curFrame->m_lowres.lowresCostForRc = curFrame->m_lowres.lowresCosts[b - p0][p1 - b];
191
- uint32_t lowresRow = 0, lowresCol = 0, lowresCuIdx = 0, sum = 0;
192
+ uint32_t lowresRow = 0, lowresCol = 0, lowresCuIdx = 0, sum = 0, intraSum = 0;
193
uint32_t scale = m_param->maxCUSize / (2 * X265_LOWRES_CU_SIZE);
194
uint32_t numCuInHeight = (m_param->sourceHeight + g_maxCUSize - 1) / g_maxCUSize;
195
uint32_t widthInLowresCu = (uint32_t)m_8x8Width, heightInLowresCu = (uint32_t)m_8x8Height;
196
197
lowresRow = row * scale;
198
for (uint32_t cnt = 0; cnt < scale && lowresRow < heightInLowresCu; lowresRow++, cnt++)
199
{
200
- sum = 0;
201
+ sum = 0; intraSum = 0;
202
lowresCuIdx = lowresRow * widthInLowresCu;
203
for (lowresCol = 0; lowresCol < widthInLowresCu; lowresCol++, lowresCuIdx++)
204
{
205
206
}
207
curFrame->m_lowres.lowresCostForRc[lowresCuIdx] = lowresCuCost;
208
sum += lowresCuCost;
209
+ intraSum += curFrame->m_lowres.intraCost[lowresCuIdx];
210
}
211
curFrame->m_encData->m_rowStat[row].satdForVbv += sum;
212
+ curFrame->m_encData->m_rowStat[row].intraSatdForVbv += intraSum;
213
}
214
}
215
}
216
}
217
218
+void PreLookaheadGroup::processTasks(int workerThreadID)
219
+{
220
+ if (workerThreadID < 0)
221
+ workerThreadID = m_lookahead.m_pool ? m_lookahead.m_pool->m_numWorkers : 0;
222
+ LookaheadTLD& tld = m_lookahead.m_tld[workerThreadID];
223
+
224
+ m_lock.acquire();
225
+ while (m_jobAcquired < m_jobTotal)
226
+ {
227
+ Frame* preFrame = m_preframes[m_jobAcquired++];
228
+ ProfileLookaheadTime(m_lookahead.m_preLookaheadElapsedTime, m_lookahead.m_countPreLookahead);
229
+ ProfileScopeEvent(prelookahead);
230
+ m_lock.release();
231
+
232
+ preFrame->m_lowres.init(preFrame->m_fencPic, preFrame->m_poc);
233
+ if (m_lookahead.m_param->rc.bStatRead && m_lookahead.m_param->rc.cuTree && IS_REFERENCED(preFrame))
234
+ /* cu-tree offsets were read from stats file */;
235
+ else if (m_lookahead.m_bAdaptiveQuant)
236
+ tld.calcAdaptiveQuantFrame(preFrame, m_lookahead.m_param);
237
+ tld.lowresIntraEstimate(preFrame->m_lowres);
238
+ preFrame->m_lowresInit = true;
239
+
240
+ m_lock.acquire();
241
+ }
242
+ m_lock.release();
243
+}
244
+
245
/* called by API thread or worker thread with inputQueueLock acquired */
246
void Lookahead::slicetypeDecide()
247
{
248
- Lowres *frames[X265_LOOKAHEAD_MAX];
249
- Frame *list[X265_LOOKAHEAD_MAX];
250
- int maxSearch = X265_MIN(m_param->lookaheadDepth, X265_LOOKAHEAD_MAX);
251
+ PreLookaheadGroup pre(*this);
252
253
+ Lowres* frames[X265_LOOKAHEAD_MAX + X265_BFRAME_MAX + 4];
254
+ Frame* list[X265_BFRAME_MAX + 4];
255
memset(frames, 0, sizeof(frames));
256
memset(list, 0, sizeof(list));
257
+ int maxSearch = X265_MIN(m_param->lookaheadDepth, X265_LOOKAHEAD_MAX);
258
+ maxSearch = X265_MAX(1, maxSearch);
259
+
260
{
261
ScopedLock lock(m_inputLock);
262
+
263
Frame *curFrame = m_inputQueue.first();
264
int j;
265
for (j = 0; j < m_param->bframes + 2; j++)
266
267
{
268
if (!curFrame) break;
269
frames[j + 1] = &curFrame->m_lowres;
270
- X265_CHECK(curFrame->m_lowres.costEst[0][0] > 0, "prelookahead not completed for input picture\n");
271
+
272
+ if (!curFrame->m_lowresInit)
273
+ pre.m_preframes[pre.m_jobTotal++] = curFrame;
274
+
275
curFrame = curFrame->m_next;
276
}
277
278
maxSearch = j;
279
}
280
281
+ /* perform pre-analysis on frames which need it, using a bonded task group */
282
+ if (pre.m_jobTotal)
283
+ {
284
+ if (m_pool)
285
+ pre.tryBondPeers(*m_pool, pre.m_jobTotal);
286
+ pre.processTasks(-1);
287
+ pre.waitForExit();
288
+ }
289
+
290
if (m_lastNonB && !m_param->rc.bStatRead &&
291
((m_param->bFrameAdaptive && m_param->bframes) ||
292
m_param->rc.cuTree || m_param->scenecutThreshold ||
293
294
295
int numc = 0;
296
MV mvc[4], mvp;
297
-
298
MV* fencMV = &fenc->lowresMvs[i][listDist[i]][cuXY];
299
+ ReferencePlanes* fref = i ? fref1 : wfref0;
300
301
/* Reverse-order MV prediction */
302
- mvc[0] = 0;
303
- mvc[2] = 0;
304
#define MVC(mv) mvc[numc++] = mv;
305
if (cuX < widthInCU - 1)
306
MVC(fencMV[1]);
307
308
MVC(fencMV[widthInCU + 1]);
309
}
310
#undef MVC
311
- if (numc <= 1)
312
- mvp = mvc[0];
313
+
314
+ if (!numc)
315
+ mvp = 0;
316
else
317
- median_mv(mvp, mvc[0], mvc[1], mvc[2]);
318
+ {
319
+ ALIGN_VAR_32(pixel, subpelbuf[X265_LOWRES_CU_SIZE * X265_LOWRES_CU_SIZE]);
320
+ int mvpcost = MotionEstimate::COST_MAX;
321
+
322
+ /* measure SATD cost of each neighbor MV (estimating merge analysis)
323
+ * and use the lowest cost MV as MVP (estimating AMVP). Since all
324
+ * mvc[] candidates are measured here, none are passed to motionEstimate */
325
+ for (int idx = 0; idx < numc; idx++)
326
+ {
327
+ intptr_t stride = X265_LOWRES_CU_SIZE;
328
+ pixel *src = fref->lowresMC(pelOffset, mvc[idx], subpelbuf, stride);
329
+ int cost = tld.me.bufSATD(src, stride);
330
+ COPY2_IF_LT(mvpcost, cost, mvp, mvc[idx]);
331
+ }
332
+ }
333
334
- fencCost = tld.me.motionEstimate(i ? fref1 : wfref0, mvmin, mvmax, mvp, numc, mvc, s_merange, *fencMV);
335
+ /* ME will never return a cost larger than the cost @MVP, so we do not
336
+ * have to check that ME cost is more than the estimated merge cost */
337
+ fencCost = tld.me.motionEstimate(fref, mvmin, mvmax, mvp, 0, NULL, s_merange, *fencMV);
338
COPY2_IF_LT(bcost, fencCost, listused, i + 1);
339
}
340
341
x265_1.6.tar.gz/source/encoder/slicetype.h -> x265_1.7.tar.gz/source/encoder/slicetype.h
Changed
50
1
2
Lock m_outputLock;
3
4
/* pre-lookahead */
5
- Frame* m_preframes[X265_LOOKAHEAD_MAX];
6
- int m_preTotal, m_preAcquired, m_preCompleted;
7
int m_fullQueueSize;
8
bool m_isActive;
9
bool m_sliceTypeBusy;
10
11
bool m_outputSignalRequired;
12
bool m_bBatchMotionSearch;
13
bool m_bBatchFrameCosts;
14
- Lock m_preLookaheadLock;
15
Event m_outputSignal;
16
17
LookaheadTLD* m_tld;
18
19
20
bool create();
21
void destroy();
22
- void stop();
23
+ void stopJobs();
24
25
void addPicture(Frame&, int sliceType);
26
void flush();
27
28
int64_t frameCostRecalculate(Lowres **frames, int p0, int p1, int b);
29
};
30
31
+class PreLookaheadGroup : public BondedTaskGroup
32
+{
33
+public:
34
+
35
+ Frame* m_preframes[X265_LOOKAHEAD_MAX];
36
+ Lookahead& m_lookahead;
37
+
38
+ PreLookaheadGroup(Lookahead& l) : m_lookahead(l) {}
39
+
40
+ void processTasks(int workerThreadID);
41
+
42
+protected:
43
+
44
+ PreLookaheadGroup& operator=(const PreLookaheadGroup&);
45
+};
46
+
47
class CostEstimateGroup : public BondedTaskGroup
48
{
49
public:
50
x265_1.6.tar.gz/source/input/input.cpp -> x265_1.7.tar.gz/source/input/input.cpp
Changed
10
1
2
3
using namespace x265;
4
5
-Input* Input::open(InputFileInfo& info, bool bForceY4m)
6
+InputFile* InputFile::open(InputFileInfo& info, bool bForceY4m)
7
{
8
const char * s = strrchr(info.filename, '.');
9
10
x265_1.6.tar.gz/source/input/input.h -> x265_1.7.tar.gz/source/input/input.h
Changed
31
1
2
int sarWidth;
3
int sarHeight;
4
int frameCount;
5
+ int timebaseNum;
6
+ int timebaseDenom;
7
8
/* user supplied */
9
int skipFrames;
10
const char *filename;
11
};
12
13
-class Input
14
+class InputFile
15
{
16
protected:
17
18
- virtual ~Input() {}
19
+ virtual ~InputFile() {}
20
21
public:
22
23
- Input() {}
24
+ InputFile() {}
25
26
- static Input* open(InputFileInfo& info, bool bForceY4m);
27
+ static InputFile* open(InputFileInfo& info, bool bForceY4m);
28
29
virtual void startReader() = 0;
30
31
x265_1.6.tar.gz/source/input/y4m.cpp -> x265_1.7.tar.gz/source/input/y4m.cpp
Changed
29
1
2
for (int i = 0; i < QUEUE_SIZE; i++)
3
buf[i] = NULL;
4
5
- readCount.set(0);
6
- writeCount.set(0);
7
-
8
threadActive = false;
9
colorSpace = info.csp;
10
sarWidth = info.sarWidth;
11
12
void Y4MInput::release()
13
{
14
threadActive = false;
15
- readCount.set(readCount.get()); // unblock file reader
16
+ readCount.poke();
17
stop();
18
delete this;
19
}
20
21
while (threadActive);
22
23
threadActive = false;
24
- writeCount.set(writeCount.get()); // unblock readPicture
25
+ writeCount.poke();
26
}
27
28
bool Y4MInput::populateFrameQueue()
29
x265_1.6.tar.gz/source/input/y4m.h -> x265_1.7.tar.gz/source/input/y4m.h
Changed
10
1
2
namespace x265 {
3
// x265 private namespace
4
5
-class Y4MInput : public Input, public Thread
6
+class Y4MInput : public InputFile, public Thread
7
{
8
protected:
9
10
x265_1.6.tar.gz/source/input/yuv.cpp -> x265_1.7.tar.gz/source/input/yuv.cpp
Changed
28
1
2
for (int i = 0; i < QUEUE_SIZE; i++)
3
buf[i] = NULL;
4
5
- readCount.set(0);
6
- writeCount.set(0);
7
depth = info.depth;
8
width = info.width;
9
height = info.height;
10
11
void YUVInput::release()
12
{
13
threadActive = false;
14
- readCount.set(readCount.get()); // unblock read thread
15
+ readCount.poke();
16
stop();
17
delete this;
18
}
19
20
}
21
22
threadActive = false;
23
- writeCount.set(writeCount.get()); // unblock readPicture
24
+ writeCount.poke();
25
}
26
27
bool YUVInput::populateFrameQueue()
28
x265_1.6.tar.gz/source/input/yuv.h -> x265_1.7.tar.gz/source/input/yuv.h
Changed
10
1
2
namespace x265 {
3
// private x265 namespace
4
5
-class YUVInput : public Input, public Thread
6
+class YUVInput : public InputFile, public Thread
7
{
8
protected:
9
10
x265_1.6.tar.gz/source/output/output.cpp -> x265_1.7.tar.gz/source/output/output.cpp
Changed
33
1
2
/*****************************************************************************
3
- * Copyright (C) 2013 x265 project
4
+ * Copyright (C) 2013-2015 x265 project
5
*
6
* Authors: Steve Borho <steve@borho.org>
7
+ * Xinyue Lu <i@7086.in>
8
*
9
* This program is free software; you can redistribute it and/or modify
10
* it under the terms of the GNU General Public License as published by
11
12
#include "yuv.h"
13
#include "y4m.h"
14
15
+#include "raw.h"
16
+
17
using namespace x265;
18
19
-Output* Output::open(const char *fname, int width, int height, uint32_t bitdepth, uint32_t fpsNum, uint32_t fpsDenom, int csp)
20
+ReconFile* ReconFile::open(const char *fname, int width, int height, uint32_t bitdepth, uint32_t fpsNum, uint32_t fpsDenom, int csp)
21
{
22
const char * s = strrchr(fname, '.');
23
24
25
else
26
return new YUVOutput(fname, width, height, bitdepth, csp);
27
}
28
+
29
+OutputFile* OutputFile::open(const char *fname, InputFileInfo& inputInfo)
30
+{
31
+ return new RAWOutput(fname, inputInfo);
32
+}
33
x265_1.6.tar.gz/source/output/output.h -> x265_1.7.tar.gz/source/output/output.h
Changed
76
1
2
/*****************************************************************************
3
- * Copyright (C) 2013 x265 project
4
+ * Copyright (C) 2013-2015 x265 project
5
*
6
* Authors: Steve Borho <steve@borho.org>
7
+ * Xinyue Lu <i@7086.in>
8
*
9
* This program is free software; you can redistribute it and/or modify
10
* it under the terms of the GNU General Public License as published by
11
12
#define X265_OUTPUT_H
13
14
#include "x265.h"
15
+#include "input/input.h"
16
17
namespace x265 {
18
// private x265 namespace
19
20
-class Output
21
+class ReconFile
22
{
23
protected:
24
25
- virtual ~Output() {}
26
+ virtual ~ReconFile() {}
27
28
public:
29
30
- Output() {}
31
+ ReconFile() {}
32
33
- static Output* open(const char *fname, int width, int height, uint32_t bitdepth,
34
- uint32_t fpsNum, uint32_t fpsDenom, int csp);
35
+ static ReconFile* open(const char *fname, int width, int height, uint32_t bitdepth,
36
+ uint32_t fpsNum, uint32_t fpsDenom, int csp);
37
38
virtual bool isFail() const = 0;
39
40
41
42
virtual const char *getName() const = 0;
43
};
44
+
45
+class OutputFile
46
+{
47
+protected:
48
+
49
+ virtual ~OutputFile() {}
50
+
51
+public:
52
+
53
+ OutputFile() {}
54
+
55
+ static OutputFile* open(const char* fname, InputFileInfo& inputInfo);
56
+
57
+ virtual bool isFail() const = 0;
58
+
59
+ virtual bool needPTS() const = 0;
60
+
61
+ virtual void release() = 0;
62
+
63
+ virtual const char* getName() const = 0;
64
+
65
+ virtual void setParam(x265_param* param) = 0;
66
+
67
+ virtual int writeHeaders(const x265_nal* nal, uint32_t nalcount) = 0;
68
+
69
+ virtual int writeFrame(const x265_nal* nal, uint32_t nalcount, x265_picture& pic) = 0;
70
+
71
+ virtual void closeFile(int64_t largest_pts, int64_t second_largest_pts) = 0;
72
+};
73
}
74
75
#endif // ifndef X265_OUTPUT_H
76
x265_1.7.tar.gz/source/output/raw.cpp
Added
82
1
2
+/*****************************************************************************
3
+ * Copyright (C) 2013-2015 x265 project
4
+ *
5
+ * Authors: Steve Borho <steve@borho.org>
6
+ * Xinyue Lu <i@7086.in>
7
+ *
8
+ * This program is free software; you can redistribute it and/or modify
9
+ * it under the terms of the GNU General Public License as published by
10
+ * the Free Software Foundation; either version 2 of the License, or
11
+ * (at your option) any later version.
12
+ *
13
+ * This program is distributed in the hope that it will be useful,
14
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
15
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
16
+ * GNU General Public License for more details.
17
+ *
18
+ * You should have received a copy of the GNU General Public License
19
+ * along with this program; if not, write to the Free Software
20
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
21
+ *
22
+ * This program is also available under a commercial proprietary license.
23
+ * For more information, contact us at license @ x265.com.
24
+ *****************************************************************************/
25
+
26
+#include "raw.h"
27
+
28
+using namespace x265;
29
+using namespace std;
30
+
31
+RAWOutput::RAWOutput(const char* fname, InputFileInfo&)
32
+{
33
+ b_fail = false;
34
+ if (!strcmp(fname, "-"))
35
+ {
36
+ ofs = &cout;
37
+ return;
38
+ }
39
+ ofs = new ofstream(fname, ios::binary | ios::out);
40
+ if (ofs->fail())
41
+ b_fail = true;
42
+}
43
+
44
+void RAWOutput::setParam(x265_param* param)
45
+{
46
+ param->bAnnexB = true;
47
+}
48
+
49
+int RAWOutput::writeHeaders(const x265_nal* nal, uint32_t nalcount)
50
+{
51
+ uint32_t bytes = 0;
52
+
53
+ for (uint32_t i = 0; i < nalcount; i++)
54
+ {
55
+ ofs->write((const char*)nal->payload, nal->sizeBytes);
56
+ bytes += nal->sizeBytes;
57
+ nal++;
58
+ }
59
+
60
+ return bytes;
61
+}
62
+
63
+int RAWOutput::writeFrame(const x265_nal* nal, uint32_t nalcount, x265_picture&)
64
+{
65
+ uint32_t bytes = 0;
66
+
67
+ for (uint32_t i = 0; i < nalcount; i++)
68
+ {
69
+ ofs->write((const char*)nal->payload, nal->sizeBytes);
70
+ bytes += nal->sizeBytes;
71
+ nal++;
72
+ }
73
+
74
+ return bytes;
75
+}
76
+
77
+void RAWOutput::closeFile(int64_t, int64_t)
78
+{
79
+ if (ofs != &cout)
80
+ delete ofs;
81
+}
82
x265_1.7.tar.gz/source/output/raw.h
Added
66
1
2
+/*****************************************************************************
3
+ * Copyright (C) 2013-2015 x265 project
4
+ *
5
+ * Authors: Steve Borho <steve@borho.org>
6
+ * Xinyue Lu <i@7086.in>
7
+ *
8
+ * This program is free software; you can redistribute it and/or modify
9
+ * it under the terms of the GNU General Public License as published by
10
+ * the Free Software Foundation; either version 2 of the License, or
11
+ * (at your option) any later version.
12
+ *
13
+ * This program is distributed in the hope that it will be useful,
14
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
15
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
16
+ * GNU General Public License for more details.
17
+ *
18
+ * You should have received a copy of the GNU General Public License
19
+ * along with this program; if not, write to the Free Software
20
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
21
+ *
22
+ * This program is also available under a commercial proprietary license.
23
+ * For more information, contact us at license @ x265.com.
24
+ *****************************************************************************/
25
+
26
+#ifndef X265_HEVC_RAW_H
27
+#define X265_HEVC_RAW_H
28
+
29
+#include "output.h"
30
+#include "common.h"
31
+#include <fstream>
32
+#include <iostream>
33
+
34
+namespace x265 {
35
+class RAWOutput : public OutputFile
36
+{
37
+protected:
38
+
39
+ std::ostream* ofs;
40
+
41
+ bool b_fail;
42
+
43
+public:
44
+
45
+ RAWOutput(const char* fname, InputFileInfo&);
46
+
47
+ bool isFail() const { return b_fail; }
48
+
49
+ bool needPTS() const { return false; }
50
+
51
+ void release() { delete this; }
52
+
53
+ const char* getName() const { return "raw"; }
54
+
55
+ void setParam(x265_param* param);
56
+
57
+ int writeHeaders(const x265_nal* nal, uint32_t nalcount);
58
+
59
+ int writeFrame(const x265_nal* nal, uint32_t nalcount, x265_picture&);
60
+
61
+ void closeFile(int64_t largest_pts, int64_t second_largest_pts);
62
+};
63
+}
64
+
65
+#endif // ifndef X265_HEVC_RAW_H
66
x265_1.7.tar.gz/source/output/reconplay.cpp
Added
199
1
2
+/*****************************************************************************
3
+ * Copyright (C) 2013 x265 project
4
+ *
5
+ * Authors: Peixuan Zhang <zhangpeixuancn@gmail.com>
6
+ * Chunli Zhang <chunli@multicorewareinc.com>
7
+ *
8
+ * This program is free software; you can redistribute it and/or modify
9
+ * it under the terms of the GNU General Public License as published by
10
+ * the Free Software Foundation; either version 2 of the License, or
11
+ * (at your option) any later version.
12
+ *
13
+ * This program is distributed in the hope that it will be useful,
14
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
15
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
16
+ * GNU General Public License for more details.
17
+ *
18
+ * You should have received a copy of the GNU General Public License
19
+ * along with this program; if not, write to the Free Software
20
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
21
+ *
22
+ * This program is also available under a commercial proprietary license.
23
+ * For more information, contact us at license @ x265.com.
24
+ *****************************************************************************/
25
+
26
+#include "common.h"
27
+#include "reconplay.h"
28
+
29
+#include <signal.h>
30
+
31
+using namespace x265;
32
+
33
+#if _WIN32
34
+#define popen _popen
35
+#define pclose _pclose
36
+#define pipemode "wb"
37
+#else
38
+#define pipemode "w"
39
+#endif
40
+
41
+bool ReconPlay::pipeValid;
42
+
43
+#ifndef _WIN32
44
+static void sigpipe_handler(int)
45
+{
46
+ if (ReconPlay::pipeValid)
47
+ general_log(NULL, "exec", X265_LOG_ERROR, "pipe closed\n");
48
+ ReconPlay::pipeValid = false;
49
+}
50
+#endif
51
+
52
+ReconPlay::ReconPlay(const char* commandLine, x265_param& param)
53
+{
54
+#ifndef _WIN32
55
+ if (signal(SIGPIPE, sigpipe_handler) == SIG_ERR)
56
+ general_log(¶m, "exec", X265_LOG_ERROR, "Unable to register SIGPIPE handler: %s\n", strerror(errno));
57
+#endif
58
+
59
+ width = param.sourceWidth;
60
+ height = param.sourceHeight;
61
+ colorSpace = param.internalCsp;
62
+
63
+ frameSize = 0;
64
+ for (int i = 0; i < x265_cli_csps[colorSpace].planes; i++)
65
+ frameSize += (uint32_t)((width >> x265_cli_csps[colorSpace].width[i]) * (height >> x265_cli_csps[colorSpace].height[i]));
66
+
67
+ for (int i = 0; i < RECON_BUF_SIZE; i++)
68
+ {
69
+ poc[i] = -1;
70
+ CHECKED_MALLOC(frameData[i], pixel, frameSize);
71
+ }
72
+
73
+ outputPipe = popen(commandLine, pipemode);
74
+ if (outputPipe)
75
+ {
76
+ const char* csp = (colorSpace >= X265_CSP_I444) ? "444" : (colorSpace >= X265_CSP_I422) ? "422" : "420";
77
+ const char* depth = (param.internalBitDepth == 10) ? "p10" : "";
78
+
79
+ fprintf(outputPipe, "YUV4MPEG2 W%d H%d F%d:%d Ip C%s%s\n", width, height, param.fpsNum, param.fpsDenom, csp, depth);
80
+
81
+ pipeValid = true;
82
+ threadActive = true;
83
+ start();
84
+ return;
85
+ }
86
+ else
87
+ general_log(¶m, "exec", X265_LOG_ERROR, "popen(%s) failed\n", commandLine);
88
+
89
+fail:
90
+ threadActive = false;
91
+}
92
+
93
+ReconPlay::~ReconPlay()
94
+{
95
+ if (threadActive)
96
+ {
97
+ threadActive = false;
98
+ writeCount.poke();
99
+ stop();
100
+ }
101
+
102
+ if (outputPipe)
103
+ pclose(outputPipe);
104
+
105
+ for (int i = 0; i < RECON_BUF_SIZE; i++)
106
+ X265_FREE(frameData[i]);
107
+}
108
+
109
+bool ReconPlay::writePicture(const x265_picture& pic)
110
+{
111
+ if (!threadActive || !pipeValid)
112
+ return false;
113
+
114
+ int written = writeCount.get();
115
+ int read = readCount.get();
116
+ int currentCursor = pic.poc % RECON_BUF_SIZE;
117
+
118
+ /* TODO: it's probably better to drop recon pictures when the ring buffer is
119
+ * backed up on the display app */
120
+ while (written - read > RECON_BUF_SIZE - 2 || poc[currentCursor] != -1)
121
+ {
122
+ read = readCount.waitForChange(read);
123
+ if (!threadActive)
124
+ return false;
125
+ }
126
+
127
+ X265_CHECK(pic.colorSpace == colorSpace, "invalid color space\n");
128
+ X265_CHECK(pic.bitDepth == X265_DEPTH, "invalid bit depth\n");
129
+
130
+ pixel* buf = frameData[currentCursor];
131
+ for (int i = 0; i < x265_cli_csps[colorSpace].planes; i++)
132
+ {
133
+ char* src = (char*)pic.planes[i];
134
+ int pwidth = width >> x265_cli_csps[colorSpace].width[i];
135
+
136
+ for (int h = 0; h < height >> x265_cli_csps[colorSpace].height[i]; h++)
137
+ {
138
+ memcpy(buf, src, pwidth * sizeof(pixel));
139
+ src += pic.stride[i];
140
+ buf += pwidth;
141
+ }
142
+ }
143
+
144
+ poc[currentCursor] = pic.poc;
145
+ writeCount.incr();
146
+
147
+ return true;
148
+}
149
+
150
+void ReconPlay::threadMain()
151
+{
152
+ THREAD_NAME("ReconPlayOutput", 0);
153
+
154
+ do
155
+ {
156
+ /* extract the next output picture in display order and write to pipe */
157
+ if (!outputFrame())
158
+ break;
159
+ }
160
+ while (threadActive);
161
+
162
+ threadActive = false;
163
+ readCount.poke();
164
+}
165
+
166
+bool ReconPlay::outputFrame()
167
+{
168
+ int written = writeCount.get();
169
+ int read = readCount.get();
170
+ int currentCursor = read % RECON_BUF_SIZE;
171
+
172
+ while (poc[currentCursor] != read)
173
+ {
174
+ written = writeCount.waitForChange(written);
175
+ if (!threadActive)
176
+ return false;
177
+ }
178
+
179
+ char* buf = (char*)frameData[currentCursor];
180
+ intptr_t remainSize = frameSize * sizeof(pixel);
181
+
182
+ fprintf(outputPipe, "FRAME\n");
183
+ while (remainSize > 0)
184
+ {
185
+ intptr_t retCount = (intptr_t)fwrite(buf, sizeof(char), remainSize, outputPipe);
186
+
187
+ if (retCount < 0 || !pipeValid)
188
+ /* pipe failure, stop writing and start dropping recon pictures */
189
+ return false;
190
+
191
+ buf += retCount;
192
+ remainSize -= retCount;
193
+ }
194
+
195
+ poc[currentCursor] = -1;
196
+ readCount.incr();
197
+ return true;
198
+}
199
x265_1.7.tar.gz/source/output/reconplay.h
Added
76
1
2
+/*****************************************************************************
3
+ * Copyright (C) 2013 x265 project
4
+ *
5
+ * Authors: Peixuan Zhang <zhangpeixuancn@gmail.com>
6
+ * Chunli Zhang <chunli@multicorewareinc.com>
7
+ *
8
+ * This program is free software; you can redistribute it and/or modify
9
+ * it under the terms of the GNU General Public License as published by
10
+ * the Free Software Foundation; either version 2 of the License, or
11
+ * (at your option) any later version.
12
+ *
13
+ * This program is distributed in the hope that it will be useful,
14
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
15
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
16
+ * GNU General Public License for more details.
17
+ *
18
+ * You should have received a copy of the GNU General Public License
19
+ * along with this program; if not, write to the Free Software
20
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
21
+ *
22
+ * This program is also available under a commercial proprietary license.
23
+ * For more information, contact us at license @ x265.com.
24
+ *****************************************************************************/
25
+
26
+#ifndef X265_RECONPLAY_H
27
+#define X265_RECONPLAY_H
28
+
29
+#include "x265.h"
30
+#include "threading.h"
31
+#include <cstdio>
32
+
33
+namespace x265 {
34
+// private x265 namespace
35
+
36
+class ReconPlay : public Thread
37
+{
38
+public:
39
+
40
+ ReconPlay(const char* commandLine, x265_param& param);
41
+
42
+ virtual ~ReconPlay();
43
+
44
+ bool writePicture(const x265_picture& pic);
45
+
46
+ static bool pipeValid;
47
+
48
+protected:
49
+
50
+ enum { RECON_BUF_SIZE = 40 };
51
+
52
+ FILE* outputPipe; /* The output pipe for player */
53
+ size_t frameSize; /* size of one frame in pixels */
54
+ bool threadActive; /* worker thread is active */
55
+ int width; /* width of frame */
56
+ int height; /* height of frame */
57
+ int colorSpace; /* color space of frame */
58
+
59
+ int poc[RECON_BUF_SIZE];
60
+ pixel* frameData[RECON_BUF_SIZE];
61
+
62
+ /* Note that the class uses read and write counters to signal that reads and
63
+ * writes have occurred in the ring buffer, but writes into the buffer
64
+ * happen in decode order and the reader must check that the POC it next
65
+ * needs to send to the pipe is in fact present. The counters are used to
66
+ * prevent the writer from getting too far ahead of the reader */
67
+ ThreadSafeInteger readCount;
68
+ ThreadSafeInteger writeCount;
69
+
70
+ void threadMain();
71
+ bool outputFrame();
72
+};
73
+}
74
+
75
+#endif // ifndef X265_RECONPLAY_H
76
x265_1.6.tar.gz/source/output/y4m.h -> x265_1.7.tar.gz/source/output/y4m.h
Changed
10
1
2
namespace x265 {
3
// private x265 namespace
4
5
-class Y4MOutput : public Output
6
+class Y4MOutput : public ReconFile
7
{
8
protected:
9
10
x265_1.6.tar.gz/source/output/yuv.h -> x265_1.7.tar.gz/source/output/yuv.h
Changed
10
1
2
namespace x265 {
3
// private x265 namespace
4
5
-class YUVOutput : public Output
6
+class YUVOutput : public ReconFile
7
{
8
protected:
9
10
x265_1.6.tar.gz/source/test/ipfilterharness.cpp -> x265_1.7.tar.gz/source/test/ipfilterharness.cpp
Changed
215
1
2
}
3
}
4
5
-bool IPFilterHarness::check_IPFilter_primitive(filter_p2s_wxh_t ref, filter_p2s_wxh_t opt, int isChroma, int csp)
6
-{
7
- intptr_t rand_srcStride;
8
- int min_size = isChroma ? 2 : 4;
9
- int max_size = isChroma ? (MAX_CU_SIZE >> 1) : MAX_CU_SIZE;
10
-
11
- if (isChroma && (csp == X265_CSP_I444))
12
- {
13
- min_size = 4;
14
- max_size = MAX_CU_SIZE;
15
- }
16
-
17
- for (int i = 0; i < ITERS; i++)
18
- {
19
- int index = i % TEST_CASES;
20
- int rand_height = (int16_t)rand() % 100;
21
- int rand_width = (int16_t)rand() % 100;
22
-
23
- rand_srcStride = rand_width + rand() % 100;
24
- if (rand_srcStride < rand_width)
25
- rand_srcStride = rand_width;
26
-
27
- rand_width &= ~(min_size - 1);
28
- rand_width = x265_clip3(min_size, max_size, rand_width);
29
-
30
- rand_height &= ~(min_size - 1);
31
- rand_height = x265_clip3(min_size, max_size, rand_height);
32
-
33
- ref(pixel_test_buff[index],
34
- rand_srcStride,
35
- IPF_C_output_s,
36
- rand_width,
37
- rand_height);
38
-
39
- checked(opt, pixel_test_buff[index],
40
- rand_srcStride,
41
- IPF_vec_output_s,
42
- rand_width,
43
- rand_height);
44
-
45
- if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(int16_t)))
46
- return false;
47
-
48
- reportfail();
49
- }
50
-
51
- return true;
52
-}
53
-
54
bool IPFilterHarness::check_IPFilterChroma_primitive(filter_pp_t ref, filter_pp_t opt)
55
{
56
intptr_t rand_srcStride, rand_dstStride;
57
58
{
59
intptr_t rand_srcStride = rand() % 100;
60
int index = i % TEST_CASES;
61
+ intptr_t dstStride = rand() % 100 + 64;
62
63
- ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s);
64
+ ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s, dstStride);
65
66
- checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s);
67
+ checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s, dstStride);
68
69
- if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(pixel)))
70
+ if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(int16_t)))
71
return false;
72
73
reportfail();
74
75
{
76
intptr_t rand_srcStride = rand() % 100;
77
int index = i % TEST_CASES;
78
+ intptr_t dstStride = rand() % 100 + 64;
79
80
- ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s);
81
+ ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s, dstStride);
82
83
- checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s);
84
+ checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s, dstStride);
85
86
- if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(pixel)))
87
+ if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(int16_t)))
88
return false;
89
90
reportfail();
91
92
93
bool IPFilterHarness::testCorrectness(const EncoderPrimitives& ref, const EncoderPrimitives& opt)
94
{
95
- if (opt.luma_p2s)
96
- {
97
- // last parameter does not matter in case of luma
98
- if (!check_IPFilter_primitive(ref.luma_p2s, opt.luma_p2s, 0, 1))
99
- {
100
- printf("luma_p2s failed\n");
101
- return false;
102
- }
103
- }
104
105
for (int value = 0; value < NUM_PU_SIZES; value++)
106
{
107
108
return false;
109
}
110
}
111
- if (opt.pu[value].filter_p2s)
112
+ if (opt.pu[value].convert_p2s)
113
{
114
- if (!check_IPFilterLumaP2S_primitive(ref.pu[value].filter_p2s, opt.pu[value].filter_p2s))
115
+ if (!check_IPFilterLumaP2S_primitive(ref.pu[value].convert_p2s, opt.pu[value].convert_p2s))
116
{
117
- printf("filter_p2s[%s]", lumaPartStr[value]);
118
+ printf("convert_p2s[%s]", lumaPartStr[value]);
119
return false;
120
}
121
}
122
123
124
for (int csp = X265_CSP_I420; csp < X265_CSP_COUNT; csp++)
125
{
126
- if (opt.chroma[csp].p2s)
127
- {
128
- if (!check_IPFilter_primitive(ref.chroma[csp].p2s, opt.chroma[csp].p2s, 1, csp))
129
- {
130
- printf("chroma_p2s[%s]", x265_source_csp_names[csp]);
131
- return false;
132
- }
133
- }
134
for (int value = 0; value < NUM_PU_SIZES; value++)
135
{
136
if (opt.chroma[csp].pu[value].filter_hpp)
137
138
return false;
139
}
140
}
141
- if (opt.chroma[csp].pu[value].chroma_p2s)
142
+ if (opt.chroma[csp].pu[value].p2s)
143
{
144
- if (!check_IPFilterChromaP2S_primitive(ref.chroma[csp].pu[value].chroma_p2s, opt.chroma[csp].pu[value].chroma_p2s))
145
+ if (!check_IPFilterChromaP2S_primitive(ref.chroma[csp].pu[value].p2s, opt.chroma[csp].pu[value].p2s))
146
{
147
printf("chroma_p2s[%s]", chromaPartStr[csp][value]);
148
return false;
149
150
151
void IPFilterHarness::measureSpeed(const EncoderPrimitives& ref, const EncoderPrimitives& opt)
152
{
153
- int height = 64;
154
- int width = 64;
155
int16_t srcStride = 96;
156
int16_t dstStride = 96;
157
int maxVerticalfilterHalfDistance = 3;
158
159
- if (opt.luma_p2s)
160
- {
161
- printf("luma_p2s\t");
162
- REPORT_SPEEDUP(opt.luma_p2s, ref.luma_p2s,
163
- pixel_buff, srcStride, IPF_vec_output_s, width, height);
164
- }
165
-
166
for (int value = 0; value < NUM_PU_SIZES; value++)
167
{
168
if (opt.pu[value].luma_hpp)
169
170
pixel_buff + 3 * srcStride, srcStride, IPF_vec_output_p, srcStride, 1, 3);
171
}
172
173
- if (opt.pu[value].filter_p2s)
174
+ if (opt.pu[value].convert_p2s)
175
{
176
- printf("filter_p2s [%s]\t", lumaPartStr[value]);
177
- REPORT_SPEEDUP(opt.pu[value].filter_p2s, ref.pu[value].filter_p2s,
178
- pixel_buff, srcStride, IPF_vec_output_s);
179
+ printf("convert_p2s[%s]\t", lumaPartStr[value]);
180
+ REPORT_SPEEDUP(opt.pu[value].convert_p2s, ref.pu[value].convert_p2s,
181
+ pixel_buff, srcStride,
182
+ IPF_vec_output_s, dstStride);
183
}
184
}
185
186
for (int csp = X265_CSP_I420; csp < X265_CSP_COUNT; csp++)
187
{
188
printf("= Color Space %s =\n", x265_source_csp_names[csp]);
189
- if (opt.chroma[csp].p2s)
190
- {
191
- printf("chroma_p2s\t");
192
- REPORT_SPEEDUP(opt.chroma[csp].p2s, ref.chroma[csp].p2s,
193
- pixel_buff, srcStride, IPF_vec_output_s, width, height);
194
- }
195
for (int value = 0; value < NUM_PU_SIZES; value++)
196
{
197
if (opt.chroma[csp].pu[value].filter_hpp)
198
199
short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
200
IPF_vec_output_s, dstStride, 1);
201
}
202
-
203
- if (opt.chroma[csp].pu[value].chroma_p2s)
204
+ if (opt.chroma[csp].pu[value].p2s)
205
{
206
printf("chroma_p2s[%s]\t", chromaPartStr[csp][value]);
207
- REPORT_SPEEDUP(opt.chroma[csp].pu[value].chroma_p2s, ref.chroma[csp].pu[value].chroma_p2s,
208
- pixel_buff, srcStride,
209
- IPF_vec_output_s);
210
+ REPORT_SPEEDUP(opt.chroma[csp].pu[value].p2s, ref.chroma[csp].pu[value].p2s,
211
+ pixel_buff, srcStride, IPF_vec_output_s, dstStride);
212
}
213
}
214
}
215
x265_1.6.tar.gz/source/test/ipfilterharness.h -> x265_1.7.tar.gz/source/test/ipfilterharness.h
Changed
9
1
2
pixel pixel_test_buff[TEST_CASES][TEST_BUF_SIZE];
3
int16_t short_test_buff[TEST_CASES][TEST_BUF_SIZE];
4
5
- bool check_IPFilter_primitive(filter_p2s_wxh_t ref, filter_p2s_wxh_t opt, int isChroma, int csp);
6
bool check_IPFilterChroma_primitive(filter_pp_t ref, filter_pp_t opt);
7
bool check_IPFilterChroma_ps_primitive(filter_ps_t ref, filter_ps_t opt);
8
bool check_IPFilterChroma_hps_primitive(filter_hps_t ref, filter_hps_t opt);
9
x265_1.6.tar.gz/source/test/pixelharness.cpp -> x265_1.7.tar.gz/source/test/pixelharness.cpp
Changed
497
1
2
return true;
3
}
4
5
-bool PixelHarness::check_scale_pp(scale_t ref, scale_t opt)
6
+bool PixelHarness::check_scale1D_pp(scale1D_t ref, scale1D_t opt)
7
+{
8
+ ALIGN_VAR_16(pixel, ref_dest[64 * 64]);
9
+ ALIGN_VAR_16(pixel, opt_dest[64 * 64]);
10
+
11
+ memset(ref_dest, 0, sizeof(ref_dest));
12
+ memset(opt_dest, 0, sizeof(opt_dest));
13
+
14
+ int j = 0;
15
+ for (int i = 0; i < ITERS; i++)
16
+ {
17
+ int index = i % TEST_CASES;
18
+ checked(opt, opt_dest, pixel_test_buff[index] + j);
19
+ ref(ref_dest, pixel_test_buff[index] + j);
20
+
21
+ if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel)))
22
+ return false;
23
+
24
+ reportfail();
25
+ j += INCR;
26
+ }
27
+
28
+ return true;
29
+}
30
+
31
+bool PixelHarness::check_scale2D_pp(scale2D_t ref, scale2D_t opt)
32
{
33
ALIGN_VAR_16(pixel, ref_dest[64 * 64]);
34
ALIGN_VAR_16(pixel, opt_dest[64 * 64]);
35
36
37
bool PixelHarness::check_calSign(sign_t ref, sign_t opt)
38
{
39
- ALIGN_VAR_16(int8_t, ref_dest[64 * 64]);
40
- ALIGN_VAR_16(int8_t, opt_dest[64 * 64]);
41
+ ALIGN_VAR_16(int8_t, ref_dest[64 * 2]);
42
+ ALIGN_VAR_16(int8_t, opt_dest[64 * 2]);
43
44
memset(ref_dest, 0xCD, sizeof(ref_dest));
45
memset(opt_dest, 0xCD, sizeof(opt_dest));
46
47
48
for (int i = 0; i < ITERS; i++)
49
{
50
- int width = 16 * (rand() % 4 + 1);
51
+ int width = (rand() % 64) + 1;
52
53
ref(ref_dest, pbuf2 + j, pbuf3 + j, width);
54
checked(opt, opt_dest, pbuf2 + j, pbuf3 + j, width);
55
56
- if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(int8_t)))
57
+ if (memcmp(ref_dest, opt_dest, sizeof(ref_dest)))
58
return false;
59
60
reportfail();
61
62
for (int i = 0; i < ITERS; i++)
63
{
64
int width = 16 * (rand() % 4 + 1);
65
- int8_t sign = rand() % 3;
66
- if (sign == 2)
67
- sign = -1;
68
+ int stride = width + 1;
69
70
- ref(ref_dest, psbuf1 + j, width, sign);
71
- checked(opt, opt_dest, psbuf1 + j, width, sign);
72
+ ref(ref_dest, psbuf1 + j, width, psbuf2 + j, stride);
73
+ checked(opt, opt_dest, psbuf1 + j, width, psbuf5 + j, stride);
74
75
if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel)))
76
return false;
77
78
return true;
79
}
80
81
-bool PixelHarness::check_saoCuOrgE2_t(saoCuOrgE2_t ref, saoCuOrgE2_t opt)
82
+bool PixelHarness::check_saoCuOrgE2_t(saoCuOrgE2_t ref[2], saoCuOrgE2_t opt[2])
83
+{
84
+ ALIGN_VAR_16(pixel, ref_dest[64 * 64]);
85
+ ALIGN_VAR_16(pixel, opt_dest[64 * 64]);
86
+
87
+ memset(ref_dest, 0xCD, sizeof(ref_dest));
88
+ memset(opt_dest, 0xCD, sizeof(opt_dest));
89
+
90
+ for (int id = 0; id < 2; id++)
91
+ {
92
+ int j = 0;
93
+ if (opt[id])
94
+ {
95
+ for (int i = 0; i < ITERS; i++)
96
+ {
97
+ int width = 16 * (1 << (id * (rand() % 2 + 1))) - (rand() % 2);
98
+ int stride = width + 1;
99
+
100
+ ref[width > 16](ref_dest, psbuf1 + j, psbuf2 + j, psbuf3 + j, width, stride);
101
+ checked(opt[width > 16], opt_dest, psbuf4 + j, psbuf2 + j, psbuf3 + j, width, stride);
102
+
103
+ if (memcmp(psbuf1 + j, psbuf4 + j, width * sizeof(int8_t)))
104
+ return false;
105
+
106
+ if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel)))
107
+ return false;
108
+
109
+ reportfail();
110
+ j += INCR;
111
+ }
112
+ }
113
+ }
114
+
115
+ return true;
116
+}
117
+
118
+bool PixelHarness::check_saoCuOrgE3_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt)
119
{
120
ALIGN_VAR_16(pixel, ref_dest[64 * 64]);
121
ALIGN_VAR_16(pixel, opt_dest[64 * 64]);
122
123
124
for (int i = 0; i < ITERS; i++)
125
{
126
- int width = 16 * (rand() % 4 + 1);
127
- int stride = width + 1;
128
-
129
- ref(ref_dest, psbuf1 + j, psbuf2 + j, psbuf3 + j, width, stride);
130
- checked(opt, opt_dest, psbuf4 + j, psbuf2 + j, psbuf3 + j, width, stride);
131
+ int stride = 16 * (rand() % 4 + 1);
132
+ int start = rand() % 2;
133
+ int end = 16 - rand() % 2;
134
135
- if (memcmp(psbuf1 + j, psbuf4 + j, width * sizeof(int8_t)))
136
- return false;
137
+ ref(ref_dest, psbuf2 + j, psbuf1 + j, stride, start, end);
138
+ checked(opt, opt_dest, psbuf5 + j, psbuf1 + j, stride, start, end);
139
140
- if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel)))
141
+ if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel)) || memcmp(psbuf2, psbuf5, BUFFSIZE))
142
return false;
143
144
reportfail();
145
146
return true;
147
}
148
149
-bool PixelHarness::check_saoCuOrgE3_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt)
150
+bool PixelHarness::check_saoCuOrgE3_32_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt)
151
{
152
ALIGN_VAR_16(pixel, ref_dest[64 * 64]);
153
ALIGN_VAR_16(pixel, opt_dest[64 * 64]);
154
155
156
for (int i = 0; i < ITERS; i++)
157
{
158
- int stride = 16 * (rand() % 4 + 1);
159
+ int stride = 32 * (rand() % 2 + 1);
160
int start = rand() % 2;
161
- int end = (16 * (rand() % 4 + 1)) - rand() % 2;
162
+ int end = (32 * (rand() % 2 + 1)) - rand() % 2;
163
164
ref(ref_dest, psbuf2 + j, psbuf1 + j, stride, start, end);
165
checked(opt, opt_dest, psbuf5 + j, psbuf1 + j, stride, start, end);
166
167
168
memset(ref_dest, 0xCD, sizeof(ref_dest));
169
memset(opt_dest, 0xCD, sizeof(opt_dest));
170
-
171
- int width = 16 + rand() % 48;
172
- int height = 16 + rand() % 48;
173
+ int width = 32 + rand() % 32;
174
+ int height = 32 + rand() % 32;
175
intptr_t srcStride = 64;
176
intptr_t dstStride = width;
177
int j = 0;
178
179
for (int i = 0; i < ITERS; i++)
180
{
181
int width = 16 * (rand() % 4 + 1);
182
- int height = rand() % 64 +1;
183
- int stride = rand() % 65;
184
+ int height = rand() % 63 + 2;
185
+ int stride = width;
186
187
ref(ref_dest, psbuf1 + j, width, height, stride);
188
checked(opt, opt_dest, psbuf1 + j, width, height, stride);
189
190
return true;
191
}
192
193
-bool PixelHarness::check_findPosLast(findPosLast_t ref, findPosLast_t opt)
194
+bool PixelHarness::check_scanPosLast(scanPosLast_t ref, scanPosLast_t opt)
195
{
196
ALIGN_VAR_16(coeff_t, ref_src[32 * 32 + ITERS * 2]);
197
uint8_t ref_coeffNum[MLS_GRP_NUM], opt_coeffNum[MLS_GRP_NUM]; // value range[0, 16]
198
199
for (int i = 0; i < 32 * 32; i++)
200
{
201
ref_src[i] = rand() & SHORT_MAX;
202
+
203
+ // more zero coeff
204
+ if (ref_src[i] < SHORT_MAX * 2 / 3)
205
+ ref_src[i] = 0;
206
+
207
+ // more negtive
208
+ if ((rand() % 10) < 8)
209
+ ref_src[i] *= -1;
210
totalCoeffs += (ref_src[i] != 0);
211
}
212
213
214
for (int j = 0; j < 1 << (2 * (rand_scan_size + 2)); j++)
215
rand_numCoeff += (ref_src[i + j] != 0);
216
217
+ // at least one coeff in transform block
218
+ if (rand_numCoeff == 0)
219
+ {
220
+ ref_src[i + (1 << (2 * (rand_scan_size + 2))) - 1] = -1;
221
+ rand_numCoeff = 1;
222
+ }
223
+
224
+ const int trSize = (1 << (rand_scan_size + 2));
225
const uint16_t* const scanTbl = g_scanOrder[rand_scan_type][rand_scan_size];
226
+ const uint16_t* const scanTblCG4x4 = g_scan4x4[rand_scan_size <= (MDCS_LOG2_MAX_SIZE - 2) ? rand_scan_type : SCAN_DIAG];
227
228
- int ref_scanPos = ref(scanTbl, ref_src + i, ref_coeffSign, ref_coeffFlag, ref_coeffNum, rand_numCoeff);
229
- int opt_scanPos = (int)checked(opt, scanTbl, ref_src + i, opt_coeffSign, opt_coeffFlag, opt_coeffNum, rand_numCoeff);
230
+ int ref_scanPos = ref(scanTbl, ref_src + i, ref_coeffSign, ref_coeffFlag, ref_coeffNum, rand_numCoeff, scanTblCG4x4, trSize);
231
+ int opt_scanPos = (int)checked(opt, scanTbl, ref_src + i, opt_coeffSign, opt_coeffFlag, opt_coeffNum, rand_numCoeff, scanTblCG4x4, trSize);
232
233
if (ref_scanPos != opt_scanPos)
234
return false;
235
236
rand_numCoeff -= ref_coeffNum[j];
237
}
238
239
+ if (rand_numCoeff != 0)
240
+ return false;
241
+
242
+ reportfail();
243
+ }
244
+
245
+ return true;
246
+}
247
+
248
+bool PixelHarness::check_findPosFirstLast(findPosFirstLast_t ref, findPosFirstLast_t opt)
249
+{
250
+ ALIGN_VAR_16(coeff_t, ref_src[32 * 32 + ITERS * 2]);
251
+
252
+ for (int i = 0; i < 32 * 32; i++)
253
+ {
254
+ ref_src[i] = rand() & SHORT_MAX;
255
+ }
256
+
257
+ // extra test area all of 0x1234
258
+ for (int i = 0; i < ITERS * 2; i++)
259
+ {
260
+ ref_src[32 * 32 + i] = 0x1234;
261
+ }
262
+
263
+ for (int i = 0; i < ITERS; i++)
264
+ {
265
+ int rand_scan_type = rand() % NUM_SCAN_TYPE;
266
+ int rand_scan_size = (rand() % NUM_SCAN_SIZE) + 2;
267
+ coeff_t *rand_src = ref_src + i;
268
+
269
+ const uint16_t* const scanTbl = g_scan4x4[rand_scan_type];
270
+
271
+ int j;
272
+ for (j = 0; j < SCAN_SET_SIZE; j++)
273
+ {
274
+ const uint32_t idxY = j / MLS_CG_SIZE;
275
+ const uint32_t idxX = j % MLS_CG_SIZE;
276
+ if (rand_src[idxY * rand_scan_size + idxX]) break;
277
+ }
278
+
279
+ // fill one coeff when all coeff group are zero
280
+ if (j >= SCAN_SET_SIZE)
281
+ rand_src[0] = 0x0BAD;
282
+
283
+ uint32_t ref_scanPos = ref(rand_src, (1 << rand_scan_size), scanTbl);
284
+ uint32_t opt_scanPos = (int)checked(opt, rand_src, (1 << rand_scan_size), scanTbl);
285
+
286
+ if (ref_scanPos != opt_scanPos)
287
+ return false;
288
+
289
reportfail();
290
}
291
292
293
return false;
294
}
295
}
296
+ if (opt.chroma[i].cu[part].sa8d)
297
+ {
298
+ if (!check_pixelcmp(ref.chroma[i].cu[part].sa8d, opt.chroma[i].cu[part].sa8d))
299
+ {
300
+ printf("chroma_sa8d[%s][%s] failed\n", x265_source_csp_names[i], chromaPartStr[i][part]);
301
+ return false;
302
+ }
303
+ }
304
}
305
}
306
307
308
309
if (opt.scale1D_128to64)
310
{
311
- if (!check_scale_pp(ref.scale1D_128to64, opt.scale1D_128to64))
312
+ if (!check_scale1D_pp(ref.scale1D_128to64, opt.scale1D_128to64))
313
{
314
printf("scale1D_128to64 failed!\n");
315
return false;
316
317
318
if (opt.scale2D_64to32)
319
{
320
- if (!check_scale_pp(ref.scale2D_64to32, opt.scale2D_64to32))
321
+ if (!check_scale2D_pp(ref.scale2D_64to32, opt.scale2D_64to32))
322
{
323
printf("scale2D_64to32 failed!\n");
324
return false;
325
326
}
327
}
328
329
- if (opt.saoCuOrgE2)
330
+ if (opt.saoCuOrgE1_2Rows)
331
+ {
332
+ if (!check_saoCuOrgE1_t(ref.saoCuOrgE1_2Rows, opt.saoCuOrgE1_2Rows))
333
+ {
334
+ printf("SAO_EO_1_2Rows failed\n");
335
+ return false;
336
+ }
337
+ }
338
+
339
+ if (opt.saoCuOrgE2[0] || opt.saoCuOrgE2[1])
340
+ {
341
+ saoCuOrgE2_t ref1[] = { ref.saoCuOrgE2[0], ref.saoCuOrgE2[1] };
342
+ saoCuOrgE2_t opt1[] = { opt.saoCuOrgE2[0], opt.saoCuOrgE2[1] };
343
+
344
+ if (!check_saoCuOrgE2_t(ref1, opt1))
345
+ {
346
+ printf("SAO_EO_2[0] && SAO_EO_2[1] failed\n");
347
+ return false;
348
+ }
349
+ }
350
+
351
+ if (opt.saoCuOrgE3[0])
352
{
353
- if (!check_saoCuOrgE2_t(ref.saoCuOrgE2, opt.saoCuOrgE2))
354
+ if (!check_saoCuOrgE3_t(ref.saoCuOrgE3[0], opt.saoCuOrgE3[0]))
355
{
356
- printf("SAO_EO_2 failed\n");
357
+ printf("SAO_EO_3[0] failed\n");
358
return false;
359
}
360
}
361
362
- if (opt.saoCuOrgE3)
363
+ if (opt.saoCuOrgE3[1])
364
{
365
- if (!check_saoCuOrgE3_t(ref.saoCuOrgE3, opt.saoCuOrgE3))
366
+ if (!check_saoCuOrgE3_32_t(ref.saoCuOrgE3[1], opt.saoCuOrgE3[1]))
367
{
368
- printf("SAO_EO_3 failed\n");
369
+ printf("SAO_EO_3[1] failed\n");
370
return false;
371
}
372
}
373
374
}
375
}
376
377
- if (opt.findPosLast)
378
+ if (opt.scanPosLast)
379
{
380
- if (!check_findPosLast(ref.findPosLast, opt.findPosLast))
381
+ if (!check_scanPosLast(ref.scanPosLast, opt.scanPosLast))
382
{
383
- printf("findPosLast failed!\n");
384
+ printf("scanPosLast failed!\n");
385
+ return false;
386
+ }
387
+ }
388
+
389
+ if (opt.findPosFirstLast)
390
+ {
391
+ if (!check_findPosFirstLast(ref.findPosFirstLast, opt.findPosFirstLast))
392
+ {
393
+ printf("findPosFirstLast failed!\n");
394
return false;
395
}
396
}
397
398
HEADER("[%s] add_ps[%s]", x265_source_csp_names[i], chromaPartStr[i][part]);
399
REPORT_SPEEDUP(opt.chroma[i].cu[part].add_ps, ref.chroma[i].cu[part].add_ps, pbuf1, FENC_STRIDE, pbuf2, sbuf1, STRIDE, STRIDE);
400
}
401
+ if (opt.chroma[i].cu[part].sa8d)
402
+ {
403
+ HEADER("[%s] sa8d[%s]", x265_source_csp_names[i], chromaPartStr[i][part]);
404
+ REPORT_SPEEDUP(opt.chroma[i].cu[part].sa8d, ref.chroma[i].cu[part].sa8d, pbuf1, STRIDE, pbuf2, STRIDE);
405
+ }
406
}
407
}
408
409
410
if (opt.scale1D_128to64)
411
{
412
HEADER0("scale1D_128to64");
413
- REPORT_SPEEDUP(opt.scale1D_128to64, ref.scale1D_128to64, pbuf2, pbuf1, 64);
414
+ REPORT_SPEEDUP(opt.scale1D_128to64, ref.scale1D_128to64, pbuf2, pbuf1);
415
}
416
417
if (opt.scale2D_64to32)
418
419
if (opt.saoCuOrgE0)
420
{
421
HEADER0("SAO_EO_0");
422
- REPORT_SPEEDUP(opt.saoCuOrgE0, ref.saoCuOrgE0, pbuf1, psbuf1, 64, 1);
423
+ REPORT_SPEEDUP(opt.saoCuOrgE0, ref.saoCuOrgE0, pbuf1, psbuf1, 64, psbuf2, 64);
424
}
425
426
if (opt.saoCuOrgE1)
427
428
REPORT_SPEEDUP(opt.saoCuOrgE1, ref.saoCuOrgE1, pbuf1, psbuf2, psbuf1, 64, 64);
429
}
430
431
- if (opt.saoCuOrgE2)
432
+ if (opt.saoCuOrgE1_2Rows)
433
{
434
- HEADER0("SAO_EO_2");
435
- REPORT_SPEEDUP(opt.saoCuOrgE2, ref.saoCuOrgE2, pbuf1, psbuf1, psbuf2, psbuf3, 64, 64);
436
+ HEADER0("SAO_EO_1_2Rows");
437
+ REPORT_SPEEDUP(opt.saoCuOrgE1_2Rows, ref.saoCuOrgE1_2Rows, pbuf1, psbuf2, psbuf1, 64, 64);
438
}
439
440
- if (opt.saoCuOrgE3)
441
+ if (opt.saoCuOrgE2[0])
442
{
443
- HEADER0("SAO_EO_3");
444
- REPORT_SPEEDUP(opt.saoCuOrgE3, ref.saoCuOrgE3, pbuf1, psbuf2, psbuf1, 64, 0, 64);
445
+ HEADER0("SAO_EO_2[0]");
446
+ REPORT_SPEEDUP(opt.saoCuOrgE2[0], ref.saoCuOrgE2[0], pbuf1, psbuf1, psbuf2, psbuf3, 16, 64);
447
+ }
448
+
449
+ if (opt.saoCuOrgE2[1])
450
+ {
451
+ HEADER0("SAO_EO_2[1]");
452
+ REPORT_SPEEDUP(opt.saoCuOrgE2[1], ref.saoCuOrgE2[1], pbuf1, psbuf1, psbuf2, psbuf3, 64, 64);
453
+ }
454
+
455
+ if (opt.saoCuOrgE3[0])
456
+ {
457
+ HEADER0("SAO_EO_3[0]");
458
+ REPORT_SPEEDUP(opt.saoCuOrgE3[0], ref.saoCuOrgE3[0], pbuf1, psbuf2, psbuf1, 64, 0, 16);
459
+ }
460
+
461
+ if (opt.saoCuOrgE3[1])
462
+ {
463
+ HEADER0("SAO_EO_3[1]");
464
+ REPORT_SPEEDUP(opt.saoCuOrgE3[1], ref.saoCuOrgE3[1], pbuf1, psbuf2, psbuf1, 64, 0, 64);
465
}
466
467
if (opt.saoCuOrgB0)
468
469
REPORT_SPEEDUP(opt.propagateCost, ref.propagateCost, ibuf1, ushort_test_buff[0], int_test_buff[0], ushort_test_buff[0], int_test_buff[0], double_test_buff[0], 80);
470
}
471
472
- if (opt.findPosLast)
473
+ if (opt.scanPosLast)
474
{
475
- HEADER0("findPosLast");
476
+ HEADER0("scanPosLast");
477
coeff_t coefBuf[32 * 32];
478
memset(coefBuf, 0, sizeof(coefBuf));
479
memset(coefBuf + 32 * 31, 1, 32 * sizeof(coeff_t));
480
- REPORT_SPEEDUP(opt.findPosLast, ref.findPosLast, g_scanOrder[SCAN_DIAG][NUM_SCAN_SIZE - 1], coefBuf, (uint16_t*)sbuf1, (uint16_t*)sbuf2, (uint8_t*)psbuf1, 32);
481
+ REPORT_SPEEDUP(opt.scanPosLast, ref.scanPosLast, g_scanOrder[SCAN_DIAG][NUM_SCAN_SIZE - 1], coefBuf, (uint16_t*)sbuf1, (uint16_t*)sbuf2, (uint8_t*)psbuf1, 32, g_scan4x4[SCAN_DIAG], 32);
482
+ }
483
+
484
+ if (opt.findPosFirstLast)
485
+ {
486
+ HEADER0("findPosFirstLast");
487
+ coeff_t coefBuf[32 * MLS_CG_SIZE];
488
+ memset(coefBuf, 0, sizeof(coefBuf));
489
+ // every CG can't be all zeros!
490
+ coefBuf[3 + 0 * 32] = 0x0BAD;
491
+ coefBuf[3 + 1 * 32] = 0x0BAD;
492
+ coefBuf[3 + 2 * 32] = 0x0BAD;
493
+ coefBuf[3 + 3 * 32] = 0x0BAD;
494
+ REPORT_SPEEDUP(opt.findPosFirstLast, ref.findPosFirstLast, coefBuf, 32, g_scan4x4[SCAN_DIAG]);
495
}
496
}
497
x265_1.6.tar.gz/source/test/pixelharness.h -> x265_1.7.tar.gz/source/test/pixelharness.h
Changed
32
1
2
bool check_pixelavg_pp(pixelavg_pp_t ref, pixelavg_pp_t opt);
3
bool check_pixel_sub_ps(pixel_sub_ps_t ref, pixel_sub_ps_t opt);
4
bool check_pixel_add_ps(pixel_add_ps_t ref, pixel_add_ps_t opt);
5
- bool check_scale_pp(scale_t ref, scale_t opt);
6
+ bool check_scale1D_pp(scale1D_t ref, scale1D_t opt);
7
+ bool check_scale2D_pp(scale2D_t ref, scale2D_t opt);
8
bool check_ssd_s(pixel_ssd_s_t ref, pixel_ssd_s_t opt);
9
bool check_blockfill_s(blockfill_s_t ref, blockfill_s_t opt);
10
bool check_calresidual(calcresidual_t ref, calcresidual_t opt);
11
12
bool check_addAvg(addAvg_t, addAvg_t);
13
bool check_saoCuOrgE0_t(saoCuOrgE0_t ref, saoCuOrgE0_t opt);
14
bool check_saoCuOrgE1_t(saoCuOrgE1_t ref, saoCuOrgE1_t opt);
15
- bool check_saoCuOrgE2_t(saoCuOrgE2_t ref, saoCuOrgE2_t opt);
16
+ bool check_saoCuOrgE2_t(saoCuOrgE2_t ref[], saoCuOrgE2_t opt[]);
17
bool check_saoCuOrgE3_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt);
18
+ bool check_saoCuOrgE3_32_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt);
19
bool check_saoCuOrgB0_t(saoCuOrgB0_t ref, saoCuOrgB0_t opt);
20
bool check_planecopy_sp(planecopy_sp_t ref, planecopy_sp_t opt);
21
bool check_planecopy_cp(planecopy_cp_t ref, planecopy_cp_t opt);
22
23
bool check_psyCost_pp(pixelcmp_t ref, pixelcmp_t opt);
24
bool check_psyCost_ss(pixelcmp_ss_t ref, pixelcmp_ss_t opt);
25
bool check_calSign(sign_t ref, sign_t opt);
26
- bool check_findPosLast(findPosLast_t ref, findPosLast_t opt);
27
+ bool check_scanPosLast(scanPosLast_t ref, scanPosLast_t opt);
28
+ bool check_findPosFirstLast(findPosFirstLast_t ref, findPosFirstLast_t opt);
29
30
public:
31
32
x265_1.6.tar.gz/source/test/rate-control-tests.txt -> x265_1.7.tar.gz/source/test/rate-control-tests.txt
Changed
72
1
2
-# List of command lines to be run by rate control regression tests, see https://bitbucket.org/sborho/test-harness
3
-
4
-# This test is listed first since it currently reproduces bugs
5
-big_buck_bunny_360p24.y4m,--preset medium --bitrate 1000 --pass 1 -F4,--preset medium --bitrate 1000 --pass 2 -F4
6
-
7
-# VBV tests, non-deterministic so testing for correctness and bitrate
8
-# fluctuations - up to 1% bitrate fluctuation is allowed between runs
9
-RaceHorses_416x240_30_10bit.yuv,--preset medium --bitrate 700 --vbv-bufsize 900 --vbv-maxrate 700
10
-RaceHorses_416x240_30_10bit.yuv,--preset superfast --bitrate 600 --vbv-bufsize 600 --vbv-maxrate 600
11
-RaceHorses_416x240_30_10bit.yuv,--preset veryslow --bitrate 1100 --vbv-bufsize 1100 --vbv-maxrate 1200
12
-112_1920x1080_25.yuv,--preset medium --bitrate 1000 --vbv-maxrate 1500 --vbv-bufsize 1500 --aud
13
-112_1920x1080_25.yuv,--preset medium --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 15000 --hrd
14
-112_1920x1080_25.yuv,--preset medium --bitrate 4000 --vbv-maxrate 12000 --vbv-bufsize 12000 --repeat-headers
15
-112_1920x1080_25.yuv,--preset superfast --bitrate 1000 --vbv-maxrate 1000 --vbv-bufsize 1500 --hrd --strict-cbr
16
-112_1920x1080_25.yuv,--preset superfast --bitrate 30000 --vbv-maxrate 30000 --vbv-bufsize 30000 --repeat-headers
17
-112_1920x1080_25.yuv,--preset superfast --bitrate 4000 --vbv-maxrate 6000 --vbv-bufsize 6000 --aud
18
-112_1920x1080_25.yuv,--preset veryslow --bitrate 1000 --vbv-maxrate 3000 --vbv-bufsize 3000 --repeat-headers
19
-big_buck_bunny_360p24.y4m,--preset medium --bitrate 1000 --vbv-bufsize 3000 --vbv-maxrate 3000 --repeat-headers
20
-big_buck_bunny_360p24.y4m,--preset medium --bitrate 3000 --vbv-bufsize 3000 --vbv-maxrate 3000 --hrd
21
-big_buck_bunny_360p24.y4m,--preset medium --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 600 --aud
22
-big_buck_bunny_360p24.y4m,--preset medium --crf 1 --vbv-bufsize 3000 --vbv-maxrate 3000 --hrd
23
-big_buck_bunny_360p24.y4m,--preset superfast --bitrate 1000 --vbv-bufsize 1000 --vbv-maxrate 1000 --aud --strict-cbr
24
-big_buck_bunny_360p24.y4m,--preset superfast --bitrate 3000 --vbv-bufsize 9000 --vbv-maxrate 9000 --repeat-headers
25
-big_buck_bunny_360p24.y4m,--preset superfast --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 400 --hrd
26
-big_buck_bunny_360p24.y4m,--preset superfast --crf 6 --vbv-bufsize 1000 --vbv-maxrate 1000 --aud
27
-
28
-# multi-pass rate control tests
29
-big_buck_bunny_360p24.y4m,--preset slow --crf 40 --pass 1,--preset slow --bitrate 200 --pass 2
30
-big_buck_bunny_360p24.y4m,--preset medium --bitrate 700 --pass 1 -F4 --slow-firstpass,--preset medium --bitrate 700 --vbv-bufsize 900 --vbv-maxrate 700 --pass 2 -F4
31
-112_1920x1080_25.yuv,--preset slow --bitrate 1000 --pass 1 -F4,--preset slow --bitrate 1000 --pass 2 -F4
32
-112_1920x1080_25.yuv,--preset superfast --crf 12 --pass 1,--preset superfast --bitrate 4000 --pass 2 -F4
33
-RaceHorses_416x240_30_10bit.yuv,--preset veryslow --crf 40 --pass 1, --preset veryslow --bitrate 200 --pass 2 -F4
34
-RaceHorses_416x240_30_10bit.yuv,--preset superfast --bitrate 600 --pass 1 -F4 --slow-firstpass,--preset superfast --bitrate 600 --pass 2 -F4
35
-RaceHorses_416x240_30_10bit.yuv,--preset medium --crf 26 --pass 1,--preset medium --bitrate 500 --pass 3 -F4,--preset medium --bitrate 500 --pass 2 -F4
36
+# List of command lines to be run by rate control regression tests, see https://bitbucket.org/sborho/test-harness
37
+
38
+#These tests should yeild deterministic results
39
+# This test is listed first since it currently reproduces bugs
40
+big_buck_bunny_360p24.y4m,--preset medium --bitrate 1000 --pass 1 -F4,--preset medium --bitrate 1000 --pass 2 -F4
41
+fire_1920x1080_30.yuv, --preset slow --bitrate 2000 --tune zero-latency
42
+
43
+
44
+# VBV tests, non-deterministic so testing for correctness and bitrate
45
+# fluctuations - up to 1% bitrate fluctuation is allowed between runs
46
+night_cars_1920x1080_30.yuv,--preset medium --crf 25 --vbv-bufsize 5000 --vbv-maxrate 5000 -F6 --crf-max 34 --crf-min 22
47
+ducks_take_off_420_720p50.y4m,--preset slow --bitrate 1600 --vbv-bufsize 1600 --vbv-maxrate 1600 --strict-cbr --aq-mode 2 --aq-strength 0.5
48
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset veryslow --bitrate 4000 --vbv-bufsize 3000 --vbv-maxrate 4000 --tune grain
49
+fire_1920x1080_30.yuv,--preset medium --bitrate 1000 --vbv-maxrate 1500 --vbv-bufsize 1500 --aud --pmode --tune ssim
50
+112_1920x1080_25.yuv,--preset ultrafast --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 15000 --hrd --strict-cbr
51
+Traffic_4096x2048_30.yuv,--preset superfast --bitrate 20000 --vbv-maxrate 20000 --vbv-bufsize 20000 --repeat-headers --strict-cbr
52
+Traffic_4096x2048_30.yuv,--preset faster --bitrate 8000 --vbv-maxrate 8000 --vbv-bufsize 6000 --aud --repeat-headers --no-open-gop --hrd --pmode --pme
53
+News-4k.y4m,--preset veryfast --bitrate 3000 --vbv-maxrate 5000 --vbv-bufsize 5000 --repeat-headers --temporal-layers
54
+NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --bitrate 18000 --vbv-bufsize 20000 --vbv-maxrate 18000 --strict-cbr
55
+NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --bitrate 8000 --vbv-bufsize 12000 --vbv-maxrate 10000 --tune grain
56
+big_buck_bunny_360p24.y4m,--preset medium --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 600 --aud --hrd --tune fast-decode
57
+sita_1920x1080_30.yuv,--preset superfast --crf 25 --vbv-bufsize 3000 --vbv-maxrate 4000 --vbv-bufsize 5000 --hrd --crf-max 30
58
+sita_1920x1080_30.yuv,--preset superfast --bitrate 3000 --vbv-bufsize 3000 --vbv-maxrate 3000 --aud --strict-cbr
59
+
60
+
61
+
62
+# multi-pass rate control tests
63
+big_buck_bunny_360p24.y4m,--preset slow --crf 40 --pass 1 -f 5000,--preset slow --bitrate 200 --pass 2 -f 5000
64
+big_buck_bunny_360p24.y4m,--preset medium --bitrate 700 --pass 1 -F4 --slow-firstpass -f 5000 ,--preset medium --bitrate 700 --vbv-bufsize 900 --vbv-maxrate 700 --pass 2 -F4 -f 5000
65
+112_1920x1080_25.yuv,--preset fast --bitrate 1000 --vbv-maxrate 1000 --vbv-bufsize 1000 --strict-cbr --pass 1 -F4,--preset fast --bitrate 1000 --vbv-maxrate 3000 --vbv-bufsize 3000 --pass 2 -F4
66
+pine_tree_1920x1080_30.yuv,--preset veryfast --crf 12 --pass 1 -F4,--preset faster --bitrate 4000 --pass 2 -F4
67
+SteamLocomotiveTrain_2560x1600_60_10bit_crop.yuv, --tune grain --preset ultrafast --bitrate 5000 --vbv-maxrate 5000 --vbv-bufsize 8000 --strict-cbr -F4 --pass 1, --tune grain --preset ultrafast --bitrate 8000 --vbv-maxrate 8000 --vbv-bufsize 8000 -F4 --pass 2
68
+RaceHorses_416x240_30_10bit.yuv,--preset medium --crf 40 --pass 1, --preset faster --bitrate 200 --pass 2 -F4
69
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset superfast --bitrate 2500 --pass 1 -F4 --slow-firstpass,--preset superfast --bitrate 2500 --pass 2 -F4
70
+RaceHorses_416x240_30_10bit.yuv,--preset medium --crf 26 --vbv-maxrate 1000 --vbv-bufsize 1000 --pass 1,--preset fast --bitrate 1000 --vbv-maxrate 1000 --vbv-bufsize 700 --pass 3 -F4,--preset slow --bitrate 500 --vbv-maxrate 500 --vbv-bufsize 700 --pass 2 -F4
71
+
72
x265_1.6.tar.gz/source/test/regression-tests.txt -> x265_1.7.tar.gz/source/test/regression-tests.txt
Changed
64
1
2
# not auto-detected.
3
4
BasketballDrive_1920x1080_50.y4m,--preset faster --aq-strength 2 --merange 190
5
-BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7
6
+BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7 --qg-size 32
7
BasketballDrive_1920x1080_50.y4m,--preset medium --keyint -1 --nr-inter 100 -F4 --no-sao
8
-BasketballDrive_1920x1080_50.y4m,--preset slow --nr-intra 100 -F4 --aq-strength 3
9
+BasketballDrive_1920x1080_50.y4m,--preset slow --nr-intra 100 -F4 --aq-strength 3 --qg-size 16
10
BasketballDrive_1920x1080_50.y4m,--preset slower --lossless --chromaloc 3 --subme 0
11
BasketballDrive_1920x1080_50.y4m,--preset superfast --psy-rd 1 --ctu 16 --no-wpp
12
BasketballDrive_1920x1080_50.y4m,--preset ultrafast --signhide --colormatrix bt709
13
14
CrowdRun_1920x1080_50_10bit_422.yuv,--preset slow --no-wpp --tune ssim --transfer smpte240m
15
CrowdRun_1920x1080_50_10bit_422.yuv,--preset slower --tune ssim --tune fastdecode
16
CrowdRun_1920x1080_50_10bit_422.yuv,--preset superfast --weightp --no-wpp --sao
17
-CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency
18
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency --qg-size 16
19
CrowdRun_1920x1080_50_10bit_422.yuv,--preset veryfast --temporal-layers --tune grain
20
CrowdRun_1920x1080_50_10bit_444.yuv,--preset medium --dither --keyint -1 --rdoq-level 1
21
CrowdRun_1920x1080_50_10bit_444.yuv,--preset superfast --weightp --dither --no-psy-rd
22
23
CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryfast --temporal-layers --repeat-headers
24
CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryslow --tskip --tskip-fast --no-scenecut
25
DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset medium --tune psnr --bframes 16
26
-DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers --no-psy-rd
27
-DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset superfast --weightp
28
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers --no-psy-rd --qg-size 32
29
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset superfast --weightp --qg-size 16
30
DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset medium --nr-inter 500 -F4 --no-psy-rdoq
31
DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset slower --no-weightp --rdoq-level 0
32
DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset veryfast --weightp --nr-intra 1000 -F4
33
34
Kimono1_1920x1080_24_10bit_444.yuv,--preset superfast --weightb
35
KristenAndSara_1280x720_60.y4m,--preset medium --no-cutree --max-tu-size 16
36
KristenAndSara_1280x720_60.y4m,--preset slower --pmode --max-tu-size 8
37
-KristenAndSara_1280x720_60.y4m,--preset superfast --min-cu-size 16
38
+KristenAndSara_1280x720_60.y4m,--preset superfast --min-cu-size 16 --qg-size 16
39
KristenAndSara_1280x720_60.y4m,--preset ultrafast --strong-intra-smoothing
40
NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --tune grain
41
NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset superfast --tune psnr
42
-News-4k.y4m,--preset medium --tune ssim --no-sao
43
+News-4k.y4m,--preset medium --tune ssim --no-sao --qg-size 32
44
News-4k.y4m,--preset superfast --lookahead-slices 6 --aq-mode 0
45
OldTownCross_1920x1080_50_10bit_422.yuv,--preset medium --no-weightp
46
OldTownCross_1920x1080_50_10bit_422.yuv,--preset slower --tune fastdecode
47
48
parkrun_ter_720p50.y4m,--preset slower --fast-intra --no-rect --tune grain
49
silent_cif_420.y4m,--preset medium --me full --rect --amp
50
silent_cif_420.y4m,--preset superfast --weightp --rect
51
-silent_cif_420.y4m,--preset placebo --ctu 32 --no-sao
52
+silent_cif_420.y4m,--preset placebo --ctu 32 --no-sao --qg-size 16
53
vtc1nw_422_ntsc.y4m,--preset medium --scaling-list default --ctu 16 --ref 5
54
-vtc1nw_422_ntsc.y4m,--preset slower --nr-inter 1000 -F4 --tune fast-decode
55
+vtc1nw_422_ntsc.y4m,--preset slower --nr-inter 1000 -F4 --tune fast-decode --qg-size 16
56
vtc1nw_422_ntsc.y4m,--preset superfast --weightp --nr-intra 100 -F4
57
washdc_422_ntsc.y4m,--preset faster --rdoq-level 1 --max-merge 5
58
washdc_422_ntsc.y4m,--preset medium --no-weightp --max-tu-size 4
59
-washdc_422_ntsc.y4m,--preset slower --psy-rdoq 2.0 --rdoq-level 2
60
+washdc_422_ntsc.y4m,--preset slower --psy-rdoq 2.0 --rdoq-level 2 --qg-size 32
61
washdc_422_ntsc.y4m,--preset superfast --psy-rd 1 --tune zerolatency
62
washdc_422_ntsc.y4m,--preset ultrafast --weightp --tu-intra-depth 4
63
washdc_422_ntsc.y4m,--preset veryfast --tu-inter-depth 4
64
x265_1.6.tar.gz/source/test/smoke-tests.txt -> x265_1.7.tar.gz/source/test/smoke-tests.txt
Changed
23
1
2
# List of command lines to be run by smoke tests, see https://bitbucket.org/sborho/test-harness
3
4
+# consider VBV tests a failure if new bitrate is more than 5% different
5
+# from the old bitrate
6
+# vbv-tolerance = 0.05
7
+
8
big_buck_bunny_360p24.y4m,--preset=superfast --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 400 --hrd --aud --repeat-headers
9
big_buck_bunny_360p24.y4m,--preset=medium --bitrate 1000 -F4 --cu-lossless --scaling-list default
10
-big_buck_bunny_360p24.y4m,--preset=slower --no-weightp --cu-stats --pme
11
-washdc_422_ntsc.y4m,--preset=faster --no-strong-intra-smoothing --keyint 1
12
+big_buck_bunny_360p24.y4m,--preset=slower --no-weightp --cu-stats --pme --qg-size 16
13
+washdc_422_ntsc.y4m,--preset=faster --no-strong-intra-smoothing --keyint 1 --qg-size 16
14
washdc_422_ntsc.y4m,--preset=medium --qp 40 --nr-inter 400 -F4
15
washdc_422_ntsc.y4m,--preset=veryslow --pmode --tskip --rdoq-level 0
16
old_town_cross_444_720p50.y4m,--preset=ultrafast --weightp --keyint -1
17
old_town_cross_444_720p50.y4m,--preset=fast --keyint 20 --min-cu-size 16
18
-old_town_cross_444_720p50.y4m,--preset=slow --sao-non-deblock --pmode
19
+old_town_cross_444_720p50.y4m,--preset=slow --sao-non-deblock --pmode --qg-size 32
20
RaceHorses_416x240_30_10bit.yuv,--preset=veryfast --cu-stats --max-tu-size 8
21
RaceHorses_416x240_30_10bit.yuv,--preset=slower --bitrate 500 -F4 --rdoq-level 1
22
CrowdRun_1920x1080_50_10bit_444.yuv,--preset=ultrafast --constrained-intra --min-keyint 5 --keyint 10
23
x265_1.6.tar.gz/source/test/testbench.cpp -> x265_1.7.tar.gz/source/test/testbench.cpp
Changed
9
1
2
{ "AVX", X265_CPU_AVX },
3
{ "XOP", X265_CPU_XOP },
4
{ "AVX2", X265_CPU_AVX2 },
5
+ { "BMI2", X265_CPU_AVX2 | X265_CPU_BMI1 | X265_CPU_BMI2 },
6
{ "", 0 },
7
};
8
9
x265_1.6.tar.gz/source/x265.cpp -> x265_1.7.tar.gz/source/x265.cpp
Changed
556
1
2
3
#include "input/input.h"
4
#include "output/output.h"
5
+#include "output/reconplay.h"
6
#include "filters/filters.h"
7
#include "common.h"
8
#include "param.h"
9
10
#include <string>
11
#include <ostream>
12
#include <fstream>
13
+#include <queue>
14
15
+#define CONSOLE_TITLE_SIZE 200
16
#ifdef _WIN32
17
#include <windows.h>
18
+static char orgConsoleTitle[CONSOLE_TITLE_SIZE] = "";
19
#else
20
#define GetConsoleTitle(t, n)
21
#define SetConsoleTitle(t)
22
+#define SetThreadExecutionState(es)
23
#endif
24
25
using namespace x265;
26
27
28
struct CLIOptions
29
{
30
- Input* input;
31
- Output* recon;
32
- std::fstream bitstreamFile;
33
+ InputFile* input;
34
+ ReconFile* recon;
35
+ OutputFile* output;
36
+ FILE* qpfile;
37
+ const char* reconPlayCmd;
38
+ const x265_api* api;
39
+ x265_param* param;
40
bool bProgress;
41
bool bForceY4m;
42
bool bDither;
43
-
44
uint32_t seek; // number of frames to skip from the beginning
45
uint32_t framesToBeEncoded; // number of frames to encode
46
uint64_t totalbytes;
47
- size_t analysisRecordSize; // number of bytes read from or dumped into file
48
- int analysisHeaderSize;
49
-
50
int64_t startTime;
51
int64_t prevUpdateTime;
52
- float frameRate;
53
- FILE* qpfile;
54
- FILE* analysisFile;
55
56
/* in microseconds */
57
static const int UPDATE_INTERVAL = 250000;
58
59
CLIOptions()
60
{
61
- frameRate = 0.f;
62
input = NULL;
63
recon = NULL;
64
+ output = NULL;
65
+ qpfile = NULL;
66
+ reconPlayCmd = NULL;
67
+ api = NULL;
68
+ param = NULL;
69
framesToBeEncoded = seek = 0;
70
totalbytes = 0;
71
bProgress = true;
72
73
startTime = x265_mdate();
74
prevUpdateTime = 0;
75
bDither = false;
76
- qpfile = NULL;
77
- analysisFile = NULL;
78
- analysisRecordSize = 0;
79
- analysisHeaderSize = 0;
80
}
81
82
void destroy();
83
- void writeNALs(const x265_nal* nal, uint32_t nalcount);
84
- void printStatus(uint32_t frameNum, x265_param *param);
85
- bool parse(int argc, char **argv, x265_param* param);
86
+ void printStatus(uint32_t frameNum);
87
+ bool parse(int argc, char **argv);
88
bool parseQPFile(x265_picture &pic_org);
89
- bool validateFanout(x265_param*);
90
};
91
92
void CLIOptions::destroy()
93
94
if (qpfile)
95
fclose(qpfile);
96
qpfile = NULL;
97
- if (analysisFile)
98
- fclose(analysisFile);
99
- analysisFile = NULL;
100
+ if (output)
101
+ output->release();
102
+ output = NULL;
103
}
104
105
-void CLIOptions::writeNALs(const x265_nal* nal, uint32_t nalcount)
106
-{
107
- ProfileScopeEvent(bitstreamWrite);
108
- for (uint32_t i = 0; i < nalcount; i++)
109
- {
110
- bitstreamFile.write((const char*)nal->payload, nal->sizeBytes);
111
- totalbytes += nal->sizeBytes;
112
- nal++;
113
- }
114
-}
115
-
116
-void CLIOptions::printStatus(uint32_t frameNum, x265_param *param)
117
+void CLIOptions::printStatus(uint32_t frameNum)
118
{
119
char buf[200];
120
int64_t time = x265_mdate();
121
122
prevUpdateTime = time;
123
}
124
125
-bool CLIOptions::parse(int argc, char **argv, x265_param* param)
126
+bool CLIOptions::parse(int argc, char **argv)
127
{
128
bool bError = 0;
129
int help = 0;
130
int inputBitDepth = 8;
131
+ int outputBitDepth = 0;
132
int reconFileBitDepth = 0;
133
const char *inputfn = NULL;
134
const char *reconfn = NULL;
135
- const char *bitstreamfn = NULL;
136
+ const char *outputfn = NULL;
137
const char *preset = NULL;
138
const char *tune = NULL;
139
const char *profile = NULL;
140
141
int c = getopt_long(argc, argv, short_options, long_options, NULL);
142
if (c == -1)
143
break;
144
- if (c == 'p')
145
+ else if (c == 'p')
146
preset = optarg;
147
- if (c == 't')
148
+ else if (c == 't')
149
tune = optarg;
150
+ else if (c == 'D')
151
+ outputBitDepth = atoi(optarg);
152
else if (c == '?')
153
showHelp(param);
154
}
155
156
- if (x265_param_default_preset(param, preset, tune) < 0)
157
+ api = x265_api_get(outputBitDepth);
158
+ if (!api)
159
+ {
160
+ x265_log(NULL, X265_LOG_WARNING, "falling back to default bit-depth\n");
161
+ api = x265_api_get(0);
162
+ }
163
+
164
+ param = api->param_alloc();
165
+ if (!param)
166
+ {
167
+ x265_log(NULL, X265_LOG_ERROR, "param alloc failed\n");
168
+ return true;
169
+ }
170
+
171
+ if (api->param_default_preset(param, preset, tune) < 0)
172
{
173
x265_log(NULL, X265_LOG_ERROR, "preset or tune unrecognized\n");
174
return true;
175
176
int long_options_index = -1;
177
int c = getopt_long(argc, argv, short_options, long_options, &long_options_index);
178
if (c == -1)
179
- {
180
break;
181
- }
182
183
switch (c)
184
{
185
186
OPT2("frame-skip", "seek") this->seek = (uint32_t)x265_atoi(optarg, bError);
187
OPT("frames") this->framesToBeEncoded = (uint32_t)x265_atoi(optarg, bError);
188
OPT("no-progress") this->bProgress = false;
189
- OPT("output") bitstreamfn = optarg;
190
+ OPT("output") outputfn = optarg;
191
OPT("input") inputfn = optarg;
192
OPT("recon") reconfn = optarg;
193
OPT("input-depth") inputBitDepth = (uint32_t)x265_atoi(optarg, bError);
194
195
OPT("profile") profile = optarg; /* handled last */
196
OPT("preset") /* handled above */;
197
OPT("tune") /* handled above */;
198
+ OPT("output-depth") /* handled above */;
199
+ OPT("recon-y4m-exec") reconPlayCmd = optarg;
200
OPT("qpfile")
201
{
202
this->qpfile = fopen(optarg, "rb");
203
if (!this->qpfile)
204
{
205
- x265_log(param, X265_LOG_ERROR, "%s qpfile not found or error in opening qp file \n", optarg);
206
+ x265_log(param, X265_LOG_ERROR, "%s qpfile not found or error in opening qp file\n", optarg);
207
return false;
208
}
209
}
210
else
211
- bError |= !!x265_param_parse(param, long_options[long_options_index].name, optarg);
212
+ bError |= !!api->param_parse(param, long_options[long_options_index].name, optarg);
213
214
if (bError)
215
{
216
217
218
if (optind < argc && !inputfn)
219
inputfn = argv[optind++];
220
- if (optind < argc && !bitstreamfn)
221
- bitstreamfn = argv[optind++];
222
+ if (optind < argc && !outputfn)
223
+ outputfn = argv[optind++];
224
if (optind < argc)
225
{
226
x265_log(param, X265_LOG_WARNING, "extra unused command arguments given <%s>\n", argv[optind]);
227
228
if (argc <= 1 || help)
229
showHelp(param);
230
231
- if (inputfn == NULL || bitstreamfn == NULL)
232
+ if (inputfn == NULL || outputfn == NULL)
233
{
234
x265_log(param, X265_LOG_ERROR, "input or output file not specified, try -V for help\n");
235
return true;
236
}
237
238
- if (param->internalBitDepth != x265_max_bit_depth)
239
+ if (param->internalBitDepth != api->max_bit_depth)
240
{
241
- x265_log(param, X265_LOG_ERROR, "Only bit depths of %d are supported in this build\n", x265_max_bit_depth);
242
+ x265_log(param, X265_LOG_ERROR, "Only bit depths of %d are supported in this build\n", api->max_bit_depth);
243
return true;
244
}
245
246
247
info.frameCount = 0;
248
getParamAspectRatio(param, info.sarWidth, info.sarHeight);
249
250
- this->input = Input::open(info, this->bForceY4m);
251
+ this->input = InputFile::open(info, this->bForceY4m);
252
if (!this->input || this->input->isFail())
253
{
254
x265_log(param, X265_LOG_ERROR, "unable to open input file <%s>\n", inputfn);
255
256
this->framesToBeEncoded = info.frameCount - seek;
257
param->totalFrames = this->framesToBeEncoded;
258
259
- if (x265_param_apply_profile(param, profile))
260
+ /* Force CFR until we have support for VFR */
261
+ info.timebaseNum = param->fpsDenom;
262
+ info.timebaseDenom = param->fpsNum;
263
+
264
+ if (api->param_apply_profile(param, profile))
265
return true;
266
267
if (param->logLevel >= X265_LOG_INFO)
268
269
else
270
sprintf(buf + p, " frames %u - %d of %d", this->seek, this->seek + this->framesToBeEncoded - 1, info.frameCount);
271
272
- fprintf(stderr, "%s [info]: %s\n", input->getName(), buf);
273
+ general_log(param, input->getName(), X265_LOG_INFO, "%s\n", buf);
274
}
275
276
this->input->startReader();
277
278
{
279
if (reconFileBitDepth == 0)
280
reconFileBitDepth = param->internalBitDepth;
281
- this->recon = Output::open(reconfn, param->sourceWidth, param->sourceHeight, reconFileBitDepth,
282
- param->fpsNum, param->fpsDenom, param->internalCsp);
283
+ this->recon = ReconFile::open(reconfn, param->sourceWidth, param->sourceHeight, reconFileBitDepth,
284
+ param->fpsNum, param->fpsDenom, param->internalCsp);
285
if (this->recon->isFail())
286
{
287
- x265_log(param, X265_LOG_WARNING, "unable to write reconstruction file\n");
288
+ x265_log(param, X265_LOG_WARNING, "unable to write reconstructed outputs file\n");
289
this->recon->release();
290
this->recon = 0;
291
}
292
else
293
- fprintf(stderr, "%s [info]: reconstructed images %dx%d fps %d/%d %s\n", this->recon->getName(),
294
+ general_log(param, this->recon->getName(), X265_LOG_INFO,
295
+ "reconstructed images %dx%d fps %d/%d %s\n",
296
param->sourceWidth, param->sourceHeight, param->fpsNum, param->fpsDenom,
297
x265_source_csp_names[param->internalCsp]);
298
}
299
300
- this->bitstreamFile.open(bitstreamfn, std::fstream::binary | std::fstream::out);
301
- if (!this->bitstreamFile)
302
+ this->output = OutputFile::open(outputfn, info);
303
+ if (this->output->isFail())
304
{
305
- x265_log(NULL, X265_LOG_ERROR, "failed to open bitstream file <%s> for writing\n", bitstreamfn);
306
+ x265_log(param, X265_LOG_ERROR, "failed to open output file <%s> for writing\n", outputfn);
307
return true;
308
}
309
+ general_log(param, this->output->getName(), X265_LOG_INFO, "output file: %s\n", outputfn);
310
return false;
311
}
312
313
314
PROFILE_INIT();
315
THREAD_NAME("API", 0);
316
317
- x265_param *param = x265_param_alloc();
318
+ GetConsoleTitle(orgConsoleTitle, CONSOLE_TITLE_SIZE);
319
+ SetThreadExecutionState(ES_CONTINUOUS | ES_SYSTEM_REQUIRED | ES_AWAYMODE_REQUIRED);
320
+
321
+ ReconPlay* reconPlay = NULL;
322
CLIOptions cliopt;
323
324
- if (cliopt.parse(argc, argv, param))
325
+ if (cliopt.parse(argc, argv))
326
{
327
cliopt.destroy();
328
- x265_param_free(param);
329
+ if (cliopt.api)
330
+ cliopt.api->param_free(cliopt.param);
331
exit(1);
332
}
333
334
- x265_encoder *encoder = x265_encoder_open(param);
335
+ x265_param* param = cliopt.param;
336
+ const x265_api* api = cliopt.api;
337
+
338
+ /* This allows muxers to modify bitstream format */
339
+ cliopt.output->setParam(param);
340
+
341
+ if (cliopt.reconPlayCmd)
342
+ reconPlay = new ReconPlay(cliopt.reconPlayCmd, *param);
343
+
344
+ /* note: we could try to acquire a different libx265 API here based on
345
+ * the profile found during option parsing, but it must be done before
346
+ * opening an encoder */
347
+
348
+ x265_encoder *encoder = api->encoder_open(param);
349
if (!encoder)
350
{
351
x265_log(param, X265_LOG_ERROR, "failed to open encoder\n");
352
cliopt.destroy();
353
- x265_param_free(param);
354
- x265_cleanup();
355
+ api->param_free(param);
356
+ api->cleanup();
357
exit(2);
358
}
359
360
/* get the encoder parameters post-initialization */
361
- x265_encoder_parameters(encoder, param);
362
+ api->encoder_parameters(encoder, param);
363
364
/* Control-C handler */
365
if (signal(SIGINT, sigint_handler) == SIG_ERR)
366
367
x265_picture pic_orig, pic_out;
368
x265_picture *pic_in = &pic_orig;
369
/* Allocate recon picture if analysisMode is enabled */
370
- x265_picture *pic_recon = (cliopt.recon || !!param->analysisMode) ? &pic_out : NULL;
371
+ std::priority_queue<int64_t>* pts_queue = cliopt.output->needPTS() ? new std::priority_queue<int64_t>() : NULL;
372
+ x265_picture *pic_recon = (cliopt.recon || !!param->analysisMode || pts_queue || reconPlay) ? &pic_out : NULL;
373
uint32_t inFrameCount = 0;
374
uint32_t outFrameCount = 0;
375
x265_nal *p_nal;
376
377
378
if (!param->bRepeatHeaders)
379
{
380
- if (x265_encoder_headers(encoder, &p_nal, &nal) < 0)
381
+ if (api->encoder_headers(encoder, &p_nal, &nal) < 0)
382
{
383
x265_log(param, X265_LOG_ERROR, "Failure generating stream headers\n");
384
ret = 3;
385
goto fail;
386
}
387
else
388
- cliopt.writeNALs(p_nal, nal);
389
+ cliopt.totalbytes += cliopt.output->writeHeaders(p_nal, nal);
390
}
391
392
- x265_picture_init(param, pic_in);
393
+ api->picture_init(param, pic_in);
394
395
if (cliopt.bDither)
396
{
397
398
399
if (pic_in)
400
{
401
- if (pic_in->bitDepth > X265_DEPTH && cliopt.bDither)
402
+ if (pic_in->bitDepth > param->internalBitDepth && cliopt.bDither)
403
{
404
- ditherImage(*pic_in, param->sourceWidth, param->sourceHeight, errorBuf, X265_DEPTH);
405
- pic_in->bitDepth = X265_DEPTH;
406
+ ditherImage(*pic_in, param->sourceWidth, param->sourceHeight, errorBuf, param->internalBitDepth);
407
+ pic_in->bitDepth = param->internalBitDepth;
408
}
409
+ /* Overwrite PTS */
410
+ pic_in->pts = pic_in->poc;
411
}
412
413
- int numEncoded = x265_encoder_encode(encoder, &p_nal, &nal, pic_in, pic_recon);
414
+ int numEncoded = api->encoder_encode(encoder, &p_nal, &nal, pic_in, pic_recon);
415
if (numEncoded < 0)
416
{
417
b_ctrl_c = 1;
418
ret = 4;
419
break;
420
}
421
+
422
+ if (reconPlay && numEncoded)
423
+ reconPlay->writePicture(*pic_recon);
424
+
425
outFrameCount += numEncoded;
426
427
if (numEncoded && pic_recon && cliopt.recon)
428
cliopt.recon->writePicture(pic_out);
429
if (nal)
430
- cliopt.writeNALs(p_nal, nal);
431
+ {
432
+ cliopt.totalbytes += cliopt.output->writeFrame(p_nal, nal, pic_out);
433
+ if (pts_queue)
434
+ {
435
+ pts_queue->push(-pic_out.pts);
436
+ if (pts_queue->size() > 2)
437
+ pts_queue->pop();
438
+ }
439
+ }
440
441
- cliopt.printStatus(outFrameCount, param);
442
+ cliopt.printStatus(outFrameCount);
443
}
444
445
/* Flush the encoder */
446
while (!b_ctrl_c)
447
{
448
- int numEncoded = x265_encoder_encode(encoder, &p_nal, &nal, NULL, pic_recon);
449
+ int numEncoded = api->encoder_encode(encoder, &p_nal, &nal, NULL, pic_recon);
450
if (numEncoded < 0)
451
{
452
ret = 4;
453
break;
454
}
455
+
456
+ if (reconPlay && numEncoded)
457
+ reconPlay->writePicture(*pic_recon);
458
+
459
outFrameCount += numEncoded;
460
if (numEncoded && pic_recon && cliopt.recon)
461
cliopt.recon->writePicture(pic_out);
462
if (nal)
463
- cliopt.writeNALs(p_nal, nal);
464
+ {
465
+ cliopt.totalbytes += cliopt.output->writeFrame(p_nal, nal, pic_out);
466
+ if (pts_queue)
467
+ {
468
+ pts_queue->push(-pic_out.pts);
469
+ if (pts_queue->size() > 2)
470
+ pts_queue->pop();
471
+ }
472
+ }
473
474
- cliopt.printStatus(outFrameCount, param);
475
+ cliopt.printStatus(outFrameCount);
476
477
if (!numEncoded)
478
break;
479
480
fprintf(stderr, "%*s\r", 80, " ");
481
482
fail:
483
- x265_encoder_get_stats(encoder, &stats, sizeof(stats));
484
+
485
+ delete reconPlay;
486
+
487
+ api->encoder_get_stats(encoder, &stats, sizeof(stats));
488
if (param->csvfn && !b_ctrl_c)
489
- x265_encoder_log(encoder, argc, argv);
490
- x265_encoder_close(encoder);
491
- cliopt.bitstreamFile.close();
492
+ api->encoder_log(encoder, argc, argv);
493
+ api->encoder_close(encoder);
494
+
495
+ int64_t second_largest_pts = 0;
496
+ int64_t largest_pts = 0;
497
+ if (pts_queue && pts_queue->size() >= 2)
498
+ {
499
+ second_largest_pts = -pts_queue->top();
500
+ pts_queue->pop();
501
+ largest_pts = -pts_queue->top();
502
+ pts_queue->pop();
503
+ delete pts_queue;
504
+ pts_queue = NULL;
505
+ }
506
+ cliopt.output->closeFile(largest_pts, second_largest_pts);
507
508
if (b_ctrl_c)
509
- fprintf(stderr, "aborted at input frame %d, output frame %d\n",
510
- cliopt.seek + inFrameCount, stats.encodedPictureCount);
511
+ general_log(param, NULL, X265_LOG_INFO, "aborted at input frame %d, output frame %d\n",
512
+ cliopt.seek + inFrameCount, stats.encodedPictureCount);
513
514
if (stats.encodedPictureCount)
515
{
516
- printf("\nencoded %d frames in %.2fs (%.2f fps), %.2f kb/s", stats.encodedPictureCount,
517
- stats.elapsedEncodeTime, stats.encodedPictureCount / stats.elapsedEncodeTime, stats.bitrate);
518
+ char buffer[4096];
519
+ int p = sprintf(buffer, "\nencoded %d frames in %.2fs (%.2f fps), %.2f kb/s", stats.encodedPictureCount,
520
+ stats.elapsedEncodeTime, stats.encodedPictureCount / stats.elapsedEncodeTime, stats.bitrate);
521
522
if (param->bEnablePsnr)
523
- printf(", Global PSNR: %.3f", stats.globalPsnr);
524
+ p += sprintf(buffer + p, ", Global PSNR: %.3f", stats.globalPsnr);
525
526
if (param->bEnableSsim)
527
- printf(", SSIM Mean Y: %.7f (%6.3f dB)", stats.globalSsim, x265_ssim2dB(stats.globalSsim));
528
+ p += sprintf(buffer + p, ", SSIM Mean Y: %.7f (%6.3f dB)", stats.globalSsim, x265_ssim2dB(stats.globalSsim));
529
530
- printf("\n");
531
+ sprintf(buffer + p, "\n");
532
+ general_log(param, NULL, X265_LOG_INFO, buffer);
533
}
534
else
535
{
536
- printf("\nencoded 0 frames\n");
537
+ general_log(param, NULL, X265_LOG_INFO, "\nencoded 0 frames\n");
538
}
539
540
- x265_cleanup(); /* Free library singletons */
541
+ api->cleanup(); /* Free library singletons */
542
543
cliopt.destroy();
544
545
- x265_param_free(param);
546
+ api->param_free(param);
547
548
X265_FREE(errorBuf);
549
550
+ SetConsoleTitle(orgConsoleTitle);
551
+ SetThreadExecutionState(ES_CONTINUOUS);
552
+
553
#if HAVE_VLD
554
assert(VLDReportLeaks() == 0);
555
#endif
556
x265_1.6.tar.gz/source/x265.def.in -> x265_1.7.tar.gz/source/x265.def.in
Changed
9
1
2
x265_build_info_str
3
x265_encoder_headers
4
x265_encoder_parameters
5
+x265_encoder_reconfig
6
x265_encoder_encode
7
x265_encoder_get_stats
8
x265_encoder_log
9
x265_1.6.tar.gz/source/x265.h -> x265_1.7.tar.gz/source/x265.h
Changed
159
1
2
*
3
* Frame encoders are distributed between the available thread pools, and
4
* the encoder will never generate more thread pools than frameNumThreads */
5
- char* numaPools;
6
+ const char* numaPools;
7
8
/* Enable wavefront parallel processing, greatly increases parallelism for
9
* less than 1% compression efficiency loss. Requires a thread pool, enabled
10
11
* order. Otherwise the encoder will emit per-stream statistics into the log
12
* file when x265_encoder_log is called (presumably at the end of the
13
* encode) */
14
- char* csvfn;
15
+ const char* csvfn;
16
17
/*== Internal Picture Specification ==*/
18
19
20
* performance. Value must be between 1 and 16, default is 3 */
21
int maxNumReferences;
22
23
+ /* Allow libx265 to emit HEVC bitstreams which do not meet strict level
24
+ * requirements. Defaults to false */
25
+ int bAllowNonConformance;
26
+
27
/*== Bitstream Options ==*/
28
29
/* Flag indicating whether VPS, SPS and PPS headers should be output with
30
* each keyframe. Default false */
31
int bRepeatHeaders;
32
33
+ /* Flag indicating whether the encoder should generate start codes (Annex B
34
+ * format) or length (file format) before NAL units. Default true, Annex B.
35
+ * Muxers should set this to the correct value */
36
+ int bAnnexB;
37
+
38
/* Flag indicating whether the encoder should emit an Access Unit Delimiter
39
* NAL at the start of every access unit. Default false */
40
int bEnableAccessUnitDelimiters;
41
42
int analysisMode;
43
44
/* Filename for analysisMode save/load. Default name is "x265_analysis.dat" */
45
- char* analysisFileName;
46
+ const char* analysisFileName;
47
48
/*== Rate Control ==*/
49
50
51
52
/* Filename of the 2pass output/input stats file, if unspecified the
53
* encoder will default to using x265_2pass.log */
54
- char* statFileName;
55
+ const char* statFileName;
56
57
/* temporally blur quants */
58
double qblur;
59
60
/* Enable stricter conditions to check bitrate deviations in CBR mode. May compromise
61
* quality to maintain bitrate adherence */
62
int bStrictCbr;
63
+
64
+ /* Enable adaptive quantization at CU granularity. This parameter specifies
65
+ * the minimum CU size at which QP can be adjusted, i.e. Quantization Group
66
+ * (QG) size. Allowed values are 64, 32, 16 provided it falls within the
67
+ * inclusuve range [maxCUSize, minCUSize]. Experimental, default: maxCUSize*/
68
+ uint32_t qgSize;
69
} rc;
70
71
/*== Video Usability Information ==*/
72
73
* conformance cropping window to further crop the displayed window */
74
int defDispWinBottomOffset;
75
} vui;
76
+
77
+ /* SMPTE ST 2086 mastering display color volume SEI info, specified as a
78
+ * string which is parsed when the stream header SEI are emitted. The string
79
+ * format is "G(%hu,%hu)B(%hu,%hu)R(%hu,%hu)WP(%hu,%hu)L(%u,%u)" where %hu
80
+ * are unsigned 16bit integers and %u are unsigned 32bit integers. The SEI
81
+ * includes X,Y display primaries for RGB channels, white point X,Y and
82
+ * max,min luminance values. */
83
+ const char* masteringDisplayColorVolume;
84
+
85
+ /* Content light level info SEI, specified as a string which is parsed when
86
+ * the stream header SEI are emitted. The string format is "%hu,%hu" where
87
+ * %hu are unsigned 16bit integers. The first value is the max content light
88
+ * level (or 0 if no maximum is indicated), the second value is the maximum
89
+ * picture average light level (or 0). */
90
+ const char* contentLightLevelInfo;
91
+
92
} x265_param;
93
94
/* x265_param_alloc:
95
96
void x265_picture_init(x265_param *param, x265_picture *pic);
97
98
/* x265_max_bit_depth:
99
- * Specifies the maximum number of bits per pixel that x265 can input. This
100
- * is also the max bit depth that x265 encodes in. When x265_max_bit_depth
101
- * is 8, the internal and input bit depths must be 8. When
102
- * x265_max_bit_depth is 12, the internal and input bit depths can be
103
- * either 8, 10, or 12. Note that the internal bit depth must be the same
104
- * for all encoders allocated in the same process. */
105
+ * Specifies the numer of bits per pixel that x265 uses internally to
106
+ * represent a pixel, and the bit depth of the output bitstream.
107
+ * param->internalBitDepth must be set to this value. x265_max_bit_depth
108
+ * will be 8 for default builds, 10 for HIGH_BIT_DEPTH builds. */
109
X265_API extern const int x265_max_bit_depth;
110
111
/* x265_version_str:
112
113
* Once flushing has begun, all subsequent calls must pass pic_in as NULL. */
114
int x265_encoder_encode(x265_encoder *encoder, x265_nal **pp_nal, uint32_t *pi_nal, x265_picture *pic_in, x265_picture *pic_out);
115
116
+/* x265_encoder_reconfig:
117
+ * various parameters from x265_param are copied.
118
+ * this takes effect immediately, on whichever frame is encoded next;
119
+ * returns 0 on success, negative on parameter validation error.
120
+ *
121
+ * not all parameters can be changed; see the actual function for a
122
+ * detailed breakdown. since not all parameters can be changed, moving
123
+ * from preset to preset may not always fully copy all relevant parameters,
124
+ * but should still work usably in practice. however, more so than for
125
+ * other presets, many of the speed shortcuts used in ultrafast cannot be
126
+ * switched out of; using reconfig to switch between ultrafast and other
127
+ * presets is not recommended without a more fine-grained breakdown of
128
+ * parameters to take this into account. */
129
+int x265_encoder_reconfig(x265_encoder *, x265_param *);
130
+
131
/* x265_encoder_get_stats:
132
* returns encoder statistics */
133
void x265_encoder_get_stats(x265_encoder *encoder, x265_stats *, uint32_t statsSizeBytes);
134
135
void (*picture_init)(x265_param*, x265_picture*);
136
x265_encoder* (*encoder_open)(x265_param*);
137
void (*encoder_parameters)(x265_encoder*, x265_param*);
138
+ int (*encoder_reconfig)(x265_encoder*, x265_param*);
139
int (*encoder_headers)(x265_encoder*, x265_nal**, uint32_t*);
140
int (*encoder_encode)(x265_encoder*, x265_nal**, uint32_t*, x265_picture*, x265_picture*);
141
void (*encoder_get_stats)(x265_encoder*, x265_stats*, uint32_t);
142
143
* Retrieve the programming interface for a linked x265 library.
144
* May return NULL if no library is available that supports the
145
* requested bit depth. If bitDepth is 0 the function is guarunteed
146
- * to return a non-NULL x265_api pointer, from the system default
147
- * libx265 */
148
+ * to return a non-NULL x265_api pointer, from the linked libx265.
149
+ *
150
+ * If the requested bitDepth is not supported by the linked libx265,
151
+ * it will attempt to dynamically bind x265_api_get() from a shared
152
+ * library with an appropriate name:
153
+ * 8bit: libx265_main.so
154
+ * 10bit: libx265_main10.so
155
+ * Obviously the shared library file extension is platform specific */
156
const x265_api* x265_api_get(int bitDepth);
157
158
#ifdef __cplusplus
159
x265_1.6.tar.gz/source/x265cli.h -> x265_1.7.tar.gz/source/x265cli.h
Changed
104
1
2
namespace x265 {
3
#endif
4
5
-static const char short_options[] = "o:p:f:F:r:I:i:b:s:t:q:m:hwV?";
6
+static const char short_options[] = "o:D:P:p:f:F:r:I:i:b:s:t:q:m:hwV?";
7
static const struct option long_options[] =
8
{
9
{ "help", no_argument, NULL, 'h' },
10
11
{ "no-pme", no_argument, NULL, 0 },
12
{ "pme", no_argument, NULL, 0 },
13
{ "log-level", required_argument, NULL, 0 },
14
- { "profile", required_argument, NULL, 0 },
15
+ { "profile", required_argument, NULL, 'P' },
16
{ "level-idc", required_argument, NULL, 0 },
17
{ "high-tier", no_argument, NULL, 0 },
18
{ "no-high-tier", no_argument, NULL, 0 },
19
+ { "allow-non-conformance",no_argument, NULL, 0 },
20
+ { "no-allow-non-conformance",no_argument, NULL, 0 },
21
{ "csv", required_argument, NULL, 0 },
22
{ "no-cu-stats", no_argument, NULL, 0 },
23
{ "cu-stats", no_argument, NULL, 0 },
24
{ "y4m", no_argument, NULL, 0 },
25
{ "no-progress", no_argument, NULL, 0 },
26
{ "output", required_argument, NULL, 'o' },
27
+ { "output-depth", required_argument, NULL, 'D' },
28
{ "input", required_argument, NULL, 0 },
29
{ "input-depth", required_argument, NULL, 0 },
30
{ "input-res", required_argument, NULL, 0 },
31
32
{ "colormatrix", required_argument, NULL, 0 },
33
{ "chromaloc", required_argument, NULL, 0 },
34
{ "crop-rect", required_argument, NULL, 0 },
35
+ { "master-display", required_argument, NULL, 0 },
36
+ { "max-cll", required_argument, NULL, 0 },
37
{ "no-dither", no_argument, NULL, 0 },
38
{ "dither", no_argument, NULL, 0 },
39
{ "no-repeat-headers", no_argument, NULL, 0 },
40
41
{ "strict-cbr", no_argument, NULL, 0 },
42
{ "temporal-layers", no_argument, NULL, 0 },
43
{ "no-temporal-layers", no_argument, NULL, 0 },
44
+ { "qg-size", required_argument, NULL, 0 },
45
+ { "recon-y4m-exec", required_argument, NULL, 0 },
46
{ 0, 0, 0, 0 },
47
{ 0, 0, 0, 0 },
48
{ 0, 0, 0, 0 },
49
50
H0("-V/--version Show version info and exit\n");
51
H0("\nOutput Options:\n");
52
H0("-o/--output <filename> Bitstream output file name\n");
53
+ H0("-D/--output-depth 8|10 Output bit depth (also internal bit depth). Default %d\n", param->internalBitDepth);
54
H0(" --log-level <string> Logging level: none error warning info debug full. Default %s\n", x265::logLevelNames[param->logLevel + 1]);
55
H0(" --no-progress Disable CLI progress reports\n");
56
H0(" --[no-]cu-stats Enable logging stats about distribution of cu across all modes. Default %s\n",OPT(param->bLogCuStats));
57
58
H0(" --[no-]ssim Enable reporting SSIM metric scores. Default %s\n", OPT(param->bEnableSsim));
59
H0(" --[no-]psnr Enable reporting PSNR metric scores. Default %s\n", OPT(param->bEnablePsnr));
60
H0("\nProfile, Level, Tier:\n");
61
- H0(" --profile <string> Enforce an encode profile: main, main10, mainstillpicture\n");
62
+ H0("-P/--profile <string> Enforce an encode profile: main, main10, mainstillpicture\n");
63
H0(" --level-idc <integer|float> Force a minimum required decoder level (as '5.0' or '50')\n");
64
H0(" --[no-]high-tier If a decoder level is specified, this modifier selects High tier of that level\n");
65
+ H0(" --[no-]allow-non-conformance Allow the encoder to generate profile NONE bitstreams. Default %s\n", OPT(param->bAllowNonConformance));
66
H0("\nThreading, performance:\n");
67
H0(" --pools <integer,...> Comma separated thread count per thread pool (pool per NUMA node)\n");
68
H0(" '-' implies no threads on node, '+' implies one thread per core on node\n");
69
70
H0(" --analysis-file <filename> Specify file name used for either dumping or reading analysis data.\n");
71
H0(" --aq-mode <integer> Mode for Adaptive Quantization - 0:none 1:uniform AQ 2:auto variance. Default %d\n", param->rc.aqMode);
72
H0(" --aq-strength <float> Reduces blocking and blurring in flat and textured areas (0 to 3.0). Default %.2f\n", param->rc.aqStrength);
73
+ H0(" --qg-size <int> Specifies the size of the quantization group (64, 32, 16). Default %d\n", param->rc.qgSize);
74
H0(" --[no-]cutree Enable cutree for Adaptive Quantization. Default %s\n", OPT(param->rc.cuTree));
75
H1(" --ipratio <float> QP factor between I and P. Default %.2f\n", param->rc.ipFactor);
76
H1(" --pbratio <float> QP factor between P and B. Default %.2f\n", param->rc.pbFactor);
77
H1(" --qcomp <float> Weight given to predicted complexity. Default %.2f\n", param->rc.qCompress);
78
- H1(" --cbqpoffs <integer> Chroma Cb QP Offset. Default %d\n", param->cbQpOffset);
79
- H1(" --crqpoffs <integer> Chroma Cr QP Offset. Default %d\n", param->crQpOffset);
80
+ H1(" --qpstep <integer> The maximum single adjustment in QP allowed to rate control. Default %d\n", param->rc.qpStep);
81
+ H1(" --cbqpoffs <integer> Chroma Cb QP Offset [-12..12]. Default %d\n", param->cbQpOffset);
82
+ H1(" --crqpoffs <integer> Chroma Cr QP Offset [-12..12]. Default %d\n", param->crQpOffset);
83
H1(" --scaling-list <string> Specify a file containing HM style quant scaling lists or 'default' or 'off'. Default: off\n");
84
H1(" --lambda-file <string> Specify a file containing replacement values for the lambda tables\n");
85
H1(" MAX_MAX_QP+1 floats for lambda table, then again for lambda2 table\n");
86
87
H1(" --colormatrix <string> Specify color matrix setting from undef, bt709, fcc, bt470bg, smpte170m,\n");
88
H1(" smpte240m, GBR, YCgCo, bt2020nc, bt2020c. Default undef\n");
89
H1(" --chromaloc <integer> Specify chroma sample location (0 to 5). Default of %d\n", param->vui.chromaSampleLocTypeTopField);
90
+ H0(" --master-display <string> SMPTE ST 2086 master display color volume info SEI (HDR)\n");
91
+ H0(" format: G(x,y)B(x,y)R(x,y)WP(x,y)L(max,min)\n");
92
+ H0(" --max-cll <string> Emit content light level info SEI as \"cll,fall\" (HDR)\n");
93
H0("\nBitstream options:\n");
94
H0(" --[no-]repeat-headers Emit SPS and PPS headers at each keyframe. Default %s\n", OPT(param->bRepeatHeaders));
95
H0(" --[no-]info Emit SEI identifying encoder and parameters. Default %s\n", OPT(param->bEmitInfoSEI));
96
97
H1("\nReconstructed video options (debugging):\n");
98
H1("-r/--recon <filename> Reconstructed raw image YUV or Y4M output file name\n");
99
H1(" --recon-depth <integer> Bit-depth of reconstructed raw image file. Defaults to input bit depth, or 8 if Y4M\n");
100
+ H1(" --recon-y4m-exec <string> pipe reconstructed frames to Y4M viewer, ex:\"ffplay -i pipe:0 -autoexit\"\n");
101
H1("\nExecutable return codes:\n");
102
H1(" 0 - encode successful\n");
103
H1(" 1 - unable to parse command line\n");
104
Refresh
No build results available
Refresh
No rpmlint results available
Login required, please
login
or
signup
in order to comment
Request History
Aloysius created request almost 10 years ago
Updated to 1.7
scarabeus accepted request almost 10 years ago
Thanks for the bump