Packman Build Service PMBS

We truncated the diff of some files because they were too big. If you want to see the full diff for every file, click here.

Changes of Revision 30

x265.changes Changed

@@ -1,4 +1,66 @@
 -------------------------------------------------------------------
+Tue Oct  9 20:03:53 UTC 2018 - aloisio@gmx.com
+
+- Update to version 2.9
+  New features:
+  * Support for chunked encoding
+    + :option:`--chunk-start and --chunk-end`
+    + Frames preceding first frame of chunk in display order
+      will be encoded, however, they will be discarded in the
+      bitstream.
+    + Frames following last frame of the chunk in display order
+      will be used in taking lookahead decisions, but, they will
+      not be encoded.
+    + This feature can be enabled only in closed GOP structures.
+      Default disabled.
+  * Support for HDR10+ version 1 SEI messages.
+  Encoder enhancements:
+  * Create API function for allocating and freeing
+    x265_analysis_data.
+  * CEA 608/708 support: Read SEI messages from text file and
+    encode it using userSEI message.
+  Bug fixes:
+  * Disable noise reduction when vbv is enabled.
+  * Support minLuma and maxLuma values changed by the
+    commandline.
+  version 2.8
+  New features:
+  * :option:`--asm avx512` used to enable AVX-512 in x265.
+    Default disabled.	
+    + For 4K main10 high-quality encoding, we are seeing good
+      gains; for other resolutions and presets, we don't
+      recommend using this setting for now.
+  * :option:`--dynamic-refine` dynamically switches between
+    different inter refine levels. Default disabled.
+    + It is recommended to use :option:`--refine-intra 4' with
+      dynamic refinement for a better trade-off between encode
+      efficiency and performance than using static refinement.
+  * :option:`--single-sei`
+    + Encode SEI messages in a single NAL unit instead of
+      multiple NAL units. Default disabled.
+  * :option:`--max-ausize-factor` controls the maximum AU size
+    defined in HEVC specification.
+    + It represents the percentage of maximum AU size used.
+      Default is 1.
+  * VMAF (Video Multi-Method Assessment Fusion)
+    + Added VMAF support for objective quality measurement of a
+      video sequence.
+    + Enable cmake option ENABLE_LIBVMAF to report per frame and
+      aggregate VMAF score. The frame level VMAF score does not
+      include temporal scores.
+    + This is supported only on linux for now.
+  Encoder enhancements:
+  * Introduced refine-intra level 4 to improve quality.
+  * Support for HLG-graded content and pic_struct in SEI message.
+  Bug Fixes:
+  * Fix 32 bit build error (using CMAKE GUI) in Linux.
+  * Fix 32 bit build error for asm primitives.
+  * Fix build error on mac OS.
+  * Fix VBV Lookahead in analysis load to achieve target bitrate.
+
+- Added x265-fix_enable512.patch
+
+-------------------------------------------------------------------
 Fri May  4 22:21:57 UTC 2018 - zaitor@opensuse.org
 
 - Build with nasm >= 2.13 for openSUSE Leap 42.3 and SLE-12, since

​x
 
@@ -1,4 +1,66 @@
 -------------------------------------------------------------------
+Tue Oct  9 20:03:53 UTC 2018 - aloisio@gmx.com
+
+- Update to version 2.9
+  New features:
+  * Support for chunked encoding
+    + :option:`--chunk-start and --chunk-end`
+    + Frames preceding first frame of chunk in display order
+      will be encoded, however, they will be discarded in the
+      bitstream.
+    + Frames following last frame of the chunk in display order
+      will be used in taking lookahead decisions, but, they will
+      not be encoded.
+    + This feature can be enabled only in closed GOP structures.
+      Default disabled.
+  * Support for HDR10+ version 1 SEI messages.
+  Encoder enhancements:
+  * Create API function for allocating and freeing
+    x265_analysis_data.
+  * CEA 608/708 support: Read SEI messages from text file and
+    encode it using userSEI message.
+  Bug fixes:
+  * Disable noise reduction when vbv is enabled.
+  * Support minLuma and maxLuma values changed by the
+    commandline.
+  version 2.8
+  New features:
+  * :option:`--asm avx512` used to enable AVX-512 in x265.
+    Default disabled.  
+    + For 4K main10 high-quality encoding, we are seeing good
+      gains; for other resolutions and presets, we don't
+      recommend using this setting for now.
+  * :option:`--dynamic-refine` dynamically switches between
+    different inter refine levels. Default disabled.
+    + It is recommended to use :option:`--refine-intra 4' with
+      dynamic refinement for a better trade-off between encode
+      efficiency and performance than using static refinement.
+  * :option:`--single-sei`
+    + Encode SEI messages in a single NAL unit instead of
+      multiple NAL units. Default disabled.
+  * :option:`--max-ausize-factor` controls the maximum AU size
+    defined in HEVC specification.
+    + It represents the percentage of maximum AU size used.
+      Default is 1.
+  * VMAF (Video Multi-Method Assessment Fusion)
+    + Added VMAF support for objective quality measurement of a
+      video sequence.
+    + Enable cmake option ENABLE_LIBVMAF to report per frame and
+      aggregate VMAF score. The frame level VMAF score does not
+      include temporal scores.
+    + This is supported only on linux for now.
+  Encoder enhancements:
+  * Introduced refine-intra level 4 to improve quality.
+  * Support for HLG-graded content and pic_struct in SEI message.
+  Bug Fixes:
+  * Fix 32 bit build error (using CMAKE GUI) in Linux.
+  * Fix 32 bit build error for asm primitives.
+  * Fix build error on mac OS.
+  * Fix VBV Lookahead in analysis load to achieve target bitrate.
+
+- Added x265-fix_enable512.patch
+
+-------------------------------------------------------------------
 Fri May  4 22:21:57 UTC 2018 - zaitor@opensuse.org
 
 - Build with nasm >= 2.13 for openSUSE Leap 42.3 and SLE-12, since
​

x265.spec Changed

@@ -1,10 +1,10 @@
 # based on the spec file from https://build.opensuse.org/package/view_file/home:Simmphonie/libx265/
 
 Name:           x265
-%define soname  151
+%define soname  165
 %define libname lib%{name}
 %define libsoname %{libname}-%{soname}
-Version:        2.7
+Version:        2.9
 Release:        0
 License:        GPL-2.0+
 Summary:        A free h265/HEVC encoder - encoder binary
@@ -13,17 +13,15 @@
 Source0:        https://bitbucket.org/multicoreware/x265/downloads/%{name}_%{version}.tar.gz
 Patch0:         arm.patch
 Patch1:         x265.pkgconfig.patch
+Patch2:         x265-fix_enable512.patch
 BuildRequires:  gcc
 BuildRequires:  gcc-c++
 BuildRequires:  cmake >= 2.8.8
 BuildRequires:  pkg-config
 BuildRequires:  nasm >= 2.13
-%if 0%{?suse_version} > 1310
 %ifarch x86_64
 BuildRequires:  libnuma-devel >= 2.0.9
 %endif
-%endif
-BuildRoot:      %{_tmppath}/%{name}-%{version}-build
 
 %description
 x265 is a free library for encoding next-generation H265/HEVC video
@@ -47,18 +45,19 @@
 
 %description -n %{libname}-devel
 x265 is a free library for encoding next-generation H265/HEVC video
-streams. 
+streams.
 
 %prep
 %setup -q -n %{name}_%{version}
 %patch0 -p1
 %patch1 -p1
+%patch2 -p1
 
 sed -i -e "s/0.0/%{soname}.0/g" source/cmake/version.cmake
 
 
 %build
-%if 0%{?suse_version} < 1330
+%if 0%{?suse_version} < 1500
 cd source
 %else
 %define __builddir ./source/build
@@ -68,7 +67,7 @@
 make %{?_smp_mflags}
 
 %install
-%if 0%{?suse_version} < 1330
+%if 0%{?suse_version} < 1500
 cd source
 %endif
 %cmake_install
@@ -79,15 +78,14 @@
 %postun -n %{libsoname} -p /sbin/ldconfig
 
 %files -n %{libsoname}
-%defattr(0644,root,root)
 %{_libdir}/%{libname}.so.%{soname}*
 
-%files 
-%defattr(0755,root,root)
+%files
 %{_bindir}/%{name}
 
 %files -n %{libname}-devel
-%defattr(0644,root,root)
+%license COPYING
+%doc readme.rst
 %{_includedir}/%{name}.h
 %{_includedir}/%{name}_config.h
 %{_libdir}/pkgconfig/%{name}.pc

 
@@ -1,10 +1,10 @@
 # based on the spec file from https://build.opensuse.org/package/view_file/home:Simmphonie/libx265/
 
 Name:           x265
-%define soname  151
+%define soname  165
 %define libname lib%{name}
 %define libsoname %{libname}-%{soname}
-Version:        2.7
+Version:        2.9
 Release:        0
 License:        GPL-2.0+
 Summary:        A free h265/HEVC encoder - encoder binary
@@ -13,17 +13,15 @@
 Source0:        https://bitbucket.org/multicoreware/x265/downloads/%{name}_%{version}.tar.gz
 Patch0:         arm.patch
 Patch1:         x265.pkgconfig.patch
+Patch2:         x265-fix_enable512.patch
 BuildRequires:  gcc
 BuildRequires:  gcc-c++
 BuildRequires:  cmake >= 2.8.8
 BuildRequires:  pkg-config
 BuildRequires:  nasm >= 2.13
-%if 0%{?suse_version} > 1310
 %ifarch x86_64
 BuildRequires:  libnuma-devel >= 2.0.9
 %endif
-%endif
-BuildRoot:      %{_tmppath}/%{name}-%{version}-build
 
 %description
 x265 is a free library for encoding next-generation H265/HEVC video
@@ -47,18 +45,19 @@
 
 %description -n %{libname}-devel
 x265 is a free library for encoding next-generation H265/HEVC video
-streams. 
+streams.
 
 %prep
 %setup -q -n %{name}_%{version}
 %patch0 -p1
 %patch1 -p1
+%patch2 -p1
 
 sed -i -e "s/0.0/%{soname}.0/g" source/cmake/version.cmake
 
 
 %build
-%if 0%{?suse_version} < 1330
+%if 0%{?suse_version} < 1500
 cd source
 %else
 %define __builddir ./source/build
@@ -68,7 +67,7 @@
 make %{?_smp_mflags}
 
 %install
-%if 0%{?suse_version} < 1330
+%if 0%{?suse_version} < 1500
 cd source
 %endif
 %cmake_install
@@ -79,15 +78,14 @@
 %postun -n %{libsoname} -p /sbin/ldconfig
 
 %files -n %{libsoname}
-%defattr(0644,root,root)
 %{_libdir}/%{libname}.so.%{soname}*
 
-%files 
-%defattr(0755,root,root)
+%files
 %{_bindir}/%{name}
 
 %files -n %{libname}-devel
-%defattr(0644,root,root)
+%license COPYING
+%doc readme.rst
 %{_includedir}/%{name}.h
 %{_includedir}/%{name}_config.h
 %{_libdir}/pkgconfig/%{name}.pc
​

x265-fix_enable512.patch Added

 
@@ -0,0 +1,25 @@
+--- a/source/common/cpu.cpp
++++ b/source/common/cpu.cpp
+@@ -110,6 +110,11 @@ const cpu_name_t cpu_names[] =
+     { "", 0 },
+ };
+ 
++bool detect512()
++{
++    return(enable512);
++}
++
+ #if X265_ARCH_X86
+ 
+ extern "C" {
+@@ -123,10 +128,6 @@ uint64_t PFX(cpu_xgetbv)(int xcr);
+ #pragma warning(disable: 4309) // truncation of constant value
+ #endif
+ 
+-bool detect512()
+-{
+-    return(enable512);
+-}
+ uint32_t cpu_detect(bool benableavx512 )
+ {
+ 
​

x265_2.7.tar.gz/.hg_archival.txt -> x265_2.9.tar.gz/.hg_archival.txt Changed

 
@@ -1,4 +1,4 @@
 repo: 09fe40627f03a0f9c3e6ac78b22ac93da23f9fdf
-node: e41a9bf2bac4a7af2bec2bbadf91e63752d320ef
+node: f9681d731f2e56c2ca185cec10daece5939bee07
 branch: stable
-tag: 2.7
+tag: 2.9
​

x265_2.7.tar.gz/.hgtags -> x265_2.9.tar.gz/.hgtags Changed

 
@@ -25,3 +25,5 @@
 e7a4dd48293b7956d4a20df257d23904cc78e376 2.4
 64b2d0bf45a52511e57a6b7299160b961ca3d51c 2.5
 0e9ea76945c89962cd46cee6537586e2054b2935 2.6
+e41a9bf2bac4a7af2bec2bbadf91e63752d320ef 2.7
+a158a3a029663133455268e2a63ae6b0af2df720 2.8
​

x265_2.7.tar.gz/doc/reST/api.rst -> x265_2.9.tar.gz/doc/reST/api.rst Changed

@@ -223,6 +223,18 @@
      *     returns negative on error, 0 access unit were output.*/
      int x265_set_analysis_data(x265_encoder *encoder, x265_analysis_data *analysis_data, int poc, uint32_t cuBytes);
 
+**x265_alloc_analysis_data()** may be used to allocate memory for the x265_analysis_data::
+
+    /* x265_alloc_analysis_data:
+     *     Allocate memory for the x265_analysis_data object's internal structures. */
+     void x265_alloc_analysis_data(x265_param *param, x265_analysis_data* analysis);
+
+**x265_free_analysis_data()** may be used to free memory for the x265_analysis_data::
+
+    /* x265_free_analysis_data:
+     *    Free the allocated memory for x265_analysis_data object's internal structures. */
+     void x265_free_analysis_data(x265_param *param, x265_analysis_data* analysis);
+
 Pictures
 ========
 
@@ -398,7 +410,30 @@
 	 *     release library static allocations, reset configured CTU size */
 	void x265_cleanup(void);
 
+VMAF (Video Multi-Method Assessment Fusion)
+==========================================
+
+If you set the ENABLE_LIBVMAF cmake option to ON, then x265 will report per frame
+and aggregate VMAF score for the given input and dump the scores in csv file.
+The user also need to specify the :option:`--recon` in command line to get the VMAF scores.
+ 
+    /* x265_calculate_vmafScore:
+     *    returns VMAF score for the input video.
+     *    This api must be called only after encoding was done. */
+    double x265_calculate_vmafscore(x265_param*, x265_vmaf_data*);
+
+    /* x265_calculate_vmaf_framelevelscore:
+     *    returns VMAF score for each frame in a given input video. The frame level VMAF score does not include temporal scores. */
+    double x265_calculate_vmaf_framelevelscore(x265_vmaf_framedata*);
+    
+.. Note::
 
+    When setting ENABLE_LIBVMAF cmake option to ON, it is recommended to
+    also set ENABLE_SHARED to OFF to prevent build problems.  
+    We only need the static library from these builds.
+    
+    Binaries build with windows will not have VMAF support.
+      
 Multi-library Interface
 =======================

 
@@ -223,6 +223,18 @@
      *     returns negative on error, 0 access unit were output.*/
      int x265_set_analysis_data(x265_encoder *encoder, x265_analysis_data *analysis_data, int poc, uint32_t cuBytes);
 
+**x265_alloc_analysis_data()** may be used to allocate memory for the x265_analysis_data::
+
+    /* x265_alloc_analysis_data:
+     *     Allocate memory for the x265_analysis_data object's internal structures. */
+     void x265_alloc_analysis_data(x265_param *param, x265_analysis_data* analysis);
+
+**x265_free_analysis_data()** may be used to free memory for the x265_analysis_data::
+
+    /* x265_free_analysis_data:
+     *    Free the allocated memory for x265_analysis_data object's internal structures. */
+     void x265_free_analysis_data(x265_param *param, x265_analysis_data* analysis);
+
 Pictures
 ========
 
@@ -398,7 +410,30 @@
     *     release library static allocations, reset configured CTU size */
    void x265_cleanup(void);
 
+VMAF (Video Multi-Method Assessment Fusion)
+==========================================
+
+If you set the ENABLE_LIBVMAF cmake option to ON, then x265 will report per frame
+and aggregate VMAF score for the given input and dump the scores in csv file.
+The user also need to specify the :option:`--recon` in command line to get the VMAF scores.
+ 
+    /* x265_calculate_vmafScore:
+     *    returns VMAF score for the input video.
+     *    This api must be called only after encoding was done. */
+    double x265_calculate_vmafscore(x265_param*, x265_vmaf_data*);
+
+    /* x265_calculate_vmaf_framelevelscore:
+     *    returns VMAF score for each frame in a given input video. The frame level VMAF score does not include temporal scores. */
+    double x265_calculate_vmaf_framelevelscore(x265_vmaf_framedata*);
+    
+.. Note::
 
+    When setting ENABLE_LIBVMAF cmake option to ON, it is recommended to
+    also set ENABLE_SHARED to OFF to prevent build problems.  
+    We only need the static library from these builds.
+    
+    Binaries build with windows will not have VMAF support.
+      
 Multi-library Interface
 =======================
 
​

x265_2.7.tar.gz/doc/reST/cli.rst -> x265_2.9.tar.gz/doc/reST/cli.rst Changed

@@ -52,7 +52,7 @@
 	2. unable to open encoder
 	3. unable to generate stream headers
 	4. encoder abort
-	
+
 Logging/Statistic Options
 =========================
 
@@ -104,6 +104,8 @@
 	**BufferFill** Bits available for the next frame. Includes bits carried
 	over from the current frame.
 	
+	**BufferFillFinal** Buffer bits available after removing the frame out of CPB.
+	
 	**Latency** Latency in terms of number of frames between when the frame 
 	was given in and when the frame is given out.
 	
@@ -183,11 +185,11 @@
 	
 .. option:: --csv-log-level <integer>
 
-    Controls the level of detail (and size) of --csv log files
-		
-    0. summary **(default)**
-    1. frame level logging
-    2. frame level logging with performance statistics
+	Controls the level of detail (and size) of --csv log files
+
+	0. summary **(default)**
+	1. frame level logging
+	2. frame level logging with performance statistics
 
 .. option:: --ssim, --no-ssim
 
@@ -254,7 +256,7 @@
 	"*"       - same as default
 	"none"    - no thread pools are created, only frame parallelism possible
 	"-"       - same as "none"
-	"10"      - allocate one pool, using up to 10 cores on node 0
+	"10"      - allocate one pool, using up to 10 cores on all available nodes
 	"-,+"     - allocate one pool, using all cores on node 1
 	"+,-,+"   - allocate one pool, using only cores on nodes 0 and 2
 	"+,-,+,-" - allocate one pool, using only cores on nodes 0 and 2
@@ -535,6 +537,20 @@
 
 	**CLI ONLY**
 
+.. option:: --chunk-start <integer>
+
+	First frame of the chunk. Frames preceeding this in display order will
+	be encoded, however, they will be discarded in the bitstream. This
+	feature can be enabled only in closed GOP structures.
+	Default 0 (disabled).
+	
+.. option:: --chunk-end <integer>
+
+	Last frame of the chunk. Frames following this in display order will be
+	used in taking lookahead decisions, but, they will not be encoded.
+	This feature can be enabled only in closed GOP structures.
+	Default 0 (disabled).
+
 Profile, Level, Tier
 ====================
 
@@ -646,9 +662,9 @@
     encoding options, the encoder will attempt to modify/set the right 
     encode specifications. If the encoder is unable to do so, this option
     will be turned OFF. Highly experimental.
-	
+
     Default: disabled
-	
+
 .. note::
 
 	:option:`--profile`, :option:`--level-idc`, and
@@ -773,7 +789,7 @@
 	Default 3.
 
 .. option:: --limit-modes, --no-limit-modes
-    
+
 	When enabled, limit-modes will limit modes analyzed for each CU	using cost 
 	metrics from the 4 sub-CUs. When multiple inter modes like :option:`--rect`
 	and/or :option:`--amp` are enabled, this feature will use motion cost 
@@ -820,6 +836,11 @@
 
 	Default: enabled, disabled for :option:`--tune grain`
 
+.. option:: --splitrd-skip, --no-splitrd-skip
+
+	Enable skipping split RD analysis when sum of split CU rdCost larger than one
+	split CU rdCost for Intra CU. Default disabled.
+
 .. option:: --fast-intra, --no-fast-intra
 
 	Perform an initial scan of every fifth intra angular mode, then
@@ -888,35 +909,36 @@
 
 	Note that --analysis-reuse-level must be paired with analysis-reuse-mode.
 
-    +--------------+------------------------------------------+
-    | Level        | Description                              |
-    +==============+==========================================+
-    | 1            | Lookahead information                    |
-    +--------------+------------------------------------------+
-    | 2 to 4       | Level 1 + intra/inter modes, ref's       |
-    +--------------+------------------------------------------+
-    | 5,6 and 9    | Level 2 + rect-amp                       |
-    +--------------+------------------------------------------+
-    | 7            | Level 5 + AVC size CU refinement         |
-    +--------------+------------------------------------------+
-    | 8            | Level 5 + AVC size Full CU analysis-info |
-    +--------------+------------------------------------------+
-    | 10           | Level 5 + Full CU analysis-info          |
-    +--------------+------------------------------------------+
+	+--------------+------------------------------------------+
+	| Level        | Description                              |
+	+==============+==========================================+
+	| 1            | Lookahead information                    |
+	+--------------+------------------------------------------+
+	| 2 to 4       | Level 1 + intra/inter modes, ref's       |
+	+--------------+------------------------------------------+
+	| 5 and 6      | Level 2 + rect-amp                       |
+	+--------------+------------------------------------------+
+	| 7            | Level 5 + AVC size CU refinement         |
+	+--------------+------------------------------------------+
+	| 8 and 9      | Level 5 + AVC size Full CU analysis-info |
+	+--------------+------------------------------------------+
+	| 10           | Level 5 + Full CU analysis-info          |
+	+--------------+------------------------------------------+
 
 .. option:: --refine-mv-type <string>
 
-    Reuse MV information received through API call. Currently receives information for AVC size and the accepted 
-    string input is "avc". Default is disabled.
+	Reuse MV information received through API call. Currently receives information for AVC size and the accepted 
+	string input is "avc". Default is disabled.
 
 .. option:: --scale-factor
 
-       Factor by which input video is scaled down for analysis save mode.
-       This option should be coupled with analysis-reuse-mode option, --analysis-reuse-level 10.
-       The ctu size of load should be double the size of save. Default 0.
+	Factor by which input video is scaled down for analysis save mode.
+	This option should be coupled with analysis-reuse-mode option, 
+	--analysis-reuse-level 10. The ctu size of load can either be the 
+	same as that of save or double the size of save. Default 0.
+
+.. option:: --refine-intra <0..4>
 
-.. option:: --refine-intra <0..3>
-	
 	Enables refinement of intra blocks in current encode. 
 	
 	Level 0 - Forces both mode and depth from the save encode.
@@ -931,8 +953,10 @@
 	
 	Level 3 - Perform analysis of intra modes for depth reused from first encode.
 	
-	Default 0.
+	Level 4 - Does not reuse any analysis information - redo analysis for the intra block.
 	
+	Default 0.
+
 .. option:: --refine-inter <0..3>
 
 	Enables refinement of inter blocks in current encode. 
@@ -954,11 +978,17 @@
 	
 	Default 0.
 
+.. option:: --dynamic-refine, --no-dynamic-refine
+
+	Dynamically switches :option:`--refine-inter` levels 0-3 based on the content and 
+	the encoder settings. It is recommended to use :option:`--refine-intra` 4 with dynamic 
+	refinement. Default disabled.
+
 .. option:: --refine-mv
 	
 	Enables refinement of motion vector for scaled video. Evaluates the best 
 	motion vector by searching the surrounding eight integer and subpel pixel
-    positions.
+	positions.
 
 Options which affect the transform unit quad-tree, sometimes referred to
 as the residual quad-tree (RQT).
@@ -1094,9 +1124,9 @@
 	quad-tree begins at the same depth of the coded tree unit, but if the
 	maximum TU size is smaller than the CU size then transform QT begins 
 	at the depth of the max-tu-size. Default: 32.
-	
+
 .. option:: --dynamic-rd <0..4>
-	
+
 	Increases the RD level at points where quality drops due to VBV rate 
 	control enforcement. The number of CUs for which the RD is reconfigured 
 	is determined based on the strength. Strength 1 gives the best FPS,

 
@@ -52,7 +52,7 @@
    2. unable to open encoder
    3. unable to generate stream headers
    4. encoder abort
-   
+
 Logging/Statistic Options
 =========================
 
@@ -104,6 +104,8 @@
    **BufferFill** Bits available for the next frame. Includes bits carried
    over from the current frame.
    
+   **BufferFillFinal** Buffer bits available after removing the frame out of CPB.
+   
    **Latency** Latency in terms of number of frames between when the frame 
    was given in and when the frame is given out.
    
@@ -183,11 +185,11 @@
    
 .. option:: --csv-log-level <integer>
 
-    Controls the level of detail (and size) of --csv log files
-       
-    0. summary **(default)**
-    1. frame level logging
-    2. frame level logging with performance statistics
+   Controls the level of detail (and size) of --csv log files
+
+   0. summary **(default)**
+   1. frame level logging
+   2. frame level logging with performance statistics
 
 .. option:: --ssim, --no-ssim
 
@@ -254,7 +256,7 @@
    "*"       - same as default
    "none"    - no thread pools are created, only frame parallelism possible
    "-"       - same as "none"
-   "10"      - allocate one pool, using up to 10 cores on node 0
+   "10"      - allocate one pool, using up to 10 cores on all available nodes
    "-,+"     - allocate one pool, using all cores on node 1
    "+,-,+"   - allocate one pool, using only cores on nodes 0 and 2
    "+,-,+,-" - allocate one pool, using only cores on nodes 0 and 2
@@ -535,6 +537,20 @@
 
    **CLI ONLY**
 
+.. option:: --chunk-start <integer>
+
+   First frame of the chunk. Frames preceeding this in display order will
+   be encoded, however, they will be discarded in the bitstream. This
+   feature can be enabled only in closed GOP structures.
+   Default 0 (disabled).
+   
+.. option:: --chunk-end <integer>
+
+   Last frame of the chunk. Frames following this in display order will be
+   used in taking lookahead decisions, but, they will not be encoded.
+   This feature can be enabled only in closed GOP structures.
+   Default 0 (disabled).
+
 Profile, Level, Tier
 ====================
 
@@ -646,9 +662,9 @@
     encoding options, the encoder will attempt to modify/set the right 
     encode specifications. If the encoder is unable to do so, this option
     will be turned OFF. Highly experimental.
-   
+
     Default: disabled
-   
+
 .. note::
 
    :option:`--profile`, :option:`--level-idc`, and
@@ -773,7 +789,7 @@
    Default 3.
 
 .. option:: --limit-modes, --no-limit-modes
-    
+
    When enabled, limit-modes will limit modes analyzed for each CU using cost 
    metrics from the 4 sub-CUs. When multiple inter modes like :option:`--rect`
    and/or :option:`--amp` are enabled, this feature will use motion cost 
@@ -820,6 +836,11 @@
 
    Default: enabled, disabled for :option:`--tune grain`
 
+.. option:: --splitrd-skip, --no-splitrd-skip
+
+   Enable skipping split RD analysis when sum of split CU rdCost larger than one
+   split CU rdCost for Intra CU. Default disabled.
+
 .. option:: --fast-intra, --no-fast-intra
 
    Perform an initial scan of every fifth intra angular mode, then
@@ -888,35 +909,36 @@
 
    Note that --analysis-reuse-level must be paired with analysis-reuse-mode.
 
-    +--------------+------------------------------------------+
-    | Level        | Description                              |
-    +==============+==========================================+
-    | 1            | Lookahead information                    |
-    +--------------+------------------------------------------+
-    | 2 to 4       | Level 1 + intra/inter modes, ref's       |
-    +--------------+------------------------------------------+
-    | 5,6 and 9    | Level 2 + rect-amp                       |
-    +--------------+------------------------------------------+
-    | 7            | Level 5 + AVC size CU refinement         |
-    +--------------+------------------------------------------+
-    | 8            | Level 5 + AVC size Full CU analysis-info |
-    +--------------+------------------------------------------+
-    | 10           | Level 5 + Full CU analysis-info          |
-    +--------------+------------------------------------------+
+   +--------------+------------------------------------------+
+   | Level        | Description                              |
+   +==============+==========================================+
+   | 1            | Lookahead information                    |
+   +--------------+------------------------------------------+
+   | 2 to 4       | Level 1 + intra/inter modes, ref's       |
+   +--------------+------------------------------------------+
+   | 5 and 6      | Level 2 + rect-amp                       |
+   +--------------+------------------------------------------+
+   | 7            | Level 5 + AVC size CU refinement         |
+   +--------------+------------------------------------------+
+   | 8 and 9      | Level 5 + AVC size Full CU analysis-info |
+   +--------------+------------------------------------------+
+   | 10           | Level 5 + Full CU analysis-info          |
+   +--------------+------------------------------------------+
 
 .. option:: --refine-mv-type <string>
 
-    Reuse MV information received through API call. Currently receives information for AVC size and the accepted 
-    string input is "avc". Default is disabled.
+   Reuse MV information received through API call. Currently receives information for AVC size and the accepted 
+   string input is "avc". Default is disabled.
 
 .. option:: --scale-factor
 
-       Factor by which input video is scaled down for analysis save mode.
-       This option should be coupled with analysis-reuse-mode option, --analysis-reuse-level 10.
-       The ctu size of load should be double the size of save. Default 0.
+   Factor by which input video is scaled down for analysis save mode.
+   This option should be coupled with analysis-reuse-mode option, 
+   --analysis-reuse-level 10. The ctu size of load can either be the 
+   same as that of save or double the size of save. Default 0.
+
+.. option:: --refine-intra <0..4>
 
-.. option:: --refine-intra <0..3>
-   
    Enables refinement of intra blocks in current encode. 
    
    Level 0 - Forces both mode and depth from the save encode.
@@ -931,8 +953,10 @@
    
    Level 3 - Perform analysis of intra modes for depth reused from first encode.
    
-   Default 0.
+   Level 4 - Does not reuse any analysis information - redo analysis for the intra block.
    
+   Default 0.
+
 .. option:: --refine-inter <0..3>
 
    Enables refinement of inter blocks in current encode. 
@@ -954,11 +978,17 @@
    
    Default 0.
 
+.. option:: --dynamic-refine, --no-dynamic-refine
+
+   Dynamically switches :option:`--refine-inter` levels 0-3 based on the content and 
+   the encoder settings. It is recommended to use :option:`--refine-intra` 4 with dynamic 
+   refinement. Default disabled.
+
 .. option:: --refine-mv
    
    Enables refinement of motion vector for scaled video. Evaluates the best 
    motion vector by searching the surrounding eight integer and subpel pixel
-    positions.
+   positions.
 
 Options which affect the transform unit quad-tree, sometimes referred to
 as the residual quad-tree (RQT).
@@ -1094,9 +1124,9 @@
    quad-tree begins at the same depth of the coded tree unit, but if the
    maximum TU size is smaller than the CU size then transform QT begins 
    at the depth of the max-tu-size. Default: 32.
-   
+
 .. option:: --dynamic-rd <0..4>
-   
+
    Increases the RD level at points where quality drops due to VBV rate 
    control enforcement. The number of CUs for which the RD is reconfigured 
    is determined based on the strength. Strength 1 gives the best FPS, 
​

x265_2.7.tar.gz/doc/reST/presets.rst -> x265_2.9.tar.gz/doc/reST/presets.rst Changed

 
@@ -156,7 +156,10 @@
 that strictly minimises QP fluctuations across frames, while still allowing 
 the encoder to hit bitrate targets and VBV buffer limits (with a slightly 
 higher margin of error than normal). It is highly recommended that this 
-algorithm is used only through the :option:`--tune` *grain* feature.
+algorithm is used only through the :option:`--tune` *grain* feature. 
+Overriding the `--tune` *grain* settings might result in grain strobing, especially
+when enabling features like :option:`--aq-mode` and :option:`--cutree` that modify
+per-block QPs within a given frame.
 
 Fast Decode
 ~~~~~~~~~~~
​

x265_2.7.tar.gz/doc/reST/releasenotes.rst -> x265_2.9.tar.gz/doc/reST/releasenotes.rst Changed

@@ -2,6 +2,69 @@
 Release Notes
 *************
 
+Version 2.9
+===========
+
+Release date - 05/10/2018
+
+New features
+-------------
+1. Support for chunked encoding
+
+   :option:`--chunk-start and --chunk-end` 
+   Frames preceding first frame of chunk in display order will be encoded, however, they will be discarded in the bitstream.
+   Frames following last frame of the chunk in display order will be used in taking lookahead decisions, but, they will not be encoded. 
+   This feature can be enabled only in closed GOP structures. Default disabled.
+
+2. Support for HDR10+ version 1 SEI messages.
+
+Encoder enhancements
+--------------------
+1. Create API function for allocating and freeing x265_analysis_data.
+2. CEA 608/708 support: Read SEI messages from text file and encode it using userSEI message.
+
+Bug fixes
+---------
+1. Disable noise reduction when vbv is enabled.
+2. Support minLuma and maxLuma values changed by the commandline.
+
+Version 2.8
+===========
+
+Release date - 21/05/2018
+
+New features
+-------------
+1. :option:`--asm avx512` used to enable AVX-512 in x265. Default disabled.	
+    For 4K main10 high-quality encoding, we are seeing good gains; for other resolutions and presets, we don't recommend using this setting for now.
+
+2. :option:`--dynamic-refine` dynamically switches between different inter refine levels. Default disabled.
+    It is recommended to use :option:`--refine-intra 4' with dynamic refinement for a better trade-off between encode efficiency and performance than using static refinement.
+
+3. :option:`--single-sei`
+    Encode SEI messages in a single NAL unit instead of multiple NAL units. Default disabled. 
+
+4. :option:`--max-ausize-factor` controls the maximum AU size defined in HEVC specification.
+    It represents the percentage of maximum AU size used. Default is 1. 
+	  
+5. VMAF (Video Multi-Method Assessment Fusion)
+   Added VMAF support for objective quality measurement of a video sequence. 
+   Enable cmake option ENABLE_LIBVMAF to report per frame and aggregate VMAF score. The frame level VMAF score does not include temporal scores.
+   This is supported only on linux for now.
+ 
+Encoder enhancements
+--------------------
+1. Introduced refine-intra level 4 to improve quality. 
+2. Support for HLG-graded content and pic_struct in SEI message.
+
+Bug Fixes
+---------
+1. Fix 32 bit build error (using CMAKE GUI) in Linux.
+2. Fix 32 bit build error for asm primitives.
+3. Fix build error on mac OS.
+4. Fix VBV Lookahead in analysis load to achieve target bitrate.
+
+
 Version 2.7
 ===========

 
@@ -2,6 +2,69 @@
 Release Notes
 *************
 
+Version 2.9
+===========
+
+Release date - 05/10/2018
+
+New features
+-------------
+1. Support for chunked encoding
+
+   :option:`--chunk-start and --chunk-end` 
+   Frames preceding first frame of chunk in display order will be encoded, however, they will be discarded in the bitstream.
+   Frames following last frame of the chunk in display order will be used in taking lookahead decisions, but, they will not be encoded. 
+   This feature can be enabled only in closed GOP structures. Default disabled.
+
+2. Support for HDR10+ version 1 SEI messages.
+
+Encoder enhancements
+--------------------
+1. Create API function for allocating and freeing x265_analysis_data.
+2. CEA 608/708 support: Read SEI messages from text file and encode it using userSEI message.
+
+Bug fixes
+---------
+1. Disable noise reduction when vbv is enabled.
+2. Support minLuma and maxLuma values changed by the commandline.
+
+Version 2.8
+===========
+
+Release date - 21/05/2018
+
+New features
+-------------
+1. :option:`--asm avx512` used to enable AVX-512 in x265. Default disabled.    
+    For 4K main10 high-quality encoding, we are seeing good gains; for other resolutions and presets, we don't recommend using this setting for now.
+
+2. :option:`--dynamic-refine` dynamically switches between different inter refine levels. Default disabled.
+    It is recommended to use :option:`--refine-intra 4' with dynamic refinement for a better trade-off between encode efficiency and performance than using static refinement.
+
+3. :option:`--single-sei`
+    Encode SEI messages in a single NAL unit instead of multiple NAL units. Default disabled. 
+
+4. :option:`--max-ausize-factor` controls the maximum AU size defined in HEVC specification.
+    It represents the percentage of maximum AU size used. Default is 1. 
+     
+5. VMAF (Video Multi-Method Assessment Fusion)
+   Added VMAF support for objective quality measurement of a video sequence. 
+   Enable cmake option ENABLE_LIBVMAF to report per frame and aggregate VMAF score. The frame level VMAF score does not include temporal scores.
+   This is supported only on linux for now.
+ 
+Encoder enhancements
+--------------------
+1. Introduced refine-intra level 4 to improve quality. 
+2. Support for HLG-graded content and pic_struct in SEI message.
+
+Bug Fixes
+---------
+1. Fix 32 bit build error (using CMAKE GUI) in Linux.
+2. Fix 32 bit build error for asm primitives.
+3. Fix build error on mac OS.
+4. Fix VBV Lookahead in analysis load to achieve target bitrate.
+
+
 Version 2.7
 ===========
 
​

x265_2.7.tar.gz/source/CMakeLists.txt -> x265_2.9.tar.gz/source/CMakeLists.txt Changed

@@ -29,7 +29,7 @@
 option(STATIC_LINK_CRT "Statically link C runtime for release builds" OFF)
 mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD)
 # X265_BUILD must be incremented each time the public API is changed
-set(X265_BUILD 151)
+set(X265_BUILD 165)
 configure_file("${PROJECT_SOURCE_DIR}/x265.def.in"
                "${PROJECT_BINARY_DIR}/x265.def")
 configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in"
@@ -48,12 +48,12 @@
 if("${SYSPROC}" STREQUAL "" OR X86MATCH GREATER "-1")
     set(X86 1)
     add_definitions(-DX265_ARCH_X86=1)
-    if("${CMAKE_SIZEOF_VOID_P}" MATCHES 8)
+    if(CMAKE_CXX_FLAGS STREQUAL "-m32")
+        message(STATUS "Detected x86 target processor")
+    elseif("${CMAKE_SIZEOF_VOID_P}" MATCHES 8)
         set(X64 1)
         add_definitions(-DX86_64=1)
         message(STATUS "Detected x86_64 target processor")
-    else()
-        message(STATUS "Detected x86 target processor")
     endif()
 elseif(POWERMATCH GREATER "-1")
     message(STATUS "Detected POWER target processor")
@@ -109,6 +109,11 @@
     if(NO_ATOMICS)
         add_definitions(-DNO_ATOMICS=1)
     endif(NO_ATOMICS)
+    find_library(VMAF vmaf)
+    option(ENABLE_LIBVMAF "Enable VMAF" OFF)
+    if(ENABLE_LIBVMAF)
+        add_definitions(-DENABLE_LIBVMAF)
+    endif()
 endif(UNIX)
 
 if(X64 AND NOT WIN32)
@@ -536,6 +541,9 @@
 if(EXTRA_LIB)
     target_link_libraries(x265-static ${EXTRA_LIB})
 endif()
+if(ENABLE_LIBVMAF)
+    target_link_libraries(x265-static ${VMAF})
+endif()
 install(TARGETS x265-static
     LIBRARY DESTINATION ${LIB_INSTALL_DIR}
     ARCHIVE DESTINATION ${LIB_INSTALL_DIR})
@@ -546,7 +554,7 @@
         ARCHIVE DESTINATION ${LIB_INSTALL_DIR})
 endif()
 install(FILES x265.h "${PROJECT_BINARY_DIR}/x265_config.h" DESTINATION include)
-if(WIN32)
+if((WIN32 AND ENABLE_CLI) OR (WIN32 AND ENABLE_SHARED))
     if(MSVC_IDE)
         install(FILES "${PROJECT_BINARY_DIR}/Debug/x265.pdb" DESTINATION ${BIN_INSTALL_DIR} CONFIGURATIONS Debug)
         install(FILES "${PROJECT_BINARY_DIR}/RelWithDebInfo/x265.pdb" DESTINATION ${BIN_INSTALL_DIR} CONFIGURATIONS RelWithDebInfo)

 
@@ -29,7 +29,7 @@
 option(STATIC_LINK_CRT "Statically link C runtime for release builds" OFF)
 mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD)
 # X265_BUILD must be incremented each time the public API is changed
-set(X265_BUILD 151)
+set(X265_BUILD 165)
 configure_file("${PROJECT_SOURCE_DIR}/x265.def.in"
                "${PROJECT_BINARY_DIR}/x265.def")
 configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in"
@@ -48,12 +48,12 @@
 if("${SYSPROC}" STREQUAL "" OR X86MATCH GREATER "-1")
     set(X86 1)
     add_definitions(-DX265_ARCH_X86=1)
-    if("${CMAKE_SIZEOF_VOID_P}" MATCHES 8)
+    if(CMAKE_CXX_FLAGS STREQUAL "-m32")
+        message(STATUS "Detected x86 target processor")
+    elseif("${CMAKE_SIZEOF_VOID_P}" MATCHES 8)
         set(X64 1)
         add_definitions(-DX86_64=1)
         message(STATUS "Detected x86_64 target processor")
-    else()
-        message(STATUS "Detected x86 target processor")
     endif()
 elseif(POWERMATCH GREATER "-1")
     message(STATUS "Detected POWER target processor")
@@ -109,6 +109,11 @@
     if(NO_ATOMICS)
         add_definitions(-DNO_ATOMICS=1)
     endif(NO_ATOMICS)
+    find_library(VMAF vmaf)
+    option(ENABLE_LIBVMAF "Enable VMAF" OFF)
+    if(ENABLE_LIBVMAF)
+        add_definitions(-DENABLE_LIBVMAF)
+    endif()
 endif(UNIX)
 
 if(X64 AND NOT WIN32)
@@ -536,6 +541,9 @@
 if(EXTRA_LIB)
     target_link_libraries(x265-static ${EXTRA_LIB})
 endif()
+if(ENABLE_LIBVMAF)
+    target_link_libraries(x265-static ${VMAF})
+endif()
 install(TARGETS x265-static
     LIBRARY DESTINATION ${LIB_INSTALL_DIR}
     ARCHIVE DESTINATION ${LIB_INSTALL_DIR})
@@ -546,7 +554,7 @@
         ARCHIVE DESTINATION ${LIB_INSTALL_DIR})
 endif()
 install(FILES x265.h "${PROJECT_BINARY_DIR}/x265_config.h" DESTINATION include)
-if(WIN32)
+if((WIN32 AND ENABLE_CLI) OR (WIN32 AND ENABLE_SHARED))
     if(MSVC_IDE)
         install(FILES "${PROJECT_BINARY_DIR}/Debug/x265.pdb" DESTINATION ${BIN_INSTALL_DIR} CONFIGURATIONS Debug)
         install(FILES "${PROJECT_BINARY_DIR}/RelWithDebInfo/x265.pdb" DESTINATION ${BIN_INSTALL_DIR} CONFIGURATIONS RelWithDebInfo)
​

x265_2.7.tar.gz/source/common/common.cpp -> x265_2.9.tar.gz/source/common/common.cpp Changed

 
@@ -54,7 +54,7 @@
 #endif
 }
 
-#define X265_ALIGNBYTES 32
+#define X265_ALIGNBYTES 64
 
 #if _WIN32
 #if defined(__MINGW32__) && !defined(__MINGW64_VERSION_MAJOR)
​

x265_2.7.tar.gz/source/common/common.h -> x265_2.9.tar.gz/source/common/common.h Changed

 
@@ -75,6 +75,7 @@
 #define ALIGN_VAR_8(T, var)  T var __attribute__((aligned(8)))
 #define ALIGN_VAR_16(T, var) T var __attribute__((aligned(16)))
 #define ALIGN_VAR_32(T, var) T var __attribute__((aligned(32)))
+#define ALIGN_VAR_64(T, var) T var __attribute__((aligned(64)))
 #if defined(__MINGW32__)
 #define fseeko fseeko64
 #define ftello ftello64
@@ -85,6 +86,7 @@
 #define ALIGN_VAR_8(T, var)  __declspec(align(8)) T var
 #define ALIGN_VAR_16(T, var) __declspec(align(16)) T var
 #define ALIGN_VAR_32(T, var) __declspec(align(32)) T var
+#define ALIGN_VAR_64(T, var) __declspec(align(64)) T var
 #define fseeko _fseeki64
 #define ftello _ftelli64
 #endif // if defined(__GNUC__)
@@ -330,6 +332,8 @@
 #define START_CODE_OVERHEAD 3 
 #define FILLER_OVERHEAD (NAL_TYPE_OVERHEAD + START_CODE_OVERHEAD + 1)
 
+#define MAX_NUM_DYN_REFINE          (NUM_CU_DEPTH * X265_REFINE_INTER_LEVELS)
+
 namespace X265_NS {
 
 enum { SAO_NUM_OFFSET = 4 };
​

x265_2.7.tar.gz/source/common/cpu.cpp -> x265_2.9.tar.gz/source/common/cpu.cpp Changed

@@ -58,10 +58,11 @@
 #endif // if X265_ARCH_ARM
 
 namespace X265_NS {
+static bool enable512 = false;
 const cpu_name_t cpu_names[] =
 {
 #if X265_ARCH_X86
-#define MMX2 X265_CPU_MMX | X265_CPU_MMX2 | X265_CPU_CMOV
+#define MMX2 X265_CPU_MMX | X265_CPU_MMX2
     { "MMX2",        MMX2 },
     { "MMXEXT",      MMX2 },
     { "SSE",         MMX2 | X265_CPU_SSE },
@@ -84,13 +85,13 @@
     { "BMI2",        AVX | X265_CPU_LZCNT | X265_CPU_BMI1 | X265_CPU_BMI2 },
 #define AVX2 AVX | X265_CPU_FMA3 | X265_CPU_LZCNT | X265_CPU_BMI1 | X265_CPU_BMI2 | X265_CPU_AVX2
     { "AVX2", AVX2},
+    { "AVX512", AVX2 | X265_CPU_AVX512 },
 #undef AVX2
 #undef AVX
 #undef SSE2
 #undef MMX2
     { "Cache32",         X265_CPU_CACHELINE_32 },
     { "Cache64",         X265_CPU_CACHELINE_64 },
-    { "SlowCTZ",         X265_CPU_SLOW_CTZ },
     { "SlowAtom",        X265_CPU_SLOW_ATOM },
     { "SlowPshufb",      X265_CPU_SLOW_PSHUFB },
     { "SlowPalignr",     X265_CPU_SLOW_PALIGNR },
@@ -115,28 +116,32 @@
 /* cpu-a.asm */
 int PFX(cpu_cpuid_test)(void);
 void PFX(cpu_cpuid)(uint32_t op, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx);
-void PFX(cpu_xgetbv)(uint32_t op, uint32_t *eax, uint32_t *edx);
+uint64_t PFX(cpu_xgetbv)(int xcr);
 }
 
 #if defined(_MSC_VER)
 #pragma warning(disable: 4309) // truncation of constant value
 #endif
 
-uint32_t cpu_detect(void)
+bool detect512()
+{
+    return(enable512);
+}
+uint32_t cpu_detect(bool benableavx512 )
 {
-    uint32_t cpu = 0;
 
+    uint32_t cpu = 0; 
     uint32_t eax, ebx, ecx, edx;
     uint32_t vendor[4] = { 0 };
     uint32_t max_extended_cap, max_basic_cap;
+    uint64_t xcr0 = 0;
 
 #if !X86_64
     if (!PFX(cpu_cpuid_test)())
         return 0;
 #endif
 
-    PFX(cpu_cpuid)(0, &eax, vendor + 0, vendor + 2, vendor + 1);
-    max_basic_cap = eax;
+    PFX(cpu_cpuid)(0, &max_basic_cap, vendor + 0, vendor + 2, vendor + 1);
     if (max_basic_cap == 0)
         return 0;
 
@@ -147,27 +152,24 @@
         return cpu;
     if (edx & 0x02000000)
         cpu |= X265_CPU_MMX2 | X265_CPU_SSE;
-    if (edx & 0x00008000)
-        cpu |= X265_CPU_CMOV;
-    else
-        return cpu;
     if (edx & 0x04000000)
         cpu |= X265_CPU_SSE2;
     if (ecx & 0x00000001)
         cpu |= X265_CPU_SSE3;
     if (ecx & 0x00000200)
-        cpu |= X265_CPU_SSSE3;
+        cpu |= X265_CPU_SSSE3 | X265_CPU_SSE2_IS_FAST;
     if (ecx & 0x00080000)
         cpu |= X265_CPU_SSE4;
     if (ecx & 0x00100000)
         cpu |= X265_CPU_SSE42;
-    /* Check OXSAVE and AVX bits */
-    if ((ecx & 0x18000000) == 0x18000000)
+
+    if (ecx & 0x08000000) /* XGETBV supported and XSAVE enabled by OS */
     {
         /* Check for OS support */
-        PFX(cpu_xgetbv)(0, &eax, &edx);
-        if ((eax & 0x6) == 0x6)
+        xcr0 = PFX(cpu_xgetbv)(0);
+        if ((xcr0 & 0x6) == 0x6) /* XMM/YMM state */
         {
+            if (ecx & 0x10000000)
             cpu |= X265_CPU_AVX;
             if (ecx & 0x00001000)
                 cpu |= X265_CPU_FMA3;
@@ -178,19 +180,29 @@
     {
         PFX(cpu_cpuid)(7, &eax, &ebx, &ecx, &edx);
         /* AVX2 requires OS support, but BMI1/2 don't. */
-        if ((cpu & X265_CPU_AVX) && (ebx & 0x00000020))
-            cpu |= X265_CPU_AVX2;
         if (ebx & 0x00000008)
-        {
             cpu |= X265_CPU_BMI1;
-            if (ebx & 0x00000100)
-                cpu |= X265_CPU_BMI2;
+        if (ebx & 0x00000100)
+            cpu |= X265_CPU_BMI2;
+
+        if ((xcr0 & 0x6) == 0x6) /* XMM/YMM state */
+        {
+            if (ebx & 0x00000020)
+                cpu |= X265_CPU_AVX2;
+            if (benableavx512)
+            {
+                if ((xcr0 & 0xE0) == 0xE0) /* OPMASK/ZMM state */
+                {
+                    if ((ebx & 0xD0030000) == 0xD0030000)
+                    {
+                        cpu |= X265_CPU_AVX512;
+                        enable512 = true;
+                    }
+                }
+            }
         }
     }
 
-    if (cpu & X265_CPU_SSSE3)
-        cpu |= X265_CPU_SSE2_IS_FAST;
-
     PFX(cpu_cpuid)(0x80000000, &eax, &ebx, &ecx, &edx);
     max_extended_cap = eax;
 
@@ -230,8 +242,6 @@
         {
             if (edx & 0x00400000)
                 cpu |= X265_CPU_MMX2;
-            if (!(cpu & X265_CPU_LZCNT))
-                cpu |= X265_CPU_SLOW_CTZ;
             if ((cpu & X265_CPU_SSE2) && !(cpu & X265_CPU_SSE2_IS_FAST))
                 cpu |= X265_CPU_SSE2_IS_SLOW; /* AMD CPUs come in two types: terrible at SSE and great at it */
         }
@@ -244,19 +254,10 @@
         int model  = ((eax >> 4) & 0xf) + ((eax >> 12) & 0xf0);
         if (family == 6)
         {
-            /* 6/9 (pentium-m "banias"), 6/13 (pentium-m "dothan"), and 6/14 (core1 "yonah")
-             * theoretically support sse2, but it's significantly slower than mmx for
-             * almost all of x264's functions, so let's just pretend they don't. */
-            if (model == 9 || model == 13 || model == 14)
-            {
-                cpu &= ~(X265_CPU_SSE2 | X265_CPU_SSE3);
-                X265_CHECK(!(cpu & (X265_CPU_SSSE3 | X265_CPU_SSE4)), "unexpected CPU ID %d\n", cpu);
-            }
             /* Detect Atom CPU */
-            else if (model == 28)
+            if (model == 28)
             {
                 cpu |= X265_CPU_SLOW_ATOM;
-                cpu |= X265_CPU_SLOW_CTZ;
                 cpu |= X265_CPU_SLOW_PSHUFB;
             }
 
@@ -328,7 +329,7 @@
 int PFX(cpu_fast_neon_mrc_test)(void);
 }
 
-uint32_t cpu_detect(void)
+uint32_t cpu_detect(bool benableavx512)
 {
     int flags = 0;
 
@@ -371,7 +372,7 @@
 
 #elif X265_ARCH_POWER8
 
-uint32_t cpu_detect(void)
+uint32_t cpu_detect(bool benableavx512)
 {
 #if HAVE_ALTIVEC
     return X265_CPU_ALTIVEC;
@@ -382,10 +383,11 @@
 
 #else // if X265_ARCH_POWER8
 
-uint32_t cpu_detect(void)
+uint32_t cpu_detect(bool benableavx512)
 {
     return 0;
 }
 
 #endif // if X265_ARCH_X86
 }
+

 
@@ -58,10 +58,11 @@
 #endif // if X265_ARCH_ARM
 
 namespace X265_NS {
+static bool enable512 = false;
 const cpu_name_t cpu_names[] =
 {
 #if X265_ARCH_X86
-#define MMX2 X265_CPU_MMX | X265_CPU_MMX2 | X265_CPU_CMOV
+#define MMX2 X265_CPU_MMX | X265_CPU_MMX2
     { "MMX2",        MMX2 },
     { "MMXEXT",      MMX2 },
     { "SSE",         MMX2 | X265_CPU_SSE },
@@ -84,13 +85,13 @@
     { "BMI2",        AVX | X265_CPU_LZCNT | X265_CPU_BMI1 | X265_CPU_BMI2 },
 #define AVX2 AVX | X265_CPU_FMA3 | X265_CPU_LZCNT | X265_CPU_BMI1 | X265_CPU_BMI2 | X265_CPU_AVX2
     { "AVX2", AVX2},
+    { "AVX512", AVX2 | X265_CPU_AVX512 },
 #undef AVX2
 #undef AVX
 #undef SSE2
 #undef MMX2
     { "Cache32",         X265_CPU_CACHELINE_32 },
     { "Cache64",         X265_CPU_CACHELINE_64 },
-    { "SlowCTZ",         X265_CPU_SLOW_CTZ },
     { "SlowAtom",        X265_CPU_SLOW_ATOM },
     { "SlowPshufb",      X265_CPU_SLOW_PSHUFB },
     { "SlowPalignr",     X265_CPU_SLOW_PALIGNR },
@@ -115,28 +116,32 @@
 /* cpu-a.asm */
 int PFX(cpu_cpuid_test)(void);
 void PFX(cpu_cpuid)(uint32_t op, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx);
-void PFX(cpu_xgetbv)(uint32_t op, uint32_t *eax, uint32_t *edx);
+uint64_t PFX(cpu_xgetbv)(int xcr);
 }
 
 #if defined(_MSC_VER)
 #pragma warning(disable: 4309) // truncation of constant value
 #endif
 
-uint32_t cpu_detect(void)
+bool detect512()
+{
+    return(enable512);
+}
+uint32_t cpu_detect(bool benableavx512 )
 {
-    uint32_t cpu = 0;
 
+    uint32_t cpu = 0; 
     uint32_t eax, ebx, ecx, edx;
     uint32_t vendor[4] = { 0 };
     uint32_t max_extended_cap, max_basic_cap;
+    uint64_t xcr0 = 0;
 
 #if !X86_64
     if (!PFX(cpu_cpuid_test)())
         return 0;
 #endif
 
-    PFX(cpu_cpuid)(0, &eax, vendor + 0, vendor + 2, vendor + 1);
-    max_basic_cap = eax;
+    PFX(cpu_cpuid)(0, &max_basic_cap, vendor + 0, vendor + 2, vendor + 1);
     if (max_basic_cap == 0)
         return 0;
 
@@ -147,27 +152,24 @@
         return cpu;
     if (edx & 0x02000000)
         cpu |= X265_CPU_MMX2 | X265_CPU_SSE;
-    if (edx & 0x00008000)
-        cpu |= X265_CPU_CMOV;
-    else
-        return cpu;
     if (edx & 0x04000000)
         cpu |= X265_CPU_SSE2;
     if (ecx & 0x00000001)
         cpu |= X265_CPU_SSE3;
     if (ecx & 0x00000200)
-        cpu |= X265_CPU_SSSE3;
+        cpu |= X265_CPU_SSSE3 | X265_CPU_SSE2_IS_FAST;
     if (ecx & 0x00080000)
         cpu |= X265_CPU_SSE4;
     if (ecx & 0x00100000)
         cpu |= X265_CPU_SSE42;
-    /* Check OXSAVE and AVX bits */
-    if ((ecx & 0x18000000) == 0x18000000)
+
+    if (ecx & 0x08000000) /* XGETBV supported and XSAVE enabled by OS */
     {
         /* Check for OS support */
-        PFX(cpu_xgetbv)(0, &eax, &edx);
-        if ((eax & 0x6) == 0x6)
+        xcr0 = PFX(cpu_xgetbv)(0);
+        if ((xcr0 & 0x6) == 0x6) /* XMM/YMM state */
         {
+            if (ecx & 0x10000000)
             cpu |= X265_CPU_AVX;
             if (ecx & 0x00001000)
                 cpu |= X265_CPU_FMA3;
@@ -178,19 +180,29 @@
     {
         PFX(cpu_cpuid)(7, &eax, &ebx, &ecx, &edx);
         /* AVX2 requires OS support, but BMI1/2 don't. */
-        if ((cpu & X265_CPU_AVX) && (ebx & 0x00000020))
-            cpu |= X265_CPU_AVX2;
         if (ebx & 0x00000008)
-        {
             cpu |= X265_CPU_BMI1;
-            if (ebx & 0x00000100)
-                cpu |= X265_CPU_BMI2;
+        if (ebx & 0x00000100)
+            cpu |= X265_CPU_BMI2;
+
+        if ((xcr0 & 0x6) == 0x6) /* XMM/YMM state */
+        {
+            if (ebx & 0x00000020)
+                cpu |= X265_CPU_AVX2;
+            if (benableavx512)
+            {
+                if ((xcr0 & 0xE0) == 0xE0) /* OPMASK/ZMM state */
+                {
+                    if ((ebx & 0xD0030000) == 0xD0030000)
+                    {
+                        cpu |= X265_CPU_AVX512;
+                        enable512 = true;
+                    }
+                }
+            }
         }
     }
 
-    if (cpu & X265_CPU_SSSE3)
-        cpu |= X265_CPU_SSE2_IS_FAST;
-
     PFX(cpu_cpuid)(0x80000000, &eax, &ebx, &ecx, &edx);
     max_extended_cap = eax;
 
@@ -230,8 +242,6 @@
         {
             if (edx & 0x00400000)
                 cpu |= X265_CPU_MMX2;
-            if (!(cpu & X265_CPU_LZCNT))
-                cpu |= X265_CPU_SLOW_CTZ;
             if ((cpu & X265_CPU_SSE2) && !(cpu & X265_CPU_SSE2_IS_FAST))
                 cpu |= X265_CPU_SSE2_IS_SLOW; /* AMD CPUs come in two types: terrible at SSE and great at it */
         }
@@ -244,19 +254,10 @@
         int model  = ((eax >> 4) & 0xf) + ((eax >> 12) & 0xf0);
         if (family == 6)
         {
-            /* 6/9 (pentium-m "banias"), 6/13 (pentium-m "dothan"), and 6/14 (core1 "yonah")
-             * theoretically support sse2, but it's significantly slower than mmx for
-             * almost all of x264's functions, so let's just pretend they don't. */
-            if (model == 9 || model == 13 || model == 14)
-            {
-                cpu &= ~(X265_CPU_SSE2 | X265_CPU_SSE3);
-                X265_CHECK(!(cpu & (X265_CPU_SSSE3 | X265_CPU_SSE4)), "unexpected CPU ID %d\n", cpu);
-            }
             /* Detect Atom CPU */
-            else if (model == 28)
+            if (model == 28)
             {
                 cpu |= X265_CPU_SLOW_ATOM;
-                cpu |= X265_CPU_SLOW_CTZ;
                 cpu |= X265_CPU_SLOW_PSHUFB;
             }
 
@@ -328,7 +329,7 @@
 int PFX(cpu_fast_neon_mrc_test)(void);
 }
 
-uint32_t cpu_detect(void)
+uint32_t cpu_detect(bool benableavx512)
 {
     int flags = 0;
 
@@ -371,7 +372,7 @@
 
 #elif X265_ARCH_POWER8
 
-uint32_t cpu_detect(void)
+uint32_t cpu_detect(bool benableavx512)
 {
 #if HAVE_ALTIVEC
     return X265_CPU_ALTIVEC;
@@ -382,10 +383,11 @@
 
 #else // if X265_ARCH_POWER8
 
-uint32_t cpu_detect(void)
+uint32_t cpu_detect(bool benableavx512)
 {
     return 0;
 }
 
 #endif // if X265_ARCH_X86
 }
+
​

x265_2.7.tar.gz/source/common/cpu.h -> x265_2.9.tar.gz/source/common/cpu.h Changed

 
@@ -26,7 +26,6 @@
 #define X265_CPU_H
 
 #include "common.h"
-
 /* All assembly functions are prefixed with X265_NS (macro expanded) */
 #define PFX3(prefix, name) prefix ## _ ## name
 #define PFX2(prefix, name) PFX3(prefix, name)
@@ -50,7 +49,8 @@
 #endif
 
 namespace X265_NS {
-uint32_t cpu_detect(void);
+uint32_t cpu_detect(bool);
+bool detect512();
 
 struct cpu_name_t
 {
​

x265_2.7.tar.gz/source/common/cudata.cpp -> x265_2.9.tar.gz/source/common/cudata.cpp Changed

@@ -1626,11 +1626,6 @@
                 dir |= (1 << list);
                 candMvField[count][list].mv = colmv;
                 candMvField[count][list].refIdx = refIdx;
-                if (m_encData->m_param->scaleFactor && m_encData->m_param->analysisSave && m_log2CUSize[0] < 4)
-                {
-                    MV dist(MAX_MV, MAX_MV);
-                    candMvField[count][list].mv = dist;
-                }
             }
         }
 
@@ -1790,14 +1785,7 @@
 
             int curRefPOC = m_slice->m_refPOCList[picList][refIdx];
             int curPOC = m_slice->m_poc;
-
-            if (m_encData->m_param->scaleFactor && m_encData->m_param->analysisSave && (m_log2CUSize[0] < 4))
-            {
-                MV dist(MAX_MV, MAX_MV);
-                pmv[numMvc++] = amvpCand[num++] = dist;
-            }
-            else
-                pmv[numMvc++] = amvpCand[num++] = scaleMvByPOCDist(neighbours[MD_COLLOCATED].mv[picList], curPOC, curRefPOC, colPOC, colRefPOC);
+            pmv[numMvc++] = amvpCand[num++] = scaleMvByPOCDist(neighbours[MD_COLLOCATED].mv[picList], curPOC, curRefPOC, colPOC, colRefPOC);
         }
     }

 
@@ -1626,11 +1626,6 @@
                 dir |= (1 << list);
                 candMvField[count][list].mv = colmv;
                 candMvField[count][list].refIdx = refIdx;
-                if (m_encData->m_param->scaleFactor && m_encData->m_param->analysisSave && m_log2CUSize[0] < 4)
-                {
-                    MV dist(MAX_MV, MAX_MV);
-                    candMvField[count][list].mv = dist;
-                }
             }
         }
 
@@ -1790,14 +1785,7 @@
 
             int curRefPOC = m_slice->m_refPOCList[picList][refIdx];
             int curPOC = m_slice->m_poc;
-
-            if (m_encData->m_param->scaleFactor && m_encData->m_param->analysisSave && (m_log2CUSize[0] < 4))
-            {
-                MV dist(MAX_MV, MAX_MV);
-                pmv[numMvc++] = amvpCand[num++] = dist;
-            }
-            else
-                pmv[numMvc++] = amvpCand[num++] = scaleMvByPOCDist(neighbours[MD_COLLOCATED].mv[picList], curPOC, curRefPOC, colPOC, colRefPOC);
+            pmv[numMvc++] = amvpCand[num++] = scaleMvByPOCDist(neighbours[MD_COLLOCATED].mv[picList], curPOC, curRefPOC, colPOC, colRefPOC);
         }
     }
 
​

x265_2.7.tar.gz/source/common/cudata.h -> x265_2.9.tar.gz/source/common/cudata.h Changed

@@ -224,6 +224,11 @@
     uint64_t      m_fAc_den[3];
     uint64_t      m_fDc_den[3];
 
+    /* Feature values per CTU for dynamic refinement */
+    uint64_t*       m_collectCURd;
+    uint32_t*       m_collectCUVariance;
+    uint32_t*       m_collectCUCount;
+
     CUData();
 
     void     initialize(const CUDataMemPool& dataPool, uint32_t depth, const x265_param& param, int instance);
@@ -348,8 +353,12 @@
     coeff_t* trCoeffMemBlock;
     MV*      mvMemBlock;
     sse_t*   distortionMemBlock;
+    uint64_t* dynRefineRdBlock;
+    uint32_t* dynRefCntBlock;
+    uint32_t* dynRefVarBlock;
 
-    CUDataMemPool() { charMemBlock = NULL; trCoeffMemBlock = NULL; mvMemBlock = NULL; distortionMemBlock = NULL; }
+    CUDataMemPool() { charMemBlock = NULL; trCoeffMemBlock = NULL; mvMemBlock = NULL; distortionMemBlock = NULL; 
+                      dynRefineRdBlock = NULL; dynRefCntBlock = NULL; dynRefVarBlock = NULL;}
 
     bool create(uint32_t depth, uint32_t csp, uint32_t numInstances, const x265_param& param)
     {

 
@@ -224,6 +224,11 @@
     uint64_t      m_fAc_den[3];
     uint64_t      m_fDc_den[3];
 
+    /* Feature values per CTU for dynamic refinement */
+    uint64_t*       m_collectCURd;
+    uint32_t*       m_collectCUVariance;
+    uint32_t*       m_collectCUCount;
+
     CUData();
 
     void     initialize(const CUDataMemPool& dataPool, uint32_t depth, const x265_param& param, int instance);
@@ -348,8 +353,12 @@
     coeff_t* trCoeffMemBlock;
     MV*      mvMemBlock;
     sse_t*   distortionMemBlock;
+    uint64_t* dynRefineRdBlock;
+    uint32_t* dynRefCntBlock;
+    uint32_t* dynRefVarBlock;
 
-    CUDataMemPool() { charMemBlock = NULL; trCoeffMemBlock = NULL; mvMemBlock = NULL; distortionMemBlock = NULL; }
+    CUDataMemPool() { charMemBlock = NULL; trCoeffMemBlock = NULL; mvMemBlock = NULL; distortionMemBlock = NULL; 
+                      dynRefineRdBlock = NULL; dynRefCntBlock = NULL; dynRefVarBlock = NULL;}
 
     bool create(uint32_t depth, uint32_t csp, uint32_t numInstances, const x265_param& param)
     {
​

x265_2.7.tar.gz/source/common/dct.cpp -> x265_2.9.tar.gz/source/common/dct.cpp Changed

@@ -980,19 +980,110 @@
             sum += sbacGetEntropyBits(mstate, firstC2Flag);
         }
     }
-
     return (sum & 0x00FFFFFF) + (c1 << 26) + (firstC2Idx << 28);
 }
+template<int log2TrSize>
+static void nonPsyRdoQuant_c(int16_t *m_resiDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, uint32_t blkPos)
+{
+    const int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; /* Represents scaling through forward transform */
+    const int scaleBits = SCALE_BITS - 2 * transformShift;
+    const uint32_t trSize = 1 << log2TrSize;
+
+    for (int y = 0; y < MLS_CG_SIZE; y++)
+    {
+        for (int x = 0; x < MLS_CG_SIZE; x++)
+        {
+             int64_t signCoef = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
+             costUncoded[blkPos + x] = static_cast<int64_t>((double)((signCoef * signCoef) << scaleBits));
+             *totalUncodedCost += costUncoded[blkPos + x];
+             *totalRdCost += costUncoded[blkPos + x];
+        }
+        blkPos += trSize;
+    }
+}
+template<int log2TrSize>
+static void psyRdoQuant_c(int16_t *m_resiDctCoeff, int16_t *m_fencDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, int64_t *psyScale, uint32_t blkPos)
+{
+    const int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; /* Represents scaling through forward transform */
+    const int scaleBits = SCALE_BITS - 2 * transformShift;
+    const uint32_t trSize = 1 << log2TrSize;
+    int max = X265_MAX(0, (2 * transformShift + 1));
+
+    for (int y = 0; y < MLS_CG_SIZE; y++)
+    {
+        for (int x = 0; x < MLS_CG_SIZE; x++)
+        {
+            int64_t signCoef = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
+            int64_t predictedCoef = m_fencDctCoeff[blkPos + x] - signCoef; /* predicted DCT = source DCT - residual DCT*/
+
+            costUncoded[blkPos + x] = static_cast<int64_t>((double)((signCoef * signCoef) << scaleBits));
+
+            /* when no residual coefficient is coded, predicted coef == recon coef */
+            costUncoded[blkPos + x] -= static_cast<int64_t>((double)(((*psyScale) * predictedCoef) >> max));
+
+            *totalUncodedCost += costUncoded[blkPos + x];
+            *totalRdCost += costUncoded[blkPos + x];
+        }
+        blkPos += trSize;
+    }
+}
+template<int log2TrSize>
+static void psyRdoQuant_c_1(int16_t *m_resiDctCoeff, /*int16_t  *m_fencDctCoeff, */ int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, /* int64_t *psyScale,*/ uint32_t blkPos)
+{
+	const int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; /* Represents scaling through forward transform */
+	const int scaleBits = SCALE_BITS - 2 * transformShift;
+	const uint32_t trSize = 1 << log2TrSize;
+
+	for (int y = 0; y < MLS_CG_SIZE; y++)
+	{
+		for (int x = 0; x < MLS_CG_SIZE; x++)
+		{
+			int64_t signCoef = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
+			costUncoded[blkPos + x] = static_cast<int64_t>((double)((signCoef * signCoef) << scaleBits));
+			*totalUncodedCost += costUncoded[blkPos + x];
+			*totalRdCost += costUncoded[blkPos + x];
+		}
+		blkPos += trSize;
+	}
+}
+template<int log2TrSize>
+static void psyRdoQuant_c_2(int16_t *m_resiDctCoeff, int16_t *m_fencDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, int64_t *psyScale, uint32_t blkPos)
+{
+	const int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; /* Represents scaling through forward transform */
+
+	const uint32_t trSize = 1 << log2TrSize;
+	int max = X265_MAX(0, (2 * transformShift + 1));
+
+	for (int y = 0; y < MLS_CG_SIZE; y++)
+	{
+		for (int x = 0; x < MLS_CG_SIZE; x++)
+		{
+			int64_t signCoef = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
+			int64_t predictedCoef = m_fencDctCoeff[blkPos + x] - signCoef; /* predicted DCT = source DCT - residual DCT*/
+			costUncoded[blkPos + x] -= static_cast<int64_t>((double)(((*psyScale) * predictedCoef) >> max));
+			*totalUncodedCost += costUncoded[blkPos + x];
+			*totalRdCost += costUncoded[blkPos + x];
+		}
+		blkPos += trSize;
+	}
+}
 
 namespace X265_NS {
 // x265 private namespace
-
 void setupDCTPrimitives_c(EncoderPrimitives& p)
 {
     p.dequant_scaling = dequant_scaling_c;
     p.dequant_normal = dequant_normal_c;
     p.quant = quant_c;
     p.nquant = nquant_c;
+    p.cu[BLOCK_4x4].nonPsyRdoQuant   = nonPsyRdoQuant_c<2>;
+    p.cu[BLOCK_8x8].nonPsyRdoQuant   = nonPsyRdoQuant_c<3>;
+    p.cu[BLOCK_16x16].nonPsyRdoQuant = nonPsyRdoQuant_c<4>;
+    p.cu[BLOCK_32x32].nonPsyRdoQuant = nonPsyRdoQuant_c<5>;
+    p.cu[BLOCK_4x4].psyRdoQuant = psyRdoQuant_c<2>;
+    p.cu[BLOCK_8x8].psyRdoQuant = psyRdoQuant_c<3>;
+    p.cu[BLOCK_16x16].psyRdoQuant = psyRdoQuant_c<4>;
+    p.cu[BLOCK_32x32].psyRdoQuant = psyRdoQuant_c<5>;
     p.dst4x4 = dst4_c;
     p.cu[BLOCK_4x4].dct   = dct4_c;
     p.cu[BLOCK_8x8].dct   = dct8_c;
@@ -1013,7 +1104,14 @@
     p.cu[BLOCK_8x8].copy_cnt   = copy_count<8>;
     p.cu[BLOCK_16x16].copy_cnt = copy_count<16>;
     p.cu[BLOCK_32x32].copy_cnt = copy_count<32>;
-
+	p.cu[BLOCK_4x4].psyRdoQuant_1p = psyRdoQuant_c_1<2>;
+	p.cu[BLOCK_4x4].psyRdoQuant_2p = psyRdoQuant_c_2<2>;
+	p.cu[BLOCK_8x8].psyRdoQuant_1p = psyRdoQuant_c_1<3>;
+	p.cu[BLOCK_8x8].psyRdoQuant_2p = psyRdoQuant_c_2<3>;
+	p.cu[BLOCK_16x16].psyRdoQuant_1p = psyRdoQuant_c_1<4>;
+	p.cu[BLOCK_16x16].psyRdoQuant_2p = psyRdoQuant_c_2<4>;
+	p.cu[BLOCK_32x32].psyRdoQuant_1p = psyRdoQuant_c_1<5>;
+	p.cu[BLOCK_32x32].psyRdoQuant_2p = psyRdoQuant_c_2<5>;
     p.scanPosLast = scanPosLast_c;
     p.findPosFirstLast = findPosFirstLast_c;
     p.costCoeffNxN = costCoeffNxN_c;

 
@@ -980,19 +980,110 @@
             sum += sbacGetEntropyBits(mstate, firstC2Flag);
         }
     }
-
     return (sum & 0x00FFFFFF) + (c1 << 26) + (firstC2Idx << 28);
 }
+template<int log2TrSize>
+static void nonPsyRdoQuant_c(int16_t *m_resiDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, uint32_t blkPos)
+{
+    const int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; /* Represents scaling through forward transform */
+    const int scaleBits = SCALE_BITS - 2 * transformShift;
+    const uint32_t trSize = 1 << log2TrSize;
+
+    for (int y = 0; y < MLS_CG_SIZE; y++)
+    {
+        for (int x = 0; x < MLS_CG_SIZE; x++)
+        {
+             int64_t signCoef = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
+             costUncoded[blkPos + x] = static_cast<int64_t>((double)((signCoef * signCoef) << scaleBits));
+             *totalUncodedCost += costUncoded[blkPos + x];
+             *totalRdCost += costUncoded[blkPos + x];
+        }
+        blkPos += trSize;
+    }
+}
+template<int log2TrSize>
+static void psyRdoQuant_c(int16_t *m_resiDctCoeff, int16_t *m_fencDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, int64_t *psyScale, uint32_t blkPos)
+{
+    const int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; /* Represents scaling through forward transform */
+    const int scaleBits = SCALE_BITS - 2 * transformShift;
+    const uint32_t trSize = 1 << log2TrSize;
+    int max = X265_MAX(0, (2 * transformShift + 1));
+
+    for (int y = 0; y < MLS_CG_SIZE; y++)
+    {
+        for (int x = 0; x < MLS_CG_SIZE; x++)
+        {
+            int64_t signCoef = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
+            int64_t predictedCoef = m_fencDctCoeff[blkPos + x] - signCoef; /* predicted DCT = source DCT - residual DCT*/
+
+            costUncoded[blkPos + x] = static_cast<int64_t>((double)((signCoef * signCoef) << scaleBits));
+
+            /* when no residual coefficient is coded, predicted coef == recon coef */
+            costUncoded[blkPos + x] -= static_cast<int64_t>((double)(((*psyScale) * predictedCoef) >> max));
+
+            *totalUncodedCost += costUncoded[blkPos + x];
+            *totalRdCost += costUncoded[blkPos + x];
+        }
+        blkPos += trSize;
+    }
+}
+template<int log2TrSize>
+static void psyRdoQuant_c_1(int16_t *m_resiDctCoeff, /*int16_t  *m_fencDctCoeff, */ int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, /* int64_t *psyScale,*/ uint32_t blkPos)
+{
+   const int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; /* Represents scaling through forward transform */
+   const int scaleBits = SCALE_BITS - 2 * transformShift;
+   const uint32_t trSize = 1 << log2TrSize;
+
+   for (int y = 0; y < MLS_CG_SIZE; y++)
+   {
+       for (int x = 0; x < MLS_CG_SIZE; x++)
+       {
+           int64_t signCoef = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
+           costUncoded[blkPos + x] = static_cast<int64_t>((double)((signCoef * signCoef) << scaleBits));
+           *totalUncodedCost += costUncoded[blkPos + x];
+           *totalRdCost += costUncoded[blkPos + x];
+       }
+       blkPos += trSize;
+   }
+}
+template<int log2TrSize>
+static void psyRdoQuant_c_2(int16_t *m_resiDctCoeff, int16_t *m_fencDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, int64_t *psyScale, uint32_t blkPos)
+{
+   const int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; /* Represents scaling through forward transform */
+
+   const uint32_t trSize = 1 << log2TrSize;
+   int max = X265_MAX(0, (2 * transformShift + 1));
+
+   for (int y = 0; y < MLS_CG_SIZE; y++)
+   {
+       for (int x = 0; x < MLS_CG_SIZE; x++)
+       {
+           int64_t signCoef = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
+           int64_t predictedCoef = m_fencDctCoeff[blkPos + x] - signCoef; /* predicted DCT = source DCT - residual DCT*/
+           costUncoded[blkPos + x] -= static_cast<int64_t>((double)(((*psyScale) * predictedCoef) >> max));
+           *totalUncodedCost += costUncoded[blkPos + x];
+           *totalRdCost += costUncoded[blkPos + x];
+       }
+       blkPos += trSize;
+   }
+}
 
 namespace X265_NS {
 // x265 private namespace
-
 void setupDCTPrimitives_c(EncoderPrimitives& p)
 {
     p.dequant_scaling = dequant_scaling_c;
     p.dequant_normal = dequant_normal_c;
     p.quant = quant_c;
     p.nquant = nquant_c;
+    p.cu[BLOCK_4x4].nonPsyRdoQuant   = nonPsyRdoQuant_c<2>;
+    p.cu[BLOCK_8x8].nonPsyRdoQuant   = nonPsyRdoQuant_c<3>;
+    p.cu[BLOCK_16x16].nonPsyRdoQuant = nonPsyRdoQuant_c<4>;
+    p.cu[BLOCK_32x32].nonPsyRdoQuant = nonPsyRdoQuant_c<5>;
+    p.cu[BLOCK_4x4].psyRdoQuant = psyRdoQuant_c<2>;
+    p.cu[BLOCK_8x8].psyRdoQuant = psyRdoQuant_c<3>;
+    p.cu[BLOCK_16x16].psyRdoQuant = psyRdoQuant_c<4>;
+    p.cu[BLOCK_32x32].psyRdoQuant = psyRdoQuant_c<5>;
     p.dst4x4 = dst4_c;
     p.cu[BLOCK_4x4].dct   = dct4_c;
     p.cu[BLOCK_8x8].dct   = dct8_c;
@@ -1013,7 +1104,14 @@
     p.cu[BLOCK_8x8].copy_cnt   = copy_count<8>;
     p.cu[BLOCK_16x16].copy_cnt = copy_count<16>;
     p.cu[BLOCK_32x32].copy_cnt = copy_count<32>;
-
+   p.cu[BLOCK_4x4].psyRdoQuant_1p = psyRdoQuant_c_1<2>;
+   p.cu[BLOCK_4x4].psyRdoQuant_2p = psyRdoQuant_c_2<2>;
+   p.cu[BLOCK_8x8].psyRdoQuant_1p = psyRdoQuant_c_1<3>;
+   p.cu[BLOCK_8x8].psyRdoQuant_2p = psyRdoQuant_c_2<3>;
+   p.cu[BLOCK_16x16].psyRdoQuant_1p = psyRdoQuant_c_1<4>;
+   p.cu[BLOCK_16x16].psyRdoQuant_2p = psyRdoQuant_c_2<4>;
+   p.cu[BLOCK_32x32].psyRdoQuant_1p = psyRdoQuant_c_1<5>;
+   p.cu[BLOCK_32x32].psyRdoQuant_2p = psyRdoQuant_c_2<5>;
     p.scanPosLast = scanPosLast_c;
     p.findPosFirstLast = findPosFirstLast_c;
     p.costCoeffNxN = costCoeffNxN_c;
​

x265_2.7.tar.gz/source/common/frame.cpp -> x265_2.9.tar.gz/source/common/frame.cpp Changed

@@ -53,6 +53,7 @@
     m_addOnDepth = NULL;
     m_addOnCtuInfo = NULL;
     m_addOnPrevChange = NULL;
+    m_classifyFrame = false;
 }
 
 bool Frame::create(x265_param *param, float* quantOffsets)
@@ -82,10 +83,18 @@
         m_analysisData.wt = NULL;
         m_analysisData.intraData = NULL;
         m_analysisData.interData = NULL;
-        m_analysis2Pass.analysisFramedata = NULL;
+        m_analysisData.distortionData = NULL;
     }
 
-    if (m_fencPic->create(param, !!m_param->bCopyPicToFrame) && m_lowres.create(m_fencPic, param->bframes, !!param->rc.aqMode || !!param->bAQMotion, param->rc.qgSize))
+    if (param->bDynamicRefine)
+    {
+        int size = m_param->maxCUDepth * X265_REFINE_INTER_LEVELS;
+        CHECKED_MALLOC_ZERO(m_classifyRd, uint64_t, size);
+        CHECKED_MALLOC_ZERO(m_classifyVariance, uint64_t, size);
+        CHECKED_MALLOC_ZERO(m_classifyCount, uint32_t, size);
+    }
+
+    if (m_fencPic->create(param, !!m_param->bCopyPicToFrame) && m_lowres.create(param, m_fencPic, param->rc.qgSize))
     {
         X265_CHECK((m_reconColCount == NULL), "m_reconColCount was initialized");
         m_numRows = (m_fencPic->m_picHeight + param->maxCUSize - 1)  / param->maxCUSize;
@@ -94,11 +103,8 @@
 
         if (quantOffsets)
         {
-            int32_t cuCount;
-            if (param->rc.qgSize == 8)
-                cuCount = m_lowres.maxBlocksInRowFullRes * m_lowres.maxBlocksInColFullRes;
-            else
-                cuCount = m_lowres.maxBlocksInRow * m_lowres.maxBlocksInCol;
+            int32_t cuCount = (param->rc.qgSize == 8) ? m_lowres.maxBlocksInRowFullRes * m_lowres.maxBlocksInColFullRes :
+                                                        m_lowres.maxBlocksInRow * m_lowres.maxBlocksInCol;
             m_quantOffsets = new float[cuCount];
         }
         return true;
@@ -226,4 +232,11 @@
     }
     m_lowres.destroy();
     X265_FREE(m_rcData);
+
+    if (m_param->bDynamicRefine)
+    {
+        X265_FREE_ZERO(m_classifyRd);
+        X265_FREE_ZERO(m_classifyVariance);
+        X265_FREE_ZERO(m_classifyCount);
+    }
 }

 
@@ -53,6 +53,7 @@
     m_addOnDepth = NULL;
     m_addOnCtuInfo = NULL;
     m_addOnPrevChange = NULL;
+    m_classifyFrame = false;
 }
 
 bool Frame::create(x265_param *param, float* quantOffsets)
@@ -82,10 +83,18 @@
         m_analysisData.wt = NULL;
         m_analysisData.intraData = NULL;
         m_analysisData.interData = NULL;
-        m_analysis2Pass.analysisFramedata = NULL;
+        m_analysisData.distortionData = NULL;
     }
 
-    if (m_fencPic->create(param, !!m_param->bCopyPicToFrame) && m_lowres.create(m_fencPic, param->bframes, !!param->rc.aqMode || !!param->bAQMotion, param->rc.qgSize))
+    if (param->bDynamicRefine)
+    {
+        int size = m_param->maxCUDepth * X265_REFINE_INTER_LEVELS;
+        CHECKED_MALLOC_ZERO(m_classifyRd, uint64_t, size);
+        CHECKED_MALLOC_ZERO(m_classifyVariance, uint64_t, size);
+        CHECKED_MALLOC_ZERO(m_classifyCount, uint32_t, size);
+    }
+
+    if (m_fencPic->create(param, !!m_param->bCopyPicToFrame) && m_lowres.create(param, m_fencPic, param->rc.qgSize))
     {
         X265_CHECK((m_reconColCount == NULL), "m_reconColCount was initialized");
         m_numRows = (m_fencPic->m_picHeight + param->maxCUSize - 1)  / param->maxCUSize;
@@ -94,11 +103,8 @@
 
         if (quantOffsets)
         {
-            int32_t cuCount;
-            if (param->rc.qgSize == 8)
-                cuCount = m_lowres.maxBlocksInRowFullRes * m_lowres.maxBlocksInColFullRes;
-            else
-                cuCount = m_lowres.maxBlocksInRow * m_lowres.maxBlocksInCol;
+            int32_t cuCount = (param->rc.qgSize == 8) ? m_lowres.maxBlocksInRowFullRes * m_lowres.maxBlocksInColFullRes :
+                                                        m_lowres.maxBlocksInRow * m_lowres.maxBlocksInCol;
             m_quantOffsets = new float[cuCount];
         }
         return true;
@@ -226,4 +232,11 @@
     }
     m_lowres.destroy();
     X265_FREE(m_rcData);
+
+    if (m_param->bDynamicRefine)
+    {
+        X265_FREE_ZERO(m_classifyRd);
+        X265_FREE_ZERO(m_classifyVariance);
+        X265_FREE_ZERO(m_classifyCount);
+    }
 }
​

x265_2.7.tar.gz/source/common/frame.h -> x265_2.9.tar.gz/source/common/frame.h Changed

 
@@ -109,7 +109,6 @@
     Frame*                 m_prev;
     x265_param*            m_param;              // Points to the latest param set for the frame.
     x265_analysis_data     m_analysisData;
-    x265_analysis_2Pass    m_analysis2Pass;
     RcStats*               m_rcData;
 
     Event                  m_copyMVType;
@@ -122,6 +121,14 @@
     uint8_t**              m_addOnDepth;
     uint8_t**              m_addOnCtuInfo;
     int**                  m_addOnPrevChange;
+
+    /* Average feature values of frames being considered for classification */
+    uint64_t*              m_classifyRd;
+    uint64_t*              m_classifyVariance;
+    uint32_t*              m_classifyCount;
+
+    bool                   m_classifyFrame;
+
     Frame();
 
     bool create(x265_param *param, float* quantOffsets);
​

x265_2.7.tar.gz/source/common/framedata.cpp -> x265_2.9.tar.gz/source/common/framedata.cpp Changed

@@ -41,9 +41,25 @@
     if (param.rc.bStatWrite)
         m_spsrps = const_cast<RPS*>(sps.spsrps);
     bool isallocated = m_cuMemPool.create(0, param.internalCsp, sps.numCUsInFrame, param);
+    if (m_param->bDynamicRefine)
+    {
+        CHECKED_MALLOC_ZERO(m_cuMemPool.dynRefineRdBlock, uint64_t, MAX_NUM_DYN_REFINE * sps.numCUsInFrame);
+        CHECKED_MALLOC_ZERO(m_cuMemPool.dynRefCntBlock, uint32_t, MAX_NUM_DYN_REFINE * sps.numCUsInFrame);
+        CHECKED_MALLOC_ZERO(m_cuMemPool.dynRefVarBlock, uint32_t, MAX_NUM_DYN_REFINE * sps.numCUsInFrame);
+    }
     if (isallocated)
+    {
         for (uint32_t ctuAddr = 0; ctuAddr < sps.numCUsInFrame; ctuAddr++)
+        {
+            if (m_param->bDynamicRefine)
+            {
+                m_picCTU[ctuAddr].m_collectCURd = m_cuMemPool.dynRefineRdBlock + (ctuAddr * MAX_NUM_DYN_REFINE);
+                m_picCTU[ctuAddr].m_collectCUVariance = m_cuMemPool.dynRefVarBlock + (ctuAddr * MAX_NUM_DYN_REFINE);
+                m_picCTU[ctuAddr].m_collectCUCount = m_cuMemPool.dynRefCntBlock + (ctuAddr * MAX_NUM_DYN_REFINE);
+            }
             m_picCTU[ctuAddr].initialize(m_cuMemPool, 0, param, ctuAddr);
+        }
+    }
     else
         return false;
     CHECKED_MALLOC_ZERO(m_cuStat, RCStatCU, sps.numCUsInFrame);
@@ -65,6 +81,12 @@
 {
     memset(m_cuStat, 0, sps.numCUsInFrame * sizeof(*m_cuStat));
     memset(m_rowStat, 0, sps.numCuInHeight * sizeof(*m_rowStat));
+    if (m_param->bDynamicRefine)
+    {
+        memset(m_picCTU->m_collectCURd, 0, MAX_NUM_DYN_REFINE * sizeof(uint64_t));
+        memset(m_picCTU->m_collectCUVariance, 0, MAX_NUM_DYN_REFINE * sizeof(uint32_t));
+        memset(m_picCTU->m_collectCUCount, 0, MAX_NUM_DYN_REFINE * sizeof(uint32_t));
+    }
 }
 
 void FrameData::destroy()
@@ -75,6 +97,12 @@
 
     m_cuMemPool.destroy();
 
+    if (m_param->bDynamicRefine)
+    {
+        X265_FREE(m_cuMemPool.dynRefineRdBlock);
+        X265_FREE(m_cuMemPool.dynRefCntBlock);
+        X265_FREE(m_cuMemPool.dynRefVarBlock);
+    }
     X265_FREE(m_cuStat);
     X265_FREE(m_rowStat);
     for (int i = 0; i < INTEGRAL_PLANE_NUM; i++)

 
@@ -41,9 +41,25 @@
     if (param.rc.bStatWrite)
         m_spsrps = const_cast<RPS*>(sps.spsrps);
     bool isallocated = m_cuMemPool.create(0, param.internalCsp, sps.numCUsInFrame, param);
+    if (m_param->bDynamicRefine)
+    {
+        CHECKED_MALLOC_ZERO(m_cuMemPool.dynRefineRdBlock, uint64_t, MAX_NUM_DYN_REFINE * sps.numCUsInFrame);
+        CHECKED_MALLOC_ZERO(m_cuMemPool.dynRefCntBlock, uint32_t, MAX_NUM_DYN_REFINE * sps.numCUsInFrame);
+        CHECKED_MALLOC_ZERO(m_cuMemPool.dynRefVarBlock, uint32_t, MAX_NUM_DYN_REFINE * sps.numCUsInFrame);
+    }
     if (isallocated)
+    {
         for (uint32_t ctuAddr = 0; ctuAddr < sps.numCUsInFrame; ctuAddr++)
+        {
+            if (m_param->bDynamicRefine)
+            {
+                m_picCTU[ctuAddr].m_collectCURd = m_cuMemPool.dynRefineRdBlock + (ctuAddr * MAX_NUM_DYN_REFINE);
+                m_picCTU[ctuAddr].m_collectCUVariance = m_cuMemPool.dynRefVarBlock + (ctuAddr * MAX_NUM_DYN_REFINE);
+                m_picCTU[ctuAddr].m_collectCUCount = m_cuMemPool.dynRefCntBlock + (ctuAddr * MAX_NUM_DYN_REFINE);
+            }
             m_picCTU[ctuAddr].initialize(m_cuMemPool, 0, param, ctuAddr);
+        }
+    }
     else
         return false;
     CHECKED_MALLOC_ZERO(m_cuStat, RCStatCU, sps.numCUsInFrame);
@@ -65,6 +81,12 @@
 {
     memset(m_cuStat, 0, sps.numCUsInFrame * sizeof(*m_cuStat));
     memset(m_rowStat, 0, sps.numCuInHeight * sizeof(*m_rowStat));
+    if (m_param->bDynamicRefine)
+    {
+        memset(m_picCTU->m_collectCURd, 0, MAX_NUM_DYN_REFINE * sizeof(uint64_t));
+        memset(m_picCTU->m_collectCUVariance, 0, MAX_NUM_DYN_REFINE * sizeof(uint32_t));
+        memset(m_picCTU->m_collectCUCount, 0, MAX_NUM_DYN_REFINE * sizeof(uint32_t));
+    }
 }
 
 void FrameData::destroy()
@@ -75,6 +97,12 @@
 
     m_cuMemPool.destroy();
 
+    if (m_param->bDynamicRefine)
+    {
+        X265_FREE(m_cuMemPool.dynRefineRdBlock);
+        X265_FREE(m_cuMemPool.dynRefCntBlock);
+        X265_FREE(m_cuMemPool.dynRefVarBlock);
+    }
     X265_FREE(m_cuStat);
     X265_FREE(m_rowStat);
     for (int i = 0; i < INTEGRAL_PLANE_NUM; i++)
​

x265_2.7.tar.gz/source/common/framedata.h -> x265_2.9.tar.gz/source/common/framedata.h Changed

@@ -88,6 +88,11 @@
     uint64_t    cntInterPu[NUM_CU_DEPTH][INTER_MODES - 1];
     uint64_t    cntMergePu[NUM_CU_DEPTH][INTER_MODES - 1];
 
+    /* Feature values per row for dynamic refinement */
+    uint64_t       rowRdDyn[MAX_NUM_DYN_REFINE];
+    uint32_t       rowVarDyn[MAX_NUM_DYN_REFINE];
+    uint32_t       rowCntDyn[MAX_NUM_DYN_REFINE];
+
     FrameStats()
     {
         memset(this, 0, sizeof(FrameStats));
@@ -174,47 +179,5 @@
     inline CUData* getPicCTU(uint32_t ctuAddr) { return &m_picCTU[ctuAddr]; }
 };
 
-/* Stores intra analysis data for a single frame. This struct needs better packing */
-struct analysis_intra_data
-{
-    uint8_t*  depth;
-    uint8_t*  modes;
-    char*     partSizes;
-    uint8_t*  chromaModes;
-};
-
-/* Stores inter analysis data for a single frame */
-struct analysis_inter_data
-{
-    int32_t*    ref;
-    uint8_t*    depth;
-    uint8_t*    modes;
-    uint8_t*    partSize;
-    uint8_t*    mergeFlag;
-    uint8_t*    interDir;
-    uint8_t*    mvpIdx[2];
-    int8_t*     refIdx[2];
-    MV*         mv[2];
-   int64_t*     sadCost;
-};
-
-struct analysis2PassFrameData
-{
-    uint8_t*      depth;
-    MV*           m_mv[2];
-    int*          mvpIdx[2];
-    int32_t*      ref[2];
-    uint8_t*      modes;
-    sse_t*        distortion;
-    sse_t*        ctuDistortion;
-    double*       scaledDistortion;
-    double        averageDistortion;
-    double        sdDistortion;
-    uint32_t      highDistortionCtuCount;
-    uint32_t      lowDistortionCtuCount;
-    double*       offset;
-    double*       threshold;
-};
-
 }
 #endif // ifndef X265_FRAMEDATA_H

 
@@ -88,6 +88,11 @@
     uint64_t    cntInterPu[NUM_CU_DEPTH][INTER_MODES - 1];
     uint64_t    cntMergePu[NUM_CU_DEPTH][INTER_MODES - 1];
 
+    /* Feature values per row for dynamic refinement */
+    uint64_t       rowRdDyn[MAX_NUM_DYN_REFINE];
+    uint32_t       rowVarDyn[MAX_NUM_DYN_REFINE];
+    uint32_t       rowCntDyn[MAX_NUM_DYN_REFINE];
+
     FrameStats()
     {
         memset(this, 0, sizeof(FrameStats));
@@ -174,47 +179,5 @@
     inline CUData* getPicCTU(uint32_t ctuAddr) { return &m_picCTU[ctuAddr]; }
 };
 
-/* Stores intra analysis data for a single frame. This struct needs better packing */
-struct analysis_intra_data
-{
-    uint8_t*  depth;
-    uint8_t*  modes;
-    char*     partSizes;
-    uint8_t*  chromaModes;
-};
-
-/* Stores inter analysis data for a single frame */
-struct analysis_inter_data
-{
-    int32_t*    ref;
-    uint8_t*    depth;
-    uint8_t*    modes;
-    uint8_t*    partSize;
-    uint8_t*    mergeFlag;
-    uint8_t*    interDir;
-    uint8_t*    mvpIdx[2];
-    int8_t*     refIdx[2];
-    MV*         mv[2];
-   int64_t*     sadCost;
-};
-
-struct analysis2PassFrameData
-{
-    uint8_t*      depth;
-    MV*           m_mv[2];
-    int*          mvpIdx[2];
-    int32_t*      ref[2];
-    uint8_t*      modes;
-    sse_t*        distortion;
-    sse_t*        ctuDistortion;
-    double*       scaledDistortion;
-    double        averageDistortion;
-    double        sdDistortion;
-    uint32_t      highDistortionCtuCount;
-    uint32_t      lowDistortionCtuCount;
-    double*       offset;
-    double*       threshold;
-};
-
 }
 #endif // ifndef X265_FRAMEDATA_H
​

x265_2.7.tar.gz/source/common/ipfilter.cpp -> x265_2.9.tar.gz/source/common/ipfilter.cpp Changed

@@ -379,7 +379,8 @@
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
-    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>;
+    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].p2s[NONALIGNED] = filterPixelToShort_c<W, H>;\
+    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].p2s[ALIGNED] = filterPixelToShort_c<W, H>;
 
 #define CHROMA_422(W, H) \
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \
@@ -388,7 +389,8 @@
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
-    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>;
+    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].p2s[NONALIGNED] = filterPixelToShort_c<W, H>;\
+    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].p2s[ALIGNED] = filterPixelToShort_c<W, H>;
 
 #define CHROMA_444(W, H) \
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \
@@ -397,7 +399,8 @@
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
-    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>;
+    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].p2s[NONALIGNED] = filterPixelToShort_c<W, H>;\
+    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].p2s[ALIGNED] = filterPixelToShort_c<W, H>;
 
 #define LUMA(W, H) \
     p.pu[LUMA_ ## W ## x ## H].luma_hpp     = interp_horiz_pp_c<8, W, H>; \
@@ -407,7 +410,8 @@
     p.pu[LUMA_ ## W ## x ## H].luma_vsp     = interp_vert_sp_c<8, W, H>;  \
     p.pu[LUMA_ ## W ## x ## H].luma_vss     = interp_vert_ss_c<8, W, H>;  \
     p.pu[LUMA_ ## W ## x ## H].luma_hvpp    = interp_hv_pp_c<8, W, H>; \
-    p.pu[LUMA_ ## W ## x ## H].convert_p2s = filterPixelToShort_c<W, H>;
+    p.pu[LUMA_ ## W ## x ## H].convert_p2s[NONALIGNED] = filterPixelToShort_c<W, H>;\
+    p.pu[LUMA_ ## W ## x ## H].convert_p2s[ALIGNED] = filterPixelToShort_c<W, H>;
 
 void setupFilterPrimitives_c(EncoderPrimitives& p)
 {

 
@@ -379,7 +379,8 @@
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
-    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>;
+    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].p2s[NONALIGNED] = filterPixelToShort_c<W, H>;\
+    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].p2s[ALIGNED] = filterPixelToShort_c<W, H>;
 
 #define CHROMA_422(W, H) \
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \
@@ -388,7 +389,8 @@
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
-    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>;
+    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].p2s[NONALIGNED] = filterPixelToShort_c<W, H>;\
+    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].p2s[ALIGNED] = filterPixelToShort_c<W, H>;
 
 #define CHROMA_444(W, H) \
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \
@@ -397,7 +399,8 @@
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
-    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>;
+    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].p2s[NONALIGNED] = filterPixelToShort_c<W, H>;\
+    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].p2s[ALIGNED] = filterPixelToShort_c<W, H>;
 
 #define LUMA(W, H) \
     p.pu[LUMA_ ## W ## x ## H].luma_hpp     = interp_horiz_pp_c<8, W, H>; \
@@ -407,7 +410,8 @@
     p.pu[LUMA_ ## W ## x ## H].luma_vsp     = interp_vert_sp_c<8, W, H>;  \
     p.pu[LUMA_ ## W ## x ## H].luma_vss     = interp_vert_ss_c<8, W, H>;  \
     p.pu[LUMA_ ## W ## x ## H].luma_hvpp    = interp_hv_pp_c<8, W, H>; \
-    p.pu[LUMA_ ## W ## x ## H].convert_p2s = filterPixelToShort_c<W, H>;
+    p.pu[LUMA_ ## W ## x ## H].convert_p2s[NONALIGNED] = filterPixelToShort_c<W, H>;\
+    p.pu[LUMA_ ## W ## x ## H].convert_p2s[ALIGNED] = filterPixelToShort_c<W, H>;
 
 void setupFilterPrimitives_c(EncoderPrimitives& p)
 {
​

x265_2.7.tar.gz/source/common/lowres.cpp -> x265_2.9.tar.gz/source/common/lowres.cpp Changed

@@ -27,10 +27,10 @@
 
 using namespace X265_NS;
 
-bool Lowres::create(PicYuv *origPic, int _bframes, bool bAQEnabled, uint32_t qgSize)
+bool Lowres::create(x265_param* param, PicYuv *origPic, uint32_t qgSize)
 {
     isLowres = true;
-    bframes = _bframes;
+    bframes = param->bframes;
     width = origPic->m_picWidth / 2;
     lines = origPic->m_picHeight / 2;
     lumaStride = width + 2 * origPic->m_lumaMarginX;
@@ -41,11 +41,7 @@
     maxBlocksInRowFullRes = maxBlocksInRow * 2;
     maxBlocksInColFullRes = maxBlocksInCol * 2;
     int cuCount = maxBlocksInRow * maxBlocksInCol;
-    int cuCountFullRes;
-    if (qgSize == 8)
-        cuCountFullRes = maxBlocksInRowFullRes * maxBlocksInColFullRes;
-    else
-        cuCountFullRes = cuCount;
+    int cuCountFullRes = (qgSize > 8) ? cuCount : cuCount << 2;
 
     /* rounding the width to multiple of lowres CU size */
     width = maxBlocksInRow * X265_LOWRES_CU_SIZE;
@@ -53,16 +49,18 @@
 
     size_t planesize = lumaStride * (lines + 2 * origPic->m_lumaMarginY);
     size_t padoffset = lumaStride * origPic->m_lumaMarginY + origPic->m_lumaMarginX;
-    if (bAQEnabled)
+    if (!!param->rc.aqMode)
     {
         CHECKED_MALLOC_ZERO(qpAqOffset, double, cuCountFullRes);
-        CHECKED_MALLOC_ZERO(qpAqMotionOffset, double, cuCountFullRes);
         CHECKED_MALLOC_ZERO(invQscaleFactor, int, cuCountFullRes);
         CHECKED_MALLOC_ZERO(qpCuTreeOffset, double, cuCountFullRes);
-        CHECKED_MALLOC_ZERO(blockVariance, uint32_t, cuCountFullRes);
         if (qgSize == 8)
             CHECKED_MALLOC_ZERO(invQscaleFactor8x8, int, cuCount);
     }
+    if (origPic->m_param->bAQMotion)
+        CHECKED_MALLOC_ZERO(qpAqMotionOffset, double, cuCountFullRes);
+    if (origPic->m_param->bDynamicRefine)
+        CHECKED_MALLOC_ZERO(blockVariance, uint32_t, cuCountFullRes);
     CHECKED_MALLOC(propagateCost, uint16_t, cuCount);
 
     /* allocate lowres buffers */
@@ -126,14 +124,13 @@
         X265_FREE(lowresMvCosts[1][i]);
     }
     X265_FREE(qpAqOffset);
-    X265_FREE(qpAqMotionOffset);
     X265_FREE(invQscaleFactor);
     X265_FREE(qpCuTreeOffset);
     X265_FREE(propagateCost);
-    X265_FREE(blockVariance);
     X265_FREE(invQscaleFactor8x8);
+    X265_FREE(qpAqMotionOffset);
+    X265_FREE(blockVariance);
 }
-
 // (re) initialize lowres state
 void Lowres::init(PicYuv *origPic, int poc)
 {

 
@@ -27,10 +27,10 @@
 
 using namespace X265_NS;
 
-bool Lowres::create(PicYuv *origPic, int _bframes, bool bAQEnabled, uint32_t qgSize)
+bool Lowres::create(x265_param* param, PicYuv *origPic, uint32_t qgSize)
 {
     isLowres = true;
-    bframes = _bframes;
+    bframes = param->bframes;
     width = origPic->m_picWidth / 2;
     lines = origPic->m_picHeight / 2;
     lumaStride = width + 2 * origPic->m_lumaMarginX;
@@ -41,11 +41,7 @@
     maxBlocksInRowFullRes = maxBlocksInRow * 2;
     maxBlocksInColFullRes = maxBlocksInCol * 2;
     int cuCount = maxBlocksInRow * maxBlocksInCol;
-    int cuCountFullRes;
-    if (qgSize == 8)
-        cuCountFullRes = maxBlocksInRowFullRes * maxBlocksInColFullRes;
-    else
-        cuCountFullRes = cuCount;
+    int cuCountFullRes = (qgSize > 8) ? cuCount : cuCount << 2;
 
     /* rounding the width to multiple of lowres CU size */
     width = maxBlocksInRow * X265_LOWRES_CU_SIZE;
@@ -53,16 +49,18 @@
 
     size_t planesize = lumaStride * (lines + 2 * origPic->m_lumaMarginY);
     size_t padoffset = lumaStride * origPic->m_lumaMarginY + origPic->m_lumaMarginX;
-    if (bAQEnabled)
+    if (!!param->rc.aqMode)
     {
         CHECKED_MALLOC_ZERO(qpAqOffset, double, cuCountFullRes);
-        CHECKED_MALLOC_ZERO(qpAqMotionOffset, double, cuCountFullRes);
         CHECKED_MALLOC_ZERO(invQscaleFactor, int, cuCountFullRes);
         CHECKED_MALLOC_ZERO(qpCuTreeOffset, double, cuCountFullRes);
-        CHECKED_MALLOC_ZERO(blockVariance, uint32_t, cuCountFullRes);
         if (qgSize == 8)
             CHECKED_MALLOC_ZERO(invQscaleFactor8x8, int, cuCount);
     }
+    if (origPic->m_param->bAQMotion)
+        CHECKED_MALLOC_ZERO(qpAqMotionOffset, double, cuCountFullRes);
+    if (origPic->m_param->bDynamicRefine)
+        CHECKED_MALLOC_ZERO(blockVariance, uint32_t, cuCountFullRes);
     CHECKED_MALLOC(propagateCost, uint16_t, cuCount);
 
     /* allocate lowres buffers */
@@ -126,14 +124,13 @@
         X265_FREE(lowresMvCosts[1][i]);
     }
     X265_FREE(qpAqOffset);
-    X265_FREE(qpAqMotionOffset);
     X265_FREE(invQscaleFactor);
     X265_FREE(qpCuTreeOffset);
     X265_FREE(propagateCost);
-    X265_FREE(blockVariance);
     X265_FREE(invQscaleFactor8x8);
+    X265_FREE(qpAqMotionOffset);
+    X265_FREE(blockVariance);
 }
-
 // (re) initialize lowres state
 void Lowres::init(PicYuv *origPic, int poc)
 {
​

x265_2.7.tar.gz/source/common/lowres.h -> x265_2.9.tar.gz/source/common/lowres.h Changed

@@ -69,7 +69,7 @@
             int qmvy = qmv.y + (qmv.y & 1);
             int hpelB = (qmvy & 2) | ((qmvx & 2) >> 1);
             pixel *frefB = lowresPlane[hpelB] + blockOffset + (qmvx >> 2) + (qmvy >> 2) * lumaStride;
-            primitives.pu[LUMA_8x8].pixelavg_pp(buf, outstride, frefA, lumaStride, frefB, lumaStride, 32);
+            primitives.pu[LUMA_8x8].pixelavg_pp[(outstride % 64 == 0) && (lumaStride % 64 == 0)](buf, outstride, frefA, lumaStride, frefB, lumaStride, 32);
             return buf;
         }
         else
@@ -91,7 +91,7 @@
             int qmvy = qmv.y + (qmv.y & 1);
             int hpelB = (qmvy & 2) | ((qmvx & 2) >> 1);
             pixel *frefB = lowresPlane[hpelB] + blockOffset + (qmvx >> 2) + (qmvy >> 2) * lumaStride;
-            primitives.pu[LUMA_8x8].pixelavg_pp(subpelbuf, 8, frefA, lumaStride, frefB, lumaStride, 32);
+            primitives.pu[LUMA_8x8].pixelavg_pp[NONALIGNED](subpelbuf, 8, frefA, lumaStride, frefB, lumaStride, 32);
             return comp(fenc, FENC_STRIDE, subpelbuf, 8);
         }
         else
@@ -152,14 +152,12 @@
     uint32_t* blockVariance;
     uint64_t  wp_ssd[3];       // This is different than SSDY, this is sum(pixel^2) - sum(pixel)^2 for entire frame
     uint64_t  wp_sum[3];
-    uint64_t  frameVariance;
 
     /* cutree intermediate data */
     uint16_t* propagateCost;
     double    weightedCostDelta[X265_BFRAME_MAX + 2];
     ReferencePlanes weightedRef[X265_BFRAME_MAX + 2];
-
-    bool create(PicYuv *origPic, int _bframes, bool bAqEnabled, uint32_t qgSize);
+    bool create(x265_param* param, PicYuv *origPic, uint32_t qgSize);
     void destroy();
     void init(PicYuv *origPic, int poc);
 };

 
@@ -69,7 +69,7 @@
             int qmvy = qmv.y + (qmv.y & 1);
             int hpelB = (qmvy & 2) | ((qmvx & 2) >> 1);
             pixel *frefB = lowresPlane[hpelB] + blockOffset + (qmvx >> 2) + (qmvy >> 2) * lumaStride;
-            primitives.pu[LUMA_8x8].pixelavg_pp(buf, outstride, frefA, lumaStride, frefB, lumaStride, 32);
+            primitives.pu[LUMA_8x8].pixelavg_pp[(outstride % 64 == 0) && (lumaStride % 64 == 0)](buf, outstride, frefA, lumaStride, frefB, lumaStride, 32);
             return buf;
         }
         else
@@ -91,7 +91,7 @@
             int qmvy = qmv.y + (qmv.y & 1);
             int hpelB = (qmvy & 2) | ((qmvx & 2) >> 1);
             pixel *frefB = lowresPlane[hpelB] + blockOffset + (qmvx >> 2) + (qmvy >> 2) * lumaStride;
-            primitives.pu[LUMA_8x8].pixelavg_pp(subpelbuf, 8, frefA, lumaStride, frefB, lumaStride, 32);
+            primitives.pu[LUMA_8x8].pixelavg_pp[NONALIGNED](subpelbuf, 8, frefA, lumaStride, frefB, lumaStride, 32);
             return comp(fenc, FENC_STRIDE, subpelbuf, 8);
         }
         else
@@ -152,14 +152,12 @@
     uint32_t* blockVariance;
     uint64_t  wp_ssd[3];       // This is different than SSDY, this is sum(pixel^2) - sum(pixel)^2 for entire frame
     uint64_t  wp_sum[3];
-    uint64_t  frameVariance;
 
     /* cutree intermediate data */
     uint16_t* propagateCost;
     double    weightedCostDelta[X265_BFRAME_MAX + 2];
     ReferencePlanes weightedRef[X265_BFRAME_MAX + 2];
-
-    bool create(PicYuv *origPic, int _bframes, bool bAqEnabled, uint32_t qgSize);
+    bool create(x265_param* param, PicYuv *origPic, uint32_t qgSize);
     void destroy();
     void init(PicYuv *origPic, int poc);
 };
​

x265_2.7.tar.gz/source/common/param.cpp -> x265_2.9.tar.gz/source/common/param.cpp Changed

@@ -105,7 +105,7 @@
     memset(param, 0, sizeof(x265_param));
 
     /* Applying default values to all elements in the param structure */
-    param->cpuid = X265_NS::cpu_detect();
+    param->cpuid = X265_NS::cpu_detect(false);
     param->bEnableWavefront = 1;
     param->frameNumThreads = 0;
 
@@ -133,6 +133,7 @@
     param->bEmitHRDSEI = 0;
     param->bEmitInfoSEI = 1;
     param->bEmitHDRSEI = 0;
+    param->bEmitIDRRecoverySEI = 0;
 
     /* CU definitions */
     param->maxCUSize = 64;
@@ -155,6 +156,9 @@
     param->lookaheadThreads = 0;
     param->scenecutBias = 5.0;
     param->radl = 0;
+    param->chunkStart = 0;
+    param->chunkEnd = 0;
+
     /* Intra Coding Tools */
     param->bEnableConstrainedIntra = 0;
     param->bEnableStrongIntraSmoothing = 1;
@@ -192,6 +196,7 @@
     param->bEnableSAO = 1;
     param->bSaoNonDeblocked = 0;
     param->bLimitSAO = 0;
+
     /* Coding Quality */
     param->cbQpOffset = 0;
     param->crQpOffset = 0;
@@ -289,16 +294,24 @@
     param->scaleFactor = 0;
     param->intraRefine = 0;
     param->interRefine = 0;
+    param->bDynamicRefine = 0;
     param->mvRefine = 0;
     param->bUseAnalysisFile = 1;
     param->csvfpt = NULL;
     param->forceFlush = 0;
     param->bDisableLookahead = 0;
     param->bCopyPicToFrame = 1;
+    param->maxAUSizeFactor = 1;
+    param->naluFile = NULL;
 
     /* DCT Approximations */
     param->bLowPassDct = 0;
     param->bMVType = 0;
+    param->bSingleSeiNal = 0;
+
+    /* SEI messages */
+    param->preferredTransferCharacteristics = -1;
+    param->pictureStructure = -1;
 }
 
 int x265_param_default_preset(x265_param* param, const char* preset, const char* tune)
@@ -606,10 +619,26 @@
     if (0) ;
     OPT("asm")
     {
+#if X265_ARCH_X86
+        if (!strcasecmp(value, "avx512"))
+        {
+            p->cpuid = X265_NS::cpu_detect(true);
+            if (!(p->cpuid & X265_CPU_AVX512))
+                x265_log(p, X265_LOG_WARNING, "AVX512 is not supported\n");
+        }
+        else
+        {
+            if (bValueWasNull)
+                p->cpuid = atobool(value);
+            else
+                p->cpuid = parseCpuName(value, bError, false);
+        }
+#else
         if (bValueWasNull)
             p->cpuid = atobool(value);
         else
-            p->cpuid = parseCpuName(value, bError);
+            p->cpuid = parseCpuName(value, bError, false);
+#endif
     }
     OPT("fps")
     {
@@ -981,6 +1010,7 @@
         OPT("limit-sao") p->bLimitSAO = atobool(value);
         OPT("dhdr10-info") p->toneMapFile = strdup(value);
         OPT("dhdr10-opt") p->bDhdr10opt = atobool(value);
+        OPT("idr-recovery-sei") p->bEmitIDRRecoverySEI = atobool(value);
         OPT("const-vbv") p->rc.bEnableConstVbv = atobool(value);
         OPT("ctu-info") p->bCTUInfo = atoi(value);
         OPT("scale-factor") p->scaleFactor = atoi(value);
@@ -989,7 +1019,7 @@
         OPT("refine-mv")p->mvRefine = atobool(value);
         OPT("force-flush")p->forceFlush = atoi(value);
         OPT("splitrd-skip") p->bEnableSplitRdSkip = atobool(value);
-		OPT("lowpass-dct") p->bLowPassDct = atobool(value);
+        OPT("lowpass-dct") p->bLowPassDct = atobool(value);
         OPT("vbv-end") p->vbvBufferEnd = atof(value);
         OPT("vbv-end-fr-adj") p->vbvEndFrameAdjust = atof(value);
         OPT("copy-pic") p->bCopyPicToFrame = atobool(value);
@@ -1007,11 +1037,19 @@
             {
                 bError = true;
             }
-         }
+        }
         OPT("gop-lookahead") p->gopLookahead = atoi(value);
         OPT("analysis-save") p->analysisSave = strdup(value);
         OPT("analysis-load") p->analysisLoad = strdup(value);
         OPT("radl") p->radl = atoi(value);
+        OPT("max-ausize-factor") p->maxAUSizeFactor = atof(value);
+        OPT("dynamic-refine") p->bDynamicRefine = atobool(value);
+        OPT("single-sei") p->bSingleSeiNal = atobool(value);
+        OPT("atc-sei") p->preferredTransferCharacteristics = atoi(value);
+        OPT("pic-struct") p->pictureStructure = atoi(value);
+        OPT("chunk-start") p->chunkStart = atoi(value);
+        OPT("chunk-end") p->chunkEnd = atoi(value);
+        OPT("nalu-file") p->naluFile = strdup(value);
         else
             return X265_PARAM_BAD_NAME;
     }
@@ -1054,7 +1092,7 @@
  *   false || no  - disabled
  *   integer bitmap value
  *   comma separated list of SIMD names, eg: SSE4.1,XOP */
-int parseCpuName(const char* value, bool& bError)
+int parseCpuName(const char* value, bool& bError, bool bEnableavx512)
 {
     if (!value)
     {
@@ -1065,7 +1103,7 @@
     if (isdigit(value[0]))
         cpu = x265_atoi(value, bError);
     else
-        cpu = !strcmp(value, "auto") || x265_atobool(value, bError) ? X265_NS::cpu_detect() : 0;
+        cpu = !strcmp(value, "auto") || x265_atobool(value, bError) ? X265_NS::cpu_detect(bEnableavx512) : 0;
 
     if (bError)
     {
@@ -1365,8 +1403,10 @@
         "Supported values for bCTUInfo are 0, 1, 2, 4, 6");
     CHECK(param->interRefine > 3 || param->interRefine < 0,
         "Invalid refine-inter value, refine-inter levels 0 to 3 supported");
-    CHECK(param->intraRefine > 3 || param->intraRefine < 0,
+    CHECK(param->intraRefine > 4 || param->intraRefine < 0,
         "Invalid refine-intra value, refine-intra levels 0 to 3 supported");
+    CHECK(param->maxAUSizeFactor < 0.5 || param->maxAUSizeFactor > 1.0,
+        "Supported factor for controlling max AU size is from 0.5 to 1");
 #if !X86_64
     CHECK(param->searchMethod == X265_SEA && (param->sourceWidth > 840 || param->sourceHeight > 480),
         "SEA motion search does not support resolutions greater than 480p in 32 bit build");
@@ -1375,6 +1415,21 @@
     if (param->masteringDisplayColorVolume || param->maxFALL || param->maxCLL)
         param->bEmitHDRSEI = 1;
 
+    bool isSingleSEI = (param->bRepeatHeaders
+                     || param->bEmitHRDSEI
+                     || param->bEmitInfoSEI
+                     || param->bEmitHDRSEI
+                     || param->bEmitIDRRecoverySEI
+                   || !!param->interlaceMode
+                     || param->preferredTransferCharacteristics > 1
+                     || param->toneMapFile
+                     || param->naluFile);
+
+    if (!isSingleSEI && param->bSingleSeiNal)
+    {
+        param->bSingleSeiNal = 0;
+        x265_log(param, X265_LOG_WARNING, "None of the SEI messages are enabled. Disabling Single SEI NAL\n");
+    }
     return check_failed;
 }
 
@@ -1504,6 +1559,7 @@
     TOOLVAL(param->bCTUInfo, "ctu-info=%d");
     if (param->bMVType == AVC_INFO)
         TOOLOPT(param->bMVType, "refine-mv-type=avc");
+    TOOLOPT(param->bDynamicRefine, "dynamic-refine");
     if (param->maxSlices > 1)
         TOOLVAL(param->maxSlices, "slices=%d");
     if (param->bEnableLoopFilter)
@@ -1520,6 +1576,7 @@
     TOOLOPT(!param->bSaoNonDeblocked && param->bEnableSAO, "sao");
     TOOLOPT(param->rc.bStatWrite, "stats-write");
     TOOLOPT(param->rc.bStatRead,  "stats-read");
+    TOOLOPT(param->bSingleSeiNal, "single-sei");
 #if ENABLE_HDR10_PLUS
     TOOLOPT(param->toneMapFile != NULL, "dhdr10-info");
 #endif
@@ -1560,6 +1617,10 @@
     s += sprintf(s, " input-res=%dx%d", p->sourceWidth - padx, p->sourceHeight - pady);
     s += sprintf(s, " interlace=%d", p->interlaceMode);
     s += sprintf(s, " total-frames=%d", p->totalFrames);
+    if (p->chunkStart)
+        s += sprintf(s, " chunk-start=%d", p->chunkStart);

 
@@ -105,7 +105,7 @@
     memset(param, 0, sizeof(x265_param));
 
     /* Applying default values to all elements in the param structure */
-    param->cpuid = X265_NS::cpu_detect();
+    param->cpuid = X265_NS::cpu_detect(false);
     param->bEnableWavefront = 1;
     param->frameNumThreads = 0;
 
@@ -133,6 +133,7 @@
     param->bEmitHRDSEI = 0;
     param->bEmitInfoSEI = 1;
     param->bEmitHDRSEI = 0;
+    param->bEmitIDRRecoverySEI = 0;
 
     /* CU definitions */
     param->maxCUSize = 64;
@@ -155,6 +156,9 @@
     param->lookaheadThreads = 0;
     param->scenecutBias = 5.0;
     param->radl = 0;
+    param->chunkStart = 0;
+    param->chunkEnd = 0;
+
     /* Intra Coding Tools */
     param->bEnableConstrainedIntra = 0;
     param->bEnableStrongIntraSmoothing = 1;
@@ -192,6 +196,7 @@
     param->bEnableSAO = 1;
     param->bSaoNonDeblocked = 0;
     param->bLimitSAO = 0;
+
     /* Coding Quality */
     param->cbQpOffset = 0;
     param->crQpOffset = 0;
@@ -289,16 +294,24 @@
     param->scaleFactor = 0;
     param->intraRefine = 0;
     param->interRefine = 0;
+    param->bDynamicRefine = 0;
     param->mvRefine = 0;
     param->bUseAnalysisFile = 1;
     param->csvfpt = NULL;
     param->forceFlush = 0;
     param->bDisableLookahead = 0;
     param->bCopyPicToFrame = 1;
+    param->maxAUSizeFactor = 1;
+    param->naluFile = NULL;
 
     /* DCT Approximations */
     param->bLowPassDct = 0;
     param->bMVType = 0;
+    param->bSingleSeiNal = 0;
+
+    /* SEI messages */
+    param->preferredTransferCharacteristics = -1;
+    param->pictureStructure = -1;
 }
 
 int x265_param_default_preset(x265_param* param, const char* preset, const char* tune)
@@ -606,10 +619,26 @@
     if (0) ;
     OPT("asm")
     {
+#if X265_ARCH_X86
+        if (!strcasecmp(value, "avx512"))
+        {
+            p->cpuid = X265_NS::cpu_detect(true);
+            if (!(p->cpuid & X265_CPU_AVX512))
+                x265_log(p, X265_LOG_WARNING, "AVX512 is not supported\n");
+        }
+        else
+        {
+            if (bValueWasNull)
+                p->cpuid = atobool(value);
+            else
+                p->cpuid = parseCpuName(value, bError, false);
+        }
+#else
         if (bValueWasNull)
             p->cpuid = atobool(value);
         else
-            p->cpuid = parseCpuName(value, bError);
+            p->cpuid = parseCpuName(value, bError, false);
+#endif
     }
     OPT("fps")
     {
@@ -981,6 +1010,7 @@
         OPT("limit-sao") p->bLimitSAO = atobool(value);
         OPT("dhdr10-info") p->toneMapFile = strdup(value);
         OPT("dhdr10-opt") p->bDhdr10opt = atobool(value);
+        OPT("idr-recovery-sei") p->bEmitIDRRecoverySEI = atobool(value);
         OPT("const-vbv") p->rc.bEnableConstVbv = atobool(value);
         OPT("ctu-info") p->bCTUInfo = atoi(value);
         OPT("scale-factor") p->scaleFactor = atoi(value);
@@ -989,7 +1019,7 @@
         OPT("refine-mv")p->mvRefine = atobool(value);
         OPT("force-flush")p->forceFlush = atoi(value);
         OPT("splitrd-skip") p->bEnableSplitRdSkip = atobool(value);
-       OPT("lowpass-dct") p->bLowPassDct = atobool(value);
+        OPT("lowpass-dct") p->bLowPassDct = atobool(value);
         OPT("vbv-end") p->vbvBufferEnd = atof(value);
         OPT("vbv-end-fr-adj") p->vbvEndFrameAdjust = atof(value);
         OPT("copy-pic") p->bCopyPicToFrame = atobool(value);
@@ -1007,11 +1037,19 @@
             {
                 bError = true;
             }
-         }
+        }
         OPT("gop-lookahead") p->gopLookahead = atoi(value);
         OPT("analysis-save") p->analysisSave = strdup(value);
         OPT("analysis-load") p->analysisLoad = strdup(value);
         OPT("radl") p->radl = atoi(value);
+        OPT("max-ausize-factor") p->maxAUSizeFactor = atof(value);
+        OPT("dynamic-refine") p->bDynamicRefine = atobool(value);
+        OPT("single-sei") p->bSingleSeiNal = atobool(value);
+        OPT("atc-sei") p->preferredTransferCharacteristics = atoi(value);
+        OPT("pic-struct") p->pictureStructure = atoi(value);
+        OPT("chunk-start") p->chunkStart = atoi(value);
+        OPT("chunk-end") p->chunkEnd = atoi(value);
+        OPT("nalu-file") p->naluFile = strdup(value);
         else
             return X265_PARAM_BAD_NAME;
     }
@@ -1054,7 +1092,7 @@
  *   false || no  - disabled
  *   integer bitmap value
  *   comma separated list of SIMD names, eg: SSE4.1,XOP */
-int parseCpuName(const char* value, bool& bError)
+int parseCpuName(const char* value, bool& bError, bool bEnableavx512)
 {
     if (!value)
     {
@@ -1065,7 +1103,7 @@
     if (isdigit(value[0]))
         cpu = x265_atoi(value, bError);
     else
-        cpu = !strcmp(value, "auto") || x265_atobool(value, bError) ? X265_NS::cpu_detect() : 0;
+        cpu = !strcmp(value, "auto") || x265_atobool(value, bError) ? X265_NS::cpu_detect(bEnableavx512) : 0;
 
     if (bError)
     {
@@ -1365,8 +1403,10 @@
         "Supported values for bCTUInfo are 0, 1, 2, 4, 6");
     CHECK(param->interRefine > 3 || param->interRefine < 0,
         "Invalid refine-inter value, refine-inter levels 0 to 3 supported");
-    CHECK(param->intraRefine > 3 || param->intraRefine < 0,
+    CHECK(param->intraRefine > 4 || param->intraRefine < 0,
         "Invalid refine-intra value, refine-intra levels 0 to 3 supported");
+    CHECK(param->maxAUSizeFactor < 0.5 || param->maxAUSizeFactor > 1.0,
+        "Supported factor for controlling max AU size is from 0.5 to 1");
 #if !X86_64
     CHECK(param->searchMethod == X265_SEA && (param->sourceWidth > 840 || param->sourceHeight > 480),
         "SEA motion search does not support resolutions greater than 480p in 32 bit build");
@@ -1375,6 +1415,21 @@
     if (param->masteringDisplayColorVolume || param->maxFALL || param->maxCLL)
         param->bEmitHDRSEI = 1;
 
+    bool isSingleSEI = (param->bRepeatHeaders
+                     || param->bEmitHRDSEI
+                     || param->bEmitInfoSEI
+                     || param->bEmitHDRSEI
+                     || param->bEmitIDRRecoverySEI
+                   || !!param->interlaceMode
+                     || param->preferredTransferCharacteristics > 1
+                     || param->toneMapFile
+                     || param->naluFile);
+
+    if (!isSingleSEI && param->bSingleSeiNal)
+    {
+        param->bSingleSeiNal = 0;
+        x265_log(param, X265_LOG_WARNING, "None of the SEI messages are enabled. Disabling Single SEI NAL\n");
+    }
     return check_failed;
 }
 
@@ -1504,6 +1559,7 @@
     TOOLVAL(param->bCTUInfo, "ctu-info=%d");
     if (param->bMVType == AVC_INFO)
         TOOLOPT(param->bMVType, "refine-mv-type=avc");
+    TOOLOPT(param->bDynamicRefine, "dynamic-refine");
     if (param->maxSlices > 1)
         TOOLVAL(param->maxSlices, "slices=%d");
     if (param->bEnableLoopFilter)
@@ -1520,6 +1576,7 @@
     TOOLOPT(!param->bSaoNonDeblocked && param->bEnableSAO, "sao");
     TOOLOPT(param->rc.bStatWrite, "stats-write");
     TOOLOPT(param->rc.bStatRead,  "stats-read");
+    TOOLOPT(param->bSingleSeiNal, "single-sei");
 #if ENABLE_HDR10_PLUS
     TOOLOPT(param->toneMapFile != NULL, "dhdr10-info");
 #endif
@@ -1560,6 +1617,10 @@
     s += sprintf(s, " input-res=%dx%d", p->sourceWidth - padx, p->sourceHeight - pady);
     s += sprintf(s, " interlace=%d", p->interlaceMode);
     s += sprintf(s, " total-frames=%d", p->totalFrames);
+    if (p->chunkStart)
+        s += sprintf(s, " chunk-start=%d", p->chunkStart);
​

x265_2.7.tar.gz/source/common/param.h -> x265_2.9.tar.gz/source/common/param.h Changed

 
@@ -33,7 +33,7 @@
 char* x265_param2string(x265_param *param, int padx, int pady);
 int   x265_atoi(const char *str, bool& bError);
 double x265_atof(const char *str, bool& bError);
-int   parseCpuName(const char *value, bool& bError);
+int   parseCpuName(const char *value, bool& bError, bool bEnableavx512);
 void  setParamAspectRatio(x265_param *p, int width, int height);
 void  getParamAspectRatio(x265_param *p, int& width, int& height);
 bool  parseLambdaFile(x265_param *param);
​

x265_2.7.tar.gz/source/common/picyuv.cpp -> x265_2.9.tar.gz/source/common/picyuv.cpp Changed

 
@@ -358,6 +358,19 @@
     pixel *uPic = m_picOrg[1];
     pixel *vPic = m_picOrg[2];
 
+    if(param.minLuma != 0 || param.maxLuma != PIXEL_MAX)
+    {
+        for (int r = 0; r < height; r++)
+        {
+            for (int c = 0; c < width; c++)
+            {
+                yPic[c] = X265_MIN(yPic[c], (pixel)param.maxLuma);
+                yPic[c] = X265_MAX(yPic[c], (pixel)param.minLuma);
+            }
+            yPic += m_stride;
+        }
+    }
+    yPic = m_picOrg[0];
     if (param.csvLogLevel >= 2 || param.maxCLL || param.maxFALL)
     {
         for (int r = 0; r < height; r++)
​

x265_2.7.tar.gz/source/common/picyuv.h -> x265_2.9.tar.gz/source/common/picyuv.h Changed

 
@@ -72,6 +72,7 @@
     pixel   m_maxChromaVLevel;
     pixel   m_minChromaVLevel;
     double  m_avgChromaVLevel;
+    double  m_vmafScore;
     x265_param *m_param;
 
     PicYuv();
​

x265_2.7.tar.gz/source/common/pixel.cpp -> x265_2.9.tar.gz/source/common/pixel.cpp Changed

@@ -922,7 +922,7 @@
 static void cuTreeFix8Pack(uint16_t *dst, double *src, int count)
 {
     for (int i = 0; i < count; i++)
-        dst[i] = (uint16_t)(src[i] * 256.0);
+        dst[i] = (uint16_t)(int16_t)(src[i] * 256.0);
 }
 
 static void cuTreeFix8Unpack(double *dst, uint16_t *src, int count)
@@ -986,28 +986,34 @@
 {
 #define LUMA_PU(W, H) \
     p.pu[LUMA_ ## W ## x ## H].copy_pp = blockcopy_pp_c<W, H>; \
-    p.pu[LUMA_ ## W ## x ## H].addAvg = addAvg<W, H>; \
+    p.pu[LUMA_ ## W ## x ## H].addAvg[NONALIGNED] = addAvg<W, H>; \
+    p.pu[LUMA_ ## W ## x ## H].addAvg[ALIGNED] = addAvg<W, H>; \
     p.pu[LUMA_ ## W ## x ## H].sad = sad<W, H>; \
     p.pu[LUMA_ ## W ## x ## H].sad_x3 = sad_x3<W, H>; \
     p.pu[LUMA_ ## W ## x ## H].sad_x4 = sad_x4<W, H>; \
-    p.pu[LUMA_ ## W ## x ## H].pixelavg_pp = pixelavg_pp<W, H>;
-
+    p.pu[LUMA_ ## W ## x ## H].pixelavg_pp[NONALIGNED] = pixelavg_pp<W, H>; \
+    p.pu[LUMA_ ## W ## x ## H].pixelavg_pp[ALIGNED] = pixelavg_pp<W, H>;
 #define LUMA_CU(W, H) \
     p.cu[BLOCK_ ## W ## x ## H].sub_ps        = pixel_sub_ps_c<W, H>; \
-    p.cu[BLOCK_ ## W ## x ## H].add_ps        = pixel_add_ps_c<W, H>; \
+    p.cu[BLOCK_ ## W ## x ## H].add_ps[NONALIGNED]    = pixel_add_ps_c<W, H>; \
+    p.cu[BLOCK_ ## W ## x ## H].add_ps[ALIGNED] = pixel_add_ps_c<W, H>; \
     p.cu[BLOCK_ ## W ## x ## H].copy_sp       = blockcopy_sp_c<W, H>; \
     p.cu[BLOCK_ ## W ## x ## H].copy_ps       = blockcopy_ps_c<W, H>; \
     p.cu[BLOCK_ ## W ## x ## H].copy_ss       = blockcopy_ss_c<W, H>; \
-    p.cu[BLOCK_ ## W ## x ## H].blockfill_s   = blockfill_s_c<W>;  \
+    p.cu[BLOCK_ ## W ## x ## H].blockfill_s[NONALIGNED] = blockfill_s_c<W>;  \
+    p.cu[BLOCK_ ## W ## x ## H].blockfill_s[ALIGNED]    = blockfill_s_c<W>;  \
     p.cu[BLOCK_ ## W ## x ## H].cpy2Dto1D_shl = cpy2Dto1D_shl<W>; \
     p.cu[BLOCK_ ## W ## x ## H].cpy2Dto1D_shr = cpy2Dto1D_shr<W>; \
-    p.cu[BLOCK_ ## W ## x ## H].cpy1Dto2D_shl = cpy1Dto2D_shl<W>; \
+    p.cu[BLOCK_ ## W ## x ## H].cpy1Dto2D_shl[NONALIGNED] = cpy1Dto2D_shl<W>; \
+    p.cu[BLOCK_ ## W ## x ## H].cpy1Dto2D_shl[ALIGNED] = cpy1Dto2D_shl<W>; \
     p.cu[BLOCK_ ## W ## x ## H].cpy1Dto2D_shr = cpy1Dto2D_shr<W>; \
     p.cu[BLOCK_ ## W ## x ## H].psy_cost_pp   = psyCost_pp<BLOCK_ ## W ## x ## H>; \
     p.cu[BLOCK_ ## W ## x ## H].transpose     = transpose<W>; \
-    p.cu[BLOCK_ ## W ## x ## H].ssd_s         = pixel_ssd_s_c<W>; \
+    p.cu[BLOCK_ ## W ## x ## H].ssd_s[NONALIGNED]         = pixel_ssd_s_c<W>; \
+    p.cu[BLOCK_ ## W ## x ## H].ssd_s[ALIGNED] = pixel_ssd_s_c<W>; \
     p.cu[BLOCK_ ## W ## x ## H].var           = pixel_var<W>; \
-    p.cu[BLOCK_ ## W ## x ## H].calcresidual  = getResidual<W>; \
+    p.cu[BLOCK_ ## W ## x ## H].calcresidual[NONALIGNED]  = getResidual<W>; \
+    p.cu[BLOCK_ ## W ## x ## H].calcresidual[ALIGNED]     = getResidual<W>; \
     p.cu[BLOCK_ ## W ## x ## H].sse_pp        = sse<W, H, pixel, pixel>; \
     p.cu[BLOCK_ ## W ## x ## H].sse_ss        = sse<W, H, int16_t, int16_t>;
 
@@ -1102,7 +1108,8 @@
     p.cu[BLOCK_64x64].sa8d = sa8d16<64, 64>;
 
 #define CHROMA_PU_420(W, H) \
-    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].addAvg  = addAvg<W, H>;         \
+    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].addAvg[NONALIGNED]  = addAvg<W, H>;         \
+    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].addAvg[ALIGNED]  = addAvg<W, H>;         \
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].copy_pp = blockcopy_pp_c<W, H>; \
 
     CHROMA_PU_420(2, 2);
@@ -1165,7 +1172,8 @@
     p.chroma[X265_CSP_I420].cu[BLOCK_420_ ## W ## x ## H].copy_ps = blockcopy_ps_c<W, H>; \
     p.chroma[X265_CSP_I420].cu[BLOCK_420_ ## W ## x ## H].copy_ss = blockcopy_ss_c<W, H>; \
     p.chroma[X265_CSP_I420].cu[BLOCK_420_ ## W ## x ## H].sub_ps = pixel_sub_ps_c<W, H>;  \
-    p.chroma[X265_CSP_I420].cu[BLOCK_420_ ## W ## x ## H].add_ps = pixel_add_ps_c<W, H>;
+    p.chroma[X265_CSP_I420].cu[BLOCK_420_ ## W ## x ## H].add_ps[NONALIGNED] = pixel_add_ps_c<W, H>; \
+    p.chroma[X265_CSP_I420].cu[BLOCK_420_ ## W ## x ## H].add_ps[ALIGNED] = pixel_add_ps_c<W, H>;
 
     CHROMA_CU_420(2, 2)
     CHROMA_CU_420(4, 4)
@@ -1179,7 +1187,8 @@
     p.chroma[X265_CSP_I420].cu[BLOCK_64x64].sa8d = sa8d16<32, 32>;
 
 #define CHROMA_PU_422(W, H) \
-    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].addAvg  = addAvg<W, H>;         \
+    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].addAvg[NONALIGNED]  = addAvg<W, H>;         \
+    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].addAvg[ALIGNED]  = addAvg<W, H>;         \
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].copy_pp = blockcopy_pp_c<W, H>; \
 
     CHROMA_PU_422(2, 4);
@@ -1242,7 +1251,8 @@
     p.chroma[X265_CSP_I422].cu[BLOCK_422_ ## W ## x ## H].copy_ps = blockcopy_ps_c<W, H>; \
     p.chroma[X265_CSP_I422].cu[BLOCK_422_ ## W ## x ## H].copy_ss = blockcopy_ss_c<W, H>; \
     p.chroma[X265_CSP_I422].cu[BLOCK_422_ ## W ## x ## H].sub_ps = pixel_sub_ps_c<W, H>; \
-    p.chroma[X265_CSP_I422].cu[BLOCK_422_ ## W ## x ## H].add_ps = pixel_add_ps_c<W, H>;
+    p.chroma[X265_CSP_I422].cu[BLOCK_422_ ## W ## x ## H].add_ps[NONALIGNED] = pixel_add_ps_c<W, H>; \
+    p.chroma[X265_CSP_I422].cu[BLOCK_422_ ## W ## x ## H].add_ps[ALIGNED] = pixel_add_ps_c<W, H>;
 
     CHROMA_CU_422(2, 4)
     CHROMA_CU_422(4, 8)
@@ -1258,7 +1268,7 @@
     p.weight_pp = weight_pp_c;
     p.weight_sp = weight_sp_c;
 
-    p.scale1D_128to64 = scale1D_128to64;
+    p.scale1D_128to64[NONALIGNED] = p.scale1D_128to64[ALIGNED] = scale1D_128to64;
     p.scale2D_64to32 = scale2D_64to32;
     p.frameInitLowres = frame_init_lowres_core;
     p.ssim_4x4x2_core = ssim_4x4x2_core;

 
@@ -922,7 +922,7 @@
 static void cuTreeFix8Pack(uint16_t *dst, double *src, int count)
 {
     for (int i = 0; i < count; i++)
-        dst[i] = (uint16_t)(src[i] * 256.0);
+        dst[i] = (uint16_t)(int16_t)(src[i] * 256.0);
 }
 
 static void cuTreeFix8Unpack(double *dst, uint16_t *src, int count)
@@ -986,28 +986,34 @@
 {
 #define LUMA_PU(W, H) \
     p.pu[LUMA_ ## W ## x ## H].copy_pp = blockcopy_pp_c<W, H>; \
-    p.pu[LUMA_ ## W ## x ## H].addAvg = addAvg<W, H>; \
+    p.pu[LUMA_ ## W ## x ## H].addAvg[NONALIGNED] = addAvg<W, H>; \
+    p.pu[LUMA_ ## W ## x ## H].addAvg[ALIGNED] = addAvg<W, H>; \
     p.pu[LUMA_ ## W ## x ## H].sad = sad<W, H>; \
     p.pu[LUMA_ ## W ## x ## H].sad_x3 = sad_x3<W, H>; \
     p.pu[LUMA_ ## W ## x ## H].sad_x4 = sad_x4<W, H>; \
-    p.pu[LUMA_ ## W ## x ## H].pixelavg_pp = pixelavg_pp<W, H>;
-
+    p.pu[LUMA_ ## W ## x ## H].pixelavg_pp[NONALIGNED] = pixelavg_pp<W, H>; \
+    p.pu[LUMA_ ## W ## x ## H].pixelavg_pp[ALIGNED] = pixelavg_pp<W, H>;
 #define LUMA_CU(W, H) \
     p.cu[BLOCK_ ## W ## x ## H].sub_ps        = pixel_sub_ps_c<W, H>; \
-    p.cu[BLOCK_ ## W ## x ## H].add_ps        = pixel_add_ps_c<W, H>; \
+    p.cu[BLOCK_ ## W ## x ## H].add_ps[NONALIGNED]    = pixel_add_ps_c<W, H>; \
+    p.cu[BLOCK_ ## W ## x ## H].add_ps[ALIGNED] = pixel_add_ps_c<W, H>; \
     p.cu[BLOCK_ ## W ## x ## H].copy_sp       = blockcopy_sp_c<W, H>; \
     p.cu[BLOCK_ ## W ## x ## H].copy_ps       = blockcopy_ps_c<W, H>; \
     p.cu[BLOCK_ ## W ## x ## H].copy_ss       = blockcopy_ss_c<W, H>; \
-    p.cu[BLOCK_ ## W ## x ## H].blockfill_s   = blockfill_s_c<W>;  \
+    p.cu[BLOCK_ ## W ## x ## H].blockfill_s[NONALIGNED] = blockfill_s_c<W>;  \
+    p.cu[BLOCK_ ## W ## x ## H].blockfill_s[ALIGNED]    = blockfill_s_c<W>;  \
     p.cu[BLOCK_ ## W ## x ## H].cpy2Dto1D_shl = cpy2Dto1D_shl<W>; \
     p.cu[BLOCK_ ## W ## x ## H].cpy2Dto1D_shr = cpy2Dto1D_shr<W>; \
-    p.cu[BLOCK_ ## W ## x ## H].cpy1Dto2D_shl = cpy1Dto2D_shl<W>; \
+    p.cu[BLOCK_ ## W ## x ## H].cpy1Dto2D_shl[NONALIGNED] = cpy1Dto2D_shl<W>; \
+    p.cu[BLOCK_ ## W ## x ## H].cpy1Dto2D_shl[ALIGNED] = cpy1Dto2D_shl<W>; \
     p.cu[BLOCK_ ## W ## x ## H].cpy1Dto2D_shr = cpy1Dto2D_shr<W>; \
     p.cu[BLOCK_ ## W ## x ## H].psy_cost_pp   = psyCost_pp<BLOCK_ ## W ## x ## H>; \
     p.cu[BLOCK_ ## W ## x ## H].transpose     = transpose<W>; \
-    p.cu[BLOCK_ ## W ## x ## H].ssd_s         = pixel_ssd_s_c<W>; \
+    p.cu[BLOCK_ ## W ## x ## H].ssd_s[NONALIGNED]         = pixel_ssd_s_c<W>; \
+    p.cu[BLOCK_ ## W ## x ## H].ssd_s[ALIGNED] = pixel_ssd_s_c<W>; \
     p.cu[BLOCK_ ## W ## x ## H].var           = pixel_var<W>; \
-    p.cu[BLOCK_ ## W ## x ## H].calcresidual  = getResidual<W>; \
+    p.cu[BLOCK_ ## W ## x ## H].calcresidual[NONALIGNED]  = getResidual<W>; \
+    p.cu[BLOCK_ ## W ## x ## H].calcresidual[ALIGNED]     = getResidual<W>; \
     p.cu[BLOCK_ ## W ## x ## H].sse_pp        = sse<W, H, pixel, pixel>; \
     p.cu[BLOCK_ ## W ## x ## H].sse_ss        = sse<W, H, int16_t, int16_t>;
 
@@ -1102,7 +1108,8 @@
     p.cu[BLOCK_64x64].sa8d = sa8d16<64, 64>;
 
 #define CHROMA_PU_420(W, H) \
-    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].addAvg  = addAvg<W, H>;         \
+    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].addAvg[NONALIGNED]  = addAvg<W, H>;         \
+    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].addAvg[ALIGNED]  = addAvg<W, H>;         \
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].copy_pp = blockcopy_pp_c<W, H>; \
 
     CHROMA_PU_420(2, 2);
@@ -1165,7 +1172,8 @@
     p.chroma[X265_CSP_I420].cu[BLOCK_420_ ## W ## x ## H].copy_ps = blockcopy_ps_c<W, H>; \
     p.chroma[X265_CSP_I420].cu[BLOCK_420_ ## W ## x ## H].copy_ss = blockcopy_ss_c<W, H>; \
     p.chroma[X265_CSP_I420].cu[BLOCK_420_ ## W ## x ## H].sub_ps = pixel_sub_ps_c<W, H>;  \
-    p.chroma[X265_CSP_I420].cu[BLOCK_420_ ## W ## x ## H].add_ps = pixel_add_ps_c<W, H>;
+    p.chroma[X265_CSP_I420].cu[BLOCK_420_ ## W ## x ## H].add_ps[NONALIGNED] = pixel_add_ps_c<W, H>; \
+    p.chroma[X265_CSP_I420].cu[BLOCK_420_ ## W ## x ## H].add_ps[ALIGNED] = pixel_add_ps_c<W, H>;
 
     CHROMA_CU_420(2, 2)
     CHROMA_CU_420(4, 4)
@@ -1179,7 +1187,8 @@
     p.chroma[X265_CSP_I420].cu[BLOCK_64x64].sa8d = sa8d16<32, 32>;
 
 #define CHROMA_PU_422(W, H) \
-    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].addAvg  = addAvg<W, H>;         \
+    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].addAvg[NONALIGNED]  = addAvg<W, H>;         \
+    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].addAvg[ALIGNED]  = addAvg<W, H>;         \
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].copy_pp = blockcopy_pp_c<W, H>; \
 
     CHROMA_PU_422(2, 4);
@@ -1242,7 +1251,8 @@
     p.chroma[X265_CSP_I422].cu[BLOCK_422_ ## W ## x ## H].copy_ps = blockcopy_ps_c<W, H>; \
     p.chroma[X265_CSP_I422].cu[BLOCK_422_ ## W ## x ## H].copy_ss = blockcopy_ss_c<W, H>; \
     p.chroma[X265_CSP_I422].cu[BLOCK_422_ ## W ## x ## H].sub_ps = pixel_sub_ps_c<W, H>; \
-    p.chroma[X265_CSP_I422].cu[BLOCK_422_ ## W ## x ## H].add_ps = pixel_add_ps_c<W, H>;
+    p.chroma[X265_CSP_I422].cu[BLOCK_422_ ## W ## x ## H].add_ps[NONALIGNED] = pixel_add_ps_c<W, H>; \
+    p.chroma[X265_CSP_I422].cu[BLOCK_422_ ## W ## x ## H].add_ps[ALIGNED] = pixel_add_ps_c<W, H>;
 
     CHROMA_CU_422(2, 4)
     CHROMA_CU_422(4, 8)
@@ -1258,7 +1268,7 @@
     p.weight_pp = weight_pp_c;
     p.weight_sp = weight_sp_c;
 
-    p.scale1D_128to64 = scale1D_128to64;
+    p.scale1D_128to64[NONALIGNED] = p.scale1D_128to64[ALIGNED] = scale1D_128to64;
     p.scale2D_64to32 = scale2D_64to32;
     p.frameInitLowres = frame_init_lowres_core;
     p.ssim_4x4x2_core = ssim_4x4x2_core;
​

x265_2.7.tar.gz/source/common/predict.cpp -> x265_2.9.tar.gz/source/common/predict.cpp Changed

@@ -91,7 +91,7 @@
         MV mv0 = cu.m_mv[0][pu.puAbsPartIdx];
         cu.clipMv(mv0);
 
-        if (cu.m_slice->m_pps->bUseWeightPred && wp0->bPresentFlag)
+        if (cu.m_slice->m_pps->bUseWeightPred && wp0->wtPresent)
         {
             for (int plane = 0; plane < (bChroma ? 3 : 1); plane++)
             {
@@ -133,7 +133,7 @@
             pwp0 = refIdx0 >= 0 ? cu.m_slice->m_weightPredTable[0][refIdx0] : NULL;
             pwp1 = refIdx1 >= 0 ? cu.m_slice->m_weightPredTable[1][refIdx1] : NULL;
 
-            if (pwp0 && pwp1 && (pwp0->bPresentFlag || pwp1->bPresentFlag))
+            if (pwp0 && pwp1 && (pwp0->wtPresent || pwp1->wtPresent))
             {
                 /* biprediction weighting */
                 for (int plane = 0; plane < (bChroma ? 3 : 1); plane++)
@@ -183,7 +183,7 @@
                 predInterChromaShort(pu, m_predShortYuv[1], *cu.m_slice->m_refReconPicList[1][refIdx1], mv1);
             }
 
-            if (pwp0 && pwp1 && (pwp0->bPresentFlag || pwp1->bPresentFlag))
+            if (pwp0 && pwp1 && (pwp0->wtPresent || pwp1->wtPresent))
                 addWeightBi(pu, predYuv, m_predShortYuv[0], m_predShortYuv[1], wv0, wv1, bLuma, bChroma);
             else
                 predYuv.addAvg(m_predShortYuv[0], m_predShortYuv[1], pu.puAbsPartIdx, pu.width, pu.height, bLuma, bChroma);
@@ -193,7 +193,7 @@
             MV mv0 = cu.m_mv[0][pu.puAbsPartIdx];
             cu.clipMv(mv0);
 
-            if (pwp0 && pwp0->bPresentFlag)
+            if (pwp0 && pwp0->wtPresent)
             {
                 ShortYuv& shortYuv = m_predShortYuv[0];
 
@@ -220,7 +220,7 @@
             /* uniprediction to L1 */
             X265_CHECK(refIdx1 >= 0, "refidx1 was not positive\n");
 
-            if (pwp1 && pwp1->bPresentFlag)
+            if (pwp1 && pwp1->wtPresent)
             {
                 ShortYuv& shortYuv = m_predShortYuv[0];
 
@@ -283,7 +283,11 @@
     int yFrac = mv.y & 3;
 
     if (!(yFrac | xFrac))
-        primitives.pu[partEnum].convert_p2s(src, srcStride, dst, dstStride);
+    {
+        bool srcbufferAlignCheck = (refPic.m_cuOffsetY[pu.ctuAddr] + refPic.m_buOffsetY[pu.cuAbsPartIdx + pu.puAbsPartIdx] + srcOffset) % 64 == 0;
+        bool dstbufferAlignCheck = (dstSYuv.getAddrOffset(pu.puAbsPartIdx, dstSYuv.m_size) % 64) == 0;
+        primitives.pu[partEnum].convert_p2s[srcStride % 64 == 0 && dstStride % 64 == 0 && srcbufferAlignCheck && dstbufferAlignCheck](src, srcStride, dst, dstStride);
+    }
     else if (!yFrac)
         primitives.pu[partEnum].luma_hps(src, srcStride, dst, dstStride, xFrac, 0);
     else if (!xFrac)
@@ -375,8 +379,10 @@
 
     if (!(yFrac | xFrac))
     {
-        primitives.chroma[m_csp].pu[partEnum].p2s(refCb, refStride, dstCb, dstStride);
-        primitives.chroma[m_csp].pu[partEnum].p2s(refCr, refStride, dstCr, dstStride);
+        bool srcbufferAlignCheckC = (refPic.m_cuOffsetC[pu.ctuAddr] + refPic.m_buOffsetC[pu.cuAbsPartIdx + pu.puAbsPartIdx] + refOffset) % 64 == 0;
+        bool dstbufferAlignCheckC = dstSYuv.getChromaAddrOffset(pu.puAbsPartIdx) % 64 == 0;
+        primitives.chroma[m_csp].pu[partEnum].p2s[refStride % 64 == 0 && dstStride % 64 == 0 && srcbufferAlignCheckC && dstbufferAlignCheckC](refCb, refStride, dstCb, dstStride);
+        primitives.chroma[m_csp].pu[partEnum].p2s[refStride % 64 == 0 && dstStride % 64 == 0 && srcbufferAlignCheckC && dstbufferAlignCheckC](refCr, refStride, dstCr, dstStride);
     }
     else if (!yFrac)
     {

 
@@ -91,7 +91,7 @@
         MV mv0 = cu.m_mv[0][pu.puAbsPartIdx];
         cu.clipMv(mv0);
 
-        if (cu.m_slice->m_pps->bUseWeightPred && wp0->bPresentFlag)
+        if (cu.m_slice->m_pps->bUseWeightPred && wp0->wtPresent)
         {
             for (int plane = 0; plane < (bChroma ? 3 : 1); plane++)
             {
@@ -133,7 +133,7 @@
             pwp0 = refIdx0 >= 0 ? cu.m_slice->m_weightPredTable[0][refIdx0] : NULL;
             pwp1 = refIdx1 >= 0 ? cu.m_slice->m_weightPredTable[1][refIdx1] : NULL;
 
-            if (pwp0 && pwp1 && (pwp0->bPresentFlag || pwp1->bPresentFlag))
+            if (pwp0 && pwp1 && (pwp0->wtPresent || pwp1->wtPresent))
             {
                 /* biprediction weighting */
                 for (int plane = 0; plane < (bChroma ? 3 : 1); plane++)
@@ -183,7 +183,7 @@
                 predInterChromaShort(pu, m_predShortYuv[1], *cu.m_slice->m_refReconPicList[1][refIdx1], mv1);
             }
 
-            if (pwp0 && pwp1 && (pwp0->bPresentFlag || pwp1->bPresentFlag))
+            if (pwp0 && pwp1 && (pwp0->wtPresent || pwp1->wtPresent))
                 addWeightBi(pu, predYuv, m_predShortYuv[0], m_predShortYuv[1], wv0, wv1, bLuma, bChroma);
             else
                 predYuv.addAvg(m_predShortYuv[0], m_predShortYuv[1], pu.puAbsPartIdx, pu.width, pu.height, bLuma, bChroma);
@@ -193,7 +193,7 @@
             MV mv0 = cu.m_mv[0][pu.puAbsPartIdx];
             cu.clipMv(mv0);
 
-            if (pwp0 && pwp0->bPresentFlag)
+            if (pwp0 && pwp0->wtPresent)
             {
                 ShortYuv& shortYuv = m_predShortYuv[0];
 
@@ -220,7 +220,7 @@
             /* uniprediction to L1 */
             X265_CHECK(refIdx1 >= 0, "refidx1 was not positive\n");
 
-            if (pwp1 && pwp1->bPresentFlag)
+            if (pwp1 && pwp1->wtPresent)
             {
                 ShortYuv& shortYuv = m_predShortYuv[0];
 
@@ -283,7 +283,11 @@
     int yFrac = mv.y & 3;
 
     if (!(yFrac | xFrac))
-        primitives.pu[partEnum].convert_p2s(src, srcStride, dst, dstStride);
+    {
+        bool srcbufferAlignCheck = (refPic.m_cuOffsetY[pu.ctuAddr] + refPic.m_buOffsetY[pu.cuAbsPartIdx + pu.puAbsPartIdx] + srcOffset) % 64 == 0;
+        bool dstbufferAlignCheck = (dstSYuv.getAddrOffset(pu.puAbsPartIdx, dstSYuv.m_size) % 64) == 0;
+        primitives.pu[partEnum].convert_p2s[srcStride % 64 == 0 && dstStride % 64 == 0 && srcbufferAlignCheck && dstbufferAlignCheck](src, srcStride, dst, dstStride);
+    }
     else if (!yFrac)
         primitives.pu[partEnum].luma_hps(src, srcStride, dst, dstStride, xFrac, 0);
     else if (!xFrac)
@@ -375,8 +379,10 @@
 
     if (!(yFrac | xFrac))
     {
-        primitives.chroma[m_csp].pu[partEnum].p2s(refCb, refStride, dstCb, dstStride);
-        primitives.chroma[m_csp].pu[partEnum].p2s(refCr, refStride, dstCr, dstStride);
+        bool srcbufferAlignCheckC = (refPic.m_cuOffsetC[pu.ctuAddr] + refPic.m_buOffsetC[pu.cuAbsPartIdx + pu.puAbsPartIdx] + refOffset) % 64 == 0;
+        bool dstbufferAlignCheckC = dstSYuv.getChromaAddrOffset(pu.puAbsPartIdx) % 64 == 0;
+        primitives.chroma[m_csp].pu[partEnum].p2s[refStride % 64 == 0 && dstStride % 64 == 0 && srcbufferAlignCheckC && dstbufferAlignCheckC](refCb, refStride, dstCb, dstStride);
+        primitives.chroma[m_csp].pu[partEnum].p2s[refStride % 64 == 0 && dstStride % 64 == 0 && srcbufferAlignCheckC && dstbufferAlignCheckC](refCr, refStride, dstCr, dstStride);
     }
     else if (!yFrac)
     {
​

x265_2.7.tar.gz/source/common/primitives.cpp -> x265_2.9.tar.gz/source/common/primitives.cpp Changed

@@ -114,9 +114,11 @@
     for (int i = 0; i < NUM_PU_SIZES; i++)
     {
         p.chroma[X265_CSP_I444].pu[i].copy_pp = p.pu[i].copy_pp;
-        p.chroma[X265_CSP_I444].pu[i].addAvg  = p.pu[i].addAvg;
+        p.chroma[X265_CSP_I444].pu[i].addAvg[NONALIGNED]  = p.pu[i].addAvg[NONALIGNED];
+        p.chroma[X265_CSP_I444].pu[i].addAvg[ALIGNED] = p.pu[i].addAvg[ALIGNED];
         p.chroma[X265_CSP_I444].pu[i].satd    = p.pu[i].satd;
-        p.chroma[X265_CSP_I444].pu[i].p2s     = p.pu[i].convert_p2s;
+        p.chroma[X265_CSP_I444].pu[i].p2s[NONALIGNED]     = p.pu[i].convert_p2s[NONALIGNED];
+        p.chroma[X265_CSP_I444].pu[i].p2s[ALIGNED] = p.pu[i].convert_p2s[ALIGNED];
     }
 
     for (int i = 0; i < NUM_CU_SIZES; i++)
@@ -124,7 +126,8 @@
         p.chroma[X265_CSP_I444].cu[i].sa8d    = p.cu[i].sa8d;
         p.chroma[X265_CSP_I444].cu[i].sse_pp  = p.cu[i].sse_pp;
         p.chroma[X265_CSP_I444].cu[i].sub_ps  = p.cu[i].sub_ps;
-        p.chroma[X265_CSP_I444].cu[i].add_ps  = p.cu[i].add_ps;
+        p.chroma[X265_CSP_I444].cu[i].add_ps[NONALIGNED]  = p.cu[i].add_ps[NONALIGNED];
+        p.chroma[X265_CSP_I444].cu[i].add_ps[ALIGNED] = p.cu[i].add_ps[ALIGNED];
         p.chroma[X265_CSP_I444].cu[i].copy_ps = p.cu[i].copy_ps;
         p.chroma[X265_CSP_I444].cu[i].copy_sp = p.cu[i].copy_sp;
         p.chroma[X265_CSP_I444].cu[i].copy_ss = p.cu[i].copy_ss;

 
@@ -114,9 +114,11 @@
     for (int i = 0; i < NUM_PU_SIZES; i++)
     {
         p.chroma[X265_CSP_I444].pu[i].copy_pp = p.pu[i].copy_pp;
-        p.chroma[X265_CSP_I444].pu[i].addAvg  = p.pu[i].addAvg;
+        p.chroma[X265_CSP_I444].pu[i].addAvg[NONALIGNED]  = p.pu[i].addAvg[NONALIGNED];
+        p.chroma[X265_CSP_I444].pu[i].addAvg[ALIGNED] = p.pu[i].addAvg[ALIGNED];
         p.chroma[X265_CSP_I444].pu[i].satd    = p.pu[i].satd;
-        p.chroma[X265_CSP_I444].pu[i].p2s     = p.pu[i].convert_p2s;
+        p.chroma[X265_CSP_I444].pu[i].p2s[NONALIGNED]     = p.pu[i].convert_p2s[NONALIGNED];
+        p.chroma[X265_CSP_I444].pu[i].p2s[ALIGNED] = p.pu[i].convert_p2s[ALIGNED];
     }
 
     for (int i = 0; i < NUM_CU_SIZES; i++)
@@ -124,7 +126,8 @@
         p.chroma[X265_CSP_I444].cu[i].sa8d    = p.cu[i].sa8d;
         p.chroma[X265_CSP_I444].cu[i].sse_pp  = p.cu[i].sse_pp;
         p.chroma[X265_CSP_I444].cu[i].sub_ps  = p.cu[i].sub_ps;
-        p.chroma[X265_CSP_I444].cu[i].add_ps  = p.cu[i].add_ps;
+        p.chroma[X265_CSP_I444].cu[i].add_ps[NONALIGNED]  = p.cu[i].add_ps[NONALIGNED];
+        p.chroma[X265_CSP_I444].cu[i].add_ps[ALIGNED] = p.cu[i].add_ps[ALIGNED];
         p.chroma[X265_CSP_I444].cu[i].copy_ps = p.cu[i].copy_ps;
         p.chroma[X265_CSP_I444].cu[i].copy_sp = p.cu[i].copy_sp;
         p.chroma[X265_CSP_I444].cu[i].copy_ss = p.cu[i].copy_ss;
​

x265_2.7.tar.gz/source/common/primitives.h -> x265_2.9.tar.gz/source/common/primitives.h Changed

@@ -62,6 +62,13 @@
     NUM_CU_SIZES
 };
 
+enum AlignPrimitive
+{
+    NONALIGNED,
+    ALIGNED,
+    NUM_ALIGNMENT_TYPES
+};
+
 enum { NUM_TR_SIZE = 4 }; // TU are 4x4, 8x8, 16x16, and 32x32
 
 
@@ -216,7 +223,10 @@
 
 typedef void (*integralv_t)(uint32_t *sum, intptr_t stride);
 typedef void (*integralh_t)(uint32_t *sum, pixel *pix, intptr_t stride);
-
+typedef void(*nonPsyRdoQuant_t)(int16_t *m_resiDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, uint32_t blkPos);
+typedef void(*psyRdoQuant_t)(int16_t *m_resiDctCoeff, int16_t *m_fencDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, int64_t *psyScale, uint32_t blkPos);
+typedef void(*psyRdoQuant_t1)(int16_t *m_resiDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost,uint32_t blkPos);
+typedef void(*psyRdoQuant_t2)(int16_t *m_resiDctCoeff, int16_t *m_fencDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, int64_t *psyScale, uint32_t blkPos);
 /* Function pointers to optimized encoder primitives. Each pointer can reference
  * either an assembly routine, a SIMD intrinsic primitive, or a C function */
 struct EncoderPrimitives
@@ -242,12 +252,10 @@
         filter_sp_t    luma_vsp;
         filter_ss_t    luma_vss;
         filter_hv_pp_t luma_hvpp;   // combines hps + vsp
-
-        pixelavg_pp_t  pixelavg_pp; // quick bidir using pixels (borrowed from x264)
-        addAvg_t       addAvg;      // bidir motion compensation, uses 16bit values
-
+        pixelavg_pp_t  pixelavg_pp[NUM_ALIGNMENT_TYPES]; // quick bidir using pixels (borrowed from x264)
+        addAvg_t       addAvg[NUM_ALIGNMENT_TYPES];      // bidir motion compensation, uses 16bit values
         copy_pp_t      copy_pp;
-        filter_p2s_t   convert_p2s;
+        filter_p2s_t   convert_p2s[NUM_ALIGNMENT_TYPES];
     }
     pu[NUM_PU_SIZES];
 
@@ -265,17 +273,16 @@
         dct_t           standard_dct;   // original dct function, used by lowpass_dct
         dct_t           lowpass_dct;    // lowpass dct approximation
 
-        calcresidual_t  calcresidual;
+        calcresidual_t  calcresidual[NUM_ALIGNMENT_TYPES];
         pixel_sub_ps_t  sub_ps;
-        pixel_add_ps_t  add_ps;
-        blockfill_s_t   blockfill_s;   // block fill, for DC transforms
+        pixel_add_ps_t  add_ps[NUM_ALIGNMENT_TYPES];
+        blockfill_s_t   blockfill_s[NUM_ALIGNMENT_TYPES];   // block fill, for DC transforms
         copy_cnt_t      copy_cnt;      // copy coeff while counting non-zero
         count_nonzero_t count_nonzero;
         cpy2Dto1D_shl_t cpy2Dto1D_shl;
         cpy2Dto1D_shr_t cpy2Dto1D_shr;
-        cpy1Dto2D_shl_t cpy1Dto2D_shl;
+        cpy1Dto2D_shl_t cpy1Dto2D_shl[NUM_ALIGNMENT_TYPES];
         cpy1Dto2D_shr_t cpy1Dto2D_shr;
-
         copy_sp_t       copy_sp;
         copy_ps_t       copy_ps;
         copy_ss_t       copy_ss;
@@ -286,16 +293,18 @@
         pixel_sse_t     sse_pp;        // Sum of Square Error (pixel, pixel) fenc alignment not assumed
         pixel_sse_ss_t  sse_ss;        // Sum of Square Error (short, short) fenc alignment not assumed
         pixelcmp_t      psy_cost_pp;   // difference in AC energy between two pixel blocks
-        pixel_ssd_s_t   ssd_s;         // Sum of Square Error (residual coeff to self)
+        pixel_ssd_s_t   ssd_s[NUM_ALIGNMENT_TYPES];         // Sum of Square Error (residual coeff to self)
         pixelcmp_t      sa8d;          // Sum of Transformed Differences (8x8 Hadamard), uses satd for 4x4 intra TU
-
         transpose_t     transpose;     // transpose pixel block; for use with intra all-angs
         intra_allangs_t intra_pred_allangs;
         intra_filter_t  intra_filter;
         intra_pred_t    intra_pred[NUM_INTRA_MODE];
+        nonPsyRdoQuant_t nonPsyRdoQuant;
+        psyRdoQuant_t    psyRdoQuant;
+		psyRdoQuant_t1   psyRdoQuant_1p;
+		psyRdoQuant_t2   psyRdoQuant_2p;
     }
     cu[NUM_CU_SIZES];
-
     /* These remaining primitives work on either fixed block sizes or take
      * block dimensions as arguments and thus do not belong in either the PU or
      * the CU arrays */
@@ -307,7 +316,7 @@
     dequant_scaling_t     dequant_scaling;
     dequant_normal_t      dequant_normal;
     denoiseDct_t          denoiseDct;
-    scale1D_t             scale1D_128to64;
+    scale1D_t             scale1D_128to64[NUM_ALIGNMENT_TYPES];
     scale2D_t             scale2D_64to32;
 
     ssim_4x4x2_core_t     ssim_4x4x2_core;
@@ -384,9 +393,9 @@
             filter_ss_t  filter_vss;
             filter_pp_t  filter_hpp;
             filter_hps_t filter_hps;
-            addAvg_t     addAvg;
+            addAvg_t     addAvg[NUM_ALIGNMENT_TYPES];
             copy_pp_t    copy_pp;
-            filter_p2s_t p2s;
+            filter_p2s_t p2s[NUM_ALIGNMENT_TYPES];
 
         }
         pu[NUM_PU_SIZES];
@@ -397,7 +406,7 @@
             pixelcmp_t     sa8d;    // if chroma CU is not multiple of 8x8, will use satd
             pixel_sse_t    sse_pp;
             pixel_sub_ps_t sub_ps;
-            pixel_add_ps_t add_ps;
+            pixel_add_ps_t add_ps[NUM_ALIGNMENT_TYPES];
 
             copy_ps_t      copy_ps;
             copy_sp_t      copy_sp;

 
@@ -62,6 +62,13 @@
     NUM_CU_SIZES
 };
 
+enum AlignPrimitive
+{
+    NONALIGNED,
+    ALIGNED,
+    NUM_ALIGNMENT_TYPES
+};
+
 enum { NUM_TR_SIZE = 4 }; // TU are 4x4, 8x8, 16x16, and 32x32
 
 
@@ -216,7 +223,10 @@
 
 typedef void (*integralv_t)(uint32_t *sum, intptr_t stride);
 typedef void (*integralh_t)(uint32_t *sum, pixel *pix, intptr_t stride);
-
+typedef void(*nonPsyRdoQuant_t)(int16_t *m_resiDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, uint32_t blkPos);
+typedef void(*psyRdoQuant_t)(int16_t *m_resiDctCoeff, int16_t *m_fencDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, int64_t *psyScale, uint32_t blkPos);
+typedef void(*psyRdoQuant_t1)(int16_t *m_resiDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost,uint32_t blkPos);
+typedef void(*psyRdoQuant_t2)(int16_t *m_resiDctCoeff, int16_t *m_fencDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, int64_t *psyScale, uint32_t blkPos);
 /* Function pointers to optimized encoder primitives. Each pointer can reference
  * either an assembly routine, a SIMD intrinsic primitive, or a C function */
 struct EncoderPrimitives
@@ -242,12 +252,10 @@
         filter_sp_t    luma_vsp;
         filter_ss_t    luma_vss;
         filter_hv_pp_t luma_hvpp;   // combines hps + vsp
-
-        pixelavg_pp_t  pixelavg_pp; // quick bidir using pixels (borrowed from x264)
-        addAvg_t       addAvg;      // bidir motion compensation, uses 16bit values
-
+        pixelavg_pp_t  pixelavg_pp[NUM_ALIGNMENT_TYPES]; // quick bidir using pixels (borrowed from x264)
+        addAvg_t       addAvg[NUM_ALIGNMENT_TYPES];      // bidir motion compensation, uses 16bit values
         copy_pp_t      copy_pp;
-        filter_p2s_t   convert_p2s;
+        filter_p2s_t   convert_p2s[NUM_ALIGNMENT_TYPES];
     }
     pu[NUM_PU_SIZES];
 
@@ -265,17 +273,16 @@
         dct_t           standard_dct;   // original dct function, used by lowpass_dct
         dct_t           lowpass_dct;    // lowpass dct approximation
 
-        calcresidual_t  calcresidual;
+        calcresidual_t  calcresidual[NUM_ALIGNMENT_TYPES];
         pixel_sub_ps_t  sub_ps;
-        pixel_add_ps_t  add_ps;
-        blockfill_s_t   blockfill_s;   // block fill, for DC transforms
+        pixel_add_ps_t  add_ps[NUM_ALIGNMENT_TYPES];
+        blockfill_s_t   blockfill_s[NUM_ALIGNMENT_TYPES];   // block fill, for DC transforms
         copy_cnt_t      copy_cnt;      // copy coeff while counting non-zero
         count_nonzero_t count_nonzero;
         cpy2Dto1D_shl_t cpy2Dto1D_shl;
         cpy2Dto1D_shr_t cpy2Dto1D_shr;
-        cpy1Dto2D_shl_t cpy1Dto2D_shl;
+        cpy1Dto2D_shl_t cpy1Dto2D_shl[NUM_ALIGNMENT_TYPES];
         cpy1Dto2D_shr_t cpy1Dto2D_shr;
-
         copy_sp_t       copy_sp;
         copy_ps_t       copy_ps;
         copy_ss_t       copy_ss;
@@ -286,16 +293,18 @@
         pixel_sse_t     sse_pp;        // Sum of Square Error (pixel, pixel) fenc alignment not assumed
         pixel_sse_ss_t  sse_ss;        // Sum of Square Error (short, short) fenc alignment not assumed
         pixelcmp_t      psy_cost_pp;   // difference in AC energy between two pixel blocks
-        pixel_ssd_s_t   ssd_s;         // Sum of Square Error (residual coeff to self)
+        pixel_ssd_s_t   ssd_s[NUM_ALIGNMENT_TYPES];         // Sum of Square Error (residual coeff to self)
         pixelcmp_t      sa8d;          // Sum of Transformed Differences (8x8 Hadamard), uses satd for 4x4 intra TU
-
         transpose_t     transpose;     // transpose pixel block; for use with intra all-angs
         intra_allangs_t intra_pred_allangs;
         intra_filter_t  intra_filter;
         intra_pred_t    intra_pred[NUM_INTRA_MODE];
+        nonPsyRdoQuant_t nonPsyRdoQuant;
+        psyRdoQuant_t    psyRdoQuant;
+       psyRdoQuant_t1   psyRdoQuant_1p;
+       psyRdoQuant_t2   psyRdoQuant_2p;
     }
     cu[NUM_CU_SIZES];
-
     /* These remaining primitives work on either fixed block sizes or take
      * block dimensions as arguments and thus do not belong in either the PU or
      * the CU arrays */
@@ -307,7 +316,7 @@
     dequant_scaling_t     dequant_scaling;
     dequant_normal_t      dequant_normal;
     denoiseDct_t          denoiseDct;
-    scale1D_t             scale1D_128to64;
+    scale1D_t             scale1D_128to64[NUM_ALIGNMENT_TYPES];
     scale2D_t             scale2D_64to32;
 
     ssim_4x4x2_core_t     ssim_4x4x2_core;
@@ -384,9 +393,9 @@
             filter_ss_t  filter_vss;
             filter_pp_t  filter_hpp;
             filter_hps_t filter_hps;
-            addAvg_t     addAvg;
+            addAvg_t     addAvg[NUM_ALIGNMENT_TYPES];
             copy_pp_t    copy_pp;
-            filter_p2s_t p2s;
+            filter_p2s_t p2s[NUM_ALIGNMENT_TYPES];
 
         }
         pu[NUM_PU_SIZES];
@@ -397,7 +406,7 @@
             pixelcmp_t     sa8d;    // if chroma CU is not multiple of 8x8, will use satd
             pixel_sse_t    sse_pp;
             pixel_sub_ps_t sub_ps;
-            pixel_add_ps_t add_ps;
+            pixel_add_ps_t add_ps[NUM_ALIGNMENT_TYPES];
 
             copy_ps_t      copy_ps;
             copy_sp_t      copy_sp;
​

x265_2.7.tar.gz/source/common/quant.cpp -> x265_2.9.tar.gz/source/common/quant.cpp Changed

@@ -560,13 +560,11 @@
                             uint32_t log2TrSize, TextType ttype, bool bIntra, bool useTransformSkip, uint32_t numSig)
 {
     const uint32_t sizeIdx = log2TrSize - 2;
-
     if (cu.m_tqBypass[0])
     {
-        primitives.cu[sizeIdx].cpy1Dto2D_shl(residual, coeff, resiStride, 0);
+        primitives.cu[sizeIdx].cpy1Dto2D_shl[resiStride % 64 == 0](residual, coeff, resiStride, 0);
         return;
     }
-
     // Values need to pass as input parameter in dequant
     int rem = m_qpParam[ttype].rem;
     int per = m_qpParam[ttype].per;
@@ -595,7 +593,7 @@
         if (transformShift > 0)
             primitives.cu[sizeIdx].cpy1Dto2D_shr(residual, m_resiDctCoeff, resiStride, transformShift);
         else
-            primitives.cu[sizeIdx].cpy1Dto2D_shl(residual, m_resiDctCoeff, resiStride, -transformShift);
+            primitives.cu[sizeIdx].cpy1Dto2D_shl[resiStride % 64 == 0](residual, m_resiDctCoeff, resiStride, -transformShift);
 #endif
     }
     else
@@ -611,7 +609,7 @@
             const int add_2nd = 1 << (shift_2nd - 1);
 
             int dc_val = (((m_resiDctCoeff[0] * (64 >> 6) + add_1st) >> shift_1st) * (64 >> 3) + add_2nd) >> shift_2nd;
-            primitives.cu[sizeIdx].blockfill_s(residual, resiStride, (int16_t)dc_val);
+            primitives.cu[sizeIdx].blockfill_s[resiStride % 64 == 0](residual, resiStride, (int16_t)dc_val);
             return;
         }
 
@@ -644,11 +642,9 @@
     X265_CHECK((int)numSig == primitives.cu[log2TrSize - 2].count_nonzero(dstCoeff), "numSig differ\n");
     if (!numSig)
         return 0;
-
     const uint32_t trSize = 1 << log2TrSize;
     int64_t lambda2 = m_qpParam[ttype].lambda2;
-    const int64_t psyScale = ((int64_t)m_psyRdoqScale * m_qpParam[ttype].lambda);
-
+    int64_t psyScale = ((int64_t)m_psyRdoqScale * m_qpParam[ttype].lambda);
     /* unquant constants for measuring distortion. Scaling list quant coefficients have a (1 << 4)
      * scale applied that must be removed during unquant. Note that in real dequant there is clipping
      * at several stages. We skip the clipping for simplicity when measuring RD cost */
@@ -725,27 +721,15 @@
         for (int cgScanPos = cgLastScanPos + 1; cgScanPos < (int)cgNum ; cgScanPos++)
         {
             X265_CHECK(coeffNum[cgScanPos] == 0, "count of coeff failure\n");
-
             uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE);
             uint32_t blkPos      = codeParams.scan[scanPosBase];
-
-            // TODO: we can't SIMD optimize because PSYVALUE need 64-bits multiplication, convert to Double can work faster by FMA
-            for (int y = 0; y < MLS_CG_SIZE; y++)
+            bool enable512 = detect512();
+            if (enable512)
+                primitives.cu[log2TrSize - 2].psyRdoQuant(m_resiDctCoeff, m_fencDctCoeff, costUncoded, &totalUncodedCost, &totalRdCost, &psyScale, blkPos);
+            else
             {
-                for (int x = 0; x < MLS_CG_SIZE; x++)
-                {
-                    int signCoef         = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
-                    int predictedCoef    = m_fencDctCoeff[blkPos + x] - signCoef; /* predicted DCT = source DCT - residual DCT*/
-
-                    costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits;
-
-                    /* when no residual coefficient is coded, predicted coef == recon coef */
-                    costUncoded[blkPos + x] -= PSYVALUE(predictedCoef);
-
-                    totalUncodedCost += costUncoded[blkPos + x];
-                    totalRdCost += costUncoded[blkPos + x];
-                }
-                blkPos += trSize;
+                primitives.cu[log2TrSize - 2].psyRdoQuant_1p(m_resiDctCoeff,  costUncoded, &totalUncodedCost, &totalRdCost,blkPos);
+                primitives.cu[log2TrSize - 2].psyRdoQuant_2p(m_resiDctCoeff, m_fencDctCoeff, costUncoded, &totalUncodedCost, &totalRdCost, &psyScale, blkPos);
             }
         }
     }
@@ -755,25 +739,11 @@
         for (int cgScanPos = cgLastScanPos + 1; cgScanPos < (int)cgNum ; cgScanPos++)
         {
             X265_CHECK(coeffNum[cgScanPos] == 0, "count of coeff failure\n");
-
             uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE);
             uint32_t blkPos      = codeParams.scan[scanPosBase];
-
-            for (int y = 0; y < MLS_CG_SIZE; y++)
-            {
-                for (int x = 0; x < MLS_CG_SIZE; x++)
-                {
-                    int signCoef = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
-                    costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits;
-
-                    totalUncodedCost += costUncoded[blkPos + x];
-                    totalRdCost += costUncoded[blkPos + x];
-                }
-                blkPos += trSize;
-            }
+            primitives.cu[log2TrSize - 2].nonPsyRdoQuant(m_resiDctCoeff, costUncoded, &totalUncodedCost, &totalRdCost, blkPos);
         }
     }
-
     static const uint8_t table_cnt[5][SCAN_SET_SIZE] =
     {
         // patternSigCtx = 0
@@ -833,25 +803,22 @@
             // TODO: does we need zero-coeff cost?
             const uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE);
             uint32_t blkPos = codeParams.scan[scanPosBase];
-
             if (usePsyMask)
             {
-                // TODO: we can't SIMD optimize because PSYVALUE need 64-bits multiplication, convert to Double can work faster by FMA
+                bool enable512 = detect512();
+
+                if (enable512)
+                    primitives.cu[log2TrSize - 2].psyRdoQuant(m_resiDctCoeff, m_fencDctCoeff, costUncoded, &totalUncodedCost, &totalRdCost, &psyScale, blkPos);
+                else
+                {
+                    primitives.cu[log2TrSize - 2].psyRdoQuant_1p(m_resiDctCoeff, costUncoded, &totalUncodedCost, &totalRdCost, blkPos);
+                    primitives.cu[log2TrSize - 2].psyRdoQuant_2p(m_resiDctCoeff, m_fencDctCoeff, costUncoded, &totalUncodedCost, &totalRdCost, &psyScale, blkPos);
+                }
+                blkPos = codeParams.scan[scanPosBase];
                 for (int y = 0; y < MLS_CG_SIZE; y++)
                 {
                     for (int x = 0; x < MLS_CG_SIZE; x++)
                     {
-                        int signCoef         = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
-                        int predictedCoef    = m_fencDctCoeff[blkPos + x] - signCoef; /* predicted DCT = source DCT - residual DCT*/
-
-                        costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits;
-
-                        /* when no residual coefficient is coded, predicted coef == recon coef */
-                        costUncoded[blkPos + x] -= PSYVALUE(predictedCoef);
-
-                        totalUncodedCost += costUncoded[blkPos + x];
-                        totalRdCost += costUncoded[blkPos + x];
-
                         const uint32_t scanPosOffset =  y * MLS_CG_SIZE + x;
                         const uint32_t ctxSig = table_cnt[patternSigCtx][g_scan4x4[codeParams.scanType][scanPosOffset]] + ctxSigOffset;
                         X265_CHECK(trSize > 4, "trSize check failure\n");
@@ -867,16 +834,12 @@
             else
             {
                 // non-psy path
+                primitives.cu[log2TrSize - 2].nonPsyRdoQuant(m_resiDctCoeff, costUncoded, &totalUncodedCost, &totalRdCost, blkPos);
+                blkPos = codeParams.scan[scanPosBase];
                 for (int y = 0; y < MLS_CG_SIZE; y++)
                 {
                     for (int x = 0; x < MLS_CG_SIZE; x++)
                     {
-                        int signCoef = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
-                        costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits;
-
-                        totalUncodedCost += costUncoded[blkPos + x];
-                        totalRdCost += costUncoded[blkPos + x];
-
                         const uint32_t scanPosOffset =  y * MLS_CG_SIZE + x;
                         const uint32_t ctxSig = table_cnt[patternSigCtx][g_scan4x4[codeParams.scanType][scanPosOffset]] + ctxSigOffset;
                         X265_CHECK(trSize > 4, "trSize check failure\n");

 
@@ -560,13 +560,11 @@
                             uint32_t log2TrSize, TextType ttype, bool bIntra, bool useTransformSkip, uint32_t numSig)
 {
     const uint32_t sizeIdx = log2TrSize - 2;
-
     if (cu.m_tqBypass[0])
     {
-        primitives.cu[sizeIdx].cpy1Dto2D_shl(residual, coeff, resiStride, 0);
+        primitives.cu[sizeIdx].cpy1Dto2D_shl[resiStride % 64 == 0](residual, coeff, resiStride, 0);
         return;
     }
-
     // Values need to pass as input parameter in dequant
     int rem = m_qpParam[ttype].rem;
     int per = m_qpParam[ttype].per;
@@ -595,7 +593,7 @@
         if (transformShift > 0)
             primitives.cu[sizeIdx].cpy1Dto2D_shr(residual, m_resiDctCoeff, resiStride, transformShift);
         else
-            primitives.cu[sizeIdx].cpy1Dto2D_shl(residual, m_resiDctCoeff, resiStride, -transformShift);
+            primitives.cu[sizeIdx].cpy1Dto2D_shl[resiStride % 64 == 0](residual, m_resiDctCoeff, resiStride, -transformShift);
 #endif
     }
     else
@@ -611,7 +609,7 @@
             const int add_2nd = 1 << (shift_2nd - 1);
 
             int dc_val = (((m_resiDctCoeff[0] * (64 >> 6) + add_1st) >> shift_1st) * (64 >> 3) + add_2nd) >> shift_2nd;
-            primitives.cu[sizeIdx].blockfill_s(residual, resiStride, (int16_t)dc_val);
+            primitives.cu[sizeIdx].blockfill_s[resiStride % 64 == 0](residual, resiStride, (int16_t)dc_val);
             return;
         }
 
@@ -644,11 +642,9 @@
     X265_CHECK((int)numSig == primitives.cu[log2TrSize - 2].count_nonzero(dstCoeff), "numSig differ\n");
     if (!numSig)
         return 0;
-
     const uint32_t trSize = 1 << log2TrSize;
     int64_t lambda2 = m_qpParam[ttype].lambda2;
-    const int64_t psyScale = ((int64_t)m_psyRdoqScale * m_qpParam[ttype].lambda);
-
+    int64_t psyScale = ((int64_t)m_psyRdoqScale * m_qpParam[ttype].lambda);
     /* unquant constants for measuring distortion. Scaling list quant coefficients have a (1 << 4)
      * scale applied that must be removed during unquant. Note that in real dequant there is clipping
      * at several stages. We skip the clipping for simplicity when measuring RD cost */
@@ -725,27 +721,15 @@
         for (int cgScanPos = cgLastScanPos + 1; cgScanPos < (int)cgNum ; cgScanPos++)
         {
             X265_CHECK(coeffNum[cgScanPos] == 0, "count of coeff failure\n");
-
             uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE);
             uint32_t blkPos      = codeParams.scan[scanPosBase];
-
-            // TODO: we can't SIMD optimize because PSYVALUE need 64-bits multiplication, convert to Double can work faster by FMA
-            for (int y = 0; y < MLS_CG_SIZE; y++)
+            bool enable512 = detect512();
+            if (enable512)
+                primitives.cu[log2TrSize - 2].psyRdoQuant(m_resiDctCoeff, m_fencDctCoeff, costUncoded, &totalUncodedCost, &totalRdCost, &psyScale, blkPos);
+            else
             {
-                for (int x = 0; x < MLS_CG_SIZE; x++)
-                {
-                    int signCoef         = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
-                    int predictedCoef    = m_fencDctCoeff[blkPos + x] - signCoef; /* predicted DCT = source DCT - residual DCT*/
-
-                    costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits;
-
-                    /* when no residual coefficient is coded, predicted coef == recon coef */
-                    costUncoded[blkPos + x] -= PSYVALUE(predictedCoef);
-
-                    totalUncodedCost += costUncoded[blkPos + x];
-                    totalRdCost += costUncoded[blkPos + x];
-                }
-                blkPos += trSize;
+                primitives.cu[log2TrSize - 2].psyRdoQuant_1p(m_resiDctCoeff,  costUncoded, &totalUncodedCost, &totalRdCost,blkPos);
+                primitives.cu[log2TrSize - 2].psyRdoQuant_2p(m_resiDctCoeff, m_fencDctCoeff, costUncoded, &totalUncodedCost, &totalRdCost, &psyScale, blkPos);
             }
         }
     }
@@ -755,25 +739,11 @@
         for (int cgScanPos = cgLastScanPos + 1; cgScanPos < (int)cgNum ; cgScanPos++)
         {
             X265_CHECK(coeffNum[cgScanPos] == 0, "count of coeff failure\n");
-
             uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE);
             uint32_t blkPos      = codeParams.scan[scanPosBase];
-
-            for (int y = 0; y < MLS_CG_SIZE; y++)
-            {
-                for (int x = 0; x < MLS_CG_SIZE; x++)
-                {
-                    int signCoef = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
-                    costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits;
-
-                    totalUncodedCost += costUncoded[blkPos + x];
-                    totalRdCost += costUncoded[blkPos + x];
-                }
-                blkPos += trSize;
-            }
+            primitives.cu[log2TrSize - 2].nonPsyRdoQuant(m_resiDctCoeff, costUncoded, &totalUncodedCost, &totalRdCost, blkPos);
         }
     }
-
     static const uint8_t table_cnt[5][SCAN_SET_SIZE] =
     {
         // patternSigCtx = 0
@@ -833,25 +803,22 @@
             // TODO: does we need zero-coeff cost?
             const uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE);
             uint32_t blkPos = codeParams.scan[scanPosBase];
-
             if (usePsyMask)
             {
-                // TODO: we can't SIMD optimize because PSYVALUE need 64-bits multiplication, convert to Double can work faster by FMA
+                bool enable512 = detect512();
+
+                if (enable512)
+                    primitives.cu[log2TrSize - 2].psyRdoQuant(m_resiDctCoeff, m_fencDctCoeff, costUncoded, &totalUncodedCost, &totalRdCost, &psyScale, blkPos);
+                else
+                {
+                    primitives.cu[log2TrSize - 2].psyRdoQuant_1p(m_resiDctCoeff, costUncoded, &totalUncodedCost, &totalRdCost, blkPos);
+                    primitives.cu[log2TrSize - 2].psyRdoQuant_2p(m_resiDctCoeff, m_fencDctCoeff, costUncoded, &totalUncodedCost, &totalRdCost, &psyScale, blkPos);
+                }
+                blkPos = codeParams.scan[scanPosBase];
                 for (int y = 0; y < MLS_CG_SIZE; y++)
                 {
                     for (int x = 0; x < MLS_CG_SIZE; x++)
                     {
-                        int signCoef         = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
-                        int predictedCoef    = m_fencDctCoeff[blkPos + x] - signCoef; /* predicted DCT = source DCT - residual DCT*/
-
-                        costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits;
-
-                        /* when no residual coefficient is coded, predicted coef == recon coef */
-                        costUncoded[blkPos + x] -= PSYVALUE(predictedCoef);
-
-                        totalUncodedCost += costUncoded[blkPos + x];
-                        totalRdCost += costUncoded[blkPos + x];
-
                         const uint32_t scanPosOffset =  y * MLS_CG_SIZE + x;
                         const uint32_t ctxSig = table_cnt[patternSigCtx][g_scan4x4[codeParams.scanType][scanPosOffset]] + ctxSigOffset;
                         X265_CHECK(trSize > 4, "trSize check failure\n");
@@ -867,16 +834,12 @@
             else
             {
                 // non-psy path
+                primitives.cu[log2TrSize - 2].nonPsyRdoQuant(m_resiDctCoeff, costUncoded, &totalUncodedCost, &totalRdCost, blkPos);
+                blkPos = codeParams.scan[scanPosBase];
                 for (int y = 0; y < MLS_CG_SIZE; y++)
                 {
                     for (int x = 0; x < MLS_CG_SIZE; x++)
                     {
-                        int signCoef = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
-                        costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits;
-
-                        totalUncodedCost += costUncoded[blkPos + x];
-                        totalRdCost += costUncoded[blkPos + x];
-
                         const uint32_t scanPosOffset =  y * MLS_CG_SIZE + x;
                         const uint32_t ctxSig = table_cnt[patternSigCtx][g_scan4x4[codeParams.scanType][scanPosOffset]] + ctxSigOffset;
                         X265_CHECK(trSize > 4, "trSize check failure\n");
​

x265_2.7.tar.gz/source/common/slice.cpp -> x265_2.9.tar.gz/source/common/slice.cpp Changed

 
@@ -138,7 +138,7 @@
             for (int yuv = 0; yuv < 3; yuv++)
             {
                 WeightParam& wp = m_weightPredTable[l][i][yuv];
-                wp.bPresentFlag = false;
+                wp.wtPresent = 0;
                 wp.log2WeightDenom = 0;
                 wp.inputWeight = 1;
                 wp.inputOffset = 0;
​

x265_2.7.tar.gz/source/common/slice.h -> x265_2.9.tar.gz/source/common/slice.h Changed

@@ -298,7 +298,7 @@
     uint32_t log2WeightDenom;
     int      inputWeight;
     int      inputOffset;
-    bool     bPresentFlag;
+    int      wtPresent;
 
     /* makes a non-h265 weight (i.e. fix7), into an h265 weight */
     void setFromWeightAndOffset(int w, int o, int denom, bool bNormalize)
@@ -321,7 +321,7 @@
         (w).inputWeight = (s); \
         (w).log2WeightDenom = (d); \
         (w).inputOffset = (o); \
-        (w).bPresentFlag = (b); \
+        (w).wtPresent = (b); \
     }
 
 class Slice
@@ -385,14 +385,14 @@
     bool getRapPicFlag() const
     {
         return m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_W_RADL
+            || m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_N_LP
             || m_nalUnitType == NAL_UNIT_CODED_SLICE_CRA;
     }
-
     bool getIdrPicFlag() const
     {
-        return m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_W_RADL;
+        return m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_W_RADL
+            || m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_N_LP;
     }
-
     bool isIRAP() const   { return m_nalUnitType >= 16 && m_nalUnitType <= 23; }
 
     bool isIntra()  const { return m_sliceType == I_SLICE; }

 
@@ -298,7 +298,7 @@
     uint32_t log2WeightDenom;
     int      inputWeight;
     int      inputOffset;
-    bool     bPresentFlag;
+    int      wtPresent;
 
     /* makes a non-h265 weight (i.e. fix7), into an h265 weight */
     void setFromWeightAndOffset(int w, int o, int denom, bool bNormalize)
@@ -321,7 +321,7 @@
         (w).inputWeight = (s); \
         (w).log2WeightDenom = (d); \
         (w).inputOffset = (o); \
-        (w).bPresentFlag = (b); \
+        (w).wtPresent = (b); \
     }
 
 class Slice
@@ -385,14 +385,14 @@
     bool getRapPicFlag() const
     {
         return m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_W_RADL
+            || m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_N_LP
             || m_nalUnitType == NAL_UNIT_CODED_SLICE_CRA;
     }
-
     bool getIdrPicFlag() const
     {
-        return m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_W_RADL;
+        return m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_W_RADL
+            || m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_N_LP;
     }
-
     bool isIRAP() const   { return m_nalUnitType >= 16 && m_nalUnitType <= 23; }
 
     bool isIntra()  const { return m_sliceType == I_SLICE; }
​

x265_2.7.tar.gz/source/common/x86/asm-primitives.cpp -> x265_2.9.tar.gz/source/common/x86/asm-primitives.cpp Changed

@@ -404,36 +404,58 @@
     p.chroma[X265_CSP_I422].cu[BLOCK_422_8x16].sa8d = PFX(pixel_sa8d_8x16_ ## cpu); \
     p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sa8d = PFX(pixel_sa8d_16x32_ ## cpu); \
     p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sa8d = PFX(pixel_sa8d_32x64_ ## cpu)
-
 #define PIXEL_AVG(cpu) \
-    p.pu[LUMA_64x64].pixelavg_pp = PFX(pixel_avg_64x64_ ## cpu); \
-    p.pu[LUMA_64x48].pixelavg_pp = PFX(pixel_avg_64x48_ ## cpu); \
-    p.pu[LUMA_64x32].pixelavg_pp = PFX(pixel_avg_64x32_ ## cpu); \
-    p.pu[LUMA_64x16].pixelavg_pp = PFX(pixel_avg_64x16_ ## cpu); \
-    p.pu[LUMA_48x64].pixelavg_pp = PFX(pixel_avg_48x64_ ## cpu); \
-    p.pu[LUMA_32x64].pixelavg_pp = PFX(pixel_avg_32x64_ ## cpu); \
-    p.pu[LUMA_32x32].pixelavg_pp = PFX(pixel_avg_32x32_ ## cpu); \
-    p.pu[LUMA_32x24].pixelavg_pp = PFX(pixel_avg_32x24_ ## cpu); \
-    p.pu[LUMA_32x16].pixelavg_pp = PFX(pixel_avg_32x16_ ## cpu); \
-    p.pu[LUMA_32x8].pixelavg_pp  = PFX(pixel_avg_32x8_ ## cpu); \
-    p.pu[LUMA_24x32].pixelavg_pp = PFX(pixel_avg_24x32_ ## cpu); \
-    p.pu[LUMA_16x64].pixelavg_pp = PFX(pixel_avg_16x64_ ## cpu); \
-    p.pu[LUMA_16x32].pixelavg_pp = PFX(pixel_avg_16x32_ ## cpu); \
-    p.pu[LUMA_16x16].pixelavg_pp = PFX(pixel_avg_16x16_ ## cpu); \
-    p.pu[LUMA_16x12].pixelavg_pp = PFX(pixel_avg_16x12_ ## cpu); \
-    p.pu[LUMA_16x8].pixelavg_pp  = PFX(pixel_avg_16x8_ ## cpu); \
-    p.pu[LUMA_16x4].pixelavg_pp  = PFX(pixel_avg_16x4_ ## cpu); \
-    p.pu[LUMA_12x16].pixelavg_pp = PFX(pixel_avg_12x16_ ## cpu); \
-    p.pu[LUMA_8x32].pixelavg_pp  = PFX(pixel_avg_8x32_ ## cpu); \
-    p.pu[LUMA_8x16].pixelavg_pp  = PFX(pixel_avg_8x16_ ## cpu); \
-    p.pu[LUMA_8x8].pixelavg_pp   = PFX(pixel_avg_8x8_ ## cpu); \
-    p.pu[LUMA_8x4].pixelavg_pp   = PFX(pixel_avg_8x4_ ## cpu);
-
+    p.pu[LUMA_64x64].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_64x64_ ## cpu); \
+    p.pu[LUMA_64x48].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_64x48_ ## cpu); \
+    p.pu[LUMA_64x32].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_64x32_ ## cpu); \
+    p.pu[LUMA_64x16].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_64x16_ ## cpu); \
+    p.pu[LUMA_48x64].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_48x64_ ## cpu); \
+    p.pu[LUMA_32x64].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_32x64_ ## cpu); \
+    p.pu[LUMA_32x32].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_32x32_ ## cpu); \
+    p.pu[LUMA_32x24].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_32x24_ ## cpu); \
+    p.pu[LUMA_32x16].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_32x16_ ## cpu); \
+    p.pu[LUMA_32x8].pixelavg_pp[NONALIGNED]  = PFX(pixel_avg_32x8_ ## cpu); \
+    p.pu[LUMA_24x32].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_24x32_ ## cpu); \
+    p.pu[LUMA_16x64].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_16x64_ ## cpu); \
+    p.pu[LUMA_16x32].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_16x32_ ## cpu); \
+    p.pu[LUMA_16x16].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_16x16_ ## cpu); \
+    p.pu[LUMA_16x12].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_16x12_ ## cpu); \
+    p.pu[LUMA_16x8].pixelavg_pp[NONALIGNED]  = PFX(pixel_avg_16x8_ ## cpu); \
+    p.pu[LUMA_16x4].pixelavg_pp[NONALIGNED]  = PFX(pixel_avg_16x4_ ## cpu); \
+    p.pu[LUMA_12x16].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_12x16_ ## cpu); \
+    p.pu[LUMA_8x32].pixelavg_pp[NONALIGNED]  = PFX(pixel_avg_8x32_ ## cpu); \
+    p.pu[LUMA_8x16].pixelavg_pp[NONALIGNED]  = PFX(pixel_avg_8x16_ ## cpu); \
+    p.pu[LUMA_8x8].pixelavg_pp[NONALIGNED]   = PFX(pixel_avg_8x8_ ## cpu); \
+    p.pu[LUMA_8x4].pixelavg_pp[NONALIGNED]   = PFX(pixel_avg_8x4_ ## cpu); \
+    p.pu[LUMA_64x64].pixelavg_pp[ALIGNED] = PFX(pixel_avg_64x64_ ## cpu); \
+    p.pu[LUMA_64x48].pixelavg_pp[ALIGNED] = PFX(pixel_avg_64x48_ ## cpu); \
+    p.pu[LUMA_64x32].pixelavg_pp[ALIGNED] = PFX(pixel_avg_64x32_ ## cpu); \
+    p.pu[LUMA_64x16].pixelavg_pp[ALIGNED] = PFX(pixel_avg_64x16_ ## cpu); \
+    p.pu[LUMA_48x64].pixelavg_pp[ALIGNED] = PFX(pixel_avg_48x64_ ## cpu); \
+    p.pu[LUMA_32x64].pixelavg_pp[ALIGNED] = PFX(pixel_avg_32x64_ ## cpu); \
+    p.pu[LUMA_32x32].pixelavg_pp[ALIGNED] = PFX(pixel_avg_32x32_ ## cpu); \
+    p.pu[LUMA_32x24].pixelavg_pp[ALIGNED] = PFX(pixel_avg_32x24_ ## cpu); \
+    p.pu[LUMA_32x16].pixelavg_pp[ALIGNED] = PFX(pixel_avg_32x16_ ## cpu); \
+    p.pu[LUMA_32x8].pixelavg_pp[ALIGNED]  = PFX(pixel_avg_32x8_ ## cpu); \
+    p.pu[LUMA_24x32].pixelavg_pp[ALIGNED] = PFX(pixel_avg_24x32_ ## cpu); \
+    p.pu[LUMA_16x64].pixelavg_pp[ALIGNED] = PFX(pixel_avg_16x64_ ## cpu); \
+    p.pu[LUMA_16x32].pixelavg_pp[ALIGNED] = PFX(pixel_avg_16x32_ ## cpu); \
+    p.pu[LUMA_16x16].pixelavg_pp[ALIGNED] = PFX(pixel_avg_16x16_ ## cpu); \
+    p.pu[LUMA_16x12].pixelavg_pp[ALIGNED] = PFX(pixel_avg_16x12_ ## cpu); \
+    p.pu[LUMA_16x8].pixelavg_pp[ALIGNED]  = PFX(pixel_avg_16x8_ ## cpu); \
+    p.pu[LUMA_16x4].pixelavg_pp[ALIGNED]  = PFX(pixel_avg_16x4_ ## cpu); \
+    p.pu[LUMA_12x16].pixelavg_pp[ALIGNED] = PFX(pixel_avg_12x16_ ## cpu); \
+    p.pu[LUMA_8x32].pixelavg_pp[ALIGNED]  = PFX(pixel_avg_8x32_ ## cpu); \
+    p.pu[LUMA_8x16].pixelavg_pp[ALIGNED]  = PFX(pixel_avg_8x16_ ## cpu); \
+    p.pu[LUMA_8x8].pixelavg_pp[ALIGNED]   = PFX(pixel_avg_8x8_ ## cpu); \
+    p.pu[LUMA_8x4].pixelavg_pp[ALIGNED]   = PFX(pixel_avg_8x4_ ## cpu);
 #define PIXEL_AVG_W4(cpu) \
-    p.pu[LUMA_4x4].pixelavg_pp  = PFX(pixel_avg_4x4_ ## cpu); \
-    p.pu[LUMA_4x8].pixelavg_pp  = PFX(pixel_avg_4x8_ ## cpu); \
-    p.pu[LUMA_4x16].pixelavg_pp = PFX(pixel_avg_4x16_ ## cpu);
-
+    p.pu[LUMA_4x4].pixelavg_pp[NONALIGNED]  = PFX(pixel_avg_4x4_ ## cpu); \
+    p.pu[LUMA_4x8].pixelavg_pp[NONALIGNED]  = PFX(pixel_avg_4x8_ ## cpu); \
+    p.pu[LUMA_4x16].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_4x16_ ## cpu); \
+    p.pu[LUMA_4x4].pixelavg_pp[ALIGNED]  = PFX(pixel_avg_4x4_ ## cpu); \
+    p.pu[LUMA_4x8].pixelavg_pp[ALIGNED]  = PFX(pixel_avg_4x8_ ## cpu); \
+    p.pu[LUMA_4x16].pixelavg_pp[ALIGNED] = PFX(pixel_avg_4x16_ ## cpu);
 #define CHROMA_420_FILTERS(cpu) \
     ALL_CHROMA_420_PU(filter_hpp, interp_4tap_horiz_pp, cpu); \
     ALL_CHROMA_420_PU(filter_hps, interp_4tap_horiz_ps, cpu); \
@@ -633,23 +655,32 @@
 
 #define LUMA_PIXELSUB(cpu) \
     p.cu[BLOCK_4x4].sub_ps = PFX(pixel_sub_ps_4x4_ ## cpu); \
-    p.cu[BLOCK_4x4].add_ps = PFX(pixel_add_ps_4x4_ ## cpu); \
+    p.cu[BLOCK_4x4].add_ps[NONALIGNED] = PFX(pixel_add_ps_4x4_ ## cpu); \
+    p.cu[BLOCK_4x4].add_ps[ALIGNED] = PFX(pixel_add_ps_4x4_ ## cpu); \
     ALL_LUMA_CU(sub_ps, pixel_sub_ps, cpu); \
-    ALL_LUMA_CU(add_ps, pixel_add_ps, cpu);
+    ALL_LUMA_CU(add_ps[NONALIGNED], pixel_add_ps, cpu); \
+    ALL_LUMA_CU(add_ps[ALIGNED], pixel_add_ps, cpu);
 
 #define CHROMA_420_PIXELSUB_PS(cpu) \
     ALL_CHROMA_420_CU(sub_ps, pixel_sub_ps, cpu); \
-    ALL_CHROMA_420_CU(add_ps, pixel_add_ps, cpu);
+    ALL_CHROMA_420_CU(add_ps[NONALIGNED], pixel_add_ps, cpu); \
+    ALL_CHROMA_420_CU(add_ps[ALIGNED], pixel_add_ps, cpu);
 
 #define CHROMA_422_PIXELSUB_PS(cpu) \
     ALL_CHROMA_422_CU(sub_ps, pixel_sub_ps, cpu); \
-    ALL_CHROMA_422_CU(add_ps, pixel_add_ps, cpu);
+    ALL_CHROMA_422_CU(add_ps[NONALIGNED], pixel_add_ps, cpu); \
+    ALL_CHROMA_422_CU(add_ps[ALIGNED], pixel_add_ps, cpu);
 
 #define LUMA_VAR(cpu)          ALL_LUMA_CU(var, pixel_var, cpu)
 
-#define LUMA_ADDAVG(cpu)       ALL_LUMA_PU(addAvg, addAvg, cpu); p.pu[LUMA_4x4].addAvg = PFX(addAvg_4x4_ ## cpu)
-#define CHROMA_420_ADDAVG(cpu) ALL_CHROMA_420_PU(addAvg, addAvg, cpu);
-#define CHROMA_422_ADDAVG(cpu) ALL_CHROMA_422_PU(addAvg, addAvg, cpu);
+#define LUMA_ADDAVG(cpu)       ALL_LUMA_PU(addAvg[NONALIGNED], addAvg, cpu); \
+                               p.pu[LUMA_4x4].addAvg[NONALIGNED] = PFX(addAvg_4x4_ ## cpu); \
+                               ALL_LUMA_PU(addAvg[ALIGNED], addAvg, cpu); \
+                               p.pu[LUMA_4x4].addAvg[ALIGNED] = PFX(addAvg_4x4_ ## cpu)
+#define CHROMA_420_ADDAVG(cpu) ALL_CHROMA_420_PU(addAvg[NONALIGNED], addAvg, cpu); \
+                               ALL_CHROMA_420_PU(addAvg[ALIGNED], addAvg, cpu)
+#define CHROMA_422_ADDAVG(cpu) ALL_CHROMA_422_PU(addAvg[NONALIGNED], addAvg, cpu); \
+                               ALL_CHROMA_422_PU(addAvg[ALIGNED], addAvg, cpu)
 
 #define SETUP_INTRA_ANG_COMMON(mode, fno, cpu) \
     p.cu[BLOCK_4x4].intra_pred[mode] = PFX(intra_pred_ang4_ ## fno ## _ ## cpu); \
@@ -855,6 +886,10 @@
     ALL_CHROMA_444_PU(filter_hpp, interp_4tap_horiz_pp, cpu); \
     ALL_CHROMA_444_PU(filter_hps, interp_4tap_horiz_ps, cpu);
 
+#define ASSIGN2(func, fname) \
+    func[ALIGNED] = PFX(fname); \
+    func[NONALIGNED] = PFX(fname)
+
 namespace X265_NS {
 // private x265 namespace
 
@@ -873,10 +908,6 @@
 
 void setupAssemblyPrimitives(EncoderPrimitives &p, int cpuMask) // Main10
 {
-#if !defined(X86_64)
-#error "Unsupported build configuration (32bit x86 and HIGH_BIT_DEPTH), you must configure ENABLE_ASSEMBLY=OFF"
-#endif
-
 #if X86_64
     p.scanPosLast = PFX(scanPosLast_x64);
 #endif
@@ -937,35 +968,69 @@
         CHROMA_422_VERT_FILTERS(_sse2);
         CHROMA_444_VERT_FILTERS(sse2);
 
+#if X86_64
         ALL_LUMA_PU(luma_hpp, interp_8tap_horiz_pp, sse2);
         p.pu[LUMA_4x4].luma_hpp = PFX(interp_8tap_horiz_pp_4x4_sse2);
         ALL_LUMA_PU(luma_hps, interp_8tap_horiz_ps, sse2);
         p.pu[LUMA_4x4].luma_hps = PFX(interp_8tap_horiz_ps_4x4_sse2);
         ALL_LUMA_PU(luma_vpp, interp_8tap_vert_pp, sse2);
         ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, sse2);
+#endif
 
         p.ssim_4x4x2_core = PFX(pixel_ssim_4x4x2_core_sse2);
         p.ssim_end_4 = PFX(pixel_ssim_end4_sse2);
-        PIXEL_AVG(sse2);
+        ASSIGN2(p.pu[LUMA_64x64].pixelavg_pp, pixel_avg_64x64_sse2);
+        ASSIGN2(p.pu[LUMA_64x48].pixelavg_pp, pixel_avg_64x48_sse2);
+        ASSIGN2(p.pu[LUMA_64x32].pixelavg_pp, pixel_avg_64x32_sse2);
+        ASSIGN2(p.pu[LUMA_64x16].pixelavg_pp, pixel_avg_64x16_sse2);
+        ASSIGN2(p.pu[LUMA_48x64].pixelavg_pp, pixel_avg_48x64_sse2);
+        ASSIGN2(p.pu[LUMA_32x64].pixelavg_pp, pixel_avg_32x64_sse2);
+        ASSIGN2(p.pu[LUMA_32x32].pixelavg_pp, pixel_avg_32x32_sse2);
+        ASSIGN2(p.pu[LUMA_32x24].pixelavg_pp, pixel_avg_32x24_sse2);
+        ASSIGN2(p.pu[LUMA_32x16].pixelavg_pp, pixel_avg_32x16_sse2);
+        ASSIGN2(p.pu[LUMA_32x8].pixelavg_pp, pixel_avg_32x8_sse2);
+        ASSIGN2(p.pu[LUMA_24x32].pixelavg_pp, pixel_avg_24x32_sse2);
+        ASSIGN2(p.pu[LUMA_16x64].pixelavg_pp, pixel_avg_16x64_sse2);
+        ASSIGN2(p.pu[LUMA_16x32].pixelavg_pp, pixel_avg_16x32_sse2);
+        ASSIGN2(p.pu[LUMA_16x16].pixelavg_pp, pixel_avg_16x16_sse2);
+        ASSIGN2(p.pu[LUMA_16x12].pixelavg_pp, pixel_avg_16x12_sse2);
+        ASSIGN2(p.pu[LUMA_16x8].pixelavg_pp, pixel_avg_16x8_sse2);
+        ASSIGN2(p.pu[LUMA_16x4].pixelavg_pp, pixel_avg_16x4_sse2);
+        ASSIGN2(p.pu[LUMA_12x16].pixelavg_pp, pixel_avg_12x16_sse2);
+#if X86_64
+        ASSIGN2(p.pu[LUMA_8x32].pixelavg_pp, pixel_avg_8x32_sse2);
+        ASSIGN2(p.pu[LUMA_8x16].pixelavg_pp, pixel_avg_8x16_sse2);
+        ASSIGN2(p.pu[LUMA_8x8].pixelavg_pp, pixel_avg_8x8_sse2);
+        ASSIGN2(p.pu[LUMA_8x4].pixelavg_pp, pixel_avg_8x4_sse2);
+#endif
         PIXEL_AVG_W4(mmx2);
         LUMA_VAR(sse2);
 
 
-        ALL_LUMA_TU(blockfill_s, blockfill_s, sse2);
+        ALL_LUMA_TU(blockfill_s[ALIGNED], blockfill_s, sse2);
+        ALL_LUMA_TU(blockfill_s[NONALIGNED], blockfill_s, sse2);
         ALL_LUMA_TU_S(cpy1Dto2D_shr, cpy1Dto2D_shr_, sse2);
-        ALL_LUMA_TU_S(cpy1Dto2D_shl, cpy1Dto2D_shl_, sse2);
+        ALL_LUMA_TU_S(cpy1Dto2D_shl[ALIGNED], cpy1Dto2D_shl_, sse2);
+        ALL_LUMA_TU_S(cpy1Dto2D_shl[NONALIGNED], cpy1Dto2D_shl_, sse2);

 
@@ -404,36 +404,58 @@
     p.chroma[X265_CSP_I422].cu[BLOCK_422_8x16].sa8d = PFX(pixel_sa8d_8x16_ ## cpu); \
     p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sa8d = PFX(pixel_sa8d_16x32_ ## cpu); \
     p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sa8d = PFX(pixel_sa8d_32x64_ ## cpu)
-
 #define PIXEL_AVG(cpu) \
-    p.pu[LUMA_64x64].pixelavg_pp = PFX(pixel_avg_64x64_ ## cpu); \
-    p.pu[LUMA_64x48].pixelavg_pp = PFX(pixel_avg_64x48_ ## cpu); \
-    p.pu[LUMA_64x32].pixelavg_pp = PFX(pixel_avg_64x32_ ## cpu); \
-    p.pu[LUMA_64x16].pixelavg_pp = PFX(pixel_avg_64x16_ ## cpu); \
-    p.pu[LUMA_48x64].pixelavg_pp = PFX(pixel_avg_48x64_ ## cpu); \
-    p.pu[LUMA_32x64].pixelavg_pp = PFX(pixel_avg_32x64_ ## cpu); \
-    p.pu[LUMA_32x32].pixelavg_pp = PFX(pixel_avg_32x32_ ## cpu); \
-    p.pu[LUMA_32x24].pixelavg_pp = PFX(pixel_avg_32x24_ ## cpu); \
-    p.pu[LUMA_32x16].pixelavg_pp = PFX(pixel_avg_32x16_ ## cpu); \
-    p.pu[LUMA_32x8].pixelavg_pp  = PFX(pixel_avg_32x8_ ## cpu); \
-    p.pu[LUMA_24x32].pixelavg_pp = PFX(pixel_avg_24x32_ ## cpu); \
-    p.pu[LUMA_16x64].pixelavg_pp = PFX(pixel_avg_16x64_ ## cpu); \
-    p.pu[LUMA_16x32].pixelavg_pp = PFX(pixel_avg_16x32_ ## cpu); \
-    p.pu[LUMA_16x16].pixelavg_pp = PFX(pixel_avg_16x16_ ## cpu); \
-    p.pu[LUMA_16x12].pixelavg_pp = PFX(pixel_avg_16x12_ ## cpu); \
-    p.pu[LUMA_16x8].pixelavg_pp  = PFX(pixel_avg_16x8_ ## cpu); \
-    p.pu[LUMA_16x4].pixelavg_pp  = PFX(pixel_avg_16x4_ ## cpu); \
-    p.pu[LUMA_12x16].pixelavg_pp = PFX(pixel_avg_12x16_ ## cpu); \
-    p.pu[LUMA_8x32].pixelavg_pp  = PFX(pixel_avg_8x32_ ## cpu); \
-    p.pu[LUMA_8x16].pixelavg_pp  = PFX(pixel_avg_8x16_ ## cpu); \
-    p.pu[LUMA_8x8].pixelavg_pp   = PFX(pixel_avg_8x8_ ## cpu); \
-    p.pu[LUMA_8x4].pixelavg_pp   = PFX(pixel_avg_8x4_ ## cpu);
-
+    p.pu[LUMA_64x64].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_64x64_ ## cpu); \
+    p.pu[LUMA_64x48].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_64x48_ ## cpu); \
+    p.pu[LUMA_64x32].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_64x32_ ## cpu); \
+    p.pu[LUMA_64x16].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_64x16_ ## cpu); \
+    p.pu[LUMA_48x64].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_48x64_ ## cpu); \
+    p.pu[LUMA_32x64].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_32x64_ ## cpu); \
+    p.pu[LUMA_32x32].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_32x32_ ## cpu); \
+    p.pu[LUMA_32x24].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_32x24_ ## cpu); \
+    p.pu[LUMA_32x16].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_32x16_ ## cpu); \
+    p.pu[LUMA_32x8].pixelavg_pp[NONALIGNED]  = PFX(pixel_avg_32x8_ ## cpu); \
+    p.pu[LUMA_24x32].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_24x32_ ## cpu); \
+    p.pu[LUMA_16x64].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_16x64_ ## cpu); \
+    p.pu[LUMA_16x32].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_16x32_ ## cpu); \
+    p.pu[LUMA_16x16].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_16x16_ ## cpu); \
+    p.pu[LUMA_16x12].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_16x12_ ## cpu); \
+    p.pu[LUMA_16x8].pixelavg_pp[NONALIGNED]  = PFX(pixel_avg_16x8_ ## cpu); \
+    p.pu[LUMA_16x4].pixelavg_pp[NONALIGNED]  = PFX(pixel_avg_16x4_ ## cpu); \
+    p.pu[LUMA_12x16].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_12x16_ ## cpu); \
+    p.pu[LUMA_8x32].pixelavg_pp[NONALIGNED]  = PFX(pixel_avg_8x32_ ## cpu); \
+    p.pu[LUMA_8x16].pixelavg_pp[NONALIGNED]  = PFX(pixel_avg_8x16_ ## cpu); \
+    p.pu[LUMA_8x8].pixelavg_pp[NONALIGNED]   = PFX(pixel_avg_8x8_ ## cpu); \
+    p.pu[LUMA_8x4].pixelavg_pp[NONALIGNED]   = PFX(pixel_avg_8x4_ ## cpu); \
+    p.pu[LUMA_64x64].pixelavg_pp[ALIGNED] = PFX(pixel_avg_64x64_ ## cpu); \
+    p.pu[LUMA_64x48].pixelavg_pp[ALIGNED] = PFX(pixel_avg_64x48_ ## cpu); \
+    p.pu[LUMA_64x32].pixelavg_pp[ALIGNED] = PFX(pixel_avg_64x32_ ## cpu); \
+    p.pu[LUMA_64x16].pixelavg_pp[ALIGNED] = PFX(pixel_avg_64x16_ ## cpu); \
+    p.pu[LUMA_48x64].pixelavg_pp[ALIGNED] = PFX(pixel_avg_48x64_ ## cpu); \
+    p.pu[LUMA_32x64].pixelavg_pp[ALIGNED] = PFX(pixel_avg_32x64_ ## cpu); \
+    p.pu[LUMA_32x32].pixelavg_pp[ALIGNED] = PFX(pixel_avg_32x32_ ## cpu); \
+    p.pu[LUMA_32x24].pixelavg_pp[ALIGNED] = PFX(pixel_avg_32x24_ ## cpu); \
+    p.pu[LUMA_32x16].pixelavg_pp[ALIGNED] = PFX(pixel_avg_32x16_ ## cpu); \
+    p.pu[LUMA_32x8].pixelavg_pp[ALIGNED]  = PFX(pixel_avg_32x8_ ## cpu); \
+    p.pu[LUMA_24x32].pixelavg_pp[ALIGNED] = PFX(pixel_avg_24x32_ ## cpu); \
+    p.pu[LUMA_16x64].pixelavg_pp[ALIGNED] = PFX(pixel_avg_16x64_ ## cpu); \
+    p.pu[LUMA_16x32].pixelavg_pp[ALIGNED] = PFX(pixel_avg_16x32_ ## cpu); \
+    p.pu[LUMA_16x16].pixelavg_pp[ALIGNED] = PFX(pixel_avg_16x16_ ## cpu); \
+    p.pu[LUMA_16x12].pixelavg_pp[ALIGNED] = PFX(pixel_avg_16x12_ ## cpu); \
+    p.pu[LUMA_16x8].pixelavg_pp[ALIGNED]  = PFX(pixel_avg_16x8_ ## cpu); \
+    p.pu[LUMA_16x4].pixelavg_pp[ALIGNED]  = PFX(pixel_avg_16x4_ ## cpu); \
+    p.pu[LUMA_12x16].pixelavg_pp[ALIGNED] = PFX(pixel_avg_12x16_ ## cpu); \
+    p.pu[LUMA_8x32].pixelavg_pp[ALIGNED]  = PFX(pixel_avg_8x32_ ## cpu); \
+    p.pu[LUMA_8x16].pixelavg_pp[ALIGNED]  = PFX(pixel_avg_8x16_ ## cpu); \
+    p.pu[LUMA_8x8].pixelavg_pp[ALIGNED]   = PFX(pixel_avg_8x8_ ## cpu); \
+    p.pu[LUMA_8x4].pixelavg_pp[ALIGNED]   = PFX(pixel_avg_8x4_ ## cpu);
 #define PIXEL_AVG_W4(cpu) \
-    p.pu[LUMA_4x4].pixelavg_pp  = PFX(pixel_avg_4x4_ ## cpu); \
-    p.pu[LUMA_4x8].pixelavg_pp  = PFX(pixel_avg_4x8_ ## cpu); \
-    p.pu[LUMA_4x16].pixelavg_pp = PFX(pixel_avg_4x16_ ## cpu);
-
+    p.pu[LUMA_4x4].pixelavg_pp[NONALIGNED]  = PFX(pixel_avg_4x4_ ## cpu); \
+    p.pu[LUMA_4x8].pixelavg_pp[NONALIGNED]  = PFX(pixel_avg_4x8_ ## cpu); \
+    p.pu[LUMA_4x16].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_4x16_ ## cpu); \
+    p.pu[LUMA_4x4].pixelavg_pp[ALIGNED]  = PFX(pixel_avg_4x4_ ## cpu); \
+    p.pu[LUMA_4x8].pixelavg_pp[ALIGNED]  = PFX(pixel_avg_4x8_ ## cpu); \
+    p.pu[LUMA_4x16].pixelavg_pp[ALIGNED] = PFX(pixel_avg_4x16_ ## cpu);
 #define CHROMA_420_FILTERS(cpu) \
     ALL_CHROMA_420_PU(filter_hpp, interp_4tap_horiz_pp, cpu); \
     ALL_CHROMA_420_PU(filter_hps, interp_4tap_horiz_ps, cpu); \
@@ -633,23 +655,32 @@
 
 #define LUMA_PIXELSUB(cpu) \
     p.cu[BLOCK_4x4].sub_ps = PFX(pixel_sub_ps_4x4_ ## cpu); \
-    p.cu[BLOCK_4x4].add_ps = PFX(pixel_add_ps_4x4_ ## cpu); \
+    p.cu[BLOCK_4x4].add_ps[NONALIGNED] = PFX(pixel_add_ps_4x4_ ## cpu); \
+    p.cu[BLOCK_4x4].add_ps[ALIGNED] = PFX(pixel_add_ps_4x4_ ## cpu); \
     ALL_LUMA_CU(sub_ps, pixel_sub_ps, cpu); \
-    ALL_LUMA_CU(add_ps, pixel_add_ps, cpu);
+    ALL_LUMA_CU(add_ps[NONALIGNED], pixel_add_ps, cpu); \
+    ALL_LUMA_CU(add_ps[ALIGNED], pixel_add_ps, cpu);
 
 #define CHROMA_420_PIXELSUB_PS(cpu) \
     ALL_CHROMA_420_CU(sub_ps, pixel_sub_ps, cpu); \
-    ALL_CHROMA_420_CU(add_ps, pixel_add_ps, cpu);
+    ALL_CHROMA_420_CU(add_ps[NONALIGNED], pixel_add_ps, cpu); \
+    ALL_CHROMA_420_CU(add_ps[ALIGNED], pixel_add_ps, cpu);
 
 #define CHROMA_422_PIXELSUB_PS(cpu) \
     ALL_CHROMA_422_CU(sub_ps, pixel_sub_ps, cpu); \
-    ALL_CHROMA_422_CU(add_ps, pixel_add_ps, cpu);
+    ALL_CHROMA_422_CU(add_ps[NONALIGNED], pixel_add_ps, cpu); \
+    ALL_CHROMA_422_CU(add_ps[ALIGNED], pixel_add_ps, cpu);
 
 #define LUMA_VAR(cpu)          ALL_LUMA_CU(var, pixel_var, cpu)
 
-#define LUMA_ADDAVG(cpu)       ALL_LUMA_PU(addAvg, addAvg, cpu); p.pu[LUMA_4x4].addAvg = PFX(addAvg_4x4_ ## cpu)
-#define CHROMA_420_ADDAVG(cpu) ALL_CHROMA_420_PU(addAvg, addAvg, cpu);
-#define CHROMA_422_ADDAVG(cpu) ALL_CHROMA_422_PU(addAvg, addAvg, cpu);
+#define LUMA_ADDAVG(cpu)       ALL_LUMA_PU(addAvg[NONALIGNED], addAvg, cpu); \
+                               p.pu[LUMA_4x4].addAvg[NONALIGNED] = PFX(addAvg_4x4_ ## cpu); \
+                               ALL_LUMA_PU(addAvg[ALIGNED], addAvg, cpu); \
+                               p.pu[LUMA_4x4].addAvg[ALIGNED] = PFX(addAvg_4x4_ ## cpu)
+#define CHROMA_420_ADDAVG(cpu) ALL_CHROMA_420_PU(addAvg[NONALIGNED], addAvg, cpu); \
+                               ALL_CHROMA_420_PU(addAvg[ALIGNED], addAvg, cpu)
+#define CHROMA_422_ADDAVG(cpu) ALL_CHROMA_422_PU(addAvg[NONALIGNED], addAvg, cpu); \
+                               ALL_CHROMA_422_PU(addAvg[ALIGNED], addAvg, cpu)
 
 #define SETUP_INTRA_ANG_COMMON(mode, fno, cpu) \
     p.cu[BLOCK_4x4].intra_pred[mode] = PFX(intra_pred_ang4_ ## fno ## _ ## cpu); \
@@ -855,6 +886,10 @@
     ALL_CHROMA_444_PU(filter_hpp, interp_4tap_horiz_pp, cpu); \
     ALL_CHROMA_444_PU(filter_hps, interp_4tap_horiz_ps, cpu);
 
+#define ASSIGN2(func, fname) \
+    func[ALIGNED] = PFX(fname); \
+    func[NONALIGNED] = PFX(fname)
+
 namespace X265_NS {
 // private x265 namespace
 
@@ -873,10 +908,6 @@
 
 void setupAssemblyPrimitives(EncoderPrimitives &p, int cpuMask) // Main10
 {
-#if !defined(X86_64)
-#error "Unsupported build configuration (32bit x86 and HIGH_BIT_DEPTH), you must configure ENABLE_ASSEMBLY=OFF"
-#endif
-
 #if X86_64
     p.scanPosLast = PFX(scanPosLast_x64);
 #endif
@@ -937,35 +968,69 @@
         CHROMA_422_VERT_FILTERS(_sse2);
         CHROMA_444_VERT_FILTERS(sse2);
 
+#if X86_64
         ALL_LUMA_PU(luma_hpp, interp_8tap_horiz_pp, sse2);
         p.pu[LUMA_4x4].luma_hpp = PFX(interp_8tap_horiz_pp_4x4_sse2);
         ALL_LUMA_PU(luma_hps, interp_8tap_horiz_ps, sse2);
         p.pu[LUMA_4x4].luma_hps = PFX(interp_8tap_horiz_ps_4x4_sse2);
         ALL_LUMA_PU(luma_vpp, interp_8tap_vert_pp, sse2);
         ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, sse2);
+#endif
 
         p.ssim_4x4x2_core = PFX(pixel_ssim_4x4x2_core_sse2);
         p.ssim_end_4 = PFX(pixel_ssim_end4_sse2);
-        PIXEL_AVG(sse2);
+        ASSIGN2(p.pu[LUMA_64x64].pixelavg_pp, pixel_avg_64x64_sse2);
+        ASSIGN2(p.pu[LUMA_64x48].pixelavg_pp, pixel_avg_64x48_sse2);
+        ASSIGN2(p.pu[LUMA_64x32].pixelavg_pp, pixel_avg_64x32_sse2);
+        ASSIGN2(p.pu[LUMA_64x16].pixelavg_pp, pixel_avg_64x16_sse2);
+        ASSIGN2(p.pu[LUMA_48x64].pixelavg_pp, pixel_avg_48x64_sse2);
+        ASSIGN2(p.pu[LUMA_32x64].pixelavg_pp, pixel_avg_32x64_sse2);
+        ASSIGN2(p.pu[LUMA_32x32].pixelavg_pp, pixel_avg_32x32_sse2);
+        ASSIGN2(p.pu[LUMA_32x24].pixelavg_pp, pixel_avg_32x24_sse2);
+        ASSIGN2(p.pu[LUMA_32x16].pixelavg_pp, pixel_avg_32x16_sse2);
+        ASSIGN2(p.pu[LUMA_32x8].pixelavg_pp, pixel_avg_32x8_sse2);
+        ASSIGN2(p.pu[LUMA_24x32].pixelavg_pp, pixel_avg_24x32_sse2);
+        ASSIGN2(p.pu[LUMA_16x64].pixelavg_pp, pixel_avg_16x64_sse2);
+        ASSIGN2(p.pu[LUMA_16x32].pixelavg_pp, pixel_avg_16x32_sse2);
+        ASSIGN2(p.pu[LUMA_16x16].pixelavg_pp, pixel_avg_16x16_sse2);
+        ASSIGN2(p.pu[LUMA_16x12].pixelavg_pp, pixel_avg_16x12_sse2);
+        ASSIGN2(p.pu[LUMA_16x8].pixelavg_pp, pixel_avg_16x8_sse2);
+        ASSIGN2(p.pu[LUMA_16x4].pixelavg_pp, pixel_avg_16x4_sse2);
+        ASSIGN2(p.pu[LUMA_12x16].pixelavg_pp, pixel_avg_12x16_sse2);
+#if X86_64
+        ASSIGN2(p.pu[LUMA_8x32].pixelavg_pp, pixel_avg_8x32_sse2);
+        ASSIGN2(p.pu[LUMA_8x16].pixelavg_pp, pixel_avg_8x16_sse2);
+        ASSIGN2(p.pu[LUMA_8x8].pixelavg_pp, pixel_avg_8x8_sse2);
+        ASSIGN2(p.pu[LUMA_8x4].pixelavg_pp, pixel_avg_8x4_sse2);
+#endif
         PIXEL_AVG_W4(mmx2);
         LUMA_VAR(sse2);
 
 
-        ALL_LUMA_TU(blockfill_s, blockfill_s, sse2);
+        ALL_LUMA_TU(blockfill_s[ALIGNED], blockfill_s, sse2);
+        ALL_LUMA_TU(blockfill_s[NONALIGNED], blockfill_s, sse2);
         ALL_LUMA_TU_S(cpy1Dto2D_shr, cpy1Dto2D_shr_, sse2);
-        ALL_LUMA_TU_S(cpy1Dto2D_shl, cpy1Dto2D_shl_, sse2);
+        ALL_LUMA_TU_S(cpy1Dto2D_shl[ALIGNED], cpy1Dto2D_shl_, sse2);
+        ALL_LUMA_TU_S(cpy1Dto2D_shl[NONALIGNED], cpy1Dto2D_shl_, sse2);
​

x265_2.7.tar.gz/source/common/x86/blockcopy8.asm -> x265_2.9.tar.gz/source/common/x86/blockcopy8.asm Changed

@@ -26,7 +26,10 @@
 %include "x86inc.asm"
 %include "x86util.asm"
 
-SECTION_RODATA 32
+SECTION_RODATA 64
+
+ALIGN 64
+const shuf1_avx512,  dq 0, 2, 4, 6, 1, 3, 5, 7
 
 cextern pb_4
 cextern pb_1
@@ -1103,6 +1106,82 @@
 BLOCKCOPY_PP_W64_H4_avx 64, 48
 BLOCKCOPY_PP_W64_H4_avx 64, 64
 
+;----------------------------------------------------------------------------------------------
+; blockcopy_pp avx512 code start
+;----------------------------------------------------------------------------------------------
+%macro PROCESS_BLOCKCOPY_PP_64X4_avx512 0
+movu    m0, [r2]
+movu    m1, [r2 + r3]
+movu    m2, [r2 + 2 * r3]
+movu    m3, [r2 + r4]
+
+movu    [r0] , m0
+movu    [r0 + r1] , m1
+movu    [r0 + 2 * r1]  , m2
+movu    [r0 + r5] , m3
+%endmacro
+
+%macro PROCESS_BLOCKCOPY_PP_32X4_avx512 0
+movu           ym0, [r2]
+vinserti32x8   m0,  [r2 + r3],     1
+movu           ym1, [r2 + 2 * r3]
+vinserti32x8   m1,  [r2 + r4],     1
+
+movu           [r0] ,              ym0
+vextracti32x8  [r0 + r1] ,         m0,    1
+movu           [r0 + 2 * r1]  ,    ym1
+vextracti32x8  [r0 + r5] ,         m1,    1
+%endmacro
+
+;----------------------------------------------------------------------------------------------
+; void blockcopy_pp_64x%1(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)
+;----------------------------------------------------------------------------------------------
+%macro BLOCKCOPY_PP_W64_H4_avx512 1
+INIT_ZMM avx512
+cglobal blockcopy_pp_64x%1, 4, 6, 4
+lea    r4,  [3 * r3]
+lea    r5,  [3 * r1]
+
+%rep %1/4 - 1
+PROCESS_BLOCKCOPY_PP_64X4_avx512
+lea     r2, [r2 + 4 * r3]
+lea     r0, [r0 + 4 * r1] 
+%endrep
+
+PROCESS_BLOCKCOPY_PP_64X4_avx512
+RET
+%endmacro
+
+BLOCKCOPY_PP_W64_H4_avx512 16
+BLOCKCOPY_PP_W64_H4_avx512 32
+BLOCKCOPY_PP_W64_H4_avx512 48
+BLOCKCOPY_PP_W64_H4_avx512 64
+
+%macro BLOCKCOPY_PP_W32_H4_avx512 1
+INIT_ZMM avx512
+cglobal blockcopy_pp_32x%1, 4, 6, 2
+    lea    r4,  [3 * r3]
+    lea    r5,  [3 * r1]
+
+%rep %1/4 - 1
+    PROCESS_BLOCKCOPY_PP_32X4_avx512
+    lea     r2, [r2 + 4 * r3]
+    lea     r0, [r0 + 4 * r1] 
+%endrep
+    PROCESS_BLOCKCOPY_PP_32X4_avx512
+    RET
+%endmacro
+
+BLOCKCOPY_PP_W32_H4_avx512 8
+BLOCKCOPY_PP_W32_H4_avx512 16
+BLOCKCOPY_PP_W32_H4_avx512 24
+BLOCKCOPY_PP_W32_H4_avx512 32
+BLOCKCOPY_PP_W32_H4_avx512 48
+BLOCKCOPY_PP_W32_H4_avx512 64
+;----------------------------------------------------------------------------------------------
+; blockcopy_pp avx512 code end
+;----------------------------------------------------------------------------------------------
+
 ;-----------------------------------------------------------------------------
 ; void blockcopy_sp_2x4(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)
 ;-----------------------------------------------------------------------------
@@ -2121,6 +2200,86 @@
 
 BLOCKCOPY_SP_W64_H4_avx2 64, 64
 
+%macro PROCESS_BLOCKCOPY_SP_64x4_AVX512 0
+    movu               m0,             [r2]
+    movu               m1,             [r2 + 64]
+    movu               m2,             [r2 + r3]
+    movu               m3,             [r2 + r3 + 64]
+
+    packuswb           m0,             m1
+    packuswb           m2,             m3
+    vpermq             m0,             m4,         m0
+    vpermq             m2,             m4,         m2
+    movu               [r0],           m0
+    movu               [r0 + r1],      m2
+
+    movu               m0,             [r2 + 2 * r3]
+    movu               m1,             [r2 + 2 * r3 + 64]
+    movu               m2,             [r2 + r4]
+    movu               m3,             [r2 + r4 + 64]
+
+    packuswb           m0,             m1
+    packuswb           m2,             m3
+    vpermq             m0,             m4,         m0
+    vpermq             m2,             m4,         m2
+    movu               [r0 + 2 * r1],  m0
+    movu               [r0 + r5],      m2
+%endmacro
+
+%macro PROCESS_BLOCKCOPY_SP_32x4_AVX512 0
+    movu               m0,             [r2]
+    movu               m1,             [r2 + r3]
+    movu               m2,             [r2 + 2 * r3]
+    movu               m3,             [r2 + r4]
+
+    packuswb           m0,             m1
+    packuswb           m2,             m3
+    vpermq             m0,             m4,         m0
+    vpermq             m2,             m4,         m2
+    movu               [r0],           ym0
+    vextracti32x8      [r0 + r1],      m0,         1
+    movu               [r0 + 2 * r1],  ym2
+    vextracti32x8      [r0 + r5],      m2,         1
+%endmacro
+
+;-----------------------------------------------------------------------------
+; void blockcopy_sp_%1x%2(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)
+;-----------------------------------------------------------------------------
+INIT_ZMM avx512
+cglobal blockcopy_sp_64x64, 4, 6, 5
+    mova   m4, [shuf1_avx512]
+    add    r3,  r3
+    lea    r4,  [3 * r3]
+    lea    r5,  [3 * r1]
+
+%rep 15
+    PROCESS_BLOCKCOPY_SP_64x4_AVX512
+    lea    r0, [r0 + 4 * r1]
+    lea    r2, [r2 + 4 * r3]
+%endrep
+    PROCESS_BLOCKCOPY_SP_64x4_AVX512
+    RET
+
+%macro BLOCKCOPY_SP_32xN_AVX512 1
+INIT_ZMM avx512
+cglobal blockcopy_sp_32x%1, 4, 6, 5
+    mova   m4, [shuf1_avx512]
+    add    r3,  r3
+    lea    r4,  [3 * r3]
+    lea    r5,  [3 * r1]
+
+%rep %1/4 - 1
+    PROCESS_BLOCKCOPY_SP_32x4_AVX512
+    lea    r0, [r0 + 4 * r1]
+    lea    r2, [r2 + 4 * r3]
+%endrep
+    PROCESS_BLOCKCOPY_SP_32x4_AVX512
+    RET
+%endmacro
+
+BLOCKCOPY_SP_32xN_AVX512 32
+BLOCKCOPY_SP_32xN_AVX512 64
+
 ;-----------------------------------------------------------------------------
 ; void blockfill_s_4x4(int16_t* dst, intptr_t dstride, int16_t val)
 ;-----------------------------------------------------------------------------
@@ -2396,6 +2555,43 @@
 movu       [r0 + r3 + 32], m0
 RET
 
+;--------------------------------------------------------------------
+; void blockfill_s_32x32(int16_t* dst, intptr_t dstride, int16_t val)
+;--------------------------------------------------------------------
+INIT_ZMM avx512
+cglobal blockfill_s_32x32, 3, 4, 1
+add          r1, r1
+lea          r3, [3 * r1]
+movd         xm0, r2d
+vpbroadcastw m0, xm0
+
+%rep 8
+movu       [r0], m0
+movu       [r0 + r1], m0
+movu       [r0 + 2 * r1], m0

 
@@ -26,7 +26,10 @@
 %include "x86inc.asm"
 %include "x86util.asm"
 
-SECTION_RODATA 32
+SECTION_RODATA 64
+
+ALIGN 64
+const shuf1_avx512,  dq 0, 2, 4, 6, 1, 3, 5, 7
 
 cextern pb_4
 cextern pb_1
@@ -1103,6 +1106,82 @@
 BLOCKCOPY_PP_W64_H4_avx 64, 48
 BLOCKCOPY_PP_W64_H4_avx 64, 64
 
+;----------------------------------------------------------------------------------------------
+; blockcopy_pp avx512 code start
+;----------------------------------------------------------------------------------------------
+%macro PROCESS_BLOCKCOPY_PP_64X4_avx512 0
+movu    m0, [r2]
+movu    m1, [r2 + r3]
+movu    m2, [r2 + 2 * r3]
+movu    m3, [r2 + r4]
+
+movu    [r0] , m0
+movu    [r0 + r1] , m1
+movu    [r0 + 2 * r1]  , m2
+movu    [r0 + r5] , m3
+%endmacro
+
+%macro PROCESS_BLOCKCOPY_PP_32X4_avx512 0
+movu           ym0, [r2]
+vinserti32x8   m0,  [r2 + r3],     1
+movu           ym1, [r2 + 2 * r3]
+vinserti32x8   m1,  [r2 + r4],     1
+
+movu           [r0] ,              ym0
+vextracti32x8  [r0 + r1] ,         m0,    1
+movu           [r0 + 2 * r1]  ,    ym1
+vextracti32x8  [r0 + r5] ,         m1,    1
+%endmacro
+
+;----------------------------------------------------------------------------------------------
+; void blockcopy_pp_64x%1(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)
+;----------------------------------------------------------------------------------------------
+%macro BLOCKCOPY_PP_W64_H4_avx512 1
+INIT_ZMM avx512
+cglobal blockcopy_pp_64x%1, 4, 6, 4
+lea    r4,  [3 * r3]
+lea    r5,  [3 * r1]
+
+%rep %1/4 - 1
+PROCESS_BLOCKCOPY_PP_64X4_avx512
+lea     r2, [r2 + 4 * r3]
+lea     r0, [r0 + 4 * r1] 
+%endrep
+
+PROCESS_BLOCKCOPY_PP_64X4_avx512
+RET
+%endmacro
+
+BLOCKCOPY_PP_W64_H4_avx512 16
+BLOCKCOPY_PP_W64_H4_avx512 32
+BLOCKCOPY_PP_W64_H4_avx512 48
+BLOCKCOPY_PP_W64_H4_avx512 64
+
+%macro BLOCKCOPY_PP_W32_H4_avx512 1
+INIT_ZMM avx512
+cglobal blockcopy_pp_32x%1, 4, 6, 2
+    lea    r4,  [3 * r3]
+    lea    r5,  [3 * r1]
+
+%rep %1/4 - 1
+    PROCESS_BLOCKCOPY_PP_32X4_avx512
+    lea     r2, [r2 + 4 * r3]
+    lea     r0, [r0 + 4 * r1] 
+%endrep
+    PROCESS_BLOCKCOPY_PP_32X4_avx512
+    RET
+%endmacro
+
+BLOCKCOPY_PP_W32_H4_avx512 8
+BLOCKCOPY_PP_W32_H4_avx512 16
+BLOCKCOPY_PP_W32_H4_avx512 24
+BLOCKCOPY_PP_W32_H4_avx512 32
+BLOCKCOPY_PP_W32_H4_avx512 48
+BLOCKCOPY_PP_W32_H4_avx512 64
+;----------------------------------------------------------------------------------------------
+; blockcopy_pp avx512 code end
+;----------------------------------------------------------------------------------------------
+
 ;-----------------------------------------------------------------------------
 ; void blockcopy_sp_2x4(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)
 ;-----------------------------------------------------------------------------
@@ -2121,6 +2200,86 @@
 
 BLOCKCOPY_SP_W64_H4_avx2 64, 64
 
+%macro PROCESS_BLOCKCOPY_SP_64x4_AVX512 0
+    movu               m0,             [r2]
+    movu               m1,             [r2 + 64]
+    movu               m2,             [r2 + r3]
+    movu               m3,             [r2 + r3 + 64]
+
+    packuswb           m0,             m1
+    packuswb           m2,             m3
+    vpermq             m0,             m4,         m0
+    vpermq             m2,             m4,         m2
+    movu               [r0],           m0
+    movu               [r0 + r1],      m2
+
+    movu               m0,             [r2 + 2 * r3]
+    movu               m1,             [r2 + 2 * r3 + 64]
+    movu               m2,             [r2 + r4]
+    movu               m3,             [r2 + r4 + 64]
+
+    packuswb           m0,             m1
+    packuswb           m2,             m3
+    vpermq             m0,             m4,         m0
+    vpermq             m2,             m4,         m2
+    movu               [r0 + 2 * r1],  m0
+    movu               [r0 + r5],      m2
+%endmacro
+
+%macro PROCESS_BLOCKCOPY_SP_32x4_AVX512 0
+    movu               m0,             [r2]
+    movu               m1,             [r2 + r3]
+    movu               m2,             [r2 + 2 * r3]
+    movu               m3,             [r2 + r4]
+
+    packuswb           m0,             m1
+    packuswb           m2,             m3
+    vpermq             m0,             m4,         m0
+    vpermq             m2,             m4,         m2
+    movu               [r0],           ym0
+    vextracti32x8      [r0 + r1],      m0,         1
+    movu               [r0 + 2 * r1],  ym2
+    vextracti32x8      [r0 + r5],      m2,         1
+%endmacro
+
+;-----------------------------------------------------------------------------
+; void blockcopy_sp_%1x%2(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride)
+;-----------------------------------------------------------------------------
+INIT_ZMM avx512
+cglobal blockcopy_sp_64x64, 4, 6, 5
+    mova   m4, [shuf1_avx512]
+    add    r3,  r3
+    lea    r4,  [3 * r3]
+    lea    r5,  [3 * r1]
+
+%rep 15
+    PROCESS_BLOCKCOPY_SP_64x4_AVX512
+    lea    r0, [r0 + 4 * r1]
+    lea    r2, [r2 + 4 * r3]
+%endrep
+    PROCESS_BLOCKCOPY_SP_64x4_AVX512
+    RET
+
+%macro BLOCKCOPY_SP_32xN_AVX512 1
+INIT_ZMM avx512
+cglobal blockcopy_sp_32x%1, 4, 6, 5
+    mova   m4, [shuf1_avx512]
+    add    r3,  r3
+    lea    r4,  [3 * r3]
+    lea    r5,  [3 * r1]
+
+%rep %1/4 - 1
+    PROCESS_BLOCKCOPY_SP_32x4_AVX512
+    lea    r0, [r0 + 4 * r1]
+    lea    r2, [r2 + 4 * r3]
+%endrep
+    PROCESS_BLOCKCOPY_SP_32x4_AVX512
+    RET
+%endmacro
+
+BLOCKCOPY_SP_32xN_AVX512 32
+BLOCKCOPY_SP_32xN_AVX512 64
+
 ;-----------------------------------------------------------------------------
 ; void blockfill_s_4x4(int16_t* dst, intptr_t dstride, int16_t val)
 ;-----------------------------------------------------------------------------
@@ -2396,6 +2555,43 @@
 movu       [r0 + r3 + 32], m0
 RET
 
+;--------------------------------------------------------------------
+; void blockfill_s_32x32(int16_t* dst, intptr_t dstride, int16_t val)
+;--------------------------------------------------------------------
+INIT_ZMM avx512
+cglobal blockfill_s_32x32, 3, 4, 1
+add          r1, r1
+lea          r3, [3 * r1]
+movd         xm0, r2d
+vpbroadcastw m0, xm0
+
+%rep 8
+movu       [r0], m0
+movu       [r0 + r1], m0
+movu       [r0 + 2 * r1], m0
​

x265_2.7.tar.gz/source/common/x86/blockcopy8.h -> x265_2.9.tar.gz/source/common/x86/blockcopy8.h Changed

@@ -28,37 +28,48 @@
 FUNCDEF_TU_S(void, cpy2Dto1D_shl, sse2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 FUNCDEF_TU_S(void, cpy2Dto1D_shl, sse4, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 FUNCDEF_TU_S(void, cpy2Dto1D_shl, avx2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
+FUNCDEF_TU_S(void, cpy2Dto1D_shl, avx512, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 
 FUNCDEF_TU_S(void, cpy2Dto1D_shr, sse2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 FUNCDEF_TU_S(void, cpy2Dto1D_shr, sse4, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 FUNCDEF_TU_S(void, cpy2Dto1D_shr, avx2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
+FUNCDEF_TU_S(void, cpy2Dto1D_shr, avx512, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 
 FUNCDEF_TU_S(void, cpy1Dto2D_shl, sse2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 FUNCDEF_TU_S(void, cpy1Dto2D_shl, sse4, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 FUNCDEF_TU_S(void, cpy1Dto2D_shl, avx2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
-
+FUNCDEF_TU_S(void, cpy1Dto2D_shl, avx512, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
+FUNCDEF_TU_S(void, cpy1Dto2D_shl_aligned, avx512, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 FUNCDEF_TU_S(void, cpy1Dto2D_shr, sse2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 FUNCDEF_TU_S(void, cpy1Dto2D_shr, sse4, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 FUNCDEF_TU_S(void, cpy1Dto2D_shr, avx2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
+FUNCDEF_TU_S(void, cpy1Dto2D_shr, avx512, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 
 FUNCDEF_TU_S(uint32_t, copy_cnt, sse2, int16_t* dst, const int16_t* src, intptr_t srcStride);
 FUNCDEF_TU_S(uint32_t, copy_cnt, sse4, int16_t* dst, const int16_t* src, intptr_t srcStride);
 FUNCDEF_TU_S(uint32_t, copy_cnt, avx2, int16_t* dst, const int16_t* src, intptr_t srcStride);
+FUNCDEF_TU_S(uint32_t, copy_cnt, avx512, int16_t* dst, const int16_t* src, intptr_t srcStride);
 
 FUNCDEF_TU(void, blockfill_s, sse2, int16_t* dst, intptr_t dstride, int16_t val);
 FUNCDEF_TU(void, blockfill_s, avx2, int16_t* dst, intptr_t dstride, int16_t val);
+FUNCDEF_TU(void, blockfill_s, avx512, int16_t* dst, intptr_t dstride, int16_t val);
+FUNCDEF_TU(void, blockfill_s_aligned, avx512, int16_t* dst, intptr_t dstride, int16_t val);
 
 FUNCDEF_CHROMA_PU(void, blockcopy_ss, sse2, int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
 FUNCDEF_CHROMA_PU(void, blockcopy_ss, avx, int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
+FUNCDEF_CHROMA_PU(void, blockcopy_ss, avx512, int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
 
 FUNCDEF_CHROMA_PU(void, blockcopy_pp, sse2, pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);
 FUNCDEF_CHROMA_PU(void, blockcopy_pp, avx, pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);
+FUNCDEF_CHROMA_PU(void, blockcopy_pp, avx512, pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);
 
 FUNCDEF_PU(void, blockcopy_sp, sse2, pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
 FUNCDEF_PU(void, blockcopy_sp, sse4, pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
 FUNCDEF_PU(void, blockcopy_sp, avx2, pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
+FUNCDEF_PU(void, blockcopy_sp, avx512, pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
 FUNCDEF_PU(void, blockcopy_ps, sse2, int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);
 FUNCDEF_PU(void, blockcopy_ps, sse4, int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);
 FUNCDEF_PU(void, blockcopy_ps, avx2, int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);
+FUNCDEF_PU(void, blockcopy_ps, avx512, int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);
 
 #endif // ifndef X265_I386_PIXEL_H

 
@@ -28,37 +28,48 @@
 FUNCDEF_TU_S(void, cpy2Dto1D_shl, sse2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 FUNCDEF_TU_S(void, cpy2Dto1D_shl, sse4, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 FUNCDEF_TU_S(void, cpy2Dto1D_shl, avx2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
+FUNCDEF_TU_S(void, cpy2Dto1D_shl, avx512, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 
 FUNCDEF_TU_S(void, cpy2Dto1D_shr, sse2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 FUNCDEF_TU_S(void, cpy2Dto1D_shr, sse4, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 FUNCDEF_TU_S(void, cpy2Dto1D_shr, avx2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
+FUNCDEF_TU_S(void, cpy2Dto1D_shr, avx512, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 
 FUNCDEF_TU_S(void, cpy1Dto2D_shl, sse2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 FUNCDEF_TU_S(void, cpy1Dto2D_shl, sse4, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 FUNCDEF_TU_S(void, cpy1Dto2D_shl, avx2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
-
+FUNCDEF_TU_S(void, cpy1Dto2D_shl, avx512, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
+FUNCDEF_TU_S(void, cpy1Dto2D_shl_aligned, avx512, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 FUNCDEF_TU_S(void, cpy1Dto2D_shr, sse2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 FUNCDEF_TU_S(void, cpy1Dto2D_shr, sse4, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 FUNCDEF_TU_S(void, cpy1Dto2D_shr, avx2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
+FUNCDEF_TU_S(void, cpy1Dto2D_shr, avx512, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 
 FUNCDEF_TU_S(uint32_t, copy_cnt, sse2, int16_t* dst, const int16_t* src, intptr_t srcStride);
 FUNCDEF_TU_S(uint32_t, copy_cnt, sse4, int16_t* dst, const int16_t* src, intptr_t srcStride);
 FUNCDEF_TU_S(uint32_t, copy_cnt, avx2, int16_t* dst, const int16_t* src, intptr_t srcStride);
+FUNCDEF_TU_S(uint32_t, copy_cnt, avx512, int16_t* dst, const int16_t* src, intptr_t srcStride);
 
 FUNCDEF_TU(void, blockfill_s, sse2, int16_t* dst, intptr_t dstride, int16_t val);
 FUNCDEF_TU(void, blockfill_s, avx2, int16_t* dst, intptr_t dstride, int16_t val);
+FUNCDEF_TU(void, blockfill_s, avx512, int16_t* dst, intptr_t dstride, int16_t val);
+FUNCDEF_TU(void, blockfill_s_aligned, avx512, int16_t* dst, intptr_t dstride, int16_t val);
 
 FUNCDEF_CHROMA_PU(void, blockcopy_ss, sse2, int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
 FUNCDEF_CHROMA_PU(void, blockcopy_ss, avx, int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
+FUNCDEF_CHROMA_PU(void, blockcopy_ss, avx512, int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
 
 FUNCDEF_CHROMA_PU(void, blockcopy_pp, sse2, pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);
 FUNCDEF_CHROMA_PU(void, blockcopy_pp, avx, pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);
+FUNCDEF_CHROMA_PU(void, blockcopy_pp, avx512, pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);
 
 FUNCDEF_PU(void, blockcopy_sp, sse2, pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
 FUNCDEF_PU(void, blockcopy_sp, sse4, pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
 FUNCDEF_PU(void, blockcopy_sp, avx2, pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
+FUNCDEF_PU(void, blockcopy_sp, avx512, pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
 FUNCDEF_PU(void, blockcopy_ps, sse2, int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);
 FUNCDEF_PU(void, blockcopy_ps, sse4, int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);
 FUNCDEF_PU(void, blockcopy_ps, avx2, int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);
+FUNCDEF_PU(void, blockcopy_ps, avx512, int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride);
 
 #endif // ifndef X265_I386_PIXEL_H
​

x265_2.7.tar.gz/source/common/x86/const-a.asm -> x265_2.9.tar.gz/source/common/x86/const-a.asm Changed

 
@@ -28,7 +28,7 @@
 
 %include "x86inc.asm"
 
-SECTION_RODATA 32
+SECTION_RODATA 64
 
 ;; 8-bit constants
 
​

x265_2.7.tar.gz/source/common/x86/cpu-a.asm -> x265_2.9.tar.gz/source/common/x86/cpu-a.asm Changed

 
@@ -54,18 +54,16 @@
     RET
 
 ;-----------------------------------------------------------------------------
-; void cpu_xgetbv( int op, int *eax, int *edx )
+; uint64_t cpu_xgetbv( int xcr )
 ;-----------------------------------------------------------------------------
-cglobal cpu_xgetbv, 3,7
-    push  r2
-    push  r1
-    mov  ecx, r0d
+cglobal cpu_xgetbv
+    movifnidn ecx, r0m
     xgetbv
-    pop   r4
-    mov [r4], eax
-    pop   r4
-    mov [r4], edx
-    RET
+%if ARCH_X86_64
+    shl       rdx, 32
+    or        rax, rdx
+%endif
+    ret
 
 %if ARCH_X86_64
 
@@ -78,7 +76,7 @@
 %if WIN64
     sub  rsp, 32 ; shadow space
 %endif
-    and  rsp, ~31
+    and  rsp, ~(STACK_ALIGNMENT - 1)
     mov  rax, r0
     mov   r0, r1
     mov   r1, r2
@@ -119,7 +117,7 @@
     push ebp
     mov  ebp, esp
     sub  esp, 12
-    and  esp, ~31
+    and  esp, ~(STACK_ALIGNMENT - 1)
     mov  ecx, [ebp+8]
     mov  edx, [ebp+12]
     mov  [esp], edx
​

x265_2.7.tar.gz/source/common/x86/dct8.asm -> x265_2.9.tar.gz/source/common/x86/dct8.asm Changed

@@ -28,7 +28,89 @@
 
 %include "x86inc.asm"
 %include "x86util.asm"
-SECTION_RODATA 32
+SECTION_RODATA 64
+
+tab_dct32:      dw 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64
+                dw 90, 90, 88, 85, 82, 78, 73, 67, 61, 54, 46, 38, 31, 22, 13,  4, -4, -13, -22, -31, -38, -46, -54, -61, -67, -73, -78, -82, -85, -88, -90, -90
+                dw 90, 87, 80, 70, 57, 43, 25,  9, -9, -25, -43, -57, -70, -80, -87, -90, -90, -87, -80, -70, -57, -43, -25, -9,  9, 25, 43, 57, 70, 80, 87, 90
+                dw 90, 82, 67, 46, 22, -4, -31, -54, -73, -85, -90, -88, -78, -61, -38, -13, 13, 38, 61, 78, 88, 90, 85, 73, 54, 31,  4, -22, -46, -67, -82, -90
+                dw 89, 75, 50, 18, -18, -50, -75, -89, -89, -75, -50, -18, 18, 50, 75, 89, 89, 75, 50, 18, -18, -50, -75, -89, -89, -75, -50, -18, 18, 50, 75, 89
+                dw 88, 67, 31, -13, -54, -82, -90, -78, -46, -4, 38, 73, 90, 85, 61, 22, -22, -61, -85, -90, -73, -38,  4, 46, 78, 90, 82, 54, 13, -31, -67, -88
+                dw 87, 57,  9, -43, -80, -90, -70, -25, 25, 70, 90, 80, 43, -9, -57, -87, -87, -57, -9, 43, 80, 90, 70, 25, -25, -70, -90, -80, -43,  9, 57, 87
+                dw 85, 46, -13, -67, -90, -73, -22, 38, 82, 88, 54, -4, -61, -90, -78, -31, 31, 78, 90, 61,  4, -54, -88, -82, -38, 22, 73, 90, 67, 13, -46, -85
+                dw 83, 36, -36, -83, -83, -36, 36, 83, 83, 36, -36, -83, -83, -36, 36, 83, 83, 36, -36, -83, -83, -36, 36, 83, 83, 36, -36, -83, -83, -36, 36, 83
+                dw 82, 22, -54, -90, -61, 13, 78, 85, 31, -46, -90, -67,  4, 73, 88, 38, -38, -88, -73, -4, 67, 90, 46, -31, -85, -78, -13, 61, 90, 54, -22, -82
+                dw 80,  9, -70, -87, -25, 57, 90, 43, -43, -90, -57, 25, 87, 70, -9, -80, -80, -9, 70, 87, 25, -57, -90, -43, 43, 90, 57, -25, -87, -70,  9, 80
+                dw 78, -4, -82, -73, 13, 85, 67, -22, -88, -61, 31, 90, 54, -38, -90, -46, 46, 90, 38, -54, -90, -31, 61, 88, 22, -67, -85, -13, 73, 82,  4, -78
+                dw 75, -18, -89, -50, 50, 89, 18, -75, -75, 18, 89, 50, -50, -89, -18, 75, 75, -18, -89, -50, 50, 89, 18, -75, -75, 18, 89, 50, -50, -89, -18, 75
+                dw 73, -31, -90, -22, 78, 67, -38, -90, -13, 82, 61, -46, -88, -4, 85, 54, -54, -85,  4, 88, 46, -61, -82, 13, 90, 38, -67, -78, 22, 90, 31, -73
+                dw 70, -43, -87,  9, 90, 25, -80, -57, 57, 80, -25, -90, -9, 87, 43, -70, -70, 43, 87, -9, -90, -25, 80, 57, -57, -80, 25, 90,  9, -87, -43, 70
+                dw 67, -54, -78, 38, 85, -22, -90,  4, 90, 13, -88, -31, 82, 46, -73, -61, 61, 73, -46, -82, 31, 88, -13, -90, -4, 90, 22, -85, -38, 78, 54, -67
+                dw 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64
+                dw 61, -73, -46, 82, 31, -88, -13, 90, -4, -90, 22, 85, -38, -78, 54, 67, -67, -54, 78, 38, -85, -22, 90,  4, -90, 13, 88, -31, -82, 46, 73, -61
+                dw 57, -80, -25, 90, -9, -87, 43, 70, -70, -43, 87,  9, -90, 25, 80, -57, -57, 80, 25, -90,  9, 87, -43, -70, 70, 43, -87, -9, 90, -25, -80, 57
+                dw 54, -85, -4, 88, -46, -61, 82, 13, -90, 38, 67, -78, -22, 90, -31, -73, 73, 31, -90, 22, 78, -67, -38, 90, -13, -82, 61, 46, -88,  4, 85, -54
+                dw 50, -89, 18, 75, -75, -18, 89, -50, -50, 89, -18, -75, 75, 18, -89, 50, 50, -89, 18, 75, -75, -18, 89, -50, -50, 89, -18, -75, 75, 18, -89, 50
+                dw 46, -90, 38, 54, -90, 31, 61, -88, 22, 67, -85, 13, 73, -82,  4, 78, -78, -4, 82, -73, -13, 85, -67, -22, 88, -61, -31, 90, -54, -38, 90, -46
+                dw 43, -90, 57, 25, -87, 70,  9, -80, 80, -9, -70, 87, -25, -57, 90, -43, -43, 90, -57, -25, 87, -70, -9, 80, -80,  9, 70, -87, 25, 57, -90, 43
+                dw 38, -88, 73, -4, -67, 90, -46, -31, 85, -78, 13, 61, -90, 54, 22, -82, 82, -22, -54, 90, -61, -13, 78, -85, 31, 46, -90, 67,  4, -73, 88, -38
+                dw 36, -83, 83, -36, -36, 83, -83, 36, 36, -83, 83, -36, -36, 83, -83, 36, 36, -83, 83, -36, -36, 83, -83, 36, 36, -83, 83, -36, -36, 83, -83, 36
+                dw 31, -78, 90, -61,  4, 54, -88, 82, -38, -22, 73, -90, 67, -13, -46, 85, -85, 46, 13, -67, 90, -73, 22, 38, -82, 88, -54, -4, 61, -90, 78, -31
+                dw 25, -70, 90, -80, 43,  9, -57, 87, -87, 57, -9, -43, 80, -90, 70, -25, -25, 70, -90, 80, -43, -9, 57, -87, 87, -57,  9, 43, -80, 90, -70, 25
+                dw 22, -61, 85, -90, 73, -38, -4, 46, -78, 90, -82, 54, -13, -31, 67, -88, 88, -67, 31, 13, -54, 82, -90, 78, -46,  4, 38, -73, 90, -85, 61, -22
+                dw 18, -50, 75, -89, 89, -75, 50, -18, -18, 50, -75, 89, -89, 75, -50, 18, 18, -50, 75, -89, 89, -75, 50, -18, -18, 50, -75, 89, -89, 75, -50, 18
+                dw 13, -38, 61, -78, 88, -90, 85, -73, 54, -31,  4, 22, -46, 67, -82, 90, -90, 82, -67, 46, -22, -4, 31, -54, 73, -85, 90, -88, 78, -61, 38, -13
+                dw  9, -25, 43, -57, 70, -80, 87, -90, 90, -87, 80, -70, 57, -43, 25, -9, -9, 25, -43, 57, -70, 80, -87, 90, -90, 87, -80, 70, -57, 43, -25,  9
+                dw  4, -13, 22, -31, 38, -46, 54, -61, 67, -73, 78, -82, 85, -88, 90, -90, 90, -90, 88, -85, 82, -78, 73, -67, 61, -54, 46, -38, 31, -22, 13, -4
+tab_dct16:      dw 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64
+                dw 90, 87, 80, 70, 57, 43, 25,  9, -9, -25, -43, -57, -70, -80, -87, -90
+                dw 89, 75, 50, 18, -18, -50, -75, -89, -89, -75, -50, -18, 18, 50, 75, 89
+                dw 87, 57,  9, -43, -80, -90, -70, -25, 25, 70, 90, 80, 43, -9, -57, -87
+                dw 83, 36, -36, -83, -83, -36, 36, 83, 83, 36, -36, -83, -83, -36, 36, 83
+                dw 80,  9, -70, -87, -25, 57, 90, 43, -43, -90, -57, 25, 87, 70, -9, -80
+                dw 75, -18, -89, -50, 50, 89, 18, -75, -75, 18, 89, 50, -50, -89, -18, 75
+                dw 70, -43, -87,  9, 90, 25, -80, -57, 57, 80, -25, -90, -9, 87, 43, -70
+                dw 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64
+                dw 57, -80, -25, 90, -9, -87, 43, 70, -70, -43, 87,  9, -90, 25, 80, -57
+                dw 50, -89, 18, 75, -75, -18, 89, -50, -50, 89, -18, -75, 75, 18, -89, 50
+                dw 43, -90, 57, 25, -87, 70,  9, -80, 80, -9, -70, 87, -25, -57, 90, -43
+                dw 36, -83, 83, -36, -36, 83, -83, 36, 36, -83, 83, -36, -36, 83, -83, 36
+                dw 25, -70, 90, -80, 43,  9, -57, 87, -87, 57, -9, -43, 80, -90, 70, -25
+                dw 18, -50, 75, -89, 89, -75, 50, -18, -18, 50, -75, 89, -89, 75, -50, 18
+                dw 9, -25, 43, -57, 70, -80, 87, -90, 90, -87, 80, -70, 57, -43, 25, -9
+
+dct16_shuf_AVX512:  dq 0, 1, 8, 9, 4, 5, 12, 13
+dct16_shuf1_AVX512: dq 2, 3, 10, 11, 6, 7, 14, 15
+dct16_shuf3_AVX512: dq 0, 1, 4, 5, 8, 9, 12, 13
+dct16_shuf4_AVX512: dq 2, 3, 6, 7, 10, 11, 14, 15
+dct16_shuf2_AVX512: dd 0, 4, 8, 12, 2, 6, 10, 14, 16, 20, 24, 28, 18, 22, 26, 30
+
+dct8_shuf5_AVX512: dq 0, 2, 4, 6, 1, 3, 5, 7
+dct8_shuf6_AVX512: dq 0, 2, 4, 6, 1, 3, 5, 7
+dct8_shuf8_AVX512: dd 0, 2, 8, 10, 4, 6, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15
+dct8_shuf4_AVX512: times 2 dd 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15
+dct16_shuf7_AVX512: dd 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30
+dct16_shuf9_AVX512: dd 0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15
+
+dct32_shuf_AVX512:  dd 0, 1, 4, 5, 8, 9, 12, 13, 16, 17, 20 , 21, 24, 25, 28, 29
+dct32_shuf4_AVX512: times 2 dd 0, 4, 8, 12, 0, 4, 8, 12
+dct32_shuf5_AVX512: dd 0, 0, 0, 0, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0
+dct32_shuf6_AVX512: dd 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, 0, 0, 0, 0
+dct32_shuf7_AVX512: dd 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1
+dct32_shuf8_AVX512: dd -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+dct16_shuf5_AVX512: dw 0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27, 4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31
+dct16_shuf6_AVX512: dw 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30
+dct16_shuf8_AVX512: dw 20, 0, 4, 2, 28, 8, 6, 10, 22, 16, 12, 18, 30, 24, 14, 26
+
+dct8_shuf7_AVX512: dw 0, 2, 16, 18, 8, 10, 24, 26, 4, 6, 20, 22, 12, 14, 28, 30
+dct8_shuf9_AVX512: times 2 dw 0, 8, 16, 24, 4, 12, 20, 28
+dct32_shuf1_AVX512: dw 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16
+dct32_shuf2_AVX512: dw 0, 1, 2, 3, 4, 5, 6, 7, 16, 17, 18, 19, 20, 21, 22, 23, 15, 14, 13, 12, 11, 10, 9, 8, 31, 30, 29, 28, 27, 26, 25, 24
+dct32_shuf3_AVX512: times 2 dw 0, 8, 16, 24, 2, 10, 18, 26
+
+dct8_shuf:         times 2 db 6, 7, 4, 5, 2, 3, 0, 1, 14, 15, 12, 13, 10, 11, 8, 9
+dct8_shuf_AVX512:  times 2 db 4, 5, 6, 7, 0, 1, 2, 3, 12, 13, 14, 15, 8, 9, 10, 11
+
 tab_dct8:       dw 64, 64, 64, 64, 64, 64, 64, 64
                 dw 89, 75, 50, 18, -18, -50, -75, -89
                 dw 83, 36, -36, -83, -83, -36, 36, 83
@@ -38,7 +120,10 @@
                 dw 36, -83, 83, -36, -36, 83, -83, 36
                 dw 18, -50, 75, -89, 89, -75, 50, -18
 
-dct8_shuf:      times 2 db 6, 7, 4, 5, 2, 3, 0, 1, 14, 15, 12, 13, 10, 11, 8, 9
+tab_dct8_avx512: dw 64, 64, 64, 64, 89, 75, 50, 18
+                 dw 83, 36, -36, -83, 75, -18, -89, -50
+                 dw 64, -64, -64, 64, 50, -89, 18, 75
+                 dw 36, -83, 83, -36, 18, -50, 75, -89
 
 tab_dct16_1:    dw 64, 64, 64, 64, 64, 64, 64, 64
                 dw 90, 87, 80, 70, 57, 43, 25,  9
@@ -57,7 +142,6 @@
                 dw 18, -50, 75, -89, 89, -75, 50, -18
                 dw  9, -25, 43, -57, 70, -80, 87, -90
 
-
 tab_dct16_2:    dw 64, 64, 64, 64, 64, 64, 64, 64
                 dw -9, -25, -43, -57, -70, -80, -87, -90
                 dw -89, -75, -50, -18, 18, 50, 75, 89
@@ -155,12 +239,34 @@
                 times 4 dw 50, -89, 18, 75
                 times 4 dw 18, -50, 75, -89
 
+avx512_idct8_1:   times 8 dw 64, 83, 64, 36
+                  times 8 dw 64, 36, -64, -83
+                  times 8 dw 64, -36, -64, 83
+                  times 8 dw 64, -83, 64, -36
+
+avx512_idct8_2:   times 8 dw 89, 75, 50, 18
+                  times 8 dw 75, -18, -89, -50
+                  times 8 dw 50, -89, 18, 75
+                  times 8 dw 18, -50, 75, -89
+
+avx512_idct8_3:   dw 64, 83, 64, 83, 64, 83, 64, 83, 64, 83, 64, 83, 64, 83, 64, 83, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36
+                  dw 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, -64, 83, -64, 83, -64, 83, -64, 83, -64, 83, -64, 83, -64, 83, -64, 83
+                  dw 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, 64, -83, 64, -83, 64, -83, 64, -83, 64, -83, 64, -83, 64, -83, 64, -83
+                  dw -64, -83, -64, -83, -64, -83, -64, -83, -64, -83, -64, -83, -64, -83, -64, -83, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36
+                  dw 89, 75, 89, 75, 89, 75, 89, 75, 89, 75, 89, 75, 89, 75, 89, 75, 50, -89, 50, -89, 50, -89, 50, -89, 50, -89, 50, -89, 50, -89, 50, -89
+                  dw 50, 18, 50, 18, 50, 18, 50, 18, 50, 18, 50, 18, 50, 18, 50, 18, 18, 75, 18, 75, 18, 75, 18, 75, 18, 75, 18, 75, 18, 75, 18, 75
+                  dw 75, -18, 75, -18, 75, -18, 75, -18, 75, -18, 75, -18, 75, -18, 75, -18, 18, -50, 18, -50, 18, -50, 18, -50, 18, -50, 18, -50, 18, -50, 18, -50
+                  dw -89, -50, -89, -50, -89, -50, -89, -50, -89, -50, -89, -50, -89, -50, -89, -50, 75, -89, 75, -89, 75, -89, 75, -89, 75, -89, 75, -89, 75, -89, 75, -89
+
 idct8_shuf1:    dd 0, 2, 4, 6, 1, 3, 5, 7
 
 const idct8_shuf2,    times 2 db 0, 1, 2, 3, 8, 9, 10, 11, 4, 5, 6, 7, 12, 13, 14, 15
 
 idct8_shuf3:    times 2 db 12, 13, 14, 15, 8, 9, 10, 11, 4, 5, 6, 7, 0, 1, 2, 3
 
+
+idct8_avx512_shuf3:    times 4 db 12, 13, 14, 15, 8, 9, 10, 11, 4, 5, 6, 7, 0, 1, 2, 3
+
 tab_idct16_1:   dw 90, 87, 80, 70, 57, 43, 25, 9
                 dw 87, 57, 9, -43, -80, -90, -70, -25
                 dw 80, 9, -70, -87, -25, 57, 90, 43
@@ -182,6 +288,31 @@
 idct16_shuff:   dd 0, 4, 2, 6, 1, 5, 3, 7
 
 idct16_shuff1:  dd 2, 6, 0, 4, 3, 7, 1, 5
+idct16_shuff2:  dw 0, 16, 2, 18, 4, 20, 6, 22, 8, 24, 10, 26, 12, 28, 14, 30
+idct16_shuff3:  dw 1, 17, 3, 19, 5, 21, 7, 23, 9, 25, 11, 27, 13, 29, 15, 31
+idct16_shuff4:  dd 0, 8, 2, 10, 4, 12, 6, 14
+idct16_shuff5:  dd 1, 9, 3, 11, 5, 13, 7, 15
+
+
+tab_AVX512_idct16_1:   dw 90, 87, 80, 70, 57, 43, 25, 9, 90, 87, 80, 70, 57, 43, 25, 9, 80, 9, -70, -87, -25, 57, 90, 43, 80, 9, -70, -87, -25, 57, 90, 43
+                       dw 87, 57, 9, -43, -80, -90, -70, -25, 87, 57, 9, -43, -80, -90, -70, -25, 70, -43, -87, 9, 90, 25, -80, -57, 70, -43, -87, 9, 90, 25, -80, -57
+                       dw 57, -80, -25, 90, -9, -87, 43, 70, 57, -80, -25, 90, -9, -87, 43, 70, 25, -70, 90, -80, 43, 9, -57, 87, 25, -70, 90, -80, 43, 9, -57, 87
+                       dw 43, -90, 57, 25, -87, 70, 9, -80, 43, -90, 57, 25, -87, 70, 9, -80, 9, -25, 43, -57, 70, -80, 87, -90, 9, -25, 43, -57, 70, -80, 87, -90
+
+tab_AVX512_idct16_2:   dw 64, 89, 83, 75, 64, 50, 36, 18, 64, 89, 83, 75, 64, 50, 36, 18, 64, 50, -36, -89, -64, 18, 83, 75, 64, 50, -36, -89, -64, 18, 83, 75
+                       dw 64, 75, 36, -18, -64, -89, -83, -50, 64, 75, 36, -18, -64, -89, -83, -50, 64, 18, -83, -50, 64, 75, -36, -89, 64, 18, -83, -50, 64, 75, -36, -89
+                       dw 64, -18, -83, 50, 64, -75, -36, 89, 64, -18, -83, 50, 64, -75, -36, 89, 64, -75, 36, 18, -64, 89, -83, 50, 64, -75, 36, 18, -64, 89, -83, 50
+                       dw 64, -50, -36, 89, -64, -18, 83, -75, 64, -50, -36, 89, -64, -18, 83, -75, 64, -89, 83, -75, 64, -50, 36, -18, 64, -89, 83, -75, 64, -50, 36, -18
+
+idct16_AVX512_shuff:   dd 0, 4, 2, 6, 1, 5, 3, 7, 8, 12, 10, 14, 9, 13, 11, 15
+
+idct16_AVX512_shuff1:  dd 2, 6, 0, 4, 3, 7, 1, 5, 10, 14, 8, 12, 11, 15, 9, 13
+
+idct16_AVX512_shuff2:   dq 0, 1, 8, 9, 4, 5, 12, 13
+idct16_AVX512_shuff3:   dq 2, 3, 10, 11, 6, 7, 14, 15
+idct16_AVX512_shuff4:   dq 4, 5, 12, 13, 0, 1, 8, 9
+idct16_AVX512_shuff5:   dq 6, 7, 14, 15, 2, 3, 10, 11
+idct16_AVX512_shuff6:   times 4 db 14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1
 
 tab_idct32_1:   dw 90 ,90 ,88 ,85, 82, 78, 73, 67, 61, 54, 46, 38, 31, 22, 13, 4
                 dw 90, 82, 67, 46, 22, -4, -31, -54, -73, -85, -90, -88, -78, -61, -38, -13
@@ -237,6 +368,71 @@
                 dw 64, -87, 75, -57, 36, -9, -18, 43, -64, 80, -89, 90, -83, 70, -50, 25
                 dw 64, -90, 89, -87, 83, -80, 75, -70, 64, -57, 50, -43, 36, -25, 18, -9
 
+
+tab_idct32_AVX512_1:   dw 90 ,90 ,88 ,85, 82, 78, 73, 67, 90 ,90 ,88 ,85, 82, 78, 73, 67, 90, 82, 67, 46, 22, -4, -31, -54, 90, 82, 67, 46, 22, -4, -31, -54
+                       dw 61, 54, 46, 38, 31, 22, 13, 4, 61, 54, 46, 38, 31, 22, 13, 4, -73, -85, -90, -88, -78, -61, -38, -13, -73, -85, -90, -88, -78, -61, -38, -13
+                       dw 88, 67, 31, -13, -54, -82, -90, -78, 88, 67, 31, -13, -54, -82, -90, -78, 85, 46, -13, -67, -90, -73, -22, 38, 85, 46, -13, -67, -90, -73, -22, 38
+                       dw -46, -4, 38, 73, 90, 85, 61, 22, -46, -4, 38, 73, 90, 85, 61, 22, 82, 88, 54, -4, -61, -90, -78, -31, 82, 88, 54, -4, -61, -90, -78, -31
+                       dw 82, 22, -54, -90, -61, 13, 78, 85, 82, 22, -54, -90, -61, 13, 78, 85, 78, -4, -82, -73, 13, 85, 67, -22, 78, -4, -82, -73, 13, 85, 67, -22
+                       dw 31, -46, -90, -67, 4, 73, 88, 38, 31, -46, -90, -67, 4, 73, 88, 38, -88, -61, 31, 90, 54, -38, -90, -46, -88, -61, 31, 90, 54, -38, -90, -46
+                       dw 73, -31, -90, -22, 78, 67, -38, -90, 73, -31, -90, -22, 78, 67, -38, -90, 67, -54, -78, 38, 85, -22, -90, 4, 67, -54, -78, 38, 85, -22, -90, 4
+                       dw -13, 82, 61, -46, -88, -4, 85, 54, -13, 82, 61, -46, -88, -4, 85, 54, 90, 13, -88, -31, 82, 46, -73, -61, 90, 13, -88, -31, 82, 46, -73, -61
+
+tab_idct32_AVX512_5:   dw 4, -13, 22, -31, 38, -46, 54, -61, 4, -13, 22, -31, 38, -46, 54, -61, 13, -38, 61, -78, 88, -90, 85, -73, 13, -38, 61, -78, 88, -90, 85, -73
+                       dw 67, -73, 78, -82, 85, -88, 90, -90, 67, -73, 78, -82, 85, -88, 90, -90, 54, -31, 4, 22, -46, 67, -82, 90, 54, -31, 4, 22, -46, 67, -82, 90
+                       dw 22, -61, 85, -90, 73, -38, -4, 46, 22, -61, 85, -90, 73, -38, -4, 46, 31, -78, 90, -61, 4, 54, -88, 82, 31, -78, 90, -61, 4, 54, -88, 82
+                       dw -78, 90, -82, 54, -13, -31, 67, -88, -78, 90, -82, 54, -13, -31, 67, -88, -38, -22, 73, -90, 67, -13, -46, 85, -38, -22, 73, -90, 67, -13, -46, 85
+                       dw 38, -88, 73, -4, -67, 90, -46, -31, 38, -88, 73, -4, -67, 90, -46, -31, 46, -90, 38, 54, -90, 31, 61, -88, 46, -90, 38, 54, -90, 31, 61, -88
+                       dw 85, -78, 13, 61, -90, 54, 22, -82, 85, -78, 13, 61, -90, 54, 22, -82, 22, 67, -85, 13, 73, -82, 4, 78, 22, 67, -85, 13, 73, -82, 4, 78
+                       dw 54, -85, -4, 88, -46, -61, 82, 13, 54, -85, -4, 88, -46, -61, 82, 13, 61, -73, -46, 82, 31, -88, -13, 90, 61, -73, -46, 82, 31, -88, -13, 90
+                       dw -90, 38, 67, -78, -22, 90, -31, -73, -90, 38, 67, -78, -22, 90, -31, -73, -4, -90, 22, 85, -38, -78, 54, 67, -4, -90, 22, 85, -38, -78, 54, 67

 
@@ -28,7 +28,89 @@
 
 %include "x86inc.asm"
 %include "x86util.asm"
-SECTION_RODATA 32
+SECTION_RODATA 64
+
+tab_dct32:      dw 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64
+                dw 90, 90, 88, 85, 82, 78, 73, 67, 61, 54, 46, 38, 31, 22, 13,  4, -4, -13, -22, -31, -38, -46, -54, -61, -67, -73, -78, -82, -85, -88, -90, -90
+                dw 90, 87, 80, 70, 57, 43, 25,  9, -9, -25, -43, -57, -70, -80, -87, -90, -90, -87, -80, -70, -57, -43, -25, -9,  9, 25, 43, 57, 70, 80, 87, 90
+                dw 90, 82, 67, 46, 22, -4, -31, -54, -73, -85, -90, -88, -78, -61, -38, -13, 13, 38, 61, 78, 88, 90, 85, 73, 54, 31,  4, -22, -46, -67, -82, -90
+                dw 89, 75, 50, 18, -18, -50, -75, -89, -89, -75, -50, -18, 18, 50, 75, 89, 89, 75, 50, 18, -18, -50, -75, -89, -89, -75, -50, -18, 18, 50, 75, 89
+                dw 88, 67, 31, -13, -54, -82, -90, -78, -46, -4, 38, 73, 90, 85, 61, 22, -22, -61, -85, -90, -73, -38,  4, 46, 78, 90, 82, 54, 13, -31, -67, -88
+                dw 87, 57,  9, -43, -80, -90, -70, -25, 25, 70, 90, 80, 43, -9, -57, -87, -87, -57, -9, 43, 80, 90, 70, 25, -25, -70, -90, -80, -43,  9, 57, 87
+                dw 85, 46, -13, -67, -90, -73, -22, 38, 82, 88, 54, -4, -61, -90, -78, -31, 31, 78, 90, 61,  4, -54, -88, -82, -38, 22, 73, 90, 67, 13, -46, -85
+                dw 83, 36, -36, -83, -83, -36, 36, 83, 83, 36, -36, -83, -83, -36, 36, 83, 83, 36, -36, -83, -83, -36, 36, 83, 83, 36, -36, -83, -83, -36, 36, 83
+                dw 82, 22, -54, -90, -61, 13, 78, 85, 31, -46, -90, -67,  4, 73, 88, 38, -38, -88, -73, -4, 67, 90, 46, -31, -85, -78, -13, 61, 90, 54, -22, -82
+                dw 80,  9, -70, -87, -25, 57, 90, 43, -43, -90, -57, 25, 87, 70, -9, -80, -80, -9, 70, 87, 25, -57, -90, -43, 43, 90, 57, -25, -87, -70,  9, 80
+                dw 78, -4, -82, -73, 13, 85, 67, -22, -88, -61, 31, 90, 54, -38, -90, -46, 46, 90, 38, -54, -90, -31, 61, 88, 22, -67, -85, -13, 73, 82,  4, -78
+                dw 75, -18, -89, -50, 50, 89, 18, -75, -75, 18, 89, 50, -50, -89, -18, 75, 75, -18, -89, -50, 50, 89, 18, -75, -75, 18, 89, 50, -50, -89, -18, 75
+                dw 73, -31, -90, -22, 78, 67, -38, -90, -13, 82, 61, -46, -88, -4, 85, 54, -54, -85,  4, 88, 46, -61, -82, 13, 90, 38, -67, -78, 22, 90, 31, -73
+                dw 70, -43, -87,  9, 90, 25, -80, -57, 57, 80, -25, -90, -9, 87, 43, -70, -70, 43, 87, -9, -90, -25, 80, 57, -57, -80, 25, 90,  9, -87, -43, 70
+                dw 67, -54, -78, 38, 85, -22, -90,  4, 90, 13, -88, -31, 82, 46, -73, -61, 61, 73, -46, -82, 31, 88, -13, -90, -4, 90, 22, -85, -38, 78, 54, -67
+                dw 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64
+                dw 61, -73, -46, 82, 31, -88, -13, 90, -4, -90, 22, 85, -38, -78, 54, 67, -67, -54, 78, 38, -85, -22, 90,  4, -90, 13, 88, -31, -82, 46, 73, -61
+                dw 57, -80, -25, 90, -9, -87, 43, 70, -70, -43, 87,  9, -90, 25, 80, -57, -57, 80, 25, -90,  9, 87, -43, -70, 70, 43, -87, -9, 90, -25, -80, 57
+                dw 54, -85, -4, 88, -46, -61, 82, 13, -90, 38, 67, -78, -22, 90, -31, -73, 73, 31, -90, 22, 78, -67, -38, 90, -13, -82, 61, 46, -88,  4, 85, -54
+                dw 50, -89, 18, 75, -75, -18, 89, -50, -50, 89, -18, -75, 75, 18, -89, 50, 50, -89, 18, 75, -75, -18, 89, -50, -50, 89, -18, -75, 75, 18, -89, 50
+                dw 46, -90, 38, 54, -90, 31, 61, -88, 22, 67, -85, 13, 73, -82,  4, 78, -78, -4, 82, -73, -13, 85, -67, -22, 88, -61, -31, 90, -54, -38, 90, -46
+                dw 43, -90, 57, 25, -87, 70,  9, -80, 80, -9, -70, 87, -25, -57, 90, -43, -43, 90, -57, -25, 87, -70, -9, 80, -80,  9, 70, -87, 25, 57, -90, 43
+                dw 38, -88, 73, -4, -67, 90, -46, -31, 85, -78, 13, 61, -90, 54, 22, -82, 82, -22, -54, 90, -61, -13, 78, -85, 31, 46, -90, 67,  4, -73, 88, -38
+                dw 36, -83, 83, -36, -36, 83, -83, 36, 36, -83, 83, -36, -36, 83, -83, 36, 36, -83, 83, -36, -36, 83, -83, 36, 36, -83, 83, -36, -36, 83, -83, 36
+                dw 31, -78, 90, -61,  4, 54, -88, 82, -38, -22, 73, -90, 67, -13, -46, 85, -85, 46, 13, -67, 90, -73, 22, 38, -82, 88, -54, -4, 61, -90, 78, -31
+                dw 25, -70, 90, -80, 43,  9, -57, 87, -87, 57, -9, -43, 80, -90, 70, -25, -25, 70, -90, 80, -43, -9, 57, -87, 87, -57,  9, 43, -80, 90, -70, 25
+                dw 22, -61, 85, -90, 73, -38, -4, 46, -78, 90, -82, 54, -13, -31, 67, -88, 88, -67, 31, 13, -54, 82, -90, 78, -46,  4, 38, -73, 90, -85, 61, -22
+                dw 18, -50, 75, -89, 89, -75, 50, -18, -18, 50, -75, 89, -89, 75, -50, 18, 18, -50, 75, -89, 89, -75, 50, -18, -18, 50, -75, 89, -89, 75, -50, 18
+                dw 13, -38, 61, -78, 88, -90, 85, -73, 54, -31,  4, 22, -46, 67, -82, 90, -90, 82, -67, 46, -22, -4, 31, -54, 73, -85, 90, -88, 78, -61, 38, -13
+                dw  9, -25, 43, -57, 70, -80, 87, -90, 90, -87, 80, -70, 57, -43, 25, -9, -9, 25, -43, 57, -70, 80, -87, 90, -90, 87, -80, 70, -57, 43, -25,  9
+                dw  4, -13, 22, -31, 38, -46, 54, -61, 67, -73, 78, -82, 85, -88, 90, -90, 90, -90, 88, -85, 82, -78, 73, -67, 61, -54, 46, -38, 31, -22, 13, -4
+tab_dct16:      dw 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64
+                dw 90, 87, 80, 70, 57, 43, 25,  9, -9, -25, -43, -57, -70, -80, -87, -90
+                dw 89, 75, 50, 18, -18, -50, -75, -89, -89, -75, -50, -18, 18, 50, 75, 89
+                dw 87, 57,  9, -43, -80, -90, -70, -25, 25, 70, 90, 80, 43, -9, -57, -87
+                dw 83, 36, -36, -83, -83, -36, 36, 83, 83, 36, -36, -83, -83, -36, 36, 83
+                dw 80,  9, -70, -87, -25, 57, 90, 43, -43, -90, -57, 25, 87, 70, -9, -80
+                dw 75, -18, -89, -50, 50, 89, 18, -75, -75, 18, 89, 50, -50, -89, -18, 75
+                dw 70, -43, -87,  9, 90, 25, -80, -57, 57, 80, -25, -90, -9, 87, 43, -70
+                dw 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64
+                dw 57, -80, -25, 90, -9, -87, 43, 70, -70, -43, 87,  9, -90, 25, 80, -57
+                dw 50, -89, 18, 75, -75, -18, 89, -50, -50, 89, -18, -75, 75, 18, -89, 50
+                dw 43, -90, 57, 25, -87, 70,  9, -80, 80, -9, -70, 87, -25, -57, 90, -43
+                dw 36, -83, 83, -36, -36, 83, -83, 36, 36, -83, 83, -36, -36, 83, -83, 36
+                dw 25, -70, 90, -80, 43,  9, -57, 87, -87, 57, -9, -43, 80, -90, 70, -25
+                dw 18, -50, 75, -89, 89, -75, 50, -18, -18, 50, -75, 89, -89, 75, -50, 18
+                dw 9, -25, 43, -57, 70, -80, 87, -90, 90, -87, 80, -70, 57, -43, 25, -9
+
+dct16_shuf_AVX512:  dq 0, 1, 8, 9, 4, 5, 12, 13
+dct16_shuf1_AVX512: dq 2, 3, 10, 11, 6, 7, 14, 15
+dct16_shuf3_AVX512: dq 0, 1, 4, 5, 8, 9, 12, 13
+dct16_shuf4_AVX512: dq 2, 3, 6, 7, 10, 11, 14, 15
+dct16_shuf2_AVX512: dd 0, 4, 8, 12, 2, 6, 10, 14, 16, 20, 24, 28, 18, 22, 26, 30
+
+dct8_shuf5_AVX512: dq 0, 2, 4, 6, 1, 3, 5, 7
+dct8_shuf6_AVX512: dq 0, 2, 4, 6, 1, 3, 5, 7
+dct8_shuf8_AVX512: dd 0, 2, 8, 10, 4, 6, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15
+dct8_shuf4_AVX512: times 2 dd 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15
+dct16_shuf7_AVX512: dd 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30
+dct16_shuf9_AVX512: dd 0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15
+
+dct32_shuf_AVX512:  dd 0, 1, 4, 5, 8, 9, 12, 13, 16, 17, 20 , 21, 24, 25, 28, 29
+dct32_shuf4_AVX512: times 2 dd 0, 4, 8, 12, 0, 4, 8, 12
+dct32_shuf5_AVX512: dd 0, 0, 0, 0, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0
+dct32_shuf6_AVX512: dd 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, 0, 0, 0, 0
+dct32_shuf7_AVX512: dd 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1
+dct32_shuf8_AVX512: dd -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+dct16_shuf5_AVX512: dw 0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27, 4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31
+dct16_shuf6_AVX512: dw 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30
+dct16_shuf8_AVX512: dw 20, 0, 4, 2, 28, 8, 6, 10, 22, 16, 12, 18, 30, 24, 14, 26
+
+dct8_shuf7_AVX512: dw 0, 2, 16, 18, 8, 10, 24, 26, 4, 6, 20, 22, 12, 14, 28, 30
+dct8_shuf9_AVX512: times 2 dw 0, 8, 16, 24, 4, 12, 20, 28
+dct32_shuf1_AVX512: dw 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16
+dct32_shuf2_AVX512: dw 0, 1, 2, 3, 4, 5, 6, 7, 16, 17, 18, 19, 20, 21, 22, 23, 15, 14, 13, 12, 11, 10, 9, 8, 31, 30, 29, 28, 27, 26, 25, 24
+dct32_shuf3_AVX512: times 2 dw 0, 8, 16, 24, 2, 10, 18, 26
+
+dct8_shuf:         times 2 db 6, 7, 4, 5, 2, 3, 0, 1, 14, 15, 12, 13, 10, 11, 8, 9
+dct8_shuf_AVX512:  times 2 db 4, 5, 6, 7, 0, 1, 2, 3, 12, 13, 14, 15, 8, 9, 10, 11
+
 tab_dct8:       dw 64, 64, 64, 64, 64, 64, 64, 64
                 dw 89, 75, 50, 18, -18, -50, -75, -89
                 dw 83, 36, -36, -83, -83, -36, 36, 83
@@ -38,7 +120,10 @@
                 dw 36, -83, 83, -36, -36, 83, -83, 36
                 dw 18, -50, 75, -89, 89, -75, 50, -18
 
-dct8_shuf:      times 2 db 6, 7, 4, 5, 2, 3, 0, 1, 14, 15, 12, 13, 10, 11, 8, 9
+tab_dct8_avx512: dw 64, 64, 64, 64, 89, 75, 50, 18
+                 dw 83, 36, -36, -83, 75, -18, -89, -50
+                 dw 64, -64, -64, 64, 50, -89, 18, 75
+                 dw 36, -83, 83, -36, 18, -50, 75, -89
 
 tab_dct16_1:    dw 64, 64, 64, 64, 64, 64, 64, 64
                 dw 90, 87, 80, 70, 57, 43, 25,  9
@@ -57,7 +142,6 @@
                 dw 18, -50, 75, -89, 89, -75, 50, -18
                 dw  9, -25, 43, -57, 70, -80, 87, -90
 
-
 tab_dct16_2:    dw 64, 64, 64, 64, 64, 64, 64, 64
                 dw -9, -25, -43, -57, -70, -80, -87, -90
                 dw -89, -75, -50, -18, 18, 50, 75, 89
@@ -155,12 +239,34 @@
                 times 4 dw 50, -89, 18, 75
                 times 4 dw 18, -50, 75, -89
 
+avx512_idct8_1:   times 8 dw 64, 83, 64, 36
+                  times 8 dw 64, 36, -64, -83
+                  times 8 dw 64, -36, -64, 83
+                  times 8 dw 64, -83, 64, -36
+
+avx512_idct8_2:   times 8 dw 89, 75, 50, 18
+                  times 8 dw 75, -18, -89, -50
+                  times 8 dw 50, -89, 18, 75
+                  times 8 dw 18, -50, 75, -89
+
+avx512_idct8_3:   dw 64, 83, 64, 83, 64, 83, 64, 83, 64, 83, 64, 83, 64, 83, 64, 83, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36
+                  dw 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, -64, 83, -64, 83, -64, 83, -64, 83, -64, 83, -64, 83, -64, 83, -64, 83
+                  dw 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, 64, -83, 64, -83, 64, -83, 64, -83, 64, -83, 64, -83, 64, -83, 64, -83
+                  dw -64, -83, -64, -83, -64, -83, -64, -83, -64, -83, -64, -83, -64, -83, -64, -83, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36
+                  dw 89, 75, 89, 75, 89, 75, 89, 75, 89, 75, 89, 75, 89, 75, 89, 75, 50, -89, 50, -89, 50, -89, 50, -89, 50, -89, 50, -89, 50, -89, 50, -89
+                  dw 50, 18, 50, 18, 50, 18, 50, 18, 50, 18, 50, 18, 50, 18, 50, 18, 18, 75, 18, 75, 18, 75, 18, 75, 18, 75, 18, 75, 18, 75, 18, 75
+                  dw 75, -18, 75, -18, 75, -18, 75, -18, 75, -18, 75, -18, 75, -18, 75, -18, 18, -50, 18, -50, 18, -50, 18, -50, 18, -50, 18, -50, 18, -50, 18, -50
+                  dw -89, -50, -89, -50, -89, -50, -89, -50, -89, -50, -89, -50, -89, -50, -89, -50, 75, -89, 75, -89, 75, -89, 75, -89, 75, -89, 75, -89, 75, -89, 75, -89
+
 idct8_shuf1:    dd 0, 2, 4, 6, 1, 3, 5, 7
 
 const idct8_shuf2,    times 2 db 0, 1, 2, 3, 8, 9, 10, 11, 4, 5, 6, 7, 12, 13, 14, 15
 
 idct8_shuf3:    times 2 db 12, 13, 14, 15, 8, 9, 10, 11, 4, 5, 6, 7, 0, 1, 2, 3
 
+
+idct8_avx512_shuf3:    times 4 db 12, 13, 14, 15, 8, 9, 10, 11, 4, 5, 6, 7, 0, 1, 2, 3
+
 tab_idct16_1:   dw 90, 87, 80, 70, 57, 43, 25, 9
                 dw 87, 57, 9, -43, -80, -90, -70, -25
                 dw 80, 9, -70, -87, -25, 57, 90, 43
@@ -182,6 +288,31 @@
 idct16_shuff:   dd 0, 4, 2, 6, 1, 5, 3, 7
 
 idct16_shuff1:  dd 2, 6, 0, 4, 3, 7, 1, 5
+idct16_shuff2:  dw 0, 16, 2, 18, 4, 20, 6, 22, 8, 24, 10, 26, 12, 28, 14, 30
+idct16_shuff3:  dw 1, 17, 3, 19, 5, 21, 7, 23, 9, 25, 11, 27, 13, 29, 15, 31
+idct16_shuff4:  dd 0, 8, 2, 10, 4, 12, 6, 14
+idct16_shuff5:  dd 1, 9, 3, 11, 5, 13, 7, 15
+
+
+tab_AVX512_idct16_1:   dw 90, 87, 80, 70, 57, 43, 25, 9, 90, 87, 80, 70, 57, 43, 25, 9, 80, 9, -70, -87, -25, 57, 90, 43, 80, 9, -70, -87, -25, 57, 90, 43
+                       dw 87, 57, 9, -43, -80, -90, -70, -25, 87, 57, 9, -43, -80, -90, -70, -25, 70, -43, -87, 9, 90, 25, -80, -57, 70, -43, -87, 9, 90, 25, -80, -57
+                       dw 57, -80, -25, 90, -9, -87, 43, 70, 57, -80, -25, 90, -9, -87, 43, 70, 25, -70, 90, -80, 43, 9, -57, 87, 25, -70, 90, -80, 43, 9, -57, 87
+                       dw 43, -90, 57, 25, -87, 70, 9, -80, 43, -90, 57, 25, -87, 70, 9, -80, 9, -25, 43, -57, 70, -80, 87, -90, 9, -25, 43, -57, 70, -80, 87, -90
+
+tab_AVX512_idct16_2:   dw 64, 89, 83, 75, 64, 50, 36, 18, 64, 89, 83, 75, 64, 50, 36, 18, 64, 50, -36, -89, -64, 18, 83, 75, 64, 50, -36, -89, -64, 18, 83, 75
+                       dw 64, 75, 36, -18, -64, -89, -83, -50, 64, 75, 36, -18, -64, -89, -83, -50, 64, 18, -83, -50, 64, 75, -36, -89, 64, 18, -83, -50, 64, 75, -36, -89
+                       dw 64, -18, -83, 50, 64, -75, -36, 89, 64, -18, -83, 50, 64, -75, -36, 89, 64, -75, 36, 18, -64, 89, -83, 50, 64, -75, 36, 18, -64, 89, -83, 50
+                       dw 64, -50, -36, 89, -64, -18, 83, -75, 64, -50, -36, 89, -64, -18, 83, -75, 64, -89, 83, -75, 64, -50, 36, -18, 64, -89, 83, -75, 64, -50, 36, -18
+
+idct16_AVX512_shuff:   dd 0, 4, 2, 6, 1, 5, 3, 7, 8, 12, 10, 14, 9, 13, 11, 15
+
+idct16_AVX512_shuff1:  dd 2, 6, 0, 4, 3, 7, 1, 5, 10, 14, 8, 12, 11, 15, 9, 13
+
+idct16_AVX512_shuff2:   dq 0, 1, 8, 9, 4, 5, 12, 13
+idct16_AVX512_shuff3:   dq 2, 3, 10, 11, 6, 7, 14, 15
+idct16_AVX512_shuff4:   dq 4, 5, 12, 13, 0, 1, 8, 9
+idct16_AVX512_shuff5:   dq 6, 7, 14, 15, 2, 3, 10, 11
+idct16_AVX512_shuff6:   times 4 db 14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1
 
 tab_idct32_1:   dw 90 ,90 ,88 ,85, 82, 78, 73, 67, 61, 54, 46, 38, 31, 22, 13, 4
                 dw 90, 82, 67, 46, 22, -4, -31, -54, -73, -85, -90, -88, -78, -61, -38, -13
@@ -237,6 +368,71 @@
                 dw 64, -87, 75, -57, 36, -9, -18, 43, -64, 80, -89, 90, -83, 70, -50, 25
                 dw 64, -90, 89, -87, 83, -80, 75, -70, 64, -57, 50, -43, 36, -25, 18, -9
 
+
+tab_idct32_AVX512_1:   dw 90 ,90 ,88 ,85, 82, 78, 73, 67, 90 ,90 ,88 ,85, 82, 78, 73, 67, 90, 82, 67, 46, 22, -4, -31, -54, 90, 82, 67, 46, 22, -4, -31, -54
+                       dw 61, 54, 46, 38, 31, 22, 13, 4, 61, 54, 46, 38, 31, 22, 13, 4, -73, -85, -90, -88, -78, -61, -38, -13, -73, -85, -90, -88, -78, -61, -38, -13
+                       dw 88, 67, 31, -13, -54, -82, -90, -78, 88, 67, 31, -13, -54, -82, -90, -78, 85, 46, -13, -67, -90, -73, -22, 38, 85, 46, -13, -67, -90, -73, -22, 38
+                       dw -46, -4, 38, 73, 90, 85, 61, 22, -46, -4, 38, 73, 90, 85, 61, 22, 82, 88, 54, -4, -61, -90, -78, -31, 82, 88, 54, -4, -61, -90, -78, -31
+                       dw 82, 22, -54, -90, -61, 13, 78, 85, 82, 22, -54, -90, -61, 13, 78, 85, 78, -4, -82, -73, 13, 85, 67, -22, 78, -4, -82, -73, 13, 85, 67, -22
+                       dw 31, -46, -90, -67, 4, 73, 88, 38, 31, -46, -90, -67, 4, 73, 88, 38, -88, -61, 31, 90, 54, -38, -90, -46, -88, -61, 31, 90, 54, -38, -90, -46
+                       dw 73, -31, -90, -22, 78, 67, -38, -90, 73, -31, -90, -22, 78, 67, -38, -90, 67, -54, -78, 38, 85, -22, -90, 4, 67, -54, -78, 38, 85, -22, -90, 4
+                       dw -13, 82, 61, -46, -88, -4, 85, 54, -13, 82, 61, -46, -88, -4, 85, 54, 90, 13, -88, -31, 82, 46, -73, -61, 90, 13, -88, -31, 82, 46, -73, -61
+
+tab_idct32_AVX512_5:   dw 4, -13, 22, -31, 38, -46, 54, -61, 4, -13, 22, -31, 38, -46, 54, -61, 13, -38, 61, -78, 88, -90, 85, -73, 13, -38, 61, -78, 88, -90, 85, -73
+                       dw 67, -73, 78, -82, 85, -88, 90, -90, 67, -73, 78, -82, 85, -88, 90, -90, 54, -31, 4, 22, -46, 67, -82, 90, 54, -31, 4, 22, -46, 67, -82, 90
+                       dw 22, -61, 85, -90, 73, -38, -4, 46, 22, -61, 85, -90, 73, -38, -4, 46, 31, -78, 90, -61, 4, 54, -88, 82, 31, -78, 90, -61, 4, 54, -88, 82
+                       dw -78, 90, -82, 54, -13, -31, 67, -88, -78, 90, -82, 54, -13, -31, 67, -88, -38, -22, 73, -90, 67, -13, -46, 85, -38, -22, 73, -90, 67, -13, -46, 85
+                       dw 38, -88, 73, -4, -67, 90, -46, -31, 38, -88, 73, -4, -67, 90, -46, -31, 46, -90, 38, 54, -90, 31, 61, -88, 46, -90, 38, 54, -90, 31, 61, -88
+                       dw 85, -78, 13, 61, -90, 54, 22, -82, 85, -78, 13, 61, -90, 54, 22, -82, 22, 67, -85, 13, 73, -82, 4, 78, 22, 67, -85, 13, 73, -82, 4, 78
+                       dw 54, -85, -4, 88, -46, -61, 82, 13, 54, -85, -4, 88, -46, -61, 82, 13, 61, -73, -46, 82, 31, -88, -13, 90, 61, -73, -46, 82, 31, -88, -13, 90
+                       dw -90, 38, 67, -78, -22, 90, -31, -73, -90, 38, 67, -78, -22, 90, -31, -73, -4, -90, 22, 85, -38, -78, 54, 67, -4, -90, 22, 85, -38, -78, 54, 67
​

x265_2.7.tar.gz/source/common/x86/dct8.h -> x265_2.9.tar.gz/source/common/x86/dct8.h Changed

@@ -34,6 +34,11 @@
 FUNCDEF_TU_S2(void, idct, ssse3, const int16_t* src, int16_t* dst, intptr_t dstStride);
 FUNCDEF_TU_S2(void, idct, sse4, const int16_t* src, int16_t* dst, intptr_t dstStride);
 FUNCDEF_TU_S2(void, idct, avx2, const int16_t* src, int16_t* dst, intptr_t dstStride);
+FUNCDEF_TU_S2(void, nonPsyRdoQuant, avx512, int16_t *m_resiDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, uint32_t blkPos);
+FUNCDEF_TU_S2(void, psyRdoQuant, avx512, int16_t* m_resiDctCoeff, int16_t* m_fencDctCoeff, int64_t* costUncoded, int64_t* totalUncodedCost, int64_t* totalRdCost, int64_t *psyScale, uint32_t blkPos);
+FUNCDEF_TU_S2(void, nonPsyRdoQuant, avx2, int16_t *m_resiDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, uint32_t blkPos);
+FUNCDEF_TU_S2(void, psyRdoQuant_1p, avx2, int16_t* m_resiDctCoeff,  int64_t* costUncoded, int64_t* totalUncodedCost, int64_t* totalRdCost,  uint32_t blkPos);
+FUNCDEF_TU_S2(void, psyRdoQuant_2p, avx2, int16_t* m_resiDctCoeff, int16_t* m_fencDctCoeff, int64_t* costUncoded, int64_t* totalUncodedCost, int64_t* totalRdCost, int64_t *psyScale, uint32_t blkPos);
 
 void PFX(dst4_ssse3)(const int16_t* src, int16_t* dst, intptr_t srcStride);
 void PFX(dst4_sse2)(const int16_t* src, int16_t* dst, intptr_t srcStride);
@@ -42,5 +47,11 @@
 void PFX(idst4_avx2)(const int16_t* src, int16_t* dst, intptr_t srcStride);
 void PFX(denoise_dct_sse4)(int16_t* dct, uint32_t* sum, const uint16_t* offset, int size);
 void PFX(denoise_dct_avx2)(int16_t* dct, uint32_t* sum, const uint16_t* offset, int size);
-
+void PFX(denoise_dct_avx512)(int16_t* dct, uint32_t* sum, const uint16_t* offset, int size);
+void PFX(dct8_avx512)(const int16_t* src, int16_t* dst, intptr_t srcStride);
+void PFX(idct8_avx512)(const int16_t* src, int16_t* dst, intptr_t dstStride);
+void PFX(idct16_avx512)(const int16_t* src, int16_t* dst, intptr_t dstStride);
+void PFX(idct32_avx512)(const int16_t* src, int16_t* dst, intptr_t dstStride);
+void PFX(dct32_avx512)(const int16_t* src, int16_t* dst, intptr_t srcStride);
+void PFX(dct16_avx512)(const int16_t* src, int16_t* dst, intptr_t srcStride);
 #endif // ifndef X265_DCT8_H

 
@@ -34,6 +34,11 @@
 FUNCDEF_TU_S2(void, idct, ssse3, const int16_t* src, int16_t* dst, intptr_t dstStride);
 FUNCDEF_TU_S2(void, idct, sse4, const int16_t* src, int16_t* dst, intptr_t dstStride);
 FUNCDEF_TU_S2(void, idct, avx2, const int16_t* src, int16_t* dst, intptr_t dstStride);
+FUNCDEF_TU_S2(void, nonPsyRdoQuant, avx512, int16_t *m_resiDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, uint32_t blkPos);
+FUNCDEF_TU_S2(void, psyRdoQuant, avx512, int16_t* m_resiDctCoeff, int16_t* m_fencDctCoeff, int64_t* costUncoded, int64_t* totalUncodedCost, int64_t* totalRdCost, int64_t *psyScale, uint32_t blkPos);
+FUNCDEF_TU_S2(void, nonPsyRdoQuant, avx2, int16_t *m_resiDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, uint32_t blkPos);
+FUNCDEF_TU_S2(void, psyRdoQuant_1p, avx2, int16_t* m_resiDctCoeff,  int64_t* costUncoded, int64_t* totalUncodedCost, int64_t* totalRdCost,  uint32_t blkPos);
+FUNCDEF_TU_S2(void, psyRdoQuant_2p, avx2, int16_t* m_resiDctCoeff, int16_t* m_fencDctCoeff, int64_t* costUncoded, int64_t* totalUncodedCost, int64_t* totalRdCost, int64_t *psyScale, uint32_t blkPos);
 
 void PFX(dst4_ssse3)(const int16_t* src, int16_t* dst, intptr_t srcStride);
 void PFX(dst4_sse2)(const int16_t* src, int16_t* dst, intptr_t srcStride);
@@ -42,5 +47,11 @@
 void PFX(idst4_avx2)(const int16_t* src, int16_t* dst, intptr_t srcStride);
 void PFX(denoise_dct_sse4)(int16_t* dct, uint32_t* sum, const uint16_t* offset, int size);
 void PFX(denoise_dct_avx2)(int16_t* dct, uint32_t* sum, const uint16_t* offset, int size);
-
+void PFX(denoise_dct_avx512)(int16_t* dct, uint32_t* sum, const uint16_t* offset, int size);
+void PFX(dct8_avx512)(const int16_t* src, int16_t* dst, intptr_t srcStride);
+void PFX(idct8_avx512)(const int16_t* src, int16_t* dst, intptr_t dstStride);
+void PFX(idct16_avx512)(const int16_t* src, int16_t* dst, intptr_t dstStride);
+void PFX(idct32_avx512)(const int16_t* src, int16_t* dst, intptr_t dstStride);
+void PFX(dct32_avx512)(const int16_t* src, int16_t* dst, intptr_t srcStride);
+void PFX(dct16_avx512)(const int16_t* src, int16_t* dst, intptr_t srcStride);
 #endif // ifndef X265_DCT8_H
​

x265_2.7.tar.gz/source/common/x86/h-ipfilter16.asm -> x265_2.9.tar.gz/source/common/x86/h-ipfilter16.asm Changed

@@ -47,7 +47,7 @@
 
 h_pd_524800:        times 8 dd 524800
                                     
-tab_LumaCoeff:    dw   0, 0,  0,  64,  0,   0,  0,  0
+h_tab_LumaCoeff:    dw   0, 0,  0,  64,  0,   0,  0,  0
                   dw  -1, 4, -10, 58,  17, -5,  1,  0
                   dw  -1, 4, -11, 40,  40, -11, 4, -1
                   dw   0, 1, -5,  17,  58, -10, 4, -1
@@ -79,8 +79,13 @@
                             db 4, 5, 6, 7, 8, 9, 10, 11, 6, 7, 8, 9, 10, 11, 12, 13 
 
 const interp8_hpp_shuf_new, db 0, 1, 2, 3, 2, 3, 4, 5, 4, 5, 6, 7, 6, 7, 8, 9
-                            db 4, 5, 6, 7, 6, 7, 8, 9, 8, 9, 10, 11, 10, 11, 12, 13                         
-                            
+                            db 4, 5, 6, 7, 6, 7, 8, 9, 8, 9, 10, 11, 10, 11, 12, 13
+
+ALIGN 64
+interp8_hpp_shuf1_load_avx512: times 4 db 0, 1, 2, 3, 4, 5, 6, 7, 2, 3, 4, 5, 6, 7, 8, 9
+interp8_hpp_shuf2_load_avx512: times 4 db 4, 5, 6, 7, 8, 9, 10, 11, 6, 7, 8, 9, 10, 11, 12, 13
+interp8_hpp_shuf1_store_avx512: times 4 db 0, 1, 4, 5, 2, 3, 6, 7, 8, 9, 12, 13, 10, 11, 14, 15
+
 SECTION .text
 cextern pd_8
 cextern pd_32
@@ -207,10 +212,10 @@
     add         r3d,    r3d
 
 %ifdef PIC
-    lea         r6,     [tab_LumaCoeff]
+    lea         r6,     [h_tab_LumaCoeff]
     mova        m0,     [r6 + r4]
 %else
-    mova        m0,     [tab_LumaCoeff + r4]
+    mova        m0,     [h_tab_LumaCoeff + r4]
 %endif
 
 %ifidn %3, pp
@@ -285,7 +290,8 @@
 ;------------------------------------------------------------------------------------------------------------
 ; void interp_8tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx
 ;------------------------------------------------------------------------------------------------------------
-    FILTER_HOR_LUMA_sse2 4, 4, pp
+%if ARCH_X86_64
+	FILTER_HOR_LUMA_sse2 4, 4, pp
     FILTER_HOR_LUMA_sse2 4, 8, pp
     FILTER_HOR_LUMA_sse2 4, 16, pp
     FILTER_HOR_LUMA_sse2 8, 4, pp
@@ -339,6 +345,7 @@
     FILTER_HOR_LUMA_sse2 64, 32, ps
     FILTER_HOR_LUMA_sse2 64, 48, ps
     FILTER_HOR_LUMA_sse2 64, 64, ps
+%endif
 
 ;-----------------------------------------------------------------------------
 ; void interp_4tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
@@ -625,10 +632,10 @@
     add         r3, r3
 
 %ifdef PIC
-    lea         r6, [tab_LumaCoeff]
+    lea         r6, [h_tab_LumaCoeff]
     mova        m0, [r6 + r4]
 %else
-    mova        m0, [tab_LumaCoeff + r4]
+    mova        m0, [h_tab_LumaCoeff + r4]
 %endif
 
 %ifidn %3, pp
@@ -712,10 +719,10 @@
     shl         r4d, 4
 
 %ifdef PIC
-    lea         r6, [tab_LumaCoeff]
+    lea         r6, [h_tab_LumaCoeff]
     mova        m0, [r6 + r4]
 %else
-    mova        m0, [tab_LumaCoeff + r4]
+    mova        m0, [h_tab_LumaCoeff + r4]
 %endif
 
 %ifidn %3, pp
@@ -815,10 +822,10 @@
     shl         r4d, 4
 
 %ifdef PIC
-    lea         r6, [tab_LumaCoeff]
+    lea         r6, [h_tab_LumaCoeff]
     mova        m0, [r6 + r4]
 %else
-    mova        m0, [tab_LumaCoeff + r4]
+    mova        m0, [h_tab_LumaCoeff + r4]
 %endif
 %ifidn %3, pp
     mova        m1, [INTERP_OFFSET_PP]
@@ -936,10 +943,10 @@
     shl         r4d, 4
 
 %ifdef PIC
-    lea         r6, [tab_LumaCoeff]
+    lea         r6, [h_tab_LumaCoeff]
     mova        m0, [r6 + r4]
 %else
-    mova        m0, [tab_LumaCoeff + r4]
+    mova        m0, [h_tab_LumaCoeff + r4]
 %endif
 
 %ifidn %3, pp
@@ -1132,10 +1139,10 @@
     shl         r4d, 4
 
 %ifdef PIC
-    lea         r6, [tab_LumaCoeff]
+    lea         r6, [h_tab_LumaCoeff]
     mova        m0, [r6 + r4]
 %else
-    mova        m0, [tab_LumaCoeff + r4]
+    mova        m0, [h_tab_LumaCoeff + r4]
 %endif
 %ifidn %3, pp
     mova        m1, [pd_32]
@@ -1307,12 +1314,12 @@
     mov              r4d, r4m
     shl              r4d, 4
 %ifdef PIC
-    lea              r5, [tab_LumaCoeff]
+    lea              r5, [h_tab_LumaCoeff]
     vpbroadcastq     m0, [r5 + r4]
     vpbroadcastq     m1, [r5 + r4 + 8]
 %else
-    vpbroadcastq     m0, [tab_LumaCoeff + r4]
-    vpbroadcastq     m1, [tab_LumaCoeff + r4 + 8]
+    vpbroadcastq     m0, [h_tab_LumaCoeff + r4]
+    vpbroadcastq     m1, [h_tab_LumaCoeff + r4 + 8]
 %endif
     lea              r6, [pw_pixel_max]
     mova             m3, [interp8_hpp_shuf]
@@ -1376,302 +1383,352 @@
 ;-------------------------------------------------------------------------------------------------------------
 ; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx
 ;-------------------------------------------------------------------------------------------------------------
-%macro FILTER_HOR_LUMA_W8 1
+%macro PROCESS_IPFILTER_LUMA_PP_8x2_AVX2 0
+    movu            xm7,        [r0]
+    movu            xm8,        [r0 + 8]
+    vinserti128     m7,        m7,        [r0 + r1],          1
+    vinserti128     m8,        m8,        [r0 + r1 + 8],      1
+    pshufb          m10,       m7,        m14
+    pshufb          m7,                   m13
+    pshufb          m11,       m8,        m14
+    pshufb          m8,                   m13
+
+    pmaddwd         m7,        m0
+    pmaddwd         m10,       m1
+    paddd           m7,        m10
+    pmaddwd         m10,       m11,       m3
+    pmaddwd         m9,        m8,        m2
+    paddd           m10,       m9
+    paddd           m7,        m10
+    paddd           m7,        m4
+    psrad           m7,        INTERP_SHIFT_PP
+
+    movu            xm9,        [r0 + 16]
+    vinserti128     m9,        m9,        [r0 + r1 + 16],      1
+    pshufb          m10,       m9,        m14
+    pshufb          m9,                   m13
+    pmaddwd         m8,        m0
+    pmaddwd         m11,       m1
+    paddd           m8,        m11
+    pmaddwd         m10,       m3
+    pmaddwd         m9,        m2
+    paddd           m9,        m10
+    paddd           m8,        m9
+    paddd           m8,        m4
+    psrad           m8,        INTERP_SHIFT_PP
+
+    packusdw        m7,        m8
+    pshufb          m7,        m12
+    CLIPW           m7,        m5,         m6
+    movu            [r2],      xm7
+    vextracti128    [r2 + r3], m7,         1
+%endmacro
+
+%macro IPFILTER_LUMA_AVX2_8xN 1
 INIT_YMM avx2
-cglobal interp_8tap_horiz_pp_8x%1, 4,6,8
-    add              r1d, r1d
-    add              r3d, r3d
-    sub              r0, 6
-    mov              r4d, r4m
-    shl              r4d, 4
+cglobal interp_8tap_horiz_pp_8x%1, 5,6,15
+    shl              r1d,        1
+    shl              r3d,        1
+    sub              r0,         6
+    mov              r4d,        r4m
+    shl              r4d,        4
+
 %ifdef PIC
-    lea              r5, [tab_LumaCoeff]

 
@@ -47,7 +47,7 @@
 
 h_pd_524800:        times 8 dd 524800
                                     
-tab_LumaCoeff:    dw   0, 0,  0,  64,  0,   0,  0,  0
+h_tab_LumaCoeff:    dw   0, 0,  0,  64,  0,   0,  0,  0
                   dw  -1, 4, -10, 58,  17, -5,  1,  0
                   dw  -1, 4, -11, 40,  40, -11, 4, -1
                   dw   0, 1, -5,  17,  58, -10, 4, -1
@@ -79,8 +79,13 @@
                             db 4, 5, 6, 7, 8, 9, 10, 11, 6, 7, 8, 9, 10, 11, 12, 13 
 
 const interp8_hpp_shuf_new, db 0, 1, 2, 3, 2, 3, 4, 5, 4, 5, 6, 7, 6, 7, 8, 9
-                            db 4, 5, 6, 7, 6, 7, 8, 9, 8, 9, 10, 11, 10, 11, 12, 13                         
-                            
+                            db 4, 5, 6, 7, 6, 7, 8, 9, 8, 9, 10, 11, 10, 11, 12, 13
+
+ALIGN 64
+interp8_hpp_shuf1_load_avx512: times 4 db 0, 1, 2, 3, 4, 5, 6, 7, 2, 3, 4, 5, 6, 7, 8, 9
+interp8_hpp_shuf2_load_avx512: times 4 db 4, 5, 6, 7, 8, 9, 10, 11, 6, 7, 8, 9, 10, 11, 12, 13
+interp8_hpp_shuf1_store_avx512: times 4 db 0, 1, 4, 5, 2, 3, 6, 7, 8, 9, 12, 13, 10, 11, 14, 15
+
 SECTION .text
 cextern pd_8
 cextern pd_32
@@ -207,10 +212,10 @@
     add         r3d,    r3d
 
 %ifdef PIC
-    lea         r6,     [tab_LumaCoeff]
+    lea         r6,     [h_tab_LumaCoeff]
     mova        m0,     [r6 + r4]
 %else
-    mova        m0,     [tab_LumaCoeff + r4]
+    mova        m0,     [h_tab_LumaCoeff + r4]
 %endif
 
 %ifidn %3, pp
@@ -285,7 +290,8 @@
 ;------------------------------------------------------------------------------------------------------------
 ; void interp_8tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx
 ;------------------------------------------------------------------------------------------------------------
-    FILTER_HOR_LUMA_sse2 4, 4, pp
+%if ARCH_X86_64
+   FILTER_HOR_LUMA_sse2 4, 4, pp
     FILTER_HOR_LUMA_sse2 4, 8, pp
     FILTER_HOR_LUMA_sse2 4, 16, pp
     FILTER_HOR_LUMA_sse2 8, 4, pp
@@ -339,6 +345,7 @@
     FILTER_HOR_LUMA_sse2 64, 32, ps
     FILTER_HOR_LUMA_sse2 64, 48, ps
     FILTER_HOR_LUMA_sse2 64, 64, ps
+%endif
 
 ;-----------------------------------------------------------------------------
 ; void interp_4tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
@@ -625,10 +632,10 @@
     add         r3, r3
 
 %ifdef PIC
-    lea         r6, [tab_LumaCoeff]
+    lea         r6, [h_tab_LumaCoeff]
     mova        m0, [r6 + r4]
 %else
-    mova        m0, [tab_LumaCoeff + r4]
+    mova        m0, [h_tab_LumaCoeff + r4]
 %endif
 
 %ifidn %3, pp
@@ -712,10 +719,10 @@
     shl         r4d, 4
 
 %ifdef PIC
-    lea         r6, [tab_LumaCoeff]
+    lea         r6, [h_tab_LumaCoeff]
     mova        m0, [r6 + r4]
 %else
-    mova        m0, [tab_LumaCoeff + r4]
+    mova        m0, [h_tab_LumaCoeff + r4]
 %endif
 
 %ifidn %3, pp
@@ -815,10 +822,10 @@
     shl         r4d, 4
 
 %ifdef PIC
-    lea         r6, [tab_LumaCoeff]
+    lea         r6, [h_tab_LumaCoeff]
     mova        m0, [r6 + r4]
 %else
-    mova        m0, [tab_LumaCoeff + r4]
+    mova        m0, [h_tab_LumaCoeff + r4]
 %endif
 %ifidn %3, pp
     mova        m1, [INTERP_OFFSET_PP]
@@ -936,10 +943,10 @@
     shl         r4d, 4
 
 %ifdef PIC
-    lea         r6, [tab_LumaCoeff]
+    lea         r6, [h_tab_LumaCoeff]
     mova        m0, [r6 + r4]
 %else
-    mova        m0, [tab_LumaCoeff + r4]
+    mova        m0, [h_tab_LumaCoeff + r4]
 %endif
 
 %ifidn %3, pp
@@ -1132,10 +1139,10 @@
     shl         r4d, 4
 
 %ifdef PIC
-    lea         r6, [tab_LumaCoeff]
+    lea         r6, [h_tab_LumaCoeff]
     mova        m0, [r6 + r4]
 %else
-    mova        m0, [tab_LumaCoeff + r4]
+    mova        m0, [h_tab_LumaCoeff + r4]
 %endif
 %ifidn %3, pp
     mova        m1, [pd_32]
@@ -1307,12 +1314,12 @@
     mov              r4d, r4m
     shl              r4d, 4
 %ifdef PIC
-    lea              r5, [tab_LumaCoeff]
+    lea              r5, [h_tab_LumaCoeff]
     vpbroadcastq     m0, [r5 + r4]
     vpbroadcastq     m1, [r5 + r4 + 8]
 %else
-    vpbroadcastq     m0, [tab_LumaCoeff + r4]
-    vpbroadcastq     m1, [tab_LumaCoeff + r4 + 8]
+    vpbroadcastq     m0, [h_tab_LumaCoeff + r4]
+    vpbroadcastq     m1, [h_tab_LumaCoeff + r4 + 8]
 %endif
     lea              r6, [pw_pixel_max]
     mova             m3, [interp8_hpp_shuf]
@@ -1376,302 +1383,352 @@
 ;-------------------------------------------------------------------------------------------------------------
 ; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx
 ;-------------------------------------------------------------------------------------------------------------
-%macro FILTER_HOR_LUMA_W8 1
+%macro PROCESS_IPFILTER_LUMA_PP_8x2_AVX2 0
+    movu            xm7,        [r0]
+    movu            xm8,        [r0 + 8]
+    vinserti128     m7,        m7,        [r0 + r1],          1
+    vinserti128     m8,        m8,        [r0 + r1 + 8],      1
+    pshufb          m10,       m7,        m14
+    pshufb          m7,                   m13
+    pshufb          m11,       m8,        m14
+    pshufb          m8,                   m13
+
+    pmaddwd         m7,        m0
+    pmaddwd         m10,       m1
+    paddd           m7,        m10
+    pmaddwd         m10,       m11,       m3
+    pmaddwd         m9,        m8,        m2
+    paddd           m10,       m9
+    paddd           m7,        m10
+    paddd           m7,        m4
+    psrad           m7,        INTERP_SHIFT_PP
+
+    movu            xm9,        [r0 + 16]
+    vinserti128     m9,        m9,        [r0 + r1 + 16],      1
+    pshufb          m10,       m9,        m14
+    pshufb          m9,                   m13
+    pmaddwd         m8,        m0
+    pmaddwd         m11,       m1
+    paddd           m8,        m11
+    pmaddwd         m10,       m3
+    pmaddwd         m9,        m2
+    paddd           m9,        m10
+    paddd           m8,        m9
+    paddd           m8,        m4
+    psrad           m8,        INTERP_SHIFT_PP
+
+    packusdw        m7,        m8
+    pshufb          m7,        m12
+    CLIPW           m7,        m5,         m6
+    movu            [r2],      xm7
+    vextracti128    [r2 + r3], m7,         1
+%endmacro
+
+%macro IPFILTER_LUMA_AVX2_8xN 1
 INIT_YMM avx2
-cglobal interp_8tap_horiz_pp_8x%1, 4,6,8
-    add              r1d, r1d
-    add              r3d, r3d
-    sub              r0, 6
-    mov              r4d, r4m
-    shl              r4d, 4
+cglobal interp_8tap_horiz_pp_8x%1, 5,6,15
+    shl              r1d,        1
+    shl              r3d,        1
+    sub              r0,         6
+    mov              r4d,        r4m
+    shl              r4d,        4
+
 %ifdef PIC
-    lea              r5, [tab_LumaCoeff]
​

x265_2.7.tar.gz/source/common/x86/h4-ipfilter16.asm -> x265_2.9.tar.gz/source/common/x86/h4-ipfilter16.asm Changed

@@ -52,7 +52,7 @@
 
 tab_Tm16:         db 0, 1, 2, 3, 4,  5,  6, 7, 2, 3, 4,  5, 6, 7, 8, 9
 
-tab_ChromaCoeff:  dw  0, 64,  0,  0
+h4_tab_ChromaCoeff:  dw  0, 64,  0,  0
                   dw -2, 58, 10, -2
                   dw -4, 54, 16, -2
                   dw -6, 46, 28, -4
@@ -279,10 +279,10 @@
     add         r4d,    r4d
 
 %ifdef PIC
-    lea         r6,     [tab_ChromaCoeff]
+    lea         r6,     [h4_tab_ChromaCoeff]
     movddup     m0,     [r6 + r4 * 4]
 %else
-    movddup     m0,     [tab_ChromaCoeff + r4 * 4]
+    movddup     m0,     [h4_tab_ChromaCoeff + r4 * 4]
 %endif
 
 %ifidn %3, ps
@@ -377,6 +377,7 @@
 ; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
 ;-----------------------------------------------------------------------------
 
+%if ARCH_X86_64
 FILTER_HOR_CHROMA_sse3 2, 4, pp
 FILTER_HOR_CHROMA_sse3 2, 8, pp
 FILTER_HOR_CHROMA_sse3 2, 16, pp
@@ -462,6 +463,7 @@
 FILTER_HOR_CHROMA_sse3 64, 32, ps
 FILTER_HOR_CHROMA_sse3 64, 48, ps
 FILTER_HOR_CHROMA_sse3 64, 64, ps
+%endif
 
 %macro FILTER_W2_2 1
     movu        m3,         [r0]
@@ -530,10 +532,10 @@
     add         r4d,      r4d
 
 %ifdef PIC
-    lea         r%6,      [tab_ChromaCoeff]
+    lea         r%6,      [h4_tab_ChromaCoeff]
     movh        m0,       [r%6 + r4 * 4]
 %else
-    movh        m0,       [tab_ChromaCoeff + r4 * 4]
+    movh        m0,       [h4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     punpcklqdq  m0,       m0
@@ -1129,10 +1131,10 @@
     add         r4d,        r4d
 
 %ifdef PIC
-    lea         r%4,       [tab_ChromaCoeff]
+    lea         r%4,       [h4_tab_ChromaCoeff]
     movh        m0,       [r%4 + r4 * 4]
 %else
-    movh        m0,       [tab_ChromaCoeff + r4 * 4]
+    movh        m0,       [h4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     punpcklqdq  m0,       m0
@@ -1246,10 +1248,10 @@
     sub             r0, 2
     mov             r4d, r4m
 %ifdef PIC
-    lea             r5, [tab_ChromaCoeff]
+    lea             r5, [h4_tab_ChromaCoeff]
     vpbroadcastq    m0, [r5 + r4 * 8]
 %else
-    vpbroadcastq    m0, [tab_ChromaCoeff + r4 * 8]
+    vpbroadcastq    m0, [h4_tab_ChromaCoeff + r4 * 8]
 %endif
     mova            m1, [h4_interp8_hpp_shuf]
     vpbroadcastd    m2, [pd_32]
@@ -1314,10 +1316,10 @@
     sub             r0, 2
     mov             r4d, r4m
 %ifdef PIC
-    lea             r5, [tab_ChromaCoeff]
+    lea             r5, [h4_tab_ChromaCoeff]
     vpbroadcastq    m0, [r5 + r4 * 8]
 %else
-    vpbroadcastq    m0, [tab_ChromaCoeff + r4 * 8]
+    vpbroadcastq    m0, [h4_tab_ChromaCoeff + r4 * 8]
 %endif
     mova            m1, [h4_interp8_hpp_shuf]
     vpbroadcastd    m2, [pd_32]
@@ -1370,10 +1372,10 @@
     sub             r0, 2
     mov             r4d, r4m
 %ifdef PIC
-    lea             r5, [tab_ChromaCoeff]
+    lea             r5, [h4_tab_ChromaCoeff]
     vpbroadcastq    m0, [r5 + r4 * 8]
 %else
-    vpbroadcastq    m0, [tab_ChromaCoeff + r4 * 8]
+    vpbroadcastq    m0, [h4_tab_ChromaCoeff + r4 * 8]
 %endif
     mova            m1, [h4_interp8_hpp_shuf]
     vpbroadcastd    m2, [pd_32]
@@ -1432,10 +1434,10 @@
     sub             r0, 2
     mov             r4d, r4m
 %ifdef PIC
-    lea             r5, [tab_ChromaCoeff]
+    lea             r5, [h4_tab_ChromaCoeff]
     vpbroadcastq    m0, [r5 + r4 * 8]
 %else
-    vpbroadcastq    m0, [tab_ChromaCoeff + r4 * 8]
+    vpbroadcastq    m0, [h4_tab_ChromaCoeff + r4 * 8]
 %endif
     mova            m1, [h4_interp8_hpp_shuf]
     vpbroadcastd    m2, [pd_32]
@@ -1504,10 +1506,10 @@
     sub             r0, 2
     mov             r4d, r4m
 %ifdef PIC
-    lea             r5, [tab_ChromaCoeff]
+    lea             r5, [h4_tab_ChromaCoeff]
     vpbroadcastq    m0, [r5 + r4 * 8]
 %else
-    vpbroadcastq    m0, [tab_ChromaCoeff + r4 * 8]
+    vpbroadcastq    m0, [h4_tab_ChromaCoeff + r4 * 8]
 %endif
     mova            m1, [h4_interp8_hpp_shuf]
     vpbroadcastd    m2, [pd_32]
@@ -1579,10 +1581,10 @@
     sub             r0, 2
     mov             r4d, r4m
 %ifdef PIC
-    lea             r5, [tab_ChromaCoeff]
+    lea             r5, [h4_tab_ChromaCoeff]
     vpbroadcastq    m0, [r5 + r4 * 8]
 %else
-    vpbroadcastq    m0, [tab_ChromaCoeff + r4 * 8]
+    vpbroadcastq    m0, [h4_tab_ChromaCoeff + r4 * 8]
 %endif
     mova            m1, [h4_interp8_hpp_shuf]
     vpbroadcastd    m2, [pd_32]
@@ -1655,10 +1657,10 @@
     sub             r0, 2
     mov             r4d, r4m
 %ifdef PIC
-    lea             r5, [tab_ChromaCoeff]
+    lea             r5, [h4_tab_ChromaCoeff]
     vpbroadcastq    m0, [r5 + r4 * 8]
 %else
-    vpbroadcastq    m0, [tab_ChromaCoeff + r4 * 8]
+    vpbroadcastq    m0, [h4_tab_ChromaCoeff + r4 * 8]
 %endif
     mova            m1, [h4_interp8_hpp_shuf]
     vpbroadcastd    m2, [pd_32]
@@ -1724,10 +1726,10 @@
     sub             r0, 2
     mov             r4d, r4m
 %ifdef PIC
-    lea             r5, [tab_ChromaCoeff]
+    lea             r5, [h4_tab_ChromaCoeff]
     vpbroadcastq    m0, [r5 + r4 * 8]
 %else
-    vpbroadcastq    m0, [tab_ChromaCoeff + r4 * 8]
+    vpbroadcastq    m0, [h4_tab_ChromaCoeff + r4 * 8]
 %endif
     mova            m1, [h4_interp8_hpp_shuf]
     vpbroadcastd    m2, [pd_32]
@@ -1804,10 +1806,10 @@
     sub             r0, 2
     mov             r4d, r4m
 %ifdef PIC
-    lea             r5, [tab_ChromaCoeff]
+    lea             r5, [h4_tab_ChromaCoeff]
     vpbroadcastq    m0, [r5 + r4 * 8]
 %else
-    vpbroadcastq    m0, [tab_ChromaCoeff + r4 * 8]
+    vpbroadcastq    m0, [h4_tab_ChromaCoeff + r4 * 8]
 %endif
     mova            m1, [h4_interp8_hpp_shuf]
     vpbroadcastd    m2, [pd_32]
@@ -1872,10 +1874,10 @@
     sub             r0, 2
     mov             r4d, r4m
 %ifdef PIC
-    lea             r5, [tab_ChromaCoeff]
+    lea             r5, [h4_tab_ChromaCoeff]
     vpbroadcastq    m0, [r5 + r4 * 8]
 %else
-    vpbroadcastq    m0, [tab_ChromaCoeff + r4 * 8]
+    vpbroadcastq    m0, [h4_tab_ChromaCoeff + r4 * 8]
 %endif
     mova            m1, [h4_interp8_hpp_shuf]
     vpbroadcastd    m2, [pd_32]
@@ -1934,10 +1936,10 @@
     mov                 r5d, r5m
 
 %ifdef PIC
-    lea                 r6, [tab_ChromaCoeff]
+    lea                 r6, [h4_tab_ChromaCoeff]

 
@@ -52,7 +52,7 @@
 
 tab_Tm16:         db 0, 1, 2, 3, 4,  5,  6, 7, 2, 3, 4,  5, 6, 7, 8, 9
 
-tab_ChromaCoeff:  dw  0, 64,  0,  0
+h4_tab_ChromaCoeff:  dw  0, 64,  0,  0
                   dw -2, 58, 10, -2
                   dw -4, 54, 16, -2
                   dw -6, 46, 28, -4
@@ -279,10 +279,10 @@
     add         r4d,    r4d
 
 %ifdef PIC
-    lea         r6,     [tab_ChromaCoeff]
+    lea         r6,     [h4_tab_ChromaCoeff]
     movddup     m0,     [r6 + r4 * 4]
 %else
-    movddup     m0,     [tab_ChromaCoeff + r4 * 4]
+    movddup     m0,     [h4_tab_ChromaCoeff + r4 * 4]
 %endif
 
 %ifidn %3, ps
@@ -377,6 +377,7 @@
 ; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
 ;-----------------------------------------------------------------------------
 
+%if ARCH_X86_64
 FILTER_HOR_CHROMA_sse3 2, 4, pp
 FILTER_HOR_CHROMA_sse3 2, 8, pp
 FILTER_HOR_CHROMA_sse3 2, 16, pp
@@ -462,6 +463,7 @@
 FILTER_HOR_CHROMA_sse3 64, 32, ps
 FILTER_HOR_CHROMA_sse3 64, 48, ps
 FILTER_HOR_CHROMA_sse3 64, 64, ps
+%endif
 
 %macro FILTER_W2_2 1
     movu        m3,         [r0]
@@ -530,10 +532,10 @@
     add         r4d,      r4d
 
 %ifdef PIC
-    lea         r%6,      [tab_ChromaCoeff]
+    lea         r%6,      [h4_tab_ChromaCoeff]
     movh        m0,       [r%6 + r4 * 4]
 %else
-    movh        m0,       [tab_ChromaCoeff + r4 * 4]
+    movh        m0,       [h4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     punpcklqdq  m0,       m0
@@ -1129,10 +1131,10 @@
     add         r4d,        r4d
 
 %ifdef PIC
-    lea         r%4,       [tab_ChromaCoeff]
+    lea         r%4,       [h4_tab_ChromaCoeff]
     movh        m0,       [r%4 + r4 * 4]
 %else
-    movh        m0,       [tab_ChromaCoeff + r4 * 4]
+    movh        m0,       [h4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     punpcklqdq  m0,       m0
@@ -1246,10 +1248,10 @@
     sub             r0, 2
     mov             r4d, r4m
 %ifdef PIC
-    lea             r5, [tab_ChromaCoeff]
+    lea             r5, [h4_tab_ChromaCoeff]
     vpbroadcastq    m0, [r5 + r4 * 8]
 %else
-    vpbroadcastq    m0, [tab_ChromaCoeff + r4 * 8]
+    vpbroadcastq    m0, [h4_tab_ChromaCoeff + r4 * 8]
 %endif
     mova            m1, [h4_interp8_hpp_shuf]
     vpbroadcastd    m2, [pd_32]
@@ -1314,10 +1316,10 @@
     sub             r0, 2
     mov             r4d, r4m
 %ifdef PIC
-    lea             r5, [tab_ChromaCoeff]
+    lea             r5, [h4_tab_ChromaCoeff]
     vpbroadcastq    m0, [r5 + r4 * 8]
 %else
-    vpbroadcastq    m0, [tab_ChromaCoeff + r4 * 8]
+    vpbroadcastq    m0, [h4_tab_ChromaCoeff + r4 * 8]
 %endif
     mova            m1, [h4_interp8_hpp_shuf]
     vpbroadcastd    m2, [pd_32]
@@ -1370,10 +1372,10 @@
     sub             r0, 2
     mov             r4d, r4m
 %ifdef PIC
-    lea             r5, [tab_ChromaCoeff]
+    lea             r5, [h4_tab_ChromaCoeff]
     vpbroadcastq    m0, [r5 + r4 * 8]
 %else
-    vpbroadcastq    m0, [tab_ChromaCoeff + r4 * 8]
+    vpbroadcastq    m0, [h4_tab_ChromaCoeff + r4 * 8]
 %endif
     mova            m1, [h4_interp8_hpp_shuf]
     vpbroadcastd    m2, [pd_32]
@@ -1432,10 +1434,10 @@
     sub             r0, 2
     mov             r4d, r4m
 %ifdef PIC
-    lea             r5, [tab_ChromaCoeff]
+    lea             r5, [h4_tab_ChromaCoeff]
     vpbroadcastq    m0, [r5 + r4 * 8]
 %else
-    vpbroadcastq    m0, [tab_ChromaCoeff + r4 * 8]
+    vpbroadcastq    m0, [h4_tab_ChromaCoeff + r4 * 8]
 %endif
     mova            m1, [h4_interp8_hpp_shuf]
     vpbroadcastd    m2, [pd_32]
@@ -1504,10 +1506,10 @@
     sub             r0, 2
     mov             r4d, r4m
 %ifdef PIC
-    lea             r5, [tab_ChromaCoeff]
+    lea             r5, [h4_tab_ChromaCoeff]
     vpbroadcastq    m0, [r5 + r4 * 8]
 %else
-    vpbroadcastq    m0, [tab_ChromaCoeff + r4 * 8]
+    vpbroadcastq    m0, [h4_tab_ChromaCoeff + r4 * 8]
 %endif
     mova            m1, [h4_interp8_hpp_shuf]
     vpbroadcastd    m2, [pd_32]
@@ -1579,10 +1581,10 @@
     sub             r0, 2
     mov             r4d, r4m
 %ifdef PIC
-    lea             r5, [tab_ChromaCoeff]
+    lea             r5, [h4_tab_ChromaCoeff]
     vpbroadcastq    m0, [r5 + r4 * 8]
 %else
-    vpbroadcastq    m0, [tab_ChromaCoeff + r4 * 8]
+    vpbroadcastq    m0, [h4_tab_ChromaCoeff + r4 * 8]
 %endif
     mova            m1, [h4_interp8_hpp_shuf]
     vpbroadcastd    m2, [pd_32]
@@ -1655,10 +1657,10 @@
     sub             r0, 2
     mov             r4d, r4m
 %ifdef PIC
-    lea             r5, [tab_ChromaCoeff]
+    lea             r5, [h4_tab_ChromaCoeff]
     vpbroadcastq    m0, [r5 + r4 * 8]
 %else
-    vpbroadcastq    m0, [tab_ChromaCoeff + r4 * 8]
+    vpbroadcastq    m0, [h4_tab_ChromaCoeff + r4 * 8]
 %endif
     mova            m1, [h4_interp8_hpp_shuf]
     vpbroadcastd    m2, [pd_32]
@@ -1724,10 +1726,10 @@
     sub             r0, 2
     mov             r4d, r4m
 %ifdef PIC
-    lea             r5, [tab_ChromaCoeff]
+    lea             r5, [h4_tab_ChromaCoeff]
     vpbroadcastq    m0, [r5 + r4 * 8]
 %else
-    vpbroadcastq    m0, [tab_ChromaCoeff + r4 * 8]
+    vpbroadcastq    m0, [h4_tab_ChromaCoeff + r4 * 8]
 %endif
     mova            m1, [h4_interp8_hpp_shuf]
     vpbroadcastd    m2, [pd_32]
@@ -1804,10 +1806,10 @@
     sub             r0, 2
     mov             r4d, r4m
 %ifdef PIC
-    lea             r5, [tab_ChromaCoeff]
+    lea             r5, [h4_tab_ChromaCoeff]
     vpbroadcastq    m0, [r5 + r4 * 8]
 %else
-    vpbroadcastq    m0, [tab_ChromaCoeff + r4 * 8]
+    vpbroadcastq    m0, [h4_tab_ChromaCoeff + r4 * 8]
 %endif
     mova            m1, [h4_interp8_hpp_shuf]
     vpbroadcastd    m2, [pd_32]
@@ -1872,10 +1874,10 @@
     sub             r0, 2
     mov             r4d, r4m
 %ifdef PIC
-    lea             r5, [tab_ChromaCoeff]
+    lea             r5, [h4_tab_ChromaCoeff]
     vpbroadcastq    m0, [r5 + r4 * 8]
 %else
-    vpbroadcastq    m0, [tab_ChromaCoeff + r4 * 8]
+    vpbroadcastq    m0, [h4_tab_ChromaCoeff + r4 * 8]
 %endif
     mova            m1, [h4_interp8_hpp_shuf]
     vpbroadcastd    m2, [pd_32]
@@ -1934,10 +1936,10 @@
     mov                 r5d, r5m
 
 %ifdef PIC
-    lea                 r6, [tab_ChromaCoeff]
+    lea                 r6, [h4_tab_ChromaCoeff]
​

x265_2.7.tar.gz/source/common/x86/intrapred.h -> x265_2.9.tar.gz/source/common/x86/intrapred.h Changed

 
@@ -76,7 +76,7 @@
 FUNCDEF_TU_S2(void, intra_pred_dc, sse2, pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
 FUNCDEF_TU_S2(void, intra_pred_dc, sse4, pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
 FUNCDEF_TU_S2(void, intra_pred_dc, avx2, pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
-
+FUNCDEF_TU_S2(void, intra_pred_dc, avx512, pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
 FUNCDEF_TU_S2(void, intra_pred_planar, sse2, pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
 FUNCDEF_TU_S2(void, intra_pred_planar, sse4, pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
 FUNCDEF_TU_S2(void, intra_pred_planar, avx2, pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
@@ -85,7 +85,7 @@
 DECL_ALL(ssse3);
 DECL_ALL(sse4);
 DECL_ALL(avx2);
-
+DECL_ALL(avx512);
 #undef DECL_ALL
 #undef DECL_ANGS
 #undef DECL_ANG
​

x265_2.7.tar.gz/source/common/x86/intrapred16.asm -> x265_2.9.tar.gz/source/common/x86/intrapred16.asm Changed

@@ -71,7 +71,7 @@
 const pw_ang8_16,                   db  0,  0,  0,  0,  0,  0, 12, 13, 10, 11,  6,  7,  4,  5,  0,  1
 const pw_ang8_17,                   db  0,  0, 14, 15, 12, 13, 10, 11,  8,  9,  4,  5,  2,  3,  0,  1
 const pw_swap16,            times 2 db 14, 15, 12, 13, 10, 11,  8,  9,  6,  7,  4,  5,  2,  3,  0,  1
-
+const pw_swap16_avx512,     times 4 db 14, 15, 12, 13, 10, 11,  8,  9,  6,  7,  4,  5,  2,  3,  0,  1
 const pw_ang16_13,                  db 14, 15,  8,  9,  0,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0
 const pw_ang16_16,                  db  0,  0,  0,  0,  0,  0, 10, 11,  8,  9,  6,  7,  2,  3,  0,  1
 
@@ -196,6 +196,7 @@
 ;-----------------------------------------------------------------------------------
 ; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* above, int, int filter)
 ;-----------------------------------------------------------------------------------
+%if ARCH_X86_64
 INIT_XMM sse2
 cglobal intra_pred_dc8, 5, 8, 2
     movu            m0,            [r2 + 34]
@@ -275,10 +276,13 @@
     mov             [r0 + r7],     r3w
 .end:
     RET
+%endif
 
 ;-------------------------------------------------------------------------------------------------------
 ; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* left, pixel* above, int dirMode, int filter)
 ;-------------------------------------------------------------------------------------------------------
+%if ARCH_X86_64
+;This code is meant for 64 bit architecture
 INIT_XMM sse2
 cglobal intra_pred_dc16, 5, 10, 4
     lea             r3,                  [r2 + 66]
@@ -410,6 +414,7 @@
     mov             [r9 + r1 * 8],       r3w
 .end:
     RET
+%endif
 
 ;-------------------------------------------------------------------------------------------
 ; void intra_pred_dc(pixel* above, pixel* left, pixel* dst, intptr_t dstStride, int filter)
@@ -474,6 +479,7 @@
 ;-------------------------------------------------------------------------------------------------------
 ; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* left, pixel* above, int dirMode, int filter)
 ;-------------------------------------------------------------------------------------------------------
+%if ARCH_X86_64
 INIT_YMM avx2
 cglobal intra_pred_dc16, 3, 9, 4
     mov             r3d,                 r4m
@@ -682,6 +688,68 @@
     movu            [r0 + r2 * 1 +  0], m0
     movu            [r0 + r2 * 1 + mmsize], m0
     RET
+INIT_ZMM avx512
+cglobal intra_pred_dc32, 3,3,2
+    add              r2, 2
+    add             r1d, r1d
+    movu             m0, [r2]
+    movu             m1, [r2 + 2 * mmsize]
+    paddw            m0, m1
+    vextracti32x8   ym1, m0, 1
+    paddw           ym0, ym1
+    vextracti32x4   xm1, m0, 1
+    paddw           xm0, xm1
+    pmaddwd         xm0, [pw_1]
+    movhlps         xm1, xm0
+    paddd           xm0, xm1
+    vpsrldq         xm1, xm0, 4
+    paddd           xm0, xm1
+    paddd           xm0, [pd_32]                        ; sum = sum + 32
+    psrld           xm0, 6                              ; sum = sum / 64
+    vpbroadcastw     m0, xm0
+    lea              r2, [r1 * 3]
+    ; store DC 32x32
+    movu            [r0 + r1 * 0 +  0], m0
+    movu            [r0 + r1 * 1 +  0], m0
+    movu            [r0 + r1 * 2 +  0], m0
+    movu            [r0 + r2 * 1 +  0], m0
+    lea             r0, [r0 + r1 * 4]
+    movu            [r0 + r1 * 0 +  0], m0
+    movu            [r0 + r1 * 1 +  0], m0
+    movu            [r0 + r1 * 2 +  0], m0
+    movu            [r0 + r2 * 1 +  0], m0
+    lea             r0, [r0 + r1 * 4]
+    movu            [r0 + r1 * 0 +  0], m0
+    movu            [r0 + r1 * 1 +  0], m0
+    movu            [r0 + r1 * 2 +  0], m0
+    movu            [r0 + r2 * 1 +  0], m0
+    lea             r0, [r0 + r1 * 4]
+    movu            [r0 + r1 * 0 +  0], m0
+    movu            [r0 + r1 * 1 +  0], m0
+    movu            [r0 + r1 * 2 +  0], m0
+    movu            [r0 + r2 * 1 +  0], m0
+    lea             r0, [r0 + r1 * 4]
+    movu            [r0 + r1 * 0 +  0], m0
+    movu            [r0 + r1 * 1 +  0], m0
+    movu            [r0 + r1 * 2 +  0], m0
+    movu            [r0 + r2 * 1 +  0], m0
+    lea             r0, [r0 + r1 * 4]
+    movu            [r0 + r1 * 0 +  0], m0
+    movu            [r0 + r1 * 1 +  0], m0
+    movu            [r0 + r1 * 2 +  0], m0
+    movu            [r0 + r2 * 1 +  0], m0
+    lea             r0, [r0 + r1 * 4]
+    movu            [r0 + r1 * 0 +  0], m0
+    movu            [r0 + r1 * 1 +  0], m0
+    movu            [r0 + r1 * 2 +  0], m0
+    movu            [r0 + r2 * 1 +  0], m0
+    lea             r0, [r0 + r1 * 4]
+    movu            [r0 + r1 * 0 +  0], m0
+    movu            [r0 + r1 * 1 +  0], m0
+    movu            [r0 + r1 * 2 +  0], m0
+    movu            [r0 + r2 * 1 +  0], m0
+    RET
+%endif
 
 ;---------------------------------------------------------------------------------------
 ; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter)
@@ -1104,6 +1172,7 @@
 ;---------------------------------------------------------------------------------------
 ; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter)
 ;---------------------------------------------------------------------------------------
+%if ARCH_X86_64
 INIT_XMM sse2
 cglobal intra_pred_planar32, 3,3,16
     movd            m3, [r2 + 66]               ; topRight   = above[32]
@@ -1209,7 +1278,7 @@
 %endrep
     RET
 %endif
-
+%endif
 ;---------------------------------------------------------------------------------------
 ; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter)
 ;---------------------------------------------------------------------------------------
@@ -2063,6 +2132,7 @@
     STORE_4x4
     RET
 
+%if ARCH_X86_64
 cglobal intra_pred_ang4_26, 3,3,3
     movh        m0,             [r2 + 2] ;[8 7 6 5 4 3 2 1]
     add         r1d,            r1d
@@ -2098,6 +2168,7 @@
     mov         [r0 + r3],      r2w
 .quit:
     RET
+%endif
 
 cglobal intra_pred_ang4_27, 3,3,5
     movu        m0, [r2 + 2]            ;[8 7 6 5 4 3 2 1]
@@ -11054,35 +11125,35 @@
 
 %macro TRANSPOSE_STORE_AVX2 11
     jnz             .skip%11
-    punpckhwd       m%9,  m%1,  m%2
-    punpcklwd       m%1,  m%2
-    punpckhwd       m%2,  m%3,  m%4
-    punpcklwd       m%3,  m%4
-
-    punpckldq       m%4,  m%1,  m%3
-    punpckhdq       m%1,  m%3
-    punpckldq       m%3,  m%9,  m%2
-    punpckhdq       m%9,  m%2
-
-    punpckhwd       m%10, m%5,  m%6
-    punpcklwd       m%5,  m%6
-    punpckhwd       m%6,  m%7,  m%8
-    punpcklwd       m%7,  m%8
-
-    punpckldq       m%8,  m%5,  m%7
-    punpckhdq       m%5,  m%7
-    punpckldq       m%7,  m%10, m%6
-    punpckhdq       m%10, m%6
-
-    punpcklqdq      m%6,  m%4,  m%8
-    punpckhqdq      m%2,  m%4,  m%8
-    punpcklqdq      m%4,  m%1,  m%5
-    punpckhqdq      m%8,  m%1,  m%5
-
-    punpcklqdq      m%1,  m%3,  m%7
-    punpckhqdq      m%5,  m%3,  m%7
-    punpcklqdq      m%3,  m%9,  m%10
-    punpckhqdq      m%7,  m%9,  m%10
+    punpckhwd       ym%9,  ym%1,  ym%2
+    punpcklwd       ym%1,  ym%2
+    punpckhwd       ym%2,  ym%3,  ym%4
+    punpcklwd       ym%3,  ym%4
+
+    punpckldq       ym%4,  ym%1,  ym%3
+    punpckhdq       ym%1,  ym%3
+    punpckldq       ym%3,  ym%9,  ym%2
+    punpckhdq       ym%9,  ym%2
+
+    punpckhwd       ym%10, ym%5,  ym%6
+    punpcklwd       ym%5,  ym%6
+    punpckhwd       ym%6,  ym%7,  ym%8
+    punpcklwd       ym%7,  ym%8
+
+    punpckldq       ym%8,  ym%5,  ym%7
+    punpckhdq       ym%5,  ym%7
+    punpckldq       ym%7,  ym%10, ym%6

 
@@ -71,7 +71,7 @@
 const pw_ang8_16,                   db  0,  0,  0,  0,  0,  0, 12, 13, 10, 11,  6,  7,  4,  5,  0,  1
 const pw_ang8_17,                   db  0,  0, 14, 15, 12, 13, 10, 11,  8,  9,  4,  5,  2,  3,  0,  1
 const pw_swap16,            times 2 db 14, 15, 12, 13, 10, 11,  8,  9,  6,  7,  4,  5,  2,  3,  0,  1
-
+const pw_swap16_avx512,     times 4 db 14, 15, 12, 13, 10, 11,  8,  9,  6,  7,  4,  5,  2,  3,  0,  1
 const pw_ang16_13,                  db 14, 15,  8,  9,  0,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0
 const pw_ang16_16,                  db  0,  0,  0,  0,  0,  0, 10, 11,  8,  9,  6,  7,  2,  3,  0,  1
 
@@ -196,6 +196,7 @@
 ;-----------------------------------------------------------------------------------
 ; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* above, int, int filter)
 ;-----------------------------------------------------------------------------------
+%if ARCH_X86_64
 INIT_XMM sse2
 cglobal intra_pred_dc8, 5, 8, 2
     movu            m0,            [r2 + 34]
@@ -275,10 +276,13 @@
     mov             [r0 + r7],     r3w
 .end:
     RET
+%endif
 
 ;-------------------------------------------------------------------------------------------------------
 ; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* left, pixel* above, int dirMode, int filter)
 ;-------------------------------------------------------------------------------------------------------
+%if ARCH_X86_64
+;This code is meant for 64 bit architecture
 INIT_XMM sse2
 cglobal intra_pred_dc16, 5, 10, 4
     lea             r3,                  [r2 + 66]
@@ -410,6 +414,7 @@
     mov             [r9 + r1 * 8],       r3w
 .end:
     RET
+%endif
 
 ;-------------------------------------------------------------------------------------------
 ; void intra_pred_dc(pixel* above, pixel* left, pixel* dst, intptr_t dstStride, int filter)
@@ -474,6 +479,7 @@
 ;-------------------------------------------------------------------------------------------------------
 ; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* left, pixel* above, int dirMode, int filter)
 ;-------------------------------------------------------------------------------------------------------
+%if ARCH_X86_64
 INIT_YMM avx2
 cglobal intra_pred_dc16, 3, 9, 4
     mov             r3d,                 r4m
@@ -682,6 +688,68 @@
     movu            [r0 + r2 * 1 +  0], m0
     movu            [r0 + r2 * 1 + mmsize], m0
     RET
+INIT_ZMM avx512
+cglobal intra_pred_dc32, 3,3,2
+    add              r2, 2
+    add             r1d, r1d
+    movu             m0, [r2]
+    movu             m1, [r2 + 2 * mmsize]
+    paddw            m0, m1
+    vextracti32x8   ym1, m0, 1
+    paddw           ym0, ym1
+    vextracti32x4   xm1, m0, 1
+    paddw           xm0, xm1
+    pmaddwd         xm0, [pw_1]
+    movhlps         xm1, xm0
+    paddd           xm0, xm1
+    vpsrldq         xm1, xm0, 4
+    paddd           xm0, xm1
+    paddd           xm0, [pd_32]                        ; sum = sum + 32
+    psrld           xm0, 6                              ; sum = sum / 64
+    vpbroadcastw     m0, xm0
+    lea              r2, [r1 * 3]
+    ; store DC 32x32
+    movu            [r0 + r1 * 0 +  0], m0
+    movu            [r0 + r1 * 1 +  0], m0
+    movu            [r0 + r1 * 2 +  0], m0
+    movu            [r0 + r2 * 1 +  0], m0
+    lea             r0, [r0 + r1 * 4]
+    movu            [r0 + r1 * 0 +  0], m0
+    movu            [r0 + r1 * 1 +  0], m0
+    movu            [r0 + r1 * 2 +  0], m0
+    movu            [r0 + r2 * 1 +  0], m0
+    lea             r0, [r0 + r1 * 4]
+    movu            [r0 + r1 * 0 +  0], m0
+    movu            [r0 + r1 * 1 +  0], m0
+    movu            [r0 + r1 * 2 +  0], m0
+    movu            [r0 + r2 * 1 +  0], m0
+    lea             r0, [r0 + r1 * 4]
+    movu            [r0 + r1 * 0 +  0], m0
+    movu            [r0 + r1 * 1 +  0], m0
+    movu            [r0 + r1 * 2 +  0], m0
+    movu            [r0 + r2 * 1 +  0], m0
+    lea             r0, [r0 + r1 * 4]
+    movu            [r0 + r1 * 0 +  0], m0
+    movu            [r0 + r1 * 1 +  0], m0
+    movu            [r0 + r1 * 2 +  0], m0
+    movu            [r0 + r2 * 1 +  0], m0
+    lea             r0, [r0 + r1 * 4]
+    movu            [r0 + r1 * 0 +  0], m0
+    movu            [r0 + r1 * 1 +  0], m0
+    movu            [r0 + r1 * 2 +  0], m0
+    movu            [r0 + r2 * 1 +  0], m0
+    lea             r0, [r0 + r1 * 4]
+    movu            [r0 + r1 * 0 +  0], m0
+    movu            [r0 + r1 * 1 +  0], m0
+    movu            [r0 + r1 * 2 +  0], m0
+    movu            [r0 + r2 * 1 +  0], m0
+    lea             r0, [r0 + r1 * 4]
+    movu            [r0 + r1 * 0 +  0], m0
+    movu            [r0 + r1 * 1 +  0], m0
+    movu            [r0 + r1 * 2 +  0], m0
+    movu            [r0 + r2 * 1 +  0], m0
+    RET
+%endif
 
 ;---------------------------------------------------------------------------------------
 ; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter)
@@ -1104,6 +1172,7 @@
 ;---------------------------------------------------------------------------------------
 ; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter)
 ;---------------------------------------------------------------------------------------
+%if ARCH_X86_64
 INIT_XMM sse2
 cglobal intra_pred_planar32, 3,3,16
     movd            m3, [r2 + 66]               ; topRight   = above[32]
@@ -1209,7 +1278,7 @@
 %endrep
     RET
 %endif
-
+%endif
 ;---------------------------------------------------------------------------------------
 ; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter)
 ;---------------------------------------------------------------------------------------
@@ -2063,6 +2132,7 @@
     STORE_4x4
     RET
 
+%if ARCH_X86_64
 cglobal intra_pred_ang4_26, 3,3,3
     movh        m0,             [r2 + 2] ;[8 7 6 5 4 3 2 1]
     add         r1d,            r1d
@@ -2098,6 +2168,7 @@
     mov         [r0 + r3],      r2w
 .quit:
     RET
+%endif
 
 cglobal intra_pred_ang4_27, 3,3,5
     movu        m0, [r2 + 2]            ;[8 7 6 5 4 3 2 1]
@@ -11054,35 +11125,35 @@
 
 %macro TRANSPOSE_STORE_AVX2 11
     jnz             .skip%11
-    punpckhwd       m%9,  m%1,  m%2
-    punpcklwd       m%1,  m%2
-    punpckhwd       m%2,  m%3,  m%4
-    punpcklwd       m%3,  m%4
-
-    punpckldq       m%4,  m%1,  m%3
-    punpckhdq       m%1,  m%3
-    punpckldq       m%3,  m%9,  m%2
-    punpckhdq       m%9,  m%2
-
-    punpckhwd       m%10, m%5,  m%6
-    punpcklwd       m%5,  m%6
-    punpckhwd       m%6,  m%7,  m%8
-    punpcklwd       m%7,  m%8
-
-    punpckldq       m%8,  m%5,  m%7
-    punpckhdq       m%5,  m%7
-    punpckldq       m%7,  m%10, m%6
-    punpckhdq       m%10, m%6
-
-    punpcklqdq      m%6,  m%4,  m%8
-    punpckhqdq      m%2,  m%4,  m%8
-    punpcklqdq      m%4,  m%1,  m%5
-    punpckhqdq      m%8,  m%1,  m%5
-
-    punpcklqdq      m%1,  m%3,  m%7
-    punpckhqdq      m%5,  m%3,  m%7
-    punpcklqdq      m%3,  m%9,  m%10
-    punpckhqdq      m%7,  m%9,  m%10
+    punpckhwd       ym%9,  ym%1,  ym%2
+    punpcklwd       ym%1,  ym%2
+    punpckhwd       ym%2,  ym%3,  ym%4
+    punpcklwd       ym%3,  ym%4
+
+    punpckldq       ym%4,  ym%1,  ym%3
+    punpckhdq       ym%1,  ym%3
+    punpckldq       ym%3,  ym%9,  ym%2
+    punpckhdq       ym%9,  ym%2
+
+    punpckhwd       ym%10, ym%5,  ym%6
+    punpcklwd       ym%5,  ym%6
+    punpckhwd       ym%6,  ym%7,  ym%8
+    punpcklwd       ym%7,  ym%8
+
+    punpckldq       ym%8,  ym%5,  ym%7
+    punpckhdq       ym%5,  ym%7
+    punpckldq       ym%7,  ym%10, ym%6
​

x265_2.7.tar.gz/source/common/x86/ipfilter16.asm -> x265_2.9.tar.gz/source/common/x86/ipfilter16.asm Changed

@@ -45,12 +45,33 @@
 %endif
 
 
-SECTION_RODATA 32
+SECTION_RODATA 64
 
 tab_c_524800:     times 4 dd 524800
 tab_c_n8192:      times 8 dw -8192
 pd_524800:        times 8 dd 524800
 
+tab_ChromaCoeff:  dw  0, 64,  0,  0
+                  dw -2, 58, 10, -2
+                  dw -4, 54, 16, -2
+                  dw -6, 46, 28, -4
+                  dw -4, 36, 36, -4
+                  dw -4, 28, 46, -6
+                  dw -2, 16, 54, -4
+                  dw -2, 10, 58, -2
+				
+tab_LumaCoeff:    dw   0, 0,  0,  64,  0,   0,  0,  0
+                  dw  -1, 4, -10, 58,  17, -5,  1,  0
+                  dw  -1, 4, -11, 40,  40, -11, 4, -1
+                  dw   0, 1, -5,  17,  58, -10, 4, -1
+
+ALIGN 64
+tab_LumaCoeffH_avx512:
+                  times 4 dw  0, 0,  0,  64,  0,   0,  0,  0
+                  times 4 dw  -1, 4, -10, 58,  17, -5,  1,  0
+                  times 4 dw  -1, 4, -11, 40,  40, -11, 4, -1
+                  times 4 dw   0, 1, -5,  17,  58, -10, 4, -1
+
 ALIGN 32
 tab_LumaCoeffV:   times 4 dw 0, 0
                   times 4 dw 0, 64
@@ -71,6 +92,7 @@
                   times 4 dw -5, 17
                   times 4 dw 58, -10
                   times 4 dw 4, -1
+
 ALIGN 32
 tab_LumaCoeffVer: times 8 dw 0, 0
                   times 8 dw 0, 64
@@ -91,7 +113,62 @@
                   times 8 dw -5, 17
                   times 8 dw 58, -10
                   times 8 dw 4, -1
-
+				  
+ALIGN 64
+const tab_ChromaCoeffV_avx512,  times 16 dw 0, 64
+                                times 16 dw 0, 0
+
+                                times 16 dw -2, 58
+                                times 16 dw 10, -2
+
+                                times 16 dw -4, 54
+                                times 16 dw 16, -2
+
+                                times 16 dw -6, 46
+                                times 16 dw 28, -4
+
+                                times 16 dw -4, 36
+                                times 16 dw 36, -4
+
+                                times 16 dw -4, 28
+                                times 16 dw 46, -6
+
+                                times 16 dw -2, 16
+                                times 16 dw 54, -4
+
+                                times 16 dw -2, 10
+                                times 16 dw 58, -2
+
+ALIGN 64
+tab_LumaCoeffVer_avx512: times 16 dw 0, 0
+                         times 16 dw 0, 64
+                         times 16 dw 0, 0
+                         times 16 dw 0, 0
+
+                         times 16 dw -1, 4
+                         times 16 dw -10, 58
+                         times 16 dw 17, -5
+                         times 16 dw 1, 0
+
+                         times 16 dw -1, 4
+                         times 16 dw -11, 40
+                         times 16 dw 40, -11
+                         times 16 dw 4, -1
+
+                         times 16 dw 0, 1
+                         times 16 dw -5, 17
+                         times 16 dw 58, -10
+                         times 16 dw 4, -1
+
+ALIGN 64
+const interp8_hpp_shuf1_load_avx512, times 4 db 0, 1, 2, 3, 4, 5, 6, 7, 2, 3, 4, 5, 6, 7, 8, 9
+
+ALIGN 64
+const interp8_hpp_shuf2_load_avx512, times 4 db 4, 5, 6, 7, 8, 9, 10, 11, 6, 7, 8, 9, 10, 11, 12, 13
+
+ALIGN 64
+const interp8_hpp_shuf1_store_avx512, times 4 db 0, 1, 4, 5, 2, 3, 6, 7, 8, 9, 12, 13, 10, 11, 14, 15
+ 
 SECTION .text
 cextern pd_8
 cextern pd_32
@@ -246,6 +323,7 @@
 ;-------------------------------------------------------------------------------------------------------------
 ; void interp_8tap_vert_pp_%2x%3(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
 ;-------------------------------------------------------------------------------------------------------------
+%if ARCH_X86_64
     FILTER_VER_LUMA_sse2 pp, 4, 4
     FILTER_VER_LUMA_sse2 pp, 8, 8
     FILTER_VER_LUMA_sse2 pp, 8, 4
@@ -300,7 +378,570 @@
     FILTER_VER_LUMA_sse2 ps, 48, 64
     FILTER_VER_LUMA_sse2 ps, 64, 16
     FILTER_VER_LUMA_sse2 ps, 16, 64
+%endif
+
+;-----------------------------------------------------------------------------
+;p2s and p2s_aligned avx512 code start
+;-----------------------------------------------------------------------------
+%macro P2S_64x4_AVX512 0
+    movu       m0, [r0]
+    movu       m1, [r0 + r1]
+    movu       m2, [r0 + r1 * 2]
+    movu       m3, [r0 + r5]
+    psllw      m0, (14 - BIT_DEPTH)
+    psllw      m1, (14 - BIT_DEPTH)
+    psllw      m2, (14 - BIT_DEPTH)
+    psllw      m3, (14 - BIT_DEPTH)
+    psubw      m0, m4
+    psubw      m1, m4
+    psubw      m2, m4
+    psubw      m3, m4
+    movu       [r2], m0
+    movu       [r2 + r3], m1
+    movu       [r2 + r3 * 2], m2
+    movu       [r2 + r4], m3
+
+    movu       m0, [r0 + mmsize]
+    movu       m1, [r0 + r1 + mmsize]
+    movu       m2, [r0 + r1 * 2 + mmsize]
+    movu       m3, [r0 + r5 + mmsize]
+    psllw      m0, (14 - BIT_DEPTH)
+    psllw      m1, (14 - BIT_DEPTH)
+    psllw      m2, (14 - BIT_DEPTH)
+    psllw      m3, (14 - BIT_DEPTH)
+    psubw      m0, m4
+    psubw      m1, m4
+    psubw      m2, m4
+    psubw      m3, m4
+    movu       [r2 + mmsize], m0
+    movu       [r2 + r3 + mmsize], m1
+    movu       [r2 + r3 * 2 + mmsize], m2
+    movu       [r2 + r4 + mmsize], m3
+%endmacro
+
+%macro P2S_ALIGNED_64x4_AVX512 0
+    mova       m0, [r0]
+    mova       m1, [r0 + r1]
+    mova       m2, [r0 + r1 * 2]
+    mova       m3, [r0 + r5]
+    psllw      m0, (14 - BIT_DEPTH)
+    psllw      m1, (14 - BIT_DEPTH)
+    psllw      m2, (14 - BIT_DEPTH)
+    psllw      m3, (14 - BIT_DEPTH)
+    psubw      m0, m4
+    psubw      m1, m4
+    psubw      m2, m4
+    psubw      m3, m4
+    mova       [r2], m0
+    mova       [r2 + r3], m1
+    mova       [r2 + r3 * 2], m2
+    mova       [r2 + r4], m3
+
+    mova       m0, [r0 + mmsize]
+    mova       m1, [r0 + r1 + mmsize]
+    mova       m2, [r0 + r1 * 2 + mmsize]
+    mova       m3, [r0 + r5 + mmsize]
+    psllw      m0, (14 - BIT_DEPTH)
+    psllw      m1, (14 - BIT_DEPTH)
+    psllw      m2, (14 - BIT_DEPTH)
+    psllw      m3, (14 - BIT_DEPTH)
+    psubw      m0, m4
+    psubw      m1, m4
+    psubw      m2, m4
+    psubw      m3, m4
+    mova       [r2 + mmsize], m0
+    mova       [r2 + r3 + mmsize], m1
+    mova       [r2 + r3 * 2 + mmsize], m2
+    mova       [r2 + r4 + mmsize], m3
+%endmacro
+
+%macro P2S_32x4_AVX512 0
+    movu       m0, [r0]
+    movu       m1, [r0 + r1]
+    movu       m2, [r0 + r1 * 2]

 
@@ -45,12 +45,33 @@
 %endif
 
 
-SECTION_RODATA 32
+SECTION_RODATA 64
 
 tab_c_524800:     times 4 dd 524800
 tab_c_n8192:      times 8 dw -8192
 pd_524800:        times 8 dd 524800
 
+tab_ChromaCoeff:  dw  0, 64,  0,  0
+                  dw -2, 58, 10, -2
+                  dw -4, 54, 16, -2
+                  dw -6, 46, 28, -4
+                  dw -4, 36, 36, -4
+                  dw -4, 28, 46, -6
+                  dw -2, 16, 54, -4
+                  dw -2, 10, 58, -2
+               
+tab_LumaCoeff:    dw   0, 0,  0,  64,  0,   0,  0,  0
+                  dw  -1, 4, -10, 58,  17, -5,  1,  0
+                  dw  -1, 4, -11, 40,  40, -11, 4, -1
+                  dw   0, 1, -5,  17,  58, -10, 4, -1
+
+ALIGN 64
+tab_LumaCoeffH_avx512:
+                  times 4 dw  0, 0,  0,  64,  0,   0,  0,  0
+                  times 4 dw  -1, 4, -10, 58,  17, -5,  1,  0
+                  times 4 dw  -1, 4, -11, 40,  40, -11, 4, -1
+                  times 4 dw   0, 1, -5,  17,  58, -10, 4, -1
+
 ALIGN 32
 tab_LumaCoeffV:   times 4 dw 0, 0
                   times 4 dw 0, 64
@@ -71,6 +92,7 @@
                   times 4 dw -5, 17
                   times 4 dw 58, -10
                   times 4 dw 4, -1
+
 ALIGN 32
 tab_LumaCoeffVer: times 8 dw 0, 0
                   times 8 dw 0, 64
@@ -91,7 +113,62 @@
                   times 8 dw -5, 17
                   times 8 dw 58, -10
                   times 8 dw 4, -1
-
+                 
+ALIGN 64
+const tab_ChromaCoeffV_avx512,  times 16 dw 0, 64
+                                times 16 dw 0, 0
+
+                                times 16 dw -2, 58
+                                times 16 dw 10, -2
+
+                                times 16 dw -4, 54
+                                times 16 dw 16, -2
+
+                                times 16 dw -6, 46
+                                times 16 dw 28, -4
+
+                                times 16 dw -4, 36
+                                times 16 dw 36, -4
+
+                                times 16 dw -4, 28
+                                times 16 dw 46, -6
+
+                                times 16 dw -2, 16
+                                times 16 dw 54, -4
+
+                                times 16 dw -2, 10
+                                times 16 dw 58, -2
+
+ALIGN 64
+tab_LumaCoeffVer_avx512: times 16 dw 0, 0
+                         times 16 dw 0, 64
+                         times 16 dw 0, 0
+                         times 16 dw 0, 0
+
+                         times 16 dw -1, 4
+                         times 16 dw -10, 58
+                         times 16 dw 17, -5
+                         times 16 dw 1, 0
+
+                         times 16 dw -1, 4
+                         times 16 dw -11, 40
+                         times 16 dw 40, -11
+                         times 16 dw 4, -1
+
+                         times 16 dw 0, 1
+                         times 16 dw -5, 17
+                         times 16 dw 58, -10
+                         times 16 dw 4, -1
+
+ALIGN 64
+const interp8_hpp_shuf1_load_avx512, times 4 db 0, 1, 2, 3, 4, 5, 6, 7, 2, 3, 4, 5, 6, 7, 8, 9
+
+ALIGN 64
+const interp8_hpp_shuf2_load_avx512, times 4 db 4, 5, 6, 7, 8, 9, 10, 11, 6, 7, 8, 9, 10, 11, 12, 13
+
+ALIGN 64
+const interp8_hpp_shuf1_store_avx512, times 4 db 0, 1, 4, 5, 2, 3, 6, 7, 8, 9, 12, 13, 10, 11, 14, 15
+ 
 SECTION .text
 cextern pd_8
 cextern pd_32
@@ -246,6 +323,7 @@
 ;-------------------------------------------------------------------------------------------------------------
 ; void interp_8tap_vert_pp_%2x%3(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
 ;-------------------------------------------------------------------------------------------------------------
+%if ARCH_X86_64
     FILTER_VER_LUMA_sse2 pp, 4, 4
     FILTER_VER_LUMA_sse2 pp, 8, 8
     FILTER_VER_LUMA_sse2 pp, 8, 4
@@ -300,7 +378,570 @@
     FILTER_VER_LUMA_sse2 ps, 48, 64
     FILTER_VER_LUMA_sse2 ps, 64, 16
     FILTER_VER_LUMA_sse2 ps, 16, 64
+%endif
+
+;-----------------------------------------------------------------------------
+;p2s and p2s_aligned avx512 code start
+;-----------------------------------------------------------------------------
+%macro P2S_64x4_AVX512 0
+    movu       m0, [r0]
+    movu       m1, [r0 + r1]
+    movu       m2, [r0 + r1 * 2]
+    movu       m3, [r0 + r5]
+    psllw      m0, (14 - BIT_DEPTH)
+    psllw      m1, (14 - BIT_DEPTH)
+    psllw      m2, (14 - BIT_DEPTH)
+    psllw      m3, (14 - BIT_DEPTH)
+    psubw      m0, m4
+    psubw      m1, m4
+    psubw      m2, m4
+    psubw      m3, m4
+    movu       [r2], m0
+    movu       [r2 + r3], m1
+    movu       [r2 + r3 * 2], m2
+    movu       [r2 + r4], m3
+
+    movu       m0, [r0 + mmsize]
+    movu       m1, [r0 + r1 + mmsize]
+    movu       m2, [r0 + r1 * 2 + mmsize]
+    movu       m3, [r0 + r5 + mmsize]
+    psllw      m0, (14 - BIT_DEPTH)
+    psllw      m1, (14 - BIT_DEPTH)
+    psllw      m2, (14 - BIT_DEPTH)
+    psllw      m3, (14 - BIT_DEPTH)
+    psubw      m0, m4
+    psubw      m1, m4
+    psubw      m2, m4
+    psubw      m3, m4
+    movu       [r2 + mmsize], m0
+    movu       [r2 + r3 + mmsize], m1
+    movu       [r2 + r3 * 2 + mmsize], m2
+    movu       [r2 + r4 + mmsize], m3
+%endmacro
+
+%macro P2S_ALIGNED_64x4_AVX512 0
+    mova       m0, [r0]
+    mova       m1, [r0 + r1]
+    mova       m2, [r0 + r1 * 2]
+    mova       m3, [r0 + r5]
+    psllw      m0, (14 - BIT_DEPTH)
+    psllw      m1, (14 - BIT_DEPTH)
+    psllw      m2, (14 - BIT_DEPTH)
+    psllw      m3, (14 - BIT_DEPTH)
+    psubw      m0, m4
+    psubw      m1, m4
+    psubw      m2, m4
+    psubw      m3, m4
+    mova       [r2], m0
+    mova       [r2 + r3], m1
+    mova       [r2 + r3 * 2], m2
+    mova       [r2 + r4], m3
+
+    mova       m0, [r0 + mmsize]
+    mova       m1, [r0 + r1 + mmsize]
+    mova       m2, [r0 + r1 * 2 + mmsize]
+    mova       m3, [r0 + r5 + mmsize]
+    psllw      m0, (14 - BIT_DEPTH)
+    psllw      m1, (14 - BIT_DEPTH)
+    psllw      m2, (14 - BIT_DEPTH)
+    psllw      m3, (14 - BIT_DEPTH)
+    psubw      m0, m4
+    psubw      m1, m4
+    psubw      m2, m4
+    psubw      m3, m4
+    mova       [r2 + mmsize], m0
+    mova       [r2 + r3 + mmsize], m1
+    mova       [r2 + r3 * 2 + mmsize], m2
+    mova       [r2 + r4 + mmsize], m3
+%endmacro
+
+%macro P2S_32x4_AVX512 0
+    movu       m0, [r0]
+    movu       m1, [r0 + r1]
+    movu       m2, [r0 + r1 * 2]
​

x265_2.7.tar.gz/source/common/x86/ipfilter8.asm -> x265_2.9.tar.gz/source/common/x86/ipfilter8.asm Changed

@@ -26,7 +26,7 @@
 %include "x86inc.asm"
 %include "x86util.asm"
 
-SECTION_RODATA 32
+SECTION_RODATA 64
 const tab_Tm,    db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
                  db 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10
                  db 8, 9,10,11, 9,10,11,12,10,11,12,13,11,12,13, 14
@@ -43,6 +43,15 @@
 
 const pd_526336, times 8 dd 8192*64+2048
 
+const tab_ChromaCoeff, db  0, 64,  0,  0
+                       db -2, 58, 10, -2
+                       db -4, 54, 16, -2
+                       db -6, 46, 28, -4
+                       db -4, 36, 36, -4
+                       db -4, 28, 46, -6
+                       db -2, 16, 54, -4
+                       db -2, 10, 58, -2
+
 const tab_LumaCoeff,   db   0, 0,  0,  64,  0,   0,  0,  0
                        db  -1, 4, -10, 58,  17, -5,  1,  0
                        db  -1, 4, -11, 40,  40, -11, 4, -1
@@ -133,12 +142,115 @@
                             times 16 db 58, -10
                             times 16 db 4, -1
 
+ALIGN 64
+const tab_ChromaCoeffVer_32_avx512,     times 32 db 0, 64
+                                        times 32 db 0, 0
+
+                                        times 32 db -2, 58
+                                        times 32 db 10, -2
+
+                                        times 32 db -4, 54
+                                        times 32 db 16, -2
+
+                                        times 32 db -6, 46
+                                        times 32 db 28, -4
+
+                                        times 32 db -4, 36
+                                        times 32 db 36, -4
+
+                                        times 32 db -4, 28
+                                        times 32 db 46, -6
+
+                                        times 32 db -2, 16
+                                        times 32 db 54, -4
+
+                                        times 32 db -2, 10
+                                        times 32 db 58, -2
+
+ALIGN 64
+const pw_ChromaCoeffVer_32_avx512,      times 16 dw 0, 64
+                                        times 16 dw 0, 0
+
+                                        times 16 dw -2, 58
+                                        times 16 dw 10, -2
+
+                                        times 16 dw -4, 54
+                                        times 16 dw 16, -2
+
+                                        times 16 dw -6, 46
+                                        times 16 dw 28, -4
+
+                                        times 16 dw -4, 36
+                                        times 16 dw 36, -4
+
+                                        times 16 dw -4, 28
+                                        times 16 dw 46, -6
+
+                                        times 16 dw -2, 16
+                                        times 16 dw 54, -4
+
+                                        times 16 dw -2, 10
+                                        times 16 dw 58, -2
+
+ALIGN 64
+const pw_LumaCoeffVer_avx512,           times 16 dw 0, 0
+                                        times 16 dw 0, 64
+                                        times 16 dw 0, 0
+                                        times 16 dw 0, 0
+
+                                        times 16 dw -1, 4
+                                        times 16 dw -10, 58
+                                        times 16 dw 17, -5
+                                        times 16 dw 1, 0
+
+                                        times 16 dw -1, 4
+                                        times 16 dw -11, 40
+                                        times 16 dw 40, -11
+                                        times 16 dw 4, -1
+
+                                        times 16 dw 0, 1
+                                        times 16 dw -5, 17
+                                        times 16 dw 58, -10
+                                        times 16 dw 4, -1
+
+ALIGN 64
+const tab_LumaCoeffVer_32_avx512,       times 32 db 0, 0
+                                        times 32 db 0, 64
+                                        times 32 db 0, 0
+                                        times 32 db 0, 0
+
+                                        times 32 db -1, 4
+                                        times 32 db -10, 58
+                                        times 32 db 17, -5
+                                        times 32 db 1, 0
+
+                                        times 32 db -1, 4
+                                        times 32 db -11, 40
+                                        times 32 db 40, -11
+                                        times 32 db 4, -1
+
+                                        times 32 db 0, 1
+                                        times 32 db -5, 17
+                                        times 32 db 58, -10
+                                        times 32 db 4, -1
+
 const tab_c_64_n64, times 8 db 64, -64
 
 const interp8_hps_shuf,     dd 0, 4, 1, 5, 2, 6, 3, 7
 
-SECTION .text
+const interp4_horiz_shuf_load1_avx512,  times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
+const interp4_horiz_shuf_load2_avx512,  times 2 db 8, 9, 10, 11, 9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14
+const interp4_horiz_shuf_load3_avx512,  times 2 db 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10
+
+ALIGN 64
+interp4_vps_store1_avx512:   dq 0, 1, 8, 9, 2, 3, 10, 11
+interp4_vps_store2_avx512:   dq 4, 5, 12, 13, 6, 7, 14, 15
+const interp4_hps_shuf_avx512,  dq 0, 4, 1, 5, 2, 6, 3, 7
+const interp4_hps_store_16xN_avx512,  dq 0, 2, 1, 3, 4, 6, 5, 7
+const interp8_hps_store_avx512,  dq 0, 1, 4, 5, 2, 3, 6, 7
+const interp8_vsp_store_avx512,  dq 0, 2, 4, 6, 1, 3, 5, 7
 
+SECTION .text
 cextern pb_128
 cextern pw_1
 cextern pw_32
@@ -1954,6 +2066,276 @@
     P2S_H_32xN_avx2 48
 
 ;-----------------------------------------------------------------------------
+;p2s and p2s_aligned 32xN avx512 code start
+;-----------------------------------------------------------------------------
+
+%macro PROCESS_P2S_32x4_AVX512 0
+    pmovzxbw    m0, [r0]
+    pmovzxbw    m1, [r0 + r1]
+    pmovzxbw    m2, [r0 + r1 * 2]
+    pmovzxbw    m3, [r0 + r5]
+
+    psllw       m0, 6
+    psllw       m1, 6
+    psllw       m2, 6
+    psllw       m3, 6
+    psubw       m0, m4
+    psubw       m1, m4
+    psubw       m2, m4
+    psubw       m3, m4
+
+    movu        [r2],           m0
+    movu        [r2 + r3],      m1
+    movu        [r2 + r3 * 2],  m2
+    movu        [r2 + r6],      m3
+%endmacro
+
+;-----------------------------------------------------------------------------
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
+;-----------------------------------------------------------------------------
+%if ARCH_X86_64
+INIT_ZMM avx512
+cglobal filterPixelToShort_32x8, 3, 7, 5
+    mov         r3d, r3m
+    add         r3d, r3d
+    lea         r5, [r1 * 3]
+    lea         r6, [r3 * 3]
+
+    ; load constant
+    vpbroadcastd      m4, [pw_2000]
+
+    PROCESS_P2S_32x4_AVX512
+    lea         r0, [r0 + r1 * 4]
+    lea         r2, [r2 + r3 * 4]
+    PROCESS_P2S_32x4_AVX512
+    RET
+
+INIT_ZMM avx512
+cglobal filterPixelToShort_32x16, 3, 7, 5
+    mov         r3d, r3m
+    add         r3d, r3d
+    lea         r5, [r1 * 3]
+    lea         r6, [r3 * 3]
+
+    ; load constant
+    vpbroadcastd      m4, [pw_2000]
+

 
@@ -26,7 +26,7 @@
 %include "x86inc.asm"
 %include "x86util.asm"
 
-SECTION_RODATA 32
+SECTION_RODATA 64
 const tab_Tm,    db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
                  db 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10
                  db 8, 9,10,11, 9,10,11,12,10,11,12,13,11,12,13, 14
@@ -43,6 +43,15 @@
 
 const pd_526336, times 8 dd 8192*64+2048
 
+const tab_ChromaCoeff, db  0, 64,  0,  0
+                       db -2, 58, 10, -2
+                       db -4, 54, 16, -2
+                       db -6, 46, 28, -4
+                       db -4, 36, 36, -4
+                       db -4, 28, 46, -6
+                       db -2, 16, 54, -4
+                       db -2, 10, 58, -2
+
 const tab_LumaCoeff,   db   0, 0,  0,  64,  0,   0,  0,  0
                        db  -1, 4, -10, 58,  17, -5,  1,  0
                        db  -1, 4, -11, 40,  40, -11, 4, -1
@@ -133,12 +142,115 @@
                             times 16 db 58, -10
                             times 16 db 4, -1
 
+ALIGN 64
+const tab_ChromaCoeffVer_32_avx512,     times 32 db 0, 64
+                                        times 32 db 0, 0
+
+                                        times 32 db -2, 58
+                                        times 32 db 10, -2
+
+                                        times 32 db -4, 54
+                                        times 32 db 16, -2
+
+                                        times 32 db -6, 46
+                                        times 32 db 28, -4
+
+                                        times 32 db -4, 36
+                                        times 32 db 36, -4
+
+                                        times 32 db -4, 28
+                                        times 32 db 46, -6
+
+                                        times 32 db -2, 16
+                                        times 32 db 54, -4
+
+                                        times 32 db -2, 10
+                                        times 32 db 58, -2
+
+ALIGN 64
+const pw_ChromaCoeffVer_32_avx512,      times 16 dw 0, 64
+                                        times 16 dw 0, 0
+
+                                        times 16 dw -2, 58
+                                        times 16 dw 10, -2
+
+                                        times 16 dw -4, 54
+                                        times 16 dw 16, -2
+
+                                        times 16 dw -6, 46
+                                        times 16 dw 28, -4
+
+                                        times 16 dw -4, 36
+                                        times 16 dw 36, -4
+
+                                        times 16 dw -4, 28
+                                        times 16 dw 46, -6
+
+                                        times 16 dw -2, 16
+                                        times 16 dw 54, -4
+
+                                        times 16 dw -2, 10
+                                        times 16 dw 58, -2
+
+ALIGN 64
+const pw_LumaCoeffVer_avx512,           times 16 dw 0, 0
+                                        times 16 dw 0, 64
+                                        times 16 dw 0, 0
+                                        times 16 dw 0, 0
+
+                                        times 16 dw -1, 4
+                                        times 16 dw -10, 58
+                                        times 16 dw 17, -5
+                                        times 16 dw 1, 0
+
+                                        times 16 dw -1, 4
+                                        times 16 dw -11, 40
+                                        times 16 dw 40, -11
+                                        times 16 dw 4, -1
+
+                                        times 16 dw 0, 1
+                                        times 16 dw -5, 17
+                                        times 16 dw 58, -10
+                                        times 16 dw 4, -1
+
+ALIGN 64
+const tab_LumaCoeffVer_32_avx512,       times 32 db 0, 0
+                                        times 32 db 0, 64
+                                        times 32 db 0, 0
+                                        times 32 db 0, 0
+
+                                        times 32 db -1, 4
+                                        times 32 db -10, 58
+                                        times 32 db 17, -5
+                                        times 32 db 1, 0
+
+                                        times 32 db -1, 4
+                                        times 32 db -11, 40
+                                        times 32 db 40, -11
+                                        times 32 db 4, -1
+
+                                        times 32 db 0, 1
+                                        times 32 db -5, 17
+                                        times 32 db 58, -10
+                                        times 32 db 4, -1
+
 const tab_c_64_n64, times 8 db 64, -64
 
 const interp8_hps_shuf,     dd 0, 4, 1, 5, 2, 6, 3, 7
 
-SECTION .text
+const interp4_horiz_shuf_load1_avx512,  times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
+const interp4_horiz_shuf_load2_avx512,  times 2 db 8, 9, 10, 11, 9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14
+const interp4_horiz_shuf_load3_avx512,  times 2 db 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10
+
+ALIGN 64
+interp4_vps_store1_avx512:   dq 0, 1, 8, 9, 2, 3, 10, 11
+interp4_vps_store2_avx512:   dq 4, 5, 12, 13, 6, 7, 14, 15
+const interp4_hps_shuf_avx512,  dq 0, 4, 1, 5, 2, 6, 3, 7
+const interp4_hps_store_16xN_avx512,  dq 0, 2, 1, 3, 4, 6, 5, 7
+const interp8_hps_store_avx512,  dq 0, 1, 4, 5, 2, 3, 6, 7
+const interp8_vsp_store_avx512,  dq 0, 2, 4, 6, 1, 3, 5, 7
 
+SECTION .text
 cextern pb_128
 cextern pw_1
 cextern pw_32
@@ -1954,6 +2066,276 @@
     P2S_H_32xN_avx2 48
 
 ;-----------------------------------------------------------------------------
+;p2s and p2s_aligned 32xN avx512 code start
+;-----------------------------------------------------------------------------
+
+%macro PROCESS_P2S_32x4_AVX512 0
+    pmovzxbw    m0, [r0]
+    pmovzxbw    m1, [r0 + r1]
+    pmovzxbw    m2, [r0 + r1 * 2]
+    pmovzxbw    m3, [r0 + r5]
+
+    psllw       m0, 6
+    psllw       m1, 6
+    psllw       m2, 6
+    psllw       m3, 6
+    psubw       m0, m4
+    psubw       m1, m4
+    psubw       m2, m4
+    psubw       m3, m4
+
+    movu        [r2],           m0
+    movu        [r2 + r3],      m1
+    movu        [r2 + r3 * 2],  m2
+    movu        [r2 + r6],      m3
+%endmacro
+
+;-----------------------------------------------------------------------------
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride)
+;-----------------------------------------------------------------------------
+%if ARCH_X86_64
+INIT_ZMM avx512
+cglobal filterPixelToShort_32x8, 3, 7, 5
+    mov         r3d, r3m
+    add         r3d, r3d
+    lea         r5, [r1 * 3]
+    lea         r6, [r3 * 3]
+
+    ; load constant
+    vpbroadcastd      m4, [pw_2000]
+
+    PROCESS_P2S_32x4_AVX512
+    lea         r0, [r0 + r1 * 4]
+    lea         r2, [r2 + r3 * 4]
+    PROCESS_P2S_32x4_AVX512
+    RET
+
+INIT_ZMM avx512
+cglobal filterPixelToShort_32x16, 3, 7, 5
+    mov         r3d, r3m
+    add         r3d, r3d
+    lea         r5, [r1 * 3]
+    lea         r6, [r3 * 3]
+
+    ; load constant
+    vpbroadcastd      m4, [pw_2000]
+
​

x265_2.7.tar.gz/source/common/x86/ipfilter8.h -> x265_2.9.tar.gz/source/common/x86/ipfilter8.h Changed

@@ -33,6 +33,7 @@
     FUNCDEF_PU(void, interp_8tap_vert_ss, cpu, const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); \
     FUNCDEF_PU(void, interp_8tap_hv_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY); \
     FUNCDEF_CHROMA_PU(void, filterPixelToShort, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); \
+    FUNCDEF_CHROMA_PU(void, filterPixelToShort_aligned, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); \
     FUNCDEF_CHROMA_PU(void, interp_4tap_horiz_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \
     FUNCDEF_CHROMA_PU(void, interp_4tap_horiz_ps, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); \
     FUNCDEF_CHROMA_PU(void, interp_4tap_vert_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \
@@ -45,5 +46,6 @@
 SETUP_FUNC_DEF(sse3);
 SETUP_FUNC_DEF(sse4);
 SETUP_FUNC_DEF(avx2);
+SETUP_FUNC_DEF(avx512);
 
 #endif // ifndef X265_IPFILTER8_H

 
@@ -33,6 +33,7 @@
     FUNCDEF_PU(void, interp_8tap_vert_ss, cpu, const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); \
     FUNCDEF_PU(void, interp_8tap_hv_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY); \
     FUNCDEF_CHROMA_PU(void, filterPixelToShort, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); \
+    FUNCDEF_CHROMA_PU(void, filterPixelToShort_aligned, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); \
     FUNCDEF_CHROMA_PU(void, interp_4tap_horiz_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \
     FUNCDEF_CHROMA_PU(void, interp_4tap_horiz_ps, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); \
     FUNCDEF_CHROMA_PU(void, interp_4tap_vert_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \
@@ -45,5 +46,6 @@
 SETUP_FUNC_DEF(sse3);
 SETUP_FUNC_DEF(sse4);
 SETUP_FUNC_DEF(avx2);
+SETUP_FUNC_DEF(avx512);
 
 #endif // ifndef X265_IPFILTER8_H
​

x265_2.7.tar.gz/source/common/x86/loopfilter.asm -> x265_2.9.tar.gz/source/common/x86/loopfilter.asm Changed

@@ -58,6 +58,7 @@
 ;============================================================================================================
 INIT_XMM sse4
 %if HIGH_BIT_DEPTH
+%if ARCH_X86_64
 cglobal saoCuOrgE0, 4,5,9
     mov         r4d, r4m
     movh        m6,  [r1]
@@ -157,7 +158,7 @@
     sub         r4d, 16
     jnz        .loopH
     RET
-
+%endif
 %else ; HIGH_BIT_DEPTH == 1
 
 cglobal saoCuOrgE0, 5, 5, 8, rec, offsetEo, lcuWidth, signLeft, stride
@@ -249,6 +250,7 @@
 
 INIT_YMM avx2
 %if HIGH_BIT_DEPTH
+%if ARCH_X86_64
 cglobal saoCuOrgE0, 4,4,9
     vbroadcasti128  m6, [r1]
     movzx           r1d, byte [r3]
@@ -308,6 +310,7 @@
     dec             r2d
     jnz             .loop
     RET
+%endif
 %else ; HIGH_BIT_DEPTH
 cglobal saoCuOrgE0, 5, 5, 7, rec, offsetEo, lcuWidth, signLeft, stride
 
@@ -1655,6 +1658,7 @@
     RET
 %endif
 
+%if ARCH_X86_64
 INIT_YMM avx2
 %if HIGH_BIT_DEPTH
 cglobal saoCuOrgB0, 5,7,8
@@ -1814,6 +1818,7 @@
 .end:
     RET
 %endif
+%endif
 
 ;============================================================================================================
 ; void calSign(int8_t *dst, const Pixel *src1, const Pixel *src2, const int width)

 
@@ -58,6 +58,7 @@
 ;============================================================================================================
 INIT_XMM sse4
 %if HIGH_BIT_DEPTH
+%if ARCH_X86_64
 cglobal saoCuOrgE0, 4,5,9
     mov         r4d, r4m
     movh        m6,  [r1]
@@ -157,7 +158,7 @@
     sub         r4d, 16
     jnz        .loopH
     RET
-
+%endif
 %else ; HIGH_BIT_DEPTH == 1
 
 cglobal saoCuOrgE0, 5, 5, 8, rec, offsetEo, lcuWidth, signLeft, stride
@@ -249,6 +250,7 @@
 
 INIT_YMM avx2
 %if HIGH_BIT_DEPTH
+%if ARCH_X86_64
 cglobal saoCuOrgE0, 4,4,9
     vbroadcasti128  m6, [r1]
     movzx           r1d, byte [r3]
@@ -308,6 +310,7 @@
     dec             r2d
     jnz             .loop
     RET
+%endif
 %else ; HIGH_BIT_DEPTH
 cglobal saoCuOrgE0, 5, 5, 7, rec, offsetEo, lcuWidth, signLeft, stride
 
@@ -1655,6 +1658,7 @@
     RET
 %endif
 
+%if ARCH_X86_64
 INIT_YMM avx2
 %if HIGH_BIT_DEPTH
 cglobal saoCuOrgB0, 5,7,8
@@ -1814,6 +1818,7 @@
 .end:
     RET
 %endif
+%endif
 
 ;============================================================================================================
 ; void calSign(int8_t *dst, const Pixel *src1, const Pixel *src2, const int width)
​

x265_2.7.tar.gz/source/common/x86/mc-a.asm -> x265_2.9.tar.gz/source/common/x86/mc-a.asm Changed

@@ -46,13 +46,10 @@
     %error Unsupport bit depth!
 %endif
 
-SECTION_RODATA 32
+SECTION_RODATA 64
 
-ch_shuf: times 2 db 0,2,2,4,4,6,6,8,1,3,3,5,5,7,7,9
-ch_shuf_adj: times 8 db 0
-             times 8 db 2
-             times 8 db 4
-             times 8 db 6
+ALIGN 64
+const shuf_avx512,  dq 0, 2, 4, 6, 1, 3, 5, 7
 
 SECTION .text
 
@@ -1037,6 +1034,7 @@
 ;------------------------------------------------------------------------------
 ; avx2 asm for addAvg high_bit_depth
 ;------------------------------------------------------------------------------
+%if ARCH_X86_64
 INIT_YMM avx2
 cglobal addAvg_8x2, 6,6,2, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride
     movu        xm0,         [r0]
@@ -1114,6 +1112,7 @@
     movu        [r2],        xm0
     movu        [r2 + r5],   xm2
     RET
+%endif
 
 %macro ADDAVG_W8_H4_AVX2 1
 cglobal addAvg_8x%1, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride
@@ -1168,13 +1167,16 @@
     RET
 %endmacro
 
+%if ARCH_X86_64
 ADDAVG_W8_H4_AVX2 4
 ADDAVG_W8_H4_AVX2 8
 ADDAVG_W8_H4_AVX2 12
 ADDAVG_W8_H4_AVX2 16
 ADDAVG_W8_H4_AVX2 32
 ADDAVG_W8_H4_AVX2 64
+%endif
 
+%if ARCH_X86_64
 cglobal addAvg_12x16, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride
     mova           m4,             [pw_ %+ ADDAVG_ROUND]
     mova           m5,             [pw_pixel_max]
@@ -1258,6 +1260,7 @@
     dec            r6d
     jnz            .loop
     RET
+%endif
 
 %macro ADDAVG_W16_H4_AVX2 1
 cglobal addAvg_16x%1, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride
@@ -1299,6 +1302,7 @@
     RET
 %endmacro
 
+%if ARCH_X86_64
 ADDAVG_W16_H4_AVX2 4
 ADDAVG_W16_H4_AVX2 8
 ADDAVG_W16_H4_AVX2 12
@@ -1306,7 +1310,9 @@
 ADDAVG_W16_H4_AVX2 24
 ADDAVG_W16_H4_AVX2 32
 ADDAVG_W16_H4_AVX2 64
+%endif
 
+%if ARCH_X86_64
 cglobal addAvg_24x32, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride
     mova        m4,              [pw_ %+ ADDAVG_ROUND]
     mova        m5,              [pw_pixel_max]
@@ -1418,6 +1424,7 @@
     dec         r6d
     jnz         .loop
     RET
+%endif
 
 %macro ADDAVG_W32_H2_AVX2 1
 cglobal addAvg_32x%1, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride
@@ -1477,13 +1484,16 @@
     RET
 %endmacro
 
+%if ARCH_X86_64
 ADDAVG_W32_H2_AVX2 8
 ADDAVG_W32_H2_AVX2 16
 ADDAVG_W32_H2_AVX2 24
 ADDAVG_W32_H2_AVX2 32
 ADDAVG_W32_H2_AVX2 48
 ADDAVG_W32_H2_AVX2 64
+%endif
 
+%if ARCH_X86_64
 cglobal addAvg_48x64, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride
     mova        m4,              [pw_ %+ ADDAVG_ROUND]
     mova        m5,              [pw_pixel_max]
@@ -1557,6 +1567,7 @@
     dec         r6d
     jnz        .loop
     RET
+%endif
 
 %macro ADDAVG_W64_H1_AVX2 1
 cglobal addAvg_64x%1, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride
@@ -1652,10 +1663,729 @@
     RET
 %endmacro
 
+%if ARCH_X86_64
 ADDAVG_W64_H1_AVX2 16
 ADDAVG_W64_H1_AVX2 32
 ADDAVG_W64_H1_AVX2 48
 ADDAVG_W64_H1_AVX2 64
+%endif
+;-----------------------------------------------------------------------------
+;addAvg avx512 high bit depth code start
+;-----------------------------------------------------------------------------
+%macro PROCESS_ADDAVG_16x4_HBD_AVX512 0
+    movu              ym0,              [r0]
+    vinserti32x8       m0,              [r0 + r3], 1
+    movu              ym1,              [r1]
+    vinserti32x8       m1,              [r1 + r4], 1
+
+    paddw              m0,              m1
+    pmulhrsw           m0,              m3
+    paddw              m0,              m4
+    pmaxsw             m0,              m2
+    pminsw             m0,              m5
+
+    movu             [r2],              ym0
+    vextracti32x8    [r2 + r5],         m0, 1
+
+    movu              ym0,              [r0 + 2 * r3]
+    vinserti32x8       m0,              [r0 + r6], 1
+    movu              ym1,              [r1 + 2 * r4]
+    vinserti32x8       m1,              [r1 + r7], 1
+
+    paddw              m0,              m1
+    pmulhrsw           m0,              m3
+    paddw              m0,              m4
+    pmaxsw             m0,              m2
+    pminsw             m0,              m5
+
+    movu             [r2 + 2 * r5],    ym0
+    vextracti32x8    [r2 + r8],         m0, 1
+%endmacro
+
+%macro PROCESS_ADDAVG_32x4_HBD_AVX512 0
+    movu        m0,              [r0]
+    movu        m1,              [r1]
+    paddw       m0,              m1
+    pmulhrsw    m0,              m3
+    paddw       m0,              m4
+    pmaxsw      m0,              m2
+    pminsw      m0,              m5
+    movu        [r2],            m0
+
+    movu        m0,              [r0 + r3]
+    movu        m1,              [r1 + r4]
+    paddw       m0,              m1
+    pmulhrsw    m0,              m3
+    paddw       m0,              m4
+    pmaxsw      m0,              m2
+    pminsw      m0,              m5
+    movu        [r2 + r5],       m0
+
+    movu        m0,              [r0 + 2 * r3]
+    movu        m1,              [r1 + 2 * r4]
+    paddw       m0,              m1
+    pmulhrsw    m0,              m3
+    paddw       m0,              m4
+    pmaxsw      m0,              m2
+    pminsw      m0,              m5
+    movu        [r2 + 2 * r5],   m0
+
+    movu        m0,              [r0 + r6]
+    movu        m1,              [r1 + r7]
+    paddw       m0,              m1
+    pmulhrsw    m0,              m3
+    paddw       m0,              m4
+    pmaxsw      m0,              m2
+    pminsw      m0,              m5
+    movu        [r2 + r8],       m0
+%endmacro
+
+%macro PROCESS_ADDAVG_64x4_HBD_AVX512 0
+    movu        m0,              [r0]
+    movu        m1,              [r1]
+    paddw       m0,              m1
+    pmulhrsw    m0,              m3
+    paddw       m0,              m4
+    pmaxsw      m0,              m2
+    pminsw      m0,              m5
+    movu        [r2],            m0
+

 
@@ -46,13 +46,10 @@
     %error Unsupport bit depth!
 %endif
 
-SECTION_RODATA 32
+SECTION_RODATA 64
 
-ch_shuf: times 2 db 0,2,2,4,4,6,6,8,1,3,3,5,5,7,7,9
-ch_shuf_adj: times 8 db 0
-             times 8 db 2
-             times 8 db 4
-             times 8 db 6
+ALIGN 64
+const shuf_avx512,  dq 0, 2, 4, 6, 1, 3, 5, 7
 
 SECTION .text
 
@@ -1037,6 +1034,7 @@
 ;------------------------------------------------------------------------------
 ; avx2 asm for addAvg high_bit_depth
 ;------------------------------------------------------------------------------
+%if ARCH_X86_64
 INIT_YMM avx2
 cglobal addAvg_8x2, 6,6,2, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride
     movu        xm0,         [r0]
@@ -1114,6 +1112,7 @@
     movu        [r2],        xm0
     movu        [r2 + r5],   xm2
     RET
+%endif
 
 %macro ADDAVG_W8_H4_AVX2 1
 cglobal addAvg_8x%1, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride
@@ -1168,13 +1167,16 @@
     RET
 %endmacro
 
+%if ARCH_X86_64
 ADDAVG_W8_H4_AVX2 4
 ADDAVG_W8_H4_AVX2 8
 ADDAVG_W8_H4_AVX2 12
 ADDAVG_W8_H4_AVX2 16
 ADDAVG_W8_H4_AVX2 32
 ADDAVG_W8_H4_AVX2 64
+%endif
 
+%if ARCH_X86_64
 cglobal addAvg_12x16, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride
     mova           m4,             [pw_ %+ ADDAVG_ROUND]
     mova           m5,             [pw_pixel_max]
@@ -1258,6 +1260,7 @@
     dec            r6d
     jnz            .loop
     RET
+%endif
 
 %macro ADDAVG_W16_H4_AVX2 1
 cglobal addAvg_16x%1, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride
@@ -1299,6 +1302,7 @@
     RET
 %endmacro
 
+%if ARCH_X86_64
 ADDAVG_W16_H4_AVX2 4
 ADDAVG_W16_H4_AVX2 8
 ADDAVG_W16_H4_AVX2 12
@@ -1306,7 +1310,9 @@
 ADDAVG_W16_H4_AVX2 24
 ADDAVG_W16_H4_AVX2 32
 ADDAVG_W16_H4_AVX2 64
+%endif
 
+%if ARCH_X86_64
 cglobal addAvg_24x32, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride
     mova        m4,              [pw_ %+ ADDAVG_ROUND]
     mova        m5,              [pw_pixel_max]
@@ -1418,6 +1424,7 @@
     dec         r6d
     jnz         .loop
     RET
+%endif
 
 %macro ADDAVG_W32_H2_AVX2 1
 cglobal addAvg_32x%1, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride
@@ -1477,13 +1484,16 @@
     RET
 %endmacro
 
+%if ARCH_X86_64
 ADDAVG_W32_H2_AVX2 8
 ADDAVG_W32_H2_AVX2 16
 ADDAVG_W32_H2_AVX2 24
 ADDAVG_W32_H2_AVX2 32
 ADDAVG_W32_H2_AVX2 48
 ADDAVG_W32_H2_AVX2 64
+%endif
 
+%if ARCH_X86_64
 cglobal addAvg_48x64, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride
     mova        m4,              [pw_ %+ ADDAVG_ROUND]
     mova        m5,              [pw_pixel_max]
@@ -1557,6 +1567,7 @@
     dec         r6d
     jnz        .loop
     RET
+%endif
 
 %macro ADDAVG_W64_H1_AVX2 1
 cglobal addAvg_64x%1, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride
@@ -1652,10 +1663,729 @@
     RET
 %endmacro
 
+%if ARCH_X86_64
 ADDAVG_W64_H1_AVX2 16
 ADDAVG_W64_H1_AVX2 32
 ADDAVG_W64_H1_AVX2 48
 ADDAVG_W64_H1_AVX2 64
+%endif
+;-----------------------------------------------------------------------------
+;addAvg avx512 high bit depth code start
+;-----------------------------------------------------------------------------
+%macro PROCESS_ADDAVG_16x4_HBD_AVX512 0
+    movu              ym0,              [r0]
+    vinserti32x8       m0,              [r0 + r3], 1
+    movu              ym1,              [r1]
+    vinserti32x8       m1,              [r1 + r4], 1
+
+    paddw              m0,              m1
+    pmulhrsw           m0,              m3
+    paddw              m0,              m4
+    pmaxsw             m0,              m2
+    pminsw             m0,              m5
+
+    movu             [r2],              ym0
+    vextracti32x8    [r2 + r5],         m0, 1
+
+    movu              ym0,              [r0 + 2 * r3]
+    vinserti32x8       m0,              [r0 + r6], 1
+    movu              ym1,              [r1 + 2 * r4]
+    vinserti32x8       m1,              [r1 + r7], 1
+
+    paddw              m0,              m1
+    pmulhrsw           m0,              m3
+    paddw              m0,              m4
+    pmaxsw             m0,              m2
+    pminsw             m0,              m5
+
+    movu             [r2 + 2 * r5],    ym0
+    vextracti32x8    [r2 + r8],         m0, 1
+%endmacro
+
+%macro PROCESS_ADDAVG_32x4_HBD_AVX512 0
+    movu        m0,              [r0]
+    movu        m1,              [r1]
+    paddw       m0,              m1
+    pmulhrsw    m0,              m3
+    paddw       m0,              m4
+    pmaxsw      m0,              m2
+    pminsw      m0,              m5
+    movu        [r2],            m0
+
+    movu        m0,              [r0 + r3]
+    movu        m1,              [r1 + r4]
+    paddw       m0,              m1
+    pmulhrsw    m0,              m3
+    paddw       m0,              m4
+    pmaxsw      m0,              m2
+    pminsw      m0,              m5
+    movu        [r2 + r5],       m0
+
+    movu        m0,              [r0 + 2 * r3]
+    movu        m1,              [r1 + 2 * r4]
+    paddw       m0,              m1
+    pmulhrsw    m0,              m3
+    paddw       m0,              m4
+    pmaxsw      m0,              m2
+    pminsw      m0,              m5
+    movu        [r2 + 2 * r5],   m0
+
+    movu        m0,              [r0 + r6]
+    movu        m1,              [r1 + r7]
+    paddw       m0,              m1
+    pmulhrsw    m0,              m3
+    paddw       m0,              m4
+    pmaxsw      m0,              m2
+    pminsw      m0,              m5
+    movu        [r2 + r8],       m0
+%endmacro
+
+%macro PROCESS_ADDAVG_64x4_HBD_AVX512 0
+    movu        m0,              [r0]
+    movu        m1,              [r1]
+    paddw       m0,              m1
+    pmulhrsw    m0,              m3
+    paddw       m0,              m4
+    pmaxsw      m0,              m2
+    pminsw      m0,              m5
+    movu        [r2],            m0
+
​

x265_2.7.tar.gz/source/common/x86/pixel-a.asm -> x265_2.9.tar.gz/source/common/x86/pixel-a.asm Changed

@@ -45,6 +45,9 @@
            times 2 dw 1, -1
            times 4 dw 1
            times 2 dw 1, -1
+psy_pp_shuff1:   dq 0, 1, 8, 9, 4, 5, 12, 13
+psy_pp_shuff2:   dq 2, 3, 10, 11, 6, 7, 14, 15
+psy_pp_shuff3:   dq 0, 0, 8, 8, 1, 1, 9, 9
 
 ALIGN 32
 transd_shuf1: SHUFFLE_MASK_W 0, 8, 2, 10, 4, 12, 6, 14
@@ -8145,6 +8148,243 @@
 %endif ; ARCH_X86_64=1
 %endif ; HIGH_BIT_DEPTH
 
+%macro SATD_AVX512_LOAD4 2 ; size, opmask
+    vpbroadcast%1 m0, [r0]
+    vpbroadcast%1 m0 {%2}, [r0+2*r1]
+    vpbroadcast%1 m2, [r2]
+    vpbroadcast%1 m2 {%2}, [r2+2*r3]
+    add           r0, r1
+    add           r2, r3
+    vpbroadcast%1 m1, [r0]
+    vpbroadcast%1 m1 {%2}, [r0+2*r1]
+    vpbroadcast%1 m3, [r2]
+    vpbroadcast%1 m3 {%2}, [r2+2*r3]
+%endmacro
+
+%macro SATD_AVX512_LOAD8 5 ; size, halfreg, opmask1, opmask2, opmask3
+    vpbroadcast%1 %{2}0, [r0]
+    vpbroadcast%1 %{2}0 {%3}, [r0+2*r1]
+    vpbroadcast%1 %{2}2, [r2]
+    vpbroadcast%1 %{2}2 {%3}, [r2+2*r3]
+    vpbroadcast%1    m0 {%4}, [r0+4*r1]
+    vpbroadcast%1    m2 {%4}, [r2+4*r3]
+    vpbroadcast%1    m0 {%5}, [r0+2*r4]
+    vpbroadcast%1    m2 {%5}, [r2+2*r5]
+    vpbroadcast%1 %{2}1, [r0+r1]
+    vpbroadcast%1 %{2}1 {%3}, [r0+r4]
+    vpbroadcast%1 %{2}3, [r2+r3]
+    vpbroadcast%1 %{2}3 {%3}, [r2+r5]
+    lea              r0, [r0+4*r1]
+    lea              r2, [r2+4*r3]
+    vpbroadcast%1    m1 {%4}, [r0+r1]
+    vpbroadcast%1    m3 {%4}, [r2+r3]
+    vpbroadcast%1    m1 {%5}, [r0+r4]
+    vpbroadcast%1    m3 {%5}, [r2+r5]
+%endmacro
+
+%macro SATD_AVX512_PACKED 0
+    DIFF_SUMSUB_SSSE3 0, 2, 1, 3, 4
+    SUMSUB_BA      w, 0, 1, 2
+    SBUTTERFLY   qdq, 0, 1, 2
+    SUMSUB_BA      w, 0, 1, 2
+    HMAXABSW2         0, 1, 2, 3
+%endmacro
+
+%macro SATD_AVX512_END 0-1 0 ; sa8d
+    paddw          m0 {k1}{z}, m1 ; zero-extend to dwords
+%if ARCH_X86_64
+%if mmsize == 64
+    vextracti32x8 ym1, m0, 1
+    paddd         ym0, ym1
+%endif
+%if mmsize >= 32
+    vextracti128  xm1, ym0, 1
+    paddd        xmm0, xm0, xm1
+%endif
+    punpckhqdq   xmm1, xmm0, xmm0
+    paddd        xmm0, xmm1
+    movq          rax, xmm0
+    rorx          rdx, rax, 32
+%if %1
+    lea           eax, [rax+rdx+1]
+    shr           eax, 1
+%else
+    add           eax, edx
+%endif
+%else
+    HADDD          m0, m1
+    movd          eax, xm0
+%if %1
+    inc           eax
+    shr           eax, 1
+%endif
+%endif
+    RET
+%endmacro
+
+%macro HMAXABSW2 4 ; a, b, tmp1, tmp2
+    pabsw     m%1, m%1
+    pabsw     m%2, m%2
+    psrldq    m%3, m%1, 2
+    psrld     m%4, m%2, 16
+    pmaxsw    m%1, m%3
+    pmaxsw    m%2, m%4
+%endmacro
+%if HIGH_BIT_DEPTH==0
+INIT_ZMM avx512
+cglobal pixel_satd_16x8_internal
+    vbroadcasti64x4 m6, [hmul_16p]
+    kxnorb           k2, k2, k2
+    mov             r4d, 0x55555555
+    knotw            k2, k2
+    kmovd            k1, r4d
+    lea              r4, [3*r1]
+    lea              r5, [3*r3]
+satd_16x8_avx512:
+    vbroadcasti128  ym0,      [r0]
+    vbroadcasti32x4  m0 {k2}, [r0+4*r1] ; 0 0 4 4
+    vbroadcasti128  ym4,      [r2]
+    vbroadcasti32x4  m4 {k2}, [r2+4*r3]
+    vbroadcasti128  ym2,      [r0+2*r1]
+    vbroadcasti32x4  m2 {k2}, [r0+2*r4] ; 2 2 6 6
+    vbroadcasti128  ym5,      [r2+2*r3]
+    vbroadcasti32x4  m5 {k2}, [r2+2*r5]
+    DIFF_SUMSUB_SSSE3 0, 4, 2, 5, 6
+    vbroadcasti128  ym1,      [r0+r1]
+    vbroadcasti128  ym4,      [r2+r3]
+    vbroadcasti128  ym3,      [r0+r4]
+    vbroadcasti128  ym5,      [r2+r5]
+    lea              r0, [r0+4*r1]
+    lea              r2, [r2+4*r3]
+    vbroadcasti32x4  m1 {k2}, [r0+r1] ; 1 1 5 5
+    vbroadcasti32x4  m4 {k2}, [r2+r3]
+    vbroadcasti32x4  m3 {k2}, [r0+r4] ; 3 3 7 7
+    vbroadcasti32x4  m5 {k2}, [r2+r5]
+    DIFF_SUMSUB_SSSE3 1, 4, 3, 5, 6
+    HADAMARD4_V       0, 1, 2, 3, 4
+    HMAXABSW2         0, 2, 4, 5
+    HMAXABSW2         1, 3, 4, 5
+    paddw            m4, m0, m2 ; m1
+    paddw            m2, m1, m3 ; m0
+    ret
+
+cglobal pixel_satd_8x8_internal
+    vbroadcasti64x4 m4, [hmul_16p]
+    mov     r4d, 0x55555555
+    kmovd    k1, r4d   ; 01010101
+    kshiftlb k2, k1, 5 ; 10100000
+    kshiftlb k3, k1, 4 ; 01010000
+    lea      r4, [3*r1]
+    lea      r5, [3*r3]
+satd_8x8_avx512:
+    SATD_AVX512_LOAD8 q, ym, k1, k2, k3 ; 2 0 2 0 6 4 6 4
+    SATD_AVX512_PACKED                  ; 3 1 3 1 7 5 7 5
+    ret
+
+cglobal pixel_satd_16x8, 4,6
+    call pixel_satd_16x8_internal_avx512
+    jmp satd_zmm_avx512_end
+
+cglobal pixel_satd_16x16, 4,6
+    call pixel_satd_16x8_internal_avx512
+    lea      r0, [r0+4*r1]
+    lea      r2, [r2+4*r3]
+    paddw    m7, m0, m1
+    call satd_16x8_avx512
+    paddw    m1, m7
+    jmp satd_zmm_avx512_end
+
+cglobal pixel_satd_8x8, 4,6
+    call pixel_satd_8x8_internal_avx512
+satd_zmm_avx512_end:
+    SATD_AVX512_END
+
+cglobal pixel_satd_8x16, 4,6
+    call pixel_satd_8x8_internal_avx512
+    lea      r0, [r0+4*r1]
+    lea      r2, [r2+4*r3]
+    paddw    m5, m0, m1
+    call satd_8x8_avx512
+    paddw    m1, m5
+    jmp satd_zmm_avx512_end
+
+INIT_YMM avx512
+cglobal pixel_satd_4x8_internal
+    vbroadcasti128 m4, [hmul_4p]
+    mov     r4d, 0x55550c
+    kmovd    k2, r4d   ; 00001100
+    kshiftlb k3, k2, 2 ; 00110000
+    kshiftlb k4, k2, 4 ; 11000000
+    kshiftrd k1, k2, 8 ; 01010101
+    lea      r4, [3*r1]
+    lea      r5, [3*r3]
+satd_4x8_avx512:
+    SATD_AVX512_LOAD8 d, xm, k2, k3, k4 ; 0 0 2 2 4 4 6 6
+satd_ymm_avx512:                        ; 1 1 3 3 5 5 7 7
+    SATD_AVX512_PACKED
+    ret
+
+cglobal pixel_satd_8x4, 4,5
+    mova     m4, [hmul_16p]
+    mov     r4d, 0x5555
+    kmovw    k1, r4d
+    SATD_AVX512_LOAD4 q, k1 ; 2 0 2 0
+    call satd_ymm_avx512    ; 3 1 3 1
+    jmp satd_ymm_avx512_end2
+
+cglobal pixel_satd_4x8, 4,6
+    call pixel_satd_4x8_internal_avx512

 
@@ -45,6 +45,9 @@
            times 2 dw 1, -1
            times 4 dw 1
            times 2 dw 1, -1
+psy_pp_shuff1:   dq 0, 1, 8, 9, 4, 5, 12, 13
+psy_pp_shuff2:   dq 2, 3, 10, 11, 6, 7, 14, 15
+psy_pp_shuff3:   dq 0, 0, 8, 8, 1, 1, 9, 9
 
 ALIGN 32
 transd_shuf1: SHUFFLE_MASK_W 0, 8, 2, 10, 4, 12, 6, 14
@@ -8145,6 +8148,243 @@
 %endif ; ARCH_X86_64=1
 %endif ; HIGH_BIT_DEPTH
 
+%macro SATD_AVX512_LOAD4 2 ; size, opmask
+    vpbroadcast%1 m0, [r0]
+    vpbroadcast%1 m0 {%2}, [r0+2*r1]
+    vpbroadcast%1 m2, [r2]
+    vpbroadcast%1 m2 {%2}, [r2+2*r3]
+    add           r0, r1
+    add           r2, r3
+    vpbroadcast%1 m1, [r0]
+    vpbroadcast%1 m1 {%2}, [r0+2*r1]
+    vpbroadcast%1 m3, [r2]
+    vpbroadcast%1 m3 {%2}, [r2+2*r3]
+%endmacro
+
+%macro SATD_AVX512_LOAD8 5 ; size, halfreg, opmask1, opmask2, opmask3
+    vpbroadcast%1 %{2}0, [r0]
+    vpbroadcast%1 %{2}0 {%3}, [r0+2*r1]
+    vpbroadcast%1 %{2}2, [r2]
+    vpbroadcast%1 %{2}2 {%3}, [r2+2*r3]
+    vpbroadcast%1    m0 {%4}, [r0+4*r1]
+    vpbroadcast%1    m2 {%4}, [r2+4*r3]
+    vpbroadcast%1    m0 {%5}, [r0+2*r4]
+    vpbroadcast%1    m2 {%5}, [r2+2*r5]
+    vpbroadcast%1 %{2}1, [r0+r1]
+    vpbroadcast%1 %{2}1 {%3}, [r0+r4]
+    vpbroadcast%1 %{2}3, [r2+r3]
+    vpbroadcast%1 %{2}3 {%3}, [r2+r5]
+    lea              r0, [r0+4*r1]
+    lea              r2, [r2+4*r3]
+    vpbroadcast%1    m1 {%4}, [r0+r1]
+    vpbroadcast%1    m3 {%4}, [r2+r3]
+    vpbroadcast%1    m1 {%5}, [r0+r4]
+    vpbroadcast%1    m3 {%5}, [r2+r5]
+%endmacro
+
+%macro SATD_AVX512_PACKED 0
+    DIFF_SUMSUB_SSSE3 0, 2, 1, 3, 4
+    SUMSUB_BA      w, 0, 1, 2
+    SBUTTERFLY   qdq, 0, 1, 2
+    SUMSUB_BA      w, 0, 1, 2
+    HMAXABSW2         0, 1, 2, 3
+%endmacro
+
+%macro SATD_AVX512_END 0-1 0 ; sa8d
+    paddw          m0 {k1}{z}, m1 ; zero-extend to dwords
+%if ARCH_X86_64
+%if mmsize == 64
+    vextracti32x8 ym1, m0, 1
+    paddd         ym0, ym1
+%endif
+%if mmsize >= 32
+    vextracti128  xm1, ym0, 1
+    paddd        xmm0, xm0, xm1
+%endif
+    punpckhqdq   xmm1, xmm0, xmm0
+    paddd        xmm0, xmm1
+    movq          rax, xmm0
+    rorx          rdx, rax, 32
+%if %1
+    lea           eax, [rax+rdx+1]
+    shr           eax, 1
+%else
+    add           eax, edx
+%endif
+%else
+    HADDD          m0, m1
+    movd          eax, xm0
+%if %1
+    inc           eax
+    shr           eax, 1
+%endif
+%endif
+    RET
+%endmacro
+
+%macro HMAXABSW2 4 ; a, b, tmp1, tmp2
+    pabsw     m%1, m%1
+    pabsw     m%2, m%2
+    psrldq    m%3, m%1, 2
+    psrld     m%4, m%2, 16
+    pmaxsw    m%1, m%3
+    pmaxsw    m%2, m%4
+%endmacro
+%if HIGH_BIT_DEPTH==0
+INIT_ZMM avx512
+cglobal pixel_satd_16x8_internal
+    vbroadcasti64x4 m6, [hmul_16p]
+    kxnorb           k2, k2, k2
+    mov             r4d, 0x55555555
+    knotw            k2, k2
+    kmovd            k1, r4d
+    lea              r4, [3*r1]
+    lea              r5, [3*r3]
+satd_16x8_avx512:
+    vbroadcasti128  ym0,      [r0]
+    vbroadcasti32x4  m0 {k2}, [r0+4*r1] ; 0 0 4 4
+    vbroadcasti128  ym4,      [r2]
+    vbroadcasti32x4  m4 {k2}, [r2+4*r3]
+    vbroadcasti128  ym2,      [r0+2*r1]
+    vbroadcasti32x4  m2 {k2}, [r0+2*r4] ; 2 2 6 6
+    vbroadcasti128  ym5,      [r2+2*r3]
+    vbroadcasti32x4  m5 {k2}, [r2+2*r5]
+    DIFF_SUMSUB_SSSE3 0, 4, 2, 5, 6
+    vbroadcasti128  ym1,      [r0+r1]
+    vbroadcasti128  ym4,      [r2+r3]
+    vbroadcasti128  ym3,      [r0+r4]
+    vbroadcasti128  ym5,      [r2+r5]
+    lea              r0, [r0+4*r1]
+    lea              r2, [r2+4*r3]
+    vbroadcasti32x4  m1 {k2}, [r0+r1] ; 1 1 5 5
+    vbroadcasti32x4  m4 {k2}, [r2+r3]
+    vbroadcasti32x4  m3 {k2}, [r0+r4] ; 3 3 7 7
+    vbroadcasti32x4  m5 {k2}, [r2+r5]
+    DIFF_SUMSUB_SSSE3 1, 4, 3, 5, 6
+    HADAMARD4_V       0, 1, 2, 3, 4
+    HMAXABSW2         0, 2, 4, 5
+    HMAXABSW2         1, 3, 4, 5
+    paddw            m4, m0, m2 ; m1
+    paddw            m2, m1, m3 ; m0
+    ret
+
+cglobal pixel_satd_8x8_internal
+    vbroadcasti64x4 m4, [hmul_16p]
+    mov     r4d, 0x55555555
+    kmovd    k1, r4d   ; 01010101
+    kshiftlb k2, k1, 5 ; 10100000
+    kshiftlb k3, k1, 4 ; 01010000
+    lea      r4, [3*r1]
+    lea      r5, [3*r3]
+satd_8x8_avx512:
+    SATD_AVX512_LOAD8 q, ym, k1, k2, k3 ; 2 0 2 0 6 4 6 4
+    SATD_AVX512_PACKED                  ; 3 1 3 1 7 5 7 5
+    ret
+
+cglobal pixel_satd_16x8, 4,6
+    call pixel_satd_16x8_internal_avx512
+    jmp satd_zmm_avx512_end
+
+cglobal pixel_satd_16x16, 4,6
+    call pixel_satd_16x8_internal_avx512
+    lea      r0, [r0+4*r1]
+    lea      r2, [r2+4*r3]
+    paddw    m7, m0, m1
+    call satd_16x8_avx512
+    paddw    m1, m7
+    jmp satd_zmm_avx512_end
+
+cglobal pixel_satd_8x8, 4,6
+    call pixel_satd_8x8_internal_avx512
+satd_zmm_avx512_end:
+    SATD_AVX512_END
+
+cglobal pixel_satd_8x16, 4,6
+    call pixel_satd_8x8_internal_avx512
+    lea      r0, [r0+4*r1]
+    lea      r2, [r2+4*r3]
+    paddw    m5, m0, m1
+    call satd_8x8_avx512
+    paddw    m1, m5
+    jmp satd_zmm_avx512_end
+
+INIT_YMM avx512
+cglobal pixel_satd_4x8_internal
+    vbroadcasti128 m4, [hmul_4p]
+    mov     r4d, 0x55550c
+    kmovd    k2, r4d   ; 00001100
+    kshiftlb k3, k2, 2 ; 00110000
+    kshiftlb k4, k2, 4 ; 11000000
+    kshiftrd k1, k2, 8 ; 01010101
+    lea      r4, [3*r1]
+    lea      r5, [3*r3]
+satd_4x8_avx512:
+    SATD_AVX512_LOAD8 d, xm, k2, k3, k4 ; 0 0 2 2 4 4 6 6
+satd_ymm_avx512:                        ; 1 1 3 3 5 5 7 7
+    SATD_AVX512_PACKED
+    ret
+
+cglobal pixel_satd_8x4, 4,5
+    mova     m4, [hmul_16p]
+    mov     r4d, 0x5555
+    kmovw    k1, r4d
+    SATD_AVX512_LOAD4 q, k1 ; 2 0 2 0
+    call satd_ymm_avx512    ; 3 1 3 1
+    jmp satd_ymm_avx512_end2
+
+cglobal pixel_satd_4x8, 4,6
+    call pixel_satd_4x8_internal_avx512
​

x265_2.7.tar.gz/source/common/x86/pixel-util.h -> x265_2.9.tar.gz/source/common/x86/pixel-util.h Changed

@@ -27,6 +27,7 @@
 
 #define DEFINE_UTILS(cpu) \
     FUNCDEF_TU_S2(void, getResidual, cpu, const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); \
+    FUNCDEF_TU_S2(void, getResidual_aligned, cpu, const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); \
     FUNCDEF_TU_S2(void, transpose, cpu, pixel* dest, const pixel* src, intptr_t stride); \
     FUNCDEF_TU(int, count_nonzero, cpu, const int16_t* quantCoeff); \
     uint32_t PFX(quant_ ## cpu(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff)); \
@@ -36,6 +37,7 @@
     void PFX(weight_pp_ ## cpu(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset)); \
     void PFX(weight_sp_ ## cpu(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset)); \
     void PFX(scale1D_128to64_ ## cpu(pixel*, const pixel*)); \
+    void PFX(scale1D_128to64_aligned_ ## cpu(pixel*, const pixel*)); \
     void PFX(scale2D_64to32_ ## cpu(pixel*, const pixel*, intptr_t)); \
     uint32_t PFX(costCoeffRemain_ ## cpu(uint16_t *absCoeff, int numNonZero, int idx)); \
     uint32_t PFX(costC1C2Flag_sse2(uint16_t *absCoeff, intptr_t numNonZero, uint8_t *baseCtxMod, intptr_t ctxOffset)); \
@@ -44,6 +46,7 @@
 DEFINE_UTILS(ssse3);
 DEFINE_UTILS(sse4);
 DEFINE_UTILS(avx2);
+DEFINE_UTILS(avx512);
 
 #undef DEFINE_UTILS
 
@@ -58,4 +61,7 @@
 uint32_t PFX(costCoeffNxN_sse4(const uint16_t *scan, const coeff_t *coeff, intptr_t trSize, uint16_t *absCoeff, const uint8_t *tabSigCtx, uint32_t scanFlagMask, uint8_t *baseCtx, int offset, int scanPosSigOff, int subPosBase));
 uint32_t PFX(costCoeffNxN_avx2_bmi2(const uint16_t *scan, const coeff_t *coeff, intptr_t trSize, uint16_t *absCoeff, const uint8_t *tabSigCtx, uint32_t scanFlagMask, uint8_t *baseCtx, int offset, int scanPosSigOff, int subPosBase));
 
+int  PFX(count_nonzero_16x16_avx512(const int16_t* quantCoeff));
+int  PFX(count_nonzero_32x32_avx512(const int16_t* quantCoeff));
+
 #endif // ifndef X265_PIXEL_UTIL_H

 
@@ -27,6 +27,7 @@
 
 #define DEFINE_UTILS(cpu) \
     FUNCDEF_TU_S2(void, getResidual, cpu, const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); \
+    FUNCDEF_TU_S2(void, getResidual_aligned, cpu, const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); \
     FUNCDEF_TU_S2(void, transpose, cpu, pixel* dest, const pixel* src, intptr_t stride); \
     FUNCDEF_TU(int, count_nonzero, cpu, const int16_t* quantCoeff); \
     uint32_t PFX(quant_ ## cpu(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff)); \
@@ -36,6 +37,7 @@
     void PFX(weight_pp_ ## cpu(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset)); \
     void PFX(weight_sp_ ## cpu(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset)); \
     void PFX(scale1D_128to64_ ## cpu(pixel*, const pixel*)); \
+    void PFX(scale1D_128to64_aligned_ ## cpu(pixel*, const pixel*)); \
     void PFX(scale2D_64to32_ ## cpu(pixel*, const pixel*, intptr_t)); \
     uint32_t PFX(costCoeffRemain_ ## cpu(uint16_t *absCoeff, int numNonZero, int idx)); \
     uint32_t PFX(costC1C2Flag_sse2(uint16_t *absCoeff, intptr_t numNonZero, uint8_t *baseCtxMod, intptr_t ctxOffset)); \
@@ -44,6 +46,7 @@
 DEFINE_UTILS(ssse3);
 DEFINE_UTILS(sse4);
 DEFINE_UTILS(avx2);
+DEFINE_UTILS(avx512);
 
 #undef DEFINE_UTILS
 
@@ -58,4 +61,7 @@
 uint32_t PFX(costCoeffNxN_sse4(const uint16_t *scan, const coeff_t *coeff, intptr_t trSize, uint16_t *absCoeff, const uint8_t *tabSigCtx, uint32_t scanFlagMask, uint8_t *baseCtx, int offset, int scanPosSigOff, int subPosBase));
 uint32_t PFX(costCoeffNxN_avx2_bmi2(const uint16_t *scan, const coeff_t *coeff, intptr_t trSize, uint16_t *absCoeff, const uint8_t *tabSigCtx, uint32_t scanFlagMask, uint8_t *baseCtx, int offset, int scanPosSigOff, int subPosBase));
 
+int  PFX(count_nonzero_16x16_avx512(const int16_t* quantCoeff));
+int  PFX(count_nonzero_32x32_avx512(const int16_t* quantCoeff));
+
 #endif // ifndef X265_PIXEL_UTIL_H
​

x265_2.7.tar.gz/source/common/x86/pixel-util8.asm -> x265_2.9.tar.gz/source/common/x86/pixel-util8.asm Changed

@@ -4,6 +4,7 @@
 ;* Authors: Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com>
 ;*          Nabajit Deka <nabajit@multicorewareinc.com>
 ;*          Rajesh Paulraj <rajesh@multicorewareinc.com>
+;*          Praveen Kumar Tiwari <praveen@multicorewareinc.com>
 ;*
 ;* This program is free software; you can redistribute it and/or modify
 ;* it under the terms of the GNU General Public License as published by
@@ -26,7 +27,13 @@
 %include "x86inc.asm"
 %include "x86util.asm"
 
-SECTION_RODATA 32
+SECTION_RODATA 64
+
+var_shuf_avx512: db 0,-1, 1,-1, 2,-1, 3,-1, 4,-1, 5,-1, 6,-1, 7,-1
+                 db 8,-1, 9,-1,10,-1,11,-1,12,-1,13,-1,14,-1,15,-1
+ALIGN 64
+const dequant_shuf1_avx512,  dq 0, 2, 4, 6, 1, 3, 5, 7
+const dequant_shuf2_avx512,  dq 0, 4, 1, 5, 2, 6, 3, 7
 
 %if BIT_DEPTH == 12
 ssim_c1:   times 4 dd 107321.76    ; .01*.01*4095*4095*64
@@ -552,6 +559,262 @@
 %endrep
     RET
 %endif
+
+%macro PROCESS_GETRESIDUAL32_W4_HBD_AVX512 0
+    movu        m0, [r0]
+    movu        m1, [r0 + r3]
+    movu        m2, [r0 + r3 * 2]
+    movu        m3, [r0 + r4]
+    lea         r0, [r0 + r3 * 4]
+
+    movu        m4, [r1]
+    movu        m5, [r1 + r3]
+    movu        m6, [r1 + r3 * 2]
+    movu        m7, [r1 + r4]
+    lea         r1, [r1 + r3 * 4]
+
+    psubw       m0, m4
+    psubw       m1, m5
+    psubw       m2, m6
+    psubw       m3, m7
+
+    movu        [r2], m0
+    movu        [r2 + r3], m1
+    movu        [r2 + r3 * 2], m2
+    movu        [r2 + r4], m3
+    lea         r2, [r2 + r3 * 4]
+%endmacro
+
+%macro PROCESS_GETRESIDUAL32_W4_HBD_AVX512_END 0
+    movu        m0, [r0]
+    movu        m1, [r0 + r3]
+    movu        m2, [r0 + r3 * 2]
+    movu        m3, [r0 + r4]
+
+    movu        m4, [r1]
+    movu        m5, [r1 + r3]
+    movu        m6, [r1 + r3 * 2]
+    movu        m7, [r1 + r4]
+
+    psubw       m0, m4
+    psubw       m1, m5
+    psubw       m2, m6
+    psubw       m3, m7
+
+    movu        [r2], m0
+    movu        [r2 + r3], m1
+    movu        [r2 + r3 * 2], m2
+    movu        [r2 + r4], m3
+%endmacro
+
+%macro PROCESS_GETRESIDUAL32_W4_AVX512 0
+    pmovzxbw    m0, [r0]
+    pmovzxbw    m1, [r0 + r3]
+    pmovzxbw    m2, [r0 + r3 * 2]
+    pmovzxbw    m3, [r0 + r4]
+    lea         r0, [r0 + r3 * 4]
+
+    pmovzxbw    m4, [r1]
+    pmovzxbw    m5, [r1 + r3]
+    pmovzxbw    m6, [r1 + r3 * 2]
+    pmovzxbw    m7, [r1 + r4]
+    lea         r1, [r1 + r3 * 4]
+
+    psubw       m0, m4
+    psubw       m1, m5
+    psubw       m2, m6
+    psubw       m3, m7
+
+    movu        [r2], m0
+    movu        [r2 + r3 * 2], m1
+    lea         r2, [r2 + r3 * 4]
+    movu        [r2], m2
+    movu        [r2 + r3 * 2], m3
+    lea         r2, [r2 + r3 * 4]
+%endmacro
+
+%macro PROCESS_GETRESIDUAL32_W4_AVX512_END 0
+    pmovzxbw    m0, [r0]
+    pmovzxbw    m1, [r0 + r3]
+    pmovzxbw    m2, [r0 + r3 * 2]
+    pmovzxbw    m3, [r0 + r4]
+
+    pmovzxbw    m4, [r1]
+    pmovzxbw    m5, [r1 + r3]
+    pmovzxbw    m6, [r1 + r3 * 2]
+    pmovzxbw    m7, [r1 + r4]
+
+    psubw       m0, m4
+    psubw       m1, m5
+    psubw       m2, m6
+    psubw       m3, m7
+
+    movu        [r2], m0
+    movu        [r2 + r3 * 2], m1
+    lea         r2, [r2 + r3 * 4]
+    movu        [r2], m2
+    movu        [r2 + r3 * 2], m3
+%endmacro
+
+
+%if HIGH_BIT_DEPTH
+INIT_ZMM avx512
+cglobal getResidual32, 4,5,8
+    add         r3, r3
+    lea         r4, [r3 * 3]
+
+    PROCESS_GETRESIDUAL32_W4_HBD_AVX512
+    PROCESS_GETRESIDUAL32_W4_HBD_AVX512
+    PROCESS_GETRESIDUAL32_W4_HBD_AVX512
+    PROCESS_GETRESIDUAL32_W4_HBD_AVX512
+    PROCESS_GETRESIDUAL32_W4_HBD_AVX512
+    PROCESS_GETRESIDUAL32_W4_HBD_AVX512
+    PROCESS_GETRESIDUAL32_W4_HBD_AVX512
+    PROCESS_GETRESIDUAL32_W4_HBD_AVX512_END
+    RET
+%else
+INIT_ZMM avx512
+cglobal getResidual32, 4,5,8
+    lea         r4, [r3 * 3]
+
+    PROCESS_GETRESIDUAL32_W4_AVX512
+    PROCESS_GETRESIDUAL32_W4_AVX512
+    PROCESS_GETRESIDUAL32_W4_AVX512
+    PROCESS_GETRESIDUAL32_W4_AVX512
+    PROCESS_GETRESIDUAL32_W4_AVX512
+    PROCESS_GETRESIDUAL32_W4_AVX512
+    PROCESS_GETRESIDUAL32_W4_AVX512
+    PROCESS_GETRESIDUAL32_W4_AVX512_END
+    RET
+%endif
+
+%macro PROCESS_GETRESIDUAL32_ALIGNED_W4_HBD_AVX512 0
+    movu        m0, [r0]
+    movu        m1, [r0 + r3]
+    movu        m2, [r0 + r3 * 2]
+    movu        m3, [r0 + r4]
+    lea         r0, [r0 + r3 * 4]
+
+    movu        m4, [r1]
+    movu        m5, [r1 + r3]
+    movu        m6, [r1 + r3 * 2]
+    movu        m7, [r1 + r4]
+    lea         r1, [r1 + r3 * 4]
+
+    psubw       m0, m4
+    psubw       m1, m5
+    psubw       m2, m6
+    psubw       m3, m7
+
+    movu        [r2], m0
+    movu        [r2 + r3], m1
+    movu        [r2 + r3 * 2], m2
+    movu        [r2 + r4], m3
+    lea         r2, [r2 + r3 * 4]
+%endmacro
+
+%macro PROCESS_GETRESIDUAL32_ALIGNED_W4_HBD_AVX512_END 0
+    movu        m0, [r0]
+    movu        m1, [r0 + r3]
+    movu        m2, [r0 + r3 * 2]
+    movu        m3, [r0 + r4]
+
+    movu        m4, [r1]
+    movu        m5, [r1 + r3]
+    movu        m6, [r1 + r3 * 2]
+    movu        m7, [r1 + r4]
+
+    psubw       m0, m4
+    psubw       m1, m5
+    psubw       m2, m6
+    psubw       m3, m7
+
+    movu        [r2], m0
+    movu        [r2 + r3], m1
+    movu        [r2 + r3 * 2], m2

 
@@ -4,6 +4,7 @@
 ;* Authors: Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com>
 ;*          Nabajit Deka <nabajit@multicorewareinc.com>
 ;*          Rajesh Paulraj <rajesh@multicorewareinc.com>
+;*          Praveen Kumar Tiwari <praveen@multicorewareinc.com>
 ;*
 ;* This program is free software; you can redistribute it and/or modify
 ;* it under the terms of the GNU General Public License as published by
@@ -26,7 +27,13 @@
 %include "x86inc.asm"
 %include "x86util.asm"
 
-SECTION_RODATA 32
+SECTION_RODATA 64
+
+var_shuf_avx512: db 0,-1, 1,-1, 2,-1, 3,-1, 4,-1, 5,-1, 6,-1, 7,-1
+                 db 8,-1, 9,-1,10,-1,11,-1,12,-1,13,-1,14,-1,15,-1
+ALIGN 64
+const dequant_shuf1_avx512,  dq 0, 2, 4, 6, 1, 3, 5, 7
+const dequant_shuf2_avx512,  dq 0, 4, 1, 5, 2, 6, 3, 7
 
 %if BIT_DEPTH == 12
 ssim_c1:   times 4 dd 107321.76    ; .01*.01*4095*4095*64
@@ -552,6 +559,262 @@
 %endrep
     RET
 %endif
+
+%macro PROCESS_GETRESIDUAL32_W4_HBD_AVX512 0
+    movu        m0, [r0]
+    movu        m1, [r0 + r3]
+    movu        m2, [r0 + r3 * 2]
+    movu        m3, [r0 + r4]
+    lea         r0, [r0 + r3 * 4]
+
+    movu        m4, [r1]
+    movu        m5, [r1 + r3]
+    movu        m6, [r1 + r3 * 2]
+    movu        m7, [r1 + r4]
+    lea         r1, [r1 + r3 * 4]
+
+    psubw       m0, m4
+    psubw       m1, m5
+    psubw       m2, m6
+    psubw       m3, m7
+
+    movu        [r2], m0
+    movu        [r2 + r3], m1
+    movu        [r2 + r3 * 2], m2
+    movu        [r2 + r4], m3
+    lea         r2, [r2 + r3 * 4]
+%endmacro
+
+%macro PROCESS_GETRESIDUAL32_W4_HBD_AVX512_END 0
+    movu        m0, [r0]
+    movu        m1, [r0 + r3]
+    movu        m2, [r0 + r3 * 2]
+    movu        m3, [r0 + r4]
+
+    movu        m4, [r1]
+    movu        m5, [r1 + r3]
+    movu        m6, [r1 + r3 * 2]
+    movu        m7, [r1 + r4]
+
+    psubw       m0, m4
+    psubw       m1, m5
+    psubw       m2, m6
+    psubw       m3, m7
+
+    movu        [r2], m0
+    movu        [r2 + r3], m1
+    movu        [r2 + r3 * 2], m2
+    movu        [r2 + r4], m3
+%endmacro
+
+%macro PROCESS_GETRESIDUAL32_W4_AVX512 0
+    pmovzxbw    m0, [r0]
+    pmovzxbw    m1, [r0 + r3]
+    pmovzxbw    m2, [r0 + r3 * 2]
+    pmovzxbw    m3, [r0 + r4]
+    lea         r0, [r0 + r3 * 4]
+
+    pmovzxbw    m4, [r1]
+    pmovzxbw    m5, [r1 + r3]
+    pmovzxbw    m6, [r1 + r3 * 2]
+    pmovzxbw    m7, [r1 + r4]
+    lea         r1, [r1 + r3 * 4]
+
+    psubw       m0, m4
+    psubw       m1, m5
+    psubw       m2, m6
+    psubw       m3, m7
+
+    movu        [r2], m0
+    movu        [r2 + r3 * 2], m1
+    lea         r2, [r2 + r3 * 4]
+    movu        [r2], m2
+    movu        [r2 + r3 * 2], m3
+    lea         r2, [r2 + r3 * 4]
+%endmacro
+
+%macro PROCESS_GETRESIDUAL32_W4_AVX512_END 0
+    pmovzxbw    m0, [r0]
+    pmovzxbw    m1, [r0 + r3]
+    pmovzxbw    m2, [r0 + r3 * 2]
+    pmovzxbw    m3, [r0 + r4]
+
+    pmovzxbw    m4, [r1]
+    pmovzxbw    m5, [r1 + r3]
+    pmovzxbw    m6, [r1 + r3 * 2]
+    pmovzxbw    m7, [r1 + r4]
+
+    psubw       m0, m4
+    psubw       m1, m5
+    psubw       m2, m6
+    psubw       m3, m7
+
+    movu        [r2], m0
+    movu        [r2 + r3 * 2], m1
+    lea         r2, [r2 + r3 * 4]
+    movu        [r2], m2
+    movu        [r2 + r3 * 2], m3
+%endmacro
+
+
+%if HIGH_BIT_DEPTH
+INIT_ZMM avx512
+cglobal getResidual32, 4,5,8
+    add         r3, r3
+    lea         r4, [r3 * 3]
+
+    PROCESS_GETRESIDUAL32_W4_HBD_AVX512
+    PROCESS_GETRESIDUAL32_W4_HBD_AVX512
+    PROCESS_GETRESIDUAL32_W4_HBD_AVX512
+    PROCESS_GETRESIDUAL32_W4_HBD_AVX512
+    PROCESS_GETRESIDUAL32_W4_HBD_AVX512
+    PROCESS_GETRESIDUAL32_W4_HBD_AVX512
+    PROCESS_GETRESIDUAL32_W4_HBD_AVX512
+    PROCESS_GETRESIDUAL32_W4_HBD_AVX512_END
+    RET
+%else
+INIT_ZMM avx512
+cglobal getResidual32, 4,5,8
+    lea         r4, [r3 * 3]
+
+    PROCESS_GETRESIDUAL32_W4_AVX512
+    PROCESS_GETRESIDUAL32_W4_AVX512
+    PROCESS_GETRESIDUAL32_W4_AVX512
+    PROCESS_GETRESIDUAL32_W4_AVX512
+    PROCESS_GETRESIDUAL32_W4_AVX512
+    PROCESS_GETRESIDUAL32_W4_AVX512
+    PROCESS_GETRESIDUAL32_W4_AVX512
+    PROCESS_GETRESIDUAL32_W4_AVX512_END
+    RET
+%endif
+
+%macro PROCESS_GETRESIDUAL32_ALIGNED_W4_HBD_AVX512 0
+    movu        m0, [r0]
+    movu        m1, [r0 + r3]
+    movu        m2, [r0 + r3 * 2]
+    movu        m3, [r0 + r4]
+    lea         r0, [r0 + r3 * 4]
+
+    movu        m4, [r1]
+    movu        m5, [r1 + r3]
+    movu        m6, [r1 + r3 * 2]
+    movu        m7, [r1 + r4]
+    lea         r1, [r1 + r3 * 4]
+
+    psubw       m0, m4
+    psubw       m1, m5
+    psubw       m2, m6
+    psubw       m3, m7
+
+    movu        [r2], m0
+    movu        [r2 + r3], m1
+    movu        [r2 + r3 * 2], m2
+    movu        [r2 + r4], m3
+    lea         r2, [r2 + r3 * 4]
+%endmacro
+
+%macro PROCESS_GETRESIDUAL32_ALIGNED_W4_HBD_AVX512_END 0
+    movu        m0, [r0]
+    movu        m1, [r0 + r3]
+    movu        m2, [r0 + r3 * 2]
+    movu        m3, [r0 + r4]
+
+    movu        m4, [r1]
+    movu        m5, [r1 + r3]
+    movu        m6, [r1 + r3 * 2]
+    movu        m7, [r1 + r4]
+
+    psubw       m0, m4
+    psubw       m1, m5
+    psubw       m2, m6
+    psubw       m3, m7
+
+    movu        [r2], m0
+    movu        [r2 + r3], m1
+    movu        [r2 + r3 * 2], m2
​

x265_2.7.tar.gz/source/common/x86/pixel.h -> x265_2.9.tar.gz/source/common/x86/pixel.h Changed

@@ -34,6 +34,7 @@
 void PFX(downShift_16_avx2)(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask);
 void PFX(upShift_16_sse2)(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask);
 void PFX(upShift_16_avx2)(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask);
+void PFX(upShift_16_avx512)(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask);
 void PFX(upShift_8_sse4)(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift);
 void PFX(upShift_8_avx2)(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift);
 pixel PFX(planeClipAndMax_avx2)(pixel *src, intptr_t stride, int width, int height, uint64_t *outsum, const pixel minPix, const pixel maxPix);
@@ -44,14 +45,19 @@
     FUNCDEF_PU(void, pixel_sad_x3, cpu, const pixel*, const pixel*, const pixel*, const pixel*, intptr_t, int32_t*); \
     FUNCDEF_PU(void, pixel_sad_x4, cpu, const pixel*, const pixel*, const pixel*, const pixel*, const pixel*, intptr_t, int32_t*); \
     FUNCDEF_PU(void, pixel_avg, cpu, pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); \
+    FUNCDEF_PU(void, pixel_avg_aligned, cpu, pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); \
     FUNCDEF_PU(void, pixel_add_ps, cpu, pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); \
+    FUNCDEF_PU(void, pixel_add_ps_aligned, cpu, pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); \
     FUNCDEF_PU(void, pixel_sub_ps, cpu, int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); \
     FUNCDEF_CHROMA_PU(int, pixel_satd, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \
     FUNCDEF_CHROMA_PU(int, pixel_sad, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \
     FUNCDEF_CHROMA_PU(sse_t, pixel_ssd_ss, cpu, const int16_t*, intptr_t, const int16_t*, intptr_t); \
     FUNCDEF_CHROMA_PU(void, addAvg, cpu, const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t); \
+    FUNCDEF_CHROMA_PU(void, addAvg_aligned, cpu, const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t); \
     FUNCDEF_CHROMA_PU(sse_t, pixel_ssd_s, cpu, const int16_t*, intptr_t); \
+    FUNCDEF_CHROMA_PU(sse_t, pixel_ssd_s_aligned, cpu, const int16_t*, intptr_t); \
     FUNCDEF_TU_S(sse_t, pixel_ssd_s, cpu, const int16_t*, intptr_t); \
+    FUNCDEF_TU_S(sse_t, pixel_ssd_s_aligned, cpu, const int16_t*, intptr_t); \
     FUNCDEF_TU(uint64_t, pixel_var, cpu, const pixel*, intptr_t); \
     FUNCDEF_TU(int, psyCost_pp, cpu, const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); \
     FUNCDEF_TU(int, psyCost_ss, cpu, const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride)
@@ -65,6 +71,7 @@
 DECL_PIXELS(avx);
 DECL_PIXELS(xop);
 DECL_PIXELS(avx2);
+DECL_PIXELS(avx512);
 
 #undef DECL_PIXELS

 
@@ -34,6 +34,7 @@
 void PFX(downShift_16_avx2)(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask);
 void PFX(upShift_16_sse2)(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask);
 void PFX(upShift_16_avx2)(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask);
+void PFX(upShift_16_avx512)(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask);
 void PFX(upShift_8_sse4)(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift);
 void PFX(upShift_8_avx2)(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift);
 pixel PFX(planeClipAndMax_avx2)(pixel *src, intptr_t stride, int width, int height, uint64_t *outsum, const pixel minPix, const pixel maxPix);
@@ -44,14 +45,19 @@
     FUNCDEF_PU(void, pixel_sad_x3, cpu, const pixel*, const pixel*, const pixel*, const pixel*, intptr_t, int32_t*); \
     FUNCDEF_PU(void, pixel_sad_x4, cpu, const pixel*, const pixel*, const pixel*, const pixel*, const pixel*, intptr_t, int32_t*); \
     FUNCDEF_PU(void, pixel_avg, cpu, pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); \
+    FUNCDEF_PU(void, pixel_avg_aligned, cpu, pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); \
     FUNCDEF_PU(void, pixel_add_ps, cpu, pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); \
+    FUNCDEF_PU(void, pixel_add_ps_aligned, cpu, pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); \
     FUNCDEF_PU(void, pixel_sub_ps, cpu, int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); \
     FUNCDEF_CHROMA_PU(int, pixel_satd, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \
     FUNCDEF_CHROMA_PU(int, pixel_sad, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \
     FUNCDEF_CHROMA_PU(sse_t, pixel_ssd_ss, cpu, const int16_t*, intptr_t, const int16_t*, intptr_t); \
     FUNCDEF_CHROMA_PU(void, addAvg, cpu, const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t); \
+    FUNCDEF_CHROMA_PU(void, addAvg_aligned, cpu, const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t); \
     FUNCDEF_CHROMA_PU(sse_t, pixel_ssd_s, cpu, const int16_t*, intptr_t); \
+    FUNCDEF_CHROMA_PU(sse_t, pixel_ssd_s_aligned, cpu, const int16_t*, intptr_t); \
     FUNCDEF_TU_S(sse_t, pixel_ssd_s, cpu, const int16_t*, intptr_t); \
+    FUNCDEF_TU_S(sse_t, pixel_ssd_s_aligned, cpu, const int16_t*, intptr_t); \
     FUNCDEF_TU(uint64_t, pixel_var, cpu, const pixel*, intptr_t); \
     FUNCDEF_TU(int, psyCost_pp, cpu, const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); \
     FUNCDEF_TU(int, psyCost_ss, cpu, const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride)
@@ -65,6 +71,7 @@
 DECL_PIXELS(avx);
 DECL_PIXELS(xop);
 DECL_PIXELS(avx2);
+DECL_PIXELS(avx512);
 
 #undef DECL_PIXELS
 
​

x265_2.7.tar.gz/source/common/x86/pixeladd8.asm -> x265_2.9.tar.gz/source/common/x86/pixeladd8.asm Changed

@@ -24,11 +24,11 @@
 
 %include "x86inc.asm"
 %include "x86util.asm"
+SECTION_RODATA 64
 
-SECTION_RODATA 32
-
+ALIGN 64
+const store_shuf1_avx512,  dq 0, 2, 4, 6, 1, 3, 5, 7
 SECTION .text
-
 cextern pw_pixel_max
 
 ;-----------------------------------------------------------------------------
@@ -768,7 +768,6 @@
 PIXEL_ADD_PS_W32_H4_avx2 32
 PIXEL_ADD_PS_W32_H4_avx2 64
 
-
 ;-----------------------------------------------------------------------------
 ; void pixel_add_ps_64x%2(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1)
 ;-----------------------------------------------------------------------------
@@ -1145,3 +1144,505 @@
     RET
 
 %endif
+
+;-----------------------------------------------------------------------------
+; pixel_add_ps avx512 code start
+;-----------------------------------------------------------------------------
+%macro PROCESS_ADD_PS_64x4_AVX512 0
+    pmovzxbw    m0,         [r2]
+    pmovzxbw    m1,         [r2 + mmsize/2]
+    movu        m2,         [r3]
+    movu        m3,         [r3 + mmsize]
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0,         m4,      m0
+    movu        [r0],       m0
+    pmovzxbw    m0,         [r2 + r4]
+    pmovzxbw    m1,         [r2 + r4 + mmsize/2]
+    movu        m2,         [r3 + r5]
+    movu        m3,         [r3 + r5 + mmsize]
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0,         m4,      m0
+    movu        [r0 + r1],  m0
+    pmovzxbw    m0,         [r2 + 2 * r4]
+    pmovzxbw    m1,         [r2 + 2 * r4 + mmsize/2]
+    movu        m2,         [r3 + 2 * r5]
+    movu        m3,         [r3 + 2 * r5 + mmsize]
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0,         m4,      m0
+    movu        [r0 + 2 * r1],       m0
+
+    pmovzxbw    m0,         [r2 + r7]
+    pmovzxbw    m1,         [r2 + r7 + mmsize/2]
+    movu        m2,         [r3 + r8]
+    movu        m3,         [r3 + r8 + mmsize]
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0,         m4,      m0
+    movu        [r0 + r6],       m0
+%endmacro
+
+%macro PROCESS_ADD_PS_64x4_HBD_AVX512 0
+    movu    m0,     [r2]
+    movu    m1,     [r2 + mmsize]
+    movu    m2,     [r3]
+    movu    m3,     [r3 + mmsize]
+    paddw   m0,     m2
+    paddw   m1,     m3
+
+    CLIPW2  m0, m1, m4, m5
+    movu    [r0],                m0
+    movu    [r0 + mmsize],       m1
+
+    movu    m0,     [r2 + r4]
+    movu    m1,     [r2 + r4 + mmsize]
+    movu    m2,     [r3 + r5]
+    movu    m3,     [r3 + r5 + mmsize]
+    paddw   m0,     m2
+    paddw   m1,     m3
+
+    CLIPW2  m0, m1, m4, m5
+    movu    [r0 + r1],           m0
+    movu    [r0 + r1 + mmsize],  m1
+
+    movu    m0,     [r2 + r4 * 2]
+    movu    m1,     [r2 + r4 * 2 + mmsize]
+    movu    m2,     [r3 + r5 * 2]
+    movu    m3,     [r3 + r5 * 2 + mmsize]
+    paddw   m0,     m2
+    paddw   m1,     m3
+
+    CLIPW2  m0, m1, m4, m5
+    movu    [r0 + r1 * 2],           m0
+    movu    [r0 + r1 * 2 + mmsize],  m1
+
+    movu    m0,     [r2 + r6]
+    movu    m1,     [r2 + r6 + mmsize]
+    movu    m2,     [r3 + r7]
+    movu    m3,     [r3 + r7 + mmsize]
+    paddw   m0,     m2
+    paddw   m1,     m3
+
+    CLIPW2  m0, m1, m4, m5
+    movu    [r0 + r8],               m0
+    movu    [r0 + r8 + mmsize],      m1
+%endmacro
+
+%macro PROCESS_ADD_PS_64x4_ALIGNED_AVX512 0
+    pmovzxbw    m0,         [r2]
+    pmovzxbw    m1,         [r2 + mmsize/2]
+    mova        m2,         [r3]
+    mova        m3,         [r3 + mmsize]
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0,         m4,      m0
+    mova        [r0],       m0
+    pmovzxbw    m0,         [r2 + r4]
+    pmovzxbw    m1,         [r2 + r4 + mmsize/2]
+    mova        m2,         [r3 + r5]
+    mova        m3,         [r3 + r5 + mmsize]
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0,         m4,      m0
+    mova        [r0 + r1],  m0
+    pmovzxbw    m0,         [r2 + 2 * r4]
+    pmovzxbw    m1,         [r2 + 2 * r4 + mmsize/2]
+    mova        m2,         [r3 + 2 * r5]
+    mova        m3,         [r3 + 2 * r5 + mmsize]
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0,         m4,      m0
+    mova        [r0 + 2 * r1],       m0
+
+    pmovzxbw    m0,         [r2 + r7]
+    pmovzxbw    m1,         [r2 + r7 + mmsize/2]
+    mova        m2,         [r3 + r8]
+    mova        m3,         [r3 + r8 + mmsize]
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0,         m4,      m0
+    mova        [r0 + r6],       m0
+%endmacro
+
+%macro PROCESS_ADD_PS_64x4_HBD_ALIGNED_AVX512 0
+    mova    m0,     [r2]
+    mova    m1,     [r2 + mmsize]
+    mova    m2,     [r3]
+    mova    m3,     [r3 + mmsize]
+    paddw   m0,     m2
+    paddw   m1,     m3
+
+    CLIPW2  m0, m1, m4, m5
+    mova    [r0],                m0
+    mova    [r0 + mmsize],       m1
+
+    mova    m0,     [r2 + r4]
+    mova    m1,     [r2 + r4 + mmsize]
+    mova    m2,     [r3 + r5]
+    mova    m3,     [r3 + r5 + mmsize]
+    paddw   m0,     m2
+    paddw   m1,     m3
+
+    CLIPW2  m0, m1, m4, m5
+    mova    [r0 + r1],           m0
+    mova    [r0 + r1 + mmsize],  m1
+
+    mova    m0,     [r2 + r4 * 2]
+    mova    m1,     [r2 + r4 * 2 + mmsize]
+    mova    m2,     [r3 + r5 * 2]
+    mova    m3,     [r3 + r5 * 2 + mmsize]
+    paddw   m0,     m2
+    paddw   m1,     m3
+
+    CLIPW2  m0, m1, m4, m5
+    mova    [r0 + r1 * 2],           m0
+    mova    [r0 + r1 * 2 + mmsize],  m1
+
+    mova    m0,     [r2 + r6]
+    mova    m1,     [r2 + r6 + mmsize]
+    mova    m2,     [r3 + r7]
+    mova    m3,     [r3 + r7 + mmsize]
+    paddw   m0,     m2
+    paddw   m1,     m3
+
+    CLIPW2  m0, m1, m4, m5
+    mova    [r0 + r8],               m0

 
@@ -24,11 +24,11 @@
 
 %include "x86inc.asm"
 %include "x86util.asm"
+SECTION_RODATA 64
 
-SECTION_RODATA 32
-
+ALIGN 64
+const store_shuf1_avx512,  dq 0, 2, 4, 6, 1, 3, 5, 7
 SECTION .text
-
 cextern pw_pixel_max
 
 ;-----------------------------------------------------------------------------
@@ -768,7 +768,6 @@
 PIXEL_ADD_PS_W32_H4_avx2 32
 PIXEL_ADD_PS_W32_H4_avx2 64
 
-
 ;-----------------------------------------------------------------------------
 ; void pixel_add_ps_64x%2(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1)
 ;-----------------------------------------------------------------------------
@@ -1145,3 +1144,505 @@
     RET
 
 %endif
+
+;-----------------------------------------------------------------------------
+; pixel_add_ps avx512 code start
+;-----------------------------------------------------------------------------
+%macro PROCESS_ADD_PS_64x4_AVX512 0
+    pmovzxbw    m0,         [r2]
+    pmovzxbw    m1,         [r2 + mmsize/2]
+    movu        m2,         [r3]
+    movu        m3,         [r3 + mmsize]
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0,         m4,      m0
+    movu        [r0],       m0
+    pmovzxbw    m0,         [r2 + r4]
+    pmovzxbw    m1,         [r2 + r4 + mmsize/2]
+    movu        m2,         [r3 + r5]
+    movu        m3,         [r3 + r5 + mmsize]
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0,         m4,      m0
+    movu        [r0 + r1],  m0
+    pmovzxbw    m0,         [r2 + 2 * r4]
+    pmovzxbw    m1,         [r2 + 2 * r4 + mmsize/2]
+    movu        m2,         [r3 + 2 * r5]
+    movu        m3,         [r3 + 2 * r5 + mmsize]
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0,         m4,      m0
+    movu        [r0 + 2 * r1],       m0
+
+    pmovzxbw    m0,         [r2 + r7]
+    pmovzxbw    m1,         [r2 + r7 + mmsize/2]
+    movu        m2,         [r3 + r8]
+    movu        m3,         [r3 + r8 + mmsize]
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0,         m4,      m0
+    movu        [r0 + r6],       m0
+%endmacro
+
+%macro PROCESS_ADD_PS_64x4_HBD_AVX512 0
+    movu    m0,     [r2]
+    movu    m1,     [r2 + mmsize]
+    movu    m2,     [r3]
+    movu    m3,     [r3 + mmsize]
+    paddw   m0,     m2
+    paddw   m1,     m3
+
+    CLIPW2  m0, m1, m4, m5
+    movu    [r0],                m0
+    movu    [r0 + mmsize],       m1
+
+    movu    m0,     [r2 + r4]
+    movu    m1,     [r2 + r4 + mmsize]
+    movu    m2,     [r3 + r5]
+    movu    m3,     [r3 + r5 + mmsize]
+    paddw   m0,     m2
+    paddw   m1,     m3
+
+    CLIPW2  m0, m1, m4, m5
+    movu    [r0 + r1],           m0
+    movu    [r0 + r1 + mmsize],  m1
+
+    movu    m0,     [r2 + r4 * 2]
+    movu    m1,     [r2 + r4 * 2 + mmsize]
+    movu    m2,     [r3 + r5 * 2]
+    movu    m3,     [r3 + r5 * 2 + mmsize]
+    paddw   m0,     m2
+    paddw   m1,     m3
+
+    CLIPW2  m0, m1, m4, m5
+    movu    [r0 + r1 * 2],           m0
+    movu    [r0 + r1 * 2 + mmsize],  m1
+
+    movu    m0,     [r2 + r6]
+    movu    m1,     [r2 + r6 + mmsize]
+    movu    m2,     [r3 + r7]
+    movu    m3,     [r3 + r7 + mmsize]
+    paddw   m0,     m2
+    paddw   m1,     m3
+
+    CLIPW2  m0, m1, m4, m5
+    movu    [r0 + r8],               m0
+    movu    [r0 + r8 + mmsize],      m1
+%endmacro
+
+%macro PROCESS_ADD_PS_64x4_ALIGNED_AVX512 0
+    pmovzxbw    m0,         [r2]
+    pmovzxbw    m1,         [r2 + mmsize/2]
+    mova        m2,         [r3]
+    mova        m3,         [r3 + mmsize]
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0,         m4,      m0
+    mova        [r0],       m0
+    pmovzxbw    m0,         [r2 + r4]
+    pmovzxbw    m1,         [r2 + r4 + mmsize/2]
+    mova        m2,         [r3 + r5]
+    mova        m3,         [r3 + r5 + mmsize]
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0,         m4,      m0
+    mova        [r0 + r1],  m0
+    pmovzxbw    m0,         [r2 + 2 * r4]
+    pmovzxbw    m1,         [r2 + 2 * r4 + mmsize/2]
+    mova        m2,         [r3 + 2 * r5]
+    mova        m3,         [r3 + 2 * r5 + mmsize]
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0,         m4,      m0
+    mova        [r0 + 2 * r1],       m0
+
+    pmovzxbw    m0,         [r2 + r7]
+    pmovzxbw    m1,         [r2 + r7 + mmsize/2]
+    mova        m2,         [r3 + r8]
+    mova        m3,         [r3 + r8 + mmsize]
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0,         m4,      m0
+    mova        [r0 + r6],       m0
+%endmacro
+
+%macro PROCESS_ADD_PS_64x4_HBD_ALIGNED_AVX512 0
+    mova    m0,     [r2]
+    mova    m1,     [r2 + mmsize]
+    mova    m2,     [r3]
+    mova    m3,     [r3 + mmsize]
+    paddw   m0,     m2
+    paddw   m1,     m3
+
+    CLIPW2  m0, m1, m4, m5
+    mova    [r0],                m0
+    mova    [r0 + mmsize],       m1
+
+    mova    m0,     [r2 + r4]
+    mova    m1,     [r2 + r4 + mmsize]
+    mova    m2,     [r3 + r5]
+    mova    m3,     [r3 + r5 + mmsize]
+    paddw   m0,     m2
+    paddw   m1,     m3
+
+    CLIPW2  m0, m1, m4, m5
+    mova    [r0 + r1],           m0
+    mova    [r0 + r1 + mmsize],  m1
+
+    mova    m0,     [r2 + r4 * 2]
+    mova    m1,     [r2 + r4 * 2 + mmsize]
+    mova    m2,     [r3 + r5 * 2]
+    mova    m3,     [r3 + r5 * 2 + mmsize]
+    paddw   m0,     m2
+    paddw   m1,     m3
+
+    CLIPW2  m0, m1, m4, m5
+    mova    [r0 + r1 * 2],           m0
+    mova    [r0 + r1 * 2 + mmsize],  m1
+
+    mova    m0,     [r2 + r6]
+    mova    m1,     [r2 + r6 + mmsize]
+    mova    m2,     [r3 + r7]
+    mova    m3,     [r3 + r7 + mmsize]
+    paddw   m0,     m2
+    paddw   m1,     m3
+
+    CLIPW2  m0, m1, m4, m5
+    mova    [r0 + r8],               m0
​

x265_2.7.tar.gz/source/common/x86/sad-a.asm -> x265_2.9.tar.gz/source/common/x86/sad-a.asm Changed

@@ -378,111 +378,63 @@
     lea     r0,  [r0 + r1]
 %endmacro
 
-%macro SAD_W16 0
-;-----------------------------------------------------------------------------
-; int pixel_sad_16x16( uint8_t *, intptr_t, uint8_t *, intptr_t )
-;-----------------------------------------------------------------------------
-cglobal pixel_sad_16x16, 4,4,8
-    movu    m0, [r2]
-    movu    m1, [r2+r3]
-    lea     r2, [r2+2*r3]
-    movu    m2, [r2]
-    movu    m3, [r2+r3]
-    lea     r2, [r2+2*r3]
-    psadbw  m0, [r0]
-    psadbw  m1, [r0+r1]
-    lea     r0, [r0+2*r1]
-    movu    m4, [r2]
-    paddw   m0, m1
-    psadbw  m2, [r0]
-    psadbw  m3, [r0+r1]
-    lea     r0, [r0+2*r1]
-    movu    m5, [r2+r3]
-    lea     r2, [r2+2*r3]
-    paddw   m2, m3
-    movu    m6, [r2]
-    movu    m7, [r2+r3]
-    lea     r2, [r2+2*r3]
-    paddw   m0, m2
-    psadbw  m4, [r0]
-    psadbw  m5, [r0+r1]
-    lea     r0, [r0+2*r1]
-    movu    m1, [r2]
-    paddw   m4, m5
-    psadbw  m6, [r0]
-    psadbw  m7, [r0+r1]
-    lea     r0, [r0+2*r1]
-    movu    m2, [r2+r3]
-    lea     r2, [r2+2*r3]
-    paddw   m6, m7
-    movu    m3, [r2]
-    paddw   m0, m4
-    movu    m4, [r2+r3]
-    lea     r2, [r2+2*r3]
-    paddw   m0, m6
-    psadbw  m1, [r0]
-    psadbw  m2, [r0+r1]
-    lea     r0, [r0+2*r1]
-    movu    m5, [r2]
-    paddw   m1, m2
-    psadbw  m3, [r0]
-    psadbw  m4, [r0+r1]
-    lea     r0, [r0+2*r1]
-    movu    m6, [r2+r3]
-    lea     r2, [r2+2*r3]
-    paddw   m3, m4
-    movu    m7, [r2]
-    paddw   m0, m1
-    movu    m1, [r2+r3]
-    paddw   m0, m3
-    psadbw  m5, [r0]
-    psadbw  m6, [r0+r1]
-    lea     r0, [r0+2*r1]
-    paddw   m5, m6
-    psadbw  m7, [r0]
-    psadbw  m1, [r0+r1]
-    paddw   m7, m1
-    paddw   m0, m5
-    paddw   m0, m7
-    SAD_END_SSE2
+%macro SAD_W16 1 ; h
+cglobal pixel_sad_16x%1, 4,4
+%ifidn cpuname, sse2
+.skip_prologue:
+%endif
+%assign %%i 0
+%if ARCH_X86_64
+    lea  r6, [3*r1] ; r6 results in fewer REX prefixes than r4 and both are volatile
+    lea  r5, [3*r3]
+%rep %1/4
+    movu     m1, [r2]
+    psadbw   m1, [r0]
+    movu     m3, [r2+r3]
+    psadbw   m3, [r0+r1]
+    movu     m2, [r2+2*r3]
+    psadbw   m2, [r0+2*r1]
+    movu     m4, [r2+r5]
+    psadbw   m4, [r0+r6]
+%if %%i != %1/4-1
+    lea      r2, [r2+4*r3]
+    lea      r0, [r0+4*r1]
+%endif
+    paddw    m1, m3
+    paddw    m2, m4
+    ACCUM paddw, 0, 1, %%i
+    paddw    m0, m2
+    %assign %%i %%i+1
+%endrep
+%else     ; The cost of having to save and restore registers on x86-32
+%rep %1/2 ; nullifies the benefit of having 3*stride in registers.
+    movu     m1, [r2]
+    psadbw   m1, [r0]
+    movu     m2, [r2+r3]
+    psadbw   m2, [r0+r1]
+%if %%i != %1/2-1
+    lea      r2, [r2+2*r3]
+    lea      r0, [r0+2*r1]
+%endif
+    ACCUM paddw, 0, 1, %%i
+    paddw    m0, m2
+    %assign %%i %%i+1
+%endrep
+%endif
+     SAD_END_SSE2
+ %endmacro
 
-;-----------------------------------------------------------------------------
-; int pixel_sad_16x8( uint8_t *, intptr_t, uint8_t *, intptr_t )
-;-----------------------------------------------------------------------------
-cglobal pixel_sad_16x8, 4,4
-    movu    m0, [r2]
-    movu    m2, [r2+r3]
-    lea     r2, [r2+2*r3]
-    movu    m3, [r2]
-    movu    m4, [r2+r3]
-    psadbw  m0, [r0]
-    psadbw  m2, [r0+r1]
-    lea     r0, [r0+2*r1]
-    psadbw  m3, [r0]
-    psadbw  m4, [r0+r1]
-    lea     r0, [r0+2*r1]
-    lea     r2, [r2+2*r3]
-    paddw   m0, m2
-    paddw   m3, m4
-    paddw   m0, m3
-    movu    m1, [r2]
-    movu    m2, [r2+r3]
-    lea     r2, [r2+2*r3]
-    movu    m3, [r2]
-    movu    m4, [r2+r3]
-    psadbw  m1, [r0]
-    psadbw  m2, [r0+r1]
-    lea     r0, [r0+2*r1]
-    psadbw  m3, [r0]
-    psadbw  m4, [r0+r1]
-    lea     r0, [r0+2*r1]
-    lea     r2, [r2+2*r3]
-    paddw   m1, m2
-    paddw   m3, m4
-    paddw   m0, m1
-    paddw   m0, m3
-    SAD_END_SSE2
+INIT_XMM sse2
+SAD_W16 8
+SAD_W16 16
+INIT_XMM sse3
+SAD_W16 8
+SAD_W16 16
+INIT_XMM sse2, aligned
+SAD_W16 8
+SAD_W16 16
 
+%macro SAD_Wx 0
 ;-----------------------------------------------------------------------------
 ; int pixel_sad_16x12( uint8_t *, intptr_t, uint8_t *, intptr_t )
 ;-----------------------------------------------------------------------------
@@ -808,11 +760,11 @@
 %endmacro
 
 INIT_XMM sse2
-SAD_W16
+SAD_Wx
 INIT_XMM sse3
-SAD_W16
+SAD_Wx
 INIT_XMM sse2, aligned
-SAD_W16
+SAD_Wx
 
 %macro SAD_INC_4x8P_SSE 1
     movq    m1, [r0]
@@ -841,7 +793,132 @@
     SAD_INC_4x8P_SSE 1
     SAD_INC_4x8P_SSE 1
     SAD_END_SSE2
+
+%macro SAD_W48_AVX512 3 ; w, h, d/q
+cglobal pixel_sad_%1x%2, 4,4
+    kxnorb        k1, k1, k1
+    kaddb         k1, k1, k1
+%assign %%i 0
+%if ARCH_X86_64 && %2 != 4
+    lea           r6, [3*r1]
+    lea           r5, [3*r3]
+%rep %2/4
+    mov%3         m1,      [r0]
+    vpbroadcast%3 m1 {k1}, [r0+r1]
+    mov%3         m3,      [r2]
+    vpbroadcast%3 m3 {k1}, [r2+r3]

 
@@ -378,111 +378,63 @@
     lea     r0,  [r0 + r1]
 %endmacro
 
-%macro SAD_W16 0
-;-----------------------------------------------------------------------------
-; int pixel_sad_16x16( uint8_t *, intptr_t, uint8_t *, intptr_t )
-;-----------------------------------------------------------------------------
-cglobal pixel_sad_16x16, 4,4,8
-    movu    m0, [r2]
-    movu    m1, [r2+r3]
-    lea     r2, [r2+2*r3]
-    movu    m2, [r2]
-    movu    m3, [r2+r3]
-    lea     r2, [r2+2*r3]
-    psadbw  m0, [r0]
-    psadbw  m1, [r0+r1]
-    lea     r0, [r0+2*r1]
-    movu    m4, [r2]
-    paddw   m0, m1
-    psadbw  m2, [r0]
-    psadbw  m3, [r0+r1]
-    lea     r0, [r0+2*r1]
-    movu    m5, [r2+r3]
-    lea     r2, [r2+2*r3]
-    paddw   m2, m3
-    movu    m6, [r2]
-    movu    m7, [r2+r3]
-    lea     r2, [r2+2*r3]
-    paddw   m0, m2
-    psadbw  m4, [r0]
-    psadbw  m5, [r0+r1]
-    lea     r0, [r0+2*r1]
-    movu    m1, [r2]
-    paddw   m4, m5
-    psadbw  m6, [r0]
-    psadbw  m7, [r0+r1]
-    lea     r0, [r0+2*r1]
-    movu    m2, [r2+r3]
-    lea     r2, [r2+2*r3]
-    paddw   m6, m7
-    movu    m3, [r2]
-    paddw   m0, m4
-    movu    m4, [r2+r3]
-    lea     r2, [r2+2*r3]
-    paddw   m0, m6
-    psadbw  m1, [r0]
-    psadbw  m2, [r0+r1]
-    lea     r0, [r0+2*r1]
-    movu    m5, [r2]
-    paddw   m1, m2
-    psadbw  m3, [r0]
-    psadbw  m4, [r0+r1]
-    lea     r0, [r0+2*r1]
-    movu    m6, [r2+r3]
-    lea     r2, [r2+2*r3]
-    paddw   m3, m4
-    movu    m7, [r2]
-    paddw   m0, m1
-    movu    m1, [r2+r3]
-    paddw   m0, m3
-    psadbw  m5, [r0]
-    psadbw  m6, [r0+r1]
-    lea     r0, [r0+2*r1]
-    paddw   m5, m6
-    psadbw  m7, [r0]
-    psadbw  m1, [r0+r1]
-    paddw   m7, m1
-    paddw   m0, m5
-    paddw   m0, m7
-    SAD_END_SSE2
+%macro SAD_W16 1 ; h
+cglobal pixel_sad_16x%1, 4,4
+%ifidn cpuname, sse2
+.skip_prologue:
+%endif
+%assign %%i 0
+%if ARCH_X86_64
+    lea  r6, [3*r1] ; r6 results in fewer REX prefixes than r4 and both are volatile
+    lea  r5, [3*r3]
+%rep %1/4
+    movu     m1, [r2]
+    psadbw   m1, [r0]
+    movu     m3, [r2+r3]
+    psadbw   m3, [r0+r1]
+    movu     m2, [r2+2*r3]
+    psadbw   m2, [r0+2*r1]
+    movu     m4, [r2+r5]
+    psadbw   m4, [r0+r6]
+%if %%i != %1/4-1
+    lea      r2, [r2+4*r3]
+    lea      r0, [r0+4*r1]
+%endif
+    paddw    m1, m3
+    paddw    m2, m4
+    ACCUM paddw, 0, 1, %%i
+    paddw    m0, m2
+    %assign %%i %%i+1
+%endrep
+%else     ; The cost of having to save and restore registers on x86-32
+%rep %1/2 ; nullifies the benefit of having 3*stride in registers.
+    movu     m1, [r2]
+    psadbw   m1, [r0]
+    movu     m2, [r2+r3]
+    psadbw   m2, [r0+r1]
+%if %%i != %1/2-1
+    lea      r2, [r2+2*r3]
+    lea      r0, [r0+2*r1]
+%endif
+    ACCUM paddw, 0, 1, %%i
+    paddw    m0, m2
+    %assign %%i %%i+1
+%endrep
+%endif
+     SAD_END_SSE2
+ %endmacro
 
-;-----------------------------------------------------------------------------
-; int pixel_sad_16x8( uint8_t *, intptr_t, uint8_t *, intptr_t )
-;-----------------------------------------------------------------------------
-cglobal pixel_sad_16x8, 4,4
-    movu    m0, [r2]
-    movu    m2, [r2+r3]
-    lea     r2, [r2+2*r3]
-    movu    m3, [r2]
-    movu    m4, [r2+r3]
-    psadbw  m0, [r0]
-    psadbw  m2, [r0+r1]
-    lea     r0, [r0+2*r1]
-    psadbw  m3, [r0]
-    psadbw  m4, [r0+r1]
-    lea     r0, [r0+2*r1]
-    lea     r2, [r2+2*r3]
-    paddw   m0, m2
-    paddw   m3, m4
-    paddw   m0, m3
-    movu    m1, [r2]
-    movu    m2, [r2+r3]
-    lea     r2, [r2+2*r3]
-    movu    m3, [r2]
-    movu    m4, [r2+r3]
-    psadbw  m1, [r0]
-    psadbw  m2, [r0+r1]
-    lea     r0, [r0+2*r1]
-    psadbw  m3, [r0]
-    psadbw  m4, [r0+r1]
-    lea     r0, [r0+2*r1]
-    lea     r2, [r2+2*r3]
-    paddw   m1, m2
-    paddw   m3, m4
-    paddw   m0, m1
-    paddw   m0, m3
-    SAD_END_SSE2
+INIT_XMM sse2
+SAD_W16 8
+SAD_W16 16
+INIT_XMM sse3
+SAD_W16 8
+SAD_W16 16
+INIT_XMM sse2, aligned
+SAD_W16 8
+SAD_W16 16
 
+%macro SAD_Wx 0
 ;-----------------------------------------------------------------------------
 ; int pixel_sad_16x12( uint8_t *, intptr_t, uint8_t *, intptr_t )
 ;-----------------------------------------------------------------------------
@@ -808,11 +760,11 @@
 %endmacro
 
 INIT_XMM sse2
-SAD_W16
+SAD_Wx
 INIT_XMM sse3
-SAD_W16
+SAD_Wx
 INIT_XMM sse2, aligned
-SAD_W16
+SAD_Wx
 
 %macro SAD_INC_4x8P_SSE 1
     movq    m1, [r0]
@@ -841,7 +793,132 @@
     SAD_INC_4x8P_SSE 1
     SAD_INC_4x8P_SSE 1
     SAD_END_SSE2
+
+%macro SAD_W48_AVX512 3 ; w, h, d/q
+cglobal pixel_sad_%1x%2, 4,4
+    kxnorb        k1, k1, k1
+    kaddb         k1, k1, k1
+%assign %%i 0
+%if ARCH_X86_64 && %2 != 4
+    lea           r6, [3*r1]
+    lea           r5, [3*r3]
+%rep %2/4
+    mov%3         m1,      [r0]
+    vpbroadcast%3 m1 {k1}, [r0+r1]
+    mov%3         m3,      [r2]
+    vpbroadcast%3 m3 {k1}, [r2+r3]
​

x265_2.7.tar.gz/source/common/x86/sad16-a.asm -> x265_2.9.tar.gz/source/common/x86/sad16-a.asm Changed

@@ -1155,6 +1155,565 @@
 SAD_12  12, 16
 
 
+%macro PROCESS_SAD_64x8_AVX512 0
+    movu    m1, [r2]
+    movu    m2, [r2 + mmsize]
+    movu    m3, [r2 + r3]
+    movu    m4, [r2 + r3 + mmsize]
+    psubw   m1, [r0]
+    psubw   m2, [r0 + mmsize]
+    psubw   m3, [r0 + r1]
+    psubw   m4, [r0 + r1 + mmsize]
+    pabsw   m1, m1
+    pabsw   m2, m2
+    pabsw   m3, m3
+    pabsw   m4, m4
+    paddw   m1, m2
+    paddw   m3, m4
+    paddw   m5, m1, m3
+
+    movu    m1, [r2 + 2 * r3]
+    movu    m2, [r2 + 2 * r3 + mmsize]
+    movu    m3, [r2 + r5]
+    movu    m4, [r2 + r5 + mmsize]
+    psubw   m1, [r0 + 2 * r1]
+    psubw   m2, [r0 + 2 * r1 + mmsize]
+    psubw   m3, [r0 + r4]
+    psubw   m4, [r0 + r4 + mmsize]
+    pabsw   m1, m1
+    pabsw   m2, m2
+    pabsw   m3, m3
+    pabsw   m4, m4
+    paddw   m1, m2
+    paddw   m3, m4
+    paddw   m1, m3
+
+    lea     r0, [r0 + 4 * r1]
+    lea     r2, [r2 + 4 * r3]
+
+    pmaddwd m5, m6
+    paddd   m0, m5
+    pmaddwd m1, m6
+    paddd   m0, m1
+
+    movu    m1, [r2]
+    movu    m2, [r2 + mmsize]
+    movu    m3, [r2 + r3]
+    movu    m4, [r2 + r3 + mmsize]
+    psubw   m1, [r0]
+    psubw   m2, [r0 + mmsize]
+    psubw   m3, [r0 + r1]
+    psubw   m4, [r0 + r1 + mmsize]
+    pabsw   m1, m1
+    pabsw   m2, m2
+    pabsw   m3, m3
+    pabsw   m4, m4
+    paddw   m1, m2
+    paddw   m3, m4
+    paddw   m5, m1, m3
+
+    movu    m1, [r2 + 2 * r3]
+    movu    m2, [r2 + 2 * r3 + mmsize]
+    movu    m3, [r2 + r5]
+    movu    m4, [r2 + r5 + mmsize]
+    psubw   m1, [r0 + 2 * r1]
+    psubw   m2, [r0 + 2 * r1 + mmsize]
+    psubw   m3, [r0 + r4]
+    psubw   m4, [r0 + r4 + mmsize]
+    pabsw   m1, m1
+    pabsw   m2, m2
+    pabsw   m3, m3
+    pabsw   m4, m4
+    paddw   m1, m2
+    paddw   m3, m4
+    paddw   m1, m3
+
+    pmaddwd m5, m6
+    paddd   m0, m5
+    pmaddwd m1, m6
+    paddd   m0, m1
+%endmacro
+
+
+%macro PROCESS_SAD_32x8_AVX512 0
+    movu    m1, [r2]
+    movu    m2, [r2 + r3]
+    movu    m3, [r2 + 2 * r3]
+    movu    m4, [r2 + r5]
+    psubw   m1, [r0]
+    psubw   m2, [r0 + r1]
+    psubw   m3, [r0 + 2 * r1]
+    psubw   m4, [r0 + r4]
+    pabsw   m1, m1
+    pabsw   m2, m2
+    pabsw   m3, m3
+    pabsw   m4, m4
+    paddw   m1, m2
+    paddw   m3, m4
+    paddw   m5, m1, m3
+
+    lea     r0, [r0 + 4 * r1]
+    lea     r2, [r2 + 4 * r3]
+
+    movu    m1, [r2]
+    movu    m2, [r2 + r3]
+    movu    m3, [r2 + 2 * r3]
+    movu    m4, [r2 + r5]
+    psubw   m1, [r0]
+    psubw   m2, [r0 + r1]
+    psubw   m3, [r0 + 2 * r1]
+    psubw   m4, [r0 + r4]
+    pabsw   m1, m1
+    pabsw   m2, m2
+    pabsw   m3, m3
+    pabsw   m4, m4
+    paddw   m1, m2
+    paddw   m3, m4
+    paddw   m1, m3
+
+    pmaddwd m5, m6
+    paddd   m0, m5
+    pmaddwd m1, m6
+    paddd   m0, m1
+%endmacro
+
+%macro PROCESS_SAD_16x8_AVX512 0
+    movu            ym1, [r2]
+    vinserti64x4     m1, [r2 + r3],  1
+    movu            ym2, [r2 + 2 * r3]
+    vinserti64x4     m2, [r2 + r5],  1
+    movu            ym3, [r0]
+    vinserti64x4     m3, [r0 + r1],  1
+    movu            ym4, [r0 + 2 * r1]
+    vinserti64x4     m4, [r0 + r4],  1
+
+    psubw   m1, m3
+    psubw   m2, m4
+    pabsw   m1, m1
+    pabsw   m2, m2
+    paddw   m5, m1, m2
+
+    lea     r0, [r0 + 4 * r1]
+    lea     r2, [r2 + 4 * r3]
+
+    movu            ym1, [r2]
+    vinserti64x4     m1, [r2 + r3],  1
+    movu            ym2, [r2 + 2 * r3]
+    vinserti64x4     m2, [r2 + r5],  1
+    movu            ym3, [r0]
+    vinserti64x4     m3, [r0 + r1],  1
+    movu            ym4, [r0 + 2 * r1]
+    vinserti64x4     m4, [r0 + r4],  1
+
+    psubw   m1, m3
+    psubw   m2, m4
+    pabsw   m1, m1
+    pabsw   m2, m2
+    paddw   m1, m2
+
+    pmaddwd m5, m6
+    paddd   m0, m5
+    pmaddwd m1, m6
+    paddd   m0, m1
+%endmacro
+
+%macro PROCESS_SAD_AVX512_END 0
+    vextracti32x8  ym1, m0, 1
+    paddd          ym0, ym1
+    vextracti64x2  xm1, m0, 1
+    paddd          xm0, xm1
+    pshufd         xm1, xm0, 00001110b
+    paddd          xm0, xm1
+    pshufd         xm1, xm0, 00000001b
+    paddd          xm0, xm1
+    movd           eax, xm0
+%endmacro
+
+;-----------------------------------------------------------------------------
+; int pixel_sad_64x%1( uint16_t *, intptr_t, uint16_t *, intptr_t )
+;-----------------------------------------------------------------------------
+%if ARCH_X86_64
+INIT_ZMM avx512
+cglobal pixel_sad_64x16, 4,6,7
+    pxor    m0, m0
+
+    vbroadcasti32x8 m6, [pw_1]
+
+    add     r3d, r3d
+    add     r1d, r1d
+    lea     r4d, [r1 * 3]
+    lea     r5d, [r3 * 3]
+
+    PROCESS_SAD_64x8_AVX512
+    lea            r2, [r2 + 4 * r3]
+    lea            r0, [r0 + 4 * r1]
+    PROCESS_SAD_64x8_AVX512
+    PROCESS_SAD_AVX512_END
+    RET
+

 
@@ -1155,6 +1155,565 @@
 SAD_12  12, 16
 
 
+%macro PROCESS_SAD_64x8_AVX512 0
+    movu    m1, [r2]
+    movu    m2, [r2 + mmsize]
+    movu    m3, [r2 + r3]
+    movu    m4, [r2 + r3 + mmsize]
+    psubw   m1, [r0]
+    psubw   m2, [r0 + mmsize]
+    psubw   m3, [r0 + r1]
+    psubw   m4, [r0 + r1 + mmsize]
+    pabsw   m1, m1
+    pabsw   m2, m2
+    pabsw   m3, m3
+    pabsw   m4, m4
+    paddw   m1, m2
+    paddw   m3, m4
+    paddw   m5, m1, m3
+
+    movu    m1, [r2 + 2 * r3]
+    movu    m2, [r2 + 2 * r3 + mmsize]
+    movu    m3, [r2 + r5]
+    movu    m4, [r2 + r5 + mmsize]
+    psubw   m1, [r0 + 2 * r1]
+    psubw   m2, [r0 + 2 * r1 + mmsize]
+    psubw   m3, [r0 + r4]
+    psubw   m4, [r0 + r4 + mmsize]
+    pabsw   m1, m1
+    pabsw   m2, m2
+    pabsw   m3, m3
+    pabsw   m4, m4
+    paddw   m1, m2
+    paddw   m3, m4
+    paddw   m1, m3
+
+    lea     r0, [r0 + 4 * r1]
+    lea     r2, [r2 + 4 * r3]
+
+    pmaddwd m5, m6
+    paddd   m0, m5
+    pmaddwd m1, m6
+    paddd   m0, m1
+
+    movu    m1, [r2]
+    movu    m2, [r2 + mmsize]
+    movu    m3, [r2 + r3]
+    movu    m4, [r2 + r3 + mmsize]
+    psubw   m1, [r0]
+    psubw   m2, [r0 + mmsize]
+    psubw   m3, [r0 + r1]
+    psubw   m4, [r0 + r1 + mmsize]
+    pabsw   m1, m1
+    pabsw   m2, m2
+    pabsw   m3, m3
+    pabsw   m4, m4
+    paddw   m1, m2
+    paddw   m3, m4
+    paddw   m5, m1, m3
+
+    movu    m1, [r2 + 2 * r3]
+    movu    m2, [r2 + 2 * r3 + mmsize]
+    movu    m3, [r2 + r5]
+    movu    m4, [r2 + r5 + mmsize]
+    psubw   m1, [r0 + 2 * r1]
+    psubw   m2, [r0 + 2 * r1 + mmsize]
+    psubw   m3, [r0 + r4]
+    psubw   m4, [r0 + r4 + mmsize]
+    pabsw   m1, m1
+    pabsw   m2, m2
+    pabsw   m3, m3
+    pabsw   m4, m4
+    paddw   m1, m2
+    paddw   m3, m4
+    paddw   m1, m3
+
+    pmaddwd m5, m6
+    paddd   m0, m5
+    pmaddwd m1, m6
+    paddd   m0, m1
+%endmacro
+
+
+%macro PROCESS_SAD_32x8_AVX512 0
+    movu    m1, [r2]
+    movu    m2, [r2 + r3]
+    movu    m3, [r2 + 2 * r3]
+    movu    m4, [r2 + r5]
+    psubw   m1, [r0]
+    psubw   m2, [r0 + r1]
+    psubw   m3, [r0 + 2 * r1]
+    psubw   m4, [r0 + r4]
+    pabsw   m1, m1
+    pabsw   m2, m2
+    pabsw   m3, m3
+    pabsw   m4, m4
+    paddw   m1, m2
+    paddw   m3, m4
+    paddw   m5, m1, m3
+
+    lea     r0, [r0 + 4 * r1]
+    lea     r2, [r2 + 4 * r3]
+
+    movu    m1, [r2]
+    movu    m2, [r2 + r3]
+    movu    m3, [r2 + 2 * r3]
+    movu    m4, [r2 + r5]
+    psubw   m1, [r0]
+    psubw   m2, [r0 + r1]
+    psubw   m3, [r0 + 2 * r1]
+    psubw   m4, [r0 + r4]
+    pabsw   m1, m1
+    pabsw   m2, m2
+    pabsw   m3, m3
+    pabsw   m4, m4
+    paddw   m1, m2
+    paddw   m3, m4
+    paddw   m1, m3
+
+    pmaddwd m5, m6
+    paddd   m0, m5
+    pmaddwd m1, m6
+    paddd   m0, m1
+%endmacro
+
+%macro PROCESS_SAD_16x8_AVX512 0
+    movu            ym1, [r2]
+    vinserti64x4     m1, [r2 + r3],  1
+    movu            ym2, [r2 + 2 * r3]
+    vinserti64x4     m2, [r2 + r5],  1
+    movu            ym3, [r0]
+    vinserti64x4     m3, [r0 + r1],  1
+    movu            ym4, [r0 + 2 * r1]
+    vinserti64x4     m4, [r0 + r4],  1
+
+    psubw   m1, m3
+    psubw   m2, m4
+    pabsw   m1, m1
+    pabsw   m2, m2
+    paddw   m5, m1, m2
+
+    lea     r0, [r0 + 4 * r1]
+    lea     r2, [r2 + 4 * r3]
+
+    movu            ym1, [r2]
+    vinserti64x4     m1, [r2 + r3],  1
+    movu            ym2, [r2 + 2 * r3]
+    vinserti64x4     m2, [r2 + r5],  1
+    movu            ym3, [r0]
+    vinserti64x4     m3, [r0 + r1],  1
+    movu            ym4, [r0 + 2 * r1]
+    vinserti64x4     m4, [r0 + r4],  1
+
+    psubw   m1, m3
+    psubw   m2, m4
+    pabsw   m1, m1
+    pabsw   m2, m2
+    paddw   m1, m2
+
+    pmaddwd m5, m6
+    paddd   m0, m5
+    pmaddwd m1, m6
+    paddd   m0, m1
+%endmacro
+
+%macro PROCESS_SAD_AVX512_END 0
+    vextracti32x8  ym1, m0, 1
+    paddd          ym0, ym1
+    vextracti64x2  xm1, m0, 1
+    paddd          xm0, xm1
+    pshufd         xm1, xm0, 00001110b
+    paddd          xm0, xm1
+    pshufd         xm1, xm0, 00000001b
+    paddd          xm0, xm1
+    movd           eax, xm0
+%endmacro
+
+;-----------------------------------------------------------------------------
+; int pixel_sad_64x%1( uint16_t *, intptr_t, uint16_t *, intptr_t )
+;-----------------------------------------------------------------------------
+%if ARCH_X86_64
+INIT_ZMM avx512
+cglobal pixel_sad_64x16, 4,6,7
+    pxor    m0, m0
+
+    vbroadcasti32x8 m6, [pw_1]
+
+    add     r3d, r3d
+    add     r1d, r1d
+    lea     r4d, [r1 * 3]
+    lea     r5d, [r3 * 3]
+
+    PROCESS_SAD_64x8_AVX512
+    lea            r2, [r2 + 4 * r3]
+    lea            r0, [r0 + 4 * r1]
+    PROCESS_SAD_64x8_AVX512
+    PROCESS_SAD_AVX512_END
+    RET
+
​

x265_2.7.tar.gz/source/common/x86/ssd-a.asm -> x265_2.9.tar.gz/source/common/x86/ssd-a.asm Changed

@@ -141,6 +141,8 @@
 
 ; Function to find ssd for 32x16 block, sse2, 12 bit depth
 ; Defined sepeartely to be called from SSD_ONE_32 macro
+%if ARCH_X86_64
+;This code is written for 64 bit architecture
 INIT_XMM sse2
 cglobal ssd_ss_32x16
     pxor        m8, m8
@@ -180,8 +182,10 @@
     paddq       m4, m5
     paddq       m9, m4
     ret
+%endif
 
 %macro SSD_ONE_32 0
+%if ARCH_X86_64
 cglobal pixel_ssd_ss_32x64, 4,7,10
     add         r1d, r1d
     add         r3d, r3d
@@ -193,7 +197,9 @@
     call        ssd_ss_32x16
     movq        rax, m9
     RET
+%endif
 %endmacro
+
 %macro SSD_ONE_SS_32 0
 cglobal pixel_ssd_ss_32x32, 4,5,8
     add         r1d, r1d
@@ -554,6 +560,7 @@
     RET
 %endmacro
 
+%if ARCH_X86_64
 INIT_YMM avx2
 cglobal pixel_ssd_16x16, 4,7,3
     FIX_STRIDES r1, r3
@@ -697,6 +704,108 @@
     movq            rax, xm3
     RET
 
+INIT_ZMM avx512
+cglobal pixel_ssd_32x2
+    pxor            m0, m0
+    movu            m1, [r0]
+    psubw           m1, [r2]
+    pmaddwd         m1, m1
+    paddd           m0, m1
+    movu            m1, [r0 + r1]
+    psubw           m1, [r2 + r3]
+    pmaddwd         m1, m1
+    paddd           m0, m1
+    lea             r0, [r0 + r1 * 2]
+    lea             r2, [r2 + r3 * 2]
+
+    mova            m1, m0
+    pxor            m2, m2
+    punpckldq       m0, m2
+    punpckhdq       m1, m2
+
+    paddq           m3, m0
+    paddq           m3, m1
+ret
+
+INIT_ZMM avx512
+cglobal pixel_ssd_32x32, 4,5,5
+    shl             r1d, 1
+    shl             r3d, 1
+    pxor            m3, m3
+    mov             r4, 16
+.iterate:
+    call            pixel_ssd_32x2
+    dec             r4d
+    jne             .iterate
+
+    vextracti32x8   ym4, m3, 1
+    paddq           ym3, ym4
+    vextracti32x4   xm4, m3, 1
+    paddq           xm3, xm4
+    movhlps         xm4, xm3
+    paddq           xm3, xm4
+    movq            rax, xm3
+RET
+
+INIT_ZMM avx512
+cglobal pixel_ssd_32x64, 4,5,5
+    shl             r1d, 1
+    shl             r3d, 1
+    pxor            m3, m3
+    mov             r4, 32
+.iterate:
+    call            pixel_ssd_32x2
+    dec             r4d
+    jne             .iterate
+
+    vextracti32x8   ym4, m3, 1
+    paddq           ym3, ym4
+    vextracti32x4   xm4, m3, 1
+    paddq           xm3, xm4
+    movhlps         xm4, xm3
+    paddq           xm3, xm4
+    movq            rax, xm3
+RET
+
+INIT_ZMM avx512
+cglobal pixel_ssd_64x64, 4,5,5
+    FIX_STRIDES     r1, r3
+    mov             r4d, 64
+    pxor            m3, m3
+
+.loop:
+    pxor            m0, m0
+    movu            m1, [r0]
+    psubw           m1, [r2]
+    pmaddwd         m1, m1
+    paddd           m0, m1
+    movu            m1, [r0 + mmsize]
+    psubw           m1, [r2 + mmsize]
+    pmaddwd         m1, m1
+    paddd           m0, m1
+
+    lea             r0, [r0 + r1]
+    lea             r2, [r2 + r3]
+
+    mova            m1, m0
+    pxor            m2, m2
+    punpckldq       m0, m2
+    punpckhdq       m1, m2
+    paddq           m3, m0
+    paddq           m3, m1
+
+    dec             r4d
+    jg              .loop
+
+    vextracti32x8   ym4, m3, 1
+    paddq           ym3, ym4
+    vextracti32x4   xm4, m3, 1
+    paddq           xm3, xm4
+    movhlps         xm4, xm3
+    paddq           xm3, xm4
+    movq            rax, xm3
+    RET
+%endif
 INIT_MMX mmx2
 SSD_ONE     4,  4
 SSD_ONE     4,  8
@@ -726,7 +835,9 @@
 %if BIT_DEPTH <= 10
     SSD_ONE    32, 64
     SSD_ONE    32, 32
+%if ARCH_X86_64
     SSD_TWO    64, 64
+%endif
 %else
     SSD_ONE_32
     SSD_ONE_SS_32
@@ -1377,7 +1488,126 @@
     HADDD       m2, m0
     movd        eax, xm2
     RET
+;-----------------------------------------------------------------------------
+; ssd_ss avx512 code start
+;-----------------------------------------------------------------------------
+%if ARCH_X86_64
+%macro PROCESS_SSD_SS_64x4_AVX512 0
+    movu        m0, [r0]
+    movu        m1, [r0 + mmsize]
+    movu        m2, [r0 + r1]
+    movu        m3, [r0 + r1 + mmsize]
+    movu        m4, [r2]
+    movu        m5, [r2 + mmsize]
+    movu        m6, [r2 + r3]
+    movu        m7, [r2 + r3 + mmsize]
+
+    psubw       m0, m4
+    psubw       m1, m5
+    psubw       m2, m6
+    psubw       m3, m7
+    pmaddwd     m0, m0
+    pmaddwd     m1, m1
+    pmaddwd     m2, m2
+    pmaddwd     m3, m3
+    paddd       m8, m0
+    paddd       m8, m1
+    paddd       m8, m2
+    paddd       m8, m3
 
+    movu        m0, [r0 + 2 * r1]
+    movu        m1, [r0 + 2 * r1 + mmsize]
+    movu        m2, [r0 + r5]
+    movu        m3, [r0 + r5 + mmsize]
+    movu        m4, [r2 + 2 * r3]
+    movu        m5, [r2 + 2 * r3 + mmsize]
+    movu        m6, [r2 + r6]
+    movu        m7, [r2 + r6 + mmsize]
+
+    psubw       m0, m4
+    psubw       m1, m5
+    psubw       m2, m6

 
@@ -141,6 +141,8 @@
 
 ; Function to find ssd for 32x16 block, sse2, 12 bit depth
 ; Defined sepeartely to be called from SSD_ONE_32 macro
+%if ARCH_X86_64
+;This code is written for 64 bit architecture
 INIT_XMM sse2
 cglobal ssd_ss_32x16
     pxor        m8, m8
@@ -180,8 +182,10 @@
     paddq       m4, m5
     paddq       m9, m4
     ret
+%endif
 
 %macro SSD_ONE_32 0
+%if ARCH_X86_64
 cglobal pixel_ssd_ss_32x64, 4,7,10
     add         r1d, r1d
     add         r3d, r3d
@@ -193,7 +197,9 @@
     call        ssd_ss_32x16
     movq        rax, m9
     RET
+%endif
 %endmacro
+
 %macro SSD_ONE_SS_32 0
 cglobal pixel_ssd_ss_32x32, 4,5,8
     add         r1d, r1d
@@ -554,6 +560,7 @@
     RET
 %endmacro
 
+%if ARCH_X86_64
 INIT_YMM avx2
 cglobal pixel_ssd_16x16, 4,7,3
     FIX_STRIDES r1, r3
@@ -697,6 +704,108 @@
     movq            rax, xm3
     RET
 
+INIT_ZMM avx512
+cglobal pixel_ssd_32x2
+    pxor            m0, m0
+    movu            m1, [r0]
+    psubw           m1, [r2]
+    pmaddwd         m1, m1
+    paddd           m0, m1
+    movu            m1, [r0 + r1]
+    psubw           m1, [r2 + r3]
+    pmaddwd         m1, m1
+    paddd           m0, m1
+    lea             r0, [r0 + r1 * 2]
+    lea             r2, [r2 + r3 * 2]
+
+    mova            m1, m0
+    pxor            m2, m2
+    punpckldq       m0, m2
+    punpckhdq       m1, m2
+
+    paddq           m3, m0
+    paddq           m3, m1
+ret
+
+INIT_ZMM avx512
+cglobal pixel_ssd_32x32, 4,5,5
+    shl             r1d, 1
+    shl             r3d, 1
+    pxor            m3, m3
+    mov             r4, 16
+.iterate:
+    call            pixel_ssd_32x2
+    dec             r4d
+    jne             .iterate
+
+    vextracti32x8   ym4, m3, 1
+    paddq           ym3, ym4
+    vextracti32x4   xm4, m3, 1
+    paddq           xm3, xm4
+    movhlps         xm4, xm3
+    paddq           xm3, xm4
+    movq            rax, xm3
+RET
+
+INIT_ZMM avx512
+cglobal pixel_ssd_32x64, 4,5,5
+    shl             r1d, 1
+    shl             r3d, 1
+    pxor            m3, m3
+    mov             r4, 32
+.iterate:
+    call            pixel_ssd_32x2
+    dec             r4d
+    jne             .iterate
+
+    vextracti32x8   ym4, m3, 1
+    paddq           ym3, ym4
+    vextracti32x4   xm4, m3, 1
+    paddq           xm3, xm4
+    movhlps         xm4, xm3
+    paddq           xm3, xm4
+    movq            rax, xm3
+RET
+
+INIT_ZMM avx512
+cglobal pixel_ssd_64x64, 4,5,5
+    FIX_STRIDES     r1, r3
+    mov             r4d, 64
+    pxor            m3, m3
+
+.loop:
+    pxor            m0, m0
+    movu            m1, [r0]
+    psubw           m1, [r2]
+    pmaddwd         m1, m1
+    paddd           m0, m1
+    movu            m1, [r0 + mmsize]
+    psubw           m1, [r2 + mmsize]
+    pmaddwd         m1, m1
+    paddd           m0, m1
+
+    lea             r0, [r0 + r1]
+    lea             r2, [r2 + r3]
+
+    mova            m1, m0
+    pxor            m2, m2
+    punpckldq       m0, m2
+    punpckhdq       m1, m2
+    paddq           m3, m0
+    paddq           m3, m1
+
+    dec             r4d
+    jg              .loop
+
+    vextracti32x8   ym4, m3, 1
+    paddq           ym3, ym4
+    vextracti32x4   xm4, m3, 1
+    paddq           xm3, xm4
+    movhlps         xm4, xm3
+    paddq           xm3, xm4
+    movq            rax, xm3
+    RET
+%endif
 INIT_MMX mmx2
 SSD_ONE     4,  4
 SSD_ONE     4,  8
@@ -726,7 +835,9 @@
 %if BIT_DEPTH <= 10
     SSD_ONE    32, 64
     SSD_ONE    32, 32
+%if ARCH_X86_64
     SSD_TWO    64, 64
+%endif
 %else
     SSD_ONE_32
     SSD_ONE_SS_32
@@ -1377,7 +1488,126 @@
     HADDD       m2, m0
     movd        eax, xm2
     RET
+;-----------------------------------------------------------------------------
+; ssd_ss avx512 code start
+;-----------------------------------------------------------------------------
+%if ARCH_X86_64
+%macro PROCESS_SSD_SS_64x4_AVX512 0
+    movu        m0, [r0]
+    movu        m1, [r0 + mmsize]
+    movu        m2, [r0 + r1]
+    movu        m3, [r0 + r1 + mmsize]
+    movu        m4, [r2]
+    movu        m5, [r2 + mmsize]
+    movu        m6, [r2 + r3]
+    movu        m7, [r2 + r3 + mmsize]
+
+    psubw       m0, m4
+    psubw       m1, m5
+    psubw       m2, m6
+    psubw       m3, m7
+    pmaddwd     m0, m0
+    pmaddwd     m1, m1
+    pmaddwd     m2, m2
+    pmaddwd     m3, m3
+    paddd       m8, m0
+    paddd       m8, m1
+    paddd       m8, m2
+    paddd       m8, m3
 
+    movu        m0, [r0 + 2 * r1]
+    movu        m1, [r0 + 2 * r1 + mmsize]
+    movu        m2, [r0 + r5]
+    movu        m3, [r0 + r5 + mmsize]
+    movu        m4, [r2 + 2 * r3]
+    movu        m5, [r2 + 2 * r3 + mmsize]
+    movu        m6, [r2 + r6]
+    movu        m7, [r2 + r6 + mmsize]
+
+    psubw       m0, m4
+    psubw       m1, m5
+    psubw       m2, m6
​

x265_2.7.tar.gz/source/common/x86/v4-ipfilter16.asm -> x265_2.9.tar.gz/source/common/x86/v4-ipfilter16.asm Changed

 
@@ -2931,6 +2931,7 @@
     RET
 %endmacro
 
+%if ARCH_X86_64
 FILTER_VER_CHROMA_AVX2_4xN pp, 16, 1, 6
 FILTER_VER_CHROMA_AVX2_4xN ps, 16, 0, INTERP_SHIFT_PS
 FILTER_VER_CHROMA_AVX2_4xN sp, 16, 1, INTERP_SHIFT_SP
@@ -2939,6 +2940,7 @@
 FILTER_VER_CHROMA_AVX2_4xN ps, 32, 0, INTERP_SHIFT_PS
 FILTER_VER_CHROMA_AVX2_4xN sp, 32, 1, INTERP_SHIFT_SP
 FILTER_VER_CHROMA_AVX2_4xN ss, 32, 0, 6
+%endif
 
 %macro FILTER_VER_CHROMA_AVX2_8x8 3
 INIT_YMM avx2
​

x265_2.7.tar.gz/source/common/x86/v4-ipfilter8.asm -> x265_2.9.tar.gz/source/common/x86/v4-ipfilter8.asm Changed

@@ -43,7 +43,7 @@
 const v4_interp4_vpp_shuf1, dd 0, 1, 1, 2, 2, 3, 3, 4
                          dd 2, 3, 3, 4, 4, 5, 5, 6
 
-const tab_ChromaCoeff, db  0, 64,  0,  0
+const v4_tab_ChromaCoeff, db  0, 64,  0,  0
                        db -2, 58, 10, -2
                        db -4, 54, 16, -2
                        db -6, 46, 28, -4
@@ -1031,8 +1031,8 @@
     mova        m6,        [r5 + r4]
     mova        m5,        [r5 + r4 + 16]
 %else
-    mova        m6,        [tab_ChromaCoeff + r4]
-    mova        m5,        [tab_ChromaCoeff + r4 + 16]
+    mova        m6,        [v4_tab_ChromaCoeff + r4]
+    mova        m5,        [v4_tab_ChromaCoeff + r4 + 16]
 %endif
 
 %ifidn %1,pp
@@ -2114,10 +2114,10 @@
     sub         r0,        r1
 
 %ifdef PIC
-    lea         r5,        [tab_ChromaCoeff]
+    lea         r5,        [v4_tab_ChromaCoeff]
     movd        m0,        [r5 + r4 * 4]
 %else
-    movd        m0,        [tab_ChromaCoeff + r4 * 4]
+    movd        m0,        [v4_tab_ChromaCoeff + r4 * 4]
 %endif
     lea         r4,        [r1 * 3]
     lea         r5,        [r0 + 4 * r1]
@@ -2430,10 +2430,10 @@
     sub         r0,        r1
 
 %ifdef PIC
-    lea         r5,        [tab_ChromaCoeff]
+    lea         r5,        [v4_tab_ChromaCoeff]
     movd        m0,        [r5 + r4 * 4]
 %else
-    movd        m0,        [tab_ChromaCoeff + r4 * 4]
+    movd        m0,        [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb      m0,        [tab_Cm]
@@ -2515,10 +2515,10 @@
     sub         r0,        r1
 
 %ifdef PIC
-    lea         r5,        [tab_ChromaCoeff]
+    lea         r5,        [v4_tab_ChromaCoeff]
     movd        m0,        [r5 + r4 * 4]
 %else
-    movd        m0,        [tab_ChromaCoeff + r4 * 4]
+    movd        m0,        [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb      m0,        [tab_Cm]
@@ -2611,10 +2611,10 @@
     sub         r0,        r1
 
 %ifdef PIC
-    lea         r5,        [tab_ChromaCoeff]
+    lea         r5,        [v4_tab_ChromaCoeff]
     movd        m0,        [r5 + r4 * 4]
 %else
-    movd        m0,        [tab_ChromaCoeff + r4 * 4]
+    movd        m0,        [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb      m0,        [tab_Cm]
@@ -2984,10 +2984,10 @@
     sub         r0,        r1
 
 %ifdef PIC
-    lea         r5,        [tab_ChromaCoeff]
+    lea         r5,        [v4_tab_ChromaCoeff]
     movd        m0,        [r5 + r4 * 4]
 %else
-    movd        m0,        [tab_ChromaCoeff + r4 * 4]
+    movd        m0,        [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb      m0,        [tab_Cm]
@@ -3180,10 +3180,10 @@
     punpcklbw   m4,        m2,          m3
 
 %ifdef PIC
-    lea         r6,        [tab_ChromaCoeff]
+    lea         r6,        [v4_tab_ChromaCoeff]
     movd        m5,        [r6 + r4 * 4]
 %else
-    movd        m5,        [tab_ChromaCoeff + r4 * 4]
+    movd        m5,        [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb      m6,        m5,       [tab_Vm]
@@ -3233,10 +3233,10 @@
     add         r3d, r3d
 
 %ifdef PIC
-    lea         r5, [tab_ChromaCoeff]
+    lea         r5, [v4_tab_ChromaCoeff]
     movd        m0, [r5 + r4 * 4]
 %else
-    movd        m0, [tab_ChromaCoeff + r4 * 4]
+    movd        m0, [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb      m0, [tab_Cm]
@@ -3280,10 +3280,10 @@
     add        r3d, r3d
 
 %ifdef PIC
-    lea        r5, [tab_ChromaCoeff]
+    lea        r5, [v4_tab_ChromaCoeff]
     movd       m0, [r5 + r4 * 4]
 %else
-    movd       m0, [tab_ChromaCoeff + r4 * 4]
+    movd       m0, [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb     m0, [tab_Cm]
@@ -3355,10 +3355,10 @@
     add        r3d, r3d
 
 %ifdef PIC
-    lea        r5, [tab_ChromaCoeff]
+    lea        r5, [v4_tab_ChromaCoeff]
     movd       m0, [r5 + r4 * 4]
 %else
-    movd       m0, [tab_ChromaCoeff + r4 * 4]
+    movd       m0, [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb     m0, [tab_Cm]
@@ -3442,10 +3442,10 @@
     add        r3d, r3d
 
 %ifdef PIC
-    lea        r5, [tab_ChromaCoeff]
+    lea        r5, [v4_tab_ChromaCoeff]
     movd       m5, [r5 + r4 * 4]
 %else
-    movd       m5, [tab_ChromaCoeff + r4 * 4]
+    movd       m5, [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb     m6, m5, [tab_Vm]
@@ -3513,10 +3513,10 @@
     add        r3d, r3d
 
 %ifdef PIC
-    lea        r5, [tab_ChromaCoeff]
+    lea        r5, [v4_tab_ChromaCoeff]
     movd       m5, [r5 + r4 * 4]
 %else
-    movd       m5, [tab_ChromaCoeff + r4 * 4]
+    movd       m5, [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb     m6, m5, [tab_Vm]
@@ -3605,10 +3605,10 @@
     add        r3d, r3d
 
 %ifdef PIC
-    lea        r5, [tab_ChromaCoeff]
+    lea        r5, [v4_tab_ChromaCoeff]
     movd       m5, [r5 + r4 * 4]
 %else
-    movd       m5, [tab_ChromaCoeff + r4 * 4]
+    movd       m5, [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb     m6, m5, [tab_Vm]
@@ -3700,10 +3700,10 @@
     add        r3d, r3d
 
 %ifdef PIC
-    lea        r5, [tab_ChromaCoeff]
+    lea        r5, [v4_tab_ChromaCoeff]
     movd       m0, [r5 + r4 * 4]
 %else
-    movd       m0, [tab_ChromaCoeff + r4 * 4]
+    movd       m0, [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb     m1, m0, [tab_Vm]
@@ -3786,10 +3786,10 @@
     add        r3d, r3d
 
 %ifdef PIC
-    lea        r5, [tab_ChromaCoeff]
+    lea        r5, [v4_tab_ChromaCoeff]
     movd       m0, [r5 + r4 * 4]
 %else
-    movd       m0, [tab_ChromaCoeff + r4 * 4]
+    movd       m0, [v4_tab_ChromaCoeff + r4 * 4]
 %endif

 
@@ -43,7 +43,7 @@
 const v4_interp4_vpp_shuf1, dd 0, 1, 1, 2, 2, 3, 3, 4
                          dd 2, 3, 3, 4, 4, 5, 5, 6
 
-const tab_ChromaCoeff, db  0, 64,  0,  0
+const v4_tab_ChromaCoeff, db  0, 64,  0,  0
                        db -2, 58, 10, -2
                        db -4, 54, 16, -2
                        db -6, 46, 28, -4
@@ -1031,8 +1031,8 @@
     mova        m6,        [r5 + r4]
     mova        m5,        [r5 + r4 + 16]
 %else
-    mova        m6,        [tab_ChromaCoeff + r4]
-    mova        m5,        [tab_ChromaCoeff + r4 + 16]
+    mova        m6,        [v4_tab_ChromaCoeff + r4]
+    mova        m5,        [v4_tab_ChromaCoeff + r4 + 16]
 %endif
 
 %ifidn %1,pp
@@ -2114,10 +2114,10 @@
     sub         r0,        r1
 
 %ifdef PIC
-    lea         r5,        [tab_ChromaCoeff]
+    lea         r5,        [v4_tab_ChromaCoeff]
     movd        m0,        [r5 + r4 * 4]
 %else
-    movd        m0,        [tab_ChromaCoeff + r4 * 4]
+    movd        m0,        [v4_tab_ChromaCoeff + r4 * 4]
 %endif
     lea         r4,        [r1 * 3]
     lea         r5,        [r0 + 4 * r1]
@@ -2430,10 +2430,10 @@
     sub         r0,        r1
 
 %ifdef PIC
-    lea         r5,        [tab_ChromaCoeff]
+    lea         r5,        [v4_tab_ChromaCoeff]
     movd        m0,        [r5 + r4 * 4]
 %else
-    movd        m0,        [tab_ChromaCoeff + r4 * 4]
+    movd        m0,        [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb      m0,        [tab_Cm]
@@ -2515,10 +2515,10 @@
     sub         r0,        r1
 
 %ifdef PIC
-    lea         r5,        [tab_ChromaCoeff]
+    lea         r5,        [v4_tab_ChromaCoeff]
     movd        m0,        [r5 + r4 * 4]
 %else
-    movd        m0,        [tab_ChromaCoeff + r4 * 4]
+    movd        m0,        [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb      m0,        [tab_Cm]
@@ -2611,10 +2611,10 @@
     sub         r0,        r1
 
 %ifdef PIC
-    lea         r5,        [tab_ChromaCoeff]
+    lea         r5,        [v4_tab_ChromaCoeff]
     movd        m0,        [r5 + r4 * 4]
 %else
-    movd        m0,        [tab_ChromaCoeff + r4 * 4]
+    movd        m0,        [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb      m0,        [tab_Cm]
@@ -2984,10 +2984,10 @@
     sub         r0,        r1
 
 %ifdef PIC
-    lea         r5,        [tab_ChromaCoeff]
+    lea         r5,        [v4_tab_ChromaCoeff]
     movd        m0,        [r5 + r4 * 4]
 %else
-    movd        m0,        [tab_ChromaCoeff + r4 * 4]
+    movd        m0,        [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb      m0,        [tab_Cm]
@@ -3180,10 +3180,10 @@
     punpcklbw   m4,        m2,          m3
 
 %ifdef PIC
-    lea         r6,        [tab_ChromaCoeff]
+    lea         r6,        [v4_tab_ChromaCoeff]
     movd        m5,        [r6 + r4 * 4]
 %else
-    movd        m5,        [tab_ChromaCoeff + r4 * 4]
+    movd        m5,        [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb      m6,        m5,       [tab_Vm]
@@ -3233,10 +3233,10 @@
     add         r3d, r3d
 
 %ifdef PIC
-    lea         r5, [tab_ChromaCoeff]
+    lea         r5, [v4_tab_ChromaCoeff]
     movd        m0, [r5 + r4 * 4]
 %else
-    movd        m0, [tab_ChromaCoeff + r4 * 4]
+    movd        m0, [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb      m0, [tab_Cm]
@@ -3280,10 +3280,10 @@
     add        r3d, r3d
 
 %ifdef PIC
-    lea        r5, [tab_ChromaCoeff]
+    lea        r5, [v4_tab_ChromaCoeff]
     movd       m0, [r5 + r4 * 4]
 %else
-    movd       m0, [tab_ChromaCoeff + r4 * 4]
+    movd       m0, [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb     m0, [tab_Cm]
@@ -3355,10 +3355,10 @@
     add        r3d, r3d
 
 %ifdef PIC
-    lea        r5, [tab_ChromaCoeff]
+    lea        r5, [v4_tab_ChromaCoeff]
     movd       m0, [r5 + r4 * 4]
 %else
-    movd       m0, [tab_ChromaCoeff + r4 * 4]
+    movd       m0, [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb     m0, [tab_Cm]
@@ -3442,10 +3442,10 @@
     add        r3d, r3d
 
 %ifdef PIC
-    lea        r5, [tab_ChromaCoeff]
+    lea        r5, [v4_tab_ChromaCoeff]
     movd       m5, [r5 + r4 * 4]
 %else
-    movd       m5, [tab_ChromaCoeff + r4 * 4]
+    movd       m5, [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb     m6, m5, [tab_Vm]
@@ -3513,10 +3513,10 @@
     add        r3d, r3d
 
 %ifdef PIC
-    lea        r5, [tab_ChromaCoeff]
+    lea        r5, [v4_tab_ChromaCoeff]
     movd       m5, [r5 + r4 * 4]
 %else
-    movd       m5, [tab_ChromaCoeff + r4 * 4]
+    movd       m5, [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb     m6, m5, [tab_Vm]
@@ -3605,10 +3605,10 @@
     add        r3d, r3d
 
 %ifdef PIC
-    lea        r5, [tab_ChromaCoeff]
+    lea        r5, [v4_tab_ChromaCoeff]
     movd       m5, [r5 + r4 * 4]
 %else
-    movd       m5, [tab_ChromaCoeff + r4 * 4]
+    movd       m5, [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb     m6, m5, [tab_Vm]
@@ -3700,10 +3700,10 @@
     add        r3d, r3d
 
 %ifdef PIC
-    lea        r5, [tab_ChromaCoeff]
+    lea        r5, [v4_tab_ChromaCoeff]
     movd       m0, [r5 + r4 * 4]
 %else
-    movd       m0, [tab_ChromaCoeff + r4 * 4]
+    movd       m0, [v4_tab_ChromaCoeff + r4 * 4]
 %endif
 
     pshufb     m1, m0, [tab_Vm]
@@ -3786,10 +3786,10 @@
     add        r3d, r3d
 
 %ifdef PIC
-    lea        r5, [tab_ChromaCoeff]
+    lea        r5, [v4_tab_ChromaCoeff]
     movd       m0, [r5 + r4 * 4]
 %else
-    movd       m0, [tab_ChromaCoeff + r4 * 4]
+    movd       m0, [v4_tab_ChromaCoeff + r4 * 4]
 %endif
​

x265_2.7.tar.gz/source/common/x86/x86inc.asm -> x265_2.9.tar.gz/source/common/x86/x86inc.asm Changed

@@ -82,7 +82,13 @@
 %endif
 
 %macro SECTION_RODATA 0-1 32
-    SECTION .rodata align=%1
+    %ifidn __OUTPUT_FORMAT__,win32
+        SECTION .rdata align=%1
+    %elif WIN64
+        SECTION .rdata align=%1
+    %else
+        SECTION .rodata align=%1
+    %endif
 %endmacro
 
 %if WIN64
@@ -325,6 +331,8 @@
 %endmacro
 
 %define required_stack_alignment ((mmsize + 15) & ~15)
+%define vzeroupper_required (mmsize > 16 && (ARCH_X86_64 == 0 || xmm_regs_used > 16 || notcpuflag(avx512)))
+%define high_mm_regs (16*cpuflag(avx512))
 
 %macro ALLOC_STACK 1-2 0 ; stack_size, n_xmm_regs (for win64 only)
     %ifnum %1
@@ -438,15 +446,16 @@
 
 %macro WIN64_PUSH_XMM 0
     ; Use the shadow space to store XMM6 and XMM7, the rest needs stack space allocated.
-    %if xmm_regs_used > 6
+    %if xmm_regs_used > 6 + high_mm_regs
         movaps [rstk + stack_offset +  8], xmm6
     %endif
-    %if xmm_regs_used > 7
+    %if xmm_regs_used > 7 + high_mm_regs
         movaps [rstk + stack_offset + 24], xmm7
     %endif
-    %if xmm_regs_used > 8
+    %assign %%xmm_regs_on_stack xmm_regs_used - high_mm_regs - 8
+    %if %%xmm_regs_on_stack > 0
         %assign %%i 8
-        %rep xmm_regs_used-8
+        %rep %%xmm_regs_on_stack
             movaps [rsp + (%%i-8)*16 + stack_size + 32], xmm %+ %%i
             %assign %%i %%i+1
         %endrep
@@ -455,8 +464,9 @@
 
 %macro WIN64_SPILL_XMM 1
     %assign xmm_regs_used %1
-    ASSERT xmm_regs_used <= 16
-    %if xmm_regs_used > 8
+    ASSERT xmm_regs_used <= 16 + high_mm_regs
+    %assign %%xmm_regs_on_stack xmm_regs_used - high_mm_regs - 8
+    %if %%xmm_regs_on_stack > 0
         ; Allocate stack space for callee-saved xmm registers plus shadow space and align the stack.
         %assign %%pad (xmm_regs_used-8)*16 + 32
         %assign stack_size_padded %%pad + ((-%%pad-stack_offset-gprsize) & (STACK_ALIGNMENT-1))
@@ -467,9 +477,10 @@
 
 %macro WIN64_RESTORE_XMM_INTERNAL 0
     %assign %%pad_size 0
-    %if xmm_regs_used > 8
-        %assign %%i xmm_regs_used
-        %rep xmm_regs_used-8
+    %assign %%xmm_regs_on_stack xmm_regs_used - high_mm_regs - 8
+    %if %%xmm_regs_on_stack > 0
+        %assign %%i xmm_regs_used - high_mm_regs
+        %rep %%xmm_regs_on_stack
             %assign %%i %%i-1
             movaps xmm %+ %%i, [rsp + (%%i-8)*16 + stack_size + 32]
         %endrep
@@ -482,10 +493,10 @@
             %assign %%pad_size stack_size_padded
         %endif
     %endif
-    %if xmm_regs_used > 7
+    %if xmm_regs_used > 7 + high_mm_regs
         movaps xmm7, [rsp + stack_offset - %%pad_size + 24]
     %endif
-    %if xmm_regs_used > 6
+    %if xmm_regs_used > 6 + high_mm_regs
         movaps xmm6, [rsp + stack_offset - %%pad_size +  8]
     %endif
 %endmacro
@@ -497,12 +508,12 @@
     %assign xmm_regs_used 0
 %endmacro
 
-%define has_epilogue regs_used > 7 || xmm_regs_used > 6 || mmsize == 32 || stack_size > 0
+%define has_epilogue regs_used > 7 || stack_size > 0 || vzeroupper_required || xmm_regs_used > 6 + high_mm_regs
 
 %macro RET 0
     WIN64_RESTORE_XMM_INTERNAL
     POP_IF_USED 14, 13, 12, 11, 10, 9, 8, 7
-    %if mmsize == 32
+    %if vzeroupper_required
         vzeroupper
     %endif
     AUTO_REP_RET
@@ -526,9 +537,10 @@
 DECLARE_REG 13, R12, 64
 DECLARE_REG 14, R13, 72
 
-%macro PROLOGUE 2-5+ ; #args, #regs, #xmm_regs, [stack_size,] arg_names...
+%macro PROLOGUE 2-5+ 0; #args, #regs, #xmm_regs, [stack_size,] arg_names...
     %assign num_args %1
     %assign regs_used %2
+    %assign xmm_regs_used %3
     ASSERT regs_used >= num_args
     SETUP_STACK_POINTER %4
     ASSERT regs_used <= 15
@@ -538,7 +550,7 @@
     DEFINE_ARGS_INTERNAL %0, %4, %5
 %endmacro
 
-%define has_epilogue regs_used > 9 || mmsize == 32 || stack_size > 0
+%define has_epilogue regs_used > 9 || stack_size > 0 || vzeroupper_required
 
 %macro RET 0
     %if stack_size_padded > 0
@@ -549,7 +561,7 @@
         %endif
     %endif
     POP_IF_USED 14, 13, 12, 11, 10, 9
-    %if mmsize == 32
+    %if vzeroupper_required
         vzeroupper
     %endif
     AUTO_REP_RET
@@ -594,7 +606,7 @@
     DEFINE_ARGS_INTERNAL %0, %4, %5
 %endmacro
 
-%define has_epilogue regs_used > 3 || mmsize == 32 || stack_size > 0
+%define has_epilogue regs_used > 3 || stack_size > 0 || vzeroupper_required
 
 %macro RET 0
     %if stack_size_padded > 0
@@ -605,7 +617,7 @@
         %endif
     %endif
     POP_IF_USED 6, 5, 4, 3
-    %if mmsize == 32
+    %if vzeroupper_required
         vzeroupper
     %endif
     AUTO_REP_RET
@@ -710,12 +722,22 @@
     %assign stack_offset 0      ; stack pointer offset relative to the return address
     %assign stack_size 0        ; amount of stack space that can be freely used inside a function
     %assign stack_size_padded 0 ; total amount of allocated stack space, including space for callee-saved xmm registers on WIN64 and alignment padding
-    %assign xmm_regs_used 0     ; number of XMM registers requested, used for dealing with callee-saved registers on WIN64
+    %assign xmm_regs_used 0     ; number of XMM registers requested, used for dealing with callee-saved registers on WIN64 and vzeroupper
     %ifnidn %3, ""
         PROLOGUE %3
     %endif
 %endmacro
 
+; Create a global symbol from a local label with the correct name mangling and type
+%macro cglobal_label 1
+    %if FORMAT_ELF
+        global current_function %+ %1:function hidden
+    %else
+        global current_function %+ %1
+    %endif
+    %1:
+%endmacro
+
 %macro cextern 1
     %xdefine %1 mangle(private_prefix %+ _ %+ %1)
     CAT_XDEFINE cglobaled_, %1, 1
@@ -768,10 +790,10 @@
 %assign cpuflags_bmi1     (1<<16)| cpuflags_avx | cpuflags_lzcnt
 %assign cpuflags_bmi2     (1<<17)| cpuflags_bmi1
 %assign cpuflags_avx2     (1<<18)| cpuflags_fma3 | cpuflags_bmi2
+%assign cpuflags_avx512   (1<<19)| cpuflags_avx2 ; F, CD, BW, DQ, VL
 
-%assign cpuflags_cache32  (1<<19)
-%assign cpuflags_cache64  (1<<20)
-%assign cpuflags_slowctz  (1<<21)
+%assign cpuflags_cache32  (1<<20)
+%assign cpuflags_cache64  (1<<21)
 %assign cpuflags_aligned  (1<<22) ; not a cpu feature, but a function variant
 %assign cpuflags_atom     (1<<23)
 
@@ -829,11 +851,12 @@
     %endif
 %endmacro
 
-; Merge mmx and sse*
+; Merge mmx and sse*, and avx*
 ; m# is a simd register of the currently selected size
 ; xm# is the corresponding xmm register if mmsize >= 16, otherwise the same as m#
 ; ym# is the corresponding ymm register if mmsize >= 32, otherwise the same as m#
-; (All 3 remain in sync through SWAP.)
+; zm# is the corresponding zmm register if mmsize >= 64, otherwise the same as m#
+; (All 4 remain in sync through SWAP.)
 
 %macro CAT_XDEFINE 3
     %xdefine %1%2 %3

 
@@ -82,7 +82,13 @@
 %endif
 
 %macro SECTION_RODATA 0-1 32
-    SECTION .rodata align=%1
+    %ifidn __OUTPUT_FORMAT__,win32
+        SECTION .rdata align=%1
+    %elif WIN64
+        SECTION .rdata align=%1
+    %else
+        SECTION .rodata align=%1
+    %endif
 %endmacro
 
 %if WIN64
@@ -325,6 +331,8 @@
 %endmacro
 
 %define required_stack_alignment ((mmsize + 15) & ~15)
+%define vzeroupper_required (mmsize > 16 && (ARCH_X86_64 == 0 || xmm_regs_used > 16 || notcpuflag(avx512)))
+%define high_mm_regs (16*cpuflag(avx512))
 
 %macro ALLOC_STACK 1-2 0 ; stack_size, n_xmm_regs (for win64 only)
     %ifnum %1
@@ -438,15 +446,16 @@
 
 %macro WIN64_PUSH_XMM 0
     ; Use the shadow space to store XMM6 and XMM7, the rest needs stack space allocated.
-    %if xmm_regs_used > 6
+    %if xmm_regs_used > 6 + high_mm_regs
         movaps [rstk + stack_offset +  8], xmm6
     %endif
-    %if xmm_regs_used > 7
+    %if xmm_regs_used > 7 + high_mm_regs
         movaps [rstk + stack_offset + 24], xmm7
     %endif
-    %if xmm_regs_used > 8
+    %assign %%xmm_regs_on_stack xmm_regs_used - high_mm_regs - 8
+    %if %%xmm_regs_on_stack > 0
         %assign %%i 8
-        %rep xmm_regs_used-8
+        %rep %%xmm_regs_on_stack
             movaps [rsp + (%%i-8)*16 + stack_size + 32], xmm %+ %%i
             %assign %%i %%i+1
         %endrep
@@ -455,8 +464,9 @@
 
 %macro WIN64_SPILL_XMM 1
     %assign xmm_regs_used %1
-    ASSERT xmm_regs_used <= 16
-    %if xmm_regs_used > 8
+    ASSERT xmm_regs_used <= 16 + high_mm_regs
+    %assign %%xmm_regs_on_stack xmm_regs_used - high_mm_regs - 8
+    %if %%xmm_regs_on_stack > 0
         ; Allocate stack space for callee-saved xmm registers plus shadow space and align the stack.
         %assign %%pad (xmm_regs_used-8)*16 + 32
         %assign stack_size_padded %%pad + ((-%%pad-stack_offset-gprsize) & (STACK_ALIGNMENT-1))
@@ -467,9 +477,10 @@
 
 %macro WIN64_RESTORE_XMM_INTERNAL 0
     %assign %%pad_size 0
-    %if xmm_regs_used > 8
-        %assign %%i xmm_regs_used
-        %rep xmm_regs_used-8
+    %assign %%xmm_regs_on_stack xmm_regs_used - high_mm_regs - 8
+    %if %%xmm_regs_on_stack > 0
+        %assign %%i xmm_regs_used - high_mm_regs
+        %rep %%xmm_regs_on_stack
             %assign %%i %%i-1
             movaps xmm %+ %%i, [rsp + (%%i-8)*16 + stack_size + 32]
         %endrep
@@ -482,10 +493,10 @@
             %assign %%pad_size stack_size_padded
         %endif
     %endif
-    %if xmm_regs_used > 7
+    %if xmm_regs_used > 7 + high_mm_regs
         movaps xmm7, [rsp + stack_offset - %%pad_size + 24]
     %endif
-    %if xmm_regs_used > 6
+    %if xmm_regs_used > 6 + high_mm_regs
         movaps xmm6, [rsp + stack_offset - %%pad_size +  8]
     %endif
 %endmacro
@@ -497,12 +508,12 @@
     %assign xmm_regs_used 0
 %endmacro
 
-%define has_epilogue regs_used > 7 || xmm_regs_used > 6 || mmsize == 32 || stack_size > 0
+%define has_epilogue regs_used > 7 || stack_size > 0 || vzeroupper_required || xmm_regs_used > 6 + high_mm_regs
 
 %macro RET 0
     WIN64_RESTORE_XMM_INTERNAL
     POP_IF_USED 14, 13, 12, 11, 10, 9, 8, 7
-    %if mmsize == 32
+    %if vzeroupper_required
         vzeroupper
     %endif
     AUTO_REP_RET
@@ -526,9 +537,10 @@
 DECLARE_REG 13, R12, 64
 DECLARE_REG 14, R13, 72
 
-%macro PROLOGUE 2-5+ ; #args, #regs, #xmm_regs, [stack_size,] arg_names...
+%macro PROLOGUE 2-5+ 0; #args, #regs, #xmm_regs, [stack_size,] arg_names...
     %assign num_args %1
     %assign regs_used %2
+    %assign xmm_regs_used %3
     ASSERT regs_used >= num_args
     SETUP_STACK_POINTER %4
     ASSERT regs_used <= 15
@@ -538,7 +550,7 @@
     DEFINE_ARGS_INTERNAL %0, %4, %5
 %endmacro
 
-%define has_epilogue regs_used > 9 || mmsize == 32 || stack_size > 0
+%define has_epilogue regs_used > 9 || stack_size > 0 || vzeroupper_required
 
 %macro RET 0
     %if stack_size_padded > 0
@@ -549,7 +561,7 @@
         %endif
     %endif
     POP_IF_USED 14, 13, 12, 11, 10, 9
-    %if mmsize == 32
+    %if vzeroupper_required
         vzeroupper
     %endif
     AUTO_REP_RET
@@ -594,7 +606,7 @@
     DEFINE_ARGS_INTERNAL %0, %4, %5
 %endmacro
 
-%define has_epilogue regs_used > 3 || mmsize == 32 || stack_size > 0
+%define has_epilogue regs_used > 3 || stack_size > 0 || vzeroupper_required
 
 %macro RET 0
     %if stack_size_padded > 0
@@ -605,7 +617,7 @@
         %endif
     %endif
     POP_IF_USED 6, 5, 4, 3
-    %if mmsize == 32
+    %if vzeroupper_required
         vzeroupper
     %endif
     AUTO_REP_RET
@@ -710,12 +722,22 @@
     %assign stack_offset 0      ; stack pointer offset relative to the return address
     %assign stack_size 0        ; amount of stack space that can be freely used inside a function
     %assign stack_size_padded 0 ; total amount of allocated stack space, including space for callee-saved xmm registers on WIN64 and alignment padding
-    %assign xmm_regs_used 0     ; number of XMM registers requested, used for dealing with callee-saved registers on WIN64
+    %assign xmm_regs_used 0     ; number of XMM registers requested, used for dealing with callee-saved registers on WIN64 and vzeroupper
     %ifnidn %3, ""
         PROLOGUE %3
     %endif
 %endmacro
 
+; Create a global symbol from a local label with the correct name mangling and type
+%macro cglobal_label 1
+    %if FORMAT_ELF
+        global current_function %+ %1:function hidden
+    %else
+        global current_function %+ %1
+    %endif
+    %1:
+%endmacro
+
 %macro cextern 1
     %xdefine %1 mangle(private_prefix %+ _ %+ %1)
     CAT_XDEFINE cglobaled_, %1, 1
@@ -768,10 +790,10 @@
 %assign cpuflags_bmi1     (1<<16)| cpuflags_avx | cpuflags_lzcnt
 %assign cpuflags_bmi2     (1<<17)| cpuflags_bmi1
 %assign cpuflags_avx2     (1<<18)| cpuflags_fma3 | cpuflags_bmi2
+%assign cpuflags_avx512   (1<<19)| cpuflags_avx2 ; F, CD, BW, DQ, VL
 
-%assign cpuflags_cache32  (1<<19)
-%assign cpuflags_cache64  (1<<20)
-%assign cpuflags_slowctz  (1<<21)
+%assign cpuflags_cache32  (1<<20)
+%assign cpuflags_cache64  (1<<21)
 %assign cpuflags_aligned  (1<<22) ; not a cpu feature, but a function variant
 %assign cpuflags_atom     (1<<23)
 
@@ -829,11 +851,12 @@
     %endif
 %endmacro
 
-; Merge mmx and sse*
+; Merge mmx and sse*, and avx*
 ; m# is a simd register of the currently selected size
 ; xm# is the corresponding xmm register if mmsize >= 16, otherwise the same as m#
 ; ym# is the corresponding ymm register if mmsize >= 32, otherwise the same as m#
-; (All 3 remain in sync through SWAP.)
+; zm# is the corresponding zmm register if mmsize >= 64, otherwise the same as m#
+; (All 4 remain in sync through SWAP.)
 
 %macro CAT_XDEFINE 3
     %xdefine %1%2 %3
​

x265_2.7.tar.gz/source/common/x86/x86util.asm -> x265_2.9.tar.gz/source/common/x86/x86util.asm Changed

@@ -299,32 +299,44 @@
     pminsw %2, %4
 %endmacro
 
+%macro MOVHL 2 ; dst, src
+%ifidn %1, %2
+    punpckhqdq %1, %2
+%elif cpuflag(avx)
+    punpckhqdq %1, %2, %2
+%elif cpuflag(sse4)
+    pshufd     %1, %2, q3232 ; pshufd is slow on some older CPUs, so only use it on more modern ones
+%else
+    movhlps    %1, %2        ; may cause an int/float domain transition and has a dependency on dst
+%endif
+%endmacro
+
 %macro HADDD 2 ; sum junk
-%if sizeof%1 == 32
-%define %2 xmm%2
-    vextracti128 %2, %1, 1
-%define %1 xmm%1
-    paddd   %1, %2
+%if sizeof%1 >= 64
+    vextracti32x8 ymm%2, zmm%1, 1
+    paddd         ymm%1, ymm%2
 %endif
-%if mmsize >= 16
-%if cpuflag(xop) && sizeof%1 == 16
-    vphadddq %1, %1
+%if sizeof%1 >= 32
+    vextracti128  xmm%2, ymm%1, 1
+    paddd         xmm%1, xmm%2
+%endif
+%if sizeof%1 >= 16
+    MOVHL         xmm%2, xmm%1
+    paddd         xmm%1, xmm%2
 %endif
-    movhlps %2, %1
-    paddd   %1, %2
+%if cpuflag(xop) && sizeof%1 == 16
+    vphadddq xmm%1, xmm%1
 %endif
 %if notcpuflag(xop)
-    PSHUFLW %2, %1, q0032
-    paddd   %1, %2
+    PSHUFLW xmm%2, xmm%1, q1032
+    paddd   xmm%1, xmm%2
 %endif
-%undef %1
-%undef %2
 %endmacro
 
 %macro HADDW 2 ; reg, tmp
 %if cpuflag(xop) && sizeof%1 == 16
     vphaddwq  %1, %1
-    movhlps   %2, %1
+    MOVHL     %2, %1
     paddd     %1, %2
 %else
     pmaddwd %1, [pw_1]
@@ -346,7 +358,7 @@
 %macro HADDUW 2
 %if cpuflag(xop) && sizeof%1 == 16
     vphadduwq %1, %1
-    movhlps   %2, %1
+    MOVHL     %2, %1
     paddd     %1, %2
 %else
     HADDUWD   %1, %2
@@ -739,25 +751,25 @@
 %if %6 ; %5 aligned?
     mova       %1, %4
     psubw      %1, %5
+%elif cpuflag(avx)
+    movu       %1, %4
+    psubw      %1, %5
 %else
     movu       %1, %4
     movu       %2, %5
     psubw      %1, %2
 %endif
 %else ; !HIGH_BIT_DEPTH
-%ifidn %3, none
     movh       %1, %4
     movh       %2, %5
+%ifidn %3, none
     punpcklbw  %1, %2
     punpcklbw  %2, %2
-    psubw      %1, %2
 %else
-    movh       %1, %4
     punpcklbw  %1, %3
-    movh       %2, %5
     punpcklbw  %2, %3
-    psubw      %1, %2
 %endif
+    psubw      %1, %2
 %endif ; HIGH_BIT_DEPTH
 %endmacro

 
@@ -299,32 +299,44 @@
     pminsw %2, %4
 %endmacro
 
+%macro MOVHL 2 ; dst, src
+%ifidn %1, %2
+    punpckhqdq %1, %2
+%elif cpuflag(avx)
+    punpckhqdq %1, %2, %2
+%elif cpuflag(sse4)
+    pshufd     %1, %2, q3232 ; pshufd is slow on some older CPUs, so only use it on more modern ones
+%else
+    movhlps    %1, %2        ; may cause an int/float domain transition and has a dependency on dst
+%endif
+%endmacro
+
 %macro HADDD 2 ; sum junk
-%if sizeof%1 == 32
-%define %2 xmm%2
-    vextracti128 %2, %1, 1
-%define %1 xmm%1
-    paddd   %1, %2
+%if sizeof%1 >= 64
+    vextracti32x8 ymm%2, zmm%1, 1
+    paddd         ymm%1, ymm%2
 %endif
-%if mmsize >= 16
-%if cpuflag(xop) && sizeof%1 == 16
-    vphadddq %1, %1
+%if sizeof%1 >= 32
+    vextracti128  xmm%2, ymm%1, 1
+    paddd         xmm%1, xmm%2
+%endif
+%if sizeof%1 >= 16
+    MOVHL         xmm%2, xmm%1
+    paddd         xmm%1, xmm%2
 %endif
-    movhlps %2, %1
-    paddd   %1, %2
+%if cpuflag(xop) && sizeof%1 == 16
+    vphadddq xmm%1, xmm%1
 %endif
 %if notcpuflag(xop)
-    PSHUFLW %2, %1, q0032
-    paddd   %1, %2
+    PSHUFLW xmm%2, xmm%1, q1032
+    paddd   xmm%1, xmm%2
 %endif
-%undef %1
-%undef %2
 %endmacro
 
 %macro HADDW 2 ; reg, tmp
 %if cpuflag(xop) && sizeof%1 == 16
     vphaddwq  %1, %1
-    movhlps   %2, %1
+    MOVHL     %2, %1
     paddd     %1, %2
 %else
     pmaddwd %1, [pw_1]
@@ -346,7 +358,7 @@
 %macro HADDUW 2
 %if cpuflag(xop) && sizeof%1 == 16
     vphadduwq %1, %1
-    movhlps   %2, %1
+    MOVHL     %2, %1
     paddd     %1, %2
 %else
     HADDUWD   %1, %2
@@ -739,25 +751,25 @@
 %if %6 ; %5 aligned?
     mova       %1, %4
     psubw      %1, %5
+%elif cpuflag(avx)
+    movu       %1, %4
+    psubw      %1, %5
 %else
     movu       %1, %4
     movu       %2, %5
     psubw      %1, %2
 %endif
 %else ; !HIGH_BIT_DEPTH
-%ifidn %3, none
     movh       %1, %4
     movh       %2, %5
+%ifidn %3, none
     punpcklbw  %1, %2
     punpcklbw  %2, %2
-    psubw      %1, %2
 %else
-    movh       %1, %4
     punpcklbw  %1, %3
-    movh       %2, %5
     punpcklbw  %2, %3
-    psubw      %1, %2
 %endif
+    psubw      %1, %2
 %endif ; HIGH_BIT_DEPTH
 %endmacro
 
​

x265_2.7.tar.gz/source/common/yuv.cpp -> x265_2.9.tar.gz/source/common/yuv.cpp Changed

@@ -170,11 +170,14 @@
 
 void Yuv::addClip(const Yuv& srcYuv0, const ShortYuv& srcYuv1, uint32_t log2SizeL, int picCsp)
 {
-    primitives.cu[log2SizeL - 2].add_ps(m_buf[0], m_size, srcYuv0.m_buf[0], srcYuv1.m_buf[0], srcYuv0.m_size, srcYuv1.m_size);
+    primitives.cu[log2SizeL - 2].add_ps[(m_size % 64 == 0) && (srcYuv0.m_size % 64 == 0) && (srcYuv1.m_size % 64 == 0)](m_buf[0],
+                                         m_size, srcYuv0.m_buf[0], srcYuv1.m_buf[0], srcYuv0.m_size, srcYuv1.m_size);
     if (m_csp != X265_CSP_I400 && picCsp != X265_CSP_I400)
     {
-        primitives.chroma[m_csp].cu[log2SizeL - 2].add_ps(m_buf[1], m_csize, srcYuv0.m_buf[1], srcYuv1.m_buf[1], srcYuv0.m_csize, srcYuv1.m_csize);
-        primitives.chroma[m_csp].cu[log2SizeL - 2].add_ps(m_buf[2], m_csize, srcYuv0.m_buf[2], srcYuv1.m_buf[2], srcYuv0.m_csize, srcYuv1.m_csize);
+        primitives.chroma[m_csp].cu[log2SizeL - 2].add_ps[(m_csize % 64 == 0) && (srcYuv0.m_csize % 64 ==0) && (srcYuv1.m_csize % 64 == 0)](m_buf[1],
+                                                           m_csize, srcYuv0.m_buf[1], srcYuv1.m_buf[1], srcYuv0.m_csize, srcYuv1.m_csize);
+        primitives.chroma[m_csp].cu[log2SizeL - 2].add_ps[(m_csize % 64 == 0) && (srcYuv0.m_csize % 64 == 0) && (srcYuv1.m_csize % 64 == 0)](m_buf[2],
+                                                           m_csize, srcYuv0.m_buf[2], srcYuv1.m_buf[2], srcYuv0.m_csize, srcYuv1.m_csize);
     }
     if (picCsp == X265_CSP_I400 && m_csp != X265_CSP_I400)
     {
@@ -192,7 +195,7 @@
         const int16_t* srcY0 = srcYuv0.getLumaAddr(absPartIdx);
         const int16_t* srcY1 = srcYuv1.getLumaAddr(absPartIdx);
         pixel* dstY = getLumaAddr(absPartIdx);
-        primitives.pu[part].addAvg(srcY0, srcY1, dstY, srcYuv0.m_size, srcYuv1.m_size, m_size);
+        primitives.pu[part].addAvg[(srcYuv0.m_size % 64 == 0) && (srcYuv1.m_size % 64 == 0) && (m_size % 64 == 0)](srcY0, srcY1, dstY, srcYuv0.m_size, srcYuv1.m_size, m_size);
     }
     if (bChroma)
     {
@@ -202,8 +205,8 @@
         const int16_t* srcV1 = srcYuv1.getCrAddr(absPartIdx);
         pixel* dstU = getCbAddr(absPartIdx);
         pixel* dstV = getCrAddr(absPartIdx);
-        primitives.chroma[m_csp].pu[part].addAvg(srcU0, srcU1, dstU, srcYuv0.m_csize, srcYuv1.m_csize, m_csize);
-        primitives.chroma[m_csp].pu[part].addAvg(srcV0, srcV1, dstV, srcYuv0.m_csize, srcYuv1.m_csize, m_csize);
+        primitives.chroma[m_csp].pu[part].addAvg[(srcYuv0.m_csize % 64 == 0) && (srcYuv1.m_csize % 64 == 0) && (m_csize % 64 == 0)](srcU0, srcU1, dstU, srcYuv0.m_csize, srcYuv1.m_csize, m_csize);
+        primitives.chroma[m_csp].pu[part].addAvg[(srcYuv0.m_csize % 64 == 0) && (srcYuv1.m_csize % 64 == 0) && (m_csize % 64 == 0)](srcV0, srcV1, dstV, srcYuv0.m_csize, srcYuv1.m_csize, m_csize);
     }
 }

 
@@ -170,11 +170,14 @@
 
 void Yuv::addClip(const Yuv& srcYuv0, const ShortYuv& srcYuv1, uint32_t log2SizeL, int picCsp)
 {
-    primitives.cu[log2SizeL - 2].add_ps(m_buf[0], m_size, srcYuv0.m_buf[0], srcYuv1.m_buf[0], srcYuv0.m_size, srcYuv1.m_size);
+    primitives.cu[log2SizeL - 2].add_ps[(m_size % 64 == 0) && (srcYuv0.m_size % 64 == 0) && (srcYuv1.m_size % 64 == 0)](m_buf[0],
+                                         m_size, srcYuv0.m_buf[0], srcYuv1.m_buf[0], srcYuv0.m_size, srcYuv1.m_size);
     if (m_csp != X265_CSP_I400 && picCsp != X265_CSP_I400)
     {
-        primitives.chroma[m_csp].cu[log2SizeL - 2].add_ps(m_buf[1], m_csize, srcYuv0.m_buf[1], srcYuv1.m_buf[1], srcYuv0.m_csize, srcYuv1.m_csize);
-        primitives.chroma[m_csp].cu[log2SizeL - 2].add_ps(m_buf[2], m_csize, srcYuv0.m_buf[2], srcYuv1.m_buf[2], srcYuv0.m_csize, srcYuv1.m_csize);
+        primitives.chroma[m_csp].cu[log2SizeL - 2].add_ps[(m_csize % 64 == 0) && (srcYuv0.m_csize % 64 ==0) && (srcYuv1.m_csize % 64 == 0)](m_buf[1],
+                                                           m_csize, srcYuv0.m_buf[1], srcYuv1.m_buf[1], srcYuv0.m_csize, srcYuv1.m_csize);
+        primitives.chroma[m_csp].cu[log2SizeL - 2].add_ps[(m_csize % 64 == 0) && (srcYuv0.m_csize % 64 == 0) && (srcYuv1.m_csize % 64 == 0)](m_buf[2],
+                                                           m_csize, srcYuv0.m_buf[2], srcYuv1.m_buf[2], srcYuv0.m_csize, srcYuv1.m_csize);
     }
     if (picCsp == X265_CSP_I400 && m_csp != X265_CSP_I400)
     {
@@ -192,7 +195,7 @@
         const int16_t* srcY0 = srcYuv0.getLumaAddr(absPartIdx);
         const int16_t* srcY1 = srcYuv1.getLumaAddr(absPartIdx);
         pixel* dstY = getLumaAddr(absPartIdx);
-        primitives.pu[part].addAvg(srcY0, srcY1, dstY, srcYuv0.m_size, srcYuv1.m_size, m_size);
+        primitives.pu[part].addAvg[(srcYuv0.m_size % 64 == 0) && (srcYuv1.m_size % 64 == 0) && (m_size % 64 == 0)](srcY0, srcY1, dstY, srcYuv0.m_size, srcYuv1.m_size, m_size);
     }
     if (bChroma)
     {
@@ -202,8 +205,8 @@
         const int16_t* srcV1 = srcYuv1.getCrAddr(absPartIdx);
         pixel* dstU = getCbAddr(absPartIdx);
         pixel* dstV = getCrAddr(absPartIdx);
-        primitives.chroma[m_csp].pu[part].addAvg(srcU0, srcU1, dstU, srcYuv0.m_csize, srcYuv1.m_csize, m_csize);
-        primitives.chroma[m_csp].pu[part].addAvg(srcV0, srcV1, dstV, srcYuv0.m_csize, srcYuv1.m_csize, m_csize);
+        primitives.chroma[m_csp].pu[part].addAvg[(srcYuv0.m_csize % 64 == 0) && (srcYuv1.m_csize % 64 == 0) && (m_csize % 64 == 0)](srcU0, srcU1, dstU, srcYuv0.m_csize, srcYuv1.m_csize, m_csize);
+        primitives.chroma[m_csp].pu[part].addAvg[(srcYuv0.m_csize % 64 == 0) && (srcYuv1.m_csize % 64 == 0) && (m_csize % 64 == 0)](srcV0, srcV1, dstV, srcYuv0.m_csize, srcYuv1.m_csize, m_csize);
     }
 }
 
​

x265_2.7.tar.gz/source/common/yuv.h -> x265_2.9.tar.gz/source/common/yuv.h Changed

 
@@ -38,7 +38,6 @@
 class Yuv
 {
 public:
-
     pixel*   m_buf[3];
 
     uint32_t m_size;
​

x265_2.7.tar.gz/source/dynamicHDR10/SeiMetadataDictionary.cpp -> x265_2.9.tar.gz/source/dynamicHDR10/SeiMetadataDictionary.cpp Changed

@@ -34,6 +34,7 @@
 const std::string BezierCurveNames::NumberOfAnchors = std::string("NumberOfAnchors");
 const std::string BezierCurveNames::KneePointX = std::string("KneePointX");
 const std::string BezierCurveNames::KneePointY = std::string("KneePointY");
+const std::string BezierCurveNames::AnchorsTag = std::string("Anchors");
 const std::string BezierCurveNames::Anchors[] = {std::string("Anchor0"),
                                                  std::string("Anchor1"),
                                                  std::string("Anchor2"),
@@ -69,6 +70,8 @@
 
 const std::string PercentileNames::TagName = std::string("PercentileLuminance");
 const std::string PercentileNames::NumberOfPercentiles = std::string("NumberOfPercentiles");
+const std::string PercentileNames::DistributionIndex = std::string("DistributionIndex");
+const std::string PercentileNames::DistributionValues = std::string("DistributionValues");
 const std::string PercentileNames::PercentilePercentageValue[] = {std::string("PercentilePercentage0"),
                                                                   std::string("PercentilePercentage1"),
                                                                   std::string("PercentilePercentage2"),
@@ -104,7 +107,9 @@
 
 
 const std::string LuminanceNames::TagName = std::string("LuminanceParameters");
+const std::string LuminanceNames::LlcTagName = std::string("LuminanceDistributions");
 const std::string LuminanceNames::AverageRGB = std::string("AverageRGB");
+const std::string LuminanceNames::MaxSCL = std::string("MaxScl");
 const std::string LuminanceNames::MaxSCL0 = std::string("MaxScl0");
 const std::string LuminanceNames::MaxSCL1 = std::string("MaxScl1");
 const std::string LuminanceNames::MaxSCL2 = std::string("MaxScl2");

 
@@ -34,6 +34,7 @@
 const std::string BezierCurveNames::NumberOfAnchors = std::string("NumberOfAnchors");
 const std::string BezierCurveNames::KneePointX = std::string("KneePointX");
 const std::string BezierCurveNames::KneePointY = std::string("KneePointY");
+const std::string BezierCurveNames::AnchorsTag = std::string("Anchors");
 const std::string BezierCurveNames::Anchors[] = {std::string("Anchor0"),
                                                  std::string("Anchor1"),
                                                  std::string("Anchor2"),
@@ -69,6 +70,8 @@
 
 const std::string PercentileNames::TagName = std::string("PercentileLuminance");
 const std::string PercentileNames::NumberOfPercentiles = std::string("NumberOfPercentiles");
+const std::string PercentileNames::DistributionIndex = std::string("DistributionIndex");
+const std::string PercentileNames::DistributionValues = std::string("DistributionValues");
 const std::string PercentileNames::PercentilePercentageValue[] = {std::string("PercentilePercentage0"),
                                                                   std::string("PercentilePercentage1"),
                                                                   std::string("PercentilePercentage2"),
@@ -104,7 +107,9 @@
 
 
 const std::string LuminanceNames::TagName = std::string("LuminanceParameters");
+const std::string LuminanceNames::LlcTagName = std::string("LuminanceDistributions");
 const std::string LuminanceNames::AverageRGB = std::string("AverageRGB");
+const std::string LuminanceNames::MaxSCL = std::string("MaxScl");
 const std::string LuminanceNames::MaxSCL0 = std::string("MaxScl0");
 const std::string LuminanceNames::MaxSCL1 = std::string("MaxScl1");
 const std::string LuminanceNames::MaxSCL2 = std::string("MaxScl2");
​

x265_2.7.tar.gz/source/dynamicHDR10/SeiMetadataDictionary.h -> x265_2.9.tar.gz/source/dynamicHDR10/SeiMetadataDictionary.h Changed

@@ -48,6 +48,7 @@
         static const std::string NumberOfAnchors;
         static const std::string KneePointX;
         static const std::string KneePointY;
+        static const std::string AnchorsTag;
         static const std::string Anchors[14];
     };
     //Ellipse Selection Data
@@ -79,6 +80,8 @@
         public:
         static const std::string TagName;
         static const std::string NumberOfPercentiles;
+        static const std::string DistributionIndex;
+        static const std::string DistributionValues;
         static const std::string PercentilePercentageValue[15];
         static const std::string PercentileLuminanceValue[15];
     };
@@ -87,7 +90,9 @@
     {
         public:
         static const std::string TagName;
+        static const std::string LlcTagName;
         static const std::string AverageRGB;
+        static const std::string MaxSCL;
         static const std::string MaxSCL0;
         static const std::string MaxSCL1;
         static const std::string MaxSCL2;

 
@@ -48,6 +48,7 @@
         static const std::string NumberOfAnchors;
         static const std::string KneePointX;
         static const std::string KneePointY;
+        static const std::string AnchorsTag;
         static const std::string Anchors[14];
     };
     //Ellipse Selection Data
@@ -79,6 +80,8 @@
         public:
         static const std::string TagName;
         static const std::string NumberOfPercentiles;
+        static const std::string DistributionIndex;
+        static const std::string DistributionValues;
         static const std::string PercentilePercentageValue[15];
         static const std::string PercentileLuminanceValue[15];
     };
@@ -87,7 +90,9 @@
     {
         public:
         static const std::string TagName;
+        static const std::string LlcTagName;
         static const std::string AverageRGB;
+        static const std::string MaxSCL;
         static const std::string MaxSCL0;
         static const std::string MaxSCL1;
         static const std::string MaxSCL2;
​

x265_2.7.tar.gz/source/dynamicHDR10/metadataFromJson.cpp -> x265_2.9.tar.gz/source/dynamicHDR10/metadataFromJson.cpp Changed

@@ -46,89 +46,133 @@
     int mCurrentStreamBit;
     int mCurrentStreamByte;
 
-    bool luminanceParamFromJson(const Json &data, LuminanceParameters &obj)
+    bool luminanceParamFromJson(const Json &data, LuminanceParameters &obj, const JsonType jsonType)
     {
         JsonObject lumJsonData = data.object_items();
         if(!lumJsonData.empty())
         {
-            JsonObject percentileData = lumJsonData[PercentileNames::TagName].object_items();
-            obj.order = percentileData[PercentileNames::NumberOfPercentiles].int_value();
-
-            obj.averageLuminance = static_cast<float>(lumJsonData[LuminanceNames::AverageRGB].number_value());
-            obj.maxRLuminance = static_cast<float>(lumJsonData[LuminanceNames::MaxSCL0].number_value());
-            obj.maxGLuminance = static_cast<float>(lumJsonData[LuminanceNames::MaxSCL1].number_value());
-            obj.maxBLuminance = static_cast<float>(lumJsonData[LuminanceNames::MaxSCL2].number_value());
-
-            if(!percentileData.empty())
-            {
-                obj.percentiles.resize(obj.order);
-                for(int i = 0; i < obj.order; ++i)
-                {
-                    std::string percentileTag = PercentileNames::TagName;
-                    percentileTag += std::to_string(i);
-                    obj.percentiles[i] = static_cast<unsigned int>(percentileData[percentileTag].int_value());
-                }
-            }
-
-            return true;
-        }
-        return false;
-    }
-
-    bool percentagesFromJson(const Json &data, std::vector<unsigned int> &percentages)
-    {
-        JsonObject jsonData = data.object_items();
-        if(!jsonData.empty())
-        {
-            JsonObject percentileData = jsonData[PercentileNames::TagName].object_items();
-            int order = percentileData[PercentileNames::NumberOfPercentiles].int_value();
-
-            percentages.resize(order);
-            for(int i = 0; i < order; ++i)
-            {
-                std::string percentileTag = PercentileNames::PercentilePercentageValue[i];
-                percentages[i] = static_cast<unsigned int>(percentileData[percentileTag].int_value());
-            }
-
-            return true;
-        }
+			switch(jsonType)
+			{
+				case LEGACY:
+				{
+					obj.averageLuminance = static_cast<float>(lumJsonData[LuminanceNames::AverageRGB].number_value());
+					obj.maxRLuminance = static_cast<float>(lumJsonData[LuminanceNames::MaxSCL0].number_value());
+					obj.maxGLuminance = static_cast<float>(lumJsonData[LuminanceNames::MaxSCL1].number_value());
+					obj.maxBLuminance = static_cast<float>(lumJsonData[LuminanceNames::MaxSCL2].number_value());
+
+					JsonObject percentileData = lumJsonData[PercentileNames::TagName].object_items();
+					obj.order = percentileData[PercentileNames::NumberOfPercentiles].int_value();
+					if(!percentileData.empty())
+					{
+						obj.percentiles.resize(obj.order);
+						for(int i = 0; i < obj.order; ++i)
+						{
+							std::string percentileTag = PercentileNames::TagName;
+							percentileTag += std::to_string(i);
+							obj.percentiles[i] = static_cast<unsigned int>(percentileData[percentileTag].int_value());
+						}
+					}
+					return true;
+				} break;
+				case LLC:
+				{
+					obj.averageLuminance = static_cast<float>(lumJsonData[LuminanceNames::AverageRGB].number_value());
+					JsonArray maxScl = lumJsonData[LuminanceNames::MaxSCL].array_items();
+					obj.maxRLuminance = static_cast<float>(maxScl[0].number_value());
+					obj.maxGLuminance = static_cast<float>(maxScl[1].number_value());
+					obj.maxBLuminance = static_cast<float>(maxScl[2].number_value());
+
+					JsonObject percentileData = lumJsonData[LuminanceNames::LlcTagName].object_items();
+					if(!percentileData.empty())
+					{
+						JsonArray distributionValues = percentileData[PercentileNames::DistributionValues].array_items();
+						obj.order = static_cast<int>(distributionValues.size());
+						obj.percentiles.resize(obj.order);
+						for(int i = 0; i < obj.order; ++i)
+						{
+							obj.percentiles[i] = static_cast<unsigned int>(distributionValues[i].int_value());
+						}
+					}
+					return true;
+				} break;
+			}
+		}
         return false;
     }
 
-    bool percentagesFromJson(const Json &data, unsigned int *percentages)
+    bool percentagesFromJson(const Json &data, std::vector<unsigned int> &percentages, const JsonType jsonType)
     {
         JsonObject jsonData = data.object_items();
         if(!jsonData.empty())
         {
-            JsonObject percentileData = jsonData[PercentileNames::TagName].object_items();
-            int order = percentileData[PercentileNames::NumberOfPercentiles].int_value();
-
-            for(int i = 0; i < order; ++i)
-            {
-                std::string percentileTag = PercentileNames::PercentilePercentageValue[i];
-                percentages[i] = static_cast<unsigned int>(percentileData[percentileTag].int_value());
-            }
+			switch(jsonType)
+			{
+				case LEGACY:
+				{
+					JsonObject percentileData = jsonData[PercentileNames::TagName].object_items();
+					int order = percentileData[PercentileNames::NumberOfPercentiles].int_value();
+					percentages.resize(order);
+					for(int i = 0; i < order; ++i)
+					{
+						std::string percentileTag = PercentileNames::PercentilePercentageValue[i];
+						percentages[i] = static_cast<unsigned int>(percentileData[percentileTag].int_value());
+					}
+					return true;
+				} break;
+				case LLC:
+				{
+					JsonObject percentileData = jsonData[LuminanceNames::LlcTagName].object_items();
+					if(!percentileData.empty())
+					{
+						JsonArray percentageValues = percentileData[PercentileNames::DistributionIndex].array_items();
+						int order = static_cast<int>(percentageValues.size());
+						percentages.resize(order);
+						for(int i = 0; i < order; ++i)
+						{
+							percentages[i] = static_cast<unsigned int>(percentageValues[i].int_value());
+						}
+					} 
+					return true;
+				} break;
+			}
 
-            return true;
         }
         return false;
     }
 
-    bool bezierCurveFromJson(const Json &data, BezierCurveData &obj)
+    bool bezierCurveFromJson(const Json &data, BezierCurveData &obj, const JsonType jsonType)
     {
         JsonObject jsonData = data.object_items();
         if(!jsonData.empty())
         {
-            obj.order = jsonData[BezierCurveNames::NumberOfAnchors].int_value();
-            obj.coeff.resize(obj.order);
-            obj.sPx = jsonData[BezierCurveNames::KneePointX].int_value();
-            obj.sPy = jsonData[BezierCurveNames::KneePointY].int_value();
-            for(int i = 0; i < obj.order; ++i)
-            {
-                obj.coeff[i] = jsonData[BezierCurveNames::Anchors[i]].int_value();
-            }
-
-            return true;
+			switch(jsonType)
+			{
+				case LEGACY:
+				{
+				    obj.sPx = jsonData[BezierCurveNames::KneePointX].int_value();
+					obj.sPy = jsonData[BezierCurveNames::KneePointY].int_value();
+					obj.order = jsonData[BezierCurveNames::NumberOfAnchors].int_value();
+					obj.coeff.resize(obj.order);
+					for(int i = 0; i < obj.order; ++i)
+					{
+						obj.coeff[i] = jsonData[BezierCurveNames::Anchors[i]].int_value();
+					}
+					return true;	
+				} break;
+				case LLC:
+				{
+					obj.sPx = jsonData[BezierCurveNames::KneePointX].int_value();
+					obj.sPy = jsonData[BezierCurveNames::KneePointY].int_value();
+					JsonArray anchorValues = data[BezierCurveNames::AnchorsTag].array_items();
+					obj.order = static_cast<int>(anchorValues.size());
+					obj.coeff.resize(obj.order);
+					for(int i = 0; i < obj.order; ++i)
+					{
+						obj.coeff[i] = anchorValues[i].int_value();
+					}
+					return true;
+				} break;
+			}
         }
         return false;
     }
@@ -162,9 +206,7 @@
     void setPayloadSize(uint8_t *dataStream, int positionOnStream, int payload)
     {

 
@@ -46,89 +46,133 @@
     int mCurrentStreamBit;
     int mCurrentStreamByte;
 
-    bool luminanceParamFromJson(const Json &data, LuminanceParameters &obj)
+    bool luminanceParamFromJson(const Json &data, LuminanceParameters &obj, const JsonType jsonType)
     {
         JsonObject lumJsonData = data.object_items();
         if(!lumJsonData.empty())
         {
-            JsonObject percentileData = lumJsonData[PercentileNames::TagName].object_items();
-            obj.order = percentileData[PercentileNames::NumberOfPercentiles].int_value();
-
-            obj.averageLuminance = static_cast<float>(lumJsonData[LuminanceNames::AverageRGB].number_value());
-            obj.maxRLuminance = static_cast<float>(lumJsonData[LuminanceNames::MaxSCL0].number_value());
-            obj.maxGLuminance = static_cast<float>(lumJsonData[LuminanceNames::MaxSCL1].number_value());
-            obj.maxBLuminance = static_cast<float>(lumJsonData[LuminanceNames::MaxSCL2].number_value());
-
-            if(!percentileData.empty())
-            {
-                obj.percentiles.resize(obj.order);
-                for(int i = 0; i < obj.order; ++i)
-                {
-                    std::string percentileTag = PercentileNames::TagName;
-                    percentileTag += std::to_string(i);
-                    obj.percentiles[i] = static_cast<unsigned int>(percentileData[percentileTag].int_value());
-                }
-            }
-
-            return true;
-        }
-        return false;
-    }
-
-    bool percentagesFromJson(const Json &data, std::vector<unsigned int> &percentages)
-    {
-        JsonObject jsonData = data.object_items();
-        if(!jsonData.empty())
-        {
-            JsonObject percentileData = jsonData[PercentileNames::TagName].object_items();
-            int order = percentileData[PercentileNames::NumberOfPercentiles].int_value();
-
-            percentages.resize(order);
-            for(int i = 0; i < order; ++i)
-            {
-                std::string percentileTag = PercentileNames::PercentilePercentageValue[i];
-                percentages[i] = static_cast<unsigned int>(percentileData[percentileTag].int_value());
-            }
-
-            return true;
-        }
+           switch(jsonType)
+           {
+               case LEGACY:
+               {
+                   obj.averageLuminance = static_cast<float>(lumJsonData[LuminanceNames::AverageRGB].number_value());
+                   obj.maxRLuminance = static_cast<float>(lumJsonData[LuminanceNames::MaxSCL0].number_value());
+                   obj.maxGLuminance = static_cast<float>(lumJsonData[LuminanceNames::MaxSCL1].number_value());
+                   obj.maxBLuminance = static_cast<float>(lumJsonData[LuminanceNames::MaxSCL2].number_value());
+
+                   JsonObject percentileData = lumJsonData[PercentileNames::TagName].object_items();
+                   obj.order = percentileData[PercentileNames::NumberOfPercentiles].int_value();
+                   if(!percentileData.empty())
+                   {
+                       obj.percentiles.resize(obj.order);
+                       for(int i = 0; i < obj.order; ++i)
+                       {
+                           std::string percentileTag = PercentileNames::TagName;
+                           percentileTag += std::to_string(i);
+                           obj.percentiles[i] = static_cast<unsigned int>(percentileData[percentileTag].int_value());
+                       }
+                   }
+                   return true;
+               } break;
+               case LLC:
+               {
+                   obj.averageLuminance = static_cast<float>(lumJsonData[LuminanceNames::AverageRGB].number_value());
+                   JsonArray maxScl = lumJsonData[LuminanceNames::MaxSCL].array_items();
+                   obj.maxRLuminance = static_cast<float>(maxScl[0].number_value());
+                   obj.maxGLuminance = static_cast<float>(maxScl[1].number_value());
+                   obj.maxBLuminance = static_cast<float>(maxScl[2].number_value());
+
+                   JsonObject percentileData = lumJsonData[LuminanceNames::LlcTagName].object_items();
+                   if(!percentileData.empty())
+                   {
+                       JsonArray distributionValues = percentileData[PercentileNames::DistributionValues].array_items();
+                       obj.order = static_cast<int>(distributionValues.size());
+                       obj.percentiles.resize(obj.order);
+                       for(int i = 0; i < obj.order; ++i)
+                       {
+                           obj.percentiles[i] = static_cast<unsigned int>(distributionValues[i].int_value());
+                       }
+                   }
+                   return true;
+               } break;
+           }
+       }
         return false;
     }
 
-    bool percentagesFromJson(const Json &data, unsigned int *percentages)
+    bool percentagesFromJson(const Json &data, std::vector<unsigned int> &percentages, const JsonType jsonType)
     {
         JsonObject jsonData = data.object_items();
         if(!jsonData.empty())
         {
-            JsonObject percentileData = jsonData[PercentileNames::TagName].object_items();
-            int order = percentileData[PercentileNames::NumberOfPercentiles].int_value();
-
-            for(int i = 0; i < order; ++i)
-            {
-                std::string percentileTag = PercentileNames::PercentilePercentageValue[i];
-                percentages[i] = static_cast<unsigned int>(percentileData[percentileTag].int_value());
-            }
+           switch(jsonType)
+           {
+               case LEGACY:
+               {
+                   JsonObject percentileData = jsonData[PercentileNames::TagName].object_items();
+                   int order = percentileData[PercentileNames::NumberOfPercentiles].int_value();
+                   percentages.resize(order);
+                   for(int i = 0; i < order; ++i)
+                   {
+                       std::string percentileTag = PercentileNames::PercentilePercentageValue[i];
+                       percentages[i] = static_cast<unsigned int>(percentileData[percentileTag].int_value());
+                   }
+                   return true;
+               } break;
+               case LLC:
+               {
+                   JsonObject percentileData = jsonData[LuminanceNames::LlcTagName].object_items();
+                   if(!percentileData.empty())
+                   {
+                       JsonArray percentageValues = percentileData[PercentileNames::DistributionIndex].array_items();
+                       int order = static_cast<int>(percentageValues.size());
+                       percentages.resize(order);
+                       for(int i = 0; i < order; ++i)
+                       {
+                           percentages[i] = static_cast<unsigned int>(percentageValues[i].int_value());
+                       }
+                   } 
+                   return true;
+               } break;
+           }
 
-            return true;
         }
         return false;
     }
 
-    bool bezierCurveFromJson(const Json &data, BezierCurveData &obj)
+    bool bezierCurveFromJson(const Json &data, BezierCurveData &obj, const JsonType jsonType)
     {
         JsonObject jsonData = data.object_items();
         if(!jsonData.empty())
         {
-            obj.order = jsonData[BezierCurveNames::NumberOfAnchors].int_value();
-            obj.coeff.resize(obj.order);
-            obj.sPx = jsonData[BezierCurveNames::KneePointX].int_value();
-            obj.sPy = jsonData[BezierCurveNames::KneePointY].int_value();
-            for(int i = 0; i < obj.order; ++i)
-            {
-                obj.coeff[i] = jsonData[BezierCurveNames::Anchors[i]].int_value();
-            }
-
-            return true;
+           switch(jsonType)
+           {
+               case LEGACY:
+               {
+                   obj.sPx = jsonData[BezierCurveNames::KneePointX].int_value();
+                   obj.sPy = jsonData[BezierCurveNames::KneePointY].int_value();
+                   obj.order = jsonData[BezierCurveNames::NumberOfAnchors].int_value();
+                   obj.coeff.resize(obj.order);
+                   for(int i = 0; i < obj.order; ++i)
+                   {
+                       obj.coeff[i] = jsonData[BezierCurveNames::Anchors[i]].int_value();
+                   }
+                   return true;    
+               } break;
+               case LLC:
+               {
+                   obj.sPx = jsonData[BezierCurveNames::KneePointX].int_value();
+                   obj.sPy = jsonData[BezierCurveNames::KneePointY].int_value();
+                   JsonArray anchorValues = data[BezierCurveNames::AnchorsTag].array_items();
+                   obj.order = static_cast<int>(anchorValues.size());
+                   obj.coeff.resize(obj.order);
+                   for(int i = 0; i < obj.order; ++i)
+                   {
+                       obj.coeff[i] = anchorValues[i].int_value();
+                   }
+                   return true;
+               } break;
+           }
         }
         return false;
     }
@@ -162,9 +206,7 @@
     void setPayloadSize(uint8_t *dataStream, int positionOnStream, int payload)
     {
​

x265_2.7.tar.gz/source/dynamicHDR10/metadataFromJson.h -> x265_2.9.tar.gz/source/dynamicHDR10/metadataFromJson.h Changed

 
@@ -26,7 +26,7 @@
 #define METADATAFROMJSON_H
 
 #include<stdint.h>
-#include "string"
+#include<cstring>
 #include "JsonHelper.h"
 
 class metadataFromJson
@@ -36,6 +36,11 @@
     metadataFromJson();
     ~metadataFromJson();
 
+   enum JsonType{
+       LEGACY,
+       LLC
+   };
+       
 
     /**
      * @brief frameMetadataFromJson: Generates a sigle frame metadata array from Json file with all
@@ -98,7 +103,7 @@
 
     class DynamicMetaIO;
     DynamicMetaIO *mPimpl;
-    void fillMetadataArray(const JsonArray &fileData, int frame, uint8_t *&metadata);
+    void fillMetadataArray(const JsonArray &fileData, int frame, const JsonType jsonType, uint8_t *&metadata);
 };
 
 #endif // METADATAFROMJSON_H
​

x265_2.7.tar.gz/source/encoder/analysis.cpp -> x265_2.9.tar.gz/source/encoder/analysis.cpp Changed

@@ -37,7 +37,7 @@
 using namespace X265_NS;
 
 /* An explanation of rate distortion levels (--rd-level)
- * 
+ *
  * rd-level 0 generates no recon per CU (NO RDO or Quant)
  *
  *   sa8d selection between merge / skip / inter / intra and split
@@ -187,27 +187,24 @@
         for (uint32_t i = 0; i < cuGeom.numPartitions; i++)
             ctu.m_log2CUSize[i] = (uint8_t)m_param->maxLog2CUSize - ctu.m_cuDepth[i];
     }
-    if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead)
+    if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead && (m_slice->m_sliceType != I_SLICE))
     {
-        m_multipassAnalysis = (analysis2PassFrameData*)m_frame->m_analysis2Pass.analysisFramedata;
-        m_multipassDepth = &m_multipassAnalysis->depth[ctu.m_cuAddr * ctu.m_numPartitions];
-        if (m_slice->m_sliceType != I_SLICE)
+        int numPredDir = m_slice->isInterP() ? 1 : 2;
+        m_reuseInterDataCTU = m_frame->m_analysisData.interData;
+        for (int dir = 0; dir < numPredDir; dir++)
         {
-            int numPredDir = m_slice->isInterP() ? 1 : 2;
-            for (int dir = 0; dir < numPredDir; dir++)
-            {
-                m_multipassMv[dir] = &m_multipassAnalysis->m_mv[dir][ctu.m_cuAddr * ctu.m_numPartitions];
-                m_multipassMvpIdx[dir] = &m_multipassAnalysis->mvpIdx[dir][ctu.m_cuAddr * ctu.m_numPartitions];
-                m_multipassRef[dir] = &m_multipassAnalysis->ref[dir][ctu.m_cuAddr * ctu.m_numPartitions];
-            }
-            m_multipassModes = &m_multipassAnalysis->modes[ctu.m_cuAddr * ctu.m_numPartitions];
+            m_reuseMv[dir] = &m_reuseInterDataCTU->mv[dir][ctu.m_cuAddr * ctu.m_numPartitions];
+            m_reuseMvpIdx[dir] = &m_reuseInterDataCTU->mvpIdx[dir][ctu.m_cuAddr * ctu.m_numPartitions];
         }
+        m_reuseRef = &m_reuseInterDataCTU->ref[ctu.m_cuAddr * ctu.m_numPartitions];
+        m_reuseModes = &m_reuseInterDataCTU->modes[ctu.m_cuAddr * ctu.m_numPartitions];
+        m_reuseDepth = &m_reuseInterDataCTU->depth[ctu.m_cuAddr * ctu.m_numPartitions];
     }
-
+    
     if ((m_param->analysisSave || m_param->analysisLoad) && m_slice->m_sliceType != I_SLICE && m_param->analysisReuseLevel > 1 && m_param->analysisReuseLevel < 10)
     {
         int numPredDir = m_slice->isInterP() ? 1 : 2;
-        m_reuseInterDataCTU = (analysis_inter_data*)m_frame->m_analysisData.interData;
+        m_reuseInterDataCTU = m_frame->m_analysisData.interData;
         m_reuseRef = &m_reuseInterDataCTU->ref [ctu.m_cuAddr * X265_MAX_PRED_MODE_PER_CTU * numPredDir];
         m_reuseDepth = &m_reuseInterDataCTU->depth[ctu.m_cuAddr * ctu.m_numPartitions];
         m_reuseModes = &m_reuseInterDataCTU->modes[ctu.m_cuAddr * ctu.m_numPartitions];
@@ -224,7 +221,7 @@
 
     if (m_slice->m_sliceType == I_SLICE)
     {
-        analysis_intra_data* intraDataCTU = (analysis_intra_data*)m_frame->m_analysisData.intraData;
+        x265_analysis_intra_data* intraDataCTU = m_frame->m_analysisData.intraData;
         if (m_param->analysisLoad && m_param->analysisReuseLevel > 1)
         {
             memcpy(ctu.m_cuDepth, &intraDataCTU->depth[ctu.m_cuAddr * numPartition], sizeof(uint8_t) * numPartition);
@@ -243,7 +240,7 @@
 
         if (bCopyAnalysis)
         {
-            analysis_inter_data* interDataCTU = (analysis_inter_data*)m_frame->m_analysisData.interData;
+            x265_analysis_inter_data* interDataCTU = m_frame->m_analysisData.interData;
             int posCTU = ctu.m_cuAddr * numPartition;
             memcpy(ctu.m_cuDepth, &interDataCTU->depth[posCTU], sizeof(uint8_t) * numPartition);
             memcpy(ctu.m_predMode, &interDataCTU->modes[posCTU], sizeof(uint8_t) * numPartition);
@@ -253,7 +250,7 @@
 
             if ((m_slice->m_sliceType == P_SLICE || m_param->bIntraInBFrames) && !m_param->bMVType)
             {
-                analysis_intra_data* intraDataCTU = (analysis_intra_data*)m_frame->m_analysisData.intraData;
+                x265_analysis_intra_data* intraDataCTU = m_frame->m_analysisData.intraData;
                 memcpy(ctu.m_lumaIntraDir, &intraDataCTU->modes[posCTU], sizeof(uint8_t) * numPartition);
                 memcpy(ctu.m_chromaIntraDir, &intraDataCTU->chromaModes[posCTU], sizeof(uint8_t) * numPartition);
             }
@@ -279,14 +276,14 @@
         }
         else if ((m_param->analysisLoad && m_param->analysisReuseLevel == 10) || ((m_param->bMVType == AVC_INFO) && m_param->analysisReuseLevel >= 7 && ctu.m_numPartitions <= 16))
         {
-            analysis_inter_data* interDataCTU = (analysis_inter_data*)m_frame->m_analysisData.interData;
+            x265_analysis_inter_data* interDataCTU = m_frame->m_analysisData.interData;
             int posCTU = ctu.m_cuAddr * numPartition;
             memcpy(ctu.m_cuDepth, &interDataCTU->depth[posCTU], sizeof(uint8_t) * numPartition);
             memcpy(ctu.m_predMode, &interDataCTU->modes[posCTU], sizeof(uint8_t) * numPartition);
             memcpy(ctu.m_partSize, &interDataCTU->partSize[posCTU], sizeof(uint8_t) * numPartition);
             if ((m_slice->m_sliceType == P_SLICE || m_param->bIntraInBFrames) && !(m_param->bMVType == AVC_INFO))
             {
-                analysis_intra_data* intraDataCTU = (analysis_intra_data*)m_frame->m_analysisData.intraData;
+                x265_analysis_intra_data* intraDataCTU = m_frame->m_analysisData.intraData;
                 memcpy(ctu.m_lumaIntraDir, &intraDataCTU->modes[posCTU], sizeof(uint8_t) * numPartition);
                 memcpy(ctu.m_chromaIntraDir, &intraDataCTU->chromaModes[posCTU], sizeof(uint8_t) * numPartition);
             }
@@ -518,19 +515,20 @@
     bool mightSplit = !(cuGeom.flags & CUGeom::LEAF);
     bool mightNotSplit = !(cuGeom.flags & CUGeom::SPLIT_MANDATORY);
 
-    bool bAlreadyDecided = parentCTU.m_lumaIntraDir[cuGeom.absPartIdx] != (uint8_t)ALL_IDX;
-    bool bDecidedDepth = parentCTU.m_cuDepth[cuGeom.absPartIdx] == depth;
+    bool bAlreadyDecided = m_param->intraRefine != 4 && parentCTU.m_lumaIntraDir[cuGeom.absPartIdx] != (uint8_t)ALL_IDX;
+    bool bDecidedDepth = m_param->intraRefine != 4 && parentCTU.m_cuDepth[cuGeom.absPartIdx] == depth;
     int split = 0;
-    if (m_param->intraRefine)
+    if (m_param->intraRefine && m_param->intraRefine != 4)
     {
-        split = ((cuGeom.log2CUSize == (uint32_t)(g_log2Size[m_param->minCUSize] + 1)) && bDecidedDepth);
+        split = m_param->scaleFactor && bDecidedDepth && (!mightNotSplit || 
+            ((cuGeom.log2CUSize == (uint32_t)(g_log2Size[m_param->minCUSize] + 1))));
         if (cuGeom.log2CUSize == (uint32_t)(g_log2Size[m_param->minCUSize]) && !bDecidedDepth)
             bAlreadyDecided = false;
     }
 
     if (bAlreadyDecided)
     {
-        if (bDecidedDepth)
+        if (bDecidedDepth && mightNotSplit)
         {
             Mode& mode = md.pred[0];
             md.bestMode = &mode;
@@ -1184,7 +1182,7 @@
 
         if (m_evaluateInter)
         {
-            if (m_param->interRefine == 2)
+            if (m_refineLevel == 2)
             {
                 if (parentCTU.m_predMode[cuGeom.absPartIdx] == MODE_SKIP)
                     skipModes = true;
@@ -1283,11 +1281,11 @@
                 }
             }
         }
-        if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead && m_multipassAnalysis)
+        if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead && m_reuseInterDataCTU)
         {
-            if (mightNotSplit && depth == m_multipassDepth[cuGeom.absPartIdx])
+            if (mightNotSplit && depth == m_reuseDepth[cuGeom.absPartIdx])
             {
-                if (m_multipassModes[cuGeom.absPartIdx] == MODE_SKIP)
+                if (m_reuseModes[cuGeom.absPartIdx] == MODE_SKIP)
                 {
                     md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp);
                     md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp);
@@ -1307,7 +1305,7 @@
             md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp);
             checkMerge2Nx2N_rd0_4(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom);
             if (m_param->rdLevel)
-                skipModes = (m_param->bEnableEarlySkip || m_param->interRefine == 2)
+                skipModes = (m_param->bEnableEarlySkip || m_refineLevel == 2)
                 && md.bestMode && md.bestMode->cu.isSkipped(0); // TODO: sa8d threshold per depth
         }
         if (md.bestMode && m_param->bEnableRecursionSkip && !bCtuInfoCheck && !(m_param->bMVType && m_param->analysisReuseLevel == 7 && (m_modeFlag[0] || m_modeFlag[1])))
@@ -1874,7 +1872,7 @@
 
         if (m_evaluateInter)
         {
-            if (m_param->interRefine == 2)
+            if (m_refineLevel == 2)
             {
                 if (parentCTU.m_predMode[cuGeom.absPartIdx] == MODE_SKIP)
                     skipModes = true;
@@ -1976,11 +1974,11 @@
             }
         }
 
-        if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead && m_multipassAnalysis)
+        if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead && m_reuseInterDataCTU)
         {
-            if (mightNotSplit && depth == m_multipassDepth[cuGeom.absPartIdx])
+            if (mightNotSplit && depth == m_reuseDepth[cuGeom.absPartIdx])
             {
-                if (m_multipassModes[cuGeom.absPartIdx] == MODE_SKIP)
+                if (m_reuseModes[cuGeom.absPartIdx] == MODE_SKIP)
                 {
                     md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp);
                     md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp);
@@ -2004,7 +2002,7 @@
             md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp);
             md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp);
             checkMerge2Nx2N_rd5_6(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom);
-            skipModes = (m_param->bEnableEarlySkip || m_param->interRefine == 2) &&
+            skipModes = (m_param->bEnableEarlySkip || m_refineLevel == 2) &&
                 md.bestMode && !md.bestMode->cu.getQtRootCbf(0);
             refMasks[0] = allSplitRefs;
             md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom, qp);
@@ -2413,9 +2411,18 @@
     bool mightNotSplit = !(cuGeom.flags & CUGeom::SPLIT_MANDATORY);
     bool bDecidedDepth = parentCTU.m_cuDepth[cuGeom.absPartIdx] == depth;
 
-    int split = (m_param->interRefine && cuGeom.log2CUSize == (uint32_t)(g_log2Size[m_param->minCUSize] + 1) && bDecidedDepth);
+    TrainingData td;
+    td.init(parentCTU, cuGeom);
 
-    if (bDecidedDepth)
+    if (!m_param->bDynamicRefine)
+        m_refineLevel = m_param->interRefine;
+    else
+        m_refineLevel = m_frame->m_classifyFrame ? 1 : 3;
+    int split = (m_param->scaleFactor && bDecidedDepth && (!mightNotSplit || 
+        (m_refineLevel && cuGeom.log2CUSize == (uint32_t)(g_log2Size[m_param->minCUSize] + 1))));
+    td.split = split;

 
@@ -37,7 +37,7 @@
 using namespace X265_NS;
 
 /* An explanation of rate distortion levels (--rd-level)
- * 
+ *
  * rd-level 0 generates no recon per CU (NO RDO or Quant)
  *
  *   sa8d selection between merge / skip / inter / intra and split
@@ -187,27 +187,24 @@
         for (uint32_t i = 0; i < cuGeom.numPartitions; i++)
             ctu.m_log2CUSize[i] = (uint8_t)m_param->maxLog2CUSize - ctu.m_cuDepth[i];
     }
-    if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead)
+    if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead && (m_slice->m_sliceType != I_SLICE))
     {
-        m_multipassAnalysis = (analysis2PassFrameData*)m_frame->m_analysis2Pass.analysisFramedata;
-        m_multipassDepth = &m_multipassAnalysis->depth[ctu.m_cuAddr * ctu.m_numPartitions];
-        if (m_slice->m_sliceType != I_SLICE)
+        int numPredDir = m_slice->isInterP() ? 1 : 2;
+        m_reuseInterDataCTU = m_frame->m_analysisData.interData;
+        for (int dir = 0; dir < numPredDir; dir++)
         {
-            int numPredDir = m_slice->isInterP() ? 1 : 2;
-            for (int dir = 0; dir < numPredDir; dir++)
-            {
-                m_multipassMv[dir] = &m_multipassAnalysis->m_mv[dir][ctu.m_cuAddr * ctu.m_numPartitions];
-                m_multipassMvpIdx[dir] = &m_multipassAnalysis->mvpIdx[dir][ctu.m_cuAddr * ctu.m_numPartitions];
-                m_multipassRef[dir] = &m_multipassAnalysis->ref[dir][ctu.m_cuAddr * ctu.m_numPartitions];
-            }
-            m_multipassModes = &m_multipassAnalysis->modes[ctu.m_cuAddr * ctu.m_numPartitions];
+            m_reuseMv[dir] = &m_reuseInterDataCTU->mv[dir][ctu.m_cuAddr * ctu.m_numPartitions];
+            m_reuseMvpIdx[dir] = &m_reuseInterDataCTU->mvpIdx[dir][ctu.m_cuAddr * ctu.m_numPartitions];
         }
+        m_reuseRef = &m_reuseInterDataCTU->ref[ctu.m_cuAddr * ctu.m_numPartitions];
+        m_reuseModes = &m_reuseInterDataCTU->modes[ctu.m_cuAddr * ctu.m_numPartitions];
+        m_reuseDepth = &m_reuseInterDataCTU->depth[ctu.m_cuAddr * ctu.m_numPartitions];
     }
-
+    
     if ((m_param->analysisSave || m_param->analysisLoad) && m_slice->m_sliceType != I_SLICE && m_param->analysisReuseLevel > 1 && m_param->analysisReuseLevel < 10)
     {
         int numPredDir = m_slice->isInterP() ? 1 : 2;
-        m_reuseInterDataCTU = (analysis_inter_data*)m_frame->m_analysisData.interData;
+        m_reuseInterDataCTU = m_frame->m_analysisData.interData;
         m_reuseRef = &m_reuseInterDataCTU->ref [ctu.m_cuAddr * X265_MAX_PRED_MODE_PER_CTU * numPredDir];
         m_reuseDepth = &m_reuseInterDataCTU->depth[ctu.m_cuAddr * ctu.m_numPartitions];
         m_reuseModes = &m_reuseInterDataCTU->modes[ctu.m_cuAddr * ctu.m_numPartitions];
@@ -224,7 +221,7 @@
 
     if (m_slice->m_sliceType == I_SLICE)
     {
-        analysis_intra_data* intraDataCTU = (analysis_intra_data*)m_frame->m_analysisData.intraData;
+        x265_analysis_intra_data* intraDataCTU = m_frame->m_analysisData.intraData;
         if (m_param->analysisLoad && m_param->analysisReuseLevel > 1)
         {
             memcpy(ctu.m_cuDepth, &intraDataCTU->depth[ctu.m_cuAddr * numPartition], sizeof(uint8_t) * numPartition);
@@ -243,7 +240,7 @@
 
         if (bCopyAnalysis)
         {
-            analysis_inter_data* interDataCTU = (analysis_inter_data*)m_frame->m_analysisData.interData;
+            x265_analysis_inter_data* interDataCTU = m_frame->m_analysisData.interData;
             int posCTU = ctu.m_cuAddr * numPartition;
             memcpy(ctu.m_cuDepth, &interDataCTU->depth[posCTU], sizeof(uint8_t) * numPartition);
             memcpy(ctu.m_predMode, &interDataCTU->modes[posCTU], sizeof(uint8_t) * numPartition);
@@ -253,7 +250,7 @@
 
             if ((m_slice->m_sliceType == P_SLICE || m_param->bIntraInBFrames) && !m_param->bMVType)
             {
-                analysis_intra_data* intraDataCTU = (analysis_intra_data*)m_frame->m_analysisData.intraData;
+                x265_analysis_intra_data* intraDataCTU = m_frame->m_analysisData.intraData;
                 memcpy(ctu.m_lumaIntraDir, &intraDataCTU->modes[posCTU], sizeof(uint8_t) * numPartition);
                 memcpy(ctu.m_chromaIntraDir, &intraDataCTU->chromaModes[posCTU], sizeof(uint8_t) * numPartition);
             }
@@ -279,14 +276,14 @@
         }
         else if ((m_param->analysisLoad && m_param->analysisReuseLevel == 10) || ((m_param->bMVType == AVC_INFO) && m_param->analysisReuseLevel >= 7 && ctu.m_numPartitions <= 16))
         {
-            analysis_inter_data* interDataCTU = (analysis_inter_data*)m_frame->m_analysisData.interData;
+            x265_analysis_inter_data* interDataCTU = m_frame->m_analysisData.interData;
             int posCTU = ctu.m_cuAddr * numPartition;
             memcpy(ctu.m_cuDepth, &interDataCTU->depth[posCTU], sizeof(uint8_t) * numPartition);
             memcpy(ctu.m_predMode, &interDataCTU->modes[posCTU], sizeof(uint8_t) * numPartition);
             memcpy(ctu.m_partSize, &interDataCTU->partSize[posCTU], sizeof(uint8_t) * numPartition);
             if ((m_slice->m_sliceType == P_SLICE || m_param->bIntraInBFrames) && !(m_param->bMVType == AVC_INFO))
             {
-                analysis_intra_data* intraDataCTU = (analysis_intra_data*)m_frame->m_analysisData.intraData;
+                x265_analysis_intra_data* intraDataCTU = m_frame->m_analysisData.intraData;
                 memcpy(ctu.m_lumaIntraDir, &intraDataCTU->modes[posCTU], sizeof(uint8_t) * numPartition);
                 memcpy(ctu.m_chromaIntraDir, &intraDataCTU->chromaModes[posCTU], sizeof(uint8_t) * numPartition);
             }
@@ -518,19 +515,20 @@
     bool mightSplit = !(cuGeom.flags & CUGeom::LEAF);
     bool mightNotSplit = !(cuGeom.flags & CUGeom::SPLIT_MANDATORY);
 
-    bool bAlreadyDecided = parentCTU.m_lumaIntraDir[cuGeom.absPartIdx] != (uint8_t)ALL_IDX;
-    bool bDecidedDepth = parentCTU.m_cuDepth[cuGeom.absPartIdx] == depth;
+    bool bAlreadyDecided = m_param->intraRefine != 4 && parentCTU.m_lumaIntraDir[cuGeom.absPartIdx] != (uint8_t)ALL_IDX;
+    bool bDecidedDepth = m_param->intraRefine != 4 && parentCTU.m_cuDepth[cuGeom.absPartIdx] == depth;
     int split = 0;
-    if (m_param->intraRefine)
+    if (m_param->intraRefine && m_param->intraRefine != 4)
     {
-        split = ((cuGeom.log2CUSize == (uint32_t)(g_log2Size[m_param->minCUSize] + 1)) && bDecidedDepth);
+        split = m_param->scaleFactor && bDecidedDepth && (!mightNotSplit || 
+            ((cuGeom.log2CUSize == (uint32_t)(g_log2Size[m_param->minCUSize] + 1))));
         if (cuGeom.log2CUSize == (uint32_t)(g_log2Size[m_param->minCUSize]) && !bDecidedDepth)
             bAlreadyDecided = false;
     }
 
     if (bAlreadyDecided)
     {
-        if (bDecidedDepth)
+        if (bDecidedDepth && mightNotSplit)
         {
             Mode& mode = md.pred[0];
             md.bestMode = &mode;
@@ -1184,7 +1182,7 @@
 
         if (m_evaluateInter)
         {
-            if (m_param->interRefine == 2)
+            if (m_refineLevel == 2)
             {
                 if (parentCTU.m_predMode[cuGeom.absPartIdx] == MODE_SKIP)
                     skipModes = true;
@@ -1283,11 +1281,11 @@
                 }
             }
         }
-        if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead && m_multipassAnalysis)
+        if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead && m_reuseInterDataCTU)
         {
-            if (mightNotSplit && depth == m_multipassDepth[cuGeom.absPartIdx])
+            if (mightNotSplit && depth == m_reuseDepth[cuGeom.absPartIdx])
             {
-                if (m_multipassModes[cuGeom.absPartIdx] == MODE_SKIP)
+                if (m_reuseModes[cuGeom.absPartIdx] == MODE_SKIP)
                 {
                     md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp);
                     md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp);
@@ -1307,7 +1305,7 @@
             md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp);
             checkMerge2Nx2N_rd0_4(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom);
             if (m_param->rdLevel)
-                skipModes = (m_param->bEnableEarlySkip || m_param->interRefine == 2)
+                skipModes = (m_param->bEnableEarlySkip || m_refineLevel == 2)
                 && md.bestMode && md.bestMode->cu.isSkipped(0); // TODO: sa8d threshold per depth
         }
         if (md.bestMode && m_param->bEnableRecursionSkip && !bCtuInfoCheck && !(m_param->bMVType && m_param->analysisReuseLevel == 7 && (m_modeFlag[0] || m_modeFlag[1])))
@@ -1874,7 +1872,7 @@
 
         if (m_evaluateInter)
         {
-            if (m_param->interRefine == 2)
+            if (m_refineLevel == 2)
             {
                 if (parentCTU.m_predMode[cuGeom.absPartIdx] == MODE_SKIP)
                     skipModes = true;
@@ -1976,11 +1974,11 @@
             }
         }
 
-        if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead && m_multipassAnalysis)
+        if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead && m_reuseInterDataCTU)
         {
-            if (mightNotSplit && depth == m_multipassDepth[cuGeom.absPartIdx])
+            if (mightNotSplit && depth == m_reuseDepth[cuGeom.absPartIdx])
             {
-                if (m_multipassModes[cuGeom.absPartIdx] == MODE_SKIP)
+                if (m_reuseModes[cuGeom.absPartIdx] == MODE_SKIP)
                 {
                     md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp);
                     md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp);
@@ -2004,7 +2002,7 @@
             md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp);
             md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp);
             checkMerge2Nx2N_rd5_6(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom);
-            skipModes = (m_param->bEnableEarlySkip || m_param->interRefine == 2) &&
+            skipModes = (m_param->bEnableEarlySkip || m_refineLevel == 2) &&
                 md.bestMode && !md.bestMode->cu.getQtRootCbf(0);
             refMasks[0] = allSplitRefs;
             md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom, qp);
@@ -2413,9 +2411,18 @@
     bool mightNotSplit = !(cuGeom.flags & CUGeom::SPLIT_MANDATORY);
     bool bDecidedDepth = parentCTU.m_cuDepth[cuGeom.absPartIdx] == depth;
 
-    int split = (m_param->interRefine && cuGeom.log2CUSize == (uint32_t)(g_log2Size[m_param->minCUSize] + 1) && bDecidedDepth);
+    TrainingData td;
+    td.init(parentCTU, cuGeom);
 
-    if (bDecidedDepth)
+    if (!m_param->bDynamicRefine)
+        m_refineLevel = m_param->interRefine;
+    else
+        m_refineLevel = m_frame->m_classifyFrame ? 1 : 3;
+    int split = (m_param->scaleFactor && bDecidedDepth && (!mightNotSplit || 
+        (m_refineLevel && cuGeom.log2CUSize == (uint32_t)(g_log2Size[m_param->minCUSize] + 1))));
+    td.split = split;
​

x265_2.7.tar.gz/source/encoder/analysis.h -> x265_2.9.tar.gz/source/encoder/analysis.h Changed

@@ -123,27 +123,42 @@
 
 protected:
     /* Analysis data for save/load mode, writes/reads data based on absPartIdx */
-    analysis_inter_data* m_reuseInterDataCTU;
-    int32_t*             m_reuseRef;
-    uint8_t*             m_reuseDepth;
-    uint8_t*             m_reuseModes;
-    uint8_t*             m_reusePartSize;
-    uint8_t*             m_reuseMergeFlag;
+    x265_analysis_inter_data*  m_reuseInterDataCTU;
+    int32_t*                   m_reuseRef;
+    uint8_t*                   m_reuseDepth;
+    uint8_t*                   m_reuseModes;
+    uint8_t*                   m_reusePartSize;
+    uint8_t*                   m_reuseMergeFlag;
+    x265_analysis_MV*          m_reuseMv[2];
+    uint8_t*             m_reuseMvpIdx[2];
 
     uint32_t             m_splitRefIdx[4];
     uint64_t*            cacheCost;
 
-
-    analysis2PassFrameData* m_multipassAnalysis;
-    uint8_t*                m_multipassDepth;
-    MV*                     m_multipassMv[2];
-    int*                    m_multipassMvpIdx[2];
-    int32_t*                m_multipassRef[2];
-    uint8_t*                m_multipassModes;
-
     uint8_t                 m_evaluateInter;
+    int32_t                 m_refineLevel;
+
     uint8_t*                m_additionalCtuInfo;
     int*                    m_prevCtuInfoChange;
+
+    struct TrainingData
+    {
+        uint32_t cuVariance;
+        uint8_t predMode;
+        uint8_t partSize;
+        uint8_t mergeFlag;
+        int split;
+
+        void init(const CUData& parentCTU, const CUGeom& cuGeom)
+        {
+            cuVariance = 0;
+            predMode = parentCTU.m_predMode[cuGeom.absPartIdx];
+            partSize = parentCTU.m_partSize[cuGeom.absPartIdx];
+            mergeFlag = parentCTU.m_mergeFlag[cuGeom.absPartIdx];
+            split = 0;
+        }
+    };
+
     /* refine RD based on QP for rd-levels 5 and 6 */
     void qprdRefine(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp, int32_t lqp);
 
@@ -182,6 +197,10 @@
     void encodeResidue(const CUData& parentCTU, const CUGeom& cuGeom);
 
     int calculateQpforCuSize(const CUData& ctu, const CUGeom& cuGeom, int32_t complexCheck = 0, double baseQP = -1);
+    uint32_t calculateCUVariance(const CUData& ctu, const CUGeom& cuGeom);
+
+    void classifyCU(const CUData& ctu, const CUGeom& cuGeom, const Mode& bestMode, TrainingData& trainData);
+    void trainCU(const CUData& ctu, const CUGeom& cuGeom, const Mode& bestMode, TrainingData& trainData);
 
     void calculateNormFactor(CUData& ctu, int qp);
     void normFactor(const pixel* src, uint32_t blockSize, CUData& ctu, int qp, TextType ttype);

 
@@ -123,27 +123,42 @@
 
 protected:
     /* Analysis data for save/load mode, writes/reads data based on absPartIdx */
-    analysis_inter_data* m_reuseInterDataCTU;
-    int32_t*             m_reuseRef;
-    uint8_t*             m_reuseDepth;
-    uint8_t*             m_reuseModes;
-    uint8_t*             m_reusePartSize;
-    uint8_t*             m_reuseMergeFlag;
+    x265_analysis_inter_data*  m_reuseInterDataCTU;
+    int32_t*                   m_reuseRef;
+    uint8_t*                   m_reuseDepth;
+    uint8_t*                   m_reuseModes;
+    uint8_t*                   m_reusePartSize;
+    uint8_t*                   m_reuseMergeFlag;
+    x265_analysis_MV*          m_reuseMv[2];
+    uint8_t*             m_reuseMvpIdx[2];
 
     uint32_t             m_splitRefIdx[4];
     uint64_t*            cacheCost;
 
-
-    analysis2PassFrameData* m_multipassAnalysis;
-    uint8_t*                m_multipassDepth;
-    MV*                     m_multipassMv[2];
-    int*                    m_multipassMvpIdx[2];
-    int32_t*                m_multipassRef[2];
-    uint8_t*                m_multipassModes;
-
     uint8_t                 m_evaluateInter;
+    int32_t                 m_refineLevel;
+
     uint8_t*                m_additionalCtuInfo;
     int*                    m_prevCtuInfoChange;
+
+    struct TrainingData
+    {
+        uint32_t cuVariance;
+        uint8_t predMode;
+        uint8_t partSize;
+        uint8_t mergeFlag;
+        int split;
+
+        void init(const CUData& parentCTU, const CUGeom& cuGeom)
+        {
+            cuVariance = 0;
+            predMode = parentCTU.m_predMode[cuGeom.absPartIdx];
+            partSize = parentCTU.m_partSize[cuGeom.absPartIdx];
+            mergeFlag = parentCTU.m_mergeFlag[cuGeom.absPartIdx];
+            split = 0;
+        }
+    };
+
     /* refine RD based on QP for rd-levels 5 and 6 */
     void qprdRefine(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp, int32_t lqp);
 
@@ -182,6 +197,10 @@
     void encodeResidue(const CUData& parentCTU, const CUGeom& cuGeom);
 
     int calculateQpforCuSize(const CUData& ctu, const CUGeom& cuGeom, int32_t complexCheck = 0, double baseQP = -1);
+    uint32_t calculateCUVariance(const CUData& ctu, const CUGeom& cuGeom);
+
+    void classifyCU(const CUData& ctu, const CUGeom& cuGeom, const Mode& bestMode, TrainingData& trainData);
+    void trainCU(const CUData& ctu, const CUGeom& cuGeom, const Mode& bestMode, TrainingData& trainData);
 
     void calculateNormFactor(CUData& ctu, int qp);
     void normFactor(const pixel* src, uint32_t blockSize, CUData& ctu, int qp, TextType ttype);
​

x265_2.7.tar.gz/source/encoder/api.cpp -> x265_2.9.tar.gz/source/encoder/api.cpp Changed

@@ -31,6 +31,10 @@
 #include "nal.h"
 #include "bitcost.h"
 
+#if ENABLE_LIBVMAF
+#include "libvmaf.h"
+#endif
+
 /* multilib namespace reflectors */
 #if LINKED_8BIT
 namespace x265_8bit {
@@ -274,10 +278,10 @@
         pic_in->analysisData.wt = NULL;
         pic_in->analysisData.intraData = NULL;
         pic_in->analysisData.interData = NULL;
-        pic_in->analysis2Pass.analysisFramedata = NULL;
+        pic_in->analysisData.distortionData = NULL;
     }
 
-    if (pp_nal && numEncoded > 0)
+    if (pp_nal && numEncoded > 0 && encoder->m_outputCount >= encoder->m_latestParam->chunkStart)
     {
         *pp_nal = &encoder->m_nalList.m_nal[0];
         if (pi_nal) *pi_nal = encoder->m_nalList.m_numNal;
@@ -285,7 +289,7 @@
     else if (pi_nal)
         *pi_nal = 0;
 
-    if (numEncoded && encoder->m_param->csvLogLevel)
+    if (numEncoded && encoder->m_param->csvLogLevel && encoder->m_outputCount >= encoder->m_latestParam->chunkStart)
         x265_csvlog_frame(encoder->m_param, pic_out);
 
     if (numEncoded < 0)
@@ -302,13 +306,34 @@
         encoder->fetchStats(outputStats, statsSizeBytes);
     }
 }
+#if ENABLE_LIBVMAF
+void x265_vmaf_encoder_log(x265_encoder* enc, int argc, char **argv, x265_param *param, x265_vmaf_data *vmafdata)
+{
+    if (enc)
+    {
+        Encoder *encoder = static_cast<Encoder*>(enc);
+        x265_stats stats;       
+        stats.aggregateVmafScore = x265_calculate_vmafscore(param, vmafdata);
+        if(vmafdata->reference_file)
+            fclose(vmafdata->reference_file);
+        if(vmafdata->distorted_file)
+            fclose(vmafdata->distorted_file);
+        if(vmafdata)
+            x265_free(vmafdata);
+        encoder->fetchStats(&stats, sizeof(stats));
+        int padx = encoder->m_sps.conformanceWindow.rightOffset;
+        int pady = encoder->m_sps.conformanceWindow.bottomOffset;
+        x265_csvlog_encode(encoder->m_param, &stats, padx, pady, argc, argv);
+    }
+}
+#endif
 
 void x265_encoder_log(x265_encoder* enc, int argc, char **argv)
 {
     if (enc)
     {
         Encoder *encoder = static_cast<Encoder*>(enc);
-        x265_stats stats;
+        x265_stats stats;       
         encoder->fetchStats(&stats, sizeof(stats));
         int padx = encoder->m_sps.conformanceWindow.rightOffset;
         int pady = encoder->m_sps.conformanceWindow.bottomOffset;
@@ -378,6 +403,181 @@
     return -1;
 }
 
+void x265_alloc_analysis_data(x265_param *param, x265_analysis_data* analysis)
+{
+    x265_analysis_inter_data *interData = analysis->interData = NULL;
+    x265_analysis_intra_data *intraData = analysis->intraData = NULL;
+    x265_analysis_distortion_data *distortionData = analysis->distortionData = NULL;
+    bool isVbv = param->rc.vbvMaxBitrate > 0 && param->rc.vbvBufferSize > 0;
+    int numDir = 2; //irrespective of P or B slices set direction as 2
+    uint32_t numPlanes = param->internalCsp == X265_CSP_I400 ? 1 : 3;
+
+#if X265_DEPTH < 10 && (LINKED_10BIT || LINKED_12BIT)
+    uint32_t numCUs_sse_t = param->internalBitDepth > 8 ? analysis->numCUsInFrame << 1 : analysis->numCUsInFrame;
+#elif X265_DEPTH >= 10 && LINKED_8BIT
+    uint32_t numCUs_sse_t = param->internalBitDepth > 8 ? analysis->numCUsInFrame : (analysis->numCUsInFrame + 1U) >> 1;
+#else
+    uint32_t numCUs_sse_t = analysis->numCUsInFrame;
+#endif
+
+    //Allocate memory for distortionData pointer
+    CHECKED_MALLOC_ZERO(distortionData, x265_analysis_distortion_data, 1);
+    CHECKED_MALLOC_ZERO(distortionData->distortion, sse_t, analysis->numPartitions * numCUs_sse_t);
+    if (param->rc.bStatRead)
+    {
+        CHECKED_MALLOC_ZERO(distortionData->ctuDistortion, sse_t, numCUs_sse_t);
+        CHECKED_MALLOC_ZERO(distortionData->scaledDistortion, double, analysis->numCUsInFrame);
+        CHECKED_MALLOC_ZERO(distortionData->offset, double, analysis->numCUsInFrame);
+        CHECKED_MALLOC_ZERO(distortionData->threshold, double, analysis->numCUsInFrame);
+    }
+    analysis->distortionData = distortionData;
+
+    if (param->bDisableLookahead && isVbv)
+    {
+        CHECKED_MALLOC_ZERO(analysis->lookahead.intraSatdForVbv, uint32_t, analysis->numCuInHeight);
+        CHECKED_MALLOC_ZERO(analysis->lookahead.satdForVbv, uint32_t, analysis->numCuInHeight);
+        CHECKED_MALLOC_ZERO(analysis->lookahead.intraVbvCost, uint32_t, analysis->numCUsInFrame);
+        CHECKED_MALLOC_ZERO(analysis->lookahead.vbvCost, uint32_t, analysis->numCUsInFrame);
+    }
+
+    //Allocate memory for weightParam pointer
+    if (!(param->bMVType == AVC_INFO))
+        CHECKED_MALLOC_ZERO(analysis->wt, x265_weight_param, numPlanes * numDir);
+
+    if (param->analysisReuseLevel < 2)
+        return;
+
+    //Allocate memory for intraData pointer
+    CHECKED_MALLOC_ZERO(intraData, x265_analysis_intra_data, 1);
+    CHECKED_MALLOC(intraData->depth, uint8_t, analysis->numPartitions * analysis->numCUsInFrame);
+    CHECKED_MALLOC(intraData->modes, uint8_t, analysis->numPartitions * analysis->numCUsInFrame);
+    CHECKED_MALLOC(intraData->partSizes, char, analysis->numPartitions * analysis->numCUsInFrame);
+    CHECKED_MALLOC(intraData->chromaModes, uint8_t, analysis->numPartitions * analysis->numCUsInFrame);
+    analysis->intraData = intraData;
+
+    //Allocate memory for interData pointer based on ReuseLevels
+    CHECKED_MALLOC_ZERO(interData, x265_analysis_inter_data, 1);
+    CHECKED_MALLOC(interData->depth, uint8_t, analysis->numPartitions * analysis->numCUsInFrame);
+    CHECKED_MALLOC(interData->modes, uint8_t, analysis->numPartitions * analysis->numCUsInFrame);
+
+    CHECKED_MALLOC_ZERO(interData->mvpIdx[0], uint8_t, analysis->numPartitions * analysis->numCUsInFrame);
+    CHECKED_MALLOC_ZERO(interData->mvpIdx[1], uint8_t, analysis->numPartitions * analysis->numCUsInFrame);
+    CHECKED_MALLOC_ZERO(interData->mv[0], x265_analysis_MV, analysis->numPartitions * analysis->numCUsInFrame);
+    CHECKED_MALLOC_ZERO(interData->mv[1], x265_analysis_MV, analysis->numPartitions * analysis->numCUsInFrame);
+
+    if (param->analysisReuseLevel > 4)
+    {
+        CHECKED_MALLOC(interData->partSize, uint8_t, analysis->numPartitions * analysis->numCUsInFrame);
+        CHECKED_MALLOC_ZERO(interData->mergeFlag, uint8_t, analysis->numPartitions * analysis->numCUsInFrame);
+    }
+    if (param->analysisReuseLevel >= 7)
+    {
+        CHECKED_MALLOC(interData->interDir, uint8_t, analysis->numPartitions * analysis->numCUsInFrame);
+        CHECKED_MALLOC(interData->sadCost, int64_t, analysis->numPartitions * analysis->numCUsInFrame);
+        for (int dir = 0; dir < numDir; dir++)
+        {
+            CHECKED_MALLOC(interData->refIdx[dir], int8_t, analysis->numPartitions * analysis->numCUsInFrame);
+            CHECKED_MALLOC_ZERO(analysis->modeFlag[dir], uint8_t, analysis->numPartitions * analysis->numCUsInFrame);
+        }
+    }
+    else
+    {
+        if (param->analysisMultiPassRefine || param->analysisMultiPassDistortion){
+            CHECKED_MALLOC_ZERO(interData->ref, int32_t, 2 * analysis->numPartitions * analysis->numCUsInFrame);
+        }
+        else
+            CHECKED_MALLOC_ZERO(interData->ref, int32_t, analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir);
+    }
+    analysis->interData = interData;
+
+    return;
+
+fail:
+    x265_free_analysis_data(param, analysis);
+}
+
+void x265_free_analysis_data(x265_param *param, x265_analysis_data* analysis)
+{
+    bool isVbv = param->rc.vbvMaxBitrate > 0 && param->rc.vbvBufferSize > 0;
+
+    //Free memory for Lookahead pointers
+    if (param->bDisableLookahead && isVbv)
+    {
+        X265_FREE(analysis->lookahead.satdForVbv);
+        X265_FREE(analysis->lookahead.intraSatdForVbv);
+        X265_FREE(analysis->lookahead.vbvCost);
+        X265_FREE(analysis->lookahead.intraVbvCost);
+    }
+
+    //Free memory for distortionData pointers
+    if (analysis->distortionData)
+    {
+        X265_FREE((analysis->distortionData)->distortion);
+        if (param->rc.bStatRead)
+        {
+            X265_FREE((analysis->distortionData)->ctuDistortion);
+            X265_FREE((analysis->distortionData)->scaledDistortion);
+            X265_FREE((analysis->distortionData)->offset);
+            X265_FREE((analysis->distortionData)->threshold);
+        }
+        X265_FREE(analysis->distortionData);
+    }
+
+    /* Early exit freeing weights alone if level is 1 (when there is no analysis inter/intra) */
+    if (analysis->wt && !(param->bMVType == AVC_INFO))
+        X265_FREE(analysis->wt);
+
+    if (param->analysisReuseLevel < 2)
+        return;
+

 
@@ -31,6 +31,10 @@
 #include "nal.h"
 #include "bitcost.h"
 
+#if ENABLE_LIBVMAF
+#include "libvmaf.h"
+#endif
+
 /* multilib namespace reflectors */
 #if LINKED_8BIT
 namespace x265_8bit {
@@ -274,10 +278,10 @@
         pic_in->analysisData.wt = NULL;
         pic_in->analysisData.intraData = NULL;
         pic_in->analysisData.interData = NULL;
-        pic_in->analysis2Pass.analysisFramedata = NULL;
+        pic_in->analysisData.distortionData = NULL;
     }
 
-    if (pp_nal && numEncoded > 0)
+    if (pp_nal && numEncoded > 0 && encoder->m_outputCount >= encoder->m_latestParam->chunkStart)
     {
         *pp_nal = &encoder->m_nalList.m_nal[0];
         if (pi_nal) *pi_nal = encoder->m_nalList.m_numNal;
@@ -285,7 +289,7 @@
     else if (pi_nal)
         *pi_nal = 0;
 
-    if (numEncoded && encoder->m_param->csvLogLevel)
+    if (numEncoded && encoder->m_param->csvLogLevel && encoder->m_outputCount >= encoder->m_latestParam->chunkStart)
         x265_csvlog_frame(encoder->m_param, pic_out);
 
     if (numEncoded < 0)
@@ -302,13 +306,34 @@
         encoder->fetchStats(outputStats, statsSizeBytes);
     }
 }
+#if ENABLE_LIBVMAF
+void x265_vmaf_encoder_log(x265_encoder* enc, int argc, char **argv, x265_param *param, x265_vmaf_data *vmafdata)
+{
+    if (enc)
+    {
+        Encoder *encoder = static_cast<Encoder*>(enc);
+        x265_stats stats;       
+        stats.aggregateVmafScore = x265_calculate_vmafscore(param, vmafdata);
+        if(vmafdata->reference_file)
+            fclose(vmafdata->reference_file);
+        if(vmafdata->distorted_file)
+            fclose(vmafdata->distorted_file);
+        if(vmafdata)
+            x265_free(vmafdata);
+        encoder->fetchStats(&stats, sizeof(stats));
+        int padx = encoder->m_sps.conformanceWindow.rightOffset;
+        int pady = encoder->m_sps.conformanceWindow.bottomOffset;
+        x265_csvlog_encode(encoder->m_param, &stats, padx, pady, argc, argv);
+    }
+}
+#endif
 
 void x265_encoder_log(x265_encoder* enc, int argc, char **argv)
 {
     if (enc)
     {
         Encoder *encoder = static_cast<Encoder*>(enc);
-        x265_stats stats;
+        x265_stats stats;       
         encoder->fetchStats(&stats, sizeof(stats));
         int padx = encoder->m_sps.conformanceWindow.rightOffset;
         int pady = encoder->m_sps.conformanceWindow.bottomOffset;
@@ -378,6 +403,181 @@
     return -1;
 }
 
+void x265_alloc_analysis_data(x265_param *param, x265_analysis_data* analysis)
+{
+    x265_analysis_inter_data *interData = analysis->interData = NULL;
+    x265_analysis_intra_data *intraData = analysis->intraData = NULL;
+    x265_analysis_distortion_data *distortionData = analysis->distortionData = NULL;
+    bool isVbv = param->rc.vbvMaxBitrate > 0 && param->rc.vbvBufferSize > 0;
+    int numDir = 2; //irrespective of P or B slices set direction as 2
+    uint32_t numPlanes = param->internalCsp == X265_CSP_I400 ? 1 : 3;
+
+#if X265_DEPTH < 10 && (LINKED_10BIT || LINKED_12BIT)
+    uint32_t numCUs_sse_t = param->internalBitDepth > 8 ? analysis->numCUsInFrame << 1 : analysis->numCUsInFrame;
+#elif X265_DEPTH >= 10 && LINKED_8BIT
+    uint32_t numCUs_sse_t = param->internalBitDepth > 8 ? analysis->numCUsInFrame : (analysis->numCUsInFrame + 1U) >> 1;
+#else
+    uint32_t numCUs_sse_t = analysis->numCUsInFrame;
+#endif
+
+    //Allocate memory for distortionData pointer
+    CHECKED_MALLOC_ZERO(distortionData, x265_analysis_distortion_data, 1);
+    CHECKED_MALLOC_ZERO(distortionData->distortion, sse_t, analysis->numPartitions * numCUs_sse_t);
+    if (param->rc.bStatRead)
+    {
+        CHECKED_MALLOC_ZERO(distortionData->ctuDistortion, sse_t, numCUs_sse_t);
+        CHECKED_MALLOC_ZERO(distortionData->scaledDistortion, double, analysis->numCUsInFrame);
+        CHECKED_MALLOC_ZERO(distortionData->offset, double, analysis->numCUsInFrame);
+        CHECKED_MALLOC_ZERO(distortionData->threshold, double, analysis->numCUsInFrame);
+    }
+    analysis->distortionData = distortionData;
+
+    if (param->bDisableLookahead && isVbv)
+    {
+        CHECKED_MALLOC_ZERO(analysis->lookahead.intraSatdForVbv, uint32_t, analysis->numCuInHeight);
+        CHECKED_MALLOC_ZERO(analysis->lookahead.satdForVbv, uint32_t, analysis->numCuInHeight);
+        CHECKED_MALLOC_ZERO(analysis->lookahead.intraVbvCost, uint32_t, analysis->numCUsInFrame);
+        CHECKED_MALLOC_ZERO(analysis->lookahead.vbvCost, uint32_t, analysis->numCUsInFrame);
+    }
+
+    //Allocate memory for weightParam pointer
+    if (!(param->bMVType == AVC_INFO))
+        CHECKED_MALLOC_ZERO(analysis->wt, x265_weight_param, numPlanes * numDir);
+
+    if (param->analysisReuseLevel < 2)
+        return;
+
+    //Allocate memory for intraData pointer
+    CHECKED_MALLOC_ZERO(intraData, x265_analysis_intra_data, 1);
+    CHECKED_MALLOC(intraData->depth, uint8_t, analysis->numPartitions * analysis->numCUsInFrame);
+    CHECKED_MALLOC(intraData->modes, uint8_t, analysis->numPartitions * analysis->numCUsInFrame);
+    CHECKED_MALLOC(intraData->partSizes, char, analysis->numPartitions * analysis->numCUsInFrame);
+    CHECKED_MALLOC(intraData->chromaModes, uint8_t, analysis->numPartitions * analysis->numCUsInFrame);
+    analysis->intraData = intraData;
+
+    //Allocate memory for interData pointer based on ReuseLevels
+    CHECKED_MALLOC_ZERO(interData, x265_analysis_inter_data, 1);
+    CHECKED_MALLOC(interData->depth, uint8_t, analysis->numPartitions * analysis->numCUsInFrame);
+    CHECKED_MALLOC(interData->modes, uint8_t, analysis->numPartitions * analysis->numCUsInFrame);
+
+    CHECKED_MALLOC_ZERO(interData->mvpIdx[0], uint8_t, analysis->numPartitions * analysis->numCUsInFrame);
+    CHECKED_MALLOC_ZERO(interData->mvpIdx[1], uint8_t, analysis->numPartitions * analysis->numCUsInFrame);
+    CHECKED_MALLOC_ZERO(interData->mv[0], x265_analysis_MV, analysis->numPartitions * analysis->numCUsInFrame);
+    CHECKED_MALLOC_ZERO(interData->mv[1], x265_analysis_MV, analysis->numPartitions * analysis->numCUsInFrame);
+
+    if (param->analysisReuseLevel > 4)
+    {
+        CHECKED_MALLOC(interData->partSize, uint8_t, analysis->numPartitions * analysis->numCUsInFrame);
+        CHECKED_MALLOC_ZERO(interData->mergeFlag, uint8_t, analysis->numPartitions * analysis->numCUsInFrame);
+    }
+    if (param->analysisReuseLevel >= 7)
+    {
+        CHECKED_MALLOC(interData->interDir, uint8_t, analysis->numPartitions * analysis->numCUsInFrame);
+        CHECKED_MALLOC(interData->sadCost, int64_t, analysis->numPartitions * analysis->numCUsInFrame);
+        for (int dir = 0; dir < numDir; dir++)
+        {
+            CHECKED_MALLOC(interData->refIdx[dir], int8_t, analysis->numPartitions * analysis->numCUsInFrame);
+            CHECKED_MALLOC_ZERO(analysis->modeFlag[dir], uint8_t, analysis->numPartitions * analysis->numCUsInFrame);
+        }
+    }
+    else
+    {
+        if (param->analysisMultiPassRefine || param->analysisMultiPassDistortion){
+            CHECKED_MALLOC_ZERO(interData->ref, int32_t, 2 * analysis->numPartitions * analysis->numCUsInFrame);
+        }
+        else
+            CHECKED_MALLOC_ZERO(interData->ref, int32_t, analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir);
+    }
+    analysis->interData = interData;
+
+    return;
+
+fail:
+    x265_free_analysis_data(param, analysis);
+}
+
+void x265_free_analysis_data(x265_param *param, x265_analysis_data* analysis)
+{
+    bool isVbv = param->rc.vbvMaxBitrate > 0 && param->rc.vbvBufferSize > 0;
+
+    //Free memory for Lookahead pointers
+    if (param->bDisableLookahead && isVbv)
+    {
+        X265_FREE(analysis->lookahead.satdForVbv);
+        X265_FREE(analysis->lookahead.intraSatdForVbv);
+        X265_FREE(analysis->lookahead.vbvCost);
+        X265_FREE(analysis->lookahead.intraVbvCost);
+    }
+
+    //Free memory for distortionData pointers
+    if (analysis->distortionData)
+    {
+        X265_FREE((analysis->distortionData)->distortion);
+        if (param->rc.bStatRead)
+        {
+            X265_FREE((analysis->distortionData)->ctuDistortion);
+            X265_FREE((analysis->distortionData)->scaledDistortion);
+            X265_FREE((analysis->distortionData)->offset);
+            X265_FREE((analysis->distortionData)->threshold);
+        }
+        X265_FREE(analysis->distortionData);
+    }
+
+    /* Early exit freeing weights alone if level is 1 (when there is no analysis inter/intra) */
+    if (analysis->wt && !(param->bMVType == AVC_INFO))
+        X265_FREE(analysis->wt);
+
+    if (param->analysisReuseLevel < 2)
+        return;
+
​

x265_2.7.tar.gz/source/encoder/dpb.cpp -> x265_2.9.tar.gz/source/encoder/dpb.cpp Changed

@@ -131,9 +131,8 @@
     int pocCurr = slice->m_poc;
     int type = newFrame->m_lowres.sliceType;
     bool bIsKeyFrame = newFrame->m_lowres.bKeyframe;
-
     slice->m_nalUnitType = getNalUnitType(pocCurr, bIsKeyFrame);
-    if (slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_W_RADL)
+    if (slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_W_RADL || slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_N_LP)
         m_lastIDR = pocCurr;
     slice->m_lastIDR = m_lastIDR;
     slice->m_sliceType = IS_X265_TYPE_B(type) ? B_SLICE : (type == X265_TYPE_P) ? P_SLICE : I_SLICE;
@@ -250,7 +249,7 @@
 /* Marking reference pictures when an IDR/CRA is encountered. */
 void DPB::decodingRefreshMarking(int pocCurr, NalUnitType nalUnitType)
 {
-    if (nalUnitType == NAL_UNIT_CODED_SLICE_IDR_W_RADL)
+    if (nalUnitType == NAL_UNIT_CODED_SLICE_IDR_W_RADL || nalUnitType == NAL_UNIT_CODED_SLICE_IDR_N_LP)
     {
         /* If the nal_unit_type is IDR, all pictures in the reference picture
          * list are marked as "unused for reference" */
@@ -326,11 +325,9 @@
 NalUnitType DPB::getNalUnitType(int curPOC, bool bIsKeyFrame)
 {
     if (!curPOC)
-        return NAL_UNIT_CODED_SLICE_IDR_W_RADL;
-
+        return NAL_UNIT_CODED_SLICE_IDR_N_LP;
     if (bIsKeyFrame)
-        return m_bOpenGOP ? NAL_UNIT_CODED_SLICE_CRA : NAL_UNIT_CODED_SLICE_IDR_W_RADL;
-
+        return m_bOpenGOP ? NAL_UNIT_CODED_SLICE_CRA : m_bhasLeadingPicture ? NAL_UNIT_CODED_SLICE_IDR_W_RADL : NAL_UNIT_CODED_SLICE_IDR_N_LP;
     if (m_pocCRA && curPOC < m_pocCRA)
         // All leading pictures are being marked as TFD pictures here since
         // current encoder uses all reference pictures while encoding leading

 
@@ -131,9 +131,8 @@
     int pocCurr = slice->m_poc;
     int type = newFrame->m_lowres.sliceType;
     bool bIsKeyFrame = newFrame->m_lowres.bKeyframe;
-
     slice->m_nalUnitType = getNalUnitType(pocCurr, bIsKeyFrame);
-    if (slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_W_RADL)
+    if (slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_W_RADL || slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_N_LP)
         m_lastIDR = pocCurr;
     slice->m_lastIDR = m_lastIDR;
     slice->m_sliceType = IS_X265_TYPE_B(type) ? B_SLICE : (type == X265_TYPE_P) ? P_SLICE : I_SLICE;
@@ -250,7 +249,7 @@
 /* Marking reference pictures when an IDR/CRA is encountered. */
 void DPB::decodingRefreshMarking(int pocCurr, NalUnitType nalUnitType)
 {
-    if (nalUnitType == NAL_UNIT_CODED_SLICE_IDR_W_RADL)
+    if (nalUnitType == NAL_UNIT_CODED_SLICE_IDR_W_RADL || nalUnitType == NAL_UNIT_CODED_SLICE_IDR_N_LP)
     {
         /* If the nal_unit_type is IDR, all pictures in the reference picture
          * list are marked as "unused for reference" */
@@ -326,11 +325,9 @@
 NalUnitType DPB::getNalUnitType(int curPOC, bool bIsKeyFrame)
 {
     if (!curPOC)
-        return NAL_UNIT_CODED_SLICE_IDR_W_RADL;
-
+        return NAL_UNIT_CODED_SLICE_IDR_N_LP;
     if (bIsKeyFrame)
-        return m_bOpenGOP ? NAL_UNIT_CODED_SLICE_CRA : NAL_UNIT_CODED_SLICE_IDR_W_RADL;
-
+        return m_bOpenGOP ? NAL_UNIT_CODED_SLICE_CRA : m_bhasLeadingPicture ? NAL_UNIT_CODED_SLICE_IDR_W_RADL : NAL_UNIT_CODED_SLICE_IDR_N_LP;
     if (m_pocCRA && curPOC < m_pocCRA)
         // All leading pictures are being marked as TFD pictures here since
         // current encoder uses all reference pictures while encoding leading
​

x265_2.7.tar.gz/source/encoder/dpb.h -> x265_2.9.tar.gz/source/encoder/dpb.h Changed

 
@@ -40,6 +40,7 @@
     int                m_lastIDR;
     int                m_pocCRA;
     int                m_bOpenGOP;
+    int                m_bhasLeadingPicture;
     bool               m_bRefreshPending;
     bool               m_bTemporalSublayer;
     PicList            m_picList;
@@ -50,6 +51,7 @@
     {
         m_lastIDR = 0;
         m_pocCRA = 0;
+        m_bhasLeadingPicture = param->radl;
         m_bRefreshPending = false;
         m_frameDataFreeList = NULL;
         m_bOpenGOP = param->bOpenGOP;
​

x265_2.7.tar.gz/source/encoder/encoder.cpp -> x265_2.9.tar.gz/source/encoder/encoder.cpp Changed

@@ -79,6 +79,7 @@
     m_threadPool = NULL;
     m_analysisFileIn = NULL;
     m_analysisFileOut = NULL;
+    m_naluFile = NULL;
     m_offsetEmergency = NULL;
     m_iFrameNum = 0;
     m_iPPSQpMinus26 = 0;
@@ -96,6 +97,8 @@
 #endif
 
     m_prevTonemapPayload.payload = NULL;
+    m_startPoint = 0;
+    m_saveCTUSize = 0;
 }
 inline char *strcatFilename(const char *input, const char *suffix)
 {
@@ -337,10 +340,12 @@
 
     if (m_param->bEmitHRDSEI)
         m_rateControl->initHRD(m_sps);
+
     if (!m_rateControl->init(m_sps))
         m_aborted = true;
     if (!m_lookahead->create())
         m_aborted = true;
+
     initRefIdx();
     if (m_param->analysisSave && m_param->bUseAnalysisFile)
     {
@@ -408,10 +413,35 @@
 
     m_emitCLLSEI = p->maxCLL || p->maxFALL;
 
+    if (m_param->naluFile)
+    {
+        m_naluFile = x265_fopen(m_param->naluFile, "r");
+        if (!m_naluFile)
+        {
+            x265_log_file(NULL, X265_LOG_ERROR, "%s file not found or Failed to open\n", m_param->naluFile);
+            m_aborted = true;
+        }
+        else
+             m_enableNal = 1;
+    }
+    else
+         m_enableNal = 0;
+
 #if ENABLE_HDR10_PLUS
     if (m_bToneMap)
         m_numCimInfo = m_hdr10plus_api->hdr10plus_json_to_movie_cim(m_param->toneMapFile, m_cim);
 #endif
+    if (m_param->bDynamicRefine)
+    {
+        /* Allocate memory for 1 GOP and reuse it for the subsequent GOPs */
+        int size = (m_param->keyframeMax + m_param->lookaheadDepth) * m_param->maxCUDepth * X265_REFINE_INTER_LEVELS;
+        CHECKED_MALLOC_ZERO(m_variance, uint64_t, size);
+        CHECKED_MALLOC_ZERO(m_rdCost, uint64_t, size);
+        CHECKED_MALLOC_ZERO(m_trainingCount, uint32_t, size);
+        return;
+    fail:
+        m_aborted = true;
+    }
 }
 
 void Encoder::stopJobs()
@@ -516,8 +546,8 @@
         curFrame->m_analysisData.numPartitions = m_param->num4x4Partitions;
         int num16x16inCUWidth = m_param->maxCUSize >> 4;
         uint32_t ctuAddr, offset, cuPos;
-        analysis_intra_data * intraData = (analysis_intra_data *)curFrame->m_analysisData.intraData;
-        analysis_intra_data * srcIntraData = (analysis_intra_data *)analysis_data->intraData;
+        x265_analysis_intra_data * intraData = curFrame->m_analysisData.intraData;
+        x265_analysis_intra_data * srcIntraData = analysis_data->intraData;
         for (int i = 0; i < mbImageHeight; i++)
         {
             for (int j = 0; j < mbImageWidth; j++)
@@ -546,8 +576,8 @@
         curFrame->m_analysisData.numPartitions = m_param->num4x4Partitions;
         int num16x16inCUWidth = m_param->maxCUSize >> 4;
         uint32_t ctuAddr, offset, cuPos;
-        analysis_inter_data * interData = (analysis_inter_data *)curFrame->m_analysisData.interData;
-        analysis_inter_data * srcInterData = (analysis_inter_data*)analysis_data->interData;
+        x265_analysis_inter_data * interData = curFrame->m_analysisData.interData;
+        x265_analysis_inter_data * srcInterData = analysis_data->interData;
         for (int i = 0; i < mbImageHeight; i++)
         {
             for (int j = 0; j < mbImageWidth; j++)
@@ -611,7 +641,7 @@
         curFrame->m_analysisData = (*analysis_data);
         curFrame->m_analysisData.numCUsInFrame = widthInCU * heightInCU;
         curFrame->m_analysisData.numPartitions = m_param->num4x4Partitions;
-        allocAnalysis(&curFrame->m_analysisData);
+        x265_alloc_analysis_data(m_param, &curFrame->m_analysisData);
         if (m_param->maxCUSize == 16)
         {
             if (analysis_data->sliceType == X265_TYPE_IDR || analysis_data->sliceType == X265_TYPE_I)
@@ -622,8 +652,8 @@
 
                 curFrame->m_analysisData.numPartitions = m_param->num4x4Partitions;
                 size_t count = 0;
-                analysis_intra_data * currIntraData = (analysis_intra_data *)curFrame->m_analysisData.intraData;
-                analysis_intra_data * intraData = (analysis_intra_data *)analysis_data->intraData;
+                x265_analysis_intra_data * currIntraData = curFrame->m_analysisData.intraData;
+                x265_analysis_intra_data * intraData = analysis_data->intraData;
                 for (uint32_t d = 0; d < cuBytes; d++)
                 {
                     int bytes = curFrame->m_analysisData.numPartitions >> ((intraData)->depth[d] * 2);
@@ -643,14 +673,14 @@
 
                 curFrame->m_analysisData.numPartitions = m_param->num4x4Partitions;
                 size_t count = 0;
-                analysis_inter_data * currInterData = (analysis_inter_data *)curFrame->m_analysisData.interData;
-                analysis_inter_data * interData = (analysis_inter_data *)analysis_data->interData;
+                x265_analysis_inter_data * currInterData = curFrame->m_analysisData.interData;
+                x265_analysis_inter_data * interData = analysis_data->interData;
                 for (uint32_t d = 0; d < cuBytes; d++)
                 {
                     int bytes = curFrame->m_analysisData.numPartitions >> ((interData)->depth[d] * 2);
                     memset(&(currInterData)->depth[count], (interData)->depth[d], bytes);
                     memset(&(currInterData)->modes[count], (interData)->modes[d], bytes);
-                    memcpy(&(currInterData)->sadCost[count], &((analysis_inter_data*)analysis_data->interData)->sadCost[d], bytes);
+                    memcpy(&(currInterData)->sadCost[count], &(analysis_data->interData)->sadCost[d], bytes);
                     if (m_param->analysisReuseLevel > 4)
                     {
                         memset(&(currInterData)->partSize[count], (interData)->partSize[d], bytes);
@@ -697,7 +727,13 @@
     if (m_bToneMap)
         m_hdr10plus_api->hdr10plus_clear_movie(m_cim, m_numCimInfo);
 #endif
-        
+
+    if (m_param->bDynamicRefine)
+    {
+        X265_FREE(m_variance);
+        X265_FREE(m_rdCost);
+        X265_FREE(m_trainingCount);
+    }
     if (m_exportedPic)
     {
         ATOMIC_DEC(&m_exportedPic->m_countRefEncoders);
@@ -761,6 +797,8 @@
         }
         X265_FREE(temp);
      }
+    if (m_naluFile)
+        fclose(m_naluFile);
     if (m_param)
     {
         if (m_param->csvfpt)
@@ -837,6 +875,77 @@
     }
 }
 
+void Encoder::copyUserSEIMessages(Frame *frame, const x265_picture* pic_in)
+{
+    x265_sei_payload toneMap;
+    toneMap.payload = NULL;
+    int toneMapPayload = 0;
+
+#if ENABLE_HDR10_PLUS
+    if (m_bToneMap)
+    {
+        int currentPOC = m_pocLast;
+        if (currentPOC < m_numCimInfo)
+        {
+            int32_t i = 0;
+            toneMap.payloadSize = 0;
+            while (m_cim[currentPOC][i] == 0xFF)
+                toneMap.payloadSize += m_cim[currentPOC][i++];
+            toneMap.payloadSize += m_cim[currentPOC][i];
+
+            toneMap.payload = (uint8_t*)x265_malloc(sizeof(uint8_t) * toneMap.payloadSize);
+            toneMap.payloadType = USER_DATA_REGISTERED_ITU_T_T35;
+            memcpy(toneMap.payload, &m_cim[currentPOC][i + 1], toneMap.payloadSize);
+            toneMapPayload = 1;
+        }
+    }
+#endif
+    /* seiMsg will contain SEI messages specified in a fixed file format in POC order.
+    * Format of the file : <POC><space><PREFIX><space><NAL UNIT TYPE>/<SEI TYPE><space><SEI Payload> */
+    x265_sei_payload seiMsg;
+    seiMsg.payload = NULL;
+    int userPayload = 0;
+    if (m_enableNal)
+    {
+        readUserSeiFile(seiMsg, m_pocLast);
+        if (seiMsg.payload)
+            userPayload = 1;;
+    }
+
+    int numPayloads = pic_in->userSEI.numPayloads + toneMapPayload + userPayload;
+    frame->m_userSEI.numPayloads = numPayloads;
+
+    if (frame->m_userSEI.numPayloads)
+    {
+        if (!frame->m_userSEI.payloads)
+        {
+            frame->m_userSEI.payloads = new x265_sei_payload[numPayloads];
+            for (int i = 0; i < numPayloads; i++)

 
@@ -79,6 +79,7 @@
     m_threadPool = NULL;
     m_analysisFileIn = NULL;
     m_analysisFileOut = NULL;
+    m_naluFile = NULL;
     m_offsetEmergency = NULL;
     m_iFrameNum = 0;
     m_iPPSQpMinus26 = 0;
@@ -96,6 +97,8 @@
 #endif
 
     m_prevTonemapPayload.payload = NULL;
+    m_startPoint = 0;
+    m_saveCTUSize = 0;
 }
 inline char *strcatFilename(const char *input, const char *suffix)
 {
@@ -337,10 +340,12 @@
 
     if (m_param->bEmitHRDSEI)
         m_rateControl->initHRD(m_sps);
+
     if (!m_rateControl->init(m_sps))
         m_aborted = true;
     if (!m_lookahead->create())
         m_aborted = true;
+
     initRefIdx();
     if (m_param->analysisSave && m_param->bUseAnalysisFile)
     {
@@ -408,10 +413,35 @@
 
     m_emitCLLSEI = p->maxCLL || p->maxFALL;
 
+    if (m_param->naluFile)
+    {
+        m_naluFile = x265_fopen(m_param->naluFile, "r");
+        if (!m_naluFile)
+        {
+            x265_log_file(NULL, X265_LOG_ERROR, "%s file not found or Failed to open\n", m_param->naluFile);
+            m_aborted = true;
+        }
+        else
+             m_enableNal = 1;
+    }
+    else
+         m_enableNal = 0;
+
 #if ENABLE_HDR10_PLUS
     if (m_bToneMap)
         m_numCimInfo = m_hdr10plus_api->hdr10plus_json_to_movie_cim(m_param->toneMapFile, m_cim);
 #endif
+    if (m_param->bDynamicRefine)
+    {
+        /* Allocate memory for 1 GOP and reuse it for the subsequent GOPs */
+        int size = (m_param->keyframeMax + m_param->lookaheadDepth) * m_param->maxCUDepth * X265_REFINE_INTER_LEVELS;
+        CHECKED_MALLOC_ZERO(m_variance, uint64_t, size);
+        CHECKED_MALLOC_ZERO(m_rdCost, uint64_t, size);
+        CHECKED_MALLOC_ZERO(m_trainingCount, uint32_t, size);
+        return;
+    fail:
+        m_aborted = true;
+    }
 }
 
 void Encoder::stopJobs()
@@ -516,8 +546,8 @@
         curFrame->m_analysisData.numPartitions = m_param->num4x4Partitions;
         int num16x16inCUWidth = m_param->maxCUSize >> 4;
         uint32_t ctuAddr, offset, cuPos;
-        analysis_intra_data * intraData = (analysis_intra_data *)curFrame->m_analysisData.intraData;
-        analysis_intra_data * srcIntraData = (analysis_intra_data *)analysis_data->intraData;
+        x265_analysis_intra_data * intraData = curFrame->m_analysisData.intraData;
+        x265_analysis_intra_data * srcIntraData = analysis_data->intraData;
         for (int i = 0; i < mbImageHeight; i++)
         {
             for (int j = 0; j < mbImageWidth; j++)
@@ -546,8 +576,8 @@
         curFrame->m_analysisData.numPartitions = m_param->num4x4Partitions;
         int num16x16inCUWidth = m_param->maxCUSize >> 4;
         uint32_t ctuAddr, offset, cuPos;
-        analysis_inter_data * interData = (analysis_inter_data *)curFrame->m_analysisData.interData;
-        analysis_inter_data * srcInterData = (analysis_inter_data*)analysis_data->interData;
+        x265_analysis_inter_data * interData = curFrame->m_analysisData.interData;
+        x265_analysis_inter_data * srcInterData = analysis_data->interData;
         for (int i = 0; i < mbImageHeight; i++)
         {
             for (int j = 0; j < mbImageWidth; j++)
@@ -611,7 +641,7 @@
         curFrame->m_analysisData = (*analysis_data);
         curFrame->m_analysisData.numCUsInFrame = widthInCU * heightInCU;
         curFrame->m_analysisData.numPartitions = m_param->num4x4Partitions;
-        allocAnalysis(&curFrame->m_analysisData);
+        x265_alloc_analysis_data(m_param, &curFrame->m_analysisData);
         if (m_param->maxCUSize == 16)
         {
             if (analysis_data->sliceType == X265_TYPE_IDR || analysis_data->sliceType == X265_TYPE_I)
@@ -622,8 +652,8 @@
 
                 curFrame->m_analysisData.numPartitions = m_param->num4x4Partitions;
                 size_t count = 0;
-                analysis_intra_data * currIntraData = (analysis_intra_data *)curFrame->m_analysisData.intraData;
-                analysis_intra_data * intraData = (analysis_intra_data *)analysis_data->intraData;
+                x265_analysis_intra_data * currIntraData = curFrame->m_analysisData.intraData;
+                x265_analysis_intra_data * intraData = analysis_data->intraData;
                 for (uint32_t d = 0; d < cuBytes; d++)
                 {
                     int bytes = curFrame->m_analysisData.numPartitions >> ((intraData)->depth[d] * 2);
@@ -643,14 +673,14 @@
 
                 curFrame->m_analysisData.numPartitions = m_param->num4x4Partitions;
                 size_t count = 0;
-                analysis_inter_data * currInterData = (analysis_inter_data *)curFrame->m_analysisData.interData;
-                analysis_inter_data * interData = (analysis_inter_data *)analysis_data->interData;
+                x265_analysis_inter_data * currInterData = curFrame->m_analysisData.interData;
+                x265_analysis_inter_data * interData = analysis_data->interData;
                 for (uint32_t d = 0; d < cuBytes; d++)
                 {
                     int bytes = curFrame->m_analysisData.numPartitions >> ((interData)->depth[d] * 2);
                     memset(&(currInterData)->depth[count], (interData)->depth[d], bytes);
                     memset(&(currInterData)->modes[count], (interData)->modes[d], bytes);
-                    memcpy(&(currInterData)->sadCost[count], &((analysis_inter_data*)analysis_data->interData)->sadCost[d], bytes);
+                    memcpy(&(currInterData)->sadCost[count], &(analysis_data->interData)->sadCost[d], bytes);
                     if (m_param->analysisReuseLevel > 4)
                     {
                         memset(&(currInterData)->partSize[count], (interData)->partSize[d], bytes);
@@ -697,7 +727,13 @@
     if (m_bToneMap)
         m_hdr10plus_api->hdr10plus_clear_movie(m_cim, m_numCimInfo);
 #endif
-        
+
+    if (m_param->bDynamicRefine)
+    {
+        X265_FREE(m_variance);
+        X265_FREE(m_rdCost);
+        X265_FREE(m_trainingCount);
+    }
     if (m_exportedPic)
     {
         ATOMIC_DEC(&m_exportedPic->m_countRefEncoders);
@@ -761,6 +797,8 @@
         }
         X265_FREE(temp);
      }
+    if (m_naluFile)
+        fclose(m_naluFile);
     if (m_param)
     {
         if (m_param->csvfpt)
@@ -837,6 +875,77 @@
     }
 }
 
+void Encoder::copyUserSEIMessages(Frame *frame, const x265_picture* pic_in)
+{
+    x265_sei_payload toneMap;
+    toneMap.payload = NULL;
+    int toneMapPayload = 0;
+
+#if ENABLE_HDR10_PLUS
+    if (m_bToneMap)
+    {
+        int currentPOC = m_pocLast;
+        if (currentPOC < m_numCimInfo)
+        {
+            int32_t i = 0;
+            toneMap.payloadSize = 0;
+            while (m_cim[currentPOC][i] == 0xFF)
+                toneMap.payloadSize += m_cim[currentPOC][i++];
+            toneMap.payloadSize += m_cim[currentPOC][i];
+
+            toneMap.payload = (uint8_t*)x265_malloc(sizeof(uint8_t) * toneMap.payloadSize);
+            toneMap.payloadType = USER_DATA_REGISTERED_ITU_T_T35;
+            memcpy(toneMap.payload, &m_cim[currentPOC][i + 1], toneMap.payloadSize);
+            toneMapPayload = 1;
+        }
+    }
+#endif
+    /* seiMsg will contain SEI messages specified in a fixed file format in POC order.
+    * Format of the file : <POC><space><PREFIX><space><NAL UNIT TYPE>/<SEI TYPE><space><SEI Payload> */
+    x265_sei_payload seiMsg;
+    seiMsg.payload = NULL;
+    int userPayload = 0;
+    if (m_enableNal)
+    {
+        readUserSeiFile(seiMsg, m_pocLast);
+        if (seiMsg.payload)
+            userPayload = 1;;
+    }
+
+    int numPayloads = pic_in->userSEI.numPayloads + toneMapPayload + userPayload;
+    frame->m_userSEI.numPayloads = numPayloads;
+
+    if (frame->m_userSEI.numPayloads)
+    {
+        if (!frame->m_userSEI.payloads)
+        {
+            frame->m_userSEI.payloads = new x265_sei_payload[numPayloads];
+            for (int i = 0; i < numPayloads; i++)
​

x265_2.7.tar.gz/source/encoder/encoder.h -> x265_2.9.tar.gz/source/encoder/encoder.h Changed

@@ -90,6 +90,43 @@
     RPSListNode* prior;
 };
 
+struct cuLocation
+{
+    bool skipWidth;
+    bool skipHeight;
+    uint32_t heightInCU;
+    uint32_t widthInCU;
+    uint32_t oddRowIndex;
+    uint32_t evenRowIndex;
+    uint32_t switchCondition;
+
+    void init(x265_param* param)
+    {
+        skipHeight = false;
+        skipWidth = false;
+        heightInCU = (param->sourceHeight + param->maxCUSize - 1) >> param->maxLog2CUSize;
+        widthInCU = (param->sourceWidth + param->maxCUSize - 1) >> param->maxLog2CUSize;
+        evenRowIndex = 0;
+        oddRowIndex = param->num4x4Partitions * widthInCU;
+        switchCondition = 0; // To switch between odd and even rows
+    }
+};
+
+struct puOrientation
+{
+    bool isVert;
+    bool isRect;
+    bool isAmp;
+
+    void init()
+    {
+        isRect = false;
+        isAmp = false;
+        isVert = false;
+    }
+};
+
+
 class FrameEncoder;
 class DPB;
 class Lookahead;
@@ -132,6 +169,7 @@
     Frame*             m_exportedPic;
     FILE*              m_analysisFileIn;
     FILE*              m_analysisFileOut;
+    FILE*              m_naluFile;
     x265_param*        m_param;
     x265_param*        m_latestParam;     // Holds latest param during a reconfigure
     RateControl*       m_rateControl;
@@ -175,6 +213,7 @@
     double                m_cR;
 
     int                     m_bToneMap; // Enables tone-mapping
+    int                     m_enableNal;
 
 #ifdef ENABLE_HDR10_PLUS
     const hdr10plus_api     *m_hdr10plus_api;
@@ -184,6 +223,15 @@
 
     x265_sei_payload        m_prevTonemapPayload;
 
+    /* Collect frame level feature data */
+    uint64_t*               m_rdCost;
+    uint64_t*               m_variance;
+    uint32_t*               m_trainingCount;
+    int32_t                 m_startPoint;
+    Lock                    m_dynamicRefineLock;
+
+    bool                    m_saveCTUSize;
+
     Encoder();
     ~Encoder()
     {
@@ -227,21 +275,26 @@
 
     void updateVbvPlan(RateControl* rc);
 
-    void allocAnalysis(x265_analysis_data* analysis);
+    void readAnalysisFile(x265_analysis_data* analysis, int poc, int sliceType);
+
+    void readAnalysisFile(x265_analysis_data* analysis, int poc, const x265_picture* picIn, int paramBytes);
 
-    void freeAnalysis(x265_analysis_data* analysis);
+    void readAnalysisFile(x265_analysis_data* analysis, int poc, const x265_picture* picIn, int paramBytes, cuLocation cuLoc);
 
-    void allocAnalysis2Pass(x265_analysis_2Pass* analysis, int sliceType);
+    int getCUIndex(cuLocation* cuLoc, uint32_t* count, int bytes, int flag);
 
-    void freeAnalysis2Pass(x265_analysis_2Pass* analysis, int sliceType);
+    int getPuShape(puOrientation* puOrient, int partSize, int numCTU);
 
-    void readAnalysisFile(x265_analysis_data* analysis, int poc, const x265_picture* picIn);
+    void writeAnalysisFile(x265_analysis_data* analysis, FrameData &curEncData);
+
+    void writeAnalysisFileRefine(x265_analysis_data* analysis, FrameData &curEncData);
 
-    void writeAnalysisFile(x265_analysis_data* pic, FrameData &curEncData);
-    void readAnalysis2PassFile(x265_analysis_2Pass* analysis2Pass, int poc, int sliceType);
-    void writeAnalysis2PassFile(x265_analysis_2Pass* analysis2Pass, FrameData &curEncData, int slicetype);
     void finishFrameStats(Frame* pic, FrameEncoder *curEncoder, x265_frame_stats* frameStats, int inPoc);
 
+    int validateAnalysisData(x265_analysis_data* analysis, int readWriteFlag);
+
+    void readUserSeiFile(x265_sei_payload& seiMsg, int poc);
+
     void calcRefreshInterval(Frame* frameEnc);
 
     void initRefIdx();
@@ -249,6 +302,8 @@
     void updateRefIdx();
     bool computeSPSRPSIndex();
 
+    void copyUserSEIMessages(Frame *frame, const x265_picture* pic_in);
+
 protected:
 
     void initVPS(VPS *vps);

 
@@ -90,6 +90,43 @@
     RPSListNode* prior;
 };
 
+struct cuLocation
+{
+    bool skipWidth;
+    bool skipHeight;
+    uint32_t heightInCU;
+    uint32_t widthInCU;
+    uint32_t oddRowIndex;
+    uint32_t evenRowIndex;
+    uint32_t switchCondition;
+
+    void init(x265_param* param)
+    {
+        skipHeight = false;
+        skipWidth = false;
+        heightInCU = (param->sourceHeight + param->maxCUSize - 1) >> param->maxLog2CUSize;
+        widthInCU = (param->sourceWidth + param->maxCUSize - 1) >> param->maxLog2CUSize;
+        evenRowIndex = 0;
+        oddRowIndex = param->num4x4Partitions * widthInCU;
+        switchCondition = 0; // To switch between odd and even rows
+    }
+};
+
+struct puOrientation
+{
+    bool isVert;
+    bool isRect;
+    bool isAmp;
+
+    void init()
+    {
+        isRect = false;
+        isAmp = false;
+        isVert = false;
+    }
+};
+
+
 class FrameEncoder;
 class DPB;
 class Lookahead;
@@ -132,6 +169,7 @@
     Frame*             m_exportedPic;
     FILE*              m_analysisFileIn;
     FILE*              m_analysisFileOut;
+    FILE*              m_naluFile;
     x265_param*        m_param;
     x265_param*        m_latestParam;     // Holds latest param during a reconfigure
     RateControl*       m_rateControl;
@@ -175,6 +213,7 @@
     double                m_cR;
 
     int                     m_bToneMap; // Enables tone-mapping
+    int                     m_enableNal;
 
 #ifdef ENABLE_HDR10_PLUS
     const hdr10plus_api     *m_hdr10plus_api;
@@ -184,6 +223,15 @@
 
     x265_sei_payload        m_prevTonemapPayload;
 
+    /* Collect frame level feature data */
+    uint64_t*               m_rdCost;
+    uint64_t*               m_variance;
+    uint32_t*               m_trainingCount;
+    int32_t                 m_startPoint;
+    Lock                    m_dynamicRefineLock;
+
+    bool                    m_saveCTUSize;
+
     Encoder();
     ~Encoder()
     {
@@ -227,21 +275,26 @@
 
     void updateVbvPlan(RateControl* rc);
 
-    void allocAnalysis(x265_analysis_data* analysis);
+    void readAnalysisFile(x265_analysis_data* analysis, int poc, int sliceType);
+
+    void readAnalysisFile(x265_analysis_data* analysis, int poc, const x265_picture* picIn, int paramBytes);
 
-    void freeAnalysis(x265_analysis_data* analysis);
+    void readAnalysisFile(x265_analysis_data* analysis, int poc, const x265_picture* picIn, int paramBytes, cuLocation cuLoc);
 
-    void allocAnalysis2Pass(x265_analysis_2Pass* analysis, int sliceType);
+    int getCUIndex(cuLocation* cuLoc, uint32_t* count, int bytes, int flag);
 
-    void freeAnalysis2Pass(x265_analysis_2Pass* analysis, int sliceType);
+    int getPuShape(puOrientation* puOrient, int partSize, int numCTU);
 
-    void readAnalysisFile(x265_analysis_data* analysis, int poc, const x265_picture* picIn);
+    void writeAnalysisFile(x265_analysis_data* analysis, FrameData &curEncData);
+
+    void writeAnalysisFileRefine(x265_analysis_data* analysis, FrameData &curEncData);
 
-    void writeAnalysisFile(x265_analysis_data* pic, FrameData &curEncData);
-    void readAnalysis2PassFile(x265_analysis_2Pass* analysis2Pass, int poc, int sliceType);
-    void writeAnalysis2PassFile(x265_analysis_2Pass* analysis2Pass, FrameData &curEncData, int slicetype);
     void finishFrameStats(Frame* pic, FrameEncoder *curEncoder, x265_frame_stats* frameStats, int inPoc);
 
+    int validateAnalysisData(x265_analysis_data* analysis, int readWriteFlag);
+
+    void readUserSeiFile(x265_sei_payload& seiMsg, int poc);
+
     void calcRefreshInterval(Frame* frameEnc);
 
     void initRefIdx();
@@ -249,6 +302,8 @@
     void updateRefIdx();
     bool computeSPSRPSIndex();
 
+    void copyUserSEIMessages(Frame *frame, const x265_picture* pic_in);
+
 protected:
 
     void initVPS(VPS *vps);
​

x265_2.7.tar.gz/source/encoder/entropy.cpp -> x265_2.9.tar.gz/source/encoder/entropy.cpp Changed

@@ -1369,8 +1369,8 @@
                     }
                     bDenomCoded = true;
                 }
-                WRITE_FLAG(wp[0].bPresentFlag, "luma_weight_lX_flag");
-                totalSignalledWeightFlags += wp[0].bPresentFlag;
+                WRITE_FLAG(!!wp[0].wtPresent, "luma_weight_lX_flag");
+                totalSignalledWeightFlags += wp[0].wtPresent;
             }
 
             if (bChroma)
@@ -1378,15 +1378,15 @@
                 for (int ref = 0; ref < slice.m_numRefIdx[list]; ref++)
                 {
                     wp = slice.m_weightPredTable[list][ref];
-                    WRITE_FLAG(wp[1].bPresentFlag, "chroma_weight_lX_flag");
-                    totalSignalledWeightFlags += 2 * wp[1].bPresentFlag;
+                    WRITE_FLAG(!!wp[1].wtPresent, "chroma_weight_lX_flag");
+                    totalSignalledWeightFlags += 2 * wp[1].wtPresent;
                 }
             }
 
             for (int ref = 0; ref < slice.m_numRefIdx[list]; ref++)
             {
                 wp = slice.m_weightPredTable[list][ref];
-                if (wp[0].bPresentFlag)
+                if (wp[0].wtPresent)
                 {
                     int deltaWeight = (wp[0].inputWeight - (1 << wp[0].log2WeightDenom));
                     WRITE_SVLC(deltaWeight, "delta_luma_weight_lX");
@@ -1395,7 +1395,7 @@
 
                 if (bChroma)
                 {
-                    if (wp[1].bPresentFlag)
+                    if (wp[1].wtPresent)
                     {
                         for (int plane = 1; plane < 3; plane++)
                         {

 
@@ -1369,8 +1369,8 @@
                     }
                     bDenomCoded = true;
                 }
-                WRITE_FLAG(wp[0].bPresentFlag, "luma_weight_lX_flag");
-                totalSignalledWeightFlags += wp[0].bPresentFlag;
+                WRITE_FLAG(!!wp[0].wtPresent, "luma_weight_lX_flag");
+                totalSignalledWeightFlags += wp[0].wtPresent;
             }
 
             if (bChroma)
@@ -1378,15 +1378,15 @@
                 for (int ref = 0; ref < slice.m_numRefIdx[list]; ref++)
                 {
                     wp = slice.m_weightPredTable[list][ref];
-                    WRITE_FLAG(wp[1].bPresentFlag, "chroma_weight_lX_flag");
-                    totalSignalledWeightFlags += 2 * wp[1].bPresentFlag;
+                    WRITE_FLAG(!!wp[1].wtPresent, "chroma_weight_lX_flag");
+                    totalSignalledWeightFlags += 2 * wp[1].wtPresent;
                 }
             }
 
             for (int ref = 0; ref < slice.m_numRefIdx[list]; ref++)
             {
                 wp = slice.m_weightPredTable[list][ref];
-                if (wp[0].bPresentFlag)
+                if (wp[0].wtPresent)
                 {
                     int deltaWeight = (wp[0].inputWeight - (1 << wp[0].log2WeightDenom));
                     WRITE_SVLC(deltaWeight, "delta_luma_weight_lX");
@@ -1395,7 +1395,7 @@
 
                 if (bChroma)
                 {
-                    if (wp[1].bPresentFlag)
+                    if (wp[1].wtPresent)
                     {
                         for (int plane = 1; plane < 3; plane++)
                         {
​

x265_2.7.tar.gz/source/encoder/frameencoder.cpp -> x265_2.9.tar.gz/source/encoder/frameencoder.cpp Changed

@@ -179,7 +179,7 @@
         ok &= m_rce.picTimingSEI && m_rce.hrdTiming;
     }
 
-    if (m_param->noiseReductionIntra || m_param->noiseReductionInter || m_param->rc.vbvBufferSize)
+    if (m_param->noiseReductionIntra || m_param->noiseReductionInter)
         m_nr = X265_MALLOC(NoiseReduction, 1);
     if (m_nr)
         memset(m_nr, 0, sizeof(NoiseReduction));
@@ -365,6 +365,65 @@
     return length;
 }
 
+bool FrameEncoder::writeToneMapInfo(x265_sei_payload *payload)
+{
+    bool payloadChange = false;
+    if (m_top->m_prevTonemapPayload.payload != NULL && payload->payloadSize == m_top->m_prevTonemapPayload.payloadSize)
+    {
+        if (memcmp(m_top->m_prevTonemapPayload.payload, payload->payload, payload->payloadSize) != 0)
+            payloadChange = true;
+    }
+    else
+    {
+        payloadChange = true;
+        if (m_top->m_prevTonemapPayload.payload != NULL)
+            x265_free(m_top->m_prevTonemapPayload.payload);
+        m_top->m_prevTonemapPayload.payload = (uint8_t*)x265_malloc(sizeof(uint8_t)* payload->payloadSize);
+    }
+
+    if (payloadChange)
+    {
+        m_top->m_prevTonemapPayload.payloadType = payload->payloadType;
+        m_top->m_prevTonemapPayload.payloadSize = payload->payloadSize;
+        memcpy(m_top->m_prevTonemapPayload.payload, payload->payload, payload->payloadSize);
+    }
+
+    bool isIDR = m_frame->m_lowres.sliceType == X265_TYPE_IDR;
+    return (payloadChange || isIDR);
+}
+
+void FrameEncoder::writeTrailingSEIMessages()
+{
+    Slice* slice = m_frame->m_encData->m_slice;
+    int planes = (m_param->internalCsp != X265_CSP_I400) ? 3 : 1;
+    int32_t payloadSize = 0;
+
+    if (m_param->decodedPictureHashSEI == 1)
+    {
+        m_seiReconPictureDigest.m_method = SEIDecodedPictureHash::MD5;
+        for (int i = 0; i < planes; i++)
+            MD5Final(&m_seiReconPictureDigest.m_state[i], m_seiReconPictureDigest.m_digest[i]);
+        payloadSize = 1 + 16 * planes;
+    }
+    else if (m_param->decodedPictureHashSEI == 2)
+    {
+        m_seiReconPictureDigest.m_method = SEIDecodedPictureHash::CRC;
+        for (int i = 0; i < planes; i++)
+            crcFinish(m_seiReconPictureDigest.m_crc[i], m_seiReconPictureDigest.m_digest[i]);
+        payloadSize = 1 + 2 * planes;
+    }
+    else if (m_param->decodedPictureHashSEI == 3)
+    {
+        m_seiReconPictureDigest.m_method = SEIDecodedPictureHash::CHECKSUM;
+        for (int i = 0; i < planes; i++)
+            checksumFinish(m_seiReconPictureDigest.m_checksum[i], m_seiReconPictureDigest.m_digest[i]);
+        payloadSize = 1 + 4 * planes;
+    }
+
+    m_seiReconPictureDigest.setSize(payloadSize);
+    m_seiReconPictureDigest.writeSEImessages(m_bs, *slice->m_sps, NAL_UNIT_SUFFIX_SEI, m_nalList, false);
+}
+
 void FrameEncoder::compressFrame()
 {
     ProfileScopeEvent(frameThread);
@@ -393,6 +452,7 @@
      * not repeating headers (since AUD is supposed to be the first NAL in the access
      * unit) */
     Slice* slice = m_frame->m_encData->m_slice;
+
     if (m_param->bEnableAccessUnitDelimiters && (m_frame->m_poc || m_param->bRepeatHeaders))
     {
         m_bs.resetBits();
@@ -400,6 +460,8 @@
         m_entropyCoder.codeAUD(*slice);
         m_bs.writeByteAlignment();
         m_nalList.serialize(NAL_UNIT_ACCESS_UNIT_DELIMITER, m_bs);
+        if (m_param->bSingleSeiNal)
+            m_bs.resetBits();
     }
     if (m_frame->m_lowres.bKeyframe && m_param->bRepeatHeaders)
     {
@@ -459,9 +521,7 @@
                 wa.waitForExit();
             else
                 weightAnalyse(*slice, *m_frame, *m_param);
-
         }
-
     }
     else
         slice->disableWeights();
@@ -475,7 +535,7 @@
         for (int ref = 0; ref < slice->m_numRefIdx[l]; ref++)
         {
             WeightParam *w = NULL;
-            if ((bUseWeightP || bUseWeightB) && slice->m_weightPredTable[l][ref][0].bPresentFlag)
+            if ((bUseWeightP || bUseWeightB) && slice->m_weightPredTable[l][ref][0].wtPresent)
                 w = slice->m_weightPredTable[l][ref];
             slice->m_refReconPicList[l][ref] = slice->m_refFrameList[l][ref]->m_reconPic;
             m_mref[l][ref].init(slice->m_refReconPicList[l][ref], w, *m_param);
@@ -496,41 +556,6 @@
 
     /* Get the QP for this frame from rate control. This call may block until
      * frames ahead of it in encode order have called rateControlEnd() */
-    m_rce.encodeOrder = m_frame->m_encodeOrder;
-    bool payloadChange = false;
-    bool writeSei = true;
-    if (m_param->bDhdr10opt)
-    {
-        for (int i = 0; i < m_frame->m_userSEI.numPayloads; i++)
-        {
-            x265_sei_payload *payload = &m_frame->m_userSEI.payloads[i];
-            if(payload->payloadType == USER_DATA_REGISTERED_ITU_T_T35)
-            {
-                if (m_top->m_prevTonemapPayload.payload != NULL && payload->payloadSize == m_top->m_prevTonemapPayload.payloadSize)
-                {
-                    if (memcmp(m_top->m_prevTonemapPayload.payload, payload->payload, payload->payloadSize) != 0)
-                        payloadChange = true;
-                }
-                else
-                {
-                    payloadChange = true;
-                    if (m_top->m_prevTonemapPayload.payload != NULL)
-                        x265_free(m_top->m_prevTonemapPayload.payload);
-                    m_top->m_prevTonemapPayload.payload = (uint8_t*)x265_malloc(sizeof(uint8_t) * payload->payloadSize);
-                }
-
-                if (payloadChange)
-                {
-                    m_top->m_prevTonemapPayload.payloadType = payload->payloadType;
-                    m_top->m_prevTonemapPayload.payloadSize = payload->payloadSize;
-                    memcpy(m_top->m_prevTonemapPayload.payload, payload->payload, payload->payloadSize);
-                }
-
-                bool isIDR = m_frame->m_lowres.sliceType == X265_TYPE_IDR;
-                writeSei = payloadChange || isIDR;
-            }
-        }
-    }
     int qp = m_top->m_rateControl->rateControlStart(m_frame, &m_rce, m_top);
     m_rce.newQp = qp;
 
@@ -594,7 +619,6 @@
 
     /* reset entropy coders and compute slice id */
     m_entropyCoder.load(m_initSliceContext);
-	
     for (uint32_t sliceId = 0; sliceId < m_param->maxSlices; sliceId++)   
         for (uint32_t row = m_sliceBaseRow[sliceId]; row < m_sliceBaseRow[sliceId + 1]; row++)
             m_rows[row].init(m_initSliceContext, sliceId);   
@@ -620,6 +644,7 @@
             m_outStreams[i].resetBits();
     }
 
+    m_rce.encodeOrder = m_frame->m_encodeOrder;
     int prevBPSEI = m_rce.encodeOrder ? m_top->m_lastBPSEI : 0;
 
     if (m_frame->m_lowres.bKeyframe)
@@ -632,18 +657,22 @@
             bpSei->m_auCpbRemovalDelayDelta = 1;
             bpSei->m_cpbDelayOffset = 0;
             bpSei->m_dpbDelayOffset = 0;
-
             // hrdFullness() calculates the initial CPB removal delay and offset
             m_top->m_rateControl->hrdFullness(bpSei);
-
-            m_bs.resetBits();
-            bpSei->write(m_bs, *slice->m_sps);
-            m_bs.writeByteAlignment();
-
-            m_nalList.serialize(NAL_UNIT_PREFIX_SEI, m_bs);
+            bpSei->writeSEImessages(m_bs, *slice->m_sps, NAL_UNIT_PREFIX_SEI, m_nalList, m_param->bSingleSeiNal);
 
             m_top->m_lastBPSEI = m_rce.encodeOrder;
         }
+
+        if (m_frame->m_lowres.sliceType == X265_TYPE_IDR && m_param->bEmitIDRRecoverySEI)
+        {
+            /* Recovery Point SEI require the SPS to be "activated" */
+            SEIRecoveryPoint sei;
+            sei.m_recoveryPocCnt = 0;
+            sei.m_exactMatchingFlag = true;
+            sei.m_brokenLinkFlag = false;
+            sei.writeSEImessages(m_bs, *slice->m_sps, NAL_UNIT_PREFIX_SEI, m_nalList, m_param->bSingleSeiNal);
+        }
     }
 
     if ((m_param->bEmitHRDSEI || !!m_param->interlaceMode))
@@ -660,8 +689,10 @@

 
@@ -179,7 +179,7 @@
         ok &= m_rce.picTimingSEI && m_rce.hrdTiming;
     }
 
-    if (m_param->noiseReductionIntra || m_param->noiseReductionInter || m_param->rc.vbvBufferSize)
+    if (m_param->noiseReductionIntra || m_param->noiseReductionInter)
         m_nr = X265_MALLOC(NoiseReduction, 1);
     if (m_nr)
         memset(m_nr, 0, sizeof(NoiseReduction));
@@ -365,6 +365,65 @@
     return length;
 }
 
+bool FrameEncoder::writeToneMapInfo(x265_sei_payload *payload)
+{
+    bool payloadChange = false;
+    if (m_top->m_prevTonemapPayload.payload != NULL && payload->payloadSize == m_top->m_prevTonemapPayload.payloadSize)
+    {
+        if (memcmp(m_top->m_prevTonemapPayload.payload, payload->payload, payload->payloadSize) != 0)
+            payloadChange = true;
+    }
+    else
+    {
+        payloadChange = true;
+        if (m_top->m_prevTonemapPayload.payload != NULL)
+            x265_free(m_top->m_prevTonemapPayload.payload);
+        m_top->m_prevTonemapPayload.payload = (uint8_t*)x265_malloc(sizeof(uint8_t)* payload->payloadSize);
+    }
+
+    if (payloadChange)
+    {
+        m_top->m_prevTonemapPayload.payloadType = payload->payloadType;
+        m_top->m_prevTonemapPayload.payloadSize = payload->payloadSize;
+        memcpy(m_top->m_prevTonemapPayload.payload, payload->payload, payload->payloadSize);
+    }
+
+    bool isIDR = m_frame->m_lowres.sliceType == X265_TYPE_IDR;
+    return (payloadChange || isIDR);
+}
+
+void FrameEncoder::writeTrailingSEIMessages()
+{
+    Slice* slice = m_frame->m_encData->m_slice;
+    int planes = (m_param->internalCsp != X265_CSP_I400) ? 3 : 1;
+    int32_t payloadSize = 0;
+
+    if (m_param->decodedPictureHashSEI == 1)
+    {
+        m_seiReconPictureDigest.m_method = SEIDecodedPictureHash::MD5;
+        for (int i = 0; i < planes; i++)
+            MD5Final(&m_seiReconPictureDigest.m_state[i], m_seiReconPictureDigest.m_digest[i]);
+        payloadSize = 1 + 16 * planes;
+    }
+    else if (m_param->decodedPictureHashSEI == 2)
+    {
+        m_seiReconPictureDigest.m_method = SEIDecodedPictureHash::CRC;
+        for (int i = 0; i < planes; i++)
+            crcFinish(m_seiReconPictureDigest.m_crc[i], m_seiReconPictureDigest.m_digest[i]);
+        payloadSize = 1 + 2 * planes;
+    }
+    else if (m_param->decodedPictureHashSEI == 3)
+    {
+        m_seiReconPictureDigest.m_method = SEIDecodedPictureHash::CHECKSUM;
+        for (int i = 0; i < planes; i++)
+            checksumFinish(m_seiReconPictureDigest.m_checksum[i], m_seiReconPictureDigest.m_digest[i]);
+        payloadSize = 1 + 4 * planes;
+    }
+
+    m_seiReconPictureDigest.setSize(payloadSize);
+    m_seiReconPictureDigest.writeSEImessages(m_bs, *slice->m_sps, NAL_UNIT_SUFFIX_SEI, m_nalList, false);
+}
+
 void FrameEncoder::compressFrame()
 {
     ProfileScopeEvent(frameThread);
@@ -393,6 +452,7 @@
      * not repeating headers (since AUD is supposed to be the first NAL in the access
      * unit) */
     Slice* slice = m_frame->m_encData->m_slice;
+
     if (m_param->bEnableAccessUnitDelimiters && (m_frame->m_poc || m_param->bRepeatHeaders))
     {
         m_bs.resetBits();
@@ -400,6 +460,8 @@
         m_entropyCoder.codeAUD(*slice);
         m_bs.writeByteAlignment();
         m_nalList.serialize(NAL_UNIT_ACCESS_UNIT_DELIMITER, m_bs);
+        if (m_param->bSingleSeiNal)
+            m_bs.resetBits();
     }
     if (m_frame->m_lowres.bKeyframe && m_param->bRepeatHeaders)
     {
@@ -459,9 +521,7 @@
                 wa.waitForExit();
             else
                 weightAnalyse(*slice, *m_frame, *m_param);
-
         }
-
     }
     else
         slice->disableWeights();
@@ -475,7 +535,7 @@
         for (int ref = 0; ref < slice->m_numRefIdx[l]; ref++)
         {
             WeightParam *w = NULL;
-            if ((bUseWeightP || bUseWeightB) && slice->m_weightPredTable[l][ref][0].bPresentFlag)
+            if ((bUseWeightP || bUseWeightB) && slice->m_weightPredTable[l][ref][0].wtPresent)
                 w = slice->m_weightPredTable[l][ref];
             slice->m_refReconPicList[l][ref] = slice->m_refFrameList[l][ref]->m_reconPic;
             m_mref[l][ref].init(slice->m_refReconPicList[l][ref], w, *m_param);
@@ -496,41 +556,6 @@
 
     /* Get the QP for this frame from rate control. This call may block until
      * frames ahead of it in encode order have called rateControlEnd() */
-    m_rce.encodeOrder = m_frame->m_encodeOrder;
-    bool payloadChange = false;
-    bool writeSei = true;
-    if (m_param->bDhdr10opt)
-    {
-        for (int i = 0; i < m_frame->m_userSEI.numPayloads; i++)
-        {
-            x265_sei_payload *payload = &m_frame->m_userSEI.payloads[i];
-            if(payload->payloadType == USER_DATA_REGISTERED_ITU_T_T35)
-            {
-                if (m_top->m_prevTonemapPayload.payload != NULL && payload->payloadSize == m_top->m_prevTonemapPayload.payloadSize)
-                {
-                    if (memcmp(m_top->m_prevTonemapPayload.payload, payload->payload, payload->payloadSize) != 0)
-                        payloadChange = true;
-                }
-                else
-                {
-                    payloadChange = true;
-                    if (m_top->m_prevTonemapPayload.payload != NULL)
-                        x265_free(m_top->m_prevTonemapPayload.payload);
-                    m_top->m_prevTonemapPayload.payload = (uint8_t*)x265_malloc(sizeof(uint8_t) * payload->payloadSize);
-                }
-
-                if (payloadChange)
-                {
-                    m_top->m_prevTonemapPayload.payloadType = payload->payloadType;
-                    m_top->m_prevTonemapPayload.payloadSize = payload->payloadSize;
-                    memcpy(m_top->m_prevTonemapPayload.payload, payload->payload, payload->payloadSize);
-                }
-
-                bool isIDR = m_frame->m_lowres.sliceType == X265_TYPE_IDR;
-                writeSei = payloadChange || isIDR;
-            }
-        }
-    }
     int qp = m_top->m_rateControl->rateControlStart(m_frame, &m_rce, m_top);
     m_rce.newQp = qp;
 
@@ -594,7 +619,6 @@
 
     /* reset entropy coders and compute slice id */
     m_entropyCoder.load(m_initSliceContext);
-   
     for (uint32_t sliceId = 0; sliceId < m_param->maxSlices; sliceId++)   
         for (uint32_t row = m_sliceBaseRow[sliceId]; row < m_sliceBaseRow[sliceId + 1]; row++)
             m_rows[row].init(m_initSliceContext, sliceId);   
@@ -620,6 +644,7 @@
             m_outStreams[i].resetBits();
     }
 
+    m_rce.encodeOrder = m_frame->m_encodeOrder;
     int prevBPSEI = m_rce.encodeOrder ? m_top->m_lastBPSEI : 0;
 
     if (m_frame->m_lowres.bKeyframe)
@@ -632,18 +657,22 @@
             bpSei->m_auCpbRemovalDelayDelta = 1;
             bpSei->m_cpbDelayOffset = 0;
             bpSei->m_dpbDelayOffset = 0;
-
             // hrdFullness() calculates the initial CPB removal delay and offset
             m_top->m_rateControl->hrdFullness(bpSei);
-
-            m_bs.resetBits();
-            bpSei->write(m_bs, *slice->m_sps);
-            m_bs.writeByteAlignment();
-
-            m_nalList.serialize(NAL_UNIT_PREFIX_SEI, m_bs);
+            bpSei->writeSEImessages(m_bs, *slice->m_sps, NAL_UNIT_PREFIX_SEI, m_nalList, m_param->bSingleSeiNal);
 
             m_top->m_lastBPSEI = m_rce.encodeOrder;
         }
+
+        if (m_frame->m_lowres.sliceType == X265_TYPE_IDR && m_param->bEmitIDRRecoverySEI)
+        {
+            /* Recovery Point SEI require the SPS to be "activated" */
+            SEIRecoveryPoint sei;
+            sei.m_recoveryPocCnt = 0;
+            sei.m_exactMatchingFlag = true;
+            sei.m_brokenLinkFlag = false;
+            sei.writeSEImessages(m_bs, *slice->m_sps, NAL_UNIT_PREFIX_SEI, m_nalList, m_param->bSingleSeiNal);
+        }
     }
 
     if ((m_param->bEmitHRDSEI || !!m_param->interlaceMode))
@@ -660,8 +689,10 @@
​

x265_2.7.tar.gz/source/encoder/frameencoder.h -> x265_2.9.tar.gz/source/encoder/frameencoder.h Changed

@@ -129,6 +129,8 @@
     /* blocks until worker thread is done, returns access unit */
     Frame *getEncodedPicture(NALList& list);
 
+    void initDecodedPictureHashSEI(int row, int cuAddr, int height);
+
     Event                    m_enable;
     Event                    m_done;
     Event                    m_completionEvent;
@@ -161,9 +163,6 @@
     double                   m_ssim;
     uint64_t                 m_accessUnitBits;
     uint32_t                 m_ssimCnt;
-    MD5Context               m_state[3];
-    uint32_t                 m_crc[3];
-    uint32_t                 m_checksum[3];
 
     volatile int             m_activeWorkerCount;        // count of workers currently encoding or filtering CTUs
     volatile int             m_totalActiveWorkerCount;   // sum of m_activeWorkerCount sampled at end of each CTU
@@ -230,6 +229,8 @@
     void threadMain();
     int  collectCTUStatistics(const CUData& ctu, FrameStats* frameLog);
     void noiseReductionUpdate();
+    void writeTrailingSEIMessages();
+    bool writeToneMapInfo(x265_sei_payload *payload);
 
     /* Called by WaveFront::findJob() */
     virtual void processRow(int row, int threadId);
@@ -239,6 +240,12 @@
     void enqueueRowFilter(int row)  { WaveFront::enqueueRow(row * 2 + 1); }
     void enableRowEncoder(int row)  { WaveFront::enableRow(row * 2 + 0); }
     void enableRowFilter(int row)   { WaveFront::enableRow(row * 2 + 1); }
+#if ENABLE_LIBVMAF
+    void vmafFrameLevelScore();
+#endif
+    void collectDynDataFrame();
+    void computeAvgTrainingData();
+    void collectDynDataRow(CUData& ctu, FrameStats* rowStats);    
 };
 }

 
@@ -129,6 +129,8 @@
     /* blocks until worker thread is done, returns access unit */
     Frame *getEncodedPicture(NALList& list);
 
+    void initDecodedPictureHashSEI(int row, int cuAddr, int height);
+
     Event                    m_enable;
     Event                    m_done;
     Event                    m_completionEvent;
@@ -161,9 +163,6 @@
     double                   m_ssim;
     uint64_t                 m_accessUnitBits;
     uint32_t                 m_ssimCnt;
-    MD5Context               m_state[3];
-    uint32_t                 m_crc[3];
-    uint32_t                 m_checksum[3];
 
     volatile int             m_activeWorkerCount;        // count of workers currently encoding or filtering CTUs
     volatile int             m_totalActiveWorkerCount;   // sum of m_activeWorkerCount sampled at end of each CTU
@@ -230,6 +229,8 @@
     void threadMain();
     int  collectCTUStatistics(const CUData& ctu, FrameStats* frameLog);
     void noiseReductionUpdate();
+    void writeTrailingSEIMessages();
+    bool writeToneMapInfo(x265_sei_payload *payload);
 
     /* Called by WaveFront::findJob() */
     virtual void processRow(int row, int threadId);
@@ -239,6 +240,12 @@
     void enqueueRowFilter(int row)  { WaveFront::enqueueRow(row * 2 + 1); }
     void enableRowEncoder(int row)  { WaveFront::enableRow(row * 2 + 0); }
     void enableRowFilter(int row)   { WaveFront::enableRow(row * 2 + 1); }
+#if ENABLE_LIBVMAF
+    void vmafFrameLevelScore();
+#endif
+    void collectDynDataFrame();
+    void computeAvgTrainingData();
+    void collectDynDataRow(CUData& ctu, FrameStats* rowStats);    
 };
 }
 
​

x265_2.7.tar.gz/source/encoder/framefilter.cpp -> x265_2.9.tar.gz/source/encoder/framefilter.cpp Changed

@@ -712,78 +712,8 @@
 
     if (m_param->maxSlices == 1)
     {
-        if (m_param->decodedPictureHashSEI == 1)
-        {
-            uint32_t height = m_parallelFilter[row].getCUHeight();
-            uint32_t width = reconPic->m_picWidth;
-            intptr_t stride = reconPic->m_stride;
-
-            if (!row)
-                MD5Init(&m_frameEncoder->m_state[0]);
-
-            updateMD5Plane(m_frameEncoder->m_state[0], reconPic->getLumaAddr(cuAddr), width, height, stride);
-            if (m_param->internalCsp != X265_CSP_I400)
-            {
-                if (!row)
-                {
-                    MD5Init(&m_frameEncoder->m_state[1]);
-                    MD5Init(&m_frameEncoder->m_state[2]);
-                }
-
-                width >>= m_hChromaShift;
-                height >>= m_vChromaShift;
-                stride = reconPic->m_strideC;
-
-                updateMD5Plane(m_frameEncoder->m_state[1], reconPic->getCbAddr(cuAddr), width, height, stride);
-                updateMD5Plane(m_frameEncoder->m_state[2], reconPic->getCrAddr(cuAddr), width, height, stride);
-            }
-        }
-        else if (m_param->decodedPictureHashSEI == 2)
-        {
-            uint32_t height = m_parallelFilter[row].getCUHeight();
-            uint32_t width = reconPic->m_picWidth;
-            intptr_t stride = reconPic->m_stride;
-
-            if (!row)
-                m_frameEncoder->m_crc[0] = 0xffff;
-
-            updateCRC(reconPic->getLumaAddr(cuAddr), m_frameEncoder->m_crc[0], height, width, stride);
-            if (m_param->internalCsp != X265_CSP_I400)
-            {
-                width >>= m_hChromaShift;
-                height >>= m_vChromaShift;
-                stride = reconPic->m_strideC;
-                m_frameEncoder->m_crc[1] = m_frameEncoder->m_crc[2] = 0xffff;
-
-                updateCRC(reconPic->getCbAddr(cuAddr), m_frameEncoder->m_crc[1], height, width, stride);
-                updateCRC(reconPic->getCrAddr(cuAddr), m_frameEncoder->m_crc[2], height, width, stride);
-            }
-        }
-        else if (m_param->decodedPictureHashSEI == 3)
-        {
-            uint32_t width = reconPic->m_picWidth;
-            uint32_t height = m_parallelFilter[row].getCUHeight();
-            intptr_t stride = reconPic->m_stride;
-            uint32_t cuHeight = m_param->maxCUSize;
-
-            if (!row)
-                m_frameEncoder->m_checksum[0] = 0;
-
-            updateChecksum(reconPic->m_picOrg[0], m_frameEncoder->m_checksum[0], height, width, stride, row, cuHeight);
-            if (m_param->internalCsp != X265_CSP_I400)
-            {
-                width >>= m_hChromaShift;
-                height >>= m_vChromaShift;
-                stride = reconPic->m_strideC;
-                cuHeight >>= m_vChromaShift;
-
-                if (!row)
-                    m_frameEncoder->m_checksum[1] = m_frameEncoder->m_checksum[2] = 0;
-
-                updateChecksum(reconPic->m_picOrg[1], m_frameEncoder->m_checksum[1], height, width, stride, row, cuHeight);
-                updateChecksum(reconPic->m_picOrg[2], m_frameEncoder->m_checksum[2], height, width, stride, row, cuHeight);
-            }
-        }
+        uint32_t height = m_parallelFilter[row].getCUHeight();
+        m_frameEncoder->initDecodedPictureHashSEI(row, cuAddr, height);
     } // end of (m_param->maxSlices == 1)
 
     if (ATOMIC_INC(&m_frameEncoder->m_completionCount) == 2 * (int)m_frameEncoder->m_numRows)

 
@@ -712,78 +712,8 @@
 
     if (m_param->maxSlices == 1)
     {
-        if (m_param->decodedPictureHashSEI == 1)
-        {
-            uint32_t height = m_parallelFilter[row].getCUHeight();
-            uint32_t width = reconPic->m_picWidth;
-            intptr_t stride = reconPic->m_stride;
-
-            if (!row)
-                MD5Init(&m_frameEncoder->m_state[0]);
-
-            updateMD5Plane(m_frameEncoder->m_state[0], reconPic->getLumaAddr(cuAddr), width, height, stride);
-            if (m_param->internalCsp != X265_CSP_I400)
-            {
-                if (!row)
-                {
-                    MD5Init(&m_frameEncoder->m_state[1]);
-                    MD5Init(&m_frameEncoder->m_state[2]);
-                }
-
-                width >>= m_hChromaShift;
-                height >>= m_vChromaShift;
-                stride = reconPic->m_strideC;
-
-                updateMD5Plane(m_frameEncoder->m_state[1], reconPic->getCbAddr(cuAddr), width, height, stride);
-                updateMD5Plane(m_frameEncoder->m_state[2], reconPic->getCrAddr(cuAddr), width, height, stride);
-            }
-        }
-        else if (m_param->decodedPictureHashSEI == 2)
-        {
-            uint32_t height = m_parallelFilter[row].getCUHeight();
-            uint32_t width = reconPic->m_picWidth;
-            intptr_t stride = reconPic->m_stride;
-
-            if (!row)
-                m_frameEncoder->m_crc[0] = 0xffff;
-
-            updateCRC(reconPic->getLumaAddr(cuAddr), m_frameEncoder->m_crc[0], height, width, stride);
-            if (m_param->internalCsp != X265_CSP_I400)
-            {
-                width >>= m_hChromaShift;
-                height >>= m_vChromaShift;
-                stride = reconPic->m_strideC;
-                m_frameEncoder->m_crc[1] = m_frameEncoder->m_crc[2] = 0xffff;
-
-                updateCRC(reconPic->getCbAddr(cuAddr), m_frameEncoder->m_crc[1], height, width, stride);
-                updateCRC(reconPic->getCrAddr(cuAddr), m_frameEncoder->m_crc[2], height, width, stride);
-            }
-        }
-        else if (m_param->decodedPictureHashSEI == 3)
-        {
-            uint32_t width = reconPic->m_picWidth;
-            uint32_t height = m_parallelFilter[row].getCUHeight();
-            intptr_t stride = reconPic->m_stride;
-            uint32_t cuHeight = m_param->maxCUSize;
-
-            if (!row)
-                m_frameEncoder->m_checksum[0] = 0;
-
-            updateChecksum(reconPic->m_picOrg[0], m_frameEncoder->m_checksum[0], height, width, stride, row, cuHeight);
-            if (m_param->internalCsp != X265_CSP_I400)
-            {
-                width >>= m_hChromaShift;
-                height >>= m_vChromaShift;
-                stride = reconPic->m_strideC;
-                cuHeight >>= m_vChromaShift;
-
-                if (!row)
-                    m_frameEncoder->m_checksum[1] = m_frameEncoder->m_checksum[2] = 0;
-
-                updateChecksum(reconPic->m_picOrg[1], m_frameEncoder->m_checksum[1], height, width, stride, row, cuHeight);
-                updateChecksum(reconPic->m_picOrg[2], m_frameEncoder->m_checksum[2], height, width, stride, row, cuHeight);
-            }
-        }
+        uint32_t height = m_parallelFilter[row].getCUHeight();
+        m_frameEncoder->initDecodedPictureHashSEI(row, cuAddr, height);
     } // end of (m_param->maxSlices == 1)
 
     if (ATOMIC_INC(&m_frameEncoder->m_completionCount) == 2 * (int)m_frameEncoder->m_numRows)
​

x265_2.7.tar.gz/source/encoder/ratecontrol.cpp -> x265_2.9.tar.gz/source/encoder/ratecontrol.cpp Changed

@@ -1282,6 +1282,12 @@
         m_predictedBits = m_totalBits;
         updateVbvPlan(enc);
         rce->bufferFill = m_bufferFill;
+        rce->vbvEndAdj = false;
+        if (m_param->vbvBufferEnd && rce->encodeOrder >= m_param->vbvEndFrameAdjust * m_param->totalFrames)
+        {
+            rce->vbvEndAdj = true;
+            rce->targetFill = 0;
+        }
 
         int mincr = enc->m_vps.ptl.minCrForLevel;
         /* Profiles above Main10 don't require maxAU size check, so just set the maximum to a large value. */
@@ -1290,7 +1296,7 @@
         else
         {
             /* The spec has a special case for the first frame. */
-            if (rce->encodeOrder == 0)
+            if (curFrame->m_lowres.bKeyframe)
             {
                 /* 1.5 * (Max( PicSizeInSamplesY, fR * MaxLumaSr) + MaxLumaSr * (AuCpbRemovalTime[ 0 ] -AuNominalRemovalTime[ 0 ])) ? MinCr */
                 double fr = 1. / 300;
@@ -1302,6 +1308,7 @@
                 /* 1.5 * MaxLumaSr * (AuCpbRemovalTime[ n ] - AuCpbRemovalTime[ n - 1 ]) / MinCr */
                 rce->frameSizeMaximum = 8 * 1.5 * enc->m_vps.ptl.maxLumaSrForLevel * m_frameDuration / mincr;
             }
+            rce->frameSizeMaximum *= m_param->maxAUSizeFactor;
         }
     }
     if (!m_isAbr && m_2pass && m_param->rc.rateControlMode == X265_RC_CRF)
@@ -2172,12 +2179,12 @@
                     curBits = predictSize(&m_pred[predType], frameQ[type], (double)satd);
                     bufferFillCur -= curBits;
                 }
-                if (m_param->vbvBufferEnd && rce->encodeOrder >= m_param->vbvEndFrameAdjust * m_param->totalFrames)
+                if (rce->vbvEndAdj)
                 {
                     bool loopBreak = false;
                     double bufferDiff = m_param->vbvBufferEnd - (m_bufferFill / m_bufferSize);
-                    targetFill = m_bufferFill + m_bufferSize * (bufferDiff / (m_param->totalFrames - rce->encodeOrder));
-                    if (bufferFillCur < targetFill)
+                    rce->targetFill = m_bufferFill + m_bufferSize * (bufferDiff / (m_param->totalFrames - rce->encodeOrder));
+                    if (bufferFillCur < rce->targetFill)
                     {
                         q *= 1.01;
                         loopTerminate |= 1;
@@ -2420,6 +2427,7 @@
         double rcTol = bufferLeftPlanned / m_param->frameNumThreads * m_rateTolerance;
         int32_t encodedBitsSoFar = 0;
         double accFrameBits = predictRowsSizeSum(curFrame, rce, qpVbv, encodedBitsSoFar);
+        double vbvEndBias = 0.95;
 
         /* * Don't increase the row QPs until a sufficent amount of the bits of
          * the frame have been processed, in case a flat area at the top of the
@@ -2441,7 +2449,8 @@
         while (qpVbv < qpMax
                && (((accFrameBits > rce->frameSizePlanned + rcTol) ||
                    (rce->bufferFill - accFrameBits < bufferLeftPlanned * 0.5) ||
-                   (accFrameBits > rce->frameSizePlanned && qpVbv < rce->qpNoVbv))
+                   (accFrameBits > rce->frameSizePlanned && qpVbv < rce->qpNoVbv) ||
+                   (rce->vbvEndAdj && ((rce->bufferFill - accFrameBits) < (rce->targetFill * vbvEndBias))))
                    && (!m_param->rc.bStrictCbr ? 1 : abrOvershoot > 0.1)))
         {
             qpVbv += stepSize;
@@ -2452,7 +2461,8 @@
         while (qpVbv > qpMin
                && (qpVbv > curEncData.m_rowStat[0].rowQp || m_singleFrameVbv)
                && (((accFrameBits < rce->frameSizePlanned * 0.8f && qpVbv <= prevRowQp)
-                   || accFrameBits < (rce->bufferFill - m_bufferSize + m_bufferRate) * 1.1)
+                   || accFrameBits < (rce->bufferFill - m_bufferSize + m_bufferRate) * 1.1
+                   || (rce->vbvEndAdj && ((rce->bufferFill - accFrameBits) > (rce->targetFill * vbvEndBias))))
                    && (!m_param->rc.bStrictCbr ? 1 : abrOvershoot < 0)))
         {
             qpVbv -= stepSize;
@@ -2630,8 +2640,9 @@
     FrameData& curEncData = *curFrame->m_encData;
     int64_t actualBits = bits;
     Slice *slice = curEncData.m_slice;
+    bool bEnableDistOffset = m_param->analysisMultiPassDistortion && m_param->rc.bStatRead;
 
-    if (m_param->rc.aqMode || m_isVbv || m_param->bAQMotion)
+    if (m_param->rc.aqMode || m_isVbv || m_param->bAQMotion || bEnableDistOffset)
     {
         if (m_isVbv && !(m_2pass && m_param->rc.rateControlMode == X265_RC_CRF))
         {
@@ -2645,10 +2656,10 @@
             rce->qpaRc = curEncData.m_avgQpRc;
         }
 
-        if (m_param->rc.aqMode || m_param->bAQMotion)
+        if (m_param->rc.aqMode || m_param->bAQMotion || bEnableDistOffset)
         {
             double avgQpAq = 0;
-            /* determine actual avg encoded QP, after AQ/cutree adjustments */
+            /* determine actual avg encoded QP, after AQ/cutree/distortion adjustments */
             for (uint32_t i = 0; i < slice->m_sps->numCuInHeight; i++)
                 avgQpAq += curEncData.m_rowStat[i].sumQpAq;
 
@@ -2792,12 +2803,8 @@
 /* called to write out the rate control frame stats info in multipass encodes */
 int RateControl::writeRateControlFrameStats(Frame* curFrame, RateControlEntry* rce)
 {
-    FrameData& curEncData = *curFrame->m_encData;
-    int ncu;
-    if (m_param->rc.qgSize == 8)
-        ncu = m_ncu * 4;
-    else
-        ncu = m_ncu;
+    FrameData& curEncData = *curFrame->m_encData;    
+    int ncu = (m_param->rc.qgSize == 8) ? m_ncu * 4 : m_ncu;
     char cType = rce->sliceType == I_SLICE ? (curFrame->m_lowres.sliceType == X265_TYPE_IDR ? 'I' : 'i')
         : rce->sliceType == P_SLICE ? 'P'
         : IS_REFERENCED(curFrame) ? 'B' : 'b';

 
@@ -1282,6 +1282,12 @@
         m_predictedBits = m_totalBits;
         updateVbvPlan(enc);
         rce->bufferFill = m_bufferFill;
+        rce->vbvEndAdj = false;
+        if (m_param->vbvBufferEnd && rce->encodeOrder >= m_param->vbvEndFrameAdjust * m_param->totalFrames)
+        {
+            rce->vbvEndAdj = true;
+            rce->targetFill = 0;
+        }
 
         int mincr = enc->m_vps.ptl.minCrForLevel;
         /* Profiles above Main10 don't require maxAU size check, so just set the maximum to a large value. */
@@ -1290,7 +1296,7 @@
         else
         {
             /* The spec has a special case for the first frame. */
-            if (rce->encodeOrder == 0)
+            if (curFrame->m_lowres.bKeyframe)
             {
                 /* 1.5 * (Max( PicSizeInSamplesY, fR * MaxLumaSr) + MaxLumaSr * (AuCpbRemovalTime[ 0 ] -AuNominalRemovalTime[ 0 ])) ? MinCr */
                 double fr = 1. / 300;
@@ -1302,6 +1308,7 @@
                 /* 1.5 * MaxLumaSr * (AuCpbRemovalTime[ n ] - AuCpbRemovalTime[ n - 1 ]) / MinCr */
                 rce->frameSizeMaximum = 8 * 1.5 * enc->m_vps.ptl.maxLumaSrForLevel * m_frameDuration / mincr;
             }
+            rce->frameSizeMaximum *= m_param->maxAUSizeFactor;
         }
     }
     if (!m_isAbr && m_2pass && m_param->rc.rateControlMode == X265_RC_CRF)
@@ -2172,12 +2179,12 @@
                     curBits = predictSize(&m_pred[predType], frameQ[type], (double)satd);
                     bufferFillCur -= curBits;
                 }
-                if (m_param->vbvBufferEnd && rce->encodeOrder >= m_param->vbvEndFrameAdjust * m_param->totalFrames)
+                if (rce->vbvEndAdj)
                 {
                     bool loopBreak = false;
                     double bufferDiff = m_param->vbvBufferEnd - (m_bufferFill / m_bufferSize);
-                    targetFill = m_bufferFill + m_bufferSize * (bufferDiff / (m_param->totalFrames - rce->encodeOrder));
-                    if (bufferFillCur < targetFill)
+                    rce->targetFill = m_bufferFill + m_bufferSize * (bufferDiff / (m_param->totalFrames - rce->encodeOrder));
+                    if (bufferFillCur < rce->targetFill)
                     {
                         q *= 1.01;
                         loopTerminate |= 1;
@@ -2420,6 +2427,7 @@
         double rcTol = bufferLeftPlanned / m_param->frameNumThreads * m_rateTolerance;
         int32_t encodedBitsSoFar = 0;
         double accFrameBits = predictRowsSizeSum(curFrame, rce, qpVbv, encodedBitsSoFar);
+        double vbvEndBias = 0.95;
 
         /* * Don't increase the row QPs until a sufficent amount of the bits of
          * the frame have been processed, in case a flat area at the top of the
@@ -2441,7 +2449,8 @@
         while (qpVbv < qpMax
                && (((accFrameBits > rce->frameSizePlanned + rcTol) ||
                    (rce->bufferFill - accFrameBits < bufferLeftPlanned * 0.5) ||
-                   (accFrameBits > rce->frameSizePlanned && qpVbv < rce->qpNoVbv))
+                   (accFrameBits > rce->frameSizePlanned && qpVbv < rce->qpNoVbv) ||
+                   (rce->vbvEndAdj && ((rce->bufferFill - accFrameBits) < (rce->targetFill * vbvEndBias))))
                    && (!m_param->rc.bStrictCbr ? 1 : abrOvershoot > 0.1)))
         {
             qpVbv += stepSize;
@@ -2452,7 +2461,8 @@
         while (qpVbv > qpMin
                && (qpVbv > curEncData.m_rowStat[0].rowQp || m_singleFrameVbv)
                && (((accFrameBits < rce->frameSizePlanned * 0.8f && qpVbv <= prevRowQp)
-                   || accFrameBits < (rce->bufferFill - m_bufferSize + m_bufferRate) * 1.1)
+                   || accFrameBits < (rce->bufferFill - m_bufferSize + m_bufferRate) * 1.1
+                   || (rce->vbvEndAdj && ((rce->bufferFill - accFrameBits) > (rce->targetFill * vbvEndBias))))
                    && (!m_param->rc.bStrictCbr ? 1 : abrOvershoot < 0)))
         {
             qpVbv -= stepSize;
@@ -2630,8 +2640,9 @@
     FrameData& curEncData = *curFrame->m_encData;
     int64_t actualBits = bits;
     Slice *slice = curEncData.m_slice;
+    bool bEnableDistOffset = m_param->analysisMultiPassDistortion && m_param->rc.bStatRead;
 
-    if (m_param->rc.aqMode || m_isVbv || m_param->bAQMotion)
+    if (m_param->rc.aqMode || m_isVbv || m_param->bAQMotion || bEnableDistOffset)
     {
         if (m_isVbv && !(m_2pass && m_param->rc.rateControlMode == X265_RC_CRF))
         {
@@ -2645,10 +2656,10 @@
             rce->qpaRc = curEncData.m_avgQpRc;
         }
 
-        if (m_param->rc.aqMode || m_param->bAQMotion)
+        if (m_param->rc.aqMode || m_param->bAQMotion || bEnableDistOffset)
         {
             double avgQpAq = 0;
-            /* determine actual avg encoded QP, after AQ/cutree adjustments */
+            /* determine actual avg encoded QP, after AQ/cutree/distortion adjustments */
             for (uint32_t i = 0; i < slice->m_sps->numCuInHeight; i++)
                 avgQpAq += curEncData.m_rowStat[i].sumQpAq;
 
@@ -2792,12 +2803,8 @@
 /* called to write out the rate control frame stats info in multipass encodes */
 int RateControl::writeRateControlFrameStats(Frame* curFrame, RateControlEntry* rce)
 {
-    FrameData& curEncData = *curFrame->m_encData;
-    int ncu;
-    if (m_param->rc.qgSize == 8)
-        ncu = m_ncu * 4;
-    else
-        ncu = m_ncu;
+    FrameData& curEncData = *curFrame->m_encData;    
+    int ncu = (m_param->rc.qgSize == 8) ? m_ncu * 4 : m_ncu;
     char cType = rce->sliceType == I_SLICE ? (curFrame->m_lowres.sliceType == X265_TYPE_IDR ? 'I' : 'i')
         : rce->sliceType == P_SLICE ? 'P'
         : IS_REFERENCED(curFrame) ? 'B' : 'b';
​

x265_2.7.tar.gz/source/encoder/ratecontrol.h -> x265_2.9.tar.gz/source/encoder/ratecontrol.h Changed

 
@@ -82,6 +82,8 @@
     double  rowCplxrSum;
     double  qpNoVbv;
     double  bufferFill;
+    double  targetFill;
+    bool    vbvEndAdj;
     double  frameDuration;
     double  clippedDuration;
     double  frameSizeEstimated; /* hold frameSize, updated from cu level vbv rc */
​

x265_2.7.tar.gz/source/encoder/reference.cpp -> x265_2.9.tar.gz/source/encoder/reference.cpp Changed

@@ -89,7 +89,7 @@
                 cuHeight >>= reconPic->m_vChromaShift;
             }
 
-            if (wp[c].bPresentFlag)
+            if (wp[c].wtPresent)
             {
                 if (!weightBuffer[c])
                 {
@@ -155,12 +155,10 @@
 
         const pixel* src = reconPic->m_picOrg[c] + numWeightedRows * cuHeight * stride;
         pixel* dst = fpelPlane[c] + numWeightedRows * cuHeight * stride;
-
         // Computing weighted CU rows
         int correction = IF_INTERNAL_PREC - X265_DEPTH; // intermediate interpolation depth
-        int padwidth = (width + 15) & ~15;              // weightp assembly needs even 16 byte widths
+        int padwidth = (width + 31) & ~31;              // weightp assembly needs even 32 byte widths
         primitives.weight_pp(src, dst, stride, padwidth, height, w[c].weight, w[c].round << correction, w[c].shift + correction, w[c].offset);
-
         // Extending Left & Right
         primitives.extendRowBorder(dst, stride, width, height, marginX);

 
@@ -89,7 +89,7 @@
                 cuHeight >>= reconPic->m_vChromaShift;
             }
 
-            if (wp[c].bPresentFlag)
+            if (wp[c].wtPresent)
             {
                 if (!weightBuffer[c])
                 {
@@ -155,12 +155,10 @@
 
         const pixel* src = reconPic->m_picOrg[c] + numWeightedRows * cuHeight * stride;
         pixel* dst = fpelPlane[c] + numWeightedRows * cuHeight * stride;
-
         // Computing weighted CU rows
         int correction = IF_INTERNAL_PREC - X265_DEPTH; // intermediate interpolation depth
-        int padwidth = (width + 15) & ~15;              // weightp assembly needs even 16 byte widths
+        int padwidth = (width + 31) & ~31;              // weightp assembly needs even 32 byte widths
         primitives.weight_pp(src, dst, stride, padwidth, height, w[c].weight, w[c].round << correction, w[c].shift + correction, w[c].offset);
-
         // Extending Left & Right
         primitives.extendRowBorder(dst, stride, width, height, marginX);
 
​

x265_2.7.tar.gz/source/encoder/search.cpp -> x265_2.9.tar.gz/source/encoder/search.cpp Changed

@@ -82,7 +82,7 @@
     m_me.init(param.internalCsp);
 
     bool ok = m_quant.init(param.psyRdoq, scalingList, m_entropyCoder);
-    if (m_param->noiseReductionIntra || m_param->noiseReductionInter || m_param->rc.vbvBufferSize)
+    if (m_param->noiseReductionIntra || m_param->noiseReductionInter )
         ok &= m_quant.allocNoiseReduction(param);
 
     ok &= Predict::allocBuffers(param.internalCsp); /* sets m_hChromaShift & m_vChromaShift */
@@ -354,14 +354,17 @@
         // store original entropy coding status
         if (bEnableRDOQ)
             m_entropyCoder.estBit(m_entropyCoder.m_estBitsSbac, log2TrSize, true);
-
-        primitives.cu[sizeIdx].calcresidual(fenc, pred, residual, stride);
+        primitives.cu[sizeIdx].calcresidual[stride % 64 == 0](fenc, pred, residual, stride);
 
         uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeffY, log2TrSize, TEXT_LUMA, absPartIdx, false);
         if (numSig)
         {
             m_quant.invtransformNxN(cu, residual, stride, coeffY, log2TrSize, TEXT_LUMA, true, false, numSig);
-            primitives.cu[sizeIdx].add_ps(reconQt, reconQtStride, pred, residual, stride, stride);
+            bool reconQtYuvAlign = m_rqt[qtLayer].reconQtYuv.getAddrOffset(absPartIdx, mode.predYuv.m_size) % 64 == 0;
+            bool predAlign = mode.predYuv.getAddrOffset(absPartIdx, mode.predYuv.m_size) % 64 == 0;
+            bool residualAlign = m_rqt[cuGeom.depth].tmpResiYuv.getAddrOffset(absPartIdx, mode.predYuv.m_size) % 64 == 0;
+            bool bufferAlignCheck = (reconQtStride % 64 == 0) && (stride % 64 == 0) && reconQtYuvAlign && predAlign && residualAlign;
+            primitives.cu[sizeIdx].add_ps[bufferAlignCheck](reconQt, reconQtStride, pred, residual, stride, stride);
         }
         else
             // no coded residual, recon = pred
@@ -559,15 +562,19 @@
 
         coeff_t* coeff = (useTSkip ? m_tsCoeff : coeffY);
         pixel*   tmpRecon = (useTSkip ? m_tsRecon : reconQt);
+        bool tmpReconAlign = (useTSkip ? 1 : (m_rqt[qtLayer].reconQtYuv.getAddrOffset(absPartIdx, m_rqt[qtLayer].reconQtYuv.m_size) % 64 == 0));
         uint32_t tmpReconStride = (useTSkip ? MAX_TS_SIZE : reconQtStride);
 
-        primitives.cu[sizeIdx].calcresidual(fenc, pred, residual, stride);
+        primitives.cu[sizeIdx].calcresidual[stride % 64 == 0](fenc, pred, residual, stride);
 
         uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeff, log2TrSize, TEXT_LUMA, absPartIdx, useTSkip);
         if (numSig)
         {
             m_quant.invtransformNxN(cu, residual, stride, coeff, log2TrSize, TEXT_LUMA, true, useTSkip, numSig);
-            primitives.cu[sizeIdx].add_ps(tmpRecon, tmpReconStride, pred, residual, stride, stride);
+            bool residualAlign = m_rqt[cuGeom.depth].tmpResiYuv.getAddrOffset(absPartIdx, m_rqt[cuGeom.depth].tmpResiYuv.m_size) % 64 == 0;
+            bool predAlign = predYuv->getAddrOffset(absPartIdx, predYuv->m_size) % 64 == 0;
+            bool bufferAlignCheck = (stride % 64 == 0) && (tmpReconStride % 64 == 0) && tmpReconAlign && residualAlign && predAlign;
+            primitives.cu[sizeIdx].add_ps[bufferAlignCheck](tmpRecon, tmpReconStride, pred, residual, stride, stride);
         }
         else if (useTSkip)
         {
@@ -714,7 +721,7 @@
         coeff_t* coeffY       = cu.m_trCoeff[0] + coeffOffsetY;
 
         uint32_t sizeIdx   = log2TrSize - 2;
-        primitives.cu[sizeIdx].calcresidual(fenc, pred, residual, stride);
+        primitives.cu[sizeIdx].calcresidual[stride % 64 == 0](fenc, pred, residual, stride);
 
         PicYuv*  reconPic = m_frame->m_reconPic;
         pixel*   picReconY = reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + absPartIdx);
@@ -724,7 +731,11 @@
         if (numSig)
         {
             m_quant.invtransformNxN(cu, residual, stride, coeffY, log2TrSize, TEXT_LUMA, true, false, numSig);
-            primitives.cu[sizeIdx].add_ps(picReconY, picStride, pred, residual, stride, stride);
+            bool picReconYAlign = (reconPic->m_cuOffsetY[cu.m_cuAddr] + reconPic->m_buOffsetY[cuGeom.absPartIdx + absPartIdx]) % 64 == 0;
+            bool predAlign = mode.predYuv.getAddrOffset(absPartIdx, mode.predYuv.m_size) % 64 == 0;
+            bool residualAlign = m_rqt[cuGeom.depth].tmpResiYuv.getAddrOffset(absPartIdx, m_rqt[cuGeom.depth].tmpResiYuv.m_size)% 64 == 0;
+            bool bufferAlignCheck = (picStride % 64 == 0) && (stride % 64 == 0) && picReconYAlign && predAlign && residualAlign;
+            primitives.cu[sizeIdx].add_ps[bufferAlignCheck](picReconY, picStride, pred, residual, stride, stride);
             cu.setCbfSubParts(1 << tuDepth, TEXT_LUMA, absPartIdx, fullDepth);
         }
         else
@@ -893,12 +904,17 @@
             predIntraChromaAng(chromaPredMode, pred, stride, log2TrSizeC);
             cu.setTransformSkipPartRange(0, ttype, absPartIdxC, tuIterator.absPartIdxStep);
 
-            primitives.cu[sizeIdxC].calcresidual(fenc, pred, residual, stride);
+            primitives.cu[sizeIdxC].calcresidual[stride % 64 == 0](fenc, pred, residual, stride);
+
             uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeffC, log2TrSizeC, ttype, absPartIdxC, false);
             if (numSig)
             {
                 m_quant.invtransformNxN(cu, residual, stride, coeffC, log2TrSizeC, ttype, true, false, numSig);
-                primitives.cu[sizeIdxC].add_ps(reconQt, reconQtStride, pred, residual, stride, stride);
+                bool reconQtAlign = m_rqt[qtLayer].reconQtYuv.getChromaAddrOffset(absPartIdxC) % 64 == 0;
+                bool predAlign = mode.predYuv.getChromaAddrOffset(absPartIdxC) % 64 == 0;
+                bool residualAlign = resiYuv.getChromaAddrOffset(absPartIdxC) % 64 == 0;
+                bool bufferAlignCheck = reconQtAlign && predAlign && residualAlign && (reconQtStride % 64 == 0) && (stride % 64 == 0);
+                primitives.cu[sizeIdxC].add_ps[bufferAlignCheck](reconQt, reconQtStride, pred, residual, stride, stride);
                 cu.setCbfPartRange(1 << tuDepth, ttype, absPartIdxC, tuIterator.absPartIdxStep);
             }
             else
@@ -992,13 +1008,17 @@
                 pixel*   recon = (useTSkip ? m_tsRecon : reconQt);
                 uint32_t reconStride = (useTSkip ? MAX_TS_SIZE : reconQtStride);
 
-                primitives.cu[sizeIdxC].calcresidual(fenc, pred, residual, stride);
+                primitives.cu[sizeIdxC].calcresidual[stride % 64 == 0](fenc, pred, residual, stride);
 
                 uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeff, log2TrSizeC, ttype, absPartIdxC, useTSkip);
                 if (numSig)
                 {
                     m_quant.invtransformNxN(cu, residual, stride, coeff, log2TrSizeC, ttype, true, useTSkip, numSig);
-                    primitives.cu[sizeIdxC].add_ps(recon, reconStride, pred, residual, stride, stride);
+                    bool reconAlign = (useTSkip ? 1 : m_rqt[qtLayer].reconQtYuv.getChromaAddrOffset(absPartIdxC)) % 64 == 0;
+                    bool predYuvAlign = mode.predYuv.getChromaAddrOffset(absPartIdxC) % 64 == 0;
+                    bool residualAlign = m_rqt[cuGeom.depth].tmpResiYuv.getChromaAddrOffset(absPartIdxC) % 64 == 0;
+                    bool bufferAlignCheck = reconAlign && predYuvAlign && residualAlign && (reconStride % 64 == 0) && (stride % 64 == 0);
+                    primitives.cu[sizeIdxC].add_ps[bufferAlignCheck](recon, reconStride, pred, residual, stride, stride);
                     cu.setCbfPartRange(1 << tuDepth, ttype, absPartIdxC, tuIterator.absPartIdxStep);
                 }
                 else if (useTSkip)
@@ -1183,12 +1203,17 @@
 
             X265_CHECK(!cu.m_transformSkip[ttype][0], "transform skip not supported at low RD levels\n");
 
-            primitives.cu[sizeIdxC].calcresidual(fenc, pred, residual, stride);
+            primitives.cu[sizeIdxC].calcresidual[stride % 64 == 0](fenc, pred, residual, stride);
+
             uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeffC, log2TrSizeC, ttype, absPartIdxC, false);
             if (numSig)
             {
                 m_quant.invtransformNxN(cu, residual, stride, coeffC, log2TrSizeC, ttype, true, false, numSig);
-                primitives.cu[sizeIdxC].add_ps(picReconC, picStride, pred, residual, stride, stride);
+                bool picReconCAlign = (reconPic->m_cuOffsetC[cu.m_cuAddr] + reconPic->m_buOffsetC[cuGeom.absPartIdx + absPartIdxC]) % 64 == 0;
+                bool predAlign = mode.predYuv.getChromaAddrOffset(absPartIdxC) % 64 == 0;
+                bool residualAlign = resiYuv.getChromaAddrOffset(absPartIdxC)% 64 == 0;
+                bool bufferAlignCheck = picReconCAlign && predAlign && residualAlign && (picStride % 64 == 0) && (stride % 64 == 0);
+                primitives.cu[sizeIdxC].add_ps[bufferAlignCheck](picReconC, picStride, pred, residual, stride, stride);
                 cu.setCbfPartRange(1 << tuDepth, ttype, absPartIdxC, tuIterator.absPartIdxStep);
             }
             else
@@ -1304,7 +1329,7 @@
 
         pixel nScale[129];
         intraNeighbourBuf[1][0] = intraNeighbourBuf[0][0];
-        primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1);
+        primitives.scale1D_128to64[NONALIGNED](nScale + 1, intraNeighbourBuf[0] + 1);
 
         // we do not estimate filtering for downscaled samples
         memcpy(&intraNeighbourBuf[0][1], &nScale[1], 2 * 64 * sizeof(pixel));   // Top & Left pixels
@@ -2107,18 +2132,24 @@
         bestME[list].mvCost  = mvCost;
     }
 }
-
-void Search::searchMV(Mode& interMode, const PredictionUnit& pu, int list, int ref, MV& outmv)
+void Search::searchMV(Mode& interMode, const PredictionUnit& pu, int list, int ref, MV& outmv, MV mvp, int numMvc, MV* mvc)
 {
     CUData& cu = interMode.cu;
     const Slice *slice = m_slice;
-    MV mv = cu.m_mv[list][pu.puAbsPartIdx];
+    MV mv;
+    if (m_param->interRefine == 1)
+        mv = mvp;
+    else
+        mv = cu.m_mv[list][pu.puAbsPartIdx];
     cu.clipMv(mv);
     MV mvmin, mvmax;
     setSearchRange(cu, mv, m_param->searchRange, mvmin, mvmax);
-    m_me.refineMV(&slice->m_mref[list][ref], mvmin, mvmax, mv, outmv);
+    if (m_param->interRefine == 1)
+        m_me.motionEstimate(&m_slice->m_mref[list][ref], mvmin, mvmax, mv, numMvc, mvc, m_param->searchRange, outmv, m_param->maxSlices,
+        m_param->bSourceReferenceEstimation ? m_slice->m_refFrameList[list][ref]->m_fencPic->getLumaAddr(0) : 0);
+    else
+        m_me.refineMV(&slice->m_mref[list][ref], mvmin, mvmax, mv, outmv);
 }
-
 /* find the best inter prediction for each PU of specified mode */
 void Search::predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChromaMC, uint32_t refMasks[2])
 {
@@ -2138,20 +2169,29 @@
     int      totalmebits = 0;
     MV       mvzero(0, 0);
     Yuv&     tmpPredYuv = m_rqt[cuGeom.depth].tmpPredYuv;
-
     MergeData merge;
     memset(&merge, 0, sizeof(merge));
-
+    bool useAsMVP = false;
     for (int puIdx = 0; puIdx < numPart; puIdx++)
     {
         MotionData* bestME = interMode.bestME[puIdx];
         PredictionUnit pu(cu, cuGeom, puIdx);
-
         m_me.setSourcePU(*interMode.fencYuv, pu.ctuAddr, pu.cuAbsPartIdx, pu.puAbsPartIdx, pu.width, pu.height, m_param->searchMethod, m_param->subpelRefine, bChromaMC);
-
+        useAsMVP = false;
+        x265_analysis_inter_data* interDataCTU = NULL;
+        int cuIdx;
+        cuIdx = (interMode.cu.m_cuAddr * m_param->num4x4Partitions) + cuGeom.absPartIdx;
+        if (m_param->analysisReuseLevel == 10 && m_param->interRefine > 1)
+        {
+            interDataCTU = m_frame->m_analysisData.interData;
+            if ((cu.m_predMode[pu.puAbsPartIdx] == interDataCTU->modes[cuIdx + pu.puAbsPartIdx])
+                && (cu.m_partSize[pu.puAbsPartIdx] == interDataCTU->partSize[cuIdx + pu.puAbsPartIdx])
+                && !(interDataCTU->mergeFlag[cuIdx + puIdx])
+                && (cu.m_cuDepth[0] == interDataCTU->depth[cuIdx]))

 
@@ -82,7 +82,7 @@
     m_me.init(param.internalCsp);
 
     bool ok = m_quant.init(param.psyRdoq, scalingList, m_entropyCoder);
-    if (m_param->noiseReductionIntra || m_param->noiseReductionInter || m_param->rc.vbvBufferSize)
+    if (m_param->noiseReductionIntra || m_param->noiseReductionInter )
         ok &= m_quant.allocNoiseReduction(param);
 
     ok &= Predict::allocBuffers(param.internalCsp); /* sets m_hChromaShift & m_vChromaShift */
@@ -354,14 +354,17 @@
         // store original entropy coding status
         if (bEnableRDOQ)
             m_entropyCoder.estBit(m_entropyCoder.m_estBitsSbac, log2TrSize, true);
-
-        primitives.cu[sizeIdx].calcresidual(fenc, pred, residual, stride);
+        primitives.cu[sizeIdx].calcresidual[stride % 64 == 0](fenc, pred, residual, stride);
 
         uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeffY, log2TrSize, TEXT_LUMA, absPartIdx, false);
         if (numSig)
         {
             m_quant.invtransformNxN(cu, residual, stride, coeffY, log2TrSize, TEXT_LUMA, true, false, numSig);
-            primitives.cu[sizeIdx].add_ps(reconQt, reconQtStride, pred, residual, stride, stride);
+            bool reconQtYuvAlign = m_rqt[qtLayer].reconQtYuv.getAddrOffset(absPartIdx, mode.predYuv.m_size) % 64 == 0;
+            bool predAlign = mode.predYuv.getAddrOffset(absPartIdx, mode.predYuv.m_size) % 64 == 0;
+            bool residualAlign = m_rqt[cuGeom.depth].tmpResiYuv.getAddrOffset(absPartIdx, mode.predYuv.m_size) % 64 == 0;
+            bool bufferAlignCheck = (reconQtStride % 64 == 0) && (stride % 64 == 0) && reconQtYuvAlign && predAlign && residualAlign;
+            primitives.cu[sizeIdx].add_ps[bufferAlignCheck](reconQt, reconQtStride, pred, residual, stride, stride);
         }
         else
             // no coded residual, recon = pred
@@ -559,15 +562,19 @@
 
         coeff_t* coeff = (useTSkip ? m_tsCoeff : coeffY);
         pixel*   tmpRecon = (useTSkip ? m_tsRecon : reconQt);
+        bool tmpReconAlign = (useTSkip ? 1 : (m_rqt[qtLayer].reconQtYuv.getAddrOffset(absPartIdx, m_rqt[qtLayer].reconQtYuv.m_size) % 64 == 0));
         uint32_t tmpReconStride = (useTSkip ? MAX_TS_SIZE : reconQtStride);
 
-        primitives.cu[sizeIdx].calcresidual(fenc, pred, residual, stride);
+        primitives.cu[sizeIdx].calcresidual[stride % 64 == 0](fenc, pred, residual, stride);
 
         uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeff, log2TrSize, TEXT_LUMA, absPartIdx, useTSkip);
         if (numSig)
         {
             m_quant.invtransformNxN(cu, residual, stride, coeff, log2TrSize, TEXT_LUMA, true, useTSkip, numSig);
-            primitives.cu[sizeIdx].add_ps(tmpRecon, tmpReconStride, pred, residual, stride, stride);
+            bool residualAlign = m_rqt[cuGeom.depth].tmpResiYuv.getAddrOffset(absPartIdx, m_rqt[cuGeom.depth].tmpResiYuv.m_size) % 64 == 0;
+            bool predAlign = predYuv->getAddrOffset(absPartIdx, predYuv->m_size) % 64 == 0;
+            bool bufferAlignCheck = (stride % 64 == 0) && (tmpReconStride % 64 == 0) && tmpReconAlign && residualAlign && predAlign;
+            primitives.cu[sizeIdx].add_ps[bufferAlignCheck](tmpRecon, tmpReconStride, pred, residual, stride, stride);
         }
         else if (useTSkip)
         {
@@ -714,7 +721,7 @@
         coeff_t* coeffY       = cu.m_trCoeff[0] + coeffOffsetY;
 
         uint32_t sizeIdx   = log2TrSize - 2;
-        primitives.cu[sizeIdx].calcresidual(fenc, pred, residual, stride);
+        primitives.cu[sizeIdx].calcresidual[stride % 64 == 0](fenc, pred, residual, stride);
 
         PicYuv*  reconPic = m_frame->m_reconPic;
         pixel*   picReconY = reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + absPartIdx);
@@ -724,7 +731,11 @@
         if (numSig)
         {
             m_quant.invtransformNxN(cu, residual, stride, coeffY, log2TrSize, TEXT_LUMA, true, false, numSig);
-            primitives.cu[sizeIdx].add_ps(picReconY, picStride, pred, residual, stride, stride);
+            bool picReconYAlign = (reconPic->m_cuOffsetY[cu.m_cuAddr] + reconPic->m_buOffsetY[cuGeom.absPartIdx + absPartIdx]) % 64 == 0;
+            bool predAlign = mode.predYuv.getAddrOffset(absPartIdx, mode.predYuv.m_size) % 64 == 0;
+            bool residualAlign = m_rqt[cuGeom.depth].tmpResiYuv.getAddrOffset(absPartIdx, m_rqt[cuGeom.depth].tmpResiYuv.m_size)% 64 == 0;
+            bool bufferAlignCheck = (picStride % 64 == 0) && (stride % 64 == 0) && picReconYAlign && predAlign && residualAlign;
+            primitives.cu[sizeIdx].add_ps[bufferAlignCheck](picReconY, picStride, pred, residual, stride, stride);
             cu.setCbfSubParts(1 << tuDepth, TEXT_LUMA, absPartIdx, fullDepth);
         }
         else
@@ -893,12 +904,17 @@
             predIntraChromaAng(chromaPredMode, pred, stride, log2TrSizeC);
             cu.setTransformSkipPartRange(0, ttype, absPartIdxC, tuIterator.absPartIdxStep);
 
-            primitives.cu[sizeIdxC].calcresidual(fenc, pred, residual, stride);
+            primitives.cu[sizeIdxC].calcresidual[stride % 64 == 0](fenc, pred, residual, stride);
+
             uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeffC, log2TrSizeC, ttype, absPartIdxC, false);
             if (numSig)
             {
                 m_quant.invtransformNxN(cu, residual, stride, coeffC, log2TrSizeC, ttype, true, false, numSig);
-                primitives.cu[sizeIdxC].add_ps(reconQt, reconQtStride, pred, residual, stride, stride);
+                bool reconQtAlign = m_rqt[qtLayer].reconQtYuv.getChromaAddrOffset(absPartIdxC) % 64 == 0;
+                bool predAlign = mode.predYuv.getChromaAddrOffset(absPartIdxC) % 64 == 0;
+                bool residualAlign = resiYuv.getChromaAddrOffset(absPartIdxC) % 64 == 0;
+                bool bufferAlignCheck = reconQtAlign && predAlign && residualAlign && (reconQtStride % 64 == 0) && (stride % 64 == 0);
+                primitives.cu[sizeIdxC].add_ps[bufferAlignCheck](reconQt, reconQtStride, pred, residual, stride, stride);
                 cu.setCbfPartRange(1 << tuDepth, ttype, absPartIdxC, tuIterator.absPartIdxStep);
             }
             else
@@ -992,13 +1008,17 @@
                 pixel*   recon = (useTSkip ? m_tsRecon : reconQt);
                 uint32_t reconStride = (useTSkip ? MAX_TS_SIZE : reconQtStride);
 
-                primitives.cu[sizeIdxC].calcresidual(fenc, pred, residual, stride);
+                primitives.cu[sizeIdxC].calcresidual[stride % 64 == 0](fenc, pred, residual, stride);
 
                 uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeff, log2TrSizeC, ttype, absPartIdxC, useTSkip);
                 if (numSig)
                 {
                     m_quant.invtransformNxN(cu, residual, stride, coeff, log2TrSizeC, ttype, true, useTSkip, numSig);
-                    primitives.cu[sizeIdxC].add_ps(recon, reconStride, pred, residual, stride, stride);
+                    bool reconAlign = (useTSkip ? 1 : m_rqt[qtLayer].reconQtYuv.getChromaAddrOffset(absPartIdxC)) % 64 == 0;
+                    bool predYuvAlign = mode.predYuv.getChromaAddrOffset(absPartIdxC) % 64 == 0;
+                    bool residualAlign = m_rqt[cuGeom.depth].tmpResiYuv.getChromaAddrOffset(absPartIdxC) % 64 == 0;
+                    bool bufferAlignCheck = reconAlign && predYuvAlign && residualAlign && (reconStride % 64 == 0) && (stride % 64 == 0);
+                    primitives.cu[sizeIdxC].add_ps[bufferAlignCheck](recon, reconStride, pred, residual, stride, stride);
                     cu.setCbfPartRange(1 << tuDepth, ttype, absPartIdxC, tuIterator.absPartIdxStep);
                 }
                 else if (useTSkip)
@@ -1183,12 +1203,17 @@
 
             X265_CHECK(!cu.m_transformSkip[ttype][0], "transform skip not supported at low RD levels\n");
 
-            primitives.cu[sizeIdxC].calcresidual(fenc, pred, residual, stride);
+            primitives.cu[sizeIdxC].calcresidual[stride % 64 == 0](fenc, pred, residual, stride);
+
             uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeffC, log2TrSizeC, ttype, absPartIdxC, false);
             if (numSig)
             {
                 m_quant.invtransformNxN(cu, residual, stride, coeffC, log2TrSizeC, ttype, true, false, numSig);
-                primitives.cu[sizeIdxC].add_ps(picReconC, picStride, pred, residual, stride, stride);
+                bool picReconCAlign = (reconPic->m_cuOffsetC[cu.m_cuAddr] + reconPic->m_buOffsetC[cuGeom.absPartIdx + absPartIdxC]) % 64 == 0;
+                bool predAlign = mode.predYuv.getChromaAddrOffset(absPartIdxC) % 64 == 0;
+                bool residualAlign = resiYuv.getChromaAddrOffset(absPartIdxC)% 64 == 0;
+                bool bufferAlignCheck = picReconCAlign && predAlign && residualAlign && (picStride % 64 == 0) && (stride % 64 == 0);
+                primitives.cu[sizeIdxC].add_ps[bufferAlignCheck](picReconC, picStride, pred, residual, stride, stride);
                 cu.setCbfPartRange(1 << tuDepth, ttype, absPartIdxC, tuIterator.absPartIdxStep);
             }
             else
@@ -1304,7 +1329,7 @@
 
         pixel nScale[129];
         intraNeighbourBuf[1][0] = intraNeighbourBuf[0][0];
-        primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1);
+        primitives.scale1D_128to64[NONALIGNED](nScale + 1, intraNeighbourBuf[0] + 1);
 
         // we do not estimate filtering for downscaled samples
         memcpy(&intraNeighbourBuf[0][1], &nScale[1], 2 * 64 * sizeof(pixel));   // Top & Left pixels
@@ -2107,18 +2132,24 @@
         bestME[list].mvCost  = mvCost;
     }
 }
-
-void Search::searchMV(Mode& interMode, const PredictionUnit& pu, int list, int ref, MV& outmv)
+void Search::searchMV(Mode& interMode, const PredictionUnit& pu, int list, int ref, MV& outmv, MV mvp, int numMvc, MV* mvc)
 {
     CUData& cu = interMode.cu;
     const Slice *slice = m_slice;
-    MV mv = cu.m_mv[list][pu.puAbsPartIdx];
+    MV mv;
+    if (m_param->interRefine == 1)
+        mv = mvp;
+    else
+        mv = cu.m_mv[list][pu.puAbsPartIdx];
     cu.clipMv(mv);
     MV mvmin, mvmax;
     setSearchRange(cu, mv, m_param->searchRange, mvmin, mvmax);
-    m_me.refineMV(&slice->m_mref[list][ref], mvmin, mvmax, mv, outmv);
+    if (m_param->interRefine == 1)
+        m_me.motionEstimate(&m_slice->m_mref[list][ref], mvmin, mvmax, mv, numMvc, mvc, m_param->searchRange, outmv, m_param->maxSlices,
+        m_param->bSourceReferenceEstimation ? m_slice->m_refFrameList[list][ref]->m_fencPic->getLumaAddr(0) : 0);
+    else
+        m_me.refineMV(&slice->m_mref[list][ref], mvmin, mvmax, mv, outmv);
 }
-
 /* find the best inter prediction for each PU of specified mode */
 void Search::predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChromaMC, uint32_t refMasks[2])
 {
@@ -2138,20 +2169,29 @@
     int      totalmebits = 0;
     MV       mvzero(0, 0);
     Yuv&     tmpPredYuv = m_rqt[cuGeom.depth].tmpPredYuv;
-
     MergeData merge;
     memset(&merge, 0, sizeof(merge));
-
+    bool useAsMVP = false;
     for (int puIdx = 0; puIdx < numPart; puIdx++)
     {
         MotionData* bestME = interMode.bestME[puIdx];
         PredictionUnit pu(cu, cuGeom, puIdx);
-
         m_me.setSourcePU(*interMode.fencYuv, pu.ctuAddr, pu.cuAbsPartIdx, pu.puAbsPartIdx, pu.width, pu.height, m_param->searchMethod, m_param->subpelRefine, bChromaMC);
-
+        useAsMVP = false;
+        x265_analysis_inter_data* interDataCTU = NULL;
+        int cuIdx;
+        cuIdx = (interMode.cu.m_cuAddr * m_param->num4x4Partitions) + cuGeom.absPartIdx;
+        if (m_param->analysisReuseLevel == 10 && m_param->interRefine > 1)
+        {
+            interDataCTU = m_frame->m_analysisData.interData;
+            if ((cu.m_predMode[pu.puAbsPartIdx] == interDataCTU->modes[cuIdx + pu.puAbsPartIdx])
+                && (cu.m_partSize[pu.puAbsPartIdx] == interDataCTU->partSize[cuIdx + pu.puAbsPartIdx])
+                && !(interDataCTU->mergeFlag[cuIdx + puIdx])
+                && (cu.m_cuDepth[0] == interDataCTU->depth[cuIdx]))
​

x265_2.7.tar.gz/source/encoder/search.h -> x265_2.9.tar.gz/source/encoder/search.h Changed

 
@@ -310,8 +310,7 @@
 
     // estimation inter prediction (non-skip)
     void     predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChromaMC, uint32_t masks[2]);
-
-    void     searchMV(Mode& interMode, const PredictionUnit& pu, int list, int ref, MV& outmv);
+    void     searchMV(Mode& interMode, const PredictionUnit& pu, int list, int ref, MV& outmv, MV mvp, int numMvc, MV* mvc);
     // encode residual and compute rd-cost for inter mode
     void     encodeResAndCalcRdInterCU(Mode& interMode, const CUGeom& cuGeom);
     void     encodeResAndCalcRdSkipCU(Mode& interMode);
​

x265_2.7.tar.gz/source/encoder/sei.cpp -> x265_2.9.tar.gz/source/encoder/sei.cpp Changed

@@ -35,45 +35,40 @@
 };
 
 /* marshal a single SEI message sei, storing the marshalled representation
- * in bitstream bs */
-void SEI::write(Bitstream& bs, const SPS& sps)
+* in bitstream bs */
+void SEI::writeSEImessages(Bitstream& bs, const SPS& sps, NalUnitType nalUnitType, NALList& list, int isNested)
 {
-    uint32_t type = m_payloadType;
+    if (!isNested)
+        bs.resetBits();
+
+    BitCounter counter;
+    m_bitIf = &counter;
+    writeSEI(sps);
+    /* count the size of the payload and return the size in bits */
+    X265_CHECK(0 == (counter.getNumberOfWrittenBits() & 7), "payload unaligned\n");
+    uint32_t payloadData = counter.getNumberOfWrittenBits() >> 3;
+
+    // set bitstream
     m_bitIf = &bs;
-    BitCounter count;
-    bool hrdTypes = (m_payloadType == ACTIVE_PARAMETER_SETS || m_payloadType == PICTURE_TIMING || m_payloadType == BUFFERING_PERIOD);
-    if (hrdTypes)
-    {
-        m_bitIf = &count;
-        /* virtual writeSEI method, write to bit counter to determine size */
-        writeSEI(sps);
-        m_bitIf = &bs;
-        uint32_t payloadType = m_payloadType;
-        for (; payloadType >= 0xff; payloadType -= 0xff)
-            WRITE_CODE(0xff, 8, "payload_type");
-    }
-    WRITE_CODE(type, 8, "payload_type");
-    uint32_t payloadSize;
-    if (hrdTypes || m_payloadType == USER_DATA_UNREGISTERED || m_payloadType == USER_DATA_REGISTERED_ITU_T_T35)
+
+    uint32_t payloadType = m_payloadType;
+    for (; payloadType >= 0xff; payloadType -= 0xff)
+        WRITE_CODE(0xff, 8, "payload_type");
+    WRITE_CODE(payloadType, 8, "payload_type");
+
+    uint32_t payloadSize = payloadData;
+    for (; payloadSize >= 0xff; payloadSize -= 0xff)
+        WRITE_CODE(0xff, 8, "payload_size");
+    WRITE_CODE(payloadSize, 8, "payload_size");
+
+    // virtual writeSEI method, write to bs 
+    writeSEI(sps);
+
+    if (!isNested)
     {
-        if (hrdTypes)
-        {
-            X265_CHECK(0 == (count.getNumberOfWrittenBits() & 7), "payload unaligned\n");
-            payloadSize = count.getNumberOfWrittenBits() >> 3;
-        }
-        else if (m_payloadType == USER_DATA_UNREGISTERED)
-            payloadSize = m_payloadSize + 16;
-        else
-            payloadSize = m_payloadSize;
-
-        for (; payloadSize >= 0xff; payloadSize -= 0xff)
-            WRITE_CODE(0xff, 8, "payload_size");
-        WRITE_CODE(payloadSize, 8, "payload_size");
+        bs.writeByteAlignment();
+        list.serialize(nalUnitType, bs);
     }
-    else
-        WRITE_CODE(m_payloadSize, 8, "payload_size");
-    /* virtual writeSEI method, write to bs */
-    writeSEI(sps);
 }
 
 void SEI::writeByteAlign()
@@ -93,3 +88,63 @@
 {
     m_payloadSize = size;
 }
+
+/* charSet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/" */
+
+char* SEI::base64Decode(char encodedString[], int base64EncodeLength)
+{
+    char* decodedString;
+    decodedString = (char*)malloc(sizeof(char) * ((base64EncodeLength / 4) * 3));
+    int i, j, k = 0;
+    // stores the bitstream
+    int bitstream = 0;
+    // countBits stores current number of bits in bitstream
+    int countBits = 0;
+    // selects 4 characters from encodedString at a time. Find the position of each encoded character in charSet and stores in bitstream
+    for (i = 0; i < base64EncodeLength; i += 4)
+    {
+        bitstream = 0, countBits = 0;
+        for (j = 0; j < 4; j++)
+        {
+            // make space for 6 bits
+            if (encodedString[i + j] != '=')
+            {
+                bitstream = bitstream << 6;
+                countBits += 6;
+            }
+            // Finding the position of each encoded character in charSet and storing in bitstream, use OR '|' operator to store bits
+
+            if (encodedString[i + j] >= 'A' && encodedString[i + j] <= 'Z')
+                bitstream = bitstream | (encodedString[i + j] - 'A');
+
+            else if (encodedString[i + j] >= 'a' && encodedString[i + j] <= 'z')
+                bitstream = bitstream | (encodedString[i + j] - 'a' + 26);
+            
+            else if (encodedString[i + j] >= '0' && encodedString[i + j] <= '9')
+                bitstream = bitstream | (encodedString[i + j] - '0' + 52);
+            
+            // '+' occurs in 62nd position in charSet
+            else if (encodedString[i + j] == '+')
+                bitstream = bitstream | 62;
+            
+            // '/' occurs in 63rd position in charSet
+            else if (encodedString[i + j] == '/')
+                bitstream = bitstream | 63;
+            
+            // to delete appended bits during encoding
+            else
+            {
+                bitstream = bitstream >> 2;
+                countBits -= 2;
+            }
+        }
+    
+        while (countBits != 0)
+        {
+            countBits -= 8;
+            decodedString[k++] = (bitstream >> countBits) & 255;
+        }
+    }
+    return decodedString;
+}
+

 
@@ -35,45 +35,40 @@
 };
 
 /* marshal a single SEI message sei, storing the marshalled representation
- * in bitstream bs */
-void SEI::write(Bitstream& bs, const SPS& sps)
+* in bitstream bs */
+void SEI::writeSEImessages(Bitstream& bs, const SPS& sps, NalUnitType nalUnitType, NALList& list, int isNested)
 {
-    uint32_t type = m_payloadType;
+    if (!isNested)
+        bs.resetBits();
+
+    BitCounter counter;
+    m_bitIf = &counter;
+    writeSEI(sps);
+    /* count the size of the payload and return the size in bits */
+    X265_CHECK(0 == (counter.getNumberOfWrittenBits() & 7), "payload unaligned\n");
+    uint32_t payloadData = counter.getNumberOfWrittenBits() >> 3;
+
+    // set bitstream
     m_bitIf = &bs;
-    BitCounter count;
-    bool hrdTypes = (m_payloadType == ACTIVE_PARAMETER_SETS || m_payloadType == PICTURE_TIMING || m_payloadType == BUFFERING_PERIOD);
-    if (hrdTypes)
-    {
-        m_bitIf = &count;
-        /* virtual writeSEI method, write to bit counter to determine size */
-        writeSEI(sps);
-        m_bitIf = &bs;
-        uint32_t payloadType = m_payloadType;
-        for (; payloadType >= 0xff; payloadType -= 0xff)
-            WRITE_CODE(0xff, 8, "payload_type");
-    }
-    WRITE_CODE(type, 8, "payload_type");
-    uint32_t payloadSize;
-    if (hrdTypes || m_payloadType == USER_DATA_UNREGISTERED || m_payloadType == USER_DATA_REGISTERED_ITU_T_T35)
+
+    uint32_t payloadType = m_payloadType;
+    for (; payloadType >= 0xff; payloadType -= 0xff)
+        WRITE_CODE(0xff, 8, "payload_type");
+    WRITE_CODE(payloadType, 8, "payload_type");
+
+    uint32_t payloadSize = payloadData;
+    for (; payloadSize >= 0xff; payloadSize -= 0xff)
+        WRITE_CODE(0xff, 8, "payload_size");
+    WRITE_CODE(payloadSize, 8, "payload_size");
+
+    // virtual writeSEI method, write to bs 
+    writeSEI(sps);
+
+    if (!isNested)
     {
-        if (hrdTypes)
-        {
-            X265_CHECK(0 == (count.getNumberOfWrittenBits() & 7), "payload unaligned\n");
-            payloadSize = count.getNumberOfWrittenBits() >> 3;
-        }
-        else if (m_payloadType == USER_DATA_UNREGISTERED)
-            payloadSize = m_payloadSize + 16;
-        else
-            payloadSize = m_payloadSize;
-
-        for (; payloadSize >= 0xff; payloadSize -= 0xff)
-            WRITE_CODE(0xff, 8, "payload_size");
-        WRITE_CODE(payloadSize, 8, "payload_size");
+        bs.writeByteAlignment();
+        list.serialize(nalUnitType, bs);
     }
-    else
-        WRITE_CODE(m_payloadSize, 8, "payload_size");
-    /* virtual writeSEI method, write to bs */
-    writeSEI(sps);
 }
 
 void SEI::writeByteAlign()
@@ -93,3 +88,63 @@
 {
     m_payloadSize = size;
 }
+
+/* charSet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/" */
+
+char* SEI::base64Decode(char encodedString[], int base64EncodeLength)
+{
+    char* decodedString;
+    decodedString = (char*)malloc(sizeof(char) * ((base64EncodeLength / 4) * 3));
+    int i, j, k = 0;
+    // stores the bitstream
+    int bitstream = 0;
+    // countBits stores current number of bits in bitstream
+    int countBits = 0;
+    // selects 4 characters from encodedString at a time. Find the position of each encoded character in charSet and stores in bitstream
+    for (i = 0; i < base64EncodeLength; i += 4)
+    {
+        bitstream = 0, countBits = 0;
+        for (j = 0; j < 4; j++)
+        {
+            // make space for 6 bits
+            if (encodedString[i + j] != '=')
+            {
+                bitstream = bitstream << 6;
+                countBits += 6;
+            }
+            // Finding the position of each encoded character in charSet and storing in bitstream, use OR '|' operator to store bits
+
+            if (encodedString[i + j] >= 'A' && encodedString[i + j] <= 'Z')
+                bitstream = bitstream | (encodedString[i + j] - 'A');
+
+            else if (encodedString[i + j] >= 'a' && encodedString[i + j] <= 'z')
+                bitstream = bitstream | (encodedString[i + j] - 'a' + 26);
+            
+            else if (encodedString[i + j] >= '0' && encodedString[i + j] <= '9')
+                bitstream = bitstream | (encodedString[i + j] - '0' + 52);
+            
+            // '+' occurs in 62nd position in charSet
+            else if (encodedString[i + j] == '+')
+                bitstream = bitstream | 62;
+            
+            // '/' occurs in 63rd position in charSet
+            else if (encodedString[i + j] == '/')
+                bitstream = bitstream | 63;
+            
+            // to delete appended bits during encoding
+            else
+            {
+                bitstream = bitstream >> 2;
+                countBits -= 2;
+            }
+        }
+    
+        while (countBits != 0)
+        {
+            countBits -= 8;
+            decodedString[k++] = (bitstream >> countBits) & 255;
+        }
+    }
+    return decodedString;
+}
+
​

x265_2.7.tar.gz/source/encoder/sei.h -> x265_2.9.tar.gz/source/encoder/sei.h Changed

@@ -27,6 +27,8 @@
 #include "common.h"
 #include "bitstream.h"
 #include "slice.h"
+#include "nal.h"
+#include "md5.h"
 
 namespace X265_NS {
 // private namespace
@@ -34,11 +36,11 @@
 class SEI : public SyntaxElementWriter
 {
 public:
-    /* SEI users call write() to marshal an SEI to a bitstream.
-     * The write() method calls writeSEI() which encodes the header */
-    void write(Bitstream& bs, const SPS& sps);
-
+    /* SEI users call writeSEImessages() to marshal an SEI to a bitstream.
+    * The writeSEImessages() method calls writeSEI() which encodes the header */
+    void writeSEImessages(Bitstream& bs, const SPS& sps, NalUnitType nalUnitType, NALList& list, int isNested);
     void setSize(uint32_t size);
+    static char* base64Decode(char encodedString[], int base64EncodeLength);
     virtual ~SEI() {}
 protected:
     SEIPayloadType  m_payloadType;
@@ -47,6 +49,32 @@
     void writeByteAlign();
 };
 
+//seongnam.oh@samsung.com :: for the Creative Intent Meta Data Encoding
+class SEIuserDataRegistered : public SEI
+{
+public:
+    SEIuserDataRegistered()
+    {
+        m_payloadType = USER_DATA_REGISTERED_ITU_T_T35;
+        m_payloadSize = 0;
+    }
+
+    uint8_t *m_userData;
+
+    // daniel.vt@samsung.com :: for the Creative Intent Meta Data Encoding ( seongnam.oh@samsung.com )
+    void writeSEI(const SPS&)
+    {
+        if (!m_userData)
+            return;
+
+        uint32_t i = 0;
+        for (; i < m_payloadSize; ++i)
+            WRITE_CODE(m_userData[i], 8, "creative_intent_metadata");
+    }
+};
+
+static const uint32_t ISO_IEC_11578_LEN = 16;
+
 class SEIuserDataUnregistered : public SEI
 {
 public:
@@ -55,11 +83,11 @@
         m_payloadType = USER_DATA_UNREGISTERED;
         m_payloadSize = 0;
     }
-    static const uint8_t m_uuid_iso_iec_11578[16];
+    static const uint8_t m_uuid_iso_iec_11578[ISO_IEC_11578_LEN];
     uint8_t *m_userData;
     void writeSEI(const SPS&)
     {
-        for (uint32_t i = 0; i < 16; i++)
+        for (uint32_t i = 0; i < ISO_IEC_11578_LEN; i++)
             WRITE_CODE(m_uuid_iso_iec_11578[i], 8, "sei.uuid_iso_iec_11578[i]");
         for (uint32_t i = 0; i < m_payloadSize; i++)
             WRITE_CODE(m_userData[i], 8, "user_data");
@@ -133,7 +161,12 @@
         CRC,
         CHECKSUM,
     } m_method;
-    uint8_t m_digest[3][16];
+
+    MD5Context m_state[3];
+    uint32_t   m_crc[3];
+    uint32_t   m_checksum[3];
+    uint8_t    m_digest[3][16];
+
     void writeSEI(const SPS& sps)
     {
         int planes = (sps.chromaFormatIdc != X265_CSP_I400) ? 3 : 1;
@@ -253,6 +286,11 @@
 class SEIRecoveryPoint : public SEI
 {
 public:
+    SEIRecoveryPoint()
+    {
+        m_payloadType = RECOVERY_POINT;
+        m_payloadSize = 0;
+    }
     int  m_recoveryPocCnt;
     bool m_exactMatchingFlag;
     bool m_brokenLinkFlag;
@@ -266,28 +304,22 @@
     }
 };
 
-//seongnam.oh@samsung.com :: for the Creative Intent Meta Data Encoding
-class SEICreativeIntentMeta : public SEI
+class SEIAlternativeTC : public SEI
 {
 public:
-    SEICreativeIntentMeta()
+    int m_preferredTransferCharacteristics;
+    SEIAlternativeTC()
     {
-        m_payloadType = USER_DATA_REGISTERED_ITU_T_T35;
+        m_payloadType = ALTERNATIVE_TRANSFER_CHARACTERISTICS;
         m_payloadSize = 0;
+        m_preferredTransferCharacteristics = -1;
     }
 
-    uint8_t *m_payload;
-
-    // daniel.vt@samsung.com :: for the Creative Intent Meta Data Encoding ( seongnam.oh@samsung.com )
     void writeSEI(const SPS&)
     {
-        if (!m_payload)
-            return;
-
-        uint32_t i = 0;
-        for (; i < m_payloadSize; ++i)
-            WRITE_CODE(m_payload[i], 8, "creative_intent_metadata");
+        WRITE_CODE(m_preferredTransferCharacteristics, 8, "Preferred transfer characteristics");
     }
 };
+
 }
 #endif // ifndef X265_SEI_H

 
@@ -27,6 +27,8 @@
 #include "common.h"
 #include "bitstream.h"
 #include "slice.h"
+#include "nal.h"
+#include "md5.h"
 
 namespace X265_NS {
 // private namespace
@@ -34,11 +36,11 @@
 class SEI : public SyntaxElementWriter
 {
 public:
-    /* SEI users call write() to marshal an SEI to a bitstream.
-     * The write() method calls writeSEI() which encodes the header */
-    void write(Bitstream& bs, const SPS& sps);
-
+    /* SEI users call writeSEImessages() to marshal an SEI to a bitstream.
+    * The writeSEImessages() method calls writeSEI() which encodes the header */
+    void writeSEImessages(Bitstream& bs, const SPS& sps, NalUnitType nalUnitType, NALList& list, int isNested);
     void setSize(uint32_t size);
+    static char* base64Decode(char encodedString[], int base64EncodeLength);
     virtual ~SEI() {}
 protected:
     SEIPayloadType  m_payloadType;
@@ -47,6 +49,32 @@
     void writeByteAlign();
 };
 
+//seongnam.oh@samsung.com :: for the Creative Intent Meta Data Encoding
+class SEIuserDataRegistered : public SEI
+{
+public:
+    SEIuserDataRegistered()
+    {
+        m_payloadType = USER_DATA_REGISTERED_ITU_T_T35;
+        m_payloadSize = 0;
+    }
+
+    uint8_t *m_userData;
+
+    // daniel.vt@samsung.com :: for the Creative Intent Meta Data Encoding ( seongnam.oh@samsung.com )
+    void writeSEI(const SPS&)
+    {
+        if (!m_userData)
+            return;
+
+        uint32_t i = 0;
+        for (; i < m_payloadSize; ++i)
+            WRITE_CODE(m_userData[i], 8, "creative_intent_metadata");
+    }
+};
+
+static const uint32_t ISO_IEC_11578_LEN = 16;
+
 class SEIuserDataUnregistered : public SEI
 {
 public:
@@ -55,11 +83,11 @@
         m_payloadType = USER_DATA_UNREGISTERED;
         m_payloadSize = 0;
     }
-    static const uint8_t m_uuid_iso_iec_11578[16];
+    static const uint8_t m_uuid_iso_iec_11578[ISO_IEC_11578_LEN];
     uint8_t *m_userData;
     void writeSEI(const SPS&)
     {
-        for (uint32_t i = 0; i < 16; i++)
+        for (uint32_t i = 0; i < ISO_IEC_11578_LEN; i++)
             WRITE_CODE(m_uuid_iso_iec_11578[i], 8, "sei.uuid_iso_iec_11578[i]");
         for (uint32_t i = 0; i < m_payloadSize; i++)
             WRITE_CODE(m_userData[i], 8, "user_data");
@@ -133,7 +161,12 @@
         CRC,
         CHECKSUM,
     } m_method;
-    uint8_t m_digest[3][16];
+
+    MD5Context m_state[3];
+    uint32_t   m_crc[3];
+    uint32_t   m_checksum[3];
+    uint8_t    m_digest[3][16];
+
     void writeSEI(const SPS& sps)
     {
         int planes = (sps.chromaFormatIdc != X265_CSP_I400) ? 3 : 1;
@@ -253,6 +286,11 @@
 class SEIRecoveryPoint : public SEI
 {
 public:
+    SEIRecoveryPoint()
+    {
+        m_payloadType = RECOVERY_POINT;
+        m_payloadSize = 0;
+    }
     int  m_recoveryPocCnt;
     bool m_exactMatchingFlag;
     bool m_brokenLinkFlag;
@@ -266,28 +304,22 @@
     }
 };
 
-//seongnam.oh@samsung.com :: for the Creative Intent Meta Data Encoding
-class SEICreativeIntentMeta : public SEI
+class SEIAlternativeTC : public SEI
 {
 public:
-    SEICreativeIntentMeta()
+    int m_preferredTransferCharacteristics;
+    SEIAlternativeTC()
     {
-        m_payloadType = USER_DATA_REGISTERED_ITU_T_T35;
+        m_payloadType = ALTERNATIVE_TRANSFER_CHARACTERISTICS;
         m_payloadSize = 0;
+        m_preferredTransferCharacteristics = -1;
     }
 
-    uint8_t *m_payload;
-
-    // daniel.vt@samsung.com :: for the Creative Intent Meta Data Encoding ( seongnam.oh@samsung.com )
     void writeSEI(const SPS&)
     {
-        if (!m_payload)
-            return;
-
-        uint32_t i = 0;
-        for (; i < m_payloadSize; ++i)
-            WRITE_CODE(m_payload[i], 8, "creative_intent_metadata");
+        WRITE_CODE(m_preferredTransferCharacteristics, 8, "Preferred transfer characteristics");
     }
 };
+
 }
 #endif // ifndef X265_SEI_H
​

x265_2.7.tar.gz/source/encoder/slicetype.cpp -> x265_2.9.tar.gz/source/encoder/slicetype.cpp Changed

@@ -150,20 +150,14 @@
         curFrame->m_lowres.wp_sum[y] = 0;
     }
 
-    /* Calculate Qp offset for each 16x16 or 8x8 block in the frame */
-    int blockXY = 0;
-    int blockX = 0, blockY = 0;
-    double strength = 0.f;
+    /* Calculate Qp offset for each 16x16 or 8x8 block in the frame */    
     if ((param->rc.aqMode == X265_AQ_NONE || param->rc.aqStrength == 0) || (param->rc.bStatRead && param->rc.cuTree && IS_REFERENCED(curFrame)))
     {
-        /* Need to init it anyways for CU tree */
-        int cuCount = blockCount;
-
         if (param->rc.aqMode && param->rc.aqStrength == 0)
         {
             if (quantOffsets)
             {
-                for (int cuxy = 0; cuxy < cuCount; cuxy++)
+                for (int cuxy = 0; cuxy < blockCount; cuxy++)
                 {
                     curFrame->m_lowres.qpCuTreeOffset[cuxy] = curFrame->m_lowres.qpAqOffset[cuxy] = quantOffsets[cuxy];
                     curFrame->m_lowres.invQscaleFactor[cuxy] = x265_exp2fix8(curFrame->m_lowres.qpCuTreeOffset[cuxy]);
@@ -171,61 +165,55 @@
             }
             else
             {
-                memset(curFrame->m_lowres.qpCuTreeOffset, 0, cuCount * sizeof(double));
-                memset(curFrame->m_lowres.qpAqOffset, 0, cuCount * sizeof(double));
-                for (int cuxy = 0; cuxy < cuCount; cuxy++)
-                    curFrame->m_lowres.invQscaleFactor[cuxy] = 256;
+               memset(curFrame->m_lowres.qpCuTreeOffset, 0, blockCount * sizeof(double));
+               memset(curFrame->m_lowres.qpAqOffset, 0, blockCount * sizeof(double));
+               for (int cuxy = 0; cuxy < blockCount; cuxy++)
+                   curFrame->m_lowres.invQscaleFactor[cuxy] = 256;
             }
         }
 
-        /* Need variance data for weighted prediction */
+        /* Need variance data for weighted prediction and dynamic refinement*/
         if (param->bEnableWeightedPred || param->bEnableWeightedBiPred)
         {
-            for (blockY = 0; blockY < maxRow; blockY += loopIncr)
-                for (blockX = 0; blockX < maxCol; blockX += loopIncr)
-                    acEnergyCu(curFrame, blockX, blockY, param->internalCsp, param->rc.qgSize);
+            for (int blockY = 0; blockY < maxRow; blockY += loopIncr)
+                for (int blockX = 0; blockX < maxCol; blockX += loopIncr)                
+                    acEnergyCu(curFrame, blockX, blockY, param->internalCsp, param->rc.qgSize);                
         }
     }
     else
     {
-        blockXY = 0;
-        double avg_adj_pow2 = 0, avg_adj = 0, qp_adj = 0;
-        double bias_strength = 0.f;
+        int blockXY = 0;
+        double avg_adj_pow2 = 0.f, avg_adj = 0.f, qp_adj = 0.f;
+        double bias_strength = 0.f, strength = 0.f;
         if (param->rc.aqMode == X265_AQ_AUTO_VARIANCE || param->rc.aqMode == X265_AQ_AUTO_VARIANCE_BIASED)
         {
-            double bit_depth_correction = 1.f / (1 << (2*(X265_DEPTH-8)));
-            curFrame->m_lowres.frameVariance = 0;
-            uint64_t rowVariance = 0;
-            for (blockY = 0; blockY < maxRow; blockY += loopIncr)
-            {
-                rowVariance = 0;
-                for (blockX = 0; blockX < maxCol; blockX += loopIncr)
-                {
-                    uint32_t energy = acEnergyCu(curFrame, blockX, blockY, param->internalCsp, param->rc.qgSize);
-                    curFrame->m_lowres.blockVariance[blockXY] = energy;
-                    rowVariance += energy;
+            double bit_depth_correction = 1.f / (1 << (2*(X265_DEPTH-8)));            
+            
+            for (int blockY = 0; blockY < maxRow; blockY += loopIncr)
+            {                
+                for (int blockX = 0; blockX < maxCol; blockX += loopIncr)
+                {
+                    uint32_t energy = acEnergyCu(curFrame, blockX, blockY, param->internalCsp, param->rc.qgSize);                    
                     qp_adj = pow(energy * bit_depth_correction + 1, 0.1);
                     curFrame->m_lowres.qpCuTreeOffset[blockXY] = qp_adj;
                     avg_adj += qp_adj;
                     avg_adj_pow2 += qp_adj * qp_adj;
                     blockXY++;
                 }
-                curFrame->m_lowres.frameVariance += (rowVariance / maxCol);
             }
-            curFrame->m_lowres.frameVariance /= maxRow;
             avg_adj /= blockCount;
             avg_adj_pow2 /= blockCount;
             strength = param->rc.aqStrength * avg_adj;
-            avg_adj = avg_adj - 0.5f * (avg_adj_pow2 - (modeTwoConst)) / avg_adj;
+            avg_adj = avg_adj - 0.5f * (avg_adj_pow2 - modeTwoConst) / avg_adj;
             bias_strength = param->rc.aqStrength;
         }
         else
             strength = param->rc.aqStrength * 1.0397f;
 
         blockXY = 0;
-        for (blockY = 0; blockY < maxRow; blockY += loopIncr)
+        for (int blockY = 0; blockY < maxRow; blockY += loopIncr)
         {
-            for (blockX = 0; blockX < maxCol; blockX += loopIncr)
+            for (int blockX = 0; blockX < maxCol; blockX += loopIncr)
             {
                 if (param->rc.aqMode == X265_AQ_AUTO_VARIANCE_BIASED)
                 {
@@ -240,7 +228,7 @@
                 else
                 {
                     uint32_t energy = acEnergyCu(curFrame, blockX, blockY, param->internalCsp,param->rc.qgSize);
-                    qp_adj = strength * (X265_LOG2(X265_MAX(energy, 1)) - (modeOneConst + 2 * (X265_DEPTH - 8)));
+                    qp_adj = strength * (X265_LOG2(X265_MAX(energy, 1)) - (modeOneConst + 2 * (X265_DEPTH - 8)));                    
                 }
 
                 if (param->bHDROpt)
@@ -308,6 +296,17 @@
             curFrame->m_lowres.wp_ssd[i] = ssd - (sum * sum + (width[i] * height[i]) / 2) / (width[i] * height[i]);
         }
     }
+
+    if (param->bDynamicRefine)
+    {
+        int blockXY = 0;
+        for (int blockY = 0; blockY < maxRow; blockY += loopIncr)
+            for (int blockX = 0; blockX < maxCol; blockX += loopIncr)
+            {
+                curFrame->m_lowres.blockVariance[blockXY] = acEnergyCu(curFrame, blockX, blockY, param->internalCsp, param->rc.qgSize);
+                blockXY++;
+            }
+    }
 }
 
 void LookaheadTLD::lowresIntraEstimate(Lowres& fenc, uint32_t qgSize)
@@ -426,7 +425,7 @@
     pixel *src = ref.fpelPlane[0];
     intptr_t stride = fenc.lumaStride;
 
-    if (wp.bPresentFlag)
+    if (wp.wtPresent)
     {
         int offset = wp.inputOffset << (X265_DEPTH - 8);
         int scale = wp.inputWeight;
@@ -480,7 +479,7 @@
     int deltaIndex = fenc.frameNum - ref.frameNum;
 
     WeightParam wp;
-    wp.bPresentFlag = false;
+    wp.wtPresent = 0;
 
     if (!wbuffer[0])
     {
@@ -1078,85 +1077,97 @@
     }
 
     int bframes, brefs;
-    for (bframes = 0, brefs = 0;; bframes++)
+    if (!m_param->analysisLoad)
     {
-        Lowres& frm = list[bframes]->m_lowres;
-
-        if (frm.sliceType == X265_TYPE_BREF && !m_param->bBPyramid && brefs == m_param->bBPyramid)
+        for (bframes = 0, brefs = 0;; bframes++)
         {
-            frm.sliceType = X265_TYPE_B;
-            x265_log(m_param, X265_LOG_WARNING, "B-ref at frame %d incompatible with B-pyramid\n",
-                     frm.frameNum);
-        }
+            Lowres& frm = list[bframes]->m_lowres;
 
-        /* pyramid with multiple B-refs needs a big enough dpb that the preceding P-frame stays available.
-         * smaller dpb could be supported by smart enough use of mmco, but it's easier just to forbid it. */
-        else if (frm.sliceType == X265_TYPE_BREF && m_param->bBPyramid && brefs &&
-                 m_param->maxNumReferences <= (brefs + 3))
-        {
-            frm.sliceType = X265_TYPE_B;
-            x265_log(m_param, X265_LOG_WARNING, "B-ref at frame %d incompatible with B-pyramid and %d reference frames\n",
-                     frm.sliceType, m_param->maxNumReferences);
-        }
-        if ((!m_param->bIntraRefresh || frm.frameNum == 0) && frm.frameNum - m_lastKeyframe >= m_param->keyframeMax &&
-            (!m_extendGopBoundary || frm.frameNum - m_lastKeyframe >= m_param->keyframeMax + m_param->gopLookahead))
-        {
-            if (frm.sliceType == X265_TYPE_AUTO || frm.sliceType == X265_TYPE_I)
-                frm.sliceType = m_param->bOpenGOP && m_lastKeyframe >= 0 ? X265_TYPE_I : X265_TYPE_IDR;
-            bool warn = frm.sliceType != X265_TYPE_IDR;
-            if (warn && m_param->bOpenGOP)
-                warn &= frm.sliceType != X265_TYPE_I;
-            if (warn)
+            if (frm.sliceType == X265_TYPE_BREF && !m_param->bBPyramid && brefs == m_param->bBPyramid)
             {
-                x265_log(m_param, X265_LOG_WARNING, "specified frame type (%d) at %d is not compatible with keyframe interval\n",
-                         frm.sliceType, frm.frameNum);
-                frm.sliceType = m_param->bOpenGOP && m_lastKeyframe >= 0 ? X265_TYPE_I : X265_TYPE_IDR;
+                frm.sliceType = X265_TYPE_B;
+                x265_log(m_param, X265_LOG_WARNING, "B-ref at frame %d incompatible with B-pyramid\n",
+                    frm.frameNum);
             }
-        }
-        if (frm.sliceType == X265_TYPE_I && frm.frameNum - m_lastKeyframe >= m_param->keyframeMin)
-        {
-            if (m_param->bOpenGOP)

 
@@ -150,20 +150,14 @@
         curFrame->m_lowres.wp_sum[y] = 0;
     }
 
-    /* Calculate Qp offset for each 16x16 or 8x8 block in the frame */
-    int blockXY = 0;
-    int blockX = 0, blockY = 0;
-    double strength = 0.f;
+    /* Calculate Qp offset for each 16x16 or 8x8 block in the frame */    
     if ((param->rc.aqMode == X265_AQ_NONE || param->rc.aqStrength == 0) || (param->rc.bStatRead && param->rc.cuTree && IS_REFERENCED(curFrame)))
     {
-        /* Need to init it anyways for CU tree */
-        int cuCount = blockCount;
-
         if (param->rc.aqMode && param->rc.aqStrength == 0)
         {
             if (quantOffsets)
             {
-                for (int cuxy = 0; cuxy < cuCount; cuxy++)
+                for (int cuxy = 0; cuxy < blockCount; cuxy++)
                 {
                     curFrame->m_lowres.qpCuTreeOffset[cuxy] = curFrame->m_lowres.qpAqOffset[cuxy] = quantOffsets[cuxy];
                     curFrame->m_lowres.invQscaleFactor[cuxy] = x265_exp2fix8(curFrame->m_lowres.qpCuTreeOffset[cuxy]);
@@ -171,61 +165,55 @@
             }
             else
             {
-                memset(curFrame->m_lowres.qpCuTreeOffset, 0, cuCount * sizeof(double));
-                memset(curFrame->m_lowres.qpAqOffset, 0, cuCount * sizeof(double));
-                for (int cuxy = 0; cuxy < cuCount; cuxy++)
-                    curFrame->m_lowres.invQscaleFactor[cuxy] = 256;
+               memset(curFrame->m_lowres.qpCuTreeOffset, 0, blockCount * sizeof(double));
+               memset(curFrame->m_lowres.qpAqOffset, 0, blockCount * sizeof(double));
+               for (int cuxy = 0; cuxy < blockCount; cuxy++)
+                   curFrame->m_lowres.invQscaleFactor[cuxy] = 256;
             }
         }
 
-        /* Need variance data for weighted prediction */
+        /* Need variance data for weighted prediction and dynamic refinement*/
         if (param->bEnableWeightedPred || param->bEnableWeightedBiPred)
         {
-            for (blockY = 0; blockY < maxRow; blockY += loopIncr)
-                for (blockX = 0; blockX < maxCol; blockX += loopIncr)
-                    acEnergyCu(curFrame, blockX, blockY, param->internalCsp, param->rc.qgSize);
+            for (int blockY = 0; blockY < maxRow; blockY += loopIncr)
+                for (int blockX = 0; blockX < maxCol; blockX += loopIncr)                
+                    acEnergyCu(curFrame, blockX, blockY, param->internalCsp, param->rc.qgSize);                
         }
     }
     else
     {
-        blockXY = 0;
-        double avg_adj_pow2 = 0, avg_adj = 0, qp_adj = 0;
-        double bias_strength = 0.f;
+        int blockXY = 0;
+        double avg_adj_pow2 = 0.f, avg_adj = 0.f, qp_adj = 0.f;
+        double bias_strength = 0.f, strength = 0.f;
         if (param->rc.aqMode == X265_AQ_AUTO_VARIANCE || param->rc.aqMode == X265_AQ_AUTO_VARIANCE_BIASED)
         {
-            double bit_depth_correction = 1.f / (1 << (2*(X265_DEPTH-8)));
-            curFrame->m_lowres.frameVariance = 0;
-            uint64_t rowVariance = 0;
-            for (blockY = 0; blockY < maxRow; blockY += loopIncr)
-            {
-                rowVariance = 0;
-                for (blockX = 0; blockX < maxCol; blockX += loopIncr)
-                {
-                    uint32_t energy = acEnergyCu(curFrame, blockX, blockY, param->internalCsp, param->rc.qgSize);
-                    curFrame->m_lowres.blockVariance[blockXY] = energy;
-                    rowVariance += energy;
+            double bit_depth_correction = 1.f / (1 << (2*(X265_DEPTH-8)));            
+            
+            for (int blockY = 0; blockY < maxRow; blockY += loopIncr)
+            {                
+                for (int blockX = 0; blockX < maxCol; blockX += loopIncr)
+                {
+                    uint32_t energy = acEnergyCu(curFrame, blockX, blockY, param->internalCsp, param->rc.qgSize);                    
                     qp_adj = pow(energy * bit_depth_correction + 1, 0.1);
                     curFrame->m_lowres.qpCuTreeOffset[blockXY] = qp_adj;
                     avg_adj += qp_adj;
                     avg_adj_pow2 += qp_adj * qp_adj;
                     blockXY++;
                 }
-                curFrame->m_lowres.frameVariance += (rowVariance / maxCol);
             }
-            curFrame->m_lowres.frameVariance /= maxRow;
             avg_adj /= blockCount;
             avg_adj_pow2 /= blockCount;
             strength = param->rc.aqStrength * avg_adj;
-            avg_adj = avg_adj - 0.5f * (avg_adj_pow2 - (modeTwoConst)) / avg_adj;
+            avg_adj = avg_adj - 0.5f * (avg_adj_pow2 - modeTwoConst) / avg_adj;
             bias_strength = param->rc.aqStrength;
         }
         else
             strength = param->rc.aqStrength * 1.0397f;
 
         blockXY = 0;
-        for (blockY = 0; blockY < maxRow; blockY += loopIncr)
+        for (int blockY = 0; blockY < maxRow; blockY += loopIncr)
         {
-            for (blockX = 0; blockX < maxCol; blockX += loopIncr)
+            for (int blockX = 0; blockX < maxCol; blockX += loopIncr)
             {
                 if (param->rc.aqMode == X265_AQ_AUTO_VARIANCE_BIASED)
                 {
@@ -240,7 +228,7 @@
                 else
                 {
                     uint32_t energy = acEnergyCu(curFrame, blockX, blockY, param->internalCsp,param->rc.qgSize);
-                    qp_adj = strength * (X265_LOG2(X265_MAX(energy, 1)) - (modeOneConst + 2 * (X265_DEPTH - 8)));
+                    qp_adj = strength * (X265_LOG2(X265_MAX(energy, 1)) - (modeOneConst + 2 * (X265_DEPTH - 8)));                    
                 }
 
                 if (param->bHDROpt)
@@ -308,6 +296,17 @@
             curFrame->m_lowres.wp_ssd[i] = ssd - (sum * sum + (width[i] * height[i]) / 2) / (width[i] * height[i]);
         }
     }
+
+    if (param->bDynamicRefine)
+    {
+        int blockXY = 0;
+        for (int blockY = 0; blockY < maxRow; blockY += loopIncr)
+            for (int blockX = 0; blockX < maxCol; blockX += loopIncr)
+            {
+                curFrame->m_lowres.blockVariance[blockXY] = acEnergyCu(curFrame, blockX, blockY, param->internalCsp, param->rc.qgSize);
+                blockXY++;
+            }
+    }
 }
 
 void LookaheadTLD::lowresIntraEstimate(Lowres& fenc, uint32_t qgSize)
@@ -426,7 +425,7 @@
     pixel *src = ref.fpelPlane[0];
     intptr_t stride = fenc.lumaStride;
 
-    if (wp.bPresentFlag)
+    if (wp.wtPresent)
     {
         int offset = wp.inputOffset << (X265_DEPTH - 8);
         int scale = wp.inputWeight;
@@ -480,7 +479,7 @@
     int deltaIndex = fenc.frameNum - ref.frameNum;
 
     WeightParam wp;
-    wp.bPresentFlag = false;
+    wp.wtPresent = 0;
 
     if (!wbuffer[0])
     {
@@ -1078,85 +1077,97 @@
     }
 
     int bframes, brefs;
-    for (bframes = 0, brefs = 0;; bframes++)
+    if (!m_param->analysisLoad)
     {
-        Lowres& frm = list[bframes]->m_lowres;
-
-        if (frm.sliceType == X265_TYPE_BREF && !m_param->bBPyramid && brefs == m_param->bBPyramid)
+        for (bframes = 0, brefs = 0;; bframes++)
         {
-            frm.sliceType = X265_TYPE_B;
-            x265_log(m_param, X265_LOG_WARNING, "B-ref at frame %d incompatible with B-pyramid\n",
-                     frm.frameNum);
-        }
+            Lowres& frm = list[bframes]->m_lowres;
 
-        /* pyramid with multiple B-refs needs a big enough dpb that the preceding P-frame stays available.
-         * smaller dpb could be supported by smart enough use of mmco, but it's easier just to forbid it. */
-        else if (frm.sliceType == X265_TYPE_BREF && m_param->bBPyramid && brefs &&
-                 m_param->maxNumReferences <= (brefs + 3))
-        {
-            frm.sliceType = X265_TYPE_B;
-            x265_log(m_param, X265_LOG_WARNING, "B-ref at frame %d incompatible with B-pyramid and %d reference frames\n",
-                     frm.sliceType, m_param->maxNumReferences);
-        }
-        if ((!m_param->bIntraRefresh || frm.frameNum == 0) && frm.frameNum - m_lastKeyframe >= m_param->keyframeMax &&
-            (!m_extendGopBoundary || frm.frameNum - m_lastKeyframe >= m_param->keyframeMax + m_param->gopLookahead))
-        {
-            if (frm.sliceType == X265_TYPE_AUTO || frm.sliceType == X265_TYPE_I)
-                frm.sliceType = m_param->bOpenGOP && m_lastKeyframe >= 0 ? X265_TYPE_I : X265_TYPE_IDR;
-            bool warn = frm.sliceType != X265_TYPE_IDR;
-            if (warn && m_param->bOpenGOP)
-                warn &= frm.sliceType != X265_TYPE_I;
-            if (warn)
+            if (frm.sliceType == X265_TYPE_BREF && !m_param->bBPyramid && brefs == m_param->bBPyramid)
             {
-                x265_log(m_param, X265_LOG_WARNING, "specified frame type (%d) at %d is not compatible with keyframe interval\n",
-                         frm.sliceType, frm.frameNum);
-                frm.sliceType = m_param->bOpenGOP && m_lastKeyframe >= 0 ? X265_TYPE_I : X265_TYPE_IDR;
+                frm.sliceType = X265_TYPE_B;
+                x265_log(m_param, X265_LOG_WARNING, "B-ref at frame %d incompatible with B-pyramid\n",
+                    frm.frameNum);
             }
-        }
-        if (frm.sliceType == X265_TYPE_I && frm.frameNum - m_lastKeyframe >= m_param->keyframeMin)
-        {
-            if (m_param->bOpenGOP)
​

x265_2.7.tar.gz/source/encoder/weightPrediction.cpp -> x265_2.9.tar.gz/source/encoder/weightPrediction.cpp Changed

@@ -184,8 +184,7 @@
         int denom = w->log2WeightDenom;
         int round = denom ? 1 << (denom - 1) : 0;
         int correction = IF_INTERNAL_PREC - X265_DEPTH; /* intermediate interpolation depth */
-        int pwidth = ((width + 15) >> 4) << 4;
-
+        int pwidth = ((width + 31) >> 5) << 5;
         primitives.weight_pp(ref, weightTemp, stride, pwidth, height,
                              weight, round << correction, denom + correction, offset);
         ref = weightTemp;
@@ -294,7 +293,7 @@
         for (int plane = 0; plane < (param.internalCsp != X265_CSP_I400 ? 3 : 1); plane++)
         {
             denom = plane ? chromaDenom : lumaDenom;
-            if (plane && !weights[0].bPresentFlag)
+            if (plane && !weights[0].wtPresent)
                 break;
 
             /* Early termination */
@@ -477,12 +476,12 @@
             }
         }
 
-        if (weights[0].bPresentFlag)
+        if (weights[0].wtPresent)
         {
             // Make sure both chroma channels match
-            if (weights[1].bPresentFlag != weights[2].bPresentFlag)
+            if (weights[1].wtPresent != weights[2].wtPresent)
             {
-                if (weights[1].bPresentFlag)
+                if (weights[1].wtPresent)
                     weights[2] = weights[1];
                 else
                     weights[1] = weights[2];
@@ -516,15 +515,15 @@
         for (int list = 0; list < numPredDir; list++)
         {
             WeightParam* w = &wp[list][0][0];
-            if (w[0].bPresentFlag || w[1].bPresentFlag || w[2].bPresentFlag)
+            if (w[0].wtPresent || w[1].wtPresent || w[2].wtPresent)
             {
                 bWeighted = true;
                 p += sprintf(buf + p, " [L%d:R0 ", list);
-                if (w[0].bPresentFlag)
+                if (w[0].wtPresent)
                     p += sprintf(buf + p, "Y{%d/%d%+d}", w[0].inputWeight, 1 << w[0].log2WeightDenom, w[0].inputOffset);
-                if (w[1].bPresentFlag)
+                if (w[1].wtPresent)
                     p += sprintf(buf + p, "U{%d/%d%+d}", w[1].inputWeight, 1 << w[1].log2WeightDenom, w[1].inputOffset);
-                if (w[2].bPresentFlag)
+                if (w[2].wtPresent)
                     p += sprintf(buf + p, "V{%d/%d%+d}", w[2].inputWeight, 1 << w[2].log2WeightDenom, w[2].inputOffset);
                 p += sprintf(buf + p, "]");
             }

 
@@ -184,8 +184,7 @@
         int denom = w->log2WeightDenom;
         int round = denom ? 1 << (denom - 1) : 0;
         int correction = IF_INTERNAL_PREC - X265_DEPTH; /* intermediate interpolation depth */
-        int pwidth = ((width + 15) >> 4) << 4;
-
+        int pwidth = ((width + 31) >> 5) << 5;
         primitives.weight_pp(ref, weightTemp, stride, pwidth, height,
                              weight, round << correction, denom + correction, offset);
         ref = weightTemp;
@@ -294,7 +293,7 @@
         for (int plane = 0; plane < (param.internalCsp != X265_CSP_I400 ? 3 : 1); plane++)
         {
             denom = plane ? chromaDenom : lumaDenom;
-            if (plane && !weights[0].bPresentFlag)
+            if (plane && !weights[0].wtPresent)
                 break;
 
             /* Early termination */
@@ -477,12 +476,12 @@
             }
         }
 
-        if (weights[0].bPresentFlag)
+        if (weights[0].wtPresent)
         {
             // Make sure both chroma channels match
-            if (weights[1].bPresentFlag != weights[2].bPresentFlag)
+            if (weights[1].wtPresent != weights[2].wtPresent)
             {
-                if (weights[1].bPresentFlag)
+                if (weights[1].wtPresent)
                     weights[2] = weights[1];
                 else
                     weights[1] = weights[2];
@@ -516,15 +515,15 @@
         for (int list = 0; list < numPredDir; list++)
         {
             WeightParam* w = &wp[list][0][0];
-            if (w[0].bPresentFlag || w[1].bPresentFlag || w[2].bPresentFlag)
+            if (w[0].wtPresent || w[1].wtPresent || w[2].wtPresent)
             {
                 bWeighted = true;
                 p += sprintf(buf + p, " [L%d:R0 ", list);
-                if (w[0].bPresentFlag)
+                if (w[0].wtPresent)
                     p += sprintf(buf + p, "Y{%d/%d%+d}", w[0].inputWeight, 1 << w[0].log2WeightDenom, w[0].inputOffset);
-                if (w[1].bPresentFlag)
+                if (w[1].wtPresent)
                     p += sprintf(buf + p, "U{%d/%d%+d}", w[1].inputWeight, 1 << w[1].log2WeightDenom, w[1].inputOffset);
-                if (w[2].bPresentFlag)
+                if (w[2].wtPresent)
                     p += sprintf(buf + p, "V{%d/%d%+d}", w[2].inputWeight, 1 << w[2].log2WeightDenom, w[2].inputOffset);
                 p += sprintf(buf + p, "]");
             }
​

x265_2.7.tar.gz/source/test/ipfilterharness.cpp -> x265_2.9.tar.gz/source/test/ipfilterharness.cpp Changed

@@ -489,6 +489,26 @@
     return true;
 }
 
+bool IPFilterHarness::check_IPFilterLumaP2S_aligned_primitive(filter_p2s_t ref, filter_p2s_t opt)
+{
+    for (int i = 0; i < TEST_CASES; i++)
+    {
+        int index = i % TEST_CASES;
+        intptr_t rand_srcStride[] = { 128, 192, 256, 512 };
+        intptr_t dstStride[] = { 192, 256, 512, 576 };
+        for (int p = 0; p < 4; p++)
+        {
+            ref(pixel_test_buff[index], rand_srcStride[p], IPF_C_output_s, dstStride[p]);
+            checked(opt, pixel_test_buff[index] + (64 * i), rand_srcStride[p], IPF_vec_output_s, dstStride[p]);
+            if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(int16_t)))
+                return false;
+        }
+        reportfail();
+    }
+
+    return true;
+}
+
 bool IPFilterHarness::check_IPFilterChromaP2S_primitive(filter_p2s_t ref, filter_p2s_t opt)
 {
     for (int i = 0; i < ITERS; i++)
@@ -510,6 +530,29 @@
     return true;
 }
 
+bool IPFilterHarness::check_IPFilterChromaP2S_aligned_primitive(filter_p2s_t ref, filter_p2s_t opt)
+{
+    for (int i = 0; i < TEST_CASES; i++)
+    {
+        int index = i % TEST_CASES;
+        intptr_t rand_srcStride[] = { 128, 192, 256, 512};
+        intptr_t dstStride[] = { 192, 256, 512, 576 };
+
+        for (int p = 0; p < 4; p++)
+        {
+            ref(pixel_test_buff[index], rand_srcStride[p], IPF_C_output_s, dstStride[p]);
+
+            checked(opt, pixel_test_buff[index], rand_srcStride[p], IPF_vec_output_s, dstStride[p]);
+
+            if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(int16_t)))
+                return false;
+        }
+        reportfail();
+    }
+
+    return true;
+}
+
 bool IPFilterHarness::testCorrectness(const EncoderPrimitives& ref, const EncoderPrimitives& opt)
 {
 
@@ -571,14 +614,22 @@
                 return false;
             }
         }
-        if (opt.pu[value].convert_p2s)
+        if (opt.pu[value].convert_p2s[NONALIGNED])
         {
-            if (!check_IPFilterLumaP2S_primitive(ref.pu[value].convert_p2s, opt.pu[value].convert_p2s))
+            if (!check_IPFilterLumaP2S_primitive(ref.pu[value].convert_p2s[NONALIGNED], opt.pu[value].convert_p2s[NONALIGNED]))
             {
                 printf("convert_p2s[%s]", lumaPartStr[value]);
                 return false;
             }
         }
+        if (opt.pu[value].convert_p2s[ALIGNED])
+        {
+            if (!check_IPFilterLumaP2S_aligned_primitive(ref.pu[value].convert_p2s[ALIGNED], opt.pu[value].convert_p2s[ALIGNED]))
+            {
+                printf("convert_p2s_aligned[%s]", lumaPartStr[value]);
+                return false;
+            }
+        }
     }
 
     for (int csp = X265_CSP_I420; csp < X265_CSP_COUNT; csp++)
@@ -633,9 +684,17 @@
                     return false;
                 }
             }
-            if (opt.chroma[csp].pu[value].p2s)
+            if (opt.chroma[csp].pu[value].p2s[ALIGNED])
+            {
+                if (!check_IPFilterChromaP2S_aligned_primitive(ref.chroma[csp].pu[value].p2s[ALIGNED], opt.chroma[csp].pu[value].p2s[ALIGNED]))
+                {
+                    printf("chroma_p2s_aligned[%s]", chromaPartStr[csp][value]);
+                    return false;
+                }
+            }
+            if (opt.chroma[csp].pu[value].p2s[NONALIGNED])
             {
-                if (!check_IPFilterChromaP2S_primitive(ref.chroma[csp].pu[value].p2s, opt.chroma[csp].pu[value].p2s))
+                if (!check_IPFilterChromaP2S_primitive(ref.chroma[csp].pu[value].p2s[NONALIGNED], opt.chroma[csp].pu[value].p2s[NONALIGNED]))
                 {
                     printf("chroma_p2s[%s]", chromaPartStr[csp][value]);
                     return false;
@@ -649,8 +708,8 @@
 
 void IPFilterHarness::measureSpeed(const EncoderPrimitives& ref, const EncoderPrimitives& opt)
 {
-    int16_t srcStride = 96;
-    int16_t dstStride = 96;
+    int16_t srcStride = 192;  /* Multiple of 64 */
+    int16_t dstStride = 192;
     int maxVerticalfilterHalfDistance = 3;
 
     for (int value = 0; value < NUM_PU_SIZES; value++)
@@ -659,62 +718,70 @@
         {
             printf("luma_hpp[%s]\t", lumaPartStr[value]);
             REPORT_SPEEDUP(opt.pu[value].luma_hpp, ref.pu[value].luma_hpp,
-                           pixel_buff + srcStride, srcStride, IPF_vec_output_p, dstStride, 1);
+                pixel_buff + srcStride, srcStride, IPF_vec_output_p, dstStride, 1);
         }
 
         if (opt.pu[value].luma_hps)
         {
             printf("luma_hps[%s]\t", lumaPartStr[value]);
             REPORT_SPEEDUP(opt.pu[value].luma_hps, ref.pu[value].luma_hps,
-                           pixel_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
-                           IPF_vec_output_s, dstStride, 1, 1);
+                pixel_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
+                IPF_vec_output_s, dstStride, 1, 1);
         }
 
         if (opt.pu[value].luma_vpp)
         {
             printf("luma_vpp[%s]\t", lumaPartStr[value]);
             REPORT_SPEEDUP(opt.pu[value].luma_vpp, ref.pu[value].luma_vpp,
-                           pixel_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
-                           IPF_vec_output_p, dstStride, 1);
+                pixel_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
+                IPF_vec_output_p, dstStride, 1);
         }
 
         if (opt.pu[value].luma_vps)
         {
             printf("luma_vps[%s]\t", lumaPartStr[value]);
             REPORT_SPEEDUP(opt.pu[value].luma_vps, ref.pu[value].luma_vps,
-                           pixel_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
-                           IPF_vec_output_s, dstStride, 1);
+                pixel_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
+                IPF_vec_output_s, dstStride, 1);
         }
 
         if (opt.pu[value].luma_vsp)
         {
             printf("luma_vsp[%s]\t", lumaPartStr[value]);
             REPORT_SPEEDUP(opt.pu[value].luma_vsp, ref.pu[value].luma_vsp,
-                           short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
-                           IPF_vec_output_p, dstStride, 1);
+                short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
+                IPF_vec_output_p, dstStride, 1);
         }
 
         if (opt.pu[value].luma_vss)
         {
             printf("luma_vss[%s]\t", lumaPartStr[value]);
             REPORT_SPEEDUP(opt.pu[value].luma_vss, ref.pu[value].luma_vss,
-                           short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
-                           IPF_vec_output_s, dstStride, 1);
+                short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
+                IPF_vec_output_s, dstStride, 1);
         }
 
         if (opt.pu[value].luma_hvpp)
         {
             printf("luma_hv [%s]\t", lumaPartStr[value]);
             REPORT_SPEEDUP(opt.pu[value].luma_hvpp, ref.pu[value].luma_hvpp,
-                           pixel_buff + 3 * srcStride, srcStride, IPF_vec_output_p, srcStride, 1, 3);
+                pixel_buff + 3 * srcStride, srcStride, IPF_vec_output_p, srcStride, 1, 3);
         }
 
-        if (opt.pu[value].convert_p2s)
+        if (opt.pu[value].convert_p2s[NONALIGNED])
         {
             printf("convert_p2s[%s]\t", lumaPartStr[value]);
-            REPORT_SPEEDUP(opt.pu[value].convert_p2s, ref.pu[value].convert_p2s,
-                               pixel_buff, srcStride,
-                               IPF_vec_output_s, dstStride);
+            REPORT_SPEEDUP(opt.pu[value].convert_p2s[NONALIGNED], ref.pu[value].convert_p2s[NONALIGNED],
+                pixel_buff, srcStride,
+                IPF_vec_output_s, dstStride);
+        }
+
+        if (opt.pu[value].convert_p2s[ALIGNED])
+        {
+            printf("convert_p2s_aligned[%s]\t", lumaPartStr[value]);
+            REPORT_SPEEDUP(opt.pu[value].convert_p2s[ALIGNED], ref.pu[value].convert_p2s[ALIGNED],
+                pixel_buff, srcStride,
+                IPF_vec_output_s, dstStride);
         }
     }

 
@@ -489,6 +489,26 @@
     return true;
 }
 
+bool IPFilterHarness::check_IPFilterLumaP2S_aligned_primitive(filter_p2s_t ref, filter_p2s_t opt)
+{
+    for (int i = 0; i < TEST_CASES; i++)
+    {
+        int index = i % TEST_CASES;
+        intptr_t rand_srcStride[] = { 128, 192, 256, 512 };
+        intptr_t dstStride[] = { 192, 256, 512, 576 };
+        for (int p = 0; p < 4; p++)
+        {
+            ref(pixel_test_buff[index], rand_srcStride[p], IPF_C_output_s, dstStride[p]);
+            checked(opt, pixel_test_buff[index] + (64 * i), rand_srcStride[p], IPF_vec_output_s, dstStride[p]);
+            if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(int16_t)))
+                return false;
+        }
+        reportfail();
+    }
+
+    return true;
+}
+
 bool IPFilterHarness::check_IPFilterChromaP2S_primitive(filter_p2s_t ref, filter_p2s_t opt)
 {
     for (int i = 0; i < ITERS; i++)
@@ -510,6 +530,29 @@
     return true;
 }
 
+bool IPFilterHarness::check_IPFilterChromaP2S_aligned_primitive(filter_p2s_t ref, filter_p2s_t opt)
+{
+    for (int i = 0; i < TEST_CASES; i++)
+    {
+        int index = i % TEST_CASES;
+        intptr_t rand_srcStride[] = { 128, 192, 256, 512};
+        intptr_t dstStride[] = { 192, 256, 512, 576 };
+
+        for (int p = 0; p < 4; p++)
+        {
+            ref(pixel_test_buff[index], rand_srcStride[p], IPF_C_output_s, dstStride[p]);
+
+            checked(opt, pixel_test_buff[index], rand_srcStride[p], IPF_vec_output_s, dstStride[p]);
+
+            if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(int16_t)))
+                return false;
+        }
+        reportfail();
+    }
+
+    return true;
+}
+
 bool IPFilterHarness::testCorrectness(const EncoderPrimitives& ref, const EncoderPrimitives& opt)
 {
 
@@ -571,14 +614,22 @@
                 return false;
             }
         }
-        if (opt.pu[value].convert_p2s)
+        if (opt.pu[value].convert_p2s[NONALIGNED])
         {
-            if (!check_IPFilterLumaP2S_primitive(ref.pu[value].convert_p2s, opt.pu[value].convert_p2s))
+            if (!check_IPFilterLumaP2S_primitive(ref.pu[value].convert_p2s[NONALIGNED], opt.pu[value].convert_p2s[NONALIGNED]))
             {
                 printf("convert_p2s[%s]", lumaPartStr[value]);
                 return false;
             }
         }
+        if (opt.pu[value].convert_p2s[ALIGNED])
+        {
+            if (!check_IPFilterLumaP2S_aligned_primitive(ref.pu[value].convert_p2s[ALIGNED], opt.pu[value].convert_p2s[ALIGNED]))
+            {
+                printf("convert_p2s_aligned[%s]", lumaPartStr[value]);
+                return false;
+            }
+        }
     }
 
     for (int csp = X265_CSP_I420; csp < X265_CSP_COUNT; csp++)
@@ -633,9 +684,17 @@
                     return false;
                 }
             }
-            if (opt.chroma[csp].pu[value].p2s)
+            if (opt.chroma[csp].pu[value].p2s[ALIGNED])
+            {
+                if (!check_IPFilterChromaP2S_aligned_primitive(ref.chroma[csp].pu[value].p2s[ALIGNED], opt.chroma[csp].pu[value].p2s[ALIGNED]))
+                {
+                    printf("chroma_p2s_aligned[%s]", chromaPartStr[csp][value]);
+                    return false;
+                }
+            }
+            if (opt.chroma[csp].pu[value].p2s[NONALIGNED])
             {
-                if (!check_IPFilterChromaP2S_primitive(ref.chroma[csp].pu[value].p2s, opt.chroma[csp].pu[value].p2s))
+                if (!check_IPFilterChromaP2S_primitive(ref.chroma[csp].pu[value].p2s[NONALIGNED], opt.chroma[csp].pu[value].p2s[NONALIGNED]))
                 {
                     printf("chroma_p2s[%s]", chromaPartStr[csp][value]);
                     return false;
@@ -649,8 +708,8 @@
 
 void IPFilterHarness::measureSpeed(const EncoderPrimitives& ref, const EncoderPrimitives& opt)
 {
-    int16_t srcStride = 96;
-    int16_t dstStride = 96;
+    int16_t srcStride = 192;  /* Multiple of 64 */
+    int16_t dstStride = 192;
     int maxVerticalfilterHalfDistance = 3;
 
     for (int value = 0; value < NUM_PU_SIZES; value++)
@@ -659,62 +718,70 @@
         {
             printf("luma_hpp[%s]\t", lumaPartStr[value]);
             REPORT_SPEEDUP(opt.pu[value].luma_hpp, ref.pu[value].luma_hpp,
-                           pixel_buff + srcStride, srcStride, IPF_vec_output_p, dstStride, 1);
+                pixel_buff + srcStride, srcStride, IPF_vec_output_p, dstStride, 1);
         }
 
         if (opt.pu[value].luma_hps)
         {
             printf("luma_hps[%s]\t", lumaPartStr[value]);
             REPORT_SPEEDUP(opt.pu[value].luma_hps, ref.pu[value].luma_hps,
-                           pixel_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
-                           IPF_vec_output_s, dstStride, 1, 1);
+                pixel_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
+                IPF_vec_output_s, dstStride, 1, 1);
         }
 
         if (opt.pu[value].luma_vpp)
         {
             printf("luma_vpp[%s]\t", lumaPartStr[value]);
             REPORT_SPEEDUP(opt.pu[value].luma_vpp, ref.pu[value].luma_vpp,
-                           pixel_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
-                           IPF_vec_output_p, dstStride, 1);
+                pixel_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
+                IPF_vec_output_p, dstStride, 1);
         }
 
         if (opt.pu[value].luma_vps)
         {
             printf("luma_vps[%s]\t", lumaPartStr[value]);
             REPORT_SPEEDUP(opt.pu[value].luma_vps, ref.pu[value].luma_vps,
-                           pixel_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
-                           IPF_vec_output_s, dstStride, 1);
+                pixel_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
+                IPF_vec_output_s, dstStride, 1);
         }
 
         if (opt.pu[value].luma_vsp)
         {
             printf("luma_vsp[%s]\t", lumaPartStr[value]);
             REPORT_SPEEDUP(opt.pu[value].luma_vsp, ref.pu[value].luma_vsp,
-                           short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
-                           IPF_vec_output_p, dstStride, 1);
+                short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
+                IPF_vec_output_p, dstStride, 1);
         }
 
         if (opt.pu[value].luma_vss)
         {
             printf("luma_vss[%s]\t", lumaPartStr[value]);
             REPORT_SPEEDUP(opt.pu[value].luma_vss, ref.pu[value].luma_vss,
-                           short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
-                           IPF_vec_output_s, dstStride, 1);
+                short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
+                IPF_vec_output_s, dstStride, 1);
         }
 
         if (opt.pu[value].luma_hvpp)
         {
             printf("luma_hv [%s]\t", lumaPartStr[value]);
             REPORT_SPEEDUP(opt.pu[value].luma_hvpp, ref.pu[value].luma_hvpp,
-                           pixel_buff + 3 * srcStride, srcStride, IPF_vec_output_p, srcStride, 1, 3);
+                pixel_buff + 3 * srcStride, srcStride, IPF_vec_output_p, srcStride, 1, 3);
         }
 
-        if (opt.pu[value].convert_p2s)
+        if (opt.pu[value].convert_p2s[NONALIGNED])
         {
             printf("convert_p2s[%s]\t", lumaPartStr[value]);
-            REPORT_SPEEDUP(opt.pu[value].convert_p2s, ref.pu[value].convert_p2s,
-                               pixel_buff, srcStride,
-                               IPF_vec_output_s, dstStride);
+            REPORT_SPEEDUP(opt.pu[value].convert_p2s[NONALIGNED], ref.pu[value].convert_p2s[NONALIGNED],
+                pixel_buff, srcStride,
+                IPF_vec_output_s, dstStride);
+        }
+
+        if (opt.pu[value].convert_p2s[ALIGNED])
+        {
+            printf("convert_p2s_aligned[%s]\t", lumaPartStr[value]);
+            REPORT_SPEEDUP(opt.pu[value].convert_p2s[ALIGNED], ref.pu[value].convert_p2s[ALIGNED],
+                pixel_buff, srcStride,
+                IPF_vec_output_s, dstStride);
         }
     }
 
​

x265_2.7.tar.gz/source/test/ipfilterharness.h -> x265_2.9.tar.gz/source/test/ipfilterharness.h Changed

@@ -40,15 +40,15 @@
     enum { TEST_CASES = 3 };
     enum { SMAX = 1 << 12 };
     enum { SMIN = (unsigned)-1 << 12 };
-    ALIGN_VAR_32(pixel, pixel_buff[TEST_BUF_SIZE]);
-    int16_t short_buff[TEST_BUF_SIZE];
-    int16_t IPF_vec_output_s[TEST_BUF_SIZE];
-    int16_t IPF_C_output_s[TEST_BUF_SIZE];
-    pixel   IPF_vec_output_p[TEST_BUF_SIZE];
-    pixel   IPF_C_output_p[TEST_BUF_SIZE];
+    ALIGN_VAR_64(pixel, pixel_buff[TEST_BUF_SIZE]);
+    ALIGN_VAR_64(int16_t, short_buff[TEST_BUF_SIZE]);
+    ALIGN_VAR_64(int16_t, IPF_vec_output_s[TEST_BUF_SIZE]);
+    ALIGN_VAR_64(int16_t, IPF_C_output_s[TEST_BUF_SIZE]);
+    ALIGN_VAR_64(pixel,   IPF_vec_output_p[TEST_BUF_SIZE]);
+    ALIGN_VAR_64(pixel,   IPF_C_output_p[TEST_BUF_SIZE]);
 
-    pixel   pixel_test_buff[TEST_CASES][TEST_BUF_SIZE];
-    int16_t short_test_buff[TEST_CASES][TEST_BUF_SIZE];
+    ALIGN_VAR_64(pixel,   pixel_test_buff[TEST_CASES][TEST_BUF_SIZE]);
+    ALIGN_VAR_64(int16_t, short_test_buff[TEST_CASES][TEST_BUF_SIZE]);
 
     bool check_IPFilterChroma_primitive(filter_pp_t ref, filter_pp_t opt);
     bool check_IPFilterChroma_ps_primitive(filter_ps_t ref, filter_ps_t opt);
@@ -62,7 +62,9 @@
     bool check_IPFilterLuma_ss_primitive(filter_ss_t ref, filter_ss_t opt);
     bool check_IPFilterLumaHV_primitive(filter_hv_pp_t ref, filter_hv_pp_t opt);
     bool check_IPFilterLumaP2S_primitive(filter_p2s_t ref, filter_p2s_t opt);
+    bool check_IPFilterLumaP2S_aligned_primitive(filter_p2s_t ref, filter_p2s_t opt);
     bool check_IPFilterChromaP2S_primitive(filter_p2s_t ref, filter_p2s_t opt);
+    bool check_IPFilterChromaP2S_aligned_primitive(filter_p2s_t ref, filter_p2s_t opt);
 
 public:

 
@@ -40,15 +40,15 @@
     enum { TEST_CASES = 3 };
     enum { SMAX = 1 << 12 };
     enum { SMIN = (unsigned)-1 << 12 };
-    ALIGN_VAR_32(pixel, pixel_buff[TEST_BUF_SIZE]);
-    int16_t short_buff[TEST_BUF_SIZE];
-    int16_t IPF_vec_output_s[TEST_BUF_SIZE];
-    int16_t IPF_C_output_s[TEST_BUF_SIZE];
-    pixel   IPF_vec_output_p[TEST_BUF_SIZE];
-    pixel   IPF_C_output_p[TEST_BUF_SIZE];
+    ALIGN_VAR_64(pixel, pixel_buff[TEST_BUF_SIZE]);
+    ALIGN_VAR_64(int16_t, short_buff[TEST_BUF_SIZE]);
+    ALIGN_VAR_64(int16_t, IPF_vec_output_s[TEST_BUF_SIZE]);
+    ALIGN_VAR_64(int16_t, IPF_C_output_s[TEST_BUF_SIZE]);
+    ALIGN_VAR_64(pixel,   IPF_vec_output_p[TEST_BUF_SIZE]);
+    ALIGN_VAR_64(pixel,   IPF_C_output_p[TEST_BUF_SIZE]);
 
-    pixel   pixel_test_buff[TEST_CASES][TEST_BUF_SIZE];
-    int16_t short_test_buff[TEST_CASES][TEST_BUF_SIZE];
+    ALIGN_VAR_64(pixel,   pixel_test_buff[TEST_CASES][TEST_BUF_SIZE]);
+    ALIGN_VAR_64(int16_t, short_test_buff[TEST_CASES][TEST_BUF_SIZE]);
 
     bool check_IPFilterChroma_primitive(filter_pp_t ref, filter_pp_t opt);
     bool check_IPFilterChroma_ps_primitive(filter_ps_t ref, filter_ps_t opt);
@@ -62,7 +62,9 @@
     bool check_IPFilterLuma_ss_primitive(filter_ss_t ref, filter_ss_t opt);
     bool check_IPFilterLumaHV_primitive(filter_hv_pp_t ref, filter_hv_pp_t opt);
     bool check_IPFilterLumaP2S_primitive(filter_p2s_t ref, filter_p2s_t opt);
+    bool check_IPFilterLumaP2S_aligned_primitive(filter_p2s_t ref, filter_p2s_t opt);
     bool check_IPFilterChromaP2S_primitive(filter_p2s_t ref, filter_p2s_t opt);
+    bool check_IPFilterChromaP2S_aligned_primitive(filter_p2s_t ref, filter_p2s_t opt);
 
 public:
 
​

x265_2.7.tar.gz/source/test/mbdstharness.cpp -> x265_2.9.tar.gz/source/test/mbdstharness.cpp Changed

@@ -61,16 +61,17 @@
     for (int i = 0; i < TEST_BUF_SIZE; i++)
     {
         short_test_buff[0][i]    = (rand() & PIXEL_MAX) - (rand() & PIXEL_MAX);
+        short_test_buff1[0][i]   = (rand() & PIXEL_MAX) - (rand() & PIXEL_MAX);
         int_test_buff[0][i]      = rand() % PIXEL_MAX;
         int_idct_test_buff[0][i] = (rand() % (SHORT_MAX - SHORT_MIN)) - SHORT_MAX;
         short_denoise_test_buff1[0][i] = short_denoise_test_buff2[0][i] = (rand() & SHORT_MAX) - (rand() & SHORT_MAX);
-
         short_test_buff[1][i]    = -PIXEL_MAX;
+        short_test_buff1[1][i]   = -PIXEL_MAX;
         int_test_buff[1][i]      = -PIXEL_MAX;
         int_idct_test_buff[1][i] = SHORT_MIN;
         short_denoise_test_buff1[1][i] = short_denoise_test_buff2[1][i] = -SHORT_MAX;
-
         short_test_buff[2][i]    = PIXEL_MAX;
+        short_test_buff1[2][i]   = PIXEL_MAX;
         int_test_buff[2][i]      = PIXEL_MAX;
         int_idct_test_buff[2][i] = SHORT_MAX;
         short_denoise_test_buff1[2][i] = short_denoise_test_buff2[2][i] = SHORT_MAX;
@@ -252,12 +253,10 @@
 bool MBDstHarness::check_nquant_primitive(nquant_t ref, nquant_t opt)
 {
     int j = 0;
-
     for (int i = 0; i < ITERS; i++)
     {
-        int width = (rand() % 4 + 1) * 4;
+        int width = 1 << (rand() % 4 + 2);
         int height = width;
-
         uint32_t optReturnValue = 0;
         uint32_t refReturnValue = 0;
 
@@ -281,6 +280,136 @@
         reportfail();
         j += INCR;
     }
+    return true;
+}
+
+bool MBDstHarness::check_nonPsyRdoQuant_primitive(nonPsyRdoQuant_t ref, nonPsyRdoQuant_t opt)
+{
+    int j = 0;
+    int trSize[4] = { 16, 64, 256, 1024 };
+
+    ALIGN_VAR_32(int64_t, ref_dest[4 * MAX_TU_SIZE]);
+    ALIGN_VAR_32(int64_t, opt_dest[4 * MAX_TU_SIZE]);
+
+    for (int i = 0; i < ITERS; i++)
+    {
+        int64_t totalRdCostRef = rand();
+        int64_t totalUncodedCostRef = rand();
+        int64_t totalRdCostOpt = totalRdCostRef;
+        int64_t totalUncodedCostOpt = totalUncodedCostRef;
+
+        int index = rand() % 4;
+        uint32_t blkPos = trSize[index];
+        int cmp_size = 4 * MAX_TU_SIZE;
+
+        memset(ref_dest, 0, MAX_TU_SIZE * sizeof(int64_t));
+        memset(opt_dest, 0, MAX_TU_SIZE * sizeof(int64_t));
+
+        int index1 = rand() % TEST_CASES;
+
+        ref(short_test_buff[index1] + j, ref_dest, &totalUncodedCostRef, &totalRdCostRef, blkPos);
+        checked(opt, short_test_buff[index1] + j, opt_dest, &totalUncodedCostOpt, &totalRdCostOpt, blkPos);
+
+        if (memcmp(ref_dest, opt_dest, cmp_size))
+            return false;
+
+        if (totalUncodedCostRef != totalUncodedCostOpt)
+            return false;
+
+        if (totalRdCostRef != totalRdCostOpt)
+            return false;
+
+        reportfail();
+        j += INCR;
+    }
+
+    return true;
+}
+bool MBDstHarness::check_psyRdoQuant_primitive(psyRdoQuant_t ref, psyRdoQuant_t opt)
+{
+    int j = 0;
+    int trSize[4] = { 16, 64, 256, 1024 };
+
+    ALIGN_VAR_32(int64_t, ref_dest[4 * MAX_TU_SIZE]);
+    ALIGN_VAR_32(int64_t, opt_dest[4 * MAX_TU_SIZE]);
+
+    for (int i = 0; i < ITERS; i++)
+    {
+        int64_t totalRdCostRef = rand();
+        int64_t totalUncodedCostRef = rand();
+        int64_t totalRdCostOpt = totalRdCostRef;
+        int64_t totalUncodedCostOpt = totalUncodedCostRef;
+        int64_t *psyScale = X265_MALLOC(int64_t, 1);
+        *psyScale = rand();
+
+        int index = rand() % 4;
+        uint32_t blkPos = trSize[index];
+        int cmp_size = 4 * MAX_TU_SIZE;
+
+        memset(ref_dest, 0, MAX_TU_SIZE * sizeof(int64_t));
+        memset(opt_dest, 0, MAX_TU_SIZE * sizeof(int64_t));
+
+        int index1 = rand() % TEST_CASES;
+
+        ref(short_test_buff[index1] + j, short_test_buff1[index1] + j, ref_dest, &totalUncodedCostRef, &totalRdCostRef, psyScale, blkPos);
+        checked(opt, short_test_buff[index1] + j, short_test_buff1[index1] + j, opt_dest, &totalUncodedCostOpt, &totalRdCostOpt, psyScale, blkPos);
+
+        X265_FREE(psyScale);
+        if (memcmp(ref_dest, opt_dest, cmp_size))
+            return false;
+
+        if (totalUncodedCostRef != totalUncodedCostOpt)
+            return false;
+
+        if (totalRdCostRef != totalRdCostOpt)
+            return false;
+
+        reportfail();
+        j += INCR;
+    }
+
+    return true;
+}
+bool MBDstHarness::check_psyRdoQuant_primitive_avx2(psyRdoQuant_t1 ref, psyRdoQuant_t1 opt)
+{
+    int j = 0;
+    int trSize[4] = { 16, 64, 256, 1024 };
+
+    ALIGN_VAR_32(int64_t, ref_dest[4 * MAX_TU_SIZE]);
+    ALIGN_VAR_32(int64_t, opt_dest[4 * MAX_TU_SIZE]);
+
+    for (int i = 0; i < ITERS; i++)
+    {
+        int64_t totalRdCostRef = rand();
+        int64_t totalUncodedCostRef = rand();
+        int64_t totalRdCostOpt = totalRdCostRef;
+        int64_t totalUncodedCostOpt = totalUncodedCostRef;
+
+        int index = rand() % 4;
+        uint32_t blkPos =  trSize[index];
+        int cmp_size = 4 * MAX_TU_SIZE;
+
+        memset(ref_dest, 0, MAX_TU_SIZE * sizeof(int64_t));
+        memset(opt_dest, 0, MAX_TU_SIZE * sizeof(int64_t));
+
+        int index1 = rand() % TEST_CASES;
+
+        ref(short_test_buff[index1] + j, ref_dest, &totalUncodedCostRef, &totalRdCostRef, blkPos);
+        checked(opt, short_test_buff[index1] + j, opt_dest, &totalUncodedCostOpt, &totalRdCostOpt, blkPos);
+
+        
+        if (memcmp(ref_dest, opt_dest, cmp_size))
+            return false;
+
+        if (totalUncodedCostRef != totalUncodedCostOpt)
+            return false;
+
+        if (totalRdCostRef != totalRdCostOpt)
+            return false;
+
+        reportfail();
+        j += INCR;
+    }
 
     return true;
 }
@@ -420,6 +549,40 @@
             return false;
         }
     }
+
+    for (int i = 0; i < NUM_TR_SIZE; i++)
+    {
+        if (opt.cu[i].nonPsyRdoQuant)
+        {
+            if (!check_nonPsyRdoQuant_primitive(ref.cu[i].nonPsyRdoQuant, opt.cu[i].nonPsyRdoQuant))
+            {
+                printf("nonPsyRdoQuant[%dx%d]: Failed!\n", 4 << i, 4 << i);
+                return false;
+            }
+        }
+    }
+    for (int i = 0; i < NUM_TR_SIZE; i++)
+    {
+        if (opt.cu[i].psyRdoQuant)
+        {
+            if (!check_psyRdoQuant_primitive(ref.cu[i].psyRdoQuant, opt.cu[i].psyRdoQuant))
+            {
+                printf("psyRdoQuant[%dx%d]: Failed!\n", 4 << i, 4 << i);
+                return false;
+            }
+        }
+    }
+    for (int i = 0; i < NUM_TR_SIZE; i++)
+    {

 
@@ -61,16 +61,17 @@
     for (int i = 0; i < TEST_BUF_SIZE; i++)
     {
         short_test_buff[0][i]    = (rand() & PIXEL_MAX) - (rand() & PIXEL_MAX);
+        short_test_buff1[0][i]   = (rand() & PIXEL_MAX) - (rand() & PIXEL_MAX);
         int_test_buff[0][i]      = rand() % PIXEL_MAX;
         int_idct_test_buff[0][i] = (rand() % (SHORT_MAX - SHORT_MIN)) - SHORT_MAX;
         short_denoise_test_buff1[0][i] = short_denoise_test_buff2[0][i] = (rand() & SHORT_MAX) - (rand() & SHORT_MAX);
-
         short_test_buff[1][i]    = -PIXEL_MAX;
+        short_test_buff1[1][i]   = -PIXEL_MAX;
         int_test_buff[1][i]      = -PIXEL_MAX;
         int_idct_test_buff[1][i] = SHORT_MIN;
         short_denoise_test_buff1[1][i] = short_denoise_test_buff2[1][i] = -SHORT_MAX;
-
         short_test_buff[2][i]    = PIXEL_MAX;
+        short_test_buff1[2][i]   = PIXEL_MAX;
         int_test_buff[2][i]      = PIXEL_MAX;
         int_idct_test_buff[2][i] = SHORT_MAX;
         short_denoise_test_buff1[2][i] = short_denoise_test_buff2[2][i] = SHORT_MAX;
@@ -252,12 +253,10 @@
 bool MBDstHarness::check_nquant_primitive(nquant_t ref, nquant_t opt)
 {
     int j = 0;
-
     for (int i = 0; i < ITERS; i++)
     {
-        int width = (rand() % 4 + 1) * 4;
+        int width = 1 << (rand() % 4 + 2);
         int height = width;
-
         uint32_t optReturnValue = 0;
         uint32_t refReturnValue = 0;
 
@@ -281,6 +280,136 @@
         reportfail();
         j += INCR;
     }
+    return true;
+}
+
+bool MBDstHarness::check_nonPsyRdoQuant_primitive(nonPsyRdoQuant_t ref, nonPsyRdoQuant_t opt)
+{
+    int j = 0;
+    int trSize[4] = { 16, 64, 256, 1024 };
+
+    ALIGN_VAR_32(int64_t, ref_dest[4 * MAX_TU_SIZE]);
+    ALIGN_VAR_32(int64_t, opt_dest[4 * MAX_TU_SIZE]);
+
+    for (int i = 0; i < ITERS; i++)
+    {
+        int64_t totalRdCostRef = rand();
+        int64_t totalUncodedCostRef = rand();
+        int64_t totalRdCostOpt = totalRdCostRef;
+        int64_t totalUncodedCostOpt = totalUncodedCostRef;
+
+        int index = rand() % 4;
+        uint32_t blkPos = trSize[index];
+        int cmp_size = 4 * MAX_TU_SIZE;
+
+        memset(ref_dest, 0, MAX_TU_SIZE * sizeof(int64_t));
+        memset(opt_dest, 0, MAX_TU_SIZE * sizeof(int64_t));
+
+        int index1 = rand() % TEST_CASES;
+
+        ref(short_test_buff[index1] + j, ref_dest, &totalUncodedCostRef, &totalRdCostRef, blkPos);
+        checked(opt, short_test_buff[index1] + j, opt_dest, &totalUncodedCostOpt, &totalRdCostOpt, blkPos);
+
+        if (memcmp(ref_dest, opt_dest, cmp_size))
+            return false;
+
+        if (totalUncodedCostRef != totalUncodedCostOpt)
+            return false;
+
+        if (totalRdCostRef != totalRdCostOpt)
+            return false;
+
+        reportfail();
+        j += INCR;
+    }
+
+    return true;
+}
+bool MBDstHarness::check_psyRdoQuant_primitive(psyRdoQuant_t ref, psyRdoQuant_t opt)
+{
+    int j = 0;
+    int trSize[4] = { 16, 64, 256, 1024 };
+
+    ALIGN_VAR_32(int64_t, ref_dest[4 * MAX_TU_SIZE]);
+    ALIGN_VAR_32(int64_t, opt_dest[4 * MAX_TU_SIZE]);
+
+    for (int i = 0; i < ITERS; i++)
+    {
+        int64_t totalRdCostRef = rand();
+        int64_t totalUncodedCostRef = rand();
+        int64_t totalRdCostOpt = totalRdCostRef;
+        int64_t totalUncodedCostOpt = totalUncodedCostRef;
+        int64_t *psyScale = X265_MALLOC(int64_t, 1);
+        *psyScale = rand();
+
+        int index = rand() % 4;
+        uint32_t blkPos = trSize[index];
+        int cmp_size = 4 * MAX_TU_SIZE;
+
+        memset(ref_dest, 0, MAX_TU_SIZE * sizeof(int64_t));
+        memset(opt_dest, 0, MAX_TU_SIZE * sizeof(int64_t));
+
+        int index1 = rand() % TEST_CASES;
+
+        ref(short_test_buff[index1] + j, short_test_buff1[index1] + j, ref_dest, &totalUncodedCostRef, &totalRdCostRef, psyScale, blkPos);
+        checked(opt, short_test_buff[index1] + j, short_test_buff1[index1] + j, opt_dest, &totalUncodedCostOpt, &totalRdCostOpt, psyScale, blkPos);
+
+        X265_FREE(psyScale);
+        if (memcmp(ref_dest, opt_dest, cmp_size))
+            return false;
+
+        if (totalUncodedCostRef != totalUncodedCostOpt)
+            return false;
+
+        if (totalRdCostRef != totalRdCostOpt)
+            return false;
+
+        reportfail();
+        j += INCR;
+    }
+
+    return true;
+}
+bool MBDstHarness::check_psyRdoQuant_primitive_avx2(psyRdoQuant_t1 ref, psyRdoQuant_t1 opt)
+{
+    int j = 0;
+    int trSize[4] = { 16, 64, 256, 1024 };
+
+    ALIGN_VAR_32(int64_t, ref_dest[4 * MAX_TU_SIZE]);
+    ALIGN_VAR_32(int64_t, opt_dest[4 * MAX_TU_SIZE]);
+
+    for (int i = 0; i < ITERS; i++)
+    {
+        int64_t totalRdCostRef = rand();
+        int64_t totalUncodedCostRef = rand();
+        int64_t totalRdCostOpt = totalRdCostRef;
+        int64_t totalUncodedCostOpt = totalUncodedCostRef;
+
+        int index = rand() % 4;
+        uint32_t blkPos =  trSize[index];
+        int cmp_size = 4 * MAX_TU_SIZE;
+
+        memset(ref_dest, 0, MAX_TU_SIZE * sizeof(int64_t));
+        memset(opt_dest, 0, MAX_TU_SIZE * sizeof(int64_t));
+
+        int index1 = rand() % TEST_CASES;
+
+        ref(short_test_buff[index1] + j, ref_dest, &totalUncodedCostRef, &totalRdCostRef, blkPos);
+        checked(opt, short_test_buff[index1] + j, opt_dest, &totalUncodedCostOpt, &totalRdCostOpt, blkPos);
+
+        
+        if (memcmp(ref_dest, opt_dest, cmp_size))
+            return false;
+
+        if (totalUncodedCostRef != totalUncodedCostOpt)
+            return false;
+
+        if (totalRdCostRef != totalRdCostOpt)
+            return false;
+
+        reportfail();
+        j += INCR;
+    }
 
     return true;
 }
@@ -420,6 +549,40 @@
             return false;
         }
     }
+
+    for (int i = 0; i < NUM_TR_SIZE; i++)
+    {
+        if (opt.cu[i].nonPsyRdoQuant)
+        {
+            if (!check_nonPsyRdoQuant_primitive(ref.cu[i].nonPsyRdoQuant, opt.cu[i].nonPsyRdoQuant))
+            {
+                printf("nonPsyRdoQuant[%dx%d]: Failed!\n", 4 << i, 4 << i);
+                return false;
+            }
+        }
+    }
+    for (int i = 0; i < NUM_TR_SIZE; i++)
+    {
+        if (opt.cu[i].psyRdoQuant)
+        {
+            if (!check_psyRdoQuant_primitive(ref.cu[i].psyRdoQuant, opt.cu[i].psyRdoQuant))
+            {
+                printf("psyRdoQuant[%dx%d]: Failed!\n", 4 << i, 4 << i);
+                return false;
+            }
+        }
+    }
+    for (int i = 0; i < NUM_TR_SIZE; i++)
+    {
​

x265_2.7.tar.gz/source/test/mbdstharness.h -> x265_2.9.tar.gz/source/test/mbdstharness.h Changed

@@ -51,26 +51,27 @@
     int     mintbuf2[MAX_TU_SIZE];
     int     mintbuf3[MAX_TU_SIZE];
     int     mintbuf4[MAX_TU_SIZE];
-
     int16_t short_test_buff[TEST_CASES][TEST_BUF_SIZE];
+    int16_t short_test_buff1[TEST_CASES][TEST_BUF_SIZE];
     int     int_test_buff[TEST_CASES][TEST_BUF_SIZE];
     int     int_idct_test_buff[TEST_CASES][TEST_BUF_SIZE];
-
     uint32_t mubuf1[MAX_TU_SIZE];
     uint32_t mubuf2[MAX_TU_SIZE];
     uint16_t mushortbuf1[MAX_TU_SIZE];
 
     int16_t short_denoise_test_buff1[TEST_CASES][TEST_BUF_SIZE];
     int16_t short_denoise_test_buff2[TEST_CASES][TEST_BUF_SIZE];
-
     bool check_dequant_primitive(dequant_scaling_t ref, dequant_scaling_t opt);
     bool check_dequant_primitive(dequant_normal_t ref, dequant_normal_t opt);
+    bool check_nonPsyRdoQuant_primitive(nonPsyRdoQuant_t ref, nonPsyRdoQuant_t opt);
+    bool check_psyRdoQuant_primitive(psyRdoQuant_t ref, psyRdoQuant_t opt);
     bool check_quant_primitive(quant_t ref, quant_t opt);
     bool check_nquant_primitive(nquant_t ref, nquant_t opt);
     bool check_dct_primitive(dct_t ref, dct_t opt, intptr_t width);
     bool check_idct_primitive(idct_t ref, idct_t opt, intptr_t width);
     bool check_count_nonzero_primitive(count_nonzero_t ref, count_nonzero_t opt);
     bool check_denoise_dct_primitive(denoiseDct_t ref, denoiseDct_t opt);
+    bool check_psyRdoQuant_primitive_avx2(psyRdoQuant_t1 ref, psyRdoQuant_t1 opt);
 
 public:

 
@@ -51,26 +51,27 @@
     int     mintbuf2[MAX_TU_SIZE];
     int     mintbuf3[MAX_TU_SIZE];
     int     mintbuf4[MAX_TU_SIZE];
-
     int16_t short_test_buff[TEST_CASES][TEST_BUF_SIZE];
+    int16_t short_test_buff1[TEST_CASES][TEST_BUF_SIZE];
     int     int_test_buff[TEST_CASES][TEST_BUF_SIZE];
     int     int_idct_test_buff[TEST_CASES][TEST_BUF_SIZE];
-
     uint32_t mubuf1[MAX_TU_SIZE];
     uint32_t mubuf2[MAX_TU_SIZE];
     uint16_t mushortbuf1[MAX_TU_SIZE];
 
     int16_t short_denoise_test_buff1[TEST_CASES][TEST_BUF_SIZE];
     int16_t short_denoise_test_buff2[TEST_CASES][TEST_BUF_SIZE];
-
     bool check_dequant_primitive(dequant_scaling_t ref, dequant_scaling_t opt);
     bool check_dequant_primitive(dequant_normal_t ref, dequant_normal_t opt);
+    bool check_nonPsyRdoQuant_primitive(nonPsyRdoQuant_t ref, nonPsyRdoQuant_t opt);
+    bool check_psyRdoQuant_primitive(psyRdoQuant_t ref, psyRdoQuant_t opt);
     bool check_quant_primitive(quant_t ref, quant_t opt);
     bool check_nquant_primitive(nquant_t ref, nquant_t opt);
     bool check_dct_primitive(dct_t ref, dct_t opt, intptr_t width);
     bool check_idct_primitive(idct_t ref, idct_t opt, intptr_t width);
     bool check_count_nonzero_primitive(count_nonzero_t ref, count_nonzero_t opt);
     bool check_denoise_dct_primitive(denoiseDct_t ref, denoiseDct_t opt);
+    bool check_psyRdoQuant_primitive_avx2(psyRdoQuant_t1 ref, psyRdoQuant_t1 opt);
 
 public:
 
​

x265_2.7.tar.gz/source/test/pixelharness.cpp -> x265_2.9.tar.gz/source/test/pixelharness.cpp Changed

@@ -226,6 +226,31 @@
     return true;
 }
 
+bool PixelHarness::check_calresidual_aligned(calcresidual_t ref, calcresidual_t opt)
+{
+    ALIGN_VAR_16(int16_t, ref_dest[64 * 64]);
+    ALIGN_VAR_16(int16_t, opt_dest[64 * 64]);
+    memset(ref_dest, 0, 64 * 64 * sizeof(int16_t));
+    memset(opt_dest, 0, 64 * 64 * sizeof(int16_t));
+
+    int j = 0;
+    intptr_t stride = STRIDE;
+    for (int i = 0; i < ITERS; i++)
+    {
+        int index = i % TEST_CASES;
+        checked(opt, pbuf1 + j, pixel_test_buff[index] + j, opt_dest, stride);
+        ref(pbuf1 + j, pixel_test_buff[index] + j, ref_dest, stride);
+
+        if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(int16_t)))
+            return false;
+
+        reportfail();
+        j += INCR;
+    }
+
+    return true;
+}
+
 bool PixelHarness::check_ssd_s(pixel_ssd_s_t ref, pixel_ssd_s_t opt)
 {
     int j = 0;
@@ -242,10 +267,27 @@
         reportfail();
         j += INCR;
     }
-
     return true;
 }
+bool PixelHarness::check_ssd_s_aligned(pixel_ssd_s_t ref, pixel_ssd_s_t opt)
+{
+    int j = 0;
+    for (int i = 0; i < ITERS; i++)
+    {
+               // NOTE: stride must be multiple of 16, because minimum block is 4x4
+        int stride = STRIDE;
+        sse_t cres = ref(sbuf1 + j, stride);
+        sse_t vres = (sse_t)checked(opt, sbuf1 + j, (intptr_t)stride);
+
+        if (cres != vres)
+            return false;
+
+        reportfail();
+        j += INCR+32;
+    }
 
+    return true;
+}
 bool PixelHarness::check_weightp(weightp_sp_t ref, weightp_sp_t opt)
 {
     ALIGN_VAR_16(pixel, ref_dest[64 * (64 + 1)]);
@@ -290,7 +332,11 @@
     memset(ref_dest, 0, 64 * 64 * sizeof(pixel));
     memset(opt_dest, 0, 64 * 64 * sizeof(pixel));
     int j = 0;
+    bool enableavx512 = true;
     int width = 16 * (rand() % 4 + 1);
+    int cpuid = X265_NS::cpu_detect(enableavx512);
+    if (cpuid & X265_CPU_AVX512)
+        width = 32 * (rand() % 2 + 1);
     int height = 8;
     int w0 = rand() % 128;
     int shift = rand() % 8; // maximum is 7, see setFromWeightAndOffset()
@@ -441,12 +487,10 @@
 
     return true;
 }
-
 bool PixelHarness::check_cpy1Dto2D_shl_t(cpy1Dto2D_shl_t ref, cpy1Dto2D_shl_t opt)
 {
-    ALIGN_VAR_16(int16_t, ref_dest[64 * 64]);
-    ALIGN_VAR_16(int16_t, opt_dest[64 * 64]);
-
+    ALIGN_VAR_64(int16_t, ref_dest[64 * 64]);
+    ALIGN_VAR_64(int16_t, opt_dest[64 * 64]);
     memset(ref_dest, 0xCD, sizeof(ref_dest));
     memset(opt_dest, 0xCD, sizeof(opt_dest));
 
@@ -469,6 +513,33 @@
 
     return true;
 }
+bool PixelHarness::check_cpy1Dto2D_shl_aligned_t(cpy1Dto2D_shl_t ref, cpy1Dto2D_shl_t opt)
+{
+    ALIGN_VAR_64(int16_t, ref_dest[64 * 64]);
+    ALIGN_VAR_64(int16_t, opt_dest[64 * 64]);
+
+    memset(ref_dest, 0xCD, sizeof(ref_dest));
+    memset(opt_dest, 0xCD, sizeof(opt_dest));
+
+    int j = 0;
+    intptr_t stride = STRIDE;
+    for (int i = 0; i < ITERS; i++)
+    {
+        int shift = (rand() % 7 + 1);
+
+        int index = i % TEST_CASES;
+        checked(opt, opt_dest, short_test_buff[index] + j, stride, shift);
+        ref(ref_dest, short_test_buff[index] + j, stride, shift);
+
+        if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(int16_t)))
+            return false;
+
+        reportfail();
+        j += INCR + 32;
+    }
+
+    return true;
+}
 
 bool PixelHarness::check_cpy1Dto2D_shr_t(cpy1Dto2D_shr_t ref, cpy1Dto2D_shr_t opt)
 {
@@ -497,11 +568,37 @@
 
     return true;
 }
-
 bool PixelHarness::check_pixelavg_pp(pixelavg_pp_t ref, pixelavg_pp_t opt)
 {
-    ALIGN_VAR_16(pixel, ref_dest[64 * 64]);
-    ALIGN_VAR_16(pixel, opt_dest[64 * 64]);
+    ALIGN_VAR_64(pixel, ref_dest[64 * 64]);
+    ALIGN_VAR_64(pixel, opt_dest[64 * 64]);
+    int j = 0;
+    memset(ref_dest, 0xCD, sizeof(ref_dest));
+    memset(opt_dest, 0xCD, sizeof(opt_dest));
+
+    intptr_t stride = STRIDE;
+    for (int i = 0; i < ITERS; i++)
+    {
+        int index1 = rand() % TEST_CASES;
+        int index2 = rand() % TEST_CASES;
+        checked(ref, ref_dest, stride, pixel_test_buff[index1] + j,
+                stride, pixel_test_buff[index2] + j, stride, 32);
+        opt(opt_dest, stride, pixel_test_buff[index1] + j,
+            stride, pixel_test_buff[index2] + j, stride, 32);
+
+        if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel)))
+            return false;
+
+        reportfail();
+        j += INCR;
+    }
+
+    return true;
+}
+bool PixelHarness::check_pixelavg_pp_aligned(pixelavg_pp_t ref, pixelavg_pp_t opt)
+{
+    ALIGN_VAR_64(pixel, ref_dest[64 * 64]);
+    ALIGN_VAR_64(pixel, opt_dest[64 * 64]);
 
     int j = 0;
 
@@ -522,7 +619,7 @@
             return false;
 
         reportfail();
-        j += INCR;
+        j += INCR + 32;
     }
 
     return true;
@@ -642,8 +739,33 @@
 
 bool PixelHarness::check_blockfill_s(blockfill_s_t ref, blockfill_s_t opt)
 {
-    ALIGN_VAR_16(int16_t, ref_dest[64 * 64]);
-    ALIGN_VAR_16(int16_t, opt_dest[64 * 64]);
+    ALIGN_VAR_64(int16_t, ref_dest[64 * 64]);
+    ALIGN_VAR_64(int16_t, opt_dest[64 * 64]);
+
+    memset(ref_dest, 0xCD, sizeof(ref_dest));
+    memset(opt_dest, 0xCD, sizeof(opt_dest));
+
+    intptr_t stride = 64;
+    for (int i = 0; i < ITERS; i++)
+    {
+        int16_t value = (rand() % SHORT_MAX) + 1;
+
+        checked(opt, opt_dest, stride, value);
+        ref(ref_dest, stride, value);
+
+        if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(int16_t)))
+            return false;
+
+        reportfail();
+    }
+
+    return true;
+}

 
@@ -226,6 +226,31 @@
     return true;
 }
 
+bool PixelHarness::check_calresidual_aligned(calcresidual_t ref, calcresidual_t opt)
+{
+    ALIGN_VAR_16(int16_t, ref_dest[64 * 64]);
+    ALIGN_VAR_16(int16_t, opt_dest[64 * 64]);
+    memset(ref_dest, 0, 64 * 64 * sizeof(int16_t));
+    memset(opt_dest, 0, 64 * 64 * sizeof(int16_t));
+
+    int j = 0;
+    intptr_t stride = STRIDE;
+    for (int i = 0; i < ITERS; i++)
+    {
+        int index = i % TEST_CASES;
+        checked(opt, pbuf1 + j, pixel_test_buff[index] + j, opt_dest, stride);
+        ref(pbuf1 + j, pixel_test_buff[index] + j, ref_dest, stride);
+
+        if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(int16_t)))
+            return false;
+
+        reportfail();
+        j += INCR;
+    }
+
+    return true;
+}
+
 bool PixelHarness::check_ssd_s(pixel_ssd_s_t ref, pixel_ssd_s_t opt)
 {
     int j = 0;
@@ -242,10 +267,27 @@
         reportfail();
         j += INCR;
     }
-
     return true;
 }
+bool PixelHarness::check_ssd_s_aligned(pixel_ssd_s_t ref, pixel_ssd_s_t opt)
+{
+    int j = 0;
+    for (int i = 0; i < ITERS; i++)
+    {
+               // NOTE: stride must be multiple of 16, because minimum block is 4x4
+        int stride = STRIDE;
+        sse_t cres = ref(sbuf1 + j, stride);
+        sse_t vres = (sse_t)checked(opt, sbuf1 + j, (intptr_t)stride);
+
+        if (cres != vres)
+            return false;
+
+        reportfail();
+        j += INCR+32;
+    }
 
+    return true;
+}
 bool PixelHarness::check_weightp(weightp_sp_t ref, weightp_sp_t opt)
 {
     ALIGN_VAR_16(pixel, ref_dest[64 * (64 + 1)]);
@@ -290,7 +332,11 @@
     memset(ref_dest, 0, 64 * 64 * sizeof(pixel));
     memset(opt_dest, 0, 64 * 64 * sizeof(pixel));
     int j = 0;
+    bool enableavx512 = true;
     int width = 16 * (rand() % 4 + 1);
+    int cpuid = X265_NS::cpu_detect(enableavx512);
+    if (cpuid & X265_CPU_AVX512)
+        width = 32 * (rand() % 2 + 1);
     int height = 8;
     int w0 = rand() % 128;
     int shift = rand() % 8; // maximum is 7, see setFromWeightAndOffset()
@@ -441,12 +487,10 @@
 
     return true;
 }
-
 bool PixelHarness::check_cpy1Dto2D_shl_t(cpy1Dto2D_shl_t ref, cpy1Dto2D_shl_t opt)
 {
-    ALIGN_VAR_16(int16_t, ref_dest[64 * 64]);
-    ALIGN_VAR_16(int16_t, opt_dest[64 * 64]);
-
+    ALIGN_VAR_64(int16_t, ref_dest[64 * 64]);
+    ALIGN_VAR_64(int16_t, opt_dest[64 * 64]);
     memset(ref_dest, 0xCD, sizeof(ref_dest));
     memset(opt_dest, 0xCD, sizeof(opt_dest));
 
@@ -469,6 +513,33 @@
 
     return true;
 }
+bool PixelHarness::check_cpy1Dto2D_shl_aligned_t(cpy1Dto2D_shl_t ref, cpy1Dto2D_shl_t opt)
+{
+    ALIGN_VAR_64(int16_t, ref_dest[64 * 64]);
+    ALIGN_VAR_64(int16_t, opt_dest[64 * 64]);
+
+    memset(ref_dest, 0xCD, sizeof(ref_dest));
+    memset(opt_dest, 0xCD, sizeof(opt_dest));
+
+    int j = 0;
+    intptr_t stride = STRIDE;
+    for (int i = 0; i < ITERS; i++)
+    {
+        int shift = (rand() % 7 + 1);
+
+        int index = i % TEST_CASES;
+        checked(opt, opt_dest, short_test_buff[index] + j, stride, shift);
+        ref(ref_dest, short_test_buff[index] + j, stride, shift);
+
+        if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(int16_t)))
+            return false;
+
+        reportfail();
+        j += INCR + 32;
+    }
+
+    return true;
+}
 
 bool PixelHarness::check_cpy1Dto2D_shr_t(cpy1Dto2D_shr_t ref, cpy1Dto2D_shr_t opt)
 {
@@ -497,11 +568,37 @@
 
     return true;
 }
-
 bool PixelHarness::check_pixelavg_pp(pixelavg_pp_t ref, pixelavg_pp_t opt)
 {
-    ALIGN_VAR_16(pixel, ref_dest[64 * 64]);
-    ALIGN_VAR_16(pixel, opt_dest[64 * 64]);
+    ALIGN_VAR_64(pixel, ref_dest[64 * 64]);
+    ALIGN_VAR_64(pixel, opt_dest[64 * 64]);
+    int j = 0;
+    memset(ref_dest, 0xCD, sizeof(ref_dest));
+    memset(opt_dest, 0xCD, sizeof(opt_dest));
+
+    intptr_t stride = STRIDE;
+    for (int i = 0; i < ITERS; i++)
+    {
+        int index1 = rand() % TEST_CASES;
+        int index2 = rand() % TEST_CASES;
+        checked(ref, ref_dest, stride, pixel_test_buff[index1] + j,
+                stride, pixel_test_buff[index2] + j, stride, 32);
+        opt(opt_dest, stride, pixel_test_buff[index1] + j,
+            stride, pixel_test_buff[index2] + j, stride, 32);
+
+        if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel)))
+            return false;
+
+        reportfail();
+        j += INCR;
+    }
+
+    return true;
+}
+bool PixelHarness::check_pixelavg_pp_aligned(pixelavg_pp_t ref, pixelavg_pp_t opt)
+{
+    ALIGN_VAR_64(pixel, ref_dest[64 * 64]);
+    ALIGN_VAR_64(pixel, opt_dest[64 * 64]);
 
     int j = 0;
 
@@ -522,7 +619,7 @@
             return false;
 
         reportfail();
-        j += INCR;
+        j += INCR + 32;
     }
 
     return true;
@@ -642,8 +739,33 @@
 
 bool PixelHarness::check_blockfill_s(blockfill_s_t ref, blockfill_s_t opt)
 {
-    ALIGN_VAR_16(int16_t, ref_dest[64 * 64]);
-    ALIGN_VAR_16(int16_t, opt_dest[64 * 64]);
+    ALIGN_VAR_64(int16_t, ref_dest[64 * 64]);
+    ALIGN_VAR_64(int16_t, opt_dest[64 * 64]);
+
+    memset(ref_dest, 0xCD, sizeof(ref_dest));
+    memset(opt_dest, 0xCD, sizeof(opt_dest));
+
+    intptr_t stride = 64;
+    for (int i = 0; i < ITERS; i++)
+    {
+        int16_t value = (rand() % SHORT_MAX) + 1;
+
+        checked(opt, opt_dest, stride, value);
+        ref(ref_dest, stride, value);
+
+        if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(int16_t)))
+            return false;
+
+        reportfail();
+    }
+
+    return true;
+}
​

x265_2.7.tar.gz/source/test/pixelharness.h -> x265_2.9.tar.gz/source/test/pixelharness.h Changed

@@ -44,30 +44,30 @@
     enum { RMAX = PIXEL_MAX - PIXEL_MIN }; //The maximum value obtained by subtracting pixel values (residual max)
     enum { RMIN = PIXEL_MIN - PIXEL_MAX }; //The minimum value obtained by subtracting pixel values (residual min)
 
-    ALIGN_VAR_32(pixel, pbuf1[BUFFSIZE]);
-    pixel    pbuf2[BUFFSIZE];
-    pixel    pbuf3[BUFFSIZE];
-    pixel    pbuf4[BUFFSIZE];
-    int      ibuf1[BUFFSIZE];
-    int8_t   psbuf1[BUFFSIZE];
-    int8_t   psbuf2[BUFFSIZE];
-    int8_t   psbuf3[BUFFSIZE];
-    int8_t   psbuf4[BUFFSIZE];
-    int8_t   psbuf5[BUFFSIZE];
+    ALIGN_VAR_64(pixel, pbuf1[BUFFSIZE]);
+    ALIGN_VAR_64(pixel,    pbuf2[BUFFSIZE]);
+    ALIGN_VAR_64(pixel,    pbuf3[BUFFSIZE]);
+    ALIGN_VAR_64(pixel,    pbuf4[BUFFSIZE]);
+    ALIGN_VAR_64(int,      ibuf1[BUFFSIZE]);
+    ALIGN_VAR_64(int8_t,   psbuf1[BUFFSIZE]);
+    ALIGN_VAR_64(int8_t,   psbuf2[BUFFSIZE]);
+    ALIGN_VAR_64(int8_t,   psbuf3[BUFFSIZE]);
+    ALIGN_VAR_64(int8_t,   psbuf4[BUFFSIZE]);
+    ALIGN_VAR_64(int8_t,   psbuf5[BUFFSIZE]);
 
-    int16_t  sbuf1[BUFFSIZE];
-    int16_t  sbuf2[BUFFSIZE];
-    int16_t  sbuf3[BUFFSIZE];
+    ALIGN_VAR_64(int16_t,  sbuf1[BUFFSIZE]);
+    ALIGN_VAR_64(int16_t,  sbuf2[BUFFSIZE]);
+    ALIGN_VAR_64(int16_t,  sbuf3[BUFFSIZE]);
 
-    pixel    pixel_test_buff[TEST_CASES][BUFFSIZE];
-    int16_t  short_test_buff[TEST_CASES][BUFFSIZE];
-    int16_t  short_test_buff1[TEST_CASES][BUFFSIZE];
-    int16_t  short_test_buff2[TEST_CASES][BUFFSIZE];
-    int      int_test_buff[TEST_CASES][BUFFSIZE];
-    uint16_t ushort_test_buff[TEST_CASES][BUFFSIZE];
-    uint8_t  uchar_test_buff[TEST_CASES][BUFFSIZE];
-    double   double_test_buff[TEST_CASES][BUFFSIZE];
-    int16_t  residual_test_buff[TEST_CASES][BUFFSIZE];
+    ALIGN_VAR_64(pixel,    pixel_test_buff[TEST_CASES][BUFFSIZE]);
+    ALIGN_VAR_64(int16_t,  short_test_buff[TEST_CASES][BUFFSIZE]);
+    ALIGN_VAR_64(int16_t,  short_test_buff1[TEST_CASES][BUFFSIZE]);
+    ALIGN_VAR_64(int16_t,  short_test_buff2[TEST_CASES][BUFFSIZE]);
+    ALIGN_VAR_64(int,      int_test_buff[TEST_CASES][BUFFSIZE]);
+    ALIGN_VAR_64(uint16_t, ushort_test_buff[TEST_CASES][BUFFSIZE]);
+    ALIGN_VAR_64(uint8_t,  uchar_test_buff[TEST_CASES][BUFFSIZE]);
+    ALIGN_VAR_64(double,   double_test_buff[TEST_CASES][BUFFSIZE]);
+    ALIGN_VAR_64(int16_t,  residual_test_buff[TEST_CASES][BUFFSIZE]);
 
     bool check_pixelcmp(pixelcmp_t ref, pixelcmp_t opt);
     bool check_pixel_sse(pixel_sse_t ref, pixel_sse_t opt);
@@ -79,13 +79,19 @@
     bool check_copy_ps(copy_ps_t ref, copy_ps_t opt);
     bool check_copy_ss(copy_ss_t ref, copy_ss_t opt);
     bool check_pixelavg_pp(pixelavg_pp_t ref, pixelavg_pp_t opt);
+    bool check_pixelavg_pp_aligned(pixelavg_pp_t ref, pixelavg_pp_t opt);
     bool check_pixel_sub_ps(pixel_sub_ps_t ref, pixel_sub_ps_t opt);
     bool check_pixel_add_ps(pixel_add_ps_t ref, pixel_add_ps_t opt);
+    bool check_pixel_add_ps_aligned(pixel_add_ps_t ref, pixel_add_ps_t opt);
     bool check_scale1D_pp(scale1D_t ref, scale1D_t opt);
+    bool check_scale1D_pp_aligned(scale1D_t ref, scale1D_t opt);
     bool check_scale2D_pp(scale2D_t ref, scale2D_t opt);
     bool check_ssd_s(pixel_ssd_s_t ref, pixel_ssd_s_t opt);
+    bool check_ssd_s_aligned(pixel_ssd_s_t ref, pixel_ssd_s_t opt);
     bool check_blockfill_s(blockfill_s_t ref, blockfill_s_t opt);
+    bool check_blockfill_s_aligned(blockfill_s_t ref, blockfill_s_t opt);
     bool check_calresidual(calcresidual_t ref, calcresidual_t opt);
+    bool check_calresidual_aligned(calcresidual_t ref, calcresidual_t opt);
     bool check_transpose(transpose_t ref, transpose_t opt);
     bool check_weightp(weightp_pp_t ref, weightp_pp_t opt);
     bool check_weightp(weightp_sp_t ref, weightp_sp_t opt);
@@ -93,12 +99,14 @@
     bool check_cpy2Dto1D_shl_t(cpy2Dto1D_shl_t ref, cpy2Dto1D_shl_t opt);
     bool check_cpy2Dto1D_shr_t(cpy2Dto1D_shr_t ref, cpy2Dto1D_shr_t opt);
     bool check_cpy1Dto2D_shl_t(cpy1Dto2D_shl_t ref, cpy1Dto2D_shl_t opt);
+    bool check_cpy1Dto2D_shl_aligned_t(cpy1Dto2D_shl_t ref, cpy1Dto2D_shl_t opt);
     bool check_cpy1Dto2D_shr_t(cpy1Dto2D_shr_t ref, cpy1Dto2D_shr_t opt);
     bool check_copy_cnt_t(copy_cnt_t ref, copy_cnt_t opt);
     bool check_pixel_var(var_t ref, var_t opt);
     bool check_ssim_4x4x2_core(ssim_4x4x2_core_t ref, ssim_4x4x2_core_t opt);
     bool check_ssim_end(ssim_end4_t ref, ssim_end4_t opt);
     bool check_addAvg(addAvg_t, addAvg_t);
+    bool check_addAvg_aligned(addAvg_t, addAvg_t);
     bool check_saoCuOrgE0_t(saoCuOrgE0_t ref, saoCuOrgE0_t opt);
     bool check_saoCuOrgE1_t(saoCuOrgE1_t ref, saoCuOrgE1_t opt);
     bool check_saoCuOrgE2_t(saoCuOrgE2_t ref[], saoCuOrgE2_t opt[]);

 
@@ -44,30 +44,30 @@
     enum { RMAX = PIXEL_MAX - PIXEL_MIN }; //The maximum value obtained by subtracting pixel values (residual max)
     enum { RMIN = PIXEL_MIN - PIXEL_MAX }; //The minimum value obtained by subtracting pixel values (residual min)
 
-    ALIGN_VAR_32(pixel, pbuf1[BUFFSIZE]);
-    pixel    pbuf2[BUFFSIZE];
-    pixel    pbuf3[BUFFSIZE];
-    pixel    pbuf4[BUFFSIZE];
-    int      ibuf1[BUFFSIZE];
-    int8_t   psbuf1[BUFFSIZE];
-    int8_t   psbuf2[BUFFSIZE];
-    int8_t   psbuf3[BUFFSIZE];
-    int8_t   psbuf4[BUFFSIZE];
-    int8_t   psbuf5[BUFFSIZE];
+    ALIGN_VAR_64(pixel, pbuf1[BUFFSIZE]);
+    ALIGN_VAR_64(pixel,    pbuf2[BUFFSIZE]);
+    ALIGN_VAR_64(pixel,    pbuf3[BUFFSIZE]);
+    ALIGN_VAR_64(pixel,    pbuf4[BUFFSIZE]);
+    ALIGN_VAR_64(int,      ibuf1[BUFFSIZE]);
+    ALIGN_VAR_64(int8_t,   psbuf1[BUFFSIZE]);
+    ALIGN_VAR_64(int8_t,   psbuf2[BUFFSIZE]);
+    ALIGN_VAR_64(int8_t,   psbuf3[BUFFSIZE]);
+    ALIGN_VAR_64(int8_t,   psbuf4[BUFFSIZE]);
+    ALIGN_VAR_64(int8_t,   psbuf5[BUFFSIZE]);
 
-    int16_t  sbuf1[BUFFSIZE];
-    int16_t  sbuf2[BUFFSIZE];
-    int16_t  sbuf3[BUFFSIZE];
+    ALIGN_VAR_64(int16_t,  sbuf1[BUFFSIZE]);
+    ALIGN_VAR_64(int16_t,  sbuf2[BUFFSIZE]);
+    ALIGN_VAR_64(int16_t,  sbuf3[BUFFSIZE]);
 
-    pixel    pixel_test_buff[TEST_CASES][BUFFSIZE];
-    int16_t  short_test_buff[TEST_CASES][BUFFSIZE];
-    int16_t  short_test_buff1[TEST_CASES][BUFFSIZE];
-    int16_t  short_test_buff2[TEST_CASES][BUFFSIZE];
-    int      int_test_buff[TEST_CASES][BUFFSIZE];
-    uint16_t ushort_test_buff[TEST_CASES][BUFFSIZE];
-    uint8_t  uchar_test_buff[TEST_CASES][BUFFSIZE];
-    double   double_test_buff[TEST_CASES][BUFFSIZE];
-    int16_t  residual_test_buff[TEST_CASES][BUFFSIZE];
+    ALIGN_VAR_64(pixel,    pixel_test_buff[TEST_CASES][BUFFSIZE]);
+    ALIGN_VAR_64(int16_t,  short_test_buff[TEST_CASES][BUFFSIZE]);
+    ALIGN_VAR_64(int16_t,  short_test_buff1[TEST_CASES][BUFFSIZE]);
+    ALIGN_VAR_64(int16_t,  short_test_buff2[TEST_CASES][BUFFSIZE]);
+    ALIGN_VAR_64(int,      int_test_buff[TEST_CASES][BUFFSIZE]);
+    ALIGN_VAR_64(uint16_t, ushort_test_buff[TEST_CASES][BUFFSIZE]);
+    ALIGN_VAR_64(uint8_t,  uchar_test_buff[TEST_CASES][BUFFSIZE]);
+    ALIGN_VAR_64(double,   double_test_buff[TEST_CASES][BUFFSIZE]);
+    ALIGN_VAR_64(int16_t,  residual_test_buff[TEST_CASES][BUFFSIZE]);
 
     bool check_pixelcmp(pixelcmp_t ref, pixelcmp_t opt);
     bool check_pixel_sse(pixel_sse_t ref, pixel_sse_t opt);
@@ -79,13 +79,19 @@
     bool check_copy_ps(copy_ps_t ref, copy_ps_t opt);
     bool check_copy_ss(copy_ss_t ref, copy_ss_t opt);
     bool check_pixelavg_pp(pixelavg_pp_t ref, pixelavg_pp_t opt);
+    bool check_pixelavg_pp_aligned(pixelavg_pp_t ref, pixelavg_pp_t opt);
     bool check_pixel_sub_ps(pixel_sub_ps_t ref, pixel_sub_ps_t opt);
     bool check_pixel_add_ps(pixel_add_ps_t ref, pixel_add_ps_t opt);
+    bool check_pixel_add_ps_aligned(pixel_add_ps_t ref, pixel_add_ps_t opt);
     bool check_scale1D_pp(scale1D_t ref, scale1D_t opt);
+    bool check_scale1D_pp_aligned(scale1D_t ref, scale1D_t opt);
     bool check_scale2D_pp(scale2D_t ref, scale2D_t opt);
     bool check_ssd_s(pixel_ssd_s_t ref, pixel_ssd_s_t opt);
+    bool check_ssd_s_aligned(pixel_ssd_s_t ref, pixel_ssd_s_t opt);
     bool check_blockfill_s(blockfill_s_t ref, blockfill_s_t opt);
+    bool check_blockfill_s_aligned(blockfill_s_t ref, blockfill_s_t opt);
     bool check_calresidual(calcresidual_t ref, calcresidual_t opt);
+    bool check_calresidual_aligned(calcresidual_t ref, calcresidual_t opt);
     bool check_transpose(transpose_t ref, transpose_t opt);
     bool check_weightp(weightp_pp_t ref, weightp_pp_t opt);
     bool check_weightp(weightp_sp_t ref, weightp_sp_t opt);
@@ -93,12 +99,14 @@
     bool check_cpy2Dto1D_shl_t(cpy2Dto1D_shl_t ref, cpy2Dto1D_shl_t opt);
     bool check_cpy2Dto1D_shr_t(cpy2Dto1D_shr_t ref, cpy2Dto1D_shr_t opt);
     bool check_cpy1Dto2D_shl_t(cpy1Dto2D_shl_t ref, cpy1Dto2D_shl_t opt);
+    bool check_cpy1Dto2D_shl_aligned_t(cpy1Dto2D_shl_t ref, cpy1Dto2D_shl_t opt);
     bool check_cpy1Dto2D_shr_t(cpy1Dto2D_shr_t ref, cpy1Dto2D_shr_t opt);
     bool check_copy_cnt_t(copy_cnt_t ref, copy_cnt_t opt);
     bool check_pixel_var(var_t ref, var_t opt);
     bool check_ssim_4x4x2_core(ssim_4x4x2_core_t ref, ssim_4x4x2_core_t opt);
     bool check_ssim_end(ssim_end4_t ref, ssim_end4_t opt);
     bool check_addAvg(addAvg_t, addAvg_t);
+    bool check_addAvg_aligned(addAvg_t, addAvg_t);
     bool check_saoCuOrgE0_t(saoCuOrgE0_t ref, saoCuOrgE0_t opt);
     bool check_saoCuOrgE1_t(saoCuOrgE1_t ref, saoCuOrgE1_t opt);
     bool check_saoCuOrgE2_t(saoCuOrgE2_t ref[], saoCuOrgE2_t opt[]);
​

x265_2.7.tar.gz/source/test/regression-tests.txt -> x265_2.9.tar.gz/source/test/regression-tests.txt Changed

@@ -23,12 +23,12 @@
 BasketballDrive_1920x1080_50.y4m,--preset slower --lossless --chromaloc 3 --subme 0 --limit-tu 4
 BasketballDrive_1920x1080_50.y4m,--preset slower --no-cutree --analysis-save x265_analysis.dat --analysis-reuse-level 10 --bitrate 7000 --limit-tu 0::--preset slower --no-cutree --analysis-load x265_analysis.dat --analysis-reuse-level 10 --bitrate 7000 --limit-tu 0
 BasketballDrive_1920x1080_50.y4m,--preset veryslow --crf 4 --cu-lossless --pmode --limit-refs 1 --aq-mode 3 --limit-tu 3
-BasketballDrive_1920x1080_50.y4m,--preset veryslow --no-cutree --analysis-save x265_analysis.dat --bitrate 7000 --tskip-fast --limit-tu 2::--preset veryslow --no-cutree --analysis-load x265_analysis.dat --bitrate 7000  --tskip-fast --limit-tu 2
+BasketballDrive_1920x1080_50.y4m,--preset veryslow --no-cutree --analysis-save x265_analysis.dat --crf 18 --tskip-fast --limit-tu 2::--preset veryslow --no-cutree --analysis-load x265_analysis.dat --crf 18 --tskip-fast --limit-tu 2
 BasketballDrive_1920x1080_50.y4m,--preset veryslow --recon-y4m-exec "ffplay -i pipe:0 -autoexit"
 Coastguard-4k.y4m,--preset ultrafast --recon-y4m-exec "ffplay -i pipe:0 -autoexit"
 Coastguard-4k.y4m,--preset superfast --tune grain --overscan=crop
 Coastguard-4k.y4m,--preset superfast --tune grain --pme --aq-strength 2 --merange 190
-Coastguard-4k.y4m,--preset veryfast --no-cutree --analysis-save x265_analysis.dat --analysis-reuse-level 1 --bitrate 15000::--preset veryfast --no-cutree --analysis-load x265_analysis.dat --analysis-reuse-level 1 --bitrate 15000
+Coastguard-4k.y4m,--preset veryfast --no-cutree --analysis-save x265_analysis.dat --analysis-reuse-level 1 --qp 35::--preset veryfast --no-cutree --analysis-load x265_analysis.dat --analysis-reuse-level 1 --qp 35
 Coastguard-4k.y4m,--preset medium --rdoq-level 1 --tune ssim --no-signhide --me umh --slices 2
 Coastguard-4k.y4m,--preset slow --tune psnr --cbqpoffs -1 --crqpoffs 1 --limit-refs 1
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency --qg-size 16
@@ -69,12 +69,11 @@
 KristenAndSara_1280x720_60.y4m,--preset slower --pmode --max-tu-size 8 --limit-refs 0 --limit-modes --limit-tu 1
 NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset superfast --tune psnr
 NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --tune grain --limit-refs 2
-NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset slow --no-cutree --analysis-save x265_analysis.dat --rd 5 --analysis-reuse-level 10 --bitrate 9000::--preset slow --no-cutree --analysis-load x265_analysis.dat --rd 5 --analysis-reuse-level 10 --bitrate 9000
+NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset slow --no-cutree --analysis-save x265_analysis.dat --rd 5 --analysis-reuse-level 10 --bitrate 9000 --vbv-maxrate 9000 --vbv-bufsize 9000::--preset slow --no-cutree --analysis-load x265_analysis.dat --rd 5 --analysis-reuse-level 10 --bitrate 9000 --vbv-maxrate 9000 --vbv-bufsize 9000
 News-4k.y4m,--preset ultrafast --no-cutree --analysis-save x265_analysis.dat --analysis-reuse-level 2 --bitrate 15000::--preset ultrafast --no-cutree --analysis-load x265_analysis.dat --analysis-reuse-level 2 --bitrate 15000
 News-4k.y4m,--preset superfast --lookahead-slices 6 --aq-mode 0
 News-4k.y4m,--preset superfast --slices 4 --aq-mode 0 
 News-4k.y4m,--preset medium --tune ssim --no-sao --qg-size 16
-News-4k.y4m,--preset slower --opt-cu-delta-qp
 News-4k.y4m,--preset veryslow --no-rskip
 News-4k.y4m,--preset veryslow --pme --crf 40
 OldTownCross_1920x1080_50_10bit_422.yuv,--preset superfast --weightp
@@ -104,7 +103,6 @@
 city_4cif_60fps.y4m,--preset superfast --rdpenalty 1 --tu-intra-depth 2
 city_4cif_60fps.y4m,--preset medium --crf 4 --cu-lossless --sao-non-deblock
 city_4cif_60fps.y4m,--preset slower --scaling-list default
-city_4cif_60fps.y4m,--preset veryslow --opt-cu-delta-qp
 city_4cif_60fps.y4m,--preset veryslow --rdpenalty 2 --sao-non-deblock --no-b-intra --limit-refs 0
 ducks_take_off_420_720p50.y4m,--preset ultrafast --constrained-intra --rd 1
 ducks_take_off_444_720p50.y4m,--preset superfast --weightp --limit-refs 2
@@ -151,7 +149,7 @@
 Kimono1_1920x1080_24_400.yuv,--preset veryslow --crf 4 --cu-lossless --slices 2 --limit-refs 3 --limit-modes
 Kimono1_1920x1080_24_400.yuv,--preset placebo --ctu 32 --max-tu-size 8 --limit-tu 2
 big_buck_bunny_360p24.y4m, --keyint 60 --min-keyint 40 --gop-lookahead 14
-BasketballDrive_1920x1080_50.y4m, --preset medium --no-open-gop --keyint 50 --min-keyint 50 --radl 2
+BasketballDrive_1920x1080_50.y4m, --preset medium --no-open-gop --keyint 50 --min-keyint 50 --radl 2 --vbv-maxrate 5000 --vbv-bufsize 5000
 
 # Main12 intraCost overflow bug test
 720p50_parkrun_ter.y4m,--preset medium
@@ -167,4 +165,15 @@
 #low-pass dct test
 720p50_parkrun_ter.y4m,--preset medium --lowpass-dct
 
+#scaled save/load test
+crowd_run_1080p50.y4m,--preset ultrafast --no-cutree --analysis-save x265_analysis.dat  --analysis-reuse-level 1 --scale-factor 2 --crf 26 --vbv-maxrate 8000 --vbv-bufsize 8000::crowd_run_2160p50.y4m, --preset ultrafast --no-cutree --analysis-load x265_analysis.dat  --analysis-reuse-level 1 --scale-factor 2 --crf 26 --vbv-maxrate 12000 --vbv-bufsize 12000 
+crowd_run_1080p50.y4m,--preset superfast --no-cutree --analysis-save x265_analysis.dat  --analysis-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 5000 --vbv-bufsize 5000::crowd_run_2160p50.y4m, --preset superfast --no-cutree --analysis-load x265_analysis.dat  --analysis-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 10000 --vbv-bufsize 10000 
+crowd_run_1080p50.y4m,--preset fast --no-cutree --analysis-save x265_analysis.dat  --analysis-reuse-level 5 --scale-factor 2 --qp 18::crowd_run_2160p50.y4m, --preset fast --no-cutree --analysis-load x265_analysis.dat  --analysis-reuse-level 5 --scale-factor 2 --qp 18
+crowd_run_1080p50.y4m,--preset medium --no-cutree --analysis-save x265_analysis.dat  --analysis-reuse-level 10 --scale-factor 2 --bitrate 5000  --vbv-maxrate 5000 --vbv-bufsize 5000 --early-skip --tu-inter-depth 3::crowd_run_2160p50.y4m, --preset medium --no-cutree --analysis-load x265_analysis.dat  --analysis-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 4 --dynamic-refine::crowd_run_2160p50.y4m, --preset medium --no-cutree --analysis-load x265_analysis.dat  --analysis-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 3 --refine-inter 3
+RaceHorses_416x240_30.y4m,--preset slow --no-cutree --ctu 16 --analysis-save x265_analysis.dat --analysis-reuse-level 10 --scale-factor 2 --crf 22  --vbv-maxrate 1000 --vbv-bufsize 1000::RaceHorses_832x480_30.y4m, --preset slow --no-cutree --ctu 32 --analysis-load x265_analysis.dat  --analysis-save x265_analysis_2.dat --analysis-reuse-level 10 --scale-factor 2 --crf 16 --vbv-maxrate 4000 --vbv-bufsize 4000 --refine-intra 0 --refine-inter 1::RaceHorses_1664x960_30.y4m,--preset slow --no-cutree --ctu 64 --analysis-load x265_analysis_2.dat  --analysis-reuse-level 10 --scale-factor 2 --crf 12 --vbv-maxrate 7000 --vbv-bufsize 7000 --refine-intra 2 --refine-inter 2
+ElFunete_960x540_60.yuv,--colorprim bt709 --transfer bt709 --chromaloc 2 --aud --repeat-headers --no-opt-qp-pps --no-opt-ref-list-length-pps --wpp --no-interlace --sar 1:1 --min-keyint 60 --no-open-gop --rc-lookahead 180 --bframes 5 --b-intra --ref 4 --cbqpoffs -2 --crqpoffs -2 --lookahead-threads 0 --weightb --qg-size 8 --me star --preset veryslow --frame-threads 1 --b-adapt 2 --aq-mode 3 --rd 6 --pools 15 --colormatrix bt709 --keyint 120 --high-tier --ctu 64 --tune psnr --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500 --analysis-reuse-level 10 --analysis-save elfuente_960x540.dat --scale-factor 2::ElFunete_1920x1080_60.yuv,--colorprim bt709 --transfer bt709 --chromaloc 2 --aud --repeat-headers --no-opt-qp-pps --no-opt-ref-list-length-pps --wpp --no-interlace --sar 1:1 --min-keyint 60 --no-open-gop --rc-lookahead 180 --bframes 5 --b-intra --ref 4 --cbqpoffs -2 --crqpoffs -2 --lookahead-threads 0 --weightb --qg-size 8 --me star --preset veryslow --frame-threads 1 --b-adapt 2 --aq-mode 3 --rd 6 --pools 15 --colormatrix bt709 --keyint 120 --high-tier --ctu 64 --tune psnr --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500 --analysis-reuse-level 10 --analysis-save elfuente_1920x1080.dat --limit-tu 0 --scale-factor 2 --analysis-load elfuente_960x540.dat --refine-intra 4 --refine-inter 2::ElFuente_3840x2160_60.yuv,--colorprim bt709 --transfer bt709 --chromaloc 2 --aud --repeat-headers --no-opt-qp-pps --no-opt-ref-list-length-pps --wpp --no-interlace --sar 1:1 --min-keyint 60 --no-open-gop --rc-lookahead 180 --bframes 5 --b-intra --ref 4 --cbqpoffs -2 --crqpoffs -2 --lookahead-threads 0 --weightb --qg-size 8 --me star --preset veryslow --frame-threads 1 --b-adapt 2 --aq-mode 3 --rd 6 --pools 15 --colormatrix bt709 --keyint 120 --high-tier --ctu 64 --tune=psnr --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000 --analysis-reuse-level 10 --limit-tu 0 --scale-factor 2 --analysis-load elfuente_1920x1080.dat --refine-intra 4 --refine-inter 2
+
+#segment encoding
+BasketballDrive_1920x1080_50.y4m, --preset ultrafast --no-open-gop --chunk-start 100 --chunk-end 200
+
 # vim: tw=200

 
@@ -23,12 +23,12 @@
 BasketballDrive_1920x1080_50.y4m,--preset slower --lossless --chromaloc 3 --subme 0 --limit-tu 4
 BasketballDrive_1920x1080_50.y4m,--preset slower --no-cutree --analysis-save x265_analysis.dat --analysis-reuse-level 10 --bitrate 7000 --limit-tu 0::--preset slower --no-cutree --analysis-load x265_analysis.dat --analysis-reuse-level 10 --bitrate 7000 --limit-tu 0
 BasketballDrive_1920x1080_50.y4m,--preset veryslow --crf 4 --cu-lossless --pmode --limit-refs 1 --aq-mode 3 --limit-tu 3
-BasketballDrive_1920x1080_50.y4m,--preset veryslow --no-cutree --analysis-save x265_analysis.dat --bitrate 7000 --tskip-fast --limit-tu 2::--preset veryslow --no-cutree --analysis-load x265_analysis.dat --bitrate 7000  --tskip-fast --limit-tu 2
+BasketballDrive_1920x1080_50.y4m,--preset veryslow --no-cutree --analysis-save x265_analysis.dat --crf 18 --tskip-fast --limit-tu 2::--preset veryslow --no-cutree --analysis-load x265_analysis.dat --crf 18 --tskip-fast --limit-tu 2
 BasketballDrive_1920x1080_50.y4m,--preset veryslow --recon-y4m-exec "ffplay -i pipe:0 -autoexit"
 Coastguard-4k.y4m,--preset ultrafast --recon-y4m-exec "ffplay -i pipe:0 -autoexit"
 Coastguard-4k.y4m,--preset superfast --tune grain --overscan=crop
 Coastguard-4k.y4m,--preset superfast --tune grain --pme --aq-strength 2 --merange 190
-Coastguard-4k.y4m,--preset veryfast --no-cutree --analysis-save x265_analysis.dat --analysis-reuse-level 1 --bitrate 15000::--preset veryfast --no-cutree --analysis-load x265_analysis.dat --analysis-reuse-level 1 --bitrate 15000
+Coastguard-4k.y4m,--preset veryfast --no-cutree --analysis-save x265_analysis.dat --analysis-reuse-level 1 --qp 35::--preset veryfast --no-cutree --analysis-load x265_analysis.dat --analysis-reuse-level 1 --qp 35
 Coastguard-4k.y4m,--preset medium --rdoq-level 1 --tune ssim --no-signhide --me umh --slices 2
 Coastguard-4k.y4m,--preset slow --tune psnr --cbqpoffs -1 --crqpoffs 1 --limit-refs 1
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency --qg-size 16
@@ -69,12 +69,11 @@
 KristenAndSara_1280x720_60.y4m,--preset slower --pmode --max-tu-size 8 --limit-refs 0 --limit-modes --limit-tu 1
 NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset superfast --tune psnr
 NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --tune grain --limit-refs 2
-NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset slow --no-cutree --analysis-save x265_analysis.dat --rd 5 --analysis-reuse-level 10 --bitrate 9000::--preset slow --no-cutree --analysis-load x265_analysis.dat --rd 5 --analysis-reuse-level 10 --bitrate 9000
+NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset slow --no-cutree --analysis-save x265_analysis.dat --rd 5 --analysis-reuse-level 10 --bitrate 9000 --vbv-maxrate 9000 --vbv-bufsize 9000::--preset slow --no-cutree --analysis-load x265_analysis.dat --rd 5 --analysis-reuse-level 10 --bitrate 9000 --vbv-maxrate 9000 --vbv-bufsize 9000
 News-4k.y4m,--preset ultrafast --no-cutree --analysis-save x265_analysis.dat --analysis-reuse-level 2 --bitrate 15000::--preset ultrafast --no-cutree --analysis-load x265_analysis.dat --analysis-reuse-level 2 --bitrate 15000
 News-4k.y4m,--preset superfast --lookahead-slices 6 --aq-mode 0
 News-4k.y4m,--preset superfast --slices 4 --aq-mode 0 
 News-4k.y4m,--preset medium --tune ssim --no-sao --qg-size 16
-News-4k.y4m,--preset slower --opt-cu-delta-qp
 News-4k.y4m,--preset veryslow --no-rskip
 News-4k.y4m,--preset veryslow --pme --crf 40
 OldTownCross_1920x1080_50_10bit_422.yuv,--preset superfast --weightp
@@ -104,7 +103,6 @@
 city_4cif_60fps.y4m,--preset superfast --rdpenalty 1 --tu-intra-depth 2
 city_4cif_60fps.y4m,--preset medium --crf 4 --cu-lossless --sao-non-deblock
 city_4cif_60fps.y4m,--preset slower --scaling-list default
-city_4cif_60fps.y4m,--preset veryslow --opt-cu-delta-qp
 city_4cif_60fps.y4m,--preset veryslow --rdpenalty 2 --sao-non-deblock --no-b-intra --limit-refs 0
 ducks_take_off_420_720p50.y4m,--preset ultrafast --constrained-intra --rd 1
 ducks_take_off_444_720p50.y4m,--preset superfast --weightp --limit-refs 2
@@ -151,7 +149,7 @@
 Kimono1_1920x1080_24_400.yuv,--preset veryslow --crf 4 --cu-lossless --slices 2 --limit-refs 3 --limit-modes
 Kimono1_1920x1080_24_400.yuv,--preset placebo --ctu 32 --max-tu-size 8 --limit-tu 2
 big_buck_bunny_360p24.y4m, --keyint 60 --min-keyint 40 --gop-lookahead 14
-BasketballDrive_1920x1080_50.y4m, --preset medium --no-open-gop --keyint 50 --min-keyint 50 --radl 2
+BasketballDrive_1920x1080_50.y4m, --preset medium --no-open-gop --keyint 50 --min-keyint 50 --radl 2 --vbv-maxrate 5000 --vbv-bufsize 5000
 
 # Main12 intraCost overflow bug test
 720p50_parkrun_ter.y4m,--preset medium
@@ -167,4 +165,15 @@
 #low-pass dct test
 720p50_parkrun_ter.y4m,--preset medium --lowpass-dct
 
+#scaled save/load test
+crowd_run_1080p50.y4m,--preset ultrafast --no-cutree --analysis-save x265_analysis.dat  --analysis-reuse-level 1 --scale-factor 2 --crf 26 --vbv-maxrate 8000 --vbv-bufsize 8000::crowd_run_2160p50.y4m, --preset ultrafast --no-cutree --analysis-load x265_analysis.dat  --analysis-reuse-level 1 --scale-factor 2 --crf 26 --vbv-maxrate 12000 --vbv-bufsize 12000 
+crowd_run_1080p50.y4m,--preset superfast --no-cutree --analysis-save x265_analysis.dat  --analysis-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 5000 --vbv-bufsize 5000::crowd_run_2160p50.y4m, --preset superfast --no-cutree --analysis-load x265_analysis.dat  --analysis-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 10000 --vbv-bufsize 10000 
+crowd_run_1080p50.y4m,--preset fast --no-cutree --analysis-save x265_analysis.dat  --analysis-reuse-level 5 --scale-factor 2 --qp 18::crowd_run_2160p50.y4m, --preset fast --no-cutree --analysis-load x265_analysis.dat  --analysis-reuse-level 5 --scale-factor 2 --qp 18
+crowd_run_1080p50.y4m,--preset medium --no-cutree --analysis-save x265_analysis.dat  --analysis-reuse-level 10 --scale-factor 2 --bitrate 5000  --vbv-maxrate 5000 --vbv-bufsize 5000 --early-skip --tu-inter-depth 3::crowd_run_2160p50.y4m, --preset medium --no-cutree --analysis-load x265_analysis.dat  --analysis-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 4 --dynamic-refine::crowd_run_2160p50.y4m, --preset medium --no-cutree --analysis-load x265_analysis.dat  --analysis-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 3 --refine-inter 3
+RaceHorses_416x240_30.y4m,--preset slow --no-cutree --ctu 16 --analysis-save x265_analysis.dat --analysis-reuse-level 10 --scale-factor 2 --crf 22  --vbv-maxrate 1000 --vbv-bufsize 1000::RaceHorses_832x480_30.y4m, --preset slow --no-cutree --ctu 32 --analysis-load x265_analysis.dat  --analysis-save x265_analysis_2.dat --analysis-reuse-level 10 --scale-factor 2 --crf 16 --vbv-maxrate 4000 --vbv-bufsize 4000 --refine-intra 0 --refine-inter 1::RaceHorses_1664x960_30.y4m,--preset slow --no-cutree --ctu 64 --analysis-load x265_analysis_2.dat  --analysis-reuse-level 10 --scale-factor 2 --crf 12 --vbv-maxrate 7000 --vbv-bufsize 7000 --refine-intra 2 --refine-inter 2
+ElFunete_960x540_60.yuv,--colorprim bt709 --transfer bt709 --chromaloc 2 --aud --repeat-headers --no-opt-qp-pps --no-opt-ref-list-length-pps --wpp --no-interlace --sar 1:1 --min-keyint 60 --no-open-gop --rc-lookahead 180 --bframes 5 --b-intra --ref 4 --cbqpoffs -2 --crqpoffs -2 --lookahead-threads 0 --weightb --qg-size 8 --me star --preset veryslow --frame-threads 1 --b-adapt 2 --aq-mode 3 --rd 6 --pools 15 --colormatrix bt709 --keyint 120 --high-tier --ctu 64 --tune psnr --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500 --analysis-reuse-level 10 --analysis-save elfuente_960x540.dat --scale-factor 2::ElFunete_1920x1080_60.yuv,--colorprim bt709 --transfer bt709 --chromaloc 2 --aud --repeat-headers --no-opt-qp-pps --no-opt-ref-list-length-pps --wpp --no-interlace --sar 1:1 --min-keyint 60 --no-open-gop --rc-lookahead 180 --bframes 5 --b-intra --ref 4 --cbqpoffs -2 --crqpoffs -2 --lookahead-threads 0 --weightb --qg-size 8 --me star --preset veryslow --frame-threads 1 --b-adapt 2 --aq-mode 3 --rd 6 --pools 15 --colormatrix bt709 --keyint 120 --high-tier --ctu 64 --tune psnr --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500 --analysis-reuse-level 10 --analysis-save elfuente_1920x1080.dat --limit-tu 0 --scale-factor 2 --analysis-load elfuente_960x540.dat --refine-intra 4 --refine-inter 2::ElFuente_3840x2160_60.yuv,--colorprim bt709 --transfer bt709 --chromaloc 2 --aud --repeat-headers --no-opt-qp-pps --no-opt-ref-list-length-pps --wpp --no-interlace --sar 1:1 --min-keyint 60 --no-open-gop --rc-lookahead 180 --bframes 5 --b-intra --ref 4 --cbqpoffs -2 --crqpoffs -2 --lookahead-threads 0 --weightb --qg-size 8 --me star --preset veryslow --frame-threads 1 --b-adapt 2 --aq-mode 3 --rd 6 --pools 15 --colormatrix bt709 --keyint 120 --high-tier --ctu 64 --tune=psnr --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000 --analysis-reuse-level 10 --limit-tu 0 --scale-factor 2 --analysis-load elfuente_1920x1080.dat --refine-intra 4 --refine-inter 2
+
+#segment encoding
+BasketballDrive_1920x1080_50.y4m, --preset ultrafast --no-open-gop --chunk-start 100 --chunk-end 200
+
 # vim: tw=200
​

x265_2.7.tar.gz/source/test/smoke-tests.txt -> x265_2.9.tar.gz/source/test/smoke-tests.txt Changed

 
@@ -13,7 +13,7 @@
 old_town_cross_444_720p50.y4m,--preset=fast --keyint 20 --min-cu-size 16
 old_town_cross_444_720p50.y4m,--preset=slow --sao-non-deblock --pmode --qg-size 32
 RaceHorses_416x240_30_10bit.yuv,--preset=veryfast --max-tu-size 8
-RaceHorses_416x240_30_10bit.yuv,--preset=slower --bitrate 500 -F4 --rdoq-level 1 --opt-cu-delta-qp
+RaceHorses_416x240_30_10bit.yuv,--preset=slower --bitrate 500 -F4 --rdoq-level 1
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset=ultrafast --constrained-intra --min-keyint 5 --keyint 10
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset=medium --max-tu-size 16 --tu-inter-depth 2 --limit-tu 3
 DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset=veryfast --min-cu 16
​

x265_2.7.tar.gz/source/test/testbench.cpp -> x265_2.9.tar.gz/source/test/testbench.cpp Changed

 
@@ -96,7 +96,8 @@
 
 int main(int argc, char *argv[])
 {
-    int cpuid = X265_NS::cpu_detect();
+    bool enableavx512 = true;
+    int cpuid = X265_NS::cpu_detect(enableavx512);
     const char *testname = 0;
 
     if (!(argc & 1))
@@ -117,7 +118,7 @@
         if (!strncmp(name, "cpuid", strlen(name)))
         {
             bool bError = false;
-            cpuid = parseCpuName(value, bError);
+            cpuid = parseCpuName(value, bError, enableavx512);
             if (bError)
             {
                 printf("Invalid CPU name: %s\n", value);
@@ -169,6 +170,7 @@
         { "XOP", X265_CPU_XOP },
         { "AVX2", X265_CPU_AVX2 },
         { "BMI2", X265_CPU_AVX2 | X265_CPU_BMI1 | X265_CPU_BMI2 },
+        { "AVX512", X265_CPU_AVX512 },
         { "ARMv6", X265_CPU_ARMV6 },
         { "NEON", X265_CPU_NEON },
         { "FastNeonMRC", X265_CPU_FAST_NEON_MRC },
​

x265_2.7.tar.gz/source/test/testharness.h -> x265_2.9.tar.gz/source/test/testharness.h Changed

 
@@ -72,7 +72,7 @@
 #include <x86intrin.h>
 #elif ( !defined(__APPLE__) && defined (__GNUC__) && defined(__ARM_NEON__))
 #include <arm_neon.h>
-#elif defined(__GNUC__)
+#elif defined(__GNUC__) && (!defined(__clang__) || __clang_major__ < 4)
 /* fallback for older GCC/MinGW */
 static inline uint32_t __rdtsc(void)
 {
@@ -91,7 +91,7 @@
 }
 #endif // ifdef _MSC_VER
 
-#define BENCH_RUNS 1000
+#define BENCH_RUNS 2000
 
 // Adapted from checkasm.c, runs each optimized primitive four times, measures rdtsc
 // and discards invalid times.  Repeats 1000 times to get a good average.  Then measures
​

x265_2.7.tar.gz/source/x265.cpp -> x265_2.9.tar.gz/source/x265.cpp Changed

@@ -75,6 +75,7 @@
     const char* reconPlayCmd;
     const x265_api* api;
     x265_param* param;
+    x265_vmaf_data* vmafData;
     bool bProgress;
     bool bForceY4m;
     bool bDither;
@@ -96,6 +97,7 @@
         reconPlayCmd = NULL;
         api = NULL;
         param = NULL;
+        vmafData = NULL;
         framesToBeEncoded = seek = 0;
         totalbytes = 0;
         bProgress = true;
@@ -142,7 +144,7 @@
     {
         int eta = (int)(elapsed * (framesToBeEncoded - frameNum) / ((int64_t)frameNum * 1000000));
         sprintf(buf, "x265 [%.1f%%] %d/%d frames, %.2f fps, %.2f kb/s, eta %d:%02d:%02d",
-                100. * frameNum / framesToBeEncoded, frameNum, framesToBeEncoded, fps, bitrate,
+            100. * frameNum / (param->chunkEnd ? param->chunkEnd : param->totalFrames), frameNum, (param->chunkEnd ? param->chunkEnd : param->totalFrames), fps, bitrate,
                 eta / 3600, (eta / 60) % 60, eta % 60);
     }
     else
@@ -216,6 +218,14 @@
         x265_log(NULL, X265_LOG_ERROR, "param alloc failed\n");
         return true;
     }
+#if ENABLE_LIBVMAF
+    vmafData = (x265_vmaf_data*)x265_malloc(sizeof(x265_vmaf_data));
+    if(!vmafData)
+    {
+        x265_log(NULL, X265_LOG_ERROR, "vmaf data alloc failed\n");
+        return true;
+    }
+#endif
 
     if (api->param_default_preset(param, preset, tune) < 0)
     {
@@ -363,6 +373,7 @@
     info.frameCount = 0;
     getParamAspectRatio(param, info.sarWidth, info.sarHeight);
 
+
     this->input = InputFile::open(info, this->bForceY4m);
     if (!this->input || this->input->isFail())
     {
@@ -392,7 +403,7 @@
     if (this->framesToBeEncoded == 0 && info.frameCount > (int)seek)
         this->framesToBeEncoded = info.frameCount - seek;
     param->totalFrames = this->framesToBeEncoded;
-
+    
     /* Force CFR until we have support for VFR */
     info.timebaseNum = param->fpsDenom;
     info.timebaseDenom = param->fpsNum;
@@ -439,7 +450,30 @@
                     param->sourceWidth, param->sourceHeight, param->fpsNum, param->fpsDenom,
                     x265_source_csp_names[param->internalCsp]);
     }
+#if ENABLE_LIBVMAF
+    if (!reconfn)
+    {
+        x265_log(param, X265_LOG_ERROR, "recon file must be specified to get VMAF score, try --help for help\n");
+        return true;
+    }
+    const char *str = strrchr(info.filename, '.');
 
+    if (!strcmp(str, ".y4m"))
+    {
+        x265_log(param, X265_LOG_ERROR, "VMAF supports YUV file format only.\n");
+        return true; 
+    }
+    if(param->internalCsp == X265_CSP_I420 || param->internalCsp == X265_CSP_I422 || param->internalCsp == X265_CSP_I444)
+    {
+        vmafData->reference_file = x265_fopen(inputfn, "rb");
+        vmafData->distorted_file = x265_fopen(reconfn, "rb");
+    }
+    else
+    {
+        x265_log(param, X265_LOG_ERROR, "VMAF will support only yuv420p, yu422p, yu444p, yuv420p10le, yuv422p10le, yuv444p10le formats.\n");
+        return true;
+    }
+#endif
     this->output = OutputFile::open(outputfn, info);
     if (this->output->isFail())
     {
@@ -555,7 +589,9 @@
 
     x265_param* param = cliopt.param;
     const x265_api* api = cliopt.api;
-
+#if ENABLE_LIBVMAF
+    x265_vmaf_data* vmafdata = cliopt.vmafData;
+#endif
     /* This allows muxers to modify bitstream format */
     cliopt.output->setParam(param);
 
@@ -712,7 +748,7 @@
         if (!numEncoded)
             break;
     }
-
+  
     /* clear progress report */
     if (cliopt.bProgress)
         fprintf(stderr, "%*s\r", 80, " ");
@@ -723,7 +759,11 @@
 
     api->encoder_get_stats(encoder, &stats, sizeof(stats));
     if (param->csvfn && !b_ctrl_c)
+#if ENABLE_LIBVMAF
+        api->vmaf_encoder_log(encoder, argc, argv, param, vmafdata);
+#else
         api->encoder_log(encoder, argc, argv);
+#endif
     api->encoder_close(encoder);
 
     int64_t second_largest_pts = 0;

 
@@ -75,6 +75,7 @@
     const char* reconPlayCmd;
     const x265_api* api;
     x265_param* param;
+    x265_vmaf_data* vmafData;
     bool bProgress;
     bool bForceY4m;
     bool bDither;
@@ -96,6 +97,7 @@
         reconPlayCmd = NULL;
         api = NULL;
         param = NULL;
+        vmafData = NULL;
         framesToBeEncoded = seek = 0;
         totalbytes = 0;
         bProgress = true;
@@ -142,7 +144,7 @@
     {
         int eta = (int)(elapsed * (framesToBeEncoded - frameNum) / ((int64_t)frameNum * 1000000));
         sprintf(buf, "x265 [%.1f%%] %d/%d frames, %.2f fps, %.2f kb/s, eta %d:%02d:%02d",
-                100. * frameNum / framesToBeEncoded, frameNum, framesToBeEncoded, fps, bitrate,
+            100. * frameNum / (param->chunkEnd ? param->chunkEnd : param->totalFrames), frameNum, (param->chunkEnd ? param->chunkEnd : param->totalFrames), fps, bitrate,
                 eta / 3600, (eta / 60) % 60, eta % 60);
     }
     else
@@ -216,6 +218,14 @@
         x265_log(NULL, X265_LOG_ERROR, "param alloc failed\n");
         return true;
     }
+#if ENABLE_LIBVMAF
+    vmafData = (x265_vmaf_data*)x265_malloc(sizeof(x265_vmaf_data));
+    if(!vmafData)
+    {
+        x265_log(NULL, X265_LOG_ERROR, "vmaf data alloc failed\n");
+        return true;
+    }
+#endif
 
     if (api->param_default_preset(param, preset, tune) < 0)
     {
@@ -363,6 +373,7 @@
     info.frameCount = 0;
     getParamAspectRatio(param, info.sarWidth, info.sarHeight);
 
+
     this->input = InputFile::open(info, this->bForceY4m);
     if (!this->input || this->input->isFail())
     {
@@ -392,7 +403,7 @@
     if (this->framesToBeEncoded == 0 && info.frameCount > (int)seek)
         this->framesToBeEncoded = info.frameCount - seek;
     param->totalFrames = this->framesToBeEncoded;
-
+    
     /* Force CFR until we have support for VFR */
     info.timebaseNum = param->fpsDenom;
     info.timebaseDenom = param->fpsNum;
@@ -439,7 +450,30 @@
                     param->sourceWidth, param->sourceHeight, param->fpsNum, param->fpsDenom,
                     x265_source_csp_names[param->internalCsp]);
     }
+#if ENABLE_LIBVMAF
+    if (!reconfn)
+    {
+        x265_log(param, X265_LOG_ERROR, "recon file must be specified to get VMAF score, try --help for help\n");
+        return true;
+    }
+    const char *str = strrchr(info.filename, '.');
 
+    if (!strcmp(str, ".y4m"))
+    {
+        x265_log(param, X265_LOG_ERROR, "VMAF supports YUV file format only.\n");
+        return true; 
+    }
+    if(param->internalCsp == X265_CSP_I420 || param->internalCsp == X265_CSP_I422 || param->internalCsp == X265_CSP_I444)
+    {
+        vmafData->reference_file = x265_fopen(inputfn, "rb");
+        vmafData->distorted_file = x265_fopen(reconfn, "rb");
+    }
+    else
+    {
+        x265_log(param, X265_LOG_ERROR, "VMAF will support only yuv420p, yu422p, yu444p, yuv420p10le, yuv422p10le, yuv444p10le formats.\n");
+        return true;
+    }
+#endif
     this->output = OutputFile::open(outputfn, info);
     if (this->output->isFail())
     {
@@ -555,7 +589,9 @@
 
     x265_param* param = cliopt.param;
     const x265_api* api = cliopt.api;
-
+#if ENABLE_LIBVMAF
+    x265_vmaf_data* vmafdata = cliopt.vmafData;
+#endif
     /* This allows muxers to modify bitstream format */
     cliopt.output->setParam(param);
 
@@ -712,7 +748,7 @@
         if (!numEncoded)
             break;
     }
-
+  
     /* clear progress report */
     if (cliopt.bProgress)
         fprintf(stderr, "%*s\r", 80, " ");
@@ -723,7 +759,11 @@
 
     api->encoder_get_stats(encoder, &stats, sizeof(stats));
     if (param->csvfn && !b_ctrl_c)
+#if ENABLE_LIBVMAF
+        api->vmaf_encoder_log(encoder, argc, argv, param, vmafdata);
+#else
         api->encoder_log(encoder, argc, argv);
+#endif
     api->encoder_close(encoder);
 
     int64_t second_largest_pts = 0;
​

x265_2.7.tar.gz/source/x265.h -> x265_2.9.tar.gz/source/x265.h Changed

@@ -31,6 +31,10 @@
 extern "C" {
 #endif
 
+#if _MSC_VER
+#pragma warning(disable: 4201) // non-standard extension used (nameless struct/union)
+#endif
+
 /* x265_encoder:
  *      opaque handler for encoder */
 typedef struct x265_encoder x265_encoder;
@@ -105,25 +109,107 @@
     int       lastMiniGopBFrame;
     int       plannedType[X265_LOOKAHEAD_MAX + 1];
     int64_t   dts;
+    int64_t   reorderedPts;
 } x265_lookahead_data;
 
+typedef struct x265_analysis_validate
+{
+    int     maxNumReferences;
+    int     analysisReuseLevel;
+    int     sourceWidth;
+    int     sourceHeight;
+    int     keyframeMax;
+    int     keyframeMin;
+    int     openGOP;
+    int     bframes;
+    int     bPyramid;
+    int     maxCUSize;
+    int     minCUSize;
+    int     intraRefresh;
+    int     lookaheadDepth;
+    int     chunkStart;
+    int     chunkEnd;
+}x265_analysis_validate;
+
+/* Stores intra analysis data for a single frame. This struct needs better packing */
+typedef struct x265_analysis_intra_data
+{
+    uint8_t*  depth;
+    uint8_t*  modes;
+    char*     partSizes;
+    uint8_t*  chromaModes;
+}x265_analysis_intra_data;
+
+typedef struct x265_analysis_MV
+{
+    union{
+        struct { int16_t x, y; };
+
+        int32_t word;
+    };
+}x265_analysis_MV;
+
+/* Stores inter analysis data for a single frame */
+typedef struct x265_analysis_inter_data
+{
+    int32_t*    ref;
+    uint8_t*    depth;
+    uint8_t*    modes;
+    uint8_t*    partSize;
+    uint8_t*    mergeFlag;
+    uint8_t*    interDir;
+    uint8_t*    mvpIdx[2];
+    int8_t*     refIdx[2];
+    x265_analysis_MV*         mv[2];
+    int64_t*     sadCost;
+}x265_analysis_inter_data;
+
+typedef struct x265_weight_param
+{
+    uint32_t log2WeightDenom;
+    int      inputWeight;
+    int      inputOffset;
+    int      wtPresent;
+}x265_weight_param;
+
+#if X265_DEPTH < 10
+typedef uint32_t sse_t;
+#else
+typedef uint64_t sse_t;
+#endif
+
+typedef struct x265_analysis_distortion_data
+{
+    sse_t*        distortion;
+    sse_t*        ctuDistortion;
+    double*       scaledDistortion;
+    double        averageDistortion;
+    double        sdDistortion;
+    uint32_t      highDistortionCtuCount;
+    uint32_t      lowDistortionCtuCount;
+    double*       offset;
+    double*       threshold;
+}x265_analysis_distortion_data;
+
 /* Stores all analysis data for a single frame */
 typedef struct x265_analysis_data
 {
-    int64_t          satdCost;
-    uint32_t         frameRecordSize;
-    uint32_t         poc;
-    uint32_t         sliceType;
-    uint32_t         numCUsInFrame;
-    uint32_t         numPartitions;
-    uint32_t         depthBytes;
-    int              bScenecut;
-    void*            wt;
-    void*            interData;
-    void*            intraData;
-    uint32_t         numCuInHeight;
-    x265_lookahead_data lookahead;
-    uint8_t*         modeFlag[2];
+    int64_t                           satdCost;
+    uint32_t                          frameRecordSize;
+    uint32_t                          poc;
+    uint32_t                          sliceType;
+    uint32_t                          numCUsInFrame;
+    uint32_t                          numPartitions;
+    uint32_t                          depthBytes;
+    int                               bScenecut;
+    x265_weight_param*                wt;
+    x265_analysis_inter_data*         interData;
+    x265_analysis_intra_data*         intraData;
+    uint32_t                          numCuInHeight;
+    x265_lookahead_data               lookahead;
+    uint8_t*                          modeFlag[2];
+    x265_analysis_validate            saveParam;
+    x265_analysis_distortion_data*    distortionData;
 } x265_analysis_data;
 
 /* cu statistics */
@@ -152,14 +238,6 @@
     /* All the above values will add up to 100%. */
 } x265_pu_stats;
 
-
-typedef struct x265_analysis_2Pass
-{
-    uint32_t      poc;
-    uint32_t      frameRecordSize;
-    void*         analysisFramedata;
-}x265_analysis_2Pass;
-
 /* Frame level statistics */
 typedef struct x265_frame_stats
 {
@@ -208,6 +286,8 @@
     x265_cu_stats    cuStats;
     x265_pu_stats    puStats;
     double           totalFrameTime;
+    double           vmafFrameScore;
+    double           bufferFillFinal;
 } x265_frame_stats;
 
 typedef struct x265_ctu_info_t
@@ -264,6 +344,7 @@
     REGION_REFRESH_INFO                  = 134,
     MASTERING_DISPLAY_INFO               = 137,
     CONTENT_LIGHT_LEVEL_INFO             = 144,
+    ALTERNATIVE_TRANSFER_CHARACTERISTICS = 147,
 } SEIPayloadType;
 
 typedef struct x265_sei_payload
@@ -362,7 +443,8 @@
 
     int    height;
 
-    x265_analysis_2Pass analysis2Pass;
+    // pts is reordered in the order of encoding.
+    int64_t reorderedPts;
 } x265_picture;
 
 typedef enum
@@ -378,39 +460,38 @@
 /* CPU flags */
 
 /* x86 */
-#define X265_CPU_CMOV            0x0000001
-#define X265_CPU_MMX             0x0000002
-#define X265_CPU_MMX2            0x0000004  /* MMX2 aka MMXEXT aka ISSE */
+#define X265_CPU_MMX             (1 << 0)
+#define X265_CPU_MMX2            (1 << 1)  /* MMX2 aka MMXEXT aka ISSE */
 #define X265_CPU_MMXEXT          X265_CPU_MMX2
-#define X265_CPU_SSE             0x0000008
-#define X265_CPU_SSE2            0x0000010
-#define X265_CPU_SSE3            0x0000020
-#define X265_CPU_SSSE3           0x0000040
-#define X265_CPU_SSE4            0x0000080  /* SSE4.1 */
-#define X265_CPU_SSE42           0x0000100  /* SSE4.2 */
-#define X265_CPU_LZCNT           0x0000200  /* Phenom support for "leading zero count" instruction. */
-#define X265_CPU_AVX             0x0000400  /* AVX support: requires OS support even if YMM registers aren't used. */
-#define X265_CPU_XOP             0x0000800  /* AMD XOP */
-#define X265_CPU_FMA4            0x0001000  /* AMD FMA4 */
-#define X265_CPU_AVX2            0x0002000  /* AVX2 */
-#define X265_CPU_FMA3            0x0004000  /* Intel FMA3 */
-#define X265_CPU_BMI1            0x0008000  /* BMI1 */
-#define X265_CPU_BMI2            0x0010000  /* BMI2 */
+#define X265_CPU_SSE             (1 << 2)

 
@@ -31,6 +31,10 @@
 extern "C" {
 #endif
 
+#if _MSC_VER
+#pragma warning(disable: 4201) // non-standard extension used (nameless struct/union)
+#endif
+
 /* x265_encoder:
  *      opaque handler for encoder */
 typedef struct x265_encoder x265_encoder;
@@ -105,25 +109,107 @@
     int       lastMiniGopBFrame;
     int       plannedType[X265_LOOKAHEAD_MAX + 1];
     int64_t   dts;
+    int64_t   reorderedPts;
 } x265_lookahead_data;
 
+typedef struct x265_analysis_validate
+{
+    int     maxNumReferences;
+    int     analysisReuseLevel;
+    int     sourceWidth;
+    int     sourceHeight;
+    int     keyframeMax;
+    int     keyframeMin;
+    int     openGOP;
+    int     bframes;
+    int     bPyramid;
+    int     maxCUSize;
+    int     minCUSize;
+    int     intraRefresh;
+    int     lookaheadDepth;
+    int     chunkStart;
+    int     chunkEnd;
+}x265_analysis_validate;
+
+/* Stores intra analysis data for a single frame. This struct needs better packing */
+typedef struct x265_analysis_intra_data
+{
+    uint8_t*  depth;
+    uint8_t*  modes;
+    char*     partSizes;
+    uint8_t*  chromaModes;
+}x265_analysis_intra_data;
+
+typedef struct x265_analysis_MV
+{
+    union{
+        struct { int16_t x, y; };
+
+        int32_t word;
+    };
+}x265_analysis_MV;
+
+/* Stores inter analysis data for a single frame */
+typedef struct x265_analysis_inter_data
+{
+    int32_t*    ref;
+    uint8_t*    depth;
+    uint8_t*    modes;
+    uint8_t*    partSize;
+    uint8_t*    mergeFlag;
+    uint8_t*    interDir;
+    uint8_t*    mvpIdx[2];
+    int8_t*     refIdx[2];
+    x265_analysis_MV*         mv[2];
+    int64_t*     sadCost;
+}x265_analysis_inter_data;
+
+typedef struct x265_weight_param
+{
+    uint32_t log2WeightDenom;
+    int      inputWeight;
+    int      inputOffset;
+    int      wtPresent;
+}x265_weight_param;
+
+#if X265_DEPTH < 10
+typedef uint32_t sse_t;
+#else
+typedef uint64_t sse_t;
+#endif
+
+typedef struct x265_analysis_distortion_data
+{
+    sse_t*        distortion;
+    sse_t*        ctuDistortion;
+    double*       scaledDistortion;
+    double        averageDistortion;
+    double        sdDistortion;
+    uint32_t      highDistortionCtuCount;
+    uint32_t      lowDistortionCtuCount;
+    double*       offset;
+    double*       threshold;
+}x265_analysis_distortion_data;
+
 /* Stores all analysis data for a single frame */
 typedef struct x265_analysis_data
 {
-    int64_t          satdCost;
-    uint32_t         frameRecordSize;
-    uint32_t         poc;
-    uint32_t         sliceType;
-    uint32_t         numCUsInFrame;
-    uint32_t         numPartitions;
-    uint32_t         depthBytes;
-    int              bScenecut;
-    void*            wt;
-    void*            interData;
-    void*            intraData;
-    uint32_t         numCuInHeight;
-    x265_lookahead_data lookahead;
-    uint8_t*         modeFlag[2];
+    int64_t                           satdCost;
+    uint32_t                          frameRecordSize;
+    uint32_t                          poc;
+    uint32_t                          sliceType;
+    uint32_t                          numCUsInFrame;
+    uint32_t                          numPartitions;
+    uint32_t                          depthBytes;
+    int                               bScenecut;
+    x265_weight_param*                wt;
+    x265_analysis_inter_data*         interData;
+    x265_analysis_intra_data*         intraData;
+    uint32_t                          numCuInHeight;
+    x265_lookahead_data               lookahead;
+    uint8_t*                          modeFlag[2];
+    x265_analysis_validate            saveParam;
+    x265_analysis_distortion_data*    distortionData;
 } x265_analysis_data;
 
 /* cu statistics */
@@ -152,14 +238,6 @@
     /* All the above values will add up to 100%. */
 } x265_pu_stats;
 
-
-typedef struct x265_analysis_2Pass
-{
-    uint32_t      poc;
-    uint32_t      frameRecordSize;
-    void*         analysisFramedata;
-}x265_analysis_2Pass;
-
 /* Frame level statistics */
 typedef struct x265_frame_stats
 {
@@ -208,6 +286,8 @@
     x265_cu_stats    cuStats;
     x265_pu_stats    puStats;
     double           totalFrameTime;
+    double           vmafFrameScore;
+    double           bufferFillFinal;
 } x265_frame_stats;
 
 typedef struct x265_ctu_info_t
@@ -264,6 +344,7 @@
     REGION_REFRESH_INFO                  = 134,
     MASTERING_DISPLAY_INFO               = 137,
     CONTENT_LIGHT_LEVEL_INFO             = 144,
+    ALTERNATIVE_TRANSFER_CHARACTERISTICS = 147,
 } SEIPayloadType;
 
 typedef struct x265_sei_payload
@@ -362,7 +443,8 @@
 
     int    height;
 
-    x265_analysis_2Pass analysis2Pass;
+    // pts is reordered in the order of encoding.
+    int64_t reorderedPts;
 } x265_picture;
 
 typedef enum
@@ -378,39 +460,38 @@
 /* CPU flags */
 
 /* x86 */
-#define X265_CPU_CMOV            0x0000001
-#define X265_CPU_MMX             0x0000002
-#define X265_CPU_MMX2            0x0000004  /* MMX2 aka MMXEXT aka ISSE */
+#define X265_CPU_MMX             (1 << 0)
+#define X265_CPU_MMX2            (1 << 1)  /* MMX2 aka MMXEXT aka ISSE */
 #define X265_CPU_MMXEXT          X265_CPU_MMX2
-#define X265_CPU_SSE             0x0000008
-#define X265_CPU_SSE2            0x0000010
-#define X265_CPU_SSE3            0x0000020
-#define X265_CPU_SSSE3           0x0000040
-#define X265_CPU_SSE4            0x0000080  /* SSE4.1 */
-#define X265_CPU_SSE42           0x0000100  /* SSE4.2 */
-#define X265_CPU_LZCNT           0x0000200  /* Phenom support for "leading zero count" instruction. */
-#define X265_CPU_AVX             0x0000400  /* AVX support: requires OS support even if YMM registers aren't used. */
-#define X265_CPU_XOP             0x0000800  /* AMD XOP */
-#define X265_CPU_FMA4            0x0001000  /* AMD FMA4 */
-#define X265_CPU_AVX2            0x0002000  /* AVX2 */
-#define X265_CPU_FMA3            0x0004000  /* Intel FMA3 */
-#define X265_CPU_BMI1            0x0008000  /* BMI1 */
-#define X265_CPU_BMI2            0x0010000  /* BMI2 */
+#define X265_CPU_SSE             (1 << 2)
​

x265_2.7.tar.gz/source/x265cli.h -> x265_2.9.tar.gz/source/x265cli.h Changed

@@ -152,6 +152,8 @@
     { "vbv-init",       required_argument, NULL, 0 },
     { "vbv-end",        required_argument, NULL, 0 },
     { "vbv-end-fr-adj", required_argument, NULL, 0 },
+    { "chunk-start",    required_argument, NULL, 0 },
+    { "chunk-end",      required_argument, NULL, 0 },
     { "bitrate",        required_argument, NULL, 0 },
     { "qp",             required_argument, NULL, 'q' },
     { "aq-mode",        required_argument, NULL, 0 },
@@ -263,6 +265,8 @@
     { "scale-factor",   required_argument, NULL, 0 },
     { "refine-intra",   required_argument, NULL, 0 },
     { "refine-inter",   required_argument, NULL, 0 },
+    { "dynamic-refine",       no_argument, NULL, 0 },
+    { "no-dynamic-refine",    no_argument, NULL, 0 },
     { "strict-cbr",           no_argument, NULL, 0 },
     { "temporal-layers",      no_argument, NULL, 0 },
     { "no-temporal-layers",   no_argument, NULL, 0 },
@@ -293,6 +297,14 @@
     { "refine-mv-type", required_argument, NULL, 0 },
     { "copy-pic",             no_argument, NULL, 0 },
     { "no-copy-pic",          no_argument, NULL, 0 },
+    { "max-ausize-factor", required_argument, NULL, 0 },
+    { "idr-recovery-sei",     no_argument, NULL, 0 },
+    { "no-idr-recovery-sei",  no_argument, NULL, 0 },
+    { "single-sei", no_argument, NULL, 0 },
+    { "no-single-sei", no_argument, NULL, 0 },
+    { "atc-sei", required_argument, NULL, 0 },
+    { "pic-struct", required_argument, NULL, 0 },
+    { "nalu-file", required_argument, NULL, 0 },
     { 0, 0, 0, 0 },
     { 0, 0, 0, 0 },
     { 0, 0, 0, 0 },
@@ -343,6 +355,7 @@
     H0("   --dhdr10-info <filename>      JSON file containing the Creative Intent Metadata to be encoded as Dynamic Tone Mapping\n");
     H0("   --[no-]dhdr10-opt             Insert tone mapping SEI only for IDR frames and when the tone mapping information changes. Default disabled\n");
 #endif
+    H0("   --nalu-file <filename>        Text file containing SEI messages in the following format : <POC><space><PREFIX><space><NAL UNIT TYPE>/<SEI TYPE><space><SEI Payload>\n");
     H0("-f/--frames <integer>            Maximum number of frames to encode. Default all\n");
     H0("   --seek <integer>              First frame to encode\n");
     H1("   --[no-]interlace <bff|tff>    Indicate input pictures are interlace fields in temporal order. Default progressive\n");
@@ -389,7 +402,7 @@
     H0("   --[no-]early-skip             Enable early SKIP detection. Default %s\n", OPT(param->bEnableEarlySkip));
     H0("   --[no-]rskip                  Enable early exit from recursion. Default %s\n", OPT(param->bEnableRecursionSkip));
     H1("   --[no-]tskip-fast             Enable fast intra transform skipping. Default %s\n", OPT(param->bEnableTSkipFast));
-    H1("   --[no-]splitrd-skip           Enable skipping split RD analysis when sum of split CU rdCost larger than none split CU rdCost for Intra CU. Default %s\n", OPT(param->bEnableSplitRdSkip));
+    H1("   --[no-]splitrd-skip           Enable skipping split RD analysis when sum of split CU rdCost larger than one split CU rdCost for Intra CU. Default %s\n", OPT(param->bEnableSplitRdSkip));
     H1("   --nr-intra <integer>          An integer value in range of 0 to 2000, which denotes strength of noise reduction in intra CUs. Default 0\n");
     H1("   --nr-inter <integer>          An integer value in range of 0 to 2000, which denotes strength of noise reduction in inter CUs. Default 0\n");
     H0("   --ctu-info <integer>          Enable receiving ctu information asynchronously and determine reaction to the CTU information (0, 1, 2, 4, 6) Default 0\n"
@@ -459,6 +472,8 @@
     H0("   --vbv-init <float>            Initial VBV buffer occupancy (fraction of bufsize or in kbits). Default %.2f\n", param->rc.vbvBufferInit);
     H0("   --vbv-end <float>             Final VBV buffer emptiness (fraction of bufsize or in kbits). Default 0 (disabled)\n");
     H0("   --vbv-end-fr-adj <float>      Frame from which qp has to be adjusted to achieve final decode buffer emptiness. Default 0\n");
+    H0("   --chunk-start <integer>       First frame of the chunk. Default 0 (disabled)\n");
+    H0("   --chunk-end <integer>         Last frame of the chunk. Default 0 (disabled)\n");
     H0("   --pass                        Multi pass rate control.\n"
        "                                   - 1 : First pass, creates stats file\n"
        "                                   - 2 : Last pass, does not overwrite stats file\n"
@@ -475,11 +490,12 @@
     H0("   --analysis-reuse-level <1..10>      Level of analysis reuse indicates amount of info stored/reused in save/load mode, 1:least..10:most. Default %d\n", param->analysisReuseLevel);
     H0("   --refine-mv-type <string>     Reuse MV information received through API call. Supported option is avc. Default disabled - %d\n", param->bMVType);
     H0("   --scale-factor <int>          Specify factor by which input video is scaled down for analysis save mode. Default %d\n", param->scaleFactor);
-    H0("   --refine-intra <0..3>         Enable intra refinement for encode that uses analysis-load.\n"
+    H0("   --refine-intra <0..4>         Enable intra refinement for encode that uses analysis-load.\n"
         "                                    - 0 : Forces both mode and depth from the save encode.\n"
         "                                    - 1 : Functionality of (0) + evaluate all intra modes at min-cu-size's depth when current depth is one smaller than min-cu-size's depth.\n"
         "                                    - 2 : Functionality of (1) + irrespective of size evaluate all angular modes when the save encode decides the best mode as angular.\n"
         "                                    - 3 : Functionality of (1) + irrespective of size evaluate all intra modes.\n"
+        "                                    - 4 : Re-evaluate all intra blocks, does not reuse data from save encode.\n"
         "                                Default:%d\n", param->intraRefine);
     H0("   --refine-inter <0..3>         Enable inter refinement for encode that uses analysis-load.\n"
         "                                    - 0 : Forces both mode and depth from the save encode.\n"
@@ -488,6 +504,7 @@
         "                                    - 2 : Functionality of (1) + irrespective of size restrict the modes evaluated when specific modes are decided as the best mode by the save encode.\n"
         "                                    - 3 : Functionality of (1) + irrespective of size evaluate all inter modes.\n"
         "                                Default:%d\n", param->interRefine);
+    H0("   --[no-]dynamic-refine         Dynamically changes refine-inter level for each CU. Default %s\n", OPT(param->bDynamicRefine));
     H0("   --[no-]refine-mv              Enable mv refinement for load mode. Default %s\n", OPT(param->mvRefine));
     H0("   --aq-mode <integer>           Mode for Adaptive Quantization - 0:none 1:uniform AQ 2:auto variance 3:auto variance with bias to dark scenes. Default %d\n", param->rc.aqMode);
     H0("   --aq-strength <float>         Reduces blocking and blurring in flat and textured areas (0 to 3.0). Default %.2f\n", param->rc.aqStrength);
@@ -515,6 +532,8 @@
     H1("                                 MAX_MAX_QP+1 floats for lambda table, then again for lambda2 table\n");
     H1("                                 Blank lines and lines starting with hash(#) are ignored\n");
     H1("                                 Comma is considered to be white-space\n");
+    H0("   --max-ausize-factor <float>   This value controls the maximum AU size defined in specification.\n");
+    H0("                                 It represents the percentage of maximum AU size used. Default %.1f\n", param->maxAUSizeFactor);
     H0("\nLoop filters (deblock and SAO):\n");
     H0("   --[no-]deblock                Enable Deblocking Loop Filter, optionally specify tC:Beta offsets Default %s\n", OPT(param->bEnableLoopFilter));
     H0("   --[no-]sao                    Enable Sample Adaptive Offset. Default %s\n", OPT(param->bEnableSAO));
@@ -548,9 +567,12 @@
     H0("   --[no-]repeat-headers         Emit SPS and PPS headers at each keyframe. Default %s\n", OPT(param->bRepeatHeaders));
     H0("   --[no-]info                   Emit SEI identifying encoder and parameters. Default %s\n", OPT(param->bEmitInfoSEI));
     H0("   --[no-]hrd                    Enable HRD parameters signaling. Default %s\n", OPT(param->bEmitHRDSEI));
+    H0("   --[no-]idr-recovery-sei      Emit recovery point infor SEI at each IDR frame \n");
     H0("   --[no-]temporal-layers        Enable a temporal sublayer for unreferenced B frames. Default %s\n", OPT(param->bEnableTemporalSubLayers));
     H0("   --[no-]aud                    Emit access unit delimiters at the start of each access unit. Default %s\n", OPT(param->bEnableAccessUnitDelimiters));
     H1("   --hash <integer>              Decoded Picture Hash SEI 0: disabled, 1: MD5, 2: CRC, 3: Checksum. Default %d\n", param->decodedPictureHashSEI);
+    H0("   --atc-sei <integer>           Emit the alternative transfer characteristics SEI message where the integer is the preferred transfer characteristics. Default disabled\n");
+    H0("   --pic-struct <integer>        Set the picture structure and emits it in the picture timing SEI message. Values in the range 0..12. See D.3.3 of the HEVC spec. for a detailed explanation.\n");
     H0("   --log2-max-poc-lsb <integer>  Maximum of the picture order count\n");
     H0("   --[no-]vui-timing-info        Emit VUI timing information in the bistream. Default %s\n", OPT(param->bEmitVUITimingInfo));
     H0("   --[no-]vui-hrd-info           Emit VUI HRD information in the bistream. Default %s\n", OPT(param->bEmitVUIHRDInfo));

 
@@ -152,6 +152,8 @@
     { "vbv-init",       required_argument, NULL, 0 },
     { "vbv-end",        required_argument, NULL, 0 },
     { "vbv-end-fr-adj", required_argument, NULL, 0 },
+    { "chunk-start",    required_argument, NULL, 0 },
+    { "chunk-end",      required_argument, NULL, 0 },
     { "bitrate",        required_argument, NULL, 0 },
     { "qp",             required_argument, NULL, 'q' },
     { "aq-mode",        required_argument, NULL, 0 },
@@ -263,6 +265,8 @@
     { "scale-factor",   required_argument, NULL, 0 },
     { "refine-intra",   required_argument, NULL, 0 },
     { "refine-inter",   required_argument, NULL, 0 },
+    { "dynamic-refine",       no_argument, NULL, 0 },
+    { "no-dynamic-refine",    no_argument, NULL, 0 },
     { "strict-cbr",           no_argument, NULL, 0 },
     { "temporal-layers",      no_argument, NULL, 0 },
     { "no-temporal-layers",   no_argument, NULL, 0 },
@@ -293,6 +297,14 @@
     { "refine-mv-type", required_argument, NULL, 0 },
     { "copy-pic",             no_argument, NULL, 0 },
     { "no-copy-pic",          no_argument, NULL, 0 },
+    { "max-ausize-factor", required_argument, NULL, 0 },
+    { "idr-recovery-sei",     no_argument, NULL, 0 },
+    { "no-idr-recovery-sei",  no_argument, NULL, 0 },
+    { "single-sei", no_argument, NULL, 0 },
+    { "no-single-sei", no_argument, NULL, 0 },
+    { "atc-sei", required_argument, NULL, 0 },
+    { "pic-struct", required_argument, NULL, 0 },
+    { "nalu-file", required_argument, NULL, 0 },
     { 0, 0, 0, 0 },
     { 0, 0, 0, 0 },
     { 0, 0, 0, 0 },
@@ -343,6 +355,7 @@
     H0("   --dhdr10-info <filename>      JSON file containing the Creative Intent Metadata to be encoded as Dynamic Tone Mapping\n");
     H0("   --[no-]dhdr10-opt             Insert tone mapping SEI only for IDR frames and when the tone mapping information changes. Default disabled\n");
 #endif
+    H0("   --nalu-file <filename>        Text file containing SEI messages in the following format : <POC><space><PREFIX><space><NAL UNIT TYPE>/<SEI TYPE><space><SEI Payload>\n");
     H0("-f/--frames <integer>            Maximum number of frames to encode. Default all\n");
     H0("   --seek <integer>              First frame to encode\n");
     H1("   --[no-]interlace <bff|tff>    Indicate input pictures are interlace fields in temporal order. Default progressive\n");
@@ -389,7 +402,7 @@
     H0("   --[no-]early-skip             Enable early SKIP detection. Default %s\n", OPT(param->bEnableEarlySkip));
     H0("   --[no-]rskip                  Enable early exit from recursion. Default %s\n", OPT(param->bEnableRecursionSkip));
     H1("   --[no-]tskip-fast             Enable fast intra transform skipping. Default %s\n", OPT(param->bEnableTSkipFast));
-    H1("   --[no-]splitrd-skip           Enable skipping split RD analysis when sum of split CU rdCost larger than none split CU rdCost for Intra CU. Default %s\n", OPT(param->bEnableSplitRdSkip));
+    H1("   --[no-]splitrd-skip           Enable skipping split RD analysis when sum of split CU rdCost larger than one split CU rdCost for Intra CU. Default %s\n", OPT(param->bEnableSplitRdSkip));
     H1("   --nr-intra <integer>          An integer value in range of 0 to 2000, which denotes strength of noise reduction in intra CUs. Default 0\n");
     H1("   --nr-inter <integer>          An integer value in range of 0 to 2000, which denotes strength of noise reduction in inter CUs. Default 0\n");
     H0("   --ctu-info <integer>          Enable receiving ctu information asynchronously and determine reaction to the CTU information (0, 1, 2, 4, 6) Default 0\n"
@@ -459,6 +472,8 @@
     H0("   --vbv-init <float>            Initial VBV buffer occupancy (fraction of bufsize or in kbits). Default %.2f\n", param->rc.vbvBufferInit);
     H0("   --vbv-end <float>             Final VBV buffer emptiness (fraction of bufsize or in kbits). Default 0 (disabled)\n");
     H0("   --vbv-end-fr-adj <float>      Frame from which qp has to be adjusted to achieve final decode buffer emptiness. Default 0\n");
+    H0("   --chunk-start <integer>       First frame of the chunk. Default 0 (disabled)\n");
+    H0("   --chunk-end <integer>         Last frame of the chunk. Default 0 (disabled)\n");
     H0("   --pass                        Multi pass rate control.\n"
        "                                   - 1 : First pass, creates stats file\n"
        "                                   - 2 : Last pass, does not overwrite stats file\n"
@@ -475,11 +490,12 @@
     H0("   --analysis-reuse-level <1..10>      Level of analysis reuse indicates amount of info stored/reused in save/load mode, 1:least..10:most. Default %d\n", param->analysisReuseLevel);
     H0("   --refine-mv-type <string>     Reuse MV information received through API call. Supported option is avc. Default disabled - %d\n", param->bMVType);
     H0("   --scale-factor <int>          Specify factor by which input video is scaled down for analysis save mode. Default %d\n", param->scaleFactor);
-    H0("   --refine-intra <0..3>         Enable intra refinement for encode that uses analysis-load.\n"
+    H0("   --refine-intra <0..4>         Enable intra refinement for encode that uses analysis-load.\n"
         "                                    - 0 : Forces both mode and depth from the save encode.\n"
         "                                    - 1 : Functionality of (0) + evaluate all intra modes at min-cu-size's depth when current depth is one smaller than min-cu-size's depth.\n"
         "                                    - 2 : Functionality of (1) + irrespective of size evaluate all angular modes when the save encode decides the best mode as angular.\n"
         "                                    - 3 : Functionality of (1) + irrespective of size evaluate all intra modes.\n"
+        "                                    - 4 : Re-evaluate all intra blocks, does not reuse data from save encode.\n"
         "                                Default:%d\n", param->intraRefine);
     H0("   --refine-inter <0..3>         Enable inter refinement for encode that uses analysis-load.\n"
         "                                    - 0 : Forces both mode and depth from the save encode.\n"
@@ -488,6 +504,7 @@
         "                                    - 2 : Functionality of (1) + irrespective of size restrict the modes evaluated when specific modes are decided as the best mode by the save encode.\n"
         "                                    - 3 : Functionality of (1) + irrespective of size evaluate all inter modes.\n"
         "                                Default:%d\n", param->interRefine);
+    H0("   --[no-]dynamic-refine         Dynamically changes refine-inter level for each CU. Default %s\n", OPT(param->bDynamicRefine));
     H0("   --[no-]refine-mv              Enable mv refinement for load mode. Default %s\n", OPT(param->mvRefine));
     H0("   --aq-mode <integer>           Mode for Adaptive Quantization - 0:none 1:uniform AQ 2:auto variance 3:auto variance with bias to dark scenes. Default %d\n", param->rc.aqMode);
     H0("   --aq-strength <float>         Reduces blocking and blurring in flat and textured areas (0 to 3.0). Default %.2f\n", param->rc.aqStrength);
@@ -515,6 +532,8 @@
     H1("                                 MAX_MAX_QP+1 floats for lambda table, then again for lambda2 table\n");
     H1("                                 Blank lines and lines starting with hash(#) are ignored\n");
     H1("                                 Comma is considered to be white-space\n");
+    H0("   --max-ausize-factor <float>   This value controls the maximum AU size defined in specification.\n");
+    H0("                                 It represents the percentage of maximum AU size used. Default %.1f\n", param->maxAUSizeFactor);
     H0("\nLoop filters (deblock and SAO):\n");
     H0("   --[no-]deblock                Enable Deblocking Loop Filter, optionally specify tC:Beta offsets Default %s\n", OPT(param->bEnableLoopFilter));
     H0("   --[no-]sao                    Enable Sample Adaptive Offset. Default %s\n", OPT(param->bEnableSAO));
@@ -548,9 +567,12 @@
     H0("   --[no-]repeat-headers         Emit SPS and PPS headers at each keyframe. Default %s\n", OPT(param->bRepeatHeaders));
     H0("   --[no-]info                   Emit SEI identifying encoder and parameters. Default %s\n", OPT(param->bEmitInfoSEI));
     H0("   --[no-]hrd                    Enable HRD parameters signaling. Default %s\n", OPT(param->bEmitHRDSEI));
+    H0("   --[no-]idr-recovery-sei      Emit recovery point infor SEI at each IDR frame \n");
     H0("   --[no-]temporal-layers        Enable a temporal sublayer for unreferenced B frames. Default %s\n", OPT(param->bEnableTemporalSubLayers));
     H0("   --[no-]aud                    Emit access unit delimiters at the start of each access unit. Default %s\n", OPT(param->bEnableAccessUnitDelimiters));
     H1("   --hash <integer>              Decoded Picture Hash SEI 0: disabled, 1: MD5, 2: CRC, 3: Checksum. Default %d\n", param->decodedPictureHashSEI);
+    H0("   --atc-sei <integer>           Emit the alternative transfer characteristics SEI message where the integer is the preferred transfer characteristics. Default disabled\n");
+    H0("   --pic-struct <integer>        Set the picture structure and emits it in the picture timing SEI message. Values in the range 0..12. See D.3.3 of the HEVC spec. for a detailed explanation.\n");
     H0("   --log2-max-poc-lsb <integer>  Maximum of the picture order count\n");
     H0("   --[no-]vui-timing-info        Emit VUI timing information in the bistream. Default %s\n", OPT(param->bEmitVUITimingInfo));
     H0("   --[no-]vui-hrd-info           Emit VUI HRD information in the bistream. Default %s\n", OPT(param->bEmitVUIHRDInfo));
​