Request 6063 (accepted)

Overview

Request 6063 (accepted)

No description set

Created by Aloysius 9 months ago
In state accepted

Submit x265

Submit package Staging / x265 to package Essentials / x265

x265.changes Changed

@@ -1,4 +1,53 @@
 -------------------------------------------------------------------
+Thu Jun 13 05:58:19 UTC 2024 - Luigi Baldoni <aloisio@gmx.com>
+
+- Update to version 3.6
+  New features:
+  * Segment based Ratecontrol (SBRC) feature
+  * Motion-Compensated Spatio-Temporal Filtering
+  * Scene-cut aware qp - BBAQ (Bidirectional Boundary Aware
+    Quantization)
+  * Histogram-Based Scene Change Detection
+  * Film-Grain characteristics as a SEI message to support Film
+    Grain Synthesis(FGS)
+  * Add temporal layer implementation(Hierarchical B-frame
+    implementation)
+  Enhancements to existing features:
+  * Added Dolby Vision 8.4 Profile Support
+  API changes:
+  * Add Segment based Ratecontrol(SBRC) feature: "--no-sbrc".
+  * Add command line parameter for mcstf feature: "--no-mctf".
+  * Add command line parameters for the scene cut aware qp
+    feature: "--scenecut-aware-qp" and "--masking-strength".
+  * Add command line parameters for Histogram-Based Scene Change
+    Detection: "--hist-scenecut".
+  * Add film grain characteristics as a SEI message to the
+    bitstream: "--film-grain <filename>"
+  * cli: add new option --cra-nal (Force nal type to CRA to all
+    frames expect for the first frame, works only with keyint 1)
+  Optimizations:
+  * ARM64 NEON optimizations:- Several time-consuming C
+    functions have been optimized for the targeted platform -
+    aarch64. The overall performance increased by around 20%.
+  * SVE/SVE2 optimizations
+  Bug fixes:
+  * Linux bug to utilize all the cores
+  * Crash with hist-scenecut build when source resolution is not
+    multiple of minCuSize
+  * 32bit and 64bit builds generation for ARM
+  * bugs in zonefile feature (Reflect Zonefile Parameters inside
+    Lookahead, extra IDR issue, Avg I Slice QP value issue etc..)
+  * Add x86 ASM implementation for subsampling luma
+  * Fix for abrladder segfault with load reuse level 1
+  * Reorder miniGOP based on temporal layer hierarchy and add
+    support for more B frame
+  * Add MacOS aarch64 build support
+  * Fix boundary condition issue for Gaussian filter
+- Drop arm.patch and replace it with 0001-Fix-arm-flags.patch
+  and 0004-Do-not-build-with-assembly-support-on-arm.patch
+  (courtesy of Debian)
+
+-------------------------------------------------------------------
 Wed May 19 13:21:09 UTC 2021 - Luigi Baldoni <aloisio@gmx.com>
 
 - Build libx265_main10 and libx265_main12 unconditionally and

​x
 
@@ -1,4 +1,53 @@
 -------------------------------------------------------------------
+Thu Jun 13 05:58:19 UTC 2024 - Luigi Baldoni <aloisio@gmx.com>
+
+- Update to version 3.6
+  New features:
+  * Segment based Ratecontrol (SBRC) feature
+  * Motion-Compensated Spatio-Temporal Filtering
+  * Scene-cut aware qp - BBAQ (Bidirectional Boundary Aware
+    Quantization)
+  * Histogram-Based Scene Change Detection
+  * Film-Grain characteristics as a SEI message to support Film
+    Grain Synthesis(FGS)
+  * Add temporal layer implementation(Hierarchical B-frame
+    implementation)
+  Enhancements to existing features:
+  * Added Dolby Vision 8.4 Profile Support
+  API changes:
+  * Add Segment based Ratecontrol(SBRC) feature: "--no-sbrc".
+  * Add command line parameter for mcstf feature: "--no-mctf".
+  * Add command line parameters for the scene cut aware qp
+    feature: "--scenecut-aware-qp" and "--masking-strength".
+  * Add command line parameters for Histogram-Based Scene Change
+    Detection: "--hist-scenecut".
+  * Add film grain characteristics as a SEI message to the
+    bitstream: "--film-grain <filename>"
+  * cli: add new option --cra-nal (Force nal type to CRA to all
+    frames expect for the first frame, works only with keyint 1)
+  Optimizations:
+  * ARM64 NEON optimizations:- Several time-consuming C
+    functions have been optimized for the targeted platform -
+    aarch64. The overall performance increased by around 20%.
+  * SVE/SVE2 optimizations
+  Bug fixes:
+  * Linux bug to utilize all the cores
+  * Crash with hist-scenecut build when source resolution is not
+    multiple of minCuSize
+  * 32bit and 64bit builds generation for ARM
+  * bugs in zonefile feature (Reflect Zonefile Parameters inside
+    Lookahead, extra IDR issue, Avg I Slice QP value issue etc..)
+  * Add x86 ASM implementation for subsampling luma
+  * Fix for abrladder segfault with load reuse level 1
+  * Reorder miniGOP based on temporal layer hierarchy and add
+    support for more B frame
+  * Add MacOS aarch64 build support
+  * Fix boundary condition issue for Gaussian filter
+- Drop arm.patch and replace it with 0001-Fix-arm-flags.patch
+  and 0004-Do-not-build-with-assembly-support-on-arm.patch
+  (courtesy of Debian)
+
+-------------------------------------------------------------------
 Wed May 19 13:21:09 UTC 2021 - Luigi Baldoni <aloisio@gmx.com>
 
 - Build libx265_main10 and libx265_main12 unconditionally and
​

x265.spec Changed

@@ -1,7 +1,7 @@
 #
 # spec file for package x265
 #
-# Copyright (c) 2021 Packman Team <packman@links2linux.de>
+# Copyright (c) 2024 Packman Team <packman@links2linux.de>
 # Copyright (c) 2014 Torsten Gruner <t.gruner@katodev.de>
 #
 # All modifications and additions to the file contributed by third parties
@@ -17,21 +17,22 @@
 #
 
 
-%define sover   199
+%define sover   209
 %define libname lib%{name}
 %define libsoname %{libname}-%{sover}
-%define uver    3_5
+%define uver    3_6
 Name:           x265
-Version:        3.5
+Version:        3.6
 Release:        0
 Summary:        A free h265/HEVC encoder - encoder binary
 License:        GPL-2.0-or-later
 Group:          Productivity/Multimedia/Video/Editors and Convertors
 URL:            https://bitbucket.org/multicoreware/x265_git
 Source0:        https://bitbucket.org/multicoreware/x265_git/downloads/%{name}_%{version}.tar.gz
-Patch0:         arm.patch
 Patch1:         x265.pkgconfig.patch
 Patch2:         x265-fix_enable512.patch
+Patch3:         0001-Fix-arm-flags.patch
+Patch4:         0004-Do-not-build-with-assembly-support-on-arm.patch
 BuildRequires:  cmake >= 2.8.8
 BuildRequires:  gcc-c++
 BuildRequires:  nasm >= 2.13
@@ -130,6 +131,8 @@
 %cmake_install
 find %{buildroot} -type f -name "*.a" -delete -print0
 
+%check
+
 %post -n %{libsoname} -p /sbin/ldconfig
 %postun -n %{libsoname} -p /sbin/ldconfig

 
@@ -1,7 +1,7 @@
 #
 # spec file for package x265
 #
-# Copyright (c) 2021 Packman Team <packman@links2linux.de>
+# Copyright (c) 2024 Packman Team <packman@links2linux.de>
 # Copyright (c) 2014 Torsten Gruner <t.gruner@katodev.de>
 #
 # All modifications and additions to the file contributed by third parties
@@ -17,21 +17,22 @@
 #
 
 
-%define sover   199
+%define sover   209
 %define libname lib%{name}
 %define libsoname %{libname}-%{sover}
-%define uver    3_5
+%define uver    3_6
 Name:           x265
-Version:        3.5
+Version:        3.6
 Release:        0
 Summary:        A free h265/HEVC encoder - encoder binary
 License:        GPL-2.0-or-later
 Group:          Productivity/Multimedia/Video/Editors and Convertors
 URL:            https://bitbucket.org/multicoreware/x265_git
 Source0:        https://bitbucket.org/multicoreware/x265_git/downloads/%{name}_%{version}.tar.gz
-Patch0:         arm.patch
 Patch1:         x265.pkgconfig.patch
 Patch2:         x265-fix_enable512.patch
+Patch3:         0001-Fix-arm-flags.patch
+Patch4:         0004-Do-not-build-with-assembly-support-on-arm.patch
 BuildRequires:  cmake >= 2.8.8
 BuildRequires:  gcc-c++
 BuildRequires:  nasm >= 2.13
@@ -130,6 +131,8 @@
 %cmake_install
 find %{buildroot} -type f -name "*.a" -delete -print0
 
+%check
+
 %post -n %{libsoname} -p /sbin/ldconfig
 %postun -n %{libsoname} -p /sbin/ldconfig
 
​

0001-Fix-arm-flags.patch Added

@@ -0,0 +1,39 @@
+From: Sebastian Ramacher <sramacher@debian.org>
+Date: Sun, 21 Jun 2020 17:54:56 +0200
+Subject: Fix arm* flags
+
+---
+ source/CMakeLists.txt | 7 ++-----
+ 1 file changed, 2 insertions(+), 5 deletions(-)
+
+diff --git a/source/CMakeLists.txt b/source/CMakeLists.txt
+index ab5ddfe..eb9b19b 100755
+--- a/source/CMakeLists.txt
++++ b/source/CMakeLists.txt
+@@ -253,10 +253,7 @@ if(GCC)
+     elseif(ARM)
+         find_package(Neon)
+         if(CPU_HAS_NEON)
+-            set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=neon -marm -fPIC)
+             add_definitions(-DHAVE_NEON)
+-        else()
+-            set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=vfp -marm)
+         endif()
+     endif()
+ 	if(ARM64 OR CROSS_COMPILE_ARM64)
+@@ -265,13 +262,13 @@ if(GCC)
+         find_package(SVE2)
+         if(CPU_HAS_SVE2 OR CROSS_COMPILE_SVE2)
+             message(STATUS "Found SVE2")
+-	        set(ARM_ARGS -O3 -march=armv8-a+sve2 -fPIC -flax-vector-conversions)
++	        set(ARM_ARGS -fPIC -flax-vector-conversions)
+             add_definitions(-DHAVE_SVE2)
+             add_definitions(-DHAVE_SVE)
+             add_definitions(-DHAVE_NEON) # for NEON c/c++ primitives, as currently there is no implementation that use SVE2
+         elseif(CPU_HAS_SVE OR CROSS_COMPILE_SVE)
+             message(STATUS "Found SVE")
+-	        set(ARM_ARGS -O3 -march=armv8-a+sve -fPIC -flax-vector-conversions)
++	        set(ARM_ARGS -fPIC -flax-vector-conversions)
+             add_definitions(-DHAVE_SVE)
+             add_definitions(-DHAVE_NEON) # for NEON c/c++ primitives, as currently there is no implementation that use SVE
+         elseif(CPU_HAS_NEON)

 
@@ -0,0 +1,39 @@
+From: Sebastian Ramacher <sramacher@debian.org>
+Date: Sun, 21 Jun 2020 17:54:56 +0200
+Subject: Fix arm* flags
+
+---
+ source/CMakeLists.txt | 7 ++-----
+ 1 file changed, 2 insertions(+), 5 deletions(-)
+
+diff --git a/source/CMakeLists.txt b/source/CMakeLists.txt
+index ab5ddfe..eb9b19b 100755
+--- a/source/CMakeLists.txt
++++ b/source/CMakeLists.txt
+@@ -253,10 +253,7 @@ if(GCC)
+     elseif(ARM)
+         find_package(Neon)
+         if(CPU_HAS_NEON)
+-            set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=neon -marm -fPIC)
+             add_definitions(-DHAVE_NEON)
+-        else()
+-            set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=vfp -marm)
+         endif()
+     endif()
+   if(ARM64 OR CROSS_COMPILE_ARM64)
+@@ -265,13 +262,13 @@ if(GCC)
+         find_package(SVE2)
+         if(CPU_HAS_SVE2 OR CROSS_COMPILE_SVE2)
+             message(STATUS "Found SVE2")
+-          set(ARM_ARGS -O3 -march=armv8-a+sve2 -fPIC -flax-vector-conversions)
++          set(ARM_ARGS -fPIC -flax-vector-conversions)
+             add_definitions(-DHAVE_SVE2)
+             add_definitions(-DHAVE_SVE)
+             add_definitions(-DHAVE_NEON) # for NEON c/c++ primitives, as currently there is no implementation that use SVE2
+         elseif(CPU_HAS_SVE OR CROSS_COMPILE_SVE)
+             message(STATUS "Found SVE")
+-          set(ARM_ARGS -O3 -march=armv8-a+sve -fPIC -flax-vector-conversions)
++          set(ARM_ARGS -fPIC -flax-vector-conversions)
+             add_definitions(-DHAVE_SVE)
+             add_definitions(-DHAVE_NEON) # for NEON c/c++ primitives, as currently there is no implementation that use SVE
+         elseif(CPU_HAS_NEON)
​

0004-Do-not-build-with-assembly-support-on-arm.patch Added

 
@@ -0,0 +1,28 @@
+From: Sebastian Ramacher <sramacher@debian.org>
+Date: Fri, 31 May 2024 23:38:23 +0200
+Subject: Do not build with assembly support on arm*
+
+---
+ source/CMakeLists.txt | 9 ---------
+ 1 file changed, 9 deletions(-)
+
+diff --git a/source/CMakeLists.txt b/source/CMakeLists.txt
+index 672cc2d..f112330 100755
+--- a/source/CMakeLists.txt
++++ b/source/CMakeLists.txt
+@@ -73,15 +73,6 @@ elseif(POWERMATCH GREATER "-1")
+         add_definitions(-DPPC64=1)
+         message(STATUS "Detected POWER PPC64 target processor")
+     endif()
+-elseif(ARMMATCH GREATER "-1")
+-    if(CROSS_COMPILE_ARM)
+-        message(STATUS "Cross compiling for ARM arch")
+-    else()
+-        set(CROSS_COMPILE_ARM 0)
+-    endif()
+-  message(STATUS "Detected ARM target processor")
+-    set(ARM 1)
+-    add_definitions(-DX265_ARCH_ARM=1 -DHAVE_ARMV6=1)
+ elseif(ARM64MATCH GREATER "-1")
+     #if(CROSS_COMPILE_ARM64)
+         #message(STATUS "Cross compiling for ARM64 arch")
​

arm.patch Deleted

@@ -1,108 +0,0 @@
-Index: x265_3.4/source/CMakeLists.txt
-===================================================================
---- x265_3.4.orig/source/CMakeLists.txt
-+++ x265_3.4/source/CMakeLists.txt
-@@ -64,26 +64,26 @@ elseif(POWERMATCH GREATER "-1")
-         add_definitions(-DPPC64=1)
-         message(STATUS "Detected POWER PPC64 target processor")
-     endif()
--elseif(ARMMATCH GREATER "-1")
--    if(CROSS_COMPILE_ARM)
--        message(STATUS "Cross compiling for ARM arch")
--    else()
--        set(CROSS_COMPILE_ARM 0)
--    endif()
--    set(ARM 1)
--    if("${CMAKE_SIZEOF_VOID_P}" MATCHES 8)
--        message(STATUS "Detected ARM64 target processor")
--        set(ARM64 1)
--        add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=1 -DHAVE_ARMV6=0)
--    else()
--        message(STATUS "Detected ARM target processor")
--        add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=0 -DHAVE_ARMV6=1)
--    endif()
-+elseif(${SYSPROC} MATCHES "armv5.*")
-+    message(STATUS "Detected ARMV5 system processor")
-+    set(ARMV5 1)
-+    add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=0 -DHAVE_ARMV6=0 -DHAVE_NEON=0)
-+elseif(${SYSPROC} STREQUAL "armv6l")
-+    message(STATUS "Detected ARMV6 system processor")
-+    set(ARMV6 1)
-+    add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=0 -DHAVE_ARMV6=1 -DHAVE_NEON=0)
-+elseif(${SYSPROC} STREQUAL "armv7l")
-+    message(STATUS "Detected ARMV7 system processor")
-+    set(ARMV7 1)
-+    add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=0 -DHAVE_ARMV6=1 -DHAVE_NEON=0)
-+elseif(${SYSPROC} STREQUAL "aarch64")
-+    message(STATUS "Detected AArch64 system processor")
-+    set(ARMV7 1)
-+    add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=1 -DHAVE_ARMV6=0 -DHAVE_NEON=0)
- else()
-     message(STATUS "CMAKE_SYSTEM_PROCESSOR value `${CMAKE_SYSTEM_PROCESSOR}` is unknown")
-     message(STATUS "Please add this value near ${CMAKE_CURRENT_LIST_FILE}:${CMAKE_CURRENT_LIST_LINE}")
- endif()
--
- if(UNIX)
-     list(APPEND PLATFORM_LIBS pthread)
-     find_library(LIBRT rt)
-@@ -238,28 +238,9 @@ if(GCC)
-             endif()
-         endif()
-     endif()
--    if(ARM AND CROSS_COMPILE_ARM)
--        if(ARM64)
--            set(ARM_ARGS -fPIC)
--        else()
--            set(ARM_ARGS -march=armv6 -mfloat-abi=soft -mfpu=vfp -marm -fPIC)
--        endif()
--        message(STATUS "cross compile arm")
--    elseif(ARM)
--        if(ARM64)
--            set(ARM_ARGS -fPIC)
--            add_definitions(-DHAVE_NEON)
--        else()
--            find_package(Neon)
--            if(CPU_HAS_NEON)
--                set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=neon -marm -fPIC)
--                add_definitions(-DHAVE_NEON)
--            else()
--                set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=vfp -marm)
--            endif()
--        endif()
-+    if(ARMV7)
-+        add_definitions(-fPIC)
-     endif()
--    add_definitions(${ARM_ARGS})
-     if(FPROFILE_GENERATE)
-         if(INTEL_CXX)
-             add_definitions(-prof-gen -prof-dir="${CMAKE_CURRENT_BINARY_DIR}")
-Index: x265_3.4/source/common/cpu.cpp
-===================================================================
---- x265_3.4.orig/source/common/cpu.cpp
-+++ x265_3.4/source/common/cpu.cpp
-@@ -39,7 +39,7 @@
- #include <machine/cpu.h>
- #endif
- 
--#if X265_ARCH_ARM && !defined(HAVE_NEON)
-+#if X265_ARCH_ARM && (!defined(HAVE_NEON) || HAVE_NEON==0)
- #include <signal.h>
- #include <setjmp.h>
- static sigjmp_buf jmpbuf;
-@@ -350,7 +350,6 @@ uint32_t cpu_detect(bool benableavx512)
-     }
- 
-     canjump = 1;
--    PFX(cpu_neon_test)();
-     canjump = 0;
-     signal(SIGILL, oldsig);
- #endif // if !HAVE_NEON
-@@ -366,7 +365,7 @@ uint32_t cpu_detect(bool benableavx512)
-     // which may result in incorrect detection and the counters stuck enabled.
-     // right now Apple does not seem to support performance counters for this test
- #ifndef __MACH__
--    flags |= PFX(cpu_fast_neon_mrc_test)() ? X265_CPU_FAST_NEON_MRC : 0;
-+    //flags |= PFX(cpu_fast_neon_mrc_test)() ? X265_CPU_FAST_NEON_MRC : 0;
- #endif
-     // TODO: write dual issue test? currently it's A8 (dual issue) vs. A9 (fast mrc)
- #elif X265_ARCH_ARM64

 
@@ -1,108 +0,0 @@
-Index: x265_3.4/source/CMakeLists.txt
-===================================================================
---- x265_3.4.orig/source/CMakeLists.txt
-+++ x265_3.4/source/CMakeLists.txt
-@@ -64,26 +64,26 @@ elseif(POWERMATCH GREATER "-1")
-         add_definitions(-DPPC64=1)
-         message(STATUS "Detected POWER PPC64 target processor")
-     endif()
--elseif(ARMMATCH GREATER "-1")
--    if(CROSS_COMPILE_ARM)
--        message(STATUS "Cross compiling for ARM arch")
--    else()
--        set(CROSS_COMPILE_ARM 0)
--    endif()
--    set(ARM 1)
--    if("${CMAKE_SIZEOF_VOID_P}" MATCHES 8)
--        message(STATUS "Detected ARM64 target processor")
--        set(ARM64 1)
--        add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=1 -DHAVE_ARMV6=0)
--    else()
--        message(STATUS "Detected ARM target processor")
--        add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=0 -DHAVE_ARMV6=1)
--    endif()
-+elseif(${SYSPROC} MATCHES "armv5.*")
-+    message(STATUS "Detected ARMV5 system processor")
-+    set(ARMV5 1)
-+    add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=0 -DHAVE_ARMV6=0 -DHAVE_NEON=0)
-+elseif(${SYSPROC} STREQUAL "armv6l")
-+    message(STATUS "Detected ARMV6 system processor")
-+    set(ARMV6 1)
-+    add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=0 -DHAVE_ARMV6=1 -DHAVE_NEON=0)
-+elseif(${SYSPROC} STREQUAL "armv7l")
-+    message(STATUS "Detected ARMV7 system processor")
-+    set(ARMV7 1)
-+    add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=0 -DHAVE_ARMV6=1 -DHAVE_NEON=0)
-+elseif(${SYSPROC} STREQUAL "aarch64")
-+    message(STATUS "Detected AArch64 system processor")
-+    set(ARMV7 1)
-+    add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=1 -DHAVE_ARMV6=0 -DHAVE_NEON=0)
- else()
-     message(STATUS "CMAKE_SYSTEM_PROCESSOR value `${CMAKE_SYSTEM_PROCESSOR}` is unknown")
-     message(STATUS "Please add this value near ${CMAKE_CURRENT_LIST_FILE}:${CMAKE_CURRENT_LIST_LINE}")
- endif()
--
- if(UNIX)
-     list(APPEND PLATFORM_LIBS pthread)
-     find_library(LIBRT rt)
-@@ -238,28 +238,9 @@ if(GCC)
-             endif()
-         endif()
-     endif()
--    if(ARM AND CROSS_COMPILE_ARM)
--        if(ARM64)
--            set(ARM_ARGS -fPIC)
--        else()
--            set(ARM_ARGS -march=armv6 -mfloat-abi=soft -mfpu=vfp -marm -fPIC)
--        endif()
--        message(STATUS "cross compile arm")
--    elseif(ARM)
--        if(ARM64)
--            set(ARM_ARGS -fPIC)
--            add_definitions(-DHAVE_NEON)
--        else()
--            find_package(Neon)
--            if(CPU_HAS_NEON)
--                set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=neon -marm -fPIC)
--                add_definitions(-DHAVE_NEON)
--            else()
--                set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=vfp -marm)
--            endif()
--        endif()
-+    if(ARMV7)
-+        add_definitions(-fPIC)
-     endif()
--    add_definitions(${ARM_ARGS})
-     if(FPROFILE_GENERATE)
-         if(INTEL_CXX)
-             add_definitions(-prof-gen -prof-dir="${CMAKE_CURRENT_BINARY_DIR}")
-Index: x265_3.4/source/common/cpu.cpp
-===================================================================
---- x265_3.4.orig/source/common/cpu.cpp
-+++ x265_3.4/source/common/cpu.cpp
-@@ -39,7 +39,7 @@
- #include <machine/cpu.h>
- #endif
- 
--#if X265_ARCH_ARM && !defined(HAVE_NEON)
-+#if X265_ARCH_ARM && (!defined(HAVE_NEON) || HAVE_NEON==0)
- #include <signal.h>
- #include <setjmp.h>
- static sigjmp_buf jmpbuf;
-@@ -350,7 +350,6 @@ uint32_t cpu_detect(bool benableavx512)
-     }
- 
-     canjump = 1;
--    PFX(cpu_neon_test)();
-     canjump = 0;
-     signal(SIGILL, oldsig);
- #endif // if !HAVE_NEON
-@@ -366,7 +365,7 @@ uint32_t cpu_detect(bool benableavx512)
-     // which may result in incorrect detection and the counters stuck enabled.
-     // right now Apple does not seem to support performance counters for this test
- #ifndef __MACH__
--    flags |= PFX(cpu_fast_neon_mrc_test)() ? X265_CPU_FAST_NEON_MRC : 0;
-+    //flags |= PFX(cpu_fast_neon_mrc_test)() ? X265_CPU_FAST_NEON_MRC : 0;
- #endif
-     // TODO: write dual issue test? currently it's A8 (dual issue) vs. A9 (fast mrc)
- #elif X265_ARCH_ARM64
​

baselibs.conf Changed

 
@@ -1,1 +1,1 @@
-libx265-199
+libx265-209
​

x265_3.5.tar.gz/source/common/aarch64/ipfilter8.S Deleted

@@ -1,414 +0,0 @@
-/*****************************************************************************
- * Copyright (C) 2020 MulticoreWare, Inc
- *
- * Authors: Yimeng Su <yimeng.su@huawei.com>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
- *
- * This program is also available under a commercial proprietary license.
- * For more information, contact us at license @ x265.com.
- *****************************************************************************/
-
-#include "asm.S"
-
-.section .rodata
-
-.align 4
-
-.text
-
-
-
-.macro qpel_filter_0_32b
-    movi            v24.8h, #64
-    uxtl            v19.8h, v5.8b
-    smull           v17.4s, v19.4h, v24.4h
-    smull2          v18.4s, v19.8h, v24.8h
-.endm
-
-.macro qpel_filter_1_32b
-    movi            v16.8h, #58
-    uxtl            v19.8h, v5.8b
-    smull           v17.4s, v19.4h, v16.4h
-    smull2          v18.4s, v19.8h, v16.8h
-
-    movi            v24.8h, #10
-    uxtl            v21.8h, v1.8b
-    smull           v19.4s, v21.4h, v24.4h
-    smull2          v20.4s, v21.8h, v24.8h
-
-    movi            v16.8h, #17
-    uxtl            v23.8h, v2.8b
-    smull           v21.4s, v23.4h, v16.4h
-    smull2          v22.4s, v23.8h, v16.8h
-
-    movi            v24.8h, #5
-    uxtl            v1.8h, v6.8b
-    smull           v23.4s, v1.4h, v24.4h
-    smull2          v16.4s, v1.8h, v24.8h
-
-    sub             v17.4s, v17.4s, v19.4s
-    sub             v18.4s, v18.4s, v20.4s
-
-    uxtl            v1.8h, v4.8b
-    sshll           v19.4s, v1.4h, #2
-    sshll2          v20.4s, v1.8h, #2
-
-    add             v17.4s, v17.4s, v21.4s
-    add             v18.4s, v18.4s, v22.4s
-
-    uxtl            v1.8h, v0.8b
-    uxtl            v2.8h, v3.8b
-    ssubl           v21.4s, v2.4h, v1.4h
-    ssubl2          v22.4s, v2.8h, v1.8h
-
-    add             v17.4s, v17.4s, v19.4s
-    add             v18.4s, v18.4s, v20.4s
-    sub             v21.4s, v21.4s, v23.4s
-    sub             v22.4s, v22.4s, v16.4s
-    add             v17.4s, v17.4s, v21.4s
-    add             v18.4s, v18.4s, v22.4s
-.endm
-
-.macro qpel_filter_2_32b
-    movi            v16.4s, #11
-    uxtl            v19.8h, v5.8b
-    uxtl            v20.8h, v2.8b
-    saddl           v17.4s, v19.4h, v20.4h
-    saddl2          v18.4s, v19.8h, v20.8h
-
-    uxtl            v21.8h, v1.8b
-    uxtl            v22.8h, v6.8b
-    saddl           v19.4s, v21.4h, v22.4h
-    saddl2          v20.4s, v21.8h, v22.8h
-
-    mul             v19.4s, v19.4s, v16.4s
-    mul             v20.4s, v20.4s, v16.4s
-
-    movi            v16.4s, #40
-    mul             v17.4s, v17.4s, v16.4s
-    mul             v18.4s, v18.4s, v16.4s
-
-    uxtl            v21.8h, v4.8b
-    uxtl            v22.8h, v3.8b
-    saddl           v23.4s, v21.4h, v22.4h
-    saddl2          v16.4s, v21.8h, v22.8h
-
-    uxtl            v1.8h, v0.8b
-    uxtl            v2.8h, v7.8b
-    saddl           v21.4s, v1.4h, v2.4h
-    saddl2          v22.4s, v1.8h, v2.8h
-
-    shl             v23.4s, v23.4s, #2
-    shl             v16.4s, v16.4s, #2
-
-    add             v19.4s, v19.4s, v21.4s
-    add             v20.4s, v20.4s, v22.4s
-    add             v17.4s, v17.4s, v23.4s
-    add             v18.4s, v18.4s, v16.4s
-    sub             v17.4s, v17.4s, v19.4s
-    sub             v18.4s, v18.4s, v20.4s
-.endm
-
-.macro qpel_filter_3_32b
-    movi            v16.8h, #17
-    movi            v24.8h, #5
-
-    uxtl            v19.8h, v5.8b
-    smull           v17.4s, v19.4h, v16.4h
-    smull2          v18.4s, v19.8h, v16.8h
-
-    uxtl            v21.8h, v1.8b
-    smull           v19.4s, v21.4h, v24.4h
-    smull2          v20.4s, v21.8h, v24.8h
-
-    movi            v16.8h, #58
-    uxtl            v23.8h, v2.8b
-    smull           v21.4s, v23.4h, v16.4h
-    smull2          v22.4s, v23.8h, v16.8h
-
-    movi            v24.8h, #10
-    uxtl            v1.8h, v6.8b
-    smull           v23.4s, v1.4h, v24.4h
-    smull2          v16.4s, v1.8h, v24.8h
-
-    sub             v17.4s, v17.4s, v19.4s
-    sub             v18.4s, v18.4s, v20.4s
-
-    uxtl            v1.8h, v3.8b
-    sshll           v19.4s, v1.4h, #2
-    sshll2          v20.4s, v1.8h, #2
-
-    add             v17.4s, v17.4s, v21.4s
-    add             v18.4s, v18.4s, v22.4s
-
-    uxtl            v1.8h, v4.8b
-    uxtl            v2.8h, v7.8b
-    ssubl           v21.4s, v1.4h, v2.4h
-    ssubl2          v22.4s, v1.8h, v2.8h
-
-    add             v17.4s, v17.4s, v19.4s
-    add             v18.4s, v18.4s, v20.4s
-    sub             v21.4s, v21.4s, v23.4s
-    sub             v22.4s, v22.4s, v16.4s
-    add             v17.4s, v17.4s, v21.4s
-    add             v18.4s, v18.4s, v22.4s
-.endm
-
-
-
-
-.macro vextin8
-    ld1             {v3.16b}, x11, #16
-    mov             v7.d0, v3.d1
-    ext             v0.8b, v3.8b, v7.8b, #1
-    ext             v4.8b, v3.8b, v7.8b, #2
-    ext             v1.8b, v3.8b, v7.8b, #3
-    ext             v5.8b, v3.8b, v7.8b, #4
-    ext             v2.8b, v3.8b, v7.8b, #5
-    ext             v6.8b, v3.8b, v7.8b, #6
-    ext             v3.8b, v3.8b, v7.8b, #7
-.endm
-
-
-
-// void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt)
-.macro HPS_FILTER a b filterhps
-    mov             w12, #8192
-    mov             w6, w10
-    sub             x3, x3, #\a
-    lsl             x3, x3, #1
-    mov             w9, #\a
-    cmp             w9, #4
-    b.eq            14f
-    cmp             w9, #12
-    b.eq            15f
-    b               7f
-14:
-    HPS_FILTER_4 \a \b \filterhps
-    b               10f
-15:
-    HPS_FILTER_12 \a \b \filterhps
-    b               10f
-7:
-    cmp             w5, #0
-    b.eq            8f
-    cmp             w5, #1
-    b.eq            9f
-8:
-loop1_hps_\filterhps\()_\a\()x\b\()_rowext0:
-    mov             w7, #\a
-    lsr             w7, w7, #3
-    mov             x11, x0
-    sub             x11, x11, #4
-loop2_hps_\filterhps\()_\a\()x\b\()_rowext0:
-    vextin8
-    \filterhps
-    dup             v16.4s, w12
-    sub             v17.4s, v17.4s, v16.4s
-    sub             v18.4s, v18.4s, v16.4s
-    xtn             v0.4h, v17.4s
-    xtn2            v0.8h, v18.4s
-    st1             {v0.8h}, x2, #16
-    subs            w7, w7, #1
-    sub             x11, x11, #8
-    b.ne            loop2_hps_\filterhps\()_\a\()x\b\()_rowext0
-    subs            w6, w6, #1
-    add             x0, x0, x1
-    add             x2, x2, x3
-    b.ne            loop1_hps_\filterhps\()_\a\()x\b\()_rowext0
-    b               10f
-9:
-loop3_hps_\filterhps\()_\a\()x\b\()_rowext1:
-    mov             w7, #\a
-    lsr             w7, w7, #3
-    mov             x11, x0
-    sub             x11, x11, #4
-loop4_hps_\filterhps\()_\a\()x\b\()_rowext1:
-    vextin8
-    \filterhps
-    dup             v16.4s, w12
-    sub             v17.4s, v17.4s, v16.4s
-    sub             v18.4s, v18.4s, v16.4s
-    xtn             v0.4h, v17.4s
-    xtn2            v0.8h, v18.4s
-    st1             {v0.8h}, x2, #16
-    subs            w7, w7, #1
-    sub             x11, x11, #8
-    b.ne            loop4_hps_\filterhps\()_\a\()x\b\()_rowext1
-    subs            w6, w6, #1
-    add             x0, x0, x1
-    add             x2, x2, x3
-    b.ne            loop3_hps_\filterhps\()_\a\()x\b\()_rowext1
-10:
-.endm
-
-.macro HPS_FILTER_4 w h filterhps
-    cmp             w5, #0
-    b.eq            11f
-    cmp             w5, #1
-    b.eq            12f
-11:
-loop4_hps_\filterhps\()_\w\()x\h\()_rowext0:
-    mov             x11, x0
-    sub             x11, x11, #4
-    vextin8
-    \filterhps
-    dup             v16.4s, w12
-    sub             v17.4s, v17.4s, v16.4s
-    xtn             v0.4h, v17.4s
-    st1             {v0.4h}, x2, #8
-    sub             x11, x11, #8
-    subs            w6, w6, #1
-    add             x0, x0, x1
-    add             x2, x2, x3
-    b.ne            loop4_hps_\filterhps\()_\w\()x\h\()_rowext0
-    b               13f
-12:
-loop5_hps_\filterhps\()_\w\()x\h\()_rowext1:
-    mov             x11, x0
-    sub             x11, x11, #4
-    vextin8
-    \filterhps
-    dup             v16.4s, w12
-    sub             v17.4s, v17.4s, v16.4s
-    xtn             v0.4h, v17.4s
-    st1             {v0.4h}, x2, #8
-    sub             x11, x11, #8
-    subs            w6, w6, #1
-    add             x0, x0, x1
-    add             x2, x2, x3
-    b.ne            loop5_hps_\filterhps\()_\w\()x\h\()_rowext1
-13:
-.endm
-
-.macro HPS_FILTER_12 w h filterhps
-    cmp             w5, #0
-    b.eq            14f
-    cmp             w5, #1
-    b.eq            15f
-14:
-loop12_hps_\filterhps\()_\w\()x\h\()_rowext0:
-    mov             x11, x0
-    sub             x11, x11, #4
-    vextin8
-    \filterhps
-    dup             v16.4s, w12
-    sub             v17.4s, v17.4s, v16.4s
-    sub             v18.4s, v18.4s, v16.4s
-    xtn             v0.4h, v17.4s
-    xtn2            v0.8h, v18.4s
-    st1             {v0.8h}, x2, #16
-    sub             x11, x11, #8
-
-    vextin8
-    \filterhps
-    dup             v16.4s, w12
-    sub             v17.4s, v17.4s, v16.4s
-    xtn             v0.4h, v17.4s
-    st1             {v0.4h}, x2, #8
-    add             x2, x2, x3
-    subs            w6, w6, #1
-    add             x0, x0, x1
-    b.ne            loop12_hps_\filterhps\()_\w\()x\h\()_rowext0
-    b               16f
-15:
-loop12_hps_\filterhps\()_\w\()x\h\()_rowext1:
-    mov             x11, x0
-    sub             x11, x11, #4
-    vextin8
-    \filterhps
-    dup             v16.4s, w12
-    sub             v17.4s, v17.4s, v16.4s
-    sub             v18.4s, v18.4s, v16.4s
-    xtn             v0.4h, v17.4s
-    xtn2            v0.8h, v18.4s
-    st1             {v0.8h}, x2, #16
-    sub             x11, x11, #8
-
-    vextin8
-    \filterhps
-    dup             v16.4s, w12
-    sub             v17.4s, v17.4s, v16.4s
-    xtn             v0.4h, v17.4s
-    st1             {v0.4h}, x2, #8
-    add             x2, x2, x3
-    subs            w6, w6, #1
-    add             x0, x0, x1
-    b.ne            loop12_hps_\filterhps\()_\w\()x\h\()_rowext1
-16:
-.endm
-
-// void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt)
-.macro LUMA_HPS w h
-function x265_interp_8tap_horiz_ps_\w\()x\h\()_neon
-    mov             w10, #\h
-    cmp             w5, #0
-    b.eq            6f
-    sub             x0, x0, x1, lsl #2
-
-    add             x0, x0, x1
-    add             w10, w10, #7
-6:
-    cmp             w4, #0
-    b.eq            0f
-    cmp             w4, #1
-    b.eq            1f
-    cmp             w4, #2
-    b.eq            2f
-    cmp             w4, #3
-    b.eq            3f
-0:
-    HPS_FILTER  \w \h qpel_filter_0_32b
-    b               5f
-1:
-    HPS_FILTER  \w \h qpel_filter_1_32b
-    b               5f
-2:
-    HPS_FILTER  \w \h qpel_filter_2_32b
-    b               5f
-3:
-    HPS_FILTER  \w \h qpel_filter_3_32b
-    b               5f
-5:
-    ret
-endfunc
-.endm
-
-LUMA_HPS    4 4
-LUMA_HPS    4 8
-LUMA_HPS    4 16
-LUMA_HPS    8 4
-LUMA_HPS    8 8
-LUMA_HPS    8 16
-LUMA_HPS    8 32
-LUMA_HPS    12 16
-LUMA_HPS    16 4
-LUMA_HPS    16 8
-LUMA_HPS    16 12
-LUMA_HPS    16 16
-LUMA_HPS    16 32
-LUMA_HPS    16 64
-LUMA_HPS    24 32
-LUMA_HPS    32 8
-LUMA_HPS    32 16
-LUMA_HPS    32 24
-LUMA_HPS    32 32
-LUMA_HPS    32 64
-LUMA_HPS    48 64
-LUMA_HPS    64 16
-LUMA_HPS    64 32
-LUMA_HPS    64 48
-LUMA_HPS    64 64

 
@@ -1,414 +0,0 @@
-/*****************************************************************************
- * Copyright (C) 2020 MulticoreWare, Inc
- *
- * Authors: Yimeng Su <yimeng.su@huawei.com>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
- *
- * This program is also available under a commercial proprietary license.
- * For more information, contact us at license @ x265.com.
- *****************************************************************************/
-
-#include "asm.S"
-
-.section .rodata
-
-.align 4
-
-.text
-
-
-
-.macro qpel_filter_0_32b
-    movi            v24.8h, #64
-    uxtl            v19.8h, v5.8b
-    smull           v17.4s, v19.4h, v24.4h
-    smull2          v18.4s, v19.8h, v24.8h
-.endm
-
-.macro qpel_filter_1_32b
-    movi            v16.8h, #58
-    uxtl            v19.8h, v5.8b
-    smull           v17.4s, v19.4h, v16.4h
-    smull2          v18.4s, v19.8h, v16.8h
-
-    movi            v24.8h, #10
-    uxtl            v21.8h, v1.8b
-    smull           v19.4s, v21.4h, v24.4h
-    smull2          v20.4s, v21.8h, v24.8h
-
-    movi            v16.8h, #17
-    uxtl            v23.8h, v2.8b
-    smull           v21.4s, v23.4h, v16.4h
-    smull2          v22.4s, v23.8h, v16.8h
-
-    movi            v24.8h, #5
-    uxtl            v1.8h, v6.8b
-    smull           v23.4s, v1.4h, v24.4h
-    smull2          v16.4s, v1.8h, v24.8h
-
-    sub             v17.4s, v17.4s, v19.4s
-    sub             v18.4s, v18.4s, v20.4s
-
-    uxtl            v1.8h, v4.8b
-    sshll           v19.4s, v1.4h, #2
-    sshll2          v20.4s, v1.8h, #2
-
-    add             v17.4s, v17.4s, v21.4s
-    add             v18.4s, v18.4s, v22.4s
-
-    uxtl            v1.8h, v0.8b
-    uxtl            v2.8h, v3.8b
-    ssubl           v21.4s, v2.4h, v1.4h
-    ssubl2          v22.4s, v2.8h, v1.8h
-
-    add             v17.4s, v17.4s, v19.4s
-    add             v18.4s, v18.4s, v20.4s
-    sub             v21.4s, v21.4s, v23.4s
-    sub             v22.4s, v22.4s, v16.4s
-    add             v17.4s, v17.4s, v21.4s
-    add             v18.4s, v18.4s, v22.4s
-.endm
-
-.macro qpel_filter_2_32b
-    movi            v16.4s, #11
-    uxtl            v19.8h, v5.8b
-    uxtl            v20.8h, v2.8b
-    saddl           v17.4s, v19.4h, v20.4h
-    saddl2          v18.4s, v19.8h, v20.8h
-
-    uxtl            v21.8h, v1.8b
-    uxtl            v22.8h, v6.8b
-    saddl           v19.4s, v21.4h, v22.4h
-    saddl2          v20.4s, v21.8h, v22.8h
-
-    mul             v19.4s, v19.4s, v16.4s
-    mul             v20.4s, v20.4s, v16.4s
-
-    movi            v16.4s, #40
-    mul             v17.4s, v17.4s, v16.4s
-    mul             v18.4s, v18.4s, v16.4s
-
-    uxtl            v21.8h, v4.8b
-    uxtl            v22.8h, v3.8b
-    saddl           v23.4s, v21.4h, v22.4h
-    saddl2          v16.4s, v21.8h, v22.8h
-
-    uxtl            v1.8h, v0.8b
-    uxtl            v2.8h, v7.8b
-    saddl           v21.4s, v1.4h, v2.4h
-    saddl2          v22.4s, v1.8h, v2.8h
-
-    shl             v23.4s, v23.4s, #2
-    shl             v16.4s, v16.4s, #2
-
-    add             v19.4s, v19.4s, v21.4s
-    add             v20.4s, v20.4s, v22.4s
-    add             v17.4s, v17.4s, v23.4s
-    add             v18.4s, v18.4s, v16.4s
-    sub             v17.4s, v17.4s, v19.4s
-    sub             v18.4s, v18.4s, v20.4s
-.endm
-
-.macro qpel_filter_3_32b
-    movi            v16.8h, #17
-    movi            v24.8h, #5
-
-    uxtl            v19.8h, v5.8b
-    smull           v17.4s, v19.4h, v16.4h
-    smull2          v18.4s, v19.8h, v16.8h
-
-    uxtl            v21.8h, v1.8b
-    smull           v19.4s, v21.4h, v24.4h
-    smull2          v20.4s, v21.8h, v24.8h
-
-    movi            v16.8h, #58
-    uxtl            v23.8h, v2.8b
-    smull           v21.4s, v23.4h, v16.4h
-    smull2          v22.4s, v23.8h, v16.8h
-
-    movi            v24.8h, #10
-    uxtl            v1.8h, v6.8b
-    smull           v23.4s, v1.4h, v24.4h
-    smull2          v16.4s, v1.8h, v24.8h
-
-    sub             v17.4s, v17.4s, v19.4s
-    sub             v18.4s, v18.4s, v20.4s
-
-    uxtl            v1.8h, v3.8b
-    sshll           v19.4s, v1.4h, #2
-    sshll2          v20.4s, v1.8h, #2
-
-    add             v17.4s, v17.4s, v21.4s
-    add             v18.4s, v18.4s, v22.4s
-
-    uxtl            v1.8h, v4.8b
-    uxtl            v2.8h, v7.8b
-    ssubl           v21.4s, v1.4h, v2.4h
-    ssubl2          v22.4s, v1.8h, v2.8h
-
-    add             v17.4s, v17.4s, v19.4s
-    add             v18.4s, v18.4s, v20.4s
-    sub             v21.4s, v21.4s, v23.4s
-    sub             v22.4s, v22.4s, v16.4s
-    add             v17.4s, v17.4s, v21.4s
-    add             v18.4s, v18.4s, v22.4s
-.endm
-
-
-
-
-.macro vextin8
-    ld1             {v3.16b}, x11, #16
-    mov             v7.d0, v3.d1
-    ext             v0.8b, v3.8b, v7.8b, #1
-    ext             v4.8b, v3.8b, v7.8b, #2
-    ext             v1.8b, v3.8b, v7.8b, #3
-    ext             v5.8b, v3.8b, v7.8b, #4
-    ext             v2.8b, v3.8b, v7.8b, #5
-    ext             v6.8b, v3.8b, v7.8b, #6
-    ext             v3.8b, v3.8b, v7.8b, #7
-.endm
-
-
-
-// void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt)
-.macro HPS_FILTER a b filterhps
-    mov             w12, #8192
-    mov             w6, w10
-    sub             x3, x3, #\a
-    lsl             x3, x3, #1
-    mov             w9, #\a
-    cmp             w9, #4
-    b.eq            14f
-    cmp             w9, #12
-    b.eq            15f
-    b               7f
-14:
-    HPS_FILTER_4 \a \b \filterhps
-    b               10f
-15:
-    HPS_FILTER_12 \a \b \filterhps
-    b               10f
-7:
-    cmp             w5, #0
-    b.eq            8f
-    cmp             w5, #1
-    b.eq            9f
-8:
-loop1_hps_\filterhps\()_\a\()x\b\()_rowext0:
-    mov             w7, #\a
-    lsr             w7, w7, #3
-    mov             x11, x0
-    sub             x11, x11, #4
-loop2_hps_\filterhps\()_\a\()x\b\()_rowext0:
-    vextin8
-    \filterhps
-    dup             v16.4s, w12
-    sub             v17.4s, v17.4s, v16.4s
-    sub             v18.4s, v18.4s, v16.4s
-    xtn             v0.4h, v17.4s
-    xtn2            v0.8h, v18.4s
-    st1             {v0.8h}, x2, #16
-    subs            w7, w7, #1
-    sub             x11, x11, #8
-    b.ne            loop2_hps_\filterhps\()_\a\()x\b\()_rowext0
-    subs            w6, w6, #1
-    add             x0, x0, x1
-    add             x2, x2, x3
-    b.ne            loop1_hps_\filterhps\()_\a\()x\b\()_rowext0
-    b               10f
-9:
-loop3_hps_\filterhps\()_\a\()x\b\()_rowext1:
-    mov             w7, #\a
-    lsr             w7, w7, #3
-    mov             x11, x0
-    sub             x11, x11, #4
-loop4_hps_\filterhps\()_\a\()x\b\()_rowext1:
-    vextin8
-    \filterhps
-    dup             v16.4s, w12
-    sub             v17.4s, v17.4s, v16.4s
-    sub             v18.4s, v18.4s, v16.4s
-    xtn             v0.4h, v17.4s
-    xtn2            v0.8h, v18.4s
-    st1             {v0.8h}, x2, #16
-    subs            w7, w7, #1
-    sub             x11, x11, #8
-    b.ne            loop4_hps_\filterhps\()_\a\()x\b\()_rowext1
-    subs            w6, w6, #1
-    add             x0, x0, x1
-    add             x2, x2, x3
-    b.ne            loop3_hps_\filterhps\()_\a\()x\b\()_rowext1
-10:
-.endm
-
-.macro HPS_FILTER_4 w h filterhps
-    cmp             w5, #0
-    b.eq            11f
-    cmp             w5, #1
-    b.eq            12f
-11:
-loop4_hps_\filterhps\()_\w\()x\h\()_rowext0:
-    mov             x11, x0
-    sub             x11, x11, #4
-    vextin8
-    \filterhps
-    dup             v16.4s, w12
-    sub             v17.4s, v17.4s, v16.4s
-    xtn             v0.4h, v17.4s
-    st1             {v0.4h}, x2, #8
-    sub             x11, x11, #8
-    subs            w6, w6, #1
-    add             x0, x0, x1
-    add             x2, x2, x3
-    b.ne            loop4_hps_\filterhps\()_\w\()x\h\()_rowext0
-    b               13f
-12:
-loop5_hps_\filterhps\()_\w\()x\h\()_rowext1:
-    mov             x11, x0
-    sub             x11, x11, #4
-    vextin8
-    \filterhps
-    dup             v16.4s, w12
-    sub             v17.4s, v17.4s, v16.4s
-    xtn             v0.4h, v17.4s
-    st1             {v0.4h}, x2, #8
-    sub             x11, x11, #8
-    subs            w6, w6, #1
-    add             x0, x0, x1
-    add             x2, x2, x3
-    b.ne            loop5_hps_\filterhps\()_\w\()x\h\()_rowext1
-13:
-.endm
-
-.macro HPS_FILTER_12 w h filterhps
-    cmp             w5, #0
-    b.eq            14f
-    cmp             w5, #1
-    b.eq            15f
-14:
-loop12_hps_\filterhps\()_\w\()x\h\()_rowext0:
-    mov             x11, x0
-    sub             x11, x11, #4
-    vextin8
-    \filterhps
-    dup             v16.4s, w12
-    sub             v17.4s, v17.4s, v16.4s
-    sub             v18.4s, v18.4s, v16.4s
-    xtn             v0.4h, v17.4s
-    xtn2            v0.8h, v18.4s
-    st1             {v0.8h}, x2, #16
-    sub             x11, x11, #8
-
-    vextin8
-    \filterhps
-    dup             v16.4s, w12
-    sub             v17.4s, v17.4s, v16.4s
-    xtn             v0.4h, v17.4s
-    st1             {v0.4h}, x2, #8
-    add             x2, x2, x3
-    subs            w6, w6, #1
-    add             x0, x0, x1
-    b.ne            loop12_hps_\filterhps\()_\w\()x\h\()_rowext0
-    b               16f
-15:
-loop12_hps_\filterhps\()_\w\()x\h\()_rowext1:
-    mov             x11, x0
-    sub             x11, x11, #4
-    vextin8
-    \filterhps
-    dup             v16.4s, w12
-    sub             v17.4s, v17.4s, v16.4s
-    sub             v18.4s, v18.4s, v16.4s
-    xtn             v0.4h, v17.4s
-    xtn2            v0.8h, v18.4s
-    st1             {v0.8h}, x2, #16
-    sub             x11, x11, #8
-
-    vextin8
-    \filterhps
-    dup             v16.4s, w12
-    sub             v17.4s, v17.4s, v16.4s
-    xtn             v0.4h, v17.4s
-    st1             {v0.4h}, x2, #8
-    add             x2, x2, x3
-    subs            w6, w6, #1
-    add             x0, x0, x1
-    b.ne            loop12_hps_\filterhps\()_\w\()x\h\()_rowext1
-16:
-.endm
-
-// void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt)
-.macro LUMA_HPS w h
-function x265_interp_8tap_horiz_ps_\w\()x\h\()_neon
-    mov             w10, #\h
-    cmp             w5, #0
-    b.eq            6f
-    sub             x0, x0, x1, lsl #2
-
-    add             x0, x0, x1
-    add             w10, w10, #7
-6:
-    cmp             w4, #0
-    b.eq            0f
-    cmp             w4, #1
-    b.eq            1f
-    cmp             w4, #2
-    b.eq            2f
-    cmp             w4, #3
-    b.eq            3f
-0:
-    HPS_FILTER  \w \h qpel_filter_0_32b
-    b               5f
-1:
-    HPS_FILTER  \w \h qpel_filter_1_32b
-    b               5f
-2:
-    HPS_FILTER  \w \h qpel_filter_2_32b
-    b               5f
-3:
-    HPS_FILTER  \w \h qpel_filter_3_32b
-    b               5f
-5:
-    ret
-endfunc
-.endm
-
-LUMA_HPS    4 4
-LUMA_HPS    4 8
-LUMA_HPS    4 16
-LUMA_HPS    8 4
-LUMA_HPS    8 8
-LUMA_HPS    8 16
-LUMA_HPS    8 32
-LUMA_HPS    12 16
-LUMA_HPS    16 4
-LUMA_HPS    16 8
-LUMA_HPS    16 12
-LUMA_HPS    16 16
-LUMA_HPS    16 32
-LUMA_HPS    16 64
-LUMA_HPS    24 32
-LUMA_HPS    32 8
-LUMA_HPS    32 16
-LUMA_HPS    32 24
-LUMA_HPS    32 32
-LUMA_HPS    32 64
-LUMA_HPS    48 64
-LUMA_HPS    64 16
-LUMA_HPS    64 32
-LUMA_HPS    64 48
-LUMA_HPS    64 64
​

x265_3.5.tar.gz/source/common/aarch64/ipfilter8.h Deleted

@@ -1,55 +0,0 @@
-/*****************************************************************************
- * Copyright (C) 2020 MulticoreWare, Inc
- *
- * Authors: Yimeng Su <yimeng.su@huawei.com>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
- *
- * This program is also available under a commercial proprietary license.
- * For more information, contact us at license @ x265.com.
- *****************************************************************************/
-
-#ifndef X265_IPFILTER8_AARCH64_H
-#define X265_IPFILTER8_AARCH64_H
-
-
-void x265_interp_8tap_horiz_ps_4x4_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_4x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_4x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_8x4_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_8x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_8x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_8x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_12x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_16x4_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_16x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_16x12_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_16x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_16x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_16x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_24x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_32x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_32x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_32x24_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_32x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_32x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_48x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_64x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_64x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_64x48_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_64x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-
-
-#endif // ifndef X265_IPFILTER8_AARCH64_H

 
@@ -1,55 +0,0 @@
-/*****************************************************************************
- * Copyright (C) 2020 MulticoreWare, Inc
- *
- * Authors: Yimeng Su <yimeng.su@huawei.com>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
- *
- * This program is also available under a commercial proprietary license.
- * For more information, contact us at license @ x265.com.
- *****************************************************************************/
-
-#ifndef X265_IPFILTER8_AARCH64_H
-#define X265_IPFILTER8_AARCH64_H
-
-
-void x265_interp_8tap_horiz_ps_4x4_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_4x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_4x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_8x4_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_8x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_8x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_8x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_12x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_16x4_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_16x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_16x12_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_16x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_16x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_16x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_24x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_32x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_32x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_32x24_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_32x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_32x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_48x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_64x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_64x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_64x48_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-void x265_interp_8tap_horiz_ps_64x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt);
-
-
-#endif // ifndef X265_IPFILTER8_AARCH64_H
​

x265_3.5.tar.gz/source/common/aarch64/pixel-util.h Deleted

@@ -1,40 +0,0 @@
-/*****************************************************************************
- * Copyright (C) 2020 MulticoreWare, Inc
- *
- * Authors: Yimeng Su <yimeng.su@huawei.com>
- *          Hongbin Liu <liuhongbin1@huawei.com>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
- *
- * This program is also available under a commercial proprietary license.
- * For more information, contact us at license @ x265.com.
- *****************************************************************************/
-
-#ifndef X265_PIXEL_UTIL_AARCH64_H
-#define X265_PIXEL_UTIL_AARCH64_H
-
-int x265_pixel_satd_4x4_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
-int x265_pixel_satd_4x8_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
-int x265_pixel_satd_4x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
-int x265_pixel_satd_4x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
-int x265_pixel_satd_8x4_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
-int x265_pixel_satd_8x8_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
-int x265_pixel_satd_12x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
-int x265_pixel_satd_12x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
-
-uint32_t x265_quant_neon(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff);
-int PFX(psyCost_4x4_neon)(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
-
-#endif // ifndef X265_PIXEL_UTIL_AARCH64_H

 
@@ -1,40 +0,0 @@
-/*****************************************************************************
- * Copyright (C) 2020 MulticoreWare, Inc
- *
- * Authors: Yimeng Su <yimeng.su@huawei.com>
- *          Hongbin Liu <liuhongbin1@huawei.com>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
- *
- * This program is also available under a commercial proprietary license.
- * For more information, contact us at license @ x265.com.
- *****************************************************************************/
-
-#ifndef X265_PIXEL_UTIL_AARCH64_H
-#define X265_PIXEL_UTIL_AARCH64_H
-
-int x265_pixel_satd_4x4_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
-int x265_pixel_satd_4x8_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
-int x265_pixel_satd_4x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
-int x265_pixel_satd_4x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
-int x265_pixel_satd_8x4_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
-int x265_pixel_satd_8x8_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
-int x265_pixel_satd_12x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
-int x265_pixel_satd_12x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
-
-uint32_t x265_quant_neon(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff);
-int PFX(psyCost_4x4_neon)(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
-
-#endif // ifndef X265_PIXEL_UTIL_AARCH64_H
​

x265_3.5.tar.gz/source/common/aarch64/pixel.h Deleted

@@ -1,105 +0,0 @@
-/*****************************************************************************
- * Copyright (C) 2020 MulticoreWare, Inc
- *
- * Authors: Hongbin Liu <liuhongbin1@huawei.com>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
- *
- * This program is also available under a commercial proprietary license.
- * For more information, contact us at license @ x265.com.
- *****************************************************************************/
-
-#ifndef X265_I386_PIXEL_AARCH64_H
-#define X265_I386_PIXEL_AARCH64_H
-
-void x265_pixel_avg_pp_4x4_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_4x8_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_4x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_8x4_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_8x8_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_8x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_8x32_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_12x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_16x4_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_16x8_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_16x12_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_16x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_16x32_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_16x64_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_24x32_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_32x8_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_32x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_32x24_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_32x32_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_32x64_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_48x64_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_64x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_64x32_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_64x48_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_64x64_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-
-void x265_sad_x3_4x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_4x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_4x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_8x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_8x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_8x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_8x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_12x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_16x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_16x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_16x12_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_16x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_16x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_16x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_24x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_32x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_32x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_32x24_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_32x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_32x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_48x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_64x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_64x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_64x48_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_64x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-
-void x265_sad_x4_4x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_4x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_4x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_8x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_8x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_8x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_8x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_12x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_16x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_16x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_16x12_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_16x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_16x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_16x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_24x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_32x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_32x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_32x24_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_32x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_32x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_48x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_64x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_64x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_64x48_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_64x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-
-#endif // ifndef X265_I386_PIXEL_AARCH64_H

 
@@ -1,105 +0,0 @@
-/*****************************************************************************
- * Copyright (C) 2020 MulticoreWare, Inc
- *
- * Authors: Hongbin Liu <liuhongbin1@huawei.com>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
- *
- * This program is also available under a commercial proprietary license.
- * For more information, contact us at license @ x265.com.
- *****************************************************************************/
-
-#ifndef X265_I386_PIXEL_AARCH64_H
-#define X265_I386_PIXEL_AARCH64_H
-
-void x265_pixel_avg_pp_4x4_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_4x8_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_4x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_8x4_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_8x8_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_8x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_8x32_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_12x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_16x4_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_16x8_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_16x12_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_16x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_16x32_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_16x64_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_24x32_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_32x8_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_32x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_32x24_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_32x32_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_32x64_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_48x64_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_64x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_64x32_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_64x48_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-void x265_pixel_avg_pp_64x64_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
-
-void x265_sad_x3_4x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_4x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_4x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_8x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_8x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_8x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_8x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_12x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_16x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_16x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_16x12_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_16x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_16x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_16x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_24x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_32x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_32x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_32x24_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_32x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_32x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_48x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_64x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_64x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_64x48_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-void x265_sad_x3_64x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res);
-
-void x265_sad_x4_4x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_4x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_4x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_8x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_8x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_8x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_8x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_12x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_16x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_16x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_16x12_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_16x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_16x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_16x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_24x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_32x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_32x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_32x24_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_32x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_32x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_48x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_64x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_64x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_64x48_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-void x265_sad_x4_64x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res);
-
-#endif // ifndef X265_I386_PIXEL_AARCH64_H
​

x265_3.6.tar.gz/.gitignore Added

 
@@ -0,0 +1,36 @@
+# Prerequisites
+*.d
+
+# Compiled Object files
+*.slo
+*.lo
+*.o
+*.obj
+
+# Precompiled Headers
+*.gch
+*.pch
+
+# Compiled Dynamic libraries
+*.so
+*.dylib
+*.dll
+
+# Fortran module files
+*.mod
+*.smod
+
+# Compiled Static libraries
+*.lai
+*.la
+*.a
+*.lib
+
+# Executables
+*.exe
+*.out
+*.app
+
+# Build directory
+build/
+
​

x265_3.5.tar.gz/build/README.txt -> x265_3.6.tar.gz/build/README.txt Changed

@@ -6,6 +6,9 @@
 
 Note: MSVC12 requires cmake 2.8.11 or later
 
+Note: When the SVE/SVE2 instruction set of Arm AArch64 architecture is to be used, the GCC10.x and onwards must
+      be installed in order to compile x265.
+
 
 = Optional Prerequisites =
 
@@ -88,3 +91,25 @@
 building out of a Mercurial source repository.  If you are building out of
 a release source package, the version will not change.  If Mercurial is not
 found, the version will be "unknown".
+
+= Build Instructions for cross-compilation for Arm AArch64 Targets=
+
+When the target platform is based on Arm AArch64 architecture, the x265 can be
+built in x86 platforms. However, the CMAKE_C_COMPILER and CMAKE_CXX_COMPILER
+enviroment variables should be set to point to the cross compilers of the
+appropriate gcc. For example:
+
+1. export CMAKE_C_COMPILER=aarch64-unknown-linux-gnu-gcc
+2. export CMAKE_CXX_COMPILER=aarch64-unknown-linux-gnu-g++
+
+The default ones are aarch64-linux-gnu-gcc and aarch64-linux-gnu-g++.
+Then, the normal building process can be followed.
+
+Moreover, if the target platform supports SVE or SVE2 instruction set, the
+CROSS_COMPILE_SVE or CROSS_COMPILE_SVE2 environment variables should be set
+to true, respectively. For example:
+
+1. export CROSS_COMPILE_SVE2=true
+2. export CROSS_COMPILE_SVE=true
+
+Then, the normal building process can be followed.

 
@@ -6,6 +6,9 @@
 
 Note: MSVC12 requires cmake 2.8.11 or later
 
+Note: When the SVE/SVE2 instruction set of Arm AArch64 architecture is to be used, the GCC10.x and onwards must
+      be installed in order to compile x265.
+
 
 = Optional Prerequisites =
 
@@ -88,3 +91,25 @@
 building out of a Mercurial source repository.  If you are building out of
 a release source package, the version will not change.  If Mercurial is not
 found, the version will be "unknown".
+
+= Build Instructions for cross-compilation for Arm AArch64 Targets=
+
+When the target platform is based on Arm AArch64 architecture, the x265 can be
+built in x86 platforms. However, the CMAKE_C_COMPILER and CMAKE_CXX_COMPILER
+enviroment variables should be set to point to the cross compilers of the
+appropriate gcc. For example:
+
+1. export CMAKE_C_COMPILER=aarch64-unknown-linux-gnu-gcc
+2. export CMAKE_CXX_COMPILER=aarch64-unknown-linux-gnu-g++
+
+The default ones are aarch64-linux-gnu-gcc and aarch64-linux-gnu-g++.
+Then, the normal building process can be followed.
+
+Moreover, if the target platform supports SVE or SVE2 instruction set, the
+CROSS_COMPILE_SVE or CROSS_COMPILE_SVE2 environment variables should be set
+to true, respectively. For example:
+
+1. export CROSS_COMPILE_SVE2=true
+2. export CROSS_COMPILE_SVE=true
+
+Then, the normal building process can be followed.
​

x265_3.6.tar.gz/build/aarch64-darwin Added

 
+(directory)
​

x265_3.6.tar.gz/build/aarch64-darwin/crosscompile.cmake Added

 
@@ -0,0 +1,23 @@
+# CMake toolchain file for cross compiling x265 for aarch64
+# This feature is only supported as experimental. Use with caution.
+# Please report bugs on bitbucket
+# Run cmake with: cmake -DCMAKE_TOOLCHAIN_FILE=crosscompile.cmake -G "Unix Makefiles" ../../source && ccmake ../../source
+
+set(CROSS_COMPILE_ARM64 1)
+set(CMAKE_SYSTEM_NAME Darwin)
+set(CMAKE_SYSTEM_PROCESSOR aarch64)
+
+# specify the cross compiler
+set(CMAKE_C_COMPILER gcc-12)
+set(CMAKE_CXX_COMPILER g++-12)
+
+# specify the target environment
+SET(CMAKE_FIND_ROOT_PATH  /opt/homebrew/bin/)
+
+# specify whether SVE/SVE2 is supported by the target platform
+if(DEFINED ENV{CROSS_COMPILE_SVE2})
+    set(CROSS_COMPILE_SVE2 1)
+elseif(DEFINED ENV{CROSS_COMPILE_SVE})
+    set(CROSS_COMPILE_SVE 1)
+endif()
+
​

x265_3.6.tar.gz/build/aarch64-darwin/make-Makefiles.bash Added

 
@@ -0,0 +1,4 @@
+#!/bin/bash
+# Run this from within a bash shell
+
+cmake -DCMAKE_TOOLCHAIN_FILE="crosscompile.cmake" -G "Unix Makefiles" ../../source && ccmake ../../source
​

x265_3.5.tar.gz/build/aarch64-linux/crosscompile.cmake -> x265_3.6.tar.gz/build/aarch64-linux/crosscompile.cmake Changed

@@ -3,13 +3,29 @@
 # Please report bugs on bitbucket
 # Run cmake with: cmake -DCMAKE_TOOLCHAIN_FILE=crosscompile.cmake -G "Unix Makefiles" ../../source && ccmake ../../source
 
-set(CROSS_COMPILE_ARM 1)
+set(CROSS_COMPILE_ARM64 1)
 set(CMAKE_SYSTEM_NAME Linux)
 set(CMAKE_SYSTEM_PROCESSOR aarch64)
 
 # specify the cross compiler
-set(CMAKE_C_COMPILER aarch64-linux-gnu-gcc)
-set(CMAKE_CXX_COMPILER aarch64-linux-gnu-g++)
+if(DEFINED ENV{CMAKE_C_COMPILER})
+    set(CMAKE_C_COMPILER $ENV{CMAKE_C_COMPILER})
+else()
+    set(CMAKE_C_COMPILER aarch64-linux-gnu-gcc)
+endif()
+if(DEFINED ENV{CMAKE_CXX_COMPILER})
+    set(CMAKE_CXX_COMPILER $ENV{CMAKE_CXX_COMPILER})
+else()
+    set(CMAKE_CXX_COMPILER aarch64-linux-gnu-g++)
+endif()
 
 # specify the target environment
 SET(CMAKE_FIND_ROOT_PATH  /usr/aarch64-linux-gnu)
+
+# specify whether SVE/SVE2 is supported by the target platform
+if(DEFINED ENV{CROSS_COMPILE_SVE2})
+    set(CROSS_COMPILE_SVE2 1)
+elseif(DEFINED ENV{CROSS_COMPILE_SVE})
+    set(CROSS_COMPILE_SVE 1)
+endif()
+

 
@@ -3,13 +3,29 @@
 # Please report bugs on bitbucket
 # Run cmake with: cmake -DCMAKE_TOOLCHAIN_FILE=crosscompile.cmake -G "Unix Makefiles" ../../source && ccmake ../../source
 
-set(CROSS_COMPILE_ARM 1)
+set(CROSS_COMPILE_ARM64 1)
 set(CMAKE_SYSTEM_NAME Linux)
 set(CMAKE_SYSTEM_PROCESSOR aarch64)
 
 # specify the cross compiler
-set(CMAKE_C_COMPILER aarch64-linux-gnu-gcc)
-set(CMAKE_CXX_COMPILER aarch64-linux-gnu-g++)
+if(DEFINED ENV{CMAKE_C_COMPILER})
+    set(CMAKE_C_COMPILER $ENV{CMAKE_C_COMPILER})
+else()
+    set(CMAKE_C_COMPILER aarch64-linux-gnu-gcc)
+endif()
+if(DEFINED ENV{CMAKE_CXX_COMPILER})
+    set(CMAKE_CXX_COMPILER $ENV{CMAKE_CXX_COMPILER})
+else()
+    set(CMAKE_CXX_COMPILER aarch64-linux-gnu-g++)
+endif()
 
 # specify the target environment
 SET(CMAKE_FIND_ROOT_PATH  /usr/aarch64-linux-gnu)
+
+# specify whether SVE/SVE2 is supported by the target platform
+if(DEFINED ENV{CROSS_COMPILE_SVE2})
+    set(CROSS_COMPILE_SVE2 1)
+elseif(DEFINED ENV{CROSS_COMPILE_SVE})
+    set(CROSS_COMPILE_SVE 1)
+endif()
+
​

x265_3.5.tar.gz/build/arm-linux/make-Makefiles.bash -> x265_3.6.tar.gz/build/arm-linux/make-Makefiles.bash Changed

 
@@ -1,4 +1,4 @@
 #!/bin/bash
 # Run this from within a bash shell
 
-cmake -G "Unix Makefiles" ../../source && ccmake ../../source
+cmake -DCMAKE_TOOLCHAIN_FILE="crosscompile.cmake" -G "Unix Makefiles" ../../source && ccmake ../../source
​

x265_3.5.tar.gz/doc/reST/cli.rst -> x265_3.6.tar.gz/doc/reST/cli.rst Changed

@@ -632,9 +632,8 @@
 	auto-detection by the encoder. If specified, the encoder will
 	attempt to bring the encode specifications within that specified
 	level. If the encoder is unable to reach the level it issues a
-	warning and aborts the encode. If the requested requirement level is
-	higher than the actual level, the actual requirement level is
-	signaled.
+	warning and aborts the encode. The requested level will be signaled 
+	in the bitstream even if it is higher than the actual level.
 
 	Beware, specifying a decoder level will force the encoder to enable
 	VBV for constant rate factor encodes, which may introduce
@@ -714,11 +713,8 @@
 	(main, main10, etc). Second, an encoder is created from this
 	x265_param instance and the :option:`--level-idc` and
 	:option:`--high-tier` parameters are used to reduce bitrate or other
-	features in order to enforce the target level. Finally, the encoder
-	re-examines the final set of parameters and detects the actual
-	minimum decoder requirement level and this is what is signaled in
-	the bitstream headers. The detected decoder level will only use High
-	tier if the user specified a High tier level.
+	features in order to enforce the target level. The detected decoder level
+	will only use High tier if the user specified a High tier level.
 
 	The signaled profile will be determined by the encoder's internal
 	bitdepth and input color space. If :option:`--keyint` is 0 or 1,
@@ -961,21 +957,21 @@
 	Note that :option:`--analysis-save-reuse-level` and :option:`--analysis-load-reuse-level` must be paired
 	with :option:`--analysis-save` and :option:`--analysis-load` respectively.
 
-	+--------------+------------------------------------------+
-	| Level        | Description                              |
-	+==============+==========================================+
-	| 1            | Lookahead information                    |
-	+--------------+------------------------------------------+
-	| 2 to 4       | Level 1 + intra/inter modes, ref's       |
-	+--------------+------------------------------------------+
-	| 5 and 6      | Level 2 + rect-amp                       |
-	+--------------+------------------------------------------+
-	| 7            | Level 5 + AVC size CU refinement         |
-	+--------------+------------------------------------------+
-	| 8 and 9      | Level 5 + AVC size Full CU analysis-info |
-	+--------------+------------------------------------------+
-	| 10           | Level 5 + Full CU analysis-info          |
-	+--------------+------------------------------------------+
+	+--------------+---------------------------------------------------+
+	| Level        | Description                                       |
+	+==============+===================================================+
+	| 1            | Lookahead information                             |
+	+--------------+---------------------------------------------------+
+	| 2 to 4       | Level 1 + intra/inter modes, depth, ref's, cutree |
+	+--------------+---------------------------------------------------+
+	| 5 and 6      | Level 2 + rect-amp                                |
+	+--------------+---------------------------------------------------+
+	| 7            | Level 5 + AVC size CU refinement                  |
+	+--------------+---------------------------------------------------+
+	| 8 and 9      | Level 5 + AVC size Full CU analysis-info          |
+	+--------------+---------------------------------------------------+
+	| 10           | Level 5 + Full CU analysis-info                   |
+	+--------------+---------------------------------------------------+
 
 .. option:: --refine-mv-type <string>
 
@@ -1332,6 +1328,11 @@
 	Search range for HME level 0, 1 and 2.
 	The Search Range for each HME level must be between 0 and 32768(excluding).
 	Default search range is 16,32,48 for level 0,1,2 respectively.
+	
+.. option:: --mcstf, --no-mcstf
+
+    Enable Motion Compensated Temporal filtering.
+	Default: disabled
 
 Spatial/intra options
 =====================
@@ -1473,17 +1474,9 @@
 
 .. option:: --hist-scenecut, --no-hist-scenecut
 
-	Indicates that scenecuts need to be detected using luma edge and chroma histograms.
-	:option:`--hist-scenecut` enables scenecut detection using the histograms and disables the default scene cut algorithm.
-	:option:`--no-hist-scenecut` disables histogram based scenecut algorithm.
-	
-.. option:: --hist-threshold <0.0..1.0>
-
-	This value represents the threshold for normalized SAD of edge histograms used in scenecut detection.
-	This requires :option:`--hist-scenecut` to be enabled. For example, a value of 0.2 indicates that a frame with normalized SAD value 
-	greater than 0.2 against the previous frame as scenecut. 
-	Increasing the threshold reduces the number of scenecuts detected.
-	Default 0.03.
+	Scenecuts detected based on histogram, intensity and variance of the picture.
+	:option:`--hist-scenecut` enables or :option:`--no-hist-scenecut` disables scenecut detection based on
+	histogram.
 	
 .. option:: --radl <integer>
 	
@@ -1766,6 +1759,12 @@
 	Default 1.0.
 	**Range of values:** 0.0 to 3.0
 
+.. option:: --sbrc --no-sbrc
+
+	To enable and disable segment based rate control.Segment duration depends on the
+	keyframe interval specified.If unspecified,default keyframe interval will be used.
+	Default: disabled.
+
 .. option:: --hevc-aq
 
 	Enable adaptive quantization
@@ -1976,12 +1975,18 @@
 	
 	**CLI ONLY**
 
+.. option:: --scenecut-qp-config <filename>
+
+	Specify a text file which contains the scenecut aware QP options.
+	The options include :option:`--scenecut-aware-qp` and :option:`--masking-strength`
+
+	**CLI ONLY**
+
 .. option:: --scenecut-aware-qp <integer>
 
 	It reduces the bits spent on the inter-frames within the scenecut window
 	before and after a scenecut by increasing their QP in ratecontrol pass2 algorithm
-	without any deterioration in visual quality. If a scenecut falls within the window,
-	the QP of the inter-frames after this scenecut will not be modified.
+	without any deterioration in visual quality.
 	:option:`--scenecut-aware-qp` works only with --pass 2. Default 0.
 
 	+-------+---------------------------------------------------------------+
@@ -2006,48 +2011,83 @@
 	for the QP increment for inter-frames when :option:`--scenecut-aware-qp`
 	is enabled.
 
-	When :option:`--scenecut-aware-qp` is::
+	When :option:`--scenecut-aware-qp` is:
+
 	* 1 (Forward masking):
-	--masking-strength <fwdWindow,fwdRefQPDelta,fwdNonRefQPDelta>
+	--masking-strength <fwdMaxWindow,fwdRefQPDelta,fwdNonRefQPDelta>
+	or 
+	--masking-strength <fwdWindow1,fwdRefQPDelta1,fwdNonRefQPDelta1,fwdWindow2,fwdRefQPDelta2,fwdNonRefQPDelta2,
+						fwdWindow3,fwdRefQPDelta3,fwdNonRefQPDelta3,fwdWindow4,fwdRefQPDelta4,fwdNonRefQPDelta4,
+						fwdWindow5,fwdRefQPDelta5,fwdNonRefQPDelta5,fwdWindow6,fwdRefQPDelta6,fwdNonRefQPDelta6>
 	* 2 (Backward masking):
-	--masking-strength <bwdWindow,bwdRefQPDelta,bwdNonRefQPDelta>
+	--masking-strength <bwdMaxWindow,bwdRefQPDelta,bwdNonRefQPDelta>
+	or 
+	--masking-strength <bwdWindow1,bwdRefQPDelta1,bwdNonRefQPDelta1,bwdWindow2,bwdRefQPDelta2,bwdNonRefQPDelta2,
+						bwdWindow3,bwdRefQPDelta3,bwdNonRefQPDelta3,bwdWindow4,bwdRefQPDelta4,bwdNonRefQPDelta4,
+						bwdWindow5,bwdRefQPDelta5,bwdNonRefQPDelta5,bwdWindow6,bwdRefQPDelta6,bwdNonRefQPDelta6>
 	* 3 (Bi-directional masking):
-	--masking-strength <fwdWindow,fwdRefQPDelta,fwdNonRefQPDelta,bwdWindow,bwdRefQPDelta,bwdNonRefQPDelta>
+	--masking-strength <fwdMaxWindow,fwdRefQPDelta,fwdNonRefQPDelta,bwdMaxWindow,bwdRefQPDelta,bwdNonRefQPDelta>
+	or 
+	--masking-strength <fwdWindow1,fwdRefQPDelta1,fwdNonRefQPDelta1,fwdWindow2,fwdRefQPDelta2,fwdNonRefQPDelta2,
+						fwdWindow3,fwdRefQPDelta3,fwdNonRefQPDelta3,fwdWindow4,fwdRefQPDelta4,fwdNonRefQPDelta4,
+						fwdWindow5,fwdRefQPDelta5,fwdNonRefQPDelta5,fwdWindow6,fwdRefQPDelta6,fwdNonRefQPDelta6,
+						bwdWindow1,bwdRefQPDelta1,bwdNonRefQPDelta1,bwdWindow2,bwdRefQPDelta2,bwdNonRefQPDelta2,
+						bwdWindow3,bwdRefQPDelta3,bwdNonRefQPDelta3,bwdWindow4,bwdRefQPDelta4,bwdNonRefQPDelta4,
+						bwdWindow5,bwdRefQPDelta5,bwdNonRefQPDelta5,bwdWindow6,bwdRefQPDelta6,bwdNonRefQPDelta6>
 
 	+-----------------+---------------------------------------------------------------+
 	| Parameter       | Description                                                   |
 	+=================+===============================================================+
-	| fwdWindow       | The duration(in milliseconds) for which there is a reduction  |
-	|                 | in the bits spent on the inter-frames after a scenecut by     |
-	|                 | increasing their QP. Default 500ms.                           |
-	|                 | **Range of values:** 0 to 1000                                |
+	| fwdMaxWindow    | The maximum duration(in milliseconds) for which there is a    |
+	|                 | reduction in the bits spent on the inter-frames after a       |
+	|                 | scenecut by increasing their QP. Default 500ms.               |
+	|                 | **Range of values:** 0 to 2000                                |
+	+-----------------+---------------------------------------------------------------+
+	| fwdWindow       | The duration of a sub-window(in milliseconds) for which there |
+	|                 | is a reduction in the bits spent on the inter-frames after a  |
+	|                 | scenecut by increasing their QP. Default 500ms.               |
+	|                 | **Range of values:** 0 to 2000                                |
 	+-----------------+---------------------------------------------------------------+
 	| fwdRefQPDelta   | The offset by which QP is incremented for inter-frames        |
 	|                 | after a scenecut. Default 5.                                  |
-	|                 | **Range of values:** 0 to 10                                  |
+	|                 | **Range of values:** 0 to 20                                  |
 	+-----------------+---------------------------------------------------------------+
 	| fwdNonRefQPDelta| The offset by which QP is incremented for non-referenced      |
 	|                 | inter-frames after a scenecut. The offset is computed from    |
 	|                 | fwdRefQPDelta when it is not explicitly specified.            |
-	|                 | **Range of values:** 0 to 10                                  |
+	|                 | **Range of values:** 0 to 20                                  |
+	+-----------------+---------------------------------------------------------------+
+	| bwdMaxWindow    | The maximum duration(in milliseconds) for which there is a    |
+	|                 | reduction in the bits spent on the inter-frames before a      |
+	|                 | scenecut by increasing their QP. Default 100ms.               |
+	|                 | **Range of values:** 0 to 2000                                |
 	+-----------------+---------------------------------------------------------------+
-	| bwdWindow       | The duration(in milliseconds) for which there is a reduction  |
-	|                 | in the bits spent on the inter-frames before a scenecut by    |
-	|                 | increasing their QP. Default 100ms.                           |
-	|                 | **Range of values:** 0 to 1000                                |
+	| bwdWindow       | The duration of a sub-window(in milliseconds) for which there |
+	|                 | is a reduction in the bits spent on the inter-frames before a |
+	|                 | scenecut by increasing their QP. Default 100ms.               |
+	|                 | **Range of values:** 0 to 2000                                |
 	+-----------------+---------------------------------------------------------------+
 	| bwdRefQPDelta   | The offset by which QP is incremented for inter-frames        |
 	|                 | before a scenecut. The offset is computed from                |
 	|                 | fwdRefQPDelta when it is not explicitly specified.            |
-	|                 | **Range of values:** 0 to 10                                  |
+	|                 | **Range of values:** 0 to 20                                  |
 	+-----------------+---------------------------------------------------------------+
 	| bwdNonRefQPDelta| The offset by which QP is incremented for non-referenced      |
 	|                 | inter-frames before a scenecut. The offset is computed from   |
 	|                 | bwdRefQPDelta when it is not explicitly specified.            |
-	|                 | **Range of values:** 0 to 10                                  |
+	|                 | **Range of values:** 0 to 20                                  |
 	+-----------------+---------------------------------------------------------------+
 
-	**CLI ONLY**
+	We can specify the value for the Use :option:`--masking-strength` parameter in different formats.
+	1. If we don't specify --masking-strength and specify only --scenecut-aware-qp, then default offset and window size values are considered.
+	2. If we specify --masking-strength with the format 1 mentioned above, the values of window, refQpDelta and nonRefQpDelta given by the user are taken for window 1 and the offsets for the remaining windows are derived with 15% difference between windows.
+	3. If we specify the --masking-strength with the format 2 mentioned above, the values of window, refQpDelta and nonRefQpDelta given by the user for each window from 1 to 6 are directly used.NOTE: We can use this format to specify zero offsets for any particular window
+
+	Sample config file:: (Format 2 Forward masking explained here)
+
+	--scenecut-aware-qp 1 --masking-strength 1000,8,12
+	
+	The above sample config file is available in `the downloads page <https://bitbucket.org/multicoreware/x265_git/downloads/scenecut_qp_config.txt>`_
 
 .. option:: --vbv-live-multi-pass, --no-vbv-live-multi-pass
 
@@ -2057,6 +2097,14 @@
    rate control mode.
 
    Default disabled. **Experimental feature**
+   
+
+.. option:: bEncFocusedFramesOnly
+
+	Used to trigger encoding of selective GOPs; Disabled by default.
+	
+	**API ONLY**
+	
 
 Quantization Options
 ====================
@@ -2427,6 +2475,81 @@
 	Values in the range 0..12. See D.3.3 of the HEVC spec. for a detailed explanation.
 	Required for HLG (Hybrid Log Gamma) signaling. Not signaled by default.
 
+.. option:: --video-signal-type-preset <string>
+
+	Specify combinations of color primaries, transfer characteristics, color matrix,
+	range of luma and chroma signals, and chroma sample location.
+	String format: <system-id>:<color-volume>
+	
+	This has higher precedence than individual VUI parameters. If any individual VUI option
+	is specified together with this, which changes the values set corresponding to the system-id
+	or color-volume, it will be discarded.
+
+	system-id options and their corresponding values:
+	+----------------+---------------------------------------------------------------+
+	| system-id      | Value                                                         |
+	+================+===============================================================+
+	| BT601_525      | --colorprim smpte170m --transfer smpte170m                    |
+	|                | --colormatrix smpte170m --range limited --chromaloc 0         |
+	+----------------+---------------------------------------------------------------+
+	| BT601_626      | --colorprim bt470bg --transfer smpte170m --colormatrix bt470bg|
+	|                | --range limited --chromaloc 0                                 |
+	+----------------+---------------------------------------------------------------+
+	| BT709_YCC      | --colorprim bt709 --transfer bt709 --colormatrix bt709        |
+	|                | --range limited --chromaloc 0                                 |
+	+----------------+---------------------------------------------------------------+
+	| BT709_RGB      | --colorprim bt709 --transfer bt709 --colormatrix gbr          |
+	|                | --range limited                                               |
+	+----------------+---------------------------------------------------------------+
+	| BT2020_YCC_NCL | --colorprim bt2020 --transfer bt2020-10 --colormatrix bt709   |
+	|                | --range limited --chromaloc 2                                 |
+	+----------------+---------------------------------------------------------------+
+	| BT2020_RGB     | --colorprim bt2020 --transfer smpte2084 --colormatrix bt2020nc|
+	|                | --range limited                                               |
+	+----------------+---------------------------------------------------------------+
+	| BT2100_PQ_YCC  | --colorprim bt2020 --transfer smpte2084 --colormatrix bt2020nc|
+	|                | --range limited --chromaloc 2                                 |
+	+----------------+---------------------------------------------------------------+
+	| BT2100_PQ_ICTCP| --colorprim bt2020 --transfer smpte2084 --colormatrix ictcp   |
+	|                | --range limited --chromaloc 2                                 |
+	+----------------+---------------------------------------------------------------+
+	| BT2100_PQ_RGB  | --colorprim bt2020 --transfer smpte2084 --colormatrix gbr     |
+	|                | --range limited                                               |
+	+----------------+---------------------------------------------------------------+
+	| BT2100_HLG_YCC | --colorprim bt2020 --transfer arib-std-b67                    |
+	|                | --colormatrix bt2020nc --range limited --chromaloc 2          |
+	+----------------+---------------------------------------------------------------+
+	| BT2100_HLG_RGB | --colorprim bt2020 --transfer arib-std-b67 --colormatrix gbr  |
+	|                | --range limited                                               |
+	+----------------+---------------------------------------------------------------+
+	| FR709_RGB      | --colorprim bt709 --transfer bt709 --colormatrix gbr          |
+	|                | --range full                                                  |
+	+----------------+---------------------------------------------------------------+
+	| FR2020_RGB     | --colorprim bt2020 --transfer bt2020-10 --colormatrix gbr     |
+	|                | --range full                                                  |
+	+----------------+---------------------------------------------------------------+
+	| FRP3D65_YCC    | --colorprim smpte432 --transfer bt709 --colormatrix smpte170m |
+	|                | --range full --chromaloc 1                                    |
+	+----------------+---------------------------------------------------------------+
+
+	color-volume options and their corresponding values:
+	+----------------+---------------------------------------------------------------+
+	| color-volume   | Value                                                         |
+	+================+===============================================================+
+	| P3D65x1000n0005| --master-display G(13250,34500)B(7500,3000)R(34000,16000)     |
+	|                |                  WP(15635,16450)L(10000000,5)                 |
+	+----------------+---------------------------------------------------------------+
+	| P3D65x4000n005 | --master-display G(13250,34500)B(7500,3000)R(34000,16000)     |
+	|                |                  WP(15635,16450)L(40000000,50)                |
+	+----------------+---------------------------------------------------------------+
+	| BT2100x108n0005| --master-display G(8500,39850)B(6550,2300)R(34000,146000)     |
+	|                |                  WP(15635,16450)L(10000000,1)                 |
+	+----------------+---------------------------------------------------------------+
+
+	Note: The color-volume options can be used only with the system-id options BT2100_PQ_YCC,
+	       BT2100_PQ_ICTCP, and BT2100_PQ_RGB. It is incompatible with other options.
+
+
 Bitstream options
 =================
 
@@ -2454,6 +2577,16 @@
 	the very first AUD will be skipped since it cannot be placed at the
 	start of the access unit, where it belongs. Default disabled
 
+.. option:: --eob, --no-eob
+
+	Emit an end of bitstream NAL unit at the end of the bitstream.
+	Default disabled
+
+.. option:: --eos, --no-eos
+
+	Emit an end of sequence NAL unit at the end of every coded
+	video sequence. Default disabled
+
 .. option:: --hrd, --no-hrd
 
 	Enable the signaling of HRD parameters to the decoder. The HRD
@@ -2480,7 +2613,7 @@
     The value is specified as a float or as an integer with the profile times 10,
     for example profile 5 is specified as "5" or "5.0" or "50".
     
-    Currently only profile 5, profile 8.1 and profile 8.2 enabled, Default 0 (disabled)
+    Currently only profile 5, profile 8.1, profile 8.2 and profile 8.4  enabled, Default 0 (disabled)
 
 .. option:: --dolby-vision-rpu <filename>
 
@@ -2509,17 +2642,26 @@
 	2. CRC
 	3. Checksum
 
-.. option:: --temporal-layers,--no-temporal-layers
+.. option:: --temporal-layers <integer>
 
-	Enable a temporal sub layer. All referenced I/P/B frames are in the
-	base layer and all unreferenced B frames are placed in a temporal
-	enhancement layer. A decoder may choose to drop the enhancement layer 
-	and only decode and display the base layer slices.
-	
-	If used with a fixed GOP (:option:`--b-adapt` 0) and :option:`--bframes`
-	3 then the two layers evenly split the frame rate, with a cadence of
-	PbBbP. You probably also want :option:`--no-scenecut` and a keyframe
-	interval that is a multiple of 4.
+	Enable specified number of temporal sub layers. For any frame in layer N,
+	all referenced frames are in the layer N or N-1.A decoder may choose to drop the enhancement layer
+	and only decode and display the base layer slices.Allowed number of temporal sub-layers
+	are 2 to 5.(2 and 5 inclusive)
+
+	When enabled,temporal layers 3 through 5 configures a fixed miniGOP with the number of bframes as shown below
+	unless miniGOP size is modified due to lookahead decisions.Temporal layer 2 is a special case that has
+	all reference frames in base layer and non-reference frames in enhancement layer without any constraint on the
+	number of bframes.Default disabled.
+	+----------------+--------+
+	| temporal layer | bframes|
+	+================+========+
+	| 3              | 3      |
+	+----------------+--------+
+	| 4              | 7      |
+    +----------------+--------+
+	| 5              | 15     |
+	+----------------+--------+
 
 .. option:: --log2-max-poc-lsb <integer>
 
@@ -2564,6 +2706,12 @@
 	Emit SEI messages in a single NAL unit instead of multiple NALs. Default disabled.
 	When HRD SEI is enabled the HM decoder will throw a warning.
 
+.. option:: --film-grain <filename>
+
+    Refers to the film grain model characteristics for signal enhancement information transmission
+
+    **CLI_ONLY**
+
 DCT Approximations
 =================

 
@@ -632,9 +632,8 @@
    auto-detection by the encoder. If specified, the encoder will
    attempt to bring the encode specifications within that specified
    level. If the encoder is unable to reach the level it issues a
-   warning and aborts the encode. If the requested requirement level is
-   higher than the actual level, the actual requirement level is
-   signaled.
+   warning and aborts the encode. The requested level will be signaled 
+   in the bitstream even if it is higher than the actual level.
 
    Beware, specifying a decoder level will force the encoder to enable
    VBV for constant rate factor encodes, which may introduce
@@ -714,11 +713,8 @@
    (main, main10, etc). Second, an encoder is created from this
    x265_param instance and the :option:`--level-idc` and
    :option:`--high-tier` parameters are used to reduce bitrate or other
-   features in order to enforce the target level. Finally, the encoder
-   re-examines the final set of parameters and detects the actual
-   minimum decoder requirement level and this is what is signaled in
-   the bitstream headers. The detected decoder level will only use High
-   tier if the user specified a High tier level.
+   features in order to enforce the target level. The detected decoder level
+   will only use High tier if the user specified a High tier level.
 
    The signaled profile will be determined by the encoder's internal
    bitdepth and input color space. If :option:`--keyint` is 0 or 1,
@@ -961,21 +957,21 @@
    Note that :option:`--analysis-save-reuse-level` and :option:`--analysis-load-reuse-level` must be paired
    with :option:`--analysis-save` and :option:`--analysis-load` respectively.
 
-   +--------------+------------------------------------------+
-   | Level        | Description                              |
-   +==============+==========================================+
-   | 1            | Lookahead information                    |
-   +--------------+------------------------------------------+
-   | 2 to 4       | Level 1 + intra/inter modes, ref's       |
-   +--------------+------------------------------------------+
-   | 5 and 6      | Level 2 + rect-amp                       |
-   +--------------+------------------------------------------+
-   | 7            | Level 5 + AVC size CU refinement         |
-   +--------------+------------------------------------------+
-   | 8 and 9      | Level 5 + AVC size Full CU analysis-info |
-   +--------------+------------------------------------------+
-   | 10           | Level 5 + Full CU analysis-info          |
-   +--------------+------------------------------------------+
+   +--------------+---------------------------------------------------+
+   | Level        | Description                                       |
+   +==============+===================================================+
+   | 1            | Lookahead information                             |
+   +--------------+---------------------------------------------------+
+   | 2 to 4       | Level 1 + intra/inter modes, depth, ref's, cutree |
+   +--------------+---------------------------------------------------+
+   | 5 and 6      | Level 2 + rect-amp                                |
+   +--------------+---------------------------------------------------+
+   | 7            | Level 5 + AVC size CU refinement                  |
+   +--------------+---------------------------------------------------+
+   | 8 and 9      | Level 5 + AVC size Full CU analysis-info          |
+   +--------------+---------------------------------------------------+
+   | 10           | Level 5 + Full CU analysis-info                   |
+   +--------------+---------------------------------------------------+
 
 .. option:: --refine-mv-type <string>
 
@@ -1332,6 +1328,11 @@
    Search range for HME level 0, 1 and 2.
    The Search Range for each HME level must be between 0 and 32768(excluding).
    Default search range is 16,32,48 for level 0,1,2 respectively.
+   
+.. option:: --mcstf, --no-mcstf
+
+    Enable Motion Compensated Temporal filtering.
+   Default: disabled
 
 Spatial/intra options
 =====================
@@ -1473,17 +1474,9 @@
 
 .. option:: --hist-scenecut, --no-hist-scenecut
 
-   Indicates that scenecuts need to be detected using luma edge and chroma histograms.
-   :option:`--hist-scenecut` enables scenecut detection using the histograms and disables the default scene cut algorithm.
-   :option:`--no-hist-scenecut` disables histogram based scenecut algorithm.
-   
-.. option:: --hist-threshold <0.0..1.0>
-
-   This value represents the threshold for normalized SAD of edge histograms used in scenecut detection.
-   This requires :option:`--hist-scenecut` to be enabled. For example, a value of 0.2 indicates that a frame with normalized SAD value 
-   greater than 0.2 against the previous frame as scenecut. 
-   Increasing the threshold reduces the number of scenecuts detected.
-   Default 0.03.
+   Scenecuts detected based on histogram, intensity and variance of the picture.
+   :option:`--hist-scenecut` enables or :option:`--no-hist-scenecut` disables scenecut detection based on
+   histogram.
    
 .. option:: --radl <integer>
    
@@ -1766,6 +1759,12 @@
    Default 1.0.
    **Range of values:** 0.0 to 3.0
 
+.. option:: --sbrc --no-sbrc
+
+   To enable and disable segment based rate control.Segment duration depends on the
+   keyframe interval specified.If unspecified,default keyframe interval will be used.
+   Default: disabled.
+
 .. option:: --hevc-aq
 
    Enable adaptive quantization
@@ -1976,12 +1975,18 @@
    
    **CLI ONLY**
 
+.. option:: --scenecut-qp-config <filename>
+
+   Specify a text file which contains the scenecut aware QP options.
+   The options include :option:`--scenecut-aware-qp` and :option:`--masking-strength`
+
+   **CLI ONLY**
+
 .. option:: --scenecut-aware-qp <integer>
 
    It reduces the bits spent on the inter-frames within the scenecut window
    before and after a scenecut by increasing their QP in ratecontrol pass2 algorithm
-   without any deterioration in visual quality. If a scenecut falls within the window,
-   the QP of the inter-frames after this scenecut will not be modified.
+   without any deterioration in visual quality.
    :option:`--scenecut-aware-qp` works only with --pass 2. Default 0.
 
    +-------+---------------------------------------------------------------+
@@ -2006,48 +2011,83 @@
    for the QP increment for inter-frames when :option:`--scenecut-aware-qp`
    is enabled.
 
-   When :option:`--scenecut-aware-qp` is::
+   When :option:`--scenecut-aware-qp` is:
+
    * 1 (Forward masking):
-   --masking-strength <fwdWindow,fwdRefQPDelta,fwdNonRefQPDelta>
+   --masking-strength <fwdMaxWindow,fwdRefQPDelta,fwdNonRefQPDelta>
+   or 
+   --masking-strength <fwdWindow1,fwdRefQPDelta1,fwdNonRefQPDelta1,fwdWindow2,fwdRefQPDelta2,fwdNonRefQPDelta2,
+                       fwdWindow3,fwdRefQPDelta3,fwdNonRefQPDelta3,fwdWindow4,fwdRefQPDelta4,fwdNonRefQPDelta4,
+                       fwdWindow5,fwdRefQPDelta5,fwdNonRefQPDelta5,fwdWindow6,fwdRefQPDelta6,fwdNonRefQPDelta6>
    * 2 (Backward masking):
-   --masking-strength <bwdWindow,bwdRefQPDelta,bwdNonRefQPDelta>
+   --masking-strength <bwdMaxWindow,bwdRefQPDelta,bwdNonRefQPDelta>
+   or 
+   --masking-strength <bwdWindow1,bwdRefQPDelta1,bwdNonRefQPDelta1,bwdWindow2,bwdRefQPDelta2,bwdNonRefQPDelta2,
+                       bwdWindow3,bwdRefQPDelta3,bwdNonRefQPDelta3,bwdWindow4,bwdRefQPDelta4,bwdNonRefQPDelta4,
+                       bwdWindow5,bwdRefQPDelta5,bwdNonRefQPDelta5,bwdWindow6,bwdRefQPDelta6,bwdNonRefQPDelta6>
    * 3 (Bi-directional masking):
-   --masking-strength <fwdWindow,fwdRefQPDelta,fwdNonRefQPDelta,bwdWindow,bwdRefQPDelta,bwdNonRefQPDelta>
+   --masking-strength <fwdMaxWindow,fwdRefQPDelta,fwdNonRefQPDelta,bwdMaxWindow,bwdRefQPDelta,bwdNonRefQPDelta>
+   or 
+   --masking-strength <fwdWindow1,fwdRefQPDelta1,fwdNonRefQPDelta1,fwdWindow2,fwdRefQPDelta2,fwdNonRefQPDelta2,
+                       fwdWindow3,fwdRefQPDelta3,fwdNonRefQPDelta3,fwdWindow4,fwdRefQPDelta4,fwdNonRefQPDelta4,
+                       fwdWindow5,fwdRefQPDelta5,fwdNonRefQPDelta5,fwdWindow6,fwdRefQPDelta6,fwdNonRefQPDelta6,
+                       bwdWindow1,bwdRefQPDelta1,bwdNonRefQPDelta1,bwdWindow2,bwdRefQPDelta2,bwdNonRefQPDelta2,
+                       bwdWindow3,bwdRefQPDelta3,bwdNonRefQPDelta3,bwdWindow4,bwdRefQPDelta4,bwdNonRefQPDelta4,
+                       bwdWindow5,bwdRefQPDelta5,bwdNonRefQPDelta5,bwdWindow6,bwdRefQPDelta6,bwdNonRefQPDelta6>
 
    +-----------------+---------------------------------------------------------------+
    | Parameter       | Description                                                   |
    +=================+===============================================================+
-   | fwdWindow       | The duration(in milliseconds) for which there is a reduction  |
-   |                 | in the bits spent on the inter-frames after a scenecut by     |
-   |                 | increasing their QP. Default 500ms.                           |
-   |                 | **Range of values:** 0 to 1000                                |
+   | fwdMaxWindow    | The maximum duration(in milliseconds) for which there is a    |
+   |                 | reduction in the bits spent on the inter-frames after a       |
+   |                 | scenecut by increasing their QP. Default 500ms.               |
+   |                 | **Range of values:** 0 to 2000                                |
+   +-----------------+---------------------------------------------------------------+
+   | fwdWindow       | The duration of a sub-window(in milliseconds) for which there |
+   |                 | is a reduction in the bits spent on the inter-frames after a  |
+   |                 | scenecut by increasing their QP. Default 500ms.               |
+   |                 | **Range of values:** 0 to 2000                                |
    +-----------------+---------------------------------------------------------------+
    | fwdRefQPDelta   | The offset by which QP is incremented for inter-frames        |
    |                 | after a scenecut. Default 5.                                  |
-   |                 | **Range of values:** 0 to 10                                  |
+   |                 | **Range of values:** 0 to 20                                  |
    +-----------------+---------------------------------------------------------------+
    | fwdNonRefQPDelta| The offset by which QP is incremented for non-referenced      |
    |                 | inter-frames after a scenecut. The offset is computed from    |
    |                 | fwdRefQPDelta when it is not explicitly specified.            |
-   |                 | **Range of values:** 0 to 10                                  |
+   |                 | **Range of values:** 0 to 20                                  |
+   +-----------------+---------------------------------------------------------------+
+   | bwdMaxWindow    | The maximum duration(in milliseconds) for which there is a    |
+   |                 | reduction in the bits spent on the inter-frames before a      |
+   |                 | scenecut by increasing their QP. Default 100ms.               |
+   |                 | **Range of values:** 0 to 2000                                |
    +-----------------+---------------------------------------------------------------+
-   | bwdWindow       | The duration(in milliseconds) for which there is a reduction  |
-   |                 | in the bits spent on the inter-frames before a scenecut by    |
-   |                 | increasing their QP. Default 100ms.                           |
-   |                 | **Range of values:** 0 to 1000                                |
+   | bwdWindow       | The duration of a sub-window(in milliseconds) for which there |
+   |                 | is a reduction in the bits spent on the inter-frames before a |
+   |                 | scenecut by increasing their QP. Default 100ms.               |
+   |                 | **Range of values:** 0 to 2000                                |
    +-----------------+---------------------------------------------------------------+
    | bwdRefQPDelta   | The offset by which QP is incremented for inter-frames        |
    |                 | before a scenecut. The offset is computed from                |
    |                 | fwdRefQPDelta when it is not explicitly specified.            |
-   |                 | **Range of values:** 0 to 10                                  |
+   |                 | **Range of values:** 0 to 20                                  |
    +-----------------+---------------------------------------------------------------+
    | bwdNonRefQPDelta| The offset by which QP is incremented for non-referenced      |
    |                 | inter-frames before a scenecut. The offset is computed from   |
    |                 | bwdRefQPDelta when it is not explicitly specified.            |
-   |                 | **Range of values:** 0 to 10                                  |
+   |                 | **Range of values:** 0 to 20                                  |
    +-----------------+---------------------------------------------------------------+
 
-   **CLI ONLY**
+   We can specify the value for the Use :option:`--masking-strength` parameter in different formats.
+   1. If we don't specify --masking-strength and specify only --scenecut-aware-qp, then default offset and window size values are considered.
+   2. If we specify --masking-strength with the format 1 mentioned above, the values of window, refQpDelta and nonRefQpDelta given by the user are taken for window 1 and the offsets for the remaining windows are derived with 15% difference between windows.
+   3. If we specify the --masking-strength with the format 2 mentioned above, the values of window, refQpDelta and nonRefQpDelta given by the user for each window from 1 to 6 are directly used.NOTE: We can use this format to specify zero offsets for any particular window
+
+   Sample config file:: (Format 2 Forward masking explained here)
+
+   --scenecut-aware-qp 1 --masking-strength 1000,8,12
+   
+   The above sample config file is available in `the downloads page <https://bitbucket.org/multicoreware/x265_git/downloads/scenecut_qp_config.txt>`_
 
 .. option:: --vbv-live-multi-pass, --no-vbv-live-multi-pass
 
@@ -2057,6 +2097,14 @@
    rate control mode.
 
    Default disabled. **Experimental feature**
+   
+
+.. option:: bEncFocusedFramesOnly
+
+   Used to trigger encoding of selective GOPs; Disabled by default.
+   
+   **API ONLY**
+   
 
 Quantization Options
 ====================
@@ -2427,6 +2475,81 @@
    Values in the range 0..12. See D.3.3 of the HEVC spec. for a detailed explanation.
    Required for HLG (Hybrid Log Gamma) signaling. Not signaled by default.
 
+.. option:: --video-signal-type-preset <string>
+
+   Specify combinations of color primaries, transfer characteristics, color matrix,
+   range of luma and chroma signals, and chroma sample location.
+   String format: <system-id>:<color-volume>
+   
+   This has higher precedence than individual VUI parameters. If any individual VUI option
+   is specified together with this, which changes the values set corresponding to the system-id
+   or color-volume, it will be discarded.
+
+   system-id options and their corresponding values:
+   +----------------+---------------------------------------------------------------+
+   | system-id      | Value                                                         |
+   +================+===============================================================+
+   | BT601_525      | --colorprim smpte170m --transfer smpte170m                    |
+   |                | --colormatrix smpte170m --range limited --chromaloc 0         |
+   +----------------+---------------------------------------------------------------+
+   | BT601_626      | --colorprim bt470bg --transfer smpte170m --colormatrix bt470bg|
+   |                | --range limited --chromaloc 0                                 |
+   +----------------+---------------------------------------------------------------+
+   | BT709_YCC      | --colorprim bt709 --transfer bt709 --colormatrix bt709        |
+   |                | --range limited --chromaloc 0                                 |
+   +----------------+---------------------------------------------------------------+
+   | BT709_RGB      | --colorprim bt709 --transfer bt709 --colormatrix gbr          |
+   |                | --range limited                                               |
+   +----------------+---------------------------------------------------------------+
+   | BT2020_YCC_NCL | --colorprim bt2020 --transfer bt2020-10 --colormatrix bt709   |
+   |                | --range limited --chromaloc 2                                 |
+   +----------------+---------------------------------------------------------------+
+   | BT2020_RGB     | --colorprim bt2020 --transfer smpte2084 --colormatrix bt2020nc|
+   |                | --range limited                                               |
+   +----------------+---------------------------------------------------------------+
+   | BT2100_PQ_YCC  | --colorprim bt2020 --transfer smpte2084 --colormatrix bt2020nc|
+   |                | --range limited --chromaloc 2                                 |
+   +----------------+---------------------------------------------------------------+
+   | BT2100_PQ_ICTCP| --colorprim bt2020 --transfer smpte2084 --colormatrix ictcp   |
+   |                | --range limited --chromaloc 2                                 |
+   +----------------+---------------------------------------------------------------+
+   | BT2100_PQ_RGB  | --colorprim bt2020 --transfer smpte2084 --colormatrix gbr     |
+   |                | --range limited                                               |
+   +----------------+---------------------------------------------------------------+
+   | BT2100_HLG_YCC | --colorprim bt2020 --transfer arib-std-b67                    |
+   |                | --colormatrix bt2020nc --range limited --chromaloc 2          |
+   +----------------+---------------------------------------------------------------+
+   | BT2100_HLG_RGB | --colorprim bt2020 --transfer arib-std-b67 --colormatrix gbr  |
+   |                | --range limited                                               |
+   +----------------+---------------------------------------------------------------+
+   | FR709_RGB      | --colorprim bt709 --transfer bt709 --colormatrix gbr          |
+   |                | --range full                                                  |
+   +----------------+---------------------------------------------------------------+
+   | FR2020_RGB     | --colorprim bt2020 --transfer bt2020-10 --colormatrix gbr     |
+   |                | --range full                                                  |
+   +----------------+---------------------------------------------------------------+
+   | FRP3D65_YCC    | --colorprim smpte432 --transfer bt709 --colormatrix smpte170m |
+   |                | --range full --chromaloc 1                                    |
+   +----------------+---------------------------------------------------------------+
+
+   color-volume options and their corresponding values:
+   +----------------+---------------------------------------------------------------+
+   | color-volume   | Value                                                         |
+   +================+===============================================================+
+   | P3D65x1000n0005| --master-display G(13250,34500)B(7500,3000)R(34000,16000)     |
+   |                |                  WP(15635,16450)L(10000000,5)                 |
+   +----------------+---------------------------------------------------------------+
+   | P3D65x4000n005 | --master-display G(13250,34500)B(7500,3000)R(34000,16000)     |
+   |                |                  WP(15635,16450)L(40000000,50)                |
+   +----------------+---------------------------------------------------------------+
+   | BT2100x108n0005| --master-display G(8500,39850)B(6550,2300)R(34000,146000)     |
+   |                |                  WP(15635,16450)L(10000000,1)                 |
+   +----------------+---------------------------------------------------------------+
+
+   Note: The color-volume options can be used only with the system-id options BT2100_PQ_YCC,
+          BT2100_PQ_ICTCP, and BT2100_PQ_RGB. It is incompatible with other options.
+
+
 Bitstream options
 =================
 
@@ -2454,6 +2577,16 @@
    the very first AUD will be skipped since it cannot be placed at the
    start of the access unit, where it belongs. Default disabled
 
+.. option:: --eob, --no-eob
+
+   Emit an end of bitstream NAL unit at the end of the bitstream.
+   Default disabled
+
+.. option:: --eos, --no-eos
+
+   Emit an end of sequence NAL unit at the end of every coded
+   video sequence. Default disabled
+
 .. option:: --hrd, --no-hrd
 
    Enable the signaling of HRD parameters to the decoder. The HRD
@@ -2480,7 +2613,7 @@
     The value is specified as a float or as an integer with the profile times 10,
     for example profile 5 is specified as "5" or "5.0" or "50".
     
-    Currently only profile 5, profile 8.1 and profile 8.2 enabled, Default 0 (disabled)
+    Currently only profile 5, profile 8.1, profile 8.2 and profile 8.4  enabled, Default 0 (disabled)
 
 .. option:: --dolby-vision-rpu <filename>
 
@@ -2509,17 +2642,26 @@
    2. CRC
    3. Checksum
 
-.. option:: --temporal-layers,--no-temporal-layers
+.. option:: --temporal-layers <integer>
 
-   Enable a temporal sub layer. All referenced I/P/B frames are in the
-   base layer and all unreferenced B frames are placed in a temporal
-   enhancement layer. A decoder may choose to drop the enhancement layer 
-   and only decode and display the base layer slices.
-   
-   If used with a fixed GOP (:option:`--b-adapt` 0) and :option:`--bframes`
-   3 then the two layers evenly split the frame rate, with a cadence of
-   PbBbP. You probably also want :option:`--no-scenecut` and a keyframe
-   interval that is a multiple of 4.
+   Enable specified number of temporal sub layers. For any frame in layer N,
+   all referenced frames are in the layer N or N-1.A decoder may choose to drop the enhancement layer
+   and only decode and display the base layer slices.Allowed number of temporal sub-layers
+   are 2 to 5.(2 and 5 inclusive)
+
+   When enabled,temporal layers 3 through 5 configures a fixed miniGOP with the number of bframes as shown below
+   unless miniGOP size is modified due to lookahead decisions.Temporal layer 2 is a special case that has
+   all reference frames in base layer and non-reference frames in enhancement layer without any constraint on the
+   number of bframes.Default disabled.
+   +----------------+--------+
+   | temporal layer | bframes|
+   +================+========+
+   | 3              | 3      |
+   +----------------+--------+
+   | 4              | 7      |
+    +----------------+--------+
+   | 5              | 15     |
+   +----------------+--------+
 
 .. option:: --log2-max-poc-lsb <integer>
 
@@ -2564,6 +2706,12 @@
    Emit SEI messages in a single NAL unit instead of multiple NALs. Default disabled.
    When HRD SEI is enabled the HM decoder will throw a warning.
 
+.. option:: --film-grain <filename>
+
+    Refers to the film grain model characteristics for signal enhancement information transmission
+
+    **CLI_ONLY**
+
 DCT Approximations
 =================
 
​

x265_3.5.tar.gz/doc/reST/introduction.rst -> x265_3.6.tar.gz/doc/reST/introduction.rst Changed

 
@@ -77,6 +77,6 @@
 to start is with the `Motion Picture Experts Group - Licensing Authority
 - HEVC Licensing Program <http://www.mpegla.com/main/PID/HEVC/default.aspx>`_.
 
-x265 is a registered trademark of MulticoreWare, Inc.  The x265 logo is
+x265 is a registered trademark of MulticoreWare, Inc.  The X265 logo is
 a trademark of MulticoreWare, and may only be used with explicit written
 permission.  All rights reserved.
​

x265_3.5.tar.gz/doc/reST/releasenotes.rst -> x265_3.6.tar.gz/doc/reST/releasenotes.rst Changed

@@ -2,6 +2,53 @@
 Release Notes
 *************
 
+Version 3.6
+===========
+
+Release date - 4th April, 2024.
+
+New feature
+-----------
+1. Segment based Ratecontrol (SBRC) feature
+2. Motion-Compensated Spatio-Temporal Filtering
+3. Scene-cut aware qp - BBAQ (Bidirectional Boundary Aware Quantization)
+4. Histogram-Based Scene Change Detection
+5. Film-Grain characteristics as a SEI message to support Film Grain Synthesis(FGS)
+6. Add temporal layer implementation(Hierarchical B-frame implementation)
+ 
+Enhancements to existing features
+---------------------------------
+1. Added Dolby Vision 8.4 Profile Support
+
+
+API changes
+-----------
+1. Add Segment based Ratecontrol(SBRC) feature: "--no-sbrc".
+2. Add command line parameter for mcstf feature: "--no-mctf".
+3. Add command line parameters for the scene cut aware qp feature: "--scenecut-aware-qp" and "--masking-strength".
+4. Add command line parameters for Histogram-Based Scene Change Detection: "--hist-scenecut".
+5. Add film grain characteristics as a SEI message to the bitstream: "--film-grain <filename>"
+6. cli: add new option --cra-nal (Force nal type to CRA to all frames expect for the first frame, works only with keyint 1)
+
+Optimizations
+---------------------
+ARM64 NEON optimizations:- Several time-consuming C functions have been optimized for the targeted platform - aarch64. The overall performance increased by around 20%.
+SVE/SVE2 optimizations
+
+
+Bug fixes
+---------
+1. Linux bug to utilize all the cores
+2. Crash with hist-scenecut build when source resolution is not multiple of minCuSize
+3. 32bit and 64bit builds generation for ARM
+4. bugs in zonefile feature (Reflect Zonefile Parameters inside Lookahead, extra IDR issue, Avg I Slice QP value issue etc..)
+5. Add x86 ASM implementation for subsampling luma 
+6. Fix for abrladder segfault with load reuse level 1 
+7. Reorder miniGOP based on temporal layer hierarchy and add support for more B frame 
+8. Add MacOS aarch64 build support 
+9. Fix boundary condition issue for Gaussian filter
+
+
 Version 3.5
 ===========

 
@@ -2,6 +2,53 @@
 Release Notes
 *************
 
+Version 3.6
+===========
+
+Release date - 4th April, 2024.
+
+New feature
+-----------
+1. Segment based Ratecontrol (SBRC) feature
+2. Motion-Compensated Spatio-Temporal Filtering
+3. Scene-cut aware qp - BBAQ (Bidirectional Boundary Aware Quantization)
+4. Histogram-Based Scene Change Detection
+5. Film-Grain characteristics as a SEI message to support Film Grain Synthesis(FGS)
+6. Add temporal layer implementation(Hierarchical B-frame implementation)
+ 
+Enhancements to existing features
+---------------------------------
+1. Added Dolby Vision 8.4 Profile Support
+
+
+API changes
+-----------
+1. Add Segment based Ratecontrol(SBRC) feature: "--no-sbrc".
+2. Add command line parameter for mcstf feature: "--no-mctf".
+3. Add command line parameters for the scene cut aware qp feature: "--scenecut-aware-qp" and "--masking-strength".
+4. Add command line parameters for Histogram-Based Scene Change Detection: "--hist-scenecut".
+5. Add film grain characteristics as a SEI message to the bitstream: "--film-grain <filename>"
+6. cli: add new option --cra-nal (Force nal type to CRA to all frames expect for the first frame, works only with keyint 1)
+
+Optimizations
+---------------------
+ARM64 NEON optimizations:- Several time-consuming C functions have been optimized for the targeted platform - aarch64. The overall performance increased by around 20%.
+SVE/SVE2 optimizations
+
+
+Bug fixes
+---------
+1. Linux bug to utilize all the cores
+2. Crash with hist-scenecut build when source resolution is not multiple of minCuSize
+3. 32bit and 64bit builds generation for ARM
+4. bugs in zonefile feature (Reflect Zonefile Parameters inside Lookahead, extra IDR issue, Avg I Slice QP value issue etc..)
+5. Add x86 ASM implementation for subsampling luma 
+6. Fix for abrladder segfault with load reuse level 1 
+7. Reorder miniGOP based on temporal layer hierarchy and add support for more B frame 
+8. Add MacOS aarch64 build support 
+9. Fix boundary condition issue for Gaussian filter
+
+
 Version 3.5
 ===========
 
​

x265_3.5.tar.gz/readme.rst -> x265_3.6.tar.gz/readme.rst Changed

 
@@ -2,7 +2,7 @@
 x265 HEVC Encoder
 =================
 
-| **Read:** | Online `documentation <http://x265.readthedocs.org/en/default/>`_ | Developer `wiki <http://bitbucket.org/multicoreware/x265/wiki/>`_
+| **Read:** | Online `documentation <http://x265.readthedocs.org/en/master/>`_ | Developer `wiki <http://bitbucket.org/multicoreware/x265_git/wiki/>`_
 | **Download:** | `releases <http://ftp.videolan.org/pub/videolan/x265/>`_ 
 | **Interact:** | #x265 on freenode.irc.net | `x265-devel@videolan.org <http://mailman.videolan.org/listinfo/x265-devel>`_ | `Report an issue <https://bitbucket.org/multicoreware/x265/issues?status=new&status=open>`_
 
​

x265_3.5.tar.gz/source/CMakeLists.txt -> x265_3.6.tar.gz/source/CMakeLists.txt Changed

@@ -29,7 +29,7 @@
 option(STATIC_LINK_CRT "Statically link C runtime for release builds" OFF)
 mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD)
 # X265_BUILD must be incremented each time the public API is changed
-set(X265_BUILD 199)
+set(X265_BUILD 209)
 configure_file("${PROJECT_SOURCE_DIR}/x265.def.in"
                "${PROJECT_BINARY_DIR}/x265.def")
 configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in"
@@ -38,14 +38,20 @@
 SET(CMAKE_MODULE_PATH "${PROJECT_SOURCE_DIR}/cmake" "${CMAKE_MODULE_PATH}")
 
 # System architecture detection
-string(TOLOWER "${CMAKE_SYSTEM_PROCESSOR}" SYSPROC)
+if (APPLE AND CMAKE_OSX_ARCHITECTURES)
+    string(TOLOWER "${CMAKE_OSX_ARCHITECTURES}" SYSPROC)
+else()
+    string(TOLOWER "${CMAKE_SYSTEM_PROCESSOR}" SYSPROC)
+endif()
 set(X86_ALIASES x86 i386 i686 x86_64 amd64)
-set(ARM_ALIASES armv6l armv7l aarch64)
+set(ARM_ALIASES armv6l armv7l)
+set(ARM64_ALIASES arm64 arm64e aarch64)
 list(FIND X86_ALIASES "${SYSPROC}" X86MATCH)
 list(FIND ARM_ALIASES "${SYSPROC}" ARMMATCH)
-set(POWER_ALIASES ppc64 ppc64le)
+list(FIND ARM64_ALIASES "${SYSPROC}" ARM64MATCH)
+set(POWER_ALIASES powerpc64 powerpc64le ppc64 ppc64le)
 list(FIND POWER_ALIASES "${SYSPROC}" POWERMATCH)
-if("${SYSPROC}" STREQUAL "" OR X86MATCH GREATER "-1")
+if(X86MATCH GREATER "-1")
     set(X86 1)
     add_definitions(-DX265_ARCH_X86=1)
     if(CMAKE_CXX_FLAGS STREQUAL "-m32")
@@ -70,15 +76,18 @@
     else()
         set(CROSS_COMPILE_ARM 0)
     endif()
+	message(STATUS "Detected ARM target processor")
     set(ARM 1)
-    if("${CMAKE_SIZEOF_VOID_P}" MATCHES 8)
-        message(STATUS "Detected ARM64 target processor")
-        set(ARM64 1)
-        add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=1 -DHAVE_ARMV6=0)
-    else()
-        message(STATUS "Detected ARM target processor")
-        add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=0 -DHAVE_ARMV6=1)
-    endif()
+    add_definitions(-DX265_ARCH_ARM=1 -DHAVE_ARMV6=1)
+elseif(ARM64MATCH GREATER "-1")
+    #if(CROSS_COMPILE_ARM64)
+        #message(STATUS "Cross compiling for ARM64 arch")
+    #else()
+        #set(CROSS_COMPILE_ARM64 0)
+    #endif()
+    message(STATUS "Detected ARM64 target processor")
+    set(ARM64 1)
+    add_definitions(-DX265_ARCH_ARM64=1 -DHAVE_NEON)
 else()
     message(STATUS "CMAKE_SYSTEM_PROCESSOR value `${CMAKE_SYSTEM_PROCESSOR}` is unknown")
     message(STATUS "Please add this value near ${CMAKE_CURRENT_LIST_FILE}:${CMAKE_CURRENT_LIST_LINE}")
@@ -239,26 +248,43 @@
         endif()
     endif()
     if(ARM AND CROSS_COMPILE_ARM)
-        if(ARM64)
-            set(ARM_ARGS -fPIC)
-        else()
-            set(ARM_ARGS -march=armv6 -mfloat-abi=soft -mfpu=vfp -marm -fPIC)
-        endif()
         message(STATUS "cross compile arm")
+		set(ARM_ARGS -march=armv6 -mfloat-abi=soft -mfpu=vfp -marm -fPIC)
     elseif(ARM)
-        if(ARM64)
-            set(ARM_ARGS -fPIC)
+        find_package(Neon)
+        if(CPU_HAS_NEON)
+            set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=neon -marm -fPIC)
             add_definitions(-DHAVE_NEON)
         else()
-            find_package(Neon)
-            if(CPU_HAS_NEON)
-                set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=neon -marm -fPIC)
-                add_definitions(-DHAVE_NEON)
-            else()
-                set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=vfp -marm)
-            endif()
+            set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=vfp -marm)
         endif()
     endif()
+	if(ARM64 OR CROSS_COMPILE_ARM64)
+        find_package(Neon)
+        find_package(SVE)
+        find_package(SVE2)
+        if(CPU_HAS_SVE2 OR CROSS_COMPILE_SVE2)
+            message(STATUS "Found SVE2")
+	        set(ARM_ARGS -O3 -march=armv8-a+sve2 -fPIC -flax-vector-conversions)
+            add_definitions(-DHAVE_SVE2)
+            add_definitions(-DHAVE_SVE)
+            add_definitions(-DHAVE_NEON) # for NEON c/c++ primitives, as currently there is no implementation that use SVE2
+        elseif(CPU_HAS_SVE OR CROSS_COMPILE_SVE)
+            message(STATUS "Found SVE")
+	        set(ARM_ARGS -O3 -march=armv8-a+sve -fPIC -flax-vector-conversions)
+            add_definitions(-DHAVE_SVE)
+            add_definitions(-DHAVE_NEON) # for NEON c/c++ primitives, as currently there is no implementation that use SVE
+        elseif(CPU_HAS_NEON)
+            message(STATUS "Found NEON")
+            set(ARM_ARGS -fPIC -flax-vector-conversions)
+            add_definitions(-DHAVE_NEON)
+        else()
+            set(ARM_ARGS -fPIC -flax-vector-conversions)
+        endif()        
+    endif()
+	if(ENABLE_PIC)
+	list(APPEND ARM_ARGS -DPIC)
+	endif()
     add_definitions(${ARM_ARGS})
     if(FPROFILE_GENERATE)
         if(INTEL_CXX)
@@ -350,7 +376,7 @@
 endif(GCC)
 
 find_package(Nasm)
-if(ARM OR CROSS_COMPILE_ARM)
+if(ARM OR CROSS_COMPILE_ARM OR ARM64 OR CROSS_COMPILE_ARM64)
     option(ENABLE_ASSEMBLY "Enable use of assembly coded primitives" ON)
 elseif(NASM_FOUND AND X86)
     if (NASM_VERSION_STRING VERSION_LESS "2.13.0")
@@ -384,7 +410,7 @@
 endif(EXTRA_LIB)
 mark_as_advanced(EXTRA_LIB EXTRA_LINK_FLAGS)
 
-if(X64)
+if(X64 OR ARM64 OR PPC64)
     # NOTE: We only officially support high-bit-depth compiles of x265
     # on 64bit architectures. Main10 plus large resolution plus slow
     # preset plus 32bit address space usually means malloc failure.  You
@@ -393,7 +419,7 @@
     # license" so to speak.  If it breaks you get to keep both halves.
     # You will need to disable assembly manually.
     option(HIGH_BIT_DEPTH "Store pixel samples as 16bit values (Main10/Main12)" OFF)
-endif(X64)
+endif(X64 OR ARM64 OR PPC64)
 if(HIGH_BIT_DEPTH)
     option(MAIN12 "Support Main12 instead of Main10" OFF)
     if(MAIN12)
@@ -440,6 +466,18 @@
 endif()
 add_definitions(-DX265_NS=${X265_NS})
 
+if(ARM64)
+  if(HIGH_BIT_DEPTH)
+    if(MAIN12)
+      list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=1 -DBIT_DEPTH=12 -DX265_NS=${X265_NS})
+    else()
+      list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=1 -DBIT_DEPTH=10 -DX265_NS=${X265_NS})
+    endif()
+  else()
+    list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=0 -DBIT_DEPTH=8 -DX265_NS=${X265_NS})
+  endif()
+endif(ARM64)
+
 option(WARNINGS_AS_ERRORS "Stop compiles on first warning" OFF)
 if(WARNINGS_AS_ERRORS)
     if(GCC)
@@ -536,11 +574,7 @@
     # compile ARM arch asm files here
         enable_language(ASM)
         foreach(ASM ${ARM_ASMS})
-            if(ARM64)
-                set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/aarch64/${ASM})
-            else()
-                set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/arm/${ASM})
-            endif()
+			set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/arm/${ASM})
             list(APPEND ASM_SRCS ${ASM_SRC})
             list(APPEND ASM_OBJS ${ASM}.${SUFFIX})
             add_custom_command(
@@ -549,6 +583,52 @@
                 ARGS ${ARM_ARGS} -c ${ASM_SRC} -o ${ASM}.${SUFFIX}
                 DEPENDS ${ASM_SRC})
         endforeach()
+	elseif(ARM64 OR CROSS_COMPILE_ARM64)
+    # compile ARM64 arch asm files here
+        enable_language(ASM)
+        foreach(ASM ${ARM_ASMS})
+            set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/aarch64/${ASM})
+            list(APPEND ASM_SRCS ${ASM_SRC})
+            list(APPEND ASM_OBJS ${ASM}.${SUFFIX})
+            add_custom_command(
+                OUTPUT ${ASM}.${SUFFIX}
+                COMMAND ${CMAKE_CXX_COMPILER}
+                ARGS ${ARM_ARGS} ${ASM_FLAGS} -c ${ASM_SRC} -o ${ASM}.${SUFFIX}
+                DEPENDS ${ASM_SRC})
+        endforeach()
+        if(CPU_HAS_SVE2 OR CROSS_COMPILE_SVE2)
+            foreach(ASM ${ARM_ASMS_SVE})
+                set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/aarch64/${ASM})
+                list(APPEND ASM_SRCS ${ASM_SRC})
+                list(APPEND ASM_OBJS ${ASM}.${SUFFIX})
+                add_custom_command(
+                    OUTPUT ${ASM}.${SUFFIX}
+                    COMMAND ${CMAKE_CXX_COMPILER}
+                    ARGS ${ARM_ARGS} ${ASM_FLAGS} -c ${ASM_SRC} -o ${ASM}.${SUFFIX}
+                    DEPENDS ${ASM_SRC})
+            endforeach()
+            foreach(ASM ${ARM_ASMS_SVE2})
+                set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/aarch64/${ASM})
+                list(APPEND ASM_SRCS ${ASM_SRC})
+                list(APPEND ASM_OBJS ${ASM}.${SUFFIX})
+                add_custom_command(
+                    OUTPUT ${ASM}.${SUFFIX}
+                    COMMAND ${CMAKE_CXX_COMPILER}
+                    ARGS ${ARM_ARGS} ${ASM_FLAGS} -c ${ASM_SRC} -o ${ASM}.${SUFFIX}
+                    DEPENDS ${ASM_SRC})
+            endforeach()
+        elseif(CPU_HAS_SVE OR CROSS_COMPILE_SVE)
+            foreach(ASM ${ARM_ASMS_SVE})
+                set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/aarch64/${ASM})
+                list(APPEND ASM_SRCS ${ASM_SRC})
+                list(APPEND ASM_OBJS ${ASM}.${SUFFIX})
+                add_custom_command(
+                    OUTPUT ${ASM}.${SUFFIX}
+                    COMMAND ${CMAKE_CXX_COMPILER}
+                    ARGS ${ARM_ARGS} ${ASM_FLAGS} -c ${ASM_SRC} -o ${ASM}.${SUFFIX}
+                    DEPENDS ${ASM_SRC})
+            endforeach()
+        endif()
     elseif(X86)
     # compile X86 arch asm files here
         foreach(ASM ${MSVC_ASMS})

 
@@ -29,7 +29,7 @@
 option(STATIC_LINK_CRT "Statically link C runtime for release builds" OFF)
 mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD)
 # X265_BUILD must be incremented each time the public API is changed
-set(X265_BUILD 199)
+set(X265_BUILD 209)
 configure_file("${PROJECT_SOURCE_DIR}/x265.def.in"
                "${PROJECT_BINARY_DIR}/x265.def")
 configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in"
@@ -38,14 +38,20 @@
 SET(CMAKE_MODULE_PATH "${PROJECT_SOURCE_DIR}/cmake" "${CMAKE_MODULE_PATH}")
 
 # System architecture detection
-string(TOLOWER "${CMAKE_SYSTEM_PROCESSOR}" SYSPROC)
+if (APPLE AND CMAKE_OSX_ARCHITECTURES)
+    string(TOLOWER "${CMAKE_OSX_ARCHITECTURES}" SYSPROC)
+else()
+    string(TOLOWER "${CMAKE_SYSTEM_PROCESSOR}" SYSPROC)
+endif()
 set(X86_ALIASES x86 i386 i686 x86_64 amd64)
-set(ARM_ALIASES armv6l armv7l aarch64)
+set(ARM_ALIASES armv6l armv7l)
+set(ARM64_ALIASES arm64 arm64e aarch64)
 list(FIND X86_ALIASES "${SYSPROC}" X86MATCH)
 list(FIND ARM_ALIASES "${SYSPROC}" ARMMATCH)
-set(POWER_ALIASES ppc64 ppc64le)
+list(FIND ARM64_ALIASES "${SYSPROC}" ARM64MATCH)
+set(POWER_ALIASES powerpc64 powerpc64le ppc64 ppc64le)
 list(FIND POWER_ALIASES "${SYSPROC}" POWERMATCH)
-if("${SYSPROC}" STREQUAL "" OR X86MATCH GREATER "-1")
+if(X86MATCH GREATER "-1")
     set(X86 1)
     add_definitions(-DX265_ARCH_X86=1)
     if(CMAKE_CXX_FLAGS STREQUAL "-m32")
@@ -70,15 +76,18 @@
     else()
         set(CROSS_COMPILE_ARM 0)
     endif()
+   message(STATUS "Detected ARM target processor")
     set(ARM 1)
-    if("${CMAKE_SIZEOF_VOID_P}" MATCHES 8)
-        message(STATUS "Detected ARM64 target processor")
-        set(ARM64 1)
-        add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=1 -DHAVE_ARMV6=0)
-    else()
-        message(STATUS "Detected ARM target processor")
-        add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=0 -DHAVE_ARMV6=1)
-    endif()
+    add_definitions(-DX265_ARCH_ARM=1 -DHAVE_ARMV6=1)
+elseif(ARM64MATCH GREATER "-1")
+    #if(CROSS_COMPILE_ARM64)
+        #message(STATUS "Cross compiling for ARM64 arch")
+    #else()
+        #set(CROSS_COMPILE_ARM64 0)
+    #endif()
+    message(STATUS "Detected ARM64 target processor")
+    set(ARM64 1)
+    add_definitions(-DX265_ARCH_ARM64=1 -DHAVE_NEON)
 else()
     message(STATUS "CMAKE_SYSTEM_PROCESSOR value `${CMAKE_SYSTEM_PROCESSOR}` is unknown")
     message(STATUS "Please add this value near ${CMAKE_CURRENT_LIST_FILE}:${CMAKE_CURRENT_LIST_LINE}")
@@ -239,26 +248,43 @@
         endif()
     endif()
     if(ARM AND CROSS_COMPILE_ARM)
-        if(ARM64)
-            set(ARM_ARGS -fPIC)
-        else()
-            set(ARM_ARGS -march=armv6 -mfloat-abi=soft -mfpu=vfp -marm -fPIC)
-        endif()
         message(STATUS "cross compile arm")
+       set(ARM_ARGS -march=armv6 -mfloat-abi=soft -mfpu=vfp -marm -fPIC)
     elseif(ARM)
-        if(ARM64)
-            set(ARM_ARGS -fPIC)
+        find_package(Neon)
+        if(CPU_HAS_NEON)
+            set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=neon -marm -fPIC)
             add_definitions(-DHAVE_NEON)
         else()
-            find_package(Neon)
-            if(CPU_HAS_NEON)
-                set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=neon -marm -fPIC)
-                add_definitions(-DHAVE_NEON)
-            else()
-                set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=vfp -marm)
-            endif()
+            set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=vfp -marm)
         endif()
     endif()
+   if(ARM64 OR CROSS_COMPILE_ARM64)
+        find_package(Neon)
+        find_package(SVE)
+        find_package(SVE2)
+        if(CPU_HAS_SVE2 OR CROSS_COMPILE_SVE2)
+            message(STATUS "Found SVE2")
+           set(ARM_ARGS -O3 -march=armv8-a+sve2 -fPIC -flax-vector-conversions)
+            add_definitions(-DHAVE_SVE2)
+            add_definitions(-DHAVE_SVE)
+            add_definitions(-DHAVE_NEON) # for NEON c/c++ primitives, as currently there is no implementation that use SVE2
+        elseif(CPU_HAS_SVE OR CROSS_COMPILE_SVE)
+            message(STATUS "Found SVE")
+           set(ARM_ARGS -O3 -march=armv8-a+sve -fPIC -flax-vector-conversions)
+            add_definitions(-DHAVE_SVE)
+            add_definitions(-DHAVE_NEON) # for NEON c/c++ primitives, as currently there is no implementation that use SVE
+        elseif(CPU_HAS_NEON)
+            message(STATUS "Found NEON")
+            set(ARM_ARGS -fPIC -flax-vector-conversions)
+            add_definitions(-DHAVE_NEON)
+        else()
+            set(ARM_ARGS -fPIC -flax-vector-conversions)
+        endif()        
+    endif()
+   if(ENABLE_PIC)
+   list(APPEND ARM_ARGS -DPIC)
+   endif()
     add_definitions(${ARM_ARGS})
     if(FPROFILE_GENERATE)
         if(INTEL_CXX)
@@ -350,7 +376,7 @@
 endif(GCC)
 
 find_package(Nasm)
-if(ARM OR CROSS_COMPILE_ARM)
+if(ARM OR CROSS_COMPILE_ARM OR ARM64 OR CROSS_COMPILE_ARM64)
     option(ENABLE_ASSEMBLY "Enable use of assembly coded primitives" ON)
 elseif(NASM_FOUND AND X86)
     if (NASM_VERSION_STRING VERSION_LESS "2.13.0")
@@ -384,7 +410,7 @@
 endif(EXTRA_LIB)
 mark_as_advanced(EXTRA_LIB EXTRA_LINK_FLAGS)
 
-if(X64)
+if(X64 OR ARM64 OR PPC64)
     # NOTE: We only officially support high-bit-depth compiles of x265
     # on 64bit architectures. Main10 plus large resolution plus slow
     # preset plus 32bit address space usually means malloc failure.  You
@@ -393,7 +419,7 @@
     # license" so to speak.  If it breaks you get to keep both halves.
     # You will need to disable assembly manually.
     option(HIGH_BIT_DEPTH "Store pixel samples as 16bit values (Main10/Main12)" OFF)
-endif(X64)
+endif(X64 OR ARM64 OR PPC64)
 if(HIGH_BIT_DEPTH)
     option(MAIN12 "Support Main12 instead of Main10" OFF)
     if(MAIN12)
@@ -440,6 +466,18 @@
 endif()
 add_definitions(-DX265_NS=${X265_NS})
 
+if(ARM64)
+  if(HIGH_BIT_DEPTH)
+    if(MAIN12)
+      list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=1 -DBIT_DEPTH=12 -DX265_NS=${X265_NS})
+    else()
+      list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=1 -DBIT_DEPTH=10 -DX265_NS=${X265_NS})
+    endif()
+  else()
+    list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=0 -DBIT_DEPTH=8 -DX265_NS=${X265_NS})
+  endif()
+endif(ARM64)
+
 option(WARNINGS_AS_ERRORS "Stop compiles on first warning" OFF)
 if(WARNINGS_AS_ERRORS)
     if(GCC)
@@ -536,11 +574,7 @@
     # compile ARM arch asm files here
         enable_language(ASM)
         foreach(ASM ${ARM_ASMS})
-            if(ARM64)
-                set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/aarch64/${ASM})
-            else()
-                set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/arm/${ASM})
-            endif()
+           set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/arm/${ASM})
             list(APPEND ASM_SRCS ${ASM_SRC})
             list(APPEND ASM_OBJS ${ASM}.${SUFFIX})
             add_custom_command(
@@ -549,6 +583,52 @@
                 ARGS ${ARM_ARGS} -c ${ASM_SRC} -o ${ASM}.${SUFFIX}
                 DEPENDS ${ASM_SRC})
         endforeach()
+   elseif(ARM64 OR CROSS_COMPILE_ARM64)
+    # compile ARM64 arch asm files here
+        enable_language(ASM)
+        foreach(ASM ${ARM_ASMS})
+            set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/aarch64/${ASM})
+            list(APPEND ASM_SRCS ${ASM_SRC})
+            list(APPEND ASM_OBJS ${ASM}.${SUFFIX})
+            add_custom_command(
+                OUTPUT ${ASM}.${SUFFIX}
+                COMMAND ${CMAKE_CXX_COMPILER}
+                ARGS ${ARM_ARGS} ${ASM_FLAGS} -c ${ASM_SRC} -o ${ASM}.${SUFFIX}
+                DEPENDS ${ASM_SRC})
+        endforeach()
+        if(CPU_HAS_SVE2 OR CROSS_COMPILE_SVE2)
+            foreach(ASM ${ARM_ASMS_SVE})
+                set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/aarch64/${ASM})
+                list(APPEND ASM_SRCS ${ASM_SRC})
+                list(APPEND ASM_OBJS ${ASM}.${SUFFIX})
+                add_custom_command(
+                    OUTPUT ${ASM}.${SUFFIX}
+                    COMMAND ${CMAKE_CXX_COMPILER}
+                    ARGS ${ARM_ARGS} ${ASM_FLAGS} -c ${ASM_SRC} -o ${ASM}.${SUFFIX}
+                    DEPENDS ${ASM_SRC})
+            endforeach()
+            foreach(ASM ${ARM_ASMS_SVE2})
+                set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/aarch64/${ASM})
+                list(APPEND ASM_SRCS ${ASM_SRC})
+                list(APPEND ASM_OBJS ${ASM}.${SUFFIX})
+                add_custom_command(
+                    OUTPUT ${ASM}.${SUFFIX}
+                    COMMAND ${CMAKE_CXX_COMPILER}
+                    ARGS ${ARM_ARGS} ${ASM_FLAGS} -c ${ASM_SRC} -o ${ASM}.${SUFFIX}
+                    DEPENDS ${ASM_SRC})
+            endforeach()
+        elseif(CPU_HAS_SVE OR CROSS_COMPILE_SVE)
+            foreach(ASM ${ARM_ASMS_SVE})
+                set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/aarch64/${ASM})
+                list(APPEND ASM_SRCS ${ASM_SRC})
+                list(APPEND ASM_OBJS ${ASM}.${SUFFIX})
+                add_custom_command(
+                    OUTPUT ${ASM}.${SUFFIX}
+                    COMMAND ${CMAKE_CXX_COMPILER}
+                    ARGS ${ARM_ARGS} ${ASM_FLAGS} -c ${ASM_SRC} -o ${ASM}.${SUFFIX}
+                    DEPENDS ${ASM_SRC})
+            endforeach()
+        endif()
     elseif(X86)
     # compile X86 arch asm files here
         foreach(ASM ${MSVC_ASMS})
​

x265_3.5.tar.gz/source/abrEncApp.cpp -> x265_3.6.tar.gz/source/abrEncApp.cpp Changed

@@ -1,1111 +1,1111 @@
-/*****************************************************************************
-* Copyright (C) 2013-2020 MulticoreWare, Inc
-*
-* Authors: Pooja Venkatesan <pooja@multicorewareinc.com>
-*          Aruna Matheswaran <aruna@multicorewareinc.com>
-*
-* This program is free software; you can redistribute it and/or modify
-* it under the terms of the GNU General Public License as published by
-* the Free Software Foundation; either version 2 of the License, or
-* (at your option) any later version.
-*
-* This program is distributed in the hope that it will be useful,
-* but WITHOUT ANY WARRANTY; without even the implied warranty of
-* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-* GNU General Public License for more details.
-*
-* You should have received a copy of the GNU General Public License
-* along with this program; if not, write to the Free Software
-* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
-*
-* This program is also available under a commercial proprietary license.
-* For more information, contact us at license @ x265.com.
-*****************************************************************************/
-
-#include "abrEncApp.h"
-#include "mv.h"
-#include "slice.h"
-#include "param.h"
-
-#include <signal.h>
-#include <errno.h>
-
-#include <queue>
-
-using namespace X265_NS;
-
-/* Ctrl-C handler */
-static volatile sig_atomic_t b_ctrl_c /* = 0 */;
-static void sigint_handler(int)
-{
-    b_ctrl_c = 1;
-}
-
-namespace X265_NS {
-    // private namespace
-#define X265_INPUT_QUEUE_SIZE 250
-
-    AbrEncoder::AbrEncoder(CLIOptions cliopt, uint8_t numEncodes, int &ret)
-    {
-        m_numEncodes = numEncodes;
-        m_numActiveEncodes.set(numEncodes);
-        m_queueSize = (numEncodes > 1) ? X265_INPUT_QUEUE_SIZE : 1;
-        m_passEnc = X265_MALLOC(PassEncoder*, m_numEncodes);
-
-        for (uint8_t i = 0; i < m_numEncodes; i++)
-        {
-            m_passEnci = new PassEncoder(i, cliopti, this);
-            if (!m_passEnci)
-            {
-                x265_log(NULL, X265_LOG_ERROR, "Unable to allocate memory for passEncoder\n");
-                ret = 4;
-            }
-            m_passEnci->init(ret);
-        }
-
-        if (!allocBuffers())
-        {
-            x265_log(NULL, X265_LOG_ERROR, "Unable to allocate memory for buffers\n");
-            ret = 4;
-        }
-
-        /* start passEncoder worker threads */
-        for (uint8_t pass = 0; pass < m_numEncodes; pass++)
-            m_passEncpass->startThreads();
-    }
-
-    bool AbrEncoder::allocBuffers()
-    {
-        m_inputPicBuffer = X265_MALLOC(x265_picture**, m_numEncodes);
-        m_analysisBuffer = X265_MALLOC(x265_analysis_data*, m_numEncodes);
-
-        m_picWriteCnt = new ThreadSafeIntegerm_numEncodes;
-        m_picReadCnt = new ThreadSafeIntegerm_numEncodes;
-        m_analysisWriteCnt = new ThreadSafeIntegerm_numEncodes;
-        m_analysisReadCnt = new ThreadSafeIntegerm_numEncodes;
-
-        m_picIdxReadCnt = X265_MALLOC(ThreadSafeInteger*, m_numEncodes);
-        m_analysisWrite = X265_MALLOC(ThreadSafeInteger*, m_numEncodes);
-        m_analysisRead = X265_MALLOC(ThreadSafeInteger*, m_numEncodes);
-        m_readFlag = X265_MALLOC(int*, m_numEncodes);
-
-        for (uint8_t pass = 0; pass < m_numEncodes; pass++)
-        {
-            m_inputPicBufferpass = X265_MALLOC(x265_picture*, m_queueSize);
-            for (uint32_t idx = 0; idx < m_queueSize; idx++)
-            {
-                m_inputPicBufferpassidx = x265_picture_alloc();
-                x265_picture_init(m_passEncpass->m_param, m_inputPicBufferpassidx);
-            }
-
-            CHECKED_MALLOC_ZERO(m_analysisBufferpass, x265_analysis_data, m_queueSize);
-            m_picIdxReadCntpass = new ThreadSafeIntegerm_queueSize;
-            m_analysisWritepass = new ThreadSafeIntegerm_queueSize;
-            m_analysisReadpass = new ThreadSafeIntegerm_queueSize;
-            m_readFlagpass = X265_MALLOC(int, m_queueSize);
-        }
-        return true;
-    fail:
-        return false;
-    }
-
-    void AbrEncoder::destroy()
-    {
-        x265_cleanup(); /* Free library singletons */
-        for (uint8_t pass = 0; pass < m_numEncodes; pass++)
-        {
-            for (uint32_t index = 0; index < m_queueSize; index++)
-            {
-                X265_FREE(m_inputPicBufferpassindex->planes0);
-                x265_picture_free(m_inputPicBufferpassindex);
-            }
-
-            X265_FREE(m_inputPicBufferpass);
-            X265_FREE(m_analysisBufferpass);
-            X265_FREE(m_readFlagpass);
-            delete m_picIdxReadCntpass;
-            delete m_analysisWritepass;
-            delete m_analysisReadpass;
-            m_passEncpass->destroy();
-            delete m_passEncpass;
-        }
-        X265_FREE(m_inputPicBuffer);
-        X265_FREE(m_analysisBuffer);
-        X265_FREE(m_readFlag);
-
-        delete m_picWriteCnt;
-        delete m_picReadCnt;
-        delete m_analysisWriteCnt;
-        delete m_analysisReadCnt;
-
-        X265_FREE(m_picIdxReadCnt);
-        X265_FREE(m_analysisWrite);
-        X265_FREE(m_analysisRead);
-
-        X265_FREE(m_passEnc);
-    }
-
-    PassEncoder::PassEncoder(uint32_t id, CLIOptions cliopt, AbrEncoder *parent)
-    {
-        m_id = id;
-        m_cliopt = cliopt;
-        m_parent = parent;
-        if(!(m_cliopt.enableScaler && m_id))
-            m_input = m_cliopt.input;
-        m_param = cliopt.param;
-        m_inputOver = false;
-        m_lastIdx = -1;
-        m_encoder = NULL;
-        m_scaler = NULL;
-        m_reader = NULL;
-        m_ret = 0;
-    }
-
-    int PassEncoder::init(int &result)
-    {
-        if (m_parent->m_numEncodes > 1)
-            setReuseLevel();
-                
-        if (!(m_cliopt.enableScaler && m_id))
-            m_reader = new Reader(m_id, this);
-        else
-        {
-            VideoDesc *src = NULL, *dst = NULL;
-            dst = new VideoDesc(m_param->sourceWidth, m_param->sourceHeight, m_param->internalCsp, m_param->internalBitDepth);
-            int dstW = m_parent->m_passEncm_id - 1->m_param->sourceWidth;
-            int dstH = m_parent->m_passEncm_id - 1->m_param->sourceHeight;
-            src = new VideoDesc(dstW, dstH, m_param->internalCsp, m_param->internalBitDepth);
-            if (src != NULL && dst != NULL)
-            {
-                m_scaler = new Scaler(0, 1, m_id, src, dst, this);
-                if (!m_scaler)
-                {
-                    x265_log(m_param, X265_LOG_ERROR, "\n MALLOC failure in Scaler");
-                    result = 4;
-                }
-            }
-        }
-
-        /* note: we could try to acquire a different libx265 API here based on
-        * the profile found during option parsing, but it must be done before
-        * opening an encoder */
-
-        if (m_param)
-            m_encoder = m_cliopt.api->encoder_open(m_param);
-        if (!m_encoder)
-        {
-            x265_log(NULL, X265_LOG_ERROR, "x265_encoder_open() failed for Enc, \n");
-            m_ret = 2;
-            return -1;
-        }
-
-        /* get the encoder parameters post-initialization */
-        m_cliopt.api->encoder_parameters(m_encoder, m_param);
-
-        return 1;
-    }
-
-    void PassEncoder::setReuseLevel()
-    {
-        uint32_t r, padh = 0, padw = 0;
-
-        m_param->confWinBottomOffset = m_param->confWinRightOffset = 0;
-
-        m_param->analysisLoadReuseLevel = m_cliopt.loadLevel;
-        m_param->analysisSaveReuseLevel = m_cliopt.saveLevel;
-        m_param->analysisSave = m_cliopt.saveLevel ? "save.dat" : NULL;
-        m_param->analysisLoad = m_cliopt.loadLevel ? "load.dat" : NULL;
-        m_param->bUseAnalysisFile = 0;
-
-        if (m_cliopt.loadLevel)
-        {
-            x265_param *refParam = m_parent->m_passEncm_cliopt.refId->m_param;
-
-            if (m_param->sourceHeight == (refParam->sourceHeight - refParam->confWinBottomOffset) &&
-                m_param->sourceWidth == (refParam->sourceWidth - refParam->confWinRightOffset))
-            {
-                m_parent->m_passEncm_id->m_param->confWinBottomOffset = refParam->confWinBottomOffset;
-                m_parent->m_passEncm_id->m_param->confWinRightOffset = refParam->confWinRightOffset;
-            }
-            else
-            {
-                int srcH = refParam->sourceHeight - refParam->confWinBottomOffset;
-                int srcW = refParam->sourceWidth - refParam->confWinRightOffset;
-
-                double scaleFactorH = double(m_param->sourceHeight / srcH);
-                double scaleFactorW = double(m_param->sourceWidth / srcW);
-
-                int absScaleFactorH = (int)(10 * scaleFactorH + 0.5);
-                int absScaleFactorW = (int)(10 * scaleFactorW + 0.5);
-
-                if (absScaleFactorH == 20 && absScaleFactorW == 20)
-                {
-                    m_param->scaleFactor = 2;
-
-                    m_parent->m_passEncm_id->m_param->confWinBottomOffset = refParam->confWinBottomOffset * 2;
-                    m_parent->m_passEncm_id->m_param->confWinRightOffset = refParam->confWinRightOffset * 2;
-
-                }
-            }
-        }
-
-        int h = m_param->sourceHeight + m_param->confWinBottomOffset;
-        int w = m_param->sourceWidth + m_param->confWinRightOffset;
-        if (h & (m_param->minCUSize - 1))
-        {
-            r = h & (m_param->minCUSize - 1);
-            padh = m_param->minCUSize - r;
-            m_param->confWinBottomOffset += padh;
-
-        }
-
-        if (w & (m_param->minCUSize - 1))
-        {
-            r = w & (m_param->minCUSize - 1);
-            padw = m_param->minCUSize - r;
-            m_param->confWinRightOffset += padw;
-        }
-    }
-
-    void PassEncoder::startThreads()
-    {
-        /* Start slave worker threads */
-        m_threadActive = true;
-        start();
-        /* Start reader threads*/
-        if (m_reader != NULL)
-        {
-            m_reader->m_threadActive = true;
-            m_reader->start();
-        }
-        /* Start scaling worker threads */
-        if (m_scaler != NULL)
-        {
-            m_scaler->m_threadActive = true;
-            m_scaler->start();
-        }
-    }
-
-    void PassEncoder::copyInfo(x265_analysis_data * src)
-    {
-
-        uint32_t written = m_parent->m_analysisWriteCntm_id.get();
-
-        int index = written % m_parent->m_queueSize;
-        //If all streams have read analysis data, reuse that position in Queue
-
-        int read = m_parent->m_analysisReadm_idindex.get();
-        int write = m_parent->m_analysisWritem_idindex.get();
-
-        int overwrite = written / m_parent->m_queueSize;
-        bool emptyIdxFound = 0;
-        while (!emptyIdxFound && overwrite)
-        {
-            for (uint32_t i = 0; i < m_parent->m_queueSize; i++)
-            {
-                read = m_parent->m_analysisReadm_idi.get();
-                write = m_parent->m_analysisWritem_idi.get();
-                write *= m_cliopt.numRefs;
-
-                if (read == write)
-                {
-                    index = i;
-                    emptyIdxFound = 1;
-                }
-            }
-        }
-
-        x265_analysis_data *m_analysisInfo = &m_parent->m_analysisBufferm_idindex;
-
-        x265_free_analysis_data(m_param, m_analysisInfo);
-        memcpy(m_analysisInfo, src, sizeof(x265_analysis_data));
-        x265_alloc_analysis_data(m_param, m_analysisInfo);
-
-        bool isVbv = m_param->rc.vbvBufferSize && m_param->rc.vbvMaxBitrate;
-        if (m_param->bDisableLookahead && isVbv)
-        {
-            memcpy(m_analysisInfo->lookahead.intraSatdForVbv, src->lookahead.intraSatdForVbv, src->numCuInHeight * sizeof(uint32_t));
-            memcpy(m_analysisInfo->lookahead.satdForVbv, src->lookahead.satdForVbv, src->numCuInHeight * sizeof(uint32_t));
-            memcpy(m_analysisInfo->lookahead.intraVbvCost, src->lookahead.intraVbvCost, src->numCUsInFrame * sizeof(uint32_t));
-            memcpy(m_analysisInfo->lookahead.vbvCost, src->lookahead.vbvCost, src->numCUsInFrame * sizeof(uint32_t));
-        }
-
-        if (src->sliceType == X265_TYPE_IDR || src->sliceType == X265_TYPE_I)
-        {
-            if (m_param->analysisSaveReuseLevel < 2)
-                goto ret;
-            x265_analysis_intra_data *intraDst, *intraSrc;
-            intraDst = (x265_analysis_intra_data*)m_analysisInfo->intraData;
-            intraSrc = (x265_analysis_intra_data*)src->intraData;
-            memcpy(intraDst->depth, intraSrc->depth, sizeof(uint8_t) * src->depthBytes);
-            memcpy(intraDst->modes, intraSrc->modes, sizeof(uint8_t) * src->numCUsInFrame * src->numPartitions);
-            memcpy(intraDst->partSizes, intraSrc->partSizes, sizeof(char) * src->depthBytes);
-            memcpy(intraDst->chromaModes, intraSrc->chromaModes, sizeof(uint8_t) * src->depthBytes);
-            if (m_param->rc.cuTree)
-                memcpy(intraDst->cuQPOff, intraSrc->cuQPOff, sizeof(int8_t) * src->depthBytes);
-        }
-        else
-        {
-            bool bIntraInInter = (src->sliceType == X265_TYPE_P || m_param->bIntraInBFrames);
-            int numDir = src->sliceType == X265_TYPE_P ? 1 : 2;
-            memcpy(m_analysisInfo->wt, src->wt, sizeof(WeightParam) * 3 * numDir);
-            if (m_param->analysisSaveReuseLevel < 2)
-                goto ret;
-            x265_analysis_inter_data *interDst, *interSrc;
-            interDst = (x265_analysis_inter_data*)m_analysisInfo->interData;
-            interSrc = (x265_analysis_inter_data*)src->interData;
-            memcpy(interDst->depth, interSrc->depth, sizeof(uint8_t) * src->depthBytes);
-            memcpy(interDst->modes, interSrc->modes, sizeof(uint8_t) * src->depthBytes);
-            if (m_param->rc.cuTree)
-                memcpy(interDst->cuQPOff, interSrc->cuQPOff, sizeof(int8_t) * src->depthBytes);
-            if (m_param->analysisSaveReuseLevel > 4)
-            {
-                memcpy(interDst->partSize, interSrc->partSize, sizeof(uint8_t) * src->depthBytes);
-                memcpy(interDst->mergeFlag, interSrc->mergeFlag, sizeof(uint8_t) * src->depthBytes);
-                if (m_param->analysisSaveReuseLevel == 10)
-                {
-                    memcpy(interDst->interDir, interSrc->interDir, sizeof(uint8_t) * src->depthBytes);
-                    for (int dir = 0; dir < numDir; dir++)
-                    {
-                        memcpy(interDst->mvpIdxdir, interSrc->mvpIdxdir, sizeof(uint8_t) * src->depthBytes);
-                        memcpy(interDst->refIdxdir, interSrc->refIdxdir, sizeof(int8_t) * src->depthBytes);
-                        memcpy(interDst->mvdir, interSrc->mvdir, sizeof(MV) * src->depthBytes);
-                    }
-                    if (bIntraInInter)
-                    {
-                        x265_analysis_intra_data *intraDst = (x265_analysis_intra_data*)m_analysisInfo->intraData;
-                        x265_analysis_intra_data *intraSrc = (x265_analysis_intra_data*)src->intraData;
-                        memcpy(intraDst->modes, intraSrc->modes, sizeof(uint8_t) * src->numPartitions * src->numCUsInFrame);
-                        memcpy(intraDst->chromaModes, intraSrc->chromaModes, sizeof(uint8_t) * src->depthBytes);
-                    }
-               }
-            }
-            if (m_param->analysisSaveReuseLevel != 10)
-                memcpy(interDst->ref, interSrc->ref, sizeof(int32_t) * src->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir);
-        }
-
-ret:
-        //increment analysis Write counter 
-        m_parent->m_analysisWriteCntm_id.incr();
-        m_parent->m_analysisWritem_idindex.incr();
-        return;
-    }
-
-
-    bool PassEncoder::readPicture(x265_picture *dstPic)
-    {
-        /*Check and wait if there any input frames to read*/
-        int ipread = m_parent->m_picReadCntm_id.get();
-        int ipwrite = m_parent->m_picWriteCntm_id.get();
-
-        bool isAbrLoad = m_cliopt.loadLevel && (m_parent->m_numEncodes > 1);
-        while (!m_inputOver && (ipread == ipwrite))
-        {
-            ipwrite = m_parent->m_picWriteCntm_id.waitForChange(ipwrite);
-        }
-
-        if (m_threadActive && ipread < ipwrite)
-        {
-            /*Get input index to read from inputQueue. If doesn't need analysis info, it need not wait to fetch poc from analysisQueue*/
-            int readPos = ipread % m_parent->m_queueSize;
-            x265_analysis_data* analysisData = 0;
-
-            if (isAbrLoad)
-            {
-                /*If stream is master of each slave pass, then fetch analysis data from prev pass*/
-                int analysisQId = m_cliopt.refId;
-                /*Check and wait if there any analysis Data to read*/
-                int analysisWrite = m_parent->m_analysisWriteCntanalysisQId.get();
-                int written = analysisWrite * m_parent->m_passEncanalysisQId->m_cliopt.numRefs;
-                int analysisRead = m_parent->m_analysisReadCntanalysisQId.get();
-                
-                while (m_threadActive && written == analysisRead)
-                {
-                    analysisWrite = m_parent->m_analysisWriteCntanalysisQId.waitForChange(analysisWrite);
-                    written = analysisWrite * m_parent->m_passEncanalysisQId->m_cliopt.numRefs;
-                }
-
-                if (analysisRead < written)
-                {
-                    int analysisIdx = 0;
-                    if (!m_param->bDisableLookahead)
-                    {
-                        bool analysisdRead = false;
-                        while ((analysisRead < written) && !analysisdRead)
-                        {
-                            while (analysisWrite < ipread)
-                            {
-                                analysisWrite = m_parent->m_analysisWriteCntanalysisQId.waitForChange(analysisWrite);
-                                written = analysisWrite * m_parent->m_passEncanalysisQId->m_cliopt.numRefs;
-                            }
-                            for (uint32_t i = 0; i < m_parent->m_queueSize; i++)
-                            {
-                                analysisData = &m_parent->m_analysisBufferanalysisQIdi;
-                                int read = m_parent->m_analysisReadanalysisQIdi.get();
-                                int write = m_parent->m_analysisWriteanalysisQIdi.get() * m_parent->m_passEncanalysisQId->m_cliopt.numRefs;
-                                if ((analysisData->poc == (uint32_t)(ipread)) && (read < write))
-                                {
-                                    analysisIdx = i;
-                                    analysisdRead = true;
-                                    break;
-                                }
-                            }
-                        }
-                    }
-                    else
-                    {
-                        analysisIdx = analysisRead % m_parent->m_queueSize;
-                        analysisData = &m_parent->m_analysisBufferanalysisQIdanalysisIdx;
-                        readPos = analysisData->poc % m_parent->m_queueSize;
-                        while ((ipwrite < readPos) || ((ipwrite - 1) < (int)analysisData->poc))
-                        {
-                            ipwrite = m_parent->m_picWriteCntm_id.waitForChange(ipwrite);
-                        }
-                    }
-
-                    m_lastIdx = analysisIdx;
-                }
-                else
-                    return false;
-            }
-
-
-            x265_picture *srcPic = (x265_picture*)(m_parent->m_inputPicBufferm_idreadPos);
-
-            x265_picture *pic = (x265_picture*)(dstPic);
-            pic->colorSpace = srcPic->colorSpace;
-            pic->bitDepth = srcPic->bitDepth;
-            pic->framesize = srcPic->framesize;
-            pic->height = srcPic->height;
-            pic->pts = srcPic->pts;
-            pic->dts = srcPic->dts;
-            pic->reorderedPts = srcPic->reorderedPts;
-            pic->width = srcPic->width;
-            pic->analysisData = srcPic->analysisData;
-            pic->userSEI = srcPic->userSEI;
-            pic->stride0 = srcPic->stride0;
-            pic->stride1 = srcPic->stride1;
-            pic->stride2 = srcPic->stride2;
-            pic->planes0 = srcPic->planes0;
-            pic->planes1 = srcPic->planes1;
-            pic->planes2 = srcPic->planes2;
-            if (isAbrLoad)
-                pic->analysisData = *analysisData;
-            return true;
-        }
-        else
-            return false;
-    }
-
-    void PassEncoder::threadMain()
-    {
+/*****************************************************************************
+* Copyright (C) 2013-2020 MulticoreWare, Inc
+*
+* Authors: Pooja Venkatesan <pooja@multicorewareinc.com>
+*          Aruna Matheswaran <aruna@multicorewareinc.com>
+*
+* This program is free software; you can redistribute it and/or modify
+* it under the terms of the GNU General Public License as published by
+* the Free Software Foundation; either version 2 of the License, or
+* (at your option) any later version.
+*
+* This program is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+* GNU General Public License for more details.
+*
+* You should have received a copy of the GNU General Public License
+* along with this program; if not, write to the Free Software
+* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+*
+* This program is also available under a commercial proprietary license.
+* For more information, contact us at license @ x265.com.
+*****************************************************************************/
+
+#include "abrEncApp.h"
+#include "mv.h"
+#include "slice.h"
+#include "param.h"
+
+#include <signal.h>
+#include <errno.h>
+
+#include <queue>
+
+using namespace X265_NS;
+
+/* Ctrl-C handler */
+static volatile sig_atomic_t b_ctrl_c /* = 0 */;
+static void sigint_handler(int)
+{
+    b_ctrl_c = 1;
+}
+
+namespace X265_NS {
+    // private namespace
+#define X265_INPUT_QUEUE_SIZE 250
+
+    AbrEncoder::AbrEncoder(CLIOptions cliopt, uint8_t numEncodes, int &ret)
+    {
+        m_numEncodes = numEncodes;
+        m_numActiveEncodes.set(numEncodes);
+        m_queueSize = (numEncodes > 1) ? X265_INPUT_QUEUE_SIZE : 1;
+        m_passEnc = X265_MALLOC(PassEncoder*, m_numEncodes);
+
+        for (uint8_t i = 0; i < m_numEncodes; i++)
+        {
+            m_passEnci = new PassEncoder(i, cliopti, this);
+            if (!m_passEnci)
+            {
+                x265_log(NULL, X265_LOG_ERROR, "Unable to allocate memory for passEncoder\n");
+                ret = 4;
+            }
+            m_passEnci->init(ret);
+        }
+
+        if (!allocBuffers())
+        {
+            x265_log(NULL, X265_LOG_ERROR, "Unable to allocate memory for buffers\n");
+            ret = 4;
+        }
+
+        /* start passEncoder worker threads */
+        for (uint8_t pass = 0; pass < m_numEncodes; pass++)
+            m_passEncpass->startThreads();
+    }
+
+    bool AbrEncoder::allocBuffers()
+    {
+        m_inputPicBuffer = X265_MALLOC(x265_picture**, m_numEncodes);
+        m_analysisBuffer = X265_MALLOC(x265_analysis_data*, m_numEncodes);
+
+        m_picWriteCnt = new ThreadSafeIntegerm_numEncodes;
+        m_picReadCnt = new ThreadSafeIntegerm_numEncodes;
+        m_analysisWriteCnt = new ThreadSafeIntegerm_numEncodes;
+        m_analysisReadCnt = new ThreadSafeIntegerm_numEncodes;
+
+        m_picIdxReadCnt = X265_MALLOC(ThreadSafeInteger*, m_numEncodes);
+        m_analysisWrite = X265_MALLOC(ThreadSafeInteger*, m_numEncodes);
+        m_analysisRead = X265_MALLOC(ThreadSafeInteger*, m_numEncodes);
+        m_readFlag = X265_MALLOC(int*, m_numEncodes);
+
+        for (uint8_t pass = 0; pass < m_numEncodes; pass++)
+        {
+            m_inputPicBufferpass = X265_MALLOC(x265_picture*, m_queueSize);
+            for (uint32_t idx = 0; idx < m_queueSize; idx++)
+            {
+                m_inputPicBufferpassidx = x265_picture_alloc();
+                x265_picture_init(m_passEncpass->m_param, m_inputPicBufferpassidx);
+            }
+
+            CHECKED_MALLOC_ZERO(m_analysisBufferpass, x265_analysis_data, m_queueSize);
+            m_picIdxReadCntpass = new ThreadSafeIntegerm_queueSize;
+            m_analysisWritepass = new ThreadSafeIntegerm_queueSize;
+            m_analysisReadpass = new ThreadSafeIntegerm_queueSize;
+            m_readFlagpass = X265_MALLOC(int, m_queueSize);
+        }
+        return true;
+    fail:
+        return false;
+    }
+
+    void AbrEncoder::destroy()
+    {
+        x265_cleanup(); /* Free library singletons */
+        for (uint8_t pass = 0; pass < m_numEncodes; pass++)
+        {
+            for (uint32_t index = 0; index < m_queueSize; index++)
+            {
+                X265_FREE(m_inputPicBufferpassindex->planes0);
+                x265_picture_free(m_inputPicBufferpassindex);
+            }
+
+            X265_FREE(m_inputPicBufferpass);
+            X265_FREE(m_analysisBufferpass);
+            X265_FREE(m_readFlagpass);
+            delete m_picIdxReadCntpass;
+            delete m_analysisWritepass;
+            delete m_analysisReadpass;
+            m_passEncpass->destroy();
+            delete m_passEncpass;
+        }
+        X265_FREE(m_inputPicBuffer);
+        X265_FREE(m_analysisBuffer);
+        X265_FREE(m_readFlag);
+
+        delete m_picWriteCnt;
+        delete m_picReadCnt;
+        delete m_analysisWriteCnt;
+        delete m_analysisReadCnt;
+
+        X265_FREE(m_picIdxReadCnt);
+        X265_FREE(m_analysisWrite);
+        X265_FREE(m_analysisRead);
+
+        X265_FREE(m_passEnc);
+    }
+
+    PassEncoder::PassEncoder(uint32_t id, CLIOptions cliopt, AbrEncoder *parent)
+    {
+        m_id = id;
+        m_cliopt = cliopt;
+        m_parent = parent;
+        if(!(m_cliopt.enableScaler && m_id))
+            m_input = m_cliopt.input;
+        m_param = cliopt.param;
+        m_inputOver = false;
+        m_lastIdx = -1;
+        m_encoder = NULL;
+        m_scaler = NULL;
+        m_reader = NULL;
+        m_ret = 0;
+    }
+
+    int PassEncoder::init(int &result)
+    {
+        if (m_parent->m_numEncodes > 1)
+            setReuseLevel();
+                
+        if (!(m_cliopt.enableScaler && m_id))
+            m_reader = new Reader(m_id, this);
+        else
+        {
+            VideoDesc *src = NULL, *dst = NULL;
+            dst = new VideoDesc(m_param->sourceWidth, m_param->sourceHeight, m_param->internalCsp, m_param->internalBitDepth);
+            int dstW = m_parent->m_passEncm_id - 1->m_param->sourceWidth;
+            int dstH = m_parent->m_passEncm_id - 1->m_param->sourceHeight;
+            src = new VideoDesc(dstW, dstH, m_param->internalCsp, m_param->internalBitDepth);
+            if (src != NULL && dst != NULL)
+            {
+                m_scaler = new Scaler(0, 1, m_id, src, dst, this);
+                if (!m_scaler)
+                {
+                    x265_log(m_param, X265_LOG_ERROR, "\n MALLOC failure in Scaler");
+                    result = 4;
+                }
+            }
+        }
+
+        if (m_cliopt.zoneFile)
+        {
+            if (!m_cliopt.parseZoneFile())
+            {
+                x265_log(NULL, X265_LOG_ERROR, "Unable to parse zonefile in %s\n");
+                fclose(m_cliopt.zoneFile);
+                m_cliopt.zoneFile = NULL;
+            }
+        }
+
+        /* note: we could try to acquire a different libx265 API here based on
+        * the profile found during option parsing, but it must be done before
+        * opening an encoder */
+
+        if (m_param)
+            m_encoder = m_cliopt.api->encoder_open(m_param);
+        if (!m_encoder)
+        {
+            x265_log(NULL, X265_LOG_ERROR, "x265_encoder_open() failed for Enc, \n");
+            m_ret = 2;
+            return -1;
+        }
+
+        /* get the encoder parameters post-initialization */
+        m_cliopt.api->encoder_parameters(m_encoder, m_param);
+
+        return 1;
+    }
+
+    void PassEncoder::setReuseLevel()
+    {
+        uint32_t r, padh = 0, padw = 0;
+
+        m_param->confWinBottomOffset = m_param->confWinRightOffset = 0;
+
+        m_param->analysisLoadReuseLevel = m_cliopt.loadLevel;
+        m_param->analysisSaveReuseLevel = m_cliopt.saveLevel;
+        m_param->analysisSave = m_cliopt.saveLevel ? "save.dat" : NULL;
+        m_param->analysisLoad = m_cliopt.loadLevel ? "load.dat" : NULL;
+        m_param->bUseAnalysisFile = 0;
+
+        if (m_cliopt.loadLevel)
+        {
+            x265_param *refParam = m_parent->m_passEncm_cliopt.refId->m_param;
+
+            if (m_param->sourceHeight == (refParam->sourceHeight - refParam->confWinBottomOffset) &&
+                m_param->sourceWidth == (refParam->sourceWidth - refParam->confWinRightOffset))
+            {
+                m_parent->m_passEncm_id->m_param->confWinBottomOffset = refParam->confWinBottomOffset;
+                m_parent->m_passEncm_id->m_param->confWinRightOffset = refParam->confWinRightOffset;
+            }
+            else
+            {
+                int srcH = refParam->sourceHeight - refParam->confWinBottomOffset;
+                int srcW = refParam->sourceWidth - refParam->confWinRightOffset;
+
+                double scaleFactorH = double(m_param->sourceHeight / srcH);
+                double scaleFactorW = double(m_param->sourceWidth / srcW);
+
+                int absScaleFactorH = (int)(10 * scaleFactorH + 0.5);
+                int absScaleFactorW = (int)(10 * scaleFactorW + 0.5);
+
+                if (absScaleFactorH == 20 && absScaleFactorW == 20)
+                {
+                    m_param->scaleFactor = 2;
+
+                    m_parent->m_passEncm_id->m_param->confWinBottomOffset = refParam->confWinBottomOffset * 2;
+                    m_parent->m_passEncm_id->m_param->confWinRightOffset = refParam->confWinRightOffset * 2;
+
+                }
+            }
+        }
+
+        int h = m_param->sourceHeight + m_param->confWinBottomOffset;
+        int w = m_param->sourceWidth + m_param->confWinRightOffset;
+        if (h & (m_param->minCUSize - 1))
+        {
+            r = h & (m_param->minCUSize - 1);
+            padh = m_param->minCUSize - r;
+            m_param->confWinBottomOffset += padh;
+
+        }
+
+        if (w & (m_param->minCUSize - 1))
+        {
+            r = w & (m_param->minCUSize - 1);
+            padw = m_param->minCUSize - r;
+            m_param->confWinRightOffset += padw;
+        }
+    }
+
+    void PassEncoder::startThreads()
+    {
+        /* Start slave worker threads */
+        m_threadActive = true;
+        start();
+        /* Start reader threads*/
+        if (m_reader != NULL)
+        {
+            m_reader->m_threadActive = true;
+            m_reader->start();
+        }
+        /* Start scaling worker threads */
+        if (m_scaler != NULL)
+        {
+            m_scaler->m_threadActive = true;
+            m_scaler->start();
+        }
+    }
+
+    void PassEncoder::copyInfo(x265_analysis_data * src)
+    {
+
+        uint32_t written = m_parent->m_analysisWriteCntm_id.get();
+
+        int index = written % m_parent->m_queueSize;
+        //If all streams have read analysis data, reuse that position in Queue
+
+        int read = m_parent->m_analysisReadm_idindex.get();
+        int write = m_parent->m_analysisWritem_idindex.get();
+
+        int overwrite = written / m_parent->m_queueSize;
+        bool emptyIdxFound = 0;
+        while (!emptyIdxFound && overwrite)
+        {
+            for (uint32_t i = 0; i < m_parent->m_queueSize; i++)
+            {
+                read = m_parent->m_analysisReadm_idi.get();
+                write = m_parent->m_analysisWritem_idi.get();
+                write *= m_cliopt.numRefs;
+
+                if (read == write)
+                {
+                    index = i;
+                    emptyIdxFound = 1;
+                }
+            }
+        }
+
+        x265_analysis_data *m_analysisInfo = &m_parent->m_analysisBufferm_idindex;
+
+        x265_free_analysis_data(m_param, m_analysisInfo);
+        memcpy(m_analysisInfo, src, sizeof(x265_analysis_data));
+        x265_alloc_analysis_data(m_param, m_analysisInfo);
+
+        bool isVbv = m_param->rc.vbvBufferSize && m_param->rc.vbvMaxBitrate;
+        if (m_param->bDisableLookahead && isVbv)
+        {
+            memcpy(m_analysisInfo->lookahead.intraSatdForVbv, src->lookahead.intraSatdForVbv, src->numCuInHeight * sizeof(uint32_t));
+            memcpy(m_analysisInfo->lookahead.satdForVbv, src->lookahead.satdForVbv, src->numCuInHeight * sizeof(uint32_t));
+            memcpy(m_analysisInfo->lookahead.intraVbvCost, src->lookahead.intraVbvCost, src->numCUsInFrame * sizeof(uint32_t));
+            memcpy(m_analysisInfo->lookahead.vbvCost, src->lookahead.vbvCost, src->numCUsInFrame * sizeof(uint32_t));
+        }
+
+        if (src->sliceType == X265_TYPE_IDR || src->sliceType == X265_TYPE_I)
+        {
+            if (m_param->analysisSaveReuseLevel < 2)
+                goto ret;
+            x265_analysis_intra_data *intraDst, *intraSrc;
+            intraDst = (x265_analysis_intra_data*)m_analysisInfo->intraData;
+            intraSrc = (x265_analysis_intra_data*)src->intraData;
+            memcpy(intraDst->depth, intraSrc->depth, sizeof(uint8_t) * src->depthBytes);
+            memcpy(intraDst->modes, intraSrc->modes, sizeof(uint8_t) * src->numCUsInFrame * src->numPartitions);
+            memcpy(intraDst->partSizes, intraSrc->partSizes, sizeof(char) * src->depthBytes);
+            memcpy(intraDst->chromaModes, intraSrc->chromaModes, sizeof(uint8_t) * src->depthBytes);
+            if (m_param->rc.cuTree)
+                memcpy(intraDst->cuQPOff, intraSrc->cuQPOff, sizeof(int8_t) * src->depthBytes);
+        }
+        else
+        {
+            bool bIntraInInter = (src->sliceType == X265_TYPE_P || m_param->bIntraInBFrames);
+            int numDir = src->sliceType == X265_TYPE_P ? 1 : 2;
+            memcpy(m_analysisInfo->wt, src->wt, sizeof(WeightParam) * 3 * numDir);
+            if (m_param->analysisSaveReuseLevel < 2)
+                goto ret;
+            x265_analysis_inter_data *interDst, *interSrc;
+            interDst = (x265_analysis_inter_data*)m_analysisInfo->interData;
+            interSrc = (x265_analysis_inter_data*)src->interData;
+            memcpy(interDst->depth, interSrc->depth, sizeof(uint8_t) * src->depthBytes);
+            memcpy(interDst->modes, interSrc->modes, sizeof(uint8_t) * src->depthBytes);
+            if (m_param->rc.cuTree)
+                memcpy(interDst->cuQPOff, interSrc->cuQPOff, sizeof(int8_t) * src->depthBytes);
+            if (m_param->analysisSaveReuseLevel > 4)
+            {
+                memcpy(interDst->partSize, interSrc->partSize, sizeof(uint8_t) * src->depthBytes);
+                memcpy(interDst->mergeFlag, interSrc->mergeFlag, sizeof(uint8_t) * src->depthBytes);
+                if (m_param->analysisSaveReuseLevel == 10)
+                {
+                    memcpy(interDst->interDir, interSrc->interDir, sizeof(uint8_t) * src->depthBytes);
+                    for (int dir = 0; dir < numDir; dir++)
+                    {
+                        memcpy(interDst->mvpIdxdir, interSrc->mvpIdxdir, sizeof(uint8_t) * src->depthBytes);
+                        memcpy(interDst->refIdxdir, interSrc->refIdxdir, sizeof(int8_t) * src->depthBytes);
+                        memcpy(interDst->mvdir, interSrc->mvdir, sizeof(MV) * src->depthBytes);
+                    }
+                    if (bIntraInInter)
+                    {
+                        x265_analysis_intra_data *intraDst = (x265_analysis_intra_data*)m_analysisInfo->intraData;
+                        x265_analysis_intra_data *intraSrc = (x265_analysis_intra_data*)src->intraData;
+                        memcpy(intraDst->modes, intraSrc->modes, sizeof(uint8_t) * src->numPartitions * src->numCUsInFrame);
+                        memcpy(intraDst->chromaModes, intraSrc->chromaModes, sizeof(uint8_t) * src->depthBytes);
+                    }
+               }
+            }
+            if (m_param->analysisSaveReuseLevel != 10)
+                memcpy(interDst->ref, interSrc->ref, sizeof(int32_t) * src->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir);
+        }
+
+ret:
+        //increment analysis Write counter 
+        m_parent->m_analysisWriteCntm_id.incr();
+        m_parent->m_analysisWritem_idindex.incr();
+        return;
+    }
+
+
+    bool PassEncoder::readPicture(x265_picture *dstPic)
+    {
+        /*Check and wait if there any input frames to read*/
+        int ipread = m_parent->m_picReadCntm_id.get();
+        int ipwrite = m_parent->m_picWriteCntm_id.get();
+
+        bool isAbrLoad = m_cliopt.loadLevel && (m_parent->m_numEncodes > 1);
+        while (!m_inputOver && (ipread == ipwrite))
+        {
+            ipwrite = m_parent->m_picWriteCntm_id.waitForChange(ipwrite);
+        }
+
+        if (m_threadActive && ipread < ipwrite)
+        {
+            /*Get input index to read from inputQueue. If doesn't need analysis info, it need not wait to fetch poc from analysisQueue*/
+            int readPos = ipread % m_parent->m_queueSize;
+            x265_analysis_data* analysisData = 0;
+
+            if (isAbrLoad)
+            {
+                /*If stream is master of each slave pass, then fetch analysis data from prev pass*/
+                int analysisQId = m_cliopt.refId;
+                /*Check and wait if there any analysis Data to read*/
+                int analysisWrite = m_parent->m_analysisWriteCntanalysisQId.get();
+                int written = analysisWrite * m_parent->m_passEncanalysisQId->m_cliopt.numRefs;
+                int analysisRead = m_parent->m_analysisReadCntanalysisQId.get();
+                
+                while (m_threadActive && written == analysisRead)
+                {
+                    analysisWrite = m_parent->m_analysisWriteCntanalysisQId.waitForChange(analysisWrite);
+                    written = analysisWrite * m_parent->m_passEncanalysisQId->m_cliopt.numRefs;
+                }
+
+                if (analysisRead < written)
+                {
+                    int analysisIdx = 0;
+                    if (!m_param->bDisableLookahead)
+                    {
+                        bool analysisdRead = false;
+                        while ((analysisRead < written) && !analysisdRead)
+                        {
+                            while (analysisWrite < ipread)
+                            {
+                                analysisWrite = m_parent->m_analysisWriteCntanalysisQId.waitForChange(analysisWrite);
+                                written = analysisWrite * m_parent->m_passEncanalysisQId->m_cliopt.numRefs;
+                            }
+                            for (uint32_t i = 0; i < m_parent->m_queueSize; i++)
+                            {
+                                analysisData = &m_parent->m_analysisBufferanalysisQIdi;
+                                int read = m_parent->m_analysisReadanalysisQIdi.get();
+                                int write = m_parent->m_analysisWriteanalysisQIdi.get() * m_parent->m_passEncanalysisQId->m_cliopt.numRefs;
+                                if ((analysisData->poc == (uint32_t)(ipread)) && (read < write))
+                                {
+                                    analysisIdx = i;
+                                    analysisdRead = true;
+                                    break;
+                                }
+                            }
+                        }
+                    }
+                    else
+                    {
+                        analysisIdx = analysisRead % m_parent->m_queueSize;
+                        analysisData = &m_parent->m_analysisBufferanalysisQIdanalysisIdx;
+                        readPos = analysisData->poc % m_parent->m_queueSize;
+                        while ((ipwrite < readPos) || ((ipwrite - 1) < (int)analysisData->poc))
+                        {
+                            ipwrite = m_parent->m_picWriteCntm_id.waitForChange(ipwrite);
+                        }
+                    }
+
+                    m_lastIdx = analysisIdx;
+                }
+                else
+                    return false;
+            }
+
+
+            x265_picture *srcPic = (x265_picture*)(m_parent->m_inputPicBufferm_idreadPos);
+
+            x265_picture *pic = (x265_picture*)(dstPic);
+            pic->colorSpace = srcPic->colorSpace;
+            pic->bitDepth = srcPic->bitDepth;
+            pic->framesize = srcPic->framesize;
+            pic->height = srcPic->height;
+            pic->pts = srcPic->pts;
+            pic->dts = srcPic->dts;
+            pic->reorderedPts = srcPic->reorderedPts;
+            pic->width = srcPic->width;
+            pic->analysisData = srcPic->analysisData;
+            pic->userSEI = srcPic->userSEI;
+            pic->stride0 = srcPic->stride0;
+            pic->stride1 = srcPic->stride1;
+            pic->stride2 = srcPic->stride2;
+            pic->planes0 = srcPic->planes0;
+            pic->planes1 = srcPic->planes1;
+            pic->planes2 = srcPic->planes2;
+            if (isAbrLoad)
+                pic->analysisData = *analysisData;
+            return true;
+        }
+        else
+            return false;
+    }
+
+    void PassEncoder::threadMain()
+    {
         THREAD_NAME("PassEncoder", m_id);
 
         while (m_threadActive)
         {
-
-#if ENABLE_LIBVMAF
-            x265_vmaf_data* vmafdata = m_cliopt.vmafData;
-#endif
-            /* This allows muxers to modify bitstream format */
-            m_cliopt.output->setParam(m_param);
-            const x265_api* api = m_cliopt.api;
-            ReconPlay* reconPlay = NULL;
-            if (m_cliopt.reconPlayCmd)
-                reconPlay = new ReconPlay(m_cliopt.reconPlayCmd, *m_param);
-            char* profileName = m_cliopt.encName ? m_cliopt.encName : (char *)"x265";
-
-            if (m_cliopt.zoneFile)
-            {
-                if (!m_cliopt.parseZoneFile())
-                {
-                    x265_log(NULL, X265_LOG_ERROR, "Unable to parse zonefile in %s\n", profileName);
-                    fclose(m_cliopt.zoneFile);
-                    m_cliopt.zoneFile = NULL;
-                }
-            }
-
-            if (signal(SIGINT, sigint_handler) == SIG_ERR)
-                x265_log(m_param, X265_LOG_ERROR, "Unable to register CTRL+C handler: %s in %s\n",
-                    strerror(errno), profileName);
-
-            x265_picture pic_orig, pic_out;
-            x265_picture *pic_in = &pic_orig;
-            /* Allocate recon picture if analysis save/load is enabled */
-            std::priority_queue<int64_t>* pts_queue = m_cliopt.output->needPTS() ? new std::priority_queue<int64_t>() : NULL;
-            x265_picture *pic_recon = (m_cliopt.recon || m_param->analysisSave || m_param->analysisLoad || pts_queue || reconPlay || m_param->csvLogLevel) ? &pic_out : NULL;
-            uint32_t inFrameCount = 0;
-            uint32_t outFrameCount = 0;
-            x265_nal *p_nal;
-            x265_stats stats;
-            uint32_t nal;
-            int16_t *errorBuf = NULL;
-            bool bDolbyVisionRPU = false;
-            uint8_t *rpuPayload = NULL;
-            int inputPicNum = 1;
-            x265_picture picField1, picField2;
-            x265_analysis_data* analysisInfo = (x265_analysis_data*)(&pic_out.analysisData);
-            bool isAbrSave = m_cliopt.saveLevel && (m_parent->m_numEncodes > 1);
-
-            if (!m_param->bRepeatHeaders && !m_param->bEnableSvtHevc)
-            {
-                if (api->encoder_headers(m_encoder, &p_nal, &nal) < 0)
-                {
-                    x265_log(m_param, X265_LOG_ERROR, "Failure generating stream headers in %s\n", profileName);
-                    m_ret = 3;
-                    goto fail;
-                }
-                else
-                    m_cliopt.totalbytes += m_cliopt.output->writeHeaders(p_nal, nal);
-            }
-
-            if (m_param->bField && m_param->interlaceMode)
-            {
-                api->picture_init(m_param, &picField1);
-                api->picture_init(m_param, &picField2);
-                // return back the original height of input
-                m_param->sourceHeight *= 2;
-                api->picture_init(m_param, &pic_orig);
-            }
-            else
-                api->picture_init(m_param, &pic_orig);
-
-            if (m_param->dolbyProfile && m_cliopt.dolbyVisionRpu)
-            {
-                rpuPayload = X265_MALLOC(uint8_t, 1024);
-                pic_in->rpu.payload = rpuPayload;
-                if (pic_in->rpu.payload)
-                    bDolbyVisionRPU = true;
-            }
-
-            if (m_cliopt.bDither)
-            {
-                errorBuf = X265_MALLOC(int16_t, m_param->sourceWidth + 1);
-                if (errorBuf)
-                    memset(errorBuf, 0, (m_param->sourceWidth + 1) * sizeof(int16_t));
-                else
-                    m_cliopt.bDither = false;
-            }
-
-            // main encoder loop
-            while (pic_in && !b_ctrl_c)
-            {
-                pic_orig.poc = (m_param->bField && m_param->interlaceMode) ? inFrameCount * 2 : inFrameCount;
-                if (m_cliopt.qpfile)
-                {
-                    if (!m_cliopt.parseQPFile(pic_orig))
-                    {
-                        x265_log(NULL, X265_LOG_ERROR, "can't parse qpfile for frame %d in %s\n",
-                            pic_in->poc, profileName);
-                        fclose(m_cliopt.qpfile);
-                        m_cliopt.qpfile = NULL;
-                    }
-                }
-
-                if (m_cliopt.framesToBeEncoded && inFrameCount >= m_cliopt.framesToBeEncoded)
-                    pic_in = NULL;
-                else if (readPicture(pic_in))
-                    inFrameCount++;
-                else
-                    pic_in = NULL;
-
-                if (pic_in)
-                {
-                    if (pic_in->bitDepth > m_param->internalBitDepth && m_cliopt.bDither)
-                    {
-                        x265_dither_image(pic_in, m_cliopt.input->getWidth(), m_cliopt.input->getHeight(), errorBuf, m_param->internalBitDepth);
-                        pic_in->bitDepth = m_param->internalBitDepth;
-                    }
-                    /* Overwrite PTS */
-                    pic_in->pts = pic_in->poc;
-
-                    // convert to field
-                    if (m_param->bField && m_param->interlaceMode)
-                    {
-                        int height = pic_in->height >> 1;
-
-                        int static bCreated = 0;
-                        if (bCreated == 0)
-                        {
-                            bCreated = 1;
-                            inputPicNum = 2;
-                            picField1.fieldNum = 1;
-                            picField2.fieldNum = 2;
-
-                            picField1.bitDepth = picField2.bitDepth = pic_in->bitDepth;
-                            picField1.colorSpace = picField2.colorSpace = pic_in->colorSpace;
-                            picField1.height = picField2.height = pic_in->height >> 1;
-                            picField1.framesize = picField2.framesize = pic_in->framesize >> 1;
-
-                            size_t fieldFrameSize = (size_t)pic_in->framesize >> 1;
-                            char* field1Buf = X265_MALLOC(char, fieldFrameSize);
-                            char* field2Buf = X265_MALLOC(char, fieldFrameSize);
-
-                            int stride = picField1.stride0 = picField2.stride0 = pic_in->stride0;
-                            uint64_t framesize = stride * (height >> x265_cli_cspspic_in->colorSpace.height0);
-                            picField1.planes0 = field1Buf;
-                            picField2.planes0 = field2Buf;
-                            for (int i = 1; i < x265_cli_cspspic_in->colorSpace.planes; i++)
-                            {
-                                picField1.planesi = field1Buf + framesize;
-                                picField2.planesi = field2Buf + framesize;
-
-                                stride = picField1.stridei = picField2.stridei = pic_in->stridei;
-                                framesize += (stride * (height >> x265_cli_cspspic_in->colorSpace.heighti));
-                            }
-                            assert(framesize == picField1.framesize);
-                        }
-
-                        picField1.pts = picField1.poc = pic_in->poc;
-                        picField2.pts = picField2.poc = pic_in->poc + 1;
-
-                        picField1.userSEI = picField2.userSEI = pic_in->userSEI;
-
-                        //if (pic_in->userData)
-                        //{
-                        //    // Have to handle userData here
-                        //}
-
-                        if (pic_in->framesize)
-                        {
-                            for (int i = 0; i < x265_cli_cspspic_in->colorSpace.planes; i++)
-                            {
-                                char* srcP1 = (char*)pic_in->planesi;
-                                char* srcP2 = (char*)pic_in->planesi + pic_in->stridei;
-                                char* p1 = (char*)picField1.planesi;
-                                char* p2 = (char*)picField2.planesi;
-
-                                int stride = picField1.stridei;
-
-                                for (int y = 0; y < (height >> x265_cli_cspspic_in->colorSpace.heighti); y++)
-                                {
-                                    memcpy(p1, srcP1, stride);
-                                    memcpy(p2, srcP2, stride);
-                                    srcP1 += 2 * stride;
-                                    srcP2 += 2 * stride;
-                                    p1 += stride;
-                                    p2 += stride;
-                                }
-                            }
-                        }
-                    }
-
-                    if (bDolbyVisionRPU)
-                    {
-                        if (m_param->bField && m_param->interlaceMode)
-                        {
-                            if (m_cliopt.rpuParser(&picField1) > 0)
-                                goto fail;
-                            if (m_cliopt.rpuParser(&picField2) > 0)
-                                goto fail;
-                        }
-                        else
-                        {
-                            if (m_cliopt.rpuParser(pic_in) > 0)
-                                goto fail;
-                        }
-                    }
-                }
-
-                for (int inputNum = 0; inputNum < inputPicNum; inputNum++)
-                {
-                    x265_picture *picInput = NULL;
-                    if (inputPicNum == 2)
-                        picInput = pic_in ? (inputNum ? &picField2 : &picField1) : NULL;
-                    else
-                        picInput = pic_in;
-
-                    int numEncoded = api->encoder_encode(m_encoder, &p_nal, &nal, picInput, pic_recon);
-
-                    int idx = (inFrameCount - 1) % m_parent->m_queueSize;
-                    m_parent->m_picIdxReadCntm_ididx.incr();
-                    m_parent->m_picReadCntm_id.incr();
-                    if (m_cliopt.loadLevel && picInput)
-                    {
-                        m_parent->m_analysisReadCntm_cliopt.refId.incr();
-                        m_parent->m_analysisReadm_cliopt.refIdm_lastIdx.incr();
-                    }
-
-                    if (numEncoded < 0)
-                    {
-                        b_ctrl_c = 1;
-                        m_ret = 4;
-                        break;
-                    }
-
-                    if (reconPlay && numEncoded)
-                        reconPlay->writePicture(*pic_recon);
-
-                    outFrameCount += numEncoded;
-
-                    if (isAbrSave && numEncoded)
-                    {
-                        copyInfo(analysisInfo);
-                    }
-
-                    if (numEncoded && pic_recon && m_cliopt.recon)
-                        m_cliopt.recon->writePicture(pic_out);
-                    if (nal)
-                    {
-                        m_cliopt.totalbytes += m_cliopt.output->writeFrame(p_nal, nal, pic_out);
-                        if (pts_queue)
-                        {
-                            pts_queue->push(-pic_out.pts);
-                            if (pts_queue->size() > 2)
-                                pts_queue->pop();
-                        }
-                    }
-                    m_cliopt.printStatus(outFrameCount);
-                }
-            }
-
-            /* Flush the encoder */
-            while (!b_ctrl_c)
-            {
-                int numEncoded = api->encoder_encode(m_encoder, &p_nal, &nal, NULL, pic_recon);
-                if (numEncoded < 0)
-                {
-                    m_ret = 4;
-                    break;
-                }
-
-                if (reconPlay && numEncoded)
-                    reconPlay->writePicture(*pic_recon);
-
-                outFrameCount += numEncoded;
-                if (isAbrSave && numEncoded)
-                {
-                    copyInfo(analysisInfo);
-                }
-
-                if (numEncoded && pic_recon && m_cliopt.recon)
-                    m_cliopt.recon->writePicture(pic_out);
-                if (nal)
-                {
-                    m_cliopt.totalbytes += m_cliopt.output->writeFrame(p_nal, nal, pic_out);
-                    if (pts_queue)
-                    {
-                        pts_queue->push(-pic_out.pts);
-                        if (pts_queue->size() > 2)
-                            pts_queue->pop();
-                    }
-                }
-
-                m_cliopt.printStatus(outFrameCount);
-
-                if (!numEncoded)
-                    break;
-            }
-
-            if (bDolbyVisionRPU)
-            {
-                if (fgetc(m_cliopt.dolbyVisionRpu) != EOF)
-                    x265_log(NULL, X265_LOG_WARNING, "Dolby Vision RPU count is greater than frame count in %s\n",
-                        profileName);
-                x265_log(NULL, X265_LOG_INFO, "VES muxing with Dolby Vision RPU file successful in %s\n",
-                    profileName);
-            }
-
-            /* clear progress report */
-            if (m_cliopt.bProgress)
-                fprintf(stderr, "%*s\r", 80, " ");
-
-        fail:
-
-            delete reconPlay;
-
-            api->encoder_get_stats(m_encoder, &stats, sizeof(stats));
-            if (m_param->csvfn && !b_ctrl_c)
-#if ENABLE_LIBVMAF
-                api->vmaf_encoder_log(m_encoder, m_cliopt.argCnt, m_cliopt.argString, m_cliopt.param, vmafdata);
-#else
-                api->encoder_log(m_encoder, m_cliopt.argCnt, m_cliopt.argString);
-#endif
-            api->encoder_close(m_encoder);
-
-            int64_t second_largest_pts = 0;
-            int64_t largest_pts = 0;
-            if (pts_queue && pts_queue->size() >= 2)
-            {
-                second_largest_pts = -pts_queue->top();
-                pts_queue->pop();
-                largest_pts = -pts_queue->top();
-                pts_queue->pop();
-                delete pts_queue;
-                pts_queue = NULL;
-            }
-            m_cliopt.output->closeFile(largest_pts, second_largest_pts);
-
-            if (b_ctrl_c)
-                general_log(m_param, NULL, X265_LOG_INFO, "aborted at input frame %d, output frame %d in %s\n",
-                    m_cliopt.seek + inFrameCount, stats.encodedPictureCount, profileName);
-
-            api->param_free(m_param);
-
-            X265_FREE(errorBuf);
-            X265_FREE(rpuPayload);
-
-            m_threadActive = false;
-            m_parent->m_numActiveEncodes.decr();
-        }
-    }
-
-    void PassEncoder::destroy()
-    {
-        stop();
-        if (m_reader)
-        {
-            m_reader->stop();
-            delete m_reader;
-        }
-        else
-        {
-            m_scaler->stop();
-            m_scaler->destroy();
-            delete m_scaler;
-        }
-    }
-
-    Scaler::Scaler(int threadId, int threadNum, int id, VideoDesc *src, VideoDesc *dst, PassEncoder *parentEnc)
-    {
-        m_parentEnc = parentEnc;
-        m_id = id;
-        m_srcFormat = src;
-        m_dstFormat = dst;
-        m_threadActive = false;
-        m_scaleFrameSize = 0;
-        m_filterManager = NULL;
-        m_threadId = threadId;
-        m_threadTotal = threadNum;
-
-        int csp = dst->m_csp;
-        uint32_t pixelbytes = dst->m_inputDepth > 8 ? 2 : 1;
-        for (int i = 0; i < x265_cli_cspscsp.planes; i++)
-        {
-            int w = dst->m_width >> x265_cli_cspscsp.widthi;
-            int h = dst->m_height >> x265_cli_cspscsp.heighti;
-            m_scalePlanesi = w * h * pixelbytes;
-            m_scaleFrameSize += m_scalePlanesi;
-        }
-
-        if (src->m_height != dst->m_height || src->m_width != dst->m_width)
-        {
-            m_filterManager = new ScalerFilterManager;
-            m_filterManager->init(4, m_srcFormat, m_dstFormat);
-        }
-    }
-
-    bool Scaler::scalePic(x265_picture * destination, x265_picture * source)
-    {
-        if (!destination || !source)
-            return false;
-        x265_param* param = m_parentEnc->m_param;
-        int pixelBytes = m_dstFormat->m_inputDepth > 8 ? 2 : 1;
-        if (m_srcFormat->m_height != m_dstFormat->m_height || m_srcFormat->m_width != m_dstFormat->m_width)
-        {
-            void **srcPlane = NULL, **dstPlane = NULL;
-            int srcStride3, dstStride3;
-            destination->bitDepth = source->bitDepth;
-            destination->colorSpace = source->colorSpace;
-            destination->pts = source->pts;
-            destination->dts = source->dts;
-            destination->reorderedPts = source->reorderedPts;
-            destination->poc = source->poc;
-            destination->userSEI = source->userSEI;
-            srcPlane = source->planes;
-            dstPlane = destination->planes;
-            srcStride0 = source->stride0;
-            destination->stride0 = m_dstFormat->m_width * pixelBytes;
-            dstStride0 = destination->stride0;
-            if (param->internalCsp != X265_CSP_I400)
-            {
-                srcStride1 = source->stride1;
-                srcStride2 = source->stride2;
-                destination->stride1 = destination->stride0 >> x265_cli_cspsparam->internalCsp.width1;
-                destination->stride2 = destination->stride0 >> x265_cli_cspsparam->internalCsp.width2;
-                dstStride1 = destination->stride1;
-                dstStride2 = destination->stride2;
-            }
-            if (m_scaleFrameSize)
-            {
-                m_filterManager->scale_pic(srcPlane, dstPlane, srcStride, dstStride);
-                return true;
-            }
-            else
-                x265_log(param, X265_LOG_INFO, "Empty frame received\n");
-        }
-        return false;
-    }
-
-    void Scaler::threadMain()
-    {
-        THREAD_NAME("Scaler", m_id);
-
-        /* unscaled picture is stored in the last index */
-        uint32_t srcId = m_id - 1;
-        int QDepth = m_parentEnc->m_parent->m_queueSize;
-        while (!m_parentEnc->m_inputOver)
-        {
-
-            uint32_t scaledWritten = m_parentEnc->m_parent->m_picWriteCntm_id.get();
-
-            if (m_parentEnc->m_cliopt.framesToBeEncoded && scaledWritten >= m_parentEnc->m_cliopt.framesToBeEncoded)
-                break;
-
-            if (m_threadTotal > 1 && (m_threadId != scaledWritten % m_threadTotal))
-            {
-                continue;
-            }
-            uint32_t written = m_parentEnc->m_parent->m_picWriteCntsrcId.get();
-
-            /*If all the input pictures are scaled by the current scale worker thread wait for input pictures*/
-            while (m_threadActive && (scaledWritten == written)) {
-                written = m_parentEnc->m_parent->m_picWriteCntsrcId.waitForChange(written);
-            }
-
-            if (m_threadActive && scaledWritten < written)
-            {
-
-                int scaledWriteIdx = scaledWritten % QDepth;
-                int overWritePicBuffer = scaledWritten / QDepth;
-                int read = m_parentEnc->m_parent->m_picIdxReadCntm_idscaledWriteIdx.get();
-
-                while (overWritePicBuffer && read < overWritePicBuffer)
-                {
-                    read = m_parentEnc->m_parent->m_picIdxReadCntm_idscaledWriteIdx.waitForChange(read);
-                }
-
-                if (!m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx)
-                {
-                    int framesize = 0;
-                    int planesize3;
-                    int csp = m_dstFormat->m_csp;
-                    int stride3;
-                    stride0 = m_dstFormat->m_width;
-                    stride1 = stride0 >> x265_cli_cspscsp.width1;
-                    stride2 = stride0 >> x265_cli_cspscsp.width2;
-                    for (int i = 0; i < x265_cli_cspscsp.planes; i++)
-                    {
-                        uint32_t h = m_dstFormat->m_height >> x265_cli_cspscsp.heighti;
-                        planesizei = h * stridei;
-                        framesize += planesizei;
-                    }
-
-                    m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx = x265_picture_alloc();
-                    x265_picture_init(m_parentEnc->m_param, m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx);
-
-                    ((x265_picture*)m_parentEnc->m_parent->m_inputPicBufferm_idscaledWritten % QDepth)->framesize = framesize;
-                    for (int32_t j = 0; j < x265_cli_cspscsp.planes; j++)
-                    {
-                        m_parentEnc->m_parent->m_inputPicBufferm_idscaledWritten % QDepth->planesj = X265_MALLOC(char, planesizej);
-                    }
-                }
-
-                x265_picture *srcPic = m_parentEnc->m_parent->m_inputPicBuffersrcIdscaledWritten % QDepth;
-                x265_picture* destPic = m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx;
-
-                // Enqueue this picture up with the current encoder so that it will asynchronously encode
-                if (!scalePic(destPic, srcPic))
-                    x265_log(NULL, X265_LOG_ERROR, "Unable to copy scaled input picture to input queue \n");
-                else
-                    m_parentEnc->m_parent->m_picWriteCntm_id.incr();
-                m_scaledWriteCnt.incr();
-                m_parentEnc->m_parent->m_picIdxReadCntsrcIdscaledWriteIdx.incr();
-            }
-            if (m_threadTotal > 1)
-            {
-                written = m_parentEnc->m_parent->m_picWriteCntsrcId.get();
-                int totalWrite = written / m_threadTotal;
-                if (written % m_threadTotal > m_threadId)
-                    totalWrite++;
-                if (totalWrite == m_scaledWriteCnt.get())
-                {
-                    m_parentEnc->m_parent->m_picWriteCntsrcId.poke();
-                    m_parentEnc->m_parent->m_picWriteCntm_id.poke();
-                    break;
-                }
-            }
-            else
-            {
-                /* Once end of video is reached and all frames are scaled, release wait on picwritecount */
-                scaledWritten = m_parentEnc->m_parent->m_picWriteCntm_id.get();
-                written = m_parentEnc->m_parent->m_picWriteCntsrcId.get();
-                if (written == scaledWritten)
-                {
-                    m_parentEnc->m_parent->m_picWriteCntsrcId.poke();
-                    m_parentEnc->m_parent->m_picWriteCntm_id.poke();
-                    break;
-                }
-            }
-
-        }
-        m_threadActive = false;
-        destroy();
-    }
-
-    Reader::Reader(int id, PassEncoder *parentEnc)
-    {
-        m_parentEnc = parentEnc;
-        m_id = id;
-        m_input = parentEnc->m_input;
-    }
-
-    void Reader::threadMain()
-    {
-        THREAD_NAME("Reader", m_id);
-
-        int QDepth = m_parentEnc->m_parent->m_queueSize;
-        x265_picture* src = x265_picture_alloc();
-        x265_picture_init(m_parentEnc->m_param, src);
-
-        while (m_threadActive)
-        {
-            uint32_t written = m_parentEnc->m_parent->m_picWriteCntm_id.get();
-            uint32_t writeIdx = written % QDepth;
-            uint32_t read = m_parentEnc->m_parent->m_picIdxReadCntm_idwriteIdx.get();
-            uint32_t overWritePicBuffer = written / QDepth;
-
-            if (m_parentEnc->m_cliopt.framesToBeEncoded && written >= m_parentEnc->m_cliopt.framesToBeEncoded)
-                break;
-
-            while (overWritePicBuffer && read < overWritePicBuffer)
-            {
-                read = m_parentEnc->m_parent->m_picIdxReadCntm_idwriteIdx.waitForChange(read);
-            }
-
-            x265_picture* dest = m_parentEnc->m_parent->m_inputPicBufferm_idwriteIdx;
-            if (m_input->readPicture(*src))
-            {
-                dest->poc = src->poc;
-                dest->pts = src->pts;
-                dest->userSEI = src->userSEI;
-                dest->bitDepth = src->bitDepth;
-                dest->framesize = src->framesize;
-                dest->height = src->height;
-                dest->width = src->width;
-                dest->colorSpace = src->colorSpace;
-                dest->userSEI = src->userSEI;
-                dest->rpu.payload = src->rpu.payload;
-                dest->picStruct = src->picStruct;
-                dest->stride0 = src->stride0;
-                dest->stride1 = src->stride1;
-                dest->stride2 = src->stride2;
-
-                if (!dest->planes0)
-                    dest->planes0 = X265_MALLOC(char, dest->framesize);
-
-                memcpy(dest->planes0, src->planes0, src->framesize * sizeof(char));
-                dest->planes1 = (char*)dest->planes0 + src->stride0 * src->height;
-                dest->planes2 = (char*)dest->planes1 + src->stride1 * (src->height >> x265_cli_cspssrc->colorSpace.height1);
-                m_parentEnc->m_parent->m_picWriteCntm_id.incr();
-            }
-            else
-            {
-                m_threadActive = false;
-                m_parentEnc->m_inputOver = true;
-                m_parentEnc->m_parent->m_picWriteCntm_id.poke();
-            }
-        }
-        x265_picture_free(src);
-    }
-}
+
+#if ENABLE_LIBVMAF
+            x265_vmaf_data* vmafdata = m_cliopt.vmafData;
+#endif
+            /* This allows muxers to modify bitstream format */
+            m_cliopt.output->setParam(m_param);
+            const x265_api* api = m_cliopt.api;
+            ReconPlay* reconPlay = NULL;
+            if (m_cliopt.reconPlayCmd)
+                reconPlay = new ReconPlay(m_cliopt.reconPlayCmd, *m_param);
+            char* profileName = m_cliopt.encName ? m_cliopt.encName : (char *)"x265";
+
+            if (signal(SIGINT, sigint_handler) == SIG_ERR)
+                x265_log(m_param, X265_LOG_ERROR, "Unable to register CTRL+C handler: %s in %s\n",
+                    strerror(errno), profileName);
+
+            x265_picture pic_orig, pic_out;
+            x265_picture *pic_in = &pic_orig;
+            /* Allocate recon picture if analysis save/load is enabled */
+            std::priority_queue<int64_t>* pts_queue = m_cliopt.output->needPTS() ? new std::priority_queue<int64_t>() : NULL;
+            x265_picture *pic_recon = (m_cliopt.recon || m_param->analysisSave || m_param->analysisLoad || pts_queue || reconPlay || m_param->csvLogLevel) ? &pic_out : NULL;
+            uint32_t inFrameCount = 0;
+            uint32_t outFrameCount = 0;
+            x265_nal *p_nal;
+            x265_stats stats;
+            uint32_t nal;
+            int16_t *errorBuf = NULL;
+            bool bDolbyVisionRPU = false;
+            uint8_t *rpuPayload = NULL;
+            int inputPicNum = 1;
+            x265_picture picField1, picField2;
+            x265_analysis_data* analysisInfo = (x265_analysis_data*)(&pic_out.analysisData);
+            bool isAbrSave = m_cliopt.saveLevel && (m_parent->m_numEncodes > 1);
+
+            if (!m_param->bRepeatHeaders && !m_param->bEnableSvtHevc)
+            {
+                if (api->encoder_headers(m_encoder, &p_nal, &nal) < 0)
+                {
+                    x265_log(m_param, X265_LOG_ERROR, "Failure generating stream headers in %s\n", profileName);
+                    m_ret = 3;
+                    goto fail;
+                }
+                else
+                    m_cliopt.totalbytes += m_cliopt.output->writeHeaders(p_nal, nal);
+            }
+
+            if (m_param->bField && m_param->interlaceMode)
+            {
+                api->picture_init(m_param, &picField1);
+                api->picture_init(m_param, &picField2);
+                // return back the original height of input
+                m_param->sourceHeight *= 2;
+                api->picture_init(m_param, &pic_orig);
+            }
+            else
+                api->picture_init(m_param, &pic_orig);
+
+            if (m_param->dolbyProfile && m_cliopt.dolbyVisionRpu)
+            {
+                rpuPayload = X265_MALLOC(uint8_t, 1024);
+                pic_in->rpu.payload = rpuPayload;
+                if (pic_in->rpu.payload)
+                    bDolbyVisionRPU = true;
+            }
+
+            if (m_cliopt.bDither)
+            {
+                errorBuf = X265_MALLOC(int16_t, m_param->sourceWidth + 1);
+                if (errorBuf)
+                    memset(errorBuf, 0, (m_param->sourceWidth + 1) * sizeof(int16_t));
+                else
+                    m_cliopt.bDither = false;
+            }
+
+            // main encoder loop
+            while (pic_in && !b_ctrl_c)
+            {
+                pic_orig.poc = (m_param->bField && m_param->interlaceMode) ? inFrameCount * 2 : inFrameCount;
+                if (m_cliopt.qpfile)
+                {
+                    if (!m_cliopt.parseQPFile(pic_orig))
+                    {
+                        x265_log(NULL, X265_LOG_ERROR, "can't parse qpfile for frame %d in %s\n",
+                            pic_in->poc, profileName);
+                        fclose(m_cliopt.qpfile);
+                        m_cliopt.qpfile = NULL;
+                    }
+                }
+
+                if (m_cliopt.framesToBeEncoded && inFrameCount >= m_cliopt.framesToBeEncoded)
+                    pic_in = NULL;
+                else if (readPicture(pic_in))
+                    inFrameCount++;
+                else
+                    pic_in = NULL;
+
+                if (pic_in)
+                {
+                    if (pic_in->bitDepth > m_param->internalBitDepth && m_cliopt.bDither)
+                    {
+                        x265_dither_image(pic_in, m_cliopt.input->getWidth(), m_cliopt.input->getHeight(), errorBuf, m_param->internalBitDepth);
+                        pic_in->bitDepth = m_param->internalBitDepth;
+                    }
+                    /* Overwrite PTS */
+                    pic_in->pts = pic_in->poc;
+
+                    // convert to field
+                    if (m_param->bField && m_param->interlaceMode)
+                    {
+                        int height = pic_in->height >> 1;
+
+                        int static bCreated = 0;
+                        if (bCreated == 0)
+                        {
+                            bCreated = 1;
+                            inputPicNum = 2;
+                            picField1.fieldNum = 1;
+                            picField2.fieldNum = 2;
+
+                            picField1.bitDepth = picField2.bitDepth = pic_in->bitDepth;
+                            picField1.colorSpace = picField2.colorSpace = pic_in->colorSpace;
+                            picField1.height = picField2.height = pic_in->height >> 1;
+                            picField1.framesize = picField2.framesize = pic_in->framesize >> 1;
+
+                            size_t fieldFrameSize = (size_t)pic_in->framesize >> 1;
+                            char* field1Buf = X265_MALLOC(char, fieldFrameSize);
+                            char* field2Buf = X265_MALLOC(char, fieldFrameSize);
+
+                            int stride = picField1.stride0 = picField2.stride0 = pic_in->stride0;
+                            uint64_t framesize = stride * (height >> x265_cli_cspspic_in->colorSpace.height0);
+                            picField1.planes0 = field1Buf;
+                            picField2.planes0 = field2Buf;
+                            for (int i = 1; i < x265_cli_cspspic_in->colorSpace.planes; i++)
+                            {
+                                picField1.planesi = field1Buf + framesize;
+                                picField2.planesi = field2Buf + framesize;
+
+                                stride = picField1.stridei = picField2.stridei = pic_in->stridei;
+                                framesize += (stride * (height >> x265_cli_cspspic_in->colorSpace.heighti));
+                            }
+                            assert(framesize == picField1.framesize);
+                        }
+
+                        picField1.pts = picField1.poc = pic_in->poc;
+                        picField2.pts = picField2.poc = pic_in->poc + 1;
+
+                        picField1.userSEI = picField2.userSEI = pic_in->userSEI;
+
+                        //if (pic_in->userData)
+                        //{
+                        //    // Have to handle userData here
+                        //}
+
+                        if (pic_in->framesize)
+                        {
+                            for (int i = 0; i < x265_cli_cspspic_in->colorSpace.planes; i++)
+                            {
+                                char* srcP1 = (char*)pic_in->planesi;
+                                char* srcP2 = (char*)pic_in->planesi + pic_in->stridei;
+                                char* p1 = (char*)picField1.planesi;
+                                char* p2 = (char*)picField2.planesi;
+
+                                int stride = picField1.stridei;
+
+                                for (int y = 0; y < (height >> x265_cli_cspspic_in->colorSpace.heighti); y++)
+                                {
+                                    memcpy(p1, srcP1, stride);
+                                    memcpy(p2, srcP2, stride);
+                                    srcP1 += 2 * stride;
+                                    srcP2 += 2 * stride;
+                                    p1 += stride;
+                                    p2 += stride;
+                                }
+                            }
+                        }
+                    }
+
+                    if (bDolbyVisionRPU)
+                    {
+                        if (m_param->bField && m_param->interlaceMode)
+                        {
+                            if (m_cliopt.rpuParser(&picField1) > 0)
+                                goto fail;
+                            if (m_cliopt.rpuParser(&picField2) > 0)
+                                goto fail;
+                        }
+                        else
+                        {
+                            if (m_cliopt.rpuParser(pic_in) > 0)
+                                goto fail;
+                        }
+                    }
+                }
+
+                for (int inputNum = 0; inputNum < inputPicNum; inputNum++)
+                {
+                    x265_picture *picInput = NULL;
+                    if (inputPicNum == 2)
+                        picInput = pic_in ? (inputNum ? &picField2 : &picField1) : NULL;
+                    else
+                        picInput = pic_in;
+
+                    int numEncoded = api->encoder_encode(m_encoder, &p_nal, &nal, picInput, pic_recon);
+
+                    int idx = (inFrameCount - 1) % m_parent->m_queueSize;
+                    m_parent->m_picIdxReadCntm_ididx.incr();
+                    m_parent->m_picReadCntm_id.incr();
+                    if (m_cliopt.loadLevel && picInput)
+                    {
+                        m_parent->m_analysisReadCntm_cliopt.refId.incr();
+                        m_parent->m_analysisReadm_cliopt.refIdm_lastIdx.incr();
+                    }
+
+                    if (numEncoded < 0)
+                    {
+                        b_ctrl_c = 1;
+                        m_ret = 4;
+                        break;
+                    }
+
+                    if (reconPlay && numEncoded)
+                        reconPlay->writePicture(*pic_recon);
+
+                    outFrameCount += numEncoded;
+
+                    if (isAbrSave && numEncoded)
+                    {
+                        copyInfo(analysisInfo);
+                    }
+
+                    if (numEncoded && pic_recon && m_cliopt.recon)
+                        m_cliopt.recon->writePicture(pic_out);
+                    if (nal)
+                    {
+                        m_cliopt.totalbytes += m_cliopt.output->writeFrame(p_nal, nal, pic_out);
+                        if (pts_queue)
+                        {
+                            pts_queue->push(-pic_out.pts);
+                            if (pts_queue->size() > 2)
+                                pts_queue->pop();
+                        }
+                    }
+                    m_cliopt.printStatus(outFrameCount);
+                }
+            }
+
+            /* Flush the encoder */
+            while (!b_ctrl_c)
+            {
+                int numEncoded = api->encoder_encode(m_encoder, &p_nal, &nal, NULL, pic_recon);
+                if (numEncoded < 0)
+                {
+                    m_ret = 4;
+                    break;
+                }
+
+                if (reconPlay && numEncoded)
+                    reconPlay->writePicture(*pic_recon);
+
+                outFrameCount += numEncoded;
+                if (isAbrSave && numEncoded)
+                {
+                    copyInfo(analysisInfo);
+                }
+
+                if (numEncoded && pic_recon && m_cliopt.recon)
+                    m_cliopt.recon->writePicture(pic_out);
+                if (nal)
+                {
+                    m_cliopt.totalbytes += m_cliopt.output->writeFrame(p_nal, nal, pic_out);
+                    if (pts_queue)
+                    {
+                        pts_queue->push(-pic_out.pts);
+                        if (pts_queue->size() > 2)
+                            pts_queue->pop();
+                    }
+                }
+
+                m_cliopt.printStatus(outFrameCount);
+
+                if (!numEncoded)
+                    break;
+            }
+
+            if (bDolbyVisionRPU)
+            {
+                if (fgetc(m_cliopt.dolbyVisionRpu) != EOF)
+                    x265_log(NULL, X265_LOG_WARNING, "Dolby Vision RPU count is greater than frame count in %s\n",
+                        profileName);
+                x265_log(NULL, X265_LOG_INFO, "VES muxing with Dolby Vision RPU file successful in %s\n",
+                    profileName);
+            }
+
+            /* clear progress report */
+            if (m_cliopt.bProgress)
+                fprintf(stderr, "%*s\r", 80, " ");
+
+        fail:
+
+            delete reconPlay;
+
+            api->encoder_get_stats(m_encoder, &stats, sizeof(stats));
+            if (m_param->csvfn && !b_ctrl_c)
+#if ENABLE_LIBVMAF
+                api->vmaf_encoder_log(m_encoder, m_cliopt.argCnt, m_cliopt.argString, m_cliopt.param, vmafdata);
+#else
+                api->encoder_log(m_encoder, m_cliopt.argCnt, m_cliopt.argString);
+#endif
+            api->encoder_close(m_encoder);
+
+            int64_t second_largest_pts = 0;
+            int64_t largest_pts = 0;
+            if (pts_queue && pts_queue->size() >= 2)
+            {
+                second_largest_pts = -pts_queue->top();
+                pts_queue->pop();
+                largest_pts = -pts_queue->top();
+                pts_queue->pop();
+                delete pts_queue;
+                pts_queue = NULL;
+            }
+            m_cliopt.output->closeFile(largest_pts, second_largest_pts);
+
+            if (b_ctrl_c)
+                general_log(m_param, NULL, X265_LOG_INFO, "aborted at input frame %d, output frame %d in %s\n",
+                    m_cliopt.seek + inFrameCount, stats.encodedPictureCount, profileName);
+
+            api->param_free(m_param);
+
+            X265_FREE(errorBuf);
+            X265_FREE(rpuPayload);
+
+            m_threadActive = false;
+            m_parent->m_numActiveEncodes.decr();
+        }
+    }
+
+    void PassEncoder::destroy()
+    {
+        stop();
+        if (m_reader)
+        {
+            m_reader->stop();
+            delete m_reader;
+        }
+        else
+        {
+            m_scaler->stop();
+            m_scaler->destroy();
+            delete m_scaler;
+        }
+    }
+
+    Scaler::Scaler(int threadId, int threadNum, int id, VideoDesc *src, VideoDesc *dst, PassEncoder *parentEnc)
+    {
+        m_parentEnc = parentEnc;
+        m_id = id;
+        m_srcFormat = src;
+        m_dstFormat = dst;
+        m_threadActive = false;
+        m_scaleFrameSize = 0;
+        m_filterManager = NULL;
+        m_threadId = threadId;
+        m_threadTotal = threadNum;
+
+        int csp = dst->m_csp;
+        uint32_t pixelbytes = dst->m_inputDepth > 8 ? 2 : 1;
+        for (int i = 0; i < x265_cli_cspscsp.planes; i++)
+        {
+            int w = dst->m_width >> x265_cli_cspscsp.widthi;
+            int h = dst->m_height >> x265_cli_cspscsp.heighti;
+            m_scalePlanesi = w * h * pixelbytes;
+            m_scaleFrameSize += m_scalePlanesi;
+        }
+
+        if (src->m_height != dst->m_height || src->m_width != dst->m_width)
+        {
+            m_filterManager = new ScalerFilterManager;
+            m_filterManager->init(4, m_srcFormat, m_dstFormat);
+        }
+    }
+
+    bool Scaler::scalePic(x265_picture * destination, x265_picture * source)
+    {
+        if (!destination || !source)
+            return false;
+        x265_param* param = m_parentEnc->m_param;
+        int pixelBytes = m_dstFormat->m_inputDepth > 8 ? 2 : 1;
+        if (m_srcFormat->m_height != m_dstFormat->m_height || m_srcFormat->m_width != m_dstFormat->m_width)
+        {
+            void **srcPlane = NULL, **dstPlane = NULL;
+            int srcStride3, dstStride3;
+            destination->bitDepth = source->bitDepth;
+            destination->colorSpace = source->colorSpace;
+            destination->pts = source->pts;
+            destination->dts = source->dts;
+            destination->reorderedPts = source->reorderedPts;
+            destination->poc = source->poc;
+            destination->userSEI = source->userSEI;
+            srcPlane = source->planes;
+            dstPlane = destination->planes;
+            srcStride0 = source->stride0;
+            destination->stride0 = m_dstFormat->m_width * pixelBytes;
+            dstStride0 = destination->stride0;
+            if (param->internalCsp != X265_CSP_I400)
+            {
+                srcStride1 = source->stride1;
+                srcStride2 = source->stride2;
+                destination->stride1 = destination->stride0 >> x265_cli_cspsparam->internalCsp.width1;
+                destination->stride2 = destination->stride0 >> x265_cli_cspsparam->internalCsp.width2;
+                dstStride1 = destination->stride1;
+                dstStride2 = destination->stride2;
+            }
+            if (m_scaleFrameSize)
+            {
+                m_filterManager->scale_pic(srcPlane, dstPlane, srcStride, dstStride);
+                return true;
+            }
+            else
+                x265_log(param, X265_LOG_INFO, "Empty frame received\n");
+        }
+        return false;
+    }
+
+    void Scaler::threadMain()
+    {
+        THREAD_NAME("Scaler", m_id);
+
+        /* unscaled picture is stored in the last index */
+        uint32_t srcId = m_id - 1;
+        int QDepth = m_parentEnc->m_parent->m_queueSize;
+        while (!m_parentEnc->m_inputOver)
+        {
+
+            uint32_t scaledWritten = m_parentEnc->m_parent->m_picWriteCntm_id.get();
+
+            if (m_parentEnc->m_cliopt.framesToBeEncoded && scaledWritten >= m_parentEnc->m_cliopt.framesToBeEncoded)
+                break;
+
+            if (m_threadTotal > 1 && (m_threadId != scaledWritten % m_threadTotal))
+            {
+                continue;
+            }
+            uint32_t written = m_parentEnc->m_parent->m_picWriteCntsrcId.get();
+
+            /*If all the input pictures are scaled by the current scale worker thread wait for input pictures*/
+            while (m_threadActive && (scaledWritten == written)) {
+                written = m_parentEnc->m_parent->m_picWriteCntsrcId.waitForChange(written);
+            }
+
+            if (m_threadActive && scaledWritten < written)
+            {
+
+                int scaledWriteIdx = scaledWritten % QDepth;
+                int overWritePicBuffer = scaledWritten / QDepth;
+                int read = m_parentEnc->m_parent->m_picIdxReadCntm_idscaledWriteIdx.get();
+
+                while (overWritePicBuffer && read < overWritePicBuffer)
+                {
+                    read = m_parentEnc->m_parent->m_picIdxReadCntm_idscaledWriteIdx.waitForChange(read);
+                }
+
+                if (!m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx)
+                {
+                    int framesize = 0;
+                    int planesize3;
+                    int csp = m_dstFormat->m_csp;
+                    int stride3;
+                    stride0 = m_dstFormat->m_width;
+                    stride1 = stride0 >> x265_cli_cspscsp.width1;
+                    stride2 = stride0 >> x265_cli_cspscsp.width2;
+                    for (int i = 0; i < x265_cli_cspscsp.planes; i++)
+                    {
+                        uint32_t h = m_dstFormat->m_height >> x265_cli_cspscsp.heighti;
+                        planesizei = h * stridei;
+                        framesize += planesizei;
+                    }
+
+                    m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx = x265_picture_alloc();
+                    x265_picture_init(m_parentEnc->m_param, m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx);
+
+                    ((x265_picture*)m_parentEnc->m_parent->m_inputPicBufferm_idscaledWritten % QDepth)->framesize = framesize;
+                    for (int32_t j = 0; j < x265_cli_cspscsp.planes; j++)
+                    {
+                        m_parentEnc->m_parent->m_inputPicBufferm_idscaledWritten % QDepth->planesj = X265_MALLOC(char, planesizej);
+                    }
+                }
+
+                x265_picture *srcPic = m_parentEnc->m_parent->m_inputPicBuffersrcIdscaledWritten % QDepth;
+                x265_picture* destPic = m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx;
+
+                // Enqueue this picture up with the current encoder so that it will asynchronously encode
+                if (!scalePic(destPic, srcPic))
+                    x265_log(NULL, X265_LOG_ERROR, "Unable to copy scaled input picture to input queue \n");
+                else
+                    m_parentEnc->m_parent->m_picWriteCntm_id.incr();
+                m_scaledWriteCnt.incr();
+                m_parentEnc->m_parent->m_picIdxReadCntsrcIdscaledWriteIdx.incr();
+            }
+            if (m_threadTotal > 1)
+            {
+                written = m_parentEnc->m_parent->m_picWriteCntsrcId.get();
+                int totalWrite = written / m_threadTotal;
+                if (written % m_threadTotal > m_threadId)
+                    totalWrite++;
+                if (totalWrite == m_scaledWriteCnt.get())
+                {
+                    m_parentEnc->m_parent->m_picWriteCntsrcId.poke();
+                    m_parentEnc->m_parent->m_picWriteCntm_id.poke();
+                    break;
+                }
+            }
+            else
+            {
+                /* Once end of video is reached and all frames are scaled, release wait on picwritecount */
+                scaledWritten = m_parentEnc->m_parent->m_picWriteCntm_id.get();
+                written = m_parentEnc->m_parent->m_picWriteCntsrcId.get();
+                if (written == scaledWritten)
+                {
+                    m_parentEnc->m_parent->m_picWriteCntsrcId.poke();
+                    m_parentEnc->m_parent->m_picWriteCntm_id.poke();
+                    break;
+                }
+            }
+
+        }
+        m_threadActive = false;
+        destroy();
+    }
+
+    Reader::Reader(int id, PassEncoder *parentEnc)
+    {
+        m_parentEnc = parentEnc;
+        m_id = id;
+        m_input = parentEnc->m_input;
+    }
+
+    void Reader::threadMain()
+    {
+        THREAD_NAME("Reader", m_id);
+
+        int QDepth = m_parentEnc->m_parent->m_queueSize;
+        x265_picture* src = x265_picture_alloc();
+        x265_picture_init(m_parentEnc->m_param, src);
+
+        while (m_threadActive)
+        {
+            uint32_t written = m_parentEnc->m_parent->m_picWriteCntm_id.get();
+            uint32_t writeIdx = written % QDepth;
+            uint32_t read = m_parentEnc->m_parent->m_picIdxReadCntm_idwriteIdx.get();
+            uint32_t overWritePicBuffer = written / QDepth;
+
+            if (m_parentEnc->m_cliopt.framesToBeEncoded && written >= m_parentEnc->m_cliopt.framesToBeEncoded)
+                break;
+
+            while (overWritePicBuffer && read < overWritePicBuffer)
+            {
+                read = m_parentEnc->m_parent->m_picIdxReadCntm_idwriteIdx.waitForChange(read);
+            }
+
+            x265_picture* dest = m_parentEnc->m_parent->m_inputPicBufferm_idwriteIdx;
+            if (m_input->readPicture(*src))
+            {
+                dest->poc = src->poc;
+                dest->pts = src->pts;
+                dest->userSEI = src->userSEI;
+                dest->bitDepth = src->bitDepth;
+                dest->framesize = src->framesize;
+                dest->height = src->height;
+                dest->width = src->width;
+                dest->colorSpace = src->colorSpace;
+                dest->userSEI = src->userSEI;
+                dest->rpu.payload = src->rpu.payload;
+                dest->picStruct = src->picStruct;
+                dest->stride0 = src->stride0;
+                dest->stride1 = src->stride1;
+                dest->stride2 = src->stride2;
+
+                if (!dest->planes0)
+                    dest->planes0 = X265_MALLOC(char, dest->framesize);
+
+                memcpy(dest->planes0, src->planes0, src->framesize * sizeof(char));
+                dest->planes1 = (char*)dest->planes0 + src->stride0 * src->height;
+                dest->planes2 = (char*)dest->planes1 + src->stride1 * (src->height >> x265_cli_cspssrc->colorSpace.height1);
+                m_parentEnc->m_parent->m_picWriteCntm_id.incr();
+            }
+            else
+            {
+                m_threadActive = false;
+                m_parentEnc->m_inputOver = true;
+                m_parentEnc->m_parent->m_picWriteCntm_id.poke();
+            }
+        }
+        x265_picture_free(src);
+    }
+}

 
@@ -1,1111 +1,1111 @@
-/*****************************************************************************
-* Copyright (C) 2013-2020 MulticoreWare, Inc
-*
-* Authors: Pooja Venkatesan <pooja@multicorewareinc.com>
-*          Aruna Matheswaran <aruna@multicorewareinc.com>
-*
-* This program is free software; you can redistribute it and/or modify
-* it under the terms of the GNU General Public License as published by
-* the Free Software Foundation; either version 2 of the License, or
-* (at your option) any later version.
-*
-* This program is distributed in the hope that it will be useful,
-* but WITHOUT ANY WARRANTY; without even the implied warranty of
-* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-* GNU General Public License for more details.
-*
-* You should have received a copy of the GNU General Public License
-* along with this program; if not, write to the Free Software
-* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
-*
-* This program is also available under a commercial proprietary license.
-* For more information, contact us at license @ x265.com.
-*****************************************************************************/
-
-#include "abrEncApp.h"
-#include "mv.h"
-#include "slice.h"
-#include "param.h"
-
-#include <signal.h>
-#include <errno.h>
-
-#include <queue>
-
-using namespace X265_NS;
-
-/* Ctrl-C handler */
-static volatile sig_atomic_t b_ctrl_c /* = 0 */;
-static void sigint_handler(int)
-{
-    b_ctrl_c = 1;
-}
-
-namespace X265_NS {
-    // private namespace
-#define X265_INPUT_QUEUE_SIZE 250
-
-    AbrEncoder::AbrEncoder(CLIOptions cliopt, uint8_t numEncodes, int &ret)
-    {
-        m_numEncodes = numEncodes;
-        m_numActiveEncodes.set(numEncodes);
-        m_queueSize = (numEncodes > 1) ? X265_INPUT_QUEUE_SIZE : 1;
-        m_passEnc = X265_MALLOC(PassEncoder*, m_numEncodes);
-
-        for (uint8_t i = 0; i < m_numEncodes; i++)
-        {
-            m_passEnci = new PassEncoder(i, cliopti, this);
-            if (!m_passEnci)
-            {
-                x265_log(NULL, X265_LOG_ERROR, "Unable to allocate memory for passEncoder\n");
-                ret = 4;
-            }
-            m_passEnci->init(ret);
-        }
-
-        if (!allocBuffers())
-        {
-            x265_log(NULL, X265_LOG_ERROR, "Unable to allocate memory for buffers\n");
-            ret = 4;
-        }
-
-        /* start passEncoder worker threads */
-        for (uint8_t pass = 0; pass < m_numEncodes; pass++)
-            m_passEncpass->startThreads();
-    }
-
-    bool AbrEncoder::allocBuffers()
-    {
-        m_inputPicBuffer = X265_MALLOC(x265_picture**, m_numEncodes);
-        m_analysisBuffer = X265_MALLOC(x265_analysis_data*, m_numEncodes);
-
-        m_picWriteCnt = new ThreadSafeIntegerm_numEncodes;
-        m_picReadCnt = new ThreadSafeIntegerm_numEncodes;
-        m_analysisWriteCnt = new ThreadSafeIntegerm_numEncodes;
-        m_analysisReadCnt = new ThreadSafeIntegerm_numEncodes;
-
-        m_picIdxReadCnt = X265_MALLOC(ThreadSafeInteger*, m_numEncodes);
-        m_analysisWrite = X265_MALLOC(ThreadSafeInteger*, m_numEncodes);
-        m_analysisRead = X265_MALLOC(ThreadSafeInteger*, m_numEncodes);
-        m_readFlag = X265_MALLOC(int*, m_numEncodes);
-
-        for (uint8_t pass = 0; pass < m_numEncodes; pass++)
-        {
-            m_inputPicBufferpass = X265_MALLOC(x265_picture*, m_queueSize);
-            for (uint32_t idx = 0; idx < m_queueSize; idx++)
-            {
-                m_inputPicBufferpassidx = x265_picture_alloc();
-                x265_picture_init(m_passEncpass->m_param, m_inputPicBufferpassidx);
-            }
-
-            CHECKED_MALLOC_ZERO(m_analysisBufferpass, x265_analysis_data, m_queueSize);
-            m_picIdxReadCntpass = new ThreadSafeIntegerm_queueSize;
-            m_analysisWritepass = new ThreadSafeIntegerm_queueSize;
-            m_analysisReadpass = new ThreadSafeIntegerm_queueSize;
-            m_readFlagpass = X265_MALLOC(int, m_queueSize);
-        }
-        return true;
-    fail:
-        return false;
-    }
-
-    void AbrEncoder::destroy()
-    {
-        x265_cleanup(); /* Free library singletons */
-        for (uint8_t pass = 0; pass < m_numEncodes; pass++)
-        {
-            for (uint32_t index = 0; index < m_queueSize; index++)
-            {
-                X265_FREE(m_inputPicBufferpassindex->planes0);
-                x265_picture_free(m_inputPicBufferpassindex);
-            }
-
-            X265_FREE(m_inputPicBufferpass);
-            X265_FREE(m_analysisBufferpass);
-            X265_FREE(m_readFlagpass);
-            delete m_picIdxReadCntpass;
-            delete m_analysisWritepass;
-            delete m_analysisReadpass;
-            m_passEncpass->destroy();
-            delete m_passEncpass;
-        }
-        X265_FREE(m_inputPicBuffer);
-        X265_FREE(m_analysisBuffer);
-        X265_FREE(m_readFlag);
-
-        delete m_picWriteCnt;
-        delete m_picReadCnt;
-        delete m_analysisWriteCnt;
-        delete m_analysisReadCnt;
-
-        X265_FREE(m_picIdxReadCnt);
-        X265_FREE(m_analysisWrite);
-        X265_FREE(m_analysisRead);
-
-        X265_FREE(m_passEnc);
-    }
-
-    PassEncoder::PassEncoder(uint32_t id, CLIOptions cliopt, AbrEncoder *parent)
-    {
-        m_id = id;
-        m_cliopt = cliopt;
-        m_parent = parent;
-        if(!(m_cliopt.enableScaler && m_id))
-            m_input = m_cliopt.input;
-        m_param = cliopt.param;
-        m_inputOver = false;
-        m_lastIdx = -1;
-        m_encoder = NULL;
-        m_scaler = NULL;
-        m_reader = NULL;
-        m_ret = 0;
-    }
-
-    int PassEncoder::init(int &result)
-    {
-        if (m_parent->m_numEncodes > 1)
-            setReuseLevel();
-                
-        if (!(m_cliopt.enableScaler && m_id))
-            m_reader = new Reader(m_id, this);
-        else
-        {
-            VideoDesc *src = NULL, *dst = NULL;
-            dst = new VideoDesc(m_param->sourceWidth, m_param->sourceHeight, m_param->internalCsp, m_param->internalBitDepth);
-            int dstW = m_parent->m_passEncm_id - 1->m_param->sourceWidth;
-            int dstH = m_parent->m_passEncm_id - 1->m_param->sourceHeight;
-            src = new VideoDesc(dstW, dstH, m_param->internalCsp, m_param->internalBitDepth);
-            if (src != NULL && dst != NULL)
-            {
-                m_scaler = new Scaler(0, 1, m_id, src, dst, this);
-                if (!m_scaler)
-                {
-                    x265_log(m_param, X265_LOG_ERROR, "\n MALLOC failure in Scaler");
-                    result = 4;
-                }
-            }
-        }
-
-        /* note: we could try to acquire a different libx265 API here based on
-        * the profile found during option parsing, but it must be done before
-        * opening an encoder */
-
-        if (m_param)
-            m_encoder = m_cliopt.api->encoder_open(m_param);
-        if (!m_encoder)
-        {
-            x265_log(NULL, X265_LOG_ERROR, "x265_encoder_open() failed for Enc, \n");
-            m_ret = 2;
-            return -1;
-        }
-
-        /* get the encoder parameters post-initialization */
-        m_cliopt.api->encoder_parameters(m_encoder, m_param);
-
-        return 1;
-    }
-
-    void PassEncoder::setReuseLevel()
-    {
-        uint32_t r, padh = 0, padw = 0;
-
-        m_param->confWinBottomOffset = m_param->confWinRightOffset = 0;
-
-        m_param->analysisLoadReuseLevel = m_cliopt.loadLevel;
-        m_param->analysisSaveReuseLevel = m_cliopt.saveLevel;
-        m_param->analysisSave = m_cliopt.saveLevel ? "save.dat" : NULL;
-        m_param->analysisLoad = m_cliopt.loadLevel ? "load.dat" : NULL;
-        m_param->bUseAnalysisFile = 0;
-
-        if (m_cliopt.loadLevel)
-        {
-            x265_param *refParam = m_parent->m_passEncm_cliopt.refId->m_param;
-
-            if (m_param->sourceHeight == (refParam->sourceHeight - refParam->confWinBottomOffset) &&
-                m_param->sourceWidth == (refParam->sourceWidth - refParam->confWinRightOffset))
-            {
-                m_parent->m_passEncm_id->m_param->confWinBottomOffset = refParam->confWinBottomOffset;
-                m_parent->m_passEncm_id->m_param->confWinRightOffset = refParam->confWinRightOffset;
-            }
-            else
-            {
-                int srcH = refParam->sourceHeight - refParam->confWinBottomOffset;
-                int srcW = refParam->sourceWidth - refParam->confWinRightOffset;
-
-                double scaleFactorH = double(m_param->sourceHeight / srcH);
-                double scaleFactorW = double(m_param->sourceWidth / srcW);
-
-                int absScaleFactorH = (int)(10 * scaleFactorH + 0.5);
-                int absScaleFactorW = (int)(10 * scaleFactorW + 0.5);
-
-                if (absScaleFactorH == 20 && absScaleFactorW == 20)
-                {
-                    m_param->scaleFactor = 2;
-
-                    m_parent->m_passEncm_id->m_param->confWinBottomOffset = refParam->confWinBottomOffset * 2;
-                    m_parent->m_passEncm_id->m_param->confWinRightOffset = refParam->confWinRightOffset * 2;
-
-                }
-            }
-        }
-
-        int h = m_param->sourceHeight + m_param->confWinBottomOffset;
-        int w = m_param->sourceWidth + m_param->confWinRightOffset;
-        if (h & (m_param->minCUSize - 1))
-        {
-            r = h & (m_param->minCUSize - 1);
-            padh = m_param->minCUSize - r;
-            m_param->confWinBottomOffset += padh;
-
-        }
-
-        if (w & (m_param->minCUSize - 1))
-        {
-            r = w & (m_param->minCUSize - 1);
-            padw = m_param->minCUSize - r;
-            m_param->confWinRightOffset += padw;
-        }
-    }
-
-    void PassEncoder::startThreads()
-    {
-        /* Start slave worker threads */
-        m_threadActive = true;
-        start();
-        /* Start reader threads*/
-        if (m_reader != NULL)
-        {
-            m_reader->m_threadActive = true;
-            m_reader->start();
-        }
-        /* Start scaling worker threads */
-        if (m_scaler != NULL)
-        {
-            m_scaler->m_threadActive = true;
-            m_scaler->start();
-        }
-    }
-
-    void PassEncoder::copyInfo(x265_analysis_data * src)
-    {
-
-        uint32_t written = m_parent->m_analysisWriteCntm_id.get();
-
-        int index = written % m_parent->m_queueSize;
-        //If all streams have read analysis data, reuse that position in Queue
-
-        int read = m_parent->m_analysisReadm_idindex.get();
-        int write = m_parent->m_analysisWritem_idindex.get();
-
-        int overwrite = written / m_parent->m_queueSize;
-        bool emptyIdxFound = 0;
-        while (!emptyIdxFound && overwrite)
-        {
-            for (uint32_t i = 0; i < m_parent->m_queueSize; i++)
-            {
-                read = m_parent->m_analysisReadm_idi.get();
-                write = m_parent->m_analysisWritem_idi.get();
-                write *= m_cliopt.numRefs;
-
-                if (read == write)
-                {
-                    index = i;
-                    emptyIdxFound = 1;
-                }
-            }
-        }
-
-        x265_analysis_data *m_analysisInfo = &m_parent->m_analysisBufferm_idindex;
-
-        x265_free_analysis_data(m_param, m_analysisInfo);
-        memcpy(m_analysisInfo, src, sizeof(x265_analysis_data));
-        x265_alloc_analysis_data(m_param, m_analysisInfo);
-
-        bool isVbv = m_param->rc.vbvBufferSize && m_param->rc.vbvMaxBitrate;
-        if (m_param->bDisableLookahead && isVbv)
-        {
-            memcpy(m_analysisInfo->lookahead.intraSatdForVbv, src->lookahead.intraSatdForVbv, src->numCuInHeight * sizeof(uint32_t));
-            memcpy(m_analysisInfo->lookahead.satdForVbv, src->lookahead.satdForVbv, src->numCuInHeight * sizeof(uint32_t));
-            memcpy(m_analysisInfo->lookahead.intraVbvCost, src->lookahead.intraVbvCost, src->numCUsInFrame * sizeof(uint32_t));
-            memcpy(m_analysisInfo->lookahead.vbvCost, src->lookahead.vbvCost, src->numCUsInFrame * sizeof(uint32_t));
-        }
-
-        if (src->sliceType == X265_TYPE_IDR || src->sliceType == X265_TYPE_I)
-        {
-            if (m_param->analysisSaveReuseLevel < 2)
-                goto ret;
-            x265_analysis_intra_data *intraDst, *intraSrc;
-            intraDst = (x265_analysis_intra_data*)m_analysisInfo->intraData;
-            intraSrc = (x265_analysis_intra_data*)src->intraData;
-            memcpy(intraDst->depth, intraSrc->depth, sizeof(uint8_t) * src->depthBytes);
-            memcpy(intraDst->modes, intraSrc->modes, sizeof(uint8_t) * src->numCUsInFrame * src->numPartitions);
-            memcpy(intraDst->partSizes, intraSrc->partSizes, sizeof(char) * src->depthBytes);
-            memcpy(intraDst->chromaModes, intraSrc->chromaModes, sizeof(uint8_t) * src->depthBytes);
-            if (m_param->rc.cuTree)
-                memcpy(intraDst->cuQPOff, intraSrc->cuQPOff, sizeof(int8_t) * src->depthBytes);
-        }
-        else
-        {
-            bool bIntraInInter = (src->sliceType == X265_TYPE_P || m_param->bIntraInBFrames);
-            int numDir = src->sliceType == X265_TYPE_P ? 1 : 2;
-            memcpy(m_analysisInfo->wt, src->wt, sizeof(WeightParam) * 3 * numDir);
-            if (m_param->analysisSaveReuseLevel < 2)
-                goto ret;
-            x265_analysis_inter_data *interDst, *interSrc;
-            interDst = (x265_analysis_inter_data*)m_analysisInfo->interData;
-            interSrc = (x265_analysis_inter_data*)src->interData;
-            memcpy(interDst->depth, interSrc->depth, sizeof(uint8_t) * src->depthBytes);
-            memcpy(interDst->modes, interSrc->modes, sizeof(uint8_t) * src->depthBytes);
-            if (m_param->rc.cuTree)
-                memcpy(interDst->cuQPOff, interSrc->cuQPOff, sizeof(int8_t) * src->depthBytes);
-            if (m_param->analysisSaveReuseLevel > 4)
-            {
-                memcpy(interDst->partSize, interSrc->partSize, sizeof(uint8_t) * src->depthBytes);
-                memcpy(interDst->mergeFlag, interSrc->mergeFlag, sizeof(uint8_t) * src->depthBytes);
-                if (m_param->analysisSaveReuseLevel == 10)
-                {
-                    memcpy(interDst->interDir, interSrc->interDir, sizeof(uint8_t) * src->depthBytes);
-                    for (int dir = 0; dir < numDir; dir++)
-                    {
-                        memcpy(interDst->mvpIdxdir, interSrc->mvpIdxdir, sizeof(uint8_t) * src->depthBytes);
-                        memcpy(interDst->refIdxdir, interSrc->refIdxdir, sizeof(int8_t) * src->depthBytes);
-                        memcpy(interDst->mvdir, interSrc->mvdir, sizeof(MV) * src->depthBytes);
-                    }
-                    if (bIntraInInter)
-                    {
-                        x265_analysis_intra_data *intraDst = (x265_analysis_intra_data*)m_analysisInfo->intraData;
-                        x265_analysis_intra_data *intraSrc = (x265_analysis_intra_data*)src->intraData;
-                        memcpy(intraDst->modes, intraSrc->modes, sizeof(uint8_t) * src->numPartitions * src->numCUsInFrame);
-                        memcpy(intraDst->chromaModes, intraSrc->chromaModes, sizeof(uint8_t) * src->depthBytes);
-                    }
-               }
-            }
-            if (m_param->analysisSaveReuseLevel != 10)
-                memcpy(interDst->ref, interSrc->ref, sizeof(int32_t) * src->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir);
-        }
-
-ret:
-        //increment analysis Write counter 
-        m_parent->m_analysisWriteCntm_id.incr();
-        m_parent->m_analysisWritem_idindex.incr();
-        return;
-    }
-
-
-    bool PassEncoder::readPicture(x265_picture *dstPic)
-    {
-        /*Check and wait if there any input frames to read*/
-        int ipread = m_parent->m_picReadCntm_id.get();
-        int ipwrite = m_parent->m_picWriteCntm_id.get();
-
-        bool isAbrLoad = m_cliopt.loadLevel && (m_parent->m_numEncodes > 1);
-        while (!m_inputOver && (ipread == ipwrite))
-        {
-            ipwrite = m_parent->m_picWriteCntm_id.waitForChange(ipwrite);
-        }
-
-        if (m_threadActive && ipread < ipwrite)
-        {
-            /*Get input index to read from inputQueue. If doesn't need analysis info, it need not wait to fetch poc from analysisQueue*/
-            int readPos = ipread % m_parent->m_queueSize;
-            x265_analysis_data* analysisData = 0;
-
-            if (isAbrLoad)
-            {
-                /*If stream is master of each slave pass, then fetch analysis data from prev pass*/
-                int analysisQId = m_cliopt.refId;
-                /*Check and wait if there any analysis Data to read*/
-                int analysisWrite = m_parent->m_analysisWriteCntanalysisQId.get();
-                int written = analysisWrite * m_parent->m_passEncanalysisQId->m_cliopt.numRefs;
-                int analysisRead = m_parent->m_analysisReadCntanalysisQId.get();
-                
-                while (m_threadActive && written == analysisRead)
-                {
-                    analysisWrite = m_parent->m_analysisWriteCntanalysisQId.waitForChange(analysisWrite);
-                    written = analysisWrite * m_parent->m_passEncanalysisQId->m_cliopt.numRefs;
-                }
-
-                if (analysisRead < written)
-                {
-                    int analysisIdx = 0;
-                    if (!m_param->bDisableLookahead)
-                    {
-                        bool analysisdRead = false;
-                        while ((analysisRead < written) && !analysisdRead)
-                        {
-                            while (analysisWrite < ipread)
-                            {
-                                analysisWrite = m_parent->m_analysisWriteCntanalysisQId.waitForChange(analysisWrite);
-                                written = analysisWrite * m_parent->m_passEncanalysisQId->m_cliopt.numRefs;
-                            }
-                            for (uint32_t i = 0; i < m_parent->m_queueSize; i++)
-                            {
-                                analysisData = &m_parent->m_analysisBufferanalysisQIdi;
-                                int read = m_parent->m_analysisReadanalysisQIdi.get();
-                                int write = m_parent->m_analysisWriteanalysisQIdi.get() * m_parent->m_passEncanalysisQId->m_cliopt.numRefs;
-                                if ((analysisData->poc == (uint32_t)(ipread)) && (read < write))
-                                {
-                                    analysisIdx = i;
-                                    analysisdRead = true;
-                                    break;
-                                }
-                            }
-                        }
-                    }
-                    else
-                    {
-                        analysisIdx = analysisRead % m_parent->m_queueSize;
-                        analysisData = &m_parent->m_analysisBufferanalysisQIdanalysisIdx;
-                        readPos = analysisData->poc % m_parent->m_queueSize;
-                        while ((ipwrite < readPos) || ((ipwrite - 1) < (int)analysisData->poc))
-                        {
-                            ipwrite = m_parent->m_picWriteCntm_id.waitForChange(ipwrite);
-                        }
-                    }
-
-                    m_lastIdx = analysisIdx;
-                }
-                else
-                    return false;
-            }
-
-
-            x265_picture *srcPic = (x265_picture*)(m_parent->m_inputPicBufferm_idreadPos);
-
-            x265_picture *pic = (x265_picture*)(dstPic);
-            pic->colorSpace = srcPic->colorSpace;
-            pic->bitDepth = srcPic->bitDepth;
-            pic->framesize = srcPic->framesize;
-            pic->height = srcPic->height;
-            pic->pts = srcPic->pts;
-            pic->dts = srcPic->dts;
-            pic->reorderedPts = srcPic->reorderedPts;
-            pic->width = srcPic->width;
-            pic->analysisData = srcPic->analysisData;
-            pic->userSEI = srcPic->userSEI;
-            pic->stride0 = srcPic->stride0;
-            pic->stride1 = srcPic->stride1;
-            pic->stride2 = srcPic->stride2;
-            pic->planes0 = srcPic->planes0;
-            pic->planes1 = srcPic->planes1;
-            pic->planes2 = srcPic->planes2;
-            if (isAbrLoad)
-                pic->analysisData = *analysisData;
-            return true;
-        }
-        else
-            return false;
-    }
-
-    void PassEncoder::threadMain()
-    {
+/*****************************************************************************
+* Copyright (C) 2013-2020 MulticoreWare, Inc
+*
+* Authors: Pooja Venkatesan <pooja@multicorewareinc.com>
+*          Aruna Matheswaran <aruna@multicorewareinc.com>
+*
+* This program is free software; you can redistribute it and/or modify
+* it under the terms of the GNU General Public License as published by
+* the Free Software Foundation; either version 2 of the License, or
+* (at your option) any later version.
+*
+* This program is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+* GNU General Public License for more details.
+*
+* You should have received a copy of the GNU General Public License
+* along with this program; if not, write to the Free Software
+* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+*
+* This program is also available under a commercial proprietary license.
+* For more information, contact us at license @ x265.com.
+*****************************************************************************/
+
+#include "abrEncApp.h"
+#include "mv.h"
+#include "slice.h"
+#include "param.h"
+
+#include <signal.h>
+#include <errno.h>
+
+#include <queue>
+
+using namespace X265_NS;
+
+/* Ctrl-C handler */
+static volatile sig_atomic_t b_ctrl_c /* = 0 */;
+static void sigint_handler(int)
+{
+    b_ctrl_c = 1;
+}
+
+namespace X265_NS {
+    // private namespace
+#define X265_INPUT_QUEUE_SIZE 250
+
+    AbrEncoder::AbrEncoder(CLIOptions cliopt, uint8_t numEncodes, int &ret)
+    {
+        m_numEncodes = numEncodes;
+        m_numActiveEncodes.set(numEncodes);
+        m_queueSize = (numEncodes > 1) ? X265_INPUT_QUEUE_SIZE : 1;
+        m_passEnc = X265_MALLOC(PassEncoder*, m_numEncodes);
+
+        for (uint8_t i = 0; i < m_numEncodes; i++)
+        {
+            m_passEnci = new PassEncoder(i, cliopti, this);
+            if (!m_passEnci)
+            {
+                x265_log(NULL, X265_LOG_ERROR, "Unable to allocate memory for passEncoder\n");
+                ret = 4;
+            }
+            m_passEnci->init(ret);
+        }
+
+        if (!allocBuffers())
+        {
+            x265_log(NULL, X265_LOG_ERROR, "Unable to allocate memory for buffers\n");
+            ret = 4;
+        }
+
+        /* start passEncoder worker threads */
+        for (uint8_t pass = 0; pass < m_numEncodes; pass++)
+            m_passEncpass->startThreads();
+    }
+
+    bool AbrEncoder::allocBuffers()
+    {
+        m_inputPicBuffer = X265_MALLOC(x265_picture**, m_numEncodes);
+        m_analysisBuffer = X265_MALLOC(x265_analysis_data*, m_numEncodes);
+
+        m_picWriteCnt = new ThreadSafeIntegerm_numEncodes;
+        m_picReadCnt = new ThreadSafeIntegerm_numEncodes;
+        m_analysisWriteCnt = new ThreadSafeIntegerm_numEncodes;
+        m_analysisReadCnt = new ThreadSafeIntegerm_numEncodes;
+
+        m_picIdxReadCnt = X265_MALLOC(ThreadSafeInteger*, m_numEncodes);
+        m_analysisWrite = X265_MALLOC(ThreadSafeInteger*, m_numEncodes);
+        m_analysisRead = X265_MALLOC(ThreadSafeInteger*, m_numEncodes);
+        m_readFlag = X265_MALLOC(int*, m_numEncodes);
+
+        for (uint8_t pass = 0; pass < m_numEncodes; pass++)
+        {
+            m_inputPicBufferpass = X265_MALLOC(x265_picture*, m_queueSize);
+            for (uint32_t idx = 0; idx < m_queueSize; idx++)
+            {
+                m_inputPicBufferpassidx = x265_picture_alloc();
+                x265_picture_init(m_passEncpass->m_param, m_inputPicBufferpassidx);
+            }
+
+            CHECKED_MALLOC_ZERO(m_analysisBufferpass, x265_analysis_data, m_queueSize);
+            m_picIdxReadCntpass = new ThreadSafeIntegerm_queueSize;
+            m_analysisWritepass = new ThreadSafeIntegerm_queueSize;
+            m_analysisReadpass = new ThreadSafeIntegerm_queueSize;
+            m_readFlagpass = X265_MALLOC(int, m_queueSize);
+        }
+        return true;
+    fail:
+        return false;
+    }
+
+    void AbrEncoder::destroy()
+    {
+        x265_cleanup(); /* Free library singletons */
+        for (uint8_t pass = 0; pass < m_numEncodes; pass++)
+        {
+            for (uint32_t index = 0; index < m_queueSize; index++)
+            {
+                X265_FREE(m_inputPicBufferpassindex->planes0);
+                x265_picture_free(m_inputPicBufferpassindex);
+            }
+
+            X265_FREE(m_inputPicBufferpass);
+            X265_FREE(m_analysisBufferpass);
+            X265_FREE(m_readFlagpass);
+            delete m_picIdxReadCntpass;
+            delete m_analysisWritepass;
+            delete m_analysisReadpass;
+            m_passEncpass->destroy();
+            delete m_passEncpass;
+        }
+        X265_FREE(m_inputPicBuffer);
+        X265_FREE(m_analysisBuffer);
+        X265_FREE(m_readFlag);
+
+        delete m_picWriteCnt;
+        delete m_picReadCnt;
+        delete m_analysisWriteCnt;
+        delete m_analysisReadCnt;
+
+        X265_FREE(m_picIdxReadCnt);
+        X265_FREE(m_analysisWrite);
+        X265_FREE(m_analysisRead);
+
+        X265_FREE(m_passEnc);
+    }
+
+    PassEncoder::PassEncoder(uint32_t id, CLIOptions cliopt, AbrEncoder *parent)
+    {
+        m_id = id;
+        m_cliopt = cliopt;
+        m_parent = parent;
+        if(!(m_cliopt.enableScaler && m_id))
+            m_input = m_cliopt.input;
+        m_param = cliopt.param;
+        m_inputOver = false;
+        m_lastIdx = -1;
+        m_encoder = NULL;
+        m_scaler = NULL;
+        m_reader = NULL;
+        m_ret = 0;
+    }
+
+    int PassEncoder::init(int &result)
+    {
+        if (m_parent->m_numEncodes > 1)
+            setReuseLevel();
+                
+        if (!(m_cliopt.enableScaler && m_id))
+            m_reader = new Reader(m_id, this);
+        else
+        {
+            VideoDesc *src = NULL, *dst = NULL;
+            dst = new VideoDesc(m_param->sourceWidth, m_param->sourceHeight, m_param->internalCsp, m_param->internalBitDepth);
+            int dstW = m_parent->m_passEncm_id - 1->m_param->sourceWidth;
+            int dstH = m_parent->m_passEncm_id - 1->m_param->sourceHeight;
+            src = new VideoDesc(dstW, dstH, m_param->internalCsp, m_param->internalBitDepth);
+            if (src != NULL && dst != NULL)
+            {
+                m_scaler = new Scaler(0, 1, m_id, src, dst, this);
+                if (!m_scaler)
+                {
+                    x265_log(m_param, X265_LOG_ERROR, "\n MALLOC failure in Scaler");
+                    result = 4;
+                }
+            }
+        }
+
+        if (m_cliopt.zoneFile)
+        {
+            if (!m_cliopt.parseZoneFile())
+            {
+                x265_log(NULL, X265_LOG_ERROR, "Unable to parse zonefile in %s\n");
+                fclose(m_cliopt.zoneFile);
+                m_cliopt.zoneFile = NULL;
+            }
+        }
+
+        /* note: we could try to acquire a different libx265 API here based on
+        * the profile found during option parsing, but it must be done before
+        * opening an encoder */
+
+        if (m_param)
+            m_encoder = m_cliopt.api->encoder_open(m_param);
+        if (!m_encoder)
+        {
+            x265_log(NULL, X265_LOG_ERROR, "x265_encoder_open() failed for Enc, \n");
+            m_ret = 2;
+            return -1;
+        }
+
+        /* get the encoder parameters post-initialization */
+        m_cliopt.api->encoder_parameters(m_encoder, m_param);
+
+        return 1;
+    }
+
+    void PassEncoder::setReuseLevel()
+    {
+        uint32_t r, padh = 0, padw = 0;
+
+        m_param->confWinBottomOffset = m_param->confWinRightOffset = 0;
+
+        m_param->analysisLoadReuseLevel = m_cliopt.loadLevel;
+        m_param->analysisSaveReuseLevel = m_cliopt.saveLevel;
+        m_param->analysisSave = m_cliopt.saveLevel ? "save.dat" : NULL;
+        m_param->analysisLoad = m_cliopt.loadLevel ? "load.dat" : NULL;
+        m_param->bUseAnalysisFile = 0;
+
+        if (m_cliopt.loadLevel)
+        {
+            x265_param *refParam = m_parent->m_passEncm_cliopt.refId->m_param;
+
+            if (m_param->sourceHeight == (refParam->sourceHeight - refParam->confWinBottomOffset) &&
+                m_param->sourceWidth == (refParam->sourceWidth - refParam->confWinRightOffset))
+            {
+                m_parent->m_passEncm_id->m_param->confWinBottomOffset = refParam->confWinBottomOffset;
+                m_parent->m_passEncm_id->m_param->confWinRightOffset = refParam->confWinRightOffset;
+            }
+            else
+            {
+                int srcH = refParam->sourceHeight - refParam->confWinBottomOffset;
+                int srcW = refParam->sourceWidth - refParam->confWinRightOffset;
+
+                double scaleFactorH = double(m_param->sourceHeight / srcH);
+                double scaleFactorW = double(m_param->sourceWidth / srcW);
+
+                int absScaleFactorH = (int)(10 * scaleFactorH + 0.5);
+                int absScaleFactorW = (int)(10 * scaleFactorW + 0.5);
+
+                if (absScaleFactorH == 20 && absScaleFactorW == 20)
+                {
+                    m_param->scaleFactor = 2;
+
+                    m_parent->m_passEncm_id->m_param->confWinBottomOffset = refParam->confWinBottomOffset * 2;
+                    m_parent->m_passEncm_id->m_param->confWinRightOffset = refParam->confWinRightOffset * 2;
+
+                }
+            }
+        }
+
+        int h = m_param->sourceHeight + m_param->confWinBottomOffset;
+        int w = m_param->sourceWidth + m_param->confWinRightOffset;
+        if (h & (m_param->minCUSize - 1))
+        {
+            r = h & (m_param->minCUSize - 1);
+            padh = m_param->minCUSize - r;
+            m_param->confWinBottomOffset += padh;
+
+        }
+
+        if (w & (m_param->minCUSize - 1))
+        {
+            r = w & (m_param->minCUSize - 1);
+            padw = m_param->minCUSize - r;
+            m_param->confWinRightOffset += padw;
+        }
+    }
+
+    void PassEncoder::startThreads()
+    {
+        /* Start slave worker threads */
+        m_threadActive = true;
+        start();
+        /* Start reader threads*/
+        if (m_reader != NULL)
+        {
+            m_reader->m_threadActive = true;
+            m_reader->start();
+        }
+        /* Start scaling worker threads */
+        if (m_scaler != NULL)
+        {
+            m_scaler->m_threadActive = true;
+            m_scaler->start();
+        }
+    }
+
+    void PassEncoder::copyInfo(x265_analysis_data * src)
+    {
+
+        uint32_t written = m_parent->m_analysisWriteCntm_id.get();
+
+        int index = written % m_parent->m_queueSize;
+        //If all streams have read analysis data, reuse that position in Queue
+
+        int read = m_parent->m_analysisReadm_idindex.get();
+        int write = m_parent->m_analysisWritem_idindex.get();
+
+        int overwrite = written / m_parent->m_queueSize;
+        bool emptyIdxFound = 0;
+        while (!emptyIdxFound && overwrite)
+        {
+            for (uint32_t i = 0; i < m_parent->m_queueSize; i++)
+            {
+                read = m_parent->m_analysisReadm_idi.get();
+                write = m_parent->m_analysisWritem_idi.get();
+                write *= m_cliopt.numRefs;
+
+                if (read == write)
+                {
+                    index = i;
+                    emptyIdxFound = 1;
+                }
+            }
+        }
+
+        x265_analysis_data *m_analysisInfo = &m_parent->m_analysisBufferm_idindex;
+
+        x265_free_analysis_data(m_param, m_analysisInfo);
+        memcpy(m_analysisInfo, src, sizeof(x265_analysis_data));
+        x265_alloc_analysis_data(m_param, m_analysisInfo);
+
+        bool isVbv = m_param->rc.vbvBufferSize && m_param->rc.vbvMaxBitrate;
+        if (m_param->bDisableLookahead && isVbv)
+        {
+            memcpy(m_analysisInfo->lookahead.intraSatdForVbv, src->lookahead.intraSatdForVbv, src->numCuInHeight * sizeof(uint32_t));
+            memcpy(m_analysisInfo->lookahead.satdForVbv, src->lookahead.satdForVbv, src->numCuInHeight * sizeof(uint32_t));
+            memcpy(m_analysisInfo->lookahead.intraVbvCost, src->lookahead.intraVbvCost, src->numCUsInFrame * sizeof(uint32_t));
+            memcpy(m_analysisInfo->lookahead.vbvCost, src->lookahead.vbvCost, src->numCUsInFrame * sizeof(uint32_t));
+        }
+
+        if (src->sliceType == X265_TYPE_IDR || src->sliceType == X265_TYPE_I)
+        {
+            if (m_param->analysisSaveReuseLevel < 2)
+                goto ret;
+            x265_analysis_intra_data *intraDst, *intraSrc;
+            intraDst = (x265_analysis_intra_data*)m_analysisInfo->intraData;
+            intraSrc = (x265_analysis_intra_data*)src->intraData;
+            memcpy(intraDst->depth, intraSrc->depth, sizeof(uint8_t) * src->depthBytes);
+            memcpy(intraDst->modes, intraSrc->modes, sizeof(uint8_t) * src->numCUsInFrame * src->numPartitions);
+            memcpy(intraDst->partSizes, intraSrc->partSizes, sizeof(char) * src->depthBytes);
+            memcpy(intraDst->chromaModes, intraSrc->chromaModes, sizeof(uint8_t) * src->depthBytes);
+            if (m_param->rc.cuTree)
+                memcpy(intraDst->cuQPOff, intraSrc->cuQPOff, sizeof(int8_t) * src->depthBytes);
+        }
+        else
+        {
+            bool bIntraInInter = (src->sliceType == X265_TYPE_P || m_param->bIntraInBFrames);
+            int numDir = src->sliceType == X265_TYPE_P ? 1 : 2;
+            memcpy(m_analysisInfo->wt, src->wt, sizeof(WeightParam) * 3 * numDir);
+            if (m_param->analysisSaveReuseLevel < 2)
+                goto ret;
+            x265_analysis_inter_data *interDst, *interSrc;
+            interDst = (x265_analysis_inter_data*)m_analysisInfo->interData;
+            interSrc = (x265_analysis_inter_data*)src->interData;
+            memcpy(interDst->depth, interSrc->depth, sizeof(uint8_t) * src->depthBytes);
+            memcpy(interDst->modes, interSrc->modes, sizeof(uint8_t) * src->depthBytes);
+            if (m_param->rc.cuTree)
+                memcpy(interDst->cuQPOff, interSrc->cuQPOff, sizeof(int8_t) * src->depthBytes);
+            if (m_param->analysisSaveReuseLevel > 4)
+            {
+                memcpy(interDst->partSize, interSrc->partSize, sizeof(uint8_t) * src->depthBytes);
+                memcpy(interDst->mergeFlag, interSrc->mergeFlag, sizeof(uint8_t) * src->depthBytes);
+                if (m_param->analysisSaveReuseLevel == 10)
+                {
+                    memcpy(interDst->interDir, interSrc->interDir, sizeof(uint8_t) * src->depthBytes);
+                    for (int dir = 0; dir < numDir; dir++)
+                    {
+                        memcpy(interDst->mvpIdxdir, interSrc->mvpIdxdir, sizeof(uint8_t) * src->depthBytes);
+                        memcpy(interDst->refIdxdir, interSrc->refIdxdir, sizeof(int8_t) * src->depthBytes);
+                        memcpy(interDst->mvdir, interSrc->mvdir, sizeof(MV) * src->depthBytes);
+                    }
+                    if (bIntraInInter)
+                    {
+                        x265_analysis_intra_data *intraDst = (x265_analysis_intra_data*)m_analysisInfo->intraData;
+                        x265_analysis_intra_data *intraSrc = (x265_analysis_intra_data*)src->intraData;
+                        memcpy(intraDst->modes, intraSrc->modes, sizeof(uint8_t) * src->numPartitions * src->numCUsInFrame);
+                        memcpy(intraDst->chromaModes, intraSrc->chromaModes, sizeof(uint8_t) * src->depthBytes);
+                    }
+               }
+            }
+            if (m_param->analysisSaveReuseLevel != 10)
+                memcpy(interDst->ref, interSrc->ref, sizeof(int32_t) * src->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir);
+        }
+
+ret:
+        //increment analysis Write counter 
+        m_parent->m_analysisWriteCntm_id.incr();
+        m_parent->m_analysisWritem_idindex.incr();
+        return;
+    }
+
+
+    bool PassEncoder::readPicture(x265_picture *dstPic)
+    {
+        /*Check and wait if there any input frames to read*/
+        int ipread = m_parent->m_picReadCntm_id.get();
+        int ipwrite = m_parent->m_picWriteCntm_id.get();
+
+        bool isAbrLoad = m_cliopt.loadLevel && (m_parent->m_numEncodes > 1);
+        while (!m_inputOver && (ipread == ipwrite))
+        {
+            ipwrite = m_parent->m_picWriteCntm_id.waitForChange(ipwrite);
+        }
+
+        if (m_threadActive && ipread < ipwrite)
+        {
+            /*Get input index to read from inputQueue. If doesn't need analysis info, it need not wait to fetch poc from analysisQueue*/
+            int readPos = ipread % m_parent->m_queueSize;
+            x265_analysis_data* analysisData = 0;
+
+            if (isAbrLoad)
+            {
+                /*If stream is master of each slave pass, then fetch analysis data from prev pass*/
+                int analysisQId = m_cliopt.refId;
+                /*Check and wait if there any analysis Data to read*/
+                int analysisWrite = m_parent->m_analysisWriteCntanalysisQId.get();
+                int written = analysisWrite * m_parent->m_passEncanalysisQId->m_cliopt.numRefs;
+                int analysisRead = m_parent->m_analysisReadCntanalysisQId.get();
+                
+                while (m_threadActive && written == analysisRead)
+                {
+                    analysisWrite = m_parent->m_analysisWriteCntanalysisQId.waitForChange(analysisWrite);
+                    written = analysisWrite * m_parent->m_passEncanalysisQId->m_cliopt.numRefs;
+                }
+
+                if (analysisRead < written)
+                {
+                    int analysisIdx = 0;
+                    if (!m_param->bDisableLookahead)
+                    {
+                        bool analysisdRead = false;
+                        while ((analysisRead < written) && !analysisdRead)
+                        {
+                            while (analysisWrite < ipread)
+                            {
+                                analysisWrite = m_parent->m_analysisWriteCntanalysisQId.waitForChange(analysisWrite);
+                                written = analysisWrite * m_parent->m_passEncanalysisQId->m_cliopt.numRefs;
+                            }
+                            for (uint32_t i = 0; i < m_parent->m_queueSize; i++)
+                            {
+                                analysisData = &m_parent->m_analysisBufferanalysisQIdi;
+                                int read = m_parent->m_analysisReadanalysisQIdi.get();
+                                int write = m_parent->m_analysisWriteanalysisQIdi.get() * m_parent->m_passEncanalysisQId->m_cliopt.numRefs;
+                                if ((analysisData->poc == (uint32_t)(ipread)) && (read < write))
+                                {
+                                    analysisIdx = i;
+                                    analysisdRead = true;
+                                    break;
+                                }
+                            }
+                        }
+                    }
+                    else
+                    {
+                        analysisIdx = analysisRead % m_parent->m_queueSize;
+                        analysisData = &m_parent->m_analysisBufferanalysisQIdanalysisIdx;
+                        readPos = analysisData->poc % m_parent->m_queueSize;
+                        while ((ipwrite < readPos) || ((ipwrite - 1) < (int)analysisData->poc))
+                        {
+                            ipwrite = m_parent->m_picWriteCntm_id.waitForChange(ipwrite);
+                        }
+                    }
+
+                    m_lastIdx = analysisIdx;
+                }
+                else
+                    return false;
+            }
+
+
+            x265_picture *srcPic = (x265_picture*)(m_parent->m_inputPicBufferm_idreadPos);
+
+            x265_picture *pic = (x265_picture*)(dstPic);
+            pic->colorSpace = srcPic->colorSpace;
+            pic->bitDepth = srcPic->bitDepth;
+            pic->framesize = srcPic->framesize;
+            pic->height = srcPic->height;
+            pic->pts = srcPic->pts;
+            pic->dts = srcPic->dts;
+            pic->reorderedPts = srcPic->reorderedPts;
+            pic->width = srcPic->width;
+            pic->analysisData = srcPic->analysisData;
+            pic->userSEI = srcPic->userSEI;
+            pic->stride0 = srcPic->stride0;
+            pic->stride1 = srcPic->stride1;
+            pic->stride2 = srcPic->stride2;
+            pic->planes0 = srcPic->planes0;
+            pic->planes1 = srcPic->planes1;
+            pic->planes2 = srcPic->planes2;
+            if (isAbrLoad)
+                pic->analysisData = *analysisData;
+            return true;
+        }
+        else
+            return false;
+    }
+
+    void PassEncoder::threadMain()
+    {
         THREAD_NAME("PassEncoder", m_id);
 
         while (m_threadActive)
         {
-
-#if ENABLE_LIBVMAF
-            x265_vmaf_data* vmafdata = m_cliopt.vmafData;
-#endif
-            /* This allows muxers to modify bitstream format */
-            m_cliopt.output->setParam(m_param);
-            const x265_api* api = m_cliopt.api;
-            ReconPlay* reconPlay = NULL;
-            if (m_cliopt.reconPlayCmd)
-                reconPlay = new ReconPlay(m_cliopt.reconPlayCmd, *m_param);
-            char* profileName = m_cliopt.encName ? m_cliopt.encName : (char *)"x265";
-
-            if (m_cliopt.zoneFile)
-            {
-                if (!m_cliopt.parseZoneFile())
-                {
-                    x265_log(NULL, X265_LOG_ERROR, "Unable to parse zonefile in %s\n", profileName);
-                    fclose(m_cliopt.zoneFile);
-                    m_cliopt.zoneFile = NULL;
-                }
-            }
-
-            if (signal(SIGINT, sigint_handler) == SIG_ERR)
-                x265_log(m_param, X265_LOG_ERROR, "Unable to register CTRL+C handler: %s in %s\n",
-                    strerror(errno), profileName);
-
-            x265_picture pic_orig, pic_out;
-            x265_picture *pic_in = &pic_orig;
-            /* Allocate recon picture if analysis save/load is enabled */
-            std::priority_queue<int64_t>* pts_queue = m_cliopt.output->needPTS() ? new std::priority_queue<int64_t>() : NULL;
-            x265_picture *pic_recon = (m_cliopt.recon || m_param->analysisSave || m_param->analysisLoad || pts_queue || reconPlay || m_param->csvLogLevel) ? &pic_out : NULL;
-            uint32_t inFrameCount = 0;
-            uint32_t outFrameCount = 0;
-            x265_nal *p_nal;
-            x265_stats stats;
-            uint32_t nal;
-            int16_t *errorBuf = NULL;
-            bool bDolbyVisionRPU = false;
-            uint8_t *rpuPayload = NULL;
-            int inputPicNum = 1;
-            x265_picture picField1, picField2;
-            x265_analysis_data* analysisInfo = (x265_analysis_data*)(&pic_out.analysisData);
-            bool isAbrSave = m_cliopt.saveLevel && (m_parent->m_numEncodes > 1);
-
-            if (!m_param->bRepeatHeaders && !m_param->bEnableSvtHevc)
-            {
-                if (api->encoder_headers(m_encoder, &p_nal, &nal) < 0)
-                {
-                    x265_log(m_param, X265_LOG_ERROR, "Failure generating stream headers in %s\n", profileName);
-                    m_ret = 3;
-                    goto fail;
-                }
-                else
-                    m_cliopt.totalbytes += m_cliopt.output->writeHeaders(p_nal, nal);
-            }
-
-            if (m_param->bField && m_param->interlaceMode)
-            {
-                api->picture_init(m_param, &picField1);
-                api->picture_init(m_param, &picField2);
-                // return back the original height of input
-                m_param->sourceHeight *= 2;
-                api->picture_init(m_param, &pic_orig);
-            }
-            else
-                api->picture_init(m_param, &pic_orig);
-
-            if (m_param->dolbyProfile && m_cliopt.dolbyVisionRpu)
-            {
-                rpuPayload = X265_MALLOC(uint8_t, 1024);
-                pic_in->rpu.payload = rpuPayload;
-                if (pic_in->rpu.payload)
-                    bDolbyVisionRPU = true;
-            }
-
-            if (m_cliopt.bDither)
-            {
-                errorBuf = X265_MALLOC(int16_t, m_param->sourceWidth + 1);
-                if (errorBuf)
-                    memset(errorBuf, 0, (m_param->sourceWidth + 1) * sizeof(int16_t));
-                else
-                    m_cliopt.bDither = false;
-            }
-
-            // main encoder loop
-            while (pic_in && !b_ctrl_c)
-            {
-                pic_orig.poc = (m_param->bField && m_param->interlaceMode) ? inFrameCount * 2 : inFrameCount;
-                if (m_cliopt.qpfile)
-                {
-                    if (!m_cliopt.parseQPFile(pic_orig))
-                    {
-                        x265_log(NULL, X265_LOG_ERROR, "can't parse qpfile for frame %d in %s\n",
-                            pic_in->poc, profileName);
-                        fclose(m_cliopt.qpfile);
-                        m_cliopt.qpfile = NULL;
-                    }
-                }
-
-                if (m_cliopt.framesToBeEncoded && inFrameCount >= m_cliopt.framesToBeEncoded)
-                    pic_in = NULL;
-                else if (readPicture(pic_in))
-                    inFrameCount++;
-                else
-                    pic_in = NULL;
-
-                if (pic_in)
-                {
-                    if (pic_in->bitDepth > m_param->internalBitDepth && m_cliopt.bDither)
-                    {
-                        x265_dither_image(pic_in, m_cliopt.input->getWidth(), m_cliopt.input->getHeight(), errorBuf, m_param->internalBitDepth);
-                        pic_in->bitDepth = m_param->internalBitDepth;
-                    }
-                    /* Overwrite PTS */
-                    pic_in->pts = pic_in->poc;
-
-                    // convert to field
-                    if (m_param->bField && m_param->interlaceMode)
-                    {
-                        int height = pic_in->height >> 1;
-
-                        int static bCreated = 0;
-                        if (bCreated == 0)
-                        {
-                            bCreated = 1;
-                            inputPicNum = 2;
-                            picField1.fieldNum = 1;
-                            picField2.fieldNum = 2;
-
-                            picField1.bitDepth = picField2.bitDepth = pic_in->bitDepth;
-                            picField1.colorSpace = picField2.colorSpace = pic_in->colorSpace;
-                            picField1.height = picField2.height = pic_in->height >> 1;
-                            picField1.framesize = picField2.framesize = pic_in->framesize >> 1;
-
-                            size_t fieldFrameSize = (size_t)pic_in->framesize >> 1;
-                            char* field1Buf = X265_MALLOC(char, fieldFrameSize);
-                            char* field2Buf = X265_MALLOC(char, fieldFrameSize);
-
-                            int stride = picField1.stride0 = picField2.stride0 = pic_in->stride0;
-                            uint64_t framesize = stride * (height >> x265_cli_cspspic_in->colorSpace.height0);
-                            picField1.planes0 = field1Buf;
-                            picField2.planes0 = field2Buf;
-                            for (int i = 1; i < x265_cli_cspspic_in->colorSpace.planes; i++)
-                            {
-                                picField1.planesi = field1Buf + framesize;
-                                picField2.planesi = field2Buf + framesize;
-
-                                stride = picField1.stridei = picField2.stridei = pic_in->stridei;
-                                framesize += (stride * (height >> x265_cli_cspspic_in->colorSpace.heighti));
-                            }
-                            assert(framesize == picField1.framesize);
-                        }
-
-                        picField1.pts = picField1.poc = pic_in->poc;
-                        picField2.pts = picField2.poc = pic_in->poc + 1;
-
-                        picField1.userSEI = picField2.userSEI = pic_in->userSEI;
-
-                        //if (pic_in->userData)
-                        //{
-                        //    // Have to handle userData here
-                        //}
-
-                        if (pic_in->framesize)
-                        {
-                            for (int i = 0; i < x265_cli_cspspic_in->colorSpace.planes; i++)
-                            {
-                                char* srcP1 = (char*)pic_in->planesi;
-                                char* srcP2 = (char*)pic_in->planesi + pic_in->stridei;
-                                char* p1 = (char*)picField1.planesi;
-                                char* p2 = (char*)picField2.planesi;
-
-                                int stride = picField1.stridei;
-
-                                for (int y = 0; y < (height >> x265_cli_cspspic_in->colorSpace.heighti); y++)
-                                {
-                                    memcpy(p1, srcP1, stride);
-                                    memcpy(p2, srcP2, stride);
-                                    srcP1 += 2 * stride;
-                                    srcP2 += 2 * stride;
-                                    p1 += stride;
-                                    p2 += stride;
-                                }
-                            }
-                        }
-                    }
-
-                    if (bDolbyVisionRPU)
-                    {
-                        if (m_param->bField && m_param->interlaceMode)
-                        {
-                            if (m_cliopt.rpuParser(&picField1) > 0)
-                                goto fail;
-                            if (m_cliopt.rpuParser(&picField2) > 0)
-                                goto fail;
-                        }
-                        else
-                        {
-                            if (m_cliopt.rpuParser(pic_in) > 0)
-                                goto fail;
-                        }
-                    }
-                }
-
-                for (int inputNum = 0; inputNum < inputPicNum; inputNum++)
-                {
-                    x265_picture *picInput = NULL;
-                    if (inputPicNum == 2)
-                        picInput = pic_in ? (inputNum ? &picField2 : &picField1) : NULL;
-                    else
-                        picInput = pic_in;
-
-                    int numEncoded = api->encoder_encode(m_encoder, &p_nal, &nal, picInput, pic_recon);
-
-                    int idx = (inFrameCount - 1) % m_parent->m_queueSize;
-                    m_parent->m_picIdxReadCntm_ididx.incr();
-                    m_parent->m_picReadCntm_id.incr();
-                    if (m_cliopt.loadLevel && picInput)
-                    {
-                        m_parent->m_analysisReadCntm_cliopt.refId.incr();
-                        m_parent->m_analysisReadm_cliopt.refIdm_lastIdx.incr();
-                    }
-
-                    if (numEncoded < 0)
-                    {
-                        b_ctrl_c = 1;
-                        m_ret = 4;
-                        break;
-                    }
-
-                    if (reconPlay && numEncoded)
-                        reconPlay->writePicture(*pic_recon);
-
-                    outFrameCount += numEncoded;
-
-                    if (isAbrSave && numEncoded)
-                    {
-                        copyInfo(analysisInfo);
-                    }
-
-                    if (numEncoded && pic_recon && m_cliopt.recon)
-                        m_cliopt.recon->writePicture(pic_out);
-                    if (nal)
-                    {
-                        m_cliopt.totalbytes += m_cliopt.output->writeFrame(p_nal, nal, pic_out);
-                        if (pts_queue)
-                        {
-                            pts_queue->push(-pic_out.pts);
-                            if (pts_queue->size() > 2)
-                                pts_queue->pop();
-                        }
-                    }
-                    m_cliopt.printStatus(outFrameCount);
-                }
-            }
-
-            /* Flush the encoder */
-            while (!b_ctrl_c)
-            {
-                int numEncoded = api->encoder_encode(m_encoder, &p_nal, &nal, NULL, pic_recon);
-                if (numEncoded < 0)
-                {
-                    m_ret = 4;
-                    break;
-                }
-
-                if (reconPlay && numEncoded)
-                    reconPlay->writePicture(*pic_recon);
-
-                outFrameCount += numEncoded;
-                if (isAbrSave && numEncoded)
-                {
-                    copyInfo(analysisInfo);
-                }
-
-                if (numEncoded && pic_recon && m_cliopt.recon)
-                    m_cliopt.recon->writePicture(pic_out);
-                if (nal)
-                {
-                    m_cliopt.totalbytes += m_cliopt.output->writeFrame(p_nal, nal, pic_out);
-                    if (pts_queue)
-                    {
-                        pts_queue->push(-pic_out.pts);
-                        if (pts_queue->size() > 2)
-                            pts_queue->pop();
-                    }
-                }
-
-                m_cliopt.printStatus(outFrameCount);
-
-                if (!numEncoded)
-                    break;
-            }
-
-            if (bDolbyVisionRPU)
-            {
-                if (fgetc(m_cliopt.dolbyVisionRpu) != EOF)
-                    x265_log(NULL, X265_LOG_WARNING, "Dolby Vision RPU count is greater than frame count in %s\n",
-                        profileName);
-                x265_log(NULL, X265_LOG_INFO, "VES muxing with Dolby Vision RPU file successful in %s\n",
-                    profileName);
-            }
-
-            /* clear progress report */
-            if (m_cliopt.bProgress)
-                fprintf(stderr, "%*s\r", 80, " ");
-
-        fail:
-
-            delete reconPlay;
-
-            api->encoder_get_stats(m_encoder, &stats, sizeof(stats));
-            if (m_param->csvfn && !b_ctrl_c)
-#if ENABLE_LIBVMAF
-                api->vmaf_encoder_log(m_encoder, m_cliopt.argCnt, m_cliopt.argString, m_cliopt.param, vmafdata);
-#else
-                api->encoder_log(m_encoder, m_cliopt.argCnt, m_cliopt.argString);
-#endif
-            api->encoder_close(m_encoder);
-
-            int64_t second_largest_pts = 0;
-            int64_t largest_pts = 0;
-            if (pts_queue && pts_queue->size() >= 2)
-            {
-                second_largest_pts = -pts_queue->top();
-                pts_queue->pop();
-                largest_pts = -pts_queue->top();
-                pts_queue->pop();
-                delete pts_queue;
-                pts_queue = NULL;
-            }
-            m_cliopt.output->closeFile(largest_pts, second_largest_pts);
-
-            if (b_ctrl_c)
-                general_log(m_param, NULL, X265_LOG_INFO, "aborted at input frame %d, output frame %d in %s\n",
-                    m_cliopt.seek + inFrameCount, stats.encodedPictureCount, profileName);
-
-            api->param_free(m_param);
-
-            X265_FREE(errorBuf);
-            X265_FREE(rpuPayload);
-
-            m_threadActive = false;
-            m_parent->m_numActiveEncodes.decr();
-        }
-    }
-
-    void PassEncoder::destroy()
-    {
-        stop();
-        if (m_reader)
-        {
-            m_reader->stop();
-            delete m_reader;
-        }
-        else
-        {
-            m_scaler->stop();
-            m_scaler->destroy();
-            delete m_scaler;
-        }
-    }
-
-    Scaler::Scaler(int threadId, int threadNum, int id, VideoDesc *src, VideoDesc *dst, PassEncoder *parentEnc)
-    {
-        m_parentEnc = parentEnc;
-        m_id = id;
-        m_srcFormat = src;
-        m_dstFormat = dst;
-        m_threadActive = false;
-        m_scaleFrameSize = 0;
-        m_filterManager = NULL;
-        m_threadId = threadId;
-        m_threadTotal = threadNum;
-
-        int csp = dst->m_csp;
-        uint32_t pixelbytes = dst->m_inputDepth > 8 ? 2 : 1;
-        for (int i = 0; i < x265_cli_cspscsp.planes; i++)
-        {
-            int w = dst->m_width >> x265_cli_cspscsp.widthi;
-            int h = dst->m_height >> x265_cli_cspscsp.heighti;
-            m_scalePlanesi = w * h * pixelbytes;
-            m_scaleFrameSize += m_scalePlanesi;
-        }
-
-        if (src->m_height != dst->m_height || src->m_width != dst->m_width)
-        {
-            m_filterManager = new ScalerFilterManager;
-            m_filterManager->init(4, m_srcFormat, m_dstFormat);
-        }
-    }
-
-    bool Scaler::scalePic(x265_picture * destination, x265_picture * source)
-    {
-        if (!destination || !source)
-            return false;
-        x265_param* param = m_parentEnc->m_param;
-        int pixelBytes = m_dstFormat->m_inputDepth > 8 ? 2 : 1;
-        if (m_srcFormat->m_height != m_dstFormat->m_height || m_srcFormat->m_width != m_dstFormat->m_width)
-        {
-            void **srcPlane = NULL, **dstPlane = NULL;
-            int srcStride3, dstStride3;
-            destination->bitDepth = source->bitDepth;
-            destination->colorSpace = source->colorSpace;
-            destination->pts = source->pts;
-            destination->dts = source->dts;
-            destination->reorderedPts = source->reorderedPts;
-            destination->poc = source->poc;
-            destination->userSEI = source->userSEI;
-            srcPlane = source->planes;
-            dstPlane = destination->planes;
-            srcStride0 = source->stride0;
-            destination->stride0 = m_dstFormat->m_width * pixelBytes;
-            dstStride0 = destination->stride0;
-            if (param->internalCsp != X265_CSP_I400)
-            {
-                srcStride1 = source->stride1;
-                srcStride2 = source->stride2;
-                destination->stride1 = destination->stride0 >> x265_cli_cspsparam->internalCsp.width1;
-                destination->stride2 = destination->stride0 >> x265_cli_cspsparam->internalCsp.width2;
-                dstStride1 = destination->stride1;
-                dstStride2 = destination->stride2;
-            }
-            if (m_scaleFrameSize)
-            {
-                m_filterManager->scale_pic(srcPlane, dstPlane, srcStride, dstStride);
-                return true;
-            }
-            else
-                x265_log(param, X265_LOG_INFO, "Empty frame received\n");
-        }
-        return false;
-    }
-
-    void Scaler::threadMain()
-    {
-        THREAD_NAME("Scaler", m_id);
-
-        /* unscaled picture is stored in the last index */
-        uint32_t srcId = m_id - 1;
-        int QDepth = m_parentEnc->m_parent->m_queueSize;
-        while (!m_parentEnc->m_inputOver)
-        {
-
-            uint32_t scaledWritten = m_parentEnc->m_parent->m_picWriteCntm_id.get();
-
-            if (m_parentEnc->m_cliopt.framesToBeEncoded && scaledWritten >= m_parentEnc->m_cliopt.framesToBeEncoded)
-                break;
-
-            if (m_threadTotal > 1 && (m_threadId != scaledWritten % m_threadTotal))
-            {
-                continue;
-            }
-            uint32_t written = m_parentEnc->m_parent->m_picWriteCntsrcId.get();
-
-            /*If all the input pictures are scaled by the current scale worker thread wait for input pictures*/
-            while (m_threadActive && (scaledWritten == written)) {
-                written = m_parentEnc->m_parent->m_picWriteCntsrcId.waitForChange(written);
-            }
-
-            if (m_threadActive && scaledWritten < written)
-            {
-
-                int scaledWriteIdx = scaledWritten % QDepth;
-                int overWritePicBuffer = scaledWritten / QDepth;
-                int read = m_parentEnc->m_parent->m_picIdxReadCntm_idscaledWriteIdx.get();
-
-                while (overWritePicBuffer && read < overWritePicBuffer)
-                {
-                    read = m_parentEnc->m_parent->m_picIdxReadCntm_idscaledWriteIdx.waitForChange(read);
-                }
-
-                if (!m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx)
-                {
-                    int framesize = 0;
-                    int planesize3;
-                    int csp = m_dstFormat->m_csp;
-                    int stride3;
-                    stride0 = m_dstFormat->m_width;
-                    stride1 = stride0 >> x265_cli_cspscsp.width1;
-                    stride2 = stride0 >> x265_cli_cspscsp.width2;
-                    for (int i = 0; i < x265_cli_cspscsp.planes; i++)
-                    {
-                        uint32_t h = m_dstFormat->m_height >> x265_cli_cspscsp.heighti;
-                        planesizei = h * stridei;
-                        framesize += planesizei;
-                    }
-
-                    m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx = x265_picture_alloc();
-                    x265_picture_init(m_parentEnc->m_param, m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx);
-
-                    ((x265_picture*)m_parentEnc->m_parent->m_inputPicBufferm_idscaledWritten % QDepth)->framesize = framesize;
-                    for (int32_t j = 0; j < x265_cli_cspscsp.planes; j++)
-                    {
-                        m_parentEnc->m_parent->m_inputPicBufferm_idscaledWritten % QDepth->planesj = X265_MALLOC(char, planesizej);
-                    }
-                }
-
-                x265_picture *srcPic = m_parentEnc->m_parent->m_inputPicBuffersrcIdscaledWritten % QDepth;
-                x265_picture* destPic = m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx;
-
-                // Enqueue this picture up with the current encoder so that it will asynchronously encode
-                if (!scalePic(destPic, srcPic))
-                    x265_log(NULL, X265_LOG_ERROR, "Unable to copy scaled input picture to input queue \n");
-                else
-                    m_parentEnc->m_parent->m_picWriteCntm_id.incr();
-                m_scaledWriteCnt.incr();
-                m_parentEnc->m_parent->m_picIdxReadCntsrcIdscaledWriteIdx.incr();
-            }
-            if (m_threadTotal > 1)
-            {
-                written = m_parentEnc->m_parent->m_picWriteCntsrcId.get();
-                int totalWrite = written / m_threadTotal;
-                if (written % m_threadTotal > m_threadId)
-                    totalWrite++;
-                if (totalWrite == m_scaledWriteCnt.get())
-                {
-                    m_parentEnc->m_parent->m_picWriteCntsrcId.poke();
-                    m_parentEnc->m_parent->m_picWriteCntm_id.poke();
-                    break;
-                }
-            }
-            else
-            {
-                /* Once end of video is reached and all frames are scaled, release wait on picwritecount */
-                scaledWritten = m_parentEnc->m_parent->m_picWriteCntm_id.get();
-                written = m_parentEnc->m_parent->m_picWriteCntsrcId.get();
-                if (written == scaledWritten)
-                {
-                    m_parentEnc->m_parent->m_picWriteCntsrcId.poke();
-                    m_parentEnc->m_parent->m_picWriteCntm_id.poke();
-                    break;
-                }
-            }
-
-        }
-        m_threadActive = false;
-        destroy();
-    }
-
-    Reader::Reader(int id, PassEncoder *parentEnc)
-    {
-        m_parentEnc = parentEnc;
-        m_id = id;
-        m_input = parentEnc->m_input;
-    }
-
-    void Reader::threadMain()
-    {
-        THREAD_NAME("Reader", m_id);
-
-        int QDepth = m_parentEnc->m_parent->m_queueSize;
-        x265_picture* src = x265_picture_alloc();
-        x265_picture_init(m_parentEnc->m_param, src);
-
-        while (m_threadActive)
-        {
-            uint32_t written = m_parentEnc->m_parent->m_picWriteCntm_id.get();
-            uint32_t writeIdx = written % QDepth;
-            uint32_t read = m_parentEnc->m_parent->m_picIdxReadCntm_idwriteIdx.get();
-            uint32_t overWritePicBuffer = written / QDepth;
-
-            if (m_parentEnc->m_cliopt.framesToBeEncoded && written >= m_parentEnc->m_cliopt.framesToBeEncoded)
-                break;
-
-            while (overWritePicBuffer && read < overWritePicBuffer)
-            {
-                read = m_parentEnc->m_parent->m_picIdxReadCntm_idwriteIdx.waitForChange(read);
-            }
-
-            x265_picture* dest = m_parentEnc->m_parent->m_inputPicBufferm_idwriteIdx;
-            if (m_input->readPicture(*src))
-            {
-                dest->poc = src->poc;
-                dest->pts = src->pts;
-                dest->userSEI = src->userSEI;
-                dest->bitDepth = src->bitDepth;
-                dest->framesize = src->framesize;
-                dest->height = src->height;
-                dest->width = src->width;
-                dest->colorSpace = src->colorSpace;
-                dest->userSEI = src->userSEI;
-                dest->rpu.payload = src->rpu.payload;
-                dest->picStruct = src->picStruct;
-                dest->stride0 = src->stride0;
-                dest->stride1 = src->stride1;
-                dest->stride2 = src->stride2;
-
-                if (!dest->planes0)
-                    dest->planes0 = X265_MALLOC(char, dest->framesize);
-
-                memcpy(dest->planes0, src->planes0, src->framesize * sizeof(char));
-                dest->planes1 = (char*)dest->planes0 + src->stride0 * src->height;
-                dest->planes2 = (char*)dest->planes1 + src->stride1 * (src->height >> x265_cli_cspssrc->colorSpace.height1);
-                m_parentEnc->m_parent->m_picWriteCntm_id.incr();
-            }
-            else
-            {
-                m_threadActive = false;
-                m_parentEnc->m_inputOver = true;
-                m_parentEnc->m_parent->m_picWriteCntm_id.poke();
-            }
-        }
-        x265_picture_free(src);
-    }
-}
+
+#if ENABLE_LIBVMAF
+            x265_vmaf_data* vmafdata = m_cliopt.vmafData;
+#endif
+            /* This allows muxers to modify bitstream format */
+            m_cliopt.output->setParam(m_param);
+            const x265_api* api = m_cliopt.api;
+            ReconPlay* reconPlay = NULL;
+            if (m_cliopt.reconPlayCmd)
+                reconPlay = new ReconPlay(m_cliopt.reconPlayCmd, *m_param);
+            char* profileName = m_cliopt.encName ? m_cliopt.encName : (char *)"x265";
+
+            if (signal(SIGINT, sigint_handler) == SIG_ERR)
+                x265_log(m_param, X265_LOG_ERROR, "Unable to register CTRL+C handler: %s in %s\n",
+                    strerror(errno), profileName);
+
+            x265_picture pic_orig, pic_out;
+            x265_picture *pic_in = &pic_orig;
+            /* Allocate recon picture if analysis save/load is enabled */
+            std::priority_queue<int64_t>* pts_queue = m_cliopt.output->needPTS() ? new std::priority_queue<int64_t>() : NULL;
+            x265_picture *pic_recon = (m_cliopt.recon || m_param->analysisSave || m_param->analysisLoad || pts_queue || reconPlay || m_param->csvLogLevel) ? &pic_out : NULL;
+            uint32_t inFrameCount = 0;
+            uint32_t outFrameCount = 0;
+            x265_nal *p_nal;
+            x265_stats stats;
+            uint32_t nal;
+            int16_t *errorBuf = NULL;
+            bool bDolbyVisionRPU = false;
+            uint8_t *rpuPayload = NULL;
+            int inputPicNum = 1;
+            x265_picture picField1, picField2;
+            x265_analysis_data* analysisInfo = (x265_analysis_data*)(&pic_out.analysisData);
+            bool isAbrSave = m_cliopt.saveLevel && (m_parent->m_numEncodes > 1);
+
+            if (!m_param->bRepeatHeaders && !m_param->bEnableSvtHevc)
+            {
+                if (api->encoder_headers(m_encoder, &p_nal, &nal) < 0)
+                {
+                    x265_log(m_param, X265_LOG_ERROR, "Failure generating stream headers in %s\n", profileName);
+                    m_ret = 3;
+                    goto fail;
+                }
+                else
+                    m_cliopt.totalbytes += m_cliopt.output->writeHeaders(p_nal, nal);
+            }
+
+            if (m_param->bField && m_param->interlaceMode)
+            {
+                api->picture_init(m_param, &picField1);
+                api->picture_init(m_param, &picField2);
+                // return back the original height of input
+                m_param->sourceHeight *= 2;
+                api->picture_init(m_param, &pic_orig);
+            }
+            else
+                api->picture_init(m_param, &pic_orig);
+
+            if (m_param->dolbyProfile && m_cliopt.dolbyVisionRpu)
+            {
+                rpuPayload = X265_MALLOC(uint8_t, 1024);
+                pic_in->rpu.payload = rpuPayload;
+                if (pic_in->rpu.payload)
+                    bDolbyVisionRPU = true;
+            }
+
+            if (m_cliopt.bDither)
+            {
+                errorBuf = X265_MALLOC(int16_t, m_param->sourceWidth + 1);
+                if (errorBuf)
+                    memset(errorBuf, 0, (m_param->sourceWidth + 1) * sizeof(int16_t));
+                else
+                    m_cliopt.bDither = false;
+            }
+
+            // main encoder loop
+            while (pic_in && !b_ctrl_c)
+            {
+                pic_orig.poc = (m_param->bField && m_param->interlaceMode) ? inFrameCount * 2 : inFrameCount;
+                if (m_cliopt.qpfile)
+                {
+                    if (!m_cliopt.parseQPFile(pic_orig))
+                    {
+                        x265_log(NULL, X265_LOG_ERROR, "can't parse qpfile for frame %d in %s\n",
+                            pic_in->poc, profileName);
+                        fclose(m_cliopt.qpfile);
+                        m_cliopt.qpfile = NULL;
+                    }
+                }
+
+                if (m_cliopt.framesToBeEncoded && inFrameCount >= m_cliopt.framesToBeEncoded)
+                    pic_in = NULL;
+                else if (readPicture(pic_in))
+                    inFrameCount++;
+                else
+                    pic_in = NULL;
+
+                if (pic_in)
+                {
+                    if (pic_in->bitDepth > m_param->internalBitDepth && m_cliopt.bDither)
+                    {
+                        x265_dither_image(pic_in, m_cliopt.input->getWidth(), m_cliopt.input->getHeight(), errorBuf, m_param->internalBitDepth);
+                        pic_in->bitDepth = m_param->internalBitDepth;
+                    }
+                    /* Overwrite PTS */
+                    pic_in->pts = pic_in->poc;
+
+                    // convert to field
+                    if (m_param->bField && m_param->interlaceMode)
+                    {
+                        int height = pic_in->height >> 1;
+
+                        int static bCreated = 0;
+                        if (bCreated == 0)
+                        {
+                            bCreated = 1;
+                            inputPicNum = 2;
+                            picField1.fieldNum = 1;
+                            picField2.fieldNum = 2;
+
+                            picField1.bitDepth = picField2.bitDepth = pic_in->bitDepth;
+                            picField1.colorSpace = picField2.colorSpace = pic_in->colorSpace;
+                            picField1.height = picField2.height = pic_in->height >> 1;
+                            picField1.framesize = picField2.framesize = pic_in->framesize >> 1;
+
+                            size_t fieldFrameSize = (size_t)pic_in->framesize >> 1;
+                            char* field1Buf = X265_MALLOC(char, fieldFrameSize);
+                            char* field2Buf = X265_MALLOC(char, fieldFrameSize);
+
+                            int stride = picField1.stride0 = picField2.stride0 = pic_in->stride0;
+                            uint64_t framesize = stride * (height >> x265_cli_cspspic_in->colorSpace.height0);
+                            picField1.planes0 = field1Buf;
+                            picField2.planes0 = field2Buf;
+                            for (int i = 1; i < x265_cli_cspspic_in->colorSpace.planes; i++)
+                            {
+                                picField1.planesi = field1Buf + framesize;
+                                picField2.planesi = field2Buf + framesize;
+
+                                stride = picField1.stridei = picField2.stridei = pic_in->stridei;
+                                framesize += (stride * (height >> x265_cli_cspspic_in->colorSpace.heighti));
+                            }
+                            assert(framesize == picField1.framesize);
+                        }
+
+                        picField1.pts = picField1.poc = pic_in->poc;
+                        picField2.pts = picField2.poc = pic_in->poc + 1;
+
+                        picField1.userSEI = picField2.userSEI = pic_in->userSEI;
+
+                        //if (pic_in->userData)
+                        //{
+                        //    // Have to handle userData here
+                        //}
+
+                        if (pic_in->framesize)
+                        {
+                            for (int i = 0; i < x265_cli_cspspic_in->colorSpace.planes; i++)
+                            {
+                                char* srcP1 = (char*)pic_in->planesi;
+                                char* srcP2 = (char*)pic_in->planesi + pic_in->stridei;
+                                char* p1 = (char*)picField1.planesi;
+                                char* p2 = (char*)picField2.planesi;
+
+                                int stride = picField1.stridei;
+
+                                for (int y = 0; y < (height >> x265_cli_cspspic_in->colorSpace.heighti); y++)
+                                {
+                                    memcpy(p1, srcP1, stride);
+                                    memcpy(p2, srcP2, stride);
+                                    srcP1 += 2 * stride;
+                                    srcP2 += 2 * stride;
+                                    p1 += stride;
+                                    p2 += stride;
+                                }
+                            }
+                        }
+                    }
+
+                    if (bDolbyVisionRPU)
+                    {
+                        if (m_param->bField && m_param->interlaceMode)
+                        {
+                            if (m_cliopt.rpuParser(&picField1) > 0)
+                                goto fail;
+                            if (m_cliopt.rpuParser(&picField2) > 0)
+                                goto fail;
+                        }
+                        else
+                        {
+                            if (m_cliopt.rpuParser(pic_in) > 0)
+                                goto fail;
+                        }
+                    }
+                }
+
+                for (int inputNum = 0; inputNum < inputPicNum; inputNum++)
+                {
+                    x265_picture *picInput = NULL;
+                    if (inputPicNum == 2)
+                        picInput = pic_in ? (inputNum ? &picField2 : &picField1) : NULL;
+                    else
+                        picInput = pic_in;
+
+                    int numEncoded = api->encoder_encode(m_encoder, &p_nal, &nal, picInput, pic_recon);
+
+                    int idx = (inFrameCount - 1) % m_parent->m_queueSize;
+                    m_parent->m_picIdxReadCntm_ididx.incr();
+                    m_parent->m_picReadCntm_id.incr();
+                    if (m_cliopt.loadLevel && picInput)
+                    {
+                        m_parent->m_analysisReadCntm_cliopt.refId.incr();
+                        m_parent->m_analysisReadm_cliopt.refIdm_lastIdx.incr();
+                    }
+
+                    if (numEncoded < 0)
+                    {
+                        b_ctrl_c = 1;
+                        m_ret = 4;
+                        break;
+                    }
+
+                    if (reconPlay && numEncoded)
+                        reconPlay->writePicture(*pic_recon);
+
+                    outFrameCount += numEncoded;
+
+                    if (isAbrSave && numEncoded)
+                    {
+                        copyInfo(analysisInfo);
+                    }
+
+                    if (numEncoded && pic_recon && m_cliopt.recon)
+                        m_cliopt.recon->writePicture(pic_out);
+                    if (nal)
+                    {
+                        m_cliopt.totalbytes += m_cliopt.output->writeFrame(p_nal, nal, pic_out);
+                        if (pts_queue)
+                        {
+                            pts_queue->push(-pic_out.pts);
+                            if (pts_queue->size() > 2)
+                                pts_queue->pop();
+                        }
+                    }
+                    m_cliopt.printStatus(outFrameCount);
+                }
+            }
+
+            /* Flush the encoder */
+            while (!b_ctrl_c)
+            {
+                int numEncoded = api->encoder_encode(m_encoder, &p_nal, &nal, NULL, pic_recon);
+                if (numEncoded < 0)
+                {
+                    m_ret = 4;
+                    break;
+                }
+
+                if (reconPlay && numEncoded)
+                    reconPlay->writePicture(*pic_recon);
+
+                outFrameCount += numEncoded;
+                if (isAbrSave && numEncoded)
+                {
+                    copyInfo(analysisInfo);
+                }
+
+                if (numEncoded && pic_recon && m_cliopt.recon)
+                    m_cliopt.recon->writePicture(pic_out);
+                if (nal)
+                {
+                    m_cliopt.totalbytes += m_cliopt.output->writeFrame(p_nal, nal, pic_out);
+                    if (pts_queue)
+                    {
+                        pts_queue->push(-pic_out.pts);
+                        if (pts_queue->size() > 2)
+                            pts_queue->pop();
+                    }
+                }
+
+                m_cliopt.printStatus(outFrameCount);
+
+                if (!numEncoded)
+                    break;
+            }
+
+            if (bDolbyVisionRPU)
+            {
+                if (fgetc(m_cliopt.dolbyVisionRpu) != EOF)
+                    x265_log(NULL, X265_LOG_WARNING, "Dolby Vision RPU count is greater than frame count in %s\n",
+                        profileName);
+                x265_log(NULL, X265_LOG_INFO, "VES muxing with Dolby Vision RPU file successful in %s\n",
+                    profileName);
+            }
+
+            /* clear progress report */
+            if (m_cliopt.bProgress)
+                fprintf(stderr, "%*s\r", 80, " ");
+
+        fail:
+
+            delete reconPlay;
+
+            api->encoder_get_stats(m_encoder, &stats, sizeof(stats));
+            if (m_param->csvfn && !b_ctrl_c)
+#if ENABLE_LIBVMAF
+                api->vmaf_encoder_log(m_encoder, m_cliopt.argCnt, m_cliopt.argString, m_cliopt.param, vmafdata);
+#else
+                api->encoder_log(m_encoder, m_cliopt.argCnt, m_cliopt.argString);
+#endif
+            api->encoder_close(m_encoder);
+
+            int64_t second_largest_pts = 0;
+            int64_t largest_pts = 0;
+            if (pts_queue && pts_queue->size() >= 2)
+            {
+                second_largest_pts = -pts_queue->top();
+                pts_queue->pop();
+                largest_pts = -pts_queue->top();
+                pts_queue->pop();
+                delete pts_queue;
+                pts_queue = NULL;
+            }
+            m_cliopt.output->closeFile(largest_pts, second_largest_pts);
+
+            if (b_ctrl_c)
+                general_log(m_param, NULL, X265_LOG_INFO, "aborted at input frame %d, output frame %d in %s\n",
+                    m_cliopt.seek + inFrameCount, stats.encodedPictureCount, profileName);
+
+            api->param_free(m_param);
+
+            X265_FREE(errorBuf);
+            X265_FREE(rpuPayload);
+
+            m_threadActive = false;
+            m_parent->m_numActiveEncodes.decr();
+        }
+    }
+
+    void PassEncoder::destroy()
+    {
+        stop();
+        if (m_reader)
+        {
+            m_reader->stop();
+            delete m_reader;
+        }
+        else
+        {
+            m_scaler->stop();
+            m_scaler->destroy();
+            delete m_scaler;
+        }
+    }
+
+    Scaler::Scaler(int threadId, int threadNum, int id, VideoDesc *src, VideoDesc *dst, PassEncoder *parentEnc)
+    {
+        m_parentEnc = parentEnc;
+        m_id = id;
+        m_srcFormat = src;
+        m_dstFormat = dst;
+        m_threadActive = false;
+        m_scaleFrameSize = 0;
+        m_filterManager = NULL;
+        m_threadId = threadId;
+        m_threadTotal = threadNum;
+
+        int csp = dst->m_csp;
+        uint32_t pixelbytes = dst->m_inputDepth > 8 ? 2 : 1;
+        for (int i = 0; i < x265_cli_cspscsp.planes; i++)
+        {
+            int w = dst->m_width >> x265_cli_cspscsp.widthi;
+            int h = dst->m_height >> x265_cli_cspscsp.heighti;
+            m_scalePlanesi = w * h * pixelbytes;
+            m_scaleFrameSize += m_scalePlanesi;
+        }
+
+        if (src->m_height != dst->m_height || src->m_width != dst->m_width)
+        {
+            m_filterManager = new ScalerFilterManager;
+            m_filterManager->init(4, m_srcFormat, m_dstFormat);
+        }
+    }
+
+    bool Scaler::scalePic(x265_picture * destination, x265_picture * source)
+    {
+        if (!destination || !source)
+            return false;
+        x265_param* param = m_parentEnc->m_param;
+        int pixelBytes = m_dstFormat->m_inputDepth > 8 ? 2 : 1;
+        if (m_srcFormat->m_height != m_dstFormat->m_height || m_srcFormat->m_width != m_dstFormat->m_width)
+        {
+            void **srcPlane = NULL, **dstPlane = NULL;
+            int srcStride3, dstStride3;
+            destination->bitDepth = source->bitDepth;
+            destination->colorSpace = source->colorSpace;
+            destination->pts = source->pts;
+            destination->dts = source->dts;
+            destination->reorderedPts = source->reorderedPts;
+            destination->poc = source->poc;
+            destination->userSEI = source->userSEI;
+            srcPlane = source->planes;
+            dstPlane = destination->planes;
+            srcStride0 = source->stride0;
+            destination->stride0 = m_dstFormat->m_width * pixelBytes;
+            dstStride0 = destination->stride0;
+            if (param->internalCsp != X265_CSP_I400)
+            {
+                srcStride1 = source->stride1;
+                srcStride2 = source->stride2;
+                destination->stride1 = destination->stride0 >> x265_cli_cspsparam->internalCsp.width1;
+                destination->stride2 = destination->stride0 >> x265_cli_cspsparam->internalCsp.width2;
+                dstStride1 = destination->stride1;
+                dstStride2 = destination->stride2;
+            }
+            if (m_scaleFrameSize)
+            {
+                m_filterManager->scale_pic(srcPlane, dstPlane, srcStride, dstStride);
+                return true;
+            }
+            else
+                x265_log(param, X265_LOG_INFO, "Empty frame received\n");
+        }
+        return false;
+    }
+
+    void Scaler::threadMain()
+    {
+        THREAD_NAME("Scaler", m_id);
+
+        /* unscaled picture is stored in the last index */
+        uint32_t srcId = m_id - 1;
+        int QDepth = m_parentEnc->m_parent->m_queueSize;
+        while (!m_parentEnc->m_inputOver)
+        {
+
+            uint32_t scaledWritten = m_parentEnc->m_parent->m_picWriteCntm_id.get();
+
+            if (m_parentEnc->m_cliopt.framesToBeEncoded && scaledWritten >= m_parentEnc->m_cliopt.framesToBeEncoded)
+                break;
+
+            if (m_threadTotal > 1 && (m_threadId != scaledWritten % m_threadTotal))
+            {
+                continue;
+            }
+            uint32_t written = m_parentEnc->m_parent->m_picWriteCntsrcId.get();
+
+            /*If all the input pictures are scaled by the current scale worker thread wait for input pictures*/
+            while (m_threadActive && (scaledWritten == written)) {
+                written = m_parentEnc->m_parent->m_picWriteCntsrcId.waitForChange(written);
+            }
+
+            if (m_threadActive && scaledWritten < written)
+            {
+
+                int scaledWriteIdx = scaledWritten % QDepth;
+                int overWritePicBuffer = scaledWritten / QDepth;
+                int read = m_parentEnc->m_parent->m_picIdxReadCntm_idscaledWriteIdx.get();
+
+                while (overWritePicBuffer && read < overWritePicBuffer)
+                {
+                    read = m_parentEnc->m_parent->m_picIdxReadCntm_idscaledWriteIdx.waitForChange(read);
+                }
+
+                if (!m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx)
+                {
+                    int framesize = 0;
+                    int planesize3;
+                    int csp = m_dstFormat->m_csp;
+                    int stride3;
+                    stride0 = m_dstFormat->m_width;
+                    stride1 = stride0 >> x265_cli_cspscsp.width1;
+                    stride2 = stride0 >> x265_cli_cspscsp.width2;
+                    for (int i = 0; i < x265_cli_cspscsp.planes; i++)
+                    {
+                        uint32_t h = m_dstFormat->m_height >> x265_cli_cspscsp.heighti;
+                        planesizei = h * stridei;
+                        framesize += planesizei;
+                    }
+
+                    m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx = x265_picture_alloc();
+                    x265_picture_init(m_parentEnc->m_param, m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx);
+
+                    ((x265_picture*)m_parentEnc->m_parent->m_inputPicBufferm_idscaledWritten % QDepth)->framesize = framesize;
+                    for (int32_t j = 0; j < x265_cli_cspscsp.planes; j++)
+                    {
+                        m_parentEnc->m_parent->m_inputPicBufferm_idscaledWritten % QDepth->planesj = X265_MALLOC(char, planesizej);
+                    }
+                }
+
+                x265_picture *srcPic = m_parentEnc->m_parent->m_inputPicBuffersrcIdscaledWritten % QDepth;
+                x265_picture* destPic = m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx;
+
+                // Enqueue this picture up with the current encoder so that it will asynchronously encode
+                if (!scalePic(destPic, srcPic))
+                    x265_log(NULL, X265_LOG_ERROR, "Unable to copy scaled input picture to input queue \n");
+                else
+                    m_parentEnc->m_parent->m_picWriteCntm_id.incr();
+                m_scaledWriteCnt.incr();
+                m_parentEnc->m_parent->m_picIdxReadCntsrcIdscaledWriteIdx.incr();
+            }
+            if (m_threadTotal > 1)
+            {
+                written = m_parentEnc->m_parent->m_picWriteCntsrcId.get();
+                int totalWrite = written / m_threadTotal;
+                if (written % m_threadTotal > m_threadId)
+                    totalWrite++;
+                if (totalWrite == m_scaledWriteCnt.get())
+                {
+                    m_parentEnc->m_parent->m_picWriteCntsrcId.poke();
+                    m_parentEnc->m_parent->m_picWriteCntm_id.poke();
+                    break;
+                }
+            }
+            else
+            {
+                /* Once end of video is reached and all frames are scaled, release wait on picwritecount */
+                scaledWritten = m_parentEnc->m_parent->m_picWriteCntm_id.get();
+                written = m_parentEnc->m_parent->m_picWriteCntsrcId.get();
+                if (written == scaledWritten)
+                {
+                    m_parentEnc->m_parent->m_picWriteCntsrcId.poke();
+                    m_parentEnc->m_parent->m_picWriteCntm_id.poke();
+                    break;
+                }
+            }
+
+        }
+        m_threadActive = false;
+        destroy();
+    }
+
+    Reader::Reader(int id, PassEncoder *parentEnc)
+    {
+        m_parentEnc = parentEnc;
+        m_id = id;
+        m_input = parentEnc->m_input;
+    }
+
+    void Reader::threadMain()
+    {
+        THREAD_NAME("Reader", m_id);
+
+        int QDepth = m_parentEnc->m_parent->m_queueSize;
+        x265_picture* src = x265_picture_alloc();
+        x265_picture_init(m_parentEnc->m_param, src);
+
+        while (m_threadActive)
+        {
+            uint32_t written = m_parentEnc->m_parent->m_picWriteCntm_id.get();
+            uint32_t writeIdx = written % QDepth;
+            uint32_t read = m_parentEnc->m_parent->m_picIdxReadCntm_idwriteIdx.get();
+            uint32_t overWritePicBuffer = written / QDepth;
+
+            if (m_parentEnc->m_cliopt.framesToBeEncoded && written >= m_parentEnc->m_cliopt.framesToBeEncoded)
+                break;
+
+            while (overWritePicBuffer && read < overWritePicBuffer)
+            {
+                read = m_parentEnc->m_parent->m_picIdxReadCntm_idwriteIdx.waitForChange(read);
+            }
+
+            x265_picture* dest = m_parentEnc->m_parent->m_inputPicBufferm_idwriteIdx;
+            if (m_input->readPicture(*src))
+            {
+                dest->poc = src->poc;
+                dest->pts = src->pts;
+                dest->userSEI = src->userSEI;
+                dest->bitDepth = src->bitDepth;
+                dest->framesize = src->framesize;
+                dest->height = src->height;
+                dest->width = src->width;
+                dest->colorSpace = src->colorSpace;
+                dest->userSEI = src->userSEI;
+                dest->rpu.payload = src->rpu.payload;
+                dest->picStruct = src->picStruct;
+                dest->stride0 = src->stride0;
+                dest->stride1 = src->stride1;
+                dest->stride2 = src->stride2;
+
+                if (!dest->planes0)
+                    dest->planes0 = X265_MALLOC(char, dest->framesize);
+
+                memcpy(dest->planes0, src->planes0, src->framesize * sizeof(char));
+                dest->planes1 = (char*)dest->planes0 + src->stride0 * src->height;
+                dest->planes2 = (char*)dest->planes1 + src->stride1 * (src->height >> x265_cli_cspssrc->colorSpace.height1);
+                m_parentEnc->m_parent->m_picWriteCntm_id.incr();
+            }
+            else
+            {
+                m_threadActive = false;
+                m_parentEnc->m_inputOver = true;
+                m_parentEnc->m_parent->m_picWriteCntm_id.poke();
+            }
+        }
+        x265_picture_free(src);
+    }
+}
​

x265_3.5.tar.gz/source/abrEncApp.h -> x265_3.6.tar.gz/source/abrEncApp.h Changed

 
@@ -91,6 +91,7 @@
         FILE*    m_qpfile;
         FILE*    m_zoneFile;
         FILE*    m_dolbyVisionRpu;/* File containing Dolby Vision BL RPU metadata */
+        FILE*    m_scenecutAwareQpConfig;
 
         int m_ret;
 
​

x265_3.5.tar.gz/source/cmake/FindNeon.cmake -> x265_3.6.tar.gz/source/cmake/FindNeon.cmake Changed

 
@@ -1,10 +1,21 @@
 include(FindPackageHandleStandardArgs)
 
 # Check the version of neon supported by the ARM CPU
-execute_process(COMMAND cat /proc/cpuinfo | grep Features | grep neon
-                OUTPUT_VARIABLE neon_version
-                ERROR_QUIET
-                OUTPUT_STRIP_TRAILING_WHITESPACE)
+if(APPLE)
+    execute_process(COMMAND sysctl -a
+                    COMMAND grep "hw.optional.neon: 1"
+                    OUTPUT_VARIABLE neon_version
+                    ERROR_QUIET
+                    OUTPUT_STRIP_TRAILING_WHITESPACE)
+else()
+    execute_process(COMMAND cat /proc/cpuinfo
+                    COMMAND grep Features
+                    COMMAND grep neon
+                    OUTPUT_VARIABLE neon_version
+                    ERROR_QUIET
+                    OUTPUT_STRIP_TRAILING_WHITESPACE)
+endif()
+
 if(neon_version)
     set(CPU_HAS_NEON 1)
 endif()
​

x265_3.6.tar.gz/source/cmake/FindSVE.cmake Added

 
@@ -0,0 +1,21 @@
+include(FindPackageHandleStandardArgs)
+
+# Check the version of SVE supported by the ARM CPU
+if(APPLE)
+    execute_process(COMMAND sysctl -a
+                    COMMAND grep "hw.optional.sve: 1"
+                    OUTPUT_VARIABLE sve_version
+                    ERROR_QUIET
+                    OUTPUT_STRIP_TRAILING_WHITESPACE)
+else()
+    execute_process(COMMAND cat /proc/cpuinfo
+                    COMMAND grep Features
+                    COMMAND grep -e "sve$" -e "sve:space:"
+                    OUTPUT_VARIABLE sve_version
+                    ERROR_QUIET
+                    OUTPUT_STRIP_TRAILING_WHITESPACE)
+endif()
+
+if(sve_version)
+    set(CPU_HAS_SVE 1)
+endif()
​

x265_3.6.tar.gz/source/cmake/FindSVE2.cmake Added

 
@@ -0,0 +1,22 @@
+include(FindPackageHandleStandardArgs)
+
+# Check the version of SVE2 supported by the ARM CPU
+if(APPLE)
+    execute_process(COMMAND sysctl -a
+                    COMMAND grep "hw.optional.sve2: 1"
+                    OUTPUT_VARIABLE sve2_version
+                    ERROR_QUIET
+                    OUTPUT_STRIP_TRAILING_WHITESPACE)
+else()
+    execute_process(COMMAND cat /proc/cpuinfo
+                    COMMAND grep Features
+                    COMMAND grep sve2
+                    OUTPUT_VARIABLE sve2_version
+                    ERROR_QUIET
+                    OUTPUT_STRIP_TRAILING_WHITESPACE)
+endif()
+
+if(sve2_version)
+    set(CPU_HAS_SVE 1)
+    set(CPU_HAS_SVE2 1)
+endif()
​

x265_3.5.tar.gz/source/common/CMakeLists.txt -> x265_3.6.tar.gz/source/common/CMakeLists.txt Changed

@@ -84,35 +84,42 @@
 endif(ENABLE_ASSEMBLY AND X86)
 
 if(ENABLE_ASSEMBLY AND (ARM OR CROSS_COMPILE_ARM))
-    if(ARM64)
-        if(GCC AND (CMAKE_CXX_FLAGS_RELEASE MATCHES "-O3"))
-            message(STATUS "Detected CXX compiler using -O3 optimization level")
-            add_definitions(-DAUTO_VECTORIZE=1)
-        endif()
-        set(C_SRCS asm-primitives.cpp pixel.h ipfilter8.h)
-
-        # add ARM assembly/intrinsic files here
-        set(A_SRCS asm.S mc-a.S sad-a.S pixel-util.S ipfilter8.S)
-        set(VEC_PRIMITIVES)
+    set(C_SRCS asm-primitives.cpp pixel.h mc.h ipfilter8.h blockcopy8.h dct8.h loopfilter.h)
 
-        set(ARM_ASMS "${A_SRCS}" CACHE INTERNAL "ARM Assembly Sources")
-        foreach(SRC ${C_SRCS})
-            set(ASM_PRIMITIVES ${ASM_PRIMITIVES} aarch64/${SRC})
-        endforeach()
-    else()
-        set(C_SRCS asm-primitives.cpp pixel.h mc.h ipfilter8.h blockcopy8.h dct8.h loopfilter.h)
+    # add ARM assembly/intrinsic files here
+    set(A_SRCS asm.S cpu-a.S mc-a.S sad-a.S pixel-util.S ssd-a.S blockcopy8.S ipfilter8.S dct-a.S)
+    set(VEC_PRIMITIVES)
 
-        # add ARM assembly/intrinsic files here
-        set(A_SRCS asm.S cpu-a.S mc-a.S sad-a.S pixel-util.S ssd-a.S blockcopy8.S ipfilter8.S dct-a.S)
-        set(VEC_PRIMITIVES)
+    set(ARM_ASMS "${A_SRCS}" CACHE INTERNAL "ARM Assembly Sources")
+    foreach(SRC ${C_SRCS})
+        set(ASM_PRIMITIVES ${ASM_PRIMITIVES} arm/${SRC})
+    endforeach()
+    source_group(Assembly FILES ${ASM_PRIMITIVES})
+endif(ENABLE_ASSEMBLY AND (ARM OR CROSS_COMPILE_ARM))
 
-        set(ARM_ASMS "${A_SRCS}" CACHE INTERNAL "ARM Assembly Sources")
-        foreach(SRC ${C_SRCS})
-            set(ASM_PRIMITIVES ${ASM_PRIMITIVES} arm/${SRC})
-        endforeach()
+if(ENABLE_ASSEMBLY AND (ARM64 OR CROSS_COMPILE_ARM64))
+    if(GCC AND (CMAKE_CXX_FLAGS_RELEASE MATCHES "-O3"))
+        message(STATUS "Detected CXX compiler using -O3 optimization level")
+        add_definitions(-DAUTO_VECTORIZE=1)
     endif()
+
+    set(C_SRCS asm-primitives.cpp pixel-prim.h pixel-prim.cpp filter-prim.h filter-prim.cpp dct-prim.h dct-prim.cpp loopfilter-prim.cpp loopfilter-prim.h intrapred-prim.cpp arm64-utils.cpp arm64-utils.h fun-decls.h)
+    enable_language(ASM)
+
+    # add ARM assembly/intrinsic files here
+    set(A_SRCS asm.S mc-a.S mc-a-common.S sad-a.S sad-a-common.S pixel-util.S pixel-util-common.S p2s.S p2s-common.S ipfilter.S ipfilter-common.S blockcopy8.S blockcopy8-common.S ssd-a.S ssd-a-common.S)
+    set(A_SRCS_SVE asm-sve.S blockcopy8-sve.S p2s-sve.S pixel-util-sve.S ssd-a-sve.S)
+    set(A_SRCS_SVE2 mc-a-sve2.S sad-a-sve2.S pixel-util-sve2.S ipfilter-sve2.S ssd-a-sve2.S)
+    set(VEC_PRIMITIVES)
+
+    set(ARM_ASMS "${A_SRCS}" CACHE INTERNAL "ARM Assembly Sources")
+    set(ARM_ASMS_SVE "${A_SRCS_SVE}" CACHE INTERNAL "ARM Assembly Sources that use SVE instruction set")
+    set(ARM_ASMS_SVE2 "${A_SRCS_SVE2}" CACHE INTERNAL "ARM Assembly Sources that use SVE2 instruction set")
+    foreach(SRC ${C_SRCS})
+        set(ASM_PRIMITIVES ${ASM_PRIMITIVES} aarch64/${SRC})
+    endforeach()
     source_group(Assembly FILES ${ASM_PRIMITIVES})
-endif(ENABLE_ASSEMBLY AND (ARM OR CROSS_COMPILE_ARM))
+endif(ENABLE_ASSEMBLY AND (ARM64 OR CROSS_COMPILE_ARM64))
 
 if(POWER)
     set_source_files_properties(version.cpp PROPERTIES COMPILE_FLAGS -DX265_VERSION=${X265_VERSION})
@@ -169,4 +176,6 @@
     scalinglist.cpp scalinglist.h
     quant.cpp quant.h contexts.h
     deblock.cpp deblock.h
-    scaler.cpp scaler.h)
+    scaler.cpp scaler.h
+    ringmem.cpp ringmem.h
+    temporalfilter.cpp temporalfilter.h)

 
@@ -84,35 +84,42 @@
 endif(ENABLE_ASSEMBLY AND X86)
 
 if(ENABLE_ASSEMBLY AND (ARM OR CROSS_COMPILE_ARM))
-    if(ARM64)
-        if(GCC AND (CMAKE_CXX_FLAGS_RELEASE MATCHES "-O3"))
-            message(STATUS "Detected CXX compiler using -O3 optimization level")
-            add_definitions(-DAUTO_VECTORIZE=1)
-        endif()
-        set(C_SRCS asm-primitives.cpp pixel.h ipfilter8.h)
-
-        # add ARM assembly/intrinsic files here
-        set(A_SRCS asm.S mc-a.S sad-a.S pixel-util.S ipfilter8.S)
-        set(VEC_PRIMITIVES)
+    set(C_SRCS asm-primitives.cpp pixel.h mc.h ipfilter8.h blockcopy8.h dct8.h loopfilter.h)
 
-        set(ARM_ASMS "${A_SRCS}" CACHE INTERNAL "ARM Assembly Sources")
-        foreach(SRC ${C_SRCS})
-            set(ASM_PRIMITIVES ${ASM_PRIMITIVES} aarch64/${SRC})
-        endforeach()
-    else()
-        set(C_SRCS asm-primitives.cpp pixel.h mc.h ipfilter8.h blockcopy8.h dct8.h loopfilter.h)
+    # add ARM assembly/intrinsic files here
+    set(A_SRCS asm.S cpu-a.S mc-a.S sad-a.S pixel-util.S ssd-a.S blockcopy8.S ipfilter8.S dct-a.S)
+    set(VEC_PRIMITIVES)
 
-        # add ARM assembly/intrinsic files here
-        set(A_SRCS asm.S cpu-a.S mc-a.S sad-a.S pixel-util.S ssd-a.S blockcopy8.S ipfilter8.S dct-a.S)
-        set(VEC_PRIMITIVES)
+    set(ARM_ASMS "${A_SRCS}" CACHE INTERNAL "ARM Assembly Sources")
+    foreach(SRC ${C_SRCS})
+        set(ASM_PRIMITIVES ${ASM_PRIMITIVES} arm/${SRC})
+    endforeach()
+    source_group(Assembly FILES ${ASM_PRIMITIVES})
+endif(ENABLE_ASSEMBLY AND (ARM OR CROSS_COMPILE_ARM))
 
-        set(ARM_ASMS "${A_SRCS}" CACHE INTERNAL "ARM Assembly Sources")
-        foreach(SRC ${C_SRCS})
-            set(ASM_PRIMITIVES ${ASM_PRIMITIVES} arm/${SRC})
-        endforeach()
+if(ENABLE_ASSEMBLY AND (ARM64 OR CROSS_COMPILE_ARM64))
+    if(GCC AND (CMAKE_CXX_FLAGS_RELEASE MATCHES "-O3"))
+        message(STATUS "Detected CXX compiler using -O3 optimization level")
+        add_definitions(-DAUTO_VECTORIZE=1)
     endif()
+
+    set(C_SRCS asm-primitives.cpp pixel-prim.h pixel-prim.cpp filter-prim.h filter-prim.cpp dct-prim.h dct-prim.cpp loopfilter-prim.cpp loopfilter-prim.h intrapred-prim.cpp arm64-utils.cpp arm64-utils.h fun-decls.h)
+    enable_language(ASM)
+
+    # add ARM assembly/intrinsic files here
+    set(A_SRCS asm.S mc-a.S mc-a-common.S sad-a.S sad-a-common.S pixel-util.S pixel-util-common.S p2s.S p2s-common.S ipfilter.S ipfilter-common.S blockcopy8.S blockcopy8-common.S ssd-a.S ssd-a-common.S)
+    set(A_SRCS_SVE asm-sve.S blockcopy8-sve.S p2s-sve.S pixel-util-sve.S ssd-a-sve.S)
+    set(A_SRCS_SVE2 mc-a-sve2.S sad-a-sve2.S pixel-util-sve2.S ipfilter-sve2.S ssd-a-sve2.S)
+    set(VEC_PRIMITIVES)
+
+    set(ARM_ASMS "${A_SRCS}" CACHE INTERNAL "ARM Assembly Sources")
+    set(ARM_ASMS_SVE "${A_SRCS_SVE}" CACHE INTERNAL "ARM Assembly Sources that use SVE instruction set")
+    set(ARM_ASMS_SVE2 "${A_SRCS_SVE2}" CACHE INTERNAL "ARM Assembly Sources that use SVE2 instruction set")
+    foreach(SRC ${C_SRCS})
+        set(ASM_PRIMITIVES ${ASM_PRIMITIVES} aarch64/${SRC})
+    endforeach()
     source_group(Assembly FILES ${ASM_PRIMITIVES})
-endif(ENABLE_ASSEMBLY AND (ARM OR CROSS_COMPILE_ARM))
+endif(ENABLE_ASSEMBLY AND (ARM64 OR CROSS_COMPILE_ARM64))
 
 if(POWER)
     set_source_files_properties(version.cpp PROPERTIES COMPILE_FLAGS -DX265_VERSION=${X265_VERSION})
@@ -169,4 +176,6 @@
     scalinglist.cpp scalinglist.h
     quant.cpp quant.h contexts.h
     deblock.cpp deblock.h
-    scaler.cpp scaler.h)
+    scaler.cpp scaler.h
+    ringmem.cpp ringmem.h
+    temporalfilter.cpp temporalfilter.h)
​

x265_3.6.tar.gz/source/common/aarch64/arm64-utils.cpp Added

@@ -0,0 +1,300 @@
+#include "common.h"
+#include "x265.h"
+#include "arm64-utils.h"
+#include <arm_neon.h>
+
+#define COPY_16(d,s) *(uint8x16_t *)(d) = *(uint8x16_t *)(s)
+namespace X265_NS
+{
+
+
+
+void transpose8x8(uint8_t *dst, const uint8_t *src, intptr_t dstride, intptr_t sstride)
+{
+    uint8x8_t a0, a1, a2, a3, a4, a5, a6, a7;
+    uint8x8_t b0, b1, b2, b3, b4, b5, b6, b7;
+
+    a0 = *(uint8x8_t *)(src + 0 * sstride);
+    a1 = *(uint8x8_t *)(src + 1 * sstride);
+    a2 = *(uint8x8_t *)(src + 2 * sstride);
+    a3 = *(uint8x8_t *)(src + 3 * sstride);
+    a4 = *(uint8x8_t *)(src + 4 * sstride);
+    a5 = *(uint8x8_t *)(src + 5 * sstride);
+    a6 = *(uint8x8_t *)(src + 6 * sstride);
+    a7 = *(uint8x8_t *)(src + 7 * sstride);
+
+    b0 = vtrn1_u32(a0, a4);
+    b1 = vtrn1_u32(a1, a5);
+    b2 = vtrn1_u32(a2, a6);
+    b3 = vtrn1_u32(a3, a7);
+    b4 = vtrn2_u32(a0, a4);
+    b5 = vtrn2_u32(a1, a5);
+    b6 = vtrn2_u32(a2, a6);
+    b7 = vtrn2_u32(a3, a7);
+
+    a0 = vtrn1_u16(b0, b2);
+    a1 = vtrn1_u16(b1, b3);
+    a2 = vtrn2_u16(b0, b2);
+    a3 = vtrn2_u16(b1, b3);
+    a4 = vtrn1_u16(b4, b6);
+    a5 = vtrn1_u16(b5, b7);
+    a6 = vtrn2_u16(b4, b6);
+    a7 = vtrn2_u16(b5, b7);
+
+    b0 = vtrn1_u8(a0, a1);
+    b1 = vtrn2_u8(a0, a1);
+    b2 = vtrn1_u8(a2, a3);
+    b3 = vtrn2_u8(a2, a3);
+    b4 = vtrn1_u8(a4, a5);
+    b5 = vtrn2_u8(a4, a5);
+    b6 = vtrn1_u8(a6, a7);
+    b7 = vtrn2_u8(a6, a7);
+
+    *(uint8x8_t *)(dst + 0 * dstride) = b0;
+    *(uint8x8_t *)(dst + 1 * dstride) = b1;
+    *(uint8x8_t *)(dst + 2 * dstride) = b2;
+    *(uint8x8_t *)(dst + 3 * dstride) = b3;
+    *(uint8x8_t *)(dst + 4 * dstride) = b4;
+    *(uint8x8_t *)(dst + 5 * dstride) = b5;
+    *(uint8x8_t *)(dst + 6 * dstride) = b6;
+    *(uint8x8_t *)(dst + 7 * dstride) = b7;
+}
+
+
+
+
+
+
+void transpose16x16(uint8_t *dst, const uint8_t *src, intptr_t dstride, intptr_t sstride)
+{
+    uint16x8_t a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, aA, aB, aC, aD, aE, aF;
+    uint16x8_t b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, bA, bB, bC, bD, bE, bF;
+    uint16x8_t c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, cA, cB, cC, cD, cE, cF;
+    uint16x8_t d0, d1, d2, d3, d4, d5, d6, d7, d8, d9, dA, dB, dC, dD, dE, dF;
+
+    a0 = *(uint16x8_t *)(src + 0 * sstride);
+    a1 = *(uint16x8_t *)(src + 1 * sstride);
+    a2 = *(uint16x8_t *)(src + 2 * sstride);
+    a3 = *(uint16x8_t *)(src + 3 * sstride);
+    a4 = *(uint16x8_t *)(src + 4 * sstride);
+    a5 = *(uint16x8_t *)(src + 5 * sstride);
+    a6 = *(uint16x8_t *)(src + 6 * sstride);
+    a7 = *(uint16x8_t *)(src + 7 * sstride);
+    a8 = *(uint16x8_t *)(src + 8 * sstride);
+    a9 = *(uint16x8_t *)(src + 9 * sstride);
+    aA = *(uint16x8_t *)(src + 10 * sstride);
+    aB = *(uint16x8_t *)(src + 11 * sstride);
+    aC = *(uint16x8_t *)(src + 12 * sstride);
+    aD = *(uint16x8_t *)(src + 13 * sstride);
+    aE = *(uint16x8_t *)(src + 14 * sstride);
+    aF = *(uint16x8_t *)(src + 15 * sstride);
+
+    b0 = vtrn1q_u64(a0, a8);
+    b1 = vtrn1q_u64(a1, a9);
+    b2 = vtrn1q_u64(a2, aA);
+    b3 = vtrn1q_u64(a3, aB);
+    b4 = vtrn1q_u64(a4, aC);
+    b5 = vtrn1q_u64(a5, aD);
+    b6 = vtrn1q_u64(a6, aE);
+    b7 = vtrn1q_u64(a7, aF);
+    b8 = vtrn2q_u64(a0, a8);
+    b9 = vtrn2q_u64(a1, a9);
+    bA = vtrn2q_u64(a2, aA);
+    bB = vtrn2q_u64(a3, aB);
+    bC = vtrn2q_u64(a4, aC);
+    bD = vtrn2q_u64(a5, aD);
+    bE = vtrn2q_u64(a6, aE);
+    bF = vtrn2q_u64(a7, aF);
+
+    c0 = vtrn1q_u32(b0, b4);
+    c1 = vtrn1q_u32(b1, b5);
+    c2 = vtrn1q_u32(b2, b6);
+    c3 = vtrn1q_u32(b3, b7);
+    c4 = vtrn2q_u32(b0, b4);
+    c5 = vtrn2q_u32(b1, b5);
+    c6 = vtrn2q_u32(b2, b6);
+    c7 = vtrn2q_u32(b3, b7);
+    c8 = vtrn1q_u32(b8, bC);
+    c9 = vtrn1q_u32(b9, bD);
+    cA = vtrn1q_u32(bA, bE);
+    cB = vtrn1q_u32(bB, bF);
+    cC = vtrn2q_u32(b8, bC);
+    cD = vtrn2q_u32(b9, bD);
+    cE = vtrn2q_u32(bA, bE);
+    cF = vtrn2q_u32(bB, bF);
+
+    d0 = vtrn1q_u16(c0, c2);
+    d1 = vtrn1q_u16(c1, c3);
+    d2 = vtrn2q_u16(c0, c2);
+    d3 = vtrn2q_u16(c1, c3);
+    d4 = vtrn1q_u16(c4, c6);
+    d5 = vtrn1q_u16(c5, c7);
+    d6 = vtrn2q_u16(c4, c6);
+    d7 = vtrn2q_u16(c5, c7);
+    d8 = vtrn1q_u16(c8, cA);
+    d9 = vtrn1q_u16(c9, cB);
+    dA = vtrn2q_u16(c8, cA);
+    dB = vtrn2q_u16(c9, cB);
+    dC = vtrn1q_u16(cC, cE);
+    dD = vtrn1q_u16(cD, cF);
+    dE = vtrn2q_u16(cC, cE);
+    dF = vtrn2q_u16(cD, cF);
+
+    *(uint16x8_t *)(dst + 0 * dstride)  = vtrn1q_u8(d0, d1);
+    *(uint16x8_t *)(dst + 1 * dstride)  = vtrn2q_u8(d0, d1);
+    *(uint16x8_t *)(dst + 2 * dstride)  = vtrn1q_u8(d2, d3);
+    *(uint16x8_t *)(dst + 3 * dstride)  = vtrn2q_u8(d2, d3);
+    *(uint16x8_t *)(dst + 4 * dstride)  = vtrn1q_u8(d4, d5);
+    *(uint16x8_t *)(dst + 5 * dstride)  = vtrn2q_u8(d4, d5);
+    *(uint16x8_t *)(dst + 6 * dstride)  = vtrn1q_u8(d6, d7);
+    *(uint16x8_t *)(dst + 7 * dstride)  = vtrn2q_u8(d6, d7);
+    *(uint16x8_t *)(dst + 8 * dstride)  = vtrn1q_u8(d8, d9);
+    *(uint16x8_t *)(dst + 9 * dstride)  = vtrn2q_u8(d8, d9);
+    *(uint16x8_t *)(dst + 10 * dstride)  = vtrn1q_u8(dA, dB);
+    *(uint16x8_t *)(dst + 11 * dstride)  = vtrn2q_u8(dA, dB);
+    *(uint16x8_t *)(dst + 12 * dstride)  = vtrn1q_u8(dC, dD);
+    *(uint16x8_t *)(dst + 13 * dstride)  = vtrn2q_u8(dC, dD);
+    *(uint16x8_t *)(dst + 14 * dstride)  = vtrn1q_u8(dE, dF);
+    *(uint16x8_t *)(dst + 15 * dstride)  = vtrn2q_u8(dE, dF);
+
+
+}
+
+
+void transpose32x32(uint8_t *dst, const uint8_t *src, intptr_t dstride, intptr_t sstride)
+{
+    //assumption: there is no partial overlap
+    transpose16x16(dst, src, dstride, sstride);
+    transpose16x16(dst + 16 * dstride + 16, src + 16 * sstride + 16, dstride, sstride);
+    if (dst == src)
+    {
+        uint8_t tmp16 * 16 __attribute__((aligned(64)));
+        transpose16x16(tmp, src + 16, 16, sstride);
+        transpose16x16(dst + 16, src + 16 * sstride, dstride, sstride);
+        for (int i = 0; i < 16; i++)
+        {
+            COPY_16(dst + (16 + i)*dstride, tmp + 16 * i);
+        }
+    }
+    else
+    {
+        transpose16x16(dst + 16 * dstride, src + 16, dstride, sstride);
+        transpose16x16(dst + 16, src + 16 * sstride, dstride, sstride);
+    }
+
+}
+
+
+
+void transpose8x8(uint16_t *dst, const uint16_t *src, intptr_t dstride, intptr_t sstride)
+{
+    uint16x8_t a0, a1, a2, a3, a4, a5, a6, a7;
+    uint16x8_t b0, b1, b2, b3, b4, b5, b6, b7;
+
+    a0 = *(uint16x8_t *)(src + 0 * sstride);
+    a1 = *(uint16x8_t *)(src + 1 * sstride);
+    a2 = *(uint16x8_t *)(src + 2 * sstride);
+    a3 = *(uint16x8_t *)(src + 3 * sstride);
+    a4 = *(uint16x8_t *)(src + 4 * sstride);
+    a5 = *(uint16x8_t *)(src + 5 * sstride);
+    a6 = *(uint16x8_t *)(src + 6 * sstride);
+    a7 = *(uint16x8_t *)(src + 7 * sstride);
+
+    b0 = vtrn1q_u64(a0, a4);
+    b1 = vtrn1q_u64(a1, a5);
+    b2 = vtrn1q_u64(a2, a6);
+    b3 = vtrn1q_u64(a3, a7);
+    b4 = vtrn2q_u64(a0, a4);
+    b5 = vtrn2q_u64(a1, a5);
+    b6 = vtrn2q_u64(a2, a6);
+    b7 = vtrn2q_u64(a3, a7);
+
+    a0 = vtrn1q_u32(b0, b2);
+    a1 = vtrn1q_u32(b1, b3);
+    a2 = vtrn2q_u32(b0, b2);
+    a3 = vtrn2q_u32(b1, b3);
+    a4 = vtrn1q_u32(b4, b6);
+    a5 = vtrn1q_u32(b5, b7);
+    a6 = vtrn2q_u32(b4, b6);
+    a7 = vtrn2q_u32(b5, b7);
+
+    b0 = vtrn1q_u16(a0, a1);
+    b1 = vtrn2q_u16(a0, a1);
+    b2 = vtrn1q_u16(a2, a3);
+    b3 = vtrn2q_u16(a2, a3);
+    b4 = vtrn1q_u16(a4, a5);
+    b5 = vtrn2q_u16(a4, a5);
+    b6 = vtrn1q_u16(a6, a7);
+    b7 = vtrn2q_u16(a6, a7);
+
+    *(uint16x8_t *)(dst + 0 * dstride) = b0;
+    *(uint16x8_t *)(dst + 1 * dstride) = b1;
+    *(uint16x8_t *)(dst + 2 * dstride) = b2;
+    *(uint16x8_t *)(dst + 3 * dstride) = b3;
+    *(uint16x8_t *)(dst + 4 * dstride) = b4;
+    *(uint16x8_t *)(dst + 5 * dstride) = b5;
+    *(uint16x8_t *)(dst + 6 * dstride) = b6;
+    *(uint16x8_t *)(dst + 7 * dstride) = b7;
+}
+
+void transpose16x16(uint16_t *dst, const uint16_t *src, intptr_t dstride, intptr_t sstride)
+{
+    //assumption: there is no partial overlap
+    transpose8x8(dst, src, dstride, sstride);
+    transpose8x8(dst + 8 * dstride + 8, src + 8 * sstride + 8, dstride, sstride);
+
+    if (dst == src)
+    {
+        uint16_t tmp8 * 8;
+        transpose8x8(tmp, src + 8, 8, sstride);
+        transpose8x8(dst + 8, src + 8 * sstride, dstride, sstride);
+        for (int i = 0; i < 8; i++)
+        {
+            COPY_16(dst + (8 + i)*dstride, tmp + 8 * i);
+        }
+    }
+    else
+    {
+        transpose8x8(dst + 8 * dstride, src + 8, dstride, sstride);
+        transpose8x8(dst + 8, src + 8 * sstride, dstride, sstride);
+    }
+
+}
+
+
+
+void transpose32x32(uint16_t *dst, const uint16_t *src, intptr_t dstride, intptr_t sstride)
+{
+    //assumption: there is no partial overlap
+    for (int i = 0; i < 4; i++)
+    {
+        transpose8x8(dst + i * 8 * (1 + dstride), src + i * 8 * (1 + sstride), dstride, sstride);
+        for (int j = i + 1; j < 4; j++)
+        {
+            if (dst == src)
+            {
+                uint16_t tmp8 * 8 __attribute__((aligned(64)));
+                transpose8x8(tmp, src + 8 * i + 8 * j * sstride, 8, sstride);
+                transpose8x8(dst + 8 * i + 8 * j * dstride, src + 8 * j + 8 * i * sstride, dstride, sstride);
+                for (int k = 0; k < 8; k++)
+                {
+                    COPY_16(dst + 8 * j + (8 * i + k)*dstride, tmp + 8 * k);
+                }
+            }
+            else
+            {
+                transpose8x8(dst + 8 * (j + i * dstride), src + 8 * (i + j * sstride), dstride, sstride);
+                transpose8x8(dst + 8 * (i + j * dstride), src + 8 * (j + i * sstride), dstride, sstride);
+            }
+
+        }
+    }
+}
+
+
+
+
+}
+
+
+

 
@@ -0,0 +1,300 @@
+#include "common.h"
+#include "x265.h"
+#include "arm64-utils.h"
+#include <arm_neon.h>
+
+#define COPY_16(d,s) *(uint8x16_t *)(d) = *(uint8x16_t *)(s)
+namespace X265_NS
+{
+
+
+
+void transpose8x8(uint8_t *dst, const uint8_t *src, intptr_t dstride, intptr_t sstride)
+{
+    uint8x8_t a0, a1, a2, a3, a4, a5, a6, a7;
+    uint8x8_t b0, b1, b2, b3, b4, b5, b6, b7;
+
+    a0 = *(uint8x8_t *)(src + 0 * sstride);
+    a1 = *(uint8x8_t *)(src + 1 * sstride);
+    a2 = *(uint8x8_t *)(src + 2 * sstride);
+    a3 = *(uint8x8_t *)(src + 3 * sstride);
+    a4 = *(uint8x8_t *)(src + 4 * sstride);
+    a5 = *(uint8x8_t *)(src + 5 * sstride);
+    a6 = *(uint8x8_t *)(src + 6 * sstride);
+    a7 = *(uint8x8_t *)(src + 7 * sstride);
+
+    b0 = vtrn1_u32(a0, a4);
+    b1 = vtrn1_u32(a1, a5);
+    b2 = vtrn1_u32(a2, a6);
+    b3 = vtrn1_u32(a3, a7);
+    b4 = vtrn2_u32(a0, a4);
+    b5 = vtrn2_u32(a1, a5);
+    b6 = vtrn2_u32(a2, a6);
+    b7 = vtrn2_u32(a3, a7);
+
+    a0 = vtrn1_u16(b0, b2);
+    a1 = vtrn1_u16(b1, b3);
+    a2 = vtrn2_u16(b0, b2);
+    a3 = vtrn2_u16(b1, b3);
+    a4 = vtrn1_u16(b4, b6);
+    a5 = vtrn1_u16(b5, b7);
+    a6 = vtrn2_u16(b4, b6);
+    a7 = vtrn2_u16(b5, b7);
+
+    b0 = vtrn1_u8(a0, a1);
+    b1 = vtrn2_u8(a0, a1);
+    b2 = vtrn1_u8(a2, a3);
+    b3 = vtrn2_u8(a2, a3);
+    b4 = vtrn1_u8(a4, a5);
+    b5 = vtrn2_u8(a4, a5);
+    b6 = vtrn1_u8(a6, a7);
+    b7 = vtrn2_u8(a6, a7);
+
+    *(uint8x8_t *)(dst + 0 * dstride) = b0;
+    *(uint8x8_t *)(dst + 1 * dstride) = b1;
+    *(uint8x8_t *)(dst + 2 * dstride) = b2;
+    *(uint8x8_t *)(dst + 3 * dstride) = b3;
+    *(uint8x8_t *)(dst + 4 * dstride) = b4;
+    *(uint8x8_t *)(dst + 5 * dstride) = b5;
+    *(uint8x8_t *)(dst + 6 * dstride) = b6;
+    *(uint8x8_t *)(dst + 7 * dstride) = b7;
+}
+
+
+
+
+
+
+void transpose16x16(uint8_t *dst, const uint8_t *src, intptr_t dstride, intptr_t sstride)
+{
+    uint16x8_t a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, aA, aB, aC, aD, aE, aF;
+    uint16x8_t b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, bA, bB, bC, bD, bE, bF;
+    uint16x8_t c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, cA, cB, cC, cD, cE, cF;
+    uint16x8_t d0, d1, d2, d3, d4, d5, d6, d7, d8, d9, dA, dB, dC, dD, dE, dF;
+
+    a0 = *(uint16x8_t *)(src + 0 * sstride);
+    a1 = *(uint16x8_t *)(src + 1 * sstride);
+    a2 = *(uint16x8_t *)(src + 2 * sstride);
+    a3 = *(uint16x8_t *)(src + 3 * sstride);
+    a4 = *(uint16x8_t *)(src + 4 * sstride);
+    a5 = *(uint16x8_t *)(src + 5 * sstride);
+    a6 = *(uint16x8_t *)(src + 6 * sstride);
+    a7 = *(uint16x8_t *)(src + 7 * sstride);
+    a8 = *(uint16x8_t *)(src + 8 * sstride);
+    a9 = *(uint16x8_t *)(src + 9 * sstride);
+    aA = *(uint16x8_t *)(src + 10 * sstride);
+    aB = *(uint16x8_t *)(src + 11 * sstride);
+    aC = *(uint16x8_t *)(src + 12 * sstride);
+    aD = *(uint16x8_t *)(src + 13 * sstride);
+    aE = *(uint16x8_t *)(src + 14 * sstride);
+    aF = *(uint16x8_t *)(src + 15 * sstride);
+
+    b0 = vtrn1q_u64(a0, a8);
+    b1 = vtrn1q_u64(a1, a9);
+    b2 = vtrn1q_u64(a2, aA);
+    b3 = vtrn1q_u64(a3, aB);
+    b4 = vtrn1q_u64(a4, aC);
+    b5 = vtrn1q_u64(a5, aD);
+    b6 = vtrn1q_u64(a6, aE);
+    b7 = vtrn1q_u64(a7, aF);
+    b8 = vtrn2q_u64(a0, a8);
+    b9 = vtrn2q_u64(a1, a9);
+    bA = vtrn2q_u64(a2, aA);
+    bB = vtrn2q_u64(a3, aB);
+    bC = vtrn2q_u64(a4, aC);
+    bD = vtrn2q_u64(a5, aD);
+    bE = vtrn2q_u64(a6, aE);
+    bF = vtrn2q_u64(a7, aF);
+
+    c0 = vtrn1q_u32(b0, b4);
+    c1 = vtrn1q_u32(b1, b5);
+    c2 = vtrn1q_u32(b2, b6);
+    c3 = vtrn1q_u32(b3, b7);
+    c4 = vtrn2q_u32(b0, b4);
+    c5 = vtrn2q_u32(b1, b5);
+    c6 = vtrn2q_u32(b2, b6);
+    c7 = vtrn2q_u32(b3, b7);
+    c8 = vtrn1q_u32(b8, bC);
+    c9 = vtrn1q_u32(b9, bD);
+    cA = vtrn1q_u32(bA, bE);
+    cB = vtrn1q_u32(bB, bF);
+    cC = vtrn2q_u32(b8, bC);
+    cD = vtrn2q_u32(b9, bD);
+    cE = vtrn2q_u32(bA, bE);
+    cF = vtrn2q_u32(bB, bF);
+
+    d0 = vtrn1q_u16(c0, c2);
+    d1 = vtrn1q_u16(c1, c3);
+    d2 = vtrn2q_u16(c0, c2);
+    d3 = vtrn2q_u16(c1, c3);
+    d4 = vtrn1q_u16(c4, c6);
+    d5 = vtrn1q_u16(c5, c7);
+    d6 = vtrn2q_u16(c4, c6);
+    d7 = vtrn2q_u16(c5, c7);
+    d8 = vtrn1q_u16(c8, cA);
+    d9 = vtrn1q_u16(c9, cB);
+    dA = vtrn2q_u16(c8, cA);
+    dB = vtrn2q_u16(c9, cB);
+    dC = vtrn1q_u16(cC, cE);
+    dD = vtrn1q_u16(cD, cF);
+    dE = vtrn2q_u16(cC, cE);
+    dF = vtrn2q_u16(cD, cF);
+
+    *(uint16x8_t *)(dst + 0 * dstride)  = vtrn1q_u8(d0, d1);
+    *(uint16x8_t *)(dst + 1 * dstride)  = vtrn2q_u8(d0, d1);
+    *(uint16x8_t *)(dst + 2 * dstride)  = vtrn1q_u8(d2, d3);
+    *(uint16x8_t *)(dst + 3 * dstride)  = vtrn2q_u8(d2, d3);
+    *(uint16x8_t *)(dst + 4 * dstride)  = vtrn1q_u8(d4, d5);
+    *(uint16x8_t *)(dst + 5 * dstride)  = vtrn2q_u8(d4, d5);
+    *(uint16x8_t *)(dst + 6 * dstride)  = vtrn1q_u8(d6, d7);
+    *(uint16x8_t *)(dst + 7 * dstride)  = vtrn2q_u8(d6, d7);
+    *(uint16x8_t *)(dst + 8 * dstride)  = vtrn1q_u8(d8, d9);
+    *(uint16x8_t *)(dst + 9 * dstride)  = vtrn2q_u8(d8, d9);
+    *(uint16x8_t *)(dst + 10 * dstride)  = vtrn1q_u8(dA, dB);
+    *(uint16x8_t *)(dst + 11 * dstride)  = vtrn2q_u8(dA, dB);
+    *(uint16x8_t *)(dst + 12 * dstride)  = vtrn1q_u8(dC, dD);
+    *(uint16x8_t *)(dst + 13 * dstride)  = vtrn2q_u8(dC, dD);
+    *(uint16x8_t *)(dst + 14 * dstride)  = vtrn1q_u8(dE, dF);
+    *(uint16x8_t *)(dst + 15 * dstride)  = vtrn2q_u8(dE, dF);
+
+
+}
+
+
+void transpose32x32(uint8_t *dst, const uint8_t *src, intptr_t dstride, intptr_t sstride)
+{
+    //assumption: there is no partial overlap
+    transpose16x16(dst, src, dstride, sstride);
+    transpose16x16(dst + 16 * dstride + 16, src + 16 * sstride + 16, dstride, sstride);
+    if (dst == src)
+    {
+        uint8_t tmp16 * 16 __attribute__((aligned(64)));
+        transpose16x16(tmp, src + 16, 16, sstride);
+        transpose16x16(dst + 16, src + 16 * sstride, dstride, sstride);
+        for (int i = 0; i < 16; i++)
+        {
+            COPY_16(dst + (16 + i)*dstride, tmp + 16 * i);
+        }
+    }
+    else
+    {
+        transpose16x16(dst + 16 * dstride, src + 16, dstride, sstride);
+        transpose16x16(dst + 16, src + 16 * sstride, dstride, sstride);
+    }
+
+}
+
+
+
+void transpose8x8(uint16_t *dst, const uint16_t *src, intptr_t dstride, intptr_t sstride)
+{
+    uint16x8_t a0, a1, a2, a3, a4, a5, a6, a7;
+    uint16x8_t b0, b1, b2, b3, b4, b5, b6, b7;
+
+    a0 = *(uint16x8_t *)(src + 0 * sstride);
+    a1 = *(uint16x8_t *)(src + 1 * sstride);
+    a2 = *(uint16x8_t *)(src + 2 * sstride);
+    a3 = *(uint16x8_t *)(src + 3 * sstride);
+    a4 = *(uint16x8_t *)(src + 4 * sstride);
+    a5 = *(uint16x8_t *)(src + 5 * sstride);
+    a6 = *(uint16x8_t *)(src + 6 * sstride);
+    a7 = *(uint16x8_t *)(src + 7 * sstride);
+
+    b0 = vtrn1q_u64(a0, a4);
+    b1 = vtrn1q_u64(a1, a5);
+    b2 = vtrn1q_u64(a2, a6);
+    b3 = vtrn1q_u64(a3, a7);
+    b4 = vtrn2q_u64(a0, a4);
+    b5 = vtrn2q_u64(a1, a5);
+    b6 = vtrn2q_u64(a2, a6);
+    b7 = vtrn2q_u64(a3, a7);
+
+    a0 = vtrn1q_u32(b0, b2);
+    a1 = vtrn1q_u32(b1, b3);
+    a2 = vtrn2q_u32(b0, b2);
+    a3 = vtrn2q_u32(b1, b3);
+    a4 = vtrn1q_u32(b4, b6);
+    a5 = vtrn1q_u32(b5, b7);
+    a6 = vtrn2q_u32(b4, b6);
+    a7 = vtrn2q_u32(b5, b7);
+
+    b0 = vtrn1q_u16(a0, a1);
+    b1 = vtrn2q_u16(a0, a1);
+    b2 = vtrn1q_u16(a2, a3);
+    b3 = vtrn2q_u16(a2, a3);
+    b4 = vtrn1q_u16(a4, a5);
+    b5 = vtrn2q_u16(a4, a5);
+    b6 = vtrn1q_u16(a6, a7);
+    b7 = vtrn2q_u16(a6, a7);
+
+    *(uint16x8_t *)(dst + 0 * dstride) = b0;
+    *(uint16x8_t *)(dst + 1 * dstride) = b1;
+    *(uint16x8_t *)(dst + 2 * dstride) = b2;
+    *(uint16x8_t *)(dst + 3 * dstride) = b3;
+    *(uint16x8_t *)(dst + 4 * dstride) = b4;
+    *(uint16x8_t *)(dst + 5 * dstride) = b5;
+    *(uint16x8_t *)(dst + 6 * dstride) = b6;
+    *(uint16x8_t *)(dst + 7 * dstride) = b7;
+}
+
+void transpose16x16(uint16_t *dst, const uint16_t *src, intptr_t dstride, intptr_t sstride)
+{
+    //assumption: there is no partial overlap
+    transpose8x8(dst, src, dstride, sstride);
+    transpose8x8(dst + 8 * dstride + 8, src + 8 * sstride + 8, dstride, sstride);
+
+    if (dst == src)
+    {
+        uint16_t tmp8 * 8;
+        transpose8x8(tmp, src + 8, 8, sstride);
+        transpose8x8(dst + 8, src + 8 * sstride, dstride, sstride);
+        for (int i = 0; i < 8; i++)
+        {
+            COPY_16(dst + (8 + i)*dstride, tmp + 8 * i);
+        }
+    }
+    else
+    {
+        transpose8x8(dst + 8 * dstride, src + 8, dstride, sstride);
+        transpose8x8(dst + 8, src + 8 * sstride, dstride, sstride);
+    }
+
+}
+
+
+
+void transpose32x32(uint16_t *dst, const uint16_t *src, intptr_t dstride, intptr_t sstride)
+{
+    //assumption: there is no partial overlap
+    for (int i = 0; i < 4; i++)
+    {
+        transpose8x8(dst + i * 8 * (1 + dstride), src + i * 8 * (1 + sstride), dstride, sstride);
+        for (int j = i + 1; j < 4; j++)
+        {
+            if (dst == src)
+            {
+                uint16_t tmp8 * 8 __attribute__((aligned(64)));
+                transpose8x8(tmp, src + 8 * i + 8 * j * sstride, 8, sstride);
+                transpose8x8(dst + 8 * i + 8 * j * dstride, src + 8 * j + 8 * i * sstride, dstride, sstride);
+                for (int k = 0; k < 8; k++)
+                {
+                    COPY_16(dst + 8 * j + (8 * i + k)*dstride, tmp + 8 * k);
+                }
+            }
+            else
+            {
+                transpose8x8(dst + 8 * (j + i * dstride), src + 8 * (i + j * sstride), dstride, sstride);
+                transpose8x8(dst + 8 * (i + j * dstride), src + 8 * (j + i * sstride), dstride, sstride);
+            }
+
+        }
+    }
+}
+
+
+
+
+}
+
+
+
​

x265_3.6.tar.gz/source/common/aarch64/arm64-utils.h Added

 
@@ -0,0 +1,15 @@
+#ifndef __ARM64_UTILS_H__
+#define __ARM64_UTILS_H__
+
+
+namespace X265_NS
+{
+void transpose8x8(uint8_t *dst, const uint8_t *src, intptr_t dstride, intptr_t sstride);
+void transpose16x16(uint8_t *dst, const uint8_t *src, intptr_t dstride, intptr_t sstride);
+void transpose32x32(uint8_t *dst, const uint8_t *src, intptr_t dstride, intptr_t sstride);
+void transpose8x8(uint16_t *dst, const uint16_t *src, intptr_t dstride, intptr_t sstride);
+void transpose16x16(uint16_t *dst, const uint16_t *src, intptr_t dstride, intptr_t sstride);
+void transpose32x32(uint16_t *dst, const uint16_t *src, intptr_t dstride, intptr_t sstride);
+}
+
+#endif
​

x265_3.5.tar.gz/source/common/aarch64/asm-primitives.cpp -> x265_3.6.tar.gz/source/common/aarch64/asm-primitives.cpp Changed

@@ -3,6 +3,7 @@
  *
  * Authors: Hongbin Liu <liuhongbin1@huawei.com>
  *          Yimeng Su <yimeng.su@huawei.com>
+ *          Sebastian Pop <spop@amazon.com>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -22,11 +23,659 @@
  * For more information, contact us at license @ x265.com.
  *****************************************************************************/
 
+
 #include "common.h"
 #include "primitives.h"
 #include "x265.h"
 #include "cpu.h"
 
+extern "C" {
+#include "fun-decls.h"
+}
+
+#define ALL_LUMA_TU_TYPED(prim, fncdef, fname, cpu) \
+    p.cuBLOCK_4x4.prim   = fncdef PFX(fname ## _4x4_ ## cpu); \
+    p.cuBLOCK_8x8.prim   = fncdef PFX(fname ## _8x8_ ## cpu); \
+    p.cuBLOCK_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \
+    p.cuBLOCK_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \
+    p.cuBLOCK_64x64.prim = fncdef PFX(fname ## _64x64_ ## cpu)
+#define LUMA_TU_TYPED_NEON(prim, fncdef, fname) \
+    p.cuBLOCK_4x4.prim   = fncdef PFX(fname ## _4x4_ ## neon); \
+    p.cuBLOCK_8x8.prim   = fncdef PFX(fname ## _8x8_ ## neon); \
+    p.cuBLOCK_16x16.prim = fncdef PFX(fname ## _16x16_ ## neon); \
+    p.cuBLOCK_64x64.prim = fncdef PFX(fname ## _64x64_ ## neon)
+#define LUMA_TU_TYPED_CAN_USE_SVE(prim, fncdef, fname) \
+    p.cuBLOCK_32x32.prim = fncdef PFX(fname ## _32x32_ ## sve)
+#define ALL_LUMA_TU(prim, fname, cpu)      ALL_LUMA_TU_TYPED(prim, , fname, cpu)
+#define LUMA_TU_NEON(prim, fname)      LUMA_TU_TYPED_NEON(prim, , fname)
+#define LUMA_TU_CAN_USE_SVE(prim, fname)      LUMA_TU_TYPED_CAN_USE_SVE(prim, , fname)
+
+#define ALL_LUMA_PU_TYPED(prim, fncdef, fname, cpu) \
+    p.puLUMA_4x4.prim   = fncdef PFX(fname ## _4x4_ ## cpu); \
+    p.puLUMA_8x8.prim   = fncdef PFX(fname ## _8x8_ ## cpu); \
+    p.puLUMA_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \
+    p.puLUMA_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \
+    p.puLUMA_64x64.prim = fncdef PFX(fname ## _64x64_ ## cpu); \
+    p.puLUMA_8x4.prim   = fncdef PFX(fname ## _8x4_ ## cpu); \
+    p.puLUMA_4x8.prim   = fncdef PFX(fname ## _4x8_ ## cpu); \
+    p.puLUMA_16x8.prim  = fncdef PFX(fname ## _16x8_ ## cpu); \
+    p.puLUMA_8x16.prim  = fncdef PFX(fname ## _8x16_ ## cpu); \
+    p.puLUMA_16x32.prim = fncdef PFX(fname ## _16x32_ ## cpu); \
+    p.puLUMA_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \
+    p.puLUMA_64x32.prim = fncdef PFX(fname ## _64x32_ ## cpu); \
+    p.puLUMA_32x64.prim = fncdef PFX(fname ## _32x64_ ## cpu); \
+    p.puLUMA_16x12.prim = fncdef PFX(fname ## _16x12_ ## cpu); \
+    p.puLUMA_12x16.prim = fncdef PFX(fname ## _12x16_ ## cpu); \
+    p.puLUMA_16x4.prim  = fncdef PFX(fname ## _16x4_ ## cpu); \
+    p.puLUMA_4x16.prim  = fncdef PFX(fname ## _4x16_ ## cpu); \
+    p.puLUMA_32x24.prim = fncdef PFX(fname ## _32x24_ ## cpu); \
+    p.puLUMA_24x32.prim = fncdef PFX(fname ## _24x32_ ## cpu); \
+    p.puLUMA_32x8.prim  = fncdef PFX(fname ## _32x8_ ## cpu); \
+    p.puLUMA_8x32.prim  = fncdef PFX(fname ## _8x32_ ## cpu); \
+    p.puLUMA_64x48.prim = fncdef PFX(fname ## _64x48_ ## cpu); \
+    p.puLUMA_48x64.prim = fncdef PFX(fname ## _48x64_ ## cpu); \
+    p.puLUMA_64x16.prim = fncdef PFX(fname ## _64x16_ ## cpu); \
+    p.puLUMA_16x64.prim = fncdef PFX(fname ## _16x64_ ## cpu)
+#define LUMA_PU_TYPED_MULTIPLE_ARCHS_1(prim, fncdef, fname, cpu) \
+    p.puLUMA_4x4.prim   = fncdef PFX(fname ## _4x4_ ## cpu); \
+    p.puLUMA_4x8.prim   = fncdef PFX(fname ## _4x8_ ## cpu); \
+    p.puLUMA_4x16.prim  = fncdef PFX(fname ## _4x16_ ## cpu)
+#define LUMA_PU_TYPED_MULTIPLE_ARCHS_2(prim, fncdef, fname, cpu) \
+    p.puLUMA_8x8.prim   = fncdef PFX(fname ## _8x8_ ## cpu); \
+    p.puLUMA_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \
+    p.puLUMA_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \
+    p.puLUMA_64x64.prim = fncdef PFX(fname ## _64x64_ ## cpu); \
+    p.puLUMA_8x4.prim   = fncdef PFX(fname ## _8x4_ ## cpu); \
+    p.puLUMA_16x8.prim  = fncdef PFX(fname ## _16x8_ ## cpu); \
+    p.puLUMA_8x16.prim  = fncdef PFX(fname ## _8x16_ ## cpu); \
+    p.puLUMA_16x32.prim = fncdef PFX(fname ## _16x32_ ## cpu); \
+    p.puLUMA_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \
+    p.puLUMA_64x32.prim = fncdef PFX(fname ## _64x32_ ## cpu); \
+    p.puLUMA_32x64.prim = fncdef PFX(fname ## _32x64_ ## cpu); \
+    p.puLUMA_16x12.prim = fncdef PFX(fname ## _16x12_ ## cpu); \
+    p.puLUMA_12x16.prim = fncdef PFX(fname ## _12x16_ ## cpu); \
+    p.puLUMA_16x4.prim  = fncdef PFX(fname ## _16x4_ ## cpu); \
+    p.puLUMA_32x24.prim = fncdef PFX(fname ## _32x24_ ## cpu); \
+    p.puLUMA_24x32.prim = fncdef PFX(fname ## _24x32_ ## cpu); \
+    p.puLUMA_32x8.prim  = fncdef PFX(fname ## _32x8_ ## cpu); \
+    p.puLUMA_8x32.prim  = fncdef PFX(fname ## _8x32_ ## cpu); \
+    p.puLUMA_64x48.prim = fncdef PFX(fname ## _64x48_ ## cpu); \
+    p.puLUMA_48x64.prim = fncdef PFX(fname ## _48x64_ ## cpu); \
+    p.puLUMA_64x16.prim = fncdef PFX(fname ## _64x16_ ## cpu); \
+    p.puLUMA_16x64.prim = fncdef PFX(fname ## _16x64_ ## cpu)
+#define LUMA_PU_TYPED_NEON_1(prim, fncdef, fname) \
+    p.puLUMA_4x4.prim   = fncdef PFX(fname ## _4x4_ ## neon); \
+    p.puLUMA_4x8.prim   = fncdef PFX(fname ## _4x8_ ## neon); \
+    p.puLUMA_4x16.prim  = fncdef PFX(fname ## _4x16_ ## neon); \
+    p.puLUMA_12x16.prim = fncdef PFX(fname ## _12x16_ ## neon); \
+    p.puLUMA_8x8.prim   = fncdef PFX(fname ## _8x8_ ## neon); \
+    p.puLUMA_16x16.prim = fncdef PFX(fname ## _16x16_ ## neon); \
+    p.puLUMA_8x4.prim   = fncdef PFX(fname ## _8x4_ ## neon); \
+    p.puLUMA_16x8.prim  = fncdef PFX(fname ## _16x8_ ## neon); \
+    p.puLUMA_8x16.prim  = fncdef PFX(fname ## _8x16_ ## neon); \
+    p.puLUMA_16x12.prim = fncdef PFX(fname ## _16x12_ ## neon); \
+    p.puLUMA_16x32.prim = fncdef PFX(fname ## _16x32_ ## neon); \
+    p.puLUMA_16x4.prim  = fncdef PFX(fname ## _16x4_ ## neon); \
+    p.puLUMA_24x32.prim = fncdef PFX(fname ## _24x32_ ## neon); \
+    p.puLUMA_8x32.prim  = fncdef PFX(fname ## _8x32_ ## neon); \
+    p.puLUMA_48x64.prim = fncdef PFX(fname ## _48x64_ ## neon); \
+    p.puLUMA_16x64.prim = fncdef PFX(fname ## _16x64_ ## neon)
+#define LUMA_PU_TYPED_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, fncdef, fname) \
+    p.puLUMA_32x32.prim = fncdef PFX(fname ## _32x32_ ## sve); \
+    p.puLUMA_64x64.prim = fncdef PFX(fname ## _64x64_ ## sve); \
+    p.puLUMA_32x16.prim = fncdef PFX(fname ## _32x16_ ## sve); \
+    p.puLUMA_64x32.prim = fncdef PFX(fname ## _64x32_ ## sve); \
+    p.puLUMA_32x64.prim = fncdef PFX(fname ## _32x64_ ## sve); \
+    p.puLUMA_32x24.prim = fncdef PFX(fname ## _32x24_ ## sve); \
+    p.puLUMA_32x8.prim  = fncdef PFX(fname ## _32x8_ ## sve); \
+    p.puLUMA_64x48.prim = fncdef PFX(fname ## _64x48_ ## sve); \
+    p.puLUMA_64x16.prim = fncdef PFX(fname ## _64x16_ ## sve)
+#define LUMA_PU_TYPED_NEON_2(prim, fncdef, fname) \
+    p.puLUMA_4x4.prim   = fncdef PFX(fname ## _4x4_ ## neon); \
+    p.puLUMA_8x4.prim   = fncdef PFX(fname ## _8x4_ ## neon); \
+    p.puLUMA_4x8.prim   = fncdef PFX(fname ## _4x8_ ## neon); \
+    p.puLUMA_8x8.prim   = fncdef PFX(fname ## _8x8_ ## neon); \
+    p.puLUMA_16x8.prim  = fncdef PFX(fname ## _16x8_ ## neon); \
+    p.puLUMA_8x16.prim  = fncdef PFX(fname ## _8x16_ ## neon); \
+    p.puLUMA_16x16.prim = fncdef PFX(fname ## _16x16_ ## neon); \
+    p.puLUMA_16x32.prim = fncdef PFX(fname ## _16x32_ ## neon); \
+    p.puLUMA_16x12.prim = fncdef PFX(fname ## _16x12_ ## neon); \
+    p.puLUMA_16x4.prim  = fncdef PFX(fname ## _16x4_ ## neon); \
+    p.puLUMA_4x16.prim  = fncdef PFX(fname ## _4x16_ ## neon); \
+    p.puLUMA_8x32.prim  = fncdef PFX(fname ## _8x32_ ## neon); \
+    p.puLUMA_16x64.prim = fncdef PFX(fname ## _16x64_ ## neon)
+#define LUMA_PU_TYPED_MULTIPLE_ARCHS_3(prim, fncdef, fname, cpu) \
+    p.puLUMA_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \
+    p.puLUMA_64x64.prim = fncdef PFX(fname ## _64x64_ ## cpu); \
+    p.puLUMA_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \
+    p.puLUMA_64x32.prim = fncdef PFX(fname ## _64x32_ ## cpu); \
+    p.puLUMA_32x64.prim = fncdef PFX(fname ## _32x64_ ## cpu); \
+    p.puLUMA_12x16.prim = fncdef PFX(fname ## _12x16_ ## cpu); \
+    p.puLUMA_32x24.prim = fncdef PFX(fname ## _32x24_ ## cpu); \
+    p.puLUMA_24x32.prim = fncdef PFX(fname ## _24x32_ ## cpu); \
+    p.puLUMA_32x8.prim  = fncdef PFX(fname ## _32x8_ ## cpu); \
+    p.puLUMA_64x48.prim = fncdef PFX(fname ## _64x48_ ## cpu); \
+    p.puLUMA_48x64.prim = fncdef PFX(fname ## _48x64_ ## cpu); \
+    p.puLUMA_64x16.prim = fncdef PFX(fname ## _64x16_ ## cpu)
+#define LUMA_PU_TYPED_NEON_3(prim, fncdef, fname) \
+    p.puLUMA_4x4.prim   = fncdef PFX(fname ## _4x4_ ## neon); \
+    p.puLUMA_4x8.prim   = fncdef PFX(fname ## _4x8_ ## neon); \
+    p.puLUMA_4x16.prim  = fncdef PFX(fname ## _4x16_ ## neon)
+#define LUMA_PU_TYPED_CAN_USE_SVE2(prim, fncdef, fname) \
+    p.puLUMA_8x8.prim   = fncdef PFX(fname ## _8x8_ ## sve2); \
+    p.puLUMA_16x16.prim = fncdef PFX(fname ## _16x16_ ## sve2); \
+    p.puLUMA_32x32.prim = fncdef PFX(fname ## _32x32_ ## sve2); \
+    p.puLUMA_64x64.prim = fncdef PFX(fname ## _64x64_ ## sve2); \
+    p.puLUMA_8x4.prim   = fncdef PFX(fname ## _8x4_ ## sve2); \
+    p.puLUMA_16x8.prim  = fncdef PFX(fname ## _16x8_ ## sve2); \
+    p.puLUMA_8x16.prim  = fncdef PFX(fname ## _8x16_ ## sve2); \
+    p.puLUMA_16x32.prim = fncdef PFX(fname ## _16x32_ ## sve2); \
+    p.puLUMA_32x16.prim = fncdef PFX(fname ## _32x16_ ## sve2); \
+    p.puLUMA_64x32.prim = fncdef PFX(fname ## _64x32_ ## sve2); \
+    p.puLUMA_32x64.prim = fncdef PFX(fname ## _32x64_ ## sve2); \
+    p.puLUMA_16x12.prim = fncdef PFX(fname ## _16x12_ ## sve2); \
+    p.puLUMA_12x16.prim = fncdef PFX(fname ## _12x16_ ## sve2); \
+    p.puLUMA_16x4.prim  = fncdef PFX(fname ## _16x4_ ## sve2); \
+    p.puLUMA_32x24.prim = fncdef PFX(fname ## _32x24_ ## sve2); \
+    p.puLUMA_24x32.prim = fncdef PFX(fname ## _24x32_ ## sve2); \
+    p.puLUMA_32x8.prim  = fncdef PFX(fname ## _32x8_ ## sve2); \
+    p.puLUMA_8x32.prim  = fncdef PFX(fname ## _8x32_ ## sve2); \
+    p.puLUMA_64x48.prim = fncdef PFX(fname ## _64x48_ ## sve2); \
+    p.puLUMA_48x64.prim = fncdef PFX(fname ## _48x64_ ## sve2); \
+    p.puLUMA_64x16.prim = fncdef PFX(fname ## _64x16_ ## sve2); \
+    p.puLUMA_16x64.prim = fncdef PFX(fname ## _16x64_ ## sve2)
+#define LUMA_PU_TYPED_NEON_FILTER_PIXEL_TO_SHORT(prim, fncdef) \
+    p.puLUMA_4x4.prim   = fncdef PFX(filterPixelToShort ## _4x4_ ## neon); \
+    p.puLUMA_8x8.prim   = fncdef PFX(filterPixelToShort ## _8x8_ ## neon); \
+    p.puLUMA_16x16.prim = fncdef PFX(filterPixelToShort ## _16x16_ ## neon); \
+    p.puLUMA_8x4.prim   = fncdef PFX(filterPixelToShort ## _8x4_ ## neon); \
+    p.puLUMA_4x8.prim   = fncdef PFX(filterPixelToShort ## _4x8_ ## neon); \
+    p.puLUMA_16x8.prim  = fncdef PFX(filterPixelToShort ## _16x8_ ## neon); \
+    p.puLUMA_8x16.prim  = fncdef PFX(filterPixelToShort ## _8x16_ ## neon); \
+    p.puLUMA_16x32.prim = fncdef PFX(filterPixelToShort ## _16x32_ ## neon); \
+    p.puLUMA_16x12.prim = fncdef PFX(filterPixelToShort ## _16x12_ ## neon); \
+    p.puLUMA_12x16.prim = fncdef PFX(filterPixelToShort ## _12x16_ ## neon); \
+    p.puLUMA_16x4.prim  = fncdef PFX(filterPixelToShort ## _16x4_ ## neon); \
+    p.puLUMA_4x16.prim  = fncdef PFX(filterPixelToShort ## _4x16_ ## neon); \
+    p.puLUMA_24x32.prim = fncdef PFX(filterPixelToShort ## _24x32_ ## neon); \
+    p.puLUMA_8x32.prim  = fncdef PFX(filterPixelToShort ## _8x32_ ## neon); \
+    p.puLUMA_16x64.prim = fncdef PFX(filterPixelToShort ## _16x64_ ## neon)
+#define LUMA_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, fncdef) \
+    p.puLUMA_32x32.prim = fncdef PFX(filterPixelToShort ## _32x32_ ## sve); \
+    p.puLUMA_32x16.prim = fncdef PFX(filterPixelToShort ## _32x16_ ## sve); \
+    p.puLUMA_32x64.prim = fncdef PFX(filterPixelToShort ## _32x64_ ## sve); \
+    p.puLUMA_32x24.prim = fncdef PFX(filterPixelToShort ## _32x24_ ## sve); \
+    p.puLUMA_32x8.prim  = fncdef PFX(filterPixelToShort ## _32x8_ ## sve); \
+    p.puLUMA_64x64.prim = fncdef PFX(filterPixelToShort ## _64x64_ ## sve); \
+    p.puLUMA_64x32.prim = fncdef PFX(filterPixelToShort ## _64x32_ ## sve); \
+    p.puLUMA_64x48.prim = fncdef PFX(filterPixelToShort ## _64x48_ ## sve); \
+    p.puLUMA_64x16.prim = fncdef PFX(filterPixelToShort ## _64x16_ ## sve); \
+    p.puLUMA_48x64.prim = fncdef PFX(filterPixelToShort ## _48x64_ ## sve)
+#define ALL_LUMA_PU(prim, fname, cpu) ALL_LUMA_PU_TYPED(prim, , fname, cpu)
+#define LUMA_PU_MULTIPLE_ARCHS_1(prim, fname, cpu) LUMA_PU_TYPED_MULTIPLE_ARCHS_1(prim, , fname, cpu)
+#define LUMA_PU_MULTIPLE_ARCHS_2(prim, fname, cpu) LUMA_PU_TYPED_MULTIPLE_ARCHS_2(prim, , fname, cpu)
+#define LUMA_PU_NEON_1(prim, fname) LUMA_PU_TYPED_NEON_1(prim, , fname)
+#define LUMA_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, fname) LUMA_PU_TYPED_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, , fname)
+#define LUMA_PU_NEON_2(prim, fname) LUMA_PU_TYPED_NEON_2(prim, , fname)
+#define LUMA_PU_MULTIPLE_ARCHS_3(prim, fname, cpu) LUMA_PU_TYPED_MULTIPLE_ARCHS_3(prim, , fname, cpu)
+#define LUMA_PU_NEON_3(prim, fname) LUMA_PU_TYPED_NEON_3(prim, , fname)
+#define LUMA_PU_CAN_USE_SVE2(prim, fname) LUMA_PU_TYPED_CAN_USE_SVE2(prim, , fname)
+#define LUMA_PU_NEON_FILTER_PIXEL_TO_SHORT(prim) LUMA_PU_TYPED_NEON_FILTER_PIXEL_TO_SHORT(prim, )
+#define LUMA_PU_SVE_FILTER_PIXEL_TO_SHORT(prim) LUMA_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, )
+
+
+#define ALL_LUMA_PU_T(prim, fname) \
+    p.puLUMA_4x4.prim   = fname<LUMA_4x4>; \
+    p.puLUMA_8x8.prim   = fname<LUMA_8x8>; \
+    p.puLUMA_16x16.prim = fname<LUMA_16x16>; \
+    p.puLUMA_32x32.prim = fname<LUMA_32x32>; \
+    p.puLUMA_64x64.prim = fname<LUMA_64x64>; \
+    p.puLUMA_8x4.prim   = fname<LUMA_8x4>; \
+    p.puLUMA_4x8.prim   = fname<LUMA_4x8>; \
+    p.puLUMA_16x8.prim  = fname<LUMA_16x8>; \
+    p.puLUMA_8x16.prim  = fname<LUMA_8x16>; \
+    p.puLUMA_16x32.prim = fname<LUMA_16x32>; \
+    p.puLUMA_32x16.prim = fname<LUMA_32x16>; \
+    p.puLUMA_64x32.prim = fname<LUMA_64x32>; \
+    p.puLUMA_32x64.prim = fname<LUMA_32x64>; \
+    p.puLUMA_16x12.prim = fname<LUMA_16x12>; \
+    p.puLUMA_12x16.prim = fname<LUMA_12x16>; \
+    p.puLUMA_16x4.prim  = fname<LUMA_16x4>; \
+    p.puLUMA_4x16.prim  = fname<LUMA_4x16>; \
+    p.puLUMA_32x24.prim = fname<LUMA_32x24>; \
+    p.puLUMA_24x32.prim = fname<LUMA_24x32>; \
+    p.puLUMA_32x8.prim  = fname<LUMA_32x8>; \
+    p.puLUMA_8x32.prim  = fname<LUMA_8x32>; \
+    p.puLUMA_64x48.prim = fname<LUMA_64x48>; \
+    p.puLUMA_48x64.prim = fname<LUMA_48x64>; \
+    p.puLUMA_64x16.prim = fname<LUMA_64x16>; \
+    p.puLUMA_16x64.prim = fname<LUMA_16x64>
+
+#define ALL_CHROMA_420_PU_TYPED(prim, fncdef, fname, cpu)               \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x4.prim   = fncdef PFX(fname ## _4x4_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x8.prim   = fncdef PFX(fname ## _8x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x2.prim   = fncdef PFX(fname ## _4x2_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_2x4.prim   = fncdef PFX(fname ## _2x4_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x4.prim   = fncdef PFX(fname ## _8x4_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x8.prim   = fncdef PFX(fname ## _4x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x8.prim  = fncdef PFX(fname ## _16x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x16.prim  = fncdef PFX(fname ## _8x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x32.prim = fncdef PFX(fname ## _16x32_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x6.prim   = fncdef PFX(fname ## _8x6_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_6x8.prim   = fncdef PFX(fname ## _6x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x2.prim   = fncdef PFX(fname ## _8x2_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_2x8.prim   = fncdef PFX(fname ## _2x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x12.prim = fncdef PFX(fname ## _16x12_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_12x16.prim = fncdef PFX(fname ## _12x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x4.prim  = fncdef PFX(fname ## _16x4_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x16.prim  = fncdef PFX(fname ## _4x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x24.prim = fncdef PFX(fname ## _32x24_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_24x32.prim = fncdef PFX(fname ## _24x32_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x8.prim  = fncdef PFX(fname ## _32x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x32.prim  = fncdef PFX(fname ## _8x32_ ## cpu)
+#define CHROMA_420_PU_TYPED_NEON_1(prim, fncdef, fname)               \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x4.prim   = fncdef PFX(fname ## _4x4_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x2.prim   = fncdef PFX(fname ## _4x2_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x8.prim   = fncdef PFX(fname ## _4x8_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_6x8.prim   = fncdef PFX(fname ## _6x8_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_12x16.prim = fncdef PFX(fname ## _12x16_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x16.prim  = fncdef PFX(fname ## _4x16_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x24.prim = fncdef PFX(fname ## _32x24_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_24x32.prim = fncdef PFX(fname ## _24x32_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x8.prim  = fncdef PFX(fname ## _32x8_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x32.prim  = fncdef PFX(fname ## _8x32_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x8.prim   = fncdef PFX(fname ## _8x8_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x16.prim = fncdef PFX(fname ## _16x16_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_2x4.prim   = fncdef PFX(fname ## _2x4_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x4.prim   = fncdef PFX(fname ## _8x4_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x8.prim  = fncdef PFX(fname ## _16x8_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x16.prim  = fncdef PFX(fname ## _8x16_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x32.prim = fncdef PFX(fname ## _16x32_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x6.prim   = fncdef PFX(fname ## _8x6_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x2.prim   = fncdef PFX(fname ## _8x2_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_2x8.prim   = fncdef PFX(fname ## _2x8_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x12.prim = fncdef PFX(fname ## _16x12_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x4.prim  = fncdef PFX(fname ## _16x4_ ## neon)
+#define CHROMA_420_PU_TYPED_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, fncdef, fname)               \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x32.prim = fncdef PFX(fname ## _32x32_ ## sve); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x16.prim = fncdef PFX(fname ## _32x16_ ## sve)
+#define CHROMA_420_PU_TYPED_NEON_2(prim, fncdef, fname)               \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x4.prim   = fncdef PFX(fname ## _4x4_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x2.prim   = fncdef PFX(fname ## _4x2_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x8.prim   = fncdef PFX(fname ## _4x8_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x16.prim  = fncdef PFX(fname ## _4x16_ ## neon)
+#define CHROMA_420_PU_TYPED_MULTIPLE_ARCHS(prim, fncdef, fname, cpu)               \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x8.prim   = fncdef PFX(fname ## _8x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_2x4.prim   = fncdef PFX(fname ## _2x4_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x4.prim   = fncdef PFX(fname ## _8x4_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x8.prim  = fncdef PFX(fname ## _16x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x16.prim  = fncdef PFX(fname ## _8x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x32.prim = fncdef PFX(fname ## _16x32_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x6.prim   = fncdef PFX(fname ## _8x6_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_6x8.prim   = fncdef PFX(fname ## _6x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x2.prim   = fncdef PFX(fname ## _8x2_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_2x8.prim   = fncdef PFX(fname ## _2x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x12.prim = fncdef PFX(fname ## _16x12_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_12x16.prim = fncdef PFX(fname ## _12x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x4.prim  = fncdef PFX(fname ## _16x4_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x24.prim = fncdef PFX(fname ## _32x24_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_24x32.prim = fncdef PFX(fname ## _24x32_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x8.prim  = fncdef PFX(fname ## _32x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x32.prim  = fncdef PFX(fname ## _8x32_ ## cpu)
+#define CHROMA_420_PU_TYPED_FILTER_PIXEL_TO_SHORT_NEON(prim, fncdef)               \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x4.prim   = fncdef PFX(filterPixelToShort ## _4x4_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x8.prim   = fncdef PFX(filterPixelToShort ## _8x8_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x16.prim = fncdef PFX(filterPixelToShort ## _16x16_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x4.prim   = fncdef PFX(filterPixelToShort ## _8x4_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x8.prim   = fncdef PFX(filterPixelToShort ## _4x8_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x8.prim  = fncdef PFX(filterPixelToShort ## _16x8_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x16.prim  = fncdef PFX(filterPixelToShort ## _8x16_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x32.prim = fncdef PFX(filterPixelToShort ## _16x32_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x6.prim   = fncdef PFX(filterPixelToShort ## _8x6_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x2.prim   = fncdef PFX(filterPixelToShort ## _8x2_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x12.prim = fncdef PFX(filterPixelToShort ## _16x12_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_12x16.prim = fncdef PFX(filterPixelToShort ## _12x16_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x4.prim  = fncdef PFX(filterPixelToShort ## _16x4_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x16.prim  = fncdef PFX(filterPixelToShort ## _4x16_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_24x32.prim = fncdef PFX(filterPixelToShort ## _24x32_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x32.prim  = fncdef PFX(filterPixelToShort ## _8x32_ ## neon)
+#define CHROMA_420_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, fncdef)               \
+    p.chromaX265_CSP_I420.puCHROMA_420_2x4.prim   = fncdef PFX(filterPixelToShort ## _2x4_ ## sve); \
+    p.chromaX265_CSP_I420.puCHROMA_420_2x8.prim   = fncdef PFX(filterPixelToShort ## _2x8_ ## sve); \
+    p.chromaX265_CSP_I420.puCHROMA_420_6x8.prim   = fncdef PFX(filterPixelToShort ## _6x8_ ## sve); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x2.prim   = fncdef PFX(filterPixelToShort ## _4x2_ ## sve); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x32.prim = fncdef PFX(filterPixelToShort ## _32x32_ ## sve); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x16.prim = fncdef PFX(filterPixelToShort ## _32x16_ ## sve); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x24.prim = fncdef PFX(filterPixelToShort ## _32x24_ ## sve); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x8.prim  = fncdef PFX(filterPixelToShort ## _32x8_ ## sve)
+#define ALL_CHROMA_420_PU(prim, fname, cpu) ALL_CHROMA_420_PU_TYPED(prim, , fname, cpu)
+#define CHROMA_420_PU_NEON_1(prim, fname) CHROMA_420_PU_TYPED_NEON_1(prim, , fname)
+#define CHROMA_420_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, fname) CHROMA_420_PU_TYPED_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, , fname)
+#define CHROMA_420_PU_NEON_2(prim, fname) CHROMA_420_PU_TYPED_NEON_2(prim, , fname)
+#define CHROMA_420_PU_MULTIPLE_ARCHS(prim, fname, cpu) CHROMA_420_PU_TYPED_MULTIPLE_ARCHS(prim, , fname, cpu)
+#define CHROMA_420_PU_FILTER_PIXEL_TO_SHORT_NEON(prim) CHROMA_420_PU_TYPED_FILTER_PIXEL_TO_SHORT_NEON(prim, )
+#define CHROMA_420_PU_SVE_FILTER_PIXEL_TO_SHORT(prim) CHROMA_420_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, )
+
+
+#define ALL_CHROMA_420_4x4_PU_TYPED(prim, fncdef, fname, cpu) \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x4.prim   = fncdef PFX(fname ## _4x4_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x2.prim   = fncdef PFX(fname ## _8x2_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x8.prim   = fncdef PFX(fname ## _8x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x4.prim   = fncdef PFX(fname ## _8x4_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x6.prim   = fncdef PFX(fname ## _8x6_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x8.prim   = fncdef PFX(fname ## _4x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x8.prim  = fncdef PFX(fname ## _16x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x16.prim  = fncdef PFX(fname ## _8x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x32.prim = fncdef PFX(fname ## _16x32_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x12.prim = fncdef PFX(fname ## _16x12_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_12x16.prim = fncdef PFX(fname ## _12x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x4.prim  = fncdef PFX(fname ## _16x4_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x16.prim  = fncdef PFX(fname ## _4x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x24.prim = fncdef PFX(fname ## _32x24_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_24x32.prim = fncdef PFX(fname ## _24x32_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x8.prim  = fncdef PFX(fname ## _32x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x32.prim  = fncdef PFX(fname ## _8x32_ ## cpu)
+#define ALL_CHROMA_420_4x4_PU(prim, fname, cpu) ALL_CHROMA_420_4x4_PU_TYPED(prim, , fname, cpu)
+
+#define ALL_CHROMA_422_PU_TYPED(prim, fncdef, fname, cpu)               \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x8.prim   = fncdef PFX(fname ## _4x8_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x16.prim  = fncdef PFX(fname ## _8x16_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x32.prim = fncdef PFX(fname ## _16x32_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x64.prim = fncdef PFX(fname ## _32x64_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x4.prim   = fncdef PFX(fname ## _4x4_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_2x8.prim   = fncdef PFX(fname ## _2x8_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x8.prim   = fncdef PFX(fname ## _8x8_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x16.prim  = fncdef PFX(fname ## _4x16_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x32.prim  = fncdef PFX(fname ## _8x32_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x64.prim = fncdef PFX(fname ## _16x64_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x12.prim  = fncdef PFX(fname ## _8x12_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_6x16.prim  = fncdef PFX(fname ## _6x16_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x4.prim   = fncdef PFX(fname ## _8x4_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_2x16.prim  = fncdef PFX(fname ## _2x16_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x24.prim = fncdef PFX(fname ## _16x24_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_12x32.prim = fncdef PFX(fname ## _12x32_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x8.prim  = fncdef PFX(fname ## _16x8_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x32.prim  = fncdef PFX(fname ## _4x32_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x48.prim = fncdef PFX(fname ## _32x48_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_24x64.prim = fncdef PFX(fname ## _24x64_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x64.prim  = fncdef PFX(fname ## _8x64_ ## cpu)
+#define CHROMA_422_PU_TYPED_NEON_1(prim, fncdef, fname)               \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x8.prim   = fncdef PFX(fname ## _4x8_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x4.prim   = fncdef PFX(fname ## _4x4_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x16.prim  = fncdef PFX(fname ## _4x16_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_6x16.prim  = fncdef PFX(fname ## _6x16_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_12x32.prim = fncdef PFX(fname ## _12x32_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x32.prim  = fncdef PFX(fname ## _4x32_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x16.prim  = fncdef PFX(fname ## _8x16_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x32.prim = fncdef PFX(fname ## _16x32_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_2x8.prim   = fncdef PFX(fname ## _2x8_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x8.prim   = fncdef PFX(fname ## _8x8_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x16.prim = fncdef PFX(fname ## _16x16_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x32.prim  = fncdef PFX(fname ## _8x32_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x64.prim = fncdef PFX(fname ## _16x64_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x12.prim  = fncdef PFX(fname ## _8x12_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x4.prim   = fncdef PFX(fname ## _8x4_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_2x16.prim  = fncdef PFX(fname ## _2x16_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x24.prim = fncdef PFX(fname ## _16x24_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x8.prim  = fncdef PFX(fname ## _16x8_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_24x64.prim = fncdef PFX(fname ## _24x64_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x64.prim  = fncdef PFX(fname ## _8x64_ ## neon)
+#define CHROMA_422_PU_TYPED_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, fncdef, fname)               \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x64.prim = fncdef PFX(fname ## _32x64_ ## sve); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x32.prim = fncdef PFX(fname ## _32x32_ ## sve); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x48.prim = fncdef PFX(fname ## _32x48_ ## sve); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x16.prim = fncdef PFX(fname ## _32x16_ ## sve)
+#define CHROMA_422_PU_TYPED_NEON_2(prim, fncdef, fname)               \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x8.prim   = fncdef PFX(fname ## _4x8_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x4.prim   = fncdef PFX(fname ## _4x4_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x16.prim  = fncdef PFX(fname ## _4x16_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x32.prim  = fncdef PFX(fname ## _4x32_ ## neon)
+#define CHROMA_422_PU_TYPED_CAN_USE_SVE2(prim, fncdef, fname)               \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x16.prim  = fncdef PFX(fname ## _8x16_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x32.prim = fncdef PFX(fname ## _16x32_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x64.prim = fncdef PFX(fname ## _32x64_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_2x8.prim   = fncdef PFX(fname ## _2x8_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x8.prim   = fncdef PFX(fname ## _8x8_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x16.prim = fncdef PFX(fname ## _16x16_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x32.prim  = fncdef PFX(fname ## _8x32_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x32.prim = fncdef PFX(fname ## _32x32_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x64.prim = fncdef PFX(fname ## _16x64_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x12.prim  = fncdef PFX(fname ## _8x12_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_6x16.prim  = fncdef PFX(fname ## _6x16_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x4.prim   = fncdef PFX(fname ## _8x4_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_2x16.prim  = fncdef PFX(fname ## _2x16_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x24.prim = fncdef PFX(fname ## _16x24_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_12x32.prim = fncdef PFX(fname ## _12x32_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x8.prim  = fncdef PFX(fname ## _16x8_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x48.prim = fncdef PFX(fname ## _32x48_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_24x64.prim = fncdef PFX(fname ## _24x64_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x16.prim = fncdef PFX(fname ## _32x16_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x64.prim  = fncdef PFX(fname ## _8x64_ ## sve2)
+#define CHROMA_422_PU_TYPED_NEON_FILTER_PIXEL_TO_SHORT(prim, fncdef)               \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x8.prim   = fncdef PFX(filterPixelToShort ## _4x8_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x16.prim  = fncdef PFX(filterPixelToShort ## _8x16_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x32.prim = fncdef PFX(filterPixelToShort ## _16x32_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x4.prim   = fncdef PFX(filterPixelToShort ## _4x4_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x8.prim   = fncdef PFX(filterPixelToShort ## _8x8_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x16.prim  = fncdef PFX(filterPixelToShort ## _4x16_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x16.prim = fncdef PFX(filterPixelToShort ## _16x16_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x32.prim  = fncdef PFX(filterPixelToShort ## _8x32_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x64.prim = fncdef PFX(filterPixelToShort ## _16x64_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x12.prim  = fncdef PFX(filterPixelToShort ## _8x12_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x4.prim   = fncdef PFX(filterPixelToShort ## _8x4_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x24.prim = fncdef PFX(filterPixelToShort ## _16x24_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_12x32.prim = fncdef PFX(filterPixelToShort ## _12x32_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x8.prim  = fncdef PFX(filterPixelToShort ## _16x8_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x32.prim  = fncdef PFX(filterPixelToShort ## _4x32_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_24x64.prim = fncdef PFX(filterPixelToShort ## _24x64_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x64.prim  = fncdef PFX(filterPixelToShort ## _8x64_ ## neon)
+#define CHROMA_422_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, fncdef)               \
+    p.chromaX265_CSP_I422.puCHROMA_422_2x8.prim   = fncdef PFX(filterPixelToShort ## _2x8_ ## sve); \
+    p.chromaX265_CSP_I422.puCHROMA_422_2x16.prim  = fncdef PFX(filterPixelToShort ## _2x16_ ## sve); \
+    p.chromaX265_CSP_I422.puCHROMA_422_6x16.prim  = fncdef PFX(filterPixelToShort ## _6x16_ ## sve); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x64.prim = fncdef PFX(filterPixelToShort ## _32x64_ ## sve); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x32.prim = fncdef PFX(filterPixelToShort ## _32x32_ ## sve); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x48.prim = fncdef PFX(filterPixelToShort ## _32x48_ ## sve); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x16.prim = fncdef PFX(filterPixelToShort ## _32x16_ ## sve)
+#define ALL_CHROMA_422_PU(prim, fname, cpu) ALL_CHROMA_422_PU_TYPED(prim, , fname, cpu)
+#define CHROMA_422_PU_NEON_1(prim, fname) CHROMA_422_PU_TYPED_NEON_1(prim, , fname)
+#define CHROMA_422_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, fname) CHROMA_422_PU_TYPED_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, , fname)
+#define CHROMA_422_PU_NEON_2(prim, fname) CHROMA_422_PU_TYPED_NEON_2(prim, , fname)
+#define CHROMA_422_PU_CAN_USE_SVE2(prim, fname) CHROMA_422_PU_TYPED_CAN_USE_SVE2(prim, , fname)
+#define CHROMA_422_PU_NEON_FILTER_PIXEL_TO_SHORT(prim) CHROMA_422_PU_TYPED_NEON_FILTER_PIXEL_TO_SHORT(prim, )
+#define CHROMA_422_PU_SVE_FILTER_PIXEL_TO_SHORT(prim) CHROMA_422_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, )
+
+#define ALL_CHROMA_444_PU_TYPED(prim, fncdef, fname, cpu) \
+    p.chromaX265_CSP_I444.puLUMA_4x4.prim   = fncdef PFX(fname ## _4x4_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_8x8.prim   = fncdef PFX(fname ## _8x8_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_64x64.prim = fncdef PFX(fname ## _64x64_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_8x4.prim   = fncdef PFX(fname ## _8x4_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_4x8.prim   = fncdef PFX(fname ## _4x8_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_16x8.prim  = fncdef PFX(fname ## _16x8_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_8x16.prim  = fncdef PFX(fname ## _8x16_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_16x32.prim = fncdef PFX(fname ## _16x32_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_64x32.prim = fncdef PFX(fname ## _64x32_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_32x64.prim = fncdef PFX(fname ## _32x64_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_16x12.prim = fncdef PFX(fname ## _16x12_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_12x16.prim = fncdef PFX(fname ## _12x16_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_16x4.prim  = fncdef PFX(fname ## _16x4_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_4x16.prim  = fncdef PFX(fname ## _4x16_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_32x24.prim = fncdef PFX(fname ## _32x24_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_24x32.prim = fncdef PFX(fname ## _24x32_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_32x8.prim  = fncdef PFX(fname ## _32x8_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_8x32.prim  = fncdef PFX(fname ## _8x32_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_64x48.prim = fncdef PFX(fname ## _64x48_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_48x64.prim = fncdef PFX(fname ## _48x64_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_64x16.prim = fncdef PFX(fname ## _64x16_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_16x64.prim = fncdef PFX(fname ## _16x64_ ## cpu)
+#define CHROMA_444_PU_TYPED_NEON_FILTER_PIXEL_TO_SHORT(prim, fncdef) \
+    p.chromaX265_CSP_I444.puLUMA_4x4.prim   = fncdef PFX(filterPixelToShort ## _4x4_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_8x8.prim   = fncdef PFX(filterPixelToShort ## _8x8_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_16x16.prim = fncdef PFX(filterPixelToShort ## _16x16_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_8x4.prim   = fncdef PFX(filterPixelToShort ## _8x4_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_4x8.prim   = fncdef PFX(filterPixelToShort ## _4x8_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_16x8.prim  = fncdef PFX(filterPixelToShort ## _16x8_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_8x16.prim  = fncdef PFX(filterPixelToShort ## _8x16_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_16x32.prim = fncdef PFX(filterPixelToShort ## _16x32_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_16x12.prim = fncdef PFX(filterPixelToShort ## _16x12_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_12x16.prim = fncdef PFX(filterPixelToShort ## _12x16_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_16x4.prim  = fncdef PFX(filterPixelToShort ## _16x4_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_4x16.prim  = fncdef PFX(filterPixelToShort ## _4x16_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_24x32.prim = fncdef PFX(filterPixelToShort ## _24x32_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_8x32.prim  = fncdef PFX(filterPixelToShort ## _8x32_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_16x64.prim = fncdef PFX(filterPixelToShort ## _16x64_ ## neon)
+#define CHROMA_444_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, fncdef) \
+    p.chromaX265_CSP_I444.puLUMA_32x32.prim = fncdef PFX(filterPixelToShort ## _32x32_ ## sve); \
+    p.chromaX265_CSP_I444.puLUMA_32x16.prim = fncdef PFX(filterPixelToShort ## _32x16_ ## sve); \
+    p.chromaX265_CSP_I444.puLUMA_32x64.prim = fncdef PFX(filterPixelToShort ## _32x64_ ## sve); \
+    p.chromaX265_CSP_I444.puLUMA_32x24.prim = fncdef PFX(filterPixelToShort ## _32x24_ ## sve); \
+    p.chromaX265_CSP_I444.puLUMA_32x8.prim  = fncdef PFX(filterPixelToShort ## _32x8_ ## sve); \
+    p.chromaX265_CSP_I444.puLUMA_64x64.prim = fncdef PFX(filterPixelToShort ## _64x64_ ## sve); \
+    p.chromaX265_CSP_I444.puLUMA_64x32.prim = fncdef PFX(filterPixelToShort ## _64x32_ ## sve); \
+    p.chromaX265_CSP_I444.puLUMA_64x48.prim = fncdef PFX(filterPixelToShort ## _64x48_ ## sve); \
+    p.chromaX265_CSP_I444.puLUMA_64x16.prim = fncdef PFX(filterPixelToShort ## _64x16_ ## sve); \
+    p.chromaX265_CSP_I444.puLUMA_48x64.prim = fncdef PFX(filterPixelToShort ## _48x64_ ## sve)
+#define ALL_CHROMA_444_PU(prim, fname, cpu) ALL_CHROMA_444_PU_TYPED(prim, , fname, cpu)
+#define CHROMA_444_PU_NEON_FILTER_PIXEL_TO_SHORT(prim) CHROMA_444_PU_TYPED_NEON_FILTER_PIXEL_TO_SHORT(prim, )
+#define CHROMA_444_PU_SVE_FILTER_PIXEL_TO_SHORT(prim) CHROMA_444_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, )
+
+#define ALL_CHROMA_420_VERT_FILTERS(cpu)                             \
+    ALL_CHROMA_420_4x4_PU(filter_vpp, interp_4tap_vert_pp, cpu); \
+    ALL_CHROMA_420_4x4_PU(filter_vps, interp_4tap_vert_ps, cpu); \
+    ALL_CHROMA_420_4x4_PU(filter_vsp, interp_4tap_vert_sp, cpu); \
+    ALL_CHROMA_420_4x4_PU(filter_vss, interp_4tap_vert_ss, cpu)
+
+#define CHROMA_420_VERT_FILTERS_NEON()                             \
+    ALL_CHROMA_420_4x4_PU(filter_vsp, interp_4tap_vert_sp, neon)
+
+#define CHROMA_420_VERT_FILTERS_CAN_USE_SVE2()                             \
+    ALL_CHROMA_420_4x4_PU(filter_vpp, interp_4tap_vert_pp, sve2); \
+    ALL_CHROMA_420_4x4_PU(filter_vps, interp_4tap_vert_ps, sve2); \
+    ALL_CHROMA_420_4x4_PU(filter_vss, interp_4tap_vert_ss, sve2)
+
+#define SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(W, H) \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vsp = PFX(interp_4tap_vert_sp_ ## W ## x ## H ## _ ## neon)
+
+#define SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(W, H, cpu) \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vpp = PFX(interp_4tap_vert_pp_ ## W ## x ## H ## _ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vps = PFX(interp_4tap_vert_ps_ ## W ## x ## H ## _ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vss = PFX(interp_4tap_vert_ss_ ## W ## x ## H ## _ ## cpu)
+
+#define CHROMA_422_VERT_FILTERS_NEON() \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(4, 8); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(8, 16); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(8, 8); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(4, 16); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(8, 12); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(8, 4); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(16, 32); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(16, 16); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(8, 32); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(16, 24); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(12, 32); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(16, 8); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(4, 32); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(32, 64); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(32, 32); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(16, 64); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(32, 48); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(24, 64); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(32, 16); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(8, 64)
+
+#define CHROMA_422_VERT_FILTERS_CAN_USE_SVE2(cpu) \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(4, 8, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(8, 16, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(8, 8, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(4, 16, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(8, 12, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(8, 4, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(16, 32, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(16, 16, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(8, 32, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(16, 24, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(12, 32, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(16, 8, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(4, 32, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(32, 64, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(32, 32, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(16, 64, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(32, 48, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(24, 64, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(32, 16, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(8, 64, cpu)
+
+#define ALL_CHROMA_444_VERT_FILTERS(cpu) \
+    ALL_CHROMA_444_PU(filter_vpp, interp_4tap_vert_pp, cpu); \
+    ALL_CHROMA_444_PU(filter_vps, interp_4tap_vert_ps, cpu); \
+    ALL_CHROMA_444_PU(filter_vsp, interp_4tap_vert_sp, cpu); \
+    ALL_CHROMA_444_PU(filter_vss, interp_4tap_vert_ss, cpu)
+
+#define CHROMA_444_VERT_FILTERS_NEON() \
+    ALL_CHROMA_444_PU(filter_vsp, interp_4tap_vert_sp, neon)
+
+#define CHROMA_444_VERT_FILTERS_CAN_USE_SVE2() \
+    ALL_CHROMA_444_PU(filter_vpp, interp_4tap_vert_pp, sve2); \
+    ALL_CHROMA_444_PU(filter_vps, interp_4tap_vert_ps, sve2); \
+    ALL_CHROMA_444_PU(filter_vss, interp_4tap_vert_ss, sve2)
+
+#define ALL_CHROMA_420_FILTERS(cpu)                               \
+    ALL_CHROMA_420_PU(filter_hpp, interp_4tap_horiz_pp, cpu); \
+    ALL_CHROMA_420_PU(filter_hps, interp_4tap_horiz_ps, cpu); \
+    ALL_CHROMA_420_PU(filter_vpp, interp_4tap_vert_pp, cpu);  \
+    ALL_CHROMA_420_PU(filter_vps, interp_4tap_vert_ps, cpu)
+
+#define CHROMA_420_FILTERS_NEON()                               \
+    ALL_CHROMA_420_PU(filter_hpp, interp_4tap_horiz_pp, neon); \
+    ALL_CHROMA_420_PU(filter_hps, interp_4tap_horiz_ps, neon)
+
+#define CHROMA_420_FILTERS_CAN_USE_SVE2()                               \
+    ALL_CHROMA_420_PU(filter_vpp, interp_4tap_vert_pp, sve2);  \
+    ALL_CHROMA_420_PU(filter_vps, interp_4tap_vert_ps, sve2)
+
+#define ALL_CHROMA_422_FILTERS(cpu) \
+    ALL_CHROMA_422_PU(filter_hpp, interp_4tap_horiz_pp, cpu); \
+    ALL_CHROMA_422_PU(filter_hps, interp_4tap_horiz_ps, cpu); \
+    ALL_CHROMA_422_PU(filter_vpp, interp_4tap_vert_pp, cpu);  \
+    ALL_CHROMA_422_PU(filter_vps, interp_4tap_vert_ps, cpu)
+
+#define CHROMA_422_FILTERS_NEON() \
+    ALL_CHROMA_422_PU(filter_hpp, interp_4tap_horiz_pp, neon); \
+    ALL_CHROMA_422_PU(filter_hps, interp_4tap_horiz_ps, neon)
+
+#define CHROMA_422_FILTERS_CAN_USE_SVE2() \
+    ALL_CHROMA_422_PU(filter_vpp, interp_4tap_vert_pp, sve2);  \
+    ALL_CHROMA_422_PU(filter_vps, interp_4tap_vert_ps, sve2)
+
+#define ALL_CHROMA_444_FILTERS(cpu) \
+    ALL_CHROMA_444_PU(filter_hpp, interp_4tap_horiz_pp, cpu); \
+    ALL_CHROMA_444_PU(filter_hps, interp_4tap_horiz_ps, cpu); \
+    ALL_CHROMA_444_PU(filter_vpp, interp_4tap_vert_pp, cpu);  \
+    ALL_CHROMA_444_PU(filter_vps, interp_4tap_vert_ps, cpu)
+
+#define CHROMA_444_FILTERS_NEON() \
+    ALL_CHROMA_444_PU(filter_hpp, interp_4tap_horiz_pp, neon); \
+    ALL_CHROMA_444_PU(filter_hps, interp_4tap_horiz_ps, neon)
+
+#define CHROMA_444_FILTERS_CAN_USE_SVE2() \
+    ALL_CHROMA_444_PU(filter_vpp, interp_4tap_vert_pp, sve2);  \
+    ALL_CHROMA_444_PU(filter_vps, interp_4tap_vert_ps, sve2)
+
 
 #if defined(__GNUC__)
 #define GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
@@ -35,18 +684,19 @@
 #define GCC_4_9_0 40900
 #define GCC_5_1_0 50100
 
-extern "C" {
-#include "pixel.h"
-#include "pixel-util.h"
-#include "ipfilter8.h"
-}
+#include "pixel-prim.h"
+#include "filter-prim.h"
+#include "dct-prim.h"
+#include "loopfilter-prim.h"
+#include "intrapred-prim.h"
 
-namespace X265_NS {
+namespace X265_NS
+{
 // private x265 namespace
 
 
 template<int size>
-void interp_8tap_hv_pp_cpu(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY)
+void interp_8tap_hv_pp_cpu(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int idxX, int idxY)
 {
     ALIGN_VAR_32(int16_t, immedMAX_CU_SIZE * (MAX_CU_SIZE + NTAPS_LUMA - 1));
     const int halfFilterSize = NTAPS_LUMA >> 1;
@@ -56,164 +706,1259 @@
     primitives.pusize.luma_vsp(immed + (halfFilterSize - 1) * immedStride, immedStride, dst, dstStride, idxY);
 }
 
-
-/* Temporary workaround because luma_vsp assembly primitive has not been completed
- * but interp_8tap_hv_pp_cpu uses mixed C primitive and assembly primitive.
- * Otherwise, segment fault occurs. */
-void setupAliasCPrimitives(EncoderPrimitives &cp, EncoderPrimitives &asmp, int cpuMask)
+void setupNeonPrimitives(EncoderPrimitives &p)
 {
-    if (cpuMask & X265_CPU_NEON)
-    {
-        asmp.puLUMA_8x4.luma_vsp   = cp.puLUMA_8x4.luma_vsp;
-        asmp.puLUMA_8x8.luma_vsp   = cp.puLUMA_8x8.luma_vsp;
-        asmp.puLUMA_8x16.luma_vsp  = cp.puLUMA_8x16.luma_vsp;
-        asmp.puLUMA_8x32.luma_vsp  = cp.puLUMA_8x32.luma_vsp;
-        asmp.puLUMA_12x16.luma_vsp = cp.puLUMA_12x16.luma_vsp;
-#if !AUTO_VECTORIZE || GCC_VERSION < GCC_5_1_0 /* gcc_version < gcc-5.1.0 */
-        asmp.puLUMA_16x4.luma_vsp  = cp.puLUMA_16x4.luma_vsp;
-        asmp.puLUMA_16x8.luma_vsp  = cp.puLUMA_16x8.luma_vsp;
-        asmp.puLUMA_16x12.luma_vsp = cp.puLUMA_16x12.luma_vsp;
-        asmp.puLUMA_16x16.luma_vsp = cp.puLUMA_16x16.luma_vsp;
-        asmp.puLUMA_16x32.luma_vsp = cp.puLUMA_16x32.luma_vsp;
-        asmp.puLUMA_16x64.luma_vsp = cp.puLUMA_16x64.luma_vsp;
-        asmp.puLUMA_32x16.luma_vsp = cp.puLUMA_32x16.luma_vsp;
-        asmp.puLUMA_32x24.luma_vsp = cp.puLUMA_32x24.luma_vsp;
-        asmp.puLUMA_32x32.luma_vsp = cp.puLUMA_32x32.luma_vsp;
-        asmp.puLUMA_32x64.luma_vsp = cp.puLUMA_32x64.luma_vsp;
-        asmp.puLUMA_48x64.luma_vsp = cp.puLUMA_48x64.luma_vsp;
-        asmp.puLUMA_64x16.luma_vsp = cp.puLUMA_64x16.luma_vsp;
-        asmp.puLUMA_64x32.luma_vsp = cp.puLUMA_64x32.luma_vsp;
-        asmp.puLUMA_64x48.luma_vsp = cp.puLUMA_64x48.luma_vsp;
-        asmp.puLUMA_64x64.luma_vsp = cp.puLUMA_64x64.luma_vsp;    
-#if !AUTO_VECTORIZE || GCC_VERSION < GCC_4_9_0 /* gcc_version < gcc-4.9.0 */
-        asmp.puLUMA_4x4.luma_vsp   = cp.puLUMA_4x4.luma_vsp;
-        asmp.puLUMA_4x8.luma_vsp   = cp.puLUMA_4x8.luma_vsp;
-        asmp.puLUMA_4x16.luma_vsp  = cp.puLUMA_4x16.luma_vsp;
-        asmp.puLUMA_24x32.luma_vsp = cp.puLUMA_24x32.luma_vsp;
-        asmp.puLUMA_32x8.luma_vsp  = cp.puLUMA_32x8.luma_vsp;
+    setupPixelPrimitives_neon(p);
+    setupFilterPrimitives_neon(p);
+    setupDCTPrimitives_neon(p);
+    setupLoopFilterPrimitives_neon(p);
+    setupIntraPrimitives_neon(p);
+
+    ALL_CHROMA_420_PU(p2sNONALIGNED, filterPixelToShort, neon);
+    ALL_CHROMA_422_PU(p2sALIGNED, filterPixelToShort, neon);
+    ALL_CHROMA_444_PU(p2sALIGNED, filterPixelToShort, neon);
+    ALL_LUMA_PU(convert_p2sALIGNED, filterPixelToShort, neon);
+    ALL_CHROMA_420_PU(p2sALIGNED, filterPixelToShort, neon);
+    ALL_CHROMA_422_PU(p2sNONALIGNED, filterPixelToShort, neon);
+    ALL_CHROMA_444_PU(p2sNONALIGNED, filterPixelToShort, neon);
+    ALL_LUMA_PU(convert_p2sNONALIGNED, filterPixelToShort, neon);
+
+#if !HIGH_BIT_DEPTH
+    ALL_LUMA_PU(luma_vpp, interp_8tap_vert_pp, neon);
+    ALL_LUMA_PU(luma_vsp, interp_8tap_vert_sp, neon);
+    ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, neon);
+    ALL_LUMA_PU(luma_hpp, interp_horiz_pp, neon);
+    ALL_LUMA_PU(luma_hps, interp_horiz_ps, neon);
+    ALL_LUMA_PU(luma_vss, interp_8tap_vert_ss, neon);
+    ALL_LUMA_PU_T(luma_hvpp, interp_8tap_hv_pp_cpu);
+    ALL_CHROMA_420_VERT_FILTERS(neon);
+    CHROMA_422_VERT_FILTERS_NEON();
+    CHROMA_422_VERT_FILTERS_CAN_USE_SVE2(neon);
+    ALL_CHROMA_444_VERT_FILTERS(neon);
+    ALL_CHROMA_420_FILTERS(neon);
+    ALL_CHROMA_422_FILTERS(neon);
+    ALL_CHROMA_444_FILTERS(neon);
+
+    // Blockcopy_pp
+    ALL_LUMA_PU(copy_pp, blockcopy_pp, neon);
+    ALL_CHROMA_420_PU(copy_pp, blockcopy_pp, neon);
+    ALL_CHROMA_422_PU(copy_pp, blockcopy_pp, neon);
+    p.cuBLOCK_4x4.copy_pp   = PFX(blockcopy_pp_4x4_neon);
+    p.cuBLOCK_8x8.copy_pp   = PFX(blockcopy_pp_8x8_neon);
+    p.cuBLOCK_16x16.copy_pp = PFX(blockcopy_pp_16x16_neon);
+    p.cuBLOCK_32x32.copy_pp = PFX(blockcopy_pp_32x32_neon);
+    p.cuBLOCK_64x64.copy_pp = PFX(blockcopy_pp_64x64_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_pp = PFX(blockcopy_pp_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_pp = PFX(blockcopy_pp_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_pp = PFX(blockcopy_pp_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_pp = PFX(blockcopy_pp_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_pp = PFX(blockcopy_pp_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_pp = PFX(blockcopy_pp_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_pp = PFX(blockcopy_pp_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_pp = PFX(blockcopy_pp_32x64_neon);
+
+#endif // !HIGH_BIT_DEPTH
+
+    // Blockcopy_ss
+    p.cuBLOCK_4x4.copy_ss   = PFX(blockcopy_ss_4x4_neon);
+    p.cuBLOCK_8x8.copy_ss   = PFX(blockcopy_ss_8x8_neon);
+    p.cuBLOCK_16x16.copy_ss = PFX(blockcopy_ss_16x16_neon);
+    p.cuBLOCK_32x32.copy_ss = PFX(blockcopy_ss_32x32_neon);
+    p.cuBLOCK_64x64.copy_ss = PFX(blockcopy_ss_64x64_neon);
+
+    // Blockcopy_ps
+    p.cuBLOCK_4x4.copy_ps   = PFX(blockcopy_ps_4x4_neon);
+    p.cuBLOCK_8x8.copy_ps   = PFX(blockcopy_ps_8x8_neon);
+    p.cuBLOCK_16x16.copy_ps = PFX(blockcopy_ps_16x16_neon);
+    p.cuBLOCK_32x32.copy_ps = PFX(blockcopy_ps_32x32_neon);
+    p.cuBLOCK_64x64.copy_ps = PFX(blockcopy_ps_64x64_neon);
+
+    // Blockcopy_sp
+    p.cuBLOCK_4x4.copy_sp   = PFX(blockcopy_sp_4x4_neon);
+    p.cuBLOCK_8x8.copy_sp   = PFX(blockcopy_sp_8x8_neon);
+    p.cuBLOCK_16x16.copy_sp = PFX(blockcopy_sp_16x16_neon);
+    p.cuBLOCK_32x32.copy_sp = PFX(blockcopy_sp_32x32_neon);
+    p.cuBLOCK_64x64.copy_sp = PFX(blockcopy_sp_64x64_neon);
+
+    // chroma blockcopy_ss
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_ss   = PFX(blockcopy_ss_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_ss   = PFX(blockcopy_ss_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_ss = PFX(blockcopy_ss_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_ss = PFX(blockcopy_ss_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_ss   = PFX(blockcopy_ss_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_ss  = PFX(blockcopy_ss_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_ss = PFX(blockcopy_ss_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_ss = PFX(blockcopy_ss_32x64_neon);
+
+    // chroma blockcopy_ps
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_ps   = PFX(blockcopy_ps_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_ps   = PFX(blockcopy_ps_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_ps = PFX(blockcopy_ps_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_ps = PFX(blockcopy_ps_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_ps   = PFX(blockcopy_ps_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_ps  = PFX(blockcopy_ps_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_ps = PFX(blockcopy_ps_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_ps = PFX(blockcopy_ps_32x64_neon);
+
+    // chroma blockcopy_sp
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_sp   = PFX(blockcopy_sp_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_sp   = PFX(blockcopy_sp_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_sp = PFX(blockcopy_sp_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_sp = PFX(blockcopy_sp_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_sp   = PFX(blockcopy_sp_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_sp  = PFX(blockcopy_sp_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_sp = PFX(blockcopy_sp_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_sp = PFX(blockcopy_sp_32x64_neon);
+
+    // Block_fill
+    ALL_LUMA_TU(blockfill_sALIGNED, blockfill_s, neon);
+    ALL_LUMA_TU(blockfill_sNONALIGNED, blockfill_s, neon);
+
+    // copy_count
+    p.cuBLOCK_4x4.copy_cnt     = PFX(copy_cnt_4_neon);
+    p.cuBLOCK_8x8.copy_cnt     = PFX(copy_cnt_8_neon);
+    p.cuBLOCK_16x16.copy_cnt   = PFX(copy_cnt_16_neon);
+    p.cuBLOCK_32x32.copy_cnt   = PFX(copy_cnt_32_neon);
+
+    // count nonzero
+    p.cuBLOCK_4x4.count_nonzero     = PFX(count_nonzero_4_neon);
+    p.cuBLOCK_8x8.count_nonzero     = PFX(count_nonzero_8_neon);
+    p.cuBLOCK_16x16.count_nonzero   = PFX(count_nonzero_16_neon);
+    p.cuBLOCK_32x32.count_nonzero   = PFX(count_nonzero_32_neon);
+
+    // cpy2Dto1D_shl
+    p.cuBLOCK_4x4.cpy2Dto1D_shl   = PFX(cpy2Dto1D_shl_4x4_neon);
+    p.cuBLOCK_8x8.cpy2Dto1D_shl   = PFX(cpy2Dto1D_shl_8x8_neon);
+    p.cuBLOCK_16x16.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_16x16_neon);
+    p.cuBLOCK_32x32.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_32x32_neon);
+    p.cuBLOCK_64x64.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_64x64_neon);
+
+    // cpy2Dto1D_shr
+    p.cuBLOCK_4x4.cpy2Dto1D_shr   = PFX(cpy2Dto1D_shr_4x4_neon);
+    p.cuBLOCK_8x8.cpy2Dto1D_shr   = PFX(cpy2Dto1D_shr_8x8_neon);
+    p.cuBLOCK_16x16.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_16x16_neon);
+    p.cuBLOCK_32x32.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_32x32_neon);
+
+    // cpy1Dto2D_shl
+    p.cuBLOCK_4x4.cpy1Dto2D_shlALIGNED      = PFX(cpy1Dto2D_shl_4x4_neon);
+    p.cuBLOCK_8x8.cpy1Dto2D_shlALIGNED      = PFX(cpy1Dto2D_shl_8x8_neon);
+    p.cuBLOCK_16x16.cpy1Dto2D_shlALIGNED    = PFX(cpy1Dto2D_shl_16x16_neon);
+    p.cuBLOCK_32x32.cpy1Dto2D_shlALIGNED    = PFX(cpy1Dto2D_shl_32x32_neon);
+    p.cuBLOCK_64x64.cpy1Dto2D_shlALIGNED    = PFX(cpy1Dto2D_shl_64x64_neon);
+
+    p.cuBLOCK_4x4.cpy1Dto2D_shlNONALIGNED   = PFX(cpy1Dto2D_shl_4x4_neon);
+    p.cuBLOCK_8x8.cpy1Dto2D_shlNONALIGNED   = PFX(cpy1Dto2D_shl_8x8_neon);
+    p.cuBLOCK_16x16.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_16x16_neon);
+    p.cuBLOCK_32x32.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_32x32_neon);
+    p.cuBLOCK_64x64.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_64x64_neon);
+
+    // cpy1Dto2D_shr
+    p.cuBLOCK_4x4.cpy1Dto2D_shr   = PFX(cpy1Dto2D_shr_4x4_neon);
+    p.cuBLOCK_8x8.cpy1Dto2D_shr   = PFX(cpy1Dto2D_shr_8x8_neon);
+    p.cuBLOCK_16x16.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_16x16_neon);
+    p.cuBLOCK_32x32.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_32x32_neon);
+    p.cuBLOCK_64x64.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_64x64_neon);
+
+#if !HIGH_BIT_DEPTH
+    // pixel_avg_pp
+    ALL_LUMA_PU(pixelavg_ppNONALIGNED, pixel_avg_pp, neon);
+    ALL_LUMA_PU(pixelavg_ppALIGNED, pixel_avg_pp, neon);
+
+    // addAvg
+    ALL_LUMA_PU(addAvgNONALIGNED, addAvg, neon);
+    ALL_LUMA_PU(addAvgALIGNED, addAvg, neon);
+    ALL_CHROMA_420_PU(addAvgNONALIGNED, addAvg, neon);
+    ALL_CHROMA_422_PU(addAvgNONALIGNED, addAvg, neon);
+    ALL_CHROMA_420_PU(addAvgALIGNED, addAvg, neon);
+    ALL_CHROMA_422_PU(addAvgALIGNED, addAvg, neon);
+
+    // sad
+    ALL_LUMA_PU(sad, pixel_sad, neon);
+    ALL_LUMA_PU(sad_x3, sad_x3, neon);
+    ALL_LUMA_PU(sad_x4, sad_x4, neon);
+
+    // sse_pp
+    p.cuBLOCK_4x4.sse_pp   = PFX(pixel_sse_pp_4x4_neon);
+    p.cuBLOCK_8x8.sse_pp   = PFX(pixel_sse_pp_8x8_neon);
+    p.cuBLOCK_16x16.sse_pp = PFX(pixel_sse_pp_16x16_neon);
+    p.cuBLOCK_32x32.sse_pp = PFX(pixel_sse_pp_32x32_neon);
+    p.cuBLOCK_64x64.sse_pp = PFX(pixel_sse_pp_64x64_neon);
+
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.sse_pp   = PFX(pixel_sse_pp_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.sse_pp   = PFX(pixel_sse_pp_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.sse_pp = PFX(pixel_sse_pp_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.sse_pp = PFX(pixel_sse_pp_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.sse_pp   = PFX(pixel_sse_pp_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sse_pp  = PFX(pixel_sse_pp_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sse_pp = PFX(pixel_sse_pp_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sse_pp = PFX(pixel_sse_pp_32x64_neon);
+
+    // sse_ss
+    p.cuBLOCK_4x4.sse_ss   = PFX(pixel_sse_ss_4x4_neon);
+    p.cuBLOCK_8x8.sse_ss   = PFX(pixel_sse_ss_8x8_neon);
+    p.cuBLOCK_16x16.sse_ss = PFX(pixel_sse_ss_16x16_neon);
+    p.cuBLOCK_32x32.sse_ss = PFX(pixel_sse_ss_32x32_neon);
+    p.cuBLOCK_64x64.sse_ss = PFX(pixel_sse_ss_64x64_neon);
+
+    // ssd_s
+    p.cuBLOCK_4x4.ssd_sNONALIGNED   = PFX(pixel_ssd_s_4x4_neon);
+    p.cuBLOCK_8x8.ssd_sNONALIGNED   = PFX(pixel_ssd_s_8x8_neon);
+    p.cuBLOCK_16x16.ssd_sNONALIGNED = PFX(pixel_ssd_s_16x16_neon);
+    p.cuBLOCK_32x32.ssd_sNONALIGNED = PFX(pixel_ssd_s_32x32_neon);
+
+    p.cuBLOCK_4x4.ssd_sALIGNED   = PFX(pixel_ssd_s_4x4_neon);
+    p.cuBLOCK_8x8.ssd_sALIGNED   = PFX(pixel_ssd_s_8x8_neon);
+    p.cuBLOCK_16x16.ssd_sALIGNED = PFX(pixel_ssd_s_16x16_neon);
+    p.cuBLOCK_32x32.ssd_sALIGNED = PFX(pixel_ssd_s_32x32_neon);
+
+    // pixel_var
+    p.cuBLOCK_8x8.var   = PFX(pixel_var_8x8_neon);
+    p.cuBLOCK_16x16.var = PFX(pixel_var_16x16_neon);
+    p.cuBLOCK_32x32.var = PFX(pixel_var_32x32_neon);
+    p.cuBLOCK_64x64.var = PFX(pixel_var_64x64_neon);
+
+    // calc_Residual
+    p.cuBLOCK_4x4.calcresidualNONALIGNED   = PFX(getResidual4_neon);
+    p.cuBLOCK_8x8.calcresidualNONALIGNED   = PFX(getResidual8_neon);
+    p.cuBLOCK_16x16.calcresidualNONALIGNED = PFX(getResidual16_neon);
+    p.cuBLOCK_32x32.calcresidualNONALIGNED = PFX(getResidual32_neon);
+
+    p.cuBLOCK_4x4.calcresidualALIGNED   = PFX(getResidual4_neon);
+    p.cuBLOCK_8x8.calcresidualALIGNED   = PFX(getResidual8_neon);
+    p.cuBLOCK_16x16.calcresidualALIGNED = PFX(getResidual16_neon);
+    p.cuBLOCK_32x32.calcresidualALIGNED = PFX(getResidual32_neon);
+
+    // pixel_sub_ps
+    p.cuBLOCK_4x4.sub_ps   = PFX(pixel_sub_ps_4x4_neon);
+    p.cuBLOCK_8x8.sub_ps   = PFX(pixel_sub_ps_8x8_neon);
+    p.cuBLOCK_16x16.sub_ps = PFX(pixel_sub_ps_16x16_neon);
+    p.cuBLOCK_32x32.sub_ps = PFX(pixel_sub_ps_32x32_neon);
+    p.cuBLOCK_64x64.sub_ps = PFX(pixel_sub_ps_64x64_neon);
+
+    // chroma sub_ps
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.sub_ps   = PFX(pixel_sub_ps_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.sub_ps   = PFX(pixel_sub_ps_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.sub_ps = PFX(pixel_sub_ps_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.sub_ps = PFX(pixel_sub_ps_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.sub_ps   = PFX(pixel_sub_ps_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sub_ps  = PFX(pixel_sub_ps_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sub_ps = PFX(pixel_sub_ps_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sub_ps = PFX(pixel_sub_ps_32x64_neon);
+
+    // pixel_add_ps
+    p.cuBLOCK_4x4.add_psNONALIGNED   = PFX(pixel_add_ps_4x4_neon);
+    p.cuBLOCK_8x8.add_psNONALIGNED   = PFX(pixel_add_ps_8x8_neon);
+    p.cuBLOCK_16x16.add_psNONALIGNED = PFX(pixel_add_ps_16x16_neon);
+    p.cuBLOCK_32x32.add_psNONALIGNED = PFX(pixel_add_ps_32x32_neon);
+    p.cuBLOCK_64x64.add_psNONALIGNED = PFX(pixel_add_ps_64x64_neon);
+
+    p.cuBLOCK_4x4.add_psALIGNED   = PFX(pixel_add_ps_4x4_neon);
+    p.cuBLOCK_8x8.add_psALIGNED   = PFX(pixel_add_ps_8x8_neon);
+    p.cuBLOCK_16x16.add_psALIGNED = PFX(pixel_add_ps_16x16_neon);
+    p.cuBLOCK_32x32.add_psALIGNED = PFX(pixel_add_ps_32x32_neon);
+    p.cuBLOCK_64x64.add_psALIGNED = PFX(pixel_add_ps_64x64_neon);
+
+    // chroma add_ps
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.add_psNONALIGNED   = PFX(pixel_add_ps_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.add_psNONALIGNED   = PFX(pixel_add_ps_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.add_psNONALIGNED = PFX(pixel_add_ps_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.add_psNONALIGNED = PFX(pixel_add_ps_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.add_psNONALIGNED   = PFX(pixel_add_ps_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.add_psNONALIGNED  = PFX(pixel_add_ps_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.add_psNONALIGNED = PFX(pixel_add_ps_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.add_psNONALIGNED = PFX(pixel_add_ps_32x64_neon);
+
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.add_psALIGNED   = PFX(pixel_add_ps_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.add_psALIGNED   = PFX(pixel_add_ps_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.add_psALIGNED = PFX(pixel_add_ps_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.add_psALIGNED = PFX(pixel_add_ps_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.add_psALIGNED   = PFX(pixel_add_ps_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.add_psALIGNED  = PFX(pixel_add_ps_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.add_psALIGNED = PFX(pixel_add_ps_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.add_psALIGNED = PFX(pixel_add_ps_32x64_neon);
+
+    //scale2D_64to32
+    p.scale2D_64to32  = PFX(scale2D_64to32_neon);
+
+    // scale1D_128to64
+    p.scale1D_128to64NONALIGNED = PFX(scale1D_128to64_neon);
+    p.scale1D_128to64ALIGNED = PFX(scale1D_128to64_neon);
+
+    // planecopy
+    p.planecopy_cp = PFX(pixel_planecopy_cp_neon);
+
+    // satd
+    ALL_LUMA_PU(satd, pixel_satd, neon);
+
+    p.chromaX265_CSP_I420.puCHROMA_420_4x4.satd   = PFX(pixel_satd_4x4_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x8.satd   = PFX(pixel_satd_8x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x16.satd = PFX(pixel_satd_16x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x32.satd = PFX(pixel_satd_32x32_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x4.satd   = PFX(pixel_satd_8x4_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_4x8.satd   = PFX(pixel_satd_4x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x8.satd  = PFX(pixel_satd_16x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x16.satd  = PFX(pixel_satd_8x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x16.satd = PFX(pixel_satd_32x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x32.satd = PFX(pixel_satd_16x32_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x12.satd = PFX(pixel_satd_16x12_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_12x16.satd = PFX(pixel_satd_12x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x4.satd  = PFX(pixel_satd_16x4_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_4x16.satd  = PFX(pixel_satd_4x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x24.satd = PFX(pixel_satd_32x24_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_24x32.satd = PFX(pixel_satd_24x32_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x8.satd  = PFX(pixel_satd_32x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x32.satd  = PFX(pixel_satd_8x32_neon);
+
+    p.chromaX265_CSP_I422.puCHROMA_422_4x8.satd   = PFX(pixel_satd_4x8_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x16.satd  = PFX(pixel_satd_8x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x32.satd = PFX(pixel_satd_16x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x64.satd = PFX(pixel_satd_32x64_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_4x4.satd   = PFX(pixel_satd_4x4_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x8.satd   = PFX(pixel_satd_8x8_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_4x16.satd  = PFX(pixel_satd_4x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x16.satd = PFX(pixel_satd_16x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x32.satd  = PFX(pixel_satd_8x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x32.satd = PFX(pixel_satd_32x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x64.satd = PFX(pixel_satd_16x64_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x12.satd  = PFX(pixel_satd_8x12_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x4.satd   = PFX(pixel_satd_8x4_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x24.satd = PFX(pixel_satd_16x24_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_12x32.satd = PFX(pixel_satd_12x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x8.satd  = PFX(pixel_satd_16x8_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_4x32.satd  = PFX(pixel_satd_4x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x48.satd = PFX(pixel_satd_32x48_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_24x64.satd = PFX(pixel_satd_24x64_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x16.satd = PFX(pixel_satd_32x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x64.satd  = PFX(pixel_satd_8x64_neon);
+
+    // sa8d
+    p.cuBLOCK_4x4.sa8d   = PFX(pixel_satd_4x4_neon);
+    p.cuBLOCK_8x8.sa8d   = PFX(pixel_sa8d_8x8_neon);
+    p.cuBLOCK_16x16.sa8d = PFX(pixel_sa8d_16x16_neon);
+    p.cuBLOCK_32x32.sa8d = PFX(pixel_sa8d_32x32_neon);
+    p.cuBLOCK_64x64.sa8d = PFX(pixel_sa8d_64x64_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_8x8.sa8d = PFX(pixel_satd_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_16x16.sa8d = PFX(pixel_sa8d_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_32x32.sa8d = PFX(pixel_sa8d_32x32_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_64x64.sa8d = PFX(pixel_sa8d_64x64_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sa8d = PFX(pixel_sa8d_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sa8d = PFX(pixel_sa8d_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sa8d = PFX(pixel_sa8d_32x64_neon);
+
+    // dequant_scaling
+    p.dequant_scaling = PFX(dequant_scaling_neon);
+    p.dequant_normal  = PFX(dequant_normal_neon);
+
+    // ssim_4x4x2_core
+    p.ssim_4x4x2_core = PFX(ssim_4x4x2_core_neon);
+
+    // ssimDist
+    p.cuBLOCK_4x4.ssimDist = PFX(ssimDist4_neon);
+    p.cuBLOCK_8x8.ssimDist = PFX(ssimDist8_neon);
+    p.cuBLOCK_16x16.ssimDist = PFX(ssimDist16_neon);
+    p.cuBLOCK_32x32.ssimDist = PFX(ssimDist32_neon);
+    p.cuBLOCK_64x64.ssimDist = PFX(ssimDist64_neon);
+
+    // normFact
+    p.cuBLOCK_8x8.normFact = PFX(normFact8_neon);
+    p.cuBLOCK_16x16.normFact = PFX(normFact16_neon);
+    p.cuBLOCK_32x32.normFact = PFX(normFact32_neon);
+    p.cuBLOCK_64x64.normFact = PFX(normFact64_neon);
+
+    // psy_cost_pp
+    p.cuBLOCK_4x4.psy_cost_pp = PFX(psyCost_4x4_neon);
+
+    p.weight_pp = PFX(weight_pp_neon);
+#if !defined(__APPLE__)
+    p.scanPosLast = PFX(scanPosLast_neon);
 #endif
+    p.costCoeffNxN = PFX(costCoeffNxN_neon);
 #endif
-    }
-}
 
+    // quant
+    p.quant = PFX(quant_neon);
+    p.nquant = PFX(nquant_neon);
+}
 
-void setupAssemblyPrimitives(EncoderPrimitives &p, int cpuMask) 
+#if defined(HAVE_SVE2) || defined(HAVE_SVE)
+void setupSvePrimitives(EncoderPrimitives &p)
 {
-    if (cpuMask & X265_CPU_NEON)
-    {
-        p.puLUMA_4x4.satd   = PFX(pixel_satd_4x4_neon);
-        p.puLUMA_4x8.satd   = PFX(pixel_satd_4x8_neon);
-        p.puLUMA_4x16.satd  = PFX(pixel_satd_4x16_neon);
-        p.puLUMA_8x4.satd   = PFX(pixel_satd_8x4_neon);
-        p.puLUMA_8x8.satd   = PFX(pixel_satd_8x8_neon);
-        p.puLUMA_12x16.satd = PFX(pixel_satd_12x16_neon);
-        
-        p.chromaX265_CSP_I420.puCHROMA_420_4x4.satd    = PFX(pixel_satd_4x4_neon);
-        p.chromaX265_CSP_I420.puCHROMA_420_4x8.satd    = PFX(pixel_satd_4x8_neon);
-        p.chromaX265_CSP_I420.puCHROMA_420_4x16.satd   = PFX(pixel_satd_4x16_neon);
-        p.chromaX265_CSP_I420.puCHROMA_420_8x4.satd    = PFX(pixel_satd_8x4_neon);
-        p.chromaX265_CSP_I420.puCHROMA_420_8x8.satd    = PFX(pixel_satd_8x8_neon);
-        p.chromaX265_CSP_I420.puCHROMA_420_12x16.satd  = PFX(pixel_satd_12x16_neon);
-        
-        p.chromaX265_CSP_I422.puCHROMA_422_4x4.satd    = PFX(pixel_satd_4x4_neon);
-        p.chromaX265_CSP_I422.puCHROMA_422_4x8.satd    = PFX(pixel_satd_4x8_neon);
-        p.chromaX265_CSP_I422.puCHROMA_422_4x16.satd   = PFX(pixel_satd_4x16_neon);
-        p.chromaX265_CSP_I422.puCHROMA_422_4x32.satd   = PFX(pixel_satd_4x32_neon);
-        p.chromaX265_CSP_I422.puCHROMA_422_8x4.satd    = PFX(pixel_satd_8x4_neon);
-        p.chromaX265_CSP_I422.puCHROMA_422_8x8.satd    = PFX(pixel_satd_8x8_neon);
-        p.chromaX265_CSP_I422.puCHROMA_422_12x32.satd  = PFX(pixel_satd_12x32_neon);
-
-        p.puLUMA_4x4.pixelavg_ppNONALIGNED   = PFX(pixel_avg_pp_4x4_neon);
-        p.puLUMA_4x8.pixelavg_ppNONALIGNED   = PFX(pixel_avg_pp_4x8_neon);
-        p.puLUMA_4x16.pixelavg_ppNONALIGNED  = PFX(pixel_avg_pp_4x16_neon);
-        p.puLUMA_8x4.pixelavg_ppNONALIGNED   = PFX(pixel_avg_pp_8x4_neon);
-        p.puLUMA_8x8.pixelavg_ppNONALIGNED   = PFX(pixel_avg_pp_8x8_neon);
-        p.puLUMA_8x16.pixelavg_ppNONALIGNED  = PFX(pixel_avg_pp_8x16_neon);
-        p.puLUMA_8x32.pixelavg_ppNONALIGNED  = PFX(pixel_avg_pp_8x32_neon);
-
-        p.puLUMA_4x4.pixelavg_ppALIGNED   = PFX(pixel_avg_pp_4x4_neon);
-        p.puLUMA_4x8.pixelavg_ppALIGNED   = PFX(pixel_avg_pp_4x8_neon);
-        p.puLUMA_4x16.pixelavg_ppALIGNED  = PFX(pixel_avg_pp_4x16_neon);
-        p.puLUMA_8x4.pixelavg_ppALIGNED   = PFX(pixel_avg_pp_8x4_neon);
-        p.puLUMA_8x8.pixelavg_ppALIGNED   = PFX(pixel_avg_pp_8x8_neon);
-        p.puLUMA_8x16.pixelavg_ppALIGNED  = PFX(pixel_avg_pp_8x16_neon);
-        p.puLUMA_8x32.pixelavg_ppALIGNED  = PFX(pixel_avg_pp_8x32_neon);
-
-        p.puLUMA_8x4.sad_x3   = PFX(sad_x3_8x4_neon);
-        p.puLUMA_8x8.sad_x3   = PFX(sad_x3_8x8_neon);
-        p.puLUMA_8x16.sad_x3  = PFX(sad_x3_8x16_neon);
-        p.puLUMA_8x32.sad_x3  = PFX(sad_x3_8x32_neon);
-
-        p.puLUMA_8x4.sad_x4   = PFX(sad_x4_8x4_neon);
-        p.puLUMA_8x8.sad_x4   = PFX(sad_x4_8x8_neon);
-        p.puLUMA_8x16.sad_x4  = PFX(sad_x4_8x16_neon);
-        p.puLUMA_8x32.sad_x4  = PFX(sad_x4_8x32_neon);
-
-        // quant
-        p.quant = PFX(quant_neon);
-        // luma_hps
-        p.puLUMA_4x4.luma_hps   = PFX(interp_8tap_horiz_ps_4x4_neon);
-        p.puLUMA_4x8.luma_hps   = PFX(interp_8tap_horiz_ps_4x8_neon);
-        p.puLUMA_4x16.luma_hps  = PFX(interp_8tap_horiz_ps_4x16_neon);
-        p.puLUMA_8x4.luma_hps   = PFX(interp_8tap_horiz_ps_8x4_neon);
-        p.puLUMA_8x8.luma_hps   = PFX(interp_8tap_horiz_ps_8x8_neon);
-        p.puLUMA_8x16.luma_hps  = PFX(interp_8tap_horiz_ps_8x16_neon);
-        p.puLUMA_8x32.luma_hps  = PFX(interp_8tap_horiz_ps_8x32_neon);
-        p.puLUMA_12x16.luma_hps = PFX(interp_8tap_horiz_ps_12x16_neon);
-        p.puLUMA_24x32.luma_hps = PFX(interp_8tap_horiz_ps_24x32_neon);
-#if !AUTO_VECTORIZE || GCC_VERSION < GCC_5_1_0 /* gcc_version < gcc-5.1.0 */
-        p.puLUMA_16x4.luma_hps  = PFX(interp_8tap_horiz_ps_16x4_neon);
-        p.puLUMA_16x8.luma_hps  = PFX(interp_8tap_horiz_ps_16x8_neon);
-        p.puLUMA_16x12.luma_hps = PFX(interp_8tap_horiz_ps_16x12_neon);
-        p.puLUMA_16x16.luma_hps = PFX(interp_8tap_horiz_ps_16x16_neon);
-        p.puLUMA_16x32.luma_hps = PFX(interp_8tap_horiz_ps_16x32_neon);
-        p.puLUMA_16x64.luma_hps = PFX(interp_8tap_horiz_ps_16x64_neon);
-        p.puLUMA_32x8.luma_hps  = PFX(interp_8tap_horiz_ps_32x8_neon);
-        p.puLUMA_32x16.luma_hps = PFX(interp_8tap_horiz_ps_32x16_neon);
-        p.puLUMA_32x24.luma_hps = PFX(interp_8tap_horiz_ps_32x24_neon);
-        p.puLUMA_32x32.luma_hps = PFX(interp_8tap_horiz_ps_32x32_neon);
-        p.puLUMA_32x64.luma_hps = PFX(interp_8tap_horiz_ps_32x64_neon);
-        p.puLUMA_48x64.luma_hps = PFX(interp_8tap_horiz_ps_48x64_neon);
-        p.puLUMA_64x16.luma_hps = PFX(interp_8tap_horiz_ps_64x16_neon);
-        p.puLUMA_64x32.luma_hps = PFX(interp_8tap_horiz_ps_64x32_neon);
-        p.puLUMA_64x48.luma_hps = PFX(interp_8tap_horiz_ps_64x48_neon);
-        p.puLUMA_64x64.luma_hps = PFX(interp_8tap_horiz_ps_64x64_neon);
-#endif
-
-        p.puLUMA_8x4.luma_hvpp   =  interp_8tap_hv_pp_cpu<LUMA_8x4>;
-        p.puLUMA_8x8.luma_hvpp   =  interp_8tap_hv_pp_cpu<LUMA_8x8>;
-        p.puLUMA_8x16.luma_hvpp  =  interp_8tap_hv_pp_cpu<LUMA_8x16>;
-        p.puLUMA_8x32.luma_hvpp  =  interp_8tap_hv_pp_cpu<LUMA_8x32>;
-        p.puLUMA_12x16.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_12x16>;
-#if !AUTO_VECTORIZE || GCC_VERSION < GCC_5_1_0 /* gcc_version < gcc-5.1.0 */
-        p.puLUMA_16x4.luma_hvpp  =  interp_8tap_hv_pp_cpu<LUMA_16x4>;
-        p.puLUMA_16x8.luma_hvpp  =  interp_8tap_hv_pp_cpu<LUMA_16x8>;
-        p.puLUMA_16x12.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_16x12>;
-        p.puLUMA_16x16.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_16x16>;
-        p.puLUMA_16x32.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_16x32>;
-        p.puLUMA_16x64.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_16x64>;
-        p.puLUMA_32x16.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_32x16>;
-        p.puLUMA_32x24.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_32x24>;
-        p.puLUMA_32x32.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_32x32>;
-        p.puLUMA_32x64.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_32x64>;
-        p.puLUMA_48x64.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_48x64>;
-        p.puLUMA_64x16.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_64x16>;
-        p.puLUMA_64x32.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_64x32>;
-        p.puLUMA_64x48.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_64x48>;
-        p.puLUMA_64x64.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_64x64>;
-#if !AUTO_VECTORIZE || GCC_VERSION < GCC_4_9_0 /* gcc_version < gcc-4.9.0 */
-        p.puLUMA_4x4.luma_hvpp   =  interp_8tap_hv_pp_cpu<LUMA_4x4>;
-        p.puLUMA_4x8.luma_hvpp   =  interp_8tap_hv_pp_cpu<LUMA_4x8>;
-        p.puLUMA_4x16.luma_hvpp  =  interp_8tap_hv_pp_cpu<LUMA_4x16>;
-        p.puLUMA_24x32.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_24x32>;
-        p.puLUMA_32x8.luma_hvpp  =  interp_8tap_hv_pp_cpu<LUMA_32x8>;
+    // When these primitives will use SVE/SVE2 instructions set,
+    // change the following definitions to point to the SVE/SVE2 implementation
+    setupPixelPrimitives_neon(p);
+    setupFilterPrimitives_neon(p);
+    setupDCTPrimitives_neon(p);
+    setupLoopFilterPrimitives_neon(p);
+    setupIntraPrimitives_neon(p);
+
+    CHROMA_420_PU_FILTER_PIXEL_TO_SHORT_NEON(p2sNONALIGNED);
+    CHROMA_420_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED);
+    CHROMA_422_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sALIGNED);
+    CHROMA_422_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sALIGNED);
+    CHROMA_444_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sALIGNED);
+    CHROMA_444_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sALIGNED);
+    LUMA_PU_NEON_FILTER_PIXEL_TO_SHORT(convert_p2sALIGNED);
+    LUMA_PU_SVE_FILTER_PIXEL_TO_SHORT(convert_p2sALIGNED);
+    CHROMA_420_PU_FILTER_PIXEL_TO_SHORT_NEON(p2sALIGNED);
+    CHROMA_420_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sALIGNED);
+    CHROMA_422_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED);
+    CHROMA_422_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED);
+    CHROMA_444_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED);
+    CHROMA_444_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED);
+    LUMA_PU_NEON_FILTER_PIXEL_TO_SHORT(convert_p2sNONALIGNED);
+    LUMA_PU_SVE_FILTER_PIXEL_TO_SHORT(convert_p2sNONALIGNED);
+
+#if !HIGH_BIT_DEPTH
+    ALL_LUMA_PU(luma_vpp, interp_8tap_vert_pp, neon);
+    ALL_LUMA_PU(luma_vsp, interp_8tap_vert_sp, neon);
+    ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, neon);
+    ALL_LUMA_PU(luma_hpp, interp_horiz_pp, neon);
+    ALL_LUMA_PU(luma_hps, interp_horiz_ps, neon);
+    ALL_LUMA_PU(luma_vss, interp_8tap_vert_ss, neon);
+    ALL_LUMA_PU_T(luma_hvpp, interp_8tap_hv_pp_cpu);
+    ALL_CHROMA_420_VERT_FILTERS(neon);
+    CHROMA_422_VERT_FILTERS_NEON();
+    CHROMA_422_VERT_FILTERS_CAN_USE_SVE2(neon);
+    ALL_CHROMA_444_VERT_FILTERS(neon);
+    ALL_CHROMA_420_FILTERS(neon);
+    ALL_CHROMA_422_FILTERS(neon);
+    ALL_CHROMA_444_FILTERS(neon);
+
+
+    // Blockcopy_pp
+    LUMA_PU_NEON_1(copy_pp, blockcopy_pp);
+    LUMA_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(copy_pp, blockcopy_pp);
+    CHROMA_420_PU_NEON_1(copy_pp, blockcopy_pp);
+    CHROMA_420_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(copy_pp, blockcopy_pp);
+    CHROMA_422_PU_NEON_1(copy_pp, blockcopy_pp);
+    CHROMA_422_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(copy_pp, blockcopy_pp);
+    p.cuBLOCK_4x4.copy_pp   = PFX(blockcopy_pp_4x4_neon);
+    p.cuBLOCK_8x8.copy_pp   = PFX(blockcopy_pp_8x8_neon);
+    p.cuBLOCK_16x16.copy_pp = PFX(blockcopy_pp_16x16_neon);
+    p.cuBLOCK_32x32.copy_pp = PFX(blockcopy_pp_32x32_sve);
+    p.cuBLOCK_64x64.copy_pp = PFX(blockcopy_pp_64x64_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_pp = PFX(blockcopy_pp_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_pp = PFX(blockcopy_pp_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_pp = PFX(blockcopy_pp_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_pp = PFX(blockcopy_pp_32x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_pp = PFX(blockcopy_pp_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_pp = PFX(blockcopy_pp_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_pp = PFX(blockcopy_pp_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_pp = PFX(blockcopy_pp_32x64_sve);
+
+#endif // !HIGH_BIT_DEPTH
+
+    // Blockcopy_ss
+    p.cuBLOCK_4x4.copy_ss   = PFX(blockcopy_ss_4x4_neon);
+    p.cuBLOCK_8x8.copy_ss   = PFX(blockcopy_ss_8x8_neon);
+    p.cuBLOCK_16x16.copy_ss = PFX(blockcopy_ss_16x16_sve);
+    p.cuBLOCK_32x32.copy_ss = PFX(blockcopy_ss_32x32_sve);
+    p.cuBLOCK_64x64.copy_ss = PFX(blockcopy_ss_64x64_sve);
+
+    // Blockcopy_ps
+    p.cuBLOCK_4x4.copy_ps   = PFX(blockcopy_ps_4x4_neon);
+    p.cuBLOCK_8x8.copy_ps   = PFX(blockcopy_ps_8x8_neon);
+    p.cuBLOCK_16x16.copy_ps = PFX(blockcopy_ps_16x16_sve);
+    p.cuBLOCK_32x32.copy_ps = PFX(blockcopy_ps_32x32_sve);
+    p.cuBLOCK_64x64.copy_ps = PFX(blockcopy_ps_64x64_sve);
+
+    // Blockcopy_sp
+    p.cuBLOCK_4x4.copy_sp   = PFX(blockcopy_sp_4x4_sve);
+    p.cuBLOCK_8x8.copy_sp   = PFX(blockcopy_sp_8x8_sve);
+    p.cuBLOCK_16x16.copy_sp = PFX(blockcopy_sp_16x16_sve);
+    p.cuBLOCK_32x32.copy_sp = PFX(blockcopy_sp_32x32_sve);
+    p.cuBLOCK_64x64.copy_sp = PFX(blockcopy_sp_64x64_neon);
+
+    // chroma blockcopy_ss
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_ss   = PFX(blockcopy_ss_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_ss   = PFX(blockcopy_ss_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_ss = PFX(blockcopy_ss_16x16_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_ss = PFX(blockcopy_ss_32x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_ss   = PFX(blockcopy_ss_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_ss  = PFX(blockcopy_ss_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_ss = PFX(blockcopy_ss_16x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_ss = PFX(blockcopy_ss_32x64_sve);
+
+    // chroma blockcopy_ps
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_ps   = PFX(blockcopy_ps_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_ps   = PFX(blockcopy_ps_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_ps = PFX(blockcopy_ps_16x16_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_ps = PFX(blockcopy_ps_32x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_ps   = PFX(blockcopy_ps_4x8_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_ps  = PFX(blockcopy_ps_8x16_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_ps = PFX(blockcopy_ps_16x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_ps = PFX(blockcopy_ps_32x64_sve);
+
+    // chroma blockcopy_sp
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_sp   = PFX(blockcopy_sp_4x4_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_sp   = PFX(blockcopy_sp_8x8_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_sp = PFX(blockcopy_sp_16x16_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_sp = PFX(blockcopy_sp_32x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_sp   = PFX(blockcopy_sp_4x8_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_sp  = PFX(blockcopy_sp_8x16_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_sp = PFX(blockcopy_sp_16x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_sp = PFX(blockcopy_sp_32x64_sve);
+
+    // Block_fill
+    LUMA_TU_NEON(blockfill_sALIGNED, blockfill_s);
+    LUMA_TU_CAN_USE_SVE(blockfill_sALIGNED, blockfill_s);
+    LUMA_TU_NEON(blockfill_sNONALIGNED, blockfill_s);
+    LUMA_TU_CAN_USE_SVE(blockfill_sNONALIGNED, blockfill_s);
+
+    // copy_count
+    p.cuBLOCK_4x4.copy_cnt     = PFX(copy_cnt_4_neon);
+    p.cuBLOCK_8x8.copy_cnt     = PFX(copy_cnt_8_neon);
+    p.cuBLOCK_16x16.copy_cnt   = PFX(copy_cnt_16_neon);
+    p.cuBLOCK_32x32.copy_cnt   = PFX(copy_cnt_32_neon);
+
+    // count nonzero
+    p.cuBLOCK_4x4.count_nonzero     = PFX(count_nonzero_4_neon);
+    p.cuBLOCK_8x8.count_nonzero     = PFX(count_nonzero_8_neon);
+    p.cuBLOCK_16x16.count_nonzero   = PFX(count_nonzero_16_neon);
+    p.cuBLOCK_32x32.count_nonzero   = PFX(count_nonzero_32_neon);
+
+    // cpy2Dto1D_shl
+    p.cuBLOCK_4x4.cpy2Dto1D_shl   = PFX(cpy2Dto1D_shl_4x4_neon);
+    p.cuBLOCK_8x8.cpy2Dto1D_shl   = PFX(cpy2Dto1D_shl_8x8_neon);
+    p.cuBLOCK_16x16.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_16x16_sve);
+    p.cuBLOCK_32x32.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_32x32_sve);
+    p.cuBLOCK_64x64.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_64x64_sve);
+
+    // cpy2Dto1D_shr
+    p.cuBLOCK_4x4.cpy2Dto1D_shr   = PFX(cpy2Dto1D_shr_4x4_neon);
+    p.cuBLOCK_8x8.cpy2Dto1D_shr   = PFX(cpy2Dto1D_shr_8x8_neon);
+    p.cuBLOCK_16x16.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_16x16_sve);
+    p.cuBLOCK_32x32.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_32x32_sve);
+
+    // cpy1Dto2D_shl
+    p.cuBLOCK_4x4.cpy1Dto2D_shlALIGNED      = PFX(cpy1Dto2D_shl_4x4_neon);
+    p.cuBLOCK_8x8.cpy1Dto2D_shlALIGNED      = PFX(cpy1Dto2D_shl_8x8_neon);
+    p.cuBLOCK_16x16.cpy1Dto2D_shlALIGNED    = PFX(cpy1Dto2D_shl_16x16_sve);
+    p.cuBLOCK_32x32.cpy1Dto2D_shlALIGNED    = PFX(cpy1Dto2D_shl_32x32_sve);
+    p.cuBLOCK_64x64.cpy1Dto2D_shlALIGNED    = PFX(cpy1Dto2D_shl_64x64_sve);
+
+    p.cuBLOCK_4x4.cpy1Dto2D_shlNONALIGNED   = PFX(cpy1Dto2D_shl_4x4_neon);
+    p.cuBLOCK_8x8.cpy1Dto2D_shlNONALIGNED   = PFX(cpy1Dto2D_shl_8x8_neon);
+    p.cuBLOCK_16x16.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_16x16_sve);
+    p.cuBLOCK_32x32.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_32x32_sve);
+    p.cuBLOCK_64x64.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_64x64_sve);
+
+    // cpy1Dto2D_shr
+    p.cuBLOCK_4x4.cpy1Dto2D_shr   = PFX(cpy1Dto2D_shr_4x4_neon);
+    p.cuBLOCK_8x8.cpy1Dto2D_shr   = PFX(cpy1Dto2D_shr_8x8_neon);
+    p.cuBLOCK_16x16.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_16x16_sve);
+    p.cuBLOCK_32x32.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_32x32_sve);
+    p.cuBLOCK_64x64.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_64x64_sve);
+
+#if !HIGH_BIT_DEPTH
+    // pixel_avg_pp
+    ALL_LUMA_PU(pixelavg_ppNONALIGNED, pixel_avg_pp, neon);
+    ALL_LUMA_PU(pixelavg_ppALIGNED, pixel_avg_pp, neon);
+
+    // addAvg
+    ALL_LUMA_PU(addAvgNONALIGNED, addAvg, neon);
+    ALL_LUMA_PU(addAvgALIGNED, addAvg, neon);
+    ALL_CHROMA_420_PU(addAvgNONALIGNED, addAvg, neon);
+    ALL_CHROMA_422_PU(addAvgNONALIGNED, addAvg, neon);
+    ALL_CHROMA_420_PU(addAvgALIGNED, addAvg, neon);
+    ALL_CHROMA_422_PU(addAvgALIGNED, addAvg, neon);
+
+    // sad
+    ALL_LUMA_PU(sad, pixel_sad, neon);
+    ALL_LUMA_PU(sad_x3, sad_x3, neon);
+    ALL_LUMA_PU(sad_x4, sad_x4, neon);
+
+    // sse_pp
+    p.cuBLOCK_4x4.sse_pp   = PFX(pixel_sse_pp_4x4_sve);
+    p.cuBLOCK_8x8.sse_pp   = PFX(pixel_sse_pp_8x8_neon);
+    p.cuBLOCK_16x16.sse_pp = PFX(pixel_sse_pp_16x16_neon);
+    p.cuBLOCK_32x32.sse_pp = PFX(pixel_sse_pp_32x32_neon);
+    p.cuBLOCK_64x64.sse_pp = PFX(pixel_sse_pp_64x64_neon);
+
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.sse_pp   = PFX(pixel_sse_pp_4x4_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.sse_pp   = PFX(pixel_sse_pp_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.sse_pp = PFX(pixel_sse_pp_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.sse_pp = PFX(pixel_sse_pp_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.sse_pp   = PFX(pixel_sse_pp_4x8_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sse_pp  = PFX(pixel_sse_pp_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sse_pp = PFX(pixel_sse_pp_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sse_pp = PFX(pixel_sse_pp_32x64_neon);
+
+    // sse_ss
+    p.cuBLOCK_4x4.sse_ss   = PFX(pixel_sse_ss_4x4_neon);
+    p.cuBLOCK_8x8.sse_ss   = PFX(pixel_sse_ss_8x8_neon);
+    p.cuBLOCK_16x16.sse_ss = PFX(pixel_sse_ss_16x16_neon);
+    p.cuBLOCK_32x32.sse_ss = PFX(pixel_sse_ss_32x32_neon);
+    p.cuBLOCK_64x64.sse_ss = PFX(pixel_sse_ss_64x64_neon);
+
+    // ssd_s
+    p.cuBLOCK_4x4.ssd_sNONALIGNED   = PFX(pixel_ssd_s_4x4_neon);
+    p.cuBLOCK_8x8.ssd_sNONALIGNED   = PFX(pixel_ssd_s_8x8_neon);
+    p.cuBLOCK_16x16.ssd_sNONALIGNED = PFX(pixel_ssd_s_16x16_neon);
+    p.cuBLOCK_32x32.ssd_sNONALIGNED = PFX(pixel_ssd_s_32x32_neon);
+
+    p.cuBLOCK_4x4.ssd_sALIGNED   = PFX(pixel_ssd_s_4x4_neon);
+    p.cuBLOCK_8x8.ssd_sALIGNED   = PFX(pixel_ssd_s_8x8_neon);
+    p.cuBLOCK_16x16.ssd_sALIGNED = PFX(pixel_ssd_s_16x16_neon);
+    p.cuBLOCK_32x32.ssd_sALIGNED = PFX(pixel_ssd_s_32x32_neon);
+
+    // pixel_var
+    p.cuBLOCK_8x8.var   = PFX(pixel_var_8x8_neon);
+    p.cuBLOCK_16x16.var = PFX(pixel_var_16x16_neon);
+    p.cuBLOCK_32x32.var = PFX(pixel_var_32x32_neon);
+    p.cuBLOCK_64x64.var = PFX(pixel_var_64x64_neon);
+
+    // calc_Residual
+    p.cuBLOCK_4x4.calcresidualNONALIGNED   = PFX(getResidual4_neon);
+    p.cuBLOCK_8x8.calcresidualNONALIGNED   = PFX(getResidual8_neon);
+    p.cuBLOCK_16x16.calcresidualNONALIGNED = PFX(getResidual16_neon);
+    p.cuBLOCK_32x32.calcresidualNONALIGNED = PFX(getResidual32_neon);
+
+    p.cuBLOCK_4x4.calcresidualALIGNED   = PFX(getResidual4_neon);
+    p.cuBLOCK_8x8.calcresidualALIGNED   = PFX(getResidual8_neon);
+    p.cuBLOCK_16x16.calcresidualALIGNED = PFX(getResidual16_neon);
+    p.cuBLOCK_32x32.calcresidualALIGNED = PFX(getResidual32_neon);
+
+    // pixel_sub_ps
+    p.cuBLOCK_4x4.sub_ps   = PFX(pixel_sub_ps_4x4_neon);
+    p.cuBLOCK_8x8.sub_ps   = PFX(pixel_sub_ps_8x8_neon);
+    p.cuBLOCK_16x16.sub_ps = PFX(pixel_sub_ps_16x16_neon);
+    p.cuBLOCK_32x32.sub_ps = PFX(pixel_sub_ps_32x32_neon);
+    p.cuBLOCK_64x64.sub_ps = PFX(pixel_sub_ps_64x64_neon);
+
+    // chroma sub_ps
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.sub_ps   = PFX(pixel_sub_ps_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.sub_ps   = PFX(pixel_sub_ps_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.sub_ps = PFX(pixel_sub_ps_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.sub_ps = PFX(pixel_sub_ps_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.sub_ps   = PFX(pixel_sub_ps_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sub_ps  = PFX(pixel_sub_ps_8x16_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sub_ps = PFX(pixel_sub_ps_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sub_ps = PFX(pixel_sub_ps_32x64_neon);
+
+    // pixel_add_ps
+    p.cuBLOCK_4x4.add_psNONALIGNED   = PFX(pixel_add_ps_4x4_neon);
+    p.cuBLOCK_8x8.add_psNONALIGNED   = PFX(pixel_add_ps_8x8_neon);
+    p.cuBLOCK_16x16.add_psNONALIGNED = PFX(pixel_add_ps_16x16_neon);
+    p.cuBLOCK_32x32.add_psNONALIGNED = PFX(pixel_add_ps_32x32_neon);
+    p.cuBLOCK_64x64.add_psNONALIGNED = PFX(pixel_add_ps_64x64_neon);
+
+    p.cuBLOCK_4x4.add_psALIGNED   = PFX(pixel_add_ps_4x4_neon);
+    p.cuBLOCK_8x8.add_psALIGNED   = PFX(pixel_add_ps_8x8_neon);
+    p.cuBLOCK_16x16.add_psALIGNED = PFX(pixel_add_ps_16x16_neon);
+    p.cuBLOCK_32x32.add_psALIGNED = PFX(pixel_add_ps_32x32_neon);
+    p.cuBLOCK_64x64.add_psALIGNED = PFX(pixel_add_ps_64x64_neon);
+
+    // chroma add_ps
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.add_psNONALIGNED   = PFX(pixel_add_ps_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.add_psNONALIGNED   = PFX(pixel_add_ps_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.add_psNONALIGNED = PFX(pixel_add_ps_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.add_psNONALIGNED = PFX(pixel_add_ps_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.add_psNONALIGNED   = PFX(pixel_add_ps_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.add_psNONALIGNED  = PFX(pixel_add_ps_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.add_psNONALIGNED = PFX(pixel_add_ps_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.add_psNONALIGNED = PFX(pixel_add_ps_32x64_neon);
+
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.add_psALIGNED   = PFX(pixel_add_ps_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.add_psALIGNED   = PFX(pixel_add_ps_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.add_psALIGNED = PFX(pixel_add_ps_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.add_psALIGNED = PFX(pixel_add_ps_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.add_psALIGNED   = PFX(pixel_add_ps_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.add_psALIGNED  = PFX(pixel_add_ps_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.add_psALIGNED = PFX(pixel_add_ps_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.add_psALIGNED = PFX(pixel_add_ps_32x64_neon);
+
+    //scale2D_64to32
+    p.scale2D_64to32  = PFX(scale2D_64to32_neon);
+
+    // scale1D_128to64
+    p.scale1D_128to64NONALIGNED = PFX(scale1D_128to64_neon);
+    p.scale1D_128to64ALIGNED = PFX(scale1D_128to64_neon);
+
+    // planecopy
+    p.planecopy_cp = PFX(pixel_planecopy_cp_neon);
+
+    // satd
+    p.puLUMA_4x4.satd   = PFX(pixel_satd_4x4_sve);
+    p.puLUMA_8x8.satd   = PFX(pixel_satd_8x8_neon);
+    p.puLUMA_16x16.satd = PFX(pixel_satd_16x16_neon);
+    p.puLUMA_32x32.satd = PFX(pixel_satd_32x32_sve);
+    p.puLUMA_64x64.satd = PFX(pixel_satd_64x64_neon);
+    p.puLUMA_8x4.satd   = PFX(pixel_satd_8x4_sve);
+    p.puLUMA_4x8.satd   = PFX(pixel_satd_4x8_neon);
+    p.puLUMA_16x8.satd  = PFX(pixel_satd_16x8_neon);
+    p.puLUMA_8x16.satd  = PFX(pixel_satd_8x16_neon);
+    p.puLUMA_16x32.satd = PFX(pixel_satd_16x32_neon);
+    p.puLUMA_32x16.satd = PFX(pixel_satd_32x16_sve);
+    p.puLUMA_64x32.satd = PFX(pixel_satd_64x32_neon);
+    p.puLUMA_32x64.satd = PFX(pixel_satd_32x64_neon);
+    p.puLUMA_16x12.satd = PFX(pixel_satd_16x12_neon);
+    p.puLUMA_12x16.satd = PFX(pixel_satd_12x16_neon);
+    p.puLUMA_16x4.satd  = PFX(pixel_satd_16x4_neon);
+    p.puLUMA_4x16.satd  = PFX(pixel_satd_4x16_neon);
+    p.puLUMA_32x24.satd = PFX(pixel_satd_32x24_neon);
+    p.puLUMA_24x32.satd = PFX(pixel_satd_24x32_neon);
+    p.puLUMA_32x8.satd  = PFX(pixel_satd_32x8_neon);
+    p.puLUMA_8x32.satd  = PFX(pixel_satd_8x32_neon);
+    p.puLUMA_64x48.satd = PFX(pixel_satd_64x48_sve);
+    p.puLUMA_48x64.satd = PFX(pixel_satd_48x64_neon);
+    p.puLUMA_64x16.satd = PFX(pixel_satd_64x16_neon);
+    p.puLUMA_16x64.satd = PFX(pixel_satd_16x64_neon);
+
+    p.chromaX265_CSP_I420.puCHROMA_420_4x4.satd   = PFX(pixel_satd_4x4_sve);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x8.satd   = PFX(pixel_satd_8x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x16.satd = PFX(pixel_satd_16x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x32.satd = PFX(pixel_satd_32x32_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x4.satd   = PFX(pixel_satd_8x4_sve);
+    p.chromaX265_CSP_I420.puCHROMA_420_4x8.satd   = PFX(pixel_satd_4x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x8.satd  = PFX(pixel_satd_16x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x16.satd  = PFX(pixel_satd_8x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x16.satd = PFX(pixel_satd_32x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x32.satd = PFX(pixel_satd_16x32_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x12.satd = PFX(pixel_satd_16x12_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_12x16.satd = PFX(pixel_satd_12x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x4.satd  = PFX(pixel_satd_16x4_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_4x16.satd  = PFX(pixel_satd_4x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x24.satd = PFX(pixel_satd_32x24_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_24x32.satd = PFX(pixel_satd_24x32_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x8.satd  = PFX(pixel_satd_32x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x32.satd  = PFX(pixel_satd_8x32_neon);
+
+    p.chromaX265_CSP_I422.puCHROMA_422_4x8.satd   = PFX(pixel_satd_4x8_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x16.satd  = PFX(pixel_satd_8x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x32.satd = PFX(pixel_satd_16x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x64.satd = PFX(pixel_satd_32x64_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_4x4.satd   = PFX(pixel_satd_4x4_sve);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x8.satd   = PFX(pixel_satd_8x8_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_4x16.satd  = PFX(pixel_satd_4x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x16.satd = PFX(pixel_satd_16x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x32.satd  = PFX(pixel_satd_8x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x32.satd = PFX(pixel_satd_32x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x64.satd = PFX(pixel_satd_16x64_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x12.satd  = PFX(pixel_satd_8x12_sve);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x4.satd   = PFX(pixel_satd_8x4_sve);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x24.satd = PFX(pixel_satd_16x24_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_12x32.satd = PFX(pixel_satd_12x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x8.satd  = PFX(pixel_satd_16x8_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_4x32.satd  = PFX(pixel_satd_4x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x48.satd = PFX(pixel_satd_32x48_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_24x64.satd = PFX(pixel_satd_24x64_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x16.satd = PFX(pixel_satd_32x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x64.satd  = PFX(pixel_satd_8x64_neon);
+
+    // sa8d
+    p.cuBLOCK_4x4.sa8d   = PFX(pixel_satd_4x4_sve);
+    p.cuBLOCK_8x8.sa8d   = PFX(pixel_sa8d_8x8_neon);
+    p.cuBLOCK_16x16.sa8d = PFX(pixel_sa8d_16x16_neon);
+    p.cuBLOCK_32x32.sa8d = PFX(pixel_sa8d_32x32_neon);
+    p.cuBLOCK_64x64.sa8d = PFX(pixel_sa8d_64x64_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_8x8.sa8d = PFX(pixel_satd_4x4_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_16x16.sa8d = PFX(pixel_sa8d_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_32x32.sa8d = PFX(pixel_sa8d_32x32_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_64x64.sa8d = PFX(pixel_sa8d_64x64_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sa8d = PFX(pixel_sa8d_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sa8d = PFX(pixel_sa8d_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sa8d = PFX(pixel_sa8d_32x64_neon);
+
+    // dequant_scaling
+    p.dequant_scaling = PFX(dequant_scaling_neon);
+    p.dequant_normal  = PFX(dequant_normal_neon);
+
+    // ssim_4x4x2_core
+    p.ssim_4x4x2_core = PFX(ssim_4x4x2_core_neon);
+
+    // ssimDist
+    p.cuBLOCK_4x4.ssimDist = PFX(ssimDist4_neon);
+    p.cuBLOCK_8x8.ssimDist = PFX(ssimDist8_neon);
+    p.cuBLOCK_16x16.ssimDist = PFX(ssimDist16_neon);
+    p.cuBLOCK_32x32.ssimDist = PFX(ssimDist32_neon);
+    p.cuBLOCK_64x64.ssimDist = PFX(ssimDist64_neon);
+
+    // normFact
+    p.cuBLOCK_8x8.normFact = PFX(normFact8_neon);
+    p.cuBLOCK_16x16.normFact = PFX(normFact16_neon);
+    p.cuBLOCK_32x32.normFact = PFX(normFact32_neon);
+    p.cuBLOCK_64x64.normFact = PFX(normFact64_neon);
+
+    // psy_cost_pp
+    p.cuBLOCK_4x4.psy_cost_pp = PFX(psyCost_4x4_neon);
+
+    p.weight_pp = PFX(weight_pp_neon);
+#if !defined(__APPLE__)
+    p.scanPosLast = PFX(scanPosLast_neon);
+#endif
+    p.costCoeffNxN = PFX(costCoeffNxN_neon);
 #endif
+
+    // quant
+    p.quant = PFX(quant_sve);
+    p.nquant = PFX(nquant_neon);
+}
 #endif
 
+#if defined(HAVE_SVE2)
+void setupSve2Primitives(EncoderPrimitives &p)
+{
+    // When these primitives will use SVE/SVE2 instructions set,
+    // change the following definitions to point to the SVE/SVE2 implementation
+    setupPixelPrimitives_neon(p);
+    setupFilterPrimitives_neon(p);
+    setupDCTPrimitives_neon(p);
+    setupLoopFilterPrimitives_neon(p);
+    setupIntraPrimitives_neon(p);
+
+    CHROMA_420_PU_FILTER_PIXEL_TO_SHORT_NEON(p2sNONALIGNED);
+    CHROMA_420_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED);
+    CHROMA_422_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sALIGNED);
+    CHROMA_422_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sALIGNED);
+    CHROMA_444_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sALIGNED);
+    CHROMA_444_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sALIGNED);
+    LUMA_PU_NEON_FILTER_PIXEL_TO_SHORT(convert_p2sALIGNED);
+    LUMA_PU_SVE_FILTER_PIXEL_TO_SHORT(convert_p2sALIGNED);
+    CHROMA_420_PU_FILTER_PIXEL_TO_SHORT_NEON(p2sALIGNED);
+    CHROMA_420_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sALIGNED);
+    CHROMA_422_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED);
+    CHROMA_422_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED);
+    CHROMA_444_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED);
+    CHROMA_444_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED);
+    LUMA_PU_NEON_FILTER_PIXEL_TO_SHORT(convert_p2sNONALIGNED);
+    LUMA_PU_SVE_FILTER_PIXEL_TO_SHORT(convert_p2sNONALIGNED);
+
 #if !HIGH_BIT_DEPTH
-        p.cuBLOCK_4x4.psy_cost_pp = PFX(psyCost_4x4_neon);
+    LUMA_PU_MULTIPLE_ARCHS_1(luma_vpp, interp_8tap_vert_pp, neon);
+    LUMA_PU_MULTIPLE_ARCHS_2(luma_vpp, interp_8tap_vert_pp, sve2);
+    LUMA_PU_MULTIPLE_ARCHS_1(luma_vsp, interp_8tap_vert_sp, sve2);
+    LUMA_PU_MULTIPLE_ARCHS_2(luma_vsp, interp_8tap_vert_sp, neon);
+    ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, sve2);
+    ALL_LUMA_PU(luma_hpp, interp_horiz_pp, neon);
+    ALL_LUMA_PU(luma_hps, interp_horiz_ps, neon);
+    ALL_LUMA_PU(luma_vss, interp_8tap_vert_ss, sve2);
+    ALL_LUMA_PU_T(luma_hvpp, interp_8tap_hv_pp_cpu);
+    CHROMA_420_VERT_FILTERS_NEON();
+    CHROMA_420_VERT_FILTERS_CAN_USE_SVE2();
+    CHROMA_422_VERT_FILTERS_NEON();
+    CHROMA_422_VERT_FILTERS_CAN_USE_SVE2(sve2);
+    CHROMA_444_VERT_FILTERS_NEON();
+    CHROMA_444_VERT_FILTERS_CAN_USE_SVE2();
+    CHROMA_420_FILTERS_NEON();
+    CHROMA_420_FILTERS_CAN_USE_SVE2();
+    CHROMA_422_FILTERS_NEON();
+    CHROMA_422_FILTERS_CAN_USE_SVE2();
+    CHROMA_444_FILTERS_NEON();
+    CHROMA_444_FILTERS_CAN_USE_SVE2();
+
+    // Blockcopy_pp
+    LUMA_PU_NEON_1(copy_pp, blockcopy_pp);
+    LUMA_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(copy_pp, blockcopy_pp);
+    CHROMA_420_PU_NEON_1(copy_pp, blockcopy_pp);
+    CHROMA_420_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(copy_pp, blockcopy_pp);
+    CHROMA_422_PU_NEON_1(copy_pp, blockcopy_pp);
+    CHROMA_422_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(copy_pp, blockcopy_pp);
+    p.cuBLOCK_4x4.copy_pp   = PFX(blockcopy_pp_4x4_neon);
+    p.cuBLOCK_8x8.copy_pp   = PFX(blockcopy_pp_8x8_neon);
+    p.cuBLOCK_16x16.copy_pp = PFX(blockcopy_pp_16x16_neon);
+    p.cuBLOCK_32x32.copy_pp = PFX(blockcopy_pp_32x32_sve);
+    p.cuBLOCK_64x64.copy_pp = PFX(blockcopy_pp_64x64_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_pp = PFX(blockcopy_pp_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_pp = PFX(blockcopy_pp_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_pp = PFX(blockcopy_pp_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_pp = PFX(blockcopy_pp_32x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_pp = PFX(blockcopy_pp_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_pp = PFX(blockcopy_pp_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_pp = PFX(blockcopy_pp_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_pp = PFX(blockcopy_pp_32x64_sve);
+
 #endif // !HIGH_BIT_DEPTH
 
+    // Blockcopy_ss
+    p.cuBLOCK_4x4.copy_ss   = PFX(blockcopy_ss_4x4_neon);
+    p.cuBLOCK_8x8.copy_ss   = PFX(blockcopy_ss_8x8_neon);
+    p.cuBLOCK_16x16.copy_ss = PFX(blockcopy_ss_16x16_sve);
+    p.cuBLOCK_32x32.copy_ss = PFX(blockcopy_ss_32x32_sve);
+    p.cuBLOCK_64x64.copy_ss = PFX(blockcopy_ss_64x64_sve);
+
+    // Blockcopy_ps
+    p.cuBLOCK_4x4.copy_ps   = PFX(blockcopy_ps_4x4_neon);
+    p.cuBLOCK_8x8.copy_ps   = PFX(blockcopy_ps_8x8_neon);
+    p.cuBLOCK_16x16.copy_ps = PFX(blockcopy_ps_16x16_sve);
+    p.cuBLOCK_32x32.copy_ps = PFX(blockcopy_ps_32x32_sve);
+    p.cuBLOCK_64x64.copy_ps = PFX(blockcopy_ps_64x64_sve);
+
+    // Blockcopy_sp
+    p.cuBLOCK_4x4.copy_sp   = PFX(blockcopy_sp_4x4_sve);
+    p.cuBLOCK_8x8.copy_sp   = PFX(blockcopy_sp_8x8_sve);
+    p.cuBLOCK_16x16.copy_sp = PFX(blockcopy_sp_16x16_sve);
+    p.cuBLOCK_32x32.copy_sp = PFX(blockcopy_sp_32x32_sve);
+    p.cuBLOCK_64x64.copy_sp = PFX(blockcopy_sp_64x64_neon);
+
+    // chroma blockcopy_ss
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_ss   = PFX(blockcopy_ss_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_ss   = PFX(blockcopy_ss_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_ss = PFX(blockcopy_ss_16x16_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_ss = PFX(blockcopy_ss_32x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_ss   = PFX(blockcopy_ss_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_ss  = PFX(blockcopy_ss_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_ss = PFX(blockcopy_ss_16x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_ss = PFX(blockcopy_ss_32x64_sve);
+
+    // chroma blockcopy_ps
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_ps   = PFX(blockcopy_ps_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_ps   = PFX(blockcopy_ps_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_ps = PFX(blockcopy_ps_16x16_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_ps = PFX(blockcopy_ps_32x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_ps   = PFX(blockcopy_ps_4x8_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_ps  = PFX(blockcopy_ps_8x16_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_ps = PFX(blockcopy_ps_16x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_ps = PFX(blockcopy_ps_32x64_sve);
+
+    // chroma blockcopy_sp
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_sp   = PFX(blockcopy_sp_4x4_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_sp   = PFX(blockcopy_sp_8x8_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_sp = PFX(blockcopy_sp_16x16_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_sp = PFX(blockcopy_sp_32x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_sp   = PFX(blockcopy_sp_4x8_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_sp  = PFX(blockcopy_sp_8x16_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_sp = PFX(blockcopy_sp_16x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_sp = PFX(blockcopy_sp_32x64_sve);
+
+    // Block_fill
+    LUMA_TU_NEON(blockfill_sALIGNED, blockfill_s);
+    LUMA_TU_CAN_USE_SVE(blockfill_sALIGNED, blockfill_s);
+    LUMA_TU_NEON(blockfill_sNONALIGNED, blockfill_s);
+    LUMA_TU_CAN_USE_SVE(blockfill_sNONALIGNED, blockfill_s);
+
+    // copy_count
+    p.cuBLOCK_4x4.copy_cnt     = PFX(copy_cnt_4_neon);
+    p.cuBLOCK_8x8.copy_cnt     = PFX(copy_cnt_8_neon);
+    p.cuBLOCK_16x16.copy_cnt   = PFX(copy_cnt_16_neon);
+    p.cuBLOCK_32x32.copy_cnt   = PFX(copy_cnt_32_neon);
+
+    // count nonzero
+    p.cuBLOCK_4x4.count_nonzero     = PFX(count_nonzero_4_neon);
+    p.cuBLOCK_8x8.count_nonzero     = PFX(count_nonzero_8_neon);
+    p.cuBLOCK_16x16.count_nonzero   = PFX(count_nonzero_16_neon);
+    p.cuBLOCK_32x32.count_nonzero   = PFX(count_nonzero_32_neon);
+
+    // cpy2Dto1D_shl
+    p.cuBLOCK_4x4.cpy2Dto1D_shl   = PFX(cpy2Dto1D_shl_4x4_neon);
+    p.cuBLOCK_8x8.cpy2Dto1D_shl   = PFX(cpy2Dto1D_shl_8x8_neon);
+    p.cuBLOCK_16x16.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_16x16_sve);
+    p.cuBLOCK_32x32.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_32x32_sve);
+    p.cuBLOCK_64x64.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_64x64_sve);
+
+    // cpy2Dto1D_shr
+    p.cuBLOCK_4x4.cpy2Dto1D_shr   = PFX(cpy2Dto1D_shr_4x4_neon);
+    p.cuBLOCK_8x8.cpy2Dto1D_shr   = PFX(cpy2Dto1D_shr_8x8_neon);
+    p.cuBLOCK_16x16.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_16x16_sve);
+    p.cuBLOCK_32x32.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_32x32_sve);
+
+    // cpy1Dto2D_shl
+    p.cuBLOCK_4x4.cpy1Dto2D_shlALIGNED      = PFX(cpy1Dto2D_shl_4x4_neon);
+    p.cuBLOCK_8x8.cpy1Dto2D_shlALIGNED      = PFX(cpy1Dto2D_shl_8x8_neon);
+    p.cuBLOCK_16x16.cpy1Dto2D_shlALIGNED    = PFX(cpy1Dto2D_shl_16x16_sve);
+    p.cuBLOCK_32x32.cpy1Dto2D_shlALIGNED    = PFX(cpy1Dto2D_shl_32x32_sve);
+    p.cuBLOCK_64x64.cpy1Dto2D_shlALIGNED    = PFX(cpy1Dto2D_shl_64x64_sve);
+
+    p.cuBLOCK_4x4.cpy1Dto2D_shlNONALIGNED   = PFX(cpy1Dto2D_shl_4x4_neon);
+    p.cuBLOCK_8x8.cpy1Dto2D_shlNONALIGNED   = PFX(cpy1Dto2D_shl_8x8_neon);
+    p.cuBLOCK_16x16.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_16x16_sve);
+    p.cuBLOCK_32x32.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_32x32_sve);
+    p.cuBLOCK_64x64.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_64x64_sve);
+
+    // cpy1Dto2D_shr
+    p.cuBLOCK_4x4.cpy1Dto2D_shr   = PFX(cpy1Dto2D_shr_4x4_neon);
+    p.cuBLOCK_8x8.cpy1Dto2D_shr   = PFX(cpy1Dto2D_shr_8x8_neon);
+    p.cuBLOCK_16x16.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_16x16_sve);
+    p.cuBLOCK_32x32.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_32x32_sve);
+    p.cuBLOCK_64x64.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_64x64_sve);
+
+#if !HIGH_BIT_DEPTH
+    // pixel_avg_pp
+    LUMA_PU_NEON_2(pixelavg_ppNONALIGNED, pixel_avg_pp);
+    LUMA_PU_MULTIPLE_ARCHS_3(pixelavg_ppNONALIGNED, pixel_avg_pp, sve2);
+    LUMA_PU_NEON_2(pixelavg_ppALIGNED, pixel_avg_pp);
+    LUMA_PU_MULTIPLE_ARCHS_3(pixelavg_ppALIGNED, pixel_avg_pp, sve2);
+
+    // addAvg
+    LUMA_PU_NEON_3(addAvgNONALIGNED, addAvg);
+    LUMA_PU_CAN_USE_SVE2(addAvgNONALIGNED, addAvg);
+    LUMA_PU_NEON_3(addAvgALIGNED, addAvg);
+    LUMA_PU_CAN_USE_SVE2(addAvgALIGNED, addAvg);
+    CHROMA_420_PU_NEON_2(addAvgNONALIGNED, addAvg);
+    CHROMA_420_PU_MULTIPLE_ARCHS(addAvgNONALIGNED, addAvg, sve2);
+    CHROMA_420_PU_NEON_2(addAvgALIGNED, addAvg);
+    CHROMA_420_PU_MULTIPLE_ARCHS(addAvgALIGNED, addAvg, sve2);
+    CHROMA_422_PU_NEON_2(addAvgNONALIGNED, addAvg);
+    CHROMA_422_PU_CAN_USE_SVE2(addAvgNONALIGNED, addAvg);
+    CHROMA_422_PU_NEON_2(addAvgALIGNED, addAvg);
+    CHROMA_422_PU_CAN_USE_SVE2(addAvgALIGNED, addAvg);
+
+    // sad
+    ALL_LUMA_PU(sad, pixel_sad, sve2);
+    ALL_LUMA_PU(sad_x3, sad_x3, sve2);
+    ALL_LUMA_PU(sad_x4, sad_x4, sve2);
+
+    // sse_pp
+    p.cuBLOCK_4x4.sse_pp   = PFX(pixel_sse_pp_4x4_sve);
+    p.cuBLOCK_8x8.sse_pp   = PFX(pixel_sse_pp_8x8_neon);
+    p.cuBLOCK_16x16.sse_pp = PFX(pixel_sse_pp_16x16_neon);
+    p.cuBLOCK_32x32.sse_pp = PFX(pixel_sse_pp_32x32_sve2);
+    p.cuBLOCK_64x64.sse_pp = PFX(pixel_sse_pp_64x64_sve2);
+
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.sse_pp   = PFX(pixel_sse_pp_4x4_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.sse_pp   = PFX(pixel_sse_pp_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.sse_pp = PFX(pixel_sse_pp_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.sse_pp = PFX(pixel_sse_pp_32x32_sve2);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.sse_pp   = PFX(pixel_sse_pp_4x8_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sse_pp  = PFX(pixel_sse_pp_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sse_pp = PFX(pixel_sse_pp_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sse_pp = PFX(pixel_sse_pp_32x64_sve2);
+
+    // sse_ss
+    p.cuBLOCK_4x4.sse_ss   = PFX(pixel_sse_ss_4x4_sve2);
+    p.cuBLOCK_8x8.sse_ss   = PFX(pixel_sse_ss_8x8_sve2);
+    p.cuBLOCK_16x16.sse_ss = PFX(pixel_sse_ss_16x16_sve2);
+    p.cuBLOCK_32x32.sse_ss = PFX(pixel_sse_ss_32x32_sve2);
+    p.cuBLOCK_64x64.sse_ss = PFX(pixel_sse_ss_64x64_sve2);
+
+    // ssd_s
+    p.cuBLOCK_4x4.ssd_sNONALIGNED   = PFX(pixel_ssd_s_4x4_sve2);
+    p.cuBLOCK_8x8.ssd_sNONALIGNED   = PFX(pixel_ssd_s_8x8_sve2);
+    p.cuBLOCK_16x16.ssd_sNONALIGNED = PFX(pixel_ssd_s_16x16_sve2);
+    p.cuBLOCK_32x32.ssd_sNONALIGNED = PFX(pixel_ssd_s_32x32_sve2);
+
+    p.cuBLOCK_4x4.ssd_sALIGNED   = PFX(pixel_ssd_s_4x4_sve2);
+    p.cuBLOCK_8x8.ssd_sALIGNED   = PFX(pixel_ssd_s_8x8_sve2);
+    p.cuBLOCK_16x16.ssd_sALIGNED = PFX(pixel_ssd_s_16x16_sve2);
+    p.cuBLOCK_32x32.ssd_sALIGNED = PFX(pixel_ssd_s_32x32_sve2);
+
+    // pixel_var
+    p.cuBLOCK_8x8.var   = PFX(pixel_var_8x8_sve2);
+    p.cuBLOCK_16x16.var = PFX(pixel_var_16x16_sve2);
+    p.cuBLOCK_32x32.var = PFX(pixel_var_32x32_sve2);
+    p.cuBLOCK_64x64.var = PFX(pixel_var_64x64_sve2);
+
+    // calc_Residual
+    p.cuBLOCK_4x4.calcresidualNONALIGNED   = PFX(getResidual4_neon);
+    p.cuBLOCK_8x8.calcresidualNONALIGNED   = PFX(getResidual8_neon);
+    p.cuBLOCK_16x16.calcresidualNONALIGNED = PFX(getResidual16_sve2);
+    p.cuBLOCK_32x32.calcresidualNONALIGNED = PFX(getResidual32_sve2);
+
+    p.cuBLOCK_4x4.calcresidualALIGNED   = PFX(getResidual4_neon);
+    p.cuBLOCK_8x8.calcresidualALIGNED   = PFX(getResidual8_neon);
+    p.cuBLOCK_16x16.calcresidualALIGNED = PFX(getResidual16_sve2);
+    p.cuBLOCK_32x32.calcresidualALIGNED = PFX(getResidual32_sve2);
+
+    // pixel_sub_ps
+    p.cuBLOCK_4x4.sub_ps   = PFX(pixel_sub_ps_4x4_neon);
+    p.cuBLOCK_8x8.sub_ps   = PFX(pixel_sub_ps_8x8_neon);
+    p.cuBLOCK_16x16.sub_ps = PFX(pixel_sub_ps_16x16_neon);
+    p.cuBLOCK_32x32.sub_ps = PFX(pixel_sub_ps_32x32_sve2);
+    p.cuBLOCK_64x64.sub_ps = PFX(pixel_sub_ps_64x64_sve2);
+
+    // chroma sub_ps
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.sub_ps   = PFX(pixel_sub_ps_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.sub_ps   = PFX(pixel_sub_ps_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.sub_ps = PFX(pixel_sub_ps_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.sub_ps = PFX(pixel_sub_ps_32x32_sve2);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.sub_ps   = PFX(pixel_sub_ps_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sub_ps  = PFX(pixel_sub_ps_8x16_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sub_ps = PFX(pixel_sub_ps_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sub_ps = PFX(pixel_sub_ps_32x64_sve2);
+
+    // pixel_add_ps
+    p.cuBLOCK_4x4.add_psNONALIGNED   = PFX(pixel_add_ps_4x4_sve2);
+    p.cuBLOCK_8x8.add_psNONALIGNED   = PFX(pixel_add_ps_8x8_sve2);
+    p.cuBLOCK_16x16.add_psNONALIGNED = PFX(pixel_add_ps_16x16_sve2);
+    p.cuBLOCK_32x32.add_psNONALIGNED = PFX(pixel_add_ps_32x32_sve2);
+    p.cuBLOCK_64x64.add_psNONALIGNED = PFX(pixel_add_ps_64x64_sve2);
+
+    p.cuBLOCK_4x4.add_psALIGNED   = PFX(pixel_add_ps_4x4_sve2);
+    p.cuBLOCK_8x8.add_psALIGNED   = PFX(pixel_add_ps_8x8_sve2);
+    p.cuBLOCK_16x16.add_psALIGNED = PFX(pixel_add_ps_16x16_sve2);
+    p.cuBLOCK_32x32.add_psALIGNED = PFX(pixel_add_ps_32x32_sve2);
+    p.cuBLOCK_64x64.add_psALIGNED = PFX(pixel_add_ps_64x64_sve2);
+
+    // chroma add_ps
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.add_psNONALIGNED   = PFX(pixel_add_ps_4x4_sve2);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.add_psNONALIGNED   = PFX(pixel_add_ps_8x8_sve2);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.add_psNONALIGNED = PFX(pixel_add_ps_16x16_sve2);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.add_psNONALIGNED = PFX(pixel_add_ps_32x32_sve2);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.add_psNONALIGNED   = PFX(pixel_add_ps_4x8_sve2);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.add_psNONALIGNED  = PFX(pixel_add_ps_8x16_sve2);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.add_psNONALIGNED = PFX(pixel_add_ps_16x32_sve2);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.add_psNONALIGNED = PFX(pixel_add_ps_32x64_sve2);
+
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.add_psALIGNED   = PFX(pixel_add_ps_4x4_sve2);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.add_psALIGNED   = PFX(pixel_add_ps_8x8_sve2);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.add_psALIGNED = PFX(pixel_add_ps_16x16_sve2);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.add_psALIGNED = PFX(pixel_add_ps_32x32_sve2);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.add_psALIGNED   = PFX(pixel_add_ps_4x8_sve2);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.add_psALIGNED  = PFX(pixel_add_ps_8x16_sve2);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.add_psALIGNED = PFX(pixel_add_ps_16x32_sve2);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.add_psALIGNED = PFX(pixel_add_ps_32x64_sve2);
+
+    //scale2D_64to32
+    p.scale2D_64to32  = PFX(scale2D_64to32_neon);
+
+    // scale1D_128to64
+    p.scale1D_128to64NONALIGNED = PFX(scale1D_128to64_sve2);
+    p.scale1D_128to64ALIGNED = PFX(scale1D_128to64_sve2);
+
+    // planecopy
+    p.planecopy_cp = PFX(pixel_planecopy_cp_neon);
+
+    // satd
+    p.puLUMA_4x4.satd   = PFX(pixel_satd_4x4_sve);
+    p.puLUMA_8x8.satd   = PFX(pixel_satd_8x8_neon);
+    p.puLUMA_16x16.satd = PFX(pixel_satd_16x16_neon);
+    p.puLUMA_32x32.satd = PFX(pixel_satd_32x32_sve);
+    p.puLUMA_64x64.satd = PFX(pixel_satd_64x64_neon);
+    p.puLUMA_8x4.satd   = PFX(pixel_satd_8x4_sve);
+    p.puLUMA_4x8.satd   = PFX(pixel_satd_4x8_neon);
+    p.puLUMA_16x8.satd  = PFX(pixel_satd_16x8_neon);
+    p.puLUMA_8x16.satd  = PFX(pixel_satd_8x16_neon);
+    p.puLUMA_16x32.satd = PFX(pixel_satd_16x32_neon);
+    p.puLUMA_32x16.satd = PFX(pixel_satd_32x16_sve);
+    p.puLUMA_64x32.satd = PFX(pixel_satd_64x32_neon);
+    p.puLUMA_32x64.satd = PFX(pixel_satd_32x64_neon);
+    p.puLUMA_16x12.satd = PFX(pixel_satd_16x12_neon);
+    p.puLUMA_12x16.satd = PFX(pixel_satd_12x16_neon);
+    p.puLUMA_16x4.satd  = PFX(pixel_satd_16x4_neon);
+    p.puLUMA_4x16.satd  = PFX(pixel_satd_4x16_neon);
+    p.puLUMA_32x24.satd = PFX(pixel_satd_32x24_neon);
+    p.puLUMA_24x32.satd = PFX(pixel_satd_24x32_neon);
+    p.puLUMA_32x8.satd  = PFX(pixel_satd_32x8_neon);
+    p.puLUMA_8x32.satd  = PFX(pixel_satd_8x32_neon);
+    p.puLUMA_64x48.satd = PFX(pixel_satd_64x48_sve);
+    p.puLUMA_48x64.satd = PFX(pixel_satd_48x64_neon);
+    p.puLUMA_64x16.satd = PFX(pixel_satd_64x16_neon);
+    p.puLUMA_16x64.satd = PFX(pixel_satd_16x64_neon);
+
+    p.chromaX265_CSP_I420.puCHROMA_420_4x4.satd   = PFX(pixel_satd_4x4_sve);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x8.satd   = PFX(pixel_satd_8x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x16.satd = PFX(pixel_satd_16x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x32.satd = PFX(pixel_satd_32x32_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x4.satd   = PFX(pixel_satd_8x4_sve);
+    p.chromaX265_CSP_I420.puCHROMA_420_4x8.satd   = PFX(pixel_satd_4x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x8.satd  = PFX(pixel_satd_16x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x16.satd  = PFX(pixel_satd_8x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x16.satd = PFX(pixel_satd_32x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x32.satd = PFX(pixel_satd_16x32_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x12.satd = PFX(pixel_satd_16x12_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_12x16.satd = PFX(pixel_satd_12x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x4.satd  = PFX(pixel_satd_16x4_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_4x16.satd  = PFX(pixel_satd_4x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x24.satd = PFX(pixel_satd_32x24_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_24x32.satd = PFX(pixel_satd_24x32_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x8.satd  = PFX(pixel_satd_32x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x32.satd  = PFX(pixel_satd_8x32_neon);
+
+    p.chromaX265_CSP_I422.puCHROMA_422_4x8.satd   = PFX(pixel_satd_4x8_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x16.satd  = PFX(pixel_satd_8x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x32.satd = PFX(pixel_satd_16x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x64.satd = PFX(pixel_satd_32x64_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_4x4.satd   = PFX(pixel_satd_4x4_sve);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x8.satd   = PFX(pixel_satd_8x8_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_4x16.satd  = PFX(pixel_satd_4x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x16.satd = PFX(pixel_satd_16x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x32.satd  = PFX(pixel_satd_8x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x32.satd = PFX(pixel_satd_32x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x64.satd = PFX(pixel_satd_16x64_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x12.satd  = PFX(pixel_satd_8x12_sve);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x4.satd   = PFX(pixel_satd_8x4_sve);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x24.satd = PFX(pixel_satd_16x24_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_12x32.satd = PFX(pixel_satd_12x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x8.satd  = PFX(pixel_satd_16x8_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_4x32.satd  = PFX(pixel_satd_4x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x48.satd = PFX(pixel_satd_32x48_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_24x64.satd = PFX(pixel_satd_24x64_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x16.satd = PFX(pixel_satd_32x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x64.satd  = PFX(pixel_satd_8x64_neon);
+
+    // sa8d
+    p.cuBLOCK_4x4.sa8d   = PFX(pixel_satd_4x4_sve);
+    p.cuBLOCK_8x8.sa8d   = PFX(pixel_sa8d_8x8_neon);
+    p.cuBLOCK_16x16.sa8d = PFX(pixel_sa8d_16x16_neon);
+    p.cuBLOCK_32x32.sa8d = PFX(pixel_sa8d_32x32_neon);
+    p.cuBLOCK_64x64.sa8d = PFX(pixel_sa8d_64x64_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_8x8.sa8d = PFX(pixel_satd_4x4_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_16x16.sa8d = PFX(pixel_sa8d_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_32x32.sa8d = PFX(pixel_sa8d_32x32_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_64x64.sa8d = PFX(pixel_sa8d_64x64_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sa8d = PFX(pixel_sa8d_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sa8d = PFX(pixel_sa8d_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sa8d = PFX(pixel_sa8d_32x64_neon);
+
+    // dequant_scaling
+    p.dequant_scaling = PFX(dequant_scaling_sve2);
+    p.dequant_normal  = PFX(dequant_normal_sve2);
+
+    // ssim_4x4x2_core
+    p.ssim_4x4x2_core = PFX(ssim_4x4x2_core_sve2);
+
+    // ssimDist
+    p.cuBLOCK_4x4.ssimDist = PFX(ssimDist4_sve2);
+    p.cuBLOCK_8x8.ssimDist = PFX(ssimDist8_sve2);
+    p.cuBLOCK_16x16.ssimDist = PFX(ssimDist16_sve2);
+    p.cuBLOCK_32x32.ssimDist = PFX(ssimDist32_sve2);
+    p.cuBLOCK_64x64.ssimDist = PFX(ssimDist64_sve2);
+
+    // normFact
+    p.cuBLOCK_8x8.normFact = PFX(normFact8_sve2);
+    p.cuBLOCK_16x16.normFact = PFX(normFact16_sve2);
+    p.cuBLOCK_32x32.normFact = PFX(normFact32_sve2);
+    p.cuBLOCK_64x64.normFact = PFX(normFact64_sve2);
+
+    // psy_cost_pp
+    p.cuBLOCK_4x4.psy_cost_pp = PFX(psyCost_4x4_neon);
+
+    p.weight_pp = PFX(weight_pp_neon);
+#if !defined(__APPLE__)
+    p.scanPosLast = PFX(scanPosLast_neon);
+#endif
+    p.costCoeffNxN = PFX(costCoeffNxN_neon);
+#endif
+
+    // quant
+    p.quant = PFX(quant_sve);
+    p.nquant = PFX(nquant_neon);
+}
+#endif
+
+void setupAssemblyPrimitives(EncoderPrimitives &p, int cpuMask)
+{
+
+#ifdef HAVE_SVE2
+    if (cpuMask & X265_CPU_SVE2)
+    {
+        setupSve2Primitives(p);
     }
+    else if (cpuMask & X265_CPU_SVE)
+    {
+        setupSvePrimitives(p);
+    }
+    else if (cpuMask & X265_CPU_NEON)
+    {
+        setupNeonPrimitives(p);
+    }
+
+#elif defined(HAVE_SVE)
+    if (cpuMask & X265_CPU_SVE)
+    {
+        setupSvePrimitives(p);
+    }
+    else if (cpuMask & X265_CPU_NEON)
+    {
+        setupNeonPrimitives(p);
+    }
+
+#else
+    if (cpuMask & X265_CPU_NEON)
+    {
+        setupNeonPrimitives(p);
+    }
+#endif
+
 }
 } // namespace X265_NS

 
@@ -3,6 +3,7 @@
  *
  * Authors: Hongbin Liu <liuhongbin1@huawei.com>
  *          Yimeng Su <yimeng.su@huawei.com>
+ *          Sebastian Pop <spop@amazon.com>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -22,11 +23,659 @@
  * For more information, contact us at license @ x265.com.
  *****************************************************************************/
 
+
 #include "common.h"
 #include "primitives.h"
 #include "x265.h"
 #include "cpu.h"
 
+extern "C" {
+#include "fun-decls.h"
+}
+
+#define ALL_LUMA_TU_TYPED(prim, fncdef, fname, cpu) \
+    p.cuBLOCK_4x4.prim   = fncdef PFX(fname ## _4x4_ ## cpu); \
+    p.cuBLOCK_8x8.prim   = fncdef PFX(fname ## _8x8_ ## cpu); \
+    p.cuBLOCK_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \
+    p.cuBLOCK_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \
+    p.cuBLOCK_64x64.prim = fncdef PFX(fname ## _64x64_ ## cpu)
+#define LUMA_TU_TYPED_NEON(prim, fncdef, fname) \
+    p.cuBLOCK_4x4.prim   = fncdef PFX(fname ## _4x4_ ## neon); \
+    p.cuBLOCK_8x8.prim   = fncdef PFX(fname ## _8x8_ ## neon); \
+    p.cuBLOCK_16x16.prim = fncdef PFX(fname ## _16x16_ ## neon); \
+    p.cuBLOCK_64x64.prim = fncdef PFX(fname ## _64x64_ ## neon)
+#define LUMA_TU_TYPED_CAN_USE_SVE(prim, fncdef, fname) \
+    p.cuBLOCK_32x32.prim = fncdef PFX(fname ## _32x32_ ## sve)
+#define ALL_LUMA_TU(prim, fname, cpu)      ALL_LUMA_TU_TYPED(prim, , fname, cpu)
+#define LUMA_TU_NEON(prim, fname)      LUMA_TU_TYPED_NEON(prim, , fname)
+#define LUMA_TU_CAN_USE_SVE(prim, fname)      LUMA_TU_TYPED_CAN_USE_SVE(prim, , fname)
+
+#define ALL_LUMA_PU_TYPED(prim, fncdef, fname, cpu) \
+    p.puLUMA_4x4.prim   = fncdef PFX(fname ## _4x4_ ## cpu); \
+    p.puLUMA_8x8.prim   = fncdef PFX(fname ## _8x8_ ## cpu); \
+    p.puLUMA_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \
+    p.puLUMA_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \
+    p.puLUMA_64x64.prim = fncdef PFX(fname ## _64x64_ ## cpu); \
+    p.puLUMA_8x4.prim   = fncdef PFX(fname ## _8x4_ ## cpu); \
+    p.puLUMA_4x8.prim   = fncdef PFX(fname ## _4x8_ ## cpu); \
+    p.puLUMA_16x8.prim  = fncdef PFX(fname ## _16x8_ ## cpu); \
+    p.puLUMA_8x16.prim  = fncdef PFX(fname ## _8x16_ ## cpu); \
+    p.puLUMA_16x32.prim = fncdef PFX(fname ## _16x32_ ## cpu); \
+    p.puLUMA_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \
+    p.puLUMA_64x32.prim = fncdef PFX(fname ## _64x32_ ## cpu); \
+    p.puLUMA_32x64.prim = fncdef PFX(fname ## _32x64_ ## cpu); \
+    p.puLUMA_16x12.prim = fncdef PFX(fname ## _16x12_ ## cpu); \
+    p.puLUMA_12x16.prim = fncdef PFX(fname ## _12x16_ ## cpu); \
+    p.puLUMA_16x4.prim  = fncdef PFX(fname ## _16x4_ ## cpu); \
+    p.puLUMA_4x16.prim  = fncdef PFX(fname ## _4x16_ ## cpu); \
+    p.puLUMA_32x24.prim = fncdef PFX(fname ## _32x24_ ## cpu); \
+    p.puLUMA_24x32.prim = fncdef PFX(fname ## _24x32_ ## cpu); \
+    p.puLUMA_32x8.prim  = fncdef PFX(fname ## _32x8_ ## cpu); \
+    p.puLUMA_8x32.prim  = fncdef PFX(fname ## _8x32_ ## cpu); \
+    p.puLUMA_64x48.prim = fncdef PFX(fname ## _64x48_ ## cpu); \
+    p.puLUMA_48x64.prim = fncdef PFX(fname ## _48x64_ ## cpu); \
+    p.puLUMA_64x16.prim = fncdef PFX(fname ## _64x16_ ## cpu); \
+    p.puLUMA_16x64.prim = fncdef PFX(fname ## _16x64_ ## cpu)
+#define LUMA_PU_TYPED_MULTIPLE_ARCHS_1(prim, fncdef, fname, cpu) \
+    p.puLUMA_4x4.prim   = fncdef PFX(fname ## _4x4_ ## cpu); \
+    p.puLUMA_4x8.prim   = fncdef PFX(fname ## _4x8_ ## cpu); \
+    p.puLUMA_4x16.prim  = fncdef PFX(fname ## _4x16_ ## cpu)
+#define LUMA_PU_TYPED_MULTIPLE_ARCHS_2(prim, fncdef, fname, cpu) \
+    p.puLUMA_8x8.prim   = fncdef PFX(fname ## _8x8_ ## cpu); \
+    p.puLUMA_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \
+    p.puLUMA_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \
+    p.puLUMA_64x64.prim = fncdef PFX(fname ## _64x64_ ## cpu); \
+    p.puLUMA_8x4.prim   = fncdef PFX(fname ## _8x4_ ## cpu); \
+    p.puLUMA_16x8.prim  = fncdef PFX(fname ## _16x8_ ## cpu); \
+    p.puLUMA_8x16.prim  = fncdef PFX(fname ## _8x16_ ## cpu); \
+    p.puLUMA_16x32.prim = fncdef PFX(fname ## _16x32_ ## cpu); \
+    p.puLUMA_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \
+    p.puLUMA_64x32.prim = fncdef PFX(fname ## _64x32_ ## cpu); \
+    p.puLUMA_32x64.prim = fncdef PFX(fname ## _32x64_ ## cpu); \
+    p.puLUMA_16x12.prim = fncdef PFX(fname ## _16x12_ ## cpu); \
+    p.puLUMA_12x16.prim = fncdef PFX(fname ## _12x16_ ## cpu); \
+    p.puLUMA_16x4.prim  = fncdef PFX(fname ## _16x4_ ## cpu); \
+    p.puLUMA_32x24.prim = fncdef PFX(fname ## _32x24_ ## cpu); \
+    p.puLUMA_24x32.prim = fncdef PFX(fname ## _24x32_ ## cpu); \
+    p.puLUMA_32x8.prim  = fncdef PFX(fname ## _32x8_ ## cpu); \
+    p.puLUMA_8x32.prim  = fncdef PFX(fname ## _8x32_ ## cpu); \
+    p.puLUMA_64x48.prim = fncdef PFX(fname ## _64x48_ ## cpu); \
+    p.puLUMA_48x64.prim = fncdef PFX(fname ## _48x64_ ## cpu); \
+    p.puLUMA_64x16.prim = fncdef PFX(fname ## _64x16_ ## cpu); \
+    p.puLUMA_16x64.prim = fncdef PFX(fname ## _16x64_ ## cpu)
+#define LUMA_PU_TYPED_NEON_1(prim, fncdef, fname) \
+    p.puLUMA_4x4.prim   = fncdef PFX(fname ## _4x4_ ## neon); \
+    p.puLUMA_4x8.prim   = fncdef PFX(fname ## _4x8_ ## neon); \
+    p.puLUMA_4x16.prim  = fncdef PFX(fname ## _4x16_ ## neon); \
+    p.puLUMA_12x16.prim = fncdef PFX(fname ## _12x16_ ## neon); \
+    p.puLUMA_8x8.prim   = fncdef PFX(fname ## _8x8_ ## neon); \
+    p.puLUMA_16x16.prim = fncdef PFX(fname ## _16x16_ ## neon); \
+    p.puLUMA_8x4.prim   = fncdef PFX(fname ## _8x4_ ## neon); \
+    p.puLUMA_16x8.prim  = fncdef PFX(fname ## _16x8_ ## neon); \
+    p.puLUMA_8x16.prim  = fncdef PFX(fname ## _8x16_ ## neon); \
+    p.puLUMA_16x12.prim = fncdef PFX(fname ## _16x12_ ## neon); \
+    p.puLUMA_16x32.prim = fncdef PFX(fname ## _16x32_ ## neon); \
+    p.puLUMA_16x4.prim  = fncdef PFX(fname ## _16x4_ ## neon); \
+    p.puLUMA_24x32.prim = fncdef PFX(fname ## _24x32_ ## neon); \
+    p.puLUMA_8x32.prim  = fncdef PFX(fname ## _8x32_ ## neon); \
+    p.puLUMA_48x64.prim = fncdef PFX(fname ## _48x64_ ## neon); \
+    p.puLUMA_16x64.prim = fncdef PFX(fname ## _16x64_ ## neon)
+#define LUMA_PU_TYPED_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, fncdef, fname) \
+    p.puLUMA_32x32.prim = fncdef PFX(fname ## _32x32_ ## sve); \
+    p.puLUMA_64x64.prim = fncdef PFX(fname ## _64x64_ ## sve); \
+    p.puLUMA_32x16.prim = fncdef PFX(fname ## _32x16_ ## sve); \
+    p.puLUMA_64x32.prim = fncdef PFX(fname ## _64x32_ ## sve); \
+    p.puLUMA_32x64.prim = fncdef PFX(fname ## _32x64_ ## sve); \
+    p.puLUMA_32x24.prim = fncdef PFX(fname ## _32x24_ ## sve); \
+    p.puLUMA_32x8.prim  = fncdef PFX(fname ## _32x8_ ## sve); \
+    p.puLUMA_64x48.prim = fncdef PFX(fname ## _64x48_ ## sve); \
+    p.puLUMA_64x16.prim = fncdef PFX(fname ## _64x16_ ## sve)
+#define LUMA_PU_TYPED_NEON_2(prim, fncdef, fname) \
+    p.puLUMA_4x4.prim   = fncdef PFX(fname ## _4x4_ ## neon); \
+    p.puLUMA_8x4.prim   = fncdef PFX(fname ## _8x4_ ## neon); \
+    p.puLUMA_4x8.prim   = fncdef PFX(fname ## _4x8_ ## neon); \
+    p.puLUMA_8x8.prim   = fncdef PFX(fname ## _8x8_ ## neon); \
+    p.puLUMA_16x8.prim  = fncdef PFX(fname ## _16x8_ ## neon); \
+    p.puLUMA_8x16.prim  = fncdef PFX(fname ## _8x16_ ## neon); \
+    p.puLUMA_16x16.prim = fncdef PFX(fname ## _16x16_ ## neon); \
+    p.puLUMA_16x32.prim = fncdef PFX(fname ## _16x32_ ## neon); \
+    p.puLUMA_16x12.prim = fncdef PFX(fname ## _16x12_ ## neon); \
+    p.puLUMA_16x4.prim  = fncdef PFX(fname ## _16x4_ ## neon); \
+    p.puLUMA_4x16.prim  = fncdef PFX(fname ## _4x16_ ## neon); \
+    p.puLUMA_8x32.prim  = fncdef PFX(fname ## _8x32_ ## neon); \
+    p.puLUMA_16x64.prim = fncdef PFX(fname ## _16x64_ ## neon)
+#define LUMA_PU_TYPED_MULTIPLE_ARCHS_3(prim, fncdef, fname, cpu) \
+    p.puLUMA_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \
+    p.puLUMA_64x64.prim = fncdef PFX(fname ## _64x64_ ## cpu); \
+    p.puLUMA_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \
+    p.puLUMA_64x32.prim = fncdef PFX(fname ## _64x32_ ## cpu); \
+    p.puLUMA_32x64.prim = fncdef PFX(fname ## _32x64_ ## cpu); \
+    p.puLUMA_12x16.prim = fncdef PFX(fname ## _12x16_ ## cpu); \
+    p.puLUMA_32x24.prim = fncdef PFX(fname ## _32x24_ ## cpu); \
+    p.puLUMA_24x32.prim = fncdef PFX(fname ## _24x32_ ## cpu); \
+    p.puLUMA_32x8.prim  = fncdef PFX(fname ## _32x8_ ## cpu); \
+    p.puLUMA_64x48.prim = fncdef PFX(fname ## _64x48_ ## cpu); \
+    p.puLUMA_48x64.prim = fncdef PFX(fname ## _48x64_ ## cpu); \
+    p.puLUMA_64x16.prim = fncdef PFX(fname ## _64x16_ ## cpu)
+#define LUMA_PU_TYPED_NEON_3(prim, fncdef, fname) \
+    p.puLUMA_4x4.prim   = fncdef PFX(fname ## _4x4_ ## neon); \
+    p.puLUMA_4x8.prim   = fncdef PFX(fname ## _4x8_ ## neon); \
+    p.puLUMA_4x16.prim  = fncdef PFX(fname ## _4x16_ ## neon)
+#define LUMA_PU_TYPED_CAN_USE_SVE2(prim, fncdef, fname) \
+    p.puLUMA_8x8.prim   = fncdef PFX(fname ## _8x8_ ## sve2); \
+    p.puLUMA_16x16.prim = fncdef PFX(fname ## _16x16_ ## sve2); \
+    p.puLUMA_32x32.prim = fncdef PFX(fname ## _32x32_ ## sve2); \
+    p.puLUMA_64x64.prim = fncdef PFX(fname ## _64x64_ ## sve2); \
+    p.puLUMA_8x4.prim   = fncdef PFX(fname ## _8x4_ ## sve2); \
+    p.puLUMA_16x8.prim  = fncdef PFX(fname ## _16x8_ ## sve2); \
+    p.puLUMA_8x16.prim  = fncdef PFX(fname ## _8x16_ ## sve2); \
+    p.puLUMA_16x32.prim = fncdef PFX(fname ## _16x32_ ## sve2); \
+    p.puLUMA_32x16.prim = fncdef PFX(fname ## _32x16_ ## sve2); \
+    p.puLUMA_64x32.prim = fncdef PFX(fname ## _64x32_ ## sve2); \
+    p.puLUMA_32x64.prim = fncdef PFX(fname ## _32x64_ ## sve2); \
+    p.puLUMA_16x12.prim = fncdef PFX(fname ## _16x12_ ## sve2); \
+    p.puLUMA_12x16.prim = fncdef PFX(fname ## _12x16_ ## sve2); \
+    p.puLUMA_16x4.prim  = fncdef PFX(fname ## _16x4_ ## sve2); \
+    p.puLUMA_32x24.prim = fncdef PFX(fname ## _32x24_ ## sve2); \
+    p.puLUMA_24x32.prim = fncdef PFX(fname ## _24x32_ ## sve2); \
+    p.puLUMA_32x8.prim  = fncdef PFX(fname ## _32x8_ ## sve2); \
+    p.puLUMA_8x32.prim  = fncdef PFX(fname ## _8x32_ ## sve2); \
+    p.puLUMA_64x48.prim = fncdef PFX(fname ## _64x48_ ## sve2); \
+    p.puLUMA_48x64.prim = fncdef PFX(fname ## _48x64_ ## sve2); \
+    p.puLUMA_64x16.prim = fncdef PFX(fname ## _64x16_ ## sve2); \
+    p.puLUMA_16x64.prim = fncdef PFX(fname ## _16x64_ ## sve2)
+#define LUMA_PU_TYPED_NEON_FILTER_PIXEL_TO_SHORT(prim, fncdef) \
+    p.puLUMA_4x4.prim   = fncdef PFX(filterPixelToShort ## _4x4_ ## neon); \
+    p.puLUMA_8x8.prim   = fncdef PFX(filterPixelToShort ## _8x8_ ## neon); \
+    p.puLUMA_16x16.prim = fncdef PFX(filterPixelToShort ## _16x16_ ## neon); \
+    p.puLUMA_8x4.prim   = fncdef PFX(filterPixelToShort ## _8x4_ ## neon); \
+    p.puLUMA_4x8.prim   = fncdef PFX(filterPixelToShort ## _4x8_ ## neon); \
+    p.puLUMA_16x8.prim  = fncdef PFX(filterPixelToShort ## _16x8_ ## neon); \
+    p.puLUMA_8x16.prim  = fncdef PFX(filterPixelToShort ## _8x16_ ## neon); \
+    p.puLUMA_16x32.prim = fncdef PFX(filterPixelToShort ## _16x32_ ## neon); \
+    p.puLUMA_16x12.prim = fncdef PFX(filterPixelToShort ## _16x12_ ## neon); \
+    p.puLUMA_12x16.prim = fncdef PFX(filterPixelToShort ## _12x16_ ## neon); \
+    p.puLUMA_16x4.prim  = fncdef PFX(filterPixelToShort ## _16x4_ ## neon); \
+    p.puLUMA_4x16.prim  = fncdef PFX(filterPixelToShort ## _4x16_ ## neon); \
+    p.puLUMA_24x32.prim = fncdef PFX(filterPixelToShort ## _24x32_ ## neon); \
+    p.puLUMA_8x32.prim  = fncdef PFX(filterPixelToShort ## _8x32_ ## neon); \
+    p.puLUMA_16x64.prim = fncdef PFX(filterPixelToShort ## _16x64_ ## neon)
+#define LUMA_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, fncdef) \
+    p.puLUMA_32x32.prim = fncdef PFX(filterPixelToShort ## _32x32_ ## sve); \
+    p.puLUMA_32x16.prim = fncdef PFX(filterPixelToShort ## _32x16_ ## sve); \
+    p.puLUMA_32x64.prim = fncdef PFX(filterPixelToShort ## _32x64_ ## sve); \
+    p.puLUMA_32x24.prim = fncdef PFX(filterPixelToShort ## _32x24_ ## sve); \
+    p.puLUMA_32x8.prim  = fncdef PFX(filterPixelToShort ## _32x8_ ## sve); \
+    p.puLUMA_64x64.prim = fncdef PFX(filterPixelToShort ## _64x64_ ## sve); \
+    p.puLUMA_64x32.prim = fncdef PFX(filterPixelToShort ## _64x32_ ## sve); \
+    p.puLUMA_64x48.prim = fncdef PFX(filterPixelToShort ## _64x48_ ## sve); \
+    p.puLUMA_64x16.prim = fncdef PFX(filterPixelToShort ## _64x16_ ## sve); \
+    p.puLUMA_48x64.prim = fncdef PFX(filterPixelToShort ## _48x64_ ## sve)
+#define ALL_LUMA_PU(prim, fname, cpu) ALL_LUMA_PU_TYPED(prim, , fname, cpu)
+#define LUMA_PU_MULTIPLE_ARCHS_1(prim, fname, cpu) LUMA_PU_TYPED_MULTIPLE_ARCHS_1(prim, , fname, cpu)
+#define LUMA_PU_MULTIPLE_ARCHS_2(prim, fname, cpu) LUMA_PU_TYPED_MULTIPLE_ARCHS_2(prim, , fname, cpu)
+#define LUMA_PU_NEON_1(prim, fname) LUMA_PU_TYPED_NEON_1(prim, , fname)
+#define LUMA_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, fname) LUMA_PU_TYPED_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, , fname)
+#define LUMA_PU_NEON_2(prim, fname) LUMA_PU_TYPED_NEON_2(prim, , fname)
+#define LUMA_PU_MULTIPLE_ARCHS_3(prim, fname, cpu) LUMA_PU_TYPED_MULTIPLE_ARCHS_3(prim, , fname, cpu)
+#define LUMA_PU_NEON_3(prim, fname) LUMA_PU_TYPED_NEON_3(prim, , fname)
+#define LUMA_PU_CAN_USE_SVE2(prim, fname) LUMA_PU_TYPED_CAN_USE_SVE2(prim, , fname)
+#define LUMA_PU_NEON_FILTER_PIXEL_TO_SHORT(prim) LUMA_PU_TYPED_NEON_FILTER_PIXEL_TO_SHORT(prim, )
+#define LUMA_PU_SVE_FILTER_PIXEL_TO_SHORT(prim) LUMA_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, )
+
+
+#define ALL_LUMA_PU_T(prim, fname) \
+    p.puLUMA_4x4.prim   = fname<LUMA_4x4>; \
+    p.puLUMA_8x8.prim   = fname<LUMA_8x8>; \
+    p.puLUMA_16x16.prim = fname<LUMA_16x16>; \
+    p.puLUMA_32x32.prim = fname<LUMA_32x32>; \
+    p.puLUMA_64x64.prim = fname<LUMA_64x64>; \
+    p.puLUMA_8x4.prim   = fname<LUMA_8x4>; \
+    p.puLUMA_4x8.prim   = fname<LUMA_4x8>; \
+    p.puLUMA_16x8.prim  = fname<LUMA_16x8>; \
+    p.puLUMA_8x16.prim  = fname<LUMA_8x16>; \
+    p.puLUMA_16x32.prim = fname<LUMA_16x32>; \
+    p.puLUMA_32x16.prim = fname<LUMA_32x16>; \
+    p.puLUMA_64x32.prim = fname<LUMA_64x32>; \
+    p.puLUMA_32x64.prim = fname<LUMA_32x64>; \
+    p.puLUMA_16x12.prim = fname<LUMA_16x12>; \
+    p.puLUMA_12x16.prim = fname<LUMA_12x16>; \
+    p.puLUMA_16x4.prim  = fname<LUMA_16x4>; \
+    p.puLUMA_4x16.prim  = fname<LUMA_4x16>; \
+    p.puLUMA_32x24.prim = fname<LUMA_32x24>; \
+    p.puLUMA_24x32.prim = fname<LUMA_24x32>; \
+    p.puLUMA_32x8.prim  = fname<LUMA_32x8>; \
+    p.puLUMA_8x32.prim  = fname<LUMA_8x32>; \
+    p.puLUMA_64x48.prim = fname<LUMA_64x48>; \
+    p.puLUMA_48x64.prim = fname<LUMA_48x64>; \
+    p.puLUMA_64x16.prim = fname<LUMA_64x16>; \
+    p.puLUMA_16x64.prim = fname<LUMA_16x64>
+
+#define ALL_CHROMA_420_PU_TYPED(prim, fncdef, fname, cpu)               \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x4.prim   = fncdef PFX(fname ## _4x4_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x8.prim   = fncdef PFX(fname ## _8x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x2.prim   = fncdef PFX(fname ## _4x2_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_2x4.prim   = fncdef PFX(fname ## _2x4_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x4.prim   = fncdef PFX(fname ## _8x4_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x8.prim   = fncdef PFX(fname ## _4x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x8.prim  = fncdef PFX(fname ## _16x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x16.prim  = fncdef PFX(fname ## _8x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x32.prim = fncdef PFX(fname ## _16x32_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x6.prim   = fncdef PFX(fname ## _8x6_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_6x8.prim   = fncdef PFX(fname ## _6x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x2.prim   = fncdef PFX(fname ## _8x2_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_2x8.prim   = fncdef PFX(fname ## _2x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x12.prim = fncdef PFX(fname ## _16x12_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_12x16.prim = fncdef PFX(fname ## _12x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x4.prim  = fncdef PFX(fname ## _16x4_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x16.prim  = fncdef PFX(fname ## _4x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x24.prim = fncdef PFX(fname ## _32x24_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_24x32.prim = fncdef PFX(fname ## _24x32_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x8.prim  = fncdef PFX(fname ## _32x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x32.prim  = fncdef PFX(fname ## _8x32_ ## cpu)
+#define CHROMA_420_PU_TYPED_NEON_1(prim, fncdef, fname)               \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x4.prim   = fncdef PFX(fname ## _4x4_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x2.prim   = fncdef PFX(fname ## _4x2_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x8.prim   = fncdef PFX(fname ## _4x8_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_6x8.prim   = fncdef PFX(fname ## _6x8_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_12x16.prim = fncdef PFX(fname ## _12x16_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x16.prim  = fncdef PFX(fname ## _4x16_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x24.prim = fncdef PFX(fname ## _32x24_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_24x32.prim = fncdef PFX(fname ## _24x32_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x8.prim  = fncdef PFX(fname ## _32x8_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x32.prim  = fncdef PFX(fname ## _8x32_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x8.prim   = fncdef PFX(fname ## _8x8_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x16.prim = fncdef PFX(fname ## _16x16_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_2x4.prim   = fncdef PFX(fname ## _2x4_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x4.prim   = fncdef PFX(fname ## _8x4_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x8.prim  = fncdef PFX(fname ## _16x8_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x16.prim  = fncdef PFX(fname ## _8x16_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x32.prim = fncdef PFX(fname ## _16x32_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x6.prim   = fncdef PFX(fname ## _8x6_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x2.prim   = fncdef PFX(fname ## _8x2_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_2x8.prim   = fncdef PFX(fname ## _2x8_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x12.prim = fncdef PFX(fname ## _16x12_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x4.prim  = fncdef PFX(fname ## _16x4_ ## neon)
+#define CHROMA_420_PU_TYPED_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, fncdef, fname)               \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x32.prim = fncdef PFX(fname ## _32x32_ ## sve); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x16.prim = fncdef PFX(fname ## _32x16_ ## sve)
+#define CHROMA_420_PU_TYPED_NEON_2(prim, fncdef, fname)               \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x4.prim   = fncdef PFX(fname ## _4x4_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x2.prim   = fncdef PFX(fname ## _4x2_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x8.prim   = fncdef PFX(fname ## _4x8_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x16.prim  = fncdef PFX(fname ## _4x16_ ## neon)
+#define CHROMA_420_PU_TYPED_MULTIPLE_ARCHS(prim, fncdef, fname, cpu)               \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x8.prim   = fncdef PFX(fname ## _8x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_2x4.prim   = fncdef PFX(fname ## _2x4_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x4.prim   = fncdef PFX(fname ## _8x4_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x8.prim  = fncdef PFX(fname ## _16x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x16.prim  = fncdef PFX(fname ## _8x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x32.prim = fncdef PFX(fname ## _16x32_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x6.prim   = fncdef PFX(fname ## _8x6_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_6x8.prim   = fncdef PFX(fname ## _6x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x2.prim   = fncdef PFX(fname ## _8x2_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_2x8.prim   = fncdef PFX(fname ## _2x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x12.prim = fncdef PFX(fname ## _16x12_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_12x16.prim = fncdef PFX(fname ## _12x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x4.prim  = fncdef PFX(fname ## _16x4_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x24.prim = fncdef PFX(fname ## _32x24_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_24x32.prim = fncdef PFX(fname ## _24x32_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x8.prim  = fncdef PFX(fname ## _32x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x32.prim  = fncdef PFX(fname ## _8x32_ ## cpu)
+#define CHROMA_420_PU_TYPED_FILTER_PIXEL_TO_SHORT_NEON(prim, fncdef)               \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x4.prim   = fncdef PFX(filterPixelToShort ## _4x4_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x8.prim   = fncdef PFX(filterPixelToShort ## _8x8_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x16.prim = fncdef PFX(filterPixelToShort ## _16x16_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x4.prim   = fncdef PFX(filterPixelToShort ## _8x4_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x8.prim   = fncdef PFX(filterPixelToShort ## _4x8_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x8.prim  = fncdef PFX(filterPixelToShort ## _16x8_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x16.prim  = fncdef PFX(filterPixelToShort ## _8x16_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x32.prim = fncdef PFX(filterPixelToShort ## _16x32_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x6.prim   = fncdef PFX(filterPixelToShort ## _8x6_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x2.prim   = fncdef PFX(filterPixelToShort ## _8x2_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x12.prim = fncdef PFX(filterPixelToShort ## _16x12_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_12x16.prim = fncdef PFX(filterPixelToShort ## _12x16_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x4.prim  = fncdef PFX(filterPixelToShort ## _16x4_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x16.prim  = fncdef PFX(filterPixelToShort ## _4x16_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_24x32.prim = fncdef PFX(filterPixelToShort ## _24x32_ ## neon); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x32.prim  = fncdef PFX(filterPixelToShort ## _8x32_ ## neon)
+#define CHROMA_420_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, fncdef)               \
+    p.chromaX265_CSP_I420.puCHROMA_420_2x4.prim   = fncdef PFX(filterPixelToShort ## _2x4_ ## sve); \
+    p.chromaX265_CSP_I420.puCHROMA_420_2x8.prim   = fncdef PFX(filterPixelToShort ## _2x8_ ## sve); \
+    p.chromaX265_CSP_I420.puCHROMA_420_6x8.prim   = fncdef PFX(filterPixelToShort ## _6x8_ ## sve); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x2.prim   = fncdef PFX(filterPixelToShort ## _4x2_ ## sve); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x32.prim = fncdef PFX(filterPixelToShort ## _32x32_ ## sve); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x16.prim = fncdef PFX(filterPixelToShort ## _32x16_ ## sve); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x24.prim = fncdef PFX(filterPixelToShort ## _32x24_ ## sve); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x8.prim  = fncdef PFX(filterPixelToShort ## _32x8_ ## sve)
+#define ALL_CHROMA_420_PU(prim, fname, cpu) ALL_CHROMA_420_PU_TYPED(prim, , fname, cpu)
+#define CHROMA_420_PU_NEON_1(prim, fname) CHROMA_420_PU_TYPED_NEON_1(prim, , fname)
+#define CHROMA_420_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, fname) CHROMA_420_PU_TYPED_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, , fname)
+#define CHROMA_420_PU_NEON_2(prim, fname) CHROMA_420_PU_TYPED_NEON_2(prim, , fname)
+#define CHROMA_420_PU_MULTIPLE_ARCHS(prim, fname, cpu) CHROMA_420_PU_TYPED_MULTIPLE_ARCHS(prim, , fname, cpu)
+#define CHROMA_420_PU_FILTER_PIXEL_TO_SHORT_NEON(prim) CHROMA_420_PU_TYPED_FILTER_PIXEL_TO_SHORT_NEON(prim, )
+#define CHROMA_420_PU_SVE_FILTER_PIXEL_TO_SHORT(prim) CHROMA_420_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, )
+
+
+#define ALL_CHROMA_420_4x4_PU_TYPED(prim, fncdef, fname, cpu) \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x4.prim   = fncdef PFX(fname ## _4x4_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x2.prim   = fncdef PFX(fname ## _8x2_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x8.prim   = fncdef PFX(fname ## _8x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x4.prim   = fncdef PFX(fname ## _8x4_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x6.prim   = fncdef PFX(fname ## _8x6_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x8.prim   = fncdef PFX(fname ## _4x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x8.prim  = fncdef PFX(fname ## _16x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x16.prim  = fncdef PFX(fname ## _8x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x32.prim = fncdef PFX(fname ## _16x32_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x12.prim = fncdef PFX(fname ## _16x12_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_12x16.prim = fncdef PFX(fname ## _12x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_16x4.prim  = fncdef PFX(fname ## _16x4_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_4x16.prim  = fncdef PFX(fname ## _4x16_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x24.prim = fncdef PFX(fname ## _32x24_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_24x32.prim = fncdef PFX(fname ## _24x32_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_32x8.prim  = fncdef PFX(fname ## _32x8_ ## cpu); \
+    p.chromaX265_CSP_I420.puCHROMA_420_8x32.prim  = fncdef PFX(fname ## _8x32_ ## cpu)
+#define ALL_CHROMA_420_4x4_PU(prim, fname, cpu) ALL_CHROMA_420_4x4_PU_TYPED(prim, , fname, cpu)
+
+#define ALL_CHROMA_422_PU_TYPED(prim, fncdef, fname, cpu)               \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x8.prim   = fncdef PFX(fname ## _4x8_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x16.prim  = fncdef PFX(fname ## _8x16_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x32.prim = fncdef PFX(fname ## _16x32_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x64.prim = fncdef PFX(fname ## _32x64_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x4.prim   = fncdef PFX(fname ## _4x4_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_2x8.prim   = fncdef PFX(fname ## _2x8_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x8.prim   = fncdef PFX(fname ## _8x8_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x16.prim  = fncdef PFX(fname ## _4x16_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x32.prim  = fncdef PFX(fname ## _8x32_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x64.prim = fncdef PFX(fname ## _16x64_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x12.prim  = fncdef PFX(fname ## _8x12_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_6x16.prim  = fncdef PFX(fname ## _6x16_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x4.prim   = fncdef PFX(fname ## _8x4_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_2x16.prim  = fncdef PFX(fname ## _2x16_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x24.prim = fncdef PFX(fname ## _16x24_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_12x32.prim = fncdef PFX(fname ## _12x32_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x8.prim  = fncdef PFX(fname ## _16x8_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x32.prim  = fncdef PFX(fname ## _4x32_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x48.prim = fncdef PFX(fname ## _32x48_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_24x64.prim = fncdef PFX(fname ## _24x64_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x64.prim  = fncdef PFX(fname ## _8x64_ ## cpu)
+#define CHROMA_422_PU_TYPED_NEON_1(prim, fncdef, fname)               \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x8.prim   = fncdef PFX(fname ## _4x8_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x4.prim   = fncdef PFX(fname ## _4x4_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x16.prim  = fncdef PFX(fname ## _4x16_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_6x16.prim  = fncdef PFX(fname ## _6x16_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_12x32.prim = fncdef PFX(fname ## _12x32_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x32.prim  = fncdef PFX(fname ## _4x32_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x16.prim  = fncdef PFX(fname ## _8x16_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x32.prim = fncdef PFX(fname ## _16x32_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_2x8.prim   = fncdef PFX(fname ## _2x8_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x8.prim   = fncdef PFX(fname ## _8x8_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x16.prim = fncdef PFX(fname ## _16x16_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x32.prim  = fncdef PFX(fname ## _8x32_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x64.prim = fncdef PFX(fname ## _16x64_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x12.prim  = fncdef PFX(fname ## _8x12_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x4.prim   = fncdef PFX(fname ## _8x4_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_2x16.prim  = fncdef PFX(fname ## _2x16_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x24.prim = fncdef PFX(fname ## _16x24_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x8.prim  = fncdef PFX(fname ## _16x8_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_24x64.prim = fncdef PFX(fname ## _24x64_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x64.prim  = fncdef PFX(fname ## _8x64_ ## neon)
+#define CHROMA_422_PU_TYPED_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, fncdef, fname)               \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x64.prim = fncdef PFX(fname ## _32x64_ ## sve); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x32.prim = fncdef PFX(fname ## _32x32_ ## sve); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x48.prim = fncdef PFX(fname ## _32x48_ ## sve); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x16.prim = fncdef PFX(fname ## _32x16_ ## sve)
+#define CHROMA_422_PU_TYPED_NEON_2(prim, fncdef, fname)               \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x8.prim   = fncdef PFX(fname ## _4x8_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x4.prim   = fncdef PFX(fname ## _4x4_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x16.prim  = fncdef PFX(fname ## _4x16_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x32.prim  = fncdef PFX(fname ## _4x32_ ## neon)
+#define CHROMA_422_PU_TYPED_CAN_USE_SVE2(prim, fncdef, fname)               \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x16.prim  = fncdef PFX(fname ## _8x16_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x32.prim = fncdef PFX(fname ## _16x32_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x64.prim = fncdef PFX(fname ## _32x64_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_2x8.prim   = fncdef PFX(fname ## _2x8_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x8.prim   = fncdef PFX(fname ## _8x8_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x16.prim = fncdef PFX(fname ## _16x16_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x32.prim  = fncdef PFX(fname ## _8x32_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x32.prim = fncdef PFX(fname ## _32x32_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x64.prim = fncdef PFX(fname ## _16x64_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x12.prim  = fncdef PFX(fname ## _8x12_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_6x16.prim  = fncdef PFX(fname ## _6x16_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x4.prim   = fncdef PFX(fname ## _8x4_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_2x16.prim  = fncdef PFX(fname ## _2x16_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x24.prim = fncdef PFX(fname ## _16x24_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_12x32.prim = fncdef PFX(fname ## _12x32_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x8.prim  = fncdef PFX(fname ## _16x8_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x48.prim = fncdef PFX(fname ## _32x48_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_24x64.prim = fncdef PFX(fname ## _24x64_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x16.prim = fncdef PFX(fname ## _32x16_ ## sve2); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x64.prim  = fncdef PFX(fname ## _8x64_ ## sve2)
+#define CHROMA_422_PU_TYPED_NEON_FILTER_PIXEL_TO_SHORT(prim, fncdef)               \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x8.prim   = fncdef PFX(filterPixelToShort ## _4x8_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x16.prim  = fncdef PFX(filterPixelToShort ## _8x16_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x32.prim = fncdef PFX(filterPixelToShort ## _16x32_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x4.prim   = fncdef PFX(filterPixelToShort ## _4x4_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x8.prim   = fncdef PFX(filterPixelToShort ## _8x8_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x16.prim  = fncdef PFX(filterPixelToShort ## _4x16_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x16.prim = fncdef PFX(filterPixelToShort ## _16x16_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x32.prim  = fncdef PFX(filterPixelToShort ## _8x32_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x64.prim = fncdef PFX(filterPixelToShort ## _16x64_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x12.prim  = fncdef PFX(filterPixelToShort ## _8x12_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x4.prim   = fncdef PFX(filterPixelToShort ## _8x4_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x24.prim = fncdef PFX(filterPixelToShort ## _16x24_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_12x32.prim = fncdef PFX(filterPixelToShort ## _12x32_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_16x8.prim  = fncdef PFX(filterPixelToShort ## _16x8_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_4x32.prim  = fncdef PFX(filterPixelToShort ## _4x32_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_24x64.prim = fncdef PFX(filterPixelToShort ## _24x64_ ## neon); \
+    p.chromaX265_CSP_I422.puCHROMA_422_8x64.prim  = fncdef PFX(filterPixelToShort ## _8x64_ ## neon)
+#define CHROMA_422_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, fncdef)               \
+    p.chromaX265_CSP_I422.puCHROMA_422_2x8.prim   = fncdef PFX(filterPixelToShort ## _2x8_ ## sve); \
+    p.chromaX265_CSP_I422.puCHROMA_422_2x16.prim  = fncdef PFX(filterPixelToShort ## _2x16_ ## sve); \
+    p.chromaX265_CSP_I422.puCHROMA_422_6x16.prim  = fncdef PFX(filterPixelToShort ## _6x16_ ## sve); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x64.prim = fncdef PFX(filterPixelToShort ## _32x64_ ## sve); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x32.prim = fncdef PFX(filterPixelToShort ## _32x32_ ## sve); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x48.prim = fncdef PFX(filterPixelToShort ## _32x48_ ## sve); \
+    p.chromaX265_CSP_I422.puCHROMA_422_32x16.prim = fncdef PFX(filterPixelToShort ## _32x16_ ## sve)
+#define ALL_CHROMA_422_PU(prim, fname, cpu) ALL_CHROMA_422_PU_TYPED(prim, , fname, cpu)
+#define CHROMA_422_PU_NEON_1(prim, fname) CHROMA_422_PU_TYPED_NEON_1(prim, , fname)
+#define CHROMA_422_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, fname) CHROMA_422_PU_TYPED_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, , fname)
+#define CHROMA_422_PU_NEON_2(prim, fname) CHROMA_422_PU_TYPED_NEON_2(prim, , fname)
+#define CHROMA_422_PU_CAN_USE_SVE2(prim, fname) CHROMA_422_PU_TYPED_CAN_USE_SVE2(prim, , fname)
+#define CHROMA_422_PU_NEON_FILTER_PIXEL_TO_SHORT(prim) CHROMA_422_PU_TYPED_NEON_FILTER_PIXEL_TO_SHORT(prim, )
+#define CHROMA_422_PU_SVE_FILTER_PIXEL_TO_SHORT(prim) CHROMA_422_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, )
+
+#define ALL_CHROMA_444_PU_TYPED(prim, fncdef, fname, cpu) \
+    p.chromaX265_CSP_I444.puLUMA_4x4.prim   = fncdef PFX(fname ## _4x4_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_8x8.prim   = fncdef PFX(fname ## _8x8_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_64x64.prim = fncdef PFX(fname ## _64x64_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_8x4.prim   = fncdef PFX(fname ## _8x4_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_4x8.prim   = fncdef PFX(fname ## _4x8_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_16x8.prim  = fncdef PFX(fname ## _16x8_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_8x16.prim  = fncdef PFX(fname ## _8x16_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_16x32.prim = fncdef PFX(fname ## _16x32_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_64x32.prim = fncdef PFX(fname ## _64x32_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_32x64.prim = fncdef PFX(fname ## _32x64_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_16x12.prim = fncdef PFX(fname ## _16x12_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_12x16.prim = fncdef PFX(fname ## _12x16_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_16x4.prim  = fncdef PFX(fname ## _16x4_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_4x16.prim  = fncdef PFX(fname ## _4x16_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_32x24.prim = fncdef PFX(fname ## _32x24_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_24x32.prim = fncdef PFX(fname ## _24x32_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_32x8.prim  = fncdef PFX(fname ## _32x8_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_8x32.prim  = fncdef PFX(fname ## _8x32_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_64x48.prim = fncdef PFX(fname ## _64x48_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_48x64.prim = fncdef PFX(fname ## _48x64_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_64x16.prim = fncdef PFX(fname ## _64x16_ ## cpu); \
+    p.chromaX265_CSP_I444.puLUMA_16x64.prim = fncdef PFX(fname ## _16x64_ ## cpu)
+#define CHROMA_444_PU_TYPED_NEON_FILTER_PIXEL_TO_SHORT(prim, fncdef) \
+    p.chromaX265_CSP_I444.puLUMA_4x4.prim   = fncdef PFX(filterPixelToShort ## _4x4_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_8x8.prim   = fncdef PFX(filterPixelToShort ## _8x8_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_16x16.prim = fncdef PFX(filterPixelToShort ## _16x16_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_8x4.prim   = fncdef PFX(filterPixelToShort ## _8x4_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_4x8.prim   = fncdef PFX(filterPixelToShort ## _4x8_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_16x8.prim  = fncdef PFX(filterPixelToShort ## _16x8_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_8x16.prim  = fncdef PFX(filterPixelToShort ## _8x16_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_16x32.prim = fncdef PFX(filterPixelToShort ## _16x32_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_16x12.prim = fncdef PFX(filterPixelToShort ## _16x12_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_12x16.prim = fncdef PFX(filterPixelToShort ## _12x16_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_16x4.prim  = fncdef PFX(filterPixelToShort ## _16x4_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_4x16.prim  = fncdef PFX(filterPixelToShort ## _4x16_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_24x32.prim = fncdef PFX(filterPixelToShort ## _24x32_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_8x32.prim  = fncdef PFX(filterPixelToShort ## _8x32_ ## neon); \
+    p.chromaX265_CSP_I444.puLUMA_16x64.prim = fncdef PFX(filterPixelToShort ## _16x64_ ## neon)
+#define CHROMA_444_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, fncdef) \
+    p.chromaX265_CSP_I444.puLUMA_32x32.prim = fncdef PFX(filterPixelToShort ## _32x32_ ## sve); \
+    p.chromaX265_CSP_I444.puLUMA_32x16.prim = fncdef PFX(filterPixelToShort ## _32x16_ ## sve); \
+    p.chromaX265_CSP_I444.puLUMA_32x64.prim = fncdef PFX(filterPixelToShort ## _32x64_ ## sve); \
+    p.chromaX265_CSP_I444.puLUMA_32x24.prim = fncdef PFX(filterPixelToShort ## _32x24_ ## sve); \
+    p.chromaX265_CSP_I444.puLUMA_32x8.prim  = fncdef PFX(filterPixelToShort ## _32x8_ ## sve); \
+    p.chromaX265_CSP_I444.puLUMA_64x64.prim = fncdef PFX(filterPixelToShort ## _64x64_ ## sve); \
+    p.chromaX265_CSP_I444.puLUMA_64x32.prim = fncdef PFX(filterPixelToShort ## _64x32_ ## sve); \
+    p.chromaX265_CSP_I444.puLUMA_64x48.prim = fncdef PFX(filterPixelToShort ## _64x48_ ## sve); \
+    p.chromaX265_CSP_I444.puLUMA_64x16.prim = fncdef PFX(filterPixelToShort ## _64x16_ ## sve); \
+    p.chromaX265_CSP_I444.puLUMA_48x64.prim = fncdef PFX(filterPixelToShort ## _48x64_ ## sve)
+#define ALL_CHROMA_444_PU(prim, fname, cpu) ALL_CHROMA_444_PU_TYPED(prim, , fname, cpu)
+#define CHROMA_444_PU_NEON_FILTER_PIXEL_TO_SHORT(prim) CHROMA_444_PU_TYPED_NEON_FILTER_PIXEL_TO_SHORT(prim, )
+#define CHROMA_444_PU_SVE_FILTER_PIXEL_TO_SHORT(prim) CHROMA_444_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, )
+
+#define ALL_CHROMA_420_VERT_FILTERS(cpu)                             \
+    ALL_CHROMA_420_4x4_PU(filter_vpp, interp_4tap_vert_pp, cpu); \
+    ALL_CHROMA_420_4x4_PU(filter_vps, interp_4tap_vert_ps, cpu); \
+    ALL_CHROMA_420_4x4_PU(filter_vsp, interp_4tap_vert_sp, cpu); \
+    ALL_CHROMA_420_4x4_PU(filter_vss, interp_4tap_vert_ss, cpu)
+
+#define CHROMA_420_VERT_FILTERS_NEON()                             \
+    ALL_CHROMA_420_4x4_PU(filter_vsp, interp_4tap_vert_sp, neon)
+
+#define CHROMA_420_VERT_FILTERS_CAN_USE_SVE2()                             \
+    ALL_CHROMA_420_4x4_PU(filter_vpp, interp_4tap_vert_pp, sve2); \
+    ALL_CHROMA_420_4x4_PU(filter_vps, interp_4tap_vert_ps, sve2); \
+    ALL_CHROMA_420_4x4_PU(filter_vss, interp_4tap_vert_ss, sve2)
+
+#define SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(W, H) \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vsp = PFX(interp_4tap_vert_sp_ ## W ## x ## H ## _ ## neon)
+
+#define SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(W, H, cpu) \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vpp = PFX(interp_4tap_vert_pp_ ## W ## x ## H ## _ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vps = PFX(interp_4tap_vert_ps_ ## W ## x ## H ## _ ## cpu); \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vss = PFX(interp_4tap_vert_ss_ ## W ## x ## H ## _ ## cpu)
+
+#define CHROMA_422_VERT_FILTERS_NEON() \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(4, 8); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(8, 16); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(8, 8); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(4, 16); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(8, 12); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(8, 4); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(16, 32); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(16, 16); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(8, 32); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(16, 24); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(12, 32); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(16, 8); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(4, 32); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(32, 64); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(32, 32); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(16, 64); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(32, 48); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(24, 64); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(32, 16); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(8, 64)
+
+#define CHROMA_422_VERT_FILTERS_CAN_USE_SVE2(cpu) \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(4, 8, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(8, 16, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(8, 8, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(4, 16, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(8, 12, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(8, 4, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(16, 32, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(16, 16, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(8, 32, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(16, 24, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(12, 32, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(16, 8, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(4, 32, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(32, 64, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(32, 32, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(16, 64, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(32, 48, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(24, 64, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(32, 16, cpu); \
+    SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(8, 64, cpu)
+
+#define ALL_CHROMA_444_VERT_FILTERS(cpu) \
+    ALL_CHROMA_444_PU(filter_vpp, interp_4tap_vert_pp, cpu); \
+    ALL_CHROMA_444_PU(filter_vps, interp_4tap_vert_ps, cpu); \
+    ALL_CHROMA_444_PU(filter_vsp, interp_4tap_vert_sp, cpu); \
+    ALL_CHROMA_444_PU(filter_vss, interp_4tap_vert_ss, cpu)
+
+#define CHROMA_444_VERT_FILTERS_NEON() \
+    ALL_CHROMA_444_PU(filter_vsp, interp_4tap_vert_sp, neon)
+
+#define CHROMA_444_VERT_FILTERS_CAN_USE_SVE2() \
+    ALL_CHROMA_444_PU(filter_vpp, interp_4tap_vert_pp, sve2); \
+    ALL_CHROMA_444_PU(filter_vps, interp_4tap_vert_ps, sve2); \
+    ALL_CHROMA_444_PU(filter_vss, interp_4tap_vert_ss, sve2)
+
+#define ALL_CHROMA_420_FILTERS(cpu)                               \
+    ALL_CHROMA_420_PU(filter_hpp, interp_4tap_horiz_pp, cpu); \
+    ALL_CHROMA_420_PU(filter_hps, interp_4tap_horiz_ps, cpu); \
+    ALL_CHROMA_420_PU(filter_vpp, interp_4tap_vert_pp, cpu);  \
+    ALL_CHROMA_420_PU(filter_vps, interp_4tap_vert_ps, cpu)
+
+#define CHROMA_420_FILTERS_NEON()                               \
+    ALL_CHROMA_420_PU(filter_hpp, interp_4tap_horiz_pp, neon); \
+    ALL_CHROMA_420_PU(filter_hps, interp_4tap_horiz_ps, neon)
+
+#define CHROMA_420_FILTERS_CAN_USE_SVE2()                               \
+    ALL_CHROMA_420_PU(filter_vpp, interp_4tap_vert_pp, sve2);  \
+    ALL_CHROMA_420_PU(filter_vps, interp_4tap_vert_ps, sve2)
+
+#define ALL_CHROMA_422_FILTERS(cpu) \
+    ALL_CHROMA_422_PU(filter_hpp, interp_4tap_horiz_pp, cpu); \
+    ALL_CHROMA_422_PU(filter_hps, interp_4tap_horiz_ps, cpu); \
+    ALL_CHROMA_422_PU(filter_vpp, interp_4tap_vert_pp, cpu);  \
+    ALL_CHROMA_422_PU(filter_vps, interp_4tap_vert_ps, cpu)
+
+#define CHROMA_422_FILTERS_NEON() \
+    ALL_CHROMA_422_PU(filter_hpp, interp_4tap_horiz_pp, neon); \
+    ALL_CHROMA_422_PU(filter_hps, interp_4tap_horiz_ps, neon)
+
+#define CHROMA_422_FILTERS_CAN_USE_SVE2() \
+    ALL_CHROMA_422_PU(filter_vpp, interp_4tap_vert_pp, sve2);  \
+    ALL_CHROMA_422_PU(filter_vps, interp_4tap_vert_ps, sve2)
+
+#define ALL_CHROMA_444_FILTERS(cpu) \
+    ALL_CHROMA_444_PU(filter_hpp, interp_4tap_horiz_pp, cpu); \
+    ALL_CHROMA_444_PU(filter_hps, interp_4tap_horiz_ps, cpu); \
+    ALL_CHROMA_444_PU(filter_vpp, interp_4tap_vert_pp, cpu);  \
+    ALL_CHROMA_444_PU(filter_vps, interp_4tap_vert_ps, cpu)
+
+#define CHROMA_444_FILTERS_NEON() \
+    ALL_CHROMA_444_PU(filter_hpp, interp_4tap_horiz_pp, neon); \
+    ALL_CHROMA_444_PU(filter_hps, interp_4tap_horiz_ps, neon)
+
+#define CHROMA_444_FILTERS_CAN_USE_SVE2() \
+    ALL_CHROMA_444_PU(filter_vpp, interp_4tap_vert_pp, sve2);  \
+    ALL_CHROMA_444_PU(filter_vps, interp_4tap_vert_ps, sve2)
+
 
 #if defined(__GNUC__)
 #define GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
@@ -35,18 +684,19 @@
 #define GCC_4_9_0 40900
 #define GCC_5_1_0 50100
 
-extern "C" {
-#include "pixel.h"
-#include "pixel-util.h"
-#include "ipfilter8.h"
-}
+#include "pixel-prim.h"
+#include "filter-prim.h"
+#include "dct-prim.h"
+#include "loopfilter-prim.h"
+#include "intrapred-prim.h"
 
-namespace X265_NS {
+namespace X265_NS
+{
 // private x265 namespace
 
 
 template<int size>
-void interp_8tap_hv_pp_cpu(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY)
+void interp_8tap_hv_pp_cpu(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int idxX, int idxY)
 {
     ALIGN_VAR_32(int16_t, immedMAX_CU_SIZE * (MAX_CU_SIZE + NTAPS_LUMA - 1));
     const int halfFilterSize = NTAPS_LUMA >> 1;
@@ -56,164 +706,1259 @@
     primitives.pusize.luma_vsp(immed + (halfFilterSize - 1) * immedStride, immedStride, dst, dstStride, idxY);
 }
 
-
-/* Temporary workaround because luma_vsp assembly primitive has not been completed
- * but interp_8tap_hv_pp_cpu uses mixed C primitive and assembly primitive.
- * Otherwise, segment fault occurs. */
-void setupAliasCPrimitives(EncoderPrimitives &cp, EncoderPrimitives &asmp, int cpuMask)
+void setupNeonPrimitives(EncoderPrimitives &p)
 {
-    if (cpuMask & X265_CPU_NEON)
-    {
-        asmp.puLUMA_8x4.luma_vsp   = cp.puLUMA_8x4.luma_vsp;
-        asmp.puLUMA_8x8.luma_vsp   = cp.puLUMA_8x8.luma_vsp;
-        asmp.puLUMA_8x16.luma_vsp  = cp.puLUMA_8x16.luma_vsp;
-        asmp.puLUMA_8x32.luma_vsp  = cp.puLUMA_8x32.luma_vsp;
-        asmp.puLUMA_12x16.luma_vsp = cp.puLUMA_12x16.luma_vsp;
-#if !AUTO_VECTORIZE || GCC_VERSION < GCC_5_1_0 /* gcc_version < gcc-5.1.0 */
-        asmp.puLUMA_16x4.luma_vsp  = cp.puLUMA_16x4.luma_vsp;
-        asmp.puLUMA_16x8.luma_vsp  = cp.puLUMA_16x8.luma_vsp;
-        asmp.puLUMA_16x12.luma_vsp = cp.puLUMA_16x12.luma_vsp;
-        asmp.puLUMA_16x16.luma_vsp = cp.puLUMA_16x16.luma_vsp;
-        asmp.puLUMA_16x32.luma_vsp = cp.puLUMA_16x32.luma_vsp;
-        asmp.puLUMA_16x64.luma_vsp = cp.puLUMA_16x64.luma_vsp;
-        asmp.puLUMA_32x16.luma_vsp = cp.puLUMA_32x16.luma_vsp;
-        asmp.puLUMA_32x24.luma_vsp = cp.puLUMA_32x24.luma_vsp;
-        asmp.puLUMA_32x32.luma_vsp = cp.puLUMA_32x32.luma_vsp;
-        asmp.puLUMA_32x64.luma_vsp = cp.puLUMA_32x64.luma_vsp;
-        asmp.puLUMA_48x64.luma_vsp = cp.puLUMA_48x64.luma_vsp;
-        asmp.puLUMA_64x16.luma_vsp = cp.puLUMA_64x16.luma_vsp;
-        asmp.puLUMA_64x32.luma_vsp = cp.puLUMA_64x32.luma_vsp;
-        asmp.puLUMA_64x48.luma_vsp = cp.puLUMA_64x48.luma_vsp;
-        asmp.puLUMA_64x64.luma_vsp = cp.puLUMA_64x64.luma_vsp;    
-#if !AUTO_VECTORIZE || GCC_VERSION < GCC_4_9_0 /* gcc_version < gcc-4.9.0 */
-        asmp.puLUMA_4x4.luma_vsp   = cp.puLUMA_4x4.luma_vsp;
-        asmp.puLUMA_4x8.luma_vsp   = cp.puLUMA_4x8.luma_vsp;
-        asmp.puLUMA_4x16.luma_vsp  = cp.puLUMA_4x16.luma_vsp;
-        asmp.puLUMA_24x32.luma_vsp = cp.puLUMA_24x32.luma_vsp;
-        asmp.puLUMA_32x8.luma_vsp  = cp.puLUMA_32x8.luma_vsp;
+    setupPixelPrimitives_neon(p);
+    setupFilterPrimitives_neon(p);
+    setupDCTPrimitives_neon(p);
+    setupLoopFilterPrimitives_neon(p);
+    setupIntraPrimitives_neon(p);
+
+    ALL_CHROMA_420_PU(p2sNONALIGNED, filterPixelToShort, neon);
+    ALL_CHROMA_422_PU(p2sALIGNED, filterPixelToShort, neon);
+    ALL_CHROMA_444_PU(p2sALIGNED, filterPixelToShort, neon);
+    ALL_LUMA_PU(convert_p2sALIGNED, filterPixelToShort, neon);
+    ALL_CHROMA_420_PU(p2sALIGNED, filterPixelToShort, neon);
+    ALL_CHROMA_422_PU(p2sNONALIGNED, filterPixelToShort, neon);
+    ALL_CHROMA_444_PU(p2sNONALIGNED, filterPixelToShort, neon);
+    ALL_LUMA_PU(convert_p2sNONALIGNED, filterPixelToShort, neon);
+
+#if !HIGH_BIT_DEPTH
+    ALL_LUMA_PU(luma_vpp, interp_8tap_vert_pp, neon);
+    ALL_LUMA_PU(luma_vsp, interp_8tap_vert_sp, neon);
+    ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, neon);
+    ALL_LUMA_PU(luma_hpp, interp_horiz_pp, neon);
+    ALL_LUMA_PU(luma_hps, interp_horiz_ps, neon);
+    ALL_LUMA_PU(luma_vss, interp_8tap_vert_ss, neon);
+    ALL_LUMA_PU_T(luma_hvpp, interp_8tap_hv_pp_cpu);
+    ALL_CHROMA_420_VERT_FILTERS(neon);
+    CHROMA_422_VERT_FILTERS_NEON();
+    CHROMA_422_VERT_FILTERS_CAN_USE_SVE2(neon);
+    ALL_CHROMA_444_VERT_FILTERS(neon);
+    ALL_CHROMA_420_FILTERS(neon);
+    ALL_CHROMA_422_FILTERS(neon);
+    ALL_CHROMA_444_FILTERS(neon);
+
+    // Blockcopy_pp
+    ALL_LUMA_PU(copy_pp, blockcopy_pp, neon);
+    ALL_CHROMA_420_PU(copy_pp, blockcopy_pp, neon);
+    ALL_CHROMA_422_PU(copy_pp, blockcopy_pp, neon);
+    p.cuBLOCK_4x4.copy_pp   = PFX(blockcopy_pp_4x4_neon);
+    p.cuBLOCK_8x8.copy_pp   = PFX(blockcopy_pp_8x8_neon);
+    p.cuBLOCK_16x16.copy_pp = PFX(blockcopy_pp_16x16_neon);
+    p.cuBLOCK_32x32.copy_pp = PFX(blockcopy_pp_32x32_neon);
+    p.cuBLOCK_64x64.copy_pp = PFX(blockcopy_pp_64x64_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_pp = PFX(blockcopy_pp_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_pp = PFX(blockcopy_pp_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_pp = PFX(blockcopy_pp_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_pp = PFX(blockcopy_pp_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_pp = PFX(blockcopy_pp_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_pp = PFX(blockcopy_pp_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_pp = PFX(blockcopy_pp_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_pp = PFX(blockcopy_pp_32x64_neon);
+
+#endif // !HIGH_BIT_DEPTH
+
+    // Blockcopy_ss
+    p.cuBLOCK_4x4.copy_ss   = PFX(blockcopy_ss_4x4_neon);
+    p.cuBLOCK_8x8.copy_ss   = PFX(blockcopy_ss_8x8_neon);
+    p.cuBLOCK_16x16.copy_ss = PFX(blockcopy_ss_16x16_neon);
+    p.cuBLOCK_32x32.copy_ss = PFX(blockcopy_ss_32x32_neon);
+    p.cuBLOCK_64x64.copy_ss = PFX(blockcopy_ss_64x64_neon);
+
+    // Blockcopy_ps
+    p.cuBLOCK_4x4.copy_ps   = PFX(blockcopy_ps_4x4_neon);
+    p.cuBLOCK_8x8.copy_ps   = PFX(blockcopy_ps_8x8_neon);
+    p.cuBLOCK_16x16.copy_ps = PFX(blockcopy_ps_16x16_neon);
+    p.cuBLOCK_32x32.copy_ps = PFX(blockcopy_ps_32x32_neon);
+    p.cuBLOCK_64x64.copy_ps = PFX(blockcopy_ps_64x64_neon);
+
+    // Blockcopy_sp
+    p.cuBLOCK_4x4.copy_sp   = PFX(blockcopy_sp_4x4_neon);
+    p.cuBLOCK_8x8.copy_sp   = PFX(blockcopy_sp_8x8_neon);
+    p.cuBLOCK_16x16.copy_sp = PFX(blockcopy_sp_16x16_neon);
+    p.cuBLOCK_32x32.copy_sp = PFX(blockcopy_sp_32x32_neon);
+    p.cuBLOCK_64x64.copy_sp = PFX(blockcopy_sp_64x64_neon);
+
+    // chroma blockcopy_ss
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_ss   = PFX(blockcopy_ss_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_ss   = PFX(blockcopy_ss_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_ss = PFX(blockcopy_ss_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_ss = PFX(blockcopy_ss_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_ss   = PFX(blockcopy_ss_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_ss  = PFX(blockcopy_ss_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_ss = PFX(blockcopy_ss_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_ss = PFX(blockcopy_ss_32x64_neon);
+
+    // chroma blockcopy_ps
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_ps   = PFX(blockcopy_ps_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_ps   = PFX(blockcopy_ps_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_ps = PFX(blockcopy_ps_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_ps = PFX(blockcopy_ps_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_ps   = PFX(blockcopy_ps_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_ps  = PFX(blockcopy_ps_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_ps = PFX(blockcopy_ps_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_ps = PFX(blockcopy_ps_32x64_neon);
+
+    // chroma blockcopy_sp
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_sp   = PFX(blockcopy_sp_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_sp   = PFX(blockcopy_sp_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_sp = PFX(blockcopy_sp_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_sp = PFX(blockcopy_sp_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_sp   = PFX(blockcopy_sp_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_sp  = PFX(blockcopy_sp_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_sp = PFX(blockcopy_sp_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_sp = PFX(blockcopy_sp_32x64_neon);
+
+    // Block_fill
+    ALL_LUMA_TU(blockfill_sALIGNED, blockfill_s, neon);
+    ALL_LUMA_TU(blockfill_sNONALIGNED, blockfill_s, neon);
+
+    // copy_count
+    p.cuBLOCK_4x4.copy_cnt     = PFX(copy_cnt_4_neon);
+    p.cuBLOCK_8x8.copy_cnt     = PFX(copy_cnt_8_neon);
+    p.cuBLOCK_16x16.copy_cnt   = PFX(copy_cnt_16_neon);
+    p.cuBLOCK_32x32.copy_cnt   = PFX(copy_cnt_32_neon);
+
+    // count nonzero
+    p.cuBLOCK_4x4.count_nonzero     = PFX(count_nonzero_4_neon);
+    p.cuBLOCK_8x8.count_nonzero     = PFX(count_nonzero_8_neon);
+    p.cuBLOCK_16x16.count_nonzero   = PFX(count_nonzero_16_neon);
+    p.cuBLOCK_32x32.count_nonzero   = PFX(count_nonzero_32_neon);
+
+    // cpy2Dto1D_shl
+    p.cuBLOCK_4x4.cpy2Dto1D_shl   = PFX(cpy2Dto1D_shl_4x4_neon);
+    p.cuBLOCK_8x8.cpy2Dto1D_shl   = PFX(cpy2Dto1D_shl_8x8_neon);
+    p.cuBLOCK_16x16.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_16x16_neon);
+    p.cuBLOCK_32x32.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_32x32_neon);
+    p.cuBLOCK_64x64.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_64x64_neon);
+
+    // cpy2Dto1D_shr
+    p.cuBLOCK_4x4.cpy2Dto1D_shr   = PFX(cpy2Dto1D_shr_4x4_neon);
+    p.cuBLOCK_8x8.cpy2Dto1D_shr   = PFX(cpy2Dto1D_shr_8x8_neon);
+    p.cuBLOCK_16x16.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_16x16_neon);
+    p.cuBLOCK_32x32.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_32x32_neon);
+
+    // cpy1Dto2D_shl
+    p.cuBLOCK_4x4.cpy1Dto2D_shlALIGNED      = PFX(cpy1Dto2D_shl_4x4_neon);
+    p.cuBLOCK_8x8.cpy1Dto2D_shlALIGNED      = PFX(cpy1Dto2D_shl_8x8_neon);
+    p.cuBLOCK_16x16.cpy1Dto2D_shlALIGNED    = PFX(cpy1Dto2D_shl_16x16_neon);
+    p.cuBLOCK_32x32.cpy1Dto2D_shlALIGNED    = PFX(cpy1Dto2D_shl_32x32_neon);
+    p.cuBLOCK_64x64.cpy1Dto2D_shlALIGNED    = PFX(cpy1Dto2D_shl_64x64_neon);
+
+    p.cuBLOCK_4x4.cpy1Dto2D_shlNONALIGNED   = PFX(cpy1Dto2D_shl_4x4_neon);
+    p.cuBLOCK_8x8.cpy1Dto2D_shlNONALIGNED   = PFX(cpy1Dto2D_shl_8x8_neon);
+    p.cuBLOCK_16x16.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_16x16_neon);
+    p.cuBLOCK_32x32.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_32x32_neon);
+    p.cuBLOCK_64x64.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_64x64_neon);
+
+    // cpy1Dto2D_shr
+    p.cuBLOCK_4x4.cpy1Dto2D_shr   = PFX(cpy1Dto2D_shr_4x4_neon);
+    p.cuBLOCK_8x8.cpy1Dto2D_shr   = PFX(cpy1Dto2D_shr_8x8_neon);
+    p.cuBLOCK_16x16.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_16x16_neon);
+    p.cuBLOCK_32x32.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_32x32_neon);
+    p.cuBLOCK_64x64.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_64x64_neon);
+
+#if !HIGH_BIT_DEPTH
+    // pixel_avg_pp
+    ALL_LUMA_PU(pixelavg_ppNONALIGNED, pixel_avg_pp, neon);
+    ALL_LUMA_PU(pixelavg_ppALIGNED, pixel_avg_pp, neon);
+
+    // addAvg
+    ALL_LUMA_PU(addAvgNONALIGNED, addAvg, neon);
+    ALL_LUMA_PU(addAvgALIGNED, addAvg, neon);
+    ALL_CHROMA_420_PU(addAvgNONALIGNED, addAvg, neon);
+    ALL_CHROMA_422_PU(addAvgNONALIGNED, addAvg, neon);
+    ALL_CHROMA_420_PU(addAvgALIGNED, addAvg, neon);
+    ALL_CHROMA_422_PU(addAvgALIGNED, addAvg, neon);
+
+    // sad
+    ALL_LUMA_PU(sad, pixel_sad, neon);
+    ALL_LUMA_PU(sad_x3, sad_x3, neon);
+    ALL_LUMA_PU(sad_x4, sad_x4, neon);
+
+    // sse_pp
+    p.cuBLOCK_4x4.sse_pp   = PFX(pixel_sse_pp_4x4_neon);
+    p.cuBLOCK_8x8.sse_pp   = PFX(pixel_sse_pp_8x8_neon);
+    p.cuBLOCK_16x16.sse_pp = PFX(pixel_sse_pp_16x16_neon);
+    p.cuBLOCK_32x32.sse_pp = PFX(pixel_sse_pp_32x32_neon);
+    p.cuBLOCK_64x64.sse_pp = PFX(pixel_sse_pp_64x64_neon);
+
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.sse_pp   = PFX(pixel_sse_pp_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.sse_pp   = PFX(pixel_sse_pp_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.sse_pp = PFX(pixel_sse_pp_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.sse_pp = PFX(pixel_sse_pp_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.sse_pp   = PFX(pixel_sse_pp_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sse_pp  = PFX(pixel_sse_pp_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sse_pp = PFX(pixel_sse_pp_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sse_pp = PFX(pixel_sse_pp_32x64_neon);
+
+    // sse_ss
+    p.cuBLOCK_4x4.sse_ss   = PFX(pixel_sse_ss_4x4_neon);
+    p.cuBLOCK_8x8.sse_ss   = PFX(pixel_sse_ss_8x8_neon);
+    p.cuBLOCK_16x16.sse_ss = PFX(pixel_sse_ss_16x16_neon);
+    p.cuBLOCK_32x32.sse_ss = PFX(pixel_sse_ss_32x32_neon);
+    p.cuBLOCK_64x64.sse_ss = PFX(pixel_sse_ss_64x64_neon);
+
+    // ssd_s
+    p.cuBLOCK_4x4.ssd_sNONALIGNED   = PFX(pixel_ssd_s_4x4_neon);
+    p.cuBLOCK_8x8.ssd_sNONALIGNED   = PFX(pixel_ssd_s_8x8_neon);
+    p.cuBLOCK_16x16.ssd_sNONALIGNED = PFX(pixel_ssd_s_16x16_neon);
+    p.cuBLOCK_32x32.ssd_sNONALIGNED = PFX(pixel_ssd_s_32x32_neon);
+
+    p.cuBLOCK_4x4.ssd_sALIGNED   = PFX(pixel_ssd_s_4x4_neon);
+    p.cuBLOCK_8x8.ssd_sALIGNED   = PFX(pixel_ssd_s_8x8_neon);
+    p.cuBLOCK_16x16.ssd_sALIGNED = PFX(pixel_ssd_s_16x16_neon);
+    p.cuBLOCK_32x32.ssd_sALIGNED = PFX(pixel_ssd_s_32x32_neon);
+
+    // pixel_var
+    p.cuBLOCK_8x8.var   = PFX(pixel_var_8x8_neon);
+    p.cuBLOCK_16x16.var = PFX(pixel_var_16x16_neon);
+    p.cuBLOCK_32x32.var = PFX(pixel_var_32x32_neon);
+    p.cuBLOCK_64x64.var = PFX(pixel_var_64x64_neon);
+
+    // calc_Residual
+    p.cuBLOCK_4x4.calcresidualNONALIGNED   = PFX(getResidual4_neon);
+    p.cuBLOCK_8x8.calcresidualNONALIGNED   = PFX(getResidual8_neon);
+    p.cuBLOCK_16x16.calcresidualNONALIGNED = PFX(getResidual16_neon);
+    p.cuBLOCK_32x32.calcresidualNONALIGNED = PFX(getResidual32_neon);
+
+    p.cuBLOCK_4x4.calcresidualALIGNED   = PFX(getResidual4_neon);
+    p.cuBLOCK_8x8.calcresidualALIGNED   = PFX(getResidual8_neon);
+    p.cuBLOCK_16x16.calcresidualALIGNED = PFX(getResidual16_neon);
+    p.cuBLOCK_32x32.calcresidualALIGNED = PFX(getResidual32_neon);
+
+    // pixel_sub_ps
+    p.cuBLOCK_4x4.sub_ps   = PFX(pixel_sub_ps_4x4_neon);
+    p.cuBLOCK_8x8.sub_ps   = PFX(pixel_sub_ps_8x8_neon);
+    p.cuBLOCK_16x16.sub_ps = PFX(pixel_sub_ps_16x16_neon);
+    p.cuBLOCK_32x32.sub_ps = PFX(pixel_sub_ps_32x32_neon);
+    p.cuBLOCK_64x64.sub_ps = PFX(pixel_sub_ps_64x64_neon);
+
+    // chroma sub_ps
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.sub_ps   = PFX(pixel_sub_ps_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.sub_ps   = PFX(pixel_sub_ps_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.sub_ps = PFX(pixel_sub_ps_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.sub_ps = PFX(pixel_sub_ps_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.sub_ps   = PFX(pixel_sub_ps_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sub_ps  = PFX(pixel_sub_ps_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sub_ps = PFX(pixel_sub_ps_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sub_ps = PFX(pixel_sub_ps_32x64_neon);
+
+    // pixel_add_ps
+    p.cuBLOCK_4x4.add_psNONALIGNED   = PFX(pixel_add_ps_4x4_neon);
+    p.cuBLOCK_8x8.add_psNONALIGNED   = PFX(pixel_add_ps_8x8_neon);
+    p.cuBLOCK_16x16.add_psNONALIGNED = PFX(pixel_add_ps_16x16_neon);
+    p.cuBLOCK_32x32.add_psNONALIGNED = PFX(pixel_add_ps_32x32_neon);
+    p.cuBLOCK_64x64.add_psNONALIGNED = PFX(pixel_add_ps_64x64_neon);
+
+    p.cuBLOCK_4x4.add_psALIGNED   = PFX(pixel_add_ps_4x4_neon);
+    p.cuBLOCK_8x8.add_psALIGNED   = PFX(pixel_add_ps_8x8_neon);
+    p.cuBLOCK_16x16.add_psALIGNED = PFX(pixel_add_ps_16x16_neon);
+    p.cuBLOCK_32x32.add_psALIGNED = PFX(pixel_add_ps_32x32_neon);
+    p.cuBLOCK_64x64.add_psALIGNED = PFX(pixel_add_ps_64x64_neon);
+
+    // chroma add_ps
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.add_psNONALIGNED   = PFX(pixel_add_ps_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.add_psNONALIGNED   = PFX(pixel_add_ps_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.add_psNONALIGNED = PFX(pixel_add_ps_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.add_psNONALIGNED = PFX(pixel_add_ps_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.add_psNONALIGNED   = PFX(pixel_add_ps_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.add_psNONALIGNED  = PFX(pixel_add_ps_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.add_psNONALIGNED = PFX(pixel_add_ps_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.add_psNONALIGNED = PFX(pixel_add_ps_32x64_neon);
+
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.add_psALIGNED   = PFX(pixel_add_ps_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.add_psALIGNED   = PFX(pixel_add_ps_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.add_psALIGNED = PFX(pixel_add_ps_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.add_psALIGNED = PFX(pixel_add_ps_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.add_psALIGNED   = PFX(pixel_add_ps_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.add_psALIGNED  = PFX(pixel_add_ps_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.add_psALIGNED = PFX(pixel_add_ps_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.add_psALIGNED = PFX(pixel_add_ps_32x64_neon);
+
+    //scale2D_64to32
+    p.scale2D_64to32  = PFX(scale2D_64to32_neon);
+
+    // scale1D_128to64
+    p.scale1D_128to64NONALIGNED = PFX(scale1D_128to64_neon);
+    p.scale1D_128to64ALIGNED = PFX(scale1D_128to64_neon);
+
+    // planecopy
+    p.planecopy_cp = PFX(pixel_planecopy_cp_neon);
+
+    // satd
+    ALL_LUMA_PU(satd, pixel_satd, neon);
+
+    p.chromaX265_CSP_I420.puCHROMA_420_4x4.satd   = PFX(pixel_satd_4x4_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x8.satd   = PFX(pixel_satd_8x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x16.satd = PFX(pixel_satd_16x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x32.satd = PFX(pixel_satd_32x32_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x4.satd   = PFX(pixel_satd_8x4_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_4x8.satd   = PFX(pixel_satd_4x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x8.satd  = PFX(pixel_satd_16x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x16.satd  = PFX(pixel_satd_8x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x16.satd = PFX(pixel_satd_32x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x32.satd = PFX(pixel_satd_16x32_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x12.satd = PFX(pixel_satd_16x12_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_12x16.satd = PFX(pixel_satd_12x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x4.satd  = PFX(pixel_satd_16x4_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_4x16.satd  = PFX(pixel_satd_4x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x24.satd = PFX(pixel_satd_32x24_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_24x32.satd = PFX(pixel_satd_24x32_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x8.satd  = PFX(pixel_satd_32x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x32.satd  = PFX(pixel_satd_8x32_neon);
+
+    p.chromaX265_CSP_I422.puCHROMA_422_4x8.satd   = PFX(pixel_satd_4x8_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x16.satd  = PFX(pixel_satd_8x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x32.satd = PFX(pixel_satd_16x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x64.satd = PFX(pixel_satd_32x64_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_4x4.satd   = PFX(pixel_satd_4x4_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x8.satd   = PFX(pixel_satd_8x8_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_4x16.satd  = PFX(pixel_satd_4x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x16.satd = PFX(pixel_satd_16x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x32.satd  = PFX(pixel_satd_8x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x32.satd = PFX(pixel_satd_32x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x64.satd = PFX(pixel_satd_16x64_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x12.satd  = PFX(pixel_satd_8x12_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x4.satd   = PFX(pixel_satd_8x4_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x24.satd = PFX(pixel_satd_16x24_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_12x32.satd = PFX(pixel_satd_12x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x8.satd  = PFX(pixel_satd_16x8_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_4x32.satd  = PFX(pixel_satd_4x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x48.satd = PFX(pixel_satd_32x48_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_24x64.satd = PFX(pixel_satd_24x64_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x16.satd = PFX(pixel_satd_32x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x64.satd  = PFX(pixel_satd_8x64_neon);
+
+    // sa8d
+    p.cuBLOCK_4x4.sa8d   = PFX(pixel_satd_4x4_neon);
+    p.cuBLOCK_8x8.sa8d   = PFX(pixel_sa8d_8x8_neon);
+    p.cuBLOCK_16x16.sa8d = PFX(pixel_sa8d_16x16_neon);
+    p.cuBLOCK_32x32.sa8d = PFX(pixel_sa8d_32x32_neon);
+    p.cuBLOCK_64x64.sa8d = PFX(pixel_sa8d_64x64_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_8x8.sa8d = PFX(pixel_satd_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_16x16.sa8d = PFX(pixel_sa8d_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_32x32.sa8d = PFX(pixel_sa8d_32x32_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_64x64.sa8d = PFX(pixel_sa8d_64x64_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sa8d = PFX(pixel_sa8d_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sa8d = PFX(pixel_sa8d_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sa8d = PFX(pixel_sa8d_32x64_neon);
+
+    // dequant_scaling
+    p.dequant_scaling = PFX(dequant_scaling_neon);
+    p.dequant_normal  = PFX(dequant_normal_neon);
+
+    // ssim_4x4x2_core
+    p.ssim_4x4x2_core = PFX(ssim_4x4x2_core_neon);
+
+    // ssimDist
+    p.cuBLOCK_4x4.ssimDist = PFX(ssimDist4_neon);
+    p.cuBLOCK_8x8.ssimDist = PFX(ssimDist8_neon);
+    p.cuBLOCK_16x16.ssimDist = PFX(ssimDist16_neon);
+    p.cuBLOCK_32x32.ssimDist = PFX(ssimDist32_neon);
+    p.cuBLOCK_64x64.ssimDist = PFX(ssimDist64_neon);
+
+    // normFact
+    p.cuBLOCK_8x8.normFact = PFX(normFact8_neon);
+    p.cuBLOCK_16x16.normFact = PFX(normFact16_neon);
+    p.cuBLOCK_32x32.normFact = PFX(normFact32_neon);
+    p.cuBLOCK_64x64.normFact = PFX(normFact64_neon);
+
+    // psy_cost_pp
+    p.cuBLOCK_4x4.psy_cost_pp = PFX(psyCost_4x4_neon);
+
+    p.weight_pp = PFX(weight_pp_neon);
+#if !defined(__APPLE__)
+    p.scanPosLast = PFX(scanPosLast_neon);
 #endif
+    p.costCoeffNxN = PFX(costCoeffNxN_neon);
 #endif
-    }
-}
 
+    // quant
+    p.quant = PFX(quant_neon);
+    p.nquant = PFX(nquant_neon);
+}
 
-void setupAssemblyPrimitives(EncoderPrimitives &p, int cpuMask) 
+#if defined(HAVE_SVE2) || defined(HAVE_SVE)
+void setupSvePrimitives(EncoderPrimitives &p)
 {
-    if (cpuMask & X265_CPU_NEON)
-    {
-        p.puLUMA_4x4.satd   = PFX(pixel_satd_4x4_neon);
-        p.puLUMA_4x8.satd   = PFX(pixel_satd_4x8_neon);
-        p.puLUMA_4x16.satd  = PFX(pixel_satd_4x16_neon);
-        p.puLUMA_8x4.satd   = PFX(pixel_satd_8x4_neon);
-        p.puLUMA_8x8.satd   = PFX(pixel_satd_8x8_neon);
-        p.puLUMA_12x16.satd = PFX(pixel_satd_12x16_neon);
-        
-        p.chromaX265_CSP_I420.puCHROMA_420_4x4.satd    = PFX(pixel_satd_4x4_neon);
-        p.chromaX265_CSP_I420.puCHROMA_420_4x8.satd    = PFX(pixel_satd_4x8_neon);
-        p.chromaX265_CSP_I420.puCHROMA_420_4x16.satd   = PFX(pixel_satd_4x16_neon);
-        p.chromaX265_CSP_I420.puCHROMA_420_8x4.satd    = PFX(pixel_satd_8x4_neon);
-        p.chromaX265_CSP_I420.puCHROMA_420_8x8.satd    = PFX(pixel_satd_8x8_neon);
-        p.chromaX265_CSP_I420.puCHROMA_420_12x16.satd  = PFX(pixel_satd_12x16_neon);
-        
-        p.chromaX265_CSP_I422.puCHROMA_422_4x4.satd    = PFX(pixel_satd_4x4_neon);
-        p.chromaX265_CSP_I422.puCHROMA_422_4x8.satd    = PFX(pixel_satd_4x8_neon);
-        p.chromaX265_CSP_I422.puCHROMA_422_4x16.satd   = PFX(pixel_satd_4x16_neon);
-        p.chromaX265_CSP_I422.puCHROMA_422_4x32.satd   = PFX(pixel_satd_4x32_neon);
-        p.chromaX265_CSP_I422.puCHROMA_422_8x4.satd    = PFX(pixel_satd_8x4_neon);
-        p.chromaX265_CSP_I422.puCHROMA_422_8x8.satd    = PFX(pixel_satd_8x8_neon);
-        p.chromaX265_CSP_I422.puCHROMA_422_12x32.satd  = PFX(pixel_satd_12x32_neon);
-
-        p.puLUMA_4x4.pixelavg_ppNONALIGNED   = PFX(pixel_avg_pp_4x4_neon);
-        p.puLUMA_4x8.pixelavg_ppNONALIGNED   = PFX(pixel_avg_pp_4x8_neon);
-        p.puLUMA_4x16.pixelavg_ppNONALIGNED  = PFX(pixel_avg_pp_4x16_neon);
-        p.puLUMA_8x4.pixelavg_ppNONALIGNED   = PFX(pixel_avg_pp_8x4_neon);
-        p.puLUMA_8x8.pixelavg_ppNONALIGNED   = PFX(pixel_avg_pp_8x8_neon);
-        p.puLUMA_8x16.pixelavg_ppNONALIGNED  = PFX(pixel_avg_pp_8x16_neon);
-        p.puLUMA_8x32.pixelavg_ppNONALIGNED  = PFX(pixel_avg_pp_8x32_neon);
-
-        p.puLUMA_4x4.pixelavg_ppALIGNED   = PFX(pixel_avg_pp_4x4_neon);
-        p.puLUMA_4x8.pixelavg_ppALIGNED   = PFX(pixel_avg_pp_4x8_neon);
-        p.puLUMA_4x16.pixelavg_ppALIGNED  = PFX(pixel_avg_pp_4x16_neon);
-        p.puLUMA_8x4.pixelavg_ppALIGNED   = PFX(pixel_avg_pp_8x4_neon);
-        p.puLUMA_8x8.pixelavg_ppALIGNED   = PFX(pixel_avg_pp_8x8_neon);
-        p.puLUMA_8x16.pixelavg_ppALIGNED  = PFX(pixel_avg_pp_8x16_neon);
-        p.puLUMA_8x32.pixelavg_ppALIGNED  = PFX(pixel_avg_pp_8x32_neon);
-
-        p.puLUMA_8x4.sad_x3   = PFX(sad_x3_8x4_neon);
-        p.puLUMA_8x8.sad_x3   = PFX(sad_x3_8x8_neon);
-        p.puLUMA_8x16.sad_x3  = PFX(sad_x3_8x16_neon);
-        p.puLUMA_8x32.sad_x3  = PFX(sad_x3_8x32_neon);
-
-        p.puLUMA_8x4.sad_x4   = PFX(sad_x4_8x4_neon);
-        p.puLUMA_8x8.sad_x4   = PFX(sad_x4_8x8_neon);
-        p.puLUMA_8x16.sad_x4  = PFX(sad_x4_8x16_neon);
-        p.puLUMA_8x32.sad_x4  = PFX(sad_x4_8x32_neon);
-
-        // quant
-        p.quant = PFX(quant_neon);
-        // luma_hps
-        p.puLUMA_4x4.luma_hps   = PFX(interp_8tap_horiz_ps_4x4_neon);
-        p.puLUMA_4x8.luma_hps   = PFX(interp_8tap_horiz_ps_4x8_neon);
-        p.puLUMA_4x16.luma_hps  = PFX(interp_8tap_horiz_ps_4x16_neon);
-        p.puLUMA_8x4.luma_hps   = PFX(interp_8tap_horiz_ps_8x4_neon);
-        p.puLUMA_8x8.luma_hps   = PFX(interp_8tap_horiz_ps_8x8_neon);
-        p.puLUMA_8x16.luma_hps  = PFX(interp_8tap_horiz_ps_8x16_neon);
-        p.puLUMA_8x32.luma_hps  = PFX(interp_8tap_horiz_ps_8x32_neon);
-        p.puLUMA_12x16.luma_hps = PFX(interp_8tap_horiz_ps_12x16_neon);
-        p.puLUMA_24x32.luma_hps = PFX(interp_8tap_horiz_ps_24x32_neon);
-#if !AUTO_VECTORIZE || GCC_VERSION < GCC_5_1_0 /* gcc_version < gcc-5.1.0 */
-        p.puLUMA_16x4.luma_hps  = PFX(interp_8tap_horiz_ps_16x4_neon);
-        p.puLUMA_16x8.luma_hps  = PFX(interp_8tap_horiz_ps_16x8_neon);
-        p.puLUMA_16x12.luma_hps = PFX(interp_8tap_horiz_ps_16x12_neon);
-        p.puLUMA_16x16.luma_hps = PFX(interp_8tap_horiz_ps_16x16_neon);
-        p.puLUMA_16x32.luma_hps = PFX(interp_8tap_horiz_ps_16x32_neon);
-        p.puLUMA_16x64.luma_hps = PFX(interp_8tap_horiz_ps_16x64_neon);
-        p.puLUMA_32x8.luma_hps  = PFX(interp_8tap_horiz_ps_32x8_neon);
-        p.puLUMA_32x16.luma_hps = PFX(interp_8tap_horiz_ps_32x16_neon);
-        p.puLUMA_32x24.luma_hps = PFX(interp_8tap_horiz_ps_32x24_neon);
-        p.puLUMA_32x32.luma_hps = PFX(interp_8tap_horiz_ps_32x32_neon);
-        p.puLUMA_32x64.luma_hps = PFX(interp_8tap_horiz_ps_32x64_neon);
-        p.puLUMA_48x64.luma_hps = PFX(interp_8tap_horiz_ps_48x64_neon);
-        p.puLUMA_64x16.luma_hps = PFX(interp_8tap_horiz_ps_64x16_neon);
-        p.puLUMA_64x32.luma_hps = PFX(interp_8tap_horiz_ps_64x32_neon);
-        p.puLUMA_64x48.luma_hps = PFX(interp_8tap_horiz_ps_64x48_neon);
-        p.puLUMA_64x64.luma_hps = PFX(interp_8tap_horiz_ps_64x64_neon);
-#endif
-
-        p.puLUMA_8x4.luma_hvpp   =  interp_8tap_hv_pp_cpu<LUMA_8x4>;
-        p.puLUMA_8x8.luma_hvpp   =  interp_8tap_hv_pp_cpu<LUMA_8x8>;
-        p.puLUMA_8x16.luma_hvpp  =  interp_8tap_hv_pp_cpu<LUMA_8x16>;
-        p.puLUMA_8x32.luma_hvpp  =  interp_8tap_hv_pp_cpu<LUMA_8x32>;
-        p.puLUMA_12x16.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_12x16>;
-#if !AUTO_VECTORIZE || GCC_VERSION < GCC_5_1_0 /* gcc_version < gcc-5.1.0 */
-        p.puLUMA_16x4.luma_hvpp  =  interp_8tap_hv_pp_cpu<LUMA_16x4>;
-        p.puLUMA_16x8.luma_hvpp  =  interp_8tap_hv_pp_cpu<LUMA_16x8>;
-        p.puLUMA_16x12.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_16x12>;
-        p.puLUMA_16x16.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_16x16>;
-        p.puLUMA_16x32.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_16x32>;
-        p.puLUMA_16x64.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_16x64>;
-        p.puLUMA_32x16.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_32x16>;
-        p.puLUMA_32x24.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_32x24>;
-        p.puLUMA_32x32.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_32x32>;
-        p.puLUMA_32x64.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_32x64>;
-        p.puLUMA_48x64.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_48x64>;
-        p.puLUMA_64x16.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_64x16>;
-        p.puLUMA_64x32.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_64x32>;
-        p.puLUMA_64x48.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_64x48>;
-        p.puLUMA_64x64.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_64x64>;
-#if !AUTO_VECTORIZE || GCC_VERSION < GCC_4_9_0 /* gcc_version < gcc-4.9.0 */
-        p.puLUMA_4x4.luma_hvpp   =  interp_8tap_hv_pp_cpu<LUMA_4x4>;
-        p.puLUMA_4x8.luma_hvpp   =  interp_8tap_hv_pp_cpu<LUMA_4x8>;
-        p.puLUMA_4x16.luma_hvpp  =  interp_8tap_hv_pp_cpu<LUMA_4x16>;
-        p.puLUMA_24x32.luma_hvpp =  interp_8tap_hv_pp_cpu<LUMA_24x32>;
-        p.puLUMA_32x8.luma_hvpp  =  interp_8tap_hv_pp_cpu<LUMA_32x8>;
+    // When these primitives will use SVE/SVE2 instructions set,
+    // change the following definitions to point to the SVE/SVE2 implementation
+    setupPixelPrimitives_neon(p);
+    setupFilterPrimitives_neon(p);
+    setupDCTPrimitives_neon(p);
+    setupLoopFilterPrimitives_neon(p);
+    setupIntraPrimitives_neon(p);
+
+    CHROMA_420_PU_FILTER_PIXEL_TO_SHORT_NEON(p2sNONALIGNED);
+    CHROMA_420_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED);
+    CHROMA_422_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sALIGNED);
+    CHROMA_422_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sALIGNED);
+    CHROMA_444_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sALIGNED);
+    CHROMA_444_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sALIGNED);
+    LUMA_PU_NEON_FILTER_PIXEL_TO_SHORT(convert_p2sALIGNED);
+    LUMA_PU_SVE_FILTER_PIXEL_TO_SHORT(convert_p2sALIGNED);
+    CHROMA_420_PU_FILTER_PIXEL_TO_SHORT_NEON(p2sALIGNED);
+    CHROMA_420_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sALIGNED);
+    CHROMA_422_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED);
+    CHROMA_422_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED);
+    CHROMA_444_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED);
+    CHROMA_444_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED);
+    LUMA_PU_NEON_FILTER_PIXEL_TO_SHORT(convert_p2sNONALIGNED);
+    LUMA_PU_SVE_FILTER_PIXEL_TO_SHORT(convert_p2sNONALIGNED);
+
+#if !HIGH_BIT_DEPTH
+    ALL_LUMA_PU(luma_vpp, interp_8tap_vert_pp, neon);
+    ALL_LUMA_PU(luma_vsp, interp_8tap_vert_sp, neon);
+    ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, neon);
+    ALL_LUMA_PU(luma_hpp, interp_horiz_pp, neon);
+    ALL_LUMA_PU(luma_hps, interp_horiz_ps, neon);
+    ALL_LUMA_PU(luma_vss, interp_8tap_vert_ss, neon);
+    ALL_LUMA_PU_T(luma_hvpp, interp_8tap_hv_pp_cpu);
+    ALL_CHROMA_420_VERT_FILTERS(neon);
+    CHROMA_422_VERT_FILTERS_NEON();
+    CHROMA_422_VERT_FILTERS_CAN_USE_SVE2(neon);
+    ALL_CHROMA_444_VERT_FILTERS(neon);
+    ALL_CHROMA_420_FILTERS(neon);
+    ALL_CHROMA_422_FILTERS(neon);
+    ALL_CHROMA_444_FILTERS(neon);
+
+
+    // Blockcopy_pp
+    LUMA_PU_NEON_1(copy_pp, blockcopy_pp);
+    LUMA_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(copy_pp, blockcopy_pp);
+    CHROMA_420_PU_NEON_1(copy_pp, blockcopy_pp);
+    CHROMA_420_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(copy_pp, blockcopy_pp);
+    CHROMA_422_PU_NEON_1(copy_pp, blockcopy_pp);
+    CHROMA_422_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(copy_pp, blockcopy_pp);
+    p.cuBLOCK_4x4.copy_pp   = PFX(blockcopy_pp_4x4_neon);
+    p.cuBLOCK_8x8.copy_pp   = PFX(blockcopy_pp_8x8_neon);
+    p.cuBLOCK_16x16.copy_pp = PFX(blockcopy_pp_16x16_neon);
+    p.cuBLOCK_32x32.copy_pp = PFX(blockcopy_pp_32x32_sve);
+    p.cuBLOCK_64x64.copy_pp = PFX(blockcopy_pp_64x64_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_pp = PFX(blockcopy_pp_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_pp = PFX(blockcopy_pp_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_pp = PFX(blockcopy_pp_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_pp = PFX(blockcopy_pp_32x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_pp = PFX(blockcopy_pp_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_pp = PFX(blockcopy_pp_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_pp = PFX(blockcopy_pp_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_pp = PFX(blockcopy_pp_32x64_sve);
+
+#endif // !HIGH_BIT_DEPTH
+
+    // Blockcopy_ss
+    p.cuBLOCK_4x4.copy_ss   = PFX(blockcopy_ss_4x4_neon);
+    p.cuBLOCK_8x8.copy_ss   = PFX(blockcopy_ss_8x8_neon);
+    p.cuBLOCK_16x16.copy_ss = PFX(blockcopy_ss_16x16_sve);
+    p.cuBLOCK_32x32.copy_ss = PFX(blockcopy_ss_32x32_sve);
+    p.cuBLOCK_64x64.copy_ss = PFX(blockcopy_ss_64x64_sve);
+
+    // Blockcopy_ps
+    p.cuBLOCK_4x4.copy_ps   = PFX(blockcopy_ps_4x4_neon);
+    p.cuBLOCK_8x8.copy_ps   = PFX(blockcopy_ps_8x8_neon);
+    p.cuBLOCK_16x16.copy_ps = PFX(blockcopy_ps_16x16_sve);
+    p.cuBLOCK_32x32.copy_ps = PFX(blockcopy_ps_32x32_sve);
+    p.cuBLOCK_64x64.copy_ps = PFX(blockcopy_ps_64x64_sve);
+
+    // Blockcopy_sp
+    p.cuBLOCK_4x4.copy_sp   = PFX(blockcopy_sp_4x4_sve);
+    p.cuBLOCK_8x8.copy_sp   = PFX(blockcopy_sp_8x8_sve);
+    p.cuBLOCK_16x16.copy_sp = PFX(blockcopy_sp_16x16_sve);
+    p.cuBLOCK_32x32.copy_sp = PFX(blockcopy_sp_32x32_sve);
+    p.cuBLOCK_64x64.copy_sp = PFX(blockcopy_sp_64x64_neon);
+
+    // chroma blockcopy_ss
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_ss   = PFX(blockcopy_ss_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_ss   = PFX(blockcopy_ss_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_ss = PFX(blockcopy_ss_16x16_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_ss = PFX(blockcopy_ss_32x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_ss   = PFX(blockcopy_ss_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_ss  = PFX(blockcopy_ss_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_ss = PFX(blockcopy_ss_16x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_ss = PFX(blockcopy_ss_32x64_sve);
+
+    // chroma blockcopy_ps
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_ps   = PFX(blockcopy_ps_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_ps   = PFX(blockcopy_ps_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_ps = PFX(blockcopy_ps_16x16_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_ps = PFX(blockcopy_ps_32x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_ps   = PFX(blockcopy_ps_4x8_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_ps  = PFX(blockcopy_ps_8x16_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_ps = PFX(blockcopy_ps_16x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_ps = PFX(blockcopy_ps_32x64_sve);
+
+    // chroma blockcopy_sp
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_sp   = PFX(blockcopy_sp_4x4_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_sp   = PFX(blockcopy_sp_8x8_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_sp = PFX(blockcopy_sp_16x16_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_sp = PFX(blockcopy_sp_32x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_sp   = PFX(blockcopy_sp_4x8_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_sp  = PFX(blockcopy_sp_8x16_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_sp = PFX(blockcopy_sp_16x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_sp = PFX(blockcopy_sp_32x64_sve);
+
+    // Block_fill
+    LUMA_TU_NEON(blockfill_sALIGNED, blockfill_s);
+    LUMA_TU_CAN_USE_SVE(blockfill_sALIGNED, blockfill_s);
+    LUMA_TU_NEON(blockfill_sNONALIGNED, blockfill_s);
+    LUMA_TU_CAN_USE_SVE(blockfill_sNONALIGNED, blockfill_s);
+
+    // copy_count
+    p.cuBLOCK_4x4.copy_cnt     = PFX(copy_cnt_4_neon);
+    p.cuBLOCK_8x8.copy_cnt     = PFX(copy_cnt_8_neon);
+    p.cuBLOCK_16x16.copy_cnt   = PFX(copy_cnt_16_neon);
+    p.cuBLOCK_32x32.copy_cnt   = PFX(copy_cnt_32_neon);
+
+    // count nonzero
+    p.cuBLOCK_4x4.count_nonzero     = PFX(count_nonzero_4_neon);
+    p.cuBLOCK_8x8.count_nonzero     = PFX(count_nonzero_8_neon);
+    p.cuBLOCK_16x16.count_nonzero   = PFX(count_nonzero_16_neon);
+    p.cuBLOCK_32x32.count_nonzero   = PFX(count_nonzero_32_neon);
+
+    // cpy2Dto1D_shl
+    p.cuBLOCK_4x4.cpy2Dto1D_shl   = PFX(cpy2Dto1D_shl_4x4_neon);
+    p.cuBLOCK_8x8.cpy2Dto1D_shl   = PFX(cpy2Dto1D_shl_8x8_neon);
+    p.cuBLOCK_16x16.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_16x16_sve);
+    p.cuBLOCK_32x32.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_32x32_sve);
+    p.cuBLOCK_64x64.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_64x64_sve);
+
+    // cpy2Dto1D_shr
+    p.cuBLOCK_4x4.cpy2Dto1D_shr   = PFX(cpy2Dto1D_shr_4x4_neon);
+    p.cuBLOCK_8x8.cpy2Dto1D_shr   = PFX(cpy2Dto1D_shr_8x8_neon);
+    p.cuBLOCK_16x16.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_16x16_sve);
+    p.cuBLOCK_32x32.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_32x32_sve);
+
+    // cpy1Dto2D_shl
+    p.cuBLOCK_4x4.cpy1Dto2D_shlALIGNED      = PFX(cpy1Dto2D_shl_4x4_neon);
+    p.cuBLOCK_8x8.cpy1Dto2D_shlALIGNED      = PFX(cpy1Dto2D_shl_8x8_neon);
+    p.cuBLOCK_16x16.cpy1Dto2D_shlALIGNED    = PFX(cpy1Dto2D_shl_16x16_sve);
+    p.cuBLOCK_32x32.cpy1Dto2D_shlALIGNED    = PFX(cpy1Dto2D_shl_32x32_sve);
+    p.cuBLOCK_64x64.cpy1Dto2D_shlALIGNED    = PFX(cpy1Dto2D_shl_64x64_sve);
+
+    p.cuBLOCK_4x4.cpy1Dto2D_shlNONALIGNED   = PFX(cpy1Dto2D_shl_4x4_neon);
+    p.cuBLOCK_8x8.cpy1Dto2D_shlNONALIGNED   = PFX(cpy1Dto2D_shl_8x8_neon);
+    p.cuBLOCK_16x16.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_16x16_sve);
+    p.cuBLOCK_32x32.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_32x32_sve);
+    p.cuBLOCK_64x64.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_64x64_sve);
+
+    // cpy1Dto2D_shr
+    p.cuBLOCK_4x4.cpy1Dto2D_shr   = PFX(cpy1Dto2D_shr_4x4_neon);
+    p.cuBLOCK_8x8.cpy1Dto2D_shr   = PFX(cpy1Dto2D_shr_8x8_neon);
+    p.cuBLOCK_16x16.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_16x16_sve);
+    p.cuBLOCK_32x32.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_32x32_sve);
+    p.cuBLOCK_64x64.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_64x64_sve);
+
+#if !HIGH_BIT_DEPTH
+    // pixel_avg_pp
+    ALL_LUMA_PU(pixelavg_ppNONALIGNED, pixel_avg_pp, neon);
+    ALL_LUMA_PU(pixelavg_ppALIGNED, pixel_avg_pp, neon);
+
+    // addAvg
+    ALL_LUMA_PU(addAvgNONALIGNED, addAvg, neon);
+    ALL_LUMA_PU(addAvgALIGNED, addAvg, neon);
+    ALL_CHROMA_420_PU(addAvgNONALIGNED, addAvg, neon);
+    ALL_CHROMA_422_PU(addAvgNONALIGNED, addAvg, neon);
+    ALL_CHROMA_420_PU(addAvgALIGNED, addAvg, neon);
+    ALL_CHROMA_422_PU(addAvgALIGNED, addAvg, neon);
+
+    // sad
+    ALL_LUMA_PU(sad, pixel_sad, neon);
+    ALL_LUMA_PU(sad_x3, sad_x3, neon);
+    ALL_LUMA_PU(sad_x4, sad_x4, neon);
+
+    // sse_pp
+    p.cuBLOCK_4x4.sse_pp   = PFX(pixel_sse_pp_4x4_sve);
+    p.cuBLOCK_8x8.sse_pp   = PFX(pixel_sse_pp_8x8_neon);
+    p.cuBLOCK_16x16.sse_pp = PFX(pixel_sse_pp_16x16_neon);
+    p.cuBLOCK_32x32.sse_pp = PFX(pixel_sse_pp_32x32_neon);
+    p.cuBLOCK_64x64.sse_pp = PFX(pixel_sse_pp_64x64_neon);
+
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.sse_pp   = PFX(pixel_sse_pp_4x4_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.sse_pp   = PFX(pixel_sse_pp_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.sse_pp = PFX(pixel_sse_pp_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.sse_pp = PFX(pixel_sse_pp_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.sse_pp   = PFX(pixel_sse_pp_4x8_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sse_pp  = PFX(pixel_sse_pp_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sse_pp = PFX(pixel_sse_pp_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sse_pp = PFX(pixel_sse_pp_32x64_neon);
+
+    // sse_ss
+    p.cuBLOCK_4x4.sse_ss   = PFX(pixel_sse_ss_4x4_neon);
+    p.cuBLOCK_8x8.sse_ss   = PFX(pixel_sse_ss_8x8_neon);
+    p.cuBLOCK_16x16.sse_ss = PFX(pixel_sse_ss_16x16_neon);
+    p.cuBLOCK_32x32.sse_ss = PFX(pixel_sse_ss_32x32_neon);
+    p.cuBLOCK_64x64.sse_ss = PFX(pixel_sse_ss_64x64_neon);
+
+    // ssd_s
+    p.cuBLOCK_4x4.ssd_sNONALIGNED   = PFX(pixel_ssd_s_4x4_neon);
+    p.cuBLOCK_8x8.ssd_sNONALIGNED   = PFX(pixel_ssd_s_8x8_neon);
+    p.cuBLOCK_16x16.ssd_sNONALIGNED = PFX(pixel_ssd_s_16x16_neon);
+    p.cuBLOCK_32x32.ssd_sNONALIGNED = PFX(pixel_ssd_s_32x32_neon);
+
+    p.cuBLOCK_4x4.ssd_sALIGNED   = PFX(pixel_ssd_s_4x4_neon);
+    p.cuBLOCK_8x8.ssd_sALIGNED   = PFX(pixel_ssd_s_8x8_neon);
+    p.cuBLOCK_16x16.ssd_sALIGNED = PFX(pixel_ssd_s_16x16_neon);
+    p.cuBLOCK_32x32.ssd_sALIGNED = PFX(pixel_ssd_s_32x32_neon);
+
+    // pixel_var
+    p.cuBLOCK_8x8.var   = PFX(pixel_var_8x8_neon);
+    p.cuBLOCK_16x16.var = PFX(pixel_var_16x16_neon);
+    p.cuBLOCK_32x32.var = PFX(pixel_var_32x32_neon);
+    p.cuBLOCK_64x64.var = PFX(pixel_var_64x64_neon);
+
+    // calc_Residual
+    p.cuBLOCK_4x4.calcresidualNONALIGNED   = PFX(getResidual4_neon);
+    p.cuBLOCK_8x8.calcresidualNONALIGNED   = PFX(getResidual8_neon);
+    p.cuBLOCK_16x16.calcresidualNONALIGNED = PFX(getResidual16_neon);
+    p.cuBLOCK_32x32.calcresidualNONALIGNED = PFX(getResidual32_neon);
+
+    p.cuBLOCK_4x4.calcresidualALIGNED   = PFX(getResidual4_neon);
+    p.cuBLOCK_8x8.calcresidualALIGNED   = PFX(getResidual8_neon);
+    p.cuBLOCK_16x16.calcresidualALIGNED = PFX(getResidual16_neon);
+    p.cuBLOCK_32x32.calcresidualALIGNED = PFX(getResidual32_neon);
+
+    // pixel_sub_ps
+    p.cuBLOCK_4x4.sub_ps   = PFX(pixel_sub_ps_4x4_neon);
+    p.cuBLOCK_8x8.sub_ps   = PFX(pixel_sub_ps_8x8_neon);
+    p.cuBLOCK_16x16.sub_ps = PFX(pixel_sub_ps_16x16_neon);
+    p.cuBLOCK_32x32.sub_ps = PFX(pixel_sub_ps_32x32_neon);
+    p.cuBLOCK_64x64.sub_ps = PFX(pixel_sub_ps_64x64_neon);
+
+    // chroma sub_ps
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.sub_ps   = PFX(pixel_sub_ps_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.sub_ps   = PFX(pixel_sub_ps_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.sub_ps = PFX(pixel_sub_ps_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.sub_ps = PFX(pixel_sub_ps_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.sub_ps   = PFX(pixel_sub_ps_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sub_ps  = PFX(pixel_sub_ps_8x16_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sub_ps = PFX(pixel_sub_ps_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sub_ps = PFX(pixel_sub_ps_32x64_neon);
+
+    // pixel_add_ps
+    p.cuBLOCK_4x4.add_psNONALIGNED   = PFX(pixel_add_ps_4x4_neon);
+    p.cuBLOCK_8x8.add_psNONALIGNED   = PFX(pixel_add_ps_8x8_neon);
+    p.cuBLOCK_16x16.add_psNONALIGNED = PFX(pixel_add_ps_16x16_neon);
+    p.cuBLOCK_32x32.add_psNONALIGNED = PFX(pixel_add_ps_32x32_neon);
+    p.cuBLOCK_64x64.add_psNONALIGNED = PFX(pixel_add_ps_64x64_neon);
+
+    p.cuBLOCK_4x4.add_psALIGNED   = PFX(pixel_add_ps_4x4_neon);
+    p.cuBLOCK_8x8.add_psALIGNED   = PFX(pixel_add_ps_8x8_neon);
+    p.cuBLOCK_16x16.add_psALIGNED = PFX(pixel_add_ps_16x16_neon);
+    p.cuBLOCK_32x32.add_psALIGNED = PFX(pixel_add_ps_32x32_neon);
+    p.cuBLOCK_64x64.add_psALIGNED = PFX(pixel_add_ps_64x64_neon);
+
+    // chroma add_ps
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.add_psNONALIGNED   = PFX(pixel_add_ps_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.add_psNONALIGNED   = PFX(pixel_add_ps_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.add_psNONALIGNED = PFX(pixel_add_ps_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.add_psNONALIGNED = PFX(pixel_add_ps_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.add_psNONALIGNED   = PFX(pixel_add_ps_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.add_psNONALIGNED  = PFX(pixel_add_ps_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.add_psNONALIGNED = PFX(pixel_add_ps_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.add_psNONALIGNED = PFX(pixel_add_ps_32x64_neon);
+
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.add_psALIGNED   = PFX(pixel_add_ps_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.add_psALIGNED   = PFX(pixel_add_ps_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.add_psALIGNED = PFX(pixel_add_ps_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.add_psALIGNED = PFX(pixel_add_ps_32x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.add_psALIGNED   = PFX(pixel_add_ps_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.add_psALIGNED  = PFX(pixel_add_ps_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.add_psALIGNED = PFX(pixel_add_ps_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.add_psALIGNED = PFX(pixel_add_ps_32x64_neon);
+
+    //scale2D_64to32
+    p.scale2D_64to32  = PFX(scale2D_64to32_neon);
+
+    // scale1D_128to64
+    p.scale1D_128to64NONALIGNED = PFX(scale1D_128to64_neon);
+    p.scale1D_128to64ALIGNED = PFX(scale1D_128to64_neon);
+
+    // planecopy
+    p.planecopy_cp = PFX(pixel_planecopy_cp_neon);
+
+    // satd
+    p.puLUMA_4x4.satd   = PFX(pixel_satd_4x4_sve);
+    p.puLUMA_8x8.satd   = PFX(pixel_satd_8x8_neon);
+    p.puLUMA_16x16.satd = PFX(pixel_satd_16x16_neon);
+    p.puLUMA_32x32.satd = PFX(pixel_satd_32x32_sve);
+    p.puLUMA_64x64.satd = PFX(pixel_satd_64x64_neon);
+    p.puLUMA_8x4.satd   = PFX(pixel_satd_8x4_sve);
+    p.puLUMA_4x8.satd   = PFX(pixel_satd_4x8_neon);
+    p.puLUMA_16x8.satd  = PFX(pixel_satd_16x8_neon);
+    p.puLUMA_8x16.satd  = PFX(pixel_satd_8x16_neon);
+    p.puLUMA_16x32.satd = PFX(pixel_satd_16x32_neon);
+    p.puLUMA_32x16.satd = PFX(pixel_satd_32x16_sve);
+    p.puLUMA_64x32.satd = PFX(pixel_satd_64x32_neon);
+    p.puLUMA_32x64.satd = PFX(pixel_satd_32x64_neon);
+    p.puLUMA_16x12.satd = PFX(pixel_satd_16x12_neon);
+    p.puLUMA_12x16.satd = PFX(pixel_satd_12x16_neon);
+    p.puLUMA_16x4.satd  = PFX(pixel_satd_16x4_neon);
+    p.puLUMA_4x16.satd  = PFX(pixel_satd_4x16_neon);
+    p.puLUMA_32x24.satd = PFX(pixel_satd_32x24_neon);
+    p.puLUMA_24x32.satd = PFX(pixel_satd_24x32_neon);
+    p.puLUMA_32x8.satd  = PFX(pixel_satd_32x8_neon);
+    p.puLUMA_8x32.satd  = PFX(pixel_satd_8x32_neon);
+    p.puLUMA_64x48.satd = PFX(pixel_satd_64x48_sve);
+    p.puLUMA_48x64.satd = PFX(pixel_satd_48x64_neon);
+    p.puLUMA_64x16.satd = PFX(pixel_satd_64x16_neon);
+    p.puLUMA_16x64.satd = PFX(pixel_satd_16x64_neon);
+
+    p.chromaX265_CSP_I420.puCHROMA_420_4x4.satd   = PFX(pixel_satd_4x4_sve);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x8.satd   = PFX(pixel_satd_8x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x16.satd = PFX(pixel_satd_16x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x32.satd = PFX(pixel_satd_32x32_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x4.satd   = PFX(pixel_satd_8x4_sve);
+    p.chromaX265_CSP_I420.puCHROMA_420_4x8.satd   = PFX(pixel_satd_4x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x8.satd  = PFX(pixel_satd_16x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x16.satd  = PFX(pixel_satd_8x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x16.satd = PFX(pixel_satd_32x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x32.satd = PFX(pixel_satd_16x32_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x12.satd = PFX(pixel_satd_16x12_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_12x16.satd = PFX(pixel_satd_12x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x4.satd  = PFX(pixel_satd_16x4_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_4x16.satd  = PFX(pixel_satd_4x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x24.satd = PFX(pixel_satd_32x24_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_24x32.satd = PFX(pixel_satd_24x32_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x8.satd  = PFX(pixel_satd_32x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x32.satd  = PFX(pixel_satd_8x32_neon);
+
+    p.chromaX265_CSP_I422.puCHROMA_422_4x8.satd   = PFX(pixel_satd_4x8_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x16.satd  = PFX(pixel_satd_8x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x32.satd = PFX(pixel_satd_16x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x64.satd = PFX(pixel_satd_32x64_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_4x4.satd   = PFX(pixel_satd_4x4_sve);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x8.satd   = PFX(pixel_satd_8x8_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_4x16.satd  = PFX(pixel_satd_4x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x16.satd = PFX(pixel_satd_16x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x32.satd  = PFX(pixel_satd_8x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x32.satd = PFX(pixel_satd_32x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x64.satd = PFX(pixel_satd_16x64_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x12.satd  = PFX(pixel_satd_8x12_sve);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x4.satd   = PFX(pixel_satd_8x4_sve);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x24.satd = PFX(pixel_satd_16x24_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_12x32.satd = PFX(pixel_satd_12x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x8.satd  = PFX(pixel_satd_16x8_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_4x32.satd  = PFX(pixel_satd_4x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x48.satd = PFX(pixel_satd_32x48_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_24x64.satd = PFX(pixel_satd_24x64_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x16.satd = PFX(pixel_satd_32x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x64.satd  = PFX(pixel_satd_8x64_neon);
+
+    // sa8d
+    p.cuBLOCK_4x4.sa8d   = PFX(pixel_satd_4x4_sve);
+    p.cuBLOCK_8x8.sa8d   = PFX(pixel_sa8d_8x8_neon);
+    p.cuBLOCK_16x16.sa8d = PFX(pixel_sa8d_16x16_neon);
+    p.cuBLOCK_32x32.sa8d = PFX(pixel_sa8d_32x32_neon);
+    p.cuBLOCK_64x64.sa8d = PFX(pixel_sa8d_64x64_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_8x8.sa8d = PFX(pixel_satd_4x4_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_16x16.sa8d = PFX(pixel_sa8d_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_32x32.sa8d = PFX(pixel_sa8d_32x32_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_64x64.sa8d = PFX(pixel_sa8d_64x64_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sa8d = PFX(pixel_sa8d_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sa8d = PFX(pixel_sa8d_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sa8d = PFX(pixel_sa8d_32x64_neon);
+
+    // dequant_scaling
+    p.dequant_scaling = PFX(dequant_scaling_neon);
+    p.dequant_normal  = PFX(dequant_normal_neon);
+
+    // ssim_4x4x2_core
+    p.ssim_4x4x2_core = PFX(ssim_4x4x2_core_neon);
+
+    // ssimDist
+    p.cuBLOCK_4x4.ssimDist = PFX(ssimDist4_neon);
+    p.cuBLOCK_8x8.ssimDist = PFX(ssimDist8_neon);
+    p.cuBLOCK_16x16.ssimDist = PFX(ssimDist16_neon);
+    p.cuBLOCK_32x32.ssimDist = PFX(ssimDist32_neon);
+    p.cuBLOCK_64x64.ssimDist = PFX(ssimDist64_neon);
+
+    // normFact
+    p.cuBLOCK_8x8.normFact = PFX(normFact8_neon);
+    p.cuBLOCK_16x16.normFact = PFX(normFact16_neon);
+    p.cuBLOCK_32x32.normFact = PFX(normFact32_neon);
+    p.cuBLOCK_64x64.normFact = PFX(normFact64_neon);
+
+    // psy_cost_pp
+    p.cuBLOCK_4x4.psy_cost_pp = PFX(psyCost_4x4_neon);
+
+    p.weight_pp = PFX(weight_pp_neon);
+#if !defined(__APPLE__)
+    p.scanPosLast = PFX(scanPosLast_neon);
+#endif
+    p.costCoeffNxN = PFX(costCoeffNxN_neon);
 #endif
+
+    // quant
+    p.quant = PFX(quant_sve);
+    p.nquant = PFX(nquant_neon);
+}
 #endif
 
+#if defined(HAVE_SVE2)
+void setupSve2Primitives(EncoderPrimitives &p)
+{
+    // When these primitives will use SVE/SVE2 instructions set,
+    // change the following definitions to point to the SVE/SVE2 implementation
+    setupPixelPrimitives_neon(p);
+    setupFilterPrimitives_neon(p);
+    setupDCTPrimitives_neon(p);
+    setupLoopFilterPrimitives_neon(p);
+    setupIntraPrimitives_neon(p);
+
+    CHROMA_420_PU_FILTER_PIXEL_TO_SHORT_NEON(p2sNONALIGNED);
+    CHROMA_420_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED);
+    CHROMA_422_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sALIGNED);
+    CHROMA_422_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sALIGNED);
+    CHROMA_444_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sALIGNED);
+    CHROMA_444_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sALIGNED);
+    LUMA_PU_NEON_FILTER_PIXEL_TO_SHORT(convert_p2sALIGNED);
+    LUMA_PU_SVE_FILTER_PIXEL_TO_SHORT(convert_p2sALIGNED);
+    CHROMA_420_PU_FILTER_PIXEL_TO_SHORT_NEON(p2sALIGNED);
+    CHROMA_420_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sALIGNED);
+    CHROMA_422_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED);
+    CHROMA_422_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED);
+    CHROMA_444_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED);
+    CHROMA_444_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED);
+    LUMA_PU_NEON_FILTER_PIXEL_TO_SHORT(convert_p2sNONALIGNED);
+    LUMA_PU_SVE_FILTER_PIXEL_TO_SHORT(convert_p2sNONALIGNED);
+
 #if !HIGH_BIT_DEPTH
-        p.cuBLOCK_4x4.psy_cost_pp = PFX(psyCost_4x4_neon);
+    LUMA_PU_MULTIPLE_ARCHS_1(luma_vpp, interp_8tap_vert_pp, neon);
+    LUMA_PU_MULTIPLE_ARCHS_2(luma_vpp, interp_8tap_vert_pp, sve2);
+    LUMA_PU_MULTIPLE_ARCHS_1(luma_vsp, interp_8tap_vert_sp, sve2);
+    LUMA_PU_MULTIPLE_ARCHS_2(luma_vsp, interp_8tap_vert_sp, neon);
+    ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, sve2);
+    ALL_LUMA_PU(luma_hpp, interp_horiz_pp, neon);
+    ALL_LUMA_PU(luma_hps, interp_horiz_ps, neon);
+    ALL_LUMA_PU(luma_vss, interp_8tap_vert_ss, sve2);
+    ALL_LUMA_PU_T(luma_hvpp, interp_8tap_hv_pp_cpu);
+    CHROMA_420_VERT_FILTERS_NEON();
+    CHROMA_420_VERT_FILTERS_CAN_USE_SVE2();
+    CHROMA_422_VERT_FILTERS_NEON();
+    CHROMA_422_VERT_FILTERS_CAN_USE_SVE2(sve2);
+    CHROMA_444_VERT_FILTERS_NEON();
+    CHROMA_444_VERT_FILTERS_CAN_USE_SVE2();
+    CHROMA_420_FILTERS_NEON();
+    CHROMA_420_FILTERS_CAN_USE_SVE2();
+    CHROMA_422_FILTERS_NEON();
+    CHROMA_422_FILTERS_CAN_USE_SVE2();
+    CHROMA_444_FILTERS_NEON();
+    CHROMA_444_FILTERS_CAN_USE_SVE2();
+
+    // Blockcopy_pp
+    LUMA_PU_NEON_1(copy_pp, blockcopy_pp);
+    LUMA_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(copy_pp, blockcopy_pp);
+    CHROMA_420_PU_NEON_1(copy_pp, blockcopy_pp);
+    CHROMA_420_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(copy_pp, blockcopy_pp);
+    CHROMA_422_PU_NEON_1(copy_pp, blockcopy_pp);
+    CHROMA_422_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(copy_pp, blockcopy_pp);
+    p.cuBLOCK_4x4.copy_pp   = PFX(blockcopy_pp_4x4_neon);
+    p.cuBLOCK_8x8.copy_pp   = PFX(blockcopy_pp_8x8_neon);
+    p.cuBLOCK_16x16.copy_pp = PFX(blockcopy_pp_16x16_neon);
+    p.cuBLOCK_32x32.copy_pp = PFX(blockcopy_pp_32x32_sve);
+    p.cuBLOCK_64x64.copy_pp = PFX(blockcopy_pp_64x64_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_pp = PFX(blockcopy_pp_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_pp = PFX(blockcopy_pp_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_pp = PFX(blockcopy_pp_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_pp = PFX(blockcopy_pp_32x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_pp = PFX(blockcopy_pp_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_pp = PFX(blockcopy_pp_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_pp = PFX(blockcopy_pp_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_pp = PFX(blockcopy_pp_32x64_sve);
+
 #endif // !HIGH_BIT_DEPTH
 
+    // Blockcopy_ss
+    p.cuBLOCK_4x4.copy_ss   = PFX(blockcopy_ss_4x4_neon);
+    p.cuBLOCK_8x8.copy_ss   = PFX(blockcopy_ss_8x8_neon);
+    p.cuBLOCK_16x16.copy_ss = PFX(blockcopy_ss_16x16_sve);
+    p.cuBLOCK_32x32.copy_ss = PFX(blockcopy_ss_32x32_sve);
+    p.cuBLOCK_64x64.copy_ss = PFX(blockcopy_ss_64x64_sve);
+
+    // Blockcopy_ps
+    p.cuBLOCK_4x4.copy_ps   = PFX(blockcopy_ps_4x4_neon);
+    p.cuBLOCK_8x8.copy_ps   = PFX(blockcopy_ps_8x8_neon);
+    p.cuBLOCK_16x16.copy_ps = PFX(blockcopy_ps_16x16_sve);
+    p.cuBLOCK_32x32.copy_ps = PFX(blockcopy_ps_32x32_sve);
+    p.cuBLOCK_64x64.copy_ps = PFX(blockcopy_ps_64x64_sve);
+
+    // Blockcopy_sp
+    p.cuBLOCK_4x4.copy_sp   = PFX(blockcopy_sp_4x4_sve);
+    p.cuBLOCK_8x8.copy_sp   = PFX(blockcopy_sp_8x8_sve);
+    p.cuBLOCK_16x16.copy_sp = PFX(blockcopy_sp_16x16_sve);
+    p.cuBLOCK_32x32.copy_sp = PFX(blockcopy_sp_32x32_sve);
+    p.cuBLOCK_64x64.copy_sp = PFX(blockcopy_sp_64x64_neon);
+
+    // chroma blockcopy_ss
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_ss   = PFX(blockcopy_ss_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_ss   = PFX(blockcopy_ss_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_ss = PFX(blockcopy_ss_16x16_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_ss = PFX(blockcopy_ss_32x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_ss   = PFX(blockcopy_ss_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_ss  = PFX(blockcopy_ss_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_ss = PFX(blockcopy_ss_16x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_ss = PFX(blockcopy_ss_32x64_sve);
+
+    // chroma blockcopy_ps
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_ps   = PFX(blockcopy_ps_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_ps   = PFX(blockcopy_ps_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_ps = PFX(blockcopy_ps_16x16_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_ps = PFX(blockcopy_ps_32x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_ps   = PFX(blockcopy_ps_4x8_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_ps  = PFX(blockcopy_ps_8x16_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_ps = PFX(blockcopy_ps_16x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_ps = PFX(blockcopy_ps_32x64_sve);
+
+    // chroma blockcopy_sp
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_sp   = PFX(blockcopy_sp_4x4_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_sp   = PFX(blockcopy_sp_8x8_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_sp = PFX(blockcopy_sp_16x16_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_sp = PFX(blockcopy_sp_32x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_sp   = PFX(blockcopy_sp_4x8_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_sp  = PFX(blockcopy_sp_8x16_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_sp = PFX(blockcopy_sp_16x32_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_sp = PFX(blockcopy_sp_32x64_sve);
+
+    // Block_fill
+    LUMA_TU_NEON(blockfill_sALIGNED, blockfill_s);
+    LUMA_TU_CAN_USE_SVE(blockfill_sALIGNED, blockfill_s);
+    LUMA_TU_NEON(blockfill_sNONALIGNED, blockfill_s);
+    LUMA_TU_CAN_USE_SVE(blockfill_sNONALIGNED, blockfill_s);
+
+    // copy_count
+    p.cuBLOCK_4x4.copy_cnt     = PFX(copy_cnt_4_neon);
+    p.cuBLOCK_8x8.copy_cnt     = PFX(copy_cnt_8_neon);
+    p.cuBLOCK_16x16.copy_cnt   = PFX(copy_cnt_16_neon);
+    p.cuBLOCK_32x32.copy_cnt   = PFX(copy_cnt_32_neon);
+
+    // count nonzero
+    p.cuBLOCK_4x4.count_nonzero     = PFX(count_nonzero_4_neon);
+    p.cuBLOCK_8x8.count_nonzero     = PFX(count_nonzero_8_neon);
+    p.cuBLOCK_16x16.count_nonzero   = PFX(count_nonzero_16_neon);
+    p.cuBLOCK_32x32.count_nonzero   = PFX(count_nonzero_32_neon);
+
+    // cpy2Dto1D_shl
+    p.cuBLOCK_4x4.cpy2Dto1D_shl   = PFX(cpy2Dto1D_shl_4x4_neon);
+    p.cuBLOCK_8x8.cpy2Dto1D_shl   = PFX(cpy2Dto1D_shl_8x8_neon);
+    p.cuBLOCK_16x16.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_16x16_sve);
+    p.cuBLOCK_32x32.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_32x32_sve);
+    p.cuBLOCK_64x64.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_64x64_sve);
+
+    // cpy2Dto1D_shr
+    p.cuBLOCK_4x4.cpy2Dto1D_shr   = PFX(cpy2Dto1D_shr_4x4_neon);
+    p.cuBLOCK_8x8.cpy2Dto1D_shr   = PFX(cpy2Dto1D_shr_8x8_neon);
+    p.cuBLOCK_16x16.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_16x16_sve);
+    p.cuBLOCK_32x32.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_32x32_sve);
+
+    // cpy1Dto2D_shl
+    p.cuBLOCK_4x4.cpy1Dto2D_shlALIGNED      = PFX(cpy1Dto2D_shl_4x4_neon);
+    p.cuBLOCK_8x8.cpy1Dto2D_shlALIGNED      = PFX(cpy1Dto2D_shl_8x8_neon);
+    p.cuBLOCK_16x16.cpy1Dto2D_shlALIGNED    = PFX(cpy1Dto2D_shl_16x16_sve);
+    p.cuBLOCK_32x32.cpy1Dto2D_shlALIGNED    = PFX(cpy1Dto2D_shl_32x32_sve);
+    p.cuBLOCK_64x64.cpy1Dto2D_shlALIGNED    = PFX(cpy1Dto2D_shl_64x64_sve);
+
+    p.cuBLOCK_4x4.cpy1Dto2D_shlNONALIGNED   = PFX(cpy1Dto2D_shl_4x4_neon);
+    p.cuBLOCK_8x8.cpy1Dto2D_shlNONALIGNED   = PFX(cpy1Dto2D_shl_8x8_neon);
+    p.cuBLOCK_16x16.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_16x16_sve);
+    p.cuBLOCK_32x32.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_32x32_sve);
+    p.cuBLOCK_64x64.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_64x64_sve);
+
+    // cpy1Dto2D_shr
+    p.cuBLOCK_4x4.cpy1Dto2D_shr   = PFX(cpy1Dto2D_shr_4x4_neon);
+    p.cuBLOCK_8x8.cpy1Dto2D_shr   = PFX(cpy1Dto2D_shr_8x8_neon);
+    p.cuBLOCK_16x16.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_16x16_sve);
+    p.cuBLOCK_32x32.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_32x32_sve);
+    p.cuBLOCK_64x64.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_64x64_sve);
+
+#if !HIGH_BIT_DEPTH
+    // pixel_avg_pp
+    LUMA_PU_NEON_2(pixelavg_ppNONALIGNED, pixel_avg_pp);
+    LUMA_PU_MULTIPLE_ARCHS_3(pixelavg_ppNONALIGNED, pixel_avg_pp, sve2);
+    LUMA_PU_NEON_2(pixelavg_ppALIGNED, pixel_avg_pp);
+    LUMA_PU_MULTIPLE_ARCHS_3(pixelavg_ppALIGNED, pixel_avg_pp, sve2);
+
+    // addAvg
+    LUMA_PU_NEON_3(addAvgNONALIGNED, addAvg);
+    LUMA_PU_CAN_USE_SVE2(addAvgNONALIGNED, addAvg);
+    LUMA_PU_NEON_3(addAvgALIGNED, addAvg);
+    LUMA_PU_CAN_USE_SVE2(addAvgALIGNED, addAvg);
+    CHROMA_420_PU_NEON_2(addAvgNONALIGNED, addAvg);
+    CHROMA_420_PU_MULTIPLE_ARCHS(addAvgNONALIGNED, addAvg, sve2);
+    CHROMA_420_PU_NEON_2(addAvgALIGNED, addAvg);
+    CHROMA_420_PU_MULTIPLE_ARCHS(addAvgALIGNED, addAvg, sve2);
+    CHROMA_422_PU_NEON_2(addAvgNONALIGNED, addAvg);
+    CHROMA_422_PU_CAN_USE_SVE2(addAvgNONALIGNED, addAvg);
+    CHROMA_422_PU_NEON_2(addAvgALIGNED, addAvg);
+    CHROMA_422_PU_CAN_USE_SVE2(addAvgALIGNED, addAvg);
+
+    // sad
+    ALL_LUMA_PU(sad, pixel_sad, sve2);
+    ALL_LUMA_PU(sad_x3, sad_x3, sve2);
+    ALL_LUMA_PU(sad_x4, sad_x4, sve2);
+
+    // sse_pp
+    p.cuBLOCK_4x4.sse_pp   = PFX(pixel_sse_pp_4x4_sve);
+    p.cuBLOCK_8x8.sse_pp   = PFX(pixel_sse_pp_8x8_neon);
+    p.cuBLOCK_16x16.sse_pp = PFX(pixel_sse_pp_16x16_neon);
+    p.cuBLOCK_32x32.sse_pp = PFX(pixel_sse_pp_32x32_sve2);
+    p.cuBLOCK_64x64.sse_pp = PFX(pixel_sse_pp_64x64_sve2);
+
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.sse_pp   = PFX(pixel_sse_pp_4x4_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.sse_pp   = PFX(pixel_sse_pp_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.sse_pp = PFX(pixel_sse_pp_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.sse_pp = PFX(pixel_sse_pp_32x32_sve2);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.sse_pp   = PFX(pixel_sse_pp_4x8_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sse_pp  = PFX(pixel_sse_pp_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sse_pp = PFX(pixel_sse_pp_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sse_pp = PFX(pixel_sse_pp_32x64_sve2);
+
+    // sse_ss
+    p.cuBLOCK_4x4.sse_ss   = PFX(pixel_sse_ss_4x4_sve2);
+    p.cuBLOCK_8x8.sse_ss   = PFX(pixel_sse_ss_8x8_sve2);
+    p.cuBLOCK_16x16.sse_ss = PFX(pixel_sse_ss_16x16_sve2);
+    p.cuBLOCK_32x32.sse_ss = PFX(pixel_sse_ss_32x32_sve2);
+    p.cuBLOCK_64x64.sse_ss = PFX(pixel_sse_ss_64x64_sve2);
+
+    // ssd_s
+    p.cuBLOCK_4x4.ssd_sNONALIGNED   = PFX(pixel_ssd_s_4x4_sve2);
+    p.cuBLOCK_8x8.ssd_sNONALIGNED   = PFX(pixel_ssd_s_8x8_sve2);
+    p.cuBLOCK_16x16.ssd_sNONALIGNED = PFX(pixel_ssd_s_16x16_sve2);
+    p.cuBLOCK_32x32.ssd_sNONALIGNED = PFX(pixel_ssd_s_32x32_sve2);
+
+    p.cuBLOCK_4x4.ssd_sALIGNED   = PFX(pixel_ssd_s_4x4_sve2);
+    p.cuBLOCK_8x8.ssd_sALIGNED   = PFX(pixel_ssd_s_8x8_sve2);
+    p.cuBLOCK_16x16.ssd_sALIGNED = PFX(pixel_ssd_s_16x16_sve2);
+    p.cuBLOCK_32x32.ssd_sALIGNED = PFX(pixel_ssd_s_32x32_sve2);
+
+    // pixel_var
+    p.cuBLOCK_8x8.var   = PFX(pixel_var_8x8_sve2);
+    p.cuBLOCK_16x16.var = PFX(pixel_var_16x16_sve2);
+    p.cuBLOCK_32x32.var = PFX(pixel_var_32x32_sve2);
+    p.cuBLOCK_64x64.var = PFX(pixel_var_64x64_sve2);
+
+    // calc_Residual
+    p.cuBLOCK_4x4.calcresidualNONALIGNED   = PFX(getResidual4_neon);
+    p.cuBLOCK_8x8.calcresidualNONALIGNED   = PFX(getResidual8_neon);
+    p.cuBLOCK_16x16.calcresidualNONALIGNED = PFX(getResidual16_sve2);
+    p.cuBLOCK_32x32.calcresidualNONALIGNED = PFX(getResidual32_sve2);
+
+    p.cuBLOCK_4x4.calcresidualALIGNED   = PFX(getResidual4_neon);
+    p.cuBLOCK_8x8.calcresidualALIGNED   = PFX(getResidual8_neon);
+    p.cuBLOCK_16x16.calcresidualALIGNED = PFX(getResidual16_sve2);
+    p.cuBLOCK_32x32.calcresidualALIGNED = PFX(getResidual32_sve2);
+
+    // pixel_sub_ps
+    p.cuBLOCK_4x4.sub_ps   = PFX(pixel_sub_ps_4x4_neon);
+    p.cuBLOCK_8x8.sub_ps   = PFX(pixel_sub_ps_8x8_neon);
+    p.cuBLOCK_16x16.sub_ps = PFX(pixel_sub_ps_16x16_neon);
+    p.cuBLOCK_32x32.sub_ps = PFX(pixel_sub_ps_32x32_sve2);
+    p.cuBLOCK_64x64.sub_ps = PFX(pixel_sub_ps_64x64_sve2);
+
+    // chroma sub_ps
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.sub_ps   = PFX(pixel_sub_ps_4x4_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.sub_ps   = PFX(pixel_sub_ps_8x8_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.sub_ps = PFX(pixel_sub_ps_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.sub_ps = PFX(pixel_sub_ps_32x32_sve2);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.sub_ps   = PFX(pixel_sub_ps_4x8_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sub_ps  = PFX(pixel_sub_ps_8x16_sve);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sub_ps = PFX(pixel_sub_ps_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sub_ps = PFX(pixel_sub_ps_32x64_sve2);
+
+    // pixel_add_ps
+    p.cuBLOCK_4x4.add_psNONALIGNED   = PFX(pixel_add_ps_4x4_sve2);
+    p.cuBLOCK_8x8.add_psNONALIGNED   = PFX(pixel_add_ps_8x8_sve2);
+    p.cuBLOCK_16x16.add_psNONALIGNED = PFX(pixel_add_ps_16x16_sve2);
+    p.cuBLOCK_32x32.add_psNONALIGNED = PFX(pixel_add_ps_32x32_sve2);
+    p.cuBLOCK_64x64.add_psNONALIGNED = PFX(pixel_add_ps_64x64_sve2);
+
+    p.cuBLOCK_4x4.add_psALIGNED   = PFX(pixel_add_ps_4x4_sve2);
+    p.cuBLOCK_8x8.add_psALIGNED   = PFX(pixel_add_ps_8x8_sve2);
+    p.cuBLOCK_16x16.add_psALIGNED = PFX(pixel_add_ps_16x16_sve2);
+    p.cuBLOCK_32x32.add_psALIGNED = PFX(pixel_add_ps_32x32_sve2);
+    p.cuBLOCK_64x64.add_psALIGNED = PFX(pixel_add_ps_64x64_sve2);
+
+    // chroma add_ps
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.add_psNONALIGNED   = PFX(pixel_add_ps_4x4_sve2);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.add_psNONALIGNED   = PFX(pixel_add_ps_8x8_sve2);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.add_psNONALIGNED = PFX(pixel_add_ps_16x16_sve2);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.add_psNONALIGNED = PFX(pixel_add_ps_32x32_sve2);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.add_psNONALIGNED   = PFX(pixel_add_ps_4x8_sve2);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.add_psNONALIGNED  = PFX(pixel_add_ps_8x16_sve2);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.add_psNONALIGNED = PFX(pixel_add_ps_16x32_sve2);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.add_psNONALIGNED = PFX(pixel_add_ps_32x64_sve2);
+
+    p.chromaX265_CSP_I420.cuBLOCK_420_4x4.add_psALIGNED   = PFX(pixel_add_ps_4x4_sve2);
+    p.chromaX265_CSP_I420.cuBLOCK_420_8x8.add_psALIGNED   = PFX(pixel_add_ps_8x8_sve2);
+    p.chromaX265_CSP_I420.cuBLOCK_420_16x16.add_psALIGNED = PFX(pixel_add_ps_16x16_sve2);
+    p.chromaX265_CSP_I420.cuBLOCK_420_32x32.add_psALIGNED = PFX(pixel_add_ps_32x32_sve2);
+    p.chromaX265_CSP_I422.cuBLOCK_422_4x8.add_psALIGNED   = PFX(pixel_add_ps_4x8_sve2);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.add_psALIGNED  = PFX(pixel_add_ps_8x16_sve2);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.add_psALIGNED = PFX(pixel_add_ps_16x32_sve2);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.add_psALIGNED = PFX(pixel_add_ps_32x64_sve2);
+
+    //scale2D_64to32
+    p.scale2D_64to32  = PFX(scale2D_64to32_neon);
+
+    // scale1D_128to64
+    p.scale1D_128to64NONALIGNED = PFX(scale1D_128to64_sve2);
+    p.scale1D_128to64ALIGNED = PFX(scale1D_128to64_sve2);
+
+    // planecopy
+    p.planecopy_cp = PFX(pixel_planecopy_cp_neon);
+
+    // satd
+    p.puLUMA_4x4.satd   = PFX(pixel_satd_4x4_sve);
+    p.puLUMA_8x8.satd   = PFX(pixel_satd_8x8_neon);
+    p.puLUMA_16x16.satd = PFX(pixel_satd_16x16_neon);
+    p.puLUMA_32x32.satd = PFX(pixel_satd_32x32_sve);
+    p.puLUMA_64x64.satd = PFX(pixel_satd_64x64_neon);
+    p.puLUMA_8x4.satd   = PFX(pixel_satd_8x4_sve);
+    p.puLUMA_4x8.satd   = PFX(pixel_satd_4x8_neon);
+    p.puLUMA_16x8.satd  = PFX(pixel_satd_16x8_neon);
+    p.puLUMA_8x16.satd  = PFX(pixel_satd_8x16_neon);
+    p.puLUMA_16x32.satd = PFX(pixel_satd_16x32_neon);
+    p.puLUMA_32x16.satd = PFX(pixel_satd_32x16_sve);
+    p.puLUMA_64x32.satd = PFX(pixel_satd_64x32_neon);
+    p.puLUMA_32x64.satd = PFX(pixel_satd_32x64_neon);
+    p.puLUMA_16x12.satd = PFX(pixel_satd_16x12_neon);
+    p.puLUMA_12x16.satd = PFX(pixel_satd_12x16_neon);
+    p.puLUMA_16x4.satd  = PFX(pixel_satd_16x4_neon);
+    p.puLUMA_4x16.satd  = PFX(pixel_satd_4x16_neon);
+    p.puLUMA_32x24.satd = PFX(pixel_satd_32x24_neon);
+    p.puLUMA_24x32.satd = PFX(pixel_satd_24x32_neon);
+    p.puLUMA_32x8.satd  = PFX(pixel_satd_32x8_neon);
+    p.puLUMA_8x32.satd  = PFX(pixel_satd_8x32_neon);
+    p.puLUMA_64x48.satd = PFX(pixel_satd_64x48_sve);
+    p.puLUMA_48x64.satd = PFX(pixel_satd_48x64_neon);
+    p.puLUMA_64x16.satd = PFX(pixel_satd_64x16_neon);
+    p.puLUMA_16x64.satd = PFX(pixel_satd_16x64_neon);
+
+    p.chromaX265_CSP_I420.puCHROMA_420_4x4.satd   = PFX(pixel_satd_4x4_sve);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x8.satd   = PFX(pixel_satd_8x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x16.satd = PFX(pixel_satd_16x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x32.satd = PFX(pixel_satd_32x32_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x4.satd   = PFX(pixel_satd_8x4_sve);
+    p.chromaX265_CSP_I420.puCHROMA_420_4x8.satd   = PFX(pixel_satd_4x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x8.satd  = PFX(pixel_satd_16x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x16.satd  = PFX(pixel_satd_8x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x16.satd = PFX(pixel_satd_32x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x32.satd = PFX(pixel_satd_16x32_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x12.satd = PFX(pixel_satd_16x12_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_12x16.satd = PFX(pixel_satd_12x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_16x4.satd  = PFX(pixel_satd_16x4_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_4x16.satd  = PFX(pixel_satd_4x16_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x24.satd = PFX(pixel_satd_32x24_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_24x32.satd = PFX(pixel_satd_24x32_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_32x8.satd  = PFX(pixel_satd_32x8_neon);
+    p.chromaX265_CSP_I420.puCHROMA_420_8x32.satd  = PFX(pixel_satd_8x32_neon);
+
+    p.chromaX265_CSP_I422.puCHROMA_422_4x8.satd   = PFX(pixel_satd_4x8_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x16.satd  = PFX(pixel_satd_8x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x32.satd = PFX(pixel_satd_16x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x64.satd = PFX(pixel_satd_32x64_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_4x4.satd   = PFX(pixel_satd_4x4_sve);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x8.satd   = PFX(pixel_satd_8x8_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_4x16.satd  = PFX(pixel_satd_4x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x16.satd = PFX(pixel_satd_16x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x32.satd  = PFX(pixel_satd_8x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x32.satd = PFX(pixel_satd_32x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x64.satd = PFX(pixel_satd_16x64_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x12.satd  = PFX(pixel_satd_8x12_sve);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x4.satd   = PFX(pixel_satd_8x4_sve);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x24.satd = PFX(pixel_satd_16x24_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_12x32.satd = PFX(pixel_satd_12x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_16x8.satd  = PFX(pixel_satd_16x8_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_4x32.satd  = PFX(pixel_satd_4x32_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x48.satd = PFX(pixel_satd_32x48_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_24x64.satd = PFX(pixel_satd_24x64_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_32x16.satd = PFX(pixel_satd_32x16_neon);
+    p.chromaX265_CSP_I422.puCHROMA_422_8x64.satd  = PFX(pixel_satd_8x64_neon);
+
+    // sa8d
+    p.cuBLOCK_4x4.sa8d   = PFX(pixel_satd_4x4_sve);
+    p.cuBLOCK_8x8.sa8d   = PFX(pixel_sa8d_8x8_neon);
+    p.cuBLOCK_16x16.sa8d = PFX(pixel_sa8d_16x16_neon);
+    p.cuBLOCK_32x32.sa8d = PFX(pixel_sa8d_32x32_neon);
+    p.cuBLOCK_64x64.sa8d = PFX(pixel_sa8d_64x64_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_8x8.sa8d = PFX(pixel_satd_4x4_sve);
+    p.chromaX265_CSP_I420.cuBLOCK_16x16.sa8d = PFX(pixel_sa8d_16x16_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_32x32.sa8d = PFX(pixel_sa8d_32x32_neon);
+    p.chromaX265_CSP_I420.cuBLOCK_64x64.sa8d = PFX(pixel_sa8d_64x64_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sa8d = PFX(pixel_sa8d_8x16_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sa8d = PFX(pixel_sa8d_16x32_neon);
+    p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sa8d = PFX(pixel_sa8d_32x64_neon);
+
+    // dequant_scaling
+    p.dequant_scaling = PFX(dequant_scaling_sve2);
+    p.dequant_normal  = PFX(dequant_normal_sve2);
+
+    // ssim_4x4x2_core
+    p.ssim_4x4x2_core = PFX(ssim_4x4x2_core_sve2);
+
+    // ssimDist
+    p.cuBLOCK_4x4.ssimDist = PFX(ssimDist4_sve2);
+    p.cuBLOCK_8x8.ssimDist = PFX(ssimDist8_sve2);
+    p.cuBLOCK_16x16.ssimDist = PFX(ssimDist16_sve2);
+    p.cuBLOCK_32x32.ssimDist = PFX(ssimDist32_sve2);
+    p.cuBLOCK_64x64.ssimDist = PFX(ssimDist64_sve2);
+
+    // normFact
+    p.cuBLOCK_8x8.normFact = PFX(normFact8_sve2);
+    p.cuBLOCK_16x16.normFact = PFX(normFact16_sve2);
+    p.cuBLOCK_32x32.normFact = PFX(normFact32_sve2);
+    p.cuBLOCK_64x64.normFact = PFX(normFact64_sve2);
+
+    // psy_cost_pp
+    p.cuBLOCK_4x4.psy_cost_pp = PFX(psyCost_4x4_neon);
+
+    p.weight_pp = PFX(weight_pp_neon);
+#if !defined(__APPLE__)
+    p.scanPosLast = PFX(scanPosLast_neon);
+#endif
+    p.costCoeffNxN = PFX(costCoeffNxN_neon);
+#endif
+
+    // quant
+    p.quant = PFX(quant_sve);
+    p.nquant = PFX(nquant_neon);
+}
+#endif
+
+void setupAssemblyPrimitives(EncoderPrimitives &p, int cpuMask)
+{
+
+#ifdef HAVE_SVE2
+    if (cpuMask & X265_CPU_SVE2)
+    {
+        setupSve2Primitives(p);
     }
+    else if (cpuMask & X265_CPU_SVE)
+    {
+        setupSvePrimitives(p);
+    }
+    else if (cpuMask & X265_CPU_NEON)
+    {
+        setupNeonPrimitives(p);
+    }
+
+#elif defined(HAVE_SVE)
+    if (cpuMask & X265_CPU_SVE)
+    {
+        setupSvePrimitives(p);
+    }
+    else if (cpuMask & X265_CPU_NEON)
+    {
+        setupNeonPrimitives(p);
+    }
+
+#else
+    if (cpuMask & X265_CPU_NEON)
+    {
+        setupNeonPrimitives(p);
+    }
+#endif
+
 }
 } // namespace X265_NS
​

x265_3.6.tar.gz/source/common/aarch64/asm-sve.S Added

@@ -0,0 +1,39 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm.S"
+
+.arch armv8-a+sve
+
+.macro ABS2_SVE a b c
+    abs             \a, \c\()/m, \a
+    abs             \b, \c\()/m, \b
+.endm
+
+.macro ABS8_SVE z0, z1, z2, z3, z4, z5, z6, z7, p0
+    ABS2_SVE        \z0, \z1, p0
+    ABS2_SVE        \z2, \z3, p0
+    ABS2_SVE        \z4, \z5, p0
+    ABS2_SVE        \z6, \z7, p0
+.endm
+

 
@@ -0,0 +1,39 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm.S"
+
+.arch armv8-a+sve
+
+.macro ABS2_SVE a b c
+    abs             \a, \c\()/m, \a
+    abs             \b, \c\()/m, \b
+.endm
+
+.macro ABS8_SVE z0, z1, z2, z3, z4, z5, z6, z7, p0
+    ABS2_SVE        \z0, \z1, p0
+    ABS2_SVE        \z2, \z3, p0
+    ABS2_SVE        \z4, \z5, p0
+    ABS2_SVE        \z6, \z7, p0
+.endm
+
​

x265_3.5.tar.gz/source/common/aarch64/asm.S -> x265_3.6.tar.gz/source/common/aarch64/asm.S Changed

 
@@ -1,7 +1,8 @@
 /*****************************************************************************
- * Copyright (C) 2020 MulticoreWare, Inc
+ * Copyright (C) 2020-2021 MulticoreWare, Inc
  *
  * Authors: Hongbin Liu <liuhongbin1@huawei.com>
+ *          Sebastian Pop <spop@amazon.com>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -21,34 +22,74 @@
  * For more information, contact us at license @ x265.com.
  *****************************************************************************/
 
+#ifndef ASM_S_  // #include guards
+#define ASM_S_
+
 .arch           armv8-a
 
+#define PFX3(prefix, name) prefix ## _ ## name
+#define PFX2(prefix, name) PFX3(prefix, name)
+#define PFX(name)          PFX2(X265_NS, name)
+
+#ifdef __APPLE__
+#define PREFIX 1
+#endif
+
 #ifdef PREFIX
 #define EXTERN_ASM _
+#define HAVE_AS_FUNC 0
+#elif defined __clang__
+#define EXTERN_ASM
+#define HAVE_AS_FUNC 0
+#define PREFIX 1
 #else
 #define EXTERN_ASM
+#define HAVE_AS_FUNC 1
 #endif
 
 #ifdef __ELF__
 #define ELF
 #else
+#ifdef PREFIX
+#define ELF #
+#else
 #define ELF @
 #endif
-
-#define HAVE_AS_FUNC 1
+#endif
 
 #if HAVE_AS_FUNC
 #define FUNC
 #else
+#ifdef PREFIX
+#define FUNC #
+#else
 #define FUNC @
 #endif
+#endif
+
+#define GLUE(a, b) a ## b
+#define JOIN(a, b) GLUE(a, b)
+
+#define PFX_C(name)        JOIN(JOIN(JOIN(EXTERN_ASM, X265_NS), _), name)
+
+#ifdef __APPLE__
+.macro endfunc
+ELF .size \name, . - \name
+FUNC .endfunc
+.endm
+#endif
 
 .macro function name, export=1
+#ifdef __APPLE__
+    .global \name
+    endfunc
+#else
     .macro endfunc
 ELF     .size   \name, . - \name
 FUNC    .endfunc
         .purgem endfunc
     .endm
+#endif
         .align  2
 .if \export == 1
         .global EXTERN_ASM\name
@@ -64,6 +105,83 @@
 .endif
 .endm
 
+.macro  const   name, align=2
+    .macro endconst
+ELF     .size   \name, . - \name
+        .purgem endconst
+    .endm
+#ifdef __MACH__
+    .const_data
+#else
+    .section .rodata
+#endif
+    .align          \align
+\name:
+.endm
+
+.macro  movrel rd, val, offset=0
+#if defined(__APPLE__)
+  .if \offset < 0
+        adrp            \rd, \val@PAGE
+        add             \rd, \rd, \val@PAGEOFF
+        sub             \rd, \rd, -(\offset)
+  .else
+        adrp            \rd, \val+(\offset)@PAGE
+        add             \rd, \rd, \val+(\offset)@PAGEOFF
+  .endif
+#elif defined(PIC) && defined(_WIN32)
+  .if \offset < 0
+        adrp            \rd, \val
+        add             \rd, \rd, :lo12:\val
+        sub             \rd, \rd, -(\offset)
+  .else
+        adrp            \rd, \val+(\offset)
+        add             \rd, \rd, :lo12:\val+(\offset)
+  .endif
+#else
+        adrp            \rd, \val+(\offset)
+        add             \rd, \rd, :lo12:\val+(\offset)
+#endif
+.endm
 
 #define FENC_STRIDE 64
 #define FDEC_STRIDE 32
+
+.macro SUMSUB_AB sum, diff, a, b
+    add             \sum,  \a, \b
+    sub             \diff, \a, \b
+.endm
+
+.macro SUMSUB_ABCD s1, d1, s2, d2, a, b, c, d
+    SUMSUB_AB       \s1, \d1, \a, \b
+    SUMSUB_AB       \s2, \d2, \c, \d
+.endm
+
+.macro HADAMARD4_V r1, r2, r3, r4, t1, t2, t3, t4
+    SUMSUB_ABCD     \t1, \t2, \t3, \t4, \r1, \r2, \r3, \r4
+    SUMSUB_ABCD     \r1, \r3, \r2, \r4, \t1, \t3, \t2, \t4
+.endm
+
+.macro ABS2 a b
+    abs             \a, \a
+    abs             \b, \b
+.endm
+
+.macro ABS8 v0, v1, v2, v3, v4, v5, v6, v7
+    ABS2            \v0, \v1
+    ABS2            \v2, \v3
+    ABS2            \v4, \v5
+    ABS2            \v6, \v7
+.endm
+
+.macro vtrn t1, t2, s1, s2
+    trn1            \t1, \s1, \s2
+    trn2            \t2, \s1, \s2
+.endm
+
+.macro trn4 t1, t2, t3, t4, s1, s2, s3, s4
+    vtrn            \t1, \t2, \s1, \s2
+    vtrn            \t3, \t4, \s3, \s4
+.endm
+
+#endif
\ No newline at end of file
​

x265_3.6.tar.gz/source/common/aarch64/blockcopy8-common.S Added

@@ -0,0 +1,54 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+// This file contains the macros written using NEON instruction set
+// that are also used by the SVE2 functions
+
+#include "asm.S"
+
+.arch           armv8-a
+
+// void cpy1Dto2D_shr(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift)
+.macro cpy1Dto2D_shr_start
+    add             x2, x2, x2
+    dup             v0.8h, w3
+    cmeq            v1.8h, v1.8h, v1.8h
+    sshl            v1.8h, v1.8h, v0.8h
+    sri             v1.8h, v1.8h, #1
+    neg             v0.8h, v0.8h
+.endm
+
+.macro cpy2Dto1D_shr_start
+    add             x2, x2, x2
+    dup             v0.8h, w3
+    cmeq            v1.8h, v1.8h, v1.8h
+    sshl            v1.8h, v1.8h, v0.8h
+    sri             v1.8h, v1.8h, #1
+    neg             v0.8h, v0.8h
+.endm
+
+const xtn_xtn2_table, align=4
+.byte    0, 2, 4, 6, 8, 10, 12, 14
+.byte    16, 18, 20, 22, 24, 26, 28, 30
+endconst
+

 
@@ -0,0 +1,54 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+// This file contains the macros written using NEON instruction set
+// that are also used by the SVE2 functions
+
+#include "asm.S"
+
+.arch           armv8-a
+
+// void cpy1Dto2D_shr(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift)
+.macro cpy1Dto2D_shr_start
+    add             x2, x2, x2
+    dup             v0.8h, w3
+    cmeq            v1.8h, v1.8h, v1.8h
+    sshl            v1.8h, v1.8h, v0.8h
+    sri             v1.8h, v1.8h, #1
+    neg             v0.8h, v0.8h
+.endm
+
+.macro cpy2Dto1D_shr_start
+    add             x2, x2, x2
+    dup             v0.8h, w3
+    cmeq            v1.8h, v1.8h, v1.8h
+    sshl            v1.8h, v1.8h, v0.8h
+    sri             v1.8h, v1.8h, #1
+    neg             v0.8h, v0.8h
+.endm
+
+const xtn_xtn2_table, align=4
+.byte    0, 2, 4, 6, 8, 10, 12, 14
+.byte    16, 18, 20, 22, 24, 26, 28, 30
+endconst
+
​

x265_3.6.tar.gz/source/common/aarch64/blockcopy8-sve.S Added

@@ -0,0 +1,1416 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm-sve.S"
+#include "blockcopy8-common.S"
+
+.arch armv8-a+sve
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+/* void blockcopy_sp(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb)
+ *
+ * r0   - a
+ * r1   - stridea
+ * r2   - b
+ * r3   - strideb */
+
+function PFX(blockcopy_sp_4x4_sve)
+    ptrue           p0.h, vl4
+.rept 2
+    ld1h            {z0.h}, p0/z, x2
+    add             x2, x2, x3, lsl #1
+    st1b            {z0.h}, p0, x0
+    add             x0, x0, x1
+    ld1h            {z1.h}, p0/z, x2
+    add             x2, x2, x3, lsl #1
+    st1b            {z1.h}, p0, x0
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_8x8_sve)
+    ptrue           p0.h, vl8
+.rept 4
+    ld1h            {z0.h}, p0/z, x2
+    add             x2, x2, x3, lsl #1
+    st1b            {z0.h}, p0, x0
+    add            x0, x0, x1
+    ld1h            {z1.h}, p0/z, x2
+    add             x2, x2, x3, lsl #1
+    st1b            {z1.h}, p0, x0
+    add            x0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_16x16_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_sp_16_16
+    lsl             x3, x3, #1
+    movrel          x11, xtn_xtn2_table
+    ld1             {v31.16b}, x11
+.rept 8
+    ld1             {v0.8h-v1.8h}, x2, x3
+    ld1             {v2.8h-v3.8h}, x2, x3
+    tbl             v0.16b, {v0.16b,v1.16b}, v31.16b
+    tbl             v1.16b, {v2.16b,v3.16b}, v31.16b
+    st1             {v0.16b}, x0, x1
+    st1             {v1.16b}, x0, x1
+.endr
+    ret
+.vl_gt_16_blockcopy_sp_16_16:
+    ptrue           p0.h, vl16
+.rept 8
+    ld1h            {z0.h}, p0/z, x2
+    st1b            {z0.h}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+    ld1h            {z1.h}, p0/z, x2
+    st1b            {z1.h}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_32x32_sve)
+    mov             w12, #4
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_sp_32_32
+    lsl             x3, x3, #1
+    movrel          x11, xtn_xtn2_table
+    ld1             {v31.16b}, x11
+.loop_csp32_sve:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.8h-v3.8h}, x2, x3
+    ld1             {v4.8h-v7.8h}, x2, x3
+    tbl             v0.16b, {v0.16b,v1.16b}, v31.16b
+    tbl             v1.16b, {v2.16b,v3.16b}, v31.16b
+    tbl             v2.16b, {v4.16b,v5.16b}, v31.16b
+    tbl             v3.16b, {v6.16b,v7.16b}, v31.16b
+    st1             {v0.16b-v1.16b}, x0, x1
+    st1             {v2.16b-v3.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_csp32_sve
+    ret
+.vl_gt_16_blockcopy_sp_32_32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_blockcopy_sp_32_32
+    ptrue           p0.h, vl16
+.vl_gt_16_loop_csp32_sve:
+    sub             w12, w12, #1
+.rept 4
+    ld1h            {z0.h}, p0/z, x2
+    ld1h            {z1.h}, p0/z, x2, #1, mul vl
+    st1b            {z0.h}, p0, x0
+    st1b            {z1.h}, p0, x0, #1, mul vl
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+    ld1h            {z2.h}, p0/z, x2
+    ld1h            {z3.h}, p0/z, x2, #1, mul vl
+    st1b            {z2.h}, p0, x0
+    st1b            {z3.h}, p0, x0, #1, mul vl
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+.endr
+    cbnz            w12, .vl_gt_16_loop_csp32_sve
+    ret
+.vl_gt_48_blockcopy_sp_32_32:
+    ptrue           p0.h, vl32
+.vl_gt_48_loop_csp32_sve:
+    sub             w12, w12, #1
+.rept 4
+    ld1h            {z0.h}, p0/z, x2
+    st1b            {z0.h}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+    ld1h            {z1.h}, p0/z, x2
+    st1b            {z1.h}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+.endr
+    cbnz            w12, .vl_gt_48_loop_csp32_sve
+    ret
+endfunc
+
+function PFX(blockcopy_ps_16x16_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_ps_16_16
+    lsl             x1, x1, #1
+.rept 8
+    ld1             {v4.16b}, x2, x3
+    ld1             {v5.16b}, x2, x3
+    uxtl            v0.8h, v4.8b
+    uxtl2           v1.8h, v4.16b
+    uxtl            v2.8h, v5.8b
+    uxtl2           v3.8h, v5.16b
+    st1             {v0.8h-v1.8h}, x0, x1
+    st1             {v2.8h-v3.8h}, x0, x1
+.endr
+    ret
+.vl_gt_16_blockcopy_ps_16_16:
+    ptrue           p0.b, vl32
+.rept 16
+    ld1b            {z1.h}, p0/z, x2
+    st1h            {z1.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ps_32x32_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_ps_32_32
+    lsl             x1, x1, #1
+    mov             w12, #4
+.loop_cps32_sve:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b-v17.16b}, x2, x3
+    ld1             {v18.16b-v19.16b}, x2, x3
+    uxtl            v0.8h, v16.8b
+    uxtl2           v1.8h, v16.16b
+    uxtl            v2.8h, v17.8b
+    uxtl2           v3.8h, v17.16b
+    uxtl            v4.8h, v18.8b
+    uxtl2           v5.8h, v18.16b
+    uxtl            v6.8h, v19.8b
+    uxtl2           v7.8h, v19.16b
+    st1             {v0.8h-v3.8h}, x0, x1
+    st1             {v4.8h-v7.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_cps32_sve
+    ret
+.vl_gt_16_blockcopy_ps_32_32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_blockcopy_ps_32_32
+    ptrue           p0.b, vl32
+.rept 32
+    ld1b            {z2.h}, p0/z, x2
+    ld1b            {z3.h}, p0/z, x2, #1, mul vl
+    st1h            {z2.h}, p0, x0
+    st1h            {z3.h}, p0, x0, #1, mul vl
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3
+.endr
+    ret
+.vl_gt_48_blockcopy_ps_32_32:
+    ptrue           p0.b, vl64
+.rept 32
+    ld1b            {z2.h}, p0/z, x2
+    st1h            {z2.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ps_64x64_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_ps_64_64
+    lsl             x1, x1, #1
+    sub             x1, x1, #64
+    mov             w12, #16
+.loop_cps64_sve:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b-v19.16b}, x2, x3
+    uxtl            v0.8h, v16.8b
+    uxtl2           v1.8h, v16.16b
+    uxtl            v2.8h, v17.8b
+    uxtl2           v3.8h, v17.16b
+    uxtl            v4.8h, v18.8b
+    uxtl2           v5.8h, v18.16b
+    uxtl            v6.8h, v19.8b
+    uxtl2           v7.8h, v19.16b
+    st1             {v0.8h-v3.8h}, x0, #64
+    st1             {v4.8h-v7.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_cps64_sve
+    ret
+.vl_gt_16_blockcopy_ps_64_64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_blockcopy_ps_64_64
+    ptrue           p0.b, vl32
+.rept 64
+    ld1b            {z4.h}, p0/z, x2
+    ld1b            {z5.h}, p0/z, x2, #1, mul vl
+    ld1b            {z6.h}, p0/z, x2, #2, mul vl
+    ld1b            {z7.h}, p0/z, x2, #3, mul vl
+    st1h            {z4.h}, p0, x0
+    st1h            {z5.h}, p0, x0, #1, mul vl
+    st1h            {z6.h}, p0, x0, #2, mul vl
+    st1h            {z7.h}, p0, x0, #3, mul vl
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3
+.endr
+    ret
+.vl_gt_48_blockcopy_ps_64_64:
+    cmp             x9, #112
+    bgt             .vl_gt_112_blockcopy_ps_64_64
+    ptrue           p0.b, vl64
+.rept 64
+    ld1b            {z4.h}, p0/z, x2
+    ld1b            {z5.h}, p0/z, x2, #1, mul vl
+    st1h            {z4.h}, p0, x0
+    st1h            {z5.h}, p0, x0, #1, mul vl
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3
+.endr
+    ret
+.vl_gt_112_blockcopy_ps_64_64:
+    ptrue           p0.b, vl128
+.rept 64
+    ld1b            {z4.h}, p0/z, x2
+    st1h            {z4.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3
+.endr
+    ret
+
+endfunc
+
+function PFX(blockcopy_ss_16x16_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_ss_16_16
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+.rept 8
+    ld1             {v0.8h-v1.8h}, x2, x3
+    ld1             {v2.8h-v3.8h}, x2, x3
+    st1             {v0.8h-v1.8h}, x0, x1
+    st1             {v2.8h-v3.8h}, x0, x1
+.endr
+    ret
+.vl_gt_16_blockcopy_ss_16_16:
+    ptrue           p0.h, vl16
+.rept 16
+    ld1h            {z0.h}, p0/z, x2
+    st1h            {z0.h}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1, lsl #1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ss_32x32_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_ss_32_32
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+    mov             w12, #4
+.loop_css32_sve:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.8h-v3.8h}, x2, x3
+    st1             {v0.8h-v3.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_css32_sve
+    ret
+.vl_gt_16_blockcopy_ss_32_32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_blockcopy_ss_32_32
+    ptrue           p0.h, vl16
+.rept 32
+    ld1h            {z0.h}, p0/z, x2
+    ld1h            {z1.h}, p0/z, x2, #1, mul vl
+    st1h            {z0.h}, p0, x0
+    st1h            {z1.h}, p0, x0, #1, mul vl
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1, lsl #1
+.endr
+    ret
+.vl_gt_48_blockcopy_ss_32_32:
+    ptrue           p0.h, vl32
+.rept 32
+    ld1h            {z0.h}, p0/z, x2
+    st1h            {z0.h}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1, lsl #1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ss_64x64_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_ss_64_64
+    lsl             x1, x1, #1
+    sub             x1, x1, #64
+    lsl             x3, x3, #1
+    sub             x3, x3, #64
+    mov             w12, #8
+.loop_css64_sve:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.8h-v3.8h}, x2, #64
+    ld1             {v4.8h-v7.8h}, x2, x3
+    st1             {v0.8h-v3.8h}, x0, #64
+    st1             {v4.8h-v7.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_css64_sve
+    ret
+.vl_gt_16_blockcopy_ss_64_64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_blockcopy_ss_64_64
+    mov             w12, #8
+    ptrue           p0.b, vl32
+.vl_gt_16_loop_css64_sve:
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z1.b}, p0/z, x2, #1, mul vl
+    ld1b            {z2.b}, p0/z, x2, #2, mul vl
+    ld1b            {z3.b}, p0/z, x2, #3, mul vl
+    st1b            {z0.b}, p0, x0
+    st1b            {z1.b}, p0, x0, #1, mul vl
+    st1b            {z2.b}, p0, x0, #2, mul vl
+    st1b            {z3.b}, p0, x0, #3, mul vl
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_16_loop_css64_sve
+    ret
+.vl_gt_48_blockcopy_ss_64_64:
+    cmp             x9, #112
+    bgt             .vl_gt_112_blockcopy_ss_64_64
+    mov             w12, #8
+    ptrue           p0.b, vl64
+.vl_gt_48_loop_css64_sve:
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z1.b}, p0/z, x2, #1, mul vl
+    st1b            {z0.b}, p0, x0
+    st1b            {z1.b}, p0, x0, #1, mul vl
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_48_loop_css64_sve
+    ret
+.vl_gt_112_blockcopy_ss_64_64:
+    mov             w12, #8
+    ptrue           p0.b, vl128
+.vl_gt_112_loop_css64_sve:
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    st1b            {z0.b}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_112_loop_css64_sve
+    ret
+endfunc
+
+/******** Chroma blockcopy********/
+function PFX(blockcopy_ss_16x32_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_ss_16_32
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+.rept 16
+    ld1             {v0.8h-v1.8h}, x2, x3
+    ld1             {v2.8h-v3.8h}, x2, x3
+    st1             {v0.8h-v1.8h}, x0, x1
+    st1             {v2.8h-v3.8h}, x0, x1
+.endr
+    ret
+.vl_gt_16_blockcopy_ss_16_32:
+    ptrue           p0.h, vl16
+.rept 32
+    ld1h            {z0.h}, p0/z, x2
+    st1h            {z0.h}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1, lsl #1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ss_32x64_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_ss_32_64
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+    mov             w12, #8
+.loop_css32x64_sve:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.8h-v3.8h}, x2, x3
+    st1             {v0.8h-v3.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_css32x64_sve
+    ret
+.vl_gt_16_blockcopy_ss_32_64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_blockcopy_ss_32_64
+    mov             w12, #8
+    ptrue           p0.b, vl32
+.vl_gt_32_loop_css32x64_sve:
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z1.b}, p0/z, x2, #1, mul vl
+    st1b            {z0.b}, p0, x0
+    st1b            {z1.b}, p0, x0, #1, mul vl
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_32_loop_css32x64_sve
+    ret
+.vl_gt_48_blockcopy_ss_32_64:
+    mov             w12, #8
+    ptrue           p0.b, vl64
+.vl_gt_48_loop_css32x64_sve:
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    st1b            {z0.b}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_48_loop_css32x64_sve
+    ret
+endfunc
+
+// chroma blockcopy_ps
+function PFX(blockcopy_ps_4x8_sve)
+    ptrue           p0.h, vl4
+.rept 8
+    ld1b            {z0.h}, p0/z, x2
+    st1h            {z0.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ps_8x16_sve)
+    ptrue           p0.h, vl8
+.rept 16
+    ld1b            {z0.h}, p0/z, x2
+    st1h            {z0.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ps_16x32_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_ps_16_32
+    lsl             x1, x1, #1
+.rept 16
+    ld1             {v4.16b}, x2, x3
+    ld1             {v5.16b}, x2, x3
+    uxtl            v0.8h, v4.8b
+    uxtl2           v1.8h, v4.16b
+    uxtl            v2.8h, v5.8b
+    uxtl2           v3.8h, v5.16b
+    st1             {v0.8h-v1.8h}, x0, x1
+    st1             {v2.8h-v3.8h}, x0, x1
+.endr
+    ret
+.vl_gt_16_blockcopy_ps_16_32:
+    ptrue           p0.b, vl32
+.rept 32
+    ld1b            {z1.h}, p0/z, x2
+    st1h            {z1.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ps_32x64_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_ps_32_64
+    lsl             x1, x1, #1
+    mov             w12, #8
+.loop_cps32x64_sve:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b-v17.16b}, x2, x3
+    ld1             {v18.16b-v19.16b}, x2, x3
+    uxtl            v0.8h, v16.8b
+    uxtl2           v1.8h, v16.16b
+    uxtl            v2.8h, v17.8b
+    uxtl2           v3.8h, v17.16b
+    uxtl            v4.8h, v18.8b
+    uxtl2           v5.8h, v18.16b
+    uxtl            v6.8h, v19.8b
+    uxtl2           v7.8h, v19.16b
+    st1             {v0.8h-v3.8h}, x0, x1
+    st1             {v4.8h-v7.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_cps32x64_sve
+    ret
+.vl_gt_16_blockcopy_ps_32_64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_blockcopy_ps_32_64
+    ptrue           p0.b, vl32
+.rept 64
+    ld1b            {z2.h}, p0/z, x2
+    ld1b            {z3.h}, p0/z, x2, #1, mul vl
+    st1h            {z2.h}, p0, x0
+    st1h            {z3.h}, p0, x0, #1, mul vl
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3
+.endr
+    ret
+.vl_gt_48_blockcopy_ps_32_64:
+    ptrue           p0.b, vl64
+.rept 64
+    ld1b            {z2.h}, p0/z, x2
+    st1h            {z2.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3
+.endr
+    ret
+endfunc
+
+// chroma blockcopy_sp
+function PFX(blockcopy_sp_4x8_sve)
+    ptrue           p0.h, vl4
+.rept 8
+    ld1h            {z0.h}, p0/z, x2
+    st1b            {z0.h}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_8x16_sve)
+    ptrue           p0.h, vl8
+.rept 16
+    ld1h            {z0.h}, p0/z, x2
+    st1b            {z0.h}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_16x32_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_sp_16_32
+    ptrue           p0.h, vl8
+.rept 32
+    ld1h            {z0.h}, p0/z, x2
+    ld1h            {z1.h}, p0/z, x2, #1, mul vl
+    st1b            {z0.h}, p0, x0
+    st1b            {z1.h}, p0, x0, #1, mul vl
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+.endr
+    ret
+.vl_gt_16_blockcopy_sp_16_32:
+    ptrue           p0.h, vl16
+.rept 32
+    ld1h            {z0.h}, p0/z, x2
+    st1b            {z0.h}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_32x64_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_sp_32_64
+    ptrue           p0.h, vl8
+.rept 64
+    ld1h            {z0.h}, p0/z, x2
+    ld1h            {z1.h}, p0/z, x2, #1, mul vl
+    ld1h            {z2.h}, p0/z, x2, #2, mul vl
+    ld1h            {z3.h}, p0/z, x2, #3, mul vl
+    st1b            {z0.h}, p0, x0
+    st1b            {z1.h}, p0, x0, #1, mul vl
+    st1b            {z2.h}, p0, x0, #2, mul vl
+    st1b            {z3.h}, p0, x0, #3, mul vl
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+.endr
+    ret
+.vl_gt_16_blockcopy_sp_32_64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_blockcopy_sp_32_64
+    ptrue           p0.h, vl16
+.rept 64
+    ld1h            {z0.h}, p0/z, x2
+    ld1h            {z1.h}, p0/z, x2, #1, mul vl
+    st1b            {z0.h}, p0, x0
+    st1b            {z1.h}, p0, x0, #1, mul vl
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+.endr
+    ret
+.vl_gt_48_blockcopy_sp_32_64:
+    ptrue           p0.h, vl32
+.rept 64
+    ld1h            {z0.h}, p0/z, x2
+    st1b            {z0.h}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+/* blockcopy_pp(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride) */
+
+function PFX(blockcopy_pp_32x8_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_pp_32_8
+.rept 8
+    ld1             {v0.16b-v1.16b}, x2, x3
+    st1             {v0.16b-v1.16b}, x0, x1
+.endr
+    ret
+.vl_gt_16_blockcopy_pp_32_8:
+    ptrue           p0.b, vl32
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    st1b            {z0.b}, p0, x0
+    add             x2, x2, x3
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+.macro blockcopy_pp_32xN_sve h
+function PFX(blockcopy_pp_32x\h\()_sve)
+    mov             w12, #\h / 8
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_pp_32xN_\h
+.loop_sve_32x\h\():
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.16b-v1.16b}, x2, x3
+    st1             {v0.16b-v1.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_sve_32x\h
+    ret
+.vl_gt_16_blockcopy_pp_32xN_\h:
+    ptrue           p0.b, vl32
+.L_gt_16_blockcopy_pp_32xN_\h:
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    st1b            {z0.b}, p0, x0
+    add             x2, x2, x3
+    add             x0, x0, x1
+.endr
+    cbnz            w12, .L_gt_16_blockcopy_pp_32xN_\h
+    ret
+endfunc
+.endm
+
+blockcopy_pp_32xN_sve 16
+blockcopy_pp_32xN_sve 24
+blockcopy_pp_32xN_sve 32
+blockcopy_pp_32xN_sve 64
+blockcopy_pp_32xN_sve 48
+
+.macro blockcopy_pp_64xN_sve h
+function PFX(blockcopy_pp_64x\h\()_sve)
+    mov             w12, #\h / 4
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_pp_64xN_\h
+.loop_sve_64x\h\():
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v3.16b}, x2, x3
+    st1             {v0.16b-v3.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_sve_64x\h
+    ret
+.vl_gt_16_blockcopy_pp_64xN_\h:
+    cmp             x9, #48
+    bgt             .vl_gt_48_blockcopy_pp_64xN_\h
+    ptrue           p0.b, vl32
+.L_le_32_blockcopy_pp_64xN_\h:
+    sub             w12, w12, #1
+.rept 4
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z1.b}, p0/z, x2, #1, mul vl
+    st1b            {z0.b}, p0, x0
+    st1b            {z1.b}, p0, x0, #1, mul vl
+    add             x2, x2, x3
+    add             x0, x0, x1
+.endr
+    cbnz            w12, .L_le_32_blockcopy_pp_64xN_\h
+    ret
+.vl_gt_48_blockcopy_pp_64xN_\h:
+    ptrue           p0.b, vl64
+.L_blockcopy_pp_64xN_\h:
+    sub             w12, w12, #1
+.rept 4
+    ld1b            {z0.b}, p0/z, x2
+    st1b            {z0.b}, p0, x0
+    add             x2, x2, x3
+    add             x0, x0, x1
+.endr
+    cbnz            w12, .L_blockcopy_pp_64xN_\h
+    ret
+endfunc
+.endm
+
+blockcopy_pp_64xN_sve 16
+blockcopy_pp_64xN_sve 32
+blockcopy_pp_64xN_sve 48
+blockcopy_pp_64xN_sve 64
+
+function PFX(blockfill_s_32x32_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockfill_s_32_32
+    dup             v0.8h, w2
+    mov             v1.16b, v0.16b
+    mov             v2.16b, v0.16b
+    mov             v3.16b, v0.16b
+    lsl             x1, x1, #1
+.rept 32
+    st1             {v0.8h-v3.8h}, x0, x1
+.endr
+    ret
+.vl_gt_16_blockfill_s_32_32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_blockfill_s_32_32
+    dup             z0.h, w2
+    ptrue           p0.h, vl16
+.rept 32
+    st1h            {z0.h}, p0, x0
+    st1h            {z0.h}, p0, x0, #1, mul vl
+    add             x0, x0, x1, lsl #1
+.endr
+    ret
+.vl_gt_48_blockfill_s_32_32:
+    dup             z0.h, w2
+    ptrue           p0.h, vl32
+.rept 32
+    st1h            {z0.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+.endr
+    ret
+endfunc
+
+// void cpy2Dto1D_shl(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift)
+.macro cpy2Dto1D_shl_start_sve
+    add             x2, x2, x2
+    mov             z0.h, w3
+.endm
+
+function PFX(cpy2Dto1D_shl_16x16_sve)
+    dup             z0.h, w3
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_cpy2Dto1D_shl_16x16
+    cpy2Dto1D_shl_start_sve
+    mov             w12, #4
+.loop_cpy2Dto1D_shl_16_sve:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v2.16b-v3.16b}, x1, x2
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.16b-v3.16b}, x0, #32
+.endr
+    cbnz            w12, .loop_cpy2Dto1D_shl_16_sve
+    ret
+.vl_gt_16_cpy2Dto1D_shl_16x16:
+    ptrue           p0.h, vl16
+.rept 16
+    ld1h            {z1.h}, p0/z, x1
+    lsl             z1.h, p0/m, z1.h, z0.h
+    st1h            {z1.h}, p0, x0
+    add             x1, x1, x2, lsl #1
+    add             x0, x0, #32
+.endr
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shl_32x32_sve)
+    dup             z0.h, w3
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_cpy2Dto1D_shl_32x32
+    cpy2Dto1D_shl_start_sve
+    mov             w12, #16
+.loop_cpy2Dto1D_shl_32_sve:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v2.16b-v5.16b}, x1, x2
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    sshl            v4.8h, v4.8h, v0.8h
+    sshl            v5.8h, v5.8h, v0.8h
+    st1             {v2.16b-v5.16b}, x0, #64
+.endr
+    cbnz            w12, .loop_cpy2Dto1D_shl_32_sve
+    ret
+.vl_gt_16_cpy2Dto1D_shl_32x32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_cpy2Dto1D_shl_32x32
+    ptrue           p0.h, vl16
+.rept 32
+    ld1h            {z1.h}, p0/z, x1
+    ld1h            {z2.h}, p0/z, x1, #1, mul vl
+    lsl             z1.h, p0/m, z1.h, z0.h
+    lsl             z2.h, p0/m, z2.h, z0.h
+    st1h            {z1.h}, p0, x0
+    st1h            {z2.h}, p0, x0, #1, mul vl
+    add             x1, x1, x2, lsl #1
+    add             x0, x0, #64
+.endr
+    ret
+.vl_gt_48_cpy2Dto1D_shl_32x32:
+    ptrue           p0.h, vl32
+.rept 32
+    ld1h            {z1.h}, p0/z, x1
+    lsl             z1.h, p0/m, z1.h, z0.h
+    st1h            {z1.h}, p0, x0
+    add             x1, x1, x2, lsl #1
+    add             x0, x0, #64
+.endr
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shl_64x64_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_cpy2Dto1D_shl_64x64
+    cpy2Dto1D_shl_start_sve
+    mov             w12, #32
+    sub             x2, x2, #64
+.loop_cpy2Dto1D_shl_64_sve:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v2.16b-v5.16b}, x1, #64
+    ld1             {v16.16b-v19.16b}, x1, x2
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    sshl            v4.8h, v4.8h, v0.8h
+    sshl            v5.8h, v5.8h, v0.8h
+    sshl            v16.8h, v16.8h, v0.8h
+    sshl            v17.8h, v17.8h, v0.8h
+    sshl            v18.8h, v18.8h, v0.8h
+    sshl            v19.8h, v19.8h, v0.8h
+    st1             {v2.16b-v5.16b}, x0, #64
+    st1             {v16.16b-v19.16b}, x0, #64
+.endr
+    cbnz            w12, .loop_cpy2Dto1D_shl_64_sve
+    ret
+.vl_gt_16_cpy2Dto1D_shl_64x64:
+    dup             z0.h, w3
+    mov             x8, #64
+    mov             w12, #64
+.L_init_cpy2Dto1D_shl_64x64:
+    sub             w12, w12, 1
+    mov             x9, #0
+    whilelt         p0.h, x9, x8
+.L_cpy2Dto1D_shl_64x64:
+    ld1h            {z1.h}, p0/z, x1, x9, lsl #1
+    lsl             z1.h, p0/m, z1.h, z0.h
+    st1h            {z1.h}, p0, x0, x9, lsl #1
+    inch            x9
+    whilelt         p0.h, x9, x8
+    b.first         .L_cpy2Dto1D_shl_64x64
+    add             x1, x1, x2, lsl #1
+    addvl           x0, x0, #1
+    cbnz            w12, .L_init_cpy2Dto1D_shl_64x64
+    ret
+endfunc
+
+// void cpy2Dto1D_shr(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift)
+
+function PFX(cpy2Dto1D_shr_4x4_sve)
+    dup             z0.h, w3
+    sub             w4, w3, #1
+    dup             z1.h, w4
+    ptrue           p0.h, vl8
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+    lsl             x2, x2, #1
+    index           z3.d, #0, x2
+    index           z4.d, #0, #8
+.rept 2
+    ld1d            {z5.d}, p0/z, x1, z3.d
+    add             x1, x1, x2, lsl #1
+    add             z5.h, p0/m, z5.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    st1d            {z5.d}, p0, x0, z4.d
+    add             x0, x0, #16
+.endr
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shr_8x8_sve)
+    dup             z0.h, w3
+    sub             w4, w3, #1
+    dup             z1.h, w4
+    ptrue           p0.h, vl8
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 8
+    ld1d            {z5.d}, p0/z, x1
+    add             x1, x1, x2, lsl #1
+    add             z5.h, p0/m, z5.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    st1d            {z5.d}, p0, x0
+    add             x0, x0, #16
+.endr
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shr_16x16_sve)
+    dup             z0.h, w3
+    sub             w4, w3, #1
+    dup             z1.h, w4
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_cpy2Dto1D_shr_16x16
+    ptrue           p0.h, vl8
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 16
+    ld1d            {z5.d}, p0/z, x1
+    ld1d            {z6.d}, p0/z, x1, #1, mul vl
+    add             x1, x1, x2, lsl #1
+    add             z5.h, p0/m, z5.h, z2.h
+    add             z6.h, p0/m, z6.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    asr             z6.h, p0/m, z6.h, z0.h
+    st1d            {z5.d}, p0, x0
+    st1d            {z6.d}, p0, x0, #1, mul vl
+    add             x0, x0, #32
+.endr
+    ret
+.vl_gt_16_cpy2Dto1D_shr_16x16:
+    ptrue           p0.h, vl16
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 16
+    ld1d            {z5.d}, p0/z, x1
+    add             x1, x1, x2, lsl #1
+    add             z5.h, p0/m, z5.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    st1d            {z5.d}, p0, x0
+    add             x0, x0, #32
+.endr
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shr_32x32_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_cpy2Dto1D_shr_32x32
+    cpy2Dto1D_shr_start
+    mov             w12, #16
+.loop_cpy2Dto1D_shr_32_sve:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v2.8h-v5.8h}, x1, x2
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sub             v4.8h, v4.8h, v1.8h
+    sub             v5.8h, v5.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    sshl            v4.8h, v4.8h, v0.8h
+    sshl            v5.8h, v5.8h, v0.8h
+    st1             {v2.8h-v5.8h}, x0, #64
+.endr
+    cbnz            w12, .loop_cpy2Dto1D_shr_32_sve
+    ret
+.vl_gt_16_cpy2Dto1D_shr_32x32:
+    dup             z0.h, w3
+    sub             w4, w3, #1
+    dup             z1.h, w4
+    cmp             x9, #48
+    bgt             .vl_gt_48_cpy2Dto1D_shr_32x32
+    ptrue           p0.h, vl16
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 32
+    ld1d            {z5.d}, p0/z, x1
+    ld1d            {z6.d}, p0/z, x1, #1, mul vl
+    add             x1, x1, x2, lsl #1
+    add             z5.h, p0/m, z5.h, z2.h
+    add             z6.h, p0/m, z6.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    asr             z6.h, p0/m, z6.h, z0.h
+    st1d            {z5.d}, p0, x0
+    st1d            {z6.d}, p0, x0, #1, mul vl
+    add             x0, x0, #64
+.endr
+    ret
+.vl_gt_48_cpy2Dto1D_shr_32x32:
+    ptrue           p0.h, vl32
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 32
+    ld1d            {z5.d}, p0/z, x1
+    add             x1, x1, x2, lsl #1
+    add             z5.h, p0/m, z5.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    st1d            {z5.d}, p0, x0
+    add             x0, x0, #64
+.endr
+    ret
+endfunc
+
+// void cpy1Dto2D_shl(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift)
+
+function PFX(cpy1Dto2D_shl_16x16_sve)
+    dup             z0.h, w3
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_cpy1Dto2D_shl_16x16
+    ptrue           p0.h, vl8
+.rept 16
+    ld1h            {z1.h}, p0/z, x1
+    ld1h            {z2.h}, p0/z, x1, #1, mul vl
+    lsl             z1.h, p0/m, z1.h, z0.h
+    lsl             z2.h, p0/m, z2.h, z0.h
+    st1h            {z1.h}, p0, x0
+    st1h            {z2.h}, p0, x0, #1, mul vl
+    add             x1, x1, #32
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+.vl_gt_16_cpy1Dto2D_shl_16x16:
+    ptrue           p0.h, vl16
+.rept 16
+    ld1h            {z1.h}, p0/z, x1
+    lsl             z1.h, p0/m, z1.h, z0.h
+    st1h            {z1.h}, p0, x0
+    add             x1, x1, #32
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shl_32x32_sve)
+    dup             z0.h, w3
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_cpy1Dto2D_shl_32x32
+    ptrue           p0.h, vl8
+.rept 32
+    ld1h            {z1.h}, p0/z, x1
+    ld1h            {z2.h}, p0/z, x1, #1, mul vl
+    ld1h            {z3.h}, p0/z, x1, #2, mul vl
+    ld1h            {z4.h}, p0/z, x1, #3, mul vl
+    lsl             z1.h, p0/m, z1.h, z0.h
+    lsl             z2.h, p0/m, z2.h, z0.h
+    lsl             z3.h, p0/m, z3.h, z0.h
+    lsl             z4.h, p0/m, z4.h, z0.h
+    st1h            {z1.h}, p0, x0
+    st1h            {z2.h}, p0, x0, #1, mul vl
+    st1h            {z3.h}, p0, x0, #2, mul vl
+    st1h            {z4.h}, p0, x0, #3, mul vl
+    add             x1, x1, #64
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+.vl_gt_16_cpy1Dto2D_shl_32x32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_cpy1Dto2D_shl_32x32
+    ptrue           p0.h, vl16
+.rept 32
+    ld1h            {z1.h}, p0/z, x1
+    ld1h            {z2.h}, p0/z, x1, #1, mul vl
+    lsl             z1.h, p0/m, z1.h, z0.h
+    lsl             z2.h, p0/m, z2.h, z0.h
+    st1h            {z1.h}, p0, x0
+    st1h            {z2.h}, p0, x0, #1, mul vl
+    add             x1, x1, #64
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+.vl_gt_48_cpy1Dto2D_shl_32x32:
+    ptrue           p0.h, vl32
+.rept 32
+    ld1h            {z1.h}, p0/z, x1
+    lsl             z1.h, p0/m, z1.h, z0.h
+    st1h            {z1.h}, p0, x0
+    add             x1, x1, #64
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shl_64x64_sve)
+    dup             z0.h, w3
+    mov             x8, #64
+    mov             w12, #64
+.L_init_cpy1Dto2D_shl_64x64:
+    sub             w12, w12, 1
+    mov             x9, #0
+    whilelt         p0.h, x9, x8
+.L_cpy1Dto2D_shl_64x64:
+    ld1h            {z1.h}, p0/z, x1, x9, lsl #1
+    lsl             z1.h, p0/m, z1.h, z0.h
+    st1h            {z1.h}, p0, x0, x9, lsl #1
+    inch            x9
+    whilelt         p0.h, x9, x8
+    b.first         .L_cpy1Dto2D_shl_64x64
+    addvl           x1, x1, #1
+    add             x0, x0, x2, lsl #1
+    cbnz            w12, .L_init_cpy1Dto2D_shl_64x64
+    ret
+endfunc
+
+// void cpy1Dto2D_shr(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift)
+
+function PFX(cpy1Dto2D_shr_16x16_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_cpy1Dto2D_shr_16x16
+    cpy1Dto2D_shr_start
+    mov             w12, #4
+.loop_cpy1Dto2D_shr_16:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v2.8h-v3.8h}, x1, #32
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.8h-v3.8h}, x0, x2
+.endr
+    cbnz            w12, .loop_cpy1Dto2D_shr_16
+    ret
+.vl_gt_16_cpy1Dto2D_shr_16x16:
+    dup             z0.h, w3
+    sub             w4, w3, #1
+    dup             z1.h, w4
+    ptrue           p0.h, vl16
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 16
+    ld1d            {z5.d}, p0/z, x1
+    add             x1, x1, #32
+    add             z5.h, p0/m, z5.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    st1d            {z5.d}, p0, x0
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shr_32x32_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_cpy1Dto2D_shr_32x32
+    cpy1Dto2D_shr_start
+    mov             w12, #16
+.loop_cpy1Dto2D_shr_32_sve:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v2.16b-v5.16b}, x1, #64
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sub             v4.8h, v4.8h, v1.8h
+    sub             v5.8h, v5.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    sshl            v4.8h, v4.8h, v0.8h
+    sshl            v5.8h, v5.8h, v0.8h
+    st1             {v2.16b-v5.16b}, x0, x2
+.endr
+    cbnz            w12, .loop_cpy1Dto2D_shr_32_sve
+    ret
+.vl_gt_16_cpy1Dto2D_shr_32x32:
+    dup             z0.h, w3
+    sub             w4, w3, #1
+    dup             z1.h, w4
+    cmp             x9, #48
+    bgt             .vl_gt_48_cpy2Dto1D_shr_32x32
+    ptrue           p0.h, vl16
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 32
+    ld1d            {z5.d}, p0/z, x1
+    ld1d            {z6.d}, p0/z, x1, #1, mul vl
+    add             x1, x1, #64
+    add             z5.h, p0/m, z5.h, z2.h
+    add             z6.h, p0/m, z6.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    asr             z6.h, p0/m, z6.h, z0.h
+    st1d            {z5.d}, p0, x0
+    st1d            {z6.d}, p0, x0, #1, mul vl
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+.vl_gt_48_cpy1Dto2D_shr_32x32:
+    ptrue           p0.h, vl32
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 32
+    ld1d            {z5.d}, p0/z, x1
+    add             x1, x1, #64
+    add             z5.h, p0/m, z5.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    st1d            {z5.d}, p0, x0
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shr_64x64_sve)
+    dup             z0.h, w3
+    sub             w4, w3, #1
+    dup             z1.h, w4
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_cpy1Dto2D_shr_64x64
+    ptrue           p0.h, vl8
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 128
+    ld1d            {z5.d}, p0/z, x1
+    ld1d            {z6.d}, p0/z, x1, #1, mul vl
+    ld1d            {z7.d}, p0/z, x1, #2, mul vl
+    ld1d            {z8.d}, p0/z, x1, #3, mul vl
+    ld1d            {z9.d}, p0/z, x1, #4, mul vl
+    ld1d            {z10.d}, p0/z, x1, #5, mul vl
+    ld1d            {z11.d}, p0/z, x1, #6, mul vl
+    ld1d            {z12.d}, p0/z, x1, #7, mul vl
+    add             x1, x1, #128
+    add             z5.h, p0/m, z5.h, z2.h
+    add             z6.h, p0/m, z6.h, z2.h
+    add             z7.h, p0/m, z7.h, z2.h
+    add             z8.h, p0/m, z8.h, z2.h
+    add             z9.h, p0/m, z9.h, z2.h
+    add             z10.h, p0/m, z10.h, z2.h
+    add             z11.h, p0/m, z11.h, z2.h
+    add             z12.h, p0/m, z12.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    asr             z6.h, p0/m, z6.h, z0.h
+    asr             z7.h, p0/m, z7.h, z0.h
+    asr             z8.h, p0/m, z8.h, z0.h
+    asr             z9.h, p0/m, z9.h, z0.h
+    asr             z10.h, p0/m, z10.h, z0.h
+    asr             z11.h, p0/m, z11.h, z0.h
+    asr             z12.h, p0/m, z12.h, z0.h
+    st1d            {z5.d}, p0, x0
+    st1d            {z6.d}, p0, x0, #1, mul vl
+    st1d            {z7.d}, p0, x0, #2, mul vl
+    st1d            {z8.d}, p0, x0, #3, mul vl
+    st1d            {z9.d}, p0, x0, #4, mul vl
+    st1d            {z10.d}, p0, x0, #5, mul vl
+    st1d            {z11.d}, p0, x0, #6, mul vl
+    st1d            {z12.d}, p0, x0, #7, mul vl
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+.vl_gt_16_cpy1Dto2D_shr_64x64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_cpy1Dto2D_shr_64x64
+    ptrue           p0.h, vl16
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 128
+    ld1d            {z5.d}, p0/z, x1
+    ld1d            {z6.d}, p0/z, x1, #1, mul vl
+    ld1d            {z7.d}, p0/z, x1, #2, mul vl
+    ld1d            {z8.d}, p0/z, x1, #3, mul vl
+    add             x1, x1, #128
+    add             z5.h, p0/m, z5.h, z2.h
+    add             z6.h, p0/m, z6.h, z2.h
+    add             z7.h, p0/m, z7.h, z2.h
+    add             z8.h, p0/m, z8.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    asr             z6.h, p0/m, z6.h, z0.h
+    asr             z7.h, p0/m, z7.h, z0.h
+    asr             z8.h, p0/m, z8.h, z0.h
+    st1d            {z5.d}, p0, x0
+    st1d            {z6.d}, p0, x0, #1, mul vl
+    st1d            {z7.d}, p0, x0, #2, mul vl
+    st1d            {z8.d}, p0, x0, #3, mul vl
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+.vl_gt_48_cpy1Dto2D_shr_64x64:
+    cmp             x9, #112
+    bgt             .vl_gt_112_cpy1Dto2D_shr_64x64
+    ptrue           p0.h, vl32
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 128
+    ld1d            {z5.d}, p0/z, x1
+    ld1d            {z6.d}, p0/z, x1, #1, mul vl
+    add             x1, x1, #128
+    add             z5.h, p0/m, z5.h, z2.h
+    add             z6.h, p0/m, z6.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    asr             z6.h, p0/m, z6.h, z0.h
+    st1d            {z5.d}, p0, x0
+    st1d            {z6.d}, p0, x0, #1, mul vl
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+.vl_gt_112_cpy1Dto2D_shr_64x64:
+    ptrue           p0.h, vl64
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 128
+    ld1d            {z5.d}, p0/z, x1
+    add             x1, x1, #128
+    add             z5.h, p0/m, z5.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    st1d            {z5.d}, p0, x0
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+endfunc

 
@@ -0,0 +1,1416 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm-sve.S"
+#include "blockcopy8-common.S"
+
+.arch armv8-a+sve
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+/* void blockcopy_sp(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb)
+ *
+ * r0   - a
+ * r1   - stridea
+ * r2   - b
+ * r3   - strideb */
+
+function PFX(blockcopy_sp_4x4_sve)
+    ptrue           p0.h, vl4
+.rept 2
+    ld1h            {z0.h}, p0/z, x2
+    add             x2, x2, x3, lsl #1
+    st1b            {z0.h}, p0, x0
+    add             x0, x0, x1
+    ld1h            {z1.h}, p0/z, x2
+    add             x2, x2, x3, lsl #1
+    st1b            {z1.h}, p0, x0
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_8x8_sve)
+    ptrue           p0.h, vl8
+.rept 4
+    ld1h            {z0.h}, p0/z, x2
+    add             x2, x2, x3, lsl #1
+    st1b            {z0.h}, p0, x0
+    add            x0, x0, x1
+    ld1h            {z1.h}, p0/z, x2
+    add             x2, x2, x3, lsl #1
+    st1b            {z1.h}, p0, x0
+    add            x0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_16x16_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_sp_16_16
+    lsl             x3, x3, #1
+    movrel          x11, xtn_xtn2_table
+    ld1             {v31.16b}, x11
+.rept 8
+    ld1             {v0.8h-v1.8h}, x2, x3
+    ld1             {v2.8h-v3.8h}, x2, x3
+    tbl             v0.16b, {v0.16b,v1.16b}, v31.16b
+    tbl             v1.16b, {v2.16b,v3.16b}, v31.16b
+    st1             {v0.16b}, x0, x1
+    st1             {v1.16b}, x0, x1
+.endr
+    ret
+.vl_gt_16_blockcopy_sp_16_16:
+    ptrue           p0.h, vl16
+.rept 8
+    ld1h            {z0.h}, p0/z, x2
+    st1b            {z0.h}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+    ld1h            {z1.h}, p0/z, x2
+    st1b            {z1.h}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_32x32_sve)
+    mov             w12, #4
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_sp_32_32
+    lsl             x3, x3, #1
+    movrel          x11, xtn_xtn2_table
+    ld1             {v31.16b}, x11
+.loop_csp32_sve:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.8h-v3.8h}, x2, x3
+    ld1             {v4.8h-v7.8h}, x2, x3
+    tbl             v0.16b, {v0.16b,v1.16b}, v31.16b
+    tbl             v1.16b, {v2.16b,v3.16b}, v31.16b
+    tbl             v2.16b, {v4.16b,v5.16b}, v31.16b
+    tbl             v3.16b, {v6.16b,v7.16b}, v31.16b
+    st1             {v0.16b-v1.16b}, x0, x1
+    st1             {v2.16b-v3.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_csp32_sve
+    ret
+.vl_gt_16_blockcopy_sp_32_32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_blockcopy_sp_32_32
+    ptrue           p0.h, vl16
+.vl_gt_16_loop_csp32_sve:
+    sub             w12, w12, #1
+.rept 4
+    ld1h            {z0.h}, p0/z, x2
+    ld1h            {z1.h}, p0/z, x2, #1, mul vl
+    st1b            {z0.h}, p0, x0
+    st1b            {z1.h}, p0, x0, #1, mul vl
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+    ld1h            {z2.h}, p0/z, x2
+    ld1h            {z3.h}, p0/z, x2, #1, mul vl
+    st1b            {z2.h}, p0, x0
+    st1b            {z3.h}, p0, x0, #1, mul vl
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+.endr
+    cbnz            w12, .vl_gt_16_loop_csp32_sve
+    ret
+.vl_gt_48_blockcopy_sp_32_32:
+    ptrue           p0.h, vl32
+.vl_gt_48_loop_csp32_sve:
+    sub             w12, w12, #1
+.rept 4
+    ld1h            {z0.h}, p0/z, x2
+    st1b            {z0.h}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+    ld1h            {z1.h}, p0/z, x2
+    st1b            {z1.h}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+.endr
+    cbnz            w12, .vl_gt_48_loop_csp32_sve
+    ret
+endfunc
+
+function PFX(blockcopy_ps_16x16_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_ps_16_16
+    lsl             x1, x1, #1
+.rept 8
+    ld1             {v4.16b}, x2, x3
+    ld1             {v5.16b}, x2, x3
+    uxtl            v0.8h, v4.8b
+    uxtl2           v1.8h, v4.16b
+    uxtl            v2.8h, v5.8b
+    uxtl2           v3.8h, v5.16b
+    st1             {v0.8h-v1.8h}, x0, x1
+    st1             {v2.8h-v3.8h}, x0, x1
+.endr
+    ret
+.vl_gt_16_blockcopy_ps_16_16:
+    ptrue           p0.b, vl32
+.rept 16
+    ld1b            {z1.h}, p0/z, x2
+    st1h            {z1.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ps_32x32_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_ps_32_32
+    lsl             x1, x1, #1
+    mov             w12, #4
+.loop_cps32_sve:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b-v17.16b}, x2, x3
+    ld1             {v18.16b-v19.16b}, x2, x3
+    uxtl            v0.8h, v16.8b
+    uxtl2           v1.8h, v16.16b
+    uxtl            v2.8h, v17.8b
+    uxtl2           v3.8h, v17.16b
+    uxtl            v4.8h, v18.8b
+    uxtl2           v5.8h, v18.16b
+    uxtl            v6.8h, v19.8b
+    uxtl2           v7.8h, v19.16b
+    st1             {v0.8h-v3.8h}, x0, x1
+    st1             {v4.8h-v7.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_cps32_sve
+    ret
+.vl_gt_16_blockcopy_ps_32_32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_blockcopy_ps_32_32
+    ptrue           p0.b, vl32
+.rept 32
+    ld1b            {z2.h}, p0/z, x2
+    ld1b            {z3.h}, p0/z, x2, #1, mul vl
+    st1h            {z2.h}, p0, x0
+    st1h            {z3.h}, p0, x0, #1, mul vl
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3
+.endr
+    ret
+.vl_gt_48_blockcopy_ps_32_32:
+    ptrue           p0.b, vl64
+.rept 32
+    ld1b            {z2.h}, p0/z, x2
+    st1h            {z2.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ps_64x64_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_ps_64_64
+    lsl             x1, x1, #1
+    sub             x1, x1, #64
+    mov             w12, #16
+.loop_cps64_sve:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b-v19.16b}, x2, x3
+    uxtl            v0.8h, v16.8b
+    uxtl2           v1.8h, v16.16b
+    uxtl            v2.8h, v17.8b
+    uxtl2           v3.8h, v17.16b
+    uxtl            v4.8h, v18.8b
+    uxtl2           v5.8h, v18.16b
+    uxtl            v6.8h, v19.8b
+    uxtl2           v7.8h, v19.16b
+    st1             {v0.8h-v3.8h}, x0, #64
+    st1             {v4.8h-v7.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_cps64_sve
+    ret
+.vl_gt_16_blockcopy_ps_64_64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_blockcopy_ps_64_64
+    ptrue           p0.b, vl32
+.rept 64
+    ld1b            {z4.h}, p0/z, x2
+    ld1b            {z5.h}, p0/z, x2, #1, mul vl
+    ld1b            {z6.h}, p0/z, x2, #2, mul vl
+    ld1b            {z7.h}, p0/z, x2, #3, mul vl
+    st1h            {z4.h}, p0, x0
+    st1h            {z5.h}, p0, x0, #1, mul vl
+    st1h            {z6.h}, p0, x0, #2, mul vl
+    st1h            {z7.h}, p0, x0, #3, mul vl
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3
+.endr
+    ret
+.vl_gt_48_blockcopy_ps_64_64:
+    cmp             x9, #112
+    bgt             .vl_gt_112_blockcopy_ps_64_64
+    ptrue           p0.b, vl64
+.rept 64
+    ld1b            {z4.h}, p0/z, x2
+    ld1b            {z5.h}, p0/z, x2, #1, mul vl
+    st1h            {z4.h}, p0, x0
+    st1h            {z5.h}, p0, x0, #1, mul vl
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3
+.endr
+    ret
+.vl_gt_112_blockcopy_ps_64_64:
+    ptrue           p0.b, vl128
+.rept 64
+    ld1b            {z4.h}, p0/z, x2
+    st1h            {z4.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3
+.endr
+    ret
+
+endfunc
+
+function PFX(blockcopy_ss_16x16_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_ss_16_16
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+.rept 8
+    ld1             {v0.8h-v1.8h}, x2, x3
+    ld1             {v2.8h-v3.8h}, x2, x3
+    st1             {v0.8h-v1.8h}, x0, x1
+    st1             {v2.8h-v3.8h}, x0, x1
+.endr
+    ret
+.vl_gt_16_blockcopy_ss_16_16:
+    ptrue           p0.h, vl16
+.rept 16
+    ld1h            {z0.h}, p0/z, x2
+    st1h            {z0.h}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1, lsl #1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ss_32x32_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_ss_32_32
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+    mov             w12, #4
+.loop_css32_sve:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.8h-v3.8h}, x2, x3
+    st1             {v0.8h-v3.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_css32_sve
+    ret
+.vl_gt_16_blockcopy_ss_32_32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_blockcopy_ss_32_32
+    ptrue           p0.h, vl16
+.rept 32
+    ld1h            {z0.h}, p0/z, x2
+    ld1h            {z1.h}, p0/z, x2, #1, mul vl
+    st1h            {z0.h}, p0, x0
+    st1h            {z1.h}, p0, x0, #1, mul vl
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1, lsl #1
+.endr
+    ret
+.vl_gt_48_blockcopy_ss_32_32:
+    ptrue           p0.h, vl32
+.rept 32
+    ld1h            {z0.h}, p0/z, x2
+    st1h            {z0.h}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1, lsl #1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ss_64x64_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_ss_64_64
+    lsl             x1, x1, #1
+    sub             x1, x1, #64
+    lsl             x3, x3, #1
+    sub             x3, x3, #64
+    mov             w12, #8
+.loop_css64_sve:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.8h-v3.8h}, x2, #64
+    ld1             {v4.8h-v7.8h}, x2, x3
+    st1             {v0.8h-v3.8h}, x0, #64
+    st1             {v4.8h-v7.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_css64_sve
+    ret
+.vl_gt_16_blockcopy_ss_64_64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_blockcopy_ss_64_64
+    mov             w12, #8
+    ptrue           p0.b, vl32
+.vl_gt_16_loop_css64_sve:
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z1.b}, p0/z, x2, #1, mul vl
+    ld1b            {z2.b}, p0/z, x2, #2, mul vl
+    ld1b            {z3.b}, p0/z, x2, #3, mul vl
+    st1b            {z0.b}, p0, x0
+    st1b            {z1.b}, p0, x0, #1, mul vl
+    st1b            {z2.b}, p0, x0, #2, mul vl
+    st1b            {z3.b}, p0, x0, #3, mul vl
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_16_loop_css64_sve
+    ret
+.vl_gt_48_blockcopy_ss_64_64:
+    cmp             x9, #112
+    bgt             .vl_gt_112_blockcopy_ss_64_64
+    mov             w12, #8
+    ptrue           p0.b, vl64
+.vl_gt_48_loop_css64_sve:
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z1.b}, p0/z, x2, #1, mul vl
+    st1b            {z0.b}, p0, x0
+    st1b            {z1.b}, p0, x0, #1, mul vl
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_48_loop_css64_sve
+    ret
+.vl_gt_112_blockcopy_ss_64_64:
+    mov             w12, #8
+    ptrue           p0.b, vl128
+.vl_gt_112_loop_css64_sve:
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    st1b            {z0.b}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_112_loop_css64_sve
+    ret
+endfunc
+
+/******** Chroma blockcopy********/
+function PFX(blockcopy_ss_16x32_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_ss_16_32
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+.rept 16
+    ld1             {v0.8h-v1.8h}, x2, x3
+    ld1             {v2.8h-v3.8h}, x2, x3
+    st1             {v0.8h-v1.8h}, x0, x1
+    st1             {v2.8h-v3.8h}, x0, x1
+.endr
+    ret
+.vl_gt_16_blockcopy_ss_16_32:
+    ptrue           p0.h, vl16
+.rept 32
+    ld1h            {z0.h}, p0/z, x2
+    st1h            {z0.h}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1, lsl #1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ss_32x64_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_ss_32_64
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+    mov             w12, #8
+.loop_css32x64_sve:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.8h-v3.8h}, x2, x3
+    st1             {v0.8h-v3.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_css32x64_sve
+    ret
+.vl_gt_16_blockcopy_ss_32_64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_blockcopy_ss_32_64
+    mov             w12, #8
+    ptrue           p0.b, vl32
+.vl_gt_32_loop_css32x64_sve:
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z1.b}, p0/z, x2, #1, mul vl
+    st1b            {z0.b}, p0, x0
+    st1b            {z1.b}, p0, x0, #1, mul vl
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_32_loop_css32x64_sve
+    ret
+.vl_gt_48_blockcopy_ss_32_64:
+    mov             w12, #8
+    ptrue           p0.b, vl64
+.vl_gt_48_loop_css32x64_sve:
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    st1b            {z0.b}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_48_loop_css32x64_sve
+    ret
+endfunc
+
+// chroma blockcopy_ps
+function PFX(blockcopy_ps_4x8_sve)
+    ptrue           p0.h, vl4
+.rept 8
+    ld1b            {z0.h}, p0/z, x2
+    st1h            {z0.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ps_8x16_sve)
+    ptrue           p0.h, vl8
+.rept 16
+    ld1b            {z0.h}, p0/z, x2
+    st1h            {z0.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ps_16x32_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_ps_16_32
+    lsl             x1, x1, #1
+.rept 16
+    ld1             {v4.16b}, x2, x3
+    ld1             {v5.16b}, x2, x3
+    uxtl            v0.8h, v4.8b
+    uxtl2           v1.8h, v4.16b
+    uxtl            v2.8h, v5.8b
+    uxtl2           v3.8h, v5.16b
+    st1             {v0.8h-v1.8h}, x0, x1
+    st1             {v2.8h-v3.8h}, x0, x1
+.endr
+    ret
+.vl_gt_16_blockcopy_ps_16_32:
+    ptrue           p0.b, vl32
+.rept 32
+    ld1b            {z1.h}, p0/z, x2
+    st1h            {z1.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ps_32x64_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_ps_32_64
+    lsl             x1, x1, #1
+    mov             w12, #8
+.loop_cps32x64_sve:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b-v17.16b}, x2, x3
+    ld1             {v18.16b-v19.16b}, x2, x3
+    uxtl            v0.8h, v16.8b
+    uxtl2           v1.8h, v16.16b
+    uxtl            v2.8h, v17.8b
+    uxtl2           v3.8h, v17.16b
+    uxtl            v4.8h, v18.8b
+    uxtl2           v5.8h, v18.16b
+    uxtl            v6.8h, v19.8b
+    uxtl2           v7.8h, v19.16b
+    st1             {v0.8h-v3.8h}, x0, x1
+    st1             {v4.8h-v7.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_cps32x64_sve
+    ret
+.vl_gt_16_blockcopy_ps_32_64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_blockcopy_ps_32_64
+    ptrue           p0.b, vl32
+.rept 64
+    ld1b            {z2.h}, p0/z, x2
+    ld1b            {z3.h}, p0/z, x2, #1, mul vl
+    st1h            {z2.h}, p0, x0
+    st1h            {z3.h}, p0, x0, #1, mul vl
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3
+.endr
+    ret
+.vl_gt_48_blockcopy_ps_32_64:
+    ptrue           p0.b, vl64
+.rept 64
+    ld1b            {z2.h}, p0/z, x2
+    st1h            {z2.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3
+.endr
+    ret
+endfunc
+
+// chroma blockcopy_sp
+function PFX(blockcopy_sp_4x8_sve)
+    ptrue           p0.h, vl4
+.rept 8
+    ld1h            {z0.h}, p0/z, x2
+    st1b            {z0.h}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_8x16_sve)
+    ptrue           p0.h, vl8
+.rept 16
+    ld1h            {z0.h}, p0/z, x2
+    st1b            {z0.h}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_16x32_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_sp_16_32
+    ptrue           p0.h, vl8
+.rept 32
+    ld1h            {z0.h}, p0/z, x2
+    ld1h            {z1.h}, p0/z, x2, #1, mul vl
+    st1b            {z0.h}, p0, x0
+    st1b            {z1.h}, p0, x0, #1, mul vl
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+.endr
+    ret
+.vl_gt_16_blockcopy_sp_16_32:
+    ptrue           p0.h, vl16
+.rept 32
+    ld1h            {z0.h}, p0/z, x2
+    st1b            {z0.h}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_32x64_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_sp_32_64
+    ptrue           p0.h, vl8
+.rept 64
+    ld1h            {z0.h}, p0/z, x2
+    ld1h            {z1.h}, p0/z, x2, #1, mul vl
+    ld1h            {z2.h}, p0/z, x2, #2, mul vl
+    ld1h            {z3.h}, p0/z, x2, #3, mul vl
+    st1b            {z0.h}, p0, x0
+    st1b            {z1.h}, p0, x0, #1, mul vl
+    st1b            {z2.h}, p0, x0, #2, mul vl
+    st1b            {z3.h}, p0, x0, #3, mul vl
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+.endr
+    ret
+.vl_gt_16_blockcopy_sp_32_64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_blockcopy_sp_32_64
+    ptrue           p0.h, vl16
+.rept 64
+    ld1h            {z0.h}, p0/z, x2
+    ld1h            {z1.h}, p0/z, x2, #1, mul vl
+    st1b            {z0.h}, p0, x0
+    st1b            {z1.h}, p0, x0, #1, mul vl
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+.endr
+    ret
+.vl_gt_48_blockcopy_sp_32_64:
+    ptrue           p0.h, vl32
+.rept 64
+    ld1h            {z0.h}, p0/z, x2
+    st1b            {z0.h}, p0, x0
+    add             x2, x2, x3, lsl #1
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+/* blockcopy_pp(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride) */
+
+function PFX(blockcopy_pp_32x8_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_pp_32_8
+.rept 8
+    ld1             {v0.16b-v1.16b}, x2, x3
+    st1             {v0.16b-v1.16b}, x0, x1
+.endr
+    ret
+.vl_gt_16_blockcopy_pp_32_8:
+    ptrue           p0.b, vl32
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    st1b            {z0.b}, p0, x0
+    add             x2, x2, x3
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+.macro blockcopy_pp_32xN_sve h
+function PFX(blockcopy_pp_32x\h\()_sve)
+    mov             w12, #\h / 8
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_pp_32xN_\h
+.loop_sve_32x\h\():
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.16b-v1.16b}, x2, x3
+    st1             {v0.16b-v1.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_sve_32x\h
+    ret
+.vl_gt_16_blockcopy_pp_32xN_\h:
+    ptrue           p0.b, vl32
+.L_gt_16_blockcopy_pp_32xN_\h:
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    st1b            {z0.b}, p0, x0
+    add             x2, x2, x3
+    add             x0, x0, x1
+.endr
+    cbnz            w12, .L_gt_16_blockcopy_pp_32xN_\h
+    ret
+endfunc
+.endm
+
+blockcopy_pp_32xN_sve 16
+blockcopy_pp_32xN_sve 24
+blockcopy_pp_32xN_sve 32
+blockcopy_pp_32xN_sve 64
+blockcopy_pp_32xN_sve 48
+
+.macro blockcopy_pp_64xN_sve h
+function PFX(blockcopy_pp_64x\h\()_sve)
+    mov             w12, #\h / 4
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockcopy_pp_64xN_\h
+.loop_sve_64x\h\():
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v3.16b}, x2, x3
+    st1             {v0.16b-v3.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_sve_64x\h
+    ret
+.vl_gt_16_blockcopy_pp_64xN_\h:
+    cmp             x9, #48
+    bgt             .vl_gt_48_blockcopy_pp_64xN_\h
+    ptrue           p0.b, vl32
+.L_le_32_blockcopy_pp_64xN_\h:
+    sub             w12, w12, #1
+.rept 4
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z1.b}, p0/z, x2, #1, mul vl
+    st1b            {z0.b}, p0, x0
+    st1b            {z1.b}, p0, x0, #1, mul vl
+    add             x2, x2, x3
+    add             x0, x0, x1
+.endr
+    cbnz            w12, .L_le_32_blockcopy_pp_64xN_\h
+    ret
+.vl_gt_48_blockcopy_pp_64xN_\h:
+    ptrue           p0.b, vl64
+.L_blockcopy_pp_64xN_\h:
+    sub             w12, w12, #1
+.rept 4
+    ld1b            {z0.b}, p0/z, x2
+    st1b            {z0.b}, p0, x0
+    add             x2, x2, x3
+    add             x0, x0, x1
+.endr
+    cbnz            w12, .L_blockcopy_pp_64xN_\h
+    ret
+endfunc
+.endm
+
+blockcopy_pp_64xN_sve 16
+blockcopy_pp_64xN_sve 32
+blockcopy_pp_64xN_sve 48
+blockcopy_pp_64xN_sve 64
+
+function PFX(blockfill_s_32x32_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_blockfill_s_32_32
+    dup             v0.8h, w2
+    mov             v1.16b, v0.16b
+    mov             v2.16b, v0.16b
+    mov             v3.16b, v0.16b
+    lsl             x1, x1, #1
+.rept 32
+    st1             {v0.8h-v3.8h}, x0, x1
+.endr
+    ret
+.vl_gt_16_blockfill_s_32_32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_blockfill_s_32_32
+    dup             z0.h, w2
+    ptrue           p0.h, vl16
+.rept 32
+    st1h            {z0.h}, p0, x0
+    st1h            {z0.h}, p0, x0, #1, mul vl
+    add             x0, x0, x1, lsl #1
+.endr
+    ret
+.vl_gt_48_blockfill_s_32_32:
+    dup             z0.h, w2
+    ptrue           p0.h, vl32
+.rept 32
+    st1h            {z0.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+.endr
+    ret
+endfunc
+
+// void cpy2Dto1D_shl(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift)
+.macro cpy2Dto1D_shl_start_sve
+    add             x2, x2, x2
+    mov             z0.h, w3
+.endm
+
+function PFX(cpy2Dto1D_shl_16x16_sve)
+    dup             z0.h, w3
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_cpy2Dto1D_shl_16x16
+    cpy2Dto1D_shl_start_sve
+    mov             w12, #4
+.loop_cpy2Dto1D_shl_16_sve:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v2.16b-v3.16b}, x1, x2
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.16b-v3.16b}, x0, #32
+.endr
+    cbnz            w12, .loop_cpy2Dto1D_shl_16_sve
+    ret
+.vl_gt_16_cpy2Dto1D_shl_16x16:
+    ptrue           p0.h, vl16
+.rept 16
+    ld1h            {z1.h}, p0/z, x1
+    lsl             z1.h, p0/m, z1.h, z0.h
+    st1h            {z1.h}, p0, x0
+    add             x1, x1, x2, lsl #1
+    add             x0, x0, #32
+.endr
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shl_32x32_sve)
+    dup             z0.h, w3
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_cpy2Dto1D_shl_32x32
+    cpy2Dto1D_shl_start_sve
+    mov             w12, #16
+.loop_cpy2Dto1D_shl_32_sve:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v2.16b-v5.16b}, x1, x2
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    sshl            v4.8h, v4.8h, v0.8h
+    sshl            v5.8h, v5.8h, v0.8h
+    st1             {v2.16b-v5.16b}, x0, #64
+.endr
+    cbnz            w12, .loop_cpy2Dto1D_shl_32_sve
+    ret
+.vl_gt_16_cpy2Dto1D_shl_32x32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_cpy2Dto1D_shl_32x32
+    ptrue           p0.h, vl16
+.rept 32
+    ld1h            {z1.h}, p0/z, x1
+    ld1h            {z2.h}, p0/z, x1, #1, mul vl
+    lsl             z1.h, p0/m, z1.h, z0.h
+    lsl             z2.h, p0/m, z2.h, z0.h
+    st1h            {z1.h}, p0, x0
+    st1h            {z2.h}, p0, x0, #1, mul vl
+    add             x1, x1, x2, lsl #1
+    add             x0, x0, #64
+.endr
+    ret
+.vl_gt_48_cpy2Dto1D_shl_32x32:
+    ptrue           p0.h, vl32
+.rept 32
+    ld1h            {z1.h}, p0/z, x1
+    lsl             z1.h, p0/m, z1.h, z0.h
+    st1h            {z1.h}, p0, x0
+    add             x1, x1, x2, lsl #1
+    add             x0, x0, #64
+.endr
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shl_64x64_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_cpy2Dto1D_shl_64x64
+    cpy2Dto1D_shl_start_sve
+    mov             w12, #32
+    sub             x2, x2, #64
+.loop_cpy2Dto1D_shl_64_sve:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v2.16b-v5.16b}, x1, #64
+    ld1             {v16.16b-v19.16b}, x1, x2
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    sshl            v4.8h, v4.8h, v0.8h
+    sshl            v5.8h, v5.8h, v0.8h
+    sshl            v16.8h, v16.8h, v0.8h
+    sshl            v17.8h, v17.8h, v0.8h
+    sshl            v18.8h, v18.8h, v0.8h
+    sshl            v19.8h, v19.8h, v0.8h
+    st1             {v2.16b-v5.16b}, x0, #64
+    st1             {v16.16b-v19.16b}, x0, #64
+.endr
+    cbnz            w12, .loop_cpy2Dto1D_shl_64_sve
+    ret
+.vl_gt_16_cpy2Dto1D_shl_64x64:
+    dup             z0.h, w3
+    mov             x8, #64
+    mov             w12, #64
+.L_init_cpy2Dto1D_shl_64x64:
+    sub             w12, w12, 1
+    mov             x9, #0
+    whilelt         p0.h, x9, x8
+.L_cpy2Dto1D_shl_64x64:
+    ld1h            {z1.h}, p0/z, x1, x9, lsl #1
+    lsl             z1.h, p0/m, z1.h, z0.h
+    st1h            {z1.h}, p0, x0, x9, lsl #1
+    inch            x9
+    whilelt         p0.h, x9, x8
+    b.first         .L_cpy2Dto1D_shl_64x64
+    add             x1, x1, x2, lsl #1
+    addvl           x0, x0, #1
+    cbnz            w12, .L_init_cpy2Dto1D_shl_64x64
+    ret
+endfunc
+
+// void cpy2Dto1D_shr(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift)
+
+function PFX(cpy2Dto1D_shr_4x4_sve)
+    dup             z0.h, w3
+    sub             w4, w3, #1
+    dup             z1.h, w4
+    ptrue           p0.h, vl8
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+    lsl             x2, x2, #1
+    index           z3.d, #0, x2
+    index           z4.d, #0, #8
+.rept 2
+    ld1d            {z5.d}, p0/z, x1, z3.d
+    add             x1, x1, x2, lsl #1
+    add             z5.h, p0/m, z5.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    st1d            {z5.d}, p0, x0, z4.d
+    add             x0, x0, #16
+.endr
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shr_8x8_sve)
+    dup             z0.h, w3
+    sub             w4, w3, #1
+    dup             z1.h, w4
+    ptrue           p0.h, vl8
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 8
+    ld1d            {z5.d}, p0/z, x1
+    add             x1, x1, x2, lsl #1
+    add             z5.h, p0/m, z5.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    st1d            {z5.d}, p0, x0
+    add             x0, x0, #16
+.endr
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shr_16x16_sve)
+    dup             z0.h, w3
+    sub             w4, w3, #1
+    dup             z1.h, w4
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_cpy2Dto1D_shr_16x16
+    ptrue           p0.h, vl8
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 16
+    ld1d            {z5.d}, p0/z, x1
+    ld1d            {z6.d}, p0/z, x1, #1, mul vl
+    add             x1, x1, x2, lsl #1
+    add             z5.h, p0/m, z5.h, z2.h
+    add             z6.h, p0/m, z6.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    asr             z6.h, p0/m, z6.h, z0.h
+    st1d            {z5.d}, p0, x0
+    st1d            {z6.d}, p0, x0, #1, mul vl
+    add             x0, x0, #32
+.endr
+    ret
+.vl_gt_16_cpy2Dto1D_shr_16x16:
+    ptrue           p0.h, vl16
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 16
+    ld1d            {z5.d}, p0/z, x1
+    add             x1, x1, x2, lsl #1
+    add             z5.h, p0/m, z5.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    st1d            {z5.d}, p0, x0
+    add             x0, x0, #32
+.endr
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shr_32x32_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_cpy2Dto1D_shr_32x32
+    cpy2Dto1D_shr_start
+    mov             w12, #16
+.loop_cpy2Dto1D_shr_32_sve:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v2.8h-v5.8h}, x1, x2
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sub             v4.8h, v4.8h, v1.8h
+    sub             v5.8h, v5.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    sshl            v4.8h, v4.8h, v0.8h
+    sshl            v5.8h, v5.8h, v0.8h
+    st1             {v2.8h-v5.8h}, x0, #64
+.endr
+    cbnz            w12, .loop_cpy2Dto1D_shr_32_sve
+    ret
+.vl_gt_16_cpy2Dto1D_shr_32x32:
+    dup             z0.h, w3
+    sub             w4, w3, #1
+    dup             z1.h, w4
+    cmp             x9, #48
+    bgt             .vl_gt_48_cpy2Dto1D_shr_32x32
+    ptrue           p0.h, vl16
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 32
+    ld1d            {z5.d}, p0/z, x1
+    ld1d            {z6.d}, p0/z, x1, #1, mul vl
+    add             x1, x1, x2, lsl #1
+    add             z5.h, p0/m, z5.h, z2.h
+    add             z6.h, p0/m, z6.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    asr             z6.h, p0/m, z6.h, z0.h
+    st1d            {z5.d}, p0, x0
+    st1d            {z6.d}, p0, x0, #1, mul vl
+    add             x0, x0, #64
+.endr
+    ret
+.vl_gt_48_cpy2Dto1D_shr_32x32:
+    ptrue           p0.h, vl32
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 32
+    ld1d            {z5.d}, p0/z, x1
+    add             x1, x1, x2, lsl #1
+    add             z5.h, p0/m, z5.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    st1d            {z5.d}, p0, x0
+    add             x0, x0, #64
+.endr
+    ret
+endfunc
+
+// void cpy1Dto2D_shl(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift)
+
+function PFX(cpy1Dto2D_shl_16x16_sve)
+    dup             z0.h, w3
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_cpy1Dto2D_shl_16x16
+    ptrue           p0.h, vl8
+.rept 16
+    ld1h            {z1.h}, p0/z, x1
+    ld1h            {z2.h}, p0/z, x1, #1, mul vl
+    lsl             z1.h, p0/m, z1.h, z0.h
+    lsl             z2.h, p0/m, z2.h, z0.h
+    st1h            {z1.h}, p0, x0
+    st1h            {z2.h}, p0, x0, #1, mul vl
+    add             x1, x1, #32
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+.vl_gt_16_cpy1Dto2D_shl_16x16:
+    ptrue           p0.h, vl16
+.rept 16
+    ld1h            {z1.h}, p0/z, x1
+    lsl             z1.h, p0/m, z1.h, z0.h
+    st1h            {z1.h}, p0, x0
+    add             x1, x1, #32
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shl_32x32_sve)
+    dup             z0.h, w3
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_cpy1Dto2D_shl_32x32
+    ptrue           p0.h, vl8
+.rept 32
+    ld1h            {z1.h}, p0/z, x1
+    ld1h            {z2.h}, p0/z, x1, #1, mul vl
+    ld1h            {z3.h}, p0/z, x1, #2, mul vl
+    ld1h            {z4.h}, p0/z, x1, #3, mul vl
+    lsl             z1.h, p0/m, z1.h, z0.h
+    lsl             z2.h, p0/m, z2.h, z0.h
+    lsl             z3.h, p0/m, z3.h, z0.h
+    lsl             z4.h, p0/m, z4.h, z0.h
+    st1h            {z1.h}, p0, x0
+    st1h            {z2.h}, p0, x0, #1, mul vl
+    st1h            {z3.h}, p0, x0, #2, mul vl
+    st1h            {z4.h}, p0, x0, #3, mul vl
+    add             x1, x1, #64
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+.vl_gt_16_cpy1Dto2D_shl_32x32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_cpy1Dto2D_shl_32x32
+    ptrue           p0.h, vl16
+.rept 32
+    ld1h            {z1.h}, p0/z, x1
+    ld1h            {z2.h}, p0/z, x1, #1, mul vl
+    lsl             z1.h, p0/m, z1.h, z0.h
+    lsl             z2.h, p0/m, z2.h, z0.h
+    st1h            {z1.h}, p0, x0
+    st1h            {z2.h}, p0, x0, #1, mul vl
+    add             x1, x1, #64
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+.vl_gt_48_cpy1Dto2D_shl_32x32:
+    ptrue           p0.h, vl32
+.rept 32
+    ld1h            {z1.h}, p0/z, x1
+    lsl             z1.h, p0/m, z1.h, z0.h
+    st1h            {z1.h}, p0, x0
+    add             x1, x1, #64
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shl_64x64_sve)
+    dup             z0.h, w3
+    mov             x8, #64
+    mov             w12, #64
+.L_init_cpy1Dto2D_shl_64x64:
+    sub             w12, w12, 1
+    mov             x9, #0
+    whilelt         p0.h, x9, x8
+.L_cpy1Dto2D_shl_64x64:
+    ld1h            {z1.h}, p0/z, x1, x9, lsl #1
+    lsl             z1.h, p0/m, z1.h, z0.h
+    st1h            {z1.h}, p0, x0, x9, lsl #1
+    inch            x9
+    whilelt         p0.h, x9, x8
+    b.first         .L_cpy1Dto2D_shl_64x64
+    addvl           x1, x1, #1
+    add             x0, x0, x2, lsl #1
+    cbnz            w12, .L_init_cpy1Dto2D_shl_64x64
+    ret
+endfunc
+
+// void cpy1Dto2D_shr(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift)
+
+function PFX(cpy1Dto2D_shr_16x16_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_cpy1Dto2D_shr_16x16
+    cpy1Dto2D_shr_start
+    mov             w12, #4
+.loop_cpy1Dto2D_shr_16:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v2.8h-v3.8h}, x1, #32
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.8h-v3.8h}, x0, x2
+.endr
+    cbnz            w12, .loop_cpy1Dto2D_shr_16
+    ret
+.vl_gt_16_cpy1Dto2D_shr_16x16:
+    dup             z0.h, w3
+    sub             w4, w3, #1
+    dup             z1.h, w4
+    ptrue           p0.h, vl16
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 16
+    ld1d            {z5.d}, p0/z, x1
+    add             x1, x1, #32
+    add             z5.h, p0/m, z5.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    st1d            {z5.d}, p0, x0
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shr_32x32_sve)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_cpy1Dto2D_shr_32x32
+    cpy1Dto2D_shr_start
+    mov             w12, #16
+.loop_cpy1Dto2D_shr_32_sve:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v2.16b-v5.16b}, x1, #64
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sub             v4.8h, v4.8h, v1.8h
+    sub             v5.8h, v5.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    sshl            v4.8h, v4.8h, v0.8h
+    sshl            v5.8h, v5.8h, v0.8h
+    st1             {v2.16b-v5.16b}, x0, x2
+.endr
+    cbnz            w12, .loop_cpy1Dto2D_shr_32_sve
+    ret
+.vl_gt_16_cpy1Dto2D_shr_32x32:
+    dup             z0.h, w3
+    sub             w4, w3, #1
+    dup             z1.h, w4
+    cmp             x9, #48
+    bgt             .vl_gt_48_cpy2Dto1D_shr_32x32
+    ptrue           p0.h, vl16
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 32
+    ld1d            {z5.d}, p0/z, x1
+    ld1d            {z6.d}, p0/z, x1, #1, mul vl
+    add             x1, x1, #64
+    add             z5.h, p0/m, z5.h, z2.h
+    add             z6.h, p0/m, z6.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    asr             z6.h, p0/m, z6.h, z0.h
+    st1d            {z5.d}, p0, x0
+    st1d            {z6.d}, p0, x0, #1, mul vl
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+.vl_gt_48_cpy1Dto2D_shr_32x32:
+    ptrue           p0.h, vl32
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 32
+    ld1d            {z5.d}, p0/z, x1
+    add             x1, x1, #64
+    add             z5.h, p0/m, z5.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    st1d            {z5.d}, p0, x0
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shr_64x64_sve)
+    dup             z0.h, w3
+    sub             w4, w3, #1
+    dup             z1.h, w4
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_cpy1Dto2D_shr_64x64
+    ptrue           p0.h, vl8
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 128
+    ld1d            {z5.d}, p0/z, x1
+    ld1d            {z6.d}, p0/z, x1, #1, mul vl
+    ld1d            {z7.d}, p0/z, x1, #2, mul vl
+    ld1d            {z8.d}, p0/z, x1, #3, mul vl
+    ld1d            {z9.d}, p0/z, x1, #4, mul vl
+    ld1d            {z10.d}, p0/z, x1, #5, mul vl
+    ld1d            {z11.d}, p0/z, x1, #6, mul vl
+    ld1d            {z12.d}, p0/z, x1, #7, mul vl
+    add             x1, x1, #128
+    add             z5.h, p0/m, z5.h, z2.h
+    add             z6.h, p0/m, z6.h, z2.h
+    add             z7.h, p0/m, z7.h, z2.h
+    add             z8.h, p0/m, z8.h, z2.h
+    add             z9.h, p0/m, z9.h, z2.h
+    add             z10.h, p0/m, z10.h, z2.h
+    add             z11.h, p0/m, z11.h, z2.h
+    add             z12.h, p0/m, z12.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    asr             z6.h, p0/m, z6.h, z0.h
+    asr             z7.h, p0/m, z7.h, z0.h
+    asr             z8.h, p0/m, z8.h, z0.h
+    asr             z9.h, p0/m, z9.h, z0.h
+    asr             z10.h, p0/m, z10.h, z0.h
+    asr             z11.h, p0/m, z11.h, z0.h
+    asr             z12.h, p0/m, z12.h, z0.h
+    st1d            {z5.d}, p0, x0
+    st1d            {z6.d}, p0, x0, #1, mul vl
+    st1d            {z7.d}, p0, x0, #2, mul vl
+    st1d            {z8.d}, p0, x0, #3, mul vl
+    st1d            {z9.d}, p0, x0, #4, mul vl
+    st1d            {z10.d}, p0, x0, #5, mul vl
+    st1d            {z11.d}, p0, x0, #6, mul vl
+    st1d            {z12.d}, p0, x0, #7, mul vl
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+.vl_gt_16_cpy1Dto2D_shr_64x64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_cpy1Dto2D_shr_64x64
+    ptrue           p0.h, vl16
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 128
+    ld1d            {z5.d}, p0/z, x1
+    ld1d            {z6.d}, p0/z, x1, #1, mul vl
+    ld1d            {z7.d}, p0/z, x1, #2, mul vl
+    ld1d            {z8.d}, p0/z, x1, #3, mul vl
+    add             x1, x1, #128
+    add             z5.h, p0/m, z5.h, z2.h
+    add             z6.h, p0/m, z6.h, z2.h
+    add             z7.h, p0/m, z7.h, z2.h
+    add             z8.h, p0/m, z8.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    asr             z6.h, p0/m, z6.h, z0.h
+    asr             z7.h, p0/m, z7.h, z0.h
+    asr             z8.h, p0/m, z8.h, z0.h
+    st1d            {z5.d}, p0, x0
+    st1d            {z6.d}, p0, x0, #1, mul vl
+    st1d            {z7.d}, p0, x0, #2, mul vl
+    st1d            {z8.d}, p0, x0, #3, mul vl
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+.vl_gt_48_cpy1Dto2D_shr_64x64:
+    cmp             x9, #112
+    bgt             .vl_gt_112_cpy1Dto2D_shr_64x64
+    ptrue           p0.h, vl32
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 128
+    ld1d            {z5.d}, p0/z, x1
+    ld1d            {z6.d}, p0/z, x1, #1, mul vl
+    add             x1, x1, #128
+    add             z5.h, p0/m, z5.h, z2.h
+    add             z6.h, p0/m, z6.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    asr             z6.h, p0/m, z6.h, z0.h
+    st1d            {z5.d}, p0, x0
+    st1d            {z6.d}, p0, x0, #1, mul vl
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+.vl_gt_112_cpy1Dto2D_shr_64x64:
+    ptrue           p0.h, vl64
+    mov             z2.h, #1
+    lsl             z2.h, p0/m, z2.h, z1.h
+.rept 128
+    ld1d            {z5.d}, p0/z, x1
+    add             x1, x1, #128
+    add             z5.h, p0/m, z5.h, z2.h
+    asr             z5.h, p0/m, z5.h, z0.h
+    st1d            {z5.d}, p0, x0
+    add             x0, x0, x2, lsl #1
+.endr
+    ret
+endfunc
​

x265_3.6.tar.gz/source/common/aarch64/blockcopy8.S Added

@@ -0,0 +1,1299 @@
+/*****************************************************************************
+ * Copyright (C) 2021 MulticoreWare, Inc
+ *
+ * Authors: Sebastian Pop <spop@amazon.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm.S"
+#include "blockcopy8-common.S"
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+/* void blockcopy_sp(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb)
+ *
+ * r0   - a
+ * r1   - stridea
+ * r2   - b
+ * r3   - strideb */
+function PFX(blockcopy_sp_4x4_neon)
+    lsl             x3, x3, #1
+.rept 2
+    ld1             {v0.8h}, x2, x3
+    ld1             {v1.8h}, x2, x3
+    xtn             v0.8b, v0.8h
+    xtn             v1.8b, v1.8h
+    st1             {v0.s}0, x0, x1
+    st1             {v1.s}0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_8x8_neon)
+    lsl             x3, x3, #1
+.rept 4
+    ld1             {v0.8h}, x2, x3
+    ld1             {v1.8h}, x2, x3
+    xtn             v0.8b, v0.8h
+    xtn             v1.8b, v1.8h
+    st1             {v0.d}0, x0, x1
+    st1             {v1.d}0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_16x16_neon)
+    lsl             x3, x3, #1
+    movrel          x11, xtn_xtn2_table
+    ld1             {v31.16b}, x11
+.rept 8
+    ld1             {v0.8h-v1.8h}, x2, x3
+    ld1             {v2.8h-v3.8h}, x2, x3
+    tbl             v0.16b, {v0.16b,v1.16b}, v31.16b
+    tbl             v1.16b, {v2.16b,v3.16b}, v31.16b
+    st1             {v0.16b}, x0, x1
+    st1             {v1.16b}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_32x32_neon)
+    mov             w12, #4
+    lsl             x3, x3, #1
+    movrel          x11, xtn_xtn2_table
+    ld1             {v31.16b}, x11
+.loop_csp32:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.8h-v3.8h}, x2, x3
+    ld1             {v4.8h-v7.8h}, x2, x3
+    tbl             v0.16b, {v0.16b,v1.16b}, v31.16b
+    tbl             v1.16b, {v2.16b,v3.16b}, v31.16b
+    tbl             v2.16b, {v4.16b,v5.16b}, v31.16b
+    tbl             v3.16b, {v6.16b,v7.16b}, v31.16b
+    st1             {v0.16b-v1.16b}, x0, x1
+    st1             {v2.16b-v3.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_csp32
+    ret
+endfunc
+
+function PFX(blockcopy_sp_64x64_neon)
+    mov             w12, #16
+    lsl             x3, x3, #1
+    sub             x3, x3, #64
+    movrel          x11, xtn_xtn2_table
+    ld1             {v31.16b}, x11
+.loop_csp64:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.8h-v3.8h}, x2, #64
+    ld1             {v4.8h-v7.8h}, x2, x3
+    tbl             v0.16b, {v0.16b,v1.16b}, v31.16b
+    tbl             v1.16b, {v2.16b,v3.16b}, v31.16b
+    tbl             v2.16b, {v4.16b,v5.16b}, v31.16b
+    tbl             v3.16b, {v6.16b,v7.16b}, v31.16b
+    st1             {v0.16b-v3.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_csp64
+    ret
+endfunc
+
+// void blockcopy_ps(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb)
+function PFX(blockcopy_ps_4x4_neon)
+    lsl             x1, x1, #1
+.rept 2
+    ld1             {v0.8b}, x2, x3
+    ld1             {v1.8b}, x2, x3
+    uxtl            v0.8h, v0.8b
+    uxtl            v1.8h, v1.8b
+    st1             {v0.4h}, x0, x1
+    st1             {v1.4h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ps_8x8_neon)
+    lsl             x1, x1, #1
+.rept 4
+    ld1             {v0.8b}, x2, x3
+    ld1             {v1.8b}, x2, x3
+    uxtl            v0.8h, v0.8b
+    uxtl            v1.8h, v1.8b
+    st1             {v0.8h}, x0, x1
+    st1             {v1.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ps_16x16_neon)
+    lsl             x1, x1, #1
+.rept 8
+    ld1             {v4.16b}, x2, x3
+    ld1             {v5.16b}, x2, x3
+    uxtl            v0.8h, v4.8b
+    uxtl2           v1.8h, v4.16b
+    uxtl            v2.8h, v5.8b
+    uxtl2           v3.8h, v5.16b
+    st1             {v0.8h-v1.8h}, x0, x1
+    st1             {v2.8h-v3.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ps_32x32_neon)
+    lsl             x1, x1, #1
+    mov             w12, #4
+.loop_cps32:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b-v17.16b}, x2, x3
+    ld1             {v18.16b-v19.16b}, x2, x3
+    uxtl            v0.8h, v16.8b
+    uxtl2           v1.8h, v16.16b
+    uxtl            v2.8h, v17.8b
+    uxtl2           v3.8h, v17.16b
+    uxtl            v4.8h, v18.8b
+    uxtl2           v5.8h, v18.16b
+    uxtl            v6.8h, v19.8b
+    uxtl2           v7.8h, v19.16b
+    st1             {v0.8h-v3.8h}, x0, x1
+    st1             {v4.8h-v7.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_cps32
+    ret
+endfunc
+
+function PFX(blockcopy_ps_64x64_neon)
+    lsl             x1, x1, #1
+    sub             x1, x1, #64
+    mov             w12, #16
+.loop_cps64:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b-v19.16b}, x2, x3
+    uxtl            v0.8h, v16.8b
+    uxtl2           v1.8h, v16.16b
+    uxtl            v2.8h, v17.8b
+    uxtl2           v3.8h, v17.16b
+    uxtl            v4.8h, v18.8b
+    uxtl2           v5.8h, v18.16b
+    uxtl            v6.8h, v19.8b
+    uxtl2           v7.8h, v19.16b
+    st1             {v0.8h-v3.8h}, x0, #64
+    st1             {v4.8h-v7.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_cps64
+    ret
+endfunc
+
+// void x265_blockcopy_ss(int16_t* a, intptr_t stridea, const int16_t* b, intptr_t strideb)
+function PFX(blockcopy_ss_4x4_neon)
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+.rept 2
+    ld1             {v0.8b}, x2, x3
+    ld1             {v1.8b}, x2, x3
+    st1             {v0.8b}, x0, x1
+    st1             {v1.8b}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ss_8x8_neon)
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+.rept 4
+    ld1             {v0.8h}, x2, x3
+    ld1             {v1.8h}, x2, x3
+    st1             {v0.8h}, x0, x1
+    st1             {v1.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ss_16x16_neon)
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+.rept 8
+    ld1             {v0.8h-v1.8h}, x2, x3
+    ld1             {v2.8h-v3.8h}, x2, x3
+    st1             {v0.8h-v1.8h}, x0, x1
+    st1             {v2.8h-v3.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ss_32x32_neon)
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+    mov             w12, #4
+.loop_css32:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.8h-v3.8h}, x2, x3
+    st1             {v0.8h-v3.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_css32
+    ret
+endfunc
+
+function PFX(blockcopy_ss_64x64_neon)
+    lsl             x1, x1, #1
+    sub             x1, x1, #64
+    lsl             x3, x3, #1
+    sub             x3, x3, #64
+    mov             w12, #8
+.loop_css64:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.8h-v3.8h}, x2, #64
+    ld1             {v4.8h-v7.8h}, x2, x3
+    st1             {v0.8h-v3.8h}, x0, #64
+    st1             {v4.8h-v7.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_css64
+    ret
+endfunc
+
+/******** Chroma blockcopy********/
+function PFX(blockcopy_ss_4x8_neon)
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+.rept 4
+    ld1             {v0.8b}, x2, x3
+    ld1             {v1.8b}, x2, x3
+    st1             {v0.8b}, x0, x1
+    st1             {v1.8b}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ss_8x16_neon)
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+.rept 8
+    ld1             {v0.8h}, x2, x3
+    ld1             {v1.8h}, x2, x3
+    st1             {v0.8h}, x0, x1
+    st1             {v1.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ss_16x32_neon)
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+.rept 16
+    ld1             {v0.8h-v1.8h}, x2, x3
+    ld1             {v2.8h-v3.8h}, x2, x3
+    st1             {v0.8h-v1.8h}, x0, x1
+    st1             {v2.8h-v3.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ss_32x64_neon)
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+    mov             w12, #8
+.loop_css32x64:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.8h-v3.8h}, x2, x3
+    st1             {v0.8h-v3.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_css32x64
+    ret
+endfunc
+
+// chroma blockcopy_ps
+function PFX(blockcopy_ps_4x8_neon)
+    lsl             x1, x1, #1
+.rept 4
+    ld1             {v0.8b}, x2, x3
+    ld1             {v1.8b}, x2, x3
+    uxtl            v0.8h, v0.8b
+    uxtl            v1.8h, v1.8b
+    st1             {v0.4h}, x0, x1
+    st1             {v1.4h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ps_8x16_neon)
+    lsl             x1, x1, #1
+.rept 8
+    ld1             {v0.8b}, x2, x3
+    ld1             {v1.8b}, x2, x3
+    uxtl            v0.8h, v0.8b
+    uxtl            v1.8h, v1.8b
+    st1             {v0.8h}, x0, x1
+    st1             {v1.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ps_16x32_neon)
+    lsl             x1, x1, #1
+.rept 16
+    ld1             {v4.16b}, x2, x3
+    ld1             {v5.16b}, x2, x3
+    uxtl            v0.8h, v4.8b
+    uxtl2           v1.8h, v4.16b
+    uxtl            v2.8h, v5.8b
+    uxtl2           v3.8h, v5.16b
+    st1             {v0.8h-v1.8h}, x0, x1
+    st1             {v2.8h-v3.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ps_32x64_neon)
+    lsl             x1, x1, #1
+    mov             w12, #8
+.loop_cps32x64:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b-v17.16b}, x2, x3
+    ld1             {v18.16b-v19.16b}, x2, x3
+    uxtl            v0.8h, v16.8b
+    uxtl2           v1.8h, v16.16b
+    uxtl            v2.8h, v17.8b
+    uxtl2           v3.8h, v17.16b
+    uxtl            v4.8h, v18.8b
+    uxtl2           v5.8h, v18.16b
+    uxtl            v6.8h, v19.8b
+    uxtl2           v7.8h, v19.16b
+    st1             {v0.8h-v3.8h}, x0, x1
+    st1             {v4.8h-v7.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_cps32x64
+    ret
+endfunc
+
+// chroma blockcopy_sp
+function PFX(blockcopy_sp_4x8_neon)
+    lsl             x3, x3, #1
+.rept 4
+    ld1             {v0.8h}, x2, x3
+    ld1             {v1.8h}, x2, x3
+    xtn             v0.8b, v0.8h
+    xtn             v1.8b, v1.8h
+    st1             {v0.s}0, x0, x1
+    st1             {v1.s}0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_8x16_neon)
+    lsl             x3, x3, #1
+.rept 8
+    ld1             {v0.8h}, x2, x3
+    ld1             {v1.8h}, x2, x3
+    xtn             v0.8b, v0.8h
+    xtn             v1.8b, v1.8h
+    st1             {v0.d}0, x0, x1
+    st1             {v1.d}0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_16x32_neon)
+    lsl             x3, x3, #1
+    movrel          x11, xtn_xtn2_table
+    ld1             {v31.16b}, x11
+.rept 16
+    ld1             {v0.8h-v1.8h}, x2, x3
+    ld1             {v2.8h-v3.8h}, x2, x3
+    tbl             v0.16b, {v0.16b,v1.16b}, v31.16b
+    tbl             v1.16b, {v2.16b,v3.16b}, v31.16b
+    st1             {v0.16b}, x0, x1
+    st1             {v1.16b}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_32x64_neon)
+    mov             w12, #8
+    lsl             x3, x3, #1
+    movrel          x11, xtn_xtn2_table
+    ld1             {v31.16b}, x11
+.loop_csp32x64:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.8h-v3.8h}, x2, x3
+    ld1             {v4.8h-v7.8h}, x2, x3
+    tbl             v0.16b, {v0.16b,v1.16b}, v31.16b
+    tbl             v1.16b, {v2.16b,v3.16b}, v31.16b
+    tbl             v2.16b, {v4.16b,v5.16b}, v31.16b
+    tbl             v3.16b, {v6.16b,v7.16b}, v31.16b
+    st1             {v0.16b-v1.16b}, x0, x1
+    st1             {v2.16b-v3.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_csp32x64
+    ret
+endfunc
+
+/* blockcopy_pp(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride) */
+
+function PFX(blockcopy_pp_2x4_neon)
+    ldrh            w9, x2
+    add             x4, x1, x1
+    add             x14, x3, x3
+    strh            w9, x0
+    ldrh            w10, x2, x3
+    add             x5, x4, x1
+    add             x15, x14, x3
+    strh            w10, x0, x1
+    ldrh            w11, x2, x14
+    strh            w11, x0, x4
+    ldrh            w12, x2, x15
+    strh            w12, x0, x5
+    ret
+endfunc
+
+.macro blockcopy_pp_2xN_neon h
+function PFX(blockcopy_pp_2x\h\()_neon)
+    add             x4, x1, x1
+    add             x5, x4, x1
+    add             x6, x5, x1
+
+    add             x14, x3, x3
+    add             x15, x14, x3
+    add             x16, x15, x3
+
+.rept \h / 4
+    ldrh            w9, x2
+    strh            w9, x0
+    ldrh            w10, x2, x3
+    strh            w10, x0, x1
+    ldrh            w11, x2, x14
+    strh            w11, x0, x4
+    ldrh            w12, x2, x15
+    strh            w12, x0, x5
+    add             x2, x2, x16
+    add             x0, x0, x6
+.endr
+    ret
+endfunc
+.endm
+
+blockcopy_pp_2xN_neon 8
+blockcopy_pp_2xN_neon 16
+
+function PFX(blockcopy_pp_4x2_neon)
+    ldr             w9, x2
+    str             w9, x0
+    ldr             w10, x2, x3
+    str             w10, x0, x1
+    ret
+endfunc
+
+function PFX(blockcopy_pp_4x4_neon)
+    ldr             w9, x2
+    add             x4, x1, x1
+    add             x14, x3, x3
+    str             w9, x0
+    ldr             w10, x2, x3
+    add             x5, x4, x1
+    add             x15, x14, x3
+    str             w10, x0, x1
+    ldr             w11, x2, x14
+    str             w11, x0, x4
+    ldr             w12, x2, x15
+    str             w12, x0, x5
+    ret
+endfunc
+
+.macro blockcopy_pp_4xN_neon h
+function PFX(blockcopy_pp_4x\h\()_neon)
+    add             x4, x1, x1
+    add             x5, x4, x1
+    add             x6, x5, x1
+
+    add             x14, x3, x3
+    add             x15, x14, x3
+    add             x16, x15, x3
+
+.rept \h / 4
+    ldr             w9, x2
+    str             w9, x0
+    ldr             w10, x2, x3
+    str             w10, x0, x1
+    ldr             w11, x2, x14
+    str             w11, x0, x4
+    ldr             w12, x2, x15
+    str             w12, x0, x5
+    add             x2, x2, x16
+    add             x0, x0, x6
+.endr
+    ret
+endfunc
+.endm
+
+blockcopy_pp_4xN_neon 8
+blockcopy_pp_4xN_neon 16
+blockcopy_pp_4xN_neon 32
+
+.macro blockcopy_pp_6xN_neon h
+function PFX(blockcopy_pp_6x\h\()_neon)
+    sub             x1, x1, #4
+.rept \h
+    ld1             {v0.8b}, x2, x3
+    st1             {v0.s}0, x0, #4
+    st1             {v0.h}2, x0, x1
+.endr
+    ret
+endfunc
+.endm
+
+blockcopy_pp_6xN_neon 8
+blockcopy_pp_6xN_neon 16
+
+.macro blockcopy_pp_8xN_neon h
+function PFX(blockcopy_pp_8x\h\()_neon)
+.rept \h
+    ld1             {v0.4h}, x2, x3
+    st1             {v0.4h}, x0, x1
+.endr
+    ret
+endfunc
+.endm
+
+blockcopy_pp_8xN_neon 2
+blockcopy_pp_8xN_neon 4
+blockcopy_pp_8xN_neon 6
+blockcopy_pp_8xN_neon 8
+blockcopy_pp_8xN_neon 12
+blockcopy_pp_8xN_neon 16
+blockcopy_pp_8xN_neon 32
+
+function PFX(blockcopy_pp_8x64_neon)
+    mov             w12, #4
+.loop_pp_8x64:
+    sub             w12, w12, #1
+.rept 16
+    ld1             {v0.4h}, x2, x3
+    st1             {v0.4h}, x0, x1
+.endr
+    cbnz            w12, .loop_pp_8x64
+    ret
+endfunc
+
+.macro blockcopy_pp_16xN_neon h
+function PFX(blockcopy_pp_16x\h\()_neon)
+.rept \h
+    ld1             {v0.8h}, x2, x3
+    st1             {v0.8h}, x0, x1
+.endr
+    ret
+endfunc
+.endm
+
+blockcopy_pp_16xN_neon 4
+blockcopy_pp_16xN_neon 8
+blockcopy_pp_16xN_neon 12
+blockcopy_pp_16xN_neon 16
+
+.macro blockcopy_pp_16xN1_neon h
+function PFX(blockcopy_pp_16x\h\()_neon)
+    mov             w12, #\h / 8
+.loop_16x\h\():
+.rept 8
+    ld1             {v0.8h}, x2, x3
+    st1             {v0.8h}, x0, x1
+.endr
+    sub             w12, w12, #1
+    cbnz            w12, .loop_16x\h
+    ret
+endfunc
+.endm
+
+blockcopy_pp_16xN1_neon 24
+blockcopy_pp_16xN1_neon 32
+blockcopy_pp_16xN1_neon 64
+
+function PFX(blockcopy_pp_12x16_neon)
+    sub             x1, x1, #8
+.rept 16
+    ld1             {v0.16b}, x2, x3
+    str             d0, x0, #8
+    st1             {v0.s}2, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_pp_12x32_neon)
+    sub             x1, x1, #8
+    mov             w12, #4
+.loop_pp_12x32:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.16b}, x2, x3
+    str             d0, x0, #8
+    st1             {v0.s}2, x0, x1
+.endr
+    cbnz            w12, .loop_pp_12x32
+    ret
+endfunc
+
+function PFX(blockcopy_pp_24x32_neon)
+    mov             w12, #4
+.loop_24x32:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.8b-v2.8b}, x2, x3
+    st1             {v0.8b-v2.8b}, x0, x1
+.endr
+    cbnz            w12, .loop_24x32
+    ret
+endfunc
+
+function PFX(blockcopy_pp_24x64_neon)
+    mov             w12, #4
+.loop_24x64:
+    sub             w12, w12, #1
+.rept 16
+    ld1             {v0.8b-v2.8b}, x2, x3
+    st1             {v0.8b-v2.8b}, x0, x1
+.endr
+    cbnz            w12, .loop_24x64
+    ret
+endfunc
+
+function PFX(blockcopy_pp_32x8_neon)
+.rept 8
+    ld1             {v0.16b-v1.16b}, x2, x3
+    st1             {v0.16b-v1.16b}, x0, x1
+.endr
+    ret
+endfunc
+
+.macro blockcopy_pp_32xN_neon h
+function PFX(blockcopy_pp_32x\h\()_neon)
+    mov             w12, #\h / 8
+.loop_32x\h\():
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.16b-v1.16b}, x2, x3
+    st1             {v0.16b-v1.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_32x\h
+    ret
+endfunc
+.endm
+
+blockcopy_pp_32xN_neon 16
+blockcopy_pp_32xN_neon 24
+blockcopy_pp_32xN_neon 32
+blockcopy_pp_32xN_neon 64
+blockcopy_pp_32xN_neon 48
+
+function PFX(blockcopy_pp_48x64_neon)
+    mov             w12, #8
+.loop_48x64:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.16b-v2.16b}, x2, x3
+    st1             {v0.16b-v2.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_48x64
+    ret
+endfunc
+
+.macro blockcopy_pp_64xN_neon h
+function PFX(blockcopy_pp_64x\h\()_neon)
+    mov             w12, #\h / 4
+.loop_64x\h\():
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v3.16b}, x2, x3
+    st1             {v0.16b-v3.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_64x\h
+    ret
+endfunc
+.endm
+
+blockcopy_pp_64xN_neon 16
+blockcopy_pp_64xN_neon 32
+blockcopy_pp_64xN_neon 48
+blockcopy_pp_64xN_neon 64
+
+// void x265_blockfill_s_neon(int16_t* dst, intptr_t dstride, int16_t val)
+function PFX(blockfill_s_4x4_neon)
+    dup             v0.4h, w2
+    lsl             x1, x1, #1
+.rept 4
+    st1             {v0.4h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockfill_s_8x8_neon)
+    dup             v0.8h, w2
+    lsl             x1, x1, #1
+.rept 8
+    st1             {v0.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockfill_s_16x16_neon)
+    dup             v0.8h, w2
+    mov             v1.16b, v0.16b
+    lsl             x1, x1, #1
+.rept 16
+    stp             q0, q1, x0
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockfill_s_32x32_neon)
+    dup             v0.8h, w2
+    mov             v1.16b, v0.16b
+    mov             v2.16b, v0.16b
+    mov             v3.16b, v0.16b
+    lsl             x1, x1, #1
+.rept 32
+    st1             {v0.8h-v3.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockfill_s_64x64_neon)
+    dup             v0.8h, w2
+    mov             v1.16b, v0.16b
+    mov             v2.16b, v0.16b
+    mov             v3.16b, v0.16b
+    lsl             x1, x1, #1
+    sub             x1, x1, #64
+.rept 64
+    st1             {v0.8h-v3.8h}, x0, #64
+    st1             {v0.8h-v3.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+// uint32_t copy_count(int16_t* coeff, const int16_t* residual, intptr_t resiStride)
+function PFX(copy_cnt_4_neon)
+    lsl             x2, x2, #1
+    movi            v4.8b, #0
+.rept 2
+    ld1             {v0.8b}, x1, x2
+    ld1             {v1.8b}, x1, x2
+    stp             d0, d1, x0, #16
+    cmeq            v0.4h, v0.4h, #0
+    cmeq            v1.4h, v1.4h, #0
+    add             v4.4h, v4.4h, v0.4h
+    add             v4.4h, v4.4h, v1.4h
+.endr
+    saddlv          s4, v4.4h
+    fmov            w12, s4
+    add             w0, w12, #16
+    ret
+endfunc
+
+function PFX(copy_cnt_8_neon)
+    lsl             x2, x2, #1
+    movi            v4.8b, #0
+.rept 4
+    ld1             {v0.16b}, x1, x2
+    ld1             {v1.16b}, x1, x2
+    stp             q0, q1, x0, #32
+    cmeq            v0.8h, v0.8h, #0
+    cmeq            v1.8h, v1.8h, #0
+    add             v4.8h, v4.8h, v0.8h
+    add             v4.8h, v4.8h, v1.8h
+.endr
+    saddlv          s4, v4.8h
+    fmov            w12, s4
+    add             w0, w12, #64
+    ret
+endfunc
+
+function PFX(copy_cnt_16_neon)
+    lsl             x2, x2, #1
+    movi            v4.8b, #0
+.rept 16
+    ld1             {v0.16b-v1.16b}, x1, x2
+    st1             {v0.16b-v1.16b}, x0, #32
+    cmeq            v0.8h, v0.8h, #0
+    cmeq            v1.8h, v1.8h, #0
+    add             v4.8h, v4.8h, v0.8h
+    add             v4.8h, v4.8h, v1.8h
+.endr
+    saddlv          s4, v4.8h
+    fmov            w12, s4
+    add             w0, w12, #256
+    ret
+endfunc
+
+function PFX(copy_cnt_32_neon)
+    lsl             x2, x2, #1
+    movi            v4.8b, #0
+.rept 32
+    ld1             {v0.16b-v3.16b}, x1, x2
+    st1             {v0.16b-v3.16b}, x0, #64
+    cmeq            v0.8h, v0.8h, #0
+    cmeq            v1.8h, v1.8h, #0
+    cmeq            v2.8h, v2.8h, #0
+    cmeq            v3.8h, v3.8h, #0
+    add             v0.8h, v0.8h, v1.8h
+    add             v2.8h, v2.8h, v3.8h
+    add             v4.8h, v4.8h, v0.8h
+    add             v4.8h, v4.8h, v2.8h
+.endr
+    saddlv          s4, v4.8h
+    fmov            w12, s4
+    add             w0, w12, #1024
+    ret
+endfunc
+
+// int  count_nonzero_c(const int16_t* quantCoeff)
+function PFX(count_nonzero_4_neon)
+    movi            v16.16b, #1
+    movi            v17.16b, #0
+    trn1            v16.16b, v16.16b, v17.16b
+    ldp             q0, q1, x0
+    cmhi            v0.8h, v0.8h, v17.8h
+    cmhi            v1.8h, v1.8h, v17.8h
+    and             v0.16b, v0.16b, v16.16b
+    and             v1.16b, v1.16b, v16.16b
+    add             v0.8h, v0.8h, v1.8h
+    uaddlv          s0, v0.8h
+    fmov            w0, s0
+    ret
+endfunc
+
+.macro COUNT_NONZERO_8
+    ld1             {v0.16b-v3.16b}, x0, #64
+    ld1             {v4.16b-v7.16b}, x0, #64
+    cmhi            v0.8h, v0.8h, v17.8h
+    cmhi            v1.8h, v1.8h, v17.8h
+    cmhi            v2.8h, v2.8h, v17.8h
+    cmhi            v3.8h, v3.8h, v17.8h
+    cmhi            v4.8h, v4.8h, v17.8h
+    cmhi            v5.8h, v5.8h, v17.8h
+    cmhi            v6.8h, v6.8h, v17.8h
+    cmhi            v7.8h, v7.8h, v17.8h
+    and             v0.16b, v0.16b, v16.16b
+    and             v1.16b, v1.16b, v16.16b
+    and             v2.16b, v2.16b, v16.16b
+    and             v3.16b, v3.16b, v16.16b
+    and             v4.16b, v4.16b, v16.16b
+    and             v5.16b, v5.16b, v16.16b
+    and             v6.16b, v6.16b, v16.16b
+    and             v7.16b, v7.16b, v16.16b
+    add             v0.8h, v0.8h, v1.8h
+    add             v2.8h, v2.8h, v3.8h
+    add             v4.8h, v4.8h, v5.8h
+    add             v6.8h, v6.8h, v7.8h
+    add             v0.8h, v0.8h, v2.8h
+    add             v4.8h, v4.8h, v6.8h
+    add             v0.8h, v0.8h, v4.8h
+.endm
+
+function PFX(count_nonzero_8_neon)
+    movi            v16.16b, #1
+    movi            v17.16b, #0
+    trn1            v16.16b, v16.16b, v17.16b
+    COUNT_NONZERO_8
+    uaddlv          s0, v0.8h
+    fmov            w0, s0
+    ret
+endfunc
+
+function PFX(count_nonzero_16_neon)
+    movi            v16.16b, #1
+    movi            v17.16b, #0
+    trn1            v16.16b, v16.16b, v17.16b
+    movi            v18.16b, #0
+.rept 4
+    COUNT_NONZERO_8
+    add             v18.16b, v18.16b, v0.16b
+.endr
+    uaddlv          s0, v18.8h
+    fmov            w0, s0
+    ret
+endfunc
+
+function PFX(count_nonzero_32_neon)
+    movi            v16.16b, #1
+    movi            v17.16b, #0
+    trn1            v16.16b, v16.16b, v17.16b
+    movi            v18.16b, #0
+    mov             w12, #16
+.loop_count_nonzero_32:
+    sub             w12, w12, #1
+    COUNT_NONZERO_8
+    add             v18.16b, v18.16b, v0.16b
+    cbnz            w12, .loop_count_nonzero_32
+
+    uaddlv          s0, v18.8h
+    fmov            w0, s0
+    ret
+endfunc
+
+// void cpy2Dto1D_shl(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift)
+.macro cpy2Dto1D_shl_start
+    add             x2, x2, x2
+    dup             v0.8h, w3
+.endm
+
+function PFX(cpy2Dto1D_shl_4x4_neon)
+    cpy2Dto1D_shl_start
+    ld1             {v2.d}0, x1, x2
+    ld1             {v2.d}1, x1, x2
+    ld1             {v3.d}0, x1, x2
+    ld1             {v3.d}1, x1, x2
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.16b-v3.16b}, x0
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shl_8x8_neon)
+    cpy2Dto1D_shl_start
+.rept 4
+    ld1             {v2.16b}, x1, x2
+    ld1             {v3.16b}, x1, x2
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.16b-v3.16b}, x0, #32
+.endr
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shl_16x16_neon)
+    cpy2Dto1D_shl_start
+    mov             w12, #4
+.loop_cpy2Dto1D_shl_16:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v2.16b-v3.16b}, x1, x2
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.16b-v3.16b}, x0, #32
+.endr
+    cbnz            w12, .loop_cpy2Dto1D_shl_16
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shl_32x32_neon)
+    cpy2Dto1D_shl_start
+    mov             w12, #16
+.loop_cpy2Dto1D_shl_32:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v2.16b-v5.16b}, x1, x2
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    sshl            v4.8h, v4.8h, v0.8h
+    sshl            v5.8h, v5.8h, v0.8h
+    st1             {v2.16b-v5.16b}, x0, #64
+.endr
+    cbnz            w12, .loop_cpy2Dto1D_shl_32
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shl_64x64_neon)
+    cpy2Dto1D_shl_start
+    mov             w12, #32
+    sub             x2, x2, #64
+.loop_cpy2Dto1D_shl_64:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v2.16b-v5.16b}, x1, #64
+    ld1             {v16.16b-v19.16b}, x1, x2
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    sshl            v4.8h, v4.8h, v0.8h
+    sshl            v5.8h, v5.8h, v0.8h
+    sshl            v16.8h, v16.8h, v0.8h
+    sshl            v17.8h, v17.8h, v0.8h
+    sshl            v18.8h, v18.8h, v0.8h
+    sshl            v19.8h, v19.8h, v0.8h
+    st1             {v2.16b-v5.16b}, x0, #64
+    st1             {v16.16b-v19.16b}, x0, #64
+.endr
+    cbnz            w12, .loop_cpy2Dto1D_shl_64
+    ret
+endfunc
+
+// void cpy2Dto1D_shr(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift)
+function PFX(cpy2Dto1D_shr_4x4_neon)
+    cpy2Dto1D_shr_start
+    ld1             {v2.d}0, x1, x2
+    ld1             {v2.d}1, x1, x2
+    ld1             {v3.d}0, x1, x2
+    ld1             {v3.d}1, x1, x2
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    stp             q2, q3, x0
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shr_8x8_neon)
+    cpy2Dto1D_shr_start
+.rept 4
+    ld1             {v2.16b}, x1, x2
+    ld1             {v3.16b}, x1, x2
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    stp             q2, q3, x0, #32
+.endr
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shr_16x16_neon)
+    cpy2Dto1D_shr_start
+    mov             w12, #4
+.loop_cpy2Dto1D_shr_16:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v2.8h-v3.8h}, x1, x2
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.8h-v3.8h}, x0, #32
+.endr
+    cbnz            w12, .loop_cpy2Dto1D_shr_16
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shr_32x32_neon)
+    cpy2Dto1D_shr_start
+    mov             w12, #16
+.loop_cpy2Dto1D_shr_32:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v2.8h-v5.8h}, x1, x2
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sub             v4.8h, v4.8h, v1.8h
+    sub             v5.8h, v5.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    sshl            v4.8h, v4.8h, v0.8h
+    sshl            v5.8h, v5.8h, v0.8h
+    st1             {v2.8h-v5.8h}, x0, #64
+.endr
+    cbnz            w12, .loop_cpy2Dto1D_shr_32
+    ret
+endfunc
+
+// void cpy1Dto2D_shl(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift)
+.macro cpy1Dto2D_shl_start
+    add             x2, x2, x2
+    dup             v0.8h, w3
+.endm
+
+function PFX(cpy1Dto2D_shl_4x4_neon)
+    cpy1Dto2D_shl_start
+    ld1             {v2.16b-v3.16b}, x1
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.d}0, x0, x2
+    st1             {v2.d}1, x0, x2
+    st1             {v3.d}0, x0, x2
+    st1             {v3.d}1, x0, x2
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shl_8x8_neon)
+    cpy1Dto2D_shl_start
+.rept 4
+    ld1             {v2.16b-v3.16b}, x1, #32
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.16b}, x0, x2
+    st1             {v3.16b}, x0, x2
+.endr
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shl_16x16_neon)
+    cpy1Dto2D_shl_start
+    mov             w12, #4
+.loop_cpy1Dto2D_shl_16:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v2.16b-v3.16b}, x1, #32
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.16b-v3.16b}, x0, x2
+.endr
+    cbnz            w12, .loop_cpy1Dto2D_shl_16
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shl_32x32_neon)
+    cpy1Dto2D_shl_start
+    mov             w12, #16
+.loop_cpy1Dto2D_shl_32:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v2.16b-v5.16b}, x1, #64
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    sshl            v4.8h, v4.8h, v0.8h
+    sshl            v5.8h, v5.8h, v0.8h
+    st1             {v2.16b-v5.16b}, x0, x2
+.endr
+    cbnz            w12, .loop_cpy1Dto2D_shl_32
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shl_64x64_neon)
+    cpy1Dto2D_shl_start
+    mov             w12, #32
+    sub             x2, x2, #64
+.loop_cpy1Dto2D_shl_64:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v2.16b-v5.16b}, x1, #64
+    ld1             {v16.16b-v19.16b}, x1, #64
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    sshl            v4.8h, v4.8h, v0.8h
+    sshl            v5.8h, v5.8h, v0.8h
+    sshl            v16.8h, v16.8h, v0.8h
+    sshl            v17.8h, v17.8h, v0.8h
+    sshl            v18.8h, v18.8h, v0.8h
+    sshl            v19.8h, v19.8h, v0.8h
+    st1             {v2.16b-v5.16b}, x0, #64
+    st1             {v16.16b-v19.16b}, x0, x2
+.endr
+    cbnz            w12, .loop_cpy1Dto2D_shl_64
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shr_4x4_neon)
+    cpy1Dto2D_shr_start
+    ld1             {v2.16b-v3.16b}, x1
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.d}0, x0, x2
+    st1             {v2.d}1, x0, x2
+    st1             {v3.d}0, x0, x2
+    st1             {v3.d}1, x0, x2
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shr_8x8_neon)
+    cpy1Dto2D_shr_start
+.rept 4
+    ld1             {v2.16b-v3.16b}, x1, #32
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.16b}, x0, x2
+    st1             {v3.16b}, x0, x2
+.endr
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shr_16x16_neon)
+    cpy1Dto2D_shr_start
+    mov             w12, #4
+.loop_cpy1Dto2D_shr_16:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v2.8h-v3.8h}, x1, #32
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.8h-v3.8h}, x0, x2
+.endr
+    cbnz            w12, .loop_cpy1Dto2D_shr_16
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shr_32x32_neon)
+    cpy1Dto2D_shr_start
+    mov             w12, #16
+.loop_cpy1Dto2D_shr_32:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v2.16b-v5.16b}, x1, #64
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sub             v4.8h, v4.8h, v1.8h
+    sub             v5.8h, v5.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    sshl            v4.8h, v4.8h, v0.8h
+    sshl            v5.8h, v5.8h, v0.8h
+    st1             {v2.16b-v5.16b}, x0, x2
+.endr
+    cbnz            w12, .loop_cpy1Dto2D_shr_32
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shr_64x64_neon)
+    cpy1Dto2D_shr_start
+    mov             w12, #32
+    sub             x2, x2, #64
+.loop_cpy1Dto2D_shr_64:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v2.16b-v5.16b}, x1, #64
+    ld1             {v16.16b-v19.16b}, x1, #64
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sub             v4.8h, v4.8h, v1.8h
+    sub             v5.8h, v5.8h, v1.8h
+    sub             v16.8h, v16.8h, v1.8h
+    sub             v17.8h, v17.8h, v1.8h
+    sub             v18.8h, v18.8h, v1.8h
+    sub             v19.8h, v19.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    sshl            v4.8h, v4.8h, v0.8h
+    sshl            v5.8h, v5.8h, v0.8h
+    sshl            v16.8h, v16.8h, v0.8h
+    sshl            v17.8h, v17.8h, v0.8h
+    sshl            v18.8h, v18.8h, v0.8h
+    sshl            v19.8h, v19.8h, v0.8h
+    st1             {v2.16b-v5.16b}, x0, #64
+    st1             {v16.16b-v19.16b}, x0, x2
+.endr
+    cbnz            w12, .loop_cpy1Dto2D_shr_64
+    ret
+endfunc

 
@@ -0,0 +1,1299 @@
+/*****************************************************************************
+ * Copyright (C) 2021 MulticoreWare, Inc
+ *
+ * Authors: Sebastian Pop <spop@amazon.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm.S"
+#include "blockcopy8-common.S"
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+/* void blockcopy_sp(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb)
+ *
+ * r0   - a
+ * r1   - stridea
+ * r2   - b
+ * r3   - strideb */
+function PFX(blockcopy_sp_4x4_neon)
+    lsl             x3, x3, #1
+.rept 2
+    ld1             {v0.8h}, x2, x3
+    ld1             {v1.8h}, x2, x3
+    xtn             v0.8b, v0.8h
+    xtn             v1.8b, v1.8h
+    st1             {v0.s}0, x0, x1
+    st1             {v1.s}0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_8x8_neon)
+    lsl             x3, x3, #1
+.rept 4
+    ld1             {v0.8h}, x2, x3
+    ld1             {v1.8h}, x2, x3
+    xtn             v0.8b, v0.8h
+    xtn             v1.8b, v1.8h
+    st1             {v0.d}0, x0, x1
+    st1             {v1.d}0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_16x16_neon)
+    lsl             x3, x3, #1
+    movrel          x11, xtn_xtn2_table
+    ld1             {v31.16b}, x11
+.rept 8
+    ld1             {v0.8h-v1.8h}, x2, x3
+    ld1             {v2.8h-v3.8h}, x2, x3
+    tbl             v0.16b, {v0.16b,v1.16b}, v31.16b
+    tbl             v1.16b, {v2.16b,v3.16b}, v31.16b
+    st1             {v0.16b}, x0, x1
+    st1             {v1.16b}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_32x32_neon)
+    mov             w12, #4
+    lsl             x3, x3, #1
+    movrel          x11, xtn_xtn2_table
+    ld1             {v31.16b}, x11
+.loop_csp32:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.8h-v3.8h}, x2, x3
+    ld1             {v4.8h-v7.8h}, x2, x3
+    tbl             v0.16b, {v0.16b,v1.16b}, v31.16b
+    tbl             v1.16b, {v2.16b,v3.16b}, v31.16b
+    tbl             v2.16b, {v4.16b,v5.16b}, v31.16b
+    tbl             v3.16b, {v6.16b,v7.16b}, v31.16b
+    st1             {v0.16b-v1.16b}, x0, x1
+    st1             {v2.16b-v3.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_csp32
+    ret
+endfunc
+
+function PFX(blockcopy_sp_64x64_neon)
+    mov             w12, #16
+    lsl             x3, x3, #1
+    sub             x3, x3, #64
+    movrel          x11, xtn_xtn2_table
+    ld1             {v31.16b}, x11
+.loop_csp64:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.8h-v3.8h}, x2, #64
+    ld1             {v4.8h-v7.8h}, x2, x3
+    tbl             v0.16b, {v0.16b,v1.16b}, v31.16b
+    tbl             v1.16b, {v2.16b,v3.16b}, v31.16b
+    tbl             v2.16b, {v4.16b,v5.16b}, v31.16b
+    tbl             v3.16b, {v6.16b,v7.16b}, v31.16b
+    st1             {v0.16b-v3.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_csp64
+    ret
+endfunc
+
+// void blockcopy_ps(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb)
+function PFX(blockcopy_ps_4x4_neon)
+    lsl             x1, x1, #1
+.rept 2
+    ld1             {v0.8b}, x2, x3
+    ld1             {v1.8b}, x2, x3
+    uxtl            v0.8h, v0.8b
+    uxtl            v1.8h, v1.8b
+    st1             {v0.4h}, x0, x1
+    st1             {v1.4h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ps_8x8_neon)
+    lsl             x1, x1, #1
+.rept 4
+    ld1             {v0.8b}, x2, x3
+    ld1             {v1.8b}, x2, x3
+    uxtl            v0.8h, v0.8b
+    uxtl            v1.8h, v1.8b
+    st1             {v0.8h}, x0, x1
+    st1             {v1.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ps_16x16_neon)
+    lsl             x1, x1, #1
+.rept 8
+    ld1             {v4.16b}, x2, x3
+    ld1             {v5.16b}, x2, x3
+    uxtl            v0.8h, v4.8b
+    uxtl2           v1.8h, v4.16b
+    uxtl            v2.8h, v5.8b
+    uxtl2           v3.8h, v5.16b
+    st1             {v0.8h-v1.8h}, x0, x1
+    st1             {v2.8h-v3.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ps_32x32_neon)
+    lsl             x1, x1, #1
+    mov             w12, #4
+.loop_cps32:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b-v17.16b}, x2, x3
+    ld1             {v18.16b-v19.16b}, x2, x3
+    uxtl            v0.8h, v16.8b
+    uxtl2           v1.8h, v16.16b
+    uxtl            v2.8h, v17.8b
+    uxtl2           v3.8h, v17.16b
+    uxtl            v4.8h, v18.8b
+    uxtl2           v5.8h, v18.16b
+    uxtl            v6.8h, v19.8b
+    uxtl2           v7.8h, v19.16b
+    st1             {v0.8h-v3.8h}, x0, x1
+    st1             {v4.8h-v7.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_cps32
+    ret
+endfunc
+
+function PFX(blockcopy_ps_64x64_neon)
+    lsl             x1, x1, #1
+    sub             x1, x1, #64
+    mov             w12, #16
+.loop_cps64:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b-v19.16b}, x2, x3
+    uxtl            v0.8h, v16.8b
+    uxtl2           v1.8h, v16.16b
+    uxtl            v2.8h, v17.8b
+    uxtl2           v3.8h, v17.16b
+    uxtl            v4.8h, v18.8b
+    uxtl2           v5.8h, v18.16b
+    uxtl            v6.8h, v19.8b
+    uxtl2           v7.8h, v19.16b
+    st1             {v0.8h-v3.8h}, x0, #64
+    st1             {v4.8h-v7.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_cps64
+    ret
+endfunc
+
+// void x265_blockcopy_ss(int16_t* a, intptr_t stridea, const int16_t* b, intptr_t strideb)
+function PFX(blockcopy_ss_4x4_neon)
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+.rept 2
+    ld1             {v0.8b}, x2, x3
+    ld1             {v1.8b}, x2, x3
+    st1             {v0.8b}, x0, x1
+    st1             {v1.8b}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ss_8x8_neon)
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+.rept 4
+    ld1             {v0.8h}, x2, x3
+    ld1             {v1.8h}, x2, x3
+    st1             {v0.8h}, x0, x1
+    st1             {v1.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ss_16x16_neon)
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+.rept 8
+    ld1             {v0.8h-v1.8h}, x2, x3
+    ld1             {v2.8h-v3.8h}, x2, x3
+    st1             {v0.8h-v1.8h}, x0, x1
+    st1             {v2.8h-v3.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ss_32x32_neon)
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+    mov             w12, #4
+.loop_css32:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.8h-v3.8h}, x2, x3
+    st1             {v0.8h-v3.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_css32
+    ret
+endfunc
+
+function PFX(blockcopy_ss_64x64_neon)
+    lsl             x1, x1, #1
+    sub             x1, x1, #64
+    lsl             x3, x3, #1
+    sub             x3, x3, #64
+    mov             w12, #8
+.loop_css64:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.8h-v3.8h}, x2, #64
+    ld1             {v4.8h-v7.8h}, x2, x3
+    st1             {v0.8h-v3.8h}, x0, #64
+    st1             {v4.8h-v7.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_css64
+    ret
+endfunc
+
+/******** Chroma blockcopy********/
+function PFX(blockcopy_ss_4x8_neon)
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+.rept 4
+    ld1             {v0.8b}, x2, x3
+    ld1             {v1.8b}, x2, x3
+    st1             {v0.8b}, x0, x1
+    st1             {v1.8b}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ss_8x16_neon)
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+.rept 8
+    ld1             {v0.8h}, x2, x3
+    ld1             {v1.8h}, x2, x3
+    st1             {v0.8h}, x0, x1
+    st1             {v1.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ss_16x32_neon)
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+.rept 16
+    ld1             {v0.8h-v1.8h}, x2, x3
+    ld1             {v2.8h-v3.8h}, x2, x3
+    st1             {v0.8h-v1.8h}, x0, x1
+    st1             {v2.8h-v3.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ss_32x64_neon)
+    lsl             x1, x1, #1
+    lsl             x3, x3, #1
+    mov             w12, #8
+.loop_css32x64:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.8h-v3.8h}, x2, x3
+    st1             {v0.8h-v3.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_css32x64
+    ret
+endfunc
+
+// chroma blockcopy_ps
+function PFX(blockcopy_ps_4x8_neon)
+    lsl             x1, x1, #1
+.rept 4
+    ld1             {v0.8b}, x2, x3
+    ld1             {v1.8b}, x2, x3
+    uxtl            v0.8h, v0.8b
+    uxtl            v1.8h, v1.8b
+    st1             {v0.4h}, x0, x1
+    st1             {v1.4h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ps_8x16_neon)
+    lsl             x1, x1, #1
+.rept 8
+    ld1             {v0.8b}, x2, x3
+    ld1             {v1.8b}, x2, x3
+    uxtl            v0.8h, v0.8b
+    uxtl            v1.8h, v1.8b
+    st1             {v0.8h}, x0, x1
+    st1             {v1.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ps_16x32_neon)
+    lsl             x1, x1, #1
+.rept 16
+    ld1             {v4.16b}, x2, x3
+    ld1             {v5.16b}, x2, x3
+    uxtl            v0.8h, v4.8b
+    uxtl2           v1.8h, v4.16b
+    uxtl            v2.8h, v5.8b
+    uxtl2           v3.8h, v5.16b
+    st1             {v0.8h-v1.8h}, x0, x1
+    st1             {v2.8h-v3.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_ps_32x64_neon)
+    lsl             x1, x1, #1
+    mov             w12, #8
+.loop_cps32x64:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b-v17.16b}, x2, x3
+    ld1             {v18.16b-v19.16b}, x2, x3
+    uxtl            v0.8h, v16.8b
+    uxtl2           v1.8h, v16.16b
+    uxtl            v2.8h, v17.8b
+    uxtl2           v3.8h, v17.16b
+    uxtl            v4.8h, v18.8b
+    uxtl2           v5.8h, v18.16b
+    uxtl            v6.8h, v19.8b
+    uxtl2           v7.8h, v19.16b
+    st1             {v0.8h-v3.8h}, x0, x1
+    st1             {v4.8h-v7.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_cps32x64
+    ret
+endfunc
+
+// chroma blockcopy_sp
+function PFX(blockcopy_sp_4x8_neon)
+    lsl             x3, x3, #1
+.rept 4
+    ld1             {v0.8h}, x2, x3
+    ld1             {v1.8h}, x2, x3
+    xtn             v0.8b, v0.8h
+    xtn             v1.8b, v1.8h
+    st1             {v0.s}0, x0, x1
+    st1             {v1.s}0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_8x16_neon)
+    lsl             x3, x3, #1
+.rept 8
+    ld1             {v0.8h}, x2, x3
+    ld1             {v1.8h}, x2, x3
+    xtn             v0.8b, v0.8h
+    xtn             v1.8b, v1.8h
+    st1             {v0.d}0, x0, x1
+    st1             {v1.d}0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_16x32_neon)
+    lsl             x3, x3, #1
+    movrel          x11, xtn_xtn2_table
+    ld1             {v31.16b}, x11
+.rept 16
+    ld1             {v0.8h-v1.8h}, x2, x3
+    ld1             {v2.8h-v3.8h}, x2, x3
+    tbl             v0.16b, {v0.16b,v1.16b}, v31.16b
+    tbl             v1.16b, {v2.16b,v3.16b}, v31.16b
+    st1             {v0.16b}, x0, x1
+    st1             {v1.16b}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_sp_32x64_neon)
+    mov             w12, #8
+    lsl             x3, x3, #1
+    movrel          x11, xtn_xtn2_table
+    ld1             {v31.16b}, x11
+.loop_csp32x64:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.8h-v3.8h}, x2, x3
+    ld1             {v4.8h-v7.8h}, x2, x3
+    tbl             v0.16b, {v0.16b,v1.16b}, v31.16b
+    tbl             v1.16b, {v2.16b,v3.16b}, v31.16b
+    tbl             v2.16b, {v4.16b,v5.16b}, v31.16b
+    tbl             v3.16b, {v6.16b,v7.16b}, v31.16b
+    st1             {v0.16b-v1.16b}, x0, x1
+    st1             {v2.16b-v3.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_csp32x64
+    ret
+endfunc
+
+/* blockcopy_pp(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride) */
+
+function PFX(blockcopy_pp_2x4_neon)
+    ldrh            w9, x2
+    add             x4, x1, x1
+    add             x14, x3, x3
+    strh            w9, x0
+    ldrh            w10, x2, x3
+    add             x5, x4, x1
+    add             x15, x14, x3
+    strh            w10, x0, x1
+    ldrh            w11, x2, x14
+    strh            w11, x0, x4
+    ldrh            w12, x2, x15
+    strh            w12, x0, x5
+    ret
+endfunc
+
+.macro blockcopy_pp_2xN_neon h
+function PFX(blockcopy_pp_2x\h\()_neon)
+    add             x4, x1, x1
+    add             x5, x4, x1
+    add             x6, x5, x1
+
+    add             x14, x3, x3
+    add             x15, x14, x3
+    add             x16, x15, x3
+
+.rept \h / 4
+    ldrh            w9, x2
+    strh            w9, x0
+    ldrh            w10, x2, x3
+    strh            w10, x0, x1
+    ldrh            w11, x2, x14
+    strh            w11, x0, x4
+    ldrh            w12, x2, x15
+    strh            w12, x0, x5
+    add             x2, x2, x16
+    add             x0, x0, x6
+.endr
+    ret
+endfunc
+.endm
+
+blockcopy_pp_2xN_neon 8
+blockcopy_pp_2xN_neon 16
+
+function PFX(blockcopy_pp_4x2_neon)
+    ldr             w9, x2
+    str             w9, x0
+    ldr             w10, x2, x3
+    str             w10, x0, x1
+    ret
+endfunc
+
+function PFX(blockcopy_pp_4x4_neon)
+    ldr             w9, x2
+    add             x4, x1, x1
+    add             x14, x3, x3
+    str             w9, x0
+    ldr             w10, x2, x3
+    add             x5, x4, x1
+    add             x15, x14, x3
+    str             w10, x0, x1
+    ldr             w11, x2, x14
+    str             w11, x0, x4
+    ldr             w12, x2, x15
+    str             w12, x0, x5
+    ret
+endfunc
+
+.macro blockcopy_pp_4xN_neon h
+function PFX(blockcopy_pp_4x\h\()_neon)
+    add             x4, x1, x1
+    add             x5, x4, x1
+    add             x6, x5, x1
+
+    add             x14, x3, x3
+    add             x15, x14, x3
+    add             x16, x15, x3
+
+.rept \h / 4
+    ldr             w9, x2
+    str             w9, x0
+    ldr             w10, x2, x3
+    str             w10, x0, x1
+    ldr             w11, x2, x14
+    str             w11, x0, x4
+    ldr             w12, x2, x15
+    str             w12, x0, x5
+    add             x2, x2, x16
+    add             x0, x0, x6
+.endr
+    ret
+endfunc
+.endm
+
+blockcopy_pp_4xN_neon 8
+blockcopy_pp_4xN_neon 16
+blockcopy_pp_4xN_neon 32
+
+.macro blockcopy_pp_6xN_neon h
+function PFX(blockcopy_pp_6x\h\()_neon)
+    sub             x1, x1, #4
+.rept \h
+    ld1             {v0.8b}, x2, x3
+    st1             {v0.s}0, x0, #4
+    st1             {v0.h}2, x0, x1
+.endr
+    ret
+endfunc
+.endm
+
+blockcopy_pp_6xN_neon 8
+blockcopy_pp_6xN_neon 16
+
+.macro blockcopy_pp_8xN_neon h
+function PFX(blockcopy_pp_8x\h\()_neon)
+.rept \h
+    ld1             {v0.4h}, x2, x3
+    st1             {v0.4h}, x0, x1
+.endr
+    ret
+endfunc
+.endm
+
+blockcopy_pp_8xN_neon 2
+blockcopy_pp_8xN_neon 4
+blockcopy_pp_8xN_neon 6
+blockcopy_pp_8xN_neon 8
+blockcopy_pp_8xN_neon 12
+blockcopy_pp_8xN_neon 16
+blockcopy_pp_8xN_neon 32
+
+function PFX(blockcopy_pp_8x64_neon)
+    mov             w12, #4
+.loop_pp_8x64:
+    sub             w12, w12, #1
+.rept 16
+    ld1             {v0.4h}, x2, x3
+    st1             {v0.4h}, x0, x1
+.endr
+    cbnz            w12, .loop_pp_8x64
+    ret
+endfunc
+
+.macro blockcopy_pp_16xN_neon h
+function PFX(blockcopy_pp_16x\h\()_neon)
+.rept \h
+    ld1             {v0.8h}, x2, x3
+    st1             {v0.8h}, x0, x1
+.endr
+    ret
+endfunc
+.endm
+
+blockcopy_pp_16xN_neon 4
+blockcopy_pp_16xN_neon 8
+blockcopy_pp_16xN_neon 12
+blockcopy_pp_16xN_neon 16
+
+.macro blockcopy_pp_16xN1_neon h
+function PFX(blockcopy_pp_16x\h\()_neon)
+    mov             w12, #\h / 8
+.loop_16x\h\():
+.rept 8
+    ld1             {v0.8h}, x2, x3
+    st1             {v0.8h}, x0, x1
+.endr
+    sub             w12, w12, #1
+    cbnz            w12, .loop_16x\h
+    ret
+endfunc
+.endm
+
+blockcopy_pp_16xN1_neon 24
+blockcopy_pp_16xN1_neon 32
+blockcopy_pp_16xN1_neon 64
+
+function PFX(blockcopy_pp_12x16_neon)
+    sub             x1, x1, #8
+.rept 16
+    ld1             {v0.16b}, x2, x3
+    str             d0, x0, #8
+    st1             {v0.s}2, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockcopy_pp_12x32_neon)
+    sub             x1, x1, #8
+    mov             w12, #4
+.loop_pp_12x32:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.16b}, x2, x3
+    str             d0, x0, #8
+    st1             {v0.s}2, x0, x1
+.endr
+    cbnz            w12, .loop_pp_12x32
+    ret
+endfunc
+
+function PFX(blockcopy_pp_24x32_neon)
+    mov             w12, #4
+.loop_24x32:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.8b-v2.8b}, x2, x3
+    st1             {v0.8b-v2.8b}, x0, x1
+.endr
+    cbnz            w12, .loop_24x32
+    ret
+endfunc
+
+function PFX(blockcopy_pp_24x64_neon)
+    mov             w12, #4
+.loop_24x64:
+    sub             w12, w12, #1
+.rept 16
+    ld1             {v0.8b-v2.8b}, x2, x3
+    st1             {v0.8b-v2.8b}, x0, x1
+.endr
+    cbnz            w12, .loop_24x64
+    ret
+endfunc
+
+function PFX(blockcopy_pp_32x8_neon)
+.rept 8
+    ld1             {v0.16b-v1.16b}, x2, x3
+    st1             {v0.16b-v1.16b}, x0, x1
+.endr
+    ret
+endfunc
+
+.macro blockcopy_pp_32xN_neon h
+function PFX(blockcopy_pp_32x\h\()_neon)
+    mov             w12, #\h / 8
+.loop_32x\h\():
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.16b-v1.16b}, x2, x3
+    st1             {v0.16b-v1.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_32x\h
+    ret
+endfunc
+.endm
+
+blockcopy_pp_32xN_neon 16
+blockcopy_pp_32xN_neon 24
+blockcopy_pp_32xN_neon 32
+blockcopy_pp_32xN_neon 64
+blockcopy_pp_32xN_neon 48
+
+function PFX(blockcopy_pp_48x64_neon)
+    mov             w12, #8
+.loop_48x64:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.16b-v2.16b}, x2, x3
+    st1             {v0.16b-v2.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_48x64
+    ret
+endfunc
+
+.macro blockcopy_pp_64xN_neon h
+function PFX(blockcopy_pp_64x\h\()_neon)
+    mov             w12, #\h / 4
+.loop_64x\h\():
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v3.16b}, x2, x3
+    st1             {v0.16b-v3.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_64x\h
+    ret
+endfunc
+.endm
+
+blockcopy_pp_64xN_neon 16
+blockcopy_pp_64xN_neon 32
+blockcopy_pp_64xN_neon 48
+blockcopy_pp_64xN_neon 64
+
+// void x265_blockfill_s_neon(int16_t* dst, intptr_t dstride, int16_t val)
+function PFX(blockfill_s_4x4_neon)
+    dup             v0.4h, w2
+    lsl             x1, x1, #1
+.rept 4
+    st1             {v0.4h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockfill_s_8x8_neon)
+    dup             v0.8h, w2
+    lsl             x1, x1, #1
+.rept 8
+    st1             {v0.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockfill_s_16x16_neon)
+    dup             v0.8h, w2
+    mov             v1.16b, v0.16b
+    lsl             x1, x1, #1
+.rept 16
+    stp             q0, q1, x0
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockfill_s_32x32_neon)
+    dup             v0.8h, w2
+    mov             v1.16b, v0.16b
+    mov             v2.16b, v0.16b
+    mov             v3.16b, v0.16b
+    lsl             x1, x1, #1
+.rept 32
+    st1             {v0.8h-v3.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(blockfill_s_64x64_neon)
+    dup             v0.8h, w2
+    mov             v1.16b, v0.16b
+    mov             v2.16b, v0.16b
+    mov             v3.16b, v0.16b
+    lsl             x1, x1, #1
+    sub             x1, x1, #64
+.rept 64
+    st1             {v0.8h-v3.8h}, x0, #64
+    st1             {v0.8h-v3.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+// uint32_t copy_count(int16_t* coeff, const int16_t* residual, intptr_t resiStride)
+function PFX(copy_cnt_4_neon)
+    lsl             x2, x2, #1
+    movi            v4.8b, #0
+.rept 2
+    ld1             {v0.8b}, x1, x2
+    ld1             {v1.8b}, x1, x2
+    stp             d0, d1, x0, #16
+    cmeq            v0.4h, v0.4h, #0
+    cmeq            v1.4h, v1.4h, #0
+    add             v4.4h, v4.4h, v0.4h
+    add             v4.4h, v4.4h, v1.4h
+.endr
+    saddlv          s4, v4.4h
+    fmov            w12, s4
+    add             w0, w12, #16
+    ret
+endfunc
+
+function PFX(copy_cnt_8_neon)
+    lsl             x2, x2, #1
+    movi            v4.8b, #0
+.rept 4
+    ld1             {v0.16b}, x1, x2
+    ld1             {v1.16b}, x1, x2
+    stp             q0, q1, x0, #32
+    cmeq            v0.8h, v0.8h, #0
+    cmeq            v1.8h, v1.8h, #0
+    add             v4.8h, v4.8h, v0.8h
+    add             v4.8h, v4.8h, v1.8h
+.endr
+    saddlv          s4, v4.8h
+    fmov            w12, s4
+    add             w0, w12, #64
+    ret
+endfunc
+
+function PFX(copy_cnt_16_neon)
+    lsl             x2, x2, #1
+    movi            v4.8b, #0
+.rept 16
+    ld1             {v0.16b-v1.16b}, x1, x2
+    st1             {v0.16b-v1.16b}, x0, #32
+    cmeq            v0.8h, v0.8h, #0
+    cmeq            v1.8h, v1.8h, #0
+    add             v4.8h, v4.8h, v0.8h
+    add             v4.8h, v4.8h, v1.8h
+.endr
+    saddlv          s4, v4.8h
+    fmov            w12, s4
+    add             w0, w12, #256
+    ret
+endfunc
+
+function PFX(copy_cnt_32_neon)
+    lsl             x2, x2, #1
+    movi            v4.8b, #0
+.rept 32
+    ld1             {v0.16b-v3.16b}, x1, x2
+    st1             {v0.16b-v3.16b}, x0, #64
+    cmeq            v0.8h, v0.8h, #0
+    cmeq            v1.8h, v1.8h, #0
+    cmeq            v2.8h, v2.8h, #0
+    cmeq            v3.8h, v3.8h, #0
+    add             v0.8h, v0.8h, v1.8h
+    add             v2.8h, v2.8h, v3.8h
+    add             v4.8h, v4.8h, v0.8h
+    add             v4.8h, v4.8h, v2.8h
+.endr
+    saddlv          s4, v4.8h
+    fmov            w12, s4
+    add             w0, w12, #1024
+    ret
+endfunc
+
+// int  count_nonzero_c(const int16_t* quantCoeff)
+function PFX(count_nonzero_4_neon)
+    movi            v16.16b, #1
+    movi            v17.16b, #0
+    trn1            v16.16b, v16.16b, v17.16b
+    ldp             q0, q1, x0
+    cmhi            v0.8h, v0.8h, v17.8h
+    cmhi            v1.8h, v1.8h, v17.8h
+    and             v0.16b, v0.16b, v16.16b
+    and             v1.16b, v1.16b, v16.16b
+    add             v0.8h, v0.8h, v1.8h
+    uaddlv          s0, v0.8h
+    fmov            w0, s0
+    ret
+endfunc
+
+.macro COUNT_NONZERO_8
+    ld1             {v0.16b-v3.16b}, x0, #64
+    ld1             {v4.16b-v7.16b}, x0, #64
+    cmhi            v0.8h, v0.8h, v17.8h
+    cmhi            v1.8h, v1.8h, v17.8h
+    cmhi            v2.8h, v2.8h, v17.8h
+    cmhi            v3.8h, v3.8h, v17.8h
+    cmhi            v4.8h, v4.8h, v17.8h
+    cmhi            v5.8h, v5.8h, v17.8h
+    cmhi            v6.8h, v6.8h, v17.8h
+    cmhi            v7.8h, v7.8h, v17.8h
+    and             v0.16b, v0.16b, v16.16b
+    and             v1.16b, v1.16b, v16.16b
+    and             v2.16b, v2.16b, v16.16b
+    and             v3.16b, v3.16b, v16.16b
+    and             v4.16b, v4.16b, v16.16b
+    and             v5.16b, v5.16b, v16.16b
+    and             v6.16b, v6.16b, v16.16b
+    and             v7.16b, v7.16b, v16.16b
+    add             v0.8h, v0.8h, v1.8h
+    add             v2.8h, v2.8h, v3.8h
+    add             v4.8h, v4.8h, v5.8h
+    add             v6.8h, v6.8h, v7.8h
+    add             v0.8h, v0.8h, v2.8h
+    add             v4.8h, v4.8h, v6.8h
+    add             v0.8h, v0.8h, v4.8h
+.endm
+
+function PFX(count_nonzero_8_neon)
+    movi            v16.16b, #1
+    movi            v17.16b, #0
+    trn1            v16.16b, v16.16b, v17.16b
+    COUNT_NONZERO_8
+    uaddlv          s0, v0.8h
+    fmov            w0, s0
+    ret
+endfunc
+
+function PFX(count_nonzero_16_neon)
+    movi            v16.16b, #1
+    movi            v17.16b, #0
+    trn1            v16.16b, v16.16b, v17.16b
+    movi            v18.16b, #0
+.rept 4
+    COUNT_NONZERO_8
+    add             v18.16b, v18.16b, v0.16b
+.endr
+    uaddlv          s0, v18.8h
+    fmov            w0, s0
+    ret
+endfunc
+
+function PFX(count_nonzero_32_neon)
+    movi            v16.16b, #1
+    movi            v17.16b, #0
+    trn1            v16.16b, v16.16b, v17.16b
+    movi            v18.16b, #0
+    mov             w12, #16
+.loop_count_nonzero_32:
+    sub             w12, w12, #1
+    COUNT_NONZERO_8
+    add             v18.16b, v18.16b, v0.16b
+    cbnz            w12, .loop_count_nonzero_32
+
+    uaddlv          s0, v18.8h
+    fmov            w0, s0
+    ret
+endfunc
+
+// void cpy2Dto1D_shl(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift)
+.macro cpy2Dto1D_shl_start
+    add             x2, x2, x2
+    dup             v0.8h, w3
+.endm
+
+function PFX(cpy2Dto1D_shl_4x4_neon)
+    cpy2Dto1D_shl_start
+    ld1             {v2.d}0, x1, x2
+    ld1             {v2.d}1, x1, x2
+    ld1             {v3.d}0, x1, x2
+    ld1             {v3.d}1, x1, x2
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.16b-v3.16b}, x0
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shl_8x8_neon)
+    cpy2Dto1D_shl_start
+.rept 4
+    ld1             {v2.16b}, x1, x2
+    ld1             {v3.16b}, x1, x2
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.16b-v3.16b}, x0, #32
+.endr
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shl_16x16_neon)
+    cpy2Dto1D_shl_start
+    mov             w12, #4
+.loop_cpy2Dto1D_shl_16:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v2.16b-v3.16b}, x1, x2
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.16b-v3.16b}, x0, #32
+.endr
+    cbnz            w12, .loop_cpy2Dto1D_shl_16
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shl_32x32_neon)
+    cpy2Dto1D_shl_start
+    mov             w12, #16
+.loop_cpy2Dto1D_shl_32:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v2.16b-v5.16b}, x1, x2
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    sshl            v4.8h, v4.8h, v0.8h
+    sshl            v5.8h, v5.8h, v0.8h
+    st1             {v2.16b-v5.16b}, x0, #64
+.endr
+    cbnz            w12, .loop_cpy2Dto1D_shl_32
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shl_64x64_neon)
+    cpy2Dto1D_shl_start
+    mov             w12, #32
+    sub             x2, x2, #64
+.loop_cpy2Dto1D_shl_64:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v2.16b-v5.16b}, x1, #64
+    ld1             {v16.16b-v19.16b}, x1, x2
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    sshl            v4.8h, v4.8h, v0.8h
+    sshl            v5.8h, v5.8h, v0.8h
+    sshl            v16.8h, v16.8h, v0.8h
+    sshl            v17.8h, v17.8h, v0.8h
+    sshl            v18.8h, v18.8h, v0.8h
+    sshl            v19.8h, v19.8h, v0.8h
+    st1             {v2.16b-v5.16b}, x0, #64
+    st1             {v16.16b-v19.16b}, x0, #64
+.endr
+    cbnz            w12, .loop_cpy2Dto1D_shl_64
+    ret
+endfunc
+
+// void cpy2Dto1D_shr(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift)
+function PFX(cpy2Dto1D_shr_4x4_neon)
+    cpy2Dto1D_shr_start
+    ld1             {v2.d}0, x1, x2
+    ld1             {v2.d}1, x1, x2
+    ld1             {v3.d}0, x1, x2
+    ld1             {v3.d}1, x1, x2
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    stp             q2, q3, x0
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shr_8x8_neon)
+    cpy2Dto1D_shr_start
+.rept 4
+    ld1             {v2.16b}, x1, x2
+    ld1             {v3.16b}, x1, x2
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    stp             q2, q3, x0, #32
+.endr
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shr_16x16_neon)
+    cpy2Dto1D_shr_start
+    mov             w12, #4
+.loop_cpy2Dto1D_shr_16:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v2.8h-v3.8h}, x1, x2
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.8h-v3.8h}, x0, #32
+.endr
+    cbnz            w12, .loop_cpy2Dto1D_shr_16
+    ret
+endfunc
+
+function PFX(cpy2Dto1D_shr_32x32_neon)
+    cpy2Dto1D_shr_start
+    mov             w12, #16
+.loop_cpy2Dto1D_shr_32:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v2.8h-v5.8h}, x1, x2
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sub             v4.8h, v4.8h, v1.8h
+    sub             v5.8h, v5.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    sshl            v4.8h, v4.8h, v0.8h
+    sshl            v5.8h, v5.8h, v0.8h
+    st1             {v2.8h-v5.8h}, x0, #64
+.endr
+    cbnz            w12, .loop_cpy2Dto1D_shr_32
+    ret
+endfunc
+
+// void cpy1Dto2D_shl(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift)
+.macro cpy1Dto2D_shl_start
+    add             x2, x2, x2
+    dup             v0.8h, w3
+.endm
+
+function PFX(cpy1Dto2D_shl_4x4_neon)
+    cpy1Dto2D_shl_start
+    ld1             {v2.16b-v3.16b}, x1
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.d}0, x0, x2
+    st1             {v2.d}1, x0, x2
+    st1             {v3.d}0, x0, x2
+    st1             {v3.d}1, x0, x2
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shl_8x8_neon)
+    cpy1Dto2D_shl_start
+.rept 4
+    ld1             {v2.16b-v3.16b}, x1, #32
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.16b}, x0, x2
+    st1             {v3.16b}, x0, x2
+.endr
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shl_16x16_neon)
+    cpy1Dto2D_shl_start
+    mov             w12, #4
+.loop_cpy1Dto2D_shl_16:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v2.16b-v3.16b}, x1, #32
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.16b-v3.16b}, x0, x2
+.endr
+    cbnz            w12, .loop_cpy1Dto2D_shl_16
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shl_32x32_neon)
+    cpy1Dto2D_shl_start
+    mov             w12, #16
+.loop_cpy1Dto2D_shl_32:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v2.16b-v5.16b}, x1, #64
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    sshl            v4.8h, v4.8h, v0.8h
+    sshl            v5.8h, v5.8h, v0.8h
+    st1             {v2.16b-v5.16b}, x0, x2
+.endr
+    cbnz            w12, .loop_cpy1Dto2D_shl_32
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shl_64x64_neon)
+    cpy1Dto2D_shl_start
+    mov             w12, #32
+    sub             x2, x2, #64
+.loop_cpy1Dto2D_shl_64:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v2.16b-v5.16b}, x1, #64
+    ld1             {v16.16b-v19.16b}, x1, #64
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    sshl            v4.8h, v4.8h, v0.8h
+    sshl            v5.8h, v5.8h, v0.8h
+    sshl            v16.8h, v16.8h, v0.8h
+    sshl            v17.8h, v17.8h, v0.8h
+    sshl            v18.8h, v18.8h, v0.8h
+    sshl            v19.8h, v19.8h, v0.8h
+    st1             {v2.16b-v5.16b}, x0, #64
+    st1             {v16.16b-v19.16b}, x0, x2
+.endr
+    cbnz            w12, .loop_cpy1Dto2D_shl_64
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shr_4x4_neon)
+    cpy1Dto2D_shr_start
+    ld1             {v2.16b-v3.16b}, x1
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.d}0, x0, x2
+    st1             {v2.d}1, x0, x2
+    st1             {v3.d}0, x0, x2
+    st1             {v3.d}1, x0, x2
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shr_8x8_neon)
+    cpy1Dto2D_shr_start
+.rept 4
+    ld1             {v2.16b-v3.16b}, x1, #32
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.16b}, x0, x2
+    st1             {v3.16b}, x0, x2
+.endr
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shr_16x16_neon)
+    cpy1Dto2D_shr_start
+    mov             w12, #4
+.loop_cpy1Dto2D_shr_16:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v2.8h-v3.8h}, x1, #32
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    st1             {v2.8h-v3.8h}, x0, x2
+.endr
+    cbnz            w12, .loop_cpy1Dto2D_shr_16
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shr_32x32_neon)
+    cpy1Dto2D_shr_start
+    mov             w12, #16
+.loop_cpy1Dto2D_shr_32:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v2.16b-v5.16b}, x1, #64
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sub             v4.8h, v4.8h, v1.8h
+    sub             v5.8h, v5.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    sshl            v4.8h, v4.8h, v0.8h
+    sshl            v5.8h, v5.8h, v0.8h
+    st1             {v2.16b-v5.16b}, x0, x2
+.endr
+    cbnz            w12, .loop_cpy1Dto2D_shr_32
+    ret
+endfunc
+
+function PFX(cpy1Dto2D_shr_64x64_neon)
+    cpy1Dto2D_shr_start
+    mov             w12, #32
+    sub             x2, x2, #64
+.loop_cpy1Dto2D_shr_64:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v2.16b-v5.16b}, x1, #64
+    ld1             {v16.16b-v19.16b}, x1, #64
+    sub             v2.8h, v2.8h, v1.8h
+    sub             v3.8h, v3.8h, v1.8h
+    sub             v4.8h, v4.8h, v1.8h
+    sub             v5.8h, v5.8h, v1.8h
+    sub             v16.8h, v16.8h, v1.8h
+    sub             v17.8h, v17.8h, v1.8h
+    sub             v18.8h, v18.8h, v1.8h
+    sub             v19.8h, v19.8h, v1.8h
+    sshl            v2.8h, v2.8h, v0.8h
+    sshl            v3.8h, v3.8h, v0.8h
+    sshl            v4.8h, v4.8h, v0.8h
+    sshl            v5.8h, v5.8h, v0.8h
+    sshl            v16.8h, v16.8h, v0.8h
+    sshl            v17.8h, v17.8h, v0.8h
+    sshl            v18.8h, v18.8h, v0.8h
+    sshl            v19.8h, v19.8h, v0.8h
+    st1             {v2.16b-v5.16b}, x0, #64
+    st1             {v16.16b-v19.16b}, x0, x2
+.endr
+    cbnz            w12, .loop_cpy1Dto2D_shr_64
+    ret
+endfunc
​

x265_3.6.tar.gz/source/common/aarch64/dct-prim.cpp Added

@@ -0,0 +1,948 @@
+#include "dct-prim.h"
+
+
+#if HAVE_NEON
+
+#include <arm_neon.h>
+
+
+namespace
+{
+using namespace X265_NS;
+
+
+static int16x8_t rev16(const int16x8_t a)
+{
+    static const int8x16_t tbl = {14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1};
+    return vqtbx1q_u8(a, a, tbl);
+}
+
+static int32x4_t rev32(const int32x4_t a)
+{
+    static const int8x16_t tbl = {12, 13, 14, 15, 8, 9, 10, 11, 4, 5, 6, 7, 0, 1, 2, 3};
+    return vqtbx1q_u8(a, a, tbl);
+}
+
+static void transpose_4x4x16(int16x4_t &x0, int16x4_t &x1, int16x4_t &x2, int16x4_t &x3)
+{
+    int16x4_t s0, s1, s2, s3;
+    s0 = vtrn1_s32(x0, x2);
+    s1 = vtrn1_s32(x1, x3);
+    s2 = vtrn2_s32(x0, x2);
+    s3 = vtrn2_s32(x1, x3);
+
+    x0 = vtrn1_s16(s0, s1);
+    x1 = vtrn2_s16(s0, s1);
+    x2 = vtrn1_s16(s2, s3);
+    x3 = vtrn2_s16(s2, s3);
+}
+
+
+
+static int scanPosLast_opt(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag,
+                           uint8_t *coeffNum, int numSig, const uint16_t * /*scanCG4x4*/, const int /*trSize*/)
+{
+
+    // This is an optimized function for scanPosLast, which removes the rmw dependency, once integrated into mainline x265, should replace reference implementation
+    // For clarity, left the original reference code in comments
+    int scanPosLast = 0;
+
+    uint16_t cSign = 0;
+    uint16_t cFlag = 0;
+    uint8_t cNum = 0;
+
+    uint32_t prevcgIdx = 0;
+    do
+    {
+        const uint32_t cgIdx = (uint32_t)scanPosLast >> MLS_CG_SIZE;
+
+        const uint32_t posLast = scanscanPosLast;
+
+        const int curCoeff = coeffposLast;
+        const uint32_t isNZCoeff = (curCoeff != 0);
+        /*
+        NOTE: the new algorithm is complicated, so I keep reference code here
+        uint32_t posy   = posLast >> log2TrSize;
+        uint32_t posx   = posLast - (posy << log2TrSize);
+        uint32_t blkIdx0 = ((posy >> MLS_CG_LOG2_SIZE) << codingParameters.log2TrSizeCG) + (posx >> MLS_CG_LOG2_SIZE);
+        const uint32_t blkIdx = ((posLast >> (2 * MLS_CG_LOG2_SIZE)) & ~maskPosXY) + ((posLast >> MLS_CG_LOG2_SIZE) & maskPosXY);
+        sigCoeffGroupFlag64 |= ((uint64_t)isNZCoeff << blkIdx);
+        */
+
+        // get L1 sig map
+        numSig -= isNZCoeff;
+
+        if (scanPosLast % (1 << MLS_CG_SIZE) == 0)
+        {
+            coeffSignprevcgIdx = cSign;
+            coeffFlagprevcgIdx = cFlag;
+            coeffNumprevcgIdx = cNum;
+            cSign = 0;
+            cFlag = 0;
+            cNum = 0;
+        }
+        // TODO: optimize by instruction BTS
+        cSign += (uint16_t)(((curCoeff < 0) ? 1 : 0) << cNum);
+        cFlag = (cFlag << 1) + (uint16_t)isNZCoeff;
+        cNum += (uint8_t)isNZCoeff;
+        prevcgIdx = cgIdx;
+        scanPosLast++;
+    }
+    while (numSig > 0);
+
+    coeffSignprevcgIdx = cSign;
+    coeffFlagprevcgIdx = cFlag;
+    coeffNumprevcgIdx = cNum;
+    return scanPosLast - 1;
+}
+
+
+#if (MLS_CG_SIZE == 4)
+template<int log2TrSize>
+static void nonPsyRdoQuant_neon(int16_t *m_resiDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost,
+                                int64_t *totalRdCost, uint32_t blkPos)
+{
+    const int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH -
+                               log2TrSize; /* Represents scaling through forward transform */
+    const int scaleBits = SCALE_BITS - 2 * transformShift;
+    const uint32_t trSize = 1 << log2TrSize;
+
+    int64x2_t vcost_sum_0 = vdupq_n_s64(0);
+    int64x2_t vcost_sum_1 = vdupq_n_s64(0);
+    for (int y = 0; y < MLS_CG_SIZE; y++)
+    {
+        int16x4_t in = *(int16x4_t *)&m_resiDctCoeffblkPos;
+        int32x4_t mul = vmull_s16(in, in);
+        int64x2_t cost0, cost1;
+        cost0 = vshll_n_s32(vget_low_s32(mul), scaleBits);
+        cost1 = vshll_high_n_s32(mul, scaleBits);
+        *(int64x2_t *)&costUncodedblkPos + 0 = cost0;
+        *(int64x2_t *)&costUncodedblkPos + 2 = cost1;
+        vcost_sum_0 = vaddq_s64(vcost_sum_0, cost0);
+        vcost_sum_1 = vaddq_s64(vcost_sum_1, cost1);
+        blkPos += trSize;
+    }
+    int64_t sum = vaddvq_s64(vaddq_s64(vcost_sum_0, vcost_sum_1));
+    *totalUncodedCost += sum;
+    *totalRdCost += sum;
+}
+
+template<int log2TrSize>
+static void psyRdoQuant_neon(int16_t *m_resiDctCoeff, int16_t *m_fencDctCoeff, int64_t *costUncoded,
+                             int64_t *totalUncodedCost, int64_t *totalRdCost, int64_t *psyScale, uint32_t blkPos)
+{
+    const int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH -
+                               log2TrSize; /* Represents scaling through forward transform */
+    const int scaleBits = SCALE_BITS - 2 * transformShift;
+    const uint32_t trSize = 1 << log2TrSize;
+    //using preprocessor to bypass clang bug
+    const int max = X265_MAX(0, (2 * transformShift + 1));
+
+    int64x2_t vcost_sum_0 = vdupq_n_s64(0);
+    int64x2_t vcost_sum_1 = vdupq_n_s64(0);
+    int32x4_t vpsy = vdupq_n_s32(*psyScale);
+    for (int y = 0; y < MLS_CG_SIZE; y++)
+    {
+        int32x4_t signCoef = vmovl_s16(*(int16x4_t *)&m_resiDctCoeffblkPos);
+        int32x4_t predictedCoef = vsubq_s32(vmovl_s16(*(int16x4_t *)&m_fencDctCoeffblkPos), signCoef);
+        int64x2_t cost0, cost1;
+        cost0 = vmull_s32(vget_low_s32(signCoef), vget_low_s32(signCoef));
+        cost1 = vmull_high_s32(signCoef, signCoef);
+        cost0 = vshlq_n_s64(cost0, scaleBits);
+        cost1 = vshlq_n_s64(cost1, scaleBits);
+        int64x2_t neg0 = vmull_s32(vget_low_s32(predictedCoef), vget_low_s32(vpsy));
+        int64x2_t neg1 = vmull_high_s32(predictedCoef, vpsy);
+        if (max > 0)
+        {
+            int64x2_t shift = vdupq_n_s64(-max);
+            neg0 = vshlq_s64(neg0, shift);
+            neg1 = vshlq_s64(neg1, shift);
+        }
+        cost0 = vsubq_s64(cost0, neg0);
+        cost1 = vsubq_s64(cost1, neg1);
+        *(int64x2_t *)&costUncodedblkPos + 0 = cost0;
+        *(int64x2_t *)&costUncodedblkPos + 2 = cost1;
+        vcost_sum_0 = vaddq_s64(vcost_sum_0, cost0);
+        vcost_sum_1 = vaddq_s64(vcost_sum_1, cost1);
+
+        blkPos += trSize;
+    }
+    int64_t sum = vaddvq_s64(vaddq_s64(vcost_sum_0, vcost_sum_1));
+    *totalUncodedCost += sum;
+    *totalRdCost += sum;
+}
+
+#else
+#error "MLS_CG_SIZE must be 4 for neon version"
+#endif
+
+
+
+template<int trSize>
+int  count_nonzero_neon(const int16_t *quantCoeff)
+{
+    X265_CHECK(((intptr_t)quantCoeff & 15) == 0, "quant buffer not aligned\n");
+    int count = 0;
+    int16x8_t vcount = vdupq_n_s16(0);
+    const int numCoeff = trSize * trSize;
+    int i = 0;
+    for (; (i + 8) <= numCoeff; i += 8)
+    {
+        int16x8_t in = *(int16x8_t *)&quantCoeffi;
+        vcount = vaddq_s16(vcount, vtstq_s16(in, in));
+    }
+    for (; i < numCoeff; i++)
+    {
+        count += quantCoeffi != 0;
+    }
+
+    return count - vaddvq_s16(vcount);
+}
+
+template<int trSize>
+uint32_t copy_count_neon(int16_t *coeff, const int16_t *residual, intptr_t resiStride)
+{
+    uint32_t numSig = 0;
+    int16x8_t vcount = vdupq_n_s16(0);
+    for (int k = 0; k < trSize; k++)
+    {
+        int j = 0;
+        for (; (j + 8) <= trSize; j += 8)
+        {
+            int16x8_t in = *(int16x8_t *)&residualj;
+            *(int16x8_t *)&coeffj = in;
+            vcount = vaddq_s16(vcount, vtstq_s16(in, in));
+        }
+        for (; j < trSize; j++)
+        {
+            coeffj = residualj;
+            numSig += (residualj != 0);
+        }
+        residual += resiStride;
+        coeff += trSize;
+    }
+
+    return numSig - vaddvq_s16(vcount);
+}
+
+
+static void partialButterfly16(const int16_t *src, int16_t *dst, int shift, int line)
+{
+    int j, k;
+    int32x4_t E2, O2;
+    int32x4_t EE, EO;
+    int32x2_t EEE, EEO;
+    const int add = 1 << (shift - 1);
+    const int32x4_t _vadd = {add, 0};
+
+    for (j = 0; j < line; j++)
+    {
+        int16x8_t in0 = *(int16x8_t *)src;
+        int16x8_t in1 = rev16(*(int16x8_t *)&src8);
+
+        E0 = vaddl_s16(vget_low_s16(in0), vget_low_s16(in1));
+        O0 = vsubl_s16(vget_low_s16(in0), vget_low_s16(in1));
+        E1 = vaddl_high_s16(in0, in1);
+        O1 = vsubl_high_s16(in0, in1);
+
+        for (k = 1; k < 16; k += 2)
+        {
+            int32x4_t c0 = vmovl_s16(*(int16x4_t *)&g_t16k0);
+            int32x4_t c1 = vmovl_s16(*(int16x4_t *)&g_t16k4);
+
+            int32x4_t res = _vadd;
+            res = vmlaq_s32(res, c0, O0);
+            res = vmlaq_s32(res, c1, O1);
+            dstk * line = (int16_t)(vaddvq_s32(res) >> shift);
+        }
+
+        /* EE and EO */
+        EE = vaddq_s32(E0, rev32(E1));
+        EO = vsubq_s32(E0, rev32(E1));
+
+        for (k = 2; k < 16; k += 4)
+        {
+            int32x4_t c0 = vmovl_s16(*(int16x4_t *)&g_t16k0);
+            int32x4_t res = _vadd;
+            res = vmlaq_s32(res, c0, EO);
+            dstk * line = (int16_t)(vaddvq_s32(res) >> shift);
+        }
+
+        /* EEE and EEO */
+        EEE0 = EE0 + EE3;
+        EEO0 = EE0 - EE3;
+        EEE1 = EE1 + EE2;
+        EEO1 = EE1 - EE2;
+
+        dst0 = (int16_t)((g_t1600 * EEE0 + g_t1601 * EEE1 + add) >> shift);
+        dst8 * line = (int16_t)((g_t1680 * EEE0 + g_t1681 * EEE1 + add) >> shift);
+        dst4 * line = (int16_t)((g_t1640 * EEO0 + g_t1641 * EEO1 + add) >> shift);
+        dst12 * line = (int16_t)((g_t16120 * EEO0 + g_t16121 * EEO1 + add) >> shift);
+
+
+        src += 16;
+        dst++;
+    }
+}
+
+
+static void partialButterfly32(const int16_t *src, int16_t *dst, int shift, int line)
+{
+    int j, k;
+    const int add = 1 << (shift - 1);
+
+
+    for (j = 0; j < line; j++)
+    {
+        int32x4_t VE4, VO0, VO1, VO2, VO3;
+        int32x4_t VEE2, VEO2;
+        int32x4_t VEEE, VEEO;
+        int EEEE2, EEEO2;
+
+        int16x8x4_t inputs;
+        inputs = *(int16x8x4_t *)&src0;
+        int16x8x4_t in_rev;
+
+        in_rev.val1 = rev16(inputs.val2);
+        in_rev.val0 = rev16(inputs.val3);
+
+        VE0 = vaddl_s16(vget_low_s16(inputs.val0), vget_low_s16(in_rev.val0));
+        VE1 = vaddl_high_s16(inputs.val0, in_rev.val0);
+        VO0 = vsubl_s16(vget_low_s16(inputs.val0), vget_low_s16(in_rev.val0));
+        VO1 = vsubl_high_s16(inputs.val0, in_rev.val0);
+        VE2 = vaddl_s16(vget_low_s16(inputs.val1), vget_low_s16(in_rev.val1));
+        VE3 = vaddl_high_s16(inputs.val1, in_rev.val1);
+        VO2 = vsubl_s16(vget_low_s16(inputs.val1), vget_low_s16(in_rev.val1));
+        VO3 = vsubl_high_s16(inputs.val1, in_rev.val1);
+
+        for (k = 1; k < 32; k += 2)
+        {
+            int32x4_t c0 = vmovl_s16(*(int16x4_t *)&g_t32k0);
+            int32x4_t c1 = vmovl_s16(*(int16x4_t *)&g_t32k4);
+            int32x4_t c2 = vmovl_s16(*(int16x4_t *)&g_t32k8);
+            int32x4_t c3 = vmovl_s16(*(int16x4_t *)&g_t32k12);
+            int32x4_t s = vmulq_s32(c0, VO0);
+            s = vmlaq_s32(s, c1, VO1);
+            s = vmlaq_s32(s, c2, VO2);
+            s = vmlaq_s32(s, c3, VO3);
+
+            dstk * line = (int16_t)((vaddvq_s32(s) + add) >> shift);
+
+        }
+
+        int32x4_t rev_VE2;
+
+
+        rev_VE0 = rev32(VE3);
+        rev_VE1 = rev32(VE2);
+
+        /* EE and EO */
+        for (k = 0; k < 2; k++)
+        {
+            VEEk = vaddq_s32(VEk, rev_VEk);
+            VEOk = vsubq_s32(VEk, rev_VEk);
+        }
+        for (k = 2; k < 32; k += 4)
+        {
+            int32x4_t c0 = vmovl_s16(*(int16x4_t *)&g_t32k0);
+            int32x4_t c1 = vmovl_s16(*(int16x4_t *)&g_t32k4);
+            int32x4_t s = vmulq_s32(c0, VEO0);
+            s = vmlaq_s32(s, c1, VEO1);
+
+            dstk * line = (int16_t)((vaddvq_s32(s) + add) >> shift);
+
+        }
+
+        int32x4_t tmp = rev32(VEE1);
+        VEEE = vaddq_s32(VEE0, tmp);
+        VEEO = vsubq_s32(VEE0, tmp);
+        for (k = 4; k < 32; k += 8)
+        {
+            int32x4_t c = vmovl_s16(*(int16x4_t *)&g_t32k0);
+            int32x4_t s = vmulq_s32(c, VEEO);
+
+            dstk * line = (int16_t)((vaddvq_s32(s) + add) >> shift);
+        }
+
+        /* EEEE and EEEO */
+        EEEE0 = VEEE0 + VEEE3;
+        EEEO0 = VEEE0 - VEEE3;
+        EEEE1 = VEEE1 + VEEE2;
+        EEEO1 = VEEE1 - VEEE2;
+
+        dst0 = (int16_t)((g_t3200 * EEEE0 + g_t3201 * EEEE1 + add) >> shift);
+        dst16 * line = (int16_t)((g_t32160 * EEEE0 + g_t32161 * EEEE1 + add) >> shift);
+        dst8 * line = (int16_t)((g_t3280 * EEEO0 + g_t3281 * EEEO1 + add) >> shift);
+        dst24 * line = (int16_t)((g_t32240 * EEEO0 + g_t32241 * EEEO1 + add) >> shift);
+
+
+
+        src += 32;
+        dst++;
+    }
+}
+
+static void partialButterfly8(const int16_t *src, int16_t *dst, int shift, int line)
+{
+    int j, k;
+    int E4, O4;
+    int EE2, EO2;
+    int add = 1 << (shift - 1);
+
+    for (j = 0; j < line; j++)
+    {
+        /* E and O*/
+        for (k = 0; k < 4; k++)
+        {
+            Ek = srck + src7 - k;
+            Ok = srck - src7 - k;
+        }
+
+        /* EE and EO */
+        EE0 = E0 + E3;
+        EO0 = E0 - E3;
+        EE1 = E1 + E2;
+        EO1 = E1 - E2;
+
+        dst0 = (int16_t)((g_t800 * EE0 + g_t801 * EE1 + add) >> shift);
+        dst4 * line = (int16_t)((g_t840 * EE0 + g_t841 * EE1 + add) >> shift);
+        dst2 * line = (int16_t)((g_t820 * EO0 + g_t821 * EO1 + add) >> shift);
+        dst6 * line = (int16_t)((g_t860 * EO0 + g_t861 * EO1 + add) >> shift);
+
+        dstline = (int16_t)((g_t810 * O0 + g_t811 * O1 + g_t812 * O2 + g_t813 * O3 + add) >> shift);
+        dst3 * line = (int16_t)((g_t830 * O0 + g_t831 * O1 + g_t832 * O2 + g_t833 * O3 + add) >>
+                                  shift);
+        dst5 * line = (int16_t)((g_t850 * O0 + g_t851 * O1 + g_t852 * O2 + g_t853 * O3 + add) >>
+                                  shift);
+        dst7 * line = (int16_t)((g_t870 * O0 + g_t871 * O1 + g_t872 * O2 + g_t873 * O3 + add) >>
+                                  shift);
+
+        src += 8;
+        dst++;
+    }
+}
+
+static void partialButterflyInverse4(const int16_t *src, int16_t *dst, int shift, int line)
+{
+    int j;
+    int E2, O2;
+    int add = 1 << (shift - 1);
+
+    for (j = 0; j < line; j++)
+    {
+        /* Utilizing symmetry properties to the maximum to minimize the number of multiplications */
+        O0 = g_t410 * srcline + g_t430 * src3 * line;
+        O1 = g_t411 * srcline + g_t431 * src3 * line;
+        E0 = g_t400 * src0 + g_t420 * src2 * line;
+        E1 = g_t401 * src0 + g_t421 * src2 * line;
+
+        /* Combining even and odd terms at each hierarchy levels to calculate the final spatial domain vector */
+        dst0 = (int16_t)(x265_clip3(-32768, 32767, (E0 + O0 + add) >> shift));
+        dst1 = (int16_t)(x265_clip3(-32768, 32767, (E1 + O1 + add) >> shift));
+        dst2 = (int16_t)(x265_clip3(-32768, 32767, (E1 - O1 + add) >> shift));
+        dst3 = (int16_t)(x265_clip3(-32768, 32767, (E0 - O0 + add) >> shift));
+
+        src++;
+        dst += 4;
+    }
+}
+
+
+
+static void partialButterflyInverse16_neon(const int16_t *src, int16_t *orig_dst, int shift, int line)
+{
+#define FMAK(x,l) sl = vmlal_lane_s16(sl,*(int16x4_t*)&src(x)*line,*(int16x4_t *)&g_t16xk,l)
+#define MULK(x,l) vmull_lane_s16(*(int16x4_t*)&srcx*line,*(int16x4_t *)&g_t16xk,l);
+#define ODD3_15(k) FMAK(3,k);FMAK(5,k);FMAK(7,k);FMAK(9,k);FMAK(11,k);FMAK(13,k);FMAK(15,k);
+#define EVEN6_14_STEP4(k) FMAK(6,k);FMAK(10,k);FMAK(14,k);
+
+
+    int j, k;
+    int32x4_t E8, O8;
+    int32x4_t EE4, EO4;
+    int32x4_t EEE2, EEO2;
+    const int add = 1 << (shift - 1);
+
+
+#pragma unroll(4)
+    for (j = 0; j < line; j += 4)
+    {
+        /* Utilizing symmetry properties to the maximum to minimize the number of multiplications */
+
+#pragma unroll(2)
+        for (k = 0; k < 2; k++)
+        {
+            int32x4_t s;
+            s = vmull_s16(vdup_n_s16(g_t164k), *(int16x4_t *)&src4 * line);;
+            EEOk = vmlal_s16(s, vdup_n_s16(g_t1612k), *(int16x4_t *)&src(12) * line);
+            s = vmull_s16(vdup_n_s16(g_t160k), *(int16x4_t *)&src0 * line);;
+            EEEk = vmlal_s16(s, vdup_n_s16(g_t168k), *(int16x4_t *)&src(8) * line);
+        }
+
+        /* Combining even and odd terms at each hierarchy levels to calculate the final spatial domain vector */
+        EE0 = vaddq_s32(EEE0 , EEO0);
+        EE2 = vsubq_s32(EEE1 , EEO1);
+        EE1 = vaddq_s32(EEE1 , EEO1);
+        EE3 = vsubq_s32(EEE0 , EEO0);
+
+
+#pragma unroll(1)
+        for (k = 0; k < 4; k += 4)
+        {
+            int32x4_t s4;
+            s0 = MULK(2, 0);
+            s1 = MULK(2, 1);
+            s2 = MULK(2, 2);
+            s3 = MULK(2, 3);
+
+            EVEN6_14_STEP4(0);
+            EVEN6_14_STEP4(1);
+            EVEN6_14_STEP4(2);
+            EVEN6_14_STEP4(3);
+
+            EOk = s0;
+            EOk + 1 = s1;
+            EOk + 2 = s2;
+            EOk + 3 = s3;
+        }
+
+
+
+        static const int32x4_t min = vdupq_n_s32(-32768);
+        static const int32x4_t max = vdupq_n_s32(32767);
+        const int32x4_t minus_shift = vdupq_n_s32(-shift);
+
+#pragma unroll(4)
+        for (k = 0; k < 4; k++)
+        {
+            Ek = vaddq_s32(EEk , EOk);
+            Ek + 4 = vsubq_s32(EE3 - k , EO3 - k);
+        }
+
+#pragma unroll(2)
+        for (k = 0; k < 8; k += 4)
+        {
+            int32x4_t s4;
+            s0 = MULK(1, 0);
+            s1 = MULK(1, 1);
+            s2 = MULK(1, 2);
+            s3 = MULK(1, 3);
+            ODD3_15(0);
+            ODD3_15(1);
+            ODD3_15(2);
+            ODD3_15(3);
+            Ok = s0;
+            Ok + 1 = s1;
+            Ok + 2 = s2;
+            Ok + 3 = s3;
+            int32x4_t t;
+            int16x4_t x0, x1, x2, x3;
+
+            Ek = vaddq_s32(vdupq_n_s32(add), Ek);
+            t = vaddq_s32(Ek, Ok);
+            t = vshlq_s32(t, minus_shift);
+            t = vmaxq_s32(t, min);
+            t = vminq_s32(t, max);
+            x0 = vmovn_s32(t);
+
+            Ek + 1 = vaddq_s32(vdupq_n_s32(add), Ek + 1);
+            t = vaddq_s32(Ek + 1, Ok + 1);
+            t = vshlq_s32(t, minus_shift);
+            t = vmaxq_s32(t, min);
+            t = vminq_s32(t, max);
+            x1 = vmovn_s32(t);
+
+            Ek + 2 = vaddq_s32(vdupq_n_s32(add), Ek + 2);
+            t = vaddq_s32(Ek + 2, Ok + 2);
+            t = vshlq_s32(t, minus_shift);
+            t = vmaxq_s32(t, min);
+            t = vminq_s32(t, max);
+            x2 = vmovn_s32(t);
+
+            Ek + 3 = vaddq_s32(vdupq_n_s32(add), Ek + 3);
+            t = vaddq_s32(Ek + 3, Ok + 3);
+            t = vshlq_s32(t, minus_shift);
+            t = vmaxq_s32(t, min);
+            t = vminq_s32(t, max);
+            x3 = vmovn_s32(t);
+
+            transpose_4x4x16(x0, x1, x2, x3);
+            *(int16x4_t *)&orig_dst0 * 16 + k = x0;
+            *(int16x4_t *)&orig_dst1 * 16 + k = x1;
+            *(int16x4_t *)&orig_dst2 * 16 + k = x2;
+            *(int16x4_t *)&orig_dst3 * 16 + k = x3;
+        }
+
+
+#pragma unroll(2)
+        for (k = 0; k < 8; k += 4)
+        {
+            int32x4_t t;
+            int16x4_t x0, x1, x2, x3;
+
+            t = vsubq_s32(E7 - k, O7 - k);
+            t = vshlq_s32(t, minus_shift);
+            t = vmaxq_s32(t, min);
+            t = vminq_s32(t, max);
+            x0 = vmovn_s32(t);
+
+            t = vsubq_s32(E6 - k, O6 - k);
+            t = vshlq_s32(t, minus_shift);
+            t = vmaxq_s32(t, min);
+            t = vminq_s32(t, max);
+            x1 = vmovn_s32(t);
+
+            t = vsubq_s32(E5 - k, O5 - k);
+
+            t = vshlq_s32(t, minus_shift);
+            t = vmaxq_s32(t, min);
+            t = vminq_s32(t, max);
+            x2 = vmovn_s32(t);
+
+            t = vsubq_s32(E4 - k, O4 - k);
+            t = vshlq_s32(t, minus_shift);
+            t = vmaxq_s32(t, min);
+            t = vminq_s32(t, max);
+            x3 = vmovn_s32(t);
+
+            transpose_4x4x16(x0, x1, x2, x3);
+            *(int16x4_t *)&orig_dst0 * 16 + k + 8 = x0;
+            *(int16x4_t *)&orig_dst1 * 16 + k + 8 = x1;
+            *(int16x4_t *)&orig_dst2 * 16 + k + 8 = x2;
+            *(int16x4_t *)&orig_dst3 * 16 + k + 8 = x3;
+        }
+        orig_dst += 4 * 16;
+        src += 4;
+    }
+
+#undef MUL
+#undef FMA
+#undef FMAK
+#undef MULK
+#undef ODD3_15
+#undef EVEN6_14_STEP4
+
+
+}
+
+
+
+static void partialButterflyInverse32_neon(const int16_t *src, int16_t *orig_dst, int shift, int line)
+{
+#define MUL(x) vmull_s16(vdup_n_s16(g_t32xk),*(int16x4_t*)&srcx*line);
+#define FMA(x) s = vmlal_s16(s,vdup_n_s16(g_t32xk),*(int16x4_t*)&src(x)*line)
+#define FMAK(x,l) sl = vmlal_lane_s16(sl,*(int16x4_t*)&src(x)*line,*(int16x4_t *)&g_t32xk,l)
+#define MULK(x,l) vmull_lane_s16(*(int16x4_t*)&srcx*line,*(int16x4_t *)&g_t32xk,l);
+#define ODD31(k) FMAK(3,k);FMAK(5,k);FMAK(7,k);FMAK(9,k);FMAK(11,k);FMAK(13,k);FMAK(15,k);FMAK(17,k);FMAK(19,k);FMAK(21,k);FMAK(23,k);FMAK(25,k);FMAK(27,k);FMAK(29,k);FMAK(31,k);
+
+#define ODD15(k) FMAK(6,k);FMAK(10,k);FMAK(14,k);FMAK(18,k);FMAK(22,k);FMAK(26,k);FMAK(30,k);
+#define ODD7(k) FMAK(12,k);FMAK(20,k);FMAK(28,k);
+
+
+    int j, k;
+    int32x4_t E16, O16;
+    int32x4_t EE8, EO8;
+    int32x4_t EEE4, EEO4;
+    int32x4_t EEEE2, EEEO2;
+    int16x4_t dst32;
+    int add = 1 << (shift - 1);
+
+#pragma unroll (8)
+    for (j = 0; j < line; j += 4)
+    {
+#pragma unroll (4)
+        for (k = 0; k < 16; k += 4)
+        {
+            int32x4_t s4;
+            s0 = MULK(1, 0);
+            s1 = MULK(1, 1);
+            s2 = MULK(1, 2);
+            s3 = MULK(1, 3);
+            ODD31(0);
+            ODD31(1);
+            ODD31(2);
+            ODD31(3);
+            Ok = s0;
+            Ok + 1 = s1;
+            Ok + 2 = s2;
+            Ok + 3 = s3;
+
+
+        }
+
+
+#pragma unroll (2)
+        for (k = 0; k < 8; k += 4)
+        {
+            int32x4_t s4;
+            s0 = MULK(2, 0);
+            s1 = MULK(2, 1);
+            s2 = MULK(2, 2);
+            s3 = MULK(2, 3);
+
+            ODD15(0);
+            ODD15(1);
+            ODD15(2);
+            ODD15(3);
+
+            EOk = s0;
+            EOk + 1 = s1;
+            EOk + 2 = s2;
+            EOk + 3 = s3;
+        }
+
+
+        for (k = 0; k < 4; k += 4)
+        {
+            int32x4_t s4;
+            s0 = MULK(4, 0);
+            s1 = MULK(4, 1);
+            s2 = MULK(4, 2);
+            s3 = MULK(4, 3);
+
+            ODD7(0);
+            ODD7(1);
+            ODD7(2);
+            ODD7(3);
+
+            EEOk = s0;
+            EEOk + 1 = s1;
+            EEOk + 2 = s2;
+            EEOk + 3 = s3;
+        }
+
+#pragma unroll (2)
+        for (k = 0; k < 2; k++)
+        {
+            int32x4_t s;
+            s = MUL(8);
+            EEEOk = FMA(24);
+            s = MUL(0);
+            EEEEk = FMA(16);
+        }
+        /* Combining even and odd terms at each hierarchy levels to calculate the final spatial domain vector */
+        EEE0 = vaddq_s32(EEEE0, EEEO0);
+        EEE3 = vsubq_s32(EEEE0, EEEO0);
+        EEE1 = vaddq_s32(EEEE1, EEEO1);
+        EEE2 = vsubq_s32(EEEE1, EEEO1);
+
+#pragma unroll (4)
+        for (k = 0; k < 4; k++)
+        {
+            EEk = vaddq_s32(EEEk, EEOk);
+            EEk + 4 = vsubq_s32((EEE3 - k), (EEO3 - k));
+        }
+
+#pragma unroll (8)
+        for (k = 0; k < 8; k++)
+        {
+            Ek = vaddq_s32(EEk, EOk);
+            Ek + 8 = vsubq_s32((EE7 - k), (EO7 - k));
+        }
+
+        static const int32x4_t min = vdupq_n_s32(-32768);
+        static const int32x4_t max = vdupq_n_s32(32767);
+
+
+
+#pragma unroll (16)
+        for (k = 0; k < 16; k++)
+        {
+            int32x4_t adde = vaddq_s32(vdupq_n_s32(add), Ek);
+            int32x4_t s = vaddq_s32(adde, Ok);
+            s = vshlq_s32(s, vdupq_n_s32(-shift));
+            s = vmaxq_s32(s, min);
+            s = vminq_s32(s, max);
+
+
+
+            dstk = vmovn_s32(s);
+            adde = vaddq_s32(vdupq_n_s32(add), (E15 - k));
+            s  = vsubq_s32(adde, (O15 - k));
+            s = vshlq_s32(s, vdupq_n_s32(-shift));
+            s = vmaxq_s32(s, min);
+            s = vminq_s32(s, max);
+
+            dstk + 16 = vmovn_s32(s);
+        }
+
+
+#pragma unroll (8)
+        for (k = 0; k < 32; k += 4)
+        {
+            int16x4_t x0 = dstk + 0;
+            int16x4_t x1 = dstk + 1;
+            int16x4_t x2 = dstk + 2;
+            int16x4_t x3 = dstk + 3;
+            transpose_4x4x16(x0, x1, x2, x3);
+            *(int16x4_t *)&orig_dst0 * 32 + k = x0;
+            *(int16x4_t *)&orig_dst1 * 32 + k = x1;
+            *(int16x4_t *)&orig_dst2 * 32 + k = x2;
+            *(int16x4_t *)&orig_dst3 * 32 + k = x3;
+        }
+        orig_dst += 4 * 32;
+        src += 4;
+    }
+#undef MUL
+#undef FMA
+#undef FMAK
+#undef MULK
+#undef ODD31
+#undef ODD15
+#undef ODD7
+
+}
+
+
+static void dct8_neon(const int16_t *src, int16_t *dst, intptr_t srcStride)
+{
+    const int shift_1st = 2 + X265_DEPTH - 8;
+    const int shift_2nd = 9;
+
+    ALIGN_VAR_32(int16_t, coef8 * 8);
+    ALIGN_VAR_32(int16_t, block8 * 8);
+
+    for (int i = 0; i < 8; i++)
+    {
+        memcpy(&blocki * 8, &srci * srcStride, 8 * sizeof(int16_t));
+    }
+
+    partialButterfly8(block, coef, shift_1st, 8);
+    partialButterfly8(coef, dst, shift_2nd, 8);
+}
+
+static void dct16_neon(const int16_t *src, int16_t *dst, intptr_t srcStride)
+{
+    const int shift_1st = 3 + X265_DEPTH - 8;
+    const int shift_2nd = 10;
+
+    ALIGN_VAR_32(int16_t, coef16 * 16);
+    ALIGN_VAR_32(int16_t, block16 * 16);
+
+    for (int i = 0; i < 16; i++)
+    {
+        memcpy(&blocki * 16, &srci * srcStride, 16 * sizeof(int16_t));
+    }
+
+    partialButterfly16(block, coef, shift_1st, 16);
+    partialButterfly16(coef, dst, shift_2nd, 16);
+}
+
+static void dct32_neon(const int16_t *src, int16_t *dst, intptr_t srcStride)
+{
+    const int shift_1st = 4 + X265_DEPTH - 8;
+    const int shift_2nd = 11;
+
+    ALIGN_VAR_32(int16_t, coef32 * 32);
+    ALIGN_VAR_32(int16_t, block32 * 32);
+
+    for (int i = 0; i < 32; i++)
+    {
+        memcpy(&blocki * 32, &srci * srcStride, 32 * sizeof(int16_t));
+    }
+
+    partialButterfly32(block, coef, shift_1st, 32);
+    partialButterfly32(coef, dst, shift_2nd, 32);
+}
+
+static void idct4_neon(const int16_t *src, int16_t *dst, intptr_t dstStride)
+{
+    const int shift_1st = 7;
+    const int shift_2nd = 12 - (X265_DEPTH - 8);
+
+    ALIGN_VAR_32(int16_t, coef4 * 4);
+    ALIGN_VAR_32(int16_t, block4 * 4);
+
+    partialButterflyInverse4(src, coef, shift_1st, 4); // Forward DST BY FAST ALGORITHM, block input, coef output
+    partialButterflyInverse4(coef, block, shift_2nd, 4); // Forward DST BY FAST ALGORITHM, coef input, coeff output
+
+    for (int i = 0; i < 4; i++)
+    {
+        memcpy(&dsti * dstStride, &blocki * 4, 4 * sizeof(int16_t));
+    }
+}
+
+static void idct16_neon(const int16_t *src, int16_t *dst, intptr_t dstStride)
+{
+    const int shift_1st = 7;
+    const int shift_2nd = 12 - (X265_DEPTH - 8);
+
+    ALIGN_VAR_32(int16_t, coef16 * 16);
+    ALIGN_VAR_32(int16_t, block16 * 16);
+
+    partialButterflyInverse16_neon(src, coef, shift_1st, 16);
+    partialButterflyInverse16_neon(coef, block, shift_2nd, 16);
+
+    for (int i = 0; i < 16; i++)
+    {
+        memcpy(&dsti * dstStride, &blocki * 16, 16 * sizeof(int16_t));
+    }
+}
+
+static void idct32_neon(const int16_t *src, int16_t *dst, intptr_t dstStride)
+{
+    const int shift_1st = 7;
+    const int shift_2nd = 12 - (X265_DEPTH - 8);
+
+    ALIGN_VAR_32(int16_t, coef32 * 32);
+    ALIGN_VAR_32(int16_t, block32 * 32);
+
+    partialButterflyInverse32_neon(src, coef, shift_1st, 32);
+    partialButterflyInverse32_neon(coef, block, shift_2nd, 32);
+
+    for (int i = 0; i < 32; i++)
+    {
+        memcpy(&dsti * dstStride, &blocki * 32, 32 * sizeof(int16_t));
+    }
+}
+
+
+
+}
+
+namespace X265_NS
+{
+// x265 private namespace
+void setupDCTPrimitives_neon(EncoderPrimitives &p)
+{
+    p.cuBLOCK_4x4.nonPsyRdoQuant   = nonPsyRdoQuant_neon<2>;
+    p.cuBLOCK_8x8.nonPsyRdoQuant   = nonPsyRdoQuant_neon<3>;
+    p.cuBLOCK_16x16.nonPsyRdoQuant = nonPsyRdoQuant_neon<4>;
+    p.cuBLOCK_32x32.nonPsyRdoQuant = nonPsyRdoQuant_neon<5>;
+    p.cuBLOCK_4x4.psyRdoQuant = psyRdoQuant_neon<2>;
+    p.cuBLOCK_8x8.psyRdoQuant = psyRdoQuant_neon<3>;
+    p.cuBLOCK_16x16.psyRdoQuant = psyRdoQuant_neon<4>;
+    p.cuBLOCK_32x32.psyRdoQuant = psyRdoQuant_neon<5>;
+    p.cuBLOCK_8x8.dct   = dct8_neon;
+    p.cuBLOCK_16x16.dct = dct16_neon;
+    p.cuBLOCK_32x32.dct = dct32_neon;
+    p.cuBLOCK_4x4.idct   = idct4_neon;
+    p.cuBLOCK_16x16.idct = idct16_neon;
+    p.cuBLOCK_32x32.idct = idct32_neon;
+    p.cuBLOCK_4x4.count_nonzero = count_nonzero_neon<4>;
+    p.cuBLOCK_8x8.count_nonzero = count_nonzero_neon<8>;
+    p.cuBLOCK_16x16.count_nonzero = count_nonzero_neon<16>;
+    p.cuBLOCK_32x32.count_nonzero = count_nonzero_neon<32>;
+
+    p.cuBLOCK_4x4.copy_cnt   = copy_count_neon<4>;
+    p.cuBLOCK_8x8.copy_cnt   = copy_count_neon<8>;
+    p.cuBLOCK_16x16.copy_cnt = copy_count_neon<16>;
+    p.cuBLOCK_32x32.copy_cnt = copy_count_neon<32>;
+    p.cuBLOCK_4x4.psyRdoQuant_1p = nonPsyRdoQuant_neon<2>;
+    p.cuBLOCK_4x4.psyRdoQuant_2p = psyRdoQuant_neon<2>;
+    p.cuBLOCK_8x8.psyRdoQuant_1p = nonPsyRdoQuant_neon<3>;
+    p.cuBLOCK_8x8.psyRdoQuant_2p = psyRdoQuant_neon<3>;
+    p.cuBLOCK_16x16.psyRdoQuant_1p = nonPsyRdoQuant_neon<4>;
+    p.cuBLOCK_16x16.psyRdoQuant_2p = psyRdoQuant_neon<4>;
+    p.cuBLOCK_32x32.psyRdoQuant_1p = nonPsyRdoQuant_neon<5>;
+    p.cuBLOCK_32x32.psyRdoQuant_2p = psyRdoQuant_neon<5>;
+
+    p.scanPosLast  = scanPosLast_opt;
+
+}
+
+};
+
+
+
+#endif

 
@@ -0,0 +1,948 @@
+#include "dct-prim.h"
+
+
+#if HAVE_NEON
+
+#include <arm_neon.h>
+
+
+namespace
+{
+using namespace X265_NS;
+
+
+static int16x8_t rev16(const int16x8_t a)
+{
+    static const int8x16_t tbl = {14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1};
+    return vqtbx1q_u8(a, a, tbl);
+}
+
+static int32x4_t rev32(const int32x4_t a)
+{
+    static const int8x16_t tbl = {12, 13, 14, 15, 8, 9, 10, 11, 4, 5, 6, 7, 0, 1, 2, 3};
+    return vqtbx1q_u8(a, a, tbl);
+}
+
+static void transpose_4x4x16(int16x4_t &x0, int16x4_t &x1, int16x4_t &x2, int16x4_t &x3)
+{
+    int16x4_t s0, s1, s2, s3;
+    s0 = vtrn1_s32(x0, x2);
+    s1 = vtrn1_s32(x1, x3);
+    s2 = vtrn2_s32(x0, x2);
+    s3 = vtrn2_s32(x1, x3);
+
+    x0 = vtrn1_s16(s0, s1);
+    x1 = vtrn2_s16(s0, s1);
+    x2 = vtrn1_s16(s2, s3);
+    x3 = vtrn2_s16(s2, s3);
+}
+
+
+
+static int scanPosLast_opt(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag,
+                           uint8_t *coeffNum, int numSig, const uint16_t * /*scanCG4x4*/, const int /*trSize*/)
+{
+
+    // This is an optimized function for scanPosLast, which removes the rmw dependency, once integrated into mainline x265, should replace reference implementation
+    // For clarity, left the original reference code in comments
+    int scanPosLast = 0;
+
+    uint16_t cSign = 0;
+    uint16_t cFlag = 0;
+    uint8_t cNum = 0;
+
+    uint32_t prevcgIdx = 0;
+    do
+    {
+        const uint32_t cgIdx = (uint32_t)scanPosLast >> MLS_CG_SIZE;
+
+        const uint32_t posLast = scanscanPosLast;
+
+        const int curCoeff = coeffposLast;
+        const uint32_t isNZCoeff = (curCoeff != 0);
+        /*
+        NOTE: the new algorithm is complicated, so I keep reference code here
+        uint32_t posy   = posLast >> log2TrSize;
+        uint32_t posx   = posLast - (posy << log2TrSize);
+        uint32_t blkIdx0 = ((posy >> MLS_CG_LOG2_SIZE) << codingParameters.log2TrSizeCG) + (posx >> MLS_CG_LOG2_SIZE);
+        const uint32_t blkIdx = ((posLast >> (2 * MLS_CG_LOG2_SIZE)) & ~maskPosXY) + ((posLast >> MLS_CG_LOG2_SIZE) & maskPosXY);
+        sigCoeffGroupFlag64 |= ((uint64_t)isNZCoeff << blkIdx);
+        */
+
+        // get L1 sig map
+        numSig -= isNZCoeff;
+
+        if (scanPosLast % (1 << MLS_CG_SIZE) == 0)
+        {
+            coeffSignprevcgIdx = cSign;
+            coeffFlagprevcgIdx = cFlag;
+            coeffNumprevcgIdx = cNum;
+            cSign = 0;
+            cFlag = 0;
+            cNum = 0;
+        }
+        // TODO: optimize by instruction BTS
+        cSign += (uint16_t)(((curCoeff < 0) ? 1 : 0) << cNum);
+        cFlag = (cFlag << 1) + (uint16_t)isNZCoeff;
+        cNum += (uint8_t)isNZCoeff;
+        prevcgIdx = cgIdx;
+        scanPosLast++;
+    }
+    while (numSig > 0);
+
+    coeffSignprevcgIdx = cSign;
+    coeffFlagprevcgIdx = cFlag;
+    coeffNumprevcgIdx = cNum;
+    return scanPosLast - 1;
+}
+
+
+#if (MLS_CG_SIZE == 4)
+template<int log2TrSize>
+static void nonPsyRdoQuant_neon(int16_t *m_resiDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost,
+                                int64_t *totalRdCost, uint32_t blkPos)
+{
+    const int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH -
+                               log2TrSize; /* Represents scaling through forward transform */
+    const int scaleBits = SCALE_BITS - 2 * transformShift;
+    const uint32_t trSize = 1 << log2TrSize;
+
+    int64x2_t vcost_sum_0 = vdupq_n_s64(0);
+    int64x2_t vcost_sum_1 = vdupq_n_s64(0);
+    for (int y = 0; y < MLS_CG_SIZE; y++)
+    {
+        int16x4_t in = *(int16x4_t *)&m_resiDctCoeffblkPos;
+        int32x4_t mul = vmull_s16(in, in);
+        int64x2_t cost0, cost1;
+        cost0 = vshll_n_s32(vget_low_s32(mul), scaleBits);
+        cost1 = vshll_high_n_s32(mul, scaleBits);
+        *(int64x2_t *)&costUncodedblkPos + 0 = cost0;
+        *(int64x2_t *)&costUncodedblkPos + 2 = cost1;
+        vcost_sum_0 = vaddq_s64(vcost_sum_0, cost0);
+        vcost_sum_1 = vaddq_s64(vcost_sum_1, cost1);
+        blkPos += trSize;
+    }
+    int64_t sum = vaddvq_s64(vaddq_s64(vcost_sum_0, vcost_sum_1));
+    *totalUncodedCost += sum;
+    *totalRdCost += sum;
+}
+
+template<int log2TrSize>
+static void psyRdoQuant_neon(int16_t *m_resiDctCoeff, int16_t *m_fencDctCoeff, int64_t *costUncoded,
+                             int64_t *totalUncodedCost, int64_t *totalRdCost, int64_t *psyScale, uint32_t blkPos)
+{
+    const int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH -
+                               log2TrSize; /* Represents scaling through forward transform */
+    const int scaleBits = SCALE_BITS - 2 * transformShift;
+    const uint32_t trSize = 1 << log2TrSize;
+    //using preprocessor to bypass clang bug
+    const int max = X265_MAX(0, (2 * transformShift + 1));
+
+    int64x2_t vcost_sum_0 = vdupq_n_s64(0);
+    int64x2_t vcost_sum_1 = vdupq_n_s64(0);
+    int32x4_t vpsy = vdupq_n_s32(*psyScale);
+    for (int y = 0; y < MLS_CG_SIZE; y++)
+    {
+        int32x4_t signCoef = vmovl_s16(*(int16x4_t *)&m_resiDctCoeffblkPos);
+        int32x4_t predictedCoef = vsubq_s32(vmovl_s16(*(int16x4_t *)&m_fencDctCoeffblkPos), signCoef);
+        int64x2_t cost0, cost1;
+        cost0 = vmull_s32(vget_low_s32(signCoef), vget_low_s32(signCoef));
+        cost1 = vmull_high_s32(signCoef, signCoef);
+        cost0 = vshlq_n_s64(cost0, scaleBits);
+        cost1 = vshlq_n_s64(cost1, scaleBits);
+        int64x2_t neg0 = vmull_s32(vget_low_s32(predictedCoef), vget_low_s32(vpsy));
+        int64x2_t neg1 = vmull_high_s32(predictedCoef, vpsy);
+        if (max > 0)
+        {
+            int64x2_t shift = vdupq_n_s64(-max);
+            neg0 = vshlq_s64(neg0, shift);
+            neg1 = vshlq_s64(neg1, shift);
+        }
+        cost0 = vsubq_s64(cost0, neg0);
+        cost1 = vsubq_s64(cost1, neg1);
+        *(int64x2_t *)&costUncodedblkPos + 0 = cost0;
+        *(int64x2_t *)&costUncodedblkPos + 2 = cost1;
+        vcost_sum_0 = vaddq_s64(vcost_sum_0, cost0);
+        vcost_sum_1 = vaddq_s64(vcost_sum_1, cost1);
+
+        blkPos += trSize;
+    }
+    int64_t sum = vaddvq_s64(vaddq_s64(vcost_sum_0, vcost_sum_1));
+    *totalUncodedCost += sum;
+    *totalRdCost += sum;
+}
+
+#else
+#error "MLS_CG_SIZE must be 4 for neon version"
+#endif
+
+
+
+template<int trSize>
+int  count_nonzero_neon(const int16_t *quantCoeff)
+{
+    X265_CHECK(((intptr_t)quantCoeff & 15) == 0, "quant buffer not aligned\n");
+    int count = 0;
+    int16x8_t vcount = vdupq_n_s16(0);
+    const int numCoeff = trSize * trSize;
+    int i = 0;
+    for (; (i + 8) <= numCoeff; i += 8)
+    {
+        int16x8_t in = *(int16x8_t *)&quantCoeffi;
+        vcount = vaddq_s16(vcount, vtstq_s16(in, in));
+    }
+    for (; i < numCoeff; i++)
+    {
+        count += quantCoeffi != 0;
+    }
+
+    return count - vaddvq_s16(vcount);
+}
+
+template<int trSize>
+uint32_t copy_count_neon(int16_t *coeff, const int16_t *residual, intptr_t resiStride)
+{
+    uint32_t numSig = 0;
+    int16x8_t vcount = vdupq_n_s16(0);
+    for (int k = 0; k < trSize; k++)
+    {
+        int j = 0;
+        for (; (j + 8) <= trSize; j += 8)
+        {
+            int16x8_t in = *(int16x8_t *)&residualj;
+            *(int16x8_t *)&coeffj = in;
+            vcount = vaddq_s16(vcount, vtstq_s16(in, in));
+        }
+        for (; j < trSize; j++)
+        {
+            coeffj = residualj;
+            numSig += (residualj != 0);
+        }
+        residual += resiStride;
+        coeff += trSize;
+    }
+
+    return numSig - vaddvq_s16(vcount);
+}
+
+
+static void partialButterfly16(const int16_t *src, int16_t *dst, int shift, int line)
+{
+    int j, k;
+    int32x4_t E2, O2;
+    int32x4_t EE, EO;
+    int32x2_t EEE, EEO;
+    const int add = 1 << (shift - 1);
+    const int32x4_t _vadd = {add, 0};
+
+    for (j = 0; j < line; j++)
+    {
+        int16x8_t in0 = *(int16x8_t *)src;
+        int16x8_t in1 = rev16(*(int16x8_t *)&src8);
+
+        E0 = vaddl_s16(vget_low_s16(in0), vget_low_s16(in1));
+        O0 = vsubl_s16(vget_low_s16(in0), vget_low_s16(in1));
+        E1 = vaddl_high_s16(in0, in1);
+        O1 = vsubl_high_s16(in0, in1);
+
+        for (k = 1; k < 16; k += 2)
+        {
+            int32x4_t c0 = vmovl_s16(*(int16x4_t *)&g_t16k0);
+            int32x4_t c1 = vmovl_s16(*(int16x4_t *)&g_t16k4);
+
+            int32x4_t res = _vadd;
+            res = vmlaq_s32(res, c0, O0);
+            res = vmlaq_s32(res, c1, O1);
+            dstk * line = (int16_t)(vaddvq_s32(res) >> shift);
+        }
+
+        /* EE and EO */
+        EE = vaddq_s32(E0, rev32(E1));
+        EO = vsubq_s32(E0, rev32(E1));
+
+        for (k = 2; k < 16; k += 4)
+        {
+            int32x4_t c0 = vmovl_s16(*(int16x4_t *)&g_t16k0);
+            int32x4_t res = _vadd;
+            res = vmlaq_s32(res, c0, EO);
+            dstk * line = (int16_t)(vaddvq_s32(res) >> shift);
+        }
+
+        /* EEE and EEO */
+        EEE0 = EE0 + EE3;
+        EEO0 = EE0 - EE3;
+        EEE1 = EE1 + EE2;
+        EEO1 = EE1 - EE2;
+
+        dst0 = (int16_t)((g_t1600 * EEE0 + g_t1601 * EEE1 + add) >> shift);
+        dst8 * line = (int16_t)((g_t1680 * EEE0 + g_t1681 * EEE1 + add) >> shift);
+        dst4 * line = (int16_t)((g_t1640 * EEO0 + g_t1641 * EEO1 + add) >> shift);
+        dst12 * line = (int16_t)((g_t16120 * EEO0 + g_t16121 * EEO1 + add) >> shift);
+
+
+        src += 16;
+        dst++;
+    }
+}
+
+
+static void partialButterfly32(const int16_t *src, int16_t *dst, int shift, int line)
+{
+    int j, k;
+    const int add = 1 << (shift - 1);
+
+
+    for (j = 0; j < line; j++)
+    {
+        int32x4_t VE4, VO0, VO1, VO2, VO3;
+        int32x4_t VEE2, VEO2;
+        int32x4_t VEEE, VEEO;
+        int EEEE2, EEEO2;
+
+        int16x8x4_t inputs;
+        inputs = *(int16x8x4_t *)&src0;
+        int16x8x4_t in_rev;
+
+        in_rev.val1 = rev16(inputs.val2);
+        in_rev.val0 = rev16(inputs.val3);
+
+        VE0 = vaddl_s16(vget_low_s16(inputs.val0), vget_low_s16(in_rev.val0));
+        VE1 = vaddl_high_s16(inputs.val0, in_rev.val0);
+        VO0 = vsubl_s16(vget_low_s16(inputs.val0), vget_low_s16(in_rev.val0));
+        VO1 = vsubl_high_s16(inputs.val0, in_rev.val0);
+        VE2 = vaddl_s16(vget_low_s16(inputs.val1), vget_low_s16(in_rev.val1));
+        VE3 = vaddl_high_s16(inputs.val1, in_rev.val1);
+        VO2 = vsubl_s16(vget_low_s16(inputs.val1), vget_low_s16(in_rev.val1));
+        VO3 = vsubl_high_s16(inputs.val1, in_rev.val1);
+
+        for (k = 1; k < 32; k += 2)
+        {
+            int32x4_t c0 = vmovl_s16(*(int16x4_t *)&g_t32k0);
+            int32x4_t c1 = vmovl_s16(*(int16x4_t *)&g_t32k4);
+            int32x4_t c2 = vmovl_s16(*(int16x4_t *)&g_t32k8);
+            int32x4_t c3 = vmovl_s16(*(int16x4_t *)&g_t32k12);
+            int32x4_t s = vmulq_s32(c0, VO0);
+            s = vmlaq_s32(s, c1, VO1);
+            s = vmlaq_s32(s, c2, VO2);
+            s = vmlaq_s32(s, c3, VO3);
+
+            dstk * line = (int16_t)((vaddvq_s32(s) + add) >> shift);
+
+        }
+
+        int32x4_t rev_VE2;
+
+
+        rev_VE0 = rev32(VE3);
+        rev_VE1 = rev32(VE2);
+
+        /* EE and EO */
+        for (k = 0; k < 2; k++)
+        {
+            VEEk = vaddq_s32(VEk, rev_VEk);
+            VEOk = vsubq_s32(VEk, rev_VEk);
+        }
+        for (k = 2; k < 32; k += 4)
+        {
+            int32x4_t c0 = vmovl_s16(*(int16x4_t *)&g_t32k0);
+            int32x4_t c1 = vmovl_s16(*(int16x4_t *)&g_t32k4);
+            int32x4_t s = vmulq_s32(c0, VEO0);
+            s = vmlaq_s32(s, c1, VEO1);
+
+            dstk * line = (int16_t)((vaddvq_s32(s) + add) >> shift);
+
+        }
+
+        int32x4_t tmp = rev32(VEE1);
+        VEEE = vaddq_s32(VEE0, tmp);
+        VEEO = vsubq_s32(VEE0, tmp);
+        for (k = 4; k < 32; k += 8)
+        {
+            int32x4_t c = vmovl_s16(*(int16x4_t *)&g_t32k0);
+            int32x4_t s = vmulq_s32(c, VEEO);
+
+            dstk * line = (int16_t)((vaddvq_s32(s) + add) >> shift);
+        }
+
+        /* EEEE and EEEO */
+        EEEE0 = VEEE0 + VEEE3;
+        EEEO0 = VEEE0 - VEEE3;
+        EEEE1 = VEEE1 + VEEE2;
+        EEEO1 = VEEE1 - VEEE2;
+
+        dst0 = (int16_t)((g_t3200 * EEEE0 + g_t3201 * EEEE1 + add) >> shift);
+        dst16 * line = (int16_t)((g_t32160 * EEEE0 + g_t32161 * EEEE1 + add) >> shift);
+        dst8 * line = (int16_t)((g_t3280 * EEEO0 + g_t3281 * EEEO1 + add) >> shift);
+        dst24 * line = (int16_t)((g_t32240 * EEEO0 + g_t32241 * EEEO1 + add) >> shift);
+
+
+
+        src += 32;
+        dst++;
+    }
+}
+
+static void partialButterfly8(const int16_t *src, int16_t *dst, int shift, int line)
+{
+    int j, k;
+    int E4, O4;
+    int EE2, EO2;
+    int add = 1 << (shift - 1);
+
+    for (j = 0; j < line; j++)
+    {
+        /* E and O*/
+        for (k = 0; k < 4; k++)
+        {
+            Ek = srck + src7 - k;
+            Ok = srck - src7 - k;
+        }
+
+        /* EE and EO */
+        EE0 = E0 + E3;
+        EO0 = E0 - E3;
+        EE1 = E1 + E2;
+        EO1 = E1 - E2;
+
+        dst0 = (int16_t)((g_t800 * EE0 + g_t801 * EE1 + add) >> shift);
+        dst4 * line = (int16_t)((g_t840 * EE0 + g_t841 * EE1 + add) >> shift);
+        dst2 * line = (int16_t)((g_t820 * EO0 + g_t821 * EO1 + add) >> shift);
+        dst6 * line = (int16_t)((g_t860 * EO0 + g_t861 * EO1 + add) >> shift);
+
+        dstline = (int16_t)((g_t810 * O0 + g_t811 * O1 + g_t812 * O2 + g_t813 * O3 + add) >> shift);
+        dst3 * line = (int16_t)((g_t830 * O0 + g_t831 * O1 + g_t832 * O2 + g_t833 * O3 + add) >>
+                                  shift);
+        dst5 * line = (int16_t)((g_t850 * O0 + g_t851 * O1 + g_t852 * O2 + g_t853 * O3 + add) >>
+                                  shift);
+        dst7 * line = (int16_t)((g_t870 * O0 + g_t871 * O1 + g_t872 * O2 + g_t873 * O3 + add) >>
+                                  shift);
+
+        src += 8;
+        dst++;
+    }
+}
+
+static void partialButterflyInverse4(const int16_t *src, int16_t *dst, int shift, int line)
+{
+    int j;
+    int E2, O2;
+    int add = 1 << (shift - 1);
+
+    for (j = 0; j < line; j++)
+    {
+        /* Utilizing symmetry properties to the maximum to minimize the number of multiplications */
+        O0 = g_t410 * srcline + g_t430 * src3 * line;
+        O1 = g_t411 * srcline + g_t431 * src3 * line;
+        E0 = g_t400 * src0 + g_t420 * src2 * line;
+        E1 = g_t401 * src0 + g_t421 * src2 * line;
+
+        /* Combining even and odd terms at each hierarchy levels to calculate the final spatial domain vector */
+        dst0 = (int16_t)(x265_clip3(-32768, 32767, (E0 + O0 + add) >> shift));
+        dst1 = (int16_t)(x265_clip3(-32768, 32767, (E1 + O1 + add) >> shift));
+        dst2 = (int16_t)(x265_clip3(-32768, 32767, (E1 - O1 + add) >> shift));
+        dst3 = (int16_t)(x265_clip3(-32768, 32767, (E0 - O0 + add) >> shift));
+
+        src++;
+        dst += 4;
+    }
+}
+
+
+
+static void partialButterflyInverse16_neon(const int16_t *src, int16_t *orig_dst, int shift, int line)
+{
+#define FMAK(x,l) sl = vmlal_lane_s16(sl,*(int16x4_t*)&src(x)*line,*(int16x4_t *)&g_t16xk,l)
+#define MULK(x,l) vmull_lane_s16(*(int16x4_t*)&srcx*line,*(int16x4_t *)&g_t16xk,l);
+#define ODD3_15(k) FMAK(3,k);FMAK(5,k);FMAK(7,k);FMAK(9,k);FMAK(11,k);FMAK(13,k);FMAK(15,k);
+#define EVEN6_14_STEP4(k) FMAK(6,k);FMAK(10,k);FMAK(14,k);
+
+
+    int j, k;
+    int32x4_t E8, O8;
+    int32x4_t EE4, EO4;
+    int32x4_t EEE2, EEO2;
+    const int add = 1 << (shift - 1);
+
+
+#pragma unroll(4)
+    for (j = 0; j < line; j += 4)
+    {
+        /* Utilizing symmetry properties to the maximum to minimize the number of multiplications */
+
+#pragma unroll(2)
+        for (k = 0; k < 2; k++)
+        {
+            int32x4_t s;
+            s = vmull_s16(vdup_n_s16(g_t164k), *(int16x4_t *)&src4 * line);;
+            EEOk = vmlal_s16(s, vdup_n_s16(g_t1612k), *(int16x4_t *)&src(12) * line);
+            s = vmull_s16(vdup_n_s16(g_t160k), *(int16x4_t *)&src0 * line);;
+            EEEk = vmlal_s16(s, vdup_n_s16(g_t168k), *(int16x4_t *)&src(8) * line);
+        }
+
+        /* Combining even and odd terms at each hierarchy levels to calculate the final spatial domain vector */
+        EE0 = vaddq_s32(EEE0 , EEO0);
+        EE2 = vsubq_s32(EEE1 , EEO1);
+        EE1 = vaddq_s32(EEE1 , EEO1);
+        EE3 = vsubq_s32(EEE0 , EEO0);
+
+
+#pragma unroll(1)
+        for (k = 0; k < 4; k += 4)
+        {
+            int32x4_t s4;
+            s0 = MULK(2, 0);
+            s1 = MULK(2, 1);
+            s2 = MULK(2, 2);
+            s3 = MULK(2, 3);
+
+            EVEN6_14_STEP4(0);
+            EVEN6_14_STEP4(1);
+            EVEN6_14_STEP4(2);
+            EVEN6_14_STEP4(3);
+
+            EOk = s0;
+            EOk + 1 = s1;
+            EOk + 2 = s2;
+            EOk + 3 = s3;
+        }
+
+
+
+        static const int32x4_t min = vdupq_n_s32(-32768);
+        static const int32x4_t max = vdupq_n_s32(32767);
+        const int32x4_t minus_shift = vdupq_n_s32(-shift);
+
+#pragma unroll(4)
+        for (k = 0; k < 4; k++)
+        {
+            Ek = vaddq_s32(EEk , EOk);
+            Ek + 4 = vsubq_s32(EE3 - k , EO3 - k);
+        }
+
+#pragma unroll(2)
+        for (k = 0; k < 8; k += 4)
+        {
+            int32x4_t s4;
+            s0 = MULK(1, 0);
+            s1 = MULK(1, 1);
+            s2 = MULK(1, 2);
+            s3 = MULK(1, 3);
+            ODD3_15(0);
+            ODD3_15(1);
+            ODD3_15(2);
+            ODD3_15(3);
+            Ok = s0;
+            Ok + 1 = s1;
+            Ok + 2 = s2;
+            Ok + 3 = s3;
+            int32x4_t t;
+            int16x4_t x0, x1, x2, x3;
+
+            Ek = vaddq_s32(vdupq_n_s32(add), Ek);
+            t = vaddq_s32(Ek, Ok);
+            t = vshlq_s32(t, minus_shift);
+            t = vmaxq_s32(t, min);
+            t = vminq_s32(t, max);
+            x0 = vmovn_s32(t);
+
+            Ek + 1 = vaddq_s32(vdupq_n_s32(add), Ek + 1);
+            t = vaddq_s32(Ek + 1, Ok + 1);
+            t = vshlq_s32(t, minus_shift);
+            t = vmaxq_s32(t, min);
+            t = vminq_s32(t, max);
+            x1 = vmovn_s32(t);
+
+            Ek + 2 = vaddq_s32(vdupq_n_s32(add), Ek + 2);
+            t = vaddq_s32(Ek + 2, Ok + 2);
+            t = vshlq_s32(t, minus_shift);
+            t = vmaxq_s32(t, min);
+            t = vminq_s32(t, max);
+            x2 = vmovn_s32(t);
+
+            Ek + 3 = vaddq_s32(vdupq_n_s32(add), Ek + 3);
+            t = vaddq_s32(Ek + 3, Ok + 3);
+            t = vshlq_s32(t, minus_shift);
+            t = vmaxq_s32(t, min);
+            t = vminq_s32(t, max);
+            x3 = vmovn_s32(t);
+
+            transpose_4x4x16(x0, x1, x2, x3);
+            *(int16x4_t *)&orig_dst0 * 16 + k = x0;
+            *(int16x4_t *)&orig_dst1 * 16 + k = x1;
+            *(int16x4_t *)&orig_dst2 * 16 + k = x2;
+            *(int16x4_t *)&orig_dst3 * 16 + k = x3;
+        }
+
+
+#pragma unroll(2)
+        for (k = 0; k < 8; k += 4)
+        {
+            int32x4_t t;
+            int16x4_t x0, x1, x2, x3;
+
+            t = vsubq_s32(E7 - k, O7 - k);
+            t = vshlq_s32(t, minus_shift);
+            t = vmaxq_s32(t, min);
+            t = vminq_s32(t, max);
+            x0 = vmovn_s32(t);
+
+            t = vsubq_s32(E6 - k, O6 - k);
+            t = vshlq_s32(t, minus_shift);
+            t = vmaxq_s32(t, min);
+            t = vminq_s32(t, max);
+            x1 = vmovn_s32(t);
+
+            t = vsubq_s32(E5 - k, O5 - k);
+
+            t = vshlq_s32(t, minus_shift);
+            t = vmaxq_s32(t, min);
+            t = vminq_s32(t, max);
+            x2 = vmovn_s32(t);
+
+            t = vsubq_s32(E4 - k, O4 - k);
+            t = vshlq_s32(t, minus_shift);
+            t = vmaxq_s32(t, min);
+            t = vminq_s32(t, max);
+            x3 = vmovn_s32(t);
+
+            transpose_4x4x16(x0, x1, x2, x3);
+            *(int16x4_t *)&orig_dst0 * 16 + k + 8 = x0;
+            *(int16x4_t *)&orig_dst1 * 16 + k + 8 = x1;
+            *(int16x4_t *)&orig_dst2 * 16 + k + 8 = x2;
+            *(int16x4_t *)&orig_dst3 * 16 + k + 8 = x3;
+        }
+        orig_dst += 4 * 16;
+        src += 4;
+    }
+
+#undef MUL
+#undef FMA
+#undef FMAK
+#undef MULK
+#undef ODD3_15
+#undef EVEN6_14_STEP4
+
+
+}
+
+
+
+static void partialButterflyInverse32_neon(const int16_t *src, int16_t *orig_dst, int shift, int line)
+{
+#define MUL(x) vmull_s16(vdup_n_s16(g_t32xk),*(int16x4_t*)&srcx*line);
+#define FMA(x) s = vmlal_s16(s,vdup_n_s16(g_t32xk),*(int16x4_t*)&src(x)*line)
+#define FMAK(x,l) sl = vmlal_lane_s16(sl,*(int16x4_t*)&src(x)*line,*(int16x4_t *)&g_t32xk,l)
+#define MULK(x,l) vmull_lane_s16(*(int16x4_t*)&srcx*line,*(int16x4_t *)&g_t32xk,l);
+#define ODD31(k) FMAK(3,k);FMAK(5,k);FMAK(7,k);FMAK(9,k);FMAK(11,k);FMAK(13,k);FMAK(15,k);FMAK(17,k);FMAK(19,k);FMAK(21,k);FMAK(23,k);FMAK(25,k);FMAK(27,k);FMAK(29,k);FMAK(31,k);
+
+#define ODD15(k) FMAK(6,k);FMAK(10,k);FMAK(14,k);FMAK(18,k);FMAK(22,k);FMAK(26,k);FMAK(30,k);
+#define ODD7(k) FMAK(12,k);FMAK(20,k);FMAK(28,k);
+
+
+    int j, k;
+    int32x4_t E16, O16;
+    int32x4_t EE8, EO8;
+    int32x4_t EEE4, EEO4;
+    int32x4_t EEEE2, EEEO2;
+    int16x4_t dst32;
+    int add = 1 << (shift - 1);
+
+#pragma unroll (8)
+    for (j = 0; j < line; j += 4)
+    {
+#pragma unroll (4)
+        for (k = 0; k < 16; k += 4)
+        {
+            int32x4_t s4;
+            s0 = MULK(1, 0);
+            s1 = MULK(1, 1);
+            s2 = MULK(1, 2);
+            s3 = MULK(1, 3);
+            ODD31(0);
+            ODD31(1);
+            ODD31(2);
+            ODD31(3);
+            Ok = s0;
+            Ok + 1 = s1;
+            Ok + 2 = s2;
+            Ok + 3 = s3;
+
+
+        }
+
+
+#pragma unroll (2)
+        for (k = 0; k < 8; k += 4)
+        {
+            int32x4_t s4;
+            s0 = MULK(2, 0);
+            s1 = MULK(2, 1);
+            s2 = MULK(2, 2);
+            s3 = MULK(2, 3);
+
+            ODD15(0);
+            ODD15(1);
+            ODD15(2);
+            ODD15(3);
+
+            EOk = s0;
+            EOk + 1 = s1;
+            EOk + 2 = s2;
+            EOk + 3 = s3;
+        }
+
+
+        for (k = 0; k < 4; k += 4)
+        {
+            int32x4_t s4;
+            s0 = MULK(4, 0);
+            s1 = MULK(4, 1);
+            s2 = MULK(4, 2);
+            s3 = MULK(4, 3);
+
+            ODD7(0);
+            ODD7(1);
+            ODD7(2);
+            ODD7(3);
+
+            EEOk = s0;
+            EEOk + 1 = s1;
+            EEOk + 2 = s2;
+            EEOk + 3 = s3;
+        }
+
+#pragma unroll (2)
+        for (k = 0; k < 2; k++)
+        {
+            int32x4_t s;
+            s = MUL(8);
+            EEEOk = FMA(24);
+            s = MUL(0);
+            EEEEk = FMA(16);
+        }
+        /* Combining even and odd terms at each hierarchy levels to calculate the final spatial domain vector */
+        EEE0 = vaddq_s32(EEEE0, EEEO0);
+        EEE3 = vsubq_s32(EEEE0, EEEO0);
+        EEE1 = vaddq_s32(EEEE1, EEEO1);
+        EEE2 = vsubq_s32(EEEE1, EEEO1);
+
+#pragma unroll (4)
+        for (k = 0; k < 4; k++)
+        {
+            EEk = vaddq_s32(EEEk, EEOk);
+            EEk + 4 = vsubq_s32((EEE3 - k), (EEO3 - k));
+        }
+
+#pragma unroll (8)
+        for (k = 0; k < 8; k++)
+        {
+            Ek = vaddq_s32(EEk, EOk);
+            Ek + 8 = vsubq_s32((EE7 - k), (EO7 - k));
+        }
+
+        static const int32x4_t min = vdupq_n_s32(-32768);
+        static const int32x4_t max = vdupq_n_s32(32767);
+
+
+
+#pragma unroll (16)
+        for (k = 0; k < 16; k++)
+        {
+            int32x4_t adde = vaddq_s32(vdupq_n_s32(add), Ek);
+            int32x4_t s = vaddq_s32(adde, Ok);
+            s = vshlq_s32(s, vdupq_n_s32(-shift));
+            s = vmaxq_s32(s, min);
+            s = vminq_s32(s, max);
+
+
+
+            dstk = vmovn_s32(s);
+            adde = vaddq_s32(vdupq_n_s32(add), (E15 - k));
+            s  = vsubq_s32(adde, (O15 - k));
+            s = vshlq_s32(s, vdupq_n_s32(-shift));
+            s = vmaxq_s32(s, min);
+            s = vminq_s32(s, max);
+
+            dstk + 16 = vmovn_s32(s);
+        }
+
+
+#pragma unroll (8)
+        for (k = 0; k < 32; k += 4)
+        {
+            int16x4_t x0 = dstk + 0;
+            int16x4_t x1 = dstk + 1;
+            int16x4_t x2 = dstk + 2;
+            int16x4_t x3 = dstk + 3;
+            transpose_4x4x16(x0, x1, x2, x3);
+            *(int16x4_t *)&orig_dst0 * 32 + k = x0;
+            *(int16x4_t *)&orig_dst1 * 32 + k = x1;
+            *(int16x4_t *)&orig_dst2 * 32 + k = x2;
+            *(int16x4_t *)&orig_dst3 * 32 + k = x3;
+        }
+        orig_dst += 4 * 32;
+        src += 4;
+    }
+#undef MUL
+#undef FMA
+#undef FMAK
+#undef MULK
+#undef ODD31
+#undef ODD15
+#undef ODD7
+
+}
+
+
+static void dct8_neon(const int16_t *src, int16_t *dst, intptr_t srcStride)
+{
+    const int shift_1st = 2 + X265_DEPTH - 8;
+    const int shift_2nd = 9;
+
+    ALIGN_VAR_32(int16_t, coef8 * 8);
+    ALIGN_VAR_32(int16_t, block8 * 8);
+
+    for (int i = 0; i < 8; i++)
+    {
+        memcpy(&blocki * 8, &srci * srcStride, 8 * sizeof(int16_t));
+    }
+
+    partialButterfly8(block, coef, shift_1st, 8);
+    partialButterfly8(coef, dst, shift_2nd, 8);
+}
+
+static void dct16_neon(const int16_t *src, int16_t *dst, intptr_t srcStride)
+{
+    const int shift_1st = 3 + X265_DEPTH - 8;
+    const int shift_2nd = 10;
+
+    ALIGN_VAR_32(int16_t, coef16 * 16);
+    ALIGN_VAR_32(int16_t, block16 * 16);
+
+    for (int i = 0; i < 16; i++)
+    {
+        memcpy(&blocki * 16, &srci * srcStride, 16 * sizeof(int16_t));
+    }
+
+    partialButterfly16(block, coef, shift_1st, 16);
+    partialButterfly16(coef, dst, shift_2nd, 16);
+}
+
+static void dct32_neon(const int16_t *src, int16_t *dst, intptr_t srcStride)
+{
+    const int shift_1st = 4 + X265_DEPTH - 8;
+    const int shift_2nd = 11;
+
+    ALIGN_VAR_32(int16_t, coef32 * 32);
+    ALIGN_VAR_32(int16_t, block32 * 32);
+
+    for (int i = 0; i < 32; i++)
+    {
+        memcpy(&blocki * 32, &srci * srcStride, 32 * sizeof(int16_t));
+    }
+
+    partialButterfly32(block, coef, shift_1st, 32);
+    partialButterfly32(coef, dst, shift_2nd, 32);
+}
+
+static void idct4_neon(const int16_t *src, int16_t *dst, intptr_t dstStride)
+{
+    const int shift_1st = 7;
+    const int shift_2nd = 12 - (X265_DEPTH - 8);
+
+    ALIGN_VAR_32(int16_t, coef4 * 4);
+    ALIGN_VAR_32(int16_t, block4 * 4);
+
+    partialButterflyInverse4(src, coef, shift_1st, 4); // Forward DST BY FAST ALGORITHM, block input, coef output
+    partialButterflyInverse4(coef, block, shift_2nd, 4); // Forward DST BY FAST ALGORITHM, coef input, coeff output
+
+    for (int i = 0; i < 4; i++)
+    {
+        memcpy(&dsti * dstStride, &blocki * 4, 4 * sizeof(int16_t));
+    }
+}
+
+static void idct16_neon(const int16_t *src, int16_t *dst, intptr_t dstStride)
+{
+    const int shift_1st = 7;
+    const int shift_2nd = 12 - (X265_DEPTH - 8);
+
+    ALIGN_VAR_32(int16_t, coef16 * 16);
+    ALIGN_VAR_32(int16_t, block16 * 16);
+
+    partialButterflyInverse16_neon(src, coef, shift_1st, 16);
+    partialButterflyInverse16_neon(coef, block, shift_2nd, 16);
+
+    for (int i = 0; i < 16; i++)
+    {
+        memcpy(&dsti * dstStride, &blocki * 16, 16 * sizeof(int16_t));
+    }
+}
+
+static void idct32_neon(const int16_t *src, int16_t *dst, intptr_t dstStride)
+{
+    const int shift_1st = 7;
+    const int shift_2nd = 12 - (X265_DEPTH - 8);
+
+    ALIGN_VAR_32(int16_t, coef32 * 32);
+    ALIGN_VAR_32(int16_t, block32 * 32);
+
+    partialButterflyInverse32_neon(src, coef, shift_1st, 32);
+    partialButterflyInverse32_neon(coef, block, shift_2nd, 32);
+
+    for (int i = 0; i < 32; i++)
+    {
+        memcpy(&dsti * dstStride, &blocki * 32, 32 * sizeof(int16_t));
+    }
+}
+
+
+
+}
+
+namespace X265_NS
+{
+// x265 private namespace
+void setupDCTPrimitives_neon(EncoderPrimitives &p)
+{
+    p.cuBLOCK_4x4.nonPsyRdoQuant   = nonPsyRdoQuant_neon<2>;
+    p.cuBLOCK_8x8.nonPsyRdoQuant   = nonPsyRdoQuant_neon<3>;
+    p.cuBLOCK_16x16.nonPsyRdoQuant = nonPsyRdoQuant_neon<4>;
+    p.cuBLOCK_32x32.nonPsyRdoQuant = nonPsyRdoQuant_neon<5>;
+    p.cuBLOCK_4x4.psyRdoQuant = psyRdoQuant_neon<2>;
+    p.cuBLOCK_8x8.psyRdoQuant = psyRdoQuant_neon<3>;
+    p.cuBLOCK_16x16.psyRdoQuant = psyRdoQuant_neon<4>;
+    p.cuBLOCK_32x32.psyRdoQuant = psyRdoQuant_neon<5>;
+    p.cuBLOCK_8x8.dct   = dct8_neon;
+    p.cuBLOCK_16x16.dct = dct16_neon;
+    p.cuBLOCK_32x32.dct = dct32_neon;
+    p.cuBLOCK_4x4.idct   = idct4_neon;
+    p.cuBLOCK_16x16.idct = idct16_neon;
+    p.cuBLOCK_32x32.idct = idct32_neon;
+    p.cuBLOCK_4x4.count_nonzero = count_nonzero_neon<4>;
+    p.cuBLOCK_8x8.count_nonzero = count_nonzero_neon<8>;
+    p.cuBLOCK_16x16.count_nonzero = count_nonzero_neon<16>;
+    p.cuBLOCK_32x32.count_nonzero = count_nonzero_neon<32>;
+
+    p.cuBLOCK_4x4.copy_cnt   = copy_count_neon<4>;
+    p.cuBLOCK_8x8.copy_cnt   = copy_count_neon<8>;
+    p.cuBLOCK_16x16.copy_cnt = copy_count_neon<16>;
+    p.cuBLOCK_32x32.copy_cnt = copy_count_neon<32>;
+    p.cuBLOCK_4x4.psyRdoQuant_1p = nonPsyRdoQuant_neon<2>;
+    p.cuBLOCK_4x4.psyRdoQuant_2p = psyRdoQuant_neon<2>;
+    p.cuBLOCK_8x8.psyRdoQuant_1p = nonPsyRdoQuant_neon<3>;
+    p.cuBLOCK_8x8.psyRdoQuant_2p = psyRdoQuant_neon<3>;
+    p.cuBLOCK_16x16.psyRdoQuant_1p = nonPsyRdoQuant_neon<4>;
+    p.cuBLOCK_16x16.psyRdoQuant_2p = psyRdoQuant_neon<4>;
+    p.cuBLOCK_32x32.psyRdoQuant_1p = nonPsyRdoQuant_neon<5>;
+    p.cuBLOCK_32x32.psyRdoQuant_2p = psyRdoQuant_neon<5>;
+
+    p.scanPosLast  = scanPosLast_opt;
+
+}
+
+};
+
+
+
+#endif
​

x265_3.6.tar.gz/source/common/aarch64/dct-prim.h Added

 
@@ -0,0 +1,19 @@
+#ifndef __DCT_PRIM_NEON_H__
+#define __DCT_PRIM_NEON_H__
+
+
+#include "common.h"
+#include "primitives.h"
+#include "contexts.h"   // costCoeffNxN_c
+#include "threading.h"  // CLZ
+
+namespace X265_NS
+{
+// x265 private namespace
+void setupDCTPrimitives_neon(EncoderPrimitives &p);
+};
+
+
+
+#endif
+
​

x265_3.6.tar.gz/source/common/aarch64/filter-prim.cpp Added

@@ -0,0 +1,995 @@
+#if HAVE_NEON
+
+#include "filter-prim.h"
+#include <arm_neon.h>
+
+namespace
+{
+
+using namespace X265_NS;
+
+
+template<int width, int height>
+void filterPixelToShort_neon(const pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
+{
+    const int shift = IF_INTERNAL_PREC - X265_DEPTH;
+    int row, col;
+    const int16x8_t off = vdupq_n_s16(IF_INTERNAL_OFFS);
+    for (row = 0; row < height; row++)
+    {
+
+        for (col = 0; col < width; col += 8)
+        {
+            int16x8_t in;
+
+#if HIGH_BIT_DEPTH
+            in = *(int16x8_t *)&srccol;
+#else
+            in = vmovl_u8(*(uint8x8_t *)&srccol);
+#endif
+
+            int16x8_t tmp = vshlq_n_s16(in, shift);
+            tmp = vsubq_s16(tmp, off);
+            *(int16x8_t *)&dstcol = tmp;
+
+        }
+
+        src += srcStride;
+        dst += dstStride;
+    }
+}
+
+
+template<int N, int width, int height>
+void interp_horiz_pp_neon(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
+{
+    const int16_t *coeff = (N == 4) ? g_chromaFiltercoeffIdx : g_lumaFiltercoeffIdx;
+    int headRoom = IF_FILTER_PREC;
+    int offset = (1 << (headRoom - 1));
+    uint16_t maxVal = (1 << X265_DEPTH) - 1;
+    int cStride = 1;
+
+    src -= (N / 2 - 1) * cStride;
+    int16x8_t vc;
+    vc = *(int16x8_t *)coeff;
+    int16x4_t low_vc = vget_low_s16(vc);
+    int16x4_t high_vc = vget_high_s16(vc);
+
+    const int32x4_t voffset = vdupq_n_s32(offset);
+    const int32x4_t vhr = vdupq_n_s32(-headRoom);
+
+    int row, col;
+    for (row = 0; row < height; row++)
+    {
+        for (col = 0; col < width; col += 8)
+        {
+            int32x4_t vsum1, vsum2;
+
+            int16x8_t inputN;
+
+            for (int i = 0; i < N; i++)
+            {
+#if HIGH_BIT_DEPTH
+                inputi = *(int16x8_t *)&srccol + i;
+#else
+                inputi = vmovl_u8(*(uint8x8_t *)&srccol + i);
+#endif
+            }
+            vsum1 = voffset;
+            vsum2 = voffset;
+
+            vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input0), low_vc, 0);
+            vsum2 = vmlal_high_lane_s16(vsum2, input0, low_vc, 0);
+
+            vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input1), low_vc, 1);
+            vsum2 = vmlal_high_lane_s16(vsum2, input1, low_vc, 1);
+
+            vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input2), low_vc, 2);
+            vsum2 = vmlal_high_lane_s16(vsum2, input2, low_vc, 2);
+
+            vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input3), low_vc, 3);
+            vsum2 = vmlal_high_lane_s16(vsum2, input3, low_vc, 3);
+
+            if (N == 8)
+            {
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input4), high_vc, 0);
+                vsum2 = vmlal_high_lane_s16(vsum2, input4, high_vc, 0);
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input5), high_vc, 1);
+                vsum2 = vmlal_high_lane_s16(vsum2, input5, high_vc, 1);
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input6), high_vc, 2);
+                vsum2 = vmlal_high_lane_s16(vsum2, input6, high_vc, 2);
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input7), high_vc, 3);
+                vsum2 = vmlal_high_lane_s16(vsum2, input7, high_vc, 3);
+
+            }
+
+            vsum1 = vshlq_s32(vsum1, vhr);
+            vsum2 = vshlq_s32(vsum2, vhr);
+
+            int16x8_t vsum = vuzp1q_s16(vsum1, vsum2);
+            vsum = vminq_s16(vsum, vdupq_n_s16(maxVal));
+            vsum = vmaxq_s16(vsum, vdupq_n_s16(0));
+#if HIGH_BIT_DEPTH
+            *(int16x8_t *)&dstcol = vsum;
+#else
+            uint8x16_t usum = vuzp1q_u8(vsum, vsum);
+            *(uint8x8_t *)&dstcol = vget_low_u8(usum);
+#endif
+
+        }
+
+        src += srcStride;
+        dst += dstStride;
+    }
+}
+
+#if HIGH_BIT_DEPTH
+
+template<int N, int width, int height>
+void interp_horiz_ps_neon(const uint16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx,
+                          int isRowExt)
+{
+    const int16_t *coeff = (N == 4) ? g_chromaFiltercoeffIdx : g_lumaFiltercoeffIdx;
+    const int headRoom = IF_INTERNAL_PREC - X265_DEPTH;
+    const int shift = IF_FILTER_PREC - headRoom;
+    const int offset = (unsigned) - IF_INTERNAL_OFFS << shift;
+
+    int blkheight = height;
+    src -= N / 2 - 1;
+
+    if (isRowExt)
+    {
+        src -= (N / 2 - 1) * srcStride;
+        blkheight += N - 1;
+    }
+    int16x8_t vc3 = vld1q_s16(coeff);
+    const int32x4_t voffset = vdupq_n_s32(offset);
+    const int32x4_t vhr = vdupq_n_s32(-shift);
+
+    int row, col;
+    for (row = 0; row < blkheight; row++)
+    {
+        for (col = 0; col < width; col += 8)
+        {
+            int32x4_t vsum, vsum2;
+
+            int16x8_t inputN;
+            for (int i = 0; i < N; i++)
+            {
+                inputi = vld1q_s16((int16_t *)&srccol + i);
+            }
+
+            vsum = voffset;
+            vsum2 = voffset;
+
+            vsum = vmlal_lane_s16(vsum, vget_low_u16(input0), vget_low_s16(vc3), 0);
+            vsum2 = vmlal_high_lane_s16(vsum2, input0, vget_low_s16(vc3), 0);
+
+            vsum = vmlal_lane_s16(vsum, vget_low_u16(input1), vget_low_s16(vc3), 1);
+            vsum2 = vmlal_high_lane_s16(vsum2, input1, vget_low_s16(vc3), 1);
+
+            vsum = vmlal_lane_s16(vsum, vget_low_u16(input2), vget_low_s16(vc3), 2);
+            vsum2 = vmlal_high_lane_s16(vsum2, input2, vget_low_s16(vc3), 2);
+
+            vsum = vmlal_lane_s16(vsum, vget_low_u16(input3), vget_low_s16(vc3), 3);
+            vsum2 = vmlal_high_lane_s16(vsum2, input3, vget_low_s16(vc3), 3);
+
+            if (N == 8)
+            {
+                vsum = vmlal_lane_s16(vsum, vget_low_s16(input4), vget_high_s16(vc3), 0);
+                vsum2 = vmlal_high_lane_s16(vsum2, input4, vget_high_s16(vc3), 0);
+
+                vsum = vmlal_lane_s16(vsum, vget_low_s16(input5), vget_high_s16(vc3), 1);
+                vsum2 = vmlal_high_lane_s16(vsum2, input5, vget_high_s16(vc3), 1);
+
+                vsum = vmlal_lane_s16(vsum, vget_low_s16(input6), vget_high_s16(vc3), 2);
+                vsum2 = vmlal_high_lane_s16(vsum2, input6, vget_high_s16(vc3), 2);
+
+                vsum = vmlal_lane_s16(vsum, vget_low_s16(input7), vget_high_s16(vc3), 3);
+                vsum2 = vmlal_high_lane_s16(vsum2, input7, vget_high_s16(vc3), 3);
+            }
+
+            vsum = vshlq_s32(vsum, vhr);
+            vsum2 = vshlq_s32(vsum2, vhr);
+            *(int16x4_t *)&dstcol = vmovn_u32(vsum);
+            *(int16x4_t *)&dstcol+4 = vmovn_u32(vsum2);
+        }
+
+        src += srcStride;
+        dst += dstStride;
+    }
+}
+
+
+#else
+
+template<int N, int width, int height>
+void interp_horiz_ps_neon(const uint8_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx,
+                          int isRowExt)
+{
+    const int16_t *coeff = (N == 4) ? g_chromaFiltercoeffIdx : g_lumaFiltercoeffIdx;
+    const int headRoom = IF_INTERNAL_PREC - X265_DEPTH;
+    const int shift = IF_FILTER_PREC - headRoom;
+    const int offset = (unsigned) - IF_INTERNAL_OFFS << shift;
+
+    int blkheight = height;
+    src -= N / 2 - 1;
+
+    if (isRowExt)
+    {
+        src -= (N / 2 - 1) * srcStride;
+        blkheight += N - 1;
+    }
+    int16x8_t vc;
+    vc = *(int16x8_t *)coeff;
+
+    const int16x8_t voffset = vdupq_n_s16(offset);
+    const int16x8_t vhr = vdupq_n_s16(-shift);
+
+    int row, col;
+    for (row = 0; row < blkheight; row++)
+    {
+        for (col = 0; col < width; col += 8)
+        {
+            int16x8_t vsum;
+
+            int16x8_t inputN;
+
+            for (int i = 0; i < N; i++)
+            {
+                inputi = vmovl_u8(*(uint8x8_t *)&srccol + i);
+            }
+            vsum = voffset;
+            vsum = vmlaq_laneq_s16(vsum, (input0), vc, 0);
+            vsum = vmlaq_laneq_s16(vsum, (input1), vc, 1);
+            vsum = vmlaq_laneq_s16(vsum, (input2), vc, 2);
+            vsum = vmlaq_laneq_s16(vsum, (input3), vc, 3);
+
+
+            if (N == 8)
+            {
+                vsum = vmlaq_laneq_s16(vsum, (input4), vc, 4);
+                vsum = vmlaq_laneq_s16(vsum, (input5), vc, 5);
+                vsum = vmlaq_laneq_s16(vsum, (input6), vc, 6);
+                vsum = vmlaq_laneq_s16(vsum, (input7), vc, 7);
+
+            }
+
+            vsum = vshlq_s16(vsum, vhr);
+            *(int16x8_t *)&dstcol = vsum;
+        }
+
+        src += srcStride;
+        dst += dstStride;
+    }
+}
+
+#endif
+
+
+template<int N, int width, int height>
+void interp_vert_ss_neon(const int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
+{
+    const int16_t *c = (N == 8 ? g_lumaFiltercoeffIdx : g_chromaFiltercoeffIdx);
+    int shift = IF_FILTER_PREC;
+    src -= (N / 2 - 1) * srcStride;
+    int16x8_t vc;
+    vc = *(int16x8_t *)c;
+    int16x4_t low_vc = vget_low_s16(vc);
+    int16x4_t high_vc = vget_high_s16(vc);
+
+    const int32x4_t vhr = vdupq_n_s32(-shift);
+
+    int row, col;
+    for (row = 0; row < height; row++)
+    {
+        for (col = 0; col < width; col += 8)
+        {
+            int32x4_t vsum1, vsum2;
+
+            int16x8_t inputN;
+
+            for (int i = 0; i < N; i++)
+            {
+                inputi = *(int16x8_t *)&srccol + i * srcStride;
+            }
+
+            vsum1 = vmull_lane_s16(vget_low_s16(input0), low_vc, 0);
+            vsum2 = vmull_high_lane_s16(input0, low_vc, 0);
+
+            vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input1), low_vc, 1);
+            vsum2 = vmlal_high_lane_s16(vsum2, input1, low_vc, 1);
+
+            vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input2), low_vc, 2);
+            vsum2 = vmlal_high_lane_s16(vsum2, input2, low_vc, 2);
+
+            vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input3), low_vc, 3);
+            vsum2 = vmlal_high_lane_s16(vsum2, input3, low_vc, 3);
+
+            if (N == 8)
+            {
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input4), high_vc, 0);
+                vsum2 = vmlal_high_lane_s16(vsum2, input4, high_vc, 0);
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input5), high_vc, 1);
+                vsum2 = vmlal_high_lane_s16(vsum2, input5, high_vc, 1);
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input6), high_vc, 2);
+                vsum2 = vmlal_high_lane_s16(vsum2, input6, high_vc, 2);
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input7), high_vc, 3);
+                vsum2 = vmlal_high_lane_s16(vsum2, input7, high_vc, 3);
+
+            }
+
+            vsum1 = vshlq_s32(vsum1, vhr);
+            vsum2 = vshlq_s32(vsum2, vhr);
+
+            int16x8_t vsum = vuzp1q_s16(vsum1, vsum2);
+            *(int16x8_t *)&dstcol = vsum;
+        }
+
+        src += srcStride;
+        dst += dstStride;
+    }
+
+}
+
+
+#if HIGH_BIT_DEPTH
+
+template<int N, int width, int height>
+void interp_vert_pp_neon(const uint16_t *src, intptr_t srcStride, uint16_t *dst, intptr_t dstStride, int coeffIdx)
+{
+
+    const int16_t *c = (N == 4) ? g_chromaFiltercoeffIdx : g_lumaFiltercoeffIdx;
+    int shift = IF_FILTER_PREC;
+    int offset = 1 << (shift - 1);
+    const uint16_t maxVal = (1 << X265_DEPTH) - 1;
+
+    src -= (N / 2 - 1) * srcStride;
+    int16x8_t vc;
+    vc = *(int16x8_t *)c;
+    int32x4_t low_vc = vmovl_s16(vget_low_s16(vc));
+    int32x4_t high_vc = vmovl_s16(vget_high_s16(vc));
+
+    const int32x4_t voffset = vdupq_n_s32(offset);
+    const int32x4_t vhr = vdupq_n_s32(-shift);
+
+    int row, col;
+    for (row = 0; row < height; row++)
+    {
+        for (col = 0; col < width; col += 4)
+        {
+            int32x4_t vsum;
+
+            int32x4_t inputN;
+
+            for (int i = 0; i < N; i++)
+            {
+                inputi = vmovl_u16(*(uint16x4_t *)&srccol + i * srcStride);
+            }
+            vsum = voffset;
+
+            vsum = vmlaq_laneq_s32(vsum, (input0), low_vc, 0);
+            vsum = vmlaq_laneq_s32(vsum, (input1), low_vc, 1);
+            vsum = vmlaq_laneq_s32(vsum, (input2), low_vc, 2);
+            vsum = vmlaq_laneq_s32(vsum, (input3), low_vc, 3);
+
+            if (N == 8)
+            {
+                vsum = vmlaq_laneq_s32(vsum, (input4), high_vc, 0);
+                vsum = vmlaq_laneq_s32(vsum, (input5), high_vc, 1);
+                vsum = vmlaq_laneq_s32(vsum, (input6), high_vc, 2);
+                vsum = vmlaq_laneq_s32(vsum, (input7), high_vc, 3);
+            }
+
+            vsum = vshlq_s32(vsum, vhr);
+            vsum = vminq_s32(vsum, vdupq_n_s32(maxVal));
+            vsum = vmaxq_s32(vsum, vdupq_n_s32(0));
+            *(uint16x4_t *)&dstcol = vmovn_u32(vsum);
+        }
+        src += srcStride;
+        dst += dstStride;
+    }
+}
+
+
+
+
+#else
+
+template<int N, int width, int height>
+void interp_vert_pp_neon(const uint8_t *src, intptr_t srcStride, uint8_t *dst, intptr_t dstStride, int coeffIdx)
+{
+
+    const int16_t *c = (N == 4) ? g_chromaFiltercoeffIdx : g_lumaFiltercoeffIdx;
+    int shift = IF_FILTER_PREC;
+    int offset = 1 << (shift - 1);
+    const uint16_t maxVal = (1 << X265_DEPTH) - 1;
+
+    src -= (N / 2 - 1) * srcStride;
+    int16x8_t vc;
+    vc = *(int16x8_t *)c;
+
+    const int16x8_t voffset = vdupq_n_s16(offset);
+    const int16x8_t vhr = vdupq_n_s16(-shift);
+
+    int row, col;
+    for (row = 0; row < height; row++)
+    {
+        for (col = 0; col < width; col += 8)
+        {
+            int16x8_t vsum;
+
+            int16x8_t inputN;
+
+            for (int i = 0; i < N; i++)
+            {
+                inputi = vmovl_u8(*(uint8x8_t *)&srccol + i * srcStride);
+            }
+            vsum = voffset;
+
+            vsum = vmlaq_laneq_s16(vsum, (input0), vc, 0);
+            vsum = vmlaq_laneq_s16(vsum, (input1), vc, 1);
+            vsum = vmlaq_laneq_s16(vsum, (input2), vc, 2);
+            vsum = vmlaq_laneq_s16(vsum, (input3), vc, 3);
+
+            if (N == 8)
+            {
+                vsum = vmlaq_laneq_s16(vsum, (input4), vc, 4);
+                vsum = vmlaq_laneq_s16(vsum, (input5), vc, 5);
+                vsum = vmlaq_laneq_s16(vsum, (input6), vc, 6);
+                vsum = vmlaq_laneq_s16(vsum, (input7), vc, 7);
+
+            }
+
+            vsum = vshlq_s16(vsum, vhr);
+
+            vsum = vminq_s16(vsum, vdupq_n_s16(maxVal));
+            vsum = vmaxq_s16(vsum, vdupq_n_s16(0));
+            uint8x16_t usum = vuzp1q_u8(vsum, vsum);
+            *(uint8x8_t *)&dstcol = vget_low_u8(usum);
+
+        }
+
+        src += srcStride;
+        dst += dstStride;
+    }
+}
+
+
+#endif
+
+
+#if HIGH_BIT_DEPTH
+
+template<int N, int width, int height>
+void interp_vert_ps_neon(const uint16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
+{
+    const int16_t *c = (N == 4) ? g_chromaFiltercoeffIdx : g_lumaFiltercoeffIdx;
+    int headRoom = IF_INTERNAL_PREC - X265_DEPTH;
+    int shift = IF_FILTER_PREC - headRoom;
+    int offset = (unsigned) - IF_INTERNAL_OFFS << shift;
+    src -= (N / 2 - 1) * srcStride;
+
+    int16x8_t vc;
+    vc = *(int16x8_t *)c;
+    int32x4_t low_vc = vmovl_s16(vget_low_s16(vc));
+    int32x4_t high_vc = vmovl_s16(vget_high_s16(vc));
+
+    const int32x4_t voffset = vdupq_n_s32(offset);
+    const int32x4_t vhr = vdupq_n_s32(-shift);
+
+    int row, col;
+    for (row = 0; row < height; row++)
+    {
+        for (col = 0; col < width; col += 4)
+        {
+            int16x8_t vsum;
+
+            int16x8_t inputN;
+
+            for (int i = 0; i < N; i++)
+            {
+                inputi = vmovl_u16(*(uint16x4_t *)&srccol + i * srcStride);
+            }
+            vsum = voffset;
+
+            vsum = vmlaq_laneq_s32(vsum, (input0), low_vc, 0);
+            vsum = vmlaq_laneq_s32(vsum, (input1), low_vc, 1);
+            vsum = vmlaq_laneq_s32(vsum, (input2), low_vc, 2);
+            vsum = vmlaq_laneq_s32(vsum, (input3), low_vc, 3);
+
+            if (N == 8)
+            {
+                int16x8_t  vsum1 = vmulq_laneq_s32((input4), high_vc, 0);
+                vsum1 = vmlaq_laneq_s32(vsum1, (input5), high_vc, 1);
+                vsum1 = vmlaq_laneq_s32(vsum1, (input6), high_vc, 2);
+                vsum1 = vmlaq_laneq_s32(vsum1, (input7), high_vc, 3);
+                vsum = vaddq_s32(vsum, vsum1);
+            }
+
+            vsum = vshlq_s32(vsum, vhr);
+
+            *(uint16x4_t *)&dstcol = vmovn_s32(vsum);
+        }
+
+        src += srcStride;
+        dst += dstStride;
+    }
+}
+
+#else
+
+template<int N, int width, int height>
+void interp_vert_ps_neon(const uint8_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
+{
+    const int16_t *c = (N == 4) ? g_chromaFiltercoeffIdx : g_lumaFiltercoeffIdx;
+    int headRoom = IF_INTERNAL_PREC - X265_DEPTH;
+    int shift = IF_FILTER_PREC - headRoom;
+    int offset = (unsigned) - IF_INTERNAL_OFFS << shift;
+    src -= (N / 2 - 1) * srcStride;
+
+    int16x8_t vc;
+    vc = *(int16x8_t *)c;
+
+    const int16x8_t voffset = vdupq_n_s16(offset);
+    const int16x8_t vhr = vdupq_n_s16(-shift);
+
+    int row, col;
+    for (row = 0; row < height; row++)
+    {
+        for (col = 0; col < width; col += 8)
+        {
+            int16x8_t vsum;
+
+            int16x8_t inputN;
+
+            for (int i = 0; i < N; i++)
+            {
+                inputi = vmovl_u8(*(uint8x8_t *)&srccol + i * srcStride);
+            }
+            vsum = voffset;
+
+            vsum = vmlaq_laneq_s16(vsum, (input0), vc, 0);
+            vsum = vmlaq_laneq_s16(vsum, (input1), vc, 1);
+            vsum = vmlaq_laneq_s16(vsum, (input2), vc, 2);
+            vsum = vmlaq_laneq_s16(vsum, (input3), vc, 3);
+
+            if (N == 8)
+            {
+                int16x8_t  vsum1 = vmulq_laneq_s16((input4), vc, 4);
+                vsum1 = vmlaq_laneq_s16(vsum1, (input5), vc, 5);
+                vsum1 = vmlaq_laneq_s16(vsum1, (input6), vc, 6);
+                vsum1 = vmlaq_laneq_s16(vsum1, (input7), vc, 7);
+                vsum = vaddq_s16(vsum, vsum1);
+            }
+
+            vsum = vshlq_s32(vsum, vhr);
+            *(int16x8_t *)&dstcol = vsum;
+        }
+
+        src += srcStride;
+        dst += dstStride;
+    }
+}
+
+#endif
+
+
+
+template<int N, int width, int height>
+void interp_vert_sp_neon(const int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
+{
+    int headRoom = IF_INTERNAL_PREC - X265_DEPTH;
+    int shift = IF_FILTER_PREC + headRoom;
+    int offset = (1 << (shift - 1)) + (IF_INTERNAL_OFFS << IF_FILTER_PREC);
+    uint16_t maxVal = (1 << X265_DEPTH) - 1;
+    const int16_t *coeff = (N == 8 ? g_lumaFiltercoeffIdx : g_chromaFiltercoeffIdx);
+
+    src -= (N / 2 - 1) * srcStride;
+
+    int16x8_t vc;
+    vc = *(int16x8_t *)coeff;
+    int16x4_t low_vc = vget_low_s16(vc);
+    int16x4_t high_vc = vget_high_s16(vc);
+
+    const int32x4_t voffset = vdupq_n_s32(offset);
+    const int32x4_t vhr = vdupq_n_s32(-shift);
+
+    int row, col;
+    for (row = 0; row < height; row++)
+    {
+        for (col = 0; col < width; col += 8)
+        {
+            int32x4_t vsum1, vsum2;
+
+            int16x8_t inputN;
+
+            for (int i = 0; i < N; i++)
+            {
+                inputi = *(int16x8_t *)&srccol + i * srcStride;
+            }
+            vsum1 = voffset;
+            vsum2 = voffset;
+
+            vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input0), low_vc, 0);
+            vsum2 = vmlal_high_lane_s16(vsum2, input0, low_vc, 0);
+
+            vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input1), low_vc, 1);
+            vsum2 = vmlal_high_lane_s16(vsum2, input1, low_vc, 1);
+
+            vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input2), low_vc, 2);
+            vsum2 = vmlal_high_lane_s16(vsum2, input2, low_vc, 2);
+
+            vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input3), low_vc, 3);
+            vsum2 = vmlal_high_lane_s16(vsum2, input3, low_vc, 3);
+
+            if (N == 8)
+            {
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input4), high_vc, 0);
+                vsum2 = vmlal_high_lane_s16(vsum2, input4, high_vc, 0);
+
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input5), high_vc, 1);
+                vsum2 = vmlal_high_lane_s16(vsum2, input5, high_vc, 1);
+
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input6), high_vc, 2);
+                vsum2 = vmlal_high_lane_s16(vsum2, input6, high_vc, 2);
+
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input7), high_vc, 3);
+                vsum2 = vmlal_high_lane_s16(vsum2, input7, high_vc, 3);
+            }
+
+            vsum1 = vshlq_s32(vsum1, vhr);
+            vsum2 = vshlq_s32(vsum2, vhr);
+
+            int16x8_t vsum = vuzp1q_s16(vsum1, vsum2);
+            vsum = vminq_s16(vsum, vdupq_n_s16(maxVal));
+            vsum = vmaxq_s16(vsum, vdupq_n_s16(0));
+#if HIGH_BIT_DEPTH
+            *(int16x8_t *)&dstcol = vsum;
+#else
+            uint8x16_t usum = vuzp1q_u8(vsum, vsum);
+            *(uint8x8_t *)&dstcol = vget_low_u8(usum);
+#endif
+
+        }
+
+        src += srcStride;
+        dst += dstStride;
+    }
+}
+
+
+
+
+
+
+template<int N, int width, int height>
+void interp_hv_pp_neon(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int idxX, int idxY)
+{
+    ALIGN_VAR_32(int16_t, immedwidth * (height + N - 1));
+
+    interp_horiz_ps_neon<N, width, height>(src, srcStride, immed, width, idxX, 1);
+    interp_vert_sp_neon<N, width, height>(immed + (N / 2 - 1) * width, width, dst, dstStride, idxY);
+}
+
+
+
+}
+
+
+
+
+namespace X265_NS
+{
+#if defined(__APPLE__)
+#define CHROMA_420(W, H) \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_hpp = interp_horiz_pp_neon<4, W, H>; \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vpp = interp_vert_pp_neon<4, W, H>;  \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vps = interp_vert_ps_neon<4, W, H>;  \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vsp = interp_vert_sp_neon<4, W, H>;  \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vss = interp_vert_ss_neon<4, W, H>;  \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.p2sNONALIGNED = filterPixelToShort_neon<W, H>;\
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.p2sALIGNED = filterPixelToShort_neon<W, H>;
+    
+#define CHROMA_FILTER_420(W, H) \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_hps = interp_horiz_ps_neon<4, W, H>;
+    
+#else // defined(__APPLE__)
+#define CHROMA_420(W, H) \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vss = interp_vert_ss_neon<4, W, H>; \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.p2sNONALIGNED = filterPixelToShort_neon<W, H>;\
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.p2sALIGNED = filterPixelToShort_neon<W, H>;
+    
+#define CHROMA_FILTER_420(W, H) \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_hpp = interp_horiz_pp_neon<4, W, H>; \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_hps = interp_horiz_ps_neon<4, W, H>; \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vpp = interp_vert_pp_neon<4, W, H>;  \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vps = interp_vert_ps_neon<4, W, H>;  \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vsp = interp_vert_sp_neon<4, W, H>;
+#endif // defined(__APPLE__)
+
+#if defined(__APPLE__)
+#define CHROMA_422(W, H) \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_hpp = interp_horiz_pp_neon<4, W, H>; \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vpp = interp_vert_pp_neon<4, W, H>;  \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vps = interp_vert_ps_neon<4, W, H>;  \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vsp = interp_vert_sp_neon<4, W, H>;  \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vss = interp_vert_ss_neon<4, W, H>;  \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.p2sNONALIGNED = filterPixelToShort_neon<W, H>;\
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.p2sALIGNED = filterPixelToShort_neon<W, H>;
+    
+#define CHROMA_FILTER_422(W, H) \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_hps = interp_horiz_ps_neon<4, W, H>;
+    
+#else // defined(__APPLE__)
+#define CHROMA_422(W, H) \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vss = interp_vert_ss_neon<4, W, H>; \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.p2sNONALIGNED = filterPixelToShort_neon<W, H>;\
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.p2sALIGNED = filterPixelToShort_neon<W, H>;
+    
+#define CHROMA_FILTER_422(W, H) \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_hpp = interp_horiz_pp_neon<4, W, H>; \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_hps = interp_horiz_ps_neon<4, W, H>; \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vpp = interp_vert_pp_neon<4, W, H>;  \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vps = interp_vert_ps_neon<4, W, H>;  \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vsp = interp_vert_sp_neon<4, W, H>;
+#endif // defined(__APPLE__)
+
+#if defined(__APPLE__)
+#define CHROMA_444(W, H) \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_hpp = interp_horiz_pp_neon<4, W, H>; \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vpp = interp_vert_pp_neon<4, W, H>;  \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vps = interp_vert_ps_neon<4, W, H>;  \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vsp = interp_vert_sp_neon<4, W, H>;  \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vss = interp_vert_ss_neon<4, W, H>;  \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.p2sNONALIGNED = filterPixelToShort_neon<W, H>;\
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.p2sALIGNED = filterPixelToShort_neon<W, H>;
+
+#define CHROMA_FILTER_444(W, H) \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_hps = interp_horiz_ps_neon<4, W, H>;
+    
+#else // defined(__APPLE__)
+#define CHROMA_444(W, H) \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.p2sNONALIGNED = filterPixelToShort_neon<W, H>;\
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.p2sALIGNED = filterPixelToShort_neon<W, H>;
+    
+#define CHROMA_FILTER_444(W, H) \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_hpp = interp_horiz_pp_neon<4, W, H>; \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_hps = interp_horiz_ps_neon<4, W, H>; \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vpp = interp_vert_pp_neon<4, W, H>;  \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vps = interp_vert_ps_neon<4, W, H>;  \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vsp = interp_vert_sp_neon<4, W, H>;  \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vss = interp_vert_ss_neon<4, W, H>;
+#endif // defined(__APPLE__)
+
+#if defined(__APPLE__)
+#define LUMA(W, H) \
+    p.puLUMA_ ## W ## x ## H.luma_hpp     = interp_horiz_pp_neon<8, W, H>; \
+    p.puLUMA_ ## W ## x ## H.luma_vpp     = interp_vert_pp_neon<8, W, H>;  \
+    p.puLUMA_ ## W ## x ## H.luma_vps     = interp_vert_ps_neon<8, W, H>;  \
+    p.puLUMA_ ## W ## x ## H.luma_vsp     = interp_vert_sp_neon<8, W, H>;  \
+    p.puLUMA_ ## W ## x ## H.luma_vss     = interp_vert_ss_neon<8, W, H>;  \
+    p.puLUMA_ ## W ## x ## H.luma_hvpp    = interp_hv_pp_neon<8, W, H>; \
+    p.puLUMA_ ## W ## x ## H.convert_p2sNONALIGNED = filterPixelToShort_neon<W, H>;\
+    p.puLUMA_ ## W ## x ## H.convert_p2sALIGNED = filterPixelToShort_neon<W, H>;
+    
+#else // defined(__APPLE__)
+#define LUMA(W, H) \
+    p.puLUMA_ ## W ## x ## H.luma_vss     = interp_vert_ss_neon<8, W, H>;  \
+    p.puLUMA_ ## W ## x ## H.convert_p2sNONALIGNED = filterPixelToShort_neon<W, H>;\
+    p.puLUMA_ ## W ## x ## H.convert_p2sALIGNED = filterPixelToShort_neon<W, H>;
+    
+#define LUMA_FILTER(W, H) \
+    p.puLUMA_ ## W ## x ## H.luma_hpp     = interp_horiz_pp_neon<8, W, H>; \
+    p.puLUMA_ ## W ## x ## H.luma_vpp     = interp_vert_pp_neon<8, W, H>;  \
+    p.puLUMA_ ## W ## x ## H.luma_vps     = interp_vert_ps_neon<8, W, H>;  \
+    p.puLUMA_ ## W ## x ## H.luma_vsp     = interp_vert_sp_neon<8, W, H>;  \
+    p.puLUMA_ ## W ## x ## H.luma_hvpp    = interp_hv_pp_neon<8, W, H>;
+#endif // defined(__APPLE__)
+
+void setupFilterPrimitives_neon(EncoderPrimitives &p)
+{
+
+    // All neon functions assume width of multiple of 8, (2,4,12 variants are not optimized)
+
+    LUMA(8, 8);
+    LUMA(8, 4);
+    LUMA(16, 16);
+    CHROMA_420(8,  8);
+    LUMA(16,  8);
+    CHROMA_420(8,  4);
+    LUMA(8, 16);
+    LUMA(16, 12);
+    CHROMA_420(8,  6);
+    LUMA(16,  4);
+    CHROMA_420(8,  2);
+    LUMA(32, 32);
+    CHROMA_420(16, 16);
+    LUMA(32, 16);
+    CHROMA_420(16, 8);
+    LUMA(16, 32);
+    CHROMA_420(8,  16);
+    LUMA(32, 24);
+    CHROMA_420(16, 12);
+    LUMA(24, 32);
+    LUMA(32,  8);
+    CHROMA_420(16, 4);
+    LUMA(8, 32);
+    LUMA(64, 64);
+    CHROMA_420(32, 32);
+    LUMA(64, 32);
+    CHROMA_420(32, 16);
+    LUMA(32, 64);
+    CHROMA_420(16, 32);
+    LUMA(64, 48);
+    CHROMA_420(32, 24);
+    LUMA(48, 64);
+    CHROMA_420(24, 32);
+    LUMA(64, 16);
+    CHROMA_420(32, 8);
+    LUMA(16, 64);
+    CHROMA_420(8,  32);
+    CHROMA_422(8,  16);
+    CHROMA_422(8,  8);
+    CHROMA_422(8,  12);
+    CHROMA_422(8,  4);
+    CHROMA_422(16, 32);
+    CHROMA_422(16, 16);
+    CHROMA_422(8,  32);
+    CHROMA_422(16, 24);
+    CHROMA_422(16, 8);
+    CHROMA_422(32, 64);
+    CHROMA_422(32, 32);
+    CHROMA_422(16, 64);
+    CHROMA_422(32, 48);
+    CHROMA_422(24, 64);
+    CHROMA_422(32, 16);
+    CHROMA_422(8,  64);
+    CHROMA_444(8,  8);
+    CHROMA_444(8,  4);
+    CHROMA_444(16, 16);
+    CHROMA_444(16, 8);
+    CHROMA_444(8,  16);
+    CHROMA_444(16, 12);
+    CHROMA_444(16, 4);
+    CHROMA_444(32, 32);
+    CHROMA_444(32, 16);
+    CHROMA_444(16, 32);
+    CHROMA_444(32, 24);
+    CHROMA_444(24, 32);
+    CHROMA_444(32, 8);
+    CHROMA_444(8,  32);
+    CHROMA_444(64, 64);
+    CHROMA_444(64, 32);
+    CHROMA_444(32, 64);
+    CHROMA_444(64, 48);
+    CHROMA_444(48, 64);
+    CHROMA_444(64, 16);
+    CHROMA_444(16, 64);
+
+#if defined(__APPLE__) || HIGH_BIT_DEPTH
+    p.puLUMA_8x4.luma_hps     = interp_horiz_ps_neon<8, 8, 4>;
+    p.puLUMA_8x8.luma_hps     = interp_horiz_ps_neon<8, 8, 8>;
+    p.puLUMA_8x16.luma_hps     = interp_horiz_ps_neon<8, 8, 16>;
+    p.puLUMA_8x32.luma_hps     = interp_horiz_ps_neon<8, 8, 32>;
+#endif // HIGH_BIT_DEPTH
+
+#if !defined(__APPLE__) && HIGH_BIT_DEPTH
+    p.puLUMA_24x32.luma_hps     = interp_horiz_ps_neon<8, 24, 32>;
+#endif // !defined(__APPLE__)
+
+#if !defined(__APPLE__)
+    p.puLUMA_32x8.luma_hpp      = interp_horiz_pp_neon<8, 32, 8>;
+    p.puLUMA_32x16.luma_hpp     = interp_horiz_pp_neon<8, 32, 16>;
+    p.puLUMA_32x24.luma_hpp     = interp_horiz_pp_neon<8, 32, 24>;
+    p.puLUMA_32x32.luma_hpp     = interp_horiz_pp_neon<8, 32, 32>;
+    p.puLUMA_32x64.luma_hpp     = interp_horiz_pp_neon<8, 32, 64>;
+    p.puLUMA_48x64.luma_hpp     = interp_horiz_pp_neon<8, 48, 64>;
+    p.puLUMA_64x16.luma_hpp     = interp_horiz_pp_neon<8, 64, 16>;
+    p.puLUMA_64x32.luma_hpp     = interp_horiz_pp_neon<8, 64, 32>;
+    p.puLUMA_64x48.luma_hpp     = interp_horiz_pp_neon<8, 64, 48>;
+    p.puLUMA_64x64.luma_hpp     = interp_horiz_pp_neon<8, 64, 64>;
+
+    LUMA_FILTER(8, 4);
+    LUMA_FILTER(8, 8);
+    LUMA_FILTER(8, 16);
+    LUMA_FILTER(8, 32);
+    LUMA_FILTER(24, 32);
+
+    LUMA_FILTER(16, 32);
+    LUMA_FILTER(32, 16);
+    LUMA_FILTER(32, 24);
+    LUMA_FILTER(32, 32);
+    LUMA_FILTER(32, 64);
+    LUMA_FILTER(48, 64);
+    LUMA_FILTER(64, 32);
+    LUMA_FILTER(64, 48);
+    LUMA_FILTER(64, 64);
+    
+    CHROMA_FILTER_420(24, 32);
+    
+    p.chromaX265_CSP_I420.puCHROMA_420_32x8.filter_hpp = interp_horiz_pp_neon<4, 32, 8>;
+    p.chromaX265_CSP_I420.puCHROMA_420_32x16.filter_hpp = interp_horiz_pp_neon<4, 32, 16>;
+    p.chromaX265_CSP_I420.puCHROMA_420_32x24.filter_hpp = interp_horiz_pp_neon<4, 32, 24>;
+    p.chromaX265_CSP_I420.puCHROMA_420_32x32.filter_hpp = interp_horiz_pp_neon<4, 32, 32>;
+    
+    CHROMA_FILTER_422(24, 64);
+    
+    p.chromaX265_CSP_I422.puCHROMA_422_32x16.filter_hpp = interp_horiz_pp_neon<4, 32, 16>;
+    p.chromaX265_CSP_I422.puCHROMA_422_32x32.filter_hpp = interp_horiz_pp_neon<4, 32, 32>;
+    p.chromaX265_CSP_I422.puCHROMA_422_32x48.filter_hpp = interp_horiz_pp_neon<4, 32, 48>;
+    p.chromaX265_CSP_I422.puCHROMA_422_32x64.filter_hpp = interp_horiz_pp_neon<4, 32, 64>;
+    
+    CHROMA_FILTER_444(24, 32);
+    
+    p.chromaX265_CSP_I444.puLUMA_32x8.filter_hpp  = interp_horiz_pp_neon<4, 32, 8>;
+    p.chromaX265_CSP_I444.puLUMA_32x16.filter_hpp = interp_horiz_pp_neon<4, 32, 16>;
+    p.chromaX265_CSP_I444.puLUMA_32x24.filter_hpp = interp_horiz_pp_neon<4, 32, 24>;
+    p.chromaX265_CSP_I444.puLUMA_32x32.filter_hpp = interp_horiz_pp_neon<4, 32, 32>;
+    p.chromaX265_CSP_I444.puLUMA_32x64.filter_hpp = interp_horiz_pp_neon<4, 32, 64>;
+    p.chromaX265_CSP_I444.puLUMA_48x64.filter_hpp = interp_horiz_pp_neon<4, 48, 64>;
+    p.chromaX265_CSP_I444.puLUMA_64x16.filter_hpp = interp_horiz_pp_neon<4, 64, 16>;
+    p.chromaX265_CSP_I444.puLUMA_64x32.filter_hpp = interp_horiz_pp_neon<4, 64, 32>;
+    p.chromaX265_CSP_I444.puLUMA_64x48.filter_hpp = interp_horiz_pp_neon<4, 64, 48>;
+    p.chromaX265_CSP_I444.puLUMA_64x64.filter_hpp = interp_horiz_pp_neon<4, 64, 64>;
+    
+    p.chromaX265_CSP_I444.puLUMA_16x4.filter_vss  = interp_vert_ss_neon<4, 16, 4>;
+    p.chromaX265_CSP_I444.puLUMA_16x8.filter_vss  = interp_vert_ss_neon<4, 16, 8>;
+    p.chromaX265_CSP_I444.puLUMA_16x12.filter_vss = interp_vert_ss_neon<4, 16, 12>;
+    p.chromaX265_CSP_I444.puLUMA_16x16.filter_vss = interp_vert_ss_neon<4, 16, 16>;
+    p.chromaX265_CSP_I444.puLUMA_16x32.filter_vss = interp_vert_ss_neon<4, 16, 32>;
+    p.chromaX265_CSP_I444.puLUMA_16x64.filter_vss = interp_vert_ss_neon<4, 16, 64>;
+    p.chromaX265_CSP_I444.puLUMA_32x8.filter_vss  = interp_vert_ss_neon<4, 32, 8>;
+    p.chromaX265_CSP_I444.puLUMA_32x16.filter_vss = interp_vert_ss_neon<4, 32, 16>;
+    p.chromaX265_CSP_I444.puLUMA_32x24.filter_vss = interp_vert_ss_neon<4, 32, 24>;
+    p.chromaX265_CSP_I444.puLUMA_32x32.filter_vss = interp_vert_ss_neon<4, 32, 32>;
+    p.chromaX265_CSP_I444.puLUMA_32x64.filter_vss = interp_vert_ss_neon<4, 32, 64>;
+#endif // !defined(__APPLE__)
+
+    CHROMA_FILTER_420(8, 2);
+    CHROMA_FILTER_420(8, 4);
+    CHROMA_FILTER_420(8, 6);
+    CHROMA_FILTER_420(8, 8);
+    CHROMA_FILTER_420(8, 16);
+    CHROMA_FILTER_420(8, 32);
+    
+    CHROMA_FILTER_422(8, 4);
+    CHROMA_FILTER_422(8, 8);
+    CHROMA_FILTER_422(8, 12);
+    CHROMA_FILTER_422(8, 16);
+    CHROMA_FILTER_422(8, 32);
+    CHROMA_FILTER_422(8, 64);
+    
+    CHROMA_FILTER_444(8, 4);
+    CHROMA_FILTER_444(8, 8);
+    CHROMA_FILTER_444(8, 16);
+    CHROMA_FILTER_444(8, 32);
+    
+#if defined(__APPLE__)
+    CHROMA_FILTER_420(16, 4);
+    CHROMA_FILTER_420(16, 8);
+    CHROMA_FILTER_420(16, 12);
+    CHROMA_FILTER_420(16, 16);
+    CHROMA_FILTER_420(16, 32);
+
+    CHROMA_FILTER_422(16, 8);
+    CHROMA_FILTER_422(16, 16);
+    CHROMA_FILTER_422(16, 24);
+    CHROMA_FILTER_422(16, 32);
+    CHROMA_FILTER_422(16, 64);
+    
+    CHROMA_FILTER_444(16, 4);
+    CHROMA_FILTER_444(16, 8);
+    CHROMA_FILTER_444(16, 12);
+    CHROMA_FILTER_444(16, 16);
+    CHROMA_FILTER_444(16, 32);
+    CHROMA_FILTER_444(16, 64);
+#endif // defined(__APPLE__)
+}
+
+};
+
+
+#endif
+
+

 
@@ -0,0 +1,995 @@
+#if HAVE_NEON
+
+#include "filter-prim.h"
+#include <arm_neon.h>
+
+namespace
+{
+
+using namespace X265_NS;
+
+
+template<int width, int height>
+void filterPixelToShort_neon(const pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
+{
+    const int shift = IF_INTERNAL_PREC - X265_DEPTH;
+    int row, col;
+    const int16x8_t off = vdupq_n_s16(IF_INTERNAL_OFFS);
+    for (row = 0; row < height; row++)
+    {
+
+        for (col = 0; col < width; col += 8)
+        {
+            int16x8_t in;
+
+#if HIGH_BIT_DEPTH
+            in = *(int16x8_t *)&srccol;
+#else
+            in = vmovl_u8(*(uint8x8_t *)&srccol);
+#endif
+
+            int16x8_t tmp = vshlq_n_s16(in, shift);
+            tmp = vsubq_s16(tmp, off);
+            *(int16x8_t *)&dstcol = tmp;
+
+        }
+
+        src += srcStride;
+        dst += dstStride;
+    }
+}
+
+
+template<int N, int width, int height>
+void interp_horiz_pp_neon(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
+{
+    const int16_t *coeff = (N == 4) ? g_chromaFiltercoeffIdx : g_lumaFiltercoeffIdx;
+    int headRoom = IF_FILTER_PREC;
+    int offset = (1 << (headRoom - 1));
+    uint16_t maxVal = (1 << X265_DEPTH) - 1;
+    int cStride = 1;
+
+    src -= (N / 2 - 1) * cStride;
+    int16x8_t vc;
+    vc = *(int16x8_t *)coeff;
+    int16x4_t low_vc = vget_low_s16(vc);
+    int16x4_t high_vc = vget_high_s16(vc);
+
+    const int32x4_t voffset = vdupq_n_s32(offset);
+    const int32x4_t vhr = vdupq_n_s32(-headRoom);
+
+    int row, col;
+    for (row = 0; row < height; row++)
+    {
+        for (col = 0; col < width; col += 8)
+        {
+            int32x4_t vsum1, vsum2;
+
+            int16x8_t inputN;
+
+            for (int i = 0; i < N; i++)
+            {
+#if HIGH_BIT_DEPTH
+                inputi = *(int16x8_t *)&srccol + i;
+#else
+                inputi = vmovl_u8(*(uint8x8_t *)&srccol + i);
+#endif
+            }
+            vsum1 = voffset;
+            vsum2 = voffset;
+
+            vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input0), low_vc, 0);
+            vsum2 = vmlal_high_lane_s16(vsum2, input0, low_vc, 0);
+
+            vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input1), low_vc, 1);
+            vsum2 = vmlal_high_lane_s16(vsum2, input1, low_vc, 1);
+
+            vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input2), low_vc, 2);
+            vsum2 = vmlal_high_lane_s16(vsum2, input2, low_vc, 2);
+
+            vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input3), low_vc, 3);
+            vsum2 = vmlal_high_lane_s16(vsum2, input3, low_vc, 3);
+
+            if (N == 8)
+            {
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input4), high_vc, 0);
+                vsum2 = vmlal_high_lane_s16(vsum2, input4, high_vc, 0);
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input5), high_vc, 1);
+                vsum2 = vmlal_high_lane_s16(vsum2, input5, high_vc, 1);
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input6), high_vc, 2);
+                vsum2 = vmlal_high_lane_s16(vsum2, input6, high_vc, 2);
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input7), high_vc, 3);
+                vsum2 = vmlal_high_lane_s16(vsum2, input7, high_vc, 3);
+
+            }
+
+            vsum1 = vshlq_s32(vsum1, vhr);
+            vsum2 = vshlq_s32(vsum2, vhr);
+
+            int16x8_t vsum = vuzp1q_s16(vsum1, vsum2);
+            vsum = vminq_s16(vsum, vdupq_n_s16(maxVal));
+            vsum = vmaxq_s16(vsum, vdupq_n_s16(0));
+#if HIGH_BIT_DEPTH
+            *(int16x8_t *)&dstcol = vsum;
+#else
+            uint8x16_t usum = vuzp1q_u8(vsum, vsum);
+            *(uint8x8_t *)&dstcol = vget_low_u8(usum);
+#endif
+
+        }
+
+        src += srcStride;
+        dst += dstStride;
+    }
+}
+
+#if HIGH_BIT_DEPTH
+
+template<int N, int width, int height>
+void interp_horiz_ps_neon(const uint16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx,
+                          int isRowExt)
+{
+    const int16_t *coeff = (N == 4) ? g_chromaFiltercoeffIdx : g_lumaFiltercoeffIdx;
+    const int headRoom = IF_INTERNAL_PREC - X265_DEPTH;
+    const int shift = IF_FILTER_PREC - headRoom;
+    const int offset = (unsigned) - IF_INTERNAL_OFFS << shift;
+
+    int blkheight = height;
+    src -= N / 2 - 1;
+
+    if (isRowExt)
+    {
+        src -= (N / 2 - 1) * srcStride;
+        blkheight += N - 1;
+    }
+    int16x8_t vc3 = vld1q_s16(coeff);
+    const int32x4_t voffset = vdupq_n_s32(offset);
+    const int32x4_t vhr = vdupq_n_s32(-shift);
+
+    int row, col;
+    for (row = 0; row < blkheight; row++)
+    {
+        for (col = 0; col < width; col += 8)
+        {
+            int32x4_t vsum, vsum2;
+
+            int16x8_t inputN;
+            for (int i = 0; i < N; i++)
+            {
+                inputi = vld1q_s16((int16_t *)&srccol + i);
+            }
+
+            vsum = voffset;
+            vsum2 = voffset;
+
+            vsum = vmlal_lane_s16(vsum, vget_low_u16(input0), vget_low_s16(vc3), 0);
+            vsum2 = vmlal_high_lane_s16(vsum2, input0, vget_low_s16(vc3), 0);
+
+            vsum = vmlal_lane_s16(vsum, vget_low_u16(input1), vget_low_s16(vc3), 1);
+            vsum2 = vmlal_high_lane_s16(vsum2, input1, vget_low_s16(vc3), 1);
+
+            vsum = vmlal_lane_s16(vsum, vget_low_u16(input2), vget_low_s16(vc3), 2);
+            vsum2 = vmlal_high_lane_s16(vsum2, input2, vget_low_s16(vc3), 2);
+
+            vsum = vmlal_lane_s16(vsum, vget_low_u16(input3), vget_low_s16(vc3), 3);
+            vsum2 = vmlal_high_lane_s16(vsum2, input3, vget_low_s16(vc3), 3);
+
+            if (N == 8)
+            {
+                vsum = vmlal_lane_s16(vsum, vget_low_s16(input4), vget_high_s16(vc3), 0);
+                vsum2 = vmlal_high_lane_s16(vsum2, input4, vget_high_s16(vc3), 0);
+
+                vsum = vmlal_lane_s16(vsum, vget_low_s16(input5), vget_high_s16(vc3), 1);
+                vsum2 = vmlal_high_lane_s16(vsum2, input5, vget_high_s16(vc3), 1);
+
+                vsum = vmlal_lane_s16(vsum, vget_low_s16(input6), vget_high_s16(vc3), 2);
+                vsum2 = vmlal_high_lane_s16(vsum2, input6, vget_high_s16(vc3), 2);
+
+                vsum = vmlal_lane_s16(vsum, vget_low_s16(input7), vget_high_s16(vc3), 3);
+                vsum2 = vmlal_high_lane_s16(vsum2, input7, vget_high_s16(vc3), 3);
+            }
+
+            vsum = vshlq_s32(vsum, vhr);
+            vsum2 = vshlq_s32(vsum2, vhr);
+            *(int16x4_t *)&dstcol = vmovn_u32(vsum);
+            *(int16x4_t *)&dstcol+4 = vmovn_u32(vsum2);
+        }
+
+        src += srcStride;
+        dst += dstStride;
+    }
+}
+
+
+#else
+
+template<int N, int width, int height>
+void interp_horiz_ps_neon(const uint8_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx,
+                          int isRowExt)
+{
+    const int16_t *coeff = (N == 4) ? g_chromaFiltercoeffIdx : g_lumaFiltercoeffIdx;
+    const int headRoom = IF_INTERNAL_PREC - X265_DEPTH;
+    const int shift = IF_FILTER_PREC - headRoom;
+    const int offset = (unsigned) - IF_INTERNAL_OFFS << shift;
+
+    int blkheight = height;
+    src -= N / 2 - 1;
+
+    if (isRowExt)
+    {
+        src -= (N / 2 - 1) * srcStride;
+        blkheight += N - 1;
+    }
+    int16x8_t vc;
+    vc = *(int16x8_t *)coeff;
+
+    const int16x8_t voffset = vdupq_n_s16(offset);
+    const int16x8_t vhr = vdupq_n_s16(-shift);
+
+    int row, col;
+    for (row = 0; row < blkheight; row++)
+    {
+        for (col = 0; col < width; col += 8)
+        {
+            int16x8_t vsum;
+
+            int16x8_t inputN;
+
+            for (int i = 0; i < N; i++)
+            {
+                inputi = vmovl_u8(*(uint8x8_t *)&srccol + i);
+            }
+            vsum = voffset;
+            vsum = vmlaq_laneq_s16(vsum, (input0), vc, 0);
+            vsum = vmlaq_laneq_s16(vsum, (input1), vc, 1);
+            vsum = vmlaq_laneq_s16(vsum, (input2), vc, 2);
+            vsum = vmlaq_laneq_s16(vsum, (input3), vc, 3);
+
+
+            if (N == 8)
+            {
+                vsum = vmlaq_laneq_s16(vsum, (input4), vc, 4);
+                vsum = vmlaq_laneq_s16(vsum, (input5), vc, 5);
+                vsum = vmlaq_laneq_s16(vsum, (input6), vc, 6);
+                vsum = vmlaq_laneq_s16(vsum, (input7), vc, 7);
+
+            }
+
+            vsum = vshlq_s16(vsum, vhr);
+            *(int16x8_t *)&dstcol = vsum;
+        }
+
+        src += srcStride;
+        dst += dstStride;
+    }
+}
+
+#endif
+
+
+template<int N, int width, int height>
+void interp_vert_ss_neon(const int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
+{
+    const int16_t *c = (N == 8 ? g_lumaFiltercoeffIdx : g_chromaFiltercoeffIdx);
+    int shift = IF_FILTER_PREC;
+    src -= (N / 2 - 1) * srcStride;
+    int16x8_t vc;
+    vc = *(int16x8_t *)c;
+    int16x4_t low_vc = vget_low_s16(vc);
+    int16x4_t high_vc = vget_high_s16(vc);
+
+    const int32x4_t vhr = vdupq_n_s32(-shift);
+
+    int row, col;
+    for (row = 0; row < height; row++)
+    {
+        for (col = 0; col < width; col += 8)
+        {
+            int32x4_t vsum1, vsum2;
+
+            int16x8_t inputN;
+
+            for (int i = 0; i < N; i++)
+            {
+                inputi = *(int16x8_t *)&srccol + i * srcStride;
+            }
+
+            vsum1 = vmull_lane_s16(vget_low_s16(input0), low_vc, 0);
+            vsum2 = vmull_high_lane_s16(input0, low_vc, 0);
+
+            vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input1), low_vc, 1);
+            vsum2 = vmlal_high_lane_s16(vsum2, input1, low_vc, 1);
+
+            vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input2), low_vc, 2);
+            vsum2 = vmlal_high_lane_s16(vsum2, input2, low_vc, 2);
+
+            vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input3), low_vc, 3);
+            vsum2 = vmlal_high_lane_s16(vsum2, input3, low_vc, 3);
+
+            if (N == 8)
+            {
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input4), high_vc, 0);
+                vsum2 = vmlal_high_lane_s16(vsum2, input4, high_vc, 0);
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input5), high_vc, 1);
+                vsum2 = vmlal_high_lane_s16(vsum2, input5, high_vc, 1);
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input6), high_vc, 2);
+                vsum2 = vmlal_high_lane_s16(vsum2, input6, high_vc, 2);
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input7), high_vc, 3);
+                vsum2 = vmlal_high_lane_s16(vsum2, input7, high_vc, 3);
+
+            }
+
+            vsum1 = vshlq_s32(vsum1, vhr);
+            vsum2 = vshlq_s32(vsum2, vhr);
+
+            int16x8_t vsum = vuzp1q_s16(vsum1, vsum2);
+            *(int16x8_t *)&dstcol = vsum;
+        }
+
+        src += srcStride;
+        dst += dstStride;
+    }
+
+}
+
+
+#if HIGH_BIT_DEPTH
+
+template<int N, int width, int height>
+void interp_vert_pp_neon(const uint16_t *src, intptr_t srcStride, uint16_t *dst, intptr_t dstStride, int coeffIdx)
+{
+
+    const int16_t *c = (N == 4) ? g_chromaFiltercoeffIdx : g_lumaFiltercoeffIdx;
+    int shift = IF_FILTER_PREC;
+    int offset = 1 << (shift - 1);
+    const uint16_t maxVal = (1 << X265_DEPTH) - 1;
+
+    src -= (N / 2 - 1) * srcStride;
+    int16x8_t vc;
+    vc = *(int16x8_t *)c;
+    int32x4_t low_vc = vmovl_s16(vget_low_s16(vc));
+    int32x4_t high_vc = vmovl_s16(vget_high_s16(vc));
+
+    const int32x4_t voffset = vdupq_n_s32(offset);
+    const int32x4_t vhr = vdupq_n_s32(-shift);
+
+    int row, col;
+    for (row = 0; row < height; row++)
+    {
+        for (col = 0; col < width; col += 4)
+        {
+            int32x4_t vsum;
+
+            int32x4_t inputN;
+
+            for (int i = 0; i < N; i++)
+            {
+                inputi = vmovl_u16(*(uint16x4_t *)&srccol + i * srcStride);
+            }
+            vsum = voffset;
+
+            vsum = vmlaq_laneq_s32(vsum, (input0), low_vc, 0);
+            vsum = vmlaq_laneq_s32(vsum, (input1), low_vc, 1);
+            vsum = vmlaq_laneq_s32(vsum, (input2), low_vc, 2);
+            vsum = vmlaq_laneq_s32(vsum, (input3), low_vc, 3);
+
+            if (N == 8)
+            {
+                vsum = vmlaq_laneq_s32(vsum, (input4), high_vc, 0);
+                vsum = vmlaq_laneq_s32(vsum, (input5), high_vc, 1);
+                vsum = vmlaq_laneq_s32(vsum, (input6), high_vc, 2);
+                vsum = vmlaq_laneq_s32(vsum, (input7), high_vc, 3);
+            }
+
+            vsum = vshlq_s32(vsum, vhr);
+            vsum = vminq_s32(vsum, vdupq_n_s32(maxVal));
+            vsum = vmaxq_s32(vsum, vdupq_n_s32(0));
+            *(uint16x4_t *)&dstcol = vmovn_u32(vsum);
+        }
+        src += srcStride;
+        dst += dstStride;
+    }
+}
+
+
+
+
+#else
+
+template<int N, int width, int height>
+void interp_vert_pp_neon(const uint8_t *src, intptr_t srcStride, uint8_t *dst, intptr_t dstStride, int coeffIdx)
+{
+
+    const int16_t *c = (N == 4) ? g_chromaFiltercoeffIdx : g_lumaFiltercoeffIdx;
+    int shift = IF_FILTER_PREC;
+    int offset = 1 << (shift - 1);
+    const uint16_t maxVal = (1 << X265_DEPTH) - 1;
+
+    src -= (N / 2 - 1) * srcStride;
+    int16x8_t vc;
+    vc = *(int16x8_t *)c;
+
+    const int16x8_t voffset = vdupq_n_s16(offset);
+    const int16x8_t vhr = vdupq_n_s16(-shift);
+
+    int row, col;
+    for (row = 0; row < height; row++)
+    {
+        for (col = 0; col < width; col += 8)
+        {
+            int16x8_t vsum;
+
+            int16x8_t inputN;
+
+            for (int i = 0; i < N; i++)
+            {
+                inputi = vmovl_u8(*(uint8x8_t *)&srccol + i * srcStride);
+            }
+            vsum = voffset;
+
+            vsum = vmlaq_laneq_s16(vsum, (input0), vc, 0);
+            vsum = vmlaq_laneq_s16(vsum, (input1), vc, 1);
+            vsum = vmlaq_laneq_s16(vsum, (input2), vc, 2);
+            vsum = vmlaq_laneq_s16(vsum, (input3), vc, 3);
+
+            if (N == 8)
+            {
+                vsum = vmlaq_laneq_s16(vsum, (input4), vc, 4);
+                vsum = vmlaq_laneq_s16(vsum, (input5), vc, 5);
+                vsum = vmlaq_laneq_s16(vsum, (input6), vc, 6);
+                vsum = vmlaq_laneq_s16(vsum, (input7), vc, 7);
+
+            }
+
+            vsum = vshlq_s16(vsum, vhr);
+
+            vsum = vminq_s16(vsum, vdupq_n_s16(maxVal));
+            vsum = vmaxq_s16(vsum, vdupq_n_s16(0));
+            uint8x16_t usum = vuzp1q_u8(vsum, vsum);
+            *(uint8x8_t *)&dstcol = vget_low_u8(usum);
+
+        }
+
+        src += srcStride;
+        dst += dstStride;
+    }
+}
+
+
+#endif
+
+
+#if HIGH_BIT_DEPTH
+
+template<int N, int width, int height>
+void interp_vert_ps_neon(const uint16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
+{
+    const int16_t *c = (N == 4) ? g_chromaFiltercoeffIdx : g_lumaFiltercoeffIdx;
+    int headRoom = IF_INTERNAL_PREC - X265_DEPTH;
+    int shift = IF_FILTER_PREC - headRoom;
+    int offset = (unsigned) - IF_INTERNAL_OFFS << shift;
+    src -= (N / 2 - 1) * srcStride;
+
+    int16x8_t vc;
+    vc = *(int16x8_t *)c;
+    int32x4_t low_vc = vmovl_s16(vget_low_s16(vc));
+    int32x4_t high_vc = vmovl_s16(vget_high_s16(vc));
+
+    const int32x4_t voffset = vdupq_n_s32(offset);
+    const int32x4_t vhr = vdupq_n_s32(-shift);
+
+    int row, col;
+    for (row = 0; row < height; row++)
+    {
+        for (col = 0; col < width; col += 4)
+        {
+            int16x8_t vsum;
+
+            int16x8_t inputN;
+
+            for (int i = 0; i < N; i++)
+            {
+                inputi = vmovl_u16(*(uint16x4_t *)&srccol + i * srcStride);
+            }
+            vsum = voffset;
+
+            vsum = vmlaq_laneq_s32(vsum, (input0), low_vc, 0);
+            vsum = vmlaq_laneq_s32(vsum, (input1), low_vc, 1);
+            vsum = vmlaq_laneq_s32(vsum, (input2), low_vc, 2);
+            vsum = vmlaq_laneq_s32(vsum, (input3), low_vc, 3);
+
+            if (N == 8)
+            {
+                int16x8_t  vsum1 = vmulq_laneq_s32((input4), high_vc, 0);
+                vsum1 = vmlaq_laneq_s32(vsum1, (input5), high_vc, 1);
+                vsum1 = vmlaq_laneq_s32(vsum1, (input6), high_vc, 2);
+                vsum1 = vmlaq_laneq_s32(vsum1, (input7), high_vc, 3);
+                vsum = vaddq_s32(vsum, vsum1);
+            }
+
+            vsum = vshlq_s32(vsum, vhr);
+
+            *(uint16x4_t *)&dstcol = vmovn_s32(vsum);
+        }
+
+        src += srcStride;
+        dst += dstStride;
+    }
+}
+
+#else
+
+template<int N, int width, int height>
+void interp_vert_ps_neon(const uint8_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx)
+{
+    const int16_t *c = (N == 4) ? g_chromaFiltercoeffIdx : g_lumaFiltercoeffIdx;
+    int headRoom = IF_INTERNAL_PREC - X265_DEPTH;
+    int shift = IF_FILTER_PREC - headRoom;
+    int offset = (unsigned) - IF_INTERNAL_OFFS << shift;
+    src -= (N / 2 - 1) * srcStride;
+
+    int16x8_t vc;
+    vc = *(int16x8_t *)c;
+
+    const int16x8_t voffset = vdupq_n_s16(offset);
+    const int16x8_t vhr = vdupq_n_s16(-shift);
+
+    int row, col;
+    for (row = 0; row < height; row++)
+    {
+        for (col = 0; col < width; col += 8)
+        {
+            int16x8_t vsum;
+
+            int16x8_t inputN;
+
+            for (int i = 0; i < N; i++)
+            {
+                inputi = vmovl_u8(*(uint8x8_t *)&srccol + i * srcStride);
+            }
+            vsum = voffset;
+
+            vsum = vmlaq_laneq_s16(vsum, (input0), vc, 0);
+            vsum = vmlaq_laneq_s16(vsum, (input1), vc, 1);
+            vsum = vmlaq_laneq_s16(vsum, (input2), vc, 2);
+            vsum = vmlaq_laneq_s16(vsum, (input3), vc, 3);
+
+            if (N == 8)
+            {
+                int16x8_t  vsum1 = vmulq_laneq_s16((input4), vc, 4);
+                vsum1 = vmlaq_laneq_s16(vsum1, (input5), vc, 5);
+                vsum1 = vmlaq_laneq_s16(vsum1, (input6), vc, 6);
+                vsum1 = vmlaq_laneq_s16(vsum1, (input7), vc, 7);
+                vsum = vaddq_s16(vsum, vsum1);
+            }
+
+            vsum = vshlq_s32(vsum, vhr);
+            *(int16x8_t *)&dstcol = vsum;
+        }
+
+        src += srcStride;
+        dst += dstStride;
+    }
+}
+
+#endif
+
+
+
+template<int N, int width, int height>
+void interp_vert_sp_neon(const int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
+{
+    int headRoom = IF_INTERNAL_PREC - X265_DEPTH;
+    int shift = IF_FILTER_PREC + headRoom;
+    int offset = (1 << (shift - 1)) + (IF_INTERNAL_OFFS << IF_FILTER_PREC);
+    uint16_t maxVal = (1 << X265_DEPTH) - 1;
+    const int16_t *coeff = (N == 8 ? g_lumaFiltercoeffIdx : g_chromaFiltercoeffIdx);
+
+    src -= (N / 2 - 1) * srcStride;
+
+    int16x8_t vc;
+    vc = *(int16x8_t *)coeff;
+    int16x4_t low_vc = vget_low_s16(vc);
+    int16x4_t high_vc = vget_high_s16(vc);
+
+    const int32x4_t voffset = vdupq_n_s32(offset);
+    const int32x4_t vhr = vdupq_n_s32(-shift);
+
+    int row, col;
+    for (row = 0; row < height; row++)
+    {
+        for (col = 0; col < width; col += 8)
+        {
+            int32x4_t vsum1, vsum2;
+
+            int16x8_t inputN;
+
+            for (int i = 0; i < N; i++)
+            {
+                inputi = *(int16x8_t *)&srccol + i * srcStride;
+            }
+            vsum1 = voffset;
+            vsum2 = voffset;
+
+            vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input0), low_vc, 0);
+            vsum2 = vmlal_high_lane_s16(vsum2, input0, low_vc, 0);
+
+            vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input1), low_vc, 1);
+            vsum2 = vmlal_high_lane_s16(vsum2, input1, low_vc, 1);
+
+            vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input2), low_vc, 2);
+            vsum2 = vmlal_high_lane_s16(vsum2, input2, low_vc, 2);
+
+            vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input3), low_vc, 3);
+            vsum2 = vmlal_high_lane_s16(vsum2, input3, low_vc, 3);
+
+            if (N == 8)
+            {
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input4), high_vc, 0);
+                vsum2 = vmlal_high_lane_s16(vsum2, input4, high_vc, 0);
+
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input5), high_vc, 1);
+                vsum2 = vmlal_high_lane_s16(vsum2, input5, high_vc, 1);
+
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input6), high_vc, 2);
+                vsum2 = vmlal_high_lane_s16(vsum2, input6, high_vc, 2);
+
+                vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input7), high_vc, 3);
+                vsum2 = vmlal_high_lane_s16(vsum2, input7, high_vc, 3);
+            }
+
+            vsum1 = vshlq_s32(vsum1, vhr);
+            vsum2 = vshlq_s32(vsum2, vhr);
+
+            int16x8_t vsum = vuzp1q_s16(vsum1, vsum2);
+            vsum = vminq_s16(vsum, vdupq_n_s16(maxVal));
+            vsum = vmaxq_s16(vsum, vdupq_n_s16(0));
+#if HIGH_BIT_DEPTH
+            *(int16x8_t *)&dstcol = vsum;
+#else
+            uint8x16_t usum = vuzp1q_u8(vsum, vsum);
+            *(uint8x8_t *)&dstcol = vget_low_u8(usum);
+#endif
+
+        }
+
+        src += srcStride;
+        dst += dstStride;
+    }
+}
+
+
+
+
+
+
+template<int N, int width, int height>
+void interp_hv_pp_neon(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int idxX, int idxY)
+{
+    ALIGN_VAR_32(int16_t, immedwidth * (height + N - 1));
+
+    interp_horiz_ps_neon<N, width, height>(src, srcStride, immed, width, idxX, 1);
+    interp_vert_sp_neon<N, width, height>(immed + (N / 2 - 1) * width, width, dst, dstStride, idxY);
+}
+
+
+
+}
+
+
+
+
+namespace X265_NS
+{
+#if defined(__APPLE__)
+#define CHROMA_420(W, H) \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_hpp = interp_horiz_pp_neon<4, W, H>; \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vpp = interp_vert_pp_neon<4, W, H>;  \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vps = interp_vert_ps_neon<4, W, H>;  \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vsp = interp_vert_sp_neon<4, W, H>;  \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vss = interp_vert_ss_neon<4, W, H>;  \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.p2sNONALIGNED = filterPixelToShort_neon<W, H>;\
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.p2sALIGNED = filterPixelToShort_neon<W, H>;
+    
+#define CHROMA_FILTER_420(W, H) \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_hps = interp_horiz_ps_neon<4, W, H>;
+    
+#else // defined(__APPLE__)
+#define CHROMA_420(W, H) \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vss = interp_vert_ss_neon<4, W, H>; \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.p2sNONALIGNED = filterPixelToShort_neon<W, H>;\
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.p2sALIGNED = filterPixelToShort_neon<W, H>;
+    
+#define CHROMA_FILTER_420(W, H) \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_hpp = interp_horiz_pp_neon<4, W, H>; \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_hps = interp_horiz_ps_neon<4, W, H>; \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vpp = interp_vert_pp_neon<4, W, H>;  \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vps = interp_vert_ps_neon<4, W, H>;  \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vsp = interp_vert_sp_neon<4, W, H>;
+#endif // defined(__APPLE__)
+
+#if defined(__APPLE__)
+#define CHROMA_422(W, H) \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_hpp = interp_horiz_pp_neon<4, W, H>; \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vpp = interp_vert_pp_neon<4, W, H>;  \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vps = interp_vert_ps_neon<4, W, H>;  \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vsp = interp_vert_sp_neon<4, W, H>;  \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vss = interp_vert_ss_neon<4, W, H>;  \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.p2sNONALIGNED = filterPixelToShort_neon<W, H>;\
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.p2sALIGNED = filterPixelToShort_neon<W, H>;
+    
+#define CHROMA_FILTER_422(W, H) \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_hps = interp_horiz_ps_neon<4, W, H>;
+    
+#else // defined(__APPLE__)
+#define CHROMA_422(W, H) \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vss = interp_vert_ss_neon<4, W, H>; \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.p2sNONALIGNED = filterPixelToShort_neon<W, H>;\
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.p2sALIGNED = filterPixelToShort_neon<W, H>;
+    
+#define CHROMA_FILTER_422(W, H) \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_hpp = interp_horiz_pp_neon<4, W, H>; \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_hps = interp_horiz_ps_neon<4, W, H>; \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vpp = interp_vert_pp_neon<4, W, H>;  \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vps = interp_vert_ps_neon<4, W, H>;  \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vsp = interp_vert_sp_neon<4, W, H>;
+#endif // defined(__APPLE__)
+
+#if defined(__APPLE__)
+#define CHROMA_444(W, H) \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_hpp = interp_horiz_pp_neon<4, W, H>; \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vpp = interp_vert_pp_neon<4, W, H>;  \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vps = interp_vert_ps_neon<4, W, H>;  \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vsp = interp_vert_sp_neon<4, W, H>;  \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vss = interp_vert_ss_neon<4, W, H>;  \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.p2sNONALIGNED = filterPixelToShort_neon<W, H>;\
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.p2sALIGNED = filterPixelToShort_neon<W, H>;
+
+#define CHROMA_FILTER_444(W, H) \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_hps = interp_horiz_ps_neon<4, W, H>;
+    
+#else // defined(__APPLE__)
+#define CHROMA_444(W, H) \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.p2sNONALIGNED = filterPixelToShort_neon<W, H>;\
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.p2sALIGNED = filterPixelToShort_neon<W, H>;
+    
+#define CHROMA_FILTER_444(W, H) \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_hpp = interp_horiz_pp_neon<4, W, H>; \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_hps = interp_horiz_ps_neon<4, W, H>; \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vpp = interp_vert_pp_neon<4, W, H>;  \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vps = interp_vert_ps_neon<4, W, H>;  \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vsp = interp_vert_sp_neon<4, W, H>;  \
+    p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vss = interp_vert_ss_neon<4, W, H>;
+#endif // defined(__APPLE__)
+
+#if defined(__APPLE__)
+#define LUMA(W, H) \
+    p.puLUMA_ ## W ## x ## H.luma_hpp     = interp_horiz_pp_neon<8, W, H>; \
+    p.puLUMA_ ## W ## x ## H.luma_vpp     = interp_vert_pp_neon<8, W, H>;  \
+    p.puLUMA_ ## W ## x ## H.luma_vps     = interp_vert_ps_neon<8, W, H>;  \
+    p.puLUMA_ ## W ## x ## H.luma_vsp     = interp_vert_sp_neon<8, W, H>;  \
+    p.puLUMA_ ## W ## x ## H.luma_vss     = interp_vert_ss_neon<8, W, H>;  \
+    p.puLUMA_ ## W ## x ## H.luma_hvpp    = interp_hv_pp_neon<8, W, H>; \
+    p.puLUMA_ ## W ## x ## H.convert_p2sNONALIGNED = filterPixelToShort_neon<W, H>;\
+    p.puLUMA_ ## W ## x ## H.convert_p2sALIGNED = filterPixelToShort_neon<W, H>;
+    
+#else // defined(__APPLE__)
+#define LUMA(W, H) \
+    p.puLUMA_ ## W ## x ## H.luma_vss     = interp_vert_ss_neon<8, W, H>;  \
+    p.puLUMA_ ## W ## x ## H.convert_p2sNONALIGNED = filterPixelToShort_neon<W, H>;\
+    p.puLUMA_ ## W ## x ## H.convert_p2sALIGNED = filterPixelToShort_neon<W, H>;
+    
+#define LUMA_FILTER(W, H) \
+    p.puLUMA_ ## W ## x ## H.luma_hpp     = interp_horiz_pp_neon<8, W, H>; \
+    p.puLUMA_ ## W ## x ## H.luma_vpp     = interp_vert_pp_neon<8, W, H>;  \
+    p.puLUMA_ ## W ## x ## H.luma_vps     = interp_vert_ps_neon<8, W, H>;  \
+    p.puLUMA_ ## W ## x ## H.luma_vsp     = interp_vert_sp_neon<8, W, H>;  \
+    p.puLUMA_ ## W ## x ## H.luma_hvpp    = interp_hv_pp_neon<8, W, H>;
+#endif // defined(__APPLE__)
+
+void setupFilterPrimitives_neon(EncoderPrimitives &p)
+{
+
+    // All neon functions assume width of multiple of 8, (2,4,12 variants are not optimized)
+
+    LUMA(8, 8);
+    LUMA(8, 4);
+    LUMA(16, 16);
+    CHROMA_420(8,  8);
+    LUMA(16,  8);
+    CHROMA_420(8,  4);
+    LUMA(8, 16);
+    LUMA(16, 12);
+    CHROMA_420(8,  6);
+    LUMA(16,  4);
+    CHROMA_420(8,  2);
+    LUMA(32, 32);
+    CHROMA_420(16, 16);
+    LUMA(32, 16);
+    CHROMA_420(16, 8);
+    LUMA(16, 32);
+    CHROMA_420(8,  16);
+    LUMA(32, 24);
+    CHROMA_420(16, 12);
+    LUMA(24, 32);
+    LUMA(32,  8);
+    CHROMA_420(16, 4);
+    LUMA(8, 32);
+    LUMA(64, 64);
+    CHROMA_420(32, 32);
+    LUMA(64, 32);
+    CHROMA_420(32, 16);
+    LUMA(32, 64);
+    CHROMA_420(16, 32);
+    LUMA(64, 48);
+    CHROMA_420(32, 24);
+    LUMA(48, 64);
+    CHROMA_420(24, 32);
+    LUMA(64, 16);
+    CHROMA_420(32, 8);
+    LUMA(16, 64);
+    CHROMA_420(8,  32);
+    CHROMA_422(8,  16);
+    CHROMA_422(8,  8);
+    CHROMA_422(8,  12);
+    CHROMA_422(8,  4);
+    CHROMA_422(16, 32);
+    CHROMA_422(16, 16);
+    CHROMA_422(8,  32);
+    CHROMA_422(16, 24);
+    CHROMA_422(16, 8);
+    CHROMA_422(32, 64);
+    CHROMA_422(32, 32);
+    CHROMA_422(16, 64);
+    CHROMA_422(32, 48);
+    CHROMA_422(24, 64);
+    CHROMA_422(32, 16);
+    CHROMA_422(8,  64);
+    CHROMA_444(8,  8);
+    CHROMA_444(8,  4);
+    CHROMA_444(16, 16);
+    CHROMA_444(16, 8);
+    CHROMA_444(8,  16);
+    CHROMA_444(16, 12);
+    CHROMA_444(16, 4);
+    CHROMA_444(32, 32);
+    CHROMA_444(32, 16);
+    CHROMA_444(16, 32);
+    CHROMA_444(32, 24);
+    CHROMA_444(24, 32);
+    CHROMA_444(32, 8);
+    CHROMA_444(8,  32);
+    CHROMA_444(64, 64);
+    CHROMA_444(64, 32);
+    CHROMA_444(32, 64);
+    CHROMA_444(64, 48);
+    CHROMA_444(48, 64);
+    CHROMA_444(64, 16);
+    CHROMA_444(16, 64);
+
+#if defined(__APPLE__) || HIGH_BIT_DEPTH
+    p.puLUMA_8x4.luma_hps     = interp_horiz_ps_neon<8, 8, 4>;
+    p.puLUMA_8x8.luma_hps     = interp_horiz_ps_neon<8, 8, 8>;
+    p.puLUMA_8x16.luma_hps     = interp_horiz_ps_neon<8, 8, 16>;
+    p.puLUMA_8x32.luma_hps     = interp_horiz_ps_neon<8, 8, 32>;
+#endif // HIGH_BIT_DEPTH
+
+#if !defined(__APPLE__) && HIGH_BIT_DEPTH
+    p.puLUMA_24x32.luma_hps     = interp_horiz_ps_neon<8, 24, 32>;
+#endif // !defined(__APPLE__)
+
+#if !defined(__APPLE__)
+    p.puLUMA_32x8.luma_hpp      = interp_horiz_pp_neon<8, 32, 8>;
+    p.puLUMA_32x16.luma_hpp     = interp_horiz_pp_neon<8, 32, 16>;
+    p.puLUMA_32x24.luma_hpp     = interp_horiz_pp_neon<8, 32, 24>;
+    p.puLUMA_32x32.luma_hpp     = interp_horiz_pp_neon<8, 32, 32>;
+    p.puLUMA_32x64.luma_hpp     = interp_horiz_pp_neon<8, 32, 64>;
+    p.puLUMA_48x64.luma_hpp     = interp_horiz_pp_neon<8, 48, 64>;
+    p.puLUMA_64x16.luma_hpp     = interp_horiz_pp_neon<8, 64, 16>;
+    p.puLUMA_64x32.luma_hpp     = interp_horiz_pp_neon<8, 64, 32>;
+    p.puLUMA_64x48.luma_hpp     = interp_horiz_pp_neon<8, 64, 48>;
+    p.puLUMA_64x64.luma_hpp     = interp_horiz_pp_neon<8, 64, 64>;
+
+    LUMA_FILTER(8, 4);
+    LUMA_FILTER(8, 8);
+    LUMA_FILTER(8, 16);
+    LUMA_FILTER(8, 32);
+    LUMA_FILTER(24, 32);
+
+    LUMA_FILTER(16, 32);
+    LUMA_FILTER(32, 16);
+    LUMA_FILTER(32, 24);
+    LUMA_FILTER(32, 32);
+    LUMA_FILTER(32, 64);
+    LUMA_FILTER(48, 64);
+    LUMA_FILTER(64, 32);
+    LUMA_FILTER(64, 48);
+    LUMA_FILTER(64, 64);
+    
+    CHROMA_FILTER_420(24, 32);
+    
+    p.chromaX265_CSP_I420.puCHROMA_420_32x8.filter_hpp = interp_horiz_pp_neon<4, 32, 8>;
+    p.chromaX265_CSP_I420.puCHROMA_420_32x16.filter_hpp = interp_horiz_pp_neon<4, 32, 16>;
+    p.chromaX265_CSP_I420.puCHROMA_420_32x24.filter_hpp = interp_horiz_pp_neon<4, 32, 24>;
+    p.chromaX265_CSP_I420.puCHROMA_420_32x32.filter_hpp = interp_horiz_pp_neon<4, 32, 32>;
+    
+    CHROMA_FILTER_422(24, 64);
+    
+    p.chromaX265_CSP_I422.puCHROMA_422_32x16.filter_hpp = interp_horiz_pp_neon<4, 32, 16>;
+    p.chromaX265_CSP_I422.puCHROMA_422_32x32.filter_hpp = interp_horiz_pp_neon<4, 32, 32>;
+    p.chromaX265_CSP_I422.puCHROMA_422_32x48.filter_hpp = interp_horiz_pp_neon<4, 32, 48>;
+    p.chromaX265_CSP_I422.puCHROMA_422_32x64.filter_hpp = interp_horiz_pp_neon<4, 32, 64>;
+    
+    CHROMA_FILTER_444(24, 32);
+    
+    p.chromaX265_CSP_I444.puLUMA_32x8.filter_hpp  = interp_horiz_pp_neon<4, 32, 8>;
+    p.chromaX265_CSP_I444.puLUMA_32x16.filter_hpp = interp_horiz_pp_neon<4, 32, 16>;
+    p.chromaX265_CSP_I444.puLUMA_32x24.filter_hpp = interp_horiz_pp_neon<4, 32, 24>;
+    p.chromaX265_CSP_I444.puLUMA_32x32.filter_hpp = interp_horiz_pp_neon<4, 32, 32>;
+    p.chromaX265_CSP_I444.puLUMA_32x64.filter_hpp = interp_horiz_pp_neon<4, 32, 64>;
+    p.chromaX265_CSP_I444.puLUMA_48x64.filter_hpp = interp_horiz_pp_neon<4, 48, 64>;
+    p.chromaX265_CSP_I444.puLUMA_64x16.filter_hpp = interp_horiz_pp_neon<4, 64, 16>;
+    p.chromaX265_CSP_I444.puLUMA_64x32.filter_hpp = interp_horiz_pp_neon<4, 64, 32>;
+    p.chromaX265_CSP_I444.puLUMA_64x48.filter_hpp = interp_horiz_pp_neon<4, 64, 48>;
+    p.chromaX265_CSP_I444.puLUMA_64x64.filter_hpp = interp_horiz_pp_neon<4, 64, 64>;
+    
+    p.chromaX265_CSP_I444.puLUMA_16x4.filter_vss  = interp_vert_ss_neon<4, 16, 4>;
+    p.chromaX265_CSP_I444.puLUMA_16x8.filter_vss  = interp_vert_ss_neon<4, 16, 8>;
+    p.chromaX265_CSP_I444.puLUMA_16x12.filter_vss = interp_vert_ss_neon<4, 16, 12>;
+    p.chromaX265_CSP_I444.puLUMA_16x16.filter_vss = interp_vert_ss_neon<4, 16, 16>;
+    p.chromaX265_CSP_I444.puLUMA_16x32.filter_vss = interp_vert_ss_neon<4, 16, 32>;
+    p.chromaX265_CSP_I444.puLUMA_16x64.filter_vss = interp_vert_ss_neon<4, 16, 64>;
+    p.chromaX265_CSP_I444.puLUMA_32x8.filter_vss  = interp_vert_ss_neon<4, 32, 8>;
+    p.chromaX265_CSP_I444.puLUMA_32x16.filter_vss = interp_vert_ss_neon<4, 32, 16>;
+    p.chromaX265_CSP_I444.puLUMA_32x24.filter_vss = interp_vert_ss_neon<4, 32, 24>;
+    p.chromaX265_CSP_I444.puLUMA_32x32.filter_vss = interp_vert_ss_neon<4, 32, 32>;
+    p.chromaX265_CSP_I444.puLUMA_32x64.filter_vss = interp_vert_ss_neon<4, 32, 64>;
+#endif // !defined(__APPLE__)
+
+    CHROMA_FILTER_420(8, 2);
+    CHROMA_FILTER_420(8, 4);
+    CHROMA_FILTER_420(8, 6);
+    CHROMA_FILTER_420(8, 8);
+    CHROMA_FILTER_420(8, 16);
+    CHROMA_FILTER_420(8, 32);
+    
+    CHROMA_FILTER_422(8, 4);
+    CHROMA_FILTER_422(8, 8);
+    CHROMA_FILTER_422(8, 12);
+    CHROMA_FILTER_422(8, 16);
+    CHROMA_FILTER_422(8, 32);
+    CHROMA_FILTER_422(8, 64);
+    
+    CHROMA_FILTER_444(8, 4);
+    CHROMA_FILTER_444(8, 8);
+    CHROMA_FILTER_444(8, 16);
+    CHROMA_FILTER_444(8, 32);
+    
+#if defined(__APPLE__)
+    CHROMA_FILTER_420(16, 4);
+    CHROMA_FILTER_420(16, 8);
+    CHROMA_FILTER_420(16, 12);
+    CHROMA_FILTER_420(16, 16);
+    CHROMA_FILTER_420(16, 32);
+
+    CHROMA_FILTER_422(16, 8);
+    CHROMA_FILTER_422(16, 16);
+    CHROMA_FILTER_422(16, 24);
+    CHROMA_FILTER_422(16, 32);
+    CHROMA_FILTER_422(16, 64);
+    
+    CHROMA_FILTER_444(16, 4);
+    CHROMA_FILTER_444(16, 8);
+    CHROMA_FILTER_444(16, 12);
+    CHROMA_FILTER_444(16, 16);
+    CHROMA_FILTER_444(16, 32);
+    CHROMA_FILTER_444(16, 64);
+#endif // defined(__APPLE__)
+}
+
+};
+
+
+#endif
+
+
​

x265_3.6.tar.gz/source/common/aarch64/filter-prim.h Added

 
@@ -0,0 +1,21 @@
+#ifndef _FILTER_PRIM_ARM64_H__
+#define _FILTER_PRIM_ARM64_H__
+
+
+#include "common.h"
+#include "slicetype.h"      // LOWRES_COST_MASK
+#include "primitives.h"
+#include "x265.h"
+
+
+namespace X265_NS
+{
+
+
+void setupFilterPrimitives_neon(EncoderPrimitives &p);
+
+};
+
+
+#endif
+
​

x265_3.6.tar.gz/source/common/aarch64/fun-decls.h Added

@@ -0,0 +1,256 @@
+/*****************************************************************************
+ * Copyright (C) 2021 MulticoreWare, Inc
+ *
+ * Authors: Sebastian Pop <spop@amazon.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#define FUNCDEF_TU(ret, name, cpu, ...) \
+    ret PFX(name ## _4x4_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## _8x8_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## _16x16_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## _32x32_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## _64x64_ ## cpu(__VA_ARGS__))
+
+#define FUNCDEF_TU_S(ret, name, cpu, ...) \
+    ret PFX(name ## _4_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## _8_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## _16_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## _32_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## _64_ ## cpu(__VA_ARGS__))
+
+#define FUNCDEF_TU_S2(ret, name, cpu, ...) \
+    ret PFX(name ## 4_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## 8_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## 16_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## 32_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## 64_ ## cpu(__VA_ARGS__))
+
+#define FUNCDEF_PU(ret, name, cpu, ...) \
+    ret PFX(name ## _4x4_   ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _8x8_   ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _16x16_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _32x32_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _64x64_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _8x4_   ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _4x8_   ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _16x8_  ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _8x16_  ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _16x32_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _32x16_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _64x32_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _32x64_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _16x12_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _12x16_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _16x4_  ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _4x16_  ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _32x24_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _24x32_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _32x8_  ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _8x32_  ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _64x48_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _48x64_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _64x16_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _16x64_ ## cpu)(__VA_ARGS__)
+
+#define FUNCDEF_CHROMA_PU(ret, name, cpu, ...) \
+    FUNCDEF_PU(ret, name, cpu, __VA_ARGS__); \
+    ret PFX(name ## _4x2_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _4x4_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _2x4_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _8x2_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _2x8_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _8x6_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _6x8_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _8x12_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _12x8_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _6x16_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _16x6_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _2x16_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _16x2_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _4x12_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _12x4_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _32x12_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _12x32_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _32x4_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _4x32_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _32x48_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _48x32_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _16x24_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _24x16_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _8x64_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _64x8_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _64x24_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _24x64_ ## cpu)(__VA_ARGS__);
+
+#define DECLS(cpu) \
+    FUNCDEF_TU(void, cpy2Dto1D_shl, cpu, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); \
+    FUNCDEF_TU(void, cpy2Dto1D_shr, cpu, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); \
+    FUNCDEF_TU(void, cpy1Dto2D_shl, cpu, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); \
+    FUNCDEF_TU(void, cpy1Dto2D_shl_aligned, cpu, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); \
+    FUNCDEF_TU(void, cpy1Dto2D_shr, cpu, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); \
+    FUNCDEF_TU_S(uint32_t, copy_cnt, cpu, int16_t* dst, const int16_t* src, intptr_t srcStride); \
+    FUNCDEF_TU_S(int, count_nonzero, cpu, const int16_t* quantCoeff); \
+    FUNCDEF_TU(void, blockfill_s, cpu, int16_t* dst, intptr_t dstride, int16_t val); \
+    FUNCDEF_TU(void, blockfill_s_aligned, cpu, int16_t* dst, intptr_t dstride, int16_t val); \
+    FUNCDEF_CHROMA_PU(void, blockcopy_ss, cpu, int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); \
+    FUNCDEF_CHROMA_PU(void, blockcopy_pp, cpu, pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); \
+    FUNCDEF_PU(void, blockcopy_sp, cpu, pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); \
+    FUNCDEF_PU(void, blockcopy_ps, cpu, int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); \
+    FUNCDEF_PU(void, interp_8tap_horiz_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \
+    FUNCDEF_PU(void, interp_8tap_horiz_ps, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); \
+    FUNCDEF_PU(void, interp_8tap_vert_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \
+    FUNCDEF_PU(void, interp_8tap_vert_ps, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); \
+    FUNCDEF_PU(void, interp_8tap_vert_sp, cpu, const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \
+    FUNCDEF_PU(void, interp_8tap_vert_ss, cpu, const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); \
+    FUNCDEF_PU(void, interp_8tap_hv_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY); \
+    FUNCDEF_CHROMA_PU(void, filterPixelToShort, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); \
+    FUNCDEF_CHROMA_PU(void, filterPixelToShort_aligned, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); \
+    FUNCDEF_CHROMA_PU(void, interp_horiz_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \
+    FUNCDEF_CHROMA_PU(void, interp_4tap_horiz_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \
+    FUNCDEF_CHROMA_PU(void, interp_horiz_ps, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); \
+    FUNCDEF_CHROMA_PU(void, interp_4tap_horiz_ps, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); \
+    FUNCDEF_CHROMA_PU(void, interp_4tap_vert_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \
+    FUNCDEF_CHROMA_PU(void, interp_4tap_vert_ps, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); \
+    FUNCDEF_CHROMA_PU(void, interp_4tap_vert_sp, cpu, const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \
+    FUNCDEF_CHROMA_PU(void, interp_4tap_vert_ss, cpu, const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); \
+    FUNCDEF_CHROMA_PU(void, addAvg, cpu, const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t); \
+    FUNCDEF_CHROMA_PU(void, addAvg_aligned, cpu, const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t); \
+    FUNCDEF_PU(void, pixel_avg_pp, cpu, pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); \
+    FUNCDEF_PU(void, pixel_avg_pp_aligned, cpu, pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); \
+    FUNCDEF_PU(void, sad_x3, cpu, const pixel*, const pixel*, const pixel*, const pixel*, intptr_t, int32_t*); \
+    FUNCDEF_PU(void, sad_x4, cpu, const pixel*, const pixel*, const pixel*, const pixel*, const pixel*, intptr_t, int32_t*); \
+    FUNCDEF_CHROMA_PU(int, pixel_sad, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \
+    FUNCDEF_CHROMA_PU(sse_t, pixel_ssd_s, cpu, const int16_t*, intptr_t); \
+    FUNCDEF_CHROMA_PU(sse_t, pixel_ssd_s_aligned, cpu, const int16_t*, intptr_t); \
+    FUNCDEF_TU_S(sse_t, pixel_ssd_s, cpu, const int16_t*, intptr_t); \
+    FUNCDEF_TU_S(sse_t, pixel_ssd_s_aligned, cpu, const int16_t*, intptr_t); \
+    FUNCDEF_PU(sse_t, pixel_sse_pp, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \
+    FUNCDEF_CHROMA_PU(sse_t, pixel_sse_ss, cpu, const int16_t*, intptr_t, const int16_t*, intptr_t); \
+    FUNCDEF_PU(void, pixel_sub_ps, cpu, int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); \
+    FUNCDEF_PU(void, pixel_add_ps, cpu, pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); \
+    FUNCDEF_PU(void, pixel_add_ps_aligned, cpu, pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); \
+    FUNCDEF_CHROMA_PU(int, pixel_satd, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \
+    FUNCDEF_TU_S2(void, ssimDist, cpu, const pixel *fenc, uint32_t fStride, const pixel *recon, intptr_t rstride, uint64_t *ssBlock, int shift, uint64_t *ac_k); \
+    FUNCDEF_TU_S2(void, normFact, cpu, const pixel *src, uint32_t blockSize, int shift, uint64_t *z_k)
+
+DECLS(neon);
+DECLS(sve);
+DECLS(sve2);
+
+
+void x265_pixel_planecopy_cp_neon(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift);
+
+uint64_t x265_pixel_var_8x8_neon(const pixel* pix, intptr_t stride);
+uint64_t x265_pixel_var_16x16_neon(const pixel* pix, intptr_t stride);
+uint64_t x265_pixel_var_32x32_neon(const pixel* pix, intptr_t stride);
+uint64_t x265_pixel_var_64x64_neon(const pixel* pix, intptr_t stride);
+
+void x265_getResidual4_neon(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride);
+void x265_getResidual8_neon(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride);
+void x265_getResidual16_neon(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride);
+void x265_getResidual32_neon(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride);
+
+void x265_scale1D_128to64_neon(pixel *dst, const pixel *src);
+void x265_scale2D_64to32_neon(pixel* dst, const pixel* src, intptr_t stride);
+
+int x265_pixel_satd_4x4_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_4x8_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_4x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_4x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_8x4_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_8x8_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_8x12_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_8x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_8x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_8x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_12x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_12x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_16x4_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_16x8_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_16x12_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_16x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_16x24_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_16x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_16x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_24x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_24x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_32x8_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_32x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_32x24_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_32x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_32x48_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_32x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_48x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_64x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_64x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_64x48_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_64x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+
+int x265_pixel_sa8d_8x8_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2);
+int x265_pixel_sa8d_8x16_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2);
+int x265_pixel_sa8d_16x16_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2);
+int x265_pixel_sa8d_16x32_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2);
+int x265_pixel_sa8d_32x32_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2);
+int x265_pixel_sa8d_32x64_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2);
+int x265_pixel_sa8d_64x64_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2);
+
+uint32_t PFX(quant_neon)(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff);
+uint32_t PFX(nquant_neon)(const int16_t* coef, const int32_t* quantCoeff, int16_t* qCoef, int qBits, int add, int numCoeff);
+
+void x265_dequant_scaling_neon(const int16_t* quantCoef, const int32_t* deQuantCoef, int16_t* coef, int num, int per, int shift);
+void x265_dequant_normal_neon(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift);
+
+void x265_ssim_4x4x2_core_neon(const pixel* pix1, intptr_t stride1, const pixel* pix2, intptr_t stride2, int sums24);
+
+int PFX(psyCost_4x4_neon)(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
+int PFX(psyCost_8x8_neon)(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
+void PFX(weight_pp_neon)(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset);
+void PFX(weight_sp_neon)(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset);
+int PFX(scanPosLast_neon)(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize);
+uint32_t PFX(costCoeffNxN_neon)(const uint16_t *scan, const coeff_t *coeff, intptr_t trSize, uint16_t *absCoeff, const uint8_t *tabSigCtx, uint32_t scanFlagMask, uint8_t *baseCtx, int offset, int scanPosSigOff, int subPosBase);
+
+uint64_t x265_pixel_var_8x8_sve2(const pixel* pix, intptr_t stride);
+uint64_t x265_pixel_var_16x16_sve2(const pixel* pix, intptr_t stride);
+uint64_t x265_pixel_var_32x32_sve2(const pixel* pix, intptr_t stride);
+uint64_t x265_pixel_var_64x64_sve2(const pixel* pix, intptr_t stride);
+
+void x265_getResidual16_sve2(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride);
+void x265_getResidual32_sve2(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride);
+
+void x265_scale1D_128to64_sve2(pixel *dst, const pixel *src);
+void x265_scale2D_64to32_sve2(pixel* dst, const pixel* src, intptr_t stride);
+
+int x265_pixel_satd_4x4_sve(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_8x4_sve(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_8x12_sve(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_32x16_sve(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_32x32_sve(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_64x48_sve(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+
+uint32_t PFX(quant_sve)(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff);
+
+void x265_dequant_scaling_sve2(const int16_t* quantCoef, const int32_t* deQuantCoef, int16_t* coef, int num, int per, int shift);
+void x265_dequant_normal_sve2(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift);
+
+void x265_ssim_4x4x2_core_sve2(const pixel* pix1, intptr_t stride1, const pixel* pix2, intptr_t stride2, int sums24);
+
+int PFX(psyCost_8x8_sve2)(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
+void PFX(weight_sp_sve2)(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset);
+int PFX(scanPosLast_sve2)(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize);

 
@@ -0,0 +1,256 @@
+/*****************************************************************************
+ * Copyright (C) 2021 MulticoreWare, Inc
+ *
+ * Authors: Sebastian Pop <spop@amazon.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#define FUNCDEF_TU(ret, name, cpu, ...) \
+    ret PFX(name ## _4x4_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## _8x8_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## _16x16_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## _32x32_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## _64x64_ ## cpu(__VA_ARGS__))
+
+#define FUNCDEF_TU_S(ret, name, cpu, ...) \
+    ret PFX(name ## _4_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## _8_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## _16_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## _32_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## _64_ ## cpu(__VA_ARGS__))
+
+#define FUNCDEF_TU_S2(ret, name, cpu, ...) \
+    ret PFX(name ## 4_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## 8_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## 16_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## 32_ ## cpu(__VA_ARGS__)); \
+    ret PFX(name ## 64_ ## cpu(__VA_ARGS__))
+
+#define FUNCDEF_PU(ret, name, cpu, ...) \
+    ret PFX(name ## _4x4_   ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _8x8_   ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _16x16_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _32x32_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _64x64_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _8x4_   ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _4x8_   ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _16x8_  ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _8x16_  ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _16x32_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _32x16_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _64x32_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _32x64_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _16x12_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _12x16_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _16x4_  ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _4x16_  ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _32x24_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _24x32_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _32x8_  ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _8x32_  ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _64x48_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _48x64_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _64x16_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _16x64_ ## cpu)(__VA_ARGS__)
+
+#define FUNCDEF_CHROMA_PU(ret, name, cpu, ...) \
+    FUNCDEF_PU(ret, name, cpu, __VA_ARGS__); \
+    ret PFX(name ## _4x2_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _4x4_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _2x4_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _8x2_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _2x8_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _8x6_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _6x8_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _8x12_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _12x8_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _6x16_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _16x6_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _2x16_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _16x2_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _4x12_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _12x4_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _32x12_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _12x32_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _32x4_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _4x32_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _32x48_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _48x32_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _16x24_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _24x16_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _8x64_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _64x8_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _64x24_ ## cpu)(__VA_ARGS__); \
+    ret PFX(name ## _24x64_ ## cpu)(__VA_ARGS__);
+
+#define DECLS(cpu) \
+    FUNCDEF_TU(void, cpy2Dto1D_shl, cpu, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); \
+    FUNCDEF_TU(void, cpy2Dto1D_shr, cpu, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); \
+    FUNCDEF_TU(void, cpy1Dto2D_shl, cpu, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); \
+    FUNCDEF_TU(void, cpy1Dto2D_shl_aligned, cpu, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); \
+    FUNCDEF_TU(void, cpy1Dto2D_shr, cpu, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); \
+    FUNCDEF_TU_S(uint32_t, copy_cnt, cpu, int16_t* dst, const int16_t* src, intptr_t srcStride); \
+    FUNCDEF_TU_S(int, count_nonzero, cpu, const int16_t* quantCoeff); \
+    FUNCDEF_TU(void, blockfill_s, cpu, int16_t* dst, intptr_t dstride, int16_t val); \
+    FUNCDEF_TU(void, blockfill_s_aligned, cpu, int16_t* dst, intptr_t dstride, int16_t val); \
+    FUNCDEF_CHROMA_PU(void, blockcopy_ss, cpu, int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); \
+    FUNCDEF_CHROMA_PU(void, blockcopy_pp, cpu, pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); \
+    FUNCDEF_PU(void, blockcopy_sp, cpu, pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); \
+    FUNCDEF_PU(void, blockcopy_ps, cpu, int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); \
+    FUNCDEF_PU(void, interp_8tap_horiz_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \
+    FUNCDEF_PU(void, interp_8tap_horiz_ps, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); \
+    FUNCDEF_PU(void, interp_8tap_vert_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \
+    FUNCDEF_PU(void, interp_8tap_vert_ps, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); \
+    FUNCDEF_PU(void, interp_8tap_vert_sp, cpu, const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \
+    FUNCDEF_PU(void, interp_8tap_vert_ss, cpu, const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); \
+    FUNCDEF_PU(void, interp_8tap_hv_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY); \
+    FUNCDEF_CHROMA_PU(void, filterPixelToShort, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); \
+    FUNCDEF_CHROMA_PU(void, filterPixelToShort_aligned, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); \
+    FUNCDEF_CHROMA_PU(void, interp_horiz_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \
+    FUNCDEF_CHROMA_PU(void, interp_4tap_horiz_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \
+    FUNCDEF_CHROMA_PU(void, interp_horiz_ps, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); \
+    FUNCDEF_CHROMA_PU(void, interp_4tap_horiz_ps, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); \
+    FUNCDEF_CHROMA_PU(void, interp_4tap_vert_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \
+    FUNCDEF_CHROMA_PU(void, interp_4tap_vert_ps, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); \
+    FUNCDEF_CHROMA_PU(void, interp_4tap_vert_sp, cpu, const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \
+    FUNCDEF_CHROMA_PU(void, interp_4tap_vert_ss, cpu, const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); \
+    FUNCDEF_CHROMA_PU(void, addAvg, cpu, const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t); \
+    FUNCDEF_CHROMA_PU(void, addAvg_aligned, cpu, const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t); \
+    FUNCDEF_PU(void, pixel_avg_pp, cpu, pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); \
+    FUNCDEF_PU(void, pixel_avg_pp_aligned, cpu, pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); \
+    FUNCDEF_PU(void, sad_x3, cpu, const pixel*, const pixel*, const pixel*, const pixel*, intptr_t, int32_t*); \
+    FUNCDEF_PU(void, sad_x4, cpu, const pixel*, const pixel*, const pixel*, const pixel*, const pixel*, intptr_t, int32_t*); \
+    FUNCDEF_CHROMA_PU(int, pixel_sad, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \
+    FUNCDEF_CHROMA_PU(sse_t, pixel_ssd_s, cpu, const int16_t*, intptr_t); \
+    FUNCDEF_CHROMA_PU(sse_t, pixel_ssd_s_aligned, cpu, const int16_t*, intptr_t); \
+    FUNCDEF_TU_S(sse_t, pixel_ssd_s, cpu, const int16_t*, intptr_t); \
+    FUNCDEF_TU_S(sse_t, pixel_ssd_s_aligned, cpu, const int16_t*, intptr_t); \
+    FUNCDEF_PU(sse_t, pixel_sse_pp, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \
+    FUNCDEF_CHROMA_PU(sse_t, pixel_sse_ss, cpu, const int16_t*, intptr_t, const int16_t*, intptr_t); \
+    FUNCDEF_PU(void, pixel_sub_ps, cpu, int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); \
+    FUNCDEF_PU(void, pixel_add_ps, cpu, pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); \
+    FUNCDEF_PU(void, pixel_add_ps_aligned, cpu, pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); \
+    FUNCDEF_CHROMA_PU(int, pixel_satd, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \
+    FUNCDEF_TU_S2(void, ssimDist, cpu, const pixel *fenc, uint32_t fStride, const pixel *recon, intptr_t rstride, uint64_t *ssBlock, int shift, uint64_t *ac_k); \
+    FUNCDEF_TU_S2(void, normFact, cpu, const pixel *src, uint32_t blockSize, int shift, uint64_t *z_k)
+
+DECLS(neon);
+DECLS(sve);
+DECLS(sve2);
+
+
+void x265_pixel_planecopy_cp_neon(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift);
+
+uint64_t x265_pixel_var_8x8_neon(const pixel* pix, intptr_t stride);
+uint64_t x265_pixel_var_16x16_neon(const pixel* pix, intptr_t stride);
+uint64_t x265_pixel_var_32x32_neon(const pixel* pix, intptr_t stride);
+uint64_t x265_pixel_var_64x64_neon(const pixel* pix, intptr_t stride);
+
+void x265_getResidual4_neon(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride);
+void x265_getResidual8_neon(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride);
+void x265_getResidual16_neon(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride);
+void x265_getResidual32_neon(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride);
+
+void x265_scale1D_128to64_neon(pixel *dst, const pixel *src);
+void x265_scale2D_64to32_neon(pixel* dst, const pixel* src, intptr_t stride);
+
+int x265_pixel_satd_4x4_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_4x8_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_4x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_4x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_8x4_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_8x8_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_8x12_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_8x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_8x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_8x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_12x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_12x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_16x4_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_16x8_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_16x12_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_16x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_16x24_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_16x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_16x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_24x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_24x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_32x8_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_32x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_32x24_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_32x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_32x48_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_32x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_48x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_64x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_64x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_64x48_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_64x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+
+int x265_pixel_sa8d_8x8_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2);
+int x265_pixel_sa8d_8x16_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2);
+int x265_pixel_sa8d_16x16_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2);
+int x265_pixel_sa8d_16x32_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2);
+int x265_pixel_sa8d_32x32_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2);
+int x265_pixel_sa8d_32x64_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2);
+int x265_pixel_sa8d_64x64_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2);
+
+uint32_t PFX(quant_neon)(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff);
+uint32_t PFX(nquant_neon)(const int16_t* coef, const int32_t* quantCoeff, int16_t* qCoef, int qBits, int add, int numCoeff);
+
+void x265_dequant_scaling_neon(const int16_t* quantCoef, const int32_t* deQuantCoef, int16_t* coef, int num, int per, int shift);
+void x265_dequant_normal_neon(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift);
+
+void x265_ssim_4x4x2_core_neon(const pixel* pix1, intptr_t stride1, const pixel* pix2, intptr_t stride2, int sums24);
+
+int PFX(psyCost_4x4_neon)(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
+int PFX(psyCost_8x8_neon)(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
+void PFX(weight_pp_neon)(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset);
+void PFX(weight_sp_neon)(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset);
+int PFX(scanPosLast_neon)(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize);
+uint32_t PFX(costCoeffNxN_neon)(const uint16_t *scan, const coeff_t *coeff, intptr_t trSize, uint16_t *absCoeff, const uint8_t *tabSigCtx, uint32_t scanFlagMask, uint8_t *baseCtx, int offset, int scanPosSigOff, int subPosBase);
+
+uint64_t x265_pixel_var_8x8_sve2(const pixel* pix, intptr_t stride);
+uint64_t x265_pixel_var_16x16_sve2(const pixel* pix, intptr_t stride);
+uint64_t x265_pixel_var_32x32_sve2(const pixel* pix, intptr_t stride);
+uint64_t x265_pixel_var_64x64_sve2(const pixel* pix, intptr_t stride);
+
+void x265_getResidual16_sve2(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride);
+void x265_getResidual32_sve2(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride);
+
+void x265_scale1D_128to64_sve2(pixel *dst, const pixel *src);
+void x265_scale2D_64to32_sve2(pixel* dst, const pixel* src, intptr_t stride);
+
+int x265_pixel_satd_4x4_sve(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_8x4_sve(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_8x12_sve(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_32x16_sve(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_32x32_sve(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+int x265_pixel_satd_64x48_sve(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2);
+
+uint32_t PFX(quant_sve)(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff);
+
+void x265_dequant_scaling_sve2(const int16_t* quantCoef, const int32_t* deQuantCoef, int16_t* coef, int num, int per, int shift);
+void x265_dequant_normal_sve2(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift);
+
+void x265_ssim_4x4x2_core_sve2(const pixel* pix1, intptr_t stride1, const pixel* pix2, intptr_t stride2, int sums24);
+
+int PFX(psyCost_8x8_sve2)(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
+void PFX(weight_sp_sve2)(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset);
+int PFX(scanPosLast_sve2)(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize);
​

x265_3.6.tar.gz/source/common/aarch64/intrapred-prim.cpp Added

@@ -0,0 +1,265 @@
+#include "common.h"
+#include "primitives.h"
+
+
+#if 1
+#include "arm64-utils.h"
+#include <arm_neon.h>
+
+using namespace X265_NS;
+
+namespace
+{
+
+
+
+template<int width>
+void intra_pred_ang_neon(pixel *dst, intptr_t dstStride, const pixel *srcPix0, int dirMode, int bFilter)
+{
+    int width2 = width << 1;
+    // Flip the neighbours in the horizontal case.
+    int horMode = dirMode < 18;
+    pixel neighbourBuf129;
+    const pixel *srcPix = srcPix0;
+
+    if (horMode)
+    {
+        neighbourBuf0 = srcPix0;
+        //for (int i = 0; i < width << 1; i++)
+        //{
+        //    neighbourBuf1 + i = srcPixwidth2 + 1 + i;
+        //    neighbourBufwidth2 + 1 + i = srcPix1 + i;
+        //}
+        memcpy(&neighbourBuf1, &srcPixwidth2 + 1, sizeof(pixel) * (width << 1));
+        memcpy(&neighbourBufwidth2 + 1, &srcPix1, sizeof(pixel) * (width << 1));
+        srcPix = neighbourBuf;
+    }
+
+    // Intra prediction angle and inverse angle tables.
+    const int8_t angleTable17 = { -32, -26, -21, -17, -13, -9, -5, -2, 0, 2, 5, 9, 13, 17, 21, 26, 32 };
+    const int16_t invAngleTable8 = { 4096, 1638, 910, 630, 482, 390, 315, 256 };
+
+    // Get the prediction angle.
+    int angleOffset = horMode ? 10 - dirMode : dirMode - 26;
+    int angle = angleTable8 + angleOffset;
+
+    // Vertical Prediction.
+    if (!angle)
+    {
+        for (int y = 0; y < width; y++)
+        {
+            memcpy(&dsty * dstStride, srcPix + 1, sizeof(pixel)*width);
+        }
+        if (bFilter)
+        {
+            int topLeft = srcPix0, top = srcPix1;
+            for (int y = 0; y < width; y++)
+            {
+                dsty * dstStride = x265_clip((int16_t)(top + ((srcPixwidth2 + 1 + y - topLeft) >> 1)));
+            }
+        }
+    }
+    else // Angular prediction.
+    {
+        // Get the reference pixels. The reference base is the first pixel to the top (neighbourBuf1).
+        pixel refBuf64;
+        const pixel *ref;
+
+        // Use the projected left neighbours and the top neighbours.
+        if (angle < 0)
+        {
+            // Number of neighbours projected.
+            int nbProjected = -((width * angle) >> 5) - 1;
+            pixel *ref_pix = refBuf + nbProjected + 1;
+
+            // Project the neighbours.
+            int invAngle = invAngleTable- angleOffset - 1;
+            int invAngleSum = 128;
+            for (int i = 0; i < nbProjected; i++)
+            {
+                invAngleSum += invAngle;
+                ref_pix- 2 - i = srcPixwidth2 + (invAngleSum >> 8);
+            }
+
+            // Copy the top-left and top pixels.
+            //for (int i = 0; i < width + 1; i++)
+            //ref_pix-1 + i = srcPixi;
+
+            memcpy(&ref_pix-1, srcPix, (width + 1)*sizeof(pixel));
+            ref = ref_pix;
+        }
+        else // Use the top and top-right neighbours.
+        {
+            ref = srcPix + 1;
+        }
+
+        // Pass every row.
+        int angleSum = 0;
+        for (int y = 0; y < width; y++)
+        {
+            angleSum += angle;
+            int offset = angleSum >> 5;
+            int fraction = angleSum & 31;
+
+            if (fraction) // Interpolate
+            {
+                if (width >= 8 && sizeof(pixel) == 1)
+                {
+                    const int16x8_t f0 = vdupq_n_s16(32 - fraction);
+                    const int16x8_t f1 = vdupq_n_s16(fraction);
+                    for (int x = 0; x < width; x += 8)
+                    {
+                        uint8x8_t in0 = *(uint8x8_t *)&refoffset + x;
+                        uint8x8_t in1 = *(uint8x8_t *)&refoffset + x + 1;
+                        int16x8_t lo = vmlaq_s16(vdupq_n_s16(16), vmovl_u8(in0), f0);
+                        lo = vmlaq_s16(lo, vmovl_u8(in1), f1);
+                        lo = vshrq_n_s16(lo, 5);
+                        *(uint8x8_t *)&dsty * dstStride + x = vmovn_u16(lo);
+                    }
+                }
+                else if (width >= 4 && sizeof(pixel) == 2)
+                {
+                    const int32x4_t f0 = vdupq_n_s32(32 - fraction);
+                    const int32x4_t f1 = vdupq_n_s32(fraction);
+                    for (int x = 0; x < width; x += 4)
+                    {
+                        uint16x4_t in0 = *(uint16x4_t *)&refoffset + x;
+                        uint16x4_t in1 = *(uint16x4_t *)&refoffset + x + 1;
+                        int32x4_t lo = vmlaq_s32(vdupq_n_s32(16), vmovl_u16(in0), f0);
+                        lo = vmlaq_s32(lo, vmovl_u16(in1), f1);
+                        lo = vshrq_n_s32(lo, 5);
+                        *(uint16x4_t *)&dsty * dstStride + x = vmovn_u32(lo);
+                    }
+                }
+                else
+                {
+                    for (int x = 0; x < width; x++)
+                    {
+                        dsty * dstStride + x = (pixel)(((32 - fraction) * refoffset + x + fraction * refoffset + x + 1 + 16) >> 5);
+                    }
+                }
+            }
+            else // Copy.
+            {
+                memcpy(&dsty * dstStride, &refoffset, sizeof(pixel)*width);
+            }
+        }
+    }
+
+    // Flip for horizontal.
+    if (horMode)
+    {
+        if (width == 8)
+        {
+            transpose8x8(dst, dst, dstStride, dstStride);
+        }
+        else if (width == 16)
+        {
+            transpose16x16(dst, dst, dstStride, dstStride);
+        }
+        else if (width == 32)
+        {
+            transpose32x32(dst, dst, dstStride, dstStride);
+        }
+        else
+        {
+            for (int y = 0; y < width - 1; y++)
+            {
+                for (int x = y + 1; x < width; x++)
+                {
+                    pixel tmp              = dsty * dstStride + x;
+                    dsty * dstStride + x = dstx * dstStride + y;
+                    dstx * dstStride + y = tmp;
+                }
+            }
+        }
+    }
+}
+
+template<int log2Size>
+void all_angs_pred_neon(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma)
+{
+    const int size = 1 << log2Size;
+    for (int mode = 2; mode <= 34; mode++)
+    {
+        pixel *srcPix  = (g_intraFilterFlagsmode & size ? filtPix  : refPix);
+        pixel *out = dest + ((mode - 2) << (log2Size * 2));
+
+        intra_pred_ang_neon<size>(out, size, srcPix, mode, bLuma);
+
+        // Optimize code don't flip buffer
+        bool modeHor = (mode < 18);
+
+        // transpose the block if this is a horizontal mode
+        if (modeHor)
+        {
+            if (size == 8)
+            {
+                transpose8x8(out, out, size, size);
+            }
+            else if (size == 16)
+            {
+                transpose16x16(out, out, size, size);
+            }
+            else if (size == 32)
+            {
+                transpose32x32(out, out, size, size);
+            }
+            else
+            {
+                for (int k = 0; k < size - 1; k++)
+                {
+                    for (int l = k + 1; l < size; l++)
+                    {
+                        pixel tmp         = outk * size + l;
+                        outk * size + l = outl * size + k;
+                        outl * size + k = tmp;
+                    }
+                }
+            }
+        }
+    }
+}
+}
+
+namespace X265_NS
+{
+// x265 private namespace
+
+void setupIntraPrimitives_neon(EncoderPrimitives &p)
+{
+    for (int i = 2; i < NUM_INTRA_MODE; i++)
+    {
+        p.cuBLOCK_8x8.intra_predi = intra_pred_ang_neon<8>;
+        p.cuBLOCK_16x16.intra_predi = intra_pred_ang_neon<16>;
+        p.cuBLOCK_32x32.intra_predi = intra_pred_ang_neon<32>;
+    }
+    p.cuBLOCK_4x4.intra_pred2 = intra_pred_ang_neon<4>;
+    p.cuBLOCK_4x4.intra_pred10 = intra_pred_ang_neon<4>;
+    p.cuBLOCK_4x4.intra_pred18 = intra_pred_ang_neon<4>;
+    p.cuBLOCK_4x4.intra_pred26 = intra_pred_ang_neon<4>;
+    p.cuBLOCK_4x4.intra_pred34 = intra_pred_ang_neon<4>;
+
+    p.cuBLOCK_4x4.intra_pred_allangs = all_angs_pred_neon<2>;
+    p.cuBLOCK_8x8.intra_pred_allangs = all_angs_pred_neon<3>;
+    p.cuBLOCK_16x16.intra_pred_allangs = all_angs_pred_neon<4>;
+    p.cuBLOCK_32x32.intra_pred_allangs = all_angs_pred_neon<5>;
+}
+
+}
+
+
+
+#else
+
+namespace X265_NS
+{
+// x265 private namespace
+void setupIntraPrimitives_neon(EncoderPrimitives &p)
+{}
+}
+
+#endif
+
+
+

 
@@ -0,0 +1,265 @@
+#include "common.h"
+#include "primitives.h"
+
+
+#if 1
+#include "arm64-utils.h"
+#include <arm_neon.h>
+
+using namespace X265_NS;
+
+namespace
+{
+
+
+
+template<int width>
+void intra_pred_ang_neon(pixel *dst, intptr_t dstStride, const pixel *srcPix0, int dirMode, int bFilter)
+{
+    int width2 = width << 1;
+    // Flip the neighbours in the horizontal case.
+    int horMode = dirMode < 18;
+    pixel neighbourBuf129;
+    const pixel *srcPix = srcPix0;
+
+    if (horMode)
+    {
+        neighbourBuf0 = srcPix0;
+        //for (int i = 0; i < width << 1; i++)
+        //{
+        //    neighbourBuf1 + i = srcPixwidth2 + 1 + i;
+        //    neighbourBufwidth2 + 1 + i = srcPix1 + i;
+        //}
+        memcpy(&neighbourBuf1, &srcPixwidth2 + 1, sizeof(pixel) * (width << 1));
+        memcpy(&neighbourBufwidth2 + 1, &srcPix1, sizeof(pixel) * (width << 1));
+        srcPix = neighbourBuf;
+    }
+
+    // Intra prediction angle and inverse angle tables.
+    const int8_t angleTable17 = { -32, -26, -21, -17, -13, -9, -5, -2, 0, 2, 5, 9, 13, 17, 21, 26, 32 };
+    const int16_t invAngleTable8 = { 4096, 1638, 910, 630, 482, 390, 315, 256 };
+
+    // Get the prediction angle.
+    int angleOffset = horMode ? 10 - dirMode : dirMode - 26;
+    int angle = angleTable8 + angleOffset;
+
+    // Vertical Prediction.
+    if (!angle)
+    {
+        for (int y = 0; y < width; y++)
+        {
+            memcpy(&dsty * dstStride, srcPix + 1, sizeof(pixel)*width);
+        }
+        if (bFilter)
+        {
+            int topLeft = srcPix0, top = srcPix1;
+            for (int y = 0; y < width; y++)
+            {
+                dsty * dstStride = x265_clip((int16_t)(top + ((srcPixwidth2 + 1 + y - topLeft) >> 1)));
+            }
+        }
+    }
+    else // Angular prediction.
+    {
+        // Get the reference pixels. The reference base is the first pixel to the top (neighbourBuf1).
+        pixel refBuf64;
+        const pixel *ref;
+
+        // Use the projected left neighbours and the top neighbours.
+        if (angle < 0)
+        {
+            // Number of neighbours projected.
+            int nbProjected = -((width * angle) >> 5) - 1;
+            pixel *ref_pix = refBuf + nbProjected + 1;
+
+            // Project the neighbours.
+            int invAngle = invAngleTable- angleOffset - 1;
+            int invAngleSum = 128;
+            for (int i = 0; i < nbProjected; i++)
+            {
+                invAngleSum += invAngle;
+                ref_pix- 2 - i = srcPixwidth2 + (invAngleSum >> 8);
+            }
+
+            // Copy the top-left and top pixels.
+            //for (int i = 0; i < width + 1; i++)
+            //ref_pix-1 + i = srcPixi;
+
+            memcpy(&ref_pix-1, srcPix, (width + 1)*sizeof(pixel));
+            ref = ref_pix;
+        }
+        else // Use the top and top-right neighbours.
+        {
+            ref = srcPix + 1;
+        }
+
+        // Pass every row.
+        int angleSum = 0;
+        for (int y = 0; y < width; y++)
+        {
+            angleSum += angle;
+            int offset = angleSum >> 5;
+            int fraction = angleSum & 31;
+
+            if (fraction) // Interpolate
+            {
+                if (width >= 8 && sizeof(pixel) == 1)
+                {
+                    const int16x8_t f0 = vdupq_n_s16(32 - fraction);
+                    const int16x8_t f1 = vdupq_n_s16(fraction);
+                    for (int x = 0; x < width; x += 8)
+                    {
+                        uint8x8_t in0 = *(uint8x8_t *)&refoffset + x;
+                        uint8x8_t in1 = *(uint8x8_t *)&refoffset + x + 1;
+                        int16x8_t lo = vmlaq_s16(vdupq_n_s16(16), vmovl_u8(in0), f0);
+                        lo = vmlaq_s16(lo, vmovl_u8(in1), f1);
+                        lo = vshrq_n_s16(lo, 5);
+                        *(uint8x8_t *)&dsty * dstStride + x = vmovn_u16(lo);
+                    }
+                }
+                else if (width >= 4 && sizeof(pixel) == 2)
+                {
+                    const int32x4_t f0 = vdupq_n_s32(32 - fraction);
+                    const int32x4_t f1 = vdupq_n_s32(fraction);
+                    for (int x = 0; x < width; x += 4)
+                    {
+                        uint16x4_t in0 = *(uint16x4_t *)&refoffset + x;
+                        uint16x4_t in1 = *(uint16x4_t *)&refoffset + x + 1;
+                        int32x4_t lo = vmlaq_s32(vdupq_n_s32(16), vmovl_u16(in0), f0);
+                        lo = vmlaq_s32(lo, vmovl_u16(in1), f1);
+                        lo = vshrq_n_s32(lo, 5);
+                        *(uint16x4_t *)&dsty * dstStride + x = vmovn_u32(lo);
+                    }
+                }
+                else
+                {
+                    for (int x = 0; x < width; x++)
+                    {
+                        dsty * dstStride + x = (pixel)(((32 - fraction) * refoffset + x + fraction * refoffset + x + 1 + 16) >> 5);
+                    }
+                }
+            }
+            else // Copy.
+            {
+                memcpy(&dsty * dstStride, &refoffset, sizeof(pixel)*width);
+            }
+        }
+    }
+
+    // Flip for horizontal.
+    if (horMode)
+    {
+        if (width == 8)
+        {
+            transpose8x8(dst, dst, dstStride, dstStride);
+        }
+        else if (width == 16)
+        {
+            transpose16x16(dst, dst, dstStride, dstStride);
+        }
+        else if (width == 32)
+        {
+            transpose32x32(dst, dst, dstStride, dstStride);
+        }
+        else
+        {
+            for (int y = 0; y < width - 1; y++)
+            {
+                for (int x = y + 1; x < width; x++)
+                {
+                    pixel tmp              = dsty * dstStride + x;
+                    dsty * dstStride + x = dstx * dstStride + y;
+                    dstx * dstStride + y = tmp;
+                }
+            }
+        }
+    }
+}
+
+template<int log2Size>
+void all_angs_pred_neon(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma)
+{
+    const int size = 1 << log2Size;
+    for (int mode = 2; mode <= 34; mode++)
+    {
+        pixel *srcPix  = (g_intraFilterFlagsmode & size ? filtPix  : refPix);
+        pixel *out = dest + ((mode - 2) << (log2Size * 2));
+
+        intra_pred_ang_neon<size>(out, size, srcPix, mode, bLuma);
+
+        // Optimize code don't flip buffer
+        bool modeHor = (mode < 18);
+
+        // transpose the block if this is a horizontal mode
+        if (modeHor)
+        {
+            if (size == 8)
+            {
+                transpose8x8(out, out, size, size);
+            }
+            else if (size == 16)
+            {
+                transpose16x16(out, out, size, size);
+            }
+            else if (size == 32)
+            {
+                transpose32x32(out, out, size, size);
+            }
+            else
+            {
+                for (int k = 0; k < size - 1; k++)
+                {
+                    for (int l = k + 1; l < size; l++)
+                    {
+                        pixel tmp         = outk * size + l;
+                        outk * size + l = outl * size + k;
+                        outl * size + k = tmp;
+                    }
+                }
+            }
+        }
+    }
+}
+}
+
+namespace X265_NS
+{
+// x265 private namespace
+
+void setupIntraPrimitives_neon(EncoderPrimitives &p)
+{
+    for (int i = 2; i < NUM_INTRA_MODE; i++)
+    {
+        p.cuBLOCK_8x8.intra_predi = intra_pred_ang_neon<8>;
+        p.cuBLOCK_16x16.intra_predi = intra_pred_ang_neon<16>;
+        p.cuBLOCK_32x32.intra_predi = intra_pred_ang_neon<32>;
+    }
+    p.cuBLOCK_4x4.intra_pred2 = intra_pred_ang_neon<4>;
+    p.cuBLOCK_4x4.intra_pred10 = intra_pred_ang_neon<4>;
+    p.cuBLOCK_4x4.intra_pred18 = intra_pred_ang_neon<4>;
+    p.cuBLOCK_4x4.intra_pred26 = intra_pred_ang_neon<4>;
+    p.cuBLOCK_4x4.intra_pred34 = intra_pred_ang_neon<4>;
+
+    p.cuBLOCK_4x4.intra_pred_allangs = all_angs_pred_neon<2>;
+    p.cuBLOCK_8x8.intra_pred_allangs = all_angs_pred_neon<3>;
+    p.cuBLOCK_16x16.intra_pred_allangs = all_angs_pred_neon<4>;
+    p.cuBLOCK_32x32.intra_pred_allangs = all_angs_pred_neon<5>;
+}
+
+}
+
+
+
+#else
+
+namespace X265_NS
+{
+// x265 private namespace
+void setupIntraPrimitives_neon(EncoderPrimitives &p)
+{}
+}
+
+#endif
+
+
+
​

x265_3.6.tar.gz/source/common/aarch64/intrapred-prim.h Added

 
@@ -0,0 +1,15 @@
+#ifndef INTRAPRED_PRIM_H__
+
+#if defined(__aarch64__)
+
+namespace X265_NS
+{
+// x265 private namespace
+
+void setupIntraPrimitives_neon(EncoderPrimitives &p);
+}
+
+#endif
+
+#endif
+
​

x265_3.6.tar.gz/source/common/aarch64/ipfilter-common.S Added

@@ -0,0 +1,1436 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+// This file contains the macros written using NEON instruction set
+// that are also used by the SVE2 functions
+
+// Macros below follow these conventions:
+// - input data in registers: v0, v1, v2, v3, v4, v5, v6, v7
+// - constants in registers: v24, v25, v26, v27, v31
+// - temporary registers: v16, v17, v18, v19, v20, v21, v22, v23, v28, v29, v30.
+// - _32b macros output a result in v17.4s
+// - _64b and _32b_1 macros output results in v17.4s, v18.4s
+
+#include "asm.S"
+
+.arch           armv8-a
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.macro vextin8 v
+    ldp             d6, d7, x11, #16
+.if \v == 0
+    // qpel_filter_0 only uses values in v3
+    ext             v3.8b, v6.8b, v7.8b, #4
+.else
+.if \v != 3
+    ext             v0.8b, v6.8b, v7.8b, #1
+.endif
+    ext             v1.8b, v6.8b, v7.8b, #2
+    ext             v2.8b, v6.8b, v7.8b, #3
+    ext             v3.8b, v6.8b, v7.8b, #4
+    ext             v4.8b, v6.8b, v7.8b, #5
+    ext             v5.8b, v6.8b, v7.8b, #6
+    ext             v6.8b, v6.8b, v7.8b, #7
+.endif
+.endm
+
+.macro vextin8_64 v
+    ldp             q6, q7, x11, #32
+.if \v == 0
+    // qpel_filter_0 only uses values in v3
+    ext             v3.16b, v6.16b, v7.16b, #4
+.else
+.if \v != 3
+    // qpel_filter_3 does not use values in v0
+    ext             v0.16b, v6.16b, v7.16b, #1
+.endif
+    ext             v1.16b, v6.16b, v7.16b, #2
+    ext             v2.16b, v6.16b, v7.16b, #3
+    ext             v3.16b, v6.16b, v7.16b, #4
+    ext             v4.16b, v6.16b, v7.16b, #5
+    ext             v5.16b, v6.16b, v7.16b, #6
+.if \v == 1
+    ext             v6.16b, v6.16b, v7.16b, #7
+    // qpel_filter_1 does not use v7
+.else
+    ext             v16.16b, v6.16b, v7.16b, #7
+    ext             v7.16b, v6.16b, v7.16b, #8
+    mov             v6.16b, v16.16b
+.endif
+.endif
+.endm
+
+.macro vextin8_chroma v
+    ldp             d6, d7, x11, #16
+.if \v == 0
+    // qpel_filter_chroma_0 only uses values in v1
+    ext             v1.8b, v6.8b, v7.8b, #2
+.else
+    ext             v0.8b, v6.8b, v7.8b, #1
+    ext             v1.8b, v6.8b, v7.8b, #2
+    ext             v2.8b, v6.8b, v7.8b, #3
+    ext             v3.8b, v6.8b, v7.8b, #4
+.endif
+.endm
+
+.macro vextin8_chroma_64 v
+    ldp             q16, q17, x11, #32
+.if \v == 0
+    // qpel_filter_chroma_0 only uses values in v1
+    ext             v1.16b, v16.16b, v17.16b, #2
+.else
+    ext             v0.16b, v16.16b, v17.16b, #1
+    ext             v1.16b, v16.16b, v17.16b, #2
+    ext             v2.16b, v16.16b, v17.16b, #3
+    ext             v3.16b, v16.16b, v17.16b, #4
+.endif
+.endm
+
+.macro qpel_load_32b v
+.if \v == 0
+    add             x6, x6, x11       // do not load 3 values that are not used in qpel_filter_0
+    ld1             {v3.8b}, x6, x1
+.elseif \v == 1 || \v == 2 || \v == 3
+.if \v != 3                           // not used in qpel_filter_3
+    ld1             {v0.8b}, x6, x1
+.else
+    add             x6, x6, x1
+.endif
+    ld1             {v1.8b}, x6, x1
+    ld1             {v2.8b}, x6, x1
+    ld1             {v3.8b}, x6, x1
+    ld1             {v4.8b}, x6, x1
+    ld1             {v5.8b}, x6, x1
+.if \v != 1                           // not used in qpel_filter_1
+    ld1             {v6.8b}, x6, x1
+    ld1             {v7.8b}, x6
+.else
+    ld1             {v6.8b}, x6
+.endif
+.endif
+.endm
+
+.macro qpel_load_64b v
+.if \v == 0
+    add             x6, x6, x11       // do not load 3 values that are not used in qpel_filter_0
+    ld1             {v3.16b}, x6, x1
+.elseif \v == 1 || \v == 2 || \v == 3
+.if \v != 3                           // not used in qpel_filter_3
+    ld1             {v0.16b}, x6, x1
+.else
+    add             x6, x6, x1
+.endif
+    ld1             {v1.16b}, x6, x1
+    ld1             {v2.16b}, x6, x1
+    ld1             {v3.16b}, x6, x1
+    ld1             {v4.16b}, x6, x1
+    ld1             {v5.16b}, x6, x1
+.if \v != 1                           // not used in qpel_filter_1
+    ld1             {v6.16b}, x6, x1
+    ld1             {v7.16b}, x6
+.else
+    ld1             {v6.16b}, x6
+.endif
+.endif
+.endm
+
+.macro qpel_chroma_load_32b v
+.if \v == 0
+    // qpel_filter_chroma_0 only uses values in v1
+    add             x6, x6, x1
+    ldr             d1, x6
+.else
+    ld1             {v0.8b}, x6, x1
+    ld1             {v1.8b}, x6, x1
+    ld1             {v2.8b}, x6, x1
+    ld1             {v3.8b}, x6
+.endif
+.endm
+
+.macro qpel_chroma_load_64b v
+.if \v == 0
+    // qpel_filter_chroma_0 only uses values in v1
+    add             x6, x6, x1
+    ldr             q1, x6
+.else
+    ld1             {v0.16b}, x6, x1
+    ld1             {v1.16b}, x6, x1
+    ld1             {v2.16b}, x6, x1
+    ld1             {v3.16b}, x6
+.endif
+.endm
+
+//          a, b,   c,  d,  e,   f, g,  h
+// .hword   0, 0,   0, 64,  0,   0, 0,  0
+.macro qpel_start_0
+    movi            v24.16b, #64
+.endm
+
+.macro qpel_filter_0_32b
+    umull           v17.8h, v3.8b, v24.8b    // 64*d
+.endm
+
+.macro qpel_filter_0_64b
+    qpel_filter_0_32b
+    umull2          v18.8h, v3.16b, v24.16b  // 64*d
+.endm
+
+.macro qpel_start_0_1
+    movi            v24.8h, #64
+.endm
+
+.macro qpel_filter_0_32b_1
+    smull           v17.4s, v3.4h, v24.4h    // 64*d0
+    smull2          v18.4s, v3.8h, v24.8h    // 64*d1
+.endm
+
+//          a, b,   c,  d,  e,   f, g,  h
+// .hword  -1, 4, -10, 58, 17,  -5, 1,  0
+.macro qpel_start_1
+    movi            v24.16b, #58
+    movi            v25.16b, #10
+    movi            v26.16b, #17
+    movi            v27.16b, #5
+.endm
+
+.macro qpel_filter_1_32b
+    umull           v19.8h, v2.8b, v25.8b  // c*10
+    umull           v17.8h, v3.8b, v24.8b  // d*58
+    umull           v21.8h, v4.8b, v26.8b  // e*17
+    umull           v23.8h, v5.8b, v27.8b  // f*5
+    sub             v17.8h, v17.8h, v19.8h // d*58 - c*10
+    ushll           v18.8h, v1.8b, #2      // b*4
+    add             v17.8h, v17.8h, v21.8h // d*58 - c*10 + e*17
+    usubl           v21.8h, v6.8b, v0.8b   // g - a
+    add             v17.8h, v17.8h, v18.8h // d*58 - c*10 + e*17 + b*4
+    sub             v21.8h, v21.8h, v23.8h // g - a - f*5
+    add             v17.8h, v17.8h, v21.8h // d*58 - c*10 + e*17 + b*4 + g - a - f*5
+.endm
+
+.macro qpel_filter_1_64b
+    qpel_filter_1_32b
+    umull2          v20.8h, v2.16b, v25.16b  // c*10
+    umull2          v18.8h, v3.16b, v24.16b  // d*58
+    umull2          v21.8h, v4.16b, v26.16b  // e*17
+    umull2          v23.8h, v5.16b, v27.16b  // f*5
+    sub             v18.8h, v18.8h, v20.8h   // d*58 - c*10
+    ushll2          v28.8h, v1.16b, #2       // b*4
+    add             v18.8h, v18.8h, v21.8h   // d*58 - c*10 + e*17
+    usubl2          v21.8h, v6.16b, v0.16b   // g - a
+    add             v18.8h, v18.8h, v28.8h   // d*58 - c*10 + e*17 + b*4
+    sub             v21.8h, v21.8h, v23.8h   // g - a - f*5
+    add             v18.8h, v18.8h, v21.8h   // d*58 - c*10 + e*17 + b*4 + g - a - f*5
+.endm
+
+.macro qpel_start_1_1
+    movi            v24.8h, #58
+    movi            v25.8h, #10
+    movi            v26.8h, #17
+    movi            v27.8h, #5
+.endm
+
+.macro qpel_filter_1_32b_1
+    smull           v17.4s, v3.4h, v24.4h    // 58 * d0
+    smull2          v18.4s, v3.8h, v24.8h    // 58 * d1
+    smull           v19.4s, v2.4h, v25.4h    // 10 * c0
+    smull2          v20.4s, v2.8h, v25.8h    // 10 * c1
+    smull           v21.4s, v4.4h, v26.4h    // 17 * e0
+    smull2          v22.4s, v4.8h, v26.8h    // 17 * e1
+    smull           v23.4s, v5.4h, v27.4h    //  5 * f0
+    smull2          v16.4s, v5.8h, v27.8h    //  5 * f1
+    sub             v17.4s, v17.4s, v19.4s   // 58 * d0 - 10 * c0
+    sub             v18.4s, v18.4s, v20.4s   // 58 * d1 - 10 * c1
+    sshll           v19.4s, v1.4h, #2        // 4 * b0
+    sshll2          v20.4s, v1.8h, #2        // 4 * b1
+    add             v17.4s, v17.4s, v21.4s   // 58 * d0 - 10 * c0 + 17 * e0
+    add             v18.4s, v18.4s, v22.4s   // 58 * d1 - 10 * c1 + 17 * e1
+    ssubl           v21.4s, v6.4h, v0.4h     // g0 - a0
+    ssubl2          v22.4s, v6.8h, v0.8h     // g1 - a1
+    add             v17.4s, v17.4s, v19.4s   // 58 * d0 - 10 * c0 + 17 * e0 + 4 * b0
+    add             v18.4s, v18.4s, v20.4s   // 58 * d1 - 10 * c1 + 17 * e1 + 4 * b1
+    sub             v21.4s, v21.4s, v23.4s   // g0 - a0 - 5 * f0
+    sub             v22.4s, v22.4s, v16.4s   // g1 - a1 - 5 * f1
+    add             v17.4s, v17.4s, v21.4s   // 58 * d0 - 10 * c0 + 17 * e0 + 4 * b0 + g0 - a0 - 5 * f0
+    add             v18.4s, v18.4s, v22.4s   // 58 * d1 - 10 * c1 + 17 * e1 + 4 * b1 + g1 - a1 - 5 * f1
+.endm
+
+//          a, b,   c,  d,  e,   f, g,  h
+// .hword  -1, 4, -11, 40, 40, -11, 4, -1
+.macro qpel_start_2
+    movi            v24.8h, #11
+    movi            v25.8h, #40
+.endm
+
+.macro qpel_filter_2_32b
+    uaddl           v17.8h, v3.8b, v4.8b     // d + e
+    uaddl           v19.8h, v2.8b, v5.8b     // c + f
+    uaddl           v23.8h, v1.8b, v6.8b     // b + g
+    uaddl           v21.8h, v0.8b, v7.8b     // a + h
+    mul             v17.8h, v17.8h, v25.8h   // 40 * (d + e)
+    mul             v19.8h, v19.8h, v24.8h   // 11 * (c + f)
+    shl             v23.8h, v23.8h, #2       // (b + g) * 4
+    add             v19.8h, v19.8h, v21.8h   // 11 * (c + f) + a + h
+    add             v17.8h, v17.8h, v23.8h   // 40 * (d + e) + (b + g) * 4
+    sub             v17.8h, v17.8h, v19.8h   // 40 * (d + e) + (b + g) * 4 - 11 * (c + f) - a - h
+.endm
+
+.macro qpel_filter_2_64b
+    qpel_filter_2_32b
+    uaddl2          v27.8h, v3.16b, v4.16b   // d + e
+    uaddl2          v16.8h, v2.16b, v5.16b   // c + f
+    uaddl2          v23.8h, v1.16b, v6.16b   // b + g
+    uaddl2          v21.8h, v0.16b, v7.16b   // a + h
+    mul             v27.8h, v27.8h, v25.8h   // 40 * (d + e)
+    mul             v16.8h, v16.8h, v24.8h   // 11 * (c + f)
+    shl             v23.8h, v23.8h, #2       // (b + g) * 4
+    add             v16.8h, v16.8h, v21.8h   // 11 * (c + f) + a + h
+    add             v27.8h, v27.8h, v23.8h   // 40 * (d + e) + (b + g) * 4
+    sub             v18.8h, v27.8h, v16.8h   // 40 * (d + e) + (b + g) * 4 - 11 * (c + f) - a - h
+.endm
+
+.macro qpel_start_2_1
+    movi            v24.4s, #11
+    movi            v25.4s, #40
+.endm
+
+.macro qpel_filter_2_32b_1
+    saddl           v17.4s, v3.4h, v4.4h     // d0 + e0
+    saddl2          v18.4s, v3.8h, v4.8h     // d1 + e1
+    saddl           v19.4s, v2.4h, v5.4h     // c0 + f0
+    saddl2          v20.4s, v2.8h, v5.8h     // c1 + f1
+    mul             v19.4s, v19.4s, v24.4s   // 11 * (c0 + f0)
+    mul             v20.4s, v20.4s, v24.4s   // 11 * (c1 + f1)
+    saddl           v23.4s, v1.4h, v6.4h     // b0 + g0
+    mul             v17.4s, v17.4s, v25.4s   // 40 * (d0 + e0)
+    mul             v18.4s, v18.4s, v25.4s   // 40 * (d1 + e1)
+    saddl2          v16.4s, v1.8h, v6.8h     // b1 + g1
+    saddl           v21.4s, v0.4h, v7.4h     // a0 + h0
+    saddl2          v22.4s, v0.8h, v7.8h     // a1 + h1
+    shl             v23.4s, v23.4s, #2       // 4*(b0+g0)
+    shl             v16.4s, v16.4s, #2       // 4*(b1+g1)
+    add             v19.4s, v19.4s, v21.4s   // 11 * (c0 + f0) + a0 + h0
+    add             v20.4s, v20.4s, v22.4s   // 11 * (c1 + f1) + a1 + h1
+    add             v17.4s, v17.4s, v23.4s   // 40 * (d0 + e0) + 4*(b0+g0)
+    add             v18.4s, v18.4s, v16.4s   // 40 * (d1 + e1) + 4*(b1+g1)
+    sub             v17.4s, v17.4s, v19.4s   // 40 * (d0 + e0) + 4*(b0+g0) - (11 * (c0 + f0) + a0 + h0)
+    sub             v18.4s, v18.4s, v20.4s   // 40 * (d1 + e1) + 4*(b1+g1) - (11 * (c1 + f1) + a1 + h1)
+.endm
+
+//          a, b,   c,  d,  e,   f, g,  h
+// .hword   0, 1,  -5, 17, 58, -10, 4, -1
+.macro qpel_start_3
+    movi            v24.16b, #17
+    movi            v25.16b, #5
+    movi            v26.16b, #58
+    movi            v27.16b, #10
+.endm
+
+.macro qpel_filter_3_32b
+    umull           v19.8h, v2.8b, v25.8b    // c * 5
+    umull           v17.8h, v3.8b, v24.8b    // d * 17
+    umull           v21.8h, v4.8b, v26.8b    // e * 58
+    umull           v23.8h, v5.8b, v27.8b    // f * 10
+    sub             v17.8h, v17.8h, v19.8h   // d * 17 - c * 5
+    ushll           v19.8h, v6.8b, #2        // g * 4
+    add             v17.8h, v17.8h, v21.8h   // d * 17 - c * 5 + e * 58
+    usubl           v21.8h, v1.8b, v7.8b     // b - h
+    add             v17.8h, v17.8h, v19.8h   // d * 17 - c * 5 + e * 58 + g * 4
+    sub             v21.8h, v21.8h, v23.8h   // b - h - f * 10
+    add             v17.8h, v17.8h, v21.8h   // d * 17 - c * 5 + e * 58 + g * 4 + b - h - f * 10
+.endm
+
+.macro qpel_filter_3_64b
+    qpel_filter_3_32b
+    umull2          v16.8h, v2.16b, v25.16b  // c * 5
+    umull2          v18.8h, v3.16b, v24.16b  // d * 17
+    umull2          v21.8h, v4.16b, v26.16b  // e * 58
+    umull2          v23.8h, v5.16b, v27.16b  // f * 10
+    sub             v18.8h, v18.8h, v16.8h   // d * 17 - c * 5
+    ushll2          v16.8h, v6.16b, #2       // g * 4
+    add             v18.8h, v18.8h, v21.8h   // d * 17 - c * 5 + e * 58
+    usubl2          v21.8h, v1.16b, v7.16b   // b - h
+    add             v18.8h, v18.8h, v16.8h   // d * 17 - c * 5 + e * 58 + g * 4
+    sub             v21.8h, v21.8h, v23.8h   // b - h - f * 10
+    add             v18.8h, v18.8h, v21.8h   // d * 17 - c * 5 + e * 58 + g * 4 + b - h - f * 10
+.endm
+
+.macro qpel_start_3_1
+    movi            v24.8h, #17
+    movi            v25.8h, #5
+    movi            v26.8h, #58
+    movi            v27.8h, #10
+.endm
+
+.macro qpel_filter_3_32b_1
+    smull           v17.4s, v3.4h, v24.4h    // 17 * d0
+    smull2          v18.4s, v3.8h, v24.8h    // 17 * d1
+    smull           v19.4s, v2.4h, v25.4h    //  5 * c0
+    smull2          v20.4s, v2.8h, v25.8h    //  5 * c1
+    smull           v21.4s, v4.4h, v26.4h    // 58 * e0
+    smull2          v22.4s, v4.8h, v26.8h    // 58 * e1
+    smull           v23.4s, v5.4h, v27.4h    // 10 * f0
+    smull2          v16.4s, v5.8h, v27.8h    // 10 * f1
+    sub             v17.4s, v17.4s, v19.4s   // 17 * d0 - 5 * c0
+    sub             v18.4s, v18.4s, v20.4s   // 17 * d1 - 5 * c1
+    sshll           v19.4s, v6.4h, #2        //  4 * g0
+    sshll2          v20.4s, v6.8h, #2        //  4 * g1
+    add             v17.4s, v17.4s, v21.4s   // 17 * d0 - 5 * c0 + 58 * e0
+    add             v18.4s, v18.4s, v22.4s   // 17 * d1 - 5 * c1 + 58 * e1
+    ssubl           v21.4s, v1.4h, v7.4h     // b0 - h0
+    ssubl2          v22.4s, v1.8h, v7.8h     // b1 - h1
+    add             v17.4s, v17.4s, v19.4s   // 17 * d0 - 5 * c0 + 58 * e0 + 4 * g0
+    add             v18.4s, v18.4s, v20.4s   // 17 * d1 - 5 * c1 + 58 * e1 + 4 * g1
+    sub             v21.4s, v21.4s, v23.4s   // b0 - h0 - 10 * f0
+    sub             v22.4s, v22.4s, v16.4s   // b1 - h1 - 10 * f1
+    add             v17.4s, v17.4s, v21.4s   // 17 * d0 - 5 * c0 + 58 * e0 + 4 * g0 + b0 - h0 - 10 * f0
+    add             v18.4s, v18.4s, v22.4s   // 17 * d1 - 5 * c1 + 58 * e1 + 4 * g1 + b1 - h1 - 10 * f1
+.endm
+
+.macro qpel_start_chroma_0
+    movi            v24.16b, #64
+.endm
+
+.macro qpel_filter_chroma_0_32b
+    umull           v17.8h, v1.8b, v24.8b    // 64*b
+.endm
+
+.macro qpel_filter_chroma_0_64b
+    umull           v17.8h, v1.8b, v24.8b    // 64*b
+    umull2          v18.8h, v1.16b, v24.16b  // 64*b
+.endm
+
+.macro qpel_start_chroma_0_1
+    movi            v24.8h, #64
+.endm
+
+.macro qpel_filter_chroma_0_32b_1
+    smull           v17.4s, v1.4h, v24.4h    // 64*b0
+    smull2          v18.4s, v1.8h, v24.8h    // 64*b1
+.endm
+
+.macro qpel_start_chroma_1
+    movi            v24.16b, #58
+    movi            v25.16b, #10
+.endm
+
+.macro qpel_filter_chroma_1_32b
+    umull           v17.8h, v1.8b, v24.8b    // 58 * b
+    umull           v19.8h, v2.8b, v25.8b    // 10 * c
+    uaddl           v22.8h, v0.8b, v3.8b     // a + d
+    shl             v22.8h, v22.8h, #1       // 2 * (a+d)
+    sub             v17.8h, v17.8h, v22.8h   // 58*b - 2*(a+d)
+    add             v17.8h, v17.8h, v19.8h   // 58*b-2*(a+d) + 10*c
+.endm
+
+.macro qpel_filter_chroma_1_64b
+    umull           v17.8h, v1.8b, v24.8b    // 58 * b
+    umull2          v18.8h, v1.16b, v24.16b  // 58 * b
+    umull           v19.8h, v2.8b, v25.8b    // 10 * c
+    umull2          v20.8h, v2.16b, v25.16b  // 10 * c
+    uaddl           v22.8h, v0.8b, v3.8b     // a + d
+    uaddl2          v23.8h, v0.16b, v3.16b   // a + d
+    shl             v22.8h, v22.8h, #1       // 2 * (a+d)
+    shl             v23.8h, v23.8h, #1       // 2 * (a+d)
+    sub             v17.8h, v17.8h, v22.8h   // 58*b - 2*(a+d)
+    sub             v18.8h, v18.8h, v23.8h   // 58*b - 2*(a+d)
+    add             v17.8h, v17.8h, v19.8h   // 58*b-2*(a+d) + 10*c
+    add             v18.8h, v18.8h, v20.8h   // 58*b-2*(a+d) + 10*c
+.endm
+
+.macro qpel_start_chroma_1_1
+    movi            v24.8h, #58
+    movi            v25.8h, #10
+.endm
+
+.macro qpel_filter_chroma_1_32b_1
+    smull           v17.4s, v1.4h, v24.4h    // 58 * b0
+    smull2          v18.4s, v1.8h, v24.8h    // 58 * b1
+    smull           v19.4s, v2.4h, v25.4h    // 10 * c0
+    smull2          v20.4s, v2.8h, v25.8h    // 10 * c1
+    add             v22.8h, v0.8h, v3.8h     // a + d
+    sshll           v21.4s, v22.4h, #1       // 2 * (a0+d0)
+    sshll2          v22.4s, v22.8h, #1       // 2 * (a1+d1)
+    sub             v17.4s, v17.4s, v21.4s   // 58*b0 - 2*(a0+d0)
+    sub             v18.4s, v18.4s, v22.4s   // 58*b1 - 2*(a1+d1)
+    add             v17.4s, v17.4s, v19.4s   // 58*b0-2*(a0+d0) + 10*c0
+    add             v18.4s, v18.4s, v20.4s   // 58*b1-2*(a1+d1) + 10*c1
+.endm
+
+.macro qpel_start_chroma_2
+    movi            v25.16b, #54
+.endm
+
+.macro qpel_filter_chroma_2_32b
+    umull           v17.8h, v1.8b, v25.8b    // 54 * b
+    ushll           v19.8h, v0.8b, #2        // 4 * a
+    ushll           v21.8h, v2.8b, #4        // 16 * c
+    ushll           v23.8h, v3.8b, #1        // 2 * d
+    add             v17.8h, v17.8h, v21.8h   // 54*b + 16*c
+    add             v19.8h, v19.8h, v23.8h   // 4*a + 2*d
+    sub             v17.8h, v17.8h, v19.8h   // 54*b+16*c - (4*a+2*d)
+.endm
+
+.macro qpel_filter_chroma_2_64b
+    umull           v17.8h, v1.8b, v25.8b    // 54 * b
+    umull2          v18.8h, v1.16b, v25.16b  // 54 * b
+    ushll           v19.8h, v0.8b, #2        // 4 * a
+    ushll2          v20.8h, v0.16b, #2       // 4 * a
+    ushll           v21.8h, v2.8b, #4        // 16 * c
+    ushll2          v22.8h, v2.16b, #4       // 16 * c
+    ushll           v23.8h, v3.8b, #1        // 2 * d
+    ushll2          v24.8h, v3.16b, #1       // 2 * d
+    add             v17.8h, v17.8h, v21.8h   // 54*b + 16*c
+    add             v18.8h, v18.8h, v22.8h   // 54*b + 16*c
+    add             v19.8h, v19.8h, v23.8h   // 4*a + 2*d
+    add             v20.8h, v20.8h, v24.8h   // 4*a + 2*d
+    sub             v17.8h, v17.8h, v19.8h   // 54*b+16*c - (4*a+2*d)
+    sub             v18.8h, v18.8h, v20.8h   // 54*b+16*c - (4*a+2*d)
+.endm
+
+.macro qpel_start_chroma_2_1
+    movi            v25.8h, #54
+.endm
+
+.macro qpel_filter_chroma_2_32b_1
+    smull           v17.4s, v1.4h, v25.4h    // 54 * b0
+    smull2          v18.4s, v1.8h, v25.8h    // 54 * b1
+    sshll           v19.4s, v0.4h, #2        // 4 * a0
+    sshll2          v20.4s, v0.8h, #2        // 4 * a1
+    sshll           v21.4s, v2.4h, #4        // 16 * c0
+    sshll2          v22.4s, v2.8h, #4        // 16 * c1
+    sshll           v23.4s, v3.4h, #1        // 2 * d0
+    sshll2          v24.4s, v3.8h, #1        // 2 * d1
+    add             v17.4s, v17.4s, v21.4s   // 54*b0 + 16*c0
+    add             v18.4s, v18.4s, v22.4s   // 54*b1 + 16*c1
+    add             v19.4s, v19.4s, v23.4s   // 4*a0 + 2*d0
+    add             v20.4s, v20.4s, v24.4s   // 4*a1 + 2*d1
+    sub             v17.4s, v17.4s, v19.4s   // 54*b0+16*c0 - (4*a0+2*d0)
+    sub             v18.4s, v18.4s, v20.4s   // 54*b1+16*c1 - (4*a1+2*d1)
+.endm
+
+.macro qpel_start_chroma_3
+    movi            v25.16b, #46
+    movi            v26.16b, #28
+    movi            v27.16b, #6
+.endm
+
+.macro qpel_filter_chroma_3_32b
+    umull           v17.8h, v1.8b, v25.8b    // 46 * b
+    umull           v19.8h, v2.8b, v26.8b    // 28 * c
+    ushll           v21.8h, v3.8b, #2        // 4 * d
+    umull           v23.8h, v0.8b, v27.8b    // 6 * a
+    add             v17.8h, v17.8h, v19.8h   // 46*b + 28*c
+    add             v21.8h, v21.8h, v23.8h   // 4*d + 6*a
+    sub             v17.8h, v17.8h, v21.8h   // 46*b+28*c - (4*d+6*a)
+.endm
+
+.macro qpel_filter_chroma_3_64b
+    umull           v17.8h, v1.8b, v25.8b    // 46 * b
+    umull2          v18.8h, v1.16b, v25.16b  // 46 * b
+    umull           v19.8h, v2.8b, v26.8b    // 28 * c
+    umull2          v20.8h, v2.16b, v26.16b  // 28 * c
+    ushll           v21.8h, v3.8b, #2        // 4 * d
+    ushll2          v22.8h, v3.16b, #2       // 4 * d
+    umull           v23.8h, v0.8b, v27.8b    // 6 * a
+    umull2          v24.8h, v0.16b, v27.16b  // 6 * a
+    add             v17.8h, v17.8h, v19.8h   // 46*b + 28*c
+    add             v18.8h, v18.8h, v20.8h   // 46*b + 28*c
+    add             v21.8h, v21.8h, v23.8h   // 4*d + 6*a
+    add             v22.8h, v22.8h, v24.8h   // 4*d + 6*a
+    sub             v17.8h, v17.8h, v21.8h   // 46*b+28*c - (4*d+6*a)
+    sub             v18.8h, v18.8h, v22.8h   // 46*b+28*c - (4*d+6*a)
+.endm
+
+.macro qpel_start_chroma_3_1
+    movi            v25.8h, #46
+    movi            v26.8h, #28
+    movi            v27.8h, #6
+.endm
+
+.macro qpel_filter_chroma_3_32b_1
+    smull           v17.4s, v1.4h, v25.4h    // 46 * b0
+    smull2          v18.4s, v1.8h, v25.8h    // 46 * b1
+    smull           v19.4s, v2.4h, v26.4h    // 28 * c0
+    smull2          v20.4s, v2.8h, v26.8h    // 28 * c1
+    sshll           v21.4s, v3.4h, #2        // 4 * d0
+    sshll2          v22.4s, v3.8h, #2        // 4 * d1
+    smull           v23.4s, v0.4h, v27.4h    // 6 * a0
+    smull2          v24.4s, v0.8h, v27.8h    // 6 * a1
+    add             v17.4s, v17.4s, v19.4s   // 46*b0 + 28*c0
+    add             v18.4s, v18.4s, v20.4s   // 46*b1 + 28*c1
+    add             v21.4s, v21.4s, v23.4s   // 4*d0 + 6*a0
+    add             v22.4s, v22.4s, v24.4s   // 4*d1 + 6*a1
+    sub             v17.4s, v17.4s, v21.4s   // 46*b0+28*c0 - (4*d0+6*a0)
+    sub             v18.4s, v18.4s, v22.4s   // 46*b1+28*c1 - (4*d1+6*a1)
+.endm
+
+.macro qpel_start_chroma_4
+    movi            v24.8h, #36
+.endm
+
+.macro qpel_filter_chroma_4_32b
+    uaddl           v20.8h, v0.8b, v3.8b     // a + d
+    uaddl           v17.8h, v1.8b, v2.8b     // b + c
+    shl             v20.8h, v20.8h, #2       // 4 * (a+d)
+    mul             v17.8h, v17.8h, v24.8h   // 36 * (b+c)
+    sub             v17.8h, v17.8h, v20.8h   // 36*(b+c) - 4*(a+d)
+.endm
+
+.macro qpel_filter_chroma_4_64b
+    uaddl           v20.8h, v0.8b, v3.8b     // a + d
+    uaddl2          v21.8h, v0.16b, v3.16b   // a + d
+    uaddl           v17.8h, v1.8b, v2.8b     // b + c
+    uaddl2          v18.8h, v1.16b, v2.16b   // b + c
+    shl             v20.8h, v20.8h, #2       // 4 * (a+d)
+    shl             v21.8h, v21.8h, #2       // 4 * (a+d)
+    mul             v17.8h, v17.8h, v24.8h   // 36 * (b+c)
+    mul             v18.8h, v18.8h, v24.8h   // 36 * (b+c)
+    sub             v17.8h, v17.8h, v20.8h   // 36*(b+c) - 4*(a+d)
+    sub             v18.8h, v18.8h, v21.8h   // 36*(b+c) - 4*(a+d)
+.endm
+
+.macro qpel_start_chroma_4_1
+    movi            v24.8h, #36
+.endm
+
+.macro qpel_filter_chroma_4_32b_1
+    add             v20.8h, v0.8h, v3.8h     // a + d
+    add             v21.8h, v1.8h, v2.8h     // b + c
+    smull           v17.4s, v21.4h, v24.4h   // 36 * (b0+c0)
+    smull2          v18.4s, v21.8h, v24.8h   // 36 * (b1+c1)
+    sshll           v21.4s, v20.4h, #2       // 4 * (a0+d0)
+    sshll2          v22.4s, v20.8h, #2       // 4 * (a1+d1)
+    sub             v17.4s, v17.4s, v21.4s   // 36*(b0+c0) - 4*(a0+d0)
+    sub             v18.4s, v18.4s, v22.4s   // 36*(b1+c1) - 4*(a1+d1)
+.endm
+
+.macro qpel_start_chroma_5
+    movi            v25.16b, #28
+    movi            v26.16b, #46
+    movi            v27.16b, #6
+.endm
+
+.macro qpel_filter_chroma_5_32b
+    umull           v17.8h, v1.8b, v25.8b    // 28 * b
+    umull           v19.8h, v2.8b, v26.8b    // 46 * c
+    ushll           v21.8h, v0.8b, #2        // 4 * a
+    umull           v23.8h, v3.8b, v27.8b    // 6 * d
+    add             v17.8h, v17.8h, v19.8h   // 28*b + 46*c
+    add             v21.8h, v21.8h, v23.8h   // 4*a + 6*d
+    sub             v17.8h, v17.8h, v21.8h   // 28*b+46*c - (4*a+6*d)
+.endm
+
+.macro qpel_filter_chroma_5_64b
+    umull           v17.8h, v1.8b, v25.8b    // 28 * b
+    umull2          v18.8h, v1.16b, v25.16b  // 28 * b
+    umull           v19.8h, v2.8b, v26.8b    // 46 * c
+    umull2          v20.8h, v2.16b, v26.16b  // 46 * c
+    ushll           v21.8h, v0.8b, #2        // 4 * a
+    ushll2          v22.8h, v0.16b, #2       // 4 * a
+    umull           v23.8h, v3.8b, v27.8b    // 6 * d
+    umull2          v24.8h, v3.16b, v27.16b  // 6 * d
+    add             v17.8h, v17.8h, v19.8h   // 28*b + 46*c
+    add             v18.8h, v18.8h, v20.8h   // 28*b + 46*c
+    add             v21.8h, v21.8h, v23.8h   // 4*a + 6*d
+    add             v22.8h, v22.8h, v24.8h   // 4*a + 6*d
+    sub             v17.8h, v17.8h, v21.8h   // 28*b+46*c - (4*a+6*d)
+    sub             v18.8h, v18.8h, v22.8h   // 28*b+46*c - (4*a+6*d)
+.endm
+
+.macro qpel_start_chroma_5_1
+    movi            v25.8h, #28
+    movi            v26.8h, #46
+    movi            v27.8h, #6
+.endm
+
+.macro qpel_filter_chroma_5_32b_1
+    smull           v17.4s, v1.4h, v25.4h    // 28 * b0
+    smull2          v18.4s, v1.8h, v25.8h    // 28 * b1
+    smull           v19.4s, v2.4h, v26.4h    // 46 * c0
+    smull2          v20.4s, v2.8h, v26.8h    // 46 * c1
+    sshll           v21.4s, v0.4h, #2        // 4 * a0
+    sshll2          v22.4s, v0.8h, #2        // 4 * a1
+    smull           v23.4s, v3.4h, v27.4h    // 6 * d0
+    smull2          v24.4s, v3.8h, v27.8h    // 6 * d1
+    add             v17.4s, v17.4s, v19.4s   // 28*b0 + 46*c0
+    add             v18.4s, v18.4s, v20.4s   // 28*b1 + 46*c1
+    add             v21.4s, v21.4s, v23.4s   // 4*a0 + 6*d0
+    add             v22.4s, v22.4s, v24.4s   // 4*a1 + 6*d1
+    sub             v17.4s, v17.4s, v21.4s   // 28*b0+46*c0 - (4*a0+6*d0)
+    sub             v18.4s, v18.4s, v22.4s   // 28*b1+46*c1 - (4*a1+6*d1)
+.endm
+
+.macro qpel_start_chroma_6
+    movi            v25.16b, #54
+.endm
+
+.macro qpel_filter_chroma_6_32b
+    umull           v17.8h, v2.8b, v25.8b    // 54 * c
+    ushll           v19.8h, v0.8b, #1        // 2 * a
+    ushll           v21.8h, v1.8b, #4        // 16 * b
+    ushll           v23.8h, v3.8b, #2        // 4 * d
+    add             v17.8h, v17.8h, v21.8h   // 54*c + 16*b
+    add             v19.8h, v19.8h, v23.8h   // 2*a + 4*d
+    sub             v17.8h, v17.8h, v19.8h   // 54*c+16*b - (2*a+4*d)
+.endm
+
+.macro qpel_filter_chroma_6_64b
+    umull           v17.8h, v2.8b, v25.8b    // 54 * c
+    umull2          v18.8h, v2.16b, v25.16b  // 54 * c
+    ushll           v19.8h, v0.8b, #1        // 2 * a
+    ushll2          v20.8h, v0.16b, #1       // 2 * a
+    ushll           v21.8h, v1.8b, #4        // 16 * b
+    ushll2          v22.8h, v1.16b, #4       // 16 * b
+    ushll           v23.8h, v3.8b, #2        // 4 * d
+    ushll2          v24.8h, v3.16b, #2       // 4 * d
+    add             v17.8h, v17.8h, v21.8h   // 54*c + 16*b
+    add             v18.8h, v18.8h, v22.8h   // 54*c + 16*b
+    add             v19.8h, v19.8h, v23.8h   // 2*a + 4*d
+    add             v20.8h, v20.8h, v24.8h   // 2*a + 4*d
+    sub             v17.8h, v17.8h, v19.8h   // 54*c+16*b - (2*a+4*d)
+    sub             v18.8h, v18.8h, v20.8h   // 54*c+16*b - (2*a+4*d)
+.endm
+
+.macro qpel_start_chroma_6_1
+    movi            v25.8h, #54
+.endm
+
+.macro qpel_filter_chroma_6_32b_1
+    smull           v17.4s, v2.4h, v25.4h    // 54 * c0
+    smull2          v18.4s, v2.8h, v25.8h    // 54 * c1
+    sshll           v19.4s, v0.4h, #1        // 2 * a0
+    sshll2          v20.4s, v0.8h, #1        // 2 * a1
+    sshll           v21.4s, v1.4h, #4        // 16 * b0
+    sshll2          v22.4s, v1.8h, #4        // 16 * b1
+    sshll           v23.4s, v3.4h, #2        // 4 * d0
+    sshll2          v24.4s, v3.8h, #2        // 4 * d1
+    add             v17.4s, v17.4s, v21.4s   // 54*c0 + 16*b0
+    add             v18.4s, v18.4s, v22.4s   // 54*c1 + 16*b1
+    add             v19.4s, v19.4s, v23.4s   // 2*a0 + 4*d0
+    add             v20.4s, v20.4s, v24.4s   // 2*a1 + 4*d1
+    sub             v17.4s, v17.4s, v19.4s   // 54*c0+16*b0 - (2*a0+4*d0)
+    sub             v18.4s, v18.4s, v20.4s   // 54*c1+16*b1 - (2*a1+4*d1)
+.endm
+
+.macro qpel_start_chroma_7
+    movi            v24.16b, #58
+    movi            v25.16b, #10
+.endm
+
+.macro qpel_filter_chroma_7_32b
+    uaddl           v20.8h, v0.8b, v3.8b     // a + d
+    umull           v17.8h, v2.8b, v24.8b    // 58 * c
+    shl             v20.8h, v20.8h, #1       // 2 * (a+d)
+    umull           v19.8h, v1.8b, v25.8b    // 10 * b
+    sub             v17.8h, v17.8h, v20.8h   // 58*c - 2*(a+d)
+    add             v17.8h, v17.8h, v19.8h   // 58*c-2*(a+d) + 10*b
+.endm
+
+.macro qpel_filter_chroma_7_64b
+    uaddl           v20.8h, v0.8b, v3.8b     // a + d
+    uaddl2          v21.8h, v0.16b, v3.16b   // a + d
+    umull           v17.8h, v2.8b, v24.8b    // 58 * c
+    umull2          v18.8h, v2.16b, v24.16b  // 58 * c
+    shl             v20.8h, v20.8h, #1       // 2 * (a+d)
+    shl             v21.8h, v21.8h, #1       // 2 * (a+d)
+    umull           v22.8h, v1.8b, v25.8b    // 10 * b
+    umull2          v23.8h, v1.16b, v25.16b  // 10 * b
+    sub             v17.8h, v17.8h, v20.8h   // 58*c - 2*(a+d)
+    sub             v18.8h, v18.8h, v21.8h   // 58*c - 2*(a+d)
+    add             v17.8h, v17.8h, v22.8h   // 58*c-2*(a+d) + 10*b
+    add             v18.8h, v18.8h, v23.8h   // 58*c-2*(a+d) + 10*b
+.endm
+
+.macro qpel_start_chroma_7_1
+    movi            v24.8h, #58
+    movi            v25.8h, #10
+.endm
+
+.macro qpel_filter_chroma_7_32b_1
+    add             v20.8h, v0.8h, v3.8h     // a + d
+    smull           v17.4s, v2.4h, v24.4h    // 58 * c0
+    smull2          v18.4s, v2.8h, v24.8h    // 58 * c1
+    sshll           v21.4s, v20.4h, #1       // 2 * (a0+d0)
+    sshll2          v22.4s, v20.8h, #1       // 2 * (a1+d1)
+    smull           v19.4s, v1.4h, v25.4h    // 10 * b0
+    smull2          v20.4s, v1.8h, v25.8h    // 10 * b1
+    sub             v17.4s, v17.4s, v21.4s   // 58*c0 - 2*(a0+d0)
+    sub             v18.4s, v18.4s, v22.4s   // 58*c1 - 2*(a1+d1)
+    add             v17.4s, v17.4s, v19.4s   // 58*c0-2*(a0+d0) + 10*b0
+    add             v18.4s, v18.4s, v20.4s   // 58*c1-2*(a1+d1) + 10*b1
+.endm
+
+.macro vpp_end
+    add             v17.8h, v17.8h, v31.8h
+    sqshrun         v17.8b, v17.8h, #6
+.endm
+
+.macro FILTER_LUMA_VPP w, h, v
+    lsl             x10, x1, #2      // x10 = 4 * x1
+    sub             x11, x10, x1     // x11 = 3 * x1
+    sub             x0, x0, x11      // src -= (8 / 2 - 1) * srcStride
+    mov             x5, #\h
+    mov             w12, #32
+    dup             v31.8h, w12
+    qpel_start_\v
+.loop_luma_vpp_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_luma_vpp_w8_\v\()_\w\()x\h:
+    add             x6, x0, x9
+.if \w == 8 || \w == 24
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vpp_end
+    str             d17, x7, #8
+    add             x9, x9, #8
+.elseif \w == 12
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vpp_end
+    str             d17, x7, #8
+    add             x6, x0, #8
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vpp_end
+    fmov            w6, s17
+    str             w6, x7, #4
+    add             x9, x9, #12
+.else
+    qpel_load_64b \v
+    qpel_filter_\v\()_64b
+    vpp_end
+    add             v18.8h, v18.8h, v31.8h
+    sqshrun2        v17.16b, v18.8h, #6
+    str             q17, x7, #16
+    add             x9, x9, #16
+.endif
+    cmp             x9, #\w
+    blt             .loop_luma_vpp_w8_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_luma_vpp_\v\()_\w\()x\h
+    ret
+.endm
+
+.macro vps_end
+    sub             v17.8h, v17.8h, v31.8h
+.endm
+
+.macro FILTER_VPS w, h, v
+    lsl             x3, x3, #1
+    lsl             x10, x1, #2      // x10 = 4 * x1
+    sub             x11, x10, x1     // x11 = 3 * x1
+    sub             x0, x0, x11      // src -= (8 / 2 - 1) * srcStride
+    mov             x5, #\h
+    mov             w12, #8192
+    dup             v31.8h, w12
+    qpel_start_\v
+.loop_ps_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_ps_w8_\v\()_\w\()x\h:
+    add             x6, x0, x9
+.if \w == 8 || \w == 24
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vps_end
+    str             q17, x7, #16
+    add             x9, x9, #8
+.elseif \w == 12
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vps_end
+    str             q17, x7, #16
+    add             x6, x0, #8
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vps_end
+    str             d17, x7, #8
+    add             x9, x9, #12
+.else
+    qpel_load_64b \v
+    qpel_filter_\v\()_64b
+    vps_end
+    sub             v18.8h, v18.8h, v31.8h
+    stp             q17, q18, x7, #32
+    add             x9, x9, #16
+.endif
+    cmp             x9, #\w
+    blt             .loop_ps_w8_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_ps_\v\()_\w\()x\h
+    ret
+.endm
+
+.macro vsp_end
+    add             v17.4s, v17.4s, v31.4s
+    add             v18.4s, v18.4s, v31.4s
+    sqshrun         v17.4h, v17.4s, #12
+    sqshrun2        v17.8h, v18.4s, #12
+    sqxtun          v17.8b, v17.8h
+.endm
+
+.macro FILTER_VSP w, h, v
+    lsl             x1, x1, #1
+    lsl             x10, x1, #2      // x10 = 4 * x1
+    sub             x11, x10, x1     // x11 = 3 * x1
+    sub             x0, x0, x11
+    mov             x5, #\h
+    mov             w12, #1
+    lsl             w12, w12, #19
+    add             w12, w12, #2048
+    dup             v31.4s, w12
+    mov             x12, #\w
+    lsl             x12, x12, #1
+    qpel_start_\v\()_1
+.loop_luma_vsp_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_luma_vsp_w8_\v\()_\w\()x\h:
+    add             x6, x0, x9
+    qpel_load_64b \v
+    qpel_filter_\v\()_32b_1
+    vsp_end
+    str             d17, x7, #8
+    add             x9, x9, #16
+.if \w == 12
+    add             x6, x0, #16
+    qpel_load_64b \v
+    qpel_filter_\v\()_32b_1
+    vsp_end
+    str             s17, x7, #4
+    add             x9, x9, #8
+.endif
+    cmp             x9, x12
+    blt             .loop_luma_vsp_w8_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_luma_vsp_\v\()_\w\()x\h
+    ret
+.endm
+
+.macro vss_end
+    sshr            v17.4s, v17.4s, #6
+    sshr            v18.4s, v18.4s, #6
+    uzp1            v17.8h, v17.8h, v18.8h
+.endm
+
+.macro FILTER_VSS w, h, v
+    lsl             x1, x1, #1
+    lsl             x10, x1, #2      // x10 = 4 * x1
+    sub             x11, x10, x1     // x11 = 3 * x1
+    sub             x0, x0, x11
+    lsl             x3, x3, #1
+    mov             x5, #\h
+    mov             x12, #\w
+    lsl             x12, x12, #1
+    qpel_start_\v\()_1
+.loop_luma_vss_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_luma_vss_w8_\v\()_\w\()x\h:
+    add             x6, x0, x9
+    qpel_load_64b \v
+    qpel_filter_\v\()_32b_1
+    vss_end
+.if \w == 4
+    str             s17, x7, #4
+    add             x9, x9, #4
+.else
+    str             q17, x7, #16
+    add             x9, x9, #16
+.if \w == 12
+    add             x6, x0, x9
+    qpel_load_64b \v
+    qpel_filter_\v\()_32b_1
+    vss_end
+    str             d17, x7, #8
+    add             x9, x9, #8
+.endif
+.endif
+    cmp             x9, x12
+    blt             .loop_luma_vss_w8_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_luma_vss_\v\()_\w\()x\h
+    ret
+.endm
+
+.macro hpp_end
+    add             v17.8h, v17.8h, v31.8h
+    sqshrun         v17.8b, v17.8h, #6
+.endm
+
+.macro FILTER_HPP w, h, v
+    mov             w6, #\h
+    sub             x3, x3, #\w
+    mov             w12, #32
+    dup             v31.8h, w12
+    qpel_start_\v
+.if \w == 4
+.rept \h
+    mov             x11, x0
+    sub             x11, x11, #4
+    vextin8 \v
+    qpel_filter_\v\()_32b
+    hpp_end
+    str             s17, x2, #4
+    add             x0, x0, x1
+    add             x2, x2, x3
+.endr
+    ret
+.else
+.loop1_hpp_\v\()_\w\()x\h:
+    mov             x7, #\w
+    mov             x11, x0
+    sub             x11, x11, #4
+.loop2_hpp_\v\()_\w\()x\h:
+    vextin8 \v
+    qpel_filter_\v\()_32b
+    hpp_end
+    str             d17, x2, #8
+    sub             x11, x11, #8
+    sub             x7, x7, #8
+.if \w == 12
+    vextin8 \v
+    qpel_filter_\v\()_32b
+    hpp_end
+    str             s17, x2, #4
+    sub             x7, x7, #4
+.endif
+    cbnz            x7, .loop2_hpp_\v\()_\w\()x\h
+    sub             x6, x6, #1
+    add             x0, x0, x1
+    add             x2, x2, x3
+    cbnz            x6, .loop1_hpp_\v\()_\w\()x\h
+    ret
+.endif
+.endm
+
+.macro hps_end
+    sub             v17.8h, v17.8h, v31.8h
+.endm
+
+.macro FILTER_HPS w, h, v
+    sub             x3, x3, #\w
+    lsl             x3, x3, #1
+    mov             w12, #8192
+    dup             v31.8h, w12
+    qpel_start_\v
+.if \w == 4
+.loop_hps_\v\()_\w\()x\h\():
+    mov             x11, x0
+    sub             x11, x11, #4
+    vextin8 \v
+    qpel_filter_\v\()_32b
+    hps_end
+    str             d17, x2, #8
+    sub             w6, w6, #1
+    add             x0, x0, x1
+    add             x2, x2, x3
+    cbnz            w6, .loop_hps_\v\()_\w\()x\h
+    ret
+.else
+.loop1_hps_\v\()_\w\()x\h\():
+    mov             w7, #\w
+    mov             x11, x0
+    sub             x11, x11, #4
+.loop2_hps_\v\()_\w\()x\h\():
+.if \w == 8 || \w == 12 || \w == 24
+    vextin8 \v
+    qpel_filter_\v\()_32b
+    hps_end
+    str             q17, x2, #16
+    sub             w7, w7, #8
+    sub             x11, x11, #8
+.if \w == 12
+    vextin8 \v
+    qpel_filter_\v\()_32b
+    hps_end
+    str             d17, x2, #8
+    sub             w7, w7, #4
+.endif
+.elseif \w == 16 || \w == 32 || \w == 48 || \w == 64
+    vextin8_64 \v
+    qpel_filter_\v\()_64b
+    hps_end
+    sub             v18.8h, v18.8h, v31.8h
+    stp             q17, q18, x2, #32
+    sub             w7, w7, #16
+    sub             x11, x11, #16
+.endif
+    cbnz            w7, .loop2_hps_\v\()_\w\()x\h
+    sub             w6, w6, #1
+    add             x0, x0, x1
+    add             x2, x2, x3
+    cbnz            w6, .loop1_hps_\v\()_\w\()x\h
+    ret
+.endif
+.endm
+
+.macro FILTER_CHROMA_VPP w, h, v
+    qpel_start_chroma_\v
+    mov             w12, #32
+    dup             v31.8h, w12
+    sub             x0, x0, x1
+    mov             x5, #\h
+.loop_chroma_vpp_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_chroma_vpp_w8_\v\()_\w\()x\h:
+    add             x6, x0, x9
+    qpel_chroma_load_32b \v
+    qpel_filter_chroma_\v\()_32b
+    vpp_end
+    add             x9, x9, #8
+.if \w == 2
+    fmov            w12, s17
+    strh            w12, x7, #2
+.elseif \w == 4
+    str             s17, x7, #4
+.elseif \w == 6
+    str             s17, x7, #4
+    umov            w12, v17.h2
+    strh            w12, x7, #2
+.elseif \w == 12
+    str             d17, x7, #8
+    add             x6, x0, x9
+    qpel_chroma_load_32b \v
+    qpel_filter_chroma_\v\()_32b
+    vpp_end
+    str             s17, x7, #4
+    add             x9, x9, #8
+.else
+    str             d17, x7, #8
+.endif
+    cmp             x9, #\w
+    blt             .loop_chroma_vpp_w8_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_chroma_vpp_\v\()_\w\()x\h
+    ret
+.endm
+
+.macro FILTER_CHROMA_VPS w, h, v
+    qpel_start_chroma_\v
+    mov             w12, #8192
+    dup             v31.8h, w12
+    lsl             x3, x3, #1
+    sub             x0, x0, x1
+    mov             x5, #\h
+.loop_vps_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_vps_w8_\v\()_\w\()x\h:
+    add             x6, x0, x9
+    qpel_chroma_load_32b \v
+    qpel_filter_chroma_\v\()_32b
+    vps_end
+    add             x9, x9, #8
+.if \w == 2
+    str             s17, x7, #4
+.elseif \w == 4
+    str             d17, x7, #8
+.elseif \w == 6
+    str             d17, x7, #8
+    st1             {v17.s}2, x7, #4
+.elseif \w == 12
+    str             q17, x7, #16
+    add             x6, x0, x9
+    qpel_chroma_load_32b \v
+    qpel_filter_chroma_\v\()_32b
+    vps_end
+    str             d17, x7, #8
+    add             x9, x9, #8
+.else
+    str             q17, x7, #16
+.endif
+    cmp             x9, #\w
+    blt             .loop_vps_w8_\v\()_\w\()x\h
+
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_vps_\v\()_\w\()x\h
+    ret
+.endm
+
+.macro FILTER_CHROMA_VSP w, h, v
+    lsl             x1, x1, #1
+    sub             x0, x0, x1
+    mov             x5, #\h
+    mov             w12, #1
+    lsl             w12, w12, #19
+    add             w12, w12, #2048
+    dup             v31.4s, w12
+    mov             x12, #\w
+    lsl             x12, x12, #1
+    qpel_start_chroma_\v\()_1
+.loop_vsp_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_vsp_w8_\v\()_\w\()x\h:
+    add             x6, x0, x9
+    qpel_chroma_load_64b \v
+    qpel_filter_chroma_\v\()_32b_1
+    vsp_end
+    add             x9, x9, #16
+.if \w == 4
+    str             s17, x7, #4
+.elseif \w == 12
+    str             d17, x7, #8
+    add             x6, x0, x9
+    qpel_chroma_load_64b \v
+    qpel_filter_chroma_\v\()_32b_1
+    vsp_end
+    str             s17, x7, #4
+    add             x9, x9, #8
+.else
+    str             d17, x7, #8
+.endif
+    cmp             x9, x12
+    blt             .loop_vsp_w8_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_vsp_\v\()_\w\()x\h
+    ret
+.endm
+
+.macro FILTER_CHROMA_VSS w, h, v
+    lsl             x1, x1, #1
+    sub             x0, x0, x1
+    lsl             x3, x3, #1
+    mov             x5, #\h
+    mov             x12, #\w
+    lsl             x12, x12, #1
+    qpel_start_chroma_\v\()_1
+.loop_vss_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.if \w == 4
+.rept 2
+    add             x6, x0, x9
+    qpel_chroma_load_64b \v
+    qpel_filter_chroma_\v\()_32b_1
+    vss_end
+    str             s17, x7, #4
+    add             x9, x9, #4
+.endr
+.else
+.loop_vss_w8_\v\()_\w\()x\h:
+    add             x6, x0, x9
+    qpel_chroma_load_64b \v
+    qpel_filter_chroma_\v\()_32b_1
+    vss_end
+    str             q17, x7, #16
+    add             x9, x9, #16
+.if \w == 12
+    add             x6, x0, x9
+    qpel_chroma_load_64b \v
+    qpel_filter_chroma_\v\()_32b_1
+    vss_end
+    str             d17, x7, #8
+    add             x9, x9, #8
+.endif
+    cmp             x9, x12
+    blt             .loop_vss_w8_\v\()_\w\()x\h
+.endif
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_vss_\v\()_\w\()x\h
+    ret
+.endm
+
+.macro FILTER_CHROMA_HPP w, h, v
+    qpel_start_chroma_\v
+    mov             w12, #32
+    dup             v31.8h, w12
+    mov             w6, #\h
+    sub             x3, x3, #\w
+.if \w == 2 || \w == 4 || \w == 6 || \w == 12
+.loop4_chroma_hpp_\v\()_\w\()x\h:
+    mov             x11, x0
+    sub             x11, x11, #2
+    vextin8_chroma \v
+    qpel_filter_chroma_\v\()_32b
+    hpp_end
+.if \w == 2
+    fmov            w12, s17
+    strh            w12, x2, #2
+.elseif \w == 4
+    str             s17, x2, #4
+.elseif \w == 6
+    str             s17, x2, #4
+    umov            w12, v17.h2
+    strh            w12, x2, #2
+.elseif \w == 12
+    str             d17, x2, #8
+    sub             x11, x11, #8
+    vextin8_chroma \v
+    qpel_filter_chroma_\v\()_32b
+    hpp_end
+    str             s17, x2, #4
+.endif
+    sub             w6, w6, #1
+    add             x0, x0, x1
+    add             x2, x2, x3
+    cbnz            w6, .loop4_chroma_hpp_\v\()_\w\()x\h
+    ret
+.else
+.loop2_chroma_hpp_\v\()_\w\()x\h:
+    mov             x7, #\w
+    lsr             x7, x7, #3
+    mov             x11, x0
+    sub             x11, x11, #2
+.loop3_chroma_hpp_\v\()_\w\()x\h:
+.if \w == 8 || \w == 24
+    vextin8_chroma \v
+    qpel_filter_chroma_\v\()_32b
+    hpp_end
+    str             d17, x2, #8
+    sub             x7, x7, #1
+    sub             x11, x11, #8
+.elseif \w == 16 || \w == 32 || \w == 48 || \w == 64
+    vextin8_chroma_64 \v
+    qpel_filter_chroma_\v\()_64b
+    hpp_end
+    add             v18.8h, v18.8h, v31.8h
+    sqshrun2        v17.16b, v18.8h, #6
+    str             q17, x2, #16
+    sub             x7, x7, #2
+    sub             x11, x11, #16
+.endif
+    cbnz            x7, .loop3_chroma_hpp_\v\()_\w\()x\h
+    sub             w6, w6, #1
+    add             x0, x0, x1
+    add             x2, x2, x3
+    cbnz            w6, .loop2_chroma_hpp_\v\()_\w\()x\h
+    ret
+.endif
+.endm
+
+.macro CHROMA_HPS_2_4_6_12 w, v
+    mov             x11, x0
+    sub             x11, x11, #2
+    vextin8_chroma \v
+    qpel_filter_chroma_\v\()_32b
+    hps_end
+    sub             x11, x11, #8
+.if \w == 2
+    str             s17, x2, #4
+.elseif \w == 4
+    str             d17, x2, #8
+.elseif \w == 6
+    str             d17, x2, #8
+    st1             {v17.s}2, x2, #4
+.elseif \w == 12
+    str             q17, x2, #16
+    vextin8_chroma \v
+    qpel_filter_chroma_\v\()_32b
+    sub             v17.8h, v17.8h, v31.8h
+    str             d17, x2, #8
+.endif
+    add             x0, x0, x1
+    add             x2, x2, x3
+.endm
+
+.macro FILTER_CHROMA_HPS w, h, v
+    qpel_start_chroma_\v
+    mov             w12, #8192
+    dup             v31.8h, w12
+    sub             x3, x3, #\w
+    lsl             x3, x3, #1
+
+.if \w == 2 || \w == 4 || \w == 6 || \w == 12
+    cmp             x5, #0
+    beq             0f
+    sub             x0, x0, x1
+.rept 3
+    CHROMA_HPS_2_4_6_12 \w, \v
+.endr
+0:
+.rept \h
+    CHROMA_HPS_2_4_6_12 \w, \v
+.endr
+    ret
+.else
+    mov             w10, #\h
+    cmp             x5, #0
+    beq             9f
+    sub             x0, x0, x1
+    add             w10, w10, #3
+9:
+    mov             w6, w10
+.loop1_chroma_hps_\v\()_\w\()x\h\():
+    mov             x7, #\w
+    lsr             x7, x7, #3
+    mov             x11, x0
+    sub             x11, x11, #2
+.loop2_chroma_hps_\v\()_\w\()x\h\():
+.if \w == 8 || \w == 24
+    vextin8_chroma \v
+    qpel_filter_chroma_\v\()_32b
+    hps_end
+    str             q17, x2, #16
+    sub             x7, x7, #1
+    sub             x11, x11, #8
+.elseif \w == 16 || \w == 32 || \w == 48 || \w == 64
+    vextin8_chroma_64 \v
+    qpel_filter_chroma_\v\()_64b
+    hps_end
+    sub             v18.8h, v18.8h, v31.8h
+    stp             q17, q18, x2, #32
+    sub             x7, x7, #2
+    sub             x11, x11, #16
+.endif
+    cbnz            x7, .loop2_chroma_hps_\v\()_\w\()x\h\()
+    sub             w6, w6, #1
+    add             x0, x0, x1
+    add             x2, x2, x3
+    cbnz            w6, .loop1_chroma_hps_\v\()_\w\()x\h\()
+    ret
+.endif
+.endm
+
+const g_lumaFilter, align=8
+.word 0,0,0,0,0,0,64,64,0,0,0,0,0,0,0,0
+.word -1,-1,4,4,-10,-10,58,58,17,17,-5,-5,1,1,0,0
+.word -1,-1,4,4,-11,-11,40,40,40,40,-11,-11,4,4,-1,-1
+.word 0,0,1,1,-5,-5,17,17,58,58,-10,-10,4,4,-1,-1
+endconst

 
@@ -0,0 +1,1436 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+// This file contains the macros written using NEON instruction set
+// that are also used by the SVE2 functions
+
+// Macros below follow these conventions:
+// - input data in registers: v0, v1, v2, v3, v4, v5, v6, v7
+// - constants in registers: v24, v25, v26, v27, v31
+// - temporary registers: v16, v17, v18, v19, v20, v21, v22, v23, v28, v29, v30.
+// - _32b macros output a result in v17.4s
+// - _64b and _32b_1 macros output results in v17.4s, v18.4s
+
+#include "asm.S"
+
+.arch           armv8-a
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.macro vextin8 v
+    ldp             d6, d7, x11, #16
+.if \v == 0
+    // qpel_filter_0 only uses values in v3
+    ext             v3.8b, v6.8b, v7.8b, #4
+.else
+.if \v != 3
+    ext             v0.8b, v6.8b, v7.8b, #1
+.endif
+    ext             v1.8b, v6.8b, v7.8b, #2
+    ext             v2.8b, v6.8b, v7.8b, #3
+    ext             v3.8b, v6.8b, v7.8b, #4
+    ext             v4.8b, v6.8b, v7.8b, #5
+    ext             v5.8b, v6.8b, v7.8b, #6
+    ext             v6.8b, v6.8b, v7.8b, #7
+.endif
+.endm
+
+.macro vextin8_64 v
+    ldp             q6, q7, x11, #32
+.if \v == 0
+    // qpel_filter_0 only uses values in v3
+    ext             v3.16b, v6.16b, v7.16b, #4
+.else
+.if \v != 3
+    // qpel_filter_3 does not use values in v0
+    ext             v0.16b, v6.16b, v7.16b, #1
+.endif
+    ext             v1.16b, v6.16b, v7.16b, #2
+    ext             v2.16b, v6.16b, v7.16b, #3
+    ext             v3.16b, v6.16b, v7.16b, #4
+    ext             v4.16b, v6.16b, v7.16b, #5
+    ext             v5.16b, v6.16b, v7.16b, #6
+.if \v == 1
+    ext             v6.16b, v6.16b, v7.16b, #7
+    // qpel_filter_1 does not use v7
+.else
+    ext             v16.16b, v6.16b, v7.16b, #7
+    ext             v7.16b, v6.16b, v7.16b, #8
+    mov             v6.16b, v16.16b
+.endif
+.endif
+.endm
+
+.macro vextin8_chroma v
+    ldp             d6, d7, x11, #16
+.if \v == 0
+    // qpel_filter_chroma_0 only uses values in v1
+    ext             v1.8b, v6.8b, v7.8b, #2
+.else
+    ext             v0.8b, v6.8b, v7.8b, #1
+    ext             v1.8b, v6.8b, v7.8b, #2
+    ext             v2.8b, v6.8b, v7.8b, #3
+    ext             v3.8b, v6.8b, v7.8b, #4
+.endif
+.endm
+
+.macro vextin8_chroma_64 v
+    ldp             q16, q17, x11, #32
+.if \v == 0
+    // qpel_filter_chroma_0 only uses values in v1
+    ext             v1.16b, v16.16b, v17.16b, #2
+.else
+    ext             v0.16b, v16.16b, v17.16b, #1
+    ext             v1.16b, v16.16b, v17.16b, #2
+    ext             v2.16b, v16.16b, v17.16b, #3
+    ext             v3.16b, v16.16b, v17.16b, #4
+.endif
+.endm
+
+.macro qpel_load_32b v
+.if \v == 0
+    add             x6, x6, x11       // do not load 3 values that are not used in qpel_filter_0
+    ld1             {v3.8b}, x6, x1
+.elseif \v == 1 || \v == 2 || \v == 3
+.if \v != 3                           // not used in qpel_filter_3
+    ld1             {v0.8b}, x6, x1
+.else
+    add             x6, x6, x1
+.endif
+    ld1             {v1.8b}, x6, x1
+    ld1             {v2.8b}, x6, x1
+    ld1             {v3.8b}, x6, x1
+    ld1             {v4.8b}, x6, x1
+    ld1             {v5.8b}, x6, x1
+.if \v != 1                           // not used in qpel_filter_1
+    ld1             {v6.8b}, x6, x1
+    ld1             {v7.8b}, x6
+.else
+    ld1             {v6.8b}, x6
+.endif
+.endif
+.endm
+
+.macro qpel_load_64b v
+.if \v == 0
+    add             x6, x6, x11       // do not load 3 values that are not used in qpel_filter_0
+    ld1             {v3.16b}, x6, x1
+.elseif \v == 1 || \v == 2 || \v == 3
+.if \v != 3                           // not used in qpel_filter_3
+    ld1             {v0.16b}, x6, x1
+.else
+    add             x6, x6, x1
+.endif
+    ld1             {v1.16b}, x6, x1
+    ld1             {v2.16b}, x6, x1
+    ld1             {v3.16b}, x6, x1
+    ld1             {v4.16b}, x6, x1
+    ld1             {v5.16b}, x6, x1
+.if \v != 1                           // not used in qpel_filter_1
+    ld1             {v6.16b}, x6, x1
+    ld1             {v7.16b}, x6
+.else
+    ld1             {v6.16b}, x6
+.endif
+.endif
+.endm
+
+.macro qpel_chroma_load_32b v
+.if \v == 0
+    // qpel_filter_chroma_0 only uses values in v1
+    add             x6, x6, x1
+    ldr             d1, x6
+.else
+    ld1             {v0.8b}, x6, x1
+    ld1             {v1.8b}, x6, x1
+    ld1             {v2.8b}, x6, x1
+    ld1             {v3.8b}, x6
+.endif
+.endm
+
+.macro qpel_chroma_load_64b v
+.if \v == 0
+    // qpel_filter_chroma_0 only uses values in v1
+    add             x6, x6, x1
+    ldr             q1, x6
+.else
+    ld1             {v0.16b}, x6, x1
+    ld1             {v1.16b}, x6, x1
+    ld1             {v2.16b}, x6, x1
+    ld1             {v3.16b}, x6
+.endif
+.endm
+
+//          a, b,   c,  d,  e,   f, g,  h
+// .hword   0, 0,   0, 64,  0,   0, 0,  0
+.macro qpel_start_0
+    movi            v24.16b, #64
+.endm
+
+.macro qpel_filter_0_32b
+    umull           v17.8h, v3.8b, v24.8b    // 64*d
+.endm
+
+.macro qpel_filter_0_64b
+    qpel_filter_0_32b
+    umull2          v18.8h, v3.16b, v24.16b  // 64*d
+.endm
+
+.macro qpel_start_0_1
+    movi            v24.8h, #64
+.endm
+
+.macro qpel_filter_0_32b_1
+    smull           v17.4s, v3.4h, v24.4h    // 64*d0
+    smull2          v18.4s, v3.8h, v24.8h    // 64*d1
+.endm
+
+//          a, b,   c,  d,  e,   f, g,  h
+// .hword  -1, 4, -10, 58, 17,  -5, 1,  0
+.macro qpel_start_1
+    movi            v24.16b, #58
+    movi            v25.16b, #10
+    movi            v26.16b, #17
+    movi            v27.16b, #5
+.endm
+
+.macro qpel_filter_1_32b
+    umull           v19.8h, v2.8b, v25.8b  // c*10
+    umull           v17.8h, v3.8b, v24.8b  // d*58
+    umull           v21.8h, v4.8b, v26.8b  // e*17
+    umull           v23.8h, v5.8b, v27.8b  // f*5
+    sub             v17.8h, v17.8h, v19.8h // d*58 - c*10
+    ushll           v18.8h, v1.8b, #2      // b*4
+    add             v17.8h, v17.8h, v21.8h // d*58 - c*10 + e*17
+    usubl           v21.8h, v6.8b, v0.8b   // g - a
+    add             v17.8h, v17.8h, v18.8h // d*58 - c*10 + e*17 + b*4
+    sub             v21.8h, v21.8h, v23.8h // g - a - f*5
+    add             v17.8h, v17.8h, v21.8h // d*58 - c*10 + e*17 + b*4 + g - a - f*5
+.endm
+
+.macro qpel_filter_1_64b
+    qpel_filter_1_32b
+    umull2          v20.8h, v2.16b, v25.16b  // c*10
+    umull2          v18.8h, v3.16b, v24.16b  // d*58
+    umull2          v21.8h, v4.16b, v26.16b  // e*17
+    umull2          v23.8h, v5.16b, v27.16b  // f*5
+    sub             v18.8h, v18.8h, v20.8h   // d*58 - c*10
+    ushll2          v28.8h, v1.16b, #2       // b*4
+    add             v18.8h, v18.8h, v21.8h   // d*58 - c*10 + e*17
+    usubl2          v21.8h, v6.16b, v0.16b   // g - a
+    add             v18.8h, v18.8h, v28.8h   // d*58 - c*10 + e*17 + b*4
+    sub             v21.8h, v21.8h, v23.8h   // g - a - f*5
+    add             v18.8h, v18.8h, v21.8h   // d*58 - c*10 + e*17 + b*4 + g - a - f*5
+.endm
+
+.macro qpel_start_1_1
+    movi            v24.8h, #58
+    movi            v25.8h, #10
+    movi            v26.8h, #17
+    movi            v27.8h, #5
+.endm
+
+.macro qpel_filter_1_32b_1
+    smull           v17.4s, v3.4h, v24.4h    // 58 * d0
+    smull2          v18.4s, v3.8h, v24.8h    // 58 * d1
+    smull           v19.4s, v2.4h, v25.4h    // 10 * c0
+    smull2          v20.4s, v2.8h, v25.8h    // 10 * c1
+    smull           v21.4s, v4.4h, v26.4h    // 17 * e0
+    smull2          v22.4s, v4.8h, v26.8h    // 17 * e1
+    smull           v23.4s, v5.4h, v27.4h    //  5 * f0
+    smull2          v16.4s, v5.8h, v27.8h    //  5 * f1
+    sub             v17.4s, v17.4s, v19.4s   // 58 * d0 - 10 * c0
+    sub             v18.4s, v18.4s, v20.4s   // 58 * d1 - 10 * c1
+    sshll           v19.4s, v1.4h, #2        // 4 * b0
+    sshll2          v20.4s, v1.8h, #2        // 4 * b1
+    add             v17.4s, v17.4s, v21.4s   // 58 * d0 - 10 * c0 + 17 * e0
+    add             v18.4s, v18.4s, v22.4s   // 58 * d1 - 10 * c1 + 17 * e1
+    ssubl           v21.4s, v6.4h, v0.4h     // g0 - a0
+    ssubl2          v22.4s, v6.8h, v0.8h     // g1 - a1
+    add             v17.4s, v17.4s, v19.4s   // 58 * d0 - 10 * c0 + 17 * e0 + 4 * b0
+    add             v18.4s, v18.4s, v20.4s   // 58 * d1 - 10 * c1 + 17 * e1 + 4 * b1
+    sub             v21.4s, v21.4s, v23.4s   // g0 - a0 - 5 * f0
+    sub             v22.4s, v22.4s, v16.4s   // g1 - a1 - 5 * f1
+    add             v17.4s, v17.4s, v21.4s   // 58 * d0 - 10 * c0 + 17 * e0 + 4 * b0 + g0 - a0 - 5 * f0
+    add             v18.4s, v18.4s, v22.4s   // 58 * d1 - 10 * c1 + 17 * e1 + 4 * b1 + g1 - a1 - 5 * f1
+.endm
+
+//          a, b,   c,  d,  e,   f, g,  h
+// .hword  -1, 4, -11, 40, 40, -11, 4, -1
+.macro qpel_start_2
+    movi            v24.8h, #11
+    movi            v25.8h, #40
+.endm
+
+.macro qpel_filter_2_32b
+    uaddl           v17.8h, v3.8b, v4.8b     // d + e
+    uaddl           v19.8h, v2.8b, v5.8b     // c + f
+    uaddl           v23.8h, v1.8b, v6.8b     // b + g
+    uaddl           v21.8h, v0.8b, v7.8b     // a + h
+    mul             v17.8h, v17.8h, v25.8h   // 40 * (d + e)
+    mul             v19.8h, v19.8h, v24.8h   // 11 * (c + f)
+    shl             v23.8h, v23.8h, #2       // (b + g) * 4
+    add             v19.8h, v19.8h, v21.8h   // 11 * (c + f) + a + h
+    add             v17.8h, v17.8h, v23.8h   // 40 * (d + e) + (b + g) * 4
+    sub             v17.8h, v17.8h, v19.8h   // 40 * (d + e) + (b + g) * 4 - 11 * (c + f) - a - h
+.endm
+
+.macro qpel_filter_2_64b
+    qpel_filter_2_32b
+    uaddl2          v27.8h, v3.16b, v4.16b   // d + e
+    uaddl2          v16.8h, v2.16b, v5.16b   // c + f
+    uaddl2          v23.8h, v1.16b, v6.16b   // b + g
+    uaddl2          v21.8h, v0.16b, v7.16b   // a + h
+    mul             v27.8h, v27.8h, v25.8h   // 40 * (d + e)
+    mul             v16.8h, v16.8h, v24.8h   // 11 * (c + f)
+    shl             v23.8h, v23.8h, #2       // (b + g) * 4
+    add             v16.8h, v16.8h, v21.8h   // 11 * (c + f) + a + h
+    add             v27.8h, v27.8h, v23.8h   // 40 * (d + e) + (b + g) * 4
+    sub             v18.8h, v27.8h, v16.8h   // 40 * (d + e) + (b + g) * 4 - 11 * (c + f) - a - h
+.endm
+
+.macro qpel_start_2_1
+    movi            v24.4s, #11
+    movi            v25.4s, #40
+.endm
+
+.macro qpel_filter_2_32b_1
+    saddl           v17.4s, v3.4h, v4.4h     // d0 + e0
+    saddl2          v18.4s, v3.8h, v4.8h     // d1 + e1
+    saddl           v19.4s, v2.4h, v5.4h     // c0 + f0
+    saddl2          v20.4s, v2.8h, v5.8h     // c1 + f1
+    mul             v19.4s, v19.4s, v24.4s   // 11 * (c0 + f0)
+    mul             v20.4s, v20.4s, v24.4s   // 11 * (c1 + f1)
+    saddl           v23.4s, v1.4h, v6.4h     // b0 + g0
+    mul             v17.4s, v17.4s, v25.4s   // 40 * (d0 + e0)
+    mul             v18.4s, v18.4s, v25.4s   // 40 * (d1 + e1)
+    saddl2          v16.4s, v1.8h, v6.8h     // b1 + g1
+    saddl           v21.4s, v0.4h, v7.4h     // a0 + h0
+    saddl2          v22.4s, v0.8h, v7.8h     // a1 + h1
+    shl             v23.4s, v23.4s, #2       // 4*(b0+g0)
+    shl             v16.4s, v16.4s, #2       // 4*(b1+g1)
+    add             v19.4s, v19.4s, v21.4s   // 11 * (c0 + f0) + a0 + h0
+    add             v20.4s, v20.4s, v22.4s   // 11 * (c1 + f1) + a1 + h1
+    add             v17.4s, v17.4s, v23.4s   // 40 * (d0 + e0) + 4*(b0+g0)
+    add             v18.4s, v18.4s, v16.4s   // 40 * (d1 + e1) + 4*(b1+g1)
+    sub             v17.4s, v17.4s, v19.4s   // 40 * (d0 + e0) + 4*(b0+g0) - (11 * (c0 + f0) + a0 + h0)
+    sub             v18.4s, v18.4s, v20.4s   // 40 * (d1 + e1) + 4*(b1+g1) - (11 * (c1 + f1) + a1 + h1)
+.endm
+
+//          a, b,   c,  d,  e,   f, g,  h
+// .hword   0, 1,  -5, 17, 58, -10, 4, -1
+.macro qpel_start_3
+    movi            v24.16b, #17
+    movi            v25.16b, #5
+    movi            v26.16b, #58
+    movi            v27.16b, #10
+.endm
+
+.macro qpel_filter_3_32b
+    umull           v19.8h, v2.8b, v25.8b    // c * 5
+    umull           v17.8h, v3.8b, v24.8b    // d * 17
+    umull           v21.8h, v4.8b, v26.8b    // e * 58
+    umull           v23.8h, v5.8b, v27.8b    // f * 10
+    sub             v17.8h, v17.8h, v19.8h   // d * 17 - c * 5
+    ushll           v19.8h, v6.8b, #2        // g * 4
+    add             v17.8h, v17.8h, v21.8h   // d * 17 - c * 5 + e * 58
+    usubl           v21.8h, v1.8b, v7.8b     // b - h
+    add             v17.8h, v17.8h, v19.8h   // d * 17 - c * 5 + e * 58 + g * 4
+    sub             v21.8h, v21.8h, v23.8h   // b - h - f * 10
+    add             v17.8h, v17.8h, v21.8h   // d * 17 - c * 5 + e * 58 + g * 4 + b - h - f * 10
+.endm
+
+.macro qpel_filter_3_64b
+    qpel_filter_3_32b
+    umull2          v16.8h, v2.16b, v25.16b  // c * 5
+    umull2          v18.8h, v3.16b, v24.16b  // d * 17
+    umull2          v21.8h, v4.16b, v26.16b  // e * 58
+    umull2          v23.8h, v5.16b, v27.16b  // f * 10
+    sub             v18.8h, v18.8h, v16.8h   // d * 17 - c * 5
+    ushll2          v16.8h, v6.16b, #2       // g * 4
+    add             v18.8h, v18.8h, v21.8h   // d * 17 - c * 5 + e * 58
+    usubl2          v21.8h, v1.16b, v7.16b   // b - h
+    add             v18.8h, v18.8h, v16.8h   // d * 17 - c * 5 + e * 58 + g * 4
+    sub             v21.8h, v21.8h, v23.8h   // b - h - f * 10
+    add             v18.8h, v18.8h, v21.8h   // d * 17 - c * 5 + e * 58 + g * 4 + b - h - f * 10
+.endm
+
+.macro qpel_start_3_1
+    movi            v24.8h, #17
+    movi            v25.8h, #5
+    movi            v26.8h, #58
+    movi            v27.8h, #10
+.endm
+
+.macro qpel_filter_3_32b_1
+    smull           v17.4s, v3.4h, v24.4h    // 17 * d0
+    smull2          v18.4s, v3.8h, v24.8h    // 17 * d1
+    smull           v19.4s, v2.4h, v25.4h    //  5 * c0
+    smull2          v20.4s, v2.8h, v25.8h    //  5 * c1
+    smull           v21.4s, v4.4h, v26.4h    // 58 * e0
+    smull2          v22.4s, v4.8h, v26.8h    // 58 * e1
+    smull           v23.4s, v5.4h, v27.4h    // 10 * f0
+    smull2          v16.4s, v5.8h, v27.8h    // 10 * f1
+    sub             v17.4s, v17.4s, v19.4s   // 17 * d0 - 5 * c0
+    sub             v18.4s, v18.4s, v20.4s   // 17 * d1 - 5 * c1
+    sshll           v19.4s, v6.4h, #2        //  4 * g0
+    sshll2          v20.4s, v6.8h, #2        //  4 * g1
+    add             v17.4s, v17.4s, v21.4s   // 17 * d0 - 5 * c0 + 58 * e0
+    add             v18.4s, v18.4s, v22.4s   // 17 * d1 - 5 * c1 + 58 * e1
+    ssubl           v21.4s, v1.4h, v7.4h     // b0 - h0
+    ssubl2          v22.4s, v1.8h, v7.8h     // b1 - h1
+    add             v17.4s, v17.4s, v19.4s   // 17 * d0 - 5 * c0 + 58 * e0 + 4 * g0
+    add             v18.4s, v18.4s, v20.4s   // 17 * d1 - 5 * c1 + 58 * e1 + 4 * g1
+    sub             v21.4s, v21.4s, v23.4s   // b0 - h0 - 10 * f0
+    sub             v22.4s, v22.4s, v16.4s   // b1 - h1 - 10 * f1
+    add             v17.4s, v17.4s, v21.4s   // 17 * d0 - 5 * c0 + 58 * e0 + 4 * g0 + b0 - h0 - 10 * f0
+    add             v18.4s, v18.4s, v22.4s   // 17 * d1 - 5 * c1 + 58 * e1 + 4 * g1 + b1 - h1 - 10 * f1
+.endm
+
+.macro qpel_start_chroma_0
+    movi            v24.16b, #64
+.endm
+
+.macro qpel_filter_chroma_0_32b
+    umull           v17.8h, v1.8b, v24.8b    // 64*b
+.endm
+
+.macro qpel_filter_chroma_0_64b
+    umull           v17.8h, v1.8b, v24.8b    // 64*b
+    umull2          v18.8h, v1.16b, v24.16b  // 64*b
+.endm
+
+.macro qpel_start_chroma_0_1
+    movi            v24.8h, #64
+.endm
+
+.macro qpel_filter_chroma_0_32b_1
+    smull           v17.4s, v1.4h, v24.4h    // 64*b0
+    smull2          v18.4s, v1.8h, v24.8h    // 64*b1
+.endm
+
+.macro qpel_start_chroma_1
+    movi            v24.16b, #58
+    movi            v25.16b, #10
+.endm
+
+.macro qpel_filter_chroma_1_32b
+    umull           v17.8h, v1.8b, v24.8b    // 58 * b
+    umull           v19.8h, v2.8b, v25.8b    // 10 * c
+    uaddl           v22.8h, v0.8b, v3.8b     // a + d
+    shl             v22.8h, v22.8h, #1       // 2 * (a+d)
+    sub             v17.8h, v17.8h, v22.8h   // 58*b - 2*(a+d)
+    add             v17.8h, v17.8h, v19.8h   // 58*b-2*(a+d) + 10*c
+.endm
+
+.macro qpel_filter_chroma_1_64b
+    umull           v17.8h, v1.8b, v24.8b    // 58 * b
+    umull2          v18.8h, v1.16b, v24.16b  // 58 * b
+    umull           v19.8h, v2.8b, v25.8b    // 10 * c
+    umull2          v20.8h, v2.16b, v25.16b  // 10 * c
+    uaddl           v22.8h, v0.8b, v3.8b     // a + d
+    uaddl2          v23.8h, v0.16b, v3.16b   // a + d
+    shl             v22.8h, v22.8h, #1       // 2 * (a+d)
+    shl             v23.8h, v23.8h, #1       // 2 * (a+d)
+    sub             v17.8h, v17.8h, v22.8h   // 58*b - 2*(a+d)
+    sub             v18.8h, v18.8h, v23.8h   // 58*b - 2*(a+d)
+    add             v17.8h, v17.8h, v19.8h   // 58*b-2*(a+d) + 10*c
+    add             v18.8h, v18.8h, v20.8h   // 58*b-2*(a+d) + 10*c
+.endm
+
+.macro qpel_start_chroma_1_1
+    movi            v24.8h, #58
+    movi            v25.8h, #10
+.endm
+
+.macro qpel_filter_chroma_1_32b_1
+    smull           v17.4s, v1.4h, v24.4h    // 58 * b0
+    smull2          v18.4s, v1.8h, v24.8h    // 58 * b1
+    smull           v19.4s, v2.4h, v25.4h    // 10 * c0
+    smull2          v20.4s, v2.8h, v25.8h    // 10 * c1
+    add             v22.8h, v0.8h, v3.8h     // a + d
+    sshll           v21.4s, v22.4h, #1       // 2 * (a0+d0)
+    sshll2          v22.4s, v22.8h, #1       // 2 * (a1+d1)
+    sub             v17.4s, v17.4s, v21.4s   // 58*b0 - 2*(a0+d0)
+    sub             v18.4s, v18.4s, v22.4s   // 58*b1 - 2*(a1+d1)
+    add             v17.4s, v17.4s, v19.4s   // 58*b0-2*(a0+d0) + 10*c0
+    add             v18.4s, v18.4s, v20.4s   // 58*b1-2*(a1+d1) + 10*c1
+.endm
+
+.macro qpel_start_chroma_2
+    movi            v25.16b, #54
+.endm
+
+.macro qpel_filter_chroma_2_32b
+    umull           v17.8h, v1.8b, v25.8b    // 54 * b
+    ushll           v19.8h, v0.8b, #2        // 4 * a
+    ushll           v21.8h, v2.8b, #4        // 16 * c
+    ushll           v23.8h, v3.8b, #1        // 2 * d
+    add             v17.8h, v17.8h, v21.8h   // 54*b + 16*c
+    add             v19.8h, v19.8h, v23.8h   // 4*a + 2*d
+    sub             v17.8h, v17.8h, v19.8h   // 54*b+16*c - (4*a+2*d)
+.endm
+
+.macro qpel_filter_chroma_2_64b
+    umull           v17.8h, v1.8b, v25.8b    // 54 * b
+    umull2          v18.8h, v1.16b, v25.16b  // 54 * b
+    ushll           v19.8h, v0.8b, #2        // 4 * a
+    ushll2          v20.8h, v0.16b, #2       // 4 * a
+    ushll           v21.8h, v2.8b, #4        // 16 * c
+    ushll2          v22.8h, v2.16b, #4       // 16 * c
+    ushll           v23.8h, v3.8b, #1        // 2 * d
+    ushll2          v24.8h, v3.16b, #1       // 2 * d
+    add             v17.8h, v17.8h, v21.8h   // 54*b + 16*c
+    add             v18.8h, v18.8h, v22.8h   // 54*b + 16*c
+    add             v19.8h, v19.8h, v23.8h   // 4*a + 2*d
+    add             v20.8h, v20.8h, v24.8h   // 4*a + 2*d
+    sub             v17.8h, v17.8h, v19.8h   // 54*b+16*c - (4*a+2*d)
+    sub             v18.8h, v18.8h, v20.8h   // 54*b+16*c - (4*a+2*d)
+.endm
+
+.macro qpel_start_chroma_2_1
+    movi            v25.8h, #54
+.endm
+
+.macro qpel_filter_chroma_2_32b_1
+    smull           v17.4s, v1.4h, v25.4h    // 54 * b0
+    smull2          v18.4s, v1.8h, v25.8h    // 54 * b1
+    sshll           v19.4s, v0.4h, #2        // 4 * a0
+    sshll2          v20.4s, v0.8h, #2        // 4 * a1
+    sshll           v21.4s, v2.4h, #4        // 16 * c0
+    sshll2          v22.4s, v2.8h, #4        // 16 * c1
+    sshll           v23.4s, v3.4h, #1        // 2 * d0
+    sshll2          v24.4s, v3.8h, #1        // 2 * d1
+    add             v17.4s, v17.4s, v21.4s   // 54*b0 + 16*c0
+    add             v18.4s, v18.4s, v22.4s   // 54*b1 + 16*c1
+    add             v19.4s, v19.4s, v23.4s   // 4*a0 + 2*d0
+    add             v20.4s, v20.4s, v24.4s   // 4*a1 + 2*d1
+    sub             v17.4s, v17.4s, v19.4s   // 54*b0+16*c0 - (4*a0+2*d0)
+    sub             v18.4s, v18.4s, v20.4s   // 54*b1+16*c1 - (4*a1+2*d1)
+.endm
+
+.macro qpel_start_chroma_3
+    movi            v25.16b, #46
+    movi            v26.16b, #28
+    movi            v27.16b, #6
+.endm
+
+.macro qpel_filter_chroma_3_32b
+    umull           v17.8h, v1.8b, v25.8b    // 46 * b
+    umull           v19.8h, v2.8b, v26.8b    // 28 * c
+    ushll           v21.8h, v3.8b, #2        // 4 * d
+    umull           v23.8h, v0.8b, v27.8b    // 6 * a
+    add             v17.8h, v17.8h, v19.8h   // 46*b + 28*c
+    add             v21.8h, v21.8h, v23.8h   // 4*d + 6*a
+    sub             v17.8h, v17.8h, v21.8h   // 46*b+28*c - (4*d+6*a)
+.endm
+
+.macro qpel_filter_chroma_3_64b
+    umull           v17.8h, v1.8b, v25.8b    // 46 * b
+    umull2          v18.8h, v1.16b, v25.16b  // 46 * b
+    umull           v19.8h, v2.8b, v26.8b    // 28 * c
+    umull2          v20.8h, v2.16b, v26.16b  // 28 * c
+    ushll           v21.8h, v3.8b, #2        // 4 * d
+    ushll2          v22.8h, v3.16b, #2       // 4 * d
+    umull           v23.8h, v0.8b, v27.8b    // 6 * a
+    umull2          v24.8h, v0.16b, v27.16b  // 6 * a
+    add             v17.8h, v17.8h, v19.8h   // 46*b + 28*c
+    add             v18.8h, v18.8h, v20.8h   // 46*b + 28*c
+    add             v21.8h, v21.8h, v23.8h   // 4*d + 6*a
+    add             v22.8h, v22.8h, v24.8h   // 4*d + 6*a
+    sub             v17.8h, v17.8h, v21.8h   // 46*b+28*c - (4*d+6*a)
+    sub             v18.8h, v18.8h, v22.8h   // 46*b+28*c - (4*d+6*a)
+.endm
+
+.macro qpel_start_chroma_3_1
+    movi            v25.8h, #46
+    movi            v26.8h, #28
+    movi            v27.8h, #6
+.endm
+
+.macro qpel_filter_chroma_3_32b_1
+    smull           v17.4s, v1.4h, v25.4h    // 46 * b0
+    smull2          v18.4s, v1.8h, v25.8h    // 46 * b1
+    smull           v19.4s, v2.4h, v26.4h    // 28 * c0
+    smull2          v20.4s, v2.8h, v26.8h    // 28 * c1
+    sshll           v21.4s, v3.4h, #2        // 4 * d0
+    sshll2          v22.4s, v3.8h, #2        // 4 * d1
+    smull           v23.4s, v0.4h, v27.4h    // 6 * a0
+    smull2          v24.4s, v0.8h, v27.8h    // 6 * a1
+    add             v17.4s, v17.4s, v19.4s   // 46*b0 + 28*c0
+    add             v18.4s, v18.4s, v20.4s   // 46*b1 + 28*c1
+    add             v21.4s, v21.4s, v23.4s   // 4*d0 + 6*a0
+    add             v22.4s, v22.4s, v24.4s   // 4*d1 + 6*a1
+    sub             v17.4s, v17.4s, v21.4s   // 46*b0+28*c0 - (4*d0+6*a0)
+    sub             v18.4s, v18.4s, v22.4s   // 46*b1+28*c1 - (4*d1+6*a1)
+.endm
+
+.macro qpel_start_chroma_4
+    movi            v24.8h, #36
+.endm
+
+.macro qpel_filter_chroma_4_32b
+    uaddl           v20.8h, v0.8b, v3.8b     // a + d
+    uaddl           v17.8h, v1.8b, v2.8b     // b + c
+    shl             v20.8h, v20.8h, #2       // 4 * (a+d)
+    mul             v17.8h, v17.8h, v24.8h   // 36 * (b+c)
+    sub             v17.8h, v17.8h, v20.8h   // 36*(b+c) - 4*(a+d)
+.endm
+
+.macro qpel_filter_chroma_4_64b
+    uaddl           v20.8h, v0.8b, v3.8b     // a + d
+    uaddl2          v21.8h, v0.16b, v3.16b   // a + d
+    uaddl           v17.8h, v1.8b, v2.8b     // b + c
+    uaddl2          v18.8h, v1.16b, v2.16b   // b + c
+    shl             v20.8h, v20.8h, #2       // 4 * (a+d)
+    shl             v21.8h, v21.8h, #2       // 4 * (a+d)
+    mul             v17.8h, v17.8h, v24.8h   // 36 * (b+c)
+    mul             v18.8h, v18.8h, v24.8h   // 36 * (b+c)
+    sub             v17.8h, v17.8h, v20.8h   // 36*(b+c) - 4*(a+d)
+    sub             v18.8h, v18.8h, v21.8h   // 36*(b+c) - 4*(a+d)
+.endm
+
+.macro qpel_start_chroma_4_1
+    movi            v24.8h, #36
+.endm
+
+.macro qpel_filter_chroma_4_32b_1
+    add             v20.8h, v0.8h, v3.8h     // a + d
+    add             v21.8h, v1.8h, v2.8h     // b + c
+    smull           v17.4s, v21.4h, v24.4h   // 36 * (b0+c0)
+    smull2          v18.4s, v21.8h, v24.8h   // 36 * (b1+c1)
+    sshll           v21.4s, v20.4h, #2       // 4 * (a0+d0)
+    sshll2          v22.4s, v20.8h, #2       // 4 * (a1+d1)
+    sub             v17.4s, v17.4s, v21.4s   // 36*(b0+c0) - 4*(a0+d0)
+    sub             v18.4s, v18.4s, v22.4s   // 36*(b1+c1) - 4*(a1+d1)
+.endm
+
+.macro qpel_start_chroma_5
+    movi            v25.16b, #28
+    movi            v26.16b, #46
+    movi            v27.16b, #6
+.endm
+
+.macro qpel_filter_chroma_5_32b
+    umull           v17.8h, v1.8b, v25.8b    // 28 * b
+    umull           v19.8h, v2.8b, v26.8b    // 46 * c
+    ushll           v21.8h, v0.8b, #2        // 4 * a
+    umull           v23.8h, v3.8b, v27.8b    // 6 * d
+    add             v17.8h, v17.8h, v19.8h   // 28*b + 46*c
+    add             v21.8h, v21.8h, v23.8h   // 4*a + 6*d
+    sub             v17.8h, v17.8h, v21.8h   // 28*b+46*c - (4*a+6*d)
+.endm
+
+.macro qpel_filter_chroma_5_64b
+    umull           v17.8h, v1.8b, v25.8b    // 28 * b
+    umull2          v18.8h, v1.16b, v25.16b  // 28 * b
+    umull           v19.8h, v2.8b, v26.8b    // 46 * c
+    umull2          v20.8h, v2.16b, v26.16b  // 46 * c
+    ushll           v21.8h, v0.8b, #2        // 4 * a
+    ushll2          v22.8h, v0.16b, #2       // 4 * a
+    umull           v23.8h, v3.8b, v27.8b    // 6 * d
+    umull2          v24.8h, v3.16b, v27.16b  // 6 * d
+    add             v17.8h, v17.8h, v19.8h   // 28*b + 46*c
+    add             v18.8h, v18.8h, v20.8h   // 28*b + 46*c
+    add             v21.8h, v21.8h, v23.8h   // 4*a + 6*d
+    add             v22.8h, v22.8h, v24.8h   // 4*a + 6*d
+    sub             v17.8h, v17.8h, v21.8h   // 28*b+46*c - (4*a+6*d)
+    sub             v18.8h, v18.8h, v22.8h   // 28*b+46*c - (4*a+6*d)
+.endm
+
+.macro qpel_start_chroma_5_1
+    movi            v25.8h, #28
+    movi            v26.8h, #46
+    movi            v27.8h, #6
+.endm
+
+.macro qpel_filter_chroma_5_32b_1
+    smull           v17.4s, v1.4h, v25.4h    // 28 * b0
+    smull2          v18.4s, v1.8h, v25.8h    // 28 * b1
+    smull           v19.4s, v2.4h, v26.4h    // 46 * c0
+    smull2          v20.4s, v2.8h, v26.8h    // 46 * c1
+    sshll           v21.4s, v0.4h, #2        // 4 * a0
+    sshll2          v22.4s, v0.8h, #2        // 4 * a1
+    smull           v23.4s, v3.4h, v27.4h    // 6 * d0
+    smull2          v24.4s, v3.8h, v27.8h    // 6 * d1
+    add             v17.4s, v17.4s, v19.4s   // 28*b0 + 46*c0
+    add             v18.4s, v18.4s, v20.4s   // 28*b1 + 46*c1
+    add             v21.4s, v21.4s, v23.4s   // 4*a0 + 6*d0
+    add             v22.4s, v22.4s, v24.4s   // 4*a1 + 6*d1
+    sub             v17.4s, v17.4s, v21.4s   // 28*b0+46*c0 - (4*a0+6*d0)
+    sub             v18.4s, v18.4s, v22.4s   // 28*b1+46*c1 - (4*a1+6*d1)
+.endm
+
+.macro qpel_start_chroma_6
+    movi            v25.16b, #54
+.endm
+
+.macro qpel_filter_chroma_6_32b
+    umull           v17.8h, v2.8b, v25.8b    // 54 * c
+    ushll           v19.8h, v0.8b, #1        // 2 * a
+    ushll           v21.8h, v1.8b, #4        // 16 * b
+    ushll           v23.8h, v3.8b, #2        // 4 * d
+    add             v17.8h, v17.8h, v21.8h   // 54*c + 16*b
+    add             v19.8h, v19.8h, v23.8h   // 2*a + 4*d
+    sub             v17.8h, v17.8h, v19.8h   // 54*c+16*b - (2*a+4*d)
+.endm
+
+.macro qpel_filter_chroma_6_64b
+    umull           v17.8h, v2.8b, v25.8b    // 54 * c
+    umull2          v18.8h, v2.16b, v25.16b  // 54 * c
+    ushll           v19.8h, v0.8b, #1        // 2 * a
+    ushll2          v20.8h, v0.16b, #1       // 2 * a
+    ushll           v21.8h, v1.8b, #4        // 16 * b
+    ushll2          v22.8h, v1.16b, #4       // 16 * b
+    ushll           v23.8h, v3.8b, #2        // 4 * d
+    ushll2          v24.8h, v3.16b, #2       // 4 * d
+    add             v17.8h, v17.8h, v21.8h   // 54*c + 16*b
+    add             v18.8h, v18.8h, v22.8h   // 54*c + 16*b
+    add             v19.8h, v19.8h, v23.8h   // 2*a + 4*d
+    add             v20.8h, v20.8h, v24.8h   // 2*a + 4*d
+    sub             v17.8h, v17.8h, v19.8h   // 54*c+16*b - (2*a+4*d)
+    sub             v18.8h, v18.8h, v20.8h   // 54*c+16*b - (2*a+4*d)
+.endm
+
+.macro qpel_start_chroma_6_1
+    movi            v25.8h, #54
+.endm
+
+.macro qpel_filter_chroma_6_32b_1
+    smull           v17.4s, v2.4h, v25.4h    // 54 * c0
+    smull2          v18.4s, v2.8h, v25.8h    // 54 * c1
+    sshll           v19.4s, v0.4h, #1        // 2 * a0
+    sshll2          v20.4s, v0.8h, #1        // 2 * a1
+    sshll           v21.4s, v1.4h, #4        // 16 * b0
+    sshll2          v22.4s, v1.8h, #4        // 16 * b1
+    sshll           v23.4s, v3.4h, #2        // 4 * d0
+    sshll2          v24.4s, v3.8h, #2        // 4 * d1
+    add             v17.4s, v17.4s, v21.4s   // 54*c0 + 16*b0
+    add             v18.4s, v18.4s, v22.4s   // 54*c1 + 16*b1
+    add             v19.4s, v19.4s, v23.4s   // 2*a0 + 4*d0
+    add             v20.4s, v20.4s, v24.4s   // 2*a1 + 4*d1
+    sub             v17.4s, v17.4s, v19.4s   // 54*c0+16*b0 - (2*a0+4*d0)
+    sub             v18.4s, v18.4s, v20.4s   // 54*c1+16*b1 - (2*a1+4*d1)
+.endm
+
+.macro qpel_start_chroma_7
+    movi            v24.16b, #58
+    movi            v25.16b, #10
+.endm
+
+.macro qpel_filter_chroma_7_32b
+    uaddl           v20.8h, v0.8b, v3.8b     // a + d
+    umull           v17.8h, v2.8b, v24.8b    // 58 * c
+    shl             v20.8h, v20.8h, #1       // 2 * (a+d)
+    umull           v19.8h, v1.8b, v25.8b    // 10 * b
+    sub             v17.8h, v17.8h, v20.8h   // 58*c - 2*(a+d)
+    add             v17.8h, v17.8h, v19.8h   // 58*c-2*(a+d) + 10*b
+.endm
+
+.macro qpel_filter_chroma_7_64b
+    uaddl           v20.8h, v0.8b, v3.8b     // a + d
+    uaddl2          v21.8h, v0.16b, v3.16b   // a + d
+    umull           v17.8h, v2.8b, v24.8b    // 58 * c
+    umull2          v18.8h, v2.16b, v24.16b  // 58 * c
+    shl             v20.8h, v20.8h, #1       // 2 * (a+d)
+    shl             v21.8h, v21.8h, #1       // 2 * (a+d)
+    umull           v22.8h, v1.8b, v25.8b    // 10 * b
+    umull2          v23.8h, v1.16b, v25.16b  // 10 * b
+    sub             v17.8h, v17.8h, v20.8h   // 58*c - 2*(a+d)
+    sub             v18.8h, v18.8h, v21.8h   // 58*c - 2*(a+d)
+    add             v17.8h, v17.8h, v22.8h   // 58*c-2*(a+d) + 10*b
+    add             v18.8h, v18.8h, v23.8h   // 58*c-2*(a+d) + 10*b
+.endm
+
+.macro qpel_start_chroma_7_1
+    movi            v24.8h, #58
+    movi            v25.8h, #10
+.endm
+
+.macro qpel_filter_chroma_7_32b_1
+    add             v20.8h, v0.8h, v3.8h     // a + d
+    smull           v17.4s, v2.4h, v24.4h    // 58 * c0
+    smull2          v18.4s, v2.8h, v24.8h    // 58 * c1
+    sshll           v21.4s, v20.4h, #1       // 2 * (a0+d0)
+    sshll2          v22.4s, v20.8h, #1       // 2 * (a1+d1)
+    smull           v19.4s, v1.4h, v25.4h    // 10 * b0
+    smull2          v20.4s, v1.8h, v25.8h    // 10 * b1
+    sub             v17.4s, v17.4s, v21.4s   // 58*c0 - 2*(a0+d0)
+    sub             v18.4s, v18.4s, v22.4s   // 58*c1 - 2*(a1+d1)
+    add             v17.4s, v17.4s, v19.4s   // 58*c0-2*(a0+d0) + 10*b0
+    add             v18.4s, v18.4s, v20.4s   // 58*c1-2*(a1+d1) + 10*b1
+.endm
+
+.macro vpp_end
+    add             v17.8h, v17.8h, v31.8h
+    sqshrun         v17.8b, v17.8h, #6
+.endm
+
+.macro FILTER_LUMA_VPP w, h, v
+    lsl             x10, x1, #2      // x10 = 4 * x1
+    sub             x11, x10, x1     // x11 = 3 * x1
+    sub             x0, x0, x11      // src -= (8 / 2 - 1) * srcStride
+    mov             x5, #\h
+    mov             w12, #32
+    dup             v31.8h, w12
+    qpel_start_\v
+.loop_luma_vpp_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_luma_vpp_w8_\v\()_\w\()x\h:
+    add             x6, x0, x9
+.if \w == 8 || \w == 24
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vpp_end
+    str             d17, x7, #8
+    add             x9, x9, #8
+.elseif \w == 12
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vpp_end
+    str             d17, x7, #8
+    add             x6, x0, #8
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vpp_end
+    fmov            w6, s17
+    str             w6, x7, #4
+    add             x9, x9, #12
+.else
+    qpel_load_64b \v
+    qpel_filter_\v\()_64b
+    vpp_end
+    add             v18.8h, v18.8h, v31.8h
+    sqshrun2        v17.16b, v18.8h, #6
+    str             q17, x7, #16
+    add             x9, x9, #16
+.endif
+    cmp             x9, #\w
+    blt             .loop_luma_vpp_w8_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_luma_vpp_\v\()_\w\()x\h
+    ret
+.endm
+
+.macro vps_end
+    sub             v17.8h, v17.8h, v31.8h
+.endm
+
+.macro FILTER_VPS w, h, v
+    lsl             x3, x3, #1
+    lsl             x10, x1, #2      // x10 = 4 * x1
+    sub             x11, x10, x1     // x11 = 3 * x1
+    sub             x0, x0, x11      // src -= (8 / 2 - 1) * srcStride
+    mov             x5, #\h
+    mov             w12, #8192
+    dup             v31.8h, w12
+    qpel_start_\v
+.loop_ps_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_ps_w8_\v\()_\w\()x\h:
+    add             x6, x0, x9
+.if \w == 8 || \w == 24
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vps_end
+    str             q17, x7, #16
+    add             x9, x9, #8
+.elseif \w == 12
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vps_end
+    str             q17, x7, #16
+    add             x6, x0, #8
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vps_end
+    str             d17, x7, #8
+    add             x9, x9, #12
+.else
+    qpel_load_64b \v
+    qpel_filter_\v\()_64b
+    vps_end
+    sub             v18.8h, v18.8h, v31.8h
+    stp             q17, q18, x7, #32
+    add             x9, x9, #16
+.endif
+    cmp             x9, #\w
+    blt             .loop_ps_w8_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_ps_\v\()_\w\()x\h
+    ret
+.endm
+
+.macro vsp_end
+    add             v17.4s, v17.4s, v31.4s
+    add             v18.4s, v18.4s, v31.4s
+    sqshrun         v17.4h, v17.4s, #12
+    sqshrun2        v17.8h, v18.4s, #12
+    sqxtun          v17.8b, v17.8h
+.endm
+
+.macro FILTER_VSP w, h, v
+    lsl             x1, x1, #1
+    lsl             x10, x1, #2      // x10 = 4 * x1
+    sub             x11, x10, x1     // x11 = 3 * x1
+    sub             x0, x0, x11
+    mov             x5, #\h
+    mov             w12, #1
+    lsl             w12, w12, #19
+    add             w12, w12, #2048
+    dup             v31.4s, w12
+    mov             x12, #\w
+    lsl             x12, x12, #1
+    qpel_start_\v\()_1
+.loop_luma_vsp_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_luma_vsp_w8_\v\()_\w\()x\h:
+    add             x6, x0, x9
+    qpel_load_64b \v
+    qpel_filter_\v\()_32b_1
+    vsp_end
+    str             d17, x7, #8
+    add             x9, x9, #16
+.if \w == 12
+    add             x6, x0, #16
+    qpel_load_64b \v
+    qpel_filter_\v\()_32b_1
+    vsp_end
+    str             s17, x7, #4
+    add             x9, x9, #8
+.endif
+    cmp             x9, x12
+    blt             .loop_luma_vsp_w8_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_luma_vsp_\v\()_\w\()x\h
+    ret
+.endm
+
+.macro vss_end
+    sshr            v17.4s, v17.4s, #6
+    sshr            v18.4s, v18.4s, #6
+    uzp1            v17.8h, v17.8h, v18.8h
+.endm
+
+.macro FILTER_VSS w, h, v
+    lsl             x1, x1, #1
+    lsl             x10, x1, #2      // x10 = 4 * x1
+    sub             x11, x10, x1     // x11 = 3 * x1
+    sub             x0, x0, x11
+    lsl             x3, x3, #1
+    mov             x5, #\h
+    mov             x12, #\w
+    lsl             x12, x12, #1
+    qpel_start_\v\()_1
+.loop_luma_vss_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_luma_vss_w8_\v\()_\w\()x\h:
+    add             x6, x0, x9
+    qpel_load_64b \v
+    qpel_filter_\v\()_32b_1
+    vss_end
+.if \w == 4
+    str             s17, x7, #4
+    add             x9, x9, #4
+.else
+    str             q17, x7, #16
+    add             x9, x9, #16
+.if \w == 12
+    add             x6, x0, x9
+    qpel_load_64b \v
+    qpel_filter_\v\()_32b_1
+    vss_end
+    str             d17, x7, #8
+    add             x9, x9, #8
+.endif
+.endif
+    cmp             x9, x12
+    blt             .loop_luma_vss_w8_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_luma_vss_\v\()_\w\()x\h
+    ret
+.endm
+
+.macro hpp_end
+    add             v17.8h, v17.8h, v31.8h
+    sqshrun         v17.8b, v17.8h, #6
+.endm
+
+.macro FILTER_HPP w, h, v
+    mov             w6, #\h
+    sub             x3, x3, #\w
+    mov             w12, #32
+    dup             v31.8h, w12
+    qpel_start_\v
+.if \w == 4
+.rept \h
+    mov             x11, x0
+    sub             x11, x11, #4
+    vextin8 \v
+    qpel_filter_\v\()_32b
+    hpp_end
+    str             s17, x2, #4
+    add             x0, x0, x1
+    add             x2, x2, x3
+.endr
+    ret
+.else
+.loop1_hpp_\v\()_\w\()x\h:
+    mov             x7, #\w
+    mov             x11, x0
+    sub             x11, x11, #4
+.loop2_hpp_\v\()_\w\()x\h:
+    vextin8 \v
+    qpel_filter_\v\()_32b
+    hpp_end
+    str             d17, x2, #8
+    sub             x11, x11, #8
+    sub             x7, x7, #8
+.if \w == 12
+    vextin8 \v
+    qpel_filter_\v\()_32b
+    hpp_end
+    str             s17, x2, #4
+    sub             x7, x7, #4
+.endif
+    cbnz            x7, .loop2_hpp_\v\()_\w\()x\h
+    sub             x6, x6, #1
+    add             x0, x0, x1
+    add             x2, x2, x3
+    cbnz            x6, .loop1_hpp_\v\()_\w\()x\h
+    ret
+.endif
+.endm
+
+.macro hps_end
+    sub             v17.8h, v17.8h, v31.8h
+.endm
+
+.macro FILTER_HPS w, h, v
+    sub             x3, x3, #\w
+    lsl             x3, x3, #1
+    mov             w12, #8192
+    dup             v31.8h, w12
+    qpel_start_\v
+.if \w == 4
+.loop_hps_\v\()_\w\()x\h\():
+    mov             x11, x0
+    sub             x11, x11, #4
+    vextin8 \v
+    qpel_filter_\v\()_32b
+    hps_end
+    str             d17, x2, #8
+    sub             w6, w6, #1
+    add             x0, x0, x1
+    add             x2, x2, x3
+    cbnz            w6, .loop_hps_\v\()_\w\()x\h
+    ret
+.else
+.loop1_hps_\v\()_\w\()x\h\():
+    mov             w7, #\w
+    mov             x11, x0
+    sub             x11, x11, #4
+.loop2_hps_\v\()_\w\()x\h\():
+.if \w == 8 || \w == 12 || \w == 24
+    vextin8 \v
+    qpel_filter_\v\()_32b
+    hps_end
+    str             q17, x2, #16
+    sub             w7, w7, #8
+    sub             x11, x11, #8
+.if \w == 12
+    vextin8 \v
+    qpel_filter_\v\()_32b
+    hps_end
+    str             d17, x2, #8
+    sub             w7, w7, #4
+.endif
+.elseif \w == 16 || \w == 32 || \w == 48 || \w == 64
+    vextin8_64 \v
+    qpel_filter_\v\()_64b
+    hps_end
+    sub             v18.8h, v18.8h, v31.8h
+    stp             q17, q18, x2, #32
+    sub             w7, w7, #16
+    sub             x11, x11, #16
+.endif
+    cbnz            w7, .loop2_hps_\v\()_\w\()x\h
+    sub             w6, w6, #1
+    add             x0, x0, x1
+    add             x2, x2, x3
+    cbnz            w6, .loop1_hps_\v\()_\w\()x\h
+    ret
+.endif
+.endm
+
+.macro FILTER_CHROMA_VPP w, h, v
+    qpel_start_chroma_\v
+    mov             w12, #32
+    dup             v31.8h, w12
+    sub             x0, x0, x1
+    mov             x5, #\h
+.loop_chroma_vpp_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_chroma_vpp_w8_\v\()_\w\()x\h:
+    add             x6, x0, x9
+    qpel_chroma_load_32b \v
+    qpel_filter_chroma_\v\()_32b
+    vpp_end
+    add             x9, x9, #8
+.if \w == 2
+    fmov            w12, s17
+    strh            w12, x7, #2
+.elseif \w == 4
+    str             s17, x7, #4
+.elseif \w == 6
+    str             s17, x7, #4
+    umov            w12, v17.h2
+    strh            w12, x7, #2
+.elseif \w == 12
+    str             d17, x7, #8
+    add             x6, x0, x9
+    qpel_chroma_load_32b \v
+    qpel_filter_chroma_\v\()_32b
+    vpp_end
+    str             s17, x7, #4
+    add             x9, x9, #8
+.else
+    str             d17, x7, #8
+.endif
+    cmp             x9, #\w
+    blt             .loop_chroma_vpp_w8_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_chroma_vpp_\v\()_\w\()x\h
+    ret
+.endm
+
+.macro FILTER_CHROMA_VPS w, h, v
+    qpel_start_chroma_\v
+    mov             w12, #8192
+    dup             v31.8h, w12
+    lsl             x3, x3, #1
+    sub             x0, x0, x1
+    mov             x5, #\h
+.loop_vps_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_vps_w8_\v\()_\w\()x\h:
+    add             x6, x0, x9
+    qpel_chroma_load_32b \v
+    qpel_filter_chroma_\v\()_32b
+    vps_end
+    add             x9, x9, #8
+.if \w == 2
+    str             s17, x7, #4
+.elseif \w == 4
+    str             d17, x7, #8
+.elseif \w == 6
+    str             d17, x7, #8
+    st1             {v17.s}2, x7, #4
+.elseif \w == 12
+    str             q17, x7, #16
+    add             x6, x0, x9
+    qpel_chroma_load_32b \v
+    qpel_filter_chroma_\v\()_32b
+    vps_end
+    str             d17, x7, #8
+    add             x9, x9, #8
+.else
+    str             q17, x7, #16
+.endif
+    cmp             x9, #\w
+    blt             .loop_vps_w8_\v\()_\w\()x\h
+
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_vps_\v\()_\w\()x\h
+    ret
+.endm
+
+.macro FILTER_CHROMA_VSP w, h, v
+    lsl             x1, x1, #1
+    sub             x0, x0, x1
+    mov             x5, #\h
+    mov             w12, #1
+    lsl             w12, w12, #19
+    add             w12, w12, #2048
+    dup             v31.4s, w12
+    mov             x12, #\w
+    lsl             x12, x12, #1
+    qpel_start_chroma_\v\()_1
+.loop_vsp_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_vsp_w8_\v\()_\w\()x\h:
+    add             x6, x0, x9
+    qpel_chroma_load_64b \v
+    qpel_filter_chroma_\v\()_32b_1
+    vsp_end
+    add             x9, x9, #16
+.if \w == 4
+    str             s17, x7, #4
+.elseif \w == 12
+    str             d17, x7, #8
+    add             x6, x0, x9
+    qpel_chroma_load_64b \v
+    qpel_filter_chroma_\v\()_32b_1
+    vsp_end
+    str             s17, x7, #4
+    add             x9, x9, #8
+.else
+    str             d17, x7, #8
+.endif
+    cmp             x9, x12
+    blt             .loop_vsp_w8_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_vsp_\v\()_\w\()x\h
+    ret
+.endm
+
+.macro FILTER_CHROMA_VSS w, h, v
+    lsl             x1, x1, #1
+    sub             x0, x0, x1
+    lsl             x3, x3, #1
+    mov             x5, #\h
+    mov             x12, #\w
+    lsl             x12, x12, #1
+    qpel_start_chroma_\v\()_1
+.loop_vss_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.if \w == 4
+.rept 2
+    add             x6, x0, x9
+    qpel_chroma_load_64b \v
+    qpel_filter_chroma_\v\()_32b_1
+    vss_end
+    str             s17, x7, #4
+    add             x9, x9, #4
+.endr
+.else
+.loop_vss_w8_\v\()_\w\()x\h:
+    add             x6, x0, x9
+    qpel_chroma_load_64b \v
+    qpel_filter_chroma_\v\()_32b_1
+    vss_end
+    str             q17, x7, #16
+    add             x9, x9, #16
+.if \w == 12
+    add             x6, x0, x9
+    qpel_chroma_load_64b \v
+    qpel_filter_chroma_\v\()_32b_1
+    vss_end
+    str             d17, x7, #8
+    add             x9, x9, #8
+.endif
+    cmp             x9, x12
+    blt             .loop_vss_w8_\v\()_\w\()x\h
+.endif
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_vss_\v\()_\w\()x\h
+    ret
+.endm
+
+.macro FILTER_CHROMA_HPP w, h, v
+    qpel_start_chroma_\v
+    mov             w12, #32
+    dup             v31.8h, w12
+    mov             w6, #\h
+    sub             x3, x3, #\w
+.if \w == 2 || \w == 4 || \w == 6 || \w == 12
+.loop4_chroma_hpp_\v\()_\w\()x\h:
+    mov             x11, x0
+    sub             x11, x11, #2
+    vextin8_chroma \v
+    qpel_filter_chroma_\v\()_32b
+    hpp_end
+.if \w == 2
+    fmov            w12, s17
+    strh            w12, x2, #2
+.elseif \w == 4
+    str             s17, x2, #4
+.elseif \w == 6
+    str             s17, x2, #4
+    umov            w12, v17.h2
+    strh            w12, x2, #2
+.elseif \w == 12
+    str             d17, x2, #8
+    sub             x11, x11, #8
+    vextin8_chroma \v
+    qpel_filter_chroma_\v\()_32b
+    hpp_end
+    str             s17, x2, #4
+.endif
+    sub             w6, w6, #1
+    add             x0, x0, x1
+    add             x2, x2, x3
+    cbnz            w6, .loop4_chroma_hpp_\v\()_\w\()x\h
+    ret
+.else
+.loop2_chroma_hpp_\v\()_\w\()x\h:
+    mov             x7, #\w
+    lsr             x7, x7, #3
+    mov             x11, x0
+    sub             x11, x11, #2
+.loop3_chroma_hpp_\v\()_\w\()x\h:
+.if \w == 8 || \w == 24
+    vextin8_chroma \v
+    qpel_filter_chroma_\v\()_32b
+    hpp_end
+    str             d17, x2, #8
+    sub             x7, x7, #1
+    sub             x11, x11, #8
+.elseif \w == 16 || \w == 32 || \w == 48 || \w == 64
+    vextin8_chroma_64 \v
+    qpel_filter_chroma_\v\()_64b
+    hpp_end
+    add             v18.8h, v18.8h, v31.8h
+    sqshrun2        v17.16b, v18.8h, #6
+    str             q17, x2, #16
+    sub             x7, x7, #2
+    sub             x11, x11, #16
+.endif
+    cbnz            x7, .loop3_chroma_hpp_\v\()_\w\()x\h
+    sub             w6, w6, #1
+    add             x0, x0, x1
+    add             x2, x2, x3
+    cbnz            w6, .loop2_chroma_hpp_\v\()_\w\()x\h
+    ret
+.endif
+.endm
+
+.macro CHROMA_HPS_2_4_6_12 w, v
+    mov             x11, x0
+    sub             x11, x11, #2
+    vextin8_chroma \v
+    qpel_filter_chroma_\v\()_32b
+    hps_end
+    sub             x11, x11, #8
+.if \w == 2
+    str             s17, x2, #4
+.elseif \w == 4
+    str             d17, x2, #8
+.elseif \w == 6
+    str             d17, x2, #8
+    st1             {v17.s}2, x2, #4
+.elseif \w == 12
+    str             q17, x2, #16
+    vextin8_chroma \v
+    qpel_filter_chroma_\v\()_32b
+    sub             v17.8h, v17.8h, v31.8h
+    str             d17, x2, #8
+.endif
+    add             x0, x0, x1
+    add             x2, x2, x3
+.endm
+
+.macro FILTER_CHROMA_HPS w, h, v
+    qpel_start_chroma_\v
+    mov             w12, #8192
+    dup             v31.8h, w12
+    sub             x3, x3, #\w
+    lsl             x3, x3, #1
+
+.if \w == 2 || \w == 4 || \w == 6 || \w == 12
+    cmp             x5, #0
+    beq             0f
+    sub             x0, x0, x1
+.rept 3
+    CHROMA_HPS_2_4_6_12 \w, \v
+.endr
+0:
+.rept \h
+    CHROMA_HPS_2_4_6_12 \w, \v
+.endr
+    ret
+.else
+    mov             w10, #\h
+    cmp             x5, #0
+    beq             9f
+    sub             x0, x0, x1
+    add             w10, w10, #3
+9:
+    mov             w6, w10
+.loop1_chroma_hps_\v\()_\w\()x\h\():
+    mov             x7, #\w
+    lsr             x7, x7, #3
+    mov             x11, x0
+    sub             x11, x11, #2
+.loop2_chroma_hps_\v\()_\w\()x\h\():
+.if \w == 8 || \w == 24
+    vextin8_chroma \v
+    qpel_filter_chroma_\v\()_32b
+    hps_end
+    str             q17, x2, #16
+    sub             x7, x7, #1
+    sub             x11, x11, #8
+.elseif \w == 16 || \w == 32 || \w == 48 || \w == 64
+    vextin8_chroma_64 \v
+    qpel_filter_chroma_\v\()_64b
+    hps_end
+    sub             v18.8h, v18.8h, v31.8h
+    stp             q17, q18, x2, #32
+    sub             x7, x7, #2
+    sub             x11, x11, #16
+.endif
+    cbnz            x7, .loop2_chroma_hps_\v\()_\w\()x\h\()
+    sub             w6, w6, #1
+    add             x0, x0, x1
+    add             x2, x2, x3
+    cbnz            w6, .loop1_chroma_hps_\v\()_\w\()x\h\()
+    ret
+.endif
+.endm
+
+const g_lumaFilter, align=8
+.word 0,0,0,0,0,0,64,64,0,0,0,0,0,0,0,0
+.word -1,-1,4,4,-10,-10,58,58,17,17,-5,-5,1,1,0,0
+.word -1,-1,4,4,-11,-11,40,40,40,40,-11,-11,4,4,-1,-1
+.word 0,0,1,1,-5,-5,17,17,58,58,-10,-10,4,4,-1,-1
+endconst
​

x265_3.6.tar.gz/source/common/aarch64/ipfilter-sve2.S Added

@@ -0,0 +1,1282 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+// Functions in this file:
+// ***** luma_vpp *****
+// ***** luma_vps *****
+// ***** luma_vsp *****
+// ***** luma_vss *****
+// ***** luma_hpp *****
+// ***** luma_hps *****
+// ***** chroma_vpp *****
+// ***** chroma_vps *****
+// ***** chroma_vsp *****
+// ***** chroma_vss *****
+// ***** chroma_hpp *****
+// ***** chroma_hps *****
+
+#include "asm-sve.S"
+#include "ipfilter-common.S"
+
+.arch armv8-a+sve2
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+.macro qpel_load_32b_sve2 v
+.if \v == 0
+    add             x6, x6, x11       // do not load 3 values that are not used in qpel_filter_0
+    ld1b            {z3.h}, p0/z, x6
+    add             x6, x6, x1
+.elseif \v == 1 || \v == 2 || \v == 3
+.if \v != 3                           // not used in qpel_filter_3
+    ld1b            {z0.h}, p0/z, x6
+    add             x6, x6, x1
+.else
+    add             x6, x6, x1
+.endif
+    ld1b            {z1.h}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z2.h}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z3.h}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z4.h}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z5.h}, p0/z, x6
+    add             x6, x6, x1
+.if \v != 1                           // not used in qpel_filter_1
+    ld1b            {z6.h}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z7.h}, p0/z, x6
+.else
+    ld1b            {z6.h}, p0/z, x6
+.endif
+.endif
+.endm
+
+.macro qpel_load_64b_sve2_gt_16 v
+.if \v == 0
+    add             x6, x6, x11       // do not load 3 values that are not used in qpel_filter_0
+    ld1b            {z3.h}, p2/z, x6
+    add             x6, x6, x1
+.elseif \v == 1 || \v == 2 || \v == 3
+.if \v != 3                           // not used in qpel_filter_3
+    ld1b            {z0.h}, p2/z, x6
+    add             x6, x6, x1
+.else
+    add             x6, x6, x1
+.endif
+    ld1b            {z1.h}, p2/z, x6
+    add             x6, x6, x1
+    ld1b            {z2.h}, p2/z, x6
+    add             x6, x6, x1
+    ld1b            {z3.h}, p2/z, x6
+    add             x6, x6, x1
+    ld1b            {z4.h}, p2/z, x6
+    add             x6, x6, x1
+    ld1b            {z5.h}, p2/z, x6
+    add             x6, x6, x1
+.if \v != 1                           // not used in qpel_filter_1
+    ld1b            {z6.h}, p2/z, x6
+    add             x6, x6, x1
+    ld1b            {z7.h}, p2/z, x6
+.else
+    ld1b            {z6.h}, p2/z, x6
+.endif
+.endif
+.endm
+
+.macro qpel_chroma_load_32b_sve2 v
+.if \v == 0
+    // qpel_filter_chroma_0 only uses values in v1
+    add             x6, x6, x1
+    ld1b            {z1.h}, p0/z, x6
+.else
+    ld1b            {z0.h}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z1.h}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z2.h}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z3.h}, p0/z, x6
+.endif
+.endm
+
+.macro qpel_start_sve2_0
+    mov             z24.h, #64
+.endm
+
+.macro qpel_filter_sve2_0_32b
+    mul             z17.h, z3.h, z24.h    // 64*d
+.endm
+
+.macro qpel_filter_sve2_0_64b
+    qpel_filter_sve2_0_32b
+    mul             z18.h, z11.h, z24.h
+.endm
+
+.macro qpel_start_sve2_1
+    mov             z24.h, #58
+    mov             z25.h, #10
+    mov             z26.h, #17
+    mov             z27.h, #5
+.endm
+
+.macro qpel_filter_sve2_1_32b
+    mul             z19.h, z2.h, z25.h  // c*10
+    mul             z17.h, z3.h, z24.h  // d*58
+    mul             z21.h, z4.h, z26.h  // e*17
+    mul             z23.h, z5.h, z27.h  // f*5
+    sub             z17.h, z17.h, z19.h // d*58 - c*10
+    lsl             z18.h, z1.h, #2      // b*4
+    add             z17.h, z17.h, z21.h // d*58 - c*10 + e*17
+    sub             z21.h, z6.h, z0.h   // g - a
+    add             z17.h, z17.h, z18.h // d*58 - c*10 + e*17 + b*4
+    sub             z21.h, z21.h, z23.h // g - a - f*5
+    add             z17.h, z17.h, z21.h // d*58 - c*10 + e*17 + b*4 + g - a - f*5
+.endm
+
+.macro qpel_filter_sve2_1_64b
+    qpel_filter_sve2_1_32b
+    mul             z20.h, z10.h, z25.h  // c*10
+    mul             z18.h, z11.h, z24.h  // d*58
+    mul             z21.h, z12.h, z26.h  // e*17
+    mul             z23.h, z13.h, z27.h  // f*5
+    sub             z18.h, z18.h, z20.h   // d*58 - c*10
+    lsl             z28.h, z30.h, #2       // b*4
+    add             z18.h, z18.h, z21.h   // d*58 - c*10 + e*17
+    sub             z21.h, z14.h, z29.h   // g - a
+    add             z18.h, z18.h, z28.h   // d*58 - c*10 + e*17 + b*4
+    sub             z21.h, z21.h, z23.h   // g - a - f*5
+    add             z18.h, z18.h, z21.h   // d*58 - c*10 + e*17 + b*4 + g - a - f*5
+.endm
+
+.macro qpel_start_sve2_2
+    mov             z24.h, #11
+    mov             z25.h, #40
+.endm
+
+.macro qpel_filter_sve2_2_32b
+    add             z17.h, z3.h, z4.h     // d + e
+    add             z19.h, z2.h, z5.h     // c + f
+    add             z23.h, z1.h, z6.h     // b + g
+    add             z21.h, z0.h, z7.h     // a + h
+    mul             z17.h, z17.h, z25.h   // 40 * (d + e)
+    mul             z19.h, z19.h, z24.h   // 11 * (c + f)
+    lsl             z23.h, z23.h, #2       // (b + g) * 4
+    add             z19.h, z19.h, z21.h   // 11 * (c + f) + a + h
+    add             z17.h, z17.h, z23.h   // 40 * (d + e) + (b + g) * 4
+    sub             z17.h, z17.h, z19.h   // 40 * (d + e) + (b + g) * 4 - 11 * (c + f) - a - h
+.endm
+
+.macro qpel_filter_sve2_2_64b
+    qpel_filter_sve2_2_32b
+    add             z27.h, z11.h, z12.h   // d + e
+    add             z16.h, z10.h, z13.h   // c + f
+    add             z23.h, z30.h, z14.h   // b + g
+    add             z21.h, z29.h, z15.h   // a + h
+    mul             z27.h, z27.h, z25.h   // 40 * (d + e)
+    mul             z16.h, z16.h, z24.h   // 11 * (c + f)
+    lsl             z23.h, z23.h, #2       // (b + g) * 4
+    add             z16.h, z16.h, z21.h   // 11 * (c + f) + a + h
+    add             z27.h, z27.h, z23.h   // 40 * (d + e) + (b + g) * 4
+    sub             z18.h, z27.h, z16.h   // 40 * (d + e) + (b + g) * 4 - 11 * (c + f) - a - h
+.endm
+
+.macro qpel_start_sve2_3
+    mov             z24.h, #17
+    mov             z25.h, #5
+    mov             z26.h, #58
+    mov             z27.h, #10
+.endm
+
+.macro qpel_filter_sve2_3_32b
+    mul             z19.h, z2.h, z25.h    // c * 5
+    mul             z17.h, z3.h, z24.h    // d * 17
+    mul             z21.h, z4.h, z26.h    // e * 58
+    mul             z23.h, z5.h, z27.h    // f * 10
+    sub             z17.h, z17.h, z19.h   // d * 17 - c * 5
+    lsl             z19.h, z6.h, #2        // g * 4
+    add             z17.h, z17.h, z21.h   // d * 17 - c * 5 + e * 58
+    sub             z21.h, z1.h, z7.h     // b - h
+    add             z17.h, z17.h, z19.h   // d * 17 - c * 5 + e * 58 + g * 4
+    sub             z21.h, z21.h, z23.h   // b - h - f * 10
+    add             z17.h, z17.h, z21.h   // d * 17 - c * 5 + e * 58 + g * 4 + b - h - f * 10
+.endm
+
+.macro qpel_filter_sve2_3_64b
+    qpel_filter_sve2_3_32b
+    mul             z16.h, z10.h, z25.h  // c * 5
+    mul             z18.h, z11.h, z24.h  // d * 17
+    mul             z21.h, z12.h, z26.h  // e * 58
+    mul             z23.h, z13.h, z27.h  // f * 10
+    sub             z18.h, z18.h, z16.h   // d * 17 - c * 5
+    lsl             z16.h, z14.h, #2       // g * 4
+    add             z18.h, z18.h, z21.h   // d * 17 - c * 5 + e * 58
+    sub             z21.h, z30.h, z15.h   // b - h
+    add             z18.h, z18.h, z16.h   // d * 17 - c * 5 + e * 58 + g * 4
+    sub             z21.h, z21.h, z23.h   // b - h - f * 10
+    add             z18.h, z18.h, z21.h   // d * 17 - c * 5 + e * 58 + g * 4 + b - h - f * 10
+.endm
+
+.macro qpel_start_chroma_sve2_0
+    mov             z29.h, #64
+.endm
+
+.macro qpel_filter_chroma_sve2_0_32b
+    mul             z17.h, z1.h, z29.h    // 64*b
+.endm
+
+.macro qpel_start_chroma_sve2_1
+    mov             z29.h, #58
+    mov             z30.h, #10
+.endm
+
+.macro qpel_filter_chroma_sve2_1_32b
+    mul             z17.h, z1.h, z29.h    // 58 * b
+    mul             z19.h, z2.h, z30.h    // 10 * c
+    add             z22.h, z0.h, z3.h     // a + d
+    lsl             z22.h, z22.h, #1       // 2 * (a+d)
+    sub             z17.h, z17.h, z22.h   // 58*b - 2*(a+d)
+    add             z17.h, z17.h, z19.h   // 58*b-2*(a+d) + 10*c
+.endm
+
+.macro qpel_start_chroma_sve2_2
+    mov             z30.h, #54
+.endm
+
+.macro qpel_filter_chroma_sve2_2_32b
+    mul             z17.h, z1.h, z30.h    // 54 * b
+    lsl             z19.h, z0.h, #2        // 4 * a
+    lsl             z21.h, z2.h, #4        // 16 * c
+    lsl             z23.h, z3.h, #1        // 2 * d
+    add             z17.h, z17.h, z21.h   // 54*b + 16*c
+    add             z19.h, z19.h, z23.h   // 4*a + 2*d
+    sub             z17.h, z17.h, z19.h   // 54*b+16*c - (4*a+2*d)
+.endm
+
+.macro qpel_start_chroma_sve2_3
+    mov             z28.h, #46
+    mov             z29.h, #28
+    mov             z30.h, #6
+.endm
+
+.macro qpel_filter_chroma_sve2_3_32b
+    mul             z17.h, z1.h, z28.h    // 46 * b
+    mul             z19.h, z2.h, z29.h    // 28 * c
+    lsl             z21.h, z3.h, #2        // 4 * d
+    mul             z23.h, z0.h, z30.h    // 6 * a
+    add             z17.h, z17.h, z19.h   // 46*b + 28*c
+    add             z21.h, z21.h, z23.h   // 4*d + 6*a
+    sub             z17.h, z17.h, z21.h   // 46*b+28*c - (4*d+6*a)
+.endm
+
+.macro qpel_start_chroma_sve2_4
+    mov             z29.h, #36
+.endm
+
+.macro qpel_filter_chroma_sve2_4_32b
+    add             z20.h, z0.h, z3.h     // a + d
+    add             z17.h, z1.h, z2.h     // b + c
+    lsl             z20.h, z20.h, #2       // 4 * (a+d)
+    mul             z17.h, z17.h, z29.h   // 36 * (b+c)
+    sub             z17.h, z17.h, z20.h   // 36*(b+c) - 4*(a+d)
+.endm
+
+.macro qpel_start_chroma_sve2_5
+    mov             z28.h, #28
+    mov             z29.h, #46
+    mov             z30.h, #6
+.endm
+
+.macro qpel_filter_chroma_sve2_5_32b
+    mul             z17.h, z1.h, z28.h    // 28 * b
+    mul             z19.h, z2.h, z29.h    // 46 * c
+    lsl             z21.h, z0.h, #2        // 4 * a
+    mul             z23.h, z3.h, z30.h    // 6 * d
+    add             z17.h, z17.h, z19.h   // 28*b + 46*c
+    add             z21.h, z21.h, z23.h   // 4*a + 6*d
+    sub             z17.h, z17.h, z21.h   // 28*b+46*c - (4*a+6*d)
+.endm
+
+.macro qpel_start_chroma_sve2_6
+    mov             z30.h, #54
+.endm
+
+.macro qpel_filter_chroma_sve2_6_32b
+    mul             z17.h, z2.h, z30.h    // 54 * c
+    lsl             z19.h, z0.h, #1        // 2 * a
+    lsl             z21.h, z1.h, #4        // 16 * b
+    lsl             z23.h, z3.h, #2        // 4 * d
+    add             z17.h, z17.h, z21.h   // 54*c + 16*b
+    add             z19.h, z19.h, z23.h   // 2*a + 4*d
+    sub             z17.h, z17.h, z19.h   // 54*c+16*b - (2*a+4*d)
+.endm
+
+.macro qpel_start_chroma_sve2_7
+    mov             z29.h, #58
+    mov             z30.h, #10
+.endm
+
+.macro qpel_filter_chroma_sve2_7_32b
+    add             z20.h, z0.h, z3.h     // a + d
+    mul             z17.h, z2.h, z29.h    // 58 * c
+    lsl             z20.h, z20.h, #1       // 2 * (a+d)
+    mul             z19.h, z1.h, z30.h    // 10 * b
+    sub             z17.h, z17.h, z20.h   // 58*c - 2*(a+d)
+    add             z17.h, z17.h, z19.h   // 58*c-2*(a+d) + 10*b
+.endm
+
+.macro vpp_end_sve2
+    add             z17.h, z17.h, z31.h
+    sqshrun         v17.8b, v17.8h, #6
+.endm
+
+.macro FILTER_LUMA_VPP_SVE2 w, h, v
+    lsl             x10, x1, #2      // x10 = 4 * x1
+    sub             x11, x10, x1     // x11 = 3 * x1
+    sub             x0, x0, x11      // src -= (8 / 2 - 1) * srcStride
+    mov             x5, #\h
+    mov             z31.h, #32
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_FILTER_LUMA_VPP_\v\()_\w\()x\h
+    qpel_start_\v
+.loop_luma_vpp_sve2_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_luma_vpp_w8_sve2_\v\()_\w\()x\h:
+    add             x6, x0, x9
+.if \w == 8 || \w == 24
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vpp_end
+    str             d17, x7, #8
+    add             x9, x9, #8
+.elseif \w == 12
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vpp_end
+    str             d17, x7, #8
+    add             x6, x0, #8
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vpp_end
+    fmov            w6, s17
+    str             w6, x7, #4
+    add             x9, x9, #12
+.else
+    qpel_load_64b \v
+    qpel_filter_\v\()_64b
+    vpp_end
+    add             v18.8h, v18.8h, v31.8h
+    sqshrun2        v17.16b, v18.8h, #6
+    str             q17, x7, #16
+    add             x9, x9, #16
+.endif
+    cmp             x9, #\w
+    blt             .loop_luma_vpp_w8_sve2_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_luma_vpp_sve2_\v\()_\w\()x\h
+    ret
+.vl_gt_16_FILTER_LUMA_VPP_\v\()_\w\()x\h:
+    ptrue           p0.h, vl8
+    ptrue           p2.h, vl16
+    qpel_start_sve2_\v
+.gt_16_loop_luma_vpp_sve2_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.gt_16_loop_luma_vpp_w8_sve2_\v\()_\w\()x\h:
+    add             x6, x0, x9
+.if \w == 8 || \w == 24
+    qpel_load_32b_sve2 \v
+    qpel_filter_sve2_\v\()_32b
+    vpp_end_sve2
+    str             d17, x7, #8
+    add             x9, x9, #8
+.elseif \w == 12
+    qpel_load_32b_sve2 \v
+    qpel_filter_sve2_\v\()_32b
+    vpp_end_sve2
+    str             d17, x7, #8
+    add             x6, x0, #8
+    qpel_load_32b_sve2 \v
+    qpel_filter_sve2_\v\()_32b
+    vpp_end_sve2
+    fmov            w6, s17
+    str             w6, x7, #4
+    add             x9, x9, #12
+.else
+    qpel_load_64b_sve2_gt_16 \v
+    qpel_filter_sve2_\v\()_32b
+    vpp_end_sve2
+    add             z18.h, z18.h, z31.h
+    sqshrun2        v17.16b, v18.8h, #6
+    str             q17, x7, #16
+    add             x9, x9, #16
+.endif
+    cmp             x9, #\w
+    blt             .gt_16_loop_luma_vpp_w8_sve2_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .gt_16_loop_luma_vpp_sve2_\v\()_\w\()x\h
+    ret
+.endm
+
+// void interp_vert_pp_c(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VPP_SVE2 w, h
+function x265_interp_8tap_vert_pp_\w\()x\h\()_sve2
+    cmp             x4, #0
+    b.eq            0f
+    cmp             x4, #1
+    b.eq            1f
+    cmp             x4, #2
+    b.eq            2f
+    cmp             x4, #3
+    b.eq            3f
+0:
+    FILTER_LUMA_VPP_SVE2 \w, \h, 0
+1:
+    FILTER_LUMA_VPP_SVE2 \w, \h, 1
+2:
+    FILTER_LUMA_VPP_SVE2 \w, \h, 2
+3:
+    FILTER_LUMA_VPP_SVE2 \w, \h, 3
+endfunc
+.endm
+
+LUMA_VPP_SVE2 8, 4
+LUMA_VPP_SVE2 8, 8
+LUMA_VPP_SVE2 8, 16
+LUMA_VPP_SVE2 8, 32
+LUMA_VPP_SVE2 12, 16
+LUMA_VPP_SVE2 16, 4
+LUMA_VPP_SVE2 16, 8
+LUMA_VPP_SVE2 16, 16
+LUMA_VPP_SVE2 16, 32
+LUMA_VPP_SVE2 16, 64
+LUMA_VPP_SVE2 16, 12
+LUMA_VPP_SVE2 24, 32
+LUMA_VPP_SVE2 32, 8
+LUMA_VPP_SVE2 32, 16
+LUMA_VPP_SVE2 32, 32
+LUMA_VPP_SVE2 32, 64
+LUMA_VPP_SVE2 32, 24
+LUMA_VPP_SVE2 48, 64
+LUMA_VPP_SVE2 64, 16
+LUMA_VPP_SVE2 64, 32
+LUMA_VPP_SVE2 64, 64
+LUMA_VPP_SVE2 64, 48
+
+// void interp_vert_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VPS_4xN_SVE2 h
+function x265_interp_8tap_vert_ps_4x\h\()_sve2
+    lsl             x3, x3, #1
+    lsl             x5, x4, #6
+    lsl             x4, x1, #2
+    sub             x4, x4, x1
+    sub             x0, x0, x4
+
+    mov             z28.s, #8192
+    mov             x4, #\h
+    movrel          x12, g_lumaFilter
+    add             x12, x12, x5
+    ptrue           p0.s, vl4
+    ld1rd           {z16.d}, p0/z, x12
+    ld1rd           {z17.d}, p0/z, x12, #8
+    ld1rd           {z18.d}, p0/z, x12, #16
+    ld1rd           {z19.d}, p0/z, x12, #24
+    ld1rd           {z20.d}, p0/z, x12, #32
+    ld1rd           {z21.d}, p0/z, x12, #40
+    ld1rd           {z22.d}, p0/z, x12, #48
+    ld1rd           {z23.d}, p0/z, x12, #56
+
+.loop_vps_sve2_4x\h:
+    mov             x6, x0
+
+    ld1b            {z0.s}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z1.s}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z2.s}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z3.s}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z4.s}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z5.s}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z6.s}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z7.s}, p0/z, x6
+    add             x6, x6, x1
+
+    mul             z0.s, z0.s, z16.s
+    mla             z0.s, p0/m, z1.s, z17.s
+    mla             z0.s, p0/m, z2.s, z18.s
+    mla             z0.s, p0/m, z3.s, z19.s
+    mla             z0.s, p0/m, z4.s, z20.s
+    mla             z0.s, p0/m, z5.s, z21.s
+    mla             z0.s, p0/m, z6.s, z22.s
+    mla             z0.s, p0/m, z7.s, z23.s
+
+    sub             z0.s, z0.s, z28.s
+    sqxtn           v0.4h, v0.4s
+    st1             {v0.8b}, x2, x3
+
+    add             x0, x0, x1
+    sub             x4, x4, #1
+    cbnz            x4, .loop_vps_sve2_4x\h
+    ret
+endfunc
+.endm
+
+LUMA_VPS_4xN_SVE2 4
+LUMA_VPS_4xN_SVE2 8
+LUMA_VPS_4xN_SVE2 16
+
+// void interp_vert_sp_c(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VSP_4xN_SVE2 h
+function x265_interp_8tap_vert_sp_4x\h\()_sve2
+    lsl             x5, x4, #6
+    lsl             x1, x1, #1
+    lsl             x4, x1, #2
+    sub             x4, x4, x1
+    sub             x0, x0, x4
+
+    mov             w12, #1
+    lsl             w12, w12, #19
+    add             w12, w12, #2048
+    dup             v24.4s, w12
+    mov             x4, #\h
+    movrel          x12, g_lumaFilter
+    add             x12, x12, x5
+
+    ptrue           p0.s, vl4
+    ld1rd           {z16.d}, p0/z, x12
+    ld1rd           {z17.d}, p0/z, x12, #8
+    ld1rd           {z18.d}, p0/z, x12, #16
+    ld1rd           {z19.d}, p0/z, x12, #24
+    ld1rd           {z20.d}, p0/z, x12, #32
+    ld1rd           {z21.d}, p0/z, x12, #40
+    ld1rd           {z22.d}, p0/z, x12, #48
+    ld1rd           {z23.d}, p0/z, x12, #56
+
+.loop_vsp_sve2_4x\h:
+    mov             x6, x0
+
+    ld1             {v0.8b}, x6, x1
+    ld1             {v1.8b}, x6, x1
+    ld1             {v2.8b}, x6, x1
+    ld1             {v3.8b}, x6, x1
+    ld1             {v4.8b}, x6, x1
+    ld1             {v5.8b}, x6, x1
+    ld1             {v6.8b}, x6, x1
+    ld1             {v7.8b}, x6, x1
+
+    sunpklo         z0.s, z0.h
+    sunpklo         z1.s, z1.h
+    mul             z0.s, z0.s, z16.s
+    sunpklo         z2.s, z2.h
+    mla             z0.s, p0/m, z1.s, z17.s
+    sunpklo         z3.s, z3.h
+    mla             z0.s, p0/m, z2.s, z18.s
+    sunpklo         z4.s, z4.h
+    mla             z0.s, p0/m, z3.s, z19.s
+    sunpklo         z5.s, z5.h
+    mla             z0.s, p0/m, z4.s, z20.s
+    sunpklo         z6.s, z6.h
+    mla             z0.s, p0/m, z5.s, z21.s
+    sunpklo         z7.s, z7.h
+    mla             z0.s, p0/m, z6.s, z22.s
+
+    mla             z0.s, p0/m, z7.s, z23.s
+
+    add             z0.s, z0.s, z24.s
+    sqshrun         v0.4h, v0.4s, #12
+    sqxtun          v0.8b, v0.8h
+    st1             {v0.s}0, x2, x3
+
+    add             x0, x0, x1
+    sub             x4, x4, #1
+    cbnz            x4, .loop_vsp_sve2_4x\h
+    ret
+endfunc
+.endm
+
+LUMA_VSP_4xN_SVE2 4
+LUMA_VSP_4xN_SVE2 8
+LUMA_VSP_4xN_SVE2 16
+
+.macro vps_end_sve2
+    sub             z17.h, z17.h, z31.h
+.endm
+
+.macro FILTER_VPS_SVE2 w, h, v
+    lsl             x3, x3, #1
+    lsl             x10, x1, #2      // x10 = 4 * x1
+    sub             x11, x10, x1     // x11 = 3 * x1
+    sub             x0, x0, x11      // src -= (8 / 2 - 1) * srcStride
+    mov             x5, #\h
+    mov             z31.h, #8192
+    rdvl            x14, #1
+    cmp             x14, #16
+    bgt             .vl_gt_16_FILTER_VPS_\v\()_\w\()x\h
+    qpel_start_\v
+.loop_ps_sve2_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_ps_w8_sve2_\v\()_\w\()x\h:
+    add             x6, x0, x9
+.if \w == 8 || \w == 24
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vps_end
+    str             q17, x7, #16
+    add             x9, x9, #8
+.elseif \w == 12
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vps_end
+    str             q17, x7, #16
+    add             x6, x0, #8
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vps_end
+    str             d17, x7, #8
+    add             x9, x9, #12
+.else
+    qpel_load_64b \v
+    qpel_filter_\v\()_64b
+    vps_end
+    sub             v18.8h, v18.8h, v31.8h
+    stp             q17, q18, x7, #32
+    add             x9, x9, #16
+.endif
+    cmp             x9, #\w
+    blt             .loop_ps_w8_sve2_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_ps_sve2_\v\()_\w\()x\h
+    ret
+.vl_gt_16_FILTER_VPS_\v\()_\w\()x\h:
+    ptrue           p0.h, vl8
+    ptrue           p2.h, vl16
+    qpel_start_sve2_\v
+.gt_16_loop_ps_sve2_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.gt_16_loop_ps_w8_sve2_\v\()_\w\()x\h:
+    add             x6, x0, x9
+.if \w == 8 || \w == 24
+    qpel_load_32b_sve2 \v
+    qpel_filter_sve2_\v\()_32b
+    vps_end_sve2
+    str             q17, x7, #16
+    add             x9, x9, #8
+.elseif \w == 12
+    qpel_load_32b_sve2 \v
+    qpel_filter_sve2_\v\()_32b
+    vps_end_sve2
+    str             q17, x7, #16
+    add             x6, x0, #8
+    qpel_load_32b_sve2 \v
+    qpel_filter_sve2_\v\()_32b
+    vps_end_sve2
+    str             d17, x7, #8
+    add             x9, x9, #12
+.else
+    qpel_load_64b_sve2_gt_16 \v
+    qpel_filter_sve2_\v\()_32b
+    vps_end_sve2
+    sub             z18.h, z18.h, z31.h
+    stp             q17, q18, x7, #32
+    add             x9, x9, #16
+.endif
+    cmp             x9, #\w
+    blt             .gt_16_loop_ps_w8_sve2_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .gt_16_loop_ps_sve2_\v\()_\w\()x\h
+    ret
+.endm
+
+// void interp_vert_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VPS_SVE2 w, h
+function x265_interp_8tap_vert_ps_\w\()x\h\()_sve2
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+0:
+    FILTER_VPS_SVE2 \w, \h, 0
+1:
+    FILTER_VPS_SVE2 \w, \h, 1
+2:
+    FILTER_VPS_SVE2 \w, \h, 2
+3:
+    FILTER_VPS_SVE2 \w, \h, 3
+endfunc
+.endm
+
+LUMA_VPS_SVE2 8, 4
+LUMA_VPS_SVE2 8, 8
+LUMA_VPS_SVE2 8, 16
+LUMA_VPS_SVE2 8, 32
+LUMA_VPS_SVE2 12, 16
+LUMA_VPS_SVE2 16, 4
+LUMA_VPS_SVE2 16, 8
+LUMA_VPS_SVE2 16, 16
+LUMA_VPS_SVE2 16, 32
+LUMA_VPS_SVE2 16, 64
+LUMA_VPS_SVE2 16, 12
+LUMA_VPS_SVE2 24, 32
+LUMA_VPS_SVE2 32, 8
+LUMA_VPS_SVE2 32, 16
+LUMA_VPS_SVE2 32, 32
+LUMA_VPS_SVE2 32, 64
+LUMA_VPS_SVE2 32, 24
+LUMA_VPS_SVE2 48, 64
+LUMA_VPS_SVE2 64, 16
+LUMA_VPS_SVE2 64, 32
+LUMA_VPS_SVE2 64, 64
+LUMA_VPS_SVE2 64, 48
+
+// ***** luma_vss *****
+.macro vss_end_sve2
+    asr             z17.s, z17.s, #6
+    asr             z18.s, z18.s, #6
+    uzp1            v17.8h, v17.8h, v18.8h
+.endm
+
+.macro FILTER_VSS_SVE2 w, h, v
+    lsl             x1, x1, #1
+    lsl             x10, x1, #2      // x10 = 4 * x1
+    sub             x11, x10, x1     // x11 = 3 * x1
+    sub             x0, x0, x11
+    lsl             x3, x3, #1
+    mov             x5, #\h
+    mov             x12, #\w
+    lsl             x12, x12, #1
+    qpel_start_\v\()_1
+.loop_luma_vss_sve2_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_luma_vss_w8_sve2_\v\()_\w\()x\h:
+    add             x6, x0, x9
+    qpel_load_64b \v
+    qpel_filter_\v\()_32b_1
+    vss_end_sve2
+.if \w == 4
+    str             s17, x7, #4
+    add             x9, x9, #4
+.else
+    str             q17, x7, #16
+    add             x9, x9, #16
+.if \w == 12
+    add             x6, x0, x9
+    qpel_load_64b \v
+    qpel_filter_\v\()_32b_1
+    vss_end_sve2
+    str             d17, x7, #8
+    add             x9, x9, #8
+.endif
+.endif
+    cmp             x9, x12
+    blt             .loop_luma_vss_w8_sve2_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_luma_vss_sve2_\v\()_\w\()x\h
+    ret
+.endm
+
+// void interp_vert_ss_c(const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VSS_SVE2 w, h
+function x265_interp_8tap_vert_ss_\w\()x\h\()_sve2
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+0:
+    FILTER_VSS_SVE2 \w, \h, 0
+1:
+    FILTER_VSS_SVE2 \w, \h, 1
+2:
+    FILTER_VSS_SVE2 \w, \h, 2
+3:
+    FILTER_VSS_SVE2 \w, \h, 3
+endfunc
+.endm
+
+LUMA_VSS_SVE2 4, 4
+LUMA_VSS_SVE2 4, 8
+LUMA_VSS_SVE2 4, 16
+LUMA_VSS_SVE2 8, 4
+LUMA_VSS_SVE2 8, 8
+LUMA_VSS_SVE2 8, 16
+LUMA_VSS_SVE2 8, 32
+LUMA_VSS_SVE2 12, 16
+LUMA_VSS_SVE2 16, 4
+LUMA_VSS_SVE2 16, 8
+LUMA_VSS_SVE2 16, 16
+LUMA_VSS_SVE2 16, 32
+LUMA_VSS_SVE2 16, 64
+LUMA_VSS_SVE2 16, 12
+LUMA_VSS_SVE2 32, 8
+LUMA_VSS_SVE2 32, 16
+LUMA_VSS_SVE2 32, 32
+LUMA_VSS_SVE2 32, 64
+LUMA_VSS_SVE2 32, 24
+LUMA_VSS_SVE2 64, 16
+LUMA_VSS_SVE2 64, 32
+LUMA_VSS_SVE2 64, 64
+LUMA_VSS_SVE2 64, 48
+LUMA_VSS_SVE2 24, 32
+LUMA_VSS_SVE2 48, 64
+
+// ***** luma_hps *****
+
+.macro FILTER_CHROMA_VPP_SVE2 w, h, v
+    ptrue           p0.h, vl8
+    qpel_start_chroma_sve2_\v
+    mov             z31.h, #32
+    sub             x0, x0, x1
+    mov             x5, #\h
+.loop_chroma_vpp_sve2_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_chroma_vpp_w8_sve2_\v\()_\w\()x\h:
+    add             x6, x0, x9
+    qpel_chroma_load_32b_sve2 \v
+    qpel_filter_chroma_sve2_\v\()_32b
+    vpp_end_sve2
+    add             x9, x9, #8
+.if \w == 2
+    fmov            w12, s17
+    strh            w12, x7, #2
+.elseif \w == 4
+    str             s17, x7, #4
+.elseif \w == 6
+    str             s17, x7, #4
+    umov            w12, v17.h2
+    strh            w12, x7, #2
+.elseif \w == 12
+    str             d17, x7, #8
+    add             x6, x0, x9
+    qpel_chroma_load_32b_sve2 \v
+    qpel_filter_chroma_sve2_\v\()_32b
+    vpp_end_sve2
+    str             s17, x7, #4
+    add             x9, x9, #8
+.else
+    str             d17, x7, #8
+.endif
+    cmp             x9, #\w
+    blt             .loop_chroma_vpp_w8_sve2_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_chroma_vpp_sve2_\v\()_\w\()x\h
+    ret
+.endm
+
+// void interp_vert_pp_c(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+.macro CHROMA_VPP_SVE2 w, h
+function x265_interp_4tap_vert_pp_\w\()x\h\()_sve2
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+    cmp             x4, #4
+    beq             4f
+    cmp             x4, #5
+    beq             5f
+    cmp             x4, #6
+    beq             6f
+    cmp             x4, #7
+    beq             7f
+0:
+    FILTER_CHROMA_VPP_SVE2  \w, \h, 0
+1:
+    FILTER_CHROMA_VPP_SVE2  \w, \h, 1
+2:
+    FILTER_CHROMA_VPP_SVE2  \w, \h, 2
+3:
+    FILTER_CHROMA_VPP_SVE2  \w, \h, 3
+4:
+    FILTER_CHROMA_VPP_SVE2  \w, \h, 4
+5:
+    FILTER_CHROMA_VPP_SVE2  \w, \h, 5
+6:
+    FILTER_CHROMA_VPP_SVE2  \w, \h, 6
+7:
+    FILTER_CHROMA_VPP_SVE2  \w, \h, 7
+endfunc
+.endm
+
+CHROMA_VPP_SVE2 2, 4
+CHROMA_VPP_SVE2 2, 8
+CHROMA_VPP_SVE2 2, 16
+CHROMA_VPP_SVE2 4, 2
+CHROMA_VPP_SVE2 4, 4
+CHROMA_VPP_SVE2 4, 8
+CHROMA_VPP_SVE2 4, 16
+CHROMA_VPP_SVE2 4, 32
+CHROMA_VPP_SVE2 6, 8
+CHROMA_VPP_SVE2 6, 16
+CHROMA_VPP_SVE2 8, 2
+CHROMA_VPP_SVE2 8, 4
+CHROMA_VPP_SVE2 8, 6
+CHROMA_VPP_SVE2 8, 8
+CHROMA_VPP_SVE2 8, 16
+CHROMA_VPP_SVE2 8, 32
+CHROMA_VPP_SVE2 8, 12
+CHROMA_VPP_SVE2 8, 64
+CHROMA_VPP_SVE2 12, 16
+CHROMA_VPP_SVE2 12, 32
+CHROMA_VPP_SVE2 16, 4
+CHROMA_VPP_SVE2 16, 8
+CHROMA_VPP_SVE2 16, 12
+CHROMA_VPP_SVE2 16, 16
+CHROMA_VPP_SVE2 16, 32
+CHROMA_VPP_SVE2 16, 64
+CHROMA_VPP_SVE2 16, 24
+CHROMA_VPP_SVE2 32, 8
+CHROMA_VPP_SVE2 32, 16
+CHROMA_VPP_SVE2 32, 24
+CHROMA_VPP_SVE2 32, 32
+CHROMA_VPP_SVE2 32, 64
+CHROMA_VPP_SVE2 32, 48
+CHROMA_VPP_SVE2 24, 32
+CHROMA_VPP_SVE2 24, 64
+CHROMA_VPP_SVE2 64, 16
+CHROMA_VPP_SVE2 64, 32
+CHROMA_VPP_SVE2 64, 48
+CHROMA_VPP_SVE2 64, 64
+CHROMA_VPP_SVE2 48, 64
+
+.macro FILTER_CHROMA_VPS_SVE2 w, h, v
+    ptrue           p0.h, vl8
+    qpel_start_chroma_sve2_\v
+    mov             z31.h, #8192
+    lsl             x3, x3, #1
+    sub             x0, x0, x1
+    mov             x5, #\h
+.loop_vps_sve2_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_vps_w8_sve2_\v\()_\w\()x\h:
+    add             x6, x0, x9
+    qpel_chroma_load_32b_sve2 \v
+    qpel_filter_chroma_sve2_\v\()_32b
+    vps_end_sve2
+    add             x9, x9, #8
+.if \w == 2
+    str             s17, x7, #4
+.elseif \w == 4
+    str             d17, x7, #8
+.elseif \w == 6
+    str             d17, x7, #8
+    st1             {v17.s}2, x7, #4
+.elseif \w == 12
+    str             q17, x7, #16
+    add             x6, x0, x9
+    qpel_chroma_load_32b_sve2 \v
+    qpel_filter_chroma_sve2_\v\()_32b
+    vps_end_sve2
+    str             d17, x7, #8
+    add             x9, x9, #8
+.else
+    str             q17, x7, #16
+.endif
+    cmp             x9, #\w
+    blt             .loop_vps_w8_sve2_\v\()_\w\()x\h
+
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_vps_sve2_\v\()_\w\()x\h
+    ret
+.endm
+
+// void interp_vert_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx)
+.macro CHROMA_VPS_SVE2 w, h
+function x265_interp_4tap_vert_ps_\w\()x\h\()_sve2
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+    cmp             x4, #4
+    beq             4f
+    cmp             x4, #5
+    beq             5f
+    cmp             x4, #6
+    beq             6f
+    cmp             x4, #7
+    beq             7f
+0:
+    FILTER_CHROMA_VPS_SVE2  \w, \h, 0
+1:
+    FILTER_CHROMA_VPS_SVE2  \w, \h, 1
+2:
+    FILTER_CHROMA_VPS_SVE2  \w, \h, 2
+3:
+    FILTER_CHROMA_VPS_SVE2  \w, \h, 3
+4:
+    FILTER_CHROMA_VPS_SVE2  \w, \h, 4
+5:
+    FILTER_CHROMA_VPS_SVE2  \w, \h, 5
+6:
+    FILTER_CHROMA_VPS_SVE2  \w, \h, 6
+7:
+    FILTER_CHROMA_VPS_SVE2  \w, \h, 7
+endfunc
+.endm
+
+CHROMA_VPS_SVE2 2, 4
+CHROMA_VPS_SVE2 2, 8
+CHROMA_VPS_SVE2 2, 16
+CHROMA_VPS_SVE2 4, 2
+CHROMA_VPS_SVE2 4, 4
+CHROMA_VPS_SVE2 4, 8
+CHROMA_VPS_SVE2 4, 16
+CHROMA_VPS_SVE2 4, 32
+CHROMA_VPS_SVE2 6, 8
+CHROMA_VPS_SVE2 6, 16
+CHROMA_VPS_SVE2 8, 2
+CHROMA_VPS_SVE2 8, 4
+CHROMA_VPS_SVE2 8, 6
+CHROMA_VPS_SVE2 8, 8
+CHROMA_VPS_SVE2 8, 16
+CHROMA_VPS_SVE2 8, 32
+CHROMA_VPS_SVE2 8, 12
+CHROMA_VPS_SVE2 8, 64
+CHROMA_VPS_SVE2 12, 16
+CHROMA_VPS_SVE2 12, 32
+CHROMA_VPS_SVE2 16, 4
+CHROMA_VPS_SVE2 16, 8
+CHROMA_VPS_SVE2 16, 12
+CHROMA_VPS_SVE2 16, 16
+CHROMA_VPS_SVE2 16, 32
+CHROMA_VPS_SVE2 16, 64
+CHROMA_VPS_SVE2 16, 24
+CHROMA_VPS_SVE2 32, 8
+CHROMA_VPS_SVE2 32, 16
+CHROMA_VPS_SVE2 32, 24
+CHROMA_VPS_SVE2 32, 32
+CHROMA_VPS_SVE2 32, 64
+CHROMA_VPS_SVE2 32, 48
+CHROMA_VPS_SVE2 24, 32
+CHROMA_VPS_SVE2 24, 64
+CHROMA_VPS_SVE2 64, 16
+CHROMA_VPS_SVE2 64, 32
+CHROMA_VPS_SVE2 64, 48
+CHROMA_VPS_SVE2 64, 64
+CHROMA_VPS_SVE2 48, 64
+
+.macro qpel_start_chroma_sve2_0_1
+    mov             z24.h, #64
+.endm
+
+.macro qpel_start_chroma_sve2_1_1
+    mov             z24.h, #58
+    mov             z25.h, #10
+.endm
+
+.macro qpel_start_chroma_sve2_2_1
+    mov             z25.h, #54
+.endm
+
+.macro qpel_start_chroma_sve2_3_1
+    mov             z25.h, #46
+    mov             z26.h, #28
+    mov             z27.h, #6
+.endm
+
+.macro qpel_start_chroma_sve2_4_1
+    mov             z24.h, #36
+.endm
+
+.macro qpel_start_chroma_sve2_5_1
+    mov             z25.h, #28
+    mov             z26.h, #46
+    mov             z27.h, #6
+.endm
+
+.macro qpel_start_chroma_sve2_6_1
+    mov             z25.h, #54
+.endm
+
+.macro qpel_start_chroma_sve2_7_1
+    mov             z24.h, #58
+    mov             z25.h, #10
+.endm
+
+.macro FILTER_CHROMA_VSS_SVE2 w, h, v
+    lsl             x1, x1, #1
+    sub             x0, x0, x1
+    lsl             x3, x3, #1
+    mov             x5, #\h
+    mov             x12, #\w
+    lsl             x12, x12, #1
+    qpel_start_chroma_sve2_\v\()_1
+.loop_vss_sve2_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.if \w == 4
+.rept 2
+    add             x6, x0, x9
+    qpel_chroma_load_64b \v
+    qpel_filter_chroma_\v\()_32b_1
+    vss_end_sve2
+    str             s17, x7, #4
+    add             x9, x9, #4
+.endr
+.else
+.loop_vss_w8_sve2_\v\()_\w\()x\h:
+    add             x6, x0, x9
+    qpel_chroma_load_64b \v
+    qpel_filter_chroma_\v\()_32b_1
+    vss_end_sve2
+    str             q17, x7, #16
+    add             x9, x9, #16
+.if \w == 12
+    add             x6, x0, x9
+    qpel_chroma_load_64b \v
+    qpel_filter_chroma_\v\()_32b_1
+    vss_end_sve2
+    str             d17, x7, #8
+    add             x9, x9, #8
+.endif
+    cmp             x9, x12
+    blt             .loop_vss_w8_sve2_\v\()_\w\()x\h
+.endif
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_vss_sve2_\v\()_\w\()x\h
+    ret
+.endm
+
+// void interp_vert_ss_c(const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx)
+.macro CHROMA_VSS_SVE2 w, h
+function x265_interp_4tap_vert_ss_\w\()x\h\()_sve2
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+    cmp             x4, #4
+    beq             4f
+    cmp             x4, #5
+    beq             5f
+    cmp             x4, #6
+    beq             6f
+    cmp             x4, #7
+    beq             7f
+0:
+    FILTER_CHROMA_VSS_SVE2  \w, \h, 0
+1:
+    FILTER_CHROMA_VSS_SVE2  \w, \h, 1
+2:
+    FILTER_CHROMA_VSS_SVE2  \w, \h, 2
+3:
+    FILTER_CHROMA_VSS_SVE2  \w, \h, 3
+4:
+    FILTER_CHROMA_VSS_SVE2  \w, \h, 4
+5:
+    FILTER_CHROMA_VSS_SVE2  \w, \h, 5
+6:
+    FILTER_CHROMA_VSS_SVE2  \w, \h, 6
+7:
+    FILTER_CHROMA_VSS_SVE2  \w, \h, 7
+endfunc
+.endm
+
+CHROMA_VSS_SVE2 4, 4
+CHROMA_VSS_SVE2 4, 8
+CHROMA_VSS_SVE2 4, 16
+CHROMA_VSS_SVE2 4, 32
+CHROMA_VSS_SVE2 8, 2
+CHROMA_VSS_SVE2 8, 4
+CHROMA_VSS_SVE2 8, 6
+CHROMA_VSS_SVE2 8, 8
+CHROMA_VSS_SVE2 8, 16
+CHROMA_VSS_SVE2 8, 32
+CHROMA_VSS_SVE2 8, 12
+CHROMA_VSS_SVE2 8, 64
+CHROMA_VSS_SVE2 12, 16
+CHROMA_VSS_SVE2 12, 32
+CHROMA_VSS_SVE2 16, 4
+CHROMA_VSS_SVE2 16, 8
+CHROMA_VSS_SVE2 16, 12
+CHROMA_VSS_SVE2 16, 16
+CHROMA_VSS_SVE2 16, 32
+CHROMA_VSS_SVE2 16, 64
+CHROMA_VSS_SVE2 16, 24
+CHROMA_VSS_SVE2 32, 8
+CHROMA_VSS_SVE2 32, 16
+CHROMA_VSS_SVE2 32, 24
+CHROMA_VSS_SVE2 32, 32
+CHROMA_VSS_SVE2 32, 64
+CHROMA_VSS_SVE2 32, 48
+CHROMA_VSS_SVE2 24, 32
+CHROMA_VSS_SVE2 24, 64
+CHROMA_VSS_SVE2 64, 16
+CHROMA_VSS_SVE2 64, 32
+CHROMA_VSS_SVE2 64, 48
+CHROMA_VSS_SVE2 64, 64
+CHROMA_VSS_SVE2 48, 64

 
@@ -0,0 +1,1282 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+// Functions in this file:
+// ***** luma_vpp *****
+// ***** luma_vps *****
+// ***** luma_vsp *****
+// ***** luma_vss *****
+// ***** luma_hpp *****
+// ***** luma_hps *****
+// ***** chroma_vpp *****
+// ***** chroma_vps *****
+// ***** chroma_vsp *****
+// ***** chroma_vss *****
+// ***** chroma_hpp *****
+// ***** chroma_hps *****
+
+#include "asm-sve.S"
+#include "ipfilter-common.S"
+
+.arch armv8-a+sve2
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+.macro qpel_load_32b_sve2 v
+.if \v == 0
+    add             x6, x6, x11       // do not load 3 values that are not used in qpel_filter_0
+    ld1b            {z3.h}, p0/z, x6
+    add             x6, x6, x1
+.elseif \v == 1 || \v == 2 || \v == 3
+.if \v != 3                           // not used in qpel_filter_3
+    ld1b            {z0.h}, p0/z, x6
+    add             x6, x6, x1
+.else
+    add             x6, x6, x1
+.endif
+    ld1b            {z1.h}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z2.h}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z3.h}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z4.h}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z5.h}, p0/z, x6
+    add             x6, x6, x1
+.if \v != 1                           // not used in qpel_filter_1
+    ld1b            {z6.h}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z7.h}, p0/z, x6
+.else
+    ld1b            {z6.h}, p0/z, x6
+.endif
+.endif
+.endm
+
+.macro qpel_load_64b_sve2_gt_16 v
+.if \v == 0
+    add             x6, x6, x11       // do not load 3 values that are not used in qpel_filter_0
+    ld1b            {z3.h}, p2/z, x6
+    add             x6, x6, x1
+.elseif \v == 1 || \v == 2 || \v == 3
+.if \v != 3                           // not used in qpel_filter_3
+    ld1b            {z0.h}, p2/z, x6
+    add             x6, x6, x1
+.else
+    add             x6, x6, x1
+.endif
+    ld1b            {z1.h}, p2/z, x6
+    add             x6, x6, x1
+    ld1b            {z2.h}, p2/z, x6
+    add             x6, x6, x1
+    ld1b            {z3.h}, p2/z, x6
+    add             x6, x6, x1
+    ld1b            {z4.h}, p2/z, x6
+    add             x6, x6, x1
+    ld1b            {z5.h}, p2/z, x6
+    add             x6, x6, x1
+.if \v != 1                           // not used in qpel_filter_1
+    ld1b            {z6.h}, p2/z, x6
+    add             x6, x6, x1
+    ld1b            {z7.h}, p2/z, x6
+.else
+    ld1b            {z6.h}, p2/z, x6
+.endif
+.endif
+.endm
+
+.macro qpel_chroma_load_32b_sve2 v
+.if \v == 0
+    // qpel_filter_chroma_0 only uses values in v1
+    add             x6, x6, x1
+    ld1b            {z1.h}, p0/z, x6
+.else
+    ld1b            {z0.h}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z1.h}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z2.h}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z3.h}, p0/z, x6
+.endif
+.endm
+
+.macro qpel_start_sve2_0
+    mov             z24.h, #64
+.endm
+
+.macro qpel_filter_sve2_0_32b
+    mul             z17.h, z3.h, z24.h    // 64*d
+.endm
+
+.macro qpel_filter_sve2_0_64b
+    qpel_filter_sve2_0_32b
+    mul             z18.h, z11.h, z24.h
+.endm
+
+.macro qpel_start_sve2_1
+    mov             z24.h, #58
+    mov             z25.h, #10
+    mov             z26.h, #17
+    mov             z27.h, #5
+.endm
+
+.macro qpel_filter_sve2_1_32b
+    mul             z19.h, z2.h, z25.h  // c*10
+    mul             z17.h, z3.h, z24.h  // d*58
+    mul             z21.h, z4.h, z26.h  // e*17
+    mul             z23.h, z5.h, z27.h  // f*5
+    sub             z17.h, z17.h, z19.h // d*58 - c*10
+    lsl             z18.h, z1.h, #2      // b*4
+    add             z17.h, z17.h, z21.h // d*58 - c*10 + e*17
+    sub             z21.h, z6.h, z0.h   // g - a
+    add             z17.h, z17.h, z18.h // d*58 - c*10 + e*17 + b*4
+    sub             z21.h, z21.h, z23.h // g - a - f*5
+    add             z17.h, z17.h, z21.h // d*58 - c*10 + e*17 + b*4 + g - a - f*5
+.endm
+
+.macro qpel_filter_sve2_1_64b
+    qpel_filter_sve2_1_32b
+    mul             z20.h, z10.h, z25.h  // c*10
+    mul             z18.h, z11.h, z24.h  // d*58
+    mul             z21.h, z12.h, z26.h  // e*17
+    mul             z23.h, z13.h, z27.h  // f*5
+    sub             z18.h, z18.h, z20.h   // d*58 - c*10
+    lsl             z28.h, z30.h, #2       // b*4
+    add             z18.h, z18.h, z21.h   // d*58 - c*10 + e*17
+    sub             z21.h, z14.h, z29.h   // g - a
+    add             z18.h, z18.h, z28.h   // d*58 - c*10 + e*17 + b*4
+    sub             z21.h, z21.h, z23.h   // g - a - f*5
+    add             z18.h, z18.h, z21.h   // d*58 - c*10 + e*17 + b*4 + g - a - f*5
+.endm
+
+.macro qpel_start_sve2_2
+    mov             z24.h, #11
+    mov             z25.h, #40
+.endm
+
+.macro qpel_filter_sve2_2_32b
+    add             z17.h, z3.h, z4.h     // d + e
+    add             z19.h, z2.h, z5.h     // c + f
+    add             z23.h, z1.h, z6.h     // b + g
+    add             z21.h, z0.h, z7.h     // a + h
+    mul             z17.h, z17.h, z25.h   // 40 * (d + e)
+    mul             z19.h, z19.h, z24.h   // 11 * (c + f)
+    lsl             z23.h, z23.h, #2       // (b + g) * 4
+    add             z19.h, z19.h, z21.h   // 11 * (c + f) + a + h
+    add             z17.h, z17.h, z23.h   // 40 * (d + e) + (b + g) * 4
+    sub             z17.h, z17.h, z19.h   // 40 * (d + e) + (b + g) * 4 - 11 * (c + f) - a - h
+.endm
+
+.macro qpel_filter_sve2_2_64b
+    qpel_filter_sve2_2_32b
+    add             z27.h, z11.h, z12.h   // d + e
+    add             z16.h, z10.h, z13.h   // c + f
+    add             z23.h, z30.h, z14.h   // b + g
+    add             z21.h, z29.h, z15.h   // a + h
+    mul             z27.h, z27.h, z25.h   // 40 * (d + e)
+    mul             z16.h, z16.h, z24.h   // 11 * (c + f)
+    lsl             z23.h, z23.h, #2       // (b + g) * 4
+    add             z16.h, z16.h, z21.h   // 11 * (c + f) + a + h
+    add             z27.h, z27.h, z23.h   // 40 * (d + e) + (b + g) * 4
+    sub             z18.h, z27.h, z16.h   // 40 * (d + e) + (b + g) * 4 - 11 * (c + f) - a - h
+.endm
+
+.macro qpel_start_sve2_3
+    mov             z24.h, #17
+    mov             z25.h, #5
+    mov             z26.h, #58
+    mov             z27.h, #10
+.endm
+
+.macro qpel_filter_sve2_3_32b
+    mul             z19.h, z2.h, z25.h    // c * 5
+    mul             z17.h, z3.h, z24.h    // d * 17
+    mul             z21.h, z4.h, z26.h    // e * 58
+    mul             z23.h, z5.h, z27.h    // f * 10
+    sub             z17.h, z17.h, z19.h   // d * 17 - c * 5
+    lsl             z19.h, z6.h, #2        // g * 4
+    add             z17.h, z17.h, z21.h   // d * 17 - c * 5 + e * 58
+    sub             z21.h, z1.h, z7.h     // b - h
+    add             z17.h, z17.h, z19.h   // d * 17 - c * 5 + e * 58 + g * 4
+    sub             z21.h, z21.h, z23.h   // b - h - f * 10
+    add             z17.h, z17.h, z21.h   // d * 17 - c * 5 + e * 58 + g * 4 + b - h - f * 10
+.endm
+
+.macro qpel_filter_sve2_3_64b
+    qpel_filter_sve2_3_32b
+    mul             z16.h, z10.h, z25.h  // c * 5
+    mul             z18.h, z11.h, z24.h  // d * 17
+    mul             z21.h, z12.h, z26.h  // e * 58
+    mul             z23.h, z13.h, z27.h  // f * 10
+    sub             z18.h, z18.h, z16.h   // d * 17 - c * 5
+    lsl             z16.h, z14.h, #2       // g * 4
+    add             z18.h, z18.h, z21.h   // d * 17 - c * 5 + e * 58
+    sub             z21.h, z30.h, z15.h   // b - h
+    add             z18.h, z18.h, z16.h   // d * 17 - c * 5 + e * 58 + g * 4
+    sub             z21.h, z21.h, z23.h   // b - h - f * 10
+    add             z18.h, z18.h, z21.h   // d * 17 - c * 5 + e * 58 + g * 4 + b - h - f * 10
+.endm
+
+.macro qpel_start_chroma_sve2_0
+    mov             z29.h, #64
+.endm
+
+.macro qpel_filter_chroma_sve2_0_32b
+    mul             z17.h, z1.h, z29.h    // 64*b
+.endm
+
+.macro qpel_start_chroma_sve2_1
+    mov             z29.h, #58
+    mov             z30.h, #10
+.endm
+
+.macro qpel_filter_chroma_sve2_1_32b
+    mul             z17.h, z1.h, z29.h    // 58 * b
+    mul             z19.h, z2.h, z30.h    // 10 * c
+    add             z22.h, z0.h, z3.h     // a + d
+    lsl             z22.h, z22.h, #1       // 2 * (a+d)
+    sub             z17.h, z17.h, z22.h   // 58*b - 2*(a+d)
+    add             z17.h, z17.h, z19.h   // 58*b-2*(a+d) + 10*c
+.endm
+
+.macro qpel_start_chroma_sve2_2
+    mov             z30.h, #54
+.endm
+
+.macro qpel_filter_chroma_sve2_2_32b
+    mul             z17.h, z1.h, z30.h    // 54 * b
+    lsl             z19.h, z0.h, #2        // 4 * a
+    lsl             z21.h, z2.h, #4        // 16 * c
+    lsl             z23.h, z3.h, #1        // 2 * d
+    add             z17.h, z17.h, z21.h   // 54*b + 16*c
+    add             z19.h, z19.h, z23.h   // 4*a + 2*d
+    sub             z17.h, z17.h, z19.h   // 54*b+16*c - (4*a+2*d)
+.endm
+
+.macro qpel_start_chroma_sve2_3
+    mov             z28.h, #46
+    mov             z29.h, #28
+    mov             z30.h, #6
+.endm
+
+.macro qpel_filter_chroma_sve2_3_32b
+    mul             z17.h, z1.h, z28.h    // 46 * b
+    mul             z19.h, z2.h, z29.h    // 28 * c
+    lsl             z21.h, z3.h, #2        // 4 * d
+    mul             z23.h, z0.h, z30.h    // 6 * a
+    add             z17.h, z17.h, z19.h   // 46*b + 28*c
+    add             z21.h, z21.h, z23.h   // 4*d + 6*a
+    sub             z17.h, z17.h, z21.h   // 46*b+28*c - (4*d+6*a)
+.endm
+
+.macro qpel_start_chroma_sve2_4
+    mov             z29.h, #36
+.endm
+
+.macro qpel_filter_chroma_sve2_4_32b
+    add             z20.h, z0.h, z3.h     // a + d
+    add             z17.h, z1.h, z2.h     // b + c
+    lsl             z20.h, z20.h, #2       // 4 * (a+d)
+    mul             z17.h, z17.h, z29.h   // 36 * (b+c)
+    sub             z17.h, z17.h, z20.h   // 36*(b+c) - 4*(a+d)
+.endm
+
+.macro qpel_start_chroma_sve2_5
+    mov             z28.h, #28
+    mov             z29.h, #46
+    mov             z30.h, #6
+.endm
+
+.macro qpel_filter_chroma_sve2_5_32b
+    mul             z17.h, z1.h, z28.h    // 28 * b
+    mul             z19.h, z2.h, z29.h    // 46 * c
+    lsl             z21.h, z0.h, #2        // 4 * a
+    mul             z23.h, z3.h, z30.h    // 6 * d
+    add             z17.h, z17.h, z19.h   // 28*b + 46*c
+    add             z21.h, z21.h, z23.h   // 4*a + 6*d
+    sub             z17.h, z17.h, z21.h   // 28*b+46*c - (4*a+6*d)
+.endm
+
+.macro qpel_start_chroma_sve2_6
+    mov             z30.h, #54
+.endm
+
+.macro qpel_filter_chroma_sve2_6_32b
+    mul             z17.h, z2.h, z30.h    // 54 * c
+    lsl             z19.h, z0.h, #1        // 2 * a
+    lsl             z21.h, z1.h, #4        // 16 * b
+    lsl             z23.h, z3.h, #2        // 4 * d
+    add             z17.h, z17.h, z21.h   // 54*c + 16*b
+    add             z19.h, z19.h, z23.h   // 2*a + 4*d
+    sub             z17.h, z17.h, z19.h   // 54*c+16*b - (2*a+4*d)
+.endm
+
+.macro qpel_start_chroma_sve2_7
+    mov             z29.h, #58
+    mov             z30.h, #10
+.endm
+
+.macro qpel_filter_chroma_sve2_7_32b
+    add             z20.h, z0.h, z3.h     // a + d
+    mul             z17.h, z2.h, z29.h    // 58 * c
+    lsl             z20.h, z20.h, #1       // 2 * (a+d)
+    mul             z19.h, z1.h, z30.h    // 10 * b
+    sub             z17.h, z17.h, z20.h   // 58*c - 2*(a+d)
+    add             z17.h, z17.h, z19.h   // 58*c-2*(a+d) + 10*b
+.endm
+
+.macro vpp_end_sve2
+    add             z17.h, z17.h, z31.h
+    sqshrun         v17.8b, v17.8h, #6
+.endm
+
+.macro FILTER_LUMA_VPP_SVE2 w, h, v
+    lsl             x10, x1, #2      // x10 = 4 * x1
+    sub             x11, x10, x1     // x11 = 3 * x1
+    sub             x0, x0, x11      // src -= (8 / 2 - 1) * srcStride
+    mov             x5, #\h
+    mov             z31.h, #32
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_FILTER_LUMA_VPP_\v\()_\w\()x\h
+    qpel_start_\v
+.loop_luma_vpp_sve2_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_luma_vpp_w8_sve2_\v\()_\w\()x\h:
+    add             x6, x0, x9
+.if \w == 8 || \w == 24
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vpp_end
+    str             d17, x7, #8
+    add             x9, x9, #8
+.elseif \w == 12
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vpp_end
+    str             d17, x7, #8
+    add             x6, x0, #8
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vpp_end
+    fmov            w6, s17
+    str             w6, x7, #4
+    add             x9, x9, #12
+.else
+    qpel_load_64b \v
+    qpel_filter_\v\()_64b
+    vpp_end
+    add             v18.8h, v18.8h, v31.8h
+    sqshrun2        v17.16b, v18.8h, #6
+    str             q17, x7, #16
+    add             x9, x9, #16
+.endif
+    cmp             x9, #\w
+    blt             .loop_luma_vpp_w8_sve2_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_luma_vpp_sve2_\v\()_\w\()x\h
+    ret
+.vl_gt_16_FILTER_LUMA_VPP_\v\()_\w\()x\h:
+    ptrue           p0.h, vl8
+    ptrue           p2.h, vl16
+    qpel_start_sve2_\v
+.gt_16_loop_luma_vpp_sve2_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.gt_16_loop_luma_vpp_w8_sve2_\v\()_\w\()x\h:
+    add             x6, x0, x9
+.if \w == 8 || \w == 24
+    qpel_load_32b_sve2 \v
+    qpel_filter_sve2_\v\()_32b
+    vpp_end_sve2
+    str             d17, x7, #8
+    add             x9, x9, #8
+.elseif \w == 12
+    qpel_load_32b_sve2 \v
+    qpel_filter_sve2_\v\()_32b
+    vpp_end_sve2
+    str             d17, x7, #8
+    add             x6, x0, #8
+    qpel_load_32b_sve2 \v
+    qpel_filter_sve2_\v\()_32b
+    vpp_end_sve2
+    fmov            w6, s17
+    str             w6, x7, #4
+    add             x9, x9, #12
+.else
+    qpel_load_64b_sve2_gt_16 \v
+    qpel_filter_sve2_\v\()_32b
+    vpp_end_sve2
+    add             z18.h, z18.h, z31.h
+    sqshrun2        v17.16b, v18.8h, #6
+    str             q17, x7, #16
+    add             x9, x9, #16
+.endif
+    cmp             x9, #\w
+    blt             .gt_16_loop_luma_vpp_w8_sve2_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .gt_16_loop_luma_vpp_sve2_\v\()_\w\()x\h
+    ret
+.endm
+
+// void interp_vert_pp_c(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VPP_SVE2 w, h
+function x265_interp_8tap_vert_pp_\w\()x\h\()_sve2
+    cmp             x4, #0
+    b.eq            0f
+    cmp             x4, #1
+    b.eq            1f
+    cmp             x4, #2
+    b.eq            2f
+    cmp             x4, #3
+    b.eq            3f
+0:
+    FILTER_LUMA_VPP_SVE2 \w, \h, 0
+1:
+    FILTER_LUMA_VPP_SVE2 \w, \h, 1
+2:
+    FILTER_LUMA_VPP_SVE2 \w, \h, 2
+3:
+    FILTER_LUMA_VPP_SVE2 \w, \h, 3
+endfunc
+.endm
+
+LUMA_VPP_SVE2 8, 4
+LUMA_VPP_SVE2 8, 8
+LUMA_VPP_SVE2 8, 16
+LUMA_VPP_SVE2 8, 32
+LUMA_VPP_SVE2 12, 16
+LUMA_VPP_SVE2 16, 4
+LUMA_VPP_SVE2 16, 8
+LUMA_VPP_SVE2 16, 16
+LUMA_VPP_SVE2 16, 32
+LUMA_VPP_SVE2 16, 64
+LUMA_VPP_SVE2 16, 12
+LUMA_VPP_SVE2 24, 32
+LUMA_VPP_SVE2 32, 8
+LUMA_VPP_SVE2 32, 16
+LUMA_VPP_SVE2 32, 32
+LUMA_VPP_SVE2 32, 64
+LUMA_VPP_SVE2 32, 24
+LUMA_VPP_SVE2 48, 64
+LUMA_VPP_SVE2 64, 16
+LUMA_VPP_SVE2 64, 32
+LUMA_VPP_SVE2 64, 64
+LUMA_VPP_SVE2 64, 48
+
+// void interp_vert_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VPS_4xN_SVE2 h
+function x265_interp_8tap_vert_ps_4x\h\()_sve2
+    lsl             x3, x3, #1
+    lsl             x5, x4, #6
+    lsl             x4, x1, #2
+    sub             x4, x4, x1
+    sub             x0, x0, x4
+
+    mov             z28.s, #8192
+    mov             x4, #\h
+    movrel          x12, g_lumaFilter
+    add             x12, x12, x5
+    ptrue           p0.s, vl4
+    ld1rd           {z16.d}, p0/z, x12
+    ld1rd           {z17.d}, p0/z, x12, #8
+    ld1rd           {z18.d}, p0/z, x12, #16
+    ld1rd           {z19.d}, p0/z, x12, #24
+    ld1rd           {z20.d}, p0/z, x12, #32
+    ld1rd           {z21.d}, p0/z, x12, #40
+    ld1rd           {z22.d}, p0/z, x12, #48
+    ld1rd           {z23.d}, p0/z, x12, #56
+
+.loop_vps_sve2_4x\h:
+    mov             x6, x0
+
+    ld1b            {z0.s}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z1.s}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z2.s}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z3.s}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z4.s}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z5.s}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z6.s}, p0/z, x6
+    add             x6, x6, x1
+    ld1b            {z7.s}, p0/z, x6
+    add             x6, x6, x1
+
+    mul             z0.s, z0.s, z16.s
+    mla             z0.s, p0/m, z1.s, z17.s
+    mla             z0.s, p0/m, z2.s, z18.s
+    mla             z0.s, p0/m, z3.s, z19.s
+    mla             z0.s, p0/m, z4.s, z20.s
+    mla             z0.s, p0/m, z5.s, z21.s
+    mla             z0.s, p0/m, z6.s, z22.s
+    mla             z0.s, p0/m, z7.s, z23.s
+
+    sub             z0.s, z0.s, z28.s
+    sqxtn           v0.4h, v0.4s
+    st1             {v0.8b}, x2, x3
+
+    add             x0, x0, x1
+    sub             x4, x4, #1
+    cbnz            x4, .loop_vps_sve2_4x\h
+    ret
+endfunc
+.endm
+
+LUMA_VPS_4xN_SVE2 4
+LUMA_VPS_4xN_SVE2 8
+LUMA_VPS_4xN_SVE2 16
+
+// void interp_vert_sp_c(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VSP_4xN_SVE2 h
+function x265_interp_8tap_vert_sp_4x\h\()_sve2
+    lsl             x5, x4, #6
+    lsl             x1, x1, #1
+    lsl             x4, x1, #2
+    sub             x4, x4, x1
+    sub             x0, x0, x4
+
+    mov             w12, #1
+    lsl             w12, w12, #19
+    add             w12, w12, #2048
+    dup             v24.4s, w12
+    mov             x4, #\h
+    movrel          x12, g_lumaFilter
+    add             x12, x12, x5
+
+    ptrue           p0.s, vl4
+    ld1rd           {z16.d}, p0/z, x12
+    ld1rd           {z17.d}, p0/z, x12, #8
+    ld1rd           {z18.d}, p0/z, x12, #16
+    ld1rd           {z19.d}, p0/z, x12, #24
+    ld1rd           {z20.d}, p0/z, x12, #32
+    ld1rd           {z21.d}, p0/z, x12, #40
+    ld1rd           {z22.d}, p0/z, x12, #48
+    ld1rd           {z23.d}, p0/z, x12, #56
+
+.loop_vsp_sve2_4x\h:
+    mov             x6, x0
+
+    ld1             {v0.8b}, x6, x1
+    ld1             {v1.8b}, x6, x1
+    ld1             {v2.8b}, x6, x1
+    ld1             {v3.8b}, x6, x1
+    ld1             {v4.8b}, x6, x1
+    ld1             {v5.8b}, x6, x1
+    ld1             {v6.8b}, x6, x1
+    ld1             {v7.8b}, x6, x1
+
+    sunpklo         z0.s, z0.h
+    sunpklo         z1.s, z1.h
+    mul             z0.s, z0.s, z16.s
+    sunpklo         z2.s, z2.h
+    mla             z0.s, p0/m, z1.s, z17.s
+    sunpklo         z3.s, z3.h
+    mla             z0.s, p0/m, z2.s, z18.s
+    sunpklo         z4.s, z4.h
+    mla             z0.s, p0/m, z3.s, z19.s
+    sunpklo         z5.s, z5.h
+    mla             z0.s, p0/m, z4.s, z20.s
+    sunpklo         z6.s, z6.h
+    mla             z0.s, p0/m, z5.s, z21.s
+    sunpklo         z7.s, z7.h
+    mla             z0.s, p0/m, z6.s, z22.s
+
+    mla             z0.s, p0/m, z7.s, z23.s
+
+    add             z0.s, z0.s, z24.s
+    sqshrun         v0.4h, v0.4s, #12
+    sqxtun          v0.8b, v0.8h
+    st1             {v0.s}0, x2, x3
+
+    add             x0, x0, x1
+    sub             x4, x4, #1
+    cbnz            x4, .loop_vsp_sve2_4x\h
+    ret
+endfunc
+.endm
+
+LUMA_VSP_4xN_SVE2 4
+LUMA_VSP_4xN_SVE2 8
+LUMA_VSP_4xN_SVE2 16
+
+.macro vps_end_sve2
+    sub             z17.h, z17.h, z31.h
+.endm
+
+.macro FILTER_VPS_SVE2 w, h, v
+    lsl             x3, x3, #1
+    lsl             x10, x1, #2      // x10 = 4 * x1
+    sub             x11, x10, x1     // x11 = 3 * x1
+    sub             x0, x0, x11      // src -= (8 / 2 - 1) * srcStride
+    mov             x5, #\h
+    mov             z31.h, #8192
+    rdvl            x14, #1
+    cmp             x14, #16
+    bgt             .vl_gt_16_FILTER_VPS_\v\()_\w\()x\h
+    qpel_start_\v
+.loop_ps_sve2_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_ps_w8_sve2_\v\()_\w\()x\h:
+    add             x6, x0, x9
+.if \w == 8 || \w == 24
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vps_end
+    str             q17, x7, #16
+    add             x9, x9, #8
+.elseif \w == 12
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vps_end
+    str             q17, x7, #16
+    add             x6, x0, #8
+    qpel_load_32b \v
+    qpel_filter_\v\()_32b
+    vps_end
+    str             d17, x7, #8
+    add             x9, x9, #12
+.else
+    qpel_load_64b \v
+    qpel_filter_\v\()_64b
+    vps_end
+    sub             v18.8h, v18.8h, v31.8h
+    stp             q17, q18, x7, #32
+    add             x9, x9, #16
+.endif
+    cmp             x9, #\w
+    blt             .loop_ps_w8_sve2_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_ps_sve2_\v\()_\w\()x\h
+    ret
+.vl_gt_16_FILTER_VPS_\v\()_\w\()x\h:
+    ptrue           p0.h, vl8
+    ptrue           p2.h, vl16
+    qpel_start_sve2_\v
+.gt_16_loop_ps_sve2_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.gt_16_loop_ps_w8_sve2_\v\()_\w\()x\h:
+    add             x6, x0, x9
+.if \w == 8 || \w == 24
+    qpel_load_32b_sve2 \v
+    qpel_filter_sve2_\v\()_32b
+    vps_end_sve2
+    str             q17, x7, #16
+    add             x9, x9, #8
+.elseif \w == 12
+    qpel_load_32b_sve2 \v
+    qpel_filter_sve2_\v\()_32b
+    vps_end_sve2
+    str             q17, x7, #16
+    add             x6, x0, #8
+    qpel_load_32b_sve2 \v
+    qpel_filter_sve2_\v\()_32b
+    vps_end_sve2
+    str             d17, x7, #8
+    add             x9, x9, #12
+.else
+    qpel_load_64b_sve2_gt_16 \v
+    qpel_filter_sve2_\v\()_32b
+    vps_end_sve2
+    sub             z18.h, z18.h, z31.h
+    stp             q17, q18, x7, #32
+    add             x9, x9, #16
+.endif
+    cmp             x9, #\w
+    blt             .gt_16_loop_ps_w8_sve2_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .gt_16_loop_ps_sve2_\v\()_\w\()x\h
+    ret
+.endm
+
+// void interp_vert_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VPS_SVE2 w, h
+function x265_interp_8tap_vert_ps_\w\()x\h\()_sve2
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+0:
+    FILTER_VPS_SVE2 \w, \h, 0
+1:
+    FILTER_VPS_SVE2 \w, \h, 1
+2:
+    FILTER_VPS_SVE2 \w, \h, 2
+3:
+    FILTER_VPS_SVE2 \w, \h, 3
+endfunc
+.endm
+
+LUMA_VPS_SVE2 8, 4
+LUMA_VPS_SVE2 8, 8
+LUMA_VPS_SVE2 8, 16
+LUMA_VPS_SVE2 8, 32
+LUMA_VPS_SVE2 12, 16
+LUMA_VPS_SVE2 16, 4
+LUMA_VPS_SVE2 16, 8
+LUMA_VPS_SVE2 16, 16
+LUMA_VPS_SVE2 16, 32
+LUMA_VPS_SVE2 16, 64
+LUMA_VPS_SVE2 16, 12
+LUMA_VPS_SVE2 24, 32
+LUMA_VPS_SVE2 32, 8
+LUMA_VPS_SVE2 32, 16
+LUMA_VPS_SVE2 32, 32
+LUMA_VPS_SVE2 32, 64
+LUMA_VPS_SVE2 32, 24
+LUMA_VPS_SVE2 48, 64
+LUMA_VPS_SVE2 64, 16
+LUMA_VPS_SVE2 64, 32
+LUMA_VPS_SVE2 64, 64
+LUMA_VPS_SVE2 64, 48
+
+// ***** luma_vss *****
+.macro vss_end_sve2
+    asr             z17.s, z17.s, #6
+    asr             z18.s, z18.s, #6
+    uzp1            v17.8h, v17.8h, v18.8h
+.endm
+
+.macro FILTER_VSS_SVE2 w, h, v
+    lsl             x1, x1, #1
+    lsl             x10, x1, #2      // x10 = 4 * x1
+    sub             x11, x10, x1     // x11 = 3 * x1
+    sub             x0, x0, x11
+    lsl             x3, x3, #1
+    mov             x5, #\h
+    mov             x12, #\w
+    lsl             x12, x12, #1
+    qpel_start_\v\()_1
+.loop_luma_vss_sve2_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_luma_vss_w8_sve2_\v\()_\w\()x\h:
+    add             x6, x0, x9
+    qpel_load_64b \v
+    qpel_filter_\v\()_32b_1
+    vss_end_sve2
+.if \w == 4
+    str             s17, x7, #4
+    add             x9, x9, #4
+.else
+    str             q17, x7, #16
+    add             x9, x9, #16
+.if \w == 12
+    add             x6, x0, x9
+    qpel_load_64b \v
+    qpel_filter_\v\()_32b_1
+    vss_end_sve2
+    str             d17, x7, #8
+    add             x9, x9, #8
+.endif
+.endif
+    cmp             x9, x12
+    blt             .loop_luma_vss_w8_sve2_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_luma_vss_sve2_\v\()_\w\()x\h
+    ret
+.endm
+
+// void interp_vert_ss_c(const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VSS_SVE2 w, h
+function x265_interp_8tap_vert_ss_\w\()x\h\()_sve2
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+0:
+    FILTER_VSS_SVE2 \w, \h, 0
+1:
+    FILTER_VSS_SVE2 \w, \h, 1
+2:
+    FILTER_VSS_SVE2 \w, \h, 2
+3:
+    FILTER_VSS_SVE2 \w, \h, 3
+endfunc
+.endm
+
+LUMA_VSS_SVE2 4, 4
+LUMA_VSS_SVE2 4, 8
+LUMA_VSS_SVE2 4, 16
+LUMA_VSS_SVE2 8, 4
+LUMA_VSS_SVE2 8, 8
+LUMA_VSS_SVE2 8, 16
+LUMA_VSS_SVE2 8, 32
+LUMA_VSS_SVE2 12, 16
+LUMA_VSS_SVE2 16, 4
+LUMA_VSS_SVE2 16, 8
+LUMA_VSS_SVE2 16, 16
+LUMA_VSS_SVE2 16, 32
+LUMA_VSS_SVE2 16, 64
+LUMA_VSS_SVE2 16, 12
+LUMA_VSS_SVE2 32, 8
+LUMA_VSS_SVE2 32, 16
+LUMA_VSS_SVE2 32, 32
+LUMA_VSS_SVE2 32, 64
+LUMA_VSS_SVE2 32, 24
+LUMA_VSS_SVE2 64, 16
+LUMA_VSS_SVE2 64, 32
+LUMA_VSS_SVE2 64, 64
+LUMA_VSS_SVE2 64, 48
+LUMA_VSS_SVE2 24, 32
+LUMA_VSS_SVE2 48, 64
+
+// ***** luma_hps *****
+
+.macro FILTER_CHROMA_VPP_SVE2 w, h, v
+    ptrue           p0.h, vl8
+    qpel_start_chroma_sve2_\v
+    mov             z31.h, #32
+    sub             x0, x0, x1
+    mov             x5, #\h
+.loop_chroma_vpp_sve2_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_chroma_vpp_w8_sve2_\v\()_\w\()x\h:
+    add             x6, x0, x9
+    qpel_chroma_load_32b_sve2 \v
+    qpel_filter_chroma_sve2_\v\()_32b
+    vpp_end_sve2
+    add             x9, x9, #8
+.if \w == 2
+    fmov            w12, s17
+    strh            w12, x7, #2
+.elseif \w == 4
+    str             s17, x7, #4
+.elseif \w == 6
+    str             s17, x7, #4
+    umov            w12, v17.h2
+    strh            w12, x7, #2
+.elseif \w == 12
+    str             d17, x7, #8
+    add             x6, x0, x9
+    qpel_chroma_load_32b_sve2 \v
+    qpel_filter_chroma_sve2_\v\()_32b
+    vpp_end_sve2
+    str             s17, x7, #4
+    add             x9, x9, #8
+.else
+    str             d17, x7, #8
+.endif
+    cmp             x9, #\w
+    blt             .loop_chroma_vpp_w8_sve2_\v\()_\w\()x\h
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_chroma_vpp_sve2_\v\()_\w\()x\h
+    ret
+.endm
+
+// void interp_vert_pp_c(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+.macro CHROMA_VPP_SVE2 w, h
+function x265_interp_4tap_vert_pp_\w\()x\h\()_sve2
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+    cmp             x4, #4
+    beq             4f
+    cmp             x4, #5
+    beq             5f
+    cmp             x4, #6
+    beq             6f
+    cmp             x4, #7
+    beq             7f
+0:
+    FILTER_CHROMA_VPP_SVE2  \w, \h, 0
+1:
+    FILTER_CHROMA_VPP_SVE2  \w, \h, 1
+2:
+    FILTER_CHROMA_VPP_SVE2  \w, \h, 2
+3:
+    FILTER_CHROMA_VPP_SVE2  \w, \h, 3
+4:
+    FILTER_CHROMA_VPP_SVE2  \w, \h, 4
+5:
+    FILTER_CHROMA_VPP_SVE2  \w, \h, 5
+6:
+    FILTER_CHROMA_VPP_SVE2  \w, \h, 6
+7:
+    FILTER_CHROMA_VPP_SVE2  \w, \h, 7
+endfunc
+.endm
+
+CHROMA_VPP_SVE2 2, 4
+CHROMA_VPP_SVE2 2, 8
+CHROMA_VPP_SVE2 2, 16
+CHROMA_VPP_SVE2 4, 2
+CHROMA_VPP_SVE2 4, 4
+CHROMA_VPP_SVE2 4, 8
+CHROMA_VPP_SVE2 4, 16
+CHROMA_VPP_SVE2 4, 32
+CHROMA_VPP_SVE2 6, 8
+CHROMA_VPP_SVE2 6, 16
+CHROMA_VPP_SVE2 8, 2
+CHROMA_VPP_SVE2 8, 4
+CHROMA_VPP_SVE2 8, 6
+CHROMA_VPP_SVE2 8, 8
+CHROMA_VPP_SVE2 8, 16
+CHROMA_VPP_SVE2 8, 32
+CHROMA_VPP_SVE2 8, 12
+CHROMA_VPP_SVE2 8, 64
+CHROMA_VPP_SVE2 12, 16
+CHROMA_VPP_SVE2 12, 32
+CHROMA_VPP_SVE2 16, 4
+CHROMA_VPP_SVE2 16, 8
+CHROMA_VPP_SVE2 16, 12
+CHROMA_VPP_SVE2 16, 16
+CHROMA_VPP_SVE2 16, 32
+CHROMA_VPP_SVE2 16, 64
+CHROMA_VPP_SVE2 16, 24
+CHROMA_VPP_SVE2 32, 8
+CHROMA_VPP_SVE2 32, 16
+CHROMA_VPP_SVE2 32, 24
+CHROMA_VPP_SVE2 32, 32
+CHROMA_VPP_SVE2 32, 64
+CHROMA_VPP_SVE2 32, 48
+CHROMA_VPP_SVE2 24, 32
+CHROMA_VPP_SVE2 24, 64
+CHROMA_VPP_SVE2 64, 16
+CHROMA_VPP_SVE2 64, 32
+CHROMA_VPP_SVE2 64, 48
+CHROMA_VPP_SVE2 64, 64
+CHROMA_VPP_SVE2 48, 64
+
+.macro FILTER_CHROMA_VPS_SVE2 w, h, v
+    ptrue           p0.h, vl8
+    qpel_start_chroma_sve2_\v
+    mov             z31.h, #8192
+    lsl             x3, x3, #1
+    sub             x0, x0, x1
+    mov             x5, #\h
+.loop_vps_sve2_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.loop_vps_w8_sve2_\v\()_\w\()x\h:
+    add             x6, x0, x9
+    qpel_chroma_load_32b_sve2 \v
+    qpel_filter_chroma_sve2_\v\()_32b
+    vps_end_sve2
+    add             x9, x9, #8
+.if \w == 2
+    str             s17, x7, #4
+.elseif \w == 4
+    str             d17, x7, #8
+.elseif \w == 6
+    str             d17, x7, #8
+    st1             {v17.s}2, x7, #4
+.elseif \w == 12
+    str             q17, x7, #16
+    add             x6, x0, x9
+    qpel_chroma_load_32b_sve2 \v
+    qpel_filter_chroma_sve2_\v\()_32b
+    vps_end_sve2
+    str             d17, x7, #8
+    add             x9, x9, #8
+.else
+    str             q17, x7, #16
+.endif
+    cmp             x9, #\w
+    blt             .loop_vps_w8_sve2_\v\()_\w\()x\h
+
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_vps_sve2_\v\()_\w\()x\h
+    ret
+.endm
+
+// void interp_vert_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx)
+.macro CHROMA_VPS_SVE2 w, h
+function x265_interp_4tap_vert_ps_\w\()x\h\()_sve2
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+    cmp             x4, #4
+    beq             4f
+    cmp             x4, #5
+    beq             5f
+    cmp             x4, #6
+    beq             6f
+    cmp             x4, #7
+    beq             7f
+0:
+    FILTER_CHROMA_VPS_SVE2  \w, \h, 0
+1:
+    FILTER_CHROMA_VPS_SVE2  \w, \h, 1
+2:
+    FILTER_CHROMA_VPS_SVE2  \w, \h, 2
+3:
+    FILTER_CHROMA_VPS_SVE2  \w, \h, 3
+4:
+    FILTER_CHROMA_VPS_SVE2  \w, \h, 4
+5:
+    FILTER_CHROMA_VPS_SVE2  \w, \h, 5
+6:
+    FILTER_CHROMA_VPS_SVE2  \w, \h, 6
+7:
+    FILTER_CHROMA_VPS_SVE2  \w, \h, 7
+endfunc
+.endm
+
+CHROMA_VPS_SVE2 2, 4
+CHROMA_VPS_SVE2 2, 8
+CHROMA_VPS_SVE2 2, 16
+CHROMA_VPS_SVE2 4, 2
+CHROMA_VPS_SVE2 4, 4
+CHROMA_VPS_SVE2 4, 8
+CHROMA_VPS_SVE2 4, 16
+CHROMA_VPS_SVE2 4, 32
+CHROMA_VPS_SVE2 6, 8
+CHROMA_VPS_SVE2 6, 16
+CHROMA_VPS_SVE2 8, 2
+CHROMA_VPS_SVE2 8, 4
+CHROMA_VPS_SVE2 8, 6
+CHROMA_VPS_SVE2 8, 8
+CHROMA_VPS_SVE2 8, 16
+CHROMA_VPS_SVE2 8, 32
+CHROMA_VPS_SVE2 8, 12
+CHROMA_VPS_SVE2 8, 64
+CHROMA_VPS_SVE2 12, 16
+CHROMA_VPS_SVE2 12, 32
+CHROMA_VPS_SVE2 16, 4
+CHROMA_VPS_SVE2 16, 8
+CHROMA_VPS_SVE2 16, 12
+CHROMA_VPS_SVE2 16, 16
+CHROMA_VPS_SVE2 16, 32
+CHROMA_VPS_SVE2 16, 64
+CHROMA_VPS_SVE2 16, 24
+CHROMA_VPS_SVE2 32, 8
+CHROMA_VPS_SVE2 32, 16
+CHROMA_VPS_SVE2 32, 24
+CHROMA_VPS_SVE2 32, 32
+CHROMA_VPS_SVE2 32, 64
+CHROMA_VPS_SVE2 32, 48
+CHROMA_VPS_SVE2 24, 32
+CHROMA_VPS_SVE2 24, 64
+CHROMA_VPS_SVE2 64, 16
+CHROMA_VPS_SVE2 64, 32
+CHROMA_VPS_SVE2 64, 48
+CHROMA_VPS_SVE2 64, 64
+CHROMA_VPS_SVE2 48, 64
+
+.macro qpel_start_chroma_sve2_0_1
+    mov             z24.h, #64
+.endm
+
+.macro qpel_start_chroma_sve2_1_1
+    mov             z24.h, #58
+    mov             z25.h, #10
+.endm
+
+.macro qpel_start_chroma_sve2_2_1
+    mov             z25.h, #54
+.endm
+
+.macro qpel_start_chroma_sve2_3_1
+    mov             z25.h, #46
+    mov             z26.h, #28
+    mov             z27.h, #6
+.endm
+
+.macro qpel_start_chroma_sve2_4_1
+    mov             z24.h, #36
+.endm
+
+.macro qpel_start_chroma_sve2_5_1
+    mov             z25.h, #28
+    mov             z26.h, #46
+    mov             z27.h, #6
+.endm
+
+.macro qpel_start_chroma_sve2_6_1
+    mov             z25.h, #54
+.endm
+
+.macro qpel_start_chroma_sve2_7_1
+    mov             z24.h, #58
+    mov             z25.h, #10
+.endm
+
+.macro FILTER_CHROMA_VSS_SVE2 w, h, v
+    lsl             x1, x1, #1
+    sub             x0, x0, x1
+    lsl             x3, x3, #1
+    mov             x5, #\h
+    mov             x12, #\w
+    lsl             x12, x12, #1
+    qpel_start_chroma_sve2_\v\()_1
+.loop_vss_sve2_\v\()_\w\()x\h:
+    mov             x7, x2
+    mov             x9, #0
+.if \w == 4
+.rept 2
+    add             x6, x0, x9
+    qpel_chroma_load_64b \v
+    qpel_filter_chroma_\v\()_32b_1
+    vss_end_sve2
+    str             s17, x7, #4
+    add             x9, x9, #4
+.endr
+.else
+.loop_vss_w8_sve2_\v\()_\w\()x\h:
+    add             x6, x0, x9
+    qpel_chroma_load_64b \v
+    qpel_filter_chroma_\v\()_32b_1
+    vss_end_sve2
+    str             q17, x7, #16
+    add             x9, x9, #16
+.if \w == 12
+    add             x6, x0, x9
+    qpel_chroma_load_64b \v
+    qpel_filter_chroma_\v\()_32b_1
+    vss_end_sve2
+    str             d17, x7, #8
+    add             x9, x9, #8
+.endif
+    cmp             x9, x12
+    blt             .loop_vss_w8_sve2_\v\()_\w\()x\h
+.endif
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_vss_sve2_\v\()_\w\()x\h
+    ret
+.endm
+
+// void interp_vert_ss_c(const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx)
+.macro CHROMA_VSS_SVE2 w, h
+function x265_interp_4tap_vert_ss_\w\()x\h\()_sve2
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+    cmp             x4, #4
+    beq             4f
+    cmp             x4, #5
+    beq             5f
+    cmp             x4, #6
+    beq             6f
+    cmp             x4, #7
+    beq             7f
+0:
+    FILTER_CHROMA_VSS_SVE2  \w, \h, 0
+1:
+    FILTER_CHROMA_VSS_SVE2  \w, \h, 1
+2:
+    FILTER_CHROMA_VSS_SVE2  \w, \h, 2
+3:
+    FILTER_CHROMA_VSS_SVE2  \w, \h, 3
+4:
+    FILTER_CHROMA_VSS_SVE2  \w, \h, 4
+5:
+    FILTER_CHROMA_VSS_SVE2  \w, \h, 5
+6:
+    FILTER_CHROMA_VSS_SVE2  \w, \h, 6
+7:
+    FILTER_CHROMA_VSS_SVE2  \w, \h, 7
+endfunc
+.endm
+
+CHROMA_VSS_SVE2 4, 4
+CHROMA_VSS_SVE2 4, 8
+CHROMA_VSS_SVE2 4, 16
+CHROMA_VSS_SVE2 4, 32
+CHROMA_VSS_SVE2 8, 2
+CHROMA_VSS_SVE2 8, 4
+CHROMA_VSS_SVE2 8, 6
+CHROMA_VSS_SVE2 8, 8
+CHROMA_VSS_SVE2 8, 16
+CHROMA_VSS_SVE2 8, 32
+CHROMA_VSS_SVE2 8, 12
+CHROMA_VSS_SVE2 8, 64
+CHROMA_VSS_SVE2 12, 16
+CHROMA_VSS_SVE2 12, 32
+CHROMA_VSS_SVE2 16, 4
+CHROMA_VSS_SVE2 16, 8
+CHROMA_VSS_SVE2 16, 12
+CHROMA_VSS_SVE2 16, 16
+CHROMA_VSS_SVE2 16, 32
+CHROMA_VSS_SVE2 16, 64
+CHROMA_VSS_SVE2 16, 24
+CHROMA_VSS_SVE2 32, 8
+CHROMA_VSS_SVE2 32, 16
+CHROMA_VSS_SVE2 32, 24
+CHROMA_VSS_SVE2 32, 32
+CHROMA_VSS_SVE2 32, 64
+CHROMA_VSS_SVE2 32, 48
+CHROMA_VSS_SVE2 24, 32
+CHROMA_VSS_SVE2 24, 64
+CHROMA_VSS_SVE2 64, 16
+CHROMA_VSS_SVE2 64, 32
+CHROMA_VSS_SVE2 64, 48
+CHROMA_VSS_SVE2 64, 64
+CHROMA_VSS_SVE2 48, 64
​

x265_3.6.tar.gz/source/common/aarch64/ipfilter.S Added

@@ -0,0 +1,1054 @@
+/*****************************************************************************
+ * Copyright (C) 2021 MulticoreWare, Inc
+ *
+ * Authors: Sebastian Pop <spop@amazon.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+// Functions in this file:
+// ***** luma_vpp *****
+// ***** luma_vps *****
+// ***** luma_vsp *****
+// ***** luma_vss *****
+// ***** luma_hpp *****
+// ***** luma_hps *****
+// ***** chroma_vpp *****
+// ***** chroma_vps *****
+// ***** chroma_vsp *****
+// ***** chroma_vss *****
+// ***** chroma_hpp *****
+// ***** chroma_hps *****
+
+#include "asm.S"
+#include "ipfilter-common.S"
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+// ***** luma_vpp *****
+// void interp_vert_pp_c(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VPP_4xN h
+function x265_interp_8tap_vert_pp_4x\h\()_neon
+    movrel          x10, g_luma_s16
+    sub             x0, x0, x1
+    sub             x0, x0, x1, lsl #1         // src -= 3 * srcStride
+    lsl             x4, x4, #4
+    ldr             q0, x10, x4              // q0 = luma interpolate coeff
+    dup             v24.8h, v0.h0
+    dup             v25.8h, v0.h1
+    trn1            v24.2d, v24.2d, v25.2d
+    dup             v26.8h, v0.h2
+    dup             v27.8h, v0.h3
+    trn1            v26.2d, v26.2d, v27.2d
+    dup             v28.8h, v0.h4
+    dup             v29.8h, v0.h5
+    trn1            v28.2d, v28.2d, v29.2d
+    dup             v30.8h, v0.h6
+    dup             v31.8h, v0.h7
+    trn1            v30.2d, v30.2d, v31.2d
+
+    // prepare to load 8 lines
+    ld1             {v0.s}0, x0, x1
+    ld1             {v0.s}1, x0, x1
+    ushll           v0.8h, v0.8b, #0
+    ld1             {v1.s}0, x0, x1
+    ld1             {v1.s}1, x0, x1
+    ushll           v1.8h, v1.8b, #0
+    ld1             {v2.s}0, x0, x1
+    ld1             {v2.s}1, x0, x1
+    ushll           v2.8h, v2.8b, #0
+    ld1             {v3.s}0, x0, x1
+    ld1             {v3.s}1, x0, x1
+    ushll           v3.8h, v3.8b, #0
+
+    mov             x9, #\h
+.loop_4x\h:
+    ld1             {v4.s}0, x0, x1
+    ld1             {v4.s}1, x0, x1
+    ushll           v4.8h, v4.8b, #0
+
+    // row0-1
+    mul             v16.8h, v0.8h, v24.8h
+    ext             v21.16b, v0.16b, v1.16b, #8
+    mul             v17.8h, v21.8h, v24.8h
+    mov             v0.16b, v1.16b
+
+    // row2-3
+    mla             v16.8h, v1.8h, v26.8h
+    ext             v21.16b, v1.16b, v2.16b, #8
+    mla             v17.8h, v21.8h, v26.8h
+    mov             v1.16b, v2.16b
+
+    // row4-5
+    mla             v16.8h, v2.8h, v28.8h
+    ext             v21.16b, v2.16b, v3.16b, #8
+    mla             v17.8h, v21.8h, v28.8h
+    mov             v2.16b, v3.16b
+
+    // row6-7
+    mla             v16.8h, v3.8h, v30.8h
+    ext             v21.16b, v3.16b, v4.16b, #8
+    mla             v17.8h, v21.8h, v30.8h
+    mov             v3.16b, v4.16b
+
+    // sum row0-7
+    trn1            v20.2d, v16.2d, v17.2d
+    trn2            v21.2d, v16.2d, v17.2d
+    add             v16.8h, v20.8h, v21.8h
+
+    sqrshrun        v16.8b,  v16.8h,  #6
+    st1             {v16.s}0, x2, x3
+    st1             {v16.s}1, x2, x3
+
+    sub             x9, x9, #2
+    cbnz            x9, .loop_4x\h
+    ret
+endfunc
+.endm
+
+LUMA_VPP_4xN 4
+LUMA_VPP_4xN 8
+LUMA_VPP_4xN 16
+
+// void interp_vert_pp_c(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VPP w, h
+function x265_interp_8tap_vert_pp_\w\()x\h\()_neon
+    cmp             x4, #0
+    b.eq            0f
+    cmp             x4, #1
+    b.eq            1f
+    cmp             x4, #2
+    b.eq            2f
+    cmp             x4, #3
+    b.eq            3f
+0:
+    FILTER_LUMA_VPP \w, \h, 0
+1:
+    FILTER_LUMA_VPP \w, \h, 1
+2:
+    FILTER_LUMA_VPP \w, \h, 2
+3:
+    FILTER_LUMA_VPP \w, \h, 3
+endfunc
+.endm
+
+LUMA_VPP 8, 4
+LUMA_VPP 8, 8
+LUMA_VPP 8, 16
+LUMA_VPP 8, 32
+LUMA_VPP 12, 16
+LUMA_VPP 16, 4
+LUMA_VPP 16, 8
+LUMA_VPP 16, 16
+LUMA_VPP 16, 32
+LUMA_VPP 16, 64
+LUMA_VPP 16, 12
+LUMA_VPP 24, 32
+LUMA_VPP 32, 8
+LUMA_VPP 32, 16
+LUMA_VPP 32, 32
+LUMA_VPP 32, 64
+LUMA_VPP 32, 24
+LUMA_VPP 48, 64
+LUMA_VPP 64, 16
+LUMA_VPP 64, 32
+LUMA_VPP 64, 64
+LUMA_VPP 64, 48
+
+// ***** luma_vps *****
+// void interp_vert_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VPS_4xN h
+function x265_interp_8tap_vert_ps_4x\h\()_neon
+    lsl             x3, x3, #1
+    lsl             x5, x4, #6
+    lsl             x4, x1, #2
+    sub             x4, x4, x1
+    sub             x0, x0, x4
+
+    mov             w6, #8192
+    dup             v28.4s, w6
+    mov             x4, #\h
+    movrel          x12, g_lumaFilter
+    add             x12, x12, x5
+    ld1r            {v16.2d}, x12, #8
+    ld1r            {v17.2d}, x12, #8
+    ld1r            {v18.2d}, x12, #8
+    ld1r            {v19.2d}, x12, #8
+    ld1r            {v20.2d}, x12, #8
+    ld1r            {v21.2d}, x12, #8
+    ld1r            {v22.2d}, x12, #8
+    ld1r            {v23.2d}, x12, #8
+
+.loop_vps_4x\h:
+    mov             x6, x0
+
+    ld1             {v0.s}0, x6, x1
+    ld1             {v1.s}0, x6, x1
+    ld1             {v2.s}0, x6, x1
+    ld1             {v3.s}0, x6, x1
+    ld1             {v4.s}0, x6, x1
+    ld1             {v5.s}0, x6, x1
+    ld1             {v6.s}0, x6, x1
+    ld1             {v7.s}0, x6, x1
+    uxtl            v0.8h, v0.8b
+    uxtl            v0.4s, v0.4h
+
+    uxtl            v1.8h, v1.8b
+    uxtl            v1.4s, v1.4h
+    mul             v0.4s, v0.4s, v16.4s
+
+    uxtl            v2.8h, v2.8b
+    uxtl            v2.4s, v2.4h
+    mla             v0.4s, v1.4s, v17.4s
+
+    uxtl            v3.8h, v3.8b
+    uxtl            v3.4s, v3.4h
+    mla             v0.4s, v2.4s, v18.4s
+
+    uxtl            v4.8h, v4.8b
+    uxtl            v4.4s, v4.4h
+    mla             v0.4s, v3.4s, v19.4s
+
+    uxtl            v5.8h, v5.8b
+    uxtl            v5.4s, v5.4h
+    mla             v0.4s, v4.4s, v20.4s
+
+    uxtl            v6.8h, v6.8b
+    uxtl            v6.4s, v6.4h
+    mla             v0.4s, v5.4s, v21.4s
+
+    uxtl            v7.8h, v7.8b
+    uxtl            v7.4s, v7.4h
+    mla             v0.4s, v6.4s, v22.4s
+
+    mla             v0.4s, v7.4s, v23.4s
+
+    sub             v0.4s, v0.4s, v28.4s
+    sqxtn           v0.4h, v0.4s
+    st1             {v0.8b}, x2, x3
+
+    add             x0, x0, x1
+    sub             x4, x4, #1
+    cbnz            x4, .loop_vps_4x\h
+    ret
+endfunc
+.endm
+
+LUMA_VPS_4xN 4
+LUMA_VPS_4xN 8
+LUMA_VPS_4xN 16
+
+// void interp_vert_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VPS w, h
+function x265_interp_8tap_vert_ps_\w\()x\h\()_neon
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+0:
+    FILTER_VPS \w, \h, 0
+1:
+    FILTER_VPS \w, \h, 1
+2:
+    FILTER_VPS \w, \h, 2
+3:
+    FILTER_VPS \w, \h, 3
+endfunc
+.endm
+
+LUMA_VPS 8, 4
+LUMA_VPS 8, 8
+LUMA_VPS 8, 16
+LUMA_VPS 8, 32
+LUMA_VPS 12, 16
+LUMA_VPS 16, 4
+LUMA_VPS 16, 8
+LUMA_VPS 16, 16
+LUMA_VPS 16, 32
+LUMA_VPS 16, 64
+LUMA_VPS 16, 12
+LUMA_VPS 24, 32
+LUMA_VPS 32, 8
+LUMA_VPS 32, 16
+LUMA_VPS 32, 32
+LUMA_VPS 32, 64
+LUMA_VPS 32, 24
+LUMA_VPS 48, 64
+LUMA_VPS 64, 16
+LUMA_VPS 64, 32
+LUMA_VPS 64, 64
+LUMA_VPS 64, 48
+
+// ***** luma_vsp *****
+// void interp_vert_sp_c(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VSP_4xN h
+function x265_interp_8tap_vert_sp_4x\h\()_neon
+    lsl             x5, x4, #6
+    lsl             x1, x1, #1
+    lsl             x4, x1, #2
+    sub             x4, x4, x1
+    sub             x0, x0, x4
+
+    mov             w12, #1
+    lsl             w12, w12, #19
+    add             w12, w12, #2048
+    dup             v24.4s, w12
+    mov             x4, #\h
+    movrel          x12, g_lumaFilter
+    add             x12, x12, x5
+    ld1r            {v16.2d}, x12, #8
+    ld1r            {v17.2d}, x12, #8
+    ld1r            {v18.2d}, x12, #8
+    ld1r            {v19.2d}, x12, #8
+    ld1r            {v20.2d}, x12, #8
+    ld1r            {v21.2d}, x12, #8
+    ld1r            {v22.2d}, x12, #8
+    ld1r            {v23.2d}, x12, #8
+.loop_vsp_4x\h:
+    mov             x6, x0
+
+    ld1             {v0.8b}, x6, x1
+    ld1             {v1.8b}, x6, x1
+    ld1             {v2.8b}, x6, x1
+    ld1             {v3.8b}, x6, x1
+    ld1             {v4.8b}, x6, x1
+    ld1             {v5.8b}, x6, x1
+    ld1             {v6.8b}, x6, x1
+    ld1             {v7.8b}, x6, x1
+
+    sshll           v0.4s, v0.4h, #0
+    sshll           v1.4s, v1.4h, #0
+    mul             v0.4s, v0.4s, v16.4s
+    sshll           v2.4s, v2.4h, #0
+    mla             v0.4s, v1.4s, v17.4s
+    sshll           v3.4s, v3.4h, #0
+    mla             v0.4s, v2.4s, v18.4s
+    sshll           v4.4s, v4.4h, #0
+    mla             v0.4s, v3.4s, v19.4s
+    sshll           v5.4s, v5.4h, #0
+    mla             v0.4s, v4.4s, v20.4s
+    sshll           v6.4s, v6.4h, #0
+    mla             v0.4s, v5.4s, v21.4s
+    sshll           v7.4s, v7.4h, #0
+    mla             v0.4s, v6.4s, v22.4s
+
+    mla             v0.4s, v7.4s, v23.4s
+
+    add             v0.4s, v0.4s, v24.4s
+    sqshrun         v0.4h, v0.4s, #12
+    sqxtun          v0.8b, v0.8h
+    st1             {v0.s}0, x2, x3
+
+    add             x0, x0, x1
+    sub             x4, x4, #1
+    cbnz            x4, .loop_vsp_4x\h
+    ret
+endfunc
+.endm
+
+LUMA_VSP_4xN 4
+LUMA_VSP_4xN 8
+LUMA_VSP_4xN 16
+
+// void interp_vert_sp_c(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VSP w, h
+function x265_interp_8tap_vert_sp_\w\()x\h\()_neon
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+0:
+    FILTER_VSP \w, \h, 0
+1:
+    FILTER_VSP \w, \h, 1
+2:
+    FILTER_VSP \w, \h, 2
+3:
+    FILTER_VSP \w, \h, 3
+endfunc
+.endm
+
+LUMA_VSP 8, 4
+LUMA_VSP 8, 8
+LUMA_VSP 8, 16
+LUMA_VSP 8, 32
+LUMA_VSP 12, 16
+LUMA_VSP 16, 4
+LUMA_VSP 16, 8
+LUMA_VSP 16, 16
+LUMA_VSP 16, 32
+LUMA_VSP 16, 64
+LUMA_VSP 16, 12
+LUMA_VSP 32, 8
+LUMA_VSP 32, 16
+LUMA_VSP 32, 32
+LUMA_VSP 32, 64
+LUMA_VSP 32, 24
+LUMA_VSP 64, 16
+LUMA_VSP 64, 32
+LUMA_VSP 64, 64
+LUMA_VSP 64, 48
+LUMA_VSP 24, 32
+LUMA_VSP 48, 64
+
+// ***** luma_vss *****
+// void interp_vert_ss_c(const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VSS w, h
+function x265_interp_8tap_vert_ss_\w\()x\h\()_neon
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+0:
+    FILTER_VSS \w, \h, 0
+1:
+    FILTER_VSS \w, \h, 1
+2:
+    FILTER_VSS \w, \h, 2
+3:
+    FILTER_VSS \w, \h, 3
+endfunc
+.endm
+
+LUMA_VSS 4, 4
+LUMA_VSS 4, 8
+LUMA_VSS 4, 16
+LUMA_VSS 8, 4
+LUMA_VSS 8, 8
+LUMA_VSS 8, 16
+LUMA_VSS 8, 32
+LUMA_VSS 12, 16
+LUMA_VSS 16, 4
+LUMA_VSS 16, 8
+LUMA_VSS 16, 16
+LUMA_VSS 16, 32
+LUMA_VSS 16, 64
+LUMA_VSS 16, 12
+LUMA_VSS 32, 8
+LUMA_VSS 32, 16
+LUMA_VSS 32, 32
+LUMA_VSS 32, 64
+LUMA_VSS 32, 24
+LUMA_VSS 64, 16
+LUMA_VSS 64, 32
+LUMA_VSS 64, 64
+LUMA_VSS 64, 48
+LUMA_VSS 24, 32
+LUMA_VSS 48, 64
+
+// ***** luma_hpp *****
+// void interp_horiz_pp_c(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_HPP w, h
+function x265_interp_horiz_pp_\w\()x\h\()_neon
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+0:
+    FILTER_HPP \w, \h, 0
+1:
+    FILTER_HPP \w, \h, 1
+2:
+    FILTER_HPP \w, \h, 2
+3:
+    FILTER_HPP \w, \h, 3
+endfunc
+.endm
+
+LUMA_HPP 4, 4
+LUMA_HPP 4, 8
+LUMA_HPP 4, 16
+LUMA_HPP 8, 4
+LUMA_HPP 8, 8
+LUMA_HPP 8, 16
+LUMA_HPP 8, 32
+LUMA_HPP 12, 16
+LUMA_HPP 16, 4
+LUMA_HPP 16, 8
+LUMA_HPP 16, 12
+LUMA_HPP 16, 16
+LUMA_HPP 16, 32
+LUMA_HPP 16, 64
+LUMA_HPP 24, 32
+LUMA_HPP 32, 8
+LUMA_HPP 32, 16
+LUMA_HPP 32, 24
+LUMA_HPP 32, 32
+LUMA_HPP 32, 64
+LUMA_HPP 48, 64
+LUMA_HPP 64, 16
+LUMA_HPP 64, 32
+LUMA_HPP 64, 48
+LUMA_HPP 64, 64
+
+// ***** luma_hps *****
+// void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt)
+.macro LUMA_HPS w, h
+function x265_interp_horiz_ps_\w\()x\h\()_neon
+    mov             w10, #\h
+    cmp             w5, #0
+    b.eq            6f
+    sub             x0, x0, x1, lsl #2
+    add             x0, x0, x1
+    add             w10, w10, #7
+6:
+    mov             w6, w10
+    cmp             w4, #0
+    b.eq            0f
+    cmp             w4, #1
+    b.eq            1f
+    cmp             w4, #2
+    b.eq            2f
+    cmp             w4, #3
+    b.eq            3f
+0:
+    FILTER_HPS \w, \h, 0
+1:
+    FILTER_HPS \w, \h, 1
+2:
+    FILTER_HPS \w, \h, 2
+3:
+    FILTER_HPS \w, \h, 3
+endfunc
+.endm
+
+LUMA_HPS 4, 4
+LUMA_HPS 4, 8
+LUMA_HPS 4, 16
+LUMA_HPS 8, 4
+LUMA_HPS 8, 8
+LUMA_HPS 8, 16
+LUMA_HPS 8, 32
+LUMA_HPS 12, 16
+LUMA_HPS 16, 4
+LUMA_HPS 16, 8
+LUMA_HPS 16, 12
+LUMA_HPS 16, 16
+LUMA_HPS 16, 32
+LUMA_HPS 16, 64
+LUMA_HPS 24, 32
+LUMA_HPS 32, 8
+LUMA_HPS 32, 16
+LUMA_HPS 32, 24
+LUMA_HPS 32, 32
+LUMA_HPS 32, 64
+LUMA_HPS 48, 64
+LUMA_HPS 64, 16
+LUMA_HPS 64, 32
+LUMA_HPS 64, 48
+LUMA_HPS 64, 64
+
+// ***** chroma_vpp *****
+// void interp_vert_pp_c(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+.macro CHROMA_VPP w, h
+function x265_interp_4tap_vert_pp_\w\()x\h\()_neon
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+    cmp             x4, #4
+    beq             4f
+    cmp             x4, #5
+    beq             5f
+    cmp             x4, #6
+    beq             6f
+    cmp             x4, #7
+    beq             7f
+0:
+    FILTER_CHROMA_VPP  \w, \h, 0
+1:
+    FILTER_CHROMA_VPP  \w, \h, 1
+2:
+    FILTER_CHROMA_VPP  \w, \h, 2
+3:
+    FILTER_CHROMA_VPP  \w, \h, 3
+4:
+    FILTER_CHROMA_VPP  \w, \h, 4
+5:
+    FILTER_CHROMA_VPP  \w, \h, 5
+6:
+    FILTER_CHROMA_VPP  \w, \h, 6
+7:
+    FILTER_CHROMA_VPP  \w, \h, 7
+endfunc
+.endm
+
+CHROMA_VPP 2, 4
+CHROMA_VPP 2, 8
+CHROMA_VPP 2, 16
+CHROMA_VPP 4, 2
+CHROMA_VPP 4, 4
+CHROMA_VPP 4, 8
+CHROMA_VPP 4, 16
+CHROMA_VPP 4, 32
+CHROMA_VPP 6, 8
+CHROMA_VPP 6, 16
+CHROMA_VPP 8, 2
+CHROMA_VPP 8, 4
+CHROMA_VPP 8, 6
+CHROMA_VPP 8, 8
+CHROMA_VPP 8, 16
+CHROMA_VPP 8, 32
+CHROMA_VPP 8, 12
+CHROMA_VPP 8, 64
+CHROMA_VPP 12, 16
+CHROMA_VPP 12, 32
+CHROMA_VPP 16, 4
+CHROMA_VPP 16, 8
+CHROMA_VPP 16, 12
+CHROMA_VPP 16, 16
+CHROMA_VPP 16, 32
+CHROMA_VPP 16, 64
+CHROMA_VPP 16, 24
+CHROMA_VPP 32, 8
+CHROMA_VPP 32, 16
+CHROMA_VPP 32, 24
+CHROMA_VPP 32, 32
+CHROMA_VPP 32, 64
+CHROMA_VPP 32, 48
+CHROMA_VPP 24, 32
+CHROMA_VPP 24, 64
+CHROMA_VPP 64, 16
+CHROMA_VPP 64, 32
+CHROMA_VPP 64, 48
+CHROMA_VPP 64, 64
+CHROMA_VPP 48, 64
+
+// ***** chroma_vps *****
+// void interp_vert_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx)
+.macro CHROMA_VPS w, h
+function x265_interp_4tap_vert_ps_\w\()x\h\()_neon
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+    cmp             x4, #4
+    beq             4f
+    cmp             x4, #5
+    beq             5f
+    cmp             x4, #6
+    beq             6f
+    cmp             x4, #7
+    beq             7f
+0:
+    FILTER_CHROMA_VPS  \w, \h, 0
+1:
+    FILTER_CHROMA_VPS  \w, \h, 1
+2:
+    FILTER_CHROMA_VPS  \w, \h, 2
+3:
+    FILTER_CHROMA_VPS  \w, \h, 3
+4:
+    FILTER_CHROMA_VPS  \w, \h, 4
+5:
+    FILTER_CHROMA_VPS  \w, \h, 5
+6:
+    FILTER_CHROMA_VPS  \w, \h, 6
+7:
+    FILTER_CHROMA_VPS  \w, \h, 7
+endfunc
+.endm
+
+CHROMA_VPS 2, 4
+CHROMA_VPS 2, 8
+CHROMA_VPS 2, 16
+CHROMA_VPS 4, 2
+CHROMA_VPS 4, 4
+CHROMA_VPS 4, 8
+CHROMA_VPS 4, 16
+CHROMA_VPS 4, 32
+CHROMA_VPS 6, 8
+CHROMA_VPS 6, 16
+CHROMA_VPS 8, 2
+CHROMA_VPS 8, 4
+CHROMA_VPS 8, 6
+CHROMA_VPS 8, 8
+CHROMA_VPS 8, 16
+CHROMA_VPS 8, 32
+CHROMA_VPS 8, 12
+CHROMA_VPS 8, 64
+CHROMA_VPS 12, 16
+CHROMA_VPS 12, 32
+CHROMA_VPS 16, 4
+CHROMA_VPS 16, 8
+CHROMA_VPS 16, 12
+CHROMA_VPS 16, 16
+CHROMA_VPS 16, 32
+CHROMA_VPS 16, 64
+CHROMA_VPS 16, 24
+CHROMA_VPS 32, 8
+CHROMA_VPS 32, 16
+CHROMA_VPS 32, 24
+CHROMA_VPS 32, 32
+CHROMA_VPS 32, 64
+CHROMA_VPS 32, 48
+CHROMA_VPS 24, 32
+CHROMA_VPS 24, 64
+CHROMA_VPS 64, 16
+CHROMA_VPS 64, 32
+CHROMA_VPS 64, 48
+CHROMA_VPS 64, 64
+CHROMA_VPS 48, 64
+
+// ***** chroma_vsp *****
+// void interp_vert_sp_c(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+.macro CHROMA_VSP w, h
+function x265_interp_4tap_vert_sp_\w\()x\h\()_neon
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+    cmp             x4, #4
+    beq             4f
+    cmp             x4, #5
+    beq             5f
+    cmp             x4, #6
+    beq             6f
+    cmp             x4, #7
+    beq             7f
+0:
+    FILTER_CHROMA_VSP  \w, \h, 0
+1:
+    FILTER_CHROMA_VSP  \w, \h, 1
+2:
+    FILTER_CHROMA_VSP  \w, \h, 2
+3:
+    FILTER_CHROMA_VSP  \w, \h, 3
+4:
+    FILTER_CHROMA_VSP  \w, \h, 4
+5:
+    FILTER_CHROMA_VSP  \w, \h, 5
+6:
+    FILTER_CHROMA_VSP  \w, \h, 6
+7:
+    FILTER_CHROMA_VSP  \w, \h, 7
+endfunc
+.endm
+
+CHROMA_VSP 4, 4
+CHROMA_VSP 4, 8
+CHROMA_VSP 4, 16
+CHROMA_VSP 4, 32
+CHROMA_VSP 8, 2
+CHROMA_VSP 8, 4
+CHROMA_VSP 8, 6
+CHROMA_VSP 8, 8
+CHROMA_VSP 8, 16
+CHROMA_VSP 8, 32
+CHROMA_VSP 8, 12
+CHROMA_VSP 8, 64
+CHROMA_VSP 12, 16
+CHROMA_VSP 12, 32
+CHROMA_VSP 16, 4
+CHROMA_VSP 16, 8
+CHROMA_VSP 16, 12
+CHROMA_VSP 16, 16
+CHROMA_VSP 16, 32
+CHROMA_VSP 16, 64
+CHROMA_VSP 16, 24
+CHROMA_VSP 32, 8
+CHROMA_VSP 32, 16
+CHROMA_VSP 32, 24
+CHROMA_VSP 32, 32
+CHROMA_VSP 32, 64
+CHROMA_VSP 32, 48
+CHROMA_VSP 24, 32
+CHROMA_VSP 24, 64
+CHROMA_VSP 64, 16
+CHROMA_VSP 64, 32
+CHROMA_VSP 64, 48
+CHROMA_VSP 64, 64
+CHROMA_VSP 48, 64
+
+// ***** chroma_vss *****
+// void interp_vert_ss_c(const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx)
+.macro CHROMA_VSS w, h
+function x265_interp_4tap_vert_ss_\w\()x\h\()_neon
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+    cmp             x4, #4
+    beq             4f
+    cmp             x4, #5
+    beq             5f
+    cmp             x4, #6
+    beq             6f
+    cmp             x4, #7
+    beq             7f
+0:
+    FILTER_CHROMA_VSS  \w, \h, 0
+1:
+    FILTER_CHROMA_VSS  \w, \h, 1
+2:
+    FILTER_CHROMA_VSS  \w, \h, 2
+3:
+    FILTER_CHROMA_VSS  \w, \h, 3
+4:
+    FILTER_CHROMA_VSS  \w, \h, 4
+5:
+    FILTER_CHROMA_VSS  \w, \h, 5
+6:
+    FILTER_CHROMA_VSS  \w, \h, 6
+7:
+    FILTER_CHROMA_VSS  \w, \h, 7
+endfunc
+.endm
+
+CHROMA_VSS 4, 4
+CHROMA_VSS 4, 8
+CHROMA_VSS 4, 16
+CHROMA_VSS 4, 32
+CHROMA_VSS 8, 2
+CHROMA_VSS 8, 4
+CHROMA_VSS 8, 6
+CHROMA_VSS 8, 8
+CHROMA_VSS 8, 16
+CHROMA_VSS 8, 32
+CHROMA_VSS 8, 12
+CHROMA_VSS 8, 64
+CHROMA_VSS 12, 16
+CHROMA_VSS 12, 32
+CHROMA_VSS 16, 4
+CHROMA_VSS 16, 8
+CHROMA_VSS 16, 12
+CHROMA_VSS 16, 16
+CHROMA_VSS 16, 32
+CHROMA_VSS 16, 64
+CHROMA_VSS 16, 24
+CHROMA_VSS 32, 8
+CHROMA_VSS 32, 16
+CHROMA_VSS 32, 24
+CHROMA_VSS 32, 32
+CHROMA_VSS 32, 64
+CHROMA_VSS 32, 48
+CHROMA_VSS 24, 32
+CHROMA_VSS 24, 64
+CHROMA_VSS 64, 16
+CHROMA_VSS 64, 32
+CHROMA_VSS 64, 48
+CHROMA_VSS 64, 64
+CHROMA_VSS 48, 64
+
+// ***** chroma_hpp *****
+// void interp_horiz_pp_c(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+.macro CHROMA_HPP w, h
+function x265_interp_4tap_horiz_pp_\w\()x\h\()_neon
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+    cmp             x4, #4
+    beq             4f
+    cmp             x4, #5
+    beq             5f
+    cmp             x4, #6
+    beq             6f
+    cmp             x4, #7
+    beq             7f
+0:
+    FILTER_CHROMA_HPP  \w, \h, 0
+1:
+    FILTER_CHROMA_HPP  \w, \h, 1
+2:
+    FILTER_CHROMA_HPP  \w, \h, 2
+3:
+    FILTER_CHROMA_HPP  \w, \h, 3
+4:
+    FILTER_CHROMA_HPP  \w, \h, 4
+5:
+    FILTER_CHROMA_HPP  \w, \h, 5
+6:
+    FILTER_CHROMA_HPP  \w, \h, 6
+7:
+    FILTER_CHROMA_HPP  \w, \h, 7
+endfunc
+.endm
+
+CHROMA_HPP 2, 4
+CHROMA_HPP 2, 8
+CHROMA_HPP 2, 16
+CHROMA_HPP 4, 2
+CHROMA_HPP 4, 4
+CHROMA_HPP 4, 8
+CHROMA_HPP 4, 16
+CHROMA_HPP 4, 32
+CHROMA_HPP 6, 8
+CHROMA_HPP 6, 16
+CHROMA_HPP 8, 2
+CHROMA_HPP 8, 4
+CHROMA_HPP 8, 6
+CHROMA_HPP 8, 8
+CHROMA_HPP 8, 12
+CHROMA_HPP 8, 16
+CHROMA_HPP 8, 32
+CHROMA_HPP 8, 64
+CHROMA_HPP 12, 16
+CHROMA_HPP 12, 32
+CHROMA_HPP 16, 4
+CHROMA_HPP 16, 8
+CHROMA_HPP 16, 12
+CHROMA_HPP 16, 16
+CHROMA_HPP 16, 24
+CHROMA_HPP 16, 32
+CHROMA_HPP 16, 64
+CHROMA_HPP 24, 32
+CHROMA_HPP 24, 64
+CHROMA_HPP 32, 8
+CHROMA_HPP 32, 16
+CHROMA_HPP 32, 24
+CHROMA_HPP 32, 32
+CHROMA_HPP 32, 48
+CHROMA_HPP 32, 64
+CHROMA_HPP 48, 64
+CHROMA_HPP 64, 16
+CHROMA_HPP 64, 32
+CHROMA_HPP 64, 48
+CHROMA_HPP 64, 64
+
+// ***** chroma_hps *****
+// void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt)
+.macro CHROMA_HPS w, h
+function x265_interp_4tap_horiz_ps_\w\()x\h\()_neon
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+    cmp             x4, #4
+    beq             4f
+    cmp             x4, #5
+    beq             5f
+    cmp             x4, #6
+    beq             6f
+    cmp             x4, #7
+    beq             7f
+0:
+    FILTER_CHROMA_HPS  \w, \h, 0
+1:
+    FILTER_CHROMA_HPS  \w, \h, 1
+2:
+    FILTER_CHROMA_HPS  \w, \h, 2
+3:
+    FILTER_CHROMA_HPS  \w, \h, 3
+4:
+    FILTER_CHROMA_HPS  \w, \h, 4
+5:
+    FILTER_CHROMA_HPS  \w, \h, 5
+6:
+    FILTER_CHROMA_HPS  \w, \h, 6
+7:
+    FILTER_CHROMA_HPS  \w, \h, 7
+endfunc
+.endm
+
+CHROMA_HPS 2, 4
+CHROMA_HPS 2, 8
+CHROMA_HPS 2, 16
+CHROMA_HPS 4, 2
+CHROMA_HPS 4, 4
+CHROMA_HPS 4, 8
+CHROMA_HPS 4, 16
+CHROMA_HPS 4, 32
+CHROMA_HPS 6, 8
+CHROMA_HPS 6, 16
+CHROMA_HPS 8, 2
+CHROMA_HPS 8, 4
+CHROMA_HPS 8, 6
+CHROMA_HPS 8, 8
+CHROMA_HPS 8, 12
+CHROMA_HPS 8, 16
+CHROMA_HPS 8, 32
+CHROMA_HPS 8, 64
+CHROMA_HPS 12, 16
+CHROMA_HPS 12, 32
+CHROMA_HPS 16, 4
+CHROMA_HPS 16, 8
+CHROMA_HPS 16, 12
+CHROMA_HPS 16, 16
+CHROMA_HPS 16, 24
+CHROMA_HPS 16, 32
+CHROMA_HPS 16, 64
+CHROMA_HPS 24, 32
+CHROMA_HPS 24, 64
+CHROMA_HPS 32, 8
+CHROMA_HPS 32, 16
+CHROMA_HPS 32, 24
+CHROMA_HPS 32, 32
+CHROMA_HPS 32, 48
+CHROMA_HPS 32, 64
+CHROMA_HPS 48, 64
+CHROMA_HPS 64, 16
+CHROMA_HPS 64, 32
+CHROMA_HPS 64, 48
+CHROMA_HPS 64, 64
+
+const g_luma_s16, align=8
+//       a, b,   c,  d,  e,   f, g,  h
+.hword   0, 0,   0, 64,  0,   0, 0,  0
+.hword  -1, 4, -10, 58, 17,  -5, 1,  0
+.hword  -1, 4, -11, 40, 40, -11, 4, -1
+.hword   0, 1,  -5, 17, 58, -10, 4, -1
+endconst

 
@@ -0,0 +1,1054 @@
+/*****************************************************************************
+ * Copyright (C) 2021 MulticoreWare, Inc
+ *
+ * Authors: Sebastian Pop <spop@amazon.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+// Functions in this file:
+// ***** luma_vpp *****
+// ***** luma_vps *****
+// ***** luma_vsp *****
+// ***** luma_vss *****
+// ***** luma_hpp *****
+// ***** luma_hps *****
+// ***** chroma_vpp *****
+// ***** chroma_vps *****
+// ***** chroma_vsp *****
+// ***** chroma_vss *****
+// ***** chroma_hpp *****
+// ***** chroma_hps *****
+
+#include "asm.S"
+#include "ipfilter-common.S"
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+// ***** luma_vpp *****
+// void interp_vert_pp_c(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VPP_4xN h
+function x265_interp_8tap_vert_pp_4x\h\()_neon
+    movrel          x10, g_luma_s16
+    sub             x0, x0, x1
+    sub             x0, x0, x1, lsl #1         // src -= 3 * srcStride
+    lsl             x4, x4, #4
+    ldr             q0, x10, x4              // q0 = luma interpolate coeff
+    dup             v24.8h, v0.h0
+    dup             v25.8h, v0.h1
+    trn1            v24.2d, v24.2d, v25.2d
+    dup             v26.8h, v0.h2
+    dup             v27.8h, v0.h3
+    trn1            v26.2d, v26.2d, v27.2d
+    dup             v28.8h, v0.h4
+    dup             v29.8h, v0.h5
+    trn1            v28.2d, v28.2d, v29.2d
+    dup             v30.8h, v0.h6
+    dup             v31.8h, v0.h7
+    trn1            v30.2d, v30.2d, v31.2d
+
+    // prepare to load 8 lines
+    ld1             {v0.s}0, x0, x1
+    ld1             {v0.s}1, x0, x1
+    ushll           v0.8h, v0.8b, #0
+    ld1             {v1.s}0, x0, x1
+    ld1             {v1.s}1, x0, x1
+    ushll           v1.8h, v1.8b, #0
+    ld1             {v2.s}0, x0, x1
+    ld1             {v2.s}1, x0, x1
+    ushll           v2.8h, v2.8b, #0
+    ld1             {v3.s}0, x0, x1
+    ld1             {v3.s}1, x0, x1
+    ushll           v3.8h, v3.8b, #0
+
+    mov             x9, #\h
+.loop_4x\h:
+    ld1             {v4.s}0, x0, x1
+    ld1             {v4.s}1, x0, x1
+    ushll           v4.8h, v4.8b, #0
+
+    // row0-1
+    mul             v16.8h, v0.8h, v24.8h
+    ext             v21.16b, v0.16b, v1.16b, #8
+    mul             v17.8h, v21.8h, v24.8h
+    mov             v0.16b, v1.16b
+
+    // row2-3
+    mla             v16.8h, v1.8h, v26.8h
+    ext             v21.16b, v1.16b, v2.16b, #8
+    mla             v17.8h, v21.8h, v26.8h
+    mov             v1.16b, v2.16b
+
+    // row4-5
+    mla             v16.8h, v2.8h, v28.8h
+    ext             v21.16b, v2.16b, v3.16b, #8
+    mla             v17.8h, v21.8h, v28.8h
+    mov             v2.16b, v3.16b
+
+    // row6-7
+    mla             v16.8h, v3.8h, v30.8h
+    ext             v21.16b, v3.16b, v4.16b, #8
+    mla             v17.8h, v21.8h, v30.8h
+    mov             v3.16b, v4.16b
+
+    // sum row0-7
+    trn1            v20.2d, v16.2d, v17.2d
+    trn2            v21.2d, v16.2d, v17.2d
+    add             v16.8h, v20.8h, v21.8h
+
+    sqrshrun        v16.8b,  v16.8h,  #6
+    st1             {v16.s}0, x2, x3
+    st1             {v16.s}1, x2, x3
+
+    sub             x9, x9, #2
+    cbnz            x9, .loop_4x\h
+    ret
+endfunc
+.endm
+
+LUMA_VPP_4xN 4
+LUMA_VPP_4xN 8
+LUMA_VPP_4xN 16
+
+// void interp_vert_pp_c(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VPP w, h
+function x265_interp_8tap_vert_pp_\w\()x\h\()_neon
+    cmp             x4, #0
+    b.eq            0f
+    cmp             x4, #1
+    b.eq            1f
+    cmp             x4, #2
+    b.eq            2f
+    cmp             x4, #3
+    b.eq            3f
+0:
+    FILTER_LUMA_VPP \w, \h, 0
+1:
+    FILTER_LUMA_VPP \w, \h, 1
+2:
+    FILTER_LUMA_VPP \w, \h, 2
+3:
+    FILTER_LUMA_VPP \w, \h, 3
+endfunc
+.endm
+
+LUMA_VPP 8, 4
+LUMA_VPP 8, 8
+LUMA_VPP 8, 16
+LUMA_VPP 8, 32
+LUMA_VPP 12, 16
+LUMA_VPP 16, 4
+LUMA_VPP 16, 8
+LUMA_VPP 16, 16
+LUMA_VPP 16, 32
+LUMA_VPP 16, 64
+LUMA_VPP 16, 12
+LUMA_VPP 24, 32
+LUMA_VPP 32, 8
+LUMA_VPP 32, 16
+LUMA_VPP 32, 32
+LUMA_VPP 32, 64
+LUMA_VPP 32, 24
+LUMA_VPP 48, 64
+LUMA_VPP 64, 16
+LUMA_VPP 64, 32
+LUMA_VPP 64, 64
+LUMA_VPP 64, 48
+
+// ***** luma_vps *****
+// void interp_vert_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VPS_4xN h
+function x265_interp_8tap_vert_ps_4x\h\()_neon
+    lsl             x3, x3, #1
+    lsl             x5, x4, #6
+    lsl             x4, x1, #2
+    sub             x4, x4, x1
+    sub             x0, x0, x4
+
+    mov             w6, #8192
+    dup             v28.4s, w6
+    mov             x4, #\h
+    movrel          x12, g_lumaFilter
+    add             x12, x12, x5
+    ld1r            {v16.2d}, x12, #8
+    ld1r            {v17.2d}, x12, #8
+    ld1r            {v18.2d}, x12, #8
+    ld1r            {v19.2d}, x12, #8
+    ld1r            {v20.2d}, x12, #8
+    ld1r            {v21.2d}, x12, #8
+    ld1r            {v22.2d}, x12, #8
+    ld1r            {v23.2d}, x12, #8
+
+.loop_vps_4x\h:
+    mov             x6, x0
+
+    ld1             {v0.s}0, x6, x1
+    ld1             {v1.s}0, x6, x1
+    ld1             {v2.s}0, x6, x1
+    ld1             {v3.s}0, x6, x1
+    ld1             {v4.s}0, x6, x1
+    ld1             {v5.s}0, x6, x1
+    ld1             {v6.s}0, x6, x1
+    ld1             {v7.s}0, x6, x1
+    uxtl            v0.8h, v0.8b
+    uxtl            v0.4s, v0.4h
+
+    uxtl            v1.8h, v1.8b
+    uxtl            v1.4s, v1.4h
+    mul             v0.4s, v0.4s, v16.4s
+
+    uxtl            v2.8h, v2.8b
+    uxtl            v2.4s, v2.4h
+    mla             v0.4s, v1.4s, v17.4s
+
+    uxtl            v3.8h, v3.8b
+    uxtl            v3.4s, v3.4h
+    mla             v0.4s, v2.4s, v18.4s
+
+    uxtl            v4.8h, v4.8b
+    uxtl            v4.4s, v4.4h
+    mla             v0.4s, v3.4s, v19.4s
+
+    uxtl            v5.8h, v5.8b
+    uxtl            v5.4s, v5.4h
+    mla             v0.4s, v4.4s, v20.4s
+
+    uxtl            v6.8h, v6.8b
+    uxtl            v6.4s, v6.4h
+    mla             v0.4s, v5.4s, v21.4s
+
+    uxtl            v7.8h, v7.8b
+    uxtl            v7.4s, v7.4h
+    mla             v0.4s, v6.4s, v22.4s
+
+    mla             v0.4s, v7.4s, v23.4s
+
+    sub             v0.4s, v0.4s, v28.4s
+    sqxtn           v0.4h, v0.4s
+    st1             {v0.8b}, x2, x3
+
+    add             x0, x0, x1
+    sub             x4, x4, #1
+    cbnz            x4, .loop_vps_4x\h
+    ret
+endfunc
+.endm
+
+LUMA_VPS_4xN 4
+LUMA_VPS_4xN 8
+LUMA_VPS_4xN 16
+
+// void interp_vert_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VPS w, h
+function x265_interp_8tap_vert_ps_\w\()x\h\()_neon
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+0:
+    FILTER_VPS \w, \h, 0
+1:
+    FILTER_VPS \w, \h, 1
+2:
+    FILTER_VPS \w, \h, 2
+3:
+    FILTER_VPS \w, \h, 3
+endfunc
+.endm
+
+LUMA_VPS 8, 4
+LUMA_VPS 8, 8
+LUMA_VPS 8, 16
+LUMA_VPS 8, 32
+LUMA_VPS 12, 16
+LUMA_VPS 16, 4
+LUMA_VPS 16, 8
+LUMA_VPS 16, 16
+LUMA_VPS 16, 32
+LUMA_VPS 16, 64
+LUMA_VPS 16, 12
+LUMA_VPS 24, 32
+LUMA_VPS 32, 8
+LUMA_VPS 32, 16
+LUMA_VPS 32, 32
+LUMA_VPS 32, 64
+LUMA_VPS 32, 24
+LUMA_VPS 48, 64
+LUMA_VPS 64, 16
+LUMA_VPS 64, 32
+LUMA_VPS 64, 64
+LUMA_VPS 64, 48
+
+// ***** luma_vsp *****
+// void interp_vert_sp_c(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VSP_4xN h
+function x265_interp_8tap_vert_sp_4x\h\()_neon
+    lsl             x5, x4, #6
+    lsl             x1, x1, #1
+    lsl             x4, x1, #2
+    sub             x4, x4, x1
+    sub             x0, x0, x4
+
+    mov             w12, #1
+    lsl             w12, w12, #19
+    add             w12, w12, #2048
+    dup             v24.4s, w12
+    mov             x4, #\h
+    movrel          x12, g_lumaFilter
+    add             x12, x12, x5
+    ld1r            {v16.2d}, x12, #8
+    ld1r            {v17.2d}, x12, #8
+    ld1r            {v18.2d}, x12, #8
+    ld1r            {v19.2d}, x12, #8
+    ld1r            {v20.2d}, x12, #8
+    ld1r            {v21.2d}, x12, #8
+    ld1r            {v22.2d}, x12, #8
+    ld1r            {v23.2d}, x12, #8
+.loop_vsp_4x\h:
+    mov             x6, x0
+
+    ld1             {v0.8b}, x6, x1
+    ld1             {v1.8b}, x6, x1
+    ld1             {v2.8b}, x6, x1
+    ld1             {v3.8b}, x6, x1
+    ld1             {v4.8b}, x6, x1
+    ld1             {v5.8b}, x6, x1
+    ld1             {v6.8b}, x6, x1
+    ld1             {v7.8b}, x6, x1
+
+    sshll           v0.4s, v0.4h, #0
+    sshll           v1.4s, v1.4h, #0
+    mul             v0.4s, v0.4s, v16.4s
+    sshll           v2.4s, v2.4h, #0
+    mla             v0.4s, v1.4s, v17.4s
+    sshll           v3.4s, v3.4h, #0
+    mla             v0.4s, v2.4s, v18.4s
+    sshll           v4.4s, v4.4h, #0
+    mla             v0.4s, v3.4s, v19.4s
+    sshll           v5.4s, v5.4h, #0
+    mla             v0.4s, v4.4s, v20.4s
+    sshll           v6.4s, v6.4h, #0
+    mla             v0.4s, v5.4s, v21.4s
+    sshll           v7.4s, v7.4h, #0
+    mla             v0.4s, v6.4s, v22.4s
+
+    mla             v0.4s, v7.4s, v23.4s
+
+    add             v0.4s, v0.4s, v24.4s
+    sqshrun         v0.4h, v0.4s, #12
+    sqxtun          v0.8b, v0.8h
+    st1             {v0.s}0, x2, x3
+
+    add             x0, x0, x1
+    sub             x4, x4, #1
+    cbnz            x4, .loop_vsp_4x\h
+    ret
+endfunc
+.endm
+
+LUMA_VSP_4xN 4
+LUMA_VSP_4xN 8
+LUMA_VSP_4xN 16
+
+// void interp_vert_sp_c(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VSP w, h
+function x265_interp_8tap_vert_sp_\w\()x\h\()_neon
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+0:
+    FILTER_VSP \w, \h, 0
+1:
+    FILTER_VSP \w, \h, 1
+2:
+    FILTER_VSP \w, \h, 2
+3:
+    FILTER_VSP \w, \h, 3
+endfunc
+.endm
+
+LUMA_VSP 8, 4
+LUMA_VSP 8, 8
+LUMA_VSP 8, 16
+LUMA_VSP 8, 32
+LUMA_VSP 12, 16
+LUMA_VSP 16, 4
+LUMA_VSP 16, 8
+LUMA_VSP 16, 16
+LUMA_VSP 16, 32
+LUMA_VSP 16, 64
+LUMA_VSP 16, 12
+LUMA_VSP 32, 8
+LUMA_VSP 32, 16
+LUMA_VSP 32, 32
+LUMA_VSP 32, 64
+LUMA_VSP 32, 24
+LUMA_VSP 64, 16
+LUMA_VSP 64, 32
+LUMA_VSP 64, 64
+LUMA_VSP 64, 48
+LUMA_VSP 24, 32
+LUMA_VSP 48, 64
+
+// ***** luma_vss *****
+// void interp_vert_ss_c(const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_VSS w, h
+function x265_interp_8tap_vert_ss_\w\()x\h\()_neon
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+0:
+    FILTER_VSS \w, \h, 0
+1:
+    FILTER_VSS \w, \h, 1
+2:
+    FILTER_VSS \w, \h, 2
+3:
+    FILTER_VSS \w, \h, 3
+endfunc
+.endm
+
+LUMA_VSS 4, 4
+LUMA_VSS 4, 8
+LUMA_VSS 4, 16
+LUMA_VSS 8, 4
+LUMA_VSS 8, 8
+LUMA_VSS 8, 16
+LUMA_VSS 8, 32
+LUMA_VSS 12, 16
+LUMA_VSS 16, 4
+LUMA_VSS 16, 8
+LUMA_VSS 16, 16
+LUMA_VSS 16, 32
+LUMA_VSS 16, 64
+LUMA_VSS 16, 12
+LUMA_VSS 32, 8
+LUMA_VSS 32, 16
+LUMA_VSS 32, 32
+LUMA_VSS 32, 64
+LUMA_VSS 32, 24
+LUMA_VSS 64, 16
+LUMA_VSS 64, 32
+LUMA_VSS 64, 64
+LUMA_VSS 64, 48
+LUMA_VSS 24, 32
+LUMA_VSS 48, 64
+
+// ***** luma_hpp *****
+// void interp_horiz_pp_c(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+.macro LUMA_HPP w, h
+function x265_interp_horiz_pp_\w\()x\h\()_neon
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+0:
+    FILTER_HPP \w, \h, 0
+1:
+    FILTER_HPP \w, \h, 1
+2:
+    FILTER_HPP \w, \h, 2
+3:
+    FILTER_HPP \w, \h, 3
+endfunc
+.endm
+
+LUMA_HPP 4, 4
+LUMA_HPP 4, 8
+LUMA_HPP 4, 16
+LUMA_HPP 8, 4
+LUMA_HPP 8, 8
+LUMA_HPP 8, 16
+LUMA_HPP 8, 32
+LUMA_HPP 12, 16
+LUMA_HPP 16, 4
+LUMA_HPP 16, 8
+LUMA_HPP 16, 12
+LUMA_HPP 16, 16
+LUMA_HPP 16, 32
+LUMA_HPP 16, 64
+LUMA_HPP 24, 32
+LUMA_HPP 32, 8
+LUMA_HPP 32, 16
+LUMA_HPP 32, 24
+LUMA_HPP 32, 32
+LUMA_HPP 32, 64
+LUMA_HPP 48, 64
+LUMA_HPP 64, 16
+LUMA_HPP 64, 32
+LUMA_HPP 64, 48
+LUMA_HPP 64, 64
+
+// ***** luma_hps *****
+// void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt)
+.macro LUMA_HPS w, h
+function x265_interp_horiz_ps_\w\()x\h\()_neon
+    mov             w10, #\h
+    cmp             w5, #0
+    b.eq            6f
+    sub             x0, x0, x1, lsl #2
+    add             x0, x0, x1
+    add             w10, w10, #7
+6:
+    mov             w6, w10
+    cmp             w4, #0
+    b.eq            0f
+    cmp             w4, #1
+    b.eq            1f
+    cmp             w4, #2
+    b.eq            2f
+    cmp             w4, #3
+    b.eq            3f
+0:
+    FILTER_HPS \w, \h, 0
+1:
+    FILTER_HPS \w, \h, 1
+2:
+    FILTER_HPS \w, \h, 2
+3:
+    FILTER_HPS \w, \h, 3
+endfunc
+.endm
+
+LUMA_HPS 4, 4
+LUMA_HPS 4, 8
+LUMA_HPS 4, 16
+LUMA_HPS 8, 4
+LUMA_HPS 8, 8
+LUMA_HPS 8, 16
+LUMA_HPS 8, 32
+LUMA_HPS 12, 16
+LUMA_HPS 16, 4
+LUMA_HPS 16, 8
+LUMA_HPS 16, 12
+LUMA_HPS 16, 16
+LUMA_HPS 16, 32
+LUMA_HPS 16, 64
+LUMA_HPS 24, 32
+LUMA_HPS 32, 8
+LUMA_HPS 32, 16
+LUMA_HPS 32, 24
+LUMA_HPS 32, 32
+LUMA_HPS 32, 64
+LUMA_HPS 48, 64
+LUMA_HPS 64, 16
+LUMA_HPS 64, 32
+LUMA_HPS 64, 48
+LUMA_HPS 64, 64
+
+// ***** chroma_vpp *****
+// void interp_vert_pp_c(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+.macro CHROMA_VPP w, h
+function x265_interp_4tap_vert_pp_\w\()x\h\()_neon
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+    cmp             x4, #4
+    beq             4f
+    cmp             x4, #5
+    beq             5f
+    cmp             x4, #6
+    beq             6f
+    cmp             x4, #7
+    beq             7f
+0:
+    FILTER_CHROMA_VPP  \w, \h, 0
+1:
+    FILTER_CHROMA_VPP  \w, \h, 1
+2:
+    FILTER_CHROMA_VPP  \w, \h, 2
+3:
+    FILTER_CHROMA_VPP  \w, \h, 3
+4:
+    FILTER_CHROMA_VPP  \w, \h, 4
+5:
+    FILTER_CHROMA_VPP  \w, \h, 5
+6:
+    FILTER_CHROMA_VPP  \w, \h, 6
+7:
+    FILTER_CHROMA_VPP  \w, \h, 7
+endfunc
+.endm
+
+CHROMA_VPP 2, 4
+CHROMA_VPP 2, 8
+CHROMA_VPP 2, 16
+CHROMA_VPP 4, 2
+CHROMA_VPP 4, 4
+CHROMA_VPP 4, 8
+CHROMA_VPP 4, 16
+CHROMA_VPP 4, 32
+CHROMA_VPP 6, 8
+CHROMA_VPP 6, 16
+CHROMA_VPP 8, 2
+CHROMA_VPP 8, 4
+CHROMA_VPP 8, 6
+CHROMA_VPP 8, 8
+CHROMA_VPP 8, 16
+CHROMA_VPP 8, 32
+CHROMA_VPP 8, 12
+CHROMA_VPP 8, 64
+CHROMA_VPP 12, 16
+CHROMA_VPP 12, 32
+CHROMA_VPP 16, 4
+CHROMA_VPP 16, 8
+CHROMA_VPP 16, 12
+CHROMA_VPP 16, 16
+CHROMA_VPP 16, 32
+CHROMA_VPP 16, 64
+CHROMA_VPP 16, 24
+CHROMA_VPP 32, 8
+CHROMA_VPP 32, 16
+CHROMA_VPP 32, 24
+CHROMA_VPP 32, 32
+CHROMA_VPP 32, 64
+CHROMA_VPP 32, 48
+CHROMA_VPP 24, 32
+CHROMA_VPP 24, 64
+CHROMA_VPP 64, 16
+CHROMA_VPP 64, 32
+CHROMA_VPP 64, 48
+CHROMA_VPP 64, 64
+CHROMA_VPP 48, 64
+
+// ***** chroma_vps *****
+// void interp_vert_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx)
+.macro CHROMA_VPS w, h
+function x265_interp_4tap_vert_ps_\w\()x\h\()_neon
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+    cmp             x4, #4
+    beq             4f
+    cmp             x4, #5
+    beq             5f
+    cmp             x4, #6
+    beq             6f
+    cmp             x4, #7
+    beq             7f
+0:
+    FILTER_CHROMA_VPS  \w, \h, 0
+1:
+    FILTER_CHROMA_VPS  \w, \h, 1
+2:
+    FILTER_CHROMA_VPS  \w, \h, 2
+3:
+    FILTER_CHROMA_VPS  \w, \h, 3
+4:
+    FILTER_CHROMA_VPS  \w, \h, 4
+5:
+    FILTER_CHROMA_VPS  \w, \h, 5
+6:
+    FILTER_CHROMA_VPS  \w, \h, 6
+7:
+    FILTER_CHROMA_VPS  \w, \h, 7
+endfunc
+.endm
+
+CHROMA_VPS 2, 4
+CHROMA_VPS 2, 8
+CHROMA_VPS 2, 16
+CHROMA_VPS 4, 2
+CHROMA_VPS 4, 4
+CHROMA_VPS 4, 8
+CHROMA_VPS 4, 16
+CHROMA_VPS 4, 32
+CHROMA_VPS 6, 8
+CHROMA_VPS 6, 16
+CHROMA_VPS 8, 2
+CHROMA_VPS 8, 4
+CHROMA_VPS 8, 6
+CHROMA_VPS 8, 8
+CHROMA_VPS 8, 16
+CHROMA_VPS 8, 32
+CHROMA_VPS 8, 12
+CHROMA_VPS 8, 64
+CHROMA_VPS 12, 16
+CHROMA_VPS 12, 32
+CHROMA_VPS 16, 4
+CHROMA_VPS 16, 8
+CHROMA_VPS 16, 12
+CHROMA_VPS 16, 16
+CHROMA_VPS 16, 32
+CHROMA_VPS 16, 64
+CHROMA_VPS 16, 24
+CHROMA_VPS 32, 8
+CHROMA_VPS 32, 16
+CHROMA_VPS 32, 24
+CHROMA_VPS 32, 32
+CHROMA_VPS 32, 64
+CHROMA_VPS 32, 48
+CHROMA_VPS 24, 32
+CHROMA_VPS 24, 64
+CHROMA_VPS 64, 16
+CHROMA_VPS 64, 32
+CHROMA_VPS 64, 48
+CHROMA_VPS 64, 64
+CHROMA_VPS 48, 64
+
+// ***** chroma_vsp *****
+// void interp_vert_sp_c(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+.macro CHROMA_VSP w, h
+function x265_interp_4tap_vert_sp_\w\()x\h\()_neon
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+    cmp             x4, #4
+    beq             4f
+    cmp             x4, #5
+    beq             5f
+    cmp             x4, #6
+    beq             6f
+    cmp             x4, #7
+    beq             7f
+0:
+    FILTER_CHROMA_VSP  \w, \h, 0
+1:
+    FILTER_CHROMA_VSP  \w, \h, 1
+2:
+    FILTER_CHROMA_VSP  \w, \h, 2
+3:
+    FILTER_CHROMA_VSP  \w, \h, 3
+4:
+    FILTER_CHROMA_VSP  \w, \h, 4
+5:
+    FILTER_CHROMA_VSP  \w, \h, 5
+6:
+    FILTER_CHROMA_VSP  \w, \h, 6
+7:
+    FILTER_CHROMA_VSP  \w, \h, 7
+endfunc
+.endm
+
+CHROMA_VSP 4, 4
+CHROMA_VSP 4, 8
+CHROMA_VSP 4, 16
+CHROMA_VSP 4, 32
+CHROMA_VSP 8, 2
+CHROMA_VSP 8, 4
+CHROMA_VSP 8, 6
+CHROMA_VSP 8, 8
+CHROMA_VSP 8, 16
+CHROMA_VSP 8, 32
+CHROMA_VSP 8, 12
+CHROMA_VSP 8, 64
+CHROMA_VSP 12, 16
+CHROMA_VSP 12, 32
+CHROMA_VSP 16, 4
+CHROMA_VSP 16, 8
+CHROMA_VSP 16, 12
+CHROMA_VSP 16, 16
+CHROMA_VSP 16, 32
+CHROMA_VSP 16, 64
+CHROMA_VSP 16, 24
+CHROMA_VSP 32, 8
+CHROMA_VSP 32, 16
+CHROMA_VSP 32, 24
+CHROMA_VSP 32, 32
+CHROMA_VSP 32, 64
+CHROMA_VSP 32, 48
+CHROMA_VSP 24, 32
+CHROMA_VSP 24, 64
+CHROMA_VSP 64, 16
+CHROMA_VSP 64, 32
+CHROMA_VSP 64, 48
+CHROMA_VSP 64, 64
+CHROMA_VSP 48, 64
+
+// ***** chroma_vss *****
+// void interp_vert_ss_c(const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx)
+.macro CHROMA_VSS w, h
+function x265_interp_4tap_vert_ss_\w\()x\h\()_neon
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+    cmp             x4, #4
+    beq             4f
+    cmp             x4, #5
+    beq             5f
+    cmp             x4, #6
+    beq             6f
+    cmp             x4, #7
+    beq             7f
+0:
+    FILTER_CHROMA_VSS  \w, \h, 0
+1:
+    FILTER_CHROMA_VSS  \w, \h, 1
+2:
+    FILTER_CHROMA_VSS  \w, \h, 2
+3:
+    FILTER_CHROMA_VSS  \w, \h, 3
+4:
+    FILTER_CHROMA_VSS  \w, \h, 4
+5:
+    FILTER_CHROMA_VSS  \w, \h, 5
+6:
+    FILTER_CHROMA_VSS  \w, \h, 6
+7:
+    FILTER_CHROMA_VSS  \w, \h, 7
+endfunc
+.endm
+
+CHROMA_VSS 4, 4
+CHROMA_VSS 4, 8
+CHROMA_VSS 4, 16
+CHROMA_VSS 4, 32
+CHROMA_VSS 8, 2
+CHROMA_VSS 8, 4
+CHROMA_VSS 8, 6
+CHROMA_VSS 8, 8
+CHROMA_VSS 8, 16
+CHROMA_VSS 8, 32
+CHROMA_VSS 8, 12
+CHROMA_VSS 8, 64
+CHROMA_VSS 12, 16
+CHROMA_VSS 12, 32
+CHROMA_VSS 16, 4
+CHROMA_VSS 16, 8
+CHROMA_VSS 16, 12
+CHROMA_VSS 16, 16
+CHROMA_VSS 16, 32
+CHROMA_VSS 16, 64
+CHROMA_VSS 16, 24
+CHROMA_VSS 32, 8
+CHROMA_VSS 32, 16
+CHROMA_VSS 32, 24
+CHROMA_VSS 32, 32
+CHROMA_VSS 32, 64
+CHROMA_VSS 32, 48
+CHROMA_VSS 24, 32
+CHROMA_VSS 24, 64
+CHROMA_VSS 64, 16
+CHROMA_VSS 64, 32
+CHROMA_VSS 64, 48
+CHROMA_VSS 64, 64
+CHROMA_VSS 48, 64
+
+// ***** chroma_hpp *****
+// void interp_horiz_pp_c(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx)
+.macro CHROMA_HPP w, h
+function x265_interp_4tap_horiz_pp_\w\()x\h\()_neon
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+    cmp             x4, #4
+    beq             4f
+    cmp             x4, #5
+    beq             5f
+    cmp             x4, #6
+    beq             6f
+    cmp             x4, #7
+    beq             7f
+0:
+    FILTER_CHROMA_HPP  \w, \h, 0
+1:
+    FILTER_CHROMA_HPP  \w, \h, 1
+2:
+    FILTER_CHROMA_HPP  \w, \h, 2
+3:
+    FILTER_CHROMA_HPP  \w, \h, 3
+4:
+    FILTER_CHROMA_HPP  \w, \h, 4
+5:
+    FILTER_CHROMA_HPP  \w, \h, 5
+6:
+    FILTER_CHROMA_HPP  \w, \h, 6
+7:
+    FILTER_CHROMA_HPP  \w, \h, 7
+endfunc
+.endm
+
+CHROMA_HPP 2, 4
+CHROMA_HPP 2, 8
+CHROMA_HPP 2, 16
+CHROMA_HPP 4, 2
+CHROMA_HPP 4, 4
+CHROMA_HPP 4, 8
+CHROMA_HPP 4, 16
+CHROMA_HPP 4, 32
+CHROMA_HPP 6, 8
+CHROMA_HPP 6, 16
+CHROMA_HPP 8, 2
+CHROMA_HPP 8, 4
+CHROMA_HPP 8, 6
+CHROMA_HPP 8, 8
+CHROMA_HPP 8, 12
+CHROMA_HPP 8, 16
+CHROMA_HPP 8, 32
+CHROMA_HPP 8, 64
+CHROMA_HPP 12, 16
+CHROMA_HPP 12, 32
+CHROMA_HPP 16, 4
+CHROMA_HPP 16, 8
+CHROMA_HPP 16, 12
+CHROMA_HPP 16, 16
+CHROMA_HPP 16, 24
+CHROMA_HPP 16, 32
+CHROMA_HPP 16, 64
+CHROMA_HPP 24, 32
+CHROMA_HPP 24, 64
+CHROMA_HPP 32, 8
+CHROMA_HPP 32, 16
+CHROMA_HPP 32, 24
+CHROMA_HPP 32, 32
+CHROMA_HPP 32, 48
+CHROMA_HPP 32, 64
+CHROMA_HPP 48, 64
+CHROMA_HPP 64, 16
+CHROMA_HPP 64, 32
+CHROMA_HPP 64, 48
+CHROMA_HPP 64, 64
+
+// ***** chroma_hps *****
+// void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt)
+.macro CHROMA_HPS w, h
+function x265_interp_4tap_horiz_ps_\w\()x\h\()_neon
+    cmp             x4, #0
+    beq             0f
+    cmp             x4, #1
+    beq             1f
+    cmp             x4, #2
+    beq             2f
+    cmp             x4, #3
+    beq             3f
+    cmp             x4, #4
+    beq             4f
+    cmp             x4, #5
+    beq             5f
+    cmp             x4, #6
+    beq             6f
+    cmp             x4, #7
+    beq             7f
+0:
+    FILTER_CHROMA_HPS  \w, \h, 0
+1:
+    FILTER_CHROMA_HPS  \w, \h, 1
+2:
+    FILTER_CHROMA_HPS  \w, \h, 2
+3:
+    FILTER_CHROMA_HPS  \w, \h, 3
+4:
+    FILTER_CHROMA_HPS  \w, \h, 4
+5:
+    FILTER_CHROMA_HPS  \w, \h, 5
+6:
+    FILTER_CHROMA_HPS  \w, \h, 6
+7:
+    FILTER_CHROMA_HPS  \w, \h, 7
+endfunc
+.endm
+
+CHROMA_HPS 2, 4
+CHROMA_HPS 2, 8
+CHROMA_HPS 2, 16
+CHROMA_HPS 4, 2
+CHROMA_HPS 4, 4
+CHROMA_HPS 4, 8
+CHROMA_HPS 4, 16
+CHROMA_HPS 4, 32
+CHROMA_HPS 6, 8
+CHROMA_HPS 6, 16
+CHROMA_HPS 8, 2
+CHROMA_HPS 8, 4
+CHROMA_HPS 8, 6
+CHROMA_HPS 8, 8
+CHROMA_HPS 8, 12
+CHROMA_HPS 8, 16
+CHROMA_HPS 8, 32
+CHROMA_HPS 8, 64
+CHROMA_HPS 12, 16
+CHROMA_HPS 12, 32
+CHROMA_HPS 16, 4
+CHROMA_HPS 16, 8
+CHROMA_HPS 16, 12
+CHROMA_HPS 16, 16
+CHROMA_HPS 16, 24
+CHROMA_HPS 16, 32
+CHROMA_HPS 16, 64
+CHROMA_HPS 24, 32
+CHROMA_HPS 24, 64
+CHROMA_HPS 32, 8
+CHROMA_HPS 32, 16
+CHROMA_HPS 32, 24
+CHROMA_HPS 32, 32
+CHROMA_HPS 32, 48
+CHROMA_HPS 32, 64
+CHROMA_HPS 48, 64
+CHROMA_HPS 64, 16
+CHROMA_HPS 64, 32
+CHROMA_HPS 64, 48
+CHROMA_HPS 64, 64
+
+const g_luma_s16, align=8
+//       a, b,   c,  d,  e,   f, g,  h
+.hword   0, 0,   0, 64,  0,   0, 0,  0
+.hword  -1, 4, -10, 58, 17,  -5, 1,  0
+.hword  -1, 4, -11, 40, 40, -11, 4, -1
+.hword   0, 1,  -5, 17, 58, -10, 4, -1
+endconst
​

x265_3.6.tar.gz/source/common/aarch64/loopfilter-prim.cpp Added

@@ -0,0 +1,291 @@
+#include "loopfilter-prim.h"
+
+#define PIXEL_MIN 0
+
+
+
+#if !(HIGH_BIT_DEPTH) && defined(HAVE_NEON)
+#include<arm_neon.h>
+
+namespace
+{
+
+
+/* get the sign of input variable (TODO: this is a dup, make common) */
+static inline int8_t signOf(int x)
+{
+    return (x >> 31) | ((int)((((uint32_t) - x)) >> 31));
+}
+
+static inline int8x8_t sign_diff_neon(const uint8x8_t in0, const uint8x8_t in1)
+{
+    int16x8_t in = vsubl_u8(in0, in1);
+    return vmovn_s16(vmaxq_s16(vminq_s16(in, vdupq_n_s16(1)), vdupq_n_s16(-1)));
+}
+
+static void calSign_neon(int8_t *dst, const pixel *src1, const pixel *src2, const int endX)
+{
+    int x = 0;
+    for (; (x + 8) <= endX; x += 8)
+    {
+        *(int8x8_t *)&dstx  = sign_diff_neon(*(uint8x8_t *)&src1x, *(uint8x8_t *)&src2x);
+    }
+
+    for (; x < endX; x++)
+    {
+        dstx = signOf(src1x - src2x);
+    }
+}
+
+static void processSaoCUE0_neon(pixel *rec, int8_t *offsetEo, int width, int8_t *signLeft, intptr_t stride)
+{
+
+
+    int y;
+    int8_t signRight, signLeft0;
+    int8_t edgeType;
+
+    for (y = 0; y < 2; y++)
+    {
+        signLeft0 = signLefty;
+        int x = 0;
+
+        if (width >= 8)
+        {
+            int8x8_t vsignRight;
+            int8x8x2_t shifter;
+            shifter.val10 = signLeft0;
+            static const int8x8_t index = {8, 0, 1, 2, 3, 4, 5, 6};
+            int8x8_t tbl = *(int8x8_t *)offsetEo;
+            for (; (x + 8) <= width; x += 8)
+            {
+                uint8x8_t in = *(uint8x8_t *)&recx;
+                vsignRight = sign_diff_neon(in, *(uint8x8_t *)&recx + 1);
+                shifter.val0 = vneg_s8(vsignRight);
+                int8x8_t tmp = shifter.val0;
+                int8x8_t edge = vtbl2_s8(shifter, index);
+                int8x8_t vedgeType = vadd_s8(vadd_s8(vsignRight, edge), vdup_n_s8(2));
+                shifter.val10 = tmp7;
+                int16x8_t t1 = vmovl_s8(vtbl1_s8(tbl, vedgeType));
+                t1 = vaddw_u8(t1, in);
+                t1 = vmaxq_s16(t1, vdupq_n_s16(0));
+                t1 = vminq_s16(t1, vdupq_n_s16(255));
+                *(uint8x8_t *)&recx = vmovn_u16(t1);
+            }
+            signLeft0 = shifter.val10;
+        }
+        for (; x < width; x++)
+        {
+            signRight = ((recx - recx + 1) < 0) ? -1 : ((recx - recx + 1) > 0) ? 1 : 0;
+            edgeType = signRight + signLeft0 + 2;
+            signLeft0 = -signRight;
+            recx = x265_clip(recx + offsetEoedgeType);
+        }
+        rec += stride;
+    }
+}
+
+static void processSaoCUE1_neon(pixel *rec, int8_t *upBuff1, int8_t *offsetEo, intptr_t stride, int width)
+{
+    int x = 0;
+    int8_t signDown;
+    int edgeType;
+
+    if (width >= 8)
+    {
+        int8x8_t tbl = *(int8x8_t *)offsetEo;
+        for (; (x + 8) <= width; x += 8)
+        {
+            uint8x8_t in0 = *(uint8x8_t *)&recx;
+            uint8x8_t in1 = *(uint8x8_t *)&recx + stride;
+            int8x8_t vsignDown = sign_diff_neon(in0, in1);
+            int8x8_t vedgeType = vadd_s8(vadd_s8(vsignDown, *(int8x8_t *)&upBuff1x), vdup_n_s8(2));
+            *(int8x8_t *)&upBuff1x = vneg_s8(vsignDown);
+            int16x8_t t1 = vmovl_s8(vtbl1_s8(tbl, vedgeType));
+            t1 = vaddw_u8(t1, in0);
+            *(uint8x8_t *)&recx = vqmovun_s16(t1);
+        }
+    }
+    for (; x < width; x++)
+    {
+        signDown = signOf(recx - recx + stride);
+        edgeType = signDown + upBuff1x + 2;
+        upBuff1x = -signDown;
+        recx = x265_clip(recx + offsetEoedgeType);
+    }
+}
+
+static void processSaoCUE1_2Rows_neon(pixel *rec, int8_t *upBuff1, int8_t *offsetEo, intptr_t stride, int width)
+{
+    int y;
+    int8_t signDown;
+    int edgeType;
+
+    for (y = 0; y < 2; y++)
+    {
+        int x = 0;
+        if (width >= 8)
+        {
+            int8x8_t tbl = *(int8x8_t *)offsetEo;
+            for (; (x + 8) <= width; x += 8)
+            {
+                uint8x8_t in0 = *(uint8x8_t *)&recx;
+                uint8x8_t in1 = *(uint8x8_t *)&recx + stride;
+                int8x8_t vsignDown = sign_diff_neon(in0, in1);
+                int8x8_t vedgeType = vadd_s8(vadd_s8(vsignDown, *(int8x8_t *)&upBuff1x), vdup_n_s8(2));
+                *(int8x8_t *)&upBuff1x = vneg_s8(vsignDown);
+                int16x8_t t1 = vmovl_s8(vtbl1_s8(tbl, vedgeType));
+                t1 = vaddw_u8(t1, in0);
+                t1 = vmaxq_s16(t1, vdupq_n_s16(0));
+                t1 = vminq_s16(t1, vdupq_n_s16(255));
+                *(uint8x8_t *)&recx = vmovn_u16(t1);
+
+            }
+        }
+        for (; x < width; x++)
+        {
+            signDown = signOf(recx - recx + stride);
+            edgeType = signDown + upBuff1x + 2;
+            upBuff1x = -signDown;
+            recx = x265_clip(recx + offsetEoedgeType);
+        }
+        rec += stride;
+    }
+}
+
+static void processSaoCUE2_neon(pixel *rec, int8_t *bufft, int8_t *buff1, int8_t *offsetEo, int width, intptr_t stride)
+{
+    int x;
+
+    if (abs(buff1 - bufft) < 16)
+    {
+        for (x = 0; x < width; x++)
+        {
+            int8_t signDown = signOf(recx - recx + stride + 1);
+            int edgeType = signDown + buff1x + 2;
+            bufftx + 1 = -signDown;
+            recx = x265_clip(recx + offsetEoedgeType);;
+        }
+    }
+    else
+    {
+        int8x8_t tbl = *(int8x8_t *)offsetEo;
+        x = 0;
+        for (; (x + 8) <= width; x += 8)
+        {
+            uint8x8_t in0 = *(uint8x8_t *)&recx;
+            uint8x8_t in1 = *(uint8x8_t *)&recx + stride + 1;
+            int8x8_t vsignDown = sign_diff_neon(in0, in1);
+            int8x8_t vedgeType = vadd_s8(vadd_s8(vsignDown, *(int8x8_t *)&buff1x), vdup_n_s8(2));
+            *(int8x8_t *)&bufftx + 1 = vneg_s8(vsignDown);
+            int16x8_t t1 = vmovl_s8(vtbl1_s8(tbl, vedgeType));
+            t1 = vaddw_u8(t1, in0);
+            t1 = vmaxq_s16(t1, vdupq_n_s16(0));
+            t1 = vminq_s16(t1, vdupq_n_s16(255));
+            *(uint8x8_t *)&recx = vmovn_u16(t1);
+        }
+        for (; x < width; x++)
+        {
+            int8_t signDown = signOf(recx - recx + stride + 1);
+            int edgeType = signDown + buff1x + 2;
+            bufftx + 1 = -signDown;
+            recx = x265_clip(recx + offsetEoedgeType);;
+        }
+
+    }
+}
+
+
+static void processSaoCUE3_neon(pixel *rec, int8_t *upBuff1, int8_t *offsetEo, intptr_t stride, int startX, int endX)
+{
+    int8_t signDown;
+    int8_t edgeType;
+    int8x8_t tbl = *(int8x8_t *)offsetEo;
+
+    int x = startX + 1;
+    for (; (x + 8) <= endX; x += 8)
+    {
+        uint8x8_t in0 = *(uint8x8_t *)&recx;
+        uint8x8_t in1 = *(uint8x8_t *)&recx + stride;
+        int8x8_t vsignDown = sign_diff_neon(in0, in1);
+        int8x8_t vedgeType = vadd_s8(vadd_s8(vsignDown, *(int8x8_t *)&upBuff1x), vdup_n_s8(2));
+        *(int8x8_t *)&upBuff1x - 1 = vneg_s8(vsignDown);
+        int16x8_t t1 = vmovl_s8(vtbl1_s8(tbl, vedgeType));
+        t1 = vaddw_u8(t1, in0);
+        t1 = vmaxq_s16(t1, vdupq_n_s16(0));
+        t1 = vminq_s16(t1, vdupq_n_s16(255));
+        *(uint8x8_t *)&recx = vmovn_u16(t1);
+
+    }
+    for (; x < endX; x++)
+    {
+        signDown = signOf(recx - recx + stride);
+        edgeType = signDown + upBuff1x + 2;
+        upBuff1x - 1 = -signDown;
+        recx = x265_clip(recx + offsetEoedgeType);
+    }
+}
+
+static void processSaoCUB0_neon(pixel *rec, const int8_t *offset, int ctuWidth, int ctuHeight, intptr_t stride)
+{
+#define SAO_BO_BITS 5
+    const int boShift = X265_DEPTH - SAO_BO_BITS;
+    int x, y;
+    int8x8x4_t table;
+    table = *(int8x8x4_t *)offset;
+
+    for (y = 0; y < ctuHeight; y++)
+    {
+
+        for (x = 0; (x + 8) <= ctuWidth; x += 8)
+        {
+            int8x8_t in = *(int8x8_t *)&recx;
+            int8x8_t offsets = vtbl4_s8(table, vshr_n_u8(in, boShift));
+            int16x8_t tmp = vmovl_s8(offsets);
+            tmp = vaddw_u8(tmp, in);
+            tmp = vmaxq_s16(tmp, vdupq_n_s16(0));
+            tmp = vminq_s16(tmp, vdupq_n_s16(255));
+            *(uint8x8_t *)&recx = vmovn_u16(tmp);
+        }
+        for (; x < ctuWidth; x++)
+        {
+            recx = x265_clip(recx + offsetrecx >> boShift);
+        }
+        rec += stride;
+    }
+}
+
+}
+
+
+
+namespace X265_NS
+{
+void setupLoopFilterPrimitives_neon(EncoderPrimitives &p)
+{
+    p.saoCuOrgE0 = processSaoCUE0_neon;
+    p.saoCuOrgE1 = processSaoCUE1_neon;
+    p.saoCuOrgE1_2Rows = processSaoCUE1_2Rows_neon;
+    p.saoCuOrgE20 = processSaoCUE2_neon;
+    p.saoCuOrgE21 = processSaoCUE2_neon;
+    p.saoCuOrgE30 = processSaoCUE3_neon;
+    p.saoCuOrgE31 = processSaoCUE3_neon;
+    p.saoCuOrgB0 = processSaoCUB0_neon;
+    p.sign = calSign_neon;
+
+}
+
+
+#else //HIGH_BIT_DEPTH
+
+
+namespace X265_NS
+{
+void setupLoopFilterPrimitives_neon(EncoderPrimitives &)
+{
+}
+
+#endif
+
+
+}

 
@@ -0,0 +1,291 @@
+#include "loopfilter-prim.h"
+
+#define PIXEL_MIN 0
+
+
+
+#if !(HIGH_BIT_DEPTH) && defined(HAVE_NEON)
+#include<arm_neon.h>
+
+namespace
+{
+
+
+/* get the sign of input variable (TODO: this is a dup, make common) */
+static inline int8_t signOf(int x)
+{
+    return (x >> 31) | ((int)((((uint32_t) - x)) >> 31));
+}
+
+static inline int8x8_t sign_diff_neon(const uint8x8_t in0, const uint8x8_t in1)
+{
+    int16x8_t in = vsubl_u8(in0, in1);
+    return vmovn_s16(vmaxq_s16(vminq_s16(in, vdupq_n_s16(1)), vdupq_n_s16(-1)));
+}
+
+static void calSign_neon(int8_t *dst, const pixel *src1, const pixel *src2, const int endX)
+{
+    int x = 0;
+    for (; (x + 8) <= endX; x += 8)
+    {
+        *(int8x8_t *)&dstx  = sign_diff_neon(*(uint8x8_t *)&src1x, *(uint8x8_t *)&src2x);
+    }
+
+    for (; x < endX; x++)
+    {
+        dstx = signOf(src1x - src2x);
+    }
+}
+
+static void processSaoCUE0_neon(pixel *rec, int8_t *offsetEo, int width, int8_t *signLeft, intptr_t stride)
+{
+
+
+    int y;
+    int8_t signRight, signLeft0;
+    int8_t edgeType;
+
+    for (y = 0; y < 2; y++)
+    {
+        signLeft0 = signLefty;
+        int x = 0;
+
+        if (width >= 8)
+        {
+            int8x8_t vsignRight;
+            int8x8x2_t shifter;
+            shifter.val10 = signLeft0;
+            static const int8x8_t index = {8, 0, 1, 2, 3, 4, 5, 6};
+            int8x8_t tbl = *(int8x8_t *)offsetEo;
+            for (; (x + 8) <= width; x += 8)
+            {
+                uint8x8_t in = *(uint8x8_t *)&recx;
+                vsignRight = sign_diff_neon(in, *(uint8x8_t *)&recx + 1);
+                shifter.val0 = vneg_s8(vsignRight);
+                int8x8_t tmp = shifter.val0;
+                int8x8_t edge = vtbl2_s8(shifter, index);
+                int8x8_t vedgeType = vadd_s8(vadd_s8(vsignRight, edge), vdup_n_s8(2));
+                shifter.val10 = tmp7;
+                int16x8_t t1 = vmovl_s8(vtbl1_s8(tbl, vedgeType));
+                t1 = vaddw_u8(t1, in);
+                t1 = vmaxq_s16(t1, vdupq_n_s16(0));
+                t1 = vminq_s16(t1, vdupq_n_s16(255));
+                *(uint8x8_t *)&recx = vmovn_u16(t1);
+            }
+            signLeft0 = shifter.val10;
+        }
+        for (; x < width; x++)
+        {
+            signRight = ((recx - recx + 1) < 0) ? -1 : ((recx - recx + 1) > 0) ? 1 : 0;
+            edgeType = signRight + signLeft0 + 2;
+            signLeft0 = -signRight;
+            recx = x265_clip(recx + offsetEoedgeType);
+        }
+        rec += stride;
+    }
+}
+
+static void processSaoCUE1_neon(pixel *rec, int8_t *upBuff1, int8_t *offsetEo, intptr_t stride, int width)
+{
+    int x = 0;
+    int8_t signDown;
+    int edgeType;
+
+    if (width >= 8)
+    {
+        int8x8_t tbl = *(int8x8_t *)offsetEo;
+        for (; (x + 8) <= width; x += 8)
+        {
+            uint8x8_t in0 = *(uint8x8_t *)&recx;
+            uint8x8_t in1 = *(uint8x8_t *)&recx + stride;
+            int8x8_t vsignDown = sign_diff_neon(in0, in1);
+            int8x8_t vedgeType = vadd_s8(vadd_s8(vsignDown, *(int8x8_t *)&upBuff1x), vdup_n_s8(2));
+            *(int8x8_t *)&upBuff1x = vneg_s8(vsignDown);
+            int16x8_t t1 = vmovl_s8(vtbl1_s8(tbl, vedgeType));
+            t1 = vaddw_u8(t1, in0);
+            *(uint8x8_t *)&recx = vqmovun_s16(t1);
+        }
+    }
+    for (; x < width; x++)
+    {
+        signDown = signOf(recx - recx + stride);
+        edgeType = signDown + upBuff1x + 2;
+        upBuff1x = -signDown;
+        recx = x265_clip(recx + offsetEoedgeType);
+    }
+}
+
+static void processSaoCUE1_2Rows_neon(pixel *rec, int8_t *upBuff1, int8_t *offsetEo, intptr_t stride, int width)
+{
+    int y;
+    int8_t signDown;
+    int edgeType;
+
+    for (y = 0; y < 2; y++)
+    {
+        int x = 0;
+        if (width >= 8)
+        {
+            int8x8_t tbl = *(int8x8_t *)offsetEo;
+            for (; (x + 8) <= width; x += 8)
+            {
+                uint8x8_t in0 = *(uint8x8_t *)&recx;
+                uint8x8_t in1 = *(uint8x8_t *)&recx + stride;
+                int8x8_t vsignDown = sign_diff_neon(in0, in1);
+                int8x8_t vedgeType = vadd_s8(vadd_s8(vsignDown, *(int8x8_t *)&upBuff1x), vdup_n_s8(2));
+                *(int8x8_t *)&upBuff1x = vneg_s8(vsignDown);
+                int16x8_t t1 = vmovl_s8(vtbl1_s8(tbl, vedgeType));
+                t1 = vaddw_u8(t1, in0);
+                t1 = vmaxq_s16(t1, vdupq_n_s16(0));
+                t1 = vminq_s16(t1, vdupq_n_s16(255));
+                *(uint8x8_t *)&recx = vmovn_u16(t1);
+
+            }
+        }
+        for (; x < width; x++)
+        {
+            signDown = signOf(recx - recx + stride);
+            edgeType = signDown + upBuff1x + 2;
+            upBuff1x = -signDown;
+            recx = x265_clip(recx + offsetEoedgeType);
+        }
+        rec += stride;
+    }
+}
+
+static void processSaoCUE2_neon(pixel *rec, int8_t *bufft, int8_t *buff1, int8_t *offsetEo, int width, intptr_t stride)
+{
+    int x;
+
+    if (abs(buff1 - bufft) < 16)
+    {
+        for (x = 0; x < width; x++)
+        {
+            int8_t signDown = signOf(recx - recx + stride + 1);
+            int edgeType = signDown + buff1x + 2;
+            bufftx + 1 = -signDown;
+            recx = x265_clip(recx + offsetEoedgeType);;
+        }
+    }
+    else
+    {
+        int8x8_t tbl = *(int8x8_t *)offsetEo;
+        x = 0;
+        for (; (x + 8) <= width; x += 8)
+        {
+            uint8x8_t in0 = *(uint8x8_t *)&recx;
+            uint8x8_t in1 = *(uint8x8_t *)&recx + stride + 1;
+            int8x8_t vsignDown = sign_diff_neon(in0, in1);
+            int8x8_t vedgeType = vadd_s8(vadd_s8(vsignDown, *(int8x8_t *)&buff1x), vdup_n_s8(2));
+            *(int8x8_t *)&bufftx + 1 = vneg_s8(vsignDown);
+            int16x8_t t1 = vmovl_s8(vtbl1_s8(tbl, vedgeType));
+            t1 = vaddw_u8(t1, in0);
+            t1 = vmaxq_s16(t1, vdupq_n_s16(0));
+            t1 = vminq_s16(t1, vdupq_n_s16(255));
+            *(uint8x8_t *)&recx = vmovn_u16(t1);
+        }
+        for (; x < width; x++)
+        {
+            int8_t signDown = signOf(recx - recx + stride + 1);
+            int edgeType = signDown + buff1x + 2;
+            bufftx + 1 = -signDown;
+            recx = x265_clip(recx + offsetEoedgeType);;
+        }
+
+    }
+}
+
+
+static void processSaoCUE3_neon(pixel *rec, int8_t *upBuff1, int8_t *offsetEo, intptr_t stride, int startX, int endX)
+{
+    int8_t signDown;
+    int8_t edgeType;
+    int8x8_t tbl = *(int8x8_t *)offsetEo;
+
+    int x = startX + 1;
+    for (; (x + 8) <= endX; x += 8)
+    {
+        uint8x8_t in0 = *(uint8x8_t *)&recx;
+        uint8x8_t in1 = *(uint8x8_t *)&recx + stride;
+        int8x8_t vsignDown = sign_diff_neon(in0, in1);
+        int8x8_t vedgeType = vadd_s8(vadd_s8(vsignDown, *(int8x8_t *)&upBuff1x), vdup_n_s8(2));
+        *(int8x8_t *)&upBuff1x - 1 = vneg_s8(vsignDown);
+        int16x8_t t1 = vmovl_s8(vtbl1_s8(tbl, vedgeType));
+        t1 = vaddw_u8(t1, in0);
+        t1 = vmaxq_s16(t1, vdupq_n_s16(0));
+        t1 = vminq_s16(t1, vdupq_n_s16(255));
+        *(uint8x8_t *)&recx = vmovn_u16(t1);
+
+    }
+    for (; x < endX; x++)
+    {
+        signDown = signOf(recx - recx + stride);
+        edgeType = signDown + upBuff1x + 2;
+        upBuff1x - 1 = -signDown;
+        recx = x265_clip(recx + offsetEoedgeType);
+    }
+}
+
+static void processSaoCUB0_neon(pixel *rec, const int8_t *offset, int ctuWidth, int ctuHeight, intptr_t stride)
+{
+#define SAO_BO_BITS 5
+    const int boShift = X265_DEPTH - SAO_BO_BITS;
+    int x, y;
+    int8x8x4_t table;
+    table = *(int8x8x4_t *)offset;
+
+    for (y = 0; y < ctuHeight; y++)
+    {
+
+        for (x = 0; (x + 8) <= ctuWidth; x += 8)
+        {
+            int8x8_t in = *(int8x8_t *)&recx;
+            int8x8_t offsets = vtbl4_s8(table, vshr_n_u8(in, boShift));
+            int16x8_t tmp = vmovl_s8(offsets);
+            tmp = vaddw_u8(tmp, in);
+            tmp = vmaxq_s16(tmp, vdupq_n_s16(0));
+            tmp = vminq_s16(tmp, vdupq_n_s16(255));
+            *(uint8x8_t *)&recx = vmovn_u16(tmp);
+        }
+        for (; x < ctuWidth; x++)
+        {
+            recx = x265_clip(recx + offsetrecx >> boShift);
+        }
+        rec += stride;
+    }
+}
+
+}
+
+
+
+namespace X265_NS
+{
+void setupLoopFilterPrimitives_neon(EncoderPrimitives &p)
+{
+    p.saoCuOrgE0 = processSaoCUE0_neon;
+    p.saoCuOrgE1 = processSaoCUE1_neon;
+    p.saoCuOrgE1_2Rows = processSaoCUE1_2Rows_neon;
+    p.saoCuOrgE20 = processSaoCUE2_neon;
+    p.saoCuOrgE21 = processSaoCUE2_neon;
+    p.saoCuOrgE30 = processSaoCUE3_neon;
+    p.saoCuOrgE31 = processSaoCUE3_neon;
+    p.saoCuOrgB0 = processSaoCUB0_neon;
+    p.sign = calSign_neon;
+
+}
+
+
+#else //HIGH_BIT_DEPTH
+
+
+namespace X265_NS
+{
+void setupLoopFilterPrimitives_neon(EncoderPrimitives &)
+{
+}
+
+#endif
+
+
+}
​

x265_3.6.tar.gz/source/common/aarch64/loopfilter-prim.h Added

 
@@ -0,0 +1,16 @@
+#ifndef _LOOPFILTER_NEON_H__
+#define _LOOPFILTER_NEON_H__
+
+#include "common.h"
+#include "primitives.h"
+
+#define PIXEL_MIN 0
+
+namespace X265_NS
+{
+void setupLoopFilterPrimitives_neon(EncoderPrimitives &p);
+
+};
+
+
+#endif
​

x265_3.6.tar.gz/source/common/aarch64/mc-a-common.S Added

@@ -0,0 +1,48 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+// This file contains the macros written using NEON instruction set
+// that are also used by the SVE2 functions
+
+.arch           armv8-a
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.macro addAvg_start
+    lsl             x3, x3, #1
+    lsl             x4, x4, #1
+    mov             w11, #0x40
+    dup             v30.16b, w11
+.endm
+
+.macro addavg_1 v0, v1
+    add             \v0\().8h, \v0\().8h, \v1\().8h
+    saddl           v16.4s, \v0\().4h, v30.4h
+    saddl2          v17.4s, \v0\().8h, v30.8h
+    shrn            \v0\().4h, v16.4s, #7
+    shrn2           \v0\().8h, v17.4s, #7
+.endm

 
@@ -0,0 +1,48 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+// This file contains the macros written using NEON instruction set
+// that are also used by the SVE2 functions
+
+.arch           armv8-a
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.macro addAvg_start
+    lsl             x3, x3, #1
+    lsl             x4, x4, #1
+    mov             w11, #0x40
+    dup             v30.16b, w11
+.endm
+
+.macro addavg_1 v0, v1
+    add             \v0\().8h, \v0\().8h, \v1\().8h
+    saddl           v16.4s, \v0\().4h, v30.4h
+    saddl2          v17.4s, \v0\().8h, v30.8h
+    shrn            \v0\().4h, v16.4s, #7
+    shrn2           \v0\().8h, v17.4s, #7
+.endm
​

x265_3.6.tar.gz/source/common/aarch64/mc-a-sve2.S Added

@@ -0,0 +1,924 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm-sve.S"
+#include "mc-a-common.S"
+
+.arch armv8-a+sve2
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+function PFX(pixel_avg_pp_12x16_sve2)
+    sub             x1, x1, #4
+    sub             x3, x3, #4
+    sub             x5, x5, #4
+    ptrue           p0.s, vl1
+    ptrue           p1.b, vl8
+    mov             x11, #4
+.rept 16
+    ld1w            {z0.s}, p0/z, x2
+    ld1b            {z1.b}, p1/z, x2, x11
+    ld1w            {z2.s}, p0/z, x4
+    ld1b            {z3.b}, p1/z, x4, x11
+    add             x2, x2, #4
+    add             x2, x2, x3
+    add             x4, x4, #4
+    add             x4, x4, x5
+    urhadd          z0.b, p1/m, z0.b, z2.b
+    urhadd          z1.b, p1/m, z1.b, z3.b
+    st1b            {z0.b}, p1, x0
+    st1b            {z1.b}, p1, x0, x11
+    add             x0, x0, #4
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_avg_pp_24x32_sve2)
+    mov             w12, #4
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_avg_pp_24x32
+    sub             x1, x1, #16
+    sub             x3, x3, #16
+    sub             x5, x5, #16
+.lpavg_24x32_sve2:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.16b}, x2, #16
+    ld1             {v1.8b}, x2, x3
+    ld1             {v2.16b}, x4, #16
+    ld1             {v3.8b}, x4, x5
+    urhadd          v0.16b, v0.16b, v2.16b
+    urhadd          v1.8b, v1.8b, v3.8b
+    st1             {v0.16b}, x0, #16
+    st1             {v1.8b}, x0, x1
+.endr
+    cbnz            w12, .lpavg_24x32_sve2
+    ret
+.vl_gt_16_pixel_avg_pp_24x32:
+    mov             x10, #24
+    mov             x11, #0
+    whilelt         p0.b, x11, x10
+.vl_gt_16_loop_pixel_avg_pp_24x32:
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z2.b}, p0/z, x4
+    add             x2, x2, x3
+    add             x4, x4, x5
+    urhadd          z0.b, p0/m, z0.b, z2.b
+    st1b            {z0.b}, p0, x0
+    add             x0, x0, x1
+.endr
+    cbnz            w12, .vl_gt_16_loop_pixel_avg_pp_24x32
+    ret
+endfunc
+
+.macro pixel_avg_pp_32xN_sve2 h
+function PFX(pixel_avg_pp_32x\h\()_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_avg_pp_32_\h
+.rept \h
+    ld1             {v0.16b-v1.16b}, x2, x3
+    ld1             {v2.16b-v3.16b}, x4, x5
+    urhadd          v0.16b, v0.16b, v2.16b
+    urhadd          v1.16b, v1.16b, v3.16b
+    st1             {v0.16b-v1.16b}, x0, x1
+.endr
+    ret
+.vl_gt_16_pixel_avg_pp_32_\h:
+    ptrue           p0.b, vl32
+.rept \h
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z2.b}, p0/z, x4
+    add             x2, x2, x3
+    add             x4, x4, x5
+    urhadd          z0.b, p0/m, z0.b, z2.b
+    st1b            {z0.b}, p0, x0
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+.endm
+
+pixel_avg_pp_32xN_sve2 8
+pixel_avg_pp_32xN_sve2 16
+pixel_avg_pp_32xN_sve2 24
+
+.macro pixel_avg_pp_32xN1_sve2 h
+function PFX(pixel_avg_pp_32x\h\()_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_avg_pp_32xN1_\h
+    mov             w12, #\h / 8
+.lpavg_sve2_32x\h\():
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.16b-v1.16b}, x2, x3
+    ld1             {v2.16b-v3.16b}, x4, x5
+    urhadd          v0.16b, v0.16b, v2.16b
+    urhadd          v1.16b, v1.16b, v3.16b
+    st1             {v0.16b-v1.16b}, x0, x1
+.endr
+    cbnz            w12, .lpavg_sve2_32x\h
+    ret
+.vl_gt_16_pixel_avg_pp_32xN1_\h:
+    ptrue           p0.b, vl32
+    mov             w12, #\h / 8
+.eq_32_loop_pixel_avg_pp_32xN1_\h\():
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z2.b}, p0/z, x4
+    add             x2, x2, x3
+    add             x4, x4, x5
+    urhadd          z0.b, p0/m, z0.b, z2.b
+    st1b            {z0.b}, p0, x0
+    add             x0, x0, x1
+.endr
+    cbnz            w12, .eq_32_loop_pixel_avg_pp_32xN1_\h
+    ret
+endfunc
+.endm
+
+pixel_avg_pp_32xN1_sve2 32
+pixel_avg_pp_32xN1_sve2 64
+
+function PFX(pixel_avg_pp_48x64_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_avg_pp_48x64
+    mov             w12, #8
+.lpavg_48x64_sve2:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.16b-v2.16b}, x2, x3
+    ld1             {v3.16b-v5.16b}, x4, x5
+    urhadd          v0.16b, v0.16b, v3.16b
+    urhadd          v1.16b, v1.16b, v4.16b
+    urhadd          v2.16b, v2.16b, v5.16b
+    st1             {v0.16b-v2.16b}, x0, x1
+.endr
+    cbnz            w12, .lpavg_48x64_sve2
+    ret
+.vl_gt_16_pixel_avg_pp_48x64:
+    cmp             x9, #32
+    bgt             .vl_gt_32_pixel_avg_pp_48x64
+    ptrue           p0.b, vl32
+    ptrue           p1.b, vl16
+    mov             w12, #8
+.vl_eq_32_pixel_avg_pp_48x64:
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z1.b}, p1/z, x2, #1, mul vl
+    ld1b            {z2.b}, p0/z, x4
+    ld1b            {z3.b}, p1/z, x4, #1, mul vl
+    add             x2, x2, x3
+    add             x4, x4, x5
+    urhadd          z0.b, p0/m, z0.b, z2.b
+    urhadd          z1.b, p1/m, z1.b, z3.b
+    st1b            {z0.b}, p0, x0
+    st1b            {z1.b}, p1, x0, #1, mul vl
+    add             x0, x0, x1
+.endr
+    cbnz            w12, .vl_eq_32_pixel_avg_pp_48x64
+    ret
+.vl_gt_32_pixel_avg_pp_48x64:
+    mov             x10, #48
+    mov             x11, #0
+    whilelt         p0.b, x11, x10
+    mov             w12, #8
+.loop_gt_32_pixel_avg_pp_48x64:
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z2.b}, p0/z, x4
+    add             x2, x2, x3
+    add             x4, x4, x5
+    urhadd          z0.b, p0/m, z0.b, z2.b
+    st1b            {z0.b}, p0, x0
+    add             x0, x0, x1
+.endr
+    cbnz            w12, .loop_gt_32_pixel_avg_pp_48x64
+    ret
+endfunc
+
+.macro pixel_avg_pp_64xN_sve2 h
+function PFX(pixel_avg_pp_64x\h\()_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_avg_pp_64x\h
+    mov             w12, #\h / 4
+.lpavg_sve2_64x\h\():
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v3.16b}, x2, x3
+    ld1             {v4.16b-v7.16b}, x4, x5
+    urhadd          v0.16b, v0.16b, v4.16b
+    urhadd          v1.16b, v1.16b, v5.16b
+    urhadd          v2.16b, v2.16b, v6.16b
+    urhadd          v3.16b, v3.16b, v7.16b
+    st1             {v0.16b-v3.16b}, x0, x1
+.endr
+    cbnz            w12, .lpavg_sve2_64x\h
+    ret
+.vl_gt_16_pixel_avg_pp_64x\h\():
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_avg_pp_64x\h
+    ptrue           p0.b, vl32
+    mov             w12, #\h / 4
+.vl_eq_32_pixel_avg_pp_64x\h\():
+    sub             w12, w12, #1
+.rept 4
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z1.b}, p0/z, x2, #1, mul vl
+    ld1b            {z2.b}, p0/z, x4
+    ld1b            {z3.b}, p0/z, x4, #1, mul vl
+    add             x2, x2, x3
+    add             x4, x4, x5
+    urhadd          z0.b, p0/m, z0.b, z2.b
+    urhadd          z1.b, p0/m, z1.b, z3.b
+    st1b            {z0.b}, p0, x0
+    st1b            {z1.b}, p0, x0, #1, mul vl
+    add             x0, x0, x1
+.endr
+    cbnz            w12, .vl_eq_32_pixel_avg_pp_64x\h
+    ret
+.vl_gt_48_pixel_avg_pp_64x\h\():
+    ptrue           p0.b, vl64
+    mov             w12, #\h / 4
+.vl_eq_64_pixel_avg_pp_64x\h\():
+    sub             w12, w12, #1
+.rept 4
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z2.b}, p0/z, x4
+    add             x2, x2, x3
+    add             x4, x4, x5
+    urhadd          z0.b, p0/m, z0.b, z2.b
+    st1b            {z0.b}, p0, x0
+    add             x0, x0, x1
+.endr
+    cbnz            w12, .vl_eq_64_pixel_avg_pp_64x\h
+    ret
+endfunc
+.endm
+
+pixel_avg_pp_64xN_sve2 16
+pixel_avg_pp_64xN_sve2 32
+pixel_avg_pp_64xN_sve2 48
+pixel_avg_pp_64xN_sve2 64
+
+// void addAvg(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride)
+
+.macro addAvg_2xN_sve2 h
+function PFX(addAvg_2x\h\()_sve2)
+    ptrue           p0.s, vl2
+    ptrue           p1.h, vl4
+    ptrue           p2.h, vl2
+.rept \h / 2
+    ld1rw           {z0.s}, p0/z, x0
+    ld1rw           {z1.s}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    ld1rw           {z2.s}, p0/z, x0
+    ld1rw           {z3.s}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p1/m, z0.h, z1.h
+    add             z2.h, p1/m, z2.h, z3.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z2.b, z2.h, #7
+    add             z2.b, z2.b, #0x80
+    st1b            {z0.h}, p2, x2
+    add             x2, x2, x5
+    st1b            {z2.h}, p2, x2
+    add             x2, x2, x5
+.endr
+    ret
+endfunc
+.endm
+
+addAvg_2xN_sve2 4
+addAvg_2xN_sve2 8
+addAvg_2xN_sve2 16
+
+.macro addAvg_6xN_sve2 h
+function PFX(addAvg_6x\h\()_sve2)
+    mov             w12, #\h / 2
+    ptrue           p0.b, vl16
+    ptrue           p2.h, vl6
+.loop_sve2_addavg_6x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    ld1b            {z2.b}, p0/z, x0
+    ld1b            {z3.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z1.h
+    add             z2.h, p0/m, z2.h, z3.h
+    sqrshrnb        z0.b, z0.h, #7
+    sqrshrnb        z2.b, z2.h, #7
+    add             z0.b, z0.b, #0x80
+    add             z2.b, z2.b, #0x80
+    st1b            {z0.h}, p2, x2
+    add             x2, x2, x5
+    st1b            {z2.h}, p2, x2
+    add             x2, x2, x5
+    cbnz            w12, .loop_sve2_addavg_6x\h
+    ret
+endfunc
+.endm
+
+addAvg_6xN_sve2 8
+addAvg_6xN_sve2 16
+
+.macro addAvg_8xN_sve2 h
+function PFX(addAvg_8x\h\()_sve2)
+    ptrue           p0.b, vl16
+.rept \h / 2
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    ld1b            {z2.b}, p0/z, x0
+    ld1b            {z3.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z1.h
+    add             z2.h, p0/m, z2.h, z3.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z2.b, z2.h, #7
+    add             z2.b, z2.b, #0x80
+    st1b            {z0.h}, p0, x2
+    add             x2, x2, x5
+    st1b            {z2.h}, p0, x2
+    add             x2, x2, x5
+.endr
+    ret
+endfunc
+.endm
+
+.macro addAvg_8xN1_sve2 h
+function PFX(addAvg_8x\h\()_sve2)
+    mov             w12, #\h / 2
+    ptrue           p0.b, vl16
+.loop_sve2_addavg_8x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    ld1b            {z2.b}, p0/z, x0
+    ld1b            {z3.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z1.h
+    add             z2.h, p0/m, z2.h, z3.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z2.b, z2.h, #7
+    add             z2.b, z2.b, #0x80
+    st1b            {z0.h}, p0, x2
+    add             x2, x2, x5
+    st1b            {z2.h}, p0, x2
+    add             x2, x2, x5
+    cbnz            w12, .loop_sve2_addavg_8x\h
+    ret
+endfunc
+.endm
+
+addAvg_8xN_sve2 2
+addAvg_8xN_sve2 4
+addAvg_8xN_sve2 6
+addAvg_8xN_sve2 8
+addAvg_8xN_sve2 12
+addAvg_8xN_sve2 16
+addAvg_8xN1_sve2 32
+addAvg_8xN1_sve2 64
+
+.macro addAvg_12xN_sve2 h
+function PFX(addAvg_12x\h\()_sve2)
+    mov             w12, #\h
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_addAvg_12x\h
+    ptrue           p0.b, vl16
+    ptrue           p1.b, vl8
+.loop_sve2_addavg_12x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x1
+    ld1b            {z2.b}, p1/z, x0, #1, mul vl
+    ld1b            {z3.b}, p1/z, x1, #1, mul vl
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z1.h
+    add             z2.h, p1/m, z2.h, z3.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z2.b, z2.h, #7
+    add             z2.b, z2.b, #0x80
+    st1b            {z0.h}, p0, x2
+    st1b            {z2.h}, p1, x2, #1, mul vl
+    add             x2, x2, x5
+    cbnz            w12, .loop_sve2_addavg_12x\h
+    ret
+.vl_gt_16_addAvg_12x\h\():
+    mov             x10, #24
+    mov             x11, #0
+    whilelt         p0.b, x11, x10
+.loop_sve2_gt_16_addavg_12x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z1.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z2.b, z2.h, #7
+    add             z2.b, z2.b, #0x80
+    st1b            {z0.h}, p0, x2
+    add             x2, x2, x5
+    cbnz            w12, .loop_sve2_gt_16_addavg_12x\h
+    ret
+endfunc
+.endm
+
+addAvg_12xN_sve2 16
+addAvg_12xN_sve2 32
+
+.macro addAvg_16xN_sve2 h
+function PFX(addAvg_16x\h\()_sve2)
+    mov             w12, #\h
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_addAvg_16x\h
+    ptrue           p0.b, vl16
+.loop_eq_16_sve2_addavg_16x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x1
+    ld1b            {z2.b}, p0/z, x0, #1, mul vl
+    ld1b            {z3.b}, p0/z, x1, #1, mul vl
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z1.h
+    add             z2.h, p0/m, z2.h, z3.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z2.b, z2.h, #7
+    add             z2.b, z2.b, #0x80
+    st1b            {z0.h}, p0, x2
+    st1b            {z2.h}, p0, x2, #1, mul vl
+    add             x2, x2, x5
+    cbnz            w12, .loop_eq_16_sve2_addavg_16x\h
+    ret
+.vl_gt_16_addAvg_16x\h\():
+    cmp             x9, #32
+    bgt             .vl_gt_32_addAvg_16x\h
+    ptrue           p0.b, vl32
+.loop_gt_16_sve2_addavg_16x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z1.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    st1b            {z0.h}, p1, x2
+    add             x2, x2, x5
+    cbnz            w12, .loop_gt_16_sve2_addavg_16x\h
+    ret
+.vl_gt_32_addAvg_16x\h\():
+    mov             x10, #48
+    mov             x11, #0
+    whilelt         p0.b, x11, x10
+.loop_gt_32_sve2_addavg_16x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z1.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    st1b            {z0.h}, p0, x2
+    add             x2, x2, x5
+    cbnz            w12, .loop_gt_32_sve2_addavg_16x\h
+    ret
+endfunc
+.endm
+
+addAvg_16xN_sve2 4
+addAvg_16xN_sve2 8
+addAvg_16xN_sve2 12
+addAvg_16xN_sve2 16
+addAvg_16xN_sve2 24
+addAvg_16xN_sve2 32
+addAvg_16xN_sve2 64
+
+.macro addAvg_24xN_sve2 h
+function PFX(addAvg_24x\h\()_sve2)
+    mov             w12, #\h
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_addAvg_24x\h
+    addAvg_start
+.loop_eq_16_sve2_addavg_24x\h\():
+    sub             w12, w12, #1
+    ld1             {v0.16b-v2.16b}, x0, x3
+    ld1             {v3.16b-v5.16b}, x1, x4
+    addavg_1        v0, v3
+    addavg_1        v1, v4
+    addavg_1        v2, v5
+    sqxtun          v0.8b, v0.8h
+    sqxtun          v1.8b, v1.8h
+    sqxtun          v2.8b, v2.8h
+    st1             {v0.8b-v2.8b}, x2, x5
+    cbnz            w12, .loop_eq_16_sve2_addavg_24x\h
+    ret
+.vl_gt_16_addAvg_24x\h\():
+    cmp             x9, #48
+    bgt             .vl_gt_48_addAvg_24x\h
+    ptrue           p0.b, vl32
+    ptrue           p1.b, vl16
+.loop_gt_16_sve2_addavg_24x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p1/z, x0, #1, mul vl
+    ld1b            {z2.b}, p0/z, x1
+    ld1b            {z3.b}, p1/z, x1, #1, mul vl
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z2.h
+    add             z1.h, p1/m, z1.h, z3.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z1.b, z1.h, #7
+    add             z1.b, z1.b, #0x80
+    st1b            {z0.h}, p0, x2
+    st1b            {z1.h}, p1, x2, #1, mul vl
+    add             x2, x2, x5
+    cbnz            w12, .loop_gt_16_sve2_addavg_24x\h
+    ret
+.vl_gt_48_addAvg_24x\h\():
+    mov             x10, #48
+    mov             x11, #0
+    whilelt         p0.b, x11, x10
+.loop_gt_48_sve2_addavg_24x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z2.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z2.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    st1b            {z0.h}, p0, x2
+    add             x2, x2, x5
+    cbnz            w12, .loop_gt_48_sve2_addavg_24x\h
+    ret
+endfunc
+.endm
+
+addAvg_24xN_sve2 32
+addAvg_24xN_sve2 64
+
+.macro addAvg_32xN_sve2 h
+function PFX(addAvg_32x\h\()_sve2)
+    mov             w12, #\h
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_addAvg_32x\h
+    ptrue           p0.b, vl16
+.loop_eq_16_sve2_addavg_32x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x0, #1, mul vl
+    ld1b            {z2.b}, p0/z, x0, #2, mul vl
+    ld1b            {z3.b}, p0/z, x0, #3, mul vl
+    ld1b            {z4.b}, p0/z, x1
+    ld1b            {z5.b}, p0/z, x1, #1, mul vl
+    ld1b            {z6.b}, p0/z, x1, #2, mul vl
+    ld1b            {z7.b}, p0/z, x1, #3, mul vl
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z4.h
+    add             z1.h, p0/m, z1.h, z5.h
+    add             z2.h, p0/m, z2.h, z6.h
+    add             z3.h, p0/m, z3.h, z7.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z1.b, z1.h, #7
+    add             z1.b, z1.b, #0x80
+    sqrshrnb        z2.b, z2.h, #7
+    add             z2.b, z2.b, #0x80
+    sqrshrnb        z3.b, z3.h, #7
+    add             z3.b, z3.b, #0x80
+    st1b            {z0.h}, p0, x2
+    st1b            {z1.h}, p0, x2, #1, mul vl
+    st1b            {z2.h}, p0, x2, #2, mul vl
+    st1b            {z3.h}, p0, x2, #3, mul vl
+    add             x2, x2, x5
+    cbnz            w12, .loop_eq_16_sve2_addavg_32x\h
+    ret
+.vl_gt_16_addAvg_32x\h\():
+    cmp             x9, #48
+    bgt             .vl_gt_48_addAvg_32x\h
+    ptrue           p0.b, vl32
+.loop_gt_eq_32_sve2_addavg_32x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x0, #1, mul vl
+    ld1b            {z2.b}, p0/z, x1
+    ld1b            {z3.b}, p0/z, x1, #1, mul vl
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z2.h
+    add             z1.h, p0/m, z1.h, z3.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z1.b, z1.b, #0x80
+    sqrshrnb        z1.b, z1.h, #7
+    add             z0.b, z0.b, #0x80
+    st1b            {z0.h}, p0, x2
+    st1b            {z1.h}, p0, x2, #1, mul vl
+    add             x2, x2, x5
+    cbnz            w12, .loop_gt_eq_32_sve2_addavg_32x\h
+    ret
+.vl_gt_48_addAvg_32x\h\():
+    ptrue           p0.b, vl64
+.loop_eq_64_sve2_addavg_32x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z1.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    st1b            {z0.h}, p0, x2
+    add             x2, x2, x5
+    cbnz            w12, .loop_eq_64_sve2_addavg_32x\h
+    ret
+endfunc
+.endm
+
+addAvg_32xN_sve2 8
+addAvg_32xN_sve2 16
+addAvg_32xN_sve2 24
+addAvg_32xN_sve2 32
+addAvg_32xN_sve2 48
+addAvg_32xN_sve2 64
+
+function PFX(addAvg_48x64_sve2)
+    mov             w12, #64
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_addAvg_48x64
+    addAvg_start
+    sub             x3, x3, #64
+    sub             x4, x4, #64
+.loop_eq_16_sve2_addavg_48x64:
+    sub             w12, w12, #1
+    ld1             {v0.8h-v3.8h}, x0, #64
+    ld1             {v4.8h-v7.8h}, x1, #64
+    ld1             {v20.8h-v21.8h}, x0, x3
+    ld1             {v22.8h-v23.8h}, x1, x4
+    addavg_1        v0, v4
+    addavg_1        v1, v5
+    addavg_1        v2, v6
+    addavg_1        v3, v7
+    addavg_1        v20, v22
+    addavg_1        v21, v23
+    sqxtun          v0.8b, v0.8h
+    sqxtun2         v0.16b, v1.8h
+    sqxtun          v1.8b, v2.8h
+    sqxtun2         v1.16b, v3.8h
+    sqxtun          v2.8b, v20.8h
+    sqxtun2         v2.16b, v21.8h
+    st1             {v0.16b-v2.16b}, x2, x5
+    cbnz            w12, .loop_eq_16_sve2_addavg_48x64
+    ret
+.vl_gt_16_addAvg_48x64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_addAvg_48x64
+    ptrue           p0.b, vl32
+.loop_gt_eq_32_sve2_addavg_48x64:
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x0, #1, mul vl
+    ld1b            {z2.b}, p0/z, x0, #2, mul vl
+    ld1b            {z4.b}, p0/z, x1
+    ld1b            {z5.b}, p0/z, x1, #1, mul vl
+    ld1b            {z6.b}, p0/z, x1, #2, mul vl
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z4.h
+    add             z1.h, p0/m, z1.h, z5.h
+    add             z2.h, p0/m, z2.h, z6.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z1.b, z1.h, #7
+    add             z1.b, z1.b, #0x80
+    sqrshrnb        z2.b, z2.h, #7
+    add             z2.b, z2.b, #0x80
+    st1b            {z0.h}, p0, x2
+    st1b            {z1.h}, p0, x2, #1, mul vl
+    st1b            {z2.h}, p0, x2, #2, mul vl
+    add             x2, x2, x5
+    cbnz            w12, .loop_gt_eq_32_sve2_addavg_48x64
+    ret
+.vl_gt_48_addAvg_48x64:
+    cmp             x9, #112
+    bgt             .vl_gt_112_addAvg_48x64
+    ptrue           p0.b, vl64
+    ptrue           p1.b, vl32
+.loop_gt_48_sve2_addavg_48x64:
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p1/z, x0, #1, mul vl
+    ld1b            {z4.b}, p0/z, x1
+    ld1b            {z5.b}, p1/z, x1, #1, mul vl
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z4.h
+    add             z1.h, p1/m, z1.h, z5.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z1.b, z1.h, #7
+    add             z1.b, z1.b, #0x80
+    st1b            {z0.h}, p0, x2
+    st1b            {z1.h}, p1, x2, #1, mul vl
+    add             x2, x2, x5
+    cbnz            w12, .loop_gt_48_sve2_addavg_48x64
+    ret
+.vl_gt_112_addAvg_48x64:
+    mov             x10, #96
+    mov             x11, #0
+    whilelt         p0.b, x11, x10
+.loop_gt_112_sve2_addavg_48x64:
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z4.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z4.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    st1b            {z0.h}, p0, x2
+    add             x2, x2, x5
+    cbnz            w12, .loop_gt_112_sve2_addavg_48x64
+    ret
+endfunc
+
+.macro addAvg_64xN_sve2 h
+function PFX(addAvg_64x\h\()_sve2)
+    mov             w12, #\h
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_addAvg_64x\h
+    addAvg_start
+    sub             x3, x3, #64
+    sub             x4, x4, #64
+.loop_eq_16_sve2_addavg_64x\h\():
+    sub             w12, w12, #1
+    ld1             {v0.8h-v3.8h}, x0, #64
+    ld1             {v4.8h-v7.8h}, x1, #64
+    ld1             {v20.8h-v23.8h}, x0, x3
+    ld1             {v24.8h-v27.8h}, x1, x4
+    addavg_1        v0, v4
+    addavg_1        v1, v5
+    addavg_1        v2, v6
+    addavg_1        v3, v7
+    addavg_1        v20, v24
+    addavg_1        v21, v25
+    addavg_1        v22, v26
+    addavg_1        v23, v27
+    sqxtun          v0.8b, v0.8h
+    sqxtun2         v0.16b, v1.8h
+    sqxtun          v1.8b, v2.8h
+    sqxtun2         v1.16b, v3.8h
+    sqxtun          v2.8b, v20.8h
+    sqxtun2         v2.16b, v21.8h
+    sqxtun          v3.8b, v22.8h
+    sqxtun2         v3.16b, v23.8h
+    st1             {v0.16b-v3.16b}, x2, x5
+    cbnz            w12, .loop_eq_16_sve2_addavg_64x\h
+    ret
+.vl_gt_16_addAvg_64x\h\():
+    cmp             x9, #48
+    bgt             .vl_gt_48_addAvg_64x\h
+    ptrue           p0.b, vl32
+.loop_gt_eq_32_sve2_addavg_64x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x0, #1, mul vl
+    ld1b            {z2.b}, p0/z, x0, #2, mul vl
+    ld1b            {z3.b}, p0/z, x0, #3, mul vl
+    ld1b            {z4.b}, p0/z, x1
+    ld1b            {z5.b}, p0/z, x1, #1, mul vl
+    ld1b            {z6.b}, p0/z, x1, #2, mul vl
+    ld1b            {z7.b}, p0/z, x1, #3, mul vl
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z4.h
+    add             z1.h, p0/m, z1.h, z5.h
+    add             z2.h, p0/m, z2.h, z6.h
+    add             z3.h, p0/m, z3.h, z7.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z1.b, z1.h, #7
+    add             z1.b, z1.b, #0x80
+    sqrshrnb        z2.b, z2.h, #7
+    add             z2.b, z2.b, #0x80
+    sqrshrnb        z3.b, z3.h, #7
+    add             z3.b, z3.b, #0x80
+    st1b            {z0.h}, p0, x2
+    st1b            {z1.h}, p0, x2, #1, mul vl
+    st1b            {z2.h}, p0, x2, #2, mul vl
+    st1b            {z3.h}, p0, x2, #3, mul vl
+    add             x2, x2, x5
+    cbnz            w12, .loop_gt_eq_32_sve2_addavg_64x\h
+    ret
+.vl_gt_48_addAvg_64x\h\():
+    cmp             x9, #112
+    bgt             .vl_gt_112_addAvg_64x\h
+    ptrue           p0.b, vl64
+.loop_gt_eq_48_sve2_addavg_64x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x0, #1, mul vl
+    ld1b            {z4.b}, p0/z, x1
+    ld1b            {z5.b}, p0/z, x1, #1, mul vl
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z4.h
+    add             z1.h, p0/m, z1.h, z5.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z1.b, z1.h, #7
+    add             z1.b, z1.b, #0x80
+    st1b            {z0.h}, p0, x2
+    st1b            {z1.h}, p0, x2, #1, mul vl
+    add             x2, x2, x5
+    cbnz            w12, .loop_gt_eq_48_sve2_addavg_64x\h
+    ret
+.vl_gt_112_addAvg_64x\h\():
+    ptrue           p0.b, vl128
+.loop_gt_eq_128_sve2_addavg_64x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z4.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z4.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    st1b            {z0.h}, p0, x2
+    add             x2, x2, x5
+    cbnz            w12, .loop_gt_eq_128_sve2_addavg_64x\h
+    ret
+endfunc
+.endm
+
+addAvg_64xN_sve2 16
+addAvg_64xN_sve2 32
+addAvg_64xN_sve2 48
+addAvg_64xN_sve2 64

 
@@ -0,0 +1,924 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm-sve.S"
+#include "mc-a-common.S"
+
+.arch armv8-a+sve2
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+function PFX(pixel_avg_pp_12x16_sve2)
+    sub             x1, x1, #4
+    sub             x3, x3, #4
+    sub             x5, x5, #4
+    ptrue           p0.s, vl1
+    ptrue           p1.b, vl8
+    mov             x11, #4
+.rept 16
+    ld1w            {z0.s}, p0/z, x2
+    ld1b            {z1.b}, p1/z, x2, x11
+    ld1w            {z2.s}, p0/z, x4
+    ld1b            {z3.b}, p1/z, x4, x11
+    add             x2, x2, #4
+    add             x2, x2, x3
+    add             x4, x4, #4
+    add             x4, x4, x5
+    urhadd          z0.b, p1/m, z0.b, z2.b
+    urhadd          z1.b, p1/m, z1.b, z3.b
+    st1b            {z0.b}, p1, x0
+    st1b            {z1.b}, p1, x0, x11
+    add             x0, x0, #4
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_avg_pp_24x32_sve2)
+    mov             w12, #4
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_avg_pp_24x32
+    sub             x1, x1, #16
+    sub             x3, x3, #16
+    sub             x5, x5, #16
+.lpavg_24x32_sve2:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.16b}, x2, #16
+    ld1             {v1.8b}, x2, x3
+    ld1             {v2.16b}, x4, #16
+    ld1             {v3.8b}, x4, x5
+    urhadd          v0.16b, v0.16b, v2.16b
+    urhadd          v1.8b, v1.8b, v3.8b
+    st1             {v0.16b}, x0, #16
+    st1             {v1.8b}, x0, x1
+.endr
+    cbnz            w12, .lpavg_24x32_sve2
+    ret
+.vl_gt_16_pixel_avg_pp_24x32:
+    mov             x10, #24
+    mov             x11, #0
+    whilelt         p0.b, x11, x10
+.vl_gt_16_loop_pixel_avg_pp_24x32:
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z2.b}, p0/z, x4
+    add             x2, x2, x3
+    add             x4, x4, x5
+    urhadd          z0.b, p0/m, z0.b, z2.b
+    st1b            {z0.b}, p0, x0
+    add             x0, x0, x1
+.endr
+    cbnz            w12, .vl_gt_16_loop_pixel_avg_pp_24x32
+    ret
+endfunc
+
+.macro pixel_avg_pp_32xN_sve2 h
+function PFX(pixel_avg_pp_32x\h\()_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_avg_pp_32_\h
+.rept \h
+    ld1             {v0.16b-v1.16b}, x2, x3
+    ld1             {v2.16b-v3.16b}, x4, x5
+    urhadd          v0.16b, v0.16b, v2.16b
+    urhadd          v1.16b, v1.16b, v3.16b
+    st1             {v0.16b-v1.16b}, x0, x1
+.endr
+    ret
+.vl_gt_16_pixel_avg_pp_32_\h:
+    ptrue           p0.b, vl32
+.rept \h
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z2.b}, p0/z, x4
+    add             x2, x2, x3
+    add             x4, x4, x5
+    urhadd          z0.b, p0/m, z0.b, z2.b
+    st1b            {z0.b}, p0, x0
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+.endm
+
+pixel_avg_pp_32xN_sve2 8
+pixel_avg_pp_32xN_sve2 16
+pixel_avg_pp_32xN_sve2 24
+
+.macro pixel_avg_pp_32xN1_sve2 h
+function PFX(pixel_avg_pp_32x\h\()_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_avg_pp_32xN1_\h
+    mov             w12, #\h / 8
+.lpavg_sve2_32x\h\():
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.16b-v1.16b}, x2, x3
+    ld1             {v2.16b-v3.16b}, x4, x5
+    urhadd          v0.16b, v0.16b, v2.16b
+    urhadd          v1.16b, v1.16b, v3.16b
+    st1             {v0.16b-v1.16b}, x0, x1
+.endr
+    cbnz            w12, .lpavg_sve2_32x\h
+    ret
+.vl_gt_16_pixel_avg_pp_32xN1_\h:
+    ptrue           p0.b, vl32
+    mov             w12, #\h / 8
+.eq_32_loop_pixel_avg_pp_32xN1_\h\():
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z2.b}, p0/z, x4
+    add             x2, x2, x3
+    add             x4, x4, x5
+    urhadd          z0.b, p0/m, z0.b, z2.b
+    st1b            {z0.b}, p0, x0
+    add             x0, x0, x1
+.endr
+    cbnz            w12, .eq_32_loop_pixel_avg_pp_32xN1_\h
+    ret
+endfunc
+.endm
+
+pixel_avg_pp_32xN1_sve2 32
+pixel_avg_pp_32xN1_sve2 64
+
+function PFX(pixel_avg_pp_48x64_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_avg_pp_48x64
+    mov             w12, #8
+.lpavg_48x64_sve2:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.16b-v2.16b}, x2, x3
+    ld1             {v3.16b-v5.16b}, x4, x5
+    urhadd          v0.16b, v0.16b, v3.16b
+    urhadd          v1.16b, v1.16b, v4.16b
+    urhadd          v2.16b, v2.16b, v5.16b
+    st1             {v0.16b-v2.16b}, x0, x1
+.endr
+    cbnz            w12, .lpavg_48x64_sve2
+    ret
+.vl_gt_16_pixel_avg_pp_48x64:
+    cmp             x9, #32
+    bgt             .vl_gt_32_pixel_avg_pp_48x64
+    ptrue           p0.b, vl32
+    ptrue           p1.b, vl16
+    mov             w12, #8
+.vl_eq_32_pixel_avg_pp_48x64:
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z1.b}, p1/z, x2, #1, mul vl
+    ld1b            {z2.b}, p0/z, x4
+    ld1b            {z3.b}, p1/z, x4, #1, mul vl
+    add             x2, x2, x3
+    add             x4, x4, x5
+    urhadd          z0.b, p0/m, z0.b, z2.b
+    urhadd          z1.b, p1/m, z1.b, z3.b
+    st1b            {z0.b}, p0, x0
+    st1b            {z1.b}, p1, x0, #1, mul vl
+    add             x0, x0, x1
+.endr
+    cbnz            w12, .vl_eq_32_pixel_avg_pp_48x64
+    ret
+.vl_gt_32_pixel_avg_pp_48x64:
+    mov             x10, #48
+    mov             x11, #0
+    whilelt         p0.b, x11, x10
+    mov             w12, #8
+.loop_gt_32_pixel_avg_pp_48x64:
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z2.b}, p0/z, x4
+    add             x2, x2, x3
+    add             x4, x4, x5
+    urhadd          z0.b, p0/m, z0.b, z2.b
+    st1b            {z0.b}, p0, x0
+    add             x0, x0, x1
+.endr
+    cbnz            w12, .loop_gt_32_pixel_avg_pp_48x64
+    ret
+endfunc
+
+.macro pixel_avg_pp_64xN_sve2 h
+function PFX(pixel_avg_pp_64x\h\()_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_avg_pp_64x\h
+    mov             w12, #\h / 4
+.lpavg_sve2_64x\h\():
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v3.16b}, x2, x3
+    ld1             {v4.16b-v7.16b}, x4, x5
+    urhadd          v0.16b, v0.16b, v4.16b
+    urhadd          v1.16b, v1.16b, v5.16b
+    urhadd          v2.16b, v2.16b, v6.16b
+    urhadd          v3.16b, v3.16b, v7.16b
+    st1             {v0.16b-v3.16b}, x0, x1
+.endr
+    cbnz            w12, .lpavg_sve2_64x\h
+    ret
+.vl_gt_16_pixel_avg_pp_64x\h\():
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_avg_pp_64x\h
+    ptrue           p0.b, vl32
+    mov             w12, #\h / 4
+.vl_eq_32_pixel_avg_pp_64x\h\():
+    sub             w12, w12, #1
+.rept 4
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z1.b}, p0/z, x2, #1, mul vl
+    ld1b            {z2.b}, p0/z, x4
+    ld1b            {z3.b}, p0/z, x4, #1, mul vl
+    add             x2, x2, x3
+    add             x4, x4, x5
+    urhadd          z0.b, p0/m, z0.b, z2.b
+    urhadd          z1.b, p0/m, z1.b, z3.b
+    st1b            {z0.b}, p0, x0
+    st1b            {z1.b}, p0, x0, #1, mul vl
+    add             x0, x0, x1
+.endr
+    cbnz            w12, .vl_eq_32_pixel_avg_pp_64x\h
+    ret
+.vl_gt_48_pixel_avg_pp_64x\h\():
+    ptrue           p0.b, vl64
+    mov             w12, #\h / 4
+.vl_eq_64_pixel_avg_pp_64x\h\():
+    sub             w12, w12, #1
+.rept 4
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z2.b}, p0/z, x4
+    add             x2, x2, x3
+    add             x4, x4, x5
+    urhadd          z0.b, p0/m, z0.b, z2.b
+    st1b            {z0.b}, p0, x0
+    add             x0, x0, x1
+.endr
+    cbnz            w12, .vl_eq_64_pixel_avg_pp_64x\h
+    ret
+endfunc
+.endm
+
+pixel_avg_pp_64xN_sve2 16
+pixel_avg_pp_64xN_sve2 32
+pixel_avg_pp_64xN_sve2 48
+pixel_avg_pp_64xN_sve2 64
+
+// void addAvg(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride)
+
+.macro addAvg_2xN_sve2 h
+function PFX(addAvg_2x\h\()_sve2)
+    ptrue           p0.s, vl2
+    ptrue           p1.h, vl4
+    ptrue           p2.h, vl2
+.rept \h / 2
+    ld1rw           {z0.s}, p0/z, x0
+    ld1rw           {z1.s}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    ld1rw           {z2.s}, p0/z, x0
+    ld1rw           {z3.s}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p1/m, z0.h, z1.h
+    add             z2.h, p1/m, z2.h, z3.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z2.b, z2.h, #7
+    add             z2.b, z2.b, #0x80
+    st1b            {z0.h}, p2, x2
+    add             x2, x2, x5
+    st1b            {z2.h}, p2, x2
+    add             x2, x2, x5
+.endr
+    ret
+endfunc
+.endm
+
+addAvg_2xN_sve2 4
+addAvg_2xN_sve2 8
+addAvg_2xN_sve2 16
+
+.macro addAvg_6xN_sve2 h
+function PFX(addAvg_6x\h\()_sve2)
+    mov             w12, #\h / 2
+    ptrue           p0.b, vl16
+    ptrue           p2.h, vl6
+.loop_sve2_addavg_6x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    ld1b            {z2.b}, p0/z, x0
+    ld1b            {z3.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z1.h
+    add             z2.h, p0/m, z2.h, z3.h
+    sqrshrnb        z0.b, z0.h, #7
+    sqrshrnb        z2.b, z2.h, #7
+    add             z0.b, z0.b, #0x80
+    add             z2.b, z2.b, #0x80
+    st1b            {z0.h}, p2, x2
+    add             x2, x2, x5
+    st1b            {z2.h}, p2, x2
+    add             x2, x2, x5
+    cbnz            w12, .loop_sve2_addavg_6x\h
+    ret
+endfunc
+.endm
+
+addAvg_6xN_sve2 8
+addAvg_6xN_sve2 16
+
+.macro addAvg_8xN_sve2 h
+function PFX(addAvg_8x\h\()_sve2)
+    ptrue           p0.b, vl16
+.rept \h / 2
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    ld1b            {z2.b}, p0/z, x0
+    ld1b            {z3.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z1.h
+    add             z2.h, p0/m, z2.h, z3.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z2.b, z2.h, #7
+    add             z2.b, z2.b, #0x80
+    st1b            {z0.h}, p0, x2
+    add             x2, x2, x5
+    st1b            {z2.h}, p0, x2
+    add             x2, x2, x5
+.endr
+    ret
+endfunc
+.endm
+
+.macro addAvg_8xN1_sve2 h
+function PFX(addAvg_8x\h\()_sve2)
+    mov             w12, #\h / 2
+    ptrue           p0.b, vl16
+.loop_sve2_addavg_8x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    ld1b            {z2.b}, p0/z, x0
+    ld1b            {z3.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z1.h
+    add             z2.h, p0/m, z2.h, z3.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z2.b, z2.h, #7
+    add             z2.b, z2.b, #0x80
+    st1b            {z0.h}, p0, x2
+    add             x2, x2, x5
+    st1b            {z2.h}, p0, x2
+    add             x2, x2, x5
+    cbnz            w12, .loop_sve2_addavg_8x\h
+    ret
+endfunc
+.endm
+
+addAvg_8xN_sve2 2
+addAvg_8xN_sve2 4
+addAvg_8xN_sve2 6
+addAvg_8xN_sve2 8
+addAvg_8xN_sve2 12
+addAvg_8xN_sve2 16
+addAvg_8xN1_sve2 32
+addAvg_8xN1_sve2 64
+
+.macro addAvg_12xN_sve2 h
+function PFX(addAvg_12x\h\()_sve2)
+    mov             w12, #\h
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_addAvg_12x\h
+    ptrue           p0.b, vl16
+    ptrue           p1.b, vl8
+.loop_sve2_addavg_12x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x1
+    ld1b            {z2.b}, p1/z, x0, #1, mul vl
+    ld1b            {z3.b}, p1/z, x1, #1, mul vl
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z1.h
+    add             z2.h, p1/m, z2.h, z3.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z2.b, z2.h, #7
+    add             z2.b, z2.b, #0x80
+    st1b            {z0.h}, p0, x2
+    st1b            {z2.h}, p1, x2, #1, mul vl
+    add             x2, x2, x5
+    cbnz            w12, .loop_sve2_addavg_12x\h
+    ret
+.vl_gt_16_addAvg_12x\h\():
+    mov             x10, #24
+    mov             x11, #0
+    whilelt         p0.b, x11, x10
+.loop_sve2_gt_16_addavg_12x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z1.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z2.b, z2.h, #7
+    add             z2.b, z2.b, #0x80
+    st1b            {z0.h}, p0, x2
+    add             x2, x2, x5
+    cbnz            w12, .loop_sve2_gt_16_addavg_12x\h
+    ret
+endfunc
+.endm
+
+addAvg_12xN_sve2 16
+addAvg_12xN_sve2 32
+
+.macro addAvg_16xN_sve2 h
+function PFX(addAvg_16x\h\()_sve2)
+    mov             w12, #\h
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_addAvg_16x\h
+    ptrue           p0.b, vl16
+.loop_eq_16_sve2_addavg_16x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x1
+    ld1b            {z2.b}, p0/z, x0, #1, mul vl
+    ld1b            {z3.b}, p0/z, x1, #1, mul vl
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z1.h
+    add             z2.h, p0/m, z2.h, z3.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z2.b, z2.h, #7
+    add             z2.b, z2.b, #0x80
+    st1b            {z0.h}, p0, x2
+    st1b            {z2.h}, p0, x2, #1, mul vl
+    add             x2, x2, x5
+    cbnz            w12, .loop_eq_16_sve2_addavg_16x\h
+    ret
+.vl_gt_16_addAvg_16x\h\():
+    cmp             x9, #32
+    bgt             .vl_gt_32_addAvg_16x\h
+    ptrue           p0.b, vl32
+.loop_gt_16_sve2_addavg_16x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z1.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    st1b            {z0.h}, p1, x2
+    add             x2, x2, x5
+    cbnz            w12, .loop_gt_16_sve2_addavg_16x\h
+    ret
+.vl_gt_32_addAvg_16x\h\():
+    mov             x10, #48
+    mov             x11, #0
+    whilelt         p0.b, x11, x10
+.loop_gt_32_sve2_addavg_16x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z1.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    st1b            {z0.h}, p0, x2
+    add             x2, x2, x5
+    cbnz            w12, .loop_gt_32_sve2_addavg_16x\h
+    ret
+endfunc
+.endm
+
+addAvg_16xN_sve2 4
+addAvg_16xN_sve2 8
+addAvg_16xN_sve2 12
+addAvg_16xN_sve2 16
+addAvg_16xN_sve2 24
+addAvg_16xN_sve2 32
+addAvg_16xN_sve2 64
+
+.macro addAvg_24xN_sve2 h
+function PFX(addAvg_24x\h\()_sve2)
+    mov             w12, #\h
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_addAvg_24x\h
+    addAvg_start
+.loop_eq_16_sve2_addavg_24x\h\():
+    sub             w12, w12, #1
+    ld1             {v0.16b-v2.16b}, x0, x3
+    ld1             {v3.16b-v5.16b}, x1, x4
+    addavg_1        v0, v3
+    addavg_1        v1, v4
+    addavg_1        v2, v5
+    sqxtun          v0.8b, v0.8h
+    sqxtun          v1.8b, v1.8h
+    sqxtun          v2.8b, v2.8h
+    st1             {v0.8b-v2.8b}, x2, x5
+    cbnz            w12, .loop_eq_16_sve2_addavg_24x\h
+    ret
+.vl_gt_16_addAvg_24x\h\():
+    cmp             x9, #48
+    bgt             .vl_gt_48_addAvg_24x\h
+    ptrue           p0.b, vl32
+    ptrue           p1.b, vl16
+.loop_gt_16_sve2_addavg_24x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p1/z, x0, #1, mul vl
+    ld1b            {z2.b}, p0/z, x1
+    ld1b            {z3.b}, p1/z, x1, #1, mul vl
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z2.h
+    add             z1.h, p1/m, z1.h, z3.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z1.b, z1.h, #7
+    add             z1.b, z1.b, #0x80
+    st1b            {z0.h}, p0, x2
+    st1b            {z1.h}, p1, x2, #1, mul vl
+    add             x2, x2, x5
+    cbnz            w12, .loop_gt_16_sve2_addavg_24x\h
+    ret
+.vl_gt_48_addAvg_24x\h\():
+    mov             x10, #48
+    mov             x11, #0
+    whilelt         p0.b, x11, x10
+.loop_gt_48_sve2_addavg_24x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z2.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z2.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    st1b            {z0.h}, p0, x2
+    add             x2, x2, x5
+    cbnz            w12, .loop_gt_48_sve2_addavg_24x\h
+    ret
+endfunc
+.endm
+
+addAvg_24xN_sve2 32
+addAvg_24xN_sve2 64
+
+.macro addAvg_32xN_sve2 h
+function PFX(addAvg_32x\h\()_sve2)
+    mov             w12, #\h
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_addAvg_32x\h
+    ptrue           p0.b, vl16
+.loop_eq_16_sve2_addavg_32x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x0, #1, mul vl
+    ld1b            {z2.b}, p0/z, x0, #2, mul vl
+    ld1b            {z3.b}, p0/z, x0, #3, mul vl
+    ld1b            {z4.b}, p0/z, x1
+    ld1b            {z5.b}, p0/z, x1, #1, mul vl
+    ld1b            {z6.b}, p0/z, x1, #2, mul vl
+    ld1b            {z7.b}, p0/z, x1, #3, mul vl
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z4.h
+    add             z1.h, p0/m, z1.h, z5.h
+    add             z2.h, p0/m, z2.h, z6.h
+    add             z3.h, p0/m, z3.h, z7.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z1.b, z1.h, #7
+    add             z1.b, z1.b, #0x80
+    sqrshrnb        z2.b, z2.h, #7
+    add             z2.b, z2.b, #0x80
+    sqrshrnb        z3.b, z3.h, #7
+    add             z3.b, z3.b, #0x80
+    st1b            {z0.h}, p0, x2
+    st1b            {z1.h}, p0, x2, #1, mul vl
+    st1b            {z2.h}, p0, x2, #2, mul vl
+    st1b            {z3.h}, p0, x2, #3, mul vl
+    add             x2, x2, x5
+    cbnz            w12, .loop_eq_16_sve2_addavg_32x\h
+    ret
+.vl_gt_16_addAvg_32x\h\():
+    cmp             x9, #48
+    bgt             .vl_gt_48_addAvg_32x\h
+    ptrue           p0.b, vl32
+.loop_gt_eq_32_sve2_addavg_32x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x0, #1, mul vl
+    ld1b            {z2.b}, p0/z, x1
+    ld1b            {z3.b}, p0/z, x1, #1, mul vl
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z2.h
+    add             z1.h, p0/m, z1.h, z3.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z1.b, z1.b, #0x80
+    sqrshrnb        z1.b, z1.h, #7
+    add             z0.b, z0.b, #0x80
+    st1b            {z0.h}, p0, x2
+    st1b            {z1.h}, p0, x2, #1, mul vl
+    add             x2, x2, x5
+    cbnz            w12, .loop_gt_eq_32_sve2_addavg_32x\h
+    ret
+.vl_gt_48_addAvg_32x\h\():
+    ptrue           p0.b, vl64
+.loop_eq_64_sve2_addavg_32x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z1.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    st1b            {z0.h}, p0, x2
+    add             x2, x2, x5
+    cbnz            w12, .loop_eq_64_sve2_addavg_32x\h
+    ret
+endfunc
+.endm
+
+addAvg_32xN_sve2 8
+addAvg_32xN_sve2 16
+addAvg_32xN_sve2 24
+addAvg_32xN_sve2 32
+addAvg_32xN_sve2 48
+addAvg_32xN_sve2 64
+
+function PFX(addAvg_48x64_sve2)
+    mov             w12, #64
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_addAvg_48x64
+    addAvg_start
+    sub             x3, x3, #64
+    sub             x4, x4, #64
+.loop_eq_16_sve2_addavg_48x64:
+    sub             w12, w12, #1
+    ld1             {v0.8h-v3.8h}, x0, #64
+    ld1             {v4.8h-v7.8h}, x1, #64
+    ld1             {v20.8h-v21.8h}, x0, x3
+    ld1             {v22.8h-v23.8h}, x1, x4
+    addavg_1        v0, v4
+    addavg_1        v1, v5
+    addavg_1        v2, v6
+    addavg_1        v3, v7
+    addavg_1        v20, v22
+    addavg_1        v21, v23
+    sqxtun          v0.8b, v0.8h
+    sqxtun2         v0.16b, v1.8h
+    sqxtun          v1.8b, v2.8h
+    sqxtun2         v1.16b, v3.8h
+    sqxtun          v2.8b, v20.8h
+    sqxtun2         v2.16b, v21.8h
+    st1             {v0.16b-v2.16b}, x2, x5
+    cbnz            w12, .loop_eq_16_sve2_addavg_48x64
+    ret
+.vl_gt_16_addAvg_48x64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_addAvg_48x64
+    ptrue           p0.b, vl32
+.loop_gt_eq_32_sve2_addavg_48x64:
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x0, #1, mul vl
+    ld1b            {z2.b}, p0/z, x0, #2, mul vl
+    ld1b            {z4.b}, p0/z, x1
+    ld1b            {z5.b}, p0/z, x1, #1, mul vl
+    ld1b            {z6.b}, p0/z, x1, #2, mul vl
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z4.h
+    add             z1.h, p0/m, z1.h, z5.h
+    add             z2.h, p0/m, z2.h, z6.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z1.b, z1.h, #7
+    add             z1.b, z1.b, #0x80
+    sqrshrnb        z2.b, z2.h, #7
+    add             z2.b, z2.b, #0x80
+    st1b            {z0.h}, p0, x2
+    st1b            {z1.h}, p0, x2, #1, mul vl
+    st1b            {z2.h}, p0, x2, #2, mul vl
+    add             x2, x2, x5
+    cbnz            w12, .loop_gt_eq_32_sve2_addavg_48x64
+    ret
+.vl_gt_48_addAvg_48x64:
+    cmp             x9, #112
+    bgt             .vl_gt_112_addAvg_48x64
+    ptrue           p0.b, vl64
+    ptrue           p1.b, vl32
+.loop_gt_48_sve2_addavg_48x64:
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p1/z, x0, #1, mul vl
+    ld1b            {z4.b}, p0/z, x1
+    ld1b            {z5.b}, p1/z, x1, #1, mul vl
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z4.h
+    add             z1.h, p1/m, z1.h, z5.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z1.b, z1.h, #7
+    add             z1.b, z1.b, #0x80
+    st1b            {z0.h}, p0, x2
+    st1b            {z1.h}, p1, x2, #1, mul vl
+    add             x2, x2, x5
+    cbnz            w12, .loop_gt_48_sve2_addavg_48x64
+    ret
+.vl_gt_112_addAvg_48x64:
+    mov             x10, #96
+    mov             x11, #0
+    whilelt         p0.b, x11, x10
+.loop_gt_112_sve2_addavg_48x64:
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z4.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z4.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    st1b            {z0.h}, p0, x2
+    add             x2, x2, x5
+    cbnz            w12, .loop_gt_112_sve2_addavg_48x64
+    ret
+endfunc
+
+.macro addAvg_64xN_sve2 h
+function PFX(addAvg_64x\h\()_sve2)
+    mov             w12, #\h
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_addAvg_64x\h
+    addAvg_start
+    sub             x3, x3, #64
+    sub             x4, x4, #64
+.loop_eq_16_sve2_addavg_64x\h\():
+    sub             w12, w12, #1
+    ld1             {v0.8h-v3.8h}, x0, #64
+    ld1             {v4.8h-v7.8h}, x1, #64
+    ld1             {v20.8h-v23.8h}, x0, x3
+    ld1             {v24.8h-v27.8h}, x1, x4
+    addavg_1        v0, v4
+    addavg_1        v1, v5
+    addavg_1        v2, v6
+    addavg_1        v3, v7
+    addavg_1        v20, v24
+    addavg_1        v21, v25
+    addavg_1        v22, v26
+    addavg_1        v23, v27
+    sqxtun          v0.8b, v0.8h
+    sqxtun2         v0.16b, v1.8h
+    sqxtun          v1.8b, v2.8h
+    sqxtun2         v1.16b, v3.8h
+    sqxtun          v2.8b, v20.8h
+    sqxtun2         v2.16b, v21.8h
+    sqxtun          v3.8b, v22.8h
+    sqxtun2         v3.16b, v23.8h
+    st1             {v0.16b-v3.16b}, x2, x5
+    cbnz            w12, .loop_eq_16_sve2_addavg_64x\h
+    ret
+.vl_gt_16_addAvg_64x\h\():
+    cmp             x9, #48
+    bgt             .vl_gt_48_addAvg_64x\h
+    ptrue           p0.b, vl32
+.loop_gt_eq_32_sve2_addavg_64x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x0, #1, mul vl
+    ld1b            {z2.b}, p0/z, x0, #2, mul vl
+    ld1b            {z3.b}, p0/z, x0, #3, mul vl
+    ld1b            {z4.b}, p0/z, x1
+    ld1b            {z5.b}, p0/z, x1, #1, mul vl
+    ld1b            {z6.b}, p0/z, x1, #2, mul vl
+    ld1b            {z7.b}, p0/z, x1, #3, mul vl
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z4.h
+    add             z1.h, p0/m, z1.h, z5.h
+    add             z2.h, p0/m, z2.h, z6.h
+    add             z3.h, p0/m, z3.h, z7.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z1.b, z1.h, #7
+    add             z1.b, z1.b, #0x80
+    sqrshrnb        z2.b, z2.h, #7
+    add             z2.b, z2.b, #0x80
+    sqrshrnb        z3.b, z3.h, #7
+    add             z3.b, z3.b, #0x80
+    st1b            {z0.h}, p0, x2
+    st1b            {z1.h}, p0, x2, #1, mul vl
+    st1b            {z2.h}, p0, x2, #2, mul vl
+    st1b            {z3.h}, p0, x2, #3, mul vl
+    add             x2, x2, x5
+    cbnz            w12, .loop_gt_eq_32_sve2_addavg_64x\h
+    ret
+.vl_gt_48_addAvg_64x\h\():
+    cmp             x9, #112
+    bgt             .vl_gt_112_addAvg_64x\h
+    ptrue           p0.b, vl64
+.loop_gt_eq_48_sve2_addavg_64x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x0, #1, mul vl
+    ld1b            {z4.b}, p0/z, x1
+    ld1b            {z5.b}, p0/z, x1, #1, mul vl
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z4.h
+    add             z1.h, p0/m, z1.h, z5.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    sqrshrnb        z1.b, z1.h, #7
+    add             z1.b, z1.b, #0x80
+    st1b            {z0.h}, p0, x2
+    st1b            {z1.h}, p0, x2, #1, mul vl
+    add             x2, x2, x5
+    cbnz            w12, .loop_gt_eq_48_sve2_addavg_64x\h
+    ret
+.vl_gt_112_addAvg_64x\h\():
+    ptrue           p0.b, vl128
+.loop_gt_eq_128_sve2_addavg_64x\h\():
+    sub             w12, w12, #1
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z4.b}, p0/z, x1
+    add             x0, x0, x3, lsl #1
+    add             x1, x1, x4, lsl #1
+    add             z0.h, p0/m, z0.h, z4.h
+    sqrshrnb        z0.b, z0.h, #7
+    add             z0.b, z0.b, #0x80
+    st1b            {z0.h}, p0, x2
+    add             x2, x2, x5
+    cbnz            w12, .loop_gt_eq_128_sve2_addavg_64x\h
+    ret
+endfunc
+.endm
+
+addAvg_64xN_sve2 16
+addAvg_64xN_sve2 32
+addAvg_64xN_sve2 48
+addAvg_64xN_sve2 64
​

x265_3.5.tar.gz/source/common/aarch64/mc-a.S -> x265_3.6.tar.gz/source/common/aarch64/mc-a.S Changed

@@ -1,7 +1,8 @@
 /*****************************************************************************
- * Copyright (C) 2020 MulticoreWare, Inc
+ * Copyright (C) 2020-2021 MulticoreWare, Inc
  *
  * Authors: Hongbin Liu <liuhongbin1@huawei.com>
+ *          Sebastian Pop <spop@amazon.com>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -22,15 +23,20 @@
  *****************************************************************************/
 
 #include "asm.S"
+#include "mc-a-common.S"
 
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
 .section .rodata
+#endif
 
 .align 4
 
 .text
 
 .macro pixel_avg_pp_4xN_neon h
-function x265_pixel_avg_pp_4x\h\()_neon
+function PFX(pixel_avg_pp_4x\h\()_neon)
 .rept \h
     ld1             {v0.s}0, x2, x3
     ld1             {v1.s}0, x4, x5
@@ -46,7 +52,7 @@
 pixel_avg_pp_4xN_neon 16
 
 .macro pixel_avg_pp_8xN_neon h
-function x265_pixel_avg_pp_8x\h\()_neon
+function PFX(pixel_avg_pp_8x\h\()_neon)
 .rept \h
     ld1             {v0.8b}, x2, x3
     ld1             {v1.8b}, x4, x5
@@ -61,3 +67,491 @@
 pixel_avg_pp_8xN_neon 8
 pixel_avg_pp_8xN_neon 16
 pixel_avg_pp_8xN_neon 32
+
+function PFX(pixel_avg_pp_12x16_neon)
+    sub             x1, x1, #4
+    sub             x3, x3, #4
+    sub             x5, x5, #4
+.rept 16
+    ld1             {v0.s}0, x2, #4
+    ld1             {v1.8b}, x2, x3
+    ld1             {v2.s}0, x4, #4
+    ld1             {v3.8b}, x4, x5
+    urhadd          v4.8b, v0.8b, v2.8b
+    urhadd          v5.8b, v1.8b, v3.8b
+    st1             {v4.s}0, x0, #4
+    st1             {v5.8b}, x0, x1
+.endr
+    ret
+endfunc
+
+.macro pixel_avg_pp_16xN_neon h
+function PFX(pixel_avg_pp_16x\h\()_neon)
+.rept \h
+    ld1             {v0.16b}, x2, x3
+    ld1             {v1.16b}, x4, x5
+    urhadd          v2.16b, v0.16b, v1.16b
+    st1             {v2.16b}, x0, x1
+.endr
+    ret
+endfunc
+.endm
+
+pixel_avg_pp_16xN_neon 4
+pixel_avg_pp_16xN_neon 8
+pixel_avg_pp_16xN_neon 12
+pixel_avg_pp_16xN_neon 16
+pixel_avg_pp_16xN_neon 32
+
+function PFX(pixel_avg_pp_16x64_neon)
+    mov             w12, #8
+.lpavg_16x64:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.16b}, x2, x3
+    ld1             {v1.16b}, x4, x5
+    urhadd          v2.16b, v0.16b, v1.16b
+    st1             {v2.16b}, x0, x1
+.endr
+    cbnz            w12, .lpavg_16x64
+    ret
+endfunc
+
+function PFX(pixel_avg_pp_24x32_neon)
+    sub             x1, x1, #16
+    sub             x3, x3, #16
+    sub             x5, x5, #16
+    mov             w12, #4
+.lpavg_24x32:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.16b}, x2, #16
+    ld1             {v1.8b}, x2, x3
+    ld1             {v2.16b}, x4, #16
+    ld1             {v3.8b}, x4, x5
+    urhadd          v0.16b, v0.16b, v2.16b
+    urhadd          v1.8b, v1.8b, v3.8b
+    st1             {v0.16b}, x0, #16
+    st1             {v1.8b}, x0, x1
+.endr
+    cbnz            w12, .lpavg_24x32
+    ret
+endfunc
+
+.macro pixel_avg_pp_32xN_neon h
+function PFX(pixel_avg_pp_32x\h\()_neon)
+.rept \h
+    ld1             {v0.16b-v1.16b}, x2, x3
+    ld1             {v2.16b-v3.16b}, x4, x5
+    urhadd          v0.16b, v0.16b, v2.16b
+    urhadd          v1.16b, v1.16b, v3.16b
+    st1             {v0.16b-v1.16b}, x0, x1
+.endr
+    ret
+endfunc
+.endm
+
+pixel_avg_pp_32xN_neon 8
+pixel_avg_pp_32xN_neon 16
+pixel_avg_pp_32xN_neon 24
+
+.macro pixel_avg_pp_32xN1_neon h
+function PFX(pixel_avg_pp_32x\h\()_neon)
+    mov             w12, #\h / 8
+.lpavg_32x\h\():
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.16b-v1.16b}, x2, x3
+    ld1             {v2.16b-v3.16b}, x4, x5
+    urhadd          v0.16b, v0.16b, v2.16b
+    urhadd          v1.16b, v1.16b, v3.16b
+    st1             {v0.16b-v1.16b}, x0, x1
+.endr
+    cbnz            w12, .lpavg_32x\h
+    ret
+endfunc
+.endm
+
+pixel_avg_pp_32xN1_neon 32
+pixel_avg_pp_32xN1_neon 64
+
+function PFX(pixel_avg_pp_48x64_neon)
+    mov             w12, #8
+.lpavg_48x64:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.16b-v2.16b}, x2, x3
+    ld1             {v3.16b-v5.16b}, x4, x5
+    urhadd          v0.16b, v0.16b, v3.16b
+    urhadd          v1.16b, v1.16b, v4.16b
+    urhadd          v2.16b, v2.16b, v5.16b
+    st1             {v0.16b-v2.16b}, x0, x1
+.endr
+    cbnz            w12, .lpavg_48x64
+    ret
+endfunc
+
+.macro pixel_avg_pp_64xN_neon h
+function PFX(pixel_avg_pp_64x\h\()_neon)
+    mov             w12, #\h / 4
+.lpavg_64x\h\():
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v3.16b}, x2, x3
+    ld1             {v4.16b-v7.16b}, x4, x5
+    urhadd          v0.16b, v0.16b, v4.16b
+    urhadd          v1.16b, v1.16b, v5.16b
+    urhadd          v2.16b, v2.16b, v6.16b
+    urhadd          v3.16b, v3.16b, v7.16b
+    st1             {v0.16b-v3.16b}, x0, x1
+.endr
+    cbnz            w12, .lpavg_64x\h
+    ret
+endfunc
+.endm
+
+pixel_avg_pp_64xN_neon 16
+pixel_avg_pp_64xN_neon 32
+pixel_avg_pp_64xN_neon 48
+pixel_avg_pp_64xN_neon 64
+
+// void addAvg(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride)
+.macro addAvg_2xN h
+function PFX(addAvg_2x\h\()_neon)
+    addAvg_start
+.rept \h / 2
+    ldr             w10, x0
+    ldr             w11, x1
+    add             x0, x0, x3
+    add             x1, x1, x4
+    ldr             w12, x0
+    ldr             w13, x1
+    add             x0, x0, x3
+    add             x1, x1, x4
+    dup             v0.2s, w10
+    dup             v1.2s, w11
+    dup             v2.2s, w12
+    dup             v3.2s, w13
+    add             v0.4h, v0.4h, v1.4h
+    add             v2.4h, v2.4h, v3.4h
+    saddl           v0.4s, v0.4h, v30.4h
+    saddl           v2.4s, v2.4h, v30.4h
+    shrn            v0.4h, v0.4s, #7
+    shrn2           v0.8h, v2.4s, #7
+    sqxtun          v0.8b, v0.8h
+    st1             {v0.h}0, x2, x5
+    st1             {v0.h}2, x2, x5
+.endr
+    ret
+endfunc
+.endm
+
+addAvg_2xN 4
+addAvg_2xN 8
+addAvg_2xN 16
+
+.macro addAvg_4xN h
+function PFX(addAvg_4x\h\()_neon)
+    addAvg_start
+.rept \h / 2
+    ld1             {v0.8b}, x0, x3
+    ld1             {v1.8b}, x1, x4
+    ld1             {v2.8b}, x0, x3
+    ld1             {v3.8b}, x1, x4
+    add             v0.4h, v0.4h, v1.4h
+    add             v2.4h, v2.4h, v3.4h
+    saddl           v0.4s, v0.4h, v30.4h
+    saddl           v2.4s, v2.4h, v30.4h
+    shrn            v0.4h, v0.4s, #7
+    shrn2           v0.8h, v2.4s, #7
+    sqxtun          v0.8b, v0.8h
+    st1             {v0.s}0, x2, x5
+    st1             {v0.s}1, x2, x5
+.endr
+    ret
+endfunc
+.endm
+
+addAvg_4xN 2
+addAvg_4xN 4
+addAvg_4xN 8
+addAvg_4xN 16
+addAvg_4xN 32
+
+.macro addAvg_6xN h
+function PFX(addAvg_6x\h\()_neon)
+    addAvg_start
+    mov             w12, #\h / 2
+    sub             x5, x5, #4
+.loop_addavg_6x\h:
+    sub             w12, w12, #1
+    ld1             {v0.16b}, x0, x3
+    ld1             {v1.16b}, x1, x4
+    ld1             {v2.16b}, x0, x3
+    ld1             {v3.16b}, x1, x4
+    add             v0.8h, v0.8h, v1.8h
+    add             v2.8h, v2.8h, v3.8h
+    saddl           v16.4s, v0.4h, v30.4h
+    saddl2          v17.4s, v0.8h, v30.8h
+    saddl           v18.4s, v2.4h, v30.4h
+    saddl2          v19.4s, v2.8h, v30.8h
+    shrn            v0.4h, v16.4s, #7
+    shrn2           v0.8h, v17.4s, #7
+    shrn            v1.4h, v18.4s, #7
+    shrn2           v1.8h, v19.4s, #7
+    sqxtun          v0.8b, v0.8h
+    sqxtun          v1.8b, v1.8h
+    str             s0, x2, #4
+    st1             {v0.h}2, x2, x5
+    str             s1, x2, #4
+    st1             {v1.h}2, x2, x5
+    cbnz            w12, .loop_addavg_6x\h
+    ret
+endfunc
+.endm
+
+addAvg_6xN 8
+addAvg_6xN 16
+
+.macro addAvg_8xN h
+function PFX(addAvg_8x\h\()_neon)
+    addAvg_start
+.rept \h / 2
+    ld1             {v0.16b}, x0, x3
+    ld1             {v1.16b}, x1, x4
+    ld1             {v2.16b}, x0, x3
+    ld1             {v3.16b}, x1, x4
+    add             v0.8h, v0.8h, v1.8h
+    add             v2.8h, v2.8h, v3.8h
+    saddl           v16.4s, v0.4h, v30.4h
+    saddl2          v17.4s, v0.8h, v30.8h
+    saddl           v18.4s, v2.4h, v30.4h
+    saddl2          v19.4s, v2.8h, v30.8h
+    shrn            v0.4h, v16.4s, #7
+    shrn2           v0.8h, v17.4s, #7
+    shrn            v1.4h, v18.4s, #7
+    shrn2           v1.8h, v19.4s, #7
+    sqxtun          v0.8b, v0.8h
+    sqxtun          v1.8b, v1.8h
+    st1             {v0.8b}, x2, x5
+    st1             {v1.8b}, x2, x5
+.endr
+    ret
+endfunc
+.endm
+
+.macro addAvg_8xN1 h
+function PFX(addAvg_8x\h\()_neon)
+    addAvg_start
+    mov             w12, #\h / 2
+.loop_addavg_8x\h:
+    sub             w12, w12, #1
+    ld1             {v0.16b}, x0, x3
+    ld1             {v1.16b}, x1, x4
+    ld1             {v2.16b}, x0, x3
+    ld1             {v3.16b}, x1, x4
+    add             v0.8h, v0.8h, v1.8h
+    add             v2.8h, v2.8h, v3.8h
+    saddl           v16.4s, v0.4h, v30.4h
+    saddl2          v17.4s, v0.8h, v30.8h
+    saddl           v18.4s, v2.4h, v30.4h
+    saddl2          v19.4s, v2.8h, v30.8h
+    shrn            v0.4h, v16.4s, #7
+    shrn2           v0.8h, v17.4s, #7
+    shrn            v1.4h, v18.4s, #7
+    shrn2           v1.8h, v19.4s, #7
+    sqxtun          v0.8b, v0.8h
+    sqxtun          v1.8b, v1.8h
+    st1             {v0.8b}, x2, x5
+    st1             {v1.8b}, x2, x5
+    cbnz            w12, .loop_addavg_8x\h
+    ret
+endfunc
+.endm
+
+addAvg_8xN 2
+addAvg_8xN 4
+addAvg_8xN 6
+addAvg_8xN 8
+addAvg_8xN 12
+addAvg_8xN 16
+addAvg_8xN1 32
+addAvg_8xN1 64
+
+.macro addAvg_12xN h
+function PFX(addAvg_12x\h\()_neon)
+    addAvg_start
+    sub             x3, x3, #16
+    sub             x4, x4, #16
+    sub             x5, x5, #8
+    mov             w12, #\h
+.loop_addAvg_12X\h\():
+    sub             w12, w12, #1
+    ld1             {v0.16b}, x0, #16
+    ld1             {v1.16b}, x1, #16
+    ld1             {v2.8b}, x0, x3
+    ld1             {v3.8b}, x1, x4
+    add             v0.8h, v0.8h, v1.8h
+    add             v2.4h, v2.4h, v3.4h
+    saddl           v16.4s, v0.4h, v30.4h
+    saddl2          v17.4s, v0.8h, v30.8h
+    saddl           v18.4s, v2.4h, v30.4h
+    shrn            v0.4h, v16.4s, #7
+    shrn2           v0.8h, v17.4s, #7
+    shrn            v1.4h, v18.4s, #7
+    sqxtun          v0.8b, v0.8h
+    sqxtun          v1.8b, v1.8h
+    st1             {v0.8b}, x2, #8
+    st1             {v1.s}0, x2, x5
+    cbnz            w12, .loop_addAvg_12X\h
+    ret
+endfunc
+.endm
+
+addAvg_12xN 16
+addAvg_12xN 32
+
+.macro addAvg_16xN h
+function PFX(addAvg_16x\h\()_neon)
+    addAvg_start
+    mov             w12, #\h
+.loop_addavg_16x\h:
+    sub             w12, w12, #1
+    ld1             {v0.8h-v1.8h}, x0, x3
+    ld1             {v2.8h-v3.8h}, x1, x4
+    addavg_1        v0, v2
+    addavg_1        v1, v3
+    sqxtun          v0.8b, v0.8h
+    sqxtun2         v0.16b, v1.8h
+    st1             {v0.16b}, x2, x5
+    cbnz            w12, .loop_addavg_16x\h
+    ret
+endfunc
+.endm
+
+addAvg_16xN 4
+addAvg_16xN 8
+addAvg_16xN 12
+addAvg_16xN 16
+addAvg_16xN 24
+addAvg_16xN 32
+addAvg_16xN 64
+
+.macro addAvg_24xN h
+function PFX(addAvg_24x\h\()_neon)
+    addAvg_start
+    mov             w12, #\h
+.loop_addavg_24x\h\():
+    sub             w12, w12, #1
+    ld1             {v0.16b-v2.16b}, x0, x3
+    ld1             {v3.16b-v5.16b}, x1, x4
+    addavg_1        v0, v3
+    addavg_1        v1, v4
+    addavg_1        v2, v5
+    sqxtun          v0.8b, v0.8h
+    sqxtun          v1.8b, v1.8h
+    sqxtun          v2.8b, v2.8h
+    st1             {v0.8b-v2.8b}, x2, x5
+    cbnz            w12, .loop_addavg_24x\h
+    ret
+endfunc
+.endm
+
+addAvg_24xN 32
+addAvg_24xN 64
+
+.macro addAvg_32xN h
+function PFX(addAvg_32x\h\()_neon)
+    addAvg_start
+    mov             w12, #\h
+.loop_addavg_32x\h\():
+    sub             w12, w12, #1
+    ld1             {v0.8h-v3.8h}, x0, x3
+    ld1             {v4.8h-v7.8h}, x1, x4
+    addavg_1        v0, v4
+    addavg_1        v1, v5
+    addavg_1        v2, v6
+    addavg_1        v3, v7
+    sqxtun          v0.8b, v0.8h
+    sqxtun          v1.8b, v1.8h
+    sqxtun          v2.8b, v2.8h
+    sqxtun          v3.8b, v3.8h
+    st1             {v0.8b-v3.8b}, x2, x5
+    cbnz            w12, .loop_addavg_32x\h
+    ret
+endfunc
+.endm
+
+addAvg_32xN 8
+addAvg_32xN 16
+addAvg_32xN 24
+addAvg_32xN 32
+addAvg_32xN 48
+addAvg_32xN 64
+
+function PFX(addAvg_48x64_neon)
+    addAvg_start
+    sub             x3, x3, #64
+    sub             x4, x4, #64
+    mov             w12, #64
+.loop_addavg_48x64:
+    sub             w12, w12, #1
+    ld1             {v0.8h-v3.8h}, x0, #64
+    ld1             {v4.8h-v7.8h}, x1, #64
+    ld1             {v20.8h-v21.8h}, x0, x3
+    ld1             {v22.8h-v23.8h}, x1, x4
+    addavg_1        v0, v4
+    addavg_1        v1, v5
+    addavg_1        v2, v6
+    addavg_1        v3, v7
+    addavg_1        v20, v22
+    addavg_1        v21, v23
+    sqxtun          v0.8b, v0.8h
+    sqxtun2         v0.16b, v1.8h
+    sqxtun          v1.8b, v2.8h
+    sqxtun2         v1.16b, v3.8h
+    sqxtun          v2.8b, v20.8h
+    sqxtun2         v2.16b, v21.8h
+    st1             {v0.16b-v2.16b}, x2, x5
+    cbnz            w12, .loop_addavg_48x64
+    ret
+endfunc
+
+.macro addAvg_64xN h
+function PFX(addAvg_64x\h\()_neon)
+    addAvg_start
+    mov             w12, #\h
+    sub             x3, x3, #64
+    sub             x4, x4, #64
+.loop_addavg_64x\h\():
+    sub             w12, w12, #1
+    ld1             {v0.8h-v3.8h}, x0, #64
+    ld1             {v4.8h-v7.8h}, x1, #64
+    ld1             {v20.8h-v23.8h}, x0, x3
+    ld1             {v24.8h-v27.8h}, x1, x4
+    addavg_1        v0, v4
+    addavg_1        v1, v5
+    addavg_1        v2, v6
+    addavg_1        v3, v7
+    addavg_1        v20, v24
+    addavg_1        v21, v25
+    addavg_1        v22, v26
+    addavg_1        v23, v27
+    sqxtun          v0.8b, v0.8h
+    sqxtun2         v0.16b, v1.8h
+    sqxtun          v1.8b, v2.8h
+    sqxtun2         v1.16b, v3.8h
+    sqxtun          v2.8b, v20.8h
+    sqxtun2         v2.16b, v21.8h
+    sqxtun          v3.8b, v22.8h
+    sqxtun2         v3.16b, v23.8h
+    st1             {v0.16b-v3.16b}, x2, x5
+    cbnz            w12, .loop_addavg_64x\h
+    ret
+endfunc
+.endm
+
+addAvg_64xN 16
+addAvg_64xN 32
+addAvg_64xN 48
+addAvg_64xN 64

 
@@ -1,7 +1,8 @@
 /*****************************************************************************
- * Copyright (C) 2020 MulticoreWare, Inc
+ * Copyright (C) 2020-2021 MulticoreWare, Inc
  *
  * Authors: Hongbin Liu <liuhongbin1@huawei.com>
+ *          Sebastian Pop <spop@amazon.com>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -22,15 +23,20 @@
  *****************************************************************************/
 
 #include "asm.S"
+#include "mc-a-common.S"
 
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
 .section .rodata
+#endif
 
 .align 4
 
 .text
 
 .macro pixel_avg_pp_4xN_neon h
-function x265_pixel_avg_pp_4x\h\()_neon
+function PFX(pixel_avg_pp_4x\h\()_neon)
 .rept \h
     ld1             {v0.s}0, x2, x3
     ld1             {v1.s}0, x4, x5
@@ -46,7 +52,7 @@
 pixel_avg_pp_4xN_neon 16
 
 .macro pixel_avg_pp_8xN_neon h
-function x265_pixel_avg_pp_8x\h\()_neon
+function PFX(pixel_avg_pp_8x\h\()_neon)
 .rept \h
     ld1             {v0.8b}, x2, x3
     ld1             {v1.8b}, x4, x5
@@ -61,3 +67,491 @@
 pixel_avg_pp_8xN_neon 8
 pixel_avg_pp_8xN_neon 16
 pixel_avg_pp_8xN_neon 32
+
+function PFX(pixel_avg_pp_12x16_neon)
+    sub             x1, x1, #4
+    sub             x3, x3, #4
+    sub             x5, x5, #4
+.rept 16
+    ld1             {v0.s}0, x2, #4
+    ld1             {v1.8b}, x2, x3
+    ld1             {v2.s}0, x4, #4
+    ld1             {v3.8b}, x4, x5
+    urhadd          v4.8b, v0.8b, v2.8b
+    urhadd          v5.8b, v1.8b, v3.8b
+    st1             {v4.s}0, x0, #4
+    st1             {v5.8b}, x0, x1
+.endr
+    ret
+endfunc
+
+.macro pixel_avg_pp_16xN_neon h
+function PFX(pixel_avg_pp_16x\h\()_neon)
+.rept \h
+    ld1             {v0.16b}, x2, x3
+    ld1             {v1.16b}, x4, x5
+    urhadd          v2.16b, v0.16b, v1.16b
+    st1             {v2.16b}, x0, x1
+.endr
+    ret
+endfunc
+.endm
+
+pixel_avg_pp_16xN_neon 4
+pixel_avg_pp_16xN_neon 8
+pixel_avg_pp_16xN_neon 12
+pixel_avg_pp_16xN_neon 16
+pixel_avg_pp_16xN_neon 32
+
+function PFX(pixel_avg_pp_16x64_neon)
+    mov             w12, #8
+.lpavg_16x64:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.16b}, x2, x3
+    ld1             {v1.16b}, x4, x5
+    urhadd          v2.16b, v0.16b, v1.16b
+    st1             {v2.16b}, x0, x1
+.endr
+    cbnz            w12, .lpavg_16x64
+    ret
+endfunc
+
+function PFX(pixel_avg_pp_24x32_neon)
+    sub             x1, x1, #16
+    sub             x3, x3, #16
+    sub             x5, x5, #16
+    mov             w12, #4
+.lpavg_24x32:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.16b}, x2, #16
+    ld1             {v1.8b}, x2, x3
+    ld1             {v2.16b}, x4, #16
+    ld1             {v3.8b}, x4, x5
+    urhadd          v0.16b, v0.16b, v2.16b
+    urhadd          v1.8b, v1.8b, v3.8b
+    st1             {v0.16b}, x0, #16
+    st1             {v1.8b}, x0, x1
+.endr
+    cbnz            w12, .lpavg_24x32
+    ret
+endfunc
+
+.macro pixel_avg_pp_32xN_neon h
+function PFX(pixel_avg_pp_32x\h\()_neon)
+.rept \h
+    ld1             {v0.16b-v1.16b}, x2, x3
+    ld1             {v2.16b-v3.16b}, x4, x5
+    urhadd          v0.16b, v0.16b, v2.16b
+    urhadd          v1.16b, v1.16b, v3.16b
+    st1             {v0.16b-v1.16b}, x0, x1
+.endr
+    ret
+endfunc
+.endm
+
+pixel_avg_pp_32xN_neon 8
+pixel_avg_pp_32xN_neon 16
+pixel_avg_pp_32xN_neon 24
+
+.macro pixel_avg_pp_32xN1_neon h
+function PFX(pixel_avg_pp_32x\h\()_neon)
+    mov             w12, #\h / 8
+.lpavg_32x\h\():
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.16b-v1.16b}, x2, x3
+    ld1             {v2.16b-v3.16b}, x4, x5
+    urhadd          v0.16b, v0.16b, v2.16b
+    urhadd          v1.16b, v1.16b, v3.16b
+    st1             {v0.16b-v1.16b}, x0, x1
+.endr
+    cbnz            w12, .lpavg_32x\h
+    ret
+endfunc
+.endm
+
+pixel_avg_pp_32xN1_neon 32
+pixel_avg_pp_32xN1_neon 64
+
+function PFX(pixel_avg_pp_48x64_neon)
+    mov             w12, #8
+.lpavg_48x64:
+    sub             w12, w12, #1
+.rept 8
+    ld1             {v0.16b-v2.16b}, x2, x3
+    ld1             {v3.16b-v5.16b}, x4, x5
+    urhadd          v0.16b, v0.16b, v3.16b
+    urhadd          v1.16b, v1.16b, v4.16b
+    urhadd          v2.16b, v2.16b, v5.16b
+    st1             {v0.16b-v2.16b}, x0, x1
+.endr
+    cbnz            w12, .lpavg_48x64
+    ret
+endfunc
+
+.macro pixel_avg_pp_64xN_neon h
+function PFX(pixel_avg_pp_64x\h\()_neon)
+    mov             w12, #\h / 4
+.lpavg_64x\h\():
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v3.16b}, x2, x3
+    ld1             {v4.16b-v7.16b}, x4, x5
+    urhadd          v0.16b, v0.16b, v4.16b
+    urhadd          v1.16b, v1.16b, v5.16b
+    urhadd          v2.16b, v2.16b, v6.16b
+    urhadd          v3.16b, v3.16b, v7.16b
+    st1             {v0.16b-v3.16b}, x0, x1
+.endr
+    cbnz            w12, .lpavg_64x\h
+    ret
+endfunc
+.endm
+
+pixel_avg_pp_64xN_neon 16
+pixel_avg_pp_64xN_neon 32
+pixel_avg_pp_64xN_neon 48
+pixel_avg_pp_64xN_neon 64
+
+// void addAvg(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride)
+.macro addAvg_2xN h
+function PFX(addAvg_2x\h\()_neon)
+    addAvg_start
+.rept \h / 2
+    ldr             w10, x0
+    ldr             w11, x1
+    add             x0, x0, x3
+    add             x1, x1, x4
+    ldr             w12, x0
+    ldr             w13, x1
+    add             x0, x0, x3
+    add             x1, x1, x4
+    dup             v0.2s, w10
+    dup             v1.2s, w11
+    dup             v2.2s, w12
+    dup             v3.2s, w13
+    add             v0.4h, v0.4h, v1.4h
+    add             v2.4h, v2.4h, v3.4h
+    saddl           v0.4s, v0.4h, v30.4h
+    saddl           v2.4s, v2.4h, v30.4h
+    shrn            v0.4h, v0.4s, #7
+    shrn2           v0.8h, v2.4s, #7
+    sqxtun          v0.8b, v0.8h
+    st1             {v0.h}0, x2, x5
+    st1             {v0.h}2, x2, x5
+.endr
+    ret
+endfunc
+.endm
+
+addAvg_2xN 4
+addAvg_2xN 8
+addAvg_2xN 16
+
+.macro addAvg_4xN h
+function PFX(addAvg_4x\h\()_neon)
+    addAvg_start
+.rept \h / 2
+    ld1             {v0.8b}, x0, x3
+    ld1             {v1.8b}, x1, x4
+    ld1             {v2.8b}, x0, x3
+    ld1             {v3.8b}, x1, x4
+    add             v0.4h, v0.4h, v1.4h
+    add             v2.4h, v2.4h, v3.4h
+    saddl           v0.4s, v0.4h, v30.4h
+    saddl           v2.4s, v2.4h, v30.4h
+    shrn            v0.4h, v0.4s, #7
+    shrn2           v0.8h, v2.4s, #7
+    sqxtun          v0.8b, v0.8h
+    st1             {v0.s}0, x2, x5
+    st1             {v0.s}1, x2, x5
+.endr
+    ret
+endfunc
+.endm
+
+addAvg_4xN 2
+addAvg_4xN 4
+addAvg_4xN 8
+addAvg_4xN 16
+addAvg_4xN 32
+
+.macro addAvg_6xN h
+function PFX(addAvg_6x\h\()_neon)
+    addAvg_start
+    mov             w12, #\h / 2
+    sub             x5, x5, #4
+.loop_addavg_6x\h:
+    sub             w12, w12, #1
+    ld1             {v0.16b}, x0, x3
+    ld1             {v1.16b}, x1, x4
+    ld1             {v2.16b}, x0, x3
+    ld1             {v3.16b}, x1, x4
+    add             v0.8h, v0.8h, v1.8h
+    add             v2.8h, v2.8h, v3.8h
+    saddl           v16.4s, v0.4h, v30.4h
+    saddl2          v17.4s, v0.8h, v30.8h
+    saddl           v18.4s, v2.4h, v30.4h
+    saddl2          v19.4s, v2.8h, v30.8h
+    shrn            v0.4h, v16.4s, #7
+    shrn2           v0.8h, v17.4s, #7
+    shrn            v1.4h, v18.4s, #7
+    shrn2           v1.8h, v19.4s, #7
+    sqxtun          v0.8b, v0.8h
+    sqxtun          v1.8b, v1.8h
+    str             s0, x2, #4
+    st1             {v0.h}2, x2, x5
+    str             s1, x2, #4
+    st1             {v1.h}2, x2, x5
+    cbnz            w12, .loop_addavg_6x\h
+    ret
+endfunc
+.endm
+
+addAvg_6xN 8
+addAvg_6xN 16
+
+.macro addAvg_8xN h
+function PFX(addAvg_8x\h\()_neon)
+    addAvg_start
+.rept \h / 2
+    ld1             {v0.16b}, x0, x3
+    ld1             {v1.16b}, x1, x4
+    ld1             {v2.16b}, x0, x3
+    ld1             {v3.16b}, x1, x4
+    add             v0.8h, v0.8h, v1.8h
+    add             v2.8h, v2.8h, v3.8h
+    saddl           v16.4s, v0.4h, v30.4h
+    saddl2          v17.4s, v0.8h, v30.8h
+    saddl           v18.4s, v2.4h, v30.4h
+    saddl2          v19.4s, v2.8h, v30.8h
+    shrn            v0.4h, v16.4s, #7
+    shrn2           v0.8h, v17.4s, #7
+    shrn            v1.4h, v18.4s, #7
+    shrn2           v1.8h, v19.4s, #7
+    sqxtun          v0.8b, v0.8h
+    sqxtun          v1.8b, v1.8h
+    st1             {v0.8b}, x2, x5
+    st1             {v1.8b}, x2, x5
+.endr
+    ret
+endfunc
+.endm
+
+.macro addAvg_8xN1 h
+function PFX(addAvg_8x\h\()_neon)
+    addAvg_start
+    mov             w12, #\h / 2
+.loop_addavg_8x\h:
+    sub             w12, w12, #1
+    ld1             {v0.16b}, x0, x3
+    ld1             {v1.16b}, x1, x4
+    ld1             {v2.16b}, x0, x3
+    ld1             {v3.16b}, x1, x4
+    add             v0.8h, v0.8h, v1.8h
+    add             v2.8h, v2.8h, v3.8h
+    saddl           v16.4s, v0.4h, v30.4h
+    saddl2          v17.4s, v0.8h, v30.8h
+    saddl           v18.4s, v2.4h, v30.4h
+    saddl2          v19.4s, v2.8h, v30.8h
+    shrn            v0.4h, v16.4s, #7
+    shrn2           v0.8h, v17.4s, #7
+    shrn            v1.4h, v18.4s, #7
+    shrn2           v1.8h, v19.4s, #7
+    sqxtun          v0.8b, v0.8h
+    sqxtun          v1.8b, v1.8h
+    st1             {v0.8b}, x2, x5
+    st1             {v1.8b}, x2, x5
+    cbnz            w12, .loop_addavg_8x\h
+    ret
+endfunc
+.endm
+
+addAvg_8xN 2
+addAvg_8xN 4
+addAvg_8xN 6
+addAvg_8xN 8
+addAvg_8xN 12
+addAvg_8xN 16
+addAvg_8xN1 32
+addAvg_8xN1 64
+
+.macro addAvg_12xN h
+function PFX(addAvg_12x\h\()_neon)
+    addAvg_start
+    sub             x3, x3, #16
+    sub             x4, x4, #16
+    sub             x5, x5, #8
+    mov             w12, #\h
+.loop_addAvg_12X\h\():
+    sub             w12, w12, #1
+    ld1             {v0.16b}, x0, #16
+    ld1             {v1.16b}, x1, #16
+    ld1             {v2.8b}, x0, x3
+    ld1             {v3.8b}, x1, x4
+    add             v0.8h, v0.8h, v1.8h
+    add             v2.4h, v2.4h, v3.4h
+    saddl           v16.4s, v0.4h, v30.4h
+    saddl2          v17.4s, v0.8h, v30.8h
+    saddl           v18.4s, v2.4h, v30.4h
+    shrn            v0.4h, v16.4s, #7
+    shrn2           v0.8h, v17.4s, #7
+    shrn            v1.4h, v18.4s, #7
+    sqxtun          v0.8b, v0.8h
+    sqxtun          v1.8b, v1.8h
+    st1             {v0.8b}, x2, #8
+    st1             {v1.s}0, x2, x5
+    cbnz            w12, .loop_addAvg_12X\h
+    ret
+endfunc
+.endm
+
+addAvg_12xN 16
+addAvg_12xN 32
+
+.macro addAvg_16xN h
+function PFX(addAvg_16x\h\()_neon)
+    addAvg_start
+    mov             w12, #\h
+.loop_addavg_16x\h:
+    sub             w12, w12, #1
+    ld1             {v0.8h-v1.8h}, x0, x3
+    ld1             {v2.8h-v3.8h}, x1, x4
+    addavg_1        v0, v2
+    addavg_1        v1, v3
+    sqxtun          v0.8b, v0.8h
+    sqxtun2         v0.16b, v1.8h
+    st1             {v0.16b}, x2, x5
+    cbnz            w12, .loop_addavg_16x\h
+    ret
+endfunc
+.endm
+
+addAvg_16xN 4
+addAvg_16xN 8
+addAvg_16xN 12
+addAvg_16xN 16
+addAvg_16xN 24
+addAvg_16xN 32
+addAvg_16xN 64
+
+.macro addAvg_24xN h
+function PFX(addAvg_24x\h\()_neon)
+    addAvg_start
+    mov             w12, #\h
+.loop_addavg_24x\h\():
+    sub             w12, w12, #1
+    ld1             {v0.16b-v2.16b}, x0, x3
+    ld1             {v3.16b-v5.16b}, x1, x4
+    addavg_1        v0, v3
+    addavg_1        v1, v4
+    addavg_1        v2, v5
+    sqxtun          v0.8b, v0.8h
+    sqxtun          v1.8b, v1.8h
+    sqxtun          v2.8b, v2.8h
+    st1             {v0.8b-v2.8b}, x2, x5
+    cbnz            w12, .loop_addavg_24x\h
+    ret
+endfunc
+.endm
+
+addAvg_24xN 32
+addAvg_24xN 64
+
+.macro addAvg_32xN h
+function PFX(addAvg_32x\h\()_neon)
+    addAvg_start
+    mov             w12, #\h
+.loop_addavg_32x\h\():
+    sub             w12, w12, #1
+    ld1             {v0.8h-v3.8h}, x0, x3
+    ld1             {v4.8h-v7.8h}, x1, x4
+    addavg_1        v0, v4
+    addavg_1        v1, v5
+    addavg_1        v2, v6
+    addavg_1        v3, v7
+    sqxtun          v0.8b, v0.8h
+    sqxtun          v1.8b, v1.8h
+    sqxtun          v2.8b, v2.8h
+    sqxtun          v3.8b, v3.8h
+    st1             {v0.8b-v3.8b}, x2, x5
+    cbnz            w12, .loop_addavg_32x\h
+    ret
+endfunc
+.endm
+
+addAvg_32xN 8
+addAvg_32xN 16
+addAvg_32xN 24
+addAvg_32xN 32
+addAvg_32xN 48
+addAvg_32xN 64
+
+function PFX(addAvg_48x64_neon)
+    addAvg_start
+    sub             x3, x3, #64
+    sub             x4, x4, #64
+    mov             w12, #64
+.loop_addavg_48x64:
+    sub             w12, w12, #1
+    ld1             {v0.8h-v3.8h}, x0, #64
+    ld1             {v4.8h-v7.8h}, x1, #64
+    ld1             {v20.8h-v21.8h}, x0, x3
+    ld1             {v22.8h-v23.8h}, x1, x4
+    addavg_1        v0, v4
+    addavg_1        v1, v5
+    addavg_1        v2, v6
+    addavg_1        v3, v7
+    addavg_1        v20, v22
+    addavg_1        v21, v23
+    sqxtun          v0.8b, v0.8h
+    sqxtun2         v0.16b, v1.8h
+    sqxtun          v1.8b, v2.8h
+    sqxtun2         v1.16b, v3.8h
+    sqxtun          v2.8b, v20.8h
+    sqxtun2         v2.16b, v21.8h
+    st1             {v0.16b-v2.16b}, x2, x5
+    cbnz            w12, .loop_addavg_48x64
+    ret
+endfunc
+
+.macro addAvg_64xN h
+function PFX(addAvg_64x\h\()_neon)
+    addAvg_start
+    mov             w12, #\h
+    sub             x3, x3, #64
+    sub             x4, x4, #64
+.loop_addavg_64x\h\():
+    sub             w12, w12, #1
+    ld1             {v0.8h-v3.8h}, x0, #64
+    ld1             {v4.8h-v7.8h}, x1, #64
+    ld1             {v20.8h-v23.8h}, x0, x3
+    ld1             {v24.8h-v27.8h}, x1, x4
+    addavg_1        v0, v4
+    addavg_1        v1, v5
+    addavg_1        v2, v6
+    addavg_1        v3, v7
+    addavg_1        v20, v24
+    addavg_1        v21, v25
+    addavg_1        v22, v26
+    addavg_1        v23, v27
+    sqxtun          v0.8b, v0.8h
+    sqxtun2         v0.16b, v1.8h
+    sqxtun          v1.8b, v2.8h
+    sqxtun2         v1.16b, v3.8h
+    sqxtun          v2.8b, v20.8h
+    sqxtun2         v2.16b, v21.8h
+    sqxtun          v3.8b, v22.8h
+    sqxtun2         v3.16b, v23.8h
+    st1             {v0.16b-v3.16b}, x2, x5
+    cbnz            w12, .loop_addavg_64x\h
+    ret
+endfunc
+.endm
+
+addAvg_64xN 16
+addAvg_64xN 32
+addAvg_64xN 48
+addAvg_64xN 64
​

x265_3.6.tar.gz/source/common/aarch64/p2s-common.S Added

@@ -0,0 +1,102 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+// This file contains the macros written using NEON instruction set
+// that are also used by the SVE2 functions
+
+.arch           armv8-a
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+#if HIGH_BIT_DEPTH
+# if BIT_DEPTH == 10
+#  define P2S_SHIFT 4
+# elif BIT_DEPTH == 12
+#  define P2S_SHIFT 2
+# endif
+.macro p2s_start
+    add             x3, x3, x3
+    add             x1, x1, x1
+    movi            v31.8h, #0xe0, lsl #8
+.endm
+
+#else // if !HIGH_BIT_DEPTH
+# define P2S_SHIFT 6
+.macro p2s_start
+    add             x3, x3, x3
+    movi            v31.8h, #0xe0, lsl #8
+.endm
+#endif // HIGH_BIT_DEPTH
+
+.macro p2s_2x2
+#if HIGH_BIT_DEPTH
+    ld1             {v0.s}0, x0, x1
+    ld1             {v0.s}1, x0, x1
+    shl             v3.8h, v0.8h, #P2S_SHIFT
+#else
+    ldrh            w10, x0
+    add             x0, x0, x1
+    ldrh            w11, x0
+    orr             w10, w10, w11, lsl #16
+    add             x0, x0, x1
+    dup             v0.4s, w10
+    ushll           v3.8h, v0.8b, #P2S_SHIFT
+#endif
+    add             v3.8h, v3.8h, v31.8h
+    st1             {v3.s}0, x2, x3
+    st1             {v3.s}1, x2, x3
+.endm
+
+.macro p2s_6x2
+#if HIGH_BIT_DEPTH
+    ld1             {v0.d}0, x0, #8
+    ld1             {v1.s}0, x0, x1
+    ld1             {v0.d}1, x0, #8
+    ld1             {v1.s}1, x0, x1
+    shl             v3.8h, v0.8h, #P2S_SHIFT
+    shl             v4.8h, v1.8h, #P2S_SHIFT
+#else
+    ldr             s0, x0
+    ldrh            w10, x0, #4
+    add             x0, x0, x1
+    ld1             {v0.s}1, x0
+    ldrh            w11, x0, #4
+    add             x0, x0, x1
+    orr             w10, w10, w11, lsl #16
+    dup             v1.4s, w10
+    ushll           v3.8h, v0.8b, #P2S_SHIFT
+    ushll           v4.8h, v1.8b, #P2S_SHIFT
+#endif
+    add             v3.8h, v3.8h, v31.8h
+    add             v4.8h, v4.8h, v31.8h
+    st1             {v3.d}0, x2, #8
+    st1             {v4.s}0, x2, x3
+    st1             {v3.d}1, x2, #8
+    st1             {v4.s}1, x2, x3
+.endm

 
@@ -0,0 +1,102 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+// This file contains the macros written using NEON instruction set
+// that are also used by the SVE2 functions
+
+.arch           armv8-a
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+#if HIGH_BIT_DEPTH
+# if BIT_DEPTH == 10
+#  define P2S_SHIFT 4
+# elif BIT_DEPTH == 12
+#  define P2S_SHIFT 2
+# endif
+.macro p2s_start
+    add             x3, x3, x3
+    add             x1, x1, x1
+    movi            v31.8h, #0xe0, lsl #8
+.endm
+
+#else // if !HIGH_BIT_DEPTH
+# define P2S_SHIFT 6
+.macro p2s_start
+    add             x3, x3, x3
+    movi            v31.8h, #0xe0, lsl #8
+.endm
+#endif // HIGH_BIT_DEPTH
+
+.macro p2s_2x2
+#if HIGH_BIT_DEPTH
+    ld1             {v0.s}0, x0, x1
+    ld1             {v0.s}1, x0, x1
+    shl             v3.8h, v0.8h, #P2S_SHIFT
+#else
+    ldrh            w10, x0
+    add             x0, x0, x1
+    ldrh            w11, x0
+    orr             w10, w10, w11, lsl #16
+    add             x0, x0, x1
+    dup             v0.4s, w10
+    ushll           v3.8h, v0.8b, #P2S_SHIFT
+#endif
+    add             v3.8h, v3.8h, v31.8h
+    st1             {v3.s}0, x2, x3
+    st1             {v3.s}1, x2, x3
+.endm
+
+.macro p2s_6x2
+#if HIGH_BIT_DEPTH
+    ld1             {v0.d}0, x0, #8
+    ld1             {v1.s}0, x0, x1
+    ld1             {v0.d}1, x0, #8
+    ld1             {v1.s}1, x0, x1
+    shl             v3.8h, v0.8h, #P2S_SHIFT
+    shl             v4.8h, v1.8h, #P2S_SHIFT
+#else
+    ldr             s0, x0
+    ldrh            w10, x0, #4
+    add             x0, x0, x1
+    ld1             {v0.s}1, x0
+    ldrh            w11, x0, #4
+    add             x0, x0, x1
+    orr             w10, w10, w11, lsl #16
+    dup             v1.4s, w10
+    ushll           v3.8h, v0.8b, #P2S_SHIFT
+    ushll           v4.8h, v1.8b, #P2S_SHIFT
+#endif
+    add             v3.8h, v3.8h, v31.8h
+    add             v4.8h, v4.8h, v31.8h
+    st1             {v3.d}0, x2, #8
+    st1             {v4.s}0, x2, x3
+    st1             {v3.d}1, x2, #8
+    st1             {v4.s}1, x2, x3
+.endm
​

x265_3.6.tar.gz/source/common/aarch64/p2s-sve.S Added

@@ -0,0 +1,445 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm-sve.S"
+#include "p2s-common.S"
+
+.arch armv8-a+sve
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+#if HIGH_BIT_DEPTH
+# if BIT_DEPTH == 10
+#  define P2S_SHIFT 4
+# elif BIT_DEPTH == 12
+#  define P2S_SHIFT 2
+# endif
+
+.macro p2s_start_sve
+    add             x3, x3, x3
+    add             x1, x1, x1
+    mov             z31.h, #0xe0, lsl #8
+.endm
+
+#else // if !HIGH_BIT_DEPTH
+# define P2S_SHIFT 6
+.macro p2s_start_sve
+    add             x3, x3, x3
+    mov             z31.h, #0xe0, lsl #8
+.endm
+
+#endif // HIGH_BIT_DEPTH
+
+// filterPixelToShort(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride)
+.macro p2s_2xN_sve h
+function PFX(filterPixelToShort_2x\h\()_sve)
+    p2s_start_sve
+.rept \h / 2
+    p2s_2x2
+.endr
+    ret
+endfunc
+.endm
+
+p2s_2xN_sve 4
+p2s_2xN_sve 8
+p2s_2xN_sve 16
+
+.macro p2s_6xN_sve h
+function PFX(filterPixelToShort_6x\h\()_sve)
+    p2s_start_sve
+    sub             x3, x3, #8
+#if HIGH_BIT_DEPTH
+    sub             x1, x1, #8
+#endif
+.rept \h / 2
+    p2s_6x2
+.endr
+    ret
+endfunc
+.endm
+
+p2s_6xN_sve 8
+p2s_6xN_sve 16
+
+function PFX(filterPixelToShort_4x2_sve)
+    p2s_start_sve
+#if HIGH_BIT_DEPTH
+    ptrue           p0.h, vl8
+    index           z1.d, #0, x1
+    index           z2.d, #0, x3
+    ld1d            {z3.d}, p0/z, x0, z1.d
+    lsl             z3.h, p0/m, z3.h, #P2S_SHIFT
+    add             z3.h, p0/m, z3.h, z31.h
+    st1d            {z3.d}, p0, x2, z2.d
+#else
+    ptrue           p0.h, vl4
+    ld1b            {z0.h}, p0/z, x0
+    add             x0, x0, x1
+    ld1b            {z1.h}, p0/z, x0
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    lsl             z1.h, p0/m, z1.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    add             z1.h, p0/m, z1.h, z31.h
+    st1h            {z0.h}, p0, x2
+    add             x2, x2, x3
+    st1h            {z1.h}, p0, x2
+#endif
+    ret
+endfunc
+
+
+.macro p2s_8xN_sve h
+function PFX(filterPixelToShort_8x\h\()_sve)
+    p2s_start_sve
+    ptrue           p0.h, vl8
+.rept \h
+#if HIGH_BIT_DEPTH
+    ld1d            {z0.d}, p0/z, x0
+    add             x0, x0, x1
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    st1h            {z0.h}, p0, x2
+    add             x2, x2, x3
+#else
+    ld1b            {z0.h}, p0/z, x0
+    add             x0, x0, x1
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    st1h            {z0.h}, p0, x2
+    add             x2, x2, x3
+#endif
+.endr
+    ret
+endfunc
+.endm
+
+p2s_8xN_sve 2
+
+.macro p2s_32xN_sve h
+function PFX(filterPixelToShort_32x\h\()_sve)
+#if HIGH_BIT_DEPTH
+    p2s_start_sve
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_filterPixelToShort_high_32x\h
+    ptrue           p0.h, vl8
+.rept \h
+    ld1h            {z0.h}, p0/z, x0
+    ld1h            {z1.h}, p0/z, x0, #1, mul vl
+    ld1h            {z2.h}, p0/z, x0, #2, mul vl
+    ld1h            {z3.h}, p0/z, x0, #3, mul vl
+    add             x0, x0, x1
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    lsl             z1.h, p0/m, z1.h, #P2S_SHIFT
+    lsl             z2.h, p0/m, z2.h, #P2S_SHIFT
+    lsl             z3.h, p0/m, z3.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    add             z1.h, p0/m, z1.h, z31.h
+    add             z2.h, p0/m, z2.h, z31.h
+    add             z3.h, p0/m, z3.h, z31.h
+    st1h            {z0.h}, p0, x2
+    st1h            {z1.h}, p0, x2, #1, mul vl
+    st1h            {z2.h}, p0, x2, #2, mul vl
+    st1h            {z3.h}, p0, x2, #3, mul vl
+    add             x2, x2, x3
+.endr
+    ret
+.vl_gt_16_filterPixelToShort_high_32x\h\():
+    cmp             x9, #48
+    bgt             .vl_gt_48_filterPixelToShort_high_32x\h
+    ptrue           p0.h, vl16
+.rept \h
+    ld1h            {z0.h}, p0/z, x0
+    ld1h            {z1.h}, p0/z, x0, #1, mul vl
+    add             x0, x0, x1
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    lsl             z1.h, p0/m, z1.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    add             z1.h, p0/m, z1.h, z31.h
+    st1h            {z0.h}, p0, x2
+    st1h            {z1.h}, p0, x2, #1, mul vl
+    add             x2, x2, x3
+.endr
+    ret
+.vl_gt_48_filterPixelToShort_high_32x\h\():
+    ptrue           p0.h, vl32
+.rept \h
+    ld1h            {z0.h}, p0/z, x0
+    add             x0, x0, x1
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    st1h            {z0.h}, p0, x2
+    add             x2, x2, x3
+.endr
+    ret
+#else
+    p2s_start
+    mov             x9, #\h
+.loop_filter_sve_P2S_32x\h:
+    sub             x9, x9, #1
+    ld1             {v0.16b-v1.16b}, x0, x1
+    ushll           v22.8h, v0.8b,  #P2S_SHIFT
+    ushll2          v23.8h, v0.16b, #P2S_SHIFT
+    ushll           v24.8h, v1.8b,  #P2S_SHIFT
+    ushll2          v25.8h, v1.16b, #P2S_SHIFT
+    add             v22.8h, v22.8h, v31.8h
+    add             v23.8h, v23.8h, v31.8h
+    add             v24.8h, v24.8h, v31.8h
+    add             v25.8h, v25.8h, v31.8h
+    st1             {v22.16b-v25.16b}, x2, x3
+    cbnz            x9, .loop_filter_sve_P2S_32x\h
+    ret
+#endif
+endfunc
+.endm
+
+p2s_32xN_sve 8
+p2s_32xN_sve 16
+p2s_32xN_sve 24
+p2s_32xN_sve 32
+p2s_32xN_sve 48
+p2s_32xN_sve 64
+
+.macro p2s_64xN_sve h
+function PFX(filterPixelToShort_64x\h\()_sve)
+#if HIGH_BIT_DEPTH
+    p2s_start_sve
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_filterPixelToShort_high_64x\h
+    ptrue           p0.h, vl8
+.rept \h
+    ld1h            {z0.h}, p0/z, x0
+    ld1h            {z1.h}, p0/z, x0, #1, mul vl
+    ld1h            {z2.h}, p0/z, x0, #2, mul vl
+    ld1h            {z3.h}, p0/z, x0, #3, mul vl
+    ld1h            {z4.h}, p0/z, x0, #4, mul vl
+    ld1h            {z5.h}, p0/z, x0, #5, mul vl
+    ld1h            {z6.h}, p0/z, x0, #6, mul vl
+    ld1h            {z7.h}, p0/z, x0, #7, mul vl
+    add             x0, x0, x1
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    lsl             z1.h, p0/m, z1.h, #P2S_SHIFT
+    lsl             z2.h, p0/m, z2.h, #P2S_SHIFT
+    lsl             z3.h, p0/m, z3.h, #P2S_SHIFT
+    lsl             z4.h, p0/m, z4.h, #P2S_SHIFT
+    lsl             z5.h, p0/m, z5.h, #P2S_SHIFT
+    lsl             z6.h, p0/m, z6.h, #P2S_SHIFT
+    lsl             z7.h, p0/m, z7.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    add             z1.h, p0/m, z1.h, z31.h
+    add             z2.h, p0/m, z2.h, z31.h
+    add             z3.h, p0/m, z3.h, z31.h
+    add             z4.h, p0/m, z4.h, z31.h
+    add             z5.h, p0/m, z5.h, z31.h
+    add             z6.h, p0/m, z6.h, z31.h
+    add             z7.h, p0/m, z7.h, z31.h
+    st1h            {z0.h}, p0, x2
+    st1h            {z1.h}, p0, x2, #1, mul vl
+    st1h            {z2.h}, p0, x2, #2, mul vl
+    st1h            {z3.h}, p0, x2, #3, mul vl
+    st1h            {z4.h}, p0, x2, #4, mul vl
+    st1h            {z5.h}, p0, x2, #5, mul vl
+    st1h            {z6.h}, p0, x2, #6, mul vl
+    st1h            {z7.h}, p0, x2, #7, mul vl
+    add             x2, x2, x3
+.endr
+    ret
+.vl_gt_16_filterPixelToShort_high_64x\h\():
+    cmp             x9, #48
+    bgt             .vl_gt_48_filterPixelToShort_high_64x\h
+    ptrue           p0.h, vl16
+.rept \h
+    ld1h            {z0.h}, p0/z, x0
+    ld1h            {z1.h}, p0/z, x0, #1, mul vl
+    ld1h            {z2.h}, p0/z, x0, #2, mul vl
+    ld1h            {z3.h}, p0/z, x0, #3, mul vl
+    add             x0, x0, x1
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    lsl             z1.h, p0/m, z1.h, #P2S_SHIFT
+    lsl             z2.h, p0/m, z2.h, #P2S_SHIFT
+    lsl             z3.h, p0/m, z3.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    add             z1.h, p0/m, z1.h, z31.h
+    add             z2.h, p0/m, z2.h, z31.h
+    add             z3.h, p0/m, z3.h, z31.h
+    st1h            {z0.h}, p0, x2
+    st1h            {z1.h}, p0, x2, #1, mul vl
+    st1h            {z2.h}, p0, x2, #2, mul vl
+    st1h            {z3.h}, p0, x2, #3, mul vl
+    add             x2, x2, x3
+.endr
+    ret
+.vl_gt_48_filterPixelToShort_high_64x\h\():
+    cmp             x9, #112
+    bgt             .vl_gt_112_filterPixelToShort_high_64x\h
+    ptrue           p0.h, vl32
+.rept \h
+    ld1h            {z0.h}, p0/z, x0
+    ld1h            {z1.h}, p0/z, x0, #1, mul vl
+    add             x0, x0, x1
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    lsl             z1.h, p0/m, z1.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    add             z1.h, p0/m, z1.h, z31.h
+    st1h            {z0.h}, p0, x2
+    st1h            {z1.h}, p0, x2, #1, mul vl
+    add             x2, x2, x3
+.endr
+    ret
+.vl_gt_112_filterPixelToShort_high_64x\h\():
+    ptrue           p0.h, vl64
+.rept \h
+    ld1h            {z0.h}, p0/z, x0
+    add             x0, x0, x1
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    st1h            {z0.h}, p0, x2
+    add             x2, x2, x3
+.endr
+    ret
+#else
+    p2s_start
+    sub             x3, x3, #64
+    mov             x9, #\h
+.loop_filter_sve_P2S_64x\h:
+    sub             x9, x9, #1
+    ld1             {v0.16b-v3.16b}, x0, x1
+    ushll           v16.8h, v0.8b,  #P2S_SHIFT
+    ushll2          v17.8h, v0.16b, #P2S_SHIFT
+    ushll           v18.8h, v1.8b,  #P2S_SHIFT
+    ushll2          v19.8h, v1.16b, #P2S_SHIFT
+    ushll           v20.8h, v2.8b,  #P2S_SHIFT
+    ushll2          v21.8h, v2.16b, #P2S_SHIFT
+    ushll           v22.8h, v3.8b,  #P2S_SHIFT
+    ushll2          v23.8h, v3.16b, #P2S_SHIFT
+    add             v16.8h, v16.8h, v31.8h
+    add             v17.8h, v17.8h, v31.8h
+    add             v18.8h, v18.8h, v31.8h
+    add             v19.8h, v19.8h, v31.8h
+    add             v20.8h, v20.8h, v31.8h
+    add             v21.8h, v21.8h, v31.8h
+    add             v22.8h, v22.8h, v31.8h
+    add             v23.8h, v23.8h, v31.8h
+    st1             {v16.16b-v19.16b}, x2, #64
+    st1             {v20.16b-v23.16b}, x2, x3
+    cbnz            x9, .loop_filter_sve_P2S_64x\h
+    ret
+#endif
+endfunc
+.endm
+
+p2s_64xN_sve 16
+p2s_64xN_sve 32
+p2s_64xN_sve 48
+p2s_64xN_sve 64
+
+function PFX(filterPixelToShort_48x64_sve)
+#if HIGH_BIT_DEPTH
+    p2s_start_sve
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_filterPixelToShort_high_48x64
+    ptrue           p0.h, vl8
+.rept 64
+    ld1h            {z0.h}, p0/z, x0
+    ld1h            {z1.h}, p0/z, x0, #1, mul vl
+    ld1h            {z2.h}, p0/z, x0, #2, mul vl
+    ld1h            {z3.h}, p0/z, x0, #3, mul vl
+    ld1h            {z4.h}, p0/z, x0, #4, mul vl
+    ld1h            {z5.h}, p0/z, x0, #5, mul vl
+    add             x0, x0, x1
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    lsl             z1.h, p0/m, z1.h, #P2S_SHIFT
+    lsl             z2.h, p0/m, z2.h, #P2S_SHIFT
+    lsl             z3.h, p0/m, z3.h, #P2S_SHIFT
+    lsl             z4.h, p0/m, z4.h, #P2S_SHIFT
+    lsl             z5.h, p0/m, z5.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    add             z1.h, p0/m, z1.h, z31.h
+    add             z2.h, p0/m, z2.h, z31.h
+    add             z3.h, p0/m, z3.h, z31.h
+    add             z4.h, p0/m, z4.h, z31.h
+    add             z5.h, p0/m, z5.h, z31.h
+    st1h            {z0.h}, p0, x2
+    st1h            {z1.h}, p0, x2, #1, mul vl
+    st1h            {z2.h}, p0, x2, #2, mul vl
+    st1h            {z3.h}, p0, x2, #3, mul vl
+    st1h            {z4.h}, p0, x2, #4, mul vl
+    st1h            {z5.h}, p0, x2, #5, mul vl
+    add             x2, x2, x3
+.endr
+    ret
+.vl_gt_16_filterPixelToShort_high_48x64:
+    ptrue           p0.h, vl16
+.rept 64
+    ld1h            {z0.h}, p0/z, x0
+    ld1h            {z1.h}, p0/z, x0, #1, mul vl
+    ld1h            {z2.h}, p0/z, x0, #2, mul vl
+    add             x0, x0, x1
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    lsl             z1.h, p0/m, z1.h, #P2S_SHIFT
+    lsl             z2.h, p0/m, z2.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    add             z1.h, p0/m, z1.h, z31.h
+    add             z2.h, p0/m, z2.h, z31.h
+    st1h            {z0.h}, p0, x2
+    st1h            {z1.h}, p0, x2, #1, mul vl
+    st1h            {z2.h}, p0, x2, #2, mul vl
+    add             x2, x2, x3
+.endr
+    ret
+#else
+    p2s_start
+    sub             x3, x3, #64
+    mov             x9, #64
+.loop_filterP2S_sve_48x64:
+    sub            x9, x9, #1
+    ld1             {v0.16b-v2.16b}, x0, x1
+    ushll           v16.8h, v0.8b,  #P2S_SHIFT
+    ushll2          v17.8h, v0.16b, #P2S_SHIFT
+    ushll           v18.8h, v1.8b,  #P2S_SHIFT
+    ushll2          v19.8h, v1.16b, #P2S_SHIFT
+    ushll           v20.8h, v2.8b,  #P2S_SHIFT
+    ushll2          v21.8h, v2.16b, #P2S_SHIFT
+    add             v16.8h, v16.8h, v31.8h
+    add             v17.8h, v17.8h, v31.8h
+    add             v18.8h, v18.8h, v31.8h
+    add             v19.8h, v19.8h, v31.8h
+    add             v20.8h, v20.8h, v31.8h
+    add             v21.8h, v21.8h, v31.8h
+    st1             {v16.16b-v19.16b}, x2, #64
+    st1             {v20.16b-v21.16b}, x2, x3
+    cbnz            x9, .loop_filterP2S_sve_48x64
+    ret
+#endif
+endfunc

 
@@ -0,0 +1,445 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm-sve.S"
+#include "p2s-common.S"
+
+.arch armv8-a+sve
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+#if HIGH_BIT_DEPTH
+# if BIT_DEPTH == 10
+#  define P2S_SHIFT 4
+# elif BIT_DEPTH == 12
+#  define P2S_SHIFT 2
+# endif
+
+.macro p2s_start_sve
+    add             x3, x3, x3
+    add             x1, x1, x1
+    mov             z31.h, #0xe0, lsl #8
+.endm
+
+#else // if !HIGH_BIT_DEPTH
+# define P2S_SHIFT 6
+.macro p2s_start_sve
+    add             x3, x3, x3
+    mov             z31.h, #0xe0, lsl #8
+.endm
+
+#endif // HIGH_BIT_DEPTH
+
+// filterPixelToShort(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride)
+.macro p2s_2xN_sve h
+function PFX(filterPixelToShort_2x\h\()_sve)
+    p2s_start_sve
+.rept \h / 2
+    p2s_2x2
+.endr
+    ret
+endfunc
+.endm
+
+p2s_2xN_sve 4
+p2s_2xN_sve 8
+p2s_2xN_sve 16
+
+.macro p2s_6xN_sve h
+function PFX(filterPixelToShort_6x\h\()_sve)
+    p2s_start_sve
+    sub             x3, x3, #8
+#if HIGH_BIT_DEPTH
+    sub             x1, x1, #8
+#endif
+.rept \h / 2
+    p2s_6x2
+.endr
+    ret
+endfunc
+.endm
+
+p2s_6xN_sve 8
+p2s_6xN_sve 16
+
+function PFX(filterPixelToShort_4x2_sve)
+    p2s_start_sve
+#if HIGH_BIT_DEPTH
+    ptrue           p0.h, vl8
+    index           z1.d, #0, x1
+    index           z2.d, #0, x3
+    ld1d            {z3.d}, p0/z, x0, z1.d
+    lsl             z3.h, p0/m, z3.h, #P2S_SHIFT
+    add             z3.h, p0/m, z3.h, z31.h
+    st1d            {z3.d}, p0, x2, z2.d
+#else
+    ptrue           p0.h, vl4
+    ld1b            {z0.h}, p0/z, x0
+    add             x0, x0, x1
+    ld1b            {z1.h}, p0/z, x0
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    lsl             z1.h, p0/m, z1.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    add             z1.h, p0/m, z1.h, z31.h
+    st1h            {z0.h}, p0, x2
+    add             x2, x2, x3
+    st1h            {z1.h}, p0, x2
+#endif
+    ret
+endfunc
+
+
+.macro p2s_8xN_sve h
+function PFX(filterPixelToShort_8x\h\()_sve)
+    p2s_start_sve
+    ptrue           p0.h, vl8
+.rept \h
+#if HIGH_BIT_DEPTH
+    ld1d            {z0.d}, p0/z, x0
+    add             x0, x0, x1
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    st1h            {z0.h}, p0, x2
+    add             x2, x2, x3
+#else
+    ld1b            {z0.h}, p0/z, x0
+    add             x0, x0, x1
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    st1h            {z0.h}, p0, x2
+    add             x2, x2, x3
+#endif
+.endr
+    ret
+endfunc
+.endm
+
+p2s_8xN_sve 2
+
+.macro p2s_32xN_sve h
+function PFX(filterPixelToShort_32x\h\()_sve)
+#if HIGH_BIT_DEPTH
+    p2s_start_sve
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_filterPixelToShort_high_32x\h
+    ptrue           p0.h, vl8
+.rept \h
+    ld1h            {z0.h}, p0/z, x0
+    ld1h            {z1.h}, p0/z, x0, #1, mul vl
+    ld1h            {z2.h}, p0/z, x0, #2, mul vl
+    ld1h            {z3.h}, p0/z, x0, #3, mul vl
+    add             x0, x0, x1
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    lsl             z1.h, p0/m, z1.h, #P2S_SHIFT
+    lsl             z2.h, p0/m, z2.h, #P2S_SHIFT
+    lsl             z3.h, p0/m, z3.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    add             z1.h, p0/m, z1.h, z31.h
+    add             z2.h, p0/m, z2.h, z31.h
+    add             z3.h, p0/m, z3.h, z31.h
+    st1h            {z0.h}, p0, x2
+    st1h            {z1.h}, p0, x2, #1, mul vl
+    st1h            {z2.h}, p0, x2, #2, mul vl
+    st1h            {z3.h}, p0, x2, #3, mul vl
+    add             x2, x2, x3
+.endr
+    ret
+.vl_gt_16_filterPixelToShort_high_32x\h\():
+    cmp             x9, #48
+    bgt             .vl_gt_48_filterPixelToShort_high_32x\h
+    ptrue           p0.h, vl16
+.rept \h
+    ld1h            {z0.h}, p0/z, x0
+    ld1h            {z1.h}, p0/z, x0, #1, mul vl
+    add             x0, x0, x1
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    lsl             z1.h, p0/m, z1.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    add             z1.h, p0/m, z1.h, z31.h
+    st1h            {z0.h}, p0, x2
+    st1h            {z1.h}, p0, x2, #1, mul vl
+    add             x2, x2, x3
+.endr
+    ret
+.vl_gt_48_filterPixelToShort_high_32x\h\():
+    ptrue           p0.h, vl32
+.rept \h
+    ld1h            {z0.h}, p0/z, x0
+    add             x0, x0, x1
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    st1h            {z0.h}, p0, x2
+    add             x2, x2, x3
+.endr
+    ret
+#else
+    p2s_start
+    mov             x9, #\h
+.loop_filter_sve_P2S_32x\h:
+    sub             x9, x9, #1
+    ld1             {v0.16b-v1.16b}, x0, x1
+    ushll           v22.8h, v0.8b,  #P2S_SHIFT
+    ushll2          v23.8h, v0.16b, #P2S_SHIFT
+    ushll           v24.8h, v1.8b,  #P2S_SHIFT
+    ushll2          v25.8h, v1.16b, #P2S_SHIFT
+    add             v22.8h, v22.8h, v31.8h
+    add             v23.8h, v23.8h, v31.8h
+    add             v24.8h, v24.8h, v31.8h
+    add             v25.8h, v25.8h, v31.8h
+    st1             {v22.16b-v25.16b}, x2, x3
+    cbnz            x9, .loop_filter_sve_P2S_32x\h
+    ret
+#endif
+endfunc
+.endm
+
+p2s_32xN_sve 8
+p2s_32xN_sve 16
+p2s_32xN_sve 24
+p2s_32xN_sve 32
+p2s_32xN_sve 48
+p2s_32xN_sve 64
+
+.macro p2s_64xN_sve h
+function PFX(filterPixelToShort_64x\h\()_sve)
+#if HIGH_BIT_DEPTH
+    p2s_start_sve
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_filterPixelToShort_high_64x\h
+    ptrue           p0.h, vl8
+.rept \h
+    ld1h            {z0.h}, p0/z, x0
+    ld1h            {z1.h}, p0/z, x0, #1, mul vl
+    ld1h            {z2.h}, p0/z, x0, #2, mul vl
+    ld1h            {z3.h}, p0/z, x0, #3, mul vl
+    ld1h            {z4.h}, p0/z, x0, #4, mul vl
+    ld1h            {z5.h}, p0/z, x0, #5, mul vl
+    ld1h            {z6.h}, p0/z, x0, #6, mul vl
+    ld1h            {z7.h}, p0/z, x0, #7, mul vl
+    add             x0, x0, x1
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    lsl             z1.h, p0/m, z1.h, #P2S_SHIFT
+    lsl             z2.h, p0/m, z2.h, #P2S_SHIFT
+    lsl             z3.h, p0/m, z3.h, #P2S_SHIFT
+    lsl             z4.h, p0/m, z4.h, #P2S_SHIFT
+    lsl             z5.h, p0/m, z5.h, #P2S_SHIFT
+    lsl             z6.h, p0/m, z6.h, #P2S_SHIFT
+    lsl             z7.h, p0/m, z7.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    add             z1.h, p0/m, z1.h, z31.h
+    add             z2.h, p0/m, z2.h, z31.h
+    add             z3.h, p0/m, z3.h, z31.h
+    add             z4.h, p0/m, z4.h, z31.h
+    add             z5.h, p0/m, z5.h, z31.h
+    add             z6.h, p0/m, z6.h, z31.h
+    add             z7.h, p0/m, z7.h, z31.h
+    st1h            {z0.h}, p0, x2
+    st1h            {z1.h}, p0, x2, #1, mul vl
+    st1h            {z2.h}, p0, x2, #2, mul vl
+    st1h            {z3.h}, p0, x2, #3, mul vl
+    st1h            {z4.h}, p0, x2, #4, mul vl
+    st1h            {z5.h}, p0, x2, #5, mul vl
+    st1h            {z6.h}, p0, x2, #6, mul vl
+    st1h            {z7.h}, p0, x2, #7, mul vl
+    add             x2, x2, x3
+.endr
+    ret
+.vl_gt_16_filterPixelToShort_high_64x\h\():
+    cmp             x9, #48
+    bgt             .vl_gt_48_filterPixelToShort_high_64x\h
+    ptrue           p0.h, vl16
+.rept \h
+    ld1h            {z0.h}, p0/z, x0
+    ld1h            {z1.h}, p0/z, x0, #1, mul vl
+    ld1h            {z2.h}, p0/z, x0, #2, mul vl
+    ld1h            {z3.h}, p0/z, x0, #3, mul vl
+    add             x0, x0, x1
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    lsl             z1.h, p0/m, z1.h, #P2S_SHIFT
+    lsl             z2.h, p0/m, z2.h, #P2S_SHIFT
+    lsl             z3.h, p0/m, z3.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    add             z1.h, p0/m, z1.h, z31.h
+    add             z2.h, p0/m, z2.h, z31.h
+    add             z3.h, p0/m, z3.h, z31.h
+    st1h            {z0.h}, p0, x2
+    st1h            {z1.h}, p0, x2, #1, mul vl
+    st1h            {z2.h}, p0, x2, #2, mul vl
+    st1h            {z3.h}, p0, x2, #3, mul vl
+    add             x2, x2, x3
+.endr
+    ret
+.vl_gt_48_filterPixelToShort_high_64x\h\():
+    cmp             x9, #112
+    bgt             .vl_gt_112_filterPixelToShort_high_64x\h
+    ptrue           p0.h, vl32
+.rept \h
+    ld1h            {z0.h}, p0/z, x0
+    ld1h            {z1.h}, p0/z, x0, #1, mul vl
+    add             x0, x0, x1
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    lsl             z1.h, p0/m, z1.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    add             z1.h, p0/m, z1.h, z31.h
+    st1h            {z0.h}, p0, x2
+    st1h            {z1.h}, p0, x2, #1, mul vl
+    add             x2, x2, x3
+.endr
+    ret
+.vl_gt_112_filterPixelToShort_high_64x\h\():
+    ptrue           p0.h, vl64
+.rept \h
+    ld1h            {z0.h}, p0/z, x0
+    add             x0, x0, x1
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    st1h            {z0.h}, p0, x2
+    add             x2, x2, x3
+.endr
+    ret
+#else
+    p2s_start
+    sub             x3, x3, #64
+    mov             x9, #\h
+.loop_filter_sve_P2S_64x\h:
+    sub             x9, x9, #1
+    ld1             {v0.16b-v3.16b}, x0, x1
+    ushll           v16.8h, v0.8b,  #P2S_SHIFT
+    ushll2          v17.8h, v0.16b, #P2S_SHIFT
+    ushll           v18.8h, v1.8b,  #P2S_SHIFT
+    ushll2          v19.8h, v1.16b, #P2S_SHIFT
+    ushll           v20.8h, v2.8b,  #P2S_SHIFT
+    ushll2          v21.8h, v2.16b, #P2S_SHIFT
+    ushll           v22.8h, v3.8b,  #P2S_SHIFT
+    ushll2          v23.8h, v3.16b, #P2S_SHIFT
+    add             v16.8h, v16.8h, v31.8h
+    add             v17.8h, v17.8h, v31.8h
+    add             v18.8h, v18.8h, v31.8h
+    add             v19.8h, v19.8h, v31.8h
+    add             v20.8h, v20.8h, v31.8h
+    add             v21.8h, v21.8h, v31.8h
+    add             v22.8h, v22.8h, v31.8h
+    add             v23.8h, v23.8h, v31.8h
+    st1             {v16.16b-v19.16b}, x2, #64
+    st1             {v20.16b-v23.16b}, x2, x3
+    cbnz            x9, .loop_filter_sve_P2S_64x\h
+    ret
+#endif
+endfunc
+.endm
+
+p2s_64xN_sve 16
+p2s_64xN_sve 32
+p2s_64xN_sve 48
+p2s_64xN_sve 64
+
+function PFX(filterPixelToShort_48x64_sve)
+#if HIGH_BIT_DEPTH
+    p2s_start_sve
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_filterPixelToShort_high_48x64
+    ptrue           p0.h, vl8
+.rept 64
+    ld1h            {z0.h}, p0/z, x0
+    ld1h            {z1.h}, p0/z, x0, #1, mul vl
+    ld1h            {z2.h}, p0/z, x0, #2, mul vl
+    ld1h            {z3.h}, p0/z, x0, #3, mul vl
+    ld1h            {z4.h}, p0/z, x0, #4, mul vl
+    ld1h            {z5.h}, p0/z, x0, #5, mul vl
+    add             x0, x0, x1
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    lsl             z1.h, p0/m, z1.h, #P2S_SHIFT
+    lsl             z2.h, p0/m, z2.h, #P2S_SHIFT
+    lsl             z3.h, p0/m, z3.h, #P2S_SHIFT
+    lsl             z4.h, p0/m, z4.h, #P2S_SHIFT
+    lsl             z5.h, p0/m, z5.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    add             z1.h, p0/m, z1.h, z31.h
+    add             z2.h, p0/m, z2.h, z31.h
+    add             z3.h, p0/m, z3.h, z31.h
+    add             z4.h, p0/m, z4.h, z31.h
+    add             z5.h, p0/m, z5.h, z31.h
+    st1h            {z0.h}, p0, x2
+    st1h            {z1.h}, p0, x2, #1, mul vl
+    st1h            {z2.h}, p0, x2, #2, mul vl
+    st1h            {z3.h}, p0, x2, #3, mul vl
+    st1h            {z4.h}, p0, x2, #4, mul vl
+    st1h            {z5.h}, p0, x2, #5, mul vl
+    add             x2, x2, x3
+.endr
+    ret
+.vl_gt_16_filterPixelToShort_high_48x64:
+    ptrue           p0.h, vl16
+.rept 64
+    ld1h            {z0.h}, p0/z, x0
+    ld1h            {z1.h}, p0/z, x0, #1, mul vl
+    ld1h            {z2.h}, p0/z, x0, #2, mul vl
+    add             x0, x0, x1
+    lsl             z0.h, p0/m, z0.h, #P2S_SHIFT
+    lsl             z1.h, p0/m, z1.h, #P2S_SHIFT
+    lsl             z2.h, p0/m, z2.h, #P2S_SHIFT
+    add             z0.h, p0/m, z0.h, z31.h
+    add             z1.h, p0/m, z1.h, z31.h
+    add             z2.h, p0/m, z2.h, z31.h
+    st1h            {z0.h}, p0, x2
+    st1h            {z1.h}, p0, x2, #1, mul vl
+    st1h            {z2.h}, p0, x2, #2, mul vl
+    add             x2, x2, x3
+.endr
+    ret
+#else
+    p2s_start
+    sub             x3, x3, #64
+    mov             x9, #64
+.loop_filterP2S_sve_48x64:
+    sub            x9, x9, #1
+    ld1             {v0.16b-v2.16b}, x0, x1
+    ushll           v16.8h, v0.8b,  #P2S_SHIFT
+    ushll2          v17.8h, v0.16b, #P2S_SHIFT
+    ushll           v18.8h, v1.8b,  #P2S_SHIFT
+    ushll2          v19.8h, v1.16b, #P2S_SHIFT
+    ushll           v20.8h, v2.8b,  #P2S_SHIFT
+    ushll2          v21.8h, v2.16b, #P2S_SHIFT
+    add             v16.8h, v16.8h, v31.8h
+    add             v17.8h, v17.8h, v31.8h
+    add             v18.8h, v18.8h, v31.8h
+    add             v19.8h, v19.8h, v31.8h
+    add             v20.8h, v20.8h, v31.8h
+    add             v21.8h, v21.8h, v31.8h
+    st1             {v16.16b-v19.16b}, x2, #64
+    st1             {v20.16b-v21.16b}, x2, x3
+    cbnz            x9, .loop_filterP2S_sve_48x64
+    ret
+#endif
+endfunc
​

x265_3.6.tar.gz/source/common/aarch64/p2s.S Added

@@ -0,0 +1,386 @@
+/*****************************************************************************
+ * Copyright (C) 2021 MulticoreWare, Inc
+ *
+ * Authors: Sebastian Pop <spop@amazon.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm.S"
+#include "p2s-common.S"
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+// filterPixelToShort(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride)
+.macro p2s_2xN h
+function PFX(filterPixelToShort_2x\h\()_neon)
+    p2s_start
+.rept \h / 2
+    p2s_2x2
+.endr
+    ret
+endfunc
+.endm
+
+p2s_2xN 4
+p2s_2xN 8
+p2s_2xN 16
+
+.macro p2s_6xN h
+function PFX(filterPixelToShort_6x\h\()_neon)
+    p2s_start
+    sub             x3, x3, #8
+#if HIGH_BIT_DEPTH
+    sub             x1, x1, #8
+#endif
+.rept \h / 2
+    p2s_6x2
+.endr
+    ret
+endfunc
+.endm
+
+p2s_6xN 8
+p2s_6xN 16
+
+function PFX(filterPixelToShort_4x2_neon)
+    p2s_start
+#if HIGH_BIT_DEPTH
+    ld1             {v0.d}0, x0, x1
+    ld1             {v0.d}1, x0, x1
+    shl             v3.8h, v0.8h, #P2S_SHIFT
+#else
+    ld1             {v0.s}0, x0, x1
+    ld1             {v0.s}1, x0, x1
+    ushll           v3.8h, v0.8b, #P2S_SHIFT
+#endif
+    add             v3.8h, v3.8h, v31.8h
+    st1             {v3.d}0, x2, x3
+    st1             {v3.d}1, x2, x3
+    ret
+endfunc
+
+function PFX(filterPixelToShort_4x4_neon)
+    p2s_start
+#if HIGH_BIT_DEPTH
+    ld1             {v0.d}0, x0, x1
+    ld1             {v0.d}1, x0, x1
+    shl             v3.8h, v0.8h, #P2S_SHIFT
+#else
+    ld1             {v0.s}0, x0, x1
+    ld1             {v0.s}1, x0, x1
+    ushll           v3.8h, v0.8b, #P2S_SHIFT
+#endif
+    add             v3.8h, v3.8h, v31.8h
+    st1             {v3.d}0, x2, x3
+    st1             {v3.d}1, x2, x3
+#if HIGH_BIT_DEPTH
+    ld1             {v1.d}0, x0, x1
+    ld1             {v1.d}1, x0, x1
+    shl             v4.8h, v1.8h, #P2S_SHIFT
+#else
+    ld1             {v1.s}0, x0, x1
+    ld1             {v1.s}1, x0, x1
+    ushll           v4.8h, v1.8b, #P2S_SHIFT
+#endif
+    add             v4.8h, v4.8h, v31.8h
+    st1             {v4.d}0, x2, x3
+    st1             {v4.d}1, x2, x3
+    ret
+endfunc
+
+.macro p2s_4xN h
+function PFX(filterPixelToShort_4x\h\()_neon)
+    p2s_start
+.rept \h / 2
+#if HIGH_BIT_DEPTH
+    ld1             {v0.16b}, x0, x1
+    shl             v0.8h, v0.8h, #P2S_SHIFT
+#else
+    ld1             {v0.8b}, x0, x1
+    ushll           v0.8h, v0.8b, #P2S_SHIFT
+#endif
+    add             v2.4h, v0.4h, v31.4h
+    st1             {v2.4h}, x2, x3
+#if HIGH_BIT_DEPTH
+    ld1             {v1.16b}, x0, x1
+    shl             v1.8h, v1.8h, #P2S_SHIFT
+#else
+    ld1             {v1.8b}, x0, x1
+    ushll           v1.8h, v1.8b, #P2S_SHIFT
+#endif
+    add             v3.4h, v1.4h, v31.4h
+    st1             {v3.4h}, x2, x3
+.endr
+    ret
+endfunc
+.endm
+
+p2s_4xN 8
+p2s_4xN 16
+p2s_4xN 32
+
+.macro p2s_8xN h
+function PFX(filterPixelToShort_8x\h\()_neon)
+    p2s_start
+.rept \h / 2
+#if HIGH_BIT_DEPTH
+    ld1             {v0.16b}, x0, x1
+    ld1             {v1.16b}, x0, x1
+    shl             v0.8h, v0.8h, #P2S_SHIFT
+    shl             v1.8h, v1.8h, #P2S_SHIFT
+#else
+    ld1             {v0.8b}, x0, x1
+    ld1             {v1.8b}, x0, x1
+    ushll           v0.8h, v0.8b, #P2S_SHIFT
+    ushll           v1.8h, v1.8b, #P2S_SHIFT
+#endif
+    add             v2.8h, v0.8h, v31.8h
+    st1             {v2.8h}, x2, x3
+    add             v3.8h, v1.8h, v31.8h
+    st1             {v3.8h}, x2, x3
+.endr
+    ret
+endfunc
+.endm
+
+p2s_8xN 2
+p2s_8xN 4
+p2s_8xN 6
+p2s_8xN 8
+p2s_8xN 12
+p2s_8xN 16
+p2s_8xN 32
+p2s_8xN 64
+
+.macro p2s_12xN h
+function PFX(filterPixelToShort_12x\h\()_neon)
+    p2s_start
+    sub             x3, x3, #16
+.rept \h
+#if HIGH_BIT_DEPTH
+    ld1             {v0.16b-v1.16b}, x0, x1
+    shl             v2.8h, v0.8h, #P2S_SHIFT
+    shl             v3.8h, v1.8h, #P2S_SHIFT
+#else
+    ld1             {v0.16b}, x0, x1
+    ushll           v2.8h, v0.8b,  #P2S_SHIFT
+    ushll2          v3.8h, v0.16b, #P2S_SHIFT
+#endif
+    add             v2.8h, v2.8h, v31.8h
+    add             v3.8h, v3.8h, v31.8h
+    st1             {v2.16b}, x2, #16
+    st1             {v3.8b}, x2, x3
+.endr
+    ret
+endfunc
+.endm
+
+p2s_12xN 16
+p2s_12xN 32
+
+.macro p2s_16xN h
+function PFX(filterPixelToShort_16x\h\()_neon)
+    p2s_start
+.rept \h
+#if HIGH_BIT_DEPTH
+    ld1             {v0.16b-v1.16b}, x0, x1
+    shl             v2.8h, v0.8h, #P2S_SHIFT
+    shl             v3.8h, v1.8h, #P2S_SHIFT
+#else
+    ld1             {v0.16b}, x0, x1
+    ushll           v2.8h, v0.8b,  #P2S_SHIFT
+    ushll2          v3.8h, v0.16b, #P2S_SHIFT
+#endif
+    add             v2.8h, v2.8h, v31.8h
+    add             v3.8h, v3.8h, v31.8h
+    st1             {v2.16b-v3.16b}, x2, x3
+.endr
+    ret
+endfunc
+.endm
+
+p2s_16xN 4
+p2s_16xN 8
+p2s_16xN 12
+p2s_16xN 16
+p2s_16xN 24
+p2s_16xN 32
+p2s_16xN 64
+
+.macro p2s_24xN h
+function PFX(filterPixelToShort_24x\h\()_neon)
+    p2s_start
+.rept \h
+#if HIGH_BIT_DEPTH
+    ld1             {v0.16b-v2.16b}, x0, x1
+    shl             v3.8h, v0.8h, #P2S_SHIFT
+    shl             v4.8h, v1.8h, #P2S_SHIFT
+    shl             v5.8h, v2.8h, #P2S_SHIFT
+#else
+    ld1             {v0.8b-v2.8b}, x0, x1
+    ushll           v3.8h, v0.8b, #P2S_SHIFT
+    ushll           v4.8h, v1.8b, #P2S_SHIFT
+    ushll           v5.8h, v2.8b, #P2S_SHIFT
+#endif
+    add             v3.8h, v3.8h, v31.8h
+    add             v4.8h, v4.8h, v31.8h
+    add             v5.8h, v5.8h, v31.8h
+    st1             {v3.16b-v5.16b}, x2, x3
+.endr
+    ret
+endfunc
+.endm
+
+p2s_24xN 32
+p2s_24xN 64
+
+.macro p2s_32xN h
+function PFX(filterPixelToShort_32x\h\()_neon)
+    p2s_start
+    mov             x9, #\h
+.loop_filterP2S_32x\h:
+    sub             x9, x9, #1
+#if HIGH_BIT_DEPTH
+    ld1             {v0.16b-v3.16b}, x0, x1
+    shl             v22.8h, v0.8h, #P2S_SHIFT
+    shl             v23.8h, v1.8h, #P2S_SHIFT
+    shl             v24.8h, v2.8h, #P2S_SHIFT
+    shl             v25.8h, v3.8h, #P2S_SHIFT
+#else
+    ld1             {v0.16b-v1.16b}, x0, x1
+    ushll           v22.8h, v0.8b,  #P2S_SHIFT
+    ushll2          v23.8h, v0.16b, #P2S_SHIFT
+    ushll           v24.8h, v1.8b,  #P2S_SHIFT
+    ushll2          v25.8h, v1.16b, #P2S_SHIFT
+#endif
+    add             v22.8h, v22.8h, v31.8h
+    add             v23.8h, v23.8h, v31.8h
+    add             v24.8h, v24.8h, v31.8h
+    add             v25.8h, v25.8h, v31.8h
+    st1             {v22.16b-v25.16b}, x2, x3
+    cbnz            x9, .loop_filterP2S_32x\h
+    ret
+endfunc
+.endm
+
+p2s_32xN 8
+p2s_32xN 16
+p2s_32xN 24
+p2s_32xN 32
+p2s_32xN 48
+p2s_32xN 64
+
+.macro p2s_64xN h
+function PFX(filterPixelToShort_64x\h\()_neon)
+    p2s_start
+#if HIGH_BIT_DEPTH
+    sub             x1, x1, #64
+#endif
+    sub             x3, x3, #64
+    mov             x9, #\h
+.loop_filterP2S_64x\h:
+    sub             x9, x9, #1
+#if HIGH_BIT_DEPTH
+    ld1             {v0.16b-v3.16b}, x0, #64
+    ld1             {v4.16b-v7.16b}, x0, x1
+    shl             v16.8h, v0.8h, #P2S_SHIFT
+    shl             v17.8h, v1.8h, #P2S_SHIFT
+    shl             v18.8h, v2.8h, #P2S_SHIFT
+    shl             v19.8h, v3.8h, #P2S_SHIFT
+    shl             v20.8h, v4.8h, #P2S_SHIFT
+    shl             v21.8h, v5.8h, #P2S_SHIFT
+    shl             v22.8h, v6.8h, #P2S_SHIFT
+    shl             v23.8h, v7.8h, #P2S_SHIFT
+#else
+    ld1             {v0.16b-v3.16b}, x0, x1
+    ushll           v16.8h, v0.8b,  #P2S_SHIFT
+    ushll2          v17.8h, v0.16b, #P2S_SHIFT
+    ushll           v18.8h, v1.8b,  #P2S_SHIFT
+    ushll2          v19.8h, v1.16b, #P2S_SHIFT
+    ushll           v20.8h, v2.8b,  #P2S_SHIFT
+    ushll2          v21.8h, v2.16b, #P2S_SHIFT
+    ushll           v22.8h, v3.8b,  #P2S_SHIFT
+    ushll2          v23.8h, v3.16b, #P2S_SHIFT
+#endif
+    add             v16.8h, v16.8h, v31.8h
+    add             v17.8h, v17.8h, v31.8h
+    add             v18.8h, v18.8h, v31.8h
+    add             v19.8h, v19.8h, v31.8h
+    add             v20.8h, v20.8h, v31.8h
+    add             v21.8h, v21.8h, v31.8h
+    add             v22.8h, v22.8h, v31.8h
+    add             v23.8h, v23.8h, v31.8h
+    st1             {v16.16b-v19.16b}, x2, #64
+    st1             {v20.16b-v23.16b}, x2, x3
+    cbnz            x9, .loop_filterP2S_64x\h
+    ret
+endfunc
+.endm
+
+p2s_64xN 16
+p2s_64xN 32
+p2s_64xN 48
+p2s_64xN 64
+
+function PFX(filterPixelToShort_48x64_neon)
+    p2s_start
+#if HIGH_BIT_DEPTH
+    sub             x1, x1, #64
+#endif
+    sub             x3, x3, #64
+    mov             x9, #64
+.loop_filterP2S_48x64:
+    sub            x9, x9, #1
+#if HIGH_BIT_DEPTH
+    ld1             {v0.16b-v3.16b}, x0, #64
+    ld1             {v4.16b-v5.16b}, x0, x1
+    shl             v16.8h, v0.8h, #P2S_SHIFT
+    shl             v17.8h, v1.8h, #P2S_SHIFT
+    shl             v18.8h, v2.8h, #P2S_SHIFT
+    shl             v19.8h, v3.8h, #P2S_SHIFT
+    shl             v20.8h, v4.8h, #P2S_SHIFT
+    shl             v21.8h, v5.8h, #P2S_SHIFT
+#else
+    ld1             {v0.16b-v2.16b}, x0, x1
+    ushll           v16.8h, v0.8b,  #P2S_SHIFT
+    ushll2          v17.8h, v0.16b, #P2S_SHIFT
+    ushll           v18.8h, v1.8b,  #P2S_SHIFT
+    ushll2          v19.8h, v1.16b, #P2S_SHIFT
+    ushll           v20.8h, v2.8b,  #P2S_SHIFT
+    ushll2          v21.8h, v2.16b, #P2S_SHIFT
+#endif
+    add             v16.8h, v16.8h, v31.8h
+    add             v17.8h, v17.8h, v31.8h
+    add             v18.8h, v18.8h, v31.8h
+    add             v19.8h, v19.8h, v31.8h
+    add             v20.8h, v20.8h, v31.8h
+    add             v21.8h, v21.8h, v31.8h
+    st1             {v16.16b-v19.16b}, x2, #64
+    st1             {v20.16b-v21.16b}, x2, x3
+    cbnz            x9, .loop_filterP2S_48x64
+    ret
+endfunc

 
@@ -0,0 +1,386 @@
+/*****************************************************************************
+ * Copyright (C) 2021 MulticoreWare, Inc
+ *
+ * Authors: Sebastian Pop <spop@amazon.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm.S"
+#include "p2s-common.S"
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+// filterPixelToShort(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride)
+.macro p2s_2xN h
+function PFX(filterPixelToShort_2x\h\()_neon)
+    p2s_start
+.rept \h / 2
+    p2s_2x2
+.endr
+    ret
+endfunc
+.endm
+
+p2s_2xN 4
+p2s_2xN 8
+p2s_2xN 16
+
+.macro p2s_6xN h
+function PFX(filterPixelToShort_6x\h\()_neon)
+    p2s_start
+    sub             x3, x3, #8
+#if HIGH_BIT_DEPTH
+    sub             x1, x1, #8
+#endif
+.rept \h / 2
+    p2s_6x2
+.endr
+    ret
+endfunc
+.endm
+
+p2s_6xN 8
+p2s_6xN 16
+
+function PFX(filterPixelToShort_4x2_neon)
+    p2s_start
+#if HIGH_BIT_DEPTH
+    ld1             {v0.d}0, x0, x1
+    ld1             {v0.d}1, x0, x1
+    shl             v3.8h, v0.8h, #P2S_SHIFT
+#else
+    ld1             {v0.s}0, x0, x1
+    ld1             {v0.s}1, x0, x1
+    ushll           v3.8h, v0.8b, #P2S_SHIFT
+#endif
+    add             v3.8h, v3.8h, v31.8h
+    st1             {v3.d}0, x2, x3
+    st1             {v3.d}1, x2, x3
+    ret
+endfunc
+
+function PFX(filterPixelToShort_4x4_neon)
+    p2s_start
+#if HIGH_BIT_DEPTH
+    ld1             {v0.d}0, x0, x1
+    ld1             {v0.d}1, x0, x1
+    shl             v3.8h, v0.8h, #P2S_SHIFT
+#else
+    ld1             {v0.s}0, x0, x1
+    ld1             {v0.s}1, x0, x1
+    ushll           v3.8h, v0.8b, #P2S_SHIFT
+#endif
+    add             v3.8h, v3.8h, v31.8h
+    st1             {v3.d}0, x2, x3
+    st1             {v3.d}1, x2, x3
+#if HIGH_BIT_DEPTH
+    ld1             {v1.d}0, x0, x1
+    ld1             {v1.d}1, x0, x1
+    shl             v4.8h, v1.8h, #P2S_SHIFT
+#else
+    ld1             {v1.s}0, x0, x1
+    ld1             {v1.s}1, x0, x1
+    ushll           v4.8h, v1.8b, #P2S_SHIFT
+#endif
+    add             v4.8h, v4.8h, v31.8h
+    st1             {v4.d}0, x2, x3
+    st1             {v4.d}1, x2, x3
+    ret
+endfunc
+
+.macro p2s_4xN h
+function PFX(filterPixelToShort_4x\h\()_neon)
+    p2s_start
+.rept \h / 2
+#if HIGH_BIT_DEPTH
+    ld1             {v0.16b}, x0, x1
+    shl             v0.8h, v0.8h, #P2S_SHIFT
+#else
+    ld1             {v0.8b}, x0, x1
+    ushll           v0.8h, v0.8b, #P2S_SHIFT
+#endif
+    add             v2.4h, v0.4h, v31.4h
+    st1             {v2.4h}, x2, x3
+#if HIGH_BIT_DEPTH
+    ld1             {v1.16b}, x0, x1
+    shl             v1.8h, v1.8h, #P2S_SHIFT
+#else
+    ld1             {v1.8b}, x0, x1
+    ushll           v1.8h, v1.8b, #P2S_SHIFT
+#endif
+    add             v3.4h, v1.4h, v31.4h
+    st1             {v3.4h}, x2, x3
+.endr
+    ret
+endfunc
+.endm
+
+p2s_4xN 8
+p2s_4xN 16
+p2s_4xN 32
+
+.macro p2s_8xN h
+function PFX(filterPixelToShort_8x\h\()_neon)
+    p2s_start
+.rept \h / 2
+#if HIGH_BIT_DEPTH
+    ld1             {v0.16b}, x0, x1
+    ld1             {v1.16b}, x0, x1
+    shl             v0.8h, v0.8h, #P2S_SHIFT
+    shl             v1.8h, v1.8h, #P2S_SHIFT
+#else
+    ld1             {v0.8b}, x0, x1
+    ld1             {v1.8b}, x0, x1
+    ushll           v0.8h, v0.8b, #P2S_SHIFT
+    ushll           v1.8h, v1.8b, #P2S_SHIFT
+#endif
+    add             v2.8h, v0.8h, v31.8h
+    st1             {v2.8h}, x2, x3
+    add             v3.8h, v1.8h, v31.8h
+    st1             {v3.8h}, x2, x3
+.endr
+    ret
+endfunc
+.endm
+
+p2s_8xN 2
+p2s_8xN 4
+p2s_8xN 6
+p2s_8xN 8
+p2s_8xN 12
+p2s_8xN 16
+p2s_8xN 32
+p2s_8xN 64
+
+.macro p2s_12xN h
+function PFX(filterPixelToShort_12x\h\()_neon)
+    p2s_start
+    sub             x3, x3, #16
+.rept \h
+#if HIGH_BIT_DEPTH
+    ld1             {v0.16b-v1.16b}, x0, x1
+    shl             v2.8h, v0.8h, #P2S_SHIFT
+    shl             v3.8h, v1.8h, #P2S_SHIFT
+#else
+    ld1             {v0.16b}, x0, x1
+    ushll           v2.8h, v0.8b,  #P2S_SHIFT
+    ushll2          v3.8h, v0.16b, #P2S_SHIFT
+#endif
+    add             v2.8h, v2.8h, v31.8h
+    add             v3.8h, v3.8h, v31.8h
+    st1             {v2.16b}, x2, #16
+    st1             {v3.8b}, x2, x3
+.endr
+    ret
+endfunc
+.endm
+
+p2s_12xN 16
+p2s_12xN 32
+
+.macro p2s_16xN h
+function PFX(filterPixelToShort_16x\h\()_neon)
+    p2s_start
+.rept \h
+#if HIGH_BIT_DEPTH
+    ld1             {v0.16b-v1.16b}, x0, x1
+    shl             v2.8h, v0.8h, #P2S_SHIFT
+    shl             v3.8h, v1.8h, #P2S_SHIFT
+#else
+    ld1             {v0.16b}, x0, x1
+    ushll           v2.8h, v0.8b,  #P2S_SHIFT
+    ushll2          v3.8h, v0.16b, #P2S_SHIFT
+#endif
+    add             v2.8h, v2.8h, v31.8h
+    add             v3.8h, v3.8h, v31.8h
+    st1             {v2.16b-v3.16b}, x2, x3
+.endr
+    ret
+endfunc
+.endm
+
+p2s_16xN 4
+p2s_16xN 8
+p2s_16xN 12
+p2s_16xN 16
+p2s_16xN 24
+p2s_16xN 32
+p2s_16xN 64
+
+.macro p2s_24xN h
+function PFX(filterPixelToShort_24x\h\()_neon)
+    p2s_start
+.rept \h
+#if HIGH_BIT_DEPTH
+    ld1             {v0.16b-v2.16b}, x0, x1
+    shl             v3.8h, v0.8h, #P2S_SHIFT
+    shl             v4.8h, v1.8h, #P2S_SHIFT
+    shl             v5.8h, v2.8h, #P2S_SHIFT
+#else
+    ld1             {v0.8b-v2.8b}, x0, x1
+    ushll           v3.8h, v0.8b, #P2S_SHIFT
+    ushll           v4.8h, v1.8b, #P2S_SHIFT
+    ushll           v5.8h, v2.8b, #P2S_SHIFT
+#endif
+    add             v3.8h, v3.8h, v31.8h
+    add             v4.8h, v4.8h, v31.8h
+    add             v5.8h, v5.8h, v31.8h
+    st1             {v3.16b-v5.16b}, x2, x3
+.endr
+    ret
+endfunc
+.endm
+
+p2s_24xN 32
+p2s_24xN 64
+
+.macro p2s_32xN h
+function PFX(filterPixelToShort_32x\h\()_neon)
+    p2s_start
+    mov             x9, #\h
+.loop_filterP2S_32x\h:
+    sub             x9, x9, #1
+#if HIGH_BIT_DEPTH
+    ld1             {v0.16b-v3.16b}, x0, x1
+    shl             v22.8h, v0.8h, #P2S_SHIFT
+    shl             v23.8h, v1.8h, #P2S_SHIFT
+    shl             v24.8h, v2.8h, #P2S_SHIFT
+    shl             v25.8h, v3.8h, #P2S_SHIFT
+#else
+    ld1             {v0.16b-v1.16b}, x0, x1
+    ushll           v22.8h, v0.8b,  #P2S_SHIFT
+    ushll2          v23.8h, v0.16b, #P2S_SHIFT
+    ushll           v24.8h, v1.8b,  #P2S_SHIFT
+    ushll2          v25.8h, v1.16b, #P2S_SHIFT
+#endif
+    add             v22.8h, v22.8h, v31.8h
+    add             v23.8h, v23.8h, v31.8h
+    add             v24.8h, v24.8h, v31.8h
+    add             v25.8h, v25.8h, v31.8h
+    st1             {v22.16b-v25.16b}, x2, x3
+    cbnz            x9, .loop_filterP2S_32x\h
+    ret
+endfunc
+.endm
+
+p2s_32xN 8
+p2s_32xN 16
+p2s_32xN 24
+p2s_32xN 32
+p2s_32xN 48
+p2s_32xN 64
+
+.macro p2s_64xN h
+function PFX(filterPixelToShort_64x\h\()_neon)
+    p2s_start
+#if HIGH_BIT_DEPTH
+    sub             x1, x1, #64
+#endif
+    sub             x3, x3, #64
+    mov             x9, #\h
+.loop_filterP2S_64x\h:
+    sub             x9, x9, #1
+#if HIGH_BIT_DEPTH
+    ld1             {v0.16b-v3.16b}, x0, #64
+    ld1             {v4.16b-v7.16b}, x0, x1
+    shl             v16.8h, v0.8h, #P2S_SHIFT
+    shl             v17.8h, v1.8h, #P2S_SHIFT
+    shl             v18.8h, v2.8h, #P2S_SHIFT
+    shl             v19.8h, v3.8h, #P2S_SHIFT
+    shl             v20.8h, v4.8h, #P2S_SHIFT
+    shl             v21.8h, v5.8h, #P2S_SHIFT
+    shl             v22.8h, v6.8h, #P2S_SHIFT
+    shl             v23.8h, v7.8h, #P2S_SHIFT
+#else
+    ld1             {v0.16b-v3.16b}, x0, x1
+    ushll           v16.8h, v0.8b,  #P2S_SHIFT
+    ushll2          v17.8h, v0.16b, #P2S_SHIFT
+    ushll           v18.8h, v1.8b,  #P2S_SHIFT
+    ushll2          v19.8h, v1.16b, #P2S_SHIFT
+    ushll           v20.8h, v2.8b,  #P2S_SHIFT
+    ushll2          v21.8h, v2.16b, #P2S_SHIFT
+    ushll           v22.8h, v3.8b,  #P2S_SHIFT
+    ushll2          v23.8h, v3.16b, #P2S_SHIFT
+#endif
+    add             v16.8h, v16.8h, v31.8h
+    add             v17.8h, v17.8h, v31.8h
+    add             v18.8h, v18.8h, v31.8h
+    add             v19.8h, v19.8h, v31.8h
+    add             v20.8h, v20.8h, v31.8h
+    add             v21.8h, v21.8h, v31.8h
+    add             v22.8h, v22.8h, v31.8h
+    add             v23.8h, v23.8h, v31.8h
+    st1             {v16.16b-v19.16b}, x2, #64
+    st1             {v20.16b-v23.16b}, x2, x3
+    cbnz            x9, .loop_filterP2S_64x\h
+    ret
+endfunc
+.endm
+
+p2s_64xN 16
+p2s_64xN 32
+p2s_64xN 48
+p2s_64xN 64
+
+function PFX(filterPixelToShort_48x64_neon)
+    p2s_start
+#if HIGH_BIT_DEPTH
+    sub             x1, x1, #64
+#endif
+    sub             x3, x3, #64
+    mov             x9, #64
+.loop_filterP2S_48x64:
+    sub            x9, x9, #1
+#if HIGH_BIT_DEPTH
+    ld1             {v0.16b-v3.16b}, x0, #64
+    ld1             {v4.16b-v5.16b}, x0, x1
+    shl             v16.8h, v0.8h, #P2S_SHIFT
+    shl             v17.8h, v1.8h, #P2S_SHIFT
+    shl             v18.8h, v2.8h, #P2S_SHIFT
+    shl             v19.8h, v3.8h, #P2S_SHIFT
+    shl             v20.8h, v4.8h, #P2S_SHIFT
+    shl             v21.8h, v5.8h, #P2S_SHIFT
+#else
+    ld1             {v0.16b-v2.16b}, x0, x1
+    ushll           v16.8h, v0.8b,  #P2S_SHIFT
+    ushll2          v17.8h, v0.16b, #P2S_SHIFT
+    ushll           v18.8h, v1.8b,  #P2S_SHIFT
+    ushll2          v19.8h, v1.16b, #P2S_SHIFT
+    ushll           v20.8h, v2.8b,  #P2S_SHIFT
+    ushll2          v21.8h, v2.16b, #P2S_SHIFT
+#endif
+    add             v16.8h, v16.8h, v31.8h
+    add             v17.8h, v17.8h, v31.8h
+    add             v18.8h, v18.8h, v31.8h
+    add             v19.8h, v19.8h, v31.8h
+    add             v20.8h, v20.8h, v31.8h
+    add             v21.8h, v21.8h, v31.8h
+    st1             {v16.16b-v19.16b}, x2, #64
+    st1             {v20.16b-v21.16b}, x2, x3
+    cbnz            x9, .loop_filterP2S_48x64
+    ret
+endfunc
​

x265_3.6.tar.gz/source/common/aarch64/pixel-prim.cpp Added

@@ -0,0 +1,2059 @@
+#include "common.h"
+#include "slicetype.h"      // LOWRES_COST_MASK
+#include "primitives.h"
+#include "x265.h"
+
+#include "pixel-prim.h"
+#include "arm64-utils.h"
+#if HAVE_NEON
+
+#include <arm_neon.h>
+
+using namespace X265_NS;
+
+
+
+namespace
+{
+
+
+/* SATD SA8D variants - based on x264 */
+static inline void SUMSUB_AB(int16x8_t &sum, int16x8_t &sub, const int16x8_t a, const int16x8_t b)
+{
+    sum = vaddq_s16(a, b);
+    sub = vsubq_s16(a, b);
+}
+
+static inline void transpose_8h(int16x8_t &t1, int16x8_t &t2, const int16x8_t s1, const int16x8_t s2)
+{
+    t1 = vtrn1q_s16(s1, s2);
+    t2 = vtrn2q_s16(s1, s2);
+}
+
+static inline void transpose_4s(int16x8_t &t1, int16x8_t &t2, const int16x8_t s1, const int16x8_t s2)
+{
+    t1 = vtrn1q_s32(s1, s2);
+    t2 = vtrn2q_s32(s1, s2);
+}
+
+#if (X265_DEPTH <= 10)
+static inline void transpose_2d(int16x8_t &t1, int16x8_t &t2, const int16x8_t s1, const int16x8_t s2)
+{
+    t1 = vtrn1q_s64(s1, s2);
+    t2 = vtrn2q_s64(s1, s2);
+}
+#endif
+
+
+static inline void SUMSUB_ABCD(int16x8_t &s1, int16x8_t &d1, int16x8_t &s2, int16x8_t &d2,
+                               int16x8_t a, int16x8_t  b, int16x8_t  c, int16x8_t  d)
+{
+    SUMSUB_AB(s1, d1, a, b);
+    SUMSUB_AB(s2, d2, c, d);
+}
+
+static inline void HADAMARD4_V(int16x8_t &r1, int16x8_t &r2, int16x8_t &r3, int16x8_t &r4,
+                               int16x8_t &t1, int16x8_t &t2, int16x8_t &t3, int16x8_t &t4)
+{
+    SUMSUB_ABCD(t1, t2, t3, t4, r1, r2, r3, r4);
+    SUMSUB_ABCD(r1, r3, r2, r4, t1, t3, t2, t4);
+}
+
+
+static int _satd_4x8_8x4_end_neon(int16x8_t v0, int16x8_t v1, int16x8_t v2, int16x8_t v3)
+
+{
+
+    int16x8_t v4, v5, v6, v7, v16, v17, v18, v19;
+
+
+    SUMSUB_AB(v16, v17, v0,  v1);
+    SUMSUB_AB(v18, v19, v2,  v3);
+
+    SUMSUB_AB(v4 , v6 , v16, v18);
+    SUMSUB_AB(v5 , v7 , v17, v19);
+
+    v0 = vtrn1q_s16(v4, v5);
+    v1 = vtrn2q_s16(v4, v5);
+    v2 = vtrn1q_s16(v6, v7);
+    v3 = vtrn2q_s16(v6, v7);
+
+    SUMSUB_AB(v16, v17, v0,  v1);
+    SUMSUB_AB(v18, v19, v2,  v3);
+
+    v0 = vtrn1q_s32(v16, v18);
+    v1 = vtrn2q_s32(v16, v18);
+    v2 = vtrn1q_s32(v17, v19);
+    v3 = vtrn2q_s32(v17, v19);
+
+    v0 = vabsq_s16(v0);
+    v1 = vabsq_s16(v1);
+    v2 = vabsq_s16(v2);
+    v3 = vabsq_s16(v3);
+
+    v0 = vmaxq_u16(v0, v1);
+    v1 = vmaxq_u16(v2, v3);
+
+    v0 = vaddq_u16(v0, v1);
+    return vaddlvq_u16(v0);
+}
+
+static inline int _satd_4x4_neon(int16x8_t v0, int16x8_t v1)
+{
+    int16x8_t v2, v3;
+    SUMSUB_AB(v2,  v3,  v0,  v1);
+
+    v0 = vzip1q_s64(v2, v3);
+    v1 = vzip2q_s64(v2, v3);
+    SUMSUB_AB(v2,  v3,  v0,  v1);
+
+    v0 = vtrn1q_s16(v2, v3);
+    v1 = vtrn2q_s16(v2, v3);
+    SUMSUB_AB(v2,  v3,  v0,  v1);
+
+    v0 = vtrn1q_s32(v2, v3);
+    v1 = vtrn2q_s32(v2, v3);
+
+    v0 = vabsq_s16(v0);
+    v1 = vabsq_s16(v1);
+    v0 = vmaxq_u16(v0, v1);
+
+    return vaddlvq_s16(v0);
+}
+
+static void _satd_8x4v_8x8h_neon(int16x8_t &v0, int16x8_t &v1, int16x8_t &v2, int16x8_t &v3, int16x8_t &v20,
+                                 int16x8_t &v21, int16x8_t &v22, int16x8_t &v23)
+{
+    int16x8_t v16, v17, v18, v19, v4, v5, v6, v7;
+
+    SUMSUB_AB(v16, v18, v0,  v2);
+    SUMSUB_AB(v17, v19, v1,  v3);
+
+    HADAMARD4_V(v20, v21, v22, v23, v0,  v1, v2, v3);
+
+    transpose_8h(v0,  v1,  v16, v17);
+    transpose_8h(v2,  v3,  v18, v19);
+    transpose_8h(v4,  v5,  v20, v21);
+    transpose_8h(v6,  v7,  v22, v23);
+
+    SUMSUB_AB(v16, v17, v0,  v1);
+    SUMSUB_AB(v18, v19, v2,  v3);
+    SUMSUB_AB(v20, v21, v4,  v5);
+    SUMSUB_AB(v22, v23, v6,  v7);
+
+    transpose_4s(v0,  v2,  v16, v18);
+    transpose_4s(v1,  v3,  v17, v19);
+    transpose_4s(v4,  v6,  v20, v22);
+    transpose_4s(v5,  v7,  v21, v23);
+
+    v0 = vabsq_s16(v0);
+    v1 = vabsq_s16(v1);
+    v2 = vabsq_s16(v2);
+    v3 = vabsq_s16(v3);
+    v4 = vabsq_s16(v4);
+    v5 = vabsq_s16(v5);
+    v6 = vabsq_s16(v6);
+    v7 = vabsq_s16(v7);
+
+    v0 = vmaxq_u16(v0, v2);
+    v1 = vmaxq_u16(v1, v3);
+    v2 = vmaxq_u16(v4, v6);
+    v3 = vmaxq_u16(v5, v7);
+
+}
+
+#if HIGH_BIT_DEPTH
+
+#if (X265_DEPTH > 10)
+static inline void transpose_2d(int32x4_t &t1, int32x4_t &t2, const int32x4_t s1, const int32x4_t s2)
+{
+    t1 = vtrn1q_s64(s1, s2);
+    t2 = vtrn2q_s64(s1, s2);
+}
+
+static inline void ISUMSUB_AB(int32x4_t &sum, int32x4_t &sub, const int32x4_t a, const int32x4_t b)
+{
+    sum = vaddq_s32(a, b);
+    sub = vsubq_s32(a, b);
+}
+
+static inline void ISUMSUB_AB_FROM_INT16(int32x4_t &suml, int32x4_t &sumh, int32x4_t &subl, int32x4_t &subh,
+        const int16x8_t a, const int16x8_t b)
+{
+    suml = vaddl_s16(vget_low_s16(a), vget_low_s16(b));
+    sumh = vaddl_high_s16(a, b);
+    subl = vsubl_s16(vget_low_s16(a), vget_low_s16(b));
+    subh = vsubl_high_s16(a, b);
+}
+
+#endif
+
+static inline void _sub_8x8_fly(const uint16_t *pix1, intptr_t stride_pix1, const uint16_t *pix2, intptr_t stride_pix2,
+                                int16x8_t &v0, int16x8_t &v1, int16x8_t &v2, int16x8_t &v3,
+                                int16x8_t &v20, int16x8_t &v21, int16x8_t &v22, int16x8_t &v23)
+{
+    uint16x8_t r0, r1, r2, r3;
+    uint16x8_t t0, t1, t2, t3;
+    int16x8_t v16, v17;
+    int16x8_t v18, v19;
+
+    r0 = *(uint16x8_t *)(pix1 + 0 * stride_pix1);
+    r1 = *(uint16x8_t *)(pix1 + 1 * stride_pix1);
+    r2 = *(uint16x8_t *)(pix1 + 2 * stride_pix1);
+    r3 = *(uint16x8_t *)(pix1 + 3 * stride_pix1);
+
+    t0 = *(uint16x8_t *)(pix2 + 0 * stride_pix2);
+    t1 = *(uint16x8_t *)(pix2 + 1 * stride_pix2);
+    t2 = *(uint16x8_t *)(pix2 + 2 * stride_pix2);
+    t3 = *(uint16x8_t *)(pix2 + 3 * stride_pix2);
+
+    v16 = vsubq_u16(r0, t0);
+    v17 = vsubq_u16(r1, t1);
+    v18 = vsubq_u16(r2, t2);
+    v19 = vsubq_u16(r3, t3);
+
+    r0 = *(uint16x8_t *)(pix1 + 4 * stride_pix1);
+    r1 = *(uint16x8_t *)(pix1 + 5 * stride_pix1);
+    r2 = *(uint16x8_t *)(pix1 + 6 * stride_pix1);
+    r3 = *(uint16x8_t *)(pix1 + 7 * stride_pix1);
+
+    t0 = *(uint16x8_t *)(pix2 + 4 * stride_pix2);
+    t1 = *(uint16x8_t *)(pix2 + 5 * stride_pix2);
+    t2 = *(uint16x8_t *)(pix2 + 6 * stride_pix2);
+    t3 = *(uint16x8_t *)(pix2 + 7 * stride_pix2);
+
+    v20 = vsubq_u16(r0, t0);
+    v21 = vsubq_u16(r1, t1);
+    v22 = vsubq_u16(r2, t2);
+    v23 = vsubq_u16(r3, t3);
+
+    SUMSUB_AB(v0,  v1,  v16, v17);
+    SUMSUB_AB(v2,  v3,  v18, v19);
+
+}
+
+
+
+
+static void _satd_16x4_neon(const uint16_t *pix1, intptr_t stride_pix1, const uint16_t *pix2, intptr_t stride_pix2,
+                            int16x8_t &v0, int16x8_t &v1, int16x8_t &v2, int16x8_t &v3)
+{
+    uint8x16_t r0, r1, r2, r3;
+    uint8x16_t t0, t1, t2, t3;
+    int16x8_t v16, v17, v20, v21;
+    int16x8_t v18, v19, v22, v23;
+
+    r0 = *(int16x8_t *)(pix1 + 0 * stride_pix1);
+    r1 = *(int16x8_t *)(pix1 + 1 * stride_pix1);
+    r2 = *(int16x8_t *)(pix1 + 2 * stride_pix1);
+    r3 = *(int16x8_t *)(pix1 + 3 * stride_pix1);
+
+    t0 = *(int16x8_t *)(pix2 + 0 * stride_pix2);
+    t1 = *(int16x8_t *)(pix2 + 1 * stride_pix2);
+    t2 = *(int16x8_t *)(pix2 + 2 * stride_pix2);
+    t3 = *(int16x8_t *)(pix2 + 3 * stride_pix2);
+
+
+    v16 = vsubq_u16((r0), (t0));
+    v17 = vsubq_u16((r1), (t1));
+    v18 = vsubq_u16((r2), (t2));
+    v19 = vsubq_u16((r3), (t3));
+
+    r0 = *(int16x8_t *)(pix1 + 0 * stride_pix1 + 8);
+    r1 = *(int16x8_t *)(pix1 + 1 * stride_pix1 + 8);
+    r2 = *(int16x8_t *)(pix1 + 2 * stride_pix1 + 8);
+    r3 = *(int16x8_t *)(pix1 + 3 * stride_pix1 + 8);
+
+    t0 = *(int16x8_t *)(pix2 + 0 * stride_pix2 + 8);
+    t1 = *(int16x8_t *)(pix2 + 1 * stride_pix2 + 8);
+    t2 = *(int16x8_t *)(pix2 + 2 * stride_pix2 + 8);
+    t3 = *(int16x8_t *)(pix2 + 3 * stride_pix2 + 8);
+
+
+    v20 = vsubq_u16(r0, t0);
+    v21 = vsubq_u16(r1, t1);
+    v22 = vsubq_u16(r2, t2);
+    v23 = vsubq_u16(r3, t3);
+
+    SUMSUB_AB(v0,  v1,  v16, v17);
+    SUMSUB_AB(v2,  v3,  v18, v19);
+
+    _satd_8x4v_8x8h_neon(v0, v1, v2, v3, v20, v21, v22, v23);
+
+}
+
+
+int pixel_satd_4x4_neon(const uint16_t *pix1, intptr_t stride_pix1, const uint16_t *pix2, intptr_t stride_pix2)
+{
+    uint64x2_t t0, t1, r0, r1;
+    t00 = *(uint64_t *)(pix1 + 0 * stride_pix1);
+    t10 = *(uint64_t *)(pix1 + 1 * stride_pix1);
+    t01 = *(uint64_t *)(pix1 + 2 * stride_pix1);
+    t11 = *(uint64_t *)(pix1 + 3 * stride_pix1);
+
+    r00 = *(uint64_t *)(pix2 + 0 * stride_pix1);
+    r10 = *(uint64_t *)(pix2 + 1 * stride_pix2);
+    r01 = *(uint64_t *)(pix2 + 2 * stride_pix2);
+    r11 = *(uint64_t *)(pix2 + 3 * stride_pix2);
+
+    return _satd_4x4_neon(vsubq_u16(t0, r0), vsubq_u16(r1, t1));
+}
+
+
+
+
+
+
+int pixel_satd_8x4_neon(const uint16_t *pix1, intptr_t stride_pix1, const uint16_t *pix2, intptr_t stride_pix2)
+{
+    uint16x8_t i0, i1, i2, i3, i4, i5, i6, i7;
+
+    i0 = *(uint16x8_t *)(pix1 + 0 * stride_pix1);
+    i1 = *(uint16x8_t *)(pix2 + 0 * stride_pix2);
+    i2 = *(uint16x8_t *)(pix1 + 1 * stride_pix1);
+    i3 = *(uint16x8_t *)(pix2 + 1 * stride_pix2);
+    i4 = *(uint16x8_t *)(pix1 + 2 * stride_pix1);
+    i5 = *(uint16x8_t *)(pix2 + 2 * stride_pix2);
+    i6 = *(uint16x8_t *)(pix1 + 3 * stride_pix1);
+    i7 = *(uint16x8_t *)(pix2 + 3 * stride_pix2);
+
+    int16x8_t v0 = vsubq_u16(i0, i1);
+    int16x8_t v1 = vsubq_u16(i2, i3);
+    int16x8_t v2 = vsubq_u16(i4, i5);
+    int16x8_t v3 = vsubq_u16(i6, i7);
+
+    return _satd_4x8_8x4_end_neon(v0, v1, v2, v3);
+}
+
+
+int pixel_satd_16x16_neon(const uint16_t *pix1, intptr_t stride_pix1, const uint16_t *pix2, intptr_t stride_pix2)
+{
+    int32x4_t v30 = vdupq_n_u32(0), v31 = vdupq_n_u32(0);
+    int16x8_t v0, v1, v2, v3;
+    for (int offset = 0; offset <= 12; offset += 4) {
+        _satd_16x4_neon(pix1 + offset * stride_pix1, stride_pix1, pix2 + offset * stride_pix2, stride_pix2, v0, v1, v2, v3);
+        v30 = vpadalq_u16(v30, v0);
+        v30 = vpadalq_u16(v30, v1);
+        v31 = vpadalq_u16(v31, v2);
+        v31 = vpadalq_u16(v31, v3);
+    }
+    return vaddvq_s32(vaddq_s32(v30, v31));
+
+}
+
+#else       //HIGH_BIT_DEPTH
+
+static void _satd_16x4_neon(const uint8_t *pix1, intptr_t stride_pix1, const uint8_t *pix2, intptr_t stride_pix2,
+                            int16x8_t &v0, int16x8_t &v1, int16x8_t &v2, int16x8_t &v3)
+{
+    uint8x16_t r0, r1, r2, r3;
+    uint8x16_t t0, t1, t2, t3;
+    int16x8_t v16, v17, v20, v21;
+    int16x8_t v18, v19, v22, v23;
+
+    r0 = *(uint8x16_t *)(pix1 + 0 * stride_pix1);
+    r1 = *(uint8x16_t *)(pix1 + 1 * stride_pix1);
+    r2 = *(uint8x16_t *)(pix1 + 2 * stride_pix1);
+    r3 = *(uint8x16_t *)(pix1 + 3 * stride_pix1);
+
+    t0 = *(uint8x16_t *)(pix2 + 0 * stride_pix2);
+    t1 = *(uint8x16_t *)(pix2 + 1 * stride_pix2);
+    t2 = *(uint8x16_t *)(pix2 + 2 * stride_pix2);
+    t3 = *(uint8x16_t *)(pix2 + 3 * stride_pix2);
+
+
+
+    v16 = vsubl_u8(vget_low_u8(r0), vget_low_u8(t0));
+    v20 = vsubl_high_u8(r0, t0);
+    v17 = vsubl_u8(vget_low_u8(r1), vget_low_u8(t1));
+    v21 = vsubl_high_u8(r1, t1);
+    v18 = vsubl_u8(vget_low_u8(r2), vget_low_u8(t2));
+    v22 = vsubl_high_u8(r2, t2);
+    v19 = vsubl_u8(vget_low_u8(r3), vget_low_u8(t3));
+    v23 = vsubl_high_u8(r3, t3);
+
+    SUMSUB_AB(v0,  v1,  v16, v17);
+    SUMSUB_AB(v2,  v3,  v18, v19);
+
+    _satd_8x4v_8x8h_neon(v0, v1, v2, v3, v20, v21, v22, v23);
+
+}
+
+
+static inline void _sub_8x8_fly(const uint8_t *pix1, intptr_t stride_pix1, const uint8_t *pix2, intptr_t stride_pix2,
+                                int16x8_t &v0, int16x8_t &v1, int16x8_t &v2, int16x8_t &v3,
+                                int16x8_t &v20, int16x8_t &v21, int16x8_t &v22, int16x8_t &v23)
+{
+    uint8x8_t r0, r1, r2, r3;
+    uint8x8_t t0, t1, t2, t3;
+    int16x8_t v16, v17;
+    int16x8_t v18, v19;
+
+    r0 = *(uint8x8_t *)(pix1 + 0 * stride_pix1);
+    r1 = *(uint8x8_t *)(pix1 + 1 * stride_pix1);
+    r2 = *(uint8x8_t *)(pix1 + 2 * stride_pix1);
+    r3 = *(uint8x8_t *)(pix1 + 3 * stride_pix1);
+
+    t0 = *(uint8x8_t *)(pix2 + 0 * stride_pix2);
+    t1 = *(uint8x8_t *)(pix2 + 1 * stride_pix2);
+    t2 = *(uint8x8_t *)(pix2 + 2 * stride_pix2);
+    t3 = *(uint8x8_t *)(pix2 + 3 * stride_pix2);
+
+    v16 = vsubl_u8(r0, t0);
+    v17 = vsubl_u8(r1, t1);
+    v18 = vsubl_u8(r2, t2);
+    v19 = vsubl_u8(r3, t3);
+
+    r0 = *(uint8x8_t *)(pix1 + 4 * stride_pix1);
+    r1 = *(uint8x8_t *)(pix1 + 5 * stride_pix1);
+    r2 = *(uint8x8_t *)(pix1 + 6 * stride_pix1);
+    r3 = *(uint8x8_t *)(pix1 + 7 * stride_pix1);
+
+    t0 = *(uint8x8_t *)(pix2 + 4 * stride_pix2);
+    t1 = *(uint8x8_t *)(pix2 + 5 * stride_pix2);
+    t2 = *(uint8x8_t *)(pix2 + 6 * stride_pix2);
+    t3 = *(uint8x8_t *)(pix2 + 7 * stride_pix2);
+
+    v20 = vsubl_u8(r0, t0);
+    v21 = vsubl_u8(r1, t1);
+    v22 = vsubl_u8(r2, t2);
+    v23 = vsubl_u8(r3, t3);
+
+
+    SUMSUB_AB(v0,  v1,  v16, v17);
+    SUMSUB_AB(v2,  v3,  v18, v19);
+
+}
+
+int pixel_satd_4x4_neon(const uint8_t *pix1, intptr_t stride_pix1, const uint8_t *pix2, intptr_t stride_pix2)
+{
+    uint32x2_t t0, t1, r0, r1;
+    t00 = *(uint32_t *)(pix1 + 0 * stride_pix1);
+    t10 = *(uint32_t *)(pix1 + 1 * stride_pix1);
+    t01 = *(uint32_t *)(pix1 + 2 * stride_pix1);
+    t11 = *(uint32_t *)(pix1 + 3 * stride_pix1);
+
+    r00 = *(uint32_t *)(pix2 + 0 * stride_pix1);
+    r10 = *(uint32_t *)(pix2 + 1 * stride_pix2);
+    r01 = *(uint32_t *)(pix2 + 2 * stride_pix2);
+    r11 = *(uint32_t *)(pix2 + 3 * stride_pix2);
+
+    return _satd_4x4_neon(vsubl_u8(t0, r0), vsubl_u8(r1, t1));
+}
+
+
+int pixel_satd_8x4_neon(const uint8_t *pix1, intptr_t stride_pix1, const uint8_t *pix2, intptr_t stride_pix2)
+{
+    uint8x8_t i0, i1, i2, i3, i4, i5, i6, i7;
+
+    i0 = *(uint8x8_t *)(pix1 + 0 * stride_pix1);
+    i1 = *(uint8x8_t *)(pix2 + 0 * stride_pix2);
+    i2 = *(uint8x8_t *)(pix1 + 1 * stride_pix1);
+    i3 = *(uint8x8_t *)(pix2 + 1 * stride_pix2);
+    i4 = *(uint8x8_t *)(pix1 + 2 * stride_pix1);
+    i5 = *(uint8x8_t *)(pix2 + 2 * stride_pix2);
+    i6 = *(uint8x8_t *)(pix1 + 3 * stride_pix1);
+    i7 = *(uint8x8_t *)(pix2 + 3 * stride_pix2);
+
+    int16x8_t v0 = vsubl_u8(i0, i1);
+    int16x8_t v1 = vsubl_u8(i2, i3);
+    int16x8_t v2 = vsubl_u8(i4, i5);
+    int16x8_t v3 = vsubl_u8(i6, i7);
+
+    return _satd_4x8_8x4_end_neon(v0, v1, v2, v3);
+}
+
+int pixel_satd_16x16_neon(const uint8_t *pix1, intptr_t stride_pix1, const uint8_t *pix2, intptr_t stride_pix2)
+{
+    int16x8_t v30, v31;
+    int16x8_t v0, v1, v2, v3;
+
+    _satd_16x4_neon(pix1, stride_pix1, pix2, stride_pix2, v0, v1, v2, v3);
+    v30 = vaddq_s16(v0, v1);
+    v31 = vaddq_s16(v2, v3);
+
+    _satd_16x4_neon(pix1 + 4 * stride_pix1, stride_pix1, pix2 + 4 * stride_pix2, stride_pix2, v0, v1, v2, v3);
+    v0 = vaddq_s16(v0, v1);
+    v1 = vaddq_s16(v2, v3);
+    v30 = vaddq_s16(v30, v0);
+    v31 = vaddq_s16(v31, v1);
+
+    _satd_16x4_neon(pix1 + 8 * stride_pix1, stride_pix1, pix2 + 8 * stride_pix2, stride_pix2, v0, v1, v2, v3);
+    v0 = vaddq_s16(v0, v1);
+    v1 = vaddq_s16(v2, v3);
+    v30 = vaddq_s16(v30, v0);
+    v31 = vaddq_s16(v31, v1);
+
+    _satd_16x4_neon(pix1 + 12 * stride_pix1, stride_pix1, pix2 + 12 * stride_pix2, stride_pix2, v0, v1, v2, v3);
+    v0 = vaddq_s16(v0, v1);
+    v1 = vaddq_s16(v2, v3);
+    v30 = vaddq_s16(v30, v0);
+    v31 = vaddq_s16(v31, v1);
+
+    int32x4_t sum0 = vpaddlq_u16(v30);
+    int32x4_t sum1 = vpaddlq_u16(v31);
+    sum0 = vaddq_s32(sum0, sum1);
+    return vaddvq_s32(sum0);
+
+}
+#endif      //HIGH_BIT_DEPTH
+
+
+static inline void _sa8d_8x8_neon_end(int16x8_t &v0, int16x8_t &v1, int16x8_t v2, int16x8_t v3,
+                                      int16x8_t v20, int16x8_t v21, int16x8_t v22, int16x8_t v23)
+{
+    int16x8_t v16, v17, v18, v19;
+    int16x8_t v4, v5, v6, v7;
+
+    SUMSUB_AB(v16, v18, v0,  v2);
+    SUMSUB_AB(v17, v19, v1,  v3);
+
+    HADAMARD4_V(v20, v21, v22, v23, v0,  v1, v2, v3);
+
+    SUMSUB_AB(v0,  v16, v16, v20);
+    SUMSUB_AB(v1,  v17, v17, v21);
+    SUMSUB_AB(v2,  v18, v18, v22);
+    SUMSUB_AB(v3,  v19, v19, v23);
+
+    transpose_8h(v20, v21, v16, v17);
+    transpose_8h(v4,  v5,  v0,  v1);
+    transpose_8h(v22, v23, v18, v19);
+    transpose_8h(v6,  v7,  v2,  v3);
+
+#if (X265_DEPTH <= 10)
+
+    int16x8_t v24, v25;
+
+    SUMSUB_AB(v2,  v3,  v20, v21);
+    SUMSUB_AB(v24, v25, v4,  v5);
+    SUMSUB_AB(v0,  v1,  v22, v23);
+    SUMSUB_AB(v4,  v5,  v6,  v7);
+
+    transpose_4s(v20, v22, v2,  v0);
+    transpose_4s(v21, v23, v3,  v1);
+    transpose_4s(v16, v18, v24, v4);
+    transpose_4s(v17, v19, v25, v5);
+
+    SUMSUB_AB(v0,  v2,  v20, v22);
+    SUMSUB_AB(v1,  v3,  v21, v23);
+    SUMSUB_AB(v4,  v6,  v16, v18);
+    SUMSUB_AB(v5,  v7,  v17, v19);
+
+    transpose_2d(v16, v20,  v0,  v4);
+    transpose_2d(v17, v21,  v1,  v5);
+    transpose_2d(v18, v22,  v2,  v6);
+    transpose_2d(v19, v23,  v3,  v7);
+
+
+    v16 = vabsq_s16(v16);
+    v17 = vabsq_s16(v17);
+    v18 = vabsq_s16(v18);
+    v19 = vabsq_s16(v19);
+    v20 = vabsq_s16(v20);
+    v21 = vabsq_s16(v21);
+    v22 = vabsq_s16(v22);
+    v23 = vabsq_s16(v23);
+
+    v16 = vmaxq_u16(v16, v20);
+    v17 = vmaxq_u16(v17, v21);
+    v18 = vmaxq_u16(v18, v22);
+    v19 = vmaxq_u16(v19, v23);
+
+#if HIGH_BIT_DEPTH
+    v0 = vpaddlq_u16(v16);
+    v1 = vpaddlq_u16(v17);
+    v0 = vpadalq_u16(v0, v18);
+    v1 = vpadalq_u16(v1, v19);
+
+#else //HIGH_BIT_DEPTH
+
+    v0 = vaddq_u16(v16, v17);
+    v1 = vaddq_u16(v18, v19);
+
+#endif //HIGH_BIT_DEPTH
+
+#else // HIGH_BIT_DEPTH 12 bit only, switching math to int32, each int16x8 is up-convreted to 2 int32x4 (low and high)
+
+    int32x4_t v2l, v2h, v3l, v3h, v24l, v24h, v25l, v25h, v0l, v0h, v1l, v1h;
+    int32x4_t v22l, v22h, v23l, v23h;
+    int32x4_t v4l, v4h, v5l, v5h;
+    int32x4_t v6l, v6h, v7l, v7h;
+    int32x4_t v16l, v16h, v17l, v17h;
+    int32x4_t v18l, v18h, v19l, v19h;
+    int32x4_t v20l, v20h, v21l, v21h;
+
+    ISUMSUB_AB_FROM_INT16(v2l, v2h, v3l, v3h, v20, v21);
+    ISUMSUB_AB_FROM_INT16(v24l, v24h, v25l, v25h, v4, v5);
+
+    v22l = vmovl_s16(vget_low_s16(v22));
+    v22h = vmovl_high_s16(v22);
+    v23l = vmovl_s16(vget_low_s16(v23));
+    v23h = vmovl_high_s16(v23);
+
+    ISUMSUB_AB(v0l,  v1l,  v22l, v23l);
+    ISUMSUB_AB(v0h,  v1h,  v22h, v23h);
+
+    v6l = vmovl_s16(vget_low_s16(v6));
+    v6h = vmovl_high_s16(v6);
+    v7l = vmovl_s16(vget_low_s16(v7));
+    v7h = vmovl_high_s16(v7);
+
+    ISUMSUB_AB(v4l,  v5l,  v6l,  v7l);
+    ISUMSUB_AB(v4h,  v5h,  v6h,  v7h);
+
+    transpose_2d(v20l, v22l, v2l,  v0l);
+    transpose_2d(v21l, v23l, v3l,  v1l);
+    transpose_2d(v16l, v18l, v24l, v4l);
+    transpose_2d(v17l, v19l, v25l, v5l);
+
+    transpose_2d(v20h, v22h, v2h,  v0h);
+    transpose_2d(v21h, v23h, v3h,  v1h);
+    transpose_2d(v16h, v18h, v24h, v4h);
+    transpose_2d(v17h, v19h, v25h, v5h);
+
+    ISUMSUB_AB(v0l,  v2l,  v20l, v22l);
+    ISUMSUB_AB(v1l,  v3l,  v21l, v23l);
+    ISUMSUB_AB(v4l,  v6l,  v16l, v18l);
+    ISUMSUB_AB(v5l,  v7l,  v17l, v19l);
+
+    ISUMSUB_AB(v0h,  v2h,  v20h, v22h);
+    ISUMSUB_AB(v1h,  v3h,  v21h, v23h);
+    ISUMSUB_AB(v4h,  v6h,  v16h, v18h);
+    ISUMSUB_AB(v5h,  v7h,  v17h, v19h);
+
+    v16l = v0l;
+    v16h = v4l;
+    v20l = v0h;
+    v20h = v4h;
+
+    v17l = v1l;
+    v17h = v5l;
+    v21l = v1h;
+    v21h = v5h;
+
+    v18l = v2l;
+    v18h = v6l;
+    v22l = v2h;
+    v22h = v6h;
+
+    v19l = v3l;
+    v19h = v7l;
+    v23l = v3h;
+    v23h = v7h;
+
+    v16l = vabsq_s32(v16l);
+    v17l = vabsq_s32(v17l);
+    v18l = vabsq_s32(v18l);
+    v19l = vabsq_s32(v19l);
+    v20l = vabsq_s32(v20l);
+    v21l = vabsq_s32(v21l);
+    v22l = vabsq_s32(v22l);
+    v23l = vabsq_s32(v23l);
+
+    v16h = vabsq_s32(v16h);
+    v17h = vabsq_s32(v17h);
+    v18h = vabsq_s32(v18h);
+    v19h = vabsq_s32(v19h);
+    v20h = vabsq_s32(v20h);
+    v21h = vabsq_s32(v21h);
+    v22h = vabsq_s32(v22h);
+    v23h = vabsq_s32(v23h);
+
+    v16l = vmaxq_u32(v16l, v20l);
+    v17l = vmaxq_u32(v17l, v21l);
+    v18l = vmaxq_u32(v18l, v22l);
+    v19l = vmaxq_u32(v19l, v23l);
+
+    v16h = vmaxq_u32(v16h, v20h);
+    v17h = vmaxq_u32(v17h, v21h);
+    v18h = vmaxq_u32(v18h, v22h);
+    v19h = vmaxq_u32(v19h, v23h);
+
+    v16l = vaddq_u32(v16l, v16h);
+    v17l = vaddq_u32(v17l, v17h);
+    v18l = vaddq_u32(v18l, v18h);
+    v19l = vaddq_u32(v19l, v19h);
+
+    v0 = vaddq_u32(v16l, v17l);
+    v1 = vaddq_u32(v18l, v19l);
+
+
+#endif
+
+}
+
+
+
+static inline void _satd_8x8_neon(const pixel *pix1, intptr_t stride_pix1, const pixel *pix2, intptr_t stride_pix2,
+                                  int16x8_t &v0, int16x8_t &v1, int16x8_t &v2, int16x8_t &v3)
+{
+
+    int16x8_t v20, v21, v22, v23;
+    _sub_8x8_fly(pix1, stride_pix1, pix2, stride_pix2, v0, v1, v2, v3, v20, v21, v22, v23);
+    _satd_8x4v_8x8h_neon(v0, v1, v2, v3, v20, v21, v22, v23);
+
+}
+
+
+
+int pixel_satd_8x8_neon(const pixel *pix1, intptr_t stride_pix1, const pixel *pix2, intptr_t stride_pix2)
+{
+    int16x8_t v30, v31;
+    int16x8_t v0, v1, v2, v3;
+
+    _satd_8x8_neon(pix1, stride_pix1, pix2, stride_pix2, v0, v1, v2, v3);
+#if !(HIGH_BIT_DEPTH)
+    v30 = vaddq_u16(v0, v1);
+    v31 = vaddq_u16(v2, v3);
+
+    uint16x8_t sum = vaddq_u16(v30, v31);
+    return vaddvq_s32(vpaddlq_u16(sum));
+#else
+
+    v30 = vaddq_u16(v0, v1);
+    v31 = vaddq_u16(v2, v3);
+
+    int32x4_t sum = vpaddlq_u16(v30);
+    sum = vpadalq_u16(sum, v31);
+    return vaddvq_s32(sum);
+#endif
+}
+
+
+int pixel_sa8d_8x8_neon(const pixel *pix1, intptr_t stride_pix1, const pixel *pix2, intptr_t stride_pix2)
+{
+    int16x8_t v0, v1, v2, v3;
+    int16x8_t v20, v21, v22, v23;
+
+    _sub_8x8_fly(pix1, stride_pix1, pix2, stride_pix2, v0, v1, v2, v3, v20, v21, v22, v23);
+    _sa8d_8x8_neon_end(v0, v1, v2, v3, v20, v21, v22, v23);
+
+#if HIGH_BIT_DEPTH
+    int32x4_t s = vaddq_u32(v0, v1);
+    return (vaddvq_u32(s) + 1) >> 1;
+#else
+    return (vaddlvq_s16(vaddq_u16(v0, v1)) + 1) >> 1;
+#endif
+}
+
+
+
+
+
+int pixel_sa8d_16x16_neon(const pixel *pix1, intptr_t stride_pix1, const pixel *pix2, intptr_t stride_pix2)
+{
+    int16x8_t v0, v1, v2, v3;
+    int16x8_t v20, v21, v22, v23;
+    int32x4_t v30, v31;
+
+    _sub_8x8_fly(pix1, stride_pix1, pix2, stride_pix2, v0, v1, v2, v3, v20, v21, v22, v23);
+    _sa8d_8x8_neon_end(v0, v1, v2, v3, v20, v21, v22, v23);
+
+#if !(HIGH_BIT_DEPTH)
+    v30 = vpaddlq_u16(v0);
+    v31 = vpaddlq_u16(v1);
+#else
+    v30 = vaddq_s32(v0, v1);
+#endif
+
+    _sub_8x8_fly(pix1 + 8, stride_pix1, pix2 + 8, stride_pix2, v0, v1, v2, v3, v20, v21, v22, v23);
+    _sa8d_8x8_neon_end(v0, v1, v2, v3, v20, v21, v22, v23);
+
+#if !(HIGH_BIT_DEPTH)
+    v30 = vpadalq_u16(v30, v0);
+    v31 = vpadalq_u16(v31, v1);
+#else
+    v31 = vaddq_s32(v0, v1);
+#endif
+
+
+    _sub_8x8_fly(pix1 + 8 * stride_pix1, stride_pix1, pix2 + 8 * stride_pix2, stride_pix2, v0, v1, v2, v3, v20, v21, v22,
+                 v23);
+    _sa8d_8x8_neon_end(v0, v1, v2, v3, v20, v21, v22, v23);
+
+#if !(HIGH_BIT_DEPTH)
+    v30 = vpadalq_u16(v30, v0);
+    v31 = vpadalq_u16(v31, v1);
+#else
+    v30 = vaddq_s32(v30, v0);
+    v31 = vaddq_s32(v31, v1);
+#endif
+
+    _sub_8x8_fly(pix1 + 8 * stride_pix1 + 8, stride_pix1, pix2 + 8 * stride_pix2 + 8, stride_pix2, v0, v1, v2, v3, v20, v21,
+                 v22, v23);
+    _sa8d_8x8_neon_end(v0, v1, v2, v3, v20, v21, v22, v23);
+
+#if !(HIGH_BIT_DEPTH)
+    v30 = vpadalq_u16(v30, v0);
+    v31 = vpadalq_u16(v31, v1);
+#else
+    v30 = vaddq_s32(v30, v0);
+    v31 = vaddq_s32(v31, v1);
+#endif
+
+    v30 = vaddq_u32(v30, v31);
+
+    return (vaddvq_u32(v30) + 1) >> 1;
+}
+
+
+
+
+
+
+
+
+template<int size>
+void blockfill_s_neon(int16_t *dst, intptr_t dstride, int16_t val)
+{
+    for (int y = 0; y < size; y++)
+    {
+        int x = 0;
+        int16x8_t v = vdupq_n_s16(val);
+        for (; (x + 8) <= size; x += 8)
+        {
+            *(int16x8_t *)&dsty * dstride + x = v;
+        }
+        for (; x < size; x++)
+        {
+            dsty * dstride + x = val;
+        }
+    }
+}
+
+template<int lx, int ly>
+int sad_pp_neon(const pixel *pix1, intptr_t stride_pix1, const pixel *pix2, intptr_t stride_pix2)
+{
+    int sum = 0;
+
+
+    for (int y = 0; y < ly; y++)
+    {
+#if HIGH_BIT_DEPTH
+        int x = 0;
+        uint16x8_t vsum16_1 = vdupq_n_u16(0);
+        for (; (x + 8) <= lx; x += 8)
+        {
+            uint16x8_t p1 = *(uint16x8_t *)&pix1x;
+            uint16x8_t p2 = *(uint16x8_t *)&pix2x;
+            vsum16_1 = vabaq_s16(vsum16_1, p1, p2);
+
+        }
+        if (lx & 4)
+        {
+            uint16x4_t p1 = *(uint16x4_t *)&pix1x;
+            uint16x4_t p2 = *(uint16x4_t *)&pix2x;
+            sum += vaddlv_s16(vaba_s16(vdup_n_s16(0), p1, p2));
+            x += 4;
+        }
+        if (lx >= 4)
+        {
+            sum += vaddlvq_s16(vsum16_1);
+        }
+
+#else
+
+        int x = 0;
+        uint16x8_t vsum16_1 = vdupq_n_u16(0);
+        uint16x8_t vsum16_2 = vdupq_n_u16(0);
+
+        for (; (x + 16) <= lx; x += 16)
+        {
+            uint8x16_t p1 = *(uint8x16_t *)&pix1x;
+            uint8x16_t p2 = *(uint8x16_t *)&pix2x;
+            vsum16_1 = vabal_u8(vsum16_1, vget_low_u8(p1), vget_low_u8(p2));
+            vsum16_2 = vabal_high_u8(vsum16_2, p1, p2);
+        }
+        if (lx & 8)
+        {
+            uint8x8_t p1 = *(uint8x8_t *)&pix1x;
+            uint8x8_t p2 = *(uint8x8_t *)&pix2x;
+            vsum16_1 = vabal_u8(vsum16_1, p1, p2);
+            x += 8;
+        }
+        if (lx & 4)
+        {
+            uint32x2_t p1 = vdup_n_u32(0);
+            p10 = *(uint32_t *)&pix1x;
+            uint32x2_t p2 = vdup_n_u32(0);
+            p20 = *(uint32_t *)&pix2x;
+            vsum16_1 = vabal_u8(vsum16_1, p1, p2);
+            x += 4;
+        }
+        if (lx >= 16)
+        {
+            vsum16_1 = vaddq_u16(vsum16_1, vsum16_2);
+        }
+        if (lx >= 4)
+        {
+            sum += vaddvq_u16(vsum16_1);
+        }
+
+#endif
+        if (lx & 3) for (; x < lx; x++)
+            {
+                sum += abs(pix1x - pix2x);
+            }
+
+        pix1 += stride_pix1;
+        pix2 += stride_pix2;
+    }
+
+    return sum;
+}
+
+template<int lx, int ly>
+void sad_x3_neon(const pixel *pix1, const pixel *pix2, const pixel *pix3, const pixel *pix4, intptr_t frefstride,
+                 int32_t *res)
+{
+    res0 = 0;
+    res1 = 0;
+    res2 = 0;
+    for (int y = 0; y < ly; y++)
+    {
+        int x = 0;
+        uint16x8_t vsum16_0 = vdupq_n_u16(0);
+        uint16x8_t vsum16_1 = vdupq_n_u16(0);
+        uint16x8_t vsum16_2 = vdupq_n_u16(0);
+#if HIGH_BIT_DEPTH
+        for (; (x + 8) <= lx; x += 8)
+        {
+            uint16x8_t p1 = *(uint16x8_t *)&pix1x;
+            uint16x8_t p2 = *(uint16x8_t *)&pix2x;
+            uint16x8_t p3 = *(uint16x8_t *)&pix3x;
+            uint16x8_t p4 = *(uint16x8_t *)&pix4x;
+            vsum16_0 = vabaq_s16(vsum16_0, p1, p2);
+            vsum16_1 = vabaq_s16(vsum16_1, p1, p3);
+            vsum16_2 = vabaq_s16(vsum16_2, p1, p4);
+
+        }
+        if (lx & 4)
+        {
+            uint16x4_t p1 = *(uint16x4_t *)&pix1x;
+            uint16x4_t p2 = *(uint16x4_t *)&pix2x;
+            uint16x4_t p3 = *(uint16x4_t *)&pix3x;
+            uint16x4_t p4 = *(uint16x4_t *)&pix4x;
+            res0 += vaddlv_s16(vaba_s16(vdup_n_s16(0), p1, p2));
+            res1 += vaddlv_s16(vaba_s16(vdup_n_s16(0), p1, p3));
+            res2 += vaddlv_s16(vaba_s16(vdup_n_s16(0), p1, p4));
+            x += 4;
+        }
+        if (lx >= 4)
+        {
+            res0 += vaddlvq_s16(vsum16_0);
+            res1 += vaddlvq_s16(vsum16_1);
+            res2 += vaddlvq_s16(vsum16_2);
+        }
+#else
+
+        for (; (x + 16) <= lx; x += 16)
+        {
+            uint8x16_t p1 = *(uint8x16_t *)&pix1x;
+            uint8x16_t p2 = *(uint8x16_t *)&pix2x;
+            uint8x16_t p3 = *(uint8x16_t *)&pix3x;
+            uint8x16_t p4 = *(uint8x16_t *)&pix4x;
+            vsum16_0 = vabal_u8(vsum16_0, vget_low_u8(p1), vget_low_u8(p2));
+            vsum16_0 = vabal_high_u8(vsum16_0, p1, p2);
+            vsum16_1 = vabal_u8(vsum16_1, vget_low_u8(p1), vget_low_u8(p3));
+            vsum16_1 = vabal_high_u8(vsum16_1, p1, p3);
+            vsum16_2 = vabal_u8(vsum16_2, vget_low_u8(p1), vget_low_u8(p4));
+            vsum16_2 = vabal_high_u8(vsum16_2, p1, p4);
+        }
+        if (lx & 8)
+        {
+            uint8x8_t p1 = *(uint8x8_t *)&pix1x;
+            uint8x8_t p2 = *(uint8x8_t *)&pix2x;
+            uint8x8_t p3 = *(uint8x8_t *)&pix3x;
+            uint8x8_t p4 = *(uint8x8_t *)&pix4x;
+            vsum16_0 = vabal_u8(vsum16_0, p1, p2);
+            vsum16_1 = vabal_u8(vsum16_1, p1, p3);
+            vsum16_2 = vabal_u8(vsum16_2, p1, p4);
+            x += 8;
+        }
+        if (lx & 4)
+        {
+            uint32x2_t p1 = vdup_n_u32(0);
+            p10 = *(uint32_t *)&pix1x;
+            uint32x2_t p2 = vdup_n_u32(0);
+            p20 = *(uint32_t *)&pix2x;
+            uint32x2_t p3 = vdup_n_u32(0);
+            p30 = *(uint32_t *)&pix3x;
+            uint32x2_t p4 = vdup_n_u32(0);
+            p40 = *(uint32_t *)&pix4x;
+            vsum16_0 = vabal_u8(vsum16_0, p1, p2);
+            vsum16_1 = vabal_u8(vsum16_1, p1, p3);
+            vsum16_2 = vabal_u8(vsum16_2, p1, p4);
+            x += 4;
+        }
+        if (lx >= 4)
+        {
+            res0 += vaddvq_u16(vsum16_0);
+            res1 += vaddvq_u16(vsum16_1);
+            res2 += vaddvq_u16(vsum16_2);
+        }
+
+#endif
+        if (lx & 3) for (; x < lx; x++)
+            {
+                res0 += abs(pix1x - pix2x);
+                res1 += abs(pix1x - pix3x);
+                res2 += abs(pix1x - pix4x);
+            }
+
+        pix1 += FENC_STRIDE;
+        pix2 += frefstride;
+        pix3 += frefstride;
+        pix4 += frefstride;
+    }
+}
+
+template<int lx, int ly>
+void sad_x4_neon(const pixel *pix1, const pixel *pix2, const pixel *pix3, const pixel *pix4, const pixel *pix5,
+                 intptr_t frefstride, int32_t *res)
+{
+    int32x4_t result = {0};
+    for (int y = 0; y < ly; y++)
+    {
+        int x = 0;
+        uint16x8_t vsum16_0 = vdupq_n_u16(0);
+        uint16x8_t vsum16_1 = vdupq_n_u16(0);
+        uint16x8_t vsum16_2 = vdupq_n_u16(0);
+        uint16x8_t vsum16_3 = vdupq_n_u16(0);
+#if HIGH_BIT_DEPTH
+        for (; (x + 16) <= lx; x += 16)
+        {
+            uint16x8x2_t p1 = vld1q_u16_x2(&pix1x);
+            uint16x8x2_t p2 = vld1q_u16_x2(&pix2x);
+            uint16x8x2_t p3 = vld1q_u16_x2(&pix3x);
+            uint16x8x2_t p4 = vld1q_u16_x2(&pix4x);
+            uint16x8x2_t p5 = vld1q_u16_x2(&pix5x);
+            vsum16_0 = vabaq_s16(vsum16_0, p1.val0, p2.val0);
+            vsum16_1 = vabaq_s16(vsum16_1, p1.val0, p3.val0);
+            vsum16_2 = vabaq_s16(vsum16_2, p1.val0, p4.val0);
+            vsum16_3 = vabaq_s16(vsum16_3, p1.val0, p5.val0);
+            vsum16_0 = vabaq_s16(vsum16_0, p1.val1, p2.val1);
+            vsum16_1 = vabaq_s16(vsum16_1, p1.val1, p3.val1);
+            vsum16_2 = vabaq_s16(vsum16_2, p1.val1, p4.val1);
+            vsum16_3 = vabaq_s16(vsum16_3, p1.val1, p5.val1);
+        }
+        if (lx & 8)
+        {
+            uint16x8_t p1 = *(uint16x8_t *)&pix1x;
+            uint16x8_t p2 = *(uint16x8_t *)&pix2x;
+            uint16x8_t p3 = *(uint16x8_t *)&pix3x;
+            uint16x8_t p4 = *(uint16x8_t *)&pix4x;
+            uint16x8_t p5 = *(uint16x8_t *)&pix5x;
+            vsum16_0 = vabaq_s16(vsum16_0, p1, p2);
+            vsum16_1 = vabaq_s16(vsum16_1, p1, p3);
+            vsum16_2 = vabaq_s16(vsum16_2, p1, p4);
+            vsum16_3 = vabaq_s16(vsum16_3, p1, p5);
+            x += 8;
+        }
+        if (lx & 4)
+        {
+            /* This is equivalent to getting the absolute difference of pix1x with each of
+             * pix2 - pix5, then summing across the vector (4 values each) and adding the
+             * result to result. */
+            uint16x8_t p1 = vreinterpretq_s16_u64(
+                    vld1q_dup_u64((uint64_t *)&pix1x));
+            uint16x8_t p2_3 = vcombine_s16(*(uint16x4_t *)&pix2x, *(uint16x4_t *)&pix3x);
+            uint16x8_t p4_5 = vcombine_s16(*(uint16x4_t *)&pix4x, *(uint16x4_t *)&pix5x);
+
+            uint16x8_t a = vabdq_u16(p1, p2_3);
+            uint16x8_t b = vabdq_u16(p1, p4_5);
+
+            result = vpadalq_s16(result, vpaddq_s16(a, b));
+            x += 4;
+        }
+        if (lx >= 4)
+        {
+            /* This is equivalent to adding across each of the sum vectors and then adding
+             * to result. */
+            uint16x8_t a = vpaddq_s16(vsum16_0, vsum16_1);
+            uint16x8_t b = vpaddq_s16(vsum16_2, vsum16_3);
+            uint16x8_t c = vpaddq_s16(a, b);
+            result = vpadalq_s16(result, c);
+        }
+
+#else
+
+        for (; (x + 16) <= lx; x += 16)
+        {
+            uint8x16_t p1 = *(uint8x16_t *)&pix1x;
+            uint8x16_t p2 = *(uint8x16_t *)&pix2x;
+            uint8x16_t p3 = *(uint8x16_t *)&pix3x;
+            uint8x16_t p4 = *(uint8x16_t *)&pix4x;
+            uint8x16_t p5 = *(uint8x16_t *)&pix5x;
+            vsum16_0 = vabal_u8(vsum16_0, vget_low_u8(p1), vget_low_u8(p2));
+            vsum16_0 = vabal_high_u8(vsum16_0, p1, p2);
+            vsum16_1 = vabal_u8(vsum16_1, vget_low_u8(p1), vget_low_u8(p3));
+            vsum16_1 = vabal_high_u8(vsum16_1, p1, p3);
+            vsum16_2 = vabal_u8(vsum16_2, vget_low_u8(p1), vget_low_u8(p4));
+            vsum16_2 = vabal_high_u8(vsum16_2, p1, p4);
+            vsum16_3 = vabal_u8(vsum16_3, vget_low_u8(p1), vget_low_u8(p5));
+            vsum16_3 = vabal_high_u8(vsum16_3, p1, p5);
+        }
+        if (lx & 8)
+        {
+            uint8x8_t p1 = *(uint8x8_t *)&pix1x;
+            uint8x8_t p2 = *(uint8x8_t *)&pix2x;
+            uint8x8_t p3 = *(uint8x8_t *)&pix3x;
+            uint8x8_t p4 = *(uint8x8_t *)&pix4x;
+            uint8x8_t p5 = *(uint8x8_t *)&pix5x;
+            vsum16_0 = vabal_u8(vsum16_0, p1, p2);
+            vsum16_1 = vabal_u8(vsum16_1, p1, p3);
+            vsum16_2 = vabal_u8(vsum16_2, p1, p4);
+            vsum16_3 = vabal_u8(vsum16_3, p1, p5);
+            x += 8;
+        }
+        if (lx & 4)
+        {
+            uint8x16_t p1 = vreinterpretq_u32_u8(
+                vld1q_dup_u32((uint32_t *)&pix1x));
+
+            uint32x4_t p_x4;
+            p_x4 = vld1q_lane_u32((uint32_t *)&pix2x, p_x4, 0);
+            p_x4 = vld1q_lane_u32((uint32_t *)&pix3x, p_x4, 1);
+            p_x4 = vld1q_lane_u32((uint32_t *)&pix4x, p_x4, 2);
+            p_x4 = vld1q_lane_u32((uint32_t *)&pix5x, p_x4, 3);
+
+            uint16x8_t sum = vabdl_u8(vget_low_u8(p1), vget_low_u8(p_x4));
+            uint16x8_t sum2 = vabdl_high_u8(p1, p_x4);
+
+            uint16x8_t a = vpaddq_u16(sum, sum2);
+            result = vpadalq_u16(result, a);
+        }
+        if (lx >= 4)
+        {
+            result0 += vaddvq_u16(vsum16_0);
+            result1 += vaddvq_u16(vsum16_1);
+            result2 += vaddvq_u16(vsum16_2);
+            result3 += vaddvq_u16(vsum16_3);
+        }
+
+#endif
+        if (lx & 3) for (; x < lx; x++)
+        {
+            result0 += abs(pix1x - pix2x);
+            result1 += abs(pix1x - pix3x);
+            result2 += abs(pix1x - pix4x);
+            result3 += abs(pix1x - pix5x);
+        }
+
+        pix1 += FENC_STRIDE;
+        pix2 += frefstride;
+        pix3 += frefstride;
+        pix4 += frefstride;
+        pix5 += frefstride;
+    }
+    vst1q_s32(res, result);
+}
+
+
+template<int lx, int ly, class T1, class T2>
+sse_t sse_neon(const T1 *pix1, intptr_t stride_pix1, const T2 *pix2, intptr_t stride_pix2)
+{
+    sse_t sum = 0;
+
+    int32x4_t vsum1 = vdupq_n_s32(0);
+    int32x4_t vsum2 = vdupq_n_s32(0);
+    for (int y = 0; y < ly; y++)
+    {
+        int x = 0;
+        for (; (x + 8) <= lx; x += 8)
+        {
+            int16x8_t tmp;
+            if (sizeof(T1) == 2 && sizeof(T2) == 2)
+            {
+                tmp = vsubq_s16(*(int16x8_t *)&pix1x, *(int16x8_t *)&pix2x);
+            }
+            else if (sizeof(T1) == 1 && sizeof(T2) == 1)
+            {
+                tmp = vsubl_u8(*(uint8x8_t *)&pix1x, *(uint8x8_t *)&pix2x);
+            }
+            else
+            {
+                X265_CHECK(false, "unsupported sse");
+            }
+            vsum1 = vmlal_s16(vsum1, vget_low_s16(tmp), vget_low_s16(tmp));
+            vsum2 = vmlal_high_s16(vsum2, tmp, tmp);
+        }
+        for (; x < lx; x++)
+        {
+            int tmp = pix1x - pix2x;
+            sum += (tmp * tmp);
+        }
+
+        if (sizeof(T1) == 2 && sizeof(T2) == 2)
+        {
+            int32x4_t vsum = vaddq_u32(vsum1, vsum2);;
+            sum += vaddvq_u32(vsum);
+            vsum1 = vsum2 = vdupq_n_u16(0);
+        }
+
+        pix1 += stride_pix1;
+        pix2 += stride_pix2;
+    }
+    int32x4_t vsum = vaddq_u32(vsum1, vsum2);
+
+    return sum + vaddvq_u32(vsum);
+}
+
+
+template<int bx, int by>
+void blockcopy_ps_neon(int16_t *a, intptr_t stridea, const pixel *b, intptr_t strideb)
+{
+    for (int y = 0; y < by; y++)
+    {
+        int x = 0;
+        for (; (x + 8) <= bx; x += 8)
+        {
+#if HIGH_BIT_DEPTH
+            *(int16x8_t *)&ax = *(int16x8_t *)&bx;
+#else
+            *(int16x8_t *)&ax = vmovl_u8(*(int8x8_t *)&bx);
+#endif
+        }
+        for (; x < bx; x++)
+        {
+            ax = (int16_t)bx;
+        }
+
+        a += stridea;
+        b += strideb;
+    }
+}
+
+
+template<int bx, int by>
+void blockcopy_pp_neon(pixel *a, intptr_t stridea, const pixel *b, intptr_t strideb)
+{
+    for (int y = 0; y < by; y++)
+    {
+        int x = 0;
+#if HIGH_BIT_DEPTH
+        for (; (x + 8) <= bx; x += 8)
+        {
+            *(int16x8_t *)&ax = *(int16x8_t *)&bx;
+        }
+        if (bx & 4)
+        {
+            *(uint64_t *)&ax = *(uint64_t *)&bx;
+            x += 4;
+        }
+#else
+        for (; (x + 16) <= bx; x += 16)
+        {
+            *(uint8x16_t *)&ax = *(uint8x16_t *)&bx;
+        }
+        if (bx & 8)
+        {
+            *(uint8x8_t *)&ax = *(uint8x8_t *)&bx;
+            x += 8;
+        }
+        if (bx & 4)
+        {
+            *(uint32_t *)&ax = *(uint32_t *)&bx;
+            x += 4;
+        }
+#endif
+        for (; x < bx; x++)
+        {
+            ax = bx;
+        }
+
+        a += stridea;
+        b += strideb;
+    }
+}
+
+
+template<int bx, int by>
+void pixel_sub_ps_neon(int16_t *a, intptr_t dstride, const pixel *b0, const pixel *b1, intptr_t sstride0,
+                       intptr_t sstride1)
+{
+    for (int y = 0; y < by; y++)
+    {
+        int x = 0;
+        for (; (x + 8) <= bx; x += 8)
+        {
+#if HIGH_BIT_DEPTH
+            *(int16x8_t *)&ax = vsubq_s16(*(int16x8_t *)&b0x, *(int16x8_t *)&b1x);
+#else
+            *(int16x8_t *)&ax = vsubl_u8(*(uint8x8_t *)&b0x, *(uint8x8_t *)&b1x);
+#endif
+        }
+        for (; x < bx; x++)
+        {
+            ax = (int16_t)(b0x - b1x);
+        }
+
+        b0 += sstride0;
+        b1 += sstride1;
+        a += dstride;
+    }
+}
+
+template<int bx, int by>
+void pixel_add_ps_neon(pixel *a, intptr_t dstride, const pixel *b0, const int16_t *b1, intptr_t sstride0,
+                       intptr_t sstride1)
+{
+    for (int y = 0; y < by; y++)
+    {
+        int x = 0;
+        for (; (x + 8) <= bx; x += 8)
+        {
+            int16x8_t t;
+            int16x8_t b1e = *(int16x8_t *)&b1x;
+            int16x8_t b0e;
+#if HIGH_BIT_DEPTH
+            b0e = *(int16x8_t *)&b0x;
+            t = vaddq_s16(b0e, b1e);
+            t = vminq_s16(t, vdupq_n_s16((1 << X265_DEPTH) - 1));
+            t = vmaxq_s16(t, vdupq_n_s16(0));
+            *(int16x8_t *)&ax = t;
+#else
+            b0e = vmovl_u8(*(uint8x8_t *)&b0x);
+            t = vaddq_s16(b0e, b1e);
+            *(uint8x8_t *)&ax = vqmovun_s16(t);
+#endif
+        }
+        for (; x < bx; x++)
+        {
+            ax = (int16_t)x265_clip(b0x + b1x);
+        }
+
+        b0 += sstride0;
+        b1 += sstride1;
+        a += dstride;
+    }
+}
+
+template<int bx, int by>
+void addAvg_neon(const int16_t *src0, const int16_t *src1, pixel *dst, intptr_t src0Stride, intptr_t src1Stride,
+                 intptr_t dstStride)
+{
+
+    const int shiftNum = IF_INTERNAL_PREC + 1 - X265_DEPTH;
+    const int offset = (1 << (shiftNum - 1)) + 2 * IF_INTERNAL_OFFS;
+
+    const int32x4_t addon = vdupq_n_s32(offset);
+    for (int y = 0; y < by; y++)
+    {
+        int x = 0;
+
+        for (; (x + 8) <= bx; x += 8)
+        {
+            int16x8_t in0 = *(int16x8_t *)&src0x;
+            int16x8_t in1 = *(int16x8_t *)&src1x;
+            int32x4_t t1 = vaddl_s16(vget_low_s16(in0), vget_low_s16(in1));
+            int32x4_t t2 = vaddl_high_s16(in0, in1);
+            t1 = vaddq_s32(t1, addon);
+            t2 = vaddq_s32(t2, addon);
+            t1 = vshrq_n_s32(t1, shiftNum);
+            t2 = vshrq_n_s32(t2, shiftNum);
+            int16x8_t t = vuzp1q_s16(t1, t2);
+#if HIGH_BIT_DEPTH
+            t = vminq_s16(t, vdupq_n_s16((1 << X265_DEPTH) - 1));
+            t = vmaxq_s16(t, vdupq_n_s16(0));
+            *(int16x8_t *)&dstx = t;
+#else
+            *(uint8x8_t *)&dstx = vqmovun_s16(t);
+#endif
+        }
+        for (; x < bx; x += 2)
+        {
+            dstx + 0 = x265_clip((src0x + 0 + src1x + 0 + offset) >> shiftNum);
+            dstx + 1 = x265_clip((src0x + 1 + src1x + 1 + offset) >> shiftNum);
+        }
+
+        src0 += src0Stride;
+        src1 += src1Stride;
+        dst  += dstStride;
+    }
+}
+
+template<int lx, int ly>
+void pixelavg_pp_neon(pixel *dst, intptr_t dstride, const pixel *src0, intptr_t sstride0, const pixel *src1,
+                      intptr_t sstride1, int)
+{
+    for (int y = 0; y < ly; y++)
+    {
+        int x = 0;
+        for (; (x + 8) <= lx; x += 8)
+        {
+#if HIGH_BIT_DEPTH
+            uint16x8_t in0 = *(uint16x8_t *)&src0x;
+            uint16x8_t in1 = *(uint16x8_t *)&src1x;
+            uint16x8_t t = vrhaddq_u16(in0, in1);
+            *(uint16x8_t *)&dstx = t;
+#else
+            int16x8_t in0 = vmovl_u8(*(uint8x8_t *)&src0x);
+            int16x8_t in1 = vmovl_u8(*(uint8x8_t *)&src1x);
+            int16x8_t t = vrhaddq_s16(in0, in1);
+            *(uint8x8_t *)&dstx = vmovn_u16(t);
+#endif
+        }
+        for (; x < lx; x++)
+        {
+            dstx = (src0x + src1x + 1) >> 1;
+        }
+
+        src0 += sstride0;
+        src1 += sstride1;
+        dst += dstride;
+    }
+}
+
+
+template<int size>
+void cpy1Dto2D_shl_neon(int16_t *dst, const int16_t *src, intptr_t dstStride, int shift)
+{
+    X265_CHECK((((intptr_t)dst | (dstStride * sizeof(*dst))) & 15) == 0 || size == 4, "dst alignment error\n");
+    X265_CHECK(((intptr_t)src & 15) == 0, "src alignment error\n");
+    X265_CHECK(shift >= 0, "invalid shift\n");
+
+    for (int i = 0; i < size; i++)
+    {
+        int j = 0;
+        for (; (j + 8) <= size; j += 8)
+        {
+            *(int16x8_t *)&dstj = vshlq_s16(*(int16x8_t *)&srcj, vdupq_n_s16(shift));
+        }
+        for (; j < size; j++)
+        {
+            dstj = srcj << shift;
+        }
+        src += size;
+        dst += dstStride;
+    }
+}
+
+
+template<int size>
+uint64_t pixel_var_neon(const uint8_t *pix, intptr_t i_stride)
+{
+    uint32_t sum = 0, sqr = 0;
+
+    int32x4_t vsqr = vdupq_n_s32(0);
+    for (int y = 0; y < size; y++)
+    {
+        int x = 0;
+        int16x8_t vsum = vdupq_n_s16(0);
+        for (; (x + 8) <= size; x += 8)
+        {
+            int16x8_t in;
+            in = vmovl_u8(*(uint8x8_t *)&pixx);
+            vsum = vaddq_u16(vsum, in);
+            vsqr = vmlal_s16(vsqr, vget_low_s16(in), vget_low_s16(in));
+            vsqr = vmlal_high_s16(vsqr, in, in);
+        }
+        for (; x < size; x++)
+        {
+            sum += pixx;
+            sqr += pixx * pixx;
+        }
+        sum += vaddvq_s16(vsum);
+
+        pix += i_stride;
+    }
+    sqr += vaddvq_u32(vsqr);
+    return sum + ((uint64_t)sqr << 32);
+}
+
+template<int blockSize>
+void getResidual_neon(const pixel *fenc, const pixel *pred, int16_t *residual, intptr_t stride)
+{
+    for (int y = 0; y < blockSize; y++)
+    {
+        int x = 0;
+        for (; (x + 8) < blockSize; x += 8)
+        {
+            int16x8_t vfenc, vpred;
+#if HIGH_BIT_DEPTH
+            vfenc = *(int16x8_t *)&fencx;
+            vpred = *(int16x8_t *)&predx;
+#else
+            vfenc = vmovl_u8(*(uint8x8_t *)&fencx);
+            vpred = vmovl_u8(*(uint8x8_t *)&predx);
+#endif
+            *(int16x8_t *)&residualx = vsubq_s16(vfenc, vpred);
+        }
+        for (; x < blockSize; x++)
+        {
+            residualx = static_cast<int16_t>(fencx) - static_cast<int16_t>(predx);
+        }
+        fenc += stride;
+        residual += stride;
+        pred += stride;
+    }
+}
+
+template<int size>
+int psyCost_pp_neon(const pixel *source, intptr_t sstride, const pixel *recon, intptr_t rstride)
+{
+    static pixel zeroBuf8 /* = { 0 } */;
+
+    if (size)
+    {
+        int dim = 1 << (size + 2);
+        uint32_t totEnergy = 0;
+        for (int i = 0; i < dim; i += 8)
+        {
+            for (int j = 0; j < dim; j += 8)
+            {
+                /* AC energy, measured by sa8d (AC + DC) minus SAD (DC) */
+                int sourceEnergy = pixel_sa8d_8x8_neon(source + i * sstride + j, sstride, zeroBuf, 0) -
+                                   (sad_pp_neon<8, 8>(source + i * sstride + j, sstride, zeroBuf, 0) >> 2);
+                int reconEnergy =  pixel_sa8d_8x8_neon(recon + i * rstride + j, rstride, zeroBuf, 0) -
+                                   (sad_pp_neon<8, 8>(recon + i * rstride + j, rstride, zeroBuf, 0) >> 2);
+
+                totEnergy += abs(sourceEnergy - reconEnergy);
+            }
+        }
+        return totEnergy;
+    }
+    else
+    {
+        /* 4x4 is too small for sa8d */
+        int sourceEnergy = pixel_satd_4x4_neon(source, sstride, zeroBuf, 0) - (sad_pp_neon<4, 4>(source, sstride, zeroBuf,
+                           0) >> 2);
+        int reconEnergy = pixel_satd_4x4_neon(recon, rstride, zeroBuf, 0) - (sad_pp_neon<4, 4>(recon, rstride, zeroBuf,
+                          0) >> 2);
+        return abs(sourceEnergy - reconEnergy);
+    }
+}
+
+
+template<int w, int h>
+// Calculate sa8d in blocks of 8x8
+int sa8d8(const pixel *pix1, intptr_t i_pix1, const pixel *pix2, intptr_t i_pix2)
+{
+    int cost = 0;
+
+    for (int y = 0; y < h; y += 8)
+        for (int x = 0; x < w; x += 8)
+        {
+            cost += pixel_sa8d_8x8_neon(pix1 + i_pix1 * y + x, i_pix1, pix2 + i_pix2 * y + x, i_pix2);
+        }
+
+    return cost;
+}
+
+template<int w, int h>
+// Calculate sa8d in blocks of 16x16
+int sa8d16(const pixel *pix1, intptr_t i_pix1, const pixel *pix2, intptr_t i_pix2)
+{
+    int cost = 0;
+
+    for (int y = 0; y < h; y += 16)
+        for (int x = 0; x < w; x += 16)
+        {
+            cost += pixel_sa8d_16x16_neon(pix1 + i_pix1 * y + x, i_pix1, pix2 + i_pix2 * y + x, i_pix2);
+        }
+
+    return cost;
+}
+
+template<int size>
+void cpy2Dto1D_shl_neon(int16_t *dst, const int16_t *src, intptr_t srcStride, int shift)
+{
+    X265_CHECK(((intptr_t)dst & 15) == 0, "dst alignment error\n");
+    X265_CHECK((((intptr_t)src | (srcStride * sizeof(*src))) & 15) == 0 || size == 4, "src alignment error\n");
+    X265_CHECK(shift >= 0, "invalid shift\n");
+
+    for (int i = 0; i < size; i++)
+    {
+        for (int j = 0; j < size; j++)
+        {
+            dstj = srcj << shift;
+        }
+
+        src += srcStride;
+        dst += size;
+    }
+}
+
+
+template<int w, int h>
+// calculate satd in blocks of 4x4
+int satd4_neon(const pixel *pix1, intptr_t stride_pix1, const pixel *pix2, intptr_t stride_pix2)
+{
+    int satd = 0;
+
+    for (int row = 0; row < h; row += 4)
+        for (int col = 0; col < w; col += 4)
+            satd += pixel_satd_4x4_neon(pix1 + row * stride_pix1 + col, stride_pix1,
+                                        pix2 + row * stride_pix2 + col, stride_pix2);
+
+    return satd;
+}
+
+template<int w, int h>
+// calculate satd in blocks of 8x4
+int satd8_neon(const pixel *pix1, intptr_t stride_pix1, const pixel *pix2, intptr_t stride_pix2)
+{
+    int satd = 0;
+
+    if (((w | h) & 15) == 0)
+    {
+        for (int row = 0; row < h; row += 16)
+            for (int col = 0; col < w; col += 16)
+                satd += pixel_satd_16x16_neon(pix1 + row * stride_pix1 + col, stride_pix1,
+                                              pix2 + row * stride_pix2 + col, stride_pix2);
+
+    }
+    else if (((w | h) & 7) == 0)
+    {
+        for (int row = 0; row < h; row += 8)
+            for (int col = 0; col < w; col += 8)
+                satd += pixel_satd_8x8_neon(pix1 + row * stride_pix1 + col, stride_pix1,
+                                            pix2 + row * stride_pix2 + col, stride_pix2);
+
+    }
+    else
+    {
+        for (int row = 0; row < h; row += 4)
+            for (int col = 0; col < w; col += 8)
+                satd += pixel_satd_8x4_neon(pix1 + row * stride_pix1 + col, stride_pix1,
+                                            pix2 + row * stride_pix2 + col, stride_pix2);
+    }
+
+    return satd;
+}
+
+
+template<int blockSize>
+void transpose_neon(pixel *dst, const pixel *src, intptr_t stride)
+{
+    for (int k = 0; k < blockSize; k++)
+        for (int l = 0; l < blockSize; l++)
+        {
+            dstk * blockSize + l = srcl * stride + k;
+        }
+}
+
+
+template<>
+void transpose_neon<8>(pixel *dst, const pixel *src, intptr_t stride)
+{
+    transpose8x8(dst, src, 8, stride);
+}
+
+template<>
+void transpose_neon<16>(pixel *dst, const pixel *src, intptr_t stride)
+{
+    transpose16x16(dst, src, 16, stride);
+}
+
+template<>
+void transpose_neon<32>(pixel *dst, const pixel *src, intptr_t stride)
+{
+    transpose32x32(dst, src, 32, stride);
+}
+
+
+template<>
+void transpose_neon<64>(pixel *dst, const pixel *src, intptr_t stride)
+{
+    transpose32x32(dst, src, 64, stride);
+    transpose32x32(dst + 32 * 64 + 32, src + 32 * stride + 32, 64, stride);
+    transpose32x32(dst + 32 * 64, src + 32, 64, stride);
+    transpose32x32(dst + 32, src + 32 * stride, 64, stride);
+}
+
+
+template<int size>
+sse_t pixel_ssd_s_neon(const int16_t *a, intptr_t dstride)
+{
+    sse_t sum = 0;
+
+
+    int32x4_t vsum = vdupq_n_s32(0);
+
+    for (int y = 0; y < size; y++)
+    {
+        int x = 0;
+
+        for (; (x + 8) <= size; x += 8)
+        {
+            int16x8_t in = *(int16x8_t *)&ax;
+            vsum = vmlal_s16(vsum, vget_low_s16(in), vget_low_s16(in));
+            vsum = vmlal_high_s16(vsum, (in), (in));
+        }
+        for (; x < size; x++)
+        {
+            sum += ax * ax;
+        }
+
+        a += dstride;
+    }
+    return sum + vaddvq_s32(vsum);
+}
+
+
+};
+
+
+
+
+namespace X265_NS
+{
+
+
+void setupPixelPrimitives_neon(EncoderPrimitives &p)
+{
+#define LUMA_PU(W, H) \
+    p.puLUMA_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.addAvgNONALIGNED = addAvg_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.addAvgALIGNED = addAvg_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.sad = sad_pp_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.sad_x3 = sad_x3_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.sad_x4 = sad_x4_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.pixelavg_ppNONALIGNED = pixelavg_pp_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.pixelavg_ppALIGNED = pixelavg_pp_neon<W, H>;
+
+#if !(HIGH_BIT_DEPTH)
+#define LUMA_PU_S(W, H) \
+    p.puLUMA_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.addAvgNONALIGNED = addAvg_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.addAvgALIGNED = addAvg_neon<W, H>;
+#else // !(HIGH_BIT_DEPTH)
+#define LUMA_PU_S(W, H) \
+    p.puLUMA_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.addAvgNONALIGNED = addAvg_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.addAvgALIGNED = addAvg_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.sad_x3 = sad_x3_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.sad_x4 = sad_x4_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.pixelavg_ppNONALIGNED = pixelavg_pp_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.pixelavg_ppALIGNED = pixelavg_pp_neon<W, H>;
+#endif // !(HIGH_BIT_DEPTH)
+
+#define LUMA_CU(W, H) \
+    p.cuBLOCK_ ## W ## x ## H.sub_ps        = pixel_sub_ps_neon<W, H>; \
+    p.cuBLOCK_ ## W ## x ## H.add_psNONALIGNED    = pixel_add_ps_neon<W, H>; \
+    p.cuBLOCK_ ## W ## x ## H.add_psALIGNED = pixel_add_ps_neon<W, H>; \
+    p.cuBLOCK_ ## W ## x ## H.copy_pp       = blockcopy_pp_neon<W, H>; \
+    p.cuBLOCK_ ## W ## x ## H.copy_ps       = blockcopy_ps_neon<W, H>; \
+    p.cuBLOCK_ ## W ## x ## H.copy_pp       = blockcopy_pp_neon<W, H>; \
+    p.cuBLOCK_ ## W ## x ## H.cpy2Dto1D_shl = cpy2Dto1D_shl_neon<W>; \
+    p.cuBLOCK_ ## W ## x ## H.cpy1Dto2D_shlNONALIGNED = cpy1Dto2D_shl_neon<W>; \
+    p.cuBLOCK_ ## W ## x ## H.cpy1Dto2D_shlALIGNED = cpy1Dto2D_shl_neon<W>; \
+    p.cuBLOCK_ ## W ## x ## H.psy_cost_pp   = psyCost_pp_neon<BLOCK_ ## W ## x ## H>; \
+    p.cuBLOCK_ ## W ## x ## H.transpose     = transpose_neon<W>;
+
+
+    LUMA_PU_S(4, 4);
+    LUMA_PU_S(8, 8);
+    LUMA_PU(16, 16);
+    LUMA_PU(32, 32);
+    LUMA_PU(64, 64);
+    LUMA_PU_S(4, 8);
+    LUMA_PU_S(8, 4);
+    LUMA_PU(16,  8);
+    LUMA_PU_S(8, 16);
+    LUMA_PU(16, 12);
+    LUMA_PU(12, 16);
+    LUMA_PU(16,  4);
+    LUMA_PU_S(4, 16);
+    LUMA_PU(32, 16);
+    LUMA_PU(16, 32);
+    LUMA_PU(32, 24);
+    LUMA_PU(24, 32);
+    LUMA_PU(32,  8);
+    LUMA_PU_S(8, 32);
+    LUMA_PU(64, 32);
+    LUMA_PU(32, 64);
+    LUMA_PU(64, 48);
+    LUMA_PU(48, 64);
+    LUMA_PU(64, 16);
+    LUMA_PU(16, 64);
+    
+#if defined(__APPLE__)
+    p.puLUMA_4x4.sad = sad_pp_neon<4, 4>;
+    p.puLUMA_4x8.sad = sad_pp_neon<4, 8>;
+    p.puLUMA_4x16.sad = sad_pp_neon<4, 16>;
+#endif // defined(__APPLE__)
+    p.puLUMA_8x4.sad = sad_pp_neon<8, 4>;
+    p.puLUMA_8x8.sad = sad_pp_neon<8, 8>;
+    p.puLUMA_8x16.sad = sad_pp_neon<8, 16>;
+    p.puLUMA_8x32.sad = sad_pp_neon<8, 32>;
+
+#if !(HIGH_BIT_DEPTH)
+    p.puLUMA_4x4.sad_x3 = sad_x3_neon<4, 4>;
+    p.puLUMA_4x4.sad_x4 = sad_x4_neon<4, 4>;
+    p.puLUMA_4x8.sad_x3 = sad_x3_neon<4, 8>;
+    p.puLUMA_4x8.sad_x4 = sad_x4_neon<4, 8>;
+    p.puLUMA_4x16.sad_x3 = sad_x3_neon<4, 16>;
+    p.puLUMA_4x16.sad_x4 = sad_x4_neon<4, 16>;
+#endif // !(HIGH_BIT_DEPTH)
+
+    p.puLUMA_4x4.satd   = pixel_satd_4x4_neon;
+    p.puLUMA_8x4.satd   = pixel_satd_8x4_neon;
+    
+    p.puLUMA_8x8.satd   = satd8_neon<8, 8>;
+    p.puLUMA_16x16.satd = satd8_neon<16, 16>;
+    p.puLUMA_16x8.satd  = satd8_neon<16, 8>;
+    p.puLUMA_8x16.satd  = satd8_neon<8, 16>;
+    p.puLUMA_16x12.satd = satd8_neon<16, 12>;
+    p.puLUMA_16x4.satd  = satd8_neon<16, 4>;
+    p.puLUMA_32x32.satd = satd8_neon<32, 32>;
+    p.puLUMA_32x16.satd = satd8_neon<32, 16>;
+    p.puLUMA_16x32.satd = satd8_neon<16, 32>;
+    p.puLUMA_32x24.satd = satd8_neon<32, 24>;
+    p.puLUMA_24x32.satd = satd8_neon<24, 32>;
+    p.puLUMA_32x8.satd  = satd8_neon<32, 8>;
+    p.puLUMA_8x32.satd  = satd8_neon<8, 32>;
+    p.puLUMA_64x64.satd = satd8_neon<64, 64>;
+    p.puLUMA_64x32.satd = satd8_neon<64, 32>;
+    p.puLUMA_32x64.satd = satd8_neon<32, 64>;
+    p.puLUMA_64x48.satd = satd8_neon<64, 48>;
+    p.puLUMA_48x64.satd = satd8_neon<48, 64>;
+    p.puLUMA_64x16.satd = satd8_neon<64, 16>;
+    p.puLUMA_16x64.satd = satd8_neon<16, 64>;
+
+#if HIGH_BIT_DEPTH
+    p.puLUMA_4x8.satd   = satd4_neon<4, 8>;
+    p.puLUMA_4x16.satd  = satd4_neon<4, 16>;
+#endif // HIGH_BIT_DEPTH
+
+#if !defined(__APPLE__) || HIGH_BIT_DEPTH
+    p.puLUMA_12x16.satd = satd4_neon<12, 16>;
+#endif // !defined(__APPLE__)
+
+
+    LUMA_CU(4, 4);
+    LUMA_CU(8, 8);
+    LUMA_CU(16, 16);
+    LUMA_CU(32, 32);
+    LUMA_CU(64, 64);
+    
+#if !(HIGH_BIT_DEPTH)
+    p.cuBLOCK_8x8.var   = pixel_var_neon<8>;
+    p.cuBLOCK_16x16.var = pixel_var_neon<16>;
+#if defined(__APPLE__)
+    p.cuBLOCK_32x32.var   = pixel_var_neon<32>;
+    p.cuBLOCK_64x64.var = pixel_var_neon<64>;
+#endif // defined(__APPLE__)
+#endif // !(HIGH_BIT_DEPTH)
+
+    p.cuBLOCK_16x16.blockfill_sNONALIGNED = blockfill_s_neon<16>; 
+    p.cuBLOCK_16x16.blockfill_sALIGNED    = blockfill_s_neon<16>;
+    p.cuBLOCK_32x32.blockfill_sNONALIGNED = blockfill_s_neon<32>; 
+    p.cuBLOCK_32x32.blockfill_sALIGNED    = blockfill_s_neon<32>;
+    p.cuBLOCK_64x64.blockfill_sNONALIGNED = blockfill_s_neon<64>; 
+    p.cuBLOCK_64x64.blockfill_sALIGNED    = blockfill_s_neon<64>;
+
+
+    p.cuBLOCK_4x4.calcresidualNONALIGNED    = getResidual_neon<4>;
+    p.cuBLOCK_4x4.calcresidualALIGNED       = getResidual_neon<4>;
+    p.cuBLOCK_8x8.calcresidualNONALIGNED    = getResidual_neon<8>;
+    p.cuBLOCK_8x8.calcresidualALIGNED       = getResidual_neon<8>;
+    p.cuBLOCK_16x16.calcresidualNONALIGNED  = getResidual_neon<16>;
+    p.cuBLOCK_16x16.calcresidualALIGNED     = getResidual_neon<16>;
+    
+#if defined(__APPLE__)
+    p.cuBLOCK_32x32.calcresidualNONALIGNED  = getResidual_neon<32>;
+    p.cuBLOCK_32x32.calcresidualALIGNED     = getResidual_neon<32>;
+#endif // defined(__APPLE__)
+
+    p.cuBLOCK_4x4.sa8d   = pixel_satd_4x4_neon;
+    p.cuBLOCK_8x8.sa8d   = pixel_sa8d_8x8_neon;
+    p.cuBLOCK_16x16.sa8d = pixel_sa8d_16x16_neon;
+    p.cuBLOCK_32x32.sa8d = sa8d16<32, 32>;
+    p.cuBLOCK_64x64.sa8d = sa8d16<64, 64>;
+
+
+#define CHROMA_PU_420(W, H) \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.addAvgNONALIGNED  = addAvg_neon<W, H>;         \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.addAvgALIGNED  = addAvg_neon<W, H>;         \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \
+
+
+    CHROMA_PU_420(4, 4);
+    CHROMA_PU_420(8, 8);
+    CHROMA_PU_420(16, 16);
+    CHROMA_PU_420(32, 32);
+    CHROMA_PU_420(4, 2);
+    CHROMA_PU_420(8, 4);
+    CHROMA_PU_420(4, 8);
+    CHROMA_PU_420(8, 6);
+    CHROMA_PU_420(6, 8);
+    CHROMA_PU_420(8, 2);
+    CHROMA_PU_420(2, 8);
+    CHROMA_PU_420(16, 8);
+    CHROMA_PU_420(8,  16);
+    CHROMA_PU_420(16, 12);
+    CHROMA_PU_420(12, 16);
+    CHROMA_PU_420(16, 4);
+    CHROMA_PU_420(4,  16);
+    CHROMA_PU_420(32, 16);
+    CHROMA_PU_420(16, 32);
+    CHROMA_PU_420(32, 24);
+    CHROMA_PU_420(24, 32);
+    CHROMA_PU_420(32, 8);
+    CHROMA_PU_420(8,  32);
+
+
+
+    p.chromaX265_CSP_I420.puCHROMA_420_2x2.satd   = NULL;
+    p.chromaX265_CSP_I420.puCHROMA_420_4x4.satd   = pixel_satd_4x4_neon;
+    p.chromaX265_CSP_I420.puCHROMA_420_8x8.satd   = satd8_neon<8, 8>;
+    p.chromaX265_CSP_I420.puCHROMA_420_16x16.satd = satd8_neon<16, 16>;
+    p.chromaX265_CSP_I420.puCHROMA_420_32x32.satd = satd8_neon<32, 32>;
+
+    p.chromaX265_CSP_I420.puCHROMA_420_4x2.satd   = NULL;
+    p.chromaX265_CSP_I420.puCHROMA_420_2x4.satd   = NULL;
+    p.chromaX265_CSP_I420.puCHROMA_420_8x4.satd   = pixel_satd_8x4_neon;
+    p.chromaX265_CSP_I420.puCHROMA_420_16x8.satd  = satd8_neon<16, 8>;
+    p.chromaX265_CSP_I420.puCHROMA_420_8x16.satd  = satd8_neon<8, 16>;
+    p.chromaX265_CSP_I420.puCHROMA_420_32x16.satd = satd8_neon<32, 16>;
+    p.chromaX265_CSP_I420.puCHROMA_420_16x32.satd = satd8_neon<16, 32>;
+
+    p.chromaX265_CSP_I420.puCHROMA_420_8x6.satd   = NULL;
+    p.chromaX265_CSP_I420.puCHROMA_420_6x8.satd   = NULL;
+    p.chromaX265_CSP_I420.puCHROMA_420_8x2.satd   = NULL;
+    p.chromaX265_CSP_I420.puCHROMA_420_2x8.satd   = NULL;
+    p.chromaX265_CSP_I420.puCHROMA_420_16x12.satd = satd4_neon<16, 12>;
+    p.chromaX265_CSP_I420.puCHROMA_420_16x4.satd  = satd4_neon<16, 4>;
+    p.chromaX265_CSP_I420.puCHROMA_420_32x24.satd = satd8_neon<32, 24>;
+    p.chromaX265_CSP_I420.puCHROMA_420_24x32.satd = satd8_neon<24, 32>;
+    p.chromaX265_CSP_I420.puCHROMA_420_32x8.satd  = satd8_neon<32, 8>;
+    p.chromaX265_CSP_I420.puCHROMA_420_8x32.satd  = satd8_neon<8, 32>;
+    
+#if HIGH_BIT_DEPTH
+    p.chromaX265_CSP_I420.puCHROMA_420_4x8.satd   = satd4_neon<4, 8>;
+    p.chromaX265_CSP_I420.puCHROMA_420_4x16.satd  = satd4_neon<4, 16>;
+#endif // HIGH_BIT_DEPTH
+
+#if !defined(__APPLE__) || HIGH_BIT_DEPTH
+    p.chromaX265_CSP_I420.puCHROMA_420_12x16.satd = satd4_neon<12, 16>;
+#endif // !defined(__APPLE__)
+
+
+#define CHROMA_CU_420(W, H) \
+    p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.sse_pp  = sse_neon<W, H, pixel, pixel>; \
+    p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \
+    p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.copy_ps = blockcopy_ps_neon<W, H>; \
+    p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.sub_ps = pixel_sub_ps_neon<W, H>;  \
+    p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.add_psNONALIGNED = pixel_add_ps_neon<W, H>; \
+    p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.add_psALIGNED = pixel_add_ps_neon<W, H>;
+    
+#define CHROMA_CU_S_420(W, H) \
+    p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \
+    p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.copy_ps = blockcopy_ps_neon<W, H>; \
+    p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.sub_ps = pixel_sub_ps_neon<W, H>;  \
+    p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.add_psNONALIGNED = pixel_add_ps_neon<W, H>; \
+    p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.add_psALIGNED = pixel_add_ps_neon<W, H>;
+
+
+    CHROMA_CU_S_420(4, 4)
+    CHROMA_CU_420(8, 8)
+    CHROMA_CU_420(16, 16)
+    CHROMA_CU_420(32, 32)
+
+
+    p.chromaX265_CSP_I420.cuBLOCK_8x8.sa8d   = p.chromaX265_CSP_I420.puCHROMA_420_4x4.satd;
+    p.chromaX265_CSP_I420.cuBLOCK_16x16.sa8d = sa8d8<8, 8>;
+    p.chromaX265_CSP_I420.cuBLOCK_32x32.sa8d = sa8d16<16, 16>;
+    p.chromaX265_CSP_I420.cuBLOCK_64x64.sa8d = sa8d16<32, 32>;
+
+
+#define CHROMA_PU_422(W, H) \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.addAvgNONALIGNED  = addAvg_neon<W, H>;         \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.addAvgALIGNED  = addAvg_neon<W, H>;         \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \
+
+
+    CHROMA_PU_422(4, 8);
+    CHROMA_PU_422(8, 16);
+    CHROMA_PU_422(16, 32);
+    CHROMA_PU_422(32, 64);
+    CHROMA_PU_422(4, 4);
+    CHROMA_PU_422(2, 8);
+    CHROMA_PU_422(8, 8);
+    CHROMA_PU_422(4, 16);
+    CHROMA_PU_422(8, 12);
+    CHROMA_PU_422(6, 16);
+    CHROMA_PU_422(8, 4);
+    CHROMA_PU_422(2, 16);
+    CHROMA_PU_422(16, 16);
+    CHROMA_PU_422(8, 32);
+    CHROMA_PU_422(16, 24);
+    CHROMA_PU_422(12, 32);
+    CHROMA_PU_422(16, 8);
+    CHROMA_PU_422(4,  32);
+    CHROMA_PU_422(32, 32);
+    CHROMA_PU_422(16, 64);
+    CHROMA_PU_422(32, 48);
+    CHROMA_PU_422(24, 64);
+    CHROMA_PU_422(32, 16);
+    CHROMA_PU_422(8,  64);
+
+
+    p.chromaX265_CSP_I422.puCHROMA_422_2x4.satd   = NULL;
+    p.chromaX265_CSP_I422.puCHROMA_422_8x16.satd  = satd8_neon<8, 16>;
+    p.chromaX265_CSP_I422.puCHROMA_422_16x32.satd = satd8_neon<16, 32>;
+    p.chromaX265_CSP_I422.puCHROMA_422_32x64.satd = satd8_neon<32, 64>;
+    p.chromaX265_CSP_I422.puCHROMA_422_4x4.satd   = pixel_satd_4x4_neon;
+    p.chromaX265_CSP_I422.puCHROMA_422_2x8.satd   = NULL;
+    p.chromaX265_CSP_I422.puCHROMA_422_8x8.satd   = satd8_neon<8, 8>;
+    p.chromaX265_CSP_I422.puCHROMA_422_16x16.satd = satd8_neon<16, 16>;
+    p.chromaX265_CSP_I422.puCHROMA_422_8x32.satd  = satd8_neon<8, 32>;
+    p.chromaX265_CSP_I422.puCHROMA_422_32x32.satd = satd8_neon<32, 32>;
+    p.chromaX265_CSP_I422.puCHROMA_422_16x64.satd = satd8_neon<16, 64>;
+    p.chromaX265_CSP_I422.puCHROMA_422_6x16.satd  = NULL;
+    p.chromaX265_CSP_I422.puCHROMA_422_8x4.satd   = satd4_neon<8, 4>;
+    p.chromaX265_CSP_I422.puCHROMA_422_2x16.satd  = NULL;
+    p.chromaX265_CSP_I422.puCHROMA_422_16x8.satd  = satd8_neon<16, 8>;
+    p.chromaX265_CSP_I422.puCHROMA_422_32x16.satd = satd8_neon<32, 16>;
+    
+    p.chromaX265_CSP_I422.puCHROMA_422_8x12.satd  = satd4_neon<8, 12>;
+    p.chromaX265_CSP_I422.puCHROMA_422_8x64.satd  = satd8_neon<8, 64>;
+    p.chromaX265_CSP_I422.puCHROMA_422_12x32.satd = satd4_neon<12, 32>;
+    p.chromaX265_CSP_I422.puCHROMA_422_16x24.satd = satd8_neon<16, 24>;
+    p.chromaX265_CSP_I422.puCHROMA_422_24x64.satd = satd8_neon<24, 64>;
+    p.chromaX265_CSP_I422.puCHROMA_422_32x48.satd = satd8_neon<32, 48>;
+
+#if HIGH_BIT_DEPTH
+    p.chromaX265_CSP_I422.puCHROMA_422_4x8.satd   = satd4_neon<4, 8>;
+    p.chromaX265_CSP_I422.puCHROMA_422_4x16.satd  = satd4_neon<4, 16>;
+    p.chromaX265_CSP_I422.puCHROMA_422_4x32.satd  = satd4_neon<4, 32>;
+#endif // HIGH_BIT_DEPTH
+
+
+#define CHROMA_CU_422(W, H) \
+    p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.sse_pp  = sse_neon<W, H, pixel, pixel>;  \
+    p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \
+    p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.copy_ps = blockcopy_ps_neon<W, H>; \
+    p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.sub_ps = pixel_sub_ps_neon<W, H>; \
+    p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.add_psNONALIGNED = pixel_add_ps_neon<W, H>; \
+    p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.add_psALIGNED = pixel_add_ps_neon<W, H>;
+
+#define CHROMA_CU_S_422(W, H) \
+    p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \
+    p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.copy_ps = blockcopy_ps_neon<W, H>; \
+    p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.sub_ps = pixel_sub_ps_neon<W, H>; \
+    p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.add_psNONALIGNED = pixel_add_ps_neon<W, H>; \
+    p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.add_psALIGNED = pixel_add_ps_neon<W, H>;
+    
+    
+    CHROMA_CU_S_422(4, 8)
+    CHROMA_CU_422(8, 16)
+    CHROMA_CU_422(16, 32)
+    CHROMA_CU_422(32, 64)
+
+    p.chromaX265_CSP_I422.cuBLOCK_8x8.sa8d   = p.chromaX265_CSP_I422.puCHROMA_422_4x8.satd;
+    p.chromaX265_CSP_I422.cuBLOCK_16x16.sa8d = sa8d8<8, 16>;
+    p.chromaX265_CSP_I422.cuBLOCK_32x32.sa8d = sa8d16<16, 32>;
+    p.chromaX265_CSP_I422.cuBLOCK_64x64.sa8d = sa8d16<32, 64>;
+
+
+}
+
+
+}
+
+
+#endif
+

 
@@ -0,0 +1,2059 @@
+#include "common.h"
+#include "slicetype.h"      // LOWRES_COST_MASK
+#include "primitives.h"
+#include "x265.h"
+
+#include "pixel-prim.h"
+#include "arm64-utils.h"
+#if HAVE_NEON
+
+#include <arm_neon.h>
+
+using namespace X265_NS;
+
+
+
+namespace
+{
+
+
+/* SATD SA8D variants - based on x264 */
+static inline void SUMSUB_AB(int16x8_t &sum, int16x8_t &sub, const int16x8_t a, const int16x8_t b)
+{
+    sum = vaddq_s16(a, b);
+    sub = vsubq_s16(a, b);
+}
+
+static inline void transpose_8h(int16x8_t &t1, int16x8_t &t2, const int16x8_t s1, const int16x8_t s2)
+{
+    t1 = vtrn1q_s16(s1, s2);
+    t2 = vtrn2q_s16(s1, s2);
+}
+
+static inline void transpose_4s(int16x8_t &t1, int16x8_t &t2, const int16x8_t s1, const int16x8_t s2)
+{
+    t1 = vtrn1q_s32(s1, s2);
+    t2 = vtrn2q_s32(s1, s2);
+}
+
+#if (X265_DEPTH <= 10)
+static inline void transpose_2d(int16x8_t &t1, int16x8_t &t2, const int16x8_t s1, const int16x8_t s2)
+{
+    t1 = vtrn1q_s64(s1, s2);
+    t2 = vtrn2q_s64(s1, s2);
+}
+#endif
+
+
+static inline void SUMSUB_ABCD(int16x8_t &s1, int16x8_t &d1, int16x8_t &s2, int16x8_t &d2,
+                               int16x8_t a, int16x8_t  b, int16x8_t  c, int16x8_t  d)
+{
+    SUMSUB_AB(s1, d1, a, b);
+    SUMSUB_AB(s2, d2, c, d);
+}
+
+static inline void HADAMARD4_V(int16x8_t &r1, int16x8_t &r2, int16x8_t &r3, int16x8_t &r4,
+                               int16x8_t &t1, int16x8_t &t2, int16x8_t &t3, int16x8_t &t4)
+{
+    SUMSUB_ABCD(t1, t2, t3, t4, r1, r2, r3, r4);
+    SUMSUB_ABCD(r1, r3, r2, r4, t1, t3, t2, t4);
+}
+
+
+static int _satd_4x8_8x4_end_neon(int16x8_t v0, int16x8_t v1, int16x8_t v2, int16x8_t v3)
+
+{
+
+    int16x8_t v4, v5, v6, v7, v16, v17, v18, v19;
+
+
+    SUMSUB_AB(v16, v17, v0,  v1);
+    SUMSUB_AB(v18, v19, v2,  v3);
+
+    SUMSUB_AB(v4 , v6 , v16, v18);
+    SUMSUB_AB(v5 , v7 , v17, v19);
+
+    v0 = vtrn1q_s16(v4, v5);
+    v1 = vtrn2q_s16(v4, v5);
+    v2 = vtrn1q_s16(v6, v7);
+    v3 = vtrn2q_s16(v6, v7);
+
+    SUMSUB_AB(v16, v17, v0,  v1);
+    SUMSUB_AB(v18, v19, v2,  v3);
+
+    v0 = vtrn1q_s32(v16, v18);
+    v1 = vtrn2q_s32(v16, v18);
+    v2 = vtrn1q_s32(v17, v19);
+    v3 = vtrn2q_s32(v17, v19);
+
+    v0 = vabsq_s16(v0);
+    v1 = vabsq_s16(v1);
+    v2 = vabsq_s16(v2);
+    v3 = vabsq_s16(v3);
+
+    v0 = vmaxq_u16(v0, v1);
+    v1 = vmaxq_u16(v2, v3);
+
+    v0 = vaddq_u16(v0, v1);
+    return vaddlvq_u16(v0);
+}
+
+static inline int _satd_4x4_neon(int16x8_t v0, int16x8_t v1)
+{
+    int16x8_t v2, v3;
+    SUMSUB_AB(v2,  v3,  v0,  v1);
+
+    v0 = vzip1q_s64(v2, v3);
+    v1 = vzip2q_s64(v2, v3);
+    SUMSUB_AB(v2,  v3,  v0,  v1);
+
+    v0 = vtrn1q_s16(v2, v3);
+    v1 = vtrn2q_s16(v2, v3);
+    SUMSUB_AB(v2,  v3,  v0,  v1);
+
+    v0 = vtrn1q_s32(v2, v3);
+    v1 = vtrn2q_s32(v2, v3);
+
+    v0 = vabsq_s16(v0);
+    v1 = vabsq_s16(v1);
+    v0 = vmaxq_u16(v0, v1);
+
+    return vaddlvq_s16(v0);
+}
+
+static void _satd_8x4v_8x8h_neon(int16x8_t &v0, int16x8_t &v1, int16x8_t &v2, int16x8_t &v3, int16x8_t &v20,
+                                 int16x8_t &v21, int16x8_t &v22, int16x8_t &v23)
+{
+    int16x8_t v16, v17, v18, v19, v4, v5, v6, v7;
+
+    SUMSUB_AB(v16, v18, v0,  v2);
+    SUMSUB_AB(v17, v19, v1,  v3);
+
+    HADAMARD4_V(v20, v21, v22, v23, v0,  v1, v2, v3);
+
+    transpose_8h(v0,  v1,  v16, v17);
+    transpose_8h(v2,  v3,  v18, v19);
+    transpose_8h(v4,  v5,  v20, v21);
+    transpose_8h(v6,  v7,  v22, v23);
+
+    SUMSUB_AB(v16, v17, v0,  v1);
+    SUMSUB_AB(v18, v19, v2,  v3);
+    SUMSUB_AB(v20, v21, v4,  v5);
+    SUMSUB_AB(v22, v23, v6,  v7);
+
+    transpose_4s(v0,  v2,  v16, v18);
+    transpose_4s(v1,  v3,  v17, v19);
+    transpose_4s(v4,  v6,  v20, v22);
+    transpose_4s(v5,  v7,  v21, v23);
+
+    v0 = vabsq_s16(v0);
+    v1 = vabsq_s16(v1);
+    v2 = vabsq_s16(v2);
+    v3 = vabsq_s16(v3);
+    v4 = vabsq_s16(v4);
+    v5 = vabsq_s16(v5);
+    v6 = vabsq_s16(v6);
+    v7 = vabsq_s16(v7);
+
+    v0 = vmaxq_u16(v0, v2);
+    v1 = vmaxq_u16(v1, v3);
+    v2 = vmaxq_u16(v4, v6);
+    v3 = vmaxq_u16(v5, v7);
+
+}
+
+#if HIGH_BIT_DEPTH
+
+#if (X265_DEPTH > 10)
+static inline void transpose_2d(int32x4_t &t1, int32x4_t &t2, const int32x4_t s1, const int32x4_t s2)
+{
+    t1 = vtrn1q_s64(s1, s2);
+    t2 = vtrn2q_s64(s1, s2);
+}
+
+static inline void ISUMSUB_AB(int32x4_t &sum, int32x4_t &sub, const int32x4_t a, const int32x4_t b)
+{
+    sum = vaddq_s32(a, b);
+    sub = vsubq_s32(a, b);
+}
+
+static inline void ISUMSUB_AB_FROM_INT16(int32x4_t &suml, int32x4_t &sumh, int32x4_t &subl, int32x4_t &subh,
+        const int16x8_t a, const int16x8_t b)
+{
+    suml = vaddl_s16(vget_low_s16(a), vget_low_s16(b));
+    sumh = vaddl_high_s16(a, b);
+    subl = vsubl_s16(vget_low_s16(a), vget_low_s16(b));
+    subh = vsubl_high_s16(a, b);
+}
+
+#endif
+
+static inline void _sub_8x8_fly(const uint16_t *pix1, intptr_t stride_pix1, const uint16_t *pix2, intptr_t stride_pix2,
+                                int16x8_t &v0, int16x8_t &v1, int16x8_t &v2, int16x8_t &v3,
+                                int16x8_t &v20, int16x8_t &v21, int16x8_t &v22, int16x8_t &v23)
+{
+    uint16x8_t r0, r1, r2, r3;
+    uint16x8_t t0, t1, t2, t3;
+    int16x8_t v16, v17;
+    int16x8_t v18, v19;
+
+    r0 = *(uint16x8_t *)(pix1 + 0 * stride_pix1);
+    r1 = *(uint16x8_t *)(pix1 + 1 * stride_pix1);
+    r2 = *(uint16x8_t *)(pix1 + 2 * stride_pix1);
+    r3 = *(uint16x8_t *)(pix1 + 3 * stride_pix1);
+
+    t0 = *(uint16x8_t *)(pix2 + 0 * stride_pix2);
+    t1 = *(uint16x8_t *)(pix2 + 1 * stride_pix2);
+    t2 = *(uint16x8_t *)(pix2 + 2 * stride_pix2);
+    t3 = *(uint16x8_t *)(pix2 + 3 * stride_pix2);
+
+    v16 = vsubq_u16(r0, t0);
+    v17 = vsubq_u16(r1, t1);
+    v18 = vsubq_u16(r2, t2);
+    v19 = vsubq_u16(r3, t3);
+
+    r0 = *(uint16x8_t *)(pix1 + 4 * stride_pix1);
+    r1 = *(uint16x8_t *)(pix1 + 5 * stride_pix1);
+    r2 = *(uint16x8_t *)(pix1 + 6 * stride_pix1);
+    r3 = *(uint16x8_t *)(pix1 + 7 * stride_pix1);
+
+    t0 = *(uint16x8_t *)(pix2 + 4 * stride_pix2);
+    t1 = *(uint16x8_t *)(pix2 + 5 * stride_pix2);
+    t2 = *(uint16x8_t *)(pix2 + 6 * stride_pix2);
+    t3 = *(uint16x8_t *)(pix2 + 7 * stride_pix2);
+
+    v20 = vsubq_u16(r0, t0);
+    v21 = vsubq_u16(r1, t1);
+    v22 = vsubq_u16(r2, t2);
+    v23 = vsubq_u16(r3, t3);
+
+    SUMSUB_AB(v0,  v1,  v16, v17);
+    SUMSUB_AB(v2,  v3,  v18, v19);
+
+}
+
+
+
+
+static void _satd_16x4_neon(const uint16_t *pix1, intptr_t stride_pix1, const uint16_t *pix2, intptr_t stride_pix2,
+                            int16x8_t &v0, int16x8_t &v1, int16x8_t &v2, int16x8_t &v3)
+{
+    uint8x16_t r0, r1, r2, r3;
+    uint8x16_t t0, t1, t2, t3;
+    int16x8_t v16, v17, v20, v21;
+    int16x8_t v18, v19, v22, v23;
+
+    r0 = *(int16x8_t *)(pix1 + 0 * stride_pix1);
+    r1 = *(int16x8_t *)(pix1 + 1 * stride_pix1);
+    r2 = *(int16x8_t *)(pix1 + 2 * stride_pix1);
+    r3 = *(int16x8_t *)(pix1 + 3 * stride_pix1);
+
+    t0 = *(int16x8_t *)(pix2 + 0 * stride_pix2);
+    t1 = *(int16x8_t *)(pix2 + 1 * stride_pix2);
+    t2 = *(int16x8_t *)(pix2 + 2 * stride_pix2);
+    t3 = *(int16x8_t *)(pix2 + 3 * stride_pix2);
+
+
+    v16 = vsubq_u16((r0), (t0));
+    v17 = vsubq_u16((r1), (t1));
+    v18 = vsubq_u16((r2), (t2));
+    v19 = vsubq_u16((r3), (t3));
+
+    r0 = *(int16x8_t *)(pix1 + 0 * stride_pix1 + 8);
+    r1 = *(int16x8_t *)(pix1 + 1 * stride_pix1 + 8);
+    r2 = *(int16x8_t *)(pix1 + 2 * stride_pix1 + 8);
+    r3 = *(int16x8_t *)(pix1 + 3 * stride_pix1 + 8);
+
+    t0 = *(int16x8_t *)(pix2 + 0 * stride_pix2 + 8);
+    t1 = *(int16x8_t *)(pix2 + 1 * stride_pix2 + 8);
+    t2 = *(int16x8_t *)(pix2 + 2 * stride_pix2 + 8);
+    t3 = *(int16x8_t *)(pix2 + 3 * stride_pix2 + 8);
+
+
+    v20 = vsubq_u16(r0, t0);
+    v21 = vsubq_u16(r1, t1);
+    v22 = vsubq_u16(r2, t2);
+    v23 = vsubq_u16(r3, t3);
+
+    SUMSUB_AB(v0,  v1,  v16, v17);
+    SUMSUB_AB(v2,  v3,  v18, v19);
+
+    _satd_8x4v_8x8h_neon(v0, v1, v2, v3, v20, v21, v22, v23);
+
+}
+
+
+int pixel_satd_4x4_neon(const uint16_t *pix1, intptr_t stride_pix1, const uint16_t *pix2, intptr_t stride_pix2)
+{
+    uint64x2_t t0, t1, r0, r1;
+    t00 = *(uint64_t *)(pix1 + 0 * stride_pix1);
+    t10 = *(uint64_t *)(pix1 + 1 * stride_pix1);
+    t01 = *(uint64_t *)(pix1 + 2 * stride_pix1);
+    t11 = *(uint64_t *)(pix1 + 3 * stride_pix1);
+
+    r00 = *(uint64_t *)(pix2 + 0 * stride_pix1);
+    r10 = *(uint64_t *)(pix2 + 1 * stride_pix2);
+    r01 = *(uint64_t *)(pix2 + 2 * stride_pix2);
+    r11 = *(uint64_t *)(pix2 + 3 * stride_pix2);
+
+    return _satd_4x4_neon(vsubq_u16(t0, r0), vsubq_u16(r1, t1));
+}
+
+
+
+
+
+
+int pixel_satd_8x4_neon(const uint16_t *pix1, intptr_t stride_pix1, const uint16_t *pix2, intptr_t stride_pix2)
+{
+    uint16x8_t i0, i1, i2, i3, i4, i5, i6, i7;
+
+    i0 = *(uint16x8_t *)(pix1 + 0 * stride_pix1);
+    i1 = *(uint16x8_t *)(pix2 + 0 * stride_pix2);
+    i2 = *(uint16x8_t *)(pix1 + 1 * stride_pix1);
+    i3 = *(uint16x8_t *)(pix2 + 1 * stride_pix2);
+    i4 = *(uint16x8_t *)(pix1 + 2 * stride_pix1);
+    i5 = *(uint16x8_t *)(pix2 + 2 * stride_pix2);
+    i6 = *(uint16x8_t *)(pix1 + 3 * stride_pix1);
+    i7 = *(uint16x8_t *)(pix2 + 3 * stride_pix2);
+
+    int16x8_t v0 = vsubq_u16(i0, i1);
+    int16x8_t v1 = vsubq_u16(i2, i3);
+    int16x8_t v2 = vsubq_u16(i4, i5);
+    int16x8_t v3 = vsubq_u16(i6, i7);
+
+    return _satd_4x8_8x4_end_neon(v0, v1, v2, v3);
+}
+
+
+int pixel_satd_16x16_neon(const uint16_t *pix1, intptr_t stride_pix1, const uint16_t *pix2, intptr_t stride_pix2)
+{
+    int32x4_t v30 = vdupq_n_u32(0), v31 = vdupq_n_u32(0);
+    int16x8_t v0, v1, v2, v3;
+    for (int offset = 0; offset <= 12; offset += 4) {
+        _satd_16x4_neon(pix1 + offset * stride_pix1, stride_pix1, pix2 + offset * stride_pix2, stride_pix2, v0, v1, v2, v3);
+        v30 = vpadalq_u16(v30, v0);
+        v30 = vpadalq_u16(v30, v1);
+        v31 = vpadalq_u16(v31, v2);
+        v31 = vpadalq_u16(v31, v3);
+    }
+    return vaddvq_s32(vaddq_s32(v30, v31));
+
+}
+
+#else       //HIGH_BIT_DEPTH
+
+static void _satd_16x4_neon(const uint8_t *pix1, intptr_t stride_pix1, const uint8_t *pix2, intptr_t stride_pix2,
+                            int16x8_t &v0, int16x8_t &v1, int16x8_t &v2, int16x8_t &v3)
+{
+    uint8x16_t r0, r1, r2, r3;
+    uint8x16_t t0, t1, t2, t3;
+    int16x8_t v16, v17, v20, v21;
+    int16x8_t v18, v19, v22, v23;
+
+    r0 = *(uint8x16_t *)(pix1 + 0 * stride_pix1);
+    r1 = *(uint8x16_t *)(pix1 + 1 * stride_pix1);
+    r2 = *(uint8x16_t *)(pix1 + 2 * stride_pix1);
+    r3 = *(uint8x16_t *)(pix1 + 3 * stride_pix1);
+
+    t0 = *(uint8x16_t *)(pix2 + 0 * stride_pix2);
+    t1 = *(uint8x16_t *)(pix2 + 1 * stride_pix2);
+    t2 = *(uint8x16_t *)(pix2 + 2 * stride_pix2);
+    t3 = *(uint8x16_t *)(pix2 + 3 * stride_pix2);
+
+
+
+    v16 = vsubl_u8(vget_low_u8(r0), vget_low_u8(t0));
+    v20 = vsubl_high_u8(r0, t0);
+    v17 = vsubl_u8(vget_low_u8(r1), vget_low_u8(t1));
+    v21 = vsubl_high_u8(r1, t1);
+    v18 = vsubl_u8(vget_low_u8(r2), vget_low_u8(t2));
+    v22 = vsubl_high_u8(r2, t2);
+    v19 = vsubl_u8(vget_low_u8(r3), vget_low_u8(t3));
+    v23 = vsubl_high_u8(r3, t3);
+
+    SUMSUB_AB(v0,  v1,  v16, v17);
+    SUMSUB_AB(v2,  v3,  v18, v19);
+
+    _satd_8x4v_8x8h_neon(v0, v1, v2, v3, v20, v21, v22, v23);
+
+}
+
+
+static inline void _sub_8x8_fly(const uint8_t *pix1, intptr_t stride_pix1, const uint8_t *pix2, intptr_t stride_pix2,
+                                int16x8_t &v0, int16x8_t &v1, int16x8_t &v2, int16x8_t &v3,
+                                int16x8_t &v20, int16x8_t &v21, int16x8_t &v22, int16x8_t &v23)
+{
+    uint8x8_t r0, r1, r2, r3;
+    uint8x8_t t0, t1, t2, t3;
+    int16x8_t v16, v17;
+    int16x8_t v18, v19;
+
+    r0 = *(uint8x8_t *)(pix1 + 0 * stride_pix1);
+    r1 = *(uint8x8_t *)(pix1 + 1 * stride_pix1);
+    r2 = *(uint8x8_t *)(pix1 + 2 * stride_pix1);
+    r3 = *(uint8x8_t *)(pix1 + 3 * stride_pix1);
+
+    t0 = *(uint8x8_t *)(pix2 + 0 * stride_pix2);
+    t1 = *(uint8x8_t *)(pix2 + 1 * stride_pix2);
+    t2 = *(uint8x8_t *)(pix2 + 2 * stride_pix2);
+    t3 = *(uint8x8_t *)(pix2 + 3 * stride_pix2);
+
+    v16 = vsubl_u8(r0, t0);
+    v17 = vsubl_u8(r1, t1);
+    v18 = vsubl_u8(r2, t2);
+    v19 = vsubl_u8(r3, t3);
+
+    r0 = *(uint8x8_t *)(pix1 + 4 * stride_pix1);
+    r1 = *(uint8x8_t *)(pix1 + 5 * stride_pix1);
+    r2 = *(uint8x8_t *)(pix1 + 6 * stride_pix1);
+    r3 = *(uint8x8_t *)(pix1 + 7 * stride_pix1);
+
+    t0 = *(uint8x8_t *)(pix2 + 4 * stride_pix2);
+    t1 = *(uint8x8_t *)(pix2 + 5 * stride_pix2);
+    t2 = *(uint8x8_t *)(pix2 + 6 * stride_pix2);
+    t3 = *(uint8x8_t *)(pix2 + 7 * stride_pix2);
+
+    v20 = vsubl_u8(r0, t0);
+    v21 = vsubl_u8(r1, t1);
+    v22 = vsubl_u8(r2, t2);
+    v23 = vsubl_u8(r3, t3);
+
+
+    SUMSUB_AB(v0,  v1,  v16, v17);
+    SUMSUB_AB(v2,  v3,  v18, v19);
+
+}
+
+int pixel_satd_4x4_neon(const uint8_t *pix1, intptr_t stride_pix1, const uint8_t *pix2, intptr_t stride_pix2)
+{
+    uint32x2_t t0, t1, r0, r1;
+    t00 = *(uint32_t *)(pix1 + 0 * stride_pix1);
+    t10 = *(uint32_t *)(pix1 + 1 * stride_pix1);
+    t01 = *(uint32_t *)(pix1 + 2 * stride_pix1);
+    t11 = *(uint32_t *)(pix1 + 3 * stride_pix1);
+
+    r00 = *(uint32_t *)(pix2 + 0 * stride_pix1);
+    r10 = *(uint32_t *)(pix2 + 1 * stride_pix2);
+    r01 = *(uint32_t *)(pix2 + 2 * stride_pix2);
+    r11 = *(uint32_t *)(pix2 + 3 * stride_pix2);
+
+    return _satd_4x4_neon(vsubl_u8(t0, r0), vsubl_u8(r1, t1));
+}
+
+
+int pixel_satd_8x4_neon(const uint8_t *pix1, intptr_t stride_pix1, const uint8_t *pix2, intptr_t stride_pix2)
+{
+    uint8x8_t i0, i1, i2, i3, i4, i5, i6, i7;
+
+    i0 = *(uint8x8_t *)(pix1 + 0 * stride_pix1);
+    i1 = *(uint8x8_t *)(pix2 + 0 * stride_pix2);
+    i2 = *(uint8x8_t *)(pix1 + 1 * stride_pix1);
+    i3 = *(uint8x8_t *)(pix2 + 1 * stride_pix2);
+    i4 = *(uint8x8_t *)(pix1 + 2 * stride_pix1);
+    i5 = *(uint8x8_t *)(pix2 + 2 * stride_pix2);
+    i6 = *(uint8x8_t *)(pix1 + 3 * stride_pix1);
+    i7 = *(uint8x8_t *)(pix2 + 3 * stride_pix2);
+
+    int16x8_t v0 = vsubl_u8(i0, i1);
+    int16x8_t v1 = vsubl_u8(i2, i3);
+    int16x8_t v2 = vsubl_u8(i4, i5);
+    int16x8_t v3 = vsubl_u8(i6, i7);
+
+    return _satd_4x8_8x4_end_neon(v0, v1, v2, v3);
+}
+
+int pixel_satd_16x16_neon(const uint8_t *pix1, intptr_t stride_pix1, const uint8_t *pix2, intptr_t stride_pix2)
+{
+    int16x8_t v30, v31;
+    int16x8_t v0, v1, v2, v3;
+
+    _satd_16x4_neon(pix1, stride_pix1, pix2, stride_pix2, v0, v1, v2, v3);
+    v30 = vaddq_s16(v0, v1);
+    v31 = vaddq_s16(v2, v3);
+
+    _satd_16x4_neon(pix1 + 4 * stride_pix1, stride_pix1, pix2 + 4 * stride_pix2, stride_pix2, v0, v1, v2, v3);
+    v0 = vaddq_s16(v0, v1);
+    v1 = vaddq_s16(v2, v3);
+    v30 = vaddq_s16(v30, v0);
+    v31 = vaddq_s16(v31, v1);
+
+    _satd_16x4_neon(pix1 + 8 * stride_pix1, stride_pix1, pix2 + 8 * stride_pix2, stride_pix2, v0, v1, v2, v3);
+    v0 = vaddq_s16(v0, v1);
+    v1 = vaddq_s16(v2, v3);
+    v30 = vaddq_s16(v30, v0);
+    v31 = vaddq_s16(v31, v1);
+
+    _satd_16x4_neon(pix1 + 12 * stride_pix1, stride_pix1, pix2 + 12 * stride_pix2, stride_pix2, v0, v1, v2, v3);
+    v0 = vaddq_s16(v0, v1);
+    v1 = vaddq_s16(v2, v3);
+    v30 = vaddq_s16(v30, v0);
+    v31 = vaddq_s16(v31, v1);
+
+    int32x4_t sum0 = vpaddlq_u16(v30);
+    int32x4_t sum1 = vpaddlq_u16(v31);
+    sum0 = vaddq_s32(sum0, sum1);
+    return vaddvq_s32(sum0);
+
+}
+#endif      //HIGH_BIT_DEPTH
+
+
+static inline void _sa8d_8x8_neon_end(int16x8_t &v0, int16x8_t &v1, int16x8_t v2, int16x8_t v3,
+                                      int16x8_t v20, int16x8_t v21, int16x8_t v22, int16x8_t v23)
+{
+    int16x8_t v16, v17, v18, v19;
+    int16x8_t v4, v5, v6, v7;
+
+    SUMSUB_AB(v16, v18, v0,  v2);
+    SUMSUB_AB(v17, v19, v1,  v3);
+
+    HADAMARD4_V(v20, v21, v22, v23, v0,  v1, v2, v3);
+
+    SUMSUB_AB(v0,  v16, v16, v20);
+    SUMSUB_AB(v1,  v17, v17, v21);
+    SUMSUB_AB(v2,  v18, v18, v22);
+    SUMSUB_AB(v3,  v19, v19, v23);
+
+    transpose_8h(v20, v21, v16, v17);
+    transpose_8h(v4,  v5,  v0,  v1);
+    transpose_8h(v22, v23, v18, v19);
+    transpose_8h(v6,  v7,  v2,  v3);
+
+#if (X265_DEPTH <= 10)
+
+    int16x8_t v24, v25;
+
+    SUMSUB_AB(v2,  v3,  v20, v21);
+    SUMSUB_AB(v24, v25, v4,  v5);
+    SUMSUB_AB(v0,  v1,  v22, v23);
+    SUMSUB_AB(v4,  v5,  v6,  v7);
+
+    transpose_4s(v20, v22, v2,  v0);
+    transpose_4s(v21, v23, v3,  v1);
+    transpose_4s(v16, v18, v24, v4);
+    transpose_4s(v17, v19, v25, v5);
+
+    SUMSUB_AB(v0,  v2,  v20, v22);
+    SUMSUB_AB(v1,  v3,  v21, v23);
+    SUMSUB_AB(v4,  v6,  v16, v18);
+    SUMSUB_AB(v5,  v7,  v17, v19);
+
+    transpose_2d(v16, v20,  v0,  v4);
+    transpose_2d(v17, v21,  v1,  v5);
+    transpose_2d(v18, v22,  v2,  v6);
+    transpose_2d(v19, v23,  v3,  v7);
+
+
+    v16 = vabsq_s16(v16);
+    v17 = vabsq_s16(v17);
+    v18 = vabsq_s16(v18);
+    v19 = vabsq_s16(v19);
+    v20 = vabsq_s16(v20);
+    v21 = vabsq_s16(v21);
+    v22 = vabsq_s16(v22);
+    v23 = vabsq_s16(v23);
+
+    v16 = vmaxq_u16(v16, v20);
+    v17 = vmaxq_u16(v17, v21);
+    v18 = vmaxq_u16(v18, v22);
+    v19 = vmaxq_u16(v19, v23);
+
+#if HIGH_BIT_DEPTH
+    v0 = vpaddlq_u16(v16);
+    v1 = vpaddlq_u16(v17);
+    v0 = vpadalq_u16(v0, v18);
+    v1 = vpadalq_u16(v1, v19);
+
+#else //HIGH_BIT_DEPTH
+
+    v0 = vaddq_u16(v16, v17);
+    v1 = vaddq_u16(v18, v19);
+
+#endif //HIGH_BIT_DEPTH
+
+#else // HIGH_BIT_DEPTH 12 bit only, switching math to int32, each int16x8 is up-convreted to 2 int32x4 (low and high)
+
+    int32x4_t v2l, v2h, v3l, v3h, v24l, v24h, v25l, v25h, v0l, v0h, v1l, v1h;
+    int32x4_t v22l, v22h, v23l, v23h;
+    int32x4_t v4l, v4h, v5l, v5h;
+    int32x4_t v6l, v6h, v7l, v7h;
+    int32x4_t v16l, v16h, v17l, v17h;
+    int32x4_t v18l, v18h, v19l, v19h;
+    int32x4_t v20l, v20h, v21l, v21h;
+
+    ISUMSUB_AB_FROM_INT16(v2l, v2h, v3l, v3h, v20, v21);
+    ISUMSUB_AB_FROM_INT16(v24l, v24h, v25l, v25h, v4, v5);
+
+    v22l = vmovl_s16(vget_low_s16(v22));
+    v22h = vmovl_high_s16(v22);
+    v23l = vmovl_s16(vget_low_s16(v23));
+    v23h = vmovl_high_s16(v23);
+
+    ISUMSUB_AB(v0l,  v1l,  v22l, v23l);
+    ISUMSUB_AB(v0h,  v1h,  v22h, v23h);
+
+    v6l = vmovl_s16(vget_low_s16(v6));
+    v6h = vmovl_high_s16(v6);
+    v7l = vmovl_s16(vget_low_s16(v7));
+    v7h = vmovl_high_s16(v7);
+
+    ISUMSUB_AB(v4l,  v5l,  v6l,  v7l);
+    ISUMSUB_AB(v4h,  v5h,  v6h,  v7h);
+
+    transpose_2d(v20l, v22l, v2l,  v0l);
+    transpose_2d(v21l, v23l, v3l,  v1l);
+    transpose_2d(v16l, v18l, v24l, v4l);
+    transpose_2d(v17l, v19l, v25l, v5l);
+
+    transpose_2d(v20h, v22h, v2h,  v0h);
+    transpose_2d(v21h, v23h, v3h,  v1h);
+    transpose_2d(v16h, v18h, v24h, v4h);
+    transpose_2d(v17h, v19h, v25h, v5h);
+
+    ISUMSUB_AB(v0l,  v2l,  v20l, v22l);
+    ISUMSUB_AB(v1l,  v3l,  v21l, v23l);
+    ISUMSUB_AB(v4l,  v6l,  v16l, v18l);
+    ISUMSUB_AB(v5l,  v7l,  v17l, v19l);
+
+    ISUMSUB_AB(v0h,  v2h,  v20h, v22h);
+    ISUMSUB_AB(v1h,  v3h,  v21h, v23h);
+    ISUMSUB_AB(v4h,  v6h,  v16h, v18h);
+    ISUMSUB_AB(v5h,  v7h,  v17h, v19h);
+
+    v16l = v0l;
+    v16h = v4l;
+    v20l = v0h;
+    v20h = v4h;
+
+    v17l = v1l;
+    v17h = v5l;
+    v21l = v1h;
+    v21h = v5h;
+
+    v18l = v2l;
+    v18h = v6l;
+    v22l = v2h;
+    v22h = v6h;
+
+    v19l = v3l;
+    v19h = v7l;
+    v23l = v3h;
+    v23h = v7h;
+
+    v16l = vabsq_s32(v16l);
+    v17l = vabsq_s32(v17l);
+    v18l = vabsq_s32(v18l);
+    v19l = vabsq_s32(v19l);
+    v20l = vabsq_s32(v20l);
+    v21l = vabsq_s32(v21l);
+    v22l = vabsq_s32(v22l);
+    v23l = vabsq_s32(v23l);
+
+    v16h = vabsq_s32(v16h);
+    v17h = vabsq_s32(v17h);
+    v18h = vabsq_s32(v18h);
+    v19h = vabsq_s32(v19h);
+    v20h = vabsq_s32(v20h);
+    v21h = vabsq_s32(v21h);
+    v22h = vabsq_s32(v22h);
+    v23h = vabsq_s32(v23h);
+
+    v16l = vmaxq_u32(v16l, v20l);
+    v17l = vmaxq_u32(v17l, v21l);
+    v18l = vmaxq_u32(v18l, v22l);
+    v19l = vmaxq_u32(v19l, v23l);
+
+    v16h = vmaxq_u32(v16h, v20h);
+    v17h = vmaxq_u32(v17h, v21h);
+    v18h = vmaxq_u32(v18h, v22h);
+    v19h = vmaxq_u32(v19h, v23h);
+
+    v16l = vaddq_u32(v16l, v16h);
+    v17l = vaddq_u32(v17l, v17h);
+    v18l = vaddq_u32(v18l, v18h);
+    v19l = vaddq_u32(v19l, v19h);
+
+    v0 = vaddq_u32(v16l, v17l);
+    v1 = vaddq_u32(v18l, v19l);
+
+
+#endif
+
+}
+
+
+
+static inline void _satd_8x8_neon(const pixel *pix1, intptr_t stride_pix1, const pixel *pix2, intptr_t stride_pix2,
+                                  int16x8_t &v0, int16x8_t &v1, int16x8_t &v2, int16x8_t &v3)
+{
+
+    int16x8_t v20, v21, v22, v23;
+    _sub_8x8_fly(pix1, stride_pix1, pix2, stride_pix2, v0, v1, v2, v3, v20, v21, v22, v23);
+    _satd_8x4v_8x8h_neon(v0, v1, v2, v3, v20, v21, v22, v23);
+
+}
+
+
+
+int pixel_satd_8x8_neon(const pixel *pix1, intptr_t stride_pix1, const pixel *pix2, intptr_t stride_pix2)
+{
+    int16x8_t v30, v31;
+    int16x8_t v0, v1, v2, v3;
+
+    _satd_8x8_neon(pix1, stride_pix1, pix2, stride_pix2, v0, v1, v2, v3);
+#if !(HIGH_BIT_DEPTH)
+    v30 = vaddq_u16(v0, v1);
+    v31 = vaddq_u16(v2, v3);
+
+    uint16x8_t sum = vaddq_u16(v30, v31);
+    return vaddvq_s32(vpaddlq_u16(sum));
+#else
+
+    v30 = vaddq_u16(v0, v1);
+    v31 = vaddq_u16(v2, v3);
+
+    int32x4_t sum = vpaddlq_u16(v30);
+    sum = vpadalq_u16(sum, v31);
+    return vaddvq_s32(sum);
+#endif
+}
+
+
+int pixel_sa8d_8x8_neon(const pixel *pix1, intptr_t stride_pix1, const pixel *pix2, intptr_t stride_pix2)
+{
+    int16x8_t v0, v1, v2, v3;
+    int16x8_t v20, v21, v22, v23;
+
+    _sub_8x8_fly(pix1, stride_pix1, pix2, stride_pix2, v0, v1, v2, v3, v20, v21, v22, v23);
+    _sa8d_8x8_neon_end(v0, v1, v2, v3, v20, v21, v22, v23);
+
+#if HIGH_BIT_DEPTH
+    int32x4_t s = vaddq_u32(v0, v1);
+    return (vaddvq_u32(s) + 1) >> 1;
+#else
+    return (vaddlvq_s16(vaddq_u16(v0, v1)) + 1) >> 1;
+#endif
+}
+
+
+
+
+
+int pixel_sa8d_16x16_neon(const pixel *pix1, intptr_t stride_pix1, const pixel *pix2, intptr_t stride_pix2)
+{
+    int16x8_t v0, v1, v2, v3;
+    int16x8_t v20, v21, v22, v23;
+    int32x4_t v30, v31;
+
+    _sub_8x8_fly(pix1, stride_pix1, pix2, stride_pix2, v0, v1, v2, v3, v20, v21, v22, v23);
+    _sa8d_8x8_neon_end(v0, v1, v2, v3, v20, v21, v22, v23);
+
+#if !(HIGH_BIT_DEPTH)
+    v30 = vpaddlq_u16(v0);
+    v31 = vpaddlq_u16(v1);
+#else
+    v30 = vaddq_s32(v0, v1);
+#endif
+
+    _sub_8x8_fly(pix1 + 8, stride_pix1, pix2 + 8, stride_pix2, v0, v1, v2, v3, v20, v21, v22, v23);
+    _sa8d_8x8_neon_end(v0, v1, v2, v3, v20, v21, v22, v23);
+
+#if !(HIGH_BIT_DEPTH)
+    v30 = vpadalq_u16(v30, v0);
+    v31 = vpadalq_u16(v31, v1);
+#else
+    v31 = vaddq_s32(v0, v1);
+#endif
+
+
+    _sub_8x8_fly(pix1 + 8 * stride_pix1, stride_pix1, pix2 + 8 * stride_pix2, stride_pix2, v0, v1, v2, v3, v20, v21, v22,
+                 v23);
+    _sa8d_8x8_neon_end(v0, v1, v2, v3, v20, v21, v22, v23);
+
+#if !(HIGH_BIT_DEPTH)
+    v30 = vpadalq_u16(v30, v0);
+    v31 = vpadalq_u16(v31, v1);
+#else
+    v30 = vaddq_s32(v30, v0);
+    v31 = vaddq_s32(v31, v1);
+#endif
+
+    _sub_8x8_fly(pix1 + 8 * stride_pix1 + 8, stride_pix1, pix2 + 8 * stride_pix2 + 8, stride_pix2, v0, v1, v2, v3, v20, v21,
+                 v22, v23);
+    _sa8d_8x8_neon_end(v0, v1, v2, v3, v20, v21, v22, v23);
+
+#if !(HIGH_BIT_DEPTH)
+    v30 = vpadalq_u16(v30, v0);
+    v31 = vpadalq_u16(v31, v1);
+#else
+    v30 = vaddq_s32(v30, v0);
+    v31 = vaddq_s32(v31, v1);
+#endif
+
+    v30 = vaddq_u32(v30, v31);
+
+    return (vaddvq_u32(v30) + 1) >> 1;
+}
+
+
+
+
+
+
+
+
+template<int size>
+void blockfill_s_neon(int16_t *dst, intptr_t dstride, int16_t val)
+{
+    for (int y = 0; y < size; y++)
+    {
+        int x = 0;
+        int16x8_t v = vdupq_n_s16(val);
+        for (; (x + 8) <= size; x += 8)
+        {
+            *(int16x8_t *)&dsty * dstride + x = v;
+        }
+        for (; x < size; x++)
+        {
+            dsty * dstride + x = val;
+        }
+    }
+}
+
+template<int lx, int ly>
+int sad_pp_neon(const pixel *pix1, intptr_t stride_pix1, const pixel *pix2, intptr_t stride_pix2)
+{
+    int sum = 0;
+
+
+    for (int y = 0; y < ly; y++)
+    {
+#if HIGH_BIT_DEPTH
+        int x = 0;
+        uint16x8_t vsum16_1 = vdupq_n_u16(0);
+        for (; (x + 8) <= lx; x += 8)
+        {
+            uint16x8_t p1 = *(uint16x8_t *)&pix1x;
+            uint16x8_t p2 = *(uint16x8_t *)&pix2x;
+            vsum16_1 = vabaq_s16(vsum16_1, p1, p2);
+
+        }
+        if (lx & 4)
+        {
+            uint16x4_t p1 = *(uint16x4_t *)&pix1x;
+            uint16x4_t p2 = *(uint16x4_t *)&pix2x;
+            sum += vaddlv_s16(vaba_s16(vdup_n_s16(0), p1, p2));
+            x += 4;
+        }
+        if (lx >= 4)
+        {
+            sum += vaddlvq_s16(vsum16_1);
+        }
+
+#else
+
+        int x = 0;
+        uint16x8_t vsum16_1 = vdupq_n_u16(0);
+        uint16x8_t vsum16_2 = vdupq_n_u16(0);
+
+        for (; (x + 16) <= lx; x += 16)
+        {
+            uint8x16_t p1 = *(uint8x16_t *)&pix1x;
+            uint8x16_t p2 = *(uint8x16_t *)&pix2x;
+            vsum16_1 = vabal_u8(vsum16_1, vget_low_u8(p1), vget_low_u8(p2));
+            vsum16_2 = vabal_high_u8(vsum16_2, p1, p2);
+        }
+        if (lx & 8)
+        {
+            uint8x8_t p1 = *(uint8x8_t *)&pix1x;
+            uint8x8_t p2 = *(uint8x8_t *)&pix2x;
+            vsum16_1 = vabal_u8(vsum16_1, p1, p2);
+            x += 8;
+        }
+        if (lx & 4)
+        {
+            uint32x2_t p1 = vdup_n_u32(0);
+            p10 = *(uint32_t *)&pix1x;
+            uint32x2_t p2 = vdup_n_u32(0);
+            p20 = *(uint32_t *)&pix2x;
+            vsum16_1 = vabal_u8(vsum16_1, p1, p2);
+            x += 4;
+        }
+        if (lx >= 16)
+        {
+            vsum16_1 = vaddq_u16(vsum16_1, vsum16_2);
+        }
+        if (lx >= 4)
+        {
+            sum += vaddvq_u16(vsum16_1);
+        }
+
+#endif
+        if (lx & 3) for (; x < lx; x++)
+            {
+                sum += abs(pix1x - pix2x);
+            }
+
+        pix1 += stride_pix1;
+        pix2 += stride_pix2;
+    }
+
+    return sum;
+}
+
+template<int lx, int ly>
+void sad_x3_neon(const pixel *pix1, const pixel *pix2, const pixel *pix3, const pixel *pix4, intptr_t frefstride,
+                 int32_t *res)
+{
+    res0 = 0;
+    res1 = 0;
+    res2 = 0;
+    for (int y = 0; y < ly; y++)
+    {
+        int x = 0;
+        uint16x8_t vsum16_0 = vdupq_n_u16(0);
+        uint16x8_t vsum16_1 = vdupq_n_u16(0);
+        uint16x8_t vsum16_2 = vdupq_n_u16(0);
+#if HIGH_BIT_DEPTH
+        for (; (x + 8) <= lx; x += 8)
+        {
+            uint16x8_t p1 = *(uint16x8_t *)&pix1x;
+            uint16x8_t p2 = *(uint16x8_t *)&pix2x;
+            uint16x8_t p3 = *(uint16x8_t *)&pix3x;
+            uint16x8_t p4 = *(uint16x8_t *)&pix4x;
+            vsum16_0 = vabaq_s16(vsum16_0, p1, p2);
+            vsum16_1 = vabaq_s16(vsum16_1, p1, p3);
+            vsum16_2 = vabaq_s16(vsum16_2, p1, p4);
+
+        }
+        if (lx & 4)
+        {
+            uint16x4_t p1 = *(uint16x4_t *)&pix1x;
+            uint16x4_t p2 = *(uint16x4_t *)&pix2x;
+            uint16x4_t p3 = *(uint16x4_t *)&pix3x;
+            uint16x4_t p4 = *(uint16x4_t *)&pix4x;
+            res0 += vaddlv_s16(vaba_s16(vdup_n_s16(0), p1, p2));
+            res1 += vaddlv_s16(vaba_s16(vdup_n_s16(0), p1, p3));
+            res2 += vaddlv_s16(vaba_s16(vdup_n_s16(0), p1, p4));
+            x += 4;
+        }
+        if (lx >= 4)
+        {
+            res0 += vaddlvq_s16(vsum16_0);
+            res1 += vaddlvq_s16(vsum16_1);
+            res2 += vaddlvq_s16(vsum16_2);
+        }
+#else
+
+        for (; (x + 16) <= lx; x += 16)
+        {
+            uint8x16_t p1 = *(uint8x16_t *)&pix1x;
+            uint8x16_t p2 = *(uint8x16_t *)&pix2x;
+            uint8x16_t p3 = *(uint8x16_t *)&pix3x;
+            uint8x16_t p4 = *(uint8x16_t *)&pix4x;
+            vsum16_0 = vabal_u8(vsum16_0, vget_low_u8(p1), vget_low_u8(p2));
+            vsum16_0 = vabal_high_u8(vsum16_0, p1, p2);
+            vsum16_1 = vabal_u8(vsum16_1, vget_low_u8(p1), vget_low_u8(p3));
+            vsum16_1 = vabal_high_u8(vsum16_1, p1, p3);
+            vsum16_2 = vabal_u8(vsum16_2, vget_low_u8(p1), vget_low_u8(p4));
+            vsum16_2 = vabal_high_u8(vsum16_2, p1, p4);
+        }
+        if (lx & 8)
+        {
+            uint8x8_t p1 = *(uint8x8_t *)&pix1x;
+            uint8x8_t p2 = *(uint8x8_t *)&pix2x;
+            uint8x8_t p3 = *(uint8x8_t *)&pix3x;
+            uint8x8_t p4 = *(uint8x8_t *)&pix4x;
+            vsum16_0 = vabal_u8(vsum16_0, p1, p2);
+            vsum16_1 = vabal_u8(vsum16_1, p1, p3);
+            vsum16_2 = vabal_u8(vsum16_2, p1, p4);
+            x += 8;
+        }
+        if (lx & 4)
+        {
+            uint32x2_t p1 = vdup_n_u32(0);
+            p10 = *(uint32_t *)&pix1x;
+            uint32x2_t p2 = vdup_n_u32(0);
+            p20 = *(uint32_t *)&pix2x;
+            uint32x2_t p3 = vdup_n_u32(0);
+            p30 = *(uint32_t *)&pix3x;
+            uint32x2_t p4 = vdup_n_u32(0);
+            p40 = *(uint32_t *)&pix4x;
+            vsum16_0 = vabal_u8(vsum16_0, p1, p2);
+            vsum16_1 = vabal_u8(vsum16_1, p1, p3);
+            vsum16_2 = vabal_u8(vsum16_2, p1, p4);
+            x += 4;
+        }
+        if (lx >= 4)
+        {
+            res0 += vaddvq_u16(vsum16_0);
+            res1 += vaddvq_u16(vsum16_1);
+            res2 += vaddvq_u16(vsum16_2);
+        }
+
+#endif
+        if (lx & 3) for (; x < lx; x++)
+            {
+                res0 += abs(pix1x - pix2x);
+                res1 += abs(pix1x - pix3x);
+                res2 += abs(pix1x - pix4x);
+            }
+
+        pix1 += FENC_STRIDE;
+        pix2 += frefstride;
+        pix3 += frefstride;
+        pix4 += frefstride;
+    }
+}
+
+template<int lx, int ly>
+void sad_x4_neon(const pixel *pix1, const pixel *pix2, const pixel *pix3, const pixel *pix4, const pixel *pix5,
+                 intptr_t frefstride, int32_t *res)
+{
+    int32x4_t result = {0};
+    for (int y = 0; y < ly; y++)
+    {
+        int x = 0;
+        uint16x8_t vsum16_0 = vdupq_n_u16(0);
+        uint16x8_t vsum16_1 = vdupq_n_u16(0);
+        uint16x8_t vsum16_2 = vdupq_n_u16(0);
+        uint16x8_t vsum16_3 = vdupq_n_u16(0);
+#if HIGH_BIT_DEPTH
+        for (; (x + 16) <= lx; x += 16)
+        {
+            uint16x8x2_t p1 = vld1q_u16_x2(&pix1x);
+            uint16x8x2_t p2 = vld1q_u16_x2(&pix2x);
+            uint16x8x2_t p3 = vld1q_u16_x2(&pix3x);
+            uint16x8x2_t p4 = vld1q_u16_x2(&pix4x);
+            uint16x8x2_t p5 = vld1q_u16_x2(&pix5x);
+            vsum16_0 = vabaq_s16(vsum16_0, p1.val0, p2.val0);
+            vsum16_1 = vabaq_s16(vsum16_1, p1.val0, p3.val0);
+            vsum16_2 = vabaq_s16(vsum16_2, p1.val0, p4.val0);
+            vsum16_3 = vabaq_s16(vsum16_3, p1.val0, p5.val0);
+            vsum16_0 = vabaq_s16(vsum16_0, p1.val1, p2.val1);
+            vsum16_1 = vabaq_s16(vsum16_1, p1.val1, p3.val1);
+            vsum16_2 = vabaq_s16(vsum16_2, p1.val1, p4.val1);
+            vsum16_3 = vabaq_s16(vsum16_3, p1.val1, p5.val1);
+        }
+        if (lx & 8)
+        {
+            uint16x8_t p1 = *(uint16x8_t *)&pix1x;
+            uint16x8_t p2 = *(uint16x8_t *)&pix2x;
+            uint16x8_t p3 = *(uint16x8_t *)&pix3x;
+            uint16x8_t p4 = *(uint16x8_t *)&pix4x;
+            uint16x8_t p5 = *(uint16x8_t *)&pix5x;
+            vsum16_0 = vabaq_s16(vsum16_0, p1, p2);
+            vsum16_1 = vabaq_s16(vsum16_1, p1, p3);
+            vsum16_2 = vabaq_s16(vsum16_2, p1, p4);
+            vsum16_3 = vabaq_s16(vsum16_3, p1, p5);
+            x += 8;
+        }
+        if (lx & 4)
+        {
+            /* This is equivalent to getting the absolute difference of pix1x with each of
+             * pix2 - pix5, then summing across the vector (4 values each) and adding the
+             * result to result. */
+            uint16x8_t p1 = vreinterpretq_s16_u64(
+                    vld1q_dup_u64((uint64_t *)&pix1x));
+            uint16x8_t p2_3 = vcombine_s16(*(uint16x4_t *)&pix2x, *(uint16x4_t *)&pix3x);
+            uint16x8_t p4_5 = vcombine_s16(*(uint16x4_t *)&pix4x, *(uint16x4_t *)&pix5x);
+
+            uint16x8_t a = vabdq_u16(p1, p2_3);
+            uint16x8_t b = vabdq_u16(p1, p4_5);
+
+            result = vpadalq_s16(result, vpaddq_s16(a, b));
+            x += 4;
+        }
+        if (lx >= 4)
+        {
+            /* This is equivalent to adding across each of the sum vectors and then adding
+             * to result. */
+            uint16x8_t a = vpaddq_s16(vsum16_0, vsum16_1);
+            uint16x8_t b = vpaddq_s16(vsum16_2, vsum16_3);
+            uint16x8_t c = vpaddq_s16(a, b);
+            result = vpadalq_s16(result, c);
+        }
+
+#else
+
+        for (; (x + 16) <= lx; x += 16)
+        {
+            uint8x16_t p1 = *(uint8x16_t *)&pix1x;
+            uint8x16_t p2 = *(uint8x16_t *)&pix2x;
+            uint8x16_t p3 = *(uint8x16_t *)&pix3x;
+            uint8x16_t p4 = *(uint8x16_t *)&pix4x;
+            uint8x16_t p5 = *(uint8x16_t *)&pix5x;
+            vsum16_0 = vabal_u8(vsum16_0, vget_low_u8(p1), vget_low_u8(p2));
+            vsum16_0 = vabal_high_u8(vsum16_0, p1, p2);
+            vsum16_1 = vabal_u8(vsum16_1, vget_low_u8(p1), vget_low_u8(p3));
+            vsum16_1 = vabal_high_u8(vsum16_1, p1, p3);
+            vsum16_2 = vabal_u8(vsum16_2, vget_low_u8(p1), vget_low_u8(p4));
+            vsum16_2 = vabal_high_u8(vsum16_2, p1, p4);
+            vsum16_3 = vabal_u8(vsum16_3, vget_low_u8(p1), vget_low_u8(p5));
+            vsum16_3 = vabal_high_u8(vsum16_3, p1, p5);
+        }
+        if (lx & 8)
+        {
+            uint8x8_t p1 = *(uint8x8_t *)&pix1x;
+            uint8x8_t p2 = *(uint8x8_t *)&pix2x;
+            uint8x8_t p3 = *(uint8x8_t *)&pix3x;
+            uint8x8_t p4 = *(uint8x8_t *)&pix4x;
+            uint8x8_t p5 = *(uint8x8_t *)&pix5x;
+            vsum16_0 = vabal_u8(vsum16_0, p1, p2);
+            vsum16_1 = vabal_u8(vsum16_1, p1, p3);
+            vsum16_2 = vabal_u8(vsum16_2, p1, p4);
+            vsum16_3 = vabal_u8(vsum16_3, p1, p5);
+            x += 8;
+        }
+        if (lx & 4)
+        {
+            uint8x16_t p1 = vreinterpretq_u32_u8(
+                vld1q_dup_u32((uint32_t *)&pix1x));
+
+            uint32x4_t p_x4;
+            p_x4 = vld1q_lane_u32((uint32_t *)&pix2x, p_x4, 0);
+            p_x4 = vld1q_lane_u32((uint32_t *)&pix3x, p_x4, 1);
+            p_x4 = vld1q_lane_u32((uint32_t *)&pix4x, p_x4, 2);
+            p_x4 = vld1q_lane_u32((uint32_t *)&pix5x, p_x4, 3);
+
+            uint16x8_t sum = vabdl_u8(vget_low_u8(p1), vget_low_u8(p_x4));
+            uint16x8_t sum2 = vabdl_high_u8(p1, p_x4);
+
+            uint16x8_t a = vpaddq_u16(sum, sum2);
+            result = vpadalq_u16(result, a);
+        }
+        if (lx >= 4)
+        {
+            result0 += vaddvq_u16(vsum16_0);
+            result1 += vaddvq_u16(vsum16_1);
+            result2 += vaddvq_u16(vsum16_2);
+            result3 += vaddvq_u16(vsum16_3);
+        }
+
+#endif
+        if (lx & 3) for (; x < lx; x++)
+        {
+            result0 += abs(pix1x - pix2x);
+            result1 += abs(pix1x - pix3x);
+            result2 += abs(pix1x - pix4x);
+            result3 += abs(pix1x - pix5x);
+        }
+
+        pix1 += FENC_STRIDE;
+        pix2 += frefstride;
+        pix3 += frefstride;
+        pix4 += frefstride;
+        pix5 += frefstride;
+    }
+    vst1q_s32(res, result);
+}
+
+
+template<int lx, int ly, class T1, class T2>
+sse_t sse_neon(const T1 *pix1, intptr_t stride_pix1, const T2 *pix2, intptr_t stride_pix2)
+{
+    sse_t sum = 0;
+
+    int32x4_t vsum1 = vdupq_n_s32(0);
+    int32x4_t vsum2 = vdupq_n_s32(0);
+    for (int y = 0; y < ly; y++)
+    {
+        int x = 0;
+        for (; (x + 8) <= lx; x += 8)
+        {
+            int16x8_t tmp;
+            if (sizeof(T1) == 2 && sizeof(T2) == 2)
+            {
+                tmp = vsubq_s16(*(int16x8_t *)&pix1x, *(int16x8_t *)&pix2x);
+            }
+            else if (sizeof(T1) == 1 && sizeof(T2) == 1)
+            {
+                tmp = vsubl_u8(*(uint8x8_t *)&pix1x, *(uint8x8_t *)&pix2x);
+            }
+            else
+            {
+                X265_CHECK(false, "unsupported sse");
+            }
+            vsum1 = vmlal_s16(vsum1, vget_low_s16(tmp), vget_low_s16(tmp));
+            vsum2 = vmlal_high_s16(vsum2, tmp, tmp);
+        }
+        for (; x < lx; x++)
+        {
+            int tmp = pix1x - pix2x;
+            sum += (tmp * tmp);
+        }
+
+        if (sizeof(T1) == 2 && sizeof(T2) == 2)
+        {
+            int32x4_t vsum = vaddq_u32(vsum1, vsum2);;
+            sum += vaddvq_u32(vsum);
+            vsum1 = vsum2 = vdupq_n_u16(0);
+        }
+
+        pix1 += stride_pix1;
+        pix2 += stride_pix2;
+    }
+    int32x4_t vsum = vaddq_u32(vsum1, vsum2);
+
+    return sum + vaddvq_u32(vsum);
+}
+
+
+template<int bx, int by>
+void blockcopy_ps_neon(int16_t *a, intptr_t stridea, const pixel *b, intptr_t strideb)
+{
+    for (int y = 0; y < by; y++)
+    {
+        int x = 0;
+        for (; (x + 8) <= bx; x += 8)
+        {
+#if HIGH_BIT_DEPTH
+            *(int16x8_t *)&ax = *(int16x8_t *)&bx;
+#else
+            *(int16x8_t *)&ax = vmovl_u8(*(int8x8_t *)&bx);
+#endif
+        }
+        for (; x < bx; x++)
+        {
+            ax = (int16_t)bx;
+        }
+
+        a += stridea;
+        b += strideb;
+    }
+}
+
+
+template<int bx, int by>
+void blockcopy_pp_neon(pixel *a, intptr_t stridea, const pixel *b, intptr_t strideb)
+{
+    for (int y = 0; y < by; y++)
+    {
+        int x = 0;
+#if HIGH_BIT_DEPTH
+        for (; (x + 8) <= bx; x += 8)
+        {
+            *(int16x8_t *)&ax = *(int16x8_t *)&bx;
+        }
+        if (bx & 4)
+        {
+            *(uint64_t *)&ax = *(uint64_t *)&bx;
+            x += 4;
+        }
+#else
+        for (; (x + 16) <= bx; x += 16)
+        {
+            *(uint8x16_t *)&ax = *(uint8x16_t *)&bx;
+        }
+        if (bx & 8)
+        {
+            *(uint8x8_t *)&ax = *(uint8x8_t *)&bx;
+            x += 8;
+        }
+        if (bx & 4)
+        {
+            *(uint32_t *)&ax = *(uint32_t *)&bx;
+            x += 4;
+        }
+#endif
+        for (; x < bx; x++)
+        {
+            ax = bx;
+        }
+
+        a += stridea;
+        b += strideb;
+    }
+}
+
+
+template<int bx, int by>
+void pixel_sub_ps_neon(int16_t *a, intptr_t dstride, const pixel *b0, const pixel *b1, intptr_t sstride0,
+                       intptr_t sstride1)
+{
+    for (int y = 0; y < by; y++)
+    {
+        int x = 0;
+        for (; (x + 8) <= bx; x += 8)
+        {
+#if HIGH_BIT_DEPTH
+            *(int16x8_t *)&ax = vsubq_s16(*(int16x8_t *)&b0x, *(int16x8_t *)&b1x);
+#else
+            *(int16x8_t *)&ax = vsubl_u8(*(uint8x8_t *)&b0x, *(uint8x8_t *)&b1x);
+#endif
+        }
+        for (; x < bx; x++)
+        {
+            ax = (int16_t)(b0x - b1x);
+        }
+
+        b0 += sstride0;
+        b1 += sstride1;
+        a += dstride;
+    }
+}
+
+template<int bx, int by>
+void pixel_add_ps_neon(pixel *a, intptr_t dstride, const pixel *b0, const int16_t *b1, intptr_t sstride0,
+                       intptr_t sstride1)
+{
+    for (int y = 0; y < by; y++)
+    {
+        int x = 0;
+        for (; (x + 8) <= bx; x += 8)
+        {
+            int16x8_t t;
+            int16x8_t b1e = *(int16x8_t *)&b1x;
+            int16x8_t b0e;
+#if HIGH_BIT_DEPTH
+            b0e = *(int16x8_t *)&b0x;
+            t = vaddq_s16(b0e, b1e);
+            t = vminq_s16(t, vdupq_n_s16((1 << X265_DEPTH) - 1));
+            t = vmaxq_s16(t, vdupq_n_s16(0));
+            *(int16x8_t *)&ax = t;
+#else
+            b0e = vmovl_u8(*(uint8x8_t *)&b0x);
+            t = vaddq_s16(b0e, b1e);
+            *(uint8x8_t *)&ax = vqmovun_s16(t);
+#endif
+        }
+        for (; x < bx; x++)
+        {
+            ax = (int16_t)x265_clip(b0x + b1x);
+        }
+
+        b0 += sstride0;
+        b1 += sstride1;
+        a += dstride;
+    }
+}
+
+template<int bx, int by>
+void addAvg_neon(const int16_t *src0, const int16_t *src1, pixel *dst, intptr_t src0Stride, intptr_t src1Stride,
+                 intptr_t dstStride)
+{
+
+    const int shiftNum = IF_INTERNAL_PREC + 1 - X265_DEPTH;
+    const int offset = (1 << (shiftNum - 1)) + 2 * IF_INTERNAL_OFFS;
+
+    const int32x4_t addon = vdupq_n_s32(offset);
+    for (int y = 0; y < by; y++)
+    {
+        int x = 0;
+
+        for (; (x + 8) <= bx; x += 8)
+        {
+            int16x8_t in0 = *(int16x8_t *)&src0x;
+            int16x8_t in1 = *(int16x8_t *)&src1x;
+            int32x4_t t1 = vaddl_s16(vget_low_s16(in0), vget_low_s16(in1));
+            int32x4_t t2 = vaddl_high_s16(in0, in1);
+            t1 = vaddq_s32(t1, addon);
+            t2 = vaddq_s32(t2, addon);
+            t1 = vshrq_n_s32(t1, shiftNum);
+            t2 = vshrq_n_s32(t2, shiftNum);
+            int16x8_t t = vuzp1q_s16(t1, t2);
+#if HIGH_BIT_DEPTH
+            t = vminq_s16(t, vdupq_n_s16((1 << X265_DEPTH) - 1));
+            t = vmaxq_s16(t, vdupq_n_s16(0));
+            *(int16x8_t *)&dstx = t;
+#else
+            *(uint8x8_t *)&dstx = vqmovun_s16(t);
+#endif
+        }
+        for (; x < bx; x += 2)
+        {
+            dstx + 0 = x265_clip((src0x + 0 + src1x + 0 + offset) >> shiftNum);
+            dstx + 1 = x265_clip((src0x + 1 + src1x + 1 + offset) >> shiftNum);
+        }
+
+        src0 += src0Stride;
+        src1 += src1Stride;
+        dst  += dstStride;
+    }
+}
+
+template<int lx, int ly>
+void pixelavg_pp_neon(pixel *dst, intptr_t dstride, const pixel *src0, intptr_t sstride0, const pixel *src1,
+                      intptr_t sstride1, int)
+{
+    for (int y = 0; y < ly; y++)
+    {
+        int x = 0;
+        for (; (x + 8) <= lx; x += 8)
+        {
+#if HIGH_BIT_DEPTH
+            uint16x8_t in0 = *(uint16x8_t *)&src0x;
+            uint16x8_t in1 = *(uint16x8_t *)&src1x;
+            uint16x8_t t = vrhaddq_u16(in0, in1);
+            *(uint16x8_t *)&dstx = t;
+#else
+            int16x8_t in0 = vmovl_u8(*(uint8x8_t *)&src0x);
+            int16x8_t in1 = vmovl_u8(*(uint8x8_t *)&src1x);
+            int16x8_t t = vrhaddq_s16(in0, in1);
+            *(uint8x8_t *)&dstx = vmovn_u16(t);
+#endif
+        }
+        for (; x < lx; x++)
+        {
+            dstx = (src0x + src1x + 1) >> 1;
+        }
+
+        src0 += sstride0;
+        src1 += sstride1;
+        dst += dstride;
+    }
+}
+
+
+template<int size>
+void cpy1Dto2D_shl_neon(int16_t *dst, const int16_t *src, intptr_t dstStride, int shift)
+{
+    X265_CHECK((((intptr_t)dst | (dstStride * sizeof(*dst))) & 15) == 0 || size == 4, "dst alignment error\n");
+    X265_CHECK(((intptr_t)src & 15) == 0, "src alignment error\n");
+    X265_CHECK(shift >= 0, "invalid shift\n");
+
+    for (int i = 0; i < size; i++)
+    {
+        int j = 0;
+        for (; (j + 8) <= size; j += 8)
+        {
+            *(int16x8_t *)&dstj = vshlq_s16(*(int16x8_t *)&srcj, vdupq_n_s16(shift));
+        }
+        for (; j < size; j++)
+        {
+            dstj = srcj << shift;
+        }
+        src += size;
+        dst += dstStride;
+    }
+}
+
+
+template<int size>
+uint64_t pixel_var_neon(const uint8_t *pix, intptr_t i_stride)
+{
+    uint32_t sum = 0, sqr = 0;
+
+    int32x4_t vsqr = vdupq_n_s32(0);
+    for (int y = 0; y < size; y++)
+    {
+        int x = 0;
+        int16x8_t vsum = vdupq_n_s16(0);
+        for (; (x + 8) <= size; x += 8)
+        {
+            int16x8_t in;
+            in = vmovl_u8(*(uint8x8_t *)&pixx);
+            vsum = vaddq_u16(vsum, in);
+            vsqr = vmlal_s16(vsqr, vget_low_s16(in), vget_low_s16(in));
+            vsqr = vmlal_high_s16(vsqr, in, in);
+        }
+        for (; x < size; x++)
+        {
+            sum += pixx;
+            sqr += pixx * pixx;
+        }
+        sum += vaddvq_s16(vsum);
+
+        pix += i_stride;
+    }
+    sqr += vaddvq_u32(vsqr);
+    return sum + ((uint64_t)sqr << 32);
+}
+
+template<int blockSize>
+void getResidual_neon(const pixel *fenc, const pixel *pred, int16_t *residual, intptr_t stride)
+{
+    for (int y = 0; y < blockSize; y++)
+    {
+        int x = 0;
+        for (; (x + 8) < blockSize; x += 8)
+        {
+            int16x8_t vfenc, vpred;
+#if HIGH_BIT_DEPTH
+            vfenc = *(int16x8_t *)&fencx;
+            vpred = *(int16x8_t *)&predx;
+#else
+            vfenc = vmovl_u8(*(uint8x8_t *)&fencx);
+            vpred = vmovl_u8(*(uint8x8_t *)&predx);
+#endif
+            *(int16x8_t *)&residualx = vsubq_s16(vfenc, vpred);
+        }
+        for (; x < blockSize; x++)
+        {
+            residualx = static_cast<int16_t>(fencx) - static_cast<int16_t>(predx);
+        }
+        fenc += stride;
+        residual += stride;
+        pred += stride;
+    }
+}
+
+template<int size>
+int psyCost_pp_neon(const pixel *source, intptr_t sstride, const pixel *recon, intptr_t rstride)
+{
+    static pixel zeroBuf8 /* = { 0 } */;
+
+    if (size)
+    {
+        int dim = 1 << (size + 2);
+        uint32_t totEnergy = 0;
+        for (int i = 0; i < dim; i += 8)
+        {
+            for (int j = 0; j < dim; j += 8)
+            {
+                /* AC energy, measured by sa8d (AC + DC) minus SAD (DC) */
+                int sourceEnergy = pixel_sa8d_8x8_neon(source + i * sstride + j, sstride, zeroBuf, 0) -
+                                   (sad_pp_neon<8, 8>(source + i * sstride + j, sstride, zeroBuf, 0) >> 2);
+                int reconEnergy =  pixel_sa8d_8x8_neon(recon + i * rstride + j, rstride, zeroBuf, 0) -
+                                   (sad_pp_neon<8, 8>(recon + i * rstride + j, rstride, zeroBuf, 0) >> 2);
+
+                totEnergy += abs(sourceEnergy - reconEnergy);
+            }
+        }
+        return totEnergy;
+    }
+    else
+    {
+        /* 4x4 is too small for sa8d */
+        int sourceEnergy = pixel_satd_4x4_neon(source, sstride, zeroBuf, 0) - (sad_pp_neon<4, 4>(source, sstride, zeroBuf,
+                           0) >> 2);
+        int reconEnergy = pixel_satd_4x4_neon(recon, rstride, zeroBuf, 0) - (sad_pp_neon<4, 4>(recon, rstride, zeroBuf,
+                          0) >> 2);
+        return abs(sourceEnergy - reconEnergy);
+    }
+}
+
+
+template<int w, int h>
+// Calculate sa8d in blocks of 8x8
+int sa8d8(const pixel *pix1, intptr_t i_pix1, const pixel *pix2, intptr_t i_pix2)
+{
+    int cost = 0;
+
+    for (int y = 0; y < h; y += 8)
+        for (int x = 0; x < w; x += 8)
+        {
+            cost += pixel_sa8d_8x8_neon(pix1 + i_pix1 * y + x, i_pix1, pix2 + i_pix2 * y + x, i_pix2);
+        }
+
+    return cost;
+}
+
+template<int w, int h>
+// Calculate sa8d in blocks of 16x16
+int sa8d16(const pixel *pix1, intptr_t i_pix1, const pixel *pix2, intptr_t i_pix2)
+{
+    int cost = 0;
+
+    for (int y = 0; y < h; y += 16)
+        for (int x = 0; x < w; x += 16)
+        {
+            cost += pixel_sa8d_16x16_neon(pix1 + i_pix1 * y + x, i_pix1, pix2 + i_pix2 * y + x, i_pix2);
+        }
+
+    return cost;
+}
+
+template<int size>
+void cpy2Dto1D_shl_neon(int16_t *dst, const int16_t *src, intptr_t srcStride, int shift)
+{
+    X265_CHECK(((intptr_t)dst & 15) == 0, "dst alignment error\n");
+    X265_CHECK((((intptr_t)src | (srcStride * sizeof(*src))) & 15) == 0 || size == 4, "src alignment error\n");
+    X265_CHECK(shift >= 0, "invalid shift\n");
+
+    for (int i = 0; i < size; i++)
+    {
+        for (int j = 0; j < size; j++)
+        {
+            dstj = srcj << shift;
+        }
+
+        src += srcStride;
+        dst += size;
+    }
+}
+
+
+template<int w, int h>
+// calculate satd in blocks of 4x4
+int satd4_neon(const pixel *pix1, intptr_t stride_pix1, const pixel *pix2, intptr_t stride_pix2)
+{
+    int satd = 0;
+
+    for (int row = 0; row < h; row += 4)
+        for (int col = 0; col < w; col += 4)
+            satd += pixel_satd_4x4_neon(pix1 + row * stride_pix1 + col, stride_pix1,
+                                        pix2 + row * stride_pix2 + col, stride_pix2);
+
+    return satd;
+}
+
+template<int w, int h>
+// calculate satd in blocks of 8x4
+int satd8_neon(const pixel *pix1, intptr_t stride_pix1, const pixel *pix2, intptr_t stride_pix2)
+{
+    int satd = 0;
+
+    if (((w | h) & 15) == 0)
+    {
+        for (int row = 0; row < h; row += 16)
+            for (int col = 0; col < w; col += 16)
+                satd += pixel_satd_16x16_neon(pix1 + row * stride_pix1 + col, stride_pix1,
+                                              pix2 + row * stride_pix2 + col, stride_pix2);
+
+    }
+    else if (((w | h) & 7) == 0)
+    {
+        for (int row = 0; row < h; row += 8)
+            for (int col = 0; col < w; col += 8)
+                satd += pixel_satd_8x8_neon(pix1 + row * stride_pix1 + col, stride_pix1,
+                                            pix2 + row * stride_pix2 + col, stride_pix2);
+
+    }
+    else
+    {
+        for (int row = 0; row < h; row += 4)
+            for (int col = 0; col < w; col += 8)
+                satd += pixel_satd_8x4_neon(pix1 + row * stride_pix1 + col, stride_pix1,
+                                            pix2 + row * stride_pix2 + col, stride_pix2);
+    }
+
+    return satd;
+}
+
+
+template<int blockSize>
+void transpose_neon(pixel *dst, const pixel *src, intptr_t stride)
+{
+    for (int k = 0; k < blockSize; k++)
+        for (int l = 0; l < blockSize; l++)
+        {
+            dstk * blockSize + l = srcl * stride + k;
+        }
+}
+
+
+template<>
+void transpose_neon<8>(pixel *dst, const pixel *src, intptr_t stride)
+{
+    transpose8x8(dst, src, 8, stride);
+}
+
+template<>
+void transpose_neon<16>(pixel *dst, const pixel *src, intptr_t stride)
+{
+    transpose16x16(dst, src, 16, stride);
+}
+
+template<>
+void transpose_neon<32>(pixel *dst, const pixel *src, intptr_t stride)
+{
+    transpose32x32(dst, src, 32, stride);
+}
+
+
+template<>
+void transpose_neon<64>(pixel *dst, const pixel *src, intptr_t stride)
+{
+    transpose32x32(dst, src, 64, stride);
+    transpose32x32(dst + 32 * 64 + 32, src + 32 * stride + 32, 64, stride);
+    transpose32x32(dst + 32 * 64, src + 32, 64, stride);
+    transpose32x32(dst + 32, src + 32 * stride, 64, stride);
+}
+
+
+template<int size>
+sse_t pixel_ssd_s_neon(const int16_t *a, intptr_t dstride)
+{
+    sse_t sum = 0;
+
+
+    int32x4_t vsum = vdupq_n_s32(0);
+
+    for (int y = 0; y < size; y++)
+    {
+        int x = 0;
+
+        for (; (x + 8) <= size; x += 8)
+        {
+            int16x8_t in = *(int16x8_t *)&ax;
+            vsum = vmlal_s16(vsum, vget_low_s16(in), vget_low_s16(in));
+            vsum = vmlal_high_s16(vsum, (in), (in));
+        }
+        for (; x < size; x++)
+        {
+            sum += ax * ax;
+        }
+
+        a += dstride;
+    }
+    return sum + vaddvq_s32(vsum);
+}
+
+
+};
+
+
+
+
+namespace X265_NS
+{
+
+
+void setupPixelPrimitives_neon(EncoderPrimitives &p)
+{
+#define LUMA_PU(W, H) \
+    p.puLUMA_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.addAvgNONALIGNED = addAvg_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.addAvgALIGNED = addAvg_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.sad = sad_pp_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.sad_x3 = sad_x3_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.sad_x4 = sad_x4_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.pixelavg_ppNONALIGNED = pixelavg_pp_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.pixelavg_ppALIGNED = pixelavg_pp_neon<W, H>;
+
+#if !(HIGH_BIT_DEPTH)
+#define LUMA_PU_S(W, H) \
+    p.puLUMA_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.addAvgNONALIGNED = addAvg_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.addAvgALIGNED = addAvg_neon<W, H>;
+#else // !(HIGH_BIT_DEPTH)
+#define LUMA_PU_S(W, H) \
+    p.puLUMA_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.addAvgNONALIGNED = addAvg_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.addAvgALIGNED = addAvg_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.sad_x3 = sad_x3_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.sad_x4 = sad_x4_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.pixelavg_ppNONALIGNED = pixelavg_pp_neon<W, H>; \
+    p.puLUMA_ ## W ## x ## H.pixelavg_ppALIGNED = pixelavg_pp_neon<W, H>;
+#endif // !(HIGH_BIT_DEPTH)
+
+#define LUMA_CU(W, H) \
+    p.cuBLOCK_ ## W ## x ## H.sub_ps        = pixel_sub_ps_neon<W, H>; \
+    p.cuBLOCK_ ## W ## x ## H.add_psNONALIGNED    = pixel_add_ps_neon<W, H>; \
+    p.cuBLOCK_ ## W ## x ## H.add_psALIGNED = pixel_add_ps_neon<W, H>; \
+    p.cuBLOCK_ ## W ## x ## H.copy_pp       = blockcopy_pp_neon<W, H>; \
+    p.cuBLOCK_ ## W ## x ## H.copy_ps       = blockcopy_ps_neon<W, H>; \
+    p.cuBLOCK_ ## W ## x ## H.copy_pp       = blockcopy_pp_neon<W, H>; \
+    p.cuBLOCK_ ## W ## x ## H.cpy2Dto1D_shl = cpy2Dto1D_shl_neon<W>; \
+    p.cuBLOCK_ ## W ## x ## H.cpy1Dto2D_shlNONALIGNED = cpy1Dto2D_shl_neon<W>; \
+    p.cuBLOCK_ ## W ## x ## H.cpy1Dto2D_shlALIGNED = cpy1Dto2D_shl_neon<W>; \
+    p.cuBLOCK_ ## W ## x ## H.psy_cost_pp   = psyCost_pp_neon<BLOCK_ ## W ## x ## H>; \
+    p.cuBLOCK_ ## W ## x ## H.transpose     = transpose_neon<W>;
+
+
+    LUMA_PU_S(4, 4);
+    LUMA_PU_S(8, 8);
+    LUMA_PU(16, 16);
+    LUMA_PU(32, 32);
+    LUMA_PU(64, 64);
+    LUMA_PU_S(4, 8);
+    LUMA_PU_S(8, 4);
+    LUMA_PU(16,  8);
+    LUMA_PU_S(8, 16);
+    LUMA_PU(16, 12);
+    LUMA_PU(12, 16);
+    LUMA_PU(16,  4);
+    LUMA_PU_S(4, 16);
+    LUMA_PU(32, 16);
+    LUMA_PU(16, 32);
+    LUMA_PU(32, 24);
+    LUMA_PU(24, 32);
+    LUMA_PU(32,  8);
+    LUMA_PU_S(8, 32);
+    LUMA_PU(64, 32);
+    LUMA_PU(32, 64);
+    LUMA_PU(64, 48);
+    LUMA_PU(48, 64);
+    LUMA_PU(64, 16);
+    LUMA_PU(16, 64);
+    
+#if defined(__APPLE__)
+    p.puLUMA_4x4.sad = sad_pp_neon<4, 4>;
+    p.puLUMA_4x8.sad = sad_pp_neon<4, 8>;
+    p.puLUMA_4x16.sad = sad_pp_neon<4, 16>;
+#endif // defined(__APPLE__)
+    p.puLUMA_8x4.sad = sad_pp_neon<8, 4>;
+    p.puLUMA_8x8.sad = sad_pp_neon<8, 8>;
+    p.puLUMA_8x16.sad = sad_pp_neon<8, 16>;
+    p.puLUMA_8x32.sad = sad_pp_neon<8, 32>;
+
+#if !(HIGH_BIT_DEPTH)
+    p.puLUMA_4x4.sad_x3 = sad_x3_neon<4, 4>;
+    p.puLUMA_4x4.sad_x4 = sad_x4_neon<4, 4>;
+    p.puLUMA_4x8.sad_x3 = sad_x3_neon<4, 8>;
+    p.puLUMA_4x8.sad_x4 = sad_x4_neon<4, 8>;
+    p.puLUMA_4x16.sad_x3 = sad_x3_neon<4, 16>;
+    p.puLUMA_4x16.sad_x4 = sad_x4_neon<4, 16>;
+#endif // !(HIGH_BIT_DEPTH)
+
+    p.puLUMA_4x4.satd   = pixel_satd_4x4_neon;
+    p.puLUMA_8x4.satd   = pixel_satd_8x4_neon;
+    
+    p.puLUMA_8x8.satd   = satd8_neon<8, 8>;
+    p.puLUMA_16x16.satd = satd8_neon<16, 16>;
+    p.puLUMA_16x8.satd  = satd8_neon<16, 8>;
+    p.puLUMA_8x16.satd  = satd8_neon<8, 16>;
+    p.puLUMA_16x12.satd = satd8_neon<16, 12>;
+    p.puLUMA_16x4.satd  = satd8_neon<16, 4>;
+    p.puLUMA_32x32.satd = satd8_neon<32, 32>;
+    p.puLUMA_32x16.satd = satd8_neon<32, 16>;
+    p.puLUMA_16x32.satd = satd8_neon<16, 32>;
+    p.puLUMA_32x24.satd = satd8_neon<32, 24>;
+    p.puLUMA_24x32.satd = satd8_neon<24, 32>;
+    p.puLUMA_32x8.satd  = satd8_neon<32, 8>;
+    p.puLUMA_8x32.satd  = satd8_neon<8, 32>;
+    p.puLUMA_64x64.satd = satd8_neon<64, 64>;
+    p.puLUMA_64x32.satd = satd8_neon<64, 32>;
+    p.puLUMA_32x64.satd = satd8_neon<32, 64>;
+    p.puLUMA_64x48.satd = satd8_neon<64, 48>;
+    p.puLUMA_48x64.satd = satd8_neon<48, 64>;
+    p.puLUMA_64x16.satd = satd8_neon<64, 16>;
+    p.puLUMA_16x64.satd = satd8_neon<16, 64>;
+
+#if HIGH_BIT_DEPTH
+    p.puLUMA_4x8.satd   = satd4_neon<4, 8>;
+    p.puLUMA_4x16.satd  = satd4_neon<4, 16>;
+#endif // HIGH_BIT_DEPTH
+
+#if !defined(__APPLE__) || HIGH_BIT_DEPTH
+    p.puLUMA_12x16.satd = satd4_neon<12, 16>;
+#endif // !defined(__APPLE__)
+
+
+    LUMA_CU(4, 4);
+    LUMA_CU(8, 8);
+    LUMA_CU(16, 16);
+    LUMA_CU(32, 32);
+    LUMA_CU(64, 64);
+    
+#if !(HIGH_BIT_DEPTH)
+    p.cuBLOCK_8x8.var   = pixel_var_neon<8>;
+    p.cuBLOCK_16x16.var = pixel_var_neon<16>;
+#if defined(__APPLE__)
+    p.cuBLOCK_32x32.var   = pixel_var_neon<32>;
+    p.cuBLOCK_64x64.var = pixel_var_neon<64>;
+#endif // defined(__APPLE__)
+#endif // !(HIGH_BIT_DEPTH)
+
+    p.cuBLOCK_16x16.blockfill_sNONALIGNED = blockfill_s_neon<16>; 
+    p.cuBLOCK_16x16.blockfill_sALIGNED    = blockfill_s_neon<16>;
+    p.cuBLOCK_32x32.blockfill_sNONALIGNED = blockfill_s_neon<32>; 
+    p.cuBLOCK_32x32.blockfill_sALIGNED    = blockfill_s_neon<32>;
+    p.cuBLOCK_64x64.blockfill_sNONALIGNED = blockfill_s_neon<64>; 
+    p.cuBLOCK_64x64.blockfill_sALIGNED    = blockfill_s_neon<64>;
+
+
+    p.cuBLOCK_4x4.calcresidualNONALIGNED    = getResidual_neon<4>;
+    p.cuBLOCK_4x4.calcresidualALIGNED       = getResidual_neon<4>;
+    p.cuBLOCK_8x8.calcresidualNONALIGNED    = getResidual_neon<8>;
+    p.cuBLOCK_8x8.calcresidualALIGNED       = getResidual_neon<8>;
+    p.cuBLOCK_16x16.calcresidualNONALIGNED  = getResidual_neon<16>;
+    p.cuBLOCK_16x16.calcresidualALIGNED     = getResidual_neon<16>;
+    
+#if defined(__APPLE__)
+    p.cuBLOCK_32x32.calcresidualNONALIGNED  = getResidual_neon<32>;
+    p.cuBLOCK_32x32.calcresidualALIGNED     = getResidual_neon<32>;
+#endif // defined(__APPLE__)
+
+    p.cuBLOCK_4x4.sa8d   = pixel_satd_4x4_neon;
+    p.cuBLOCK_8x8.sa8d   = pixel_sa8d_8x8_neon;
+    p.cuBLOCK_16x16.sa8d = pixel_sa8d_16x16_neon;
+    p.cuBLOCK_32x32.sa8d = sa8d16<32, 32>;
+    p.cuBLOCK_64x64.sa8d = sa8d16<64, 64>;
+
+
+#define CHROMA_PU_420(W, H) \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.addAvgNONALIGNED  = addAvg_neon<W, H>;         \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.addAvgALIGNED  = addAvg_neon<W, H>;         \
+    p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \
+
+
+    CHROMA_PU_420(4, 4);
+    CHROMA_PU_420(8, 8);
+    CHROMA_PU_420(16, 16);
+    CHROMA_PU_420(32, 32);
+    CHROMA_PU_420(4, 2);
+    CHROMA_PU_420(8, 4);
+    CHROMA_PU_420(4, 8);
+    CHROMA_PU_420(8, 6);
+    CHROMA_PU_420(6, 8);
+    CHROMA_PU_420(8, 2);
+    CHROMA_PU_420(2, 8);
+    CHROMA_PU_420(16, 8);
+    CHROMA_PU_420(8,  16);
+    CHROMA_PU_420(16, 12);
+    CHROMA_PU_420(12, 16);
+    CHROMA_PU_420(16, 4);
+    CHROMA_PU_420(4,  16);
+    CHROMA_PU_420(32, 16);
+    CHROMA_PU_420(16, 32);
+    CHROMA_PU_420(32, 24);
+    CHROMA_PU_420(24, 32);
+    CHROMA_PU_420(32, 8);
+    CHROMA_PU_420(8,  32);
+
+
+
+    p.chromaX265_CSP_I420.puCHROMA_420_2x2.satd   = NULL;
+    p.chromaX265_CSP_I420.puCHROMA_420_4x4.satd   = pixel_satd_4x4_neon;
+    p.chromaX265_CSP_I420.puCHROMA_420_8x8.satd   = satd8_neon<8, 8>;
+    p.chromaX265_CSP_I420.puCHROMA_420_16x16.satd = satd8_neon<16, 16>;
+    p.chromaX265_CSP_I420.puCHROMA_420_32x32.satd = satd8_neon<32, 32>;
+
+    p.chromaX265_CSP_I420.puCHROMA_420_4x2.satd   = NULL;
+    p.chromaX265_CSP_I420.puCHROMA_420_2x4.satd   = NULL;
+    p.chromaX265_CSP_I420.puCHROMA_420_8x4.satd   = pixel_satd_8x4_neon;
+    p.chromaX265_CSP_I420.puCHROMA_420_16x8.satd  = satd8_neon<16, 8>;
+    p.chromaX265_CSP_I420.puCHROMA_420_8x16.satd  = satd8_neon<8, 16>;
+    p.chromaX265_CSP_I420.puCHROMA_420_32x16.satd = satd8_neon<32, 16>;
+    p.chromaX265_CSP_I420.puCHROMA_420_16x32.satd = satd8_neon<16, 32>;
+
+    p.chromaX265_CSP_I420.puCHROMA_420_8x6.satd   = NULL;
+    p.chromaX265_CSP_I420.puCHROMA_420_6x8.satd   = NULL;
+    p.chromaX265_CSP_I420.puCHROMA_420_8x2.satd   = NULL;
+    p.chromaX265_CSP_I420.puCHROMA_420_2x8.satd   = NULL;
+    p.chromaX265_CSP_I420.puCHROMA_420_16x12.satd = satd4_neon<16, 12>;
+    p.chromaX265_CSP_I420.puCHROMA_420_16x4.satd  = satd4_neon<16, 4>;
+    p.chromaX265_CSP_I420.puCHROMA_420_32x24.satd = satd8_neon<32, 24>;
+    p.chromaX265_CSP_I420.puCHROMA_420_24x32.satd = satd8_neon<24, 32>;
+    p.chromaX265_CSP_I420.puCHROMA_420_32x8.satd  = satd8_neon<32, 8>;
+    p.chromaX265_CSP_I420.puCHROMA_420_8x32.satd  = satd8_neon<8, 32>;
+    
+#if HIGH_BIT_DEPTH
+    p.chromaX265_CSP_I420.puCHROMA_420_4x8.satd   = satd4_neon<4, 8>;
+    p.chromaX265_CSP_I420.puCHROMA_420_4x16.satd  = satd4_neon<4, 16>;
+#endif // HIGH_BIT_DEPTH
+
+#if !defined(__APPLE__) || HIGH_BIT_DEPTH
+    p.chromaX265_CSP_I420.puCHROMA_420_12x16.satd = satd4_neon<12, 16>;
+#endif // !defined(__APPLE__)
+
+
+#define CHROMA_CU_420(W, H) \
+    p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.sse_pp  = sse_neon<W, H, pixel, pixel>; \
+    p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \
+    p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.copy_ps = blockcopy_ps_neon<W, H>; \
+    p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.sub_ps = pixel_sub_ps_neon<W, H>;  \
+    p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.add_psNONALIGNED = pixel_add_ps_neon<W, H>; \
+    p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.add_psALIGNED = pixel_add_ps_neon<W, H>;
+    
+#define CHROMA_CU_S_420(W, H) \
+    p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \
+    p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.copy_ps = blockcopy_ps_neon<W, H>; \
+    p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.sub_ps = pixel_sub_ps_neon<W, H>;  \
+    p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.add_psNONALIGNED = pixel_add_ps_neon<W, H>; \
+    p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.add_psALIGNED = pixel_add_ps_neon<W, H>;
+
+
+    CHROMA_CU_S_420(4, 4)
+    CHROMA_CU_420(8, 8)
+    CHROMA_CU_420(16, 16)
+    CHROMA_CU_420(32, 32)
+
+
+    p.chromaX265_CSP_I420.cuBLOCK_8x8.sa8d   = p.chromaX265_CSP_I420.puCHROMA_420_4x4.satd;
+    p.chromaX265_CSP_I420.cuBLOCK_16x16.sa8d = sa8d8<8, 8>;
+    p.chromaX265_CSP_I420.cuBLOCK_32x32.sa8d = sa8d16<16, 16>;
+    p.chromaX265_CSP_I420.cuBLOCK_64x64.sa8d = sa8d16<32, 32>;
+
+
+#define CHROMA_PU_422(W, H) \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.addAvgNONALIGNED  = addAvg_neon<W, H>;         \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.addAvgALIGNED  = addAvg_neon<W, H>;         \
+    p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \
+
+
+    CHROMA_PU_422(4, 8);
+    CHROMA_PU_422(8, 16);
+    CHROMA_PU_422(16, 32);
+    CHROMA_PU_422(32, 64);
+    CHROMA_PU_422(4, 4);
+    CHROMA_PU_422(2, 8);
+    CHROMA_PU_422(8, 8);
+    CHROMA_PU_422(4, 16);
+    CHROMA_PU_422(8, 12);
+    CHROMA_PU_422(6, 16);
+    CHROMA_PU_422(8, 4);
+    CHROMA_PU_422(2, 16);
+    CHROMA_PU_422(16, 16);
+    CHROMA_PU_422(8, 32);
+    CHROMA_PU_422(16, 24);
+    CHROMA_PU_422(12, 32);
+    CHROMA_PU_422(16, 8);
+    CHROMA_PU_422(4,  32);
+    CHROMA_PU_422(32, 32);
+    CHROMA_PU_422(16, 64);
+    CHROMA_PU_422(32, 48);
+    CHROMA_PU_422(24, 64);
+    CHROMA_PU_422(32, 16);
+    CHROMA_PU_422(8,  64);
+
+
+    p.chromaX265_CSP_I422.puCHROMA_422_2x4.satd   = NULL;
+    p.chromaX265_CSP_I422.puCHROMA_422_8x16.satd  = satd8_neon<8, 16>;
+    p.chromaX265_CSP_I422.puCHROMA_422_16x32.satd = satd8_neon<16, 32>;
+    p.chromaX265_CSP_I422.puCHROMA_422_32x64.satd = satd8_neon<32, 64>;
+    p.chromaX265_CSP_I422.puCHROMA_422_4x4.satd   = pixel_satd_4x4_neon;
+    p.chromaX265_CSP_I422.puCHROMA_422_2x8.satd   = NULL;
+    p.chromaX265_CSP_I422.puCHROMA_422_8x8.satd   = satd8_neon<8, 8>;
+    p.chromaX265_CSP_I422.puCHROMA_422_16x16.satd = satd8_neon<16, 16>;
+    p.chromaX265_CSP_I422.puCHROMA_422_8x32.satd  = satd8_neon<8, 32>;
+    p.chromaX265_CSP_I422.puCHROMA_422_32x32.satd = satd8_neon<32, 32>;
+    p.chromaX265_CSP_I422.puCHROMA_422_16x64.satd = satd8_neon<16, 64>;
+    p.chromaX265_CSP_I422.puCHROMA_422_6x16.satd  = NULL;
+    p.chromaX265_CSP_I422.puCHROMA_422_8x4.satd   = satd4_neon<8, 4>;
+    p.chromaX265_CSP_I422.puCHROMA_422_2x16.satd  = NULL;
+    p.chromaX265_CSP_I422.puCHROMA_422_16x8.satd  = satd8_neon<16, 8>;
+    p.chromaX265_CSP_I422.puCHROMA_422_32x16.satd = satd8_neon<32, 16>;
+    
+    p.chromaX265_CSP_I422.puCHROMA_422_8x12.satd  = satd4_neon<8, 12>;
+    p.chromaX265_CSP_I422.puCHROMA_422_8x64.satd  = satd8_neon<8, 64>;
+    p.chromaX265_CSP_I422.puCHROMA_422_12x32.satd = satd4_neon<12, 32>;
+    p.chromaX265_CSP_I422.puCHROMA_422_16x24.satd = satd8_neon<16, 24>;
+    p.chromaX265_CSP_I422.puCHROMA_422_24x64.satd = satd8_neon<24, 64>;
+    p.chromaX265_CSP_I422.puCHROMA_422_32x48.satd = satd8_neon<32, 48>;
+
+#if HIGH_BIT_DEPTH
+    p.chromaX265_CSP_I422.puCHROMA_422_4x8.satd   = satd4_neon<4, 8>;
+    p.chromaX265_CSP_I422.puCHROMA_422_4x16.satd  = satd4_neon<4, 16>;
+    p.chromaX265_CSP_I422.puCHROMA_422_4x32.satd  = satd4_neon<4, 32>;
+#endif // HIGH_BIT_DEPTH
+
+
+#define CHROMA_CU_422(W, H) \
+    p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.sse_pp  = sse_neon<W, H, pixel, pixel>;  \
+    p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \
+    p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.copy_ps = blockcopy_ps_neon<W, H>; \
+    p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.sub_ps = pixel_sub_ps_neon<W, H>; \
+    p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.add_psNONALIGNED = pixel_add_ps_neon<W, H>; \
+    p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.add_psALIGNED = pixel_add_ps_neon<W, H>;
+
+#define CHROMA_CU_S_422(W, H) \
+    p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \
+    p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.copy_ps = blockcopy_ps_neon<W, H>; \
+    p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.sub_ps = pixel_sub_ps_neon<W, H>; \
+    p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.add_psNONALIGNED = pixel_add_ps_neon<W, H>; \
+    p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.add_psALIGNED = pixel_add_ps_neon<W, H>;
+    
+    
+    CHROMA_CU_S_422(4, 8)
+    CHROMA_CU_422(8, 16)
+    CHROMA_CU_422(16, 32)
+    CHROMA_CU_422(32, 64)
+
+    p.chromaX265_CSP_I422.cuBLOCK_8x8.sa8d   = p.chromaX265_CSP_I422.puCHROMA_422_4x8.satd;
+    p.chromaX265_CSP_I422.cuBLOCK_16x16.sa8d = sa8d8<8, 16>;
+    p.chromaX265_CSP_I422.cuBLOCK_32x32.sa8d = sa8d16<16, 32>;
+    p.chromaX265_CSP_I422.cuBLOCK_64x64.sa8d = sa8d16<32, 64>;
+
+
+}
+
+
+}
+
+
+#endif
+
​

x265_3.6.tar.gz/source/common/aarch64/pixel-prim.h Added

 
@@ -0,0 +1,23 @@
+#ifndef PIXEL_PRIM_NEON_H__
+#define PIXEL_PRIM_NEON_H__
+
+#include "common.h"
+#include "slicetype.h"      // LOWRES_COST_MASK
+#include "primitives.h"
+#include "x265.h"
+
+
+
+namespace X265_NS
+{
+
+
+
+void setupPixelPrimitives_neon(EncoderPrimitives &p);
+
+
+}
+
+
+#endif
+
​

x265_3.6.tar.gz/source/common/aarch64/pixel-util-common.S Added

@@ -0,0 +1,84 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+// This file contains the macros written using NEON instruction set
+// that are also used by the SVE2 functions
+
+.arch           armv8-a
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.macro pixel_var_start
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+    movi            v2.16b, #0
+    movi            v3.16b, #0
+.endm
+
+.macro pixel_var_1 v
+    uaddw           v0.8h, v0.8h, \v\().8b
+    umull           v30.8h, \v\().8b, \v\().8b
+    uaddw2          v1.8h, v1.8h, \v\().16b
+    umull2          v31.8h, \v\().16b, \v\().16b
+    uadalp          v2.4s, v30.8h
+    uadalp          v3.4s, v31.8h
+.endm
+
+.macro pixel_var_end
+    uaddlv          s0, v0.8h
+    uaddlv          s1, v1.8h
+    add             v2.4s, v2.4s, v3.4s
+    fadd            s0, s0, s1
+    uaddlv          d2, v2.4s
+    fmov            w0, s0
+    fmov            x2, d2
+    orr             x0, x0, x2, lsl #32
+.endm
+
+.macro ssimDist_start
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+.endm
+
+.macro ssimDist_end
+    uaddlv          d0, v0.4s
+    uaddlv          d1, v1.4s
+    str             d0, x6
+    str             d1, x4
+.endm
+
+.macro normFact_start
+    movi            v0.16b, #0
+.endm
+
+.macro normFact_end
+    uaddlv          d0, v0.4s
+    str             d0, x3
+.endm
+

 
@@ -0,0 +1,84 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+// This file contains the macros written using NEON instruction set
+// that are also used by the SVE2 functions
+
+.arch           armv8-a
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.macro pixel_var_start
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+    movi            v2.16b, #0
+    movi            v3.16b, #0
+.endm
+
+.macro pixel_var_1 v
+    uaddw           v0.8h, v0.8h, \v\().8b
+    umull           v30.8h, \v\().8b, \v\().8b
+    uaddw2          v1.8h, v1.8h, \v\().16b
+    umull2          v31.8h, \v\().16b, \v\().16b
+    uadalp          v2.4s, v30.8h
+    uadalp          v3.4s, v31.8h
+.endm
+
+.macro pixel_var_end
+    uaddlv          s0, v0.8h
+    uaddlv          s1, v1.8h
+    add             v2.4s, v2.4s, v3.4s
+    fadd            s0, s0, s1
+    uaddlv          d2, v2.4s
+    fmov            w0, s0
+    fmov            x2, d2
+    orr             x0, x0, x2, lsl #32
+.endm
+
+.macro ssimDist_start
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+.endm
+
+.macro ssimDist_end
+    uaddlv          d0, v0.4s
+    uaddlv          d1, v1.4s
+    str             d0, x6
+    str             d1, x4
+.endm
+
+.macro normFact_start
+    movi            v0.16b, #0
+.endm
+
+.macro normFact_end
+    uaddlv          d0, v0.4s
+    str             d0, x3
+.endm
+
​

x265_3.6.tar.gz/source/common/aarch64/pixel-util-sve.S Added

@@ -0,0 +1,373 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm-sve.S"
+#include "pixel-util-common.S"
+
+.arch armv8-a+sve
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+function PFX(pixel_sub_ps_8x16_sve)
+    lsl             x1, x1, #1
+    ptrue           p0.h, vl8
+.rept 8
+    ld1b            {z0.h}, p0/z, x2
+    ld1b            {z1.h}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5
+    ld1b            {z2.h}, p0/z, x2
+    ld1b            {z3.h}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5
+    sub             z4.h, z0.h, z1.h
+    sub             z5.h, z2.h, z3.h
+    st1             {v4.8h}, x0, x1
+    st1             {v5.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+//******* satd *******
+.macro satd_4x4_sve
+    ld1b            {z0.h}, p0/z, x0
+    ld1b            {z2.h}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    ld1b            {z1.h}, p0/z, x0
+    ld1b            {z3.h}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    ld1b            {z4.h}, p0/z, x0
+    ld1b            {z6.h}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    ld1b            {z5.h}, p0/z, x0
+    ld1b            {z7.h}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+
+    sub             z0.h, z0.h, z2.h
+    sub             z1.h, z1.h, z3.h
+    sub             z2.h, z4.h, z6.h
+    sub             z3.h, z5.h, z7.h
+
+    add             z4.h, z0.h, z2.h
+    add             z5.h, z1.h, z3.h
+    sub             z6.h, z0.h, z2.h
+    sub             z7.h, z1.h, z3.h
+
+    add             z0.h, z4.h, z5.h
+    sub             z1.h, z4.h, z5.h
+
+    add             z2.h, z6.h, z7.h
+    sub             z3.h, z6.h, z7.h
+
+    trn1            z4.h, z0.h, z2.h
+    trn2            z5.h, z0.h, z2.h
+
+    trn1            z6.h, z1.h, z3.h
+    trn2            z7.h, z1.h, z3.h
+
+    add             z0.h, z4.h, z5.h
+    sub             z1.h, z4.h, z5.h
+
+    add             z2.h, z6.h, z7.h
+    sub             z3.h, z6.h, z7.h
+
+    trn1            z4.s, z0.s, z1.s
+    trn2            z5.s, z0.s, z1.s
+
+    trn1            z6.s, z2.s, z3.s
+    trn2            z7.s, z2.s, z3.s
+
+    abs             z4.h, p0/m, z4.h
+    abs             z5.h, p0/m, z5.h
+    abs             z6.h, p0/m, z6.h
+    abs             z7.h, p0/m, z7.h
+
+    smax            z4.h, p0/m, z4.h, z5.h
+    smax            z6.h, p0/m, z6.h, z7.h
+
+    add             z0.h, z4.h, z6.h
+
+    uaddlp          v0.2s, v0.4h
+    uaddlp          v0.1d, v0.2s
+.endm
+
+// int satd_4x4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+function PFX(pixel_satd_4x4_sve)
+    ptrue           p0.h, vl4
+    satd_4x4_sve
+    fmov            x0, d0
+    ret
+endfunc
+
+function PFX(pixel_satd_8x4_sve)
+    ptrue           p0.h, vl4
+    mov             x4, x0
+    mov             x5, x2
+    satd_4x4_sve
+    add             x0, x4, #4
+    add             x2, x5, #4
+    umov            x6, v0.d0
+    satd_4x4_sve
+    umov            x0, v0.d0
+    add             x0, x0, x6
+    ret
+endfunc
+
+function PFX(pixel_satd_8x12_sve)
+    ptrue           p0.h, vl4
+    mov             x4, x0
+    mov             x5, x2
+    mov             x7, #0
+    satd_4x4_sve
+    umov            x6, v0.d0
+    add             x7, x7, x6
+    add             x0, x4, #4
+    add             x2, x5, #4
+    satd_4x4_sve
+    umov            x6, v0.d0
+    add             x7, x7, x6
+.rept 2
+    sub             x0, x0, #4
+    sub             x2, x2, #4
+    mov             x4, x0
+    mov             x5, x2
+    satd_4x4_sve
+    umov            x6, v0.d0
+    add             x7, x7, x6
+    add             x0, x4, #4
+    add             x2, x5, #4
+    satd_4x4_sve
+    umov            x6, v0.d0
+    add             x7, x7, x6
+.endr
+    mov             x0, x7
+    ret
+endfunc
+
+.macro LOAD_DIFF_16x4_sve v0 v1 v2 v3 v4 v5 v6 v7
+    mov             x11, #8 // in order to consider CPUs whose vector size is greater than 128 bits
+    ld1b            {z0.h}, p0/z, x0
+    ld1b            {z1.h}, p0/z, x0, x11
+    ld1b            {z2.h}, p0/z, x2
+    ld1b            {z3.h}, p0/z, x2, x11
+    add             x0, x0, x1
+    add             x2, x2, x3
+    ld1b            {z4.h}, p0/z, x0
+    ld1b            {z5.h}, p0/z, x0, x11
+    ld1b            {z6.h}, p0/z, x2
+    ld1b            {z7.h}, p0/z, x2, x11
+    add             x0, x0, x1
+    add             x2, x2, x3
+    ld1b            {z29.h}, p0/z, x0
+    ld1b            {z9.h}, p0/z, x0, x11
+    ld1b            {z10.h}, p0/z, x2
+    ld1b            {z11.h}, p0/z, x2, x11
+    add             x0, x0, x1
+    add             x2, x2, x3
+    ld1b            {z12.h}, p0/z, x0
+    ld1b            {z13.h}, p0/z, x0, x11
+    ld1b            {z14.h}, p0/z, x2
+    ld1b            {z15.h}, p0/z, x2, x11
+    add             x0, x0, x1
+    add             x2, x2, x3
+
+    sub             \v0\().h, z0.h, z2.h
+    sub             \v4\().h, z1.h, z3.h
+    sub             \v1\().h, z4.h, z6.h
+    sub             \v5\().h, z5.h, z7.h
+    sub             \v2\().h, z29.h, z10.h
+    sub             \v6\().h, z9.h, z11.h
+    sub             \v3\().h, z12.h, z14.h
+    sub             \v7\().h, z13.h, z15.h
+.endm
+
+// one vertical hadamard pass and two horizontal
+function PFX(satd_8x4v_8x8h_sve), export=0
+    HADAMARD4_V     z16.h, z18.h, z17.h, z19.h, z0.h, z2.h, z1.h, z3.h
+    HADAMARD4_V     z20.h, z21.h, z22.h, z23.h, z0.h, z1.h, z2.h, z3.h
+    trn4            z0.h, z1.h, z2.h, z3.h, z16.h, z17.h, z18.h, z19.h
+    trn4            z4.h, z5.h, z6.h, z7.h, z20.h, z21.h, z22.h, z23.h
+    SUMSUB_ABCD     z16.h, z17.h, z18.h, z19.h, z0.h, z1.h, z2.h, z3.h
+    SUMSUB_ABCD     z20.h, z21.h, z22.h, z23.h, z4.h, z5.h, z6.h, z7.h
+    trn4            z0.s, z2.s, z1.s, z3.s, z16.s, z18.s, z17.s, z19.s
+    trn4            z4.s, z6.s, z5.s, z7.s, z20.s, z22.s, z21.s, z23.s
+    ABS8_SVE        z0.h, z1.h, z2.h, z3.h, z4.h, z5.h, z6.h, z7.h, p0
+    smax            z0.h, p0/m, z0.h, z2.h
+    smax            z1.h, p0/m, z1.h, z3.h
+    smax            z4.h, p0/m, z4.h, z6.h
+    smax            z5.h, p0/m, z5.h, z7.h
+    ret
+endfunc
+
+function PFX(satd_16x4_sve), export=0
+    LOAD_DIFF_16x4_sve  z16, z17, z18, z19, z20, z21, z22, z23
+    b                    PFX(satd_8x4v_8x8h_sve)
+endfunc
+
+.macro pixel_satd_32x8_sve
+    mov             x4, x0
+    mov             x5, x2
+.rept 2
+    bl              PFX(satd_16x4_sve)
+    add             z30.h, z30.h, z0.h
+    add             z31.h, z31.h, z1.h
+    add             z30.h, z30.h, z4.h
+    add             z31.h, z31.h, z5.h
+.endr
+    add             x0, x4, #16
+    add             x2, x5, #16
+.rept 2
+    bl              PFX(satd_16x4_sve)
+    add             z30.h, z30.h, z0.h
+    add             z31.h, z31.h, z1.h
+    add             z30.h, z30.h, z4.h
+    add             z31.h, z31.h, z5.h
+.endr
+.endm
+
+.macro satd_32x16_sve
+    movi            v30.2d, #0
+    movi            v31.2d, #0
+    pixel_satd_32x8_sve
+    sub             x0, x0, #16
+    sub             x2, x2, #16
+    pixel_satd_32x8_sve
+    add             z0.h, z30.h, z31.h
+    uaddlv          s0, v0.8h
+    mov             w6, v0.s0
+.endm
+
+function PFX(pixel_satd_32x16_sve)
+    ptrue           p0.h, vl8
+    mov             x10, x30
+    satd_32x16_sve
+    mov             x0, x6
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_32x32_sve)
+    ptrue           p0.h, vl8
+    mov             x10, x30
+    mov             x7, #0
+    satd_32x16_sve
+    sub             x0, x0, #16
+    sub             x2, x2, #16
+    add             x7, x7, x6
+    satd_32x16_sve
+    add             x0, x7, x6
+    ret             x10
+endfunc
+
+.macro satd_64x16_sve
+    mov             x8, x0
+    mov             x9, x2
+    satd_32x16_sve
+    add             x7, x7, x6
+    add             x0, x8, #32
+    add             x2, x9, #32
+    satd_32x16_sve
+    add             x7, x7, x6
+.endm
+
+function PFX(pixel_satd_64x48_sve)
+    ptrue           p0.h, vl8
+    mov             x10, x30
+    mov             x7, #0
+.rept 2
+    satd_64x16_sve
+    sub             x0, x0, #48
+    sub             x2, x2, #48
+.endr
+    satd_64x16_sve
+    mov             x0, x7
+    ret             x10
+endfunc
+
+/********* ssim ***********/
+// uint32_t quant_c(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff)
+// No need to fully use sve instructions for this function
+function PFX(quant_sve)
+    mov             w9, #1
+    lsl             w9, w9, w4
+    mov             z0.s, w9
+    neg             w9, w4
+    mov             z1.s, w9
+    add             w9, w9, #8
+    mov             z2.s, w9
+    mov             z3.s, w5
+
+    lsr             w6, w6, #2
+    eor             z4.d, z4.d, z4.d
+    eor             w10, w10, w10
+    eor             z17.d, z17.d, z17.d
+
+.loop_quant_sve:
+    ld1             {v18.4h}, x0, #8
+    ld1             {v7.4s}, x1, #16
+    sxtl            v6.4s, v18.4h
+
+    cmlt            v5.4s, v6.4s, #0
+
+    abs             v6.4s, v6.4s
+
+
+    mul             v6.4s, v6.4s, v7.4s
+
+    add             v7.4s, v6.4s, v3.4s
+    sshl            v7.4s, v7.4s, v1.4s
+
+    mls             v6.4s, v7.4s, v0.s0
+    sshl            v16.4s, v6.4s, v2.4s
+    st1             {v16.4s}, x2, #16
+
+    // numsig
+    cmeq            v16.4s, v7.4s, v17.4s
+    add             v4.4s, v4.4s, v16.4s
+    add             w10, w10, #4
+
+    // level *= sign
+    eor             z16.d, z7.d, z5.d
+    sub             v16.4s, v16.4s, v5.4s
+    sqxtn           v5.4h, v16.4s
+    st1             {v5.4h}, x3, #8
+
+    subs            w6, w6, #1
+    b.ne             .loop_quant_sve
+
+    addv            s4, v4.4s
+    mov             w9, v4.s0
+    add             w0, w10, w9
+    ret
+endfunc

 
@@ -0,0 +1,373 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm-sve.S"
+#include "pixel-util-common.S"
+
+.arch armv8-a+sve
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+function PFX(pixel_sub_ps_8x16_sve)
+    lsl             x1, x1, #1
+    ptrue           p0.h, vl8
+.rept 8
+    ld1b            {z0.h}, p0/z, x2
+    ld1b            {z1.h}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5
+    ld1b            {z2.h}, p0/z, x2
+    ld1b            {z3.h}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5
+    sub             z4.h, z0.h, z1.h
+    sub             z5.h, z2.h, z3.h
+    st1             {v4.8h}, x0, x1
+    st1             {v5.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+//******* satd *******
+.macro satd_4x4_sve
+    ld1b            {z0.h}, p0/z, x0
+    ld1b            {z2.h}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    ld1b            {z1.h}, p0/z, x0
+    ld1b            {z3.h}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    ld1b            {z4.h}, p0/z, x0
+    ld1b            {z6.h}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    ld1b            {z5.h}, p0/z, x0
+    ld1b            {z7.h}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+
+    sub             z0.h, z0.h, z2.h
+    sub             z1.h, z1.h, z3.h
+    sub             z2.h, z4.h, z6.h
+    sub             z3.h, z5.h, z7.h
+
+    add             z4.h, z0.h, z2.h
+    add             z5.h, z1.h, z3.h
+    sub             z6.h, z0.h, z2.h
+    sub             z7.h, z1.h, z3.h
+
+    add             z0.h, z4.h, z5.h
+    sub             z1.h, z4.h, z5.h
+
+    add             z2.h, z6.h, z7.h
+    sub             z3.h, z6.h, z7.h
+
+    trn1            z4.h, z0.h, z2.h
+    trn2            z5.h, z0.h, z2.h
+
+    trn1            z6.h, z1.h, z3.h
+    trn2            z7.h, z1.h, z3.h
+
+    add             z0.h, z4.h, z5.h
+    sub             z1.h, z4.h, z5.h
+
+    add             z2.h, z6.h, z7.h
+    sub             z3.h, z6.h, z7.h
+
+    trn1            z4.s, z0.s, z1.s
+    trn2            z5.s, z0.s, z1.s
+
+    trn1            z6.s, z2.s, z3.s
+    trn2            z7.s, z2.s, z3.s
+
+    abs             z4.h, p0/m, z4.h
+    abs             z5.h, p0/m, z5.h
+    abs             z6.h, p0/m, z6.h
+    abs             z7.h, p0/m, z7.h
+
+    smax            z4.h, p0/m, z4.h, z5.h
+    smax            z6.h, p0/m, z6.h, z7.h
+
+    add             z0.h, z4.h, z6.h
+
+    uaddlp          v0.2s, v0.4h
+    uaddlp          v0.1d, v0.2s
+.endm
+
+// int satd_4x4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+function PFX(pixel_satd_4x4_sve)
+    ptrue           p0.h, vl4
+    satd_4x4_sve
+    fmov            x0, d0
+    ret
+endfunc
+
+function PFX(pixel_satd_8x4_sve)
+    ptrue           p0.h, vl4
+    mov             x4, x0
+    mov             x5, x2
+    satd_4x4_sve
+    add             x0, x4, #4
+    add             x2, x5, #4
+    umov            x6, v0.d0
+    satd_4x4_sve
+    umov            x0, v0.d0
+    add             x0, x0, x6
+    ret
+endfunc
+
+function PFX(pixel_satd_8x12_sve)
+    ptrue           p0.h, vl4
+    mov             x4, x0
+    mov             x5, x2
+    mov             x7, #0
+    satd_4x4_sve
+    umov            x6, v0.d0
+    add             x7, x7, x6
+    add             x0, x4, #4
+    add             x2, x5, #4
+    satd_4x4_sve
+    umov            x6, v0.d0
+    add             x7, x7, x6
+.rept 2
+    sub             x0, x0, #4
+    sub             x2, x2, #4
+    mov             x4, x0
+    mov             x5, x2
+    satd_4x4_sve
+    umov            x6, v0.d0
+    add             x7, x7, x6
+    add             x0, x4, #4
+    add             x2, x5, #4
+    satd_4x4_sve
+    umov            x6, v0.d0
+    add             x7, x7, x6
+.endr
+    mov             x0, x7
+    ret
+endfunc
+
+.macro LOAD_DIFF_16x4_sve v0 v1 v2 v3 v4 v5 v6 v7
+    mov             x11, #8 // in order to consider CPUs whose vector size is greater than 128 bits
+    ld1b            {z0.h}, p0/z, x0
+    ld1b            {z1.h}, p0/z, x0, x11
+    ld1b            {z2.h}, p0/z, x2
+    ld1b            {z3.h}, p0/z, x2, x11
+    add             x0, x0, x1
+    add             x2, x2, x3
+    ld1b            {z4.h}, p0/z, x0
+    ld1b            {z5.h}, p0/z, x0, x11
+    ld1b            {z6.h}, p0/z, x2
+    ld1b            {z7.h}, p0/z, x2, x11
+    add             x0, x0, x1
+    add             x2, x2, x3
+    ld1b            {z29.h}, p0/z, x0
+    ld1b            {z9.h}, p0/z, x0, x11
+    ld1b            {z10.h}, p0/z, x2
+    ld1b            {z11.h}, p0/z, x2, x11
+    add             x0, x0, x1
+    add             x2, x2, x3
+    ld1b            {z12.h}, p0/z, x0
+    ld1b            {z13.h}, p0/z, x0, x11
+    ld1b            {z14.h}, p0/z, x2
+    ld1b            {z15.h}, p0/z, x2, x11
+    add             x0, x0, x1
+    add             x2, x2, x3
+
+    sub             \v0\().h, z0.h, z2.h
+    sub             \v4\().h, z1.h, z3.h
+    sub             \v1\().h, z4.h, z6.h
+    sub             \v5\().h, z5.h, z7.h
+    sub             \v2\().h, z29.h, z10.h
+    sub             \v6\().h, z9.h, z11.h
+    sub             \v3\().h, z12.h, z14.h
+    sub             \v7\().h, z13.h, z15.h
+.endm
+
+// one vertical hadamard pass and two horizontal
+function PFX(satd_8x4v_8x8h_sve), export=0
+    HADAMARD4_V     z16.h, z18.h, z17.h, z19.h, z0.h, z2.h, z1.h, z3.h
+    HADAMARD4_V     z20.h, z21.h, z22.h, z23.h, z0.h, z1.h, z2.h, z3.h
+    trn4            z0.h, z1.h, z2.h, z3.h, z16.h, z17.h, z18.h, z19.h
+    trn4            z4.h, z5.h, z6.h, z7.h, z20.h, z21.h, z22.h, z23.h
+    SUMSUB_ABCD     z16.h, z17.h, z18.h, z19.h, z0.h, z1.h, z2.h, z3.h
+    SUMSUB_ABCD     z20.h, z21.h, z22.h, z23.h, z4.h, z5.h, z6.h, z7.h
+    trn4            z0.s, z2.s, z1.s, z3.s, z16.s, z18.s, z17.s, z19.s
+    trn4            z4.s, z6.s, z5.s, z7.s, z20.s, z22.s, z21.s, z23.s
+    ABS8_SVE        z0.h, z1.h, z2.h, z3.h, z4.h, z5.h, z6.h, z7.h, p0
+    smax            z0.h, p0/m, z0.h, z2.h
+    smax            z1.h, p0/m, z1.h, z3.h
+    smax            z4.h, p0/m, z4.h, z6.h
+    smax            z5.h, p0/m, z5.h, z7.h
+    ret
+endfunc
+
+function PFX(satd_16x4_sve), export=0
+    LOAD_DIFF_16x4_sve  z16, z17, z18, z19, z20, z21, z22, z23
+    b                    PFX(satd_8x4v_8x8h_sve)
+endfunc
+
+.macro pixel_satd_32x8_sve
+    mov             x4, x0
+    mov             x5, x2
+.rept 2
+    bl              PFX(satd_16x4_sve)
+    add             z30.h, z30.h, z0.h
+    add             z31.h, z31.h, z1.h
+    add             z30.h, z30.h, z4.h
+    add             z31.h, z31.h, z5.h
+.endr
+    add             x0, x4, #16
+    add             x2, x5, #16
+.rept 2
+    bl              PFX(satd_16x4_sve)
+    add             z30.h, z30.h, z0.h
+    add             z31.h, z31.h, z1.h
+    add             z30.h, z30.h, z4.h
+    add             z31.h, z31.h, z5.h
+.endr
+.endm
+
+.macro satd_32x16_sve
+    movi            v30.2d, #0
+    movi            v31.2d, #0
+    pixel_satd_32x8_sve
+    sub             x0, x0, #16
+    sub             x2, x2, #16
+    pixel_satd_32x8_sve
+    add             z0.h, z30.h, z31.h
+    uaddlv          s0, v0.8h
+    mov             w6, v0.s0
+.endm
+
+function PFX(pixel_satd_32x16_sve)
+    ptrue           p0.h, vl8
+    mov             x10, x30
+    satd_32x16_sve
+    mov             x0, x6
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_32x32_sve)
+    ptrue           p0.h, vl8
+    mov             x10, x30
+    mov             x7, #0
+    satd_32x16_sve
+    sub             x0, x0, #16
+    sub             x2, x2, #16
+    add             x7, x7, x6
+    satd_32x16_sve
+    add             x0, x7, x6
+    ret             x10
+endfunc
+
+.macro satd_64x16_sve
+    mov             x8, x0
+    mov             x9, x2
+    satd_32x16_sve
+    add             x7, x7, x6
+    add             x0, x8, #32
+    add             x2, x9, #32
+    satd_32x16_sve
+    add             x7, x7, x6
+.endm
+
+function PFX(pixel_satd_64x48_sve)
+    ptrue           p0.h, vl8
+    mov             x10, x30
+    mov             x7, #0
+.rept 2
+    satd_64x16_sve
+    sub             x0, x0, #48
+    sub             x2, x2, #48
+.endr
+    satd_64x16_sve
+    mov             x0, x7
+    ret             x10
+endfunc
+
+/********* ssim ***********/
+// uint32_t quant_c(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff)
+// No need to fully use sve instructions for this function
+function PFX(quant_sve)
+    mov             w9, #1
+    lsl             w9, w9, w4
+    mov             z0.s, w9
+    neg             w9, w4
+    mov             z1.s, w9
+    add             w9, w9, #8
+    mov             z2.s, w9
+    mov             z3.s, w5
+
+    lsr             w6, w6, #2
+    eor             z4.d, z4.d, z4.d
+    eor             w10, w10, w10
+    eor             z17.d, z17.d, z17.d
+
+.loop_quant_sve:
+    ld1             {v18.4h}, x0, #8
+    ld1             {v7.4s}, x1, #16
+    sxtl            v6.4s, v18.4h
+
+    cmlt            v5.4s, v6.4s, #0
+
+    abs             v6.4s, v6.4s
+
+
+    mul             v6.4s, v6.4s, v7.4s
+
+    add             v7.4s, v6.4s, v3.4s
+    sshl            v7.4s, v7.4s, v1.4s
+
+    mls             v6.4s, v7.4s, v0.s0
+    sshl            v16.4s, v6.4s, v2.4s
+    st1             {v16.4s}, x2, #16
+
+    // numsig
+    cmeq            v16.4s, v7.4s, v17.4s
+    add             v4.4s, v4.4s, v16.4s
+    add             w10, w10, #4
+
+    // level *= sign
+    eor             z16.d, z7.d, z5.d
+    sub             v16.4s, v16.4s, v5.4s
+    sqxtn           v5.4h, v16.4s
+    st1             {v5.4h}, x3, #8
+
+    subs            w6, w6, #1
+    b.ne             .loop_quant_sve
+
+    addv            s4, v4.4s
+    mov             w9, v4.s0
+    add             w0, w10, w9
+    ret
+endfunc
​

x265_3.6.tar.gz/source/common/aarch64/pixel-util-sve2.S Added

@@ -0,0 +1,1686 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm-sve.S"
+#include "pixel-util-common.S"
+
+.arch armv8-a+sve2
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+// uint64_t pixel_var(const pixel* pix, intptr_t i_stride)
+function PFX(pixel_var_8x8_sve2)
+    ptrue           p0.h, vl8
+    ld1b            {z0.h}, p0/z, x0
+    add             x0, x0, x1
+    mul             z31.h, z0.h, z0.h
+    uaddlp          v1.4s, v31.8h
+.rept 7
+    ld1b            {z4.h}, p0/z, x0
+    add             x0, x0, x1
+    add             z0.h, z0.h, z4.h
+    mul             z31.h, z4.h, z4.h
+    uadalp          z1.s, p0/m, z31.h
+.endr
+    uaddlv          s0, v0.8h
+    uaddlv          d1, v1.4s
+    fmov            w0, s0
+    fmov            x1, d1
+    orr             x0, x0, x1, lsl #32
+    ret
+endfunc
+
+function PFX(pixel_var_16x16_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_var_16x16
+    pixel_var_start
+    mov             w12, #16
+.loop_var_16_sve2:
+    sub             w12, w12, #1
+    ld1             {v4.16b}, x0, x1
+    pixel_var_1 v4
+    cbnz            w12, .loop_var_16_sve2
+    pixel_var_end
+    ret
+.vl_gt_16_pixel_var_16x16:
+    ptrue           p0.h, vl16
+    mov             z0.d, #0
+.rept 16
+    ld1b            {z4.h}, p0/z, x0
+    add             x0, x0, x1
+    add             z0.h, z0.h, z4.h
+    mul             z30.h, z4.h, z4.h
+    uadalp          z1.s, p0/m, z30.h
+.endr
+    uaddv           d0, p0, z0.h
+    uaddv           d1, p0, z1.s
+    fmov            w0, s0
+    fmov            x1, d1
+    orr             x0, x0, x1, lsl #32
+    ret
+endfunc
+
+function PFX(pixel_var_32x32_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_var_32x32
+    pixel_var_start
+    mov             w12, #32
+.loop_var_32_sve2:
+    sub             w12, w12, #1
+    ld1             {v4.16b-v5.16b}, x0, x1
+    pixel_var_1 v4
+    pixel_var_1 v5
+    cbnz            w12, .loop_var_32_sve2
+    pixel_var_end
+    ret
+.vl_gt_16_pixel_var_32x32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_var_32x32
+    ptrue           p0.b, vl32
+    mov             z0.d, #0
+    mov             z1.d, #0
+.rept 32
+    ld1b            {z4.b}, p0/z, x0
+    add             x0, x0, x1
+    uaddwb          z0.h, z0.h, z4.b
+    uaddwt          z0.h, z0.h, z4.b
+    umullb          z28.h, z4.b, z4.b
+    umullt          z29.h, z4.b, z4.b
+    uadalp          z1.s, p0/m, z28.h
+    uadalp          z1.s, p0/m, z29.h
+.endr
+    uaddv           d0, p0, z0.h
+    uaddv           d1, p0, z1.s
+    fmov            w0, s0
+    fmov            x1, d1
+    orr             x0, x0, x1, lsl #32
+    ret
+.vl_gt_48_pixel_var_32x32:
+    ptrue           p0.h, vl32
+    mov             z0.d, #0
+    mov             z1.d, #0
+.rept 32
+    ld1b            {z4.h}, p0/z, x0
+    add             x0, x0, x1
+    add             z0.h, z0.h, z4.h
+    mul             z28.h, z4.h, z4.h
+    uadalp          z1.s, p0/m, z28.h
+.endr
+    uaddv           d0, p0, z0.h
+    uaddv           d1, p0, z1.s
+    fmov            w0, s0
+    fmov            x1, d1
+    orr             x0, x0, x1, lsl #32
+    ret
+endfunc
+
+function PFX(pixel_var_64x64_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_var_64x64
+    pixel_var_start
+    mov             w12, #64
+.loop_var_64_sve2:
+    sub             w12, w12, #1
+    ld1             {v4.16b-v7.16b}, x0, x1
+    pixel_var_1 v4
+    pixel_var_1 v5
+    pixel_var_1 v6
+    pixel_var_1 v7
+    cbnz            w12, .loop_var_64_sve2
+    pixel_var_end
+    ret
+.vl_gt_16_pixel_var_64x64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_var_64x64
+    ptrue           p0.b, vl32
+    mov             z0.d, #0
+    mov             z2.d, #0
+.rept 64
+    ld1b            {z4.b}, p0/z, x0
+    ld1b            {z5.b}, p0/z, x0, #1, mul vl
+    add             x0, x0, x1
+    uaddwb          z0.h, z0.h, z4.b
+    uaddwt          z0.h, z0.h, z4.b
+    uaddwb          z0.h, z0.h, z5.b
+    uaddwt          z0.h, z0.h, z5.b
+    umullb          z24.h, z4.b, z4.b
+    umullt          z25.h, z4.b, z4.b
+    umullb          z26.h, z5.b, z5.b
+    umullt          z27.h, z5.b, z5.b
+    uadalp          z2.s, p0/m, z24.h
+    uadalp          z2.s, p0/m, z25.h
+    uadalp          z2.s, p0/m, z26.h
+    uadalp          z2.s, p0/m, z27.h
+.endr
+    uaddv           d0, p0, z0.h
+    uaddv           d1, p0, z2.s
+    fmov            w0, s0
+    fmov            x1, d1
+    orr             x0, x0, x1, lsl #32
+    ret
+.vl_gt_48_pixel_var_64x64:
+    cmp             x9, #112
+    bgt             .vl_gt_112_pixel_var_64x64
+    ptrue           p0.b, vl64
+    mov             z0.d, #0
+    mov             z1.d, #0
+.rept 64
+    ld1b            {z4.b}, p0/z, x0
+    add             x0, x0, x1
+    uaddwb          z0.h, z0.h, z4.b
+    uaddwt          z0.h, z0.h, z4.b
+    umullb          z24.h, z4.b, z4.b
+    umullt          z25.h, z4.b, z4.b
+    uadalp          z2.s, p0/m, z24.h
+    uadalp          z2.s, p0/m, z25.h
+.endr
+    uaddv           d0, p0, z0.h
+    uaddv           d1, p0, z2.s
+    fmov            w0, s0
+    fmov            x1, d1
+    orr             x0, x0, x1, lsl #32
+    ret
+.vl_gt_112_pixel_var_64x64:
+    ptrue           p0.h, vl64
+    mov             z0.d, #0
+    mov             z1.d, #0
+.rept 64
+    ld1b            {z4.h}, p0/z, x0
+    add             x0, x0, x1
+    add             z0.h, z0.h, z4.h
+    mul             z24.h, z4.h, z4.h
+    uadalp          z1.s, p0/m, z24.h
+.endr
+    uaddv           d0, p0, z0.h
+    uaddv           d1, p0, z1.s
+    fmov            w0, s0
+    fmov            x1, d1
+    orr             x0, x0, x1, lsl #32
+    ret
+endfunc
+
+function PFX(getResidual16_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_getResidual16
+    lsl             x4, x3, #1
+.rept 8
+    ld1             {v0.16b}, x0, x3
+    ld1             {v1.16b}, x1, x3
+    ld1             {v2.16b}, x0, x3
+    ld1             {v3.16b}, x1, x3
+    usubl           v4.8h, v0.8b, v1.8b
+    usubl2          v5.8h, v0.16b, v1.16b
+    usubl           v6.8h, v2.8b, v3.8b
+    usubl2          v7.8h, v2.16b, v3.16b
+    st1             {v4.8h-v5.8h}, x2, x4
+    st1             {v6.8h-v7.8h}, x2, x4
+.endr
+    ret
+.vl_gt_16_getResidual16:
+    ptrue           p0.h, vl16
+.rept 16
+    ld1b            {z0.h}, p0/z, x0
+    ld1b            {z2.h}, p0/z, x1
+    add             x0, x0, x3
+    add             x1, x1, x3
+    sub             z4.h, z0.h, z2.h
+    st1h            {z4.h}, p0, x2
+    add             x2, x2, x3, lsl #1
+.endr
+    ret
+endfunc
+
+function PFX(getResidual32_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_getResidual32
+    lsl             x4, x3, #1
+    mov             w12, #4
+.loop_residual_32:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v1.16b}, x0, x3
+    ld1             {v2.16b-v3.16b}, x1, x3
+    ld1             {v4.16b-v5.16b}, x0, x3
+    ld1             {v6.16b-v7.16b}, x1, x3
+    usubl           v16.8h, v0.8b, v2.8b
+    usubl2          v17.8h, v0.16b, v2.16b
+    usubl           v18.8h, v1.8b, v3.8b
+    usubl2          v19.8h, v1.16b, v3.16b
+    usubl           v20.8h, v4.8b, v6.8b
+    usubl2          v21.8h, v4.16b, v6.16b
+    usubl           v22.8h, v5.8b, v7.8b
+    usubl2          v23.8h, v5.16b, v7.16b
+    st1             {v16.8h-v19.8h}, x2, x4
+    st1             {v20.8h-v23.8h}, x2, x4
+.endr
+    cbnz            w12, .loop_residual_32
+    ret
+.vl_gt_16_getResidual32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_getResidual32
+    ptrue           p0.b, vl32
+.rept 32
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z2.b}, p0/z, x1
+    add             x0, x0, x3
+    add             x1, x1, x3
+    usublb          z4.h, z0.b, z2.b
+    usublt          z5.h, z0.b, z2.b
+    st2h            {z4.h, z5.h}, p0, x2
+    add             x2, x2, x3, lsl #1
+.endr
+    ret
+.vl_gt_48_getResidual32:
+    ptrue           p0.h, vl32
+.rept 32
+    ld1b            {z0.h}, p0/z, x0
+    ld1b            {z4.h}, p0/z, x1
+    add             x0, x0, x3
+    add             x1, x1, x3
+    sub             z8.h, z0.h, z4.h
+    st1h            {z8.h}, p0, x2
+    add             x2, x2, x3, lsl #1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_sub_ps_32x32_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sub_ps_32x32
+    lsl             x1, x1, #1
+    mov             w12, #4
+.loop_sub_ps_32_sve2:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v1.16b}, x2, x4
+    ld1             {v2.16b-v3.16b}, x3, x5
+    ld1             {v4.16b-v5.16b}, x2, x4
+    ld1             {v6.16b-v7.16b}, x3, x5
+    usubl           v16.8h, v0.8b, v2.8b
+    usubl2          v17.8h, v0.16b, v2.16b
+    usubl           v18.8h, v1.8b, v3.8b
+    usubl2          v19.8h, v1.16b, v3.16b
+    usubl           v20.8h, v4.8b, v6.8b
+    usubl2          v21.8h, v4.16b, v6.16b
+    usubl           v22.8h, v5.8b, v7.8b
+    usubl2          v23.8h, v5.16b, v7.16b
+    st1             {v16.8h-v19.8h}, x0, x1
+    st1             {v20.8h-v23.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_sub_ps_32_sve2
+    ret
+.vl_gt_16_pixel_sub_ps_32x32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_sub_ps_32x32
+    ptrue           p0.b, vl32
+    mov             w12, #8
+.vl_gt_16_loop_sub_ps_32_sve2:
+    sub             w12, w12, #1
+.rept 4
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z2.b}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5
+    usublb          z16.h, z0.b, z2.b
+    usublt          z17.h, z0.b, z2.b
+    st2h            {z16.h, z17.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_16_loop_sub_ps_32_sve2
+    ret
+.vl_gt_48_pixel_sub_ps_32x32:
+    ptrue           p0.h, vl32
+    mov             w12, #8
+.vl_gt_48_loop_sub_ps_32_sve2:
+    sub             w12, w12, #1
+.rept 4
+    ld1b            {z0.h}, p0/z, x2
+    ld1b            {z4.h}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5
+    sub             z8.h, z0.h, z4.h
+    st1h            {z8.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_48_loop_sub_ps_32_sve2
+    ret
+endfunc
+
+function PFX(pixel_sub_ps_64x64_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sub_ps_64x64
+    lsl             x1, x1, #1
+    sub             x1, x1, #64
+    mov             w12, #16
+.loop_sub_ps_64_sve2:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v3.16b}, x2, x4
+    ld1             {v4.16b-v7.16b}, x3, x5
+    usubl           v16.8h, v0.8b, v4.8b
+    usubl2          v17.8h, v0.16b, v4.16b
+    usubl           v18.8h, v1.8b, v5.8b
+    usubl2          v19.8h, v1.16b, v5.16b
+    usubl           v20.8h, v2.8b, v6.8b
+    usubl2          v21.8h, v2.16b, v6.16b
+    usubl           v22.8h, v3.8b, v7.8b
+    usubl2          v23.8h, v3.16b, v7.16b
+    st1             {v16.8h-v19.8h}, x0, #64
+    st1             {v20.8h-v23.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_sub_ps_64_sve2
+    ret
+.vl_gt_16_pixel_sub_ps_64x64:
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sub_ps_64x64
+    ptrue           p0.b, vl32
+    mov             w12, #16
+.vl_gt_16_loop_sub_ps_64_sve2:
+    sub             w12, w12, #1
+.rept 4
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z1.b}, p0/z, x2, #1, mul vl
+    ld1b            {z4.b}, p0/z, x3
+    ld1b            {z5.b}, p0/z, x3, #1, mul vl
+    add             x2, x2, x4
+    add             x3, x3, x5
+    usublb          z16.h, z0.b, z4.b
+    usublt          z17.h, z0.b, z4.b
+    usublb          z18.h, z1.b, z5.b
+    usublt          z19.h, z1.b, z5.b
+    st2h            {z16.h, z17.h}, p0, x0
+    st2h            {z18.h, z19.h}, p0, x0, #2, mul vl
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_16_loop_sub_ps_64_sve2
+    ret
+.vl_gt_48_pixel_sub_ps_64x64:
+    cmp             x9, #112
+    bgt             .vl_gt_112_pixel_sub_ps_64x64
+    ptrue           p0.b, vl64
+    mov             w12, #16
+.vl_gt_48_loop_sub_ps_64_sve2:
+    sub             w12, w12, #1
+.rept 4
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z4.b}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5
+    usublb          z16.h, z0.b, z4.b
+    usublt          z17.h, z0.b, z4.b
+    st2h            {z16.h, z17.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_48_loop_sub_ps_64_sve2
+    ret
+.vl_gt_112_pixel_sub_ps_64x64:
+    ptrue           p0.h, vl64
+    mov             w12, #16
+.vl_gt_112_loop_sub_ps_64_sve2:
+    sub             w12, w12, #1
+.rept 4
+    ld1b            {z0.h}, p0/z, x2
+    ld1b            {z8.h}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5
+    sub             z16.h, z0.h, z8.h
+    st1h            {z16.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_112_loop_sub_ps_64_sve2
+    ret
+endfunc
+
+function PFX(pixel_sub_ps_32x64_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sub_ps_32x64
+    lsl             x1, x1, #1
+    mov             w12, #8
+.loop_sub_ps_32x64_sve2:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v1.16b}, x2, x4
+    ld1             {v2.16b-v3.16b}, x3, x5
+    ld1             {v4.16b-v5.16b}, x2, x4
+    ld1             {v6.16b-v7.16b}, x3, x5
+    usubl           v16.8h, v0.8b, v2.8b
+    usubl2          v17.8h, v0.16b, v2.16b
+    usubl           v18.8h, v1.8b, v3.8b
+    usubl2          v19.8h, v1.16b, v3.16b
+    usubl           v20.8h, v4.8b, v6.8b
+    usubl2          v21.8h, v4.16b, v6.16b
+    usubl           v22.8h, v5.8b, v7.8b
+    usubl2          v23.8h, v5.16b, v7.16b
+    st1             {v16.8h-v19.8h}, x0, x1
+    st1             {v20.8h-v23.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_sub_ps_32x64_sve2
+    ret
+.vl_gt_16_pixel_sub_ps_32x64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_sub_ps_32x64
+    ptrue           p0.b, vl32
+    mov             w12, #8
+.vl_gt_16_loop_sub_ps_32x64_sve2:
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z2.b}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5
+    usublb          z16.h, z0.b, z2.b
+    usublt          z17.h, z0.b, z2.b
+    st2h            {z16.h, z17.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_16_loop_sub_ps_32x64_sve2
+    ret
+.vl_gt_48_pixel_sub_ps_32x64:
+    ptrue           p0.h, vl32
+    mov             w12, #8
+.vl_gt_48_loop_sub_ps_32x64_sve2:
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.h}, p0/z, x2
+    ld1b            {z4.h}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5
+    sub             z8.h, z0.h, z4.h
+    st1h            {z8.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_48_loop_sub_ps_32x64_sve2
+    ret
+endfunc
+
+function PFX(pixel_add_ps_4x4_sve2)
+    ptrue           p0.h, vl8
+    ptrue           p1.h, vl4
+.rept 4
+    ld1b            {z0.h}, p0/z, x2
+    ld1h            {z2.h}, p1/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z4.h, z0.h, z2.h
+    sqxtunb         z4.b, z4.h
+    st1b            {z4.h}, p1, x0
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_add_ps_8x8_sve2)
+    ptrue           p0.h, vl8
+.rept 8
+    ld1b            {z0.h}, p0/z, x2
+    ld1h            {z2.h}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z4.h, z0.h, z2.h
+    sqxtunb         z4.b, z4.h
+    st1b            {z4.h}, p0, x0
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+.macro pixel_add_ps_16xN_sve2 h
+function PFX(pixel_add_ps_16x\h\()_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_add_ps_16x\h
+    ptrue           p0.b, vl16
+.rept \h
+    ld1b            {z0.h}, p0/z, x2
+    ld1b            {z1.h}, p0/z, x2, #1, mul vl
+    ld1h            {z2.h}, p0/z, x3
+    ld1h            {z3.h}, p0/z, x3, #1, mul vl
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z24.h, z0.h, z2.h
+    add             z25.h, z1.h, z3.h
+    sqxtunb         z6.b, z24.h
+    sqxtunb         z7.b, z25.h
+    st1b            {z6.h}, p0, x0
+    st1b            {z7.h}, p0, x0, #1, mul vl
+    add             x0, x0, x1
+.endr
+    ret
+.vl_gt_16_pixel_add_ps_16x\h\():
+    ptrue           p0.b, vl32
+.rept \h
+    ld1b            {z0.h}, p0/z, x2
+    ld1h            {z2.h}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z24.h, z0.h, z2.h
+    sqxtunb         z6.b, z24.h
+    st1b            {z6.h}, p0, x0
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+.endm
+
+pixel_add_ps_16xN_sve2 16
+pixel_add_ps_16xN_sve2 32
+
+.macro pixel_add_ps_32xN_sve2 h
+ function PFX(pixel_add_ps_32x\h\()_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_add_ps_32x\h
+    lsl             x5, x5, #1
+    mov             w12, #\h / 4
+.loop_add_ps__sve2_32x\h\():
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v1.16b}, x2, x4
+    ld1             {v16.8h-v19.8h}, x3, x5
+    uxtl            v4.8h, v0.8b
+    uxtl2           v5.8h, v0.16b
+    uxtl            v6.8h, v1.8b
+    uxtl2           v7.8h, v1.16b
+    add             v24.8h, v4.8h, v16.8h
+    add             v25.8h, v5.8h, v17.8h
+    add             v26.8h, v6.8h, v18.8h
+    add             v27.8h, v7.8h, v19.8h
+    sqxtun          v4.8b, v24.8h
+    sqxtun2         v4.16b, v25.8h
+    sqxtun          v5.8b, v26.8h
+    sqxtun2         v5.16b, v27.8h
+    st1             {v4.16b-v5.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_add_ps__sve2_32x\h
+    ret
+.vl_gt_16_pixel_add_ps_32x\h\():
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_add_ps_32x\h
+    ptrue           p0.b, vl32
+.rept \h
+    ld1b            {z0.h}, p0/z, x2
+    ld1b            {z1.h}, p0/z, x2, #1, mul vl
+    ld1h            {z4.h}, p0/z, x3
+    ld1h            {z5.h}, p0/z, x3, #1, mul vl
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z24.h, z0.h, z4.h
+    add             z25.h, z1.h, z5.h
+    sqxtunb         z6.b, z24.h
+    sqxtunb         z7.b, z25.h
+    st1b            {z6.h}, p0, x0
+    st1b            {z7.h}, p0, x0, #1, mul vl
+    add             x0, x0, x1
+.endr
+    ret
+.vl_gt_48_pixel_add_ps_32x\h\():
+    ptrue           p0.b, vl64
+.rept \h
+    ld1b            {z0.h}, p0/z, x2
+    ld1h            {z4.h}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z24.h, z0.h, z4.h
+    sqxtunb         z6.b, z24.h
+    st1b            {z6.h}, p0, x0
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+.endm
+
+pixel_add_ps_32xN_sve2 32
+pixel_add_ps_32xN_sve2 64
+
+function PFX(pixel_add_ps_64x64_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_add_ps_64x64
+    ptrue           p0.b, vl16
+.rept 64
+    ld1b            {z0.h}, p0/z, x2
+    ld1b            {z1.h}, p0/z, x2, #1, mul vl
+    ld1b            {z2.h}, p0/z, x2, #2, mul vl
+    ld1b            {z3.h}, p0/z, x2, #3, mul vl
+    ld1b            {z4.h}, p0/z, x2, #4 ,mul vl
+    ld1b            {z5.h}, p0/z, x2, #5, mul vl
+    ld1b            {z6.h}, p0/z, x2, #6, mul vl
+    ld1b            {z7.h}, p0/z, x2, #7, mul vl
+    ld1h            {z8.h}, p0/z, x3
+    ld1h            {z9.h}, p0/z, x3, #1, mul vl
+    ld1h            {z10.h}, p0/z, x3, #2, mul vl
+    ld1h            {z11.h}, p0/z, x3, #3, mul vl
+    ld1h            {z12.h}, p0/z, x3, #4, mul vl
+    ld1h            {z13.h}, p0/z, x3, #5, mul vl
+    ld1h            {z14.h}, p0/z, x3, #6, mul vl
+    ld1h            {z15.h}, p0/z, x3, #7, mul vl
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z24.h, z0.h, z8.h
+    add             z25.h, z1.h, z9.h
+    add             z26.h, z2.h, z10.h
+    add             z27.h, z3.h, z11.h
+    add             z28.h, z4.h, z12.h
+    add             z29.h, z5.h, z13.h
+    add             z30.h, z6.h, z14.h
+    add             z31.h, z7.h, z15.h
+    sqxtunb         z6.b, z24.h
+    sqxtunb         z7.b, z25.h
+    sqxtunb         z8.b, z26.h
+    sqxtunb         z9.b, z27.h
+    sqxtunb         z10.b, z28.h
+    sqxtunb         z11.b, z29.h
+    sqxtunb         z12.b, z30.h
+    sqxtunb         z13.b, z31.h
+    st1b            {z6.h}, p0, x0
+    st1b            {z7.h}, p0, x0, #1, mul vl
+    st1b            {z8.h}, p0, x0, #2, mul vl
+    st1b            {z9.h}, p0, x0, #3, mul vl
+    st1b            {z10.h}, p0, x0, #4, mul vl
+    st1b            {z11.h}, p0, x0, #5, mul vl
+    st1b            {z12.h}, p0, x0, #6, mul vl
+    st1b            {z13.h}, p0, x0, #7, mul vl
+    add             x0, x0, x1
+.endr
+    ret
+.vl_gt_16_pixel_add_ps_64x64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_add_ps_64x64
+    ptrue           p0.b, vl32
+.rept 64
+    ld1b            {z0.h}, p0/z, x2
+    ld1b            {z1.h}, p0/z, x2, #1, mul vl
+    ld1b            {z2.h}, p0/z, x2, #2, mul vl
+    ld1b            {z3.h}, p0/z, x2, #3, mul vl
+    ld1h            {z8.h}, p0/z, x3
+    ld1h            {z9.h}, p0/z, x3, #1, mul vl
+    ld1h            {z10.h}, p0/z, x3, #2, mul vl
+    ld1h            {z11.h}, p0/z, x3, #3, mul vl
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z24.h, z0.h, z8.h
+    add             z25.h, z1.h, z9.h
+    add             z26.h, z2.h, z10.h
+    add             z27.h, z3.h, z11.h
+    sqxtunb         z6.b, z24.h
+    sqxtunb         z7.b, z25.h
+    sqxtunb         z8.b, z26.h
+    sqxtunb         z9.b, z27.h
+    st1b            {z6.h}, p0, x0
+    st1b            {z7.h}, p0, x0, #1, mul vl
+    st1b            {z8.h}, p0, x0, #2, mul vl
+    st1b            {z9.h}, p0, x0, #3, mul vl
+    add             x0, x0, x1
+.endr
+    ret
+.vl_gt_48_pixel_add_ps_64x64:
+    cmp             x9, #112
+    bgt             .vl_gt_112_pixel_add_ps_64x64
+    ptrue           p0.b, vl64
+.rept 64
+    ld1b            {z0.h}, p0/z, x2
+    ld1b            {z1.h}, p0/z, x2, #1, mul vl
+    ld1h            {z8.h}, p0/z, x3
+    ld1h            {z9.h}, p0/z, x3, #1, mul vl
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z24.h, z0.h, z8.h
+    add             z25.h, z1.h, z9.h
+    sqxtunb         z6.b, z24.h
+    sqxtunb         z7.b, z25.h
+    st1b            {z6.h}, p0, x0
+    st1b            {z7.h}, p0, x0, #1, mul vl
+    add             x0, x0, x1
+.endr
+    ret
+.vl_gt_112_pixel_add_ps_64x64:
+    ptrue           p0.b, vl128
+.rept 64
+    ld1b            {z0.h}, p0/z, x2
+    ld1h            {z8.h}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z24.h, z0.h, z8.h
+    sqxtunb         z6.b, z24.h
+    st1b            {z6.h}, p0, x0
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+// Chroma add_ps
+function PFX(pixel_add_ps_4x8_sve2)
+    ptrue           p0.h,vl4
+.rept 8
+    ld1b            {z0.h}, p0/z, x2
+    ld1h            {z2.h}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z4.h, z0.h, z2.h
+    sqxtunb         z4.b, z4.h
+    st1b            {z4.h}, p0, x0
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_add_ps_8x16_sve2)
+    ptrue           p0.h,vl8
+.rept 16
+    ld1b            {z0.h}, p0/z, x2
+    ld1h            {z2.h}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z4.h, z0.h, z2.h
+    sqxtunb         z4.b, z4.h
+    st1b            {z4.h}, p0, x0
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+// void scale1D_128to64(pixel *dst, const pixel *src)
+function PFX(scale1D_128to64_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_scale1D_128to64
+    ptrue           p0.b, vl16
+.rept 2
+    ld2b            {z0.b, z1.b}, p0/z, x1
+    ld2b            {z2.b, z3.b}, p0/z, x1, #2, mul vl
+    ld2b            {z4.b, z5.b}, p0/z, x1, #4, mul vl
+    ld2b            {z6.b, z7.b}, p0/z, x1, #6, mul vl
+    add             x1, x1, #128
+    urhadd          z0.b, p0/m, z0.b, z1.b
+    urhadd          z2.b, p0/m, z2.b, z3.b
+    urhadd          z4.b, p0/m, z4.b, z5.b
+    urhadd          z6.b, p0/m, z6.b, z7.b
+    st1b            {z0.b}, p0, x0
+    st1b            {z2.b}, p0, x0, #1, mul vl
+    st1b            {z4.b}, p0, x0, #2, mul vl
+    st1b            {z6.b}, p0, x0, #3, mul vl
+    add             x0, x0, #64
+.endr
+    ret
+.vl_gt_16_scale1D_128to64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_scale1D_128to64
+    ptrue           p0.b, vl32
+.rept 2
+    ld2b            {z0.b, z1.b}, p0/z, x1
+    ld2b            {z2.b, z3.b}, p0/z, x1, #2, mul vl
+    add             x1, x1, #128
+    urhadd          z0.b, p0/m, z0.b, z1.b
+    urhadd          z2.b, p0/m, z2.b, z3.b
+    st1b            {z0.b}, p0, x0
+    st1b            {z2.b}, p0, x0, #1, mul vl
+    add             x0, x0, #64
+.endr
+    ret
+.vl_gt_48_scale1D_128to64:
+    ptrue           p0.b, vl64
+.rept 2
+    ld2b            {z0.b, z1.b}, p0/z, x1
+    add             x1, x1, #128
+    urhadd          z0.b, p0/m, z0.b, z1.b
+    st1b            {z0.b}, p0, x0
+    add             x0, x0, #64
+.endr
+    ret
+endfunc
+
+/***** dequant_scaling*****/
+// void dequant_scaling_c(const int16_t* quantCoef, const int32_t* deQuantCoef, int16_t* coef, int num, int per, int shift)
+function PFX(dequant_scaling_sve2)
+    ptrue           p0.h, vl8
+    add             x5, x5, #4              // shift + 4
+    lsr             x3, x3, #3              // num / 8
+    cmp             x5, x4
+    blt             .dequant_skip_sve2
+
+    mov             x12, #1
+    sub             x6, x5, x4          // shift - per
+    sub             x6, x6, #1          // shift - per - 1
+    lsl             x6, x12, x6         // 1 << shift - per - 1 (add)
+    mov             z0.s, w6
+    sub             x7, x4, x5          // per - shift
+    mov             z3.s, w7
+
+.dequant_loop1_sve2:
+    ld1h            {z19.h}, p0/z, x0
+    ld1w            {z2.s}, p0/z, x1
+    add             x1, x1, #16
+    ld1w            {z20.s}, p0/z, x1
+    add             x0, x0, #16
+    add             x1, x1, #16
+
+    sub             x3, x3, #1
+    sunpklo         z1.s, z19.h
+    sunpkhi         z19.s, z19.h
+
+    mul             z1.s, z1.s, z2.s // quantCoef * deQuantCoef
+    mul             z19.s, z19.s, z20.s
+    add             z1.s, z1.s, z0.s // quantCoef * deQuantCoef + add
+    add             z19.s, z19.s, z0.s
+
+    // No equivalent instructions in SVE2 for sshl
+    // as sqshl has double latency
+    sshl            v1.4s, v1.4s, v3.4s
+    sshl            v19.4s, v19.4s, v3.4s
+
+    sqxtnb          z16.h, z1.s
+    sqxtnb          z17.h, z19.s
+    st1h            {z16.s}, p0, x2
+    st1h            {z17.s}, p0, x2, #1, mul vl
+    add             x2, x2, #16
+    cbnz            x3, .dequant_loop1_sve2
+    ret
+
+.dequant_skip_sve2:
+    sub             x6, x4, x5          // per - shift
+    mov             z0.h, w6
+
+.dequant_loop2_sve2:
+    ld1h            {z19.h}, p0/z, x0
+    ld1w            {z2.s}, p0/z, x1
+    add             x1, x1, #16
+    ld1w            {z20.s}, p0/z, x1
+    add             x0, x0, #16
+    add             x1, x1, #16
+
+
+    sub             x3, x3, #1
+    sunpklo         z1.s, z19.h
+    sunpkhi         z19.s, z19.h
+
+    mul             z1.s, z1.s, z2.s // quantCoef * deQuantCoef
+    mul             z19.s, z19.s, z20.s
+
+    // Keeping NEON instructions here in order to have
+    // one sqshl later
+    sqxtn           v16.4h, v1.4s       // x265_clip3
+    sqxtn2          v16.8h, v19.4s
+
+    sqshl           z16.h, p0/m, z16.h, z0.h // coefQ << per - shift
+    st1h            {z16.h}, p0, x2
+    add             x2, x2, #16
+    cbnz            x3, .dequant_loop2_sve2
+    ret
+endfunc
+
+// void dequant_normal_c(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift)
+function PFX(dequant_normal_sve2)
+    lsr             w2, w2, #4              // num / 16
+    neg             w4, w4
+    mov             z0.h, w3
+    mov             z1.s, w4
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_dequant_normal_sve2
+.dqn_loop1_sve2:
+    ld1             {v2.8h, v3.8h}, x0, #32
+    smull           v16.4s, v2.4h, v0.4h
+    smull2          v17.4s, v2.8h, v0.8h
+    smull           v18.4s, v3.4h, v0.4h
+    smull2          v19.4s, v3.8h, v0.8h
+
+    srshl           v16.4s, v16.4s, v1.4s
+    srshl           v17.4s, v17.4s, v1.4s
+    srshl           v18.4s, v18.4s, v1.4s
+    srshl           v19.4s, v19.4s, v1.4s
+
+    sqxtn           v2.4h, v16.4s
+    sqxtn2          v2.8h, v17.4s
+    sqxtn           v3.4h, v18.4s
+    sqxtn2          v3.8h, v19.4s
+
+    sub             w2, w2, #1
+    st1             {v2.8h, v3.8h}, x1, #32
+    cbnz            w2, .dqn_loop1_sve2
+    ret
+.vl_gt_16_dequant_normal_sve2:
+    ptrue           p0.h, vl16
+.gt_16_dqn_loop1_sve2:
+    ld1h            {z2.h}, p0/z, x0
+    add             x0, x0, #32
+    smullb          z16.s, z2.h, z0.h
+    smullt          z17.s, z2.h, z0.h
+
+    srshl           z16.s, p0/m, z16.s, z1.s
+    srshl           z17.s, p0/m, z17.s, z1.s
+
+    sqxtnb          z2.h, z16.s
+    sqxtnt          z2.h, z17.s
+    
+    sub             w2, w2, #1
+    st1h            {z2.h}, p0, x1
+    add             x1, x1, #32
+    cbnz            w2, .gt_16_dqn_loop1_sve2
+    ret
+
+endfunc
+
+// void ssim_4x4x2_core(const pixel* pix1, intptr_t stride1, const pixel* pix2, intptr_t stride2, int sums24)
+function PFX(ssim_4x4x2_core_sve2)
+    ptrue           p0.b, vl16
+    movi            v30.2d, #0
+    movi            v31.2d, #0
+
+    ld1b            {z0.h}, p0/z, x0
+    add             x0, x0, x1
+    ld1b            {z1.h}, p0/z, x0
+    add             x0, x0, x1
+    ld1b            {z2.h}, p0/z, x0
+    add             x0, x0, x1
+    ld1b            {z3.h}, p0/z, x0
+    add             x0, x0, x1
+
+    ld1b            {z4.h}, p0/z, x2
+    add             x2, x2, x3
+    ld1b            {z5.h}, p0/z, x2
+    add             x2, x2, x3
+    ld1b            {z6.h}, p0/z, x2
+    add             x2, x2, x3
+    ld1b            {z7.h}, p0/z, x2
+    add             x2, x2, x3
+
+    mul             z16.h, z0.h, z0.h
+    mul             z17.h, z1.h, z1.h
+    mul             z18.h, z2.h, z2.h
+    uaddlp          v30.4s, v16.8h
+
+    mul             z19.h, z3.h, z3.h
+    mul             z20.h, z4.h, z4.h
+    mul             z21.h, z5.h, z5.h
+    uadalp          v30.4s, v17.8h
+
+    mul             z22.h, z6.h, z6.h
+    mul             z23.h, z7.h, z7.h
+    mul             z24.h, z0.h, z4.h
+    uadalp          v30.4s, v18.8h
+
+    mul             z25.h, z1.h, z5.h
+    mul             z26.h, z2.h, z6.h
+    mul             z27.h, z3.h, z7.h
+    uadalp          v30.4s, v19.8h
+
+    add             z28.h, z0.h, z1.h
+    add             z29.h, z4.h, z5.h
+    uadalp          v30.4s, v20.8h
+    uaddlp          v31.4s, v24.8h
+
+    add             z28.h, z28.h, z2.h
+    add             z29.h, z29.h, z6.h
+    uadalp          v30.4s, v21.8h
+    uadalp          v31.4s, v25.8h
+
+    add             z28.h, z28.h, z3.h
+    add             z29.h, z29.h, z7.h
+    uadalp          v30.4s, v22.8h
+    uadalp          v31.4s, v26.8h
+
+    // Better use NEON instructions here
+    uaddlp          v28.4s, v28.8h
+    uaddlp          v29.4s, v29.8h
+    uadalp          v30.4s, v23.8h
+    uadalp          v31.4s, v27.8h
+
+    addp            v28.4s, v28.4s, v28.4s
+    addp            v29.4s, v29.4s, v29.4s
+    addp            v30.4s, v30.4s, v30.4s
+    addp            v31.4s, v31.4s, v31.4s
+
+    st4             {v28.2s, v29.2s, v30.2s, v31.2s}, x4
+    ret
+endfunc
+
+// void ssimDist_c(const pixel* fenc, uint32_t fStride, const pixel* recon, intptr_t rstride, uint64_t *ssBlock, int shift, uint64_t *ac_k)
+.macro ssimDist_start_sve2
+    mov             z0.d, #0
+    mov             z1.d, #0
+.endm
+
+.macro ssimDist_1_sve2  z0 z1 z2 z3
+    sub             z16.s, \z0\().s, \z2\().s
+    sub             z17.s, \z1\().s, \z3\().s
+    mul             z18.s, \z0\().s, \z0\().s
+    mul             z19.s, \z1\().s, \z1\().s
+    mul             z20.s, z16.s, z16.s
+    mul             z21.s, z17.s, z17.s
+    add             z0.s, z0.s, z18.s
+    add             z0.s, z0.s, z19.s
+    add             z1.s, z1.s, z20.s
+    add             z1.s, z1.s, z21.s
+.endm
+
+.macro ssimDist_end_sve2
+    uaddv           d0, p0, z0.s
+    uaddv           d1, p0, z1.s
+    str             d0, x6
+    str             d1, x4
+.endm
+
+function PFX(ssimDist4_sve2)
+    ssimDist_start
+    ptrue           p0.s, vl4
+.rept 4
+    ld1b            {z4.s}, p0/z, x0
+    add             x0, x0, x1
+    ld1b            {z5.s}, p0/z, x2
+    add             x2, x2, x3
+    sub             z2.s, z4.s, z5.s
+    mul             z3.s, z4.s, z4.s
+    mul             z2.s, z2.s, z2.s
+    add             z0.s, z0.s, z3.s
+    add             z1.s, z1.s, z2.s
+.endr
+    ssimDist_end
+    ret
+endfunc
+
+function PFX(ssimDist8_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_ssimDist8
+    ssimDist_start
+    ptrue           p0.s, vl4
+.rept 8
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    add             x0, x0, x1
+    ld1b            {z6.s}, p0/z, x2
+    ld1b            {z7.s}, p0/z, x2, #1, mul vl
+    add             x2, x2, x3
+    ssimDist_1_sve2 z4, z5, z6, z7
+.endr
+    ssimDist_end
+    ret
+.vl_gt_16_ssimDist8:
+    ssimDist_start_sve2
+    ptrue           p0.s, vl8
+.rept 8
+    ld1b            {z4.s}, p0/z, x0
+    add             x0, x0, x1
+    ld1b            {z6.s}, p0/z, x2
+    add             x2, x2, x3
+    sub             z20.s, z4.s, z6.s
+    mul             z16.s, z4.s, z4.s
+    mul             z18.s, z20.s, z20.s
+    add             z0.s, z0.s, z16.s
+    add             z1.s, z1.s, z18.s
+.endr
+    ssimDist_end_sve2
+    ret
+endfunc
+
+function PFX(ssimDist16_sve2)
+    mov             w12, #16
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_ssimDist16
+    ssimDist_start
+    ptrue           p0.s, vl4
+.loop_ssimDist16_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    ld1b            {z6.s}, p0/z, x0, #2, mul vl
+    ld1b            {z7.s}, p0/z, x0, #3, mul vl
+    add             x0, x0, x1
+    ld1b            {z8.s}, p0/z, x2
+    ld1b            {z9.s}, p0/z, x2, #1, mul vl
+    ld1b            {z10.s}, p0/z, x2, #2, mul vl
+    ld1b            {z11.s}, p0/z, x2, #3, mul vl
+    add             x2, x2, x3
+    ssimDist_1_sve2 z4, z5, z8, z9
+    ssimDist_1_sve2 z6, z7, z10, z11
+    cbnz            w12, .loop_ssimDist16_sve2
+    ssimDist_end
+    ret
+.vl_gt_16_ssimDist16:
+    cmp             x9, #48
+    bgt             .vl_gt_48_ssimDist16
+    ssimDist_start_sve2
+    ptrue           p0.s, vl8
+.vl_gt_16_loop_ssimDist16_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    add             x0, x0, x1
+    ld1b            {z8.s}, p0/z, x2
+    ld1b            {z9.s}, p0/z, x2, #1, mul vl
+    add             x2, x2, x3
+    ssimDist_1_sve2 z4, z5, z8, z9
+    cbnz            w12, .vl_gt_16_loop_ssimDist16_sve2
+    ssimDist_end_sve2
+    ret
+.vl_gt_48_ssimDist16:
+    ssimDist_start_sve2
+    ptrue           p0.s, vl16
+.vl_gt_48_loop_ssimDist16_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    add             x0, x0, x1
+    ld1b            {z8.s}, p0/z, x2
+    add             x2, x2, x3
+    sub             z20.s, z4.s, z8.s
+    mul             z16.s, z4.s, z4.s
+    mul             z18.s, z20.s, z20.s
+    add             z0.s, z0.s, z16.s
+    add             z1.s, z1.s, z18.s
+    cbnz            w12, .vl_gt_48_loop_ssimDist16_sve2
+    ssimDist_end_sve2
+    ret
+endfunc
+
+function PFX(ssimDist32_sve2)
+    mov             w12, #32
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_ssimDist32
+    ssimDist_start
+    ptrue           p0.s, vl4
+.loop_ssimDist32_sve2:
+    sub             w12, w12, #1
+    ld1b            {z2.s}, p0/z, x0
+    ld1b            {z3.s}, p0/z, x0, #1, mul vl
+    ld1b            {z4.s}, p0/z, x0, #2, mul vl
+    ld1b            {z5.s}, p0/z, x0, #3, mul vl
+    ld1b            {z6.s}, p0/z, x0, #4, mul vl
+    ld1b            {z7.s}, p0/z, x0, #5, mul vl
+    ld1b            {z8.s}, p0/z, x0, #6, mul vl
+    ld1b            {z9.s}, p0/z, x0, #7, mul vl
+    add             x0, x0, x1
+    ld1b            {z10.s}, p0/z, x2
+    ld1b            {z11.s}, p0/z, x2, #1, mul vl
+    ld1b            {z12.s}, p0/z, x2, #2, mul vl
+    ld1b            {z13.s}, p0/z, x2, #3, mul vl
+    ld1b            {z14.s}, p0/z, x2, #4, mul vl
+    ld1b            {z15.s}, p0/z, x2, #5, mul vl
+    ld1b            {z30.s}, p0/z, x2, #6, mul vl
+    ld1b            {z31.s}, p0/z, x2, #7, mul vl
+    add             x2, x2, x3
+    ssimDist_1_sve2 z2, z3, z10, z11
+    ssimDist_1_sve2 z4, z5, z12, z13
+    ssimDist_1_sve2 z6, z7, z14, z15
+    ssimDist_1_sve2 z8, z9, z30, z31
+    cbnz            w12, .loop_ssimDist32_sve2
+    ssimDist_end
+    ret
+.vl_gt_16_ssimDist32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_ssimDist32
+    ssimDist_start_sve2
+    ptrue           p0.s, vl8
+.vl_gt_16_loop_ssimDist32_sve2:
+    sub             w12, w12, #1
+    ld1b            {z2.s}, p0/z, x0
+    ld1b            {z3.s}, p0/z, x0, #1, mul vl
+    ld1b            {z4.s}, p0/z, x0, #2, mul vl
+    ld1b            {z5.s}, p0/z, x0, #3, mul vl
+    add             x0, x0, x1
+    ld1b            {z10.s}, p0/z, x2
+    ld1b            {z11.s}, p0/z, x2, #1, mul vl
+    ld1b            {z12.s}, p0/z, x2, #2, mul vl
+    ld1b            {z13.s}, p0/z, x2, #3, mul vl
+    add             x2, x2, x3
+    ssimDist_1_sve2 z2, z3, z10, z11
+    ssimDist_1_sve2 z4, z5, z12, z13
+    cbnz            w12, .vl_gt_16_loop_ssimDist32_sve2
+    ssimDist_end_sve2
+    ret
+.vl_gt_48_ssimDist32:
+    cmp             x9, #112
+    bgt             .vl_gt_112_ssimDist32
+    ssimDist_start_sve2
+    ptrue           p0.s, vl16
+.vl_gt_48_loop_ssimDist32_sve2:
+    sub             w12, w12, #1
+    ld1b            {z2.s}, p0/z, x0
+    ld1b            {z3.s}, p0/z, x0, #1, mul vl
+    add             x0, x0, x1
+    ld1b            {z10.s}, p0/z, x2
+    ld1b            {z11.s}, p0/z, x2, #1, mul vl
+    add             x2, x2, x3
+    ssimDist_1_sve2 z2, z3, z10, z11
+    cbnz            w12, .vl_gt_48_loop_ssimDist32_sve2
+    ssimDist_end_sve2
+    ret
+.vl_gt_112_ssimDist32:
+    ssimDist_start_sve2
+    ptrue           p0.s, vl32
+.vl_gt_112_loop_ssimDist32_sve2:
+    sub             w12, w12, #1
+    ld1b            {z2.s}, p0/z, x0
+    add             x0, x0, x1
+    ld1b            {z10.s}, p0/z, x2
+    add             x2, x2, x3
+    sub             z20.s, z2.s, z10.s
+    mul             z16.s, z2.s, z2.s
+    mul             z18.s, z20.s, z20.s
+    add             z0.s, z0.s, z16.s
+    add             z1.s, z1.s, z18.s
+    cbnz            w12, .vl_gt_112_loop_ssimDist32_sve2
+    ssimDist_end_sve2
+    ret
+endfunc
+
+function PFX(ssimDist64_sve2)
+    mov             w12, #64
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_ssimDist64
+    ssimDist_start
+    ptrue           p0.s, vl4
+.loop_ssimDist64_sve2:
+    sub             w12, w12, #1
+    ld1b            {z2.s}, p0/z, x0
+    ld1b            {z3.s}, p0/z, x0, #1, mul vl
+    ld1b            {z4.s}, p0/z, x0, #2, mul vl
+    ld1b            {z5.s}, p0/z, x0, #3, mul vl
+    ld1b            {z6.s}, p0/z, x0, #4, mul vl
+    ld1b            {z7.s}, p0/z, x0, #5, mul vl
+    ld1b            {z8.s}, p0/z, x0, #6, mul vl
+    ld1b            {z9.s}, p0/z, x0, #7, mul vl
+    ld1b            {z23.s}, p0/z, x2
+    ld1b            {z24.s}, p0/z, x2, #1, mul vl
+    ld1b            {z25.s}, p0/z, x2, #2, mul vl
+    ld1b            {z26.s}, p0/z, x2, #3, mul vl
+    ld1b            {z27.s}, p0/z, x2, #4, mul vl
+    ld1b            {z28.s}, p0/z, x2, #5, mul vl
+    ld1b            {z29.s}, p0/z, x2, #6, mul vl
+    ld1b            {z30.s}, p0/z, x2, #7, mul vl
+    ssimDist_1_sve2 z2, z3, z23, z24
+    ssimDist_1_sve2 z4, z5, z25, z26
+    ssimDist_1_sve2 z6, z7, z27, z28
+    ssimDist_1_sve2 z8, z9, z29, z30
+    mov             x4, x0
+    mov             x5, x2
+    add             x4, x4, #32
+    add             x5, x5, #32
+    ld1b            {z2.s}, p0/z, x4
+    ld1b            {z3.s}, p0/z, x4, #1, mul vl
+    ld1b            {z4.s}, p0/z, x4, #2, mul vl
+    ld1b            {z5.s}, p0/z, x4, #3, mul vl
+    ld1b            {z6.s}, p0/z, x4, #4, mul vl
+    ld1b            {z7.s}, p0/z, x4, #5, mul vl
+    ld1b            {z8.s}, p0/z, x4, #6, mul vl
+    ld1b            {z9.s}, p0/z, x4, #7, mul vl
+    ld1b            {z23.s}, p0/z, x5
+    ld1b            {z24.s}, p0/z, x5, #1, mul vl
+    ld1b            {z25.s}, p0/z, x5, #2, mul vl
+    ld1b            {z26.s}, p0/z, x5, #3, mul vl
+    ld1b            {z27.s}, p0/z, x5, #4, mul vl
+    ld1b            {z28.s}, p0/z, x5, #5, mul vl
+    ld1b            {z29.s}, p0/z, x5, #6, mul vl
+    ld1b            {z30.s}, p0/z, x5, #7, mul vl
+    ssimDist_1_sve2 z2, z3, z23, z24
+    ssimDist_1_sve2 z4, z5, z25, z26
+    ssimDist_1_sve2 z6, z7, z27, z28
+    ssimDist_1_sve2 z8, z9, z29, z30
+    add             x0, x0, x1
+    add             x2, x2, x3
+    cbnz            w12, .loop_ssimDist64_sve2
+    ssimDist_end
+    ret
+.vl_gt_16_ssimDist64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_ssimDist64
+    ssimDist_start_sve2
+    ptrue           p0.s, vl8
+.vl_gt_16_loop_ssimDist64_sve2:
+    sub             w12, w12, #1
+    ld1b            {z2.s}, p0/z, x0
+    ld1b            {z3.s}, p0/z, x0, #1, mul vl
+    ld1b            {z4.s}, p0/z, x0, #2, mul vl
+    ld1b            {z5.s}, p0/z, x0, #3, mul vl
+    ld1b            {z6.s}, p0/z, x0, #4, mul vl
+    ld1b            {z7.s}, p0/z, x0, #5, mul vl
+    ld1b            {z8.s}, p0/z, x0, #6, mul vl
+    ld1b            {z9.s}, p0/z, x0, #7, mul vl
+    ld1b            {z23.s}, p0/z, x2
+    ld1b            {z24.s}, p0/z, x2, #1, mul vl
+    ld1b            {z25.s}, p0/z, x2, #2, mul vl
+    ld1b            {z26.s}, p0/z, x2, #3, mul vl
+    ld1b            {z27.s}, p0/z, x2, #4, mul vl
+    ld1b            {z28.s}, p0/z, x2, #5, mul vl
+    ld1b            {z29.s}, p0/z, x2, #6, mul vl
+    ld1b            {z30.s}, p0/z, x2, #7, mul vl
+    ssimDist_1_sve2 z2, z3, z23, z24
+    ssimDist_1_sve2 z4, z5, z25, z26
+    ssimDist_1_sve2 z6, z7, z27, z28
+    ssimDist_1_sve2 z8, z9, z29, z30
+    add             x0, x0, x1
+    add             x2, x2, x3
+    cbnz            w12, .vl_gt_16_loop_ssimDist64_sve2
+    ssimDist_end_sve2
+    ret
+.vl_gt_48_ssimDist64:
+    cmp             x9, #112
+    bgt             .vl_gt_112_ssimDist64
+    ssimDist_start_sve2
+    ptrue           p0.s, vl16
+.vl_gt_48_loop_ssimDist64_sve2:
+    sub             w12, w12, #1
+    ld1b            {z2.s}, p0/z, x0
+    ld1b            {z3.s}, p0/z, x0, #1, mul vl
+    ld1b            {z4.s}, p0/z, x0, #2, mul vl
+    ld1b            {z5.s}, p0/z, x0, #3, mul vl
+    ld1b            {z23.s}, p0/z, x2
+    ld1b            {z24.s}, p0/z, x2, #1, mul vl
+    ld1b            {z25.s}, p0/z, x2, #2, mul vl
+    ld1b            {z26.s}, p0/z, x2, #3, mul vl
+    ssimDist_1_sve2 z2, z3, z23, z24
+    ssimDist_1_sve2 z4, z5, z25, z26
+    add             x0, x0, x1
+    add             x2, x2, x3
+    cbnz            w12, .vl_gt_48_loop_ssimDist64_sve2
+    ssimDist_end_sve2
+    ret
+.vl_gt_112_ssimDist64:
+    ssimDist_start_sve2
+    ptrue           p0.s, vl32
+.vl_gt_112_loop_ssimDist64_sve2:
+    sub             w12, w12, #1
+    ld1b            {z2.s}, p0/z, x0
+    ld1b            {z3.s}, p0/z, x0, #1, mul vl
+    ld1b            {z23.s}, p0/z, x2
+    ld1b            {z24.s}, p0/z, x2, #1, mul vl
+    ssimDist_1_sve2 z2, z3, z23, z24
+    add             x0, x0, x1
+    add             x2, x2, x3
+    cbnz            w12, .vl_gt_112_loop_ssimDist64_sve2
+    ssimDist_end_sve2
+    ret
+endfunc
+
+// void normFact_c(const pixel* src, uint32_t blockSize, int shift, uint64_t *z_k)
+.macro normFact_start_sve2
+    mov             z0.d, #0
+.endm
+
+.macro normFact_1_sve2  z0, z1
+    mul             z16.s, \z0\().s, \z0\().s
+    mul             z17.s, \z1\().s, \z1\().s
+    add             z0.s, z0.s, z16.s
+    add             z0.s, z0.s, z17.s
+.endm
+
+.macro normFact_end_sve2
+    uaddv           d0, p0, z0.s
+    str             d0, x3
+.endm
+
+function PFX(normFact8_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_normFact8
+    normFact_start
+    ptrue           p0.s, vl4
+.rept 8
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    add             x0, x0, x1
+    normFact_1_sve2 z4, z5
+.endr
+    normFact_end
+    ret
+.vl_gt_16_normFact8:
+    normFact_start_sve2
+    ptrue           p0.s, vl8
+.rept 8
+    ld1b            {z4.s}, p0/z, x0
+    add             x0, x0, x1
+    mul             z16.s, z4.s, z4.s
+    add             z0.s, z0.s, z16.s
+.endr
+    normFact_end_sve2
+    ret
+endfunc
+
+function PFX(normFact16_sve2)
+    mov             w12, #16
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_normFact16
+    normFact_start
+    ptrue           p0.s, vl4
+.loop_normFact16_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    ld1b            {z6.s}, p0/z, x0, #2, mul vl
+    ld1b            {z7.s}, p0/z, x0, #3, mul vl
+    add             x0, x0, x1
+    normFact_1_sve2 z4, z5
+    normFact_1_sve2 z6, z7
+    cbnz            w12, .loop_normFact16_sve2
+    normFact_end
+    ret
+.vl_gt_16_normFact16:
+    cmp             x9, #48
+    bgt             .vl_gt_48_normFact16
+    normFact_start_sve2
+    ptrue           p0.s, vl8
+.vl_gt_16_loop_normFact16_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    add             x0, x0, x1
+    normFact_1_sve2 z4, z5
+    cbnz            w12, .vl_gt_16_loop_normFact16_sve2
+    normFact_end_sve2
+    ret
+.vl_gt_48_normFact16:
+    normFact_start_sve2
+    ptrue           p0.s, vl16
+.vl_gt_48_loop_normFact16_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    add             x0, x0, x1
+    mul             z16.s, z4.s, z4.s
+    add             z0.s, z0.s, z16.s
+    cbnz            w12, .vl_gt_48_loop_normFact16_sve2
+    normFact_end_sve2
+    ret
+endfunc
+
+function PFX(normFact32_sve2)
+    mov             w12, #32
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_normFact32
+    normFact_start
+    ptrue           p0.s, vl4
+.loop_normFact32_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    ld1b            {z6.s}, p0/z, x0, #2, mul vl
+    ld1b            {z7.s}, p0/z, x0, #3, mul vl
+    ld1b            {z8.s}, p0/z, x0, #4, mul vl
+    ld1b            {z9.s}, p0/z, x0, #5, mul vl
+    ld1b            {z10.s}, p0/z, x0, #6, mul vl
+    ld1b            {z11.s}, p0/z, x0, #7, mul vl
+    add             x0, x0, x1
+    normFact_1_sve2 z4, z5
+    normFact_1_sve2 z6, z7
+    normFact_1_sve2 z8, z9
+    normFact_1_sve2 z10, z11
+    cbnz            w12, .loop_normFact32_sve2
+    normFact_end
+    ret
+.vl_gt_16_normFact32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_normFact32
+    normFact_start_sve2
+    ptrue           p0.s, vl8
+.vl_gt_16_loop_normFact32_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    ld1b            {z6.s}, p0/z, x0, #2, mul vl
+    ld1b            {z7.s}, p0/z, x0, #3, mul vl
+    add             x0, x0, x1
+    normFact_1_sve2 z4, z5
+    normFact_1_sve2 z6, z7
+    cbnz            w12, .vl_gt_16_loop_normFact32_sve2
+    normFact_end_sve2
+    ret
+.vl_gt_48_normFact32:
+    cmp             x9, #112
+    bgt             .vl_gt_112_normFact32
+    normFact_start_sve2
+    ptrue           p0.s, vl16
+.vl_gt_48_loop_normFact32_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    add             x0, x0, x1
+    normFact_1_sve2 z4, z5
+    cbnz            w12, .vl_gt_48_loop_normFact32_sve2
+    normFact_end_sve2
+    ret
+.vl_gt_112_normFact32:
+    normFact_start_sve2
+    ptrue           p0.s, vl32
+.vl_gt_112_loop_normFact32_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    add             x0, x0, x1
+    mul             z16.s, z4.s, z4.s
+    add             z0.s, z0.s, z16.s
+    cbnz            w12, .vl_gt_112_loop_normFact32_sve2
+    normFact_end_sve2
+    ret
+endfunc
+
+function PFX(normFact64_sve2)
+    mov             w12, #64
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_normFact64
+    normFact_start
+    ptrue           p0.s, vl4
+.loop_normFact64_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    ld1b            {z6.s}, p0/z, x0, #2, mul vl
+    ld1b            {z7.s}, p0/z, x0, #3, mul vl
+    ld1b            {z8.s}, p0/z, x0, #4, mul vl
+    ld1b            {z9.s}, p0/z, x0, #5, mul vl
+    ld1b            {z10.s}, p0/z, x0, #6, mul vl
+    ld1b            {z11.s}, p0/z, x0, #7, mul vl
+    normFact_1_sve2 z4, z5
+    normFact_1_sve2 z6, z7
+    normFact_1_sve2 z8, z9
+    normFact_1_sve2 z10, z11
+    mov             x2, x0
+    add             x2, x2, #32
+    ld1b            {z4.s}, p0/z, x2
+    ld1b            {z5.s}, p0/z, x2, #1, mul vl
+    ld1b            {z6.s}, p0/z, x2, #2, mul vl
+    ld1b            {z7.s}, p0/z, x2, #3, mul vl
+    ld1b            {z8.s}, p0/z, x2, #4, mul vl
+    ld1b            {z9.s}, p0/z, x2, #5, mul vl
+    ld1b            {z10.s}, p0/z, x2, #6, mul vl
+    ld1b            {z11.s}, p0/z, x2, #7, mul vl
+    normFact_1_sve2 z4, z5
+    normFact_1_sve2 z6, z7
+    normFact_1_sve2 z8, z9
+    normFact_1_sve2 z10, z11
+    add             x0, x0, x1
+    cbnz            w12, .loop_normFact64_sve2
+    normFact_end
+    ret
+.vl_gt_16_normFact64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_normFact64
+    normFact_start_sve2
+    ptrue           p0.s, vl8
+.vl_gt_16_loop_normFact64_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    ld1b            {z6.s}, p0/z, x0, #2, mul vl
+    ld1b            {z7.s}, p0/z, x0, #3, mul vl
+    ld1b            {z8.s}, p0/z, x0, #4, mul vl
+    ld1b            {z9.s}, p0/z, x0, #5, mul vl
+    ld1b            {z10.s}, p0/z, x0, #6, mul vl
+    ld1b            {z11.s}, p0/z, x0, #7, mul vl
+    normFact_1_sve2 z4, z5
+    normFact_1_sve2 z6, z7
+    normFact_1_sve2 z8, z9
+    normFact_1_sve2 z10, z11
+    add             x0, x0, x1
+    cbnz            w12, .vl_gt_16_loop_normFact64_sve2
+    normFact_end_sve2
+    ret
+.vl_gt_48_normFact64:
+    cmp             x9, #112
+    bgt             .vl_gt_112_normFact64
+    normFact_start_sve2
+    ptrue           p0.s, vl16
+.vl_gt_48_loop_normFact64_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    ld1b            {z6.s}, p0/z, x0, #2, mul vl
+    ld1b            {z7.s}, p0/z, x0, #3, mul vl
+    normFact_1_sve2 z4, z5
+    normFact_1_sve2 z6, z7
+    add             x0, x0, x1
+    cbnz            w12, .vl_gt_48_loop_normFact64_sve2
+    normFact_end_sve2
+    ret
+.vl_gt_112_normFact64:
+    normFact_start_sve2
+    ptrue           p0.s, vl32
+.vl_gt_112_loop_normFact64_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    normFact_1_sve2 z4, z5
+    add             x0, x0, x1
+    cbnz            w12, .vl_gt_112_loop_normFact64_sve2
+    normFact_end_sve2
+    ret
+endfunc

 
@@ -0,0 +1,1686 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm-sve.S"
+#include "pixel-util-common.S"
+
+.arch armv8-a+sve2
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+// uint64_t pixel_var(const pixel* pix, intptr_t i_stride)
+function PFX(pixel_var_8x8_sve2)
+    ptrue           p0.h, vl8
+    ld1b            {z0.h}, p0/z, x0
+    add             x0, x0, x1
+    mul             z31.h, z0.h, z0.h
+    uaddlp          v1.4s, v31.8h
+.rept 7
+    ld1b            {z4.h}, p0/z, x0
+    add             x0, x0, x1
+    add             z0.h, z0.h, z4.h
+    mul             z31.h, z4.h, z4.h
+    uadalp          z1.s, p0/m, z31.h
+.endr
+    uaddlv          s0, v0.8h
+    uaddlv          d1, v1.4s
+    fmov            w0, s0
+    fmov            x1, d1
+    orr             x0, x0, x1, lsl #32
+    ret
+endfunc
+
+function PFX(pixel_var_16x16_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_var_16x16
+    pixel_var_start
+    mov             w12, #16
+.loop_var_16_sve2:
+    sub             w12, w12, #1
+    ld1             {v4.16b}, x0, x1
+    pixel_var_1 v4
+    cbnz            w12, .loop_var_16_sve2
+    pixel_var_end
+    ret
+.vl_gt_16_pixel_var_16x16:
+    ptrue           p0.h, vl16
+    mov             z0.d, #0
+.rept 16
+    ld1b            {z4.h}, p0/z, x0
+    add             x0, x0, x1
+    add             z0.h, z0.h, z4.h
+    mul             z30.h, z4.h, z4.h
+    uadalp          z1.s, p0/m, z30.h
+.endr
+    uaddv           d0, p0, z0.h
+    uaddv           d1, p0, z1.s
+    fmov            w0, s0
+    fmov            x1, d1
+    orr             x0, x0, x1, lsl #32
+    ret
+endfunc
+
+function PFX(pixel_var_32x32_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_var_32x32
+    pixel_var_start
+    mov             w12, #32
+.loop_var_32_sve2:
+    sub             w12, w12, #1
+    ld1             {v4.16b-v5.16b}, x0, x1
+    pixel_var_1 v4
+    pixel_var_1 v5
+    cbnz            w12, .loop_var_32_sve2
+    pixel_var_end
+    ret
+.vl_gt_16_pixel_var_32x32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_var_32x32
+    ptrue           p0.b, vl32
+    mov             z0.d, #0
+    mov             z1.d, #0
+.rept 32
+    ld1b            {z4.b}, p0/z, x0
+    add             x0, x0, x1
+    uaddwb          z0.h, z0.h, z4.b
+    uaddwt          z0.h, z0.h, z4.b
+    umullb          z28.h, z4.b, z4.b
+    umullt          z29.h, z4.b, z4.b
+    uadalp          z1.s, p0/m, z28.h
+    uadalp          z1.s, p0/m, z29.h
+.endr
+    uaddv           d0, p0, z0.h
+    uaddv           d1, p0, z1.s
+    fmov            w0, s0
+    fmov            x1, d1
+    orr             x0, x0, x1, lsl #32
+    ret
+.vl_gt_48_pixel_var_32x32:
+    ptrue           p0.h, vl32
+    mov             z0.d, #0
+    mov             z1.d, #0
+.rept 32
+    ld1b            {z4.h}, p0/z, x0
+    add             x0, x0, x1
+    add             z0.h, z0.h, z4.h
+    mul             z28.h, z4.h, z4.h
+    uadalp          z1.s, p0/m, z28.h
+.endr
+    uaddv           d0, p0, z0.h
+    uaddv           d1, p0, z1.s
+    fmov            w0, s0
+    fmov            x1, d1
+    orr             x0, x0, x1, lsl #32
+    ret
+endfunc
+
+function PFX(pixel_var_64x64_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_var_64x64
+    pixel_var_start
+    mov             w12, #64
+.loop_var_64_sve2:
+    sub             w12, w12, #1
+    ld1             {v4.16b-v7.16b}, x0, x1
+    pixel_var_1 v4
+    pixel_var_1 v5
+    pixel_var_1 v6
+    pixel_var_1 v7
+    cbnz            w12, .loop_var_64_sve2
+    pixel_var_end
+    ret
+.vl_gt_16_pixel_var_64x64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_var_64x64
+    ptrue           p0.b, vl32
+    mov             z0.d, #0
+    mov             z2.d, #0
+.rept 64
+    ld1b            {z4.b}, p0/z, x0
+    ld1b            {z5.b}, p0/z, x0, #1, mul vl
+    add             x0, x0, x1
+    uaddwb          z0.h, z0.h, z4.b
+    uaddwt          z0.h, z0.h, z4.b
+    uaddwb          z0.h, z0.h, z5.b
+    uaddwt          z0.h, z0.h, z5.b
+    umullb          z24.h, z4.b, z4.b
+    umullt          z25.h, z4.b, z4.b
+    umullb          z26.h, z5.b, z5.b
+    umullt          z27.h, z5.b, z5.b
+    uadalp          z2.s, p0/m, z24.h
+    uadalp          z2.s, p0/m, z25.h
+    uadalp          z2.s, p0/m, z26.h
+    uadalp          z2.s, p0/m, z27.h
+.endr
+    uaddv           d0, p0, z0.h
+    uaddv           d1, p0, z2.s
+    fmov            w0, s0
+    fmov            x1, d1
+    orr             x0, x0, x1, lsl #32
+    ret
+.vl_gt_48_pixel_var_64x64:
+    cmp             x9, #112
+    bgt             .vl_gt_112_pixel_var_64x64
+    ptrue           p0.b, vl64
+    mov             z0.d, #0
+    mov             z1.d, #0
+.rept 64
+    ld1b            {z4.b}, p0/z, x0
+    add             x0, x0, x1
+    uaddwb          z0.h, z0.h, z4.b
+    uaddwt          z0.h, z0.h, z4.b
+    umullb          z24.h, z4.b, z4.b
+    umullt          z25.h, z4.b, z4.b
+    uadalp          z2.s, p0/m, z24.h
+    uadalp          z2.s, p0/m, z25.h
+.endr
+    uaddv           d0, p0, z0.h
+    uaddv           d1, p0, z2.s
+    fmov            w0, s0
+    fmov            x1, d1
+    orr             x0, x0, x1, lsl #32
+    ret
+.vl_gt_112_pixel_var_64x64:
+    ptrue           p0.h, vl64
+    mov             z0.d, #0
+    mov             z1.d, #0
+.rept 64
+    ld1b            {z4.h}, p0/z, x0
+    add             x0, x0, x1
+    add             z0.h, z0.h, z4.h
+    mul             z24.h, z4.h, z4.h
+    uadalp          z1.s, p0/m, z24.h
+.endr
+    uaddv           d0, p0, z0.h
+    uaddv           d1, p0, z1.s
+    fmov            w0, s0
+    fmov            x1, d1
+    orr             x0, x0, x1, lsl #32
+    ret
+endfunc
+
+function PFX(getResidual16_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_getResidual16
+    lsl             x4, x3, #1
+.rept 8
+    ld1             {v0.16b}, x0, x3
+    ld1             {v1.16b}, x1, x3
+    ld1             {v2.16b}, x0, x3
+    ld1             {v3.16b}, x1, x3
+    usubl           v4.8h, v0.8b, v1.8b
+    usubl2          v5.8h, v0.16b, v1.16b
+    usubl           v6.8h, v2.8b, v3.8b
+    usubl2          v7.8h, v2.16b, v3.16b
+    st1             {v4.8h-v5.8h}, x2, x4
+    st1             {v6.8h-v7.8h}, x2, x4
+.endr
+    ret
+.vl_gt_16_getResidual16:
+    ptrue           p0.h, vl16
+.rept 16
+    ld1b            {z0.h}, p0/z, x0
+    ld1b            {z2.h}, p0/z, x1
+    add             x0, x0, x3
+    add             x1, x1, x3
+    sub             z4.h, z0.h, z2.h
+    st1h            {z4.h}, p0, x2
+    add             x2, x2, x3, lsl #1
+.endr
+    ret
+endfunc
+
+function PFX(getResidual32_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_getResidual32
+    lsl             x4, x3, #1
+    mov             w12, #4
+.loop_residual_32:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v1.16b}, x0, x3
+    ld1             {v2.16b-v3.16b}, x1, x3
+    ld1             {v4.16b-v5.16b}, x0, x3
+    ld1             {v6.16b-v7.16b}, x1, x3
+    usubl           v16.8h, v0.8b, v2.8b
+    usubl2          v17.8h, v0.16b, v2.16b
+    usubl           v18.8h, v1.8b, v3.8b
+    usubl2          v19.8h, v1.16b, v3.16b
+    usubl           v20.8h, v4.8b, v6.8b
+    usubl2          v21.8h, v4.16b, v6.16b
+    usubl           v22.8h, v5.8b, v7.8b
+    usubl2          v23.8h, v5.16b, v7.16b
+    st1             {v16.8h-v19.8h}, x2, x4
+    st1             {v20.8h-v23.8h}, x2, x4
+.endr
+    cbnz            w12, .loop_residual_32
+    ret
+.vl_gt_16_getResidual32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_getResidual32
+    ptrue           p0.b, vl32
+.rept 32
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z2.b}, p0/z, x1
+    add             x0, x0, x3
+    add             x1, x1, x3
+    usublb          z4.h, z0.b, z2.b
+    usublt          z5.h, z0.b, z2.b
+    st2h            {z4.h, z5.h}, p0, x2
+    add             x2, x2, x3, lsl #1
+.endr
+    ret
+.vl_gt_48_getResidual32:
+    ptrue           p0.h, vl32
+.rept 32
+    ld1b            {z0.h}, p0/z, x0
+    ld1b            {z4.h}, p0/z, x1
+    add             x0, x0, x3
+    add             x1, x1, x3
+    sub             z8.h, z0.h, z4.h
+    st1h            {z8.h}, p0, x2
+    add             x2, x2, x3, lsl #1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_sub_ps_32x32_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sub_ps_32x32
+    lsl             x1, x1, #1
+    mov             w12, #4
+.loop_sub_ps_32_sve2:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v1.16b}, x2, x4
+    ld1             {v2.16b-v3.16b}, x3, x5
+    ld1             {v4.16b-v5.16b}, x2, x4
+    ld1             {v6.16b-v7.16b}, x3, x5
+    usubl           v16.8h, v0.8b, v2.8b
+    usubl2          v17.8h, v0.16b, v2.16b
+    usubl           v18.8h, v1.8b, v3.8b
+    usubl2          v19.8h, v1.16b, v3.16b
+    usubl           v20.8h, v4.8b, v6.8b
+    usubl2          v21.8h, v4.16b, v6.16b
+    usubl           v22.8h, v5.8b, v7.8b
+    usubl2          v23.8h, v5.16b, v7.16b
+    st1             {v16.8h-v19.8h}, x0, x1
+    st1             {v20.8h-v23.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_sub_ps_32_sve2
+    ret
+.vl_gt_16_pixel_sub_ps_32x32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_sub_ps_32x32
+    ptrue           p0.b, vl32
+    mov             w12, #8
+.vl_gt_16_loop_sub_ps_32_sve2:
+    sub             w12, w12, #1
+.rept 4
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z2.b}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5
+    usublb          z16.h, z0.b, z2.b
+    usublt          z17.h, z0.b, z2.b
+    st2h            {z16.h, z17.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_16_loop_sub_ps_32_sve2
+    ret
+.vl_gt_48_pixel_sub_ps_32x32:
+    ptrue           p0.h, vl32
+    mov             w12, #8
+.vl_gt_48_loop_sub_ps_32_sve2:
+    sub             w12, w12, #1
+.rept 4
+    ld1b            {z0.h}, p0/z, x2
+    ld1b            {z4.h}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5
+    sub             z8.h, z0.h, z4.h
+    st1h            {z8.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_48_loop_sub_ps_32_sve2
+    ret
+endfunc
+
+function PFX(pixel_sub_ps_64x64_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sub_ps_64x64
+    lsl             x1, x1, #1
+    sub             x1, x1, #64
+    mov             w12, #16
+.loop_sub_ps_64_sve2:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v3.16b}, x2, x4
+    ld1             {v4.16b-v7.16b}, x3, x5
+    usubl           v16.8h, v0.8b, v4.8b
+    usubl2          v17.8h, v0.16b, v4.16b
+    usubl           v18.8h, v1.8b, v5.8b
+    usubl2          v19.8h, v1.16b, v5.16b
+    usubl           v20.8h, v2.8b, v6.8b
+    usubl2          v21.8h, v2.16b, v6.16b
+    usubl           v22.8h, v3.8b, v7.8b
+    usubl2          v23.8h, v3.16b, v7.16b
+    st1             {v16.8h-v19.8h}, x0, #64
+    st1             {v20.8h-v23.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_sub_ps_64_sve2
+    ret
+.vl_gt_16_pixel_sub_ps_64x64:
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sub_ps_64x64
+    ptrue           p0.b, vl32
+    mov             w12, #16
+.vl_gt_16_loop_sub_ps_64_sve2:
+    sub             w12, w12, #1
+.rept 4
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z1.b}, p0/z, x2, #1, mul vl
+    ld1b            {z4.b}, p0/z, x3
+    ld1b            {z5.b}, p0/z, x3, #1, mul vl
+    add             x2, x2, x4
+    add             x3, x3, x5
+    usublb          z16.h, z0.b, z4.b
+    usublt          z17.h, z0.b, z4.b
+    usublb          z18.h, z1.b, z5.b
+    usublt          z19.h, z1.b, z5.b
+    st2h            {z16.h, z17.h}, p0, x0
+    st2h            {z18.h, z19.h}, p0, x0, #2, mul vl
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_16_loop_sub_ps_64_sve2
+    ret
+.vl_gt_48_pixel_sub_ps_64x64:
+    cmp             x9, #112
+    bgt             .vl_gt_112_pixel_sub_ps_64x64
+    ptrue           p0.b, vl64
+    mov             w12, #16
+.vl_gt_48_loop_sub_ps_64_sve2:
+    sub             w12, w12, #1
+.rept 4
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z4.b}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5
+    usublb          z16.h, z0.b, z4.b
+    usublt          z17.h, z0.b, z4.b
+    st2h            {z16.h, z17.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_48_loop_sub_ps_64_sve2
+    ret
+.vl_gt_112_pixel_sub_ps_64x64:
+    ptrue           p0.h, vl64
+    mov             w12, #16
+.vl_gt_112_loop_sub_ps_64_sve2:
+    sub             w12, w12, #1
+.rept 4
+    ld1b            {z0.h}, p0/z, x2
+    ld1b            {z8.h}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5
+    sub             z16.h, z0.h, z8.h
+    st1h            {z16.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_112_loop_sub_ps_64_sve2
+    ret
+endfunc
+
+function PFX(pixel_sub_ps_32x64_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sub_ps_32x64
+    lsl             x1, x1, #1
+    mov             w12, #8
+.loop_sub_ps_32x64_sve2:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v1.16b}, x2, x4
+    ld1             {v2.16b-v3.16b}, x3, x5
+    ld1             {v4.16b-v5.16b}, x2, x4
+    ld1             {v6.16b-v7.16b}, x3, x5
+    usubl           v16.8h, v0.8b, v2.8b
+    usubl2          v17.8h, v0.16b, v2.16b
+    usubl           v18.8h, v1.8b, v3.8b
+    usubl2          v19.8h, v1.16b, v3.16b
+    usubl           v20.8h, v4.8b, v6.8b
+    usubl2          v21.8h, v4.16b, v6.16b
+    usubl           v22.8h, v5.8b, v7.8b
+    usubl2          v23.8h, v5.16b, v7.16b
+    st1             {v16.8h-v19.8h}, x0, x1
+    st1             {v20.8h-v23.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_sub_ps_32x64_sve2
+    ret
+.vl_gt_16_pixel_sub_ps_32x64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_sub_ps_32x64
+    ptrue           p0.b, vl32
+    mov             w12, #8
+.vl_gt_16_loop_sub_ps_32x64_sve2:
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.b}, p0/z, x2
+    ld1b            {z2.b}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5
+    usublb          z16.h, z0.b, z2.b
+    usublt          z17.h, z0.b, z2.b
+    st2h            {z16.h, z17.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_16_loop_sub_ps_32x64_sve2
+    ret
+.vl_gt_48_pixel_sub_ps_32x64:
+    ptrue           p0.h, vl32
+    mov             w12, #8
+.vl_gt_48_loop_sub_ps_32x64_sve2:
+    sub             w12, w12, #1
+.rept 8
+    ld1b            {z0.h}, p0/z, x2
+    ld1b            {z4.h}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5
+    sub             z8.h, z0.h, z4.h
+    st1h            {z8.h}, p0, x0
+    add             x0, x0, x1, lsl #1
+.endr
+    cbnz            w12, .vl_gt_48_loop_sub_ps_32x64_sve2
+    ret
+endfunc
+
+function PFX(pixel_add_ps_4x4_sve2)
+    ptrue           p0.h, vl8
+    ptrue           p1.h, vl4
+.rept 4
+    ld1b            {z0.h}, p0/z, x2
+    ld1h            {z2.h}, p1/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z4.h, z0.h, z2.h
+    sqxtunb         z4.b, z4.h
+    st1b            {z4.h}, p1, x0
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_add_ps_8x8_sve2)
+    ptrue           p0.h, vl8
+.rept 8
+    ld1b            {z0.h}, p0/z, x2
+    ld1h            {z2.h}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z4.h, z0.h, z2.h
+    sqxtunb         z4.b, z4.h
+    st1b            {z4.h}, p0, x0
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+.macro pixel_add_ps_16xN_sve2 h
+function PFX(pixel_add_ps_16x\h\()_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_add_ps_16x\h
+    ptrue           p0.b, vl16
+.rept \h
+    ld1b            {z0.h}, p0/z, x2
+    ld1b            {z1.h}, p0/z, x2, #1, mul vl
+    ld1h            {z2.h}, p0/z, x3
+    ld1h            {z3.h}, p0/z, x3, #1, mul vl
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z24.h, z0.h, z2.h
+    add             z25.h, z1.h, z3.h
+    sqxtunb         z6.b, z24.h
+    sqxtunb         z7.b, z25.h
+    st1b            {z6.h}, p0, x0
+    st1b            {z7.h}, p0, x0, #1, mul vl
+    add             x0, x0, x1
+.endr
+    ret
+.vl_gt_16_pixel_add_ps_16x\h\():
+    ptrue           p0.b, vl32
+.rept \h
+    ld1b            {z0.h}, p0/z, x2
+    ld1h            {z2.h}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z24.h, z0.h, z2.h
+    sqxtunb         z6.b, z24.h
+    st1b            {z6.h}, p0, x0
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+.endm
+
+pixel_add_ps_16xN_sve2 16
+pixel_add_ps_16xN_sve2 32
+
+.macro pixel_add_ps_32xN_sve2 h
+ function PFX(pixel_add_ps_32x\h\()_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_add_ps_32x\h
+    lsl             x5, x5, #1
+    mov             w12, #\h / 4
+.loop_add_ps__sve2_32x\h\():
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v1.16b}, x2, x4
+    ld1             {v16.8h-v19.8h}, x3, x5
+    uxtl            v4.8h, v0.8b
+    uxtl2           v5.8h, v0.16b
+    uxtl            v6.8h, v1.8b
+    uxtl2           v7.8h, v1.16b
+    add             v24.8h, v4.8h, v16.8h
+    add             v25.8h, v5.8h, v17.8h
+    add             v26.8h, v6.8h, v18.8h
+    add             v27.8h, v7.8h, v19.8h
+    sqxtun          v4.8b, v24.8h
+    sqxtun2         v4.16b, v25.8h
+    sqxtun          v5.8b, v26.8h
+    sqxtun2         v5.16b, v27.8h
+    st1             {v4.16b-v5.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_add_ps__sve2_32x\h
+    ret
+.vl_gt_16_pixel_add_ps_32x\h\():
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_add_ps_32x\h
+    ptrue           p0.b, vl32
+.rept \h
+    ld1b            {z0.h}, p0/z, x2
+    ld1b            {z1.h}, p0/z, x2, #1, mul vl
+    ld1h            {z4.h}, p0/z, x3
+    ld1h            {z5.h}, p0/z, x3, #1, mul vl
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z24.h, z0.h, z4.h
+    add             z25.h, z1.h, z5.h
+    sqxtunb         z6.b, z24.h
+    sqxtunb         z7.b, z25.h
+    st1b            {z6.h}, p0, x0
+    st1b            {z7.h}, p0, x0, #1, mul vl
+    add             x0, x0, x1
+.endr
+    ret
+.vl_gt_48_pixel_add_ps_32x\h\():
+    ptrue           p0.b, vl64
+.rept \h
+    ld1b            {z0.h}, p0/z, x2
+    ld1h            {z4.h}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z24.h, z0.h, z4.h
+    sqxtunb         z6.b, z24.h
+    st1b            {z6.h}, p0, x0
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+.endm
+
+pixel_add_ps_32xN_sve2 32
+pixel_add_ps_32xN_sve2 64
+
+function PFX(pixel_add_ps_64x64_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_add_ps_64x64
+    ptrue           p0.b, vl16
+.rept 64
+    ld1b            {z0.h}, p0/z, x2
+    ld1b            {z1.h}, p0/z, x2, #1, mul vl
+    ld1b            {z2.h}, p0/z, x2, #2, mul vl
+    ld1b            {z3.h}, p0/z, x2, #3, mul vl
+    ld1b            {z4.h}, p0/z, x2, #4 ,mul vl
+    ld1b            {z5.h}, p0/z, x2, #5, mul vl
+    ld1b            {z6.h}, p0/z, x2, #6, mul vl
+    ld1b            {z7.h}, p0/z, x2, #7, mul vl
+    ld1h            {z8.h}, p0/z, x3
+    ld1h            {z9.h}, p0/z, x3, #1, mul vl
+    ld1h            {z10.h}, p0/z, x3, #2, mul vl
+    ld1h            {z11.h}, p0/z, x3, #3, mul vl
+    ld1h            {z12.h}, p0/z, x3, #4, mul vl
+    ld1h            {z13.h}, p0/z, x3, #5, mul vl
+    ld1h            {z14.h}, p0/z, x3, #6, mul vl
+    ld1h            {z15.h}, p0/z, x3, #7, mul vl
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z24.h, z0.h, z8.h
+    add             z25.h, z1.h, z9.h
+    add             z26.h, z2.h, z10.h
+    add             z27.h, z3.h, z11.h
+    add             z28.h, z4.h, z12.h
+    add             z29.h, z5.h, z13.h
+    add             z30.h, z6.h, z14.h
+    add             z31.h, z7.h, z15.h
+    sqxtunb         z6.b, z24.h
+    sqxtunb         z7.b, z25.h
+    sqxtunb         z8.b, z26.h
+    sqxtunb         z9.b, z27.h
+    sqxtunb         z10.b, z28.h
+    sqxtunb         z11.b, z29.h
+    sqxtunb         z12.b, z30.h
+    sqxtunb         z13.b, z31.h
+    st1b            {z6.h}, p0, x0
+    st1b            {z7.h}, p0, x0, #1, mul vl
+    st1b            {z8.h}, p0, x0, #2, mul vl
+    st1b            {z9.h}, p0, x0, #3, mul vl
+    st1b            {z10.h}, p0, x0, #4, mul vl
+    st1b            {z11.h}, p0, x0, #5, mul vl
+    st1b            {z12.h}, p0, x0, #6, mul vl
+    st1b            {z13.h}, p0, x0, #7, mul vl
+    add             x0, x0, x1
+.endr
+    ret
+.vl_gt_16_pixel_add_ps_64x64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_add_ps_64x64
+    ptrue           p0.b, vl32
+.rept 64
+    ld1b            {z0.h}, p0/z, x2
+    ld1b            {z1.h}, p0/z, x2, #1, mul vl
+    ld1b            {z2.h}, p0/z, x2, #2, mul vl
+    ld1b            {z3.h}, p0/z, x2, #3, mul vl
+    ld1h            {z8.h}, p0/z, x3
+    ld1h            {z9.h}, p0/z, x3, #1, mul vl
+    ld1h            {z10.h}, p0/z, x3, #2, mul vl
+    ld1h            {z11.h}, p0/z, x3, #3, mul vl
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z24.h, z0.h, z8.h
+    add             z25.h, z1.h, z9.h
+    add             z26.h, z2.h, z10.h
+    add             z27.h, z3.h, z11.h
+    sqxtunb         z6.b, z24.h
+    sqxtunb         z7.b, z25.h
+    sqxtunb         z8.b, z26.h
+    sqxtunb         z9.b, z27.h
+    st1b            {z6.h}, p0, x0
+    st1b            {z7.h}, p0, x0, #1, mul vl
+    st1b            {z8.h}, p0, x0, #2, mul vl
+    st1b            {z9.h}, p0, x0, #3, mul vl
+    add             x0, x0, x1
+.endr
+    ret
+.vl_gt_48_pixel_add_ps_64x64:
+    cmp             x9, #112
+    bgt             .vl_gt_112_pixel_add_ps_64x64
+    ptrue           p0.b, vl64
+.rept 64
+    ld1b            {z0.h}, p0/z, x2
+    ld1b            {z1.h}, p0/z, x2, #1, mul vl
+    ld1h            {z8.h}, p0/z, x3
+    ld1h            {z9.h}, p0/z, x3, #1, mul vl
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z24.h, z0.h, z8.h
+    add             z25.h, z1.h, z9.h
+    sqxtunb         z6.b, z24.h
+    sqxtunb         z7.b, z25.h
+    st1b            {z6.h}, p0, x0
+    st1b            {z7.h}, p0, x0, #1, mul vl
+    add             x0, x0, x1
+.endr
+    ret
+.vl_gt_112_pixel_add_ps_64x64:
+    ptrue           p0.b, vl128
+.rept 64
+    ld1b            {z0.h}, p0/z, x2
+    ld1h            {z8.h}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z24.h, z0.h, z8.h
+    sqxtunb         z6.b, z24.h
+    st1b            {z6.h}, p0, x0
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+// Chroma add_ps
+function PFX(pixel_add_ps_4x8_sve2)
+    ptrue           p0.h,vl4
+.rept 8
+    ld1b            {z0.h}, p0/z, x2
+    ld1h            {z2.h}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z4.h, z0.h, z2.h
+    sqxtunb         z4.b, z4.h
+    st1b            {z4.h}, p0, x0
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_add_ps_8x16_sve2)
+    ptrue           p0.h,vl8
+.rept 16
+    ld1b            {z0.h}, p0/z, x2
+    ld1h            {z2.h}, p0/z, x3
+    add             x2, x2, x4
+    add             x3, x3, x5, lsl #1
+    add             z4.h, z0.h, z2.h
+    sqxtunb         z4.b, z4.h
+    st1b            {z4.h}, p0, x0
+    add             x0, x0, x1
+.endr
+    ret
+endfunc
+
+// void scale1D_128to64(pixel *dst, const pixel *src)
+function PFX(scale1D_128to64_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_scale1D_128to64
+    ptrue           p0.b, vl16
+.rept 2
+    ld2b            {z0.b, z1.b}, p0/z, x1
+    ld2b            {z2.b, z3.b}, p0/z, x1, #2, mul vl
+    ld2b            {z4.b, z5.b}, p0/z, x1, #4, mul vl
+    ld2b            {z6.b, z7.b}, p0/z, x1, #6, mul vl
+    add             x1, x1, #128
+    urhadd          z0.b, p0/m, z0.b, z1.b
+    urhadd          z2.b, p0/m, z2.b, z3.b
+    urhadd          z4.b, p0/m, z4.b, z5.b
+    urhadd          z6.b, p0/m, z6.b, z7.b
+    st1b            {z0.b}, p0, x0
+    st1b            {z2.b}, p0, x0, #1, mul vl
+    st1b            {z4.b}, p0, x0, #2, mul vl
+    st1b            {z6.b}, p0, x0, #3, mul vl
+    add             x0, x0, #64
+.endr
+    ret
+.vl_gt_16_scale1D_128to64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_scale1D_128to64
+    ptrue           p0.b, vl32
+.rept 2
+    ld2b            {z0.b, z1.b}, p0/z, x1
+    ld2b            {z2.b, z3.b}, p0/z, x1, #2, mul vl
+    add             x1, x1, #128
+    urhadd          z0.b, p0/m, z0.b, z1.b
+    urhadd          z2.b, p0/m, z2.b, z3.b
+    st1b            {z0.b}, p0, x0
+    st1b            {z2.b}, p0, x0, #1, mul vl
+    add             x0, x0, #64
+.endr
+    ret
+.vl_gt_48_scale1D_128to64:
+    ptrue           p0.b, vl64
+.rept 2
+    ld2b            {z0.b, z1.b}, p0/z, x1
+    add             x1, x1, #128
+    urhadd          z0.b, p0/m, z0.b, z1.b
+    st1b            {z0.b}, p0, x0
+    add             x0, x0, #64
+.endr
+    ret
+endfunc
+
+/***** dequant_scaling*****/
+// void dequant_scaling_c(const int16_t* quantCoef, const int32_t* deQuantCoef, int16_t* coef, int num, int per, int shift)
+function PFX(dequant_scaling_sve2)
+    ptrue           p0.h, vl8
+    add             x5, x5, #4              // shift + 4
+    lsr             x3, x3, #3              // num / 8
+    cmp             x5, x4
+    blt             .dequant_skip_sve2
+
+    mov             x12, #1
+    sub             x6, x5, x4          // shift - per
+    sub             x6, x6, #1          // shift - per - 1
+    lsl             x6, x12, x6         // 1 << shift - per - 1 (add)
+    mov             z0.s, w6
+    sub             x7, x4, x5          // per - shift
+    mov             z3.s, w7
+
+.dequant_loop1_sve2:
+    ld1h            {z19.h}, p0/z, x0
+    ld1w            {z2.s}, p0/z, x1
+    add             x1, x1, #16
+    ld1w            {z20.s}, p0/z, x1
+    add             x0, x0, #16
+    add             x1, x1, #16
+
+    sub             x3, x3, #1
+    sunpklo         z1.s, z19.h
+    sunpkhi         z19.s, z19.h
+
+    mul             z1.s, z1.s, z2.s // quantCoef * deQuantCoef
+    mul             z19.s, z19.s, z20.s
+    add             z1.s, z1.s, z0.s // quantCoef * deQuantCoef + add
+    add             z19.s, z19.s, z0.s
+
+    // No equivalent instructions in SVE2 for sshl
+    // as sqshl has double latency
+    sshl            v1.4s, v1.4s, v3.4s
+    sshl            v19.4s, v19.4s, v3.4s
+
+    sqxtnb          z16.h, z1.s
+    sqxtnb          z17.h, z19.s
+    st1h            {z16.s}, p0, x2
+    st1h            {z17.s}, p0, x2, #1, mul vl
+    add             x2, x2, #16
+    cbnz            x3, .dequant_loop1_sve2
+    ret
+
+.dequant_skip_sve2:
+    sub             x6, x4, x5          // per - shift
+    mov             z0.h, w6
+
+.dequant_loop2_sve2:
+    ld1h            {z19.h}, p0/z, x0
+    ld1w            {z2.s}, p0/z, x1
+    add             x1, x1, #16
+    ld1w            {z20.s}, p0/z, x1
+    add             x0, x0, #16
+    add             x1, x1, #16
+
+
+    sub             x3, x3, #1
+    sunpklo         z1.s, z19.h
+    sunpkhi         z19.s, z19.h
+
+    mul             z1.s, z1.s, z2.s // quantCoef * deQuantCoef
+    mul             z19.s, z19.s, z20.s
+
+    // Keeping NEON instructions here in order to have
+    // one sqshl later
+    sqxtn           v16.4h, v1.4s       // x265_clip3
+    sqxtn2          v16.8h, v19.4s
+
+    sqshl           z16.h, p0/m, z16.h, z0.h // coefQ << per - shift
+    st1h            {z16.h}, p0, x2
+    add             x2, x2, #16
+    cbnz            x3, .dequant_loop2_sve2
+    ret
+endfunc
+
+// void dequant_normal_c(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift)
+function PFX(dequant_normal_sve2)
+    lsr             w2, w2, #4              // num / 16
+    neg             w4, w4
+    mov             z0.h, w3
+    mov             z1.s, w4
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_dequant_normal_sve2
+.dqn_loop1_sve2:
+    ld1             {v2.8h, v3.8h}, x0, #32
+    smull           v16.4s, v2.4h, v0.4h
+    smull2          v17.4s, v2.8h, v0.8h
+    smull           v18.4s, v3.4h, v0.4h
+    smull2          v19.4s, v3.8h, v0.8h
+
+    srshl           v16.4s, v16.4s, v1.4s
+    srshl           v17.4s, v17.4s, v1.4s
+    srshl           v18.4s, v18.4s, v1.4s
+    srshl           v19.4s, v19.4s, v1.4s
+
+    sqxtn           v2.4h, v16.4s
+    sqxtn2          v2.8h, v17.4s
+    sqxtn           v3.4h, v18.4s
+    sqxtn2          v3.8h, v19.4s
+
+    sub             w2, w2, #1
+    st1             {v2.8h, v3.8h}, x1, #32
+    cbnz            w2, .dqn_loop1_sve2
+    ret
+.vl_gt_16_dequant_normal_sve2:
+    ptrue           p0.h, vl16
+.gt_16_dqn_loop1_sve2:
+    ld1h            {z2.h}, p0/z, x0
+    add             x0, x0, #32
+    smullb          z16.s, z2.h, z0.h
+    smullt          z17.s, z2.h, z0.h
+
+    srshl           z16.s, p0/m, z16.s, z1.s
+    srshl           z17.s, p0/m, z17.s, z1.s
+
+    sqxtnb          z2.h, z16.s
+    sqxtnt          z2.h, z17.s
+    
+    sub             w2, w2, #1
+    st1h            {z2.h}, p0, x1
+    add             x1, x1, #32
+    cbnz            w2, .gt_16_dqn_loop1_sve2
+    ret
+
+endfunc
+
+// void ssim_4x4x2_core(const pixel* pix1, intptr_t stride1, const pixel* pix2, intptr_t stride2, int sums24)
+function PFX(ssim_4x4x2_core_sve2)
+    ptrue           p0.b, vl16
+    movi            v30.2d, #0
+    movi            v31.2d, #0
+
+    ld1b            {z0.h}, p0/z, x0
+    add             x0, x0, x1
+    ld1b            {z1.h}, p0/z, x0
+    add             x0, x0, x1
+    ld1b            {z2.h}, p0/z, x0
+    add             x0, x0, x1
+    ld1b            {z3.h}, p0/z, x0
+    add             x0, x0, x1
+
+    ld1b            {z4.h}, p0/z, x2
+    add             x2, x2, x3
+    ld1b            {z5.h}, p0/z, x2
+    add             x2, x2, x3
+    ld1b            {z6.h}, p0/z, x2
+    add             x2, x2, x3
+    ld1b            {z7.h}, p0/z, x2
+    add             x2, x2, x3
+
+    mul             z16.h, z0.h, z0.h
+    mul             z17.h, z1.h, z1.h
+    mul             z18.h, z2.h, z2.h
+    uaddlp          v30.4s, v16.8h
+
+    mul             z19.h, z3.h, z3.h
+    mul             z20.h, z4.h, z4.h
+    mul             z21.h, z5.h, z5.h
+    uadalp          v30.4s, v17.8h
+
+    mul             z22.h, z6.h, z6.h
+    mul             z23.h, z7.h, z7.h
+    mul             z24.h, z0.h, z4.h
+    uadalp          v30.4s, v18.8h
+
+    mul             z25.h, z1.h, z5.h
+    mul             z26.h, z2.h, z6.h
+    mul             z27.h, z3.h, z7.h
+    uadalp          v30.4s, v19.8h
+
+    add             z28.h, z0.h, z1.h
+    add             z29.h, z4.h, z5.h
+    uadalp          v30.4s, v20.8h
+    uaddlp          v31.4s, v24.8h
+
+    add             z28.h, z28.h, z2.h
+    add             z29.h, z29.h, z6.h
+    uadalp          v30.4s, v21.8h
+    uadalp          v31.4s, v25.8h
+
+    add             z28.h, z28.h, z3.h
+    add             z29.h, z29.h, z7.h
+    uadalp          v30.4s, v22.8h
+    uadalp          v31.4s, v26.8h
+
+    // Better use NEON instructions here
+    uaddlp          v28.4s, v28.8h
+    uaddlp          v29.4s, v29.8h
+    uadalp          v30.4s, v23.8h
+    uadalp          v31.4s, v27.8h
+
+    addp            v28.4s, v28.4s, v28.4s
+    addp            v29.4s, v29.4s, v29.4s
+    addp            v30.4s, v30.4s, v30.4s
+    addp            v31.4s, v31.4s, v31.4s
+
+    st4             {v28.2s, v29.2s, v30.2s, v31.2s}, x4
+    ret
+endfunc
+
+// void ssimDist_c(const pixel* fenc, uint32_t fStride, const pixel* recon, intptr_t rstride, uint64_t *ssBlock, int shift, uint64_t *ac_k)
+.macro ssimDist_start_sve2
+    mov             z0.d, #0
+    mov             z1.d, #0
+.endm
+
+.macro ssimDist_1_sve2  z0 z1 z2 z3
+    sub             z16.s, \z0\().s, \z2\().s
+    sub             z17.s, \z1\().s, \z3\().s
+    mul             z18.s, \z0\().s, \z0\().s
+    mul             z19.s, \z1\().s, \z1\().s
+    mul             z20.s, z16.s, z16.s
+    mul             z21.s, z17.s, z17.s
+    add             z0.s, z0.s, z18.s
+    add             z0.s, z0.s, z19.s
+    add             z1.s, z1.s, z20.s
+    add             z1.s, z1.s, z21.s
+.endm
+
+.macro ssimDist_end_sve2
+    uaddv           d0, p0, z0.s
+    uaddv           d1, p0, z1.s
+    str             d0, x6
+    str             d1, x4
+.endm
+
+function PFX(ssimDist4_sve2)
+    ssimDist_start
+    ptrue           p0.s, vl4
+.rept 4
+    ld1b            {z4.s}, p0/z, x0
+    add             x0, x0, x1
+    ld1b            {z5.s}, p0/z, x2
+    add             x2, x2, x3
+    sub             z2.s, z4.s, z5.s
+    mul             z3.s, z4.s, z4.s
+    mul             z2.s, z2.s, z2.s
+    add             z0.s, z0.s, z3.s
+    add             z1.s, z1.s, z2.s
+.endr
+    ssimDist_end
+    ret
+endfunc
+
+function PFX(ssimDist8_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_ssimDist8
+    ssimDist_start
+    ptrue           p0.s, vl4
+.rept 8
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    add             x0, x0, x1
+    ld1b            {z6.s}, p0/z, x2
+    ld1b            {z7.s}, p0/z, x2, #1, mul vl
+    add             x2, x2, x3
+    ssimDist_1_sve2 z4, z5, z6, z7
+.endr
+    ssimDist_end
+    ret
+.vl_gt_16_ssimDist8:
+    ssimDist_start_sve2
+    ptrue           p0.s, vl8
+.rept 8
+    ld1b            {z4.s}, p0/z, x0
+    add             x0, x0, x1
+    ld1b            {z6.s}, p0/z, x2
+    add             x2, x2, x3
+    sub             z20.s, z4.s, z6.s
+    mul             z16.s, z4.s, z4.s
+    mul             z18.s, z20.s, z20.s
+    add             z0.s, z0.s, z16.s
+    add             z1.s, z1.s, z18.s
+.endr
+    ssimDist_end_sve2
+    ret
+endfunc
+
+function PFX(ssimDist16_sve2)
+    mov             w12, #16
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_ssimDist16
+    ssimDist_start
+    ptrue           p0.s, vl4
+.loop_ssimDist16_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    ld1b            {z6.s}, p0/z, x0, #2, mul vl
+    ld1b            {z7.s}, p0/z, x0, #3, mul vl
+    add             x0, x0, x1
+    ld1b            {z8.s}, p0/z, x2
+    ld1b            {z9.s}, p0/z, x2, #1, mul vl
+    ld1b            {z10.s}, p0/z, x2, #2, mul vl
+    ld1b            {z11.s}, p0/z, x2, #3, mul vl
+    add             x2, x2, x3
+    ssimDist_1_sve2 z4, z5, z8, z9
+    ssimDist_1_sve2 z6, z7, z10, z11
+    cbnz            w12, .loop_ssimDist16_sve2
+    ssimDist_end
+    ret
+.vl_gt_16_ssimDist16:
+    cmp             x9, #48
+    bgt             .vl_gt_48_ssimDist16
+    ssimDist_start_sve2
+    ptrue           p0.s, vl8
+.vl_gt_16_loop_ssimDist16_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    add             x0, x0, x1
+    ld1b            {z8.s}, p0/z, x2
+    ld1b            {z9.s}, p0/z, x2, #1, mul vl
+    add             x2, x2, x3
+    ssimDist_1_sve2 z4, z5, z8, z9
+    cbnz            w12, .vl_gt_16_loop_ssimDist16_sve2
+    ssimDist_end_sve2
+    ret
+.vl_gt_48_ssimDist16:
+    ssimDist_start_sve2
+    ptrue           p0.s, vl16
+.vl_gt_48_loop_ssimDist16_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    add             x0, x0, x1
+    ld1b            {z8.s}, p0/z, x2
+    add             x2, x2, x3
+    sub             z20.s, z4.s, z8.s
+    mul             z16.s, z4.s, z4.s
+    mul             z18.s, z20.s, z20.s
+    add             z0.s, z0.s, z16.s
+    add             z1.s, z1.s, z18.s
+    cbnz            w12, .vl_gt_48_loop_ssimDist16_sve2
+    ssimDist_end_sve2
+    ret
+endfunc
+
+function PFX(ssimDist32_sve2)
+    mov             w12, #32
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_ssimDist32
+    ssimDist_start
+    ptrue           p0.s, vl4
+.loop_ssimDist32_sve2:
+    sub             w12, w12, #1
+    ld1b            {z2.s}, p0/z, x0
+    ld1b            {z3.s}, p0/z, x0, #1, mul vl
+    ld1b            {z4.s}, p0/z, x0, #2, mul vl
+    ld1b            {z5.s}, p0/z, x0, #3, mul vl
+    ld1b            {z6.s}, p0/z, x0, #4, mul vl
+    ld1b            {z7.s}, p0/z, x0, #5, mul vl
+    ld1b            {z8.s}, p0/z, x0, #6, mul vl
+    ld1b            {z9.s}, p0/z, x0, #7, mul vl
+    add             x0, x0, x1
+    ld1b            {z10.s}, p0/z, x2
+    ld1b            {z11.s}, p0/z, x2, #1, mul vl
+    ld1b            {z12.s}, p0/z, x2, #2, mul vl
+    ld1b            {z13.s}, p0/z, x2, #3, mul vl
+    ld1b            {z14.s}, p0/z, x2, #4, mul vl
+    ld1b            {z15.s}, p0/z, x2, #5, mul vl
+    ld1b            {z30.s}, p0/z, x2, #6, mul vl
+    ld1b            {z31.s}, p0/z, x2, #7, mul vl
+    add             x2, x2, x3
+    ssimDist_1_sve2 z2, z3, z10, z11
+    ssimDist_1_sve2 z4, z5, z12, z13
+    ssimDist_1_sve2 z6, z7, z14, z15
+    ssimDist_1_sve2 z8, z9, z30, z31
+    cbnz            w12, .loop_ssimDist32_sve2
+    ssimDist_end
+    ret
+.vl_gt_16_ssimDist32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_ssimDist32
+    ssimDist_start_sve2
+    ptrue           p0.s, vl8
+.vl_gt_16_loop_ssimDist32_sve2:
+    sub             w12, w12, #1
+    ld1b            {z2.s}, p0/z, x0
+    ld1b            {z3.s}, p0/z, x0, #1, mul vl
+    ld1b            {z4.s}, p0/z, x0, #2, mul vl
+    ld1b            {z5.s}, p0/z, x0, #3, mul vl
+    add             x0, x0, x1
+    ld1b            {z10.s}, p0/z, x2
+    ld1b            {z11.s}, p0/z, x2, #1, mul vl
+    ld1b            {z12.s}, p0/z, x2, #2, mul vl
+    ld1b            {z13.s}, p0/z, x2, #3, mul vl
+    add             x2, x2, x3
+    ssimDist_1_sve2 z2, z3, z10, z11
+    ssimDist_1_sve2 z4, z5, z12, z13
+    cbnz            w12, .vl_gt_16_loop_ssimDist32_sve2
+    ssimDist_end_sve2
+    ret
+.vl_gt_48_ssimDist32:
+    cmp             x9, #112
+    bgt             .vl_gt_112_ssimDist32
+    ssimDist_start_sve2
+    ptrue           p0.s, vl16
+.vl_gt_48_loop_ssimDist32_sve2:
+    sub             w12, w12, #1
+    ld1b            {z2.s}, p0/z, x0
+    ld1b            {z3.s}, p0/z, x0, #1, mul vl
+    add             x0, x0, x1
+    ld1b            {z10.s}, p0/z, x2
+    ld1b            {z11.s}, p0/z, x2, #1, mul vl
+    add             x2, x2, x3
+    ssimDist_1_sve2 z2, z3, z10, z11
+    cbnz            w12, .vl_gt_48_loop_ssimDist32_sve2
+    ssimDist_end_sve2
+    ret
+.vl_gt_112_ssimDist32:
+    ssimDist_start_sve2
+    ptrue           p0.s, vl32
+.vl_gt_112_loop_ssimDist32_sve2:
+    sub             w12, w12, #1
+    ld1b            {z2.s}, p0/z, x0
+    add             x0, x0, x1
+    ld1b            {z10.s}, p0/z, x2
+    add             x2, x2, x3
+    sub             z20.s, z2.s, z10.s
+    mul             z16.s, z2.s, z2.s
+    mul             z18.s, z20.s, z20.s
+    add             z0.s, z0.s, z16.s
+    add             z1.s, z1.s, z18.s
+    cbnz            w12, .vl_gt_112_loop_ssimDist32_sve2
+    ssimDist_end_sve2
+    ret
+endfunc
+
+function PFX(ssimDist64_sve2)
+    mov             w12, #64
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_ssimDist64
+    ssimDist_start
+    ptrue           p0.s, vl4
+.loop_ssimDist64_sve2:
+    sub             w12, w12, #1
+    ld1b            {z2.s}, p0/z, x0
+    ld1b            {z3.s}, p0/z, x0, #1, mul vl
+    ld1b            {z4.s}, p0/z, x0, #2, mul vl
+    ld1b            {z5.s}, p0/z, x0, #3, mul vl
+    ld1b            {z6.s}, p0/z, x0, #4, mul vl
+    ld1b            {z7.s}, p0/z, x0, #5, mul vl
+    ld1b            {z8.s}, p0/z, x0, #6, mul vl
+    ld1b            {z9.s}, p0/z, x0, #7, mul vl
+    ld1b            {z23.s}, p0/z, x2
+    ld1b            {z24.s}, p0/z, x2, #1, mul vl
+    ld1b            {z25.s}, p0/z, x2, #2, mul vl
+    ld1b            {z26.s}, p0/z, x2, #3, mul vl
+    ld1b            {z27.s}, p0/z, x2, #4, mul vl
+    ld1b            {z28.s}, p0/z, x2, #5, mul vl
+    ld1b            {z29.s}, p0/z, x2, #6, mul vl
+    ld1b            {z30.s}, p0/z, x2, #7, mul vl
+    ssimDist_1_sve2 z2, z3, z23, z24
+    ssimDist_1_sve2 z4, z5, z25, z26
+    ssimDist_1_sve2 z6, z7, z27, z28
+    ssimDist_1_sve2 z8, z9, z29, z30
+    mov             x4, x0
+    mov             x5, x2
+    add             x4, x4, #32
+    add             x5, x5, #32
+    ld1b            {z2.s}, p0/z, x4
+    ld1b            {z3.s}, p0/z, x4, #1, mul vl
+    ld1b            {z4.s}, p0/z, x4, #2, mul vl
+    ld1b            {z5.s}, p0/z, x4, #3, mul vl
+    ld1b            {z6.s}, p0/z, x4, #4, mul vl
+    ld1b            {z7.s}, p0/z, x4, #5, mul vl
+    ld1b            {z8.s}, p0/z, x4, #6, mul vl
+    ld1b            {z9.s}, p0/z, x4, #7, mul vl
+    ld1b            {z23.s}, p0/z, x5
+    ld1b            {z24.s}, p0/z, x5, #1, mul vl
+    ld1b            {z25.s}, p0/z, x5, #2, mul vl
+    ld1b            {z26.s}, p0/z, x5, #3, mul vl
+    ld1b            {z27.s}, p0/z, x5, #4, mul vl
+    ld1b            {z28.s}, p0/z, x5, #5, mul vl
+    ld1b            {z29.s}, p0/z, x5, #6, mul vl
+    ld1b            {z30.s}, p0/z, x5, #7, mul vl
+    ssimDist_1_sve2 z2, z3, z23, z24
+    ssimDist_1_sve2 z4, z5, z25, z26
+    ssimDist_1_sve2 z6, z7, z27, z28
+    ssimDist_1_sve2 z8, z9, z29, z30
+    add             x0, x0, x1
+    add             x2, x2, x3
+    cbnz            w12, .loop_ssimDist64_sve2
+    ssimDist_end
+    ret
+.vl_gt_16_ssimDist64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_ssimDist64
+    ssimDist_start_sve2
+    ptrue           p0.s, vl8
+.vl_gt_16_loop_ssimDist64_sve2:
+    sub             w12, w12, #1
+    ld1b            {z2.s}, p0/z, x0
+    ld1b            {z3.s}, p0/z, x0, #1, mul vl
+    ld1b            {z4.s}, p0/z, x0, #2, mul vl
+    ld1b            {z5.s}, p0/z, x0, #3, mul vl
+    ld1b            {z6.s}, p0/z, x0, #4, mul vl
+    ld1b            {z7.s}, p0/z, x0, #5, mul vl
+    ld1b            {z8.s}, p0/z, x0, #6, mul vl
+    ld1b            {z9.s}, p0/z, x0, #7, mul vl
+    ld1b            {z23.s}, p0/z, x2
+    ld1b            {z24.s}, p0/z, x2, #1, mul vl
+    ld1b            {z25.s}, p0/z, x2, #2, mul vl
+    ld1b            {z26.s}, p0/z, x2, #3, mul vl
+    ld1b            {z27.s}, p0/z, x2, #4, mul vl
+    ld1b            {z28.s}, p0/z, x2, #5, mul vl
+    ld1b            {z29.s}, p0/z, x2, #6, mul vl
+    ld1b            {z30.s}, p0/z, x2, #7, mul vl
+    ssimDist_1_sve2 z2, z3, z23, z24
+    ssimDist_1_sve2 z4, z5, z25, z26
+    ssimDist_1_sve2 z6, z7, z27, z28
+    ssimDist_1_sve2 z8, z9, z29, z30
+    add             x0, x0, x1
+    add             x2, x2, x3
+    cbnz            w12, .vl_gt_16_loop_ssimDist64_sve2
+    ssimDist_end_sve2
+    ret
+.vl_gt_48_ssimDist64:
+    cmp             x9, #112
+    bgt             .vl_gt_112_ssimDist64
+    ssimDist_start_sve2
+    ptrue           p0.s, vl16
+.vl_gt_48_loop_ssimDist64_sve2:
+    sub             w12, w12, #1
+    ld1b            {z2.s}, p0/z, x0
+    ld1b            {z3.s}, p0/z, x0, #1, mul vl
+    ld1b            {z4.s}, p0/z, x0, #2, mul vl
+    ld1b            {z5.s}, p0/z, x0, #3, mul vl
+    ld1b            {z23.s}, p0/z, x2
+    ld1b            {z24.s}, p0/z, x2, #1, mul vl
+    ld1b            {z25.s}, p0/z, x2, #2, mul vl
+    ld1b            {z26.s}, p0/z, x2, #3, mul vl
+    ssimDist_1_sve2 z2, z3, z23, z24
+    ssimDist_1_sve2 z4, z5, z25, z26
+    add             x0, x0, x1
+    add             x2, x2, x3
+    cbnz            w12, .vl_gt_48_loop_ssimDist64_sve2
+    ssimDist_end_sve2
+    ret
+.vl_gt_112_ssimDist64:
+    ssimDist_start_sve2
+    ptrue           p0.s, vl32
+.vl_gt_112_loop_ssimDist64_sve2:
+    sub             w12, w12, #1
+    ld1b            {z2.s}, p0/z, x0
+    ld1b            {z3.s}, p0/z, x0, #1, mul vl
+    ld1b            {z23.s}, p0/z, x2
+    ld1b            {z24.s}, p0/z, x2, #1, mul vl
+    ssimDist_1_sve2 z2, z3, z23, z24
+    add             x0, x0, x1
+    add             x2, x2, x3
+    cbnz            w12, .vl_gt_112_loop_ssimDist64_sve2
+    ssimDist_end_sve2
+    ret
+endfunc
+
+// void normFact_c(const pixel* src, uint32_t blockSize, int shift, uint64_t *z_k)
+.macro normFact_start_sve2
+    mov             z0.d, #0
+.endm
+
+.macro normFact_1_sve2  z0, z1
+    mul             z16.s, \z0\().s, \z0\().s
+    mul             z17.s, \z1\().s, \z1\().s
+    add             z0.s, z0.s, z16.s
+    add             z0.s, z0.s, z17.s
+.endm
+
+.macro normFact_end_sve2
+    uaddv           d0, p0, z0.s
+    str             d0, x3
+.endm
+
+function PFX(normFact8_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_normFact8
+    normFact_start
+    ptrue           p0.s, vl4
+.rept 8
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    add             x0, x0, x1
+    normFact_1_sve2 z4, z5
+.endr
+    normFact_end
+    ret
+.vl_gt_16_normFact8:
+    normFact_start_sve2
+    ptrue           p0.s, vl8
+.rept 8
+    ld1b            {z4.s}, p0/z, x0
+    add             x0, x0, x1
+    mul             z16.s, z4.s, z4.s
+    add             z0.s, z0.s, z16.s
+.endr
+    normFact_end_sve2
+    ret
+endfunc
+
+function PFX(normFact16_sve2)
+    mov             w12, #16
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_normFact16
+    normFact_start
+    ptrue           p0.s, vl4
+.loop_normFact16_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    ld1b            {z6.s}, p0/z, x0, #2, mul vl
+    ld1b            {z7.s}, p0/z, x0, #3, mul vl
+    add             x0, x0, x1
+    normFact_1_sve2 z4, z5
+    normFact_1_sve2 z6, z7
+    cbnz            w12, .loop_normFact16_sve2
+    normFact_end
+    ret
+.vl_gt_16_normFact16:
+    cmp             x9, #48
+    bgt             .vl_gt_48_normFact16
+    normFact_start_sve2
+    ptrue           p0.s, vl8
+.vl_gt_16_loop_normFact16_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    add             x0, x0, x1
+    normFact_1_sve2 z4, z5
+    cbnz            w12, .vl_gt_16_loop_normFact16_sve2
+    normFact_end_sve2
+    ret
+.vl_gt_48_normFact16:
+    normFact_start_sve2
+    ptrue           p0.s, vl16
+.vl_gt_48_loop_normFact16_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    add             x0, x0, x1
+    mul             z16.s, z4.s, z4.s
+    add             z0.s, z0.s, z16.s
+    cbnz            w12, .vl_gt_48_loop_normFact16_sve2
+    normFact_end_sve2
+    ret
+endfunc
+
+function PFX(normFact32_sve2)
+    mov             w12, #32
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_normFact32
+    normFact_start
+    ptrue           p0.s, vl4
+.loop_normFact32_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    ld1b            {z6.s}, p0/z, x0, #2, mul vl
+    ld1b            {z7.s}, p0/z, x0, #3, mul vl
+    ld1b            {z8.s}, p0/z, x0, #4, mul vl
+    ld1b            {z9.s}, p0/z, x0, #5, mul vl
+    ld1b            {z10.s}, p0/z, x0, #6, mul vl
+    ld1b            {z11.s}, p0/z, x0, #7, mul vl
+    add             x0, x0, x1
+    normFact_1_sve2 z4, z5
+    normFact_1_sve2 z6, z7
+    normFact_1_sve2 z8, z9
+    normFact_1_sve2 z10, z11
+    cbnz            w12, .loop_normFact32_sve2
+    normFact_end
+    ret
+.vl_gt_16_normFact32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_normFact32
+    normFact_start_sve2
+    ptrue           p0.s, vl8
+.vl_gt_16_loop_normFact32_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    ld1b            {z6.s}, p0/z, x0, #2, mul vl
+    ld1b            {z7.s}, p0/z, x0, #3, mul vl
+    add             x0, x0, x1
+    normFact_1_sve2 z4, z5
+    normFact_1_sve2 z6, z7
+    cbnz            w12, .vl_gt_16_loop_normFact32_sve2
+    normFact_end_sve2
+    ret
+.vl_gt_48_normFact32:
+    cmp             x9, #112
+    bgt             .vl_gt_112_normFact32
+    normFact_start_sve2
+    ptrue           p0.s, vl16
+.vl_gt_48_loop_normFact32_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    add             x0, x0, x1
+    normFact_1_sve2 z4, z5
+    cbnz            w12, .vl_gt_48_loop_normFact32_sve2
+    normFact_end_sve2
+    ret
+.vl_gt_112_normFact32:
+    normFact_start_sve2
+    ptrue           p0.s, vl32
+.vl_gt_112_loop_normFact32_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    add             x0, x0, x1
+    mul             z16.s, z4.s, z4.s
+    add             z0.s, z0.s, z16.s
+    cbnz            w12, .vl_gt_112_loop_normFact32_sve2
+    normFact_end_sve2
+    ret
+endfunc
+
+function PFX(normFact64_sve2)
+    mov             w12, #64
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_normFact64
+    normFact_start
+    ptrue           p0.s, vl4
+.loop_normFact64_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    ld1b            {z6.s}, p0/z, x0, #2, mul vl
+    ld1b            {z7.s}, p0/z, x0, #3, mul vl
+    ld1b            {z8.s}, p0/z, x0, #4, mul vl
+    ld1b            {z9.s}, p0/z, x0, #5, mul vl
+    ld1b            {z10.s}, p0/z, x0, #6, mul vl
+    ld1b            {z11.s}, p0/z, x0, #7, mul vl
+    normFact_1_sve2 z4, z5
+    normFact_1_sve2 z6, z7
+    normFact_1_sve2 z8, z9
+    normFact_1_sve2 z10, z11
+    mov             x2, x0
+    add             x2, x2, #32
+    ld1b            {z4.s}, p0/z, x2
+    ld1b            {z5.s}, p0/z, x2, #1, mul vl
+    ld1b            {z6.s}, p0/z, x2, #2, mul vl
+    ld1b            {z7.s}, p0/z, x2, #3, mul vl
+    ld1b            {z8.s}, p0/z, x2, #4, mul vl
+    ld1b            {z9.s}, p0/z, x2, #5, mul vl
+    ld1b            {z10.s}, p0/z, x2, #6, mul vl
+    ld1b            {z11.s}, p0/z, x2, #7, mul vl
+    normFact_1_sve2 z4, z5
+    normFact_1_sve2 z6, z7
+    normFact_1_sve2 z8, z9
+    normFact_1_sve2 z10, z11
+    add             x0, x0, x1
+    cbnz            w12, .loop_normFact64_sve2
+    normFact_end
+    ret
+.vl_gt_16_normFact64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_normFact64
+    normFact_start_sve2
+    ptrue           p0.s, vl8
+.vl_gt_16_loop_normFact64_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    ld1b            {z6.s}, p0/z, x0, #2, mul vl
+    ld1b            {z7.s}, p0/z, x0, #3, mul vl
+    ld1b            {z8.s}, p0/z, x0, #4, mul vl
+    ld1b            {z9.s}, p0/z, x0, #5, mul vl
+    ld1b            {z10.s}, p0/z, x0, #6, mul vl
+    ld1b            {z11.s}, p0/z, x0, #7, mul vl
+    normFact_1_sve2 z4, z5
+    normFact_1_sve2 z6, z7
+    normFact_1_sve2 z8, z9
+    normFact_1_sve2 z10, z11
+    add             x0, x0, x1
+    cbnz            w12, .vl_gt_16_loop_normFact64_sve2
+    normFact_end_sve2
+    ret
+.vl_gt_48_normFact64:
+    cmp             x9, #112
+    bgt             .vl_gt_112_normFact64
+    normFact_start_sve2
+    ptrue           p0.s, vl16
+.vl_gt_48_loop_normFact64_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    ld1b            {z6.s}, p0/z, x0, #2, mul vl
+    ld1b            {z7.s}, p0/z, x0, #3, mul vl
+    normFact_1_sve2 z4, z5
+    normFact_1_sve2 z6, z7
+    add             x0, x0, x1
+    cbnz            w12, .vl_gt_48_loop_normFact64_sve2
+    normFact_end_sve2
+    ret
+.vl_gt_112_normFact64:
+    normFact_start_sve2
+    ptrue           p0.s, vl32
+.vl_gt_112_loop_normFact64_sve2:
+    sub             w12, w12, #1
+    ld1b            {z4.s}, p0/z, x0
+    ld1b            {z5.s}, p0/z, x0, #1, mul vl
+    normFact_1_sve2 z4, z5
+    add             x0, x0, x1
+    cbnz            w12, .vl_gt_112_loop_normFact64_sve2
+    normFact_end_sve2
+    ret
+endfunc
​

x265_3.5.tar.gz/source/common/aarch64/pixel-util.S -> x265_3.6.tar.gz/source/common/aarch64/pixel-util.S Changed

@@ -1,8 +1,9 @@
 /*****************************************************************************
- * Copyright (C) 2020 MulticoreWare, Inc
+ * Copyright (C) 2020-2021 MulticoreWare, Inc
  *
  * Authors: Yimeng Su <yimeng.su@huawei.com>
  *          Hongbin Liu <liuhongbin1@huawei.com>
+ *          Sebastian Pop <spop@amazon.com>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -23,13 +24,652 @@
  *****************************************************************************/
 
 #include "asm.S"
+#include "pixel-util-common.S"
 
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
 .section .rodata
+#endif
 
 .align 4
 
 .text
 
+// uint64_t pixel_var(const pixel* pix, intptr_t i_stride)
+function PFX(pixel_var_8x8_neon)
+    ld1             {v4.8b}, x0, x1        // pixx
+    uxtl            v0.8h, v4.8b             // sum = pixx
+    umull           v1.8h, v4.8b, v4.8b
+    uaddlp          v1.4s, v1.8h             // sqr = pixx * pixx
+
+.rept 7
+    ld1             {v4.8b}, x0, x1        // pixx
+    umull           v31.8h, v4.8b, v4.8b
+    uaddw           v0.8h, v0.8h, v4.8b      // sum += pixx
+    uadalp          v1.4s, v31.8h            // sqr += pixx * pixx
+.endr
+    uaddlv          s0, v0.8h
+    uaddlv          d1, v1.4s
+    fmov            w0, s0
+    fmov            x1, d1
+    orr             x0, x0, x1, lsl #32      // return sum + ((uint64_t)sqr << 32);
+    ret
+endfunc
+
+function PFX(pixel_var_16x16_neon)
+    pixel_var_start
+    mov             w12, #16
+.loop_var_16:
+    sub             w12, w12, #1
+    ld1             {v4.16b}, x0, x1
+    pixel_var_1 v4
+    cbnz            w12, .loop_var_16
+    pixel_var_end
+    ret
+endfunc
+
+function PFX(pixel_var_32x32_neon)
+    pixel_var_start
+    mov             w12, #32
+.loop_var_32:
+    sub             w12, w12, #1
+    ld1             {v4.16b-v5.16b}, x0, x1
+    pixel_var_1 v4
+    pixel_var_1 v5
+    cbnz            w12, .loop_var_32
+    pixel_var_end
+    ret
+endfunc
+
+function PFX(pixel_var_64x64_neon)
+    pixel_var_start
+    mov             w12, #64
+.loop_var_64:
+    sub             w12, w12, #1
+    ld1             {v4.16b-v7.16b}, x0, x1
+    pixel_var_1 v4
+    pixel_var_1 v5
+    pixel_var_1 v6
+    pixel_var_1 v7
+    cbnz            w12, .loop_var_64
+    pixel_var_end
+    ret
+endfunc
+
+// void getResidual4_neon(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride)
+function PFX(getResidual4_neon)
+    lsl             x4, x3, #1
+.rept 2
+    ld1             {v0.8b}, x0, x3
+    ld1             {v1.8b}, x1, x3
+    ld1             {v2.8b}, x0, x3
+    ld1             {v3.8b}, x1, x3
+    usubl           v4.8h, v0.8b, v1.8b
+    usubl           v5.8h, v2.8b, v3.8b
+    st1             {v4.8b}, x2, x4
+    st1             {v5.8b}, x2, x4
+.endr
+    ret
+endfunc
+
+function PFX(getResidual8_neon)
+    lsl             x4, x3, #1
+.rept 4
+    ld1             {v0.8b}, x0, x3
+    ld1             {v1.8b}, x1, x3
+    ld1             {v2.8b}, x0, x3
+    ld1             {v3.8b}, x1, x3
+    usubl           v4.8h, v0.8b, v1.8b
+    usubl           v5.8h, v2.8b, v3.8b
+    st1             {v4.16b}, x2, x4
+    st1             {v5.16b}, x2, x4
+.endr
+    ret
+endfunc
+
+function PFX(getResidual16_neon)
+    lsl             x4, x3, #1
+.rept 8
+    ld1             {v0.16b}, x0, x3
+    ld1             {v1.16b}, x1, x3
+    ld1             {v2.16b}, x0, x3
+    ld1             {v3.16b}, x1, x3
+    usubl           v4.8h, v0.8b, v1.8b
+    usubl2          v5.8h, v0.16b, v1.16b
+    usubl           v6.8h, v2.8b, v3.8b
+    usubl2          v7.8h, v2.16b, v3.16b
+    st1             {v4.8h-v5.8h}, x2, x4
+    st1             {v6.8h-v7.8h}, x2, x4
+.endr
+    ret
+endfunc
+
+function PFX(getResidual32_neon)
+    lsl             x4, x3, #1
+    mov             w12, #4
+.loop_residual_32:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v1.16b}, x0, x3
+    ld1             {v2.16b-v3.16b}, x1, x3
+    ld1             {v4.16b-v5.16b}, x0, x3
+    ld1             {v6.16b-v7.16b}, x1, x3
+    usubl           v16.8h, v0.8b, v2.8b
+    usubl2          v17.8h, v0.16b, v2.16b
+    usubl           v18.8h, v1.8b, v3.8b
+    usubl2          v19.8h, v1.16b, v3.16b
+    usubl           v20.8h, v4.8b, v6.8b
+    usubl2          v21.8h, v4.16b, v6.16b
+    usubl           v22.8h, v5.8b, v7.8b
+    usubl2          v23.8h, v5.16b, v7.16b
+    st1             {v16.8h-v19.8h}, x2, x4
+    st1             {v20.8h-v23.8h}, x2, x4
+.endr
+    cbnz            w12, .loop_residual_32
+    ret
+endfunc
+
+// void pixel_sub_ps_neon(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1)
+function PFX(pixel_sub_ps_4x4_neon)
+    lsl             x1, x1, #1
+.rept 2
+    ld1             {v0.8b}, x2, x4
+    ld1             {v1.8b}, x3, x5
+    ld1             {v2.8b}, x2, x4
+    ld1             {v3.8b}, x3, x5
+    usubl           v4.8h, v0.8b, v1.8b
+    usubl           v5.8h, v2.8b, v3.8b
+    st1             {v4.4h}, x0, x1
+    st1             {v5.4h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_sub_ps_8x8_neon)
+    lsl             x1, x1, #1
+.rept 4
+    ld1             {v0.8b}, x2, x4
+    ld1             {v1.8b}, x3, x5
+    ld1             {v2.8b}, x2, x4
+    ld1             {v3.8b}, x3, x5
+    usubl           v4.8h, v0.8b, v1.8b
+    usubl           v5.8h, v2.8b, v3.8b
+    st1             {v4.8h}, x0, x1
+    st1             {v5.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_sub_ps_16x16_neon)
+    lsl             x1, x1, #1
+.rept 8
+    ld1             {v0.16b}, x2, x4
+    ld1             {v1.16b}, x3, x5
+    ld1             {v2.16b}, x2, x4
+    ld1             {v3.16b}, x3, x5
+    usubl           v4.8h, v0.8b, v1.8b
+    usubl2          v5.8h, v0.16b, v1.16b
+    usubl           v6.8h, v2.8b, v3.8b
+    usubl2          v7.8h, v2.16b, v3.16b
+    st1             {v4.8h-v5.8h}, x0, x1
+    st1             {v6.8h-v7.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_sub_ps_32x32_neon)
+    lsl             x1, x1, #1
+    mov             w12, #4
+.loop_sub_ps_32:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v1.16b}, x2, x4
+    ld1             {v2.16b-v3.16b}, x3, x5
+    ld1             {v4.16b-v5.16b}, x2, x4
+    ld1             {v6.16b-v7.16b}, x3, x5
+    usubl           v16.8h, v0.8b, v2.8b
+    usubl2          v17.8h, v0.16b, v2.16b
+    usubl           v18.8h, v1.8b, v3.8b
+    usubl2          v19.8h, v1.16b, v3.16b
+    usubl           v20.8h, v4.8b, v6.8b
+    usubl2          v21.8h, v4.16b, v6.16b
+    usubl           v22.8h, v5.8b, v7.8b
+    usubl2          v23.8h, v5.16b, v7.16b
+    st1             {v16.8h-v19.8h}, x0, x1
+    st1             {v20.8h-v23.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_sub_ps_32
+    ret
+endfunc
+
+function PFX(pixel_sub_ps_64x64_neon)
+    lsl             x1, x1, #1
+    sub             x1, x1, #64
+    mov             w12, #16
+.loop_sub_ps_64:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v3.16b}, x2, x4
+    ld1             {v4.16b-v7.16b}, x3, x5
+    usubl           v16.8h, v0.8b, v4.8b
+    usubl2          v17.8h, v0.16b, v4.16b
+    usubl           v18.8h, v1.8b, v5.8b
+    usubl2          v19.8h, v1.16b, v5.16b
+    usubl           v20.8h, v2.8b, v6.8b
+    usubl2          v21.8h, v2.16b, v6.16b
+    usubl           v22.8h, v3.8b, v7.8b
+    usubl2          v23.8h, v3.16b, v7.16b
+    st1             {v16.8h-v19.8h}, x0, #64
+    st1             {v20.8h-v23.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_sub_ps_64
+    ret
+endfunc
+
+// chroma sub_ps
+function PFX(pixel_sub_ps_4x8_neon)
+    lsl             x1, x1, #1
+.rept 4
+    ld1             {v0.8b}, x2, x4
+    ld1             {v1.8b}, x3, x5
+    ld1             {v2.8b}, x2, x4
+    ld1             {v3.8b}, x3, x5
+    usubl           v4.8h, v0.8b, v1.8b
+    usubl           v5.8h, v2.8b, v3.8b
+    st1             {v4.4h}, x0, x1
+    st1             {v5.4h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_sub_ps_8x16_neon)
+    lsl             x1, x1, #1
+.rept 8
+    ld1             {v0.8b}, x2, x4
+    ld1             {v1.8b}, x3, x5
+    ld1             {v2.8b}, x2, x4
+    ld1             {v3.8b}, x3, x5
+    usubl           v4.8h, v0.8b, v1.8b
+    usubl           v5.8h, v2.8b, v3.8b
+    st1             {v4.8h}, x0, x1
+    st1             {v5.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_sub_ps_16x32_neon)
+    lsl             x1, x1, #1
+.rept 16
+    ld1             {v0.16b}, x2, x4
+    ld1             {v1.16b}, x3, x5
+    ld1             {v2.16b}, x2, x4
+    ld1             {v3.16b}, x3, x5
+    usubl           v4.8h, v0.8b, v1.8b
+    usubl2          v5.8h, v0.16b, v1.16b
+    usubl           v6.8h, v2.8b, v3.8b
+    usubl2          v7.8h, v2.16b, v3.16b
+    st1             {v4.8h-v5.8h}, x0, x1
+    st1             {v6.8h-v7.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_sub_ps_32x64_neon)
+    lsl             x1, x1, #1
+    mov             w12, #8
+.loop_sub_ps_32x64:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v1.16b}, x2, x4
+    ld1             {v2.16b-v3.16b}, x3, x5
+    ld1             {v4.16b-v5.16b}, x2, x4
+    ld1             {v6.16b-v7.16b}, x3, x5
+    usubl           v16.8h, v0.8b, v2.8b
+    usubl2          v17.8h, v0.16b, v2.16b
+    usubl           v18.8h, v1.8b, v3.8b
+    usubl2          v19.8h, v1.16b, v3.16b
+    usubl           v20.8h, v4.8b, v6.8b
+    usubl2          v21.8h, v4.16b, v6.16b
+    usubl           v22.8h, v5.8b, v7.8b
+    usubl2          v23.8h, v5.16b, v7.16b
+    st1             {v16.8h-v19.8h}, x0, x1
+    st1             {v20.8h-v23.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_sub_ps_32x64
+    ret
+endfunc
+
+// void x265_pixel_add_ps_neon(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
+function PFX(pixel_add_ps_4x4_neon)
+    lsl             x5, x5, #1
+.rept 2
+    ld1             {v0.8b}, x2, x4
+    ld1             {v1.8b}, x2, x4
+    ld1             {v2.4h}, x3, x5
+    ld1             {v3.4h}, x3, x5
+    uxtl            v0.8h, v0.8b
+    uxtl            v1.8h, v1.8b
+    add             v4.8h, v0.8h, v2.8h
+    add             v5.8h, v1.8h, v3.8h
+    sqxtun          v4.8b, v4.8h
+    sqxtun          v5.8b, v5.8h
+    st1             {v4.s}0, x0, x1
+    st1             {v5.s}0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_add_ps_8x8_neon)
+    lsl             x5, x5, #1
+.rept 4
+    ld1             {v0.8b}, x2, x4
+    ld1             {v1.8b}, x2, x4
+    ld1             {v2.8h}, x3, x5
+    ld1             {v3.8h}, x3, x5
+    uxtl            v0.8h, v0.8b
+    uxtl            v1.8h, v1.8b
+    add             v4.8h, v0.8h, v2.8h
+    add             v5.8h, v1.8h, v3.8h
+    sqxtun          v4.8b, v4.8h
+    sqxtun          v5.8b, v5.8h
+    st1             {v4.8b}, x0, x1
+    st1             {v5.8b}, x0, x1
+.endr
+    ret
+endfunc
+
+.macro pixel_add_ps_16xN_neon h
+function PFX(pixel_add_ps_16x\h\()_neon)
+    lsl             x5, x5, #1
+    mov             w12, #\h / 8
+.loop_add_ps_16x\h\():
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b}, x2, x4
+    ld1             {v1.16b}, x2, x4
+    ld1             {v16.8h-v17.8h}, x3, x5
+    ld1             {v18.8h-v19.8h}, x3, x5
+    uxtl            v4.8h, v0.8b
+    uxtl2           v5.8h, v0.16b
+    uxtl            v6.8h, v1.8b
+    uxtl2           v7.8h, v1.16b
+    add             v24.8h, v4.8h, v16.8h
+    add             v25.8h, v5.8h, v17.8h
+    add             v26.8h, v6.8h, v18.8h
+    add             v27.8h, v7.8h, v19.8h
+    sqxtun          v4.8b, v24.8h
+    sqxtun2         v4.16b, v25.8h
+    sqxtun          v5.8b, v26.8h
+    sqxtun2         v5.16b, v27.8h
+    st1             {v4.16b}, x0, x1
+    st1             {v5.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_add_ps_16x\h
+    ret
+endfunc
+.endm
+
+pixel_add_ps_16xN_neon 16
+pixel_add_ps_16xN_neon 32
+
+.macro pixel_add_ps_32xN_neon h
+ function PFX(pixel_add_ps_32x\h\()_neon)
+    lsl             x5, x5, #1
+    mov             w12, #\h / 4
+.loop_add_ps_32x\h\():
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v1.16b}, x2, x4
+    ld1             {v16.8h-v19.8h}, x3, x5
+    uxtl            v4.8h, v0.8b
+    uxtl2           v5.8h, v0.16b
+    uxtl            v6.8h, v1.8b
+    uxtl2           v7.8h, v1.16b
+    add             v24.8h, v4.8h, v16.8h
+    add             v25.8h, v5.8h, v17.8h
+    add             v26.8h, v6.8h, v18.8h
+    add             v27.8h, v7.8h, v19.8h
+    sqxtun          v4.8b, v24.8h
+    sqxtun2         v4.16b, v25.8h
+    sqxtun          v5.8b, v26.8h
+    sqxtun2         v5.16b, v27.8h
+    st1             {v4.16b-v5.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_add_ps_32x\h
+    ret
+endfunc
+.endm
+
+pixel_add_ps_32xN_neon 32
+pixel_add_ps_32xN_neon 64
+
+function PFX(pixel_add_ps_64x64_neon)
+    lsl             x5, x5, #1
+    sub             x5, x5, #64
+    mov             w12, #32
+.loop_add_ps_64x64:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v0.16b-v3.16b}, x2, x4
+    ld1             {v16.8h-v19.8h}, x3, #64
+    ld1             {v20.8h-v23.8h}, x3, x5
+    uxtl            v4.8h, v0.8b
+    uxtl2           v5.8h, v0.16b
+    uxtl            v6.8h, v1.8b
+    uxtl2           v7.8h, v1.16b
+    uxtl            v24.8h, v2.8b
+    uxtl2           v25.8h, v2.16b
+    uxtl            v26.8h, v3.8b
+    uxtl2           v27.8h, v3.16b
+    add             v0.8h, v4.8h, v16.8h
+    add             v1.8h, v5.8h, v17.8h
+    add             v2.8h, v6.8h, v18.8h
+    add             v3.8h, v7.8h, v19.8h
+    add             v4.8h, v24.8h, v20.8h
+    add             v5.8h, v25.8h, v21.8h
+    add             v6.8h, v26.8h, v22.8h
+    add             v7.8h, v27.8h, v23.8h
+    sqxtun          v0.8b, v0.8h
+    sqxtun2         v0.16b, v1.8h
+    sqxtun          v1.8b, v2.8h
+    sqxtun2         v1.16b, v3.8h
+    sqxtun          v2.8b, v4.8h
+    sqxtun2         v2.16b, v5.8h
+    sqxtun          v3.8b, v6.8h
+    sqxtun2         v3.16b, v7.8h
+    st1             {v0.16b-v3.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_add_ps_64x64
+    ret
+endfunc
+
+// Chroma add_ps
+function PFX(pixel_add_ps_4x8_neon)
+    lsl             x5, x5, #1
+.rept 4
+    ld1             {v0.8b}, x2, x4
+    ld1             {v1.8b}, x2, x4
+    ld1             {v2.4h}, x3, x5
+    ld1             {v3.4h}, x3, x5
+    uxtl            v0.8h, v0.8b
+    uxtl            v1.8h, v1.8b
+    add             v4.8h, v0.8h, v2.8h
+    add             v5.8h, v1.8h, v3.8h
+    sqxtun          v4.8b, v4.8h
+    sqxtun          v5.8b, v5.8h
+    st1             {v4.s}0, x0, x1
+    st1             {v5.s}0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_add_ps_8x16_neon)
+    lsl             x5, x5, #1
+.rept 8
+    ld1             {v0.8b}, x2, x4
+    ld1             {v1.8b}, x2, x4
+    ld1             {v2.8h}, x3, x5
+    ld1             {v3.8h}, x3, x5
+    uxtl            v0.8h, v0.8b
+    uxtl            v1.8h, v1.8b
+    add             v4.8h, v0.8h, v2.8h
+    add             v5.8h, v1.8h, v3.8h
+    sqxtun          v4.8b, v4.8h
+    sqxtun          v5.8b, v5.8h
+    st1             {v4.8b}, x0, x1
+    st1             {v5.8b}, x0, x1
+.endr
+    ret
+endfunc
+
+// void scale1D_128to64(pixel *dst, const pixel *src)
+function PFX(scale1D_128to64_neon)
+.rept 2
+    ld2             {v0.16b, v1.16b}, x1, #32
+    ld2             {v2.16b, v3.16b}, x1, #32
+    ld2             {v4.16b, v5.16b}, x1, #32
+    ld2             {v6.16b, v7.16b}, x1, #32
+    urhadd          v0.16b, v0.16b, v1.16b
+    urhadd          v1.16b, v2.16b, v3.16b
+    urhadd          v2.16b, v4.16b, v5.16b
+    urhadd          v3.16b, v6.16b, v7.16b
+    st1             {v0.16b-v3.16b}, x0, #64
+.endr
+    ret
+endfunc
+
+.macro scale2D_1  v0, v1
+    uaddlp          \v0\().8h, \v0\().16b
+    uaddlp          \v1\().8h, \v1\().16b
+    add             \v0\().8h, \v0\().8h, \v1\().8h
+.endm
+
+// void scale2D_64to32(pixel* dst, const pixel* src, intptr_t stride)
+function PFX(scale2D_64to32_neon)
+    mov             w12, #32
+.loop_scale2D:
+    ld1             {v0.16b-v3.16b}, x1, x2
+    sub             w12, w12, #1
+    ld1             {v4.16b-v7.16b}, x1, x2
+    scale2D_1       v0, v4
+    scale2D_1       v1, v5
+    scale2D_1       v2, v6
+    scale2D_1       v3, v7
+    uqrshrn         v0.8b, v0.8h, #2
+    uqrshrn2        v0.16b, v1.8h, #2
+    uqrshrn         v1.8b, v2.8h, #2
+    uqrshrn2        v1.16b, v3.8h, #2
+    st1             {v0.16b-v1.16b}, x0, #32
+    cbnz            w12, .loop_scale2D
+    ret
+endfunc
+
+// void planecopy_cp_c(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift)
+function PFX(pixel_planecopy_cp_neon)
+    dup             v2.16b, w6
+    sub             x5, x5, #1
+.loop_h:
+    mov             x6, x0
+    mov             x12, x2
+    mov             x7, #0
+.loop_w:
+    ldr             q0, x6, #16
+    ushl            v0.16b, v0.16b, v2.16b
+    str             q0, x12, #16
+    add             x7, x7, #16
+    cmp             x7, x4
+    blt             .loop_w
+
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_h
+
+// handle last row
+    mov             x5, x4
+    lsr             x5, x5, #3
+.loopW8:
+    ldr             d0, x0, #8
+    ushl            v0.8b, v0.8b, v2.8b
+    str             d0, x2, #8
+    sub             x4, x4, #8
+    sub             x5, x5, #1
+    cbnz            x5, .loopW8
+
+    mov             x5, #8
+    sub             x5, x5, x4
+    sub             x0, x0, x5
+    sub             x2, x2, x5
+    ldr             d0, x0
+    ushl            v0.8b, v0.8b, v2.8b
+    str             d0, x2
+    ret
+endfunc
+
+//******* satd *******
+.macro satd_4x4_neon
+    ld1             {v0.s}0, x0, x1
+    ld1             {v0.s}1, x0, x1
+    ld1             {v1.s}0, x2, x3
+    ld1             {v1.s}1, x2, x3
+    ld1             {v2.s}0, x0, x1
+    ld1             {v2.s}1, x0, x1
+    ld1             {v3.s}0, x2, x3
+    ld1             {v3.s}1, x2, x3
+
+    usubl           v4.8h, v0.8b, v1.8b
+    usubl           v5.8h, v2.8b, v3.8b
+
+    add             v6.8h, v4.8h, v5.8h
+    sub             v7.8h, v4.8h, v5.8h
+
+    mov             v4.d0, v6.d1
+    add             v0.4h, v6.4h, v4.4h
+    sub             v2.4h, v6.4h, v4.4h
+
+    mov             v5.d0, v7.d1
+    add             v1.4h, v7.4h, v5.4h
+    sub             v3.4h, v7.4h, v5.4h
+
+    trn1            v4.4h, v0.4h, v1.4h
+    trn2            v5.4h, v0.4h, v1.4h
+
+    trn1            v6.4h, v2.4h, v3.4h
+    trn2            v7.4h, v2.4h, v3.4h
+
+    add             v0.4h, v4.4h, v5.4h
+    sub             v1.4h, v4.4h, v5.4h
+
+    add             v2.4h, v6.4h, v7.4h
+    sub             v3.4h, v6.4h, v7.4h
+
+    trn1            v4.2s, v0.2s, v1.2s
+    trn2            v5.2s, v0.2s, v1.2s
+
+    trn1            v6.2s, v2.2s, v3.2s
+    trn2            v7.2s, v2.2s, v3.2s
+
+    abs             v4.4h, v4.4h
+    abs             v5.4h, v5.4h
+    abs             v6.4h, v6.4h
+    abs             v7.4h, v7.4h
+
+    smax            v1.4h, v4.4h, v5.4h
+    smax            v2.4h, v6.4h, v7.4h
+
+    add             v0.4h, v1.4h, v2.4h
+    uaddlp          v0.2s, v0.4h
+    uaddlp          v0.1d, v0.2s
+.endm
+
+// int satd_4x4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+function PFX(pixel_satd_4x4_neon)
+    satd_4x4_neon
+    fmov            x0, d0
+    ret
+endfunc
+
 .macro x265_satd_4x8_8x4_end_neon
     add             v0.8h, v4.8h, v6.8h
     add             v1.8h, v5.8h, v7.8h
@@ -59,7 +699,7 @@
 .endm
 
 .macro pixel_satd_4x8_neon
-    ld1r             {v1.2s}, x2, x3
+    ld1r            {v1.2s}, x2, x3
     ld1r            {v0.2s}, x0, x1
     ld1r            {v3.2s}, x2, x3
     ld1r            {v2.2s}, x0, x1
@@ -82,129 +722,995 @@
     sub             v5.8h, v0.8h, v1.8h
     ld1             {v6.s}1, x0, x1
     usubl           v3.8h, v6.8b, v7.8b
-    add         v6.8h, v2.8h, v3.8h
-    sub         v7.8h, v2.8h, v3.8h
+    add             v6.8h, v2.8h, v3.8h
+    sub             v7.8h, v2.8h, v3.8h
     x265_satd_4x8_8x4_end_neon
 .endm
 
-// template<int w, int h>
-// int satd4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
-function x265_pixel_satd_4x8_neon
-    pixel_satd_4x8_neon
-    mov               w0, v0.s0
-    ret
+// template<int w, int h>
+// int satd4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+function PFX(pixel_satd_4x8_neon)
+    pixel_satd_4x8_neon
+    mov             w0, v0.s0
+    ret
+endfunc
+
+function PFX(pixel_satd_4x16_neon)
+    mov             w4, #0
+    pixel_satd_4x8_neon
+    mov             w5, v0.s0
+    add             w4, w4, w5
+    pixel_satd_4x8_neon
+    mov             w5, v0.s0
+    add             w0, w5, w4
+    ret
+endfunc
+
+function PFX(pixel_satd_4x32_neon)
+    mov             w4, #0
+.rept 4
+    pixel_satd_4x8_neon
+    mov             w5, v0.s0
+    add             w4, w4, w5
+.endr
+    mov             w0, w4
+    ret
+endfunc
+
+function PFX(pixel_satd_12x16_neon)
+    mov             x4, x0
+    mov             x5, x2
+    mov             w7, #0
+    pixel_satd_4x8_neon
+    mov             w6, v0.s0
+    add             w7, w7, w6
+    pixel_satd_4x8_neon
+    mov             w6, v0.s0
+    add             w7, w7, w6
+
+    add             x0, x4, #4
+    add             x2, x5, #4
+    pixel_satd_4x8_neon
+    mov             w6, v0.s0
+    add             w7, w7, w6
+    pixel_satd_4x8_neon
+    mov             w6, v0.s0
+    add             w7, w7, w6
+
+    add             x0, x4, #8
+    add             x2, x5, #8
+    pixel_satd_4x8_neon
+    mov             w6, v0.s0
+    add             w7, w7, w6
+    pixel_satd_4x8_neon
+    mov             w6, v0.s0
+    add             w0, w7, w6
+    ret
+endfunc
+
+function PFX(pixel_satd_12x32_neon)
+    mov             x4, x0
+    mov             x5, x2
+    mov             w7, #0
+.rept 4
+    pixel_satd_4x8_neon
+    mov             w6, v0.s0
+    add             w7, w7, w6
+.endr
+
+    add             x0, x4, #4
+    add             x2, x5, #4
+.rept 4
+    pixel_satd_4x8_neon
+    mov             w6, v0.s0
+    add             w7, w7, w6
+.endr
+
+    add             x0, x4, #8
+    add             x2, x5, #8
+.rept 4
+    pixel_satd_4x8_neon
+    mov             w6, v0.s0
+    add             w7, w7, w6
+.endr
+
+    mov             w0, w7
+    ret
+endfunc
+
+function PFX(pixel_satd_8x4_neon)
+    mov             x4, x0
+    mov             x5, x2
+    satd_4x4_neon
+    add             x0, x4, #4
+    add             x2, x5, #4
+    umov            x6, v0.d0
+    satd_4x4_neon
+    umov            x0, v0.d0
+    add             x0, x0, x6
+    ret
+endfunc
+
+.macro LOAD_DIFF_8x4 v0 v1 v2 v3
+    ld1             {v0.8b}, x0, x1
+    ld1             {v1.8b}, x2, x3
+    ld1             {v2.8b}, x0, x1
+    ld1             {v3.8b}, x2, x3
+    ld1             {v4.8b}, x0, x1
+    ld1             {v5.8b}, x2, x3
+    ld1             {v6.8b}, x0, x1
+    ld1             {v7.8b}, x2, x3
+    usubl           \v0, v0.8b, v1.8b
+    usubl           \v1, v2.8b, v3.8b
+    usubl           \v2, v4.8b, v5.8b
+    usubl           \v3, v6.8b, v7.8b
+.endm
+
+.macro LOAD_DIFF_16x4 v0 v1 v2 v3 v4 v5 v6 v7
+    ld1             {v0.16b}, x0, x1
+    ld1             {v1.16b}, x2, x3
+    ld1             {v2.16b}, x0, x1
+    ld1             {v3.16b}, x2, x3
+    ld1             {v4.16b}, x0, x1
+    ld1             {v5.16b}, x2, x3
+    ld1             {v6.16b}, x0, x1
+    ld1             {v7.16b}, x2, x3
+    usubl           \v0, v0.8b, v1.8b
+    usubl           \v1, v2.8b, v3.8b
+    usubl           \v2, v4.8b, v5.8b
+    usubl           \v3, v6.8b, v7.8b
+    usubl2          \v4, v0.16b, v1.16b
+    usubl2          \v5, v2.16b, v3.16b
+    usubl2          \v6, v4.16b, v5.16b
+    usubl2          \v7, v6.16b, v7.16b
+.endm
+
+function PFX(satd_16x4_neon), export=0
+    LOAD_DIFF_16x4  v16.8h, v17.8h, v18.8h, v19.8h, v20.8h, v21.8h, v22.8h, v23.8h
+    b               PFX(satd_8x4v_8x8h_neon)
+endfunc
+
+function PFX(satd_8x8_neon), export=0
+    LOAD_DIFF_8x4   v16.8h, v17.8h, v18.8h, v19.8h
+    LOAD_DIFF_8x4   v20.8h, v21.8h, v22.8h, v23.8h
+    b               PFX(satd_8x4v_8x8h_neon)
+endfunc
+
+// one vertical hadamard pass and two horizontal
+function PFX(satd_8x4v_8x8h_neon), export=0
+    HADAMARD4_V     v16.8h, v18.8h, v17.8h, v19.8h, v0.8h, v2.8h, v1.8h, v3.8h
+    HADAMARD4_V     v20.8h, v21.8h, v22.8h, v23.8h, v0.8h, v1.8h, v2.8h, v3.8h
+    trn4            v0.8h, v1.8h, v2.8h, v3.8h, v16.8h, v17.8h, v18.8h, v19.8h
+    trn4            v4.8h, v5.8h, v6.8h, v7.8h, v20.8h, v21.8h, v22.8h, v23.8h
+    SUMSUB_ABCD     v16.8h, v17.8h, v18.8h, v19.8h, v0.8h, v1.8h, v2.8h, v3.8h
+    SUMSUB_ABCD     v20.8h, v21.8h, v22.8h, v23.8h, v4.8h, v5.8h, v6.8h, v7.8h
+    trn4            v0.4s, v2.4s, v1.4s, v3.4s, v16.4s, v18.4s, v17.4s, v19.4s
+    trn4            v4.4s, v6.4s, v5.4s, v7.4s, v20.4s, v22.4s, v21.4s, v23.4s
+    ABS8            v0.8h, v1.8h, v2.8h, v3.8h, v4.8h, v5.8h, v6.8h, v7.8h
+    smax            v0.8h, v0.8h, v2.8h
+    smax            v1.8h, v1.8h, v3.8h
+    smax            v2.8h, v4.8h, v6.8h
+    smax            v3.8h, v5.8h, v7.8h
+    ret
+endfunc
+
+function PFX(pixel_satd_8x8_neon)
+    mov             x10, x30
+    bl              PFX(satd_8x8_neon)
+    add             v0.8h, v0.8h, v1.8h
+    add             v1.8h, v2.8h, v3.8h
+    add             v0.8h, v0.8h, v1.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_8x12_neon)
+    mov             x4, x0
+    mov             x5, x2
+    mov             x7, #0
+    satd_4x4_neon
+    umov            x6, v0.d0
+    add             x7, x7, x6
+    add             x0, x4, #4
+    add             x2, x5, #4
+    satd_4x4_neon
+    umov            x6, v0.d0
+    add             x7, x7, x6
+.rept 2
+    sub             x0, x0, #4
+    sub             x2, x2, #4
+    mov             x4, x0
+    mov             x5, x2
+    satd_4x4_neon
+    umov            x6, v0.d0
+    add             x7, x7, x6
+    add             x0, x4, #4
+    add             x2, x5, #4
+    satd_4x4_neon
+    umov            x6, v0.d0
+    add             x7, x7, x6
+.endr
+    mov             x0, x7
+    ret
+endfunc
+
+function PFX(pixel_satd_8x16_neon)
+    mov             x10, x30
+    bl              PFX(satd_8x8_neon)
+    add             v30.8h, v0.8h, v1.8h
+    add             v31.8h, v2.8h, v3.8h
+    bl              PFX(satd_8x8_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_8x32_neon)
+    mov             x10, x30
+    bl              PFX(satd_8x8_neon)
+    add             v30.8h, v0.8h, v1.8h
+    add             v31.8h, v2.8h, v3.8h
+.rept 3
+    bl              PFX(satd_8x8_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+.endr
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_8x64_neon)
+    mov             x10, x30
+    bl              PFX(satd_8x8_neon)
+    add             v30.8h, v0.8h, v1.8h
+    add             v31.8h, v2.8h, v3.8h
+.rept 7
+    bl              PFX(satd_8x8_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+.endr
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_16x4_neon)
+    mov             x10, x30
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v0.8h, v1.8h
+    add             v31.8h, v2.8h, v3.8h
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_16x8_neon)
+    mov             x10, x30
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v0.8h, v1.8h
+    add             v31.8h, v2.8h, v3.8h
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_16x12_neon)
+    mov             x10, x30
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v0.8h, v1.8h
+    add             v31.8h, v2.8h, v3.8h
+.rept 2
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+.endr
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_16x16_neon)
+    mov             x10, x30
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v0.8h, v1.8h
+    add             v31.8h, v2.8h, v3.8h
+.rept 3
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+.endr
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_16x24_neon)
+    mov             x10, x30
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v0.8h, v1.8h
+    add             v31.8h, v2.8h, v3.8h
+.rept 5
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+.endr
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
+endfunc
+
+.macro pixel_satd_16x32_neon
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v0.8h, v1.8h
+    add             v31.8h, v2.8h, v3.8h
+.rept 7
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+.endr
+.endm
+
+function PFX(pixel_satd_16x32_neon)
+    mov             x10, x30
+    pixel_satd_16x32_neon
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
 endfunc
 
-// template<int w, int h>
-// int satd4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
-function x265_pixel_satd_4x16_neon
-    eor             w4, w4, w4
-    pixel_satd_4x8_neon
-    mov               w5, v0.s0
-    add             w4, w4, w5
-    pixel_satd_4x8_neon
-    mov               w5, v0.s0
-    add             w0, w5, w4
-    ret
+function PFX(pixel_satd_16x64_neon)
+    mov             x10, x30
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v0.8h, v1.8h
+    add             v31.8h, v2.8h, v3.8h
+.rept 15
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+.endr
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
 endfunc
 
-// template<int w, int h>
-// int satd4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
-function x265_pixel_satd_4x32_neon
-    eor             w4, w4, w4
+function PFX(pixel_satd_24x32_neon)
+    mov             x10, x30
+    mov             x7, #0
+    mov             x4, x0
+    mov             x5, x2
+.rept 3
+    movi            v30.8h, #0
+    movi            v31.8h, #0
 .rept 4
-    pixel_satd_4x8_neon
-    mov             w5, v0.s0
-    add             w4, w4, w5
+    bl              PFX(satd_8x8_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
 .endr
-    mov             w0, w4
-    ret
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w6, v0.s0
+    add             x7, x7, x6
+    add             x4, x4, #8
+    add             x5, x5, #8
+    mov             x0, x4
+    mov             x2, x5
+.endr
+    mov             x0, x7
+    ret             x10
 endfunc
 
-// template<int w, int h>
-// int satd4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
-function x265_pixel_satd_12x16_neon
+function PFX(pixel_satd_24x64_neon)
+    mov             x10, x30
+    mov             x7, #0
     mov             x4, x0
     mov             x5, x2
-    eor             w7, w7, w7
-    pixel_satd_4x8_neon
+.rept 3
+    movi            v30.8h, #0
+    movi            v31.8h, #0
+.rept 4
+    bl              PFX(satd_8x8_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+.endr
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
     mov             w6, v0.s0
-    add             w7, w7, w6
-    pixel_satd_4x8_neon
+    add             x7, x7, x6
+    add             x4, x4, #8
+    add             x5, x5, #8
+    mov             x0, x4
+    mov             x2, x5
+.endr
+    sub             x4, x4, #24
+    sub             x5, x5, #24
+    add             x0, x4, x1, lsl #5
+    add             x2, x5, x3, lsl #5
+    mov             x4, x0
+    mov             x5, x2
+.rept 3
+    movi            v30.8h, #0
+    movi            v31.8h, #0
+.rept 4
+    bl              PFX(satd_8x8_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+.endr
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
     mov             w6, v0.s0
-    add             w7, w7, w6
+    add             x7, x7, x6
+    add             x4, x4, #8
+    add             x5, x5, #8
+    mov             x0, x4
+    mov             x2, x5
+.endr
+    mov             x0, x7
+    ret             x10
+endfunc
 
-    add             x0, x4, #4
-    add             x2, x5, #4
-    pixel_satd_4x8_neon
-    mov             w6, v0.s0
-    add             w7, w7, w6
-    pixel_satd_4x8_neon
-    mov             w6, v0.s0
-    add             w7, w7, w6
+.macro pixel_satd_32x8
+    mov             x4, x0
+    mov             x5, x2
+.rept 2
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+.endr
+    add             x0, x4, #16
+    add             x2, x5, #16
+.rept 2
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+.endr
+.endm
 
-    add             x0, x4, #8
-    add             x2, x5, #8
-    pixel_satd_4x8_neon
-    mov             w6, v0.s0
-    add             w7, w7, w6
-    pixel_satd_4x8_neon
+.macro satd_32x16_neon
+    movi            v30.8h, #0
+    movi            v31.8h, #0
+    pixel_satd_32x8
+    sub             x0, x0, #16
+    sub             x2, x2, #16
+    pixel_satd_32x8
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
     mov             w6, v0.s0
-    add             w0, w7, w6
-    ret
-endfunc
+.endm
 
-// template<int w, int h>
-// int satd4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
-function x265_pixel_satd_12x32_neon
+.macro satd_64x16_neon
+    mov             x8, x0
+    mov             x9, x2
+    satd_32x16_neon
+    add             x7, x7, x6
+    add             x0, x8, #32
+    add             x2, x9, #32
+    satd_32x16_neon
+    add             x7, x7, x6
+.endm
+
+function PFX(pixel_satd_32x8_neon)
+    mov             x10, x30
+    mov             x7, #0
     mov             x4, x0
     mov             x5, x2
-    eor             w7, w7, w7
-.rept 4
-    pixel_satd_4x8_neon
-    mov             w6, v0.s0
-    add             w7, w7, w6
+    movi            v30.8h, #0
+    movi            v31.8h, #0
+    pixel_satd_32x8
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_32x16_neon)
+    mov             x10, x30
+    satd_32x16_neon
+    mov             x0, x6
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_32x24_neon)
+    mov             x10, x30
+    satd_32x16_neon
+    movi            v30.8h, #0
+    movi            v31.8h, #0
+    sub             x0, x0, #16
+    sub             x2, x2, #16
+    pixel_satd_32x8
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    add             x0, x0, x6
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_32x32_neon)
+    mov             x10, x30
+    mov             x7, #0
+    satd_32x16_neon
+    sub             x0, x0, #16
+    sub             x2, x2, #16
+    add             x7, x7, x6
+    satd_32x16_neon
+    add             x0, x7, x6
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_32x48_neon)
+    mov             x10, x30
+    mov             x7, #0
+.rept 2
+    satd_32x16_neon
+    sub             x0, x0, #16
+    sub             x2, x2, #16
+    add             x7, x7, x6
 .endr
+    satd_32x16_neon
+    add             x0, x7, x6
+    ret             x10
+endfunc
 
-    add             x0, x4, #4
-    add             x2, x5, #4
-.rept 4
-    pixel_satd_4x8_neon
-    mov             w6, v0.s0
-    add             w7, w7, w6
+function PFX(pixel_satd_32x64_neon)
+    mov             x10, x30
+    mov             x7, #0
+.rept 3
+    satd_32x16_neon
+    sub             x0, x0, #16
+    sub             x2, x2, #16
+    add             x7, x7, x6
 .endr
+    satd_32x16_neon
+    add             x0, x7, x6
+    ret             x10
+endfunc
 
-    add             x0, x4, #8
-    add             x2, x5, #8
-.rept 4
-    pixel_satd_4x8_neon
-    mov             w6, v0.s0
-    add             w7, w7, w6
+function PFX(pixel_satd_64x16_neon)
+    mov             x10, x30
+    mov             x7, #0
+    satd_64x16_neon
+    mov             x0, x7
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_64x32_neon)
+    mov             x10, x30
+    mov             x7, #0
+    satd_64x16_neon
+    sub             x0, x0, #48
+    sub             x2, x2, #48
+    satd_64x16_neon
+    mov             x0, x7
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_64x48_neon)
+    mov             x10, x30
+    mov             x7, #0
+.rept 2
+    satd_64x16_neon
+    sub             x0, x0, #48
+    sub             x2, x2, #48
 .endr
+    satd_64x16_neon
+    mov             x0, x7
+    ret             x10
+endfunc
 
-    mov             w0, w7
+function PFX(pixel_satd_64x64_neon)
+    mov             x10, x30
+    mov             x7, #0
+.rept 3
+    satd_64x16_neon
+    sub             x0, x0, #48
+    sub             x2, x2, #48
+.endr
+    satd_64x16_neon
+    mov             x0, x7
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_48x64_neon)
+    mov             x10, x30
+    mov             x7, #0
+    mov             x8, x0
+    mov             x9, x2
+.rept 3
+    satd_32x16_neon
+    sub             x0, x0, #16
+    sub             x2, x2, #16
+    add             x7, x7, x6
+.endr
+    satd_32x16_neon
+    add             x7, x7, x6
+
+    add             x0, x8, #32
+    add             x2, x9, #32
+    pixel_satd_16x32_neon
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w6, v0.s0
+    add             x7, x7, x6
+
+    movi            v30.8h, #0
+    movi            v31.8h, #0
+    pixel_satd_16x32_neon
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w6, v0.s0
+    add             x0, x7, x6
+    ret             x10
+endfunc
+
+function PFX(sa8d_8x8_neon), export=0
+    LOAD_DIFF_8x4   v16.8h, v17.8h, v18.8h, v19.8h
+    LOAD_DIFF_8x4   v20.8h, v21.8h, v22.8h, v23.8h
+    HADAMARD4_V     v16.8h, v18.8h, v17.8h, v19.8h, v0.8h, v2.8h, v1.8h, v3.8h
+    HADAMARD4_V     v20.8h, v21.8h, v22.8h, v23.8h, v0.8h, v1.8h, v2.8h, v3.8h
+    SUMSUB_ABCD     v0.8h, v16.8h, v1.8h, v17.8h, v16.8h, v20.8h, v17.8h, v21.8h
+    SUMSUB_ABCD     v2.8h, v18.8h, v3.8h, v19.8h, v18.8h, v22.8h, v19.8h, v23.8h
+    trn4            v4.8h, v5.8h, v6.8h, v7.8h, v0.8h, v1.8h, v2.8h, v3.8h
+    trn4            v20.8h, v21.8h, v22.8h, v23.8h, v16.8h, v17.8h, v18.8h, v19.8h
+    SUMSUB_ABCD     v2.8h, v3.8h, v24.8h, v25.8h, v20.8h, v21.8h, v4.8h, v5.8h
+    SUMSUB_ABCD     v0.8h, v1.8h, v4.8h, v5.8h, v22.8h, v23.8h, v6.8h, v7.8h
+    trn4            v20.4s, v22.4s, v21.4s, v23.4s, v2.4s, v0.4s, v3.4s, v1.4s
+    trn4            v16.4s, v18.4s, v17.4s, v19.4s, v24.4s, v4.4s, v25.4s, v5.4s
+    SUMSUB_ABCD     v0.8h, v2.8h, v1.8h, v3.8h, v20.8h, v22.8h, v21.8h, v23.8h
+    SUMSUB_ABCD     v4.8h, v6.8h, v5.8h, v7.8h, v16.8h, v18.8h, v17.8h, v19.8h
+    trn4            v16.2d, v20.2d, v17.2d, v21.2d, v0.2d, v4.2d, v1.2d, v5.2d
+    trn4            v18.2d, v22.2d, v19.2d, v23.2d, v2.2d, v6.2d, v3.2d, v7.2d
+    ABS8            v16.8h, v17.8h, v18.8h, v19.8h, v20.8h, v21.8h, v22.8h, v23.8h
+    smax            v16.8h, v16.8h, v20.8h
+    smax            v17.8h, v17.8h, v21.8h
+    smax            v18.8h, v18.8h, v22.8h
+    smax            v19.8h, v19.8h, v23.8h
+    add             v0.8h, v16.8h, v17.8h
+    add             v1.8h, v18.8h, v19.8h
     ret
 endfunc
 
-// template<int w, int h>
-// int satd4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
-function x265_pixel_satd_8x8_neon
-    eor             w4, w4, w4
-    mov             x6, x0
-    mov             x7, x2
-    pixel_satd_4x8_neon
-    mov             w5, v0.s0
-    add             w4, w4, w5
-    add             x0, x6, #4
-    add             x2, x7, #4
-    pixel_satd_4x8_neon
+function PFX(pixel_sa8d_8x8_neon)
+    mov             x10, x30
+    bl              PFX(sa8d_8x8_neon)
+    add             v0.8h, v0.8h, v1.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    add             w0, w0, #1
+    lsr             w0, w0, #1
+    ret             x10
+endfunc
+
+function PFX(pixel_sa8d_8x16_neon)
+    mov             x10, x30
+    bl              PFX(sa8d_8x8_neon)
+    add             v0.8h, v0.8h, v1.8h
+    uaddlv          s0, v0.8h
     mov             w5, v0.s0
+    add             w5, w5, #1
+    lsr             w5, w5, #1
+    bl              PFX(sa8d_8x8_neon)
+    add             v0.8h, v0.8h, v1.8h
+    uaddlv          s0, v0.8h
+    mov             w4, v0.s0
+    add             w4, w4, #1
+    lsr             w4, w4, #1
+    add             w0, w4, w5
+    ret             x10
+endfunc
+
+.macro sa8d_16x16 reg
+    bl              PFX(sa8d_8x8_neon)
+    uaddlp          v30.4s, v0.8h
+    uaddlp          v31.4s, v1.8h
+    bl              PFX(sa8d_8x8_neon)
+    uadalp          v30.4s, v0.8h
+    uadalp          v31.4s, v1.8h
+    sub             x0, x0, x1, lsl #4
+    sub             x2, x2, x3, lsl #4
+    add             x0, x0, #8
+    add             x2, x2, #8
+    bl              PFX(sa8d_8x8_neon)
+    uadalp          v30.4s, v0.8h
+    uadalp          v31.4s, v1.8h
+    bl              PFX(sa8d_8x8_neon)
+    uadalp          v30.4s, v0.8h
+    uadalp          v31.4s, v1.8h
+    add             v0.4s, v30.4s, v31.4s
+    addv            s0, v0.4s
+    mov             \reg, v0.s0
+    add             \reg, \reg, #1
+    lsr             \reg, \reg, #1
+.endm
+
+function PFX(pixel_sa8d_16x16_neon)
+    mov             x10, x30
+    sa8d_16x16      w0
+    ret             x10
+endfunc
+
+function PFX(pixel_sa8d_16x32_neon)
+    mov             x10, x30
+    sa8d_16x16      w4
+    sub             x0, x0, #8
+    sub             x2, x2, #8
+    sa8d_16x16      w5
     add             w0, w4, w5
+    ret             x10
+endfunc
+
+function PFX(pixel_sa8d_32x32_neon)
+    mov             x10, x30
+    sa8d_16x16      w4
+    sub             x0, x0, x1, lsl #4
+    sub             x2, x2, x3, lsl #4
+    add             x0, x0, #8
+    add             x2, x2, #8
+    sa8d_16x16      w5
+    sub             x0, x0, #24
+    sub             x2, x2, #24
+    sa8d_16x16      w6
+    sub             x0, x0, x1, lsl #4
+    sub             x2, x2, x3, lsl #4
+    add             x0, x0, #8
+    add             x2, x2, #8
+    sa8d_16x16      w7
+    add             w4, w4, w5
+    add             w6, w6, w7
+    add             w0, w4, w6
+    ret             x10
+endfunc
+
+function PFX(pixel_sa8d_32x64_neon)
+    mov             x10, x30
+    mov             w11, #4
+    mov             w9, #0
+.loop_sa8d_32:
+    sub             w11, w11, #1
+    sa8d_16x16      w4
+    sub             x0, x0, x1, lsl #4
+    sub             x2, x2, x3, lsl #4
+    add             x0, x0, #8
+    add             x2, x2, #8
+    sa8d_16x16      w5
+    add             w4, w4, w5
+    add             w9, w9, w4
+    sub             x0, x0, #24
+    sub             x2, x2, #24
+    cbnz            w11, .loop_sa8d_32
+    mov             w0, w9
+    ret             x10
+endfunc
+
+function PFX(pixel_sa8d_64x64_neon)
+    mov             x10, x30
+    mov             w11, #4
+    mov             w9, #0
+.loop_sa8d_64:
+    sub             w11, w11, #1
+    sa8d_16x16      w4
+    sub             x0, x0, x1, lsl #4
+    sub             x2, x2, x3, lsl #4
+    add             x0, x0, #8
+    add             x2, x2, #8
+    sa8d_16x16      w5
+    sub             x0, x0, x1, lsl #4
+    sub             x2, x2, x3, lsl #4
+    add             x0, x0, #8
+    add             x2, x2, #8
+    sa8d_16x16      w6
+    sub             x0, x0, x1, lsl #4
+    sub             x2, x2, x3, lsl #4
+    add             x0, x0, #8
+    add             x2, x2, #8
+    sa8d_16x16      w7
+    add             w4, w4, w5
+    add             w6, w6, w7
+    add             w8, w4, w6
+    add             w9, w9, w8
+
+    sub             x0, x0, #56
+    sub             x2, x2, #56
+    cbnz            w11, .loop_sa8d_64
+    mov             w0, w9
+    ret             x10
+endfunc
+
+/***** dequant_scaling*****/
+// void dequant_scaling_c(const int16_t* quantCoef, const int32_t* deQuantCoef, int16_t* coef, int num, int per, int shift)
+function PFX(dequant_scaling_neon)
+    add             x5, x5, #4              // shift + 4
+    lsr             x3, x3, #3              // num / 8
+    cmp             x5, x4
+    blt             .dequant_skip
+
+    mov             x12, #1
+    sub             x6, x5, x4          // shift - per
+    sub             x6, x6, #1          // shift - per - 1
+    lsl             x6, x12, x6         // 1 << shift - per - 1 (add)
+    dup             v0.4s, w6
+    sub             x7, x4, x5          // per - shift
+    dup             v3.4s, w7
+
+.dequant_loop1:
+    ld1             {v19.8h}, x0, #16 // quantCoef
+    ld1             {v2.4s}, x1, #16  // deQuantCoef
+    ld1             {v20.4s}, x1, #16
+    sub             x3, x3, #1
+    sxtl            v1.4s, v19.4h
+    sxtl2           v19.4s, v19.8h
+
+    mul             v1.4s, v1.4s, v2.4s // quantCoef * deQuantCoef
+    mul             v19.4s, v19.4s, v20.4s
+    add             v1.4s, v1.4s, v0.4s // quantCoef * deQuantCoef + add
+    add             v19.4s, v19.4s, v0.4s
+
+    sshl            v1.4s, v1.4s, v3.4s
+    sshl            v19.4s, v19.4s, v3.4s
+    sqxtn           v16.4h, v1.4s       // x265_clip3
+    sqxtn2          v16.8h, v19.4s
+    st1             {v16.8h}, x2, #16
+    cbnz            x3, .dequant_loop1
+    ret
+
+.dequant_skip:
+    sub             x6, x4, x5          // per - shift
+    dup             v0.8h, w6
+
+.dequant_loop2:
+    ld1             {v19.8h}, x0, #16 // quantCoef
+    ld1             {v2.4s}, x1, #16  // deQuantCoef
+    ld1             {v20.4s}, x1, #16
+    sub             x3, x3, #1
+    sxtl            v1.4s, v19.4h
+    sxtl2           v19.4s, v19.8h
+
+    mul             v1.4s, v1.4s, v2.4s // quantCoef * deQuantCoef
+    mul             v19.4s, v19.4s, v20.4s
+    sqxtn           v16.4h, v1.4s       // x265_clip3
+    sqxtn2          v16.8h, v19.4s
+
+    sqshl           v16.8h, v16.8h, v0.8h // coefQ << per - shift
+    st1             {v16.8h}, x2, #16
+    cbnz            x3, .dequant_loop2
+    ret
+endfunc
+
+// void dequant_normal_c(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift)
+function PFX(dequant_normal_neon)
+    lsr             w2, w2, #4              // num / 16
+    neg             w4, w4
+    dup             v0.8h, w3
+    dup             v1.4s, w4
+
+.dqn_loop1:
+    ld1             {v2.8h, v3.8h}, x0, #32
+    smull           v16.4s, v2.4h, v0.4h
+    smull2          v17.4s, v2.8h, v0.8h
+    smull           v18.4s, v3.4h, v0.4h
+    smull2          v19.4s, v3.8h, v0.8h
+
+    srshl           v16.4s, v16.4s, v1.4s
+    srshl           v17.4s, v17.4s, v1.4s
+    srshl           v18.4s, v18.4s, v1.4s
+    srshl           v19.4s, v19.4s, v1.4s
+
+    sqxtn           v2.4h, v16.4s
+    sqxtn2          v2.8h, v17.4s
+    sqxtn           v3.4h, v18.4s
+    sqxtn2          v3.8h, v19.4s
+
+    sub             w2, w2, #1
+    st1             {v2.8h, v3.8h}, x1, #32
+    cbnz            w2, .dqn_loop1
+    ret
+endfunc
+
+/********* ssim ***********/
+// void ssim_4x4x2_core(const pixel* pix1, intptr_t stride1, const pixel* pix2, intptr_t stride2, int sums24)
+function PFX(ssim_4x4x2_core_neon)
+    ld1             {v0.8b}, x0, x1
+    ld1             {v1.8b}, x0, x1
+    ld1             {v2.8b}, x0, x1
+    ld1             {v3.8b}, x0, x1
+
+    ld1             {v4.8b}, x2, x3
+    ld1             {v5.8b}, x2, x3
+    ld1             {v6.8b}, x2, x3
+    ld1             {v7.8b}, x2, x3
+
+    umull           v16.8h, v0.8b, v0.8b
+    umull           v17.8h, v1.8b, v1.8b
+    umull           v18.8h, v2.8b, v2.8b
+    uaddlp          v30.4s, v16.8h
+    umull           v19.8h, v3.8b, v3.8b
+    umull           v20.8h, v4.8b, v4.8b
+    umull           v21.8h, v5.8b, v5.8b
+    uadalp          v30.4s, v17.8h
+    umull           v22.8h, v6.8b, v6.8b
+    umull           v23.8h, v7.8b, v7.8b
+
+    umull           v24.8h, v0.8b, v4.8b
+    uadalp          v30.4s, v18.8h
+    umull           v25.8h, v1.8b, v5.8b
+    umull           v26.8h, v2.8b, v6.8b
+    umull           v27.8h, v3.8b, v7.8b
+    uadalp          v30.4s, v19.8h
+
+    uaddl           v28.8h, v0.8b, v1.8b
+    uaddl           v29.8h, v4.8b, v5.8b
+    uadalp          v30.4s, v20.8h
+    uaddlp          v31.4s, v24.8h
+
+    uaddw           v28.8h, v28.8h, v2.8b
+    uaddw           v29.8h, v29.8h, v6.8b
+    uadalp          v30.4s, v21.8h
+    uadalp          v31.4s, v25.8h
+
+    uaddw           v28.8h, v28.8h, v3.8b
+    uaddw           v29.8h, v29.8h, v7.8b
+    uadalp          v30.4s, v22.8h
+    uadalp          v31.4s, v26.8h
+
+    uaddlp          v28.4s, v28.8h
+    uaddlp          v29.4s, v29.8h
+    uadalp          v30.4s, v23.8h
+    uadalp          v31.4s, v27.8h
+
+    addp            v28.4s, v28.4s, v28.4s
+    addp            v29.4s, v29.4s, v29.4s
+    addp            v30.4s, v30.4s, v30.4s
+    addp            v31.4s, v31.4s, v31.4s
+
+    st4             {v28.2s, v29.2s, v30.2s, v31.2s}, x4
     ret
 endfunc
 
 // int psyCost_pp(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride)
-function x265_psyCost_4x4_neon
+function PFX(psyCost_4x4_neon)
     ld1r            {v4.2s}, x0, x1
     ld1r            {v5.2s}, x0, x1
     ld1             {v4.s}1, x0, x1
@@ -286,7 +1792,7 @@
 endfunc
 
 // uint32_t quant_c(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff)
-function x265_quant_neon
+function PFX(quant_neon)
     mov             w9, #1
     lsl             w9, w9, w4
     dup             v0.2s, w9
@@ -341,79 +1847,597 @@
     ret
 endfunc
 
-.macro satd_4x4_neon
-    ld1             {v1.s}0, x2, x3
-    ld1             {v0.s}0, x0, x1
-    ld1             {v3.s}0, x2, x3
-    ld1             {v2.s}0, x0, x1
+// uint32_t nquant_c(const int16_t* coef, const int32_t* quantCoeff, int16_t* qCoef, int qBits, int add, int numCoeff)
+function PFX(nquant_neon)
+    neg             x12, x3
+    dup             v0.4s, w12             // q0= -qbits
+    dup             v1.4s, w4              // add
 
-    ld1             {v1.s}1, x2, x3
-    ld1             {v0.s}1, x0, x1
-    ld1             {v3.s}1, x2, x3
-    ld1             {v2.s}1, x0, x1
+    lsr             w5, w5, #2
+    movi            v4.4s, #0              // v4= accumulate numsig
+    mov             x4, #0
+    movi            v22.4s, #0
 
-    usubl           v4.8h, v0.8b, v1.8b
-    usubl           v5.8h, v2.8b, v3.8b
+.loop_nquant:
+    ld1             {v16.4h}, x0, #8
+    sub             w5, w5, #1
+    sxtl            v19.4s, v16.4h         // v19 = coefblockpos
 
-    add             v6.8h, v4.8h, v5.8h
-    sub             v7.8h, v4.8h, v5.8h
+    cmlt            v18.4s, v19.4s, #0     // v18 = sign
 
-    mov             v4.d0, v6.d1
-    add             v0.8h, v6.8h, v4.8h
-    sub             v2.8h, v6.8h, v4.8h
+    abs             v19.4s, v19.4s         // v19 = level=abs(coefblockpos)
+    ld1             {v20.4s}, x1, #16    // v20 = quantCoeffblockpos
+    mul             v19.4s, v19.4s, v20.4s // v19 = tmplevel = abs(level) * quantCoeffblockpos;
 
-    mov             v5.d0, v7.d1
-    add             v1.8h, v7.8h, v5.8h
-    sub             v3.8h, v7.8h, v5.8h
+    add             v20.4s, v19.4s, v1.4s  // v20 = tmplevel+add
+    sshl            v20.4s, v20.4s, v0.4s  // v20 = level =(tmplevel+add) >> qbits
 
-    trn1            v4.4h, v0.4h, v1.4h
-    trn2            v5.4h, v0.4h, v1.4h
+    // numsig
+    cmeq            v21.4s, v20.4s, v22.4s
+    add             v4.4s, v4.4s, v21.4s
+    add             x4, x4, #4
 
-    trn1            v6.4h, v2.4h, v3.4h
-    trn2            v7.4h, v2.4h, v3.4h
+    eor             v21.16b, v20.16b, v18.16b
+    sub             v21.4s, v21.4s, v18.4s
+    sqxtn           v16.4h, v21.4s
+    abs             v17.4h, v16.4h
+    st1             {v17.4h}, x2, #8
 
-    add             v0.4h, v4.4h, v5.4h
-    sub             v1.4h, v4.4h, v5.4h
+    cbnz            w5, .loop_nquant
 
-    add             v2.4h, v6.4h, v7.4h
-    sub             v3.4h, v6.4h, v7.4h
+    uaddlv          d4, v4.4s
+    fmov            x12, d4
+    add             x0, x4, x12
+    ret
+endfunc
 
-    trn1            v4.2s, v0.2s, v1.2s
-    trn2            v5.2s, v0.2s, v1.2s
+// void ssimDist_c(const pixel* fenc, uint32_t fStride, const pixel* recon, intptr_t rstride, uint64_t *ssBlock, int shift, uint64_t *ac_k)
+.macro ssimDist_1  v4 v5
+    sub             v20.8h, \v4\().8h, \v5\().8h
+    smull           v16.4s, \v4\().4h, \v4\().4h
+    smull2          v17.4s, \v4\().8h, \v4\().8h
+    smull           v18.4s, v20.4h, v20.4h
+    smull2          v19.4s, v20.8h, v20.8h
+    add             v0.4s, v0.4s, v16.4s
+    add             v0.4s, v0.4s, v17.4s
+    add             v1.4s, v1.4s, v18.4s
+    add             v1.4s, v1.4s, v19.4s
+.endm
 
-    trn1            v6.2s, v2.2s, v3.2s
-    trn2            v7.2s, v2.2s, v3.2s
+function PFX(ssimDist4_neon)
+    ssimDist_start
+.rept 4
+    ld1             {v4.s}0, x0, x1
+    ld1             {v5.s}0, x2, x3
+    uxtl            v4.8h, v4.8b
+    uxtl            v5.8h, v5.8b
+    sub             v2.4h, v4.4h, v5.4h
+    smull           v3.4s, v4.4h, v4.4h
+    smull           v2.4s, v2.4h, v2.4h
+    add             v0.4s, v0.4s, v3.4s
+    add             v1.4s, v1.4s, v2.4s
+.endr
+    ssimDist_end
+    ret
+endfunc
 
-    abs             v4.4h, v4.4h
-    abs             v5.4h, v5.4h
-    abs             v6.4h, v6.4h
-    abs             v7.4h, v7.4h
+function PFX(ssimDist8_neon)
+    ssimDist_start
+.rept 8
+    ld1             {v4.8b}, x0, x1
+    ld1             {v5.8b}, x2, x3
+    uxtl            v4.8h, v4.8b
+    uxtl            v5.8h, v5.8b
+    ssimDist_1      v4, v5
+.endr
+    ssimDist_end
+    ret
+endfunc
 
-    smax            v1.4h, v4.4h, v5.4h
-    smax            v2.4h, v6.4h, v7.4h
+function PFX(ssimDist16_neon)
+    mov w12, #16
+    ssimDist_start
+.loop_ssimDist16:
+    sub             w12, w12, #1
+    ld1             {v4.16b}, x0, x1
+    ld1             {v5.16b}, x2, x3
+    uxtl            v6.8h, v4.8b
+    uxtl            v7.8h, v5.8b
+    uxtl2           v4.8h, v4.16b
+    uxtl2           v5.8h, v5.16b
+    ssimDist_1      v6, v7
+    ssimDist_1      v4, v5
+    cbnz            w12, .loop_ssimDist16
+    ssimDist_end
+    ret
+endfunc
 
-    add             v0.4h, v1.4h, v2.4h
-    uaddlp          v0.2s, v0.4h
-    uaddlp          v0.1d, v0.2s
+function PFX(ssimDist32_neon)
+    mov w12, #32
+    ssimDist_start
+.loop_ssimDist32:
+    sub             w12, w12, #1
+    ld1             {v4.16b-v5.16b}, x0, x1
+    ld1             {v6.16b-v7.16b}, x2, x3
+    uxtl            v21.8h, v4.8b
+    uxtl            v22.8h, v6.8b
+    uxtl            v23.8h, v5.8b
+    uxtl            v24.8h, v7.8b
+    uxtl2           v25.8h, v4.16b
+    uxtl2           v26.8h, v6.16b
+    uxtl2           v27.8h, v5.16b
+    uxtl2           v28.8h, v7.16b
+    ssimDist_1      v21, v22
+    ssimDist_1      v23, v24
+    ssimDist_1      v25, v26
+    ssimDist_1      v27, v28
+    cbnz            w12, .loop_ssimDist32
+    ssimDist_end
+    ret
+endfunc
+
+function PFX(ssimDist64_neon)
+    mov w12, #64
+    ssimDist_start
+.loop_ssimDist64:
+    sub             w12, w12, #1
+    ld1             {v4.16b-v7.16b}, x0, x1
+    ld1             {v16.16b-v19.16b}, x2, x3
+    uxtl            v21.8h, v4.8b
+    uxtl            v22.8h, v16.8b
+    uxtl            v23.8h, v5.8b
+    uxtl            v24.8h, v17.8b
+    uxtl2           v25.8h, v4.16b
+    uxtl2           v26.8h, v16.16b
+    uxtl2           v27.8h, v5.16b
+    uxtl2           v28.8h, v17.16b
+    ssimDist_1      v21, v22
+    ssimDist_1      v23, v24
+    ssimDist_1      v25, v26
+    ssimDist_1      v27, v28
+    uxtl            v21.8h, v6.8b
+    uxtl            v22.8h, v18.8b
+    uxtl            v23.8h, v7.8b
+    uxtl            v24.8h, v19.8b
+    uxtl2           v25.8h, v6.16b
+    uxtl2           v26.8h, v18.16b
+    uxtl2           v27.8h, v7.16b
+    uxtl2           v28.8h, v19.16b
+    ssimDist_1      v21, v22
+    ssimDist_1      v23, v24
+    ssimDist_1      v25, v26
+    ssimDist_1      v27, v28
+    cbnz            w12, .loop_ssimDist64
+    ssimDist_end
+    ret
+endfunc
+
+// void normFact_c(const pixel* src, uint32_t blockSize, int shift, uint64_t *z_k)
+
+.macro normFact_1  v4
+    smull           v16.4s, \v4\().4h, \v4\().4h
+    smull2          v17.4s, \v4\().8h, \v4\().8h
+    add             v0.4s, v0.4s, v16.4s
+    add             v0.4s, v0.4s, v17.4s
 .endm
 
-// int satd_4x4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
-function x265_pixel_satd_4x4_neon
-    satd_4x4_neon
-    umov            x0, v0.d0
+function PFX(normFact8_neon)
+    normFact_start
+.rept 8
+    ld1             {v4.8b}, x0, x1
+    uxtl            v4.8h, v4.8b
+    normFact_1      v4
+.endr
+    normFact_end
     ret
 endfunc
 
-// int satd_8x4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
-function x265_pixel_satd_8x4_neon
-    mov             x4, x0
-    mov             x5, x2
-    satd_4x4_neon
-    add             x0, x4, #4
-    add             x2, x5, #4
-    umov            x6, v0.d0
-    satd_4x4_neon
-    umov            x0, v0.d0
-    add             x0, x0, x6
+function PFX(normFact16_neon)
+    mov w12, #16
+    normFact_start
+.loop_normFact16:
+    sub             w12, w12, #1
+    ld1             {v4.16b}, x0, x1
+    uxtl            v5.8h, v4.8b
+    uxtl2           v4.8h, v4.16b
+    normFact_1      v5
+    normFact_1      v4
+    cbnz            w12, .loop_normFact16
+    normFact_end
+    ret
+endfunc
+
+function PFX(normFact32_neon)
+    mov w12, #32
+    normFact_start
+.loop_normFact32:
+    sub             w12, w12, #1
+    ld1             {v4.16b-v5.16b}, x0, x1
+    uxtl            v6.8h, v4.8b
+    uxtl2           v4.8h, v4.16b
+    uxtl            v7.8h, v5.8b
+    uxtl2           v5.8h, v5.16b
+    normFact_1      v4
+    normFact_1      v5
+    normFact_1      v6
+    normFact_1      v7
+    cbnz            w12, .loop_normFact32
+    normFact_end
+    ret
+endfunc
+
+function PFX(normFact64_neon)
+    mov w12, #64
+    normFact_start
+.loop_normFact64:
+    sub             w12, w12, #1
+    ld1             {v4.16b-v7.16b}, x0, x1
+    uxtl            v26.8h, v4.8b
+    uxtl2           v24.8h, v4.16b
+    uxtl            v27.8h, v5.8b
+    uxtl2           v25.8h, v5.16b
+    normFact_1      v24
+    normFact_1      v25
+    normFact_1      v26
+    normFact_1      v27
+    uxtl            v26.8h, v6.8b
+    uxtl2           v24.8h, v6.16b
+    uxtl            v27.8h, v7.8b
+    uxtl2           v25.8h, v7.16b
+    normFact_1      v24
+    normFact_1      v25
+    normFact_1      v26
+    normFact_1      v27
+    cbnz            w12, .loop_normFact64
+    normFact_end
+    ret
+endfunc
+
+// void weight_pp_c(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset)
+function PFX(weight_pp_neon)
+    sub             x2, x2, x3
+    ldr             w9, sp              // offset
+    lsl             w5, w5, #6            // w0 << correction
+
+    // count trailing zeros in w5 and compare against shift right amount.
+    rbit            w10, w5
+    clz             w10, w10
+    cmp             w10, w7
+    b.lt            .unfoldedShift
+
+    // shift right only removes trailing zeros: hoist LSR out of the loop.
+    lsr             w10, w5, w7           // w0 << correction >> shift
+    dup             v25.16b, w10
+    lsr             w6, w6, w7            // round >> shift
+    add             w6, w6, w9            // round >> shift + offset
+    dup             v26.8h, w6
+
+    // Check arithmetic range.
+    mov             w11, #255
+    madd            w11, w11, w10, w6
+    add             w11, w11, w9
+    lsr             w11, w11, #16
+    cbnz            w11, .widenTo32Bit
+
+    // 16-bit arithmetic is enough.
+.loopHpp:
+    mov             x12, x3
+.loopWpp:
+    ldr             q0, x0, #16
+    sub             x12, x12, #16
+    umull           v1.8h, v0.8b, v25.8b  // val *= w0 << correction >> shift
+    umull2          v2.8h, v0.16b, v25.16b
+    add             v1.8h, v1.8h, v26.8h  // val += round >> shift + offset
+    add             v2.8h, v2.8h, v26.8h
+    sqxtun          v0.8b, v1.8h          // val = x265_clip(val)
+    sqxtun2         v0.16b, v2.8h
+    str             q0, x1, #16
+    cbnz            x12, .loopWpp
+    add             x1, x1, x2
+    add             x0, x0, x2
+    sub             x4, x4, #1
+    cbnz            x4, .loopHpp
+    ret
+
+    // 32-bit arithmetic is needed.
+.widenTo32Bit:
+.loopHpp32:
+    mov             x12, x3
+.loopWpp32:
+    ldr             d0, x0, #8
+    sub             x12, x12, #8
+    uxtl            v0.8h, v0.8b
+    umull           v1.4s, v0.4h, v25.4h  // val *= w0 << correction >> shift
+    umull2          v2.4s, v0.8h, v25.8h
+    add             v1.4s, v1.4s, v26.4s  // val += round >> shift + offset
+    add             v2.4s, v2.4s, v26.4s
+    sqxtn           v0.4h, v1.4s          // val = x265_clip(val)
+    sqxtn2          v0.8h, v2.4s
+    sqxtun          v0.8b, v0.8h
+    str             d0, x1, #8
+    cbnz            x12, .loopWpp32
+    add             x1, x1, x2
+    add             x0, x0, x2
+    sub             x4, x4, #1
+    cbnz            x4, .loopHpp32
+    ret
+
+    // The shift right cannot be moved out of the loop.
+.unfoldedShift:
+    dup             v25.8h, w5            // w0 << correction
+    dup             v26.4s, w6            // round
+    neg             w7, w7                // -shift
+    dup             v27.4s, w7
+    dup             v29.4s, w9            // offset
+.loopHppUS:
+    mov             x12, x3
+.loopWppUS:
+    ldr             d0, x0, #8
+    sub             x12, x12, #8
+    uxtl            v0.8h, v0.8b
+    umull           v1.4s, v0.4h, v25.4h  // val *= w0
+    umull2          v2.4s, v0.8h, v25.8h
+    add             v1.4s, v1.4s, v26.4s  // val += round
+    add             v2.4s, v2.4s, v26.4s
+    sshl            v1.4s, v1.4s, v27.4s  // val >>= shift
+    sshl            v2.4s, v2.4s, v27.4s
+    add             v1.4s, v1.4s, v29.4s  // val += offset
+    add             v2.4s, v2.4s, v29.4s
+    sqxtn           v0.4h, v1.4s          // val = x265_clip(val)
+    sqxtn2          v0.8h, v2.4s
+    sqxtun          v0.8b, v0.8h
+    str             d0, x1, #8
+    cbnz            x12, .loopWppUS
+    add             x1, x1, x2
+    add             x0, x0, x2
+    sub             x4, x4, #1
+    cbnz            x4, .loopHppUS
+    ret
+endfunc
+
+// int scanPosLast(
+//     const uint16_t *scan,      // x0
+//     const coeff_t *coeff,      // x1
+//     uint16_t *coeffSign,       // x2
+//     uint16_t *coeffFlag,       // x3
+//     uint8_t *coeffNum,         // x4
+//     int numSig,                // x5
+//     const uint16_t* scanCG4x4, // x6
+//     const int trSize)          // x7
+function PFX(scanPosLast_neon)
+    // convert unit of Stride(trSize) to int16_t
+    add             x7, x7, x7
+
+    // load scan table and convert to Byte
+    ldp             q0, q1, x6
+    xtn             v0.8b, v0.8h
+    xtn2            v0.16b, v1.8h   // v0 - Zigzag scan table
+
+    movrel          x10, g_SPL_and_mask
+    ldr             q28, x10      // v28 = mask for pmovmskb
+    movi            v31.16b, #0     // v31 = {0, ..., 0}
+    add             x10, x7, x7     // 2*x7
+    add             x11, x10, x7    // 3*x7
+    add             x9, x4, #1      // CG count
+
+.loop_spl:
+    // position of current CG
+    ldrh            w6, x0, #32
+    add             x6, x1, x6, lsl #1
+
+    // loading current CG
+    ldr             d2, x6
+    ldr             d3, x6, x7
+    ldr             d4, x6, x10
+    ldr             d5, x6, x11
+    mov             v2.d1, v3.d0
+    mov             v4.d1, v5.d0
+    sqxtn           v2.8b, v2.8h
+    sqxtn2          v2.16b, v4.8h
+
+    // Zigzag
+    tbl             v3.16b, {v2.16b}, v0.16b
+
+    // get sign
+    cmhi            v5.16b, v3.16b, v31.16b   // v5 = non-zero
+    cmlt            v3.16b, v3.16b, #0        // v3 = negative
+
+    // val - w13 = pmovmskb(v3)
+    and             v3.16b, v3.16b, v28.16b
+    mov             d4, v3.d1
+    addv            b23, v3.8b
+    addv            b24, v4.8b
+    mov             v23.b1, v24.b0
+    fmov            w13, s23
+
+    // mask - w15 = pmovmskb(v5)
+    and             v5.16b, v5.16b, v28.16b
+    mov             d6, v5.d1
+    addv            b25, v5.8b
+    addv            b26, v6.8b
+    mov             v25.b1, v26.b0
+    fmov            w15, s25
+
+    // coeffFlag = reverse_bit(w15) in 16-bit
+    rbit            w12, w15
+    lsr             w12, w12, #16
+    fmov            s30, w12
+    strh            w12, x3, #2
+
+    // accelerate by preparing w13 = w13 & w15
+    and             w13, w13, w15
+    mov             x14, xzr
+.loop_spl_1:
+    cbz             w15, .pext_end
+    clz             w6, w15
+    lsl             w13, w13, w6
+    lsl             w15, w15, w6
+    extr            w14, w14, w13, #31
+    bfm             w15, wzr, #1, #0
+    b               .loop_spl_1
+.pext_end:
+    strh            w14, x2, #2
+
+    // compute coeffNum = popcount(coeffFlag)
+    cnt             v30.8b, v30.8b
+    addp            v30.8b, v30.8b, v30.8b
+    fmov            w6, s30
+    sub             x5, x5, x6
+    strb            w6, x4, #1
+
+    cbnz            x5, .loop_spl
+
+    // count trailing zeros
+    rbit            w13, w12
+    clz             w13, w13
+    lsr             w12, w12, w13
+    strh            w12, x3, #-2
+
+    // get last pos
+    sub             x9, x4, x9
+    lsl             x0, x9, #4
+    eor             w13, w13, #15
+    add             x0, x0, x13
+    ret
+endfunc
+
+// uint32_t costCoeffNxN(
+//    uint16_t *scan,        // x0
+//    coeff_t *coeff,        // x1
+//    intptr_t trSize,       // x2
+//    uint16_t *absCoeff,    // x3
+//    uint8_t *tabSigCtx,    // x4
+//    uint16_t scanFlagMask, // x5
+//    uint8_t *baseCtx,      // x6
+//    int offset,            // x7
+//    int scanPosSigOff,     // sp
+//    int subPosBase)        // sp + 8
+function PFX(costCoeffNxN_neon)
+    // abs(coeff)
+    add             x2, x2, x2
+    ld1             {v1.d}0, x1, x2
+    ld1             {v1.d}1, x1, x2
+    ld1             {v2.d}0, x1, x2
+    ld1             {v2.d}1, x1, x2
+    abs             v1.8h, v1.8h
+    abs             v2.8h, v2.8h
+
+    // WARNING: beyond-bound read here!
+    // loading scan table
+    ldr             w2, sp
+    eor             w15, w2, #15
+    add             x1, x0, x15, lsl #1
+    ldp             q20, q21, x1
+    uzp1            v20.16b, v20.16b, v21.16b
+    movi            v21.16b, #15
+    eor             v0.16b, v20.16b, v21.16b
+
+    // reorder coeff
+    uzp1           v22.16b, v1.16b, v2.16b
+    uzp2           v23.16b, v1.16b, v2.16b
+    tbl            v24.16b, {v22.16b}, v0.16b
+    tbl            v25.16b, {v23.16b}, v0.16b
+    zip1           v2.16b, v24.16b, v25.16b
+    zip2           v3.16b, v24.16b, v25.16b
+
+    // loading tabSigCtx (+offset)
+    ldr             q1, x4
+    tbl             v1.16b, {v1.16b}, v0.16b
+    dup             v4.16b, w7
+    movi            v5.16b, #0
+    tbl             v4.16b, {v4.16b}, v5.16b
+    add             v1.16b, v1.16b, v4.16b
+
+    // register mapping
+    // x0 - sum
+    // x1 - entropyStateBits
+    // v1 - sigCtx
+    // {v3,v2} - abs(coeff)
+    // x2 - scanPosSigOff
+    // x3 - absCoeff
+    // x4 - numNonZero
+    // x5 - scanFlagMask
+    // x6 - baseCtx
+    mov             x0, #0
+    movrel          x1, PFX_C(entropyStateBits)
+    mov             x4, #0
+    mov             x11, #0
+    movi            v31.16b, #0
+    cbz             x2, .idx_zero
+.loop_ccnn:
+//   {
+//        const uint32_t cnt = tabSigCtxblkPos + offset + posOffset;
+//        ctxSig = cnt & posZeroMask;
+//        const uint32_t mstate = baseCtxctxSig;
+//        const uint32_t mps = mstate & 1;
+//        const uint32_t stateBits = x265_entropyStateBitsmstate ^ sig;
+//        uint32_t nextState = (stateBits >> 24) + mps;
+//        if ((mstate ^ sig) == 1)
+//            nextState = sig;
+//        baseCtxctxSig = (uint8_t)nextState;
+//        sum += stateBits;
+//    }
+//    absCoeffnumNonZero = tmpCoeffblkPos;
+//    numNonZero += sig;
+//    scanPosSigOff--;
+
+    add             x13, x3, x4, lsl #1
+    sub             x2, x2, #1
+    str             h2, x13             // absCoeffnumNonZero = tmpCoeffblkPos
+    fmov            w14, s1               // x14 = ctxSig
+    uxtb            w14, w14
+    ubfx            w11, w5, #0, #1       // x11 = sig
+    lsr             x5, x5, #1
+    add             x4, x4, x11           // numNonZero += sig
+    ext             v1.16b, v1.16b, v31.16b, #1
+    ext             v2.16b, v2.16b, v3.16b, #2
+    ext             v3.16b, v3.16b, v31.16b, #2
+    ldrb            w9, x6, x14         // mstate = baseCtxctxSig
+    and             w10, w9, #1           // mps = mstate & 1
+    eor             w9, w9, w11           // x9 = mstate ^ sig
+    add             x12, x1, x9, lsl #2
+    ldr             w13, x12
+    add             w0, w0, w13           // sum += x265_entropyStateBitsmstate ^ sig
+    ldrb            w13, x12, #3
+    add             w10, w10, w13         // nextState = (stateBits >> 24) + mps
+    cmp             w9, #1
+    csel            w10, w11, w10, eq
+    strb            w10, x6, x14
+    cbnz            x2, .loop_ccnn
+.idx_zero:
+
+    add             x13, x3, x4, lsl #1
+    add             x4, x4, x15
+    str             h2, x13              // absCoeffnumNonZero = tmpCoeffblkPos
+
+    ldr             x9, sp, #8           // subPosBase
+    uxth            w9, w9
+    cmp             w9, #0
+    cset            x2, eq
+    add             x4, x4, x2
+    cbz             x4, .exit_ccnn
+
+    sub             w2, w2, #1
+    uxtb            w2, w2
+    fmov            w3, s1
+    and             w2, w2, w3
+
+    ldrb            w3, x6, x2         // mstate = baseCtxctxSig
+    eor             w4, w5, w3            // x5 = mstate ^ sig
+    and             w3, w3, #1            // mps = mstate & 1
+    add             x1, x1, x4, lsl #2
+    ldr             w11, x1
+    ldrb            w12, x1, #3
+    add             w0, w0, w11           // sum += x265_entropyStateBitsmstate ^ sig
+    add             w3, w3, w12           // nextState = (stateBits >> 24) + mps
+    cmp             w4, #1
+    csel            w3, w5, w3, eq
+    strb            w3, x6, x2
+.exit_ccnn:
+    ubfx            w0, w0, #0, #24
     ret
 endfunc
+
+const g_SPL_and_mask, align=8
+.byte 0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80, 0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80
+endconst

 
@@ -1,8 +1,9 @@
 /*****************************************************************************
- * Copyright (C) 2020 MulticoreWare, Inc
+ * Copyright (C) 2020-2021 MulticoreWare, Inc
  *
  * Authors: Yimeng Su <yimeng.su@huawei.com>
  *          Hongbin Liu <liuhongbin1@huawei.com>
+ *          Sebastian Pop <spop@amazon.com>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -23,13 +24,652 @@
  *****************************************************************************/
 
 #include "asm.S"
+#include "pixel-util-common.S"
 
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
 .section .rodata
+#endif
 
 .align 4
 
 .text
 
+// uint64_t pixel_var(const pixel* pix, intptr_t i_stride)
+function PFX(pixel_var_8x8_neon)
+    ld1             {v4.8b}, x0, x1        // pixx
+    uxtl            v0.8h, v4.8b             // sum = pixx
+    umull           v1.8h, v4.8b, v4.8b
+    uaddlp          v1.4s, v1.8h             // sqr = pixx * pixx
+
+.rept 7
+    ld1             {v4.8b}, x0, x1        // pixx
+    umull           v31.8h, v4.8b, v4.8b
+    uaddw           v0.8h, v0.8h, v4.8b      // sum += pixx
+    uadalp          v1.4s, v31.8h            // sqr += pixx * pixx
+.endr
+    uaddlv          s0, v0.8h
+    uaddlv          d1, v1.4s
+    fmov            w0, s0
+    fmov            x1, d1
+    orr             x0, x0, x1, lsl #32      // return sum + ((uint64_t)sqr << 32);
+    ret
+endfunc
+
+function PFX(pixel_var_16x16_neon)
+    pixel_var_start
+    mov             w12, #16
+.loop_var_16:
+    sub             w12, w12, #1
+    ld1             {v4.16b}, x0, x1
+    pixel_var_1 v4
+    cbnz            w12, .loop_var_16
+    pixel_var_end
+    ret
+endfunc
+
+function PFX(pixel_var_32x32_neon)
+    pixel_var_start
+    mov             w12, #32
+.loop_var_32:
+    sub             w12, w12, #1
+    ld1             {v4.16b-v5.16b}, x0, x1
+    pixel_var_1 v4
+    pixel_var_1 v5
+    cbnz            w12, .loop_var_32
+    pixel_var_end
+    ret
+endfunc
+
+function PFX(pixel_var_64x64_neon)
+    pixel_var_start
+    mov             w12, #64
+.loop_var_64:
+    sub             w12, w12, #1
+    ld1             {v4.16b-v7.16b}, x0, x1
+    pixel_var_1 v4
+    pixel_var_1 v5
+    pixel_var_1 v6
+    pixel_var_1 v7
+    cbnz            w12, .loop_var_64
+    pixel_var_end
+    ret
+endfunc
+
+// void getResidual4_neon(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride)
+function PFX(getResidual4_neon)
+    lsl             x4, x3, #1
+.rept 2
+    ld1             {v0.8b}, x0, x3
+    ld1             {v1.8b}, x1, x3
+    ld1             {v2.8b}, x0, x3
+    ld1             {v3.8b}, x1, x3
+    usubl           v4.8h, v0.8b, v1.8b
+    usubl           v5.8h, v2.8b, v3.8b
+    st1             {v4.8b}, x2, x4
+    st1             {v5.8b}, x2, x4
+.endr
+    ret
+endfunc
+
+function PFX(getResidual8_neon)
+    lsl             x4, x3, #1
+.rept 4
+    ld1             {v0.8b}, x0, x3
+    ld1             {v1.8b}, x1, x3
+    ld1             {v2.8b}, x0, x3
+    ld1             {v3.8b}, x1, x3
+    usubl           v4.8h, v0.8b, v1.8b
+    usubl           v5.8h, v2.8b, v3.8b
+    st1             {v4.16b}, x2, x4
+    st1             {v5.16b}, x2, x4
+.endr
+    ret
+endfunc
+
+function PFX(getResidual16_neon)
+    lsl             x4, x3, #1
+.rept 8
+    ld1             {v0.16b}, x0, x3
+    ld1             {v1.16b}, x1, x3
+    ld1             {v2.16b}, x0, x3
+    ld1             {v3.16b}, x1, x3
+    usubl           v4.8h, v0.8b, v1.8b
+    usubl2          v5.8h, v0.16b, v1.16b
+    usubl           v6.8h, v2.8b, v3.8b
+    usubl2          v7.8h, v2.16b, v3.16b
+    st1             {v4.8h-v5.8h}, x2, x4
+    st1             {v6.8h-v7.8h}, x2, x4
+.endr
+    ret
+endfunc
+
+function PFX(getResidual32_neon)
+    lsl             x4, x3, #1
+    mov             w12, #4
+.loop_residual_32:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v1.16b}, x0, x3
+    ld1             {v2.16b-v3.16b}, x1, x3
+    ld1             {v4.16b-v5.16b}, x0, x3
+    ld1             {v6.16b-v7.16b}, x1, x3
+    usubl           v16.8h, v0.8b, v2.8b
+    usubl2          v17.8h, v0.16b, v2.16b
+    usubl           v18.8h, v1.8b, v3.8b
+    usubl2          v19.8h, v1.16b, v3.16b
+    usubl           v20.8h, v4.8b, v6.8b
+    usubl2          v21.8h, v4.16b, v6.16b
+    usubl           v22.8h, v5.8b, v7.8b
+    usubl2          v23.8h, v5.16b, v7.16b
+    st1             {v16.8h-v19.8h}, x2, x4
+    st1             {v20.8h-v23.8h}, x2, x4
+.endr
+    cbnz            w12, .loop_residual_32
+    ret
+endfunc
+
+// void pixel_sub_ps_neon(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1)
+function PFX(pixel_sub_ps_4x4_neon)
+    lsl             x1, x1, #1
+.rept 2
+    ld1             {v0.8b}, x2, x4
+    ld1             {v1.8b}, x3, x5
+    ld1             {v2.8b}, x2, x4
+    ld1             {v3.8b}, x3, x5
+    usubl           v4.8h, v0.8b, v1.8b
+    usubl           v5.8h, v2.8b, v3.8b
+    st1             {v4.4h}, x0, x1
+    st1             {v5.4h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_sub_ps_8x8_neon)
+    lsl             x1, x1, #1
+.rept 4
+    ld1             {v0.8b}, x2, x4
+    ld1             {v1.8b}, x3, x5
+    ld1             {v2.8b}, x2, x4
+    ld1             {v3.8b}, x3, x5
+    usubl           v4.8h, v0.8b, v1.8b
+    usubl           v5.8h, v2.8b, v3.8b
+    st1             {v4.8h}, x0, x1
+    st1             {v5.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_sub_ps_16x16_neon)
+    lsl             x1, x1, #1
+.rept 8
+    ld1             {v0.16b}, x2, x4
+    ld1             {v1.16b}, x3, x5
+    ld1             {v2.16b}, x2, x4
+    ld1             {v3.16b}, x3, x5
+    usubl           v4.8h, v0.8b, v1.8b
+    usubl2          v5.8h, v0.16b, v1.16b
+    usubl           v6.8h, v2.8b, v3.8b
+    usubl2          v7.8h, v2.16b, v3.16b
+    st1             {v4.8h-v5.8h}, x0, x1
+    st1             {v6.8h-v7.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_sub_ps_32x32_neon)
+    lsl             x1, x1, #1
+    mov             w12, #4
+.loop_sub_ps_32:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v1.16b}, x2, x4
+    ld1             {v2.16b-v3.16b}, x3, x5
+    ld1             {v4.16b-v5.16b}, x2, x4
+    ld1             {v6.16b-v7.16b}, x3, x5
+    usubl           v16.8h, v0.8b, v2.8b
+    usubl2          v17.8h, v0.16b, v2.16b
+    usubl           v18.8h, v1.8b, v3.8b
+    usubl2          v19.8h, v1.16b, v3.16b
+    usubl           v20.8h, v4.8b, v6.8b
+    usubl2          v21.8h, v4.16b, v6.16b
+    usubl           v22.8h, v5.8b, v7.8b
+    usubl2          v23.8h, v5.16b, v7.16b
+    st1             {v16.8h-v19.8h}, x0, x1
+    st1             {v20.8h-v23.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_sub_ps_32
+    ret
+endfunc
+
+function PFX(pixel_sub_ps_64x64_neon)
+    lsl             x1, x1, #1
+    sub             x1, x1, #64
+    mov             w12, #16
+.loop_sub_ps_64:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v3.16b}, x2, x4
+    ld1             {v4.16b-v7.16b}, x3, x5
+    usubl           v16.8h, v0.8b, v4.8b
+    usubl2          v17.8h, v0.16b, v4.16b
+    usubl           v18.8h, v1.8b, v5.8b
+    usubl2          v19.8h, v1.16b, v5.16b
+    usubl           v20.8h, v2.8b, v6.8b
+    usubl2          v21.8h, v2.16b, v6.16b
+    usubl           v22.8h, v3.8b, v7.8b
+    usubl2          v23.8h, v3.16b, v7.16b
+    st1             {v16.8h-v19.8h}, x0, #64
+    st1             {v20.8h-v23.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_sub_ps_64
+    ret
+endfunc
+
+// chroma sub_ps
+function PFX(pixel_sub_ps_4x8_neon)
+    lsl             x1, x1, #1
+.rept 4
+    ld1             {v0.8b}, x2, x4
+    ld1             {v1.8b}, x3, x5
+    ld1             {v2.8b}, x2, x4
+    ld1             {v3.8b}, x3, x5
+    usubl           v4.8h, v0.8b, v1.8b
+    usubl           v5.8h, v2.8b, v3.8b
+    st1             {v4.4h}, x0, x1
+    st1             {v5.4h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_sub_ps_8x16_neon)
+    lsl             x1, x1, #1
+.rept 8
+    ld1             {v0.8b}, x2, x4
+    ld1             {v1.8b}, x3, x5
+    ld1             {v2.8b}, x2, x4
+    ld1             {v3.8b}, x3, x5
+    usubl           v4.8h, v0.8b, v1.8b
+    usubl           v5.8h, v2.8b, v3.8b
+    st1             {v4.8h}, x0, x1
+    st1             {v5.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_sub_ps_16x32_neon)
+    lsl             x1, x1, #1
+.rept 16
+    ld1             {v0.16b}, x2, x4
+    ld1             {v1.16b}, x3, x5
+    ld1             {v2.16b}, x2, x4
+    ld1             {v3.16b}, x3, x5
+    usubl           v4.8h, v0.8b, v1.8b
+    usubl2          v5.8h, v0.16b, v1.16b
+    usubl           v6.8h, v2.8b, v3.8b
+    usubl2          v7.8h, v2.16b, v3.16b
+    st1             {v4.8h-v5.8h}, x0, x1
+    st1             {v6.8h-v7.8h}, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_sub_ps_32x64_neon)
+    lsl             x1, x1, #1
+    mov             w12, #8
+.loop_sub_ps_32x64:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v1.16b}, x2, x4
+    ld1             {v2.16b-v3.16b}, x3, x5
+    ld1             {v4.16b-v5.16b}, x2, x4
+    ld1             {v6.16b-v7.16b}, x3, x5
+    usubl           v16.8h, v0.8b, v2.8b
+    usubl2          v17.8h, v0.16b, v2.16b
+    usubl           v18.8h, v1.8b, v3.8b
+    usubl2          v19.8h, v1.16b, v3.16b
+    usubl           v20.8h, v4.8b, v6.8b
+    usubl2          v21.8h, v4.16b, v6.16b
+    usubl           v22.8h, v5.8b, v7.8b
+    usubl2          v23.8h, v5.16b, v7.16b
+    st1             {v16.8h-v19.8h}, x0, x1
+    st1             {v20.8h-v23.8h}, x0, x1
+.endr
+    cbnz            w12, .loop_sub_ps_32x64
+    ret
+endfunc
+
+// void x265_pixel_add_ps_neon(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
+function PFX(pixel_add_ps_4x4_neon)
+    lsl             x5, x5, #1
+.rept 2
+    ld1             {v0.8b}, x2, x4
+    ld1             {v1.8b}, x2, x4
+    ld1             {v2.4h}, x3, x5
+    ld1             {v3.4h}, x3, x5
+    uxtl            v0.8h, v0.8b
+    uxtl            v1.8h, v1.8b
+    add             v4.8h, v0.8h, v2.8h
+    add             v5.8h, v1.8h, v3.8h
+    sqxtun          v4.8b, v4.8h
+    sqxtun          v5.8b, v5.8h
+    st1             {v4.s}0, x0, x1
+    st1             {v5.s}0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_add_ps_8x8_neon)
+    lsl             x5, x5, #1
+.rept 4
+    ld1             {v0.8b}, x2, x4
+    ld1             {v1.8b}, x2, x4
+    ld1             {v2.8h}, x3, x5
+    ld1             {v3.8h}, x3, x5
+    uxtl            v0.8h, v0.8b
+    uxtl            v1.8h, v1.8b
+    add             v4.8h, v0.8h, v2.8h
+    add             v5.8h, v1.8h, v3.8h
+    sqxtun          v4.8b, v4.8h
+    sqxtun          v5.8b, v5.8h
+    st1             {v4.8b}, x0, x1
+    st1             {v5.8b}, x0, x1
+.endr
+    ret
+endfunc
+
+.macro pixel_add_ps_16xN_neon h
+function PFX(pixel_add_ps_16x\h\()_neon)
+    lsl             x5, x5, #1
+    mov             w12, #\h / 8
+.loop_add_ps_16x\h\():
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b}, x2, x4
+    ld1             {v1.16b}, x2, x4
+    ld1             {v16.8h-v17.8h}, x3, x5
+    ld1             {v18.8h-v19.8h}, x3, x5
+    uxtl            v4.8h, v0.8b
+    uxtl2           v5.8h, v0.16b
+    uxtl            v6.8h, v1.8b
+    uxtl2           v7.8h, v1.16b
+    add             v24.8h, v4.8h, v16.8h
+    add             v25.8h, v5.8h, v17.8h
+    add             v26.8h, v6.8h, v18.8h
+    add             v27.8h, v7.8h, v19.8h
+    sqxtun          v4.8b, v24.8h
+    sqxtun2         v4.16b, v25.8h
+    sqxtun          v5.8b, v26.8h
+    sqxtun2         v5.16b, v27.8h
+    st1             {v4.16b}, x0, x1
+    st1             {v5.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_add_ps_16x\h
+    ret
+endfunc
+.endm
+
+pixel_add_ps_16xN_neon 16
+pixel_add_ps_16xN_neon 32
+
+.macro pixel_add_ps_32xN_neon h
+ function PFX(pixel_add_ps_32x\h\()_neon)
+    lsl             x5, x5, #1
+    mov             w12, #\h / 4
+.loop_add_ps_32x\h\():
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v0.16b-v1.16b}, x2, x4
+    ld1             {v16.8h-v19.8h}, x3, x5
+    uxtl            v4.8h, v0.8b
+    uxtl2           v5.8h, v0.16b
+    uxtl            v6.8h, v1.8b
+    uxtl2           v7.8h, v1.16b
+    add             v24.8h, v4.8h, v16.8h
+    add             v25.8h, v5.8h, v17.8h
+    add             v26.8h, v6.8h, v18.8h
+    add             v27.8h, v7.8h, v19.8h
+    sqxtun          v4.8b, v24.8h
+    sqxtun2         v4.16b, v25.8h
+    sqxtun          v5.8b, v26.8h
+    sqxtun2         v5.16b, v27.8h
+    st1             {v4.16b-v5.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_add_ps_32x\h
+    ret
+endfunc
+.endm
+
+pixel_add_ps_32xN_neon 32
+pixel_add_ps_32xN_neon 64
+
+function PFX(pixel_add_ps_64x64_neon)
+    lsl             x5, x5, #1
+    sub             x5, x5, #64
+    mov             w12, #32
+.loop_add_ps_64x64:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v0.16b-v3.16b}, x2, x4
+    ld1             {v16.8h-v19.8h}, x3, #64
+    ld1             {v20.8h-v23.8h}, x3, x5
+    uxtl            v4.8h, v0.8b
+    uxtl2           v5.8h, v0.16b
+    uxtl            v6.8h, v1.8b
+    uxtl2           v7.8h, v1.16b
+    uxtl            v24.8h, v2.8b
+    uxtl2           v25.8h, v2.16b
+    uxtl            v26.8h, v3.8b
+    uxtl2           v27.8h, v3.16b
+    add             v0.8h, v4.8h, v16.8h
+    add             v1.8h, v5.8h, v17.8h
+    add             v2.8h, v6.8h, v18.8h
+    add             v3.8h, v7.8h, v19.8h
+    add             v4.8h, v24.8h, v20.8h
+    add             v5.8h, v25.8h, v21.8h
+    add             v6.8h, v26.8h, v22.8h
+    add             v7.8h, v27.8h, v23.8h
+    sqxtun          v0.8b, v0.8h
+    sqxtun2         v0.16b, v1.8h
+    sqxtun          v1.8b, v2.8h
+    sqxtun2         v1.16b, v3.8h
+    sqxtun          v2.8b, v4.8h
+    sqxtun2         v2.16b, v5.8h
+    sqxtun          v3.8b, v6.8h
+    sqxtun2         v3.16b, v7.8h
+    st1             {v0.16b-v3.16b}, x0, x1
+.endr
+    cbnz            w12, .loop_add_ps_64x64
+    ret
+endfunc
+
+// Chroma add_ps
+function PFX(pixel_add_ps_4x8_neon)
+    lsl             x5, x5, #1
+.rept 4
+    ld1             {v0.8b}, x2, x4
+    ld1             {v1.8b}, x2, x4
+    ld1             {v2.4h}, x3, x5
+    ld1             {v3.4h}, x3, x5
+    uxtl            v0.8h, v0.8b
+    uxtl            v1.8h, v1.8b
+    add             v4.8h, v0.8h, v2.8h
+    add             v5.8h, v1.8h, v3.8h
+    sqxtun          v4.8b, v4.8h
+    sqxtun          v5.8b, v5.8h
+    st1             {v4.s}0, x0, x1
+    st1             {v5.s}0, x0, x1
+.endr
+    ret
+endfunc
+
+function PFX(pixel_add_ps_8x16_neon)
+    lsl             x5, x5, #1
+.rept 8
+    ld1             {v0.8b}, x2, x4
+    ld1             {v1.8b}, x2, x4
+    ld1             {v2.8h}, x3, x5
+    ld1             {v3.8h}, x3, x5
+    uxtl            v0.8h, v0.8b
+    uxtl            v1.8h, v1.8b
+    add             v4.8h, v0.8h, v2.8h
+    add             v5.8h, v1.8h, v3.8h
+    sqxtun          v4.8b, v4.8h
+    sqxtun          v5.8b, v5.8h
+    st1             {v4.8b}, x0, x1
+    st1             {v5.8b}, x0, x1
+.endr
+    ret
+endfunc
+
+// void scale1D_128to64(pixel *dst, const pixel *src)
+function PFX(scale1D_128to64_neon)
+.rept 2
+    ld2             {v0.16b, v1.16b}, x1, #32
+    ld2             {v2.16b, v3.16b}, x1, #32
+    ld2             {v4.16b, v5.16b}, x1, #32
+    ld2             {v6.16b, v7.16b}, x1, #32
+    urhadd          v0.16b, v0.16b, v1.16b
+    urhadd          v1.16b, v2.16b, v3.16b
+    urhadd          v2.16b, v4.16b, v5.16b
+    urhadd          v3.16b, v6.16b, v7.16b
+    st1             {v0.16b-v3.16b}, x0, #64
+.endr
+    ret
+endfunc
+
+.macro scale2D_1  v0, v1
+    uaddlp          \v0\().8h, \v0\().16b
+    uaddlp          \v1\().8h, \v1\().16b
+    add             \v0\().8h, \v0\().8h, \v1\().8h
+.endm
+
+// void scale2D_64to32(pixel* dst, const pixel* src, intptr_t stride)
+function PFX(scale2D_64to32_neon)
+    mov             w12, #32
+.loop_scale2D:
+    ld1             {v0.16b-v3.16b}, x1, x2
+    sub             w12, w12, #1
+    ld1             {v4.16b-v7.16b}, x1, x2
+    scale2D_1       v0, v4
+    scale2D_1       v1, v5
+    scale2D_1       v2, v6
+    scale2D_1       v3, v7
+    uqrshrn         v0.8b, v0.8h, #2
+    uqrshrn2        v0.16b, v1.8h, #2
+    uqrshrn         v1.8b, v2.8h, #2
+    uqrshrn2        v1.16b, v3.8h, #2
+    st1             {v0.16b-v1.16b}, x0, #32
+    cbnz            w12, .loop_scale2D
+    ret
+endfunc
+
+// void planecopy_cp_c(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift)
+function PFX(pixel_planecopy_cp_neon)
+    dup             v2.16b, w6
+    sub             x5, x5, #1
+.loop_h:
+    mov             x6, x0
+    mov             x12, x2
+    mov             x7, #0
+.loop_w:
+    ldr             q0, x6, #16
+    ushl            v0.16b, v0.16b, v2.16b
+    str             q0, x12, #16
+    add             x7, x7, #16
+    cmp             x7, x4
+    blt             .loop_w
+
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             x5, x5, #1
+    cbnz            x5, .loop_h
+
+// handle last row
+    mov             x5, x4
+    lsr             x5, x5, #3
+.loopW8:
+    ldr             d0, x0, #8
+    ushl            v0.8b, v0.8b, v2.8b
+    str             d0, x2, #8
+    sub             x4, x4, #8
+    sub             x5, x5, #1
+    cbnz            x5, .loopW8
+
+    mov             x5, #8
+    sub             x5, x5, x4
+    sub             x0, x0, x5
+    sub             x2, x2, x5
+    ldr             d0, x0
+    ushl            v0.8b, v0.8b, v2.8b
+    str             d0, x2
+    ret
+endfunc
+
+//******* satd *******
+.macro satd_4x4_neon
+    ld1             {v0.s}0, x0, x1
+    ld1             {v0.s}1, x0, x1
+    ld1             {v1.s}0, x2, x3
+    ld1             {v1.s}1, x2, x3
+    ld1             {v2.s}0, x0, x1
+    ld1             {v2.s}1, x0, x1
+    ld1             {v3.s}0, x2, x3
+    ld1             {v3.s}1, x2, x3
+
+    usubl           v4.8h, v0.8b, v1.8b
+    usubl           v5.8h, v2.8b, v3.8b
+
+    add             v6.8h, v4.8h, v5.8h
+    sub             v7.8h, v4.8h, v5.8h
+
+    mov             v4.d0, v6.d1
+    add             v0.4h, v6.4h, v4.4h
+    sub             v2.4h, v6.4h, v4.4h
+
+    mov             v5.d0, v7.d1
+    add             v1.4h, v7.4h, v5.4h
+    sub             v3.4h, v7.4h, v5.4h
+
+    trn1            v4.4h, v0.4h, v1.4h
+    trn2            v5.4h, v0.4h, v1.4h
+
+    trn1            v6.4h, v2.4h, v3.4h
+    trn2            v7.4h, v2.4h, v3.4h
+
+    add             v0.4h, v4.4h, v5.4h
+    sub             v1.4h, v4.4h, v5.4h
+
+    add             v2.4h, v6.4h, v7.4h
+    sub             v3.4h, v6.4h, v7.4h
+
+    trn1            v4.2s, v0.2s, v1.2s
+    trn2            v5.2s, v0.2s, v1.2s
+
+    trn1            v6.2s, v2.2s, v3.2s
+    trn2            v7.2s, v2.2s, v3.2s
+
+    abs             v4.4h, v4.4h
+    abs             v5.4h, v5.4h
+    abs             v6.4h, v6.4h
+    abs             v7.4h, v7.4h
+
+    smax            v1.4h, v4.4h, v5.4h
+    smax            v2.4h, v6.4h, v7.4h
+
+    add             v0.4h, v1.4h, v2.4h
+    uaddlp          v0.2s, v0.4h
+    uaddlp          v0.1d, v0.2s
+.endm
+
+// int satd_4x4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+function PFX(pixel_satd_4x4_neon)
+    satd_4x4_neon
+    fmov            x0, d0
+    ret
+endfunc
+
 .macro x265_satd_4x8_8x4_end_neon
     add             v0.8h, v4.8h, v6.8h
     add             v1.8h, v5.8h, v7.8h
@@ -59,7 +699,7 @@
 .endm
 
 .macro pixel_satd_4x8_neon
-    ld1r             {v1.2s}, x2, x3
+    ld1r            {v1.2s}, x2, x3
     ld1r            {v0.2s}, x0, x1
     ld1r            {v3.2s}, x2, x3
     ld1r            {v2.2s}, x0, x1
@@ -82,129 +722,995 @@
     sub             v5.8h, v0.8h, v1.8h
     ld1             {v6.s}1, x0, x1
     usubl           v3.8h, v6.8b, v7.8b
-    add         v6.8h, v2.8h, v3.8h
-    sub         v7.8h, v2.8h, v3.8h
+    add             v6.8h, v2.8h, v3.8h
+    sub             v7.8h, v2.8h, v3.8h
     x265_satd_4x8_8x4_end_neon
 .endm
 
-// template<int w, int h>
-// int satd4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
-function x265_pixel_satd_4x8_neon
-    pixel_satd_4x8_neon
-    mov               w0, v0.s0
-    ret
+// template<int w, int h>
+// int satd4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
+function PFX(pixel_satd_4x8_neon)
+    pixel_satd_4x8_neon
+    mov             w0, v0.s0
+    ret
+endfunc
+
+function PFX(pixel_satd_4x16_neon)
+    mov             w4, #0
+    pixel_satd_4x8_neon
+    mov             w5, v0.s0
+    add             w4, w4, w5
+    pixel_satd_4x8_neon
+    mov             w5, v0.s0
+    add             w0, w5, w4
+    ret
+endfunc
+
+function PFX(pixel_satd_4x32_neon)
+    mov             w4, #0
+.rept 4
+    pixel_satd_4x8_neon
+    mov             w5, v0.s0
+    add             w4, w4, w5
+.endr
+    mov             w0, w4
+    ret
+endfunc
+
+function PFX(pixel_satd_12x16_neon)
+    mov             x4, x0
+    mov             x5, x2
+    mov             w7, #0
+    pixel_satd_4x8_neon
+    mov             w6, v0.s0
+    add             w7, w7, w6
+    pixel_satd_4x8_neon
+    mov             w6, v0.s0
+    add             w7, w7, w6
+
+    add             x0, x4, #4
+    add             x2, x5, #4
+    pixel_satd_4x8_neon
+    mov             w6, v0.s0
+    add             w7, w7, w6
+    pixel_satd_4x8_neon
+    mov             w6, v0.s0
+    add             w7, w7, w6
+
+    add             x0, x4, #8
+    add             x2, x5, #8
+    pixel_satd_4x8_neon
+    mov             w6, v0.s0
+    add             w7, w7, w6
+    pixel_satd_4x8_neon
+    mov             w6, v0.s0
+    add             w0, w7, w6
+    ret
+endfunc
+
+function PFX(pixel_satd_12x32_neon)
+    mov             x4, x0
+    mov             x5, x2
+    mov             w7, #0
+.rept 4
+    pixel_satd_4x8_neon
+    mov             w6, v0.s0
+    add             w7, w7, w6
+.endr
+
+    add             x0, x4, #4
+    add             x2, x5, #4
+.rept 4
+    pixel_satd_4x8_neon
+    mov             w6, v0.s0
+    add             w7, w7, w6
+.endr
+
+    add             x0, x4, #8
+    add             x2, x5, #8
+.rept 4
+    pixel_satd_4x8_neon
+    mov             w6, v0.s0
+    add             w7, w7, w6
+.endr
+
+    mov             w0, w7
+    ret
+endfunc
+
+function PFX(pixel_satd_8x4_neon)
+    mov             x4, x0
+    mov             x5, x2
+    satd_4x4_neon
+    add             x0, x4, #4
+    add             x2, x5, #4
+    umov            x6, v0.d0
+    satd_4x4_neon
+    umov            x0, v0.d0
+    add             x0, x0, x6
+    ret
+endfunc
+
+.macro LOAD_DIFF_8x4 v0 v1 v2 v3
+    ld1             {v0.8b}, x0, x1
+    ld1             {v1.8b}, x2, x3
+    ld1             {v2.8b}, x0, x1
+    ld1             {v3.8b}, x2, x3
+    ld1             {v4.8b}, x0, x1
+    ld1             {v5.8b}, x2, x3
+    ld1             {v6.8b}, x0, x1
+    ld1             {v7.8b}, x2, x3
+    usubl           \v0, v0.8b, v1.8b
+    usubl           \v1, v2.8b, v3.8b
+    usubl           \v2, v4.8b, v5.8b
+    usubl           \v3, v6.8b, v7.8b
+.endm
+
+.macro LOAD_DIFF_16x4 v0 v1 v2 v3 v4 v5 v6 v7
+    ld1             {v0.16b}, x0, x1
+    ld1             {v1.16b}, x2, x3
+    ld1             {v2.16b}, x0, x1
+    ld1             {v3.16b}, x2, x3
+    ld1             {v4.16b}, x0, x1
+    ld1             {v5.16b}, x2, x3
+    ld1             {v6.16b}, x0, x1
+    ld1             {v7.16b}, x2, x3
+    usubl           \v0, v0.8b, v1.8b
+    usubl           \v1, v2.8b, v3.8b
+    usubl           \v2, v4.8b, v5.8b
+    usubl           \v3, v6.8b, v7.8b
+    usubl2          \v4, v0.16b, v1.16b
+    usubl2          \v5, v2.16b, v3.16b
+    usubl2          \v6, v4.16b, v5.16b
+    usubl2          \v7, v6.16b, v7.16b
+.endm
+
+function PFX(satd_16x4_neon), export=0
+    LOAD_DIFF_16x4  v16.8h, v17.8h, v18.8h, v19.8h, v20.8h, v21.8h, v22.8h, v23.8h
+    b               PFX(satd_8x4v_8x8h_neon)
+endfunc
+
+function PFX(satd_8x8_neon), export=0
+    LOAD_DIFF_8x4   v16.8h, v17.8h, v18.8h, v19.8h
+    LOAD_DIFF_8x4   v20.8h, v21.8h, v22.8h, v23.8h
+    b               PFX(satd_8x4v_8x8h_neon)
+endfunc
+
+// one vertical hadamard pass and two horizontal
+function PFX(satd_8x4v_8x8h_neon), export=0
+    HADAMARD4_V     v16.8h, v18.8h, v17.8h, v19.8h, v0.8h, v2.8h, v1.8h, v3.8h
+    HADAMARD4_V     v20.8h, v21.8h, v22.8h, v23.8h, v0.8h, v1.8h, v2.8h, v3.8h
+    trn4            v0.8h, v1.8h, v2.8h, v3.8h, v16.8h, v17.8h, v18.8h, v19.8h
+    trn4            v4.8h, v5.8h, v6.8h, v7.8h, v20.8h, v21.8h, v22.8h, v23.8h
+    SUMSUB_ABCD     v16.8h, v17.8h, v18.8h, v19.8h, v0.8h, v1.8h, v2.8h, v3.8h
+    SUMSUB_ABCD     v20.8h, v21.8h, v22.8h, v23.8h, v4.8h, v5.8h, v6.8h, v7.8h
+    trn4            v0.4s, v2.4s, v1.4s, v3.4s, v16.4s, v18.4s, v17.4s, v19.4s
+    trn4            v4.4s, v6.4s, v5.4s, v7.4s, v20.4s, v22.4s, v21.4s, v23.4s
+    ABS8            v0.8h, v1.8h, v2.8h, v3.8h, v4.8h, v5.8h, v6.8h, v7.8h
+    smax            v0.8h, v0.8h, v2.8h
+    smax            v1.8h, v1.8h, v3.8h
+    smax            v2.8h, v4.8h, v6.8h
+    smax            v3.8h, v5.8h, v7.8h
+    ret
+endfunc
+
+function PFX(pixel_satd_8x8_neon)
+    mov             x10, x30
+    bl              PFX(satd_8x8_neon)
+    add             v0.8h, v0.8h, v1.8h
+    add             v1.8h, v2.8h, v3.8h
+    add             v0.8h, v0.8h, v1.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_8x12_neon)
+    mov             x4, x0
+    mov             x5, x2
+    mov             x7, #0
+    satd_4x4_neon
+    umov            x6, v0.d0
+    add             x7, x7, x6
+    add             x0, x4, #4
+    add             x2, x5, #4
+    satd_4x4_neon
+    umov            x6, v0.d0
+    add             x7, x7, x6
+.rept 2
+    sub             x0, x0, #4
+    sub             x2, x2, #4
+    mov             x4, x0
+    mov             x5, x2
+    satd_4x4_neon
+    umov            x6, v0.d0
+    add             x7, x7, x6
+    add             x0, x4, #4
+    add             x2, x5, #4
+    satd_4x4_neon
+    umov            x6, v0.d0
+    add             x7, x7, x6
+.endr
+    mov             x0, x7
+    ret
+endfunc
+
+function PFX(pixel_satd_8x16_neon)
+    mov             x10, x30
+    bl              PFX(satd_8x8_neon)
+    add             v30.8h, v0.8h, v1.8h
+    add             v31.8h, v2.8h, v3.8h
+    bl              PFX(satd_8x8_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_8x32_neon)
+    mov             x10, x30
+    bl              PFX(satd_8x8_neon)
+    add             v30.8h, v0.8h, v1.8h
+    add             v31.8h, v2.8h, v3.8h
+.rept 3
+    bl              PFX(satd_8x8_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+.endr
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_8x64_neon)
+    mov             x10, x30
+    bl              PFX(satd_8x8_neon)
+    add             v30.8h, v0.8h, v1.8h
+    add             v31.8h, v2.8h, v3.8h
+.rept 7
+    bl              PFX(satd_8x8_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+.endr
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_16x4_neon)
+    mov             x10, x30
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v0.8h, v1.8h
+    add             v31.8h, v2.8h, v3.8h
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_16x8_neon)
+    mov             x10, x30
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v0.8h, v1.8h
+    add             v31.8h, v2.8h, v3.8h
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_16x12_neon)
+    mov             x10, x30
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v0.8h, v1.8h
+    add             v31.8h, v2.8h, v3.8h
+.rept 2
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+.endr
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_16x16_neon)
+    mov             x10, x30
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v0.8h, v1.8h
+    add             v31.8h, v2.8h, v3.8h
+.rept 3
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+.endr
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_16x24_neon)
+    mov             x10, x30
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v0.8h, v1.8h
+    add             v31.8h, v2.8h, v3.8h
+.rept 5
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+.endr
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
+endfunc
+
+.macro pixel_satd_16x32_neon
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v0.8h, v1.8h
+    add             v31.8h, v2.8h, v3.8h
+.rept 7
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+.endr
+.endm
+
+function PFX(pixel_satd_16x32_neon)
+    mov             x10, x30
+    pixel_satd_16x32_neon
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
 endfunc
 
-// template<int w, int h>
-// int satd4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
-function x265_pixel_satd_4x16_neon
-    eor             w4, w4, w4
-    pixel_satd_4x8_neon
-    mov               w5, v0.s0
-    add             w4, w4, w5
-    pixel_satd_4x8_neon
-    mov               w5, v0.s0
-    add             w0, w5, w4
-    ret
+function PFX(pixel_satd_16x64_neon)
+    mov             x10, x30
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v0.8h, v1.8h
+    add             v31.8h, v2.8h, v3.8h
+.rept 15
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+.endr
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
 endfunc
 
-// template<int w, int h>
-// int satd4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
-function x265_pixel_satd_4x32_neon
-    eor             w4, w4, w4
+function PFX(pixel_satd_24x32_neon)
+    mov             x10, x30
+    mov             x7, #0
+    mov             x4, x0
+    mov             x5, x2
+.rept 3
+    movi            v30.8h, #0
+    movi            v31.8h, #0
 .rept 4
-    pixel_satd_4x8_neon
-    mov             w5, v0.s0
-    add             w4, w4, w5
+    bl              PFX(satd_8x8_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
 .endr
-    mov             w0, w4
-    ret
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w6, v0.s0
+    add             x7, x7, x6
+    add             x4, x4, #8
+    add             x5, x5, #8
+    mov             x0, x4
+    mov             x2, x5
+.endr
+    mov             x0, x7
+    ret             x10
 endfunc
 
-// template<int w, int h>
-// int satd4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
-function x265_pixel_satd_12x16_neon
+function PFX(pixel_satd_24x64_neon)
+    mov             x10, x30
+    mov             x7, #0
     mov             x4, x0
     mov             x5, x2
-    eor             w7, w7, w7
-    pixel_satd_4x8_neon
+.rept 3
+    movi            v30.8h, #0
+    movi            v31.8h, #0
+.rept 4
+    bl              PFX(satd_8x8_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+.endr
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
     mov             w6, v0.s0
-    add             w7, w7, w6
-    pixel_satd_4x8_neon
+    add             x7, x7, x6
+    add             x4, x4, #8
+    add             x5, x5, #8
+    mov             x0, x4
+    mov             x2, x5
+.endr
+    sub             x4, x4, #24
+    sub             x5, x5, #24
+    add             x0, x4, x1, lsl #5
+    add             x2, x5, x3, lsl #5
+    mov             x4, x0
+    mov             x5, x2
+.rept 3
+    movi            v30.8h, #0
+    movi            v31.8h, #0
+.rept 4
+    bl              PFX(satd_8x8_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+.endr
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
     mov             w6, v0.s0
-    add             w7, w7, w6
+    add             x7, x7, x6
+    add             x4, x4, #8
+    add             x5, x5, #8
+    mov             x0, x4
+    mov             x2, x5
+.endr
+    mov             x0, x7
+    ret             x10
+endfunc
 
-    add             x0, x4, #4
-    add             x2, x5, #4
-    pixel_satd_4x8_neon
-    mov             w6, v0.s0
-    add             w7, w7, w6
-    pixel_satd_4x8_neon
-    mov             w6, v0.s0
-    add             w7, w7, w6
+.macro pixel_satd_32x8
+    mov             x4, x0
+    mov             x5, x2
+.rept 2
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+.endr
+    add             x0, x4, #16
+    add             x2, x5, #16
+.rept 2
+    bl              PFX(satd_16x4_neon)
+    add             v30.8h, v30.8h, v0.8h
+    add             v31.8h, v31.8h, v1.8h
+    add             v30.8h, v30.8h, v2.8h
+    add             v31.8h, v31.8h, v3.8h
+.endr
+.endm
 
-    add             x0, x4, #8
-    add             x2, x5, #8
-    pixel_satd_4x8_neon
-    mov             w6, v0.s0
-    add             w7, w7, w6
-    pixel_satd_4x8_neon
+.macro satd_32x16_neon
+    movi            v30.8h, #0
+    movi            v31.8h, #0
+    pixel_satd_32x8
+    sub             x0, x0, #16
+    sub             x2, x2, #16
+    pixel_satd_32x8
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
     mov             w6, v0.s0
-    add             w0, w7, w6
-    ret
-endfunc
+.endm
 
-// template<int w, int h>
-// int satd4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
-function x265_pixel_satd_12x32_neon
+.macro satd_64x16_neon
+    mov             x8, x0
+    mov             x9, x2
+    satd_32x16_neon
+    add             x7, x7, x6
+    add             x0, x8, #32
+    add             x2, x9, #32
+    satd_32x16_neon
+    add             x7, x7, x6
+.endm
+
+function PFX(pixel_satd_32x8_neon)
+    mov             x10, x30
+    mov             x7, #0
     mov             x4, x0
     mov             x5, x2
-    eor             w7, w7, w7
-.rept 4
-    pixel_satd_4x8_neon
-    mov             w6, v0.s0
-    add             w7, w7, w6
+    movi            v30.8h, #0
+    movi            v31.8h, #0
+    pixel_satd_32x8
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_32x16_neon)
+    mov             x10, x30
+    satd_32x16_neon
+    mov             x0, x6
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_32x24_neon)
+    mov             x10, x30
+    satd_32x16_neon
+    movi            v30.8h, #0
+    movi            v31.8h, #0
+    sub             x0, x0, #16
+    sub             x2, x2, #16
+    pixel_satd_32x8
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    add             x0, x0, x6
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_32x32_neon)
+    mov             x10, x30
+    mov             x7, #0
+    satd_32x16_neon
+    sub             x0, x0, #16
+    sub             x2, x2, #16
+    add             x7, x7, x6
+    satd_32x16_neon
+    add             x0, x7, x6
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_32x48_neon)
+    mov             x10, x30
+    mov             x7, #0
+.rept 2
+    satd_32x16_neon
+    sub             x0, x0, #16
+    sub             x2, x2, #16
+    add             x7, x7, x6
 .endr
+    satd_32x16_neon
+    add             x0, x7, x6
+    ret             x10
+endfunc
 
-    add             x0, x4, #4
-    add             x2, x5, #4
-.rept 4
-    pixel_satd_4x8_neon
-    mov             w6, v0.s0
-    add             w7, w7, w6
+function PFX(pixel_satd_32x64_neon)
+    mov             x10, x30
+    mov             x7, #0
+.rept 3
+    satd_32x16_neon
+    sub             x0, x0, #16
+    sub             x2, x2, #16
+    add             x7, x7, x6
 .endr
+    satd_32x16_neon
+    add             x0, x7, x6
+    ret             x10
+endfunc
 
-    add             x0, x4, #8
-    add             x2, x5, #8
-.rept 4
-    pixel_satd_4x8_neon
-    mov             w6, v0.s0
-    add             w7, w7, w6
+function PFX(pixel_satd_64x16_neon)
+    mov             x10, x30
+    mov             x7, #0
+    satd_64x16_neon
+    mov             x0, x7
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_64x32_neon)
+    mov             x10, x30
+    mov             x7, #0
+    satd_64x16_neon
+    sub             x0, x0, #48
+    sub             x2, x2, #48
+    satd_64x16_neon
+    mov             x0, x7
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_64x48_neon)
+    mov             x10, x30
+    mov             x7, #0
+.rept 2
+    satd_64x16_neon
+    sub             x0, x0, #48
+    sub             x2, x2, #48
 .endr
+    satd_64x16_neon
+    mov             x0, x7
+    ret             x10
+endfunc
 
-    mov             w0, w7
+function PFX(pixel_satd_64x64_neon)
+    mov             x10, x30
+    mov             x7, #0
+.rept 3
+    satd_64x16_neon
+    sub             x0, x0, #48
+    sub             x2, x2, #48
+.endr
+    satd_64x16_neon
+    mov             x0, x7
+    ret             x10
+endfunc
+
+function PFX(pixel_satd_48x64_neon)
+    mov             x10, x30
+    mov             x7, #0
+    mov             x8, x0
+    mov             x9, x2
+.rept 3
+    satd_32x16_neon
+    sub             x0, x0, #16
+    sub             x2, x2, #16
+    add             x7, x7, x6
+.endr
+    satd_32x16_neon
+    add             x7, x7, x6
+
+    add             x0, x8, #32
+    add             x2, x9, #32
+    pixel_satd_16x32_neon
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w6, v0.s0
+    add             x7, x7, x6
+
+    movi            v30.8h, #0
+    movi            v31.8h, #0
+    pixel_satd_16x32_neon
+    add             v0.8h, v30.8h, v31.8h
+    uaddlv          s0, v0.8h
+    mov             w6, v0.s0
+    add             x0, x7, x6
+    ret             x10
+endfunc
+
+function PFX(sa8d_8x8_neon), export=0
+    LOAD_DIFF_8x4   v16.8h, v17.8h, v18.8h, v19.8h
+    LOAD_DIFF_8x4   v20.8h, v21.8h, v22.8h, v23.8h
+    HADAMARD4_V     v16.8h, v18.8h, v17.8h, v19.8h, v0.8h, v2.8h, v1.8h, v3.8h
+    HADAMARD4_V     v20.8h, v21.8h, v22.8h, v23.8h, v0.8h, v1.8h, v2.8h, v3.8h
+    SUMSUB_ABCD     v0.8h, v16.8h, v1.8h, v17.8h, v16.8h, v20.8h, v17.8h, v21.8h
+    SUMSUB_ABCD     v2.8h, v18.8h, v3.8h, v19.8h, v18.8h, v22.8h, v19.8h, v23.8h
+    trn4            v4.8h, v5.8h, v6.8h, v7.8h, v0.8h, v1.8h, v2.8h, v3.8h
+    trn4            v20.8h, v21.8h, v22.8h, v23.8h, v16.8h, v17.8h, v18.8h, v19.8h
+    SUMSUB_ABCD     v2.8h, v3.8h, v24.8h, v25.8h, v20.8h, v21.8h, v4.8h, v5.8h
+    SUMSUB_ABCD     v0.8h, v1.8h, v4.8h, v5.8h, v22.8h, v23.8h, v6.8h, v7.8h
+    trn4            v20.4s, v22.4s, v21.4s, v23.4s, v2.4s, v0.4s, v3.4s, v1.4s
+    trn4            v16.4s, v18.4s, v17.4s, v19.4s, v24.4s, v4.4s, v25.4s, v5.4s
+    SUMSUB_ABCD     v0.8h, v2.8h, v1.8h, v3.8h, v20.8h, v22.8h, v21.8h, v23.8h
+    SUMSUB_ABCD     v4.8h, v6.8h, v5.8h, v7.8h, v16.8h, v18.8h, v17.8h, v19.8h
+    trn4            v16.2d, v20.2d, v17.2d, v21.2d, v0.2d, v4.2d, v1.2d, v5.2d
+    trn4            v18.2d, v22.2d, v19.2d, v23.2d, v2.2d, v6.2d, v3.2d, v7.2d
+    ABS8            v16.8h, v17.8h, v18.8h, v19.8h, v20.8h, v21.8h, v22.8h, v23.8h
+    smax            v16.8h, v16.8h, v20.8h
+    smax            v17.8h, v17.8h, v21.8h
+    smax            v18.8h, v18.8h, v22.8h
+    smax            v19.8h, v19.8h, v23.8h
+    add             v0.8h, v16.8h, v17.8h
+    add             v1.8h, v18.8h, v19.8h
     ret
 endfunc
 
-// template<int w, int h>
-// int satd4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
-function x265_pixel_satd_8x8_neon
-    eor             w4, w4, w4
-    mov             x6, x0
-    mov             x7, x2
-    pixel_satd_4x8_neon
-    mov             w5, v0.s0
-    add             w4, w4, w5
-    add             x0, x6, #4
-    add             x2, x7, #4
-    pixel_satd_4x8_neon
+function PFX(pixel_sa8d_8x8_neon)
+    mov             x10, x30
+    bl              PFX(sa8d_8x8_neon)
+    add             v0.8h, v0.8h, v1.8h
+    uaddlv          s0, v0.8h
+    mov             w0, v0.s0
+    add             w0, w0, #1
+    lsr             w0, w0, #1
+    ret             x10
+endfunc
+
+function PFX(pixel_sa8d_8x16_neon)
+    mov             x10, x30
+    bl              PFX(sa8d_8x8_neon)
+    add             v0.8h, v0.8h, v1.8h
+    uaddlv          s0, v0.8h
     mov             w5, v0.s0
+    add             w5, w5, #1
+    lsr             w5, w5, #1
+    bl              PFX(sa8d_8x8_neon)
+    add             v0.8h, v0.8h, v1.8h
+    uaddlv          s0, v0.8h
+    mov             w4, v0.s0
+    add             w4, w4, #1
+    lsr             w4, w4, #1
+    add             w0, w4, w5
+    ret             x10
+endfunc
+
+.macro sa8d_16x16 reg
+    bl              PFX(sa8d_8x8_neon)
+    uaddlp          v30.4s, v0.8h
+    uaddlp          v31.4s, v1.8h
+    bl              PFX(sa8d_8x8_neon)
+    uadalp          v30.4s, v0.8h
+    uadalp          v31.4s, v1.8h
+    sub             x0, x0, x1, lsl #4
+    sub             x2, x2, x3, lsl #4
+    add             x0, x0, #8
+    add             x2, x2, #8
+    bl              PFX(sa8d_8x8_neon)
+    uadalp          v30.4s, v0.8h
+    uadalp          v31.4s, v1.8h
+    bl              PFX(sa8d_8x8_neon)
+    uadalp          v30.4s, v0.8h
+    uadalp          v31.4s, v1.8h
+    add             v0.4s, v30.4s, v31.4s
+    addv            s0, v0.4s
+    mov             \reg, v0.s0
+    add             \reg, \reg, #1
+    lsr             \reg, \reg, #1
+.endm
+
+function PFX(pixel_sa8d_16x16_neon)
+    mov             x10, x30
+    sa8d_16x16      w0
+    ret             x10
+endfunc
+
+function PFX(pixel_sa8d_16x32_neon)
+    mov             x10, x30
+    sa8d_16x16      w4
+    sub             x0, x0, #8
+    sub             x2, x2, #8
+    sa8d_16x16      w5
     add             w0, w4, w5
+    ret             x10
+endfunc
+
+function PFX(pixel_sa8d_32x32_neon)
+    mov             x10, x30
+    sa8d_16x16      w4
+    sub             x0, x0, x1, lsl #4
+    sub             x2, x2, x3, lsl #4
+    add             x0, x0, #8
+    add             x2, x2, #8
+    sa8d_16x16      w5
+    sub             x0, x0, #24
+    sub             x2, x2, #24
+    sa8d_16x16      w6
+    sub             x0, x0, x1, lsl #4
+    sub             x2, x2, x3, lsl #4
+    add             x0, x0, #8
+    add             x2, x2, #8
+    sa8d_16x16      w7
+    add             w4, w4, w5
+    add             w6, w6, w7
+    add             w0, w4, w6
+    ret             x10
+endfunc
+
+function PFX(pixel_sa8d_32x64_neon)
+    mov             x10, x30
+    mov             w11, #4
+    mov             w9, #0
+.loop_sa8d_32:
+    sub             w11, w11, #1
+    sa8d_16x16      w4
+    sub             x0, x0, x1, lsl #4
+    sub             x2, x2, x3, lsl #4
+    add             x0, x0, #8
+    add             x2, x2, #8
+    sa8d_16x16      w5
+    add             w4, w4, w5
+    add             w9, w9, w4
+    sub             x0, x0, #24
+    sub             x2, x2, #24
+    cbnz            w11, .loop_sa8d_32
+    mov             w0, w9
+    ret             x10
+endfunc
+
+function PFX(pixel_sa8d_64x64_neon)
+    mov             x10, x30
+    mov             w11, #4
+    mov             w9, #0
+.loop_sa8d_64:
+    sub             w11, w11, #1
+    sa8d_16x16      w4
+    sub             x0, x0, x1, lsl #4
+    sub             x2, x2, x3, lsl #4
+    add             x0, x0, #8
+    add             x2, x2, #8
+    sa8d_16x16      w5
+    sub             x0, x0, x1, lsl #4
+    sub             x2, x2, x3, lsl #4
+    add             x0, x0, #8
+    add             x2, x2, #8
+    sa8d_16x16      w6
+    sub             x0, x0, x1, lsl #4
+    sub             x2, x2, x3, lsl #4
+    add             x0, x0, #8
+    add             x2, x2, #8
+    sa8d_16x16      w7
+    add             w4, w4, w5
+    add             w6, w6, w7
+    add             w8, w4, w6
+    add             w9, w9, w8
+
+    sub             x0, x0, #56
+    sub             x2, x2, #56
+    cbnz            w11, .loop_sa8d_64
+    mov             w0, w9
+    ret             x10
+endfunc
+
+/***** dequant_scaling*****/
+// void dequant_scaling_c(const int16_t* quantCoef, const int32_t* deQuantCoef, int16_t* coef, int num, int per, int shift)
+function PFX(dequant_scaling_neon)
+    add             x5, x5, #4              // shift + 4
+    lsr             x3, x3, #3              // num / 8
+    cmp             x5, x4
+    blt             .dequant_skip
+
+    mov             x12, #1
+    sub             x6, x5, x4          // shift - per
+    sub             x6, x6, #1          // shift - per - 1
+    lsl             x6, x12, x6         // 1 << shift - per - 1 (add)
+    dup             v0.4s, w6
+    sub             x7, x4, x5          // per - shift
+    dup             v3.4s, w7
+
+.dequant_loop1:
+    ld1             {v19.8h}, x0, #16 // quantCoef
+    ld1             {v2.4s}, x1, #16  // deQuantCoef
+    ld1             {v20.4s}, x1, #16
+    sub             x3, x3, #1
+    sxtl            v1.4s, v19.4h
+    sxtl2           v19.4s, v19.8h
+
+    mul             v1.4s, v1.4s, v2.4s // quantCoef * deQuantCoef
+    mul             v19.4s, v19.4s, v20.4s
+    add             v1.4s, v1.4s, v0.4s // quantCoef * deQuantCoef + add
+    add             v19.4s, v19.4s, v0.4s
+
+    sshl            v1.4s, v1.4s, v3.4s
+    sshl            v19.4s, v19.4s, v3.4s
+    sqxtn           v16.4h, v1.4s       // x265_clip3
+    sqxtn2          v16.8h, v19.4s
+    st1             {v16.8h}, x2, #16
+    cbnz            x3, .dequant_loop1
+    ret
+
+.dequant_skip:
+    sub             x6, x4, x5          // per - shift
+    dup             v0.8h, w6
+
+.dequant_loop2:
+    ld1             {v19.8h}, x0, #16 // quantCoef
+    ld1             {v2.4s}, x1, #16  // deQuantCoef
+    ld1             {v20.4s}, x1, #16
+    sub             x3, x3, #1
+    sxtl            v1.4s, v19.4h
+    sxtl2           v19.4s, v19.8h
+
+    mul             v1.4s, v1.4s, v2.4s // quantCoef * deQuantCoef
+    mul             v19.4s, v19.4s, v20.4s
+    sqxtn           v16.4h, v1.4s       // x265_clip3
+    sqxtn2          v16.8h, v19.4s
+
+    sqshl           v16.8h, v16.8h, v0.8h // coefQ << per - shift
+    st1             {v16.8h}, x2, #16
+    cbnz            x3, .dequant_loop2
+    ret
+endfunc
+
+// void dequant_normal_c(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift)
+function PFX(dequant_normal_neon)
+    lsr             w2, w2, #4              // num / 16
+    neg             w4, w4
+    dup             v0.8h, w3
+    dup             v1.4s, w4
+
+.dqn_loop1:
+    ld1             {v2.8h, v3.8h}, x0, #32
+    smull           v16.4s, v2.4h, v0.4h
+    smull2          v17.4s, v2.8h, v0.8h
+    smull           v18.4s, v3.4h, v0.4h
+    smull2          v19.4s, v3.8h, v0.8h
+
+    srshl           v16.4s, v16.4s, v1.4s
+    srshl           v17.4s, v17.4s, v1.4s
+    srshl           v18.4s, v18.4s, v1.4s
+    srshl           v19.4s, v19.4s, v1.4s
+
+    sqxtn           v2.4h, v16.4s
+    sqxtn2          v2.8h, v17.4s
+    sqxtn           v3.4h, v18.4s
+    sqxtn2          v3.8h, v19.4s
+
+    sub             w2, w2, #1
+    st1             {v2.8h, v3.8h}, x1, #32
+    cbnz            w2, .dqn_loop1
+    ret
+endfunc
+
+/********* ssim ***********/
+// void ssim_4x4x2_core(const pixel* pix1, intptr_t stride1, const pixel* pix2, intptr_t stride2, int sums24)
+function PFX(ssim_4x4x2_core_neon)
+    ld1             {v0.8b}, x0, x1
+    ld1             {v1.8b}, x0, x1
+    ld1             {v2.8b}, x0, x1
+    ld1             {v3.8b}, x0, x1
+
+    ld1             {v4.8b}, x2, x3
+    ld1             {v5.8b}, x2, x3
+    ld1             {v6.8b}, x2, x3
+    ld1             {v7.8b}, x2, x3
+
+    umull           v16.8h, v0.8b, v0.8b
+    umull           v17.8h, v1.8b, v1.8b
+    umull           v18.8h, v2.8b, v2.8b
+    uaddlp          v30.4s, v16.8h
+    umull           v19.8h, v3.8b, v3.8b
+    umull           v20.8h, v4.8b, v4.8b
+    umull           v21.8h, v5.8b, v5.8b
+    uadalp          v30.4s, v17.8h
+    umull           v22.8h, v6.8b, v6.8b
+    umull           v23.8h, v7.8b, v7.8b
+
+    umull           v24.8h, v0.8b, v4.8b
+    uadalp          v30.4s, v18.8h
+    umull           v25.8h, v1.8b, v5.8b
+    umull           v26.8h, v2.8b, v6.8b
+    umull           v27.8h, v3.8b, v7.8b
+    uadalp          v30.4s, v19.8h
+
+    uaddl           v28.8h, v0.8b, v1.8b
+    uaddl           v29.8h, v4.8b, v5.8b
+    uadalp          v30.4s, v20.8h
+    uaddlp          v31.4s, v24.8h
+
+    uaddw           v28.8h, v28.8h, v2.8b
+    uaddw           v29.8h, v29.8h, v6.8b
+    uadalp          v30.4s, v21.8h
+    uadalp          v31.4s, v25.8h
+
+    uaddw           v28.8h, v28.8h, v3.8b
+    uaddw           v29.8h, v29.8h, v7.8b
+    uadalp          v30.4s, v22.8h
+    uadalp          v31.4s, v26.8h
+
+    uaddlp          v28.4s, v28.8h
+    uaddlp          v29.4s, v29.8h
+    uadalp          v30.4s, v23.8h
+    uadalp          v31.4s, v27.8h
+
+    addp            v28.4s, v28.4s, v28.4s
+    addp            v29.4s, v29.4s, v29.4s
+    addp            v30.4s, v30.4s, v30.4s
+    addp            v31.4s, v31.4s, v31.4s
+
+    st4             {v28.2s, v29.2s, v30.2s, v31.2s}, x4
     ret
 endfunc
 
 // int psyCost_pp(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride)
-function x265_psyCost_4x4_neon
+function PFX(psyCost_4x4_neon)
     ld1r            {v4.2s}, x0, x1
     ld1r            {v5.2s}, x0, x1
     ld1             {v4.s}1, x0, x1
@@ -286,7 +1792,7 @@
 endfunc
 
 // uint32_t quant_c(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff)
-function x265_quant_neon
+function PFX(quant_neon)
     mov             w9, #1
     lsl             w9, w9, w4
     dup             v0.2s, w9
@@ -341,79 +1847,597 @@
     ret
 endfunc
 
-.macro satd_4x4_neon
-    ld1             {v1.s}0, x2, x3
-    ld1             {v0.s}0, x0, x1
-    ld1             {v3.s}0, x2, x3
-    ld1             {v2.s}0, x0, x1
+// uint32_t nquant_c(const int16_t* coef, const int32_t* quantCoeff, int16_t* qCoef, int qBits, int add, int numCoeff)
+function PFX(nquant_neon)
+    neg             x12, x3
+    dup             v0.4s, w12             // q0= -qbits
+    dup             v1.4s, w4              // add
 
-    ld1             {v1.s}1, x2, x3
-    ld1             {v0.s}1, x0, x1
-    ld1             {v3.s}1, x2, x3
-    ld1             {v2.s}1, x0, x1
+    lsr             w5, w5, #2
+    movi            v4.4s, #0              // v4= accumulate numsig
+    mov             x4, #0
+    movi            v22.4s, #0
 
-    usubl           v4.8h, v0.8b, v1.8b
-    usubl           v5.8h, v2.8b, v3.8b
+.loop_nquant:
+    ld1             {v16.4h}, x0, #8
+    sub             w5, w5, #1
+    sxtl            v19.4s, v16.4h         // v19 = coefblockpos
 
-    add             v6.8h, v4.8h, v5.8h
-    sub             v7.8h, v4.8h, v5.8h
+    cmlt            v18.4s, v19.4s, #0     // v18 = sign
 
-    mov             v4.d0, v6.d1
-    add             v0.8h, v6.8h, v4.8h
-    sub             v2.8h, v6.8h, v4.8h
+    abs             v19.4s, v19.4s         // v19 = level=abs(coefblockpos)
+    ld1             {v20.4s}, x1, #16    // v20 = quantCoeffblockpos
+    mul             v19.4s, v19.4s, v20.4s // v19 = tmplevel = abs(level) * quantCoeffblockpos;
 
-    mov             v5.d0, v7.d1
-    add             v1.8h, v7.8h, v5.8h
-    sub             v3.8h, v7.8h, v5.8h
+    add             v20.4s, v19.4s, v1.4s  // v20 = tmplevel+add
+    sshl            v20.4s, v20.4s, v0.4s  // v20 = level =(tmplevel+add) >> qbits
 
-    trn1            v4.4h, v0.4h, v1.4h
-    trn2            v5.4h, v0.4h, v1.4h
+    // numsig
+    cmeq            v21.4s, v20.4s, v22.4s
+    add             v4.4s, v4.4s, v21.4s
+    add             x4, x4, #4
 
-    trn1            v6.4h, v2.4h, v3.4h
-    trn2            v7.4h, v2.4h, v3.4h
+    eor             v21.16b, v20.16b, v18.16b
+    sub             v21.4s, v21.4s, v18.4s
+    sqxtn           v16.4h, v21.4s
+    abs             v17.4h, v16.4h
+    st1             {v17.4h}, x2, #8
 
-    add             v0.4h, v4.4h, v5.4h
-    sub             v1.4h, v4.4h, v5.4h
+    cbnz            w5, .loop_nquant
 
-    add             v2.4h, v6.4h, v7.4h
-    sub             v3.4h, v6.4h, v7.4h
+    uaddlv          d4, v4.4s
+    fmov            x12, d4
+    add             x0, x4, x12
+    ret
+endfunc
 
-    trn1            v4.2s, v0.2s, v1.2s
-    trn2            v5.2s, v0.2s, v1.2s
+// void ssimDist_c(const pixel* fenc, uint32_t fStride, const pixel* recon, intptr_t rstride, uint64_t *ssBlock, int shift, uint64_t *ac_k)
+.macro ssimDist_1  v4 v5
+    sub             v20.8h, \v4\().8h, \v5\().8h
+    smull           v16.4s, \v4\().4h, \v4\().4h
+    smull2          v17.4s, \v4\().8h, \v4\().8h
+    smull           v18.4s, v20.4h, v20.4h
+    smull2          v19.4s, v20.8h, v20.8h
+    add             v0.4s, v0.4s, v16.4s
+    add             v0.4s, v0.4s, v17.4s
+    add             v1.4s, v1.4s, v18.4s
+    add             v1.4s, v1.4s, v19.4s
+.endm
 
-    trn1            v6.2s, v2.2s, v3.2s
-    trn2            v7.2s, v2.2s, v3.2s
+function PFX(ssimDist4_neon)
+    ssimDist_start
+.rept 4
+    ld1             {v4.s}0, x0, x1
+    ld1             {v5.s}0, x2, x3
+    uxtl            v4.8h, v4.8b
+    uxtl            v5.8h, v5.8b
+    sub             v2.4h, v4.4h, v5.4h
+    smull           v3.4s, v4.4h, v4.4h
+    smull           v2.4s, v2.4h, v2.4h
+    add             v0.4s, v0.4s, v3.4s
+    add             v1.4s, v1.4s, v2.4s
+.endr
+    ssimDist_end
+    ret
+endfunc
 
-    abs             v4.4h, v4.4h
-    abs             v5.4h, v5.4h
-    abs             v6.4h, v6.4h
-    abs             v7.4h, v7.4h
+function PFX(ssimDist8_neon)
+    ssimDist_start
+.rept 8
+    ld1             {v4.8b}, x0, x1
+    ld1             {v5.8b}, x2, x3
+    uxtl            v4.8h, v4.8b
+    uxtl            v5.8h, v5.8b
+    ssimDist_1      v4, v5
+.endr
+    ssimDist_end
+    ret
+endfunc
 
-    smax            v1.4h, v4.4h, v5.4h
-    smax            v2.4h, v6.4h, v7.4h
+function PFX(ssimDist16_neon)
+    mov w12, #16
+    ssimDist_start
+.loop_ssimDist16:
+    sub             w12, w12, #1
+    ld1             {v4.16b}, x0, x1
+    ld1             {v5.16b}, x2, x3
+    uxtl            v6.8h, v4.8b
+    uxtl            v7.8h, v5.8b
+    uxtl2           v4.8h, v4.16b
+    uxtl2           v5.8h, v5.16b
+    ssimDist_1      v6, v7
+    ssimDist_1      v4, v5
+    cbnz            w12, .loop_ssimDist16
+    ssimDist_end
+    ret
+endfunc
 
-    add             v0.4h, v1.4h, v2.4h
-    uaddlp          v0.2s, v0.4h
-    uaddlp          v0.1d, v0.2s
+function PFX(ssimDist32_neon)
+    mov w12, #32
+    ssimDist_start
+.loop_ssimDist32:
+    sub             w12, w12, #1
+    ld1             {v4.16b-v5.16b}, x0, x1
+    ld1             {v6.16b-v7.16b}, x2, x3
+    uxtl            v21.8h, v4.8b
+    uxtl            v22.8h, v6.8b
+    uxtl            v23.8h, v5.8b
+    uxtl            v24.8h, v7.8b
+    uxtl2           v25.8h, v4.16b
+    uxtl2           v26.8h, v6.16b
+    uxtl2           v27.8h, v5.16b
+    uxtl2           v28.8h, v7.16b
+    ssimDist_1      v21, v22
+    ssimDist_1      v23, v24
+    ssimDist_1      v25, v26
+    ssimDist_1      v27, v28
+    cbnz            w12, .loop_ssimDist32
+    ssimDist_end
+    ret
+endfunc
+
+function PFX(ssimDist64_neon)
+    mov w12, #64
+    ssimDist_start
+.loop_ssimDist64:
+    sub             w12, w12, #1
+    ld1             {v4.16b-v7.16b}, x0, x1
+    ld1             {v16.16b-v19.16b}, x2, x3
+    uxtl            v21.8h, v4.8b
+    uxtl            v22.8h, v16.8b
+    uxtl            v23.8h, v5.8b
+    uxtl            v24.8h, v17.8b
+    uxtl2           v25.8h, v4.16b
+    uxtl2           v26.8h, v16.16b
+    uxtl2           v27.8h, v5.16b
+    uxtl2           v28.8h, v17.16b
+    ssimDist_1      v21, v22
+    ssimDist_1      v23, v24
+    ssimDist_1      v25, v26
+    ssimDist_1      v27, v28
+    uxtl            v21.8h, v6.8b
+    uxtl            v22.8h, v18.8b
+    uxtl            v23.8h, v7.8b
+    uxtl            v24.8h, v19.8b
+    uxtl2           v25.8h, v6.16b
+    uxtl2           v26.8h, v18.16b
+    uxtl2           v27.8h, v7.16b
+    uxtl2           v28.8h, v19.16b
+    ssimDist_1      v21, v22
+    ssimDist_1      v23, v24
+    ssimDist_1      v25, v26
+    ssimDist_1      v27, v28
+    cbnz            w12, .loop_ssimDist64
+    ssimDist_end
+    ret
+endfunc
+
+// void normFact_c(const pixel* src, uint32_t blockSize, int shift, uint64_t *z_k)
+
+.macro normFact_1  v4
+    smull           v16.4s, \v4\().4h, \v4\().4h
+    smull2          v17.4s, \v4\().8h, \v4\().8h
+    add             v0.4s, v0.4s, v16.4s
+    add             v0.4s, v0.4s, v17.4s
 .endm
 
-// int satd_4x4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
-function x265_pixel_satd_4x4_neon
-    satd_4x4_neon
-    umov            x0, v0.d0
+function PFX(normFact8_neon)
+    normFact_start
+.rept 8
+    ld1             {v4.8b}, x0, x1
+    uxtl            v4.8h, v4.8b
+    normFact_1      v4
+.endr
+    normFact_end
     ret
 endfunc
 
-// int satd_8x4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2)
-function x265_pixel_satd_8x4_neon
-    mov             x4, x0
-    mov             x5, x2
-    satd_4x4_neon
-    add             x0, x4, #4
-    add             x2, x5, #4
-    umov            x6, v0.d0
-    satd_4x4_neon
-    umov            x0, v0.d0
-    add             x0, x0, x6
+function PFX(normFact16_neon)
+    mov w12, #16
+    normFact_start
+.loop_normFact16:
+    sub             w12, w12, #1
+    ld1             {v4.16b}, x0, x1
+    uxtl            v5.8h, v4.8b
+    uxtl2           v4.8h, v4.16b
+    normFact_1      v5
+    normFact_1      v4
+    cbnz            w12, .loop_normFact16
+    normFact_end
+    ret
+endfunc
+
+function PFX(normFact32_neon)
+    mov w12, #32
+    normFact_start
+.loop_normFact32:
+    sub             w12, w12, #1
+    ld1             {v4.16b-v5.16b}, x0, x1
+    uxtl            v6.8h, v4.8b
+    uxtl2           v4.8h, v4.16b
+    uxtl            v7.8h, v5.8b
+    uxtl2           v5.8h, v5.16b
+    normFact_1      v4
+    normFact_1      v5
+    normFact_1      v6
+    normFact_1      v7
+    cbnz            w12, .loop_normFact32
+    normFact_end
+    ret
+endfunc
+
+function PFX(normFact64_neon)
+    mov w12, #64
+    normFact_start
+.loop_normFact64:
+    sub             w12, w12, #1
+    ld1             {v4.16b-v7.16b}, x0, x1
+    uxtl            v26.8h, v4.8b
+    uxtl2           v24.8h, v4.16b
+    uxtl            v27.8h, v5.8b
+    uxtl2           v25.8h, v5.16b
+    normFact_1      v24
+    normFact_1      v25
+    normFact_1      v26
+    normFact_1      v27
+    uxtl            v26.8h, v6.8b
+    uxtl2           v24.8h, v6.16b
+    uxtl            v27.8h, v7.8b
+    uxtl2           v25.8h, v7.16b
+    normFact_1      v24
+    normFact_1      v25
+    normFact_1      v26
+    normFact_1      v27
+    cbnz            w12, .loop_normFact64
+    normFact_end
+    ret
+endfunc
+
+// void weight_pp_c(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset)
+function PFX(weight_pp_neon)
+    sub             x2, x2, x3
+    ldr             w9, sp              // offset
+    lsl             w5, w5, #6            // w0 << correction
+
+    // count trailing zeros in w5 and compare against shift right amount.
+    rbit            w10, w5
+    clz             w10, w10
+    cmp             w10, w7
+    b.lt            .unfoldedShift
+
+    // shift right only removes trailing zeros: hoist LSR out of the loop.
+    lsr             w10, w5, w7           // w0 << correction >> shift
+    dup             v25.16b, w10
+    lsr             w6, w6, w7            // round >> shift
+    add             w6, w6, w9            // round >> shift + offset
+    dup             v26.8h, w6
+
+    // Check arithmetic range.
+    mov             w11, #255
+    madd            w11, w11, w10, w6
+    add             w11, w11, w9
+    lsr             w11, w11, #16
+    cbnz            w11, .widenTo32Bit
+
+    // 16-bit arithmetic is enough.
+.loopHpp:
+    mov             x12, x3
+.loopWpp:
+    ldr             q0, x0, #16
+    sub             x12, x12, #16
+    umull           v1.8h, v0.8b, v25.8b  // val *= w0 << correction >> shift
+    umull2          v2.8h, v0.16b, v25.16b
+    add             v1.8h, v1.8h, v26.8h  // val += round >> shift + offset
+    add             v2.8h, v2.8h, v26.8h
+    sqxtun          v0.8b, v1.8h          // val = x265_clip(val)
+    sqxtun2         v0.16b, v2.8h
+    str             q0, x1, #16
+    cbnz            x12, .loopWpp
+    add             x1, x1, x2
+    add             x0, x0, x2
+    sub             x4, x4, #1
+    cbnz            x4, .loopHpp
+    ret
+
+    // 32-bit arithmetic is needed.
+.widenTo32Bit:
+.loopHpp32:
+    mov             x12, x3
+.loopWpp32:
+    ldr             d0, x0, #8
+    sub             x12, x12, #8
+    uxtl            v0.8h, v0.8b
+    umull           v1.4s, v0.4h, v25.4h  // val *= w0 << correction >> shift
+    umull2          v2.4s, v0.8h, v25.8h
+    add             v1.4s, v1.4s, v26.4s  // val += round >> shift + offset
+    add             v2.4s, v2.4s, v26.4s
+    sqxtn           v0.4h, v1.4s          // val = x265_clip(val)
+    sqxtn2          v0.8h, v2.4s
+    sqxtun          v0.8b, v0.8h
+    str             d0, x1, #8
+    cbnz            x12, .loopWpp32
+    add             x1, x1, x2
+    add             x0, x0, x2
+    sub             x4, x4, #1
+    cbnz            x4, .loopHpp32
+    ret
+
+    // The shift right cannot be moved out of the loop.
+.unfoldedShift:
+    dup             v25.8h, w5            // w0 << correction
+    dup             v26.4s, w6            // round
+    neg             w7, w7                // -shift
+    dup             v27.4s, w7
+    dup             v29.4s, w9            // offset
+.loopHppUS:
+    mov             x12, x3
+.loopWppUS:
+    ldr             d0, x0, #8
+    sub             x12, x12, #8
+    uxtl            v0.8h, v0.8b
+    umull           v1.4s, v0.4h, v25.4h  // val *= w0
+    umull2          v2.4s, v0.8h, v25.8h
+    add             v1.4s, v1.4s, v26.4s  // val += round
+    add             v2.4s, v2.4s, v26.4s
+    sshl            v1.4s, v1.4s, v27.4s  // val >>= shift
+    sshl            v2.4s, v2.4s, v27.4s
+    add             v1.4s, v1.4s, v29.4s  // val += offset
+    add             v2.4s, v2.4s, v29.4s
+    sqxtn           v0.4h, v1.4s          // val = x265_clip(val)
+    sqxtn2          v0.8h, v2.4s
+    sqxtun          v0.8b, v0.8h
+    str             d0, x1, #8
+    cbnz            x12, .loopWppUS
+    add             x1, x1, x2
+    add             x0, x0, x2
+    sub             x4, x4, #1
+    cbnz            x4, .loopHppUS
+    ret
+endfunc
+
+// int scanPosLast(
+//     const uint16_t *scan,      // x0
+//     const coeff_t *coeff,      // x1
+//     uint16_t *coeffSign,       // x2
+//     uint16_t *coeffFlag,       // x3
+//     uint8_t *coeffNum,         // x4
+//     int numSig,                // x5
+//     const uint16_t* scanCG4x4, // x6
+//     const int trSize)          // x7
+function PFX(scanPosLast_neon)
+    // convert unit of Stride(trSize) to int16_t
+    add             x7, x7, x7
+
+    // load scan table and convert to Byte
+    ldp             q0, q1, x6
+    xtn             v0.8b, v0.8h
+    xtn2            v0.16b, v1.8h   // v0 - Zigzag scan table
+
+    movrel          x10, g_SPL_and_mask
+    ldr             q28, x10      // v28 = mask for pmovmskb
+    movi            v31.16b, #0     // v31 = {0, ..., 0}
+    add             x10, x7, x7     // 2*x7
+    add             x11, x10, x7    // 3*x7
+    add             x9, x4, #1      // CG count
+
+.loop_spl:
+    // position of current CG
+    ldrh            w6, x0, #32
+    add             x6, x1, x6, lsl #1
+
+    // loading current CG
+    ldr             d2, x6
+    ldr             d3, x6, x7
+    ldr             d4, x6, x10
+    ldr             d5, x6, x11
+    mov             v2.d1, v3.d0
+    mov             v4.d1, v5.d0
+    sqxtn           v2.8b, v2.8h
+    sqxtn2          v2.16b, v4.8h
+
+    // Zigzag
+    tbl             v3.16b, {v2.16b}, v0.16b
+
+    // get sign
+    cmhi            v5.16b, v3.16b, v31.16b   // v5 = non-zero
+    cmlt            v3.16b, v3.16b, #0        // v3 = negative
+
+    // val - w13 = pmovmskb(v3)
+    and             v3.16b, v3.16b, v28.16b
+    mov             d4, v3.d1
+    addv            b23, v3.8b
+    addv            b24, v4.8b
+    mov             v23.b1, v24.b0
+    fmov            w13, s23
+
+    // mask - w15 = pmovmskb(v5)
+    and             v5.16b, v5.16b, v28.16b
+    mov             d6, v5.d1
+    addv            b25, v5.8b
+    addv            b26, v6.8b
+    mov             v25.b1, v26.b0
+    fmov            w15, s25
+
+    // coeffFlag = reverse_bit(w15) in 16-bit
+    rbit            w12, w15
+    lsr             w12, w12, #16
+    fmov            s30, w12
+    strh            w12, x3, #2
+
+    // accelerate by preparing w13 = w13 & w15
+    and             w13, w13, w15
+    mov             x14, xzr
+.loop_spl_1:
+    cbz             w15, .pext_end
+    clz             w6, w15
+    lsl             w13, w13, w6
+    lsl             w15, w15, w6
+    extr            w14, w14, w13, #31
+    bfm             w15, wzr, #1, #0
+    b               .loop_spl_1
+.pext_end:
+    strh            w14, x2, #2
+
+    // compute coeffNum = popcount(coeffFlag)
+    cnt             v30.8b, v30.8b
+    addp            v30.8b, v30.8b, v30.8b
+    fmov            w6, s30
+    sub             x5, x5, x6
+    strb            w6, x4, #1
+
+    cbnz            x5, .loop_spl
+
+    // count trailing zeros
+    rbit            w13, w12
+    clz             w13, w13
+    lsr             w12, w12, w13
+    strh            w12, x3, #-2
+
+    // get last pos
+    sub             x9, x4, x9
+    lsl             x0, x9, #4
+    eor             w13, w13, #15
+    add             x0, x0, x13
+    ret
+endfunc
+
+// uint32_t costCoeffNxN(
+//    uint16_t *scan,        // x0
+//    coeff_t *coeff,        // x1
+//    intptr_t trSize,       // x2
+//    uint16_t *absCoeff,    // x3
+//    uint8_t *tabSigCtx,    // x4
+//    uint16_t scanFlagMask, // x5
+//    uint8_t *baseCtx,      // x6
+//    int offset,            // x7
+//    int scanPosSigOff,     // sp
+//    int subPosBase)        // sp + 8
+function PFX(costCoeffNxN_neon)
+    // abs(coeff)
+    add             x2, x2, x2
+    ld1             {v1.d}0, x1, x2
+    ld1             {v1.d}1, x1, x2
+    ld1             {v2.d}0, x1, x2
+    ld1             {v2.d}1, x1, x2
+    abs             v1.8h, v1.8h
+    abs             v2.8h, v2.8h
+
+    // WARNING: beyond-bound read here!
+    // loading scan table
+    ldr             w2, sp
+    eor             w15, w2, #15
+    add             x1, x0, x15, lsl #1
+    ldp             q20, q21, x1
+    uzp1            v20.16b, v20.16b, v21.16b
+    movi            v21.16b, #15
+    eor             v0.16b, v20.16b, v21.16b
+
+    // reorder coeff
+    uzp1           v22.16b, v1.16b, v2.16b
+    uzp2           v23.16b, v1.16b, v2.16b
+    tbl            v24.16b, {v22.16b}, v0.16b
+    tbl            v25.16b, {v23.16b}, v0.16b
+    zip1           v2.16b, v24.16b, v25.16b
+    zip2           v3.16b, v24.16b, v25.16b
+
+    // loading tabSigCtx (+offset)
+    ldr             q1, x4
+    tbl             v1.16b, {v1.16b}, v0.16b
+    dup             v4.16b, w7
+    movi            v5.16b, #0
+    tbl             v4.16b, {v4.16b}, v5.16b
+    add             v1.16b, v1.16b, v4.16b
+
+    // register mapping
+    // x0 - sum
+    // x1 - entropyStateBits
+    // v1 - sigCtx
+    // {v3,v2} - abs(coeff)
+    // x2 - scanPosSigOff
+    // x3 - absCoeff
+    // x4 - numNonZero
+    // x5 - scanFlagMask
+    // x6 - baseCtx
+    mov             x0, #0
+    movrel          x1, PFX_C(entropyStateBits)
+    mov             x4, #0
+    mov             x11, #0
+    movi            v31.16b, #0
+    cbz             x2, .idx_zero
+.loop_ccnn:
+//   {
+//        const uint32_t cnt = tabSigCtxblkPos + offset + posOffset;
+//        ctxSig = cnt & posZeroMask;
+//        const uint32_t mstate = baseCtxctxSig;
+//        const uint32_t mps = mstate & 1;
+//        const uint32_t stateBits = x265_entropyStateBitsmstate ^ sig;
+//        uint32_t nextState = (stateBits >> 24) + mps;
+//        if ((mstate ^ sig) == 1)
+//            nextState = sig;
+//        baseCtxctxSig = (uint8_t)nextState;
+//        sum += stateBits;
+//    }
+//    absCoeffnumNonZero = tmpCoeffblkPos;
+//    numNonZero += sig;
+//    scanPosSigOff--;
+
+    add             x13, x3, x4, lsl #1
+    sub             x2, x2, #1
+    str             h2, x13             // absCoeffnumNonZero = tmpCoeffblkPos
+    fmov            w14, s1               // x14 = ctxSig
+    uxtb            w14, w14
+    ubfx            w11, w5, #0, #1       // x11 = sig
+    lsr             x5, x5, #1
+    add             x4, x4, x11           // numNonZero += sig
+    ext             v1.16b, v1.16b, v31.16b, #1
+    ext             v2.16b, v2.16b, v3.16b, #2
+    ext             v3.16b, v3.16b, v31.16b, #2
+    ldrb            w9, x6, x14         // mstate = baseCtxctxSig
+    and             w10, w9, #1           // mps = mstate & 1
+    eor             w9, w9, w11           // x9 = mstate ^ sig
+    add             x12, x1, x9, lsl #2
+    ldr             w13, x12
+    add             w0, w0, w13           // sum += x265_entropyStateBitsmstate ^ sig
+    ldrb            w13, x12, #3
+    add             w10, w10, w13         // nextState = (stateBits >> 24) + mps
+    cmp             w9, #1
+    csel            w10, w11, w10, eq
+    strb            w10, x6, x14
+    cbnz            x2, .loop_ccnn
+.idx_zero:
+
+    add             x13, x3, x4, lsl #1
+    add             x4, x4, x15
+    str             h2, x13              // absCoeffnumNonZero = tmpCoeffblkPos
+
+    ldr             x9, sp, #8           // subPosBase
+    uxth            w9, w9
+    cmp             w9, #0
+    cset            x2, eq
+    add             x4, x4, x2
+    cbz             x4, .exit_ccnn
+
+    sub             w2, w2, #1
+    uxtb            w2, w2
+    fmov            w3, s1
+    and             w2, w2, w3
+
+    ldrb            w3, x6, x2         // mstate = baseCtxctxSig
+    eor             w4, w5, w3            // x5 = mstate ^ sig
+    and             w3, w3, #1            // mps = mstate & 1
+    add             x1, x1, x4, lsl #2
+    ldr             w11, x1
+    ldrb            w12, x1, #3
+    add             w0, w0, w11           // sum += x265_entropyStateBitsmstate ^ sig
+    add             w3, w3, w12           // nextState = (stateBits >> 24) + mps
+    cmp             w4, #1
+    csel            w3, w5, w3, eq
+    strb            w3, x6, x2
+.exit_ccnn:
+    ubfx            w0, w0, #0, #24
     ret
 endfunc
+
+const g_SPL_and_mask, align=8
+.byte 0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80, 0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80
+endconst
​

x265_3.6.tar.gz/source/common/aarch64/sad-a-common.S Added

@@ -0,0 +1,514 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+// This file contains the macros written using NEON instruction set
+// that are also used by the SVE2 functions
+
+#include "asm.S"
+
+.arch           armv8-a
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.macro SAD_START_4 f
+    ld1             {v0.s}0, x0, x1
+    ld1             {v0.s}1, x0, x1
+    ld1             {v1.s}0, x2, x3
+    ld1             {v1.s}1, x2, x3
+    \f              v16.8h, v0.8b, v1.8b
+.endm
+
+.macro SAD_4 h
+.rept \h / 2 - 1
+    SAD_START_4 uabal
+.endr
+.endm
+
+.macro SAD_START_8 f
+    ld1             {v0.8b}, x0, x1
+    ld1             {v1.8b}, x2, x3
+    ld1             {v2.8b}, x0, x1
+    ld1             {v3.8b}, x2, x3
+    \f              v16.8h, v0.8b, v1.8b
+    \f              v17.8h, v2.8b, v3.8b
+.endm
+
+.macro SAD_8 h
+.rept \h / 2 - 1
+    SAD_START_8 uabal
+.endr
+.endm
+
+.macro SAD_START_16 f
+    ld1             {v0.16b}, x0, x1
+    ld1             {v1.16b}, x2, x3
+    ld1             {v2.16b}, x0, x1
+    ld1             {v3.16b}, x2, x3
+    \f              v16.8h, v0.8b, v1.8b
+    \f\()2          v17.8h, v0.16b, v1.16b
+    uabal           v16.8h, v2.8b, v3.8b
+    uabal2          v17.8h, v2.16b, v3.16b
+.endm
+
+.macro SAD_16 h
+.rept \h / 2 - 1
+    SAD_START_16 uabal
+.endr
+.endm
+
+.macro SAD_START_32
+    movi            v16.16b, #0
+    movi            v17.16b, #0
+    movi            v18.16b, #0
+    movi            v19.16b, #0
+.endm
+
+.macro SAD_32
+    ld1             {v0.16b-v1.16b}, x0, x1
+    ld1             {v2.16b-v3.16b}, x2, x3
+    ld1             {v4.16b-v5.16b}, x0, x1
+    ld1             {v6.16b-v7.16b}, x2, x3
+    uabal           v16.8h, v0.8b, v2.8b
+    uabal2          v17.8h, v0.16b, v2.16b
+    uabal           v18.8h, v1.8b, v3.8b
+    uabal2          v19.8h, v1.16b, v3.16b
+    uabal           v16.8h, v4.8b, v6.8b
+    uabal2          v17.8h, v4.16b, v6.16b
+    uabal           v18.8h, v5.8b, v7.8b
+    uabal2          v19.8h, v5.16b, v7.16b
+.endm
+
+.macro SAD_END_32
+    add             v16.8h, v16.8h, v17.8h
+    add             v17.8h, v18.8h, v19.8h
+    add             v16.8h, v16.8h, v17.8h
+    uaddlv          s0, v16.8h
+    fmov            w0, s0
+    ret
+.endm
+
+.macro SAD_START_64
+    movi            v16.16b, #0
+    movi            v17.16b, #0
+    movi            v18.16b, #0
+    movi            v19.16b, #0
+    movi            v20.16b, #0
+    movi            v21.16b, #0
+    movi            v22.16b, #0
+    movi            v23.16b, #0
+.endm
+
+.macro SAD_64
+    ld1             {v0.16b-v3.16b}, x0, x1
+    ld1             {v4.16b-v7.16b}, x2, x3
+    ld1             {v24.16b-v27.16b}, x0, x1
+    ld1             {v28.16b-v31.16b}, x2, x3
+    uabal           v16.8h, v0.8b, v4.8b
+    uabal2          v17.8h, v0.16b, v4.16b
+    uabal           v18.8h, v1.8b, v5.8b
+    uabal2          v19.8h, v1.16b, v5.16b
+    uabal           v20.8h, v2.8b, v6.8b
+    uabal2          v21.8h, v2.16b, v6.16b
+    uabal           v22.8h, v3.8b, v7.8b
+    uabal2          v23.8h, v3.16b, v7.16b
+
+    uabal           v16.8h, v24.8b, v28.8b
+    uabal2          v17.8h, v24.16b, v28.16b
+    uabal           v18.8h, v25.8b, v29.8b
+    uabal2          v19.8h, v25.16b, v29.16b
+    uabal           v20.8h, v26.8b, v30.8b
+    uabal2          v21.8h, v26.16b, v30.16b
+    uabal           v22.8h, v27.8b, v31.8b
+    uabal2          v23.8h, v27.16b, v31.16b
+.endm
+
+.macro SAD_END_64
+    add             v16.8h, v16.8h, v17.8h
+    add             v17.8h, v18.8h, v19.8h
+    add             v16.8h, v16.8h, v17.8h
+    uaddlp          v16.4s, v16.8h
+    add             v18.8h, v20.8h, v21.8h
+    add             v19.8h, v22.8h, v23.8h
+    add             v17.8h, v18.8h, v19.8h
+    uaddlp          v17.4s, v17.8h
+    add             v16.4s, v16.4s, v17.4s
+    uaddlv          d0, v16.4s
+    fmov            x0, d0
+    ret
+.endm
+
+.macro SAD_START_12
+    movrel          x12, sad12_mask
+    ld1             {v31.16b}, x12
+    movi            v16.16b, #0
+    movi            v17.16b, #0
+.endm
+
+.macro SAD_12
+    ld1             {v0.16b}, x0, x1
+    and             v0.16b, v0.16b, v31.16b
+    ld1             {v1.16b}, x2, x3
+    and             v1.16b, v1.16b, v31.16b
+    ld1             {v2.16b}, x0, x1
+    and             v2.16b, v2.16b, v31.16b
+    ld1             {v3.16b}, x2, x3
+    and             v3.16b, v3.16b, v31.16b
+    uabal           v16.8h, v0.8b, v1.8b
+    uabal2          v17.8h, v0.16b, v1.16b
+    uabal           v16.8h, v2.8b, v3.8b
+    uabal2          v17.8h, v2.16b, v3.16b
+.endm
+
+.macro SAD_END_12
+    add             v16.8h, v16.8h, v17.8h
+    uaddlv          s0, v16.8h
+    fmov            w0, s0
+    ret
+.endm
+
+.macro SAD_START_24
+    movi            v16.16b, #0
+    movi            v17.16b, #0
+    movi            v18.16b, #0
+    sub             x1, x1, #16
+    sub             x3, x3, #16
+.endm
+
+.macro SAD_24
+    ld1             {v0.16b}, x0, #16
+    ld1             {v1.8b}, x0, x1
+    ld1             {v2.16b}, x2, #16
+    ld1             {v3.8b}, x2, x3
+    ld1             {v4.16b}, x0, #16
+    ld1             {v5.8b}, x0, x1
+    ld1             {v6.16b}, x2, #16
+    ld1             {v7.8b}, x2, x3
+    uabal           v16.8h, v0.8b, v2.8b
+    uabal2          v17.8h, v0.16b, v2.16b
+    uabal           v18.8h, v1.8b, v3.8b
+    uabal           v16.8h, v4.8b, v6.8b
+    uabal2          v17.8h, v4.16b, v6.16b
+    uabal           v18.8h, v5.8b, v7.8b
+.endm
+
+.macro SAD_END_24
+    add             v16.8h, v16.8h, v17.8h
+    add             v16.8h, v16.8h, v18.8h
+    uaddlv          s0, v16.8h
+    fmov            w0, s0
+    ret
+.endm
+
+.macro SAD_START_48
+    movi            v16.16b, #0
+    movi            v17.16b, #0
+    movi            v18.16b, #0
+    movi            v19.16b, #0
+    movi            v20.16b, #0
+    movi            v21.16b, #0
+.endm
+
+.macro SAD_48
+    ld1             {v0.16b-v2.16b}, x0, x1
+    ld1             {v4.16b-v6.16b}, x2, x3
+    ld1             {v24.16b-v26.16b}, x0, x1
+    ld1             {v28.16b-v30.16b}, x2, x3
+    uabal           v16.8h, v0.8b, v4.8b
+    uabal2          v17.8h, v0.16b, v4.16b
+    uabal           v18.8h, v1.8b, v5.8b
+    uabal2          v19.8h, v1.16b, v5.16b
+    uabal           v20.8h, v2.8b, v6.8b
+    uabal2          v21.8h, v2.16b, v6.16b
+
+    uabal           v16.8h, v24.8b, v28.8b
+    uabal2          v17.8h, v24.16b, v28.16b
+    uabal           v18.8h, v25.8b, v29.8b
+    uabal2          v19.8h, v25.16b, v29.16b
+    uabal           v20.8h, v26.8b, v30.8b
+    uabal2          v21.8h, v26.16b, v30.16b
+.endm
+
+.macro SAD_END_48
+    add             v16.8h, v16.8h, v17.8h
+    add             v17.8h, v18.8h, v19.8h
+    add             v16.8h, v16.8h, v17.8h
+    uaddlv          s0, v16.8h
+    fmov            w0, s0
+    add             v18.8h, v20.8h, v21.8h
+    uaddlv          s1, v18.8h
+    fmov            w1, s1
+    add             w0, w0, w1
+    ret
+.endm
+
+.macro SAD_X_START_4 h, x, f
+    ld1             {v0.s}0, x0, x9
+    ld1             {v0.s}1, x0, x9
+    ld1             {v1.s}0, x1, x5
+    ld1             {v1.s}1, x1, x5
+    ld1             {v2.s}0, x2, x5
+    ld1             {v2.s}1, x2, x5
+    ld1             {v3.s}0, x3, x5
+    ld1             {v3.s}1, x3, x5
+    \f              v16.8h, v0.8b, v1.8b
+    \f              v17.8h, v0.8b, v2.8b
+    \f              v18.8h, v0.8b, v3.8b
+.if \x == 4
+    ld1             {v4.s}0, x4, x5
+    ld1             {v4.s}1, x4, x5
+    \f              v19.8h, v0.8b, v4.8b
+.endif
+.endm
+
+.macro SAD_X_4 h, x
+.rept \h/2 - 1
+    SAD_X_START_4 \h, \x, uabal
+.endr
+.endm
+
+.macro SAD_X_END_4 x
+    uaddlv          s0, v16.8h
+    uaddlv          s1, v17.8h
+    uaddlv          s2, v18.8h
+    stp             s0, s1, x6
+.if \x == 3
+    str             s2, x6, #8
+.elseif \x == 4
+    uaddlv          s3, v19.8h
+    stp             s2, s3, x6, #8
+.endif
+    ret
+.endm
+
+.macro SAD_X_START_8 h, x, f
+    ld1             {v0.8b}, x0, x9
+    ld1             {v1.8b}, x1, x5
+    ld1             {v2.8b}, x2, x5
+    ld1             {v3.8b}, x3, x5
+    \f              v16.8h, v0.8b, v1.8b
+    \f              v17.8h, v0.8b, v2.8b
+    \f              v18.8h, v0.8b, v3.8b
+.if \x == 4
+    ld1             {v4.8b}, x4, x5
+    \f              v19.8h, v0.8b, v4.8b
+.endif
+.endm
+
+.macro SAD_X_8 h x
+.rept \h - 1
+    SAD_X_START_8 \h, \x, uabal
+.endr
+.endm
+
+.macro SAD_X_END_8 x
+    SAD_X_END_4 \x
+.endm
+
+.macro SAD_X_START_12 h, x, f
+    ld1             {v0.16b}, x0, x9
+    and             v0.16b, v0.16b, v31.16b
+    ld1             {v1.16b}, x1, x5
+    and             v1.16b, v1.16b, v31.16b
+    ld1             {v2.16b}, x2, x5
+    and             v2.16b, v2.16b, v31.16b
+    ld1             {v3.16b}, x3, x5
+    and             v3.16b, v3.16b, v31.16b
+    \f              v16.8h, v1.8b, v0.8b
+    \f\()2          v20.8h, v1.16b, v0.16b
+    \f              v17.8h, v2.8b, v0.8b
+    \f\()2          v21.8h, v2.16b, v0.16b
+    \f              v18.8h, v3.8b, v0.8b
+    \f\()2          v22.8h, v3.16b, v0.16b
+.if \x == 4
+    ld1             {v4.16b}, x4, x5
+    and             v4.16b, v4.16b, v31.16b
+    \f              v19.8h, v4.8b, v0.8b
+    \f\()2          v23.8h, v4.16b, v0.16b
+.endif
+.endm
+
+.macro SAD_X_12 h x
+.rept \h - 1
+    SAD_X_START_12 \h, \x, uabal
+.endr
+.endm
+
+.macro SAD_X_END_12 x
+    SAD_X_END_16 \x
+.endm
+
+.macro SAD_X_START_16 h, x, f
+    ld1             {v0.16b}, x0, x9
+    ld1             {v1.16b}, x1, x5
+    ld1             {v2.16b}, x2, x5
+    ld1             {v3.16b}, x3, x5
+    \f              v16.8h, v1.8b, v0.8b
+    \f\()2          v20.8h, v1.16b, v0.16b
+    \f              v17.8h, v2.8b, v0.8b
+    \f\()2          v21.8h, v2.16b, v0.16b
+    \f              v18.8h, v3.8b, v0.8b
+    \f\()2          v22.8h, v3.16b, v0.16b
+.if \x == 4
+    ld1             {v4.16b}, x4, x5
+    \f              v19.8h, v4.8b, v0.8b
+    \f\()2          v23.8h, v4.16b, v0.16b
+.endif
+.endm
+
+.macro SAD_X_16 h x
+.rept \h - 1
+    SAD_X_START_16 \h, \x, uabal
+.endr
+.endm
+
+.macro SAD_X_END_16 x
+    add             v16.8h, v16.8h, v20.8h
+    add             v17.8h, v17.8h, v21.8h
+    add             v18.8h, v18.8h, v22.8h
+.if \x == 4
+    add             v19.8h, v19.8h, v23.8h
+.endif
+
+    SAD_X_END_4 \x
+.endm
+
+.macro SAD_X_START_24 x
+    SAD_X_START_32 \x
+    sub             x5, x5, #16
+    sub             x9, x9, #16
+.endm
+
+.macro SAD_X_24 base v1 v2
+    ld1             {v0.16b},  \base , #16
+    ld1             {v1.8b},  \base , x5
+    uabal           \v1\().8h, v0.8b, v6.8b
+    uabal           \v1\().8h, v1.8b, v7.8b
+    uabal2          \v2\().8h, v0.16b, v6.16b
+.endm
+
+.macro SAD_X_END_24 x
+    SAD_X_END_16 \x
+.endm
+
+.macro SAD_X_START_32 x
+    movi v16.16b, #0
+    movi v17.16b, #0
+    movi v18.16b, #0
+    movi v20.16b, #0
+    movi v21.16b, #0
+    movi v22.16b, #0
+.if \x == 4
+    movi v19.16b, #0
+    movi v23.16b, #0
+.endif
+.endm
+
+.macro SAD_X_32 base v1 v2
+    ld1             {v0.16b-v1.16b},  \base , x5
+    uabal           \v1\().8h, v0.8b, v6.8b
+    uabal           \v1\().8h, v1.8b, v7.8b
+    uabal2          \v2\().8h, v0.16b, v6.16b
+    uabal2          \v2\().8h, v1.16b, v7.16b
+.endm
+
+.macro SAD_X_END_32 x
+    SAD_X_END_16 \x
+.endm
+
+.macro SAD_X_START_48 x
+    SAD_X_START_32 \x
+.endm
+
+.macro SAD_X_48 x1 v1 v2
+    ld1             {v0.16b-v2.16b},  \x1 , x5
+    uabal           \v1\().8h, v0.8b, v4.8b
+    uabal           \v1\().8h, v1.8b, v5.8b
+    uabal           \v1\().8h, v2.8b, v6.8b
+    uabal2          \v2\().8h, v0.16b, v4.16b
+    uabal2          \v2\().8h, v1.16b, v5.16b
+    uabal2          \v2\().8h, v2.16b, v6.16b
+.endm
+
+.macro SAD_X_END_48 x
+    SAD_X_END_64 \x
+.endm
+
+.macro SAD_X_START_64 x
+    SAD_X_START_32 \x
+.endm
+
+.macro SAD_X_64 x1 v1 v2
+    ld1             {v0.16b-v3.16b},  \x1 , x5
+    uabal           \v1\().8h, v0.8b, v4.8b
+    uabal           \v1\().8h, v1.8b, v5.8b
+    uabal           \v1\().8h, v2.8b, v6.8b
+    uabal           \v1\().8h, v3.8b, v7.8b
+    uabal2          \v2\().8h, v0.16b, v4.16b
+    uabal2          \v2\().8h, v1.16b, v5.16b
+    uabal2          \v2\().8h, v2.16b, v6.16b
+    uabal2          \v2\().8h, v3.16b, v7.16b
+.endm
+
+.macro SAD_X_END_64 x
+    uaddlp          v16.4s, v16.8h
+    uaddlp          v17.4s, v17.8h
+    uaddlp          v18.4s, v18.8h
+    uaddlp          v20.4s, v20.8h
+    uaddlp          v21.4s, v21.8h
+    uaddlp          v22.4s, v22.8h
+    add             v16.4s, v16.4s, v20.4s
+    add             v17.4s, v17.4s, v21.4s
+    add             v18.4s, v18.4s, v22.4s
+    trn2            v20.2d, v16.2d, v16.2d
+    trn2            v21.2d, v17.2d, v17.2d
+    trn2            v22.2d, v18.2d, v18.2d
+    add             v16.2s, v16.2s, v20.2s
+    add             v17.2s, v17.2s, v21.2s
+    add             v18.2s, v18.2s, v22.2s
+    uaddlp          v16.1d, v16.2s
+    uaddlp          v17.1d, v17.2s
+    uaddlp          v18.1d, v18.2s
+    stp             s16, s17, x6, #8
+.if \x == 3
+    str             s18, x6
+.elseif \x == 4
+    uaddlp          v19.4s, v19.8h
+    uaddlp          v23.4s, v23.8h
+    add             v19.4s, v19.4s, v23.4s
+    trn2            v23.2d, v19.2d, v19.2d
+    add             v19.2s, v19.2s, v23.2s
+    uaddlp          v19.1d, v19.2s
+    stp             s18, s19, x6
+.endif
+    ret
+.endm
+
+const sad12_mask, align=8
+.byte 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 0, 0, 0, 0
+endconst

 
@@ -0,0 +1,514 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+// This file contains the macros written using NEON instruction set
+// that are also used by the SVE2 functions
+
+#include "asm.S"
+
+.arch           armv8-a
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.macro SAD_START_4 f
+    ld1             {v0.s}0, x0, x1
+    ld1             {v0.s}1, x0, x1
+    ld1             {v1.s}0, x2, x3
+    ld1             {v1.s}1, x2, x3
+    \f              v16.8h, v0.8b, v1.8b
+.endm
+
+.macro SAD_4 h
+.rept \h / 2 - 1
+    SAD_START_4 uabal
+.endr
+.endm
+
+.macro SAD_START_8 f
+    ld1             {v0.8b}, x0, x1
+    ld1             {v1.8b}, x2, x3
+    ld1             {v2.8b}, x0, x1
+    ld1             {v3.8b}, x2, x3
+    \f              v16.8h, v0.8b, v1.8b
+    \f              v17.8h, v2.8b, v3.8b
+.endm
+
+.macro SAD_8 h
+.rept \h / 2 - 1
+    SAD_START_8 uabal
+.endr
+.endm
+
+.macro SAD_START_16 f
+    ld1             {v0.16b}, x0, x1
+    ld1             {v1.16b}, x2, x3
+    ld1             {v2.16b}, x0, x1
+    ld1             {v3.16b}, x2, x3
+    \f              v16.8h, v0.8b, v1.8b
+    \f\()2          v17.8h, v0.16b, v1.16b
+    uabal           v16.8h, v2.8b, v3.8b
+    uabal2          v17.8h, v2.16b, v3.16b
+.endm
+
+.macro SAD_16 h
+.rept \h / 2 - 1
+    SAD_START_16 uabal
+.endr
+.endm
+
+.macro SAD_START_32
+    movi            v16.16b, #0
+    movi            v17.16b, #0
+    movi            v18.16b, #0
+    movi            v19.16b, #0
+.endm
+
+.macro SAD_32
+    ld1             {v0.16b-v1.16b}, x0, x1
+    ld1             {v2.16b-v3.16b}, x2, x3
+    ld1             {v4.16b-v5.16b}, x0, x1
+    ld1             {v6.16b-v7.16b}, x2, x3
+    uabal           v16.8h, v0.8b, v2.8b
+    uabal2          v17.8h, v0.16b, v2.16b
+    uabal           v18.8h, v1.8b, v3.8b
+    uabal2          v19.8h, v1.16b, v3.16b
+    uabal           v16.8h, v4.8b, v6.8b
+    uabal2          v17.8h, v4.16b, v6.16b
+    uabal           v18.8h, v5.8b, v7.8b
+    uabal2          v19.8h, v5.16b, v7.16b
+.endm
+
+.macro SAD_END_32
+    add             v16.8h, v16.8h, v17.8h
+    add             v17.8h, v18.8h, v19.8h
+    add             v16.8h, v16.8h, v17.8h
+    uaddlv          s0, v16.8h
+    fmov            w0, s0
+    ret
+.endm
+
+.macro SAD_START_64
+    movi            v16.16b, #0
+    movi            v17.16b, #0
+    movi            v18.16b, #0
+    movi            v19.16b, #0
+    movi            v20.16b, #0
+    movi            v21.16b, #0
+    movi            v22.16b, #0
+    movi            v23.16b, #0
+.endm
+
+.macro SAD_64
+    ld1             {v0.16b-v3.16b}, x0, x1
+    ld1             {v4.16b-v7.16b}, x2, x3
+    ld1             {v24.16b-v27.16b}, x0, x1
+    ld1             {v28.16b-v31.16b}, x2, x3
+    uabal           v16.8h, v0.8b, v4.8b
+    uabal2          v17.8h, v0.16b, v4.16b
+    uabal           v18.8h, v1.8b, v5.8b
+    uabal2          v19.8h, v1.16b, v5.16b
+    uabal           v20.8h, v2.8b, v6.8b
+    uabal2          v21.8h, v2.16b, v6.16b
+    uabal           v22.8h, v3.8b, v7.8b
+    uabal2          v23.8h, v3.16b, v7.16b
+
+    uabal           v16.8h, v24.8b, v28.8b
+    uabal2          v17.8h, v24.16b, v28.16b
+    uabal           v18.8h, v25.8b, v29.8b
+    uabal2          v19.8h, v25.16b, v29.16b
+    uabal           v20.8h, v26.8b, v30.8b
+    uabal2          v21.8h, v26.16b, v30.16b
+    uabal           v22.8h, v27.8b, v31.8b
+    uabal2          v23.8h, v27.16b, v31.16b
+.endm
+
+.macro SAD_END_64
+    add             v16.8h, v16.8h, v17.8h
+    add             v17.8h, v18.8h, v19.8h
+    add             v16.8h, v16.8h, v17.8h
+    uaddlp          v16.4s, v16.8h
+    add             v18.8h, v20.8h, v21.8h
+    add             v19.8h, v22.8h, v23.8h
+    add             v17.8h, v18.8h, v19.8h
+    uaddlp          v17.4s, v17.8h
+    add             v16.4s, v16.4s, v17.4s
+    uaddlv          d0, v16.4s
+    fmov            x0, d0
+    ret
+.endm
+
+.macro SAD_START_12
+    movrel          x12, sad12_mask
+    ld1             {v31.16b}, x12
+    movi            v16.16b, #0
+    movi            v17.16b, #0
+.endm
+
+.macro SAD_12
+    ld1             {v0.16b}, x0, x1
+    and             v0.16b, v0.16b, v31.16b
+    ld1             {v1.16b}, x2, x3
+    and             v1.16b, v1.16b, v31.16b
+    ld1             {v2.16b}, x0, x1
+    and             v2.16b, v2.16b, v31.16b
+    ld1             {v3.16b}, x2, x3
+    and             v3.16b, v3.16b, v31.16b
+    uabal           v16.8h, v0.8b, v1.8b
+    uabal2          v17.8h, v0.16b, v1.16b
+    uabal           v16.8h, v2.8b, v3.8b
+    uabal2          v17.8h, v2.16b, v3.16b
+.endm
+
+.macro SAD_END_12
+    add             v16.8h, v16.8h, v17.8h
+    uaddlv          s0, v16.8h
+    fmov            w0, s0
+    ret
+.endm
+
+.macro SAD_START_24
+    movi            v16.16b, #0
+    movi            v17.16b, #0
+    movi            v18.16b, #0
+    sub             x1, x1, #16
+    sub             x3, x3, #16
+.endm
+
+.macro SAD_24
+    ld1             {v0.16b}, x0, #16
+    ld1             {v1.8b}, x0, x1
+    ld1             {v2.16b}, x2, #16
+    ld1             {v3.8b}, x2, x3
+    ld1             {v4.16b}, x0, #16
+    ld1             {v5.8b}, x0, x1
+    ld1             {v6.16b}, x2, #16
+    ld1             {v7.8b}, x2, x3
+    uabal           v16.8h, v0.8b, v2.8b
+    uabal2          v17.8h, v0.16b, v2.16b
+    uabal           v18.8h, v1.8b, v3.8b
+    uabal           v16.8h, v4.8b, v6.8b
+    uabal2          v17.8h, v4.16b, v6.16b
+    uabal           v18.8h, v5.8b, v7.8b
+.endm
+
+.macro SAD_END_24
+    add             v16.8h, v16.8h, v17.8h
+    add             v16.8h, v16.8h, v18.8h
+    uaddlv          s0, v16.8h
+    fmov            w0, s0
+    ret
+.endm
+
+.macro SAD_START_48
+    movi            v16.16b, #0
+    movi            v17.16b, #0
+    movi            v18.16b, #0
+    movi            v19.16b, #0
+    movi            v20.16b, #0
+    movi            v21.16b, #0
+.endm
+
+.macro SAD_48
+    ld1             {v0.16b-v2.16b}, x0, x1
+    ld1             {v4.16b-v6.16b}, x2, x3
+    ld1             {v24.16b-v26.16b}, x0, x1
+    ld1             {v28.16b-v30.16b}, x2, x3
+    uabal           v16.8h, v0.8b, v4.8b
+    uabal2          v17.8h, v0.16b, v4.16b
+    uabal           v18.8h, v1.8b, v5.8b
+    uabal2          v19.8h, v1.16b, v5.16b
+    uabal           v20.8h, v2.8b, v6.8b
+    uabal2          v21.8h, v2.16b, v6.16b
+
+    uabal           v16.8h, v24.8b, v28.8b
+    uabal2          v17.8h, v24.16b, v28.16b
+    uabal           v18.8h, v25.8b, v29.8b
+    uabal2          v19.8h, v25.16b, v29.16b
+    uabal           v20.8h, v26.8b, v30.8b
+    uabal2          v21.8h, v26.16b, v30.16b
+.endm
+
+.macro SAD_END_48
+    add             v16.8h, v16.8h, v17.8h
+    add             v17.8h, v18.8h, v19.8h
+    add             v16.8h, v16.8h, v17.8h
+    uaddlv          s0, v16.8h
+    fmov            w0, s0
+    add             v18.8h, v20.8h, v21.8h
+    uaddlv          s1, v18.8h
+    fmov            w1, s1
+    add             w0, w0, w1
+    ret
+.endm
+
+.macro SAD_X_START_4 h, x, f
+    ld1             {v0.s}0, x0, x9
+    ld1             {v0.s}1, x0, x9
+    ld1             {v1.s}0, x1, x5
+    ld1             {v1.s}1, x1, x5
+    ld1             {v2.s}0, x2, x5
+    ld1             {v2.s}1, x2, x5
+    ld1             {v3.s}0, x3, x5
+    ld1             {v3.s}1, x3, x5
+    \f              v16.8h, v0.8b, v1.8b
+    \f              v17.8h, v0.8b, v2.8b
+    \f              v18.8h, v0.8b, v3.8b
+.if \x == 4
+    ld1             {v4.s}0, x4, x5
+    ld1             {v4.s}1, x4, x5
+    \f              v19.8h, v0.8b, v4.8b
+.endif
+.endm
+
+.macro SAD_X_4 h, x
+.rept \h/2 - 1
+    SAD_X_START_4 \h, \x, uabal
+.endr
+.endm
+
+.macro SAD_X_END_4 x
+    uaddlv          s0, v16.8h
+    uaddlv          s1, v17.8h
+    uaddlv          s2, v18.8h
+    stp             s0, s1, x6
+.if \x == 3
+    str             s2, x6, #8
+.elseif \x == 4
+    uaddlv          s3, v19.8h
+    stp             s2, s3, x6, #8
+.endif
+    ret
+.endm
+
+.macro SAD_X_START_8 h, x, f
+    ld1             {v0.8b}, x0, x9
+    ld1             {v1.8b}, x1, x5
+    ld1             {v2.8b}, x2, x5
+    ld1             {v3.8b}, x3, x5
+    \f              v16.8h, v0.8b, v1.8b
+    \f              v17.8h, v0.8b, v2.8b
+    \f              v18.8h, v0.8b, v3.8b
+.if \x == 4
+    ld1             {v4.8b}, x4, x5
+    \f              v19.8h, v0.8b, v4.8b
+.endif
+.endm
+
+.macro SAD_X_8 h x
+.rept \h - 1
+    SAD_X_START_8 \h, \x, uabal
+.endr
+.endm
+
+.macro SAD_X_END_8 x
+    SAD_X_END_4 \x
+.endm
+
+.macro SAD_X_START_12 h, x, f
+    ld1             {v0.16b}, x0, x9
+    and             v0.16b, v0.16b, v31.16b
+    ld1             {v1.16b}, x1, x5
+    and             v1.16b, v1.16b, v31.16b
+    ld1             {v2.16b}, x2, x5
+    and             v2.16b, v2.16b, v31.16b
+    ld1             {v3.16b}, x3, x5
+    and             v3.16b, v3.16b, v31.16b
+    \f              v16.8h, v1.8b, v0.8b
+    \f\()2          v20.8h, v1.16b, v0.16b
+    \f              v17.8h, v2.8b, v0.8b
+    \f\()2          v21.8h, v2.16b, v0.16b
+    \f              v18.8h, v3.8b, v0.8b
+    \f\()2          v22.8h, v3.16b, v0.16b
+.if \x == 4
+    ld1             {v4.16b}, x4, x5
+    and             v4.16b, v4.16b, v31.16b
+    \f              v19.8h, v4.8b, v0.8b
+    \f\()2          v23.8h, v4.16b, v0.16b
+.endif
+.endm
+
+.macro SAD_X_12 h x
+.rept \h - 1
+    SAD_X_START_12 \h, \x, uabal
+.endr
+.endm
+
+.macro SAD_X_END_12 x
+    SAD_X_END_16 \x
+.endm
+
+.macro SAD_X_START_16 h, x, f
+    ld1             {v0.16b}, x0, x9
+    ld1             {v1.16b}, x1, x5
+    ld1             {v2.16b}, x2, x5
+    ld1             {v3.16b}, x3, x5
+    \f              v16.8h, v1.8b, v0.8b
+    \f\()2          v20.8h, v1.16b, v0.16b
+    \f              v17.8h, v2.8b, v0.8b
+    \f\()2          v21.8h, v2.16b, v0.16b
+    \f              v18.8h, v3.8b, v0.8b
+    \f\()2          v22.8h, v3.16b, v0.16b
+.if \x == 4
+    ld1             {v4.16b}, x4, x5
+    \f              v19.8h, v4.8b, v0.8b
+    \f\()2          v23.8h, v4.16b, v0.16b
+.endif
+.endm
+
+.macro SAD_X_16 h x
+.rept \h - 1
+    SAD_X_START_16 \h, \x, uabal
+.endr
+.endm
+
+.macro SAD_X_END_16 x
+    add             v16.8h, v16.8h, v20.8h
+    add             v17.8h, v17.8h, v21.8h
+    add             v18.8h, v18.8h, v22.8h
+.if \x == 4
+    add             v19.8h, v19.8h, v23.8h
+.endif
+
+    SAD_X_END_4 \x
+.endm
+
+.macro SAD_X_START_24 x
+    SAD_X_START_32 \x
+    sub             x5, x5, #16
+    sub             x9, x9, #16
+.endm
+
+.macro SAD_X_24 base v1 v2
+    ld1             {v0.16b},  \base , #16
+    ld1             {v1.8b},  \base , x5
+    uabal           \v1\().8h, v0.8b, v6.8b
+    uabal           \v1\().8h, v1.8b, v7.8b
+    uabal2          \v2\().8h, v0.16b, v6.16b
+.endm
+
+.macro SAD_X_END_24 x
+    SAD_X_END_16 \x
+.endm
+
+.macro SAD_X_START_32 x
+    movi v16.16b, #0
+    movi v17.16b, #0
+    movi v18.16b, #0
+    movi v20.16b, #0
+    movi v21.16b, #0
+    movi v22.16b, #0
+.if \x == 4
+    movi v19.16b, #0
+    movi v23.16b, #0
+.endif
+.endm
+
+.macro SAD_X_32 base v1 v2
+    ld1             {v0.16b-v1.16b},  \base , x5
+    uabal           \v1\().8h, v0.8b, v6.8b
+    uabal           \v1\().8h, v1.8b, v7.8b
+    uabal2          \v2\().8h, v0.16b, v6.16b
+    uabal2          \v2\().8h, v1.16b, v7.16b
+.endm
+
+.macro SAD_X_END_32 x
+    SAD_X_END_16 \x
+.endm
+
+.macro SAD_X_START_48 x
+    SAD_X_START_32 \x
+.endm
+
+.macro SAD_X_48 x1 v1 v2
+    ld1             {v0.16b-v2.16b},  \x1 , x5
+    uabal           \v1\().8h, v0.8b, v4.8b
+    uabal           \v1\().8h, v1.8b, v5.8b
+    uabal           \v1\().8h, v2.8b, v6.8b
+    uabal2          \v2\().8h, v0.16b, v4.16b
+    uabal2          \v2\().8h, v1.16b, v5.16b
+    uabal2          \v2\().8h, v2.16b, v6.16b
+.endm
+
+.macro SAD_X_END_48 x
+    SAD_X_END_64 \x
+.endm
+
+.macro SAD_X_START_64 x
+    SAD_X_START_32 \x
+.endm
+
+.macro SAD_X_64 x1 v1 v2
+    ld1             {v0.16b-v3.16b},  \x1 , x5
+    uabal           \v1\().8h, v0.8b, v4.8b
+    uabal           \v1\().8h, v1.8b, v5.8b
+    uabal           \v1\().8h, v2.8b, v6.8b
+    uabal           \v1\().8h, v3.8b, v7.8b
+    uabal2          \v2\().8h, v0.16b, v4.16b
+    uabal2          \v2\().8h, v1.16b, v5.16b
+    uabal2          \v2\().8h, v2.16b, v6.16b
+    uabal2          \v2\().8h, v3.16b, v7.16b
+.endm
+
+.macro SAD_X_END_64 x
+    uaddlp          v16.4s, v16.8h
+    uaddlp          v17.4s, v17.8h
+    uaddlp          v18.4s, v18.8h
+    uaddlp          v20.4s, v20.8h
+    uaddlp          v21.4s, v21.8h
+    uaddlp          v22.4s, v22.8h
+    add             v16.4s, v16.4s, v20.4s
+    add             v17.4s, v17.4s, v21.4s
+    add             v18.4s, v18.4s, v22.4s
+    trn2            v20.2d, v16.2d, v16.2d
+    trn2            v21.2d, v17.2d, v17.2d
+    trn2            v22.2d, v18.2d, v18.2d
+    add             v16.2s, v16.2s, v20.2s
+    add             v17.2s, v17.2s, v21.2s
+    add             v18.2s, v18.2s, v22.2s
+    uaddlp          v16.1d, v16.2s
+    uaddlp          v17.1d, v17.2s
+    uaddlp          v18.1d, v18.2s
+    stp             s16, s17, x6, #8
+.if \x == 3
+    str             s18, x6
+.elseif \x == 4
+    uaddlp          v19.4s, v19.8h
+    uaddlp          v23.4s, v23.8h
+    add             v19.4s, v19.4s, v23.4s
+    trn2            v23.2d, v19.2d, v19.2d
+    add             v19.2s, v19.2s, v23.2s
+    uaddlp          v19.1d, v19.2s
+    stp             s18, s19, x6
+.endif
+    ret
+.endm
+
+const sad12_mask, align=8
+.byte 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 0, 0, 0, 0
+endconst
​

x265_3.6.tar.gz/source/common/aarch64/sad-a-sve2.S Added

@@ -0,0 +1,511 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm-sve.S"
+#include "sad-a-common.S"
+
+.arch armv8-a+sve2
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+.macro SAD_SVE2_16 h
+    mov             z16.d, #0
+    ptrue           p0.h, vl16
+.rept \h
+    ld1b            {z0.h}, p0/z, x0
+    ld1b            {z2.h}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    uaba            z16.h, z0.h, z2.h
+.endr
+    uaddv           d0, p0, z16.h
+    fmov            w0, s0
+    ret
+.endm
+
+.macro SAD_SVE2_32 h
+    ptrue           p0.b, vl32
+.rept \h
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z4.b}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    uabalb          z16.h, z0.b, z4.b
+    uabalt          z16.h, z0.b, z4.b
+.endr
+    uaddv           d0, p0, z16.h
+    fmov            w0, s0
+    ret
+.endm
+
+.macro SAD_SVE2_64 h
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_sad_64x\h
+    mov             z16.d, #0
+    mov             z17.d, #0
+    mov             z18.d, #0
+    mov             z19.d, #0
+    ptrue           p0.b, vl32
+.rept \h
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x0, #1, mul vl
+    ld1b            {z4.b}, p0/z, x2
+    ld1b            {z5.b}, p0/z, x2, #1, mul vl
+    add             x0, x0, x1
+    add             x2, x2, x3
+    uabalb          z16.h, z0.b, z4.b
+    uabalt          z17.h, z0.b, z4.b
+    uabalb          z18.h, z1.b, z5.b
+    uabalt          z19.h, z1.b, z5.b
+.endr
+    add             z16.h, z16.h, z17.h
+    add             z17.h, z18.h, z19.h
+    add             z16.h, z16.h, z17.h
+    uadalp          z24.s, p0/m, z16.h
+    uaddv           d5, p0, z24.s
+    fmov            x0, d5
+    ret
+.vl_gt_48_pixel_sad_64x\h\():
+    mov             z16.d, #0
+    mov             z17.d, #0
+    mov             z24.d, #0
+    ptrue           p0.b, vl64
+.rept \h
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z4.b}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    uabalb          z16.h, z0.b, z4.b
+    uabalt          z17.h, z0.b, z4.b
+.endr
+    add             z16.h, z16.h, z17.h
+    uadalp          z24.s, p0/m, z16.h
+    uaddv           d5, p0, z24.s
+    fmov            x0, d5
+    ret
+.endm
+
+.macro SAD_SVE2_24 h
+    mov             z16.d, #0
+    mov             x10, #24
+    mov             x11, #0
+    whilelt         p0.b, x11, x10
+.rept \h
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z8.b}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    uabalb          z16.h, z0.b, z8.b
+    uabalt          z16.h, z0.b, z8.b
+.endr
+    uaddv           d5, p0, z16.h
+    fmov            w0, s5
+    ret
+.endm
+
+.macro SAD_SVE2_48 h
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_sad_48x\h
+    mov             z16.d, #0
+    mov             z17.d, #0
+    mov             z18.d, #0
+    mov             z19.d, #0
+    ptrue           p0.b, vl32
+    ptrue           p1.b, vl16
+.rept \h
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p1/z, x0, #1, mul vl
+    ld1b            {z8.b}, p0/z, x2
+    ld1b            {z9.b}, p1/z, x2, #1, mul vl
+    add             x0, x0, x1
+    add             x2, x2, x3
+    uabalb          z16.h, z0.b, z8.b
+    uabalt          z17.h, z0.b, z8.b
+    uabalb          z18.h, z1.b, z9.b
+    uabalt          z19.h, z1.b, z9.b
+.endr
+    add             z16.h, z16.h, z17.h
+    add             z17.h, z18.h, z19.h
+    add             z16.h, z16.h, z17.h
+    uaddv           d5, p0, z16.h
+    fmov            w0, s5
+    ret
+.vl_gt_48_pixel_sad_48x\h\():
+    mov             z16.d, #0
+    mov             z17.d, #0
+    mov             x10, #48
+    mov             x11, #0
+    whilelt         p0.b, x11, x10
+.rept \h
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z8.b}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    uabalb          z16.h, z0.b, z8.b
+    uabalt          z17.h, z0.b, z8.b
+.endr
+    add             z16.h, z16.h, z17.h
+    uaddv           d5, p0, z16.h
+    fmov            w0, s5
+    ret
+.endm
+
+// Fully unrolled.
+.macro SAD_FUNC_SVE2 w, h
+function PFX(pixel_sad_\w\()x\h\()_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sad_\w\()x\h
+    SAD_START_\w uabdl
+    SAD_\w \h
+.if \w > 4
+    add             v16.8h, v16.8h, v17.8h
+.endif
+    uaddlv          s0, v16.8h
+    fmov            w0, s0
+    ret
+.vl_gt_16_pixel_sad_\w\()x\h\():
+.if \w == 4 || \w == 8 || \w == 12
+    SAD_START_\w uabdl
+    SAD_\w \h
+.if \w > 4
+    add             v16.8h, v16.8h, v17.8h
+.endif
+    uaddlv          s0, v16.8h
+    fmov            w0, s0
+    ret
+.else
+    SAD_SVE2_\w \h
+.endif
+endfunc
+.endm
+
+// Loop unrolled 4.
+.macro SAD_FUNC_LOOP_SVE2 w, h
+function PFX(pixel_sad_\w\()x\h\()_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sad_loop_\w\()x\h
+    SAD_START_\w
+
+    mov             w9, #\h/8
+.loop_sve2_\w\()x\h:
+    sub             w9, w9, #1
+.rept 4
+    SAD_\w
+.endr
+    cbnz            w9, .loop_sve2_\w\()x\h
+
+    SAD_END_\w
+
+.vl_gt_16_pixel_sad_loop_\w\()x\h\():
+.if \w == 4 || \w == 8 || \w == 12
+    SAD_START_\w
+
+    mov             w9, #\h/8
+.loop_sve2_loop_\w\()x\h:
+    sub             w9, w9, #1
+.rept 4
+    SAD_\w
+.endr
+    cbnz            w9, .loop_sve2_loop_\w\()x\h
+
+    SAD_END_\w
+.else
+    SAD_SVE2_\w \h
+.endif
+endfunc
+.endm
+
+SAD_FUNC_SVE2  4,  4
+SAD_FUNC_SVE2  4,  8
+SAD_FUNC_SVE2  4,  16
+SAD_FUNC_SVE2  8,  4
+SAD_FUNC_SVE2  8,  8
+SAD_FUNC_SVE2  8,  16
+SAD_FUNC_SVE2  8,  32
+SAD_FUNC_SVE2  16, 4
+SAD_FUNC_SVE2  16, 8
+SAD_FUNC_SVE2  16, 12
+SAD_FUNC_SVE2  16, 16
+SAD_FUNC_SVE2  16, 32
+SAD_FUNC_SVE2  16, 64
+
+SAD_FUNC_LOOP_SVE2  32, 8
+SAD_FUNC_LOOP_SVE2  32, 16
+SAD_FUNC_LOOP_SVE2  32, 24
+SAD_FUNC_LOOP_SVE2  32, 32
+SAD_FUNC_LOOP_SVE2  32, 64
+SAD_FUNC_LOOP_SVE2  64, 16
+SAD_FUNC_LOOP_SVE2  64, 32
+SAD_FUNC_LOOP_SVE2  64, 48
+SAD_FUNC_LOOP_SVE2  64, 64
+SAD_FUNC_LOOP_SVE2  12, 16
+SAD_FUNC_LOOP_SVE2  24, 32
+SAD_FUNC_LOOP_SVE2  48, 64
+
+// SAD_X3 and SAD_X4 code start
+
+.macro SAD_X_SVE2_24_INNER_GT_16 base z
+    ld1b            {z4.b}, p0/z,  \base 
+    add             \base, \base, x5
+    uabalb          \z\().h, z4.b, z0.b
+    uabalt          \z\().h, z4.b, z0.b
+.endm
+
+.macro SAD_X_SVE2_24 h x
+    mov             z20.d, #0
+    mov             z21.d, #0
+    mov             z22.d, #0
+    mov             z23.d, #0
+    mov             x10, #24
+    mov             x11, #0
+    whilelt         p0.b, x11, x10
+.rept \h
+    ld1b            {z0.b}, p0/z, x0
+    add             x0, x0, x9
+    SAD_X_SVE2_24_INNER_GT_16 x1, z20
+    SAD_X_SVE2_24_INNER_GT_16 x2, z21
+    SAD_X_SVE2_24_INNER_GT_16 x3, z22
+.if \x == 4
+    SAD_X_SVE2_24_INNER_GT_16 x4, z23
+.endif
+.endr
+    uaddlv          s0, v20.8h
+    uaddlv          s1, v21.8h
+    uaddlv          s2, v22.8h
+    stp             s0, s1, x6
+.if \x == 3
+    str             s2, x6, #8
+.elseif \x == 4
+    uaddv           d0, p0, z20.h
+    uaddv           d1, p0, z21.h
+    uaddv           d2, p0, z22.h
+    stp             s2, s3, x6, #8
+.endif
+    ret
+.endm
+
+.macro SAD_X_SVE2_32_INNER_GT_16 base z
+    ld1b            {z4.b}, p0/z,  \base 
+    add             \base, \base, x5
+    uabalb          \z\().h, z4.b, z0.b
+    uabalt          \z\().h, z4.b, z0.b
+.endm
+
+.macro SAD_X_SVE2_32 h x
+    mov             z20.d, #0
+    mov             z21.d, #0
+    mov             z22.d, #0
+    mov             z23.d, #0
+    ptrue           p0.b, vl32
+.rept \h
+    ld1b            {z0.b}, p0/z, x0
+    add             x0, x0, x9
+    SAD_X_SVE2_32_INNER_GT_16 x1, z20
+    SAD_X_SVE2_32_INNER_GT_16 x2, z21
+    SAD_X_SVE2_32_INNER_GT_16 x3, z22
+.if \x == 4
+    SAD_X_SVE2_32_INNER_GT_16 x4, z23
+.endif
+.endr
+    uaddv           d0, p0, z20.h
+    uaddv           d1, p0, z21.h
+    uaddv           d2, p0, z22.h
+    stp             s0, s1, x6
+.if \x == 3
+    str             s2, x6, #8
+.elseif \x == 4
+    uaddv           d3, p0, z23.h
+    stp             s2, s3, x6, #8
+.endif
+    ret
+.endm
+
+// static void x264_pixel_sad_x3_##size(pixel *fenc, pixel *pix0, pixel *pix1, pixel *pix2, intptr_t i_stride, int scores3)
+// static void x264_pixel_sad_x4_##size(pixel *fenc, pixel *pix0, pixel *pix1,pixel *pix2, pixel *pix3, intptr_t i_stride, int scores4)
+.macro SAD_X_FUNC_SVE2 x, w, h
+function PFX(sad_x\x\()_\w\()x\h\()_sve2)
+    mov             x9, #FENC_STRIDE
+
+// Make function arguments for x == 3 look like x == 4.
+.if \x == 3
+    mov             x6, x5
+    mov             x5, x4
+.endif
+    rdvl            x11, #1
+    cmp             x11, #16
+    bgt             .vl_gt_16_sad_x\x\()_\w\()x\h
+.if \w == 12
+    movrel          x12, sad12_mask
+    ld1             {v31.16b}, x12
+.endif
+
+    SAD_X_START_\w \h, \x, uabdl
+    SAD_X_\w \h, \x
+    SAD_X_END_\w \x
+.vl_gt_16_sad_x\x\()_\w\()x\h\():
+.if \w == 24 || \w == 32
+    SAD_X_SVE2_\w \h, \x
+.else
+.if \w == 12
+    movrel          x12, sad12_mask
+    ld1             {v31.16b}, x12
+.endif
+
+    SAD_X_START_\w \h, \x, uabdl
+    SAD_X_\w \h, \x
+    SAD_X_END_\w \x
+.endif
+endfunc
+.endm
+
+.macro SAD_X_LOOP_SVE2 x, w, h
+function PFX(sad_x\x\()_\w\()x\h\()_sve2)
+    mov             x9, #FENC_STRIDE
+
+// Make function arguments for x == 3 look like x == 4.
+.if \x == 3
+    mov             x6, x5
+    mov             x5, x4
+.endif
+    rdvl            x11, #1
+    cmp             x11, #16
+    bgt             .vl_gt_16_sad_x_loop_\x\()_\w\()x\h
+    SAD_X_START_\w \x
+    mov             w12, #\h/4
+.loop_sad_sve2_x\x\()_\w\()x\h:
+    sub             w12, w12, #1
+ .rept 4
+  .if \w == 24
+    ld1             {v6.16b}, x0, #16
+    ld1             {v7.8b}, x0, x9
+  .elseif \w == 32
+    ld1             {v6.16b-v7.16b}, x0, x9
+  .elseif \w == 48
+    ld1             {v4.16b-v6.16b}, x0, x9
+  .elseif \w == 64
+    ld1             {v4.16b-v7.16b}, x0, x9
+  .endif
+    SAD_X_\w x1, v16, v20
+    SAD_X_\w x2, v17, v21
+    SAD_X_\w x3, v18, v22
+  .if \x == 4
+    SAD_X_\w x4, v19, v23
+  .endif
+ .endr
+    cbnz            w12, .loop_sad_sve2_x\x\()_\w\()x\h
+    SAD_X_END_\w \x
+.vl_gt_16_sad_x_loop_\x\()_\w\()x\h\():
+.if \w == 24 || \w == 32
+    SAD_X_SVE2_\w \h, \x
+    ret
+.else
+    SAD_X_START_\w \x
+    mov             w12, #\h/4
+.loop_sad_sve2_gt_16_x\x\()_\w\()x\h:
+    sub             w12, w12, #1
+ .rept 4
+  .if \w == 24
+    ld1             {v6.16b}, x0, #16
+    ld1             {v7.8b}, x0, x9
+  .elseif \w == 32
+    ld1             {v6.16b-v7.16b}, x0, x9
+  .elseif \w == 48
+    ld1             {v4.16b-v6.16b}, x0, x9
+  .elseif \w == 64
+    ld1             {v4.16b-v7.16b}, x0, x9
+  .endif
+    SAD_X_\w x1, v16, v20
+    SAD_X_\w x2, v17, v21
+    SAD_X_\w x3, v18, v22
+  .if \x == 4
+    SAD_X_\w x4, v19, v23
+  .endif
+ .endr
+    cbnz            w12, .loop_sad_sve2_gt_16_x\x\()_\w\()x\h
+    SAD_X_END_\w \x
+.endif
+endfunc
+.endm
+
+
+SAD_X_FUNC_SVE2  3, 4,  4
+SAD_X_FUNC_SVE2  3, 4,  8
+SAD_X_FUNC_SVE2  3, 4,  16
+SAD_X_FUNC_SVE2  3, 8,  4
+SAD_X_FUNC_SVE2  3, 8,  8
+SAD_X_FUNC_SVE2  3, 8,  16
+SAD_X_FUNC_SVE2  3, 8,  32
+SAD_X_FUNC_SVE2  3, 12, 16
+SAD_X_FUNC_SVE2  3, 16, 4
+SAD_X_FUNC_SVE2  3, 16, 8
+SAD_X_FUNC_SVE2  3, 16, 12
+SAD_X_FUNC_SVE2  3, 16, 16
+SAD_X_FUNC_SVE2  3, 16, 32
+SAD_X_FUNC_SVE2  3, 16, 64
+SAD_X_LOOP_SVE2  3, 24, 32
+SAD_X_LOOP_SVE2  3, 32, 8
+SAD_X_LOOP_SVE2  3, 32, 16
+SAD_X_LOOP_SVE2  3, 32, 24
+SAD_X_LOOP_SVE2  3, 32, 32
+SAD_X_LOOP_SVE2  3, 32, 64
+SAD_X_LOOP_SVE2  3, 48, 64
+SAD_X_LOOP_SVE2  3, 64, 16
+SAD_X_LOOP_SVE2  3, 64, 32
+SAD_X_LOOP_SVE2  3, 64, 48
+SAD_X_LOOP_SVE2  3, 64, 64
+
+SAD_X_FUNC_SVE2  4, 4,  4
+SAD_X_FUNC_SVE2  4, 4,  8
+SAD_X_FUNC_SVE2  4, 4,  16
+SAD_X_FUNC_SVE2  4, 8,  4
+SAD_X_FUNC_SVE2  4, 8,  8
+SAD_X_FUNC_SVE2  4, 8,  16
+SAD_X_FUNC_SVE2  4, 8,  32
+SAD_X_FUNC_SVE2  4, 12, 16
+SAD_X_FUNC_SVE2  4, 16, 4
+SAD_X_FUNC_SVE2  4, 16, 8
+SAD_X_FUNC_SVE2  4, 16, 12
+SAD_X_FUNC_SVE2  4, 16, 16
+SAD_X_FUNC_SVE2  4, 16, 32
+SAD_X_FUNC_SVE2  4, 16, 64
+SAD_X_LOOP_SVE2  4, 24, 32
+SAD_X_LOOP_SVE2  4, 32, 8
+SAD_X_LOOP_SVE2  4, 32, 16
+SAD_X_LOOP_SVE2  4, 32, 24
+SAD_X_LOOP_SVE2  4, 32, 32
+SAD_X_LOOP_SVE2  4, 32, 64
+SAD_X_LOOP_SVE2  4, 48, 64
+SAD_X_LOOP_SVE2  4, 64, 16
+SAD_X_LOOP_SVE2  4, 64, 32
+SAD_X_LOOP_SVE2  4, 64, 48
+SAD_X_LOOP_SVE2  4, 64, 64

 
@@ -0,0 +1,511 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm-sve.S"
+#include "sad-a-common.S"
+
+.arch armv8-a+sve2
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+.macro SAD_SVE2_16 h
+    mov             z16.d, #0
+    ptrue           p0.h, vl16
+.rept \h
+    ld1b            {z0.h}, p0/z, x0
+    ld1b            {z2.h}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    uaba            z16.h, z0.h, z2.h
+.endr
+    uaddv           d0, p0, z16.h
+    fmov            w0, s0
+    ret
+.endm
+
+.macro SAD_SVE2_32 h
+    ptrue           p0.b, vl32
+.rept \h
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z4.b}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    uabalb          z16.h, z0.b, z4.b
+    uabalt          z16.h, z0.b, z4.b
+.endr
+    uaddv           d0, p0, z16.h
+    fmov            w0, s0
+    ret
+.endm
+
+.macro SAD_SVE2_64 h
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_sad_64x\h
+    mov             z16.d, #0
+    mov             z17.d, #0
+    mov             z18.d, #0
+    mov             z19.d, #0
+    ptrue           p0.b, vl32
+.rept \h
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p0/z, x0, #1, mul vl
+    ld1b            {z4.b}, p0/z, x2
+    ld1b            {z5.b}, p0/z, x2, #1, mul vl
+    add             x0, x0, x1
+    add             x2, x2, x3
+    uabalb          z16.h, z0.b, z4.b
+    uabalt          z17.h, z0.b, z4.b
+    uabalb          z18.h, z1.b, z5.b
+    uabalt          z19.h, z1.b, z5.b
+.endr
+    add             z16.h, z16.h, z17.h
+    add             z17.h, z18.h, z19.h
+    add             z16.h, z16.h, z17.h
+    uadalp          z24.s, p0/m, z16.h
+    uaddv           d5, p0, z24.s
+    fmov            x0, d5
+    ret
+.vl_gt_48_pixel_sad_64x\h\():
+    mov             z16.d, #0
+    mov             z17.d, #0
+    mov             z24.d, #0
+    ptrue           p0.b, vl64
+.rept \h
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z4.b}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    uabalb          z16.h, z0.b, z4.b
+    uabalt          z17.h, z0.b, z4.b
+.endr
+    add             z16.h, z16.h, z17.h
+    uadalp          z24.s, p0/m, z16.h
+    uaddv           d5, p0, z24.s
+    fmov            x0, d5
+    ret
+.endm
+
+.macro SAD_SVE2_24 h
+    mov             z16.d, #0
+    mov             x10, #24
+    mov             x11, #0
+    whilelt         p0.b, x11, x10
+.rept \h
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z8.b}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    uabalb          z16.h, z0.b, z8.b
+    uabalt          z16.h, z0.b, z8.b
+.endr
+    uaddv           d5, p0, z16.h
+    fmov            w0, s5
+    ret
+.endm
+
+.macro SAD_SVE2_48 h
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_sad_48x\h
+    mov             z16.d, #0
+    mov             z17.d, #0
+    mov             z18.d, #0
+    mov             z19.d, #0
+    ptrue           p0.b, vl32
+    ptrue           p1.b, vl16
+.rept \h
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z1.b}, p1/z, x0, #1, mul vl
+    ld1b            {z8.b}, p0/z, x2
+    ld1b            {z9.b}, p1/z, x2, #1, mul vl
+    add             x0, x0, x1
+    add             x2, x2, x3
+    uabalb          z16.h, z0.b, z8.b
+    uabalt          z17.h, z0.b, z8.b
+    uabalb          z18.h, z1.b, z9.b
+    uabalt          z19.h, z1.b, z9.b
+.endr
+    add             z16.h, z16.h, z17.h
+    add             z17.h, z18.h, z19.h
+    add             z16.h, z16.h, z17.h
+    uaddv           d5, p0, z16.h
+    fmov            w0, s5
+    ret
+.vl_gt_48_pixel_sad_48x\h\():
+    mov             z16.d, #0
+    mov             z17.d, #0
+    mov             x10, #48
+    mov             x11, #0
+    whilelt         p0.b, x11, x10
+.rept \h
+    ld1b            {z0.b}, p0/z, x0
+    ld1b            {z8.b}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    uabalb          z16.h, z0.b, z8.b
+    uabalt          z17.h, z0.b, z8.b
+.endr
+    add             z16.h, z16.h, z17.h
+    uaddv           d5, p0, z16.h
+    fmov            w0, s5
+    ret
+.endm
+
+// Fully unrolled.
+.macro SAD_FUNC_SVE2 w, h
+function PFX(pixel_sad_\w\()x\h\()_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sad_\w\()x\h
+    SAD_START_\w uabdl
+    SAD_\w \h
+.if \w > 4
+    add             v16.8h, v16.8h, v17.8h
+.endif
+    uaddlv          s0, v16.8h
+    fmov            w0, s0
+    ret
+.vl_gt_16_pixel_sad_\w\()x\h\():
+.if \w == 4 || \w == 8 || \w == 12
+    SAD_START_\w uabdl
+    SAD_\w \h
+.if \w > 4
+    add             v16.8h, v16.8h, v17.8h
+.endif
+    uaddlv          s0, v16.8h
+    fmov            w0, s0
+    ret
+.else
+    SAD_SVE2_\w \h
+.endif
+endfunc
+.endm
+
+// Loop unrolled 4.
+.macro SAD_FUNC_LOOP_SVE2 w, h
+function PFX(pixel_sad_\w\()x\h\()_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sad_loop_\w\()x\h
+    SAD_START_\w
+
+    mov             w9, #\h/8
+.loop_sve2_\w\()x\h:
+    sub             w9, w9, #1
+.rept 4
+    SAD_\w
+.endr
+    cbnz            w9, .loop_sve2_\w\()x\h
+
+    SAD_END_\w
+
+.vl_gt_16_pixel_sad_loop_\w\()x\h\():
+.if \w == 4 || \w == 8 || \w == 12
+    SAD_START_\w
+
+    mov             w9, #\h/8
+.loop_sve2_loop_\w\()x\h:
+    sub             w9, w9, #1
+.rept 4
+    SAD_\w
+.endr
+    cbnz            w9, .loop_sve2_loop_\w\()x\h
+
+    SAD_END_\w
+.else
+    SAD_SVE2_\w \h
+.endif
+endfunc
+.endm
+
+SAD_FUNC_SVE2  4,  4
+SAD_FUNC_SVE2  4,  8
+SAD_FUNC_SVE2  4,  16
+SAD_FUNC_SVE2  8,  4
+SAD_FUNC_SVE2  8,  8
+SAD_FUNC_SVE2  8,  16
+SAD_FUNC_SVE2  8,  32
+SAD_FUNC_SVE2  16, 4
+SAD_FUNC_SVE2  16, 8
+SAD_FUNC_SVE2  16, 12
+SAD_FUNC_SVE2  16, 16
+SAD_FUNC_SVE2  16, 32
+SAD_FUNC_SVE2  16, 64
+
+SAD_FUNC_LOOP_SVE2  32, 8
+SAD_FUNC_LOOP_SVE2  32, 16
+SAD_FUNC_LOOP_SVE2  32, 24
+SAD_FUNC_LOOP_SVE2  32, 32
+SAD_FUNC_LOOP_SVE2  32, 64
+SAD_FUNC_LOOP_SVE2  64, 16
+SAD_FUNC_LOOP_SVE2  64, 32
+SAD_FUNC_LOOP_SVE2  64, 48
+SAD_FUNC_LOOP_SVE2  64, 64
+SAD_FUNC_LOOP_SVE2  12, 16
+SAD_FUNC_LOOP_SVE2  24, 32
+SAD_FUNC_LOOP_SVE2  48, 64
+
+// SAD_X3 and SAD_X4 code start
+
+.macro SAD_X_SVE2_24_INNER_GT_16 base z
+    ld1b            {z4.b}, p0/z,  \base 
+    add             \base, \base, x5
+    uabalb          \z\().h, z4.b, z0.b
+    uabalt          \z\().h, z4.b, z0.b
+.endm
+
+.macro SAD_X_SVE2_24 h x
+    mov             z20.d, #0
+    mov             z21.d, #0
+    mov             z22.d, #0
+    mov             z23.d, #0
+    mov             x10, #24
+    mov             x11, #0
+    whilelt         p0.b, x11, x10
+.rept \h
+    ld1b            {z0.b}, p0/z, x0
+    add             x0, x0, x9
+    SAD_X_SVE2_24_INNER_GT_16 x1, z20
+    SAD_X_SVE2_24_INNER_GT_16 x2, z21
+    SAD_X_SVE2_24_INNER_GT_16 x3, z22
+.if \x == 4
+    SAD_X_SVE2_24_INNER_GT_16 x4, z23
+.endif
+.endr
+    uaddlv          s0, v20.8h
+    uaddlv          s1, v21.8h
+    uaddlv          s2, v22.8h
+    stp             s0, s1, x6
+.if \x == 3
+    str             s2, x6, #8
+.elseif \x == 4
+    uaddv           d0, p0, z20.h
+    uaddv           d1, p0, z21.h
+    uaddv           d2, p0, z22.h
+    stp             s2, s3, x6, #8
+.endif
+    ret
+.endm
+
+.macro SAD_X_SVE2_32_INNER_GT_16 base z
+    ld1b            {z4.b}, p0/z,  \base 
+    add             \base, \base, x5
+    uabalb          \z\().h, z4.b, z0.b
+    uabalt          \z\().h, z4.b, z0.b
+.endm
+
+.macro SAD_X_SVE2_32 h x
+    mov             z20.d, #0
+    mov             z21.d, #0
+    mov             z22.d, #0
+    mov             z23.d, #0
+    ptrue           p0.b, vl32
+.rept \h
+    ld1b            {z0.b}, p0/z, x0
+    add             x0, x0, x9
+    SAD_X_SVE2_32_INNER_GT_16 x1, z20
+    SAD_X_SVE2_32_INNER_GT_16 x2, z21
+    SAD_X_SVE2_32_INNER_GT_16 x3, z22
+.if \x == 4
+    SAD_X_SVE2_32_INNER_GT_16 x4, z23
+.endif
+.endr
+    uaddv           d0, p0, z20.h
+    uaddv           d1, p0, z21.h
+    uaddv           d2, p0, z22.h
+    stp             s0, s1, x6
+.if \x == 3
+    str             s2, x6, #8
+.elseif \x == 4
+    uaddv           d3, p0, z23.h
+    stp             s2, s3, x6, #8
+.endif
+    ret
+.endm
+
+// static void x264_pixel_sad_x3_##size(pixel *fenc, pixel *pix0, pixel *pix1, pixel *pix2, intptr_t i_stride, int scores3)
+// static void x264_pixel_sad_x4_##size(pixel *fenc, pixel *pix0, pixel *pix1,pixel *pix2, pixel *pix3, intptr_t i_stride, int scores4)
+.macro SAD_X_FUNC_SVE2 x, w, h
+function PFX(sad_x\x\()_\w\()x\h\()_sve2)
+    mov             x9, #FENC_STRIDE
+
+// Make function arguments for x == 3 look like x == 4.
+.if \x == 3
+    mov             x6, x5
+    mov             x5, x4
+.endif
+    rdvl            x11, #1
+    cmp             x11, #16
+    bgt             .vl_gt_16_sad_x\x\()_\w\()x\h
+.if \w == 12
+    movrel          x12, sad12_mask
+    ld1             {v31.16b}, x12
+.endif
+
+    SAD_X_START_\w \h, \x, uabdl
+    SAD_X_\w \h, \x
+    SAD_X_END_\w \x
+.vl_gt_16_sad_x\x\()_\w\()x\h\():
+.if \w == 24 || \w == 32
+    SAD_X_SVE2_\w \h, \x
+.else
+.if \w == 12
+    movrel          x12, sad12_mask
+    ld1             {v31.16b}, x12
+.endif
+
+    SAD_X_START_\w \h, \x, uabdl
+    SAD_X_\w \h, \x
+    SAD_X_END_\w \x
+.endif
+endfunc
+.endm
+
+.macro SAD_X_LOOP_SVE2 x, w, h
+function PFX(sad_x\x\()_\w\()x\h\()_sve2)
+    mov             x9, #FENC_STRIDE
+
+// Make function arguments for x == 3 look like x == 4.
+.if \x == 3
+    mov             x6, x5
+    mov             x5, x4
+.endif
+    rdvl            x11, #1
+    cmp             x11, #16
+    bgt             .vl_gt_16_sad_x_loop_\x\()_\w\()x\h
+    SAD_X_START_\w \x
+    mov             w12, #\h/4
+.loop_sad_sve2_x\x\()_\w\()x\h:
+    sub             w12, w12, #1
+ .rept 4
+  .if \w == 24
+    ld1             {v6.16b}, x0, #16
+    ld1             {v7.8b}, x0, x9
+  .elseif \w == 32
+    ld1             {v6.16b-v7.16b}, x0, x9
+  .elseif \w == 48
+    ld1             {v4.16b-v6.16b}, x0, x9
+  .elseif \w == 64
+    ld1             {v4.16b-v7.16b}, x0, x9
+  .endif
+    SAD_X_\w x1, v16, v20
+    SAD_X_\w x2, v17, v21
+    SAD_X_\w x3, v18, v22
+  .if \x == 4
+    SAD_X_\w x4, v19, v23
+  .endif
+ .endr
+    cbnz            w12, .loop_sad_sve2_x\x\()_\w\()x\h
+    SAD_X_END_\w \x
+.vl_gt_16_sad_x_loop_\x\()_\w\()x\h\():
+.if \w == 24 || \w == 32
+    SAD_X_SVE2_\w \h, \x
+    ret
+.else
+    SAD_X_START_\w \x
+    mov             w12, #\h/4
+.loop_sad_sve2_gt_16_x\x\()_\w\()x\h:
+    sub             w12, w12, #1
+ .rept 4
+  .if \w == 24
+    ld1             {v6.16b}, x0, #16
+    ld1             {v7.8b}, x0, x9
+  .elseif \w == 32
+    ld1             {v6.16b-v7.16b}, x0, x9
+  .elseif \w == 48
+    ld1             {v4.16b-v6.16b}, x0, x9
+  .elseif \w == 64
+    ld1             {v4.16b-v7.16b}, x0, x9
+  .endif
+    SAD_X_\w x1, v16, v20
+    SAD_X_\w x2, v17, v21
+    SAD_X_\w x3, v18, v22
+  .if \x == 4
+    SAD_X_\w x4, v19, v23
+  .endif
+ .endr
+    cbnz            w12, .loop_sad_sve2_gt_16_x\x\()_\w\()x\h
+    SAD_X_END_\w \x
+.endif
+endfunc
+.endm
+
+
+SAD_X_FUNC_SVE2  3, 4,  4
+SAD_X_FUNC_SVE2  3, 4,  8
+SAD_X_FUNC_SVE2  3, 4,  16
+SAD_X_FUNC_SVE2  3, 8,  4
+SAD_X_FUNC_SVE2  3, 8,  8
+SAD_X_FUNC_SVE2  3, 8,  16
+SAD_X_FUNC_SVE2  3, 8,  32
+SAD_X_FUNC_SVE2  3, 12, 16
+SAD_X_FUNC_SVE2  3, 16, 4
+SAD_X_FUNC_SVE2  3, 16, 8
+SAD_X_FUNC_SVE2  3, 16, 12
+SAD_X_FUNC_SVE2  3, 16, 16
+SAD_X_FUNC_SVE2  3, 16, 32
+SAD_X_FUNC_SVE2  3, 16, 64
+SAD_X_LOOP_SVE2  3, 24, 32
+SAD_X_LOOP_SVE2  3, 32, 8
+SAD_X_LOOP_SVE2  3, 32, 16
+SAD_X_LOOP_SVE2  3, 32, 24
+SAD_X_LOOP_SVE2  3, 32, 32
+SAD_X_LOOP_SVE2  3, 32, 64
+SAD_X_LOOP_SVE2  3, 48, 64
+SAD_X_LOOP_SVE2  3, 64, 16
+SAD_X_LOOP_SVE2  3, 64, 32
+SAD_X_LOOP_SVE2  3, 64, 48
+SAD_X_LOOP_SVE2  3, 64, 64
+
+SAD_X_FUNC_SVE2  4, 4,  4
+SAD_X_FUNC_SVE2  4, 4,  8
+SAD_X_FUNC_SVE2  4, 4,  16
+SAD_X_FUNC_SVE2  4, 8,  4
+SAD_X_FUNC_SVE2  4, 8,  8
+SAD_X_FUNC_SVE2  4, 8,  16
+SAD_X_FUNC_SVE2  4, 8,  32
+SAD_X_FUNC_SVE2  4, 12, 16
+SAD_X_FUNC_SVE2  4, 16, 4
+SAD_X_FUNC_SVE2  4, 16, 8
+SAD_X_FUNC_SVE2  4, 16, 12
+SAD_X_FUNC_SVE2  4, 16, 16
+SAD_X_FUNC_SVE2  4, 16, 32
+SAD_X_FUNC_SVE2  4, 16, 64
+SAD_X_LOOP_SVE2  4, 24, 32
+SAD_X_LOOP_SVE2  4, 32, 8
+SAD_X_LOOP_SVE2  4, 32, 16
+SAD_X_LOOP_SVE2  4, 32, 24
+SAD_X_LOOP_SVE2  4, 32, 32
+SAD_X_LOOP_SVE2  4, 32, 64
+SAD_X_LOOP_SVE2  4, 48, 64
+SAD_X_LOOP_SVE2  4, 64, 16
+SAD_X_LOOP_SVE2  4, 64, 32
+SAD_X_LOOP_SVE2  4, 64, 48
+SAD_X_LOOP_SVE2  4, 64, 64
​

x265_3.5.tar.gz/source/common/aarch64/sad-a.S -> x265_3.6.tar.gz/source/common/aarch64/sad-a.S Changed

 
@@ -1,7 +1,8 @@
 /*****************************************************************************
- * Copyright (C) 2020 MulticoreWare, Inc
+ * Copyright (C) 2020-2021 MulticoreWare, Inc
  *
  * Authors: Hongbin Liu <liuhongbin1@huawei.com>
+ *          Sebastian Pop <spop@amazon.com>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -22,84 +23,186 @@
  *****************************************************************************/
 
 #include "asm.S"
+#include "sad-a-common.S"
 
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
 .section .rodata
+#endif
 
 .align 4
 
 .text
 
-.macro SAD_X_START_8 x
-    ld1             {v0.8b}, x0, x9
-.if \x == 3
-    ld1             {v1.8b}, x1, x4
-    ld1             {v2.8b}, x2, x4
-    ld1             {v3.8b}, x3, x4
-.elseif \x == 4
-    ld1             {v1.8b}, x1, x5
-    ld1             {v2.8b}, x2, x5
-    ld1             {v3.8b}, x3, x5
-    ld1             {v4.8b}, x4, x5
-.endif
-    uabdl           v16.8h, v0.8b, v1.8b
-    uabdl           v17.8h, v0.8b, v2.8b
-    uabdl           v18.8h, v0.8b, v3.8b
-.if \x == 4
-    uabdl           v19.8h, v0.8b, v4.8b
+// Fully unrolled.
+.macro SAD_FUNC w, h
+function PFX(pixel_sad_\w\()x\h\()_neon)
+    SAD_START_\w uabdl
+    SAD_\w \h
+.if \w > 4
+    add             v16.8h, v16.8h, v17.8h
 .endif
+    uaddlv          s0, v16.8h
+    fmov            w0, s0
+    ret
+endfunc
+.endm
+
+// Loop unrolled 4.
+.macro SAD_FUNC_LOOP w, h
+function PFX(pixel_sad_\w\()x\h\()_neon)
+    SAD_START_\w
+
+    mov             w9, #\h/8
+.loop_\w\()x\h:
+    sub             w9, w9, #1
+.rept 4
+    SAD_\w
+.endr
+    cbnz            w9, .loop_\w\()x\h
+
+    SAD_END_\w
+endfunc
 .endm
 
-.macro SAD_X_8 x
-    ld1             {v0.8b}, x0, x9
+SAD_FUNC  4,  4
+SAD_FUNC  4,  8
+SAD_FUNC  4,  16
+SAD_FUNC  8,  4
+SAD_FUNC  8,  8
+SAD_FUNC  8,  16
+SAD_FUNC  8,  32
+SAD_FUNC  16, 4
+SAD_FUNC  16, 8
+SAD_FUNC  16, 12
+SAD_FUNC  16, 16
+SAD_FUNC  16, 32
+SAD_FUNC  16, 64
+
+SAD_FUNC_LOOP  32, 8
+SAD_FUNC_LOOP  32, 16
+SAD_FUNC_LOOP  32, 24
+SAD_FUNC_LOOP  32, 32
+SAD_FUNC_LOOP  32, 64
+SAD_FUNC_LOOP  64, 16
+SAD_FUNC_LOOP  64, 32
+SAD_FUNC_LOOP  64, 48
+SAD_FUNC_LOOP  64, 64
+SAD_FUNC_LOOP  12, 16
+SAD_FUNC_LOOP  24, 32
+SAD_FUNC_LOOP  48, 64
+
+// SAD_X3 and SAD_X4 code start
+
+// static void x264_pixel_sad_x3_##size(pixel *fenc, pixel *pix0, pixel *pix1, pixel *pix2, intptr_t i_stride, int scores3)
+// static void x264_pixel_sad_x4_##size(pixel *fenc, pixel *pix0, pixel *pix1,pixel *pix2, pixel *pix3, intptr_t i_stride, int scores4)
+.macro SAD_X_FUNC x, w, h
+function PFX(sad_x\x\()_\w\()x\h\()_neon)
+    mov             x9, #FENC_STRIDE
+
+// Make function arguments for x == 3 look like x == 4.
 .if \x == 3
-    ld1             {v1.8b}, x1, x4
-    ld1             {v2.8b}, x2, x4
-    ld1             {v3.8b}, x3, x4
-.elseif \x == 4
-    ld1             {v1.8b}, x1, x5
-    ld1             {v2.8b}, x2, x5
-    ld1             {v3.8b}, x3, x5
-    ld1             {v4.8b}, x4, x5
+    mov             x6, x5
+    mov             x5, x4
 .endif
-    uabal           v16.8h, v0.8b, v1.8b
-    uabal           v17.8h, v0.8b, v2.8b
-    uabal           v18.8h, v0.8b, v3.8b
-.if \x == 4
-    uabal           v19.8h, v0.8b, v4.8b
+
+.if \w == 12
+    movrel          x12, sad12_mask
+    ld1             {v31.16b}, x12
 .endif
+
+    SAD_X_START_\w \h, \x, uabdl
+    SAD_X_\w \h, \x
+    SAD_X_END_\w \x
+endfunc
 .endm
 
-.macro SAD_X_8xN x, h
-function x265_sad_x\x\()_8x\h\()_neon
+.macro SAD_X_LOOP x, w, h
+function PFX(sad_x\x\()_\w\()x\h\()_neon)
     mov             x9, #FENC_STRIDE
-    SAD_X_START_8 \x
-.rept \h - 1
-    SAD_X_8 \x
-.endr
-    uaddlv          s0, v16.8h
-    uaddlv          s1, v17.8h
-    uaddlv          s2, v18.8h
-.if \x == 4
-    uaddlv          s3, v19.8h
-.endif
 
+// Make function arguments for x == 3 look like x == 4.
 .if \x == 3
-    stp             s0, s1, x5
-    str             s2, x5, #8
-.elseif \x == 4
-    stp             s0, s1, x6
-    stp             s2, s3, x6, #8
+    mov             x6, x5
+    mov             x5, x4
 .endif
-    ret
+    SAD_X_START_\w \x
+    mov             w12, #\h/4
+.loop_sad_x\x\()_\w\()x\h:
+    sub             w12, w12, #1
+ .rept 4
+  .if \w == 24
+    ld1             {v6.16b}, x0, #16
+    ld1             {v7.8b}, x0, x9
+  .elseif \w == 32
+    ld1             {v6.16b-v7.16b}, x0, x9
+  .elseif \w == 48
+    ld1             {v4.16b-v6.16b}, x0, x9
+  .elseif \w == 64
+    ld1             {v4.16b-v7.16b}, x0, x9
+  .endif
+    SAD_X_\w x1, v16, v20
+    SAD_X_\w x2, v17, v21
+    SAD_X_\w x3, v18, v22
+  .if \x == 4
+    SAD_X_\w x4, v19, v23
+  .endif
+ .endr
+    cbnz            w12, .loop_sad_x\x\()_\w\()x\h
+    SAD_X_END_\w \x
 endfunc
 .endm
 
-SAD_X_8xN 3 4
-SAD_X_8xN 3 8
-SAD_X_8xN 3 16
-SAD_X_8xN 3 32
 
-SAD_X_8xN 4 4
-SAD_X_8xN 4 8
-SAD_X_8xN 4 16
-SAD_X_8xN 4 32
+SAD_X_FUNC  3, 4,  4
+SAD_X_FUNC  3, 4,  8
+SAD_X_FUNC  3, 4,  16
+SAD_X_FUNC  3, 8,  4
+SAD_X_FUNC  3, 8,  8
+SAD_X_FUNC  3, 8,  16
+SAD_X_FUNC  3, 8,  32
+SAD_X_FUNC  3, 12, 16
+SAD_X_FUNC  3, 16, 4
+SAD_X_FUNC  3, 16, 8
+SAD_X_FUNC  3, 16, 12
+SAD_X_FUNC  3, 16, 16
+SAD_X_FUNC  3, 16, 32
+SAD_X_FUNC  3, 16, 64
+SAD_X_LOOP  3, 24, 32
+SAD_X_LOOP  3, 32, 8
+SAD_X_LOOP  3, 32, 16
+SAD_X_LOOP  3, 32, 24
+SAD_X_LOOP  3, 32, 32
+SAD_X_LOOP  3, 32, 64
+SAD_X_LOOP  3, 48, 64
+SAD_X_LOOP  3, 64, 16
+SAD_X_LOOP  3, 64, 32
+SAD_X_LOOP  3, 64, 48
+SAD_X_LOOP  3, 64, 64
+
+SAD_X_FUNC  4, 4,  4
+SAD_X_FUNC  4, 4,  8
+SAD_X_FUNC  4, 4,  16
+SAD_X_FUNC  4, 8,  4
+SAD_X_FUNC  4, 8,  8
+SAD_X_FUNC  4, 8,  16
+SAD_X_FUNC  4, 8,  32
+SAD_X_FUNC  4, 12, 16
+SAD_X_FUNC  4, 16, 4
+SAD_X_FUNC  4, 16, 8
+SAD_X_FUNC  4, 16, 12
+SAD_X_FUNC  4, 16, 16
+SAD_X_FUNC  4, 16, 32
+SAD_X_FUNC  4, 16, 64
+SAD_X_LOOP  4, 24, 32
+SAD_X_LOOP  4, 32, 8
+SAD_X_LOOP  4, 32, 16
+SAD_X_LOOP  4, 32, 24
+SAD_X_LOOP  4, 32, 32
+SAD_X_LOOP  4, 32, 64
+SAD_X_LOOP  4, 48, 64
+SAD_X_LOOP  4, 64, 16
+SAD_X_LOOP  4, 64, 32
+SAD_X_LOOP  4, 64, 48
+SAD_X_LOOP  4, 64, 64
​

x265_3.6.tar.gz/source/common/aarch64/ssd-a-common.S Added

@@ -0,0 +1,37 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+// This file contains the macros written using NEON instruction set
+// that are also used by the SVE2 functions
+
+#include "asm.S"
+
+.arch           armv8-a
+
+.macro ret_v0_w0
+    trn2            v1.2d, v0.2d, v0.2d
+    add             v0.2s, v0.2s, v1.2s
+    addp            v0.2s, v0.2s, v0.2s
+    fmov            w0, s0
+    ret
+.endm

 
@@ -0,0 +1,37 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+// This file contains the macros written using NEON instruction set
+// that are also used by the SVE2 functions
+
+#include "asm.S"
+
+.arch           armv8-a
+
+.macro ret_v0_w0
+    trn2            v1.2d, v0.2d, v0.2d
+    add             v0.2s, v0.2s, v1.2s
+    addp            v0.2s, v0.2s, v0.2s
+    fmov            w0, s0
+    ret
+.endm
​

x265_3.6.tar.gz/source/common/aarch64/ssd-a-sve.S Added

@@ -0,0 +1,78 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm-sve.S"
+
+.arch armv8-a+sve
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+function PFX(pixel_sse_pp_4x4_sve)
+    ptrue           p0.s, vl4
+    ld1b            {z0.s}, p0/z, x0
+    ld1b            {z17.s}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             z0.s, p0/m, z0.s, z17.s
+    mul             z0.s, p0/m, z0.s, z0.s
+.rept 3
+    ld1b            {z16.s}, p0/z, x0
+    ld1b            {z17.s}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             z16.s, p0/m, z16.s, z17.s
+    mla             z0.s, p0/m, z16.s, z16.s
+.endr
+    uaddv           d0, p0, z0.s
+    fmov            w0, s0
+    ret
+endfunc
+
+function PFX(pixel_sse_pp_4x8_sve)
+    ptrue           p0.s, vl4
+    ld1b            {z0.s}, p0/z, x0
+    ld1b            {z17.s}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             z0.s, p0/m, z0.s, z17.s
+    mul             z0.s, p0/m, z0.s, z0.s
+.rept 7
+    ld1b            {z16.s}, p0/z, x0
+    ld1b            {z17.s}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             z16.s, p0/m, z16.s, z17.s
+    mla             z0.s, p0/m, z16.s, z16.s
+.endr
+    uaddv           d0, p0, z0.s
+    fmov            w0, s0
+    ret
+endfunc

 
@@ -0,0 +1,78 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm-sve.S"
+
+.arch armv8-a+sve
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+function PFX(pixel_sse_pp_4x4_sve)
+    ptrue           p0.s, vl4
+    ld1b            {z0.s}, p0/z, x0
+    ld1b            {z17.s}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             z0.s, p0/m, z0.s, z17.s
+    mul             z0.s, p0/m, z0.s, z0.s
+.rept 3
+    ld1b            {z16.s}, p0/z, x0
+    ld1b            {z17.s}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             z16.s, p0/m, z16.s, z17.s
+    mla             z0.s, p0/m, z16.s, z16.s
+.endr
+    uaddv           d0, p0, z0.s
+    fmov            w0, s0
+    ret
+endfunc
+
+function PFX(pixel_sse_pp_4x8_sve)
+    ptrue           p0.s, vl4
+    ld1b            {z0.s}, p0/z, x0
+    ld1b            {z17.s}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             z0.s, p0/m, z0.s, z17.s
+    mul             z0.s, p0/m, z0.s, z0.s
+.rept 7
+    ld1b            {z16.s}, p0/z, x0
+    ld1b            {z17.s}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    sub             z16.s, p0/m, z16.s, z17.s
+    mla             z0.s, p0/m, z16.s, z16.s
+.endr
+    uaddv           d0, p0, z0.s
+    fmov            w0, s0
+    ret
+endfunc
​

x265_3.6.tar.gz/source/common/aarch64/ssd-a-sve2.S Added

@@ -0,0 +1,887 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm-sve.S"
+#include "ssd-a-common.S"
+
+.arch armv8-a+sve2
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+function PFX(pixel_sse_pp_32x32_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sse_pp_32x32
+    mov             w12, #8
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+.loop_sse_pp_32_sve2:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b,v17.16b}, x0, x1
+    ld1             {v18.16b,v19.16b}, x2, x3
+    usubl           v2.8h, v16.8b, v18.8b
+    usubl2          v3.8h, v16.16b, v18.16b
+    usubl           v4.8h, v17.8b, v19.8b
+    usubl2          v5.8h, v17.16b, v19.16b
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal2          v1.4s, v3.8h, v3.8h
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+.endr
+    cbnz            w12, .loop_sse_pp_32_sve2
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+.vl_gt_16_pixel_sse_pp_32x32:
+    ptrue           p0.b, vl32
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z18.b}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    usublb          z1.h, z16.b, z18.b
+    usublt          z2.h, z16.b, z18.b
+    smullb          z0.s, z1.h, z1.h
+    smlalt          z0.s, z1.h, z1.h
+    smlalb          z0.s, z2.h, z2.h
+    smlalt          z0.s, z2.h, z2.h
+.rept 31
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z18.b}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    usublb          z1.h, z16.b, z18.b
+    usublt          z2.h, z16.b, z18.b
+    smullb          z0.s, z1.h, z1.h
+    smlalt          z0.s, z1.h, z1.h
+    smlalb          z0.s, z2.h, z2.h
+    smlalt          z0.s, z2.h, z2.h
+.endr
+    uaddv           d3, p0, z0.s
+    fmov            w0, s3
+    ret
+endfunc
+
+function PFX(pixel_sse_pp_32x64_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sse_pp_32x64
+    ptrue           p0.b, vl16
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    ld1b            {z18.b}, p0/z, x2
+    ld1b            {z19.b}, p0/z, x2, #1, mul vl
+    add             x0, x0, x1
+    add             x2, x2, x3
+    usublb          z1.h, z16.b, z18.b
+    usublt          z2.h, z16.b, z18.b
+    usublb          z3.h, z17.b, z19.b
+    usublt          z4.h, z17.b, z19.b
+    smullb          z20.s, z1.h, z1.h
+    smullt          z21.s, z1.h, z1.h
+    smlalb          z20.s, z2.h, z2.h
+    smlalt          z21.s, z2.h, z2.h
+    smlalb          z20.s, z3.h, z3.h
+    smlalt          z21.s, z3.h, z3.h
+    smlalb          z20.s, z4.h, z4.h
+    smlalt          z21.s, z4.h, z4.h
+.rept 63
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    ld1b            {z18.b}, p0/z, x2
+    ld1b            {z19.b}, p0/z, x2, #1, mul vl
+    add             x0, x0, x1
+    add             x2, x2, x3
+    usublb          z1.h, z16.b, z18.b
+    usublt          z2.h, z16.b, z18.b
+    usublb          z3.h, z17.b, z19.b
+    usublt          z4.h, z17.b, z19.b
+    smlalb          z20.s, z1.h, z1.h
+    smlalt          z21.s, z1.h, z1.h
+    smlalb          z20.s, z2.h, z2.h
+    smlalt          z21.s, z2.h, z2.h
+    smlalb          z20.s, z3.h, z3.h
+    smlalt          z21.s, z3.h, z3.h
+    smlalb          z20.s, z4.h, z4.h
+    smlalt          z21.s, z4.h, z4.h
+.endr
+    uaddv           d3, p0, z20.s
+    fmov            w0, s3
+    uaddv           d4, p0, z21.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+.vl_gt_16_pixel_sse_pp_32x64:
+    ptrue           p0.b, vl32
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z18.b}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    usublb          z1.h, z16.b, z18.b
+    usublt          z2.h, z16.b, z18.b
+    smullb          z20.s, z1.h, z1.h
+    smullt          z21.s, z1.h, z1.h
+    smlalb          z20.s, z2.h, z2.h
+    smlalt          z21.s, z2.h, z2.h
+.rept 63
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z18.b}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    usublb          z1.h, z16.b, z18.b
+    usublt          z2.h, z16.b, z18.b
+    smlalb          z20.s, z1.h, z1.h
+    smlalt          z21.s, z1.h, z1.h
+    smlalb          z20.s, z2.h, z2.h
+    smlalt          z21.s, z2.h, z2.h
+.endr
+    uaddv           d3, p0, z20.s
+    fmov            w0, s3
+    uaddv           d4, p0, z21.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+endfunc
+
+function PFX(pixel_sse_pp_64x64_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sse_pp_64x64
+    mov             w12, #16
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+
+.loop_sse_pp_64_sve2:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b-v19.16b}, x0, x1
+    ld1             {v20.16b-v23.16b}, x2, x3
+
+    usubl           v2.8h, v16.8b, v20.8b
+    usubl2          v3.8h, v16.16b, v20.16b
+    usubl           v4.8h, v17.8b, v21.8b
+    usubl2          v5.8h, v17.16b, v21.16b
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal2          v1.4s, v3.8h, v3.8h
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+
+    usubl           v2.8h, v18.8b, v22.8b
+    usubl2          v3.8h, v18.16b, v22.16b
+    usubl           v4.8h, v19.8b, v23.8b
+    usubl2          v5.8h, v19.16b, v23.16b
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal2          v1.4s, v3.8h, v3.8h
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+.endr
+    cbnz            w12, .loop_sse_pp_64_sve2
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+.vl_gt_16_pixel_sse_pp_64x64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_sse_pp_64x64
+    ptrue           p0.b, vl32
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    ld1b            {z20.b}, p0/z, x2
+    ld1b            {z21.b}, p0/z, x2, #1, mul vl
+    add             x0, x0, x1
+    add             x2, x2, x3
+    usublb          z1.h, z16.b, z20.b
+    usublt          z2.h, z16.b, z20.b
+    usublb          z3.h, z17.b, z21.b
+    usublt          z4.h, z17.b, z21.b
+    smullb          z24.s, z1.h, z1.h
+    smullt          z25.s, z1.h, z1.h
+    smlalb          z24.s, z2.h, z2.h
+    smlalt          z25.s, z2.h, z2.h
+    smlalb          z24.s, z3.h, z3.h
+    smlalt          z25.s, z3.h, z3.h
+    smlalb          z24.s, z4.h, z4.h
+    smlalt          z25.s, z4.h, z4.h
+.rept 63
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    ld1b            {z20.b}, p0/z, x2
+    ld1b            {z21.b}, p0/z, x2, #1, mul vl
+    add             x0, x0, x1
+    add             x2, x2, x3
+    usublb          z1.h, z16.b, z20.b
+    usublt          z2.h, z16.b, z20.b
+    usublb          z3.h, z17.b, z21.b
+    usublt          z4.h, z17.b, z21.b
+    smlalb          z24.s, z1.h, z1.h
+    smlalt          z25.s, z1.h, z1.h
+    smlalb          z24.s, z2.h, z2.h
+    smlalt          z25.s, z2.h, z2.h
+    smlalb          z24.s, z3.h, z3.h
+    smlalt          z25.s, z3.h, z3.h
+    smlalb          z24.s, z4.h, z4.h
+    smlalt          z25.s, z4.h, z4.h
+.endr
+    uaddv           d3, p0, z24.s
+    fmov            w0, s3
+    uaddv           d4, p0, z25.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+.vl_gt_48_pixel_sse_pp_64x64:
+    ptrue           p0.b, vl64
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z20.b}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    usublb          z1.h, z16.b, z20.b
+    usublt          z2.h, z16.b, z20.b
+    smullb          z24.s, z1.h, z1.h
+    smullt          z25.s, z1.h, z1.h
+    smlalb          z24.s, z2.h, z2.h
+    smlalt          z25.s, z2.h, z2.h
+.rept 63
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z20.b}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    usublb          z1.h, z16.b, z20.b
+    usublt          z2.h, z16.b, z20.b
+    smlalb          z24.s, z1.h, z1.h
+    smlalt          z25.s, z1.h, z1.h
+    smlalb          z24.s, z2.h, z2.h
+    smlalt          z25.s, z2.h, z2.h
+.endr
+    uaddv           d3, p0, z24.s
+    fmov            w0, s3
+    uaddv           d4, p0, z25.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+endfunc
+
+function PFX(pixel_sse_ss_4x4_sve2)
+    ptrue           p0.b, vl8
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x2
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z17.h
+    smullb          z3.s, z1.h, z1.h
+    smullt          z4.s, z1.h, z1.h
+.rept 3
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x2
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z17.h
+    smlalb          z3.s, z1.h, z1.h
+    smlalt          z4.s, z1.h, z1.h
+.endr
+    uaddv           d3, p0, z3.s
+    fmov            w0, s3
+    uaddv           d4, p0, z4.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+endfunc
+
+function PFX(pixel_sse_ss_8x8_sve2)
+    ptrue           p0.b, vl16
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x2
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z17.h
+    smullb          z3.s, z1.h, z1.h
+    smullt          z4.s, z1.h, z1.h
+.rept 7
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x2
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z17.h
+    smlalb          z3.s, z1.h, z1.h
+    smlalt          z4.s, z1.h, z1.h
+.endr
+    uaddv           d3, p0, z3.s
+    fmov            w0, s3
+    uaddv           d4, p0, z4.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+endfunc
+
+function PFX(pixel_sse_ss_16x16_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sse_ss_16x16
+    ptrue           p0.b, vl16
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    ld1b            {z18.b}, p0/z, x2
+    ld1b            {z19.b}, p0/z, x2, #1, mul vl
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z18.h
+    sub             z2.h, z17.h, z19.h
+    smullb          z3.s, z1.h, z1.h
+    smullt          z4.s, z1.h, z1.h
+    smlalb          z3.s, z2.h, z2.h
+    smlalt          z4.s, z2.h, z2.h
+.rept 15
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    ld1b            {z18.b}, p0/z, x2
+    ld1b            {z19.b}, p0/z, x2, #1, mul vl
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z18.h
+    sub             z2.h, z17.h, z19.h
+    smlalb          z3.s, z1.h, z1.h
+    smlalt          z4.s, z1.h, z1.h
+    smlalb          z3.s, z2.h, z2.h
+    smlalt          z4.s, z2.h, z2.h
+.endr
+    uaddv           d3, p0, z3.s
+    fmov            w0, s3
+    uaddv           d4, p0, z4.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+.vl_gt_16_pixel_sse_ss_16x16:
+    ptrue           p0.b, vl32
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z18.b}, p0/z, x2
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z18.h
+    smullb          z3.s, z1.h, z1.h
+    smullt          z4.s, z1.h, z1.h
+.rept 15
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z18.b}, p0/z, x2
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z18.h
+    smlalb          z3.s, z1.h, z1.h
+    smlalt          z4.s, z1.h, z1.h
+.endr
+    uaddv           d3, p0, z3.s
+    fmov            w0, s3
+    uaddv           d4, p0, z4.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+endfunc
+
+function PFX(pixel_sse_ss_32x32_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sse_ss_32x32
+    ptrue           p0.b, vl16
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    ld1b            {z18.b}, p0/z, x0, #2, mul vl
+    ld1b            {z19.b}, p0/z, x0, #3, mul vl
+    ld1b            {z20.b}, p0/z, x2
+    ld1b            {z21.b}, p0/z, x2, #1, mul vl
+    ld1b            {z22.b}, p0/z, x2, #2, mul vl
+    ld1b            {z23.b}, p0/z, x2, #3, mul vl
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z20.h
+    sub             z2.h, z17.h, z21.h
+    sub             z3.h, z18.h, z22.h
+    sub             z4.h, z19.h, z23.h
+    smullb          z5.s, z1.h, z1.h
+    smullt          z6.s, z1.h, z1.h
+    smlalb          z5.s, z2.h, z2.h
+    smlalt          z6.s, z2.h, z2.h
+    smlalb          z5.s, z3.h, z3.h
+    smlalt          z6.s, z3.h, z3.h
+    smlalb          z5.s, z4.h, z4.h
+    smlalt          z6.s, z4.h, z4.h
+.rept 31
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    ld1b            {z18.b}, p0/z, x0, #2, mul vl
+    ld1b            {z19.b}, p0/z, x0, #3, mul vl
+    ld1b            {z20.b}, p0/z, x2
+    ld1b            {z21.b}, p0/z, x2, #1, mul vl
+    ld1b            {z22.b}, p0/z, x2, #2, mul vl
+    ld1b            {z23.b}, p0/z, x2, #3, mul vl
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z20.h
+    sub             z2.h, z17.h, z21.h
+    sub             z3.h, z18.h, z22.h
+    sub             z4.h, z19.h, z23.h
+    smlalb          z5.s, z1.h, z1.h
+    smlalt          z6.s, z1.h, z1.h
+    smlalb          z5.s, z2.h, z2.h
+    smlalt          z6.s, z2.h, z2.h
+    smlalb          z5.s, z3.h, z3.h
+    smlalt          z6.s, z3.h, z3.h
+    smlalb          z5.s, z4.h, z4.h
+    smlalt          z6.s, z4.h, z4.h
+.endr
+    uaddv           d3, p0, z5.s
+    fmov            w0, s3
+    uaddv           d4, p0, z6.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+.vl_gt_16_pixel_sse_ss_32x32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_sse_ss_32x32
+    ptrue           p0.b, vl32
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    ld1b            {z20.b}, p0/z, x2
+    ld1b            {z21.b}, p0/z, x2, #1, mul vl
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z20.h
+    sub             z2.h, z17.h, z21.h
+    smullb          z5.s, z1.h, z1.h
+    smullt          z6.s, z1.h, z1.h
+    smlalb          z5.s, z2.h, z2.h
+    smlalt          z6.s, z2.h, z2.h
+.rept 31
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    ld1b            {z20.b}, p0/z, x2
+    ld1b            {z21.b}, p0/z, x2, #1, mul vl
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z20.h
+    sub             z2.h, z17.h, z21.h
+    smlalb          z5.s, z1.h, z1.h
+    smlalt          z6.s, z1.h, z1.h
+    smlalb          z5.s, z2.h, z2.h
+    smlalt          z6.s, z2.h, z2.h
+.endr
+    uaddv           d3, p0, z5.s
+    fmov            w0, s3
+    uaddv           d4, p0, z6.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+.vl_gt_48_pixel_sse_ss_32x32:
+    ptrue           p0.b, vl64
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z20.b}, p0/z, x2
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z20.h
+    smullb          z5.s, z1.h, z1.h
+    smullt          z6.s, z1.h, z1.h
+.rept 31
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z20.b}, p0/z, x2
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z20.h
+    smlalb          z5.s, z1.h, z1.h
+    smlalt          z6.s, z1.h, z1.h
+.endr
+    uaddv           d3, p0, z5.s
+    fmov            w0, s3
+    uaddv           d4, p0, z6.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+endfunc
+
+function PFX(pixel_sse_ss_64x64_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sse_ss_64x64
+    ptrue           p0.b, vl16
+    ld1b            {z24.b}, p0/z, x0
+    ld1b            {z25.b}, p0/z, x0, #1, mul vl
+    ld1b            {z26.b}, p0/z, x0, #2, mul vl
+    ld1b            {z27.b}, p0/z, x0, #3, mul vl
+    ld1b            {z28.b}, p0/z, x2
+    ld1b            {z29.b}, p0/z, x2, #1, mul vl
+    ld1b            {z30.b}, p0/z, x2, #2, mul vl
+    ld1b            {z31.b}, p0/z, x2, #3, mul vl
+    sub             z0.h, z24.h, z28.h
+    sub             z1.h, z25.h, z29.h
+    sub             z2.h, z26.h, z30.h
+    sub             z3.h, z27.h, z31.h
+    smullb          z5.s, z0.h, z0.h
+    smullt          z6.s, z0.h, z0.h
+    smlalb          z5.s, z1.h, z1.h
+    smlalt          z6.s, z1.h, z1.h
+    smlalb          z5.s, z2.h, z2.h
+    smlalt          z6.s, z2.h, z2.h
+    smlalb          z5.s, z3.h, z3.h
+    smlalt          z6.s, z3.h, z3.h
+    ld1b            {z24.b}, p0/z, x0, #4, mul vl
+    ld1b            {z25.b}, p0/z, x0, #5, mul vl
+    ld1b            {z26.b}, p0/z, x0, #6, mul vl
+    ld1b            {z27.b}, p0/z, x0, #7, mul vl
+    ld1b            {z28.b}, p0/z, x2, #4, mul vl
+    ld1b            {z29.b}, p0/z, x2, #5, mul vl
+    ld1b            {z30.b}, p0/z, x2, #6, mul vl
+    ld1b            {z31.b}, p0/z, x2, #7, mul vl
+    sub             z0.h, z24.h, z28.h
+    sub             z1.h, z25.h, z29.h
+    sub             z2.h, z26.h, z30.h
+    sub             z3.h, z27.h, z31.h
+    smlalb          z5.s, z0.h, z0.h
+    smlalt          z6.s, z0.h, z0.h
+    smlalb          z5.s, z1.h, z1.h
+    smlalt          z6.s, z1.h, z1.h
+    smlalb          z5.s, z2.h, z2.h
+    smlalt          z6.s, z2.h, z2.h
+    smlalb          z5.s, z3.h, z3.h
+    smlalt          z6.s, z3.h, z3.h
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+.rept 63
+    ld1b            {z24.b}, p0/z, x0
+    ld1b            {z25.b}, p0/z, x0, #1, mul vl
+    ld1b            {z26.b}, p0/z, x0, #2, mul vl
+    ld1b            {z27.b}, p0/z, x0, #3, mul vl
+    ld1b            {z28.b}, p0/z, x2
+    ld1b            {z29.b}, p0/z, x2, #1, mul vl
+    ld1b            {z30.b}, p0/z, x2, #2, mul vl
+    ld1b            {z31.b}, p0/z, x2, #3, mul vl
+    sub             z0.h, z24.h, z28.h
+    sub             z1.h, z25.h, z29.h
+    sub             z2.h, z26.h, z30.h
+    sub             z3.h, z27.h, z31.h
+    smlalb          z5.s, z0.h, z0.h
+    smlalt          z6.s, z0.h, z0.h
+    smlalb          z5.s, z1.h, z1.h
+    smlalt          z6.s, z1.h, z1.h
+    smlalb          z5.s, z2.h, z2.h
+    smlalt          z6.s, z2.h, z2.h
+    smlalb          z5.s, z3.h, z3.h
+    smlalt          z6.s, z3.h, z3.h
+    ld1b            {z24.b}, p0/z, x0, #4, mul vl
+    ld1b            {z25.b}, p0/z, x0, #5, mul vl
+    ld1b            {z26.b}, p0/z, x0, #6, mul vl
+    ld1b            {z27.b}, p0/z, x0, #7, mul vl
+    ld1b            {z28.b}, p0/z, x2, #4, mul vl
+    ld1b            {z29.b}, p0/z, x2, #5, mul vl
+    ld1b            {z30.b}, p0/z, x2, #6, mul vl
+    ld1b            {z31.b}, p0/z, x2, #7, mul vl
+    sub             z0.h, z24.h, z28.h
+    sub             z1.h, z25.h, z29.h
+    sub             z2.h, z26.h, z30.h
+    sub             z3.h, z27.h, z31.h
+    smlalb          z5.s, z0.h, z0.h
+    smlalt          z6.s, z0.h, z0.h
+    smlalb          z5.s, z1.h, z1.h
+    smlalt          z6.s, z1.h, z1.h
+    smlalb          z5.s, z2.h, z2.h
+    smlalt          z6.s, z2.h, z2.h
+    smlalb          z5.s, z3.h, z3.h
+    smlalt          z6.s, z3.h, z3.h
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+.endr
+    uaddv           d3, p0, z5.s
+    fmov            w0, s3
+    uaddv           d4, p0, z6.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+.vl_gt_16_pixel_sse_ss_64x64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_sse_ss_64x64
+    ptrue           p0.b, vl32
+    ld1b            {z24.b}, p0/z, x0
+    ld1b            {z25.b}, p0/z, x0, #1, mul vl
+    ld1b            {z28.b}, p0/z, x2
+    ld1b            {z29.b}, p0/z, x2, #1, mul vl
+    sub             z0.h, z24.h, z28.h
+    sub             z1.h, z25.h, z29.h
+    smullb          z5.s, z0.h, z0.h
+    smullt          z6.s, z0.h, z0.h
+    smlalb          z5.s, z1.h, z1.h
+    smlalt          z6.s, z1.h, z1.h
+    ld1b            {z24.b}, p0/z, x0, #1, mul vl
+    ld1b            {z25.b}, p0/z, x0, #2, mul vl
+    ld1b            {z28.b}, p0/z, x2, #1, mul vl
+    ld1b            {z29.b}, p0/z, x2, #2, mul vl
+    sub             z0.h, z24.h, z28.h
+    sub             z1.h, z25.h, z29.h
+    smlalb          z5.s, z0.h, z0.h
+    smlalt          z6.s, z0.h, z0.h
+    smlalb          z5.s, z1.h, z1.h
+    smlalt          z6.s, z1.h, z1.h
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+.rept 63
+    ld1b            {z24.b}, p0/z, x0
+    ld1b            {z25.b}, p0/z, x0, #1, mul vl
+    ld1b            {z28.b}, p0/z, x2
+    ld1b            {z29.b}, p0/z, x2, #1, mul vl
+    sub             z0.h, z24.h, z28.h
+    sub             z1.h, z25.h, z29.h
+    smlalb          z5.s, z0.h, z0.h
+    smlalt          z6.s, z0.h, z0.h
+    smlalb          z5.s, z1.h, z1.h
+    smlalt          z6.s, z1.h, z1.h
+    ld1b            {z24.b}, p0/z, x0, #1, mul vl
+    ld1b            {z25.b}, p0/z, x0, #2, mul vl
+    ld1b            {z28.b}, p0/z, x2, #1, mul vl
+    ld1b            {z29.b}, p0/z, x2, #2, mul vl
+    sub             z0.h, z24.h, z28.h
+    sub             z1.h, z25.h, z29.h
+    smlalb          z5.s, z0.h, z0.h
+    smlalt          z6.s, z0.h, z0.h
+    smlalb          z5.s, z1.h, z1.h
+    smlalt          z6.s, z1.h, z1.h
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+.endr
+    uaddv           d3, p0, z5.s
+    fmov            w0, s3
+    uaddv           d4, p0, z6.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+.vl_gt_48_pixel_sse_ss_64x64:
+    cmp             x9, #112
+    bgt             .vl_gt_112_pixel_sse_ss_64x64
+    ptrue           p0.b, vl64
+    ld1b            {z24.b}, p0/z, x0
+    ld1b            {z28.b}, p0/z, x2
+    sub             z0.h, z24.h, z28.h
+    smullb          z5.s, z0.h, z0.h
+    smullt          z6.s, z0.h, z0.h
+    ld1b            {z24.b}, p0/z, x0, #1, mul vl
+    ld1b            {z28.b}, p0/z, x2, #1, mul vl
+    sub             z0.h, z24.h, z28.h
+    smlalb          z5.s, z0.h, z0.h
+    smlalt          z6.s, z0.h, z0.h
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+.rept 63
+    ld1b            {z24.b}, p0/z, x0
+    ld1b            {z28.b}, p0/z, x2
+    sub             z0.h, z24.h, z28.h
+    smlalb          z5.s, z0.h, z0.h
+    smlalt          z6.s, z0.h, z0.h
+    ld1b            {z24.b}, p0/z, x0, #1, mul vl
+    ld1b            {z28.b}, p0/z, x2, #1, mul vl
+    sub             z0.h, z24.h, z28.h
+    smlalb          z5.s, z0.h, z0.h
+    smlalt          z6.s, z0.h, z0.h
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+.endr
+    uaddv           d3, p0, z5.s
+    fmov            w0, s3
+    uaddv           d4, p0, z6.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+.vl_gt_112_pixel_sse_ss_64x64:
+    ptrue           p0.b, vl128
+    ld1b            {z24.b}, p0/z, x0
+    ld1b            {z28.b}, p0/z, x2
+    sub             z0.h, z24.h, z28.h
+    smullb          z5.s, z0.h, z0.h
+    smullt          z6.s, z0.h, z0.h
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+.rept 63
+    ld1b            {z24.b}, p0/z, x0
+    ld1b            {z28.b}, p0/z, x2
+    sub             z0.h, z24.h, z28.h
+    smlalb          z5.s, z0.h, z0.h
+    smlalt          z6.s, z0.h, z0.h
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+.endr
+    uaddv           d3, p0, z5.s
+    fmov            w0, s3
+    uaddv           d4, p0, z6.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+endfunc
+
+function PFX(pixel_ssd_s_4x4_sve2)
+    ptrue           p0.b, vl8
+    ld1b            {z16.b}, p0/z, x0
+    add             x0, x0, x1, lsl #1
+    smullb          z0.s, z16.h, z16.h
+    smlalt          z0.s, z16.h, z16.h
+.rept 3
+    ld1b            {z16.b}, p0/z, x0
+    add             x0, x0, x1, lsl #1
+    smlalb          z0.s, z16.h, z16.h
+    smlalt          z0.s, z16.h, z16.h
+.endr
+    uaddv           d3, p0, z0.s
+    fmov            w0, s3
+    ret
+endfunc
+
+function PFX(pixel_ssd_s_8x8_sve2)
+    ptrue           p0.b, vl16
+    ld1b            {z16.b}, p0/z, x0
+    add             x0, x0, x1, lsl #1
+    smullb          z0.s, z16.h, z16.h
+    smlalt          z0.s, z16.h, z16.h
+.rept 7
+    ld1b            {z16.b}, p0/z, x0
+    add             x0, x0, x1, lsl #1
+    smlalb          z0.s, z16.h, z16.h
+    smlalt          z0.s, z16.h, z16.h
+.endr
+    uaddv           d3, p0, z0.s
+    fmov            w0, s3
+    ret
+endfunc
+
+function PFX(pixel_ssd_s_16x16_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_ssd_s_16x16
+    add             x1, x1, x1
+    mov             w12, #4
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+.loop_ssd_s_16_sve2:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v4.16b,v5.16b}, x0, x1
+    ld1             {v6.16b,v7.16b}, x0, x1
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+    smlal           v0.4s, v6.4h, v6.4h
+    smlal2          v1.4s, v6.8h, v6.8h
+    smlal           v0.4s, v7.4h, v7.4h
+    smlal2          v1.4s, v7.8h, v7.8h
+.endr
+    cbnz            w12, .loop_ssd_s_16_sve2
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+.vl_gt_16_pixel_ssd_s_16x16:
+    ptrue           p0.b, vl32
+    ld1b            {z16.b}, p0/z, x0
+    add             x0, x0, x1, lsl #1
+    smullb          z0.s, z16.h, z16.h
+    smlalt          z0.s, z16.h, z16.h
+.rept 15
+    ld1b            {z16.b}, p0/z, x0
+    add             x0, x0, x1, lsl #1
+    smlalb          z0.s, z16.h, z16.h
+    smlalt          z0.s, z16.h, z16.h
+.endr
+    uaddv           d3, p0, z0.s
+    fmov            w0, s3
+    ret
+endfunc
+
+function PFX(pixel_ssd_s_32x32_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_ssd_s_32x32
+    add             x1, x1, x1
+    mov             w12, #8
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+.loop_ssd_s_32:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v4.16b-v7.16b}, x0, x1
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+    smlal           v0.4s, v6.4h, v6.4h
+    smlal2          v1.4s, v6.8h, v6.8h
+    smlal           v0.4s, v7.4h, v7.4h
+    smlal2          v1.4s, v7.8h, v7.8h
+.endr
+    cbnz            w12, .loop_ssd_s_32
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+.vl_gt_16_pixel_ssd_s_32x32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_ssd_s_32x32
+    ptrue           p0.b, vl32
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    add             x0, x0, x1, lsl #1
+    smullb          z0.s, z16.h, z16.h
+    smlalt          z0.s, z16.h, z16.h
+    smlalb          z0.s, z17.h, z17.h
+    smlalt          z0.s, z17.h, z17.h
+.rept 31
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    add             x0, x0, x1, lsl #1
+    smlalb          z0.s, z16.h, z16.h
+    smlalt          z0.s, z16.h, z16.h
+    smlalb          z0.s, z17.h, z17.h
+    smlalt          z0.s, z17.h, z17.h
+.endr
+    uaddv           d3, p0, z0.s
+    fmov            w0, s3
+    ret
+.vl_gt_48_pixel_ssd_s_32x32:
+    ptrue           p0.b, vl64
+    ld1b            {z16.b}, p0/z, x0
+    add             x0, x0, x1, lsl #1
+    smullb          z0.s, z16.h, z16.h
+    smlalt          z0.s, z16.h, z16.h
+.rept 31
+    ld1b            {z16.b}, p0/z, x0
+    add             x0, x0, x1, lsl #1
+    smlalb          z0.s, z16.h, z16.h
+    smlalt          z0.s, z16.h, z16.h
+.endr
+    uaddv           d3, p0, z0.s
+    fmov            w0, s3
+    ret
+endfunc

 
@@ -0,0 +1,887 @@
+/*****************************************************************************
+ * Copyright (C) 2022-2023 MulticoreWare, Inc
+ *
+ * Authors: David Chen <david.chen@myais.com.cn>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm-sve.S"
+#include "ssd-a-common.S"
+
+.arch armv8-a+sve2
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+function PFX(pixel_sse_pp_32x32_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sse_pp_32x32
+    mov             w12, #8
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+.loop_sse_pp_32_sve2:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b,v17.16b}, x0, x1
+    ld1             {v18.16b,v19.16b}, x2, x3
+    usubl           v2.8h, v16.8b, v18.8b
+    usubl2          v3.8h, v16.16b, v18.16b
+    usubl           v4.8h, v17.8b, v19.8b
+    usubl2          v5.8h, v17.16b, v19.16b
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal2          v1.4s, v3.8h, v3.8h
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+.endr
+    cbnz            w12, .loop_sse_pp_32_sve2
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+.vl_gt_16_pixel_sse_pp_32x32:
+    ptrue           p0.b, vl32
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z18.b}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    usublb          z1.h, z16.b, z18.b
+    usublt          z2.h, z16.b, z18.b
+    smullb          z0.s, z1.h, z1.h
+    smlalt          z0.s, z1.h, z1.h
+    smlalb          z0.s, z2.h, z2.h
+    smlalt          z0.s, z2.h, z2.h
+.rept 31
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z18.b}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    usublb          z1.h, z16.b, z18.b
+    usublt          z2.h, z16.b, z18.b
+    smullb          z0.s, z1.h, z1.h
+    smlalt          z0.s, z1.h, z1.h
+    smlalb          z0.s, z2.h, z2.h
+    smlalt          z0.s, z2.h, z2.h
+.endr
+    uaddv           d3, p0, z0.s
+    fmov            w0, s3
+    ret
+endfunc
+
+function PFX(pixel_sse_pp_32x64_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sse_pp_32x64
+    ptrue           p0.b, vl16
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    ld1b            {z18.b}, p0/z, x2
+    ld1b            {z19.b}, p0/z, x2, #1, mul vl
+    add             x0, x0, x1
+    add             x2, x2, x3
+    usublb          z1.h, z16.b, z18.b
+    usublt          z2.h, z16.b, z18.b
+    usublb          z3.h, z17.b, z19.b
+    usublt          z4.h, z17.b, z19.b
+    smullb          z20.s, z1.h, z1.h
+    smullt          z21.s, z1.h, z1.h
+    smlalb          z20.s, z2.h, z2.h
+    smlalt          z21.s, z2.h, z2.h
+    smlalb          z20.s, z3.h, z3.h
+    smlalt          z21.s, z3.h, z3.h
+    smlalb          z20.s, z4.h, z4.h
+    smlalt          z21.s, z4.h, z4.h
+.rept 63
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    ld1b            {z18.b}, p0/z, x2
+    ld1b            {z19.b}, p0/z, x2, #1, mul vl
+    add             x0, x0, x1
+    add             x2, x2, x3
+    usublb          z1.h, z16.b, z18.b
+    usublt          z2.h, z16.b, z18.b
+    usublb          z3.h, z17.b, z19.b
+    usublt          z4.h, z17.b, z19.b
+    smlalb          z20.s, z1.h, z1.h
+    smlalt          z21.s, z1.h, z1.h
+    smlalb          z20.s, z2.h, z2.h
+    smlalt          z21.s, z2.h, z2.h
+    smlalb          z20.s, z3.h, z3.h
+    smlalt          z21.s, z3.h, z3.h
+    smlalb          z20.s, z4.h, z4.h
+    smlalt          z21.s, z4.h, z4.h
+.endr
+    uaddv           d3, p0, z20.s
+    fmov            w0, s3
+    uaddv           d4, p0, z21.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+.vl_gt_16_pixel_sse_pp_32x64:
+    ptrue           p0.b, vl32
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z18.b}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    usublb          z1.h, z16.b, z18.b
+    usublt          z2.h, z16.b, z18.b
+    smullb          z20.s, z1.h, z1.h
+    smullt          z21.s, z1.h, z1.h
+    smlalb          z20.s, z2.h, z2.h
+    smlalt          z21.s, z2.h, z2.h
+.rept 63
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z18.b}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    usublb          z1.h, z16.b, z18.b
+    usublt          z2.h, z16.b, z18.b
+    smlalb          z20.s, z1.h, z1.h
+    smlalt          z21.s, z1.h, z1.h
+    smlalb          z20.s, z2.h, z2.h
+    smlalt          z21.s, z2.h, z2.h
+.endr
+    uaddv           d3, p0, z20.s
+    fmov            w0, s3
+    uaddv           d4, p0, z21.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+endfunc
+
+function PFX(pixel_sse_pp_64x64_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sse_pp_64x64
+    mov             w12, #16
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+
+.loop_sse_pp_64_sve2:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b-v19.16b}, x0, x1
+    ld1             {v20.16b-v23.16b}, x2, x3
+
+    usubl           v2.8h, v16.8b, v20.8b
+    usubl2          v3.8h, v16.16b, v20.16b
+    usubl           v4.8h, v17.8b, v21.8b
+    usubl2          v5.8h, v17.16b, v21.16b
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal2          v1.4s, v3.8h, v3.8h
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+
+    usubl           v2.8h, v18.8b, v22.8b
+    usubl2          v3.8h, v18.16b, v22.16b
+    usubl           v4.8h, v19.8b, v23.8b
+    usubl2          v5.8h, v19.16b, v23.16b
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal2          v1.4s, v3.8h, v3.8h
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+.endr
+    cbnz            w12, .loop_sse_pp_64_sve2
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+.vl_gt_16_pixel_sse_pp_64x64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_sse_pp_64x64
+    ptrue           p0.b, vl32
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    ld1b            {z20.b}, p0/z, x2
+    ld1b            {z21.b}, p0/z, x2, #1, mul vl
+    add             x0, x0, x1
+    add             x2, x2, x3
+    usublb          z1.h, z16.b, z20.b
+    usublt          z2.h, z16.b, z20.b
+    usublb          z3.h, z17.b, z21.b
+    usublt          z4.h, z17.b, z21.b
+    smullb          z24.s, z1.h, z1.h
+    smullt          z25.s, z1.h, z1.h
+    smlalb          z24.s, z2.h, z2.h
+    smlalt          z25.s, z2.h, z2.h
+    smlalb          z24.s, z3.h, z3.h
+    smlalt          z25.s, z3.h, z3.h
+    smlalb          z24.s, z4.h, z4.h
+    smlalt          z25.s, z4.h, z4.h
+.rept 63
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    ld1b            {z20.b}, p0/z, x2
+    ld1b            {z21.b}, p0/z, x2, #1, mul vl
+    add             x0, x0, x1
+    add             x2, x2, x3
+    usublb          z1.h, z16.b, z20.b
+    usublt          z2.h, z16.b, z20.b
+    usublb          z3.h, z17.b, z21.b
+    usublt          z4.h, z17.b, z21.b
+    smlalb          z24.s, z1.h, z1.h
+    smlalt          z25.s, z1.h, z1.h
+    smlalb          z24.s, z2.h, z2.h
+    smlalt          z25.s, z2.h, z2.h
+    smlalb          z24.s, z3.h, z3.h
+    smlalt          z25.s, z3.h, z3.h
+    smlalb          z24.s, z4.h, z4.h
+    smlalt          z25.s, z4.h, z4.h
+.endr
+    uaddv           d3, p0, z24.s
+    fmov            w0, s3
+    uaddv           d4, p0, z25.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+.vl_gt_48_pixel_sse_pp_64x64:
+    ptrue           p0.b, vl64
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z20.b}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    usublb          z1.h, z16.b, z20.b
+    usublt          z2.h, z16.b, z20.b
+    smullb          z24.s, z1.h, z1.h
+    smullt          z25.s, z1.h, z1.h
+    smlalb          z24.s, z2.h, z2.h
+    smlalt          z25.s, z2.h, z2.h
+.rept 63
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z20.b}, p0/z, x2
+    add             x0, x0, x1
+    add             x2, x2, x3
+    usublb          z1.h, z16.b, z20.b
+    usublt          z2.h, z16.b, z20.b
+    smlalb          z24.s, z1.h, z1.h
+    smlalt          z25.s, z1.h, z1.h
+    smlalb          z24.s, z2.h, z2.h
+    smlalt          z25.s, z2.h, z2.h
+.endr
+    uaddv           d3, p0, z24.s
+    fmov            w0, s3
+    uaddv           d4, p0, z25.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+endfunc
+
+function PFX(pixel_sse_ss_4x4_sve2)
+    ptrue           p0.b, vl8
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x2
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z17.h
+    smullb          z3.s, z1.h, z1.h
+    smullt          z4.s, z1.h, z1.h
+.rept 3
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x2
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z17.h
+    smlalb          z3.s, z1.h, z1.h
+    smlalt          z4.s, z1.h, z1.h
+.endr
+    uaddv           d3, p0, z3.s
+    fmov            w0, s3
+    uaddv           d4, p0, z4.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+endfunc
+
+function PFX(pixel_sse_ss_8x8_sve2)
+    ptrue           p0.b, vl16
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x2
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z17.h
+    smullb          z3.s, z1.h, z1.h
+    smullt          z4.s, z1.h, z1.h
+.rept 7
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x2
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z17.h
+    smlalb          z3.s, z1.h, z1.h
+    smlalt          z4.s, z1.h, z1.h
+.endr
+    uaddv           d3, p0, z3.s
+    fmov            w0, s3
+    uaddv           d4, p0, z4.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+endfunc
+
+function PFX(pixel_sse_ss_16x16_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sse_ss_16x16
+    ptrue           p0.b, vl16
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    ld1b            {z18.b}, p0/z, x2
+    ld1b            {z19.b}, p0/z, x2, #1, mul vl
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z18.h
+    sub             z2.h, z17.h, z19.h
+    smullb          z3.s, z1.h, z1.h
+    smullt          z4.s, z1.h, z1.h
+    smlalb          z3.s, z2.h, z2.h
+    smlalt          z4.s, z2.h, z2.h
+.rept 15
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    ld1b            {z18.b}, p0/z, x2
+    ld1b            {z19.b}, p0/z, x2, #1, mul vl
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z18.h
+    sub             z2.h, z17.h, z19.h
+    smlalb          z3.s, z1.h, z1.h
+    smlalt          z4.s, z1.h, z1.h
+    smlalb          z3.s, z2.h, z2.h
+    smlalt          z4.s, z2.h, z2.h
+.endr
+    uaddv           d3, p0, z3.s
+    fmov            w0, s3
+    uaddv           d4, p0, z4.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+.vl_gt_16_pixel_sse_ss_16x16:
+    ptrue           p0.b, vl32
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z18.b}, p0/z, x2
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z18.h
+    smullb          z3.s, z1.h, z1.h
+    smullt          z4.s, z1.h, z1.h
+.rept 15
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z18.b}, p0/z, x2
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z18.h
+    smlalb          z3.s, z1.h, z1.h
+    smlalt          z4.s, z1.h, z1.h
+.endr
+    uaddv           d3, p0, z3.s
+    fmov            w0, s3
+    uaddv           d4, p0, z4.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+endfunc
+
+function PFX(pixel_sse_ss_32x32_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sse_ss_32x32
+    ptrue           p0.b, vl16
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    ld1b            {z18.b}, p0/z, x0, #2, mul vl
+    ld1b            {z19.b}, p0/z, x0, #3, mul vl
+    ld1b            {z20.b}, p0/z, x2
+    ld1b            {z21.b}, p0/z, x2, #1, mul vl
+    ld1b            {z22.b}, p0/z, x2, #2, mul vl
+    ld1b            {z23.b}, p0/z, x2, #3, mul vl
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z20.h
+    sub             z2.h, z17.h, z21.h
+    sub             z3.h, z18.h, z22.h
+    sub             z4.h, z19.h, z23.h
+    smullb          z5.s, z1.h, z1.h
+    smullt          z6.s, z1.h, z1.h
+    smlalb          z5.s, z2.h, z2.h
+    smlalt          z6.s, z2.h, z2.h
+    smlalb          z5.s, z3.h, z3.h
+    smlalt          z6.s, z3.h, z3.h
+    smlalb          z5.s, z4.h, z4.h
+    smlalt          z6.s, z4.h, z4.h
+.rept 31
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    ld1b            {z18.b}, p0/z, x0, #2, mul vl
+    ld1b            {z19.b}, p0/z, x0, #3, mul vl
+    ld1b            {z20.b}, p0/z, x2
+    ld1b            {z21.b}, p0/z, x2, #1, mul vl
+    ld1b            {z22.b}, p0/z, x2, #2, mul vl
+    ld1b            {z23.b}, p0/z, x2, #3, mul vl
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z20.h
+    sub             z2.h, z17.h, z21.h
+    sub             z3.h, z18.h, z22.h
+    sub             z4.h, z19.h, z23.h
+    smlalb          z5.s, z1.h, z1.h
+    smlalt          z6.s, z1.h, z1.h
+    smlalb          z5.s, z2.h, z2.h
+    smlalt          z6.s, z2.h, z2.h
+    smlalb          z5.s, z3.h, z3.h
+    smlalt          z6.s, z3.h, z3.h
+    smlalb          z5.s, z4.h, z4.h
+    smlalt          z6.s, z4.h, z4.h
+.endr
+    uaddv           d3, p0, z5.s
+    fmov            w0, s3
+    uaddv           d4, p0, z6.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+.vl_gt_16_pixel_sse_ss_32x32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_sse_ss_32x32
+    ptrue           p0.b, vl32
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    ld1b            {z20.b}, p0/z, x2
+    ld1b            {z21.b}, p0/z, x2, #1, mul vl
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z20.h
+    sub             z2.h, z17.h, z21.h
+    smullb          z5.s, z1.h, z1.h
+    smullt          z6.s, z1.h, z1.h
+    smlalb          z5.s, z2.h, z2.h
+    smlalt          z6.s, z2.h, z2.h
+.rept 31
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    ld1b            {z20.b}, p0/z, x2
+    ld1b            {z21.b}, p0/z, x2, #1, mul vl
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z20.h
+    sub             z2.h, z17.h, z21.h
+    smlalb          z5.s, z1.h, z1.h
+    smlalt          z6.s, z1.h, z1.h
+    smlalb          z5.s, z2.h, z2.h
+    smlalt          z6.s, z2.h, z2.h
+.endr
+    uaddv           d3, p0, z5.s
+    fmov            w0, s3
+    uaddv           d4, p0, z6.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+.vl_gt_48_pixel_sse_ss_32x32:
+    ptrue           p0.b, vl64
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z20.b}, p0/z, x2
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z20.h
+    smullb          z5.s, z1.h, z1.h
+    smullt          z6.s, z1.h, z1.h
+.rept 31
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z20.b}, p0/z, x2
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+    sub             z1.h, z16.h, z20.h
+    smlalb          z5.s, z1.h, z1.h
+    smlalt          z6.s, z1.h, z1.h
+.endr
+    uaddv           d3, p0, z5.s
+    fmov            w0, s3
+    uaddv           d4, p0, z6.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+endfunc
+
+function PFX(pixel_sse_ss_64x64_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_sse_ss_64x64
+    ptrue           p0.b, vl16
+    ld1b            {z24.b}, p0/z, x0
+    ld1b            {z25.b}, p0/z, x0, #1, mul vl
+    ld1b            {z26.b}, p0/z, x0, #2, mul vl
+    ld1b            {z27.b}, p0/z, x0, #3, mul vl
+    ld1b            {z28.b}, p0/z, x2
+    ld1b            {z29.b}, p0/z, x2, #1, mul vl
+    ld1b            {z30.b}, p0/z, x2, #2, mul vl
+    ld1b            {z31.b}, p0/z, x2, #3, mul vl
+    sub             z0.h, z24.h, z28.h
+    sub             z1.h, z25.h, z29.h
+    sub             z2.h, z26.h, z30.h
+    sub             z3.h, z27.h, z31.h
+    smullb          z5.s, z0.h, z0.h
+    smullt          z6.s, z0.h, z0.h
+    smlalb          z5.s, z1.h, z1.h
+    smlalt          z6.s, z1.h, z1.h
+    smlalb          z5.s, z2.h, z2.h
+    smlalt          z6.s, z2.h, z2.h
+    smlalb          z5.s, z3.h, z3.h
+    smlalt          z6.s, z3.h, z3.h
+    ld1b            {z24.b}, p0/z, x0, #4, mul vl
+    ld1b            {z25.b}, p0/z, x0, #5, mul vl
+    ld1b            {z26.b}, p0/z, x0, #6, mul vl
+    ld1b            {z27.b}, p0/z, x0, #7, mul vl
+    ld1b            {z28.b}, p0/z, x2, #4, mul vl
+    ld1b            {z29.b}, p0/z, x2, #5, mul vl
+    ld1b            {z30.b}, p0/z, x2, #6, mul vl
+    ld1b            {z31.b}, p0/z, x2, #7, mul vl
+    sub             z0.h, z24.h, z28.h
+    sub             z1.h, z25.h, z29.h
+    sub             z2.h, z26.h, z30.h
+    sub             z3.h, z27.h, z31.h
+    smlalb          z5.s, z0.h, z0.h
+    smlalt          z6.s, z0.h, z0.h
+    smlalb          z5.s, z1.h, z1.h
+    smlalt          z6.s, z1.h, z1.h
+    smlalb          z5.s, z2.h, z2.h
+    smlalt          z6.s, z2.h, z2.h
+    smlalb          z5.s, z3.h, z3.h
+    smlalt          z6.s, z3.h, z3.h
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+.rept 63
+    ld1b            {z24.b}, p0/z, x0
+    ld1b            {z25.b}, p0/z, x0, #1, mul vl
+    ld1b            {z26.b}, p0/z, x0, #2, mul vl
+    ld1b            {z27.b}, p0/z, x0, #3, mul vl
+    ld1b            {z28.b}, p0/z, x2
+    ld1b            {z29.b}, p0/z, x2, #1, mul vl
+    ld1b            {z30.b}, p0/z, x2, #2, mul vl
+    ld1b            {z31.b}, p0/z, x2, #3, mul vl
+    sub             z0.h, z24.h, z28.h
+    sub             z1.h, z25.h, z29.h
+    sub             z2.h, z26.h, z30.h
+    sub             z3.h, z27.h, z31.h
+    smlalb          z5.s, z0.h, z0.h
+    smlalt          z6.s, z0.h, z0.h
+    smlalb          z5.s, z1.h, z1.h
+    smlalt          z6.s, z1.h, z1.h
+    smlalb          z5.s, z2.h, z2.h
+    smlalt          z6.s, z2.h, z2.h
+    smlalb          z5.s, z3.h, z3.h
+    smlalt          z6.s, z3.h, z3.h
+    ld1b            {z24.b}, p0/z, x0, #4, mul vl
+    ld1b            {z25.b}, p0/z, x0, #5, mul vl
+    ld1b            {z26.b}, p0/z, x0, #6, mul vl
+    ld1b            {z27.b}, p0/z, x0, #7, mul vl
+    ld1b            {z28.b}, p0/z, x2, #4, mul vl
+    ld1b            {z29.b}, p0/z, x2, #5, mul vl
+    ld1b            {z30.b}, p0/z, x2, #6, mul vl
+    ld1b            {z31.b}, p0/z, x2, #7, mul vl
+    sub             z0.h, z24.h, z28.h
+    sub             z1.h, z25.h, z29.h
+    sub             z2.h, z26.h, z30.h
+    sub             z3.h, z27.h, z31.h
+    smlalb          z5.s, z0.h, z0.h
+    smlalt          z6.s, z0.h, z0.h
+    smlalb          z5.s, z1.h, z1.h
+    smlalt          z6.s, z1.h, z1.h
+    smlalb          z5.s, z2.h, z2.h
+    smlalt          z6.s, z2.h, z2.h
+    smlalb          z5.s, z3.h, z3.h
+    smlalt          z6.s, z3.h, z3.h
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+.endr
+    uaddv           d3, p0, z5.s
+    fmov            w0, s3
+    uaddv           d4, p0, z6.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+.vl_gt_16_pixel_sse_ss_64x64:
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_sse_ss_64x64
+    ptrue           p0.b, vl32
+    ld1b            {z24.b}, p0/z, x0
+    ld1b            {z25.b}, p0/z, x0, #1, mul vl
+    ld1b            {z28.b}, p0/z, x2
+    ld1b            {z29.b}, p0/z, x2, #1, mul vl
+    sub             z0.h, z24.h, z28.h
+    sub             z1.h, z25.h, z29.h
+    smullb          z5.s, z0.h, z0.h
+    smullt          z6.s, z0.h, z0.h
+    smlalb          z5.s, z1.h, z1.h
+    smlalt          z6.s, z1.h, z1.h
+    ld1b            {z24.b}, p0/z, x0, #1, mul vl
+    ld1b            {z25.b}, p0/z, x0, #2, mul vl
+    ld1b            {z28.b}, p0/z, x2, #1, mul vl
+    ld1b            {z29.b}, p0/z, x2, #2, mul vl
+    sub             z0.h, z24.h, z28.h
+    sub             z1.h, z25.h, z29.h
+    smlalb          z5.s, z0.h, z0.h
+    smlalt          z6.s, z0.h, z0.h
+    smlalb          z5.s, z1.h, z1.h
+    smlalt          z6.s, z1.h, z1.h
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+.rept 63
+    ld1b            {z24.b}, p0/z, x0
+    ld1b            {z25.b}, p0/z, x0, #1, mul vl
+    ld1b            {z28.b}, p0/z, x2
+    ld1b            {z29.b}, p0/z, x2, #1, mul vl
+    sub             z0.h, z24.h, z28.h
+    sub             z1.h, z25.h, z29.h
+    smlalb          z5.s, z0.h, z0.h
+    smlalt          z6.s, z0.h, z0.h
+    smlalb          z5.s, z1.h, z1.h
+    smlalt          z6.s, z1.h, z1.h
+    ld1b            {z24.b}, p0/z, x0, #1, mul vl
+    ld1b            {z25.b}, p0/z, x0, #2, mul vl
+    ld1b            {z28.b}, p0/z, x2, #1, mul vl
+    ld1b            {z29.b}, p0/z, x2, #2, mul vl
+    sub             z0.h, z24.h, z28.h
+    sub             z1.h, z25.h, z29.h
+    smlalb          z5.s, z0.h, z0.h
+    smlalt          z6.s, z0.h, z0.h
+    smlalb          z5.s, z1.h, z1.h
+    smlalt          z6.s, z1.h, z1.h
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+.endr
+    uaddv           d3, p0, z5.s
+    fmov            w0, s3
+    uaddv           d4, p0, z6.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+.vl_gt_48_pixel_sse_ss_64x64:
+    cmp             x9, #112
+    bgt             .vl_gt_112_pixel_sse_ss_64x64
+    ptrue           p0.b, vl64
+    ld1b            {z24.b}, p0/z, x0
+    ld1b            {z28.b}, p0/z, x2
+    sub             z0.h, z24.h, z28.h
+    smullb          z5.s, z0.h, z0.h
+    smullt          z6.s, z0.h, z0.h
+    ld1b            {z24.b}, p0/z, x0, #1, mul vl
+    ld1b            {z28.b}, p0/z, x2, #1, mul vl
+    sub             z0.h, z24.h, z28.h
+    smlalb          z5.s, z0.h, z0.h
+    smlalt          z6.s, z0.h, z0.h
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+.rept 63
+    ld1b            {z24.b}, p0/z, x0
+    ld1b            {z28.b}, p0/z, x2
+    sub             z0.h, z24.h, z28.h
+    smlalb          z5.s, z0.h, z0.h
+    smlalt          z6.s, z0.h, z0.h
+    ld1b            {z24.b}, p0/z, x0, #1, mul vl
+    ld1b            {z28.b}, p0/z, x2, #1, mul vl
+    sub             z0.h, z24.h, z28.h
+    smlalb          z5.s, z0.h, z0.h
+    smlalt          z6.s, z0.h, z0.h
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+.endr
+    uaddv           d3, p0, z5.s
+    fmov            w0, s3
+    uaddv           d4, p0, z6.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+.vl_gt_112_pixel_sse_ss_64x64:
+    ptrue           p0.b, vl128
+    ld1b            {z24.b}, p0/z, x0
+    ld1b            {z28.b}, p0/z, x2
+    sub             z0.h, z24.h, z28.h
+    smullb          z5.s, z0.h, z0.h
+    smullt          z6.s, z0.h, z0.h
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+.rept 63
+    ld1b            {z24.b}, p0/z, x0
+    ld1b            {z28.b}, p0/z, x2
+    sub             z0.h, z24.h, z28.h
+    smlalb          z5.s, z0.h, z0.h
+    smlalt          z6.s, z0.h, z0.h
+    add             x0, x0, x1, lsl #1
+    add             x2, x2, x3, lsl #1
+.endr
+    uaddv           d3, p0, z5.s
+    fmov            w0, s3
+    uaddv           d4, p0, z6.s
+    fmov            w1, s4
+    add             w0, w0, w1
+    ret
+endfunc
+
+function PFX(pixel_ssd_s_4x4_sve2)
+    ptrue           p0.b, vl8
+    ld1b            {z16.b}, p0/z, x0
+    add             x0, x0, x1, lsl #1
+    smullb          z0.s, z16.h, z16.h
+    smlalt          z0.s, z16.h, z16.h
+.rept 3
+    ld1b            {z16.b}, p0/z, x0
+    add             x0, x0, x1, lsl #1
+    smlalb          z0.s, z16.h, z16.h
+    smlalt          z0.s, z16.h, z16.h
+.endr
+    uaddv           d3, p0, z0.s
+    fmov            w0, s3
+    ret
+endfunc
+
+function PFX(pixel_ssd_s_8x8_sve2)
+    ptrue           p0.b, vl16
+    ld1b            {z16.b}, p0/z, x0
+    add             x0, x0, x1, lsl #1
+    smullb          z0.s, z16.h, z16.h
+    smlalt          z0.s, z16.h, z16.h
+.rept 7
+    ld1b            {z16.b}, p0/z, x0
+    add             x0, x0, x1, lsl #1
+    smlalb          z0.s, z16.h, z16.h
+    smlalt          z0.s, z16.h, z16.h
+.endr
+    uaddv           d3, p0, z0.s
+    fmov            w0, s3
+    ret
+endfunc
+
+function PFX(pixel_ssd_s_16x16_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_ssd_s_16x16
+    add             x1, x1, x1
+    mov             w12, #4
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+.loop_ssd_s_16_sve2:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v4.16b,v5.16b}, x0, x1
+    ld1             {v6.16b,v7.16b}, x0, x1
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+    smlal           v0.4s, v6.4h, v6.4h
+    smlal2          v1.4s, v6.8h, v6.8h
+    smlal           v0.4s, v7.4h, v7.4h
+    smlal2          v1.4s, v7.8h, v7.8h
+.endr
+    cbnz            w12, .loop_ssd_s_16_sve2
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+.vl_gt_16_pixel_ssd_s_16x16:
+    ptrue           p0.b, vl32
+    ld1b            {z16.b}, p0/z, x0
+    add             x0, x0, x1, lsl #1
+    smullb          z0.s, z16.h, z16.h
+    smlalt          z0.s, z16.h, z16.h
+.rept 15
+    ld1b            {z16.b}, p0/z, x0
+    add             x0, x0, x1, lsl #1
+    smlalb          z0.s, z16.h, z16.h
+    smlalt          z0.s, z16.h, z16.h
+.endr
+    uaddv           d3, p0, z0.s
+    fmov            w0, s3
+    ret
+endfunc
+
+function PFX(pixel_ssd_s_32x32_sve2)
+    rdvl            x9, #1
+    cmp             x9, #16
+    bgt             .vl_gt_16_pixel_ssd_s_32x32
+    add             x1, x1, x1
+    mov             w12, #8
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+.loop_ssd_s_32:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v4.16b-v7.16b}, x0, x1
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+    smlal           v0.4s, v6.4h, v6.4h
+    smlal2          v1.4s, v6.8h, v6.8h
+    smlal           v0.4s, v7.4h, v7.4h
+    smlal2          v1.4s, v7.8h, v7.8h
+.endr
+    cbnz            w12, .loop_ssd_s_32
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+.vl_gt_16_pixel_ssd_s_32x32:
+    cmp             x9, #48
+    bgt             .vl_gt_48_pixel_ssd_s_32x32
+    ptrue           p0.b, vl32
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    add             x0, x0, x1, lsl #1
+    smullb          z0.s, z16.h, z16.h
+    smlalt          z0.s, z16.h, z16.h
+    smlalb          z0.s, z17.h, z17.h
+    smlalt          z0.s, z17.h, z17.h
+.rept 31
+    ld1b            {z16.b}, p0/z, x0
+    ld1b            {z17.b}, p0/z, x0, #1, mul vl
+    add             x0, x0, x1, lsl #1
+    smlalb          z0.s, z16.h, z16.h
+    smlalt          z0.s, z16.h, z16.h
+    smlalb          z0.s, z17.h, z17.h
+    smlalt          z0.s, z17.h, z17.h
+.endr
+    uaddv           d3, p0, z0.s
+    fmov            w0, s3
+    ret
+.vl_gt_48_pixel_ssd_s_32x32:
+    ptrue           p0.b, vl64
+    ld1b            {z16.b}, p0/z, x0
+    add             x0, x0, x1, lsl #1
+    smullb          z0.s, z16.h, z16.h
+    smlalt          z0.s, z16.h, z16.h
+.rept 31
+    ld1b            {z16.b}, p0/z, x0
+    add             x0, x0, x1, lsl #1
+    smlalb          z0.s, z16.h, z16.h
+    smlalt          z0.s, z16.h, z16.h
+.endr
+    uaddv           d3, p0, z0.s
+    fmov            w0, s3
+    ret
+endfunc
​

x265_3.6.tar.gz/source/common/aarch64/ssd-a.S Added

@@ -0,0 +1,476 @@
+/*****************************************************************************
+ * Copyright (C) 2021 MulticoreWare, Inc
+ *
+ * Authors: Sebastian Pop <spop@amazon.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm.S"
+#include "ssd-a-common.S"
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+function PFX(pixel_sse_pp_4x4_neon)
+    ld1             {v16.s}0, x0, x1
+    ld1             {v17.s}0, x2, x3
+    ld1             {v18.s}0, x0, x1
+    ld1             {v19.s}0, x2, x3
+    ld1             {v20.s}0, x0, x1
+    ld1             {v21.s}0, x2, x3
+    ld1             {v22.s}0, x0, x1
+    ld1             {v23.s}0, x2, x3
+
+    usubl           v1.8h, v16.8b, v17.8b
+    usubl           v2.8h, v18.8b, v19.8b
+    usubl           v3.8h, v20.8b, v21.8b
+    usubl           v4.8h, v22.8b, v23.8b
+
+    smull           v0.4s, v1.4h, v1.4h
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal           v0.4s, v4.4h, v4.4h
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_sse_pp_4x8_neon)
+    ld1             {v16.s}0, x0, x1
+    ld1             {v17.s}0, x2, x3
+    usubl           v1.8h, v16.8b, v17.8b
+    ld1             {v16.s}0, x0, x1
+    ld1             {v17.s}0, x2, x3
+    smull           v0.4s, v1.4h, v1.4h
+.rept 6
+    usubl           v1.8h, v16.8b, v17.8b
+    ld1             {v16.s}0, x0, x1
+    smlal           v0.4s, v1.4h, v1.4h
+    ld1             {v17.s}0, x2, x3
+.endr
+    usubl           v1.8h, v16.8b, v17.8b
+    smlal           v0.4s, v1.4h, v1.4h
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_sse_pp_8x8_neon)
+    ld1             {v16.8b}, x0, x1
+    ld1             {v17.8b}, x2, x3
+    usubl           v1.8h, v16.8b, v17.8b
+    ld1             {v16.8b}, x0, x1
+    smull           v0.4s, v1.4h, v1.4h
+    smlal2          v0.4s, v1.8h, v1.8h
+    ld1             {v17.8b}, x2, x3
+
+.rept 6
+    usubl           v1.8h, v16.8b, v17.8b
+    ld1             {v16.8b}, x0, x1
+    smlal           v0.4s, v1.4h, v1.4h
+    smlal2          v0.4s, v1.8h, v1.8h
+    ld1             {v17.8b}, x2, x3
+.endr
+    usubl           v1.8h, v16.8b, v17.8b
+    smlal           v0.4s, v1.4h, v1.4h
+    smlal2          v0.4s, v1.8h, v1.8h
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_sse_pp_8x16_neon)
+    ld1             {v16.8b}, x0, x1
+    ld1             {v17.8b}, x2, x3
+    usubl           v1.8h, v16.8b, v17.8b
+    ld1             {v16.8b}, x0, x1
+    smull           v0.4s, v1.4h, v1.4h
+    smlal2          v0.4s, v1.8h, v1.8h
+    ld1             {v17.8b}, x2, x3
+
+.rept 14
+    usubl           v1.8h, v16.8b, v17.8b
+    ld1             {v16.8b}, x0, x1
+    smlal           v0.4s, v1.4h, v1.4h
+    smlal2          v0.4s, v1.8h, v1.8h
+    ld1             {v17.8b}, x2, x3
+.endr
+    usubl           v1.8h, v16.8b, v17.8b
+    smlal           v0.4s, v1.4h, v1.4h
+    smlal2          v0.4s, v1.8h, v1.8h
+    ret_v0_w0
+endfunc
+
+.macro sse_pp_16xN h
+function PFX(pixel_sse_pp_16x\h\()_neon)
+    ld1             {v16.16b}, x0, x1
+    ld1             {v17.16b}, x2, x3
+    usubl           v1.8h, v16.8b, v17.8b
+    usubl2          v2.8h, v16.16b, v17.16b
+    ld1             {v16.16b}, x0, x1
+    ld1             {v17.16b}, x2, x3
+    smull           v0.4s, v1.4h, v1.4h
+    smlal2          v0.4s, v1.8h, v1.8h
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v0.4s, v2.8h, v2.8h
+.rept \h - 2
+    usubl           v1.8h, v16.8b, v17.8b
+    usubl2          v2.8h, v16.16b, v17.16b
+    ld1             {v16.16b}, x0, x1
+    smlal           v0.4s, v1.4h, v1.4h
+    smlal2          v0.4s, v1.8h, v1.8h
+    ld1             {v17.16b}, x2, x3
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v0.4s, v2.8h, v2.8h
+.endr
+    usubl           v1.8h, v16.8b, v17.8b
+    usubl2          v2.8h, v16.16b, v17.16b
+    smlal           v0.4s, v1.4h, v1.4h
+    smlal2          v0.4s, v1.8h, v1.8h
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v0.4s, v2.8h, v2.8h
+    ret_v0_w0
+endfunc
+.endm
+
+sse_pp_16xN 16
+sse_pp_16xN 32
+
+function PFX(pixel_sse_pp_32x32_neon)
+    mov             w12, #8
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+.loop_sse_pp_32:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b,v17.16b}, x0, x1
+    ld1             {v18.16b,v19.16b}, x2, x3
+    usubl           v2.8h, v16.8b, v18.8b
+    usubl2          v3.8h, v16.16b, v18.16b
+    usubl           v4.8h, v17.8b, v19.8b
+    usubl2          v5.8h, v17.16b, v19.16b
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal2          v1.4s, v3.8h, v3.8h
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+.endr
+    cbnz            w12, .loop_sse_pp_32
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_sse_pp_32x64_neon)
+    mov             w12, #16
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+.loop_sse_pp_32x64:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b,v17.16b}, x0, x1
+    ld1             {v18.16b,v19.16b}, x2, x3
+    usubl           v2.8h, v16.8b, v18.8b
+    usubl2          v3.8h, v16.16b, v18.16b
+    usubl           v4.8h, v17.8b, v19.8b
+    usubl2          v5.8h, v17.16b, v19.16b
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal2          v1.4s, v3.8h, v3.8h
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+.endr
+    cbnz            w12, .loop_sse_pp_32x64
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_sse_pp_64x64_neon)
+    mov             w12, #16
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+
+.loop_sse_pp_64:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b-v19.16b}, x0, x1
+    ld1             {v20.16b-v23.16b}, x2, x3
+
+    usubl           v2.8h, v16.8b, v20.8b
+    usubl2          v3.8h, v16.16b, v20.16b
+    usubl           v4.8h, v17.8b, v21.8b
+    usubl2          v5.8h, v17.16b, v21.16b
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal2          v1.4s, v3.8h, v3.8h
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+
+    usubl           v2.8h, v18.8b, v22.8b
+    usubl2          v3.8h, v18.16b, v22.16b
+    usubl           v4.8h, v19.8b, v23.8b
+    usubl2          v5.8h, v19.16b, v23.16b
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal2          v1.4s, v3.8h, v3.8h
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+.endr
+    cbnz            w12, .loop_sse_pp_64
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_sse_ss_4x4_neon)
+    add             x1, x1, x1
+    add             x3, x3, x3
+    ld1             {v16.8b}, x0, x1
+    ld1             {v17.8b}, x2, x3
+    sub             v2.4h, v16.4h, v17.4h
+    ld1             {v16.8b}, x0, x1
+    ld1             {v17.8b}, x2, x3
+    smull           v0.4s, v2.4h, v2.4h
+    sub             v2.4h, v16.4h, v17.4h
+    ld1             {v16.8b}, x0, x1
+    ld1             {v17.8b}, x2, x3
+    smlal           v0.4s, v2.4h, v2.4h
+    sub             v2.4h, v16.4h, v17.4h
+    ld1             {v16.8b}, x0, x1
+    smlal           v0.4s, v2.4h, v2.4h
+    ld1             {v17.8b}, x2, x3
+    sub             v2.4h, v16.4h, v17.4h
+    smlal           v0.4s, v2.4h, v2.4h
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_sse_ss_8x8_neon)
+    add             x1, x1, x1
+    add             x3, x3, x3
+    ld1             {v16.16b}, x0, x1
+    ld1             {v17.16b}, x2, x3
+    sub             v2.8h, v16.8h, v17.8h
+    ld1             {v16.16b}, x0, x1
+    ld1             {v17.16b}, x2, x3
+    smull           v0.4s, v2.4h, v2.4h
+    smull2          v1.4s, v2.8h, v2.8h
+    sub             v2.8h, v16.8h, v17.8h
+.rept 6
+    ld1             {v16.16b}, x0, x1
+    ld1             {v17.16b}, x2, x3
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    sub             v2.8h, v16.8h, v17.8h
+.endr
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_sse_ss_16x16_neon)
+    add             x1, x1, x1
+    add             x3, x3, x3
+    mov             w12, #4
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+.loop_sse_ss_16:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b, v17.16b}, x0, x1
+    ld1             {v18.16b, v19.16b}, x2, x3
+    sub             v2.8h, v16.8h, v18.8h
+    sub             v3.8h, v17.8h, v19.8h
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal2          v1.4s, v3.8h, v3.8h
+.endr
+    cbnz            w12, .loop_sse_ss_16
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_sse_ss_32x32_neon)
+    add             x1, x1, x1
+    add             x3, x3, x3
+
+    mov             w12, #8
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+.loop_sse_ss_32:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b-v19.16b}, x0, x1
+    ld1             {v20.16b-v23.16b}, x2, x3
+    sub             v2.8h, v16.8h, v20.8h
+    sub             v3.8h, v17.8h, v21.8h
+    sub             v4.8h, v18.8h, v22.8h
+    sub             v5.8h, v19.8h, v23.8h
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal2          v1.4s, v3.8h, v3.8h
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+.endr
+    cbnz            w12, .loop_sse_ss_32
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_sse_ss_64x64_neon)
+    add             x1, x1, x1
+    add             x3, x3, x3
+    sub             x1, x1, #64
+    sub             x3, x3, #64
+
+    mov             w12, #32
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+.loop_sse_ss_64:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v16.16b-v19.16b}, x0, #64
+    ld1             {v20.16b-v23.16b}, x2, #64
+    sub             v2.8h, v16.8h, v20.8h
+    sub             v3.8h, v17.8h, v21.8h
+    sub             v4.8h, v18.8h, v22.8h
+    sub             v5.8h, v19.8h, v23.8h
+    ld1             {v16.16b-v19.16b}, x0, x1
+    ld1             {v20.16b-v23.16b}, x2, x3
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal2          v1.4s, v3.8h, v3.8h
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+    sub             v2.8h, v16.8h, v20.8h
+    sub             v3.8h, v17.8h, v21.8h
+    sub             v4.8h, v18.8h, v22.8h
+    sub             v5.8h, v19.8h, v23.8h
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal2          v1.4s, v3.8h, v3.8h
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+.endr
+    cbnz            w12, .loop_sse_ss_64
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_ssd_s_4x4_neon)
+    add             x1, x1, x1
+    ld1             {v4.8b}, x0, x1
+    ld1             {v5.8b}, x0, x1
+    ld1             {v6.8b}, x0, x1
+    ld1             {v7.8b}, x0
+    smull           v0.4s, v4.4h, v4.4h
+    smull           v1.4s, v5.4h, v5.4h
+    smlal           v0.4s, v6.4h, v6.4h
+    smlal           v1.4s, v7.4h, v7.4h
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_ssd_s_8x8_neon)
+    add             x1, x1, x1
+    ld1             {v4.16b}, x0, x1
+    ld1             {v5.16b}, x0, x1
+    smull           v0.4s, v4.4h, v4.4h
+    smull2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+.rept 3
+    ld1             {v4.16b}, x0, x1
+    ld1             {v5.16b}, x0, x1
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+.endr
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_ssd_s_16x16_neon)
+    add             x1, x1, x1
+    mov             w12, #4
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+.loop_ssd_s_16:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v4.16b,v5.16b}, x0, x1
+    ld1             {v6.16b,v7.16b}, x0, x1
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+    smlal           v0.4s, v6.4h, v6.4h
+    smlal2          v1.4s, v6.8h, v6.8h
+    smlal           v0.4s, v7.4h, v7.4h
+    smlal2          v1.4s, v7.8h, v7.8h
+.endr
+    cbnz            w12, .loop_ssd_s_16
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_ssd_s_32x32_neon)
+    add             x1, x1, x1
+    mov             w12, #8
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+.loop_ssd_s_32:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v4.16b-v7.16b}, x0, x1
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+    smlal           v0.4s, v6.4h, v6.4h
+    smlal2          v1.4s, v6.8h, v6.8h
+    smlal           v0.4s, v7.4h, v7.4h
+    smlal2          v1.4s, v7.8h, v7.8h
+.endr
+    cbnz            w12, .loop_ssd_s_32
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+endfunc

 
@@ -0,0 +1,476 @@
+/*****************************************************************************
+ * Copyright (C) 2021 MulticoreWare, Inc
+ *
+ * Authors: Sebastian Pop <spop@amazon.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "asm.S"
+#include "ssd-a-common.S"
+
+#ifdef __APPLE__
+.section __RODATA,__rodata
+#else
+.section .rodata
+#endif
+
+.align 4
+
+.text
+
+function PFX(pixel_sse_pp_4x4_neon)
+    ld1             {v16.s}0, x0, x1
+    ld1             {v17.s}0, x2, x3
+    ld1             {v18.s}0, x0, x1
+    ld1             {v19.s}0, x2, x3
+    ld1             {v20.s}0, x0, x1
+    ld1             {v21.s}0, x2, x3
+    ld1             {v22.s}0, x0, x1
+    ld1             {v23.s}0, x2, x3
+
+    usubl           v1.8h, v16.8b, v17.8b
+    usubl           v2.8h, v18.8b, v19.8b
+    usubl           v3.8h, v20.8b, v21.8b
+    usubl           v4.8h, v22.8b, v23.8b
+
+    smull           v0.4s, v1.4h, v1.4h
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal           v0.4s, v4.4h, v4.4h
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_sse_pp_4x8_neon)
+    ld1             {v16.s}0, x0, x1
+    ld1             {v17.s}0, x2, x3
+    usubl           v1.8h, v16.8b, v17.8b
+    ld1             {v16.s}0, x0, x1
+    ld1             {v17.s}0, x2, x3
+    smull           v0.4s, v1.4h, v1.4h
+.rept 6
+    usubl           v1.8h, v16.8b, v17.8b
+    ld1             {v16.s}0, x0, x1
+    smlal           v0.4s, v1.4h, v1.4h
+    ld1             {v17.s}0, x2, x3
+.endr
+    usubl           v1.8h, v16.8b, v17.8b
+    smlal           v0.4s, v1.4h, v1.4h
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_sse_pp_8x8_neon)
+    ld1             {v16.8b}, x0, x1
+    ld1             {v17.8b}, x2, x3
+    usubl           v1.8h, v16.8b, v17.8b
+    ld1             {v16.8b}, x0, x1
+    smull           v0.4s, v1.4h, v1.4h
+    smlal2          v0.4s, v1.8h, v1.8h
+    ld1             {v17.8b}, x2, x3
+
+.rept 6
+    usubl           v1.8h, v16.8b, v17.8b
+    ld1             {v16.8b}, x0, x1
+    smlal           v0.4s, v1.4h, v1.4h
+    smlal2          v0.4s, v1.8h, v1.8h
+    ld1             {v17.8b}, x2, x3
+.endr
+    usubl           v1.8h, v16.8b, v17.8b
+    smlal           v0.4s, v1.4h, v1.4h
+    smlal2          v0.4s, v1.8h, v1.8h
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_sse_pp_8x16_neon)
+    ld1             {v16.8b}, x0, x1
+    ld1             {v17.8b}, x2, x3
+    usubl           v1.8h, v16.8b, v17.8b
+    ld1             {v16.8b}, x0, x1
+    smull           v0.4s, v1.4h, v1.4h
+    smlal2          v0.4s, v1.8h, v1.8h
+    ld1             {v17.8b}, x2, x3
+
+.rept 14
+    usubl           v1.8h, v16.8b, v17.8b
+    ld1             {v16.8b}, x0, x1
+    smlal           v0.4s, v1.4h, v1.4h
+    smlal2          v0.4s, v1.8h, v1.8h
+    ld1             {v17.8b}, x2, x3
+.endr
+    usubl           v1.8h, v16.8b, v17.8b
+    smlal           v0.4s, v1.4h, v1.4h
+    smlal2          v0.4s, v1.8h, v1.8h
+    ret_v0_w0
+endfunc
+
+.macro sse_pp_16xN h
+function PFX(pixel_sse_pp_16x\h\()_neon)
+    ld1             {v16.16b}, x0, x1
+    ld1             {v17.16b}, x2, x3
+    usubl           v1.8h, v16.8b, v17.8b
+    usubl2          v2.8h, v16.16b, v17.16b
+    ld1             {v16.16b}, x0, x1
+    ld1             {v17.16b}, x2, x3
+    smull           v0.4s, v1.4h, v1.4h
+    smlal2          v0.4s, v1.8h, v1.8h
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v0.4s, v2.8h, v2.8h
+.rept \h - 2
+    usubl           v1.8h, v16.8b, v17.8b
+    usubl2          v2.8h, v16.16b, v17.16b
+    ld1             {v16.16b}, x0, x1
+    smlal           v0.4s, v1.4h, v1.4h
+    smlal2          v0.4s, v1.8h, v1.8h
+    ld1             {v17.16b}, x2, x3
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v0.4s, v2.8h, v2.8h
+.endr
+    usubl           v1.8h, v16.8b, v17.8b
+    usubl2          v2.8h, v16.16b, v17.16b
+    smlal           v0.4s, v1.4h, v1.4h
+    smlal2          v0.4s, v1.8h, v1.8h
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v0.4s, v2.8h, v2.8h
+    ret_v0_w0
+endfunc
+.endm
+
+sse_pp_16xN 16
+sse_pp_16xN 32
+
+function PFX(pixel_sse_pp_32x32_neon)
+    mov             w12, #8
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+.loop_sse_pp_32:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b,v17.16b}, x0, x1
+    ld1             {v18.16b,v19.16b}, x2, x3
+    usubl           v2.8h, v16.8b, v18.8b
+    usubl2          v3.8h, v16.16b, v18.16b
+    usubl           v4.8h, v17.8b, v19.8b
+    usubl2          v5.8h, v17.16b, v19.16b
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal2          v1.4s, v3.8h, v3.8h
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+.endr
+    cbnz            w12, .loop_sse_pp_32
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_sse_pp_32x64_neon)
+    mov             w12, #16
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+.loop_sse_pp_32x64:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b,v17.16b}, x0, x1
+    ld1             {v18.16b,v19.16b}, x2, x3
+    usubl           v2.8h, v16.8b, v18.8b
+    usubl2          v3.8h, v16.16b, v18.16b
+    usubl           v4.8h, v17.8b, v19.8b
+    usubl2          v5.8h, v17.16b, v19.16b
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal2          v1.4s, v3.8h, v3.8h
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+.endr
+    cbnz            w12, .loop_sse_pp_32x64
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_sse_pp_64x64_neon)
+    mov             w12, #16
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+
+.loop_sse_pp_64:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b-v19.16b}, x0, x1
+    ld1             {v20.16b-v23.16b}, x2, x3
+
+    usubl           v2.8h, v16.8b, v20.8b
+    usubl2          v3.8h, v16.16b, v20.16b
+    usubl           v4.8h, v17.8b, v21.8b
+    usubl2          v5.8h, v17.16b, v21.16b
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal2          v1.4s, v3.8h, v3.8h
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+
+    usubl           v2.8h, v18.8b, v22.8b
+    usubl2          v3.8h, v18.16b, v22.16b
+    usubl           v4.8h, v19.8b, v23.8b
+    usubl2          v5.8h, v19.16b, v23.16b
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal2          v1.4s, v3.8h, v3.8h
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+.endr
+    cbnz            w12, .loop_sse_pp_64
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_sse_ss_4x4_neon)
+    add             x1, x1, x1
+    add             x3, x3, x3
+    ld1             {v16.8b}, x0, x1
+    ld1             {v17.8b}, x2, x3
+    sub             v2.4h, v16.4h, v17.4h
+    ld1             {v16.8b}, x0, x1
+    ld1             {v17.8b}, x2, x3
+    smull           v0.4s, v2.4h, v2.4h
+    sub             v2.4h, v16.4h, v17.4h
+    ld1             {v16.8b}, x0, x1
+    ld1             {v17.8b}, x2, x3
+    smlal           v0.4s, v2.4h, v2.4h
+    sub             v2.4h, v16.4h, v17.4h
+    ld1             {v16.8b}, x0, x1
+    smlal           v0.4s, v2.4h, v2.4h
+    ld1             {v17.8b}, x2, x3
+    sub             v2.4h, v16.4h, v17.4h
+    smlal           v0.4s, v2.4h, v2.4h
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_sse_ss_8x8_neon)
+    add             x1, x1, x1
+    add             x3, x3, x3
+    ld1             {v16.16b}, x0, x1
+    ld1             {v17.16b}, x2, x3
+    sub             v2.8h, v16.8h, v17.8h
+    ld1             {v16.16b}, x0, x1
+    ld1             {v17.16b}, x2, x3
+    smull           v0.4s, v2.4h, v2.4h
+    smull2          v1.4s, v2.8h, v2.8h
+    sub             v2.8h, v16.8h, v17.8h
+.rept 6
+    ld1             {v16.16b}, x0, x1
+    ld1             {v17.16b}, x2, x3
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    sub             v2.8h, v16.8h, v17.8h
+.endr
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_sse_ss_16x16_neon)
+    add             x1, x1, x1
+    add             x3, x3, x3
+    mov             w12, #4
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+.loop_sse_ss_16:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b, v17.16b}, x0, x1
+    ld1             {v18.16b, v19.16b}, x2, x3
+    sub             v2.8h, v16.8h, v18.8h
+    sub             v3.8h, v17.8h, v19.8h
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal2          v1.4s, v3.8h, v3.8h
+.endr
+    cbnz            w12, .loop_sse_ss_16
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_sse_ss_32x32_neon)
+    add             x1, x1, x1
+    add             x3, x3, x3
+
+    mov             w12, #8
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+.loop_sse_ss_32:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v16.16b-v19.16b}, x0, x1
+    ld1             {v20.16b-v23.16b}, x2, x3
+    sub             v2.8h, v16.8h, v20.8h
+    sub             v3.8h, v17.8h, v21.8h
+    sub             v4.8h, v18.8h, v22.8h
+    sub             v5.8h, v19.8h, v23.8h
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal2          v1.4s, v3.8h, v3.8h
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+.endr
+    cbnz            w12, .loop_sse_ss_32
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_sse_ss_64x64_neon)
+    add             x1, x1, x1
+    add             x3, x3, x3
+    sub             x1, x1, #64
+    sub             x3, x3, #64
+
+    mov             w12, #32
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+.loop_sse_ss_64:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v16.16b-v19.16b}, x0, #64
+    ld1             {v20.16b-v23.16b}, x2, #64
+    sub             v2.8h, v16.8h, v20.8h
+    sub             v3.8h, v17.8h, v21.8h
+    sub             v4.8h, v18.8h, v22.8h
+    sub             v5.8h, v19.8h, v23.8h
+    ld1             {v16.16b-v19.16b}, x0, x1
+    ld1             {v20.16b-v23.16b}, x2, x3
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal2          v1.4s, v3.8h, v3.8h
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+    sub             v2.8h, v16.8h, v20.8h
+    sub             v3.8h, v17.8h, v21.8h
+    sub             v4.8h, v18.8h, v22.8h
+    sub             v5.8h, v19.8h, v23.8h
+    smlal           v0.4s, v2.4h, v2.4h
+    smlal2          v1.4s, v2.8h, v2.8h
+    smlal           v0.4s, v3.4h, v3.4h
+    smlal2          v1.4s, v3.8h, v3.8h
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+.endr
+    cbnz            w12, .loop_sse_ss_64
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_ssd_s_4x4_neon)
+    add             x1, x1, x1
+    ld1             {v4.8b}, x0, x1
+    ld1             {v5.8b}, x0, x1
+    ld1             {v6.8b}, x0, x1
+    ld1             {v7.8b}, x0
+    smull           v0.4s, v4.4h, v4.4h
+    smull           v1.4s, v5.4h, v5.4h
+    smlal           v0.4s, v6.4h, v6.4h
+    smlal           v1.4s, v7.4h, v7.4h
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_ssd_s_8x8_neon)
+    add             x1, x1, x1
+    ld1             {v4.16b}, x0, x1
+    ld1             {v5.16b}, x0, x1
+    smull           v0.4s, v4.4h, v4.4h
+    smull2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+.rept 3
+    ld1             {v4.16b}, x0, x1
+    ld1             {v5.16b}, x0, x1
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+.endr
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_ssd_s_16x16_neon)
+    add             x1, x1, x1
+    mov             w12, #4
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+.loop_ssd_s_16:
+    sub             w12, w12, #1
+.rept 2
+    ld1             {v4.16b,v5.16b}, x0, x1
+    ld1             {v6.16b,v7.16b}, x0, x1
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+    smlal           v0.4s, v6.4h, v6.4h
+    smlal2          v1.4s, v6.8h, v6.8h
+    smlal           v0.4s, v7.4h, v7.4h
+    smlal2          v1.4s, v7.8h, v7.8h
+.endr
+    cbnz            w12, .loop_ssd_s_16
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+endfunc
+
+function PFX(pixel_ssd_s_32x32_neon)
+    add             x1, x1, x1
+    mov             w12, #8
+    movi            v0.16b, #0
+    movi            v1.16b, #0
+.loop_ssd_s_32:
+    sub             w12, w12, #1
+.rept 4
+    ld1             {v4.16b-v7.16b}, x0, x1
+    smlal           v0.4s, v4.4h, v4.4h
+    smlal2          v1.4s, v4.8h, v4.8h
+    smlal           v0.4s, v5.4h, v5.4h
+    smlal2          v1.4s, v5.8h, v5.8h
+    smlal           v0.4s, v6.4h, v6.4h
+    smlal2          v1.4s, v6.8h, v6.8h
+    smlal           v0.4s, v7.4h, v7.4h
+    smlal2          v1.4s, v7.8h, v7.8h
+.endr
+    cbnz            w12, .loop_ssd_s_32
+    add             v0.4s, v0.4s, v1.4s
+    ret_v0_w0
+endfunc
​

x265_3.5.tar.gz/source/common/common.h -> x265_3.6.tar.gz/source/common/common.h Changed

@@ -130,7 +130,6 @@
 typedef uint64_t pixel4;
 typedef int64_t  ssum2_t;
 #define SHIFT_TO_BITPLANE 9
-#define HISTOGRAM_BINS 1024
 #else
 typedef uint8_t  pixel;
 typedef uint16_t sum_t;
@@ -138,7 +137,6 @@
 typedef uint32_t pixel4;
 typedef int32_t  ssum2_t; // Signed sum
 #define SHIFT_TO_BITPLANE 7
-#define HISTOGRAM_BINS 256
 #endif // if HIGH_BIT_DEPTH
 
 #if X265_DEPTH < 10
@@ -162,6 +160,8 @@
 
 #define MIN_QPSCALE     0.21249999999999999
 #define MAX_MAX_QPSCALE 615.46574234477100
+#define FRAME_BRIGHTNESS_THRESHOLD  50.0 // Min % of pixels in a frame, that are above BRIGHTNESS_THRESHOLD for it to be considered a bright frame
+#define FRAME_EDGE_THRESHOLD  10.0 // Min % of edge pixels in a frame, for it to be considered to have high edge density
 
 
 template<typename T>
@@ -340,6 +340,9 @@
 #define FILLER_OVERHEAD (NAL_TYPE_OVERHEAD + START_CODE_OVERHEAD + 1)
 
 #define MAX_NUM_DYN_REFINE          (NUM_CU_DEPTH * X265_REFINE_INTER_LEVELS)
+#define X265_BYTE 8
+
+#define MAX_MCSTF_TEMPORAL_WINDOW_LENGTH 8
 
 namespace X265_NS {
 
@@ -434,6 +437,14 @@
 #define  x265_unlink(fileName) unlink(fileName)
 #define  x265_rename(oldName, newName) rename(oldName, newName)
 #endif
+/* Close a file */
+#define  x265_fclose(file) if (file != NULL) fclose(file); file=NULL;
+#define x265_fread(val, size, readSize, fileOffset,errorMessage)\
+    if (fread(val, size, readSize, fileOffset) != readSize)\
+    {\
+        x265_log(NULL, X265_LOG_ERROR, errorMessage); \
+        return; \
+    }
 int      x265_exp2fix8(double x);
 
 double   x265_ssim2dB(double ssim);

 
@@ -130,7 +130,6 @@
 typedef uint64_t pixel4;
 typedef int64_t  ssum2_t;
 #define SHIFT_TO_BITPLANE 9
-#define HISTOGRAM_BINS 1024
 #else
 typedef uint8_t  pixel;
 typedef uint16_t sum_t;
@@ -138,7 +137,6 @@
 typedef uint32_t pixel4;
 typedef int32_t  ssum2_t; // Signed sum
 #define SHIFT_TO_BITPLANE 7
-#define HISTOGRAM_BINS 256
 #endif // if HIGH_BIT_DEPTH
 
 #if X265_DEPTH < 10
@@ -162,6 +160,8 @@
 
 #define MIN_QPSCALE     0.21249999999999999
 #define MAX_MAX_QPSCALE 615.46574234477100
+#define FRAME_BRIGHTNESS_THRESHOLD  50.0 // Min % of pixels in a frame, that are above BRIGHTNESS_THRESHOLD for it to be considered a bright frame
+#define FRAME_EDGE_THRESHOLD  10.0 // Min % of edge pixels in a frame, for it to be considered to have high edge density
 
 
 template<typename T>
@@ -340,6 +340,9 @@
 #define FILLER_OVERHEAD (NAL_TYPE_OVERHEAD + START_CODE_OVERHEAD + 1)
 
 #define MAX_NUM_DYN_REFINE          (NUM_CU_DEPTH * X265_REFINE_INTER_LEVELS)
+#define X265_BYTE 8
+
+#define MAX_MCSTF_TEMPORAL_WINDOW_LENGTH 8
 
 namespace X265_NS {
 
@@ -434,6 +437,14 @@
 #define  x265_unlink(fileName) unlink(fileName)
 #define  x265_rename(oldName, newName) rename(oldName, newName)
 #endif
+/* Close a file */
+#define  x265_fclose(file) if (file != NULL) fclose(file); file=NULL;
+#define x265_fread(val, size, readSize, fileOffset,errorMessage)\
+    if (fread(val, size, readSize, fileOffset) != readSize)\
+    {\
+        x265_log(NULL, X265_LOG_ERROR, errorMessage); \
+        return; \
+    }
 int      x265_exp2fix8(double x);
 
 double   x265_ssim2dB(double ssim);
​

x265_3.5.tar.gz/source/common/cpu.cpp -> x265_3.6.tar.gz/source/common/cpu.cpp Changed

@@ -7,6 +7,8 @@
  *          Steve Borho <steve@borho.org>
  *          Hongbin Liu <liuhongbin1@huawei.com>
  *          Yimeng Su <yimeng.su@huawei.com>
+ *          Josh Dekker <josh@itanimul.li>
+ *          Jean-Baptiste Kempf <jb@videolan.org>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -105,6 +107,14 @@
     { "NEON",            X265_CPU_NEON },
     { "FastNeonMRC",     X265_CPU_FAST_NEON_MRC },
 
+#elif X265_ARCH_ARM64
+    { "NEON",            X265_CPU_NEON },
+#if defined(HAVE_SVE)
+    { "SVE",            X265_CPU_SVE },
+#endif
+#if defined(HAVE_SVE2)
+    { "SVE2",            X265_CPU_SVE2 },
+#endif
 #elif X265_ARCH_POWER8
     { "Altivec",         X265_CPU_ALTIVEC },
 
@@ -369,12 +379,30 @@
     flags |= PFX(cpu_fast_neon_mrc_test)() ? X265_CPU_FAST_NEON_MRC : 0;
 #endif
     // TODO: write dual issue test? currently it's A8 (dual issue) vs. A9 (fast mrc)
-#elif X265_ARCH_ARM64
-    flags |= X265_CPU_NEON;
 #endif // if HAVE_ARMV6
     return flags;
 }
 
+#elif X265_ARCH_ARM64
+
+uint32_t cpu_detect(bool benableavx512)
+{
+    int flags = 0;
+
+    #if defined(HAVE_SVE2)
+         flags |= X265_CPU_SVE2;
+         flags |= X265_CPU_SVE;
+         flags |= X265_CPU_NEON;
+    #elif defined(HAVE_SVE)
+         flags |= X265_CPU_SVE;
+         flags |= X265_CPU_NEON;
+    #elif HAVE_NEON
+         flags |= X265_CPU_NEON;
+    #endif
+        
+    return flags;
+}
+
 #elif X265_ARCH_POWER8
 
 uint32_t cpu_detect(bool benableavx512)

 
@@ -7,6 +7,8 @@
  *          Steve Borho <steve@borho.org>
  *          Hongbin Liu <liuhongbin1@huawei.com>
  *          Yimeng Su <yimeng.su@huawei.com>
+ *          Josh Dekker <josh@itanimul.li>
+ *          Jean-Baptiste Kempf <jb@videolan.org>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -105,6 +107,14 @@
     { "NEON",            X265_CPU_NEON },
     { "FastNeonMRC",     X265_CPU_FAST_NEON_MRC },
 
+#elif X265_ARCH_ARM64
+    { "NEON",            X265_CPU_NEON },
+#if defined(HAVE_SVE)
+    { "SVE",            X265_CPU_SVE },
+#endif
+#if defined(HAVE_SVE2)
+    { "SVE2",            X265_CPU_SVE2 },
+#endif
 #elif X265_ARCH_POWER8
     { "Altivec",         X265_CPU_ALTIVEC },
 
@@ -369,12 +379,30 @@
     flags |= PFX(cpu_fast_neon_mrc_test)() ? X265_CPU_FAST_NEON_MRC : 0;
 #endif
     // TODO: write dual issue test? currently it's A8 (dual issue) vs. A9 (fast mrc)
-#elif X265_ARCH_ARM64
-    flags |= X265_CPU_NEON;
 #endif // if HAVE_ARMV6
     return flags;
 }
 
+#elif X265_ARCH_ARM64
+
+uint32_t cpu_detect(bool benableavx512)
+{
+    int flags = 0;
+
+    #if defined(HAVE_SVE2)
+         flags |= X265_CPU_SVE2;
+         flags |= X265_CPU_SVE;
+         flags |= X265_CPU_NEON;
+    #elif defined(HAVE_SVE)
+         flags |= X265_CPU_SVE;
+         flags |= X265_CPU_NEON;
+    #elif HAVE_NEON
+         flags |= X265_CPU_NEON;
+    #endif
+        
+    return flags;
+}
+
 #elif X265_ARCH_POWER8
 
 uint32_t cpu_detect(bool benableavx512)
​

x265_3.5.tar.gz/source/common/frame.cpp -> x265_3.6.tar.gz/source/common/frame.cpp Changed

@@ -64,12 +64,40 @@
     m_edgeBitPlane = NULL;
     m_edgeBitPic = NULL;
     m_isInsideWindow = 0;
+
+    // mcstf
+    m_isSubSampled = NULL;
+    m_mcstf = NULL;
+    m_refPicCnt0 = 0;
+    m_refPicCnt1 = 0;
+    m_nextMCSTF = NULL;
+    m_prevMCSTF = NULL;
+
+    m_tempLayer = 0;
+    m_sameLayerRefPic = false;
 }
 
 bool Frame::create(x265_param *param, float* quantOffsets)
 {
     m_fencPic = new PicYuv;
     m_param = param;
+
+    if (m_param->bEnableTemporalFilter)
+    {
+        m_mcstf = new TemporalFilter;
+        m_mcstf->init(param);
+
+        m_fencPicSubsampled2 = new PicYuv;
+        m_fencPicSubsampled4 = new PicYuv;
+
+        if (!m_fencPicSubsampled2->createScaledPicYUV(param, 2))
+            return false;
+        if (!m_fencPicSubsampled4->createScaledPicYUV(param, 4))
+            return false;
+
+        CHECKED_MALLOC_ZERO(m_isSubSampled, int, 1);
+    }
+
     CHECKED_MALLOC_ZERO(m_rcData, RcStats, 1);
 
     if (param->bCTUInfo)
@@ -151,6 +179,22 @@
     return false;
 }
 
+bool Frame::createSubSample()
+{
+
+    m_fencPicSubsampled2 = new PicYuv;
+    m_fencPicSubsampled4 = new PicYuv;
+
+    if (!m_fencPicSubsampled2->createScaledPicYUV(m_param, 2))
+        return false;
+    if (!m_fencPicSubsampled4->createScaledPicYUV(m_param, 4))
+        return false;
+    CHECKED_MALLOC_ZERO(m_isSubSampled, int, 1);
+    return true;
+fail:
+    return false;
+}
+
 bool Frame::allocEncodeData(x265_param *param, const SPS& sps)
 {
     m_encData = new FrameData;
@@ -207,6 +251,26 @@
         m_fencPic = NULL;
     }
 
+    if (m_param->bEnableTemporalFilter)
+    {
+
+        if (m_fencPicSubsampled2)
+        {
+            m_fencPicSubsampled2->destroy();
+            delete m_fencPicSubsampled2;
+            m_fencPicSubsampled2 = NULL;
+        }
+
+        if (m_fencPicSubsampled4)
+        {
+            m_fencPicSubsampled4->destroy();
+            delete m_fencPicSubsampled4;
+            m_fencPicSubsampled4 = NULL;
+        }
+        delete m_mcstf;
+        X265_FREE(m_isSubSampled);
+    }
+
     if (m_reconPic)
     {
         m_reconPic->destroy();
@@ -267,7 +331,8 @@
         X265_FREE(m_addOnPrevChange);
         m_addOnPrevChange = NULL;
     }
-    m_lowres.destroy();
+
+    m_lowres.destroy(m_param);
     X265_FREE(m_rcData);
 
     if (m_param->bDynamicRefine)

 
@@ -64,12 +64,40 @@
     m_edgeBitPlane = NULL;
     m_edgeBitPic = NULL;
     m_isInsideWindow = 0;
+
+    // mcstf
+    m_isSubSampled = NULL;
+    m_mcstf = NULL;
+    m_refPicCnt0 = 0;
+    m_refPicCnt1 = 0;
+    m_nextMCSTF = NULL;
+    m_prevMCSTF = NULL;
+
+    m_tempLayer = 0;
+    m_sameLayerRefPic = false;
 }
 
 bool Frame::create(x265_param *param, float* quantOffsets)
 {
     m_fencPic = new PicYuv;
     m_param = param;
+
+    if (m_param->bEnableTemporalFilter)
+    {
+        m_mcstf = new TemporalFilter;
+        m_mcstf->init(param);
+
+        m_fencPicSubsampled2 = new PicYuv;
+        m_fencPicSubsampled4 = new PicYuv;
+
+        if (!m_fencPicSubsampled2->createScaledPicYUV(param, 2))
+            return false;
+        if (!m_fencPicSubsampled4->createScaledPicYUV(param, 4))
+            return false;
+
+        CHECKED_MALLOC_ZERO(m_isSubSampled, int, 1);
+    }
+
     CHECKED_MALLOC_ZERO(m_rcData, RcStats, 1);
 
     if (param->bCTUInfo)
@@ -151,6 +179,22 @@
     return false;
 }
 
+bool Frame::createSubSample()
+{
+
+    m_fencPicSubsampled2 = new PicYuv;
+    m_fencPicSubsampled4 = new PicYuv;
+
+    if (!m_fencPicSubsampled2->createScaledPicYUV(m_param, 2))
+        return false;
+    if (!m_fencPicSubsampled4->createScaledPicYUV(m_param, 4))
+        return false;
+    CHECKED_MALLOC_ZERO(m_isSubSampled, int, 1);
+    return true;
+fail:
+    return false;
+}
+
 bool Frame::allocEncodeData(x265_param *param, const SPS& sps)
 {
     m_encData = new FrameData;
@@ -207,6 +251,26 @@
         m_fencPic = NULL;
     }
 
+    if (m_param->bEnableTemporalFilter)
+    {
+
+        if (m_fencPicSubsampled2)
+        {
+            m_fencPicSubsampled2->destroy();
+            delete m_fencPicSubsampled2;
+            m_fencPicSubsampled2 = NULL;
+        }
+
+        if (m_fencPicSubsampled4)
+        {
+            m_fencPicSubsampled4->destroy();
+            delete m_fencPicSubsampled4;
+            m_fencPicSubsampled4 = NULL;
+        }
+        delete m_mcstf;
+        X265_FREE(m_isSubSampled);
+    }
+
     if (m_reconPic)
     {
         m_reconPic->destroy();
@@ -267,7 +331,8 @@
         X265_FREE(m_addOnPrevChange);
         m_addOnPrevChange = NULL;
     }
-    m_lowres.destroy();
+
+    m_lowres.destroy(m_param);
     X265_FREE(m_rcData);
 
     if (m_param->bDynamicRefine)
​

x265_3.5.tar.gz/source/common/frame.h -> x265_3.6.tar.gz/source/common/frame.h Changed

@@ -28,6 +28,7 @@
 #include "common.h"
 #include "lowres.h"
 #include "threading.h"
+#include "temporalfilter.h"
 
 namespace X265_NS {
 // private namespace
@@ -70,6 +71,7 @@
     double   count4;
     double   offset4;
     double   bufferFillFinal;
+    int64_t  currentSatd;
 };
 
 class Frame
@@ -83,8 +85,12 @@
 
     /* Data associated with x265_picture */
     PicYuv*                m_fencPic;
+    PicYuv*                m_fencPicSubsampled2;
+    PicYuv*                m_fencPicSubsampled4;
+
     int                    m_poc;
     int                    m_encodeOrder;
+    int                    m_gopOffset;
     int64_t                m_pts;                // user provided presentation time stamp
     int64_t                m_reorderedPts;
     int64_t                m_dts;
@@ -132,6 +138,13 @@
     bool                   m_classifyFrame;
     int                    m_fieldNum;
 
+    /*MCSTF*/
+    TemporalFilter*        m_mcstf;
+    int                    m_refPicCnt2;
+    Frame*                 m_nextMCSTF;           // PicList doubly linked list pointers
+    Frame*                 m_prevMCSTF;
+    int*                   m_isSubSampled;
+
     /* aq-mode 4 : Gaussian, edge and theta frames for edge information */
     pixel*                 m_edgePic;
     pixel*                 m_gaussianPic;
@@ -143,9 +156,15 @@
 
     int                    m_isInsideWindow;
 
+    /*Frame's temporal layer info*/
+    uint8_t                m_tempLayer;
+    int8_t                 m_gopId;
+    bool                   m_sameLayerRefPic;
+
     Frame();
 
     bool create(x265_param *param, float* quantOffsets);
+    bool createSubSample();
     bool allocEncodeData(x265_param *param, const SPS& sps);
     void reinit(const SPS& sps);
     void destroy();

 
@@ -28,6 +28,7 @@
 #include "common.h"
 #include "lowres.h"
 #include "threading.h"
+#include "temporalfilter.h"
 
 namespace X265_NS {
 // private namespace
@@ -70,6 +71,7 @@
     double   count4;
     double   offset4;
     double   bufferFillFinal;
+    int64_t  currentSatd;
 };
 
 class Frame
@@ -83,8 +85,12 @@
 
     /* Data associated with x265_picture */
     PicYuv*                m_fencPic;
+    PicYuv*                m_fencPicSubsampled2;
+    PicYuv*                m_fencPicSubsampled4;
+
     int                    m_poc;
     int                    m_encodeOrder;
+    int                    m_gopOffset;
     int64_t                m_pts;                // user provided presentation time stamp
     int64_t                m_reorderedPts;
     int64_t                m_dts;
@@ -132,6 +138,13 @@
     bool                   m_classifyFrame;
     int                    m_fieldNum;
 
+    /*MCSTF*/
+    TemporalFilter*        m_mcstf;
+    int                    m_refPicCnt2;
+    Frame*                 m_nextMCSTF;           // PicList doubly linked list pointers
+    Frame*                 m_prevMCSTF;
+    int*                   m_isSubSampled;
+
     /* aq-mode 4 : Gaussian, edge and theta frames for edge information */
     pixel*                 m_edgePic;
     pixel*                 m_gaussianPic;
@@ -143,9 +156,15 @@
 
     int                    m_isInsideWindow;
 
+    /*Frame's temporal layer info*/
+    uint8_t                m_tempLayer;
+    int8_t                 m_gopId;
+    bool                   m_sameLayerRefPic;
+
     Frame();
 
     bool create(x265_param *param, float* quantOffsets);
+    bool createSubSample();
     bool allocEncodeData(x265_param *param, const SPS& sps);
     void reinit(const SPS& sps);
     void destroy();
​

x265_3.5.tar.gz/source/common/framedata.cpp -> x265_3.6.tar.gz/source/common/framedata.cpp Changed

 
@@ -62,7 +62,7 @@
     }
     else
         return false;
-    CHECKED_MALLOC_ZERO(m_cuStat, RCStatCU, sps.numCUsInFrame);
+    CHECKED_MALLOC_ZERO(m_cuStat, RCStatCU, sps.numCUsInFrame + 1);
     CHECKED_MALLOC(m_rowStat, RCStatRow, sps.numCuInHeight);
     reinit(sps);
     
​

x265_3.5.tar.gz/source/common/lowres.cpp -> x265_3.6.tar.gz/source/common/lowres.cpp Changed

@@ -28,6 +28,28 @@
 
 using namespace X265_NS;
 
+/*
+ * Down Sample input picture
+ */
+static
+void frame_lowres_core(const pixel* src0, pixel* dst0,
+    intptr_t src_stride, intptr_t dst_stride, int width, int height)
+{
+    for (int y = 0; y < height; y++)
+    {
+        const pixel* src1 = src0 + src_stride;
+        for (int x = 0; x < width; x++)
+        {
+            // slower than naive bilinear, but matches asm
+#define FILTER(a, b, c, d) ((((a + b + 1) >> 1) + ((c + d + 1) >> 1) + 1) >> 1)
+            dst0x = FILTER(src02 * x, src12 * x, src02 * x + 1, src12 * x + 1);
+#undef FILTER
+        }
+        src0 += src_stride * 2;
+        dst0 += dst_stride;
+    }
+}
+
 bool PicQPAdaptationLayer::create(uint32_t width, uint32_t height, uint32_t partWidth, uint32_t partHeight, uint32_t numAQPartInWidthExt, uint32_t numAQPartInHeightExt)
 {
     aqPartWidth = partWidth;
@@ -73,7 +95,7 @@
 
     size_t planesize = lumaStride * (lines + 2 * origPic->m_lumaMarginY);
     size_t padoffset = lumaStride * origPic->m_lumaMarginY + origPic->m_lumaMarginX;
-    if (!!param->rc.aqMode || !!param->rc.hevcAq || !!param->bAQMotion)
+    if (!!param->rc.aqMode || !!param->rc.hevcAq || !!param->bAQMotion || !!param->bEnableWeightedPred || !!param->bEnableWeightedBiPred)
     {
         CHECKED_MALLOC_ZERO(qpAqOffset, double, cuCountFullRes);
         CHECKED_MALLOC_ZERO(invQscaleFactor, int, cuCountFullRes);
@@ -190,13 +212,45 @@
         }
     }
 
+    if (param->bHistBasedSceneCut)
+    {
+        quarterSampleLowResWidth = widthFullRes / 4;
+        quarterSampleLowResHeight = heightFullRes / 4;
+        quarterSampleLowResOriginX = 16;
+        quarterSampleLowResOriginY = 16;
+        quarterSampleLowResStrideY = quarterSampleLowResWidth + 2 * quarterSampleLowResOriginY;
+
+        size_t quarterSampleLowResPlanesize = quarterSampleLowResStrideY * (quarterSampleLowResHeight + 2 * quarterSampleLowResOriginX);
+        /* allocate quarter sampled lowres buffers */
+        CHECKED_MALLOC_ZERO(quarterSampleLowResBuffer, pixel, quarterSampleLowResPlanesize);
+
+        // Allocate memory for Histograms
+        picHistogram = X265_MALLOC(uint32_t***, NUMBER_OF_SEGMENTS_IN_WIDTH * sizeof(uint32_t***));
+        picHistogram0 = X265_MALLOC(uint32_t**, NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT);
+        for (uint32_t wd = 1; wd < NUMBER_OF_SEGMENTS_IN_WIDTH; wd++) {
+            picHistogramwd = picHistogram0 + wd * NUMBER_OF_SEGMENTS_IN_HEIGHT;
+        }
+
+        for (uint32_t regionInPictureWidthIndex = 0; regionInPictureWidthIndex < NUMBER_OF_SEGMENTS_IN_WIDTH; regionInPictureWidthIndex++)
+        {
+            for (uint32_t regionInPictureHeightIndex = 0; regionInPictureHeightIndex < NUMBER_OF_SEGMENTS_IN_HEIGHT; regionInPictureHeightIndex++)
+            {
+                picHistogramregionInPictureWidthIndexregionInPictureHeightIndex = X265_MALLOC(uint32_t*, NUMBER_OF_SEGMENTS_IN_WIDTH *sizeof(uint32_t*));
+                picHistogramregionInPictureWidthIndexregionInPictureHeightIndex0 = X265_MALLOC(uint32_t, 3 * HISTOGRAM_NUMBER_OF_BINS * sizeof(uint32_t));
+                for (uint32_t wd = 1; wd < 3; wd++) {
+                    picHistogramregionInPictureWidthIndexregionInPictureHeightIndexwd = picHistogramregionInPictureWidthIndexregionInPictureHeightIndex0 + wd * HISTOGRAM_NUMBER_OF_BINS;
+                }
+            }
+        }
+    }
+
     return true;
 
 fail:
     return false;
 }
 
-void Lowres::destroy()
+void Lowres::destroy(x265_param* param)
 {
     X265_FREE(buffer0);
     if(bEnableHME)
@@ -234,7 +288,8 @@
     X265_FREE(invQscaleFactor8x8);
     X265_FREE(edgeInclined);
     X265_FREE(qpAqMotionOffset);
-    X265_FREE(blockVariance);
+    if (param->bDynamicRefine || param->bEnableFades)
+        X265_FREE(blockVariance);
     if (maxAQDepth > 0)
     {
         for (uint32_t d = 0; d < 4; d++)
@@ -254,6 +309,29 @@
 
         delete pAQLayer;
     }
+
+    // Histograms
+    if (param->bHistBasedSceneCut)
+    {
+        for (uint32_t segmentInFrameWidthIdx = 0; segmentInFrameWidthIdx < NUMBER_OF_SEGMENTS_IN_WIDTH; segmentInFrameWidthIdx++)
+        {
+            if (picHistogramsegmentInFrameWidthIdx)
+            {
+                for (uint32_t segmentInFrameHeightIdx = 0; segmentInFrameHeightIdx < NUMBER_OF_SEGMENTS_IN_HEIGHT; segmentInFrameHeightIdx++)
+                {
+                    if (picHistogramsegmentInFrameWidthIdxsegmentInFrameHeightIdx)
+                        X265_FREE(picHistogramsegmentInFrameWidthIdxsegmentInFrameHeightIdx0);
+                    X265_FREE(picHistogramsegmentInFrameWidthIdxsegmentInFrameHeightIdx);
+                }
+            }
+        }
+        if (picHistogram)
+            X265_FREE(picHistogram0);
+        X265_FREE(picHistogram);
+
+        X265_FREE(quarterSampleLowResBuffer);
+
+    }
 }
 // (re) initialize lowres state
 void Lowres::init(PicYuv *origPic, int poc)
@@ -266,10 +344,6 @@
     indB = 0;
     memset(costEst, -1, sizeof(costEst));
     memset(weightedCostDelta, 0, sizeof(weightedCostDelta));
-    interPCostPercDiff = 0.0;
-    intraCostPercDiff = 0.0;
-    m_bIsMaxThres = false;
-    m_bIsHardScenecut = false;
 
     if (qpAqOffset && invQscaleFactor)
         memset(costEstAq, -1, sizeof(costEstAq));
@@ -314,4 +388,16 @@
     }
 
     fpelPlane0 = lowresPlane0;
+
+    if (origPic->m_param->bHistBasedSceneCut)
+    {
+        // Quarter Sampled Input Picture Formation
+        // TO DO: Replace with ASM function
+        frame_lowres_core(
+            lowresPlane0,
+            quarterSampleLowResBuffer + quarterSampleLowResOriginX + quarterSampleLowResOriginY * quarterSampleLowResStrideY,
+            lumaStride,
+            quarterSampleLowResStrideY,
+            widthFullRes / 4, heightFullRes / 4);
+    }
 }

 
@@ -28,6 +28,28 @@
 
 using namespace X265_NS;
 
+/*
+ * Down Sample input picture
+ */
+static
+void frame_lowres_core(const pixel* src0, pixel* dst0,
+    intptr_t src_stride, intptr_t dst_stride, int width, int height)
+{
+    for (int y = 0; y < height; y++)
+    {
+        const pixel* src1 = src0 + src_stride;
+        for (int x = 0; x < width; x++)
+        {
+            // slower than naive bilinear, but matches asm
+#define FILTER(a, b, c, d) ((((a + b + 1) >> 1) + ((c + d + 1) >> 1) + 1) >> 1)
+            dst0x = FILTER(src02 * x, src12 * x, src02 * x + 1, src12 * x + 1);
+#undef FILTER
+        }
+        src0 += src_stride * 2;
+        dst0 += dst_stride;
+    }
+}
+
 bool PicQPAdaptationLayer::create(uint32_t width, uint32_t height, uint32_t partWidth, uint32_t partHeight, uint32_t numAQPartInWidthExt, uint32_t numAQPartInHeightExt)
 {
     aqPartWidth = partWidth;
@@ -73,7 +95,7 @@
 
     size_t planesize = lumaStride * (lines + 2 * origPic->m_lumaMarginY);
     size_t padoffset = lumaStride * origPic->m_lumaMarginY + origPic->m_lumaMarginX;
-    if (!!param->rc.aqMode || !!param->rc.hevcAq || !!param->bAQMotion)
+    if (!!param->rc.aqMode || !!param->rc.hevcAq || !!param->bAQMotion || !!param->bEnableWeightedPred || !!param->bEnableWeightedBiPred)
     {
         CHECKED_MALLOC_ZERO(qpAqOffset, double, cuCountFullRes);
         CHECKED_MALLOC_ZERO(invQscaleFactor, int, cuCountFullRes);
@@ -190,13 +212,45 @@
         }
     }
 
+    if (param->bHistBasedSceneCut)
+    {
+        quarterSampleLowResWidth = widthFullRes / 4;
+        quarterSampleLowResHeight = heightFullRes / 4;
+        quarterSampleLowResOriginX = 16;
+        quarterSampleLowResOriginY = 16;
+        quarterSampleLowResStrideY = quarterSampleLowResWidth + 2 * quarterSampleLowResOriginY;
+
+        size_t quarterSampleLowResPlanesize = quarterSampleLowResStrideY * (quarterSampleLowResHeight + 2 * quarterSampleLowResOriginX);
+        /* allocate quarter sampled lowres buffers */
+        CHECKED_MALLOC_ZERO(quarterSampleLowResBuffer, pixel, quarterSampleLowResPlanesize);
+
+        // Allocate memory for Histograms
+        picHistogram = X265_MALLOC(uint32_t***, NUMBER_OF_SEGMENTS_IN_WIDTH * sizeof(uint32_t***));
+        picHistogram0 = X265_MALLOC(uint32_t**, NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT);
+        for (uint32_t wd = 1; wd < NUMBER_OF_SEGMENTS_IN_WIDTH; wd++) {
+            picHistogramwd = picHistogram0 + wd * NUMBER_OF_SEGMENTS_IN_HEIGHT;
+        }
+
+        for (uint32_t regionInPictureWidthIndex = 0; regionInPictureWidthIndex < NUMBER_OF_SEGMENTS_IN_WIDTH; regionInPictureWidthIndex++)
+        {
+            for (uint32_t regionInPictureHeightIndex = 0; regionInPictureHeightIndex < NUMBER_OF_SEGMENTS_IN_HEIGHT; regionInPictureHeightIndex++)
+            {
+                picHistogramregionInPictureWidthIndexregionInPictureHeightIndex = X265_MALLOC(uint32_t*, NUMBER_OF_SEGMENTS_IN_WIDTH *sizeof(uint32_t*));
+                picHistogramregionInPictureWidthIndexregionInPictureHeightIndex0 = X265_MALLOC(uint32_t, 3 * HISTOGRAM_NUMBER_OF_BINS * sizeof(uint32_t));
+                for (uint32_t wd = 1; wd < 3; wd++) {
+                    picHistogramregionInPictureWidthIndexregionInPictureHeightIndexwd = picHistogramregionInPictureWidthIndexregionInPictureHeightIndex0 + wd * HISTOGRAM_NUMBER_OF_BINS;
+                }
+            }
+        }
+    }
+
     return true;
 
 fail:
     return false;
 }
 
-void Lowres::destroy()
+void Lowres::destroy(x265_param* param)
 {
     X265_FREE(buffer0);
     if(bEnableHME)
@@ -234,7 +288,8 @@
     X265_FREE(invQscaleFactor8x8);
     X265_FREE(edgeInclined);
     X265_FREE(qpAqMotionOffset);
-    X265_FREE(blockVariance);
+    if (param->bDynamicRefine || param->bEnableFades)
+        X265_FREE(blockVariance);
     if (maxAQDepth > 0)
     {
         for (uint32_t d = 0; d < 4; d++)
@@ -254,6 +309,29 @@
 
         delete pAQLayer;
     }
+
+    // Histograms
+    if (param->bHistBasedSceneCut)
+    {
+        for (uint32_t segmentInFrameWidthIdx = 0; segmentInFrameWidthIdx < NUMBER_OF_SEGMENTS_IN_WIDTH; segmentInFrameWidthIdx++)
+        {
+            if (picHistogramsegmentInFrameWidthIdx)
+            {
+                for (uint32_t segmentInFrameHeightIdx = 0; segmentInFrameHeightIdx < NUMBER_OF_SEGMENTS_IN_HEIGHT; segmentInFrameHeightIdx++)
+                {
+                    if (picHistogramsegmentInFrameWidthIdxsegmentInFrameHeightIdx)
+                        X265_FREE(picHistogramsegmentInFrameWidthIdxsegmentInFrameHeightIdx0);
+                    X265_FREE(picHistogramsegmentInFrameWidthIdxsegmentInFrameHeightIdx);
+                }
+            }
+        }
+        if (picHistogram)
+            X265_FREE(picHistogram0);
+        X265_FREE(picHistogram);
+
+        X265_FREE(quarterSampleLowResBuffer);
+
+    }
 }
 // (re) initialize lowres state
 void Lowres::init(PicYuv *origPic, int poc)
@@ -266,10 +344,6 @@
     indB = 0;
     memset(costEst, -1, sizeof(costEst));
     memset(weightedCostDelta, 0, sizeof(weightedCostDelta));
-    interPCostPercDiff = 0.0;
-    intraCostPercDiff = 0.0;
-    m_bIsMaxThres = false;
-    m_bIsHardScenecut = false;
 
     if (qpAqOffset && invQscaleFactor)
         memset(costEstAq, -1, sizeof(costEstAq));
@@ -314,4 +388,16 @@
     }
 
     fpelPlane0 = lowresPlane0;
+
+    if (origPic->m_param->bHistBasedSceneCut)
+    {
+        // Quarter Sampled Input Picture Formation
+        // TO DO: Replace with ASM function
+        frame_lowres_core(
+            lowresPlane0,
+            quarterSampleLowResBuffer + quarterSampleLowResOriginX + quarterSampleLowResOriginY * quarterSampleLowResStrideY,
+            lumaStride,
+            quarterSampleLowResStrideY,
+            widthFullRes / 4, heightFullRes / 4);
+    }
 }
​

x265_3.5.tar.gz/source/common/lowres.h -> x265_3.6.tar.gz/source/common/lowres.h Changed

@@ -32,6 +32,10 @@
 namespace X265_NS {
 // private namespace
 
+#define HISTOGRAM_NUMBER_OF_BINS         256
+#define NUMBER_OF_SEGMENTS_IN_WIDTH      4
+#define NUMBER_OF_SEGMENTS_IN_HEIGHT     4
+
 struct ReferencePlanes
 {
     ReferencePlanes() { memset(this, 0, sizeof(ReferencePlanes)); }
@@ -171,6 +175,7 @@
 
     int    frameNum;         // Presentation frame number
     int    sliceType;        // Slice type decided by lookahead
+    int    sliceTypeReq;     // Slice type required as per the QP file
     int    width;            // width of lowres frame in pixels
     int    lines;            // height of lowres frame in pixel lines
     int    leadingBframes;   // number of leading B frames for P or I
@@ -214,13 +219,13 @@
     double*   qpAqOffset;      // AQ QP offset values for each 16x16 CU
     double*   qpCuTreeOffset;  // cuTree QP offset values for each 16x16 CU
     double*   qpAqMotionOffset;
-    int*      invQscaleFactor; // qScale values for qp Aq Offsets
+    int*      invQscaleFactor;    // qScale values for qp Aq Offsets
     int*      invQscaleFactor8x8; // temporary buffer for qg-size 8
     uint32_t* blockVariance;
     uint64_t  wp_ssd3;       // This is different than SSDY, this is sum(pixel^2) - sum(pixel)^2 for entire frame
     uint64_t  wp_sum3;
     double    frameVariance;
-    int* edgeInclined;
+    int*      edgeInclined;
 
 
     /* cutree intermediate data */
@@ -230,18 +235,30 @@
     uint32_t heightFullRes;
     uint32_t m_maxCUSize;
     uint32_t m_qgSize;
-    
+
     uint16_t* propagateCost;
     double    weightedCostDeltaX265_BFRAME_MAX + 2;
     ReferencePlanes weightedRefX265_BFRAME_MAX + 2;
+
     /* For hist-based scenecut */
-    bool   m_bIsMaxThres;
-    double interPCostPercDiff;
-    double intraCostPercDiff;
-    bool   m_bIsHardScenecut;
+    int          quarterSampleLowResWidth;     // width of 1/4 lowres frame in pixels
+    int          quarterSampleLowResHeight;    // height of 1/4 lowres frame in pixels
+    int          quarterSampleLowResStrideY;
+    int          quarterSampleLowResOriginX;
+    int          quarterSampleLowResOriginY;
+    pixel       *quarterSampleLowResBuffer;
+    bool         bHistScenecutAnalyzed;
+
+    uint16_t     picAvgVariance;
+    uint16_t     picAvgVarianceCb;
+    uint16_t     picAvgVarianceCr;
+
+    uint32_t ****picHistogram;
+    uint64_t     averageIntensityPerSegmentNUMBER_OF_SEGMENTS_IN_WIDTHNUMBER_OF_SEGMENTS_IN_HEIGHT3;
+    uint8_t      averageIntensity3;
 
     bool create(x265_param* param, PicYuv *origPic, uint32_t qgSize);
-    void destroy();
+    void destroy(x265_param* param);
     void init(PicYuv *origPic, int poc);
 };
 }

 
@@ -32,6 +32,10 @@
 namespace X265_NS {
 // private namespace
 
+#define HISTOGRAM_NUMBER_OF_BINS         256
+#define NUMBER_OF_SEGMENTS_IN_WIDTH      4
+#define NUMBER_OF_SEGMENTS_IN_HEIGHT     4
+
 struct ReferencePlanes
 {
     ReferencePlanes() { memset(this, 0, sizeof(ReferencePlanes)); }
@@ -171,6 +175,7 @@
 
     int    frameNum;         // Presentation frame number
     int    sliceType;        // Slice type decided by lookahead
+    int    sliceTypeReq;     // Slice type required as per the QP file
     int    width;            // width of lowres frame in pixels
     int    lines;            // height of lowres frame in pixel lines
     int    leadingBframes;   // number of leading B frames for P or I
@@ -214,13 +219,13 @@
     double*   qpAqOffset;      // AQ QP offset values for each 16x16 CU
     double*   qpCuTreeOffset;  // cuTree QP offset values for each 16x16 CU
     double*   qpAqMotionOffset;
-    int*      invQscaleFactor; // qScale values for qp Aq Offsets
+    int*      invQscaleFactor;    // qScale values for qp Aq Offsets
     int*      invQscaleFactor8x8; // temporary buffer for qg-size 8
     uint32_t* blockVariance;
     uint64_t  wp_ssd3;       // This is different than SSDY, this is sum(pixel^2) - sum(pixel)^2 for entire frame
     uint64_t  wp_sum3;
     double    frameVariance;
-    int* edgeInclined;
+    int*      edgeInclined;
 
 
     /* cutree intermediate data */
@@ -230,18 +235,30 @@
     uint32_t heightFullRes;
     uint32_t m_maxCUSize;
     uint32_t m_qgSize;
-    
+
     uint16_t* propagateCost;
     double    weightedCostDeltaX265_BFRAME_MAX + 2;
     ReferencePlanes weightedRefX265_BFRAME_MAX + 2;
+
     /* For hist-based scenecut */
-    bool   m_bIsMaxThres;
-    double interPCostPercDiff;
-    double intraCostPercDiff;
-    bool   m_bIsHardScenecut;
+    int          quarterSampleLowResWidth;     // width of 1/4 lowres frame in pixels
+    int          quarterSampleLowResHeight;    // height of 1/4 lowres frame in pixels
+    int          quarterSampleLowResStrideY;
+    int          quarterSampleLowResOriginX;
+    int          quarterSampleLowResOriginY;
+    pixel       *quarterSampleLowResBuffer;
+    bool         bHistScenecutAnalyzed;
+
+    uint16_t     picAvgVariance;
+    uint16_t     picAvgVarianceCb;
+    uint16_t     picAvgVarianceCr;
+
+    uint32_t ****picHistogram;
+    uint64_t     averageIntensityPerSegmentNUMBER_OF_SEGMENTS_IN_WIDTHNUMBER_OF_SEGMENTS_IN_HEIGHT3;
+    uint8_t      averageIntensity3;
 
     bool create(x265_param* param, PicYuv *origPic, uint32_t qgSize);
-    void destroy();
+    void destroy(x265_param* param);
     void init(PicYuv *origPic, int poc);
 };
 }
​

x265_3.5.tar.gz/source/common/mv.h -> x265_3.6.tar.gz/source/common/mv.h Changed

 
@@ -105,6 +105,8 @@
     {
         return x >= _min.x && x <= _max.x && y >= _min.y && y <= _max.y;
     }
+
+    void set(int32_t _x, int32_t _y) { x = _x; y = _y; }
 };
 }
 
​

x265_3.5.tar.gz/source/common/param.cpp -> x265_3.6.tar.gz/source/common/param.cpp Changed

@@ -145,6 +145,8 @@
     param->bAnnexB = 1;
     param->bRepeatHeaders = 0;
     param->bEnableAccessUnitDelimiters = 0;
+    param->bEnableEndOfBitstream = 0;
+    param->bEnableEndOfSequence = 0;
     param->bEmitHRDSEI = 0;
     param->bEmitInfoSEI = 1;
     param->bEmitHDRSEI = 0; /*Deprecated*/
@@ -163,12 +165,12 @@
     param->keyframeMax = 250;
     param->gopLookahead = 0;
     param->bOpenGOP = 1;
+	param->craNal = 0;
     param->bframes = 4;
     param->lookaheadDepth = 20;
     param->bFrameAdaptive = X265_B_ADAPT_TRELLIS;
     param->bBPyramid = 1;
     param->scenecutThreshold = 40; /* Magic number pulled in from x264 */
-    param->edgeTransitionThreshold = 0.03;
     param->bHistBasedSceneCut = 0;
     param->lookaheadSlices = 8;
     param->lookaheadThreads = 0;
@@ -179,12 +181,20 @@
     param->bEnableHRDConcatFlag = 0;
     param->bEnableFades = 0;
     param->bEnableSceneCutAwareQp = 0;
-    param->fwdScenecutWindow = 500;
-    param->fwdRefQpDelta = 5;
-    param->fwdNonRefQpDelta = param->fwdRefQpDelta + (SLICE_TYPE_DELTA * param->fwdRefQpDelta);
-    param->bwdScenecutWindow = 100;
-    param->bwdRefQpDelta = -1;
-    param->bwdNonRefQpDelta = -1;
+    param->fwdMaxScenecutWindow = 1200;
+    param->bwdMaxScenecutWindow = 600;
+    for (int i = 0; i < 6; i++)
+    {
+        int deltas6 = { 5, 4, 3, 2, 1, 0 };
+
+        param->fwdScenecutWindowi = 200;
+        param->fwdRefQpDeltai = deltasi;
+        param->fwdNonRefQpDeltai = param->fwdRefQpDeltai + (SLICE_TYPE_DELTA * param->fwdRefQpDeltai);
+
+        param->bwdScenecutWindowi = 100;
+        param->bwdRefQpDeltai = -1;
+        param->bwdNonRefQpDeltai = -1;
+    }
 
     /* Intra Coding Tools */
     param->bEnableConstrainedIntra = 0;
@@ -278,7 +288,10 @@
     param->rc.rfConstantMin = 0;
     param->rc.bStatRead = 0;
     param->rc.bStatWrite = 0;
+    param->rc.dataShareMode = X265_SHARE_MODE_FILE;
     param->rc.statFileName = NULL;
+    param->rc.sharedMemName = NULL;
+    param->rc.bEncFocusedFramesOnly = 0;
     param->rc.complexityBlur = 20;
     param->rc.qblur = 0.5;
     param->rc.zoneCount = 0;
@@ -321,6 +334,7 @@
     param->maxLuma = PIXEL_MAX;
     param->log2MaxPocLsb = 8;
     param->maxSlices = 1;
+    param->videoSignalTypePreset = NULL;
 
     /*Conformance window*/
     param->confWinRightOffset = 0;
@@ -373,10 +387,17 @@
     param->bEnableSvtHevc = 0;
     param->svtHevcParam = NULL;
 
+    /* MCSTF */
+    param->bEnableTemporalFilter = 0;
+    param->temporalFilterStrength = 0.95;
+
 #ifdef SVT_HEVC
     param->svtHevcParam = svtParam;
     svt_param_default(param);
 #endif
+    /* Film grain characteristics model filename */
+    param->filmGrain = NULL;
+    param->bEnableSBRC = 0;
 }
 
 int x265_param_default_preset(x265_param* param, const char* preset, const char* tune)
@@ -666,6 +687,46 @@
 #define atof(str) x265_atof(str, bError)
 #define atobool(str) (x265_atobool(str, bError))
 
+int x265_scenecut_aware_qp_param_parse(x265_param* p, const char* name, const char* value)
+{
+    bool bError = false;
+    char nameBuf64;
+    if (!name)
+        return X265_PARAM_BAD_NAME;
+    // skip -- prefix if provided
+    if (name0 == '-' && name1 == '-')
+        name += 2;
+    // s/_/-/g
+    if (strlen(name) + 1 < sizeof(nameBuf) && strchr(name, '_'))
+    {
+        char *c;
+        strcpy(nameBuf, name);
+        while ((c = strchr(nameBuf, '_')) != 0)
+            *c = '-';
+        name = nameBuf;
+    }
+    if (!value)
+        value = "true";
+    else if (value0 == '=')
+        value++;
+#define OPT(STR) else if (!strcmp(name, STR))
+    if (0);
+    OPT("scenecut-aware-qp") p->bEnableSceneCutAwareQp = x265_atoi(value, bError);
+    OPT("masking-strength") bError = parseMaskingStrength(p, value);
+    else
+        return X265_PARAM_BAD_NAME;
+#undef OPT
+    return bError ? X265_PARAM_BAD_VALUE : 0;
+}
+
+
+/* internal versions of string-to-int with additional error checking */
+#undef atoi
+#undef atof
+#define atoi(str) x265_atoi(str, bError)
+#define atof(str) x265_atof(str, bError)
+#define atobool(str) (x265_atobool(str, bError))
+
 int x265_zone_param_parse(x265_param* p, const char* name, const char* value)
 {
     bool bError = false;
@@ -949,10 +1010,9 @@
        {
            bError = false;
            p->scenecutThreshold = atoi(value);
-           p->bHistBasedSceneCut = 0;
        }
     }
-    OPT("temporal-layers") p->bEnableTemporalSubLayers = atobool(value);
+    OPT("temporal-layers") p->bEnableTemporalSubLayers = atoi(value);
     OPT("keyint") p->keyframeMax = atoi(value);
     OPT("min-keyint") p->keyframeMin = atoi(value);
     OPT("rc-lookahead") p->lookaheadDepth = atoi(value);
@@ -1184,6 +1244,7 @@
         int pass = x265_clip3(0, 3, atoi(value));
         p->rc.bStatWrite = pass & 1;
         p->rc.bStatRead = pass & 2;
+        p->rc.dataShareMode = X265_SHARE_MODE_FILE;
     }
     OPT("stats") p->rc.statFileName = strdup(value);
     OPT("scaling-list") p->scalingLists = strdup(value);
@@ -1216,21 +1277,7 @@
         OPT("opt-ref-list-length-pps") p->bOptRefListLengthPPS = atobool(value);
         OPT("multi-pass-opt-rps") p->bMultiPassOptRPS = atobool(value);
         OPT("scenecut-bias") p->scenecutBias = atof(value);
-        OPT("hist-scenecut")
-        {
-            p->bHistBasedSceneCut = atobool(value);
-            if (bError)
-            {
-                bError = false;
-                p->bHistBasedSceneCut = 0;
-            }
-            if (p->bHistBasedSceneCut)
-            {
-                bError = false;
-                p->scenecutThreshold = 0;
-            }
-        }
-        OPT("hist-threshold") p->edgeTransitionThreshold = atof(value);
+        OPT("hist-scenecut") p->bHistBasedSceneCut = atobool(value);
         OPT("rskip-edge-threshold") p->edgeVarThreshold = atoi(value)/100.0f;
         OPT("lookahead-threads") p->lookaheadThreads = atoi(value);
         OPT("opt-cu-delta-qp") p->bOptCUDeltaQP = atobool(value);
@@ -1238,6 +1285,7 @@
         OPT("multi-pass-opt-distortion") p->analysisMultiPassDistortion = atobool(value);
         OPT("aq-motion") p->bAQMotion = atobool(value);
         OPT("dynamic-rd") p->dynamicRd = atof(value);
+		OPT("cra-nal") p->craNal = atobool(value);
         OPT("analysis-reuse-level")
         {
             p->analysisReuseLevel = atoi(value);
@@ -1348,71 +1396,7 @@
         }
         OPT("fades") p->bEnableFades = atobool(value);
         OPT("scenecut-aware-qp") p->bEnableSceneCutAwareQp = atoi(value);
-        OPT("masking-strength")
-        {
-            int window1;
-            double refQpDelta1, nonRefQpDelta1;
-
-            if (p->bEnableSceneCutAwareQp == FORWARD)
-            {
-                if (3 == sscanf(value, "%d,%lf,%lf", &window1, &refQpDelta1, &nonRefQpDelta1))
-                {
-                    if (window1 > 0)
-                        p->fwdScenecutWindow = window1;
-                    if (refQpDelta1 > 0)
-                        p->fwdRefQpDelta = refQpDelta1;
-                    if (nonRefQpDelta1 > 0)
-                        p->fwdNonRefQpDelta = nonRefQpDelta1;
-                }
-                else
-                {
-                    x265_log(NULL, X265_LOG_ERROR, "Specify all the necessary offsets for masking-strength \n");
-                    bError = true;
-                }
-            }
-            else if (p->bEnableSceneCutAwareQp == BACKWARD)
-            {
-                if (3 == sscanf(value, "%d,%lf,%lf", &window1, &refQpDelta1, &nonRefQpDelta1))
-                {
-                    if (window1 > 0)
-                        p->bwdScenecutWindow = window1;
-                    if (refQpDelta1 > 0)
-                        p->bwdRefQpDelta = refQpDelta1;
-                    if (nonRefQpDelta1 > 0)
-                        p->bwdNonRefQpDelta = nonRefQpDelta1;
-                }
-                else
-                {
-                    x265_log(NULL, X265_LOG_ERROR, "Specify all the necessary offsets for masking-strength \n");
-                    bError = true;
-                }
-            }
-            else if (p->bEnableSceneCutAwareQp == BI_DIRECTIONAL)
-            {
-                int window2;
-                double refQpDelta2, nonRefQpDelta2;
-                if (6 == sscanf(value, "%d,%lf,%lf,%d,%lf,%lf", &window1, &refQpDelta1, &nonRefQpDelta1, &window2, &refQpDelta2, &nonRefQpDelta2))
-                {
-                    if (window1 > 0)
-                        p->fwdScenecutWindow = window1;
-                    if (refQpDelta1 > 0)
-                        p->fwdRefQpDelta = refQpDelta1;
-                    if (nonRefQpDelta1 > 0)
-                        p->fwdNonRefQpDelta = nonRefQpDelta1;
-                    if (window2 > 0)
-                        p->bwdScenecutWindow = window2;
-                    if (refQpDelta2 > 0)
-                        p->bwdRefQpDelta = refQpDelta2;
-                    if (nonRefQpDelta2 > 0)
-                        p->bwdNonRefQpDelta = nonRefQpDelta2;
-                }
-                else
-                {
-                    x265_log(NULL, X265_LOG_ERROR, "Specify all the necessary offsets for masking-strength \n");
-                    bError = true;
-                }
-            }
-        }
+        OPT("masking-strength") bError |= parseMaskingStrength(p, value);
         OPT("field") p->bField = atobool( value );
         OPT("cll") p->bEmitCLL = atobool(value);
         OPT("frame-dup") p->bEnableFrameDuplication = atobool(value);
@@ -1446,6 +1430,13 @@
         OPT("vbv-live-multi-pass") p->bliveVBV2pass = atobool(value);
         OPT("min-vbv-fullness") p->minVbvFullness = atof(value);
         OPT("max-vbv-fullness") p->maxVbvFullness = atof(value);
+        OPT("video-signal-type-preset") p->videoSignalTypePreset = strdup(value);
+        OPT("eob") p->bEnableEndOfBitstream = atobool(value);
+        OPT("eos") p->bEnableEndOfSequence = atobool(value);
+        /* Film grain characterstics model filename */
+        OPT("film-grain") p->filmGrain = (char* )value;
+        OPT("mcstf") p->bEnableTemporalFilter = atobool(value);
+        OPT("sbrc") p->bEnableSBRC = atobool(value);
         else
             return X265_PARAM_BAD_NAME;
     }
@@ -1761,8 +1752,6 @@
           "scenecutThreshold must be greater than 0");
     CHECK(param->scenecutBias < 0 || 100 < param->scenecutBias,
             "scenecut-bias must be between 0 and 100");
-    CHECK(param->edgeTransitionThreshold < 0.0 || 1.0 < param->edgeTransitionThreshold,
-            "hist-threshold must be between 0.0 and 1.0");
     CHECK(param->radl < 0 || param->radl > param->bframes,
           "radl must be between 0 and bframes");
     CHECK(param->rdPenalty < 0 || param->rdPenalty > 2,
@@ -1824,15 +1813,15 @@
         "Invalid refine-ctu-distortion value, must be either 0 or 1");
     CHECK(param->maxAUSizeFactor < 0.5 || param->maxAUSizeFactor > 1.0,
         "Supported factor for controlling max AU size is from 0.5 to 1");
-    CHECK((param->dolbyProfile != 0) && (param->dolbyProfile != 50) && (param->dolbyProfile != 81) && (param->dolbyProfile != 82),
-        "Unsupported Dolby Vision profile, only profile 5, profile 8.1 and profile 8.2 enabled");
+    CHECK((param->dolbyProfile != 0) && (param->dolbyProfile != 50) && (param->dolbyProfile != 81) && (param->dolbyProfile != 82) && (param->dolbyProfile != 84),
+        "Unsupported Dolby Vision profile, only profile 5, profile 8.1, profile 8.2 and profile 8.4 enabled");
     CHECK(param->dupThreshold < 1 || 99 < param->dupThreshold,
         "Invalid frame-duplication threshold. Value must be between 1 and 99.");
     if (param->dolbyProfile)
     {
         CHECK((param->rc.vbvMaxBitrate <= 0 || param->rc.vbvBufferSize <= 0), "Dolby Vision requires VBV settings to enable HRD.\n");
-        CHECK((param->internalBitDepth != 10), "Dolby Vision profile - 5, profile - 8.1 and profile - 8.2 is Main10 only\n");
-        CHECK((param->internalCsp != X265_CSP_I420), "Dolby Vision profile - 5, profile - 8.1 and profile - 8.2 requires YCbCr 4:2:0 color space\n");
+        CHECK((param->internalBitDepth != 10), "Dolby Vision profile - 5, profile - 8.1, profile - 8.2 and profile - 8.4 are Main10 only\n");
+        CHECK((param->internalCsp != X265_CSP_I420), "Dolby Vision profile - 5, profile - 8.1, profile - 8.2 and profile - 8.4 requires YCbCr 4:2:0 color space\n");
         if (param->dolbyProfile == 81)
             CHECK(!(param->masteringDisplayColorVolume), "Dolby Vision profile - 8.1 requires Mastering display color volume information\n");
     }
@@ -1854,19 +1843,22 @@
         {
             CHECK(param->bEnableSceneCutAwareQp < 0 || param->bEnableSceneCutAwareQp > 3,
             "Invalid masking direction. Value must be between 0 and 3(inclusive)");
-            CHECK(param->fwdScenecutWindow < 0 || param->fwdScenecutWindow > 1000,
-            "Invalid forward scenecut Window duration. Value must be between 0 and 1000(inclusive)");
-            CHECK(param->fwdRefQpDelta < 0 || param->fwdRefQpDelta > 10,
-            "Invalid fwdRefQpDelta value. Value must be between 0 and 10 (inclusive)");
-            CHECK(param->fwdNonRefQpDelta < 0 || param->fwdNonRefQpDelta > 10,
-            "Invalid fwdNonRefQpDelta value. Value must be between 0 and 10 (inclusive)");
-
-            CHECK(param->bwdScenecutWindow < 0 || param->bwdScenecutWindow > 1000,
-                "Invalid backward scenecut Window duration. Value must be between 0 and 1000(inclusive)");
-            CHECK(param->bwdRefQpDelta < -1 || param->bwdRefQpDelta > 10,
-                "Invalid bwdRefQpDelta value. Value must be between 0 and 10 (inclusive)");
-            CHECK(param->bwdNonRefQpDelta < -1 || param->bwdNonRefQpDelta > 10,
-                "Invalid bwdNonRefQpDelta value. Value must be between 0 and 10 (inclusive)");
+            for (int i = 0; i < 6; i++)
+            {
+                CHECK(param->fwdScenecutWindowi < 0 || param->fwdScenecutWindowi > 1000,
+                    "Invalid forward scenecut Window duration. Value must be between 0 and 1000(inclusive)");
+                CHECK(param->fwdRefQpDeltai < 0 || param->fwdRefQpDeltai > 20,
+                    "Invalid fwdRefQpDelta value. Value must be between 0 and 20 (inclusive)");
+                CHECK(param->fwdNonRefQpDeltai < 0 || param->fwdNonRefQpDeltai > 20,
+                    "Invalid fwdNonRefQpDelta value. Value must be between 0 and 20 (inclusive)");
+
+                CHECK(param->bwdScenecutWindowi < 0 || param->bwdScenecutWindowi > 1000,
+                    "Invalid backward scenecut Window duration. Value must be between 0 and 1000(inclusive)");
+                CHECK(param->bwdRefQpDeltai < -1 || param->bwdRefQpDeltai > 20,
+                    "Invalid bwdRefQpDelta value. Value must be between 0 and 20 (inclusive)");
+                CHECK(param->bwdNonRefQpDeltai < -1 || param->bwdNonRefQpDeltai > 20,
+                    "Invalid bwdNonRefQpDelta value. Value must be between 0 and 20 (inclusive)");
+            }
         }
     }
     if (param->bEnableHME)
@@ -1898,6 +1890,11 @@
         param->bSingleSeiNal = 0;
         x265_log(param, X265_LOG_WARNING, "None of the SEI messages are enabled. Disabling Single SEI NAL\n");
     }
+    if (param->bEnableTemporalFilter && (param->frameNumThreads > 1))
+    {
+        param->bEnableTemporalFilter = 0;
+        x265_log(param, X265_LOG_WARNING, "MCSTF can be enabled with frame thread = 1 only. Disabling MCSTF\n");
+    }
     CHECK(param->confWinRightOffset < 0, "Conformance Window Right Offset must be 0 or greater");
     CHECK(param->confWinBottomOffset < 0, "Conformance Window Bottom Offset must be 0 or greater");
     CHECK(param->decoderVbvMaxRate < 0, "Invalid Decoder Vbv Maxrate. Value can not be less than zero");
@@ -1910,6 +1907,7 @@
             x265_log(param, X265_LOG_WARNING, "Live VBV enabled without VBV settings.Disabling live VBV in 2 pass\n");
         }
     }
+    CHECK(param->rc.dataShareMode != X265_SHARE_MODE_FILE && param->rc.dataShareMode != X265_SHARE_MODE_SHAREDMEM, "Invalid data share mode. It must be one of the X265_DATA_SHARE_MODES enum values\n" );
     return check_failed;
 }
 
@@ -1970,8 +1968,8 @@
         x265_log(param, X265_LOG_INFO, "Keyframe min / max / scenecut / bias  : %d / %d / %d / %.2lf \n",
                  param->keyframeMin, param->keyframeMax, param->scenecutThreshold, param->scenecutBias * 100);
     else if (param->bHistBasedSceneCut && param->keyframeMax != INT_MAX) 
-        x265_log(param, X265_LOG_INFO, "Keyframe min / max / scenecut / edge threshold  : %d / %d / %d / %.2lf\n",
-                 param->keyframeMin, param->keyframeMax, param->bHistBasedSceneCut, param->edgeTransitionThreshold);
+        x265_log(param, X265_LOG_INFO, "Keyframe min / max / scenecut  : %d / %d / %d\n",
+                 param->keyframeMin, param->keyframeMax, param->bHistBasedSceneCut);
     else if (param->keyframeMax == INT_MAX)
         x265_log(param, X265_LOG_INFO, "Keyframe min / max / scenecut       : disabled\n");
 
@@ -2089,6 +2087,8 @@
         bufSize += strlen(p->numaPools);
     if (p->masteringDisplayColorVolume)
         bufSize += strlen(p->masteringDisplayColorVolume);
+    if (p->videoSignalTypePreset)
+        bufSize += strlen(p->videoSignalTypePreset);
 
     buf = s = X265_MALLOC(char, bufSize);
     if (!buf)
@@ -2126,10 +2126,12 @@
     BOOL(p->bRepeatHeaders, "repeat-headers");
     BOOL(p->bAnnexB, "annexb");
     BOOL(p->bEnableAccessUnitDelimiters, "aud");
+    BOOL(p->bEnableEndOfBitstream, "eob");
+    BOOL(p->bEnableEndOfSequence, "eos");
     BOOL(p->bEmitHRDSEI, "hrd");
     BOOL(p->bEmitInfoSEI, "info");
     s += sprintf(s, " hash=%d", p->decodedPictureHashSEI);
-    BOOL(p->bEnableTemporalSubLayers, "temporal-layers");
+    s += sprintf(s, " temporal-layers=%d", p->bEnableTemporalSubLayers);
     BOOL(p->bOpenGOP, "open-gop");
     s += sprintf(s, " min-keyint=%d", p->keyframeMin);
     s += sprintf(s, " keyint=%d", p->keyframeMax);
@@ -2141,7 +2143,7 @@
     s += sprintf(s, " rc-lookahead=%d", p->lookaheadDepth);
     s += sprintf(s, " lookahead-slices=%d", p->lookaheadSlices);
     s += sprintf(s, " scenecut=%d", p->scenecutThreshold);
-    s += sprintf(s, " hist-scenecut=%d", p->bHistBasedSceneCut);
+    BOOL(p->bHistBasedSceneCut, "hist-scenecut");
     s += sprintf(s, " radl=%d", p->radl);
     BOOL(p->bEnableHRDConcatFlag, "splice");
     BOOL(p->bIntraRefresh, "intra-refresh");
@@ -2295,7 +2297,6 @@
     BOOL(p->bOptRefListLengthPPS, "opt-ref-list-length-pps");
     BOOL(p->bMultiPassOptRPS, "multi-pass-opt-rps");
     s += sprintf(s, " scenecut-bias=%.2f", p->scenecutBias);
-    s += sprintf(s, " hist-threshold=%.2f", p->edgeTransitionThreshold);
     BOOL(p->bOptCUDeltaQP, "opt-cu-delta-qp");
     BOOL(p->bAQMotion, "aq-motion");
     BOOL(p->bEmitHDR10SEI, "hdr10");
@@ -2328,10 +2329,14 @@
     s += sprintf(s, " qp-adaptation-range=%.2f", p->rc.qpAdaptationRange);
     s += sprintf(s, " scenecut-aware-qp=%d", p->bEnableSceneCutAwareQp);
     if (p->bEnableSceneCutAwareQp)
-        s += sprintf(s, " fwd-scenecut-window=%d fwd-ref-qp-delta=%f fwd-nonref-qp-delta=%f bwd-scenecut-window=%d bwd-ref-qp-delta=%f bwd-nonref-qp-delta=%f", p->fwdScenecutWindow, p->fwdRefQpDelta, p->fwdNonRefQpDelta, p->bwdScenecutWindow, p->bwdRefQpDelta, p->bwdNonRefQpDelta);
+        s += sprintf(s, " fwd-scenecut-window=%d fwd-ref-qp-delta=%f fwd-nonref-qp-delta=%f bwd-scenecut-window=%d bwd-ref-qp-delta=%f bwd-nonref-qp-delta=%f", p->fwdMaxScenecutWindow, p->fwdRefQpDelta0, p->fwdNonRefQpDelta0, p->bwdMaxScenecutWindow, p->bwdRefQpDelta0, p->bwdNonRefQpDelta0);
     s += sprintf(s, "conformance-window-offsets right=%d bottom=%d", p->confWinRightOffset, p->confWinBottomOffset);
     s += sprintf(s, " decoder-max-rate=%d", p->decoderVbvMaxRate);
     BOOL(p->bliveVBV2pass, "vbv-live-multi-pass");
+    if (p->filmGrain)
+        s += sprintf(s, " film-grain=%s", p->filmGrain); // Film grain characteristics model filename
+    BOOL(p->bEnableTemporalFilter, "mcstf");
+    BOOL(p->bEnableSBRC, "sbrc");
 #undef BOOL
     return buf;
 }
@@ -2406,6 +2411,151 @@
     return false;
 }
 
+bool parseMaskingStrength(x265_param* p, const char* value)
+{
+    bool bError = false;
+    int window16;
+    double refQpDelta16, nonRefQpDelta16;
+    if (p->bEnableSceneCutAwareQp == FORWARD)
+    {
+        if (3 == sscanf(value, "%d,%lf,%lf", &window10, &refQpDelta10, &nonRefQpDelta10))
+        {
+            if (window10 > 0)
+                p->fwdMaxScenecutWindow = window10;
+            if (refQpDelta10 > 0)
+                p->fwdRefQpDelta0 = refQpDelta10;
+            if (nonRefQpDelta10 > 0)
+                p->fwdNonRefQpDelta0 = nonRefQpDelta10;
+
+            p->fwdScenecutWindow0 = p->fwdMaxScenecutWindow / 6;
+            for (int i = 1; i < 6; i++)
+            {
+                p->fwdScenecutWindowi = p->fwdMaxScenecutWindow / 6;
+                p->fwdRefQpDeltai = p->fwdRefQpDeltai - 1 - (0.15 * p->fwdRefQpDeltai - 1);
+                p->fwdNonRefQpDeltai = p->fwdNonRefQpDeltai - 1 - (0.15 * p->fwdNonRefQpDeltai - 1);
+            }
+        }
+        else if (18 == sscanf(value, "%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf"
+            , &window10, &refQpDelta10, &nonRefQpDelta10, &window11, &refQpDelta11, &nonRefQpDelta11
+            , &window12, &refQpDelta12, &nonRefQpDelta12, &window13, &refQpDelta13, &nonRefQpDelta13
+            , &window14, &refQpDelta14, &nonRefQpDelta14, &window15, &refQpDelta15, &nonRefQpDelta15))
+        {
+            p->fwdMaxScenecutWindow = 0;
+            for (int i = 0; i < 6; i++)
+            {
+                p->fwdScenecutWindowi = window1i;
+                p->fwdRefQpDeltai = refQpDelta1i;
+                p->fwdNonRefQpDeltai = nonRefQpDelta1i;
+                p->fwdMaxScenecutWindow += p->fwdScenecutWindowi;
+            }
+        }
+        else
+        {
+            x265_log(NULL, X265_LOG_ERROR, "Specify all the necessary offsets for masking-strength \n");
+            bError = true;
+        }
+    }
+    else if (p->bEnableSceneCutAwareQp == BACKWARD)
+    {
+        if (3 == sscanf(value, "%d,%lf,%lf", &window10, &refQpDelta10, &nonRefQpDelta10))
+        {
+            if (window10 > 0)
+                p->bwdMaxScenecutWindow = window10;
+            if (refQpDelta10 > 0)
+                p->bwdRefQpDelta0 = refQpDelta10;
+            if (nonRefQpDelta10 > 0)
+                p->bwdNonRefQpDelta0 = nonRefQpDelta10;
+
+            p->bwdScenecutWindow0 = p->bwdMaxScenecutWindow / 6;
+            for (int i = 1; i < 6; i++)
+            {
+                p->bwdScenecutWindowi = p->bwdMaxScenecutWindow / 6;
+                p->bwdRefQpDeltai = p->bwdRefQpDeltai - 1 - (0.15 * p->bwdRefQpDeltai - 1);
+                p->bwdNonRefQpDeltai = p->bwdNonRefQpDeltai - 1 - (0.15 * p->bwdNonRefQpDeltai - 1);
+            }
+        }
+        else if (18 == sscanf(value, "%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf"
+            , &window10, &refQpDelta10, &nonRefQpDelta10, &window11, &refQpDelta11, &nonRefQpDelta11
+            , &window12, &refQpDelta12, &nonRefQpDelta12, &window13, &refQpDelta13, &nonRefQpDelta13
+            , &window14, &refQpDelta14, &nonRefQpDelta14, &window15, &refQpDelta15, &nonRefQpDelta15))
+        {
+            p->bwdMaxScenecutWindow = 0;
+            for (int i = 0; i < 6; i++)
+            {
+                p->bwdScenecutWindowi = window1i;
+                p->bwdRefQpDeltai = refQpDelta1i;
+                p->bwdNonRefQpDeltai = nonRefQpDelta1i;
+                p->bwdMaxScenecutWindow += p->bwdScenecutWindowi;
+            }
+        }
+        else
+        {
+            x265_log(NULL, X265_LOG_ERROR, "Specify all the necessary offsets for masking-strength \n");
+            bError = true;
+        }
+    }
+    else if (p->bEnableSceneCutAwareQp == BI_DIRECTIONAL)
+    {
+        int window26;
+        double refQpDelta26, nonRefQpDelta26;
+        if (6 == sscanf(value, "%d,%lf,%lf,%d,%lf,%lf", &window10, &refQpDelta10, &nonRefQpDelta10, &window20, &refQpDelta20, &nonRefQpDelta20))
+        {
+            if (window10 > 0)
+                p->fwdMaxScenecutWindow = window10;
+            if (refQpDelta10 > 0)
+                p->fwdRefQpDelta0 = refQpDelta10;
+            if (nonRefQpDelta10 > 0)
+                p->fwdNonRefQpDelta0 = nonRefQpDelta10;
+            if (window20 > 0)
+                p->bwdMaxScenecutWindow = window20;
+            if (refQpDelta20 > 0)
+                p->bwdRefQpDelta0 = refQpDelta20;
+            if (nonRefQpDelta20 > 0)
+                p->bwdNonRefQpDelta0 = nonRefQpDelta20;
+
+            p->fwdScenecutWindow0 = p->fwdMaxScenecutWindow / 6;
+            p->bwdScenecutWindow0 = p->bwdMaxScenecutWindow / 6;
+            for (int i = 1; i < 6; i++)
+            {
+                p->fwdScenecutWindowi = p->fwdMaxScenecutWindow / 6;
+                p->bwdScenecutWindowi = p->bwdMaxScenecutWindow / 6;
+                p->fwdRefQpDeltai = p->fwdRefQpDeltai - 1 - (0.15 * p->fwdRefQpDeltai - 1);
+                p->fwdNonRefQpDeltai = p->fwdNonRefQpDeltai - 1 - (0.15 * p->fwdNonRefQpDeltai - 1);
+                p->bwdRefQpDeltai = p->bwdRefQpDeltai - 1 - (0.15 * p->bwdRefQpDeltai - 1);
+                p->bwdNonRefQpDeltai = p->bwdNonRefQpDeltai - 1 - (0.15 * p->bwdNonRefQpDeltai - 1);
+            }
+        }
+        else if (36 == sscanf(value, "%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf"
+            , &window10, &refQpDelta10, &nonRefQpDelta10, &window11, &refQpDelta11, &nonRefQpDelta11
+            , &window12, &refQpDelta12, &nonRefQpDelta12, &window13, &refQpDelta13, &nonRefQpDelta13
+            , &window14, &refQpDelta14, &nonRefQpDelta14, &window15, &refQpDelta15, &nonRefQpDelta15
+            , &window20, &refQpDelta20, &nonRefQpDelta20, &window21, &refQpDelta21, &nonRefQpDelta21
+            , &window22, &refQpDelta22, &nonRefQpDelta22, &window23, &refQpDelta23, &nonRefQpDelta23
+            , &window24, &refQpDelta24, &nonRefQpDelta24, &window25, &refQpDelta25, &nonRefQpDelta25))
+        {
+            p->fwdMaxScenecutWindow = 0;
+            p->bwdMaxScenecutWindow = 0;
+            for (int i = 0; i < 6; i++)
+            {
+                p->fwdScenecutWindowi = window1i;
+                p->fwdRefQpDeltai = refQpDelta1i;
+                p->fwdNonRefQpDeltai = nonRefQpDelta1i;
+                p->bwdScenecutWindowi = window2i;
+                p->bwdRefQpDeltai = refQpDelta2i;
+                p->bwdNonRefQpDeltai = nonRefQpDelta2i;
+                p->fwdMaxScenecutWindow += p->fwdScenecutWindowi;
+                p->bwdMaxScenecutWindow += p->bwdScenecutWindowi;
+            }
+        }
+        else
+        {
+            x265_log(NULL, X265_LOG_ERROR, "Specify all the necessary offsets for masking-strength \n");
+            bError = true;
+        }
+    }
+    return bError;
+}
+
 void x265_copy_params(x265_param* dst, x265_param* src)
 {
     dst->cpuid = src->cpuid;
@@ -2440,10 +2590,13 @@
     dst->bRepeatHeaders = src->bRepeatHeaders;
     dst->bAnnexB = src->bAnnexB;
     dst->bEnableAccessUnitDelimiters = src->bEnableAccessUnitDelimiters;
+    dst->bEnableEndOfBitstream = src->bEnableEndOfBitstream;
+    dst->bEnableEndOfSequence = src->bEnableEndOfSequence;
     dst->bEmitInfoSEI = src->bEmitInfoSEI;
     dst->decodedPictureHashSEI = src->decodedPictureHashSEI;
     dst->bEnableTemporalSubLayers = src->bEnableTemporalSubLayers;
     dst->bOpenGOP = src->bOpenGOP;
+	dst->craNal = src->craNal;
     dst->keyframeMax = src->keyframeMax;
     dst->keyframeMin = src->keyframeMin;
     dst->bframes = src->bframes;
@@ -2541,8 +2694,11 @@
     dst->rc.rfConstantMin = src->rc.rfConstantMin;
     dst->rc.bStatWrite = src->rc.bStatWrite;
     dst->rc.bStatRead = src->rc.bStatRead;
+    dst->rc.dataShareMode = src->rc.dataShareMode;
     if (src->rc.statFileName) dst->rc.statFileName=strdup(src->rc.statFileName);
     else dst->rc.statFileName = NULL;
+    if (src->rc.sharedMemName) dst->rc.sharedMemName = strdup(src->rc.sharedMemName);
+    else dst->rc.sharedMemName = NULL;
     dst->rc.qblur = src->rc.qblur;
     dst->rc.complexityBlur = src->rc.complexityBlur;
     dst->rc.bEnableSlowFirstPass = src->rc.bEnableSlowFirstPass;
@@ -2550,6 +2706,7 @@
     dst->rc.zonefileCount = src->rc.zonefileCount;
     dst->reconfigWindowSize = src->reconfigWindowSize;
     dst->bResetZoneConfig = src->bResetZoneConfig;
+    dst->bNoResetZoneConfig = src->bNoResetZoneConfig;
     dst->decoderVbvMaxRate = src->decoderVbvMaxRate;
 
     if (src->rc.zonefileCount && src->rc.zones && src->bResetZoneConfig)
@@ -2557,6 +2714,7 @@
         for (int i = 0; i < src->rc.zonefileCount; i++)
         {
             dst->rc.zonesi.startFrame = src->rc.zonesi.startFrame;
+            dst->rc.zones0.keyframeMax = src->rc.zones0.keyframeMax;
             memcpy(dst->rc.zonesi.zoneParam, src->rc.zonesi.zoneParam, sizeof(x265_param));
         }
     }
@@ -2621,7 +2779,6 @@
     dst->bOptRefListLengthPPS = src->bOptRefListLengthPPS;
     dst->bMultiPassOptRPS = src->bMultiPassOptRPS;
     dst->scenecutBias = src->scenecutBias;
-    dst->edgeTransitionThreshold = src->edgeTransitionThreshold;
     dst->gopLookahead = src->lookaheadDepth;
     dst->bOptCUDeltaQP = src->bOptCUDeltaQP;
     dst->analysisMultiPassDistortion = src->analysisMultiPassDistortion;
@@ -2682,20 +2839,33 @@
     dst->bEnableSvtHevc = src->bEnableSvtHevc;
     dst->bEnableFades = src->bEnableFades;
     dst->bEnableSceneCutAwareQp = src->bEnableSceneCutAwareQp;
-    dst->fwdScenecutWindow = src->fwdScenecutWindow;
-    dst->fwdRefQpDelta = src->fwdRefQpDelta;
-    dst->fwdNonRefQpDelta = src->fwdNonRefQpDelta;
-    dst->bwdScenecutWindow = src->bwdScenecutWindow;
-    dst->bwdRefQpDelta = src->bwdRefQpDelta;
-    dst->bwdNonRefQpDelta = src->bwdNonRefQpDelta;
+    dst->fwdMaxScenecutWindow = src->fwdMaxScenecutWindow;
+    dst->bwdMaxScenecutWindow = src->bwdMaxScenecutWindow;
+    for (int i = 0; i < 6; i++)
+    {
+        dst->fwdScenecutWindowi = src->fwdScenecutWindowi;
+        dst->fwdRefQpDeltai = src->fwdRefQpDeltai;
+        dst->fwdNonRefQpDeltai = src->fwdNonRefQpDeltai;
+        dst->bwdScenecutWindowi = src->bwdScenecutWindowi;
+        dst->bwdRefQpDeltai = src->bwdRefQpDeltai;
+        dst->bwdNonRefQpDeltai = src->bwdNonRefQpDeltai;
+    }
     dst->bField = src->bField;
-
+    dst->bEnableTemporalFilter = src->bEnableTemporalFilter;
+    dst->temporalFilterStrength = src->temporalFilterStrength;
     dst->confWinRightOffset = src->confWinRightOffset;
     dst->confWinBottomOffset = src->confWinBottomOffset;
     dst->bliveVBV2pass = src->bliveVBV2pass;
+
+    if (src->videoSignalTypePreset) dst->videoSignalTypePreset = strdup(src->videoSignalTypePreset);
+    else dst->videoSignalTypePreset = NULL;
 #ifdef SVT_HEVC
     memcpy(dst->svtHevcParam, src->svtHevcParam, sizeof(EB_H265_ENC_CONFIGURATION));
 #endif
+    /* Film grain */
+    if (src->filmGrain)
+        dst->filmGrain = src->filmGrain;
+    dst->bEnableSBRC = src->bEnableSBRC;
 }
 
 #ifdef SVT_HEVC

 
@@ -145,6 +145,8 @@
     param->bAnnexB = 1;
     param->bRepeatHeaders = 0;
     param->bEnableAccessUnitDelimiters = 0;
+    param->bEnableEndOfBitstream = 0;
+    param->bEnableEndOfSequence = 0;
     param->bEmitHRDSEI = 0;
     param->bEmitInfoSEI = 1;
     param->bEmitHDRSEI = 0; /*Deprecated*/
@@ -163,12 +165,12 @@
     param->keyframeMax = 250;
     param->gopLookahead = 0;
     param->bOpenGOP = 1;
+   param->craNal = 0;
     param->bframes = 4;
     param->lookaheadDepth = 20;
     param->bFrameAdaptive = X265_B_ADAPT_TRELLIS;
     param->bBPyramid = 1;
     param->scenecutThreshold = 40; /* Magic number pulled in from x264 */
-    param->edgeTransitionThreshold = 0.03;
     param->bHistBasedSceneCut = 0;
     param->lookaheadSlices = 8;
     param->lookaheadThreads = 0;
@@ -179,12 +181,20 @@
     param->bEnableHRDConcatFlag = 0;
     param->bEnableFades = 0;
     param->bEnableSceneCutAwareQp = 0;
-    param->fwdScenecutWindow = 500;
-    param->fwdRefQpDelta = 5;
-    param->fwdNonRefQpDelta = param->fwdRefQpDelta + (SLICE_TYPE_DELTA * param->fwdRefQpDelta);
-    param->bwdScenecutWindow = 100;
-    param->bwdRefQpDelta = -1;
-    param->bwdNonRefQpDelta = -1;
+    param->fwdMaxScenecutWindow = 1200;
+    param->bwdMaxScenecutWindow = 600;
+    for (int i = 0; i < 6; i++)
+    {
+        int deltas6 = { 5, 4, 3, 2, 1, 0 };
+
+        param->fwdScenecutWindowi = 200;
+        param->fwdRefQpDeltai = deltasi;
+        param->fwdNonRefQpDeltai = param->fwdRefQpDeltai + (SLICE_TYPE_DELTA * param->fwdRefQpDeltai);
+
+        param->bwdScenecutWindowi = 100;
+        param->bwdRefQpDeltai = -1;
+        param->bwdNonRefQpDeltai = -1;
+    }
 
     /* Intra Coding Tools */
     param->bEnableConstrainedIntra = 0;
@@ -278,7 +288,10 @@
     param->rc.rfConstantMin = 0;
     param->rc.bStatRead = 0;
     param->rc.bStatWrite = 0;
+    param->rc.dataShareMode = X265_SHARE_MODE_FILE;
     param->rc.statFileName = NULL;
+    param->rc.sharedMemName = NULL;
+    param->rc.bEncFocusedFramesOnly = 0;
     param->rc.complexityBlur = 20;
     param->rc.qblur = 0.5;
     param->rc.zoneCount = 0;
@@ -321,6 +334,7 @@
     param->maxLuma = PIXEL_MAX;
     param->log2MaxPocLsb = 8;
     param->maxSlices = 1;
+    param->videoSignalTypePreset = NULL;
 
     /*Conformance window*/
     param->confWinRightOffset = 0;
@@ -373,10 +387,17 @@
     param->bEnableSvtHevc = 0;
     param->svtHevcParam = NULL;
 
+    /* MCSTF */
+    param->bEnableTemporalFilter = 0;
+    param->temporalFilterStrength = 0.95;
+
 #ifdef SVT_HEVC
     param->svtHevcParam = svtParam;
     svt_param_default(param);
 #endif
+    /* Film grain characteristics model filename */
+    param->filmGrain = NULL;
+    param->bEnableSBRC = 0;
 }
 
 int x265_param_default_preset(x265_param* param, const char* preset, const char* tune)
@@ -666,6 +687,46 @@
 #define atof(str) x265_atof(str, bError)
 #define atobool(str) (x265_atobool(str, bError))
 
+int x265_scenecut_aware_qp_param_parse(x265_param* p, const char* name, const char* value)
+{
+    bool bError = false;
+    char nameBuf64;
+    if (!name)
+        return X265_PARAM_BAD_NAME;
+    // skip -- prefix if provided
+    if (name0 == '-' && name1 == '-')
+        name += 2;
+    // s/_/-/g
+    if (strlen(name) + 1 < sizeof(nameBuf) && strchr(name, '_'))
+    {
+        char *c;
+        strcpy(nameBuf, name);
+        while ((c = strchr(nameBuf, '_')) != 0)
+            *c = '-';
+        name = nameBuf;
+    }
+    if (!value)
+        value = "true";
+    else if (value0 == '=')
+        value++;
+#define OPT(STR) else if (!strcmp(name, STR))
+    if (0);
+    OPT("scenecut-aware-qp") p->bEnableSceneCutAwareQp = x265_atoi(value, bError);
+    OPT("masking-strength") bError = parseMaskingStrength(p, value);
+    else
+        return X265_PARAM_BAD_NAME;
+#undef OPT
+    return bError ? X265_PARAM_BAD_VALUE : 0;
+}
+
+
+/* internal versions of string-to-int with additional error checking */
+#undef atoi
+#undef atof
+#define atoi(str) x265_atoi(str, bError)
+#define atof(str) x265_atof(str, bError)
+#define atobool(str) (x265_atobool(str, bError))
+
 int x265_zone_param_parse(x265_param* p, const char* name, const char* value)
 {
     bool bError = false;
@@ -949,10 +1010,9 @@
        {
            bError = false;
            p->scenecutThreshold = atoi(value);
-           p->bHistBasedSceneCut = 0;
        }
     }
-    OPT("temporal-layers") p->bEnableTemporalSubLayers = atobool(value);
+    OPT("temporal-layers") p->bEnableTemporalSubLayers = atoi(value);
     OPT("keyint") p->keyframeMax = atoi(value);
     OPT("min-keyint") p->keyframeMin = atoi(value);
     OPT("rc-lookahead") p->lookaheadDepth = atoi(value);
@@ -1184,6 +1244,7 @@
         int pass = x265_clip3(0, 3, atoi(value));
         p->rc.bStatWrite = pass & 1;
         p->rc.bStatRead = pass & 2;
+        p->rc.dataShareMode = X265_SHARE_MODE_FILE;
     }
     OPT("stats") p->rc.statFileName = strdup(value);
     OPT("scaling-list") p->scalingLists = strdup(value);
@@ -1216,21 +1277,7 @@
         OPT("opt-ref-list-length-pps") p->bOptRefListLengthPPS = atobool(value);
         OPT("multi-pass-opt-rps") p->bMultiPassOptRPS = atobool(value);
         OPT("scenecut-bias") p->scenecutBias = atof(value);
-        OPT("hist-scenecut")
-        {
-            p->bHistBasedSceneCut = atobool(value);
-            if (bError)
-            {
-                bError = false;
-                p->bHistBasedSceneCut = 0;
-            }
-            if (p->bHistBasedSceneCut)
-            {
-                bError = false;
-                p->scenecutThreshold = 0;
-            }
-        }
-        OPT("hist-threshold") p->edgeTransitionThreshold = atof(value);
+        OPT("hist-scenecut") p->bHistBasedSceneCut = atobool(value);
         OPT("rskip-edge-threshold") p->edgeVarThreshold = atoi(value)/100.0f;
         OPT("lookahead-threads") p->lookaheadThreads = atoi(value);
         OPT("opt-cu-delta-qp") p->bOptCUDeltaQP = atobool(value);
@@ -1238,6 +1285,7 @@
         OPT("multi-pass-opt-distortion") p->analysisMultiPassDistortion = atobool(value);
         OPT("aq-motion") p->bAQMotion = atobool(value);
         OPT("dynamic-rd") p->dynamicRd = atof(value);
+       OPT("cra-nal") p->craNal = atobool(value);
         OPT("analysis-reuse-level")
         {
             p->analysisReuseLevel = atoi(value);
@@ -1348,71 +1396,7 @@
         }
         OPT("fades") p->bEnableFades = atobool(value);
         OPT("scenecut-aware-qp") p->bEnableSceneCutAwareQp = atoi(value);
-        OPT("masking-strength")
-        {
-            int window1;
-            double refQpDelta1, nonRefQpDelta1;
-
-            if (p->bEnableSceneCutAwareQp == FORWARD)
-            {
-                if (3 == sscanf(value, "%d,%lf,%lf", &window1, &refQpDelta1, &nonRefQpDelta1))
-                {
-                    if (window1 > 0)
-                        p->fwdScenecutWindow = window1;
-                    if (refQpDelta1 > 0)
-                        p->fwdRefQpDelta = refQpDelta1;
-                    if (nonRefQpDelta1 > 0)
-                        p->fwdNonRefQpDelta = nonRefQpDelta1;
-                }
-                else
-                {
-                    x265_log(NULL, X265_LOG_ERROR, "Specify all the necessary offsets for masking-strength \n");
-                    bError = true;
-                }
-            }
-            else if (p->bEnableSceneCutAwareQp == BACKWARD)
-            {
-                if (3 == sscanf(value, "%d,%lf,%lf", &window1, &refQpDelta1, &nonRefQpDelta1))
-                {
-                    if (window1 > 0)
-                        p->bwdScenecutWindow = window1;
-                    if (refQpDelta1 > 0)
-                        p->bwdRefQpDelta = refQpDelta1;
-                    if (nonRefQpDelta1 > 0)
-                        p->bwdNonRefQpDelta = nonRefQpDelta1;
-                }
-                else
-                {
-                    x265_log(NULL, X265_LOG_ERROR, "Specify all the necessary offsets for masking-strength \n");
-                    bError = true;
-                }
-            }
-            else if (p->bEnableSceneCutAwareQp == BI_DIRECTIONAL)
-            {
-                int window2;
-                double refQpDelta2, nonRefQpDelta2;
-                if (6 == sscanf(value, "%d,%lf,%lf,%d,%lf,%lf", &window1, &refQpDelta1, &nonRefQpDelta1, &window2, &refQpDelta2, &nonRefQpDelta2))
-                {
-                    if (window1 > 0)
-                        p->fwdScenecutWindow = window1;
-                    if (refQpDelta1 > 0)
-                        p->fwdRefQpDelta = refQpDelta1;
-                    if (nonRefQpDelta1 > 0)
-                        p->fwdNonRefQpDelta = nonRefQpDelta1;
-                    if (window2 > 0)
-                        p->bwdScenecutWindow = window2;
-                    if (refQpDelta2 > 0)
-                        p->bwdRefQpDelta = refQpDelta2;
-                    if (nonRefQpDelta2 > 0)
-                        p->bwdNonRefQpDelta = nonRefQpDelta2;
-                }
-                else
-                {
-                    x265_log(NULL, X265_LOG_ERROR, "Specify all the necessary offsets for masking-strength \n");
-                    bError = true;
-                }
-            }
-        }
+        OPT("masking-strength") bError |= parseMaskingStrength(p, value);
         OPT("field") p->bField = atobool( value );
         OPT("cll") p->bEmitCLL = atobool(value);
         OPT("frame-dup") p->bEnableFrameDuplication = atobool(value);
@@ -1446,6 +1430,13 @@
         OPT("vbv-live-multi-pass") p->bliveVBV2pass = atobool(value);
         OPT("min-vbv-fullness") p->minVbvFullness = atof(value);
         OPT("max-vbv-fullness") p->maxVbvFullness = atof(value);
+        OPT("video-signal-type-preset") p->videoSignalTypePreset = strdup(value);
+        OPT("eob") p->bEnableEndOfBitstream = atobool(value);
+        OPT("eos") p->bEnableEndOfSequence = atobool(value);
+        /* Film grain characterstics model filename */
+        OPT("film-grain") p->filmGrain = (char* )value;
+        OPT("mcstf") p->bEnableTemporalFilter = atobool(value);
+        OPT("sbrc") p->bEnableSBRC = atobool(value);
         else
             return X265_PARAM_BAD_NAME;
     }
@@ -1761,8 +1752,6 @@
           "scenecutThreshold must be greater than 0");
     CHECK(param->scenecutBias < 0 || 100 < param->scenecutBias,
             "scenecut-bias must be between 0 and 100");
-    CHECK(param->edgeTransitionThreshold < 0.0 || 1.0 < param->edgeTransitionThreshold,
-            "hist-threshold must be between 0.0 and 1.0");
     CHECK(param->radl < 0 || param->radl > param->bframes,
           "radl must be between 0 and bframes");
     CHECK(param->rdPenalty < 0 || param->rdPenalty > 2,
@@ -1824,15 +1813,15 @@
         "Invalid refine-ctu-distortion value, must be either 0 or 1");
     CHECK(param->maxAUSizeFactor < 0.5 || param->maxAUSizeFactor > 1.0,
         "Supported factor for controlling max AU size is from 0.5 to 1");
-    CHECK((param->dolbyProfile != 0) && (param->dolbyProfile != 50) && (param->dolbyProfile != 81) && (param->dolbyProfile != 82),
-        "Unsupported Dolby Vision profile, only profile 5, profile 8.1 and profile 8.2 enabled");
+    CHECK((param->dolbyProfile != 0) && (param->dolbyProfile != 50) && (param->dolbyProfile != 81) && (param->dolbyProfile != 82) && (param->dolbyProfile != 84),
+        "Unsupported Dolby Vision profile, only profile 5, profile 8.1, profile 8.2 and profile 8.4 enabled");
     CHECK(param->dupThreshold < 1 || 99 < param->dupThreshold,
         "Invalid frame-duplication threshold. Value must be between 1 and 99.");
     if (param->dolbyProfile)
     {
         CHECK((param->rc.vbvMaxBitrate <= 0 || param->rc.vbvBufferSize <= 0), "Dolby Vision requires VBV settings to enable HRD.\n");
-        CHECK((param->internalBitDepth != 10), "Dolby Vision profile - 5, profile - 8.1 and profile - 8.2 is Main10 only\n");
-        CHECK((param->internalCsp != X265_CSP_I420), "Dolby Vision profile - 5, profile - 8.1 and profile - 8.2 requires YCbCr 4:2:0 color space\n");
+        CHECK((param->internalBitDepth != 10), "Dolby Vision profile - 5, profile - 8.1, profile - 8.2 and profile - 8.4 are Main10 only\n");
+        CHECK((param->internalCsp != X265_CSP_I420), "Dolby Vision profile - 5, profile - 8.1, profile - 8.2 and profile - 8.4 requires YCbCr 4:2:0 color space\n");
         if (param->dolbyProfile == 81)
             CHECK(!(param->masteringDisplayColorVolume), "Dolby Vision profile - 8.1 requires Mastering display color volume information\n");
     }
@@ -1854,19 +1843,22 @@
         {
             CHECK(param->bEnableSceneCutAwareQp < 0 || param->bEnableSceneCutAwareQp > 3,
             "Invalid masking direction. Value must be between 0 and 3(inclusive)");
-            CHECK(param->fwdScenecutWindow < 0 || param->fwdScenecutWindow > 1000,
-            "Invalid forward scenecut Window duration. Value must be between 0 and 1000(inclusive)");
-            CHECK(param->fwdRefQpDelta < 0 || param->fwdRefQpDelta > 10,
-            "Invalid fwdRefQpDelta value. Value must be between 0 and 10 (inclusive)");
-            CHECK(param->fwdNonRefQpDelta < 0 || param->fwdNonRefQpDelta > 10,
-            "Invalid fwdNonRefQpDelta value. Value must be between 0 and 10 (inclusive)");
-
-            CHECK(param->bwdScenecutWindow < 0 || param->bwdScenecutWindow > 1000,
-                "Invalid backward scenecut Window duration. Value must be between 0 and 1000(inclusive)");
-            CHECK(param->bwdRefQpDelta < -1 || param->bwdRefQpDelta > 10,
-                "Invalid bwdRefQpDelta value. Value must be between 0 and 10 (inclusive)");
-            CHECK(param->bwdNonRefQpDelta < -1 || param->bwdNonRefQpDelta > 10,
-                "Invalid bwdNonRefQpDelta value. Value must be between 0 and 10 (inclusive)");
+            for (int i = 0; i < 6; i++)
+            {
+                CHECK(param->fwdScenecutWindowi < 0 || param->fwdScenecutWindowi > 1000,
+                    "Invalid forward scenecut Window duration. Value must be between 0 and 1000(inclusive)");
+                CHECK(param->fwdRefQpDeltai < 0 || param->fwdRefQpDeltai > 20,
+                    "Invalid fwdRefQpDelta value. Value must be between 0 and 20 (inclusive)");
+                CHECK(param->fwdNonRefQpDeltai < 0 || param->fwdNonRefQpDeltai > 20,
+                    "Invalid fwdNonRefQpDelta value. Value must be between 0 and 20 (inclusive)");
+
+                CHECK(param->bwdScenecutWindowi < 0 || param->bwdScenecutWindowi > 1000,
+                    "Invalid backward scenecut Window duration. Value must be between 0 and 1000(inclusive)");
+                CHECK(param->bwdRefQpDeltai < -1 || param->bwdRefQpDeltai > 20,
+                    "Invalid bwdRefQpDelta value. Value must be between 0 and 20 (inclusive)");
+                CHECK(param->bwdNonRefQpDeltai < -1 || param->bwdNonRefQpDeltai > 20,
+                    "Invalid bwdNonRefQpDelta value. Value must be between 0 and 20 (inclusive)");
+            }
         }
     }
     if (param->bEnableHME)
@@ -1898,6 +1890,11 @@
         param->bSingleSeiNal = 0;
         x265_log(param, X265_LOG_WARNING, "None of the SEI messages are enabled. Disabling Single SEI NAL\n");
     }
+    if (param->bEnableTemporalFilter && (param->frameNumThreads > 1))
+    {
+        param->bEnableTemporalFilter = 0;
+        x265_log(param, X265_LOG_WARNING, "MCSTF can be enabled with frame thread = 1 only. Disabling MCSTF\n");
+    }
     CHECK(param->confWinRightOffset < 0, "Conformance Window Right Offset must be 0 or greater");
     CHECK(param->confWinBottomOffset < 0, "Conformance Window Bottom Offset must be 0 or greater");
     CHECK(param->decoderVbvMaxRate < 0, "Invalid Decoder Vbv Maxrate. Value can not be less than zero");
@@ -1910,6 +1907,7 @@
             x265_log(param, X265_LOG_WARNING, "Live VBV enabled without VBV settings.Disabling live VBV in 2 pass\n");
         }
     }
+    CHECK(param->rc.dataShareMode != X265_SHARE_MODE_FILE && param->rc.dataShareMode != X265_SHARE_MODE_SHAREDMEM, "Invalid data share mode. It must be one of the X265_DATA_SHARE_MODES enum values\n" );
     return check_failed;
 }
 
@@ -1970,8 +1968,8 @@
         x265_log(param, X265_LOG_INFO, "Keyframe min / max / scenecut / bias  : %d / %d / %d / %.2lf \n",
                  param->keyframeMin, param->keyframeMax, param->scenecutThreshold, param->scenecutBias * 100);
     else if (param->bHistBasedSceneCut && param->keyframeMax != INT_MAX) 
-        x265_log(param, X265_LOG_INFO, "Keyframe min / max / scenecut / edge threshold  : %d / %d / %d / %.2lf\n",
-                 param->keyframeMin, param->keyframeMax, param->bHistBasedSceneCut, param->edgeTransitionThreshold);
+        x265_log(param, X265_LOG_INFO, "Keyframe min / max / scenecut  : %d / %d / %d\n",
+                 param->keyframeMin, param->keyframeMax, param->bHistBasedSceneCut);
     else if (param->keyframeMax == INT_MAX)
         x265_log(param, X265_LOG_INFO, "Keyframe min / max / scenecut       : disabled\n");
 
@@ -2089,6 +2087,8 @@
         bufSize += strlen(p->numaPools);
     if (p->masteringDisplayColorVolume)
         bufSize += strlen(p->masteringDisplayColorVolume);
+    if (p->videoSignalTypePreset)
+        bufSize += strlen(p->videoSignalTypePreset);
 
     buf = s = X265_MALLOC(char, bufSize);
     if (!buf)
@@ -2126,10 +2126,12 @@
     BOOL(p->bRepeatHeaders, "repeat-headers");
     BOOL(p->bAnnexB, "annexb");
     BOOL(p->bEnableAccessUnitDelimiters, "aud");
+    BOOL(p->bEnableEndOfBitstream, "eob");
+    BOOL(p->bEnableEndOfSequence, "eos");
     BOOL(p->bEmitHRDSEI, "hrd");
     BOOL(p->bEmitInfoSEI, "info");
     s += sprintf(s, " hash=%d", p->decodedPictureHashSEI);
-    BOOL(p->bEnableTemporalSubLayers, "temporal-layers");
+    s += sprintf(s, " temporal-layers=%d", p->bEnableTemporalSubLayers);
     BOOL(p->bOpenGOP, "open-gop");
     s += sprintf(s, " min-keyint=%d", p->keyframeMin);
     s += sprintf(s, " keyint=%d", p->keyframeMax);
@@ -2141,7 +2143,7 @@
     s += sprintf(s, " rc-lookahead=%d", p->lookaheadDepth);
     s += sprintf(s, " lookahead-slices=%d", p->lookaheadSlices);
     s += sprintf(s, " scenecut=%d", p->scenecutThreshold);
-    s += sprintf(s, " hist-scenecut=%d", p->bHistBasedSceneCut);
+    BOOL(p->bHistBasedSceneCut, "hist-scenecut");
     s += sprintf(s, " radl=%d", p->radl);
     BOOL(p->bEnableHRDConcatFlag, "splice");
     BOOL(p->bIntraRefresh, "intra-refresh");
@@ -2295,7 +2297,6 @@
     BOOL(p->bOptRefListLengthPPS, "opt-ref-list-length-pps");
     BOOL(p->bMultiPassOptRPS, "multi-pass-opt-rps");
     s += sprintf(s, " scenecut-bias=%.2f", p->scenecutBias);
-    s += sprintf(s, " hist-threshold=%.2f", p->edgeTransitionThreshold);
     BOOL(p->bOptCUDeltaQP, "opt-cu-delta-qp");
     BOOL(p->bAQMotion, "aq-motion");
     BOOL(p->bEmitHDR10SEI, "hdr10");
@@ -2328,10 +2329,14 @@
     s += sprintf(s, " qp-adaptation-range=%.2f", p->rc.qpAdaptationRange);
     s += sprintf(s, " scenecut-aware-qp=%d", p->bEnableSceneCutAwareQp);
     if (p->bEnableSceneCutAwareQp)
-        s += sprintf(s, " fwd-scenecut-window=%d fwd-ref-qp-delta=%f fwd-nonref-qp-delta=%f bwd-scenecut-window=%d bwd-ref-qp-delta=%f bwd-nonref-qp-delta=%f", p->fwdScenecutWindow, p->fwdRefQpDelta, p->fwdNonRefQpDelta, p->bwdScenecutWindow, p->bwdRefQpDelta, p->bwdNonRefQpDelta);
+        s += sprintf(s, " fwd-scenecut-window=%d fwd-ref-qp-delta=%f fwd-nonref-qp-delta=%f bwd-scenecut-window=%d bwd-ref-qp-delta=%f bwd-nonref-qp-delta=%f", p->fwdMaxScenecutWindow, p->fwdRefQpDelta0, p->fwdNonRefQpDelta0, p->bwdMaxScenecutWindow, p->bwdRefQpDelta0, p->bwdNonRefQpDelta0);
     s += sprintf(s, "conformance-window-offsets right=%d bottom=%d", p->confWinRightOffset, p->confWinBottomOffset);
     s += sprintf(s, " decoder-max-rate=%d", p->decoderVbvMaxRate);
     BOOL(p->bliveVBV2pass, "vbv-live-multi-pass");
+    if (p->filmGrain)
+        s += sprintf(s, " film-grain=%s", p->filmGrain); // Film grain characteristics model filename
+    BOOL(p->bEnableTemporalFilter, "mcstf");
+    BOOL(p->bEnableSBRC, "sbrc");
 #undef BOOL
     return buf;
 }
@@ -2406,6 +2411,151 @@
     return false;
 }
 
+bool parseMaskingStrength(x265_param* p, const char* value)
+{
+    bool bError = false;
+    int window16;
+    double refQpDelta16, nonRefQpDelta16;
+    if (p->bEnableSceneCutAwareQp == FORWARD)
+    {
+        if (3 == sscanf(value, "%d,%lf,%lf", &window10, &refQpDelta10, &nonRefQpDelta10))
+        {
+            if (window10 > 0)
+                p->fwdMaxScenecutWindow = window10;
+            if (refQpDelta10 > 0)
+                p->fwdRefQpDelta0 = refQpDelta10;
+            if (nonRefQpDelta10 > 0)
+                p->fwdNonRefQpDelta0 = nonRefQpDelta10;
+
+            p->fwdScenecutWindow0 = p->fwdMaxScenecutWindow / 6;
+            for (int i = 1; i < 6; i++)
+            {
+                p->fwdScenecutWindowi = p->fwdMaxScenecutWindow / 6;
+                p->fwdRefQpDeltai = p->fwdRefQpDeltai - 1 - (0.15 * p->fwdRefQpDeltai - 1);
+                p->fwdNonRefQpDeltai = p->fwdNonRefQpDeltai - 1 - (0.15 * p->fwdNonRefQpDeltai - 1);
+            }
+        }
+        else if (18 == sscanf(value, "%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf"
+            , &window10, &refQpDelta10, &nonRefQpDelta10, &window11, &refQpDelta11, &nonRefQpDelta11
+            , &window12, &refQpDelta12, &nonRefQpDelta12, &window13, &refQpDelta13, &nonRefQpDelta13
+            , &window14, &refQpDelta14, &nonRefQpDelta14, &window15, &refQpDelta15, &nonRefQpDelta15))
+        {
+            p->fwdMaxScenecutWindow = 0;
+            for (int i = 0; i < 6; i++)
+            {
+                p->fwdScenecutWindowi = window1i;
+                p->fwdRefQpDeltai = refQpDelta1i;
+                p->fwdNonRefQpDeltai = nonRefQpDelta1i;
+                p->fwdMaxScenecutWindow += p->fwdScenecutWindowi;
+            }
+        }
+        else
+        {
+            x265_log(NULL, X265_LOG_ERROR, "Specify all the necessary offsets for masking-strength \n");
+            bError = true;
+        }
+    }
+    else if (p->bEnableSceneCutAwareQp == BACKWARD)
+    {
+        if (3 == sscanf(value, "%d,%lf,%lf", &window10, &refQpDelta10, &nonRefQpDelta10))
+        {
+            if (window10 > 0)
+                p->bwdMaxScenecutWindow = window10;
+            if (refQpDelta10 > 0)
+                p->bwdRefQpDelta0 = refQpDelta10;
+            if (nonRefQpDelta10 > 0)
+                p->bwdNonRefQpDelta0 = nonRefQpDelta10;
+
+            p->bwdScenecutWindow0 = p->bwdMaxScenecutWindow / 6;
+            for (int i = 1; i < 6; i++)
+            {
+                p->bwdScenecutWindowi = p->bwdMaxScenecutWindow / 6;
+                p->bwdRefQpDeltai = p->bwdRefQpDeltai - 1 - (0.15 * p->bwdRefQpDeltai - 1);
+                p->bwdNonRefQpDeltai = p->bwdNonRefQpDeltai - 1 - (0.15 * p->bwdNonRefQpDeltai - 1);
+            }
+        }
+        else if (18 == sscanf(value, "%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf"
+            , &window10, &refQpDelta10, &nonRefQpDelta10, &window11, &refQpDelta11, &nonRefQpDelta11
+            , &window12, &refQpDelta12, &nonRefQpDelta12, &window13, &refQpDelta13, &nonRefQpDelta13
+            , &window14, &refQpDelta14, &nonRefQpDelta14, &window15, &refQpDelta15, &nonRefQpDelta15))
+        {
+            p->bwdMaxScenecutWindow = 0;
+            for (int i = 0; i < 6; i++)
+            {
+                p->bwdScenecutWindowi = window1i;
+                p->bwdRefQpDeltai = refQpDelta1i;
+                p->bwdNonRefQpDeltai = nonRefQpDelta1i;
+                p->bwdMaxScenecutWindow += p->bwdScenecutWindowi;
+            }
+        }
+        else
+        {
+            x265_log(NULL, X265_LOG_ERROR, "Specify all the necessary offsets for masking-strength \n");
+            bError = true;
+        }
+    }
+    else if (p->bEnableSceneCutAwareQp == BI_DIRECTIONAL)
+    {
+        int window26;
+        double refQpDelta26, nonRefQpDelta26;
+        if (6 == sscanf(value, "%d,%lf,%lf,%d,%lf,%lf", &window10, &refQpDelta10, &nonRefQpDelta10, &window20, &refQpDelta20, &nonRefQpDelta20))
+        {
+            if (window10 > 0)
+                p->fwdMaxScenecutWindow = window10;
+            if (refQpDelta10 > 0)
+                p->fwdRefQpDelta0 = refQpDelta10;
+            if (nonRefQpDelta10 > 0)
+                p->fwdNonRefQpDelta0 = nonRefQpDelta10;
+            if (window20 > 0)
+                p->bwdMaxScenecutWindow = window20;
+            if (refQpDelta20 > 0)
+                p->bwdRefQpDelta0 = refQpDelta20;
+            if (nonRefQpDelta20 > 0)
+                p->bwdNonRefQpDelta0 = nonRefQpDelta20;
+
+            p->fwdScenecutWindow0 = p->fwdMaxScenecutWindow / 6;
+            p->bwdScenecutWindow0 = p->bwdMaxScenecutWindow / 6;
+            for (int i = 1; i < 6; i++)
+            {
+                p->fwdScenecutWindowi = p->fwdMaxScenecutWindow / 6;
+                p->bwdScenecutWindowi = p->bwdMaxScenecutWindow / 6;
+                p->fwdRefQpDeltai = p->fwdRefQpDeltai - 1 - (0.15 * p->fwdRefQpDeltai - 1);
+                p->fwdNonRefQpDeltai = p->fwdNonRefQpDeltai - 1 - (0.15 * p->fwdNonRefQpDeltai - 1);
+                p->bwdRefQpDeltai = p->bwdRefQpDeltai - 1 - (0.15 * p->bwdRefQpDeltai - 1);
+                p->bwdNonRefQpDeltai = p->bwdNonRefQpDeltai - 1 - (0.15 * p->bwdNonRefQpDeltai - 1);
+            }
+        }
+        else if (36 == sscanf(value, "%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf"
+            , &window10, &refQpDelta10, &nonRefQpDelta10, &window11, &refQpDelta11, &nonRefQpDelta11
+            , &window12, &refQpDelta12, &nonRefQpDelta12, &window13, &refQpDelta13, &nonRefQpDelta13
+            , &window14, &refQpDelta14, &nonRefQpDelta14, &window15, &refQpDelta15, &nonRefQpDelta15
+            , &window20, &refQpDelta20, &nonRefQpDelta20, &window21, &refQpDelta21, &nonRefQpDelta21
+            , &window22, &refQpDelta22, &nonRefQpDelta22, &window23, &refQpDelta23, &nonRefQpDelta23
+            , &window24, &refQpDelta24, &nonRefQpDelta24, &window25, &refQpDelta25, &nonRefQpDelta25))
+        {
+            p->fwdMaxScenecutWindow = 0;
+            p->bwdMaxScenecutWindow = 0;
+            for (int i = 0; i < 6; i++)
+            {
+                p->fwdScenecutWindowi = window1i;
+                p->fwdRefQpDeltai = refQpDelta1i;
+                p->fwdNonRefQpDeltai = nonRefQpDelta1i;
+                p->bwdScenecutWindowi = window2i;
+                p->bwdRefQpDeltai = refQpDelta2i;
+                p->bwdNonRefQpDeltai = nonRefQpDelta2i;
+                p->fwdMaxScenecutWindow += p->fwdScenecutWindowi;
+                p->bwdMaxScenecutWindow += p->bwdScenecutWindowi;
+            }
+        }
+        else
+        {
+            x265_log(NULL, X265_LOG_ERROR, "Specify all the necessary offsets for masking-strength \n");
+            bError = true;
+        }
+    }
+    return bError;
+}
+
 void x265_copy_params(x265_param* dst, x265_param* src)
 {
     dst->cpuid = src->cpuid;
@@ -2440,10 +2590,13 @@
     dst->bRepeatHeaders = src->bRepeatHeaders;
     dst->bAnnexB = src->bAnnexB;
     dst->bEnableAccessUnitDelimiters = src->bEnableAccessUnitDelimiters;
+    dst->bEnableEndOfBitstream = src->bEnableEndOfBitstream;
+    dst->bEnableEndOfSequence = src->bEnableEndOfSequence;
     dst->bEmitInfoSEI = src->bEmitInfoSEI;
     dst->decodedPictureHashSEI = src->decodedPictureHashSEI;
     dst->bEnableTemporalSubLayers = src->bEnableTemporalSubLayers;
     dst->bOpenGOP = src->bOpenGOP;
+   dst->craNal = src->craNal;
     dst->keyframeMax = src->keyframeMax;
     dst->keyframeMin = src->keyframeMin;
     dst->bframes = src->bframes;
@@ -2541,8 +2694,11 @@
     dst->rc.rfConstantMin = src->rc.rfConstantMin;
     dst->rc.bStatWrite = src->rc.bStatWrite;
     dst->rc.bStatRead = src->rc.bStatRead;
+    dst->rc.dataShareMode = src->rc.dataShareMode;
     if (src->rc.statFileName) dst->rc.statFileName=strdup(src->rc.statFileName);
     else dst->rc.statFileName = NULL;
+    if (src->rc.sharedMemName) dst->rc.sharedMemName = strdup(src->rc.sharedMemName);
+    else dst->rc.sharedMemName = NULL;
     dst->rc.qblur = src->rc.qblur;
     dst->rc.complexityBlur = src->rc.complexityBlur;
     dst->rc.bEnableSlowFirstPass = src->rc.bEnableSlowFirstPass;
@@ -2550,6 +2706,7 @@
     dst->rc.zonefileCount = src->rc.zonefileCount;
     dst->reconfigWindowSize = src->reconfigWindowSize;
     dst->bResetZoneConfig = src->bResetZoneConfig;
+    dst->bNoResetZoneConfig = src->bNoResetZoneConfig;
     dst->decoderVbvMaxRate = src->decoderVbvMaxRate;
 
     if (src->rc.zonefileCount && src->rc.zones && src->bResetZoneConfig)
@@ -2557,6 +2714,7 @@
         for (int i = 0; i < src->rc.zonefileCount; i++)
         {
             dst->rc.zonesi.startFrame = src->rc.zonesi.startFrame;
+            dst->rc.zones0.keyframeMax = src->rc.zones0.keyframeMax;
             memcpy(dst->rc.zonesi.zoneParam, src->rc.zonesi.zoneParam, sizeof(x265_param));
         }
     }
@@ -2621,7 +2779,6 @@
     dst->bOptRefListLengthPPS = src->bOptRefListLengthPPS;
     dst->bMultiPassOptRPS = src->bMultiPassOptRPS;
     dst->scenecutBias = src->scenecutBias;
-    dst->edgeTransitionThreshold = src->edgeTransitionThreshold;
     dst->gopLookahead = src->lookaheadDepth;
     dst->bOptCUDeltaQP = src->bOptCUDeltaQP;
     dst->analysisMultiPassDistortion = src->analysisMultiPassDistortion;
@@ -2682,20 +2839,33 @@
     dst->bEnableSvtHevc = src->bEnableSvtHevc;
     dst->bEnableFades = src->bEnableFades;
     dst->bEnableSceneCutAwareQp = src->bEnableSceneCutAwareQp;
-    dst->fwdScenecutWindow = src->fwdScenecutWindow;
-    dst->fwdRefQpDelta = src->fwdRefQpDelta;
-    dst->fwdNonRefQpDelta = src->fwdNonRefQpDelta;
-    dst->bwdScenecutWindow = src->bwdScenecutWindow;
-    dst->bwdRefQpDelta = src->bwdRefQpDelta;
-    dst->bwdNonRefQpDelta = src->bwdNonRefQpDelta;
+    dst->fwdMaxScenecutWindow = src->fwdMaxScenecutWindow;
+    dst->bwdMaxScenecutWindow = src->bwdMaxScenecutWindow;
+    for (int i = 0; i < 6; i++)
+    {
+        dst->fwdScenecutWindowi = src->fwdScenecutWindowi;
+        dst->fwdRefQpDeltai = src->fwdRefQpDeltai;
+        dst->fwdNonRefQpDeltai = src->fwdNonRefQpDeltai;
+        dst->bwdScenecutWindowi = src->bwdScenecutWindowi;
+        dst->bwdRefQpDeltai = src->bwdRefQpDeltai;
+        dst->bwdNonRefQpDeltai = src->bwdNonRefQpDeltai;
+    }
     dst->bField = src->bField;
-
+    dst->bEnableTemporalFilter = src->bEnableTemporalFilter;
+    dst->temporalFilterStrength = src->temporalFilterStrength;
     dst->confWinRightOffset = src->confWinRightOffset;
     dst->confWinBottomOffset = src->confWinBottomOffset;
     dst->bliveVBV2pass = src->bliveVBV2pass;
+
+    if (src->videoSignalTypePreset) dst->videoSignalTypePreset = strdup(src->videoSignalTypePreset);
+    else dst->videoSignalTypePreset = NULL;
 #ifdef SVT_HEVC
     memcpy(dst->svtHevcParam, src->svtHevcParam, sizeof(EB_H265_ENC_CONFIGURATION));
 #endif
+    /* Film grain */
+    if (src->filmGrain)
+        dst->filmGrain = src->filmGrain;
+    dst->bEnableSBRC = src->bEnableSBRC;
 }
 
 #ifdef SVT_HEVC
​

x265_3.5.tar.gz/source/common/param.h -> x265_3.6.tar.gz/source/common/param.h Changed

 
@@ -38,6 +38,7 @@
 void  getParamAspectRatio(x265_param *p, int& width, int& height);
 bool  parseLambdaFile(x265_param *param);
 void x265_copy_params(x265_param* dst, x265_param* src);
+bool parseMaskingStrength(x265_param* p, const char* value);
 
 /* this table is kept internal to avoid confusion, since log level indices start at -1 */
 static const char * const logLevelNames = { "none", "error", "warning", "info", "debug", "full", 0 };
@@ -52,6 +53,7 @@
 int x265_param_default_preset(x265_param *, const char *preset, const char *tune);
 int x265_param_apply_profile(x265_param *, const char *profile);
 int x265_param_parse(x265_param *p, const char *name, const char *value);
+int x265_scenecut_aware_qp_param_parse(x265_param* p, const char* name, const char* value);
 int x265_zone_param_parse(x265_param* p, const char* name, const char* value);
 #define PARAM_NS X265_NS
 #endif
​

x265_3.5.tar.gz/source/common/piclist.cpp -> x265_3.6.tar.gz/source/common/piclist.cpp Changed

@@ -45,6 +45,25 @@
     m_count++;
 }
 
+void PicList::pushFrontMCSTF(Frame& curFrame)
+{
+    X265_CHECK(!curFrame.m_nextMCSTF && !curFrame.m_nextMCSTF, "piclist: picture already in OPB list\n"); // ensure frame is not in a list
+    curFrame.m_nextMCSTF = m_start;
+    curFrame.m_prevMCSTF = NULL;
+
+    if (m_count)
+    {
+        m_start->m_prevMCSTF = &curFrame;
+        m_start = &curFrame;
+    }
+    else
+    {
+        m_start = m_end = &curFrame;
+    }
+    m_count++;
+
+}
+
 void PicList::pushBack(Frame& curFrame)
 {
     X265_CHECK(!curFrame.m_next && !curFrame.m_prev, "piclist: picture already in list\n"); // ensure frame is not in a list
@@ -63,6 +82,24 @@
     m_count++;
 }
 
+void PicList::pushBackMCSTF(Frame& curFrame)
+{
+    X265_CHECK(!curFrame.m_nextMCSTF && !curFrame.m_prevMCSTF, "piclist: picture already in OPB list\n"); // ensure frame is not in a list
+    curFrame.m_nextMCSTF = NULL;
+    curFrame.m_prevMCSTF = m_end;
+
+    if (m_count)
+    {
+        m_end->m_nextMCSTF = &curFrame;
+        m_end = &curFrame;
+    }
+    else
+    {
+        m_start = m_end = &curFrame;
+    }
+    m_count++;
+}
+
 Frame *PicList::popFront()
 {
     if (m_start)
@@ -94,6 +131,14 @@
     return curFrame;
 }
 
+Frame* PicList::getPOCMCSTF(int poc)
+{
+    Frame *curFrame = m_start;
+    while (curFrame && curFrame->m_poc != poc)
+        curFrame = curFrame->m_nextMCSTF;
+    return curFrame;
+}
+
 Frame *PicList::popBack()
 {
     if (m_end)
@@ -117,6 +162,29 @@
         return NULL;
 }
 
+Frame *PicList::popBackMCSTF()
+{
+    if (m_end)
+    {
+        Frame* temp = m_end;
+        m_count--;
+
+        if (m_count)
+        {
+            m_end = m_end->m_prevMCSTF;
+            m_end->m_nextMCSTF = NULL;
+        }
+        else
+        {
+            m_start = m_end = NULL;
+        }
+        temp->m_nextMCSTF = temp->m_prevMCSTF = NULL;
+        return temp;
+    }
+    else
+        return NULL;
+}
+
 Frame* PicList::getCurFrame(void)
 {
     Frame *curFrame = m_start;
@@ -158,3 +226,36 @@
 
     curFrame.m_next = curFrame.m_prev = NULL;
 }
+
+void PicList::removeMCSTF(Frame& curFrame)
+{
+#if _DEBUG
+    Frame *tmp = m_start;
+    while (tmp && tmp != &curFrame)
+    {
+        tmp = tmp->m_nextMCSTF;
+    }
+
+    X265_CHECK(tmp == &curFrame, "framelist: pic being removed was not in list\n"); // verify pic is in this list
+#endif
+
+    m_count--;
+    if (m_count)
+    {
+        if (m_start == &curFrame)
+            m_start = curFrame.m_nextMCSTF;
+        if (m_end == &curFrame)
+            m_end = curFrame.m_prevMCSTF;
+
+        if (curFrame.m_nextMCSTF)
+            curFrame.m_nextMCSTF->m_prevMCSTF = curFrame.m_prevMCSTF;
+        if (curFrame.m_prevMCSTF)
+            curFrame.m_prevMCSTF->m_nextMCSTF = curFrame.m_nextMCSTF;
+    }
+    else
+    {
+        m_start = m_end = NULL;
+    }
+
+    curFrame.m_nextMCSTF = curFrame.m_prevMCSTF = NULL;
+}

 
@@ -45,6 +45,25 @@
     m_count++;
 }
 
+void PicList::pushFrontMCSTF(Frame& curFrame)
+{
+    X265_CHECK(!curFrame.m_nextMCSTF && !curFrame.m_nextMCSTF, "piclist: picture already in OPB list\n"); // ensure frame is not in a list
+    curFrame.m_nextMCSTF = m_start;
+    curFrame.m_prevMCSTF = NULL;
+
+    if (m_count)
+    {
+        m_start->m_prevMCSTF = &curFrame;
+        m_start = &curFrame;
+    }
+    else
+    {
+        m_start = m_end = &curFrame;
+    }
+    m_count++;
+
+}
+
 void PicList::pushBack(Frame& curFrame)
 {
     X265_CHECK(!curFrame.m_next && !curFrame.m_prev, "piclist: picture already in list\n"); // ensure frame is not in a list
@@ -63,6 +82,24 @@
     m_count++;
 }
 
+void PicList::pushBackMCSTF(Frame& curFrame)
+{
+    X265_CHECK(!curFrame.m_nextMCSTF && !curFrame.m_prevMCSTF, "piclist: picture already in OPB list\n"); // ensure frame is not in a list
+    curFrame.m_nextMCSTF = NULL;
+    curFrame.m_prevMCSTF = m_end;
+
+    if (m_count)
+    {
+        m_end->m_nextMCSTF = &curFrame;
+        m_end = &curFrame;
+    }
+    else
+    {
+        m_start = m_end = &curFrame;
+    }
+    m_count++;
+}
+
 Frame *PicList::popFront()
 {
     if (m_start)
@@ -94,6 +131,14 @@
     return curFrame;
 }
 
+Frame* PicList::getPOCMCSTF(int poc)
+{
+    Frame *curFrame = m_start;
+    while (curFrame && curFrame->m_poc != poc)
+        curFrame = curFrame->m_nextMCSTF;
+    return curFrame;
+}
+
 Frame *PicList::popBack()
 {
     if (m_end)
@@ -117,6 +162,29 @@
         return NULL;
 }
 
+Frame *PicList::popBackMCSTF()
+{
+    if (m_end)
+    {
+        Frame* temp = m_end;
+        m_count--;
+
+        if (m_count)
+        {
+            m_end = m_end->m_prevMCSTF;
+            m_end->m_nextMCSTF = NULL;
+        }
+        else
+        {
+            m_start = m_end = NULL;
+        }
+        temp->m_nextMCSTF = temp->m_prevMCSTF = NULL;
+        return temp;
+    }
+    else
+        return NULL;
+}
+
 Frame* PicList::getCurFrame(void)
 {
     Frame *curFrame = m_start;
@@ -158,3 +226,36 @@
 
     curFrame.m_next = curFrame.m_prev = NULL;
 }
+
+void PicList::removeMCSTF(Frame& curFrame)
+{
+#if _DEBUG
+    Frame *tmp = m_start;
+    while (tmp && tmp != &curFrame)
+    {
+        tmp = tmp->m_nextMCSTF;
+    }
+
+    X265_CHECK(tmp == &curFrame, "framelist: pic being removed was not in list\n"); // verify pic is in this list
+#endif
+
+    m_count--;
+    if (m_count)
+    {
+        if (m_start == &curFrame)
+            m_start = curFrame.m_nextMCSTF;
+        if (m_end == &curFrame)
+            m_end = curFrame.m_prevMCSTF;
+
+        if (curFrame.m_nextMCSTF)
+            curFrame.m_nextMCSTF->m_prevMCSTF = curFrame.m_prevMCSTF;
+        if (curFrame.m_prevMCSTF)
+            curFrame.m_prevMCSTF->m_nextMCSTF = curFrame.m_nextMCSTF;
+    }
+    else
+    {
+        m_start = m_end = NULL;
+    }
+
+    curFrame.m_nextMCSTF = curFrame.m_prevMCSTF = NULL;
+}
​

x265_3.5.tar.gz/source/common/piclist.h -> x265_3.6.tar.gz/source/common/piclist.h Changed

 
@@ -49,24 +49,31 @@
 
     /** Push picture to end of the list */
     void pushBack(Frame& pic);
+    void pushBackMCSTF(Frame& pic);
 
     /** Push picture to beginning of the list */
     void pushFront(Frame& pic);
+    void pushFrontMCSTF(Frame& pic);
 
     /** Pop picture from end of the list */
     Frame* popBack();
+    Frame* popBackMCSTF();
 
     /** Pop picture from beginning of the list */
     Frame* popFront();
 
     /** Find frame with specified POC */
     Frame* getPOC(int poc);
+    /* Find next MCSTF frame with specified POC */
+    Frame* getPOCMCSTF(int poc);
 
     /** Get the current Frame from the list **/
     Frame* getCurFrame(void);
 
     /** Remove picture from list */
     void remove(Frame& pic);
+    /* Remove MCSTF picture from list */
+    void removeMCSTF(Frame& pic);
 
     Frame* first()        { return m_start;   }
 
​

x265_3.5.tar.gz/source/common/picyuv.cpp -> x265_3.6.tar.gz/source/common/picyuv.cpp Changed

@@ -125,6 +125,58 @@
     return false;
 }
 
+/*Copy pixels from the picture buffer of a frame to picture buffer of another frame*/
+void PicYuv::copyFromFrame(PicYuv* source)
+{
+    uint32_t numCuInHeight = (m_picHeight + m_param->maxCUSize - 1) / m_param->maxCUSize;
+
+    int maxHeight = numCuInHeight * m_param->maxCUSize;
+    memcpy(m_picBuf0, source->m_picBuf0, sizeof(pixel)* m_stride * (maxHeight + (m_lumaMarginY * 2)));
+    m_picOrg0 = m_picBuf0 + m_lumaMarginY * m_stride + m_lumaMarginX;
+
+    if (m_picCsp != X265_CSP_I400)
+    {
+        memcpy(m_picBuf1, source->m_picBuf1, sizeof(pixel)* m_strideC * ((maxHeight >> m_vChromaShift) + (m_chromaMarginY * 2)));
+        memcpy(m_picBuf2, source->m_picBuf2, sizeof(pixel)* m_strideC * ((maxHeight >> m_vChromaShift) + (m_chromaMarginY * 2)));
+
+        m_picOrg1 = m_picBuf1 + m_chromaMarginY * m_strideC + m_chromaMarginX;
+        m_picOrg2 = m_picBuf2 + m_chromaMarginY * m_strideC + m_chromaMarginX;
+    }
+    else
+    {
+        m_picBuf1 = m_picBuf2 = NULL;
+        m_picOrg1 = m_picOrg2 = NULL;
+    }
+}
+
+bool PicYuv::createScaledPicYUV(x265_param* param, uint8_t scaleFactor)
+{
+    m_param = param;
+    m_picWidth = m_param->sourceWidth / scaleFactor;
+    m_picHeight = m_param->sourceHeight / scaleFactor;
+
+    m_picCsp = m_param->internalCsp;
+    m_hChromaShift = CHROMA_H_SHIFT(m_picCsp);
+    m_vChromaShift = CHROMA_V_SHIFT(m_picCsp);
+
+    uint32_t numCuInWidth = (m_picWidth + param->maxCUSize - 1) / param->maxCUSize;
+    uint32_t numCuInHeight = (m_picHeight + param->maxCUSize - 1) / param->maxCUSize;
+
+    m_lumaMarginX = 128; // search margin for L0 and L1 ME in horizontal direction
+    m_lumaMarginY = 128; // search margin for L0 and L1 ME in vertical direction
+    m_stride = (numCuInWidth * param->maxCUSize) + (m_lumaMarginX << 1);
+
+    int maxHeight = numCuInHeight * param->maxCUSize;
+    CHECKED_MALLOC_ZERO(m_picBuf0, pixel, m_stride * (maxHeight + (m_lumaMarginY * 2)));
+    m_picOrg0 = m_picBuf0 + m_lumaMarginY * m_stride + m_lumaMarginX;
+    m_picBuf1 = m_picBuf2 = NULL;
+    m_picOrg1 = m_picOrg2 = NULL;
+    return true;
+
+fail:
+    return false;
+}
+
 int PicYuv::getLumaBufLen(uint32_t picWidth, uint32_t picHeight, uint32_t picCsp)
 {
     m_picWidth = picWidth;

 
@@ -125,6 +125,58 @@
     return false;
 }
 
+/*Copy pixels from the picture buffer of a frame to picture buffer of another frame*/
+void PicYuv::copyFromFrame(PicYuv* source)
+{
+    uint32_t numCuInHeight = (m_picHeight + m_param->maxCUSize - 1) / m_param->maxCUSize;
+
+    int maxHeight = numCuInHeight * m_param->maxCUSize;
+    memcpy(m_picBuf0, source->m_picBuf0, sizeof(pixel)* m_stride * (maxHeight + (m_lumaMarginY * 2)));
+    m_picOrg0 = m_picBuf0 + m_lumaMarginY * m_stride + m_lumaMarginX;
+
+    if (m_picCsp != X265_CSP_I400)
+    {
+        memcpy(m_picBuf1, source->m_picBuf1, sizeof(pixel)* m_strideC * ((maxHeight >> m_vChromaShift) + (m_chromaMarginY * 2)));
+        memcpy(m_picBuf2, source->m_picBuf2, sizeof(pixel)* m_strideC * ((maxHeight >> m_vChromaShift) + (m_chromaMarginY * 2)));
+
+        m_picOrg1 = m_picBuf1 + m_chromaMarginY * m_strideC + m_chromaMarginX;
+        m_picOrg2 = m_picBuf2 + m_chromaMarginY * m_strideC + m_chromaMarginX;
+    }
+    else
+    {
+        m_picBuf1 = m_picBuf2 = NULL;
+        m_picOrg1 = m_picOrg2 = NULL;
+    }
+}
+
+bool PicYuv::createScaledPicYUV(x265_param* param, uint8_t scaleFactor)
+{
+    m_param = param;
+    m_picWidth = m_param->sourceWidth / scaleFactor;
+    m_picHeight = m_param->sourceHeight / scaleFactor;
+
+    m_picCsp = m_param->internalCsp;
+    m_hChromaShift = CHROMA_H_SHIFT(m_picCsp);
+    m_vChromaShift = CHROMA_V_SHIFT(m_picCsp);
+
+    uint32_t numCuInWidth = (m_picWidth + param->maxCUSize - 1) / param->maxCUSize;
+    uint32_t numCuInHeight = (m_picHeight + param->maxCUSize - 1) / param->maxCUSize;
+
+    m_lumaMarginX = 128; // search margin for L0 and L1 ME in horizontal direction
+    m_lumaMarginY = 128; // search margin for L0 and L1 ME in vertical direction
+    m_stride = (numCuInWidth * param->maxCUSize) + (m_lumaMarginX << 1);
+
+    int maxHeight = numCuInHeight * param->maxCUSize;
+    CHECKED_MALLOC_ZERO(m_picBuf0, pixel, m_stride * (maxHeight + (m_lumaMarginY * 2)));
+    m_picOrg0 = m_picBuf0 + m_lumaMarginY * m_stride + m_lumaMarginX;
+    m_picBuf1 = m_picBuf2 = NULL;
+    m_picOrg1 = m_picOrg2 = NULL;
+    return true;
+
+fail:
+    return false;
+}
+
 int PicYuv::getLumaBufLen(uint32_t picWidth, uint32_t picHeight, uint32_t picCsp)
 {
     m_picWidth = picWidth;
​

x265_3.5.tar.gz/source/common/picyuv.h -> x265_3.6.tar.gz/source/common/picyuv.h Changed

 
@@ -78,11 +78,13 @@
     PicYuv();
 
     bool  create(x265_param* param, bool picAlloc = true, pixel *pixelbuf = NULL);
+    bool  createScaledPicYUV(x265_param* param, uint8_t scaleFactor);
     bool  createOffsets(const SPS& sps);
     void  destroy();
     int   getLumaBufLen(uint32_t picWidth, uint32_t picHeight, uint32_t picCsp);
 
     void  copyFromPicture(const x265_picture&, const x265_param& param, int padx, int pady);
+    void  copyFromFrame(PicYuv* source);
 
     intptr_t getChromaAddrOffset(uint32_t ctuAddr, uint32_t absPartIdx) const { return m_cuOffsetCctuAddr + m_buOffsetCabsPartIdx; }
 
​

x265_3.5.tar.gz/source/common/pixel.cpp -> x265_3.6.tar.gz/source/common/pixel.cpp Changed

@@ -266,7 +266,7 @@
 {
     int satd = 0;
 
-#if ENABLE_ASSEMBLY && X265_ARCH_ARM64
+#if ENABLE_ASSEMBLY && X265_ARCH_ARM64 && !HIGH_BIT_DEPTH
     pixelcmp_t satd_4x4 = x265_pixel_satd_4x4_neon;
 #endif
 
@@ -284,7 +284,7 @@
 {
     int satd = 0;
 
-#if ENABLE_ASSEMBLY && X265_ARCH_ARM64
+#if ENABLE_ASSEMBLY && X265_ARCH_ARM64 && !HIGH_BIT_DEPTH
     pixelcmp_t satd_8x4 = x265_pixel_satd_8x4_neon;
 #endif
 
@@ -627,6 +627,23 @@
     }
 }
 
+static
+void frame_subsample_luma(const pixel* src0, pixel* dst0, intptr_t src_stride, intptr_t dst_stride, int width, int height)
+{
+    for (int y = 0; y < height; y++, src0 += 2 * src_stride, dst0 += dst_stride)
+    {
+        const pixel *inRow = src0;
+        const pixel *inRowBelow = src0 + src_stride;
+        pixel *target = dst0;
+        for (int x = 0; x < width; x++)
+        {
+            targetx = (((inRow0 + inRowBelow0 + 1) >> 1) + ((inRow1 + inRowBelow1 + 1) >> 1) + 1) >> 1;
+            inRow += 2;
+            inRowBelow += 2;
+        }
+    }
+}
+
 /* structural similarity metric */
 static void ssim_4x4x2_core(const pixel* pix1, intptr_t stride1, const pixel* pix2, intptr_t stride2, int sums24)
 {
@@ -1355,5 +1372,7 @@
     p.cuBLOCK_16x16.normFact = normFact_c;
     p.cuBLOCK_32x32.normFact = normFact_c;
     p.cuBLOCK_64x64.normFact = normFact_c;
+    /* SubSample Luma*/
+    p.frameSubSampleLuma = frame_subsample_luma;
 }
 }

 
@@ -266,7 +266,7 @@
 {
     int satd = 0;
 
-#if ENABLE_ASSEMBLY && X265_ARCH_ARM64
+#if ENABLE_ASSEMBLY && X265_ARCH_ARM64 && !HIGH_BIT_DEPTH
     pixelcmp_t satd_4x4 = x265_pixel_satd_4x4_neon;
 #endif
 
@@ -284,7 +284,7 @@
 {
     int satd = 0;
 
-#if ENABLE_ASSEMBLY && X265_ARCH_ARM64
+#if ENABLE_ASSEMBLY && X265_ARCH_ARM64 && !HIGH_BIT_DEPTH
     pixelcmp_t satd_8x4 = x265_pixel_satd_8x4_neon;
 #endif
 
@@ -627,6 +627,23 @@
     }
 }
 
+static
+void frame_subsample_luma(const pixel* src0, pixel* dst0, intptr_t src_stride, intptr_t dst_stride, int width, int height)
+{
+    for (int y = 0; y < height; y++, src0 += 2 * src_stride, dst0 += dst_stride)
+    {
+        const pixel *inRow = src0;
+        const pixel *inRowBelow = src0 + src_stride;
+        pixel *target = dst0;
+        for (int x = 0; x < width; x++)
+        {
+            targetx = (((inRow0 + inRowBelow0 + 1) >> 1) + ((inRow1 + inRowBelow1 + 1) >> 1) + 1) >> 1;
+            inRow += 2;
+            inRowBelow += 2;
+        }
+    }
+}
+
 /* structural similarity metric */
 static void ssim_4x4x2_core(const pixel* pix1, intptr_t stride1, const pixel* pix2, intptr_t stride2, int sums24)
 {
@@ -1355,5 +1372,7 @@
     p.cuBLOCK_16x16.normFact = normFact_c;
     p.cuBLOCK_32x32.normFact = normFact_c;
     p.cuBLOCK_64x64.normFact = normFact_c;
+    /* SubSample Luma*/
+    p.frameSubSampleLuma = frame_subsample_luma;
 }
 }
​

x265_3.5.tar.gz/source/common/ppc/intrapred_altivec.cpp -> x265_3.6.tar.gz/source/common/ppc/intrapred_altivec.cpp Changed

 
@@ -27,7 +27,7 @@
 #include <assert.h>
 #include <math.h>
 #include <cmath>
-#include <linux/types.h>
+#include <sys/types.h>
 #include <stdlib.h>
 #include <stdio.h>
 #include <stdint.h>
​

x265_3.5.tar.gz/source/common/primitives.h -> x265_3.6.tar.gz/source/common/primitives.h Changed

@@ -232,6 +232,8 @@
 typedef void(*psyRdoQuant_t2)(int16_t *m_resiDctCoeff, int16_t *m_fencDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, int64_t *psyScale, uint32_t blkPos);
 typedef void(*ssimDistortion_t)(const pixel *fenc, uint32_t fStride, const pixel *recon,  intptr_t rstride, uint64_t *ssBlock, int shift, uint64_t *ac_k);
 typedef void(*normFactor_t)(const pixel *src, uint32_t blockSize, int shift, uint64_t *z_k);
+/* SubSampling Luma */
+typedef void (*downscaleluma_t)(const pixel* src0, pixel* dstf, intptr_t src_stride, intptr_t dst_stride, int width, int height);
 /* Function pointers to optimized encoder primitives. Each pointer can reference
  * either an assembly routine, a SIMD intrinsic primitive, or a C function */
 struct EncoderPrimitives
@@ -353,6 +355,8 @@
 
     downscale_t           frameInitLowres;
     downscale_t           frameInitLowerRes;
+    /* Sub Sample Luma */
+    downscaleluma_t        frameSubSampleLuma;
     cutree_propagate_cost propagateCost;
     cutree_fix8_unpack    fix8Unpack;
     cutree_fix8_pack      fix8Pack;
@@ -488,7 +492,7 @@
 
 #if ENABLE_ASSEMBLY && X265_ARCH_ARM64
 extern "C" {
-#include "aarch64/pixel-util.h"
+#include "aarch64/fun-decls.h"
 }
 #endif

 
@@ -232,6 +232,8 @@
 typedef void(*psyRdoQuant_t2)(int16_t *m_resiDctCoeff, int16_t *m_fencDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, int64_t *psyScale, uint32_t blkPos);
 typedef void(*ssimDistortion_t)(const pixel *fenc, uint32_t fStride, const pixel *recon,  intptr_t rstride, uint64_t *ssBlock, int shift, uint64_t *ac_k);
 typedef void(*normFactor_t)(const pixel *src, uint32_t blockSize, int shift, uint64_t *z_k);
+/* SubSampling Luma */
+typedef void (*downscaleluma_t)(const pixel* src0, pixel* dstf, intptr_t src_stride, intptr_t dst_stride, int width, int height);
 /* Function pointers to optimized encoder primitives. Each pointer can reference
  * either an assembly routine, a SIMD intrinsic primitive, or a C function */
 struct EncoderPrimitives
@@ -353,6 +355,8 @@
 
     downscale_t           frameInitLowres;
     downscale_t           frameInitLowerRes;
+    /* Sub Sample Luma */
+    downscaleluma_t        frameSubSampleLuma;
     cutree_propagate_cost propagateCost;
     cutree_fix8_unpack    fix8Unpack;
     cutree_fix8_pack      fix8Pack;
@@ -488,7 +492,7 @@
 
 #if ENABLE_ASSEMBLY && X265_ARCH_ARM64
 extern "C" {
-#include "aarch64/pixel-util.h"
+#include "aarch64/fun-decls.h"
 }
 #endif
 
​

x265_3.6.tar.gz/source/common/ringmem.cpp Added

@@ -0,0 +1,357 @@
+/*****************************************************************************
+ * Copyright (C) 2013-2017 MulticoreWare, Inc
+ *
+ * Authors: liwei <liwei@multicorewareinc.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com
+ *****************************************************************************/
+
+#include "ringmem.h"
+
+#ifndef _WIN32
+#include <sys/mman.h>
+#endif ////< _WIN32
+
+#ifdef _WIN32
+#define X265_SHARED_MEM_NAME                    "Local\\_x265_shr_mem_"
+#define X265_SEMAPHORE_RINGMEM_WRITER_NAME	    "_x265_semW_"
+#define X265_SEMAPHORE_RINGMEM_READER_NAME	    "_x265_semR_"
+#else /* POSIX / pthreads */
+#define X265_SHARED_MEM_NAME                    "/tmp/_x265_shr_mem_"
+#define X265_SEMAPHORE_RINGMEM_WRITER_NAME	    "/tmp/_x265_semW_"
+#define X265_SEMAPHORE_RINGMEM_READER_NAME	    "/tmp/_x265_semR_"
+#endif
+
+#define RINGMEM_ALLIGNMENT                       64
+
+namespace X265_NS {
+    RingMem::RingMem() 
+        : m_initialized(false)
+        , m_protectRW(false)
+        , m_itemSize(0)
+        , m_itemCnt(0)
+        , m_dataPool(NULL)
+        , m_shrMem(NULL)
+#ifdef _WIN32
+        , m_handle(NULL)
+#else //_WIN32
+        , m_filepath(NULL)
+#endif //_WIN32
+        , m_writeSem(NULL)
+        , m_readSem(NULL)
+    {
+    }
+
+
+    RingMem::~RingMem()
+    {
+    }
+
+    bool RingMem::skipRead(int32_t cnt) {
+        if (!m_initialized)
+        {
+            return false;
+        }
+
+        if (m_protectRW)
+        {
+            for (int i = 0; i < cnt; i++)
+            {
+                m_readSem->take();
+            }
+        }
+        
+        ATOMIC_ADD(&m_shrMem->m_read, cnt);
+
+        if (m_protectRW)
+        {
+            m_writeSem->give(cnt);
+        }
+
+        return true;
+    }
+
+    bool RingMem::skipWrite(int32_t cnt) {
+        if (!m_initialized)
+        {
+            return false;
+        }
+
+        if (m_protectRW)
+        {
+            for (int i = 0; i < cnt; i++)
+            {
+                m_writeSem->take();
+            }
+        }
+
+        ATOMIC_ADD(&m_shrMem->m_write, cnt);
+
+        if (m_protectRW)
+        {
+            m_readSem->give(cnt);
+        }
+
+        return true;
+    }
+
+    ///< initialize
+    bool RingMem::init(int32_t itemSize, int32_t itemCnt, const char *name, bool protectRW)
+    {
+        ///< check parameters
+        if (itemSize <= 0 || itemCnt <= 0 || NULL == name)
+        {
+            ///< invalid parameters 
+            return false;
+        }
+
+        if (!m_initialized)
+        {
+            ///< formating names
+            char nameBufMAX_SHR_NAME_LEN = { 0 };
+
+            ///< shared memory name
+            snprintf(nameBuf, sizeof(nameBuf) - 1, "%s%s", X265_SHARED_MEM_NAME, name);
+
+            ///< create or open shared memory
+            bool newCreated = false;
+
+            ///< calculate the size of the shared memory
+            int32_t shrMemSize = (itemSize * itemCnt + sizeof(ShrMemCtrl) + RINGMEM_ALLIGNMENT - 1) & ~(RINGMEM_ALLIGNMENT - 1);
+
+#ifdef _WIN32
+            HANDLE h = OpenFileMappingA(FILE_MAP_WRITE | FILE_MAP_READ, FALSE, nameBuf);
+            if (!h)
+            {
+                h = CreateFileMappingA(INVALID_HANDLE_VALUE, NULL, PAGE_READWRITE, 0, shrMemSize, nameBuf);
+
+                if (!h)
+                {
+                    return false;
+                }
+
+                newCreated = true;
+            }
+
+            void *pool = MapViewOfFile(h, FILE_MAP_ALL_ACCESS, 0, 0, 0);
+
+            ///< should not close the handle here, otherwise the OpenFileMapping would fail
+            //CloseHandle(h);
+            m_handle = h;
+
+            if (!pool)
+            {
+                return false;
+            }
+
+#else /* POSIX / pthreads */
+            mode_t mode = S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH;
+            int flag = O_RDWR;
+            int shrfd = -1;
+            if ((shrfd = open(nameBuf, flag, mode)) < 0)
+            {
+                flag |= O_CREAT;
+                
+                shrfd = open(nameBuf, flag, mode);
+                if (shrfd < 0)
+                {
+                    return false;
+                }
+                newCreated = true;
+
+                lseek(shrfd, shrMemSize - 1, SEEK_SET);
+
+                if (-1 == write(shrfd, "\0", 1))
+                {
+                    close(shrfd);
+                    return false;
+                }
+
+                if (lseek(shrfd, 0, SEEK_END) < shrMemSize)
+                {
+                    close(shrfd);
+                    return false;
+                }
+            }
+
+            void *pool = mmap(0,
+                shrMemSize,
+                PROT_READ | PROT_WRITE,
+                MAP_SHARED,
+                shrfd,
+                0);
+
+            close(shrfd);
+            if (pool == MAP_FAILED)
+            {               
+                return false;
+            }
+
+            m_filepath = strdup(nameBuf);
+#endif ///< _WIN32
+
+            if (newCreated)
+            {
+                memset(pool, 0, shrMemSize);
+            }
+            
+            m_shrMem = reinterpret_cast<ShrMemCtrl *>(pool);
+            m_dataPool = reinterpret_cast<uint8_t *>(pool) + sizeof(ShrMemCtrl);
+            m_itemSize = itemSize;
+            m_itemCnt = itemCnt;
+            m_initialized = true;
+
+            if (protectRW)
+            {
+                m_protectRW = true;
+                m_writeSem = new NamedSemaphore();
+                if (!m_writeSem)
+                {
+                    release();
+                    return false;
+                }
+
+                ///< shared memory name
+                snprintf(nameBuf, sizeof(nameBuf) - 1, "%s%s", X265_SEMAPHORE_RINGMEM_WRITER_NAME, name);
+                if (!m_writeSem->create(nameBuf, m_itemCnt, m_itemCnt))
+                {
+                    release();
+                    return false;
+                }
+
+                m_readSem = new NamedSemaphore();
+                if (!m_readSem)
+                {
+                    release();
+                    return false;
+                }
+
+                ///< shared memory name
+                snprintf(nameBuf, sizeof(nameBuf) - 1, "%s%s", X265_SEMAPHORE_RINGMEM_READER_NAME, name);
+                if (!m_readSem->create(nameBuf, 0, m_itemCnt))
+                {
+                    release();
+                    return false;
+                }
+            }
+        }
+
+        return true;
+    }
+    ///< finalize
+    void RingMem::release()
+    {
+        if (m_initialized)
+        {
+            m_initialized = false;
+
+            if (m_shrMem)
+            {
+#ifdef _WIN32
+                UnmapViewOfFile(m_shrMem);
+                CloseHandle(m_handle);
+                m_handle = NULL;
+#else /* POSIX / pthreads */
+                int32_t shrMemSize = (m_itemSize * m_itemCnt + sizeof(ShrMemCtrl) + RINGMEM_ALLIGNMENT - 1) & (~RINGMEM_ALLIGNMENT - 1);
+                munmap(m_shrMem, shrMemSize);
+                unlink(m_filepath);
+                free(m_filepath);
+                m_filepath = NULL;
+#endif ///< _WIN32
+                m_shrMem = NULL;
+                m_dataPool = NULL;
+                m_itemSize = 0;
+                m_itemCnt = 0;
+            }
+            
+            if (m_protectRW)
+            {
+                m_protectRW = false;
+                if (m_writeSem)
+                {
+                    m_writeSem->release();
+
+                    delete m_writeSem;
+                    m_writeSem = NULL;
+                }
+
+                if (m_readSem)
+                {
+                    m_readSem->release();
+
+                    delete m_readSem;
+                    m_readSem = NULL;
+                }
+            }
+
+        }
+    }
+
+    ///< data read
+    bool RingMem::readNext(void* dst, fnRWSharedData callback)
+    {
+        if (!m_initialized || !callback || !dst)
+        {
+            return false;
+        }
+
+        if (m_protectRW)
+        {
+            if (!m_readSem->take())
+            {
+                return false;
+            }
+        }
+
+        int32_t index = ATOMIC_ADD(&m_shrMem->m_read, 1) % m_itemCnt;
+        (*callback)(dst, reinterpret_cast<uint8_t *>(m_dataPool) + index * m_itemSize, m_itemSize);
+
+        if (m_protectRW)
+        {
+            m_writeSem->give(1);
+        }
+
+        return true;
+    }
+    ///< data write
+    bool RingMem::writeData(void *data, fnRWSharedData callback)
+    {
+        if (!m_initialized || !data || !callback)
+        {
+            return false;
+        }
+
+        if (m_protectRW)
+        {
+            if (!m_writeSem->take())
+            {
+                return false;
+            }
+        }
+
+        int32_t index = ATOMIC_ADD(&m_shrMem->m_write, 1) % m_itemCnt;
+        (*callback)(reinterpret_cast<uint8_t *>(m_dataPool) + index * m_itemSize, data, m_itemSize);
+
+        if (m_protectRW)
+        {
+            m_readSem->give(1);
+        }
+
+        return true;
+    }
+}

 
@@ -0,0 +1,357 @@
+/*****************************************************************************
+ * Copyright (C) 2013-2017 MulticoreWare, Inc
+ *
+ * Authors: liwei <liwei@multicorewareinc.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com
+ *****************************************************************************/
+
+#include "ringmem.h"
+
+#ifndef _WIN32
+#include <sys/mman.h>
+#endif ////< _WIN32
+
+#ifdef _WIN32
+#define X265_SHARED_MEM_NAME                    "Local\\_x265_shr_mem_"
+#define X265_SEMAPHORE_RINGMEM_WRITER_NAME     "_x265_semW_"
+#define X265_SEMAPHORE_RINGMEM_READER_NAME     "_x265_semR_"
+#else /* POSIX / pthreads */
+#define X265_SHARED_MEM_NAME                    "/tmp/_x265_shr_mem_"
+#define X265_SEMAPHORE_RINGMEM_WRITER_NAME     "/tmp/_x265_semW_"
+#define X265_SEMAPHORE_RINGMEM_READER_NAME     "/tmp/_x265_semR_"
+#endif
+
+#define RINGMEM_ALLIGNMENT                       64
+
+namespace X265_NS {
+    RingMem::RingMem() 
+        : m_initialized(false)
+        , m_protectRW(false)
+        , m_itemSize(0)
+        , m_itemCnt(0)
+        , m_dataPool(NULL)
+        , m_shrMem(NULL)
+#ifdef _WIN32
+        , m_handle(NULL)
+#else //_WIN32
+        , m_filepath(NULL)
+#endif //_WIN32
+        , m_writeSem(NULL)
+        , m_readSem(NULL)
+    {
+    }
+
+
+    RingMem::~RingMem()
+    {
+    }
+
+    bool RingMem::skipRead(int32_t cnt) {
+        if (!m_initialized)
+        {
+            return false;
+        }
+
+        if (m_protectRW)
+        {
+            for (int i = 0; i < cnt; i++)
+            {
+                m_readSem->take();
+            }
+        }
+        
+        ATOMIC_ADD(&m_shrMem->m_read, cnt);
+
+        if (m_protectRW)
+        {
+            m_writeSem->give(cnt);
+        }
+
+        return true;
+    }
+
+    bool RingMem::skipWrite(int32_t cnt) {
+        if (!m_initialized)
+        {
+            return false;
+        }
+
+        if (m_protectRW)
+        {
+            for (int i = 0; i < cnt; i++)
+            {
+                m_writeSem->take();
+            }
+        }
+
+        ATOMIC_ADD(&m_shrMem->m_write, cnt);
+
+        if (m_protectRW)
+        {
+            m_readSem->give(cnt);
+        }
+
+        return true;
+    }
+
+    ///< initialize
+    bool RingMem::init(int32_t itemSize, int32_t itemCnt, const char *name, bool protectRW)
+    {
+        ///< check parameters
+        if (itemSize <= 0 || itemCnt <= 0 || NULL == name)
+        {
+            ///< invalid parameters 
+            return false;
+        }
+
+        if (!m_initialized)
+        {
+            ///< formating names
+            char nameBufMAX_SHR_NAME_LEN = { 0 };
+
+            ///< shared memory name
+            snprintf(nameBuf, sizeof(nameBuf) - 1, "%s%s", X265_SHARED_MEM_NAME, name);
+
+            ///< create or open shared memory
+            bool newCreated = false;
+
+            ///< calculate the size of the shared memory
+            int32_t shrMemSize = (itemSize * itemCnt + sizeof(ShrMemCtrl) + RINGMEM_ALLIGNMENT - 1) & ~(RINGMEM_ALLIGNMENT - 1);
+
+#ifdef _WIN32
+            HANDLE h = OpenFileMappingA(FILE_MAP_WRITE | FILE_MAP_READ, FALSE, nameBuf);
+            if (!h)
+            {
+                h = CreateFileMappingA(INVALID_HANDLE_VALUE, NULL, PAGE_READWRITE, 0, shrMemSize, nameBuf);
+
+                if (!h)
+                {
+                    return false;
+                }
+
+                newCreated = true;
+            }
+
+            void *pool = MapViewOfFile(h, FILE_MAP_ALL_ACCESS, 0, 0, 0);
+
+            ///< should not close the handle here, otherwise the OpenFileMapping would fail
+            //CloseHandle(h);
+            m_handle = h;
+
+            if (!pool)
+            {
+                return false;
+            }
+
+#else /* POSIX / pthreads */
+            mode_t mode = S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH;
+            int flag = O_RDWR;
+            int shrfd = -1;
+            if ((shrfd = open(nameBuf, flag, mode)) < 0)
+            {
+                flag |= O_CREAT;
+                
+                shrfd = open(nameBuf, flag, mode);
+                if (shrfd < 0)
+                {
+                    return false;
+                }
+                newCreated = true;
+
+                lseek(shrfd, shrMemSize - 1, SEEK_SET);
+
+                if (-1 == write(shrfd, "\0", 1))
+                {
+                    close(shrfd);
+                    return false;
+                }
+
+                if (lseek(shrfd, 0, SEEK_END) < shrMemSize)
+                {
+                    close(shrfd);
+                    return false;
+                }
+            }
+
+            void *pool = mmap(0,
+                shrMemSize,
+                PROT_READ | PROT_WRITE,
+                MAP_SHARED,
+                shrfd,
+                0);
+
+            close(shrfd);
+            if (pool == MAP_FAILED)
+            {               
+                return false;
+            }
+
+            m_filepath = strdup(nameBuf);
+#endif ///< _WIN32
+
+            if (newCreated)
+            {
+                memset(pool, 0, shrMemSize);
+            }
+            
+            m_shrMem = reinterpret_cast<ShrMemCtrl *>(pool);
+            m_dataPool = reinterpret_cast<uint8_t *>(pool) + sizeof(ShrMemCtrl);
+            m_itemSize = itemSize;
+            m_itemCnt = itemCnt;
+            m_initialized = true;
+
+            if (protectRW)
+            {
+                m_protectRW = true;
+                m_writeSem = new NamedSemaphore();
+                if (!m_writeSem)
+                {
+                    release();
+                    return false;
+                }
+
+                ///< shared memory name
+                snprintf(nameBuf, sizeof(nameBuf) - 1, "%s%s", X265_SEMAPHORE_RINGMEM_WRITER_NAME, name);
+                if (!m_writeSem->create(nameBuf, m_itemCnt, m_itemCnt))
+                {
+                    release();
+                    return false;
+                }
+
+                m_readSem = new NamedSemaphore();
+                if (!m_readSem)
+                {
+                    release();
+                    return false;
+                }
+
+                ///< shared memory name
+                snprintf(nameBuf, sizeof(nameBuf) - 1, "%s%s", X265_SEMAPHORE_RINGMEM_READER_NAME, name);
+                if (!m_readSem->create(nameBuf, 0, m_itemCnt))
+                {
+                    release();
+                    return false;
+                }
+            }
+        }
+
+        return true;
+    }
+    ///< finalize
+    void RingMem::release()
+    {
+        if (m_initialized)
+        {
+            m_initialized = false;
+
+            if (m_shrMem)
+            {
+#ifdef _WIN32
+                UnmapViewOfFile(m_shrMem);
+                CloseHandle(m_handle);
+                m_handle = NULL;
+#else /* POSIX / pthreads */
+                int32_t shrMemSize = (m_itemSize * m_itemCnt + sizeof(ShrMemCtrl) + RINGMEM_ALLIGNMENT - 1) & (~RINGMEM_ALLIGNMENT - 1);
+                munmap(m_shrMem, shrMemSize);
+                unlink(m_filepath);
+                free(m_filepath);
+                m_filepath = NULL;
+#endif ///< _WIN32
+                m_shrMem = NULL;
+                m_dataPool = NULL;
+                m_itemSize = 0;
+                m_itemCnt = 0;
+            }
+            
+            if (m_protectRW)
+            {
+                m_protectRW = false;
+                if (m_writeSem)
+                {
+                    m_writeSem->release();
+
+                    delete m_writeSem;
+                    m_writeSem = NULL;
+                }
+
+                if (m_readSem)
+                {
+                    m_readSem->release();
+
+                    delete m_readSem;
+                    m_readSem = NULL;
+                }
+            }
+
+        }
+    }
+
+    ///< data read
+    bool RingMem::readNext(void* dst, fnRWSharedData callback)
+    {
+        if (!m_initialized || !callback || !dst)
+        {
+            return false;
+        }
+
+        if (m_protectRW)
+        {
+            if (!m_readSem->take())
+            {
+                return false;
+            }
+        }
+
+        int32_t index = ATOMIC_ADD(&m_shrMem->m_read, 1) % m_itemCnt;
+        (*callback)(dst, reinterpret_cast<uint8_t *>(m_dataPool) + index * m_itemSize, m_itemSize);
+
+        if (m_protectRW)
+        {
+            m_writeSem->give(1);
+        }
+
+        return true;
+    }
+    ///< data write
+    bool RingMem::writeData(void *data, fnRWSharedData callback)
+    {
+        if (!m_initialized || !data || !callback)
+        {
+            return false;
+        }
+
+        if (m_protectRW)
+        {
+            if (!m_writeSem->take())
+            {
+                return false;
+            }
+        }
+
+        int32_t index = ATOMIC_ADD(&m_shrMem->m_write, 1) % m_itemCnt;
+        (*callback)(reinterpret_cast<uint8_t *>(m_dataPool) + index * m_itemSize, data, m_itemSize);
+
+        if (m_protectRW)
+        {
+            m_readSem->give(1);
+        }
+
+        return true;
+    }
+}
​

x265_3.6.tar.gz/source/common/ringmem.h Added

@@ -0,0 +1,90 @@
+/*****************************************************************************
+ * Copyright (C) 2013-2017 MulticoreWare, Inc
+ *
+ * Authors: liwei <liwei@multicorewareinc.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com
+ *****************************************************************************/
+
+#ifndef X265_RINGMEM_H
+#define X265_RINGMEM_H
+
+#include "common.h"
+#include "threading.h"
+
+#if _MSC_VER
+#define snprintf _snprintf
+#define strdup _strdup
+#endif
+
+namespace X265_NS {
+
+#define MAX_SHR_NAME_LEN                         256
+
+    class RingMem {
+    public:
+        RingMem();
+        ~RingMem();
+
+        bool skipRead(int32_t cnt);
+
+        bool skipWrite(int32_t cnt);
+
+        ///< initialize
+        ///< protectRW: if use the semaphore the protect the write and read operation.
+        bool init(int32_t itemSize, int32_t itemCnt, const char *name, bool protectRW = false);
+        ///< finalize
+        void release();
+
+        typedef void(*fnRWSharedData)(void *dst, void *src, int32_t size);
+
+        ///< data read
+        bool readNext(void* dst, fnRWSharedData callback);
+        ///< data write
+        bool writeData(void *data, fnRWSharedData callback);
+
+    private:        
+        bool    m_initialized;
+        bool    m_protectRW;
+
+        int32_t m_itemSize;
+        int32_t m_itemCnt;
+        ///< data pool
+        void   *m_dataPool;
+        typedef struct {
+            ///< index to write
+            int32_t m_write;
+            ///< index to read
+            int32_t m_read;
+            
+        }ShrMemCtrl;
+
+        ShrMemCtrl *m_shrMem;
+#ifdef _WIN32
+        void       *m_handle;
+#else // _WIN32
+        char       *m_filepath;
+#endif // _WIN32
+
+        ///< Semaphores
+        NamedSemaphore *m_writeSem;
+        NamedSemaphore *m_readSem;
+    };
+};
+
+#endif // ifndef X265_RINGMEM_H

 
@@ -0,0 +1,90 @@
+/*****************************************************************************
+ * Copyright (C) 2013-2017 MulticoreWare, Inc
+ *
+ * Authors: liwei <liwei@multicorewareinc.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com
+ *****************************************************************************/
+
+#ifndef X265_RINGMEM_H
+#define X265_RINGMEM_H
+
+#include "common.h"
+#include "threading.h"
+
+#if _MSC_VER
+#define snprintf _snprintf
+#define strdup _strdup
+#endif
+
+namespace X265_NS {
+
+#define MAX_SHR_NAME_LEN                         256
+
+    class RingMem {
+    public:
+        RingMem();
+        ~RingMem();
+
+        bool skipRead(int32_t cnt);
+
+        bool skipWrite(int32_t cnt);
+
+        ///< initialize
+        ///< protectRW: if use the semaphore the protect the write and read operation.
+        bool init(int32_t itemSize, int32_t itemCnt, const char *name, bool protectRW = false);
+        ///< finalize
+        void release();
+
+        typedef void(*fnRWSharedData)(void *dst, void *src, int32_t size);
+
+        ///< data read
+        bool readNext(void* dst, fnRWSharedData callback);
+        ///< data write
+        bool writeData(void *data, fnRWSharedData callback);
+
+    private:        
+        bool    m_initialized;
+        bool    m_protectRW;
+
+        int32_t m_itemSize;
+        int32_t m_itemCnt;
+        ///< data pool
+        void   *m_dataPool;
+        typedef struct {
+            ///< index to write
+            int32_t m_write;
+            ///< index to read
+            int32_t m_read;
+            
+        }ShrMemCtrl;
+
+        ShrMemCtrl *m_shrMem;
+#ifdef _WIN32
+        void       *m_handle;
+#else // _WIN32
+        char       *m_filepath;
+#endif // _WIN32
+
+        ///< Semaphores
+        NamedSemaphore *m_writeSem;
+        NamedSemaphore *m_readSem;
+    };
+};
+
+#endif // ifndef X265_RINGMEM_H
​

x265_3.5.tar.gz/source/common/slice.h -> x265_3.6.tar.gz/source/common/slice.h Changed

@@ -156,9 +156,9 @@
     HRDInfo          hrdParameters;
     ProfileTierLevel ptl;
     uint32_t         maxTempSubLayers;
-    uint32_t         numReorderPics;
-    uint32_t         maxDecPicBuffering;
-    uint32_t         maxLatencyIncrease;
+    uint32_t         numReorderPicsMAX_T_LAYERS;
+    uint32_t         maxDecPicBufferingMAX_T_LAYERS;
+    uint32_t         maxLatencyIncreaseMAX_T_LAYERS;
 };
 
 struct Window
@@ -235,9 +235,9 @@
     uint32_t maxAMPDepth;
 
     uint32_t maxTempSubLayers;   // max number of Temporal Sub layers
-    uint32_t maxDecPicBuffering; // these are dups of VPS values
-    uint32_t maxLatencyIncrease;
-    int      numReorderPics;
+    uint32_t maxDecPicBufferingMAX_T_LAYERS; // these are dups of VPS values
+    uint32_t maxLatencyIncreaseMAX_T_LAYERS;
+    int      numReorderPicsMAX_T_LAYERS;
 
     RPS      spsrpsMAX_NUM_SHORT_TERM_RPS;
     int      spsrpsNum;
@@ -363,6 +363,7 @@
     int         m_iNumRPSInSPS;
     const x265_param *m_param;
     int         m_fieldNum;
+    Frame*      m_mcstfRefFrameList2MAX_MCSTF_TEMPORAL_WINDOW_LENGTH;
 
     Slice()
     {

 
@@ -156,9 +156,9 @@
     HRDInfo          hrdParameters;
     ProfileTierLevel ptl;
     uint32_t         maxTempSubLayers;
-    uint32_t         numReorderPics;
-    uint32_t         maxDecPicBuffering;
-    uint32_t         maxLatencyIncrease;
+    uint32_t         numReorderPicsMAX_T_LAYERS;
+    uint32_t         maxDecPicBufferingMAX_T_LAYERS;
+    uint32_t         maxLatencyIncreaseMAX_T_LAYERS;
 };
 
 struct Window
@@ -235,9 +235,9 @@
     uint32_t maxAMPDepth;
 
     uint32_t maxTempSubLayers;   // max number of Temporal Sub layers
-    uint32_t maxDecPicBuffering; // these are dups of VPS values
-    uint32_t maxLatencyIncrease;
-    int      numReorderPics;
+    uint32_t maxDecPicBufferingMAX_T_LAYERS; // these are dups of VPS values
+    uint32_t maxLatencyIncreaseMAX_T_LAYERS;
+    int      numReorderPicsMAX_T_LAYERS;
 
     RPS      spsrpsMAX_NUM_SHORT_TERM_RPS;
     int      spsrpsNum;
@@ -363,6 +363,7 @@
     int         m_iNumRPSInSPS;
     const x265_param *m_param;
     int         m_fieldNum;
+    Frame*      m_mcstfRefFrameList2MAX_MCSTF_TEMPORAL_WINDOW_LENGTH;
 
     Slice()
     {
​

x265_3.6.tar.gz/source/common/temporalfilter.cpp Added

@@ -0,0 +1,1017 @@
+/*****************************************************************************
+* Copyright (C) 2013-2021 MulticoreWare, Inc
+*
+ * Authors: Ashok Kumar Mishra <ashok@multicorewareinc.com>
+ *
+* This program is free software; you can redistribute it and/or modify
+* it under the terms of the GNU General Public License as published by
+* the Free Software Foundation; either version 2 of the License, or
+* (at your option) any later version.
+*
+* This program is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+* GNU General Public License for more details.
+*
+* You should have received a copy of the GNU General Public License
+* along with this program; if not, write to the Free Software
+* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+*
+* This program is also available under a commercial proprietary license.
+* For more information, contact us at license @ x265.com.
+*****************************************************************************/
+#include "common.h"
+#include "temporalfilter.h"
+#include "primitives.h"
+
+#include "frame.h"
+#include "slice.h"
+#include "framedata.h"
+#include "analysis.h"
+
+using namespace X265_NS;
+
+void OrigPicBuffer::addPicture(Frame* inFrame)
+{
+    m_mcstfPicList.pushFrontMCSTF(*inFrame);
+}
+
+void OrigPicBuffer::addEncPicture(Frame* inFrame)
+{
+    m_mcstfOrigPicFreeList.pushFrontMCSTF(*inFrame);
+}
+
+void OrigPicBuffer::addEncPictureToPicList(Frame* inFrame)
+{
+    m_mcstfOrigPicList.pushFrontMCSTF(*inFrame);
+}
+
+OrigPicBuffer::~OrigPicBuffer()
+{
+    while (!m_mcstfOrigPicList.empty())
+    {
+        Frame* curFrame = m_mcstfOrigPicList.popBackMCSTF();
+        curFrame->destroy();
+        delete curFrame;
+    }
+
+    while (!m_mcstfOrigPicFreeList.empty())
+    {
+        Frame* curFrame = m_mcstfOrigPicFreeList.popBackMCSTF();
+        curFrame->destroy();
+        delete curFrame;
+    }
+}
+
+void OrigPicBuffer::setOrigPicList(Frame* inFrame, int frameCnt)
+{
+    Slice* slice = inFrame->m_encData->m_slice;
+    uint8_t j = 0;
+    for (int iterPOC = (inFrame->m_poc - inFrame->m_mcstf->m_range);
+        iterPOC <= (inFrame->m_poc + inFrame->m_mcstf->m_range); iterPOC++)
+    {
+        if (iterPOC != inFrame->m_poc)
+        {
+            if (iterPOC < 0)
+                continue;
+            if (iterPOC >= frameCnt)
+                break;
+
+            Frame *iterFrame = m_mcstfPicList.getPOCMCSTF(iterPOC);
+            X265_CHECK(iterFrame, "Reference frame not found in OPB");
+            if (iterFrame != NULL)
+            {
+                slice->m_mcstfRefFrameList1j = iterFrame;
+                iterFrame->m_refPicCnt1--;
+            }
+
+            iterFrame = m_mcstfOrigPicList.getPOCMCSTF(iterPOC);
+            if (iterFrame != NULL)
+            {
+
+                slice->m_mcstfRefFrameList1j = iterFrame;
+
+                iterFrame->m_refPicCnt1--;
+                Frame *cFrame = m_mcstfOrigPicList.getPOCMCSTF(inFrame->m_poc);
+                X265_CHECK(cFrame, "Reference frame not found in encoded OPB");
+                cFrame->m_refPicCnt1--;
+            }
+            j++;
+        }
+    }
+}
+
+void OrigPicBuffer::recycleOrigPicList()
+{
+    Frame *iterFrame = m_mcstfPicList.first();
+
+    while (iterFrame)
+    {
+        Frame *curFrame = iterFrame;
+        iterFrame = iterFrame->m_nextMCSTF;
+        if (!curFrame->m_refPicCnt1)
+        {
+            m_mcstfPicList.removeMCSTF(*curFrame);
+            iterFrame = m_mcstfPicList.first();
+        }
+    }
+
+    iterFrame = m_mcstfOrigPicList.first();
+
+    while (iterFrame)
+    {
+        Frame *curFrame = iterFrame;
+        iterFrame = iterFrame->m_nextMCSTF;
+        if (!curFrame->m_refPicCnt1)
+        {
+            m_mcstfOrigPicList.removeMCSTF(*curFrame);
+            *curFrame->m_isSubSampled = false;
+            m_mcstfOrigPicFreeList.pushFrontMCSTF(*curFrame);
+            iterFrame = m_mcstfOrigPicList.first();
+        }
+    }
+}
+
+void OrigPicBuffer::addPictureToFreelist(Frame* inFrame)
+{
+    m_mcstfOrigPicFreeList.pushBack(*inFrame);
+}
+
+TemporalFilter::TemporalFilter()
+{
+    m_sourceWidth = 0;
+    m_sourceHeight = 0,
+    m_QP = 0;
+    m_sliceTypeConfig = 3;
+    m_numRef = 0;
+    m_useSADinME = 1;
+
+    m_range = 2;
+    m_chromaFactor = 0.55;
+    m_sigmaMultiplier = 9.0;
+    m_sigmaZeroPoint = 10.0;
+    m_motionVectorFactor = 16;
+}
+
+void TemporalFilter::init(const x265_param* param)
+{
+    m_param = param;
+    m_bitDepth = param->internalBitDepth;
+    m_sourceWidth = param->sourceWidth;
+    m_sourceHeight = param->sourceHeight;
+    m_internalCsp = param->internalCsp;
+    m_numComponents = (m_internalCsp != X265_CSP_I400) ? MAX_NUM_COMPONENT : 1;
+
+    m_metld = new MotionEstimatorTLD;
+
+    predPUYuv.create(FENC_STRIDE, X265_CSP_I400);
+}
+
+int TemporalFilter::createRefPicInfo(TemporalFilterRefPicInfo* refFrame, x265_param* param)
+{
+    CHECKED_MALLOC_ZERO(refFrame->mvs, MV, sizeof(MV)* ((m_sourceWidth ) / 4) * ((m_sourceHeight ) / 4));
+    refFrame->mvsStride = m_sourceWidth / 4;
+    CHECKED_MALLOC_ZERO(refFrame->mvs0, MV, sizeof(MV)* ((m_sourceWidth ) / 16) * ((m_sourceHeight ) / 16));
+    refFrame->mvsStride0 = m_sourceWidth / 16;
+    CHECKED_MALLOC_ZERO(refFrame->mvs1, MV, sizeof(MV)* ((m_sourceWidth ) / 16) * ((m_sourceHeight ) / 16));
+    refFrame->mvsStride1 = m_sourceWidth / 16;
+    CHECKED_MALLOC_ZERO(refFrame->mvs2, MV, sizeof(MV)* ((m_sourceWidth ) / 16)*((m_sourceHeight ) / 16));
+    refFrame->mvsStride2 = m_sourceWidth / 16;
+
+    CHECKED_MALLOC_ZERO(refFrame->noise, int, sizeof(int) * ((m_sourceWidth) / 4) * ((m_sourceHeight) / 4));
+    CHECKED_MALLOC_ZERO(refFrame->error, int, sizeof(int) * ((m_sourceWidth) / 4) * ((m_sourceHeight) / 4));
+
+    refFrame->slicetype = X265_TYPE_AUTO;
+
+    refFrame->compensatedPic = new PicYuv;
+    refFrame->compensatedPic->create(param, true);
+
+    return 1;
+fail:
+    return 0;
+}
+
+int TemporalFilter::motionErrorLumaSAD(
+    PicYuv *orig,
+    PicYuv *buffer,
+    int x,
+    int y,
+    int dx,
+    int dy,
+    int bs,
+    int besterror)
+{
+
+    pixel* origOrigin = orig->m_picOrg0;
+    intptr_t origStride = orig->m_stride;
+    pixel *buffOrigin = buffer->m_picOrg0;
+    intptr_t buffStride = buffer->m_stride;
+    int error = 0;// dx * 10 + dy * 10;
+    if (((dx | dy) & 0xF) == 0)
+    {
+        dx /= m_motionVectorFactor;
+        dy /= m_motionVectorFactor;
+
+        const pixel* bufferRowStart = buffOrigin + (y + dy) * buffStride + (x + dx);
+#if 0
+        const pixel* origRowStart = origOrigin + y *origStride + x;
+
+        for (int y1 = 0; y1 < bs; y1++)
+        {
+            for (int x1 = 0; x1 < bs; x1++)
+            {
+                int diff = origRowStartx1 - bufferRowStartx1;
+                error += abs(diff);
+            }
+
+            origRowStart += origStride;
+            bufferRowStart += buffStride;
+        }
+#else
+        int partEnum = partitionFromSizes(bs, bs);
+        /* copy PU block into cache */
+        primitives.pupartEnum.copy_pp(predPUYuv.m_buf0, FENC_STRIDE, bufferRowStart, buffStride);
+
+        error = m_metld->me.bufSAD(predPUYuv.m_buf0, FENC_STRIDE);
+#endif
+        if (error > besterror)
+        {
+            return error;
+        }
+    }
+    else
+    {
+        const int *xFilter = s_interpolationFilterdx & 0xF;
+        const int *yFilter = s_interpolationFilterdy & 0xF;
+        int tempArray64 + 864;
+
+        int iSum, iBase;
+        for (int y1 = 1; y1 < bs + 7; y1++)
+        {
+            const int yOffset = y + y1 + (dy >> 4) - 3;
+            const pixel *sourceRow = buffOrigin + (yOffset)*buffStride + 0;
+            for (int x1 = 0; x1 < bs; x1++)
+            {
+                iSum = 0;
+                iBase = x + x1 + (dx >> 4) - 3;
+                const pixel *rowStart = sourceRow + iBase;
+
+                iSum += xFilter1 * rowStart1;
+                iSum += xFilter2 * rowStart2;
+                iSum += xFilter3 * rowStart3;
+                iSum += xFilter4 * rowStart4;
+                iSum += xFilter5 * rowStart5;
+                iSum += xFilter6 * rowStart6;
+
+                tempArrayy1x1 = iSum;
+            }
+        }
+
+        const pixel maxSampleValue = (1 << m_bitDepth) - 1;
+        for (int y1 = 0; y1 < bs; y1++)
+        {
+            const pixel *origRow = origOrigin + (y + y1)*origStride + 0;
+            for (int x1 = 0; x1 < bs; x1++)
+            {
+                iSum = 0;
+                iSum += yFilter1 * tempArrayy1 + 1x1;
+                iSum += yFilter2 * tempArrayy1 + 2x1;
+                iSum += yFilter3 * tempArrayy1 + 3x1;
+                iSum += yFilter4 * tempArrayy1 + 4x1;
+                iSum += yFilter5 * tempArrayy1 + 5x1;
+                iSum += yFilter6 * tempArrayy1 + 6x1;
+
+                iSum = (iSum + (1 << 11)) >> 12;
+                iSum = iSum < 0 ? 0 : (iSum > maxSampleValue ? maxSampleValue : iSum);
+
+                error += abs(iSum - origRowx + x1);
+            }
+            if (error > besterror)
+            {
+                return error;
+            }
+        }
+    }
+    return error;
+}
+
+int TemporalFilter::motionErrorLumaSSD(
+    PicYuv *orig,
+    PicYuv *buffer,
+    int x,
+    int y,
+    int dx,
+    int dy,
+    int bs,
+    int besterror)
+{
+
+    pixel* origOrigin = orig->m_picOrg0;
+    intptr_t origStride = orig->m_stride;
+    pixel *buffOrigin = buffer->m_picOrg0;
+    intptr_t buffStride = buffer->m_stride;
+    int error = 0;// dx * 10 + dy * 10;
+    if (((dx | dy) & 0xF) == 0)
+    {
+        dx /= m_motionVectorFactor;
+        dy /= m_motionVectorFactor;
+
+        const pixel* bufferRowStart = buffOrigin + (y + dy) * buffStride + (x + dx);
+#if 0
+        const pixel* origRowStart = origOrigin + y * origStride + x;
+
+        for (int y1 = 0; y1 < bs; y1++)
+        {
+            for (int x1 = 0; x1 < bs; x1++)
+            {
+                int diff = origRowStartx1 - bufferRowStartx1;
+                error += diff * diff;
+            }
+
+            origRowStart += origStride;
+            bufferRowStart += buffStride;
+        }
+#else
+        int partEnum = partitionFromSizes(bs, bs);
+        /* copy PU block into cache */
+        primitives.pupartEnum.copy_pp(predPUYuv.m_buf0, FENC_STRIDE, bufferRowStart, buffStride);
+
+        error = (int)primitives.cupartEnum.sse_pp(m_metld->me.fencPUYuv.m_buf0, FENC_STRIDE, predPUYuv.m_buf0, FENC_STRIDE);
+
+#endif
+        if (error > besterror)
+        {
+            return error;
+        }
+    }
+    else
+    {
+        const int *xFilter = s_interpolationFilterdx & 0xF;
+        const int *yFilter = s_interpolationFilterdy & 0xF;
+        int tempArray64 + 864;
+
+        int iSum, iBase;
+        for (int y1 = 1; y1 < bs + 7; y1++)
+        {
+            const int yOffset = y + y1 + (dy >> 4) - 3;
+            const pixel *sourceRow = buffOrigin + (yOffset)*buffStride + 0;
+            for (int x1 = 0; x1 < bs; x1++)
+            {
+                iSum = 0;
+                iBase = x + x1 + (dx >> 4) - 3;
+                const pixel *rowStart = sourceRow + iBase;
+
+                iSum += xFilter1 * rowStart1;
+                iSum += xFilter2 * rowStart2;
+                iSum += xFilter3 * rowStart3;
+                iSum += xFilter4 * rowStart4;
+                iSum += xFilter5 * rowStart5;
+                iSum += xFilter6 * rowStart6;
+
+                tempArrayy1x1 = iSum;
+            }
+        }
+
+        const pixel maxSampleValue = (1 << m_bitDepth) - 1;
+        for (int y1 = 0; y1 < bs; y1++)
+        {
+            const pixel *origRow = origOrigin + (y + y1)*origStride + 0;
+            for (int x1 = 0; x1 < bs; x1++)
+            {
+                iSum = 0;
+                iSum += yFilter1 * tempArrayy1 + 1x1;
+                iSum += yFilter2 * tempArrayy1 + 2x1;
+                iSum += yFilter3 * tempArrayy1 + 3x1;
+                iSum += yFilter4 * tempArrayy1 + 4x1;
+                iSum += yFilter5 * tempArrayy1 + 5x1;
+                iSum += yFilter6 * tempArrayy1 + 6x1;
+
+                iSum = (iSum + (1 << 11)) >> 12;
+                iSum = iSum < 0 ? 0 : (iSum > maxSampleValue ? maxSampleValue : iSum);
+
+                error += (iSum - origRowx + x1) * (iSum - origRowx + x1);
+            }
+            if (error > besterror)
+            {
+                return error;
+            }
+        }
+    }
+    return error;
+}
+
+void TemporalFilter::applyMotion(MV *mvs, uint32_t mvsStride, PicYuv *input, PicYuv *output)
+{
+    static const int lumaBlockSize = 8;
+    int srcStride = 0;
+    int dstStride = 0;
+    int csx = 0, csy = 0;
+    for (int c = 0; c < m_numComponents; c++)
+    {
+        const pixel maxValue = (1 << X265_DEPTH) - 1;
+
+        const pixel *pSrcImage = input->m_picOrgc;
+        pixel *pDstImage = output->m_picOrgc;
+
+        if (!c)
+        {
+            srcStride = (int)input->m_stride;
+            dstStride = (int)output->m_stride;
+        }
+        else
+        {
+            srcStride = (int)input->m_strideC;
+            dstStride = (int)output->m_strideC;
+            csx = CHROMA_H_SHIFT(m_internalCsp);
+            csy = CHROMA_V_SHIFT(m_internalCsp);
+        }
+        const int blockSizeX = lumaBlockSize >> csx;
+        const int blockSizeY = lumaBlockSize >> csy;
+        const int height = input->m_picHeight >> csy;
+        const int width = input->m_picWidth >> csx;
+
+        for (int y = 0, blockNumY = 0; y + blockSizeY <= height; y += blockSizeY, blockNumY++)
+        {
+            for (int x = 0, blockNumX = 0; x + blockSizeX <= width; x += blockSizeX, blockNumX++)
+            {
+                int mvIdx = blockNumY * mvsStride + blockNumX;
+                const MV &mv = mvsmvIdx;
+                const int dx = mv.x >> csx;
+                const int dy = mv.y >> csy;
+                const int xInt = mv.x >> (4 + csx);
+                const int yInt = mv.y >> (4 + csy);
+
+                const int *xFilter = s_interpolationFilterdx & 0xf;
+                const int *yFilter = s_interpolationFilterdy & 0xf; // will add 6 bit.
+                const int numFilterTaps = 7;
+                const int centreTapOffset = 3;
+
+                int tempArraylumaBlockSize + numFilterTapslumaBlockSize;
+
+                for (int by = 1; by < blockSizeY + numFilterTaps; by++)
+                {
+                    const int yOffset = y + by + yInt - centreTapOffset;
+                    const pixel *sourceRow = pSrcImage + yOffset * srcStride;
+                    for (int bx = 0; bx < blockSizeX; bx++)
+                    {
+                        int iBase = x + bx + xInt - centreTapOffset;
+                        const pixel *rowStart = sourceRow + iBase;
+
+                        int iSum = 0;
+                        iSum += xFilter1 * rowStart1;
+                        iSum += xFilter2 * rowStart2;
+                        iSum += xFilter3 * rowStart3;
+                        iSum += xFilter4 * rowStart4;
+                        iSum += xFilter5 * rowStart5;
+                        iSum += xFilter6 * rowStart6;
+
+                        tempArraybybx = iSum;
+                    }
+                }
+
+                pixel *pDstRow = pDstImage + y * dstStride;
+                for (int by = 0; by < blockSizeY; by++, pDstRow += dstStride)
+                {
+                    pixel *pDstPel = pDstRow + x;
+                    for (int bx = 0; bx < blockSizeX; bx++, pDstPel++)
+                    {
+                        int iSum = 0;
+
+                        iSum += yFilter1 * tempArrayby + 1bx;
+                        iSum += yFilter2 * tempArrayby + 2bx;
+                        iSum += yFilter3 * tempArrayby + 3bx;
+                        iSum += yFilter4 * tempArrayby + 4bx;
+                        iSum += yFilter5 * tempArrayby + 5bx;
+                        iSum += yFilter6 * tempArrayby + 6bx;
+
+                        iSum = (iSum + (1 << 11)) >> 12;
+                        iSum = iSum < 0 ? 0 : (iSum > maxValue ? maxValue : iSum);
+                        *pDstPel = (pixel)iSum;
+                    }
+                }
+            }
+        }
+    }
+}
+
+void TemporalFilter::bilateralFilter(Frame* frame,
+    TemporalFilterRefPicInfo* m_mcstfRefList,
+    double overallStrength)
+{
+
+    const int numRefs = frame->m_mcstf->m_numRef;
+
+    for (int i = 0; i < numRefs; i++)
+    {
+        TemporalFilterRefPicInfo *ref = &m_mcstfRefListi;
+        applyMotion(m_mcstfRefListi.mvs, m_mcstfRefListi.mvsStride, m_mcstfRefListi.picBuffer, ref->compensatedPic);
+    }
+
+    int refStrengthRow = 2;
+    if (numRefs == m_range * 2)
+    {
+        refStrengthRow = 0;
+    }
+    else if (numRefs == m_range)
+    {
+        refStrengthRow = 1;
+    }
+
+    const double lumaSigmaSq = (m_QP - m_sigmaZeroPoint) * (m_QP - m_sigmaZeroPoint) * m_sigmaMultiplier;
+    const double chromaSigmaSq = 30 * 30;
+
+    PicYuv* orgPic = frame->m_fencPic;
+
+    for (int c = 0; c < m_numComponents; c++)
+    {
+        int height, width;
+        pixel *srcPelRow = NULL;
+        intptr_t srcStride, correctedPicsStride = 0;
+
+        if (!c)
+        {
+            height = orgPic->m_picHeight;
+            width = orgPic->m_picWidth;
+            srcPelRow = orgPic->m_picOrgc;
+            srcStride = orgPic->m_stride;
+        }
+        else
+        {
+            int csx = CHROMA_H_SHIFT(m_internalCsp);
+            int csy = CHROMA_V_SHIFT(m_internalCsp);
+
+            height = orgPic->m_picHeight >> csy;
+            width = orgPic->m_picWidth >> csx;
+            srcPelRow = orgPic->m_picOrgc;
+            srcStride = (int)orgPic->m_strideC;
+        }
+
+        const double sigmaSq = (!c)  ? lumaSigmaSq : chromaSigmaSq;
+        const double weightScaling = overallStrength * ( (!c) ? 0.4 : m_chromaFactor);
+
+        const double maxSampleValue = (1 << m_bitDepth) - 1;
+        const double bitDepthDiffWeighting = 1024.0 / (maxSampleValue + 1);
+
+        const int blkSize = (!c) ? 8 : 4;
+
+        for (int y = 0; y < height; y++, srcPelRow += srcStride)
+        {
+            pixel *srcPel = srcPelRow;
+
+            for (int x = 0; x < width; x++, srcPel++)
+            {
+                const int orgVal = (int)*srcPel;
+                double temporalWeightSum = 1.0;
+                double newVal = (double)orgVal;
+
+                if ((y % blkSize == 0) && (x % blkSize == 0))
+                {
+                    for (int i = 0; i < numRefs; i++)
+                    {
+                        TemporalFilterRefPicInfo *refPicInfo = &m_mcstfRefListi;
+
+                        if (!c)
+                            correctedPicsStride = refPicInfo->compensatedPic->m_stride;
+                        else
+                            correctedPicsStride = refPicInfo->compensatedPic->m_strideC;
+
+                        double variance = 0, diffsum = 0;
+                        for (int y1 = 0; y1 < blkSize - 1; y1++)
+                        {
+                            for (int x1 = 0; x1 < blkSize - 1; x1++)
+                            {
+                                int pix = *(srcPel + x1);
+                                int pixR = *(srcPel + x1 + 1);
+                                int pixD = *(srcPel + x1 + srcStride);
+
+                                int ref = *(refPicInfo->compensatedPic->m_picOrgc + ((y + y1) * correctedPicsStride + x + x1));
+                                int refR = *(refPicInfo->compensatedPic->m_picOrgc + ((y + y1) * correctedPicsStride + x + x1 + 1));
+                                int refD = *(refPicInfo->compensatedPic->m_picOrgc + ((y + y1 + 1) * correctedPicsStride + x + x1));
+
+                                int diff = pix - ref;
+                                int diffR = pixR - refR;
+                                int diffD = pixD - refD;
+
+                                variance += diff * diff;
+                                diffsum += (diffR - diff) * (diffR - diff);
+                                diffsum += (diffD - diff) * (diffD - diff);
+                            }
+                        }
+
+                        refPicInfo->noise(y / blkSize) * refPicInfo->mvsStride + (x / blkSize) = (int)round((300 * variance + 50) / (10 * diffsum + 50));
+                    }
+                }
+
+                double minError = 9999999;
+                for (int i = 0; i < numRefs; i++)
+                {
+                    TemporalFilterRefPicInfo *refPicInfo = &m_mcstfRefListi;
+                    minError = X265_MIN(minError, (double)refPicInfo->error(y / blkSize) * refPicInfo->mvsStride + (x / blkSize));
+                }
+
+                for (int i = 0; i < numRefs; i++)
+                {
+                    TemporalFilterRefPicInfo *refPicInfo = &m_mcstfRefListi;
+
+                    const int error = refPicInfo->error(y / blkSize) * refPicInfo->mvsStride + (x / blkSize);
+                    const int noise = refPicInfo->noise(y / blkSize) * refPicInfo->mvsStride + (x / blkSize);
+
+                    const pixel *pCorrectedPelPtr = refPicInfo->compensatedPic->m_picOrgc + (y * correctedPicsStride + x);
+                    const int refVal = (int)*pCorrectedPelPtr;
+                    double diff = (double)(refVal - orgVal);
+                    diff *= bitDepthDiffWeighting;
+                    double diffSq = diff * diff;
+
+                    const int index = X265_MIN(3, std::abs(refPicInfo->origOffset) - 1);
+                    double ww = 1, sw = 1;
+                    ww *= (noise < 25) ? 1 : 1.2;
+                    sw *= (noise < 25) ? 1.3 : 0.8;
+                    ww *= (error < 50) ? 1.2 : ((error > 100) ? 0.8 : 1);
+                    sw *= (error < 50) ? 1.3 : 1;
+                    ww *= ((minError + 1) / (error + 1));
+                    const double weight = weightScaling * s_refStrengthsrefStrengthRowindex * ww * exp(-diffSq / (2 * sw * sigmaSq));
+
+                    newVal += weight * refVal;
+                    temporalWeightSum += weight;
+                }
+                newVal /= temporalWeightSum;
+                double sampleVal = round(newVal);
+                sampleVal = (sampleVal < 0 ? 0 : (sampleVal > maxSampleValue ? maxSampleValue : sampleVal));
+                *srcPel = (pixel)sampleVal;
+            }
+        }
+    }
+}
+
+void TemporalFilter::motionEstimationLuma(MV *mvs, uint32_t mvStride, PicYuv *orig, PicYuv *buffer, int blockSize,
+    MV *previous, uint32_t prevMvStride, int factor)
+{
+
+    int range = 5;
+
+
+    const int stepSize = blockSize;
+
+    const int origWidth = orig->m_picWidth;
+    const int origHeight = orig->m_picHeight;
+
+    int error;
+
+    for (int blockY = 0; blockY + blockSize <= origHeight; blockY += stepSize)
+    {
+        for (int blockX = 0; blockX + blockSize <= origWidth; blockX += stepSize)
+        {
+            const intptr_t pelOffset = blockY * orig->m_stride + blockX;
+            m_metld->me.setSourcePU(orig->m_picOrg0, orig->m_stride, pelOffset, blockSize, blockSize, X265_HEX_SEARCH, 1);
+
+
+            MV best(0, 0);
+            int leastError = INT_MAX;
+
+            if (previous == NULL)
+            {
+                range = 8;
+            }
+            else
+            {
+
+                for (int py = -1; py <= 1; py++)
+                {
+                    int testy = blockY / (2 * blockSize) + py;
+
+                    for (int px = -1; px <= 1; px++)
+                    {
+
+                        int testx = blockX / (2 * blockSize) + px;
+                        if ((testx >= 0) && (testx < origWidth / (2 * blockSize)) && (testy >= 0) && (testy < origHeight / (2 * blockSize)))
+                        {
+                            int mvIdx = testy * prevMvStride + testx;
+                            MV old = previousmvIdx;
+
+                            if (m_useSADinME)
+                                error = motionErrorLumaSAD(orig, buffer, blockX, blockY, old.x * factor, old.y * factor, blockSize, leastError);
+                            else
+                                error = motionErrorLumaSSD(orig, buffer, blockX, blockY, old.x * factor, old.y * factor, blockSize, leastError);
+
+                            if (error < leastError)
+                            {
+                                best.set(old.x * factor, old.y * factor);
+                                leastError = error;
+                            }
+                        }
+                    }
+                }
+
+                if (m_useSADinME)
+                    error = motionErrorLumaSAD(orig, buffer, blockX, blockY, 0, 0, blockSize, leastError);
+                else
+                    error = motionErrorLumaSSD(orig, buffer, blockX, blockY, 0, 0, blockSize, leastError);
+
+                if (error < leastError)
+                {
+                    best.set(0, 0);
+                    leastError = error;
+                }
+
+            }
+
+            MV prevBest = best;
+            for (int y2 = prevBest.y / m_motionVectorFactor - range; y2 <= prevBest.y / m_motionVectorFactor + range; y2++)
+            {
+                for (int x2 = prevBest.x / m_motionVectorFactor - range; x2 <= prevBest.x / m_motionVectorFactor + range; x2++)
+                {
+                    if (m_useSADinME)
+                        error = motionErrorLumaSAD(orig, buffer, blockX, blockY, x2 * m_motionVectorFactor, y2 * m_motionVectorFactor, blockSize, leastError);
+                    else
+                        error = motionErrorLumaSSD(orig, buffer, blockX, blockY, x2 * m_motionVectorFactor, y2 * m_motionVectorFactor, blockSize, leastError);
+                    if (error < leastError)
+                    {
+                        best.set(x2 * m_motionVectorFactor, y2 * m_motionVectorFactor);
+                        leastError = error;
+                    }
+                }
+            }
+
+            if (blockY > 0)
+            {
+                int idx = ((blockY - stepSize) / stepSize) * mvStride + (blockX / stepSize);
+                MV aboveMV = mvsidx;
+
+                if (m_useSADinME)
+                    error = motionErrorLumaSAD(orig, buffer, blockX, blockY, aboveMV.x, aboveMV.y, blockSize, leastError);
+                else
+                    error = motionErrorLumaSSD(orig, buffer, blockX, blockY, aboveMV.x, aboveMV.y, blockSize, leastError);
+
+                if (error < leastError)
+                {
+                    best.set(aboveMV.x, aboveMV.y);
+                    leastError = error;
+                }
+            }
+
+            if (blockX > 0)
+            {
+                int idx = ((blockY / stepSize) * mvStride + (blockX - stepSize) / stepSize);
+                MV leftMV = mvsidx;
+
+                if (m_useSADinME)
+                    error = motionErrorLumaSAD(orig, buffer, blockX, blockY, leftMV.x, leftMV.y, blockSize, leastError);
+                else
+                    error = motionErrorLumaSSD(orig, buffer, blockX, blockY, leftMV.x, leftMV.y, blockSize, leastError);
+
+                if (error < leastError)
+                {
+                    best.set(leftMV.x, leftMV.y);
+                    leastError = error;
+                }
+            }
+
+            // calculate average
+            double avg = 0.0;
+            for (int x1 = 0; x1 < blockSize; x1++)
+            {
+                for (int y1 = 0; y1 < blockSize; y1++)
+                {
+                    avg = avg + *(orig->m_picOrg0 + (blockX + x1 + orig->m_stride * (blockY + y1)));
+                }
+            }
+            avg = avg / (blockSize * blockSize);
+
+            // calculate variance
+            double variance = 0;
+            for (int x1 = 0; x1 < blockSize; x1++)
+            {
+                for (int y1 = 0; y1 < blockSize; y1++)
+                {
+                    int pix = *(orig->m_picOrg0 + (blockX + x1 + orig->m_stride * (blockY + y1)));
+                    variance = variance + (pix - avg) * (pix - avg);
+                }
+            }
+
+            leastError = (int)(20 * ((leastError + 5.0) / (variance + 5.0)) + (leastError / (blockSize * blockSize)) / 50);
+
+            int mvIdx = (blockY / stepSize) * mvStride + (blockX / stepSize);
+            mvsmvIdx = best;
+        }
+    }
+}
+
+
+void TemporalFilter::motionEstimationLumaDoubleRes(MV *mvs, uint32_t mvStride, PicYuv *orig, PicYuv *buffer, int blockSize,
+    MV *previous, uint32_t prevMvStride, int factor, int* minError)
+{
+
+    int range = 0;
+
+
+    const int stepSize = blockSize;
+
+    const int origWidth = orig->m_picWidth;
+    const int origHeight = orig->m_picHeight;
+
+    int error;
+
+    for (int blockY = 0; blockY + blockSize <= origHeight; blockY += stepSize)
+    {
+        for (int blockX = 0; blockX + blockSize <= origWidth; blockX += stepSize)
+        {
+
+            const intptr_t pelOffset = blockY * orig->m_stride + blockX;
+            m_metld->me.setSourcePU(orig->m_picOrg0, orig->m_stride, pelOffset, blockSize, blockSize, X265_HEX_SEARCH, 1);
+
+            MV best(0, 0);
+            int leastError = INT_MAX;
+
+            if (previous == NULL)
+            {
+                range = 8;
+            }
+            else
+            {
+
+                for (int py = -1; py <= 1; py++)
+                {
+                    int testy = blockY / (2 * blockSize) + py;
+
+                    for (int px = -1; px <= 1; px++)
+                    {
+
+                        int testx = blockX / (2 * blockSize) + px;
+                        if ((testx >= 0) && (testx < origWidth / (2 * blockSize)) && (testy >= 0) && (testy < origHeight / (2 * blockSize)))
+                        {
+                            int mvIdx = testy * prevMvStride + testx;
+                            MV old = previousmvIdx;
+
+                            if (m_useSADinME)
+                                error = motionErrorLumaSAD(orig, buffer, blockX, blockY, old.x * factor, old.y * factor, blockSize, leastError);
+                            else
+                                error = motionErrorLumaSSD(orig, buffer, blockX, blockY, old.x * factor, old.y * factor, blockSize, leastError);
+
+                            if (error < leastError)
+                            {
+                                best.set(old.x * factor, old.y * factor);
+                                leastError = error;
+                            }
+                        }
+                    }
+                }
+
+                if (m_useSADinME)
+                    error = motionErrorLumaSAD(orig, buffer, blockX, blockY, 0, 0, blockSize, leastError);
+                else
+                    error = motionErrorLumaSSD(orig, buffer, blockX, blockY, 0, 0, blockSize, leastError);
+
+                if (error < leastError)
+                {
+                    best.set(0, 0);
+                    leastError = error;
+                }
+
+            }
+
+            MV prevBest = best;
+            for (int y2 = prevBest.y / m_motionVectorFactor - range; y2 <= prevBest.y / m_motionVectorFactor + range; y2++)
+            {
+                for (int x2 = prevBest.x / m_motionVectorFactor - range; x2 <= prevBest.x / m_motionVectorFactor + range; x2++)
+                {
+                    if (m_useSADinME)
+                        error = motionErrorLumaSAD(orig, buffer, blockX, blockY, x2 * m_motionVectorFactor, y2 * m_motionVectorFactor, blockSize, leastError);
+                    else
+                        error = motionErrorLumaSSD(orig, buffer, blockX, blockY, x2 * m_motionVectorFactor, y2 * m_motionVectorFactor, blockSize, leastError);
+
+                    if (error < leastError)
+                    {
+                        best.set(x2 * m_motionVectorFactor, y2 * m_motionVectorFactor);
+                        leastError = error;
+                    }
+                }
+            }
+
+            prevBest = best;
+            int doubleRange = 3 * 4;
+            for (int y2 = prevBest.y - doubleRange; y2 <= prevBest.y + doubleRange; y2 += 4)
+            {
+                for (int x2 = prevBest.x - doubleRange; x2 <= prevBest.x + doubleRange; x2 += 4)
+                {
+                    if (m_useSADinME)
+                        error = motionErrorLumaSAD(orig, buffer, blockX, blockY, x2, y2, blockSize, leastError);
+                    else
+                        error = motionErrorLumaSSD(orig, buffer, blockX, blockY, x2, y2, blockSize, leastError);
+
+                    if (error < leastError)
+                    {
+                        best.set(x2, y2);
+                        leastError = error;
+                    }
+                }
+            }
+
+            prevBest = best;
+            doubleRange = 3;
+            for (int y2 = prevBest.y - doubleRange; y2 <= prevBest.y + doubleRange; y2++)
+            {
+                for (int x2 = prevBest.x - doubleRange; x2 <= prevBest.x + doubleRange; x2++)
+                {
+                    if (m_useSADinME)
+                        error = motionErrorLumaSAD(orig, buffer, blockX, blockY, x2, y2, blockSize, leastError);
+                    else
+                        error = motionErrorLumaSSD(orig, buffer, blockX, blockY, x2, y2, blockSize, leastError);
+
+                    if (error < leastError)
+                    {
+                        best.set(x2, y2);
+                        leastError = error;
+                    }
+                }
+            }
+
+
+            if (blockY > 0)
+            {
+                int idx = ((blockY - stepSize) / stepSize) * mvStride + (blockX / stepSize);
+                MV aboveMV = mvsidx;
+
+                if (m_useSADinME)
+                    error = motionErrorLumaSAD(orig, buffer, blockX, blockY, aboveMV.x, aboveMV.y, blockSize, leastError);
+                else
+                    error = motionErrorLumaSSD(orig, buffer, blockX, blockY, aboveMV.x, aboveMV.y, blockSize, leastError);
+
+                if (error < leastError)
+                {
+                    best.set(aboveMV.x, aboveMV.y);
+                    leastError = error;
+                }
+            }
+
+            if (blockX > 0)
+            {
+                int idx = ((blockY / stepSize) * mvStride + (blockX - stepSize) / stepSize);
+                MV leftMV = mvsidx;
+
+                if (m_useSADinME)
+                    error = motionErrorLumaSAD(orig, buffer, blockX, blockY, leftMV.x, leftMV.y, blockSize, leastError);
+                else
+                    error = motionErrorLumaSSD(orig, buffer, blockX, blockY, leftMV.x, leftMV.y, blockSize, leastError);
+
+                if (error < leastError)
+                {
+                    best.set(leftMV.x, leftMV.y);
+                    leastError = error;
+                }
+            }
+
+            // calculate average
+            double avg = 0.0;
+            for (int x1 = 0; x1 < blockSize; x1++)
+            {
+                for (int y1 = 0; y1 < blockSize; y1++)
+                {
+                    avg = avg + *(orig->m_picOrg0 + (blockX + x1 + orig->m_stride * (blockY + y1)));
+                }
+            }
+            avg = avg / (blockSize * blockSize);
+
+            // calculate variance
+            double variance = 0;
+            for (int x1 = 0; x1 < blockSize; x1++)
+            {
+                for (int y1 = 0; y1 < blockSize; y1++)
+                {
+                    int pix = *(orig->m_picOrg0 + (blockX + x1 + orig->m_stride * (blockY + y1)));
+                    variance = variance + (pix - avg) * (pix - avg);
+                }
+            }
+
+            leastError = (int)(20 * ((leastError + 5.0) / (variance + 5.0)) + (leastError / (blockSize * blockSize)) / 50);
+
+            int mvIdx = (blockY / stepSize) * mvStride + (blockX / stepSize);
+            mvsmvIdx = best;
+            minErrormvIdx = leastError;
+        }
+    }
+}
+
+void TemporalFilter::destroyRefPicInfo(TemporalFilterRefPicInfo* curFrame)
+{
+    if (curFrame)
+    {
+        if (curFrame->compensatedPic)
+        {
+            curFrame->compensatedPic->destroy();
+            delete curFrame->compensatedPic;
+        }
+
+        if (curFrame->mvs)
+            X265_FREE(curFrame->mvs);
+        if (curFrame->mvs0)
+            X265_FREE(curFrame->mvs0);
+        if (curFrame->mvs1)
+            X265_FREE(curFrame->mvs1);
+        if (curFrame->mvs2)
+            X265_FREE(curFrame->mvs2);
+        if (curFrame->noise)
+            X265_FREE(curFrame->noise);
+        if (curFrame->error)
+            X265_FREE(curFrame->error);
+    }
+}

 
@@ -0,0 +1,1017 @@
+/*****************************************************************************
+* Copyright (C) 2013-2021 MulticoreWare, Inc
+*
+ * Authors: Ashok Kumar Mishra <ashok@multicorewareinc.com>
+ *
+* This program is free software; you can redistribute it and/or modify
+* it under the terms of the GNU General Public License as published by
+* the Free Software Foundation; either version 2 of the License, or
+* (at your option) any later version.
+*
+* This program is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+* GNU General Public License for more details.
+*
+* You should have received a copy of the GNU General Public License
+* along with this program; if not, write to the Free Software
+* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+*
+* This program is also available under a commercial proprietary license.
+* For more information, contact us at license @ x265.com.
+*****************************************************************************/
+#include "common.h"
+#include "temporalfilter.h"
+#include "primitives.h"
+
+#include "frame.h"
+#include "slice.h"
+#include "framedata.h"
+#include "analysis.h"
+
+using namespace X265_NS;
+
+void OrigPicBuffer::addPicture(Frame* inFrame)
+{
+    m_mcstfPicList.pushFrontMCSTF(*inFrame);
+}
+
+void OrigPicBuffer::addEncPicture(Frame* inFrame)
+{
+    m_mcstfOrigPicFreeList.pushFrontMCSTF(*inFrame);
+}
+
+void OrigPicBuffer::addEncPictureToPicList(Frame* inFrame)
+{
+    m_mcstfOrigPicList.pushFrontMCSTF(*inFrame);
+}
+
+OrigPicBuffer::~OrigPicBuffer()
+{
+    while (!m_mcstfOrigPicList.empty())
+    {
+        Frame* curFrame = m_mcstfOrigPicList.popBackMCSTF();
+        curFrame->destroy();
+        delete curFrame;
+    }
+
+    while (!m_mcstfOrigPicFreeList.empty())
+    {
+        Frame* curFrame = m_mcstfOrigPicFreeList.popBackMCSTF();
+        curFrame->destroy();
+        delete curFrame;
+    }
+}
+
+void OrigPicBuffer::setOrigPicList(Frame* inFrame, int frameCnt)
+{
+    Slice* slice = inFrame->m_encData->m_slice;
+    uint8_t j = 0;
+    for (int iterPOC = (inFrame->m_poc - inFrame->m_mcstf->m_range);
+        iterPOC <= (inFrame->m_poc + inFrame->m_mcstf->m_range); iterPOC++)
+    {
+        if (iterPOC != inFrame->m_poc)
+        {
+            if (iterPOC < 0)
+                continue;
+            if (iterPOC >= frameCnt)
+                break;
+
+            Frame *iterFrame = m_mcstfPicList.getPOCMCSTF(iterPOC);
+            X265_CHECK(iterFrame, "Reference frame not found in OPB");
+            if (iterFrame != NULL)
+            {
+                slice->m_mcstfRefFrameList1j = iterFrame;
+                iterFrame->m_refPicCnt1--;
+            }
+
+            iterFrame = m_mcstfOrigPicList.getPOCMCSTF(iterPOC);
+            if (iterFrame != NULL)
+            {
+
+                slice->m_mcstfRefFrameList1j = iterFrame;
+
+                iterFrame->m_refPicCnt1--;
+                Frame *cFrame = m_mcstfOrigPicList.getPOCMCSTF(inFrame->m_poc);
+                X265_CHECK(cFrame, "Reference frame not found in encoded OPB");
+                cFrame->m_refPicCnt1--;
+            }
+            j++;
+        }
+    }
+}
+
+void OrigPicBuffer::recycleOrigPicList()
+{
+    Frame *iterFrame = m_mcstfPicList.first();
+
+    while (iterFrame)
+    {
+        Frame *curFrame = iterFrame;
+        iterFrame = iterFrame->m_nextMCSTF;
+        if (!curFrame->m_refPicCnt1)
+        {
+            m_mcstfPicList.removeMCSTF(*curFrame);
+            iterFrame = m_mcstfPicList.first();
+        }
+    }
+
+    iterFrame = m_mcstfOrigPicList.first();
+
+    while (iterFrame)
+    {
+        Frame *curFrame = iterFrame;
+        iterFrame = iterFrame->m_nextMCSTF;
+        if (!curFrame->m_refPicCnt1)
+        {
+            m_mcstfOrigPicList.removeMCSTF(*curFrame);
+            *curFrame->m_isSubSampled = false;
+            m_mcstfOrigPicFreeList.pushFrontMCSTF(*curFrame);
+            iterFrame = m_mcstfOrigPicList.first();
+        }
+    }
+}
+
+void OrigPicBuffer::addPictureToFreelist(Frame* inFrame)
+{
+    m_mcstfOrigPicFreeList.pushBack(*inFrame);
+}
+
+TemporalFilter::TemporalFilter()
+{
+    m_sourceWidth = 0;
+    m_sourceHeight = 0,
+    m_QP = 0;
+    m_sliceTypeConfig = 3;
+    m_numRef = 0;
+    m_useSADinME = 1;
+
+    m_range = 2;
+    m_chromaFactor = 0.55;
+    m_sigmaMultiplier = 9.0;
+    m_sigmaZeroPoint = 10.0;
+    m_motionVectorFactor = 16;
+}
+
+void TemporalFilter::init(const x265_param* param)
+{
+    m_param = param;
+    m_bitDepth = param->internalBitDepth;
+    m_sourceWidth = param->sourceWidth;
+    m_sourceHeight = param->sourceHeight;
+    m_internalCsp = param->internalCsp;
+    m_numComponents = (m_internalCsp != X265_CSP_I400) ? MAX_NUM_COMPONENT : 1;
+
+    m_metld = new MotionEstimatorTLD;
+
+    predPUYuv.create(FENC_STRIDE, X265_CSP_I400);
+}
+
+int TemporalFilter::createRefPicInfo(TemporalFilterRefPicInfo* refFrame, x265_param* param)
+{
+    CHECKED_MALLOC_ZERO(refFrame->mvs, MV, sizeof(MV)* ((m_sourceWidth ) / 4) * ((m_sourceHeight ) / 4));
+    refFrame->mvsStride = m_sourceWidth / 4;
+    CHECKED_MALLOC_ZERO(refFrame->mvs0, MV, sizeof(MV)* ((m_sourceWidth ) / 16) * ((m_sourceHeight ) / 16));
+    refFrame->mvsStride0 = m_sourceWidth / 16;
+    CHECKED_MALLOC_ZERO(refFrame->mvs1, MV, sizeof(MV)* ((m_sourceWidth ) / 16) * ((m_sourceHeight ) / 16));
+    refFrame->mvsStride1 = m_sourceWidth / 16;
+    CHECKED_MALLOC_ZERO(refFrame->mvs2, MV, sizeof(MV)* ((m_sourceWidth ) / 16)*((m_sourceHeight ) / 16));
+    refFrame->mvsStride2 = m_sourceWidth / 16;
+
+    CHECKED_MALLOC_ZERO(refFrame->noise, int, sizeof(int) * ((m_sourceWidth) / 4) * ((m_sourceHeight) / 4));
+    CHECKED_MALLOC_ZERO(refFrame->error, int, sizeof(int) * ((m_sourceWidth) / 4) * ((m_sourceHeight) / 4));
+
+    refFrame->slicetype = X265_TYPE_AUTO;
+
+    refFrame->compensatedPic = new PicYuv;
+    refFrame->compensatedPic->create(param, true);
+
+    return 1;
+fail:
+    return 0;
+}
+
+int TemporalFilter::motionErrorLumaSAD(
+    PicYuv *orig,
+    PicYuv *buffer,
+    int x,
+    int y,
+    int dx,
+    int dy,
+    int bs,
+    int besterror)
+{
+
+    pixel* origOrigin = orig->m_picOrg0;
+    intptr_t origStride = orig->m_stride;
+    pixel *buffOrigin = buffer->m_picOrg0;
+    intptr_t buffStride = buffer->m_stride;
+    int error = 0;// dx * 10 + dy * 10;
+    if (((dx | dy) & 0xF) == 0)
+    {
+        dx /= m_motionVectorFactor;
+        dy /= m_motionVectorFactor;
+
+        const pixel* bufferRowStart = buffOrigin + (y + dy) * buffStride + (x + dx);
+#if 0
+        const pixel* origRowStart = origOrigin + y *origStride + x;
+
+        for (int y1 = 0; y1 < bs; y1++)
+        {
+            for (int x1 = 0; x1 < bs; x1++)
+            {
+                int diff = origRowStartx1 - bufferRowStartx1;
+                error += abs(diff);
+            }
+
+            origRowStart += origStride;
+            bufferRowStart += buffStride;
+        }
+#else
+        int partEnum = partitionFromSizes(bs, bs);
+        /* copy PU block into cache */
+        primitives.pupartEnum.copy_pp(predPUYuv.m_buf0, FENC_STRIDE, bufferRowStart, buffStride);
+
+        error = m_metld->me.bufSAD(predPUYuv.m_buf0, FENC_STRIDE);
+#endif
+        if (error > besterror)
+        {
+            return error;
+        }
+    }
+    else
+    {
+        const int *xFilter = s_interpolationFilterdx & 0xF;
+        const int *yFilter = s_interpolationFilterdy & 0xF;
+        int tempArray64 + 864;
+
+        int iSum, iBase;
+        for (int y1 = 1; y1 < bs + 7; y1++)
+        {
+            const int yOffset = y + y1 + (dy >> 4) - 3;
+            const pixel *sourceRow = buffOrigin + (yOffset)*buffStride + 0;
+            for (int x1 = 0; x1 < bs; x1++)
+            {
+                iSum = 0;
+                iBase = x + x1 + (dx >> 4) - 3;
+                const pixel *rowStart = sourceRow + iBase;
+
+                iSum += xFilter1 * rowStart1;
+                iSum += xFilter2 * rowStart2;
+                iSum += xFilter3 * rowStart3;
+                iSum += xFilter4 * rowStart4;
+                iSum += xFilter5 * rowStart5;
+                iSum += xFilter6 * rowStart6;
+
+                tempArrayy1x1 = iSum;
+            }
+        }
+
+        const pixel maxSampleValue = (1 << m_bitDepth) - 1;
+        for (int y1 = 0; y1 < bs; y1++)
+        {
+            const pixel *origRow = origOrigin + (y + y1)*origStride + 0;
+            for (int x1 = 0; x1 < bs; x1++)
+            {
+                iSum = 0;
+                iSum += yFilter1 * tempArrayy1 + 1x1;
+                iSum += yFilter2 * tempArrayy1 + 2x1;
+                iSum += yFilter3 * tempArrayy1 + 3x1;
+                iSum += yFilter4 * tempArrayy1 + 4x1;
+                iSum += yFilter5 * tempArrayy1 + 5x1;
+                iSum += yFilter6 * tempArrayy1 + 6x1;
+
+                iSum = (iSum + (1 << 11)) >> 12;
+                iSum = iSum < 0 ? 0 : (iSum > maxSampleValue ? maxSampleValue : iSum);
+
+                error += abs(iSum - origRowx + x1);
+            }
+            if (error > besterror)
+            {
+                return error;
+            }
+        }
+    }
+    return error;
+}
+
+int TemporalFilter::motionErrorLumaSSD(
+    PicYuv *orig,
+    PicYuv *buffer,
+    int x,
+    int y,
+    int dx,
+    int dy,
+    int bs,
+    int besterror)
+{
+
+    pixel* origOrigin = orig->m_picOrg0;
+    intptr_t origStride = orig->m_stride;
+    pixel *buffOrigin = buffer->m_picOrg0;
+    intptr_t buffStride = buffer->m_stride;
+    int error = 0;// dx * 10 + dy * 10;
+    if (((dx | dy) & 0xF) == 0)
+    {
+        dx /= m_motionVectorFactor;
+        dy /= m_motionVectorFactor;
+
+        const pixel* bufferRowStart = buffOrigin + (y + dy) * buffStride + (x + dx);
+#if 0
+        const pixel* origRowStart = origOrigin + y * origStride + x;
+
+        for (int y1 = 0; y1 < bs; y1++)
+        {
+            for (int x1 = 0; x1 < bs; x1++)
+            {
+                int diff = origRowStartx1 - bufferRowStartx1;
+                error += diff * diff;
+            }
+
+            origRowStart += origStride;
+            bufferRowStart += buffStride;
+        }
+#else
+        int partEnum = partitionFromSizes(bs, bs);
+        /* copy PU block into cache */
+        primitives.pupartEnum.copy_pp(predPUYuv.m_buf0, FENC_STRIDE, bufferRowStart, buffStride);
+
+        error = (int)primitives.cupartEnum.sse_pp(m_metld->me.fencPUYuv.m_buf0, FENC_STRIDE, predPUYuv.m_buf0, FENC_STRIDE);
+
+#endif
+        if (error > besterror)
+        {
+            return error;
+        }
+    }
+    else
+    {
+        const int *xFilter = s_interpolationFilterdx & 0xF;
+        const int *yFilter = s_interpolationFilterdy & 0xF;
+        int tempArray64 + 864;
+
+        int iSum, iBase;
+        for (int y1 = 1; y1 < bs + 7; y1++)
+        {
+            const int yOffset = y + y1 + (dy >> 4) - 3;
+            const pixel *sourceRow = buffOrigin + (yOffset)*buffStride + 0;
+            for (int x1 = 0; x1 < bs; x1++)
+            {
+                iSum = 0;
+                iBase = x + x1 + (dx >> 4) - 3;
+                const pixel *rowStart = sourceRow + iBase;
+
+                iSum += xFilter1 * rowStart1;
+                iSum += xFilter2 * rowStart2;
+                iSum += xFilter3 * rowStart3;
+                iSum += xFilter4 * rowStart4;
+                iSum += xFilter5 * rowStart5;
+                iSum += xFilter6 * rowStart6;
+
+                tempArrayy1x1 = iSum;
+            }
+        }
+
+        const pixel maxSampleValue = (1 << m_bitDepth) - 1;
+        for (int y1 = 0; y1 < bs; y1++)
+        {
+            const pixel *origRow = origOrigin + (y + y1)*origStride + 0;
+            for (int x1 = 0; x1 < bs; x1++)
+            {
+                iSum = 0;
+                iSum += yFilter1 * tempArrayy1 + 1x1;
+                iSum += yFilter2 * tempArrayy1 + 2x1;
+                iSum += yFilter3 * tempArrayy1 + 3x1;
+                iSum += yFilter4 * tempArrayy1 + 4x1;
+                iSum += yFilter5 * tempArrayy1 + 5x1;
+                iSum += yFilter6 * tempArrayy1 + 6x1;
+
+                iSum = (iSum + (1 << 11)) >> 12;
+                iSum = iSum < 0 ? 0 : (iSum > maxSampleValue ? maxSampleValue : iSum);
+
+                error += (iSum - origRowx + x1) * (iSum - origRowx + x1);
+            }
+            if (error > besterror)
+            {
+                return error;
+            }
+        }
+    }
+    return error;
+}
+
+void TemporalFilter::applyMotion(MV *mvs, uint32_t mvsStride, PicYuv *input, PicYuv *output)
+{
+    static const int lumaBlockSize = 8;
+    int srcStride = 0;
+    int dstStride = 0;
+    int csx = 0, csy = 0;
+    for (int c = 0; c < m_numComponents; c++)
+    {
+        const pixel maxValue = (1 << X265_DEPTH) - 1;
+
+        const pixel *pSrcImage = input->m_picOrgc;
+        pixel *pDstImage = output->m_picOrgc;
+
+        if (!c)
+        {
+            srcStride = (int)input->m_stride;
+            dstStride = (int)output->m_stride;
+        }
+        else
+        {
+            srcStride = (int)input->m_strideC;
+            dstStride = (int)output->m_strideC;
+            csx = CHROMA_H_SHIFT(m_internalCsp);
+            csy = CHROMA_V_SHIFT(m_internalCsp);
+        }
+        const int blockSizeX = lumaBlockSize >> csx;
+        const int blockSizeY = lumaBlockSize >> csy;
+        const int height = input->m_picHeight >> csy;
+        const int width = input->m_picWidth >> csx;
+
+        for (int y = 0, blockNumY = 0; y + blockSizeY <= height; y += blockSizeY, blockNumY++)
+        {
+            for (int x = 0, blockNumX = 0; x + blockSizeX <= width; x += blockSizeX, blockNumX++)
+            {
+                int mvIdx = blockNumY * mvsStride + blockNumX;
+                const MV &mv = mvsmvIdx;
+                const int dx = mv.x >> csx;
+                const int dy = mv.y >> csy;
+                const int xInt = mv.x >> (4 + csx);
+                const int yInt = mv.y >> (4 + csy);
+
+                const int *xFilter = s_interpolationFilterdx & 0xf;
+                const int *yFilter = s_interpolationFilterdy & 0xf; // will add 6 bit.
+                const int numFilterTaps = 7;
+                const int centreTapOffset = 3;
+
+                int tempArraylumaBlockSize + numFilterTapslumaBlockSize;
+
+                for (int by = 1; by < blockSizeY + numFilterTaps; by++)
+                {
+                    const int yOffset = y + by + yInt - centreTapOffset;
+                    const pixel *sourceRow = pSrcImage + yOffset * srcStride;
+                    for (int bx = 0; bx < blockSizeX; bx++)
+                    {
+                        int iBase = x + bx + xInt - centreTapOffset;
+                        const pixel *rowStart = sourceRow + iBase;
+
+                        int iSum = 0;
+                        iSum += xFilter1 * rowStart1;
+                        iSum += xFilter2 * rowStart2;
+                        iSum += xFilter3 * rowStart3;
+                        iSum += xFilter4 * rowStart4;
+                        iSum += xFilter5 * rowStart5;
+                        iSum += xFilter6 * rowStart6;
+
+                        tempArraybybx = iSum;
+                    }
+                }
+
+                pixel *pDstRow = pDstImage + y * dstStride;
+                for (int by = 0; by < blockSizeY; by++, pDstRow += dstStride)
+                {
+                    pixel *pDstPel = pDstRow + x;
+                    for (int bx = 0; bx < blockSizeX; bx++, pDstPel++)
+                    {
+                        int iSum = 0;
+
+                        iSum += yFilter1 * tempArrayby + 1bx;
+                        iSum += yFilter2 * tempArrayby + 2bx;
+                        iSum += yFilter3 * tempArrayby + 3bx;
+                        iSum += yFilter4 * tempArrayby + 4bx;
+                        iSum += yFilter5 * tempArrayby + 5bx;
+                        iSum += yFilter6 * tempArrayby + 6bx;
+
+                        iSum = (iSum + (1 << 11)) >> 12;
+                        iSum = iSum < 0 ? 0 : (iSum > maxValue ? maxValue : iSum);
+                        *pDstPel = (pixel)iSum;
+                    }
+                }
+            }
+        }
+    }
+}
+
+void TemporalFilter::bilateralFilter(Frame* frame,
+    TemporalFilterRefPicInfo* m_mcstfRefList,
+    double overallStrength)
+{
+
+    const int numRefs = frame->m_mcstf->m_numRef;
+
+    for (int i = 0; i < numRefs; i++)
+    {
+        TemporalFilterRefPicInfo *ref = &m_mcstfRefListi;
+        applyMotion(m_mcstfRefListi.mvs, m_mcstfRefListi.mvsStride, m_mcstfRefListi.picBuffer, ref->compensatedPic);
+    }
+
+    int refStrengthRow = 2;
+    if (numRefs == m_range * 2)
+    {
+        refStrengthRow = 0;
+    }
+    else if (numRefs == m_range)
+    {
+        refStrengthRow = 1;
+    }
+
+    const double lumaSigmaSq = (m_QP - m_sigmaZeroPoint) * (m_QP - m_sigmaZeroPoint) * m_sigmaMultiplier;
+    const double chromaSigmaSq = 30 * 30;
+
+    PicYuv* orgPic = frame->m_fencPic;
+
+    for (int c = 0; c < m_numComponents; c++)
+    {
+        int height, width;
+        pixel *srcPelRow = NULL;
+        intptr_t srcStride, correctedPicsStride = 0;
+
+        if (!c)
+        {
+            height = orgPic->m_picHeight;
+            width = orgPic->m_picWidth;
+            srcPelRow = orgPic->m_picOrgc;
+            srcStride = orgPic->m_stride;
+        }
+        else
+        {
+            int csx = CHROMA_H_SHIFT(m_internalCsp);
+            int csy = CHROMA_V_SHIFT(m_internalCsp);
+
+            height = orgPic->m_picHeight >> csy;
+            width = orgPic->m_picWidth >> csx;
+            srcPelRow = orgPic->m_picOrgc;
+            srcStride = (int)orgPic->m_strideC;
+        }
+
+        const double sigmaSq = (!c)  ? lumaSigmaSq : chromaSigmaSq;
+        const double weightScaling = overallStrength * ( (!c) ? 0.4 : m_chromaFactor);
+
+        const double maxSampleValue = (1 << m_bitDepth) - 1;
+        const double bitDepthDiffWeighting = 1024.0 / (maxSampleValue + 1);
+
+        const int blkSize = (!c) ? 8 : 4;
+
+        for (int y = 0; y < height; y++, srcPelRow += srcStride)
+        {
+            pixel *srcPel = srcPelRow;
+
+            for (int x = 0; x < width; x++, srcPel++)
+            {
+                const int orgVal = (int)*srcPel;
+                double temporalWeightSum = 1.0;
+                double newVal = (double)orgVal;
+
+                if ((y % blkSize == 0) && (x % blkSize == 0))
+                {
+                    for (int i = 0; i < numRefs; i++)
+                    {
+                        TemporalFilterRefPicInfo *refPicInfo = &m_mcstfRefListi;
+
+                        if (!c)
+                            correctedPicsStride = refPicInfo->compensatedPic->m_stride;
+                        else
+                            correctedPicsStride = refPicInfo->compensatedPic->m_strideC;
+
+                        double variance = 0, diffsum = 0;
+                        for (int y1 = 0; y1 < blkSize - 1; y1++)
+                        {
+                            for (int x1 = 0; x1 < blkSize - 1; x1++)
+                            {
+                                int pix = *(srcPel + x1);
+                                int pixR = *(srcPel + x1 + 1);
+                                int pixD = *(srcPel + x1 + srcStride);
+
+                                int ref = *(refPicInfo->compensatedPic->m_picOrgc + ((y + y1) * correctedPicsStride + x + x1));
+                                int refR = *(refPicInfo->compensatedPic->m_picOrgc + ((y + y1) * correctedPicsStride + x + x1 + 1));
+                                int refD = *(refPicInfo->compensatedPic->m_picOrgc + ((y + y1 + 1) * correctedPicsStride + x + x1));
+
+                                int diff = pix - ref;
+                                int diffR = pixR - refR;
+                                int diffD = pixD - refD;
+
+                                variance += diff * diff;
+                                diffsum += (diffR - diff) * (diffR - diff);
+                                diffsum += (diffD - diff) * (diffD - diff);
+                            }
+                        }
+
+                        refPicInfo->noise(y / blkSize) * refPicInfo->mvsStride + (x / blkSize) = (int)round((300 * variance + 50) / (10 * diffsum + 50));
+                    }
+                }
+
+                double minError = 9999999;
+                for (int i = 0; i < numRefs; i++)
+                {
+                    TemporalFilterRefPicInfo *refPicInfo = &m_mcstfRefListi;
+                    minError = X265_MIN(minError, (double)refPicInfo->error(y / blkSize) * refPicInfo->mvsStride + (x / blkSize));
+                }
+
+                for (int i = 0; i < numRefs; i++)
+                {
+                    TemporalFilterRefPicInfo *refPicInfo = &m_mcstfRefListi;
+
+                    const int error = refPicInfo->error(y / blkSize) * refPicInfo->mvsStride + (x / blkSize);
+                    const int noise = refPicInfo->noise(y / blkSize) * refPicInfo->mvsStride + (x / blkSize);
+
+                    const pixel *pCorrectedPelPtr = refPicInfo->compensatedPic->m_picOrgc + (y * correctedPicsStride + x);
+                    const int refVal = (int)*pCorrectedPelPtr;
+                    double diff = (double)(refVal - orgVal);
+                    diff *= bitDepthDiffWeighting;
+                    double diffSq = diff * diff;
+
+                    const int index = X265_MIN(3, std::abs(refPicInfo->origOffset) - 1);
+                    double ww = 1, sw = 1;
+                    ww *= (noise < 25) ? 1 : 1.2;
+                    sw *= (noise < 25) ? 1.3 : 0.8;
+                    ww *= (error < 50) ? 1.2 : ((error > 100) ? 0.8 : 1);
+                    sw *= (error < 50) ? 1.3 : 1;
+                    ww *= ((minError + 1) / (error + 1));
+                    const double weight = weightScaling * s_refStrengthsrefStrengthRowindex * ww * exp(-diffSq / (2 * sw * sigmaSq));
+
+                    newVal += weight * refVal;
+                    temporalWeightSum += weight;
+                }
+                newVal /= temporalWeightSum;
+                double sampleVal = round(newVal);
+                sampleVal = (sampleVal < 0 ? 0 : (sampleVal > maxSampleValue ? maxSampleValue : sampleVal));
+                *srcPel = (pixel)sampleVal;
+            }
+        }
+    }
+}
+
+void TemporalFilter::motionEstimationLuma(MV *mvs, uint32_t mvStride, PicYuv *orig, PicYuv *buffer, int blockSize,
+    MV *previous, uint32_t prevMvStride, int factor)
+{
+
+    int range = 5;
+
+
+    const int stepSize = blockSize;
+
+    const int origWidth = orig->m_picWidth;
+    const int origHeight = orig->m_picHeight;
+
+    int error;
+
+    for (int blockY = 0; blockY + blockSize <= origHeight; blockY += stepSize)
+    {
+        for (int blockX = 0; blockX + blockSize <= origWidth; blockX += stepSize)
+        {
+            const intptr_t pelOffset = blockY * orig->m_stride + blockX;
+            m_metld->me.setSourcePU(orig->m_picOrg0, orig->m_stride, pelOffset, blockSize, blockSize, X265_HEX_SEARCH, 1);
+
+
+            MV best(0, 0);
+            int leastError = INT_MAX;
+
+            if (previous == NULL)
+            {
+                range = 8;
+            }
+            else
+            {
+
+                for (int py = -1; py <= 1; py++)
+                {
+                    int testy = blockY / (2 * blockSize) + py;
+
+                    for (int px = -1; px <= 1; px++)
+                    {
+
+                        int testx = blockX / (2 * blockSize) + px;
+                        if ((testx >= 0) && (testx < origWidth / (2 * blockSize)) && (testy >= 0) && (testy < origHeight / (2 * blockSize)))
+                        {
+                            int mvIdx = testy * prevMvStride + testx;
+                            MV old = previousmvIdx;
+
+                            if (m_useSADinME)
+                                error = motionErrorLumaSAD(orig, buffer, blockX, blockY, old.x * factor, old.y * factor, blockSize, leastError);
+                            else
+                                error = motionErrorLumaSSD(orig, buffer, blockX, blockY, old.x * factor, old.y * factor, blockSize, leastError);
+
+                            if (error < leastError)
+                            {
+                                best.set(old.x * factor, old.y * factor);
+                                leastError = error;
+                            }
+                        }
+                    }
+                }
+
+                if (m_useSADinME)
+                    error = motionErrorLumaSAD(orig, buffer, blockX, blockY, 0, 0, blockSize, leastError);
+                else
+                    error = motionErrorLumaSSD(orig, buffer, blockX, blockY, 0, 0, blockSize, leastError);
+
+                if (error < leastError)
+                {
+                    best.set(0, 0);
+                    leastError = error;
+                }
+
+            }
+
+            MV prevBest = best;
+            for (int y2 = prevBest.y / m_motionVectorFactor - range; y2 <= prevBest.y / m_motionVectorFactor + range; y2++)
+            {
+                for (int x2 = prevBest.x / m_motionVectorFactor - range; x2 <= prevBest.x / m_motionVectorFactor + range; x2++)
+                {
+                    if (m_useSADinME)
+                        error = motionErrorLumaSAD(orig, buffer, blockX, blockY, x2 * m_motionVectorFactor, y2 * m_motionVectorFactor, blockSize, leastError);
+                    else
+                        error = motionErrorLumaSSD(orig, buffer, blockX, blockY, x2 * m_motionVectorFactor, y2 * m_motionVectorFactor, blockSize, leastError);
+                    if (error < leastError)
+                    {
+                        best.set(x2 * m_motionVectorFactor, y2 * m_motionVectorFactor);
+                        leastError = error;
+                    }
+                }
+            }
+
+            if (blockY > 0)
+            {
+                int idx = ((blockY - stepSize) / stepSize) * mvStride + (blockX / stepSize);
+                MV aboveMV = mvsidx;
+
+                if (m_useSADinME)
+                    error = motionErrorLumaSAD(orig, buffer, blockX, blockY, aboveMV.x, aboveMV.y, blockSize, leastError);
+                else
+                    error = motionErrorLumaSSD(orig, buffer, blockX, blockY, aboveMV.x, aboveMV.y, blockSize, leastError);
+
+                if (error < leastError)
+                {
+                    best.set(aboveMV.x, aboveMV.y);
+                    leastError = error;
+                }
+            }
+
+            if (blockX > 0)
+            {
+                int idx = ((blockY / stepSize) * mvStride + (blockX - stepSize) / stepSize);
+                MV leftMV = mvsidx;
+
+                if (m_useSADinME)
+                    error = motionErrorLumaSAD(orig, buffer, blockX, blockY, leftMV.x, leftMV.y, blockSize, leastError);
+                else
+                    error = motionErrorLumaSSD(orig, buffer, blockX, blockY, leftMV.x, leftMV.y, blockSize, leastError);
+
+                if (error < leastError)
+                {
+                    best.set(leftMV.x, leftMV.y);
+                    leastError = error;
+                }
+            }
+
+            // calculate average
+            double avg = 0.0;
+            for (int x1 = 0; x1 < blockSize; x1++)
+            {
+                for (int y1 = 0; y1 < blockSize; y1++)
+                {
+                    avg = avg + *(orig->m_picOrg0 + (blockX + x1 + orig->m_stride * (blockY + y1)));
+                }
+            }
+            avg = avg / (blockSize * blockSize);
+
+            // calculate variance
+            double variance = 0;
+            for (int x1 = 0; x1 < blockSize; x1++)
+            {
+                for (int y1 = 0; y1 < blockSize; y1++)
+                {
+                    int pix = *(orig->m_picOrg0 + (blockX + x1 + orig->m_stride * (blockY + y1)));
+                    variance = variance + (pix - avg) * (pix - avg);
+                }
+            }
+
+            leastError = (int)(20 * ((leastError + 5.0) / (variance + 5.0)) + (leastError / (blockSize * blockSize)) / 50);
+
+            int mvIdx = (blockY / stepSize) * mvStride + (blockX / stepSize);
+            mvsmvIdx = best;
+        }
+    }
+}
+
+
+void TemporalFilter::motionEstimationLumaDoubleRes(MV *mvs, uint32_t mvStride, PicYuv *orig, PicYuv *buffer, int blockSize,
+    MV *previous, uint32_t prevMvStride, int factor, int* minError)
+{
+
+    int range = 0;
+
+
+    const int stepSize = blockSize;
+
+    const int origWidth = orig->m_picWidth;
+    const int origHeight = orig->m_picHeight;
+
+    int error;
+
+    for (int blockY = 0; blockY + blockSize <= origHeight; blockY += stepSize)
+    {
+        for (int blockX = 0; blockX + blockSize <= origWidth; blockX += stepSize)
+        {
+
+            const intptr_t pelOffset = blockY * orig->m_stride + blockX;
+            m_metld->me.setSourcePU(orig->m_picOrg0, orig->m_stride, pelOffset, blockSize, blockSize, X265_HEX_SEARCH, 1);
+
+            MV best(0, 0);
+            int leastError = INT_MAX;
+
+            if (previous == NULL)
+            {
+                range = 8;
+            }
+            else
+            {
+
+                for (int py = -1; py <= 1; py++)
+                {
+                    int testy = blockY / (2 * blockSize) + py;
+
+                    for (int px = -1; px <= 1; px++)
+                    {
+
+                        int testx = blockX / (2 * blockSize) + px;
+                        if ((testx >= 0) && (testx < origWidth / (2 * blockSize)) && (testy >= 0) && (testy < origHeight / (2 * blockSize)))
+                        {
+                            int mvIdx = testy * prevMvStride + testx;
+                            MV old = previousmvIdx;
+
+                            if (m_useSADinME)
+                                error = motionErrorLumaSAD(orig, buffer, blockX, blockY, old.x * factor, old.y * factor, blockSize, leastError);
+                            else
+                                error = motionErrorLumaSSD(orig, buffer, blockX, blockY, old.x * factor, old.y * factor, blockSize, leastError);
+
+                            if (error < leastError)
+                            {
+                                best.set(old.x * factor, old.y * factor);
+                                leastError = error;
+                            }
+                        }
+                    }
+                }
+
+                if (m_useSADinME)
+                    error = motionErrorLumaSAD(orig, buffer, blockX, blockY, 0, 0, blockSize, leastError);
+                else
+                    error = motionErrorLumaSSD(orig, buffer, blockX, blockY, 0, 0, blockSize, leastError);
+
+                if (error < leastError)
+                {
+                    best.set(0, 0);
+                    leastError = error;
+                }
+
+            }
+
+            MV prevBest = best;
+            for (int y2 = prevBest.y / m_motionVectorFactor - range; y2 <= prevBest.y / m_motionVectorFactor + range; y2++)
+            {
+                for (int x2 = prevBest.x / m_motionVectorFactor - range; x2 <= prevBest.x / m_motionVectorFactor + range; x2++)
+                {
+                    if (m_useSADinME)
+                        error = motionErrorLumaSAD(orig, buffer, blockX, blockY, x2 * m_motionVectorFactor, y2 * m_motionVectorFactor, blockSize, leastError);
+                    else
+                        error = motionErrorLumaSSD(orig, buffer, blockX, blockY, x2 * m_motionVectorFactor, y2 * m_motionVectorFactor, blockSize, leastError);
+
+                    if (error < leastError)
+                    {
+                        best.set(x2 * m_motionVectorFactor, y2 * m_motionVectorFactor);
+                        leastError = error;
+                    }
+                }
+            }
+
+            prevBest = best;
+            int doubleRange = 3 * 4;
+            for (int y2 = prevBest.y - doubleRange; y2 <= prevBest.y + doubleRange; y2 += 4)
+            {
+                for (int x2 = prevBest.x - doubleRange; x2 <= prevBest.x + doubleRange; x2 += 4)
+                {
+                    if (m_useSADinME)
+                        error = motionErrorLumaSAD(orig, buffer, blockX, blockY, x2, y2, blockSize, leastError);
+                    else
+                        error = motionErrorLumaSSD(orig, buffer, blockX, blockY, x2, y2, blockSize, leastError);
+
+                    if (error < leastError)
+                    {
+                        best.set(x2, y2);
+                        leastError = error;
+                    }
+                }
+            }
+
+            prevBest = best;
+            doubleRange = 3;
+            for (int y2 = prevBest.y - doubleRange; y2 <= prevBest.y + doubleRange; y2++)
+            {
+                for (int x2 = prevBest.x - doubleRange; x2 <= prevBest.x + doubleRange; x2++)
+                {
+                    if (m_useSADinME)
+                        error = motionErrorLumaSAD(orig, buffer, blockX, blockY, x2, y2, blockSize, leastError);
+                    else
+                        error = motionErrorLumaSSD(orig, buffer, blockX, blockY, x2, y2, blockSize, leastError);
+
+                    if (error < leastError)
+                    {
+                        best.set(x2, y2);
+                        leastError = error;
+                    }
+                }
+            }
+
+
+            if (blockY > 0)
+            {
+                int idx = ((blockY - stepSize) / stepSize) * mvStride + (blockX / stepSize);
+                MV aboveMV = mvsidx;
+
+                if (m_useSADinME)
+                    error = motionErrorLumaSAD(orig, buffer, blockX, blockY, aboveMV.x, aboveMV.y, blockSize, leastError);
+                else
+                    error = motionErrorLumaSSD(orig, buffer, blockX, blockY, aboveMV.x, aboveMV.y, blockSize, leastError);
+
+                if (error < leastError)
+                {
+                    best.set(aboveMV.x, aboveMV.y);
+                    leastError = error;
+                }
+            }
+
+            if (blockX > 0)
+            {
+                int idx = ((blockY / stepSize) * mvStride + (blockX - stepSize) / stepSize);
+                MV leftMV = mvsidx;
+
+                if (m_useSADinME)
+                    error = motionErrorLumaSAD(orig, buffer, blockX, blockY, leftMV.x, leftMV.y, blockSize, leastError);
+                else
+                    error = motionErrorLumaSSD(orig, buffer, blockX, blockY, leftMV.x, leftMV.y, blockSize, leastError);
+
+                if (error < leastError)
+                {
+                    best.set(leftMV.x, leftMV.y);
+                    leastError = error;
+                }
+            }
+
+            // calculate average
+            double avg = 0.0;
+            for (int x1 = 0; x1 < blockSize; x1++)
+            {
+                for (int y1 = 0; y1 < blockSize; y1++)
+                {
+                    avg = avg + *(orig->m_picOrg0 + (blockX + x1 + orig->m_stride * (blockY + y1)));
+                }
+            }
+            avg = avg / (blockSize * blockSize);
+
+            // calculate variance
+            double variance = 0;
+            for (int x1 = 0; x1 < blockSize; x1++)
+            {
+                for (int y1 = 0; y1 < blockSize; y1++)
+                {
+                    int pix = *(orig->m_picOrg0 + (blockX + x1 + orig->m_stride * (blockY + y1)));
+                    variance = variance + (pix - avg) * (pix - avg);
+                }
+            }
+
+            leastError = (int)(20 * ((leastError + 5.0) / (variance + 5.0)) + (leastError / (blockSize * blockSize)) / 50);
+
+            int mvIdx = (blockY / stepSize) * mvStride + (blockX / stepSize);
+            mvsmvIdx = best;
+            minErrormvIdx = leastError;
+        }
+    }
+}
+
+void TemporalFilter::destroyRefPicInfo(TemporalFilterRefPicInfo* curFrame)
+{
+    if (curFrame)
+    {
+        if (curFrame->compensatedPic)
+        {
+            curFrame->compensatedPic->destroy();
+            delete curFrame->compensatedPic;
+        }
+
+        if (curFrame->mvs)
+            X265_FREE(curFrame->mvs);
+        if (curFrame->mvs0)
+            X265_FREE(curFrame->mvs0);
+        if (curFrame->mvs1)
+            X265_FREE(curFrame->mvs1);
+        if (curFrame->mvs2)
+            X265_FREE(curFrame->mvs2);
+        if (curFrame->noise)
+            X265_FREE(curFrame->noise);
+        if (curFrame->error)
+            X265_FREE(curFrame->error);
+    }
+}
​

x265_3.6.tar.gz/source/common/temporalfilter.h Added

@@ -0,0 +1,185 @@
+/*****************************************************************************
+* Copyright (C) 2013-2021 MulticoreWare, Inc
+*
+ * Authors: Ashok Kumar Mishra <ashok@multicorewareinc.com>
+ *
+* This program is free software; you can redistribute it and/or modify
+* it under the terms of the GNU General Public License as published by
+* the Free Software Foundation; either version 2 of the License, or
+* (at your option) any later version.
+*
+* This program is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+* GNU General Public License for more details.
+*
+* You should have received a copy of the GNU General Public License
+* along with this program; if not, write to the Free Software
+* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+*
+* This program is also available under a commercial proprietary license.
+* For more information, contact us at license @ x265.com.
+*****************************************************************************/
+
+#ifndef X265_TEMPORAL_FILTER_H
+#define X265_TEMPORAL_FILTER_H
+
+#include "x265.h"
+#include "picyuv.h"
+#include "mv.h"
+#include "piclist.h"
+#include "yuv.h"
+#include "motion.h"
+
+const int s_interpolationFilter168 =
+{
+    {   0,   0,   0,  64,   0,   0,   0,   0 },   //0
+    {   0,   1,  -3,  64,   4,  -2,   0,   0 },   //1 -->-->
+    {   0,   1,  -6,  62,   9,  -3,   1,   0 },   //2 -->
+    {   0,   2,  -8,  60,  14,  -5,   1,   0 },   //3 -->-->
+    {   0,   2,  -9,  57,  19,  -7,   2,   0 },   //4
+    {   0,   3, -10,  53,  24,  -8,   2,   0 },   //5 -->-->
+    {   0,   3, -11,  50,  29,  -9,   2,   0 },   //6 -->
+    {   0,   3, -11,  44,  35, -10,   3,   0 },   //7 -->-->
+    {   0,   1,  -7,  38,  38,  -7,   1,   0 },   //8
+    {   0,   3, -10,  35,  44, -11,   3,   0 },   //9 -->-->
+    {   0,   2,  -9,  29,  50, -11,   3,   0 },   //10-->
+    {   0,   2,  -8,  24,  53, -10,   3,   0 },   //11-->-->
+    {   0,   2,  -7,  19,  57,  -9,   2,   0 },   //12
+    {   0,   1,  -5,  14,  60,  -8,   2,   0 },   //13-->-->
+    {   0,   1,  -3,   9,  62,  -6,   1,   0 },   //14-->
+    {   0,   0,  -2,   4,  64,  -3,   1,   0 }    //15-->-->
+};
+
+const double s_refStrengths34 =
+{ // abs(POC offset)
+  //  1,    2     3     4
+  {0.85, 0.57, 0.41, 0.33},  // m_range * 2
+  {1.13, 0.97, 0.81, 0.57},  // m_range
+  {0.30, 0.30, 0.30, 0.30}   // otherwise
+};
+
+namespace X265_NS {
+    class OrigPicBuffer
+    {
+    public:
+        PicList    m_mcstfPicList;
+        PicList    m_mcstfOrigPicFreeList;
+        PicList    m_mcstfOrigPicList;
+
+        ~OrigPicBuffer();
+        void addPicture(Frame*);
+        void addEncPicture(Frame*);
+        void setOrigPicList(Frame*, int);
+        void recycleOrigPicList();
+        void addPictureToFreelist(Frame*);
+        void addEncPictureToPicList(Frame*);
+    };
+
+    struct MotionEstimatorTLD
+    {
+        MotionEstimate  me;
+
+        MotionEstimatorTLD()
+        {
+            me.init(X265_CSP_I400);
+            me.setQP(X265_LOOKAHEAD_QP);
+        }
+
+        ~MotionEstimatorTLD() {}
+    };
+
+    struct TemporalFilterRefPicInfo
+    {
+        PicYuv*    picBuffer;
+        PicYuv*    picBufferSubSampled2;
+        PicYuv*    picBufferSubSampled4;
+        MV*        mvs;
+        MV*        mvs0;
+        MV*        mvs1;
+        MV*        mvs2;
+        uint32_t   mvsStride;
+        uint32_t   mvsStride0;
+        uint32_t   mvsStride1;
+        uint32_t   mvsStride2;
+        int*       error;
+        int*       noise;
+
+        int16_t    origOffset;
+        bool       isFilteredFrame;
+        PicYuv*    compensatedPic;
+
+        int*       isSubsampled;
+
+        int        slicetype;
+    };
+
+    class TemporalFilter
+    {
+    public:
+        TemporalFilter();
+        ~TemporalFilter() {}
+
+        void init(const x265_param* param);
+
+        //private:
+            // Private static member variables
+        const x265_param *m_param;
+        int32_t  m_bitDepth;
+        int m_range;
+        uint8_t m_numRef;
+        double m_chromaFactor;
+        double m_sigmaMultiplier;
+        double m_sigmaZeroPoint;
+        int m_motionVectorFactor;
+        int m_padding;
+
+        // Private member variables
+
+        int m_sourceWidth;
+        int m_sourceHeight;
+        int m_QP;
+
+        int m_internalCsp;
+        int m_numComponents;
+        uint8_t m_sliceTypeConfig;
+
+        MotionEstimatorTLD* m_metld;
+        Yuv  predPUYuv;
+        int m_useSADinME;
+
+        int createRefPicInfo(TemporalFilterRefPicInfo* refFrame, x265_param* param);
+
+        void bilateralFilter(Frame* frame, TemporalFilterRefPicInfo* mctfRefList, double overallStrength);
+
+        void motionEstimationLuma(MV *mvs, uint32_t mvStride, PicYuv *orig, PicYuv *buffer, int bs,
+            MV *previous = 0, uint32_t prevmvStride = 0, int factor = 1);
+
+        void motionEstimationLumaDoubleRes(MV *mvs, uint32_t mvStride, PicYuv *orig, PicYuv *buffer, int blockSize,
+            MV *previous, uint32_t prevMvStride, int factor, int* minError);
+
+        int motionErrorLumaSSD(PicYuv *orig,
+            PicYuv *buffer,
+            int x,
+            int y,
+            int dx,
+            int dy,
+            int bs,
+            int besterror = 8 * 8 * 1024 * 1024);
+
+        int motionErrorLumaSAD(PicYuv *orig,
+            PicYuv *buffer,
+            int x,
+            int y,
+            int dx,
+            int dy,
+            int bs,
+            int besterror = 8 * 8 * 1024 * 1024);
+
+        void destroyRefPicInfo(TemporalFilterRefPicInfo* curFrame);
+
+        void applyMotion(MV *mvs, uint32_t mvsStride, PicYuv *input, PicYuv *output);
+
+    };
+}
+#endif

 
@@ -0,0 +1,185 @@
+/*****************************************************************************
+* Copyright (C) 2013-2021 MulticoreWare, Inc
+*
+ * Authors: Ashok Kumar Mishra <ashok@multicorewareinc.com>
+ *
+* This program is free software; you can redistribute it and/or modify
+* it under the terms of the GNU General Public License as published by
+* the Free Software Foundation; either version 2 of the License, or
+* (at your option) any later version.
+*
+* This program is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+* GNU General Public License for more details.
+*
+* You should have received a copy of the GNU General Public License
+* along with this program; if not, write to the Free Software
+* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+*
+* This program is also available under a commercial proprietary license.
+* For more information, contact us at license @ x265.com.
+*****************************************************************************/
+
+#ifndef X265_TEMPORAL_FILTER_H
+#define X265_TEMPORAL_FILTER_H
+
+#include "x265.h"
+#include "picyuv.h"
+#include "mv.h"
+#include "piclist.h"
+#include "yuv.h"
+#include "motion.h"
+
+const int s_interpolationFilter168 =
+{
+    {   0,   0,   0,  64,   0,   0,   0,   0 },   //0
+    {   0,   1,  -3,  64,   4,  -2,   0,   0 },   //1 -->-->
+    {   0,   1,  -6,  62,   9,  -3,   1,   0 },   //2 -->
+    {   0,   2,  -8,  60,  14,  -5,   1,   0 },   //3 -->-->
+    {   0,   2,  -9,  57,  19,  -7,   2,   0 },   //4
+    {   0,   3, -10,  53,  24,  -8,   2,   0 },   //5 -->-->
+    {   0,   3, -11,  50,  29,  -9,   2,   0 },   //6 -->
+    {   0,   3, -11,  44,  35, -10,   3,   0 },   //7 -->-->
+    {   0,   1,  -7,  38,  38,  -7,   1,   0 },   //8
+    {   0,   3, -10,  35,  44, -11,   3,   0 },   //9 -->-->
+    {   0,   2,  -9,  29,  50, -11,   3,   0 },   //10-->
+    {   0,   2,  -8,  24,  53, -10,   3,   0 },   //11-->-->
+    {   0,   2,  -7,  19,  57,  -9,   2,   0 },   //12
+    {   0,   1,  -5,  14,  60,  -8,   2,   0 },   //13-->-->
+    {   0,   1,  -3,   9,  62,  -6,   1,   0 },   //14-->
+    {   0,   0,  -2,   4,  64,  -3,   1,   0 }    //15-->-->
+};
+
+const double s_refStrengths34 =
+{ // abs(POC offset)
+  //  1,    2     3     4
+  {0.85, 0.57, 0.41, 0.33},  // m_range * 2
+  {1.13, 0.97, 0.81, 0.57},  // m_range
+  {0.30, 0.30, 0.30, 0.30}   // otherwise
+};
+
+namespace X265_NS {
+    class OrigPicBuffer
+    {
+    public:
+        PicList    m_mcstfPicList;
+        PicList    m_mcstfOrigPicFreeList;
+        PicList    m_mcstfOrigPicList;
+
+        ~OrigPicBuffer();
+        void addPicture(Frame*);
+        void addEncPicture(Frame*);
+        void setOrigPicList(Frame*, int);
+        void recycleOrigPicList();
+        void addPictureToFreelist(Frame*);
+        void addEncPictureToPicList(Frame*);
+    };
+
+    struct MotionEstimatorTLD
+    {
+        MotionEstimate  me;
+
+        MotionEstimatorTLD()
+        {
+            me.init(X265_CSP_I400);
+            me.setQP(X265_LOOKAHEAD_QP);
+        }
+
+        ~MotionEstimatorTLD() {}
+    };
+
+    struct TemporalFilterRefPicInfo
+    {
+        PicYuv*    picBuffer;
+        PicYuv*    picBufferSubSampled2;
+        PicYuv*    picBufferSubSampled4;
+        MV*        mvs;
+        MV*        mvs0;
+        MV*        mvs1;
+        MV*        mvs2;
+        uint32_t   mvsStride;
+        uint32_t   mvsStride0;
+        uint32_t   mvsStride1;
+        uint32_t   mvsStride2;
+        int*       error;
+        int*       noise;
+
+        int16_t    origOffset;
+        bool       isFilteredFrame;
+        PicYuv*    compensatedPic;
+
+        int*       isSubsampled;
+
+        int        slicetype;
+    };
+
+    class TemporalFilter
+    {
+    public:
+        TemporalFilter();
+        ~TemporalFilter() {}
+
+        void init(const x265_param* param);
+
+        //private:
+            // Private static member variables
+        const x265_param *m_param;
+        int32_t  m_bitDepth;
+        int m_range;
+        uint8_t m_numRef;
+        double m_chromaFactor;
+        double m_sigmaMultiplier;
+        double m_sigmaZeroPoint;
+        int m_motionVectorFactor;
+        int m_padding;
+
+        // Private member variables
+
+        int m_sourceWidth;
+        int m_sourceHeight;
+        int m_QP;
+
+        int m_internalCsp;
+        int m_numComponents;
+        uint8_t m_sliceTypeConfig;
+
+        MotionEstimatorTLD* m_metld;
+        Yuv  predPUYuv;
+        int m_useSADinME;
+
+        int createRefPicInfo(TemporalFilterRefPicInfo* refFrame, x265_param* param);
+
+        void bilateralFilter(Frame* frame, TemporalFilterRefPicInfo* mctfRefList, double overallStrength);
+
+        void motionEstimationLuma(MV *mvs, uint32_t mvStride, PicYuv *orig, PicYuv *buffer, int bs,
+            MV *previous = 0, uint32_t prevmvStride = 0, int factor = 1);
+
+        void motionEstimationLumaDoubleRes(MV *mvs, uint32_t mvStride, PicYuv *orig, PicYuv *buffer, int blockSize,
+            MV *previous, uint32_t prevMvStride, int factor, int* minError);
+
+        int motionErrorLumaSSD(PicYuv *orig,
+            PicYuv *buffer,
+            int x,
+            int y,
+            int dx,
+            int dy,
+            int bs,
+            int besterror = 8 * 8 * 1024 * 1024);
+
+        int motionErrorLumaSAD(PicYuv *orig,
+            PicYuv *buffer,
+            int x,
+            int y,
+            int dx,
+            int dy,
+            int bs,
+            int besterror = 8 * 8 * 1024 * 1024);
+
+        void destroyRefPicInfo(TemporalFilterRefPicInfo* curFrame);
+
+        void applyMotion(MV *mvs, uint32_t mvsStride, PicYuv *input, PicYuv *output);
+
+    };
+}
+#endif
​

x265_3.5.tar.gz/source/common/threading.h -> x265_3.6.tar.gz/source/common/threading.h Changed

@@ -3,6 +3,7 @@
  *
  * Authors: Steve Borho <steve@borho.org>
  *          Min Chen <chenm003@163.com>
+            liwei <liwei@multicorewareinc.com>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -253,6 +254,47 @@
     int                m_val;
 };
 
+class NamedSemaphore
+{
+public:
+    NamedSemaphore() : m_sem(NULL)
+    {
+    }
+
+    ~NamedSemaphore()
+    {
+    }
+
+    bool create(const char* name, const int initcnt, const int maxcnt)
+    {
+        if(!m_sem)
+        {
+            m_sem = CreateSemaphoreA(NULL, initcnt, maxcnt, name);
+        }
+        return m_sem != NULL;
+    }
+
+    bool give(const int32_t cnt)
+    {
+        return ReleaseSemaphore(m_sem, (LONG)cnt, NULL) != FALSE;
+    }
+
+    bool take(const uint32_t time_out = INFINITE)
+    {
+        int32_t rt = WaitForSingleObject(m_sem, time_out);
+        return rt != WAIT_TIMEOUT && rt != WAIT_FAILED;
+    }
+
+    void release()
+    {
+        CloseHandle(m_sem);
+        m_sem = NULL;
+    }
+
+private:
+    HANDLE m_sem;
+};
+
 #else /* POSIX / pthreads */
 
 typedef pthread_t ThreadHandle;
@@ -459,6 +501,282 @@
     int             m_val;
 };
 
+#define TIMEOUT_INFINITE 0xFFFFFFFF
+
+class NamedSemaphore
+{
+public:
+    NamedSemaphore() 
+        : m_sem(NULL)
+#ifndef __APPLE__
+        , m_name(NULL)
+#endif //__APPLE__
+    {
+    }
+
+    ~NamedSemaphore()
+    {
+    }
+
+    bool create(const char* name, const int initcnt, const int maxcnt)
+    {
+        bool ret = false;
+
+        if (initcnt >= maxcnt)
+        {
+            return false;
+        }
+
+#ifdef __APPLE__
+        do
+        {
+            int32_t pshared = name != NULL ? PTHREAD_PROCESS_SHARED : PTHREAD_PROCESS_PRIVATE;
+
+            m_sem = (mac_sem_t *)malloc(sizeof(mac_sem_t));
+            if (!m_sem)
+            {
+                break;
+            }
+
+            if (pthread_mutexattr_init(&m_sem->mutexAttr))
+            {
+                break;
+            }
+
+            if (pthread_mutexattr_setpshared(&m_sem->mutexAttr, pshared))
+            {
+                break;
+            }
+
+            if (pthread_condattr_init(&m_sem->condAttr))
+            {
+                break;
+            }
+
+            if (pthread_condattr_setpshared(&m_sem->condAttr, pshared))
+            {
+                break;
+            }
+
+            if (pthread_mutex_init(&m_sem->mutex, &m_sem->mutexAttr))
+            {
+                break;
+            }
+
+            if (pthread_cond_init(&m_sem->cond, &m_sem->condAttr))
+            {
+                break;
+            }
+
+            m_sem->curCnt = initcnt;
+            m_sem->maxCnt = maxcnt;
+
+            ret = true;
+        } while (0);
+        
+        if (!ret)
+        {
+            release();
+        }
+
+#else  //__APPLE__
+        m_sem = sem_open(name, O_CREAT | O_EXCL, 0666, initcnt);
+        if (m_sem != SEM_FAILED) 
+        {
+            m_name = strdup(name);
+            ret = true;
+        }
+        else 
+        {
+            if (EEXIST == errno) 
+            {
+                m_sem = sem_open(name, 0);
+                if (m_sem != SEM_FAILED) 
+                {
+                    m_name = strdup(name);
+                    ret = true;
+                }
+            }
+        }
+#endif //__APPLE__
+
+        return ret;
+    }
+
+    bool give(const int32_t cnt)
+    {
+        if (!m_sem)
+        {
+            return false;
+        }
+
+#ifdef __APPLE__
+        if (pthread_mutex_lock(&m_sem->mutex))
+        {
+            return false;
+        }
+
+        int oldCnt = m_sem->curCnt;
+        m_sem->curCnt += cnt;
+        if (m_sem->curCnt > m_sem->maxCnt)
+        {
+            m_sem->curCnt = m_sem->maxCnt;
+        }
+
+        bool ret = true;
+        if (!oldCnt)
+        {
+            ret = 0 == pthread_cond_broadcast(&m_sem->cond);
+        }
+
+        if (pthread_mutex_unlock(&m_sem->mutex))
+        {
+            return false;
+        }
+
+        return ret;
+#else //__APPLE__
+        int ret = 0;
+        int32_t curCnt = cnt;
+        while (curCnt-- && !ret) {
+            ret = sem_post(m_sem);
+        }
+
+        return 0 == ret;
+#endif //_APPLE__
+    }
+
+    bool take(const uint32_t time_out = TIMEOUT_INFINITE)
+    {
+        if (!m_sem)
+        {
+            return false;
+        }
+
+#ifdef __APPLE__
+
+        if (pthread_mutex_lock(&m_sem->mutex))
+        {
+            return false;
+        }
+
+        bool ret = true;
+        if (TIMEOUT_INFINITE == time_out) 
+        {
+            if (!m_sem->curCnt)
+            {
+                if (pthread_cond_wait(&m_sem->cond, &m_sem->mutex))
+                {
+                    ret = false;
+                } 
+            }
+
+            if (m_sem->curCnt && ret)
+            {
+                m_sem->curCnt--;
+            }
+        }
+        else
+        {
+            if (0 == time_out)
+            {
+                if (m_sem->curCnt)
+                {
+                    m_sem->curCnt--;
+                }
+                else
+                {
+                    ret = false;
+                }
+            }
+            else
+            {
+                if (!m_sem->curCnt)
+                {
+                    struct timespec ts;
+                    ts.tv_sec = time_out / 1000L;
+                    ts.tv_nsec = (time_out * 1000000L) - ts.tv_sec * 1000 * 1000 * 1000;
+
+                    if (pthread_cond_timedwait(&m_sem->cond, &m_sem->mutex, &ts))
+                    {
+                        ret = false;
+                    }
+                }
+
+                if (m_sem->curCnt && ret)
+                {
+                    m_sem->curCnt--;
+                }
+            }
+        }
+
+        if (pthread_mutex_unlock(&m_sem->mutex))
+        {
+            return false;
+        }
+
+        return ret;
+#else //__APPLE__
+        if (TIMEOUT_INFINITE == time_out) 
+        {
+            return 0 == sem_wait(m_sem);
+        }
+        else 
+        {
+            if (0 == time_out)
+            {
+                return 0 == sem_trywait(m_sem);
+            }
+            else
+            {
+                struct timespec ts;
+                ts.tv_sec = time_out / 1000L;
+                ts.tv_nsec = (time_out * 1000000L) - ts.tv_sec * 1000 * 1000 * 1000;
+                return 0 == sem_timedwait(m_sem, &ts);
+            }
+        }
+#endif //_APPLE__
+    }
+
+    void release()
+    {
+        if (m_sem)
+        {
+#ifdef __APPLE__
+            pthread_condattr_destroy(&m_sem->condAttr);
+            pthread_mutexattr_destroy(&m_sem->mutexAttr);
+            pthread_mutex_destroy(&m_sem->mutex);
+            pthread_cond_destroy(&m_sem->cond);
+            free(m_sem);
+            m_sem = NULL;
+#else //__APPLE__
+            sem_close(m_sem);
+            sem_unlink(m_name);
+            m_sem = NULL;
+            free(m_name);
+            m_name = NULL;
+#endif //__APPLE__
+        }
+    }
+
+private:
+#ifdef __APPLE__
+    typedef struct
+    {
+        pthread_mutex_t     mutex;
+        pthread_cond_t      cond;
+        pthread_mutexattr_t mutexAttr;
+        pthread_condattr_t  condAttr;
+        uint32_t            curCnt;
+        uint32_t            maxCnt;
+    }mac_sem_t;
+    mac_sem_t *m_sem;
+#else // __APPLE__
+    sem_t *m_sem;
+    char  *m_name;
+#endif // __APPLE_
+};
+
 #endif // ifdef _WIN32
 
 class ScopedLock

 
@@ -3,6 +3,7 @@
  *
  * Authors: Steve Borho <steve@borho.org>
  *          Min Chen <chenm003@163.com>
+            liwei <liwei@multicorewareinc.com>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -253,6 +254,47 @@
     int                m_val;
 };
 
+class NamedSemaphore
+{
+public:
+    NamedSemaphore() : m_sem(NULL)
+    {
+    }
+
+    ~NamedSemaphore()
+    {
+    }
+
+    bool create(const char* name, const int initcnt, const int maxcnt)
+    {
+        if(!m_sem)
+        {
+            m_sem = CreateSemaphoreA(NULL, initcnt, maxcnt, name);
+        }
+        return m_sem != NULL;
+    }
+
+    bool give(const int32_t cnt)
+    {
+        return ReleaseSemaphore(m_sem, (LONG)cnt, NULL) != FALSE;
+    }
+
+    bool take(const uint32_t time_out = INFINITE)
+    {
+        int32_t rt = WaitForSingleObject(m_sem, time_out);
+        return rt != WAIT_TIMEOUT && rt != WAIT_FAILED;
+    }
+
+    void release()
+    {
+        CloseHandle(m_sem);
+        m_sem = NULL;
+    }
+
+private:
+    HANDLE m_sem;
+};
+
 #else /* POSIX / pthreads */
 
 typedef pthread_t ThreadHandle;
@@ -459,6 +501,282 @@
     int             m_val;
 };
 
+#define TIMEOUT_INFINITE 0xFFFFFFFF
+
+class NamedSemaphore
+{
+public:
+    NamedSemaphore() 
+        : m_sem(NULL)
+#ifndef __APPLE__
+        , m_name(NULL)
+#endif //__APPLE__
+    {
+    }
+
+    ~NamedSemaphore()
+    {
+    }
+
+    bool create(const char* name, const int initcnt, const int maxcnt)
+    {
+        bool ret = false;
+
+        if (initcnt >= maxcnt)
+        {
+            return false;
+        }
+
+#ifdef __APPLE__
+        do
+        {
+            int32_t pshared = name != NULL ? PTHREAD_PROCESS_SHARED : PTHREAD_PROCESS_PRIVATE;
+
+            m_sem = (mac_sem_t *)malloc(sizeof(mac_sem_t));
+            if (!m_sem)
+            {
+                break;
+            }
+
+            if (pthread_mutexattr_init(&m_sem->mutexAttr))
+            {
+                break;
+            }
+
+            if (pthread_mutexattr_setpshared(&m_sem->mutexAttr, pshared))
+            {
+                break;
+            }
+
+            if (pthread_condattr_init(&m_sem->condAttr))
+            {
+                break;
+            }
+
+            if (pthread_condattr_setpshared(&m_sem->condAttr, pshared))
+            {
+                break;
+            }
+
+            if (pthread_mutex_init(&m_sem->mutex, &m_sem->mutexAttr))
+            {
+                break;
+            }
+
+            if (pthread_cond_init(&m_sem->cond, &m_sem->condAttr))
+            {
+                break;
+            }
+
+            m_sem->curCnt = initcnt;
+            m_sem->maxCnt = maxcnt;
+
+            ret = true;
+        } while (0);
+        
+        if (!ret)
+        {
+            release();
+        }
+
+#else  //__APPLE__
+        m_sem = sem_open(name, O_CREAT | O_EXCL, 0666, initcnt);
+        if (m_sem != SEM_FAILED) 
+        {
+            m_name = strdup(name);
+            ret = true;
+        }
+        else 
+        {
+            if (EEXIST == errno) 
+            {
+                m_sem = sem_open(name, 0);
+                if (m_sem != SEM_FAILED) 
+                {
+                    m_name = strdup(name);
+                    ret = true;
+                }
+            }
+        }
+#endif //__APPLE__
+
+        return ret;
+    }
+
+    bool give(const int32_t cnt)
+    {
+        if (!m_sem)
+        {
+            return false;
+        }
+
+#ifdef __APPLE__
+        if (pthread_mutex_lock(&m_sem->mutex))
+        {
+            return false;
+        }
+
+        int oldCnt = m_sem->curCnt;
+        m_sem->curCnt += cnt;
+        if (m_sem->curCnt > m_sem->maxCnt)
+        {
+            m_sem->curCnt = m_sem->maxCnt;
+        }
+
+        bool ret = true;
+        if (!oldCnt)
+        {
+            ret = 0 == pthread_cond_broadcast(&m_sem->cond);
+        }
+
+        if (pthread_mutex_unlock(&m_sem->mutex))
+        {
+            return false;
+        }
+
+        return ret;
+#else //__APPLE__
+        int ret = 0;
+        int32_t curCnt = cnt;
+        while (curCnt-- && !ret) {
+            ret = sem_post(m_sem);
+        }
+
+        return 0 == ret;
+#endif //_APPLE__
+    }
+
+    bool take(const uint32_t time_out = TIMEOUT_INFINITE)
+    {
+        if (!m_sem)
+        {
+            return false;
+        }
+
+#ifdef __APPLE__
+
+        if (pthread_mutex_lock(&m_sem->mutex))
+        {
+            return false;
+        }
+
+        bool ret = true;
+        if (TIMEOUT_INFINITE == time_out) 
+        {
+            if (!m_sem->curCnt)
+            {
+                if (pthread_cond_wait(&m_sem->cond, &m_sem->mutex))
+                {
+                    ret = false;
+                } 
+            }
+
+            if (m_sem->curCnt && ret)
+            {
+                m_sem->curCnt--;
+            }
+        }
+        else
+        {
+            if (0 == time_out)
+            {
+                if (m_sem->curCnt)
+                {
+                    m_sem->curCnt--;
+                }
+                else
+                {
+                    ret = false;
+                }
+            }
+            else
+            {
+                if (!m_sem->curCnt)
+                {
+                    struct timespec ts;
+                    ts.tv_sec = time_out / 1000L;
+                    ts.tv_nsec = (time_out * 1000000L) - ts.tv_sec * 1000 * 1000 * 1000;
+
+                    if (pthread_cond_timedwait(&m_sem->cond, &m_sem->mutex, &ts))
+                    {
+                        ret = false;
+                    }
+                }
+
+                if (m_sem->curCnt && ret)
+                {
+                    m_sem->curCnt--;
+                }
+            }
+        }
+
+        if (pthread_mutex_unlock(&m_sem->mutex))
+        {
+            return false;
+        }
+
+        return ret;
+#else //__APPLE__
+        if (TIMEOUT_INFINITE == time_out) 
+        {
+            return 0 == sem_wait(m_sem);
+        }
+        else 
+        {
+            if (0 == time_out)
+            {
+                return 0 == sem_trywait(m_sem);
+            }
+            else
+            {
+                struct timespec ts;
+                ts.tv_sec = time_out / 1000L;
+                ts.tv_nsec = (time_out * 1000000L) - ts.tv_sec * 1000 * 1000 * 1000;
+                return 0 == sem_timedwait(m_sem, &ts);
+            }
+        }
+#endif //_APPLE__
+    }
+
+    void release()
+    {
+        if (m_sem)
+        {
+#ifdef __APPLE__
+            pthread_condattr_destroy(&m_sem->condAttr);
+            pthread_mutexattr_destroy(&m_sem->mutexAttr);
+            pthread_mutex_destroy(&m_sem->mutex);
+            pthread_cond_destroy(&m_sem->cond);
+            free(m_sem);
+            m_sem = NULL;
+#else //__APPLE__
+            sem_close(m_sem);
+            sem_unlink(m_name);
+            m_sem = NULL;
+            free(m_name);
+            m_name = NULL;
+#endif //__APPLE__
+        }
+    }
+
+private:
+#ifdef __APPLE__
+    typedef struct
+    {
+        pthread_mutex_t     mutex;
+        pthread_cond_t      cond;
+        pthread_mutexattr_t mutexAttr;
+        pthread_condattr_t  condAttr;
+        uint32_t            curCnt;
+        uint32_t            maxCnt;
+    }mac_sem_t;
+    mac_sem_t *m_sem;
+#else // __APPLE__
+    sem_t *m_sem;
+    char  *m_name;
+#endif // __APPLE_
+};
+
 #endif // ifdef _WIN32
 
 class ScopedLock
​

x265_3.5.tar.gz/source/common/threadpool.cpp -> x265_3.6.tar.gz/source/common/threadpool.cpp Changed

 
@@ -301,7 +301,7 @@
     /* limit threads based on param->numaPools
      * For windows because threads can't be allocated to live across sockets
      * changing the default behavior to be per-socket pools -- FIXME */
-#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7
+#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 || HAVE_LIBNUMA
     if (!p->numaPools || (strcmp(p->numaPools, "NULL") == 0 || strcmp(p->numaPools, "*") == 0 || strcmp(p->numaPools, "") == 0))
     {
          char poolString50 = "";
​

x265_3.5.tar.gz/source/common/version.cpp -> x265_3.6.tar.gz/source/common/version.cpp Changed

 
@@ -71,7 +71,7 @@
 #define ONOS    "Unk-OS"
 #endif
 
-#if X86_64
+#if defined(_LP64) || defined(_WIN64)
 #define BITS    "64 bit"
 #else
 #define BITS    "32 bit"
​

x265_3.5.tar.gz/source/common/x86/asm-primitives.cpp -> x265_3.6.tar.gz/source/common/x86/asm-primitives.cpp Changed

@@ -1091,6 +1091,7 @@
 
         p.frameInitLowres = PFX(frame_init_lowres_core_sse2);
         p.frameInitLowerRes = PFX(frame_init_lowres_core_sse2);
+        p.frameSubSampleLuma = PFX(frame_subsample_luma_sse2);
         // TODO: the planecopy_sp is really planecopy_SC now, must be fix it 
         //p.planecopy_sp = PFX(downShift_16_sse2);
         p.planecopy_sp_shl = PFX(upShift_16_sse2);
@@ -1121,6 +1122,7 @@
     {
         ASSIGN2(p.scale1D_128to64, scale1D_128to64_ssse3);
         p.scale2D_64to32 = PFX(scale2D_64to32_ssse3);
+        p.frameSubSampleLuma = PFX(frame_subsample_luma_ssse3);
 
         // p.puLUMA_4x4.satd = p.cuBLOCK_4x4.sa8d = PFX(pixel_satd_4x4_ssse3); this one is broken
         ALL_LUMA_PU(satd, pixel_satd, ssse3);
@@ -1462,6 +1464,7 @@
         p.puLUMA_64x48.copy_pp = (copy_pp_t)PFX(blockcopy_ss_64x48_avx);
         p.puLUMA_64x64.copy_pp = (copy_pp_t)PFX(blockcopy_ss_64x64_avx);
         p.propagateCost = PFX(mbtree_propagate_cost_avx);
+        p.frameSubSampleLuma = PFX(frame_subsample_luma_avx);
     }
     if (cpuMask & X265_CPU_XOP)
     {
@@ -1473,6 +1476,7 @@
         LUMA_VAR(xop);
         p.frameInitLowres = PFX(frame_init_lowres_core_xop);
         p.frameInitLowerRes = PFX(frame_init_lowres_core_xop);
+        p.frameSubSampleLuma = PFX(frame_subsample_luma_xop);
     }
     if (cpuMask & X265_CPU_AVX2)
     {
@@ -2301,6 +2305,9 @@
 
         p.frameInitLowres = PFX(frame_init_lowres_core_avx2);
         p.frameInitLowerRes = PFX(frame_init_lowres_core_avx2);
+
+        p.frameSubSampleLuma = PFX(frame_subsample_luma_avx2);
+
         p.propagateCost = PFX(mbtree_propagate_cost_avx2);
         p.fix8Unpack = PFX(cutree_fix8_unpack_avx2);
         p.fix8Pack = PFX(cutree_fix8_pack_avx2);
@@ -3300,6 +3307,7 @@
         //p.frameInitLowres = PFX(frame_init_lowres_core_mmx2);
         p.frameInitLowres = PFX(frame_init_lowres_core_sse2);
         p.frameInitLowerRes = PFX(frame_init_lowres_core_sse2);
+        p.frameSubSampleLuma = PFX(frame_subsample_luma_sse2);
 
         ALL_LUMA_TU(blockfill_sNONALIGNED, blockfill_s, sse2);
         ALL_LUMA_TU(blockfill_sALIGNED, blockfill_s, sse2);
@@ -3424,6 +3432,8 @@
         ASSIGN2(p.scale1D_128to64, scale1D_128to64_ssse3);
         p.scale2D_64to32 = PFX(scale2D_64to32_ssse3);
 
+        p.frameSubSampleLuma = PFX(frame_subsample_luma_ssse3);
+
         ASSIGN2(p.puLUMA_8x4.convert_p2s, filterPixelToShort_8x4_ssse3);
         ASSIGN2(p.puLUMA_8x8.convert_p2s, filterPixelToShort_8x8_ssse3);
         ASSIGN2(p.puLUMA_8x16.convert_p2s, filterPixelToShort_8x16_ssse3);
@@ -3691,6 +3701,7 @@
         p.frameInitLowres = PFX(frame_init_lowres_core_avx);
         p.frameInitLowerRes = PFX(frame_init_lowres_core_avx);
         p.propagateCost = PFX(mbtree_propagate_cost_avx);
+        p.frameSubSampleLuma = PFX(frame_subsample_luma_avx);
     }
     if (cpuMask & X265_CPU_XOP)
     {
@@ -3702,6 +3713,7 @@
         p.cuBLOCK_16x16.sse_pp = PFX(pixel_ssd_16x16_xop);
         p.frameInitLowres = PFX(frame_init_lowres_core_xop);
         p.frameInitLowerRes = PFX(frame_init_lowres_core_xop);
+        p.frameSubSampleLuma = PFX(frame_subsample_luma_xop);
 
     }
 #if X86_64
@@ -4684,6 +4696,8 @@
         p.saoCuStatsE2 = PFX(saoCuStatsE2_avx2);
         p.saoCuStatsE3 = PFX(saoCuStatsE3_avx2);
 
+        p.frameSubSampleLuma = PFX(frame_subsample_luma_avx2);
+
         if (cpuMask & X265_CPU_BMI2)
         {
             p.scanPosLast = PFX(scanPosLast_avx2_bmi2);

 
@@ -1091,6 +1091,7 @@
 
         p.frameInitLowres = PFX(frame_init_lowres_core_sse2);
         p.frameInitLowerRes = PFX(frame_init_lowres_core_sse2);
+        p.frameSubSampleLuma = PFX(frame_subsample_luma_sse2);
         // TODO: the planecopy_sp is really planecopy_SC now, must be fix it 
         //p.planecopy_sp = PFX(downShift_16_sse2);
         p.planecopy_sp_shl = PFX(upShift_16_sse2);
@@ -1121,6 +1122,7 @@
     {
         ASSIGN2(p.scale1D_128to64, scale1D_128to64_ssse3);
         p.scale2D_64to32 = PFX(scale2D_64to32_ssse3);
+        p.frameSubSampleLuma = PFX(frame_subsample_luma_ssse3);
 
         // p.puLUMA_4x4.satd = p.cuBLOCK_4x4.sa8d = PFX(pixel_satd_4x4_ssse3); this one is broken
         ALL_LUMA_PU(satd, pixel_satd, ssse3);
@@ -1462,6 +1464,7 @@
         p.puLUMA_64x48.copy_pp = (copy_pp_t)PFX(blockcopy_ss_64x48_avx);
         p.puLUMA_64x64.copy_pp = (copy_pp_t)PFX(blockcopy_ss_64x64_avx);
         p.propagateCost = PFX(mbtree_propagate_cost_avx);
+        p.frameSubSampleLuma = PFX(frame_subsample_luma_avx);
     }
     if (cpuMask & X265_CPU_XOP)
     {
@@ -1473,6 +1476,7 @@
         LUMA_VAR(xop);
         p.frameInitLowres = PFX(frame_init_lowres_core_xop);
         p.frameInitLowerRes = PFX(frame_init_lowres_core_xop);
+        p.frameSubSampleLuma = PFX(frame_subsample_luma_xop);
     }
     if (cpuMask & X265_CPU_AVX2)
     {
@@ -2301,6 +2305,9 @@
 
         p.frameInitLowres = PFX(frame_init_lowres_core_avx2);
         p.frameInitLowerRes = PFX(frame_init_lowres_core_avx2);
+
+        p.frameSubSampleLuma = PFX(frame_subsample_luma_avx2);
+
         p.propagateCost = PFX(mbtree_propagate_cost_avx2);
         p.fix8Unpack = PFX(cutree_fix8_unpack_avx2);
         p.fix8Pack = PFX(cutree_fix8_pack_avx2);
@@ -3300,6 +3307,7 @@
         //p.frameInitLowres = PFX(frame_init_lowres_core_mmx2);
         p.frameInitLowres = PFX(frame_init_lowres_core_sse2);
         p.frameInitLowerRes = PFX(frame_init_lowres_core_sse2);
+        p.frameSubSampleLuma = PFX(frame_subsample_luma_sse2);
 
         ALL_LUMA_TU(blockfill_sNONALIGNED, blockfill_s, sse2);
         ALL_LUMA_TU(blockfill_sALIGNED, blockfill_s, sse2);
@@ -3424,6 +3432,8 @@
         ASSIGN2(p.scale1D_128to64, scale1D_128to64_ssse3);
         p.scale2D_64to32 = PFX(scale2D_64to32_ssse3);
 
+        p.frameSubSampleLuma = PFX(frame_subsample_luma_ssse3);
+
         ASSIGN2(p.puLUMA_8x4.convert_p2s, filterPixelToShort_8x4_ssse3);
         ASSIGN2(p.puLUMA_8x8.convert_p2s, filterPixelToShort_8x8_ssse3);
         ASSIGN2(p.puLUMA_8x16.convert_p2s, filterPixelToShort_8x16_ssse3);
@@ -3691,6 +3701,7 @@
         p.frameInitLowres = PFX(frame_init_lowres_core_avx);
         p.frameInitLowerRes = PFX(frame_init_lowres_core_avx);
         p.propagateCost = PFX(mbtree_propagate_cost_avx);
+        p.frameSubSampleLuma = PFX(frame_subsample_luma_avx);
     }
     if (cpuMask & X265_CPU_XOP)
     {
@@ -3702,6 +3713,7 @@
         p.cuBLOCK_16x16.sse_pp = PFX(pixel_ssd_16x16_xop);
         p.frameInitLowres = PFX(frame_init_lowres_core_xop);
         p.frameInitLowerRes = PFX(frame_init_lowres_core_xop);
+        p.frameSubSampleLuma = PFX(frame_subsample_luma_xop);
 
     }
 #if X86_64
@@ -4684,6 +4696,8 @@
         p.saoCuStatsE2 = PFX(saoCuStatsE2_avx2);
         p.saoCuStatsE3 = PFX(saoCuStatsE3_avx2);
 
+        p.frameSubSampleLuma = PFX(frame_subsample_luma_avx2);
+
         if (cpuMask & X265_CPU_BMI2)
         {
             p.scanPosLast = PFX(scanPosLast_avx2_bmi2);
​

x265_3.5.tar.gz/source/common/x86/const-a.asm -> x265_3.6.tar.gz/source/common/x86/const-a.asm Changed

 
@@ -100,7 +100,7 @@
 const pw_2000,              times 16 dw 0x2000
 const pw_8000,              times  8 dw 0x8000
 const pw_3fff,              times 16 dw 0x3fff
-const pw_32_0,              times  4 dw 32,
+const pw_32_0,              times  4 dw 32
                             times  4 dw 0
 const pw_pixel_max,         times 16 dw ((1 << BIT_DEPTH)-1)
 
​

x265_3.5.tar.gz/source/common/x86/h-ipfilter8.asm -> x265_3.6.tar.gz/source/common/x86/h-ipfilter8.asm Changed

 
@@ -125,6 +125,9 @@
 ALIGN 32
 interp4_hps_shuf: times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 8, 9, 10, 11, 9, 10, 11, 12
 
+ALIGN 32
+const interp_4tap_8x8_horiz_shuf,   dd 0, 4, 1, 5, 2, 6, 3, 7
+
 SECTION .text
 
 cextern pw_1
@@ -1459,8 +1462,6 @@
 
     RET
 
-ALIGN 32
-const interp_4tap_8x8_horiz_shuf,   dd 0, 4, 1, 5, 2, 6, 3, 7
 
 %macro FILTER_H4_w6 3
     movu        %1, srcq - 1
​

x265_3.5.tar.gz/source/common/x86/mc-a2.asm -> x265_3.6.tar.gz/source/common/x86/mc-a2.asm Changed

@@ -992,6 +992,262 @@
 FRAME_INIT_LOWRES
 %endif
 
+%macro SUBSAMPLEFILT8x4 7
+    mova      %3, r0+%7
+    mova      %4, r0+r2+%7
+    pavgb     %3, %4
+    pavgb     %4, r0+r2*2+%7
+    PALIGNR   %1, %3, 1, m6
+    PALIGNR   %2, %4, 1, m6
+%if cpuflag(xop)
+    pavgb     %1, %3
+    pavgb     %2, %4
+%else
+    pavgb     %1, %3
+    pavgb     %2, %4
+    psrlw     %5, %1, 8
+    psrlw     %6, %2, 8
+    pand      %1, m7
+    pand      %2, m7
+%endif
+%endmacro
+
+%macro SUBSAMPLEFILT32x4U 1
+    movu      m1, r0+r2
+    pavgb     m0, m1, r0
+    movu      m3, r0+r2+1
+    pavgb     m2, m3, r0+1
+    pavgb     m1, r0+r2*2
+    pavgb     m3, r0+r2*2+1
+    pavgb     m0, m2
+    pavgb     m1, m3
+
+    movu      m3, r0+r2+mmsize
+    pavgb     m2, m3, r0+mmsize
+    movu      m5, r0+r2+1+mmsize
+    pavgb     m4, m5, r0+1+mmsize
+    pavgb     m2, m4
+
+    pshufb    m0, m7
+    pshufb    m2, m7
+    punpcklqdq m0, m0, m2
+    vpermq    m0, m0, q3120
+    movu    %1, m0
+%endmacro
+
+%macro SUBSAMPLEFILT16x2 3
+    mova      m3, r0+%3+mmsize
+    mova      m2, r0+%3
+    pavgb     m3, r0+%3+r2+mmsize
+    pavgb     m2, r0+%3+r2
+    PALIGNR   %1, m3, 1, m6
+    pavgb     %1, m3
+    PALIGNR   m3, m2, 1, m6
+    pavgb     m3, m2
+%if cpuflag(xop)
+    vpperm    m3, m3, %1, m6
+%else
+    pand      m3, m7
+    pand      %1, m7
+    packuswb  m3, %1
+%endif
+    mova    %2, m3
+    mova      %1, m2
+%endmacro
+
+%macro SUBSAMPLEFILT8x2U 2
+    mova      m2, r0+%2
+    pavgb     m2, r0+%2+r2
+    mova      m0, r0+%2+1
+    pavgb     m0, r0+%2+r2+1
+    pavgb     m1, m3
+    pavgb     m0, m2
+    pand      m1, m7
+    pand      m0, m7
+    packuswb  m0, m1
+    mova    %1, m0
+%endmacro
+
+%macro SUBSAMPLEFILT8xU 2
+    mova      m3, r0+%2+8
+    mova      m2, r0+%2
+    pavgw     m3, r0+%2+r2+8
+    pavgw     m2, r0+%2+r2
+    movu      m1, r0+%2+10
+    movu      m0, r0+%2+2
+    pavgw     m1, r0+%2+r2+10
+    pavgw     m0, r0+%2+r2+2
+    pavgw     m1, m3
+    pavgw     m0, m2
+    psrld     m3, m1, 16
+    pand      m1, m7
+    pand      m0, m7
+    packssdw  m0, m1
+    movu    %1, m0
+%endmacro
+
+%macro SUBSAMPLEFILT8xA 3
+    movu      m3, r0+%3+mmsize
+    movu      m2, r0+%3
+    pavgw     m3, r0+%3+r2+mmsize
+    pavgw     m2, r0+%3+r2
+    PALIGNR   %1, m3, 2, m6
+    pavgw     %1, m3
+    PALIGNR   m3, m2, 2, m6
+    pavgw     m3, m2
+%if cpuflag(xop)
+    vpperm    m3, m3, %1, m6
+%else
+    pand      m3, m7
+    pand      %1, m7
+    packssdw  m3, %1
+%endif
+%if cpuflag(avx2)
+    vpermq     m3, m3, q3120
+%endif
+    movu    %2, m3
+    movu      %1, m2
+%endmacro
+
+;-----------------------------------------------------------------------------
+; void frame_subsample_luma( uint8_t *src0, uint8_t *dst0,
+;                              intptr_t src_stride, intptr_t dst_stride, int width, int height )
+;-----------------------------------------------------------------------------
+
+%macro FRAME_SUBSAMPLE_LUMA 0
+cglobal frame_subsample_luma, 6,7,(12-4*(BIT_DEPTH/9)) ; 8 for HIGH_BIT_DEPTH, 12 otherwise
+%if HIGH_BIT_DEPTH
+    shl   dword r3m, 1
+    FIX_STRIDES r2
+    shl   dword r4m, 1
+%endif
+%if mmsize >= 16
+    add   dword r4m, mmsize-1
+    and   dword r4m, ~(mmsize-1)
+%endif
+    ; src += 2*(height-1)*stride + 2*width
+    mov      r6d, r5m
+    dec      r6d
+    imul     r6d, r2d
+    add      r6d, r4m
+    lea       r0, r0+r6*2
+    ; dst += (height-1)*stride + width
+    mov      r6d, r5m
+    dec      r6d
+    imul     r6d, r3m
+    add      r6d, r4m
+    add       r1, r6
+    ; gap = stride - width
+    mov      r6d, r3m
+    sub      r6d, r4m
+    PUSH      r6
+    %define dst_gap rsp+gprsize
+    mov      r6d, r2d
+    sub      r6d, r4m
+    shl      r6d, 1
+    PUSH      r6
+    %define src_gap rsp
+%if HIGH_BIT_DEPTH
+%if cpuflag(xop)
+    mova      m6, deinterleave_shuf32a
+    mova      m7, deinterleave_shuf32b
+%else
+    pcmpeqw   m7, m7
+    psrld     m7, 16
+%endif
+.vloop:
+    mov      r6d, r4m
+%ifnidn cpuname, mmx2
+    movu      m0, r0
+    movu      m1, r0+r2
+    pavgw     m0, m1
+    pavgw     m1, r0+r2*2
+%endif
+.hloop:
+    sub       r0, mmsize*2
+    sub       r1, mmsize
+%ifidn cpuname, mmx2
+    SUBSAMPLEFILT8xU r1, 0
+%else
+    SUBSAMPLEFILT8xA m0, r1, 0
+%endif
+    sub      r6d, mmsize
+    jg .hloop
+%else ; !HIGH_BIT_DEPTH
+%if cpuflag(avx2)
+    mova      m7, deinterleave_shuf
+%elif cpuflag(xop)
+    mova      m6, deinterleave_shuf32a
+    mova      m7, deinterleave_shuf32b
+%else
+    pcmpeqb   m7, m7
+    psrlw     m7, 8
+%endif
+.vloop:
+    mov      r6d, r4m
+%ifnidn cpuname, mmx2
+%if mmsize <= 16
+    mova      m0, r0
+    mova      m1, r0+r2
+    pavgb     m0, m1
+    pavgb     m1, r0+r2*2
+%endif
+%endif
+.hloop:
+    sub       r0, mmsize*2
+    sub       r1, mmsize
+%if mmsize==32
+    SUBSAMPLEFILT32x4U r1
+%elifdef m8
+    SUBSAMPLEFILT8x4   m0, m1, m2, m3, m10, m11, mmsize
+    mova      m8, m0
+    mova      m9, m1
+    SUBSAMPLEFILT8x4   m2, m3, m0, m1, m4, m5, 0
+%if cpuflag(xop)
+    vpperm    m4, m2, m8, m7
+    vpperm    m2, m2, m8, m6
+%else
+    packuswb  m2, m8
+%endif
+    mova    r1, m2
+%elifidn cpuname, mmx2
+    SUBSAMPLEFILT8x2U  r1, 0
+%else
+    SUBSAMPLEFILT16x2  m0, r1, 0
+%endif
+    sub      r6d, mmsize
+    jg .hloop
+%endif ; HIGH_BIT_DEPTH
+.skip:
+    mov       r3, dst_gap
+    sub       r0, src_gap
+    sub       r1, r3
+    dec    dword r5m
+    jg .vloop
+    ADD      rsp, 2*gprsize
+    emms
+    RET
+%endmacro ; FRAME_SUBSAMPLE_LUMA
+
+INIT_MMX mmx2
+FRAME_SUBSAMPLE_LUMA
+%if ARCH_X86_64 == 0
+INIT_MMX cache32, mmx2
+FRAME_SUBSAMPLE_LUMA
+%endif
+INIT_XMM sse2
+FRAME_SUBSAMPLE_LUMA
+INIT_XMM ssse3
+FRAME_SUBSAMPLE_LUMA
+INIT_XMM avx
+FRAME_SUBSAMPLE_LUMA
+INIT_XMM xop
+FRAME_SUBSAMPLE_LUMA
+%if ARCH_X86_64 == 1
+INIT_YMM avx2
+FRAME_SUBSAMPLE_LUMA
+%endif
+
 ;-----------------------------------------------------------------------------
 ; void mbtree_propagate_cost( int *dst, uint16_t *propagate_in, int32_t *intra_costs,
 ;                             uint16_t *inter_costs, int32_t *inv_qscales, double *fps_factor, int len )

 
@@ -992,6 +992,262 @@
 FRAME_INIT_LOWRES
 %endif
 
+%macro SUBSAMPLEFILT8x4 7
+    mova      %3, r0+%7
+    mova      %4, r0+r2+%7
+    pavgb     %3, %4
+    pavgb     %4, r0+r2*2+%7
+    PALIGNR   %1, %3, 1, m6
+    PALIGNR   %2, %4, 1, m6
+%if cpuflag(xop)
+    pavgb     %1, %3
+    pavgb     %2, %4
+%else
+    pavgb     %1, %3
+    pavgb     %2, %4
+    psrlw     %5, %1, 8
+    psrlw     %6, %2, 8
+    pand      %1, m7
+    pand      %2, m7
+%endif
+%endmacro
+
+%macro SUBSAMPLEFILT32x4U 1
+    movu      m1, r0+r2
+    pavgb     m0, m1, r0
+    movu      m3, r0+r2+1
+    pavgb     m2, m3, r0+1
+    pavgb     m1, r0+r2*2
+    pavgb     m3, r0+r2*2+1
+    pavgb     m0, m2
+    pavgb     m1, m3
+
+    movu      m3, r0+r2+mmsize
+    pavgb     m2, m3, r0+mmsize
+    movu      m5, r0+r2+1+mmsize
+    pavgb     m4, m5, r0+1+mmsize
+    pavgb     m2, m4
+
+    pshufb    m0, m7
+    pshufb    m2, m7
+    punpcklqdq m0, m0, m2
+    vpermq    m0, m0, q3120
+    movu    %1, m0
+%endmacro
+
+%macro SUBSAMPLEFILT16x2 3
+    mova      m3, r0+%3+mmsize
+    mova      m2, r0+%3
+    pavgb     m3, r0+%3+r2+mmsize
+    pavgb     m2, r0+%3+r2
+    PALIGNR   %1, m3, 1, m6
+    pavgb     %1, m3
+    PALIGNR   m3, m2, 1, m6
+    pavgb     m3, m2
+%if cpuflag(xop)
+    vpperm    m3, m3, %1, m6
+%else
+    pand      m3, m7
+    pand      %1, m7
+    packuswb  m3, %1
+%endif
+    mova    %2, m3
+    mova      %1, m2
+%endmacro
+
+%macro SUBSAMPLEFILT8x2U 2
+    mova      m2, r0+%2
+    pavgb     m2, r0+%2+r2
+    mova      m0, r0+%2+1
+    pavgb     m0, r0+%2+r2+1
+    pavgb     m1, m3
+    pavgb     m0, m2
+    pand      m1, m7
+    pand      m0, m7
+    packuswb  m0, m1
+    mova    %1, m0
+%endmacro
+
+%macro SUBSAMPLEFILT8xU 2
+    mova      m3, r0+%2+8
+    mova      m2, r0+%2
+    pavgw     m3, r0+%2+r2+8
+    pavgw     m2, r0+%2+r2
+    movu      m1, r0+%2+10
+    movu      m0, r0+%2+2
+    pavgw     m1, r0+%2+r2+10
+    pavgw     m0, r0+%2+r2+2
+    pavgw     m1, m3
+    pavgw     m0, m2
+    psrld     m3, m1, 16
+    pand      m1, m7
+    pand      m0, m7
+    packssdw  m0, m1
+    movu    %1, m0
+%endmacro
+
+%macro SUBSAMPLEFILT8xA 3
+    movu      m3, r0+%3+mmsize
+    movu      m2, r0+%3
+    pavgw     m3, r0+%3+r2+mmsize
+    pavgw     m2, r0+%3+r2
+    PALIGNR   %1, m3, 2, m6
+    pavgw     %1, m3
+    PALIGNR   m3, m2, 2, m6
+    pavgw     m3, m2
+%if cpuflag(xop)
+    vpperm    m3, m3, %1, m6
+%else
+    pand      m3, m7
+    pand      %1, m7
+    packssdw  m3, %1
+%endif
+%if cpuflag(avx2)
+    vpermq     m3, m3, q3120
+%endif
+    movu    %2, m3
+    movu      %1, m2
+%endmacro
+
+;-----------------------------------------------------------------------------
+; void frame_subsample_luma( uint8_t *src0, uint8_t *dst0,
+;                              intptr_t src_stride, intptr_t dst_stride, int width, int height )
+;-----------------------------------------------------------------------------
+
+%macro FRAME_SUBSAMPLE_LUMA 0
+cglobal frame_subsample_luma, 6,7,(12-4*(BIT_DEPTH/9)) ; 8 for HIGH_BIT_DEPTH, 12 otherwise
+%if HIGH_BIT_DEPTH
+    shl   dword r3m, 1
+    FIX_STRIDES r2
+    shl   dword r4m, 1
+%endif
+%if mmsize >= 16
+    add   dword r4m, mmsize-1
+    and   dword r4m, ~(mmsize-1)
+%endif
+    ; src += 2*(height-1)*stride + 2*width
+    mov      r6d, r5m
+    dec      r6d
+    imul     r6d, r2d
+    add      r6d, r4m
+    lea       r0, r0+r6*2
+    ; dst += (height-1)*stride + width
+    mov      r6d, r5m
+    dec      r6d
+    imul     r6d, r3m
+    add      r6d, r4m
+    add       r1, r6
+    ; gap = stride - width
+    mov      r6d, r3m
+    sub      r6d, r4m
+    PUSH      r6
+    %define dst_gap rsp+gprsize
+    mov      r6d, r2d
+    sub      r6d, r4m
+    shl      r6d, 1
+    PUSH      r6
+    %define src_gap rsp
+%if HIGH_BIT_DEPTH
+%if cpuflag(xop)
+    mova      m6, deinterleave_shuf32a
+    mova      m7, deinterleave_shuf32b
+%else
+    pcmpeqw   m7, m7
+    psrld     m7, 16
+%endif
+.vloop:
+    mov      r6d, r4m
+%ifnidn cpuname, mmx2
+    movu      m0, r0
+    movu      m1, r0+r2
+    pavgw     m0, m1
+    pavgw     m1, r0+r2*2
+%endif
+.hloop:
+    sub       r0, mmsize*2
+    sub       r1, mmsize
+%ifidn cpuname, mmx2
+    SUBSAMPLEFILT8xU r1, 0
+%else
+    SUBSAMPLEFILT8xA m0, r1, 0
+%endif
+    sub      r6d, mmsize
+    jg .hloop
+%else ; !HIGH_BIT_DEPTH
+%if cpuflag(avx2)
+    mova      m7, deinterleave_shuf
+%elif cpuflag(xop)
+    mova      m6, deinterleave_shuf32a
+    mova      m7, deinterleave_shuf32b
+%else
+    pcmpeqb   m7, m7
+    psrlw     m7, 8
+%endif
+.vloop:
+    mov      r6d, r4m
+%ifnidn cpuname, mmx2
+%if mmsize <= 16
+    mova      m0, r0
+    mova      m1, r0+r2
+    pavgb     m0, m1
+    pavgb     m1, r0+r2*2
+%endif
+%endif
+.hloop:
+    sub       r0, mmsize*2
+    sub       r1, mmsize
+%if mmsize==32
+    SUBSAMPLEFILT32x4U r1
+%elifdef m8
+    SUBSAMPLEFILT8x4   m0, m1, m2, m3, m10, m11, mmsize
+    mova      m8, m0
+    mova      m9, m1
+    SUBSAMPLEFILT8x4   m2, m3, m0, m1, m4, m5, 0
+%if cpuflag(xop)
+    vpperm    m4, m2, m8, m7
+    vpperm    m2, m2, m8, m6
+%else
+    packuswb  m2, m8
+%endif
+    mova    r1, m2
+%elifidn cpuname, mmx2
+    SUBSAMPLEFILT8x2U  r1, 0
+%else
+    SUBSAMPLEFILT16x2  m0, r1, 0
+%endif
+    sub      r6d, mmsize
+    jg .hloop
+%endif ; HIGH_BIT_DEPTH
+.skip:
+    mov       r3, dst_gap
+    sub       r0, src_gap
+    sub       r1, r3
+    dec    dword r5m
+    jg .vloop
+    ADD      rsp, 2*gprsize
+    emms
+    RET
+%endmacro ; FRAME_SUBSAMPLE_LUMA
+
+INIT_MMX mmx2
+FRAME_SUBSAMPLE_LUMA
+%if ARCH_X86_64 == 0
+INIT_MMX cache32, mmx2
+FRAME_SUBSAMPLE_LUMA
+%endif
+INIT_XMM sse2
+FRAME_SUBSAMPLE_LUMA
+INIT_XMM ssse3
+FRAME_SUBSAMPLE_LUMA
+INIT_XMM avx
+FRAME_SUBSAMPLE_LUMA
+INIT_XMM xop
+FRAME_SUBSAMPLE_LUMA
+%if ARCH_X86_64 == 1
+INIT_YMM avx2
+FRAME_SUBSAMPLE_LUMA
+%endif
+
 ;-----------------------------------------------------------------------------
 ; void mbtree_propagate_cost( int *dst, uint16_t *propagate_in, int32_t *intra_costs,
 ;                             uint16_t *inter_costs, int32_t *inv_qscales, double *fps_factor, int len )
​

x265_3.5.tar.gz/source/common/x86/mc.h -> x265_3.6.tar.gz/source/common/x86/mc.h Changed

 
@@ -36,6 +36,17 @@
 
 #undef LOWRES
 
+#define SUBSAMPLELUMA(cpu) \
+    void PFX(frame_subsample_luma_ ## cpu)(const pixel* src0, pixel* dst0, intptr_t src_stride, intptr_t dst_stride, int width, int height);
+SUBSAMPLELUMA(mmx2)
+SUBSAMPLELUMA(sse2)
+SUBSAMPLELUMA(ssse3)
+SUBSAMPLELUMA(avx)
+SUBSAMPLELUMA(avx2)
+SUBSAMPLELUMA(xop)
+
+#undef SUBSAMPLELUMA
+
 #define PROPAGATE_COST(cpu) \
     void PFX(mbtree_propagate_cost_ ## cpu)(int* dst, const uint16_t* propagateIn, const int32_t* intraCosts, \
                                               const uint16_t* interCosts, const int32_t* invQscales, const double* fpsFactor, int len);
​

x265_3.5.tar.gz/source/common/x86/x86inc.asm -> x265_3.6.tar.gz/source/common/x86/x86inc.asm Changed

@@ -401,16 +401,6 @@
     %endif
 %endmacro
 
-%macro DEFINE_ARGS_INTERNAL 3+
-    %ifnum %2
-        DEFINE_ARGS %3
-    %elif %1 == 4
-        DEFINE_ARGS %2
-    %elif %1 > 4
-        DEFINE_ARGS %2, %3
-    %endif
-%endmacro
-
 %if WIN64 ; Windows x64 ;=================================================
 
 DECLARE_REG 0,  rcx
@@ -429,7 +419,7 @@
 DECLARE_REG 13, R12, 112
 DECLARE_REG 14, R13, 120
 
-%macro PROLOGUE 2-5+ 0 ; #args, #regs, #xmm_regs, stack_size, arg_names...
+%macro PROLOGUE 2-5+ 0, 0 ; #args, #regs, #xmm_regs, stack_size, arg_names...
     %assign num_args %1
     %assign regs_used %2
     ASSERT regs_used >= num_args
@@ -441,7 +431,15 @@
         WIN64_SPILL_XMM %3
     %endif
     LOAD_IF_USED 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
-    DEFINE_ARGS_INTERNAL %0, %4, %5
+    %if %0 > 4
+         %ifnum %4
+             DEFINE_ARGS %5
+         %else
+             DEFINE_ARGS %4, %5
+         %endif
+     %elifnnum %4
+         DEFINE_ARGS %4
+     %endif
 %endmacro
 
 %macro WIN64_PUSH_XMM 0
@@ -537,7 +535,7 @@
 DECLARE_REG 13, R12, 64
 DECLARE_REG 14, R13, 72
 
-%macro PROLOGUE 2-5+ 0; #args, #regs, #xmm_regs, stack_size, arg_names...
+%macro PROLOGUE 2-5+ 0, 0 ; #args, #regs, #xmm_regs, stack_size, arg_names...
     %assign num_args %1
     %assign regs_used %2
     %assign xmm_regs_used %3
@@ -547,7 +545,15 @@
     PUSH_IF_USED 9, 10, 11, 12, 13, 14
     ALLOC_STACK %4
     LOAD_IF_USED 6, 7, 8, 9, 10, 11, 12, 13, 14
-    DEFINE_ARGS_INTERNAL %0, %4, %5
+    %if %0 > 4
+         %ifnum %4
+             DEFINE_ARGS %5
+         %else
+             DEFINE_ARGS %4, %5
+         %endif
+     %elifnnum %4
+         DEFINE_ARGS %4
+     %endif
 %endmacro
 
 %define has_epilogue regs_used > 9 || stack_size > 0 || vzeroupper_required
@@ -588,7 +594,7 @@
 
 DECLARE_ARG 7, 8, 9, 10, 11, 12, 13, 14
 
-%macro PROLOGUE 2-5+ ; #args, #regs, #xmm_regs, stack_size, arg_names...
+%macro PROLOGUE 2-5+ 0, 0 ; #args, #regs, #xmm_regs, stack_size, arg_names...
     %assign num_args %1
     %assign regs_used %2
     ASSERT regs_used >= num_args
@@ -603,7 +609,15 @@
     PUSH_IF_USED 3, 4, 5, 6
     ALLOC_STACK %4
     LOAD_IF_USED 0, 1, 2, 3, 4, 5, 6
-    DEFINE_ARGS_INTERNAL %0, %4, %5
+    %if %0 > 4
+         %ifnum %4
+             DEFINE_ARGS %5
+         %else
+             DEFINE_ARGS %4, %5
+         %endif
+     %elifnnum %4
+         DEFINE_ARGS %4
+     %endif
 %endmacro
 
 %define has_epilogue regs_used > 3 || stack_size > 0 || vzeroupper_required

 
@@ -401,16 +401,6 @@
     %endif
 %endmacro
 
-%macro DEFINE_ARGS_INTERNAL 3+
-    %ifnum %2
-        DEFINE_ARGS %3
-    %elif %1 == 4
-        DEFINE_ARGS %2
-    %elif %1 > 4
-        DEFINE_ARGS %2, %3
-    %endif
-%endmacro
-
 %if WIN64 ; Windows x64 ;=================================================
 
 DECLARE_REG 0,  rcx
@@ -429,7 +419,7 @@
 DECLARE_REG 13, R12, 112
 DECLARE_REG 14, R13, 120
 
-%macro PROLOGUE 2-5+ 0 ; #args, #regs, #xmm_regs, stack_size, arg_names...
+%macro PROLOGUE 2-5+ 0, 0 ; #args, #regs, #xmm_regs, stack_size, arg_names...
     %assign num_args %1
     %assign regs_used %2
     ASSERT regs_used >= num_args
@@ -441,7 +431,15 @@
         WIN64_SPILL_XMM %3
     %endif
     LOAD_IF_USED 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
-    DEFINE_ARGS_INTERNAL %0, %4, %5
+    %if %0 > 4
+         %ifnum %4
+             DEFINE_ARGS %5
+         %else
+             DEFINE_ARGS %4, %5
+         %endif
+     %elifnnum %4
+         DEFINE_ARGS %4
+     %endif
 %endmacro
 
 %macro WIN64_PUSH_XMM 0
@@ -537,7 +535,7 @@
 DECLARE_REG 13, R12, 64
 DECLARE_REG 14, R13, 72
 
-%macro PROLOGUE 2-5+ 0; #args, #regs, #xmm_regs, stack_size, arg_names...
+%macro PROLOGUE 2-5+ 0, 0 ; #args, #regs, #xmm_regs, stack_size, arg_names...
     %assign num_args %1
     %assign regs_used %2
     %assign xmm_regs_used %3
@@ -547,7 +545,15 @@
     PUSH_IF_USED 9, 10, 11, 12, 13, 14
     ALLOC_STACK %4
     LOAD_IF_USED 6, 7, 8, 9, 10, 11, 12, 13, 14
-    DEFINE_ARGS_INTERNAL %0, %4, %5
+    %if %0 > 4
+         %ifnum %4
+             DEFINE_ARGS %5
+         %else
+             DEFINE_ARGS %4, %5
+         %endif
+     %elifnnum %4
+         DEFINE_ARGS %4
+     %endif
 %endmacro
 
 %define has_epilogue regs_used > 9 || stack_size > 0 || vzeroupper_required
@@ -588,7 +594,7 @@
 
 DECLARE_ARG 7, 8, 9, 10, 11, 12, 13, 14
 
-%macro PROLOGUE 2-5+ ; #args, #regs, #xmm_regs, stack_size, arg_names...
+%macro PROLOGUE 2-5+ 0, 0 ; #args, #regs, #xmm_regs, stack_size, arg_names...
     %assign num_args %1
     %assign regs_used %2
     ASSERT regs_used >= num_args
@@ -603,7 +609,15 @@
     PUSH_IF_USED 3, 4, 5, 6
     ALLOC_STACK %4
     LOAD_IF_USED 0, 1, 2, 3, 4, 5, 6
-    DEFINE_ARGS_INTERNAL %0, %4, %5
+    %if %0 > 4
+         %ifnum %4
+             DEFINE_ARGS %5
+         %else
+             DEFINE_ARGS %4, %5
+         %endif
+     %elifnnum %4
+         DEFINE_ARGS %4
+     %endif
 %endmacro
 
 %define has_epilogue regs_used > 3 || stack_size > 0 || vzeroupper_required
​

x265_3.5.tar.gz/source/common/x86/x86util.asm -> x265_3.6.tar.gz/source/common/x86/x86util.asm Changed

 
@@ -578,8 +578,10 @@
     %elif %1==2
         %if mmsize==8
             SBUTTERFLY dq, %3, %4, %5
-        %else
+        %elif %0==6
             TRANS q, ORDER, %3, %4, %5, %6
+        %else
+            TRANS q, ORDER, %3, %4, %5
         %endif
     %elif %1==4
         SBUTTERFLY qdq, %3, %4, %5
​

x265_3.5.tar.gz/source/encoder/analysis.cpp -> x265_3.6.tar.gz/source/encoder/analysis.cpp Changed

 
@@ -3645,7 +3645,7 @@
             qp += distortionData->offsetctu.m_cuAddr;
     }
 
-    if (m_param->analysisLoadReuseLevel == 10 && m_param->rc.cuTree)
+    if (m_param->analysisLoadReuseLevel >= 2 && m_param->rc.cuTree)
     {
         int cuIdx = (ctu.m_cuAddr * ctu.m_numPartitions) + cuGeom.absPartIdx;
         if (ctu.m_slice->m_sliceType == I_SLICE)
​

x265_3.5.tar.gz/source/encoder/api.cpp -> x265_3.6.tar.gz/source/encoder/api.cpp Changed

@@ -208,7 +208,6 @@
     memcpy(zoneParam, param, sizeof(x265_param));
     for (int i = 0; i < param->rc.zonefileCount; i++)
     {
-        param->rc.zonesi.startFrame = -1;
         encoder->configureZone(zoneParam, param->rc.zonesi.zoneParam);
     }
 
@@ -608,6 +607,14 @@
     if (numEncoded < 0)
         encoder->m_aborted = true;
 
+    if ((!encoder->m_numDelayedPic && !numEncoded) && (encoder->m_param->bEnableEndOfSequence || encoder->m_param->bEnableEndOfBitstream))
+    {
+        Bitstream bs;
+        encoder->getEndNalUnits(encoder->m_nalList, bs);
+        *pp_nal = &encoder->m_nalList.m_nal0;
+        if (pi_nal) *pi_nal = encoder->m_nalList.m_numNal;
+    }
+
     return numEncoded;
 }
 
@@ -1042,6 +1049,7 @@
     &PARAM_NS::x265_param_free,
     &PARAM_NS::x265_param_default,
     &PARAM_NS::x265_param_parse,
+    &PARAM_NS::x265_scenecut_aware_qp_param_parse,
     &PARAM_NS::x265_param_apply_profile,
     &PARAM_NS::x265_param_default_preset,
     &x265_picture_alloc,
@@ -1288,6 +1296,8 @@
             if (param->csvLogLevel)
             {
                 fprintf(csvfp, "Encode Order, Type, POC, QP, Bits, Scenecut, ");
+                if (!!param->bEnableTemporalSubLayers)
+                    fprintf(csvfp, "Temporal Sub Layer ID, ");
                 if (param->csvLogLevel >= 2)
                     fprintf(csvfp, "I/P cost ratio, ");
                 if (param->rc.rateControlMode == X265_RC_CRF)
@@ -1401,6 +1411,8 @@
     const x265_frame_stats* frameStats = &pic->frameData;
     fprintf(param->csvfpt, "%d, %c-SLICE, %4d, %2.2lf, %10d, %d,", frameStats->encoderOrder, frameStats->sliceType, frameStats->poc,
                                                                    frameStats->qp, (int)frameStats->bits, frameStats->bScenecut);
+    if (!!param->bEnableTemporalSubLayers)
+        fprintf(param->csvfpt, "%d,", frameStats->tLayer);
     if (param->csvLogLevel >= 2)
         fprintf(param->csvfpt, "%.2f,", frameStats->ipCostRatio);
     if (param->rc.rateControlMode == X265_RC_CRF)

 
@@ -208,7 +208,6 @@
     memcpy(zoneParam, param, sizeof(x265_param));
     for (int i = 0; i < param->rc.zonefileCount; i++)
     {
-        param->rc.zonesi.startFrame = -1;
         encoder->configureZone(zoneParam, param->rc.zonesi.zoneParam);
     }
 
@@ -608,6 +607,14 @@
     if (numEncoded < 0)
         encoder->m_aborted = true;
 
+    if ((!encoder->m_numDelayedPic && !numEncoded) && (encoder->m_param->bEnableEndOfSequence || encoder->m_param->bEnableEndOfBitstream))
+    {
+        Bitstream bs;
+        encoder->getEndNalUnits(encoder->m_nalList, bs);
+        *pp_nal = &encoder->m_nalList.m_nal0;
+        if (pi_nal) *pi_nal = encoder->m_nalList.m_numNal;
+    }
+
     return numEncoded;
 }
 
@@ -1042,6 +1049,7 @@
     &PARAM_NS::x265_param_free,
     &PARAM_NS::x265_param_default,
     &PARAM_NS::x265_param_parse,
+    &PARAM_NS::x265_scenecut_aware_qp_param_parse,
     &PARAM_NS::x265_param_apply_profile,
     &PARAM_NS::x265_param_default_preset,
     &x265_picture_alloc,
@@ -1288,6 +1296,8 @@
             if (param->csvLogLevel)
             {
                 fprintf(csvfp, "Encode Order, Type, POC, QP, Bits, Scenecut, ");
+                if (!!param->bEnableTemporalSubLayers)
+                    fprintf(csvfp, "Temporal Sub Layer ID, ");
                 if (param->csvLogLevel >= 2)
                     fprintf(csvfp, "I/P cost ratio, ");
                 if (param->rc.rateControlMode == X265_RC_CRF)
@@ -1401,6 +1411,8 @@
     const x265_frame_stats* frameStats = &pic->frameData;
     fprintf(param->csvfpt, "%d, %c-SLICE, %4d, %2.2lf, %10d, %d,", frameStats->encoderOrder, frameStats->sliceType, frameStats->poc,
                                                                    frameStats->qp, (int)frameStats->bits, frameStats->bScenecut);
+    if (!!param->bEnableTemporalSubLayers)
+        fprintf(param->csvfpt, "%d,", frameStats->tLayer);
     if (param->csvLogLevel >= 2)
         fprintf(param->csvfpt, "%.2f,", frameStats->ipCostRatio);
     if (param->rc.rateControlMode == X265_RC_CRF)
​

x265_3.5.tar.gz/source/encoder/dpb.cpp -> x265_3.6.tar.gz/source/encoder/dpb.cpp Changed

@@ -70,10 +70,18 @@
     {
         Frame *curFrame = iterFrame;
         iterFrame = iterFrame->m_next;
-        if (!curFrame->m_encData->m_bHasReferences && !curFrame->m_countRefEncoders)
+        bool isMCSTFReferenced = false;
+
+        if (curFrame->m_param->bEnableTemporalFilter)
+            isMCSTFReferenced =!!(curFrame->m_refPicCnt1);
+
+        if (!curFrame->m_encData->m_bHasReferences && !curFrame->m_countRefEncoders && !isMCSTFReferenced)
         {
             curFrame->m_bChromaExtended = false;
 
+            if (curFrame->m_param->bEnableTemporalFilter)
+                *curFrame->m_isSubSampled = false;
+
             // Reset column counter
             X265_CHECK(curFrame->m_reconRowFlag != NULL, "curFrame->m_reconRowFlag check failure");
             X265_CHECK(curFrame->m_reconColCount != NULL, "curFrame->m_reconColCount check failure");
@@ -142,12 +150,13 @@
     {
         newFrame->m_encData->m_bHasReferences = false;
 
+        newFrame->m_tempLayer = (newFrame->m_param->bEnableTemporalSubLayers && !m_bTemporalSublayer) ? 1 : newFrame->m_tempLayer;
         // Adjust NAL type for unreferenced B frames (change from _R "referenced"
         // to _N "non-referenced" NAL unit type)
         switch (slice->m_nalUnitType)
         {
         case NAL_UNIT_CODED_SLICE_TRAIL_R:
-            slice->m_nalUnitType = m_bTemporalSublayer ? NAL_UNIT_CODED_SLICE_TSA_N : NAL_UNIT_CODED_SLICE_TRAIL_N;
+            slice->m_nalUnitType = newFrame->m_param->bEnableTemporalSubLayers ? NAL_UNIT_CODED_SLICE_TSA_N : NAL_UNIT_CODED_SLICE_TRAIL_N;
             break;
         case NAL_UNIT_CODED_SLICE_RADL_R:
             slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_RADL_N;
@@ -168,13 +177,94 @@
 
     m_picList.pushFront(*newFrame);
 
+    if (m_bTemporalSublayer && getTemporalLayerNonReferenceFlag())
+    {
+        switch (slice->m_nalUnitType)
+        {
+        case NAL_UNIT_CODED_SLICE_TRAIL_R:
+            slice->m_nalUnitType =  NAL_UNIT_CODED_SLICE_TRAIL_N;
+            break;
+        case NAL_UNIT_CODED_SLICE_RADL_R:
+            slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_RADL_N;
+            break;
+        case NAL_UNIT_CODED_SLICE_RASL_R:
+            slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_RASL_N;
+            break;
+        default:
+            break;
+        }
+    }
     // Do decoding refresh marking if any
     decodingRefreshMarking(pocCurr, slice->m_nalUnitType);
 
-    computeRPS(pocCurr, slice->isIRAP(), &slice->m_rps, slice->m_sps->maxDecPicBuffering);
-
+    computeRPS(pocCurr, newFrame->m_tempLayer, slice->isIRAP(), &slice->m_rps, slice->m_sps->maxDecPicBufferingnewFrame->m_tempLayer);
+    bool isTSAPic = ((slice->m_nalUnitType == 2) || (slice->m_nalUnitType == 3)) ? true : false;
     // Mark pictures in m_piclist as unreferenced if they are not included in RPS
-    applyReferencePictureSet(&slice->m_rps, pocCurr);
+    applyReferencePictureSet(&slice->m_rps, pocCurr, newFrame->m_tempLayer, isTSAPic);
+
+
+    if (m_bTemporalSublayer && newFrame->m_tempLayer > 0
+        && !(slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_RADL_N     // Check if not a leading picture
+            || slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_RADL_R
+            || slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_RASL_N
+            || slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_RASL_R)
+        )
+    {
+        if (isTemporalLayerSwitchingPoint(pocCurr, newFrame->m_tempLayer) || (slice->m_sps->maxTempSubLayers == 1))
+        {
+            if (getTemporalLayerNonReferenceFlag())
+            {
+                slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_TSA_N;
+            }
+            else
+            {
+                slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_TSA_R;
+            }
+        }
+        else if (isStepwiseTemporalLayerSwitchingPoint(&slice->m_rps, pocCurr, newFrame->m_tempLayer))
+        {
+            bool isSTSA = true;
+            int id = newFrame->m_gopOffset % x265_gop_ra_lengthnewFrame->m_gopId;
+            for (int ii = id; (ii < x265_gop_ra_lengthnewFrame->m_gopId && isSTSA == true); ii++)
+            {
+                int tempIdRef = x265_gop_ranewFrame->m_gopIdii.layer;
+                if (tempIdRef == newFrame->m_tempLayer)
+                {
+                    for (int jj = 0; jj < slice->m_rps.numberOfPositivePictures + slice->m_rps.numberOfNegativePictures; jj++)
+                    {
+                        if (slice->m_rps.bUsedjj)
+                        {
+                            int refPoc = x265_gop_ranewFrame->m_gopIdii.poc_offset + slice->m_rps.deltaPOCjj;
+                            int kk = 0;
+                            for (kk = 0; kk < x265_gop_ra_lengthnewFrame->m_gopId; kk++)
+                            {
+                                if (x265_gop_ranewFrame->m_gopIdkk.poc_offset == refPoc)
+                                {
+                                    break;
+                                }
+                            }
+                            if (x265_gop_ranewFrame->m_gopIdkk.layer >= newFrame->m_tempLayer)
+                            {
+                                isSTSA = false;
+                                break;
+                            }
+                        }
+                    }
+                }
+            }
+            if (isSTSA == true)
+            {
+                if (getTemporalLayerNonReferenceFlag())
+                {
+                    slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_STSA_N;
+                }
+                else
+                {
+                    slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_STSA_R;
+                }
+            }
+        }
+    }
 
     if (slice->m_sliceType != I_SLICE)
         slice->m_numRefIdx0 = x265_clip3(1, newFrame->m_param->maxNumReferences, slice->m_rps.numberOfNegativePictures);
@@ -218,7 +308,7 @@
     }
 }
 
-void DPB::computeRPS(int curPoc, bool isRAP, RPS * rps, unsigned int maxDecPicBuffer)
+void DPB::computeRPS(int curPoc, int tempId, bool isRAP, RPS * rps, unsigned int maxDecPicBuffer)
 {
     unsigned int poci = 0, numNeg = 0, numPos = 0;
 
@@ -228,7 +318,7 @@
     {
         if ((iterPic->m_poc != curPoc) && iterPic->m_encData->m_bHasReferences)
         {
-            if ((m_lastIDR >= curPoc) || (m_lastIDR <= iterPic->m_poc))
+            if ((!m_bTemporalSublayer || (iterPic->m_tempLayer <= tempId)) && ((m_lastIDR >= curPoc) || (m_lastIDR <= iterPic->m_poc)))
             {
                     rps->pocpoci = iterPic->m_poc;
                     rps->deltaPOCpoci = rps->pocpoci - curPoc;
@@ -247,6 +337,18 @@
     rps->sortDeltaPOC();
 }
 
+bool DPB::getTemporalLayerNonReferenceFlag()
+{
+    Frame* curFrame = m_picList.first();
+    if (curFrame->m_encData->m_bHasReferences)
+    {
+        curFrame->m_sameLayerRefPic = true;
+        return false;
+    }
+    else
+        return true;
+}
+
 /* Marking reference pictures when an IDR/CRA is encountered. */
 void DPB::decodingRefreshMarking(int pocCurr, NalUnitType nalUnitType)
 {
@@ -296,7 +398,7 @@
 }
 
 /** Function for applying picture marking based on the Reference Picture Set */
-void DPB::applyReferencePictureSet(RPS *rps, int curPoc)
+void DPB::applyReferencePictureSet(RPS *rps, int curPoc, int tempId, bool isTSAPicture)
 {
     // loop through all pictures in the reference picture buffer
     Frame* iterFrame = m_picList.first();
@@ -317,9 +419,68 @@
             }
             if (!referenced)
                 iterFrame->m_encData->m_bHasReferences = false;
+
+            if (m_bTemporalSublayer)
+            {
+                //check that pictures of higher temporal layers are not used
+                assert(referenced == 0 || iterFrame->m_encData->m_bHasReferences == false || iterFrame->m_tempLayer <= tempId);
+
+                //check that pictures of higher or equal temporal layer are not in the RPS if the current picture is a TSA picture
+                if (isTSAPicture)
+                {
+                    assert(referenced == 0 || iterFrame->m_tempLayer < tempId);
+                }
+                //check that pictures marked as temporal layer non-reference pictures are not used for reference
+                if (iterFrame->m_tempLayer == tempId)
+                {
+                    assert(referenced == 0 || iterFrame->m_sameLayerRefPic == true);
+                }
+            }
+        }
+        iterFrame = iterFrame->m_next;
+    }
+}
+
+bool DPB::isTemporalLayerSwitchingPoint(int curPoc, int tempId)
+{
+    // loop through all pictures in the reference picture buffer
+    Frame* iterFrame = m_picList.first();
+    while (iterFrame)
+    {
+        if (iterFrame->m_poc != curPoc && iterFrame->m_encData->m_bHasReferences)
+        {
+            if (iterFrame->m_tempLayer >= tempId)
+            {
+                return false;
+            }
+        }
+        iterFrame = iterFrame->m_next;
+    }
+    return true;
+}
+
+bool DPB::isStepwiseTemporalLayerSwitchingPoint(RPS *rps, int curPoc, int tempId)
+{
+    // loop through all pictures in the reference picture buffer
+    Frame* iterFrame = m_picList.first();
+    while (iterFrame)
+    {
+        if (iterFrame->m_poc != curPoc && iterFrame->m_encData->m_bHasReferences)
+        {
+            for (int i = 0; i < rps->numberOfPositivePictures + rps->numberOfNegativePictures; i++)
+            {
+                if ((iterFrame->m_poc == curPoc + rps->deltaPOCi) && rps->bUsedi)
+                {
+                    if (iterFrame->m_tempLayer >= tempId)
+                    {
+                        return false;
+                    }
+                }
+            }
         }
         iterFrame = iterFrame->m_next;
     }
+    return true;
 }
 
 /* deciding the nal_unit_type */
@@ -328,7 +489,7 @@
     if (!curPOC)
         return NAL_UNIT_CODED_SLICE_IDR_N_LP;
     if (bIsKeyFrame)
-        return m_bOpenGOP ? NAL_UNIT_CODED_SLICE_CRA : m_bhasLeadingPicture ? NAL_UNIT_CODED_SLICE_IDR_W_RADL : NAL_UNIT_CODED_SLICE_IDR_N_LP;
+        return (m_bOpenGOP || m_craNal) ? NAL_UNIT_CODED_SLICE_CRA : m_bhasLeadingPicture ? NAL_UNIT_CODED_SLICE_IDR_W_RADL : NAL_UNIT_CODED_SLICE_IDR_N_LP;
     if (m_pocCRA && curPOC < m_pocCRA)
         // All leading pictures are being marked as TFD pictures here since
         // current encoder uses all reference pictures while encoding leading

 
@@ -70,10 +70,18 @@
     {
         Frame *curFrame = iterFrame;
         iterFrame = iterFrame->m_next;
-        if (!curFrame->m_encData->m_bHasReferences && !curFrame->m_countRefEncoders)
+        bool isMCSTFReferenced = false;
+
+        if (curFrame->m_param->bEnableTemporalFilter)
+            isMCSTFReferenced =!!(curFrame->m_refPicCnt1);
+
+        if (!curFrame->m_encData->m_bHasReferences && !curFrame->m_countRefEncoders && !isMCSTFReferenced)
         {
             curFrame->m_bChromaExtended = false;
 
+            if (curFrame->m_param->bEnableTemporalFilter)
+                *curFrame->m_isSubSampled = false;
+
             // Reset column counter
             X265_CHECK(curFrame->m_reconRowFlag != NULL, "curFrame->m_reconRowFlag check failure");
             X265_CHECK(curFrame->m_reconColCount != NULL, "curFrame->m_reconColCount check failure");
@@ -142,12 +150,13 @@
     {
         newFrame->m_encData->m_bHasReferences = false;
 
+        newFrame->m_tempLayer = (newFrame->m_param->bEnableTemporalSubLayers && !m_bTemporalSublayer) ? 1 : newFrame->m_tempLayer;
         // Adjust NAL type for unreferenced B frames (change from _R "referenced"
         // to _N "non-referenced" NAL unit type)
         switch (slice->m_nalUnitType)
         {
         case NAL_UNIT_CODED_SLICE_TRAIL_R:
-            slice->m_nalUnitType = m_bTemporalSublayer ? NAL_UNIT_CODED_SLICE_TSA_N : NAL_UNIT_CODED_SLICE_TRAIL_N;
+            slice->m_nalUnitType = newFrame->m_param->bEnableTemporalSubLayers ? NAL_UNIT_CODED_SLICE_TSA_N : NAL_UNIT_CODED_SLICE_TRAIL_N;
             break;
         case NAL_UNIT_CODED_SLICE_RADL_R:
             slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_RADL_N;
@@ -168,13 +177,94 @@
 
     m_picList.pushFront(*newFrame);
 
+    if (m_bTemporalSublayer && getTemporalLayerNonReferenceFlag())
+    {
+        switch (slice->m_nalUnitType)
+        {
+        case NAL_UNIT_CODED_SLICE_TRAIL_R:
+            slice->m_nalUnitType =  NAL_UNIT_CODED_SLICE_TRAIL_N;
+            break;
+        case NAL_UNIT_CODED_SLICE_RADL_R:
+            slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_RADL_N;
+            break;
+        case NAL_UNIT_CODED_SLICE_RASL_R:
+            slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_RASL_N;
+            break;
+        default:
+            break;
+        }
+    }
     // Do decoding refresh marking if any
     decodingRefreshMarking(pocCurr, slice->m_nalUnitType);
 
-    computeRPS(pocCurr, slice->isIRAP(), &slice->m_rps, slice->m_sps->maxDecPicBuffering);
-
+    computeRPS(pocCurr, newFrame->m_tempLayer, slice->isIRAP(), &slice->m_rps, slice->m_sps->maxDecPicBufferingnewFrame->m_tempLayer);
+    bool isTSAPic = ((slice->m_nalUnitType == 2) || (slice->m_nalUnitType == 3)) ? true : false;
     // Mark pictures in m_piclist as unreferenced if they are not included in RPS
-    applyReferencePictureSet(&slice->m_rps, pocCurr);
+    applyReferencePictureSet(&slice->m_rps, pocCurr, newFrame->m_tempLayer, isTSAPic);
+
+
+    if (m_bTemporalSublayer && newFrame->m_tempLayer > 0
+        && !(slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_RADL_N     // Check if not a leading picture
+            || slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_RADL_R
+            || slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_RASL_N
+            || slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_RASL_R)
+        )
+    {
+        if (isTemporalLayerSwitchingPoint(pocCurr, newFrame->m_tempLayer) || (slice->m_sps->maxTempSubLayers == 1))
+        {
+            if (getTemporalLayerNonReferenceFlag())
+            {
+                slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_TSA_N;
+            }
+            else
+            {
+                slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_TSA_R;
+            }
+        }
+        else if (isStepwiseTemporalLayerSwitchingPoint(&slice->m_rps, pocCurr, newFrame->m_tempLayer))
+        {
+            bool isSTSA = true;
+            int id = newFrame->m_gopOffset % x265_gop_ra_lengthnewFrame->m_gopId;
+            for (int ii = id; (ii < x265_gop_ra_lengthnewFrame->m_gopId && isSTSA == true); ii++)
+            {
+                int tempIdRef = x265_gop_ranewFrame->m_gopIdii.layer;
+                if (tempIdRef == newFrame->m_tempLayer)
+                {
+                    for (int jj = 0; jj < slice->m_rps.numberOfPositivePictures + slice->m_rps.numberOfNegativePictures; jj++)
+                    {
+                        if (slice->m_rps.bUsedjj)
+                        {
+                            int refPoc = x265_gop_ranewFrame->m_gopIdii.poc_offset + slice->m_rps.deltaPOCjj;
+                            int kk = 0;
+                            for (kk = 0; kk < x265_gop_ra_lengthnewFrame->m_gopId; kk++)
+                            {
+                                if (x265_gop_ranewFrame->m_gopIdkk.poc_offset == refPoc)
+                                {
+                                    break;
+                                }
+                            }
+                            if (x265_gop_ranewFrame->m_gopIdkk.layer >= newFrame->m_tempLayer)
+                            {
+                                isSTSA = false;
+                                break;
+                            }
+                        }
+                    }
+                }
+            }
+            if (isSTSA == true)
+            {
+                if (getTemporalLayerNonReferenceFlag())
+                {
+                    slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_STSA_N;
+                }
+                else
+                {
+                    slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_STSA_R;
+                }
+            }
+        }
+    }
 
     if (slice->m_sliceType != I_SLICE)
         slice->m_numRefIdx0 = x265_clip3(1, newFrame->m_param->maxNumReferences, slice->m_rps.numberOfNegativePictures);
@@ -218,7 +308,7 @@
     }
 }
 
-void DPB::computeRPS(int curPoc, bool isRAP, RPS * rps, unsigned int maxDecPicBuffer)
+void DPB::computeRPS(int curPoc, int tempId, bool isRAP, RPS * rps, unsigned int maxDecPicBuffer)
 {
     unsigned int poci = 0, numNeg = 0, numPos = 0;
 
@@ -228,7 +318,7 @@
     {
         if ((iterPic->m_poc != curPoc) && iterPic->m_encData->m_bHasReferences)
         {
-            if ((m_lastIDR >= curPoc) || (m_lastIDR <= iterPic->m_poc))
+            if ((!m_bTemporalSublayer || (iterPic->m_tempLayer <= tempId)) && ((m_lastIDR >= curPoc) || (m_lastIDR <= iterPic->m_poc)))
             {
                     rps->pocpoci = iterPic->m_poc;
                     rps->deltaPOCpoci = rps->pocpoci - curPoc;
@@ -247,6 +337,18 @@
     rps->sortDeltaPOC();
 }
 
+bool DPB::getTemporalLayerNonReferenceFlag()
+{
+    Frame* curFrame = m_picList.first();
+    if (curFrame->m_encData->m_bHasReferences)
+    {
+        curFrame->m_sameLayerRefPic = true;
+        return false;
+    }
+    else
+        return true;
+}
+
 /* Marking reference pictures when an IDR/CRA is encountered. */
 void DPB::decodingRefreshMarking(int pocCurr, NalUnitType nalUnitType)
 {
@@ -296,7 +398,7 @@
 }
 
 /** Function for applying picture marking based on the Reference Picture Set */
-void DPB::applyReferencePictureSet(RPS *rps, int curPoc)
+void DPB::applyReferencePictureSet(RPS *rps, int curPoc, int tempId, bool isTSAPicture)
 {
     // loop through all pictures in the reference picture buffer
     Frame* iterFrame = m_picList.first();
@@ -317,9 +419,68 @@
             }
             if (!referenced)
                 iterFrame->m_encData->m_bHasReferences = false;
+
+            if (m_bTemporalSublayer)
+            {
+                //check that pictures of higher temporal layers are not used
+                assert(referenced == 0 || iterFrame->m_encData->m_bHasReferences == false || iterFrame->m_tempLayer <= tempId);
+
+                //check that pictures of higher or equal temporal layer are not in the RPS if the current picture is a TSA picture
+                if (isTSAPicture)
+                {
+                    assert(referenced == 0 || iterFrame->m_tempLayer < tempId);
+                }
+                //check that pictures marked as temporal layer non-reference pictures are not used for reference
+                if (iterFrame->m_tempLayer == tempId)
+                {
+                    assert(referenced == 0 || iterFrame->m_sameLayerRefPic == true);
+                }
+            }
+        }
+        iterFrame = iterFrame->m_next;
+    }
+}
+
+bool DPB::isTemporalLayerSwitchingPoint(int curPoc, int tempId)
+{
+    // loop through all pictures in the reference picture buffer
+    Frame* iterFrame = m_picList.first();
+    while (iterFrame)
+    {
+        if (iterFrame->m_poc != curPoc && iterFrame->m_encData->m_bHasReferences)
+        {
+            if (iterFrame->m_tempLayer >= tempId)
+            {
+                return false;
+            }
+        }
+        iterFrame = iterFrame->m_next;
+    }
+    return true;
+}
+
+bool DPB::isStepwiseTemporalLayerSwitchingPoint(RPS *rps, int curPoc, int tempId)
+{
+    // loop through all pictures in the reference picture buffer
+    Frame* iterFrame = m_picList.first();
+    while (iterFrame)
+    {
+        if (iterFrame->m_poc != curPoc && iterFrame->m_encData->m_bHasReferences)
+        {
+            for (int i = 0; i < rps->numberOfPositivePictures + rps->numberOfNegativePictures; i++)
+            {
+                if ((iterFrame->m_poc == curPoc + rps->deltaPOCi) && rps->bUsedi)
+                {
+                    if (iterFrame->m_tempLayer >= tempId)
+                    {
+                        return false;
+                    }
+                }
+            }
         }
         iterFrame = iterFrame->m_next;
     }
+    return true;
 }
 
 /* deciding the nal_unit_type */
@@ -328,7 +489,7 @@
     if (!curPOC)
         return NAL_UNIT_CODED_SLICE_IDR_N_LP;
     if (bIsKeyFrame)
-        return m_bOpenGOP ? NAL_UNIT_CODED_SLICE_CRA : m_bhasLeadingPicture ? NAL_UNIT_CODED_SLICE_IDR_W_RADL : NAL_UNIT_CODED_SLICE_IDR_N_LP;
+        return (m_bOpenGOP || m_craNal) ? NAL_UNIT_CODED_SLICE_CRA : m_bhasLeadingPicture ? NAL_UNIT_CODED_SLICE_IDR_W_RADL : NAL_UNIT_CODED_SLICE_IDR_N_LP;
     if (m_pocCRA && curPOC < m_pocCRA)
         // All leading pictures are being marked as TFD pictures here since
         // current encoder uses all reference pictures while encoding leading
​

x265_3.5.tar.gz/source/encoder/dpb.h -> x265_3.6.tar.gz/source/encoder/dpb.h Changed

@@ -40,6 +40,7 @@
     int                m_lastIDR;
     int                m_pocCRA;
     int                m_bOpenGOP;
+	int                m_craNal;
     int                m_bhasLeadingPicture;
     bool               m_bRefreshPending;
     bool               m_bTemporalSublayer;
@@ -66,7 +67,8 @@
         m_bRefreshPending = false;
         m_frameDataFreeList = NULL;
         m_bOpenGOP = param->bOpenGOP;
-        m_bTemporalSublayer = !!param->bEnableTemporalSubLayers;
+		m_craNal = param->craNal;
+        m_bTemporalSublayer = (param->bEnableTemporalSubLayers > 2);
     }
 
     ~DPB();
@@ -77,10 +79,13 @@
 
 protected:
 
-    void computeRPS(int curPoc, bool isRAP, RPS * rps, unsigned int maxDecPicBuffer);
+    void computeRPS(int curPoc,int tempId, bool isRAP, RPS * rps, unsigned int maxDecPicBuffer);
 
-    void applyReferencePictureSet(RPS *rps, int curPoc);
+    void applyReferencePictureSet(RPS *rps, int curPoc, int tempId, bool isTSAPicture);
+    bool getTemporalLayerNonReferenceFlag();
     void decodingRefreshMarking(int pocCurr, NalUnitType nalUnitType);
+    bool isTemporalLayerSwitchingPoint(int curPoc, int tempId);
+    bool isStepwiseTemporalLayerSwitchingPoint(RPS *rps, int curPoc, int tempId);
 
     NalUnitType getNalUnitType(int curPoc, bool bIsKeyFrame);
 };

 
@@ -40,6 +40,7 @@
     int                m_lastIDR;
     int                m_pocCRA;
     int                m_bOpenGOP;
+   int                m_craNal;
     int                m_bhasLeadingPicture;
     bool               m_bRefreshPending;
     bool               m_bTemporalSublayer;
@@ -66,7 +67,8 @@
         m_bRefreshPending = false;
         m_frameDataFreeList = NULL;
         m_bOpenGOP = param->bOpenGOP;
-        m_bTemporalSublayer = !!param->bEnableTemporalSubLayers;
+       m_craNal = param->craNal;
+        m_bTemporalSublayer = (param->bEnableTemporalSubLayers > 2);
     }
 
     ~DPB();
@@ -77,10 +79,13 @@
 
 protected:
 
-    void computeRPS(int curPoc, bool isRAP, RPS * rps, unsigned int maxDecPicBuffer);
+    void computeRPS(int curPoc,int tempId, bool isRAP, RPS * rps, unsigned int maxDecPicBuffer);
 
-    void applyReferencePictureSet(RPS *rps, int curPoc);
+    void applyReferencePictureSet(RPS *rps, int curPoc, int tempId, bool isTSAPicture);
+    bool getTemporalLayerNonReferenceFlag();
     void decodingRefreshMarking(int pocCurr, NalUnitType nalUnitType);
+    bool isTemporalLayerSwitchingPoint(int curPoc, int tempId);
+    bool isStepwiseTemporalLayerSwitchingPoint(RPS *rps, int curPoc, int tempId);
 
     NalUnitType getNalUnitType(int curPoc, bool bIsKeyFrame);
 };
​

x265_3.5.tar.gz/source/encoder/encoder.cpp -> x265_3.6.tar.gz/source/encoder/encoder.cpp Changed

@@ -72,7 +72,40 @@
 {
     { 1, 1, 1, 1, 1, 5, 1,  2, 2, 2, 50 },
     { 1, 1, 1, 1, 1, 5, 0, 16, 9, 9, 81 },
-    { 1, 1, 1, 1, 1, 5, 0,  1, 1, 1, 82 }
+    { 1, 1, 1, 1, 1, 5, 0,  1, 1, 1, 82 },
+    { 1, 1, 1, 1, 1, 5, 0, 18, 9, 9, 84 }
+};
+
+typedef struct
+{
+    int bEnableVideoSignalTypePresentFlag;
+    int bEnableColorDescriptionPresentFlag;
+    int bEnableChromaLocInfoPresentFlag;
+    int colorPrimaries;
+    int transferCharacteristics;
+    int matrixCoeffs;
+    int bEnableVideoFullRangeFlag;
+    int chromaSampleLocTypeTopField;
+    int chromaSampleLocTypeBottomField;
+    const char* systemId;
+}VideoSignalTypePresets;
+
+VideoSignalTypePresets vstPresets =
+{
+    {1, 1, 1, 6, 6, 6, 0, 0, 0, "BT601_525"},
+    {1, 1, 1, 5, 6, 5, 0, 0, 0, "BT601_626"},
+    {1, 1, 1, 1, 1, 1, 0, 0, 0, "BT709_YCC"},
+    {1, 1, 0, 1, 1, 0, 0, 0, 0, "BT709_RGB"},
+    {1, 1, 1, 9, 14, 1, 0, 2, 2, "BT2020_YCC_NCL"},
+    {1, 1, 0, 9, 16, 9, 0, 0, 0, "BT2020_RGB"},
+    {1, 1, 1, 9, 16, 9, 0, 2, 2, "BT2100_PQ_YCC"},
+    {1, 1, 1, 9, 16, 14, 0, 2, 2, "BT2100_PQ_ICTCP"},
+    {1, 1, 0, 9, 16, 0, 0, 0, 0, "BT2100_PQ_RGB"},
+    {1, 1, 1, 9, 18, 9, 0, 2, 2, "BT2100_HLG_YCC"},
+    {1, 1, 0, 9, 18, 0, 0, 0, 0, "BT2100_HLG_RGB"},
+    {1, 1, 0, 1, 1, 0, 1, 0, 0, "FR709_RGB"},
+    {1, 1, 0, 9, 14, 0, 1, 0, 0, "FR2020_RGB"},
+    {1, 1, 1, 12, 1, 6, 1, 1, 1, "FRP3D65_YCC"}
 };
 }
 
@@ -109,6 +142,7 @@
     m_threadPool = NULL;
     m_analysisFileIn = NULL;
     m_analysisFileOut = NULL;
+    m_filmGrainIn = NULL;
     m_naluFile = NULL;
     m_offsetEmergency = NULL;
     m_iFrameNum = 0;
@@ -134,12 +168,8 @@
     m_prevTonemapPayload.payload = NULL;
     m_startPoint = 0;
     m_saveCTUSize = 0;
-    m_edgePic = NULL;
-    m_edgeHistThreshold = 0;
-    m_chromaHistThreshold = 0.0;
-    m_scaledEdgeThreshold = 0.0;
-    m_scaledChromaThreshold = 0.0;
     m_zoneIndex = 0;
+    m_origPicBuffer = 0;
 }
 
 inline char *strcatFilename(const char *input, const char *suffix)
@@ -216,34 +246,6 @@
         }
     }
 
-    if (m_param->bHistBasedSceneCut)
-    {
-        m_planeSizes0 = (m_param->sourceWidth >> x265_cli_cspsp->internalCsp.width0) * (m_param->sourceHeight >> x265_cli_cspsm_param->internalCsp.height0);
-        uint32_t pixelbytes = m_param->internalBitDepth > 8 ? 2 : 1;
-        m_edgePic = X265_MALLOC(pixel, m_planeSizes0 * pixelbytes);
-        m_edgeHistThreshold = m_param->edgeTransitionThreshold;
-        m_chromaHistThreshold = x265_min(m_edgeHistThreshold * 10.0, MAX_SCENECUT_THRESHOLD);
-        m_scaledEdgeThreshold = x265_min(m_edgeHistThreshold * SCENECUT_STRENGTH_FACTOR, MAX_SCENECUT_THRESHOLD);
-        m_scaledChromaThreshold = x265_min(m_chromaHistThreshold * SCENECUT_STRENGTH_FACTOR, MAX_SCENECUT_THRESHOLD);
-        if (m_param->sourceBitDepth != m_param->internalBitDepth)
-        {
-            int size = m_param->sourceWidth * m_param->sourceHeight;
-            int hshift = CHROMA_H_SHIFT(m_param->internalCsp);
-            int vshift = CHROMA_V_SHIFT(m_param->internalCsp);
-            int widthC = m_param->sourceWidth >> hshift;
-            int heightC = m_param->sourceHeight >> vshift;
-
-            m_inputPic0 = X265_MALLOC(pixel, size);
-            if (m_param->internalCsp != X265_CSP_I400)
-            {
-                for (int j = 1; j < 3; j++)
-                {
-                    m_inputPicj = X265_MALLOC(pixel, widthC * heightC);
-                }
-            }
-        }
-    }
-
     // Do not allow WPP if only one row or fewer than 3 columns, it is pointless and unstable
     if (rows == 1 || cols < 3)
     {
@@ -357,6 +359,10 @@
             lookAheadThreadPooli.start();
     m_lookahead->m_numPools = pools;
     m_dpb = new DPB(m_param);
+
+    if (m_param->bEnableTemporalFilter)
+        m_origPicBuffer = new OrigPicBuffer();
+
     m_rateControl = new RateControl(*m_param, this);
     if (!m_param->bResetZoneConfig)
     {
@@ -518,6 +524,15 @@
             }
         }
     }
+    if (m_param->filmGrain)
+    {
+        m_filmGrainIn = x265_fopen(m_param->filmGrain, "rb");
+        if (!m_filmGrainIn)
+        {
+            x265_log_file(NULL, X265_LOG_ERROR, "Failed to open film grain characteristics binary file %s\n", m_param->filmGrain);
+        }
+    }
+
     m_bZeroLatency = !m_param->bframes && !m_param->lookaheadDepth && m_param->frameNumThreads == 1 && m_param->maxSlices == 1;
     m_aborted |= parseLambdaFile(m_param);
 
@@ -879,26 +894,6 @@
         }
     }
 
-    if (m_param->bHistBasedSceneCut)
-    {
-        if (m_edgePic != NULL)
-        {
-            X265_FREE_ZERO(m_edgePic);
-        }
-
-        if (m_param->sourceBitDepth != m_param->internalBitDepth)
-        {
-            X265_FREE_ZERO(m_inputPic0);
-            if (m_param->internalCsp != X265_CSP_I400)
-            {
-                for (int i = 1; i < 3; i++)
-                {
-                    X265_FREE_ZERO(m_inputPici);
-                }
-            }
-        }
-    }
-
     for (int i = 0; i < m_param->frameNumThreads; i++)
     {
         if (m_frameEncoderi)
@@ -924,6 +919,10 @@
         delete zoneReadCount;
         delete zoneWriteCount;
     }
+
+    if (m_param->bEnableTemporalFilter)
+        delete m_origPicBuffer;
+
     if (m_rateControl)
     {
         m_rateControl->destroy();
@@ -963,6 +962,8 @@
      }
     if (m_naluFile)
         fclose(m_naluFile);
+    if (m_filmGrainIn)
+        x265_fclose(m_filmGrainIn);
 
 #ifdef SVT_HEVC
     X265_FREE(m_svtAppData);
@@ -974,6 +975,7 @@
         /* release string arguments that were strdup'd */
         free((char*)m_param->rc.lambdaFileName);
         free((char*)m_param->rc.statFileName);
+        free((char*)m_param->rc.sharedMemName);
         free((char*)m_param->analysisReuseFileName);
         free((char*)m_param->scalingLists);
         free((char*)m_param->csvfn);
@@ -982,6 +984,7 @@
         free((char*)m_param->toneMapFile);
         free((char*)m_param->analysisSave);
         free((char*)m_param->analysisLoad);
+        free((char*)m_param->videoSignalTypePreset);
         PARAM_NS::x265_param_free(m_param);
     }
 }
@@ -1358,215 +1361,90 @@
     dest->planes2 = (char*)dest->planes1 + src->stride1 * (src->height >> x265_cli_cspssrc->colorSpace.height1);
 }
 
-bool Encoder::computeHistograms(x265_picture *pic)
+bool Encoder::isFilterThisframe(uint8_t sliceTypeConfig, int curSliceType)
 {
-    pixel *src = NULL, *planeV = NULL, *planeU = NULL;
-    uint32_t widthC, heightC;
-    int hshift, vshift;
-
-    hshift = CHROMA_H_SHIFT(pic->colorSpace);
-    vshift = CHROMA_V_SHIFT(pic->colorSpace);
-    widthC = pic->width >> hshift;
-    heightC = pic->height >> vshift;
-
-    if (pic->bitDepth == X265_DEPTH)
+    uint8_t newSliceType = 0;
+    switch (curSliceType)
     {
-        src = (pixel*)pic->planes0;
-        if (m_param->internalCsp != X265_CSP_I400)
-        {
-            planeU = (pixel*)pic->planes1;
-            planeV = (pixel*)pic->planes2;
-        }
-    }
-    else if (pic->bitDepth == 8 && X265_DEPTH > 8)
-    {
-        int shift = (X265_DEPTH - 8);
-        uint8_t *yChar, *uChar, *vChar;
-
-        yChar = (uint8_t*)pic->planes0;
-        primitives.planecopy_cp(yChar, pic->stride0 / sizeof(*yChar), m_inputPic0, pic->stride0 / sizeof(*yChar), pic->width, pic->height, shift);
-        src = m_inputPic0;
-        if (m_param->internalCsp != X265_CSP_I400)
-        {
-            uChar = (uint8_t*)pic->planes1;
-            vChar = (uint8_t*)pic->planes2;
-            primitives.planecopy_cp(uChar, pic->stride1 / sizeof(*uChar), m_inputPic1, pic->stride1 / sizeof(*uChar), widthC, heightC, shift);
-            primitives.planecopy_cp(vChar, pic->stride2 / sizeof(*vChar), m_inputPic2, pic->stride2 / sizeof(*vChar), widthC, heightC, shift);
-            planeU = m_inputPic1;
-            planeV = m_inputPic2;
-        }
-    }
-    else
-    {
-        uint16_t *yShort, *uShort, *vShort;
-        /* mask off bits that are supposed to be zero */
-        uint16_t mask = (1 << X265_DEPTH) - 1;
-        int shift = abs(pic->bitDepth - X265_DEPTH);
-
-        yShort = (uint16_t*)pic->planes0;
-        uShort = (uint16_t*)pic->planes1;
-        vShort = (uint16_t*)pic->planes2;
-
-        if (pic->bitDepth > X265_DEPTH)
-        {
-            /* shift right and mask pixels to final size */
-            primitives.planecopy_sp(yShort, pic->stride0 / sizeof(*yShort), m_inputPic0, pic->stride0 / sizeof(*yShort), pic->width, pic->height, shift, mask);
-            if (m_param->internalCsp != X265_CSP_I400)
-            {
-                primitives.planecopy_sp(uShort, pic->stride1 / sizeof(*uShort), m_inputPic1, pic->stride1 / sizeof(*uShort), widthC, heightC, shift, mask);
-                primitives.planecopy_sp(vShort, pic->stride2 / sizeof(*vShort), m_inputPic2, pic->stride2 / sizeof(*vShort), widthC, heightC, shift, mask);
-            }
-        }
-        else /* Case for (pic.bitDepth < X265_DEPTH) */
-        {
-            /* shift left and mask pixels to final size */
-            primitives.planecopy_sp_shl(yShort, pic->stride0 / sizeof(*yShort), m_inputPic0, pic->stride0 / sizeof(*yShort), pic->width, pic->height, shift, mask);
-            if (m_param->internalCsp != X265_CSP_I400)
-            {
-                primitives.planecopy_sp_shl(uShort, pic->stride1 / sizeof(*uShort), m_inputPic1, pic->stride1 / sizeof(*uShort), widthC, heightC, shift, mask);
-                primitives.planecopy_sp_shl(vShort, pic->stride2 / sizeof(*vShort), m_inputPic2, pic->stride2 / sizeof(*vShort), widthC, heightC, shift, mask);
-            }
-        }
-
-        src = m_inputPic0;
-        planeU = m_inputPic1;
-        planeV = m_inputPic2;
-    }
-
-    size_t bufSize = sizeof(pixel) * m_planeSizes0;
-    memset(m_edgePic, 0, bufSize);
-
-    if (!computeEdge(m_edgePic, src, NULL, pic->width, pic->height, pic->width, false, 1))
-    {
-        x265_log(m_param, X265_LOG_ERROR, "Failed to compute edge!");
-        return false;
-    }
-
-    pixel pixelVal;
-    int32_t *edgeHist = m_curEdgeHist;
-    memset(edgeHist, 0, EDGE_BINS * sizeof(int32_t));
-    for (uint32_t i = 0; i < m_planeSizes0; i++)
-    {
-        if (m_edgePici)
-            edgeHist1++;
-        else
-            edgeHist0++;
-    }
-
-    /* Y Histogram Calculation */
-    int32_t *yHist = m_curYUVHist0;
-    memset(yHist, 0, HISTOGRAM_BINS * sizeof(int32_t));
-    for (uint32_t i = 0; i < m_planeSizes0; i++)
-    {
-        pixelVal = srci;
-        yHistpixelVal++;
+    case 1: newSliceType |= 1 << 0;
+        break;
+    case 2: newSliceType |= 1 << 0;
+        break;
+    case 3: newSliceType |= 1 << 1;
+        break;
+    case 4: newSliceType |= 1 << 2;
+        break;
+    case 5: newSliceType |= 1 << 3;
+        break;
+    default: return 0;
     }
+    return ((sliceTypeConfig & newSliceType) != 0);
+}
 
-    if (pic->colorSpace != X265_CSP_I400)
-    {
-        /* U Histogram Calculation */
-        int32_t *uHist = m_curYUVHist1;
-        memset(uHist, 0, sizeof(m_curYUVHist1));
-        for (uint32_t i = 0; i < m_planeSizes1; i++)
-        {
-            pixelVal = planeUi;
-            uHistpixelVal++;
-        }
+inline int enqueueRefFrame(FrameEncoder* curframeEncoder, Frame* iterFrame, Frame* curFrame, bool isPreFiltered, int16_t i)
+{
+    TemporalFilterRefPicInfo* dest = &curframeEncoder->m_mcstfRefListcurFrame->m_mcstf->m_numRef;
+    dest->picBuffer = iterFrame->m_fencPic;
+    dest->picBufferSubSampled2 = iterFrame->m_fencPicSubsampled2;
+    dest->picBufferSubSampled4 = iterFrame->m_fencPicSubsampled4;
+    dest->isFilteredFrame = isPreFiltered;
+    dest->isSubsampled = iterFrame->m_isSubSampled;
+    dest->origOffset = i;
+    curFrame->m_mcstf->m_numRef++;
 
-        /* V Histogram Calculation */
-        pixelVal = 0;
-        int32_t *vHist = m_curYUVHist2;
-        memset(vHist, 0, sizeof(m_curYUVHist2));
-        for (uint32_t i = 0; i < m_planeSizes2; i++)
-        {
-            pixelVal = planeVi;
-            vHistpixelVal++;
-        }
-    }
-    return true;
+    return 1;
 }
 
-void Encoder::computeHistogramSAD(double *normalizedMaxUVSad, double *normalizedEdgeSad, int curPoc)
+bool Encoder::generateMcstfRef(Frame* frameEnc, FrameEncoder* currEncoder)
 {
+    frameEnc->m_mcstf->m_numRef = 0;
 
-    if (curPoc == 0)
-    {   /* first frame is scenecut by default no sad computation for the same. */
-        *normalizedMaxUVSad = 0.0;
-        *normalizedEdgeSad = 0.0;
-    }
-    else
+    for (int iterPOC = (frameEnc->m_poc - frameEnc->m_mcstf->m_range);
+        iterPOC <= (frameEnc->m_poc + frameEnc->m_mcstf->m_range); iterPOC++)
     {
-        /* compute sum of absolute differences of histogram bins of chroma and luma edge response between the current and prev pictures. */
-        int32_t edgeHistSad = 0;
-        int32_t uHistSad = 0;
-        int32_t vHistSad = 0;
-        double normalizedUSad = 0.0;
-        double normalizedVSad = 0.0;
-
-        for (int j = 0; j < HISTOGRAM_BINS; j++)
+        bool isFound = false;
+        if (iterPOC != frameEnc->m_poc)
         {
-            if (j < 2)
+            //search for the reference frame in the Original Picture Buffer
+            if (!isFound)
             {
-                edgeHistSad += abs(m_curEdgeHistj - m_prevEdgeHistj);
-            }
-            uHistSad += abs(m_curYUVHist1j - m_prevYUVHist1j);
-            vHistSad += abs(m_curYUVHist2j - m_prevYUVHist2j);
-        }
-        *normalizedEdgeSad = normalizeRange(edgeHistSad, 0, 2 * m_planeSizes0, 0.0, 1.0);
-        normalizedUSad = normalizeRange(uHistSad, 0, 2 * m_planeSizes1, 0.0, 1.0);
-        normalizedVSad = normalizeRange(vHistSad, 0, 2 * m_planeSizes2, 0.0, 1.0);
-        *normalizedMaxUVSad = x265_max(normalizedUSad, normalizedVSad);
-    }
-
-    /* store histograms of previous frame for reference */
-    memcpy(m_prevEdgeHist, m_curEdgeHist, sizeof(m_curEdgeHist));
-    memcpy(m_prevYUVHist, m_curYUVHist, sizeof(m_curYUVHist));
-}
+                for (int j = 0; j < (2 * frameEnc->m_mcstf->m_range); j++)
+                {
+                    if (iterPOC < 0)
+                        continue;
+                    if (iterPOC >= m_pocLast)
+                    {
 
-double Encoder::normalizeRange(int32_t value, int32_t minValue, int32_t maxValue, double rangeStart, double rangeEnd)
-{
-    return (double)(value - minValue) * (rangeEnd - rangeStart) / (maxValue - minValue) + rangeStart;
-}
+                        TemporalFilter* mcstf = frameEnc->m_mcstf;
+                        while (mcstf->m_numRef)
+                        {
+                            memset(currEncoder->m_mcstfRefListmcstf->m_numRef.mvs0,  0, sizeof(MV) * ((mcstf->m_sourceWidth / 16) * (mcstf->m_sourceHeight / 16)));
+                            memset(currEncoder->m_mcstfRefListmcstf->m_numRef.mvs1,  0, sizeof(MV) * ((mcstf->m_sourceWidth / 16) * (mcstf->m_sourceHeight / 16)));
+                            memset(currEncoder->m_mcstfRefListmcstf->m_numRef.mvs2,  0, sizeof(MV) * ((mcstf->m_sourceWidth / 16) * (mcstf->m_sourceHeight / 16)));
+                            memset(currEncoder->m_mcstfRefListmcstf->m_numRef.mvs,   0, sizeof(MV) * ((mcstf->m_sourceWidth /  4) * (mcstf->m_sourceHeight /  4)));
+                            memset(currEncoder->m_mcstfRefListmcstf->m_numRef.noise, 0, sizeof(int) * ((mcstf->m_sourceWidth / 4) * (mcstf->m_sourceHeight / 4)));
+                            memset(currEncoder->m_mcstfRefListmcstf->m_numRef.error, 0, sizeof(int) * ((mcstf->m_sourceWidth / 4) * (mcstf->m_sourceHeight / 4)));
 
-void Encoder::findSceneCuts(x265_picture *pic, bool& bDup, double maxUVSad, double edgeSad, bool& isMaxThres, bool& isHardSC)
-{
-    double minEdgeT = m_edgeHistThreshold * MIN_EDGE_FACTOR;
-    double minChromaT = minEdgeT * SCENECUT_CHROMA_FACTOR;
-    double maxEdgeT = m_edgeHistThreshold * MAX_EDGE_FACTOR;
-    double maxChromaT = maxEdgeT * SCENECUT_CHROMA_FACTOR;
-    pic->frameData.bScenecut = false;
+                            mcstf->m_numRef--;
+                        }
 
-    if (pic->poc == 0)
-    {
-        /* for first frame */
-        pic->frameData.bScenecut = false;
-        bDup = false;
-    }
-    else
-    {
-        if (edgeSad == 0.0 && maxUVSad == 0.0)
-        {
-            bDup = true;
-        }
-        else if (edgeSad < minEdgeT && maxUVSad < minChromaT)
-        {
-            pic->frameData.bScenecut = false;
-        }
-        else if (edgeSad > maxEdgeT && maxUVSad > maxChromaT)
-        {
-            pic->frameData.bScenecut = true;
-            isMaxThres = true;
-            isHardSC = true;
-        }
-        else if (edgeSad > m_scaledEdgeThreshold || maxUVSad >= m_scaledChromaThreshold
-                 || (edgeSad > m_edgeHistThreshold && maxUVSad >= m_chromaHistThreshold))
-        {
-            pic->frameData.bScenecut = true;
-            bDup = false;
-            if (edgeSad > m_scaledEdgeThreshold || maxUVSad >= m_scaledChromaThreshold)
-                isHardSC = true;
+                        break;
+                    }
+                    Frame* iterFrame = frameEnc->m_encData->m_slice->m_mcstfRefFrameList1j;
+                    if (iterFrame->m_poc == iterPOC)
+                    {
+                        if (!enqueueRefFrame(currEncoder, iterFrame, frameEnc, false, (int16_t)(iterPOC - frameEnc->m_poc)))
+                        {
+                            return false;
+                        };
+                        break;
+                    }
+                }
+            }
         }
     }
+
+    return true;
 }
 
 /**
@@ -1595,40 +1473,24 @@
     const x265_picture* inputPic = NULL;
     static int written = 0, read = 0;
     bool dontRead = false;
-    bool bdropFrame = false;
     bool dropflag = false;
-    bool isMaxThres = false;
-    bool isHardSC = false;
 
     if (m_exportedPic)
     {
         if (!m_param->bUseAnalysisFile && m_param->analysisSave)
             x265_free_analysis_data(m_param, &m_exportedPic->m_analysisData);
+
         ATOMIC_DEC(&m_exportedPic->m_countRefEncoders);
+
         m_exportedPic = NULL;
         m_dpb->recycleUnreferenced();
+
+        if (m_param->bEnableTemporalFilter)
+            m_origPicBuffer->recycleOrigPicList();
     }
+
     if ((pic_in && (!m_param->chunkEnd || (m_encodedFrameNum < m_param->chunkEnd))) || (m_param->bEnableFrameDuplication && !pic_in && (read < written)))
     {
-        if (m_param->bHistBasedSceneCut && pic_in)
-        {
-            x265_picture *pic = (x265_picture *) pic_in;
-
-            if (pic->poc == 0)
-            {
-                /* for entire encode compute the chroma plane sizes only once */
-                for (int i = 1; i < x265_cli_cspsm_param->internalCsp.planes; i++)
-                    m_planeSizesi = (pic->width >> x265_cli_cspsm_param->internalCsp.widthi) * (pic->height >> x265_cli_cspsm_param->internalCsp.heighti);
-            }
-
-            if (computeHistograms(pic))
-            {
-                double maxUVSad = 0.0, edgeSad = 0.0;
-                computeHistogramSAD(&maxUVSad, &edgeSad, pic_in->poc);
-                findSceneCuts(pic, bdropFrame, maxUVSad, edgeSad, isMaxThres, isHardSC);
-            }
-        }
-
         if ((m_param->bEnableFrameDuplication && !pic_in && (read < written)))
             dontRead = true;
         else
@@ -1672,20 +1534,7 @@
                     written++;
                 }
 
-                if (m_param->bEnableFrameDuplication && m_param->bHistBasedSceneCut)
-                {
-                    if (!bdropFrame && m_dupBuffer1->dupPic->frameData.bScenecut == false)
-                    {
-                        psnrWeight = ComputePSNR(m_dupBuffer0->dupPic, m_dupBuffer1->dupPic, m_param);
-                        if (psnrWeight >= m_param->dupThreshold)
-                            dropflag = true;
-                    }
-                    else
-                    {
-                        dropflag = true;
-                    }
-                }
-                else if (m_param->bEnableFrameDuplication)
+                if (m_param->bEnableFrameDuplication)
                 {
                     psnrWeight = ComputePSNR(m_dupBuffer0->dupPic, m_dupBuffer1->dupPic, m_param);
                     if (psnrWeight >= m_param->dupThreshold)
@@ -1768,12 +1617,6 @@
                         }
                     }
                 }
-                if (m_param->recursionSkipMode == EDGE_BASED_RSKIP && m_param->bHistBasedSceneCut)
-                {
-                    pixel* src = m_edgePic;
-                    primitives.planecopy_pp_shr(src, inFrame->m_fencPic->m_picWidth, inFrame->m_edgeBitPic, inFrame->m_fencPic->m_stride,
-                        inFrame->m_fencPic->m_picWidth, inFrame->m_fencPic->m_picHeight, 0);
-                }
             }
             else
             {
@@ -1794,6 +1637,8 @@
             inFrame->m_lowres.satdCost = (int64_t)-1;
             inFrame->m_lowresInit = false;
             inFrame->m_isInsideWindow = 0;
+            inFrame->m_tempLayer = 0;
+            inFrame->m_sameLayerRefPic = 0;
         }
 
         /* Copy input picture into a Frame and PicYuv, send to lookahead */
@@ -1802,13 +1647,6 @@
         inFrame->m_poc       = ++m_pocLast;
         inFrame->m_userData  = inputPic->userData;
         inFrame->m_pts       = inputPic->pts;
-        if (m_param->bHistBasedSceneCut)
-        {
-            inFrame->m_lowres.bScenecut = (inputPic->frameData.bScenecut == 1) ? true : false;
-            inFrame->m_lowres.m_bIsMaxThres = isMaxThres;
-            if (m_param->radl && m_param->keyframeMax != m_param->keyframeMin)
-                inFrame->m_lowres.m_bIsHardScenecut = isHardSC;
-        }
 
         if ((m_param->bEnableSceneCutAwareQp & BACKWARD) && m_param->rc.bStatRead)
         {
@@ -1816,7 +1654,7 @@
             rcEntry = &(m_rateControl->m_rce2PassinFrame->m_poc);
             if(rcEntry->scenecut)
             {
-                int backwardWindow = X265_MIN(int((m_param->bwdScenecutWindow / 1000.0) * (m_param->fpsNum / m_param->fpsDenom)), p->lookaheadDepth);
+                int backwardWindow = X265_MIN(int((m_param->bwdMaxScenecutWindow / 1000.0) * (m_param->fpsNum / m_param->fpsDenom)), p->lookaheadDepth);
                 for (int i = 1; i <= backwardWindow; i++)
                 {
                     int frameNum = inFrame->m_poc - i;
@@ -1826,16 +1664,7 @@
                 }
             }
         }
-        if (m_param->bHistBasedSceneCut && m_param->analysisSave)
-        {
-            memcpy(inFrame->m_analysisData.edgeHist, m_curEdgeHist, EDGE_BINS * sizeof(int32_t));
-            memcpy(inFrame->m_analysisData.yuvHist0, m_curYUVHist0, HISTOGRAM_BINS *sizeof(int32_t));
-            if (inputPic->colorSpace != X265_CSP_I400)
-            {
-                memcpy(inFrame->m_analysisData.yuvHist1, m_curYUVHist1, HISTOGRAM_BINS * sizeof(int32_t));
-                memcpy(inFrame->m_analysisData.yuvHist2, m_curYUVHist2, HISTOGRAM_BINS * sizeof(int32_t));
-            }
-        }
+
         inFrame->m_forceqp   = inputPic->forceqp;
         inFrame->m_param     = (m_reconfigure || m_reconfigureRc) ? m_latestParam : m_param;
         inFrame->m_picStruct = inputPic->picStruct;
@@ -1881,7 +1710,8 @@
         }
 
         /* Use the frame types from the first pass, if available */
-        int sliceType = (m_param->rc.bStatRead) ? m_rateControl->rateControlSliceType(inFrame->m_poc) : inputPic->sliceType;
+        int sliceType = (m_param->rc.bStatRead) ? m_rateControl->rateControlSliceType(inFrame->m_poc) : X265_TYPE_AUTO;
+        inFrame->m_lowres.sliceTypeReq = inputPic->sliceType;
 
         /* In analysisSave mode, x265_analysis_data is allocated in inputPic and inFrame points to this */
         /* Load analysis data before lookahead->addPicture, since sliceType has been decided */
@@ -1977,6 +1807,59 @@
         if (m_reconfigureRc)
             inFrame->m_reconfigureRc = true;
 
+        if (m_param->bEnableTemporalFilter)
+        {
+            if (!m_pocLast)
+            {
+                /*One shot allocation of frames in OriginalPictureBuffer*/
+                int numFramesinOPB = X265_MAX(m_param->bframes, (inFrame->m_mcstf->m_range << 1)) + 1;
+                for (int i = 0; i < numFramesinOPB; i++)
+                {
+                    Frame* dupFrame = new Frame;
+                    if (!(dupFrame->create(m_param, pic_in->quantOffsets)))
+                    {
+                        m_aborted = true;
+                        x265_log(m_param, X265_LOG_ERROR, "Memory allocation failure, aborting encode\n");
+                        fflush(stderr);
+                        dupFrame->destroy();
+                        delete dupFrame;
+                        return -1;
+                    }
+                    else
+                    {
+                        if (m_sps.cuOffsetY)
+                        {
+                            dupFrame->m_fencPic->m_cuOffsetC = m_sps.cuOffsetC;
+                            dupFrame->m_fencPic->m_buOffsetC = m_sps.buOffsetC;
+                            dupFrame->m_fencPic->m_cuOffsetY = m_sps.cuOffsetY;
+                            dupFrame->m_fencPic->m_buOffsetY = m_sps.buOffsetY;
+                            if (m_param->internalCsp != X265_CSP_I400)
+                            {
+                                dupFrame->m_fencPic->m_cuOffsetC = m_sps.cuOffsetC;
+                                dupFrame->m_fencPic->m_buOffsetC = m_sps.buOffsetC;
+                            }
+                            m_origPicBuffer->addEncPicture(dupFrame);
+                        }
+                    }
+                }
+            }
+
+            inFrame->m_refPicCnt1 = 2 * inFrame->m_mcstf->m_range + 1;
+            if (inFrame->m_poc < inFrame->m_mcstf->m_range)
+                inFrame->m_refPicCnt1 -= (uint8_t)(inFrame->m_mcstf->m_range - inFrame->m_poc);
+            if (m_param->totalFrames && (inFrame->m_poc >= (m_param->totalFrames - inFrame->m_mcstf->m_range)))
+                inFrame->m_refPicCnt1 -= (uint8_t)(inFrame->m_poc + inFrame->m_mcstf->m_range - m_param->totalFrames + 1);
+
+            //Extend full-res original picture border
+            PicYuv *orig = inFrame->m_fencPic;
+            extendPicBorder(orig->m_picOrg0, orig->m_stride, orig->m_picWidth, orig->m_picHeight, orig->m_lumaMarginX, orig->m_lumaMarginY);
+            extendPicBorder(orig->m_picOrg1, orig->m_strideC, orig->m_picWidth >> orig->m_hChromaShift, orig->m_picHeight >> orig->m_vChromaShift, orig->m_chromaMarginX, orig->m_chromaMarginY);
+            extendPicBorder(orig->m_picOrg2, orig->m_strideC, orig->m_picWidth >> orig->m_hChromaShift, orig->m_picHeight >> orig->m_vChromaShift, orig->m_chromaMarginX, orig->m_chromaMarginY);
+
+            //TODO: Add subsampling here if required
+            m_origPicBuffer->addPicture(inFrame);
+        }
+
         m_lookahead->addPicture(*inFrame, sliceType);
         m_numDelayedPic++;
     }
@@ -2019,6 +1902,7 @@
                 pic_out->bitDepth = X265_DEPTH;
                 pic_out->userData = outFrame->m_userData;
                 pic_out->colorSpace = m_param->internalCsp;
+                pic_out->frameData.tLayer = outFrame->m_tempLayer;
                 frameData = &(pic_out->frameData);
 
                 pic_out->pts = outFrame->m_pts;
@@ -2041,16 +1925,6 @@
                     pic_out->analysisData.poc = pic_out->poc;
                     pic_out->analysisData.sliceType = pic_out->sliceType;
                     pic_out->analysisData.bScenecut = outFrame->m_lowres.bScenecut;
-                    if (m_param->bHistBasedSceneCut)
-                    {
-                        memcpy(pic_out->analysisData.edgeHist, outFrame->m_analysisData.edgeHist, EDGE_BINS * sizeof(int32_t));
-                        memcpy(pic_out->analysisData.yuvHist0, outFrame->m_analysisData.yuvHist0, HISTOGRAM_BINS * sizeof(int32_t));
-                        if (pic_out->colorSpace != X265_CSP_I400)
-                        {
-                            memcpy(pic_out->analysisData.yuvHist1, outFrame->m_analysisData.yuvHist1, HISTOGRAM_BINS * sizeof(int32_t));
-                            memcpy(pic_out->analysisData.yuvHist2, outFrame->m_analysisData.yuvHist2, HISTOGRAM_BINS * sizeof(int32_t));
-                        }
-                    }
                     pic_out->analysisData.satdCost  = outFrame->m_lowres.satdCost;
                     pic_out->analysisData.numCUsInFrame = outFrame->m_analysisData.numCUsInFrame;
                     pic_out->analysisData.numPartitions = outFrame->m_analysisData.numPartitions;
@@ -2198,7 +2072,7 @@
                 if (m_rateControl->writeRateControlFrameStats(outFrame, &curEncoder->m_rce))
                     m_aborted = true;
             if (pic_out)
-            { 
+            {
                 /* m_rcData is allocated for every frame */
                 pic_out->rcData = outFrame->m_rcData;
                 outFrame->m_rcData->qpaRc = outFrame->m_encData->m_avgQpRc;
@@ -2216,6 +2090,18 @@
                 outFrame->m_rcData->iCuCount = outFrame->m_encData->m_frameStats.percent8x8Intra * m_rateControl->m_ncu;
                 outFrame->m_rcData->pCuCount = outFrame->m_encData->m_frameStats.percent8x8Inter * m_rateControl->m_ncu;
                 outFrame->m_rcData->skipCuCount = outFrame->m_encData->m_frameStats.percent8x8Skip  * m_rateControl->m_ncu;
+                outFrame->m_rcData->currentSatd = curEncoder->m_rce.coeffBits;
+            }
+
+            if (m_param->bEnableTemporalFilter)
+            {
+                Frame *curFrame = m_origPicBuffer->m_mcstfPicList.getPOCMCSTF(outFrame->m_poc);
+                X265_CHECK(curFrame, "Outframe not found in DPB's mcstfPicList");
+                curFrame->m_refPicCnt0--;
+                curFrame->m_refPicCnt1--;
+                curFrame = m_origPicBuffer->m_mcstfOrigPicList.getPOCMCSTF(outFrame->m_poc);
+                X265_CHECK(curFrame, "Outframe not found in OPB's mcstfOrigPicList");
+                curFrame->m_refPicCnt1--;
             }
 
             /* Allow this frame to be recycled if no frame encoders are using it for reference */
@@ -2223,6 +2109,8 @@
             {
                 ATOMIC_DEC(&outFrame->m_countRefEncoders);
                 m_dpb->recycleUnreferenced();
+                if (m_param->bEnableTemporalFilter)
+                    m_origPicBuffer->recycleOrigPicList();
             }
             else
                 m_exportedPic = outFrame;
@@ -2253,7 +2141,7 @@
                         m_rateControl->m_lastScenecut = frameEnc->m_poc;
                     else
                     {
-                        int maxWindowSize = int((m_param->fwdScenecutWindow / 1000.0) * (m_param->fpsNum / m_param->fpsDenom) + 0.5);
+                        int maxWindowSize = int((m_param->fwdMaxScenecutWindow / 1000.0) * (m_param->fpsNum / m_param->fpsDenom) + 0.5);
                         if (frameEnc->m_poc > (m_rateControl->m_lastScenecut + maxWindowSize))
                             m_rateControl->m_lastScenecut = frameEnc->m_poc;
                     }
@@ -2422,8 +2310,36 @@
                 analysis->numPartitions  = m_param->num4x4Partitions;
                 x265_alloc_analysis_data(m_param, analysis);
             }
+            if (m_param->bEnableTemporalSubLayers > 2)
+            {
+                //Re-assign temporalid if the current frame is at the end of encode or when I slice is encountered
+                if ((frameEnc->m_poc == (m_param->totalFrames - 1)) || (frameEnc->m_lowres.sliceType == X265_TYPE_I) || (frameEnc->m_lowres.sliceType == X265_TYPE_IDR))
+                {
+                    frameEnc->m_tempLayer = (int8_t)0;
+                }
+            }
             /* determine references, setup RPS, etc */
             m_dpb->prepareEncode(frameEnc);
+
+            if (m_param->bEnableTemporalFilter)
+            {
+                X265_CHECK(!m_origPicBuffer->m_mcstfOrigPicFreeList.empty(), "Frames not available in Encoded OPB");
+
+                Frame *dupFrame = m_origPicBuffer->m_mcstfOrigPicFreeList.popBackMCSTF();
+                dupFrame->m_fencPic->copyFromFrame(frameEnc->m_fencPic);
+                dupFrame->m_poc = frameEnc->m_poc;
+                dupFrame->m_encodeOrder = frameEnc->m_encodeOrder;
+                dupFrame->m_refPicCnt1 = 2 * dupFrame->m_mcstf->m_range + 1;
+
+                if (dupFrame->m_poc < dupFrame->m_mcstf->m_range)
+                    dupFrame->m_refPicCnt1 -= (uint8_t)(dupFrame->m_mcstf->m_range - dupFrame->m_poc);
+                if (m_param->totalFrames && (dupFrame->m_poc >= (m_param->totalFrames - dupFrame->m_mcstf->m_range)))
+                    dupFrame->m_refPicCnt1 -= (uint8_t)(dupFrame->m_poc + dupFrame->m_mcstf->m_range - m_param->totalFrames + 1);
+
+                m_origPicBuffer->addEncPictureToPicList(dupFrame);
+                m_origPicBuffer->setOrigPicList(frameEnc, m_pocLast);
+            }
+
             if (!!m_param->selectiveSAO)
             {
                 Slice* slice = frameEnc->m_encData->m_slice;
@@ -2449,9 +2365,72 @@
 
             if (m_param->rc.rateControlMode != X265_RC_CQP)
                 m_lookahead->getEstimatedPictureCost(frameEnc);
+
             if (m_param->bIntraRefresh)
                  calcRefreshInterval(frameEnc);
 
+            // Generate MCSTF References and perform HME
+            if (m_param->bEnableTemporalFilter && isFilterThisframe(frameEnc->m_mcstf->m_sliceTypeConfig, frameEnc->m_lowres.sliceType))
+            {
+
+                if (!generateMcstfRef(frameEnc, curEncoder))
+                {
+                    m_aborted = true;
+                    x265_log(m_param, X265_LOG_ERROR, "Failed to initialize MCSTFReferencePicInfo at POC %d\n", frameEnc->m_poc);
+                    fflush(stderr);
+                    return -1;
+                }
+
+
+                if (!*frameEnc->m_isSubSampled)
+                {
+                    primitives.frameSubSampleLuma((const pixel *)frameEnc->m_fencPic->m_picOrg0,frameEnc->m_fencPicSubsampled2->m_picOrg0, frameEnc->m_fencPic->m_stride, frameEnc->m_fencPicSubsampled2->m_stride, frameEnc->m_fencPicSubsampled2->m_picWidth, frameEnc->m_fencPicSubsampled2->m_picHeight);
+                    extendPicBorder(frameEnc->m_fencPicSubsampled2->m_picOrg0, frameEnc->m_fencPicSubsampled2->m_stride, frameEnc->m_fencPicSubsampled2->m_picWidth, frameEnc->m_fencPicSubsampled2->m_picHeight, frameEnc->m_fencPicSubsampled2->m_lumaMarginX, frameEnc->m_fencPicSubsampled2->m_lumaMarginY);
+                    primitives.frameSubSampleLuma((const pixel *)frameEnc->m_fencPicSubsampled2->m_picOrg0,frameEnc->m_fencPicSubsampled4->m_picOrg0, frameEnc->m_fencPicSubsampled2->m_stride, frameEnc->m_fencPicSubsampled4->m_stride, frameEnc->m_fencPicSubsampled4->m_picWidth, frameEnc->m_fencPicSubsampled4->m_picHeight);
+                    extendPicBorder(frameEnc->m_fencPicSubsampled4->m_picOrg0, frameEnc->m_fencPicSubsampled4->m_stride, frameEnc->m_fencPicSubsampled4->m_picWidth, frameEnc->m_fencPicSubsampled4->m_picHeight, frameEnc->m_fencPicSubsampled4->m_lumaMarginX, frameEnc->m_fencPicSubsampled4->m_lumaMarginY);
+                    *frameEnc->m_isSubSampled = true;
+                }
+
+                for (uint8_t i = 1; i <= frameEnc->m_mcstf->m_numRef; i++)
+                {
+                    TemporalFilterRefPicInfo *ref = &curEncoder->m_mcstfRefListi - 1;
+                    if (!*ref->isSubsampled)
+                    {
+                        primitives.frameSubSampleLuma((const pixel *)ref->picBuffer->m_picOrg0, ref->picBufferSubSampled2->m_picOrg0, ref->picBuffer->m_stride, ref->picBufferSubSampled2->m_stride, ref->picBufferSubSampled2->m_picWidth, ref->picBufferSubSampled2->m_picHeight);
+                        extendPicBorder(ref->picBufferSubSampled2->m_picOrg0, ref->picBufferSubSampled2->m_stride, ref->picBufferSubSampled2->m_picWidth, ref->picBufferSubSampled2->m_picHeight, ref->picBufferSubSampled2->m_lumaMarginX, ref->picBufferSubSampled2->m_lumaMarginY);
+                        primitives.frameSubSampleLuma((const pixel *)ref->picBufferSubSampled2->m_picOrg0,ref->picBufferSubSampled4->m_picOrg0, ref->picBufferSubSampled2->m_stride, ref->picBufferSubSampled4->m_stride, ref->picBufferSubSampled4->m_picWidth, ref->picBufferSubSampled4->m_picHeight);
+                        extendPicBorder(ref->picBufferSubSampled4->m_picOrg0, ref->picBufferSubSampled4->m_stride, ref->picBufferSubSampled4->m_picWidth, ref->picBufferSubSampled4->m_picHeight, ref->picBufferSubSampled4->m_lumaMarginX, ref->picBufferSubSampled4->m_lumaMarginY);
+                        *ref->isSubsampled = true;
+                    }
+                }
+
+                for (uint8_t i = 1; i <= frameEnc->m_mcstf->m_numRef; i++)
+                {
+                    TemporalFilterRefPicInfo *ref = &curEncoder->m_mcstfRefListi - 1;
+
+                    curEncoder->m_frameEncTF->motionEstimationLuma(ref->mvs0, ref->mvsStride0, frameEnc->m_fencPicSubsampled4, ref->picBufferSubSampled4, 16);
+                    curEncoder->m_frameEncTF->motionEstimationLuma(ref->mvs1, ref->mvsStride1, frameEnc->m_fencPicSubsampled2, ref->picBufferSubSampled2, 16, ref->mvs0, ref->mvsStride0, 2);
+                    curEncoder->m_frameEncTF->motionEstimationLuma(ref->mvs2, ref->mvsStride2, frameEnc->m_fencPic, ref->picBuffer, 16, ref->mvs1, ref->mvsStride1, 2);
+                    curEncoder->m_frameEncTF->motionEstimationLumaDoubleRes(ref->mvs,  ref->mvsStride, frameEnc->m_fencPic, ref->picBuffer, 8, ref->mvs2, ref->mvsStride2, 1, ref->error);
+                }
+
+                for (int i = 0; i < frameEnc->m_mcstf->m_numRef; i++)
+                {
+                    TemporalFilterRefPicInfo *ref = &curEncoder->m_mcstfRefListi;
+                    ref->slicetype = m_lookahead->findSliceType(frameEnc->m_poc + ref->origOffset);
+                    Frame* dpbframePtr = m_dpb->m_picList.getPOC(frameEnc->m_poc + ref->origOffset);
+                    if (dpbframePtr != NULL)
+                    {
+                        if (dpbframePtr->m_encData->m_slice->m_sliceType == B_SLICE)
+                            ref->slicetype = X265_TYPE_B;
+                        else if (dpbframePtr->m_encData->m_slice->m_sliceType == P_SLICE)
+                            ref->slicetype = X265_TYPE_P;
+                        else
+                            ref->slicetype = X265_TYPE_I;
+                    }
+                }
+            }
+
             /* Allow FrameEncoder::compressFrame() to start in the frame encoder thread */
             if (!curEncoder->startCompressFrame(frameEnc))
                 m_aborted = true;
@@ -2523,7 +2502,11 @@
         encParam->dynamicRd = param->dynamicRd;
         encParam->bEnableTransformSkip = param->bEnableTransformSkip;
         encParam->bEnableAMP = param->bEnableAMP;
-
+        if (param->confWinBottomOffset == 0 && param->confWinRightOffset == 0)
+        {
+            encParam->confWinBottomOffset = param->confWinBottomOffset;
+            encParam->confWinRightOffset = param->confWinRightOffset;
+        }
         /* Resignal changes in params in Parameter Sets */
         m_sps.maxAMPDepth = (m_sps.bUseAMP = param->bEnableAMP && param->bEnableAMP) ? param->maxCUDepth : 0;
         m_pps.bTransformSkipEnabled = param->bEnableTransformSkip ? 1 : 0;
@@ -2729,18 +2712,7 @@
             (float)100.0 * m_numLumaWPBiFrames / m_analyzeB.m_numPics,
             (float)100.0 * m_numChromaWPBiFrames / m_analyzeB.m_numPics);
     }
-    int pWithB = 0;
-    for (int i = 0; i <= m_param->bframes; i++)
-        pWithB += m_lookahead->m_histogrami;
 
-    if (pWithB)
-    {
-        int p = 0;
-        for (int i = 0; i <= m_param->bframes; i++)
-            p += sprintf(buffer + p, "%.1f%% ", 100. * m_lookahead->m_histogrami / pWithB);
-
-        x265_log(m_param, X265_LOG_INFO, "consecutive B-frames: %s\n", buffer);
-    }
     if (m_param->bLossless)
     {
         float frameSize = (float)(m_param->sourceWidth - m_sps.conformanceWindow.rightOffset) *
@@ -3341,6 +3313,19 @@
     }
 }
 
+void Encoder::getEndNalUnits(NALList& list, Bitstream& bs)
+{
+    NALList nalList;
+    bs.resetBits();
+
+    if (m_param->bEnableEndOfSequence)
+        nalList.serialize(NAL_UNIT_EOS, bs);
+    if (m_param->bEnableEndOfBitstream)
+        nalList.serialize(NAL_UNIT_EOB, bs);
+
+    list.takeContents(nalList);
+}
+
 void Encoder::initVPS(VPS *vps)
 {
     /* Note that much of the VPS is initialized by determineLevel() */
@@ -3375,10 +3360,14 @@
     sps->bUseAMP = m_param->bEnableAMP;
     sps->maxAMPDepth = m_param->bEnableAMP ? m_param->maxCUDepth : 0;
 
-    sps->maxTempSubLayers = m_param->bEnableTemporalSubLayers ? 2 : 1;
-    sps->maxDecPicBuffering = m_vps.maxDecPicBuffering;
-    sps->numReorderPics = m_vps.numReorderPics;
-    sps->maxLatencyIncrease = m_vps.maxLatencyIncrease = m_param->bframes;
+    sps->maxTempSubLayers = m_vps.maxTempSubLayers;// Getting the value from the user
+
+    for(uint8_t i = 0; i < sps->maxTempSubLayers; i++)
+    {
+        sps->maxDecPicBufferingi = m_vps.maxDecPicBufferingi;
+        sps->numReorderPicsi = m_vps.numReorderPicsi;
+        sps->maxLatencyIncreasei = m_vps.maxLatencyIncreasei = m_param->bframes;
+    }
 
     sps->bUseStrongIntraSmoothing = m_param->bEnableStrongIntraSmoothing;
     sps->bTemporalMVPEnabled = m_param->bEnableTemporalMvp;
@@ -3518,6 +3507,11 @@
             p->rc.aqMode = X265_AQ_NONE;
             p->rc.hevcAq = 0;
         }
+        if (p->rc.aqMode == 0 && p->rc.cuTree)
+        {
+            p->rc.aqMode = X265_AQ_VARIANCE;
+            p->rc.aqStrength = 0;
+        }
         p->radl = zone->radl;
     }
     memcpy(zone, p, sizeof(x265_param));
@@ -3548,6 +3542,65 @@
         p->crQpOffset = 3;
 }
 
+void Encoder::configureVideoSignalTypePreset(x265_param* p)
+{
+    char systemId20 = {};
+    char colorVolume20 = {};
+    sscanf(p->videoSignalTypePreset, "%^::%s", systemId, colorVolume);
+    uint32_t sysId = 0;
+    while (strcmp(vstPresetssysId.systemId, systemId))
+    {
+        if (sysId + 1 == sizeof(vstPresets) / sizeof(vstPresets0))
+        {
+            x265_log(NULL, X265_LOG_ERROR, "Incorrect system-id, aborting\n");
+            m_aborted = true;
+            break;
+        }
+        sysId++;
+    }
+
+    p->vui.bEnableVideoSignalTypePresentFlag = vstPresetssysId.bEnableVideoSignalTypePresentFlag;
+    p->vui.bEnableColorDescriptionPresentFlag = vstPresetssysId.bEnableColorDescriptionPresentFlag;
+    p->vui.bEnableChromaLocInfoPresentFlag = vstPresetssysId.bEnableChromaLocInfoPresentFlag;
+    p->vui.colorPrimaries = vstPresetssysId.colorPrimaries;
+    p->vui.transferCharacteristics = vstPresetssysId.transferCharacteristics;
+    p->vui.matrixCoeffs = vstPresetssysId.matrixCoeffs;
+    p->vui.bEnableVideoFullRangeFlag = vstPresetssysId.bEnableVideoFullRangeFlag;
+    p->vui.chromaSampleLocTypeTopField = vstPresetssysId.chromaSampleLocTypeTopField;
+    p->vui.chromaSampleLocTypeBottomField = vstPresetssysId.chromaSampleLocTypeBottomField;
+
+    if (colorVolume0 != '\0')
+    {
+        if (!strcmp(systemId, "BT2100_PQ_YCC") || !strcmp(systemId, "BT2100_PQ_ICTCP") || !strcmp(systemId, "BT2100_PQ_RGB"))
+        {
+            p->bEmitHDR10SEI = 1;
+            if (!strcmp(colorVolume, "P3D65x1000n0005"))
+            {
+                p->masteringDisplayColorVolume = strdup("G(13250,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(10000000,5)");
+            }
+            else if (!strcmp(colorVolume, "P3D65x4000n005"))
+            {
+                p->masteringDisplayColorVolume = strdup("G(13250,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(40000000,50)");
+            }
+            else if (!strcmp(colorVolume, "BT2100x108n0005"))
+            {
+                p->masteringDisplayColorVolume = strdup("G(8500,39850)B(6550,2300)R(34000,146000)WP(15635,16450)L(10000000,1)");
+            }
+            else
+            {
+                x265_log(NULL, X265_LOG_ERROR, "Incorrect color-volume, aborting\n");
+                m_aborted = true;
+            }
+        }
+        else
+        {
+            x265_log(NULL, X265_LOG_ERROR, "Color-volume is not supported with the given system-id, aborting\n");
+            m_aborted = true;
+        }
+    }
+
+}
+
 void Encoder::configure(x265_param *p)
 {
     this->m_param = p;
@@ -3610,6 +3663,12 @@
     if (!p->rdoqLevel)
         p->psyRdoq = 0;
 
+    if (p->craNal && p->keyframeMax > 1)
+    {
+        x265_log_file(NULL, X265_LOG_ERROR, " --cra-nal works only with keyint 1, but given keyint = %s\n", p->keyframeMax);
+        m_aborted = true;
+    }
+
     /* Disable features which are not supported by the current RD level */
     if (p->rdLevel < 3)
     {
@@ -3848,12 +3907,37 @@
         p->limitReferences = 0;
     }
 
-    if (p->bEnableTemporalSubLayers && !p->bframes)
+    if ((p->bEnableTemporalSubLayers > 2) && !p->bframes)
     {
         x265_log(p, X265_LOG_WARNING, "B frames not enabled, temporal sublayer disabled\n");
         p->bEnableTemporalSubLayers = 0;
     }
 
+    if (!!p->bEnableTemporalSubLayers && p->bEnableTemporalSubLayers < 2)
+    {
+        p->bEnableTemporalSubLayers = 0;
+        x265_log(p, X265_LOG_WARNING, "No support for temporal sublayers less than 2; Disabling temporal layers\n");
+    }
+
+    if (p->bEnableTemporalSubLayers > 5)
+    {
+        p->bEnableTemporalSubLayers = 5;
+        x265_log(p, X265_LOG_WARNING, "No support for temporal sublayers more than 5; Reducing the temporal sublayers to 5\n");
+    }
+
+    // Assign number of B frames for temporal layers
+    if (p->bEnableTemporalSubLayers > 2)
+            p->bframes = x265_temporal_layer_bframesp->bEnableTemporalSubLayers - 1;
+
+    if (p->bEnableTemporalSubLayers > 2)
+    {
+        if (p->bFrameAdaptive)
+        {
+            x265_log(p, X265_LOG_WARNING, "Disabling adaptive B-frame placement to support temporal sub-layers\n");
+            p->bFrameAdaptive = 0;
+        }
+    }
+
     m_bframeDelay = p->bframes ? (p->bBPyramid ? 2 : 1) : 0;
 
     p->bFrameBias = X265_MIN(X265_MAX(-90, p->bFrameBias), 100);
@@ -3907,6 +3991,16 @@
         p->rc.bStatRead = 0;
     }
 
+    if ((p->rc.bStatWrite || p->rc.bStatRead) && p->rc.dataShareMode != X265_SHARE_MODE_FILE && p->rc.dataShareMode != X265_SHARE_MODE_SHAREDMEM)
+    {
+        p->rc.dataShareMode = X265_SHARE_MODE_FILE;
+    }
+
+    if (!p->rc.bStatRead || p->rc.rateControlMode != X265_RC_CRF)
+    {
+        p->rc.bEncFocusedFramesOnly = 0;
+    }
+
     /* some options make no sense if others are disabled */
     p->bSaoNonDeblocked &= p->bEnableSAO;
     p->bEnableTSkipFast &= p->bEnableTransformSkip;
@@ -4243,6 +4337,9 @@
         }
     }
 
+    if (p->videoSignalTypePreset)     // Default disabled.
+        configureVideoSignalTypePreset(p);
+
     if (m_param->toneMapFile || p->bHDR10Opt || p->bEmitHDR10SEI)
     {
         if (!p->bRepeatHeaders)
@@ -4313,12 +4410,26 @@
             m_param->searchRange = m_param->hmeRange2;
     }
 
-   if (p->bHistBasedSceneCut && !p->edgeTransitionThreshold)
-   {
-       p->edgeTransitionThreshold = 0.03;
-       x265_log(p, X265_LOG_WARNING, "using  default threshold %.2lf for scene cut detection\n", p->edgeTransitionThreshold);
-   }
+    if (p->bEnableSBRC && (p->rc.rateControlMode != X265_RC_CRF || (p->rc.vbvBufferSize == 0 || p->rc.vbvMaxBitrate == 0)))
+    {
+        x265_log(p, X265_LOG_WARNING, "SBRC can be enabled only with CRF+VBV mode. Disabling SBRC\n");
+        p->bEnableSBRC = 0;
+    }
 
+    if (p->bEnableSBRC)
+    {
+        p->rc.ipFactor = p->rc.ipFactor * X265_IPRATIO_STRENGTH;
+        if (p->bOpenGOP)
+        {
+            x265_log(p, X265_LOG_WARNING, "Segment based RateControl requires closed gop structure. Enabling closed GOP.\n");
+            p->bOpenGOP = 0;
+        }
+        if (p->keyframeMax != p->keyframeMin)
+        {
+            x265_log(p, X265_LOG_WARNING, "Segment based RateControl requires fixed gop length. Force set min-keyint equal to keyint.\n");
+            p->keyframeMin = p->keyframeMax;
+        }
+    }
 }
 
 void Encoder::readAnalysisFile(x265_analysis_data* analysis, int curPoc, const x265_picture* picIn, int paramBytes)
@@ -4379,16 +4490,6 @@
     analysis->frameRecordSize = frameRecordSize;
     X265_FREAD(&analysis->sliceType, sizeof(int), 1, m_analysisFileIn, &(picData->sliceType));
     X265_FREAD(&analysis->bScenecut, sizeof(int), 1, m_analysisFileIn, &(picData->bScenecut));
-    if (m_param->bHistBasedSceneCut)
-    {
-        X265_FREAD(&analysis->edgeHist, sizeof(int32_t), EDGE_BINS, m_analysisFileIn, &m_curEdgeHist);
-        X265_FREAD(&analysis->yuvHist0, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileIn, &m_curYUVHist0);
-        if (m_param->internalCsp != X265_CSP_I400)
-        {
-            X265_FREAD(&analysis->yuvHist1, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileIn, &m_curYUVHist1);
-            X265_FREAD(&analysis->yuvHist2, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileIn, &m_curYUVHist2);
-        }
-    }
     X265_FREAD(&analysis->satdCost, sizeof(int64_t), 1, m_analysisFileIn, &(picData->satdCost));
     X265_FREAD(&numCUsLoad, sizeof(int), 1, m_analysisFileIn, &(picData->numCUsInFrame));
     X265_FREAD(&analysis->numPartitions, sizeof(int), 1, m_analysisFileIn, &(picData->numPartitions));
@@ -4711,16 +4812,6 @@
     analysis->frameRecordSize = frameRecordSize;
     X265_FREAD(&analysis->sliceType, sizeof(int), 1, m_analysisFileIn, &(picData->sliceType));
     X265_FREAD(&analysis->bScenecut, sizeof(int), 1, m_analysisFileIn, &(picData->bScenecut));
-    if (m_param->bHistBasedSceneCut)
-    {
-        X265_FREAD(&analysis->edgeHist, sizeof(int32_t), EDGE_BINS, m_analysisFileIn, &m_curEdgeHist);
-        X265_FREAD(&analysis->yuvHist0, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileIn, &m_curYUVHist0);
-        if (m_param->internalCsp != X265_CSP_I400)
-        {
-            X265_FREAD(&analysis->yuvHist1, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileIn, &m_curYUVHist1);
-            X265_FREAD(&analysis->yuvHist2, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileIn, &m_curYUVHist2);
-        }
-    }
     X265_FREAD(&analysis->satdCost, sizeof(int64_t), 1, m_analysisFileIn, &(picData->satdCost));
     X265_FREAD(&analysis->numCUsInFrame, sizeof(int), 1, m_analysisFileIn, &(picData->numCUsInFrame));
     X265_FREAD(&analysis->numPartitions, sizeof(int), 1, m_analysisFileIn, &(picData->numPartitions));
@@ -4810,8 +4901,14 @@
 
     if (analysis->sliceType == X265_TYPE_IDR || analysis->sliceType == X265_TYPE_I)
     {
-        if (m_param->analysisLoadReuseLevel < 2)
-            return;
+		if (m_param->analysisLoadReuseLevel < 2)
+		{
+			/* Restore to the current encode's numPartitions and numCUsInFrame */
+			analysis->numPartitions = m_param->num4x4Partitions;
+			analysis->numCUsInFrame = cuLoc.heightInCU * cuLoc.widthInCU;
+			analysis->numCuInHeight = cuLoc.heightInCU;
+			return;
+		}
 
         uint8_t *tempBuf = NULL, *depthBuf = NULL, *modeBuf = NULL, *partSizes = NULL;
         int8_t *cuQPBuf = NULL;
@@ -4879,8 +4976,14 @@
         uint32_t numDir = analysis->sliceType == X265_TYPE_P ? 1 : 2;
         uint32_t numPlanes = m_param->internalCsp == X265_CSP_I400 ? 1 : 3;
         X265_FREAD((WeightParam*)analysis->wt, sizeof(WeightParam), numPlanes * numDir, m_analysisFileIn, (picIn->analysisData.wt));
-        if (m_param->analysisLoadReuseLevel < 2)
-            return;
+		if (m_param->analysisLoadReuseLevel < 2)
+		{
+			/* Restore to the current encode's numPartitions and numCUsInFrame */
+			analysis->numPartitions = m_param->num4x4Partitions;
+			analysis->numCUsInFrame = cuLoc.heightInCU * cuLoc.widthInCU;
+			analysis->numCuInHeight = cuLoc.heightInCU;
+			return;
+		}
 
         uint8_t *tempBuf = NULL, *depthBuf = NULL, *modeBuf = NULL, *partSize = NULL, *mergeFlag = NULL;
         uint8_t *interDir = NULL, *chromaDir = NULL, *mvpIdx2;
@@ -5167,7 +5270,7 @@
 
         int bcutree;
         X265_FREAD(&bcutree, sizeof(int), 1, m_analysisFileIn, &(saveParam->cuTree));
-        if (loadLevel == 10 && m_param->rc.cuTree && (!bcutree || saveLevel < 2))
+        if (loadLevel >= 2 && m_param->rc.cuTree && (!bcutree || saveLevel < 2))
         {
             x265_log(NULL, X265_LOG_ERROR, "Error reading cu-tree info. Disabling cutree offsets. \n");
             m_param->rc.cuTree = 0;
@@ -5337,6 +5440,7 @@
             distortionData->highDistortionCtuCount++;
     }
 }
+
 void Encoder::readAnalysisFile(x265_analysis_data* analysis, int curPoc, int sliceType)
 {
 
@@ -5486,17 +5590,6 @@
     /* calculate frameRecordSize */
     analysis->frameRecordSize = sizeof(analysis->frameRecordSize) + sizeof(depthBytes) + sizeof(analysis->poc) + sizeof(analysis->sliceType) +
                       sizeof(analysis->numCUsInFrame) + sizeof(analysis->numPartitions) + sizeof(analysis->bScenecut) + sizeof(analysis->satdCost);
-    if (m_param->bHistBasedSceneCut)
-    {
-        analysis->frameRecordSize += sizeof(analysis->edgeHist);
-        analysis->frameRecordSize += sizeof(int32_t) * HISTOGRAM_BINS;
-        if (m_param->internalCsp != X265_CSP_I400)
-        {
-            analysis->frameRecordSize += sizeof(int32_t) * HISTOGRAM_BINS;
-            analysis->frameRecordSize += sizeof(int32_t) * HISTOGRAM_BINS;
-        }
-    }
-
     if (analysis->sliceType > X265_TYPE_I)
     {
         numDir = (analysis->sliceType == X265_TYPE_P) ? 1 : 2;
@@ -5641,17 +5734,6 @@
     X265_FWRITE(&analysis->poc, sizeof(int), 1, m_analysisFileOut);
     X265_FWRITE(&analysis->sliceType, sizeof(int), 1, m_analysisFileOut);
     X265_FWRITE(&analysis->bScenecut, sizeof(int), 1, m_analysisFileOut);
-    if (m_param->bHistBasedSceneCut)
-    {
-        X265_FWRITE(&analysis->edgeHist, sizeof(int32_t), EDGE_BINS, m_analysisFileOut);
-        X265_FWRITE(&analysis->yuvHist0, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileOut);
-        if (m_param->internalCsp != X265_CSP_I400)
-        {
-            X265_FWRITE(&analysis->yuvHist1, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileOut);
-            X265_FWRITE(&analysis->yuvHist2, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileOut);
-        }
-    }
-
     X265_FWRITE(&analysis->satdCost, sizeof(int64_t), 1, m_analysisFileOut);
     X265_FWRITE(&analysis->numCUsInFrame, sizeof(int), 1, m_analysisFileOut);
     X265_FWRITE(&analysis->numPartitions, sizeof(int), 1, m_analysisFileOut);

 
@@ -72,7 +72,40 @@
 {
     { 1, 1, 1, 1, 1, 5, 1,  2, 2, 2, 50 },
     { 1, 1, 1, 1, 1, 5, 0, 16, 9, 9, 81 },
-    { 1, 1, 1, 1, 1, 5, 0,  1, 1, 1, 82 }
+    { 1, 1, 1, 1, 1, 5, 0,  1, 1, 1, 82 },
+    { 1, 1, 1, 1, 1, 5, 0, 18, 9, 9, 84 }
+};
+
+typedef struct
+{
+    int bEnableVideoSignalTypePresentFlag;
+    int bEnableColorDescriptionPresentFlag;
+    int bEnableChromaLocInfoPresentFlag;
+    int colorPrimaries;
+    int transferCharacteristics;
+    int matrixCoeffs;
+    int bEnableVideoFullRangeFlag;
+    int chromaSampleLocTypeTopField;
+    int chromaSampleLocTypeBottomField;
+    const char* systemId;
+}VideoSignalTypePresets;
+
+VideoSignalTypePresets vstPresets =
+{
+    {1, 1, 1, 6, 6, 6, 0, 0, 0, "BT601_525"},
+    {1, 1, 1, 5, 6, 5, 0, 0, 0, "BT601_626"},
+    {1, 1, 1, 1, 1, 1, 0, 0, 0, "BT709_YCC"},
+    {1, 1, 0, 1, 1, 0, 0, 0, 0, "BT709_RGB"},
+    {1, 1, 1, 9, 14, 1, 0, 2, 2, "BT2020_YCC_NCL"},
+    {1, 1, 0, 9, 16, 9, 0, 0, 0, "BT2020_RGB"},
+    {1, 1, 1, 9, 16, 9, 0, 2, 2, "BT2100_PQ_YCC"},
+    {1, 1, 1, 9, 16, 14, 0, 2, 2, "BT2100_PQ_ICTCP"},
+    {1, 1, 0, 9, 16, 0, 0, 0, 0, "BT2100_PQ_RGB"},
+    {1, 1, 1, 9, 18, 9, 0, 2, 2, "BT2100_HLG_YCC"},
+    {1, 1, 0, 9, 18, 0, 0, 0, 0, "BT2100_HLG_RGB"},
+    {1, 1, 0, 1, 1, 0, 1, 0, 0, "FR709_RGB"},
+    {1, 1, 0, 9, 14, 0, 1, 0, 0, "FR2020_RGB"},
+    {1, 1, 1, 12, 1, 6, 1, 1, 1, "FRP3D65_YCC"}
 };
 }
 
@@ -109,6 +142,7 @@
     m_threadPool = NULL;
     m_analysisFileIn = NULL;
     m_analysisFileOut = NULL;
+    m_filmGrainIn = NULL;
     m_naluFile = NULL;
     m_offsetEmergency = NULL;
     m_iFrameNum = 0;
@@ -134,12 +168,8 @@
     m_prevTonemapPayload.payload = NULL;
     m_startPoint = 0;
     m_saveCTUSize = 0;
-    m_edgePic = NULL;
-    m_edgeHistThreshold = 0;
-    m_chromaHistThreshold = 0.0;
-    m_scaledEdgeThreshold = 0.0;
-    m_scaledChromaThreshold = 0.0;
     m_zoneIndex = 0;
+    m_origPicBuffer = 0;
 }
 
 inline char *strcatFilename(const char *input, const char *suffix)
@@ -216,34 +246,6 @@
         }
     }
 
-    if (m_param->bHistBasedSceneCut)
-    {
-        m_planeSizes0 = (m_param->sourceWidth >> x265_cli_cspsp->internalCsp.width0) * (m_param->sourceHeight >> x265_cli_cspsm_param->internalCsp.height0);
-        uint32_t pixelbytes = m_param->internalBitDepth > 8 ? 2 : 1;
-        m_edgePic = X265_MALLOC(pixel, m_planeSizes0 * pixelbytes);
-        m_edgeHistThreshold = m_param->edgeTransitionThreshold;
-        m_chromaHistThreshold = x265_min(m_edgeHistThreshold * 10.0, MAX_SCENECUT_THRESHOLD);
-        m_scaledEdgeThreshold = x265_min(m_edgeHistThreshold * SCENECUT_STRENGTH_FACTOR, MAX_SCENECUT_THRESHOLD);
-        m_scaledChromaThreshold = x265_min(m_chromaHistThreshold * SCENECUT_STRENGTH_FACTOR, MAX_SCENECUT_THRESHOLD);
-        if (m_param->sourceBitDepth != m_param->internalBitDepth)
-        {
-            int size = m_param->sourceWidth * m_param->sourceHeight;
-            int hshift = CHROMA_H_SHIFT(m_param->internalCsp);
-            int vshift = CHROMA_V_SHIFT(m_param->internalCsp);
-            int widthC = m_param->sourceWidth >> hshift;
-            int heightC = m_param->sourceHeight >> vshift;
-
-            m_inputPic0 = X265_MALLOC(pixel, size);
-            if (m_param->internalCsp != X265_CSP_I400)
-            {
-                for (int j = 1; j < 3; j++)
-                {
-                    m_inputPicj = X265_MALLOC(pixel, widthC * heightC);
-                }
-            }
-        }
-    }
-
     // Do not allow WPP if only one row or fewer than 3 columns, it is pointless and unstable
     if (rows == 1 || cols < 3)
     {
@@ -357,6 +359,10 @@
             lookAheadThreadPooli.start();
     m_lookahead->m_numPools = pools;
     m_dpb = new DPB(m_param);
+
+    if (m_param->bEnableTemporalFilter)
+        m_origPicBuffer = new OrigPicBuffer();
+
     m_rateControl = new RateControl(*m_param, this);
     if (!m_param->bResetZoneConfig)
     {
@@ -518,6 +524,15 @@
             }
         }
     }
+    if (m_param->filmGrain)
+    {
+        m_filmGrainIn = x265_fopen(m_param->filmGrain, "rb");
+        if (!m_filmGrainIn)
+        {
+            x265_log_file(NULL, X265_LOG_ERROR, "Failed to open film grain characteristics binary file %s\n", m_param->filmGrain);
+        }
+    }
+
     m_bZeroLatency = !m_param->bframes && !m_param->lookaheadDepth && m_param->frameNumThreads == 1 && m_param->maxSlices == 1;
     m_aborted |= parseLambdaFile(m_param);
 
@@ -879,26 +894,6 @@
         }
     }
 
-    if (m_param->bHistBasedSceneCut)
-    {
-        if (m_edgePic != NULL)
-        {
-            X265_FREE_ZERO(m_edgePic);
-        }
-
-        if (m_param->sourceBitDepth != m_param->internalBitDepth)
-        {
-            X265_FREE_ZERO(m_inputPic0);
-            if (m_param->internalCsp != X265_CSP_I400)
-            {
-                for (int i = 1; i < 3; i++)
-                {
-                    X265_FREE_ZERO(m_inputPici);
-                }
-            }
-        }
-    }
-
     for (int i = 0; i < m_param->frameNumThreads; i++)
     {
         if (m_frameEncoderi)
@@ -924,6 +919,10 @@
         delete zoneReadCount;
         delete zoneWriteCount;
     }
+
+    if (m_param->bEnableTemporalFilter)
+        delete m_origPicBuffer;
+
     if (m_rateControl)
     {
         m_rateControl->destroy();
@@ -963,6 +962,8 @@
      }
     if (m_naluFile)
         fclose(m_naluFile);
+    if (m_filmGrainIn)
+        x265_fclose(m_filmGrainIn);
 
 #ifdef SVT_HEVC
     X265_FREE(m_svtAppData);
@@ -974,6 +975,7 @@
         /* release string arguments that were strdup'd */
         free((char*)m_param->rc.lambdaFileName);
         free((char*)m_param->rc.statFileName);
+        free((char*)m_param->rc.sharedMemName);
         free((char*)m_param->analysisReuseFileName);
         free((char*)m_param->scalingLists);
         free((char*)m_param->csvfn);
@@ -982,6 +984,7 @@
         free((char*)m_param->toneMapFile);
         free((char*)m_param->analysisSave);
         free((char*)m_param->analysisLoad);
+        free((char*)m_param->videoSignalTypePreset);
         PARAM_NS::x265_param_free(m_param);
     }
 }
@@ -1358,215 +1361,90 @@
     dest->planes2 = (char*)dest->planes1 + src->stride1 * (src->height >> x265_cli_cspssrc->colorSpace.height1);
 }
 
-bool Encoder::computeHistograms(x265_picture *pic)
+bool Encoder::isFilterThisframe(uint8_t sliceTypeConfig, int curSliceType)
 {
-    pixel *src = NULL, *planeV = NULL, *planeU = NULL;
-    uint32_t widthC, heightC;
-    int hshift, vshift;
-
-    hshift = CHROMA_H_SHIFT(pic->colorSpace);
-    vshift = CHROMA_V_SHIFT(pic->colorSpace);
-    widthC = pic->width >> hshift;
-    heightC = pic->height >> vshift;
-
-    if (pic->bitDepth == X265_DEPTH)
+    uint8_t newSliceType = 0;
+    switch (curSliceType)
     {
-        src = (pixel*)pic->planes0;
-        if (m_param->internalCsp != X265_CSP_I400)
-        {
-            planeU = (pixel*)pic->planes1;
-            planeV = (pixel*)pic->planes2;
-        }
-    }
-    else if (pic->bitDepth == 8 && X265_DEPTH > 8)
-    {
-        int shift = (X265_DEPTH - 8);
-        uint8_t *yChar, *uChar, *vChar;
-
-        yChar = (uint8_t*)pic->planes0;
-        primitives.planecopy_cp(yChar, pic->stride0 / sizeof(*yChar), m_inputPic0, pic->stride0 / sizeof(*yChar), pic->width, pic->height, shift);
-        src = m_inputPic0;
-        if (m_param->internalCsp != X265_CSP_I400)
-        {
-            uChar = (uint8_t*)pic->planes1;
-            vChar = (uint8_t*)pic->planes2;
-            primitives.planecopy_cp(uChar, pic->stride1 / sizeof(*uChar), m_inputPic1, pic->stride1 / sizeof(*uChar), widthC, heightC, shift);
-            primitives.planecopy_cp(vChar, pic->stride2 / sizeof(*vChar), m_inputPic2, pic->stride2 / sizeof(*vChar), widthC, heightC, shift);
-            planeU = m_inputPic1;
-            planeV = m_inputPic2;
-        }
-    }
-    else
-    {
-        uint16_t *yShort, *uShort, *vShort;
-        /* mask off bits that are supposed to be zero */
-        uint16_t mask = (1 << X265_DEPTH) - 1;
-        int shift = abs(pic->bitDepth - X265_DEPTH);
-
-        yShort = (uint16_t*)pic->planes0;
-        uShort = (uint16_t*)pic->planes1;
-        vShort = (uint16_t*)pic->planes2;
-
-        if (pic->bitDepth > X265_DEPTH)
-        {
-            /* shift right and mask pixels to final size */
-            primitives.planecopy_sp(yShort, pic->stride0 / sizeof(*yShort), m_inputPic0, pic->stride0 / sizeof(*yShort), pic->width, pic->height, shift, mask);
-            if (m_param->internalCsp != X265_CSP_I400)
-            {
-                primitives.planecopy_sp(uShort, pic->stride1 / sizeof(*uShort), m_inputPic1, pic->stride1 / sizeof(*uShort), widthC, heightC, shift, mask);
-                primitives.planecopy_sp(vShort, pic->stride2 / sizeof(*vShort), m_inputPic2, pic->stride2 / sizeof(*vShort), widthC, heightC, shift, mask);
-            }
-        }
-        else /* Case for (pic.bitDepth < X265_DEPTH) */
-        {
-            /* shift left and mask pixels to final size */
-            primitives.planecopy_sp_shl(yShort, pic->stride0 / sizeof(*yShort), m_inputPic0, pic->stride0 / sizeof(*yShort), pic->width, pic->height, shift, mask);
-            if (m_param->internalCsp != X265_CSP_I400)
-            {
-                primitives.planecopy_sp_shl(uShort, pic->stride1 / sizeof(*uShort), m_inputPic1, pic->stride1 / sizeof(*uShort), widthC, heightC, shift, mask);
-                primitives.planecopy_sp_shl(vShort, pic->stride2 / sizeof(*vShort), m_inputPic2, pic->stride2 / sizeof(*vShort), widthC, heightC, shift, mask);
-            }
-        }
-
-        src = m_inputPic0;
-        planeU = m_inputPic1;
-        planeV = m_inputPic2;
-    }
-
-    size_t bufSize = sizeof(pixel) * m_planeSizes0;
-    memset(m_edgePic, 0, bufSize);
-
-    if (!computeEdge(m_edgePic, src, NULL, pic->width, pic->height, pic->width, false, 1))
-    {
-        x265_log(m_param, X265_LOG_ERROR, "Failed to compute edge!");
-        return false;
-    }
-
-    pixel pixelVal;
-    int32_t *edgeHist = m_curEdgeHist;
-    memset(edgeHist, 0, EDGE_BINS * sizeof(int32_t));
-    for (uint32_t i = 0; i < m_planeSizes0; i++)
-    {
-        if (m_edgePici)
-            edgeHist1++;
-        else
-            edgeHist0++;
-    }
-
-    /* Y Histogram Calculation */
-    int32_t *yHist = m_curYUVHist0;
-    memset(yHist, 0, HISTOGRAM_BINS * sizeof(int32_t));
-    for (uint32_t i = 0; i < m_planeSizes0; i++)
-    {
-        pixelVal = srci;
-        yHistpixelVal++;
+    case 1: newSliceType |= 1 << 0;
+        break;
+    case 2: newSliceType |= 1 << 0;
+        break;
+    case 3: newSliceType |= 1 << 1;
+        break;
+    case 4: newSliceType |= 1 << 2;
+        break;
+    case 5: newSliceType |= 1 << 3;
+        break;
+    default: return 0;
     }
+    return ((sliceTypeConfig & newSliceType) != 0);
+}
 
-    if (pic->colorSpace != X265_CSP_I400)
-    {
-        /* U Histogram Calculation */
-        int32_t *uHist = m_curYUVHist1;
-        memset(uHist, 0, sizeof(m_curYUVHist1));
-        for (uint32_t i = 0; i < m_planeSizes1; i++)
-        {
-            pixelVal = planeUi;
-            uHistpixelVal++;
-        }
+inline int enqueueRefFrame(FrameEncoder* curframeEncoder, Frame* iterFrame, Frame* curFrame, bool isPreFiltered, int16_t i)
+{
+    TemporalFilterRefPicInfo* dest = &curframeEncoder->m_mcstfRefListcurFrame->m_mcstf->m_numRef;
+    dest->picBuffer = iterFrame->m_fencPic;
+    dest->picBufferSubSampled2 = iterFrame->m_fencPicSubsampled2;
+    dest->picBufferSubSampled4 = iterFrame->m_fencPicSubsampled4;
+    dest->isFilteredFrame = isPreFiltered;
+    dest->isSubsampled = iterFrame->m_isSubSampled;
+    dest->origOffset = i;
+    curFrame->m_mcstf->m_numRef++;
 
-        /* V Histogram Calculation */
-        pixelVal = 0;
-        int32_t *vHist = m_curYUVHist2;
-        memset(vHist, 0, sizeof(m_curYUVHist2));
-        for (uint32_t i = 0; i < m_planeSizes2; i++)
-        {
-            pixelVal = planeVi;
-            vHistpixelVal++;
-        }
-    }
-    return true;
+    return 1;
 }
 
-void Encoder::computeHistogramSAD(double *normalizedMaxUVSad, double *normalizedEdgeSad, int curPoc)
+bool Encoder::generateMcstfRef(Frame* frameEnc, FrameEncoder* currEncoder)
 {
+    frameEnc->m_mcstf->m_numRef = 0;
 
-    if (curPoc == 0)
-    {   /* first frame is scenecut by default no sad computation for the same. */
-        *normalizedMaxUVSad = 0.0;
-        *normalizedEdgeSad = 0.0;
-    }
-    else
+    for (int iterPOC = (frameEnc->m_poc - frameEnc->m_mcstf->m_range);
+        iterPOC <= (frameEnc->m_poc + frameEnc->m_mcstf->m_range); iterPOC++)
     {
-        /* compute sum of absolute differences of histogram bins of chroma and luma edge response between the current and prev pictures. */
-        int32_t edgeHistSad = 0;
-        int32_t uHistSad = 0;
-        int32_t vHistSad = 0;
-        double normalizedUSad = 0.0;
-        double normalizedVSad = 0.0;
-
-        for (int j = 0; j < HISTOGRAM_BINS; j++)
+        bool isFound = false;
+        if (iterPOC != frameEnc->m_poc)
         {
-            if (j < 2)
+            //search for the reference frame in the Original Picture Buffer
+            if (!isFound)
             {
-                edgeHistSad += abs(m_curEdgeHistj - m_prevEdgeHistj);
-            }
-            uHistSad += abs(m_curYUVHist1j - m_prevYUVHist1j);
-            vHistSad += abs(m_curYUVHist2j - m_prevYUVHist2j);
-        }
-        *normalizedEdgeSad = normalizeRange(edgeHistSad, 0, 2 * m_planeSizes0, 0.0, 1.0);
-        normalizedUSad = normalizeRange(uHistSad, 0, 2 * m_planeSizes1, 0.0, 1.0);
-        normalizedVSad = normalizeRange(vHistSad, 0, 2 * m_planeSizes2, 0.0, 1.0);
-        *normalizedMaxUVSad = x265_max(normalizedUSad, normalizedVSad);
-    }
-
-    /* store histograms of previous frame for reference */
-    memcpy(m_prevEdgeHist, m_curEdgeHist, sizeof(m_curEdgeHist));
-    memcpy(m_prevYUVHist, m_curYUVHist, sizeof(m_curYUVHist));
-}
+                for (int j = 0; j < (2 * frameEnc->m_mcstf->m_range); j++)
+                {
+                    if (iterPOC < 0)
+                        continue;
+                    if (iterPOC >= m_pocLast)
+                    {
 
-double Encoder::normalizeRange(int32_t value, int32_t minValue, int32_t maxValue, double rangeStart, double rangeEnd)
-{
-    return (double)(value - minValue) * (rangeEnd - rangeStart) / (maxValue - minValue) + rangeStart;
-}
+                        TemporalFilter* mcstf = frameEnc->m_mcstf;
+                        while (mcstf->m_numRef)
+                        {
+                            memset(currEncoder->m_mcstfRefListmcstf->m_numRef.mvs0,  0, sizeof(MV) * ((mcstf->m_sourceWidth / 16) * (mcstf->m_sourceHeight / 16)));
+                            memset(currEncoder->m_mcstfRefListmcstf->m_numRef.mvs1,  0, sizeof(MV) * ((mcstf->m_sourceWidth / 16) * (mcstf->m_sourceHeight / 16)));
+                            memset(currEncoder->m_mcstfRefListmcstf->m_numRef.mvs2,  0, sizeof(MV) * ((mcstf->m_sourceWidth / 16) * (mcstf->m_sourceHeight / 16)));
+                            memset(currEncoder->m_mcstfRefListmcstf->m_numRef.mvs,   0, sizeof(MV) * ((mcstf->m_sourceWidth /  4) * (mcstf->m_sourceHeight /  4)));
+                            memset(currEncoder->m_mcstfRefListmcstf->m_numRef.noise, 0, sizeof(int) * ((mcstf->m_sourceWidth / 4) * (mcstf->m_sourceHeight / 4)));
+                            memset(currEncoder->m_mcstfRefListmcstf->m_numRef.error, 0, sizeof(int) * ((mcstf->m_sourceWidth / 4) * (mcstf->m_sourceHeight / 4)));
 
-void Encoder::findSceneCuts(x265_picture *pic, bool& bDup, double maxUVSad, double edgeSad, bool& isMaxThres, bool& isHardSC)
-{
-    double minEdgeT = m_edgeHistThreshold * MIN_EDGE_FACTOR;
-    double minChromaT = minEdgeT * SCENECUT_CHROMA_FACTOR;
-    double maxEdgeT = m_edgeHistThreshold * MAX_EDGE_FACTOR;
-    double maxChromaT = maxEdgeT * SCENECUT_CHROMA_FACTOR;
-    pic->frameData.bScenecut = false;
+                            mcstf->m_numRef--;
+                        }
 
-    if (pic->poc == 0)
-    {
-        /* for first frame */
-        pic->frameData.bScenecut = false;
-        bDup = false;
-    }
-    else
-    {
-        if (edgeSad == 0.0 && maxUVSad == 0.0)
-        {
-            bDup = true;
-        }
-        else if (edgeSad < minEdgeT && maxUVSad < minChromaT)
-        {
-            pic->frameData.bScenecut = false;
-        }
-        else if (edgeSad > maxEdgeT && maxUVSad > maxChromaT)
-        {
-            pic->frameData.bScenecut = true;
-            isMaxThres = true;
-            isHardSC = true;
-        }
-        else if (edgeSad > m_scaledEdgeThreshold || maxUVSad >= m_scaledChromaThreshold
-                 || (edgeSad > m_edgeHistThreshold && maxUVSad >= m_chromaHistThreshold))
-        {
-            pic->frameData.bScenecut = true;
-            bDup = false;
-            if (edgeSad > m_scaledEdgeThreshold || maxUVSad >= m_scaledChromaThreshold)
-                isHardSC = true;
+                        break;
+                    }
+                    Frame* iterFrame = frameEnc->m_encData->m_slice->m_mcstfRefFrameList1j;
+                    if (iterFrame->m_poc == iterPOC)
+                    {
+                        if (!enqueueRefFrame(currEncoder, iterFrame, frameEnc, false, (int16_t)(iterPOC - frameEnc->m_poc)))
+                        {
+                            return false;
+                        };
+                        break;
+                    }
+                }
+            }
         }
     }
+
+    return true;
 }
 
 /**
@@ -1595,40 +1473,24 @@
     const x265_picture* inputPic = NULL;
     static int written = 0, read = 0;
     bool dontRead = false;
-    bool bdropFrame = false;
     bool dropflag = false;
-    bool isMaxThres = false;
-    bool isHardSC = false;
 
     if (m_exportedPic)
     {
         if (!m_param->bUseAnalysisFile && m_param->analysisSave)
             x265_free_analysis_data(m_param, &m_exportedPic->m_analysisData);
+
         ATOMIC_DEC(&m_exportedPic->m_countRefEncoders);
+
         m_exportedPic = NULL;
         m_dpb->recycleUnreferenced();
+
+        if (m_param->bEnableTemporalFilter)
+            m_origPicBuffer->recycleOrigPicList();
     }
+
     if ((pic_in && (!m_param->chunkEnd || (m_encodedFrameNum < m_param->chunkEnd))) || (m_param->bEnableFrameDuplication && !pic_in && (read < written)))
     {
-        if (m_param->bHistBasedSceneCut && pic_in)
-        {
-            x265_picture *pic = (x265_picture *) pic_in;
-
-            if (pic->poc == 0)
-            {
-                /* for entire encode compute the chroma plane sizes only once */
-                for (int i = 1; i < x265_cli_cspsm_param->internalCsp.planes; i++)
-                    m_planeSizesi = (pic->width >> x265_cli_cspsm_param->internalCsp.widthi) * (pic->height >> x265_cli_cspsm_param->internalCsp.heighti);
-            }
-
-            if (computeHistograms(pic))
-            {
-                double maxUVSad = 0.0, edgeSad = 0.0;
-                computeHistogramSAD(&maxUVSad, &edgeSad, pic_in->poc);
-                findSceneCuts(pic, bdropFrame, maxUVSad, edgeSad, isMaxThres, isHardSC);
-            }
-        }
-
         if ((m_param->bEnableFrameDuplication && !pic_in && (read < written)))
             dontRead = true;
         else
@@ -1672,20 +1534,7 @@
                     written++;
                 }
 
-                if (m_param->bEnableFrameDuplication && m_param->bHistBasedSceneCut)
-                {
-                    if (!bdropFrame && m_dupBuffer1->dupPic->frameData.bScenecut == false)
-                    {
-                        psnrWeight = ComputePSNR(m_dupBuffer0->dupPic, m_dupBuffer1->dupPic, m_param);
-                        if (psnrWeight >= m_param->dupThreshold)
-                            dropflag = true;
-                    }
-                    else
-                    {
-                        dropflag = true;
-                    }
-                }
-                else if (m_param->bEnableFrameDuplication)
+                if (m_param->bEnableFrameDuplication)
                 {
                     psnrWeight = ComputePSNR(m_dupBuffer0->dupPic, m_dupBuffer1->dupPic, m_param);
                     if (psnrWeight >= m_param->dupThreshold)
@@ -1768,12 +1617,6 @@
                         }
                     }
                 }
-                if (m_param->recursionSkipMode == EDGE_BASED_RSKIP && m_param->bHistBasedSceneCut)
-                {
-                    pixel* src = m_edgePic;
-                    primitives.planecopy_pp_shr(src, inFrame->m_fencPic->m_picWidth, inFrame->m_edgeBitPic, inFrame->m_fencPic->m_stride,
-                        inFrame->m_fencPic->m_picWidth, inFrame->m_fencPic->m_picHeight, 0);
-                }
             }
             else
             {
@@ -1794,6 +1637,8 @@
             inFrame->m_lowres.satdCost = (int64_t)-1;
             inFrame->m_lowresInit = false;
             inFrame->m_isInsideWindow = 0;
+            inFrame->m_tempLayer = 0;
+            inFrame->m_sameLayerRefPic = 0;
         }
 
         /* Copy input picture into a Frame and PicYuv, send to lookahead */
@@ -1802,13 +1647,6 @@
         inFrame->m_poc       = ++m_pocLast;
         inFrame->m_userData  = inputPic->userData;
         inFrame->m_pts       = inputPic->pts;
-        if (m_param->bHistBasedSceneCut)
-        {
-            inFrame->m_lowres.bScenecut = (inputPic->frameData.bScenecut == 1) ? true : false;
-            inFrame->m_lowres.m_bIsMaxThres = isMaxThres;
-            if (m_param->radl && m_param->keyframeMax != m_param->keyframeMin)
-                inFrame->m_lowres.m_bIsHardScenecut = isHardSC;
-        }
 
         if ((m_param->bEnableSceneCutAwareQp & BACKWARD) && m_param->rc.bStatRead)
         {
@@ -1816,7 +1654,7 @@
             rcEntry = &(m_rateControl->m_rce2PassinFrame->m_poc);
             if(rcEntry->scenecut)
             {
-                int backwardWindow = X265_MIN(int((m_param->bwdScenecutWindow / 1000.0) * (m_param->fpsNum / m_param->fpsDenom)), p->lookaheadDepth);
+                int backwardWindow = X265_MIN(int((m_param->bwdMaxScenecutWindow / 1000.0) * (m_param->fpsNum / m_param->fpsDenom)), p->lookaheadDepth);
                 for (int i = 1; i <= backwardWindow; i++)
                 {
                     int frameNum = inFrame->m_poc - i;
@@ -1826,16 +1664,7 @@
                 }
             }
         }
-        if (m_param->bHistBasedSceneCut && m_param->analysisSave)
-        {
-            memcpy(inFrame->m_analysisData.edgeHist, m_curEdgeHist, EDGE_BINS * sizeof(int32_t));
-            memcpy(inFrame->m_analysisData.yuvHist0, m_curYUVHist0, HISTOGRAM_BINS *sizeof(int32_t));
-            if (inputPic->colorSpace != X265_CSP_I400)
-            {
-                memcpy(inFrame->m_analysisData.yuvHist1, m_curYUVHist1, HISTOGRAM_BINS * sizeof(int32_t));
-                memcpy(inFrame->m_analysisData.yuvHist2, m_curYUVHist2, HISTOGRAM_BINS * sizeof(int32_t));
-            }
-        }
+
         inFrame->m_forceqp   = inputPic->forceqp;
         inFrame->m_param     = (m_reconfigure || m_reconfigureRc) ? m_latestParam : m_param;
         inFrame->m_picStruct = inputPic->picStruct;
@@ -1881,7 +1710,8 @@
         }
 
         /* Use the frame types from the first pass, if available */
-        int sliceType = (m_param->rc.bStatRead) ? m_rateControl->rateControlSliceType(inFrame->m_poc) : inputPic->sliceType;
+        int sliceType = (m_param->rc.bStatRead) ? m_rateControl->rateControlSliceType(inFrame->m_poc) : X265_TYPE_AUTO;
+        inFrame->m_lowres.sliceTypeReq = inputPic->sliceType;
 
         /* In analysisSave mode, x265_analysis_data is allocated in inputPic and inFrame points to this */
         /* Load analysis data before lookahead->addPicture, since sliceType has been decided */
@@ -1977,6 +1807,59 @@
         if (m_reconfigureRc)
             inFrame->m_reconfigureRc = true;
 
+        if (m_param->bEnableTemporalFilter)
+        {
+            if (!m_pocLast)
+            {
+                /*One shot allocation of frames in OriginalPictureBuffer*/
+                int numFramesinOPB = X265_MAX(m_param->bframes, (inFrame->m_mcstf->m_range << 1)) + 1;
+                for (int i = 0; i < numFramesinOPB; i++)
+                {
+                    Frame* dupFrame = new Frame;
+                    if (!(dupFrame->create(m_param, pic_in->quantOffsets)))
+                    {
+                        m_aborted = true;
+                        x265_log(m_param, X265_LOG_ERROR, "Memory allocation failure, aborting encode\n");
+                        fflush(stderr);
+                        dupFrame->destroy();
+                        delete dupFrame;
+                        return -1;
+                    }
+                    else
+                    {
+                        if (m_sps.cuOffsetY)
+                        {
+                            dupFrame->m_fencPic->m_cuOffsetC = m_sps.cuOffsetC;
+                            dupFrame->m_fencPic->m_buOffsetC = m_sps.buOffsetC;
+                            dupFrame->m_fencPic->m_cuOffsetY = m_sps.cuOffsetY;
+                            dupFrame->m_fencPic->m_buOffsetY = m_sps.buOffsetY;
+                            if (m_param->internalCsp != X265_CSP_I400)
+                            {
+                                dupFrame->m_fencPic->m_cuOffsetC = m_sps.cuOffsetC;
+                                dupFrame->m_fencPic->m_buOffsetC = m_sps.buOffsetC;
+                            }
+                            m_origPicBuffer->addEncPicture(dupFrame);
+                        }
+                    }
+                }
+            }
+
+            inFrame->m_refPicCnt1 = 2 * inFrame->m_mcstf->m_range + 1;
+            if (inFrame->m_poc < inFrame->m_mcstf->m_range)
+                inFrame->m_refPicCnt1 -= (uint8_t)(inFrame->m_mcstf->m_range - inFrame->m_poc);
+            if (m_param->totalFrames && (inFrame->m_poc >= (m_param->totalFrames - inFrame->m_mcstf->m_range)))
+                inFrame->m_refPicCnt1 -= (uint8_t)(inFrame->m_poc + inFrame->m_mcstf->m_range - m_param->totalFrames + 1);
+
+            //Extend full-res original picture border
+            PicYuv *orig = inFrame->m_fencPic;
+            extendPicBorder(orig->m_picOrg0, orig->m_stride, orig->m_picWidth, orig->m_picHeight, orig->m_lumaMarginX, orig->m_lumaMarginY);
+            extendPicBorder(orig->m_picOrg1, orig->m_strideC, orig->m_picWidth >> orig->m_hChromaShift, orig->m_picHeight >> orig->m_vChromaShift, orig->m_chromaMarginX, orig->m_chromaMarginY);
+            extendPicBorder(orig->m_picOrg2, orig->m_strideC, orig->m_picWidth >> orig->m_hChromaShift, orig->m_picHeight >> orig->m_vChromaShift, orig->m_chromaMarginX, orig->m_chromaMarginY);
+
+            //TODO: Add subsampling here if required
+            m_origPicBuffer->addPicture(inFrame);
+        }
+
         m_lookahead->addPicture(*inFrame, sliceType);
         m_numDelayedPic++;
     }
@@ -2019,6 +1902,7 @@
                 pic_out->bitDepth = X265_DEPTH;
                 pic_out->userData = outFrame->m_userData;
                 pic_out->colorSpace = m_param->internalCsp;
+                pic_out->frameData.tLayer = outFrame->m_tempLayer;
                 frameData = &(pic_out->frameData);
 
                 pic_out->pts = outFrame->m_pts;
@@ -2041,16 +1925,6 @@
                     pic_out->analysisData.poc = pic_out->poc;
                     pic_out->analysisData.sliceType = pic_out->sliceType;
                     pic_out->analysisData.bScenecut = outFrame->m_lowres.bScenecut;
-                    if (m_param->bHistBasedSceneCut)
-                    {
-                        memcpy(pic_out->analysisData.edgeHist, outFrame->m_analysisData.edgeHist, EDGE_BINS * sizeof(int32_t));
-                        memcpy(pic_out->analysisData.yuvHist0, outFrame->m_analysisData.yuvHist0, HISTOGRAM_BINS * sizeof(int32_t));
-                        if (pic_out->colorSpace != X265_CSP_I400)
-                        {
-                            memcpy(pic_out->analysisData.yuvHist1, outFrame->m_analysisData.yuvHist1, HISTOGRAM_BINS * sizeof(int32_t));
-                            memcpy(pic_out->analysisData.yuvHist2, outFrame->m_analysisData.yuvHist2, HISTOGRAM_BINS * sizeof(int32_t));
-                        }
-                    }
                     pic_out->analysisData.satdCost  = outFrame->m_lowres.satdCost;
                     pic_out->analysisData.numCUsInFrame = outFrame->m_analysisData.numCUsInFrame;
                     pic_out->analysisData.numPartitions = outFrame->m_analysisData.numPartitions;
@@ -2198,7 +2072,7 @@
                 if (m_rateControl->writeRateControlFrameStats(outFrame, &curEncoder->m_rce))
                     m_aborted = true;
             if (pic_out)
-            { 
+            {
                 /* m_rcData is allocated for every frame */
                 pic_out->rcData = outFrame->m_rcData;
                 outFrame->m_rcData->qpaRc = outFrame->m_encData->m_avgQpRc;
@@ -2216,6 +2090,18 @@
                 outFrame->m_rcData->iCuCount = outFrame->m_encData->m_frameStats.percent8x8Intra * m_rateControl->m_ncu;
                 outFrame->m_rcData->pCuCount = outFrame->m_encData->m_frameStats.percent8x8Inter * m_rateControl->m_ncu;
                 outFrame->m_rcData->skipCuCount = outFrame->m_encData->m_frameStats.percent8x8Skip  * m_rateControl->m_ncu;
+                outFrame->m_rcData->currentSatd = curEncoder->m_rce.coeffBits;
+            }
+
+            if (m_param->bEnableTemporalFilter)
+            {
+                Frame *curFrame = m_origPicBuffer->m_mcstfPicList.getPOCMCSTF(outFrame->m_poc);
+                X265_CHECK(curFrame, "Outframe not found in DPB's mcstfPicList");
+                curFrame->m_refPicCnt0--;
+                curFrame->m_refPicCnt1--;
+                curFrame = m_origPicBuffer->m_mcstfOrigPicList.getPOCMCSTF(outFrame->m_poc);
+                X265_CHECK(curFrame, "Outframe not found in OPB's mcstfOrigPicList");
+                curFrame->m_refPicCnt1--;
             }
 
             /* Allow this frame to be recycled if no frame encoders are using it for reference */
@@ -2223,6 +2109,8 @@
             {
                 ATOMIC_DEC(&outFrame->m_countRefEncoders);
                 m_dpb->recycleUnreferenced();
+                if (m_param->bEnableTemporalFilter)
+                    m_origPicBuffer->recycleOrigPicList();
             }
             else
                 m_exportedPic = outFrame;
@@ -2253,7 +2141,7 @@
                         m_rateControl->m_lastScenecut = frameEnc->m_poc;
                     else
                     {
-                        int maxWindowSize = int((m_param->fwdScenecutWindow / 1000.0) * (m_param->fpsNum / m_param->fpsDenom) + 0.5);
+                        int maxWindowSize = int((m_param->fwdMaxScenecutWindow / 1000.0) * (m_param->fpsNum / m_param->fpsDenom) + 0.5);
                         if (frameEnc->m_poc > (m_rateControl->m_lastScenecut + maxWindowSize))
                             m_rateControl->m_lastScenecut = frameEnc->m_poc;
                     }
@@ -2422,8 +2310,36 @@
                 analysis->numPartitions  = m_param->num4x4Partitions;
                 x265_alloc_analysis_data(m_param, analysis);
             }
+            if (m_param->bEnableTemporalSubLayers > 2)
+            {
+                //Re-assign temporalid if the current frame is at the end of encode or when I slice is encountered
+                if ((frameEnc->m_poc == (m_param->totalFrames - 1)) || (frameEnc->m_lowres.sliceType == X265_TYPE_I) || (frameEnc->m_lowres.sliceType == X265_TYPE_IDR))
+                {
+                    frameEnc->m_tempLayer = (int8_t)0;
+                }
+            }
             /* determine references, setup RPS, etc */
             m_dpb->prepareEncode(frameEnc);
+
+            if (m_param->bEnableTemporalFilter)
+            {
+                X265_CHECK(!m_origPicBuffer->m_mcstfOrigPicFreeList.empty(), "Frames not available in Encoded OPB");
+
+                Frame *dupFrame = m_origPicBuffer->m_mcstfOrigPicFreeList.popBackMCSTF();
+                dupFrame->m_fencPic->copyFromFrame(frameEnc->m_fencPic);
+                dupFrame->m_poc = frameEnc->m_poc;
+                dupFrame->m_encodeOrder = frameEnc->m_encodeOrder;
+                dupFrame->m_refPicCnt1 = 2 * dupFrame->m_mcstf->m_range + 1;
+
+                if (dupFrame->m_poc < dupFrame->m_mcstf->m_range)
+                    dupFrame->m_refPicCnt1 -= (uint8_t)(dupFrame->m_mcstf->m_range - dupFrame->m_poc);
+                if (m_param->totalFrames && (dupFrame->m_poc >= (m_param->totalFrames - dupFrame->m_mcstf->m_range)))
+                    dupFrame->m_refPicCnt1 -= (uint8_t)(dupFrame->m_poc + dupFrame->m_mcstf->m_range - m_param->totalFrames + 1);
+
+                m_origPicBuffer->addEncPictureToPicList(dupFrame);
+                m_origPicBuffer->setOrigPicList(frameEnc, m_pocLast);
+            }
+
             if (!!m_param->selectiveSAO)
             {
                 Slice* slice = frameEnc->m_encData->m_slice;
@@ -2449,9 +2365,72 @@
 
             if (m_param->rc.rateControlMode != X265_RC_CQP)
                 m_lookahead->getEstimatedPictureCost(frameEnc);
+
             if (m_param->bIntraRefresh)
                  calcRefreshInterval(frameEnc);
 
+            // Generate MCSTF References and perform HME
+            if (m_param->bEnableTemporalFilter && isFilterThisframe(frameEnc->m_mcstf->m_sliceTypeConfig, frameEnc->m_lowres.sliceType))
+            {
+
+                if (!generateMcstfRef(frameEnc, curEncoder))
+                {
+                    m_aborted = true;
+                    x265_log(m_param, X265_LOG_ERROR, "Failed to initialize MCSTFReferencePicInfo at POC %d\n", frameEnc->m_poc);
+                    fflush(stderr);
+                    return -1;
+                }
+
+
+                if (!*frameEnc->m_isSubSampled)
+                {
+                    primitives.frameSubSampleLuma((const pixel *)frameEnc->m_fencPic->m_picOrg0,frameEnc->m_fencPicSubsampled2->m_picOrg0, frameEnc->m_fencPic->m_stride, frameEnc->m_fencPicSubsampled2->m_stride, frameEnc->m_fencPicSubsampled2->m_picWidth, frameEnc->m_fencPicSubsampled2->m_picHeight);
+                    extendPicBorder(frameEnc->m_fencPicSubsampled2->m_picOrg0, frameEnc->m_fencPicSubsampled2->m_stride, frameEnc->m_fencPicSubsampled2->m_picWidth, frameEnc->m_fencPicSubsampled2->m_picHeight, frameEnc->m_fencPicSubsampled2->m_lumaMarginX, frameEnc->m_fencPicSubsampled2->m_lumaMarginY);
+                    primitives.frameSubSampleLuma((const pixel *)frameEnc->m_fencPicSubsampled2->m_picOrg0,frameEnc->m_fencPicSubsampled4->m_picOrg0, frameEnc->m_fencPicSubsampled2->m_stride, frameEnc->m_fencPicSubsampled4->m_stride, frameEnc->m_fencPicSubsampled4->m_picWidth, frameEnc->m_fencPicSubsampled4->m_picHeight);
+                    extendPicBorder(frameEnc->m_fencPicSubsampled4->m_picOrg0, frameEnc->m_fencPicSubsampled4->m_stride, frameEnc->m_fencPicSubsampled4->m_picWidth, frameEnc->m_fencPicSubsampled4->m_picHeight, frameEnc->m_fencPicSubsampled4->m_lumaMarginX, frameEnc->m_fencPicSubsampled4->m_lumaMarginY);
+                    *frameEnc->m_isSubSampled = true;
+                }
+
+                for (uint8_t i = 1; i <= frameEnc->m_mcstf->m_numRef; i++)
+                {
+                    TemporalFilterRefPicInfo *ref = &curEncoder->m_mcstfRefListi - 1;
+                    if (!*ref->isSubsampled)
+                    {
+                        primitives.frameSubSampleLuma((const pixel *)ref->picBuffer->m_picOrg0, ref->picBufferSubSampled2->m_picOrg0, ref->picBuffer->m_stride, ref->picBufferSubSampled2->m_stride, ref->picBufferSubSampled2->m_picWidth, ref->picBufferSubSampled2->m_picHeight);
+                        extendPicBorder(ref->picBufferSubSampled2->m_picOrg0, ref->picBufferSubSampled2->m_stride, ref->picBufferSubSampled2->m_picWidth, ref->picBufferSubSampled2->m_picHeight, ref->picBufferSubSampled2->m_lumaMarginX, ref->picBufferSubSampled2->m_lumaMarginY);
+                        primitives.frameSubSampleLuma((const pixel *)ref->picBufferSubSampled2->m_picOrg0,ref->picBufferSubSampled4->m_picOrg0, ref->picBufferSubSampled2->m_stride, ref->picBufferSubSampled4->m_stride, ref->picBufferSubSampled4->m_picWidth, ref->picBufferSubSampled4->m_picHeight);
+                        extendPicBorder(ref->picBufferSubSampled4->m_picOrg0, ref->picBufferSubSampled4->m_stride, ref->picBufferSubSampled4->m_picWidth, ref->picBufferSubSampled4->m_picHeight, ref->picBufferSubSampled4->m_lumaMarginX, ref->picBufferSubSampled4->m_lumaMarginY);
+                        *ref->isSubsampled = true;
+                    }
+                }
+
+                for (uint8_t i = 1; i <= frameEnc->m_mcstf->m_numRef; i++)
+                {
+                    TemporalFilterRefPicInfo *ref = &curEncoder->m_mcstfRefListi - 1;
+
+                    curEncoder->m_frameEncTF->motionEstimationLuma(ref->mvs0, ref->mvsStride0, frameEnc->m_fencPicSubsampled4, ref->picBufferSubSampled4, 16);
+                    curEncoder->m_frameEncTF->motionEstimationLuma(ref->mvs1, ref->mvsStride1, frameEnc->m_fencPicSubsampled2, ref->picBufferSubSampled2, 16, ref->mvs0, ref->mvsStride0, 2);
+                    curEncoder->m_frameEncTF->motionEstimationLuma(ref->mvs2, ref->mvsStride2, frameEnc->m_fencPic, ref->picBuffer, 16, ref->mvs1, ref->mvsStride1, 2);
+                    curEncoder->m_frameEncTF->motionEstimationLumaDoubleRes(ref->mvs,  ref->mvsStride, frameEnc->m_fencPic, ref->picBuffer, 8, ref->mvs2, ref->mvsStride2, 1, ref->error);
+                }
+
+                for (int i = 0; i < frameEnc->m_mcstf->m_numRef; i++)
+                {
+                    TemporalFilterRefPicInfo *ref = &curEncoder->m_mcstfRefListi;
+                    ref->slicetype = m_lookahead->findSliceType(frameEnc->m_poc + ref->origOffset);
+                    Frame* dpbframePtr = m_dpb->m_picList.getPOC(frameEnc->m_poc + ref->origOffset);
+                    if (dpbframePtr != NULL)
+                    {
+                        if (dpbframePtr->m_encData->m_slice->m_sliceType == B_SLICE)
+                            ref->slicetype = X265_TYPE_B;
+                        else if (dpbframePtr->m_encData->m_slice->m_sliceType == P_SLICE)
+                            ref->slicetype = X265_TYPE_P;
+                        else
+                            ref->slicetype = X265_TYPE_I;
+                    }
+                }
+            }
+
             /* Allow FrameEncoder::compressFrame() to start in the frame encoder thread */
             if (!curEncoder->startCompressFrame(frameEnc))
                 m_aborted = true;
@@ -2523,7 +2502,11 @@
         encParam->dynamicRd = param->dynamicRd;
         encParam->bEnableTransformSkip = param->bEnableTransformSkip;
         encParam->bEnableAMP = param->bEnableAMP;
-
+        if (param->confWinBottomOffset == 0 && param->confWinRightOffset == 0)
+        {
+            encParam->confWinBottomOffset = param->confWinBottomOffset;
+            encParam->confWinRightOffset = param->confWinRightOffset;
+        }
         /* Resignal changes in params in Parameter Sets */
         m_sps.maxAMPDepth = (m_sps.bUseAMP = param->bEnableAMP && param->bEnableAMP) ? param->maxCUDepth : 0;
         m_pps.bTransformSkipEnabled = param->bEnableTransformSkip ? 1 : 0;
@@ -2729,18 +2712,7 @@
             (float)100.0 * m_numLumaWPBiFrames / m_analyzeB.m_numPics,
             (float)100.0 * m_numChromaWPBiFrames / m_analyzeB.m_numPics);
     }
-    int pWithB = 0;
-    for (int i = 0; i <= m_param->bframes; i++)
-        pWithB += m_lookahead->m_histogrami;
 
-    if (pWithB)
-    {
-        int p = 0;
-        for (int i = 0; i <= m_param->bframes; i++)
-            p += sprintf(buffer + p, "%.1f%% ", 100. * m_lookahead->m_histogrami / pWithB);
-
-        x265_log(m_param, X265_LOG_INFO, "consecutive B-frames: %s\n", buffer);
-    }
     if (m_param->bLossless)
     {
         float frameSize = (float)(m_param->sourceWidth - m_sps.conformanceWindow.rightOffset) *
@@ -3341,6 +3313,19 @@
     }
 }
 
+void Encoder::getEndNalUnits(NALList& list, Bitstream& bs)
+{
+    NALList nalList;
+    bs.resetBits();
+
+    if (m_param->bEnableEndOfSequence)
+        nalList.serialize(NAL_UNIT_EOS, bs);
+    if (m_param->bEnableEndOfBitstream)
+        nalList.serialize(NAL_UNIT_EOB, bs);
+
+    list.takeContents(nalList);
+}
+
 void Encoder::initVPS(VPS *vps)
 {
     /* Note that much of the VPS is initialized by determineLevel() */
@@ -3375,10 +3360,14 @@
     sps->bUseAMP = m_param->bEnableAMP;
     sps->maxAMPDepth = m_param->bEnableAMP ? m_param->maxCUDepth : 0;
 
-    sps->maxTempSubLayers = m_param->bEnableTemporalSubLayers ? 2 : 1;
-    sps->maxDecPicBuffering = m_vps.maxDecPicBuffering;
-    sps->numReorderPics = m_vps.numReorderPics;
-    sps->maxLatencyIncrease = m_vps.maxLatencyIncrease = m_param->bframes;
+    sps->maxTempSubLayers = m_vps.maxTempSubLayers;// Getting the value from the user
+
+    for(uint8_t i = 0; i < sps->maxTempSubLayers; i++)
+    {
+        sps->maxDecPicBufferingi = m_vps.maxDecPicBufferingi;
+        sps->numReorderPicsi = m_vps.numReorderPicsi;
+        sps->maxLatencyIncreasei = m_vps.maxLatencyIncreasei = m_param->bframes;
+    }
 
     sps->bUseStrongIntraSmoothing = m_param->bEnableStrongIntraSmoothing;
     sps->bTemporalMVPEnabled = m_param->bEnableTemporalMvp;
@@ -3518,6 +3507,11 @@
             p->rc.aqMode = X265_AQ_NONE;
             p->rc.hevcAq = 0;
         }
+        if (p->rc.aqMode == 0 && p->rc.cuTree)
+        {
+            p->rc.aqMode = X265_AQ_VARIANCE;
+            p->rc.aqStrength = 0;
+        }
         p->radl = zone->radl;
     }
     memcpy(zone, p, sizeof(x265_param));
@@ -3548,6 +3542,65 @@
         p->crQpOffset = 3;
 }
 
+void Encoder::configureVideoSignalTypePreset(x265_param* p)
+{
+    char systemId20 = {};
+    char colorVolume20 = {};
+    sscanf(p->videoSignalTypePreset, "%^::%s", systemId, colorVolume);
+    uint32_t sysId = 0;
+    while (strcmp(vstPresetssysId.systemId, systemId))
+    {
+        if (sysId + 1 == sizeof(vstPresets) / sizeof(vstPresets0))
+        {
+            x265_log(NULL, X265_LOG_ERROR, "Incorrect system-id, aborting\n");
+            m_aborted = true;
+            break;
+        }
+        sysId++;
+    }
+
+    p->vui.bEnableVideoSignalTypePresentFlag = vstPresetssysId.bEnableVideoSignalTypePresentFlag;
+    p->vui.bEnableColorDescriptionPresentFlag = vstPresetssysId.bEnableColorDescriptionPresentFlag;
+    p->vui.bEnableChromaLocInfoPresentFlag = vstPresetssysId.bEnableChromaLocInfoPresentFlag;
+    p->vui.colorPrimaries = vstPresetssysId.colorPrimaries;
+    p->vui.transferCharacteristics = vstPresetssysId.transferCharacteristics;
+    p->vui.matrixCoeffs = vstPresetssysId.matrixCoeffs;
+    p->vui.bEnableVideoFullRangeFlag = vstPresetssysId.bEnableVideoFullRangeFlag;
+    p->vui.chromaSampleLocTypeTopField = vstPresetssysId.chromaSampleLocTypeTopField;
+    p->vui.chromaSampleLocTypeBottomField = vstPresetssysId.chromaSampleLocTypeBottomField;
+
+    if (colorVolume0 != '\0')
+    {
+        if (!strcmp(systemId, "BT2100_PQ_YCC") || !strcmp(systemId, "BT2100_PQ_ICTCP") || !strcmp(systemId, "BT2100_PQ_RGB"))
+        {
+            p->bEmitHDR10SEI = 1;
+            if (!strcmp(colorVolume, "P3D65x1000n0005"))
+            {
+                p->masteringDisplayColorVolume = strdup("G(13250,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(10000000,5)");
+            }
+            else if (!strcmp(colorVolume, "P3D65x4000n005"))
+            {
+                p->masteringDisplayColorVolume = strdup("G(13250,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(40000000,50)");
+            }
+            else if (!strcmp(colorVolume, "BT2100x108n0005"))
+            {
+                p->masteringDisplayColorVolume = strdup("G(8500,39850)B(6550,2300)R(34000,146000)WP(15635,16450)L(10000000,1)");
+            }
+            else
+            {
+                x265_log(NULL, X265_LOG_ERROR, "Incorrect color-volume, aborting\n");
+                m_aborted = true;
+            }
+        }
+        else
+        {
+            x265_log(NULL, X265_LOG_ERROR, "Color-volume is not supported with the given system-id, aborting\n");
+            m_aborted = true;
+        }
+    }
+
+}
+
 void Encoder::configure(x265_param *p)
 {
     this->m_param = p;
@@ -3610,6 +3663,12 @@
     if (!p->rdoqLevel)
         p->psyRdoq = 0;
 
+    if (p->craNal && p->keyframeMax > 1)
+    {
+        x265_log_file(NULL, X265_LOG_ERROR, " --cra-nal works only with keyint 1, but given keyint = %s\n", p->keyframeMax);
+        m_aborted = true;
+    }
+
     /* Disable features which are not supported by the current RD level */
     if (p->rdLevel < 3)
     {
@@ -3848,12 +3907,37 @@
         p->limitReferences = 0;
     }
 
-    if (p->bEnableTemporalSubLayers && !p->bframes)
+    if ((p->bEnableTemporalSubLayers > 2) && !p->bframes)
     {
         x265_log(p, X265_LOG_WARNING, "B frames not enabled, temporal sublayer disabled\n");
         p->bEnableTemporalSubLayers = 0;
     }
 
+    if (!!p->bEnableTemporalSubLayers && p->bEnableTemporalSubLayers < 2)
+    {
+        p->bEnableTemporalSubLayers = 0;
+        x265_log(p, X265_LOG_WARNING, "No support for temporal sublayers less than 2; Disabling temporal layers\n");
+    }
+
+    if (p->bEnableTemporalSubLayers > 5)
+    {
+        p->bEnableTemporalSubLayers = 5;
+        x265_log(p, X265_LOG_WARNING, "No support for temporal sublayers more than 5; Reducing the temporal sublayers to 5\n");
+    }
+
+    // Assign number of B frames for temporal layers
+    if (p->bEnableTemporalSubLayers > 2)
+            p->bframes = x265_temporal_layer_bframesp->bEnableTemporalSubLayers - 1;
+
+    if (p->bEnableTemporalSubLayers > 2)
+    {
+        if (p->bFrameAdaptive)
+        {
+            x265_log(p, X265_LOG_WARNING, "Disabling adaptive B-frame placement to support temporal sub-layers\n");
+            p->bFrameAdaptive = 0;
+        }
+    }
+
     m_bframeDelay = p->bframes ? (p->bBPyramid ? 2 : 1) : 0;
 
     p->bFrameBias = X265_MIN(X265_MAX(-90, p->bFrameBias), 100);
@@ -3907,6 +3991,16 @@
         p->rc.bStatRead = 0;
     }
 
+    if ((p->rc.bStatWrite || p->rc.bStatRead) && p->rc.dataShareMode != X265_SHARE_MODE_FILE && p->rc.dataShareMode != X265_SHARE_MODE_SHAREDMEM)
+    {
+        p->rc.dataShareMode = X265_SHARE_MODE_FILE;
+    }
+
+    if (!p->rc.bStatRead || p->rc.rateControlMode != X265_RC_CRF)
+    {
+        p->rc.bEncFocusedFramesOnly = 0;
+    }
+
     /* some options make no sense if others are disabled */
     p->bSaoNonDeblocked &= p->bEnableSAO;
     p->bEnableTSkipFast &= p->bEnableTransformSkip;
@@ -4243,6 +4337,9 @@
         }
     }
 
+    if (p->videoSignalTypePreset)     // Default disabled.
+        configureVideoSignalTypePreset(p);
+
     if (m_param->toneMapFile || p->bHDR10Opt || p->bEmitHDR10SEI)
     {
         if (!p->bRepeatHeaders)
@@ -4313,12 +4410,26 @@
             m_param->searchRange = m_param->hmeRange2;
     }
 
-   if (p->bHistBasedSceneCut && !p->edgeTransitionThreshold)
-   {
-       p->edgeTransitionThreshold = 0.03;
-       x265_log(p, X265_LOG_WARNING, "using  default threshold %.2lf for scene cut detection\n", p->edgeTransitionThreshold);
-   }
+    if (p->bEnableSBRC && (p->rc.rateControlMode != X265_RC_CRF || (p->rc.vbvBufferSize == 0 || p->rc.vbvMaxBitrate == 0)))
+    {
+        x265_log(p, X265_LOG_WARNING, "SBRC can be enabled only with CRF+VBV mode. Disabling SBRC\n");
+        p->bEnableSBRC = 0;
+    }
 
+    if (p->bEnableSBRC)
+    {
+        p->rc.ipFactor = p->rc.ipFactor * X265_IPRATIO_STRENGTH;
+        if (p->bOpenGOP)
+        {
+            x265_log(p, X265_LOG_WARNING, "Segment based RateControl requires closed gop structure. Enabling closed GOP.\n");
+            p->bOpenGOP = 0;
+        }
+        if (p->keyframeMax != p->keyframeMin)
+        {
+            x265_log(p, X265_LOG_WARNING, "Segment based RateControl requires fixed gop length. Force set min-keyint equal to keyint.\n");
+            p->keyframeMin = p->keyframeMax;
+        }
+    }
 }
 
 void Encoder::readAnalysisFile(x265_analysis_data* analysis, int curPoc, const x265_picture* picIn, int paramBytes)
@@ -4379,16 +4490,6 @@
     analysis->frameRecordSize = frameRecordSize;
     X265_FREAD(&analysis->sliceType, sizeof(int), 1, m_analysisFileIn, &(picData->sliceType));
     X265_FREAD(&analysis->bScenecut, sizeof(int), 1, m_analysisFileIn, &(picData->bScenecut));
-    if (m_param->bHistBasedSceneCut)
-    {
-        X265_FREAD(&analysis->edgeHist, sizeof(int32_t), EDGE_BINS, m_analysisFileIn, &m_curEdgeHist);
-        X265_FREAD(&analysis->yuvHist0, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileIn, &m_curYUVHist0);
-        if (m_param->internalCsp != X265_CSP_I400)
-        {
-            X265_FREAD(&analysis->yuvHist1, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileIn, &m_curYUVHist1);
-            X265_FREAD(&analysis->yuvHist2, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileIn, &m_curYUVHist2);
-        }
-    }
     X265_FREAD(&analysis->satdCost, sizeof(int64_t), 1, m_analysisFileIn, &(picData->satdCost));
     X265_FREAD(&numCUsLoad, sizeof(int), 1, m_analysisFileIn, &(picData->numCUsInFrame));
     X265_FREAD(&analysis->numPartitions, sizeof(int), 1, m_analysisFileIn, &(picData->numPartitions));
@@ -4711,16 +4812,6 @@
     analysis->frameRecordSize = frameRecordSize;
     X265_FREAD(&analysis->sliceType, sizeof(int), 1, m_analysisFileIn, &(picData->sliceType));
     X265_FREAD(&analysis->bScenecut, sizeof(int), 1, m_analysisFileIn, &(picData->bScenecut));
-    if (m_param->bHistBasedSceneCut)
-    {
-        X265_FREAD(&analysis->edgeHist, sizeof(int32_t), EDGE_BINS, m_analysisFileIn, &m_curEdgeHist);
-        X265_FREAD(&analysis->yuvHist0, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileIn, &m_curYUVHist0);
-        if (m_param->internalCsp != X265_CSP_I400)
-        {
-            X265_FREAD(&analysis->yuvHist1, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileIn, &m_curYUVHist1);
-            X265_FREAD(&analysis->yuvHist2, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileIn, &m_curYUVHist2);
-        }
-    }
     X265_FREAD(&analysis->satdCost, sizeof(int64_t), 1, m_analysisFileIn, &(picData->satdCost));
     X265_FREAD(&analysis->numCUsInFrame, sizeof(int), 1, m_analysisFileIn, &(picData->numCUsInFrame));
     X265_FREAD(&analysis->numPartitions, sizeof(int), 1, m_analysisFileIn, &(picData->numPartitions));
@@ -4810,8 +4901,14 @@
 
     if (analysis->sliceType == X265_TYPE_IDR || analysis->sliceType == X265_TYPE_I)
     {
-        if (m_param->analysisLoadReuseLevel < 2)
-            return;
+       if (m_param->analysisLoadReuseLevel < 2)
+       {
+           /* Restore to the current encode's numPartitions and numCUsInFrame */
+           analysis->numPartitions = m_param->num4x4Partitions;
+           analysis->numCUsInFrame = cuLoc.heightInCU * cuLoc.widthInCU;
+           analysis->numCuInHeight = cuLoc.heightInCU;
+           return;
+       }
 
         uint8_t *tempBuf = NULL, *depthBuf = NULL, *modeBuf = NULL, *partSizes = NULL;
         int8_t *cuQPBuf = NULL;
@@ -4879,8 +4976,14 @@
         uint32_t numDir = analysis->sliceType == X265_TYPE_P ? 1 : 2;
         uint32_t numPlanes = m_param->internalCsp == X265_CSP_I400 ? 1 : 3;
         X265_FREAD((WeightParam*)analysis->wt, sizeof(WeightParam), numPlanes * numDir, m_analysisFileIn, (picIn->analysisData.wt));
-        if (m_param->analysisLoadReuseLevel < 2)
-            return;
+       if (m_param->analysisLoadReuseLevel < 2)
+       {
+           /* Restore to the current encode's numPartitions and numCUsInFrame */
+           analysis->numPartitions = m_param->num4x4Partitions;
+           analysis->numCUsInFrame = cuLoc.heightInCU * cuLoc.widthInCU;
+           analysis->numCuInHeight = cuLoc.heightInCU;
+           return;
+       }
 
         uint8_t *tempBuf = NULL, *depthBuf = NULL, *modeBuf = NULL, *partSize = NULL, *mergeFlag = NULL;
         uint8_t *interDir = NULL, *chromaDir = NULL, *mvpIdx2;
@@ -5167,7 +5270,7 @@
 
         int bcutree;
         X265_FREAD(&bcutree, sizeof(int), 1, m_analysisFileIn, &(saveParam->cuTree));
-        if (loadLevel == 10 && m_param->rc.cuTree && (!bcutree || saveLevel < 2))
+        if (loadLevel >= 2 && m_param->rc.cuTree && (!bcutree || saveLevel < 2))
         {
             x265_log(NULL, X265_LOG_ERROR, "Error reading cu-tree info. Disabling cutree offsets. \n");
             m_param->rc.cuTree = 0;
@@ -5337,6 +5440,7 @@
             distortionData->highDistortionCtuCount++;
     }
 }
+
 void Encoder::readAnalysisFile(x265_analysis_data* analysis, int curPoc, int sliceType)
 {
 
@@ -5486,17 +5590,6 @@
     /* calculate frameRecordSize */
     analysis->frameRecordSize = sizeof(analysis->frameRecordSize) + sizeof(depthBytes) + sizeof(analysis->poc) + sizeof(analysis->sliceType) +
                       sizeof(analysis->numCUsInFrame) + sizeof(analysis->numPartitions) + sizeof(analysis->bScenecut) + sizeof(analysis->satdCost);
-    if (m_param->bHistBasedSceneCut)
-    {
-        analysis->frameRecordSize += sizeof(analysis->edgeHist);
-        analysis->frameRecordSize += sizeof(int32_t) * HISTOGRAM_BINS;
-        if (m_param->internalCsp != X265_CSP_I400)
-        {
-            analysis->frameRecordSize += sizeof(int32_t) * HISTOGRAM_BINS;
-            analysis->frameRecordSize += sizeof(int32_t) * HISTOGRAM_BINS;
-        }
-    }
-
     if (analysis->sliceType > X265_TYPE_I)
     {
         numDir = (analysis->sliceType == X265_TYPE_P) ? 1 : 2;
@@ -5641,17 +5734,6 @@
     X265_FWRITE(&analysis->poc, sizeof(int), 1, m_analysisFileOut);
     X265_FWRITE(&analysis->sliceType, sizeof(int), 1, m_analysisFileOut);
     X265_FWRITE(&analysis->bScenecut, sizeof(int), 1, m_analysisFileOut);
-    if (m_param->bHistBasedSceneCut)
-    {
-        X265_FWRITE(&analysis->edgeHist, sizeof(int32_t), EDGE_BINS, m_analysisFileOut);
-        X265_FWRITE(&analysis->yuvHist0, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileOut);
-        if (m_param->internalCsp != X265_CSP_I400)
-        {
-            X265_FWRITE(&analysis->yuvHist1, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileOut);
-            X265_FWRITE(&analysis->yuvHist2, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileOut);
-        }
-    }
-
     X265_FWRITE(&analysis->satdCost, sizeof(int64_t), 1, m_analysisFileOut);
     X265_FWRITE(&analysis->numCUsInFrame, sizeof(int), 1, m_analysisFileOut);
     X265_FWRITE(&analysis->numPartitions, sizeof(int), 1, m_analysisFileOut);
​

x265_3.5.tar.gz/source/encoder/encoder.h -> x265_3.6.tar.gz/source/encoder/encoder.h Changed

@@ -32,6 +32,7 @@
 #include "nal.h"
 #include "framedata.h"
 #include "svt.h"
+#include "temporalfilter.h"
 #ifdef ENABLE_HDR10_PLUS
     #include "dynamicHDR10/hdr10plus.h"
 #endif
@@ -256,19 +257,6 @@
     int                m_bToneMap; // Enables tone-mapping
     int                m_enableNal;
 
-    /* For histogram based scene-cut detection */
-    pixel*             m_edgePic;
-    pixel*             m_inputPic3;
-    int32_t            m_curYUVHist3HISTOGRAM_BINS;
-    int32_t            m_prevYUVHist3HISTOGRAM_BINS;
-    int32_t            m_curEdgeHist2;
-    int32_t            m_prevEdgeHist2;
-    uint32_t           m_planeSizes3;
-    double             m_edgeHistThreshold;
-    double             m_chromaHistThreshold;
-    double             m_scaledEdgeThreshold;
-    double             m_scaledChromaThreshold;
-
 #ifdef ENABLE_HDR10_PLUS
     const hdr10plus_api     *m_hdr10plus_api;
     uint8_t                 **m_cim;
@@ -295,6 +283,9 @@
 
     ThreadSafeInteger* zoneReadCount;
     ThreadSafeInteger* zoneWriteCount;
+    /* Film grain model file */
+    FILE* m_filmGrainIn;
+    OrigPicBuffer*          m_origPicBuffer;
 
     Encoder();
     ~Encoder()
@@ -327,6 +318,8 @@
 
     void getStreamHeaders(NALList& list, Entropy& sbacCoder, Bitstream& bs);
 
+    void getEndNalUnits(NALList& list, Bitstream& bs);
+
     void fetchStats(x265_stats* stats, size_t statsSizeBytes);
 
     void printSummary();
@@ -373,11 +366,6 @@
 
     void copyPicture(x265_picture *dest, const x265_picture *src);
 
-    bool computeHistograms(x265_picture *pic);
-    void computeHistogramSAD(double *maxUVNormalizedSAD, double *edgeNormalizedSAD, int curPoc);
-    double normalizeRange(int32_t value, int32_t minValue, int32_t maxValue, double rangeStart, double rangeEnd);
-    void findSceneCuts(x265_picture *pic, bool& bDup, double m_maxUVSADVal, double m_edgeSADVal, bool& isMaxThres, bool& isHardSC);
-
     void initRefIdx();
     void analyseRefIdx(int *numRefIdx);
     void updateRefIdx();
@@ -387,6 +375,11 @@
 
     void configureDolbyVisionParams(x265_param* p);
 
+    void configureVideoSignalTypePreset(x265_param* p);
+
+    bool isFilterThisframe(uint8_t sliceTypeConfig, int curSliceType);
+    bool generateMcstfRef(Frame* frameEnc, FrameEncoder* currEncoder);
+
 protected:
 
     void initVPS(VPS *vps);

 
@@ -32,6 +32,7 @@
 #include "nal.h"
 #include "framedata.h"
 #include "svt.h"
+#include "temporalfilter.h"
 #ifdef ENABLE_HDR10_PLUS
     #include "dynamicHDR10/hdr10plus.h"
 #endif
@@ -256,19 +257,6 @@
     int                m_bToneMap; // Enables tone-mapping
     int                m_enableNal;
 
-    /* For histogram based scene-cut detection */
-    pixel*             m_edgePic;
-    pixel*             m_inputPic3;
-    int32_t            m_curYUVHist3HISTOGRAM_BINS;
-    int32_t            m_prevYUVHist3HISTOGRAM_BINS;
-    int32_t            m_curEdgeHist2;
-    int32_t            m_prevEdgeHist2;
-    uint32_t           m_planeSizes3;
-    double             m_edgeHistThreshold;
-    double             m_chromaHistThreshold;
-    double             m_scaledEdgeThreshold;
-    double             m_scaledChromaThreshold;
-
 #ifdef ENABLE_HDR10_PLUS
     const hdr10plus_api     *m_hdr10plus_api;
     uint8_t                 **m_cim;
@@ -295,6 +283,9 @@
 
     ThreadSafeInteger* zoneReadCount;
     ThreadSafeInteger* zoneWriteCount;
+    /* Film grain model file */
+    FILE* m_filmGrainIn;
+    OrigPicBuffer*          m_origPicBuffer;
 
     Encoder();
     ~Encoder()
@@ -327,6 +318,8 @@
 
     void getStreamHeaders(NALList& list, Entropy& sbacCoder, Bitstream& bs);
 
+    void getEndNalUnits(NALList& list, Bitstream& bs);
+
     void fetchStats(x265_stats* stats, size_t statsSizeBytes);
 
     void printSummary();
@@ -373,11 +366,6 @@
 
     void copyPicture(x265_picture *dest, const x265_picture *src);
 
-    bool computeHistograms(x265_picture *pic);
-    void computeHistogramSAD(double *maxUVNormalizedSAD, double *edgeNormalizedSAD, int curPoc);
-    double normalizeRange(int32_t value, int32_t minValue, int32_t maxValue, double rangeStart, double rangeEnd);
-    void findSceneCuts(x265_picture *pic, bool& bDup, double m_maxUVSADVal, double m_edgeSADVal, bool& isMaxThres, bool& isHardSC);
-
     void initRefIdx();
     void analyseRefIdx(int *numRefIdx);
     void updateRefIdx();
@@ -387,6 +375,11 @@
 
     void configureDolbyVisionParams(x265_param* p);
 
+    void configureVideoSignalTypePreset(x265_param* p);
+
+    bool isFilterThisframe(uint8_t sliceTypeConfig, int curSliceType);
+    bool generateMcstfRef(Frame* frameEnc, FrameEncoder* currEncoder);
+
 protected:
 
     void initVPS(VPS *vps);
​

x265_3.5.tar.gz/source/encoder/entropy.cpp -> x265_3.6.tar.gz/source/encoder/entropy.cpp Changed

@@ -245,9 +245,9 @@
 
     for (uint32_t i = 0; i < vps.maxTempSubLayers; i++)
     {
-        WRITE_UVLC(vps.maxDecPicBuffering - 1, "vps_max_dec_pic_buffering_minus1i");
-        WRITE_UVLC(vps.numReorderPics,         "vps_num_reorder_picsi");
-        WRITE_UVLC(vps.maxLatencyIncrease + 1, "vps_max_latency_increase_plus1i");
+        WRITE_UVLC(vps.maxDecPicBufferingi - 1, "vps_max_dec_pic_buffering_minus1i");
+        WRITE_UVLC(vps.numReorderPicsi,         "vps_num_reorder_picsi");
+        WRITE_UVLC(vps.maxLatencyIncreasei + 1, "vps_max_latency_increase_plus1i");
     }
 
     WRITE_CODE(0, 6, "vps_max_nuh_reserved_zero_layer_id");
@@ -291,9 +291,9 @@
 
     for (uint32_t i = 0; i < sps.maxTempSubLayers; i++)
     {
-        WRITE_UVLC(sps.maxDecPicBuffering - 1, "sps_max_dec_pic_buffering_minus1i");
-        WRITE_UVLC(sps.numReorderPics,         "sps_num_reorder_picsi");
-        WRITE_UVLC(sps.maxLatencyIncrease + 1, "sps_max_latency_increase_plus1i");
+        WRITE_UVLC(sps.maxDecPicBufferingi - 1, "sps_max_dec_pic_buffering_minus1i");
+        WRITE_UVLC(sps.numReorderPicsi,         "sps_num_reorder_picsi");
+        WRITE_UVLC(sps.maxLatencyIncreasei + 1, "sps_max_latency_increase_plus1i");
     }
 
     WRITE_UVLC(sps.log2MinCodingBlockSize - 3,    "log2_min_coding_block_size_minus3");
@@ -418,8 +418,11 @@
 
     if (maxTempSubLayers > 1)
     {
-         WRITE_FLAG(0, "sub_layer_profile_present_flagi");
-         WRITE_FLAG(0, "sub_layer_level_present_flagi");
+        for(int i = 0; i < maxTempSubLayers - 1; i++)
+        {
+            WRITE_FLAG(0, "sub_layer_profile_present_flagi");
+            WRITE_FLAG(0, "sub_layer_level_present_flagi");
+        }
          for (int i = maxTempSubLayers - 1; i < 8 ; i++)
              WRITE_CODE(0, 2, "reserved_zero_2bits");
     }

 
@@ -245,9 +245,9 @@
 
     for (uint32_t i = 0; i < vps.maxTempSubLayers; i++)
     {
-        WRITE_UVLC(vps.maxDecPicBuffering - 1, "vps_max_dec_pic_buffering_minus1i");
-        WRITE_UVLC(vps.numReorderPics,         "vps_num_reorder_picsi");
-        WRITE_UVLC(vps.maxLatencyIncrease + 1, "vps_max_latency_increase_plus1i");
+        WRITE_UVLC(vps.maxDecPicBufferingi - 1, "vps_max_dec_pic_buffering_minus1i");
+        WRITE_UVLC(vps.numReorderPicsi,         "vps_num_reorder_picsi");
+        WRITE_UVLC(vps.maxLatencyIncreasei + 1, "vps_max_latency_increase_plus1i");
     }
 
     WRITE_CODE(0, 6, "vps_max_nuh_reserved_zero_layer_id");
@@ -291,9 +291,9 @@
 
     for (uint32_t i = 0; i < sps.maxTempSubLayers; i++)
     {
-        WRITE_UVLC(sps.maxDecPicBuffering - 1, "sps_max_dec_pic_buffering_minus1i");
-        WRITE_UVLC(sps.numReorderPics,         "sps_num_reorder_picsi");
-        WRITE_UVLC(sps.maxLatencyIncrease + 1, "sps_max_latency_increase_plus1i");
+        WRITE_UVLC(sps.maxDecPicBufferingi - 1, "sps_max_dec_pic_buffering_minus1i");
+        WRITE_UVLC(sps.numReorderPicsi,         "sps_num_reorder_picsi");
+        WRITE_UVLC(sps.maxLatencyIncreasei + 1, "sps_max_latency_increase_plus1i");
     }
 
     WRITE_UVLC(sps.log2MinCodingBlockSize - 3,    "log2_min_coding_block_size_minus3");
@@ -418,8 +418,11 @@
 
     if (maxTempSubLayers > 1)
     {
-         WRITE_FLAG(0, "sub_layer_profile_present_flagi");
-         WRITE_FLAG(0, "sub_layer_level_present_flagi");
+        for(int i = 0; i < maxTempSubLayers - 1; i++)
+        {
+            WRITE_FLAG(0, "sub_layer_profile_present_flagi");
+            WRITE_FLAG(0, "sub_layer_level_present_flagi");
+        }
          for (int i = maxTempSubLayers - 1; i < 8 ; i++)
              WRITE_CODE(0, 2, "reserved_zero_2bits");
     }
​

x265_3.5.tar.gz/source/encoder/frameencoder.cpp -> x265_3.6.tar.gz/source/encoder/frameencoder.cpp Changed

@@ -34,6 +34,7 @@
 #include "common.h"
 #include "slicetype.h"
 #include "nal.h"
+#include "temporalfilter.h"
 
 namespace X265_NS {
 void weightAnalyse(Slice& slice, Frame& frame, x265_param& param);
@@ -101,6 +102,16 @@
         delete m_rce.picTimingSEI;
         delete m_rce.hrdTiming;
     }
+
+    if (m_param->bEnableTemporalFilter)
+    {
+        delete m_frameEncTF->m_metld;
+
+        for (int i = 0; i < (m_frameEncTF->m_range << 1); i++)
+            m_frameEncTF->destroyRefPicInfo(&m_mcstfRefListi);
+
+        delete m_frameEncTF;
+    }
 }
 
 bool FrameEncoder::init(Encoder *top, int numRows, int numCols)
@@ -195,6 +206,16 @@
         m_sliceAddrBits = (uint16_t)(tmp + 1);
     }
 
+    if (m_param->bEnableTemporalFilter)
+    {
+        m_frameEncTF = new TemporalFilter();
+        if (m_frameEncTF)
+            m_frameEncTF->init(m_param);
+
+        for (int i = 0; i < (m_frameEncTF->m_range << 1); i++)
+            ok &= !!m_frameEncTF->createRefPicInfo(&m_mcstfRefListi, m_param);
+    }
+
     return ok;
 }
 
@@ -450,7 +471,7 @@
     m_ssimCnt = 0;
     memset(&(m_frame->m_encData->m_frameStats), 0, sizeof(m_frame->m_encData->m_frameStats));
 
-    if (!m_param->bHistBasedSceneCut && m_param->rc.aqMode != X265_AQ_EDGE && m_param->recursionSkipMode == EDGE_BASED_RSKIP)
+    if (m_param->rc.aqMode != X265_AQ_EDGE && m_param->recursionSkipMode == EDGE_BASED_RSKIP)
     {
         int height = m_frame->m_fencPic->m_picHeight;
         int width = m_frame->m_fencPic->m_picWidth;
@@ -467,6 +488,12 @@
      * unit) */
     Slice* slice = m_frame->m_encData->m_slice;
 
+    if (m_param->bEnableEndOfSequence && m_frame->m_lowres.sliceType == X265_TYPE_IDR && m_frame->m_poc)
+    {
+        m_bs.resetBits();
+        m_nalList.serialize(NAL_UNIT_EOS, m_bs);
+    }
+
     if (m_param->bEnableAccessUnitDelimiters && (m_frame->m_poc || m_param->bRepeatHeaders))
     {
         m_bs.resetBits();
@@ -573,6 +600,12 @@
     int qp = m_top->m_rateControl->rateControlStart(m_frame, &m_rce, m_top);
     m_rce.newQp = qp;
 
+    if (m_param->bEnableTemporalFilter)
+    {
+        m_frameEncTF->m_QP = qp;
+        m_frameEncTF->bilateralFilter(m_frame, m_mcstfRefList, m_param->temporalFilterStrength);
+    }
+
     if (m_nr)
     {
         if (qp > QP_MAX_SPEC && m_frame->m_param->rc.vbvBufferSize)
@@ -744,7 +777,7 @@
             // wait after removal of the access unit with the most recent
             // buffering period SEI message
             sei->m_auCpbRemovalDelay = X265_MIN(X265_MAX(1, m_rce.encodeOrder - prevBPSEI), (1 << hrd->cpbRemovalDelayLength));
-            sei->m_picDpbOutputDelay = slice->m_sps->numReorderPics + poc - m_rce.encodeOrder;
+            sei->m_picDpbOutputDelay = slice->m_sps->numReorderPicsm_frame->m_tempLayer + poc - m_rce.encodeOrder;
         }
 
         sei->writeSEImessages(m_bs, *slice->m_sps, NAL_UNIT_PREFIX_SEI, m_nalList, m_param->bSingleSeiNal);
@@ -756,7 +789,14 @@
         m_seiAlternativeTC.m_preferredTransferCharacteristics = m_param->preferredTransferCharacteristics;
         m_seiAlternativeTC.writeSEImessages(m_bs, *slice->m_sps, NAL_UNIT_PREFIX_SEI, m_nalList, m_param->bSingleSeiNal);
     }
-
+    /* Write Film grain characteristics if present */
+    if (this->m_top->m_filmGrainIn)
+    {
+        FilmGrainCharacteristics m_filmGrain;
+        /* Read the Film grain model file */
+        readModel(&m_filmGrain, this->m_top->m_filmGrainIn);
+        m_filmGrain.writeSEImessages(m_bs, *slice->m_sps, NAL_UNIT_PREFIX_SEI, m_nalList, m_param->bSingleSeiNal);
+    }
     /* Write user SEI */
     for (int i = 0; i < m_frame->m_userSEI.numPayloads; i++)
     {
@@ -933,6 +973,23 @@
     if (m_param->bDynamicRefine && m_top->m_startPoint <= m_frame->m_encodeOrder) //Avoid collecting data that will not be used by future frames.
         collectDynDataFrame();
 
+    if (m_param->bEnableTemporalFilter && m_top->isFilterThisframe(m_frame->m_mcstf->m_sliceTypeConfig, m_frame->m_lowres.sliceType))
+    {
+        //Reset the MCSTF context in Frame Encoder and Frame
+        for (int i = 0; i < (m_frameEncTF->m_range << 1); i++)
+        {
+            memset(m_mcstfRefListi.mvs0, 0, sizeof(MV) * ((m_param->sourceWidth / 16) * (m_param->sourceHeight / 16)));
+            memset(m_mcstfRefListi.mvs1, 0, sizeof(MV) * ((m_param->sourceWidth / 16) * (m_param->sourceHeight / 16)));
+            memset(m_mcstfRefListi.mvs2, 0, sizeof(MV) * ((m_param->sourceWidth / 16) * (m_param->sourceHeight / 16)));
+            memset(m_mcstfRefListi.mvs,  0, sizeof(MV) * ((m_param->sourceWidth / 4) * (m_param->sourceHeight / 4)));
+            memset(m_mcstfRefListi.noise, 0, sizeof(int) * ((m_param->sourceWidth / 4) * (m_param->sourceHeight / 4)));
+            memset(m_mcstfRefListi.error, 0, sizeof(int) * ((m_param->sourceWidth / 4) * (m_param->sourceHeight / 4)));
+
+            m_frame->m_mcstf->m_numRef = 0;
+        }
+    }
+
+
     if (m_param->rc.bStatWrite)
     {
         int totalI = 0, totalP = 0, totalSkip = 0;
@@ -1041,7 +1098,7 @@
             
             m_bs.writeByteAlignment();
 
-            m_nalList.serialize(slice->m_nalUnitType, m_bs);
+            m_nalList.serialize(slice->m_nalUnitType, m_bs, (!!m_param->bEnableTemporalSubLayers ? m_frame->m_tempLayer + 1 : (1 + (slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_TSA_N))));
         }
     }
     else
@@ -1062,7 +1119,7 @@
             m_entropyCoder.codeSliceHeaderWPPEntryPoints(m_substreamSizes, (slice->m_sps->numCuInHeight - 1), maxStreamSize);
         m_bs.writeByteAlignment();
 
-        m_nalList.serialize(slice->m_nalUnitType, m_bs);
+        m_nalList.serialize(slice->m_nalUnitType, m_bs, (!!m_param->bEnableTemporalSubLayers ? m_frame->m_tempLayer + 1 : (1 + (slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_TSA_N))));
     }
 
     if (m_param->decodedPictureHashSEI)
@@ -2127,6 +2184,54 @@
         m_nr->nrOffsetDenoisecat0 = 0;
     }
 }
+
+void FrameEncoder::readModel(FilmGrainCharacteristics* m_filmGrain, FILE* filmgrain)
+{
+    char const* errorMessage = "Error reading FilmGrain characteristics\n";
+    FilmGrain m_fg;
+    x265_fread((char* )&m_fg, sizeof(bool) * 3 + sizeof(uint8_t), 1, filmgrain, errorMessage);
+    m_filmGrain->m_filmGrainCharacteristicsCancelFlag = m_fg.m_filmGrainCharacteristicsCancelFlag;
+    m_filmGrain->m_filmGrainCharacteristicsPersistenceFlag = m_fg.m_filmGrainCharacteristicsPersistenceFlag;
+    m_filmGrain->m_filmGrainModelId = m_fg.m_filmGrainModelId;
+    m_filmGrain->m_separateColourDescriptionPresentFlag = m_fg.m_separateColourDescriptionPresentFlag;
+    if (m_filmGrain->m_separateColourDescriptionPresentFlag)
+    {
+        ColourDescription m_clr;
+        x265_fread((char* )&m_clr, sizeof(bool) + sizeof(uint8_t) * 5, 1, filmgrain, errorMessage);
+        m_filmGrain->m_filmGrainBitDepthLumaMinus8 = m_clr.m_filmGrainBitDepthLumaMinus8;
+        m_filmGrain->m_filmGrainBitDepthChromaMinus8 = m_clr.m_filmGrainBitDepthChromaMinus8;
+        m_filmGrain->m_filmGrainFullRangeFlag = m_clr.m_filmGrainFullRangeFlag;
+        m_filmGrain->m_filmGrainColourPrimaries = m_clr.m_filmGrainColourPrimaries;
+        m_filmGrain->m_filmGrainTransferCharacteristics = m_clr.m_filmGrainTransferCharacteristics;
+        m_filmGrain->m_filmGrainMatrixCoeffs = m_clr.m_filmGrainMatrixCoeffs;
+    }
+    FGPresent m_present;
+    x265_fread((char* )&m_present, sizeof(bool) * 3 + sizeof(uint8_t) * 2, 1, filmgrain, errorMessage);
+    m_filmGrain->m_blendingModeId = m_present.m_blendingModeId;
+    m_filmGrain->m_log2ScaleFactor = m_present.m_log2ScaleFactor;
+    m_filmGrain->m_compModel0.bPresentFlag = m_present.m_presentFlag0;
+    m_filmGrain->m_compModel1.bPresentFlag = m_present.m_presentFlag1;
+    m_filmGrain->m_compModel2.bPresentFlag = m_present.m_presentFlag2;
+    for (int i = 0; i < MAX_NUM_COMPONENT; i++)
+    {
+        if (m_filmGrain->m_compModeli.bPresentFlag)
+        {
+            x265_fread((char* )(&m_filmGrain->m_compModeli.m_filmGrainNumIntensityIntervalMinus1), sizeof(uint8_t), 1, filmgrain, errorMessage);
+            x265_fread((char* )(&m_filmGrain->m_compModeli.numModelValues), sizeof(uint8_t), 1, filmgrain, errorMessage);
+            m_filmGrain->m_compModeli.intensityValues = (FilmGrainCharacteristics::CompModelIntensityValues* ) malloc(sizeof(FilmGrainCharacteristics::CompModelIntensityValues) * (m_filmGrain->m_compModeli.m_filmGrainNumIntensityIntervalMinus1+1)) ;
+            for (int j = 0; j <= m_filmGrain->m_compModeli.m_filmGrainNumIntensityIntervalMinus1; j++)
+            {
+                x265_fread((char* )(&m_filmGrain->m_compModeli.intensityValuesj.intensityIntervalLowerBound), sizeof(uint8_t), 1, filmgrain, errorMessage);
+                x265_fread((char* )(&m_filmGrain->m_compModeli.intensityValuesj.intensityIntervalUpperBound), sizeof(uint8_t), 1, filmgrain, errorMessage);
+                m_filmGrain->m_compModeli.intensityValuesj.compModelValue = (int* ) malloc(sizeof(int) * (m_filmGrain->m_compModeli.numModelValues));
+                for (int k = 0; k < m_filmGrain->m_compModeli.numModelValues; k++)
+                {
+                    x265_fread((char* )(&m_filmGrain->m_compModeli.intensityValuesj.compModelValuek), sizeof(int), 1, filmgrain, errorMessage);
+                }
+            }
+        }
+    }
+}
 #if ENABLE_LIBVMAF
 void FrameEncoder::vmafFrameLevelScore()
 {

 
@@ -34,6 +34,7 @@
 #include "common.h"
 #include "slicetype.h"
 #include "nal.h"
+#include "temporalfilter.h"
 
 namespace X265_NS {
 void weightAnalyse(Slice& slice, Frame& frame, x265_param& param);
@@ -101,6 +102,16 @@
         delete m_rce.picTimingSEI;
         delete m_rce.hrdTiming;
     }
+
+    if (m_param->bEnableTemporalFilter)
+    {
+        delete m_frameEncTF->m_metld;
+
+        for (int i = 0; i < (m_frameEncTF->m_range << 1); i++)
+            m_frameEncTF->destroyRefPicInfo(&m_mcstfRefListi);
+
+        delete m_frameEncTF;
+    }
 }
 
 bool FrameEncoder::init(Encoder *top, int numRows, int numCols)
@@ -195,6 +206,16 @@
         m_sliceAddrBits = (uint16_t)(tmp + 1);
     }
 
+    if (m_param->bEnableTemporalFilter)
+    {
+        m_frameEncTF = new TemporalFilter();
+        if (m_frameEncTF)
+            m_frameEncTF->init(m_param);
+
+        for (int i = 0; i < (m_frameEncTF->m_range << 1); i++)
+            ok &= !!m_frameEncTF->createRefPicInfo(&m_mcstfRefListi, m_param);
+    }
+
     return ok;
 }
 
@@ -450,7 +471,7 @@
     m_ssimCnt = 0;
     memset(&(m_frame->m_encData->m_frameStats), 0, sizeof(m_frame->m_encData->m_frameStats));
 
-    if (!m_param->bHistBasedSceneCut && m_param->rc.aqMode != X265_AQ_EDGE && m_param->recursionSkipMode == EDGE_BASED_RSKIP)
+    if (m_param->rc.aqMode != X265_AQ_EDGE && m_param->recursionSkipMode == EDGE_BASED_RSKIP)
     {
         int height = m_frame->m_fencPic->m_picHeight;
         int width = m_frame->m_fencPic->m_picWidth;
@@ -467,6 +488,12 @@
      * unit) */
     Slice* slice = m_frame->m_encData->m_slice;
 
+    if (m_param->bEnableEndOfSequence && m_frame->m_lowres.sliceType == X265_TYPE_IDR && m_frame->m_poc)
+    {
+        m_bs.resetBits();
+        m_nalList.serialize(NAL_UNIT_EOS, m_bs);
+    }
+
     if (m_param->bEnableAccessUnitDelimiters && (m_frame->m_poc || m_param->bRepeatHeaders))
     {
         m_bs.resetBits();
@@ -573,6 +600,12 @@
     int qp = m_top->m_rateControl->rateControlStart(m_frame, &m_rce, m_top);
     m_rce.newQp = qp;
 
+    if (m_param->bEnableTemporalFilter)
+    {
+        m_frameEncTF->m_QP = qp;
+        m_frameEncTF->bilateralFilter(m_frame, m_mcstfRefList, m_param->temporalFilterStrength);
+    }
+
     if (m_nr)
     {
         if (qp > QP_MAX_SPEC && m_frame->m_param->rc.vbvBufferSize)
@@ -744,7 +777,7 @@
             // wait after removal of the access unit with the most recent
             // buffering period SEI message
             sei->m_auCpbRemovalDelay = X265_MIN(X265_MAX(1, m_rce.encodeOrder - prevBPSEI), (1 << hrd->cpbRemovalDelayLength));
-            sei->m_picDpbOutputDelay = slice->m_sps->numReorderPics + poc - m_rce.encodeOrder;
+            sei->m_picDpbOutputDelay = slice->m_sps->numReorderPicsm_frame->m_tempLayer + poc - m_rce.encodeOrder;
         }
 
         sei->writeSEImessages(m_bs, *slice->m_sps, NAL_UNIT_PREFIX_SEI, m_nalList, m_param->bSingleSeiNal);
@@ -756,7 +789,14 @@
         m_seiAlternativeTC.m_preferredTransferCharacteristics = m_param->preferredTransferCharacteristics;
         m_seiAlternativeTC.writeSEImessages(m_bs, *slice->m_sps, NAL_UNIT_PREFIX_SEI, m_nalList, m_param->bSingleSeiNal);
     }
-
+    /* Write Film grain characteristics if present */
+    if (this->m_top->m_filmGrainIn)
+    {
+        FilmGrainCharacteristics m_filmGrain;
+        /* Read the Film grain model file */
+        readModel(&m_filmGrain, this->m_top->m_filmGrainIn);
+        m_filmGrain.writeSEImessages(m_bs, *slice->m_sps, NAL_UNIT_PREFIX_SEI, m_nalList, m_param->bSingleSeiNal);
+    }
     /* Write user SEI */
     for (int i = 0; i < m_frame->m_userSEI.numPayloads; i++)
     {
@@ -933,6 +973,23 @@
     if (m_param->bDynamicRefine && m_top->m_startPoint <= m_frame->m_encodeOrder) //Avoid collecting data that will not be used by future frames.
         collectDynDataFrame();
 
+    if (m_param->bEnableTemporalFilter && m_top->isFilterThisframe(m_frame->m_mcstf->m_sliceTypeConfig, m_frame->m_lowres.sliceType))
+    {
+        //Reset the MCSTF context in Frame Encoder and Frame
+        for (int i = 0; i < (m_frameEncTF->m_range << 1); i++)
+        {
+            memset(m_mcstfRefListi.mvs0, 0, sizeof(MV) * ((m_param->sourceWidth / 16) * (m_param->sourceHeight / 16)));
+            memset(m_mcstfRefListi.mvs1, 0, sizeof(MV) * ((m_param->sourceWidth / 16) * (m_param->sourceHeight / 16)));
+            memset(m_mcstfRefListi.mvs2, 0, sizeof(MV) * ((m_param->sourceWidth / 16) * (m_param->sourceHeight / 16)));
+            memset(m_mcstfRefListi.mvs,  0, sizeof(MV) * ((m_param->sourceWidth / 4) * (m_param->sourceHeight / 4)));
+            memset(m_mcstfRefListi.noise, 0, sizeof(int) * ((m_param->sourceWidth / 4) * (m_param->sourceHeight / 4)));
+            memset(m_mcstfRefListi.error, 0, sizeof(int) * ((m_param->sourceWidth / 4) * (m_param->sourceHeight / 4)));
+
+            m_frame->m_mcstf->m_numRef = 0;
+        }
+    }
+
+
     if (m_param->rc.bStatWrite)
     {
         int totalI = 0, totalP = 0, totalSkip = 0;
@@ -1041,7 +1098,7 @@
             
             m_bs.writeByteAlignment();
 
-            m_nalList.serialize(slice->m_nalUnitType, m_bs);
+            m_nalList.serialize(slice->m_nalUnitType, m_bs, (!!m_param->bEnableTemporalSubLayers ? m_frame->m_tempLayer + 1 : (1 + (slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_TSA_N))));
         }
     }
     else
@@ -1062,7 +1119,7 @@
             m_entropyCoder.codeSliceHeaderWPPEntryPoints(m_substreamSizes, (slice->m_sps->numCuInHeight - 1), maxStreamSize);
         m_bs.writeByteAlignment();
 
-        m_nalList.serialize(slice->m_nalUnitType, m_bs);
+        m_nalList.serialize(slice->m_nalUnitType, m_bs, (!!m_param->bEnableTemporalSubLayers ? m_frame->m_tempLayer + 1 : (1 + (slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_TSA_N))));
     }
 
     if (m_param->decodedPictureHashSEI)
@@ -2127,6 +2184,54 @@
         m_nr->nrOffsetDenoisecat0 = 0;
     }
 }
+
+void FrameEncoder::readModel(FilmGrainCharacteristics* m_filmGrain, FILE* filmgrain)
+{
+    char const* errorMessage = "Error reading FilmGrain characteristics\n";
+    FilmGrain m_fg;
+    x265_fread((char* )&m_fg, sizeof(bool) * 3 + sizeof(uint8_t), 1, filmgrain, errorMessage);
+    m_filmGrain->m_filmGrainCharacteristicsCancelFlag = m_fg.m_filmGrainCharacteristicsCancelFlag;
+    m_filmGrain->m_filmGrainCharacteristicsPersistenceFlag = m_fg.m_filmGrainCharacteristicsPersistenceFlag;
+    m_filmGrain->m_filmGrainModelId = m_fg.m_filmGrainModelId;
+    m_filmGrain->m_separateColourDescriptionPresentFlag = m_fg.m_separateColourDescriptionPresentFlag;
+    if (m_filmGrain->m_separateColourDescriptionPresentFlag)
+    {
+        ColourDescription m_clr;
+        x265_fread((char* )&m_clr, sizeof(bool) + sizeof(uint8_t) * 5, 1, filmgrain, errorMessage);
+        m_filmGrain->m_filmGrainBitDepthLumaMinus8 = m_clr.m_filmGrainBitDepthLumaMinus8;
+        m_filmGrain->m_filmGrainBitDepthChromaMinus8 = m_clr.m_filmGrainBitDepthChromaMinus8;
+        m_filmGrain->m_filmGrainFullRangeFlag = m_clr.m_filmGrainFullRangeFlag;
+        m_filmGrain->m_filmGrainColourPrimaries = m_clr.m_filmGrainColourPrimaries;
+        m_filmGrain->m_filmGrainTransferCharacteristics = m_clr.m_filmGrainTransferCharacteristics;
+        m_filmGrain->m_filmGrainMatrixCoeffs = m_clr.m_filmGrainMatrixCoeffs;
+    }
+    FGPresent m_present;
+    x265_fread((char* )&m_present, sizeof(bool) * 3 + sizeof(uint8_t) * 2, 1, filmgrain, errorMessage);
+    m_filmGrain->m_blendingModeId = m_present.m_blendingModeId;
+    m_filmGrain->m_log2ScaleFactor = m_present.m_log2ScaleFactor;
+    m_filmGrain->m_compModel0.bPresentFlag = m_present.m_presentFlag0;
+    m_filmGrain->m_compModel1.bPresentFlag = m_present.m_presentFlag1;
+    m_filmGrain->m_compModel2.bPresentFlag = m_present.m_presentFlag2;
+    for (int i = 0; i < MAX_NUM_COMPONENT; i++)
+    {
+        if (m_filmGrain->m_compModeli.bPresentFlag)
+        {
+            x265_fread((char* )(&m_filmGrain->m_compModeli.m_filmGrainNumIntensityIntervalMinus1), sizeof(uint8_t), 1, filmgrain, errorMessage);
+            x265_fread((char* )(&m_filmGrain->m_compModeli.numModelValues), sizeof(uint8_t), 1, filmgrain, errorMessage);
+            m_filmGrain->m_compModeli.intensityValues = (FilmGrainCharacteristics::CompModelIntensityValues* ) malloc(sizeof(FilmGrainCharacteristics::CompModelIntensityValues) * (m_filmGrain->m_compModeli.m_filmGrainNumIntensityIntervalMinus1+1)) ;
+            for (int j = 0; j <= m_filmGrain->m_compModeli.m_filmGrainNumIntensityIntervalMinus1; j++)
+            {
+                x265_fread((char* )(&m_filmGrain->m_compModeli.intensityValuesj.intensityIntervalLowerBound), sizeof(uint8_t), 1, filmgrain, errorMessage);
+                x265_fread((char* )(&m_filmGrain->m_compModeli.intensityValuesj.intensityIntervalUpperBound), sizeof(uint8_t), 1, filmgrain, errorMessage);
+                m_filmGrain->m_compModeli.intensityValuesj.compModelValue = (int* ) malloc(sizeof(int) * (m_filmGrain->m_compModeli.numModelValues));
+                for (int k = 0; k < m_filmGrain->m_compModeli.numModelValues; k++)
+                {
+                    x265_fread((char* )(&m_filmGrain->m_compModeli.intensityValuesj.compModelValuek), sizeof(int), 1, filmgrain, errorMessage);
+                }
+            }
+        }
+    }
+}
 #if ENABLE_LIBVMAF
 void FrameEncoder::vmafFrameLevelScore()
 {
​

x265_3.5.tar.gz/source/encoder/frameencoder.h -> x265_3.6.tar.gz/source/encoder/frameencoder.h Changed

@@ -40,6 +40,7 @@
 #include "ratecontrol.h"
 #include "reference.h"
 #include "nal.h"
+#include "temporalfilter.h"
 
 namespace X265_NS {
 // private x265 namespace
@@ -113,6 +114,34 @@
     }
 };
 
+/*Film grain characteristics*/
+struct FilmGrain
+{
+    bool    m_filmGrainCharacteristicsCancelFlag;
+    bool    m_filmGrainCharacteristicsPersistenceFlag;
+    bool    m_separateColourDescriptionPresentFlag;
+    uint8_t m_filmGrainModelId;
+    uint8_t m_blendingModeId;
+    uint8_t m_log2ScaleFactor;
+};
+
+struct ColourDescription
+{
+    bool        m_filmGrainFullRangeFlag;
+    uint8_t     m_filmGrainBitDepthLumaMinus8;
+    uint8_t     m_filmGrainBitDepthChromaMinus8;
+    uint8_t     m_filmGrainColourPrimaries;
+    uint8_t     m_filmGrainTransferCharacteristics;
+    uint8_t     m_filmGrainMatrixCoeffs;
+};
+
+struct FGPresent
+{
+    uint8_t     m_blendingModeId;
+    uint8_t     m_log2ScaleFactor;
+    bool        m_presentFlag3;
+};
+
 // Manages the wave-front processing of a single encoding frame
 class FrameEncoder : public WaveFront, public Thread
 {
@@ -205,6 +234,10 @@
     FrameFilter              m_frameFilter;
     NALList                  m_nalList;
 
+    // initialization for mcstf
+    TemporalFilter*          m_frameEncTF;
+    TemporalFilterRefPicInfo m_mcstfRefListMAX_MCSTF_TEMPORAL_WINDOW_LENGTH;
+
     class WeightAnalysis : public BondedTaskGroup
     {
     public:
@@ -250,6 +283,7 @@
     void collectDynDataFrame();
     void computeAvgTrainingData();
     void collectDynDataRow(CUData& ctu, FrameStats* rowStats);    
+    void readModel(FilmGrainCharacteristics* m_filmGrain, FILE* filmgrain);
 };
 }

 
@@ -40,6 +40,7 @@
 #include "ratecontrol.h"
 #include "reference.h"
 #include "nal.h"
+#include "temporalfilter.h"
 
 namespace X265_NS {
 // private x265 namespace
@@ -113,6 +114,34 @@
     }
 };
 
+/*Film grain characteristics*/
+struct FilmGrain
+{
+    bool    m_filmGrainCharacteristicsCancelFlag;
+    bool    m_filmGrainCharacteristicsPersistenceFlag;
+    bool    m_separateColourDescriptionPresentFlag;
+    uint8_t m_filmGrainModelId;
+    uint8_t m_blendingModeId;
+    uint8_t m_log2ScaleFactor;
+};
+
+struct ColourDescription
+{
+    bool        m_filmGrainFullRangeFlag;
+    uint8_t     m_filmGrainBitDepthLumaMinus8;
+    uint8_t     m_filmGrainBitDepthChromaMinus8;
+    uint8_t     m_filmGrainColourPrimaries;
+    uint8_t     m_filmGrainTransferCharacteristics;
+    uint8_t     m_filmGrainMatrixCoeffs;
+};
+
+struct FGPresent
+{
+    uint8_t     m_blendingModeId;
+    uint8_t     m_log2ScaleFactor;
+    bool        m_presentFlag3;
+};
+
 // Manages the wave-front processing of a single encoding frame
 class FrameEncoder : public WaveFront, public Thread
 {
@@ -205,6 +234,10 @@
     FrameFilter              m_frameFilter;
     NALList                  m_nalList;
 
+    // initialization for mcstf
+    TemporalFilter*          m_frameEncTF;
+    TemporalFilterRefPicInfo m_mcstfRefListMAX_MCSTF_TEMPORAL_WINDOW_LENGTH;
+
     class WeightAnalysis : public BondedTaskGroup
     {
     public:
@@ -250,6 +283,7 @@
     void collectDynDataFrame();
     void computeAvgTrainingData();
     void collectDynDataRow(CUData& ctu, FrameStats* rowStats);    
+    void readModel(FilmGrainCharacteristics* m_filmGrain, FILE* filmgrain);
 };
 }
 
​

x265_3.5.tar.gz/source/encoder/level.cpp -> x265_3.6.tar.gz/source/encoder/level.cpp Changed

@@ -72,7 +72,7 @@
      * for intra-only profiles (vps.ptl.intraConstraintFlag) */
     vps.ptl.lowerBitRateConstraintFlag = true;
 
-    vps.maxTempSubLayers = param.bEnableTemporalSubLayers ? 2 : 1;
+    vps.maxTempSubLayers = !!param.bEnableTemporalSubLayers ? param.bEnableTemporalSubLayers : 1;
     
     if (param.internalCsp == X265_CSP_I420 && param.internalBitDepth <= 10)
     {
@@ -167,7 +167,7 @@
 
         /* The value of sps_max_dec_pic_buffering_minus1 HighestTid  + 1 shall be less than
          * or equal to MaxDpbSize */
-        if (vps.maxDecPicBuffering > maxDpbSize)
+        if (vps.maxDecPicBufferingvps.maxTempSubLayers - 1 > maxDpbSize)
             continue;
 
         /* For level 5 and higher levels, the value of CtbSizeY shall be equal to 32 or 64 */
@@ -182,8 +182,8 @@
         }
 
         /* The value of NumPocTotalCurr shall be less than or equal to 8 */
-        int numPocTotalCurr = param.maxNumReferences + vps.numReorderPics;
-        if (numPocTotalCurr > 8)
+        int numPocTotalCurr = param.maxNumReferences + vps.numReorderPicsvps.maxTempSubLayers - 1;
+        if (numPocTotalCurr > 10)
         {
             x265_log(&param, X265_LOG_WARNING, "level %s detected, but NumPocTotalCurr (total references) is non-compliant\n", levelsi.name);
             vps.ptl.profileIdc = Profile::NONE;
@@ -289,9 +289,40 @@
  * circumstances it will be quite noisy */
 bool enforceLevel(x265_param& param, VPS& vps)
 {
-    vps.numReorderPics = (param.bBPyramid && param.bframes > 1) ? 2 : !!param.bframes;
-    vps.maxDecPicBuffering = X265_MIN(MAX_NUM_REF, X265_MAX(vps.numReorderPics + 2, (uint32_t)param.maxNumReferences) + 1);
+    vps.maxTempSubLayers = !!param.bEnableTemporalSubLayers ? param.bEnableTemporalSubLayers : 1;
+    for (uint32_t i = 0; i < vps.maxTempSubLayers; i++)
+    {
+        vps.numReorderPicsi = (i == 0) ? ((param.bBPyramid && param.bframes > 1) ? 2 : !!param.bframes) : i;
+        vps.maxDecPicBufferingi = X265_MIN(MAX_NUM_REF, X265_MAX(vps.numReorderPicsi + 2, (uint32_t)param.maxNumReferences) + 1);
+    }
 
+    if (!!param.bEnableTemporalSubLayers)
+    {
+        for (int i = 0; i < MAX_T_LAYERS - 1; i++)
+        {
+            // a lower layer can not have higher value of numReorderPics than a higher layer
+            if (vps.numReorderPicsi + 1 < vps.numReorderPicsi)
+            {
+                vps.numReorderPicsi + 1 = vps.numReorderPicsi;
+            }
+            // the value of numReorderPicsi shall be in the range of 0 to maxDecPicBufferingi - 1, inclusive
+            if (vps.numReorderPicsi > vps.maxDecPicBufferingi - 1)
+            {
+                vps.maxDecPicBufferingi = vps.numReorderPicsi + 1;
+            }
+            // a lower layer can not have higher value of maxDecPicBuffering than a higher layer
+            if (vps.maxDecPicBufferingi + 1 < vps.maxDecPicBufferingi)
+            {
+                vps.maxDecPicBufferingi + 1 = vps.maxDecPicBufferingi;
+            }
+        }
+
+        // the value of numReorderPicsi shall be in the range of 0 to maxDecPicBuffering i  -  1, inclusive
+        if (vps.numReorderPicsMAX_T_LAYERS - 1 > vps.maxDecPicBufferingMAX_T_LAYERS - 1 - 1)
+        {
+            vps.maxDecPicBufferingMAX_T_LAYERS - 1 = vps.numReorderPicsMAX_T_LAYERS - 1 + 1;
+        }
+    }
     /* no level specified by user, just auto-detect from the configuration */
     if (param.levelIdc <= 0)
         return true;
@@ -391,10 +422,10 @@
     }
 
     int savedRefCount = param.maxNumReferences;
-    while (vps.maxDecPicBuffering > maxDpbSize && param.maxNumReferences > 1)
+    while (vps.maxDecPicBufferingvps.maxTempSubLayers - 1 > maxDpbSize && param.maxNumReferences > 1)
     {
         param.maxNumReferences--;
-        vps.maxDecPicBuffering = X265_MIN(MAX_NUM_REF, X265_MAX(vps.numReorderPics + 1, (uint32_t)param.maxNumReferences) + 1);
+        vps.maxDecPicBufferingvps.maxTempSubLayers - 1 = X265_MIN(MAX_NUM_REF, X265_MAX(vps.numReorderPicsvps.maxTempSubLayers - 1 + 1, (uint32_t)param.maxNumReferences) + 1);
     }
     if (param.maxNumReferences != savedRefCount)
         x265_log(&param, X265_LOG_WARNING, "Lowering max references to %d to meet level requirement\n", param.maxNumReferences);

 
@@ -72,7 +72,7 @@
      * for intra-only profiles (vps.ptl.intraConstraintFlag) */
     vps.ptl.lowerBitRateConstraintFlag = true;
 
-    vps.maxTempSubLayers = param.bEnableTemporalSubLayers ? 2 : 1;
+    vps.maxTempSubLayers = !!param.bEnableTemporalSubLayers ? param.bEnableTemporalSubLayers : 1;
     
     if (param.internalCsp == X265_CSP_I420 && param.internalBitDepth <= 10)
     {
@@ -167,7 +167,7 @@
 
         /* The value of sps_max_dec_pic_buffering_minus1 HighestTid  + 1 shall be less than
          * or equal to MaxDpbSize */
-        if (vps.maxDecPicBuffering > maxDpbSize)
+        if (vps.maxDecPicBufferingvps.maxTempSubLayers - 1 > maxDpbSize)
             continue;
 
         /* For level 5 and higher levels, the value of CtbSizeY shall be equal to 32 or 64 */
@@ -182,8 +182,8 @@
         }
 
         /* The value of NumPocTotalCurr shall be less than or equal to 8 */
-        int numPocTotalCurr = param.maxNumReferences + vps.numReorderPics;
-        if (numPocTotalCurr > 8)
+        int numPocTotalCurr = param.maxNumReferences + vps.numReorderPicsvps.maxTempSubLayers - 1;
+        if (numPocTotalCurr > 10)
         {
             x265_log(&param, X265_LOG_WARNING, "level %s detected, but NumPocTotalCurr (total references) is non-compliant\n", levelsi.name);
             vps.ptl.profileIdc = Profile::NONE;
@@ -289,9 +289,40 @@
  * circumstances it will be quite noisy */
 bool enforceLevel(x265_param& param, VPS& vps)
 {
-    vps.numReorderPics = (param.bBPyramid && param.bframes > 1) ? 2 : !!param.bframes;
-    vps.maxDecPicBuffering = X265_MIN(MAX_NUM_REF, X265_MAX(vps.numReorderPics + 2, (uint32_t)param.maxNumReferences) + 1);
+    vps.maxTempSubLayers = !!param.bEnableTemporalSubLayers ? param.bEnableTemporalSubLayers : 1;
+    for (uint32_t i = 0; i < vps.maxTempSubLayers; i++)
+    {
+        vps.numReorderPicsi = (i == 0) ? ((param.bBPyramid && param.bframes > 1) ? 2 : !!param.bframes) : i;
+        vps.maxDecPicBufferingi = X265_MIN(MAX_NUM_REF, X265_MAX(vps.numReorderPicsi + 2, (uint32_t)param.maxNumReferences) + 1);
+    }
 
+    if (!!param.bEnableTemporalSubLayers)
+    {
+        for (int i = 0; i < MAX_T_LAYERS - 1; i++)
+        {
+            // a lower layer can not have higher value of numReorderPics than a higher layer
+            if (vps.numReorderPicsi + 1 < vps.numReorderPicsi)
+            {
+                vps.numReorderPicsi + 1 = vps.numReorderPicsi;
+            }
+            // the value of numReorderPicsi shall be in the range of 0 to maxDecPicBufferingi - 1, inclusive
+            if (vps.numReorderPicsi > vps.maxDecPicBufferingi - 1)
+            {
+                vps.maxDecPicBufferingi = vps.numReorderPicsi + 1;
+            }
+            // a lower layer can not have higher value of maxDecPicBuffering than a higher layer
+            if (vps.maxDecPicBufferingi + 1 < vps.maxDecPicBufferingi)
+            {
+                vps.maxDecPicBufferingi + 1 = vps.maxDecPicBufferingi;
+            }
+        }
+
+        // the value of numReorderPicsi shall be in the range of 0 to maxDecPicBuffering i  -  1, inclusive
+        if (vps.numReorderPicsMAX_T_LAYERS - 1 > vps.maxDecPicBufferingMAX_T_LAYERS - 1 - 1)
+        {
+            vps.maxDecPicBufferingMAX_T_LAYERS - 1 = vps.numReorderPicsMAX_T_LAYERS - 1 + 1;
+        }
+    }
     /* no level specified by user, just auto-detect from the configuration */
     if (param.levelIdc <= 0)
         return true;
@@ -391,10 +422,10 @@
     }
 
     int savedRefCount = param.maxNumReferences;
-    while (vps.maxDecPicBuffering > maxDpbSize && param.maxNumReferences > 1)
+    while (vps.maxDecPicBufferingvps.maxTempSubLayers - 1 > maxDpbSize && param.maxNumReferences > 1)
     {
         param.maxNumReferences--;
-        vps.maxDecPicBuffering = X265_MIN(MAX_NUM_REF, X265_MAX(vps.numReorderPics + 1, (uint32_t)param.maxNumReferences) + 1);
+        vps.maxDecPicBufferingvps.maxTempSubLayers - 1 = X265_MIN(MAX_NUM_REF, X265_MAX(vps.numReorderPicsvps.maxTempSubLayers - 1 + 1, (uint32_t)param.maxNumReferences) + 1);
     }
     if (param.maxNumReferences != savedRefCount)
         x265_log(&param, X265_LOG_WARNING, "Lowering max references to %d to meet level requirement\n", param.maxNumReferences);
​

x265_3.5.tar.gz/source/encoder/motion.cpp -> x265_3.6.tar.gz/source/encoder/motion.cpp Changed

@@ -190,6 +190,31 @@
     X265_CHECK(!bChromaSATD, "chroma distortion measurements impossible in this code path\n");
 }
 
+/* Called by lookahead, luma only, no use of PicYuv */
+void MotionEstimate::setSourcePU(pixel *fencY, intptr_t stride, intptr_t offset, int pwidth, int pheight, const int method, const int refine)
+{
+    partEnum = partitionFromSizes(pwidth, pheight);
+    X265_CHECK(LUMA_4x4 != partEnum, "4x4 inter partition detected!\n");
+    sad = primitives.pupartEnum.sad;
+    ads = primitives.pupartEnum.ads;
+    satd = primitives.pupartEnum.satd;
+    sad_x3 = primitives.pupartEnum.sad_x3;
+    sad_x4 = primitives.pupartEnum.sad_x4;
+
+
+    blockwidth = pwidth;
+    blockOffset = offset;
+    absPartIdx = ctuAddr = -1;
+
+    /* Search params */
+    searchMethod = method;
+    subpelRefine = refine;
+
+    /* copy PU block into cache */
+    primitives.pupartEnum.copy_pp(fencPUYuv.m_buf0, FENC_STRIDE, fencY + offset, stride);
+    X265_CHECK(!bChromaSATD, "chroma distortion measurements impossible in this code path\n");
+}
+
 /* Called by Search::predInterSearch() or --pme equivalent, chroma residual might be considered */
 void MotionEstimate::setSourcePU(const Yuv& srcFencYuv, int _ctuAddr, int cuPartIdx, int puPartIdx, int pwidth, int pheight, const int method, const int refine, bool bChroma)
 {

 
@@ -190,6 +190,31 @@
     X265_CHECK(!bChromaSATD, "chroma distortion measurements impossible in this code path\n");
 }
 
+/* Called by lookahead, luma only, no use of PicYuv */
+void MotionEstimate::setSourcePU(pixel *fencY, intptr_t stride, intptr_t offset, int pwidth, int pheight, const int method, const int refine)
+{
+    partEnum = partitionFromSizes(pwidth, pheight);
+    X265_CHECK(LUMA_4x4 != partEnum, "4x4 inter partition detected!\n");
+    sad = primitives.pupartEnum.sad;
+    ads = primitives.pupartEnum.ads;
+    satd = primitives.pupartEnum.satd;
+    sad_x3 = primitives.pupartEnum.sad_x3;
+    sad_x4 = primitives.pupartEnum.sad_x4;
+
+
+    blockwidth = pwidth;
+    blockOffset = offset;
+    absPartIdx = ctuAddr = -1;
+
+    /* Search params */
+    searchMethod = method;
+    subpelRefine = refine;
+
+    /* copy PU block into cache */
+    primitives.pupartEnum.copy_pp(fencPUYuv.m_buf0, FENC_STRIDE, fencY + offset, stride);
+    X265_CHECK(!bChromaSATD, "chroma distortion measurements impossible in this code path\n");
+}
+
 /* Called by Search::predInterSearch() or --pme equivalent, chroma residual might be considered */
 void MotionEstimate::setSourcePU(const Yuv& srcFencYuv, int _ctuAddr, int cuPartIdx, int puPartIdx, int pwidth, int pheight, const int method, const int refine, bool bChroma)
 {
​

x265_3.5.tar.gz/source/encoder/motion.h -> x265_3.6.tar.gz/source/encoder/motion.h Changed

 
@@ -77,7 +77,7 @@
     void init(int csp);
 
     /* Methods called at slice setup */
-
+    void setSourcePU(pixel *fencY, intptr_t stride, intptr_t offset, int pwidth, int pheight, const int searchMethod, const int subpelRefine);
     void setSourcePU(pixel *fencY, intptr_t stride, intptr_t offset, int pwidth, int pheight, const int searchMethod, const int searchL0, const int searchL1, const int subpelRefine);
     void setSourcePU(const Yuv& srcFencYuv, int ctuAddr, int cuPartIdx, int puPartIdx, int pwidth, int pheight, const int searchMethod, const int subpelRefine, bool bChroma);
 
​

x265_3.5.tar.gz/source/encoder/nal.cpp -> x265_3.6.tar.gz/source/encoder/nal.cpp Changed

 
@@ -57,7 +57,7 @@
     other.m_buffer = X265_MALLOC(uint8_t, m_allocSize);
 }
 
-void NALList::serialize(NalUnitType nalUnitType, const Bitstream& bs)
+void NALList::serialize(NalUnitType nalUnitType, const Bitstream& bs, uint8_t temporalID)
 {
     static const char startCodePrefix = { 0, 0, 0, 1 };
 
@@ -114,7 +114,7 @@
      * nuh_reserved_zero_6bits  6-bits
      * nuh_temporal_id_plus1    3-bits */
     outbytes++ = (uint8_t)nalUnitType << 1;
-    outbytes++ = 1 + (nalUnitType == NAL_UNIT_CODED_SLICE_TSA_N);
+    outbytes++ = temporalID;
 
     /* 7.4.1 ...
      * Within the NAL unit, the following three-byte sequences shall not occur at
​

x265_3.5.tar.gz/source/encoder/nal.h -> x265_3.6.tar.gz/source/encoder/nal.h Changed

 
@@ -56,7 +56,7 @@
 
     void takeContents(NALList& other);
 
-    void serialize(NalUnitType nalUnitType, const Bitstream& bs);
+    void serialize(NalUnitType nalUnitType, const Bitstream& bs, uint8_t temporalID = 1);
 
     uint32_t serializeSubstreams(uint32_t* streamSizeBytes, uint32_t streamCount, const Bitstream* streams);
 };
​

x265_3.5.tar.gz/source/encoder/ratecontrol.cpp -> x265_3.6.tar.gz/source/encoder/ratecontrol.cpp Changed

@@ -41,6 +41,10 @@
 #define BR_SHIFT  6
 #define CPB_SHIFT 4
 
+#define SHARED_DATA_ALIGNMENT      4 ///< 4btye, 32bit
+#define CUTREE_SHARED_MEM_NAME     "cutree"
+#define GOP_CNT_CU_TREE            3
+
 using namespace X265_NS;
 
 /* Amortize the partial cost of I frames over the next N frames */
@@ -104,6 +108,37 @@
     return output;
 }
 
+typedef struct CUTreeSharedDataItem
+{
+    uint8_t  *type;
+    uint16_t *stats;
+}CUTreeSharedDataItem;
+
+void static ReadSharedCUTreeData(void *dst, void *src, int32_t size)
+{
+    CUTreeSharedDataItem *statsDst = reinterpret_cast<CUTreeSharedDataItem *>(dst);
+    uint8_t *typeSrc = reinterpret_cast<uint8_t *>(src);
+    *statsDst->type = *typeSrc;
+
+    ///< for memory alignment, the type will take 32bit in the shared memory
+    int32_t offset = (sizeof(*statsDst->type) + SHARED_DATA_ALIGNMENT - 1) & ~(SHARED_DATA_ALIGNMENT - 1);
+    uint16_t *statsSrc = reinterpret_cast<uint16_t *>(typeSrc + offset);
+    memcpy(statsDst->stats, statsSrc, size - offset);
+}
+
+void static WriteSharedCUTreeData(void *dst, void *src, int32_t size)
+{
+    CUTreeSharedDataItem *statsSrc = reinterpret_cast<CUTreeSharedDataItem *>(src);
+    uint8_t *typeDst = reinterpret_cast<uint8_t *>(dst);
+    *typeDst = *statsSrc->type;
+
+    ///< for memory alignment, the type will take 32bit in the shared memory
+    int32_t offset = (sizeof(*statsSrc->type) + SHARED_DATA_ALIGNMENT - 1) & ~(SHARED_DATA_ALIGNMENT - 1);
+    uint16_t *statsDst = reinterpret_cast<uint16_t *>(typeDst + offset);
+    memcpy(statsDst, statsSrc->stats, size - offset);
+}
+
+
 inline double qScale2bits(RateControlEntry *rce, double qScale)
 {
     if (qScale < 0.1)
@@ -209,6 +244,7 @@
     m_lastAbrResetPoc = -1;
     m_statFileOut = NULL;
     m_cutreeStatFileOut = m_cutreeStatFileIn = NULL;
+    m_cutreeShrMem = NULL;
     m_rce2Pass = NULL;
     m_encOrder = NULL;
     m_lastBsliceSatdCost = 0;
@@ -224,6 +260,8 @@
     m_initVbv = false;
     m_singleFrameVbv = 0;
     m_rateTolerance = 1.0;
+    m_encodedSegmentBits = 0;
+    m_segDur = 0;
 
     if (m_param->rc.vbvBufferSize)
     {
@@ -320,47 +358,86 @@
         m_cuTreeStats.qpBufferi = NULL;
 }
 
-bool RateControl::init(const SPS& sps)
+bool RateControl::initCUTreeSharedMem()
 {
-    if (m_isVbv && !m_initVbv)
-    {
-        /* We don't support changing the ABR bitrate right now,
-         * so if the stream starts as CBR, keep it CBR. */
-        if (m_param->rc.vbvBufferSize < (int)(m_param->rc.vbvMaxBitrate / m_fps))
+    if (!m_cutreeShrMem) {
+        m_cutreeShrMem = new RingMem();
+        if (!m_cutreeShrMem)
         {
-            m_param->rc.vbvBufferSize = (int)(m_param->rc.vbvMaxBitrate / m_fps);
-            x265_log(m_param, X265_LOG_WARNING, "VBV buffer size cannot be smaller than one frame, using %d kbit\n",
-                     m_param->rc.vbvBufferSize);
+            return false;
         }
-        int vbvBufferSize = m_param->rc.vbvBufferSize * 1000;
-        int vbvMaxBitrate = m_param->rc.vbvMaxBitrate * 1000;
 
-        if (m_param->bEmitHRDSEI && !m_param->decoderVbvMaxRate)
+        ///< now cutree data form at most 3 gops would be stored in the shared memory at the same time
+        int32_t itemSize = (sizeof(uint8_t) + SHARED_DATA_ALIGNMENT - 1) & ~(SHARED_DATA_ALIGNMENT - 1);
+        if (m_param->rc.qgSize == 8)
         {
-            const HRDInfo* hrd = &sps.vuiParameters.hrdParameters;
-            vbvBufferSize = hrd->cpbSizeValue << (hrd->cpbSizeScale + CPB_SHIFT);
-            vbvMaxBitrate = hrd->bitRateValue << (hrd->bitRateScale + BR_SHIFT);
+            itemSize += sizeof(uint16_t) * m_ncu * 4;
         }
-        m_bufferRate = vbvMaxBitrate / m_fps;
-        m_vbvMaxRate = vbvMaxBitrate;
-        m_bufferSize = vbvBufferSize;
-        m_singleFrameVbv = m_bufferRate * 1.1 > m_bufferSize;
+        else
+        {
+            itemSize += sizeof(uint16_t) * m_ncu;
+        }
+
+        int32_t itemCnt = X265_MIN(m_param->keyframeMax, (int)(m_fps + 0.5));
+        itemCnt *= GOP_CNT_CU_TREE;
 
-        if (m_param->rc.vbvBufferInit > 1.)
-            m_param->rc.vbvBufferInit = x265_clip3(0.0, 1.0, m_param->rc.vbvBufferInit / m_param->rc.vbvBufferSize);
-        if (m_param->vbvBufferEnd > 1.)
-            m_param->vbvBufferEnd = x265_clip3(0.0, 1.0, m_param->vbvBufferEnd / m_param->rc.vbvBufferSize);
-        if (m_param->vbvEndFrameAdjust > 1.)
-            m_param->vbvEndFrameAdjust = x265_clip3(0.0, 1.0, m_param->vbvEndFrameAdjust);
-        m_param->rc.vbvBufferInit = x265_clip3(0.0, 1.0, X265_MAX(m_param->rc.vbvBufferInit, m_bufferRate / m_bufferSize));
-        m_bufferFillFinal = m_bufferSize * m_param->rc.vbvBufferInit;
-        m_bufferFillActual = m_bufferFillFinal;
-        m_bufferExcess = 0;
-        m_minBufferFill = m_param->minVbvFullness / 100;
-        m_maxBufferFill = 1 - (m_param->maxVbvFullness / 100);
-        m_initVbv = true;
+        char shrnameMAX_SHR_NAME_LEN = { 0 };
+        strcpy(shrname, m_param->rc.sharedMemName);
+        strcat(shrname, CUTREE_SHARED_MEM_NAME);
+
+        if (!m_cutreeShrMem->init(itemSize, itemCnt, shrname))
+        {
+            return false;
+        }
     }
 
+    return true;
+}
+
+void RateControl::initVBV(const SPS& sps)
+{
+    /* We don't support changing the ABR bitrate right now,
+ * so if the stream starts as CBR, keep it CBR. */
+    if (m_param->rc.vbvBufferSize < (int)(m_param->rc.vbvMaxBitrate / m_fps))
+    {
+        m_param->rc.vbvBufferSize = (int)(m_param->rc.vbvMaxBitrate / m_fps);
+        x265_log(m_param, X265_LOG_WARNING, "VBV buffer size cannot be smaller than one frame, using %d kbit\n",
+            m_param->rc.vbvBufferSize);
+    }
+    int vbvBufferSize = m_param->rc.vbvBufferSize * 1000;
+    int vbvMaxBitrate = m_param->rc.vbvMaxBitrate * 1000;
+
+    if (m_param->bEmitHRDSEI && !m_param->decoderVbvMaxRate)
+    {
+        const HRDInfo* hrd = &sps.vuiParameters.hrdParameters;
+        vbvBufferSize = hrd->cpbSizeValue << (hrd->cpbSizeScale + CPB_SHIFT);
+        vbvMaxBitrate = hrd->bitRateValue << (hrd->bitRateScale + BR_SHIFT);
+    }
+    m_bufferRate = vbvMaxBitrate / m_fps;
+    m_vbvMaxRate = vbvMaxBitrate;
+    m_bufferSize = vbvBufferSize;
+    m_singleFrameVbv = m_bufferRate * 1.1 > m_bufferSize;
+
+    if (m_param->rc.vbvBufferInit > 1.)
+        m_param->rc.vbvBufferInit = x265_clip3(0.0, 1.0, m_param->rc.vbvBufferInit / m_param->rc.vbvBufferSize);
+    if (m_param->vbvBufferEnd > 1.)
+        m_param->vbvBufferEnd = x265_clip3(0.0, 1.0, m_param->vbvBufferEnd / m_param->rc.vbvBufferSize);
+    if (m_param->vbvEndFrameAdjust > 1.)
+        m_param->vbvEndFrameAdjust = x265_clip3(0.0, 1.0, m_param->vbvEndFrameAdjust);
+    m_param->rc.vbvBufferInit = x265_clip3(0.0, 1.0, X265_MAX(m_param->rc.vbvBufferInit, m_bufferRate / m_bufferSize));
+    m_bufferFillFinal = m_bufferSize * m_param->rc.vbvBufferInit;
+    m_bufferFillActual = m_bufferFillFinal;
+    m_bufferExcess = 0;
+    m_minBufferFill = m_param->minVbvFullness / 100;
+    m_maxBufferFill = 1 - (m_param->maxVbvFullness / 100);
+    m_initVbv = true;
+}
+
+bool RateControl::init(const SPS& sps)
+{
+    if (m_isVbv && !m_initVbv)
+        initVBV(sps);
+
     if (!m_param->bResetZoneConfig && (m_relativeComplexity == NULL))
     {
         m_relativeComplexity = X265_MALLOC(double, m_param->reconfigWindowSize);
@@ -373,7 +450,9 @@
 
     m_totalBits = 0;
     m_encodedBits = 0;
+    m_encodedSegmentBits = 0;
     m_framesDone = 0;
+    m_segDur = 0;
     m_residualCost = 0;
     m_partialResidualCost = 0;
     m_amortizeFraction = 0.85;
@@ -421,244 +500,257 @@
         /* Load stat file and init 2pass algo */
         if (m_param->rc.bStatRead)
         {
-            m_expectedBitsSum = 0;
-            char *p, *statsIn, *statsBuf;
-            /* read 1st pass stats */
-            statsIn = statsBuf = x265_slurp_file(fileName);
-            if (!statsBuf)
-                return false;
-            if (m_param->rc.cuTree)
+            if (X265_SHARE_MODE_FILE == m_param->rc.dataShareMode)
             {
-                char *tmpFile = strcatFilename(fileName, ".cutree");
-                if (!tmpFile)
+                m_expectedBitsSum = 0;
+                char *p, *statsIn, *statsBuf;
+                /* read 1st pass stats */
+                statsIn = statsBuf = x265_slurp_file(fileName);
+                if (!statsBuf)
                     return false;
-                m_cutreeStatFileIn = x265_fopen(tmpFile, "rb");
-                X265_FREE(tmpFile);
-                if (!m_cutreeStatFileIn)
+                if (m_param->rc.cuTree)
                 {
-                    x265_log_file(m_param, X265_LOG_ERROR, "can't open stats file %s.cutree\n", fileName);
-                    return false;
+                    char *tmpFile = strcatFilename(fileName, ".cutree");
+                    if (!tmpFile)
+                        return false;
+                    m_cutreeStatFileIn = x265_fopen(tmpFile, "rb");
+                    X265_FREE(tmpFile);
+                    if (!m_cutreeStatFileIn)
+                    {
+                        x265_log_file(m_param, X265_LOG_ERROR, "can't open stats file %s.cutree\n", fileName);
+                        return false;
+                    }
                 }
-            }
 
-            /* check whether 1st pass options were compatible with current options */
-            if (strncmp(statsBuf, "#options:", 9))
-            {
-                x265_log(m_param, X265_LOG_ERROR,"options list in stats file not valid\n");
-                return false;
-            }
-            {
-                int i, j, m;
-                uint32_t k , l;
-                bool bErr = false;
-                char *opts = statsBuf;
-                statsIn = strchr(statsBuf, '\n');
-                if (!statsIn)
-                {
-                    x265_log(m_param, X265_LOG_ERROR, "Malformed stats file\n");
-                    return false;
-                }
-                *statsIn = '\0';
-                statsIn++;
-                if ((p = strstr(opts, " input-res=")) == 0 || sscanf(p, " input-res=%dx%d", &i, &j) != 2)
-                {
-                    x265_log(m_param, X265_LOG_ERROR, "Resolution specified in stats file not valid\n");
-                    return false;
-                }
-                if ((p = strstr(opts, " fps=")) == 0 || sscanf(p, " fps=%u/%u", &k, &l) != 2)
-                {
-                    x265_log(m_param, X265_LOG_ERROR, "fps specified in stats file not valid\n");
-                    return false;
-                }
-                if (((p = strstr(opts, " vbv-maxrate=")) == 0 || sscanf(p, " vbv-maxrate=%d", &m) != 1) && m_param->rc.rateControlMode == X265_RC_CRF)
-                {
-                    x265_log(m_param, X265_LOG_ERROR, "Constant rate-factor is incompatible with 2pass without vbv-maxrate in the previous pass\n");
-                    return false;
-                }
-                if (k != m_param->fpsNum || l != m_param->fpsDenom)
+                /* check whether 1st pass options were compatible with current options */
+                if (strncmp(statsBuf, "#options:", 9))
                 {
-                    x265_log(m_param, X265_LOG_ERROR, "fps mismatch with 1st pass (%u/%u vs %u/%u)\n",
-                              m_param->fpsNum, m_param->fpsDenom, k, l);
+                    x265_log(m_param, X265_LOG_ERROR, "options list in stats file not valid\n");
                     return false;
                 }
-                if (m_param->analysisMultiPassRefine)
                 {
-                    p = strstr(opts, "ref=");
-                    sscanf(p, "ref=%d", &i);
-                    if (i > m_param->maxNumReferences)
+                    int i, j, m;
+                    uint32_t k, l;
+                    bool bErr = false;
+                    char *opts = statsBuf;
+                    statsIn = strchr(statsBuf, '\n');
+                    if (!statsIn)
                     {
-                        x265_log(m_param, X265_LOG_ERROR, "maxNumReferences cannot be less than 1st pass (%d vs %d)\n",
-                            i, m_param->maxNumReferences);
+                        x265_log(m_param, X265_LOG_ERROR, "Malformed stats file\n");
                         return false;
                     }
-                }
-                if (m_param->analysisMultiPassRefine || m_param->analysisMultiPassDistortion)
-                {
-                    p = strstr(opts, "ctu=");
-                    sscanf(p, "ctu=%u", &k);
-                    if (k != m_param->maxCUSize)
+                    *statsIn = '\0';
+                    statsIn++;
+                    if ((p = strstr(opts, " input-res=")) == 0 || sscanf(p, " input-res=%dx%d", &i, &j) != 2)
                     {
-                        x265_log(m_param, X265_LOG_ERROR, "maxCUSize mismatch with 1st pass (%u vs %u)\n",
-                            k, m_param->maxCUSize);
+                        x265_log(m_param, X265_LOG_ERROR, "Resolution specified in stats file not valid\n");
                         return false;
                     }
+                    if ((p = strstr(opts, " fps=")) == 0 || sscanf(p, " fps=%u/%u", &k, &l) != 2)
+                    {
+                        x265_log(m_param, X265_LOG_ERROR, "fps specified in stats file not valid\n");
+                        return false;
+                    }
+                    if (((p = strstr(opts, " vbv-maxrate=")) == 0 || sscanf(p, " vbv-maxrate=%d", &m) != 1) && m_param->rc.rateControlMode == X265_RC_CRF)
+                    {
+                        x265_log(m_param, X265_LOG_ERROR, "Constant rate-factor is incompatible with 2pass without vbv-maxrate in the previous pass\n");
+                        return false;
+                    }
+                    if (k != m_param->fpsNum || l != m_param->fpsDenom)
+                    {
+                        x265_log(m_param, X265_LOG_ERROR, "fps mismatch with 1st pass (%u/%u vs %u/%u)\n",
+                            m_param->fpsNum, m_param->fpsDenom, k, l);
+                        return false;
+                    }
+                    if (m_param->analysisMultiPassRefine)
+                    {
+                        p = strstr(opts, "ref=");
+                        sscanf(p, "ref=%d", &i);
+                        if (i > m_param->maxNumReferences)
+                        {
+                            x265_log(m_param, X265_LOG_ERROR, "maxNumReferences cannot be less than 1st pass (%d vs %d)\n",
+                                i, m_param->maxNumReferences);
+                            return false;
+                        }
+                    }
+                    if (m_param->analysisMultiPassRefine || m_param->analysisMultiPassDistortion)
+                    {
+                        p = strstr(opts, "ctu=");
+                        sscanf(p, "ctu=%u", &k);
+                        if (k != m_param->maxCUSize)
+                        {
+                            x265_log(m_param, X265_LOG_ERROR, "maxCUSize mismatch with 1st pass (%u vs %u)\n",
+                                k, m_param->maxCUSize);
+                            return false;
+                        }
+                    }
+                    CMP_OPT_FIRST_PASS("bitdepth", m_param->internalBitDepth);
+                    CMP_OPT_FIRST_PASS("weightp", m_param->bEnableWeightedPred);
+                    CMP_OPT_FIRST_PASS("bframes", m_param->bframes);
+                    CMP_OPT_FIRST_PASS("b-pyramid", m_param->bBPyramid);
+                    CMP_OPT_FIRST_PASS("open-gop", m_param->bOpenGOP);
+                    CMP_OPT_FIRST_PASS(" keyint", m_param->keyframeMax);
+                    CMP_OPT_FIRST_PASS("scenecut", m_param->scenecutThreshold);
+                    CMP_OPT_FIRST_PASS("intra-refresh", m_param->bIntraRefresh);
+                    CMP_OPT_FIRST_PASS("frame-dup", m_param->bEnableFrameDuplication);
+                    if (m_param->bMultiPassOptRPS)
+                    {
+                        CMP_OPT_FIRST_PASS("multi-pass-opt-rps", m_param->bMultiPassOptRPS);
+                        CMP_OPT_FIRST_PASS("repeat-headers", m_param->bRepeatHeaders);
+                        CMP_OPT_FIRST_PASS("min-keyint", m_param->keyframeMin);
+                    }
+
+                    if ((p = strstr(opts, "b-adapt=")) != 0 && sscanf(p, "b-adapt=%d", &i) && i >= X265_B_ADAPT_NONE && i <= X265_B_ADAPT_TRELLIS)
+                    {
+                        m_param->bFrameAdaptive = i;
+                    }
+                    else if (m_param->bframes)
+                    {
+                        x265_log(m_param, X265_LOG_ERROR, "b-adapt method specified in stats file not valid\n");
+                        return false;
+                    }
+
+                    if ((p = strstr(opts, "rc-lookahead=")) != 0 && sscanf(p, "rc-lookahead=%d", &i))
+                        m_param->lookaheadDepth = i;
                 }
-                CMP_OPT_FIRST_PASS("bitdepth", m_param->internalBitDepth);
-                CMP_OPT_FIRST_PASS("weightp", m_param->bEnableWeightedPred);
-                CMP_OPT_FIRST_PASS("bframes", m_param->bframes);
-                CMP_OPT_FIRST_PASS("b-pyramid", m_param->bBPyramid);
-                CMP_OPT_FIRST_PASS("open-gop", m_param->bOpenGOP);
-                CMP_OPT_FIRST_PASS(" keyint", m_param->keyframeMax);
-                CMP_OPT_FIRST_PASS("scenecut", m_param->scenecutThreshold);
-                CMP_OPT_FIRST_PASS("intra-refresh", m_param->bIntraRefresh);
-                CMP_OPT_FIRST_PASS("frame-dup", m_param->bEnableFrameDuplication);
-                if (m_param->bMultiPassOptRPS)
+                /* find number of pics */
+                p = statsIn;
+                int numEntries;
+                for (numEntries = -1; p; numEntries++)
+                    p = strchr(p + 1, ';');
+                if (!numEntries)
                 {
-                    CMP_OPT_FIRST_PASS("multi-pass-opt-rps", m_param->bMultiPassOptRPS);
-                    CMP_OPT_FIRST_PASS("repeat-headers", m_param->bRepeatHeaders);
-                    CMP_OPT_FIRST_PASS("min-keyint", m_param->keyframeMin);
+                    x265_log(m_param, X265_LOG_ERROR, "empty stats file\n");
+                    return false;
                 }
+                m_numEntries = numEntries;
 
-                if ((p = strstr(opts, "b-adapt=")) != 0 && sscanf(p, "b-adapt=%d", &i) && i >= X265_B_ADAPT_NONE && i <= X265_B_ADAPT_TRELLIS)
+                if (m_param->totalFrames < m_numEntries && m_param->totalFrames > 0)
                 {
-                    m_param->bFrameAdaptive = i;
+                    x265_log(m_param, X265_LOG_WARNING, "2nd pass has fewer frames than 1st pass (%d vs %d)\n",
+                        m_param->totalFrames, m_numEntries);
                 }
-                else if (m_param->bframes)
+                if (m_param->totalFrames > m_numEntries && !m_param->bEnableFrameDuplication)
                 {
-                    x265_log(m_param, X265_LOG_ERROR, "b-adapt method specified in stats file not valid\n");
+                    x265_log(m_param, X265_LOG_ERROR, "2nd pass has more frames than 1st pass (%d vs %d)\n",
+                        m_param->totalFrames, m_numEntries);
                     return false;
                 }
 
-                if ((p = strstr(opts, "rc-lookahead=")) != 0 && sscanf(p, "rc-lookahead=%d", &i))
-                    m_param->lookaheadDepth = i;
-            }
-            /* find number of pics */
-            p = statsIn;
-            int numEntries;
-            for (numEntries = -1; p; numEntries++)
-                p = strchr(p + 1, ';');
-            if (!numEntries)
-            {
-                x265_log(m_param, X265_LOG_ERROR, "empty stats file\n");
-                return false;
-            }
-            m_numEntries = numEntries;
-
-            if (m_param->totalFrames < m_numEntries && m_param->totalFrames > 0)
-            {
-                x265_log(m_param, X265_LOG_WARNING, "2nd pass has fewer frames than 1st pass (%d vs %d)\n",
-                         m_param->totalFrames, m_numEntries);
-            }
-            if (m_param->totalFrames > m_numEntries && !m_param->bEnableFrameDuplication)
-            {
-                x265_log(m_param, X265_LOG_ERROR, "2nd pass has more frames than 1st pass (%d vs %d)\n",
-                         m_param->totalFrames, m_numEntries);
-                return false;
-            }
-
-            m_rce2Pass = X265_MALLOC(RateControlEntry, m_numEntries);
-            if (!m_rce2Pass)
-            {
-                 x265_log(m_param, X265_LOG_ERROR, "Rce Entries for 2 pass cannot be allocated\n");
-                 return false;
-            }
-            m_encOrder = X265_MALLOC(int, m_numEntries);
-            if (!m_encOrder)
-            {
-                x265_log(m_param, X265_LOG_ERROR, "Encode order for 2 pass cannot be allocated\n");
-                return false;
-            }
-            /* init all to skipped p frames */
-            for (int i = 0; i < m_numEntries; i++)
-            {
-                RateControlEntry *rce = &m_rce2Passi;
-                rce->sliceType = P_SLICE;
-                rce->qScale = rce->newQScale = x265_qp2qScale(20);
-                rce->miscBits = m_ncu + 10;
-                rce->newQp = 0;
-            }
-            /* read stats */
-            p = statsIn;
-            double totalQpAq = 0;
-            for (int i = 0; i < m_numEntries; i++)
-            {
-                RateControlEntry *rce, *rcePocOrder;
-                int frameNumber;
-                int encodeOrder;
-                char picType;
-                int e;
-                char *next;
-                double qpRc, qpAq, qNoVbv, qRceq;
-                next = strstr(p, ";");
-                if (next)
-                    *next++ = 0;
-                e = sscanf(p, " in:%d out:%d", &frameNumber, &encodeOrder);
-                if (frameNumber < 0 || frameNumber >= m_numEntries)
+                m_rce2Pass = X265_MALLOC(RateControlEntry, m_numEntries);
+                if (!m_rce2Pass)
                 {
-                    x265_log(m_param, X265_LOG_ERROR, "bad frame number (%d) at stats line %d\n", frameNumber, i);
+                    x265_log(m_param, X265_LOG_ERROR, "Rce Entries for 2 pass cannot be allocated\n");
                     return false;
                 }
-                rce = &m_rce2PassencodeOrder;
-                rcePocOrder = &m_rce2PassframeNumber;
-                m_encOrderframeNumber = encodeOrder;
-                if (!m_param->bMultiPassOptRPS)
-                {
-                    int scenecut = 0;
-                    e += sscanf(p, " in:%*d out:%*d type:%c q:%lf q-aq:%lf q-noVbv:%lf q-Rceq:%lf tex:%d mv:%d misc:%d icu:%lf pcu:%lf scu:%lf sc:%d",
-                        &picType, &qpRc, &qpAq, &qNoVbv, &qRceq, &rce->coeffBits,
-                        &rce->mvBits, &rce->miscBits, &rce->iCuCount, &rce->pCuCount,
-                        &rce->skipCuCount, &scenecut);
-                    rcePocOrder->scenecut = scenecut != 0;
+                m_encOrder = X265_MALLOC(int, m_numEntries);
+                if (!m_encOrder)
+                {
+                    x265_log(m_param, X265_LOG_ERROR, "Encode order for 2 pass cannot be allocated\n");
+                    return false;
                 }
-                else
+                /* init all to skipped p frames */
+                for (int i = 0; i < m_numEntries; i++)
                 {
-                    char deltaPOC128;
-                    char bUsed40;
-                    memset(deltaPOC, 0, sizeof(deltaPOC));
-                    memset(bUsed, 0, sizeof(bUsed));
-                    e += sscanf(p, " in:%*d out:%*d type:%c q:%lf q-aq:%lf q-noVbv:%lf q-Rceq:%lf tex:%d mv:%d misc:%d icu:%lf pcu:%lf scu:%lf nump:%d numnegp:%d numposp:%d deltapoc:%s bused:%s",
-                        &picType, &qpRc, &qpAq, &qNoVbv, &qRceq, &rce->coeffBits,
-                        &rce->mvBits, &rce->miscBits, &rce->iCuCount, &rce->pCuCount,
-                        &rce->skipCuCount, &rce->rpsData.numberOfPictures, &rce->rpsData.numberOfNegativePictures, &rce->rpsData.numberOfPositivePictures, deltaPOC, bUsed);
-                    splitdeltaPOC(deltaPOC, rce);
-                    splitbUsed(bUsed, rce);
-                    rce->rpsIdx = -1;
-                }
-                rce->keptAsRef = true;
-                rce->isIdr = false;
-                if (picType == 'b' || picType == 'p')
-                    rce->keptAsRef = false;
-                if (picType == 'I')
-                    rce->isIdr = true;
-                if (picType == 'I' || picType == 'i')
-                    rce->sliceType = I_SLICE;
-                else if (picType == 'P' || picType == 'p')
+                    RateControlEntry *rce = &m_rce2Passi;
                     rce->sliceType = P_SLICE;
-                else if (picType == 'B' || picType == 'b')
-                    rce->sliceType = B_SLICE;
-                else
-                    e = -1;
-                if (e < 10)
+                    rce->qScale = rce->newQScale = x265_qp2qScale(20);
+                    rce->miscBits = m_ncu + 10;
+                    rce->newQp = 0;
+                }
+                /* read stats */
+                p = statsIn;
+                double totalQpAq = 0;
+                for (int i = 0; i < m_numEntries; i++)
+                {
+                    RateControlEntry *rce, *rcePocOrder;
+                    int frameNumber;
+                    int encodeOrder;
+                    char picType;
+                    int e;
+                    char *next;
+                    double qpRc, qpAq, qNoVbv, qRceq;
+                    next = strstr(p, ";");
+                    if (next)
+                        *next++ = 0;
+                    e = sscanf(p, " in:%d out:%d", &frameNumber, &encodeOrder);
+                    if (frameNumber < 0 || frameNumber >= m_numEntries)
+                    {
+                        x265_log(m_param, X265_LOG_ERROR, "bad frame number (%d) at stats line %d\n", frameNumber, i);
+                        return false;
+                    }
+                    rce = &m_rce2PassencodeOrder;
+                    rcePocOrder = &m_rce2PassframeNumber;
+                    m_encOrderframeNumber = encodeOrder;
+                    if (!m_param->bMultiPassOptRPS)
+                    {
+                        int scenecut = 0;
+                        e += sscanf(p, " in:%*d out:%*d type:%c q:%lf q-aq:%lf q-noVbv:%lf q-Rceq:%lf tex:%d mv:%d misc:%d icu:%lf pcu:%lf scu:%lf sc:%d",
+                            &picType, &qpRc, &qpAq, &qNoVbv, &qRceq, &rce->coeffBits,
+                            &rce->mvBits, &rce->miscBits, &rce->iCuCount, &rce->pCuCount,
+                            &rce->skipCuCount, &scenecut);
+                        rcePocOrder->scenecut = scenecut != 0;
+                    }
+                    else
+                    {
+                        char deltaPOC128;
+                        char bUsed40;
+                        memset(deltaPOC, 0, sizeof(deltaPOC));
+                        memset(bUsed, 0, sizeof(bUsed));
+                        e += sscanf(p, " in:%*d out:%*d type:%c q:%lf q-aq:%lf q-noVbv:%lf q-Rceq:%lf tex:%d mv:%d misc:%d icu:%lf pcu:%lf scu:%lf nump:%d numnegp:%d numposp:%d deltapoc:%s bused:%s",
+                            &picType, &qpRc, &qpAq, &qNoVbv, &qRceq, &rce->coeffBits,
+                            &rce->mvBits, &rce->miscBits, &rce->iCuCount, &rce->pCuCount,
+                            &rce->skipCuCount, &rce->rpsData.numberOfPictures, &rce->rpsData.numberOfNegativePictures, &rce->rpsData.numberOfPositivePictures, deltaPOC, bUsed);
+                        splitdeltaPOC(deltaPOC, rce);
+                        splitbUsed(bUsed, rce);
+                        rce->rpsIdx = -1;
+                    }
+                    rce->keptAsRef = true;
+                    rce->isIdr = false;
+                    if (picType == 'b' || picType == 'p')
+                        rce->keptAsRef = false;
+                    if (picType == 'I')
+                        rce->isIdr = true;
+                    if (picType == 'I' || picType == 'i')
+                        rce->sliceType = I_SLICE;
+                    else if (picType == 'P' || picType == 'p')
+                        rce->sliceType = P_SLICE;
+                    else if (picType == 'B' || picType == 'b')
+                        rce->sliceType = B_SLICE;
+                    else
+                        e = -1;
+                    if (e < 10)
+                    {
+                        x265_log(m_param, X265_LOG_ERROR, "statistics are damaged at line %d, parser out=%d\n", i, e);
+                        return false;
+                    }
+                    rce->qScale = rce->newQScale = x265_qp2qScale(qpRc);
+                    totalQpAq += qpAq;
+                    rce->qpNoVbv = qNoVbv;
+                    rce->qpaRc = qpRc;
+                    rce->qpAq = qpAq;
+                    rce->qRceq = qRceq;
+                    p = next;
+                }
+                X265_FREE(statsBuf);
+                if (m_param->rc.rateControlMode != X265_RC_CQP)
+                {
+                    m_start = 0;
+                    m_isQpModified = true;
+                    if (!initPass2())
+                        return false;
+                } /* else we're using constant quant, so no need to run the bitrate allocation */
+            }
+            else // X265_SHARE_MODE_SHAREDMEM == m_param->rc.dataShareMode
+            {
+                if (m_param->rc.cuTree)
                 {
-                    x265_log(m_param, X265_LOG_ERROR, "statistics are damaged at line %d, parser out=%d\n", i, e);
-                    return false;
+                    if (!initCUTreeSharedMem())
+                    {
+                        return false;
+                    }
                 }
-                rce->qScale = rce->newQScale = x265_qp2qScale(qpRc);
-                totalQpAq += qpAq;
-                rce->qpNoVbv = qNoVbv;
-                rce->qpaRc = qpRc;
-                rce->qpAq = qpAq;
-                rce->qRceq = qRceq;
-                p = next;
-            }
-            X265_FREE(statsBuf);
-            if (m_param->rc.rateControlMode != X265_RC_CQP)
-            {
-                m_start = 0;
-                m_isQpModified = true;
-                if (!initPass2())
-                    return false;
-            } /* else we're using constant quant, so no need to run the bitrate allocation */
+            }
         }
         /* Open output file */
         /* If input and output files are the same, output to a temp file
@@ -682,19 +774,29 @@
             X265_FREE(p);
             if (m_param->rc.cuTree && !m_param->rc.bStatRead)
             {
-                statFileTmpname = strcatFilename(fileName, ".cutree.temp");
-                if (!statFileTmpname)
-                    return false;
-                m_cutreeStatFileOut = x265_fopen(statFileTmpname, "wb");
-                X265_FREE(statFileTmpname);
-                if (!m_cutreeStatFileOut)
+                if (X265_SHARE_MODE_FILE == m_param->rc.dataShareMode)
                 {
-                    x265_log_file(m_param, X265_LOG_ERROR, "can't open mbtree stats file %s.cutree.temp\n", fileName);
-                    return false;
+                    statFileTmpname = strcatFilename(fileName, ".cutree.temp");
+                    if (!statFileTmpname)
+                        return false;
+                    m_cutreeStatFileOut = x265_fopen(statFileTmpname, "wb");
+                    X265_FREE(statFileTmpname);
+                    if (!m_cutreeStatFileOut)
+                    {
+                        x265_log_file(m_param, X265_LOG_ERROR, "can't open mbtree stats file %s.cutree.temp\n", fileName);
+                        return false;
+                    }
+                }
+                else // X265_SHARE_MODE_SHAREDMEM == m_param->rc.dataShareMode
+                {
+                    if (!initCUTreeSharedMem())
+                    {
+                        return false;
+                    }
                 }
             }
         }
-        if (m_param->rc.cuTree)
+        if (m_param->rc.cuTree && !m_cuTreeStats.qpBuffer0)
         {
             if (m_param->rc.qgSize == 8)
             {
@@ -714,6 +816,10 @@
     return true;
 }
 
+void RateControl::skipCUTreeSharedMemRead(int32_t cnt)
+{
+    m_cutreeShrMem->skipRead(cnt);
+}
 void RateControl::reconfigureRC()
 {
     if (m_isVbv)
@@ -806,7 +912,7 @@
 
     TimingInfo *time = &sps.vuiParameters.timingInfo;
     int maxCpbOutputDelay = (int)(X265_MIN(m_param->keyframeMax * MAX_DURATION * time->timeScale / time->numUnitsInTick, INT_MAX));
-    int maxDpbOutputDelay = (int)(sps.maxDecPicBuffering * MAX_DURATION * time->timeScale / time->numUnitsInTick);
+    int maxDpbOutputDelay = (int)(sps.maxDecPicBufferingsps.maxTempSubLayers - 1 * MAX_DURATION * time->timeScale / time->numUnitsInTick);
     int maxDelay = (int)(90000.0 * cpbSizeUnscale / bitRateUnscale + 0.5);
 
     hrd->initialCpbRemovalDelayLength = 2 + x265_clip3(4, 22, 32 - calcLength(maxDelay));
@@ -1000,125 +1106,103 @@
 {
     uint64_t allConstBits = 0, allCodedBits = 0;
     uint64_t allAvailableBits = uint64_t(m_param->rc.bitrate * 1000. * m_numEntries * m_frameDuration);
-    int startIndex, framesCount, endIndex;
+    int startIndex, endIndex;
     int fps = X265_MIN(m_param->keyframeMax, (int)(m_fps + 0.5));
-    startIndex = endIndex = framesCount = 0;
-    int diffQp = 0;
+    int distance = fps << 1;
+    distance = distance > m_param->keyframeMax ? (m_param->keyframeMax << 1) : m_param->keyframeMax;
+    startIndex = endIndex = 0;
     double targetBits = 0;
     double expectedBits = 0;
-    for (startIndex = m_start, endIndex = m_start; endIndex < m_numEntries; endIndex++)
+    double targetBits2 = 0;
+    double expectedBits2 = 0;
+    double cpxSum = 0;
+    double cpxSum2 = 0;
+
+    if (m_param->rc.rateControlMode == X265_RC_ABR)
     {
-        allConstBits += m_rce2PassendIndex.miscBits;
-        allCodedBits += m_rce2PassendIndex.coeffBits + m_rce2PassendIndex.mvBits;
-        if (m_param->rc.rateControlMode == X265_RC_CRF)
+        for (endIndex = m_start; endIndex < m_numEntries; endIndex++)
         {
-            framesCount = endIndex - startIndex + 1;
-            diffQp += int (m_rce2PassendIndex.qpaRc - m_rce2PassendIndex.qpNoVbv);
-            if (framesCount > fps)
-                diffQp -= int (m_rce2PassendIndex - fps.qpaRc - m_rce2PassendIndex - fps.qpNoVbv);
-            if (framesCount >= fps)
-            {
-                if (diffQp >= 1)
-                {
-                    if (!m_isQpModified && endIndex > fps)
-                    {
-                        double factor = 2;
-                        double step = 0;
-                        if (endIndex + fps >= m_numEntries)
-                        {
-                            m_start = endIndex - (endIndex % fps);
-                            return true;
-                        }
-                        for (int start = endIndex + 1; start <= endIndex + fps && start < m_numEntries; start++)
-                        {
-                            RateControlEntry *rce = &m_rce2Passstart;
-                            targetBits += qScale2bits(rce, x265_qp2qScale(rce->qpNoVbv));
-                            expectedBits += qScale2bits(rce, rce->qScale);
-                        }
-                        if (expectedBits < 0.95 * targetBits)
-                        {
-                            m_isQpModified = true;
-                            m_isGopReEncoded = true;
-                            while (endIndex + fps < m_numEntries)
-                            {
-                                step = pow(2, factor / 6.0);
-                                expectedBits = 0;
-                                for (int start = endIndex + 1; start <= endIndex + fps; start++)
-                                {
-                                    RateControlEntry *rce = &m_rce2Passstart;
-                                    rce->newQScale = rce->qScale / step;
-                                    X265_CHECK(rce->newQScale >= 0, "new Qscale is negative\n");
-                                    expectedBits += qScale2bits(rce, rce->newQScale);
-                                    rce->newQp = x265_qScale2qp(rce->newQScale);
-                                }
-                                if (expectedBits >= targetBits && step > 1)
-                                    factor *= 0.90;
-                                else
-                                    break;
-                            }
-
-                            if (m_isVbv && endIndex + fps < m_numEntries)
-                                if (!vbv2Pass((uint64_t)targetBits, endIndex + fps, endIndex + 1))
-                                    return false;
-
-                            targetBits = 0;
-                            expectedBits = 0;
-
-                            for (int start = endIndex - fps + 1; start <= endIndex; start++)
-                            {
-                                RateControlEntry *rce = &m_rce2Passstart;
-                                targetBits += qScale2bits(rce, x265_qp2qScale(rce->qpNoVbv));
-                            }
-                            while (1)
-                            {
-                                step = pow(2, factor / 6.0);
-                                expectedBits = 0;
-                                for (int start = endIndex - fps + 1; start <= endIndex; start++)
-                                {
-                                    RateControlEntry *rce = &m_rce2Passstart;
-                                    rce->newQScale = rce->qScale * step;
-                                    X265_CHECK(rce->newQScale >= 0, "new Qscale is negative\n");
-                                    expectedBits += qScale2bits(rce, rce->newQScale);
-                                    rce->newQp = x265_qScale2qp(rce->newQScale);
-                                }
-                                if (expectedBits > targetBits && step > 1)
-                                    factor *= 1.1;
-                                else
-                                     break;
-                            }
-                            if (m_isVbv)
-                                if (!vbv2Pass((uint64_t)targetBits, endIndex, endIndex - fps + 1))
-                                    return false;
-                            diffQp = 0;
-                            m_reencode = endIndex - fps + 1;
-                            endIndex = endIndex + fps;
-                            startIndex = endIndex + 1;
-                            m_start = startIndex;
-                            targetBits = expectedBits = 0;
-                        }
-                        else
-                            targetBits = expectedBits = 0;
-                    }
-                }
-                else
-                    m_isQpModified = false;
-            }
+            allConstBits += m_rce2PassendIndex.miscBits;
+            allCodedBits += m_rce2PassendIndex.coeffBits + m_rce2PassendIndex.mvBits;
         }
-    }
 
-    if (m_param->rc.rateControlMode == X265_RC_ABR)
-    {
         if (allAvailableBits < allConstBits)
         {
             x265_log(m_param, X265_LOG_ERROR, "requested bitrate is too low. estimated minimum is %d kbps\n",
-                     (int)(allConstBits * m_fps / framesCount * 1000.));
+                (int)(allConstBits * m_fps / (m_numEntries - m_start) * 1000.));
             return false;
         }
         if (!analyseABR2Pass(allAvailableBits))
             return false;
+
+        return true;
+    }
+
+    if (m_isQpModified)
+    {
+        return true;
+    }
+
+    if (m_start + (fps << 1) > m_numEntries)
+    {
+        return true;
+    }
+
+    for (startIndex = m_start, endIndex = m_numEntries - 1; startIndex < endIndex; startIndex++, endIndex--)
+    {
+        cpxSum += m_rce2PassstartIndex.qScale / m_rce2PassstartIndex.coeffBits;
+        cpxSum2 += m_rce2PassendIndex.qScale / m_rce2PassendIndex.coeffBits;
+
+        RateControlEntry *rce = &m_rce2PassstartIndex;
+        targetBits += qScale2bits(rce, x265_qp2qScale(rce->qpNoVbv));
+        expectedBits += qScale2bits(rce, rce->qScale);
+
+        rce = &m_rce2PassendIndex;
+        targetBits2 += qScale2bits(rce, x265_qp2qScale(rce->qpNoVbv));
+        expectedBits2 += qScale2bits(rce, rce->qScale);
     }
 
-    m_start = X265_MAX(m_start, endIndex - fps);
+    if (expectedBits < 0.95 * targetBits || expectedBits2 < 0.95 * targetBits2)
+    {
+        if (cpxSum / cpxSum2 < 0.95 || cpxSum2 / cpxSum < 0.95)
+        {
+            m_isQpModified = true;
+            m_isGopReEncoded = true;
+
+            m_shortTermCplxSum = 0;
+            m_shortTermCplxCount = 0;
+            m_framesDone = m_start;
+
+            for (startIndex = m_start; startIndex < m_numEntries; startIndex++)
+            {
+                m_shortTermCplxSum *= 0.5;
+                m_shortTermCplxCount *= 0.5;
+                m_shortTermCplxSum += m_rce2PassstartIndex.currentSatd / (CLIP_DURATION(m_frameDuration) / BASE_FRAME_DURATION);
+                m_shortTermCplxCount++;
+            }
+
+            m_bufferFill = m_rce2Passm_start - 1.bufferFill;
+            m_bufferFillFinal = m_rce2Passm_start - 1.bufferFillFinal;
+            m_bufferFillActual = m_rce2Passm_start - 1.bufferFillActual;
+
+            m_reencode = m_start;
+            m_start = m_numEntries;
+        }
+        else
+        {
+
+            m_isQpModified = false;
+            m_isGopReEncoded = false;
+        }
+    }
+    else
+    {
+
+        m_isQpModified = false;
+        m_isGopReEncoded = false;
+    }
+
+    m_start = X265_MAX(m_start, m_numEntries - distance + m_param->keyframeMax);
 
     return true;
 }
@@ -1271,6 +1355,16 @@
     m_predType = getPredictorType(curFrame->m_lowres.sliceType, m_sliceType);
     rce->poc = m_curSlice->m_poc;
 
+    if (m_param->bEnableSBRC)
+    {
+        if (rce->poc == 0 || (m_framesDone % m_param->keyframeMax == 0))
+        {
+            //Reset SBRC buffer
+            m_encodedSegmentBits = 0;
+            m_segDur = 0;
+        }
+    }
+
     if (!m_param->bResetZoneConfig && (rce->encodeOrder % m_param->reconfigWindowSize == 0))
     {
         int index = m_zoneBufferIdx % m_param->rc.zonefileCount;
@@ -1304,7 +1398,8 @@
             {
                 m_param = m_param->rc.zonesi.zoneParam;
                 reconfigureRC();
-                init(*m_curSlice->m_sps);
+                if (!m_param->bNoResetZoneConfig)
+                    init(*m_curSlice->m_sps);
             }
         }
     }
@@ -1391,15 +1486,57 @@
             rce->frameSizeMaximum *= m_param->maxAUSizeFactor;
         }
     }
+
+    ///< regenerate the qp
     if (!m_isAbr && m_2pass && m_param->rc.rateControlMode == X265_RC_CRF)
     {
-        rce->qpPrev = x265_qScale2qp(rce->qScale);
-        rce->qScale = rce->newQScale;
-        rce->qpaRc = curEncData.m_avgQpRc = curEncData.m_avgQpAq = x265_qScale2qp(rce->newQScale);
-        m_qp = int(rce->qpaRc + 0.5);
-        rce->frameSizePlanned = qScale2bits(rce, rce->qScale);
-        m_framesDone++;
-        return m_qp;
+        if (!m_param->rc.bEncFocusedFramesOnly)
+        {
+            rce->qpPrev = x265_qScale2qp(rce->qScale);
+            if (m_param->bEnableSceneCutAwareQp)
+            {
+                double lqmin = m_lminm_sliceType;
+                double lqmax = m_lmaxm_sliceType;
+                if (m_param->bEnableSceneCutAwareQp & FORWARD)
+                    rce->newQScale = forwardMasking(curFrame, rce->newQScale);
+                if (m_param->bEnableSceneCutAwareQp & BACKWARD)
+                    rce->newQScale = backwardMasking(curFrame, rce->newQScale);
+                rce->newQScale = x265_clip3(lqmin, lqmax, rce->newQScale);
+            }
+            rce->qScale = rce->newQScale;
+            rce->qpaRc = curEncData.m_avgQpRc = curEncData.m_avgQpAq = x265_qScale2qp(rce->newQScale);
+            m_qp = int(rce->qpaRc + 0.5);
+            rce->frameSizePlanned = qScale2bits(rce, rce->qScale);
+            m_framesDone++;
+            return m_qp;
+        }
+        else
+        { 
+            int index = m_encOrderrce->poc;
+            index++;
+            double totalDuration = m_frameDuration;
+            for (int j = 0; totalDuration < 1.0 && index < m_numEntries; j++)
+            {
+                switch (m_rce2Passindex.sliceType)
+                {
+                case B_SLICE:
+                    curFrame->m_lowres.plannedTypej = m_rce2Passindex.keptAsRef ? X265_TYPE_BREF : X265_TYPE_B;
+                    break;
+                case P_SLICE:
+                    curFrame->m_lowres.plannedTypej = X265_TYPE_P;
+                    break;
+                case I_SLICE:
+                    curFrame->m_lowres.plannedTypej = m_param->bOpenGOP ? X265_TYPE_I : X265_TYPE_IDR;
+                    break;
+                default:
+                    break;
+                }
+
+                curFrame->m_lowres.plannedSatdj = m_rce2Passindex.currentSatd;
+                totalDuration += m_frameDuration;
+                index++;
+            }
+        }
     }
 
     if (m_isAbr || m_2pass) // ABR,CRF
@@ -1655,10 +1792,25 @@
             {
                 m_cuTreeStats.qpBufPos++;
 
-                if (!fread(&type, 1, 1, m_cutreeStatFileIn))
-                    goto fail;
-                if (fread(m_cuTreeStats.qpBufferm_cuTreeStats.qpBufPos, sizeof(uint16_t), ncu, m_cutreeStatFileIn) != (size_t)ncu)
-                    goto fail;
+                if (X265_SHARE_MODE_FILE == m_param->rc.dataShareMode)
+                {
+                    if (!fread(&type, 1, 1, m_cutreeStatFileIn))
+                        goto fail;
+                    if (fread(m_cuTreeStats.qpBufferm_cuTreeStats.qpBufPos, sizeof(uint16_t), ncu, m_cutreeStatFileIn) != (size_t)ncu)
+                        goto fail;
+                }
+                else // X265_SHARE_MODE_SHAREDMEM == m_param->rc.dataShareMode
+                {
+                    if (!m_cutreeShrMem)
+                    {
+                        goto fail;
+                    }
+
+                    CUTreeSharedDataItem shrItem;
+                    shrItem.type = &type;
+                    shrItem.stats = m_cuTreeStats.qpBufferm_cuTreeStats.qpBufPos;
+                    m_cutreeShrMem->readNext(&shrItem, ReadSharedCUTreeData);
+                }
 
                 if (type != sliceTypeActual && m_cuTreeStats.qpBufPos == 1)
                 {
@@ -1785,7 +1937,7 @@
         m_sliderPos++;
     }
 
-    if (m_sliceType == B_SLICE)
+    if((!m_param->bEnableSBRC && m_sliceType == B_SLICE) || (m_param->bEnableSBRC && !IS_REFERENCED(curFrame)))
     {
         /* B-frames don't have independent rate control, but rather get the
          * average QP of the two adjacent P-frames + an offset */
@@ -1836,8 +1988,16 @@
             double minScenecutQscale =x265_qp2qScale(ABR_SCENECUT_INIT_QP_MIN); 
             m_lastQScaleForP_SLICE = X265_MAX(minScenecutQscale, m_lastQScaleForP_SLICE);
         }
+
         double qScale = x265_qp2qScale(q);
         rce->qpNoVbv = q;
+
+        if (m_param->bEnableSBRC)
+        {
+            qScale = tuneQscaleForSBRC(curFrame, qScale);
+            rce->qpNoVbv = x265_qScale2qp(qScale);
+        }
+
         double lmin = 0, lmax = 0;
         if (m_isGrainEnabled && m_isFirstMiniGop)
         {
@@ -1890,7 +2050,7 @@
                 qScale = x265_clip3(lqmin, lqmax, qScale);
             }
 
-            if (!m_2pass || m_param->bliveVBV2pass)
+            if (!m_2pass || m_param->bliveVBV2pass || (m_2pass && m_param->rc.rateControlMode == X265_RC_CRF && m_param->rc.bEncFocusedFramesOnly))
             {
                 /* clip qp to permissible range after vbv-lookahead estimation to avoid possible 
                  * mispredictions by initial frame size predictors */
@@ -1927,7 +2087,7 @@
     else
     {
         double abrBuffer = 2 * m_rateTolerance * m_bitrate;
-        if (m_2pass)
+        if (m_2pass && (m_param->rc.rateControlMode != X265_RC_CRF || !m_param->rc.bEncFocusedFramesOnly))
         {
             double lmin = m_lminm_sliceType;
             double lmax = m_lmaxm_sliceType;
@@ -2057,6 +2217,19 @@
 
             if (m_param->rc.rateControlMode == X265_RC_CRF)
             {
+                if (m_param->bEnableSBRC)
+                {
+                    double rfConstant = m_param->rc.rfConstant;
+                    if (m_currentSatd < rce->movingAvgSum)
+                        rfConstant += 2;
+                    double ipOffset = (curFrame->m_lowres.bScenecut ? m_ipOffset : m_ipOffset / 2.0);
+                    rfConstant = (rce->sliceType == I_SLICE ? rfConstant - ipOffset :
+                        (rce->sliceType == B_SLICE ? rfConstant + m_pbOffset : rfConstant));
+                    double mbtree_offset = m_param->rc.cuTree ? (1.0 - m_param->rc.qCompress) * 13.5 : 0;
+                    double qComp = (m_param->rc.cuTree && !m_param->rc.hevcAq) ? 0.99 : m_param->rc.qCompress;
+                    m_rateFactorConstant = pow(m_currentSatd, 1.0 - qComp) /
+                        x265_qp2qScale(rfConstant + mbtree_offset);
+                }
                 q = getQScale(rce, m_rateFactorConstant);
                 x265_zone* zone = getZone();
                 if (zone)
@@ -2082,7 +2255,7 @@
                 }
                 double tunedQScale = tuneAbrQScaleFromFeedback(initialQScale);
                 overflow = tunedQScale / initialQScale;
-                q = !m_partialResidualFrames? tunedQScale : initialQScale;
+                q = !m_partialResidualFrames ? tunedQScale : initialQScale;
                 bool isEncodeEnd = (m_param->totalFrames && 
                     m_framesDone > 0.75 * m_param->totalFrames) ? 1 : 0;
                 bool isEncodeBeg = m_framesDone < (int)(m_fps + 0.5);
@@ -2138,6 +2311,9 @@
                 q = X265_MAX(minScenecutQscale, q);
                 m_lastQScaleForP_SLICE = X265_MAX(minScenecutQscale, m_lastQScaleForP_SLICE);
             }
+            if (m_param->bEnableSBRC)
+                q = tuneQscaleForSBRC(curFrame, q);
+
             rce->qpNoVbv = x265_qScale2qp(q);
             if (m_sliceType == P_SLICE)
             {
@@ -2319,6 +2495,43 @@
     return (p->coeff * var + p->offset) / (q * p->count);
 }
 
+double RateControl::tuneQscaleForSBRC(Frame* curFrame, double q)
+{
+    int depth = 0;
+    int framesDoneInSeg = m_framesDone % m_param->keyframeMax;
+    if (framesDoneInSeg + m_param->lookaheadDepth <= m_param->keyframeMax)
+        depth = m_param->lookaheadDepth;
+    else
+        depth = m_param->keyframeMax - framesDoneInSeg;
+    for (int iterations = 0; iterations < 1000; iterations++)
+    {
+        double totalDuration = m_segDur;
+        double frameBitsTotal = m_encodedSegmentBits + predictSize(&m_predm_predType, q, (double)m_currentSatd);
+        for (int i = 0; i < depth; i++)
+        {
+            int type = curFrame->m_lowres.plannedTypei;
+            if (type == X265_TYPE_AUTO)
+                break;
+            int64_t satd = curFrame->m_lowres.plannedSatdi >> (X265_DEPTH - 8);
+            type = IS_X265_TYPE_I(curFrame->m_lowres.plannedTypei) ? I_SLICE : IS_X265_TYPE_B(curFrame->m_lowres.plannedTypei) ? B_SLICE : P_SLICE;
+            int predType = getPredictorType(curFrame->m_lowres.plannedTypei, type);
+            double curBits = predictSize(&m_predpredType, q, (double)satd);
+            frameBitsTotal += curBits;
+            totalDuration += m_frameDuration;
+        }
+        //Check for segment buffer overflow and adjust QP accordingly
+        double segDur = m_param->keyframeMax / m_fps;
+        double allowedSize = m_vbvMaxRate * segDur;
+        double remDur = segDur - totalDuration;
+        double remainingBits = frameBitsTotal / totalDuration * remDur;
+        if (frameBitsTotal + remainingBits > 0.9 * allowedSize)
+            q = q * 1.01;
+        else
+            break;
+    }
+    return q;
+}
+
 double RateControl::clipQscale(Frame* curFrame, RateControlEntry* rce, double q)
 {
     // B-frames are not directly subject to VBV,
@@ -2395,7 +2608,7 @@
                     {
                         finalDur = x265_clip3(0.4, 1.0, totalDuration);
                     }
-                    targetFill = X265_MIN(m_bufferFill + totalDuration * m_vbvMaxRate * 0.5, m_bufferSize * (1 - m_minBufferFill * finalDur));
+                    targetFill = X265_MIN(m_bufferFill + totalDuration * m_vbvMaxRate * 0.5, m_bufferSize * (m_minBufferFill * finalDur));
                     if (bufferFillCur < targetFill)
                     {
                         q *= 1.01;
@@ -2828,7 +3041,7 @@
 
     if (m_param->rc.aqMode || m_isVbv || m_param->bAQMotion || bEnableDistOffset)
     {
-        if (m_isVbv && !(m_2pass && m_param->rc.rateControlMode == X265_RC_CRF))
+        if (m_isVbv && !(m_2pass && m_param->rc.rateControlMode == X265_RC_CRF && !m_param->rc.bEncFocusedFramesOnly))
         {
             double avgQpRc = 0;
             /* determine avg QP decided by VBV rate control */
@@ -2862,8 +3075,9 @@
     if (m_param->rc.rateControlMode == X265_RC_CRF)
     {
         double crfVal, qpRef = curEncData.m_avgQpRc;
+
         bool is2passCrfChange = false;
-        if (m_2pass)
+        if (m_2pass && !m_param->rc.bEncFocusedFramesOnly)
         {
             if (fabs(curEncData.m_avgQpRc - rce->qpPrev) > 0.1)
             {
@@ -2921,6 +3135,8 @@
         m_wantedBitsWindow += m_frameDuration * m_bitrate;
         m_totalBits += bits - rce->rowTotalBits;
         m_encodedBits += actualBits;
+        m_encodedSegmentBits += actualBits;
+        m_segDur += m_frameDuration;
         int pos = m_sliderPos - m_param->frameNumThreads;
         if (pos >= 0)
             m_encodedBitsWindowpos % s_slidingWindowFrames = actualBits;
@@ -3048,10 +3264,26 @@
     {
         uint8_t sliceType = (uint8_t)rce->sliceType;
         primitives.fix8Pack(m_cuTreeStats.qpBuffer0, curFrame->m_lowres.qpCuTreeOffset, ncu);
-        if (fwrite(&sliceType, 1, 1, m_cutreeStatFileOut) < 1)
-            goto writeFailure;
-        if (fwrite(m_cuTreeStats.qpBuffer0, sizeof(uint16_t), ncu, m_cutreeStatFileOut) < (size_t)ncu)
-            goto writeFailure;
+
+        if (X265_SHARE_MODE_FILE == m_param->rc.dataShareMode)
+        {
+            if (fwrite(&sliceType, 1, 1, m_cutreeStatFileOut) < 1)
+                goto writeFailure;
+            if (fwrite(m_cuTreeStats.qpBuffer0, sizeof(uint16_t), ncu, m_cutreeStatFileOut) < (size_t)ncu)
+                goto writeFailure;
+        }
+        else // X265_SHARE_MODE_SHAREDMEM == m_param->rc.dataShareMode
+        {
+            if (!m_cutreeShrMem)
+            {
+                goto writeFailure;
+            }
+
+            CUTreeSharedDataItem shrItem;
+            shrItem.type = &sliceType;
+            shrItem.stats = m_cuTreeStats.qpBuffer0;
+            m_cutreeShrMem->writeData(&shrItem, WriteSharedCUTreeData);
+        } 
     }
     return 0;
 
@@ -3127,6 +3359,13 @@
     if (m_cutreeStatFileIn)
         fclose(m_cutreeStatFileIn);
 
+    if (m_cutreeShrMem)
+    {
+        m_cutreeShrMem->release();
+        delete m_cutreeShrMem;
+        m_cutreeShrMem = NULL;
+    }
+
     X265_FREE(m_rce2Pass);
     X265_FREE(m_encOrder);
     for (int i = 0; i < 2; i++)
@@ -3186,13 +3425,20 @@
 double RateControl::forwardMasking(Frame* curFrame, double q)
 {
     double qp = x265_qScale2qp(q);
-    uint32_t maxWindowSize = uint32_t((m_param->fwdScenecutWindow / 1000.0) * (m_param->fpsNum / m_param->fpsDenom) + 0.5);
-    uint32_t windowSize = maxWindowSize / 3;
+    uint32_t maxWindowSize = uint32_t((m_param->fwdMaxScenecutWindow / 1000.0) * (m_param->fpsNum / m_param->fpsDenom) + 0.5);
+    uint32_t windowSize6, prevWindow = 0;
     int lastScenecut = m_top->m_rateControl->m_lastScenecut;
-    int lastIFrame = m_top->m_rateControl->m_lastScenecutAwareIFrame;
-    double fwdRefQpDelta = double(m_param->fwdRefQpDelta);
-    double fwdNonRefQpDelta = double(m_param->fwdNonRefQpDelta);
-    double sliceTypeDelta = SLICE_TYPE_DELTA * fwdRefQpDelta;
+
+    double fwdRefQpDelta6, fwdNonRefQpDelta6, sliceTypeDelta6;
+    for (int i = 0; i < 6; i++)
+    {
+        windowSizei = prevWindow + (uint32_t((m_param->fwdScenecutWindowi / 1000.0) * (m_param->fpsNum / m_param->fpsDenom) + 0.5));
+        fwdRefQpDeltai = double(m_param->fwdRefQpDeltai);
+        fwdNonRefQpDeltai = double(m_param->fwdNonRefQpDeltai);
+        sliceTypeDeltai = SLICE_TYPE_DELTA * fwdRefQpDeltai;
+        prevWindow = windowSizei;
+    }
+
 
     //Check whether the current frame is within the forward window
     if (curFrame->m_poc > lastScenecut && curFrame->m_poc <= (lastScenecut + int(maxWindowSize)))
@@ -3205,45 +3451,51 @@
         }
         else if (curFrame->m_lowres.sliceType == X265_TYPE_P)
         {
-            if (!(lastIFrame > lastScenecut && lastIFrame <= (lastScenecut + int(maxWindowSize))
-                && curFrame->m_poc >= lastIFrame))
-            {
-                //Add offsets corresponding to the window in which the P-frame occurs
-                if (curFrame->m_poc <= (lastScenecut + int(windowSize)))
-                    qp += WINDOW1_DELTA * (fwdRefQpDelta - sliceTypeDelta);
-                else if (((curFrame->m_poc) > (lastScenecut + int(windowSize))) && ((curFrame->m_poc) <= (lastScenecut + 2 * int(windowSize))))
-                    qp += WINDOW2_DELTA * (fwdRefQpDelta - sliceTypeDelta);
-                else if (curFrame->m_poc > lastScenecut + 2 * int(windowSize))
-                    qp += WINDOW3_DELTA * (fwdRefQpDelta - sliceTypeDelta);
-            }
+            //Add offsets corresponding to the window in which the P-frame occurs
+            if (curFrame->m_poc <= (lastScenecut + int(windowSize0)))
+                qp += fwdRefQpDelta0 - sliceTypeDelta0;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize0))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize1))))
+                qp += fwdRefQpDelta1 - sliceTypeDelta1;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize1))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize2))))
+                qp += fwdRefQpDelta2 - sliceTypeDelta2;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize2))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize3))))
+                qp += fwdRefQpDelta3 - sliceTypeDelta3;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize3))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize4))))
+                qp += fwdRefQpDelta4 - sliceTypeDelta4;
+            else if (curFrame->m_poc > lastScenecut + int(windowSize4))
+                qp += fwdRefQpDelta5 - sliceTypeDelta5;
         }
         else if (curFrame->m_lowres.sliceType == X265_TYPE_BREF)
         {
-            if (!(lastIFrame > lastScenecut && lastIFrame <= (lastScenecut + int(maxWindowSize))
-                && curFrame->m_poc >= lastIFrame))
-            {
-                //Add offsets corresponding to the window in which the B-frame occurs
-                if (curFrame->m_poc <= (lastScenecut + int(windowSize)))
-                    qp += WINDOW1_DELTA * fwdRefQpDelta;
-                else if (((curFrame->m_poc) > (lastScenecut + int(windowSize))) && ((curFrame->m_poc) <= (lastScenecut + 2 * int(windowSize))))
-                    qp += WINDOW2_DELTA * fwdRefQpDelta;
-                else if (curFrame->m_poc > lastScenecut + 2 * int(windowSize))
-                    qp += WINDOW3_DELTA * fwdRefQpDelta;
-            }
+            //Add offsets corresponding to the window in which the B-frame occurs
+            if (curFrame->m_poc <= (lastScenecut + int(windowSize0)))
+                qp += fwdRefQpDelta0;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize0))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize1))))
+                qp += fwdRefQpDelta1;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize1))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize2))))
+                qp += fwdRefQpDelta2;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize2))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize3))))
+                qp += fwdRefQpDelta3;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize3))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize4))))
+                qp += fwdRefQpDelta4;
+            else if (curFrame->m_poc > lastScenecut + int(windowSize4))
+                qp += fwdRefQpDelta5;
         }
         else if (curFrame->m_lowres.sliceType == X265_TYPE_B)
         {
-            if (!(lastIFrame > lastScenecut && lastIFrame <= (lastScenecut + int(maxWindowSize))
-                && curFrame->m_poc >= lastIFrame))
-            {
-                //Add offsets corresponding to the window in which the b-frame occurs
-                if (curFrame->m_poc <= (lastScenecut + int(windowSize)))
-                    qp += WINDOW1_DELTA * fwdNonRefQpDelta;
-                else if (((curFrame->m_poc) > (lastScenecut + int(windowSize))) && ((curFrame->m_poc) <= (lastScenecut + 2 * int(windowSize))))
-                    qp += WINDOW2_DELTA * fwdNonRefQpDelta;
-                else if (curFrame->m_poc > lastScenecut + 2 * int(windowSize))
-                    qp += WINDOW3_DELTA * fwdNonRefQpDelta;
-            }
+            //Add offsets corresponding to the window in which the b-frame occurs
+            if (curFrame->m_poc <= (lastScenecut + int(windowSize0)))
+                qp += fwdNonRefQpDelta0;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize0))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize1))))
+                qp += fwdNonRefQpDelta1;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize1))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize2))))
+                qp += fwdNonRefQpDelta2;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize2))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize3))))
+                qp += fwdNonRefQpDelta3;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize3))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize4))))
+                qp += fwdNonRefQpDelta4;
+            else if (curFrame->m_poc > lastScenecut + int(windowSize4))
+                qp += fwdNonRefQpDelta5;
         }
     }
 
@@ -3252,24 +3504,75 @@
 double RateControl::backwardMasking(Frame* curFrame, double q)
 {
     double qp = x265_qScale2qp(q);
-    double fwdRefQpDelta = double(m_param->fwdRefQpDelta);
-    double bwdRefQpDelta = double(m_param->bwdRefQpDelta);
-    double bwdNonRefQpDelta = double(m_param->bwdNonRefQpDelta);
+    uint32_t windowSize6, prevWindow = 0;
+    int lastScenecut = m_top->m_rateControl->m_lastScenecut;
 
-    if (curFrame->m_isInsideWindow == BACKWARD_WINDOW)
+    double bwdRefQpDelta6, bwdNonRefQpDelta6, sliceTypeDelta6;
+    for (int i = 0; i < 6; i++)
     {
-        if (bwdRefQpDelta < 0)
-            bwdRefQpDelta = WINDOW3_DELTA * fwdRefQpDelta;
-        double sliceTypeDelta = SLICE_TYPE_DELTA * bwdRefQpDelta;
-        if (bwdNonRefQpDelta < 0)
-            bwdNonRefQpDelta = bwdRefQpDelta + sliceTypeDelta;
+        windowSizei = prevWindow + (uint32_t((m_param->bwdScenecutWindowi / 1000.0) * (m_param->fpsNum / m_param->fpsDenom) + 0.5));
+        prevWindow = windowSizei;
+        bwdRefQpDeltai = double(m_param->bwdRefQpDeltai);
+        bwdNonRefQpDeltai = double(m_param->bwdNonRefQpDeltai);
+
+        if (bwdRefQpDeltai < 0)
+            bwdRefQpDeltai = BWD_WINDOW_DELTA * m_param->fwdRefQpDeltai;
+        sliceTypeDeltai = SLICE_TYPE_DELTA * bwdRefQpDeltai;
+
+        if (bwdNonRefQpDeltai < 0)
+            bwdNonRefQpDeltai = bwdRefQpDeltai + sliceTypeDeltai;
+    }
 
+    if (curFrame->m_isInsideWindow == BACKWARD_WINDOW)
+    {
         if (curFrame->m_lowres.sliceType == X265_TYPE_P)
-            qp += bwdRefQpDelta - sliceTypeDelta;
+        {
+            //Add offsets corresponding to the window in which the P-frame occurs
+            if (curFrame->m_poc >= (lastScenecut - int(windowSize0)))
+                qp += bwdRefQpDelta0 - sliceTypeDelta0;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize0))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize1))))
+                qp += bwdRefQpDelta1 - sliceTypeDelta1;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize1))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize2))))
+                qp += bwdRefQpDelta2 - sliceTypeDelta2;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize2))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize3))))
+                qp += bwdRefQpDelta3 - sliceTypeDelta3;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize3))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize4))))
+                qp += bwdRefQpDelta4 - sliceTypeDelta4;
+            else if (curFrame->m_poc < lastScenecut - int(windowSize4))
+                qp += bwdRefQpDelta5 - sliceTypeDelta5;
+        }
         else if (curFrame->m_lowres.sliceType == X265_TYPE_BREF)
-            qp += bwdRefQpDelta;
+        {
+            //Add offsets corresponding to the window in which the B-frame occurs
+            if (curFrame->m_poc >= (lastScenecut - int(windowSize0)))
+                qp += bwdRefQpDelta0;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize0))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize1))))
+                qp += bwdRefQpDelta1;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize1))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize2))))
+                qp += bwdRefQpDelta2;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize2))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize3))))
+                qp += bwdRefQpDelta3;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize3))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize4))))
+                qp += bwdRefQpDelta4;
+            else if (curFrame->m_poc < lastScenecut - int(windowSize4))
+                qp += bwdRefQpDelta5;
+        }
         else if (curFrame->m_lowres.sliceType == X265_TYPE_B)
-            qp += bwdNonRefQpDelta;
+        {
+            //Add offsets corresponding to the window in which the b-frame occurs
+            if (curFrame->m_poc >= (lastScenecut - int(windowSize0)))
+                qp += bwdNonRefQpDelta0;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize0))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize1))))
+                qp += bwdNonRefQpDelta1;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize1))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize2))))
+                qp += bwdNonRefQpDelta2;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize2))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize3))))
+                qp += bwdNonRefQpDelta3;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize3))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize4))))
+                qp += bwdNonRefQpDelta4;
+            else if (curFrame->m_poc < lastScenecut - int(windowSize4))
+                qp += bwdNonRefQpDelta5;
+        }
     }
 
     return x265_qp2qScale(qp);

 
@@ -41,6 +41,10 @@
 #define BR_SHIFT  6
 #define CPB_SHIFT 4
 
+#define SHARED_DATA_ALIGNMENT      4 ///< 4btye, 32bit
+#define CUTREE_SHARED_MEM_NAME     "cutree"
+#define GOP_CNT_CU_TREE            3
+
 using namespace X265_NS;
 
 /* Amortize the partial cost of I frames over the next N frames */
@@ -104,6 +108,37 @@
     return output;
 }
 
+typedef struct CUTreeSharedDataItem
+{
+    uint8_t  *type;
+    uint16_t *stats;
+}CUTreeSharedDataItem;
+
+void static ReadSharedCUTreeData(void *dst, void *src, int32_t size)
+{
+    CUTreeSharedDataItem *statsDst = reinterpret_cast<CUTreeSharedDataItem *>(dst);
+    uint8_t *typeSrc = reinterpret_cast<uint8_t *>(src);
+    *statsDst->type = *typeSrc;
+
+    ///< for memory alignment, the type will take 32bit in the shared memory
+    int32_t offset = (sizeof(*statsDst->type) + SHARED_DATA_ALIGNMENT - 1) & ~(SHARED_DATA_ALIGNMENT - 1);
+    uint16_t *statsSrc = reinterpret_cast<uint16_t *>(typeSrc + offset);
+    memcpy(statsDst->stats, statsSrc, size - offset);
+}
+
+void static WriteSharedCUTreeData(void *dst, void *src, int32_t size)
+{
+    CUTreeSharedDataItem *statsSrc = reinterpret_cast<CUTreeSharedDataItem *>(src);
+    uint8_t *typeDst = reinterpret_cast<uint8_t *>(dst);
+    *typeDst = *statsSrc->type;
+
+    ///< for memory alignment, the type will take 32bit in the shared memory
+    int32_t offset = (sizeof(*statsSrc->type) + SHARED_DATA_ALIGNMENT - 1) & ~(SHARED_DATA_ALIGNMENT - 1);
+    uint16_t *statsDst = reinterpret_cast<uint16_t *>(typeDst + offset);
+    memcpy(statsDst, statsSrc->stats, size - offset);
+}
+
+
 inline double qScale2bits(RateControlEntry *rce, double qScale)
 {
     if (qScale < 0.1)
@@ -209,6 +244,7 @@
     m_lastAbrResetPoc = -1;
     m_statFileOut = NULL;
     m_cutreeStatFileOut = m_cutreeStatFileIn = NULL;
+    m_cutreeShrMem = NULL;
     m_rce2Pass = NULL;
     m_encOrder = NULL;
     m_lastBsliceSatdCost = 0;
@@ -224,6 +260,8 @@
     m_initVbv = false;
     m_singleFrameVbv = 0;
     m_rateTolerance = 1.0;
+    m_encodedSegmentBits = 0;
+    m_segDur = 0;
 
     if (m_param->rc.vbvBufferSize)
     {
@@ -320,47 +358,86 @@
         m_cuTreeStats.qpBufferi = NULL;
 }
 
-bool RateControl::init(const SPS& sps)
+bool RateControl::initCUTreeSharedMem()
 {
-    if (m_isVbv && !m_initVbv)
-    {
-        /* We don't support changing the ABR bitrate right now,
-         * so if the stream starts as CBR, keep it CBR. */
-        if (m_param->rc.vbvBufferSize < (int)(m_param->rc.vbvMaxBitrate / m_fps))
+    if (!m_cutreeShrMem) {
+        m_cutreeShrMem = new RingMem();
+        if (!m_cutreeShrMem)
         {
-            m_param->rc.vbvBufferSize = (int)(m_param->rc.vbvMaxBitrate / m_fps);
-            x265_log(m_param, X265_LOG_WARNING, "VBV buffer size cannot be smaller than one frame, using %d kbit\n",
-                     m_param->rc.vbvBufferSize);
+            return false;
         }
-        int vbvBufferSize = m_param->rc.vbvBufferSize * 1000;
-        int vbvMaxBitrate = m_param->rc.vbvMaxBitrate * 1000;
 
-        if (m_param->bEmitHRDSEI && !m_param->decoderVbvMaxRate)
+        ///< now cutree data form at most 3 gops would be stored in the shared memory at the same time
+        int32_t itemSize = (sizeof(uint8_t) + SHARED_DATA_ALIGNMENT - 1) & ~(SHARED_DATA_ALIGNMENT - 1);
+        if (m_param->rc.qgSize == 8)
         {
-            const HRDInfo* hrd = &sps.vuiParameters.hrdParameters;
-            vbvBufferSize = hrd->cpbSizeValue << (hrd->cpbSizeScale + CPB_SHIFT);
-            vbvMaxBitrate = hrd->bitRateValue << (hrd->bitRateScale + BR_SHIFT);
+            itemSize += sizeof(uint16_t) * m_ncu * 4;
         }
-        m_bufferRate = vbvMaxBitrate / m_fps;
-        m_vbvMaxRate = vbvMaxBitrate;
-        m_bufferSize = vbvBufferSize;
-        m_singleFrameVbv = m_bufferRate * 1.1 > m_bufferSize;
+        else
+        {
+            itemSize += sizeof(uint16_t) * m_ncu;
+        }
+
+        int32_t itemCnt = X265_MIN(m_param->keyframeMax, (int)(m_fps + 0.5));
+        itemCnt *= GOP_CNT_CU_TREE;
 
-        if (m_param->rc.vbvBufferInit > 1.)
-            m_param->rc.vbvBufferInit = x265_clip3(0.0, 1.0, m_param->rc.vbvBufferInit / m_param->rc.vbvBufferSize);
-        if (m_param->vbvBufferEnd > 1.)
-            m_param->vbvBufferEnd = x265_clip3(0.0, 1.0, m_param->vbvBufferEnd / m_param->rc.vbvBufferSize);
-        if (m_param->vbvEndFrameAdjust > 1.)
-            m_param->vbvEndFrameAdjust = x265_clip3(0.0, 1.0, m_param->vbvEndFrameAdjust);
-        m_param->rc.vbvBufferInit = x265_clip3(0.0, 1.0, X265_MAX(m_param->rc.vbvBufferInit, m_bufferRate / m_bufferSize));
-        m_bufferFillFinal = m_bufferSize * m_param->rc.vbvBufferInit;
-        m_bufferFillActual = m_bufferFillFinal;
-        m_bufferExcess = 0;
-        m_minBufferFill = m_param->minVbvFullness / 100;
-        m_maxBufferFill = 1 - (m_param->maxVbvFullness / 100);
-        m_initVbv = true;
+        char shrnameMAX_SHR_NAME_LEN = { 0 };
+        strcpy(shrname, m_param->rc.sharedMemName);
+        strcat(shrname, CUTREE_SHARED_MEM_NAME);
+
+        if (!m_cutreeShrMem->init(itemSize, itemCnt, shrname))
+        {
+            return false;
+        }
     }
 
+    return true;
+}
+
+void RateControl::initVBV(const SPS& sps)
+{
+    /* We don't support changing the ABR bitrate right now,
+ * so if the stream starts as CBR, keep it CBR. */
+    if (m_param->rc.vbvBufferSize < (int)(m_param->rc.vbvMaxBitrate / m_fps))
+    {
+        m_param->rc.vbvBufferSize = (int)(m_param->rc.vbvMaxBitrate / m_fps);
+        x265_log(m_param, X265_LOG_WARNING, "VBV buffer size cannot be smaller than one frame, using %d kbit\n",
+            m_param->rc.vbvBufferSize);
+    }
+    int vbvBufferSize = m_param->rc.vbvBufferSize * 1000;
+    int vbvMaxBitrate = m_param->rc.vbvMaxBitrate * 1000;
+
+    if (m_param->bEmitHRDSEI && !m_param->decoderVbvMaxRate)
+    {
+        const HRDInfo* hrd = &sps.vuiParameters.hrdParameters;
+        vbvBufferSize = hrd->cpbSizeValue << (hrd->cpbSizeScale + CPB_SHIFT);
+        vbvMaxBitrate = hrd->bitRateValue << (hrd->bitRateScale + BR_SHIFT);
+    }
+    m_bufferRate = vbvMaxBitrate / m_fps;
+    m_vbvMaxRate = vbvMaxBitrate;
+    m_bufferSize = vbvBufferSize;
+    m_singleFrameVbv = m_bufferRate * 1.1 > m_bufferSize;
+
+    if (m_param->rc.vbvBufferInit > 1.)
+        m_param->rc.vbvBufferInit = x265_clip3(0.0, 1.0, m_param->rc.vbvBufferInit / m_param->rc.vbvBufferSize);
+    if (m_param->vbvBufferEnd > 1.)
+        m_param->vbvBufferEnd = x265_clip3(0.0, 1.0, m_param->vbvBufferEnd / m_param->rc.vbvBufferSize);
+    if (m_param->vbvEndFrameAdjust > 1.)
+        m_param->vbvEndFrameAdjust = x265_clip3(0.0, 1.0, m_param->vbvEndFrameAdjust);
+    m_param->rc.vbvBufferInit = x265_clip3(0.0, 1.0, X265_MAX(m_param->rc.vbvBufferInit, m_bufferRate / m_bufferSize));
+    m_bufferFillFinal = m_bufferSize * m_param->rc.vbvBufferInit;
+    m_bufferFillActual = m_bufferFillFinal;
+    m_bufferExcess = 0;
+    m_minBufferFill = m_param->minVbvFullness / 100;
+    m_maxBufferFill = 1 - (m_param->maxVbvFullness / 100);
+    m_initVbv = true;
+}
+
+bool RateControl::init(const SPS& sps)
+{
+    if (m_isVbv && !m_initVbv)
+        initVBV(sps);
+
     if (!m_param->bResetZoneConfig && (m_relativeComplexity == NULL))
     {
         m_relativeComplexity = X265_MALLOC(double, m_param->reconfigWindowSize);
@@ -373,7 +450,9 @@
 
     m_totalBits = 0;
     m_encodedBits = 0;
+    m_encodedSegmentBits = 0;
     m_framesDone = 0;
+    m_segDur = 0;
     m_residualCost = 0;
     m_partialResidualCost = 0;
     m_amortizeFraction = 0.85;
@@ -421,244 +500,257 @@
         /* Load stat file and init 2pass algo */
         if (m_param->rc.bStatRead)
         {
-            m_expectedBitsSum = 0;
-            char *p, *statsIn, *statsBuf;
-            /* read 1st pass stats */
-            statsIn = statsBuf = x265_slurp_file(fileName);
-            if (!statsBuf)
-                return false;
-            if (m_param->rc.cuTree)
+            if (X265_SHARE_MODE_FILE == m_param->rc.dataShareMode)
             {
-                char *tmpFile = strcatFilename(fileName, ".cutree");
-                if (!tmpFile)
+                m_expectedBitsSum = 0;
+                char *p, *statsIn, *statsBuf;
+                /* read 1st pass stats */
+                statsIn = statsBuf = x265_slurp_file(fileName);
+                if (!statsBuf)
                     return false;
-                m_cutreeStatFileIn = x265_fopen(tmpFile, "rb");
-                X265_FREE(tmpFile);
-                if (!m_cutreeStatFileIn)
+                if (m_param->rc.cuTree)
                 {
-                    x265_log_file(m_param, X265_LOG_ERROR, "can't open stats file %s.cutree\n", fileName);
-                    return false;
+                    char *tmpFile = strcatFilename(fileName, ".cutree");
+                    if (!tmpFile)
+                        return false;
+                    m_cutreeStatFileIn = x265_fopen(tmpFile, "rb");
+                    X265_FREE(tmpFile);
+                    if (!m_cutreeStatFileIn)
+                    {
+                        x265_log_file(m_param, X265_LOG_ERROR, "can't open stats file %s.cutree\n", fileName);
+                        return false;
+                    }
                 }
-            }
 
-            /* check whether 1st pass options were compatible with current options */
-            if (strncmp(statsBuf, "#options:", 9))
-            {
-                x265_log(m_param, X265_LOG_ERROR,"options list in stats file not valid\n");
-                return false;
-            }
-            {
-                int i, j, m;
-                uint32_t k , l;
-                bool bErr = false;
-                char *opts = statsBuf;
-                statsIn = strchr(statsBuf, '\n');
-                if (!statsIn)
-                {
-                    x265_log(m_param, X265_LOG_ERROR, "Malformed stats file\n");
-                    return false;
-                }
-                *statsIn = '\0';
-                statsIn++;
-                if ((p = strstr(opts, " input-res=")) == 0 || sscanf(p, " input-res=%dx%d", &i, &j) != 2)
-                {
-                    x265_log(m_param, X265_LOG_ERROR, "Resolution specified in stats file not valid\n");
-                    return false;
-                }
-                if ((p = strstr(opts, " fps=")) == 0 || sscanf(p, " fps=%u/%u", &k, &l) != 2)
-                {
-                    x265_log(m_param, X265_LOG_ERROR, "fps specified in stats file not valid\n");
-                    return false;
-                }
-                if (((p = strstr(opts, " vbv-maxrate=")) == 0 || sscanf(p, " vbv-maxrate=%d", &m) != 1) && m_param->rc.rateControlMode == X265_RC_CRF)
-                {
-                    x265_log(m_param, X265_LOG_ERROR, "Constant rate-factor is incompatible with 2pass without vbv-maxrate in the previous pass\n");
-                    return false;
-                }
-                if (k != m_param->fpsNum || l != m_param->fpsDenom)
+                /* check whether 1st pass options were compatible with current options */
+                if (strncmp(statsBuf, "#options:", 9))
                 {
-                    x265_log(m_param, X265_LOG_ERROR, "fps mismatch with 1st pass (%u/%u vs %u/%u)\n",
-                              m_param->fpsNum, m_param->fpsDenom, k, l);
+                    x265_log(m_param, X265_LOG_ERROR, "options list in stats file not valid\n");
                     return false;
                 }
-                if (m_param->analysisMultiPassRefine)
                 {
-                    p = strstr(opts, "ref=");
-                    sscanf(p, "ref=%d", &i);
-                    if (i > m_param->maxNumReferences)
+                    int i, j, m;
+                    uint32_t k, l;
+                    bool bErr = false;
+                    char *opts = statsBuf;
+                    statsIn = strchr(statsBuf, '\n');
+                    if (!statsIn)
                     {
-                        x265_log(m_param, X265_LOG_ERROR, "maxNumReferences cannot be less than 1st pass (%d vs %d)\n",
-                            i, m_param->maxNumReferences);
+                        x265_log(m_param, X265_LOG_ERROR, "Malformed stats file\n");
                         return false;
                     }
-                }
-                if (m_param->analysisMultiPassRefine || m_param->analysisMultiPassDistortion)
-                {
-                    p = strstr(opts, "ctu=");
-                    sscanf(p, "ctu=%u", &k);
-                    if (k != m_param->maxCUSize)
+                    *statsIn = '\0';
+                    statsIn++;
+                    if ((p = strstr(opts, " input-res=")) == 0 || sscanf(p, " input-res=%dx%d", &i, &j) != 2)
                     {
-                        x265_log(m_param, X265_LOG_ERROR, "maxCUSize mismatch with 1st pass (%u vs %u)\n",
-                            k, m_param->maxCUSize);
+                        x265_log(m_param, X265_LOG_ERROR, "Resolution specified in stats file not valid\n");
                         return false;
                     }
+                    if ((p = strstr(opts, " fps=")) == 0 || sscanf(p, " fps=%u/%u", &k, &l) != 2)
+                    {
+                        x265_log(m_param, X265_LOG_ERROR, "fps specified in stats file not valid\n");
+                        return false;
+                    }
+                    if (((p = strstr(opts, " vbv-maxrate=")) == 0 || sscanf(p, " vbv-maxrate=%d", &m) != 1) && m_param->rc.rateControlMode == X265_RC_CRF)
+                    {
+                        x265_log(m_param, X265_LOG_ERROR, "Constant rate-factor is incompatible with 2pass without vbv-maxrate in the previous pass\n");
+                        return false;
+                    }
+                    if (k != m_param->fpsNum || l != m_param->fpsDenom)
+                    {
+                        x265_log(m_param, X265_LOG_ERROR, "fps mismatch with 1st pass (%u/%u vs %u/%u)\n",
+                            m_param->fpsNum, m_param->fpsDenom, k, l);
+                        return false;
+                    }
+                    if (m_param->analysisMultiPassRefine)
+                    {
+                        p = strstr(opts, "ref=");
+                        sscanf(p, "ref=%d", &i);
+                        if (i > m_param->maxNumReferences)
+                        {
+                            x265_log(m_param, X265_LOG_ERROR, "maxNumReferences cannot be less than 1st pass (%d vs %d)\n",
+                                i, m_param->maxNumReferences);
+                            return false;
+                        }
+                    }
+                    if (m_param->analysisMultiPassRefine || m_param->analysisMultiPassDistortion)
+                    {
+                        p = strstr(opts, "ctu=");
+                        sscanf(p, "ctu=%u", &k);
+                        if (k != m_param->maxCUSize)
+                        {
+                            x265_log(m_param, X265_LOG_ERROR, "maxCUSize mismatch with 1st pass (%u vs %u)\n",
+                                k, m_param->maxCUSize);
+                            return false;
+                        }
+                    }
+                    CMP_OPT_FIRST_PASS("bitdepth", m_param->internalBitDepth);
+                    CMP_OPT_FIRST_PASS("weightp", m_param->bEnableWeightedPred);
+                    CMP_OPT_FIRST_PASS("bframes", m_param->bframes);
+                    CMP_OPT_FIRST_PASS("b-pyramid", m_param->bBPyramid);
+                    CMP_OPT_FIRST_PASS("open-gop", m_param->bOpenGOP);
+                    CMP_OPT_FIRST_PASS(" keyint", m_param->keyframeMax);
+                    CMP_OPT_FIRST_PASS("scenecut", m_param->scenecutThreshold);
+                    CMP_OPT_FIRST_PASS("intra-refresh", m_param->bIntraRefresh);
+                    CMP_OPT_FIRST_PASS("frame-dup", m_param->bEnableFrameDuplication);
+                    if (m_param->bMultiPassOptRPS)
+                    {
+                        CMP_OPT_FIRST_PASS("multi-pass-opt-rps", m_param->bMultiPassOptRPS);
+                        CMP_OPT_FIRST_PASS("repeat-headers", m_param->bRepeatHeaders);
+                        CMP_OPT_FIRST_PASS("min-keyint", m_param->keyframeMin);
+                    }
+
+                    if ((p = strstr(opts, "b-adapt=")) != 0 && sscanf(p, "b-adapt=%d", &i) && i >= X265_B_ADAPT_NONE && i <= X265_B_ADAPT_TRELLIS)
+                    {
+                        m_param->bFrameAdaptive = i;
+                    }
+                    else if (m_param->bframes)
+                    {
+                        x265_log(m_param, X265_LOG_ERROR, "b-adapt method specified in stats file not valid\n");
+                        return false;
+                    }
+
+                    if ((p = strstr(opts, "rc-lookahead=")) != 0 && sscanf(p, "rc-lookahead=%d", &i))
+                        m_param->lookaheadDepth = i;
                 }
-                CMP_OPT_FIRST_PASS("bitdepth", m_param->internalBitDepth);
-                CMP_OPT_FIRST_PASS("weightp", m_param->bEnableWeightedPred);
-                CMP_OPT_FIRST_PASS("bframes", m_param->bframes);
-                CMP_OPT_FIRST_PASS("b-pyramid", m_param->bBPyramid);
-                CMP_OPT_FIRST_PASS("open-gop", m_param->bOpenGOP);
-                CMP_OPT_FIRST_PASS(" keyint", m_param->keyframeMax);
-                CMP_OPT_FIRST_PASS("scenecut", m_param->scenecutThreshold);
-                CMP_OPT_FIRST_PASS("intra-refresh", m_param->bIntraRefresh);
-                CMP_OPT_FIRST_PASS("frame-dup", m_param->bEnableFrameDuplication);
-                if (m_param->bMultiPassOptRPS)
+                /* find number of pics */
+                p = statsIn;
+                int numEntries;
+                for (numEntries = -1; p; numEntries++)
+                    p = strchr(p + 1, ';');
+                if (!numEntries)
                 {
-                    CMP_OPT_FIRST_PASS("multi-pass-opt-rps", m_param->bMultiPassOptRPS);
-                    CMP_OPT_FIRST_PASS("repeat-headers", m_param->bRepeatHeaders);
-                    CMP_OPT_FIRST_PASS("min-keyint", m_param->keyframeMin);
+                    x265_log(m_param, X265_LOG_ERROR, "empty stats file\n");
+                    return false;
                 }
+                m_numEntries = numEntries;
 
-                if ((p = strstr(opts, "b-adapt=")) != 0 && sscanf(p, "b-adapt=%d", &i) && i >= X265_B_ADAPT_NONE && i <= X265_B_ADAPT_TRELLIS)
+                if (m_param->totalFrames < m_numEntries && m_param->totalFrames > 0)
                 {
-                    m_param->bFrameAdaptive = i;
+                    x265_log(m_param, X265_LOG_WARNING, "2nd pass has fewer frames than 1st pass (%d vs %d)\n",
+                        m_param->totalFrames, m_numEntries);
                 }
-                else if (m_param->bframes)
+                if (m_param->totalFrames > m_numEntries && !m_param->bEnableFrameDuplication)
                 {
-                    x265_log(m_param, X265_LOG_ERROR, "b-adapt method specified in stats file not valid\n");
+                    x265_log(m_param, X265_LOG_ERROR, "2nd pass has more frames than 1st pass (%d vs %d)\n",
+                        m_param->totalFrames, m_numEntries);
                     return false;
                 }
 
-                if ((p = strstr(opts, "rc-lookahead=")) != 0 && sscanf(p, "rc-lookahead=%d", &i))
-                    m_param->lookaheadDepth = i;
-            }
-            /* find number of pics */
-            p = statsIn;
-            int numEntries;
-            for (numEntries = -1; p; numEntries++)
-                p = strchr(p + 1, ';');
-            if (!numEntries)
-            {
-                x265_log(m_param, X265_LOG_ERROR, "empty stats file\n");
-                return false;
-            }
-            m_numEntries = numEntries;
-
-            if (m_param->totalFrames < m_numEntries && m_param->totalFrames > 0)
-            {
-                x265_log(m_param, X265_LOG_WARNING, "2nd pass has fewer frames than 1st pass (%d vs %d)\n",
-                         m_param->totalFrames, m_numEntries);
-            }
-            if (m_param->totalFrames > m_numEntries && !m_param->bEnableFrameDuplication)
-            {
-                x265_log(m_param, X265_LOG_ERROR, "2nd pass has more frames than 1st pass (%d vs %d)\n",
-                         m_param->totalFrames, m_numEntries);
-                return false;
-            }
-
-            m_rce2Pass = X265_MALLOC(RateControlEntry, m_numEntries);
-            if (!m_rce2Pass)
-            {
-                 x265_log(m_param, X265_LOG_ERROR, "Rce Entries for 2 pass cannot be allocated\n");
-                 return false;
-            }
-            m_encOrder = X265_MALLOC(int, m_numEntries);
-            if (!m_encOrder)
-            {
-                x265_log(m_param, X265_LOG_ERROR, "Encode order for 2 pass cannot be allocated\n");
-                return false;
-            }
-            /* init all to skipped p frames */
-            for (int i = 0; i < m_numEntries; i++)
-            {
-                RateControlEntry *rce = &m_rce2Passi;
-                rce->sliceType = P_SLICE;
-                rce->qScale = rce->newQScale = x265_qp2qScale(20);
-                rce->miscBits = m_ncu + 10;
-                rce->newQp = 0;
-            }
-            /* read stats */
-            p = statsIn;
-            double totalQpAq = 0;
-            for (int i = 0; i < m_numEntries; i++)
-            {
-                RateControlEntry *rce, *rcePocOrder;
-                int frameNumber;
-                int encodeOrder;
-                char picType;
-                int e;
-                char *next;
-                double qpRc, qpAq, qNoVbv, qRceq;
-                next = strstr(p, ";");
-                if (next)
-                    *next++ = 0;
-                e = sscanf(p, " in:%d out:%d", &frameNumber, &encodeOrder);
-                if (frameNumber < 0 || frameNumber >= m_numEntries)
+                m_rce2Pass = X265_MALLOC(RateControlEntry, m_numEntries);
+                if (!m_rce2Pass)
                 {
-                    x265_log(m_param, X265_LOG_ERROR, "bad frame number (%d) at stats line %d\n", frameNumber, i);
+                    x265_log(m_param, X265_LOG_ERROR, "Rce Entries for 2 pass cannot be allocated\n");
                     return false;
                 }
-                rce = &m_rce2PassencodeOrder;
-                rcePocOrder = &m_rce2PassframeNumber;
-                m_encOrderframeNumber = encodeOrder;
-                if (!m_param->bMultiPassOptRPS)
-                {
-                    int scenecut = 0;
-                    e += sscanf(p, " in:%*d out:%*d type:%c q:%lf q-aq:%lf q-noVbv:%lf q-Rceq:%lf tex:%d mv:%d misc:%d icu:%lf pcu:%lf scu:%lf sc:%d",
-                        &picType, &qpRc, &qpAq, &qNoVbv, &qRceq, &rce->coeffBits,
-                        &rce->mvBits, &rce->miscBits, &rce->iCuCount, &rce->pCuCount,
-                        &rce->skipCuCount, &scenecut);
-                    rcePocOrder->scenecut = scenecut != 0;
+                m_encOrder = X265_MALLOC(int, m_numEntries);
+                if (!m_encOrder)
+                {
+                    x265_log(m_param, X265_LOG_ERROR, "Encode order for 2 pass cannot be allocated\n");
+                    return false;
                 }
-                else
+                /* init all to skipped p frames */
+                for (int i = 0; i < m_numEntries; i++)
                 {
-                    char deltaPOC128;
-                    char bUsed40;
-                    memset(deltaPOC, 0, sizeof(deltaPOC));
-                    memset(bUsed, 0, sizeof(bUsed));
-                    e += sscanf(p, " in:%*d out:%*d type:%c q:%lf q-aq:%lf q-noVbv:%lf q-Rceq:%lf tex:%d mv:%d misc:%d icu:%lf pcu:%lf scu:%lf nump:%d numnegp:%d numposp:%d deltapoc:%s bused:%s",
-                        &picType, &qpRc, &qpAq, &qNoVbv, &qRceq, &rce->coeffBits,
-                        &rce->mvBits, &rce->miscBits, &rce->iCuCount, &rce->pCuCount,
-                        &rce->skipCuCount, &rce->rpsData.numberOfPictures, &rce->rpsData.numberOfNegativePictures, &rce->rpsData.numberOfPositivePictures, deltaPOC, bUsed);
-                    splitdeltaPOC(deltaPOC, rce);
-                    splitbUsed(bUsed, rce);
-                    rce->rpsIdx = -1;
-                }
-                rce->keptAsRef = true;
-                rce->isIdr = false;
-                if (picType == 'b' || picType == 'p')
-                    rce->keptAsRef = false;
-                if (picType == 'I')
-                    rce->isIdr = true;
-                if (picType == 'I' || picType == 'i')
-                    rce->sliceType = I_SLICE;
-                else if (picType == 'P' || picType == 'p')
+                    RateControlEntry *rce = &m_rce2Passi;
                     rce->sliceType = P_SLICE;
-                else if (picType == 'B' || picType == 'b')
-                    rce->sliceType = B_SLICE;
-                else
-                    e = -1;
-                if (e < 10)
+                    rce->qScale = rce->newQScale = x265_qp2qScale(20);
+                    rce->miscBits = m_ncu + 10;
+                    rce->newQp = 0;
+                }
+                /* read stats */
+                p = statsIn;
+                double totalQpAq = 0;
+                for (int i = 0; i < m_numEntries; i++)
+                {
+                    RateControlEntry *rce, *rcePocOrder;
+                    int frameNumber;
+                    int encodeOrder;
+                    char picType;
+                    int e;
+                    char *next;
+                    double qpRc, qpAq, qNoVbv, qRceq;
+                    next = strstr(p, ";");
+                    if (next)
+                        *next++ = 0;
+                    e = sscanf(p, " in:%d out:%d", &frameNumber, &encodeOrder);
+                    if (frameNumber < 0 || frameNumber >= m_numEntries)
+                    {
+                        x265_log(m_param, X265_LOG_ERROR, "bad frame number (%d) at stats line %d\n", frameNumber, i);
+                        return false;
+                    }
+                    rce = &m_rce2PassencodeOrder;
+                    rcePocOrder = &m_rce2PassframeNumber;
+                    m_encOrderframeNumber = encodeOrder;
+                    if (!m_param->bMultiPassOptRPS)
+                    {
+                        int scenecut = 0;
+                        e += sscanf(p, " in:%*d out:%*d type:%c q:%lf q-aq:%lf q-noVbv:%lf q-Rceq:%lf tex:%d mv:%d misc:%d icu:%lf pcu:%lf scu:%lf sc:%d",
+                            &picType, &qpRc, &qpAq, &qNoVbv, &qRceq, &rce->coeffBits,
+                            &rce->mvBits, &rce->miscBits, &rce->iCuCount, &rce->pCuCount,
+                            &rce->skipCuCount, &scenecut);
+                        rcePocOrder->scenecut = scenecut != 0;
+                    }
+                    else
+                    {
+                        char deltaPOC128;
+                        char bUsed40;
+                        memset(deltaPOC, 0, sizeof(deltaPOC));
+                        memset(bUsed, 0, sizeof(bUsed));
+                        e += sscanf(p, " in:%*d out:%*d type:%c q:%lf q-aq:%lf q-noVbv:%lf q-Rceq:%lf tex:%d mv:%d misc:%d icu:%lf pcu:%lf scu:%lf nump:%d numnegp:%d numposp:%d deltapoc:%s bused:%s",
+                            &picType, &qpRc, &qpAq, &qNoVbv, &qRceq, &rce->coeffBits,
+                            &rce->mvBits, &rce->miscBits, &rce->iCuCount, &rce->pCuCount,
+                            &rce->skipCuCount, &rce->rpsData.numberOfPictures, &rce->rpsData.numberOfNegativePictures, &rce->rpsData.numberOfPositivePictures, deltaPOC, bUsed);
+                        splitdeltaPOC(deltaPOC, rce);
+                        splitbUsed(bUsed, rce);
+                        rce->rpsIdx = -1;
+                    }
+                    rce->keptAsRef = true;
+                    rce->isIdr = false;
+                    if (picType == 'b' || picType == 'p')
+                        rce->keptAsRef = false;
+                    if (picType == 'I')
+                        rce->isIdr = true;
+                    if (picType == 'I' || picType == 'i')
+                        rce->sliceType = I_SLICE;
+                    else if (picType == 'P' || picType == 'p')
+                        rce->sliceType = P_SLICE;
+                    else if (picType == 'B' || picType == 'b')
+                        rce->sliceType = B_SLICE;
+                    else
+                        e = -1;
+                    if (e < 10)
+                    {
+                        x265_log(m_param, X265_LOG_ERROR, "statistics are damaged at line %d, parser out=%d\n", i, e);
+                        return false;
+                    }
+                    rce->qScale = rce->newQScale = x265_qp2qScale(qpRc);
+                    totalQpAq += qpAq;
+                    rce->qpNoVbv = qNoVbv;
+                    rce->qpaRc = qpRc;
+                    rce->qpAq = qpAq;
+                    rce->qRceq = qRceq;
+                    p = next;
+                }
+                X265_FREE(statsBuf);
+                if (m_param->rc.rateControlMode != X265_RC_CQP)
+                {
+                    m_start = 0;
+                    m_isQpModified = true;
+                    if (!initPass2())
+                        return false;
+                } /* else we're using constant quant, so no need to run the bitrate allocation */
+            }
+            else // X265_SHARE_MODE_SHAREDMEM == m_param->rc.dataShareMode
+            {
+                if (m_param->rc.cuTree)
                 {
-                    x265_log(m_param, X265_LOG_ERROR, "statistics are damaged at line %d, parser out=%d\n", i, e);
-                    return false;
+                    if (!initCUTreeSharedMem())
+                    {
+                        return false;
+                    }
                 }
-                rce->qScale = rce->newQScale = x265_qp2qScale(qpRc);
-                totalQpAq += qpAq;
-                rce->qpNoVbv = qNoVbv;
-                rce->qpaRc = qpRc;
-                rce->qpAq = qpAq;
-                rce->qRceq = qRceq;
-                p = next;
-            }
-            X265_FREE(statsBuf);
-            if (m_param->rc.rateControlMode != X265_RC_CQP)
-            {
-                m_start = 0;
-                m_isQpModified = true;
-                if (!initPass2())
-                    return false;
-            } /* else we're using constant quant, so no need to run the bitrate allocation */
+            }
         }
         /* Open output file */
         /* If input and output files are the same, output to a temp file
@@ -682,19 +774,29 @@
             X265_FREE(p);
             if (m_param->rc.cuTree && !m_param->rc.bStatRead)
             {
-                statFileTmpname = strcatFilename(fileName, ".cutree.temp");
-                if (!statFileTmpname)
-                    return false;
-                m_cutreeStatFileOut = x265_fopen(statFileTmpname, "wb");
-                X265_FREE(statFileTmpname);
-                if (!m_cutreeStatFileOut)
+                if (X265_SHARE_MODE_FILE == m_param->rc.dataShareMode)
                 {
-                    x265_log_file(m_param, X265_LOG_ERROR, "can't open mbtree stats file %s.cutree.temp\n", fileName);
-                    return false;
+                    statFileTmpname = strcatFilename(fileName, ".cutree.temp");
+                    if (!statFileTmpname)
+                        return false;
+                    m_cutreeStatFileOut = x265_fopen(statFileTmpname, "wb");
+                    X265_FREE(statFileTmpname);
+                    if (!m_cutreeStatFileOut)
+                    {
+                        x265_log_file(m_param, X265_LOG_ERROR, "can't open mbtree stats file %s.cutree.temp\n", fileName);
+                        return false;
+                    }
+                }
+                else // X265_SHARE_MODE_SHAREDMEM == m_param->rc.dataShareMode
+                {
+                    if (!initCUTreeSharedMem())
+                    {
+                        return false;
+                    }
                 }
             }
         }
-        if (m_param->rc.cuTree)
+        if (m_param->rc.cuTree && !m_cuTreeStats.qpBuffer0)
         {
             if (m_param->rc.qgSize == 8)
             {
@@ -714,6 +816,10 @@
     return true;
 }
 
+void RateControl::skipCUTreeSharedMemRead(int32_t cnt)
+{
+    m_cutreeShrMem->skipRead(cnt);
+}
 void RateControl::reconfigureRC()
 {
     if (m_isVbv)
@@ -806,7 +912,7 @@
 
     TimingInfo *time = &sps.vuiParameters.timingInfo;
     int maxCpbOutputDelay = (int)(X265_MIN(m_param->keyframeMax * MAX_DURATION * time->timeScale / time->numUnitsInTick, INT_MAX));
-    int maxDpbOutputDelay = (int)(sps.maxDecPicBuffering * MAX_DURATION * time->timeScale / time->numUnitsInTick);
+    int maxDpbOutputDelay = (int)(sps.maxDecPicBufferingsps.maxTempSubLayers - 1 * MAX_DURATION * time->timeScale / time->numUnitsInTick);
     int maxDelay = (int)(90000.0 * cpbSizeUnscale / bitRateUnscale + 0.5);
 
     hrd->initialCpbRemovalDelayLength = 2 + x265_clip3(4, 22, 32 - calcLength(maxDelay));
@@ -1000,125 +1106,103 @@
 {
     uint64_t allConstBits = 0, allCodedBits = 0;
     uint64_t allAvailableBits = uint64_t(m_param->rc.bitrate * 1000. * m_numEntries * m_frameDuration);
-    int startIndex, framesCount, endIndex;
+    int startIndex, endIndex;
     int fps = X265_MIN(m_param->keyframeMax, (int)(m_fps + 0.5));
-    startIndex = endIndex = framesCount = 0;
-    int diffQp = 0;
+    int distance = fps << 1;
+    distance = distance > m_param->keyframeMax ? (m_param->keyframeMax << 1) : m_param->keyframeMax;
+    startIndex = endIndex = 0;
     double targetBits = 0;
     double expectedBits = 0;
-    for (startIndex = m_start, endIndex = m_start; endIndex < m_numEntries; endIndex++)
+    double targetBits2 = 0;
+    double expectedBits2 = 0;
+    double cpxSum = 0;
+    double cpxSum2 = 0;
+
+    if (m_param->rc.rateControlMode == X265_RC_ABR)
     {
-        allConstBits += m_rce2PassendIndex.miscBits;
-        allCodedBits += m_rce2PassendIndex.coeffBits + m_rce2PassendIndex.mvBits;
-        if (m_param->rc.rateControlMode == X265_RC_CRF)
+        for (endIndex = m_start; endIndex < m_numEntries; endIndex++)
         {
-            framesCount = endIndex - startIndex + 1;
-            diffQp += int (m_rce2PassendIndex.qpaRc - m_rce2PassendIndex.qpNoVbv);
-            if (framesCount > fps)
-                diffQp -= int (m_rce2PassendIndex - fps.qpaRc - m_rce2PassendIndex - fps.qpNoVbv);
-            if (framesCount >= fps)
-            {
-                if (diffQp >= 1)
-                {
-                    if (!m_isQpModified && endIndex > fps)
-                    {
-                        double factor = 2;
-                        double step = 0;
-                        if (endIndex + fps >= m_numEntries)
-                        {
-                            m_start = endIndex - (endIndex % fps);
-                            return true;
-                        }
-                        for (int start = endIndex + 1; start <= endIndex + fps && start < m_numEntries; start++)
-                        {
-                            RateControlEntry *rce = &m_rce2Passstart;
-                            targetBits += qScale2bits(rce, x265_qp2qScale(rce->qpNoVbv));
-                            expectedBits += qScale2bits(rce, rce->qScale);
-                        }
-                        if (expectedBits < 0.95 * targetBits)
-                        {
-                            m_isQpModified = true;
-                            m_isGopReEncoded = true;
-                            while (endIndex + fps < m_numEntries)
-                            {
-                                step = pow(2, factor / 6.0);
-                                expectedBits = 0;
-                                for (int start = endIndex + 1; start <= endIndex + fps; start++)
-                                {
-                                    RateControlEntry *rce = &m_rce2Passstart;
-                                    rce->newQScale = rce->qScale / step;
-                                    X265_CHECK(rce->newQScale >= 0, "new Qscale is negative\n");
-                                    expectedBits += qScale2bits(rce, rce->newQScale);
-                                    rce->newQp = x265_qScale2qp(rce->newQScale);
-                                }
-                                if (expectedBits >= targetBits && step > 1)
-                                    factor *= 0.90;
-                                else
-                                    break;
-                            }
-
-                            if (m_isVbv && endIndex + fps < m_numEntries)
-                                if (!vbv2Pass((uint64_t)targetBits, endIndex + fps, endIndex + 1))
-                                    return false;
-
-                            targetBits = 0;
-                            expectedBits = 0;
-
-                            for (int start = endIndex - fps + 1; start <= endIndex; start++)
-                            {
-                                RateControlEntry *rce = &m_rce2Passstart;
-                                targetBits += qScale2bits(rce, x265_qp2qScale(rce->qpNoVbv));
-                            }
-                            while (1)
-                            {
-                                step = pow(2, factor / 6.0);
-                                expectedBits = 0;
-                                for (int start = endIndex - fps + 1; start <= endIndex; start++)
-                                {
-                                    RateControlEntry *rce = &m_rce2Passstart;
-                                    rce->newQScale = rce->qScale * step;
-                                    X265_CHECK(rce->newQScale >= 0, "new Qscale is negative\n");
-                                    expectedBits += qScale2bits(rce, rce->newQScale);
-                                    rce->newQp = x265_qScale2qp(rce->newQScale);
-                                }
-                                if (expectedBits > targetBits && step > 1)
-                                    factor *= 1.1;
-                                else
-                                     break;
-                            }
-                            if (m_isVbv)
-                                if (!vbv2Pass((uint64_t)targetBits, endIndex, endIndex - fps + 1))
-                                    return false;
-                            diffQp = 0;
-                            m_reencode = endIndex - fps + 1;
-                            endIndex = endIndex + fps;
-                            startIndex = endIndex + 1;
-                            m_start = startIndex;
-                            targetBits = expectedBits = 0;
-                        }
-                        else
-                            targetBits = expectedBits = 0;
-                    }
-                }
-                else
-                    m_isQpModified = false;
-            }
+            allConstBits += m_rce2PassendIndex.miscBits;
+            allCodedBits += m_rce2PassendIndex.coeffBits + m_rce2PassendIndex.mvBits;
         }
-    }
 
-    if (m_param->rc.rateControlMode == X265_RC_ABR)
-    {
         if (allAvailableBits < allConstBits)
         {
             x265_log(m_param, X265_LOG_ERROR, "requested bitrate is too low. estimated minimum is %d kbps\n",
-                     (int)(allConstBits * m_fps / framesCount * 1000.));
+                (int)(allConstBits * m_fps / (m_numEntries - m_start) * 1000.));
             return false;
         }
         if (!analyseABR2Pass(allAvailableBits))
             return false;
+
+        return true;
+    }
+
+    if (m_isQpModified)
+    {
+        return true;
+    }
+
+    if (m_start + (fps << 1) > m_numEntries)
+    {
+        return true;
+    }
+
+    for (startIndex = m_start, endIndex = m_numEntries - 1; startIndex < endIndex; startIndex++, endIndex--)
+    {
+        cpxSum += m_rce2PassstartIndex.qScale / m_rce2PassstartIndex.coeffBits;
+        cpxSum2 += m_rce2PassendIndex.qScale / m_rce2PassendIndex.coeffBits;
+
+        RateControlEntry *rce = &m_rce2PassstartIndex;
+        targetBits += qScale2bits(rce, x265_qp2qScale(rce->qpNoVbv));
+        expectedBits += qScale2bits(rce, rce->qScale);
+
+        rce = &m_rce2PassendIndex;
+        targetBits2 += qScale2bits(rce, x265_qp2qScale(rce->qpNoVbv));
+        expectedBits2 += qScale2bits(rce, rce->qScale);
     }
 
-    m_start = X265_MAX(m_start, endIndex - fps);
+    if (expectedBits < 0.95 * targetBits || expectedBits2 < 0.95 * targetBits2)
+    {
+        if (cpxSum / cpxSum2 < 0.95 || cpxSum2 / cpxSum < 0.95)
+        {
+            m_isQpModified = true;
+            m_isGopReEncoded = true;
+
+            m_shortTermCplxSum = 0;
+            m_shortTermCplxCount = 0;
+            m_framesDone = m_start;
+
+            for (startIndex = m_start; startIndex < m_numEntries; startIndex++)
+            {
+                m_shortTermCplxSum *= 0.5;
+                m_shortTermCplxCount *= 0.5;
+                m_shortTermCplxSum += m_rce2PassstartIndex.currentSatd / (CLIP_DURATION(m_frameDuration) / BASE_FRAME_DURATION);
+                m_shortTermCplxCount++;
+            }
+
+            m_bufferFill = m_rce2Passm_start - 1.bufferFill;
+            m_bufferFillFinal = m_rce2Passm_start - 1.bufferFillFinal;
+            m_bufferFillActual = m_rce2Passm_start - 1.bufferFillActual;
+
+            m_reencode = m_start;
+            m_start = m_numEntries;
+        }
+        else
+        {
+
+            m_isQpModified = false;
+            m_isGopReEncoded = false;
+        }
+    }
+    else
+    {
+
+        m_isQpModified = false;
+        m_isGopReEncoded = false;
+    }
+
+    m_start = X265_MAX(m_start, m_numEntries - distance + m_param->keyframeMax);
 
     return true;
 }
@@ -1271,6 +1355,16 @@
     m_predType = getPredictorType(curFrame->m_lowres.sliceType, m_sliceType);
     rce->poc = m_curSlice->m_poc;
 
+    if (m_param->bEnableSBRC)
+    {
+        if (rce->poc == 0 || (m_framesDone % m_param->keyframeMax == 0))
+        {
+            //Reset SBRC buffer
+            m_encodedSegmentBits = 0;
+            m_segDur = 0;
+        }
+    }
+
     if (!m_param->bResetZoneConfig && (rce->encodeOrder % m_param->reconfigWindowSize == 0))
     {
         int index = m_zoneBufferIdx % m_param->rc.zonefileCount;
@@ -1304,7 +1398,8 @@
             {
                 m_param = m_param->rc.zonesi.zoneParam;
                 reconfigureRC();
-                init(*m_curSlice->m_sps);
+                if (!m_param->bNoResetZoneConfig)
+                    init(*m_curSlice->m_sps);
             }
         }
     }
@@ -1391,15 +1486,57 @@
             rce->frameSizeMaximum *= m_param->maxAUSizeFactor;
         }
     }
+
+    ///< regenerate the qp
     if (!m_isAbr && m_2pass && m_param->rc.rateControlMode == X265_RC_CRF)
     {
-        rce->qpPrev = x265_qScale2qp(rce->qScale);
-        rce->qScale = rce->newQScale;
-        rce->qpaRc = curEncData.m_avgQpRc = curEncData.m_avgQpAq = x265_qScale2qp(rce->newQScale);
-        m_qp = int(rce->qpaRc + 0.5);
-        rce->frameSizePlanned = qScale2bits(rce, rce->qScale);
-        m_framesDone++;
-        return m_qp;
+        if (!m_param->rc.bEncFocusedFramesOnly)
+        {
+            rce->qpPrev = x265_qScale2qp(rce->qScale);
+            if (m_param->bEnableSceneCutAwareQp)
+            {
+                double lqmin = m_lminm_sliceType;
+                double lqmax = m_lmaxm_sliceType;
+                if (m_param->bEnableSceneCutAwareQp & FORWARD)
+                    rce->newQScale = forwardMasking(curFrame, rce->newQScale);
+                if (m_param->bEnableSceneCutAwareQp & BACKWARD)
+                    rce->newQScale = backwardMasking(curFrame, rce->newQScale);
+                rce->newQScale = x265_clip3(lqmin, lqmax, rce->newQScale);
+            }
+            rce->qScale = rce->newQScale;
+            rce->qpaRc = curEncData.m_avgQpRc = curEncData.m_avgQpAq = x265_qScale2qp(rce->newQScale);
+            m_qp = int(rce->qpaRc + 0.5);
+            rce->frameSizePlanned = qScale2bits(rce, rce->qScale);
+            m_framesDone++;
+            return m_qp;
+        }
+        else
+        { 
+            int index = m_encOrderrce->poc;
+            index++;
+            double totalDuration = m_frameDuration;
+            for (int j = 0; totalDuration < 1.0 && index < m_numEntries; j++)
+            {
+                switch (m_rce2Passindex.sliceType)
+                {
+                case B_SLICE:
+                    curFrame->m_lowres.plannedTypej = m_rce2Passindex.keptAsRef ? X265_TYPE_BREF : X265_TYPE_B;
+                    break;
+                case P_SLICE:
+                    curFrame->m_lowres.plannedTypej = X265_TYPE_P;
+                    break;
+                case I_SLICE:
+                    curFrame->m_lowres.plannedTypej = m_param->bOpenGOP ? X265_TYPE_I : X265_TYPE_IDR;
+                    break;
+                default:
+                    break;
+                }
+
+                curFrame->m_lowres.plannedSatdj = m_rce2Passindex.currentSatd;
+                totalDuration += m_frameDuration;
+                index++;
+            }
+        }
     }
 
     if (m_isAbr || m_2pass) // ABR,CRF
@@ -1655,10 +1792,25 @@
             {
                 m_cuTreeStats.qpBufPos++;
 
-                if (!fread(&type, 1, 1, m_cutreeStatFileIn))
-                    goto fail;
-                if (fread(m_cuTreeStats.qpBufferm_cuTreeStats.qpBufPos, sizeof(uint16_t), ncu, m_cutreeStatFileIn) != (size_t)ncu)
-                    goto fail;
+                if (X265_SHARE_MODE_FILE == m_param->rc.dataShareMode)
+                {
+                    if (!fread(&type, 1, 1, m_cutreeStatFileIn))
+                        goto fail;
+                    if (fread(m_cuTreeStats.qpBufferm_cuTreeStats.qpBufPos, sizeof(uint16_t), ncu, m_cutreeStatFileIn) != (size_t)ncu)
+                        goto fail;
+                }
+                else // X265_SHARE_MODE_SHAREDMEM == m_param->rc.dataShareMode
+                {
+                    if (!m_cutreeShrMem)
+                    {
+                        goto fail;
+                    }
+
+                    CUTreeSharedDataItem shrItem;
+                    shrItem.type = &type;
+                    shrItem.stats = m_cuTreeStats.qpBufferm_cuTreeStats.qpBufPos;
+                    m_cutreeShrMem->readNext(&shrItem, ReadSharedCUTreeData);
+                }
 
                 if (type != sliceTypeActual && m_cuTreeStats.qpBufPos == 1)
                 {
@@ -1785,7 +1937,7 @@
         m_sliderPos++;
     }
 
-    if (m_sliceType == B_SLICE)
+    if((!m_param->bEnableSBRC && m_sliceType == B_SLICE) || (m_param->bEnableSBRC && !IS_REFERENCED(curFrame)))
     {
         /* B-frames don't have independent rate control, but rather get the
          * average QP of the two adjacent P-frames + an offset */
@@ -1836,8 +1988,16 @@
             double minScenecutQscale =x265_qp2qScale(ABR_SCENECUT_INIT_QP_MIN); 
             m_lastQScaleForP_SLICE = X265_MAX(minScenecutQscale, m_lastQScaleForP_SLICE);
         }
+
         double qScale = x265_qp2qScale(q);
         rce->qpNoVbv = q;
+
+        if (m_param->bEnableSBRC)
+        {
+            qScale = tuneQscaleForSBRC(curFrame, qScale);
+            rce->qpNoVbv = x265_qScale2qp(qScale);
+        }
+
         double lmin = 0, lmax = 0;
         if (m_isGrainEnabled && m_isFirstMiniGop)
         {
@@ -1890,7 +2050,7 @@
                 qScale = x265_clip3(lqmin, lqmax, qScale);
             }
 
-            if (!m_2pass || m_param->bliveVBV2pass)
+            if (!m_2pass || m_param->bliveVBV2pass || (m_2pass && m_param->rc.rateControlMode == X265_RC_CRF && m_param->rc.bEncFocusedFramesOnly))
             {
                 /* clip qp to permissible range after vbv-lookahead estimation to avoid possible 
                  * mispredictions by initial frame size predictors */
@@ -1927,7 +2087,7 @@
     else
     {
         double abrBuffer = 2 * m_rateTolerance * m_bitrate;
-        if (m_2pass)
+        if (m_2pass && (m_param->rc.rateControlMode != X265_RC_CRF || !m_param->rc.bEncFocusedFramesOnly))
         {
             double lmin = m_lminm_sliceType;
             double lmax = m_lmaxm_sliceType;
@@ -2057,6 +2217,19 @@
 
             if (m_param->rc.rateControlMode == X265_RC_CRF)
             {
+                if (m_param->bEnableSBRC)
+                {
+                    double rfConstant = m_param->rc.rfConstant;
+                    if (m_currentSatd < rce->movingAvgSum)
+                        rfConstant += 2;
+                    double ipOffset = (curFrame->m_lowres.bScenecut ? m_ipOffset : m_ipOffset / 2.0);
+                    rfConstant = (rce->sliceType == I_SLICE ? rfConstant - ipOffset :
+                        (rce->sliceType == B_SLICE ? rfConstant + m_pbOffset : rfConstant));
+                    double mbtree_offset = m_param->rc.cuTree ? (1.0 - m_param->rc.qCompress) * 13.5 : 0;
+                    double qComp = (m_param->rc.cuTree && !m_param->rc.hevcAq) ? 0.99 : m_param->rc.qCompress;
+                    m_rateFactorConstant = pow(m_currentSatd, 1.0 - qComp) /
+                        x265_qp2qScale(rfConstant + mbtree_offset);
+                }
                 q = getQScale(rce, m_rateFactorConstant);
                 x265_zone* zone = getZone();
                 if (zone)
@@ -2082,7 +2255,7 @@
                 }
                 double tunedQScale = tuneAbrQScaleFromFeedback(initialQScale);
                 overflow = tunedQScale / initialQScale;
-                q = !m_partialResidualFrames? tunedQScale : initialQScale;
+                q = !m_partialResidualFrames ? tunedQScale : initialQScale;
                 bool isEncodeEnd = (m_param->totalFrames && 
                     m_framesDone > 0.75 * m_param->totalFrames) ? 1 : 0;
                 bool isEncodeBeg = m_framesDone < (int)(m_fps + 0.5);
@@ -2138,6 +2311,9 @@
                 q = X265_MAX(minScenecutQscale, q);
                 m_lastQScaleForP_SLICE = X265_MAX(minScenecutQscale, m_lastQScaleForP_SLICE);
             }
+            if (m_param->bEnableSBRC)
+                q = tuneQscaleForSBRC(curFrame, q);
+
             rce->qpNoVbv = x265_qScale2qp(q);
             if (m_sliceType == P_SLICE)
             {
@@ -2319,6 +2495,43 @@
     return (p->coeff * var + p->offset) / (q * p->count);
 }
 
+double RateControl::tuneQscaleForSBRC(Frame* curFrame, double q)
+{
+    int depth = 0;
+    int framesDoneInSeg = m_framesDone % m_param->keyframeMax;
+    if (framesDoneInSeg + m_param->lookaheadDepth <= m_param->keyframeMax)
+        depth = m_param->lookaheadDepth;
+    else
+        depth = m_param->keyframeMax - framesDoneInSeg;
+    for (int iterations = 0; iterations < 1000; iterations++)
+    {
+        double totalDuration = m_segDur;
+        double frameBitsTotal = m_encodedSegmentBits + predictSize(&m_predm_predType, q, (double)m_currentSatd);
+        for (int i = 0; i < depth; i++)
+        {
+            int type = curFrame->m_lowres.plannedTypei;
+            if (type == X265_TYPE_AUTO)
+                break;
+            int64_t satd = curFrame->m_lowres.plannedSatdi >> (X265_DEPTH - 8);
+            type = IS_X265_TYPE_I(curFrame->m_lowres.plannedTypei) ? I_SLICE : IS_X265_TYPE_B(curFrame->m_lowres.plannedTypei) ? B_SLICE : P_SLICE;
+            int predType = getPredictorType(curFrame->m_lowres.plannedTypei, type);
+            double curBits = predictSize(&m_predpredType, q, (double)satd);
+            frameBitsTotal += curBits;
+            totalDuration += m_frameDuration;
+        }
+        //Check for segment buffer overflow and adjust QP accordingly
+        double segDur = m_param->keyframeMax / m_fps;
+        double allowedSize = m_vbvMaxRate * segDur;
+        double remDur = segDur - totalDuration;
+        double remainingBits = frameBitsTotal / totalDuration * remDur;
+        if (frameBitsTotal + remainingBits > 0.9 * allowedSize)
+            q = q * 1.01;
+        else
+            break;
+    }
+    return q;
+}
+
 double RateControl::clipQscale(Frame* curFrame, RateControlEntry* rce, double q)
 {
     // B-frames are not directly subject to VBV,
@@ -2395,7 +2608,7 @@
                     {
                         finalDur = x265_clip3(0.4, 1.0, totalDuration);
                     }
-                    targetFill = X265_MIN(m_bufferFill + totalDuration * m_vbvMaxRate * 0.5, m_bufferSize * (1 - m_minBufferFill * finalDur));
+                    targetFill = X265_MIN(m_bufferFill + totalDuration * m_vbvMaxRate * 0.5, m_bufferSize * (m_minBufferFill * finalDur));
                     if (bufferFillCur < targetFill)
                     {
                         q *= 1.01;
@@ -2828,7 +3041,7 @@
 
     if (m_param->rc.aqMode || m_isVbv || m_param->bAQMotion || bEnableDistOffset)
     {
-        if (m_isVbv && !(m_2pass && m_param->rc.rateControlMode == X265_RC_CRF))
+        if (m_isVbv && !(m_2pass && m_param->rc.rateControlMode == X265_RC_CRF && !m_param->rc.bEncFocusedFramesOnly))
         {
             double avgQpRc = 0;
             /* determine avg QP decided by VBV rate control */
@@ -2862,8 +3075,9 @@
     if (m_param->rc.rateControlMode == X265_RC_CRF)
     {
         double crfVal, qpRef = curEncData.m_avgQpRc;
+
         bool is2passCrfChange = false;
-        if (m_2pass)
+        if (m_2pass && !m_param->rc.bEncFocusedFramesOnly)
         {
             if (fabs(curEncData.m_avgQpRc - rce->qpPrev) > 0.1)
             {
@@ -2921,6 +3135,8 @@
         m_wantedBitsWindow += m_frameDuration * m_bitrate;
         m_totalBits += bits - rce->rowTotalBits;
         m_encodedBits += actualBits;
+        m_encodedSegmentBits += actualBits;
+        m_segDur += m_frameDuration;
         int pos = m_sliderPos - m_param->frameNumThreads;
         if (pos >= 0)
             m_encodedBitsWindowpos % s_slidingWindowFrames = actualBits;
@@ -3048,10 +3264,26 @@
     {
         uint8_t sliceType = (uint8_t)rce->sliceType;
         primitives.fix8Pack(m_cuTreeStats.qpBuffer0, curFrame->m_lowres.qpCuTreeOffset, ncu);
-        if (fwrite(&sliceType, 1, 1, m_cutreeStatFileOut) < 1)
-            goto writeFailure;
-        if (fwrite(m_cuTreeStats.qpBuffer0, sizeof(uint16_t), ncu, m_cutreeStatFileOut) < (size_t)ncu)
-            goto writeFailure;
+
+        if (X265_SHARE_MODE_FILE == m_param->rc.dataShareMode)
+        {
+            if (fwrite(&sliceType, 1, 1, m_cutreeStatFileOut) < 1)
+                goto writeFailure;
+            if (fwrite(m_cuTreeStats.qpBuffer0, sizeof(uint16_t), ncu, m_cutreeStatFileOut) < (size_t)ncu)
+                goto writeFailure;
+        }
+        else // X265_SHARE_MODE_SHAREDMEM == m_param->rc.dataShareMode
+        {
+            if (!m_cutreeShrMem)
+            {
+                goto writeFailure;
+            }
+
+            CUTreeSharedDataItem shrItem;
+            shrItem.type = &sliceType;
+            shrItem.stats = m_cuTreeStats.qpBuffer0;
+            m_cutreeShrMem->writeData(&shrItem, WriteSharedCUTreeData);
+        } 
     }
     return 0;
 
@@ -3127,6 +3359,13 @@
     if (m_cutreeStatFileIn)
         fclose(m_cutreeStatFileIn);
 
+    if (m_cutreeShrMem)
+    {
+        m_cutreeShrMem->release();
+        delete m_cutreeShrMem;
+        m_cutreeShrMem = NULL;
+    }
+
     X265_FREE(m_rce2Pass);
     X265_FREE(m_encOrder);
     for (int i = 0; i < 2; i++)
@@ -3186,13 +3425,20 @@
 double RateControl::forwardMasking(Frame* curFrame, double q)
 {
     double qp = x265_qScale2qp(q);
-    uint32_t maxWindowSize = uint32_t((m_param->fwdScenecutWindow / 1000.0) * (m_param->fpsNum / m_param->fpsDenom) + 0.5);
-    uint32_t windowSize = maxWindowSize / 3;
+    uint32_t maxWindowSize = uint32_t((m_param->fwdMaxScenecutWindow / 1000.0) * (m_param->fpsNum / m_param->fpsDenom) + 0.5);
+    uint32_t windowSize6, prevWindow = 0;
     int lastScenecut = m_top->m_rateControl->m_lastScenecut;
-    int lastIFrame = m_top->m_rateControl->m_lastScenecutAwareIFrame;
-    double fwdRefQpDelta = double(m_param->fwdRefQpDelta);
-    double fwdNonRefQpDelta = double(m_param->fwdNonRefQpDelta);
-    double sliceTypeDelta = SLICE_TYPE_DELTA * fwdRefQpDelta;
+
+    double fwdRefQpDelta6, fwdNonRefQpDelta6, sliceTypeDelta6;
+    for (int i = 0; i < 6; i++)
+    {
+        windowSizei = prevWindow + (uint32_t((m_param->fwdScenecutWindowi / 1000.0) * (m_param->fpsNum / m_param->fpsDenom) + 0.5));
+        fwdRefQpDeltai = double(m_param->fwdRefQpDeltai);
+        fwdNonRefQpDeltai = double(m_param->fwdNonRefQpDeltai);
+        sliceTypeDeltai = SLICE_TYPE_DELTA * fwdRefQpDeltai;
+        prevWindow = windowSizei;
+    }
+
 
     //Check whether the current frame is within the forward window
     if (curFrame->m_poc > lastScenecut && curFrame->m_poc <= (lastScenecut + int(maxWindowSize)))
@@ -3205,45 +3451,51 @@
         }
         else if (curFrame->m_lowres.sliceType == X265_TYPE_P)
         {
-            if (!(lastIFrame > lastScenecut && lastIFrame <= (lastScenecut + int(maxWindowSize))
-                && curFrame->m_poc >= lastIFrame))
-            {
-                //Add offsets corresponding to the window in which the P-frame occurs
-                if (curFrame->m_poc <= (lastScenecut + int(windowSize)))
-                    qp += WINDOW1_DELTA * (fwdRefQpDelta - sliceTypeDelta);
-                else if (((curFrame->m_poc) > (lastScenecut + int(windowSize))) && ((curFrame->m_poc) <= (lastScenecut + 2 * int(windowSize))))
-                    qp += WINDOW2_DELTA * (fwdRefQpDelta - sliceTypeDelta);
-                else if (curFrame->m_poc > lastScenecut + 2 * int(windowSize))
-                    qp += WINDOW3_DELTA * (fwdRefQpDelta - sliceTypeDelta);
-            }
+            //Add offsets corresponding to the window in which the P-frame occurs
+            if (curFrame->m_poc <= (lastScenecut + int(windowSize0)))
+                qp += fwdRefQpDelta0 - sliceTypeDelta0;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize0))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize1))))
+                qp += fwdRefQpDelta1 - sliceTypeDelta1;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize1))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize2))))
+                qp += fwdRefQpDelta2 - sliceTypeDelta2;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize2))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize3))))
+                qp += fwdRefQpDelta3 - sliceTypeDelta3;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize3))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize4))))
+                qp += fwdRefQpDelta4 - sliceTypeDelta4;
+            else if (curFrame->m_poc > lastScenecut + int(windowSize4))
+                qp += fwdRefQpDelta5 - sliceTypeDelta5;
         }
         else if (curFrame->m_lowres.sliceType == X265_TYPE_BREF)
         {
-            if (!(lastIFrame > lastScenecut && lastIFrame <= (lastScenecut + int(maxWindowSize))
-                && curFrame->m_poc >= lastIFrame))
-            {
-                //Add offsets corresponding to the window in which the B-frame occurs
-                if (curFrame->m_poc <= (lastScenecut + int(windowSize)))
-                    qp += WINDOW1_DELTA * fwdRefQpDelta;
-                else if (((curFrame->m_poc) > (lastScenecut + int(windowSize))) && ((curFrame->m_poc) <= (lastScenecut + 2 * int(windowSize))))
-                    qp += WINDOW2_DELTA * fwdRefQpDelta;
-                else if (curFrame->m_poc > lastScenecut + 2 * int(windowSize))
-                    qp += WINDOW3_DELTA * fwdRefQpDelta;
-            }
+            //Add offsets corresponding to the window in which the B-frame occurs
+            if (curFrame->m_poc <= (lastScenecut + int(windowSize0)))
+                qp += fwdRefQpDelta0;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize0))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize1))))
+                qp += fwdRefQpDelta1;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize1))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize2))))
+                qp += fwdRefQpDelta2;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize2))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize3))))
+                qp += fwdRefQpDelta3;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize3))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize4))))
+                qp += fwdRefQpDelta4;
+            else if (curFrame->m_poc > lastScenecut + int(windowSize4))
+                qp += fwdRefQpDelta5;
         }
         else if (curFrame->m_lowres.sliceType == X265_TYPE_B)
         {
-            if (!(lastIFrame > lastScenecut && lastIFrame <= (lastScenecut + int(maxWindowSize))
-                && curFrame->m_poc >= lastIFrame))
-            {
-                //Add offsets corresponding to the window in which the b-frame occurs
-                if (curFrame->m_poc <= (lastScenecut + int(windowSize)))
-                    qp += WINDOW1_DELTA * fwdNonRefQpDelta;
-                else if (((curFrame->m_poc) > (lastScenecut + int(windowSize))) && ((curFrame->m_poc) <= (lastScenecut + 2 * int(windowSize))))
-                    qp += WINDOW2_DELTA * fwdNonRefQpDelta;
-                else if (curFrame->m_poc > lastScenecut + 2 * int(windowSize))
-                    qp += WINDOW3_DELTA * fwdNonRefQpDelta;
-            }
+            //Add offsets corresponding to the window in which the b-frame occurs
+            if (curFrame->m_poc <= (lastScenecut + int(windowSize0)))
+                qp += fwdNonRefQpDelta0;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize0))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize1))))
+                qp += fwdNonRefQpDelta1;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize1))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize2))))
+                qp += fwdNonRefQpDelta2;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize2))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize3))))
+                qp += fwdNonRefQpDelta3;
+            else if (((curFrame->m_poc) > (lastScenecut + int(windowSize3))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize4))))
+                qp += fwdNonRefQpDelta4;
+            else if (curFrame->m_poc > lastScenecut + int(windowSize4))
+                qp += fwdNonRefQpDelta5;
         }
     }
 
@@ -3252,24 +3504,75 @@
 double RateControl::backwardMasking(Frame* curFrame, double q)
 {
     double qp = x265_qScale2qp(q);
-    double fwdRefQpDelta = double(m_param->fwdRefQpDelta);
-    double bwdRefQpDelta = double(m_param->bwdRefQpDelta);
-    double bwdNonRefQpDelta = double(m_param->bwdNonRefQpDelta);
+    uint32_t windowSize6, prevWindow = 0;
+    int lastScenecut = m_top->m_rateControl->m_lastScenecut;
 
-    if (curFrame->m_isInsideWindow == BACKWARD_WINDOW)
+    double bwdRefQpDelta6, bwdNonRefQpDelta6, sliceTypeDelta6;
+    for (int i = 0; i < 6; i++)
     {
-        if (bwdRefQpDelta < 0)
-            bwdRefQpDelta = WINDOW3_DELTA * fwdRefQpDelta;
-        double sliceTypeDelta = SLICE_TYPE_DELTA * bwdRefQpDelta;
-        if (bwdNonRefQpDelta < 0)
-            bwdNonRefQpDelta = bwdRefQpDelta + sliceTypeDelta;
+        windowSizei = prevWindow + (uint32_t((m_param->bwdScenecutWindowi / 1000.0) * (m_param->fpsNum / m_param->fpsDenom) + 0.5));
+        prevWindow = windowSizei;
+        bwdRefQpDeltai = double(m_param->bwdRefQpDeltai);
+        bwdNonRefQpDeltai = double(m_param->bwdNonRefQpDeltai);
+
+        if (bwdRefQpDeltai < 0)
+            bwdRefQpDeltai = BWD_WINDOW_DELTA * m_param->fwdRefQpDeltai;
+        sliceTypeDeltai = SLICE_TYPE_DELTA * bwdRefQpDeltai;
+
+        if (bwdNonRefQpDeltai < 0)
+            bwdNonRefQpDeltai = bwdRefQpDeltai + sliceTypeDeltai;
+    }
 
+    if (curFrame->m_isInsideWindow == BACKWARD_WINDOW)
+    {
         if (curFrame->m_lowres.sliceType == X265_TYPE_P)
-            qp += bwdRefQpDelta - sliceTypeDelta;
+        {
+            //Add offsets corresponding to the window in which the P-frame occurs
+            if (curFrame->m_poc >= (lastScenecut - int(windowSize0)))
+                qp += bwdRefQpDelta0 - sliceTypeDelta0;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize0))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize1))))
+                qp += bwdRefQpDelta1 - sliceTypeDelta1;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize1))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize2))))
+                qp += bwdRefQpDelta2 - sliceTypeDelta2;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize2))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize3))))
+                qp += bwdRefQpDelta3 - sliceTypeDelta3;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize3))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize4))))
+                qp += bwdRefQpDelta4 - sliceTypeDelta4;
+            else if (curFrame->m_poc < lastScenecut - int(windowSize4))
+                qp += bwdRefQpDelta5 - sliceTypeDelta5;
+        }
         else if (curFrame->m_lowres.sliceType == X265_TYPE_BREF)
-            qp += bwdRefQpDelta;
+        {
+            //Add offsets corresponding to the window in which the B-frame occurs
+            if (curFrame->m_poc >= (lastScenecut - int(windowSize0)))
+                qp += bwdRefQpDelta0;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize0))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize1))))
+                qp += bwdRefQpDelta1;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize1))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize2))))
+                qp += bwdRefQpDelta2;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize2))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize3))))
+                qp += bwdRefQpDelta3;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize3))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize4))))
+                qp += bwdRefQpDelta4;
+            else if (curFrame->m_poc < lastScenecut - int(windowSize4))
+                qp += bwdRefQpDelta5;
+        }
         else if (curFrame->m_lowres.sliceType == X265_TYPE_B)
-            qp += bwdNonRefQpDelta;
+        {
+            //Add offsets corresponding to the window in which the b-frame occurs
+            if (curFrame->m_poc >= (lastScenecut - int(windowSize0)))
+                qp += bwdNonRefQpDelta0;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize0))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize1))))
+                qp += bwdNonRefQpDelta1;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize1))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize2))))
+                qp += bwdNonRefQpDelta2;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize2))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize3))))
+                qp += bwdNonRefQpDelta3;
+            else if (((curFrame->m_poc) < (lastScenecut - int(windowSize3))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize4))))
+                qp += bwdNonRefQpDelta4;
+            else if (curFrame->m_poc < lastScenecut - int(windowSize4))
+                qp += bwdNonRefQpDelta5;
+        }
     }
 
     return x265_qp2qScale(qp);
​

x265_3.5.tar.gz/source/encoder/ratecontrol.h -> x265_3.6.tar.gz/source/encoder/ratecontrol.h Changed

@@ -28,6 +28,7 @@
 
 #include "common.h"
 #include "sei.h"
+#include "ringmem.h"
 
 namespace X265_NS {
 // encoder namespace
@@ -46,11 +47,6 @@
 #define MIN_AMORTIZE_FRACTION 0.2
 #define CLIP_DURATION(f) x265_clip3(MIN_FRAME_DURATION, MAX_FRAME_DURATION, f)
 
-/*Scenecut Aware QP*/
-#define WINDOW1_DELTA           1.0 /* The offset for the frames coming in the window-1*/
-#define WINDOW2_DELTA           0.7 /* The offset for the frames coming in the window-2*/
-#define WINDOW3_DELTA           0.4 /* The offset for the frames coming in the window-3*/
-
 struct Predictor
 {
     double coeffMin;
@@ -73,6 +69,7 @@
     Predictor  rowPreds32;
     Predictor* rowPred2;
 
+    int64_t currentSatd;
     int64_t lastSatd;      /* Contains the picture cost of the previous frame, required for resetAbr and VBV */
     int64_t leadingNoBSatd;
     int64_t rowTotalBits;  /* update cplxrsum and totalbits at the end of 2 rows */
@@ -87,6 +84,8 @@
     double  rowCplxrSum;
     double  qpNoVbv;
     double  bufferFill;
+    double  bufferFillFinal;
+    double  bufferFillActual;
     double  targetFill;
     bool    vbvEndAdj;
     double  frameDuration;
@@ -192,6 +191,8 @@
     double  m_qCompress;
     int64_t m_totalBits;        /* total bits used for already encoded frames (after ammortization) */
     int64_t m_encodedBits;      /* bits used for encoded frames (without ammortization) */
+    int64_t m_encodedSegmentBits;      /* bits used for encoded frames in a segment*/
+    double  m_segDur;
     double  m_fps;
     int64_t m_satdCostWindow50;
     int64_t m_encodedBitsWindow50;
@@ -237,6 +238,8 @@
     FILE*   m_statFileOut;
     FILE*   m_cutreeStatFileOut;
     FILE*   m_cutreeStatFileIn;
+    ///< store the cutree data in memory instead of file
+    RingMem *m_cutreeShrMem;
     double  m_lastAccumPNorm;
     double  m_expectedBitsSum;   /* sum of qscale2bits after rceq, ratefactor, and overflow, only includes finished frames */
     int64_t m_predictedBits;
@@ -254,6 +257,7 @@
     RateControl(x265_param& p, Encoder *enc);
     bool init(const SPS& sps);
     void initHRD(SPS& sps);
+    void initVBV(const SPS& sps);
     void reconfigureRC();
 
     void setFinalFrameCount(int count);
@@ -271,6 +275,9 @@
     int writeRateControlFrameStats(Frame* curFrame, RateControlEntry* rce);
     bool   initPass2();
 
+    bool initCUTreeSharedMem();
+    void skipCUTreeSharedMemRead(int32_t cnt);
+
     double forwardMasking(Frame* curFrame, double q);
     double backwardMasking(Frame* curFrame, double q);
 
@@ -291,6 +298,7 @@
     double rateEstimateQscale(Frame* pic, RateControlEntry *rce); // main logic for calculating QP based on ABR
     double tuneAbrQScaleFromFeedback(double qScale);
     double tuneQScaleForZone(RateControlEntry *rce, double qScale); // Tune qScale to adhere to zone budget
+    double tuneQscaleForSBRC(Frame* curFrame, double q); // Tune qScale to adhere to segment budget
     void   accumPQpUpdate();
 
     int    getPredictorType(int lowresSliceType, int sliceType);
@@ -311,6 +319,7 @@
     double tuneQScaleForGrain(double rcOverflow);
     void   splitdeltaPOC(char deltapoc, RateControlEntry *rce);
     void   splitbUsed(char deltapoc, RateControlEntry *rce);
+    void   checkAndResetCRF(RateControlEntry* rce);
 };
 }
 #endif // ifndef X265_RATECONTROL_H

 
@@ -28,6 +28,7 @@
 
 #include "common.h"
 #include "sei.h"
+#include "ringmem.h"
 
 namespace X265_NS {
 // encoder namespace
@@ -46,11 +47,6 @@
 #define MIN_AMORTIZE_FRACTION 0.2
 #define CLIP_DURATION(f) x265_clip3(MIN_FRAME_DURATION, MAX_FRAME_DURATION, f)
 
-/*Scenecut Aware QP*/
-#define WINDOW1_DELTA           1.0 /* The offset for the frames coming in the window-1*/
-#define WINDOW2_DELTA           0.7 /* The offset for the frames coming in the window-2*/
-#define WINDOW3_DELTA           0.4 /* The offset for the frames coming in the window-3*/
-
 struct Predictor
 {
     double coeffMin;
@@ -73,6 +69,7 @@
     Predictor  rowPreds32;
     Predictor* rowPred2;
 
+    int64_t currentSatd;
     int64_t lastSatd;      /* Contains the picture cost of the previous frame, required for resetAbr and VBV */
     int64_t leadingNoBSatd;
     int64_t rowTotalBits;  /* update cplxrsum and totalbits at the end of 2 rows */
@@ -87,6 +84,8 @@
     double  rowCplxrSum;
     double  qpNoVbv;
     double  bufferFill;
+    double  bufferFillFinal;
+    double  bufferFillActual;
     double  targetFill;
     bool    vbvEndAdj;
     double  frameDuration;
@@ -192,6 +191,8 @@
     double  m_qCompress;
     int64_t m_totalBits;        /* total bits used for already encoded frames (after ammortization) */
     int64_t m_encodedBits;      /* bits used for encoded frames (without ammortization) */
+    int64_t m_encodedSegmentBits;      /* bits used for encoded frames in a segment*/
+    double  m_segDur;
     double  m_fps;
     int64_t m_satdCostWindow50;
     int64_t m_encodedBitsWindow50;
@@ -237,6 +238,8 @@
     FILE*   m_statFileOut;
     FILE*   m_cutreeStatFileOut;
     FILE*   m_cutreeStatFileIn;
+    ///< store the cutree data in memory instead of file
+    RingMem *m_cutreeShrMem;
     double  m_lastAccumPNorm;
     double  m_expectedBitsSum;   /* sum of qscale2bits after rceq, ratefactor, and overflow, only includes finished frames */
     int64_t m_predictedBits;
@@ -254,6 +257,7 @@
     RateControl(x265_param& p, Encoder *enc);
     bool init(const SPS& sps);
     void initHRD(SPS& sps);
+    void initVBV(const SPS& sps);
     void reconfigureRC();
 
     void setFinalFrameCount(int count);
@@ -271,6 +275,9 @@
     int writeRateControlFrameStats(Frame* curFrame, RateControlEntry* rce);
     bool   initPass2();
 
+    bool initCUTreeSharedMem();
+    void skipCUTreeSharedMemRead(int32_t cnt);
+
     double forwardMasking(Frame* curFrame, double q);
     double backwardMasking(Frame* curFrame, double q);
 
@@ -291,6 +298,7 @@
     double rateEstimateQscale(Frame* pic, RateControlEntry *rce); // main logic for calculating QP based on ABR
     double tuneAbrQScaleFromFeedback(double qScale);
     double tuneQScaleForZone(RateControlEntry *rce, double qScale); // Tune qScale to adhere to zone budget
+    double tuneQscaleForSBRC(Frame* curFrame, double q); // Tune qScale to adhere to segment budget
     void   accumPQpUpdate();
 
     int    getPredictorType(int lowresSliceType, int sliceType);
@@ -311,6 +319,7 @@
     double tuneQScaleForGrain(double rcOverflow);
     void   splitdeltaPOC(char deltapoc, RateControlEntry *rce);
     void   splitbUsed(char deltapoc, RateControlEntry *rce);
+    void   checkAndResetCRF(RateControlEntry* rce);
 };
 }
 #endif // ifndef X265_RATECONTROL_H
​

x265_3.5.tar.gz/source/encoder/sei.cpp -> x265_3.6.tar.gz/source/encoder/sei.cpp Changed

 
@@ -68,7 +68,7 @@
     {
         if (nalUnitType != NAL_UNIT_UNSPECIFIED)
             bs.writeByteAlignment();
-        list.serialize(nalUnitType, bs);
+        list.serialize(nalUnitType, bs, (1 + (nalUnitType == NAL_UNIT_CODED_SLICE_TSA_N)));
     }
 }
 
​

x265_3.5.tar.gz/source/encoder/sei.h -> x265_3.6.tar.gz/source/encoder/sei.h Changed

@@ -73,6 +73,101 @@
     }
 };
 
+/* Film grain characteristics */
+class FilmGrainCharacteristics : public SEI
+{
+  public:
+
+    FilmGrainCharacteristics()
+    {
+        m_payloadType = FILM_GRAIN_CHARACTERISTICS;
+        m_payloadSize = 0;
+    }
+
+    struct CompModelIntensityValues
+    {
+        uint8_t intensityIntervalLowerBound;
+        uint8_t intensityIntervalUpperBound;
+        int*    compModelValue;
+    };
+
+    struct CompModel
+    {
+        bool    bPresentFlag;
+        uint8_t numModelValues;
+        uint8_t m_filmGrainNumIntensityIntervalMinus1;
+        CompModelIntensityValues* intensityValues;
+    };
+
+    CompModel   m_compModelMAX_NUM_COMPONENT;
+    bool        m_filmGrainCharacteristicsPersistenceFlag;
+    bool        m_filmGrainCharacteristicsCancelFlag;
+    bool        m_separateColourDescriptionPresentFlag;
+    bool        m_filmGrainFullRangeFlag;
+    uint8_t     m_filmGrainModelId;
+    uint8_t     m_blendingModeId;
+    uint8_t     m_log2ScaleFactor;
+    uint8_t     m_filmGrainBitDepthLumaMinus8;
+    uint8_t     m_filmGrainBitDepthChromaMinus8;
+    uint8_t     m_filmGrainColourPrimaries;
+    uint8_t     m_filmGrainTransferCharacteristics;
+    uint8_t     m_filmGrainMatrixCoeffs;
+
+    void writeSEI(const SPS&)
+    {
+        WRITE_FLAG(m_filmGrainCharacteristicsCancelFlag, "film_grain_characteristics_cancel_flag");
+
+        if (!m_filmGrainCharacteristicsCancelFlag)
+        {
+            WRITE_CODE(m_filmGrainModelId, 2, "film_grain_model_id");
+            WRITE_FLAG(m_separateColourDescriptionPresentFlag, "separate_colour_description_present_flag");
+            if (m_separateColourDescriptionPresentFlag)
+            {
+                WRITE_CODE(m_filmGrainBitDepthLumaMinus8, 3, "film_grain_bit_depth_luma_minus8");
+                WRITE_CODE(m_filmGrainBitDepthChromaMinus8, 3, "film_grain_bit_depth_chroma_minus8");
+                WRITE_FLAG(m_filmGrainFullRangeFlag, "film_grain_full_range_flag");
+                WRITE_CODE(m_filmGrainColourPrimaries, X265_BYTE, "film_grain_colour_primaries");
+                WRITE_CODE(m_filmGrainTransferCharacteristics, X265_BYTE, "film_grain_transfer_characteristics");
+                WRITE_CODE(m_filmGrainMatrixCoeffs, X265_BYTE, "film_grain_matrix_coeffs");
+            }
+            WRITE_CODE(m_blendingModeId, 2, "blending_mode_id");
+            WRITE_CODE(m_log2ScaleFactor, 4, "log2_scale_factor");
+            for (uint8_t c = 0; c < 3; c++)
+            {
+                WRITE_FLAG(m_compModelc.bPresentFlag && m_compModelc.m_filmGrainNumIntensityIntervalMinus1 + 1 > 0 && m_compModelc.numModelValues > 0, "comp_model_present_flagc");
+            }
+            for (uint8_t c = 0; c < 3; c++)
+            {
+                if (m_compModelc.bPresentFlag && m_compModelc.m_filmGrainNumIntensityIntervalMinus1 + 1 > 0 && m_compModelc.numModelValues > 0)
+                {
+                    assert(m_compModelc.m_filmGrainNumIntensityIntervalMinus1 + 1 <= 256);
+                    assert(m_compModelc.numModelValues <= X265_BYTE);
+                    WRITE_CODE(m_compModelc.m_filmGrainNumIntensityIntervalMinus1 , X265_BYTE, "num_intensity_intervals_minus1c");
+                    WRITE_CODE(m_compModelc.numModelValues - 1, 3, "num_model_values_minus1c");
+                    for (uint8_t interval = 0; interval < m_compModelc.m_filmGrainNumIntensityIntervalMinus1 + 1; interval++)
+                    {
+                        WRITE_CODE(m_compModelc.intensityValuesinterval.intensityIntervalLowerBound, X265_BYTE, "intensity_interval_lower_boundci");
+                        WRITE_CODE(m_compModelc.intensityValuesinterval.intensityIntervalUpperBound, X265_BYTE, "intensity_interval_upper_boundci");
+                        for (uint8_t j = 0; j < m_compModelc.numModelValues; j++)
+                        {
+                            WRITE_SVLC(m_compModelc.intensityValuesinterval.compModelValuej,"comp_model_valueci");
+                        }
+                    }
+                }
+            }
+            WRITE_FLAG(m_filmGrainCharacteristicsPersistenceFlag, "film_grain_characteristics_persistence_flag");
+        }
+        if (m_bitIf->getNumberOfWrittenBits() % X265_BYTE != 0)
+        {
+            WRITE_FLAG(1, "payload_bit_equal_to_one");
+            while (m_bitIf->getNumberOfWrittenBits() % X265_BYTE != 0)
+            {
+                WRITE_FLAG(0, "payload_bit_equal_to_zero");
+            }
+        }
+    }
+};
+
 static const uint32_t ISO_IEC_11578_LEN = 16;
 
 class SEIuserDataUnregistered : public SEI

 
@@ -73,6 +73,101 @@
     }
 };
 
+/* Film grain characteristics */
+class FilmGrainCharacteristics : public SEI
+{
+  public:
+
+    FilmGrainCharacteristics()
+    {
+        m_payloadType = FILM_GRAIN_CHARACTERISTICS;
+        m_payloadSize = 0;
+    }
+
+    struct CompModelIntensityValues
+    {
+        uint8_t intensityIntervalLowerBound;
+        uint8_t intensityIntervalUpperBound;
+        int*    compModelValue;
+    };
+
+    struct CompModel
+    {
+        bool    bPresentFlag;
+        uint8_t numModelValues;
+        uint8_t m_filmGrainNumIntensityIntervalMinus1;
+        CompModelIntensityValues* intensityValues;
+    };
+
+    CompModel   m_compModelMAX_NUM_COMPONENT;
+    bool        m_filmGrainCharacteristicsPersistenceFlag;
+    bool        m_filmGrainCharacteristicsCancelFlag;
+    bool        m_separateColourDescriptionPresentFlag;
+    bool        m_filmGrainFullRangeFlag;
+    uint8_t     m_filmGrainModelId;
+    uint8_t     m_blendingModeId;
+    uint8_t     m_log2ScaleFactor;
+    uint8_t     m_filmGrainBitDepthLumaMinus8;
+    uint8_t     m_filmGrainBitDepthChromaMinus8;
+    uint8_t     m_filmGrainColourPrimaries;
+    uint8_t     m_filmGrainTransferCharacteristics;
+    uint8_t     m_filmGrainMatrixCoeffs;
+
+    void writeSEI(const SPS&)
+    {
+        WRITE_FLAG(m_filmGrainCharacteristicsCancelFlag, "film_grain_characteristics_cancel_flag");
+
+        if (!m_filmGrainCharacteristicsCancelFlag)
+        {
+            WRITE_CODE(m_filmGrainModelId, 2, "film_grain_model_id");
+            WRITE_FLAG(m_separateColourDescriptionPresentFlag, "separate_colour_description_present_flag");
+            if (m_separateColourDescriptionPresentFlag)
+            {
+                WRITE_CODE(m_filmGrainBitDepthLumaMinus8, 3, "film_grain_bit_depth_luma_minus8");
+                WRITE_CODE(m_filmGrainBitDepthChromaMinus8, 3, "film_grain_bit_depth_chroma_minus8");
+                WRITE_FLAG(m_filmGrainFullRangeFlag, "film_grain_full_range_flag");
+                WRITE_CODE(m_filmGrainColourPrimaries, X265_BYTE, "film_grain_colour_primaries");
+                WRITE_CODE(m_filmGrainTransferCharacteristics, X265_BYTE, "film_grain_transfer_characteristics");
+                WRITE_CODE(m_filmGrainMatrixCoeffs, X265_BYTE, "film_grain_matrix_coeffs");
+            }
+            WRITE_CODE(m_blendingModeId, 2, "blending_mode_id");
+            WRITE_CODE(m_log2ScaleFactor, 4, "log2_scale_factor");
+            for (uint8_t c = 0; c < 3; c++)
+            {
+                WRITE_FLAG(m_compModelc.bPresentFlag && m_compModelc.m_filmGrainNumIntensityIntervalMinus1 + 1 > 0 && m_compModelc.numModelValues > 0, "comp_model_present_flagc");
+            }
+            for (uint8_t c = 0; c < 3; c++)
+            {
+                if (m_compModelc.bPresentFlag && m_compModelc.m_filmGrainNumIntensityIntervalMinus1 + 1 > 0 && m_compModelc.numModelValues > 0)
+                {
+                    assert(m_compModelc.m_filmGrainNumIntensityIntervalMinus1 + 1 <= 256);
+                    assert(m_compModelc.numModelValues <= X265_BYTE);
+                    WRITE_CODE(m_compModelc.m_filmGrainNumIntensityIntervalMinus1 , X265_BYTE, "num_intensity_intervals_minus1c");
+                    WRITE_CODE(m_compModelc.numModelValues - 1, 3, "num_model_values_minus1c");
+                    for (uint8_t interval = 0; interval < m_compModelc.m_filmGrainNumIntensityIntervalMinus1 + 1; interval++)
+                    {
+                        WRITE_CODE(m_compModelc.intensityValuesinterval.intensityIntervalLowerBound, X265_BYTE, "intensity_interval_lower_boundci");
+                        WRITE_CODE(m_compModelc.intensityValuesinterval.intensityIntervalUpperBound, X265_BYTE, "intensity_interval_upper_boundci");
+                        for (uint8_t j = 0; j < m_compModelc.numModelValues; j++)
+                        {
+                            WRITE_SVLC(m_compModelc.intensityValuesinterval.compModelValuej,"comp_model_valueci");
+                        }
+                    }
+                }
+            }
+            WRITE_FLAG(m_filmGrainCharacteristicsPersistenceFlag, "film_grain_characteristics_persistence_flag");
+        }
+        if (m_bitIf->getNumberOfWrittenBits() % X265_BYTE != 0)
+        {
+            WRITE_FLAG(1, "payload_bit_equal_to_one");
+            while (m_bitIf->getNumberOfWrittenBits() % X265_BYTE != 0)
+            {
+                WRITE_FLAG(0, "payload_bit_equal_to_zero");
+            }
+        }
+    }
+};
+
 static const uint32_t ISO_IEC_11578_LEN = 16;
 
 class SEIuserDataUnregistered : public SEI
​

x265_3.5.tar.gz/source/encoder/slicetype.cpp -> x265_3.6.tar.gz/source/encoder/slicetype.cpp Changed

@@ -87,6 +87,14 @@
 
 namespace X265_NS {
 
+uint32_t acEnergyVarHist(uint64_t sum_ssd, int shift)
+{
+    uint32_t sum = (uint32_t)sum_ssd;
+    uint32_t ssd = (uint32_t)(sum_ssd >> 32);
+
+    return ssd - ((uint64_t)sum * sum >> shift);
+}
+
 bool computeEdge(pixel* edgePic, pixel* refPic, pixel* edgeTheta, intptr_t stride, int height, int width, bool bcalcTheta, pixel whitePixel)
 {
     intptr_t rowOne = 0, rowTwo = 0, rowThree = 0, colOne = 0, colTwo = 0, colThree = 0;
@@ -184,7 +192,7 @@
     {
         for (int colNum = 0; colNum < width; colNum++)
         {
-            if ((rowNum >= 2) && (colNum >= 2) && (rowNum != height - 2) && (colNum != width - 2)) //Ignoring the border pixels of the picture
+            if ((rowNum >= 2) && (colNum >= 2) && (rowNum < height - 2) && (colNum < width - 2)) //Ignoring the border pixels of the picture
             {
                 /*  5x5 Gaussian filter
                     2   4   5   4   2
@@ -519,7 +527,7 @@
                 if (param->rc.aqMode == X265_AQ_EDGE)
                     edgeFilter(curFrame, param);
 
-                if (param->rc.aqMode == X265_AQ_EDGE && !param->bHistBasedSceneCut && param->recursionSkipMode == EDGE_BASED_RSKIP)
+                if (param->rc.aqMode == X265_AQ_EDGE && param->recursionSkipMode == EDGE_BASED_RSKIP)
                 {
                     pixel* src = curFrame->m_edgePic + curFrame->m_fencPic->m_lumaMarginY * curFrame->m_fencPic->m_stride + curFrame->m_fencPic->m_lumaMarginX;
                     primitives.planecopy_pp_shr(src, curFrame->m_fencPic->m_stride, curFrame->m_edgeBitPic,
@@ -1050,7 +1058,48 @@
     m_countPreLookahead = 0;
 #endif
 
-    memset(m_histogram, 0, sizeof(m_histogram));
+    m_accHistDiffRunningAvgCb = X265_MALLOC(uint32_t*, NUMBER_OF_SEGMENTS_IN_WIDTH * sizeof(uint32_t*));
+    m_accHistDiffRunningAvgCb0 = X265_MALLOC(uint32_t, NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT);
+    memset(m_accHistDiffRunningAvgCb0, 0, sizeof(uint32_t) * NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT);
+    for (uint32_t w = 1; w < NUMBER_OF_SEGMENTS_IN_WIDTH; w++) {
+        m_accHistDiffRunningAvgCbw = m_accHistDiffRunningAvgCb0 + w * NUMBER_OF_SEGMENTS_IN_HEIGHT;
+    }
+
+    m_accHistDiffRunningAvgCr = X265_MALLOC(uint32_t*, NUMBER_OF_SEGMENTS_IN_WIDTH * sizeof(uint32_t*));
+    m_accHistDiffRunningAvgCr0 = X265_MALLOC(uint32_t, NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT);
+    memset(m_accHistDiffRunningAvgCr0, 0, sizeof(uint32_t) * NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT);
+    for (uint32_t w = 1; w < NUMBER_OF_SEGMENTS_IN_WIDTH; w++) {
+        m_accHistDiffRunningAvgCrw = m_accHistDiffRunningAvgCr0 + w * NUMBER_OF_SEGMENTS_IN_HEIGHT;
+    }
+
+    m_accHistDiffRunningAvg = X265_MALLOC(uint32_t*, NUMBER_OF_SEGMENTS_IN_WIDTH * sizeof(uint32_t*));
+    m_accHistDiffRunningAvg0 = X265_MALLOC(uint32_t, NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT);
+    memset(m_accHistDiffRunningAvg0, 0, sizeof(uint32_t) * NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT);
+    for (uint32_t w = 1; w < NUMBER_OF_SEGMENTS_IN_WIDTH; w++) {
+        m_accHistDiffRunningAvgw = m_accHistDiffRunningAvg0 + w * NUMBER_OF_SEGMENTS_IN_HEIGHT;
+    }
+
+    m_resetRunningAvg = true;
+
+    m_segmentCountThreshold = (uint32_t)(((float)((NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT) * 50) / 100) + 0.5);
+
+    if (m_param->bEnableTemporalSubLayers > 2)
+    {
+        switch (m_param->bEnableTemporalSubLayers)
+        {
+        case 3:
+            m_gopId = 0;
+            break;
+        case 4:
+            m_gopId = 1;
+            break;
+        case 5:
+            m_gopId = 2;
+            break;
+        default:
+            break;
+        }
+    }
 }
 
 #if DETAILED_CU_STATS
@@ -1098,6 +1147,7 @@
             m_pooli.stopWorkers();
     }
 }
+
 void Lookahead::destroy()
 {
     // these two queues will be empty unless the encode was aborted
@@ -1309,32 +1359,32 @@
     default:
         return;
     }
-    if (!m_param->analysisLoad || !m_param->bDisableLookahead)
+    if (!curFrame->m_param->analysisLoad || !curFrame->m_param->bDisableLookahead)
     {
         X265_CHECK(curFrame->m_lowres.costEstb - p0p1 - b > 0, "Slice cost not estimated\n")
 
-        if (m_param->rc.cuTree && !m_param->rc.bStatRead)
+        if (curFrame->m_param->rc.cuTree && !curFrame->m_param->rc.bStatRead)
             /* update row satds based on cutree offsets */
             curFrame->m_lowres.satdCost = frameCostRecalculate(frames, p0, p1, b);
-        else if (!m_param->analysisLoad || m_param->scaleFactor || m_param->bAnalysisType == HEVC_INFO)
+        else if (!curFrame->m_param->analysisLoad || curFrame->m_param->scaleFactor || curFrame->m_param->bAnalysisType == HEVC_INFO)
         {
-            if (m_param->rc.aqMode)
+            if (curFrame->m_param->rc.aqMode)
                 curFrame->m_lowres.satdCost = curFrame->m_lowres.costEstAqb - p0p1 - b;
             else
                 curFrame->m_lowres.satdCost = curFrame->m_lowres.costEstb - p0p1 - b;
         }
-        if (m_param->rc.vbvBufferSize && m_param->rc.vbvMaxBitrate)
+        if (curFrame->m_param->rc.vbvBufferSize && curFrame->m_param->rc.vbvMaxBitrate)
         {
             /* aggregate lowres row satds to CTU resolution */
             curFrame->m_lowres.lowresCostForRc = curFrame->m_lowres.lowresCostsb - p0p1 - b;
             uint32_t lowresRow = 0, lowresCol = 0, lowresCuIdx = 0, sum = 0, intraSum = 0;
-            uint32_t scale = m_param->maxCUSize / (2 * X265_LOWRES_CU_SIZE);
-            uint32_t numCuInHeight = (m_param->sourceHeight + m_param->maxCUSize - 1) / m_param->maxCUSize;
+            uint32_t scale = curFrame->m_param->maxCUSize / (2 * X265_LOWRES_CU_SIZE);
+            uint32_t numCuInHeight = (curFrame->m_param->sourceHeight + curFrame->m_param->maxCUSize - 1) / curFrame->m_param->maxCUSize;
             uint32_t widthInLowresCu = (uint32_t)m_8x8Width, heightInLowresCu = (uint32_t)m_8x8Height;
             double *qp_offset = 0;
             /* Factor in qpoffsets based on Aq/Cutree in CU costs */
-            if (m_param->rc.aqMode || m_param->bAQMotion)
-                qp_offset = (framesb->sliceType == X265_TYPE_B || !m_param->rc.cuTree) ? framesb->qpAqOffset : framesb->qpCuTreeOffset;
+            if (curFrame->m_param->rc.aqMode || curFrame->m_param->bAQMotion)
+                qp_offset = (framesb->sliceType == X265_TYPE_B || !curFrame->m_param->rc.cuTree) ? framesb->qpAqOffset : framesb->qpCuTreeOffset;
 
             for (uint32_t row = 0; row < numCuInHeight; row++)
             {
@@ -1350,7 +1400,7 @@
                         if (qp_offset)
                         {
                             double qpOffset;
-                            if (m_param->rc.qgSize == 8)
+                            if (curFrame->m_param->rc.qgSize == 8)
                                 qpOffset = (qp_offsetlowresCol * 2 + lowresRow * widthInLowresCu * 4 +
                                 qp_offsetlowresCol * 2 + lowresRow * widthInLowresCu * 4 + 1 +
                                 qp_offsetlowresCol * 2 + lowresRow * widthInLowresCu * 4 + curFrame->m_lowres.maxBlocksInRowFullRes +
@@ -1361,7 +1411,7 @@
                             int32_t intraCuCost = curFrame->m_lowres.intraCostlowresCuIdx;
                             curFrame->m_lowres.intraCostlowresCuIdx = (intraCuCost * x265_exp2fix8(qpOffset) + 128) >> 8;
                         }
-                        if (m_param->bIntraRefresh && slice->m_sliceType == X265_TYPE_P)
+                        if (curFrame->m_param->bIntraRefresh && slice->m_sliceType == X265_TYPE_P)
                             for (uint32_t x = curFrame->m_encData->m_pir.pirStartCol; x <= curFrame->m_encData->m_pir.pirEndCol; x++)
                                 diff += curFrame->m_lowres.intraCostlowresCuIdx - lowresCuCost;
                         curFrame->m_lowres.lowresCostForRclowresCuIdx = lowresCuCost;
@@ -1377,6 +1427,291 @@
     }
 }
 
+uint32_t LookaheadTLD::calcVariance(pixel* inpSrc, intptr_t stride, intptr_t blockOffset, uint32_t plane)
+{
+    pixel* src = inpSrc + blockOffset;
+
+    uint32_t var;
+    if (!plane)
+        var = acEnergyVarHist(primitives.cuBLOCK_8x8.var(src, stride), 6);
+    else
+        var = acEnergyVarHist(primitives.cuBLOCK_4x4.var(src, stride), 4);
+
+    x265_emms();
+    return var;
+}
+
+/*
+** Compute Block and Picture Variance, Block Mean for all blocks in the picture
+*/
+void LookaheadTLD::computePictureStatistics(Frame *curFrame)
+{
+    int maxCol = curFrame->m_fencPic->m_picWidth;
+    int maxRow = curFrame->m_fencPic->m_picHeight;
+    intptr_t inpStride = curFrame->m_fencPic->m_stride;
+
+    // Variance
+    uint64_t picTotVariance = 0;
+    uint32_t variance;
+
+    uint64_t blockXY = 0;
+    pixel* src = curFrame->m_fencPic->m_picOrg0;
+
+    for (int blockY = 0; blockY < maxRow; blockY += 8)
+    {
+        uint64_t rowVariance = 0;
+        for (int blockX = 0; blockX < maxCol; blockX += 8)
+        {
+            intptr_t blockOffsetLuma = blockX + (blockY * inpStride);
+
+            variance = calcVariance(
+                src,
+                inpStride,
+                blockOffsetLuma, 0);
+
+            rowVariance += variance;
+            blockXY++;
+        }
+        picTotVariance += (uint16_t)(rowVariance / maxCol);
+    }
+
+    curFrame->m_lowres.picAvgVariance = (uint16_t)(picTotVariance / maxRow);
+
+    // Collect chroma variance
+    int hShift = curFrame->m_fencPic->m_hChromaShift;
+    int vShift = curFrame->m_fencPic->m_vChromaShift;
+
+    int maxColChroma = curFrame->m_fencPic->m_picWidth >> hShift;
+    int maxRowChroma = curFrame->m_fencPic->m_picHeight >> vShift;
+    intptr_t cStride = curFrame->m_fencPic->m_strideC;
+
+    pixel* srcCb = curFrame->m_fencPic->m_picOrg1;
+
+    picTotVariance = 0;
+    for (int blockY = 0; blockY < maxRowChroma; blockY += 4)
+    {
+        uint64_t rowVariance = 0;
+        for (int blockX = 0; blockX < maxColChroma; blockX += 4)
+        {
+            intptr_t blockOffsetChroma = blockX + blockY * cStride;
+
+            variance = calcVariance(
+                srcCb,
+                cStride,
+                blockOffsetChroma, 1);
+
+            rowVariance += variance;
+            blockXY++;
+        }
+        picTotVariance += (uint16_t)(rowVariance / maxColChroma);
+    }
+
+    curFrame->m_lowres.picAvgVarianceCb = (uint16_t)(picTotVariance / maxRowChroma);
+
+
+    pixel* srcCr = curFrame->m_fencPic->m_picOrg2;
+
+    picTotVariance = 0;
+    for (int blockY = 0; blockY < maxRowChroma; blockY += 4)
+    {
+        uint64_t rowVariance = 0;
+        for (int blockX = 0; blockX < maxColChroma; blockX += 4)
+        {
+            intptr_t blockOffsetChroma = blockX + blockY * cStride;
+
+            variance = calcVariance(
+                srcCr,
+                cStride,
+                blockOffsetChroma, 2);
+
+            rowVariance += variance;
+            blockXY++;
+        }
+        picTotVariance += (uint16_t)(rowVariance / maxColChroma);
+    }
+
+    curFrame->m_lowres.picAvgVarianceCr = (uint16_t)(picTotVariance / maxRowChroma);
+}
+
+/*
+* Compute histogram of n-bins for the input
+*/
+void LookaheadTLD::calculateHistogram(
+    pixel     *inputSrc,
+    uint32_t   inputWidth,
+    uint32_t   inputHeight,
+    intptr_t   stride,
+    uint8_t    dsFactor,
+    uint32_t  *histogram,
+    uint64_t  *sum)
+
+{
+    *sum = 0;
+
+    for (uint32_t verticalIdx = 0; verticalIdx < inputHeight; verticalIdx += dsFactor)
+    {
+        for (uint32_t horizontalIdx = 0; horizontalIdx < inputWidth; horizontalIdx += dsFactor)
+        {
+            ++(histograminputSrchorizontalIdx);
+            *sum += inputSrchorizontalIdx;
+        }
+        inputSrc += (stride << (dsFactor >> 1));
+    }
+
+    return;
+}
+
+/*
+* Compute histogram bins and chroma pixel intensity *
+*/
+void LookaheadTLD::computeIntensityHistogramBinsChroma(
+    Frame    *curFrame,
+    uint64_t *sumAverageIntensityCb,
+    uint64_t *sumAverageIntensityCr)
+{
+    uint64_t    sum;
+    uint8_t     dsFactor = 4;
+
+    uint32_t segmentWidth = curFrame->m_lowres.widthFullRes / NUMBER_OF_SEGMENTS_IN_WIDTH;
+    uint32_t segmentHeight = curFrame->m_lowres.heightFullRes / NUMBER_OF_SEGMENTS_IN_HEIGHT;
+
+    for (uint32_t segmentInFrameWidthIndex = 0; segmentInFrameWidthIndex < NUMBER_OF_SEGMENTS_IN_WIDTH; segmentInFrameWidthIndex++)
+    {
+        for (uint32_t segmentInFrameHeightIndex = 0; segmentInFrameHeightIndex < NUMBER_OF_SEGMENTS_IN_HEIGHT; segmentInFrameHeightIndex++)
+        {
+            // Initialize bins to 1
+            for (uint32_t cuIndex = 0; cuIndex < 256; cuIndex++) {
+                curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex1cuIndex = 1;
+                curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex2cuIndex = 1;
+            }
+
+            uint32_t segmentWidthOffset = (segmentInFrameWidthIndex == NUMBER_OF_SEGMENTS_IN_WIDTH - 1) ?
+                curFrame->m_lowres.widthFullRes - (NUMBER_OF_SEGMENTS_IN_WIDTH * segmentWidth) : 0;
+
+            uint32_t segmentHeightOffset = (segmentInFrameHeightIndex == NUMBER_OF_SEGMENTS_IN_HEIGHT - 1) ?
+                curFrame->m_lowres.heightFullRes - (NUMBER_OF_SEGMENTS_IN_HEIGHT * segmentHeight) : 0;
+
+
+            // U Histogram
+            calculateHistogram(
+                curFrame->m_fencPic->m_picOrg1 + ((segmentInFrameWidthIndex * segmentWidth) >> 1) + (((segmentInFrameHeightIndex * segmentHeight) >> 1) * curFrame->m_fencPic->m_strideC),
+                (segmentWidth + segmentWidthOffset) >> 1,
+                (segmentHeight + segmentHeightOffset) >> 1,
+                curFrame->m_fencPic->m_strideC,
+                dsFactor,
+                curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex1,
+                &sum);
+
+            sum = (sum << dsFactor);
+            *sumAverageIntensityCb += sum;
+            curFrame->m_lowres.averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex1 =
+                (uint8_t)((sum + (((segmentWidth + segmentWidthOffset) * (segmentHeight + segmentHeightOffset)) >> 3)) / (((segmentWidth + segmentWidthOffset) * (segmentHeight + segmentHeightOffset)) >> 2));
+
+            for (uint16_t histogramBin = 0; histogramBin < HISTOGRAM_NUMBER_OF_BINS; histogramBin++) {
+                curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex1histogramBin =
+                    curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex1histogramBin << dsFactor;
+            }
+
+            // V Histogram
+            calculateHistogram(
+                curFrame->m_fencPic->m_picOrg2 + ((segmentInFrameWidthIndex * segmentWidth) >> 1) + (((segmentInFrameHeightIndex * segmentHeight) >> 1) * curFrame->m_fencPic->m_strideC),
+                (segmentWidth + segmentWidthOffset) >> 1,
+                (segmentHeight + segmentHeightOffset) >> 1,
+                curFrame->m_fencPic->m_strideC,
+                dsFactor,
+                curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex2,
+                &sum);
+
+            sum = (sum << dsFactor);
+            *sumAverageIntensityCr += sum;
+            curFrame->m_lowres.averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex2 =
+                (uint8_t)((sum + (((segmentWidth + segmentWidthOffset) * (segmentHeight + segmentHeightOffset)) >> 3)) / (((segmentWidth + segmentHeightOffset) * (segmentHeight + segmentHeightOffset)) >> 2));
+
+            for (uint16_t histogramBin = 0; histogramBin < HISTOGRAM_NUMBER_OF_BINS; histogramBin++) {
+                curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex2histogramBin =
+                    curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex2histogramBin << dsFactor;
+            }
+        }
+    }
+    return;
+
+}
+
+/*
+* Compute histogram bins and luma pixel intensity *
+*/
+void LookaheadTLD::computeIntensityHistogramBinsLuma(
+    Frame    *curFrame,
+    uint64_t *sumAvgIntensityTotalSegmentsLuma)
+{
+    uint64_t sum;
+
+    uint32_t segmentWidth = curFrame->m_lowres.quarterSampleLowResWidth / NUMBER_OF_SEGMENTS_IN_WIDTH;
+    uint32_t segmentHeight = curFrame->m_lowres.quarterSampleLowResHeight / NUMBER_OF_SEGMENTS_IN_HEIGHT;
+
+    for (uint32_t segmentInFrameWidthIndex = 0; segmentInFrameWidthIndex < NUMBER_OF_SEGMENTS_IN_WIDTH; segmentInFrameWidthIndex++)
+    {
+        for (uint32_t segmentInFrameHeightIndex = 0; segmentInFrameHeightIndex < NUMBER_OF_SEGMENTS_IN_HEIGHT; segmentInFrameHeightIndex++)
+        {
+            // Initialize bins to 1
+            for (uint32_t cuIndex = 0; cuIndex < 256; cuIndex++) {
+                curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex0cuIndex = 1;
+            }
+
+            uint32_t segmentWidthOffset = (segmentInFrameWidthIndex == NUMBER_OF_SEGMENTS_IN_WIDTH - 1) ?
+                curFrame->m_lowres.quarterSampleLowResWidth - (NUMBER_OF_SEGMENTS_IN_WIDTH * segmentWidth) : 0;
+
+            uint32_t segmentHeightOffset = (segmentInFrameHeightIndex == NUMBER_OF_SEGMENTS_IN_HEIGHT - 1) ?
+                curFrame->m_lowres.quarterSampleLowResHeight - (NUMBER_OF_SEGMENTS_IN_HEIGHT * segmentHeight) : 0;
+
+            // Y Histogram
+            calculateHistogram(
+                curFrame->m_lowres.quarterSampleLowResBuffer + (curFrame->m_lowres.quarterSampleLowResOriginX + segmentInFrameWidthIndex * segmentWidth) + ((curFrame->m_lowres.quarterSampleLowResOriginY + segmentInFrameHeightIndex * segmentHeight) * curFrame->m_lowres.quarterSampleLowResStrideY),
+                segmentWidth + segmentWidthOffset,
+                segmentHeight + segmentHeightOffset,
+                curFrame->m_lowres.quarterSampleLowResStrideY,
+                1,
+                curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex0,
+                &sum);
+
+            curFrame->m_lowres.averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex0 = (uint8_t)((sum + (((segmentWidth + segmentWidthOffset)*(segmentWidth + segmentHeightOffset)) >> 1)) / ((segmentWidth + segmentWidthOffset)*(segmentHeight + segmentHeightOffset)));
+            (*sumAvgIntensityTotalSegmentsLuma) += (sum << 4);
+            for (uint32_t histogramBin = 0; histogramBin < HISTOGRAM_NUMBER_OF_BINS; histogramBin++)
+            {
+                curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex0histogramBin =
+                    curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex0histogramBin << 4;
+            }
+        }
+    }
+}
+
+void LookaheadTLD::collectPictureStatistics(Frame *curFrame)
+{
+
+    uint64_t sumAverageIntensityCb = 0;
+    uint64_t sumAverageIntensityCr = 0;
+    uint64_t sumAverageIntensity = 0;
+
+    // Histogram bins for Luma
+    computeIntensityHistogramBinsLuma(
+        curFrame,
+        &sumAverageIntensity);
+
+    // Histogram bins for Chroma
+    computeIntensityHistogramBinsChroma(
+        curFrame,
+        &sumAverageIntensityCb,
+        &sumAverageIntensityCr);
+
+    curFrame->m_lowres.averageIntensity0 = (uint8_t)((sumAverageIntensity + ((curFrame->m_lowres.widthFullRes * curFrame->m_lowres.heightFullRes) >> 1)) / (curFrame->m_lowres.widthFullRes * curFrame->m_lowres.heightFullRes));
+    curFrame->m_lowres.averageIntensity1 = (uint8_t)((sumAverageIntensityCb + ((curFrame->m_lowres.widthFullRes * curFrame->m_lowres.heightFullRes) >> 3)) / ((curFrame->m_lowres.widthFullRes * curFrame->m_lowres.heightFullRes) >> 2));
+    curFrame->m_lowres.averageIntensity2 = (uint8_t)((sumAverageIntensityCr + ((curFrame->m_lowres.widthFullRes * curFrame->m_lowres.heightFullRes) >> 3)) / ((curFrame->m_lowres.widthFullRes * curFrame->m_lowres.heightFullRes) >> 2));
+
+    computePictureStatistics(curFrame);
+
+    curFrame->m_lowres.bHistScenecutAnalyzed = false;
+}
+
 void PreLookaheadGroup::processTasks(int workerThreadID)
 {
     if (workerThreadID < 0)
@@ -1393,6 +1728,10 @@
         preFrame->m_lowres.init(preFrame->m_fencPic, preFrame->m_poc);
         if (m_lookahead.m_bAdaptiveQuant)
             tld.calcAdaptiveQuantFrame(preFrame, m_lookahead.m_param);
+
+        if (m_lookahead.m_param->bHistBasedSceneCut)
+            tld.collectPictureStatistics(preFrame);
+
         tld.lowresIntraEstimate(preFrame->m_lowres, m_lookahead.m_param->rc.qgSize);
         preFrame->m_lowresInit = true;
 
@@ -1401,6 +1740,53 @@
     m_lock.release();
 }
 
+
+void Lookahead::placeBref(Frame** frames, int start, int end, int num, int *brefs)
+{
+    int avg = (start + end) / 2;
+    if (m_param->bEnableTemporalSubLayers < 2)
+    {
+        (*framesavg).m_lowres.sliceType = X265_TYPE_BREF;
+        (*brefs)++;
+        return;
+    }
+    else
+    {
+        if (num <= 2)
+            return;
+        else
+        {
+            (*framesavg).m_lowres.sliceType = X265_TYPE_BREF;
+            (*brefs)++;
+            placeBref(frames, start, avg, avg - start, brefs);
+            placeBref(frames, avg + 1, end, end - avg, brefs);
+            return;
+        }
+    }
+}
+
+
+void Lookahead::compCostBref(Lowres **frames, int start, int end, int num)
+{
+    CostEstimateGroup estGroup(*this, frames);
+    int avg = (start + end) / 2;
+    if (num <= 2)
+    {
+        for (int i = start; i < end; i++)
+        {
+            estGroup.singleCost(start, end + 1, i + 1);
+        }
+        return;
+    }
+    else
+    {
+        estGroup.singleCost(start, end + 1, avg + 1);
+        compCostBref(frames, start, avg, avg - start);
+        compCostBref(frames, avg + 1, end, end - avg);
+        return;
+    }
+}
+
 /* called by API thread or worker thread with inputQueueLock acquired */
 void Lookahead::slicetypeDecide()
 {
@@ -1416,6 +1802,18 @@
         ScopedLock lock(m_inputLock);
 
         Frame *curFrame = m_inputQueue.first();
+        if (m_param->bResetZoneConfig)
+        {
+            for (int i = 0; i < m_param->rc.zonefileCount; i++)
+            {
+                if (m_param->rc.zonesi.startFrame == curFrame->m_poc)
+                    m_param = m_param->rc.zonesi.zoneParam;
+                int nextZoneStart = m_param->rc.zonesi.startFrame;
+                nextZoneStart += nextZoneStart ? m_param->rc.zonesi.zoneParam->radl : 0;
+                if (nextZoneStart < curFrame->m_poc + maxSearch && curFrame->m_poc < nextZoneStart)
+                    maxSearch = nextZoneStart - curFrame->m_poc;
+            }
+        }
         int j;
         for (j = 0; j < m_param->bframes + 2; j++)
         {
@@ -1502,7 +1900,7 @@
          m_param->rc.cuTree || m_param->scenecutThreshold || m_param->bHistBasedSceneCut ||
          (m_param->lookaheadDepth && m_param->rc.vbvBufferSize)))
     {
-        if(!m_param->rc.bStatRead)
+        if (!m_param->rc.bStatRead)
             slicetypeAnalyse(frames, false);
         bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0;
         if ((m_param->analysisLoad && m_param->scaleFactor && bIsVbv) || m_param->bliveVBV2pass)
@@ -1526,6 +1924,8 @@
         {
             Lowres& frm = listbframes->m_lowres;
 
+            if (frm.sliceTypeReq != X265_TYPE_AUTO && frm.sliceTypeReq != frm.sliceType)
+                frm.sliceType = frm.sliceTypeReq;
             if (frm.sliceType == X265_TYPE_BREF && !m_param->bBPyramid && brefs == m_param->bBPyramid)
             {
                 frm.sliceType = X265_TYPE_B;
@@ -1583,12 +1983,9 @@
             }
             if (frm.sliceType == X265_TYPE_IDR && frm.bScenecut && isClosedGopRadl)
             {
-                if (!m_param->bHistBasedSceneCut || (m_param->bHistBasedSceneCut && frm.m_bIsHardScenecut))
-                {
-                    for (int i = bframes; i < bframes + m_param->radl; i++)
-                        listi->m_lowres.sliceType = X265_TYPE_B;
-                    list(bframes + m_param->radl)->m_lowres.sliceType = X265_TYPE_IDR;
-                }
+                for (int i = bframes; i < bframes + m_param->radl; i++)
+                    listi->m_lowres.sliceType = X265_TYPE_B;
+                list(bframes + m_param->radl)->m_lowres.sliceType = X265_TYPE_IDR;
             }
             if (frm.sliceType == X265_TYPE_IDR)
             {
@@ -1649,138 +2046,454 @@
                 break;
         }
     }
-    if (bframes)
-        listbframes - 1->m_lowres.bLastMiniGopBFrame = true;
-    listbframes->m_lowres.leadingBframes = bframes;
-    m_lastNonB = &listbframes->m_lowres;
-    m_histogrambframes++;
-
-    /* insert a bref into the sequence */
-    if (m_param->bBPyramid && bframes > 1 && !brefs)
-    {
-        listbframes / 2->m_lowres.sliceType = X265_TYPE_BREF;
-        brefs++;
-    }
-    /* calculate the frame costs ahead of time for estimateFrameCost while we still have lowres */
-    if (m_param->rc.rateControlMode != X265_RC_CQP)
-    {
-        int p0, p1, b;
-        /* For zero latency tuning, calculate frame cost to be used later in RC */
-        if (!maxSearch)
+
+    if (m_param->bEnableTemporalSubLayers > 2)
+    {
+        //Split the partial mini GOP into sub mini GOPs when temporal sub layers are enabled
+        if (bframes < m_param->bframes)
         {
-            for (int i = 0; i <= bframes; i++)
-               framesi + 1 = &listi->m_lowres;
-        }
+            int leftOver = bframes + 1;
+            int8_t gopId = m_gopId - 1;
+            int gopLen = x265_gop_ra_lengthgopId;
+            int listReset = 0;
 
-        /* estimate new non-B cost */
-        p1 = b = bframes + 1;
-        p0 = (IS_X265_TYPE_I(framesbframes + 1->sliceType)) ? b : 0;
+            m_outputLock.acquire();
 
-        CostEstimateGroup estGroup(*this, frames);
+            while ((gopId >= 0) && (leftOver > 3))
+            {
+                if (leftOver < gopLen)
+                {
+                    gopId = gopId - 1;
+                    gopLen = x265_gop_ra_lengthgopId;
+                    continue;
+                }
+                else
+                {
+                    int newbFrames = listReset + gopLen - 1;
+                    //Re-assign GOP
+                    listnewbFrames->m_lowres.sliceType = IS_X265_TYPE_I(listnewbFrames->m_lowres.sliceType) ? listnewbFrames->m_lowres.sliceType : X265_TYPE_P;
+                    if (newbFrames)
+                        listnewbFrames - 1->m_lowres.bLastMiniGopBFrame = true;
+                    listnewbFrames->m_lowres.leadingBframes = newbFrames;
+                    m_lastNonB = &listnewbFrames->m_lowres;
+
+                    /* insert a bref into the sequence */
+                    if (m_param->bBPyramid && newbFrames)
+                    {
+                        placeBref(list, listReset, newbFrames, newbFrames + 1, &brefs);
+                    }
+                    if (m_param->rc.rateControlMode != X265_RC_CQP)
+                    {
+                        int p0, p1, b;
+                        /* For zero latency tuning, calculate frame cost to be used later in RC */
+                        if (!maxSearch)
+                        {
+                            for (int i = listReset; i <= newbFrames; i++)
+                                framesi + 1 = &listlistReset + i->m_lowres;
+                        }
 
-        estGroup.singleCost(p0, p1, b);
+                        /* estimate new non-B cost */
+                        p1 = b = newbFrames + 1;
+                        p0 = (IS_X265_TYPE_I(framesnewbFrames + 1->sliceType)) ? b : listReset;
 
-        if (bframes)
+                        CostEstimateGroup estGroup(*this, frames);
+
+                        estGroup.singleCost(p0, p1, b);
+
+                        if (newbFrames)
+                            compCostBref(frames, listReset, newbFrames, newbFrames + 1);
+                    }
+
+                    m_inputLock.acquire();
+                    /* dequeue all frames from inputQueue that are about to be enqueued
+                     * in the output queue. The order is important because Frame can
+                     * only be in one list at a time */
+                    int64_t ptsX265_BFRAME_MAX + 1;
+                    for (int i = 0; i < gopLen; i++)
+                    {
+                        Frame *curFrame;
+                        curFrame = m_inputQueue.popFront();
+                        ptsi = curFrame->m_pts;
+                        maxSearch--;
+                    }
+                    m_inputLock.release();
+
+                    int idx = 0;
+                    /* add non-B to output queue */
+                    listnewbFrames->m_reorderedPts = ptsidx++;
+                    listnewbFrames->m_gopOffset = 0;
+                    listnewbFrames->m_gopId = gopId;
+                    listnewbFrames->m_tempLayer = x265_gop_ragopId0.layer;
+                    m_outputQueue.pushBack(*listnewbFrames);
+
+                    /* add B frames to output queue */
+                    int i = 1, j = 1;
+                    while (i < gopLen)
+                    {
+                        int offset = listReset + (x265_gop_ragopIdj.poc_offset - 1);
+                        if (!listoffset || offset == newbFrames)
+                            continue;
+
+                        // Assign gop offset and temporal layer of frames
+                        listoffset->m_gopOffset = j;
+                        listbframes->m_gopId = gopId;
+                        listoffset->m_tempLayer = x265_gop_ragopIdj++.layer;
+
+                        listoffset->m_reorderedPts = ptsidx++;
+                        m_outputQueue.pushBack(*listoffset);
+                        i++;
+                    }
+
+                    listReset += gopLen;
+                    leftOver = leftOver - gopLen;
+                    gopId -= 1;
+                    gopLen = (gopId >= 0) ? x265_gop_ra_lengthgopId : 0;
+                }
+            }
+
+            if (leftOver > 0 && leftOver < 4)
+            {
+                int64_t ptsX265_BFRAME_MAX + 1;
+                int idx = 0;
+
+                int newbFrames = listReset + leftOver - 1;
+                listnewbFrames->m_lowres.sliceType = IS_X265_TYPE_I(listnewbFrames->m_lowres.sliceType) ? listnewbFrames->m_lowres.sliceType : X265_TYPE_P;
+                if (newbFrames)
+                        listnewbFrames - 1->m_lowres.bLastMiniGopBFrame = true;
+                listnewbFrames->m_lowres.leadingBframes = newbFrames;
+                m_lastNonB = &listnewbFrames->m_lowres;
+
+                /* insert a bref into the sequence */
+                if (m_param->bBPyramid && (newbFrames- listReset) > 1)
+                    placeBref(list, listReset, newbFrames, newbFrames + 1, &brefs);
+
+                if (m_param->rc.rateControlMode != X265_RC_CQP)
+                {
+                    int p0, p1, b;
+                    /* For zero latency tuning, calculate frame cost to be used later in RC */
+                    if (!maxSearch)
+                    {
+                        for (int i = listReset; i <= newbFrames; i++)
+                            framesi + 1 = &listlistReset + i->m_lowres;
+                    }
+
+                        /* estimate new non-B cost */
+                    p1 = b = newbFrames + 1;
+                    p0 = (IS_X265_TYPE_I(framesnewbFrames + 1->sliceType)) ? b : listReset;
+
+                    CostEstimateGroup estGroup(*this, frames);
+
+                    estGroup.singleCost(p0, p1, b);
+
+                    if (newbFrames)
+                        compCostBref(frames, listReset, newbFrames, newbFrames + 1);
+                }
+
+                m_inputLock.acquire();
+                /* dequeue all frames from inputQueue that are about to be enqueued
+                 * in the output queue. The order is important because Frame can
+                 * only be in one list at a time */
+                for (int i = 0; i < leftOver; i++)
+                {
+                    Frame *curFrame;
+                    curFrame = m_inputQueue.popFront();
+                    ptsi = curFrame->m_pts;
+                    maxSearch--;
+                }
+                m_inputLock.release();
+
+                m_lastNonB = &listnewbFrames->m_lowres;
+                listnewbFrames->m_reorderedPts = ptsidx++;
+                listnewbFrames->m_gopOffset = 0;
+                listnewbFrames->m_gopId = -1;
+                listnewbFrames->m_tempLayer = 0;
+                m_outputQueue.pushBack(*listnewbFrames);
+                if (brefs)
+                {
+                    for (int i = listReset; i < newbFrames; i++)
+                    {
+                        if (listi->m_lowres.sliceType == X265_TYPE_BREF)
+                        {
+                            listi->m_reorderedPts = ptsidx++;
+                            listi->m_gopOffset = 0;
+                            listi->m_gopId = -1;
+                            listi->m_tempLayer = 0;
+                            m_outputQueue.pushBack(*listi);
+                        }
+                    }
+                }
+
+                /* add B frames to output queue */
+                for (int i = listReset; i < newbFrames; i++)
+                {
+                    /* push all the B frames into output queue except B-ref, which already pushed into output queue */
+                    if (listi->m_lowres.sliceType != X265_TYPE_BREF)
+                    {
+                        listi->m_reorderedPts = ptsidx++;
+                        listi->m_gopOffset = 0;
+                        listi->m_gopId = -1;
+                        listi->m_tempLayer = 1;
+                        m_outputQueue.pushBack(*listi);
+                    }
+                }
+            }
+        }
+        else
+        // Fill the complete mini GOP when temporal sub layers are enabled
         {
-            p0 = 0; // last nonb
-            bool isp0available = framesbframes + 1->sliceType == X265_TYPE_IDR ? false : true;
 
-            for (b = 1; b <= bframes; b++)
+            listbframes - 1->m_lowres.bLastMiniGopBFrame = true;
+            listbframes->m_lowres.leadingBframes = bframes;
+            m_lastNonB = &listbframes->m_lowres;
+
+            /* insert a bref into the sequence */
+            if (m_param->bBPyramid && !brefs)
             {
-                if (!isp0available)
-                    p0 = b;
+                placeBref(list, 0, bframes, bframes + 1, &brefs);
+            }
 
-                if (framesb->sliceType == X265_TYPE_B)
-                    for (p1 = b; framesp1->sliceType == X265_TYPE_B; p1++)
-                        ; // find new nonb or bref
-                else
-                    p1 = bframes + 1;
+            /* calculate the frame costs ahead of time for estimateFrameCost while we still have lowres */
+            if (m_param->rc.rateControlMode != X265_RC_CQP)
+            {
+                int p0, p1, b;
+                /* For zero latency tuning, calculate frame cost to be used later in RC */
+                if (!maxSearch)
+                {
+                    for (int i = 0; i <= bframes; i++)
+                        framesi + 1 = &listi->m_lowres;
+                }
 
+                /* estimate new non-B cost */
+                p1 = b = bframes + 1;
+                p0 = (IS_X265_TYPE_I(framesbframes + 1->sliceType)) ? b : 0;
+
+                CostEstimateGroup estGroup(*this, frames);
                 estGroup.singleCost(p0, p1, b);
 
-                if (framesb->sliceType == X265_TYPE_BREF)
+                compCostBref(frames, 0, bframes, bframes + 1);
+            }
+
+            m_inputLock.acquire();
+            /* dequeue all frames from inputQueue that are about to be enqueued
+            * in the output queue. The order is important because Frame can
+            * only be in one list at a time */
+            int64_t ptsX265_BFRAME_MAX + 1;
+            for (int i = 0; i <= bframes; i++)
+            {
+                Frame *curFrame;
+                curFrame = m_inputQueue.popFront();
+                ptsi = curFrame->m_pts;
+                maxSearch--;
+            }
+            m_inputLock.release();
+
+            m_outputLock.acquire();
+
+            int idx = 0;
+            /* add non-B to output queue */
+            listbframes->m_reorderedPts = ptsidx++;
+            listbframes->m_gopOffset = 0;
+            listbframes->m_gopId = m_gopId;
+            listbframes->m_tempLayer = x265_gop_ram_gopId0.layer;
+            m_outputQueue.pushBack(*listbframes);
+
+            int i = 1, j = 1;
+            while (i <= bframes)
+            {
+                int offset = x265_gop_ram_gopIdj.poc_offset - 1;
+                if (!listoffset || offset == bframes)
+                    continue;
+
+                // Assign gop offset and temporal layer of frames
+                listoffset->m_gopOffset = j;
+                listoffset->m_gopId = m_gopId;
+                listoffset->m_tempLayer = x265_gop_ram_gopIdj++.layer;
+
+                /* add B frames to output queue */
+                listoffset->m_reorderedPts = ptsidx++;
+                m_outputQueue.pushBack(*listoffset);
+                i++;
+            }
+        }
+
+        bool isKeyFrameAnalyse = (m_param->rc.cuTree || (m_param->rc.vbvBufferSize && m_param->lookaheadDepth));
+        if (isKeyFrameAnalyse && IS_X265_TYPE_I(m_lastNonB->sliceType))
+        {
+            m_inputLock.acquire();
+            Frame *curFrame = m_inputQueue.first();
+            frames0 = m_lastNonB;
+            int j;
+            for (j = 0; j < maxSearch; j++)
+            {
+                framesj + 1 = &curFrame->m_lowres;
+                curFrame = curFrame->m_next;
+            }
+            m_inputLock.release();
+
+            framesj + 1 = NULL;
+            if (!m_param->rc.bStatRead)
+                slicetypeAnalyse(frames, true);
+            bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0;
+            if ((m_param->analysisLoad && m_param->scaleFactor && bIsVbv) || m_param->bliveVBV2pass)
+            {
+                int numFrames;
+                for (numFrames = 0; numFrames < maxSearch; numFrames++)
                 {
-                    p0 = b;
-                    isp0available = true;
+                    Lowres *fenc = framesnumFrames + 1;
+                    if (!fenc)
+                        break;
                 }
+                vbvLookahead(frames, numFrames, true);
             }
         }
-    }
 
-    m_inputLock.acquire();
-    /* dequeue all frames from inputQueue that are about to be enqueued
-     * in the output queue. The order is important because Frame can
-     * only be in one list at a time */
-    int64_t ptsX265_BFRAME_MAX + 1;
-    for (int i = 0; i <= bframes; i++)
-    {
-        Frame *curFrame;
-        curFrame = m_inputQueue.popFront();
-        ptsi = curFrame->m_pts;
-        maxSearch--;
-    }
-    m_inputLock.release();
 
-    m_outputLock.acquire();
-    /* add non-B to output queue */
-    int idx = 0;
-    listbframes->m_reorderedPts = ptsidx++;
-    m_outputQueue.pushBack(*listbframes);
-    /* Add B-ref frame next to P frame in output queue, the B-ref encode before non B-ref frame */
-    if (brefs)
+        m_outputLock.release();
+    }
+    else
     {
-        for (int i = 0; i < bframes; i++)
+
+        if (bframes)
+            listbframes - 1->m_lowres.bLastMiniGopBFrame = true;
+        listbframes->m_lowres.leadingBframes = bframes;
+        m_lastNonB = &listbframes->m_lowres;
+
+        /* insert a bref into the sequence */
+        if (m_param->bBPyramid && bframes > 1 && !brefs)
         {
-            if (listi->m_lowres.sliceType == X265_TYPE_BREF)
+            placeBref(list, 0, bframes, bframes + 1, &brefs);
+        }
+        /* calculate the frame costs ahead of time for estimateFrameCost while we still have lowres */
+        if (m_param->rc.rateControlMode != X265_RC_CQP)
+        {
+            int p0, p1, b;
+            /* For zero latency tuning, calculate frame cost to be used later in RC */
+            if (!maxSearch)
             {
-                listi->m_reorderedPts = ptsidx++;
-                m_outputQueue.pushBack(*listi);
+                for (int i = 0; i <= bframes; i++)
+                    framesi + 1 = &listi->m_lowres;
+            }
+
+            /* estimate new non-B cost */
+            p1 = b = bframes + 1;
+            p0 = (IS_X265_TYPE_I(framesbframes + 1->sliceType)) ? b : 0;
+
+            CostEstimateGroup estGroup(*this, frames);
+            estGroup.singleCost(p0, p1, b);
+
+            if (m_param->bEnableTemporalSubLayers > 1 && bframes)
+            {
+                compCostBref(frames, 0, bframes, bframes + 1);
+            }
+            else
+            {
+                if (bframes)
+                {
+                    p0 = 0; // last nonb
+                    bool isp0available = framesbframes + 1->sliceType == X265_TYPE_IDR ? false : true;
+
+                    for (b = 1; b <= bframes; b++)
+                    {
+                        if (!isp0available)
+                            p0 = b;
+
+                        if (framesb->sliceType == X265_TYPE_B)
+                            for (p1 = b; framesp1->sliceType == X265_TYPE_B; p1++)
+                                ; // find new nonb or bref
+                        else
+                            p1 = bframes + 1;
+
+                        estGroup.singleCost(p0, p1, b);
+
+                        if (framesb->sliceType == X265_TYPE_BREF)
+                        {
+                            p0 = b;
+                            isp0available = true;
+                        }
+                    }
+                }
             }
         }
-    }
 
-    /* add B frames to output queue */
-    for (int i = 0; i < bframes; i++)
-    {
-        /* push all the B frames into output queue except B-ref, which already pushed into output queue */
-        if (listi->m_lowres.sliceType != X265_TYPE_BREF)
+        m_inputLock.acquire();
+        /* dequeue all frames from inputQueue that are about to be enqueued
+         * in the output queue. The order is important because Frame can
+         * only be in one list at a time */
+        int64_t ptsX265_BFRAME_MAX + 1;
+        for (int i = 0; i <= bframes; i++)
+        {
+            Frame *curFrame;
+            curFrame = m_inputQueue.popFront();
+            ptsi = curFrame->m_pts;
+            maxSearch--;
+        }
+        m_inputLock.release();
+
+        m_outputLock.acquire();
+
+        /* add non-B to output queue */
+        int idx = 0;
+        listbframes->m_reorderedPts = ptsidx++;
+        m_outputQueue.pushBack(*listbframes);
+
+        /* Add B-ref frame next to P frame in output queue, the B-ref encode before non B-ref frame */
+        if (brefs)
         {
-            listi->m_reorderedPts = ptsidx++;
-            m_outputQueue.pushBack(*listi);
+            for (int i = 0; i < bframes; i++)
+            {
+                if (listi->m_lowres.sliceType == X265_TYPE_BREF)
+                {
+                    listi->m_reorderedPts = ptsidx++;
+                    m_outputQueue.pushBack(*listi);
+                }
+            }
         }
-    }
 
-    bool isKeyFrameAnalyse = (m_param->rc.cuTree || (m_param->rc.vbvBufferSize && m_param->lookaheadDepth));
-    if (isKeyFrameAnalyse && IS_X265_TYPE_I(m_lastNonB->sliceType))
-    {
-        m_inputLock.acquire();
-        Frame *curFrame = m_inputQueue.first();
-        frames0 = m_lastNonB;
-        int j;
-        for (j = 0; j < maxSearch; j++)
+        /* add B frames to output queue */
+        for (int i = 0; i < bframes; i++)
         {
-            framesj + 1 = &curFrame->m_lowres;
-            curFrame = curFrame->m_next;
+            /* push all the B frames into output queue except B-ref, which already pushed into output queue */
+            if (listi->m_lowres.sliceType != X265_TYPE_BREF)
+            {
+                listi->m_reorderedPts = ptsidx++;
+                m_outputQueue.pushBack(*listi);
+            }
         }
-        m_inputLock.release();
 
-        framesj + 1 = NULL;
-        if (!m_param->rc.bStatRead)
-            slicetypeAnalyse(frames, true);
-        bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0;
-        if ((m_param->analysisLoad && m_param->scaleFactor && bIsVbv) || m_param->bliveVBV2pass)
+
+        bool isKeyFrameAnalyse = (m_param->rc.cuTree || (m_param->rc.vbvBufferSize && m_param->lookaheadDepth));
+        if (isKeyFrameAnalyse && IS_X265_TYPE_I(m_lastNonB->sliceType))
         {
-            int numFrames;
-            for (numFrames = 0; numFrames < maxSearch; numFrames++)
+            m_inputLock.acquire();
+            Frame *curFrame = m_inputQueue.first();
+            frames0 = m_lastNonB;
+            int j;
+            for (j = 0; j < maxSearch; j++)
+            {
+                framesj + 1 = &curFrame->m_lowres;
+                curFrame = curFrame->m_next;
+            }
+            m_inputLock.release();
+
+            framesj + 1 = NULL;
+            if (!m_param->rc.bStatRead)
+                slicetypeAnalyse(frames, true);
+            bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0;
+            if ((m_param->analysisLoad && m_param->scaleFactor && bIsVbv) || m_param->bliveVBV2pass)
             {
-                Lowres *fenc = framesnumFrames + 1;
-                if (!fenc)
-                    break;
+                int numFrames;
+                for (numFrames = 0; numFrames < maxSearch; numFrames++)
+                {
+                    Lowres *fenc = framesnumFrames + 1;
+                    if (!fenc)
+                        break;
+                }
+                vbvLookahead(frames, numFrames, true);
             }
-            vbvLookahead(frames, numFrames, true);
         }
+
+        m_outputLock.release();
     }
-    m_outputLock.release();
 }
 
 void Lookahead::vbvLookahead(Lowres **frames, int numFrames, int keyframe)
@@ -1909,6 +2622,8 @@
             nextZoneStart += (i + 1 < m_param->rc.zonefileCount) ? m_param->rc.zonesi + 1.startFrame + m_param->rc.zonesi + 1.zoneParam->radl : m_param->totalFrames;
             if (curZoneStart <= frames0->frameNum && nextZoneStart > frames0->frameNum)
                 m_param->keyframeMax = nextZoneStart - curZoneStart;
+            if (m_param->rc.zonesm_param->rc.zonefileCount - 1.startFrame <= frames0->frameNum && nextZoneStart == 0)
+                m_param->keyframeMax = m_param->rc.zones0.keyframeMax;
         }
     }
     int keylimit = m_param->keyframeMax;
@@ -2013,44 +2728,13 @@
     int numAnalyzed = numFrames;
     bool isScenecut = false;
 
-    /* Temporal computations for scenecut detection */
     if (m_param->bHistBasedSceneCut)
-    {
-        for (int i = numFrames - 1; i > 0; i--)
-        {
-            if (framesi->interPCostPercDiff > 0.0)
-                continue;
-            int64_t interCost = framesi->costEst10;
-            int64_t intraCost = framesi->costEst00;
-            if (interCost < 0 || intraCost < 0)
-                continue;
-            int times = 0;
-            double averagePcost = 0.0, averageIcost = 0.0;
-            for (int j = i - 1; j >= 0 && times < 5; j--, times++)
-            {
-                if (framesj->costEst00 > 0 && framesj->costEst10 > 0)
-                {
-                    averageIcost += framesj->costEst00;
-                    averagePcost += framesj->costEst10;
-                }
-                else
-                    times--;
-            }
-            if (times)
-            {
-                averageIcost = averageIcost / times;
-                averagePcost = averagePcost / times;
-                framesi->interPCostPercDiff = abs(interCost - averagePcost) / X265_MIN(interCost, averagePcost) * 100;
-                framesi->intraCostPercDiff = abs(intraCost - averageIcost) / X265_MIN(intraCost, averageIcost) * 100;
-            }
-        }
-    }
-
-    /* When scenecut threshold is set, use scenecut detection for I frame placements */
-    if (!m_param->bHistBasedSceneCut || (m_param->bHistBasedSceneCut && frames1->bScenecut))
+        isScenecut = histBasedScenecut(frames, 0, 1, origNumFrames);
+    else
         isScenecut = scenecut(frames, 0, 1, true, origNumFrames);
 
-    if (isScenecut && (m_param->bHistBasedSceneCut || m_param->scenecutThreshold))
+    /* When scenecut threshold is set, use scenecut detection for I frame placements */
+    if (m_param->scenecutThreshold && isScenecut)
     {
         frames1->sliceType = X265_TYPE_I;
         return;
@@ -2061,8 +2745,7 @@
         m_extendGopBoundary = false;
         for (int i = m_param->bframes + 1; i < origNumFrames; i += m_param->bframes + 1)
         {
-            if (!m_param->bHistBasedSceneCut || (m_param->bHistBasedSceneCut && framesi + 1->bScenecut))
-                scenecut(frames, i, i + 1, true, origNumFrames);
+            scenecut(frames, i, i + 1, true, origNumFrames);
 
             for (int j = i + 1; j <= X265_MIN(i + m_param->bframes + 1, origNumFrames); j++)
             {
@@ -2175,10 +2858,8 @@
         {
             for (int j = 1; j < numBFrames + 1; j++)
             {
-                bool isNextScenecut = false;
-                if (!m_param->bHistBasedSceneCut || (m_param->bHistBasedSceneCut && framesj + 1->bScenecut))
-                    isNextScenecut = scenecut(frames, j, j + 1, false, origNumFrames);
-                if (isNextScenecut || (bForceRADL && framesj->frameNum == preRADL))
+                if (scenecut(frames, j, j + 1, false, origNumFrames) ||
+                    (bForceRADL && (framesj->frameNum == preRADL)))
                 {
                     framesj->sliceType = X265_TYPE_P;
                     numAnalyzed = j;
@@ -2244,9 +2925,10 @@
         /* Where A and B are scenes: AAAAAABBBAAAAAA
          * If BBB is shorter than (maxp1-p0), it is detected as a flash
          * and not considered a scenecut. */
+
         for (int cp1 = p1; cp1 <= maxp1; cp1++)
         {
-            if (!scenecutInternal(frames, p0, cp1, false) && !m_param->bHistBasedSceneCut)
+            if (!scenecutInternal(frames, p0, cp1, false))
             {
                 /* Any frame in between p0 and cur_p1 cannot be a real scenecut. */
                 for (int i = cp1; i > p0; i--)
@@ -2255,7 +2937,7 @@
                     noScenecuts = false;
                 }
             }
-            else if ((m_param->bHistBasedSceneCut && framescp1->m_bIsMaxThres) || scenecutInternal(frames, cp1 - 1, cp1, false))
+            else if (scenecutInternal(frames, cp1 - 1, cp1, false))
             {
                 /* If current frame is a Scenecut from p0 frame as well as Scenecut from
                  * preceeding frame, mark it as a Scenecut */
@@ -2316,9 +2998,6 @@
 
     if (!framesp1->bScenecut)
         return false;
-    /* Check only scene transitions if max threshold */
-    if (m_param->bHistBasedSceneCut && framesp1->m_bIsMaxThres)
-        return framesp1->bScenecut;
 
     return scenecutInternal(frames, p0, p1, bRealScenecut);
 }
@@ -2336,19 +3015,8 @@
     /* magic numbers pulled out of thin air */
     float threshMin = (float)(threshMax * 0.25);
     double bias = m_param->scenecutBias;
-    if (m_param->bHistBasedSceneCut)
-    {
-        double minT = TEMPORAL_SCENECUT_THRESHOLD * (1 + m_param->edgeTransitionThreshold);
-        if (frame->interPCostPercDiff > minT || frame->intraCostPercDiff > minT)
-        {
-            if (bRealScenecut && frame->bScenecut)
-                x265_log(m_param, X265_LOG_DEBUG, "scene cut at %d \n", frame->frameNum);
-            return frame->bScenecut;
-        }
-        else
-            return false;
-    }
-    else if (bRealScenecut)
+
+    if (bRealScenecut)
     {
         if (m_param->keyframeMin == m_param->keyframeMax)
             threshMin = threshMax;
@@ -2375,6 +3043,167 @@
     return res;
 }
 
+bool Lookahead::detectHistBasedSceneChange(Lowres **frames, int p0, int p1, int p2)
+{
+    bool isAbruptChange;
+    bool isSceneChange;
+
+    Lowres  *previousFrame = framesp0;
+    Lowres  *currentFrame = framesp1;
+    Lowres  *futureFrame = framesp2;
+
+    currentFrame->bHistScenecutAnalyzed = true;
+
+    uint32_t **accHistDiffRunningAvgCb = m_accHistDiffRunningAvgCb;
+    uint32_t **accHistDiffRunningAvgCr = m_accHistDiffRunningAvgCr;
+    uint32_t **accHistDiffRunningAvg = m_accHistDiffRunningAvg;
+
+    uint8_t absIntDiffFuturePast = 0;
+    uint8_t absIntDiffFuturePresent = 0;
+    uint8_t absIntDiffPresentPast = 0;
+
+    uint32_t abruptChangeCount = 0;
+    uint32_t sceneChangeCount = 0;
+
+    uint32_t segmentWidth = frames1->widthFullRes / NUMBER_OF_SEGMENTS_IN_WIDTH;
+    uint32_t segmentHeight = frames1->heightFullRes / NUMBER_OF_SEGMENTS_IN_HEIGHT;
+
+    for (uint32_t segmentInFrameWidthIndex = 0; segmentInFrameWidthIndex < NUMBER_OF_SEGMENTS_IN_WIDTH; segmentInFrameWidthIndex++)
+    {
+        for (uint32_t segmentInFrameHeightIndex = 0; segmentInFrameHeightIndex < NUMBER_OF_SEGMENTS_IN_HEIGHT; segmentInFrameHeightIndex++)
+        {
+            isAbruptChange = false;
+            isSceneChange = false;
+
+            // accumulative absolute histogram differences between the past and current frame
+            uint32_t accHistDiff = 0;
+            uint32_t accHistDiffCb = 0;
+            uint32_t accHistDiffCr = 0;
+
+            uint32_t segmentWidthOffset = (segmentInFrameWidthIndex == NUMBER_OF_SEGMENTS_IN_WIDTH - 1) ?
+                frames1->widthFullRes - (NUMBER_OF_SEGMENTS_IN_WIDTH * segmentWidth) : 0;
+
+            uint32_t segmentHeightOffset = (segmentInFrameHeightIndex == NUMBER_OF_SEGMENTS_IN_HEIGHT - 1) ?
+                frames1->heightFullRes - (NUMBER_OF_SEGMENTS_IN_HEIGHT * segmentHeight) : 0;
+
+            segmentWidth += segmentWidthOffset;
+            segmentHeight += segmentHeightOffset;
+
+            uint32_t segmentThreshHold = (
+                ((X265_ABS((int64_t)currentFrame->picAvgVariance - (int64_t)previousFrame->picAvgVariance)) > PICTURE_DIFF_VARIANCE_TH) &&
+                (currentFrame->picAvgVariance > PICTURE_VARIANCE_TH || previousFrame->picAvgVariance > PICTURE_VARIANCE_TH)) ?
+                HIGH_VAR_SCENE_CHANGE_TH * NUM64x64INPIC(segmentWidth, segmentHeight) : LOW_VAR_SCENE_CHANGE_TH * NUM64x64INPIC(segmentWidth, segmentHeight);
+
+            uint32_t segmentThreshHoldCb = (
+                ((X265_ABS((int64_t)currentFrame->picAvgVarianceCb - (int64_t)previousFrame->picAvgVarianceCb)) > PICTURE_DIFF_VARIANCE_CHROMA_TH) &&
+                (currentFrame->picAvgVarianceCb > PICTURE_VARIANCE_CHROMA_TH || previousFrame->picAvgVarianceCb > PICTURE_VARIANCE_CHROMA_TH)) ?
+                HIGH_VAR_SCENE_CHANGE_CHROMA_TH * NUM64x64INPIC(segmentWidth, segmentHeight) : LOW_VAR_SCENE_CHANGE_CHROMA_TH * NUM64x64INPIC(segmentWidth, segmentHeight);
+
+            uint32_t segmentThreshHoldCr = (
+                ((X265_ABS((int64_t)currentFrame->picAvgVarianceCr - (int64_t)previousFrame->picAvgVarianceCr)) > PICTURE_DIFF_VARIANCE_CHROMA_TH) &&
+                (currentFrame->picAvgVarianceCr > PICTURE_VARIANCE_CHROMA_TH || previousFrame->picAvgVarianceCr > PICTURE_VARIANCE_CHROMA_TH)) ?
+                HIGH_VAR_SCENE_CHANGE_CHROMA_TH * NUM64x64INPIC(segmentWidth, segmentHeight) : LOW_VAR_SCENE_CHANGE_CHROMA_TH * NUM64x64INPIC(segmentWidth, segmentHeight);
+
+            for (uint32_t bin = 0; bin < HISTOGRAM_NUMBER_OF_BINS; ++bin) {
+                accHistDiff += X265_ABS((int32_t)currentFrame->picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex0bin - (int32_t)previousFrame->picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex0bin);
+                accHistDiffCb += X265_ABS((int32_t)currentFrame->picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex1bin - (int32_t)previousFrame->picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex1bin);
+                accHistDiffCr += X265_ABS((int32_t)currentFrame->picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex2bin - (int32_t)previousFrame->picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex2bin);
+            }
+
+            if (m_resetRunningAvg) {
+                accHistDiffRunningAvgsegmentInFrameWidthIndexsegmentInFrameHeightIndex = accHistDiff;
+                accHistDiffRunningAvgCbsegmentInFrameWidthIndexsegmentInFrameHeightIndex = accHistDiffCb;
+                accHistDiffRunningAvgCrsegmentInFrameWidthIndexsegmentInFrameHeightIndex = accHistDiffCr;
+            }
+
+            // difference between accumulative absolute histogram differences and the running average at the current frame.
+            uint32_t accHistDiffError = X265_ABS((int32_t)accHistDiffRunningAvgsegmentInFrameWidthIndexsegmentInFrameHeightIndex - (int32_t)accHistDiff);
+            uint32_t accHistDiffErrorCb = X265_ABS((int32_t)accHistDiffRunningAvgCbsegmentInFrameWidthIndexsegmentInFrameHeightIndex - (int32_t)accHistDiffCb);
+            uint32_t accHistDiffErrorCr = X265_ABS((int32_t)accHistDiffRunningAvgCrsegmentInFrameWidthIndexsegmentInFrameHeightIndex - (int32_t)accHistDiffCr);
+
+            if ((accHistDiffError > segmentThreshHold     && accHistDiff >= accHistDiffError) ||
+                (accHistDiffErrorCb > segmentThreshHoldCb && accHistDiffCb >= accHistDiffErrorCb) ||
+                (accHistDiffErrorCr > segmentThreshHoldCr && accHistDiffCr >= accHistDiffErrorCr)) {
+
+                isAbruptChange = true;
+            }
+
+            if (isAbruptChange)
+            {
+                absIntDiffFuturePast = (uint8_t)X265_ABS((int16_t)futureFrame->averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex0 - (int16_t)previousFrame->averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex0);
+                absIntDiffFuturePresent = (uint8_t)X265_ABS((int16_t)futureFrame->averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex0 - (int16_t)currentFrame->averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex0);
+                absIntDiffPresentPast = (uint8_t)X265_ABS((int16_t)currentFrame->averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex0 - (int16_t)previousFrame->averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex0);
+
+                if (absIntDiffFuturePresent >= FLASH_TH * absIntDiffFuturePast && absIntDiffPresentPast >= FLASH_TH * absIntDiffFuturePast) {
+                    x265_log(m_param, X265_LOG_DEBUG, "Flash in frame# %i , %i, %i, %i\n", currentFrame->frameNum, absIntDiffFuturePast, absIntDiffFuturePresent, absIntDiffPresentPast);
+                }
+                else if (absIntDiffFuturePresent < FADE_TH && absIntDiffPresentPast < FADE_TH) {
+                    x265_log(m_param, X265_LOG_DEBUG, "Fade in frame# %i , %i, %i, %i\n", currentFrame->frameNum, absIntDiffFuturePast, absIntDiffFuturePresent, absIntDiffPresentPast);
+                }
+                else if (X265_ABS(absIntDiffFuturePresent - absIntDiffPresentPast) < INTENSITY_CHANGE_TH && absIntDiffFuturePresent + absIntDiffPresentPast >= absIntDiffFuturePast) {
+                    x265_log(m_param, X265_LOG_DEBUG, "Intensity Change in frame# %i , %i, %i, %i\n", currentFrame->frameNum, absIntDiffFuturePast, absIntDiffFuturePresent, absIntDiffPresentPast);
+                }
+                else {
+                    isSceneChange = true;
+                    x265_log(m_param, X265_LOG_DEBUG, "Scene change in frame# %i , %i, %i, %i\n", currentFrame->frameNum, absIntDiffFuturePast, absIntDiffFuturePresent, absIntDiffPresentPast);
+                }
+
+            }
+            else {
+                accHistDiffRunningAvgsegmentInFrameWidthIndexsegmentInFrameHeightIndex = (3 * accHistDiffRunningAvgsegmentInFrameWidthIndexsegmentInFrameHeightIndex + accHistDiff) / 4;
+            }
+
+            abruptChangeCount += isAbruptChange;
+            sceneChangeCount += isSceneChange;
+        }
+    }
+
+    if (abruptChangeCount >= m_segmentCountThreshold) {
+        m_resetRunningAvg = true;
+    }
+    else {
+        m_resetRunningAvg = false;
+    }
+
+    if ((sceneChangeCount >= m_segmentCountThreshold)) {
+        x265_log(m_param, X265_LOG_DEBUG, "Scene Change in Pic Number# %i\n", currentFrame->frameNum);
+
+        return true;
+    }
+    else {
+        return false;
+    }
+
+}
+
+bool Lookahead::histBasedScenecut(Lowres **frames, int p0, int p1, int numFrames)
+{
+    /* Only do analysis during a normal scenecut check. */
+    if (m_param->bframes)
+    {
+        int origmaxp1 = p0 + 1;
+        /* Look ahead to avoid coding short flashes as scenecuts. */
+        origmaxp1 += m_param->bframes;
+        int maxp1 = X265_MIN(origmaxp1, numFrames);
+
+        for (int cp1 = p0; cp1 < maxp1; cp1++)
+        {
+            if (framescp1 + 1->bHistScenecutAnalyzed == true)
+                continue;
+
+            if (framescp1 + 2 != NULL && detectHistBasedSceneChange(frames, cp1, cp1 + 1, cp1 + 2))
+            {
+                /* If current frame is a Scenecut from p0 frame as well as Scenecut from
+                 * preceeding frame, mark it as a Scenecut */
+                framescp1+1->bScenecut = true;
+            }
+        }
+
+    }
+
+    return framesp1->bScenecut;
+}
+
 void Lookahead::slicetypePath(Lowres **frames, int length, char(*best_paths)X265_LOOKAHEAD_MAX + 1)
 {
     char paths2X265_LOOKAHEAD_MAX + 1;
@@ -2404,6 +3233,27 @@
     memcpy(best_pathslength % (X265_BFRAME_MAX + 1), pathsidx ^ 1, length);
 }
 
+// Find slicetype of the frame with poc # in lookahead buffer
+int Lookahead::findSliceType(int poc)
+{
+    int out_slicetype = X265_TYPE_AUTO;
+    if (m_filled)
+    {
+        m_outputLock.acquire();
+        Frame* out = m_outputQueue.first();
+        while (out != NULL) {
+            if (poc == out->m_poc)
+            {
+                out_slicetype = out->m_lowres.sliceType;
+                break;
+            }
+            out = out->m_next;
+        }
+        m_outputLock.release();
+    }
+    return out_slicetype;
+}
+
 int64_t Lookahead::slicetypePathCost(Lowres **frames, char *path, int64_t threshold)
 {
     int64_t cost = 0;

 
@@ -87,6 +87,14 @@
 
 namespace X265_NS {
 
+uint32_t acEnergyVarHist(uint64_t sum_ssd, int shift)
+{
+    uint32_t sum = (uint32_t)sum_ssd;
+    uint32_t ssd = (uint32_t)(sum_ssd >> 32);
+
+    return ssd - ((uint64_t)sum * sum >> shift);
+}
+
 bool computeEdge(pixel* edgePic, pixel* refPic, pixel* edgeTheta, intptr_t stride, int height, int width, bool bcalcTheta, pixel whitePixel)
 {
     intptr_t rowOne = 0, rowTwo = 0, rowThree = 0, colOne = 0, colTwo = 0, colThree = 0;
@@ -184,7 +192,7 @@
     {
         for (int colNum = 0; colNum < width; colNum++)
         {
-            if ((rowNum >= 2) && (colNum >= 2) && (rowNum != height - 2) && (colNum != width - 2)) //Ignoring the border pixels of the picture
+            if ((rowNum >= 2) && (colNum >= 2) && (rowNum < height - 2) && (colNum < width - 2)) //Ignoring the border pixels of the picture
             {
                 /*  5x5 Gaussian filter
                     2   4   5   4   2
@@ -519,7 +527,7 @@
                 if (param->rc.aqMode == X265_AQ_EDGE)
                     edgeFilter(curFrame, param);
 
-                if (param->rc.aqMode == X265_AQ_EDGE && !param->bHistBasedSceneCut && param->recursionSkipMode == EDGE_BASED_RSKIP)
+                if (param->rc.aqMode == X265_AQ_EDGE && param->recursionSkipMode == EDGE_BASED_RSKIP)
                 {
                     pixel* src = curFrame->m_edgePic + curFrame->m_fencPic->m_lumaMarginY * curFrame->m_fencPic->m_stride + curFrame->m_fencPic->m_lumaMarginX;
                     primitives.planecopy_pp_shr(src, curFrame->m_fencPic->m_stride, curFrame->m_edgeBitPic,
@@ -1050,7 +1058,48 @@
     m_countPreLookahead = 0;
 #endif
 
-    memset(m_histogram, 0, sizeof(m_histogram));
+    m_accHistDiffRunningAvgCb = X265_MALLOC(uint32_t*, NUMBER_OF_SEGMENTS_IN_WIDTH * sizeof(uint32_t*));
+    m_accHistDiffRunningAvgCb0 = X265_MALLOC(uint32_t, NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT);
+    memset(m_accHistDiffRunningAvgCb0, 0, sizeof(uint32_t) * NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT);
+    for (uint32_t w = 1; w < NUMBER_OF_SEGMENTS_IN_WIDTH; w++) {
+        m_accHistDiffRunningAvgCbw = m_accHistDiffRunningAvgCb0 + w * NUMBER_OF_SEGMENTS_IN_HEIGHT;
+    }
+
+    m_accHistDiffRunningAvgCr = X265_MALLOC(uint32_t*, NUMBER_OF_SEGMENTS_IN_WIDTH * sizeof(uint32_t*));
+    m_accHistDiffRunningAvgCr0 = X265_MALLOC(uint32_t, NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT);
+    memset(m_accHistDiffRunningAvgCr0, 0, sizeof(uint32_t) * NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT);
+    for (uint32_t w = 1; w < NUMBER_OF_SEGMENTS_IN_WIDTH; w++) {
+        m_accHistDiffRunningAvgCrw = m_accHistDiffRunningAvgCr0 + w * NUMBER_OF_SEGMENTS_IN_HEIGHT;
+    }
+
+    m_accHistDiffRunningAvg = X265_MALLOC(uint32_t*, NUMBER_OF_SEGMENTS_IN_WIDTH * sizeof(uint32_t*));
+    m_accHistDiffRunningAvg0 = X265_MALLOC(uint32_t, NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT);
+    memset(m_accHistDiffRunningAvg0, 0, sizeof(uint32_t) * NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT);
+    for (uint32_t w = 1; w < NUMBER_OF_SEGMENTS_IN_WIDTH; w++) {
+        m_accHistDiffRunningAvgw = m_accHistDiffRunningAvg0 + w * NUMBER_OF_SEGMENTS_IN_HEIGHT;
+    }
+
+    m_resetRunningAvg = true;
+
+    m_segmentCountThreshold = (uint32_t)(((float)((NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT) * 50) / 100) + 0.5);
+
+    if (m_param->bEnableTemporalSubLayers > 2)
+    {
+        switch (m_param->bEnableTemporalSubLayers)
+        {
+        case 3:
+            m_gopId = 0;
+            break;
+        case 4:
+            m_gopId = 1;
+            break;
+        case 5:
+            m_gopId = 2;
+            break;
+        default:
+            break;
+        }
+    }
 }
 
 #if DETAILED_CU_STATS
@@ -1098,6 +1147,7 @@
             m_pooli.stopWorkers();
     }
 }
+
 void Lookahead::destroy()
 {
     // these two queues will be empty unless the encode was aborted
@@ -1309,32 +1359,32 @@
     default:
         return;
     }
-    if (!m_param->analysisLoad || !m_param->bDisableLookahead)
+    if (!curFrame->m_param->analysisLoad || !curFrame->m_param->bDisableLookahead)
     {
         X265_CHECK(curFrame->m_lowres.costEstb - p0p1 - b > 0, "Slice cost not estimated\n")
 
-        if (m_param->rc.cuTree && !m_param->rc.bStatRead)
+        if (curFrame->m_param->rc.cuTree && !curFrame->m_param->rc.bStatRead)
             /* update row satds based on cutree offsets */
             curFrame->m_lowres.satdCost = frameCostRecalculate(frames, p0, p1, b);
-        else if (!m_param->analysisLoad || m_param->scaleFactor || m_param->bAnalysisType == HEVC_INFO)
+        else if (!curFrame->m_param->analysisLoad || curFrame->m_param->scaleFactor || curFrame->m_param->bAnalysisType == HEVC_INFO)
         {
-            if (m_param->rc.aqMode)
+            if (curFrame->m_param->rc.aqMode)
                 curFrame->m_lowres.satdCost = curFrame->m_lowres.costEstAqb - p0p1 - b;
             else
                 curFrame->m_lowres.satdCost = curFrame->m_lowres.costEstb - p0p1 - b;
         }
-        if (m_param->rc.vbvBufferSize && m_param->rc.vbvMaxBitrate)
+        if (curFrame->m_param->rc.vbvBufferSize && curFrame->m_param->rc.vbvMaxBitrate)
         {
             /* aggregate lowres row satds to CTU resolution */
             curFrame->m_lowres.lowresCostForRc = curFrame->m_lowres.lowresCostsb - p0p1 - b;
             uint32_t lowresRow = 0, lowresCol = 0, lowresCuIdx = 0, sum = 0, intraSum = 0;
-            uint32_t scale = m_param->maxCUSize / (2 * X265_LOWRES_CU_SIZE);
-            uint32_t numCuInHeight = (m_param->sourceHeight + m_param->maxCUSize - 1) / m_param->maxCUSize;
+            uint32_t scale = curFrame->m_param->maxCUSize / (2 * X265_LOWRES_CU_SIZE);
+            uint32_t numCuInHeight = (curFrame->m_param->sourceHeight + curFrame->m_param->maxCUSize - 1) / curFrame->m_param->maxCUSize;
             uint32_t widthInLowresCu = (uint32_t)m_8x8Width, heightInLowresCu = (uint32_t)m_8x8Height;
             double *qp_offset = 0;
             /* Factor in qpoffsets based on Aq/Cutree in CU costs */
-            if (m_param->rc.aqMode || m_param->bAQMotion)
-                qp_offset = (framesb->sliceType == X265_TYPE_B || !m_param->rc.cuTree) ? framesb->qpAqOffset : framesb->qpCuTreeOffset;
+            if (curFrame->m_param->rc.aqMode || curFrame->m_param->bAQMotion)
+                qp_offset = (framesb->sliceType == X265_TYPE_B || !curFrame->m_param->rc.cuTree) ? framesb->qpAqOffset : framesb->qpCuTreeOffset;
 
             for (uint32_t row = 0; row < numCuInHeight; row++)
             {
@@ -1350,7 +1400,7 @@
                         if (qp_offset)
                         {
                             double qpOffset;
-                            if (m_param->rc.qgSize == 8)
+                            if (curFrame->m_param->rc.qgSize == 8)
                                 qpOffset = (qp_offsetlowresCol * 2 + lowresRow * widthInLowresCu * 4 +
                                 qp_offsetlowresCol * 2 + lowresRow * widthInLowresCu * 4 + 1 +
                                 qp_offsetlowresCol * 2 + lowresRow * widthInLowresCu * 4 + curFrame->m_lowres.maxBlocksInRowFullRes +
@@ -1361,7 +1411,7 @@
                             int32_t intraCuCost = curFrame->m_lowres.intraCostlowresCuIdx;
                             curFrame->m_lowres.intraCostlowresCuIdx = (intraCuCost * x265_exp2fix8(qpOffset) + 128) >> 8;
                         }
-                        if (m_param->bIntraRefresh && slice->m_sliceType == X265_TYPE_P)
+                        if (curFrame->m_param->bIntraRefresh && slice->m_sliceType == X265_TYPE_P)
                             for (uint32_t x = curFrame->m_encData->m_pir.pirStartCol; x <= curFrame->m_encData->m_pir.pirEndCol; x++)
                                 diff += curFrame->m_lowres.intraCostlowresCuIdx - lowresCuCost;
                         curFrame->m_lowres.lowresCostForRclowresCuIdx = lowresCuCost;
@@ -1377,6 +1427,291 @@
     }
 }
 
+uint32_t LookaheadTLD::calcVariance(pixel* inpSrc, intptr_t stride, intptr_t blockOffset, uint32_t plane)
+{
+    pixel* src = inpSrc + blockOffset;
+
+    uint32_t var;
+    if (!plane)
+        var = acEnergyVarHist(primitives.cuBLOCK_8x8.var(src, stride), 6);
+    else
+        var = acEnergyVarHist(primitives.cuBLOCK_4x4.var(src, stride), 4);
+
+    x265_emms();
+    return var;
+}
+
+/*
+** Compute Block and Picture Variance, Block Mean for all blocks in the picture
+*/
+void LookaheadTLD::computePictureStatistics(Frame *curFrame)
+{
+    int maxCol = curFrame->m_fencPic->m_picWidth;
+    int maxRow = curFrame->m_fencPic->m_picHeight;
+    intptr_t inpStride = curFrame->m_fencPic->m_stride;
+
+    // Variance
+    uint64_t picTotVariance = 0;
+    uint32_t variance;
+
+    uint64_t blockXY = 0;
+    pixel* src = curFrame->m_fencPic->m_picOrg0;
+
+    for (int blockY = 0; blockY < maxRow; blockY += 8)
+    {
+        uint64_t rowVariance = 0;
+        for (int blockX = 0; blockX < maxCol; blockX += 8)
+        {
+            intptr_t blockOffsetLuma = blockX + (blockY * inpStride);
+
+            variance = calcVariance(
+                src,
+                inpStride,
+                blockOffsetLuma, 0);
+
+            rowVariance += variance;
+            blockXY++;
+        }
+        picTotVariance += (uint16_t)(rowVariance / maxCol);
+    }
+
+    curFrame->m_lowres.picAvgVariance = (uint16_t)(picTotVariance / maxRow);
+
+    // Collect chroma variance
+    int hShift = curFrame->m_fencPic->m_hChromaShift;
+    int vShift = curFrame->m_fencPic->m_vChromaShift;
+
+    int maxColChroma = curFrame->m_fencPic->m_picWidth >> hShift;
+    int maxRowChroma = curFrame->m_fencPic->m_picHeight >> vShift;
+    intptr_t cStride = curFrame->m_fencPic->m_strideC;
+
+    pixel* srcCb = curFrame->m_fencPic->m_picOrg1;
+
+    picTotVariance = 0;
+    for (int blockY = 0; blockY < maxRowChroma; blockY += 4)
+    {
+        uint64_t rowVariance = 0;
+        for (int blockX = 0; blockX < maxColChroma; blockX += 4)
+        {
+            intptr_t blockOffsetChroma = blockX + blockY * cStride;
+
+            variance = calcVariance(
+                srcCb,
+                cStride,
+                blockOffsetChroma, 1);
+
+            rowVariance += variance;
+            blockXY++;
+        }
+        picTotVariance += (uint16_t)(rowVariance / maxColChroma);
+    }
+
+    curFrame->m_lowres.picAvgVarianceCb = (uint16_t)(picTotVariance / maxRowChroma);
+
+
+    pixel* srcCr = curFrame->m_fencPic->m_picOrg2;
+
+    picTotVariance = 0;
+    for (int blockY = 0; blockY < maxRowChroma; blockY += 4)
+    {
+        uint64_t rowVariance = 0;
+        for (int blockX = 0; blockX < maxColChroma; blockX += 4)
+        {
+            intptr_t blockOffsetChroma = blockX + blockY * cStride;
+
+            variance = calcVariance(
+                srcCr,
+                cStride,
+                blockOffsetChroma, 2);
+
+            rowVariance += variance;
+            blockXY++;
+        }
+        picTotVariance += (uint16_t)(rowVariance / maxColChroma);
+    }
+
+    curFrame->m_lowres.picAvgVarianceCr = (uint16_t)(picTotVariance / maxRowChroma);
+}
+
+/*
+* Compute histogram of n-bins for the input
+*/
+void LookaheadTLD::calculateHistogram(
+    pixel     *inputSrc,
+    uint32_t   inputWidth,
+    uint32_t   inputHeight,
+    intptr_t   stride,
+    uint8_t    dsFactor,
+    uint32_t  *histogram,
+    uint64_t  *sum)
+
+{
+    *sum = 0;
+
+    for (uint32_t verticalIdx = 0; verticalIdx < inputHeight; verticalIdx += dsFactor)
+    {
+        for (uint32_t horizontalIdx = 0; horizontalIdx < inputWidth; horizontalIdx += dsFactor)
+        {
+            ++(histograminputSrchorizontalIdx);
+            *sum += inputSrchorizontalIdx;
+        }
+        inputSrc += (stride << (dsFactor >> 1));
+    }
+
+    return;
+}
+
+/*
+* Compute histogram bins and chroma pixel intensity *
+*/
+void LookaheadTLD::computeIntensityHistogramBinsChroma(
+    Frame    *curFrame,
+    uint64_t *sumAverageIntensityCb,
+    uint64_t *sumAverageIntensityCr)
+{
+    uint64_t    sum;
+    uint8_t     dsFactor = 4;
+
+    uint32_t segmentWidth = curFrame->m_lowres.widthFullRes / NUMBER_OF_SEGMENTS_IN_WIDTH;
+    uint32_t segmentHeight = curFrame->m_lowres.heightFullRes / NUMBER_OF_SEGMENTS_IN_HEIGHT;
+
+    for (uint32_t segmentInFrameWidthIndex = 0; segmentInFrameWidthIndex < NUMBER_OF_SEGMENTS_IN_WIDTH; segmentInFrameWidthIndex++)
+    {
+        for (uint32_t segmentInFrameHeightIndex = 0; segmentInFrameHeightIndex < NUMBER_OF_SEGMENTS_IN_HEIGHT; segmentInFrameHeightIndex++)
+        {
+            // Initialize bins to 1
+            for (uint32_t cuIndex = 0; cuIndex < 256; cuIndex++) {
+                curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex1cuIndex = 1;
+                curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex2cuIndex = 1;
+            }
+
+            uint32_t segmentWidthOffset = (segmentInFrameWidthIndex == NUMBER_OF_SEGMENTS_IN_WIDTH - 1) ?
+                curFrame->m_lowres.widthFullRes - (NUMBER_OF_SEGMENTS_IN_WIDTH * segmentWidth) : 0;
+
+            uint32_t segmentHeightOffset = (segmentInFrameHeightIndex == NUMBER_OF_SEGMENTS_IN_HEIGHT - 1) ?
+                curFrame->m_lowres.heightFullRes - (NUMBER_OF_SEGMENTS_IN_HEIGHT * segmentHeight) : 0;
+
+
+            // U Histogram
+            calculateHistogram(
+                curFrame->m_fencPic->m_picOrg1 + ((segmentInFrameWidthIndex * segmentWidth) >> 1) + (((segmentInFrameHeightIndex * segmentHeight) >> 1) * curFrame->m_fencPic->m_strideC),
+                (segmentWidth + segmentWidthOffset) >> 1,
+                (segmentHeight + segmentHeightOffset) >> 1,
+                curFrame->m_fencPic->m_strideC,
+                dsFactor,
+                curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex1,
+                &sum);
+
+            sum = (sum << dsFactor);
+            *sumAverageIntensityCb += sum;
+            curFrame->m_lowres.averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex1 =
+                (uint8_t)((sum + (((segmentWidth + segmentWidthOffset) * (segmentHeight + segmentHeightOffset)) >> 3)) / (((segmentWidth + segmentWidthOffset) * (segmentHeight + segmentHeightOffset)) >> 2));
+
+            for (uint16_t histogramBin = 0; histogramBin < HISTOGRAM_NUMBER_OF_BINS; histogramBin++) {
+                curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex1histogramBin =
+                    curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex1histogramBin << dsFactor;
+            }
+
+            // V Histogram
+            calculateHistogram(
+                curFrame->m_fencPic->m_picOrg2 + ((segmentInFrameWidthIndex * segmentWidth) >> 1) + (((segmentInFrameHeightIndex * segmentHeight) >> 1) * curFrame->m_fencPic->m_strideC),
+                (segmentWidth + segmentWidthOffset) >> 1,
+                (segmentHeight + segmentHeightOffset) >> 1,
+                curFrame->m_fencPic->m_strideC,
+                dsFactor,
+                curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex2,
+                &sum);
+
+            sum = (sum << dsFactor);
+            *sumAverageIntensityCr += sum;
+            curFrame->m_lowres.averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex2 =
+                (uint8_t)((sum + (((segmentWidth + segmentWidthOffset) * (segmentHeight + segmentHeightOffset)) >> 3)) / (((segmentWidth + segmentHeightOffset) * (segmentHeight + segmentHeightOffset)) >> 2));
+
+            for (uint16_t histogramBin = 0; histogramBin < HISTOGRAM_NUMBER_OF_BINS; histogramBin++) {
+                curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex2histogramBin =
+                    curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex2histogramBin << dsFactor;
+            }
+        }
+    }
+    return;
+
+}
+
+/*
+* Compute histogram bins and luma pixel intensity *
+*/
+void LookaheadTLD::computeIntensityHistogramBinsLuma(
+    Frame    *curFrame,
+    uint64_t *sumAvgIntensityTotalSegmentsLuma)
+{
+    uint64_t sum;
+
+    uint32_t segmentWidth = curFrame->m_lowres.quarterSampleLowResWidth / NUMBER_OF_SEGMENTS_IN_WIDTH;
+    uint32_t segmentHeight = curFrame->m_lowres.quarterSampleLowResHeight / NUMBER_OF_SEGMENTS_IN_HEIGHT;
+
+    for (uint32_t segmentInFrameWidthIndex = 0; segmentInFrameWidthIndex < NUMBER_OF_SEGMENTS_IN_WIDTH; segmentInFrameWidthIndex++)
+    {
+        for (uint32_t segmentInFrameHeightIndex = 0; segmentInFrameHeightIndex < NUMBER_OF_SEGMENTS_IN_HEIGHT; segmentInFrameHeightIndex++)
+        {
+            // Initialize bins to 1
+            for (uint32_t cuIndex = 0; cuIndex < 256; cuIndex++) {
+                curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex0cuIndex = 1;
+            }
+
+            uint32_t segmentWidthOffset = (segmentInFrameWidthIndex == NUMBER_OF_SEGMENTS_IN_WIDTH - 1) ?
+                curFrame->m_lowres.quarterSampleLowResWidth - (NUMBER_OF_SEGMENTS_IN_WIDTH * segmentWidth) : 0;
+
+            uint32_t segmentHeightOffset = (segmentInFrameHeightIndex == NUMBER_OF_SEGMENTS_IN_HEIGHT - 1) ?
+                curFrame->m_lowres.quarterSampleLowResHeight - (NUMBER_OF_SEGMENTS_IN_HEIGHT * segmentHeight) : 0;
+
+            // Y Histogram
+            calculateHistogram(
+                curFrame->m_lowres.quarterSampleLowResBuffer + (curFrame->m_lowres.quarterSampleLowResOriginX + segmentInFrameWidthIndex * segmentWidth) + ((curFrame->m_lowres.quarterSampleLowResOriginY + segmentInFrameHeightIndex * segmentHeight) * curFrame->m_lowres.quarterSampleLowResStrideY),
+                segmentWidth + segmentWidthOffset,
+                segmentHeight + segmentHeightOffset,
+                curFrame->m_lowres.quarterSampleLowResStrideY,
+                1,
+                curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex0,
+                &sum);
+
+            curFrame->m_lowres.averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex0 = (uint8_t)((sum + (((segmentWidth + segmentWidthOffset)*(segmentWidth + segmentHeightOffset)) >> 1)) / ((segmentWidth + segmentWidthOffset)*(segmentHeight + segmentHeightOffset)));
+            (*sumAvgIntensityTotalSegmentsLuma) += (sum << 4);
+            for (uint32_t histogramBin = 0; histogramBin < HISTOGRAM_NUMBER_OF_BINS; histogramBin++)
+            {
+                curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex0histogramBin =
+                    curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex0histogramBin << 4;
+            }
+        }
+    }
+}
+
+void LookaheadTLD::collectPictureStatistics(Frame *curFrame)
+{
+
+    uint64_t sumAverageIntensityCb = 0;
+    uint64_t sumAverageIntensityCr = 0;
+    uint64_t sumAverageIntensity = 0;
+
+    // Histogram bins for Luma
+    computeIntensityHistogramBinsLuma(
+        curFrame,
+        &sumAverageIntensity);
+
+    // Histogram bins for Chroma
+    computeIntensityHistogramBinsChroma(
+        curFrame,
+        &sumAverageIntensityCb,
+        &sumAverageIntensityCr);
+
+    curFrame->m_lowres.averageIntensity0 = (uint8_t)((sumAverageIntensity + ((curFrame->m_lowres.widthFullRes * curFrame->m_lowres.heightFullRes) >> 1)) / (curFrame->m_lowres.widthFullRes * curFrame->m_lowres.heightFullRes));
+    curFrame->m_lowres.averageIntensity1 = (uint8_t)((sumAverageIntensityCb + ((curFrame->m_lowres.widthFullRes * curFrame->m_lowres.heightFullRes) >> 3)) / ((curFrame->m_lowres.widthFullRes * curFrame->m_lowres.heightFullRes) >> 2));
+    curFrame->m_lowres.averageIntensity2 = (uint8_t)((sumAverageIntensityCr + ((curFrame->m_lowres.widthFullRes * curFrame->m_lowres.heightFullRes) >> 3)) / ((curFrame->m_lowres.widthFullRes * curFrame->m_lowres.heightFullRes) >> 2));
+
+    computePictureStatistics(curFrame);
+
+    curFrame->m_lowres.bHistScenecutAnalyzed = false;
+}
+
 void PreLookaheadGroup::processTasks(int workerThreadID)
 {
     if (workerThreadID < 0)
@@ -1393,6 +1728,10 @@
         preFrame->m_lowres.init(preFrame->m_fencPic, preFrame->m_poc);
         if (m_lookahead.m_bAdaptiveQuant)
             tld.calcAdaptiveQuantFrame(preFrame, m_lookahead.m_param);
+
+        if (m_lookahead.m_param->bHistBasedSceneCut)
+            tld.collectPictureStatistics(preFrame);
+
         tld.lowresIntraEstimate(preFrame->m_lowres, m_lookahead.m_param->rc.qgSize);
         preFrame->m_lowresInit = true;
 
@@ -1401,6 +1740,53 @@
     m_lock.release();
 }
 
+
+void Lookahead::placeBref(Frame** frames, int start, int end, int num, int *brefs)
+{
+    int avg = (start + end) / 2;
+    if (m_param->bEnableTemporalSubLayers < 2)
+    {
+        (*framesavg).m_lowres.sliceType = X265_TYPE_BREF;
+        (*brefs)++;
+        return;
+    }
+    else
+    {
+        if (num <= 2)
+            return;
+        else
+        {
+            (*framesavg).m_lowres.sliceType = X265_TYPE_BREF;
+            (*brefs)++;
+            placeBref(frames, start, avg, avg - start, brefs);
+            placeBref(frames, avg + 1, end, end - avg, brefs);
+            return;
+        }
+    }
+}
+
+
+void Lookahead::compCostBref(Lowres **frames, int start, int end, int num)
+{
+    CostEstimateGroup estGroup(*this, frames);
+    int avg = (start + end) / 2;
+    if (num <= 2)
+    {
+        for (int i = start; i < end; i++)
+        {
+            estGroup.singleCost(start, end + 1, i + 1);
+        }
+        return;
+    }
+    else
+    {
+        estGroup.singleCost(start, end + 1, avg + 1);
+        compCostBref(frames, start, avg, avg - start);
+        compCostBref(frames, avg + 1, end, end - avg);
+        return;
+    }
+}
+
 /* called by API thread or worker thread with inputQueueLock acquired */
 void Lookahead::slicetypeDecide()
 {
@@ -1416,6 +1802,18 @@
         ScopedLock lock(m_inputLock);
 
         Frame *curFrame = m_inputQueue.first();
+        if (m_param->bResetZoneConfig)
+        {
+            for (int i = 0; i < m_param->rc.zonefileCount; i++)
+            {
+                if (m_param->rc.zonesi.startFrame == curFrame->m_poc)
+                    m_param = m_param->rc.zonesi.zoneParam;
+                int nextZoneStart = m_param->rc.zonesi.startFrame;
+                nextZoneStart += nextZoneStart ? m_param->rc.zonesi.zoneParam->radl : 0;
+                if (nextZoneStart < curFrame->m_poc + maxSearch && curFrame->m_poc < nextZoneStart)
+                    maxSearch = nextZoneStart - curFrame->m_poc;
+            }
+        }
         int j;
         for (j = 0; j < m_param->bframes + 2; j++)
         {
@@ -1502,7 +1900,7 @@
          m_param->rc.cuTree || m_param->scenecutThreshold || m_param->bHistBasedSceneCut ||
          (m_param->lookaheadDepth && m_param->rc.vbvBufferSize)))
     {
-        if(!m_param->rc.bStatRead)
+        if (!m_param->rc.bStatRead)
             slicetypeAnalyse(frames, false);
         bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0;
         if ((m_param->analysisLoad && m_param->scaleFactor && bIsVbv) || m_param->bliveVBV2pass)
@@ -1526,6 +1924,8 @@
         {
             Lowres& frm = listbframes->m_lowres;
 
+            if (frm.sliceTypeReq != X265_TYPE_AUTO && frm.sliceTypeReq != frm.sliceType)
+                frm.sliceType = frm.sliceTypeReq;
             if (frm.sliceType == X265_TYPE_BREF && !m_param->bBPyramid && brefs == m_param->bBPyramid)
             {
                 frm.sliceType = X265_TYPE_B;
@@ -1583,12 +1983,9 @@
             }
             if (frm.sliceType == X265_TYPE_IDR && frm.bScenecut && isClosedGopRadl)
             {
-                if (!m_param->bHistBasedSceneCut || (m_param->bHistBasedSceneCut && frm.m_bIsHardScenecut))
-                {
-                    for (int i = bframes; i < bframes + m_param->radl; i++)
-                        listi->m_lowres.sliceType = X265_TYPE_B;
-                    list(bframes + m_param->radl)->m_lowres.sliceType = X265_TYPE_IDR;
-                }
+                for (int i = bframes; i < bframes + m_param->radl; i++)
+                    listi->m_lowres.sliceType = X265_TYPE_B;
+                list(bframes + m_param->radl)->m_lowres.sliceType = X265_TYPE_IDR;
             }
             if (frm.sliceType == X265_TYPE_IDR)
             {
@@ -1649,138 +2046,454 @@
                 break;
         }
     }
-    if (bframes)
-        listbframes - 1->m_lowres.bLastMiniGopBFrame = true;
-    listbframes->m_lowres.leadingBframes = bframes;
-    m_lastNonB = &listbframes->m_lowres;
-    m_histogrambframes++;
-
-    /* insert a bref into the sequence */
-    if (m_param->bBPyramid && bframes > 1 && !brefs)
-    {
-        listbframes / 2->m_lowres.sliceType = X265_TYPE_BREF;
-        brefs++;
-    }
-    /* calculate the frame costs ahead of time for estimateFrameCost while we still have lowres */
-    if (m_param->rc.rateControlMode != X265_RC_CQP)
-    {
-        int p0, p1, b;
-        /* For zero latency tuning, calculate frame cost to be used later in RC */
-        if (!maxSearch)
+
+    if (m_param->bEnableTemporalSubLayers > 2)
+    {
+        //Split the partial mini GOP into sub mini GOPs when temporal sub layers are enabled
+        if (bframes < m_param->bframes)
         {
-            for (int i = 0; i <= bframes; i++)
-               framesi + 1 = &listi->m_lowres;
-        }
+            int leftOver = bframes + 1;
+            int8_t gopId = m_gopId - 1;
+            int gopLen = x265_gop_ra_lengthgopId;
+            int listReset = 0;
 
-        /* estimate new non-B cost */
-        p1 = b = bframes + 1;
-        p0 = (IS_X265_TYPE_I(framesbframes + 1->sliceType)) ? b : 0;
+            m_outputLock.acquire();
 
-        CostEstimateGroup estGroup(*this, frames);
+            while ((gopId >= 0) && (leftOver > 3))
+            {
+                if (leftOver < gopLen)
+                {
+                    gopId = gopId - 1;
+                    gopLen = x265_gop_ra_lengthgopId;
+                    continue;
+                }
+                else
+                {
+                    int newbFrames = listReset + gopLen - 1;
+                    //Re-assign GOP
+                    listnewbFrames->m_lowres.sliceType = IS_X265_TYPE_I(listnewbFrames->m_lowres.sliceType) ? listnewbFrames->m_lowres.sliceType : X265_TYPE_P;
+                    if (newbFrames)
+                        listnewbFrames - 1->m_lowres.bLastMiniGopBFrame = true;
+                    listnewbFrames->m_lowres.leadingBframes = newbFrames;
+                    m_lastNonB = &listnewbFrames->m_lowres;
+
+                    /* insert a bref into the sequence */
+                    if (m_param->bBPyramid && newbFrames)
+                    {
+                        placeBref(list, listReset, newbFrames, newbFrames + 1, &brefs);
+                    }
+                    if (m_param->rc.rateControlMode != X265_RC_CQP)
+                    {
+                        int p0, p1, b;
+                        /* For zero latency tuning, calculate frame cost to be used later in RC */
+                        if (!maxSearch)
+                        {
+                            for (int i = listReset; i <= newbFrames; i++)
+                                framesi + 1 = &listlistReset + i->m_lowres;
+                        }
 
-        estGroup.singleCost(p0, p1, b);
+                        /* estimate new non-B cost */
+                        p1 = b = newbFrames + 1;
+                        p0 = (IS_X265_TYPE_I(framesnewbFrames + 1->sliceType)) ? b : listReset;
 
-        if (bframes)
+                        CostEstimateGroup estGroup(*this, frames);
+
+                        estGroup.singleCost(p0, p1, b);
+
+                        if (newbFrames)
+                            compCostBref(frames, listReset, newbFrames, newbFrames + 1);
+                    }
+
+                    m_inputLock.acquire();
+                    /* dequeue all frames from inputQueue that are about to be enqueued
+                     * in the output queue. The order is important because Frame can
+                     * only be in one list at a time */
+                    int64_t ptsX265_BFRAME_MAX + 1;
+                    for (int i = 0; i < gopLen; i++)
+                    {
+                        Frame *curFrame;
+                        curFrame = m_inputQueue.popFront();
+                        ptsi = curFrame->m_pts;
+                        maxSearch--;
+                    }
+                    m_inputLock.release();
+
+                    int idx = 0;
+                    /* add non-B to output queue */
+                    listnewbFrames->m_reorderedPts = ptsidx++;
+                    listnewbFrames->m_gopOffset = 0;
+                    listnewbFrames->m_gopId = gopId;
+                    listnewbFrames->m_tempLayer = x265_gop_ragopId0.layer;
+                    m_outputQueue.pushBack(*listnewbFrames);
+
+                    /* add B frames to output queue */
+                    int i = 1, j = 1;
+                    while (i < gopLen)
+                    {
+                        int offset = listReset + (x265_gop_ragopIdj.poc_offset - 1);
+                        if (!listoffset || offset == newbFrames)
+                            continue;
+
+                        // Assign gop offset and temporal layer of frames
+                        listoffset->m_gopOffset = j;
+                        listbframes->m_gopId = gopId;
+                        listoffset->m_tempLayer = x265_gop_ragopIdj++.layer;
+
+                        listoffset->m_reorderedPts = ptsidx++;
+                        m_outputQueue.pushBack(*listoffset);
+                        i++;
+                    }
+
+                    listReset += gopLen;
+                    leftOver = leftOver - gopLen;
+                    gopId -= 1;
+                    gopLen = (gopId >= 0) ? x265_gop_ra_lengthgopId : 0;
+                }
+            }
+
+            if (leftOver > 0 && leftOver < 4)
+            {
+                int64_t ptsX265_BFRAME_MAX + 1;
+                int idx = 0;
+
+                int newbFrames = listReset + leftOver - 1;
+                listnewbFrames->m_lowres.sliceType = IS_X265_TYPE_I(listnewbFrames->m_lowres.sliceType) ? listnewbFrames->m_lowres.sliceType : X265_TYPE_P;
+                if (newbFrames)
+                        listnewbFrames - 1->m_lowres.bLastMiniGopBFrame = true;
+                listnewbFrames->m_lowres.leadingBframes = newbFrames;
+                m_lastNonB = &listnewbFrames->m_lowres;
+
+                /* insert a bref into the sequence */
+                if (m_param->bBPyramid && (newbFrames- listReset) > 1)
+                    placeBref(list, listReset, newbFrames, newbFrames + 1, &brefs);
+
+                if (m_param->rc.rateControlMode != X265_RC_CQP)
+                {
+                    int p0, p1, b;
+                    /* For zero latency tuning, calculate frame cost to be used later in RC */
+                    if (!maxSearch)
+                    {
+                        for (int i = listReset; i <= newbFrames; i++)
+                            framesi + 1 = &listlistReset + i->m_lowres;
+                    }
+
+                        /* estimate new non-B cost */
+                    p1 = b = newbFrames + 1;
+                    p0 = (IS_X265_TYPE_I(framesnewbFrames + 1->sliceType)) ? b : listReset;
+
+                    CostEstimateGroup estGroup(*this, frames);
+
+                    estGroup.singleCost(p0, p1, b);
+
+                    if (newbFrames)
+                        compCostBref(frames, listReset, newbFrames, newbFrames + 1);
+                }
+
+                m_inputLock.acquire();
+                /* dequeue all frames from inputQueue that are about to be enqueued
+                 * in the output queue. The order is important because Frame can
+                 * only be in one list at a time */
+                for (int i = 0; i < leftOver; i++)
+                {
+                    Frame *curFrame;
+                    curFrame = m_inputQueue.popFront();
+                    ptsi = curFrame->m_pts;
+                    maxSearch--;
+                }
+                m_inputLock.release();
+
+                m_lastNonB = &listnewbFrames->m_lowres;
+                listnewbFrames->m_reorderedPts = ptsidx++;
+                listnewbFrames->m_gopOffset = 0;
+                listnewbFrames->m_gopId = -1;
+                listnewbFrames->m_tempLayer = 0;
+                m_outputQueue.pushBack(*listnewbFrames);
+                if (brefs)
+                {
+                    for (int i = listReset; i < newbFrames; i++)
+                    {
+                        if (listi->m_lowres.sliceType == X265_TYPE_BREF)
+                        {
+                            listi->m_reorderedPts = ptsidx++;
+                            listi->m_gopOffset = 0;
+                            listi->m_gopId = -1;
+                            listi->m_tempLayer = 0;
+                            m_outputQueue.pushBack(*listi);
+                        }
+                    }
+                }
+
+                /* add B frames to output queue */
+                for (int i = listReset; i < newbFrames; i++)
+                {
+                    /* push all the B frames into output queue except B-ref, which already pushed into output queue */
+                    if (listi->m_lowres.sliceType != X265_TYPE_BREF)
+                    {
+                        listi->m_reorderedPts = ptsidx++;
+                        listi->m_gopOffset = 0;
+                        listi->m_gopId = -1;
+                        listi->m_tempLayer = 1;
+                        m_outputQueue.pushBack(*listi);
+                    }
+                }
+            }
+        }
+        else
+        // Fill the complete mini GOP when temporal sub layers are enabled
         {
-            p0 = 0; // last nonb
-            bool isp0available = framesbframes + 1->sliceType == X265_TYPE_IDR ? false : true;
 
-            for (b = 1; b <= bframes; b++)
+            listbframes - 1->m_lowres.bLastMiniGopBFrame = true;
+            listbframes->m_lowres.leadingBframes = bframes;
+            m_lastNonB = &listbframes->m_lowres;
+
+            /* insert a bref into the sequence */
+            if (m_param->bBPyramid && !brefs)
             {
-                if (!isp0available)
-                    p0 = b;
+                placeBref(list, 0, bframes, bframes + 1, &brefs);
+            }
 
-                if (framesb->sliceType == X265_TYPE_B)
-                    for (p1 = b; framesp1->sliceType == X265_TYPE_B; p1++)
-                        ; // find new nonb or bref
-                else
-                    p1 = bframes + 1;
+            /* calculate the frame costs ahead of time for estimateFrameCost while we still have lowres */
+            if (m_param->rc.rateControlMode != X265_RC_CQP)
+            {
+                int p0, p1, b;
+                /* For zero latency tuning, calculate frame cost to be used later in RC */
+                if (!maxSearch)
+                {
+                    for (int i = 0; i <= bframes; i++)
+                        framesi + 1 = &listi->m_lowres;
+                }
 
+                /* estimate new non-B cost */
+                p1 = b = bframes + 1;
+                p0 = (IS_X265_TYPE_I(framesbframes + 1->sliceType)) ? b : 0;
+
+                CostEstimateGroup estGroup(*this, frames);
                 estGroup.singleCost(p0, p1, b);
 
-                if (framesb->sliceType == X265_TYPE_BREF)
+                compCostBref(frames, 0, bframes, bframes + 1);
+            }
+
+            m_inputLock.acquire();
+            /* dequeue all frames from inputQueue that are about to be enqueued
+            * in the output queue. The order is important because Frame can
+            * only be in one list at a time */
+            int64_t ptsX265_BFRAME_MAX + 1;
+            for (int i = 0; i <= bframes; i++)
+            {
+                Frame *curFrame;
+                curFrame = m_inputQueue.popFront();
+                ptsi = curFrame->m_pts;
+                maxSearch--;
+            }
+            m_inputLock.release();
+
+            m_outputLock.acquire();
+
+            int idx = 0;
+            /* add non-B to output queue */
+            listbframes->m_reorderedPts = ptsidx++;
+            listbframes->m_gopOffset = 0;
+            listbframes->m_gopId = m_gopId;
+            listbframes->m_tempLayer = x265_gop_ram_gopId0.layer;
+            m_outputQueue.pushBack(*listbframes);
+
+            int i = 1, j = 1;
+            while (i <= bframes)
+            {
+                int offset = x265_gop_ram_gopIdj.poc_offset - 1;
+                if (!listoffset || offset == bframes)
+                    continue;
+
+                // Assign gop offset and temporal layer of frames
+                listoffset->m_gopOffset = j;
+                listoffset->m_gopId = m_gopId;
+                listoffset->m_tempLayer = x265_gop_ram_gopIdj++.layer;
+
+                /* add B frames to output queue */
+                listoffset->m_reorderedPts = ptsidx++;
+                m_outputQueue.pushBack(*listoffset);
+                i++;
+            }
+        }
+
+        bool isKeyFrameAnalyse = (m_param->rc.cuTree || (m_param->rc.vbvBufferSize && m_param->lookaheadDepth));
+        if (isKeyFrameAnalyse && IS_X265_TYPE_I(m_lastNonB->sliceType))
+        {
+            m_inputLock.acquire();
+            Frame *curFrame = m_inputQueue.first();
+            frames0 = m_lastNonB;
+            int j;
+            for (j = 0; j < maxSearch; j++)
+            {
+                framesj + 1 = &curFrame->m_lowres;
+                curFrame = curFrame->m_next;
+            }
+            m_inputLock.release();
+
+            framesj + 1 = NULL;
+            if (!m_param->rc.bStatRead)
+                slicetypeAnalyse(frames, true);
+            bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0;
+            if ((m_param->analysisLoad && m_param->scaleFactor && bIsVbv) || m_param->bliveVBV2pass)
+            {
+                int numFrames;
+                for (numFrames = 0; numFrames < maxSearch; numFrames++)
                 {
-                    p0 = b;
-                    isp0available = true;
+                    Lowres *fenc = framesnumFrames + 1;
+                    if (!fenc)
+                        break;
                 }
+                vbvLookahead(frames, numFrames, true);
             }
         }
-    }
 
-    m_inputLock.acquire();
-    /* dequeue all frames from inputQueue that are about to be enqueued
-     * in the output queue. The order is important because Frame can
-     * only be in one list at a time */
-    int64_t ptsX265_BFRAME_MAX + 1;
-    for (int i = 0; i <= bframes; i++)
-    {
-        Frame *curFrame;
-        curFrame = m_inputQueue.popFront();
-        ptsi = curFrame->m_pts;
-        maxSearch--;
-    }
-    m_inputLock.release();
 
-    m_outputLock.acquire();
-    /* add non-B to output queue */
-    int idx = 0;
-    listbframes->m_reorderedPts = ptsidx++;
-    m_outputQueue.pushBack(*listbframes);
-    /* Add B-ref frame next to P frame in output queue, the B-ref encode before non B-ref frame */
-    if (brefs)
+        m_outputLock.release();
+    }
+    else
     {
-        for (int i = 0; i < bframes; i++)
+
+        if (bframes)
+            listbframes - 1->m_lowres.bLastMiniGopBFrame = true;
+        listbframes->m_lowres.leadingBframes = bframes;
+        m_lastNonB = &listbframes->m_lowres;
+
+        /* insert a bref into the sequence */
+        if (m_param->bBPyramid && bframes > 1 && !brefs)
         {
-            if (listi->m_lowres.sliceType == X265_TYPE_BREF)
+            placeBref(list, 0, bframes, bframes + 1, &brefs);
+        }
+        /* calculate the frame costs ahead of time for estimateFrameCost while we still have lowres */
+        if (m_param->rc.rateControlMode != X265_RC_CQP)
+        {
+            int p0, p1, b;
+            /* For zero latency tuning, calculate frame cost to be used later in RC */
+            if (!maxSearch)
             {
-                listi->m_reorderedPts = ptsidx++;
-                m_outputQueue.pushBack(*listi);
+                for (int i = 0; i <= bframes; i++)
+                    framesi + 1 = &listi->m_lowres;
+            }
+
+            /* estimate new non-B cost */
+            p1 = b = bframes + 1;
+            p0 = (IS_X265_TYPE_I(framesbframes + 1->sliceType)) ? b : 0;
+
+            CostEstimateGroup estGroup(*this, frames);
+            estGroup.singleCost(p0, p1, b);
+
+            if (m_param->bEnableTemporalSubLayers > 1 && bframes)
+            {
+                compCostBref(frames, 0, bframes, bframes + 1);
+            }
+            else
+            {
+                if (bframes)
+                {
+                    p0 = 0; // last nonb
+                    bool isp0available = framesbframes + 1->sliceType == X265_TYPE_IDR ? false : true;
+
+                    for (b = 1; b <= bframes; b++)
+                    {
+                        if (!isp0available)
+                            p0 = b;
+
+                        if (framesb->sliceType == X265_TYPE_B)
+                            for (p1 = b; framesp1->sliceType == X265_TYPE_B; p1++)
+                                ; // find new nonb or bref
+                        else
+                            p1 = bframes + 1;
+
+                        estGroup.singleCost(p0, p1, b);
+
+                        if (framesb->sliceType == X265_TYPE_BREF)
+                        {
+                            p0 = b;
+                            isp0available = true;
+                        }
+                    }
+                }
             }
         }
-    }
 
-    /* add B frames to output queue */
-    for (int i = 0; i < bframes; i++)
-    {
-        /* push all the B frames into output queue except B-ref, which already pushed into output queue */
-        if (listi->m_lowres.sliceType != X265_TYPE_BREF)
+        m_inputLock.acquire();
+        /* dequeue all frames from inputQueue that are about to be enqueued
+         * in the output queue. The order is important because Frame can
+         * only be in one list at a time */
+        int64_t ptsX265_BFRAME_MAX + 1;
+        for (int i = 0; i <= bframes; i++)
+        {
+            Frame *curFrame;
+            curFrame = m_inputQueue.popFront();
+            ptsi = curFrame->m_pts;
+            maxSearch--;
+        }
+        m_inputLock.release();
+
+        m_outputLock.acquire();
+
+        /* add non-B to output queue */
+        int idx = 0;
+        listbframes->m_reorderedPts = ptsidx++;
+        m_outputQueue.pushBack(*listbframes);
+
+        /* Add B-ref frame next to P frame in output queue, the B-ref encode before non B-ref frame */
+        if (brefs)
         {
-            listi->m_reorderedPts = ptsidx++;
-            m_outputQueue.pushBack(*listi);
+            for (int i = 0; i < bframes; i++)
+            {
+                if (listi->m_lowres.sliceType == X265_TYPE_BREF)
+                {
+                    listi->m_reorderedPts = ptsidx++;
+                    m_outputQueue.pushBack(*listi);
+                }
+            }
         }
-    }
 
-    bool isKeyFrameAnalyse = (m_param->rc.cuTree || (m_param->rc.vbvBufferSize && m_param->lookaheadDepth));
-    if (isKeyFrameAnalyse && IS_X265_TYPE_I(m_lastNonB->sliceType))
-    {
-        m_inputLock.acquire();
-        Frame *curFrame = m_inputQueue.first();
-        frames0 = m_lastNonB;
-        int j;
-        for (j = 0; j < maxSearch; j++)
+        /* add B frames to output queue */
+        for (int i = 0; i < bframes; i++)
         {
-            framesj + 1 = &curFrame->m_lowres;
-            curFrame = curFrame->m_next;
+            /* push all the B frames into output queue except B-ref, which already pushed into output queue */
+            if (listi->m_lowres.sliceType != X265_TYPE_BREF)
+            {
+                listi->m_reorderedPts = ptsidx++;
+                m_outputQueue.pushBack(*listi);
+            }
         }
-        m_inputLock.release();
 
-        framesj + 1 = NULL;
-        if (!m_param->rc.bStatRead)
-            slicetypeAnalyse(frames, true);
-        bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0;
-        if ((m_param->analysisLoad && m_param->scaleFactor && bIsVbv) || m_param->bliveVBV2pass)
+
+        bool isKeyFrameAnalyse = (m_param->rc.cuTree || (m_param->rc.vbvBufferSize && m_param->lookaheadDepth));
+        if (isKeyFrameAnalyse && IS_X265_TYPE_I(m_lastNonB->sliceType))
         {
-            int numFrames;
-            for (numFrames = 0; numFrames < maxSearch; numFrames++)
+            m_inputLock.acquire();
+            Frame *curFrame = m_inputQueue.first();
+            frames0 = m_lastNonB;
+            int j;
+            for (j = 0; j < maxSearch; j++)
+            {
+                framesj + 1 = &curFrame->m_lowres;
+                curFrame = curFrame->m_next;
+            }
+            m_inputLock.release();
+
+            framesj + 1 = NULL;
+            if (!m_param->rc.bStatRead)
+                slicetypeAnalyse(frames, true);
+            bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0;
+            if ((m_param->analysisLoad && m_param->scaleFactor && bIsVbv) || m_param->bliveVBV2pass)
             {
-                Lowres *fenc = framesnumFrames + 1;
-                if (!fenc)
-                    break;
+                int numFrames;
+                for (numFrames = 0; numFrames < maxSearch; numFrames++)
+                {
+                    Lowres *fenc = framesnumFrames + 1;
+                    if (!fenc)
+                        break;
+                }
+                vbvLookahead(frames, numFrames, true);
             }
-            vbvLookahead(frames, numFrames, true);
         }
+
+        m_outputLock.release();
     }
-    m_outputLock.release();
 }
 
 void Lookahead::vbvLookahead(Lowres **frames, int numFrames, int keyframe)
@@ -1909,6 +2622,8 @@
             nextZoneStart += (i + 1 < m_param->rc.zonefileCount) ? m_param->rc.zonesi + 1.startFrame + m_param->rc.zonesi + 1.zoneParam->radl : m_param->totalFrames;
             if (curZoneStart <= frames0->frameNum && nextZoneStart > frames0->frameNum)
                 m_param->keyframeMax = nextZoneStart - curZoneStart;
+            if (m_param->rc.zonesm_param->rc.zonefileCount - 1.startFrame <= frames0->frameNum && nextZoneStart == 0)
+                m_param->keyframeMax = m_param->rc.zones0.keyframeMax;
         }
     }
     int keylimit = m_param->keyframeMax;
@@ -2013,44 +2728,13 @@
     int numAnalyzed = numFrames;
     bool isScenecut = false;
 
-    /* Temporal computations for scenecut detection */
     if (m_param->bHistBasedSceneCut)
-    {
-        for (int i = numFrames - 1; i > 0; i--)
-        {
-            if (framesi->interPCostPercDiff > 0.0)
-                continue;
-            int64_t interCost = framesi->costEst10;
-            int64_t intraCost = framesi->costEst00;
-            if (interCost < 0 || intraCost < 0)
-                continue;
-            int times = 0;
-            double averagePcost = 0.0, averageIcost = 0.0;
-            for (int j = i - 1; j >= 0 && times < 5; j--, times++)
-            {
-                if (framesj->costEst00 > 0 && framesj->costEst10 > 0)
-                {
-                    averageIcost += framesj->costEst00;
-                    averagePcost += framesj->costEst10;
-                }
-                else
-                    times--;
-            }
-            if (times)
-            {
-                averageIcost = averageIcost / times;
-                averagePcost = averagePcost / times;
-                framesi->interPCostPercDiff = abs(interCost - averagePcost) / X265_MIN(interCost, averagePcost) * 100;
-                framesi->intraCostPercDiff = abs(intraCost - averageIcost) / X265_MIN(intraCost, averageIcost) * 100;
-            }
-        }
-    }
-
-    /* When scenecut threshold is set, use scenecut detection for I frame placements */
-    if (!m_param->bHistBasedSceneCut || (m_param->bHistBasedSceneCut && frames1->bScenecut))
+        isScenecut = histBasedScenecut(frames, 0, 1, origNumFrames);
+    else
         isScenecut = scenecut(frames, 0, 1, true, origNumFrames);
 
-    if (isScenecut && (m_param->bHistBasedSceneCut || m_param->scenecutThreshold))
+    /* When scenecut threshold is set, use scenecut detection for I frame placements */
+    if (m_param->scenecutThreshold && isScenecut)
     {
         frames1->sliceType = X265_TYPE_I;
         return;
@@ -2061,8 +2745,7 @@
         m_extendGopBoundary = false;
         for (int i = m_param->bframes + 1; i < origNumFrames; i += m_param->bframes + 1)
         {
-            if (!m_param->bHistBasedSceneCut || (m_param->bHistBasedSceneCut && framesi + 1->bScenecut))
-                scenecut(frames, i, i + 1, true, origNumFrames);
+            scenecut(frames, i, i + 1, true, origNumFrames);
 
             for (int j = i + 1; j <= X265_MIN(i + m_param->bframes + 1, origNumFrames); j++)
             {
@@ -2175,10 +2858,8 @@
         {
             for (int j = 1; j < numBFrames + 1; j++)
             {
-                bool isNextScenecut = false;
-                if (!m_param->bHistBasedSceneCut || (m_param->bHistBasedSceneCut && framesj + 1->bScenecut))
-                    isNextScenecut = scenecut(frames, j, j + 1, false, origNumFrames);
-                if (isNextScenecut || (bForceRADL && framesj->frameNum == preRADL))
+                if (scenecut(frames, j, j + 1, false, origNumFrames) ||
+                    (bForceRADL && (framesj->frameNum == preRADL)))
                 {
                     framesj->sliceType = X265_TYPE_P;
                     numAnalyzed = j;
@@ -2244,9 +2925,10 @@
         /* Where A and B are scenes: AAAAAABBBAAAAAA
          * If BBB is shorter than (maxp1-p0), it is detected as a flash
          * and not considered a scenecut. */
+
         for (int cp1 = p1; cp1 <= maxp1; cp1++)
         {
-            if (!scenecutInternal(frames, p0, cp1, false) && !m_param->bHistBasedSceneCut)
+            if (!scenecutInternal(frames, p0, cp1, false))
             {
                 /* Any frame in between p0 and cur_p1 cannot be a real scenecut. */
                 for (int i = cp1; i > p0; i--)
@@ -2255,7 +2937,7 @@
                     noScenecuts = false;
                 }
             }
-            else if ((m_param->bHistBasedSceneCut && framescp1->m_bIsMaxThres) || scenecutInternal(frames, cp1 - 1, cp1, false))
+            else if (scenecutInternal(frames, cp1 - 1, cp1, false))
             {
                 /* If current frame is a Scenecut from p0 frame as well as Scenecut from
                  * preceeding frame, mark it as a Scenecut */
@@ -2316,9 +2998,6 @@
 
     if (!framesp1->bScenecut)
         return false;
-    /* Check only scene transitions if max threshold */
-    if (m_param->bHistBasedSceneCut && framesp1->m_bIsMaxThres)
-        return framesp1->bScenecut;
 
     return scenecutInternal(frames, p0, p1, bRealScenecut);
 }
@@ -2336,19 +3015,8 @@
     /* magic numbers pulled out of thin air */
     float threshMin = (float)(threshMax * 0.25);
     double bias = m_param->scenecutBias;
-    if (m_param->bHistBasedSceneCut)
-    {
-        double minT = TEMPORAL_SCENECUT_THRESHOLD * (1 + m_param->edgeTransitionThreshold);
-        if (frame->interPCostPercDiff > minT || frame->intraCostPercDiff > minT)
-        {
-            if (bRealScenecut && frame->bScenecut)
-                x265_log(m_param, X265_LOG_DEBUG, "scene cut at %d \n", frame->frameNum);
-            return frame->bScenecut;
-        }
-        else
-            return false;
-    }
-    else if (bRealScenecut)
+
+    if (bRealScenecut)
     {
         if (m_param->keyframeMin == m_param->keyframeMax)
             threshMin = threshMax;
@@ -2375,6 +3043,167 @@
     return res;
 }
 
+bool Lookahead::detectHistBasedSceneChange(Lowres **frames, int p0, int p1, int p2)
+{
+    bool isAbruptChange;
+    bool isSceneChange;
+
+    Lowres  *previousFrame = framesp0;
+    Lowres  *currentFrame = framesp1;
+    Lowres  *futureFrame = framesp2;
+
+    currentFrame->bHistScenecutAnalyzed = true;
+
+    uint32_t **accHistDiffRunningAvgCb = m_accHistDiffRunningAvgCb;
+    uint32_t **accHistDiffRunningAvgCr = m_accHistDiffRunningAvgCr;
+    uint32_t **accHistDiffRunningAvg = m_accHistDiffRunningAvg;
+
+    uint8_t absIntDiffFuturePast = 0;
+    uint8_t absIntDiffFuturePresent = 0;
+    uint8_t absIntDiffPresentPast = 0;
+
+    uint32_t abruptChangeCount = 0;
+    uint32_t sceneChangeCount = 0;
+
+    uint32_t segmentWidth = frames1->widthFullRes / NUMBER_OF_SEGMENTS_IN_WIDTH;
+    uint32_t segmentHeight = frames1->heightFullRes / NUMBER_OF_SEGMENTS_IN_HEIGHT;
+
+    for (uint32_t segmentInFrameWidthIndex = 0; segmentInFrameWidthIndex < NUMBER_OF_SEGMENTS_IN_WIDTH; segmentInFrameWidthIndex++)
+    {
+        for (uint32_t segmentInFrameHeightIndex = 0; segmentInFrameHeightIndex < NUMBER_OF_SEGMENTS_IN_HEIGHT; segmentInFrameHeightIndex++)
+        {
+            isAbruptChange = false;
+            isSceneChange = false;
+
+            // accumulative absolute histogram differences between the past and current frame
+            uint32_t accHistDiff = 0;
+            uint32_t accHistDiffCb = 0;
+            uint32_t accHistDiffCr = 0;
+
+            uint32_t segmentWidthOffset = (segmentInFrameWidthIndex == NUMBER_OF_SEGMENTS_IN_WIDTH - 1) ?
+                frames1->widthFullRes - (NUMBER_OF_SEGMENTS_IN_WIDTH * segmentWidth) : 0;
+
+            uint32_t segmentHeightOffset = (segmentInFrameHeightIndex == NUMBER_OF_SEGMENTS_IN_HEIGHT - 1) ?
+                frames1->heightFullRes - (NUMBER_OF_SEGMENTS_IN_HEIGHT * segmentHeight) : 0;
+
+            segmentWidth += segmentWidthOffset;
+            segmentHeight += segmentHeightOffset;
+
+            uint32_t segmentThreshHold = (
+                ((X265_ABS((int64_t)currentFrame->picAvgVariance - (int64_t)previousFrame->picAvgVariance)) > PICTURE_DIFF_VARIANCE_TH) &&
+                (currentFrame->picAvgVariance > PICTURE_VARIANCE_TH || previousFrame->picAvgVariance > PICTURE_VARIANCE_TH)) ?
+                HIGH_VAR_SCENE_CHANGE_TH * NUM64x64INPIC(segmentWidth, segmentHeight) : LOW_VAR_SCENE_CHANGE_TH * NUM64x64INPIC(segmentWidth, segmentHeight);
+
+            uint32_t segmentThreshHoldCb = (
+                ((X265_ABS((int64_t)currentFrame->picAvgVarianceCb - (int64_t)previousFrame->picAvgVarianceCb)) > PICTURE_DIFF_VARIANCE_CHROMA_TH) &&
+                (currentFrame->picAvgVarianceCb > PICTURE_VARIANCE_CHROMA_TH || previousFrame->picAvgVarianceCb > PICTURE_VARIANCE_CHROMA_TH)) ?
+                HIGH_VAR_SCENE_CHANGE_CHROMA_TH * NUM64x64INPIC(segmentWidth, segmentHeight) : LOW_VAR_SCENE_CHANGE_CHROMA_TH * NUM64x64INPIC(segmentWidth, segmentHeight);
+
+            uint32_t segmentThreshHoldCr = (
+                ((X265_ABS((int64_t)currentFrame->picAvgVarianceCr - (int64_t)previousFrame->picAvgVarianceCr)) > PICTURE_DIFF_VARIANCE_CHROMA_TH) &&
+                (currentFrame->picAvgVarianceCr > PICTURE_VARIANCE_CHROMA_TH || previousFrame->picAvgVarianceCr > PICTURE_VARIANCE_CHROMA_TH)) ?
+                HIGH_VAR_SCENE_CHANGE_CHROMA_TH * NUM64x64INPIC(segmentWidth, segmentHeight) : LOW_VAR_SCENE_CHANGE_CHROMA_TH * NUM64x64INPIC(segmentWidth, segmentHeight);
+
+            for (uint32_t bin = 0; bin < HISTOGRAM_NUMBER_OF_BINS; ++bin) {
+                accHistDiff += X265_ABS((int32_t)currentFrame->picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex0bin - (int32_t)previousFrame->picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex0bin);
+                accHistDiffCb += X265_ABS((int32_t)currentFrame->picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex1bin - (int32_t)previousFrame->picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex1bin);
+                accHistDiffCr += X265_ABS((int32_t)currentFrame->picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex2bin - (int32_t)previousFrame->picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex2bin);
+            }
+
+            if (m_resetRunningAvg) {
+                accHistDiffRunningAvgsegmentInFrameWidthIndexsegmentInFrameHeightIndex = accHistDiff;
+                accHistDiffRunningAvgCbsegmentInFrameWidthIndexsegmentInFrameHeightIndex = accHistDiffCb;
+                accHistDiffRunningAvgCrsegmentInFrameWidthIndexsegmentInFrameHeightIndex = accHistDiffCr;
+            }
+
+            // difference between accumulative absolute histogram differences and the running average at the current frame.
+            uint32_t accHistDiffError = X265_ABS((int32_t)accHistDiffRunningAvgsegmentInFrameWidthIndexsegmentInFrameHeightIndex - (int32_t)accHistDiff);
+            uint32_t accHistDiffErrorCb = X265_ABS((int32_t)accHistDiffRunningAvgCbsegmentInFrameWidthIndexsegmentInFrameHeightIndex - (int32_t)accHistDiffCb);
+            uint32_t accHistDiffErrorCr = X265_ABS((int32_t)accHistDiffRunningAvgCrsegmentInFrameWidthIndexsegmentInFrameHeightIndex - (int32_t)accHistDiffCr);
+
+            if ((accHistDiffError > segmentThreshHold     && accHistDiff >= accHistDiffError) ||
+                (accHistDiffErrorCb > segmentThreshHoldCb && accHistDiffCb >= accHistDiffErrorCb) ||
+                (accHistDiffErrorCr > segmentThreshHoldCr && accHistDiffCr >= accHistDiffErrorCr)) {
+
+                isAbruptChange = true;
+            }
+
+            if (isAbruptChange)
+            {
+                absIntDiffFuturePast = (uint8_t)X265_ABS((int16_t)futureFrame->averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex0 - (int16_t)previousFrame->averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex0);
+                absIntDiffFuturePresent = (uint8_t)X265_ABS((int16_t)futureFrame->averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex0 - (int16_t)currentFrame->averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex0);
+                absIntDiffPresentPast = (uint8_t)X265_ABS((int16_t)currentFrame->averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex0 - (int16_t)previousFrame->averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex0);
+
+                if (absIntDiffFuturePresent >= FLASH_TH * absIntDiffFuturePast && absIntDiffPresentPast >= FLASH_TH * absIntDiffFuturePast) {
+                    x265_log(m_param, X265_LOG_DEBUG, "Flash in frame# %i , %i, %i, %i\n", currentFrame->frameNum, absIntDiffFuturePast, absIntDiffFuturePresent, absIntDiffPresentPast);
+                }
+                else if (absIntDiffFuturePresent < FADE_TH && absIntDiffPresentPast < FADE_TH) {
+                    x265_log(m_param, X265_LOG_DEBUG, "Fade in frame# %i , %i, %i, %i\n", currentFrame->frameNum, absIntDiffFuturePast, absIntDiffFuturePresent, absIntDiffPresentPast);
+                }
+                else if (X265_ABS(absIntDiffFuturePresent - absIntDiffPresentPast) < INTENSITY_CHANGE_TH && absIntDiffFuturePresent + absIntDiffPresentPast >= absIntDiffFuturePast) {
+                    x265_log(m_param, X265_LOG_DEBUG, "Intensity Change in frame# %i , %i, %i, %i\n", currentFrame->frameNum, absIntDiffFuturePast, absIntDiffFuturePresent, absIntDiffPresentPast);
+                }
+                else {
+                    isSceneChange = true;
+                    x265_log(m_param, X265_LOG_DEBUG, "Scene change in frame# %i , %i, %i, %i\n", currentFrame->frameNum, absIntDiffFuturePast, absIntDiffFuturePresent, absIntDiffPresentPast);
+                }
+
+            }
+            else {
+                accHistDiffRunningAvgsegmentInFrameWidthIndexsegmentInFrameHeightIndex = (3 * accHistDiffRunningAvgsegmentInFrameWidthIndexsegmentInFrameHeightIndex + accHistDiff) / 4;
+            }
+
+            abruptChangeCount += isAbruptChange;
+            sceneChangeCount += isSceneChange;
+        }
+    }
+
+    if (abruptChangeCount >= m_segmentCountThreshold) {
+        m_resetRunningAvg = true;
+    }
+    else {
+        m_resetRunningAvg = false;
+    }
+
+    if ((sceneChangeCount >= m_segmentCountThreshold)) {
+        x265_log(m_param, X265_LOG_DEBUG, "Scene Change in Pic Number# %i\n", currentFrame->frameNum);
+
+        return true;
+    }
+    else {
+        return false;
+    }
+
+}
+
+bool Lookahead::histBasedScenecut(Lowres **frames, int p0, int p1, int numFrames)
+{
+    /* Only do analysis during a normal scenecut check. */
+    if (m_param->bframes)
+    {
+        int origmaxp1 = p0 + 1;
+        /* Look ahead to avoid coding short flashes as scenecuts. */
+        origmaxp1 += m_param->bframes;
+        int maxp1 = X265_MIN(origmaxp1, numFrames);
+
+        for (int cp1 = p0; cp1 < maxp1; cp1++)
+        {
+            if (framescp1 + 1->bHistScenecutAnalyzed == true)
+                continue;
+
+            if (framescp1 + 2 != NULL && detectHistBasedSceneChange(frames, cp1, cp1 + 1, cp1 + 2))
+            {
+                /* If current frame is a Scenecut from p0 frame as well as Scenecut from
+                 * preceeding frame, mark it as a Scenecut */
+                framescp1+1->bScenecut = true;
+            }
+        }
+
+    }
+
+    return framesp1->bScenecut;
+}
+
 void Lookahead::slicetypePath(Lowres **frames, int length, char(*best_paths)X265_LOOKAHEAD_MAX + 1)
 {
     char paths2X265_LOOKAHEAD_MAX + 1;
@@ -2404,6 +3233,27 @@
     memcpy(best_pathslength % (X265_BFRAME_MAX + 1), pathsidx ^ 1, length);
 }
 
+// Find slicetype of the frame with poc # in lookahead buffer
+int Lookahead::findSliceType(int poc)
+{
+    int out_slicetype = X265_TYPE_AUTO;
+    if (m_filled)
+    {
+        m_outputLock.acquire();
+        Frame* out = m_outputQueue.first();
+        while (out != NULL) {
+            if (poc == out->m_poc)
+            {
+                out_slicetype = out->m_lowres.sliceType;
+                break;
+            }
+            out = out->m_next;
+        }
+        m_outputLock.release();
+    }
+    return out_slicetype;
+}
+
 int64_t Lookahead::slicetypePathCost(Lowres **frames, char *path, int64_t threshold)
 {
     int64_t cost = 0;
​

x265_3.5.tar.gz/source/encoder/slicetype.h -> x265_3.6.tar.gz/source/encoder/slicetype.h Changed

@@ -44,6 +44,24 @@
 #define EDGE_INCLINATION 45
 #define TEMPORAL_SCENECUT_THRESHOLD 50
 
+#define X265_ABS(a)                        (((a) < 0) ? (-(a)) : (a))
+
+#define PICTURE_DIFF_VARIANCE_TH            390
+#define PICTURE_VARIANCE_TH                 1500
+#define LOW_VAR_SCENE_CHANGE_TH             2250
+#define HIGH_VAR_SCENE_CHANGE_TH            3500
+
+#define PICTURE_DIFF_VARIANCE_CHROMA_TH     10
+#define PICTURE_VARIANCE_CHROMA_TH          20
+#define LOW_VAR_SCENE_CHANGE_CHROMA_TH      2250/4
+#define HIGH_VAR_SCENE_CHANGE_CHROMA_TH     3500/4
+
+#define FLASH_TH                            1.5
+#define FADE_TH                             4
+#define INTENSITY_CHANGE_TH                 4
+
+#define NUM64x64INPIC(w,h)                  ((w*h)>> (MAX_LOG2_CU_SIZE<<1))
+
 #if HIGH_BIT_DEPTH
 #define EDGE_THRESHOLD 1023.0
 #else
@@ -93,7 +111,29 @@
 
     ~LookaheadTLD() { X265_FREE(wbuffer0); }
 
+    void collectPictureStatistics(Frame *curFrame);
+    void computeIntensityHistogramBinsLuma(Frame *curFrame, uint64_t *sumAvgIntensityTotalSegmentsLuma);
+
+    void computeIntensityHistogramBinsChroma(
+        Frame    *curFrame,
+        uint64_t *sumAverageIntensityCb,
+        uint64_t *sumAverageIntensityCr);
+
+    void calculateHistogram(
+        pixel    *inputSrc,
+        uint32_t  inputWidth,
+        uint32_t  inputHeight,
+        intptr_t  stride,
+        uint8_t   dsFactor,
+        uint32_t *histogram,
+        uint64_t *sum);
+
+    void computePictureStatistics(Frame *curFrame);
+
+    uint32_t calcVariance(pixel* src, intptr_t stride, intptr_t blockOffset, uint32_t plane);
+
     void calcAdaptiveQuantFrame(Frame *curFrame, x265_param* param);
+    void calcFrameSegment(Frame *curFrame);
     void lowresIntraEstimate(Lowres& fenc, uint32_t qgSize);
 
     void weightsAnalyse(Lowres& fenc, Lowres& ref);
@@ -124,7 +164,6 @@
 
     /* pre-lookahead */
     int           m_fullQueueSize;
-    int           m_histogramX265_BFRAME_MAX + 1;
     int           m_lastKeyframe;
     int           m_8x8Width;
     int           m_8x8Height;
@@ -153,6 +192,16 @@
     bool          m_isFadeIn;
     uint64_t      m_fadeCount;
     int           m_fadeStart;
+
+    uint32_t    **m_accHistDiffRunningAvgCb;
+    uint32_t    **m_accHistDiffRunningAvgCr;
+    uint32_t    **m_accHistDiffRunningAvg;
+
+    bool          m_resetRunningAvg;
+    uint32_t      m_segmentCountThreshold;
+
+    int8_t                  m_gopId;
+
     Lookahead(x265_param *param, ThreadPool *pool);
 #if DETAILED_CU_STATS
     int64_t       m_slicetypeDecideElapsedTime;
@@ -174,6 +223,7 @@
 
     void    getEstimatedPictureCost(Frame *pic);
     void    setLookaheadQueue();
+    int     findSliceType(int poc);
 
 protected:
 
@@ -184,6 +234,10 @@
     /* called by slicetypeAnalyse() to make slice decisions */
     bool    scenecut(Lowres **frames, int p0, int p1, bool bRealScenecut, int numFrames);
     bool    scenecutInternal(Lowres **frames, int p0, int p1, bool bRealScenecut);
+
+    bool    histBasedScenecut(Lowres **frames, int p0, int p1, int numFrames);
+    bool    detectHistBasedSceneChange(Lowres **frames, int p0, int p1, int p2);
+
     void    slicetypePath(Lowres **frames, int length, char(*best_paths)X265_LOOKAHEAD_MAX + 1);
     int64_t slicetypePathCost(Lowres **frames, char *path, int64_t threshold);
     int64_t vbvFrameCost(Lowres **frames, int p0, int p1, int b);
@@ -199,6 +253,9 @@
 
     /* called by getEstimatedPictureCost() to finalize cuTree costs */
     int64_t frameCostRecalculate(Lowres **frames, int p0, int p1, int b);
+    /*Compute index for positioning B-Ref frames*/
+    void     placeBref(Frame** frames, int start, int end, int num, int *brefs);
+    void     compCostBref(Lowres **frame, int start, int end, int num);
 };
 
 class PreLookaheadGroup : public BondedTaskGroup

 
@@ -44,6 +44,24 @@
 #define EDGE_INCLINATION 45
 #define TEMPORAL_SCENECUT_THRESHOLD 50
 
+#define X265_ABS(a)                        (((a) < 0) ? (-(a)) : (a))
+
+#define PICTURE_DIFF_VARIANCE_TH            390
+#define PICTURE_VARIANCE_TH                 1500
+#define LOW_VAR_SCENE_CHANGE_TH             2250
+#define HIGH_VAR_SCENE_CHANGE_TH            3500
+
+#define PICTURE_DIFF_VARIANCE_CHROMA_TH     10
+#define PICTURE_VARIANCE_CHROMA_TH          20
+#define LOW_VAR_SCENE_CHANGE_CHROMA_TH      2250/4
+#define HIGH_VAR_SCENE_CHANGE_CHROMA_TH     3500/4
+
+#define FLASH_TH                            1.5
+#define FADE_TH                             4
+#define INTENSITY_CHANGE_TH                 4
+
+#define NUM64x64INPIC(w,h)                  ((w*h)>> (MAX_LOG2_CU_SIZE<<1))
+
 #if HIGH_BIT_DEPTH
 #define EDGE_THRESHOLD 1023.0
 #else
@@ -93,7 +111,29 @@
 
     ~LookaheadTLD() { X265_FREE(wbuffer0); }
 
+    void collectPictureStatistics(Frame *curFrame);
+    void computeIntensityHistogramBinsLuma(Frame *curFrame, uint64_t *sumAvgIntensityTotalSegmentsLuma);
+
+    void computeIntensityHistogramBinsChroma(
+        Frame    *curFrame,
+        uint64_t *sumAverageIntensityCb,
+        uint64_t *sumAverageIntensityCr);
+
+    void calculateHistogram(
+        pixel    *inputSrc,
+        uint32_t  inputWidth,
+        uint32_t  inputHeight,
+        intptr_t  stride,
+        uint8_t   dsFactor,
+        uint32_t *histogram,
+        uint64_t *sum);
+
+    void computePictureStatistics(Frame *curFrame);
+
+    uint32_t calcVariance(pixel* src, intptr_t stride, intptr_t blockOffset, uint32_t plane);
+
     void calcAdaptiveQuantFrame(Frame *curFrame, x265_param* param);
+    void calcFrameSegment(Frame *curFrame);
     void lowresIntraEstimate(Lowres& fenc, uint32_t qgSize);
 
     void weightsAnalyse(Lowres& fenc, Lowres& ref);
@@ -124,7 +164,6 @@
 
     /* pre-lookahead */
     int           m_fullQueueSize;
-    int           m_histogramX265_BFRAME_MAX + 1;
     int           m_lastKeyframe;
     int           m_8x8Width;
     int           m_8x8Height;
@@ -153,6 +192,16 @@
     bool          m_isFadeIn;
     uint64_t      m_fadeCount;
     int           m_fadeStart;
+
+    uint32_t    **m_accHistDiffRunningAvgCb;
+    uint32_t    **m_accHistDiffRunningAvgCr;
+    uint32_t    **m_accHistDiffRunningAvg;
+
+    bool          m_resetRunningAvg;
+    uint32_t      m_segmentCountThreshold;
+
+    int8_t                  m_gopId;
+
     Lookahead(x265_param *param, ThreadPool *pool);
 #if DETAILED_CU_STATS
     int64_t       m_slicetypeDecideElapsedTime;
@@ -174,6 +223,7 @@
 
     void    getEstimatedPictureCost(Frame *pic);
     void    setLookaheadQueue();
+    int     findSliceType(int poc);
 
 protected:
 
@@ -184,6 +234,10 @@
     /* called by slicetypeAnalyse() to make slice decisions */
     bool    scenecut(Lowres **frames, int p0, int p1, bool bRealScenecut, int numFrames);
     bool    scenecutInternal(Lowres **frames, int p0, int p1, bool bRealScenecut);
+
+    bool    histBasedScenecut(Lowres **frames, int p0, int p1, int numFrames);
+    bool    detectHistBasedSceneChange(Lowres **frames, int p0, int p1, int p2);
+
     void    slicetypePath(Lowres **frames, int length, char(*best_paths)X265_LOOKAHEAD_MAX + 1);
     int64_t slicetypePathCost(Lowres **frames, char *path, int64_t threshold);
     int64_t vbvFrameCost(Lowres **frames, int p0, int p1, int b);
@@ -199,6 +253,9 @@
 
     /* called by getEstimatedPictureCost() to finalize cuTree costs */
     int64_t frameCostRecalculate(Lowres **frames, int p0, int p1, int b);
+    /*Compute index for positioning B-Ref frames*/
+    void     placeBref(Frame** frames, int start, int end, int num, int *brefs);
+    void     compCostBref(Lowres **frame, int start, int end, int num);
 };
 
 class PreLookaheadGroup : public BondedTaskGroup
​

x265_3.5.tar.gz/source/output/output.cpp -> x265_3.6.tar.gz/source/output/output.cpp Changed

 
@@ -30,14 +30,14 @@
 
 using namespace X265_NS;
 
-ReconFile* ReconFile::open(const char *fname, int width, int height, uint32_t bitdepth, uint32_t fpsNum, uint32_t fpsDenom, int csp)
+ReconFile* ReconFile::open(const char *fname, int width, int height, uint32_t bitdepth, uint32_t fpsNum, uint32_t fpsDenom, int csp, int sourceBitDepth)
 {
     const char * s = strrchr(fname, '.');
 
     if (s && !strcmp(s, ".y4m"))
-        return new Y4MOutput(fname, width, height, fpsNum, fpsDenom, csp);
+        return new Y4MOutput(fname, width, height, bitdepth, fpsNum, fpsDenom, csp, sourceBitDepth);
     else
-        return new YUVOutput(fname, width, height, bitdepth, csp);
+        return new YUVOutput(fname, width, height, bitdepth, csp, sourceBitDepth);
 }
 
 OutputFile* OutputFile::open(const char *fname, InputFileInfo& inputInfo)
​

x265_3.5.tar.gz/source/output/output.h -> x265_3.6.tar.gz/source/output/output.h Changed

 
@@ -42,7 +42,7 @@
     ReconFile()           {}
 
     static ReconFile* open(const char *fname, int width, int height, uint32_t bitdepth,
-                           uint32_t fpsNum, uint32_t fpsDenom, int csp);
+                           uint32_t fpsNum, uint32_t fpsDenom, int csp, int sourceBitDepth);
 
     virtual bool isFail() const = 0;
 
​

x265_3.5.tar.gz/source/output/y4m.cpp -> x265_3.6.tar.gz/source/output/y4m.cpp Changed

@@ -28,11 +28,13 @@
 using namespace X265_NS;
 using namespace std;
 
-Y4MOutput::Y4MOutput(const char *filename, int w, int h, uint32_t fpsNum, uint32_t fpsDenom, int csp)
+Y4MOutput::Y4MOutput(const char* filename, int w, int h, uint32_t bitdepth, uint32_t fpsNum, uint32_t fpsDenom, int csp, int inputdepth)
     : width(w)
     , height(h)
+    , bitDepth(bitdepth)
     , colorSpace(csp)
     , frameSize(0)
+    , inputDepth(inputdepth)
 {
     ofs.open(filename, ios::binary | ios::out);
     buf = new charwidth;
@@ -41,7 +43,13 @@
 
     if (ofs)
     {
-        ofs << "YUV4MPEG2 W" << width << " H" << height << " F" << fpsNum << ":" << fpsDenom << " Ip" << " C" << cf << "\n";
+        if (bitDepth == 10)
+            ofs << "YUV4MPEG2 W" << width << " H" << height << " F" << fpsNum << ":" << fpsDenom << " Ip" << " C" << cf << "p10" << " XYSCSS = " << cf << "P10" << "\n";
+        else if (bitDepth == 12)
+            ofs << "YUV4MPEG2 W" << width << " H" << height << " F" << fpsNum << ":" << fpsDenom << " Ip" << " C" << cf << "p12" << " XYSCSS = " << cf << "P12" << "\n";
+        else
+            ofs << "YUV4MPEG2 W" << width << " H" << height << " F" << fpsNum << ":" << fpsDenom << " Ip" << " C" << cf << "\n";
+
         header = ofs.tellp();
     }
 
@@ -58,52 +66,81 @@
 bool Y4MOutput::writePicture(const x265_picture& pic)
 {
     std::ofstream::pos_type outPicPos = header;
-    outPicPos += (uint64_t)pic.poc * (6 + frameSize);
+    if (pic.bitDepth > 8)
+        outPicPos += (uint64_t)(pic.poc * (6 + frameSize * 2));
+    else
+        outPicPos += (uint64_t)pic.poc * (6 + frameSize);
     ofs.seekp(outPicPos);
     ofs << "FRAME\n";
 
-#if HIGH_BIT_DEPTH
-    if (pic.bitDepth > 8 && pic.poc == 0)
-        x265_log(NULL, X265_LOG_WARNING, "y4m: down-shifting reconstructed pixels to 8 bits\n");
-#else
-    if (pic.bitDepth > 8 && pic.poc == 0)
-        x265_log(NULL, X265_LOG_WARNING, "y4m: forcing reconstructed pixels to 8 bits\n");
-#endif
+    if (inputDepth > 8)
+    {
+        if (pic.bitDepth == 8 && pic.poc == 0)
+            x265_log(NULL, X265_LOG_WARNING, "y4m: down-shifting reconstructed pixels to 8 bits\n");
+    }
 
     X265_CHECK(pic.colorSpace == colorSpace, "invalid chroma subsampling\n");
 
-#if HIGH_BIT_DEPTH
-
-    // encoder gave us short pixels, downshift, then write
-    X265_CHECK(pic.bitDepth > 8, "invalid bit depth\n");
-    int shift = pic.bitDepth - 8;
-    for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
+    if (inputDepth > 8)//if HIGH_BIT_DEPTH
     {
-        uint16_t *src = (uint16_t*)pic.planesi;
-        for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++)
+        if (pic.bitDepth == 8)
         {
-            for (int w = 0; w < width >> x265_cli_cspscolorSpace.widthi; w++)
-                bufw = (char)(srcw >> shift);
-
-            ofs.write(buf, width >> x265_cli_cspscolorSpace.widthi);
-            src += pic.stridei / sizeof(*src);
+            // encoder gave us short pixels, downshift, then write
+            X265_CHECK(pic.bitDepth == 8, "invalid bit depth\n");
+            int shift = pic.bitDepth - 8;
+            for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
+            {
+                char *src = (char*)pic.planesi;
+                for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++)
+                {
+                    for (int w = 0; w < width >> x265_cli_cspscolorSpace.widthi; w++)
+                        bufw = (char)(srcw >> shift);
+
+                    ofs.write(buf, width >> x265_cli_cspscolorSpace.widthi);
+                    src += pic.stridei / sizeof(*src);
+                }
+            }
+        }
+        else
+        {
+            X265_CHECK(pic.bitDepth > 8, "invalid bit depth\n");
+            for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
+            {
+                uint16_t *src = (uint16_t*)pic.planesi;
+                for (int h = 0; h < (height * 1) >> x265_cli_cspscolorSpace.heighti; h++)
+                {
+                    ofs.write((const char*)src, (width * 2) >> x265_cli_cspscolorSpace.widthi);
+                    src += pic.stridei / sizeof(*src);
+                }
+            }
         }
     }
-
-#else // if HIGH_BIT_DEPTH
-
-    X265_CHECK(pic.bitDepth == 8, "invalid bit depth\n");
-    for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
+    else if (inputDepth == 8 && pic.bitDepth > 8)
     {
-        char *src = (char*)pic.planesi;
-        for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++)
+        X265_CHECK(pic.bitDepth > 8, "invalid bit depth\n");
+        for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
         {
-            ofs.write(src, width >> x265_cli_cspscolorSpace.widthi);
-            src += pic.stridei / sizeof(*src);
+            uint16_t* src = (uint16_t*)pic.planesi;
+            for (int h = 0; h < (height * 1) >> x265_cli_cspscolorSpace.heighti; h++)
+            {
+                ofs.write((const char*)src, (width * 2) >> x265_cli_cspscolorSpace.widthi);
+                src += pic.stridei / sizeof(*src);
+            }
+        }
+    }
+    else
+    {
+        X265_CHECK(pic.bitDepth == 8, "invalid bit depth\n");
+        for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
+        {
+            char *src = (char*)pic.planesi;
+            for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++)
+            {
+                ofs.write(src, width >> x265_cli_cspscolorSpace.widthi);
+                src += pic.stridei / sizeof(*src);
+            }
         }
     }
-
-#endif // if HIGH_BIT_DEPTH
 
     return true;
 }

 
@@ -28,11 +28,13 @@
 using namespace X265_NS;
 using namespace std;
 
-Y4MOutput::Y4MOutput(const char *filename, int w, int h, uint32_t fpsNum, uint32_t fpsDenom, int csp)
+Y4MOutput::Y4MOutput(const char* filename, int w, int h, uint32_t bitdepth, uint32_t fpsNum, uint32_t fpsDenom, int csp, int inputdepth)
     : width(w)
     , height(h)
+    , bitDepth(bitdepth)
     , colorSpace(csp)
     , frameSize(0)
+    , inputDepth(inputdepth)
 {
     ofs.open(filename, ios::binary | ios::out);
     buf = new charwidth;
@@ -41,7 +43,13 @@
 
     if (ofs)
     {
-        ofs << "YUV4MPEG2 W" << width << " H" << height << " F" << fpsNum << ":" << fpsDenom << " Ip" << " C" << cf << "\n";
+        if (bitDepth == 10)
+            ofs << "YUV4MPEG2 W" << width << " H" << height << " F" << fpsNum << ":" << fpsDenom << " Ip" << " C" << cf << "p10" << " XYSCSS = " << cf << "P10" << "\n";
+        else if (bitDepth == 12)
+            ofs << "YUV4MPEG2 W" << width << " H" << height << " F" << fpsNum << ":" << fpsDenom << " Ip" << " C" << cf << "p12" << " XYSCSS = " << cf << "P12" << "\n";
+        else
+            ofs << "YUV4MPEG2 W" << width << " H" << height << " F" << fpsNum << ":" << fpsDenom << " Ip" << " C" << cf << "\n";
+
         header = ofs.tellp();
     }
 
@@ -58,52 +66,81 @@
 bool Y4MOutput::writePicture(const x265_picture& pic)
 {
     std::ofstream::pos_type outPicPos = header;
-    outPicPos += (uint64_t)pic.poc * (6 + frameSize);
+    if (pic.bitDepth > 8)
+        outPicPos += (uint64_t)(pic.poc * (6 + frameSize * 2));
+    else
+        outPicPos += (uint64_t)pic.poc * (6 + frameSize);
     ofs.seekp(outPicPos);
     ofs << "FRAME\n";
 
-#if HIGH_BIT_DEPTH
-    if (pic.bitDepth > 8 && pic.poc == 0)
-        x265_log(NULL, X265_LOG_WARNING, "y4m: down-shifting reconstructed pixels to 8 bits\n");
-#else
-    if (pic.bitDepth > 8 && pic.poc == 0)
-        x265_log(NULL, X265_LOG_WARNING, "y4m: forcing reconstructed pixels to 8 bits\n");
-#endif
+    if (inputDepth > 8)
+    {
+        if (pic.bitDepth == 8 && pic.poc == 0)
+            x265_log(NULL, X265_LOG_WARNING, "y4m: down-shifting reconstructed pixels to 8 bits\n");
+    }
 
     X265_CHECK(pic.colorSpace == colorSpace, "invalid chroma subsampling\n");
 
-#if HIGH_BIT_DEPTH
-
-    // encoder gave us short pixels, downshift, then write
-    X265_CHECK(pic.bitDepth > 8, "invalid bit depth\n");
-    int shift = pic.bitDepth - 8;
-    for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
+    if (inputDepth > 8)//if HIGH_BIT_DEPTH
     {
-        uint16_t *src = (uint16_t*)pic.planesi;
-        for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++)
+        if (pic.bitDepth == 8)
         {
-            for (int w = 0; w < width >> x265_cli_cspscolorSpace.widthi; w++)
-                bufw = (char)(srcw >> shift);
-
-            ofs.write(buf, width >> x265_cli_cspscolorSpace.widthi);
-            src += pic.stridei / sizeof(*src);
+            // encoder gave us short pixels, downshift, then write
+            X265_CHECK(pic.bitDepth == 8, "invalid bit depth\n");
+            int shift = pic.bitDepth - 8;
+            for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
+            {
+                char *src = (char*)pic.planesi;
+                for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++)
+                {
+                    for (int w = 0; w < width >> x265_cli_cspscolorSpace.widthi; w++)
+                        bufw = (char)(srcw >> shift);
+
+                    ofs.write(buf, width >> x265_cli_cspscolorSpace.widthi);
+                    src += pic.stridei / sizeof(*src);
+                }
+            }
+        }
+        else
+        {
+            X265_CHECK(pic.bitDepth > 8, "invalid bit depth\n");
+            for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
+            {
+                uint16_t *src = (uint16_t*)pic.planesi;
+                for (int h = 0; h < (height * 1) >> x265_cli_cspscolorSpace.heighti; h++)
+                {
+                    ofs.write((const char*)src, (width * 2) >> x265_cli_cspscolorSpace.widthi);
+                    src += pic.stridei / sizeof(*src);
+                }
+            }
         }
     }
-
-#else // if HIGH_BIT_DEPTH
-
-    X265_CHECK(pic.bitDepth == 8, "invalid bit depth\n");
-    for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
+    else if (inputDepth == 8 && pic.bitDepth > 8)
     {
-        char *src = (char*)pic.planesi;
-        for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++)
+        X265_CHECK(pic.bitDepth > 8, "invalid bit depth\n");
+        for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
         {
-            ofs.write(src, width >> x265_cli_cspscolorSpace.widthi);
-            src += pic.stridei / sizeof(*src);
+            uint16_t* src = (uint16_t*)pic.planesi;
+            for (int h = 0; h < (height * 1) >> x265_cli_cspscolorSpace.heighti; h++)
+            {
+                ofs.write((const char*)src, (width * 2) >> x265_cli_cspscolorSpace.widthi);
+                src += pic.stridei / sizeof(*src);
+            }
+        }
+    }
+    else
+    {
+        X265_CHECK(pic.bitDepth == 8, "invalid bit depth\n");
+        for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
+        {
+            char *src = (char*)pic.planesi;
+            for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++)
+            {
+                ofs.write(src, width >> x265_cli_cspscolorSpace.widthi);
+                src += pic.stridei / sizeof(*src);
+            }
         }
     }
-
-#endif // if HIGH_BIT_DEPTH
 
     return true;
 }
​

x265_3.5.tar.gz/source/output/y4m.h -> x265_3.6.tar.gz/source/output/y4m.h Changed

 
@@ -38,10 +38,14 @@
 
     int height;
 
+    uint32_t bitDepth;
+
     int colorSpace;
 
     uint32_t frameSize;
 
+    int inputDepth;
+
     std::ofstream ofs;
 
     std::ofstream::pos_type header;
@@ -52,7 +56,7 @@
 
 public:
 
-    Y4MOutput(const char *filename, int width, int height, uint32_t fpsNum, uint32_t fpsDenom, int csp);
+    Y4MOutput(const char *filename, int width, int height, uint32_t bitdepth, uint32_t fpsNum, uint32_t fpsDenom, int csp, int inputDepth);
 
     virtual ~Y4MOutput();
 
​

x265_3.5.tar.gz/source/output/yuv.cpp -> x265_3.6.tar.gz/source/output/yuv.cpp Changed

@@ -28,12 +28,13 @@
 using namespace X265_NS;
 using namespace std;
 
-YUVOutput::YUVOutput(const char *filename, int w, int h, uint32_t d, int csp)
+YUVOutput::YUVOutput(const char *filename, int w, int h, uint32_t d, int csp, int inputdepth)
     : width(w)
     , height(h)
     , depth(d)
     , colorSpace(csp)
     , frameSize(0)
+    , inputDepth(inputdepth)
 {
     ofs.open(filename, ios::binary | ios::out);
     buf = new charwidth;
@@ -56,50 +57,52 @@
     X265_CHECK(pic.colorSpace == colorSpace, "invalid chroma subsampling\n");
     X265_CHECK(pic.bitDepth == (int)depth, "invalid bit depth\n");
 
-#if HIGH_BIT_DEPTH
-    if (depth == 8)
+    if (inputDepth > 8)
     {
-        int shift = pic.bitDepth - 8;
-        ofs.seekp((std::streamoff)fileOffset);
-        for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
-        {
-            uint16_t *src = (uint16_t*)pic.planesi;
-            for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++)
-            {
-                for (int w = 0; w < width >> x265_cli_cspscolorSpace.widthi; w++)
-                    bufw = (char)(srcw >> shift);
+	if (depth == 8)
+	{
+		int shift = pic.bitDepth - 8;
+		ofs.seekp((std::streamoff)fileOffset);
+		for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
+		{
+			uint16_t *src = (uint16_t*)pic.planesi;
+			for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++)
+			{
+				for (int w = 0; w < width >> x265_cli_cspscolorSpace.widthi; w++)
+					bufw = (char)(srcw >> shift);
 
-                ofs.write(buf, width >> x265_cli_cspscolorSpace.widthi);
-                src += pic.stridei / sizeof(*src);
-            }
-        }
+				ofs.write(buf, width >> x265_cli_cspscolorSpace.widthi);
+				src += pic.stridei / sizeof(*src);
+			}
+		}
+	}
+	else
+	{
+		ofs.seekp((std::streamoff)(fileOffset * 2));
+		for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
+		{
+			uint16_t *src = (uint16_t*)pic.planesi;
+			for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++)
+			{
+				ofs.write((const char*)src, (width * 2) >> x265_cli_cspscolorSpace.widthi);
+				src += pic.stridei / sizeof(*src);
+			}
+		}
+	}
     }
     else
     {
-        ofs.seekp((std::streamoff)(fileOffset * 2));
-        for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
-        {
-            uint16_t *src = (uint16_t*)pic.planesi;
-            for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++)
-            {
-                ofs.write((const char*)src, (width * 2) >> x265_cli_cspscolorSpace.widthi);
-                src += pic.stridei / sizeof(*src);
-            }
-        }
+	ofs.seekp((std::streamoff)fileOffset);
+	for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
+	{
+		char *src = (char*)pic.planesi;
+		for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++)
+		{
+			ofs.write(src, width >> x265_cli_cspscolorSpace.widthi);
+			src += pic.stridei / sizeof(*src);
+		}
+	}
     }
-#else // if HIGH_BIT_DEPTH
-    ofs.seekp((std::streamoff)fileOffset);
-    for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
-    {
-        char *src = (char*)pic.planesi;
-        for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++)
-        {
-            ofs.write(src, width >> x265_cli_cspscolorSpace.widthi);
-            src += pic.stridei / sizeof(*src);
-        }
-    }
-
-#endif // if HIGH_BIT_DEPTH
 
     return true;
 }

 
@@ -28,12 +28,13 @@
 using namespace X265_NS;
 using namespace std;
 
-YUVOutput::YUVOutput(const char *filename, int w, int h, uint32_t d, int csp)
+YUVOutput::YUVOutput(const char *filename, int w, int h, uint32_t d, int csp, int inputdepth)
     : width(w)
     , height(h)
     , depth(d)
     , colorSpace(csp)
     , frameSize(0)
+    , inputDepth(inputdepth)
 {
     ofs.open(filename, ios::binary | ios::out);
     buf = new charwidth;
@@ -56,50 +57,52 @@
     X265_CHECK(pic.colorSpace == colorSpace, "invalid chroma subsampling\n");
     X265_CHECK(pic.bitDepth == (int)depth, "invalid bit depth\n");
 
-#if HIGH_BIT_DEPTH
-    if (depth == 8)
+    if (inputDepth > 8)
     {
-        int shift = pic.bitDepth - 8;
-        ofs.seekp((std::streamoff)fileOffset);
-        for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
-        {
-            uint16_t *src = (uint16_t*)pic.planesi;
-            for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++)
-            {
-                for (int w = 0; w < width >> x265_cli_cspscolorSpace.widthi; w++)
-                    bufw = (char)(srcw >> shift);
+   if (depth == 8)
+   {
+       int shift = pic.bitDepth - 8;
+       ofs.seekp((std::streamoff)fileOffset);
+       for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
+       {
+           uint16_t *src = (uint16_t*)pic.planesi;
+           for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++)
+           {
+               for (int w = 0; w < width >> x265_cli_cspscolorSpace.widthi; w++)
+                   bufw = (char)(srcw >> shift);
 
-                ofs.write(buf, width >> x265_cli_cspscolorSpace.widthi);
-                src += pic.stridei / sizeof(*src);
-            }
-        }
+               ofs.write(buf, width >> x265_cli_cspscolorSpace.widthi);
+               src += pic.stridei / sizeof(*src);
+           }
+       }
+   }
+   else
+   {
+       ofs.seekp((std::streamoff)(fileOffset * 2));
+       for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
+       {
+           uint16_t *src = (uint16_t*)pic.planesi;
+           for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++)
+           {
+               ofs.write((const char*)src, (width * 2) >> x265_cli_cspscolorSpace.widthi);
+               src += pic.stridei / sizeof(*src);
+           }
+       }
+   }
     }
     else
     {
-        ofs.seekp((std::streamoff)(fileOffset * 2));
-        for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
-        {
-            uint16_t *src = (uint16_t*)pic.planesi;
-            for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++)
-            {
-                ofs.write((const char*)src, (width * 2) >> x265_cli_cspscolorSpace.widthi);
-                src += pic.stridei / sizeof(*src);
-            }
-        }
+   ofs.seekp((std::streamoff)fileOffset);
+   for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
+   {
+       char *src = (char*)pic.planesi;
+       for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++)
+       {
+           ofs.write(src, width >> x265_cli_cspscolorSpace.widthi);
+           src += pic.stridei / sizeof(*src);
+       }
+   }
     }
-#else // if HIGH_BIT_DEPTH
-    ofs.seekp((std::streamoff)fileOffset);
-    for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++)
-    {
-        char *src = (char*)pic.planesi;
-        for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++)
-        {
-            ofs.write(src, width >> x265_cli_cspscolorSpace.widthi);
-            src += pic.stridei / sizeof(*src);
-        }
-    }
-
-#endif // if HIGH_BIT_DEPTH
 
     return true;
 }
​

x265_3.5.tar.gz/source/output/yuv.h -> x265_3.6.tar.gz/source/output/yuv.h Changed

 
@@ -46,13 +46,15 @@
 
     uint32_t frameSize;
 
+    int inputDepth;
+
     char *buf;
 
     std::ofstream ofs;
 
 public:
 
-    YUVOutput(const char *filename, int width, int height, uint32_t bitdepth, int csp);
+    YUVOutput(const char *filename, int width, int height, uint32_t bitdepth, int csp, int inputDepth);
 
     virtual ~YUVOutput();
 
​

x265_3.5.tar.gz/source/test/CMakeLists.txt -> x265_3.6.tar.gz/source/test/CMakeLists.txt Changed

 
@@ -23,15 +23,13 @@
 
 # add ARM assembly files
 if(ARM OR CROSS_COMPILE_ARM)
-    if(NOT ARM64)
-        enable_language(ASM)
-        set(NASM_SRC checkasm-arm.S)
-        add_custom_command(
-            OUTPUT checkasm-arm.obj
-            COMMAND ${CMAKE_CXX_COMPILER}
-            ARGS ${NASM_FLAGS} ${CMAKE_CURRENT_SOURCE_DIR}/checkasm-arm.S -o checkasm-arm.obj
-            DEPENDS checkasm-arm.S)
-    endif()
+    enable_language(ASM)
+    set(NASM_SRC checkasm-arm.S)
+    add_custom_command(
+        OUTPUT checkasm-arm.obj
+        COMMAND ${CMAKE_CXX_COMPILER}
+        ARGS ${NASM_FLAGS} ${CMAKE_CURRENT_SOURCE_DIR}/checkasm-arm.S -o checkasm-arm.obj
+        DEPENDS checkasm-arm.S)
 endif(ARM OR CROSS_COMPILE_ARM)
 
 # add PowerPC assembly files
​

x265_3.5.tar.gz/source/test/pixelharness.cpp -> x265_3.6.tar.gz/source/test/pixelharness.cpp Changed

@@ -406,6 +406,32 @@
     return true;
 }
 
+bool PixelHarness::check_downscaleluma_t(downscaleluma_t ref, downscaleluma_t opt)
+{
+    ALIGN_VAR_16(pixel, ref_destf32 * 32);
+    ALIGN_VAR_16(pixel, opt_destf32 * 32);
+
+    intptr_t src_stride = 64;
+    intptr_t dst_stride = 32;
+    int bx = 32;
+    int by = 32;
+    int j = 0;
+    for (int i = 0; i < ITERS; i++)
+    {
+        int index = i % TEST_CASES;
+        ref(pixel_test_buffindex + j, ref_destf, src_stride, dst_stride, bx, by);
+        checked(opt, pixel_test_buffindex + j, opt_destf, src_stride, dst_stride, bx, by);
+
+        if (memcmp(ref_destf, opt_destf, 32 * 32 * sizeof(pixel)))
+            return false;
+
+        reportfail();
+        j += INCR;
+    }
+
+    return true;
+}
+
 bool PixelHarness::check_cpy2Dto1D_shl_t(cpy2Dto1D_shl_t ref, cpy2Dto1D_shl_t opt)
 {
     ALIGN_VAR_16(int16_t, ref_dest64 * 64);
@@ -2793,6 +2819,15 @@
         }
     }
 
+    if (opt.frameSubSampleLuma)
+    {
+        if (!check_downscaleluma_t(ref.frameSubSampleLuma, opt.frameSubSampleLuma))
+        {
+            printf("SubSample Luma failed!\n");
+            return false;
+        }
+    }
+
     if (opt.scale1D_128to64NONALIGNED)
     {
         if (!check_scale1D_pp(ref.scale1D_128to64NONALIGNED, opt.scale1D_128to64NONALIGNED))
@@ -3492,6 +3527,12 @@
         REPORT_SPEEDUP(opt.frameInitLowres, ref.frameInitLowres, pbuf2, pbuf1, pbuf2, pbuf3, pbuf4, 64, 64, 64, 64);
     }
 
+    if (opt.frameSubSampleLuma)
+    {
+        HEADER0("downscaleluma");
+        REPORT_SPEEDUP(opt.frameSubSampleLuma, ref.frameSubSampleLuma, pbuf2, pbuf1, 64, 64, 64, 64);
+    }
+
     if (opt.scale1D_128to64NONALIGNED)
     {
         HEADER0("scale1D_128to64");

 
@@ -406,6 +406,32 @@
     return true;
 }
 
+bool PixelHarness::check_downscaleluma_t(downscaleluma_t ref, downscaleluma_t opt)
+{
+    ALIGN_VAR_16(pixel, ref_destf32 * 32);
+    ALIGN_VAR_16(pixel, opt_destf32 * 32);
+
+    intptr_t src_stride = 64;
+    intptr_t dst_stride = 32;
+    int bx = 32;
+    int by = 32;
+    int j = 0;
+    for (int i = 0; i < ITERS; i++)
+    {
+        int index = i % TEST_CASES;
+        ref(pixel_test_buffindex + j, ref_destf, src_stride, dst_stride, bx, by);
+        checked(opt, pixel_test_buffindex + j, opt_destf, src_stride, dst_stride, bx, by);
+
+        if (memcmp(ref_destf, opt_destf, 32 * 32 * sizeof(pixel)))
+            return false;
+
+        reportfail();
+        j += INCR;
+    }
+
+    return true;
+}
+
 bool PixelHarness::check_cpy2Dto1D_shl_t(cpy2Dto1D_shl_t ref, cpy2Dto1D_shl_t opt)
 {
     ALIGN_VAR_16(int16_t, ref_dest64 * 64);
@@ -2793,6 +2819,15 @@
         }
     }
 
+    if (opt.frameSubSampleLuma)
+    {
+        if (!check_downscaleluma_t(ref.frameSubSampleLuma, opt.frameSubSampleLuma))
+        {
+            printf("SubSample Luma failed!\n");
+            return false;
+        }
+    }
+
     if (opt.scale1D_128to64NONALIGNED)
     {
         if (!check_scale1D_pp(ref.scale1D_128to64NONALIGNED, opt.scale1D_128to64NONALIGNED))
@@ -3492,6 +3527,12 @@
         REPORT_SPEEDUP(opt.frameInitLowres, ref.frameInitLowres, pbuf2, pbuf1, pbuf2, pbuf3, pbuf4, 64, 64, 64, 64);
     }
 
+    if (opt.frameSubSampleLuma)
+    {
+        HEADER0("downscaleluma");
+        REPORT_SPEEDUP(opt.frameSubSampleLuma, ref.frameSubSampleLuma, pbuf2, pbuf1, 64, 64, 64, 64);
+    }
+
     if (opt.scale1D_128to64NONALIGNED)
     {
         HEADER0("scale1D_128to64");
​

x265_3.5.tar.gz/source/test/pixelharness.h -> x265_3.6.tar.gz/source/test/pixelharness.h Changed

 
@@ -138,6 +138,7 @@
     bool check_integral_inith(integralh_t ref, integralh_t opt);
     bool check_ssimDist(ssimDistortion_t ref, ssimDistortion_t opt);
     bool check_normFact(normFactor_t ref, normFactor_t opt, int block);
+    bool check_downscaleluma_t(downscaleluma_t ref, downscaleluma_t opt);
 
 public:
 
​

x265_3.5.tar.gz/source/test/rate-control-tests.txt -> x265_3.6.tar.gz/source/test/rate-control-tests.txt Changed

@@ -15,7 +15,7 @@
 112_1920x1080_25.yuv,--preset ultrafast --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 15000 --hrd --strict-cbr
 Traffic_4096x2048_30.yuv,--preset superfast --bitrate 20000 --vbv-maxrate 20000 --vbv-bufsize 20000 --repeat-headers --strict-cbr
 Traffic_4096x2048_30.yuv,--preset faster --bitrate 8000 --vbv-maxrate 8000 --vbv-bufsize 6000 --aud --repeat-headers --no-open-gop --hrd --pmode --pme
-News-4k.y4m,--preset veryfast --bitrate 3000 --vbv-maxrate 5000 --vbv-bufsize 5000 --repeat-headers --temporal-layers
+News-4k.y4m,--preset veryfast --bitrate 3000 --vbv-maxrate 5000 --vbv-bufsize 5000 --repeat-headers --temporal-layers 3
 NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --bitrate 18000 --vbv-bufsize 20000 --vbv-maxrate 18000 --strict-cbr
 NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --bitrate 8000 --vbv-bufsize 12000 --vbv-maxrate 10000  --tune grain
 big_buck_bunny_360p24.y4m,--preset medium --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 600 --aud --hrd --tune fast-decode

 
@@ -15,7 +15,7 @@
 112_1920x1080_25.yuv,--preset ultrafast --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 15000 --hrd --strict-cbr
 Traffic_4096x2048_30.yuv,--preset superfast --bitrate 20000 --vbv-maxrate 20000 --vbv-bufsize 20000 --repeat-headers --strict-cbr
 Traffic_4096x2048_30.yuv,--preset faster --bitrate 8000 --vbv-maxrate 8000 --vbv-bufsize 6000 --aud --repeat-headers --no-open-gop --hrd --pmode --pme
-News-4k.y4m,--preset veryfast --bitrate 3000 --vbv-maxrate 5000 --vbv-bufsize 5000 --repeat-headers --temporal-layers
+News-4k.y4m,--preset veryfast --bitrate 3000 --vbv-maxrate 5000 --vbv-bufsize 5000 --repeat-headers --temporal-layers 3
 NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --bitrate 18000 --vbv-bufsize 20000 --vbv-maxrate 18000 --strict-cbr
 NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --bitrate 8000 --vbv-bufsize 12000 --vbv-maxrate 10000  --tune grain
 big_buck_bunny_360p24.y4m,--preset medium --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 600 --aud --hrd --tune fast-decode
​

x265_3.5.tar.gz/source/test/regression-tests.txt -> x265_3.6.tar.gz/source/test/regression-tests.txt Changed

@@ -18,12 +18,12 @@
 BasketballDrive_1920x1080_50.y4m,--preset faster --aq-strength 2 --merange 190 --slices 3
 BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7 --qg-size 16 --cu-lossless --tu-inter-depth 3 --limit-tu 1
 BasketballDrive_1920x1080_50.y4m,--preset medium --keyint -1 --nr-inter 100 -F4 --no-sao
-BasketballDrive_1920x1080_50.y4m,--preset medium --no-cutree --analysis-save x265_analysis.dat --analysis-save-reuse-level 2 --bitrate 7000 --limit-modes::--preset medium --no-cutree --analysis-load x265_analysis.dat --analysis-load-reuse-level 2 --bitrate 7000 --limit-modes
+BasketballDrive_1920x1080_50.y4m,--preset medium --analysis-save x265_analysis.dat --analysis-save-reuse-level 2 --bitrate 7000 --limit-modes::--preset medium --analysis-load x265_analysis.dat --analysis-load-reuse-level 2 --bitrate 7000 --limit-modes
 BasketballDrive_1920x1080_50.y4m,--preset slow --nr-intra 100 -F4 --aq-strength 3 --qg-size 16 --limit-refs 1
 BasketballDrive_1920x1080_50.y4m,--preset slower --lossless --chromaloc 3 --subme 0 --limit-tu 4
-BasketballDrive_1920x1080_50.y4m,--preset slower --no-cutree --analysis-save x265_analysis.dat --analysis-save-reuse-level 10 --bitrate 7000 --limit-tu 0::--preset slower --no-cutree --analysis-load x265_analysis.dat --analysis-load-reuse-level 10 --bitrate 7000 --limit-tu 0
+BasketballDrive_1920x1080_50.y4m,--preset slower --analysis-save x265_analysis.dat --analysis-save-reuse-level 10 --bitrate 7000 --limit-tu 0::--preset slower --analysis-load x265_analysis.dat --analysis-load-reuse-level 10 --bitrate 7000 --limit-tu 0
 BasketballDrive_1920x1080_50.y4m,--preset veryslow --crf 4 --cu-lossless --pmode --limit-refs 1 --aq-mode 3 --limit-tu 3
-BasketballDrive_1920x1080_50.y4m,--preset veryslow --no-cutree --analysis-save x265_analysis.dat --analysis-save-reuse-level 5 --crf 18 --tskip-fast --limit-tu 2::--preset veryslow --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 5 --crf 18 --tskip-fast --limit-tu 2
+BasketballDrive_1920x1080_50.y4m,--preset veryslow --analysis-save x265_analysis.dat --analysis-save-reuse-level 5 --crf 18 --tskip-fast --limit-tu 2::--preset veryslow --analysis-load x265_analysis.dat  --analysis-load-reuse-level 5 --crf 18 --tskip-fast --limit-tu 2
 BasketballDrive_1920x1080_50.y4m,--preset veryslow --recon-y4m-exec "ffplay -i pipe:0 -autoexit"
 Coastguard-4k.y4m,--preset ultrafast --recon-y4m-exec "ffplay -i pipe:0 -autoexit"
 Coastguard-4k.y4m,--preset superfast --tune grain --overscan=crop
@@ -33,7 +33,7 @@
 Coastguard-4k.y4m,--preset slow --tune psnr --cbqpoffs -1 --crqpoffs 1 --limit-refs 1
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency --qg-size 16
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset superfast --weightp --no-wpp --sao
-CrowdRun_1920x1080_50_10bit_422.yuv,--preset veryfast --temporal-layers --tune grain
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset veryfast --temporal-layers 2 --tune grain
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset faster --max-tu-size 4 --min-cu-size 32
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset fast --aq-mode 0 --sar 2 --range full
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset medium --no-wpp --no-cutree --no-strong-intra-smoothing --limit-refs 1
@@ -41,7 +41,7 @@
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset slower --tune ssim --tune fastdecode --limit-refs 2
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset ultrafast --weightp --no-wpp --no-open-gop
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset superfast --weightp --dither --no-psy-rd
-CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryfast --temporal-layers --repeat-headers --limit-refs 2
+CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryfast --temporal-layers 2 --repeat-headers --limit-refs 2
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset medium --dither --keyint -1 --rdoq-level 1 --limit-modes
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryslow --tskip --tskip-fast --no-scenecut --limit-tu 1
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryslow --aq-mode 3 --aq-strength 1.5 --aq-motion --bitrate 5000
@@ -49,11 +49,11 @@
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryslow --hevc-aq --no-cutree --qg-size 16
 DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset superfast --weightp --qg-size 16
 DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset medium --tune psnr --bframes 16 --limit-modes
-DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers --no-psy-rd --qg-size 32 --limit-refs 0 --cu-lossless
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers 2 --no-psy-rd --qg-size 32 --limit-refs 0 --cu-lossless
 DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset veryfast --weightp --nr-intra 1000 -F4
 DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset medium --nr-inter 500 -F4 --no-psy-rdoq
 DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset slower --no-weightp --rdoq-level 0 --limit-refs 3 --tu-inter-depth 4 --limit-tu 3
-DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset fast --no-cutree --analysis-save x265_analysis.dat --analysis-save-reuse-level 5 --bitrate 3000 --early-skip --tu-inter-depth 3 --limit-tu 1::--preset fast --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 5 --bitrate 3000 --early-skip --tu-inter-depth 3 --limit-tu 1
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset fast --analysis-save x265_analysis.dat --analysis-save-reuse-level 5 --bitrate 3000 --early-skip --tu-inter-depth 3 --limit-tu 1::--preset fast --analysis-load x265_analysis.dat  --analysis-load-reuse-level 5 --bitrate 3000 --early-skip --tu-inter-depth 3 --limit-tu 1
 FourPeople_1280x720_60.y4m,--preset superfast --no-wpp --lookahead-slices 2
 FourPeople_1280x720_60.y4m,--preset veryfast --aq-mode 2 --aq-strength 1.5 --qg-size 8
 FourPeople_1280x720_60.y4m,--preset medium --qp 38 --no-psy-rd
@@ -158,13 +158,10 @@
 ducks_take_off_420_1_720p50.y4m,--preset medium --selective-sao 4 --sao --crf 20
 Traffic_4096x2048_30p.y4m, --preset medium --frame-dup --dup-threshold 60 --hrd --bitrate 10000 --vbv-bufsize 15000 --vbv-maxrate 12000
 Kimono1_1920x1080_24_400.yuv,--preset superfast --qp 28 --zones 0,139,q=32
-sintel_trailer_2k_1920x1080_24.yuv, --preset medium --hist-scenecut --hist-threshold 0.02 --frame-dup --dup-threshold 60 --hrd --bitrate 10000 --vbv-bufsize 15000 --vbv-maxrate 12000
-sintel_trailer_2k_1920x1080_24.yuv, --preset medium --hist-scenecut --hist-threshold 0.02
-sintel_trailer_2k_1920x1080_24.yuv, --preset ultrafast --hist-scenecut --hist-threshold 0.02
 crowd_run_1920x1080_50.yuv, --preset faster --ctu 32 --rskip 2 --rskip-edge-threshold 5
 crowd_run_1920x1080_50.yuv, --preset fast --ctu 64 --rskip 2 --rskip-edge-threshold 5 --aq-mode 4
-crowd_run_1920x1080_50.yuv, --preset slow --ctu 32 --rskip 2 --rskip-edge-threshold 5 --hist-scenecut --hist-threshold 0.1
-crowd_run_1920x1080_50.yuv, --preset slower --ctu 16 --rskip 2 --rskip-edge-threshold 5 --hist-scenecut --hist-threshold 0.1 --aq-mode 4
+crowd_run_1920x1080_50.yuv, --preset ultrafast --video-signal-type-preset BT2100_PQ_YCC:BT2100x108n0005
+crowd_run_1920x1080_50.yuv, --preset ultrafast --eob --eos
  
 # Main12 intraCost overflow bug test
 720p50_parkrun_ter.y4m,--preset medium
@@ -182,14 +179,22 @@
 
 #scaled save/load test
 crowd_run_1080p50.y4m,--preset ultrafast --no-cutree --analysis-save x265_analysis.dat  --analysis-save-reuse-level 1 --scale-factor 2 --crf 26 --vbv-maxrate 8000 --vbv-bufsize 8000::crowd_run_2160p50.y4m, --preset ultrafast --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 1 --scale-factor 2 --crf 26 --vbv-maxrate 12000 --vbv-bufsize 12000 
-crowd_run_1080p50.y4m,--preset superfast --no-cutree --analysis-save x265_analysis.dat  --analysis-save-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 5000 --vbv-bufsize 5000::crowd_run_2160p50.y4m, --preset superfast --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 10000 --vbv-bufsize 10000 
-crowd_run_1080p50.y4m,--preset fast --no-cutree --analysis-save x265_analysis.dat  --analysis-save-reuse-level 5 --scale-factor 2 --qp 18::crowd_run_2160p50.y4m, --preset fast --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 5 --scale-factor 2 --qp 18
+crowd_run_1080p50.y4m,--preset superfast --analysis-save x265_analysis.dat  --analysis-save-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 5000 --vbv-bufsize 5000::crowd_run_2160p50.y4m, --preset superfast --analysis-load x265_analysis.dat  --analysis-load-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 10000 --vbv-bufsize 10000 
+crowd_run_1080p50.y4m,--preset fast --analysis-save x265_analysis.dat  --analysis-save-reuse-level 5 --scale-factor 2 --qp 18::crowd_run_2160p50.y4m, --preset fast --analysis-load x265_analysis.dat  --analysis-load-reuse-level 5 --scale-factor 2 --qp 18
 crowd_run_1080p50.y4m,--preset medium --no-cutree --analysis-save x265_analysis.dat  --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 5000  --vbv-maxrate 5000 --vbv-bufsize 5000 --early-skip --tu-inter-depth 3::crowd_run_2160p50.y4m, --preset medium --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 4 --dynamic-refine::crowd_run_2160p50.y4m, --preset medium --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 3 --refine-inter 3
-RaceHorses_416x240_30.y4m,--preset slow --no-cutree --ctu 16 --analysis-save x265_analysis.dat --analysis-save-reuse-level 10 --scale-factor 2 --crf 22  --vbv-maxrate 1000 --vbv-bufsize 1000::RaceHorses_832x480_30.y4m, --preset slow --no-cutree --ctu 32 --analysis-load x265_analysis.dat  --analysis-save x265_analysis_2.dat --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --crf 16 --vbv-maxrate 4000 --vbv-bufsize 4000 --refine-intra 0 --refine-inter 1::RaceHorses_1664x960_30.y4m,--preset slow --no-cutree --ctu 64 --analysis-load x265_analysis_2.dat  --analysis-load-reuse-level 10 --scale-factor 2 --crf 12 --vbv-maxrate 7000 --vbv-bufsize 7000 --refine-intra 2 --refine-inter 2
+RaceHorses_416x240_30.y4m,--preset slow --ctu 16 --analysis-save x265_analysis.dat --analysis-save-reuse-level 10 --scale-factor 2 --crf 22  --vbv-maxrate 1000 --vbv-bufsize 1000::RaceHorses_832x480_30.y4m, --preset slow --ctu 32 --analysis-load x265_analysis.dat  --analysis-save x265_analysis_2.dat --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --crf 16 --vbv-maxrate 4000 --vbv-bufsize 4000 --refine-intra 0 --refine-inter 1::RaceHorses_1664x960_30.y4m,--preset slow --ctu 64 --analysis-load x265_analysis_2.dat  --analysis-load-reuse-level 10 --scale-factor 2 --crf 12 --vbv-maxrate 7000 --vbv-bufsize 7000 --refine-intra 2 --refine-inter 2
 ElFunete_960x540_60.yuv,--colorprim bt709 --transfer bt709 --chromaloc 2 --aud --repeat-headers --no-opt-qp-pps --no-opt-ref-list-length-pps --wpp --no-interlace --sar 1:1 --min-keyint 60 --no-open-gop --rc-lookahead 180 --bframes 5 --b-intra --ref 4 --cbqpoffs -2 --crqpoffs -2 --lookahead-threads 0 --weightb --qg-size 8 --me star --preset veryslow --frame-threads 1 --b-adapt 2 --aq-mode 3 --rd 6 --pools 15 --colormatrix bt709 --keyint 120 --high-tier --ctu 64 --tune psnr --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500 --analysis-save-reuse-level 10 --analysis-save elfuente_960x540.dat --scale-factor 2::ElFunete_1920x1080_60.yuv,--colorprim bt709 --transfer bt709 --chromaloc 2 --aud --repeat-headers --no-opt-qp-pps --no-opt-ref-list-length-pps --wpp --no-interlace --sar 1:1 --min-keyint 60 --no-open-gop --rc-lookahead 180 --bframes 5 --b-intra --ref 4 --cbqpoffs -2 --crqpoffs -2 --lookahead-threads 0 --weightb --qg-size 8 --me star --preset veryslow --frame-threads 1 --b-adapt 2 --aq-mode 3 --rd 6 --pools 15 --colormatrix bt709 --keyint 120 --high-tier --ctu 64 --tune psnr --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500 --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --analysis-save elfuente_1920x1080.dat --limit-tu 0 --scale-factor 2 --analysis-load elfuente_960x540.dat --refine-intra 4 --refine-inter 2::ElFuente_3840x2160_60.yuv,--colorprim bt709 --transfer bt709 --chromaloc 2 --aud --repeat-headers --no-opt-qp-pps --no-opt-ref-list-length-pps --wpp --no-interlace --sar 1:1 --min-keyint 60 --no-open-gop --rc-lookahead 180 --bframes 5 --b-intra --ref 4 --cbqpoffs -2 --crqpoffs -2 --lookahead-threads 0 --weightb --qg-size 8 --me star --preset veryslow --frame-threads 1 --b-adapt 2 --aq-mode 3 --rd 6 --pools 15 --colormatrix bt709 --keyint 120 --high-tier --ctu 64 --tune=psnr --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000 --analysis-load-reuse-level 10 --limit-tu 0 --scale-factor 2 --analysis-load elfuente_1920x1080.dat --refine-intra 4 --refine-inter 2
 #save/load with ctu distortion refinement
 CrowdRun_1920x1080_50_10bit_422.yuv,--no-cutree --analysis-save x265_analysis.dat --analysis-save-reuse-level 5 --refine-ctu-distortion 1 --bitrate 7000::--no-cutree --analysis-load x265_analysis.dat --refine-ctu-distortion 1 --bitrate 7000 --analysis-load-reuse-level 5
 #segment encoding
 BasketballDrive_1920x1080_50.y4m, --preset ultrafast --no-open-gop --chunk-start 100 --chunk-end 200
 
+#Test FG SEI message addition
+#OldTownCross_1920x1080_50_10bit_422.yuv,--preset slower --tune grain --film-grain "OldTownCross_1920x1080_50_10bit_422.bin"
+#RaceHorses_416x240_30_10bit.yuv,--preset ultrafast --signhide --colormatrix bt709 --film-grain "RaceHorses_416x240_30_10bit.bin"
+
+#Temporal layers tests
+ducks_take_off_420_720p50.y4m,--preset slow --temporal-layers 3 --b-adapt 0
+parkrun_ter_720p50.y4m,--preset medium --temporal-layers 4 --b-adapt 0
+BasketballDrive_1920x1080_50.y4m, --preset medium --no-open-gop --keyint 50 --min-keyint 50 --temporal-layers 5 --b-adapt 0
 # vim: tw=200

 
@@ -18,12 +18,12 @@
 BasketballDrive_1920x1080_50.y4m,--preset faster --aq-strength 2 --merange 190 --slices 3
 BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7 --qg-size 16 --cu-lossless --tu-inter-depth 3 --limit-tu 1
 BasketballDrive_1920x1080_50.y4m,--preset medium --keyint -1 --nr-inter 100 -F4 --no-sao
-BasketballDrive_1920x1080_50.y4m,--preset medium --no-cutree --analysis-save x265_analysis.dat --analysis-save-reuse-level 2 --bitrate 7000 --limit-modes::--preset medium --no-cutree --analysis-load x265_analysis.dat --analysis-load-reuse-level 2 --bitrate 7000 --limit-modes
+BasketballDrive_1920x1080_50.y4m,--preset medium --analysis-save x265_analysis.dat --analysis-save-reuse-level 2 --bitrate 7000 --limit-modes::--preset medium --analysis-load x265_analysis.dat --analysis-load-reuse-level 2 --bitrate 7000 --limit-modes
 BasketballDrive_1920x1080_50.y4m,--preset slow --nr-intra 100 -F4 --aq-strength 3 --qg-size 16 --limit-refs 1
 BasketballDrive_1920x1080_50.y4m,--preset slower --lossless --chromaloc 3 --subme 0 --limit-tu 4
-BasketballDrive_1920x1080_50.y4m,--preset slower --no-cutree --analysis-save x265_analysis.dat --analysis-save-reuse-level 10 --bitrate 7000 --limit-tu 0::--preset slower --no-cutree --analysis-load x265_analysis.dat --analysis-load-reuse-level 10 --bitrate 7000 --limit-tu 0
+BasketballDrive_1920x1080_50.y4m,--preset slower --analysis-save x265_analysis.dat --analysis-save-reuse-level 10 --bitrate 7000 --limit-tu 0::--preset slower --analysis-load x265_analysis.dat --analysis-load-reuse-level 10 --bitrate 7000 --limit-tu 0
 BasketballDrive_1920x1080_50.y4m,--preset veryslow --crf 4 --cu-lossless --pmode --limit-refs 1 --aq-mode 3 --limit-tu 3
-BasketballDrive_1920x1080_50.y4m,--preset veryslow --no-cutree --analysis-save x265_analysis.dat --analysis-save-reuse-level 5 --crf 18 --tskip-fast --limit-tu 2::--preset veryslow --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 5 --crf 18 --tskip-fast --limit-tu 2
+BasketballDrive_1920x1080_50.y4m,--preset veryslow --analysis-save x265_analysis.dat --analysis-save-reuse-level 5 --crf 18 --tskip-fast --limit-tu 2::--preset veryslow --analysis-load x265_analysis.dat  --analysis-load-reuse-level 5 --crf 18 --tskip-fast --limit-tu 2
 BasketballDrive_1920x1080_50.y4m,--preset veryslow --recon-y4m-exec "ffplay -i pipe:0 -autoexit"
 Coastguard-4k.y4m,--preset ultrafast --recon-y4m-exec "ffplay -i pipe:0 -autoexit"
 Coastguard-4k.y4m,--preset superfast --tune grain --overscan=crop
@@ -33,7 +33,7 @@
 Coastguard-4k.y4m,--preset slow --tune psnr --cbqpoffs -1 --crqpoffs 1 --limit-refs 1
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency --qg-size 16
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset superfast --weightp --no-wpp --sao
-CrowdRun_1920x1080_50_10bit_422.yuv,--preset veryfast --temporal-layers --tune grain
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset veryfast --temporal-layers 2 --tune grain
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset faster --max-tu-size 4 --min-cu-size 32
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset fast --aq-mode 0 --sar 2 --range full
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset medium --no-wpp --no-cutree --no-strong-intra-smoothing --limit-refs 1
@@ -41,7 +41,7 @@
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset slower --tune ssim --tune fastdecode --limit-refs 2
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset ultrafast --weightp --no-wpp --no-open-gop
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset superfast --weightp --dither --no-psy-rd
-CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryfast --temporal-layers --repeat-headers --limit-refs 2
+CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryfast --temporal-layers 2 --repeat-headers --limit-refs 2
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset medium --dither --keyint -1 --rdoq-level 1 --limit-modes
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryslow --tskip --tskip-fast --no-scenecut --limit-tu 1
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryslow --aq-mode 3 --aq-strength 1.5 --aq-motion --bitrate 5000
@@ -49,11 +49,11 @@
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryslow --hevc-aq --no-cutree --qg-size 16
 DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset superfast --weightp --qg-size 16
 DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset medium --tune psnr --bframes 16 --limit-modes
-DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers --no-psy-rd --qg-size 32 --limit-refs 0 --cu-lossless
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers 2 --no-psy-rd --qg-size 32 --limit-refs 0 --cu-lossless
 DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset veryfast --weightp --nr-intra 1000 -F4
 DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset medium --nr-inter 500 -F4 --no-psy-rdoq
 DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset slower --no-weightp --rdoq-level 0 --limit-refs 3 --tu-inter-depth 4 --limit-tu 3
-DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset fast --no-cutree --analysis-save x265_analysis.dat --analysis-save-reuse-level 5 --bitrate 3000 --early-skip --tu-inter-depth 3 --limit-tu 1::--preset fast --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 5 --bitrate 3000 --early-skip --tu-inter-depth 3 --limit-tu 1
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset fast --analysis-save x265_analysis.dat --analysis-save-reuse-level 5 --bitrate 3000 --early-skip --tu-inter-depth 3 --limit-tu 1::--preset fast --analysis-load x265_analysis.dat  --analysis-load-reuse-level 5 --bitrate 3000 --early-skip --tu-inter-depth 3 --limit-tu 1
 FourPeople_1280x720_60.y4m,--preset superfast --no-wpp --lookahead-slices 2
 FourPeople_1280x720_60.y4m,--preset veryfast --aq-mode 2 --aq-strength 1.5 --qg-size 8
 FourPeople_1280x720_60.y4m,--preset medium --qp 38 --no-psy-rd
@@ -158,13 +158,10 @@
 ducks_take_off_420_1_720p50.y4m,--preset medium --selective-sao 4 --sao --crf 20
 Traffic_4096x2048_30p.y4m, --preset medium --frame-dup --dup-threshold 60 --hrd --bitrate 10000 --vbv-bufsize 15000 --vbv-maxrate 12000
 Kimono1_1920x1080_24_400.yuv,--preset superfast --qp 28 --zones 0,139,q=32
-sintel_trailer_2k_1920x1080_24.yuv, --preset medium --hist-scenecut --hist-threshold 0.02 --frame-dup --dup-threshold 60 --hrd --bitrate 10000 --vbv-bufsize 15000 --vbv-maxrate 12000
-sintel_trailer_2k_1920x1080_24.yuv, --preset medium --hist-scenecut --hist-threshold 0.02
-sintel_trailer_2k_1920x1080_24.yuv, --preset ultrafast --hist-scenecut --hist-threshold 0.02
 crowd_run_1920x1080_50.yuv, --preset faster --ctu 32 --rskip 2 --rskip-edge-threshold 5
 crowd_run_1920x1080_50.yuv, --preset fast --ctu 64 --rskip 2 --rskip-edge-threshold 5 --aq-mode 4
-crowd_run_1920x1080_50.yuv, --preset slow --ctu 32 --rskip 2 --rskip-edge-threshold 5 --hist-scenecut --hist-threshold 0.1
-crowd_run_1920x1080_50.yuv, --preset slower --ctu 16 --rskip 2 --rskip-edge-threshold 5 --hist-scenecut --hist-threshold 0.1 --aq-mode 4
+crowd_run_1920x1080_50.yuv, --preset ultrafast --video-signal-type-preset BT2100_PQ_YCC:BT2100x108n0005
+crowd_run_1920x1080_50.yuv, --preset ultrafast --eob --eos
  
 # Main12 intraCost overflow bug test
 720p50_parkrun_ter.y4m,--preset medium
@@ -182,14 +179,22 @@
 
 #scaled save/load test
 crowd_run_1080p50.y4m,--preset ultrafast --no-cutree --analysis-save x265_analysis.dat  --analysis-save-reuse-level 1 --scale-factor 2 --crf 26 --vbv-maxrate 8000 --vbv-bufsize 8000::crowd_run_2160p50.y4m, --preset ultrafast --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 1 --scale-factor 2 --crf 26 --vbv-maxrate 12000 --vbv-bufsize 12000 
-crowd_run_1080p50.y4m,--preset superfast --no-cutree --analysis-save x265_analysis.dat  --analysis-save-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 5000 --vbv-bufsize 5000::crowd_run_2160p50.y4m, --preset superfast --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 10000 --vbv-bufsize 10000 
-crowd_run_1080p50.y4m,--preset fast --no-cutree --analysis-save x265_analysis.dat  --analysis-save-reuse-level 5 --scale-factor 2 --qp 18::crowd_run_2160p50.y4m, --preset fast --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 5 --scale-factor 2 --qp 18
+crowd_run_1080p50.y4m,--preset superfast --analysis-save x265_analysis.dat  --analysis-save-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 5000 --vbv-bufsize 5000::crowd_run_2160p50.y4m, --preset superfast --analysis-load x265_analysis.dat  --analysis-load-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 10000 --vbv-bufsize 10000 
+crowd_run_1080p50.y4m,--preset fast --analysis-save x265_analysis.dat  --analysis-save-reuse-level 5 --scale-factor 2 --qp 18::crowd_run_2160p50.y4m, --preset fast --analysis-load x265_analysis.dat  --analysis-load-reuse-level 5 --scale-factor 2 --qp 18
 crowd_run_1080p50.y4m,--preset medium --no-cutree --analysis-save x265_analysis.dat  --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 5000  --vbv-maxrate 5000 --vbv-bufsize 5000 --early-skip --tu-inter-depth 3::crowd_run_2160p50.y4m, --preset medium --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 4 --dynamic-refine::crowd_run_2160p50.y4m, --preset medium --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 3 --refine-inter 3
-RaceHorses_416x240_30.y4m,--preset slow --no-cutree --ctu 16 --analysis-save x265_analysis.dat --analysis-save-reuse-level 10 --scale-factor 2 --crf 22  --vbv-maxrate 1000 --vbv-bufsize 1000::RaceHorses_832x480_30.y4m, --preset slow --no-cutree --ctu 32 --analysis-load x265_analysis.dat  --analysis-save x265_analysis_2.dat --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --crf 16 --vbv-maxrate 4000 --vbv-bufsize 4000 --refine-intra 0 --refine-inter 1::RaceHorses_1664x960_30.y4m,--preset slow --no-cutree --ctu 64 --analysis-load x265_analysis_2.dat  --analysis-load-reuse-level 10 --scale-factor 2 --crf 12 --vbv-maxrate 7000 --vbv-bufsize 7000 --refine-intra 2 --refine-inter 2
+RaceHorses_416x240_30.y4m,--preset slow --ctu 16 --analysis-save x265_analysis.dat --analysis-save-reuse-level 10 --scale-factor 2 --crf 22  --vbv-maxrate 1000 --vbv-bufsize 1000::RaceHorses_832x480_30.y4m, --preset slow --ctu 32 --analysis-load x265_analysis.dat  --analysis-save x265_analysis_2.dat --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --crf 16 --vbv-maxrate 4000 --vbv-bufsize 4000 --refine-intra 0 --refine-inter 1::RaceHorses_1664x960_30.y4m,--preset slow --ctu 64 --analysis-load x265_analysis_2.dat  --analysis-load-reuse-level 10 --scale-factor 2 --crf 12 --vbv-maxrate 7000 --vbv-bufsize 7000 --refine-intra 2 --refine-inter 2
 ElFunete_960x540_60.yuv,--colorprim bt709 --transfer bt709 --chromaloc 2 --aud --repeat-headers --no-opt-qp-pps --no-opt-ref-list-length-pps --wpp --no-interlace --sar 1:1 --min-keyint 60 --no-open-gop --rc-lookahead 180 --bframes 5 --b-intra --ref 4 --cbqpoffs -2 --crqpoffs -2 --lookahead-threads 0 --weightb --qg-size 8 --me star --preset veryslow --frame-threads 1 --b-adapt 2 --aq-mode 3 --rd 6 --pools 15 --colormatrix bt709 --keyint 120 --high-tier --ctu 64 --tune psnr --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500 --analysis-save-reuse-level 10 --analysis-save elfuente_960x540.dat --scale-factor 2::ElFunete_1920x1080_60.yuv,--colorprim bt709 --transfer bt709 --chromaloc 2 --aud --repeat-headers --no-opt-qp-pps --no-opt-ref-list-length-pps --wpp --no-interlace --sar 1:1 --min-keyint 60 --no-open-gop --rc-lookahead 180 --bframes 5 --b-intra --ref 4 --cbqpoffs -2 --crqpoffs -2 --lookahead-threads 0 --weightb --qg-size 8 --me star --preset veryslow --frame-threads 1 --b-adapt 2 --aq-mode 3 --rd 6 --pools 15 --colormatrix bt709 --keyint 120 --high-tier --ctu 64 --tune psnr --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500 --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --analysis-save elfuente_1920x1080.dat --limit-tu 0 --scale-factor 2 --analysis-load elfuente_960x540.dat --refine-intra 4 --refine-inter 2::ElFuente_3840x2160_60.yuv,--colorprim bt709 --transfer bt709 --chromaloc 2 --aud --repeat-headers --no-opt-qp-pps --no-opt-ref-list-length-pps --wpp --no-interlace --sar 1:1 --min-keyint 60 --no-open-gop --rc-lookahead 180 --bframes 5 --b-intra --ref 4 --cbqpoffs -2 --crqpoffs -2 --lookahead-threads 0 --weightb --qg-size 8 --me star --preset veryslow --frame-threads 1 --b-adapt 2 --aq-mode 3 --rd 6 --pools 15 --colormatrix bt709 --keyint 120 --high-tier --ctu 64 --tune=psnr --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000 --analysis-load-reuse-level 10 --limit-tu 0 --scale-factor 2 --analysis-load elfuente_1920x1080.dat --refine-intra 4 --refine-inter 2
 #save/load with ctu distortion refinement
 CrowdRun_1920x1080_50_10bit_422.yuv,--no-cutree --analysis-save x265_analysis.dat --analysis-save-reuse-level 5 --refine-ctu-distortion 1 --bitrate 7000::--no-cutree --analysis-load x265_analysis.dat --refine-ctu-distortion 1 --bitrate 7000 --analysis-load-reuse-level 5
 #segment encoding
 BasketballDrive_1920x1080_50.y4m, --preset ultrafast --no-open-gop --chunk-start 100 --chunk-end 200
 
+#Test FG SEI message addition
+#OldTownCross_1920x1080_50_10bit_422.yuv,--preset slower --tune grain --film-grain "OldTownCross_1920x1080_50_10bit_422.bin"
+#RaceHorses_416x240_30_10bit.yuv,--preset ultrafast --signhide --colormatrix bt709 --film-grain "RaceHorses_416x240_30_10bit.bin"
+
+#Temporal layers tests
+ducks_take_off_420_720p50.y4m,--preset slow --temporal-layers 3 --b-adapt 0
+parkrun_ter_720p50.y4m,--preset medium --temporal-layers 4 --b-adapt 0
+BasketballDrive_1920x1080_50.y4m, --preset medium --no-open-gop --keyint 50 --min-keyint 50 --temporal-layers 5 --b-adapt 0
 # vim: tw=200
​

x265_3.5.tar.gz/source/test/save-load-tests.txt -> x265_3.6.tar.gz/source/test/save-load-tests.txt Changed

@@ -12,10 +12,10 @@
 # not auto-detected.
 crowd_run_1080p50.y4m, --preset ultrafast --no-cutree --analysis-save x265_analysis.dat  --analysis-save-reuse-level 1 --scale-factor 2 --crf 26 --vbv-maxrate 8000 --vbv-bufsize 8000::crowd_run_2160p50.y4m, --preset ultrafast --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 1 --scale-factor 2 --crf 26 --vbv-maxrate 12000 --vbv-bufsize 12000
 crowd_run_540p50.y4m, --preset ultrafast --no-cutree --analysis-save x265_analysis.dat --scale-factor 2 --crf 26 --vbv-maxrate 8000 --vbv-bufsize 8000::crowd_run_1080p50.y4m, --preset ultrafast --no-cutree --analysis-load x265_analysis.dat --scale-factor 2 --crf 26 --vbv-maxrate 12000 --vbv-bufsize 12000
-crowd_run_1080p50.y4m, --preset superfast --no-cutree --analysis-save x265_analysis.dat  --analysis-save-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 5000 --vbv-bufsize 5000::crowd_run_2160p50.y4m,   --preset superfast --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 10000 --vbv-bufsize 10000
-crowd_run_1080p50.y4m,  --preset fast --no-cutree --analysis-save x265_analysis.dat  --analysis-save-reuse-level 5 --scale-factor 2 --qp 18::crowd_run_2160p50.y4m,   --preset fast --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 5 --scale-factor 2 --qp 18
-crowd_run_1080p50.y4m,   --preset medium --no-cutree --analysis-save x265_analysis.dat  --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 5000  --vbv-maxrate 5000 --vbv-bufsize 5000 --early-skip --tu-inter-depth 3::crowd_run_2160p50.y4m,    --preset medium --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 4 --dynamic-refine::crowd_run_2160p50.y4m,    --preset medium --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 3 --refine-inter 3
+crowd_run_1080p50.y4m, --preset superfast --analysis-save x265_analysis.dat  --analysis-save-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 5000 --vbv-bufsize 5000::crowd_run_2160p50.y4m,   --preset superfast --analysis-load x265_analysis.dat  --analysis-load-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 10000 --vbv-bufsize 10000
+crowd_run_1080p50.y4m,  --preset fast --analysis-save x265_analysis.dat  --analysis-save-reuse-level 5 --scale-factor 2 --qp 18::crowd_run_2160p50.y4m,   --preset fast --analysis-load x265_analysis.dat  --analysis-load-reuse-level 5 --scale-factor 2 --qp 18
+crowd_run_1080p50.y4m,   --preset medium --analysis-save x265_analysis.dat  --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 5000  --vbv-maxrate 5000 --vbv-bufsize 5000 --early-skip --tu-inter-depth 3::crowd_run_2160p50.y4m,    --preset medium --analysis-load x265_analysis.dat  --analysis-load-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 4 --dynamic-refine::crowd_run_2160p50.y4m,    --preset medium --analysis-load x265_analysis.dat  --analysis-load-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 3 --refine-inter 3
 RaceHorses_416x240_30.y4m,   --preset slow --no-cutree --ctu 16 --analysis-save x265_analysis.dat --analysis-save-reuse-level 10 --scale-factor 2 --crf 22  --vbv-maxrate 1000 --vbv-bufsize 1000::RaceHorses_832x480_30.y4m,    --preset slow --no-cutree --ctu 32 --analysis-load x265_analysis.dat  --analysis-save x265_analysis_2.dat --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --crf 16 --vbv-maxrate 4000 --vbv-bufsize 4000 --refine-intra 0 --refine-inter 1::RaceHorses_1664x960_30.y4m,   --preset slow --no-cutree --ctu 64 --analysis-load x265_analysis_2.dat  --analysis-load-reuse-level 10 --scale-factor 2 --crf 12 --vbv-maxrate 7000 --vbv-bufsize 7000 --refine-intra 2 --refine-inter 2
-crowd_run_540p50.y4m,   --preset veryslow --no-cutree --analysis-save x265_analysis_540.dat  --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 5000 --vbv-bufsize 15000 --vbv-maxrate 9000::crowd_run_1080p50.y4m,   --preset veryslow --no-cutree --analysis-save x265_analysis_1080.dat  --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500::crowd_run_1080p50.y4m,  --preset veryslow --no-cutree --analysis-save x265_analysis_1080.dat --analysis-load x265_analysis_540.dat --refine-intra 4 --dynamic-refine --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500::crowd_run_2160p50.y4m,  --preset veryslow --no-cutree --analysis-save x265_analysis_2160.dat --analysis-load x265_analysis_1080.dat --refine-intra 3 --dynamic-refine --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000::crowd_run_2160p50.y4m,  --preset veryslow --no-cutree --analysis-load x265_analysis_2160.dat --refine-intra 2 --dynamic-refine --analysis-load-reuse-level 10 --scale-factor 1 --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000
+crowd_run_540p50.y4m,   --preset veryslow --analysis-save x265_analysis_540.dat  --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 5000 --vbv-bufsize 15000 --vbv-maxrate 9000::crowd_run_1080p50.y4m,   --preset veryslow --analysis-save x265_analysis_1080.dat  --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500::crowd_run_1080p50.y4m,  --preset veryslow --analysis-save x265_analysis_1080.dat --analysis-load x265_analysis_540.dat --refine-intra 4 --dynamic-refine --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500::crowd_run_2160p50.y4m,  --preset veryslow --analysis-save x265_analysis_2160.dat --analysis-load x265_analysis_1080.dat --refine-intra 3 --dynamic-refine --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000::crowd_run_2160p50.y4m,  --preset veryslow --analysis-load x265_analysis_2160.dat --refine-intra 2 --dynamic-refine --analysis-load-reuse-level 10 --scale-factor 1 --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000
 crowd_run_540p50.y4m,  --preset medium --no-cutree --analysis-save x265_analysis_540.dat  --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 5000 --vbv-bufsize 15000 --vbv-maxrate 9000::crowd_run_1080p50.y4m,  --preset medium --no-cutree --analysis-save x265_analysis_1080.dat  --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500::crowd_run_1080p50.y4m,  --preset medium --no-cutree --analysis-save x265_analysis_1080.dat --analysis-load x265_analysis_540.dat --refine-intra 4 --dynamic-refine --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500::crowd_run_2160p50.y4m,  --preset medium --no-cutree --analysis-save x265_analysis_2160.dat --analysis-load x265_analysis_1080.dat --refine-intra 3 --dynamic-refine --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000::crowd_run_2160p50.y4m,  --preset medium --no-cutree --analysis-load x265_analysis_2160.dat --refine-intra 2 --dynamic-refine --analysis-load-reuse-level 10 --scale-factor 1 --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000
 News-4k.y4m,  --preset medium --analysis-save x265_analysis_fdup.dat --frame-dup --hrd --bitrate 10000 --vbv-bufsize 15000 --vbv-maxrate 12000::News-4k.y4m, --analysis-load x265_analysis_fdup.dat --frame-dup --hrd --bitrate 10000 --vbv-bufsize 15000 --vbv-maxrate 12000

 
@@ -12,10 +12,10 @@
 # not auto-detected.
 crowd_run_1080p50.y4m, --preset ultrafast --no-cutree --analysis-save x265_analysis.dat  --analysis-save-reuse-level 1 --scale-factor 2 --crf 26 --vbv-maxrate 8000 --vbv-bufsize 8000::crowd_run_2160p50.y4m, --preset ultrafast --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 1 --scale-factor 2 --crf 26 --vbv-maxrate 12000 --vbv-bufsize 12000
 crowd_run_540p50.y4m, --preset ultrafast --no-cutree --analysis-save x265_analysis.dat --scale-factor 2 --crf 26 --vbv-maxrate 8000 --vbv-bufsize 8000::crowd_run_1080p50.y4m, --preset ultrafast --no-cutree --analysis-load x265_analysis.dat --scale-factor 2 --crf 26 --vbv-maxrate 12000 --vbv-bufsize 12000
-crowd_run_1080p50.y4m, --preset superfast --no-cutree --analysis-save x265_analysis.dat  --analysis-save-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 5000 --vbv-bufsize 5000::crowd_run_2160p50.y4m,   --preset superfast --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 10000 --vbv-bufsize 10000
-crowd_run_1080p50.y4m,  --preset fast --no-cutree --analysis-save x265_analysis.dat  --analysis-save-reuse-level 5 --scale-factor 2 --qp 18::crowd_run_2160p50.y4m,   --preset fast --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 5 --scale-factor 2 --qp 18
-crowd_run_1080p50.y4m,   --preset medium --no-cutree --analysis-save x265_analysis.dat  --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 5000  --vbv-maxrate 5000 --vbv-bufsize 5000 --early-skip --tu-inter-depth 3::crowd_run_2160p50.y4m,    --preset medium --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 4 --dynamic-refine::crowd_run_2160p50.y4m,    --preset medium --no-cutree --analysis-load x265_analysis.dat  --analysis-load-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 3 --refine-inter 3
+crowd_run_1080p50.y4m, --preset superfast --analysis-save x265_analysis.dat  --analysis-save-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 5000 --vbv-bufsize 5000::crowd_run_2160p50.y4m,   --preset superfast --analysis-load x265_analysis.dat  --analysis-load-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 10000 --vbv-bufsize 10000
+crowd_run_1080p50.y4m,  --preset fast --analysis-save x265_analysis.dat  --analysis-save-reuse-level 5 --scale-factor 2 --qp 18::crowd_run_2160p50.y4m,   --preset fast --analysis-load x265_analysis.dat  --analysis-load-reuse-level 5 --scale-factor 2 --qp 18
+crowd_run_1080p50.y4m,   --preset medium --analysis-save x265_analysis.dat  --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 5000  --vbv-maxrate 5000 --vbv-bufsize 5000 --early-skip --tu-inter-depth 3::crowd_run_2160p50.y4m,    --preset medium --analysis-load x265_analysis.dat  --analysis-load-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 4 --dynamic-refine::crowd_run_2160p50.y4m,    --preset medium --analysis-load x265_analysis.dat  --analysis-load-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 3 --refine-inter 3
 RaceHorses_416x240_30.y4m,   --preset slow --no-cutree --ctu 16 --analysis-save x265_analysis.dat --analysis-save-reuse-level 10 --scale-factor 2 --crf 22  --vbv-maxrate 1000 --vbv-bufsize 1000::RaceHorses_832x480_30.y4m,    --preset slow --no-cutree --ctu 32 --analysis-load x265_analysis.dat  --analysis-save x265_analysis_2.dat --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --crf 16 --vbv-maxrate 4000 --vbv-bufsize 4000 --refine-intra 0 --refine-inter 1::RaceHorses_1664x960_30.y4m,   --preset slow --no-cutree --ctu 64 --analysis-load x265_analysis_2.dat  --analysis-load-reuse-level 10 --scale-factor 2 --crf 12 --vbv-maxrate 7000 --vbv-bufsize 7000 --refine-intra 2 --refine-inter 2
-crowd_run_540p50.y4m,   --preset veryslow --no-cutree --analysis-save x265_analysis_540.dat  --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 5000 --vbv-bufsize 15000 --vbv-maxrate 9000::crowd_run_1080p50.y4m,   --preset veryslow --no-cutree --analysis-save x265_analysis_1080.dat  --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500::crowd_run_1080p50.y4m,  --preset veryslow --no-cutree --analysis-save x265_analysis_1080.dat --analysis-load x265_analysis_540.dat --refine-intra 4 --dynamic-refine --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500::crowd_run_2160p50.y4m,  --preset veryslow --no-cutree --analysis-save x265_analysis_2160.dat --analysis-load x265_analysis_1080.dat --refine-intra 3 --dynamic-refine --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000::crowd_run_2160p50.y4m,  --preset veryslow --no-cutree --analysis-load x265_analysis_2160.dat --refine-intra 2 --dynamic-refine --analysis-load-reuse-level 10 --scale-factor 1 --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000
+crowd_run_540p50.y4m,   --preset veryslow --analysis-save x265_analysis_540.dat  --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 5000 --vbv-bufsize 15000 --vbv-maxrate 9000::crowd_run_1080p50.y4m,   --preset veryslow --analysis-save x265_analysis_1080.dat  --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500::crowd_run_1080p50.y4m,  --preset veryslow --analysis-save x265_analysis_1080.dat --analysis-load x265_analysis_540.dat --refine-intra 4 --dynamic-refine --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500::crowd_run_2160p50.y4m,  --preset veryslow --analysis-save x265_analysis_2160.dat --analysis-load x265_analysis_1080.dat --refine-intra 3 --dynamic-refine --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000::crowd_run_2160p50.y4m,  --preset veryslow --analysis-load x265_analysis_2160.dat --refine-intra 2 --dynamic-refine --analysis-load-reuse-level 10 --scale-factor 1 --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000
 crowd_run_540p50.y4m,  --preset medium --no-cutree --analysis-save x265_analysis_540.dat  --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 5000 --vbv-bufsize 15000 --vbv-maxrate 9000::crowd_run_1080p50.y4m,  --preset medium --no-cutree --analysis-save x265_analysis_1080.dat  --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500::crowd_run_1080p50.y4m,  --preset medium --no-cutree --analysis-save x265_analysis_1080.dat --analysis-load x265_analysis_540.dat --refine-intra 4 --dynamic-refine --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500::crowd_run_2160p50.y4m,  --preset medium --no-cutree --analysis-save x265_analysis_2160.dat --analysis-load x265_analysis_1080.dat --refine-intra 3 --dynamic-refine --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000::crowd_run_2160p50.y4m,  --preset medium --no-cutree --analysis-load x265_analysis_2160.dat --refine-intra 2 --dynamic-refine --analysis-load-reuse-level 10 --scale-factor 1 --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000
 News-4k.y4m,  --preset medium --analysis-save x265_analysis_fdup.dat --frame-dup --hrd --bitrate 10000 --vbv-bufsize 15000 --vbv-maxrate 12000::News-4k.y4m, --analysis-load x265_analysis_fdup.dat --frame-dup --hrd --bitrate 10000 --vbv-bufsize 15000 --vbv-maxrate 12000
​

x265_3.5.tar.gz/source/test/smoke-tests.txt -> x265_3.6.tar.gz/source/test/smoke-tests.txt Changed

 
@@ -23,3 +23,7 @@
 # Main12 intraCost overflow bug test
 720p50_parkrun_ter.y4m,--preset medium
 720p50_parkrun_ter.y4m,--preset=fast --hevc-aq --no-cutree
+# Test FG SEI message addition
+# CrowdRun_1920x1080_50_10bit_444.yuv,--preset=ultrafast --weightp --keyint -1 --film-grain "CrowdRun_1920x1080_50_10bit_444.bin"
+# DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset=veryfast --min-cu 16 --film-grain "DucksAndLegs_1920x1080_60_10bit_422.bin"
+# NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset=superfast --bitrate 10000 --sao --limit-sao --cll --max-cll "1000,400" --film-grain "NebutaFestival_2560x1600_60_10bit_crop.bin"
​

x265_3.5.tar.gz/source/test/testbench.cpp -> x265_3.6.tar.gz/source/test/testbench.cpp Changed

@@ -174,6 +174,8 @@
         { "AVX512", X265_CPU_AVX512 },
         { "ARMv6", X265_CPU_ARMV6 },
         { "NEON", X265_CPU_NEON },
+        { "SVE2", X265_CPU_SVE2 },
+        { "SVE", X265_CPU_SVE },
         { "FastNeonMRC", X265_CPU_FAST_NEON_MRC },
         { "", 0 },
     };
@@ -208,15 +210,8 @@
 
         EncoderPrimitives asmprim;
         memset(&asmprim, 0, sizeof(asmprim));
-        setupAssemblyPrimitives(asmprim, test_archi.flag);
-
-#if X265_ARCH_ARM64
-        /* Temporary workaround because luma_vsp assembly primitive has not been completed
-         * but interp_8tap_hv_pp_cpu uses mixed C primitive and assembly primitive.
-         * Otherwise, segment fault occurs. */
-        setupAliasCPrimitives(cprim, asmprim, test_archi.flag);
-#endif
 
+        setupAssemblyPrimitives(asmprim, test_archi.flag);
         setupAliasPrimitives(asmprim);
         memcpy(&primitives, &asmprim, sizeof(EncoderPrimitives));
         for (size_t h = 0; h < sizeof(harness) / sizeof(TestHarness*); h++)
@@ -239,14 +234,8 @@
 #if X265_ARCH_X86
     setupInstrinsicPrimitives(optprim, cpuid);
 #endif
-    setupAssemblyPrimitives(optprim, cpuid);
 
-#if X265_ARCH_ARM64
-    /* Temporary workaround because luma_vsp assembly primitive has not been completed
-     * but interp_8tap_hv_pp_cpu uses mixed C primitive and assembly primitive.
-     * Otherwise, segment fault occurs. */
-    setupAliasCPrimitives(cprim, optprim, cpuid);
-#endif
+    setupAssemblyPrimitives(optprim, cpuid);
 
     /* Note that we do not setup aliases for performance tests, that would be
      * redundant. The testbench only verifies they are correctly aliased */

 
@@ -174,6 +174,8 @@
         { "AVX512", X265_CPU_AVX512 },
         { "ARMv6", X265_CPU_ARMV6 },
         { "NEON", X265_CPU_NEON },
+        { "SVE2", X265_CPU_SVE2 },
+        { "SVE", X265_CPU_SVE },
         { "FastNeonMRC", X265_CPU_FAST_NEON_MRC },
         { "", 0 },
     };
@@ -208,15 +210,8 @@
 
         EncoderPrimitives asmprim;
         memset(&asmprim, 0, sizeof(asmprim));
-        setupAssemblyPrimitives(asmprim, test_archi.flag);
-
-#if X265_ARCH_ARM64
-        /* Temporary workaround because luma_vsp assembly primitive has not been completed
-         * but interp_8tap_hv_pp_cpu uses mixed C primitive and assembly primitive.
-         * Otherwise, segment fault occurs. */
-        setupAliasCPrimitives(cprim, asmprim, test_archi.flag);
-#endif
 
+        setupAssemblyPrimitives(asmprim, test_archi.flag);
         setupAliasPrimitives(asmprim);
         memcpy(&primitives, &asmprim, sizeof(EncoderPrimitives));
         for (size_t h = 0; h < sizeof(harness) / sizeof(TestHarness*); h++)
@@ -239,14 +234,8 @@
 #if X265_ARCH_X86
     setupInstrinsicPrimitives(optprim, cpuid);
 #endif
-    setupAssemblyPrimitives(optprim, cpuid);
 
-#if X265_ARCH_ARM64
-    /* Temporary workaround because luma_vsp assembly primitive has not been completed
-     * but interp_8tap_hv_pp_cpu uses mixed C primitive and assembly primitive.
-     * Otherwise, segment fault occurs. */
-    setupAliasCPrimitives(cprim, optprim, cpuid);
-#endif
+    setupAssemblyPrimitives(optprim, cpuid);
 
     /* Note that we do not setup aliases for performance tests, that would be
      * redundant. The testbench only verifies they are correctly aliased */
​

x265_3.5.tar.gz/source/test/testharness.h -> x265_3.6.tar.gz/source/test/testharness.h Changed

@@ -73,7 +73,7 @@
 #include <x86intrin.h>
 #elif ( !defined(__APPLE__) && defined (__GNUC__) && defined(__ARM_NEON__))
 #include <arm_neon.h>
-#elif defined(__GNUC__) && (!defined(__clang__) || __clang_major__ < 4)
+#else
 /* fallback for older GCC/MinGW */
 static inline uint32_t __rdtsc(void)
 {
@@ -82,15 +82,13 @@
 #if X265_ARCH_X86
     asm volatile("rdtsc" : "=a" (a) ::"edx");
 #elif X265_ARCH_ARM
-#if X265_ARCH_ARM64
-    asm volatile("mrs %0, cntvct_el0" : "=r"(a));
-#else
     // TOD-DO: verify following inline asm to get cpu Timestamp Counter for ARM arch
     // asm volatile("mrc p15, 0, %0, c9, c13, 0" : "=r"(a));
 
     // TO-DO: replace clock() function with appropriate ARM cpu instructions
     a = clock();
-#endif
+#elif  X265_ARCH_ARM64
+    asm volatile("mrs %0, cntvct_el0" : "=r"(a));
 #endif
     return a;
 }
@@ -128,8 +126,8 @@
         x265_emms(); \
         float optperf = (10.0f * cycles / runs) / 4; \
         float refperf = (10.0f * refcycles / refruns) / 4; \
-        printf("\t%3.2fx ", refperf / optperf); \
-        printf("\t %-8.2lf \t %-8.2lf\n", optperf, refperf); \
+        printf(" | \t%3.2fx | ", refperf / optperf); \
+        printf("\t %-8.2lf | \t %-8.2lf\n", optperf, refperf); \
     }
 
 extern "C" {
@@ -140,7 +138,7 @@
  * needs an explicit asm check because it only sometimes crashes in normal use. */
 intptr_t PFX(checkasm_call)(intptr_t (*func)(), int *ok, ...);
 float PFX(checkasm_call_float)(float (*func)(), int *ok, ...);
-#elif X265_ARCH_ARM == 0
+#elif (X265_ARCH_ARM == 0 && X265_ARCH_ARM64 == 0)
 #define PFX(stack_pagealign)(func, align) func()
 #endif

 
@@ -73,7 +73,7 @@
 #include <x86intrin.h>
 #elif ( !defined(__APPLE__) && defined (__GNUC__) && defined(__ARM_NEON__))
 #include <arm_neon.h>
-#elif defined(__GNUC__) && (!defined(__clang__) || __clang_major__ < 4)
+#else
 /* fallback for older GCC/MinGW */
 static inline uint32_t __rdtsc(void)
 {
@@ -82,15 +82,13 @@
 #if X265_ARCH_X86
     asm volatile("rdtsc" : "=a" (a) ::"edx");
 #elif X265_ARCH_ARM
-#if X265_ARCH_ARM64
-    asm volatile("mrs %0, cntvct_el0" : "=r"(a));
-#else
     // TOD-DO: verify following inline asm to get cpu Timestamp Counter for ARM arch
     // asm volatile("mrc p15, 0, %0, c9, c13, 0" : "=r"(a));
 
     // TO-DO: replace clock() function with appropriate ARM cpu instructions
     a = clock();
-#endif
+#elif  X265_ARCH_ARM64
+    asm volatile("mrs %0, cntvct_el0" : "=r"(a));
 #endif
     return a;
 }
@@ -128,8 +126,8 @@
         x265_emms(); \
         float optperf = (10.0f * cycles / runs) / 4; \
         float refperf = (10.0f * refcycles / refruns) / 4; \
-        printf("\t%3.2fx ", refperf / optperf); \
-        printf("\t %-8.2lf \t %-8.2lf\n", optperf, refperf); \
+        printf(" | \t%3.2fx | ", refperf / optperf); \
+        printf("\t %-8.2lf | \t %-8.2lf\n", optperf, refperf); \
     }
 
 extern "C" {
@@ -140,7 +138,7 @@
  * needs an explicit asm check because it only sometimes crashes in normal use. */
 intptr_t PFX(checkasm_call)(intptr_t (*func)(), int *ok, ...);
 float PFX(checkasm_call_float)(float (*func)(), int *ok, ...);
-#elif X265_ARCH_ARM == 0
+#elif (X265_ARCH_ARM == 0 && X265_ARCH_ARM64 == 0)
 #define PFX(stack_pagealign)(func, align) func()
 #endif
 
​

x265_3.5.tar.gz/source/x265.cpp -> x265_3.6.tar.gz/source/x265.cpp Changed

 
@@ -296,6 +296,16 @@
 
     int ret = 0;
 
+    if (cliopt0.scenecutAwareQpConfig)
+    {
+        if (!cliopt0.parseScenecutAwareQpConfig())
+        {
+            x265_log(NULL, X265_LOG_ERROR, "Unable to parse scenecut aware qp config file \n");
+            fclose(cliopt0.scenecutAwareQpConfig);
+            cliopt0.scenecutAwareQpConfig = NULL;
+        }
+    }
+
     AbrEncoder* abrEnc = new AbrEncoder(cliopt, numEncodes, ret);
     int threadsActive = abrEnc->m_numActiveEncodes.get();
     while (threadsActive)
​

x265_3.5.tar.gz/source/x265.h -> x265_3.6.tar.gz/source/x265.h Changed

@@ -26,6 +26,7 @@
 #define X265_H
 #include <stdint.h>
 #include <stdio.h>
+#include <sys/stat.h>
 #include "x265_config.h"
 #ifdef __cplusplus
 extern "C" {
@@ -59,7 +60,7 @@
     NAL_UNIT_CODED_SLICE_TRAIL_N = 0,
     NAL_UNIT_CODED_SLICE_TRAIL_R,
     NAL_UNIT_CODED_SLICE_TSA_N,
-    NAL_UNIT_CODED_SLICE_TLA_R,
+    NAL_UNIT_CODED_SLICE_TSA_R,
     NAL_UNIT_CODED_SLICE_STSA_N,
     NAL_UNIT_CODED_SLICE_STSA_R,
     NAL_UNIT_CODED_SLICE_RADL_N,
@@ -311,6 +312,7 @@
     double           vmafFrameScore;
     double           bufferFillFinal;
     double           unclippedBufferFillFinal;
+    uint8_t          tLayer;
 } x265_frame_stats;
 
 typedef struct x265_ctu_info_t
@@ -536,6 +538,8 @@
 /* ARM */
 #define X265_CPU_ARMV6           0x0000001
 #define X265_CPU_NEON            0x0000002  /* ARM NEON */
+#define X265_CPU_SVE2            0x0000008  /* ARM SVE2 */
+#define X265_CPU_SVE             0x0000010  /* ARM SVE2 */
 #define X265_CPU_FAST_NEON_MRC   0x0000004  /* Transfer from NEON to ARM register is fast (Cortex-A9) */
 
 /* IBM Power8 */
@@ -613,6 +617,13 @@
 #define SLICE_TYPE_DELTA        0.3 /* The offset decremented or incremented for P-frames or b-frames respectively*/
 #define BACKWARD_WINDOW         1 /* Scenecut window before a scenecut */
 #define FORWARD_WINDOW          2 /* Scenecut window after a scenecut */
+#define BWD_WINDOW_DELTA        0.4
+
+#define X265_MAX_GOP_CONFIG 3
+#define X265_MAX_GOP_LENGTH 16
+#define MAX_T_LAYERS 7
+
+#define X265_IPRATIO_STRENGTH   1.43
 
 typedef struct x265_cli_csp
 {
@@ -696,6 +707,7 @@
 typedef struct x265_zone
 {
     int   startFrame, endFrame; /* range of frame numbers */
+    int   keyframeMax;          /* it store the default/user defined keyframeMax value*/
     int   bForceQp;             /* whether to use qp vs bitrate factor */
     int   qp;
     float bitrateFactor;
@@ -747,6 +759,271 @@
 
 static const x265_vmaf_commondata vcd = { { NULL, (char *)"/usr/local/share/model/vmaf_v0.6.1.pkl", NULL, NULL, 0, 0, 0, 0, 0, 0, 0, NULL, 0, 1, 0 } };
 
+typedef struct x265_temporal_layer {
+    int poc_offset;      /* POC offset */
+    int8_t layer;        /* Current layer */
+    int8_t qp_offset;    /* QP offset */
+} x265_temporal_layer;
+
+static const int8_t x265_temporal_layer_bframesMAX_T_LAYERS = {-1, -1, 3, 7, 15, -1, -1};
+
+static const int8_t x265_gop_ra_lengthX265_MAX_GOP_CONFIG = { 4, 8, 16};
+static const x265_temporal_layer x265_gop_raX265_MAX_GOP_CONFIGX265_MAX_GOP_LENGTH = {
+    {
+        {
+            4,
+            0,
+            1,
+        },
+        {
+            2,
+            1,
+            5,
+        },
+        {
+            1,
+            2,
+            3,
+        },
+        {
+            3,
+            2,
+            5,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        }
+    },
+
+    {
+        {
+            8,
+            0,
+            1,
+        },
+        {
+            4,
+            1,
+            5,
+        },
+        {
+            2,
+            2,
+            4,
+        },
+        {
+            1,
+            3,
+            5,
+        },
+        {
+            3,
+            3,
+            2,
+        },
+        {
+            6,
+            2,
+            5,
+        },
+        {
+            5,
+            3,
+            4,
+        },
+        {
+            7,
+            3,
+            5,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+    },
+    {
+        {
+            16,
+            0,
+            1,
+        },
+        {
+            8,
+            1,
+            6,
+        },
+        {
+            4,
+            2,
+            5,
+        },
+        {
+            2,
+            3,
+            6,
+        },
+        {
+            1,
+            4,
+            4,
+        },
+        {
+            3,
+            4,
+            6,
+        },
+        {
+            6,
+            3,
+            5,
+        },
+        {
+            5,
+            4,
+            6,
+        },
+        {
+            7,
+            4,
+            1,
+        },
+        {
+            12,
+            2,
+            6,
+        },
+        {
+            10,
+            3,
+            5,
+        },
+        {
+            9,
+            4,
+            6,
+        },
+        {
+            11,
+            4,
+            4,
+        },
+        {
+            14,
+            3,
+            6,
+        },
+        {
+            13,
+            4,
+            5,
+        },
+        {
+            15,
+            4,
+            6,
+        }
+    }
+};
+
+typedef enum
+{
+    X265_SHARE_MODE_FILE = 0,
+    X265_SHARE_MODE_SHAREDMEM
+}X265_DATA_SHARE_MODES;
+
 /* x265 input parameters
  *
  * For version safety you may use x265_param_alloc/free() to manage the
@@ -983,6 +1260,9 @@
      * performance impact, but the use case may preclude it.  Default true */
     int       bOpenGOP;
 
+	/*Force nal type to CRA to all frames expect first frame. Default disabled*/
+	int       craNal;
+
     /* Scene cuts closer together than this are coded as I, not IDR. */
     int       keyframeMin;
 
@@ -1433,10 +1713,10 @@
         double    rfConstantMin;
 
         /* Multi-pass encoding */
-        /* Enable writing the stats in a multi-pass encode to the stat output file */
+        /* Enable writing the stats in a multi-pass encode to the stat output file/memory */
         int       bStatWrite;
 
-        /* Enable loading data from the stat input file in a multi pass encode */
+        /* Enable loading data from the stat input file/memory in a multi pass encode */
         int       bStatRead;
 
         /* Filename of the 2pass output/input stats file, if unspecified the
@@ -1489,6 +1769,21 @@
         /* internally enable if tune grain is set */
         int      bEnableConstVbv;
 
+        /* if only the focused frames would be re-encode or not */
+        int       bEncFocusedFramesOnly;
+
+        /* Share the data with stats file or shared memory.
+        It must be one of the X265_DATA_SHARE_MODES enum values
+        Available if the bStatWrite or bStatRead is true.
+        Use stats file by default.
+        The stats file mode would be used among the encoders running in sequence.
+        The shared memory mode could only be used among the encoders running in parallel.
+        Now only the cutree data could be shared among shared memory. More data would be support in the future.*/
+        int       dataShareMode;
+
+        /* Unique shared memory name. Required if the shared memory mode enabled. NULL by default */
+        const char* sharedMemName;
+
     } rc;
 
     /*== Video Usability Information ==*/
@@ -1850,6 +2145,10 @@
       Default 1 (Enabled). API only. */
     int       bResetZoneConfig;
 
+    /*Flag to indicate rate-control history has not to be reset during zone reconfiguration.
+      Default 0 (Disabled) */
+    int       bNoResetZoneConfig;
+
     /* It reduces the bits spent on the inter-frames within the scenecutWindow before and / or after a scenecut
      * by increasing their QP in ratecontrol pass2 algorithm without any deterioration in visual quality.
      * 0 - Disabled (default).
@@ -1860,20 +2159,15 @@
 
     /* The duration(in milliseconds) for which there is a reduction in the bits spent on the inter-frames after a scenecut
      * by increasing their QP, when bEnableSceneCutAwareQp is 1 or 3. Default is 500ms.*/
-    int       fwdScenecutWindow;
+    int       fwdMaxScenecutWindow;
+    int       fwdScenecutWindow6;
 
     /* The offset by which QP is incremented for inter-frames after a scenecut when bEnableSceneCutAwareQp is 1 or 3.
      * Default is +5. */
-    double    fwdRefQpDelta;
+    double    fwdRefQpDelta6;
 
     /* The offset by which QP is incremented for non-referenced inter-frames after a scenecut when bEnableSceneCutAwareQp is 1 or 3. */
-    double    fwdNonRefQpDelta;
-
-    /* A genuine threshold used for histogram based scene cut detection.
-     * This threshold determines whether a frame is a scenecut or not
-     * when compared against the edge and chroma histogram sad values.
-     * Default 0.03. Range: Real number in the interval (0,1). */
-    double    edgeTransitionThreshold;
+    double    fwdNonRefQpDelta6;
 
     /* Enables histogram based scenecut detection algorithm to detect scenecuts. Default disabled */
     int       bHistBasedSceneCut;
@@ -1941,13 +2235,39 @@
 
     /* The duration(in milliseconds) for which there is a reduction in the bits spent on the inter-frames before a scenecut
      * by increasing their QP, when bEnableSceneCutAwareQp is 2 or 3. Default is 100ms.*/
-    int       bwdScenecutWindow;
+    int       bwdMaxScenecutWindow;
+    int       bwdScenecutWindow6;
 
     /* The offset by which QP is incremented for inter-frames before a scenecut when bEnableSceneCutAwareQp is 2 or 3. */
-    double    bwdRefQpDelta;
+    double    bwdRefQpDelta6;
 
     /* The offset by which QP is incremented for non-referenced inter-frames before a scenecut when bEnableSceneCutAwareQp is 2 or 3. */
-    double    bwdNonRefQpDelta;
+    double    bwdNonRefQpDelta6;
+
+    /* Specify combinations of color primaries, transfer characteristics, color matrix,
+    * range of luma and chroma signals, and chroma sample location. This has higher
+    * precedence than individual VUI parameters. If any individual VUI option is specified
+    * together with this, which changes the values set corresponding to the system-id
+    * or color-volume, it will be discarded. */
+    const char* videoSignalTypePreset;
+
+    /* Flag indicating whether the encoder should emit an End of Bitstream
+     * NAL at the end of bitstream. Default false */
+    int      bEnableEndOfBitstream;
+
+    /* Flag indicating whether the encoder should emit an End of Sequence
+     * NAL at the end of every Coded Video Sequence. Default false */
+    int      bEnableEndOfSequence;
+
+    /* Film Grain Characteristic file */
+    char* filmGrain;
+
+    /*Motion compensated temporal filter*/
+    int      bEnableTemporalFilter;
+    double   temporalFilterStrength;
+
+    /*SBRC*/
+    int      bEnableSBRC;
 } x265_param;
 
 /* x265_param_alloc:
@@ -1982,6 +2302,8 @@
 
 int x265_zone_param_parse(x265_param* p, const char* name, const char* value);
 
+int x265_scenecut_aware_qp_param_parse(x265_param* p, const char* name, const char* value);
+
 static const char * const x265_profile_names = {
     /* HEVC v1 */
     "main", "main10", "mainstillpicture", /* alias */ "msp",
@@ -2251,6 +2573,7 @@
     void          (*param_free)(x265_param*);
     void          (*param_default)(x265_param*);
     int           (*param_parse)(x265_param*, const char*, const char*);
+    int           (*scenecut_aware_qp_param_parse)(x265_param*, const char*, const char*);
     int           (*param_apply_profile)(x265_param*, const char*);
     int           (*param_default_preset)(x265_param*, const char*, const char *);
     x265_picture* (*picture_alloc)(void);

 
@@ -26,6 +26,7 @@
 #define X265_H
 #include <stdint.h>
 #include <stdio.h>
+#include <sys/stat.h>
 #include "x265_config.h"
 #ifdef __cplusplus
 extern "C" {
@@ -59,7 +60,7 @@
     NAL_UNIT_CODED_SLICE_TRAIL_N = 0,
     NAL_UNIT_CODED_SLICE_TRAIL_R,
     NAL_UNIT_CODED_SLICE_TSA_N,
-    NAL_UNIT_CODED_SLICE_TLA_R,
+    NAL_UNIT_CODED_SLICE_TSA_R,
     NAL_UNIT_CODED_SLICE_STSA_N,
     NAL_UNIT_CODED_SLICE_STSA_R,
     NAL_UNIT_CODED_SLICE_RADL_N,
@@ -311,6 +312,7 @@
     double           vmafFrameScore;
     double           bufferFillFinal;
     double           unclippedBufferFillFinal;
+    uint8_t          tLayer;
 } x265_frame_stats;
 
 typedef struct x265_ctu_info_t
@@ -536,6 +538,8 @@
 /* ARM */
 #define X265_CPU_ARMV6           0x0000001
 #define X265_CPU_NEON            0x0000002  /* ARM NEON */
+#define X265_CPU_SVE2            0x0000008  /* ARM SVE2 */
+#define X265_CPU_SVE             0x0000010  /* ARM SVE2 */
 #define X265_CPU_FAST_NEON_MRC   0x0000004  /* Transfer from NEON to ARM register is fast (Cortex-A9) */
 
 /* IBM Power8 */
@@ -613,6 +617,13 @@
 #define SLICE_TYPE_DELTA        0.3 /* The offset decremented or incremented for P-frames or b-frames respectively*/
 #define BACKWARD_WINDOW         1 /* Scenecut window before a scenecut */
 #define FORWARD_WINDOW          2 /* Scenecut window after a scenecut */
+#define BWD_WINDOW_DELTA        0.4
+
+#define X265_MAX_GOP_CONFIG 3
+#define X265_MAX_GOP_LENGTH 16
+#define MAX_T_LAYERS 7
+
+#define X265_IPRATIO_STRENGTH   1.43
 
 typedef struct x265_cli_csp
 {
@@ -696,6 +707,7 @@
 typedef struct x265_zone
 {
     int   startFrame, endFrame; /* range of frame numbers */
+    int   keyframeMax;          /* it store the default/user defined keyframeMax value*/
     int   bForceQp;             /* whether to use qp vs bitrate factor */
     int   qp;
     float bitrateFactor;
@@ -747,6 +759,271 @@
 
 static const x265_vmaf_commondata vcd = { { NULL, (char *)"/usr/local/share/model/vmaf_v0.6.1.pkl", NULL, NULL, 0, 0, 0, 0, 0, 0, 0, NULL, 0, 1, 0 } };
 
+typedef struct x265_temporal_layer {
+    int poc_offset;      /* POC offset */
+    int8_t layer;        /* Current layer */
+    int8_t qp_offset;    /* QP offset */
+} x265_temporal_layer;
+
+static const int8_t x265_temporal_layer_bframesMAX_T_LAYERS = {-1, -1, 3, 7, 15, -1, -1};
+
+static const int8_t x265_gop_ra_lengthX265_MAX_GOP_CONFIG = { 4, 8, 16};
+static const x265_temporal_layer x265_gop_raX265_MAX_GOP_CONFIGX265_MAX_GOP_LENGTH = {
+    {
+        {
+            4,
+            0,
+            1,
+        },
+        {
+            2,
+            1,
+            5,
+        },
+        {
+            1,
+            2,
+            3,
+        },
+        {
+            3,
+            2,
+            5,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        }
+    },
+
+    {
+        {
+            8,
+            0,
+            1,
+        },
+        {
+            4,
+            1,
+            5,
+        },
+        {
+            2,
+            2,
+            4,
+        },
+        {
+            1,
+            3,
+            5,
+        },
+        {
+            3,
+            3,
+            2,
+        },
+        {
+            6,
+            2,
+            5,
+        },
+        {
+            5,
+            3,
+            4,
+        },
+        {
+            7,
+            3,
+            5,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+        {
+            -1,
+            -1,
+            -1,
+        },
+    },
+    {
+        {
+            16,
+            0,
+            1,
+        },
+        {
+            8,
+            1,
+            6,
+        },
+        {
+            4,
+            2,
+            5,
+        },
+        {
+            2,
+            3,
+            6,
+        },
+        {
+            1,
+            4,
+            4,
+        },
+        {
+            3,
+            4,
+            6,
+        },
+        {
+            6,
+            3,
+            5,
+        },
+        {
+            5,
+            4,
+            6,
+        },
+        {
+            7,
+            4,
+            1,
+        },
+        {
+            12,
+            2,
+            6,
+        },
+        {
+            10,
+            3,
+            5,
+        },
+        {
+            9,
+            4,
+            6,
+        },
+        {
+            11,
+            4,
+            4,
+        },
+        {
+            14,
+            3,
+            6,
+        },
+        {
+            13,
+            4,
+            5,
+        },
+        {
+            15,
+            4,
+            6,
+        }
+    }
+};
+
+typedef enum
+{
+    X265_SHARE_MODE_FILE = 0,
+    X265_SHARE_MODE_SHAREDMEM
+}X265_DATA_SHARE_MODES;
+
 /* x265 input parameters
  *
  * For version safety you may use x265_param_alloc/free() to manage the
@@ -983,6 +1260,9 @@
      * performance impact, but the use case may preclude it.  Default true */
     int       bOpenGOP;
 
+   /*Force nal type to CRA to all frames expect first frame. Default disabled*/
+   int       craNal;
+
     /* Scene cuts closer together than this are coded as I, not IDR. */
     int       keyframeMin;
 
@@ -1433,10 +1713,10 @@
         double    rfConstantMin;
 
         /* Multi-pass encoding */
-        /* Enable writing the stats in a multi-pass encode to the stat output file */
+        /* Enable writing the stats in a multi-pass encode to the stat output file/memory */
         int       bStatWrite;
 
-        /* Enable loading data from the stat input file in a multi pass encode */
+        /* Enable loading data from the stat input file/memory in a multi pass encode */
         int       bStatRead;
 
         /* Filename of the 2pass output/input stats file, if unspecified the
@@ -1489,6 +1769,21 @@
         /* internally enable if tune grain is set */
         int      bEnableConstVbv;
 
+        /* if only the focused frames would be re-encode or not */
+        int       bEncFocusedFramesOnly;
+
+        /* Share the data with stats file or shared memory.
+        It must be one of the X265_DATA_SHARE_MODES enum values
+        Available if the bStatWrite or bStatRead is true.
+        Use stats file by default.
+        The stats file mode would be used among the encoders running in sequence.
+        The shared memory mode could only be used among the encoders running in parallel.
+        Now only the cutree data could be shared among shared memory. More data would be support in the future.*/
+        int       dataShareMode;
+
+        /* Unique shared memory name. Required if the shared memory mode enabled. NULL by default */
+        const char* sharedMemName;
+
     } rc;
 
     /*== Video Usability Information ==*/
@@ -1850,6 +2145,10 @@
       Default 1 (Enabled). API only. */
     int       bResetZoneConfig;
 
+    /*Flag to indicate rate-control history has not to be reset during zone reconfiguration.
+      Default 0 (Disabled) */
+    int       bNoResetZoneConfig;
+
     /* It reduces the bits spent on the inter-frames within the scenecutWindow before and / or after a scenecut
      * by increasing their QP in ratecontrol pass2 algorithm without any deterioration in visual quality.
      * 0 - Disabled (default).
@@ -1860,20 +2159,15 @@
 
     /* The duration(in milliseconds) for which there is a reduction in the bits spent on the inter-frames after a scenecut
      * by increasing their QP, when bEnableSceneCutAwareQp is 1 or 3. Default is 500ms.*/
-    int       fwdScenecutWindow;
+    int       fwdMaxScenecutWindow;
+    int       fwdScenecutWindow6;
 
     /* The offset by which QP is incremented for inter-frames after a scenecut when bEnableSceneCutAwareQp is 1 or 3.
      * Default is +5. */
-    double    fwdRefQpDelta;
+    double    fwdRefQpDelta6;
 
     /* The offset by which QP is incremented for non-referenced inter-frames after a scenecut when bEnableSceneCutAwareQp is 1 or 3. */
-    double    fwdNonRefQpDelta;
-
-    /* A genuine threshold used for histogram based scene cut detection.
-     * This threshold determines whether a frame is a scenecut or not
-     * when compared against the edge and chroma histogram sad values.
-     * Default 0.03. Range: Real number in the interval (0,1). */
-    double    edgeTransitionThreshold;
+    double    fwdNonRefQpDelta6;
 
     /* Enables histogram based scenecut detection algorithm to detect scenecuts. Default disabled */
     int       bHistBasedSceneCut;
@@ -1941,13 +2235,39 @@
 
     /* The duration(in milliseconds) for which there is a reduction in the bits spent on the inter-frames before a scenecut
      * by increasing their QP, when bEnableSceneCutAwareQp is 2 or 3. Default is 100ms.*/
-    int       bwdScenecutWindow;
+    int       bwdMaxScenecutWindow;
+    int       bwdScenecutWindow6;
 
     /* The offset by which QP is incremented for inter-frames before a scenecut when bEnableSceneCutAwareQp is 2 or 3. */
-    double    bwdRefQpDelta;
+    double    bwdRefQpDelta6;
 
     /* The offset by which QP is incremented for non-referenced inter-frames before a scenecut when bEnableSceneCutAwareQp is 2 or 3. */
-    double    bwdNonRefQpDelta;
+    double    bwdNonRefQpDelta6;
+
+    /* Specify combinations of color primaries, transfer characteristics, color matrix,
+    * range of luma and chroma signals, and chroma sample location. This has higher
+    * precedence than individual VUI parameters. If any individual VUI option is specified
+    * together with this, which changes the values set corresponding to the system-id
+    * or color-volume, it will be discarded. */
+    const char* videoSignalTypePreset;
+
+    /* Flag indicating whether the encoder should emit an End of Bitstream
+     * NAL at the end of bitstream. Default false */
+    int      bEnableEndOfBitstream;
+
+    /* Flag indicating whether the encoder should emit an End of Sequence
+     * NAL at the end of every Coded Video Sequence. Default false */
+    int      bEnableEndOfSequence;
+
+    /* Film Grain Characteristic file */
+    char* filmGrain;
+
+    /*Motion compensated temporal filter*/
+    int      bEnableTemporalFilter;
+    double   temporalFilterStrength;
+
+    /*SBRC*/
+    int      bEnableSBRC;
 } x265_param;
 
 /* x265_param_alloc:
@@ -1982,6 +2302,8 @@
 
 int x265_zone_param_parse(x265_param* p, const char* name, const char* value);
 
+int x265_scenecut_aware_qp_param_parse(x265_param* p, const char* name, const char* value);
+
 static const char * const x265_profile_names = {
     /* HEVC v1 */
     "main", "main10", "mainstillpicture", /* alias */ "msp",
@@ -2251,6 +2573,7 @@
     void          (*param_free)(x265_param*);
     void          (*param_default)(x265_param*);
     int           (*param_parse)(x265_param*, const char*, const char*);
+    int           (*scenecut_aware_qp_param_parse)(x265_param*, const char*, const char*);
     int           (*param_apply_profile)(x265_param*, const char*);
     int           (*param_default_preset)(x265_param*, const char*, const char *);
     x265_picture* (*picture_alloc)(void);
​

x265_3.5.tar.gz/source/x265cli.cpp -> x265_3.6.tar.gz/source/x265cli.cpp Changed

@@ -28,8 +28,8 @@
 #include "x265cli.h"
 #include "svt.h"
 
-#define START_CODE 0x00000001
-#define START_CODE_BYTES 4
+#define START_CODE 0x00000001
+#define START_CODE_BYTES 4
 
 #ifdef __cplusplus
 namespace X265_NS {
@@ -166,6 +166,7 @@
         H0("   --rdpenalty <0..2>            penalty for 32x32 intra TU in non-I slices. 0:disabled 1:RD-penalty 2:maximum. Default %d\n", param->rdPenalty);
         H0("\nSlice decision options:\n");
         H0("   --no-open-gop               Enable open-GOP, allows I slices to be non-IDR. Default %s\n", OPT(param->bOpenGOP));
+		H0("   --cra-nal                     Force nal type to CRA to all frames expect first frame, works only with keyint 1. Default %s\n", OPT(param->craNal));
         H0("-I/--keyint <integer>            Max IDR period in frames. -1 for infinite-gop. Default %d\n", param->keyframeMax);
         H0("-i/--min-keyint <integer>        Scenecuts closer together than this are coded as I, not IDR. Default: auto\n");
         H0("   --gop-lookahead <integer>     Extends gop boundary if a scenecut is found within this from keyint boundary. Default 0\n");
@@ -174,7 +175,6 @@
         H1("   --scenecut-bias <0..100.0>    Bias for scenecut detection. Default %.2f\n", param->scenecutBias);
         H0("   --hist-scenecut               Enables histogram based scene-cut detection using histogram based algorithm.\n");
         H0("   --no-hist-scenecut            Disables histogram based scene-cut detection using histogram based algorithm.\n");
-        H1("   --hist-threshold <0.0..1.0>   Luma Edge histogram's Normalized SAD threshold for histogram based scenecut detection Default %.2f\n", param->edgeTransitionThreshold);
         H0("   --no-fades                  Enable detection and handling of fade-in regions. Default %s\n", OPT(param->bEnableFades));
         H1("   --scenecut-aware-qp <0..3>    Enable increasing QP for frames inside the scenecut window around scenecut. Default %s\n", OPT(param->bEnableSceneCutAwareQp));
         H1("                                 0 - Disabled\n");
@@ -182,6 +182,7 @@
         H1("                                 2 - Backward masking\n");
         H1("                                 3 - Bidirectional masking\n");
         H1("   --masking-strength <string>   Comma separated values which specify the duration and offset for the QP increment for inter-frames when scenecut-aware-qp is enabled.\n");
+        H1("   --scenecut-qp-config <file>   File containing scenecut-aware-qp mode, window duration and offsets settings required for the masking. Works only with --pass 2\n");
         H0("   --radl <integer>              Number of RADL pictures allowed in front of IDR. Default %d\n", param->radl);
         H0("   --intra-refresh               Use Periodic Intra Refresh instead of IDR frames\n");
         H0("   --rc-lookahead <integer>      Number of frames for frame-type lookahead (determines encoder latency) Default %d\n", param->lookaheadDepth);
@@ -262,6 +263,7 @@
         H0("   --aq-strength <float>         Reduces blocking and blurring in flat and textured areas (0 to 3.0). Default %.2f\n", param->rc.aqStrength);
         H0("   --qp-adaptation-range <float> Delta QP range by QP adaptation based on a psycho-visual model (1.0 to 6.0). Default %.2f\n", param->rc.qpAdaptationRange);
         H0("   --no-aq-motion              Block level QP adaptation based on the relative motion between the block and the frame. Default %s\n", OPT(param->bAQMotion));
+        H1("   --no-sbrc                   Enables the segment based rate control. Default %s\n", OPT(param->bEnableSBRC));
         H0("   --qg-size <int>               Specifies the size of the quantization group (64, 32, 16, 8). Default %d\n", param->rc.qgSize);
         H0("   --no-cutree                 Enable cutree for Adaptive Quantization. Default %s\n", OPT(param->rc.cuTree));
         H0("   --no-rc-grain               Enable ratecontrol mode to handle grains specifically. turned on with tune grain. Default %s\n", OPT(param->rc.bEnableGrain));
@@ -282,6 +284,7 @@
         H1("                                       q=<integer> (force QP)\n");
         H1("                                   or  b=<float> (bitrate multiplier)\n");
         H0("   --zonefile <filename>         Zone file containing the zone boundaries and the parameters to be reconfigured.\n");
+        H0("   --no-zonefile-rc-init         This allow to use rate-control history across zones in zonefile.\n");
         H1("   --lambda-file <string>        Specify a file containing replacement values for the lambda tables\n");
         H1("                                 MAX_MAX_QP+1 floats for lambda table, then again for lambda2 table\n");
         H1("                                 Blank lines and lines starting with hash(#) are ignored\n");
@@ -314,6 +317,30 @@
         H0("   --master-display <string>     SMPTE ST 2086 master display color volume info SEI (HDR)\n");
         H0("                                    format: G(x,y)B(x,y)R(x,y)WP(x,y)L(max,min)\n");
         H0("   --max-cll <string>            Specify content light level info SEI as \"cll,fall\" (HDR).\n");
+        H0("   --video-signal-type-preset <string>    Specify combinations of color primaries, transfer characteristics, color matrix, range of luma and chroma signals, and chroma sample location\n");
+        H0("                                            format: <system-id>:<color-volume>\n");
+        H0("                                            This has higher precedence than individual VUI parameters. If any individual VUI option is specified together with this,\n");
+        H0("                                            which changes the values set corresponding to the system-id or color-volume, it will be discarded.\n");
+        H0("                                            The color-volume can be used only with the system-id options BT2100_PQ_YCC, BT2100_PQ_ICTCP, and BT2100_PQ_RGB.\n");
+        H0("                                            system-id options and their corresponding values:\n");
+        H0("                                              BT601_525:       --colorprim smpte170m --transfer smpte170m --colormatrix smpte170m --range limited --chromaloc 0\n");
+        H0("                                              BT601_626:       --colorprim bt470bg --transfer smpte170m --colormatrix bt470bg --range limited --chromaloc 0\n");
+        H0("                                              BT709_YCC:       --colorprim bt709 --transfer bt709 --colormatrix bt709 --range limited --chromaloc 0\n");
+        H0("                                              BT709_RGB:       --colorprim bt709 --transfer bt709 --colormatrix gbr --range limited\n");
+        H0("                                              BT2020_YCC_NCL:  --colorprim bt2020 --transfer bt2020-10 --colormatrix bt709 --range limited --chromaloc 2\n");
+        H0("                                              BT2020_RGB:      --colorprim bt2020 --transfer smpte2084 --colormatrix bt2020nc --range limited\n");
+        H0("                                              BT2100_PQ_YCC:   --colorprim bt2020 --transfer smpte2084 --colormatrix bt2020nc --range limited --chromaloc 2\n");
+        H0("                                              BT2100_PQ_ICTCP: --colorprim bt2020 --transfer smpte2084 --colormatrix ictcp --range limited --chromaloc 2\n");
+        H0("                                              BT2100_PQ_RGB:   --colorprim bt2020 --transfer smpte2084 --colormatrix gbr --range limited\n");
+        H0("                                              BT2100_HLG_YCC:  --colorprim bt2020 --transfer arib-std-b67 --colormatrix bt2020nc --range limited --chromaloc 2\n");
+        H0("                                              BT2100_HLG_RGB:  --colorprim bt2020 --transfer arib-std-b67 --colormatrix gbr --range limited\n");
+        H0("                                              FR709_RGB:       --colorprim bt709 --transfer bt709 --colormatrix gbr --range full\n");
+        H0("                                              FR2020_RGB:      --colorprim bt2020 --transfer bt2020-10 --colormatrix gbr --range full\n");
+        H0("                                              FRP3D65_YCC:     --colorprim smpte432 --transfer bt709 --colormatrix smpte170m --range full --chromaloc 1\n");
+        H0("                                            color-volume options and their corresponding values:\n");
+        H0("                                              P3D65x1000n0005: --master-display G(13250,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(10000000,5)\n");
+        H0("                                              P3D65x4000n005:  --master-display G(13250,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(40000000,50)\n");
+        H0("                                              BT2100x108n0005: --master-display G(8500,39850)B(6550,2300)R(34000,146000)WP(15635,16450)L(10000000,1)\n");
         H0("   --no-cll                    Emit content light level info SEI. Default %s\n", OPT(param->bEmitCLL));
         H0("   --no-hdr10                  Control dumping of HDR10 SEI packet. If max-cll or master-display has non-zero values, this is enabled. Default %s\n", OPT(param->bEmitHDR10SEI));
         H0("   --no-hdr-opt                Add luma and chroma offsets for HDR/WCG content. Default %s. Now deprecated.\n", OPT(param->bHDROpt));
@@ -324,9 +351,11 @@
         H0("   --no-repeat-headers         Emit SPS and PPS headers at each keyframe. Default %s\n", OPT(param->bRepeatHeaders));
         H0("   --no-info                   Emit SEI identifying encoder and parameters. Default %s\n", OPT(param->bEmitInfoSEI));
         H0("   --no-hrd                    Enable HRD parameters signaling. Default %s\n", OPT(param->bEmitHRDSEI));
-        H0("   --no-idr-recovery-sei      Emit recovery point infor SEI at each IDR frame \n");
-        H0("   --no-temporal-layers        Enable a temporal sublayer for unreferenced B frames. Default %s\n", OPT(param->bEnableTemporalSubLayers));
+        H0("   --no-idr-recovery-sei       Emit recovery point infor SEI at each IDR frame \n");
+        H0("   --temporal-layers             Enable a temporal sublayer for unreferenced B frames. Default %s\n", OPT(param->bEnableTemporalSubLayers));
         H0("   --no-aud                    Emit access unit delimiters at the start of each access unit. Default %s\n", OPT(param->bEnableAccessUnitDelimiters));
+        H0("   --no-eob                    Emit end of bitstream nal unit at the end of the bitstream. Default %s\n", OPT(param->bEnableEndOfBitstream));
+        H0("   --no-eos                    Emit end of sequence nal unit at the end of every coded video sequence. Default %s\n", OPT(param->bEnableEndOfSequence));
         H1("   --hash <integer>              Decoded Picture Hash SEI 0: disabled, 1: MD5, 2: CRC, 3: Checksum. Default %d\n", param->decodedPictureHashSEI);
         H0("   --atc-sei <integer>           Emit the alternative transfer characteristics SEI message where the integer is the preferred transfer characteristics. Default disabled\n");
         H0("   --pic-struct <integer>        Set the picture structure and emits it in the picture timing SEI message. Values in the range 0..12. See D.3.3 of the HEVC spec. for a detailed explanation.\n");
@@ -344,6 +373,7 @@
         H0("   --lowpass-dct                 Use low-pass subband dct approximation. Default %s\n", OPT(param->bLowPassDct));
         H0("   --no-frame-dup              Enable Frame duplication. Default %s\n", OPT(param->bEnableFrameDuplication));
         H0("   --dup-threshold <integer>     PSNR threshold for Frame duplication. Default %d\n", param->dupThreshold);
+        H0("   --no-mcstf                  Enable GOP based temporal filter. Default %d\n", param->bEnableTemporalFilter);
 #ifdef SVT_HEVC
         H0("   --nosvt                     Enable SVT HEVC encoder %s\n", OPT(param->bEnableSvtHevc));
         H0("   --no-svt-hme                Enable Hierarchial motion estimation(HME) in SVT HEVC encoder \n");
@@ -365,6 +395,9 @@
         H1("    2 - unable to open encoder\n");
         H1("    3 - unable to generate stream headers\n");
         H1("    4 - encoder abort\n");
+        H0("\nSEI Message Options\n");
+        H0("   --film-grain <filename>           File containing Film Grain Characteristics to be written as a SEI Message\n");
+
 #undef OPT
 #undef H0
 #undef H1
@@ -484,6 +517,9 @@
 
         memcpy(globalParam->rc.zoneszonefileCount.zoneParam, globalParam, sizeof(x265_param));
 
+        if (zonefileCount == 0)
+            globalParam->rc.zoneszonefileCount.keyframeMax = globalParam->keyframeMax;
+
         for (optind = 0;;)
         {
             int long_options_index = -1;
@@ -708,12 +744,19 @@
                         return true;
                     }
                 }
+                OPT("scenecut-qp-config")
+                {
+                    this->scenecutAwareQpConfig = x265_fopen(optarg, "rb");
+                    if (!this->scenecutAwareQpConfig)
+                        x265_log_file(param, X265_LOG_ERROR, "%s scenecut aware qp config file not found or error in opening config file\n", optarg);
+                }
                 OPT("zonefile")
                 {
                     this->zoneFile = x265_fopen(optarg, "rb");
                     if (!this->zoneFile)
                         x265_log_file(param, X265_LOG_ERROR, "%s zone file not found or error in opening zone file\n", optarg);
                 }
+                OPT("no-zonefile-rc-init") this->param->bNoResetZoneConfig = true;
                 OPT("fullhelp")
                 {
                     param->logLevel = X265_LOG_FULL;
@@ -875,7 +918,7 @@
             if (reconFileBitDepth == 0)
                 reconFileBitDepth = param->internalBitDepth;
             this->recon = ReconFile::open(reconfn, param->sourceWidth, param->sourceHeight, reconFileBitDepth,
-                param->fpsNum, param->fpsDenom, param->internalCsp);
+                param->fpsNum, param->fpsDenom, param->internalCsp, param->sourceBitDepth);
             if (this->recon->isFail())
             {
                 x265_log(param, X265_LOG_WARNING, "unable to write reconstructed outputs file\n");
@@ -973,6 +1016,7 @@
         param->rc.zones = X265_MALLOC(x265_zone, param->rc.zonefileCount);
         for (int i = 0; i < param->rc.zonefileCount; i++)
         {
+            param->rc.zonesi.startFrame = -1;
             while (fgets(line, sizeof(line), zoneFile))
             {
                 if (*line == '#' || (strcmp(line, "\r\n") == 0))
@@ -1010,57 +1054,179 @@
         return 1;
     }
 
-    /* Parse the RPU file and extract the RPU corresponding to the current picture
-    * and fill the rpu field of the input picture */
-    int CLIOptions::rpuParser(x265_picture * pic)
-    {
-        uint8_t byteVal;
-        uint32_t code = 0;
-        int bytesRead = 0;
-        pic->rpu.payloadSize = 0;
-
-        if (!pic->pts)
-        {
-            while (bytesRead++ < 4 && fread(&byteVal, sizeof(uint8_t), 1, dolbyVisionRpu))
-                code = (code << 8) | byteVal;
-
-            if (code != START_CODE)
-            {
-                x265_log(NULL, X265_LOG_ERROR, "Invalid Dolby Vision RPU startcode in POC %d\n", pic->pts);
-                return 1;
-            }
-        }
-
-        bytesRead = 0;
-        while (fread(&byteVal, sizeof(uint8_t), 1, dolbyVisionRpu))
-        {
-            code = (code << 8) | byteVal;
-            if (bytesRead++ < 3)
-                continue;
-            if (bytesRead >= 1024)
-            {
-                x265_log(NULL, X265_LOG_ERROR, "Invalid Dolby Vision RPU size in POC %d\n", pic->pts);
-                return 1;
-            }
-
-            if (code != START_CODE)
-                pic->rpu.payloadpic->rpu.payloadSize++ = (code >> (3 * 8)) & 0xFF;
-            else
-                return 0;
-        }
-
-        int ShiftBytes = START_CODE_BYTES - (bytesRead - pic->rpu.payloadSize);
-        int bytesLeft = bytesRead - pic->rpu.payloadSize;
-        code = (code << ShiftBytes * 8);
-        for (int i = 0; i < bytesLeft; i++)
-        {
-            pic->rpu.payloadpic->rpu.payloadSize++ = (code >> (3 * 8)) & 0xFF;
-            code = (code << 8);
-        }
-        if (!pic->rpu.payloadSize)
-            x265_log(NULL, X265_LOG_WARNING, "Dolby Vision RPU not found for POC %d\n", pic->pts);
-        return 0;
-    }
+    /* Parse the RPU file and extract the RPU corresponding to the current picture
+    * and fill the rpu field of the input picture */
+    int CLIOptions::rpuParser(x265_picture * pic)
+    {
+        uint8_t byteVal;
+        uint32_t code = 0;
+        int bytesRead = 0;
+        pic->rpu.payloadSize = 0;
+
+        if (!pic->pts)
+        {
+            while (bytesRead++ < 4 && fread(&byteVal, sizeof(uint8_t), 1, dolbyVisionRpu))
+                code = (code << 8) | byteVal;
+
+            if (code != START_CODE)
+            {
+                x265_log(NULL, X265_LOG_ERROR, "Invalid Dolby Vision RPU startcode in POC %d\n", pic->pts);
+                return 1;
+            }
+        }
+
+        bytesRead = 0;
+        while (fread(&byteVal, sizeof(uint8_t), 1, dolbyVisionRpu))
+        {
+            code = (code << 8) | byteVal;
+            if (bytesRead++ < 3)
+                continue;
+            if (bytesRead >= 1024)
+            {
+                x265_log(NULL, X265_LOG_ERROR, "Invalid Dolby Vision RPU size in POC %d\n", pic->pts);
+                return 1;
+            }
+
+            if (code != START_CODE)
+                pic->rpu.payloadpic->rpu.payloadSize++ = (code >> (3 * 8)) & 0xFF;
+            else
+                return 0;
+        }
+
+        int ShiftBytes = START_CODE_BYTES - (bytesRead - pic->rpu.payloadSize);
+        int bytesLeft = bytesRead - pic->rpu.payloadSize;
+        code = (code << ShiftBytes * 8);
+        for (int i = 0; i < bytesLeft; i++)
+        {
+            pic->rpu.payloadpic->rpu.payloadSize++ = (code >> (3 * 8)) & 0xFF;
+            code = (code << 8);
+        }
+        if (!pic->rpu.payloadSize)
+            x265_log(NULL, X265_LOG_WARNING, "Dolby Vision RPU not found for POC %d\n", pic->pts);
+        return 0;
+    }
+
+    bool CLIOptions::parseScenecutAwareQpConfig()
+    {
+        char line256;
+        char* argLine;
+        rewind(scenecutAwareQpConfig);
+        while (fgets(line, sizeof(line), scenecutAwareQpConfig))
+        {
+            if (*line == '#' || (strcmp(line, "\r\n") == 0))
+                continue;
+            int index = (int)strcspn(line, "\r\n");
+            lineindex = '\0';
+            argLine = line;
+            while (isspace((unsigned char)*argLine)) argLine++;
+            char* start = strchr(argLine, '-');
+            int argCount = 0;
+            char **args = (char**)malloc(256 * sizeof(char *));
+            //Adding a dummy string to avoid file parsing error
+            argsargCount++ = (char *)"x265";
+            char* token = strtok(start, " ");
+            while (token)
+            {
+                argsargCount++ = token;
+                token = strtok(NULL, " ");
+            }
+            argsargCount = NULL;
+            CLIOptions cliopt;
+            if (cliopt.parseScenecutAwareQpParam(argCount, args, param))
+            {
+                cliopt.destroy();
+                if (cliopt.api)
+                    cliopt.api->param_free(cliopt.param);
+                exit(1);
+            }
+            break;
+        }
+        return 1;
+    }
+    bool CLIOptions::parseScenecutAwareQpParam(int argc, char **argv, x265_param* globalParam)
+    {
+        bool bError = false;
+        int bShowHelp = false;
+        int outputBitDepth = 0;
+        const char *profile = NULL;
+        /* Presets are applied before all other options. */
+        for (optind = 0;;)
+        {
+            int c = getopt_long(argc, argv, short_options, long_options, NULL);
+            if (c == -1)
+                break;
+            else if (c == 'D')
+                outputBitDepth = atoi(optarg);
+            else if (c == 'P')
+                profile = optarg;
+            else if (c == '?')
+                bShowHelp = true;
+        }
+        if (!outputBitDepth && profile)
+        {
+            /*try to derive the output bit depth from the requested profile*/
+            if (strstr(profile, "10"))
+                outputBitDepth = 10;
+            else if (strstr(profile, "12"))
+                outputBitDepth = 12;
+            else
+                outputBitDepth = 8;
+        }
+        api = x265_api_get(outputBitDepth);
+        if (!api)
+        {
+            x265_log(NULL, X265_LOG_WARNING, "falling back to default bit-depth\n");
+            api = x265_api_get(0);
+        }
+        if (bShowHelp)
+        {
+            printVersion(globalParam, api);
+            showHelp(globalParam);
+        }
+        for (optind = 0;;)
+        {
+            int long_options_index = -1;
+            int c = getopt_long(argc, argv, short_options, long_options, &long_options_index);
+            if (c == -1)
+                break;
+            if (long_options_index < 0 && c > 0)
+            {
+                for (size_t i = 0; i < sizeof(long_options) / sizeof(long_options0); i++)
+                {
+                    if (long_optionsi.val == c)
+                    {
+                        long_options_index = (int)i;
+                        break;
+                    }
+                }
+                if (long_options_index < 0)
+                {
+                    /* getopt_long might have already printed an error message */
+                    if (c != 63)
+                        x265_log(NULL, X265_LOG_WARNING, "internal error: short option '%c' has no long option\n", c);
+                    return true;
+                }
+            }
+            if (long_options_index < 0)
+            {
+                x265_log(NULL, X265_LOG_WARNING, "short option '%c' unrecognized\n", c);
+                return true;
+            }
+            bError |= !!api->scenecut_aware_qp_param_parse(globalParam, long_optionslong_options_index.name, optarg);
+            if (bError)
+            {
+                const char *name = long_options_index > 0 ? long_optionslong_options_index.name : argvoptind - 2;
+                x265_log(NULL, X265_LOG_ERROR, "invalid argument: %s = %s\n", name, optarg);
+                return true;
+            }
+        }
+        if (optind < argc)
+        {
+            x265_log(param, X265_LOG_WARNING, "extra unused command arguments given <%s>\n", argvoptind);
+            return true;
+        }
+        return false;
+    }
 
 #ifdef __cplusplus
 }

 
@@ -28,8 +28,8 @@
 #include "x265cli.h"
 #include "svt.h"
 
-#define START_CODE 0x00000001
-#define START_CODE_BYTES 4
+#define START_CODE 0x00000001
+#define START_CODE_BYTES 4
 
 #ifdef __cplusplus
 namespace X265_NS {
@@ -166,6 +166,7 @@
         H0("   --rdpenalty <0..2>            penalty for 32x32 intra TU in non-I slices. 0:disabled 1:RD-penalty 2:maximum. Default %d\n", param->rdPenalty);
         H0("\nSlice decision options:\n");
         H0("   --no-open-gop               Enable open-GOP, allows I slices to be non-IDR. Default %s\n", OPT(param->bOpenGOP));
+       H0("   --cra-nal                     Force nal type to CRA to all frames expect first frame, works only with keyint 1. Default %s\n", OPT(param->craNal));
         H0("-I/--keyint <integer>            Max IDR period in frames. -1 for infinite-gop. Default %d\n", param->keyframeMax);
         H0("-i/--min-keyint <integer>        Scenecuts closer together than this are coded as I, not IDR. Default: auto\n");
         H0("   --gop-lookahead <integer>     Extends gop boundary if a scenecut is found within this from keyint boundary. Default 0\n");
@@ -174,7 +175,6 @@
         H1("   --scenecut-bias <0..100.0>    Bias for scenecut detection. Default %.2f\n", param->scenecutBias);
         H0("   --hist-scenecut               Enables histogram based scene-cut detection using histogram based algorithm.\n");
         H0("   --no-hist-scenecut            Disables histogram based scene-cut detection using histogram based algorithm.\n");
-        H1("   --hist-threshold <0.0..1.0>   Luma Edge histogram's Normalized SAD threshold for histogram based scenecut detection Default %.2f\n", param->edgeTransitionThreshold);
         H0("   --no-fades                  Enable detection and handling of fade-in regions. Default %s\n", OPT(param->bEnableFades));
         H1("   --scenecut-aware-qp <0..3>    Enable increasing QP for frames inside the scenecut window around scenecut. Default %s\n", OPT(param->bEnableSceneCutAwareQp));
         H1("                                 0 - Disabled\n");
@@ -182,6 +182,7 @@
         H1("                                 2 - Backward masking\n");
         H1("                                 3 - Bidirectional masking\n");
         H1("   --masking-strength <string>   Comma separated values which specify the duration and offset for the QP increment for inter-frames when scenecut-aware-qp is enabled.\n");
+        H1("   --scenecut-qp-config <file>   File containing scenecut-aware-qp mode, window duration and offsets settings required for the masking. Works only with --pass 2\n");
         H0("   --radl <integer>              Number of RADL pictures allowed in front of IDR. Default %d\n", param->radl);
         H0("   --intra-refresh               Use Periodic Intra Refresh instead of IDR frames\n");
         H0("   --rc-lookahead <integer>      Number of frames for frame-type lookahead (determines encoder latency) Default %d\n", param->lookaheadDepth);
@@ -262,6 +263,7 @@
         H0("   --aq-strength <float>         Reduces blocking and blurring in flat and textured areas (0 to 3.0). Default %.2f\n", param->rc.aqStrength);
         H0("   --qp-adaptation-range <float> Delta QP range by QP adaptation based on a psycho-visual model (1.0 to 6.0). Default %.2f\n", param->rc.qpAdaptationRange);
         H0("   --no-aq-motion              Block level QP adaptation based on the relative motion between the block and the frame. Default %s\n", OPT(param->bAQMotion));
+        H1("   --no-sbrc                   Enables the segment based rate control. Default %s\n", OPT(param->bEnableSBRC));
         H0("   --qg-size <int>               Specifies the size of the quantization group (64, 32, 16, 8). Default %d\n", param->rc.qgSize);
         H0("   --no-cutree                 Enable cutree for Adaptive Quantization. Default %s\n", OPT(param->rc.cuTree));
         H0("   --no-rc-grain               Enable ratecontrol mode to handle grains specifically. turned on with tune grain. Default %s\n", OPT(param->rc.bEnableGrain));
@@ -282,6 +284,7 @@
         H1("                                       q=<integer> (force QP)\n");
         H1("                                   or  b=<float> (bitrate multiplier)\n");
         H0("   --zonefile <filename>         Zone file containing the zone boundaries and the parameters to be reconfigured.\n");
+        H0("   --no-zonefile-rc-init         This allow to use rate-control history across zones in zonefile.\n");
         H1("   --lambda-file <string>        Specify a file containing replacement values for the lambda tables\n");
         H1("                                 MAX_MAX_QP+1 floats for lambda table, then again for lambda2 table\n");
         H1("                                 Blank lines and lines starting with hash(#) are ignored\n");
@@ -314,6 +317,30 @@
         H0("   --master-display <string>     SMPTE ST 2086 master display color volume info SEI (HDR)\n");
         H0("                                    format: G(x,y)B(x,y)R(x,y)WP(x,y)L(max,min)\n");
         H0("   --max-cll <string>            Specify content light level info SEI as \"cll,fall\" (HDR).\n");
+        H0("   --video-signal-type-preset <string>    Specify combinations of color primaries, transfer characteristics, color matrix, range of luma and chroma signals, and chroma sample location\n");
+        H0("                                            format: <system-id>:<color-volume>\n");
+        H0("                                            This has higher precedence than individual VUI parameters. If any individual VUI option is specified together with this,\n");
+        H0("                                            which changes the values set corresponding to the system-id or color-volume, it will be discarded.\n");
+        H0("                                            The color-volume can be used only with the system-id options BT2100_PQ_YCC, BT2100_PQ_ICTCP, and BT2100_PQ_RGB.\n");
+        H0("                                            system-id options and their corresponding values:\n");
+        H0("                                              BT601_525:       --colorprim smpte170m --transfer smpte170m --colormatrix smpte170m --range limited --chromaloc 0\n");
+        H0("                                              BT601_626:       --colorprim bt470bg --transfer smpte170m --colormatrix bt470bg --range limited --chromaloc 0\n");
+        H0("                                              BT709_YCC:       --colorprim bt709 --transfer bt709 --colormatrix bt709 --range limited --chromaloc 0\n");
+        H0("                                              BT709_RGB:       --colorprim bt709 --transfer bt709 --colormatrix gbr --range limited\n");
+        H0("                                              BT2020_YCC_NCL:  --colorprim bt2020 --transfer bt2020-10 --colormatrix bt709 --range limited --chromaloc 2\n");
+        H0("                                              BT2020_RGB:      --colorprim bt2020 --transfer smpte2084 --colormatrix bt2020nc --range limited\n");
+        H0("                                              BT2100_PQ_YCC:   --colorprim bt2020 --transfer smpte2084 --colormatrix bt2020nc --range limited --chromaloc 2\n");
+        H0("                                              BT2100_PQ_ICTCP: --colorprim bt2020 --transfer smpte2084 --colormatrix ictcp --range limited --chromaloc 2\n");
+        H0("                                              BT2100_PQ_RGB:   --colorprim bt2020 --transfer smpte2084 --colormatrix gbr --range limited\n");
+        H0("                                              BT2100_HLG_YCC:  --colorprim bt2020 --transfer arib-std-b67 --colormatrix bt2020nc --range limited --chromaloc 2\n");
+        H0("                                              BT2100_HLG_RGB:  --colorprim bt2020 --transfer arib-std-b67 --colormatrix gbr --range limited\n");
+        H0("                                              FR709_RGB:       --colorprim bt709 --transfer bt709 --colormatrix gbr --range full\n");
+        H0("                                              FR2020_RGB:      --colorprim bt2020 --transfer bt2020-10 --colormatrix gbr --range full\n");
+        H0("                                              FRP3D65_YCC:     --colorprim smpte432 --transfer bt709 --colormatrix smpte170m --range full --chromaloc 1\n");
+        H0("                                            color-volume options and their corresponding values:\n");
+        H0("                                              P3D65x1000n0005: --master-display G(13250,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(10000000,5)\n");
+        H0("                                              P3D65x4000n005:  --master-display G(13250,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(40000000,50)\n");
+        H0("                                              BT2100x108n0005: --master-display G(8500,39850)B(6550,2300)R(34000,146000)WP(15635,16450)L(10000000,1)\n");
         H0("   --no-cll                    Emit content light level info SEI. Default %s\n", OPT(param->bEmitCLL));
         H0("   --no-hdr10                  Control dumping of HDR10 SEI packet. If max-cll or master-display has non-zero values, this is enabled. Default %s\n", OPT(param->bEmitHDR10SEI));
         H0("   --no-hdr-opt                Add luma and chroma offsets for HDR/WCG content. Default %s. Now deprecated.\n", OPT(param->bHDROpt));
@@ -324,9 +351,11 @@
         H0("   --no-repeat-headers         Emit SPS and PPS headers at each keyframe. Default %s\n", OPT(param->bRepeatHeaders));
         H0("   --no-info                   Emit SEI identifying encoder and parameters. Default %s\n", OPT(param->bEmitInfoSEI));
         H0("   --no-hrd                    Enable HRD parameters signaling. Default %s\n", OPT(param->bEmitHRDSEI));
-        H0("   --no-idr-recovery-sei      Emit recovery point infor SEI at each IDR frame \n");
-        H0("   --no-temporal-layers        Enable a temporal sublayer for unreferenced B frames. Default %s\n", OPT(param->bEnableTemporalSubLayers));
+        H0("   --no-idr-recovery-sei       Emit recovery point infor SEI at each IDR frame \n");
+        H0("   --temporal-layers             Enable a temporal sublayer for unreferenced B frames. Default %s\n", OPT(param->bEnableTemporalSubLayers));
         H0("   --no-aud                    Emit access unit delimiters at the start of each access unit. Default %s\n", OPT(param->bEnableAccessUnitDelimiters));
+        H0("   --no-eob                    Emit end of bitstream nal unit at the end of the bitstream. Default %s\n", OPT(param->bEnableEndOfBitstream));
+        H0("   --no-eos                    Emit end of sequence nal unit at the end of every coded video sequence. Default %s\n", OPT(param->bEnableEndOfSequence));
         H1("   --hash <integer>              Decoded Picture Hash SEI 0: disabled, 1: MD5, 2: CRC, 3: Checksum. Default %d\n", param->decodedPictureHashSEI);
         H0("   --atc-sei <integer>           Emit the alternative transfer characteristics SEI message where the integer is the preferred transfer characteristics. Default disabled\n");
         H0("   --pic-struct <integer>        Set the picture structure and emits it in the picture timing SEI message. Values in the range 0..12. See D.3.3 of the HEVC spec. for a detailed explanation.\n");
@@ -344,6 +373,7 @@
         H0("   --lowpass-dct                 Use low-pass subband dct approximation. Default %s\n", OPT(param->bLowPassDct));
         H0("   --no-frame-dup              Enable Frame duplication. Default %s\n", OPT(param->bEnableFrameDuplication));
         H0("   --dup-threshold <integer>     PSNR threshold for Frame duplication. Default %d\n", param->dupThreshold);
+        H0("   --no-mcstf                  Enable GOP based temporal filter. Default %d\n", param->bEnableTemporalFilter);
 #ifdef SVT_HEVC
         H0("   --nosvt                     Enable SVT HEVC encoder %s\n", OPT(param->bEnableSvtHevc));
         H0("   --no-svt-hme                Enable Hierarchial motion estimation(HME) in SVT HEVC encoder \n");
@@ -365,6 +395,9 @@
         H1("    2 - unable to open encoder\n");
         H1("    3 - unable to generate stream headers\n");
         H1("    4 - encoder abort\n");
+        H0("\nSEI Message Options\n");
+        H0("   --film-grain <filename>           File containing Film Grain Characteristics to be written as a SEI Message\n");
+
 #undef OPT
 #undef H0
 #undef H1
@@ -484,6 +517,9 @@
 
         memcpy(globalParam->rc.zoneszonefileCount.zoneParam, globalParam, sizeof(x265_param));
 
+        if (zonefileCount == 0)
+            globalParam->rc.zoneszonefileCount.keyframeMax = globalParam->keyframeMax;
+
         for (optind = 0;;)
         {
             int long_options_index = -1;
@@ -708,12 +744,19 @@
                         return true;
                     }
                 }
+                OPT("scenecut-qp-config")
+                {
+                    this->scenecutAwareQpConfig = x265_fopen(optarg, "rb");
+                    if (!this->scenecutAwareQpConfig)
+                        x265_log_file(param, X265_LOG_ERROR, "%s scenecut aware qp config file not found or error in opening config file\n", optarg);
+                }
                 OPT("zonefile")
                 {
                     this->zoneFile = x265_fopen(optarg, "rb");
                     if (!this->zoneFile)
                         x265_log_file(param, X265_LOG_ERROR, "%s zone file not found or error in opening zone file\n", optarg);
                 }
+                OPT("no-zonefile-rc-init") this->param->bNoResetZoneConfig = true;
                 OPT("fullhelp")
                 {
                     param->logLevel = X265_LOG_FULL;
@@ -875,7 +918,7 @@
             if (reconFileBitDepth == 0)
                 reconFileBitDepth = param->internalBitDepth;
             this->recon = ReconFile::open(reconfn, param->sourceWidth, param->sourceHeight, reconFileBitDepth,
-                param->fpsNum, param->fpsDenom, param->internalCsp);
+                param->fpsNum, param->fpsDenom, param->internalCsp, param->sourceBitDepth);
             if (this->recon->isFail())
             {
                 x265_log(param, X265_LOG_WARNING, "unable to write reconstructed outputs file\n");
@@ -973,6 +1016,7 @@
         param->rc.zones = X265_MALLOC(x265_zone, param->rc.zonefileCount);
         for (int i = 0; i < param->rc.zonefileCount; i++)
         {
+            param->rc.zonesi.startFrame = -1;
             while (fgets(line, sizeof(line), zoneFile))
             {
                 if (*line == '#' || (strcmp(line, "\r\n") == 0))
@@ -1010,57 +1054,179 @@
         return 1;
     }
 
-    /* Parse the RPU file and extract the RPU corresponding to the current picture
-    * and fill the rpu field of the input picture */
-    int CLIOptions::rpuParser(x265_picture * pic)
-    {
-        uint8_t byteVal;
-        uint32_t code = 0;
-        int bytesRead = 0;
-        pic->rpu.payloadSize = 0;
-
-        if (!pic->pts)
-        {
-            while (bytesRead++ < 4 && fread(&byteVal, sizeof(uint8_t), 1, dolbyVisionRpu))
-                code = (code << 8) | byteVal;
-
-            if (code != START_CODE)
-            {
-                x265_log(NULL, X265_LOG_ERROR, "Invalid Dolby Vision RPU startcode in POC %d\n", pic->pts);
-                return 1;
-            }
-        }
-
-        bytesRead = 0;
-        while (fread(&byteVal, sizeof(uint8_t), 1, dolbyVisionRpu))
-        {
-            code = (code << 8) | byteVal;
-            if (bytesRead++ < 3)
-                continue;
-            if (bytesRead >= 1024)
-            {
-                x265_log(NULL, X265_LOG_ERROR, "Invalid Dolby Vision RPU size in POC %d\n", pic->pts);
-                return 1;
-            }
-
-            if (code != START_CODE)
-                pic->rpu.payloadpic->rpu.payloadSize++ = (code >> (3 * 8)) & 0xFF;
-            else
-                return 0;
-        }
-
-        int ShiftBytes = START_CODE_BYTES - (bytesRead - pic->rpu.payloadSize);
-        int bytesLeft = bytesRead - pic->rpu.payloadSize;
-        code = (code << ShiftBytes * 8);
-        for (int i = 0; i < bytesLeft; i++)
-        {
-            pic->rpu.payloadpic->rpu.payloadSize++ = (code >> (3 * 8)) & 0xFF;
-            code = (code << 8);
-        }
-        if (!pic->rpu.payloadSize)
-            x265_log(NULL, X265_LOG_WARNING, "Dolby Vision RPU not found for POC %d\n", pic->pts);
-        return 0;
-    }
+    /* Parse the RPU file and extract the RPU corresponding to the current picture
+    * and fill the rpu field of the input picture */
+    int CLIOptions::rpuParser(x265_picture * pic)
+    {
+        uint8_t byteVal;
+        uint32_t code = 0;
+        int bytesRead = 0;
+        pic->rpu.payloadSize = 0;
+
+        if (!pic->pts)
+        {
+            while (bytesRead++ < 4 && fread(&byteVal, sizeof(uint8_t), 1, dolbyVisionRpu))
+                code = (code << 8) | byteVal;
+
+            if (code != START_CODE)
+            {
+                x265_log(NULL, X265_LOG_ERROR, "Invalid Dolby Vision RPU startcode in POC %d\n", pic->pts);
+                return 1;
+            }
+        }
+
+        bytesRead = 0;
+        while (fread(&byteVal, sizeof(uint8_t), 1, dolbyVisionRpu))
+        {
+            code = (code << 8) | byteVal;
+            if (bytesRead++ < 3)
+                continue;
+            if (bytesRead >= 1024)
+            {
+                x265_log(NULL, X265_LOG_ERROR, "Invalid Dolby Vision RPU size in POC %d\n", pic->pts);
+                return 1;
+            }
+
+            if (code != START_CODE)
+                pic->rpu.payloadpic->rpu.payloadSize++ = (code >> (3 * 8)) & 0xFF;
+            else
+                return 0;
+        }
+
+        int ShiftBytes = START_CODE_BYTES - (bytesRead - pic->rpu.payloadSize);
+        int bytesLeft = bytesRead - pic->rpu.payloadSize;
+        code = (code << ShiftBytes * 8);
+        for (int i = 0; i < bytesLeft; i++)
+        {
+            pic->rpu.payloadpic->rpu.payloadSize++ = (code >> (3 * 8)) & 0xFF;
+            code = (code << 8);
+        }
+        if (!pic->rpu.payloadSize)
+            x265_log(NULL, X265_LOG_WARNING, "Dolby Vision RPU not found for POC %d\n", pic->pts);
+        return 0;
+    }
+
+    bool CLIOptions::parseScenecutAwareQpConfig()
+    {
+        char line256;
+        char* argLine;
+        rewind(scenecutAwareQpConfig);
+        while (fgets(line, sizeof(line), scenecutAwareQpConfig))
+        {
+            if (*line == '#' || (strcmp(line, "\r\n") == 0))
+                continue;
+            int index = (int)strcspn(line, "\r\n");
+            lineindex = '\0';
+            argLine = line;
+            while (isspace((unsigned char)*argLine)) argLine++;
+            char* start = strchr(argLine, '-');
+            int argCount = 0;
+            char **args = (char**)malloc(256 * sizeof(char *));
+            //Adding a dummy string to avoid file parsing error
+            argsargCount++ = (char *)"x265";
+            char* token = strtok(start, " ");
+            while (token)
+            {
+                argsargCount++ = token;
+                token = strtok(NULL, " ");
+            }
+            argsargCount = NULL;
+            CLIOptions cliopt;
+            if (cliopt.parseScenecutAwareQpParam(argCount, args, param))
+            {
+                cliopt.destroy();
+                if (cliopt.api)
+                    cliopt.api->param_free(cliopt.param);
+                exit(1);
+            }
+            break;
+        }
+        return 1;
+    }
+    bool CLIOptions::parseScenecutAwareQpParam(int argc, char **argv, x265_param* globalParam)
+    {
+        bool bError = false;
+        int bShowHelp = false;
+        int outputBitDepth = 0;
+        const char *profile = NULL;
+        /* Presets are applied before all other options. */
+        for (optind = 0;;)
+        {
+            int c = getopt_long(argc, argv, short_options, long_options, NULL);
+            if (c == -1)
+                break;
+            else if (c == 'D')
+                outputBitDepth = atoi(optarg);
+            else if (c == 'P')
+                profile = optarg;
+            else if (c == '?')
+                bShowHelp = true;
+        }
+        if (!outputBitDepth && profile)
+        {
+            /*try to derive the output bit depth from the requested profile*/
+            if (strstr(profile, "10"))
+                outputBitDepth = 10;
+            else if (strstr(profile, "12"))
+                outputBitDepth = 12;
+            else
+                outputBitDepth = 8;
+        }
+        api = x265_api_get(outputBitDepth);
+        if (!api)
+        {
+            x265_log(NULL, X265_LOG_WARNING, "falling back to default bit-depth\n");
+            api = x265_api_get(0);
+        }
+        if (bShowHelp)
+        {
+            printVersion(globalParam, api);
+            showHelp(globalParam);
+        }
+        for (optind = 0;;)
+        {
+            int long_options_index = -1;
+            int c = getopt_long(argc, argv, short_options, long_options, &long_options_index);
+            if (c == -1)
+                break;
+            if (long_options_index < 0 && c > 0)
+            {
+                for (size_t i = 0; i < sizeof(long_options) / sizeof(long_options0); i++)
+                {
+                    if (long_optionsi.val == c)
+                    {
+                        long_options_index = (int)i;
+                        break;
+                    }
+                }
+                if (long_options_index < 0)
+                {
+                    /* getopt_long might have already printed an error message */
+                    if (c != 63)
+                        x265_log(NULL, X265_LOG_WARNING, "internal error: short option '%c' has no long option\n", c);
+                    return true;
+                }
+            }
+            if (long_options_index < 0)
+            {
+                x265_log(NULL, X265_LOG_WARNING, "short option '%c' unrecognized\n", c);
+                return true;
+            }
+            bError |= !!api->scenecut_aware_qp_param_parse(globalParam, long_optionslong_options_index.name, optarg);
+            if (bError)
+            {
+                const char *name = long_options_index > 0 ? long_optionslong_options_index.name : argvoptind - 2;
+                x265_log(NULL, X265_LOG_ERROR, "invalid argument: %s = %s\n", name, optarg);
+                return true;
+            }
+        }
+        if (optind < argc)
+        {
+            x265_log(param, X265_LOG_WARNING, "extra unused command arguments given <%s>\n", argvoptind);
+            return true;
+        }
+        return false;
+    }
 
 #ifdef __cplusplus
 }
​

x265_3.5.tar.gz/source/x265cli.h -> x265_3.6.tar.gz/source/x265cli.h Changed

@@ -135,6 +135,7 @@
     { "no-fast-intra",        no_argument, NULL, 0 },
     { "no-open-gop",          no_argument, NULL, 0 },
     { "open-gop",             no_argument, NULL, 0 },
+    { "cra-nal",              no_argument, NULL, 0 },
     { "keyint",         required_argument, NULL, 'I' },
     { "min-keyint",     required_argument, NULL, 'i' },
     { "gop-lookahead",  required_argument, NULL, 0 },
@@ -143,7 +144,6 @@
     { "scenecut-bias",  required_argument, NULL, 0 },
     { "hist-scenecut",        no_argument, NULL, 0},
     { "no-hist-scenecut",     no_argument, NULL, 0},
-    { "hist-threshold", required_argument, NULL, 0},
     { "fades",                no_argument, NULL, 0 },
     { "no-fades",             no_argument, NULL, 0 },
     { "scenecut-aware-qp", required_argument, NULL, 0 },
@@ -182,6 +182,8 @@
     { "qp",             required_argument, NULL, 'q' },
     { "aq-mode",        required_argument, NULL, 0 },
     { "aq-strength",    required_argument, NULL, 0 },
+    { "sbrc",                 no_argument, NULL, 0 },
+    { "no-sbrc",              no_argument, NULL, 0 },
     { "rc-grain",             no_argument, NULL, 0 },
     { "no-rc-grain",          no_argument, NULL, 0 },
     { "ipratio",        required_argument, NULL, 0 },
@@ -244,6 +246,7 @@
     { "crop-rect",      required_argument, NULL, 0 }, /* DEPRECATED */
     { "master-display", required_argument, NULL, 0 },
     { "max-cll",        required_argument, NULL, 0 },
+    {"video-signal-type-preset", required_argument, NULL, 0 },
     { "min-luma",       required_argument, NULL, 0 },
     { "max-luma",       required_argument, NULL, 0 },
     { "log2-max-poc-lsb", required_argument, NULL, 8 },
@@ -263,11 +266,16 @@
     { "repeat-headers",       no_argument, NULL, 0 },
     { "aud",                  no_argument, NULL, 0 },
     { "no-aud",               no_argument, NULL, 0 },
+    { "eob",                  no_argument, NULL, 0 },
+    { "no-eob",               no_argument, NULL, 0 },
+    { "eos",                  no_argument, NULL, 0 },
+    { "no-eos",               no_argument, NULL, 0 },
     { "info",                 no_argument, NULL, 0 },
     { "no-info",              no_argument, NULL, 0 },
     { "zones",          required_argument, NULL, 0 },
     { "qpfile",         required_argument, NULL, 0 },
     { "zonefile",       required_argument, NULL, 0 },
+    { "no-zonefile-rc-init",  no_argument, NULL, 0 },
     { "lambda-file",    required_argument, NULL, 0 },
     { "b-intra",              no_argument, NULL, 0 },
     { "no-b-intra",           no_argument, NULL, 0 },
@@ -298,8 +306,7 @@
     { "dynamic-refine",       no_argument, NULL, 0 },
     { "no-dynamic-refine",    no_argument, NULL, 0 },
     { "strict-cbr",           no_argument, NULL, 0 },
-    { "temporal-layers",      no_argument, NULL, 0 },
-    { "no-temporal-layers",   no_argument, NULL, 0 },
+    { "temporal-layers",      required_argument, NULL, 0 },
     { "qg-size",        required_argument, NULL, 0 },
     { "recon-y4m-exec", required_argument, NULL, 0 },
     { "analyze-src-pics", no_argument, NULL, 0 },
@@ -349,6 +356,8 @@
     { "frame-dup",            no_argument, NULL, 0 },
     { "no-frame-dup", no_argument, NULL, 0 },
     { "dup-threshold", required_argument, NULL, 0 },
+    { "mcstf",                 no_argument, NULL, 0 },
+    { "no-mcstf",              no_argument, NULL, 0 },
 #ifdef SVT_HEVC
     { "svt",     no_argument, NULL, 0 },
     { "no-svt",  no_argument, NULL, 0 },
@@ -373,6 +382,8 @@
     { "abr-ladder", required_argument, NULL, 0 },
     { "min-vbv-fullness", required_argument, NULL, 0 },
     { "max-vbv-fullness", required_argument, NULL, 0 },
+    { "scenecut-qp-config", required_argument, NULL, 0 },
+    { "film-grain", required_argument, NULL, 0 },
     { 0, 0, 0, 0 },
     { 0, 0, 0, 0 },
     { 0, 0, 0, 0 },
@@ -388,6 +399,7 @@
         FILE*       qpfile;
         FILE*       zoneFile;
         FILE*    dolbyVisionRpu;    /* File containing Dolby Vision BL RPU metadata */
+        FILE*    scenecutAwareQpConfig; /* File containing scenecut aware frame quantization related CLI options */
         const char* reconPlayCmd;
         const x265_api* api;
         x265_param* param;
@@ -425,6 +437,7 @@
             qpfile = NULL;
             zoneFile = NULL;
             dolbyVisionRpu = NULL;
+            scenecutAwareQpConfig = NULL;
             reconPlayCmd = NULL;
             api = NULL;
             param = NULL;
@@ -455,6 +468,8 @@
         bool parseQPFile(x265_picture &pic_org);
         bool parseZoneFile();
         int rpuParser(x265_picture * pic);
+        bool parseScenecutAwareQpConfig();
+        bool parseScenecutAwareQpParam(int argc, char **argv, x265_param* globalParam);
     };
 #ifdef __cplusplus
 }

 
@@ -135,6 +135,7 @@
     { "no-fast-intra",        no_argument, NULL, 0 },
     { "no-open-gop",          no_argument, NULL, 0 },
     { "open-gop",             no_argument, NULL, 0 },
+    { "cra-nal",              no_argument, NULL, 0 },
     { "keyint",         required_argument, NULL, 'I' },
     { "min-keyint",     required_argument, NULL, 'i' },
     { "gop-lookahead",  required_argument, NULL, 0 },
@@ -143,7 +144,6 @@
     { "scenecut-bias",  required_argument, NULL, 0 },
     { "hist-scenecut",        no_argument, NULL, 0},
     { "no-hist-scenecut",     no_argument, NULL, 0},
-    { "hist-threshold", required_argument, NULL, 0},
     { "fades",                no_argument, NULL, 0 },
     { "no-fades",             no_argument, NULL, 0 },
     { "scenecut-aware-qp", required_argument, NULL, 0 },
@@ -182,6 +182,8 @@
     { "qp",             required_argument, NULL, 'q' },
     { "aq-mode",        required_argument, NULL, 0 },
     { "aq-strength",    required_argument, NULL, 0 },
+    { "sbrc",                 no_argument, NULL, 0 },
+    { "no-sbrc",              no_argument, NULL, 0 },
     { "rc-grain",             no_argument, NULL, 0 },
     { "no-rc-grain",          no_argument, NULL, 0 },
     { "ipratio",        required_argument, NULL, 0 },
@@ -244,6 +246,7 @@
     { "crop-rect",      required_argument, NULL, 0 }, /* DEPRECATED */
     { "master-display", required_argument, NULL, 0 },
     { "max-cll",        required_argument, NULL, 0 },
+    {"video-signal-type-preset", required_argument, NULL, 0 },
     { "min-luma",       required_argument, NULL, 0 },
     { "max-luma",       required_argument, NULL, 0 },
     { "log2-max-poc-lsb", required_argument, NULL, 8 },
@@ -263,11 +266,16 @@
     { "repeat-headers",       no_argument, NULL, 0 },
     { "aud",                  no_argument, NULL, 0 },
     { "no-aud",               no_argument, NULL, 0 },
+    { "eob",                  no_argument, NULL, 0 },
+    { "no-eob",               no_argument, NULL, 0 },
+    { "eos",                  no_argument, NULL, 0 },
+    { "no-eos",               no_argument, NULL, 0 },
     { "info",                 no_argument, NULL, 0 },
     { "no-info",              no_argument, NULL, 0 },
     { "zones",          required_argument, NULL, 0 },
     { "qpfile",         required_argument, NULL, 0 },
     { "zonefile",       required_argument, NULL, 0 },
+    { "no-zonefile-rc-init",  no_argument, NULL, 0 },
     { "lambda-file",    required_argument, NULL, 0 },
     { "b-intra",              no_argument, NULL, 0 },
     { "no-b-intra",           no_argument, NULL, 0 },
@@ -298,8 +306,7 @@
     { "dynamic-refine",       no_argument, NULL, 0 },
     { "no-dynamic-refine",    no_argument, NULL, 0 },
     { "strict-cbr",           no_argument, NULL, 0 },
-    { "temporal-layers",      no_argument, NULL, 0 },
-    { "no-temporal-layers",   no_argument, NULL, 0 },
+    { "temporal-layers",      required_argument, NULL, 0 },
     { "qg-size",        required_argument, NULL, 0 },
     { "recon-y4m-exec", required_argument, NULL, 0 },
     { "analyze-src-pics", no_argument, NULL, 0 },
@@ -349,6 +356,8 @@
     { "frame-dup",            no_argument, NULL, 0 },
     { "no-frame-dup", no_argument, NULL, 0 },
     { "dup-threshold", required_argument, NULL, 0 },
+    { "mcstf",                 no_argument, NULL, 0 },
+    { "no-mcstf",              no_argument, NULL, 0 },
 #ifdef SVT_HEVC
     { "svt",     no_argument, NULL, 0 },
     { "no-svt",  no_argument, NULL, 0 },
@@ -373,6 +382,8 @@
     { "abr-ladder", required_argument, NULL, 0 },
     { "min-vbv-fullness", required_argument, NULL, 0 },
     { "max-vbv-fullness", required_argument, NULL, 0 },
+    { "scenecut-qp-config", required_argument, NULL, 0 },
+    { "film-grain", required_argument, NULL, 0 },
     { 0, 0, 0, 0 },
     { 0, 0, 0, 0 },
     { 0, 0, 0, 0 },
@@ -388,6 +399,7 @@
         FILE*       qpfile;
         FILE*       zoneFile;
         FILE*    dolbyVisionRpu;    /* File containing Dolby Vision BL RPU metadata */
+        FILE*    scenecutAwareQpConfig; /* File containing scenecut aware frame quantization related CLI options */
         const char* reconPlayCmd;
         const x265_api* api;
         x265_param* param;
@@ -425,6 +437,7 @@
             qpfile = NULL;
             zoneFile = NULL;
             dolbyVisionRpu = NULL;
+            scenecutAwareQpConfig = NULL;
             reconPlayCmd = NULL;
             api = NULL;
             param = NULL;
@@ -455,6 +468,8 @@
         bool parseQPFile(x265_picture &pic_org);
         bool parseZoneFile();
         int rpuParser(x265_picture * pic);
+        bool parseScenecutAwareQpConfig();
+        bool parseScenecutAwareQpParam(int argc, char **argv, x265_param* globalParam);
     };
 #ifdef __cplusplus
 }
​

x265_3.5.tar.gz/x265Version.txt -> x265_3.6.tar.gz/x265Version.txt Changed

 
@@ -1,4 +1,4 @@
 #Attribute:         Values
-repositorychangeset: f0c1022b6
+repositorychangeset: aa7f602f7
 releasetagdistance: 1
-releasetag: 3.5
+releasetag: 3.6
​