MFE Overview

Overview

HW and algorithms limitations overview

Problem Statement

GPU supports only single context execution, limitation is - in single context, at any given time, workload (encoder kernel threads) from only one frame can be executed by EU/VME.
In current AVC/HEVC encoder design, MBENC kernel thread for each MB/LCU is launched following 26 degree wavefront dependency pattern because of the dependency on left, above and above right MBs. Left/above/above right MBs are used to calculate initial search starting point and neighboring INTRA prediction mode. This dependency is defined by H.264/H.265 ITU recommendation specification. Due to this dependency, there is inherent limitation in amount of parallel threads that can be dispatched for the computation. The bigger the frame for processing, more parallel threads (going from top of frame to center diagonal); and smaller frames have fewer parallel threads. Due to this, the utilization of the resource (VME) is not efficient for smaller frames (<1080).

Figure 1. wavefront dependency

wavefront dependency

thread dispatching - the overhead for thread dispatching is pretty high comparing to one macroblock

VME utilization

In this section, we will look into the relations between the frame size for encode and the VME utilization. As discussed previously, MB dependencies limit the number of parallel threads (wavefront dependency pattern), and the number of parallel threads is related to its frame size. Now, we will tie this understanding to VME utilization; consider the case below:

Assuming 1080p frame and each MB takes the 1 unit of time to process.
Assuming we are not limited by resources, meaning we have enough threads and VMEs.
EU/VME utilization is very low
- Peak # of parallel kernel thread is about 60.
- Takes about 256 unit of time to finish the whole frame.
- Takes about 120 unit of time to ramp up to peak state.
- Takes about 120 unit of time to ramp down from peak state.

Figure 2: MBs per Wavefront in single frame

MBs per Wavefront

Result of algorithm and thread dispatching limitations - we can't utilize VME for 100% at high end GPU SKUs, and even at low-end SKUs for low resolutions.

Figure 3. VME utilization at different SKUs.

vme utilizaion

On a SKL GT4 system, which has 3 slices and 9 VMEs, when encoding 1080p in "best speed" mode, we can only utilize around 30% of the VME available (that is equivalent to 1 slice worth of VMEs!). When we have a single stream being encoded, as a result, disabling 2 slices and having only 1 slice on starts to become beneficial - (1) to reduce thread scheduling, (2) and power consumption by concentrating the execution on one slice instead of spreading across 3.

Figure 4. VME utilization result.

vme utilization result

System utilization

In most transcode use-cases, we are bound by the EUs (due to MBEnc kernel timings), and combine that with poor scalability with VMEs and single context limitation, transcode workloads esp ABR or multi transcode use-cases are impacted. This impact is exacerbated on low resolutions. Consider the pipeline below with 4 parallel transcodes:

Figure 5. System utilization bound by EU array.

System utilization

Multi-frame encode concept

Frames from different streams/sessions are combined together as single batch buffer of the ENC kernel (MF-ENC). This helps to increase number of independent wave fronts executed in parallel, creating more parallelism for ENC operation and thus increasing VME utilization. As a result the EU array can process several frames within the same or near to the same execution time as a single frame.

Figure 6. MFE Encode kernel concept

concept

VME utilization improvement

VME utilization increased, within the same or near ENC execution time as single frame.

Figure 7. VME utilization improvement.

vme improvement

System utilization improvement

As multiple frames execute at a time of single frame, system utilization improved, through improved EU Array time utilization.

Figure 8. System utilization improvement.

system utilization improvement

Home

Media SDK for Linux
- Media SDK in Linux Distributions
- Intel Graphics Support in Linux Kernels
Media SDK for Windows
- Media SDK dispatcher for Windows
- Media SDK for UWP applications
FFmpeg QSV
GStreamer MSDK
- Build GStreamer MSDK
Docker
- Running on GPU under docker
Usage guides
- Intel media stack on Ubuntu
- Performance monitoring and debug
Building Media SDK
Running Media SDK CI tests
- Run CI smoke tests
Additional information
- Media SDK Shaders (EU Kernels)
- Previous Media SDK products
Multi-Frame Encode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MFE Overview

Overview

HW and algorithms limitations overview

Problem Statement

Figure 1. wavefront dependency

VME utilization

Figure 2: MBs per Wavefront in single frame

Figure 3. VME utilization at different SKUs.

Figure 4. VME utilization result.

System utilization

Figure 5. System utilization bound by EU array.

Multi-frame encode concept

Figure 6. MFE Encode kernel concept

VME utilization improvement

Figure 7. VME utilization improvement.

System utilization improvement

Figure 8. System utilization improvement.

Clone this wiki locally