Skip to content
This repository has been archived by the owner on May 17, 2023. It is now read-only.

MFE Overview

Artem edited this page Aug 17, 2018 · 1 revision

Overview

HW and algorithms limitations overview

Problem Statement

  • GPU supports only single context execution, limitation is - in single context, at any given time, workload (encoder kernel threads) from only one frame can be executed by EU/VME.

  • In current AVC/HEVC encoder design, MBENC kernel thread for each MB/LCU is launched following 26 degree wavefront dependency pattern because of the dependency on left, above and above right MBs. Left/above/above right MBs are used to calculate initial search starting point and neighboring INTRA prediction mode. This dependency is defined by H.264/H.265 ITU recommendation specification. Due to this dependency, there is inherent limitation in amount of parallel threads that can be dispatched for the computation. The bigger the frame for processing, more parallel threads (going from top of frame to center diagonal); and smaller frames have fewer parallel threads. Due to this, the utilization of the resource (VME) is not efficient for smaller frames (<1080).

Figure 1. wavefront dependency

wavefront dependency

  • thread dispatching - the overhead for thread dispatching is pretty high comparing to one macroblock 

VME utilization  

In this section, we will look into the relations between the frame size for encode and the VME utilization. As discussed previously, MB dependencies limit the number of parallel threads (wavefront dependency pattern), and the number of parallel threads is related to its frame size. Now, we will tie this understanding to VME utilization; consider the case below:

  • Assuming 1080p frame and each MB takes the 1 unit of time to process.

  • Assuming we are not limited by resources, meaning we have enough threads and VMEs.

  • EU/VME utilization is very low

    • Peak # of parallel kernel thread is about 60.  

    • Takes about 256 unit of time to finish the whole frame.

    • Takes about 120 unit of time to ramp up to peak state.

    • Takes about 120 unit of time to ramp down from peak state.

Figure 2: MBs per Wavefront in single frame

MBs per Wavefront

  • Result of algorithm and thread dispatching limitations - we can't utilize VME for 100% at high end GPU SKUs, and even at low-end SKUs for low resolutions.
Figure 3. VME utilization at different SKUs.

vme utilizaion

  • On a SKL GT4 system, which has 3 slices and 9 VMEs, when encoding 1080p in "best speed" mode, we can only utilize around 30% of the VME available (that is equivalent to 1 slice worth of VMEs!). When we have a single stream being encoded, as a result, disabling 2 slices and having only 1 slice on starts to become beneficial - (1) to reduce thread scheduling, (2) and power consumption by concentrating the execution on one slice instead of spreading across 3.
Figure 4. VME utilization result.

vme utilization result

System utilization

In most transcode use-cases, we are bound by the EUs (due to MBEnc kernel timings), and combine that with poor scalability with VMEs and single context limitation, transcode workloads esp ABR or multi transcode use-cases are impacted. This impact is exacerbated on low resolutions. Consider the pipeline below with 4 parallel transcodes:

Figure 5. System utilization bound by EU array.

System utilization

Multi-frame encode concept

Frames from different streams/sessions are combined together as single batch buffer of the ENC kernel (MF-ENC). This helps to increase number of independent wave fronts executed in parallel, creating more parallelism for ENC operation and thus increasing VME utilization. As a result the EU array can process several frames within the same or near to the same execution time as a single frame.

Figure 6. MFE Encode kernel concept

concept

VME utilization improvement

VME utilization increased, within the same or near ENC execution time as single frame.

Figure 7. VME utilization improvement.

vme improvement

System utilization improvement

As multiple frames execute at a time of single frame, system utilization improved, through improved EU Array time utilization.

Figure 8. System utilization improvement.

system utilization improvement

 

Clone this wiki locally