videofilt_cpudependentoptimization

VirtualDub Plugin SDK 1.2

CPU dependent optimization

Depending on the CPU in the user's machine, it may be possible to use certain instruction set extensions to accelerate execution of the filter, such as MMX or SSE2. Good use of these extensions can result in significant speedups, as much as 2-4x. However, optional extensions must be checked for before they are used.

Checking for CPU features

The video filter API exports several entry points that allow the filter to query for optional CPU features. Although it is possible for the filter to query the CPU directly, using the host callbacks has the benefit that the filter tracks any CPU feature override UI that is in the host. The isFPUEnabled() call returns true if FPU (x87) optimizations should be used; isMMXEnabled() returns true if MMX should be used. For more advanced features, getCPUFlags() also reports support for integer SSE, SSE, SSE2, 3DNow!, and 3DNow! Professional.

Keep in mind when writing code that the video filter API offers no guarantees with regard to any CPU extensions — only the bare instruction set is supported. In particular, on x86, neither MMX nor P6 conditional moves (CMOVcc/FCMOVcc) should be used before checking feature flags. On x64, SSE2 is standard, but 3DNow! is not.

Note: The host can change the value of flags on the fly in response to a change the user preferences. A filter does not have to support all dynamic changes — it can cache the state of the flags when required.

Note: If you are using compiler options to generate code that uses instruction set extensions, such as /QxW on Intel C/C++ or /arch:SSE2 with Microsoft Visual C++, you must ensure that such code is not executed until support for the extensions is verified. This can be done either by instructing the compiler to check for the extensions (ex: /QaxW in Intel C/C++), or selectively compiling the startup code for the filter with CPU-specific optimizations disabled. A sledgehammer-like method would be to compile only the module initialization routine in this manner and have it add filter entries to the host only if the host reports that the necessary CPU extensions are available.

VirtualDub specific: VirtualDub relies on 80486 instructions to be supported, but does not guarantee support for instructions introduced in later CPUs.

Requesting aligned buffers (V14+ only)

When working with vector instruction sets it is frequently advantageous to have aligned data, or data that is placed at addresses in memory that are a multiple of some alignment size. For MMX instructions, this is 4 or 8 bytes, and for SSE and above, it is 16 bytes. By default, it is only guaranteed that filter image buffers are aligned by natural alignment, or the usual alignment for the pixel type. For 32-bit RGB frame buffers this is 4 bytes, and for most others, it is only byte alignment. This complicates vectorization as vector instruction sets often have awkward and slow handling of unaligned data and require fixup routines to handle odd pixel counts.

Starting with the V14 API, it is possible to request 16 byte alignment of all scanlines by having paramProc return the FILTERPARAM_ALIGN_SCANLINES flag. This flag modifies the allocation of frame buffers so that scanlines are always aligned to a 16 byte boundary and are a multiple of 16 bytes long. This simplifies vectorization since routines can read vectors directly from memory and not have to worry about unaligned loads or crashing due to reading beyond the end of a scanline.

The FILTERPARAM_ALIGN_SCANLINES flag also has another effect, which is that it also pads out scanlines in the output buffer so that the filter can write multiples of 16 bytes. This also reduces the complexity of the filter as it often makes it unnecessary to have fixup code for images that are an odd number of pixels in width. Padding applies to all planes, so the quarter-size chroma planes in a 4:1:0 YCbCr buffer will also be padded to a multiple of 16 bytes.

Note that there are a couple of gotchas to aligned scanline support. The first is that, although the padding at the end of output scanlines is ignored and can be written with any value, the extra padding on source scanlines is not guaranteed to have any particular value. This means that filters still need to be carefully written so that those undefined values don't affect the output, which might otherwise show up as noise on the right side of the image. The second and more minor gotcha is that in some cases using this flag can impose a small performance penalty as the host may have to realign buffers that aren't aligned appropriately from the source.