VMAF Anne Aaron presentation PDF download
How will Netflix members rate the quality of this video — poor, average or excellent? Which video clip looks better — encoded with Codec A or Codec B? For this episode, at kbps, is it better to encode with HD resolution, with some blockiness, or will SD look better?
These were example questions we asked ourselves as we worked towards delivering the best quality of experience for Netflix members. It was possible to deploy existing video quality metrics, such as PSNR and SSIM at scale, but they fell short of accurately capturing human perception.
Video Multi-method Assessment Fusion, or VMAF for short, is a video quality metric that combines human vision modeling with machine learning. The project started as research collaboration between our team and Prof. Jay Kuo from University of Southern California. His research group had previously worked on perceptual metrics for images, and together, we worked on extending the ideas to video. Over time, we have collaborated with other research partners such as Prof.
In this new techblog, we want to share our journey. Because of industry adoption, the project is benefitting from broader contribution from researchers, video-related companies and the open-source community. We are pleased to see that other research groups have cross-verified the perceptual accuracy of VMAF. Barman et al. Kingston University tested several quality assessment metrics on gaming content and concluded that VMAF was the best in predicting the subjective scores.
Lee et al. We have also read studies where it is claimed that VMAF does not perform as expected. We invite industry and researchers to evaluate the latest VMAF models and encourage them to share with us counterexamples and corner cases that can potentially improve the next VMAF version. We also give best practices of using VMAF at a later section to address some of the concerns.
VMAF can be used as an optimization criterion for better encoding decisions, and we have seen reports of other companies applying VMAF for this purpose. Traditionally, codec comparisons share the same methodology: PSNR values are calculated for a number of video sequences, each encoded at predefined resolutions and fixed quantization settings according to a set of test conditions. Subsequently, rate-quality curves are constructed, and average differences between those curves BD-rate are calculated.
Such settings work well for small differences in codecs, or for evaluating tools within the same codec. For our use case — video streaming — the use of PSNR is ill-suited, since it correlates poorly with perceptual quality. It enables us to compare codecs in the regions which are truly relevant, i. VMAF is used throughout our production pipeline, not only to measure the outcome of our encoding process, but also to guide our encodes towards the best possible quality.
An important example of how VMAF is used within encoding is in our Dynamic Optimizer , where encoding decisions for each individual shot are guided by bitrate and quality measurements for each encoder option.
Researchers in different business areas — TV UI teams and streaming client teams, for example — are constantly innovating to improve streaming quality. For example, a researcher changes the adaptive streaming algorithm or deploys new encodes , runs an experiment, and compares VMAF between the old and new algorithms or encodes.
This metric is well-suited for assessing quality in experiments because of its consistency across content and accuracy in reflecting human perception of quality. When we first released VMAF on Github back in June , it had its core feature extraction library written in C and the control code in Python, with the main goal of supporting algorithm experimentation and fast prototyping.
Most recently in June , we added frame-level multithreading, a long-due feature special shout out to DonTequila. We also introduced the feature of frame skipping, allowing VMAF to be computed on every one of N frames. This is the first time that VMAF can be computed in real time, even in 4K, albeit with a slight accuracy loss. Since we open-sourced VMAF, we have been continuously improving its prediction accuracy. Over time, we have fixed a number of undesirable cases found in the elementary metrics and the machine learning model, yielding more accurate prediction overall.
For example, the elementary metrics are modified to yield improved consistency with luminance masking; motion scores at the scene boundaries are updated to avoid overshoot due to scene changes; the QP-VMAF monotonicity is now maintained when extrapolating into high QP regions. We have collected a subjective dataset with a broadened scope compared to our previous dataset, including more diverse content and source artifacts such as film grain and camera noise, and more comprehensive coverage of encoding resolutions and compression parameters.
We have also developed a new data cleaning technique to remove human bias and inconsistency from the raw data, and open-sourced it on Github. The new approach uses maximum likelihood estimation to jointly optimize its parameters based on the available information and eliminates the need for explicit subject rejection.
The original model released when we open-sourced VMAF was based on the assumption that the viewers sit in front of a p display in a living room-like environment with the viewing distance of 3x the screen height 3H. This is a setup that is generally useful for many scenarios.
In applying this model to the mobile phone viewing, however, we found that it did not accurately reflect how a viewer perceives quality on a phone. For example, on a mobile phone, there is less distinction between p and p videos compared to other devices.
With this in mind, we trained and released a VMAF phone model. An example VMAF-bitrate relationship for the default model and the phone model is shown above. It can be interpreted that the same distorted video would be perceived as having a higher quality when viewed on a phone screen than on a HD TV, and that the VMAF score differences between the p and p videos are smaller using the phone model. A viewing distance of 1.
However, the 4K model assumes a wider viewing angle, which affects the foveal vs peripheral vision that the subject uses.
VMAF is trained on a set of representative video genres and distortions. Due to limitations in the size of lab-based subjective experiments, the selection of video sequences does not cover the entire space of perceptual video qualities.
Therefore, VMAF predictions need to be associated with a confidence interval CI that expresses the inherent uncertainty of the training process. The CI is established through bootstrapping on the prediction residuals using the full training data. Each of the models will introduce a slightly different prediction.
The variability of these predictions quantifies the level of confidence — the closer these predictions are, the more reliable the prediction using the full data will be. Nov 16, Oct 21, May 19, Nov 2, Add external installation guides. Nov 11, Jan 9, Apr 30, Apr 6, Dec 4, Update Makefile meson.
May 15, Dec 19, Oct 16, Docker: fix image build. Jul 21, Feb 27, Update versions etc. Jun 2, Feb 2, View code. Check out the tech blog for an overview and the technical paper published in PCS note that the paper describes an initial version of CAMBI that no longer matches the code exactly, but it is still a good introduction. Also check out the usage page.
It also has a new API that is more flexible and extensible. Documentation There is an overview of the documentation with links to specific pages, covering FAQs, available models and metrics, software usage guides, and a list of resources.
The command-line tool vmaf provides a complete algorithm implementation, such that one can easily deploy VMAF in a production environment. The C library libvmaf provides an interface to incorporate VMAF into your code, and tools to integrate other feature extractors into the library.
The Python library offers a full array of wrapper classes and scripts for software testing, VMAF model training and validation, dataset processing, data visualization, etc.
0コメント