Computer Vision Image Processing Publications Video Analysis

Video Quality Assessment (VQA)

The growing ubiquity of audiovisual content on the Internet has increased the importance of online advertising. With technological improvements, the quality of digital video rendering keeps improving as well, which in turn makes the users’ requirements stricter and stricter. So, the video quality is an important element to consider in online video advertising. Indeed, it goes without saying that a high quality video is more likely to interest users than a low quality video. Therefore it is crucial to be able to quantify the video’s quality. Nevertheless, the multiplicity of video formats and the various types of communication networks (wireless, fiber, xDSL networks…) make video quality assessment complex. Since the “end receiver” of video is human, the most accurate VQA is subjective (by humans). However, subjective assessment is time-consuming, and it depends on the person who evaluates it (mood, culture…). Thus, researchers have considered building objective assessment methods to model subjective methods. The advantage of objective VQA is that they can operate in real time. We will focus here on objective assessment processes for quality of video signals.

Objective VQA

There are three types of quality assessments of video signals, depending on the availability of original images for assessment (Figure 1): Full Reference (FR), Reduced Reference (RR), and No Reference (NR).

FR VQA metrics

They compare the degraded signal to the original signal. These metrics require the original signal. One of the most popular and simplest FR metrics are PSNR (peak signal-to-noise ratio) [1], which is usually computed to estimate the degradation related to compression techniques. Denoting  an  monochrome original image and  a degraded version of , let us assume that each pixel is coded with  bits (the dynamic range of the pixel values is ).

The drawback of PSNR is that it does not relate to the Human Vision System (HVS). SSIM (Structural Similarity), which quantifies the difference of the image structures, where introduced, mitigates that. Indeed, the human eye is more sensitive to differences in the image structure than the straightforward difference of pixels (PSNR case). The SSIM is computed on several windows. The formula for SSIM between two windows  and  is:

The variables  and  represent respectively the average and the standard deviation of the window . And the variable  corresponds to the covariance between  and . The variables  and  correspond to parameters allowing to stabilize the division for weak values of the denominator. Other FR metrics also come into use: visual Information fidelity (VIF) [1] and visual signal-to-noise ratio (VSNR) [2]. There is also Video Quality Metric (VQM), a FR metric designed by the Institute for Telecommunication Sciences in Colorado and which the American National Standards Institute (ANSI) standardized in 2003 [3]. The prediction scores of VQM closely match human subjective scores.


In audiovisual communication over the Internet, it is difficult if not impossible to get the original video; so FR VQA is not useful in this case. To tackle this challenge, some spatiotemporal features (Spatial Information or SI and Temporal Information or TI) of the original video must be extracted, encoded in very few bits, and embedded into the original video bitstream. The assessment of the degraded signal at the receiver would decode these features to quantify the distortion rate by comparing the original and distorted features. One of the first types of RR metrics was introduced by Wolf and Pinson in 1999 [4]. Some RR VQA systems are designed for a specific type of distortion (e.g. MPEG compression artifacts [5]). More recently, Lin Ma proposed an RR metric based on a Generalized Gaussian Distribution modeling of the Reorganized Discrete Cosine Transform (DCT) coefficient distribution [6].


It has to be acknowledged that the lack of any information about the reference video complicates NR VQA. There are three main types of NR VQA.

a) The first type is a distortion-specific approach where the VQA seeks to model predefined distortion. This distortion mostly consists of compression artifacts such as block edge effect (Blockiness), blurriness and ringing, which are some of the most common compression artifacts.

  • Blocking artifact

In most image/video coding compression techniques, DCT is applied to the image after portioning into blocks (usually 8×8 pixels). Although DCT has very powerful compression abilities, unfortunately for low bitrates transmission, some “blocking artifacts” appear on the decoded signal: something like discontinuities along the block boundaries.

  • Blurring artifact

The blurring artifact appears when there is a loss of a high frequency component in the original image. Blurred images look “fuzzy.”

  •  Ringing artifact

Ringing, also known as the Gibbs phenomenon, results from DCT coefficient quantization. This causes ripples (oscillations) around sharp edges in the image.

b) The second approach is based on machine learning systems that use image features to train a supervised learning tool. In [7] this method used color features and SVM for the training tool.

c)  The third approach is based on natural scene statistics modeling, which assumes non-distorted images (natural images) belong to a small subset of all possible images/signals, and the objective of this approach is to find the distance between the distorted image and the subspace of natural images [8].

NR VQA systems are usually designed for specific compression types. When a new type of distortion appears, these systems cannot quantify it well. Consequently, research on blind VQA is under way to reduce these drawbacks. For this purpose, Saad et al [9] proposed a model called BLIINDS (BLind Image Integrity Notator using DCT Statistics), which combines machine learning and NSS approaches.

Figure 1. Overview of the different types of VQA systems


[1] H. R. Sheikh and A. C. Bovik, “Image information and visual quality,” IEEE Trans. Image Process., vol. 15, no. 2, pp. 430–444, Feb. 2006.

[2] D. M. Chandler and S. S. Hemami, “VSNR: A wavelet-based visual signal-to-noise ratio for natural images,” IEEE Trans. Image Process., vol. 16, no. 9, pp. 2284–2298, Sep. 2007.


[4] S. Wolf, M. H. Pinson. Spatial-temporal distortion metrics for in-service quality monitoring of any digital video system. In Proc. SPIE Multimedia Systems and Applications, vol. 3845, pp. 266–277, Boston, MA, 1999

[5] K. T. Tan, M. Ghanbari and D. E. Pearson. An objective measurement tool for MPEG video quality. Signal Processing 70(3):279–294, 1998.

[6] M. Lin, S. Li, F. Zhang et al. Reduced- Reference Image Quality Assessment Using Reorganized DCT-Based Image Representation. IEEE Transactions on Multimedia, 2011, 13(4): 824-829.

[7] C. Charrier, G. Lebrun, and O. Lezoray, “A machine learning-based color image quality metric,” in Third Eur. Conf. Color Graphics, Imaging, and Vision, June 2006, pp. 251–256.

[8] T. Brandao and M.P. Queluz, “No-reference image quality assessment based on DCT-domain statistics,” Signal Process. Vol. 88, no.4, pp. 822–833, April 2008.

[9] M. A. Saad et al. A Perceptual DCT Statistics Based Blind Image Quality Metric. IEEE Signal Processing Letters, 2010, 17(6): 583-586.

You may also like
Spoken Language Recognition
Recommender Systems