Multi-Foveated Image and Video Compression

This page describes a new method for automatically finding regions of interest in static images or video clips, using a neurobiological model of visual attention. This model computes a topographic saliency map which indicates how conspicuous every location in the input image is, based on the responses from simulated neurons in primary visual cortex, sensitive to color, intensity, oriented edges, flicker and oriented motion. The method is applied as a front-end filter for image and video compression, where a spatially-variable blur is applied to the input images prior to compression. Image locations determined interesting (or salient) by the attention model receive no or little blur, while image locations increasingly farther from those hot spots are increasingly blurred. Locations which are more highly blurred will be compressed more efficiently and hence will yield a smaller overall compressed file size. To the extent that the algorithm indeed automatically marked as hot spots regions that human observers would find interesting, the blur around those regions should be tolerable when viewing the resulting clips.

The basic operation of the algorithm is as follows: frames are evaluated using our model of saliency-based visual attention, and a saliency map is computed for each frame. The saliency map may either be used directly to determine the amount of blur to be applied at every location in the image, or a small number of discrete foveas endowed with mass/spring/friction dynamics may attempt to follow a collection of the most salient objects, using proximity as well as feature similarity to track the objects across successive frames. If the saliency map is used for continuous blur, it is simply passed through a squashing function and used as blurring mask; if discrete foveas are used, object segmentation is performed around each fovea, and a chamfer distance transform is used to determine the distance from every pixel in the image to the closest segmented object - this distance map is then used as blurring mask. Blurring is obtained by computing a Gaussian pyramid from the input image and operating a trilinear interpolation (x, y, and depth in the pyramid, i.e., amount of blur, according to the blurring mask value). The resulting frames are encoded using an MPEG-1 video encoder.

On one example clip, we explore a number of variations of our saliency-based multi-foveation algorithm. Namely, we vary:

Results are shown below for all possible combinations of these parameters, on one sample video clip.

Comparison to human eye movements

To validate our approach, we compared the computed locations of high priority from our algorithm to the gaze locations from eight human observers watching the unfoveated clips. Eye movements were recorded from eight human observers, each watching a subset of a collection of 50 unfoveated video clips (each clip was viewed by 4 to 6 observers). We measured the average blur compounded over each human scanpath, and compare it to the average blur over the entire field of view (which is very close to the average blur compounded over a random scanpath, or over a human scanpath but after the saliency map has been randomly scanbled for each frame).

The clips included:

In the clips below, human eye position is marked by a small cyan square. Since we recorded eye position at 240Hz and played the clips at 30.13fps, there typically are 7-8 cyan squares displayed on each frame.

Overall, a remarkable agreement was found between human eye fixations and regions of low blur as independently predicted by our algorithm. This study validates our algorithm as a valuable tool for the automatic selection of foveation centers in an unconstrained variety of video clips.


Copyright © 2003 by the University of Southern California, iLab and Prof. Laurent Itti