Feature Quantization and Pooling for Videos
CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE
Pagination or Media Count:
Building video representations typically involves four steps feature extraction quantization, encoding, and pooling. While there have been large advances in feature extraction and encoding, the questions of how to quantize video features and what kinds of regions to pool them over have been relatively unexplored. To tackle the challenges present in video data, it is necessary to develop robust quantization and pooling methods. The first contribution of this thesis, Source Constrained Clustering, quantizes features into a codebook that generalizes better across actions. The main insight is to incorporate readily available labels of the sources generating the data. Sources can be the people who performed each cooking recipe, the directors who made each movie, or the YouTube users who shared their videos. In the pooling step, it is common to pool feature vectors over local regions. The regions of choice include the entire video, coarse spatio-temporal pyramids or cuboids of pre-determined fixed size. A consequence of using indiscriminately chosen cuboids is that widely dissimilar features may be pooled together if they are in nearby locations. It is natural to consider pooling video features over supervoxels for example, obtained from a video segmentation. However, since videos can have a different number of supervoxels, this produces a video representation of variable size. The second contribution of this thesis is a new, fixed size video representation Motion Words, where we pool features over video segments. The ultimate goal of video segmentation is to recover object boundaries, often grouping pixels from regions of very different motion. However, in the context of Motion Words, it is important that regions preserve motion boundaries. The third contribution of this thesis is a supervoxel segmentation, Globally Consistent Supervoxels which respects motion boundaries and provides better spatio-temporal support for Motion Words.