Learning and Parsing Video Events with Goal and Intent Prediction
Journal article preprint
CALIFORNIA UNIV LOS ANGELES DEPT OF STATISTICS
Pagination or Media Count:
In this paper, we present a framework for parsing video events with stochastic Temporal And-Or Graph T-AOG and unsupervised learning of the T-AOG from video. This T-AOG represents a stochastic event grammar. The alphabet of the T-AOG consists of a set of grounded spatial relations including the poses of agents and their interactions with objects in the scene. The terminal nodes of the T-AOG are atomic actions which are specified by a number of grounded relations over image frames. An And-node represents a sequence of actions. An Or-node represents a number of alternative ways of such concatenations. The And-Or nodes in the T-AOG can generate a set of valid temporal configurations of atomic actions, which can be equivalently represented as a stochastic context-free grammar SCFG. For each And-node we model the temporal relations of its children nodes to distinguish events with similar structures but different temporal patterns and interpolate missing portions of events. This makes the T-AOG grammar context-sensitive. We propose an unsupervised learning algorithm to learn the atomic actions, the temporal relations and the And-Or nodes under the information projection principle in a coherent probabilistic framework. We also propose an event parsing algorithm based on the T-AOG which can understand events, infer the goal of agents, and predict their plausible intended actions. In comparison with existing methods, our paper makes the following contributions. i We represent events by a T-AOG with hierarchical compositions of events and the temporal relations between the sub-events. ii We learn the grammar, including atomic actions and temporal relations, automatically from the video data without manual supervision. iii Our algorithm infers the goal of agents and predicts their intents by a top-down process, handles events insertion and multi-agent events, keeps all possible interpretations of the video to preserve the ambiguities.