Unsupervised Grounding of Textual Descriptions of Object Features and Actions in Video

Our paper titled “Unsupervised Grounding of Textual Descriptions of Object Features and Actions in Video” has been accepted for publication in Principles of Knowledge Representation and Reasoning (KR 2016).

Abstract

We propose a novel method for learning visual concepts and their correspondence to the words of a natural language. The concepts and correspondences are jointly inferred from video clips depicting simple actions involving multiple objects, together with corresponding natural language commands that would elicit these actions. Individual objects are first detected, together with quantitative measurements of their colour, shape, location and motion. Visual concepts emerge from the co-occurrence of regions within a measurement space and words of the language.

The method is evaluated on a set of videos generated automatically using computer graphics from a database of initial and goal configurations of objects. Each video is annotated with multiple commands in natural language obtained from human annotators using crowd sourcing.

Cite as:

M. Al-Omari, E. Chinellato, Y. Gatsoulis, D. C. Hogg and A. G. Cohn. Grounding of Textual Descriptions of Object Features and Actions in Video. In Proc. of the 15th International Conference on Principles of Knowledge Representation and Reasoning (KR). Cape Town, South Africa. April 2016.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s