Our paper titled “Unsupervised Grounding of Textual Descriptions of Object Features and Actions in Video” has been accepted for publication in Principles of Knowledge Representation and Reasoning (KR 2016).
We propose a novel method for learning visual concepts and their correspondence to the words of a natural language. The concepts and correspondences are jointly inferred from video clips depicting simple actions involving multiple objects, together with corresponding natural language commands that would elicit these actions. Individual objects are first detected, together with quantitative measurements of their colour, shape, location and motion. Visual concepts emerge from the co-occurrence of regions within a measurement space and words of the language.
The method is evaluated on a set of videos generated automatically using computer graphics from a database of initial and goal configurations of objects. Each video is annotated with multiple commands in natural language obtained from human annotators using crowd sourcing.
M. Al-Omari, E. Chinellato, Y. Gatsoulis, D. C. Hogg and A. G. Cohn. Grounding of Textual Descriptions of Object Features and Actions in Video. In Proc. of the 15th International Conference on Principles of Knowledge Representation and Reasoning (KR). Cape Town, South Africa. April 2016.