Current computer vision models for image segmentation and object detection use awkward feature mappings and uncoupled subtask representations making joint inference clumsy and leaving the classification of many parts of the scene ambiguous. Gould et al. propose combining image segmentation and object detection in a hierarchical model to reason simultaneously about pixels, regions, and objects to uniquely explain every pixel, group pixels into regions, and achieve precise boundaries.
At the heart of this methodology is the energy function, which captures the a priori location of the horizon in a scene, preferences for a region to be assigned to different semantic labels, boundaries, likelihood of a group of regions being assigned a given object label, and contextual information. The model iteratively proposes and accepts moves, such as merging or splitting of adjacent regions or objects and local reassignment of pixels to neighboring regions or objects, to reduce the energy function.
Key improvements of Gould et al.’s model over other image segmentation and object detection models include forced consistency between sub-tasks and incorporation of a background region to eliminate large portions of the image and reduce the number of component regions that need to be considered for each object. Interestingly, a closed loop learning regime whereby mistakes from running inference on the training set are augmented to the training set to improve the diversity of examples resulting in substantially improved performance as opposed to over-fitting. Other creative tricks employed by Gould et al. include a multi-scale approach of running scene decomposition of a low-resolution version of the image while extracting features from the high-resolution version to lesson computational burden.