上QQ阅读APP看书，第一时间看更新

The Microsoft common objects in context

Advances in application of deep learning in computer vision are often highly focalized on the kind of classification problems that can be summarized by challenges such as ImageNet (but also, for instance, PASCAL VOC - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/) and the ConvNets suitable to crack it (Xception, VGG16, VGG19, ResNet50, InceptionV3, and MobileNet, just to quote the ones available in the well-known package Keras: https://keras.io/applications/).

Though deep learning networks based on ImageNet data are the actual state of the art, such networks can experience difficulties when faced with real-world applications. In fact, in practical applications, we have to process images that are quite different from the examples provided by ImageNet. In ImageNet the elements to be classified are clearly the only clear element present in the image, ideally set in an unobstructed way near the center of a neatly composed photo. In the reality of images taken from the field, objects are randomly scattered around, in often large number. All these objects are also quite different from each other, creating sometimes confusing settings. In addition, often objects of interest cannot be clearly and directly perceived because they are visually obstructed by other potentially interesting objects.

Please refer to the figure from the following mentioned reference:

Figure 1: A sample of images from ImageNet: they are arranged in a hierarchical structure, allowing working with both general or more specific classes.

SOURCE: DENG, Jia, et al. Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009. p. 248-255.

Realistic images contain multiple objects that sometimes can hardly be distinguished from a noisy background. Often you really cannot create interesting projects just by labeling an image with a tag simply telling you the object was recognized with the highest confidence.

In a real-world application, you really need to be able to do the following:

Object classification of single and multiple instances when recognizing various objects, often of the same class
Image localization, that is understanding where the objects are in the image
Image segmentation, by marking each pixel in the images with a label: the type of object or background in order to be able to cut off interesting parts from the background.

The necessity to train a ConvNet to be able to achieve some or all of the preceding mentioned objectives led to the creation of the Microsoft common objects in context (MS COCO) dataset, as described in the paper: LIN, Tsung-Yi, et al. Microsoft coco: common objects in context. In: European conference on computer vision. Springer, Cham, 2014. p. 740-755. (You can read the original paper at the following link: https://arxiv.org/abs/1405.0312.) This dataset is made up of 91 common object categories, hierarchically ordered, with 82 of them having more than 5,000 labeled instances. The dataset totals 2,500,000 labeled objects distributed in 328,000 images.

Here are the classes that can be recognized in the MS COCO dataset:

{1: 'person', 2: 'bicycle', 3: 'car', 4: 'motorcycle', 5: 'airplane', 6: 'bus', 7: 'train', 8: 'truck', 9: 'boat', 10: 'traffic light', 11: 'fire hydrant', 13: 'stop sign', 14: 'parking meter', 15: 'bench', 16: 'bird', 17: 'cat', 18: 'dog', 19: 'horse', 20: 'sheep', 21: 'cow', 22: 'elephant', 23: 'bear', 24: 'zebra', 25: 'giraffe', 27: 'backpack', 28: 'umbrella', 31: 'handbag', 32: 'tie', 33: 'suitcase', 34: 'frisbee', 35: 'skis', 36: 'snowboard', 37: 'sports ball', 38: 'kite', 39: 'baseball bat', 40: 'baseball glove', 41: 'skateboard', 42: 'surfboard', 43: 'tennis racket', 44: 'bottle', 46: 'wine glass', 47: 'cup', 48: 'fork', 49: 'knife', 50: 'spoon', 51: 'bowl', 52: 'banana', 53: 'apple', 54: 'sandwich', 55: 'orange', 56: 'broccoli', 57: 'carrot', 58: 'hot dog', 59: 'pizza', 60: 'donut', 61: 'cake', 62: 'chair', 63: 'couch', 64: 'potted plant', 65: 'bed', 67: 'dining table', 70: 'toilet', 72: 'tv', 73: 'laptop', 74: 'mouse', 75: 'remote', 76: 'keyboard', 77: 'cell phone', 78: 'microwave', 79: 'oven', 80: 'toaster', 81: 'sink', 82: 'refrigerator', 84: 'book', 85: 'clock', 86: 'vase', 87: 'scissors', 88: 'teddy bear', 89: 'hair drier', 90: 'toothbrush'}

Though the ImageNet dataset can present 1,000 object classes (as described at https://gist.github.com/yrevar/942d3a0ac09ec9e5eb3a) distributed in 14,197,122 images, MS COCO offers the peculiar feature of multiple objects distributed in a minor number of images (the dataset has been gathered using Amazon Mechanical Turk, a somehow more costly approach but shared by ImageNet, too). Given such premises, the MS COCO images can be considered very good examples of contextual relationships and non-iconic object views, since objects are arranged in realistic positions and settings. This can be verified from this comparative example taken from the MS COCO paper previously mentioned:

Figure 2: Examples of iconic and non-iconic images. SOURCE: LIN, Tsung-Yi, et al. Microsoft coco: common objects in context. In: European conference on computer vision. Springer, Cham, 2014. p. 740-755.

In addition, the image annotation of MS COCO is particularly rich, offering the coordinates of the contours of the objects present in the images. The contours can be easily translated into bounding boxes, boxes that delimit the part of the image where the object is located. This is a rougher way to locate objects than the original one used for training MS COCO itself, based on pixel segmentation.

In the following figure, a crowded row has been carefully segmented by defining notable areas in an image and creating a textual description of those areas. In machine learning terms, this translates to assigning a label to every pixel in the image and trying to predict the segmentation class (corresponding to the textual description). Historically this has been done with image processing until ImageNet 2012 when deep learning proved a much more efficient solution.

2012 marked a milestone in computer vision because for the first time a deep learning solution provided many superior results than any technique used before: KRIZHEVSKY, Alex; SUTSKEVER, Ilya; HINTON, Geoffrey E. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. 2012. p. 1097-1105 ( https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf).

Image segmentation is particularly useful for various tasks, such as:

Highlighting the important objects in an image, for instance in medical applications detecting areas with illness
Locating objects in an image so that a robot can pick them up or manipulate them
Helping with road scene understanding for self-driving cars or drones to navigate
Editing images by automatically extracting portions of an image or removing a background

This kind of annotation is very expensive (hence the reduced number of examples in MS COCO) because it has to be done completely by hand and it requires attention and precision. There are some tools to help with annotating by segmenting an image. You can find a comprehensive list at https://stackoverflow.com/questions/8317787/image-labelling-and-annotation-tool. However, we can suggest the following two tools, if you want to annotate by segmentation images by yourself:

LabelImg https://github.com/tzutalin/labelImg
FastAnnotationTool https://github.com/christopher5106/FastAnnotationTool

All these tools can also be used for the much simpler annotation by bounding boxes, and they really can come in handy if you want to retrain a model from MS COCO using a class of your own. (We will mention this again at the end of the chapter):

A pixel segmentation of an image used in MS COCO training phase