Object detection–a quick overview
Since the breakthrough in neural network in 2012, when a deep CNN model called AlexNet won the annual ImageNet visual recognition challenge by dramatically reducing the error rate, many researchers in computer vision and natural language processing have started to take advantage of the power of deep learning models. Modern deep-learning-based object detections are all based on CNN and built on top of pre-trained models such as AlexNet, Google Inception, or another popular net VGG. These CNNs typically have trained millions of parameters and can convert an input image to a set of features that can be further used for tasks such as image classification, which we covered in the previous chapter, and object detection, among other computer-vision-related tasks.
In 2014, a state-of-the-art object detector that retrained AlexNet with a labeled object detection dataset, called RCNN (Regions with CNN features), was proposed, and it offered a big improvement in accuracy over traditional detection methods. RCNN combines a technique called region proposals, which generates about 2,000 possible region candidates, and runs a CNN on each of those regions for classification and bounding box predictions. It then merges those results to generate the detection result. The training process of RCNN is pretty complicated and takes several days, and the inference speed is also quite slow, taking almost a minute on an image on a GPU.
Since RCNN was proposed, better-performing object detection algorithms came one after another: Fast RCNN, Faster RCNN, YOLO (You Only Look Once), SSD (Single Shot MultiBox Detector), and YOLO v2.
Andrej Karpathy wrote a good introduction to RCNN, "Playing around with RCNN, State of the Art Object Detector” in 2014 (https://cs.stanford.edu/people/karpathy/rcnn). There's a nice video lecture, “Spatial Localization and Detection", as part of Stanford’s CS231n course by Justin Johnson, on object detection, with details on RCNN, Fast RCNN, Faster RCNN, and YOLO. SSD is described in detail at https://github.com/weiliu89/caffe/tree/ssd. And the cool YOLO2 website is https://pjreddie.com/darknet/yolo.
Fast RCNN significantly improves both the training process and inference time (10 hours of training and 2.x seconds of inference) by first applying a CNN on the whole input image, instead of thousands of proposed regions, and then dealing with region proposals. Faster RCNN further improves the inference speed to real time (0.2 seconds) by using a region proposal network so that after training, the time-consuming region proposal process is no longer needed.
Unlike the RCNN family of detection, SSD and YOLO are both single-shot methods, meaning they apply a single CNN to the full input image without using region proposals and region classification. This makes both methods very fast, and their Mean Average Precision (mAPs) are about 80%, outperforming that of Faster RCNN.
If this is the first time you heard of these methods, you probably would feel a bit lost. But as a developer interested in powering up your mobile apps with AI, you don’t need to understand all the details in setting up the deep neural network architectures and training the models for object detection; you should just be able to know how to use and, if needed, retrain pre-trained models and how to use the pre-trained or retrained models in your iOS and Android apps.
If you’re really interested in deep learning research and want to know all the details of how each detector works to decide which one to use, you should definitely read the papers of each method and try to reproduce the training process on your own. It’ll be a long but rewarding road. But if you want to take Andrej Karpathy’s advice, “don’t be a hero” (search on YouTube for “deep learning for computer vision Andrej”), then you can “take whatever works best, download a pre-trained model, potentially add/delete some parts of it, and fine-tune it on your app,” which is also the approach we’ll use here.
Before we start looking at what works best with TensorFlow, let’s have a quick note on datasets. There are three main datasets used for training in object detection: PASCAL VOC (http://host.robots.ox.ac.uk/pascal/VOC), ImageNet (http://image-net.org), and Microsoft COCO (http://cocodataset.org), and the number of classes they have are 20, 200, and 80, respectively. Most of the pre-trained models the TensorFlow Object Detection API currently supports are trained on the 80-class MS COCO dataset (for a complete list of the pre-trained models and the datasets they’re trained on, see https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md).
Although we won’t do the training from scratch, you’ll see the frequent mention of the PASCAL VOC or MS COCO data format, as well as the 20 or 80 common classes they cover, in retraining or use of the trained models. In the last section of this chapter, we'll try both a VOC-trained YOLO model and a COCO-trained one.