上QQ阅读APP看书，第一时间看更新

Understanding the process flow

Features are extracted, matched, and tracked by the FeatureMatching class—especially by the public match method. However, before we can begin analyzing the incoming video stream, we have some homework to do. It might not be clear right away what some of these things mean (especially for SURF and FLANN), but we will discuss these steps in detail in the following sections.

For now, we only have to worry about initialization:

class FeatureMatching: 
     def __init__(self, train_image: str = "train.png") -> None:

The following steps cover the initialization process:

The following line sets up a SURF detector, which we will use for detecting and extracting features from images (see the Learning feature extraction section for further details), with a Hessian threshold between 300 and 500, that is, 400:

self.f_extractor = cv.xfeatures2d_SURF.create(hessianThreshold=400)

We load a template of our object of interest (self.img_obj), or print an error if it cannot be found:

self.img_obj = cv.imread(train_image, cv.CV_8UC1)
assert self.img_obj is not None, f"Could not find train image {train_image}"

Also, we store the shape of the image (self.sh_train) for convenience:

self.sh_train = self.img_obj.shape[:2]

We will call the template image the train image, as our algorithm will be trained to find this image, and every incoming frame a query image, as we will use these images to query the train image. The following photograph is the train image:

Image credit—Lenna.png by Conor Lawless is licensed under CC BY 2.0

The previous train image has a size of 512 x 512 pixels and will be used to train the algorithm.

Next, we apply SURF to the object of interest. This can be done with a convenient function call that returns both a list of keypoints and the descriptor (you can refer to the Learning feature extraction section for further explanation):

self.key_train, self.desc_train = \
    self.f_extractor.detectAndCompute(self.img_obj, None)

We will do the same with each incoming frame and then compare lists of features across images.

Now, we set up a FLANN object that will be used to match the features of the train and query images (refer to the Understanding feature matching section for further details). This requires the specification of some additional parameters via dictionaries, such as which algorithm to use and how many trees to run in parallel:

index_params = {"algorithm": 0, "trees": 5}
search_params = {"checks": 50}
self.flann = cv.FlannBasedMatcher(index_params, search_params)

Finally, initialize some additional bookkeeping variables. These will come in handy when we want to make our feature tracking both faster and more accurate. For example, we will keep track of the latest computed homography matrix and of the number of frames we have spent without locating our object of interest (refer to the Learning feature tracking section for more details):

self.last_hinv = np.zeros((3, 3))
self.max_error_hinv = 50.
self.num_frames_no_success = 0
self.max_frames_no_success = 5

Then, the bulk of the work is done by the FeatureMatching.match method. This method follows the procedure elaborated here:

It extracts interesting image features from each incoming video frame.
It matches features between the template image and the video frame. This is done in FeatureMatching.match_features. If no such match is found, it skips to the next frame.
It finds the corner points of the template image in the video frame. This is done in the detect_corner_points function. If any of the corners lie (significantly) outside the frame, it skips to the next frame.
It calculates the area of the quadrilateral that the four corner points span. If the area is either too small or too large, it skips to the next frame.
It outlines the corner points of the template image in the current frame.
It finds the perspective transform that is necessary to bring the located object from the current frame to the frontoparallel plane. If the result is significantly different from the result we got recently for an earlier frame, it skips to the next frame.
It warps the perspective of the current frame to make the object of interest appear centered and upright.

In the following sections, we will discuss the previous steps in detail.

Let's first take a look at the feature extraction step in the next section. This step is the core of our algorithm. It will find informative areas in the image and represent them in a lower dimensionality so that we can use those representations afterward to decide whether two images contain similar features.