Learning about the seeds dataset
We now look at another agricultural dataset, which is still small, but already too large to plot exhaustively on a page as we did with the Iris dataset. This dataset consists of measurements of wheat seeds. There are seven features that are present, which are as follows:
- Area A
- Perimeter P
- Compactness C = 4πA/P2
- Length of kernel
- Width of kernel
- Asymmetry coefficient
- Length of kernel groove
There are three classes corresponding to three wheat varieties: Canadian, Koma, and Rosa. As earlier, the goal is to be able to classify the species based on these morphological measurements. Unlike the Iris dataset, which was collected in the 1930s, this is a very recent dataset and its features were automatically computed from digital images.
This is how image pattern recognition can be implemented: you can take images, in digital form, compute a few relevant features from them, and use a generic classification system. In Chapter 12, Computer Vision, we will work through the computer vision side of this problem and compute features in images. For the moment, we will work with the features that are given to us.
The University of California at Irvine ( UCI) maintains an online repository of machine learning datasets (at the time of writing, they list 233 datasets). Both the Iris and the seeds datasets used in this chapter were taken from there. The repository is available online at http://archive.ics.uci.edu/ml/.