Object recognition with feature engineering and shallow learning
Recognising objects in images using computer vision has been both a research task as well as industrial reality for the last 60 years. While this may seem seem contradictory we need to consider the different tasks that can be solved by image processing. One the hand with have simple tasks such as detecting whether an object exists in an image with well defined lighting conditions, well chosen camera angles and a neutral background; and on the other hand if it exists in an image with arbitrary background, lighting etc. The first type of task is one that is often encountered in production industry and which can be solved with traditional computer vision methods -- while the second task is more often encountered in mobile robotics, autonomous cars and other applications requiring interactions with an unstructured world. Needless to say this pose a much harder problem that have only recently become solvable using a mixture of deep learning and large datasets, which often can be quite expensive solutions.
This blog post is a continuation on a series of posts using Sympathy for Data for doing image processing. Sympathy is an Open-Source tool for graphically programming data-flows which lends itself well for quickly setting up and testing different image processing and classical machine learning algorithms that we can use to classify objects in an industrial setting. We will show how we can perform simple object recognition using only a modicum of feature engineering and a very small dataset and some simple machine learning algorithms. By doing feature engineering on the input data we get a high precision training set for the machine learning algorithm sufficient for classifying objects. This can be contrasted with the shotgun approach of deep learning which requires wast datasets of training examples to solve the same task.
In the previous entry we started on an algorithm for automatically extracting objects in an image taken top-down of objects against a neutral background. These example objects consists of a mix of screws, washers and nuts on a conveyor belt and we would ideally like to classify them in order to sort them in a later step.
The output from our previous step was a list containing the mask for each object found in the input image. We will continue from this step by using image processing to do feature engineering as a pre-processing step to using a simple machine learning algorithm to do the classification.
What is feature engineering?
Many simplistic approaches to object classification using machine learning feed the raw pixel data to machine learning algorithms such as support vector machines, random forest or classical neural networks in order to solve tasks such as the MNIST classification. While these approaches have been successful in small domains such as the 28×28 pixels images of MNIST it is much more problematic to do classification of arbitrarily sized objects in larger images due in part to an explosion of the number of parameters in the models which leads to a need for very large datasets. We cannot reasonably train a model with fewer examples than number of free parameters.
Solutions to this problem include either doing feature learning or feature engineering. While the former is within the purview of deep learning and out of the scope for our solution we can instead use the later by using classical image processing to extract new features that enables the machine learning algorithm to work with the images.
A classical algorithm used for pre-processing images before feeding them to machine learning algorithms is SIFT features (PDF). The original algorithm proposed in 1999 was considered by many to be a large step forward since it allows to extract a number of features for points on a real world object such that the points extracted from two different images of the same object would be close in the feature space regardless of the scale (size) and rotation of the object. This allows us to compare the features from the same object in two different images.
Each feature consisted of the XY position of a keypoint (eg. corners) as well as a multi-dimensional vector that describes that point in such a way that the description is mostly invariant under different scale, rotation and lighting conditions. While this algorithm have been used by many to do successful object recognition it is today not as often used due to being patented and due to many other new alternative algorithms for extracting image features.
One good free alternative to SIFT (and later SURF) is to use the ORB algorithm (PDF) which combines two other algorithms for keypoint detection and for creating a feature descriptor for each such point. We will base our solution around this algorithm.
Using ORB features for object recognition
Step one is to load an image containing only the object we want to detect, in this case an example with a number of screws. We give this image to the ORB features extractor in the node image statistics. With the default arguments we get an output that contain a number of XY points (see table below) as well as a feature vector f0 … f255 describing each such point. We can draw a small circle around each XY point in order to see which points have been extracted in the image, and we see that we have a number of such points for each object in the image at key locations such as the head and bottom of the screws.
Next we can train a one-class classification algorithm to match these features. Two options that are included in the default Sympathy are the isolation forest and one-class support vector machine algorithms. We will use the former to create a machine learning model that matches features that are present in the first image, while rejecting all other features. Note that by having only a single image with a few screws as a training example we are only doing a very light and cheap form of machine learning, and should adjust our expectations for the end result thereafter.
Before feeding the feature points to the classification algorithm we remove the XY coordinates by using the select columns from table node. We use the fit node to train the isolation forest using the features from the training image and we use the predict node to create a prediction for each of the features in a test image containing both screws, washers and nuts.
The output of the predict node is a single column Y with value +1 for features that match the original features and value -1 for all other features. We can use this Y value to determine a color to be drawn on top of each feature in order to see how the model handles each feature in the test image.
As we can see in the images above we have a large number of positive features (Y=1, white circles) for the screws in the image and mostly negative (Y=-1, black circles) for the washers and nuts. In order to make a final classification we just need to count the number of features that are positive vs. the number of features that are negative for each object identified in the image. If the ratio of positive to negative features exceed a threshold (eg 0.6) then we classify the object as a screw.
We do this by creating a lambda subflow that takes two inputs. The first input should be the table with a y0 prediction for each feature. The second input should be an image mask. Note on the main flow you can click on your lambda and select add input port to make these two ports visible so you can give a test inputs to them. By connecting the table with features/predictions as well as an input mask to the lambda it allows you to test-run the lambda on these values when you are editing the lambda.
Once we have our inputs to the lambda we take a look at the content of the lambda. You can right click on the lambda sub-flow and select edit just like you would on a normal subflow. The first thing we do inside the lambda is to use morphology to extend the border around each object, we want to get keypoints that are not only inside the objects but also along the border of them.
After that it is a small matter of extracting the value of the mask (true/false) at each keypoint and summing the keypoints that have value y0=1 and y0=-1 respectively in a calculator node. We do this by giving the XY coordinates for each keypoint to the Extract Image Data node, this gives a table with a single column ch0_values that gives the mask value at the XY coordinate for each keypoint. Next we can use the following expression in the calculator node to compute the ratio of positive to negative features for each object:
What this means is that we require a keypoint to both be inside the mask using the column ch0_values, and multiply it with a check if the column y0 had the value 1. The output of this will be 1 only for the points that had y0=1 and that had True in the input mask. The sum of all these are the correct column which gives the number of features predicted true by the classifier.
Similarly, by doing the same but comparing the column y0 != 1 gives us the number of keypoints predicted false by the classifier.
The final step is to apply this lambda to the classified data and map it on each input mask, in order to get a classification for each object.
Note that we need to use apply first with the table as input since we only have one table that should be used for all invocations of the lambda. We use the map second since we have a list of input masks to check, and want a list with the outputs. Finally we can use the filter list predicate function to only keep the outputs that had a sufficiently high score.
The final output is a list of all the objects that was classified as a screw. Note that with the given threshold of 0.6 we miss two of the screws. You can experiment with different values of the threshold and with different parameters to the basic classifier (isolation forest) to get better results. You can also try to use the One class SVM node instead of an isolation forest.
We have shown how you can use the built-in nodes in Sympathy for Data to solve a simple image classification task using a simple one-class classifier machine learning node with ORB features as a pre-processing step on the image. The final system can work with only a single example of the object to detect, although with a high miss-classification chance. For better classifications more training examples and/or a different machine learning algorithm can be substituted for the isolation forest, but keeping the general ORB features as the feature engineering method.