David Knox - Principal, Data Science
In the second blog in our series on processing underwater video, I’ll discuss some of our data gathering and classification techniques leading in to our next blog post which discusses semantic segmentation and object detection…
Annotating data is possibly the least exciting part of a machine learning project. It certainly doesn’t receive the hype accorded to new neural network architectures or the applications of AI. In supervised machine learning we require annotated data to learn from. Lots of data. At the start of a project where you’re “doing something cool with Machine Learning” there are some important questions to ask up front:
· What kind of data do we need?
· How much data do we need or how little data can we get away with?
· How do we get the data?
What data do we need?
For image classification, we need only a binary yes/no answer for each image, whether an object or scene exists in the image. In video, you start by splitting the video into individual frames and classifying those.
For object detection, we need a bounding box around each instance of each object of interest within the image.
For semantic segmentation, we need to know the class of each pixel within the image.
How much data do we need?
If we’re experimenting with new algorithms or techniques using supervised learning, then we’ll want to use a well-known, well studied and hopefully open dataset. However, if our project has a commercial goal then we’ll probably need to gather data that other people don’t have. That means collecting or buying data.
How do we get the data?
In previous projects, we’ve used AWS’ Mechanical Turk, a crowd sourcing service, which we can get large amounts of imagery classified in a short time for a nominal sum. But, while this can be a great source of data, the accuracy of the crowd can vary, and a good deal of effort is required to QA and assess the results.
There are tricks, however, to make this job simpler, such as including a sample of known images to help identify workers with lower accuracy and their work can then be excluded from the final set.
Another approach is to include options for contradictory classes. If an image is tagged with contradictory categories, such as “Deep Sea” and “Beach” then we can exclude that answer.
So, while crowd sourcing can be good for gathering large amounts of data quickly it invariably requires a lot of QA work and upfront design.
Using internal staff or an outsource provider is often a better option but naturally comes at a greater cost and still requires some level of QA and upfront design.
Fortunately, by using transfer learning and the fine-tuning of pre-trained models we can often achieve great results from a few thousand images, of each class, rather than tens of thousands.
This leads to our most relied on approach for labelling new data for classification problems… hand labelling using a simple web form. We use PHP forms backed by a simple database.
Let’s peel away the veneer of glamour covering data science…
Here’s an example form we’ve used to build a starfish classifier:
We find that with simple binary choice forms like this, we can comfortably get through 1 image per second for easy cases and maybe 2 seconds for harder cases, easily getting through a few thousand images in an hour. While not exciting work, we can get a good feel for the imagery and context while working through them, often spotting challenges in the data before we even begin modelling.
This approach is all very well if our images fall evenly between the positive and negative class. However, if as in the above example, most images don’t contain a starfish, it’s going to be slow going finding enough examples of starfish images to train our neural network to detect starfish.
A shortcut is to train a weaker classifier with relatively few training images first and then use this model to identify candidate images for our next round of manual classification. This approach is called “Active Learning” and can be a very effective way to build up a training set of data provided we take some precautions. There is a danger with active learning that we will reinforce some bias present in the model by selecting those images that most agree with the model rather than more informative examples.
When we take an active learning approach – which we do for most image classification data sets – with each pass we randomly select 1000 of the positive images, 1000 of the negative images and the 1000 most in-between images. However, for validation we stick to the data gathered in the original round of manual annotation and not the active-learning contributed images. This way we can monitor for any decrease in accuracy due to biased training data.
Moving on from classification data. Image segmentation and object detection data takes longer to manually capture and capturing it is a more skilled job. We can use the image classifier to identify candidate images (or frames of video) to feed to our annotation process.
In the case of image segmentation there is a shortcut that sometimes works but often doesn’t. Extracting a high-level feature map for a particular class gives us a heat-map for that class that can help locate an object or group of objects within an image.
For some types of objects, with careful network design and a bit of luck, this heat-map can approximate a segmentation mask. Fairly simple and amorphous objects and those that can be detected largely by pixel values such as detecting water bodies from aerial images are good candidates.
In the following example, we show heat-maps extracted from a classifier looking for divers in underwater images.
The classifier output is a global max pooling layer following a 1 class feature map. The network is a simple Convolutional Neural Network (CNN) with down-sampling, residual blocks but no up-sampling or deconvolution. When we run the model, we extract the greyscale image from the heat map layer.
Figure 3, shows a the diver detection heatmap.
Figure 4, shows a heatmap from an encoder-decoder style CNN with reduction of spatial extent followed by increase in spatial extent.
As you can see, neither of these heatmaps generate a good segmentation mask. The first network produces a blob in approximately the right place and the second network produces something that looks interesting but quite different to a segmentation mask. Better than random.
Segmentation Mask and Object Detection Data
However, heatmaps like this can be useful for two things: roughly locating an object in an image (for example, the centre of mass of a heatmap can approximate the centroid of an object) and to use as pre-training for a segmentation model. Using a pre-trained model such as this should cut down on the number of training segmentation masks required to be manually created in order to get good results from our trained model.
This leads to the problem of how to manually capture segmentation masks. Having spent many frustrating hours trying to build our own tool or create masks in an image editing program, we found the excellent “Computer Vision Annotation Tool”: https://github.com/opencv/cvat
This Computer Vision Annotation Tool allows us to now more easily annotate an image with bounding boxes (for object detection) or polygons (for segmentation). For video, it’s even better. We can annotate key frames and the tool will interpolate the annotations for intermediate frames.
Next >> Semantic Segmentation and Object Detection