Phil Woods, Principal Analytics & Visualisation
In this blog, we look at two training methods within the “supervised” camp of machine learning and the effect of bias on our training data.
The largest hurdle we face with machine learning is training data. And, it’s not just the amount of training data that’s difficult to overcome but equally, the quality of training data. It’s a big problem and one that can be significantly compounded by “biases” in the data.
As a somewhat hyperbolic example of bias and a training data problem, if you were to Google, “great hair” and look at the images you’ll notice that most of the results are images of attractive white females, despite the lack of any gender or race specific keywords within the query.
This is due to biases within Google’s search algorithms, which, has been trained through use, to implicitly correlate these attributes, race and gender, with the term, “great hair”. This problem of bias is unavoidable in socially trained AI’s, like Google’s, but is also difficult to manage in general within the applied Machine Learning domain.
In a more practical related to my previous blog; imagine the extraction of building features from aerial imagery. To train that model on this process we needed to provide the AI with a representative amount of hand-captured building features from the source imagery. In this instance, we used something x% of the whole area and a representative cross-section of building types from across the area of interest.
Using this training method, known as “supervised learning”, managing bias can be difficult because the amount of permutations and the contexts the buildings are in, is highly variable. We must accept some level of error and bias therefore, an understanding of where humans are required to mitigate and rectify the effects is very important.
When we first began capturing buildings using Machine Learning we thought we could teach the AI to map through areas of buildings encroached by vegetation, in the same way we could teach a human operator to. At first our results were encouraging but we soon noticed something bad was happening too. By capturing training features to include the areas of vegetation encroachment, the “bias” we had introduced had instructed the AI to start capturing random patches of vegetation as buildings, because we had presumably inferred that certain textures and colours were common to both buildings and vegetation.
Another training approach, less useful for well-defined feature capture but great for object detection and classification, Active Learning (or semi-supervised learning) is a pool-based training method which begins with a collection (or pool) of unlabelled data and the trainer works their way through the data, labelling until such a time as the model being trained starts to yield confidence scores that indicate the model is suitably accurate.
This approach means this is a great method where there is little or no training data, as it orders the worst “guesses” to the top of the pool, ensuring that the trainer is always working to produce the least amount of labelled data required to produce an accurate and relatively unbiased model.
Care should still be taken, if nothing else to avoid frustration, but this approach helps to mitigate bias by ensuring it is “smoothed” out of the model eventually.
The two methods don’t produce the same results though, and where one method is deployed over the other is a down to the objective.
The first, more traditional form of supervised learning, requires input data that exactly matches the desired output. Active Learning, however, is better suited to classification and object detection as it was developed using label classes rather than label features.