Tolerant Machine Learning for Deficient Training Data
While supervised machine learning methods have shown great potential for automating time-consuming manual data classification tasks, their application is often hindered by deficiencies in training datasets that degrade model accuracy. This thesis proposes tolerant machine learning methods that can be applied despite common training data deficiencies. By enabling users to derive value proportional to data quality, tolerant machine learning makes exploring new machine learning applications more cost-effective.
The first training data deficiency addressed by this thesis is a lack of class labels for training data, including when the set of possible classes is unknown. Prior work in the weak supervision paradigm of data programming has sought to provide large quantities of data labels through user-defined heuristic labelling functions. Despite the development of methods to assist users in defining such functions, users must still have a small labelled dataset or at least upfront knowledge of the set of possible classes. The Witan algorithm proposed in this thesis can suggest labelling functions without any initial supervision, supporting the user to discover classes progressively. Experiments with binary and multi-class datasets demonstrate Witan's competitive efficiency and accuracy compared to alternative labelling methods, despite its lack of supervision.
The second training data deficiency addressed by this thesis is the problem of dataset shift, where the data distribution of the training dataset differs from that of a target population. Dataset shift is a challenging yet expected problem when estimating the prevalences of classes in different target samples - so-called quantification. Existing quantification methods make strong assumptions on the nature of dataset shift that may not hold in practice. This thesis proposes a Gain-Some-Lose-Some (GSLS) quantification model that is experimentally demonstrated to provide more reliable quantification prediction intervals than existing methods under more general conditions of shift. GSLS is integrated into a decision tree for dynamically selecting an appropriate quantification method for a given target sample. Selection by a Kolmogorov-Smirnov test for any shift followed by a newly proposed "Adjusted Kolmogorov-Smirnov" test for non-prior shift is found to best balance quantification and runtime performance. Additionally, a framework is presented for constraining quantification prediction intervals to user-specified limits by requesting class labels from the user for smaller sets of instances than would be required with rejection of classifications based on confidence scores alone.
The third and final training data deficiency addressed by this thesis is the problem of class noise. When a class noise process distorts the relationships between input features and class labels or true class values, a classifier should reject instances for which it cannot provide a confident classification. This thesis demonstrates how the popular model-agnostic confidence-thresholding rejection method does not leverage relationships between input features and class noise. A novel model-agnostic null-labelling rejection method is proposed to learn such relationships, and an experimental evaluation demonstrates its ability to achieve a significantly better trade-off between classification error and the rate of rejection under an evaluation framework that unifies prior theories for combining rejecting-classifiers. Additionally, null-labelling is demonstrated to enable users to understand relationships between input features and regions of class noise, allowing them to identify and potentially address sources of noise.
Finally, a software architecture is presented that combines all of the presented methods for addressing training data deficiencies. The architecture demonstrates a tolerant machine learning system with minimal data quality requirements, allowing users to start deriving value from their data with less upfront effort.