Semi-Unsupervised Learning: Clustering and Classifying using Ultra-Sparse Labels
Willetts M., Roberts S., Holmes C.
In semi-supervised learning for classification, i t is assumed that every ground truth class of data is present in the small labelled dataset. In many real-world sparsely-labelled datasets, it is possible that not all ground-truth classes are captured in the labelled dataset: a biased data collection process could result in some classes of data to be found only in the unlabelled dataset. We call this regime 'semi-unsupervised learning', an extreme case of semi-supervised learning, where some classes have no labelled exemplars. First, we outline the pitfalls associated with trying to apply deep generative model (DGM)-based semi-supervised learning algorithms to datasets of this type. We then show how a combination of clustering and semi-supervised learning, using DGMs, can be brought to bear on this problem. We study several different datasets, showing how one can still learn effectively when half of the ground truth classes are entirely unlabelled and the other half are sparsely labelled.