Machine learning algorithms often rely on a very large quantity of data that needs to be carefully annotated by human experts. For instance, given a large collection of images, the role of the annotator will be to indicate which elements are present on the photo. Data annotation and labeling is a long, tedious and costly process. 

Unsupervised classification

Unsupervised classification consists in relying on a clustering algorithm to produce the labels for a dataset. The unsupervised nature of clustering is consistent with the absence of labels in the dataset, but also comes with a difficulty: the role of the annotator is then to find the algorithm and its hyperparameters that produce the labels the annotator has in mind. This task is challenging, especially for a user who is not a machine learning expert.

We explore solutions to make the underlying machine learning task invisible to the user, while guaranteeing alignment of the results with the user's goals. We are testing these methods on image segmentation tasks. 

Single positive multi-label classification

Annotations are often only partial and incomplete: even when an instance is inherently associated to multiple labels (for instance a photo containing simultaneously a cat and a dog), it is not rare that only one is given in the dataset (for instance, only the dog is indicated, which makes the AI "believe" that there is no cat). This problem is usually addressed by predicting missing labels or by adapting the loss function. As an alternative setting, we explore the possibility of incorporating additional information to guide the classification, for instance statistical data about the distribution of labels, or semantic information in the form of logical constraints on the labels (e.g. hierarchical constraints such as “a dog is an animal” or impossibility constraints such as “a photo cannot be taken simultaneously inside and outside”).