1 Comment

The method shown in the section on semi-supervised learning seems clever but may be dangerous to use. Imagine that we have an incomplete dataset with 3 classes (A, B, C) but for which we have a few examples with known labels A and B, but none with C. The iterative method will identify unknown items of classes A and B, but will not work for C class items. Indeed, the model will predict the probabilities for classes A and B (for which the sum will be 100% if we use a softmax layer for classification). The class with the highest probability will be allocated to C-class items; it will be A or B instead of C. If we continue the same process, the C class will fuse with A and/or B.

Data clustering could be used to figure out an approximative number of classes. Say we have 4 classes; A, B and 2 unknown classes Z1 and Z2. We can create a new dataset where all known A elements are in the group A, the same for B elements, and where we put a similar number of unlabeled data in groups Z1 and Z2. Next we train a 4-class model on this dataset.

We expect the classification results for A and B classes to be better than those for Z1 and Z2 classes. Let us take a look at the misclassifications of Z1 elements. Those allocated (with high probability) to classes A and B could be relabeled accordingly. We do the same to elements in class Z2. The training process restarts and the relabeling continues until desired.

We can add new unknown Z1 and Z2 element in the dataset and redo the training-relabeling process.

At some point, if necessary, we could take a look at the data in the Z1 and Z2 classes and figure out what they contain. If they are easy to identify (e.g., images of horses and whales) we can use active learning and ask an expert to figure out all the very difficult cases when the probabilities of 2 or more classes are similar. This will result in a refinement of the boundaries between classes.

This is not foolproof obviously, but it is worth the try.

Expand full comment