Semi-supervised scenario
A typical semi-supervised scenario is not very different from a supervised one. Let's suppose we have a data generating process, pdata:
However, contrary to a supervised approach, we have only a limited number N of samples drawn from pdata and provided with a label, as follows:
Instead, we have a larger amount (M) of unlabeled samples drawn from the marginal distribution p(x):
In general, there are no restrictions on the values of N and M; however, a semi-supervised problem arises when the number of unlabeled samples is much higher than the number of complete samples. If we can draw N >> M labeled samples from pdata, it's probably useless to keep on working with semi-supervised approaches and preferring classical supervised methods is likely to be the best choice. The extra complexity we need is justified by M >> N, which is a common condition in all those situations where the amount of available unlabeled data is large, while the number of correctly labeled samples is quite a lot lower. For example, we can easily access millions of free images but detailed labeled datasets are expensive and include only a limited subset of possibilities. However, is it always possible to apply semi-supervised learning to improve our models? The answer to this question is almost obvious: unfortunately no. As a rule of thumb, we can say that if the knowledge of Xu increases our knowledge about the prior distribution p(x), a semi-supervised algorithm is likely to perform better than a purely supervised (and thus limited to Xl) counterpart. On the other hand, if the unlabeled samples are drawn from different distributions, the final result can be quite a lot worse. In real cases, it's not so immediately necessary to decide whether a semi-supervised algorithm is the best choice; therefore, cross-validation and comparisons are the best practices to employ when evaluating a scenario.