Class-prior estimation in a semi-supervised setup

 

Problem formulation

In this problem, we have a fully labeled training dataset:

\displaystyle \left\{ x_i, y_i\right\}_{i=1}^{n_{tr}} \sim p_{tr}(x, y),

and an unlabeled dataset drawn according to:

\displaystyle \left\{ x_j'\right\}_{i=1}^{n_{te}} \sim p_{te}(x).

We assume that the test and training datasets only differ by a change in class priors:

\displaystyle p_{tr}(x|y) = p_{te}(x|y), \quad p_{tr}(y) \neq p_{te}(y).

The goal is then to get an estimate \widehat{p}_{te}(y), that would allow us to reweight any empirical average
calculated using the training samples:

\displaystyle \int{\ell(yg(x))p_{te} dx} \approx \sum_{y=1}^c \frac{1}{n_y}\sum_{i=1, y_i = y}^{n}\ell (y_ig(x_i))\widehat{p}_{te}(y).

The general strategy that we follow is to have the following model of the test input density:

\displaystyle q(x; \theta) = \sum_{y=1}^c \theta_y p_{tr}(x|y).

We then select \theta so that the model q(x; \theta) is the same as p_{te}(x). To compare q(x;\theta) and p_{te}(x), we
use a divergence (such as an f-divergence or L_2-distance). These can in turn be directly estimated from samples (avoiding density estimation).

We provide implementations for two methods:

Note that experimentally, the L_2-method seems to give the best results.