Friday, December 15, 2006

Sample selection bias = Multitask learning?

I just browsed one interesting paper:
Dirichlet enhanced spam filtering based on biased samples. which pose a question to me:

Is sample selection bias = multitask learning?

As in this paper, each user's spam filtering can be considered as one task.

However, there are two very "Strong" assumptions:

1. The sample selection bias of publicly available data and personalized data is still 0-1 consistent. That is P(x|public) !=0 then P(x|personal) !=0;

2. P(y|x, public) = P(y|x, personal).

The 1st assumption is still OK, since most sample selection bias adopt this one. (Otherwise, I think there's no way to inference something you have no chance to know).

The 2nd assumption is way too UNACCEPTABLE to me. Given a message, some users might treat it as a ham, some might treat it as a spam. How could this be possible to be equal?!
(Is this because it's easy to analyze?)

In my opinion, MTL can not be considered as a feature bias (in terms of sample selection bias), nor a label bias. A more general model should be a complete bias. That is, P(s=1|x,y) cannot be decomposed into any simpler form.

Actually, I am thinking whether or not it's possible to learn P(s=1|x,y) (as in the paper I mentioned) if biased and unbiased samples are both provided. OK. Let's start with the ideal case. Then a logistic regression learner can be adopted to estimate the bias. The trick here is treat (x, y) as input feature and s=1 or 0 as output.

In MTL, only very few labeled data are provided for the unbiased setting. Can we estimate the density reliably? The dilemma is that: we need "enough" data to improve the bias density estimation, but if we have "enough" data, we can already derive a very good model.

Any other possible way to enhance the estimation?

This paper adopt the Dirichlet process to improve the estimation. But why does it work? I couldn't figure it out.



No comments: