I just browsed one interesting paper:
Dirichlet enhanced spam filtering based on biased samples. which pose a question to me:
Is sample selection bias = multitask learning?
As in this paper, each user's spam filtering can be considered as one task.
However, there are two very "Strong" assumptions:
1. The sample selection bias of publicly available data and personalized data is still 0-1 consistent. That is P(x|public) !=0 then P(x|personal) !=0;
2. P(y|x, public) = P(y|x, personal).
The 1st assumption is still OK, since most sample selection bias adopt this one. (Otherwise, I think there's no way to inference something you have no chance to know).
The 2nd assumption is way too UNACCEPTABLE to me. Given a message, some users might treat it as a ham, some might treat it as a spam. How could this be possible to be equal?!
(Is this because it's easy to analyze?)
In my opinion, MTL can not be considered as a feature bias (in terms of sample selection bias), nor a label bias. A more general model should be a complete bias. That is, P(s=1|x,y) cannot be decomposed into any simpler form.
Actually, I am thinking whether or not it's possible to learn P(s=1|x,y) (as in the paper I mentioned) if biased and unbiased samples are both provided. OK. Let's start with the ideal case. Then a logistic regression learner can be adopted to estimate the bias. The trick here is treat (x, y) as input feature and s=1 or 0 as output.
In MTL, only very few labeled data are provided for the unbiased setting. Can we estimate the density reliably? The dilemma is that: we need "enough" data to improve the bias density estimation, but if we have "enough" data, we can already derive a very good model.
Any other possible way to enhance the estimation?
This paper adopt the Dirichlet process to improve the estimation. But why does it work? I couldn't figure it out.
Favorite Theorems: Learning from Natural Proofs
-
October Edition
I had a tough choice for my final favorite theorem from the decade
2015-2024. Runners up include Pseudodeterministic Primes and Hardness ...
No comments:
Post a Comment