Showing posts with label Transfer Learning. Show all posts
Showing posts with label Transfer Learning. Show all posts

Monday, March 26, 2007

Several Future directions

While reckon over the future directions to go, I think the following problems might be interesting concerning text categorization.

1. Large-number of categories, multi-label classification problem. Typically, a hierarchy is employed to dissect the problem, thus reduce to learning with structured output.

2. "Dirty" text categorization. Typical text categorization requires the features to be clean, such as newswire articles, paper abstract etc. However, current fashion extends to "dirty" texts, such as notes(spell errors), prescription(lots of abbreviations) customer telephone log (usually with noisy, contradicting facts). Another example is email spam filtering. Currently, most of spams consists of images rather than just text. However, existing OCR techniques can not extract the characters very correctly. Hence, the final words/terms obtained might not a "proper" feature. Hence, some techniques are required to transform a "image-derived" word into a word in the dictionary. Such kind of transformation can be done via some algorithm like shortest-path algorithm. However, when the spammer add noise in purpose in the text within images, this problem seems more complicated. Is it possible to automatically learn feature similarity? How to extract useful similarities measure between these noisy vectors? How to derive a useful kernel? So, this problem actually is related to feature extraction, kernel learning, and robust learning and uncertainty.

3. Event detection and concept drift. I believe such kind of directions has more promising effect. I think the difficulty lays mainly on the lack of benchmark data set. But with the development of Web 2.0, this kind of problem should gain some attention in the future.

4. Ambiguous label problem. I really doubt the existence of small sample in text classification. Seems labellings some documents requires very little human labor. Now, some websites already provides some schemes for users to tag some web pages and blog posts. How to effectively employ the tag info seems to be missing in current work. When I tried delicious, only some key-word matching are performed. How to organize the text into more sensible way?

5. "Universal" text classification. As so many benchmark data sets are online, can we any how use all of them. This might be related to transfer learning. At least, the benchmark data can serve to provide a common prior for the target classification task. But can we extract more? Human beings can immediately classify the documents given very few examples. Existing transfer learning (most actually are doing MTL), in nature, is doing dimensionality reduction. How to related the features of different domains? Is it possible to extract the "structural" information? Zhang Tong's work talks about that, but it actually focus more on semi-supervised learning.

6. Sentimental classification/author gender identification/ genre identification. Such kind of problems requires new feature extraction techniques.

Some other concerns:

Feature selection for text categorization? As far as I can see, I do not think this direction will provide more interesting result. It works, and efficient. It can be used as a preprocessing step to reduce the computation burden. But some complicated methods (such as kernel learning) can be used to do a better job.

Active Learning, a greedy method can works fine enough.

Clustering, not making any sense to me. But as for simple text, clustering might show some potential impact. I believe clustering should be "customized". Different user will ask different clustering results. It seems more interesting to do clustering given some prespecified parameters. Clustering of multi-labels under concept drift can also be explored.

Sunday, January 21, 2007

Some experiment result

I just finished some demo experiments. Originally, I wanted to find some toy example to show task selection's effect in transfer learning. Unfortunately, all the results I found are very disappointing.
Let me summarize the results a little bit:
(1) If the target task has very limited training data, transfer learning do help a lot compared with single task learning.
(2) The tasks selected make a very tiny difference (within 1% percent). Actually, it seems that combine all the tasks together is a very robust and reliable strategy for the data set I'm working on.
(3) Combine all the data together seems always better than transfer learning.


(1) is not surprising and has been approved by existing researchers.
(2) can not justify task selection.
(3) It seems that there's no difference between these tasks in the data set.

Maybe, one interesting problem is to determine whether the data extracted from multiple sources are actually the same. But I feel that's a more difficult problem than task selection.

Monday, December 18, 2006

Domain adaptation & learning from multiple sources

It seems that different people interpret transfer learning in different terms:
Multitask learning, Domain Adaptation, learning from multiple sources, sample selection bias.

Here, I just tried to distinguish these terms and discuss about the difference.

Multitask learning has been studied a lot. Usually, the objective function is to learn a model for each task such that the overall performance is optimized. These models share some commonality.
A very strong limitation of this method is that
overall performance != performance on one specific task.

In my opinion, MTL actually solves the learning problem in an indirect way. There are lots of issues involved: how to find similar tasks, which tasks should I trust, how much should I trust for each task, what's the trade-off between data and tasks. All these problems requires additional knowledge from domain or data, which make it rather limited. Actually, MTL can only works if very few data is available for each task. If there's no training data for one task, MTL can not work.

Domain adaptation, is more like a transfer learning setting. Domain adaptation can be considered as involving two tasks: one is source domain(support task), one is target domain(target task). This term is coined(as I see) from NLP community. The goal is exactly the same as transfer learning, to improve the performance of target task.

Learning from multiple sources, can be a little tricky. There are actually two interpretations: one assume that all the sources are drawn from the same underlying distribution, but with various level noise or uncertainty(If some additional information like the bound of the noise can be obtained, then it's possible to take advantage of it); The other one assumes that different source have different underlying distribution. But they are related. Thus, the former is more like data selection, how to learn ONE model given some noisy data while the latter is exactly like Multitask learning (develope one model for each task(source)).

Finally, transfer learning can also be connected to sample selection bias as I mentioned in last post. How ever sample selection bias always deals with the case such that only one biased sample is available, how to obtain an unbiased model. This situation is more like domain adaptation. We can effectively apply the method in one field to the other.





Friday, December 15, 2006

Sample selection bias = Multitask learning?

I just browsed one interesting paper:
Dirichlet enhanced spam filtering based on biased samples. which pose a question to me:

Is sample selection bias = multitask learning?

As in this paper, each user's spam filtering can be considered as one task.

However, there are two very "Strong" assumptions:

1. The sample selection bias of publicly available data and personalized data is still 0-1 consistent. That is P(x|public) !=0 then P(x|personal) !=0;

2. P(y|x, public) = P(y|x, personal).

The 1st assumption is still OK, since most sample selection bias adopt this one. (Otherwise, I think there's no way to inference something you have no chance to know).

The 2nd assumption is way too UNACCEPTABLE to me. Given a message, some users might treat it as a ham, some might treat it as a spam. How could this be possible to be equal?!
(Is this because it's easy to analyze?)

In my opinion, MTL can not be considered as a feature bias (in terms of sample selection bias), nor a label bias. A more general model should be a complete bias. That is, P(s=1|x,y) cannot be decomposed into any simpler form.

Actually, I am thinking whether or not it's possible to learn P(s=1|x,y) (as in the paper I mentioned) if biased and unbiased samples are both provided. OK. Let's start with the ideal case. Then a logistic regression learner can be adopted to estimate the bias. The trick here is treat (x, y) as input feature and s=1 or 0 as output.

In MTL, only very few labeled data are provided for the unbiased setting. Can we estimate the density reliably? The dilemma is that: we need "enough" data to improve the bias density estimation, but if we have "enough" data, we can already derive a very good model.

Any other possible way to enhance the estimation?

This paper adopt the Dirichlet process to improve the estimation. But why does it work? I couldn't figure it out.