Monday, March 26, 2007

Several Future directions

While reckon over the future directions to go, I think the following problems might be interesting concerning text categorization.

1. Large-number of categories, multi-label classification problem. Typically, a hierarchy is employed to dissect the problem, thus reduce to learning with structured output.

2. "Dirty" text categorization. Typical text categorization requires the features to be clean, such as newswire articles, paper abstract etc. However, current fashion extends to "dirty" texts, such as notes(spell errors), prescription(lots of abbreviations) customer telephone log (usually with noisy, contradicting facts). Another example is email spam filtering. Currently, most of spams consists of images rather than just text. However, existing OCR techniques can not extract the characters very correctly. Hence, the final words/terms obtained might not a "proper" feature. Hence, some techniques are required to transform a "image-derived" word into a word in the dictionary. Such kind of transformation can be done via some algorithm like shortest-path algorithm. However, when the spammer add noise in purpose in the text within images, this problem seems more complicated. Is it possible to automatically learn feature similarity? How to extract useful similarities measure between these noisy vectors? How to derive a useful kernel? So, this problem actually is related to feature extraction, kernel learning, and robust learning and uncertainty.

3. Event detection and concept drift. I believe such kind of directions has more promising effect. I think the difficulty lays mainly on the lack of benchmark data set. But with the development of Web 2.0, this kind of problem should gain some attention in the future.

4. Ambiguous label problem. I really doubt the existence of small sample in text classification. Seems labellings some documents requires very little human labor. Now, some websites already provides some schemes for users to tag some web pages and blog posts. How to effectively employ the tag info seems to be missing in current work. When I tried delicious, only some key-word matching are performed. How to organize the text into more sensible way?

5. "Universal" text classification. As so many benchmark data sets are online, can we any how use all of them. This might be related to transfer learning. At least, the benchmark data can serve to provide a common prior for the target classification task. But can we extract more? Human beings can immediately classify the documents given very few examples. Existing transfer learning (most actually are doing MTL), in nature, is doing dimensionality reduction. How to related the features of different domains? Is it possible to extract the "structural" information? Zhang Tong's work talks about that, but it actually focus more on semi-supervised learning.

6. Sentimental classification/author gender identification/ genre identification. Such kind of problems requires new feature extraction techniques.

Some other concerns:

Feature selection for text categorization? As far as I can see, I do not think this direction will provide more interesting result. It works, and efficient. It can be used as a preprocessing step to reduce the computation burden. But some complicated methods (such as kernel learning) can be used to do a better job.

Active Learning, a greedy method can works fine enough.

Clustering, not making any sense to me. But as for simple text, clustering might show some potential impact. I believe clustering should be "customized". Different user will ask different clustering results. It seems more interesting to do clustering given some prespecified parameters. Clustering of multi-labels under concept drift can also be explored.

Thursday, March 22, 2007

Leonhard Euler

"Leonhard Euler(1707-1783) A Swiss mathematician and physicist.
......
During the last 17 years of his life, he was almost totally blind, and yet he produced nearly half of his results during that period."

(Copied from Pattern Recognition and Machine Learning, Page 465)

I am wondering whether or not he has any chance to win the Fields Award:)

Where to Go?

Recently, I seems to lose the enthusiasm of research. No mood to read, to discuss, to work on experiments. I know this is not correct way to go, but I just couldn't help become anxious. I would rather there's a problem let me to just jump in.

I don't know whether this is related to the frustration of recent progress. Too slow.

To be frank, I really doubt current development of machine learning field. Too fast. Each year, in the top level conference, there are lots of publications, but very few of them worth reading. Lots of them are just rubbish. Is this common in other fields?

I am wondering which direction should be a correct way to go. Searching in blind is really difficult:(

Friday, March 2, 2007

Human Intelligence and Current Artificial Intelligence

I've take Pat Langley's cognitive system class this semester.

My feeling of intelligence is rather vague.

What's intelligence? If a system is complicated enough, is that intelligent? What should be a component of an intelligence? Like Deep Blue, is that intelligence. I guess most people won't vote for intelligence. Now, people are trying to develop a meta-game player, which can automatically train itself as long as the rules of the game are given. Is that intelligent? No. Human beings are actually open-minded. Yep, machines can outperform human beings in one field, but they can never be "artificial intelligence". That's why I feel AI would never come true in my life. But I agree with Pat that more focus should be on the nature of mind.

For a long time, I've been thinking that the brain and a machine's difference is mainly due to the hardware difference (brain's processing speed is low, memory is limited, but highly parallel). Like playing chess, planning, scheduling, while machines can do an exhaustive search, or deeper search, human beings try to prune a lot of branches in each step (I guess there should be some pattern recognition involved). It's more like beam-search, as I think. The problem for efficient processing is how to extract those useful patterns.

Until recently, Pat mentioned this paper:
The magic of 7 +/- 2

which tries to understand why human beings did a better job to remember 7 digits more easily than 4 or more digits(This is partly the reason why telephone digits are around 7).

Based on information theory, the more digits we have, the more information they contains, thus require more bits to represent the information. But for human beings, this is not the case. This phenomenon is really interesting, and justifies the study for human brains. I guess Pat didn't realize this example is so exciting to me:)

Thursday, March 1, 2007

KDD submisssion is done

Nitin and I coauthored one paper talking about finding influentials in blogosphere. It's an interesting problem. the method seems a little ad hoc. But we finally made it.

Through a long distance communication (that's the wonder of Web), Nitin is working in India while Dr.Liu and I were working in the united states. The final version seems much better. But whether or not this work can be accepted is really depending on our luck. Part of my concern the technical part is really subjective.

I don't know whether the reviewer is obsessed with cross validation.

Let's see.