Showing posts with label personal life. Show all posts
Showing posts with label personal life. Show all posts

Monday, March 26, 2007

Several Future directions

While reckon over the future directions to go, I think the following problems might be interesting concerning text categorization.

1. Large-number of categories, multi-label classification problem. Typically, a hierarchy is employed to dissect the problem, thus reduce to learning with structured output.

2. "Dirty" text categorization. Typical text categorization requires the features to be clean, such as newswire articles, paper abstract etc. However, current fashion extends to "dirty" texts, such as notes(spell errors), prescription(lots of abbreviations) customer telephone log (usually with noisy, contradicting facts). Another example is email spam filtering. Currently, most of spams consists of images rather than just text. However, existing OCR techniques can not extract the characters very correctly. Hence, the final words/terms obtained might not a "proper" feature. Hence, some techniques are required to transform a "image-derived" word into a word in the dictionary. Such kind of transformation can be done via some algorithm like shortest-path algorithm. However, when the spammer add noise in purpose in the text within images, this problem seems more complicated. Is it possible to automatically learn feature similarity? How to extract useful similarities measure between these noisy vectors? How to derive a useful kernel? So, this problem actually is related to feature extraction, kernel learning, and robust learning and uncertainty.

3. Event detection and concept drift. I believe such kind of directions has more promising effect. I think the difficulty lays mainly on the lack of benchmark data set. But with the development of Web 2.0, this kind of problem should gain some attention in the future.

4. Ambiguous label problem. I really doubt the existence of small sample in text classification. Seems labellings some documents requires very little human labor. Now, some websites already provides some schemes for users to tag some web pages and blog posts. How to effectively employ the tag info seems to be missing in current work. When I tried delicious, only some key-word matching are performed. How to organize the text into more sensible way?

5. "Universal" text classification. As so many benchmark data sets are online, can we any how use all of them. This might be related to transfer learning. At least, the benchmark data can serve to provide a common prior for the target classification task. But can we extract more? Human beings can immediately classify the documents given very few examples. Existing transfer learning (most actually are doing MTL), in nature, is doing dimensionality reduction. How to related the features of different domains? Is it possible to extract the "structural" information? Zhang Tong's work talks about that, but it actually focus more on semi-supervised learning.

6. Sentimental classification/author gender identification/ genre identification. Such kind of problems requires new feature extraction techniques.

Some other concerns:

Feature selection for text categorization? As far as I can see, I do not think this direction will provide more interesting result. It works, and efficient. It can be used as a preprocessing step to reduce the computation burden. But some complicated methods (such as kernel learning) can be used to do a better job.

Active Learning, a greedy method can works fine enough.

Clustering, not making any sense to me. But as for simple text, clustering might show some potential impact. I believe clustering should be "customized". Different user will ask different clustering results. It seems more interesting to do clustering given some prespecified parameters. Clustering of multi-labels under concept drift can also be explored.

Thursday, December 14, 2006

Finally done with course work

Finally finished all the finals and TA work.

Since this is the end of the semester, I would like to make a summary of the work I have done.

OK, this year, I took 2 courses, one about numerical linear algebra. I think this class is pretty good. I finally do not have scare away from those stupid matrix and SVD stuff. But unfortunately, I didn't spend much time on this. I think the projects on this class are pretty cool. Especially the Arnoldi process. I didn't expect such a large sparse matrix can be handled by only few iterations.

The SVD stuff, LU decomposition, QR decomposition(finally I got a chance to show off in this Blog:) I think I won't use these much in future research work. But combined with convex optimization, it should help.

Well... The 2nd course : distributed operating system. The instructor is a good talker and explains everything very well. Unfortunately, I am an eager learner(not kNN--"lazy learner"). He speaks so slow that I couldn't help getting asleep? Is this my fault?

Anyway, it's good to learn the concept about lock/unlock, synchronization, multi threading, PVM, messger system, shared memory, net messenger server, RPC, lock server, distributed shared memory, distributed mutual exclusive/snapshot. All the terms sound very cool... I did really bad in the final. But who cares? (PHD do not care about course work. That becomes my quote now. hehe...)

Finally, let's talk about the TA work. I am working as a TA for Subbarao Kambhampati.
To be frank, this is the toughest TA work I've ever have. 4 homeworks (the last one has around 50 questions, which really drove me crazy!!!!) and 5 projects. I was grading either the homework or the project nearly every weekend. So many stuff.

But from another perspective, I've never understand AI concept so clear. Rao connected agent design, search, planning, MDP, logic, Bayesian network, learning so well that I have to admit I learned much more by doing this TA work than taking a course. Actually, I don't think I learned anything really exciting when I took Dr. Liu's AI. Rao did a much better job. But I don't like his projects. No challenge, though lots of students spent way too much time on lisp. Anyway, I would like to thank to this experience. (I guess this is due to my gf. She changed me a lot!)

But...... I have to work as TA again next semester, both AI and data mining. I want to kill someone.....