While reckon over the future directions to go, I think the following problems might be interesting concerning text categorization.
1. Large-number of categories, multi-label classification problem. Typically, a hierarchy is employed to dissect the problem, thus reduce to learning with structured output.
2. "Dirty" text categorization. Typical text categorization requires the features to be clean, such as newswire articles, paper abstract etc. However, current fashion extends to "dirty" texts, such as notes(spell errors), prescription(lots of abbreviations) customer telephone log (usually with noisy, contradicting facts). Another example is email spam filtering. Currently, most of spams consists of images rather than just text. However, existing OCR techniques can not extract the characters very correctly. Hence, the final words/terms obtained might not a "proper" feature. Hence, some techniques are required to transform a "image-derived" word into a word in the dictionary. Such kind of transformation can be done via some algorithm like shortest-path algorithm. However, when the spammer add noise in purpose in the text within images, this problem seems more complicated. Is it possible to automatically learn feature similarity? How to extract useful similarities measure between these noisy vectors? How to derive a useful kernel? So, this problem actually is related to feature extraction, kernel learning, and robust learning and uncertainty.
3. Event detection and concept drift. I believe such kind of directions has more promising effect. I think the difficulty lays mainly on the lack of benchmark data set. But with the development of Web 2.0, this kind of problem should gain some attention in the future.
4. Ambiguous label problem. I really doubt the existence of small sample in text classification. Seems labellings some documents requires very little human labor. Now, some websites already provides some schemes for users to tag some web pages and blog posts. How to effectively employ the tag info seems to be missing in current work. When I tried delicious, only some key-word matching are performed. How to organize the text into more sensible way?
5. "Universal" text classification. As so many benchmark data sets are online, can we any how use all of them. This might be related to transfer learning. At least, the benchmark data can serve to provide a common prior for the target classification task. But can we extract more? Human beings can immediately classify the documents given very few examples. Existing transfer learning (most actually are doing MTL), in nature, is doing dimensionality reduction. How to related the features of different domains? Is it possible to extract the "structural" information? Zhang Tong's work talks about that, but it actually focus more on semi-supervised learning.
6. Sentimental classification/author gender identification/ genre identification. Such kind of problems requires new feature extraction techniques.
Some other concerns:
Feature selection for text categorization? As far as I can see, I do not think this direction will provide more interesting result. It works, and efficient. It can be used as a preprocessing step to reduce the computation burden. But some complicated methods (such as kernel learning) can be used to do a better job.
Active Learning, a greedy method can works fine enough.
Clustering, not making any sense to me. But as for simple text, clustering might show some potential impact. I believe clustering should be "customized". Different user will ask different clustering results. It seems more interesting to do clustering given some prespecified parameters. Clustering of multi-labels under concept drift can also be explored.
"Our Days Are Numbered"
-
[image: Proofs are amenable to chess techniques. "Our Days are Numbered".]
Slide in Lev Reyzin's JMM talk "Problems in AI and ML for Mathematicians"
Reyz...