Friday, December 22, 2006

Travel to LA (23rd-27th)

More will come after I come back:)

Tuesday, December 19, 2006

Data Mining dead for effective counterterrorism?

Jeff Jonas has just posted one paper in his blog saying that data mining would fail to detect suspicious for counterterrorism. This kind of voice is really interesting to me.

In his article, he mentioned the limitations of existing data mining techniques:

1. typical data mining requires lots of data available to discover the pattern of terrorism. However, terrorism's frequency is so low that data mining would fail to detect it.

2. the cost of false positive. If the number of population is large, suppose 1 billion people under monitored, even with a very high accuracy(say 99.9%), the false positive would result in 1,000,000 suspicious, which would be too costly to perform furthur investigation and tracking.

After reading it, I just have several questions in mind:
  1. Is inductive data mining dead for this case?
  2. Is deductive analysis more suitble for this case?
  3. How to detect novelty? What is an "anomly"?
  4. Terrorists will change their stratgy for next attack. Is it possible to find it?

Instead of claiming the limitedness of data mining, I would cheer for the "dead" of statistical data mining. Machine learning, finally have to rethink about its correct direction.

Monday, December 18, 2006

Data mining to detect perpetrators and accomplices

One article seems interesting. In this article, the researchers tried to use data mining technique to detect perpetrators and accomplices during online auction. Seems an interesting problem. But really difficult to obtain the data. I guess. Such kind of data always involves lots of private issues.


Carnegie Mellon Researchers Uncover Online Auction Fraud; Data Mining Software Fingers Both Perpetrators and Accomplices

PITTSBURGH, Dec. 5 (AScribe Newswire) -- Computer scientists at Carnegie Mellon University are using data mining techniques to identify perpetrators of fraud among online auction users as well as their otherwise unknown accomplices.

The new method analyzes publicly available histories of transactions posted by online auction sites such as eBay and identifies suspicious online behaviors and dubious associations among users.

Online auction sites are immensely popular. The largest, eBay, reported third quarter revenues of $1.449 billion, up 31 percent from the previous year, and registered 212 million users, up 26 percent. But the popularity of online auction sites also makes them a target for crooks. Internet auction fraud, such as failure to deliver goods after a sale, accounted for almost two-thirds of the 97,000 complaints referred to law enforcement agencies last year by the federal Internet Crime Complaint Center.

Perpetrators of these frauds have distinctive online behaviors that cause them to be readily purged from an online auction site, said Computer Science Professor Christos Faloutsos. The software developed by his research team - Network Detection via Propagation of Beliefs, or NetProbe - could prevent future frauds by identifying their accomplices, who can lurk on a site indefinitely and enable new generations of fraudsters.

In a test analysis of about one million transactions between almost 66,000 eBay users, NetProbe correctly detected 10 previously identified perpetrators, as well as more than a dozen probable fraudsters and several dozen apparent accomplices.

"To the best of our knowledge, this is the first work that uses a systematic approach to analyze and detect electronic auction frauds," said Faloutsos, who noted that NetProbe could eventually be useful for both law enforcement and security personnel of online sites.

The researchers have already adapted the software to provide a trustworthiness score for individual user IDs. Though not yet available to the public, the NetProbe score would complement user reputation scores that many auction sites already provide to help prevent fraud.

"We want to help people detect potential fraud before the fraud occurs," said research associate Duen Horng "Polo" Chau, who developed NetProbe with Faloutsos, undergraduate student Samuel Wang and graduate student Shashank Pandit.

Many auction sites try to avert fraud with so-called reputation systems. In eBay's case, buyers can report whether they had a positive, neutral or negative experience in a transaction, and that report is then translated into a feedback score for that seller.

Unfortunately, a crook can manipulate these feedback scores, obtaining a favorable score by engaging in a number of legitimate sales. But that is costly and time-consuming and, once the fraudster starts cheating buyers, that user identification is quickly red-flagged and shut down.

Perpetrating frauds may be sustainable, however, if a fraudster has accomplices or sets up separate user IDs to serve as accomplices. These accomplice accounts conduct legitimate transactions and maintain good reputations. They also have many transactions with the user IDs of fraudsters, using their good reputations to boost the fraudsters' feedback scores. Because accomplices don't perpetrate frauds, they usually escape notice and can keep working to establish new fraudster accounts, Faloutsos said.

But an unnatural pattern becomes evident when the transactions are plotted as a graph, with each user represented as a node, or dot, and transactions between individual users represented by lines connecting the nodes.

In the resulting graph, transactions between accomplices and fraudsters create a pattern that sticks out like "a guiding light," Chau said. Graph theorists call this pattern a "bipartite core" - members of one group have lots of transactions with members of a second group, but don't have transactions with members of their own group. One group, the accomplices, also deals with honest eBay users, but most of the transactions are with fraudster groups.

The researchers tested their method, in part, by accumulating transaction histories from eBay and demonstrating that they could detect the distinctive fraud patterns within these massive data sets. Chau reported on an analysis involving about 100 eBay users at a September data mining conference in Berlin. The team has since analyzed about a million transactions between almost 66,000 eBay users, and those as-yet unpublished findings have been submitted for presentation at an upcoming scientific conference.

"Crooks are extremely ingenious," Faloutsos warned, so identifying accomplices would not eliminate all online auction fraud. But eliminating accomplices would force crooks to resort to more sophisticated, complex schemes. "These schemes will require more effort and cost, so fraud would be increasingly unprofitable," he added.

Domain adaptation & learning from multiple sources

It seems that different people interpret transfer learning in different terms:
Multitask learning, Domain Adaptation, learning from multiple sources, sample selection bias.

Here, I just tried to distinguish these terms and discuss about the difference.

Multitask learning has been studied a lot. Usually, the objective function is to learn a model for each task such that the overall performance is optimized. These models share some commonality.
A very strong limitation of this method is that
overall performance != performance on one specific task.

In my opinion, MTL actually solves the learning problem in an indirect way. There are lots of issues involved: how to find similar tasks, which tasks should I trust, how much should I trust for each task, what's the trade-off between data and tasks. All these problems requires additional knowledge from domain or data, which make it rather limited. Actually, MTL can only works if very few data is available for each task. If there's no training data for one task, MTL can not work.

Domain adaptation, is more like a transfer learning setting. Domain adaptation can be considered as involving two tasks: one is source domain(support task), one is target domain(target task). This term is coined(as I see) from NLP community. The goal is exactly the same as transfer learning, to improve the performance of target task.

Learning from multiple sources, can be a little tricky. There are actually two interpretations: one assume that all the sources are drawn from the same underlying distribution, but with various level noise or uncertainty(If some additional information like the bound of the noise can be obtained, then it's possible to take advantage of it); The other one assumes that different source have different underlying distribution. But they are related. Thus, the former is more like data selection, how to learn ONE model given some noisy data while the latter is exactly like Multitask learning (develope one model for each task(source)).

Finally, transfer learning can also be connected to sample selection bias as I mentioned in last post. How ever sample selection bias always deals with the case such that only one biased sample is available, how to obtain an unbiased model. This situation is more like domain adaptation. We can effectively apply the method in one field to the other.





Friday, December 15, 2006

Sample selection bias = Multitask learning?

I just browsed one interesting paper:
Dirichlet enhanced spam filtering based on biased samples. which pose a question to me:

Is sample selection bias = multitask learning?

As in this paper, each user's spam filtering can be considered as one task.

However, there are two very "Strong" assumptions:

1. The sample selection bias of publicly available data and personalized data is still 0-1 consistent. That is P(x|public) !=0 then P(x|personal) !=0;

2. P(y|x, public) = P(y|x, personal).

The 1st assumption is still OK, since most sample selection bias adopt this one. (Otherwise, I think there's no way to inference something you have no chance to know).

The 2nd assumption is way too UNACCEPTABLE to me. Given a message, some users might treat it as a ham, some might treat it as a spam. How could this be possible to be equal?!
(Is this because it's easy to analyze?)

In my opinion, MTL can not be considered as a feature bias (in terms of sample selection bias), nor a label bias. A more general model should be a complete bias. That is, P(s=1|x,y) cannot be decomposed into any simpler form.

Actually, I am thinking whether or not it's possible to learn P(s=1|x,y) (as in the paper I mentioned) if biased and unbiased samples are both provided. OK. Let's start with the ideal case. Then a logistic regression learner can be adopted to estimate the bias. The trick here is treat (x, y) as input feature and s=1 or 0 as output.

In MTL, only very few labeled data are provided for the unbiased setting. Can we estimate the density reliably? The dilemma is that: we need "enough" data to improve the bias density estimation, but if we have "enough" data, we can already derive a very good model.

Any other possible way to enhance the estimation?

This paper adopt the Dirichlet process to improve the estimation. But why does it work? I couldn't figure it out.



Thursday, December 14, 2006

Finally done with course work

Finally finished all the finals and TA work.

Since this is the end of the semester, I would like to make a summary of the work I have done.

OK, this year, I took 2 courses, one about numerical linear algebra. I think this class is pretty good. I finally do not have scare away from those stupid matrix and SVD stuff. But unfortunately, I didn't spend much time on this. I think the projects on this class are pretty cool. Especially the Arnoldi process. I didn't expect such a large sparse matrix can be handled by only few iterations.

The SVD stuff, LU decomposition, QR decomposition(finally I got a chance to show off in this Blog:) I think I won't use these much in future research work. But combined with convex optimization, it should help.

Well... The 2nd course : distributed operating system. The instructor is a good talker and explains everything very well. Unfortunately, I am an eager learner(not kNN--"lazy learner"). He speaks so slow that I couldn't help getting asleep? Is this my fault?

Anyway, it's good to learn the concept about lock/unlock, synchronization, multi threading, PVM, messger system, shared memory, net messenger server, RPC, lock server, distributed shared memory, distributed mutual exclusive/snapshot. All the terms sound very cool... I did really bad in the final. But who cares? (PHD do not care about course work. That becomes my quote now. hehe...)

Finally, let's talk about the TA work. I am working as a TA for Subbarao Kambhampati.
To be frank, this is the toughest TA work I've ever have. 4 homeworks (the last one has around 50 questions, which really drove me crazy!!!!) and 5 projects. I was grading either the homework or the project nearly every weekend. So many stuff.

But from another perspective, I've never understand AI concept so clear. Rao connected agent design, search, planning, MDP, logic, Bayesian network, learning so well that I have to admit I learned much more by doing this TA work than taking a course. Actually, I don't think I learned anything really exciting when I took Dr. Liu's AI. Rao did a much better job. But I don't like his projects. No challenge, though lots of students spent way too much time on lisp. Anyway, I would like to thank to this experience. (I guess this is due to my gf. She changed me a lot!)

But...... I have to work as TA again next semester, both AI and data mining. I want to kill someone.....






Some interesting papers from NIPS 2006

NIPS'2006 has just released their online proceedings.
http://books.nips.cc/nips19.html

Here are some interesting papers:


------------------------------------------------------------
Dirichlet-Enhanced Spam Filtering based on Biased Samples
Steffen Bickel, Tobias Scheffer [ps.gz][pdf][bibtex]

Denoising and Dimension Reduction in Feature Space
Mikio Braun, Joachim Buhmann, Klaus-Robert Müller [ps.gz][pdf][bibtex]


Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
Gavin Cawley, Nicola Talbot, Mark Girolami [ps.gz][pdf][bibtex]

Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model
Chaitanya Chemudugunta, Padhraic Smyth, Mark Steyvers [ps.gz][pdf][bibtex]

Learning from Multiple Sources
Koby Crammer, Michael Kearns, Jennifer Wortman [ps.gz][pdf][bibtex]

Optimal Single-Class Classification Strategies
Ran El-Yaniv, Mordechai Nisenson [ps.gz][pdf][bibtex]

Clustering Under Prior Knowledge with Application to Image Segmentation
Mario Figueiredo, Dong Seon Cheng, Vittorio Murino [ps.gz][pdf][bibtex]

Data Integration for Classification Problems Employing Gaussian Process Priors
Mark Girolami, Mingjun Zhong [ps.gz][pdf][bibtex][zip]


Correcting Sample Selection Bias by Unlabeled Data
Jiayuan Huang, Alex Smola, Arthur Gretton, Karsten M. Borgwardt, Bernhard Schölkopf [ps.gz][pdf][bibtex]

Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods
Matthias Seeger [ps.gz][pdf][bibtex]





---------------------------------------------------------------------
Interesting:
Image Retrieval and Classification Using Local Distance Functions
Andrea Frome, Yoram Singer, Jitendra Malik [ps.gz][pdf][bibtex][tgz]

Branch and Bound for Semi-Supervised Support Vector Machines
Olivier Chapelle, Vikas Sindhwani, Sathiya Keerthi [ps.gz][pdf][bibtex]


Max-margin classification of incomplete data
Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller [ps.gz][pdf][bibtex]

Bayesian Ensemble Learning
Hugh Chipman, Edward George, Robert McCulloch [ps.gz][pdf][bibtex]


Map-Reduce for Machine Learning on Multicore
Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Ng, Kunle Olukotun [ps.gz][pdf][bibtex]


Hierarchical Dirichlet Processes with Random Effects
Seyoung Kim, Padhraic Smyth [ps.gz][pdf][bibtex]

Ordinal Regression by Extended Binary Classification
Ling Li, Hsuan-Tien Lin [ps.gz][pdf][bibtex]

Commercement!!

OK. This is the 3rd time I tried to create a Blog.

I am wondering why I cannot last for blog posting. Who knows?

Hope this time, I can make it work:)