Tuesday, December 1, 2009

matlab sparse matrix format

I have encountered several cases to handle matlab sparse matrix in mex c or c++.

Here, I use a simple example to explain how the sparse matrix is stored in matlab (Very creative, but not necessarily for my usage. Hence, it is always convenient to understand the structure)

The sparse matrix format in matlab involve several components:
nz: the number of non-zeros
m: the number of rows
n: the number of cols

jc: the index of columns.
ir: the index of rows;
pr: the nonzero entries stored in a double vector.

The tricky part is the relation between jc, ir and pr.

For example, consider a 7-by-3 sparse mxArray named Sparrow containing six nonzero elements, created by typing:

Sparrow = zeros(7,3);
Sparrow(2,1) = 1;
Sparrow(5,1) = 1;
Sparrow(3,2) = 1;
Sparrow(2,3) = 2;
Sparrow(5,3) = 1;
Sparrow(6,3) = 1;
Sparrow = sparse(Sparrow);
Then, the matrix looks like below:

>> full(Sparrow)

ans =

0 0 0
1 0 2
0 1 0
0 0 0
1 0 1
0 0 1
0 0 0

Then

The contents of the ir, jc, and pr arrays are listed in this table.

Subscript

ir

pr

jc

Comment

(2,1)

1

1

0

Column 1 contains two nonzero elements, with rows designated by ir[0] and ir[1]

(5,1)

4

1

2

Column 2 contains one nonzero element, with row designated by ir[2]

(3,2)

2

1

3

Column 3 contains three nonzero elements, with rows designated by ir[3],ir[4], and ir[5]

(2,3)

1

2

6

There are six nonzero elements in all.

(5,3)

4

1



(6,3)

5

1




If the jth column of the sparse mxArray has any nonzero elements:

  • jc[j] is the index in ir, pr, and pi (if it exists) of the first nonzero element in the jth column.

  • jc[j+1]-1 is the index of the last nonzero element in the jth column.

  • For the jth column of the sparse matrix, jc[j] is the total number of nonzero elements in all preceding columns.

The number of nonzero elements in the jth column of the sparse mxArray is:

jc[j+1] - jc[j];

Note that the size of jc is n+1.
the size of ir is nz, the same as pr.

Hence, to iterate over the spare matrix in c, you can use the following code:

for (col=0; col < n; col++){
startIndex = jc[col];
endIndex = jc[col+1];
for (i=startIndex; i < endIndex; i++){
row = ir[i];
val = pr[i];
......

}

}

Monday, November 30, 2009

ICDM Trip Soon

This Sunday, I will be out of town for ICDM conference held in Miami, FL.

I will present my paper:

Uncovering Groups via Heterogeneous Interaction Analysis.


This work addresses the community detection problem when multiple different types of interactions are presented between the same set of users.

A more general case is that users registered at different social media sites. Can we somehow uncover the hidden community structure?

We show that using an integration based on structural features is more robust. For evaluation, I proposed a simple cross-dimension network validation scheme. Similar to cross validation. This could be used as a simple rule for evaluation in the future.

Of course, there are many interesting directions to pursue in the future. One important aspect is that some of the dimensions are noisy. Is it possible to identify them? Is this the same as tensor decomposition?

Wednesday, November 18, 2009

about collective behavior

Recently, I just submitted a magzine article talking about collective behavior to IEEE Intelligent systems based on my recent work published this year:
Relational Learning via Latent Social Dimensions  KDD-2009
Scalable Learning of Collective Behavior based on Sparse Social Dimensions CIKM-2009.

So what is collective behavior? I define collective behavior as behaviors when individuals are exposed in a social network environment. 

I found that different areas have quite different definitions. 
(1) . According to wiki (mostly written by sociologist),"The term "collective behavior" was first used by Robert E. Park, and employed definitively by Herbert Blumer, to refer to social processes and events which do not reflect existing social structure (laws, conventions, and institutions), but which emerge in a "spontaneous" way. "Collective behavior can be divided into four categories:
1. The Crowd
2. The public
3. The mass
4. The social movement
It is subtle to differentiate these four terms based on the name.
(2) On the other hand, in other field, such as artificial life

Swarm Intelligence in Data Mining

collective behavior refers to a group of agents which can be treated as an entity, which is commonly observed in bird flocks, ants and other animals.  There are several principles:
homogeneity (all the agents follow the same behavior model), locality (influenced only by neighbors), collision avoidance (avoid with nearby flock mates), velocity matching and flock centering.
Two popular methods are introduced in the chapter: particle swarm intelligence and ant colony optimization.  quite interesting ideas :)
(3) In data mining field, there is one paper talking about learning from collective behavior.
The setup is like oracle: suppose you have the luxury to observe the interaction and corresponding actions of a group of people and each person in the group follow a fixed policy,  then  how can we learn the policy and strategies so that when new situations arrive, we can predict the collective behavior?  The authors provide some theoretical bounds about the learnability. Unfortunatelly, evertying is synthetic and it is really difficult for me to figure out a proper scenario such that their setup might be true.
The nice part of this work is that at least two kinds of policies are studied:
  • one is mimic your friend
  • the other is try to differentiate from your neighbors
while the first one is well known, the 2nd strategy, as I believe existing in the real world, do not studied well in data mining or SNA. One way of achieving this is connecting those nodes that are two hops away and seperate them from 1-hop away neighbors as suggested in using ghost edges for sparsely labeled networks But when everyting is mixed, more work needs to be done.
Another key difference of this work from my article is that my article talks about "spatial" prediction of collective behavior (given some observations, predict the others within the same network), while this one talks about "temporal" prediction.
I think it is a great idea to combine these two aspects together.

(4) Adaptive networks and behavior. Concerning social networks and collective behavior, two directions are converging. One study the dynamics of networks, the other study the dynamics on networks. In reality, there two factors are evolving simultaneously. Social networks can evolve, so are collective behavior. How to capture these two factors in the modeling? 

This also relates to social influence and social selection and several papers are talking about this issue:
Modeling the co-evolution of network and behavior
Nonequilibrium phase transition in the coevolution of networks and opinions
Feedback effects between similarity and social influence in online communities

(5) Collective Attention investigate how a news or a resource attracts users' attentions. Does networks come into play? This can also be a further direction for investigation.

Anyway, I feel that this direction has many more issues to address. Also many challenges such as problem formulation, data collection, and evaluation. More in the near future :)
Just check my homepage


Monday, November 16, 2009

Change default paper size in tetex in Linux

Run command in sudo mode:

texconfig

Tips in research

writing is like a hour glass.

Starts big, cover every aspects of the neck, and present a big picture.

Use conference deadlines to make yourself efficient.

Start from bad writing then improve!


Monday, October 26, 2009

www abstract due today

Today is the abstract due. I have uploaded the abstract.

As I'm leaving on 29th, I need to finish the paper in 3 days!!!

This time, we did some interesting work rather than conventional data mining (mostly focus on cross-validation result, not interesting enough).

Hopefully we can make it!

Sunday, October 25, 2009

Thursday, October 22, 2009

An interesting reading

I came accross this interesing paper by a statistician Leo Breiman talking about the difference of statistics and machine learning. Basically, there are two school: one takes the data model, trying to understand the nature process (which is widely studied by statistician), the other one  takes an algorithmic model. They do not care the process in the black box, but just try to approximate the process using whatever effective methods such as neural network and decisions trees (this is the philosophy of machine learning guys).


Some of the discusssions are very incisive. Leo argued that the over emphasize of data model might lead to wrong conclusions. As there could be many different but comparable models to achieve the same performance. Taking a machine learning approach seems more practical.



I am more interested in reading a paper by a machine learning guy talking about statistics. 


Especially, for a newbie to work on machine learning, should he go to the statistics department or computer science department?

Wednesday, October 14, 2009

refeshing by writing papers

I'm not sure whether this is a good symptom for reseachers. I feel excited when I write papers. This has been confirmed many times.

Wednesday, October 7, 2009

starting job search

After submitting all the journal papers relevant to my past work, I'm now ready to look for jobs.

Go Go Go!

Hope I can find an ideal job :)

Sunday, September 27, 2009

US visa still pending...

As I am planning to go to HK to attend CIKM'09, I went to Mexico to renew my US visa two weeks ago.  As expected, I got checked. But fortunatelly, I "slipped" back into US.

Until today, I still didn't get any update from the consulate about the visa.

Hope I will receive the good news next week!

Friday, September 25, 2009

determine number of clusters?

At present, many community detection methods have been proposed. "Unfortunately", most of these methods require users to specify the number of clusters, which seems insane at first glimpse.

Imaging a conversation like this:

Person: powerful machine, can you tell me something about the data set I have?
Machine: You honored! Sure, but please tell me how many clusters are there.
Person: I do not know the data set. That's why I am asking you!
Machine: Sorry, I cannot tell you anything unless you tell me something.
Person: ....... (stupid machine!)
Machine: ...... (this user is an idiot about data mining!)

So when an algorithm claims that it automatically determines the number of clusters, everyone just embraces it. For instance, modularity is claimed to find the optimal number of clusters. So are Bayesian guys by enforcing a Chinese restaurant process as a prior. These methods, as I understand, is like setting a threshold for the clustering process or setting an objective to optimize. It basically tells a user that under this objective function, this number of clusters seems reasonable. Instead of setting the number of clusters, these algorithm implicitly set a parameter or an biased objective function to optimize.

But do we really need to determine the number of clusters?

If just for data exploration, I think a hierarchical organization of data objects are more reasonable. But it seems that current state of hierarchical clustering is too far away from satisfactory. Most methods just give you a binary tree, which does not reflect any interesting structure. Another disadvantage is high sensitivity of the hierarchical structure to the algorithm implementation and data processing order.

Instead of begging for machines to determine the number of communities, organization clusters of multi-resolutions is more interesting and more consistent with users' request.

Wednesday, August 19, 2009

IJCAI post-conference report

This year, thanks to the support from IJCAI conference, I can travel to Pasadena to attend the IJCAI conference. This is the first time I attended IJCAI (I attended AAAI in 2005). AI motivated me to pursue my Ph.D study in the first place. I was quite excited to attend this top conference on AI.

Overall, I like the invited talks and demos most. AI is now so separated into different subfields and each field currently (e.g., machine learning and data mining) rarely talks much about intelligence. So these high-quality invited talks in IJCAI really gave me a fresh view of AI from different perspectives. However, the papers in the conference, I have to say, are not quite interesting. What is even worse, it is difficult to grasp any idea if you are not working directly in the field the paper is talking about. While it seems IJCAI motivates to encourage collaboration and cross-discipline talks between AI researchers, the barrier is heavier than ever before. This makes me worry about the quality of the IJCAI conference. How to attract high-quality work to publish in IJCAI and break the boundary between different areas require a lot of efforts, or even a revolutionary reform. Maybe, it is more reasonable to host IJCAI as a symposium consists of high-quality talks and pioneer work, rather than yet-another conference to publish delta papers.

This conference provides two tours for APL and USC. I think these tours were great and such kind of activities should be kept in the following conferences as they provide a great opportunity to explore classical AI projects. It was a little sad that some attendants reserved the seats but did not show up. On the contrary, some other people who wish to join the tour could not go because they had no ticket. I hope the ticket-exchange at the registration desk would be allowed to accommodate more people in the next IJCAI.

Another is concerning the organization of IJCAI. The sessions were well organized. But the catering was really poor. I attended the student reception. The food there was not attractive at all. Some other attendants were also complaining about the food. Maybe, the conference organizer spent most of the fundings to provide travel scholarship for students. Then, I'll take my words back given such a bad economic situation.

I would like to thank IJCAI to provide me this great opportunity to make friends with other AI researchers, to meet those AI “stars”, and to re-cherish those AI dreams I used to have. Hope this conference would become better and inspire more AI communications and explorations.

Wednesday, August 5, 2009

what's your chance of acceptance if you submit late?

Just came across this question:

The same paper, which one would have a higher chance of being accepted in a conference?

Submit early? normal? or submit late?


Basically, would a larger submission number gives you a higher chance of being accepted?

I believe so.

Actually, as a reviewer and author, I have seen many people rushing their paper for a conference especially in CS. So a large submission number automatically hint that "the paper is rushed". So somehow the reviewer would have a low expectation. Then a good quality paper would stands out in the local region and is likely to get accepted.

Of course, this is just a small trick and no evidence can be found.

A simple way to avoid this bias is to reshuffle the paper when assigning them to the reviwers. Such that the paper's reviewer number has no correlation with the quality. Then, that seems a more fair process.

Just my 2 cents.

Monday, July 27, 2009

A special issue of network analysis on Science

http://www.sciencemag.org/cgi/content/short/325/5939/405

Quite interesting special issue.

"It is not enough to look at patterns; we need to study how they evolve and change."

Saturday, June 27, 2009

Recent Update

Finished conference papers for ICDM and CIKM. (Still a paper of understanding groups in progress, probably submitting to WSDM)


Need to come back to the journal papers and thesis proposal now.

Thursday, May 28, 2009

recent conferences to go

I'm working on papers to submit to CIKM or ICDE.

I just checked both conferences, one requires 10 pages, and the other requires 12 pages.

The ridiculous thing is that once you are accepted as short paper, you have to reduce to 2 pages for CIKM, and 4 pages for ICDE.

Saturday, May 16, 2009

some notes on professional writing

1. should use punctuations in equations.
2. Avoid starting a sentence with a mathematical expression, seperate symbols by puncutation marks or words.
3. "the" is inapproprieate when the object referred is not unique or does not exist
4. write "the kth term", not "the k-th term"
5. two types of ellipsis, vertically centered and ground level. The latter is used between a list of symbols or to indicate a product. A_1...A_n. i = 1, 2, ..., n
6. Several commonly used abbreviations, e.g., i.e., et al.,

7. avoid using the adj or advs. very, rather, quite, nice, interesting (all these are imprecise). So is "essentially"

Wednesday, April 22, 2009

what's the point of 30% new material?

I decide to convert my KDD08 paper into a journal version. Unfortunately, it is said that I have to add at least 30% new materials.

The sad thing is that I have wrote too much for the conference paper, since I want to make the paper more solid and comprehensive.

But it turns out I have nothing to add for the journal paper(Of course, you can always add something new but the cost is way too high, almost another journal paper).

So that's what I learned from this class:
Don't write too much in one conference paper.

It's really funny and ridiculous.