Tuesday, February 13, 2007

No hope for this google interview

Just done with the google interview. I believe this this the worst case I've ever done.
It's so bad. The interviewee is a Chinese, but we still talk in English.

First question ask me to write a SQL query. I haven't touch database for a long time. So after 10 mins, I still can't come up with a good answer.

Second question: how to implement a merge sort? write the pseudo code. Then, I ask him whether or not I can use recursion. He questioned: without recursion, how do you do that? (Stupid, why not just give the answer with recursion). What's the time complexity and space complexity. I know time complexity is O(nlgn) but space complexity is not sure. I finally figure out the correct solution should be O(n). I guess he already kicked me out of his candidate list based on my performance.

3rd question: 10G large array data, 2G RAM, how do you sort them? I know this question should be merge sort? But how many data should you load each time into the memory?
I answered you can split an array of size n into 2 parts, 3 parts or even more, and then do a combination. So for 2 parts, you need a working memory of (n/2), 3 parts, you need a working memory of (n/3). but what should be the optimal number? No idea.

Too nervous, cannot think any more. And he mentioned we were out of time several times.

4. it's about the project. How to solve their problem.

He asked me if I have any questions, but I already gave up. So just ask him some normal questions.

Really need to know the algorithm and structure book well. But too busy to prepare the interview. Really shame on myself.

Saturday, February 10, 2007

How to collobrate with others?

I just read John's blog and found it is really very helpful. Thus, I just copy and past his stuff here.

The full article can be found via the link below:

Best Practices for Collaboration
Filed under: Papers, Research — jl @ 1:51 pm

Many people, especially students, haven’t had an opportunity to collaborate with other researchers. Collaboration, especially with remote people can be tricky. Here are some observations of what has worked for me on collaborations involving a few people.

1. Travel and Discuss Almost all collaborations start with in-person discussion. This implies that travel is often necessary. We can hope that in the future we’ll have better systems for starting collaborations remotely (such as blogs), but we aren’t quite there yet.
2. Enable your collaborator. A collaboration can fall apart because one collaborator disables another. This sounds stupid (and it is), but it’s far easier than you might think.
1. Avoid Duplication. Discovering that you and a collaborator have been editing the same thing and now need to waste time reconciling changes is annoying. The best way to avoid this to be explicit about who has write permission to what. Most of the time, a write lock is held for the entire document, just to be sure.
2. Don’t keep the write lock unnecessarily. Some people are perfectionists so they have a real problem giving up the write lock on a draft until it is perfect. This prevents other collaborators from doing things. Releasing write lock (at least) when you sleep, is a good idea.
3. Send all necessary materials. Some people try to save space or bandwidth by not passing ‘.bib’ files or other auxiliary components. Forcing your collaborator to deal with the missing subdocument problem is disabling. Space and bandwidth are cheap while your collaborators time is precious. (Sending may be pass-by-reference rather than attach-to-message in most cases.)
4. Version Control. This doesn’t mean “use version control software”, although that’s fine. Instead, it means: have a version number for drafts passed back and forth. This means you can talk about “draft 3″ rather than “the draft that was passed last tuesday”. Coupled with “send all necessary materials”, this implies that you naturally backup previous work.
3. Be Generous. It’s common for people to feel insecure about what they have done or how much “credit” they should get.
1. Coauthor standing. When deciding who should have a chance to be a coauthor, the rule should be “anyone who has helped produce a result conditioned on previous work”. “Helped produce” is often interpreted too narrowly—a theoretician should be generous about crediting experimental results and vice-versa. Potential coauthors may decline (and senior ones often do so). Control over who is a coauthor is best (and most naturally) exercised by the choice of who you talk to.
2. Author ordering. Author ordering is the wrong thing to worry about, so don’t. The CS theory community has a substantial advantage here because they default to alpha-by-author ordering, as is understood by everyone.
3. Who presents. A good default for presentations at a conference is “student presents” (or suitable generalizations). This gives young people a real chance to get involved and learn how things are done. Senior collaborators already have plentiful alternative methods to present research at workshops or invited talks.
4. Communicate by default Not cc’ing a collaborator is a bad idea. Even if you have a very specific question for one collaborator and not another, it’s a good idea to cc everyone. In the worst case, this is a few-second annoyance for the other collaborator. In the best case, the exchange answers unasked questions. This also prevents “conversation shifts into subjects interesting to everyone, but oops! you weren’t cced” problem.

These practices are imperfectly followed even by me, but they are a good ideal to strive for.

Friday, February 9, 2007

Done with the ICML paper

OK. Payam and I just finished the ICML paper. It's good. I believe it should be accepted unless we are really unlucky.

Anyway, it's a good experience to work with Payam. I believe our cooperation makes the task much easier and efficient.

Hope we can get good news soon.

Need to rush for my KDD paper now:)

Monday, February 5, 2007

How to test two data sets from the same distribution?

Suppose I have two data sets, how can I tell that they are from two distributions? what's the difference of these two data sets?

This is actually very generic question emerged in machine learning.

Some intuitive ideas:
1. estimate the density of data samples for each data set. This method might be very weak. In reality, density estimation is a very difficult task compared with "simple classification". This approach is generally not application in reality.

2. Estimate the sufficient statistics of each data set. Like the mean, variance of each feature, the class conditional distribution. This can be interpreted as analogy in cognition. An analogy can be derived if the relationship between multiple symbols can be maintained. The problem is when can we conclude that the difference is large enough? It seems some hypothesis testing is required.

3. Transformation. A data set can be transformed into another data set. But how do you know the feature mapping? A more reasonable way is to enforce equivalence of sufficient statistics in a newly generated space.

4. Dimensionality reduction. Assume that two data set shared the same distribution on a fixed number of dimensions. By projecting the two data set into those dimensions, probably we can find some interesting patterns.

5. Learn the classification function. Use the decision function to measure the difference. this is quite related to transfer learning.

6. Any other ways?

OK. To endeavor this direction, where can we find the data set?

The approach of Kernel Maximum Mean discrepancy might be related.



  1. 最牛博士论文就是在还没答辩之前已经发表在最好的期刊上,而且鉴于论文很长


2. 最牛博士论文答辩就是答辩人一直在挑战答辩委员会成员,直到问得这些教授们紧


3. 最牛投稿论文就是让编辑满世界都找不到一个能看懂这篇论文的匿名审稿人,最后


4. 最牛B的论文没必要长篇大论,千把字足以。实例:德布罗意是个花花公子贵族,本



