Tuesday, February 13, 2007

No hope for this google interview

Just done with the google interview. I believe this this the worst case I've ever done.
It's so bad. The interviewee is a Chinese, but we still talk in English.

First question ask me to write a SQL query. I haven't touch database for a long time. So after 10 mins, I still can't come up with a good answer.

Second question: how to implement a merge sort? write the pseudo code. Then, I ask him whether or not I can use recursion. He questioned: without recursion, how do you do that? (Stupid, why not just give the answer with recursion). What's the time complexity and space complexity. I know time complexity is O(nlgn) but space complexity is not sure. I finally figure out the correct solution should be O(n). I guess he already kicked me out of his candidate list based on my performance.

3rd question: 10G large array data, 2G RAM, how do you sort them? I know this question should be merge sort? But how many data should you load each time into the memory?
I answered you can split an array of size n into 2 parts, 3 parts or even more, and then do a combination. So for 2 parts, you need a working memory of (n/2), 3 parts, you need a working memory of (n/3). but what should be the optimal number? No idea.

Too nervous, cannot think any more. And he mentioned we were out of time several times.

4. it's about the project. How to solve their problem.

He asked me if I have any questions, but I already gave up. So just ask him some normal questions.

Really need to know the algorithm and structure book well. But too busy to prepare the interview. Really shame on myself.

Saturday, February 10, 2007

How to collobrate with others?

I just read John's blog and found it is really very helpful. Thus, I just copy and past his stuff here.

The full article can be found via the link below:
http://hunch.net/?p=251

2/10/2007
Best Practices for Collaboration
Filed under: Papers, Research — jl @ 1:51 pm

Many people, especially students, haven’t had an opportunity to collaborate with other researchers. Collaboration, especially with remote people can be tricky. Here are some observations of what has worked for me on collaborations involving a few people.

1. Travel and Discuss Almost all collaborations start with in-person discussion. This implies that travel is often necessary. We can hope that in the future we’ll have better systems for starting collaborations remotely (such as blogs), but we aren’t quite there yet.
2. Enable your collaborator. A collaboration can fall apart because one collaborator disables another. This sounds stupid (and it is), but it’s far easier than you might think.
1. Avoid Duplication. Discovering that you and a collaborator have been editing the same thing and now need to waste time reconciling changes is annoying. The best way to avoid this to be explicit about who has write permission to what. Most of the time, a write lock is held for the entire document, just to be sure.
2. Don’t keep the write lock unnecessarily. Some people are perfectionists so they have a real problem giving up the write lock on a draft until it is perfect. This prevents other collaborators from doing things. Releasing write lock (at least) when you sleep, is a good idea.
3. Send all necessary materials. Some people try to save space or bandwidth by not passing ‘.bib’ files or other auxiliary components. Forcing your collaborator to deal with the missing subdocument problem is disabling. Space and bandwidth are cheap while your collaborators time is precious. (Sending may be pass-by-reference rather than attach-to-message in most cases.)
4. Version Control. This doesn’t mean “use version control software”, although that’s fine. Instead, it means: have a version number for drafts passed back and forth. This means you can talk about “draft 3″ rather than “the draft that was passed last tuesday”. Coupled with “send all necessary materials”, this implies that you naturally backup previous work.
3. Be Generous. It’s common for people to feel insecure about what they have done or how much “credit” they should get.
1. Coauthor standing. When deciding who should have a chance to be a coauthor, the rule should be “anyone who has helped produce a result conditioned on previous work”. “Helped produce” is often interpreted too narrowly—a theoretician should be generous about crediting experimental results and vice-versa. Potential coauthors may decline (and senior ones often do so). Control over who is a coauthor is best (and most naturally) exercised by the choice of who you talk to.
2. Author ordering. Author ordering is the wrong thing to worry about, so don’t. The CS theory community has a substantial advantage here because they default to alpha-by-author ordering, as is understood by everyone.
3. Who presents. A good default for presentations at a conference is “student presents” (or suitable generalizations). This gives young people a real chance to get involved and learn how things are done. Senior collaborators already have plentiful alternative methods to present research at workshops or invited talks.
4. Communicate by default Not cc’ing a collaborator is a bad idea. Even if you have a very specific question for one collaborator and not another, it’s a good idea to cc everyone. In the worst case, this is a few-second annoyance for the other collaborator. In the best case, the exchange answers unasked questions. This also prevents “conversation shifts into subjects interesting to everyone, but oops! you weren’t cced” problem.

These practices are imperfectly followed even by me, but they are a good ideal to strive for.

Friday, February 9, 2007

Done with the ICML paper

OK. Payam and I just finished the ICML paper. It's good. I believe it should be accepted unless we are really unlucky.

Anyway, it's a good experience to work with Payam. I believe our cooperation makes the task much easier and efficient.

Hope we can get good news soon.

Need to rush for my KDD paper now:)

Monday, February 5, 2007

How to test two data sets from the same distribution?

Suppose I have two data sets, how can I tell that they are from two distributions? what's the difference of these two data sets?

This is actually very generic question emerged in machine learning.

Some intuitive ideas:
1. estimate the density of data samples for each data set. This method might be very weak. In reality, density estimation is a very difficult task compared with "simple classification". This approach is generally not application in reality.

2. Estimate the sufficient statistics of each data set. Like the mean, variance of each feature, the class conditional distribution. This can be interpreted as analogy in cognition. An analogy can be derived if the relationship between multiple symbols can be maintained. The problem is when can we conclude that the difference is large enough? It seems some hypothesis testing is required.

3. Transformation. A data set can be transformed into another data set. But how do you know the feature mapping? A more reasonable way is to enforce equivalence of sufficient statistics in a newly generated space.

4. Dimensionality reduction. Assume that two data set shared the same distribution on a fixed number of dimensions. By projecting the two data set into those dimensions, probably we can find some interesting patterns.

5. Learn the classification function. Use the decision function to measure the difference. this is quite related to transfer learning.

6. Any other ways?

OK. To endeavor this direction, where can we find the data set?

The approach of Kernel Maximum Mean discrepancy might be related.

最牛的博士论文!(转载)

最牛的博士论文!教授们紧张到恍惚以为自己才是答辩人

  1. 最牛博士论文就是在还没答辩之前已经发表在最好的期刊上,而且鉴于论文很长
,该期刊必须像小说一样连载。

  实例:张五常博士论文《佃农理论》,当年在JLE上连载四期。

2. 最牛博士论文答辩就是答辩人一直在挑战答辩委员会成员,直到问得这些教授们紧
张到恍惚以为自己才是答辩人。

  实例:萨缪尔森的博士论文答辩结束后,答辩委员会成员之一的熊彼特(上世纪最
伟大的经济学家之一)转过头去问另一位成员里昂剔夫(诺奖得主):“瓦西里,我们
通过了么?”

3. 最牛投稿论文就是让编辑满世界都找不到一个能看懂这篇论文的匿名审稿人,最后
只能发表,根本不需要修改的。

实例:SIMS1971年发表在《数理统计年鉴》上的论文《无穷维参数空间中的分布滞后估
计》。SIMS写完这篇论文后没投经济学杂志,因为他显然知道没人看的懂。于是投给了
最牛B的数理统计杂志,结果编辑死活找不到审稿人,最后好不容易凑合拉来一个,审
稿报告是这么写的:“我真的不明白这篇论文在说什么,但是我检验了其中的几个定理
,好像是对的。所以我猜应该发表。”

4. 最牛B的论文没必要长篇大论,千把字足以。实例:德布罗意是个花花公子贵族,本
科是历史学专的,后来实在闲着无聊去读了5年博士,最后交的博士论文是一页纸,还
涉嫌“抄袭”。

答辩委员会气的都不想让他答辩。他的导师、著名物理学家朗之万感到很没面子,自己
学生毕业不了真是耻辱,于是他鼓动了爱因斯坦一起帮着求情:让这小子过了吧,他老
爸是法国内政部长,咱惹不起。那篇“垃圾”论文后来被薛定谔看到了,薛定谔看着这
页论文苦思冥想了1个月,发表了量子力学里最重要的理论之一的薛定谔方程,薛定谔
猫也成为最有趣的一只猫。

德布罗意因这篇论文说阐述的观点获得了诺贝尔物理学奖。薛定谔凭借德布罗意的这篇
论文对量子力学作出了杰出贡献,从一名普通而不得志的讲师一跃成为了一名伟大的科
学家并获得了诺贝尔物理学奖。可以说,一篇1页纸的博士论文成就2个诺贝尔物理学奖
可谓前无古人,估计也是后无来者。由此看来,最牛b的论文不必象张五常那样连载,
一页A4的纸足以。不过我想德布罗意要是在中国读博士就惨了,论文因为字数太少,根
本连答辩的资格都没有。

不得不说两句:德布罗意幼年即失去双亲,被他的哥哥莫里斯公爵(也是一名杰出的
物理学家)一手养大的,在他1924年的著名博士论文之前一年,德布罗意就已连续发表
三篇论文提出物质波的猜想,至于博士论文是几页纸,这个我还没考证过。

关于薛定谔:薛定谔多才多艺,会4种语言,出过诗集。另外他于1944年出版的《生
命是什么》,吸引了一大批物理学家转向生物学研究。其中包括后来双螺旋的发现者沃
森和克里克。所以说,这帮牛人并不一定像人们想象的那样传奇,也不能把其成功单纯
的归结为偶然的因素。正所谓:牛者恒牛。
--