Swimming in Social Media: 2007

Wednesday, December 12, 2007

Daily Linux (1) -- FreeNX, remote access, list files

1. Here is a very useful instruction to install FreeNX server on Ubuntu Gusty,

and it works pretty cool! Much much faster than VNC.

2. To list files matches certain pattern in one directory, use the following command:

find . -name "*.c"

This basically find all the c-files in a directory.

Tuesday, November 20, 2007

dynamic topic models

Recently, I'm pretty interested in topic models.
But this is very difficult to follow.

Instead, I'll read some material about Kalman filtering and Wavelet first.

Learning is endless...

Monday, November 19, 2007

NIPS 08 proceeding available

Here are some interesting paper I'm planning to read or browse.
My biggest concern is how effective is the work. It seems currently most are beautiful with formulas, but can not even beat the simplest method.

It's always human who make the world so complicated.
My belief is "The World is Simple!"

Heterogeneous Component Analysis
Shigeyuki Oba, Motoaki Kawanabe, Klaus-Robert Müller, Shin Ishii

Neural characterization in partially observed populations of spiking neurons
Jonathan Pillow, Peter Latham

Probabilistic Matrix Factorization
Ruslan Salakhutdinov, Andriy Mnih

Hidden Common Cause Relations in Relational Learning
Ricardo Silva, Wei Chu, Zoubin Ghahramani

Hierarchical Penalization
Marie Szafranski, Yves Grandvalet, Pierre Morizet-Mahoudeaux

Learning with Transformation Invariant Kernels
Christian Walder

A Spectral Regularization Framework for Multi-Task Structure Learning
Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil, Yiming Ying

Supervised Topic Models
David Blei, Jon McAuliffe

Learning Bounds for Domain Adaptation
John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, Jennifer Wortman

Multi-task Gaussian Process Prediction
Edwin Bonilla, Kian Ming Chai, Chris Williams

Automatic Generation of Social Tags for Music Recommendation
Douglas Eck, Paul Lamere, Thierry Bertin-Mahieux, Stephen Green

Kernel Measures of Conditional Dependence
Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, Bernhard Sch??lkopf

Monday, November 5, 2007

What do you see?

Monday, August 27, 2007

A funny joke

狼来了，经过山洞，见兔子面前放台笔记本电脑，噼噼噼的乱敲。狼说：“兔子！干嘛呢？” “写论文！” 狼想：“世道真变啦，兔子也能写论文？”。 “啥题目？” “论兔子比狼厉害！” “你有病吧？” “你进洞瞧瞧！我收集好多资料！” 狼进洞了，半天没出来。天近黄昏，兔子收拾起笔记本电脑，回山洞了。洞中一只狮子正在剔牙。见兔子进来，拍着胸脯狂笑：“早跟你说了，不在乎你写什么题目，什么内容，关键是，你导师是谁！

Wednesday, August 22, 2007

Research is a joke?

Just read the chapter of Quasi-Newton Methods and came across this paragraph:
"The first quasi-Newton algorithm turned out to be one of the most creative ideas in nonlinear optimization. .... An interesting historical irony is that Davidon's paper was not accepted for publication; it remained as a technical report for more than thirty years until it appeared in the first issue of the SIAM Journal on Optimization in 1991."

It seems that high-quality work is more difficult to publish.

However, computer science still generates thousands of papers each year. How many paper will be remembered after 5 years?

Fast food is everywhere. This also includes "fast paper". Ridiculous!

Wednesday, July 18, 2007

SIGIR & KDD papers

SIGIR'07

An InterActive Algorithm For Asking And Incorporating Feature Feedback into Support Vector Machines
Hema Raghavan, James Allan

Analyzing Feature Trajectories for Event Detection
QI HE, Kuiyu Chang, Ee-Peng Lim

New Event Detection Based on Indexing-tree and Named Entity
Kuo ZHANG, JuanZi LI, Gang WU

Session 12: Learning to Rank I (Wed 09:00-10:30)

Chair: Bruce Croft, Foyer Room

A Support Vector Method for Optimizing Average Precision
Yisong Yue, Thomas Finley, Filip Radlinski, Thorsten Joachims
Ranking with Multiple Hyperplanes
Tie-Yan Liu, Tao Qin, Wei Lai, Xu-Dong Zhang, De-Sheng Wang, Hang Li
A Regression Framework for Learning Ranking Functions Using Relative Relevance Judgments
Zhaohui Zheng, Hongyuan Zha, Keke Chen, Gordon Sun

Session 16: Learning to Rank II (Wed 14:30-16:30)

Chair: Keith van Rijsbergen, Grand Ballroom

FRank: A Ranking Method with Fidelity Loss
Ming-Feng Tsai, Tie-Yan Liu, Tao Qin, Hsin-Hsi Chen, Wei-Ying Ma
AdaRank: A Boosting Algorithm for Information Retrieval
Jun Xu, Hang Li
A Combined Component Approach for Finding Collection-Adapted Ranking Functions based on Genetic Programming
Humberto Almeida, Marcos Goncalves, Marco Cristo, Pavel Calado
Feature Selection for Ranking
Tie-Yan Liu, Xiubo Geng, Tao Qin

KDD'07

Active Exploration for Learning Rankings from Clickthrough Data - Filip Radlinski and Thorsten Joachims

Co-clustering based Classification for Out-of-domain Documents - Wenyuan Dai, Gui-Rong Xue, Qiang Yang, and Yong Yu

Enhancing Semi-Supervised Clustering: A Feature Projection Perspective - Wei Tang, Hui Xiong, Shi Zhong, and Jie Wu

Evolutionary Spectral Clustering by Incorporating Temporal Smoothness - Yun Chi, Xiaodan Song, Dengyong Zhou, Koji Hino, and Belle Tseng

Mining Statistically Important Equivalence Classes - Jinyan Li, Guimei Liu, and Limsoon Wong

Model-Shared Subspace Boosting for Multi-label Classification - Rong Yan, Jelena Tesic, and John Smith

Support Feature Machine for Classification of Abnormal Brain Activity - W. Art Chaovalitwongse, Ya-Ju Fan, and Rajesh Sachdeo

Sunday, May 27, 2007

ICML 07 proceeding is online now

http://oregonstate.edu/conferences/icml2007/paperlist.html

Here are some interesting papers:

Kernel Selection:

Kernel Selection for Semi-Supervised Kernel Machines [Abstract][Paper]

Learning Nonparametric Kernel Matrices from Pairwise Constraints

[Abstract][Paper]

More Efficiency in Multiple Kernel Learning [Abstract][Paper]

Multiclass Multiple Kernel Learning [Abstract][Paper]

MTL and Transfer Learning:

Uncovering Shared Structures in Multiclass Classification [Abstract][Paper]

Discriminative Learning for Differing Training and Test Distributions [Abstract][Paper]

Boosting for Transfer Learning [Abstract][Paper]

Learning a Meta-Level Prior for Feature Relevance from Multiple Related Tasks [Abstract][Paper]

Multi-Task Learning for Sequential Data via iHMMs and the Nested Dirichlet Process [Abstract][Paper]

Self-taught Learning: Transfer Learning from Unlabeled Data [Abstract][Paper]

The Matrix Stick-Breaking Process for Flexible Multi-Task Learning [Abstract][Paper]

Asymptotic Bayesian Generalization Error When Training and Test Distributions Are Different [Abstract][Paper]

Robust Multi-Task Learning with $t$-Processes [Abstract][Paper]

Relational Learning and structured prediction

Relational Clustering by Symmetric Convex Coding [Abstract][Paper]

Fast and Effective Kernels for Relational Learning from Texts [Abstract][Paper]

Exponentiated Gradient Algorithms for Log-Linear Structured Prediction [Abstract][Paper]

Ranking:

Learning Random Walks to Rank Nodes in Graphs [Abstract][Paper]

Feature Selection:

Supervised Feature Selection via Dependence Estimation [Abstract][Paper]

Feature Selection in Kernel Space [Abstract][Paper]

Minimum Reference Set Based Feature Selection for Small Sample Classifications [Abstract][Paper]

Spectral Feature Selection for Supervised and Unsupervised Learning [Abstract][Paper]

Monday, May 21, 2007

MSR internship offer

I just received an offer from MSR for this summer intern. But I decided to reject it since it's a more application oriented intern. Maybe, staying in school can help produce more works. That's more hope and belief.

BTW: I just checked with Martha and found that we can use CPT twice. Anyhow, this is a good news for ASU student. I'm not sure whether or not my complaint makes the difference. Again, do things proactively can always makes future better.

Tuesday, May 8, 2007

Crafting papers in machine learning

Pat is like a tutor.

http://lyonesse.stanford.edu/~langley/papers/craft.ml2k.ps

I just encountered Pat's interesting paper on how to write a machine learning paper. Not sure whether or not he is a little regretful about emphasizing the experimental evaluation too much in the paper. But I think most of the general guide lines are still reasonable.

Friday, May 4, 2007

linux trick

I just found the rdesktop tool in linux is better to remote access windows system. Especially in the full-screen mode.

One trick to quit full screen mode:
Ctrl+Alt+Enter

Tuesday, May 1, 2007

Windows Backup & Norton Ghost

After my system crashed, I have to reinstall a windows system on my laptop, which is horrible. It takes almost three days to finish all the installation and configuration process. Suffering from reinstall system, I decided to try to use Norton Ghost to backup the system to make my life easier. Unfortunately, GHOST requires a floppy drive to make sure your system is bootable when crashed. More serious, I installed a dual-boot system on my laptop (ubuntu7.04+windows xp pro (sp2)) with the first parition for windows being NTFS. So the MBR of windows is actually modified by linux, and ghost can not find the proper MBR infor for windows. To backup linux system is easy, just compress all the stuff into one tar file would be OK.

So I'm wondering whether or not I can backup the whole hard disk under linux system. But the problem is, even if I can backup the windows partition in linux, I cannot restore it as Ubuntu 7.04 does not support modification on NTFS. Haha, really frustrated with this backup research work.

It seems that linux backup is way too much easier than windows. But I'm still a newbie to linux. Maybe, after sometime, I can get rid of windows.

Tuesday, April 3, 2007

Feedback of ICML

"I think the paper is techincally sound, but I think that it is quite narrow in
scope, and am not sure it will be of great general interest. There are only few
and unsurprising conclusions, and moreover the results seem to indicate that
there is an interacation between IN/OUT and the specific FS algorithms used.
Focussing so strongly on toy data, is also a considerable worry. I recommend
rejection. "

It seems that this year is not my "lucky" year. This is the same situation for my intern applications. Getting so many interviews(7 interviews), but none of them makes an offer.

Anyway, just keep going!

Monday, March 26, 2007

Several Future directions

While reckon over the future directions to go, I think the following problems might be interesting concerning text categorization.

1. Large-number of categories, multi-label classification problem. Typically, a hierarchy is employed to dissect the problem, thus reduce to learning with structured output.

2. "Dirty" text categorization. Typical text categorization requires the features to be clean, such as newswire articles, paper abstract etc. However, current fashion extends to "dirty" texts, such as notes(spell errors), prescription(lots of abbreviations) customer telephone log (usually with noisy, contradicting facts). Another example is email spam filtering. Currently, most of spams consists of images rather than just text. However, existing OCR techniques can not extract the characters very correctly. Hence, the final words/terms obtained might not a "proper" feature. Hence, some techniques are required to transform a "image-derived" word into a word in the dictionary. Such kind of transformation can be done via some algorithm like shortest-path algorithm. However, when the spammer add noise in purpose in the text within images, this problem seems more complicated. Is it possible to automatically learn feature similarity? How to extract useful similarities measure between these noisy vectors? How to derive a useful kernel? So, this problem actually is related to feature extraction, kernel learning, and robust learning and uncertainty.

3. Event detection and concept drift. I believe such kind of directions has more promising effect. I think the difficulty lays mainly on the lack of benchmark data set. But with the development of Web 2.0, this kind of problem should gain some attention in the future.

4. Ambiguous label problem. I really doubt the existence of small sample in text classification. Seems labellings some documents requires very little human labor. Now, some websites already provides some schemes for users to tag some web pages and blog posts. How to effectively employ the tag info seems to be missing in current work. When I tried delicious, only some key-word matching are performed. How to organize the text into more sensible way?

5. "Universal" text classification. As so many benchmark data sets are online, can we any how use all of them. This might be related to transfer learning. At least, the benchmark data can serve to provide a common prior for the target classification task. But can we extract more? Human beings can immediately classify the documents given very few examples. Existing transfer learning (most actually are doing MTL), in nature, is doing dimensionality reduction. How to related the features of different domains? Is it possible to extract the "structural" information? Zhang Tong's work talks about that, but it actually focus more on semi-supervised learning.

6. Sentimental classification/author gender identification/ genre identification. Such kind of problems requires new feature extraction techniques.

Some other concerns:

Feature selection for text categorization? As far as I can see, I do not think this direction will provide more interesting result. It works, and efficient. It can be used as a preprocessing step to reduce the computation burden. But some complicated methods (such as kernel learning) can be used to do a better job.

Active Learning, a greedy method can works fine enough.

Clustering, not making any sense to me. But as for simple text, clustering might show some potential impact. I believe clustering should be "customized". Different user will ask different clustering results. It seems more interesting to do clustering given some prespecified parameters. Clustering of multi-labels under concept drift can also be explored.

Thursday, March 22, 2007

Leonhard Euler

"Leonhard Euler(1707-1783) A Swiss mathematician and physicist.
......
During the last 17 years of his life, he was almost totally blind, and yet he produced nearly half of his results during that period."

(Copied from Pattern Recognition and Machine Learning, Page 465)

I am wondering whether or not he has any chance to win the Fields Award:)

Where to Go?

Recently, I seems to lose the enthusiasm of research. No mood to read, to discuss, to work on experiments. I know this is not correct way to go, but I just couldn't help become anxious. I would rather there's a problem let me to just jump in.

I don't know whether this is related to the frustration of recent progress. Too slow.

To be frank, I really doubt current development of machine learning field. Too fast. Each year, in the top level conference, there are lots of publications, but very few of them worth reading. Lots of them are just rubbish. Is this common in other fields?

I am wondering which direction should be a correct way to go. Searching in blind is really difficult:(

Friday, March 2, 2007

Human Intelligence and Current Artificial Intelligence

I've take Pat Langley's cognitive system class this semester.

My feeling of intelligence is rather vague.

What's intelligence? If a system is complicated enough, is that intelligent? What should be a component of an intelligence? Like Deep Blue, is that intelligence. I guess most people won't vote for intelligence. Now, people are trying to develop a meta-game player, which can automatically train itself as long as the rules of the game are given. Is that intelligent? No. Human beings are actually open-minded. Yep, machines can outperform human beings in one field, but they can never be "artificial intelligence". That's why I feel AI would never come true in my life. But I agree with Pat that more focus should be on the nature of mind.

For a long time, I've been thinking that the brain and a machine's difference is mainly due to the hardware difference (brain's processing speed is low, memory is limited, but highly parallel). Like playing chess, planning, scheduling, while machines can do an exhaustive search, or deeper search, human beings try to prune a lot of branches in each step (I guess there should be some pattern recognition involved). It's more like beam-search, as I think. The problem for efficient processing is how to extract those useful patterns.

Until recently, Pat mentioned this paper:
The magic of 7 +/- 2
which tries to understand why human beings did a better job to remember 7 digits more easily than 4 or more digits(This is partly the reason why telephone digits are around 7).

Based on information theory, the more digits we have, the more information they contains, thus require more bits to represent the information. But for human beings, this is not the case. This phenomenon is really interesting, and justifies the study for human brains. I guess Pat didn't realize this example is so exciting to me:)

Thursday, March 1, 2007

KDD submisssion is done

Nitin and I coauthored one paper talking about finding influentials in blogosphere. It's an interesting problem. the method seems a little ad hoc. But we finally made it.

Through a long distance communication (that's the wonder of Web), Nitin is working in India while Dr.Liu and I were working in the united states. The final version seems much better. But whether or not this work can be accepted is really depending on our luck. Part of my concern the technical part is really subjective.

I don't know whether the reviewer is obsessed with cross validation.

Let's see.

Tuesday, February 13, 2007

No hope for this google interview

Just done with the google interview. I believe this this the worst case I've ever done.
It's so bad. The interviewee is a Chinese, but we still talk in English.

First question ask me to write a SQL query. I haven't touch database for a long time. So after 10 mins, I still can't come up with a good answer.

Second question: how to implement a merge sort? write the pseudo code. Then, I ask him whether or not I can use recursion. He questioned: without recursion, how do you do that? (Stupid, why not just give the answer with recursion). What's the time complexity and space complexity. I know time complexity is O(nlgn) but space complexity is not sure. I finally figure out the correct solution should be O(n). I guess he already kicked me out of his candidate list based on my performance.

3rd question: 10G large array data, 2G RAM, how do you sort them? I know this question should be merge sort? But how many data should you load each time into the memory?
I answered you can split an array of size n into 2 parts, 3 parts or even more, and then do a combination. So for 2 parts, you need a working memory of (n/2), 3 parts, you need a working memory of (n/3). but what should be the optimal number? No idea.

Too nervous, cannot think any more. And he mentioned we were out of time several times.

4. it's about the project. How to solve their problem.

He asked me if I have any questions, but I already gave up. So just ask him some normal questions.

Really need to know the algorithm and structure book well. But too busy to prepare the interview. Really shame on myself.

Saturday, February 10, 2007

How to collobrate with others?

I just read John's blog and found it is really very helpful. Thus, I just copy and past his stuff here.

The full article can be found via the link below:
http://hunch.net/?p=251

2/10/2007
Best Practices for Collaboration
Filed under: Papers, Research — jl @ 1:51 pm

Many people, especially students, haven’t had an opportunity to collaborate with other researchers. Collaboration, especially with remote people can be tricky. Here are some observations of what has worked for me on collaborations involving a few people.

1. Travel and Discuss Almost all collaborations start with in-person discussion. This implies that travel is often necessary. We can hope that in the future we’ll have better systems for starting collaborations remotely (such as blogs), but we aren’t quite there yet.
2. Enable your collaborator. A collaboration can fall apart because one collaborator disables another. This sounds stupid (and it is), but it’s far easier than you might think.
1. Avoid Duplication. Discovering that you and a collaborator have been editing the same thing and now need to waste time reconciling changes is annoying. The best way to avoid this to be explicit about who has write permission to what. Most of the time, a write lock is held for the entire document, just to be sure.
2. Don’t keep the write lock unnecessarily. Some people are perfectionists so they have a real problem giving up the write lock on a draft until it is perfect. This prevents other collaborators from doing things. Releasing write lock (at least) when you sleep, is a good idea.
3. Send all necessary materials. Some people try to save space or bandwidth by not passing ‘.bib’ files or other auxiliary components. Forcing your collaborator to deal with the missing subdocument problem is disabling. Space and bandwidth are cheap while your collaborators time is precious. (Sending may be pass-by-reference rather than attach-to-message in most cases.)
4. Version Control. This doesn’t mean “use version control software”, although that’s fine. Instead, it means: have a version number for drafts passed back and forth. This means you can talk about “draft 3″ rather than “the draft that was passed last tuesday”. Coupled with “send all necessary materials”, this implies that you naturally backup previous work.
3. Be Generous. It’s common for people to feel insecure about what they have done or how much “credit” they should get.
1. Coauthor standing. When deciding who should have a chance to be a coauthor, the rule should be “anyone who has helped produce a result conditioned on previous work”. “Helped produce” is often interpreted too narrowly—a theoretician should be generous about crediting experimental results and vice-versa. Potential coauthors may decline (and senior ones often do so). Control over who is a coauthor is best (and most naturally) exercised by the choice of who you talk to.
2. Author ordering. Author ordering is the wrong thing to worry about, so don’t. The CS theory community has a substantial advantage here because they default to alpha-by-author ordering, as is understood by everyone.
3. Who presents. A good default for presentations at a conference is “student presents” (or suitable generalizations). This gives young people a real chance to get involved and learn how things are done. Senior collaborators already have plentiful alternative methods to present research at workshops or invited talks.
4. Communicate by default Not cc’ing a collaborator is a bad idea. Even if you have a very specific question for one collaborator and not another, it’s a good idea to cc everyone. In the worst case, this is a few-second annoyance for the other collaborator. In the best case, the exchange answers unasked questions. This also prevents “conversation shifts into subjects interesting to everyone, but oops! you weren’t cced” problem.

These practices are imperfectly followed even by me, but they are a good ideal to strive for.

Friday, February 9, 2007

Done with the ICML paper

OK. Payam and I just finished the ICML paper. It's good. I believe it should be accepted unless we are really unlucky.

Anyway, it's a good experience to work with Payam. I believe our cooperation makes the task much easier and efficient.

Hope we can get good news soon.

Need to rush for my KDD paper now:)

Monday, February 5, 2007

How to test two data sets from the same distribution?

Suppose I have two data sets, how can I tell that they are from two distributions? what's the difference of these two data sets?

This is actually very generic question emerged in machine learning.

Some intuitive ideas:
1. estimate the density of data samples for each data set. This method might be very weak. In reality, density estimation is a very difficult task compared with "simple classification". This approach is generally not application in reality.

2. Estimate the sufficient statistics of each data set. Like the mean, variance of each feature, the class conditional distribution. This can be interpreted as analogy in cognition. An analogy can be derived if the relationship between multiple symbols can be maintained. The problem is when can we conclude that the difference is large enough? It seems some hypothesis testing is required.

3. Transformation. A data set can be transformed into another data set. But how do you know the feature mapping? A more reasonable way is to enforce equivalence of sufficient statistics in a newly generated space.

4. Dimensionality reduction. Assume that two data set shared the same distribution on a fixed number of dimensions. By projecting the two data set into those dimensions, probably we can find some interesting patterns.

5. Learn the classification function. Use the decision function to measure the difference. this is quite related to transfer learning.

6. Any other ways?

OK. To endeavor this direction, where can we find the data set?

The approach of Kernel Maximum Mean discrepancy might be related.

最牛的博士论文！(转载)

最牛的博士论文！教授们紧张到恍惚以为自己才是答辩人

　 1. 最牛博士论文就是在还没答辩之前已经发表在最好的期刊上，而且鉴于论文很长
，该期刊必须像小说一样连载。

　实例：张五常博士论文《佃农理论》，当年在JLE上连载四期。

2. 最牛博士论文答辩就是答辩人一直在挑战答辩委员会成员，直到问得这些教授们紧
张到恍惚以为自己才是答辩人。

　实例：萨缪尔森的博士论文答辩结束后，答辩委员会成员之一的熊彼特（上世纪最
伟大的经济学家之一）转过头去问另一位成员里昂剔夫（诺奖得主）：“瓦西里，我们
通过了么？”

3. 最牛投稿论文就是让编辑满世界都找不到一个能看懂这篇论文的匿名审稿人，最后
只能发表，根本不需要修改的。

实例：SIMS1971年发表在《数理统计年鉴》上的论文《无穷维参数空间中的分布滞后估
计》。SIMS写完这篇论文后没投经济学杂志，因为他显然知道没人看的懂。于是投给了
最牛B的数理统计杂志，结果编辑死活找不到审稿人，最后好不容易凑合拉来一个，审
稿报告是这么写的：“我真的不明白这篇论文在说什么，但是我检验了其中的几个定理
，好像是对的。所以我猜应该发表。”

4. 最牛B的论文没必要长篇大论，千把字足以。实例：德布罗意是个花花公子贵族，本
科是历史学专的，后来实在闲着无聊去读了5年博士，最后交的博士论文是一页纸，还
涉嫌“抄袭”。

答辩委员会气的都不想让他答辩。他的导师、著名物理学家朗之万感到很没面子，自己
学生毕业不了真是耻辱，于是他鼓动了爱因斯坦一起帮着求情：让这小子过了吧，他老
爸是法国内政部长，咱惹不起。那篇“垃圾”论文后来被薛定谔看到了，薛定谔看着这
页论文苦思冥想了1个月，发表了量子力学里最重要的理论之一的薛定谔方程，薛定谔
猫也成为最有趣的一只猫。

德布罗意因这篇论文说阐述的观点获得了诺贝尔物理学奖。薛定谔凭借德布罗意的这篇
论文对量子力学作出了杰出贡献，从一名普通而不得志的讲师一跃成为了一名伟大的科
学家并获得了诺贝尔物理学奖。可以说，一篇1页纸的博士论文成就2个诺贝尔物理学奖
可谓前无古人，估计也是后无来者。由此看来，最牛b的论文不必象张五常那样连载，
一页A4的纸足以。不过我想德布罗意要是在中国读博士就惨了，论文因为字数太少，根
本连答辩的资格都没有。

不得不说两句：德布罗意幼年即失去双亲，被他的哥哥莫里斯公爵（也是一名杰出的
物理学家）一手养大的，在他1924年的著名博士论文之前一年，德布罗意就已连续发表
三篇论文提出物质波的猜想，至于博士论文是几页纸，这个我还没考证过。

关于薛定谔：薛定谔多才多艺，会4种语言，出过诗集。另外他于1944年出版的《生
命是什么》，吸引了一大批物理学家转向生物学研究。其中包括后来双螺旋的发现者沃
森和克里克。所以说，这帮牛人并不一定像人们想象的那样传奇，也不能把其成功单纯
的归结为偶然的因素。正所谓：牛者恒牛。
--

Wednesday, January 31, 2007

science research is not proof

Two days ago, we had a very nice discussion with Pat Langley.
Pat is more a cognitive scientist than a typical computer scientist, also very talkative:)

Basically, I question him why he recommend one paper saying that it took longer time for human beings to reason about difficult puzzles. The authors use a forward chaining scheme to build their model. Based on the similar performance of the model and human beings, it concludes that it took longer time for human beings to solve difficult puzzles.

OK. That paper is a little boring to me as the result is totally not surprising, or even too "obvious". Also, I really doubt human beings reasons like forward chaining or backward chaining. At the first sight, it seems the result of the paper is totally not convincing to me though the result is obvious.

So I asked him about the validality of this experiment set up. Then, I got his question, except this way, what can you do? Yep, I just post questions but forgot to think out of the box. What would I do if I am the researcher? You propose a model, then what you can do is to find "sufficient" evidence to support your claim. However, in most disciplines, this cannot be proved as mathematics. And even it's proved in someway, it might not work in reality. That's a common case in data mining, also in planning. The disadvantage of deduction is you have to trust your premise 100%, if one of your assumption is wrong, you mess up.

After reckon over that problem for a while, I have to sadly admit that's the only possible way or proper way to justify a new claim. So the problem is how to find "enough" evidence?

It seems that our science research is very weak. Or maybe that's exactly the process of doing research. You propose a model to solve a problem, explain some phenomenon. Then people use one counter example to disprove your model. (Disproof is always much easier than proof, but this seems not the case for refutation in logic proof). Then, a new model or theory is proposed.

You'll find that almost all science or engineering research are repeating the same cycle.

When going back to machine learning experiment, I think I've already put too much belief to 10-fold cross validation. But actually, most of tasks have no correct evaluation method. Like information retrieval, data analysis. these tasks requires a reasonable good answer but not an optimal solution.

Like the recent paper I coauthored with Payam for ICML, we found that two dramatically different feature selection evaluation methods turn out to be almost the same for comparing feature selection methods. There's actually no proof. But I believe it's an interesting research.

Science Research is pragmatic.

Thursday, January 25, 2007

Just finished the journal paper

I've just finished the journal paper for TKDD.
This paper really takes a while to fill in. I even got puke for rewriting it.

I guess I've looked through it at least 15 times. I really cannot bear to read it any more.

I am wondering, to do research, should I put more effort to polish a paper or to think about the idea?

I really hate spending so much time on writing the paper. i think it's not worthwhile.

The ridiculous thing about current stage of machine learning and data mining is that all these papers tend to foster UNNECESSARY formula to make paper more easy to accept.

That's not the goal for research. i would like more to motivate the problem well and present some simple, intuitive, sensible and working methods!!

Sunday, January 21, 2007

Some experiment result

I just finished some demo experiments. Originally, I wanted to find some toy example to show task selection's effect in transfer learning. Unfortunately, all the results I found are very disappointing.
Let me summarize the results a little bit:
(1) If the target task has very limited training data, transfer learning do help a lot compared with single task learning.
(2) The tasks selected make a very tiny difference (within 1% percent). Actually, it seems that combine all the tasks together is a very robust and reliable strategy for the data set I'm working on.
(3) Combine all the data together seems always better than transfer learning.

(1) is not surprising and has been approved by existing researchers.
(2) can not justify task selection.
(3) It seems that there's no difference between these tasks in the data set.

Maybe, one interesting problem is to determine whether the data extracted from multiple sources are actually the same. But I feel that's a more difficult problem than task selection.

Saturday, January 20, 2007

Flexing Muscle, China Destroys Satellite in Test

This is great news for Chinese!!

http://www.nytimes.com/2007/01/19/world/asia/19china.html?n=Top%2fNews%2fWorld%2fCountries%20and%20Territories%2fChina

Article Tools Sponsored By
By WILLIAM J. BROAD and DAVID E. SANGER
Published: January 19, 2007

China successfully carried out its first test of an antisatellite weapon last week, signaling its resolve to play a major role in military space activities and bringing expressions of concern from Washington and other capitals, the Bush administration said yesterday.

Only two nations — the Soviet Union and the United States — have previously destroyed spacecraft in antisatellite tests, most recently the United States in the mid-1980s.

Arms control experts called the test, in which the weapon destroyed an aging Chinese weather satellite, a troubling development that could foreshadow an antisatellite arms race. Alternatively, however, some experts speculated that it could precede a diplomatic effort by China to prod the Bush administration into negotiations on a weapons ban.

“This is the first real escalation in the weaponization of space that we’ve seen in 20 years,” said Jonathan McDowell, a Harvard astronomer who tracks rocket launchings and space activity. “It ends a long period of restraint.”

White House officials said the United States and other nations, which they did not identify, had “expressed our concern regarding this action to the Chinese.” Despite its protest, the Bush administration has long resisted a global treaty banning such tests because it says it needs freedom of action in space.

Jianhua Li, a spokesman at the Chinese Embassy in Washington, said that he had heard about the antisatellite story but that he had no statement or information.

At a time when China is modernizing its nuclear weapons, expanding the reach of its navy and sending astronauts into orbit for the first time, the test appears to mark a new sphere of technical and military competition. American officials complained yesterday that China had made no public or private announcements about its test, despite repeated requests by American officials for more openness about its actions.

The weather satellite hit by the weapon had circled the globe at an altitude of roughly 500 miles. In theory, the test means that China can now hit American spy satellites, which orbit closer to Earth. The satellites presumably in range of the Chinese missile include most of the imagery satellites used for basic military reconnaissance, which are essentially the eyes of the American intelligence community for military movements, potential nuclear tests and even some counterterrorism, and commercial satellites.

Experts said the weather satellite’s speeding remnants could pose a threat to other satellites for years or even decades.

In late August, President Bush authorized a new national space policy that ignored calls for a global prohibition on such tests. The policy said the United States would “preserve its rights, capabilities, and freedom of action in space” and “dissuade or deter others from either impeding those rights or developing capabilities intended to do so.” It declared the United States would “deny, if necessary, adversaries the use of space capabilities hostile to U.S. national interests.”

The Chinese test “could be a shot across the bow,” said Theresa Hitchens, director of the Center for Defense Information, a private group in Washington that tracks military programs. “For several years, the Russians and Chinese have been trying to push a treaty to ban space weapons. The concept of exhibiting a hard-power capability to bring somebody to the negotiating table is a classic cold war technique.”

Gary Samore, the director of studies at the Council on Foreign Relations, said in an interview: “I think it makes perfect sense for the Chinese to do this both for deterrence and to hedge their bets. It puts pressure on the U.S. to negotiate agreements not to weaponize space.”

Ms. Hitchens and other critics have accused the administration of conducting secret research on advanced antisatellite weapons using lasers, which are considered a far speedier and more powerful way of destroying satellites than the weapons of two decades ago.

The White House statement, issued by the National Security Council, said China’s “development and testing of such weapons is inconsistent with the spirit of cooperation that both countries aspire to in the civil space area.”

An administration official who had reviewed the intelligence about China’s test said the launching was detected by the United States in the early evening of Jan. 11, which would have been early morning on Jan. 12 in China. American satellites tracked the launching of the medium-range ballistic missile, and later space radars saw the debris.

The antisatellite test was first reported late Wednesday on the Web site of Aviation Week and Space Technology, an industry magazine. It said intelligence agencies had yet to “complete confirmation of the test.”

The test, the magazine said, appeared to employ a ground-based interceptor that used the sheer force of impact rather than an exploding warhead to shatter the satellite.

Dr. McDowell of Harvard said the satellite was known as Feng Yun, or “wind and cloud.” Launched in 1999, it was the third in a series. He said that it was a cube measuring 4.6 feet on each side, and that its solar panels extended about 28 feet. He added that it was due for retirement but that it still appeared to be electronically alive, making it an ideal target.

“If it stops working,” he said, “you know you have a successful hit.”

David C. Wright, a senior scientist at the Union of Concerned Scientists, a private group in Cambridge, Mass., said he calculated that the Chinese satellite had shattered into 800 fragments four inches wide or larger, and millions of smaller pieces.

The Soviet Union conducted roughly a dozen antisatellite tests from 1968 to 1982, Dr. McDowell said, adding that the Reagan administration carried out its experiments in 1985 and 1986.

The Bush administration has conducted research that critics say could produce a powerful ground-based laser weapon that would be used against enemy satellites.

The largely secret project, parts of which were made public through Air Force budget documents submitted to Congress last year, appears to be part of a wide-ranging administration effort to develop space weapons, both defensive and offensive.

The administration’s laser research is far more ambitious than a previous effort by the Clinton administration to develop an antisatellite laser, though the administration denies that it is an attempt to build a laser weapon.

The current research takes advantage of an optical technique that uses sensors, computers and flexible mirrors to counteract the atmospheric turbulence that seems to make stars twinkle. The weapon would essentially reverse that process, shooting focused beams of light upward with great clarity and force.

Michael Krepon, co-founder of the Henry L. Stimson Center, a group that studies national security, called the Chinese test very un-Chinese.

“There’s nothing subtle about this,” he said. “They’ve created a huge debris cloud that will last a quarter century or more. It’s at a higher elevation than the test we did in 1985, and for that one the last trackable debris took 17 years to clear out.”

Mr. Krepon added that the administration had long argued that the world needed no space-weapons treaty because no such arms existed and because the last tests were two decades ago. “It seems,” he said, “that argument is no longer operative.”

Mark Mazzetti contributed reporting.

Friday, January 19, 2007

Rejected by Google

I thought I could reject Google, but unfortunately Google rejected me!! :(

We would like to thank you for your interest in Google. After carefully reviewing your experience and qualifications, we have determined that we do not have a 'Software Engineering Intern' position available which is a strong match at this time.

Thanks again for considering Google. We wish you well in your endeavors
and hope you might consider us again in the future.

Thursday, January 18, 2007

Two TA work again!!!!

This semester, I have to work as TA for both AI and Data Mining Class again. So sad...
I am wondering why I am always so unlucky?

Just returned the new version of journal paper to boss. Feels really tired to make any change to that.

Wednesday, January 17, 2007

Google Phone Interview (1st round)

(1st round)

Why do you like google?
What's the difference between process and threads?
What's the difference between Java and C++?
Tell me the basic concepts of object-oriented programming?
How to implement mutiple inherience in Java? Why java use interface while C++ keeps feature of multiple inherience?

Have you ever involved into any team project? what did you do?
How do you handle the case that you have different opinions with manager?

Then discuss about my research topic.

How to find most common word in billions of documents?
1. If memory allowed (hashtable). what's the time complexity?
2. What if multiple machines? What's the bottleneck?
3. What if just one machine and hashtable can not be stored in the memory?

Then, I asked hime some general questions.
The last task, he asked me to send code to him in 30 mintues. The task is write a function to transform a string into an integer.

Google Phone Interview(2nd round)

Just finished my interview with Google. This time is an engineer, asking lots of detailed questions.

He knows a lot about lisp, so we just discuss about the issues about lisp. like
what's the difference of lisp and other languages?
which kind of lisp complier do you use? emacs lisp vs. clisp?

Then, some questions about operating system?
What's the difference between process and thread? What kind of information does thread maintain? its own stack? heap?
How and when to do a context switch? How do you handle an time slice interrupt?
What are the possible pitfalls for multi-thread programming?

How compiler works?
Can regular expression resolve the problem of nested structures?
Tell me something about grammar?
Is type check done before or after parsing?
(I did pretty bad in this session, so he finally stopped)

Familiar with TCP/IP, RPC, network programming? (NO, skipped)

Are you familiar with B-tree, red-black tree? (No, so we switch to binary search tree)
What's the time complexity of insertion or query in a binary search tree? O(lg n)
Worst case? (O (n))
How to transform a unbalanced tree into balanced tree?
Are you familiar with TreeMap in Java?

How hash table works? What if two object have the same key value? Show me one example of hash function.
What is the innate structure of a hashtable? (I said array) How do you map a key value to an index?

Finally, one technical question:
Given a source word, a target word, and a dictionary, how to transform the source word into target word by changing only one letter in each step. The word you get in each step must be in the dictionary.

Then, I asked him about some projects details.

Thursday, January 11, 2007

2nd round phone interview from google

good news.

Wednesday, January 3, 2007

Desirable features of blog site

I just initiated two blogs, one in google blogspot, the other one in windows live space. Compare these two blog site is kind of interesting.

Windows Live:
1. Better handling photos
2. Template seems to be more beautiful.

Google blogspot:

1. Support group blog which is absent in windows live space.

Both sucks:
1. Can not set the permission for a specific post;
2. Can not change the layout and template freely, like change the background of template.

I am wondering how could these two blog sites to be so successful. Is there any other good ones?

Tuesday, January 2, 2007

Redundant Features useful or not?

I just browsed one interesting paper published in ICML'06:

nightmare at test time: robust learning by feature deletion

The motivation for the paper can be described as like buying stock.

Suppose you have several stocks with the same risk, and you have $1000. What would you do?

Of course, divide all the money evenly into these stocks should be more reliable than putting them into just one.

This is the same situation for feature selection. Suppose you select some relevant features from training data, but it could be wrong due to small samples or noise or any kind of noise.

In this process, you probably remove those redundant features as well. From this point of view, it seems more robust to keep those redundant features rather than remove them.

But from curse of dimensionality view, redundant features should be removed.

How to trade off redundancy and robustness?
I guess this is highly related to the definition of redundancy.

I'll comment on this issue more in future.

Monday, January 1, 2007

Some interesting search engines

Some interesting search engines: Hakia, powerset, snap
there's an article in nytimes talking about these search engines.

Will these be one of the future google?

I used to be a google fan. But I just found that google maps sucks. I tried different maps for LA trip, and mapquest seems to do a much better job. Google map finally and always got me lost. To be sure, this is not the first time.