Wednesday, June 18, 2008

some interesting ICML papers

Maybe too much.

The GroupLASSO for Generalized Linear Models: Uniqueness of Solutions and Efficient Algorithms

Volker Roth and Bernd Fischer

The GroupLASSO method for finding important explanatory factors suffers from the potential non-uniqueness of solutions and also from high computational costs. We formulate conditions for the uniqueness of GroupLASSO solutions which lead to an easily implementable test procedure. In addition to merely detecting ambiguities in solutions, this testing procedure identifies all potentially active groups. These results are used to derive an efficient algorithm that can deal with input dimensions in the millions and can approximate the solution path efficiently. The derived methods are applied to large-scale learning problems where they exhibit excellent performance. We show that the proposed testing procedure helps to avoid misinterpretations of GroupLASSO solutions.

[Full paper]

Dirichlet Component Analysis: Feature Extraction for Compositional Data

Hua-Yan Wang, Qiang Yang, Hong Qin, and Hongbin Zha

We consider feature extraction (dimensionality reduction) for compositional data, where the data vectors are constrained to be positive and constant-sum. In real-world problems, the data components (variables) usually have complicated "correlations" while their total number is huge. Such scenario demands feature extraction. That is, we shall de-correlate the components and reduce their dimensionality. Traditional techniques such as the Principle Component Analysis (PCA) are not suitable for these problems due to unique statistical properties and the need to satisfy the constraints in compositional data. This paper presents a novel approach to feature extraction for compositional data. Our method first identifies a family of dimensionality reduction projections that preserve all relevant constraints, and then finds the optimal projection that maximizes the estimated Dirichlet precision on projected data. It reduces the compositional data to a given lower dimensionality while the components in the lower-dimensional space are de-correlated as much as possible. We develop theoretical foundation of our approach, and validate its effectiveness on some synthetic and real-world datasets.

[Full paper]

paper ID: 145

Pairwise Constraint Propagation by Semidefinite Programming for Semi-Supervised Classification

Zhenguo Li, Jianzhuang Liu, and Xiaoou Tang

We consider the general problem of learning from pairwise constraints and unlabeled data. The pairwise constraints specify whether two objects belong to the same class or not, known as the must-link constraints and the cannot-link constraints. We propose to learn a mapping that is smooth over the data graph and maps the data onto a unit hypersphere, where two must-link objects are mapped to the same point while two cannot-link objects are mapped to be orthogonal. We show that such a mapping can be achieved by formulating a semidefinite programming problem, which is convex and can be solved globally. Our approach can effectively propagate pairwise constraints to the whole data set. It can be directly applied to multi-class classification and can handle data labels, pairwise constraints, or a mixture of them in a unified framework. Promising experimental results are presented for classification tasks on a variety of synthetic and real data sets.

[Full paper]

Localized Multiple Kernel Learning

Mehmet Gonen and Ethem Alpaydin

Recently, instead of selecting a single kernel, multiple kernel learning (MKL) has been proposed which uses a convex combination of kernels, where the weight of each kernel is optimized during training. However, MKL assigns the same weight to a kernel over the whole input space. In this paper, we develop a localized multiple kernel learning (LMKL) algorithm using a gating model for selecting the appropriate kernel function locally. The localizing gating model and the kernel-based classifier are coupled and their optimization is done in a joint manner. Empirical results on ten benchmark and two bioinformatics data sets validate the applicability of our approach. LMKL achieves statistically similar accuracy results compared with MKL by storing fewer support vectors. LMKL can also combine multiple copies of the same kernel function localized in different parts. For example, LMKL with multiple linear kernels gives better accuracy results than using a single linear kernel on bioinformatics data sets.

[Full paper]

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

Haiping Lu, Konstantinos Plataniotis, and Anastasios Venetsanopoulos

Tensorial data are frequently encountered in various machine learning tasks today and dimensionality reduction is one of their most important applications. This paper extends the classical principal component analysis (PCA) to its multilinear version by proposing a novel dimensionality reduction algorithm for tensorial data, named as uncorrelated multilinear PCA (UMPCA). UMPCA seeks a tensor-to-vector projection that captures most of the variation in the original tensorial input while producing uncorrelated features through successive variance maximization. We evaluate the proposed algorithm on a second-order tensorial problem, face recognition, and the experimental results show its superiority, especially in low-dimensional spaces, through the comparison with three other PCA-based algorithms.

[Full paper]

A Dual Coordinate Descent Method for Large-scale Linear SVM

Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S. Sathiya Keerthi, and S. Sundararajan

In many applications, data appear with a huge number of instances as well as features. Linear Support Vector Machines (SVM) is one of the most popular tools to deal with such large-scale sparse data. This paper presents a novel dual coordinate descent method for linear SVM with L1- and L2-loss functions. The proposed method is simple and reaches an epsilon-accurate solution in O(log (1/epsilon)) iterations. Experiments indicate that our method is much faster than state of the art solvers such as Pegasos, Tron, svmperf, and a recent primal coordinate descent implementation.

[Full paper]

Efficient MultiClass Maximum Margin Clustering

Bin Zhao, Fei Wang, and Changshui Zhang

This paper presents a cutting plane algorithm for multiclass maximum margin clustering (MMC). The proposed algorithm constructs a nested sequence of successively tighter relaxations of the original MMC problem, and each optimization problem in this sequence could be efficiently solved using the constrained concave-convex procedure (CCCP). Experimental evaluations on several real world datasets show that our algorithm converges much faster than existing MMC methods with guaranteed accuracy, and can thus handle much larger datasets efficiently.

[Full paper]

Nearest Hyperdisk Methods for High-Dimensional Classification

Hakan Cevikalp, Bill Triggs, and Robi Polikar

In high-dimensional classification problems it is infeasible to include enough training samples to cover the class regions densely. Irregularities in the resulting sparse sample distributions cause local classifiers such as Nearest Neighbors (NN) and kernel methods to have irregular decision boundaries. One solution is to "fill in the holes" by building a convex model of the region spanned by the training samples of each class and classifying examples based on their distances to these approximate models. Methods of this kind based on affine and convex hulls and bounding hyperspheres have already been studied. Here we propose a method based on the bounding hyperdisk of each class -- the intersection of the affine hull and the smallest bounding hypersphere of its training samples. We argue that in many cases hyperdisks are preferable to affine and convex hulls and hyperspheres: they bound the classes more tightly than affine hulls or hyperspheres while avoiding much of the sample overfitting and computational complexity that is inherent in high-dimensional convex hulls. We show that the hyperdisk method can be kernelized to provide nonlinear classifiers based on non-Euclidean distance metrics. Experiments on several classification problems show promising results.

[Full paper]

Local Likelihood Modeling of Temporal Text Streams

Guy Lebanon and Yang Zhao

Temporal text data is often generated by a time-changing process or distribution. Such a drift in the underlying distribution cannot be captured by stationary likelihood techniques. We consider the application of local likelihood methods to generative and conditional modeling of temporal document sequences. We examine the asymptotic bias and variance and present an experimental study using the RCV1 dataset containing a temporal sequence of Reuters news stories.

[Full paper]

Learning to Classify with Missing and Corrupted Features

Ofer Dekel and Ohad Shamir

After a classifier is trained using a machine learning algorithm and put to use in a real world system, it often faces noise which did not appear in the training data. Particularly, some subset of features may be missing or may become corrupted. We present two novel machine learning techniques that are robust to this type of classification-time noise. First, we solve an approximation to the learning problem using linear programming. We analyze the tightness of our approximation and prove statistical risk bounds for this approach. Second, we define the online-learning variant of our problem, address this variant using a modified Perceptron, and obtain a statistical learning algorithm using an online-to-batch technique. We conclude with a set of experiments that demonstrate the effectiveness of our algorithms.

[Full paper]

Fast Solvers and Efficient Implementations for Distance Metric Learning

Kilian Weinberger and Lawrence Saul

In this paper we study how to improve nearest neighbor classification by learning a Mahalanobis distance metric. We build on a recently proposed framework for distance metric learning known as large margin nearest neighbor (LMNN) classification. Within this framework, we focus specifically on the challenges in scalability and adaptability posed by large data sets. Our paper makes three contributions. First, we describe a highly efficient solver for the particular instance of semidefinite programming that arises in LMNN classification; our solver can handle problems with billions of large margin constraints in a few hours. Second, we show how to reduce both training and testing times using metric ball trees; the speedups from ball trees are further magnified by learning low dimensional representations of the input space. Third, we show how to learn different Mahalanobis distance metrics in different parts of the input space. For large data sets, these mixtures of locally adaptive metrics lead to even lower error rates.

[Full paper]

Random Classification Noise Defeats All Convex Potential Boosters

Philip M. Long and Rocco A. Servedio

A broad class of boosting algorithms can be interpreted as performing coordinate-wise gradient descent to minimize some potential function of the margins of a data set. This class includes AdaBoost, LogitBoost, and other widely used and well-studied boosters. In this paper we show that for a broad class of convex potential functions, any such boosting algorithm is highly susceptible to random classification noise. We do this by showing that for any such booster and any nonzero random classification noise rate R, there is a simple data set of examples which is efficiently learnable by such a booster if there is no noise, but which cannot be learned to accuracy better than 1/2 if there is random classification noise at rate R. This negative result is in contrast with known branching program based boosters which do not fall into the convex potential function framework and which can provably learn to high accuracy in the presence of random classification noise.

[Full paper]

Learning Diverse Rankings with Multi-Armed Bandits

Filip Radlinski, Robert Kleinberg, and Thorsten Joachims

Algorithms for learning to rank Web documents usually assume a document's relevance is independent of other documents. This leads to learned ranking functions that produce rankings with redundant results. In contrast, user studies have shown that diversity at high ranks is often preferred. We present two new learning algorithms that directly learn a diverse ranking of documents based on users' clicking behavior. We show that these algorithms minimize abandonment, or alternatively, maximize the probability that a relevant document is found in the top k positions of a ranking. We show that one of our algorithms asymptotically achieves the best possible payoff obtainable in polynomial time even as user's interests change. The other performs better empirically when user interests are static, and is still theoretically near-optimal in that case.

[Full paper]

SVM Optimization: Inverse Dependence on Training Set Size

Shai Shalev-Shwartz and Nathan Srebro

We discuss how the runtime of SVM optimization should decrease as the size of the training data increases. We present theoretical and empirical results demonstrating how a simple subgradient descent approach indeed displays such behavior, at least for linear kernels.

[Full paper]

Training Structural SVMs when Exact Inference is Intractable

Thomas Finley and Thorsten Joachims

While discriminative training (e.g., CRF, structural SVM) holds much promise for machine translation, image segmentation, and clustering, the complex inference these applications require make exact training intractable. This leads to a need for approximate training methods. Unfortunately, knowledge about how to perform efficient and effective approximate training is limited. Focusing on structural SVMs, we provide and explore algorithms for two different classes of approximate training algorithms, which we call undergenerating (e.g., greedy) and overgenerating (e.g., relaxations) algorithms. We provide a theoretical and empirical analysis of both types of approximate trained structural SVMs, focusing on fully connected pairwise Markov random fields. We find that models trained with overgenerating methods have theoretic advantages over undergenerating methods, are empirically robust relative to their undergenerating brethren, and relaxed trained models favor non-fractional predictions from relaxed predictors.

[Full paper]

Graph Transduction via Alternating Minimization

Jun Wang, Tony Jebara, and Shih-Fu Chang

Graph transduction methods label input data by learning a classification function that is regularized to exhibit smoothness along a graph over labeled and unlabeled samples. In practice, these algorithms are sensitive to the initial set of labels provided by the user. For instance, classification accuracy drops if the training set contains weak labels, if imbalances exist across label classes or if the labeled portion of the data is not chosen at random. This paper introduces a propagation algorithm that more reliably minimizes a cost function over both a function on the graph and a binary label matrix. The cost function generalizes prior work in graph transduction and also introduces node normalization terms for resilience to label imbalances. We demonstrate that global minimization of the function is intractable but instead provide an alternating minimization scheme that incrementally adjusts the function and the labels towards a reliable local minimum. Unlike prior methods, the resulting propagation of labels does not prematurely commit to an erroneous labeling and obtains more consistent labels. Experiments are shown for synthetic and real classification tasks including digit and text recognition. A substantial improvement in accuracy compared to state of the art semi-supervised methods is achieved. The advantage are even more dramatic when labeled instances are limited.

[Full paper]

Grassmann Discriminant Analysis: a Unifying View on Subspace-Based Learning

Jihun Hamm and Daniel Lee

In this paper we propose a discriminant learning framework for problems in which data consist of linear subspaces instead of vectors. By treating subspaces as basic elements, we can make learning algorithms adapt naturally to the problems with linear invariant structures. We propose a unifying view on the subspace-based learning method by formulating the problems on the Grassmann manifold, which is the set of fixed-dimensional subspaces of a Euclidean space. Previous methods on the problem typically adopt an inconsistent strategy: feature extraction is performed in the Euclidean space while non-Euclidean dissimilarity measures are used. In our approach, we treat each subspace as a point in the Grassmann space, and perform feature extraction and classification in the same space. We show feasibility of the approach by using the Grassmann kernel functions such as the Projection kernel and the Binet-Cauchy kernel. Experiments with real image databases show that the proposed method performs well compared with state-of-the-art algorithms.

[Full paper]

On-line Discovery of Temporal-Difference Networks

Takaki Makino and Toshihisa Takagi

We present an algorithm for on-line, incremental discovery of temporal-difference (TD) networks. The key contribution is the establishment of three criteria to expand a node in TD network: a node is expanded when the node is well-known, independent, and has a prediction error that requires further explanation. Since none of these criteria requires centralized calculation operations, they are easily computed in a parallel and distributed manner, and scalable for bigger problems compared to other discovery methods of predictive state representations. Through computer experiments, we demonstrate the empirical effectiveness of our algorithm.

[Full paper]

Confidence-Weighted Linear Classification

Mark Dredze, Koby Crammer, and Fernando Pereira

We introduce confidence-weighted linear classifiers, a new class of algorithms that maintain confidence information about classifier parameters. Learning in this framework updates parameters by estimating weights and increasing model confidence. We investigate a new online algorithm that maintains a Gaussian distribution over weight vectors, updating the mean and variance of the model with each instance. Empirical evaluation on a range of NLP tasks show that our algorithm improves over other state of the art online and batch methods, learns faster in the online setting, and lends itself to better classifier combination after parallel training.

[Full paper]

On the Chance Accuracies of Large Collections of Classifiers

Mark Palatucci and Andrew Carlson

We provide a theoretical analysis of the chance accuracies of large collections of classifiers. We show that on problems with small numbers of examples, some classifier can perform well by random chance, and we derive a theorem to explicitly calculate this accuracy. We use this theorem to provide a principled feature selection criteria for sparse, high-dimensional problems. We evaluate this method on both microarray and fMRI datasets and show that it performs very close to the optimal accuracy obtained from an oracle. We also show that on the fMRI dataset this technique chooses relevant features successfully while another state-of-the-art method, the False Discovery Rate (FDR), completely fails at standard significance levels.

[Full paper]

Efficient Projections onto the L1-Ball for Learning in High Dimensions

John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra

We describe efficient algorithms for projecting a vector onto the L1-ball. We present two methods for projection. The first performs exact projection in O(n) time, where n is the dimension of the space. The second works on vectors k of whose elements are perturbed outside the L1-ball, projecting in O(k log(n)) time. This setting is especially useful for online learning in sparse feature spaces such as text categorization applications. We demonstrate the merits and effectiveness of our algorithms in numerous batch and online learning tasks. We show that variants of stochastic gradient projection methods augmented with our efficient projection procedures outperform state-of-the-art optimization techniques such as interior point methods. We also show that in online settings gradient updates with L1 projections outperform the EG algorithm while obtaining models with high degrees of sparsity.

[Full paper]

Tailoring Density Estimation via Reproducing Kernel Moment Matching

Le Song, Xinhua Zhang, Alex Smola, Arthur Gretton, and Bernhard Schoelkopf

Moment matching is a popular means of parametric density estimation. We extend this technique to nonparametric estimation of mixture models. Our approach works by embedding distributions into a reproducing kernel Hilbert space, and performing moment matching in that space. This allows us to tailor density estimators to a function class of interest (i.e., for which we would like to compute expectations). We show our density estimation approach is useful in applications such as message compression in graphical models, and image classification and retrieval.

[Full paper]

Graph Kernels Between Point Clouds

Francis Bach

Point clouds are sets of points in two or three dimensions. Most kernel methods for learning on sets of points have not yet dealt with the specific geometrical invariances and practical constraints associated with point clouds in computer vision and graphics. In this paper, we present extensions of graph kernels for point clouds, which allow to use kernel methods for such objects as shapes, line drawings, or any three-dimensional point clouds. In order to design rich and numerically efficient kernels with as few free parameters as possible, we use kernels between covariance matrices and their factorizations on graphical models. We derive polynomial time dynamic programming recursions and present applications to recognition of handwritten digits and Chinese characters from few training examples.

[Full paper]

Learning Dissimilarities by Ranking: From SDP to QP

Hua Ouyang and Alexander Gray

We consider the problem of learning dissimilarities between points via formulations which preserve a specified ordering between points rather than the numerical values of the dissimilarities. Dissimilarity ranking (d-ranking) learns from instances like "A is more similar to B than C is to D" or "The distance between E and F is larger than that between G and H". Three formulations of d-ranking problems are presented and new algorithms are presented for two of them, one by semidefinite programming (SDP) and one by quadratic programming (QP). Among the novel capabilities of these approaches are out-of-sample prediction and scalability to large problems.

[Full paper]

The Skew Spectrum of Graphs

Risi Kondor and Karsten Borgwardt

The central issue in representing graph-structured data instances in learning algorithms is designing features which are invariant to permuting the numbering of the vertices. We present a new system of invariant graph features which we call the skew spectrum of graphs. The skew spectrum is based on mapping the adjacency matrix to a function on the symmetric group and computing bispectral invariants. The reduced form of the skew spectrum is computable in O(n3) time, and experiments show that on several benchmark datasets it can outperform state of the art graph kernels.

[Full paper]

Accurate Max-margin Training for Structured Output Spaces

Sunita Sarawagi and Rahul Gupta

Tsochantaridis et al 2005 proposed two formulations for maximum margin training of structured spaces: margin scaling and slack scaling. While margin scaling has been extensively used since it requires the same kind of MAP inference as normal structured prediction, slack scaling is believed to be more accurate and better-behaved. We present an efficient variational approximation to the slack scaling method that solves its inference bottleneck while retaining its accuracy advantage over margin scaling. We further argue that existing scaling approaches do not separate the true labeling comprehensively while generating violating constraints. We propose a new max-margin trainer PosLearn that generates violators to ensure separation at each position of a decomposable loss function. Empirical results on real datasets illustrate that PosLearn can reduce test error by up to 25%. Further, PosLearn violators can be generated more efficiently than slack violators; for many structured tasks the time required is just twice that of MAP inference.

[Full paper]

Optimized Cutting Plane Algorithm for Support Vector Machines

Vojtech Franc and Soeren Sonnenburg

We have developed a new Linear Support Vector Machine (SVM) training algorithm called OCAS. Its computational effort scales linearly with the sample size. In an extensive empirical evaluation OCAS significantly outperforms current state of the art SVM solvers, like SVMLight, SVMPerf and BMRM, achieving speedups of over 1,000 on some datasets over SVMLight and 20 over SVMPerf, while obtaining the same precise Support Vector solution. OCAS even in the early optimization steps shows often faster convergence than the so far in this domain prevailing approximative methods SGD and Pegasos. Effectively parallelizing OCAS we were able to train on a dataset of size 15 million examples (itself about 32GB in size) in just 671 seconds --- a competing string kernel SVM required 97,484 seconds to train on 10 million examples sub-sampled from this dataset.

[Full paper]

Self-taught Clustering

Wenyuan Dai, Qiang Yang, Gui-Rong Xue, and Yong Yu

This paper focuses on a new clustering task, called self-taught clustering. Self-taught clustering is an instance of unsupervised transfer learning, which aims at clustering a small collection of target unlabeled data with the help of a large amount of auxiliary unlabeled data. The target and auxiliary data can be different in topic distribution. We show that even when the target data are not sufficient to allow effective learning of a high quality feature representation, it is possible to learn the useful features with the help of the auxiliary data on which the target data can be clustered effectively. We propose a co-clustering based self-taught clustering algorithm to tackle this problem, by clustering the target and auxiliary data simultaneously to allow the feature representation from the auxiliary data to influence the target data through a common set of features. Under the new data representation, clustering on the target data can be improved. Our experiments on image clustering show that our algorithm can greatly outperform several state-of-the-art clustering methods when utilizing irrelevant unlabeled auxiliary data.

[Full paper]

A Quasi-Newton Approach to Nonsmooth Convex Optimization

Jin Yu, S.V.N. Vishwanathan, Simon Guenter, and Nicol Schraudolph

We extend the well-known BFGS quasi-Newton method and its limited-memory variant (LBFGS) to the optimization of nonsmooth convex objectives. This is done in a rigorous fashion by generalizing three components of BFGS to subdifferentials: The local quadratic model, the identification of a descent direction, and the Wolfe line search conditions. We apply the resulting sub(L)BFGS algorithm to L2-regularized risk minimization with binary hinge loss, and its direction-finding component to L1-regularized risk minimization with logistic loss. In both settings our generic algorithms perform comparable to or better than their counterparts in specialized state-of-the-art solvers.

[Full paper]

Predicting Diverse Subsets Using Structural SVMs

Yisong Yue and Thorsten Joachims

In many retrieval tasks, one important goal involves retrieving a diverse set of results (e.g., documents covering a wide range of topics for a search query). First of all, this reduces redundancy, effectively presenting more information with the presented results. Secondly, search queries are often ambiguous at some level. For example, the query “Jaguar” can refer to many different topics (such as the car or the feline). A set of documents with high topic diversity ensures that fewer users abandon the query because none of the results are relevant to them. Unlike existing approaches to learning retrieval functions, we present a method that explicitly trains to diversify results. In particular, we formulate the learning problem of predicting a diverse subset and derive a training algorithm based on structural SVMs.

[Full paper]

Improved Nystrom Low-Rank Approximation and Error Analysis

Kai Zhang, Ivor Tsang, and James Kwok

Low-rank matrix approximation is an effective tool in alleviating the memory and computational burdens of kernel methods and sampling, as the mainstream of such algorithms, has drawn considerable attention in both theory and practice. This paper presents detailed studies on the Nystrom sampling scheme and in particular, an error analysis that directly relates the Nystrom approximation quality with the encoding powers of the landmark points in summarizing the data. The resultant error bound suggests a simple and efficient sampling scheme, the k-means clustering algorithm, for Nystrom low-rank approximation. We compare it with state-of-the-art approaches that range from greedy schemes to probabilistic sampling. Our algorithm achieves significant performance gains in a number of supervised/unsupervised learning tasks including kernel PCA and least squares SVM.

[Full paper]

Expectation-Maximization for Sparse and Non-Negative PCA

Christian David Sigg and Joachim M. Buhmann

We study the problem of finding the dominant eigenvector of the sample covariance matrix, under additional constraints on its elements: a cardinality constraint limits the number of non-zero elements, and non-negativity forces the elements to have equal sign. This problem is known as sparse and non-negative principal component analysis (PCA), and has many applications including dimensionality reduction and feature selection. Based on expectation-maximization for probabilistic PCA, we present an algorithm for any combination of these constraints. Its complexity is at most quadratic in the number of dimensions of the data. We demonstrate significant improvements in performance and computational efficiency compared to the state-of-the-art, using large data sets from biology and computer vision.

[Full paper]

Training SVM with Indefinite Kernels

Jianhui Chen and Jieping Ye

Similarity matrices generated from many applications may not be positive semidefinite, and hence can't fit into the kernel machine framework. In this paper, we study the problem of training support vector machines with an indefinite kernel. We consider a regularized SVM formulation, in which the indefinite kernel matrix is treated as a noisy observation of some unknown positive semidefinite one (proxy kernel) and the support vectors and the proxy kernel can be computed simultaneously. We propose a semi-infinite quadratically constrained linear program formulation for the optimization, which can be solved iteratively to find a global optimum solution. We further propose to employ an additional pruning strategy, which significantly improves the efficiency of the algorithm, while retaining the convergence property of the algorithm. In addition, we show the close relationship between the proposed formulation and multiple kernel learning. Experiments on a collection of benchmark data sets demonstrate the efficiency and effectiveness of the proposed algorithm.

[Full paper]

An RKHS for Multi-View Learning and Manifold Co-Regularization

Vikas Sindhwani and David Rosenberg

Inspired by co-training, many multi-view semi-supervised kernel methods implement the following idea: find a function in each of multiple Reproducing Kernel Hilbert Spaces (RKHSs) such that (a) the chosen functions make similar predictions on unlabeled examples, and (b) the average prediction given by the chosen functions performs well on labeled examples. In this paper, we construct a single RKHS with a data-dependent “co-regularization” norm that reduces these approaches to standard supervised learning. The reproducing kernel for this RKHS can be explicitly derived and plugged into any kernel method, greatly extending the theoretical and algorithmic scope of co-regularization. In particular, with this development, the Rademacher complexity bound for co-regularization given in (Rosenberg & Bartlett, 2007) follows easily from well-known results. Furthermore, more refined bounds given by localized Rademacher complexity can also be easily applied. We propose a co-regularization based algorithmic alternative to manifold regularization (Belkin et al., 2006; Sindhwani et al., 2005a) that leads to major empirical improvements on semi-supervised tasks. Unlike the recently proposed transductive approach of (Yu et al., 2008), our RKHS formulation is truly semi-supervised and naturally extends to unseen test data.

[Full paper]

A Generalization of Haussler's Convolution Kernel - Mapping Kernel

Kilho Shin and Tetsuji Kuboyama

Haussler's convolution kernel provides a successful framework for engineering new positive semidefinite kernels, and has been applied to a wide range of data types and applications. In the framework, each data object represents a finite set of finer grained components. Then, Haussler's convolution kernel takes a pair of data objects as input, and returns the sum of the return values of the predetermined primitive kernel calculated for all the possible pairs of the components of the input data objects. Due to the definition, Haussler's convolution kernel is also known as the cross product kernel, and is positive semidefinite, if so is the primitive kernel. On the other hand, the mapping kernel that we introduce in this paper is a natural generalization of Haussler's convolution kernel, in that the input to the primitive kernel moves over a predetermined subset rather than the entire cross product. Although we have plural instances of the mapping kernel in the literature, their positive semidefiniteness was investigated in case-by-case manners, and worse yet, was sometimes incorrectly concluded. In fact, there exists a simple and easily checkable necessary and sufficient condition, which is generic in the sense that it enables us to investigate the positive semidefiniteness of an arbitrary instance of the mapping kernel. This is the first paper that presents and proves the validity of the condition. In addition, we introduce two important instances of the mapping kernel, which we refer to as the size-of-index-structure-distribution kernel and the edit-cost-distribution kernel. Both of them are naturally derived from well known (dis)similarity measurements in the literature (e.g. the maximum agreement tree, the edit distance), and are reasonably expected to improve the performance of the existing measures by evaluating their distributional features rather than their peak (maximum/minimum) features.

[Full paper]

Composite Kernel Learning

Marie Szafranski, Yves Grandvalet, and Alain Rakotomamonjy

The Support Vector Machine (SVM) is an acknowledged powerful tool for building classifiers, but it lacks flexibility, in the sense that the kernel is chosen prior to learning. Multiple Kernel Learning (MKL) enables to learn the kernel, from an ensemble of basis kernels, whose combination is optimized in the learning process. Here, we propose Composite Kernel Learning to address the situation where distinct components give rise to a group structure among kernels. Our formulation of the learning problem encompasses several setups, putting more or less emphasis on the group structure. We characterize the convexity of the learning problem, and provide a general wrapper algorithm for computing solutions. Finally, we illustrate the behavior of our method on multi-channel data where groups correspond to channels.

[Full paper]

Nonnegative Matrix Factorization via Rank-One Downdate

Michael Biggs, Ali Ghodsi, and Stephen Vavasis

Nonnegative matrix factorization (NMF) was popularized as a tool for data mining by Lee and Seung in 1999. NMF attempts to approximate a matrix with nonnegative entries by a product of two low-rank matrices, also with nonnegative entries. We propose an algorithm called rank-one downdate (R1D) for computing a NMF that is partly motivated by singular value decomposition. This algorithm computes the dominant singular values and vectors of adaptively determined submatrices of a matrix. On each iteration, R1D extracts a rank-one submatrix from the dataset according to an objective function. We establish a theoretical result that maximizing this objective function corresponds to correctly classifying articles in a nearly separable corpus. We also provide computational experiments showing the success of this method in identifying features in realistic datasets. The method is much faster than either LSI or other NMF routines.

[Full paper]


paper ID: 668

Closed-form Supervised Dimensionality Reduction with Generalized Linear Models

Irina Rish, Genady Grabarnilk, Guillermo Cecchi, Francisco Pereira, and Geoffrey J. Gordon

We propose a family of supervised dimensionality reduction (SDR) algorithms that combine feature extraction (dimensionality reduction) with learning a predictive model in a unified optimization framework, using data- and class-appropriate generalized linear models (GLMs), and handling both classification and regression problems. Our approach uses simple closed-form update rules and is provably convergent. Promising empirical results are demonstrated on a variety of high-dimensional datasets.

[Full paper]