Daddy's Technology Notes: Feature selection

Discriminant Analysis

Constructing new features via linear combination so that classification is easier. The notion of "easier" is quantified by Fisher's criterion, which applies exactly to Gaussian classes with equal variance and approximately to other models. Variants like Flexible discriminant analysis consider nonlinear combinations as well. See Duda&Hart, "Eigenfaces vs. Fisherfaces", and "Flexible Discriminant and Mixture Models".

Principal Component Analysis

Constructing new features which are the principal components of a data set. The principal components are random variables of maximal variance constructed from linear combinations of the input features. Equivalently, they are the projections onto the principal component axes, which are lines that minimize the average squared distance to each point in the data set. To ensure uniqueness, all of the principal component axes must be orthogonal. PCA is a maximum-likelihood technique for linear regression in the presence of Gaussian noise on both x and y. In some cases, PCA corresponds to a Fourier transform, such as the DCT used in JPEG image compression. See "Eigenfaces for recognition" (Turk&Pentland, J Cognitive Neuroscience 3(1), 1991), Bishop and "Probabilistic Principal Component Analysis".

Principal curve

A nonlinear principal component axis. Principal curves are smooth curves that minimize the average squared orthogonal distance to each point in a data set. Fitting a principal curve is a maximum-likelihood technique for nonlinear regression in the presence of Gaussian noise on both x and y. Principal points are individual points that minimize the average distance to each point in a data set (they are the output of k-means). See "Principal Curves" (Hastie & Stuetzle, Journal of the American Statistical Association 84:502-516, 1989) and "Learning and Design of Principal Curves".

Factor analysis

A generalization of PCA which is based explicitly on maximum-likelihood. Like PCA, each data point is assumed to arise from sampling a point in a subspace and then perturbing it with full-dimensional Gaussian noise. The difference is that factor analysis allows the noise to have an arbitrary diagonal covariance matrix, while PCA assumes the noise is spherical. In addition to estimating the subspace, factor analysis estimates the noise covariance matrix. See "The EM Algorithm for Mixtures of Factor Analyzers".

Independent Component Analysis

Constructing new features which are the independent components of a data set. The independent components are random variables of minimum entropy constructed from linear combinations of the input features. The entropy is normalized by the variance of the component, so absolute scale doesn't matter. It is a fact of information theory that such variables will be as independent as possible. This feature extraction technique is closely related to exploratory projection pursuit, commonly used for visualization. See "Fast and Robust Fixed-Point Algorithms for Independent Component Analysis" (with software), "Independent Component Analysis: A flexible non-linearity and decorrelating manifold approach" and the ICA Web site.

Clustering

Grouping similar objects in a multidimensional space. It is useful for constructing new features which are abstractions of the existing features. Some algorithms, like k-means, simply partition the feature space. Other algorithms, like single-link agglomeration, create nested partitionings which form a taxonomy. Another possibility is to learn a graph structure between the partitions, as in the Growing Neural Gas. The quality of the clustering depends crucially on the distance metric in the space. Most techniques are very sensitive to irrelevant features, so they should be combined with feature selection. See Richard Duda's notes on clustering, Duda&Hart, and "Clustering sequences with hidden Markov models".

K-means
A parametric algorithm for clustering data into exactly k clusters. First, define some initial cluster parameters. Second, assign data points to clusters. Third, recompute better cluster parameters, given the data assignment. Iterate back to step two. It is a special case of the Expectation-Maximization algorithm for fitting a mixture of Gaussians. See Richard Duda's notes on clustering and Duda&Hart.

Feature selection

Not extracting new features but rather removing features which seem irrelevant for modeling. This is a combinatorial optimization problem. The direct approach (the "wrapper" method) retrains and re-evaluates a given model for many different feature sets. An approximation (the "filter" method) instead optimizes simple criteria which tend to improve performance. The two simplest optimization methods are forward selection (keep adding the best feature) and backward elimination (keep removing the worst feature). See "Selecting relevant features in machine learning", "Feature Selection Evaluation", "Toward Optimal Feature Selection", a bibliography of feature selection in machine learning, and another bibliography.

Daddy's Technology Notes

Saturday, July 16, 2005

Feature selection

0 Comments:

About Me

Previous Posts