Daddy's Technology Notes

Read, think, and write down the notes.

Monday, July 18, 2005

Note of "Debugging Applications for MS.Net and MS Windows": Chapter 2

Getting Started Debugging

1. Track changes until you throw away the project

  • 2 tools: version control and bug tracking;
  • Version control shall check in the unit test program as well;
  • Control changes: "Green, yellow, and red times", in green time, new features are added to the project and checked in; yellow time is for bug fix, only bug fixes can be checked in, no new features (detailed bug fix description in the check-in comment); the release time or code freeze is indicated by red time, the code check-in needs project manager's approval.
  • Check the risk of the bug to determine whether the code has to be checked in during the red time.
  • Labelling: A. Label all internal milestones; B. label any transition from green, yellow, or red development times; C. label any build sent to someone outside the team; D. Label any time you branch a development tree in the version control software; label after daily build and smoke tests complete correctly.
  • How to reprevent the trouble reproducing the builds sent to outside: archive the build tree on CD/DVD, including all source code and built binaries.

Note of "Debugging Applications for MS.Net and MS Windows": Chapter 1

Bugs: where they come from and how you solve them

1. What are bugs
  • Inconsistent user interfaces
  • Unmet expectations
  • Poor performance
  • Crashes or data corruption

2. Reasons for bugs

  • Short or impossible deadlines
  • The "code first, think later" approach
  • Misunderstood requirements
  • Engineer ignorance or improper training
  • Lack of commitment to quality

3. What is proper code review

  • Bad: A formal group review meeting
  • Good: one-on-one informal review ---line by line
  • Good: Have junior developers to review senior developers' code

4. Planning for debugging

4.1. Prerequisites

  • Skill set: know your project, language, technology/tools, operating systems, and CPU
  • Learn the skill set

4.2. Debug process

  • Duplicate the bug
  • Describe the bug
  • Always assume that the bug is yours
  • Divide and conquer
  • Think creatively
  • Leverage tools
  • Start heavy debugging
  • Verify that the bugs is fixed
  • Learn and share

Saturday, July 16, 2005

Feature selection

Discriminant Analysis

Constructing new features via linear combination so that classification is easier. The notion of "easier" is quantified by Fisher's criterion, which applies exactly to Gaussian classes with equal variance and approximately to other models. Variants like Flexible discriminant analysis consider nonlinear combinations as well. See Duda&Hart, "Eigenfaces vs. Fisherfaces", and "Flexible Discriminant and Mixture Models".

Principal Component Analysis

Constructing new features which are the principal components of a data set. The principal components are random variables of maximal variance constructed from linear combinations of the input features. Equivalently, they are the projections onto the principal component axes, which are lines that minimize the average squared distance to each point in the data set. To ensure uniqueness, all of the principal component axes must be orthogonal. PCA is a maximum-likelihood technique for linear regression in the presence of Gaussian noise on both x and y. In some cases, PCA corresponds to a Fourier transform, such as the DCT used in JPEG image compression. See "Eigenfaces for recognition" (Turk&Pentland, J Cognitive Neuroscience 3(1), 1991), Bishop and "Probabilistic Principal Component Analysis".

Principal curve

A nonlinear principal component axis. Principal curves are smooth curves that minimize the average squared orthogonal distance to each point in a data set. Fitting a principal curve is a maximum-likelihood technique for nonlinear regression in the presence of Gaussian noise on both x and y. Principal points are individual points that minimize the average distance to each point in a data set (they are the output of k-means). See "Principal Curves" (Hastie & Stuetzle, Journal of the American Statistical Association 84:502-516, 1989) and "Learning and Design of Principal Curves".

Factor analysis

A generalization of PCA which is based explicitly on maximum-likelihood. Like PCA, each data point is assumed to arise from sampling a point in a subspace and then perturbing it with full-dimensional Gaussian noise. The difference is that factor analysis allows the noise to have an arbitrary diagonal covariance matrix, while PCA assumes the noise is spherical. In addition to estimating the subspace, factor analysis estimates the noise covariance matrix. See "The EM Algorithm for Mixtures of Factor Analyzers".

Independent Component Analysis

Constructing new features which are the independent components of a data set. The independent components are random variables of minimum entropy constructed from linear combinations of the input features. The entropy is normalized by the variance of the component, so absolute scale doesn't matter. It is a fact of information theory that such variables will be as independent as possible. This feature extraction technique is closely related to exploratory projection pursuit, commonly used for visualization. See "Fast and Robust Fixed-Point Algorithms for Independent Component Analysis" (with software), "Independent Component Analysis: A flexible non-linearity and decorrelating manifold approach" and the ICA Web site.

Clustering

Grouping similar objects in a multidimensional space. It is useful for constructing new features which are abstractions of the existing features. Some algorithms, like k-means, simply partition the feature space. Other algorithms, like single-link agglomeration, create nested partitionings which form a taxonomy. Another possibility is to learn a graph structure between the partitions, as in the Growing Neural Gas. The quality of the clustering depends crucially on the distance metric in the space. Most techniques are very sensitive to irrelevant features, so they should be combined with feature selection. See Richard Duda's notes on clustering, Duda&Hart, and "Clustering sequences with hidden Markov models".

K-means
A parametric algorithm for clustering data into exactly k clusters. First, define some initial cluster parameters. Second, assign data points to clusters. Third, recompute better cluster parameters, given the data assignment. Iterate back to step two. It is a special case of the Expectation-Maximization algorithm for fitting a mixture of Gaussians. See Richard Duda's notes on clustering and Duda&Hart.

Feature selection

Not extracting new features but rather removing features which seem irrelevant for modeling. This is a combinatorial optimization problem. The direct approach (the "wrapper" method) retrains and re-evaluates a given model for many different feature sets. An approximation (the "filter" method) instead optimizes simple criteria which tend to improve performance. The two simplest optimization methods are forward selection (keep adding the best feature) and backward elimination (keep removing the worst feature). See "Selecting relevant features in machine learning", "Feature Selection Evaluation", "Toward Optimal Feature Selection", a bibliography of feature selection in machine learning, and another bibliography.

Linear discriminant analysis

Linear discriminant analysis (LDA), is sometimes known as Fisher's linear discriminant, after its inventor, Ronald A. Fisher, who published it in The Use of Multiple Measures in Taxonomic Problems (1936). It is typically used as a feature extraction step before classification.

LDA is used for two-class classification, or equivalently, given a vector of observations x, predict the probability of a binary random class variable c. LDA is based on the following observation: if the densities p(\vec x|c=1) and p(\vec x|c=0) are both normal, with identical full-rank covariances, but possibly different means, then a sufficient statistic for P(c|\vec x) is given by \vec x \cdot \vec w

\vec w = \Sigma^{-1} (\vec \mu_1 - \vec \mu_0)

That is, the probability of an input x being in a class c is purely a function of this dot product.

A nice property of this dot product is that, out of all possible one-dimensional projections, this one maximizes the distance between the projected means to the variance of the projected normal distributions. Thus, in some sense, this projection maximizes the signal to noise ratio.

In practice, this technique can be used by assuming that the two densities p(\vec x|c=1) and p(\vec x|c=0) have different means and shared covariance, and then use the maximum likelihood estimate or the maximum a posteriori estimate of the means and covariance.

LDA can be generalized to multiple discriminant analysis, where c becomes a categorical variable with N possible states, instead of only two. Analogously, if the class-conditional densities p(\vec x|c=i) are normal with shared covariances, the sufficient statistic for P(c|\vec x) are the values of N projections, which are the subspace spanned by the N means, affine projected by the inverse covariance matrix. These projections can be found by solving a generalized eigenvalue problem, where the numerator is the covariance matrix formed by treating the means as the samples, and the denominator is the shared covariance matrix.