Download PDF version

Transcript
Overview
Component Analysis is an unsupervised or class-free approach to finding the most
informative or explanatory features in data. In particular, Principal Component Analysis
(PCA) substantially reduces the complexity of data in which a large number of variables
(e.g. thousands) are interrelated, such as in large-scale gene expression data obtained
across a variety of different samples or conditions. PCA accomplishes this by computing
a new, much smaller set of uncorrelated variables which best represent the original
data. PCA is a powerful, well-established technique for data reduction and visualization.
2D and 3D PCA plots often place objects with similar patterns near each other.
GeneLinker™ provides one option for PCA analysis: Orientation by Genes or
Orientation by Samples. In brief, PCA oriented by genes is useful for distinguishing
sample classes or sample clusters, while PCA oriented by samples is useful for
distinguishing gene classes or gene sets.
Mathematical Details and Examples of Orientation
To understand the difference and interpretive implications between the two different
orientations - PCA by Genes or PCA by Samples - it is helpful to conceptualize the
data analysis from the point of view of covariance matrices. A dataset can be thought of
as comprising distinct mathematical or statistical variables (e.g. columns) for which there
are statistical samples (e.g. rows).
a) Genes vs. Genes (Orientation by Genes)
• Typically, genes are considered the mathematical or statistical variables and
samples are considered the statistical samples. The corresponding covariance
matrix (if it were computed) would carry the covariance of one gene vs. another
gene, assessed over the samples, and recorded for each pairwise combination
of genes (i.e., pairwise combinations of the statistical variables). Thus, if there
are n genes and m samples, the corresponding covariance matrix would
comprise n by n entries, each entry being the covariance of the ith gene vs. the
jth gene, i and j running from 1 through n. The ith element along the diagonal of
this covariance matrix is simply the conventional variance of the ith variable, in
this case the variance of the ith gene over all the m samples.
b) Samples vs. Samples (Orientation by Samples)
• However, if the samples are considered to be the mathematical or statistical
variables, then the genes would play the role of the statistical samples. This
case is less typical, but is still useful for biological interpretation in some
situations (e.g., when the samples are different specific times of the cell cycle).
In this case, the corresponding covariance matrix (if we were to compute it)
would comprise m by m entries, each entry being the covariance of the ith
sample vs. the jth sample from the data matrix. However, this time i and j run
from 1 through m. Again, the ith element along the diagonal of this covariance
matrix is simply the conventional variance of the ith variable. In this case, it is
the variance of the ith sample (i.e., the ith mathematical or statistical variable)
over all the n genes (the statistical samples).
In GeneLinker™, a Principal Component (PC) is defined as a mathematical entity (i.e.,
vector) computed from the data which is equivalent to a characteristic vector (i.e.,
GeneLinker Gold 3.1 / GeneLinker Platinum 2.1
315