Principal component analysis is probably one of the most widely used tools in research data analysis. For a long time, it is often treated as little more than a convenient method for exploratory visualization and dimensionality reduction. But once you look at it closely—especially when a stubborn data problem forces you to—you start to notice that PCA sits at the crossroads of many ideas in statistics, geometry, optimization, and modeling.
It can be understood through distance minimization, variance preservation, matrix decomposition, latent-variable models, and even as a stepping stone toward factor analysis and independent component analysis. Once these pieces are connected, PCA stops being just a plotting trick and becomes a way of thinking about data.
From a line of numbers to a single point
Start with the simplest possible case: you have a bunch of numbers and want one number to represent them.
If you think in terms of distance, the question becomes: which point is closest to all the observed values? That is an optimization problem, and the answer depends on how distance is measured. If you minimize the sum of absolute deviations, you get the median. If you minimize the sum of squared deviations, differentiation leads you to the mean.
Seen this way, reducing many one-dimensional observations to one representative number is already a form of dimensionality reduction: from 1 dimension down to 0 dimensions. If we restrict ourselves to squared error, then taking the mean is simply the 1D-to-0D version of the same logic that PCA uses.
From a plane to a line
Now move up to two dimensions. If a cloud of points lies in a plane, then reducing it by one dimension means representing it with a line; reducing it by two dimensions means representing it with a point.
The point obtained by reducing from 2D to 0D should naturally lie on the line obtained by reducing from 2D to 1D. Otherwise the reduced representation would lose coherence.
What does that line look like? It passes through the mean of all points. That may remind you of linear regression in two dimensions, since the ordinary least-squares regression line also passes through the mean point. But the regression line is not the PCA line.
The difference is subtle but fundamental. In least-squares regression, the quantity being minimized is the vertical deviation of the dependent variable from the line. In PCA, what is minimized is the perpendicular distance from all points to the line. A small difference in formulation leads to a different geometric object, and the solution method is not the same either.
Two ways to understand PCA
The core idea of PCA is often explained in two equivalent ways.
One is geometric: find a lower-dimensional subspace such that the projected points stay as close as possible to the original high-dimensional points.
The other is statistical: find the direction along which the projected data have the largest variance, then within the orthogonal directions find the next direction with the largest remaining variance, and so on.
Both viewpoints can be turned into optimization problems. In the usual treatment, the data are first standardized or normalized as needed, then a covariance matrix is computed. By solving for eigenvalues and eigenvectors, one obtains the principal directions. The eigenvector associated with the largest eigenvalue gives the direction of maximum variance—or equivalently, the subspace that best preserves the data under projection. The remaining directions are chosen to be orthogonal to the previous ones and are selected in the same way.
Of course, computing the covariance matrix is not the only route. If you use singular value decomposition, you can bypass that step entirely. Either way, the analytical solution gives orthogonal components. If the resulting directions are not orthogonal, then it is not PCA.
In principle, PCA can also be solved iteratively through latent-variable formulations, but when an analytical solution exists, there is little reason to approximate it numerically unless the modeling goal changes. Matrix-based computation is one of PCA’s biggest practical strengths, especially because some of those matrix operations can be distributed when the dataset becomes very large.
This also points to a broader lesson: multivariate methods may look interchangeable in theory, but they are rooted in different ideas and suit different kinds of data. In research, one may pursue the most theoretically satisfying method. In industry, computational cost and payoff may matter more.
Why normalization matters
In three dimensions, dimensionality reduction proceeds from space to a plane, then from that plane to a line, and finally from the line to a point.
Now imagine multiplying all coordinates by 2. Naturally, the locations of the representative plane, line, and point all change as well. That makes one fact easy to see: PCA is sensitive to scale.
This is why measurements taken on different scales usually need to be normalized or standardized before PCA. Otherwise, variables with large numerical ranges will dominate the projection and pull the result away from the structure you actually care about.
At an intuitive level, PCA maps points from a high-dimensional space into a lower-dimensional one while trying to preserve their differences as much as possible. Whether you describe this as preserving distances under projection or preserving variance, the resulting computation still leads to orthogonal directions through matrix operations. Once the transformation matrix is obtained, the lower-dimensional space is determined—and it is orthogonal by construction.
What the variance of projected points really means
PCA is frequently used for visualization, but it is worth being precise about what exactly is being visualized.
Return to the 2D-to-1D case. Suppose a point A in the original plane is projected onto a 1D line as point B. Let C be the mean point of all data, which is also the 0D projection. The segment AC is fixed once the data are given, and the distance from A to the 1D line must be minimized, so AB is perpendicular to the line. That means A, B, and C form a right triangle.
By the Pythagorean theorem, maximizing the length of BC across all projected points is equivalent to maximizing the variance preserved in the 1D projection. This is why the principal component is the direction of greatest variance.
There is also an unavoidable consequence: every time dimension is reduced, distances between points shrink relative to the higher-dimensional space. So if projected points crowd together in a low-dimensional plot, that does not automatically mean they are truly very similar. But if the leading principal components explain a large share of the total variance, then such closeness is much more likely to reflect genuine similarity.
PCA and classical multidimensional scaling
Once distance enters the discussion, PCA starts to look very close to multidimensional scaling.
In classical multidimensional scaling, one may not have direct coordinates for the points, but one can measure Euclidean distances between them and build a distance matrix. Under the same variance-maximizing geometry, it can be shown that preserving the configuration in a subspace also preserves the interpoint distances as much as possible. That is why PCA can be viewed as a form of classical multidimensional scaling.
This means that when Euclidean distances are observable but coordinates are not, it may still be possible to reconstruct the original geometric scale. That idea has applications in areas such as structural biology, where relationships are often measured as pairwise distances rather than direct coordinates.
Probabilistic PCA: adding uncertainty back into the picture
Standard PCA relies heavily on matrix algebra, and the different algebraic routes lead to equivalent results. But this does not make it a full probabilistic model. The white-noise uncertainty that might generate variation in the data never explicitly appears in the usual derivation.
So it is natural to ask whether the PCA subspace can be interpreted as the solution to a statistical model. In other words, if ordinary PCA gives something like an estimate of the mean structure without explicitly modeling uncertainty, can we rewrite it in a probabilistic framework and perhaps use that framework later for hypothesis testing?
Assume the observed data points come from a Gaussian distribution. Then the PCA problem becomes one of finding a subspace that minimizes projected reconstruction error under that model. One way to express the relationship is:
[ t = Wx + \mu + \epsilon ]
Here, (t) is the observed data point, (W) describes the mapping, and if the dimensionality is unchanged you can think of it as a rotation of coordinate axes. The variable (x) is the point in the mapped space, (\mu) is the mean of (x), and (\epsilon) is a Gaussian random variable.
Under this setup, the observed points follow a normal distribution with mean (\mu) and covariance (WW^t + \psi), where (\psi) represents random error. If that error term is ignored, then PCA can again be solved directly by eigenvalues and eigenvectors—in effect assuming zero variance in the noise term.
But in real data, every high-dimensional sample point typically contains at least some measurement error, and that error variance is not zero. So the model should include an error term. The problem is that the error structure is usually unknown. A common simplification is to assume that all points share Gaussian noise with the same variance. Once that restriction is imposed, the model becomes identifiable and solvable.
With that addition, PCA can be placed within a statistical framework where hypothesis testing becomes possible—for example, assessing whether a point behaves like an outlier.
Why the EM algorithm keeps appearing
Once probabilistic PCA is introduced, the EM algorithm is hard to avoid. EM is a broadly useful method for models with hidden variables, and PCA can be cast in exactly that form.
You may think of the low-dimensional representation as hidden behind the observed high-dimensional points, or conversely the observed points as noisy manifestations of lower-dimensional latent variables.
The EM idea is straightforward. Start by proposing some low-dimensional space. In the E-step, project the observed points into that space and compute the expected latent structure or corresponding distances. In the M-step, update the model to minimize those distances or maximize the likelihood. Then repeat: E-step, M-step, over and over until improvement stops.
In plain terms, if the model structure is not directly visible, you create an initial guess and iteratively force it to fit the data better according to your objective.
One of the strengths of EM is that the variance term introduced in probabilistic PCA can be optimized at the same time. That gives probabilistic PCA a workable solution. The implementation details can be delicate, but the main point is that EM is a very general strategy. The same idea shows up across many latent-variable models, including Markov processes, Bayesian networks, and conditional random fields.
Factor analysis: close to PCA, but not the same
In many introductions, probabilistic PCA mainly serves as a bridge to factor analysis.
The key difference is that factor analysis does not require the noise term to come from a Gaussian distribution with equal variance in every direction. That makes the computation harder, but it also makes factor analysis more credible for interpreting hidden structure.
At the same time, factor analysis is less transparent than PCA in practice. Deciding how many factors to keep often requires separate criteria or subjective judgment. Factor analysis can also be used for prediction, with the latent factors themselves as the target. From a probabilistic viewpoint, PCA can also support prediction, but only when the application is carefully thought through.
Like PCA, factor analysis produces orthogonal components in the standard setup. Orthogonality is useful because it removes correlation among the mapped dimensions. But orthogonality is not the same as independence. If the data-generating process calls for truly independent latent sources, then independent component analysis becomes more appropriate.
Independent component analysis and the difference between uncorrelated and independent
Independent component analysis, or ICA, collapses to something very close to PCA when the independent components are Gaussian. But when the latent sources are not Gaussian, ICA has a stronger ability to recover them.
That is because independence is a stricter requirement than zero correlation. Variables can be uncorrelated without being independent, but if they are independent, then they are necessarily uncorrelated. In information-theoretic terms, independent variables have zero mutual information, and in higher-order statistics they show no dependence structure.
The classic example is the cocktail party problem. Imagine a noisy room where many people are speaking at once. Several microphones are placed at fixed positions, and each microphone records a mixture of voices. The task is to separate the mixed signals back into individual speakers.
If PCA is applied, what you tend to recover are dominant common patterns in the audio mixtures. ICA, however, is much better suited to teasing apart each individual source—that is, the latent variables behind the mixed observations. EM-style thinking can also be used to solve ICA-type models.
One practical difference is that ICA is not primarily a dimensionality-reduction method. Usually, if you ask for a certain number of components, that is how many you get. Still, PCA, factor analysis, and ICA all share a basic property: they are linear models. Their principal components, factors, or independent components are all linear combinations of the original variables.
In some areas such as omics data, ICA may be especially attractive because it can extract independent modules directly from the data for annotation and biological interpretation. PCA or factor analysis can also be used, of course, but then the assumptions about orthogonality and underlying distributions need to be examined much more carefully.
PCA is not just for pretty plots
A common use of PCA in applied papers is to draw clusters and circles on a score plot, then claim that samples share some internal commonality. This is especially common in environmental analysis, where hundreds of compounds may be measured simultaneously.
Since PCA tries to preserve differences among samples as much as possible in a low-dimensional space, samples that map close together may indeed share similar contamination profiles, sources, or environmental processes. But that does not mean PCA is the only choice. In many cases, clustering or other statistical models may be more direct.
Part of the confusion comes from the fact that many users do not have a clear grasp of eigenvalues, eigenvectors, and loadings, so PCA gets reduced to a dimensionality-reduction plotting tool. But the exploratory value of PCA lies in the latent common structure it reveals.
Take a concrete example. Suppose there are 100 samples, and each sample has 1000 measured variables. That gives a (1001000) matrix. After PCA, you may reduce it to a (100250) matrix that still preserves 95% of the original variance.
So what are these 250 new variables? They are the eigenvectors, the new projection directions. Those directions can be interpreted as hidden common patterns. And the eigenvalues? They are the weights of those common patterns: the larger the eigenvalue, the more variance that component explains, and therefore the more important it is.
What about loadings? Roughly speaking, they indicate how much each of the original 1000 variables contributes to each of the 250 new components.
This inserts an additional layer between samples and measured variables: a layer of common structure. On one side, the dimensionality is reduced. On the other, the method extracts common patterns that are uncorrelated with one another, though not necessarily independent.
That extra layer is often where interpretation begins. We know how samples distribute across the common patterns, and we know how variables contribute to them. A biplot is useful precisely because it can display both sample points and variable directions together on the two most important components.
At that stage, domain knowledge becomes essential. Interactions among variables may suggest the physical meaning of a hidden factor. A component that cleanly separates certain samples may point to a latent source or mechanism. This is why PCA can be useful for tasks as different as identifying text topics in natural language processing or discovering gene modules in genomics.
Still, it is worth remembering that these so-called common patterns are not objective laws of nature. They are the result of a linear transformation. If they do not align with the substantive factors you actually care about, then a direct regression model may be more appropriate.
Compression and denoising
Another major use of PCA—or equivalently the singular value decomposition that implements it—is data compression.
In the example above, a (100*1000) data space may be wasteful to store, especially if the matrix is sparse. SVD can compress that representation into a lower-dimensional form while preserving most of the important structure.
From a signal-processing perspective, PCA and transformations such as the Fourier transform share a common spirit: replace the original signal with a new set of basis signals. If meaningful signal is assumed to have larger variance than noise, then keeping the principal components—or retaining selected frequency bands in other transforms—can also reduce noise.
The same idea extends naturally to image processing. And since all data can in some sense be rendered as images or geometric objects, the connection between image denoising and data denoising is not hard to make. In fields such as trace environmental analysis, this denoising role can be especially valuable.
Once you put together structure reconstruction, variance preservation, compression, and denoising, PCA starts to look far less narrow than it first appears. Even if this were the only method you knew well, there would still be a surprising amount you could do with it.
You just need a little imagination.