PCA (Principal Component Analysis)

PCA stands for “Principal Component Analysis.” It’s a way that scientists can use computers to analyze patterns in large amounts of data. When it’s used for ancestry analysis, it helps researchers to understand where a person’s ancestors might have come from and how they are related to other people.

Here’s an example of how it might work: Imagine that scientists have collected a lot of information about the genes of many different people from all around the world. They can use PCA to look for patterns in this data that might show how people are related to each other. For example, they might find that people who live in the same region of the world tend to have similar patterns of genes. This could help researchers understand where a person’s ancestors might have come from.

PCA is just one tool that scientists use to learn about ancestry. There are other ways to study ancestry too, like looking at the history of a place or the language that people speak. But PCA is a very powerful way to help us understand the connections between different people and their ancestors.

How it works?

In principal component analysis (PCA), we aim to find the directions in which the data varies the most, and these directions are called principal components. The first principal component is the direction in which the data varies the most, the second principal component is the direction in which the data varies the second most, and so on.

Mathematically, PCA is defined as follows:

  • We start with a dataset of N observations (samples) and p variables (features). We want to find the directions in which the data varies the most.
  • We standardize the data by subtracting the mean and dividing by the standard deviation for each variable. This step is optional, but it helps to give all the variables the same scale.
  • We calculate the covariance matrix of the standardized data. The covariance matrix is a p x p matrix that contains the pairwise covariances between the variables.
  • We calculate the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors are the directions in which the data varies the most, and the eigenvalues are the amount of variation in each direction.
  • We sort the eigenvectors and eigenvalues by the magnitude of the eigenvalues. The eigenvectors with the largest eigenvalues are the principal components.
  • We can then project the data onto the principal components by taking the dot product of the data and the eigenvectors. This gives us a new dataset with fewer dimensions, where each dimension is a combination of the original variables.