Current location - Music Encyclopedia - Dating - Transcriptome Primer (7): Differential Expression Analysis
Transcriptome Primer (7): Differential Expression Analysis

The original HTSeq-count data of the three samples can be found in my GitHub, but as mentioned before, Jimmy’s mistake allowed us to analyze only 3 samples of humans, and the other sample needs to be obtained from another sample. A batch of data is acquired (please pay attention to the batch effect), so there is no guarantee that each group has two duplicates.

I have always believed in the words "You are not alone." I am definitely not the only one who encounters this situation, so I found several solutions

The above methods They will be introduced later, but our DESeq2 must have duplicate problems that need to be solved urgently. I have no choice but to make them up myself. Even though we are editing, we still need to have a certain pattern and cannot directly copy it. We must take into account that the reads of high-throughput sequencing conform to the Poisson distribution by default. I made it up like this.

This is just a way to fill in the gaps. For better methods of simulating data, you need to refer to more professional literature. I hope to make up for this part in my lifetime.

This part of the content was first seen in Section 8.5.3 of RNA-Seq Data Analysis. I didn’t understand it at all at first, but after studying biostatistics, I think it is necessary to understand all differential gene expression analysis. The key to R packages.

Basically, statistics classes will introduce how to use the t test to compare the difference between two samples, and then use analysis of variance to determine whether there is a difference between the samples when there are many samples. Of course, the sample comes from a normally distributed population, or is randomly and independently sampled in large numbers.

For differential expression analysis of gene chips, since it is generally believed that the data obeys a normal distribution, differential expression analysis is nothing more than applying t test sum or variance analysis to each gene. High-throughput searches for many genes at once, so multiple tests need to be corrected to control false positives. Currently, limma is most commonly used in gene chip analysis.

However, the read count of high-throughput sequencing (HTS) is generally believed to obey the Poisson distribution (of course there are other different opinions), and it is impossible to directly use the t test and variance analysis of normal distribution. Of course, we can use the non-parametric test method simply and crudely, but the statistical power is not enough, and the corrected p-value estimate of the result cannot find a single differential gene. The boss spent a lot of money, but it turned out that there was no differential gene and it was a negative result. So thousands of dollars were wasted. He must not be happy. Therefore, we still have to use the parametric test method, so we have to talk about the relationship between variance analysis and linear models.

Linear regression and analysis of variance are two sets of methods developed during the same period. In my undergraduate field statistics course, I introduced the use of analysis of variance (ANOVA) to analyze the yield differences after different fertilizer treatments. The experimental design is as follows

This is the simplest single-factor analysis of variance. Each result It can be regarded as yij = ai u eij, where u is the overall mean, ai is the difference of each treatment, and eij is the random error.

Note: The name of Analysis of Variance (ANAOVA) sounds like it is testing variance, but it is actually to determine whether the differences between samples really exist. To do this, it is necessary to prove that the variances within different treatments are significant. is greater than the variance between different treatments.

Linear regression is generally used to predict quantitative response variables using quantitative predictor variables.

For example, modeling the relationship between weight and height:

Of course, linear regression can also be used to process nominal or ordinal factors (that is, discrete variables) as predictor variables. If you want to draw a graph, it is the following situation.

If we need to find the gene changes between the control group and the control group after different treatments through an experiment, then gene expression can be simply written as, y = a b · treament e. Compared with the previous yij = ai u eij, you will find that the formula is so consistent. This is because both linear models and ANOVA are special forms of generalizing linear models (GLM) with normally distributed predictors. GLM itself can handle modeling of any type of variables as long as it uses appropriate connection functions.

It is currently believed that the difference between read counts conforms to the negative binomial distribution, also called gamma-Possion distribution. So the question is, how to use GLM or LM to analyze the difference between two processed parts? In fact, it can be simply explained by the slope of the fitting line in the figure above. If there are differences between different treatments, then the slope of the fitting line must not be zero, that is, parallel to the X-axis. But this is an easy-to-understand way (although you may not understand it), but it is actually more complicated and has more factors to consider.

Note 1 The negative two-way distribution has two parameters, the mean and the dispersion. The dispersion describes the degree to which the variance deviates from the mean. The Poisson distribution can be thought of as a negative two-way distribution with a discrete value of 1, that is, a situation where the mean is equal to the variance (mean=variance).

Note 2 This part involves a lot of statistical knowledge. If you don’t understand it, just use Wikipedia to check it out one by one.

After talking about linear models and variance analysis, the design matrix below is easy to understand. It is actually used to tell different difference analysis functions how to treat variables. For example, if we want to study the change between KD and control, the design matrix is ??

Then the contrast matrix tells the difference analysis function how to compare which factor. Here is the comparison of expressions under different treatments. Quantity changes.

In fact, there are many ways to standardize read count. The most commonly used are FPKM and RPKM, although they are actually wrong - FPKM/RPKM are wrong.

I recommend reading Comparing the normalization methods for the differential analysis of Illumina high-throughput RNA-Seq data to understand the differences between different normalization methods.

Some methods require raw data, while others require certain types of standardized data. Remember to distinguish.

Regarding DESeq2 analysis of differentially expressed genes, it is actually standardized by position bias.

edgeR's calcNormFactors function uses the TMM algorithm to normalize DGEList

Note that most mRNA-Seq data analysis can be normalized using TMM, but there are exceptions, such as single-cell RNA-Seq (Lun, Bach, and Marioni 2016), and there is global differential expression. More than half of the genes in the genome are differentially expressed, so please try to avoid it (D. Wu et al. 2013), otherwise you need to use internal reference for standardization (Risso et al. 2014).

Step 4: Experimental design matrix (Design matrix), similar to the design parameters in DESeq2. edgeR's linear model and differential expression analysis require the definition of an experimental design matrix. It is very straightforward to find that it is 1vs0

Step 5: Estimate the discrete value (Dispersion). As mentioned earlier, the negative binomial distribution (negative binomial, NB) requires two parameters: mean and discrete value. edgeR estimates an empirical Bayes moderated dispersion for each gene, a common dispersion (the mean of the empirical Bayes moderated dispersion of all genes), and A trend discrete value

can further fit the NB model through quasi-likelihood (QL) to explain gene-specific variations caused by biology and technology (Lund et al. 2012; Lun, Chen, and Smyth 2016).

Note 1 The step of estimating discrete values ??actually has many estimate*Disp functions. When there is no experimental design matrix (design matrix), estimateDisp is equivalent to estimateCommonDisp and estimateTagwiseDisp. When an experimental design matrix is ??given, estimateDisp is equivalent to estimateGLMCommonDisp, estimateGLMTrendedDisp and estimateGLMTagwiseDisp. Among them, tag is synonymous with gene.

Note 2 In fact, the third, fourth, and fifth steps here correspond to the two steps included in DESeq of DESeq2, standardization and discrete value estimation.

Step 6: Differential expression test (1). This step mainly constructs a comparison matrix, similar to the contrast parameter of the results function in DESeq2.

The reason glmQLFTest is used here instead of glmLRT is because glmQLTFit was used for fitting earlier, so QL F-test needs to be used for testing. If glmFit was used previously, the corresponding one is glmLRT. The author claims that QL F-test is more stringent. Multiple test correction also uses the BH method.

The next step is to extract significantly different genes for downstream analysis and make some pictures to see.

Step 6: Differential expression test (2). The genes with significant differences found above do not take into account the effect value, that is, the specific number of times the change has occurred.

We can also find genes with relatively large expression changes, and the corresponding function is glmTreat.

After being baptized by the above two methods, you will basically know the routine. I will briefly summarize it first, and then continue to introduce the voom of the limma package.

Limma was originally used to process gene expression chip data, but it is said to be the leader in this field: sunglasses:. If you look carefully at the edgeR import interface, you will find that some functions of edgeR depend on the limma package. Limma uses the Empirical Bayesian model to make the results more robust.

When processing RNA-Seq data, raw read count is first converted into log2-counts-per-million (logCPM), and then the mean-variance relationship is modeled. There are two methods of modeling:

Data preprocessing: Limma uses edgeR’s DGEList object, and the filtering methods are consistent, corresponding to the first, second, and third steps of edgeR

Differential expression analysis: use "limma-trend"

Differential expression analysis: use "limma-voom"

If you analyze gene chip data, you must understand the LIMMA package well .

Basically for every package, I extracted various significant genes. For comparison, I need to use Venn diagram, but I don’t: stick_out_tongue: I want to use UpSetR.

I feel that the result of limma is a bit strange, I will have to worry about it for the rest of my life.

Okay, I missed this part

[1] Comparing the normalization methods for the differential analysis of Illumina high-throughput RNA-Seq data

[ 2] https://www.bioconductor.org/help/workflows/rnaseqGene/

[3] https://www.bioconductor.org/help/workflows/RnaSeqGeneEdgeRQL/

[4] https://www.bioconductor.org/help/workflows/RNAseq123/