SoyNcRNAExp collected 4,179 Long non-coding RNAs (lncRNAs), 1,245 microRNAs (miRNAs), 12,970 phased siRNAs (phasiRNAs) and 451 PHAS/TAS genes. In addition, 3,747 transcription factors (TFs), 135 RNA-binding proteins (RBPs) and other 52,297 protein-coding genes were also collected for constructing ncRNA co-expression networks to expand ncRNA-centerred networks comprising more gene expression regulators.
SoyNcRNAExp contains four major functionalities: ncRNA expression, ncRNA differential expression, ncRNA mining, and ncRNA co-expression. SoyNcRNAExp is the first comprehensive resource for soybean ncRNA expression and co-expression analysis, which may greatly promote soybean ncRNA studies for soybean research community.
We collected 1,263 soybean RNA-seq datasets (including 518 paired-end and 765 single-end) and 333 soybean sRNA-seq datasets (all single-end) from the GEO database on NCBI, of which 141 datasets included both RNA-seq datasets and sRNA-datasets.The basic information of these datasets (data type, tissue, processing conditions, sequencing instrument, etc.) was summarized. For the RNA-seq datasets, 53 tissues and 49 stress conditions were involved. For the sRNA-seq data, 49 tissues and 8 stress conditions were involved. The information of datasets is available for download on the download page.
Downloaded soybean genome file and genome annotation file from NCBI and phytozome databases, the expression of lncRNA, PHAS gene, RBP were calculated based on NCBI annotation file, TF expression was calculated based on the annotation file of phytozome.
Information sources for each regulator are shown in the following table:
For RNA-seq libraries, first used trimmomatic (version 0.39) to remove sequencing adapters and low-quality sequences in batches, then adopted FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc) to perform quality testing on clean data, and clean reads were aligned with soybean reference genomic files from Phytozome (Gmax_275_Wm82.a2.v1.gene.gff3) and NCBI Reference Sequence Database (ref_Glycine_max_v2.1_top_level.gff3) respectively by Hisat2 (version 2.1.0), and then used samtools (version 1.9) to convert the sam file generated by the alignment into a bam file, and sorted it. Based on phytozome and NCBI gene annotation files, used featureCounts (version 1.6.4) and stringtie (version 1.3.6) to calculate count value and TPM apart. The expression of TFs and protein-coding genes were extracted from the expression calculated based on phytozome annotation according to the annotation information file of TFs, and the expression of lncRNAs, PHAS genes, RBPs were extracted from the expression calculated based on NCBI annotation according to the respective annotation information file. Finally, filtered the genes whose TPM was 0 in all samples. Afterwards, we employed Combat-seq of the Sva package(version 3.40.0) for batch effect processing.
For sRNA-seq libraries, firstly used cutadapt(http://pypi.python.org/pypi/cutadapt) to remove the sequencing adapters, and then calculated the count value of reads and CPM (counts per million) through a python script. For the calculation of miRNAs expression, our laboratory has a complete set of sRNA analysis process. First, aligned the reads to the soybean genome file of NCBI for preliminary filtering, and then aligned the reads on the alignment to the miRNAs precursor sequence again, and the mature miRNAs sequences were aligned to the precursor sequence. The miRNAs precursor sequences and mature miRNAs sequences were merged from miRbase and PmiREN databases. Finally, the two alignment results were combined that the reads aligned to the same position of the precursor with the known mature miRNAs can represent the mature miRNAs, thus, the miRNAs expression would be the sum of the expression of all reads aligned to the position, then filtered the miRNAs that were expressed as 0 in all samples. For the calculation of phasiRNAs expression, kept reads with a length of 16-26 nt and CPM greater than or equal to 1, and then aligned them to soybean cDNA sequences. According to the principle of its production, used the phasiRNAs analysis tools, PhasiRNAnalyzer, set the shift offset to 0, and required the PHAS sites of 4 or more to be called a PHAS cluster, to predict and calculate the expression of phasiRNAs. Similarly, filtered the phasiRNAs whose expression level is 0 in all samples. Afterwards, we employed Combat-seq of the Sva package for batch effect processing.
We calculated the average expression of ncRNAs in each soybean tissue, if the ncRNAs with low average expression values (averaged expression values less than or equal to 1) are considered as not expressed. All expressed ncRNAs (averaged expression values more than 1) were further ranked under a specific tissue. NcRNAs with expression values greater than the upper quantile are considered as 'HE' (high expression level), while those less than the lower quantile as 'LE' (low expression level) and the remaining as ‘ME’ (medium expression level). The following principles are followed when defining the expression ability of a certain ncRNA in all tissues: high expression capacity ncRNAs (HC-ncRNAs) are defined as ncRNAs with high expression potential, so there is at least one ‘HE’ among all tissues; low expression capacity ncRNAs (LC-ncRNAs) have ‘LC’ in all tissues; the remaining are medium expression capacity ncRNAs (MC-ncRNAs).
The τ value was calculated to measure whether the ncRNAs are specifically expressed in specific tissues. For tissues with less than 5 samples, the analysis of tissue-specific expression is not involved. τ varies between 0 and 1, with 0 indicating extensive expression and 1 indicating specific expression. τ is defined as follows:
The following criteria were used to identify tissue-specifically expressed ncRNAs:
1.The expression level in all samples must all be greater than 0.
2.Maximum expression in all tissues greater than or equal to 5.
3.τ greater than 0.8.
Based on DESeq2 in R, used raw read counts instead of normalized read counts for differential expression analysis of ncRNAs and genes, to generate log2 (foldChange) and p-value, FoldChange is the ratio of the treatment expression to control expression. Each treatment and control must contain two or more biological replicates. Log2 (foldChange) greater than or equal to 1 or less than or equal to -1 and FDR(adjusted p-value) less than or equal to 0.05 were defined as differential expressed genes or sRNAs.
The libraries used for co-expression analysis were divided into two categories: one is a total of 141 RNA-seq and sRNA-seq libraries from the same experiment; the other is 21 representative samples selected by removing the repetitive samples. Selection is based on the following criteria:
1.ensuring one SRR in each GSE.
2.different tissue samples were selected from the same GSE, and repeated tissue samples were removed.
3.selected different treatments from the same GSE, and removed the samples that were repeatedly treated.
Filtered the ncRNAs whose expression level is 0 in the 141 or 21 co-expression samples and then calculated the Pearson correlation coefficient (PCC) among the ncRNAs and between ncRNAs and target protein-coding genes. It was required that the log2 value of target expression in at least one sample was required to be greater than 3, otherwise, it would be filtered, and finally generated the co-expression line chart based on PCC, and a co-expression network was generated using the open-source graphics library cytoscape written in Javascript. PCC is defined as follows:
Using the R package WGCNA (version 1.70), the co-expression network module analysis of all ncRNAs and protein-coding genes was performed, we calculated the soft threshold value as 10. Using the default Pearson correlation coefficient, the blockwiseModules function was used to obtain the Adjacency matrix and Topological Overlap matrix (TOM). TOM matrix was derived and cytoscape was used for network visualization.
Browse page presents the basic information of all ncRNAs in the database, including long non-coding RNAs, microRNAs, phased siRNAs and PHAS genes.
ID is the name of the database, symbol is the name in corresponding NCBI.
Click ID to jump to the detailed information page of the ncRNA (if the ncRNA not expressed in all samples, it cannot be clicked.)
The detailed information page shows the transcripts information, the tissue expression and differential expression of the selected ncRNA.
Blast page can provide the known sequence, align with the ncRNAs in the database, and retrieve its ID.
Take the results of lncRNA as an example
The first column is the input sequence ID.
The second column is the transcript ID.
The third column is the gene ID of the transcript.
The fourth column is the name of the gene in the database, click it to jump to the detailed information page.
The fifth column is the alignment rate.
The sixth column is the alignment length.
The seventh column is the number of mismatches.
The eighth column is gap number.
The ninth column is the alignment start position of the input sequence.
The tenth column is the alignment end position of the input sequence.
The eleventh column is the start position of the aligned sequence.
The twelfth column is the end position of the aligned sequence.
The thirteenth column is the E-value. The smaller the E-value, the better the match.
The last column is the score of the alignment.
The analysis page is divided into four parts.
The first part is NcRNA Expression, including Tissue/Organ-specific expression and Anatomy sub-functions.
Tissue/Organ-specific expression is that query the expression of a certain ncRNA in all available soybean tissues.
For lncRNAs and PHAS Genes, we have 25 different tissues of soybean.
For miRNAs and phasiRNAs, we have 22 different tissues of soybean.
All the tissue information can be downloaded from the download page.
Parameter:line: For miRNAs and phasiRNAs, we use CPM to represent their expression; For lncRNAs and PHAS genes, we use TPM to represent their expression.
log2: Using log2 value for expression. (log2(expression+1)).
Sample type:Control: Untreated samples
All:Including all samples of control and treatment
After submission, the box plot of the ncRNA expression in all tissues will obtain.
Anatomy is query the expression of ncRNAs in a specific soybean tissue.
After submission, you get a histogram of the expression of all ncRNAs in the selected tissue.
The second part is Differential Expression, including Condition-specific expression and Perturbations sub-functions.
Condition-specific expression is that query the expression of a specific ncRNA under various conditions.
For lncRNAs and PHAS Genes, 23 different stress conditions are involved from 96 GSE datasets.
For miRNAs and phasiRNAs, 4 different stress conditions are involved from 9 GSE datasets.
After submission, the bar chart of the differential expression of the ncRNA under all stress conditions.
Perturbations is query ncRNAs that are differentially expressed under a certain condition.
All stress information can be downloaded from the download page.
Result is a heatmap. The x-axis of the result represents the name of ncRNA that meets the parameter(status, fold change cutoff, FDR cutoff and the condition) which you choosed. The first line of the heat map is the condition which you choosed. The following is the differential expression of these ncRNAs in other conditions.
The third part is NcRNA mining, is Tissue-specific ncRNA mining, is query tissue-specifically expressed ncRNAs by using tau method.
Tau formula is in help 2.6 Tissue-Specific Expression.
After submission, you will get an information form, then you can click on the ID to view its detailed expression information
The fourth part is Co-Expression, including PCC Curve, PCC Network and WGCNA Network sub-functions.
The PCC curve function can give the expression curves of a pair of source(ncRNAs) and target((ncRNAs, TFs, RBPs, protein-coding genes)).
Parameter:Source Data Type: Four target ncRNAs can be selected, it is single option, including long non-coding RNA, microRNA, phased siRNA and PHAS gene.
Target Data Type: Seven target types can be selected, it is multiple option, including long non-coding RNA, PHAS gene, RNA-binding protein, transcript factor , microRNA, phased siRNA and coding gene(phytozome).
Library: "All" indicates a total of 141 samples used for co-expression analysis. "Portion" means 21 samples for the analysis. For detailed classification criteria, see help 2.8 Co-expression.
Correlation: Positive: Positive correlation; Negative: Negative correlation
PCC cutoff: Threshold of Pearson correlation coefficient(PCC), formula is in help 2.8 Co-expression.
Result:Firstly, the expression of the selected source in selected library sample is shown as a line chart, and then the correlation coefficient table of the source and target is shown. Click the last column graph to display the expression relationship between the source and target.
The PCC network function can generate a network online based on the co-expression relationship between source (ncRNAs) and target (ncRNAs, TFs, RBPs, protein-coding genes).
Parameter:Start Point Type: The type of ncRNA at the start of the network.
Library: "All" indicates a total of 141 samples used for co-expression analysis. "Portion" means 21 samples for the analysis. For detailed classification criteria, see help 2.8 Co-expression.
Layer: The number of layers of the network, you can set 1 to 3 layers.
Target Point Type: The type of target per layer.
PCC cutoff: Threshold of Pearson correlation coefficient(PCC), formula is in help 2.8 Co-expression.
Correlation: Positive: Positive correlation; Negative: Negative correlation
Maximum of Nodes: Maximum number of nodes per layer.
Result:There are detailed network information in the parameter box on the right.
Different layers are represented by different colors, and different ncRNAs and genes are represented by different shapes. The red line represents the positive co-expression relationship, and the blue line represents the negative co-expression relationship.
The WGCNA network function can generate networks of different modules based on weighted values between source (ncRNAs) and target (ncRNAs, TFs, RBPs, protein-coding genes).
First, enter an ncRNA to obtain the modules identified for the ncRNA by WGCNA analysis. The first table shows the number of target genes corresponding to each module, and the second table lists the weight values and the target module. Click network to display the module network figure of the ncRNA and the target genes.