Additional file 4: Figure S2. of A high-quality annotated transcriptome of swine peripheral blood

Flowchart for de novo blood transcriptome assembly, annotation and filtering. The diagram shows the steps involved in construction and filtering of the de novo assembly, and includes the number of PTs that resulted from each step, where appropriate. Refer to the Materials and Methods section for details. Quality of the raw RNA-seq reads for each library was first checked with FASTQC. Subsequently, sequencing adaptors and low quality bases were trimmed from the raw reads. These trimmed reads were then digitally normalized to reduce k-mer redundancy. Normalized reads were assembled into putative Trinity transcripts (PTs), which are collectively called “de novo transcriptome assembly”. This assembly was then analyzed in several ways. First, the coding potentials of the PTs were predicted by using PLEK, with PTs of coding potentials higher than zero considered as potentially protein-coding. Then all PTs were separately aligned to the two pig reference genomes, USMARCv1.0 and SSC10.2, by using GMAP. Finally, PTs with significant BLAST hits in the NCBI NT and NR databases were determined by using DC-megaBLAST and BLASTX, with E-value cutoffs of 10−20 and 10−6, respectively. Because the alignment frequency of the PTs to the USMARCv1.0 reference genomes was much higher than to the SSC10.2 assembly, the de novo transcriptome was filtered based on the USMARCv1.0 mapping results. PTs with top megaBLAST hits on sequences from non-vertebrates and without better alignments with the two reference genomes were considered as “contaminants” and were filtered out. The potential biotypes of the PTs were determined based on the biotypes of their top megaBLAST hits if available. Other removed PTs were (i) PTs with top megaBLAST hits on sequences of mitochondrial genomes; (ii) unspliced intronic PTs; (iii) unspliced nonintronic PTs mapped to genomic regions of maximal coverage per base (CPB) lower than 50× (low-CPB regions); and (iv) multiple mapping or nonmapping PTs on the USMARCv1.0 assembly without top megaBLAST hits on RNA sequences in the NT database. The final filtered de novo transcriptome assembly was composed of 126,225 PTs. (PDF 498 kb)