Additional file 15: of Recombination-driven generation of the largest pathogen repository of antigen variants in the protozoan Trypanosoma cruzi
datasetposted on 13.09.2016 by D. Weatherly, Duo Peng, Rick Tarleton
Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.
Generation of control recombination positive dataset. Two types of recombination events were introduced to simulated sequences: gene conversion and homologous recombination (reciprocal exchange). For gene conversion recombination, a subsequence from a randomly chosen donor gene is extracted, and used to replace the homologous subsequence in a randomly chosen recipient gene. For homologous recombination (reciprocal exchange), homologous subsequences from two randomly chosen genes are extracted and exchanged. Normal distribution is used to model the length of the recombination region. Exponential distribution is used to model the number of recombination events per gene. Exponential distribution with probability density function y = 0.52622e − 0.52622x is a good approximation of the distribution of frequency of number of recombination events per gene observed in our preliminary analysis. A custom perl program script was used to introduce multiple recombination events in a single dataset. The perl script takes 4 parameters, percentage of Gene conversion recombination“g%”, total recombination gene number “n”, recombination region mean length, “l” and standard deviation of recombination region length, “std”. The program randomly chose g% of the genes to introduce recombination into. For each recombinant gene, the program sampled the exponential distribution described above to determine the number of recombination events to introduce into the gene. For each recombination event, g% probability to introduce gene conversion event, and 1-g% probability to introduce homologous recombination events (reciprocal exchange), the length of recombination event was sampled from the normal distribution (l, std2). (ZIP 23 kb)