Additional file 1: Figure S1. of Shedding light on the expansion and diversification of the Cdc48 protein family during the rise of the eukaryotic cell

Statistical validation of the AAA domain classification. (A) We used a resampling approach to evaluate the quality of our Hidden Markov Models (HMMs). New models were trained with a random subset of 90 % of the original sequences used to generate each model. We used the other 10 % as the search database with a fixed size of 100,000 sequences. This process was repeated 1000 times and we considered the profile with the best expectation value to be correct. The positive predictive rate (PPR, black, left) and the sensitivity (white, right) are displayed. All models achieved at least 97 % PPR and sensitivity. (B-D) The Cdc48 family is part of a superfamily of classical AAA proteins that also includes proteasome subunits, metalloproteases, meiotic ATPases, and BCS1 [14]. As all our models were trained using Cdc48 AAA domain sequences, non-Cdc48 AAA domain sequences should be a much weaker fit to these models. To evaluate the specificity of our HMMs, we tested the extent to which our models also recognized non-Cdc48 AAA domains. For this, we selected approximately 1800 sequences from the larger family of classical AAA proteins and scanned these sequences with our models. The results are shown as box plots, including the 5 % and 95 % percentiles as whiskers. The plots show the scores (negative logarithm of the expectation value) of our models for the predictions of (B) Cdc48 sequences and (D) non-Cdc48 sequences. We used the different E-value distributions to define the cut-offs for the confidence of our Cdc48 AAA domain predictions. The 5 % percentile of the expectation value distribution in (B) was used as a ‘strict’ cut-off, whereas the 95 % percentile of the expectation value distribution from (D) served as a ‘soft’ cut-off. An overview of the ‘strict’ versus ’soft’ cut-offs for all Cdc48 domain models are displayed in (C). SPAF.d1 is the only model that reveals a lower ‘strict’ than ‘soft’ cut-off. The ‘strict’ cut-off for this model seems especially low, whereas the ‘soft’ cut-off is in a similar range to most other models. Plotting the scores of predictions on Cdc48 sequences for each AAA domain model reveals that most graphs have a logarithmic characteristic, whereas SPAF.d1 follows a linear trend (data not shown). This indicates a higher degree of diversity within this domain. To uphold the quality of our predictions we decided to use the higher ‘soft’ cut-off as the ‘strict’ cut-off for SPAF.d1 as well. (AI 4 mb)