MOESM1 of The chemfp project

Dalke, Andrew

doi:10.6084/m9.figshare.11329592.v1

13321_2019_398_MOESM1_ESM.pdf (92.1 kB)

MOESM1 of The chemfp project

journal contribution

posted on 2019-12-06, 04:45 authored by Andrew Dalke

Additional file 1: Table S1. Performance of different popcount implementations, in milliseconds and relative to the 8-bit lookup table time, measured using the threshold searches from the chemfp benchmark suite (T = 0.4 for 2048 bit searches, otherwise T = 0.7). In most cases the search algorithm uses a function pointer to dispatch to the appropriate popcount function, without memory prefetching. The POPCNT and AVX2 versions show times using loops of different sizes and “fully unrolled” versions which implement the fingerprint popcount without a loop. The ‘inline’ and ‘prefetch’ variants inline the calculation and use memory prefetching, respectively. Timings were made with chemfp 3.3. Figure S1. Scaling of k = 1 nearest neighbor searches as a function of the number of targets, for different fingerprint types. MACCS and FP2 fingerprints scales as O(n~0.65) and the PubChem/CACTVS and Morgan searches scale as O(n~0.8) in the number of fingerprints in the dataset, which is the sublinear scaling expected from using BitBound. Timings made with chemfp 1.5. Table S2. Chemfp file scan search performance for 100 queries from each of the data sets in the chemfp benchmark. The search time shows chemfp processes 500–600 MiB/s. The GNU program “wc” version 8.25 can count the number of lines in about 1/10th the time indicating that chemfp is not disk I/O bound. Table S3. Number of Tanimotos evaluated for an in-memory search of each of the test cases in the chemfp benchmark suite. The number of Tanimotos is much less than the expected 1 billion (1000 queries * 1 million targets) because of the BitBound limits. The number of divisions is the number of tests which passed the fast rational rejection test so require a 64-bit division. It shows the effectiveness of the rational rejection test. Table S4. Performance comparison as a function of the number of fingerprints between the fastest implementation from Kristensen et al. [28] and chemfp 3.3, using the Kristensen benchmark data set. The benchmark does a threshold = 0.9 search using the first 100 fingerprints in the data set. Table S5. Performance comparison as a function of minimum Tanimoto threshold between the fastest implementation from Kristensen et al. and chemfp 3.3, using the Kristensen benchmark data set. The benchmark uses the first 100 fingerprints in the data set to search the first 1,999,998 fingerprints. LinearSearcher is the fastest Kristensen method for all Tanimoto thresholds at or below 0.76. Some thresholds timings are omitted here as they add little useful information. The full table for each threshold step of 0.01 is available from this paper’s BitBucket repository.