posted on 2017-12-29, 18:22authored byFlorian Wenzel, Théo Galy-Fajou, Matthäus Deutsch, Marius Kloft
This record contains seven real-world test datasets used in experiments with the Bayesian SVM algorithm in the ECML PKDD 2017 paper; Wenzel et al.: Bayesian Nonlinear Support Vector Machines for Big Data.
The datasets are used in the related experiments to compare the prediction performance, the quality of the uncertainty estimates and run time of the various methods. Collectively these contain containing millions of samples. The datasets are all from the Rätsch benchmark datasets commonly used to test the accuracy of binary nonlinear classifiers.
Data files are in .data format used by Analysis Studio, a statistical analysis and data mining program. It contains mined data in a plain text, tab-delimited format, including an Analysis Studio file header. The raw data is can be openly accessed via text edit software.
The data are from a range of disciplines that correspond to applications considered in the related publication:
Processed_BreastCancer.data
Processed_Diabetis.data
Processed_Flare.data
Processed_German.data
Processed_Heart.data
Processed_Splice.data
Processed_Waveform.data
Background
We propose a fast inference method for Bayesian nonlinear support vector machines that leverages stochastic variational inference and inducing points. Our experiments show that the proposed method is faster than competing Bayesian approaches and scales easily to millions of data points. It provides additional features over frequentist competitors such as accurate predictive uncertainty estimates and automatic hyperparameter search.