Co-Regularised Support Vector Regression Data and Python Implementation data and code

CoSVR python implementation: eclipse project containing the CoSVR implementations and a framework for setting up and running experiments.

Data and code are provided in .zip compressed format, accessible using openly-accessible zip utilities. Source code for the CoSVR experiments is provided in Python .py files that can be edited text edit software and run in an IDE or via command line on a system with Python. Raw data are supplied in openly accessible .csv and .txt files.

Ligand datasets:
24 datasets, each corresponding to one target protein, labelled with the affinity value of the contained ligands (molecules) to bin with the target protein.

Datasets are available in 3 different fingerprints / views:
ECFP4, MACCS, GpiDAPH3

Raw datasets have been preprocessed and stored as csv files in the csv folder.

Content

This dataset contains a python project (including eclipse project files) for running CoSVR experiments on ligand affinity prediction tasks.
  • data: the data package contains a data handler class.
  • experiments: the experiments folder contains the exp.py files which setup and run an experiment.
  • framework: the framework package contains classes for running an experiment, performing parameter tuning, and evaluating the results.
  • learner: the learner package contains all CoSVR variants as well as several baselines, including co-regularised least squares regression (CoRLSR), standard SVR, and standard RLSR.
  • test: the test package contains some unit tests for the project.

How do I get set up?

  • Download the project.
  • Make sure you have all necessary python packages installed (numpy, cvxopt, sklearn).
  • Run an experiment (python exp.py).
  • In the folder, where the exp.py file lies, a new folder will be created where all results will be stored.
  • Enjoy!

Background

In the related publication linked from this dataset we consider a semi-supervised learning scenario for regression, where only few labelled examples, many unlabelled instances and different data representations (multiple views) are available. For this setting, we extend support vector regression with a co-regularisation term and obtain co-regularised support vector regression (CoSVR). In addition to labelled data, co-regularisation includes information from unlabelled examples by ensuring that models trained on different views make similar predictions. Ligand affinity prediction is an important real-world problem that fits into this scenario. The characterisation of the strength of protein-ligand bonds is a crucial step in the process of drug discovery and design.We introduce variants of the base CoSVR algorithm and discuss their theoretical and computational properties. For the CoSVR function class we provide a theoretical bound on the Rademacher complexity. Finally, we demonstrate the usefulness of CoSVR for the affinity prediction task and evaluate its performance empirically on different protein-ligand datasets. We show that CoSVR outperforms co-regularised least squares regression as well as existing state-of-the-art approaches for affinity prediction.