SVM-FOLD detects subtle protein sequence similarities by learning from all available annotated proteins, as well as utilizing potential hits as identified by PSI-BLAST. Predictions of classes of proteins that do not have any known example with a significant pairwise PSI-BLAST E-value can still be found using SVMs.
We are currently in the process of upgrading the Web Server to SCOP v1.73 (02/20/09)
Users of the website can enter raw sequence data in FASTA format data directly into a form on the web-page, or select a local FASTA file to upload to the server. Alternatively, the user can supply a PSI-BLAST profile file (output of blastpgp -Q) instead. An example sequence can be loaded into the query text box by click the link marked 'load example'.
The server can perform both superfamily and fold detection but performs fold detection by default. One can move between the two modes using the link that says "Change to fold detection" or "Change to superfamily detection" on the front page (which link is visible depends on the current mode). The coverage for each mode, that is, the set of folds or superfamilies that a given query sequence is ranked against, is given via the link named 'coverage' on the front page.
The advanced options can be accessed by clicking on the spiral above the text entry box. In the advanced options one can select either zero-one or balanced loss optimized SVMs, or alternatively just use standard PSI-BLAST ranking.
The server will then add the query to a queue and compute the results. A query currently takes approximately 6 minutes from when it starts processing. Approximately 5 minutes of this computation is actually computing the profile using PSI-BLAST for use with the profile kernel, so if the user can supply your sequence in the form of a precomputed profile (the output of blastpgp -Q) then their query time will be dramatically reduced.
On completion, the user is presented with a table showing the resulting SCOP fold (or superfamily) ranking of their sequence along with an empirically estimated confidence value and SCOP derived comments relating to the fold or superfamily. The results table also contains links to pages detailing results for each target SCOP class. These pages link to the relevant pages on the SCOP website and show both PSI-BLAST E-values and profile kernel scores between the query protein and the set of proteins from the SCOP class in the training set. For each protein in these rankings, we can go to a full SCOP definition or to a molecule rendering of that protein. The molecule renderer uses OpenRasMol on the server-side to deliver small animated 3D renders. There are controls on this page to rotate the molecule and to alter the render style.
The scores reported by this algorithm are roughly in the range (1,..,-1), the larger the score the more significant the match. The confidence value is an empirically estimated confidence value derived from the score and the Fold or Superfamily, and is in the range (0,..,+1).
If the query sequence produces strong matches in the PSI-BLAST stage of the computation, then the PSI-BLAST e-value hits are also reported in a separate table.
An example results page can be found here
SVMs build prediction models in a vector space. The particular feature representation we use is with the profile kernel which defines a mutation neighborhood using PSI-BLAST profiles to represent each k-length segement of an amino acid sequence.
To solve the multiclass problem we employ a novel method of learning error correcting codes, which utilizes the class hierarchy of SCOP. The entire method employed on this server is based on the following publications:
[1]
E. Ie, J. Weston, W. Stafford Noble and C. Leslie. Multi-class protein fold recognition using adaptive codes. International Conference on Machine Learning, 2005.
This web server was supported by National Institute of Health grant GM74257-01.
This work was also
funded in part by award EIA-0312706 from the National Science Foundation.
Jason Weston and Iain Melvin also thank NEC Laboratories America, Princeton, where they work, for their support of this project.
[2] R. Kuang, E. Ie, K. Wang, K. Wang,
M. Siddiqi, Y. Freund and C. Leslie.
Profile-based string kernels for remote homology detection and motif
extraction.
Proceedings of the IEEE Computational Systems
Bioninformatics 2004, Stanford, August, 2004.
[3] C. Leslie, E. Eskin, A. Cohen, J. Weston,
and W. Noble.
Mismatch String Kernels for Discriminative Protein Classification.
Bioinformatics, 20:4, pp. 467-476, 2004.
[4]
I. Melvin, E. Ie, J. Weston, W. Noble and C. Leslie
Multi-class protein classification using adaptive codes.
Credits
SVM-FOLD was created by Iain Melvin, Jason Weston, Rui Kuang, Christina Leslie and William Stafford Noble.