Use of Text Mining for Protein Structure Prediction and Functional Annotation in Lack of Sequence Homology

Andreas Rechtsteiner*, Jeremy Luinstra**, Luis M. Rocha*, Charlie E. Strauss**,

*Indiana University
1900 East Tenth Street, Bloomington IN 47408

**Los Alamos National Laboratory
P.O.Box 1663, Los Alamos, NM 87545

Citation: Rechtsteiner, A., Luinstra, J., Rocha, L.M., Strauss, C.E., [2006]. "Use of Text Mining for Protein Structure Prediction and Functional Annotation in Lack of Sequence Homology". In: Joint BioLINK and Bio-Ontologies Meeting 2006 (ISMB Special Interest Group).In Press.

The full paper is available in a preprint pdf version


Background: Linking of information from different data sources, specifically literature, becomes increasingly important to annotate the growing number of new genome sequences. For the large percentage of genes with no known sequence homologs, new, possibly integrative, methods need to be developed. Ab-initio structure prediction and comparison is a method some of us pursued previously for functional annotation of sequences with no known homologs. Here we use a large set of sequences of known structure to evaluate a new method that uses keyword information from literature to improve our previously used ab-initio structure prediction method.

Results: We report two results: first, the literature and keyword similarity measure we employ here performs well in identifying functional and/or structural relationships even if there is little or no sequence homology between the compared proteins, the difficult, but frequent, so-called “twilight zone” case in annotation and structure prediction. Second, our novel method that uses literature to assist SCOP super-family prediction [2] significantly improves on our original ab-initio structure prediction algorithm.

Conclusions: We show that the literature keywords and similarity measure used here are of great value for the increasingly important field of functional annotation of new sequences with no or little sequence homology.

Keywords: Bioinformatics, Text Mining, Natural Language Processing, Proteomics, Structure Prediction, homology, Proteins, Bibliome, Information Retrieval, Pfam, SwissProt, Classification, MeSH, PubMed, Gene Ontology.

For more information contact Luis Rocha at
Last Modified: June 28, 2006