Protein Annotation as Term Categorization in the Gene Ontology using Word Proximity Networks

Karin Verspoor*, Judith Cohn*, Cliff Joslyn*, Sue Mniszewski*, Andreas Rechtsteiner*, Luis M. Rocha**, Tiago Simas**

*Modeling, Algorithms, and Informatics Group (CCS-3)
Los Alamos National Laboratory, MS B256
Los Alamos, New Mexico 87545, USA

**School of Informatics and Cognitive Science Program
Indiana University
1900 East Tenth Street, Bloomington IN 47408

Citation: Verspoor, K., J. Cohn, C. Joslyn, S. Mniszewski, A. Rechtsteiner, L.M. Rocha, T. Simas [2005]. "Protein Annotation as Term Categorization in the Gene Ontology using Word Proximity Networks". BMC Bioinformatics, 6(Suppl 1):S20. doi:10.1186/1471-2105-6-S1-S20

The full paper is freely available on the BMC Bioinformatics site. There is also a pdf version

This paper was part of BioCreative: A critical assessment of text mining methods in molecular biology. More papers from this competition are also available in the BMC Bioinformatics web site.

Abstract.

Background

We participated in the BioCreAtIvE Task 2, which addressed the annotation of proteins into the Gene Ontology (GO) based on the text of a given document and the selection of evidence text from the document justifying that annotation. We approached the task utilizing several combinations of two distinct methods: an unsupervised algorithm for expanding words associated with GO nodes, and an annotation methodology which treats annotation as categorization of terms from a protein's document neighborhood into the GO.

Results

The evaluation results indicate that the method for expanding words associated with GO nodes is quite powerful; we were able to successfully select appropriate evidence text for a given annotation in 38% of Task 2.1 queries by building on this method. The term categorization methodology achieved a precision of 16% for annotation within the correct extended family in Task 2.2, though we show through subsequent analysis that this can be improved with a different parameter setting. Our architecture proved not to be very successful on the evidence text component of the task, in the configuration used to generate the submitted results.

Conclusion

The initial results show promise for both of the methods we explored, and we are planning to integrate the methods more closely to achieve better results overall.

Keywords:Text Mining, Information Retrieval, Computational Biology, Bioinformatics, Genomics, Proteomics, Gene Ontology, Portein Function, Function, Annotations.

The full paper is freely available on the BMC Bioinformatics site. There is also a pdf version


For more information contact Luis Rocha at rocha@indiana.edu.
Last Modified: May 25, 2005