Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks

Alaa Abi-Haidar^1,6, Jasleen Kaur¹, Ana G. Maguitman², Predrag Radivojac¹, Andreas Retchsteiner³, Karin Verspoor⁴, Zhiping Wang⁵, Luis M. Rocha^1,6,*

¹School of Informatics, Indiana University, 1900 East Tenth Street, Bloomington IN 47408, USA
²Universidad Nacional del Sur, Bahia Blanca, Argentina
³Center for Genomics and Bioinformatics, Indiana University, USA
⁴Information Sciences Group, Los Alamos National Laboratory, USA
⁵Biostatistics, School of Medicine, Indiana University, USA
⁶FLAD Computational Biology Collaboratorium, Instituto Gulbenkian de Ciencia, Portugal
^*To whom correspondence should be addressed: rocha@indiana.edu

Citation: A. Abi-Haidar, J. Kaur, A. Maguitman, P. Radivojac, A. Retchsteiner, K. Verspoor, Z. Wang, and L.M. Rocha [2008]."Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks". Genome Biology, 9(Suppl 2):S11. doi:10.1186/gb-2008-9-s2-s11. PMC2559982

The full text and pdf re-print are available from the Genome Biology open access site. Supplemental materials are also available. Due to mathematical notation and graphics, only the abstract is presented here.

Background: We participated in three of the protein-protein interaction subtasks of the Second BioCreative Challenge: classification of abstracts relevant for protein-protein interaction (IAS), discovery of protein pairs (IPS) and text passages characterizing protein interaction (ISS) in full text documents. We approached the abstract classification task with a novel, lightweight linear model inspired by spam-detection techniques, as well as an uncertainty-based integration scheme. We also used a Support Vector Machine and the Singular Value Decomposition on the same features for comparison purposes. Our approach to the full text subtasks (protein pair and passage identification) includes a feature expansion method based on word-proximity networks.

Results: Our approach to the abstract classification task (IAS) was among the top submissions for this task in terms of the measures of performance used in the challenge evaluation (accuracy, F-score and AUC). We also report on a web-tool we produced using our approach: the Protein Interaction Abstract Relevance Evaluator (PIARE). Our approach to the full text tasks resulted in one of the highest recall rates as well as mean reciprocal rank of correct passages.

Conclusions: Our approach to abstract classification shows that a simple linear model, using relatively few features, is capable of generalizing and uncovering the conceptual nature of protein-protein interaction from the bibliome. Since the novel approach is based on a very lightweight linear model, it can be easily ported and applied to similar problems. In full text problems, the expansion of word features with word-proximity networks is shown to be useful, though the need for some improvements is discussed.

Keywords:Protein interaction, text mining, bibliome informatics, support vector machines, singular value decomposition, spam detection, uncertainty measures, proximity graphs, complex networks.

For more information contact Luis Rocha at rocha@indiana.edu. Check the Web Design Credits, for due credit.
Last Modified: October 27, 2008