Citation: Maguitman, A. G., Rechtsteiner, A., Verspoor, K., Strauss, C.E., Rocha, L.M. [2006]. "Large-Scale Testing Of Bibliome Informatics Using Pfam Protein Families". In: Pacific Symposium on Biocomputing11:76-87.
Abstract.
Literature mining is expected to help not only with automatically sifting through huge biomedical literature and annotation databases, but also with linking bio-chemical entities to appropriate functional hypotheses. However, there has been very limited success in testing literature mining methods due to the lack of large, objectively validated test sets or “gold standards”. To improve this situation we created a large-scale test of literature mining methods and resources. We report on a specific implementation of this test: how well can the Pfam protein family classification be replicated from independently mining different literature/annotation resources? We test and compare different keyterm sets as well as different algorithms for issuing protein family predictions. We find that protein families can indeed be automatically predicted from the literature. Using words from PubMed abstracts, of 3663 proteins tested, over 75% were correctly assigned to one of 618 Pfam families. For 90% of proteins the correct Pfam family was among the top 5 ranked families. We found that protein family prediction is far superior with keywords extracted from PubMed abstracts than with GO annotations or MeSH keyterms, suggesting that the text itself (in combination with the vector space model) is superior to GO and MeSH as a literature mining resources, at least for detecting protein family membership. Finally, we show that Shannon’s entropy can be exploited to improve prediction by facilitating the integration of the different literature sources tested.
Keywords: Bioinformatics, Text Mining, Natural Language Processing, Proteins, Bibliome, Information Retrieval, Pfam, SwissProt, Classification, MeSH, PubMed, Gene Ontology.
1. Introduction
Biology was until recently essentially a hypothesis driven science in which experiments were carefully designed to answer one or very few specific questions — e.g. test the function of a specific protein in a specific context. In the last decade, fueled by the widespread use of high-throughput technology, we have witnessed the emergence of a more data-driven paradigm for biological research. Since highthroughput experiments are frequently conducted for the sake of discovery rather than hypothesis testing, and due to the sheer amount of measured variables they entail, it is very difficult to interpret their results. Moreover, since the goal of many experiments is to uncover bio-chemical and functional information about genes and proteins, there is an obvious need to understand the linkages amongst biological entities in literature and databases which allow us to make inferences. Literature mining is expected to help with those inferences; its objective is to automatically sort through huge collections of literature and suggest the most relevant pieces of information for a specific analysis task, e.g. the annotation of proteins. Another application is to uncover similarities of genes according to “publication space”, or the more tongue-in-cheek term "bibliome".
Since literature mining hinges on the quality of available sources of literature as well as their linkage to other electronic sources of biological knowledge, it is particularly important to study the quality of the inferences it can provide. Indeed, the Bibliome is not just the collection of publications and annotations available; its usefulness ultimately depends on the quality of linking resources that allow us to associate experimental data with publications and annotations. Interestingly, while literature mining is receiving considerable attention in Bioinformatics, it has not been hitherto seriously validated. Towards improving this situation, we present here our large-scale testing and comparison of literature mining algorithms, paired with specific bibliome resources.
We present a general method for testing bibliome resources and literature mining algorithms in the context of classification of biological entities. This method formalizes and extends a previous study in which we tested how well is the Pfam protein family classification inferred from PubMed as indexed by the MeSH keyterm vocabulary. We expand on these results by testing additional bibliome resources such as GO annotations and text extracted from PubMed abstracts for the same classification problem. We additionally propose a new method based on Shannon’s entropy to integrate results from different bibliome resources, and show that it significantly improves protein family predictions.