Citation: Rocha, Luis M. [2001]."Integrative Technology for Bioinformatics". Los Alamos National Laboratory Internal Report LAUR 01-6859.
Also available in Adobe Acrobat (.pdf) format.
This document outlines the research directions in Information Retrieval for Bioinformatics being pursued by Luis M. Rocha and the project teams of the Active Recommendation Project and of the LDRD-ER project FY02-CSSE-012: Identification of interests, trends and dynamics in Document Networks. Copyright Luis M. Rocha, December 2001. LAUR 01-6859
The production of larger and larger databases in molecular biology, particularly those containing genomic data, have lead to a strong interest in Bioinformatics and Computational Biology due to the obvious need to analyze and understand such large collections of data. In particular, microarray technology, with its ability to measure the expression patterns of thousands of genes simultaneously, presents researchers with formidable data analysis difficulties.
The first wave of methods employed to analyze gene expression data were brought in from the fields of data-mining, machine learning, and statistics [e.g. Eisen et al, 1998; Alter et al, 2000; Hastie et al, 2000]. These methods are typically used to discover patterns of expression behavior associated with subsets of genes, which are thus identified. But this analysis is pursued using exclusively the numerical expression values obtained from microarray experiments. Therefore, they cannot directly help us in deriving functional knowledge. The biological reasons for the patterns identified by these techniques must ultimately be ascertained by biologists who need to be able to integrate knowledge about a large number of possible underlying biological mechanisms. Given the large number of genes in microarrays and the myriad possible networks of cellular interaction, this is a daunting task indeed.
Recent renewed interest in Systems Biology has lead researchers in Bioinformatics to the idea that in general, no single set of measurements, data analysis method, or single research team will be sufficient to understand complex biological networks of vast size [e.g. Kanehisa, 2000; Kitano, 2000; Eckardt, 2001]. Instead, this research needs to be carried out by interdisciplinary teams empowered with Informatics technology capable of automatically integrating the results of pattern recognition analysis of microarray data, with available sources of functional knowledge. Clearly, such integrative technology does not aim to replace biologists, but rather to assist them by reducing the number of possible explanations of functional behavior.
One of the most promising avenues to develop such integrative technology, lies in the application of modern Information Retrieval (IR) and Knowledge Management (KM) algorithms to databases with biomedical publications and data [Masys, 2001]. Modern information resources can be thought of as networks of documents. The prime example of a Document Network is the World Wide Web (WWW). But many other types of such networks exist: bibliographic databases containing scientific publications (e.g. MEDLINE: http://www.nlm.nih.gov), preprints (e.g. the e-Print Arxiv @ LANL http://arxiv.org), as well as databases of datasets used in scientific endeavors (e.g. GenBank: http://www.ncbi.nlm.nih.gov/Genbank/and PROSITE: http://www.expasy.org/prosite/). Each of these databases possesses several distinct relationships among documents and between documents and semantic tags or indices that classify documents appropriately. For instance, documents in the WWW are related via a hyperlink network, while documents in bibliographic databases are related by citation and collaboration networks [Newman, 2000].
The first two approaches described above [Masys et al 2001; Jenssen et al 2001] faced the known problems of synonymy and polysemy plaguing keyterm analysis in IR [Masys, 2001]. Synonymy means that several keyterms can refer to the same item (e.g. gene), and polysemy means that the same keyterm can refer to several items. To evaluate their network, Jenssen et al [2001] manually studied the validity of a set of gene associations. Of 500 randomly chosen pairs of genes with more than 5 co-occurrences, 29% were incorrect, mostly because the same keyterm is used to identify more than one gene, or a gene keyterm is also used to refer to some other entirely different concept.
Our research in this area aims to improve the linguistic ambiguity errors found in the type of approaches pursued by Masys et al [2001] and Jenssen et al [2001], as well as reducing the dependence on human experts in the type of approach pursued by Shatkay et al [2000]. To achieve this, we follow two interacting lines of research.
It should be emphasized that the IR techniques here detailed are not meant as a substitute for pattern analysis of microarray expression experiments, nor for human expertise in Biology. Rather, we propose the development of these IR techniques for Bioinformatics as a complement to both these two sources of knowledge. Clearly, pattern recognition methods can discover expression relationships amongst groups of genes, but cannot by themselves reveal underlying biological causes or function. Furthermore, expert biologists are easily overwhelmed trying to grasp the biological causes of the groupings discovered by pattern recognition methods, due to the sheer volume of genes and potential biological mechanisms involved. Therefore, techniques that recommend possible functional mechanisms and associated literature, can only help biologists by mediating between the results of pattern recognition and scientific explanation available in the literature.
Alter, O., P.O Brown, and D. Botstein [2000]."Singular value decomposition for genome-wide expression data processing and modeling." Proc. Natl. Acad. Sci. USA. Vol. 97, No. 18, pp. 10101-10106.
Berry, M.W., S.T. Dumais, and G.W. O'Brien [1995]."Using linear algebra for intelligent information retrieval." SIAM Review. Vol. 37, no. 4, pp. 573-595.
Eckardt, N.A. [2001]."The New Biology: Genomics fosters a 'Systems Approach and Collaborations between Academic, Government, and Industry Scientists ." Plant Cell, Vol. 13, 725-734. .
Eisen, M.B., P.T. Spellman, P.O. Brown, and D. Botstein [1998]."Cluster abalysis and display of geneome-wide expression patterns." Proc. Natl. Acad. Sci. USA. Vol. 95, pp. 14863-14868.
Hastie, T., et al [2000]."'Gene Shaving' as a method for identifying distinct sets of genes with similar expression patterns." Genome Biology. Vol. 1, No. 2: 0003.1-0003.21, http://genomebiology.com/2000/1/2/research/0003/.
Jenssen, T.K., A. Lægreid, J. Komorowski, and E. Hovig [2001]."A literature network of human genes for high-throughput analysis of gene expression." Nature Genetics. V. 28, No. 1, pp. 21 - 28.
Kitano, H. [2000]."Perspectives on Systems Biology." New Generation Computing. Vol. 18, 199-216.
Kleinberg, J.M. [1998]."Authoritative sources in a hyperlinked environment." In: Proc. of the the 9th ACM-SIAM Symposium on Discrete Algorithms. . pp. 668-677.
Landauer, T.K., P.W. Foltz, and D. Laham [1998]."Introduction to Latent Semantic Analysis." Discourse Processes. Vol. 25, pp. 259-284.
Masys, D.R. [2001]."Linking microarray data to the literature." Nature Genetics. V. 28, No. 1, pp. 9-10.
Masys, D.R. et al [2001]."Use of keywords hierarchies to interpret gene expression patterns." Bioinformatics. Vol. 17, no. 4, pp. 319-326.
Newman, MJ [2000]."The structure of scientific collaboration networks." Proc.Nat.Acad.Sci.. No. 98, pp. 404-409.
Rocha, Luis M. [1999a]."TalkMine and the Adaptive Recommendation Project." In: the Proceedings of the Association for Computing Machinery (ACM) - Digital Libraries 99. U.C. Berkely, August 1999. . pp. 242-243.
Rocha, Luis M. [1999b]."Evidence sets: modeling subjective categories." International Journal of General Systems. Vol. 27, pp. 457-494.
Rocha, Luis M. [2001a]."TalkMine: A Soft Computing Approach to Adaptive Knowledge Recommendation." In: Soft Computing Agents: New Trends for Designing Autonomous Systems. V. Loia and S. Sessa (Eds.). Springer-Verlag.
Rocha, Luis M. [2001b]."Identification of interests, trends and dynamics in Document Networks." Los Alamos National Laboratory Internal Report for LDRD-ER FY02-CSSE-012. LAUR 01-2380.
Rocha, Luis M. and Johan Bollen [2001]."Biologically motivated distributed designs for adaptive knowledge management." In: Design Principles for the Immune System and Other Distributed Autonomous Systems. Cohen I. And L. Segel (Eds.). Santa Fe Institute Series in the Sciences of Complexity. Oxford University Press, pp. 305-334.
Shatkay, H., S. Edwards, W. Wilbur, and M. Boguski [2000]."Genes, themes, and microarrays: using information retrieval for large-scale gene analysis." In: Intelligent Systems for Molecular Biology. . AAAI Press, pp. 317-328.