Introduction to Bioinformatics
These are the course materials for an introduction to Bioinformatics for the PhD program in Biomedicine at the Instituto Gulbenkian da Ciencia, Oeiras, Portugal. It is designed as a one week intensive course, taught March 11-15, 2002. Below are some of the materials used. The materials for the more recent 2006 course are available separately.
Contents
Syllabus
Monday, March 11th 2002
Tuesday, March 12th 2002
- Information Processes in Biology
- Synthetic, Multi-Disciplinary Approach to Biology
- Grand Challenges
- Full Curriculum for Bioinformatics
- Some traditional components of Bioinformatics: Sequence Analysis, Similarity Search Motif Search, Data-driven vs. Knowledge-based Functional Interpretation, Sequence Alignment, Dynamic Programming for Sequence Alignment Optimization, Similarity Database Search, basics of FASTA Method, Simulated Annealing and Genetic Algorithms for Multiple Sequence Alignment, basics of BLAST, Hidden Markov Models, Suffix Trees for Sequence Alignment, Evolutionary Trees.
- Literature Discussion and Useful Resources.
Bioinformatics Practice by Pedro Fernandes, Instituto Gulbenkian para a Ciencia.
Introduction: From Bioinformatics to Systems Biology by Luis M. Rocha, Los Alamos National Laboratory.
Wednesday, March 13th 2002
DNA Chip Technology by Michael Wall, Los Alamos National Laboratory.
- Experimental systems (What precisely is measured?): Biological samples, Chip technology
- Analysis methods: Clustering (binary & k-means), SVD, SVM's, Bidirectional "clustering" (Plaid, regulatory motifs)
- Establishing biological context (URS, TF's, functional annotation)
Regulatory network models with applications by Michael Wall, Los Alamos National Laboratory.
- Michaelis-Menten reaction kinetics and drosophila development: Deterministic reaction kinetics modeling, et al. & Odell Application to stripe fixing
- S-systems and fundamental network studies: Savageau S-system framework & relation to rxn kinetics, Use of S-systems for model comparison & evaluation according to natural selection criteria
- Brief motivation for and intro to stochastic models
Thursday, March 14th 2002
Network Inference by Patrik D'Haeseleer , Harvard Medical School.
- Beyond Co-Expression: Gene Network Inference
- Combining expression and sequence data: searching for motifs in expression clusters (Tavazoie), linear contribution of motifs to expression levels (Bussemaker), combinations of motifs (Pilpel).
- Protein interaction networks: Y2H and other methods, tracing pathways through the protein network
Friday, March 15th 2002
Integrative Technology for Computational Biology by Luis M. Rocha, Los Alamos National Laboratory.
- Database Technology and Bioinformatics
- Semantic Annotation of Biology Dat: From XML to Bio-Ontologies
- Information Retrieval: Vector Searches, Latent Databases, Natural Language Processing in Biology
- Extracting Functional Knowledge from Published Literature: 3 Examples, Collaborative Systems for Biology
- Literature Discussion and Useful Resources
Materials and References
References for Tuesday, March 12th 2002
Bioinformatics Overviews
Systems Science and Systems Biology
Microarray Data Analysis (SVD/PCA)
Dynamic Programming and Sequence Alignment
Similarity Matrices
FASTA algorithm and BLAST algorithm
Statistical Significance
Simulated Annealing
Genetic Algorithms
Kanehisa, M. [2000]. Post-Genome Informatics. Oxford University Press.
Waterman, M.S. [1995] Introduction to Computational Biology. Chapman and Hall.
Baldi. P. and S. Brunak [1998]. Bioinformatics: The Machine Learning Approach. MIT Press.
Wada, A. [2000]. Bioinformatics:the necessity of the quest for first principles in life. Bioinformatics. V. 16, pp. 663-664.
Altman, R.B. [1998]. A Curriculum for Bioinformatics: The Time is Ripe. Bioinformatics 14(7):549-550
Altman, R.B. [1998]. Bioinformatics in Support of Molecular Medicine. In C.G. Chute, Ed., 1998 AMIA Annual Symposium, Orlando, FL, 53-61. 1998.
Altmans Biomedical Informatics course
von Bertallanfy [1968] General System Theory. Foundations, Development, Applications, New York 1968
Cariani, Peter [1989]. On The Design of Devices With Emergent Semantic Functions. PhD Dissertation. State University of New York at Binghamton.
Conrad, Michael [1983]. Adaptability. Plenum Press.
Kauffman, S. [1993]. The Origins of Order: Self-Organization and Selection in Evolution. Oxford university Press.
Klir, George, J. [1991]. Facets of Systems Science. Plenum Press.
Mesarovic, MD: (1968) "Auxiliary Functions and Constructive Specification of Gen. Sys.", Mathematical Systems Theory, v. 2:3
Pattee, Howard H. [1973]."The physical basis and origin of control." In: Hierarchy Theory: The Challenge of Complex Systems. H.H. Pattee (Ed.). George Braziller, pp.71-108.
Pattee, Howard H. [1978]."The complementary principle in biological and social structures." Journal of Social and Biological Structures. Vol. 1, pp. 191-200.
Pattee, Howard H. [1982]."Cell psychology: an evolutionary approach to the symbol-matter problem." Cognition and Brain Theory. Vol. 5, no. 4, pp. 191-200.
Rosen, R, [1969]."Hierarchical organization in automata-theoretic models of biological systems." In: Hierarchical structures. LL. Whyte, A. Wilson, and D. Wilson (Eds.). Elsevier, pp. 179.
Rosen, Robert [1991]. Life Itself. Columbia University Press.
Rosen, Robert [1993]."Bionics revisited." In: The Machine as a Metaphor and Tool. H. Haken, A. Karlqvist, and U. Svedin (eds.). Springer-Verlag, pp. 87-100.
Kitano Symbiotic Systems Project
Literature on Microarray Data Analysis
Literature on Microarray Data Analysis II
Alter, O., P.O. Brown and D. Botstein [2000]."Singular value decomposition for genome- wide expression data processing and modeling." PNAS. Vol. 97, no. 18, pp. 10101-06.
Fellenberg K, et al [2001]. Correspondence analysis applied to microarray data. Proc Natl Acad Sci. 98(19):10781-6
Hastie, T. et al [2000]."'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns." Genome Biology. Vol. 1, no. 2, pp. 3.1-3.21.
Holter, N.S. et al [2000]."Fundamental patterns underlying gene expression profiles:Simplicity from complexity." PNAS. Vol. 97, no. 15, pp. 8409-14.
Raychaudhuri, S., J.M. Stuart and R.B. Altman [2000]."Principal components analysis to summarize microarray experiments: Application to sporulation data." http://cmgm.stanford.edu.
Wall, M., P.A. Dyck, and T. Brettin [2001]."SVDMAN -- Singular value decomposition analysis of microarray data." Bioinformatics. Bioinformatics Vol. 17 no. 6 200, pp. 566-568
Yeung K.Y. and W. L. Ruzzo [2001] Principal component analysis for clustering gene expression data. Bioinformatics Vol. 17 no. 9, pp. 763-774
Bellman, R.E. [1957] Dynamic Programming. Princeton University Press, Princeton
Bertsekas, D. [1995]. Dynamic Programming and Optimal Control. Athena Scientific.
Needleman, S. B. and Wunsch, C. D. [1970]. A general method applicable to the search for similarities in the amino acid sequence of two proteins.J. Mol. Biol., 48,443-53.
Giegerich, R. [2000]. A systematic approach to dynamic programming in bioinformatics. Bioinformatics. V. 16, pp. 665-677.
Sankoff, D. [1972]. Matching sequences under deletion/insertion constraints. Proc. Natl. Acad. Sci. USA, 69,4-6.
Sellers, P. H [1974]. On the theory and computation of evolutionary distances. SIAM J. Appl. Mat ., 26,787-793.
Sellers, P. H. [1980]. The theory and computation of evolutionary distances: pattern recognition. Algorithms, 1,359-73.
Smith, T. F. and Waterman, M. S. [1981] . Identification of common molecular subsequences. J.Mol. Biol., 147,195--7.
Goad, W. B. and Kanehisa, M. I. [1982]. Pattern recognition in nucleic acid sequences. I. A general method for finding local homologies and Symmetries. Nucleic Acids Res., 10, 247-63.
Scientific Computation (Gaston Gonnet)
Probabilistic Dynamic Programming and Multiple Alignments (gaston Gonnet)
Pairwise Alignment via Dynamic Programming
Hardware Protein Database Search using Local Alignment (Smith-Waterman algorithm)
Dayhoff, M. 0., Schwartz, R. M. and Orcutt, B.C. [1978] A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure, Vol. 5, Suppl. 3 (ed. M. 0. Dayhoff), pp. 345--52. National Biomedical Research Foundation, Washington, DC.
Henikoff, S. and Henikoff, J. G. [1992]. Amino acid substitution matrices from protein blocks. Proc. Natl.Acad. Sci. USA,89, 10915--19.
Wilbur, WJ. and Lipman, D.J. [1983]. Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl.Acad. sci. USA, 80,726-30.
Lipman, D.J. and Pearson, W R. [1985]. Rapid and sensitive protein similarity searches. Science, 227,1435-41.
Altschul, S. F., Gish, W, Miller, W, Myers, E. W, and Lipman, D.J. [1990]. Basic local alignment search tool. J. Mol. Biol., 215,403-10.
Altschul, S. F., Madden, T. L., Schaeffer, A. A., Zhang, J., Zhang, Z., Miller, W, and Liprnan, D.J. [1997]. Gapped BLAST and PSI-BLAST:a new generacion of protein database search programs. Nucleic Acids Res., 25, 3389--402.
Karlin, S. and Altschul, S. F. [1990]. Methods for assessing the statiscical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. sci. USA, 87 . 2264-8.
Pearson, W R. [1995]. Comparison of methods for searching protein sequece databases. Protein sci.,4, 1145--60.
Ishikawa, M. et al [1993]. Multiple sequence alignment by parallel simulated annealing. Compt. Appl. Biosci. 9, 267-73.
Bertsimas, D. and J. Tsitsiklis [1993]. Simulated Annealing. Statis. Sci. 8, 10-15.
Kirkpatrick, S. C.D. Gelatt, and M.O. Vecchi [1983]. Optimization by simulated annealing. Science. 220, 671-680.
Goldberg, D.E. [1989]. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley.
Holland, J.H. [1975]. Adaptation in Natural and Artificial Systems. University of Michigan Press.
Holland, J.H. [1995]. Hidden Order: How Adaptation Builds Complexity. Addison- Wesley.
Mitchell, Melanie [1996]. An Introduction to Genetic Algorithms. MIT Press.
References for Wednesday, March 13th 2002
Reviews of chip technology
Analysis methods
Regulatory Network Models
Supplemental reading
Supplemental books
Harrington, Rosenow & Retief. 2000. Curr Opin Microbiol 3:285
Gerhold, Rushmore & Caskey. 1999. TIBS 24:168
Making and reading microarrays for academia: Cheung et al. 1999. Nature Genetics Supplement 21:15
High-throughput assay of transcription-factor binding: Ren et al. 2000. Science 290:2306
Clustering: Sherlock. 2000. Curr. Opin. Immun. 12:201
Singular Value Decomposition: Holter et al. 2000. PNAS 97:8409
Support Vector Machines: Brown et al. 2000. PNAS 97:262
Plaid analysis: www.google.com: plaid gene expression
Cell cycle modeling
Simple frog egg cell-cycle model: Tyson. 1991. Proc Natl Acad Sci USA 88:7328
Modern yeast cell cycle model: Chen et al. 2000. Mol Biol Cell 11:369
High-throughput identification of cell-cycle regulated genes: Spellman et al. 1998. Mol Biol Cell 9:3273
Current high-throughput yeast cell cycle transcription factor experiment: Simon et al. 2001. Cell 106:697
S-systems
Review of gene circuit design principles: Savageau. 2001. CHAOS 11:142
Stochastic models
Stochastic reaction kinetics modeling: Gibson & Bruck. 2000. J. Phys. Chem. A 104:1876 (read sections 1. and 2., through Gillespie's First Reaction Method)
Lambda Phage model: Arkin, Ross & McAdams. 1998. Genetics 149:1633
Misc. reviews of DNA chips: Nature Genetics Supplement Volume 21, January 1999
Equivalence of stochastic reaction kinetics to deterministic in the high N limit: Kurtz. 1972. J. Chem. Phys. 57:2976
Review of stochastic chemical reaction kinetics and continuous-time Markov chains: McQuarrie. 1967. J. App. Prob. 4:413
Web site on power law modeling (relevant to S-systems)
Web site source for PLAS software (for simulating S-systems)
Computational Modeling of Genetic and Biochemical Networks. Bower and Bolouri, eds. 2001. MIT Press, Cambridge, MA, USA (Many contributed chapters on biochemical networks topics from leading researchers)
Computational Analysis of Biochemical Systems: A Practical Guide for Biochemists and Molecular Biologists. E.O. Voit. 2000. Cambridge University Press, Cambridge, UK (Comprehensive hands-on explanation of S-systems modeling of biochemical networks; includes PLAS software written by A. Ferreira, University of Lisbon)
References for Thursday, March 14th 2002
Gene network inference
Linking regulatory motifs to expression levels
Protein networks
Genetic Network Inference: From Co-Expression Clustering to Reverse Engineering. D'haeseleer et al; Bioinformatics 16, 707-726 (2000)
Reconstructing Gene Networks from Large Scale Gene Expression Data.D'haeseleer, P.; Ph.D. dissertation, University of New Mexico (2000)
Systematic determination of genetic network architecture Tavazoie et al; Nature Genetics 22, 281-285 (1999)
Regulatory element detection using correlation with expression Bussemaker et al; Nature Genetics 27, 167-174 (2001)
Identifying regulatory networks by combinatorial analysis of promoter elements Pilpel et al; Nature Genetics 29, 153-159 (2001)
A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae Uetz et al; Nature 403, 623-627 (2000)
A comprehensive two-hybrid analysis to explore the yeast protein interactome Ito et al; PNAS 98, 4569-4574 (2001)
Correlation between transcriptome and interactome data obtained from S. cerevisiae. Ge et al; Nature Genetics 29, 482-486 (2002)
References for Friday, March 15th 2002
Semantic Markup of Biological Data
SVD and Latent Semantic Analysis in IR
Knowledge Discovery fromPublication Databases
Supporting Gene Expression Data Papers
Achard, F.,G. Vaysseix, and E.Barillot [2001] "XML, bioinformatics and data integration". Bioinformatics Vol. 17 no. 2 2001, pp. 115-125.
Sowa's Definition of Ontology.
Robert Stevens Bio-ontology Page.
Karp P.D. [2000] An ontology for biological function based on molecular interactions Bioinformatics Vol. 16 no. 3, PP. 269-285.
EcoCyc: Encyclopedia of E. coli Genes and Metabolism.
Berry, M.W., S.T. Dumais, and G.W. O'Brien [1995]."Using linear algebra for intelligent information retrieval." SIAM Review. Vol. 37, no. 4, pp. 573-595.
Kannan, R. and S. Vempala [1999]."Real-time clustering and ranking of documents on the web." Unpublished Manuscript.
Landauer, T.K., P.W. Foltz, and D. Laham [1998]."Introduction to Latent Semantic Analysis." Discourse Processes. Vol. 25, pp. 259-284.
Masys et al [2001] Use of keyword hierarchies to interpret gene expression patterns Bioinformatics 17 (4), 319-326
Jenssen et al [2001] A literature network of human genes for high-throughput analysis of gene expression. Nature Genetics 28 (1), 21 - 28.
Shatkay et al [2000].Genes, Themes and Microarrays Using Information Retrieval for Large-Scale Gene Analysis. ISMB 2000, AAAI Press, 317-328.
Rocha, L.M. [2001]. Integrative technology for bioinformatics. Los Alamos National Laboratory Technical Report. LAUR 01-6859.
Golub et al [1999]. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286: 531-537
Eisen et al [1998], Cluster analysis and display of genome-wide expression patterns. PNAS 95(25), 14863-14868.
Alter et al [2000], Singular value decomposition for genome-wide expression data processing and modeling. PNAS 97 (18), 10101-10106.
Hastie et al [2000], Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology 1(2), 0003.1-0003.21.