Representation of DNA sequences by Virtual Potentials


Joao Aires de Sousa
Departamento de Quimica, Faculdade de Ciencias e Tecnologia, Universidade Nova de Lisboa

This is the abstract of a talk prepared for the Oeiras Mathematical and Computational Biology Workshop. June 20, 2003, Instituto Gulbenkian de Ciência

Abstract: Processing of DNA sequences by neural networks and other machine learning techniques requires an appropriate representation of the sequences by fixed-length codes. This talk will show a representation of individual positions in DNA sequences by virtual potentials generated by other bases of the same sequence. A virtual potential is a compact representation of the neighbourhood of a base. The distribution of the virtual potentials over the whole sequence can be used as a representation of the entire sequence (SEQREP code). It is a flexible code, with a length independent of the sequence size, and does not require previous alignment.

Application of SEQREP code will be illustrated with an example: Kohonen self-organizing maps (SOMs) were trained to classify sequences encoding for HIV-1 envelope glycoprotein (env) into subtypes A-G, using SEQREP codes as input.

Software for the calculation of SEQREP codes and its processing by Kohonen self-organizing maps will be demonstrated.

Possible areas of application of SEQREP codes include functional genomics, phylogenetic analysis, detection of repetitions, database retrieval, and automatic alignment.


For more information contact Luis Rocha at rocha@indiana.edu
Last Modified: June 16, 2003