This is the abstract of a talk prepared for the Oeiras Mathematical and Computational Biology Workshop. June 20, 2003, Instituto Gulbenkian de Ciência
Abstract: Processing of DNA sequences by neural networks and other machine learning techniques requires an appropriate representation of the sequences by fixed-length codes. This talk will show a representation of individual positions in DNA sequences by virtual potentials generated by other bases of the same sequence. A virtual potential is a compact representation of the neighbourhood of a base. The distribution of the virtual potentials over the whole sequence can be used as a representation of the entire sequence (SEQREP code). It is a flexible code, with a length independent of the sequence size, and does not require previous alignment.
Application of SEQREP code will be illustrated with an example: Kohonen self-organizing maps (SOMs) were trained to classify sequences encoding for HIV-1 envelope glycoprotein (env) into subtypes A-G, using SEQREP codes as input.
Software for the calculation of SEQREP codes and its processing by Kohonen self-organizing maps will be demonstrated.
Possible areas of application of SEQREP codes include functional genomics, phylogenetic analysis, detection of repetitions, database retrieval, and automatic alignment.