Functional Genomics Project: Analysis of Expression Arrays

Expression Array Data from the functional genomics project at the Los Alamos National Laboratory is being analysed with data-mining techniques in search of relevant expression correlations. Data is in the form of an expression array, that is, a vector representation of trait manifestation caused by particular genes under varied experimental conditions. We are asked to find maximal-size subsets that are highly correlated.

We are interested in techniques that will help to both identify genes that interact in significant ways and ‘filter out’ those that do not. Proposed analysis methods so far include Association Rule Mining and Fuzzy Clustering. Possible exploration of temporal data (i.e. tissue sampled over time) has also been discussed, in which case general systems problem solving (GSPS) methods would be applied (a preliminary description of how GSPS could be used for this problem is available).

An Association Rule is a conditional implication of the form A => B, where A is an item or subset of items from the data set, and B is a single item from the data set. In mining for these rules the user defines a threshold, called confidence, which the implication, measured in terms of a conditional probability, must exceed. The primary task here is to identify frequent itemsets, that is sets of data points that occur together with at least a minimum frequency, called minimum support, as defined by the user. Once the frequent itemsets are identified, the association rules are formulated from among them. For example, an association rule would be a statement of the form “if the gene expression is high within set A for 90% of all experiments, then expression level of B will also be high”.
Fuzzy Clustering has the advantage of offering degrees of compatibility, rather than declaring elements strictly related or unrelated as in classical clustering. Initially, we will work with the Fuzzy c-Means algorithm, where the user determines the number of clusters. The method calculates the designated number of cluster centers, or prototypes, and then assigns a measure of compatibility between each element of the data set and each cluster center. In this way, data points are members of all clusters, to some degree, rather than being assigned to only one. Identifying the genes that are moderately related to a number of prototypes may be as significant as identifying those that are strongly related to only one or two prototypes. Fuzzy clustering is a means of identifying both groups.

The domain of the problem data is still in the defining stage. We are currently working to clarify the source of data with the functional genomics project and how to represent the specific data appropriately in the frameworks mentioned above.

Analysis of Expression Array Data

for the Functional Genomics Project

Project Members

Srinivas Doddi (T-10), Luis Rocha (CIC-3), Deborah Stungis (CIC-3), and David Torney (T-10)

Project Description

References and Materials