*By Deborah Stungis*

Expression Array Data from the functional genomics project at the Los Alamos National Laboratory is being analysed with data-mining techniques in search of relevant expression correlations. Data is in the form of an expression array, that is, a vector representation of trait manifestation caused by particular genes under varied experimental conditions. We are asked to find maximal-size subsets that are highly correlated.

We are interested in techniques that will help to both identify genes that interact in significant ways
and ‘filter out’ those that do not. Proposed analysis methods so far include *Association Rule Mining* and
*Fuzzy Clustering*. Possible exploration of temporal data (i.e. tissue sampled over time) has also been
discussed, in which case *general systems problem solving (GSPS)* methods would be applied (a preliminary description of how GSPS could be used for this problem is available).

- An
*Association Rule*is a conditional implication of the form A => B, where A is an item or subset of items from the data set, and B is a single item from the data set. In mining for these rules the user defines a threshold, called confidence, which the implication, measured in terms of a conditional probability, must exceed. The primary task here is to identify frequent itemsets, that is sets of data points that occur together with at least a minimum frequency, called minimum support, as defined by the user. Once the frequent itemsets are identified, the association rules are formulated from among them. For example, an association rule would be a statement of the form “if the gene expression is high within set A for 90% of all experiments, then expression level of B will also be high”. *Fuzzy Clustering*has the advantage of offering degrees of compatibility, rather than declaring elements strictly related or unrelated as in classical clustering. Initially, we will work with the Fuzzy c-Means algorithm, where the user determines the number of clusters. The method calculates the designated number of cluster centers, or prototypes, and then assigns a measure of compatibility between each element of the data set and each cluster center. In this way, data points are members of all clusters, to some degree, rather than being assigned to only one. Identifying the genes that are moderately related to a number of prototypes may be as significant as identifying those that are strongly related to only one or two prototypes. Fuzzy clustering is a means of identifying both groups.

The domain of the problem data is still in the defining stage. We are currently working to clarify the source of data with the functional genomics project and how to represent the specific data appropriately in the frameworks mentioned above.

Chen, J.J.W. et al [1998]. "Profiling expression patterns and isolated differentially expressed genes by cDNA microarray system with colorimetry detection". Genomics, **51**, 313-324.

Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo [1994]. "Efficient algorithms for discovering association rules". In Knowledge Discovery in Databases (KDD'94), 181 - 192, Seattle, Washington, July 1994. AAAI Press.

L.Pickert, I.Reuter, F. Klawonn and E.Wingender [1998], "Transcription regulatory region analysis using signal detection and fuzzy clustering", Bioinformatics, 14(3):244-251.

L. Pickert, I. Reuter, F. Klawonnand E.Wingender. "Transcription Regulatory Region Analysis Using Signal Detection and Fuzzy Clustering". Bioinformatics 14 (1998), 244-251

Zaki, M. J. and M. Ogihara [1998]. "Theoretical Foundations of Association Rules". 3rd SIGMOD'98 Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD), pp 7:1-7:8, Seattle, WA, June 1998.