Research and Development

In Part 1 we presented in detail the infrastructure we have developed for the Active Recommendation Project (ARP), which we deem essential for ng research and development for the Distributed Information Systems available to the Library Without Walls Project at the Los Alamos National Laboratory. Clearly, the infrastructure of part 1 can and should be used for other research and development projects other than ARP. Below we identify the research and development program for ARP, while suggesting other possible projects.

1. First Year

The bulk of the first year has been dedicated to establishing the XML and Relational Repositories described in part 1. We are now in a position to use and profit from such an infrastructure.

1.1 Spreading Activation on Subsets of the Relational Information

We have calculated the semantic (section I.2.1) and structural (section I.2.2) proximity measures for the most frequent keywords and most cited documents. Given these sub-matrices of the overall relational repository, we developed a spreading activation utility to recommend frequent keywords and most cited documents to users. As we compute more proximity information, we will increase the size of the spreading activation matrices.

Spreading activation (SA) is an algorithm for searching information in associative networks such as those represented by the proximity matrices. Once a network with relational information is available, the SA algorithm basically pursues a distributed shortest path search. Distributed because it can start from several nodes of the network simultaneously, and makes use of local relational information.

For instance, given the network given by the keyword semantic proximity matrix, that is, a network of keywords related by proximity values, SA can be used to discover keywords that are highly related to several initially selected keywords. These selected keywords in the network get an initial activation value which follows the links to associated keywords. The activation reaching a keyword is the sum of the weighted activations coming through all its input links. From an activated word, activation diffuses further to the words linked to it, and so on. This way, SA finds keywords related to all the keywords that are initially activated. For a definition of spreading activation in information retrieval see [Salton and Buckely, 1998]. The leading advantage of this technique is that it allows the retrieval of information not by textual keyword matching (as most search engines do), but rather by using relational information such as those developed in our relational repository. Clearly, the usefulness of SA highly depends on the quality of the underlying networks used. Therefore, we expect the current ARP spreading activation utilities to improve as more proximity information is computed, and even more, when adaptive user-feedback is enabled from TalkMine and other adaptive techniques (see below).

The ARP spreading activation utilities can be accessed on the web at http://www.c3.lanl.gov/~rocha/lww/SA.html. In addition to the proximity networks of the relational repository, SA has also been applied to a network of ISSN Journal Titles organized according to an adaptive algorithm using archives of user interaction. This algorithm developed by Bollen and Heylighen [1996] was initially developed to organize hypertext networks according to user choices and it is based on a simple Hebbian rule, as well as on symmetry and transitive constraints. This algorithm increases the proximity of elements in the network which tend to be associated by users, and tends to decrease the proximity of those not associated. It is an algorithm similar to the adaptive components of TalkMine and will be used for further research and development in ARP.

Related keywords for searches in other systems, using the SA utility on the subset of the most frequent keywords and documents in the semantic proximity matrices (I.2.1).
Related documents, using the SA utility on the subset of the most cited documents in the structural proximity matrices (I.2.2).
Related Journals, using the SA utility on the ISSN journal list organized according to the adaptive algorithm of Bollen and Heylighen.

1.2 Citation Examination

Cliff Joslyn has performed an analysis of the citation network in the structural relational repository. This statistical analysis, available in a separate document (LANL Only), culminates in the following conclusions and recommendations:

The current structural relational repository, based on source files from 1996-1999, is acceptable for applying citation analysis algorithms, if we disregard cited documents published before 1994. This would result in a balanced distribution of roots, internals, and leave documents (see section I.2.2,).
Nevertheless, a richer citation structure could be obtained by adding ISI records back to 1992.
However, this would increase the size of our repositories dramatically, possibly affecting storage and algorithm development.

We thus conclude that pre-1996 ISI data would enhance the citation analysis, but may conversely make this very analysis intractable while adding very little to the adaptive semantic algorithms. The best solution could be to add pre-1996 ISI data, but only for a narrower set of documents, for instance by adding only records of a given subject area or set of journals. We will keep on investigating these alternatives, while continuing our citation analysis efforts.

2. Second Year and Beyond

2.1 On Semantic Information

With the semantic relational information in place, we can now deploy a number of techniques leading to unique utilities. We can apply such techniques to the entire repository, or to subsets of it obtained via the request operators presented in part 1. We are implementing the first in the fast computer system (TURBO) available in CIC-3, and the second in PC's since the objective is to define algorithms that are capable of analyzing the results of user queries in real time, on their machines. The techniques we are now using are:

Latent Semantic Indexing (LSI). Used to produce lower-rank projections of A using LSI [Berry et al, 1995] with the random sampling algorithm of Kannan and Vempala [1999]. This technique allows the discovery of appropriate lower-rank projections and produce an LSI database. This will produce a clustering of keywords and documents according to cross-relevance.
Semantic Proximity Clustering. Cluster matrices KSP and RSP. Use Singular Value Decomposition (SVD) [Drineas et al, 1999] and hierarchical clustering.

Both of these techniques will first of all allow us to characterize and understand semantic information in the relational repository, and second to reduce the number of important keywords and documents. With this reduction we can produce smaller matrices that retain the essential structure of the original information, which can be used by retrieval algorithms such as spreading activation, and TalkMine. In particular we are now starting the following tasks:

Use LSI on queries to the semantic relational repository intended for TalkMine. This will reduce the matrices used by the user-side TalkMine.
Implement the adaptive feedback from TalkMine and other adaptive algorithms back to the semantic relational repository.
Choose a semantically consistent subset of the semantic repository to analyze the success of our adaptive techniques and retrieval algorithms. This entails choosing an appropriate scientific sub-field and personnel capable of validating our results.

2.2 On Structural Information

Authoritative Sources. Kleinberg's [1998] analysis on C, Pⁱⁿ and P^out. Construct a hierarchically distinct directed sub-graph consisting only of hubs and authorities.
Structural Proximity Clustering. Cluster matrices P. We use SVD and hierarchical clustering.
Simple Structural Analysis. More detailed study of the citation structure to supplement the current study (I.2.2). What are the descriptive statistics of the structural relational information: connectivity and component distribution, cycles, etc.

Investigate how adaptive algorithms can use structural information.
Investigate wether the special case of a Kleinbergian analysis of a partially ordered set such as the citation graph (rather than a general graph such as the Web) introduce interesting properties with respect to the resulting hub/authority structure? Is a source corpus of <= 3 years of sufficient depth to generate interesting Kleinbergian statistics? If so, generate them. Propose a user interface to show the user the position of the retrieved document in a "Kleinbergian conversation".

2.3 Long-Term Development

After completion of the efforts described in 2.1 and 2.2, we will tackle the following long-term research tasks and goals:

Add more information resources to the repositories. This could be the xxx physics archives or some other database.
Improve the "real-time" querying of the relational repositories from user side applications. This will require the porting of the repositories now residing on TURBO, to machines that can be accessed more easily from user applications. Scientific computation may still be performed on TURBO, but the interface to user-side applications needs to reside on another machine.
Develop and evaluate TalkMine and other adaptive algorithms on the interface created in 2 from the results previous tests on TURBO as discussed in 2.1.
Expand and evaluate spreading activation prototypes currently in place with the results of the adaptive algorithms.
Algorithmic development to continue the study of the relational repositories. This includes several research problems:

Effects of Adaptation. As the adaptive algorithms change the initial relational repository, we need techniques to compare and evaluate the changes. We intend to use measures of the metricity of the semantic proximity information and track how they change with time. Comparing the metricity of clusters against that of the overall repository, may allow us to track the formation of highly semantically related keywords and records and measure the emergence of "concept" clusters from user-driven adaptation. Besides analysis of metricity, we intend to use Watt's small-world graphs for the same purpose.
Point and Cluster Dispersion. Do "interesting" regions in the structure (clusters or hub/authority nodes) map to similarly "interesting" regions in the semantic proximity matrices of the relational repository? For example, measures are available to determine whether tight clusters in the citation structure are highly dispersed in the semantics (indicating interdisciplinary working groups), or conversely. Tight semantic clusters (either induced by, or otherwise coherent with, the structure) are candidates to be the famous "conceptual level" semantic representations.
Multiple Semantic Spaces. Now assume a single structural space (e.g. citations) related to multiple semantic spaces (e.g. proximity on keywords and authors). How do the two relations (from citations to keywords and citations to authors) induce a new relation from keywords to authors? Now measures of dispersion in the semantic space can recognize truly interdisciplinary authors or groups of authors, in addition to citation threads.

Part 2: Research and Development

From the Active Recommendation Project (ARP) R&D Report