The full research and development report can be downloaded in pdf format (LANL only).
In Part 1 we presented in detail the infrastructure we have developed for the Active Recommendation Project (ARP), which we deem essential for ng research and development for the Distributed Information Systems available to the Library Without Walls Project at the Los Alamos National Laboratory. Clearly, the infrastructure of part 1 can and should be used for other research and development projects other than ARP. Below we identify the research and development program for ARP, while suggesting other possible projects.
The bulk of the first year has been dedicated to establishing the XML and Relational Repositories described in part 1. We are now in a position to use and profit from such an infrastructure.
We have calculated the semantic (section I.2.1) and structural (section I.2.2) proximity measures for the most frequent keywords and most cited documents. Given these sub-matrices of the overall relational repository, we developed a spreading activation utility to recommend frequent keywords and most cited documents to users. As we compute more proximity information, we will increase the size of the spreading activation matrices.
Spreading activation (SA) is an algorithm for searching information in associative networks such as those represented by the proximity matrices. Once a network with relational information is available, the SA algorithm basically pursues a distributed shortest path search. Distributed because it can start from several nodes of the network simultaneously, and makes use of local relational information.
For instance, given the network given by the keyword semantic proximity matrix, that is, a network of keywords related by proximity values, SA can be used to discover keywords that are highly related to several initially selected keywords. These selected keywords in the network get an initial activation value which follows the links to associated keywords. The activation reaching a keyword is the sum of the weighted activations coming through all its input links. From an activated word, activation diffuses further to the words linked to it, and so on. This way, SA finds keywords related to all the keywords that are initially activated. For a definition of spreading activation in information retrieval see [Salton and Buckely, 1998]. The leading advantage of this technique is that it allows the retrieval of information not by textual keyword matching (as most search engines do), but rather by using relational information such as those developed in our relational repository. Clearly, the usefulness of SA highly depends on the quality of the underlying networks used. Therefore, we expect the current ARP spreading activation utilities to improve as more proximity information is computed, and even more, when adaptive user-feedback is enabled from TalkMine and other adaptive techniques (see below).
The ARP spreading activation utilities can be accessed on the web at http://www.c3.lanl.gov/~rocha/lww/SA.html. In addition to the proximity networks of the relational repository, SA has also been applied to a network of ISSN Journal Titles organized according to an adaptive algorithm using archives of user interaction. This algorithm developed by Bollen and Heylighen [1996] was initially developed to organize hypertext networks according to user choices and it is based on a simple Hebbian rule, as well as on symmetry and transitive constraints. This algorithm increases the proximity of elements in the network which tend to be associated by users, and tends to decrease the proximity of those not associated. It is an algorithm similar to the adaptive components of TalkMine and will be used for further research and development in ARP.
The SA utilities accessible on the web provide a means for users to obtain:
Cliff Joslyn has performed an analysis of the citation network in the structural relational repository. This statistical analysis, available in a separate document (LANL Only), culminates in the following conclusions and recommendations:
We thus conclude that pre-1996 ISI data would enhance the citation analysis, but may conversely make this very analysis intractable while adding very little to the adaptive semantic algorithms. The best solution could be to add pre-1996 ISI data, but only for a narrower set of documents, for instance by adding only records of a given subject area or set of journals. We will keep on investigating these alternatives, while continuing our citation analysis efforts.
With the semantic relational information in place, we can now deploy a number of techniques leading to unique utilities. We can apply such techniques to the entire repository, or to subsets of it obtained via the request operators presented in part 1. We are implementing the first in the fast computer system (TURBO) available in CIC-3, and the second in PC's since the objective is to define algorithms that are capable of analyzing the results of user queries in real time, on their machines. The techniques we are now using are:
Both of these techniques will first of all allow us to characterize and understand semantic information in the relational repository, and second to reduce the number of important keywords and documents. With this reduction we can produce smaller matrices that retain the essential structure of the original information, which can be used by retrieval algorithms such as spreading activation, and TalkMine. In particular we are now starting the following tasks:
Similarly, for the relational structural information we are now using the following techniques:
Likewise, we intend to use the results of these techniques to:
After completion of the efforts described in 2.1 and 2.2, we will tackle the following long-term research tasks and goals: