Citation: Rocha, Luis M. [2001]."Identification of Interests, Trends and Dynamics in Document Networks". Los Alamos National Laboratory LDRD ER Research Proposal. Awarded for FY0204.
Also available in Adobe Acrobat (.pdf) format.
Abstract.
This proposal aims to investigate the hypothesis that the metric behavior of the distance functions defining these associative networks, can be used as an indicator of the relevance of collections of documents, the interests of users who have selected certain sets of documents, the trends in communities associated with sets of documents, as well the dynamics of such networks in general. The hypothesis itself is based on empirical evidence gathered and discussed in the proposal. We are requesting funds to gather more empirical evidence, investigate adequate formalisms, validate the hypothesis, and build a recommendation system that makes use of results obtained.
The success of this research would have a strong impact on information retrieval and knowledge management. If the hypothesis is correct, we would be able to predict trends in a given community by the automatic analysis of the documents they produce . We would also gain another technique to identify the relevance of documents and the interests of users, which would need to be compared to existing methodology. This impact would be felt on LANL’s current knowledge management initiatives, as well as externally, on advancing the study of DN in particular and social networks in general, as well as by producing better recommendation systems for the World Wide Web and digital libraries.
For each DN we can identify several distinct relations among documents and between documents and semantic tags used to classify documents appropriately. For instance, documents in the WWW are related via a hyperlink network, while documents in bibliographic databases are related by citation and collaboration networks [Newman, 2000]. Furthermore, documents can be related to semantic tags such as keywords used to describe their content. Although all the technology and the hypothesis here discussed would apply equally to any of these relations extracted from DN, let us exemplify the problem with the datasets we have created for the Active Recommendation Project (ARP) (http://arp.lanl.gov), part of the Library Without Walls Project, at the Research Library of the Los Alamos National Laboratory [Rocha and Bollen, 2000].
Frequency 
Keyword 
187705 
Cell 
150795 
studi 
149594 
system 
140738 
express 
127350 
protein 
124094 
model 
120215 
activ 
113740 
human 
112737 
rat 
112702 
patient 
To discern closeness between keywords according to the documents they classify, we compute the Keyword Semantic Proximity (KSP), obtained from A by the following formula:

cell 
studi 
system 
express 
protein 
model 
activ 
human 
rat 
patient 
cell 
1.000 
0.022 
0.019 
0.158 
0.084 
0.017 
0.085 
0.114 
0.068 
0.032 
studi 
0.022 
1.000 
0.029 
0.013 
0.017 
0.028 
0.020 
0.020 
0.020 
0.037 
system 
0.019 
0.029 
1.000 
0.020 
0.017 
0.046 
0.022 
0.014 
0.021 
0.014 
express 
0.158 
0.013 
0.020 
1.000 
0.126 
0.011 
0.071 
0.103 
0.078 
0.020 
protein 
0.084 
0.017 
0.017 
0.126 
1.000 
0.013 
0.070 
0.061 
0.041 
0.014 
model 
0.017 
0.028 
0.046 
0.011 
0.013 
1.000 
0.016 
0.016 
0.026 
0.005 
activ 
0.085 
0.020 
0.022 
0.071 
0.070 
0.016 
1.000 
0.058 
0.053 
0.021 
human 
0.114 
0.020 
0.014 
0.103 
0.061 
0.016 
0.058 
1.000 
0.029 
0.021 
rat 
0.068 
0.020 
0.021 
0.078 
0.041 
0.026 
0.053 
0.029 
1.000 
0.008 
patient 
0.032 
0.037 
0.014 
0.020 
0.014 
0.005 
0.021 
0.021 
0.008 
1.000 
From the inverse of KSP we obtain a distance function between keywords:
d is a distance function because it is a nonnegative, symmetric realvalued function such that d(k, k) = 0 [Shore and Sawyer, 1993]. This distance function indicates how far, semantically, a keyword is from another in the set of keywords. This way, it defines a weighted graph D whose nodes are all of the keywords extracted from a given DN, and the edges are the values of d. Clearly, many other types of distance functions can be defined on the elements of a DN. All of the observations and hypothesis below would apply equally, but naturally, the conclusions drawn cannot be separated by how well, and how appropriately for a given application, a distance function is capable of discerning the elements of the set it is applied to. Thus, different distance functions applied to citation structures or collaboration networks, will require distinct semantic considerations than those used for keyword sets.
d (eq. 2) is not an Euclidean metric because it may violate the triangle inequality: d(k_{1}, k_{2}) ≤ d(k_{1}, k_{3}) + d(k_{3}, k_{2}) for some keyword k_{3}. This means that the shortest distance between two keywords may not be the direct link but rather an indirect pathway. Such measures of distance are referred to as semimetrics [Galvin and Shore, 1991]. Indeed, given that most social and knowledgederived networks possess SmallWorld behavior [Watts, 1999], we expect that nodes which tend to be clustered in a local neighborhood of related nodes, have large distances to nodes in other clusters, but because of the existence of “gateway” nodes relating nodes in different clusters (the smallworld phenomenon), smaller indirect distances between nodes in distinct clusters, through these “gateway” nodes, are to be expected.
Most hypotheses are born out of anecdotal, often personal, evidence. The one put forward here is no exception. It arose from questioning what could one infer from the semimetric behavior of the distance functions calculated from DN. Given a distance function, what can we say about a pair of highly semimetric elements from a finite set? And what can we say about the set, from the pairs of highly semimetric pairs it contains?
To construct an intuition to answer these questions, one needs to deal with very familiar examples. In this case, the author could think of no DN more familiar than the set of books cited by his own dissertation [Rocha, 1997]! A database similar (but much smaller) to the one used by ARP contains the relevant information. This database contains about 150 books, each indexed by the respective Library of Congress Keywords, for example:
Kearfott, R. Baker and Vladik Kreinovich (Editors). [1996]. Applications of Interval Computations. Kluwer. Keywords: Optimization algorithms, Fuzzy logic, Uncertainty, Mathematics, Reliable Computation, Interval Computation.

Adaptive Systems 
Evolution 
Modeling systems 
Complex Systems 
Social Systems 
Adaptive Systems 
0.00 
3.89 
12.00 
10.33 
16.00 
Evolution 
3.89 
0.00 
21.50 
4.22 
35.00 
Modeling Systems 
12.00 
21.50 
0.00 
5.75 
10.00 
Complex Systems 
10.33 
4.22 
5.75 
0.00 
19.00 
Social Systems 
16.00 
35.00 
10.00 
19.00 
0.00 
From this database, 86 keywords are extracted. A distance function d is calculated according to eq. 2. Table III shows the values of d for 5 of the keywords. One needs to note that this distance function is obtained from the relations extracted from a particular set of documents (in this case 150 books). Therefore, one should not expect these values to represent a universally accepted thesaurus or the associations one would anticipate from common sense knowledge. Indeed, this kind of distance is used to characterize particular information resources and users from the documents they contain or retrieve [Rocha, 2001]. In this case, the associative distances between keywords denote the way the dissertation set of books is related.

Adaptive Systems 
Evolution 
Modeling systems 
Complex Systems 
Social Systems 
Adaptive Systems 
0.00 
3.89 
12.00 
8.11 
16.00 
Evolution 
3.89 
0.00 
9.97 
4.22 
19.89 
Modeling Systems 
12.00 
9.97 
0.00 
5.75 
10.00 
Complex Systems 
8.11 
4.22 
5.75 
0.00 
15.75 
Social Systems 
16.00 
19.89 
10.00 
15.75 
0.00 
Clearly, semimetric behavior is a question of degree. For some pairs of keywords, the indirect distance provides a much shorter shortcut, a larger reduction of distance, than for others. One way to capture this property of pairs of semimetric keywords is to compute a semimetric ratio:
s is positive and ≥ 1 for semimetric pairs. In our example, s(Modeling Systems,Evolution) = 21.5/9.97 =2.157. This ratio is important to discover semimetric behavior necessary for our hypothesis as discussed below, but given that larger graphs tend to show a much larger spread of distance, s tends to increase with the number of keywords. Therefore, to be able to compare semimetric behavior between different DN and their respective different sets of keywords, a relative semimetric ratio is also used:
rs compares the semimetric distance reduction to the maximum possible distance reduction in graph D. d_{max} is the largest distance in the graph, and d_{min} = 0 is the shortest distance.
Often, the direct distance between two keywords is ∞ because they do not index any documents in common. As a result, in closed graphs, s and rs are also ∞ for these cases. Thus, s and rs are not capable of discerning the degree of semimetric behavior for pairs that do not have a finite direct distance. To detect relevant instances of this infinite semimetric reduction, we define the below average ratio:
where represents the average direct distance from k_{i} to all k_{j} such that d_{direct}(k_{\i}, k_{j}) ≥ 0. b measures how much an indirect distance falls below the average distance of all keywords directly associated with a keyword. Of course, b can also be applied to pairs with finite semimetric reduction.
The semimetric ratios were applied to graph D of the dissertation database, and the semimetric pairs with higher ratios were identified. Table V lists the top 5 pairs for semimetric ratio s. If we rank pairs for the relative semimetric ratio rs, there is a slight ordering of the top as the pair EvolutionDNA drops to rank 11 and the pair LifeCognition to 6^{th}, while the pair EvolutionControl rises to rank 3 (from 6^{th}) and the pair EvolutionInformation Theory rises to 5^{th} (from 20^{th}).
What is most interesting about these results is that these pairs denote the original contributions that were offered by the dissertation! Indeed, the dissertation was about using ideas and methodologies from Complex Adaptive Systems, Evolutionary Systems, and Artificial Life and apply them to Artificial Intelligence and Cognitive Science. In particular, the mathematical models (from Psychology) of cognitive categories were expanded using evolutionary ideas, by drawing an analogy with the symbolic characteristics of DNA. Furthermore, this framework was named Evolutionary Constructivism, a term that did not exist previously, but draws both from Evolutionary Theory and the Philosophy of Constructivism in Cognitive Science and Systems Theory.
To understand these results, we need to remember that the distance function d is derived from the finite set of books used in the dissertation. A high degree of semimetricity for a keyword pair means that very few of the books in the database are simultaneously indexed by these two keywords, but that there exists a strong indirect association between these keywords via some indirect path whose short distances require the existence of many related books for each keywords pair in the pathway. Thus, a keyword pair with high semimetric behavior, implies an association that is a property of the specific collection of documents, but not one identifiable in many included documents, and rather constructed from an indirect series of strongly related documents. In other words, the highly semimetric pairs represent associations that “were begging to be made, given this specific collection of documents. Indeed, the two pairs (EvolutionControl and EvolutionInformation Theory) ranked in the top 5 for the relative semimetric ratio, identify two associations that are certainly implied by the collection of books (given its large subsets of Cybernetics and Information Theory books), but which were not dealt with in this dissertation – offering some topics for other dissertations!
The below average ratio b, was also used to identify keyword pairs with infinite semimetric reduction. Those are the pairs that do not index simultaneously a single book in the collection, but which are nonetheless indirectly strongly related. For this particular dataset, the pairs with highest values of b, did not seem to produce meaningful results. This could be because, being a small, tight collection of books, the relevant associations implied by the collections are already made at least by a small set of books producing a large, but finite, distance. Such situation is not expected to occur in larger, multiauthored collections unlike the dissertation one.
The anecdotal analysis of the author’s dissertation database served the purpose of creating an intuition of what semimetric behavior may mean for DN, but to even build a hypothesis, other more “subjectindependent” datasets need to be studied.
The same semimetric behavior ratios were used to study the ARP dataset describe above. The distance function d (eq. 2) was calculated for the set of the 500 most common keywords, and the semimetric ratios (formulas 3 to 5) were calculated for all keyword pairs. Table VI shows the top 5 keyword pairs ranked by highest values of s.
To analyze these results, again, one must remember the original collection of documents, in this case, all the scientific articles published in journals indexed by ISI in SciSearch between the years of 1996 to 1999. A keyword pair with high semimetricity, implies that while very few articles discuss the two topics together, a very large series of articles exists which creates an indirect pathway between these two keywords in D. To obtain a large semimetric ratio, it is necessary that each link in the indirect pathway be defined by short distances, which in turn require the existence of many articles associated to both keywords in the link. Thus, a highly semimetric keyword association implies that very few documents make that association, but that there are large sets of documents indirectly supporting it. In this sense, the existence of such support (particularly in scientific databases) may identify a trend that can be expected to be picked up.
The hypothesis generated from the anecdotal evidence described above, is that high semimetric keyword associations, discovered in the distance function of DN defined by eq. 2, can (1) capture the interests of a person associated with a given small collection of documents, and (2) be used to identify trends in large, multiauthored document collections.
Clearly, much more evidence is needed to support this hypothesis. We have already applied this study to other cases such as random distance graphs (very small relative semimetric ratios), distance graphs built from word association norms used in psychological tests (very little semimetric behavior), distance graphs built from website collective usage [Bollen et al, 1999] (similar results to section 5), etc. In addition to a more thorough analysis of the semimetric behavior of different distance functions applied to different DN, we propose to study this hypothesis by (1) creating an experimental database interface to evaluate how well the interests of users are captured by semimetricity, and (2) studying the trend dynamics in large collections of documents such as the ARP database.
LANL’s research library is currently offering an ideal system to study how well semimetric keyword associations capture the interests of a single person. The MyLibrary web portal (http://mylibrary.lanl.gov) allows LANL users to store links to publications in LANL’s several digital libraries with scientific articles, as well as links elsewhere on the WWW. We obtained permission from the research library to develop a recommendation system [Rocha, 2001] to be added to MyLibrary.
We propose to issue recommendations to users of MyLibrary from pairs of highly semimetric keywords. We will use the distance function defined by eq. 2 on the keywords extracted from the documents stored by each user of the MyLibrary system. Recommendations will be issued in the form of documents that users may be unaware of but which directly associate the semimetric keyword pairs, and directly as keywords. We will collect data from user behavior to validate how useful the recommendations are, thus gathering evidence for the first aspect of the semimetric hypothesis.
If high semimetricity is indeed an indicator of trends, then, by analyzing how semimetric behavior changes in time in a DN we should be able to distinguish real trends from indirect associations never picked up by authors of documents. As a trend develops in time, we can expect the semimetric ratios to lower, as more and more documents are added which directly associate the previously highly semimetric keyword pairs. For instance, if the high semimetricity pairs of Bioinformatics keywords related to Gene Expression Arrays identified in 6 do indeed imply a trend phenomenon, we should expect to see lower values of semimetricity for these concepts in those articles published in 2000 and beyond.
Thus, we propose to collect the SciSearch articles for years 2000 and beyond, and use them as a validation set for the trends picked in the ARP database. We will also conduct the semimetric analysis year by year, rather than the whole interval from 1996 to 1999, as well as for much more than the top 500 most frequent keywords. A more detailed study such as this, will allow us to capture the change in semimetric behavior of a large set of keywords through the years, thus gathering evidence for the second aspect of the semimetric hypothesis to study the dynamics of trends in DN.
We intend to tackle this proposal with the following chronological milestones:
In the first year we expect to finalize 1 and 2, and start 3 which will be ongoing through the duration of the project. In the second year we will tackle 4, 5, and 6. In the third year we will continue 6, and complete 7 and 8.
Bollen, J., H. Vandesompel, L.M. Rocha [1999]."Mining associative relations from website logs and their application to contexdependent retrieval using Spreading Activation." In: Workshop on Organizing Web Space (WOWS), ACM Digital Libraries 99, August 1999, Berkeley, California.
Galvin, F. and S.D. Shore [1991]."Distance functions and topologies." The American Mathematical Monthly. Vol. 98, No. 7, pp. 620623.
Klir, G.J. and B. Yuan [1995]. Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall.
Miyamoto, S. [1990]. Fuzzy Sets in Information Retrieval and Cluster Analysis. Kluwer.
Newman, MJ [2000]."The structure of scientific collaboration networks." Proc.Nat.Acad.Sci. No. 98, pp. 404409.
Rocha, Luis M. [1997]. Evidence Sets and Contextual Genetic Algorithms: Exploring Uncertainty, Context and Embodiment in Cognitive and biological Systems. PhD. Dissertation. State University of New York at Binghamton. UMI Microform 9734528. (http://www.c3.lanl.gov/~rocha/dissert.html)
Rocha, Luis M. [2001]."TalkMine: A Soft Computing Approach to Adaptive Knowledge Recommendation." In: Soft Computing Agents: New Trends for Designing Autonomous Systems. V. Loia and S. Sessa (Eds.). SpringerVerlag. In press. (http://www.c3.lanl.gov/~rocha/softrec.html)
Rocha, Luis M. and Johan Bollen [2000]."Biologically motivated distributed designs for adaptive knowledge management." In: Design Principles for the Immune System and Other Distributed Autonomous Systems. Cohen I. And L. Segel (Eds.). Santa Fe Institute Series in the Sciences of Complexity. Oxford University Press. In Press. (http://www.c3.lanl.gov/~rocha/SFI99.html)
Shore, SD, Sawyer LJ [1993]."Explicit Metrization." Annals of the New York Academy of Sciences. v. 704 pp. 328336 1993.
Watts, D. [1999]. Small Worlds: The Dynamics of Networks between Order and Randomness. Princeton University Press.