Adaptive Recommendation Project Decription

Modern library systems at universities and research institutes are perfect examples of today's complex Distributed Information Systems (DIS): they are responsible for serving large and diverse technical communities by providing access to an extensive set of equally large and heterogeneous electronic information resources. As the complexity and size of both user communities and information resources grows, the fundamental limitations of traditional information retrieval systems have become evident. The recent LANL library user survey has revealed desired functionality, currently unavailable in today's library (from survey conducted by Rick Luce, group leader of LANL's research library):

There is no crossover of information: it can be very difficult for users to search across databases from different disciplines.
There is no "push" of information: recommendations from the system to its users about related topics that they may be unaware of are not issued.
There is no user profiling: the system does not remember user preferences or user-specific keyword categories.
There is a failure to work at the concept level: the system relies on fixed keywords, but does not infer categories of keywords used by its communities of users.

The sources of these limitations can be traced directly to a number of technical deficiencies of current DIS, in particular, that they are:

Passive: Information retrieval in DIS is generally unidirectionally query-based , and thus only able to respond to specific user requests. They can generally neither proactively generate information for users, nor even respond to queries in a user-specific fashion. Instead, users must know in advance what information they need, and then try to pull it from the environment.
Semantically Fixed: Semantic tags (keywords) must be provided explicitly by authors (or publishers, librarians, and indexers). The keywords in each document or database are bound to the "concept space" of these authors, which may be incoherent with the concept spaces of the users, or of the authors of other documents or databases.
Static: Once deployed to users, the knowledge in DIS remains fixed. Any indirect knowledge available through analysis of these structures, or implicit knowledge inherent in the patterns of information retrieval, cannot be exploited to enable push of user-specific content or to enhance semantic representations of content.
Isolated: Knowledge is represented in distinct formats on separate systems. Thus knowledge about the common properties of related domains or databases (available, for example, from an analysis of common structure or directly from users) cannot be exploited.

2. Existing Technologies for Recommendation Systems

New approaches for information retrieval have been proposed to address these limitations. These active recommendation systems, also known as Active Collaborative Filtering, Knowledge Mining, or Knowledge Self-Organization environments, rely on active computational environments that interact with and adapt to their users . They effectively push relevant information to users according to previous patterns of information retrieval or individual user profiling.

In content-based systems, user profiles are created based on the system's keywords. These establish a means of recommending documents to users according to their profiles and some kind of semantic metric that describes the relationships between keywords inferred from their association with common documents.
Collaborative systems do not involve any description of the semantics or content of documents, but rather issue recommendations according to a comparison of the profiles of several users that tend to access the same documents. These user profiles are not based on keywords, but on the actual documents retrieved.

Content-based systems depend on single user profiles, and thus cannot effectively recommend documents about previously unrequested content. Conversely, pure collaborative systems, with no content analysis, match only the profiles of users that (to a great extent) have requested the same exact documents; for instance, different book editions are considered distinct documents. It is clear that effective recommendation systems require aspects of both approaches.

3. Proposed Systems Development and Research

We propose developing and researching recommendation systems for LANL's Library Without Walls (LWW). These systems will be both collaborative and content-based, and will exploit currently untapped sources of information in DIS. In particular, they will integrate information from the patterns of usage of groups of users, and also categorize database content or semantics in a manner relevant to those groups. Moreover, we intend that the semantic tags and conceptual categories need not be just designed into these systems, but may also be induced and evolved from document content, user-supplied information, and group interaction.

Our overall aims are to deploy software applications within the LWW, and to use the LWW and its user community itself as an object of scientific study. These efforts will provide substantial benefits to the expanding needs of the library by responding to the specific issues revealed in the recent survey, in particular by:

Establishing crossover of different subject matter by enabling search across multiple, interdisciplinary databases.
Also establishing crossover among heterogeneous types of information resources (for example linking abstract indexes such as Inspec with deep-content sources such as e-journals).
Pushing, recommendation of related topics that users may not have thought of.
Expanding the "semantic space" of the databases to a more conceptual level, including qualified keywords, keywords derived from the analysis of document content, and higher-level "conceptual" groupings reflecting collaborative usage patterns.
Deploying a more personalized human-machine interaction from a consolidated point of access.
Detecting patterns and relationships of information retrieval leading to the adaptation of the environment to the users.

These overall goals will be pursued in a modest first-year effort to demonstrate fundamental engineering capabilities and scientific results. Initially, an existing prototype of the TalkMine recommendation system will be developed and deployed, and the inherent semantic structure of LWW databases will be analyzed. Later work will see the analysis of customer satisfaction and experimental results, and lay the basis for expansion of these methods in following years.

3.1 TalkMine: Adaptive Recommendation on Multiple Databases

TalkMine is an adaptive recommendation system which is both content-based and collaborative, and further allows the crossover of information among multiple databases searched by users. In this way, different databases learn new and adapt existing keywords to the categories recognized by its communities of users. TalkMine is based on several theories of uncertainty, such as fuzzy set theory and Dempster-Shafer theory of evidence, as well as on biologically inspired adaptionist ideas.

Luis Rocha (CIC-3) has developed TalkMine as a fully functional prototype for Microsoft Windows computers. The architecture has both user-side and system-side components. Each user owns a browser (or plug-in to an existing Internet browser), which functions as a consolidated interface to all information resources searched. This individual browser stores user preferences and tracks information retrieval patterns and relationships which it utilizes to adapt to the user.

Where existing DIS are strictly unidirectionally query-based, in TalkMine an interactive, conversational, multi-directional approach between user and system side components is fundamental. Each user's browser engages in an interactive algorithm with the information resources it queries. This first results in a list of document and related topic recommendations issued according to the user's profile and present interests, and the integration of knowledge from the several information resources queried. The second result of this interaction is that all sides exchange information, therefore all of the parties can potentially learn new information in an adaptive fashion. Indeed, databases can learn new keywords from users and other databases, and will adapt the associations between keywords and documents according to the expectations of its users.

In this way TalkMine establishes an open-ended human-machine symbiosis, which can be used in the automatic, adaptive, organization of knowledge in DIS such as library databases or the Internet, facilitating the rapid dissemination of relevant information and the discovery of new knowledge.

A consolidated information access point.
An adaptive individual interface.
The capacity to search across multiple databases efficiently in the users' own keywords, and to achieve information crossover between these.
The capacity to push recommendations of new concepts and keywords that users may be unaware of.
The detection of patterns and relationships in the information retrieval of users leading to an adaptation of information resources to users.

3.2 Semantic Analysis and Extended Conceptual Spaces

The central data structure used in traditional DIS is a many-many mapping among a set of documents and a set of keywords which act as semantic "tags" on the content of the documents. The keywords are usually provided directly by the authors, or at best by secondary editors or librarians. In traditional DIS these keywords form the basis for query matching for information pull. TalkMine also uses keywords on documents, but through its adaptive, conversational approach provides an effective means for communities of users to explore multiple keyword spaces, thereby both pushing this semantic content to users and sharing it among databases.

Beyond deploying TalkMine in its present form, we will also pursue research and development goals to use other sources of information in DIS beyond author-supplied keywords, and to augment the given keywords and document-keyword mappings to capture important information not available in current DIS. These efforts can not only serve future enhancements of TalkMine's capabilities, but will have general applicability to recommendation systems of many types.

There are many diverse sources of information available in DIS to enhance representations of semantic content. In addition to the actual contents of the texts themselves (including both the abstracts of indexing services such as SciSearch and Inspec, and the contents of e-journals) are various sources of structural information about linkages among documents, among keywords, and between documents and keywords. Given such structural mappings, quantitative information is available to induce indirect connections among both documents and keywords. These mappings include:

The many-many document-keyword mapping itself.
The citations among documents.
The hierarchical structure of the DIS (text and keywords within documents, which are in turn within databases which are within the LWW corpus).
Finally, but perhaps most significantly, are the patterns of access and traversal of these databases as users browse, search, and explore them.

We will first explore methods to instrument LWW systems in order to gather information of these types. Analysis of this information can then be useful in a number of ways. Methods can be deployed to accomplish the following, either separately or in combination:

To analyze document content in order to derive new keywords not provided by authors.
To produce extended keywords where the extent or degree to which the keyword belongs to a document is both represented and exploited.
To produce weighted graphical representations of the semantic space of the keywords (whether extended or unextended).
Finally, to analyze groupings of keywords to move towards higher-level conceptual representations of semantic content.

Finally, we also wish to examine the role that existing conceptual maps can play in providing enhanced semantic linkages among documents and keywords. Such maps would be provided as external sources of semantic information, and are available from the research community as products of prior Artificial Intelligence and Information Science research efforts. These ontologies, such as WordNet from Princeton (see http://www.cogsci.princeton.edu/~wn), are effectively lexicons or thesauri which have been augmented to include taxonomies of semantic relations among terms. These structures can be brought to bear both to aid in the analysis of the LWW's existing semantic space, and to induce further semantic connections among DIS components.

4. Project Plan

The primary purposes for the FY99 effort will be to develop initial engineering capabilities and perform scientific analysis of the existing LWW systems and the results of an initial TalkMine deployment. In particular, we foresee the following specific goals, some of which may be pursued in parallel:

Definition of Problem Area and Survey of Current Technology: Outline of the problem area, technical concepts and issues, and survey of the state of the art of recommendation systems for DIS.
LWW System Assessment: Survey existing LWW systems and environments. Develop understanding of appropriate methods to integrate with LWW systems for database access, software development and deployment, data acquisition, and database extension.
Development of TalkMine Testbed: Plan and execute porting or rewriting of TalkMine prototype for execution in a (portion of the) LWW environment.
Additional Software Development: Develop additional software as needed to support TalkMine data analysis, instrumentation of LWW resources for capture of data for semantic analysis, and other purposes as appropriate.
Deployment of TalkMine Testbed: Identify user community, operate system, and gather and analyze data.
Data Acquisition for Semantic Analysis: Construct (hierarchical) keyword, citation, and/or user traversal maps.
Analysis of LWW Semantic Space: Construct new and extended keywords and concepts.

At this point it may be desirable to augment the TalkMine application to utilize results of the semantic analysis, for example, to incorporate the hierarchical relations among keywords. For other results, it may be appropriate to design additional user-accessible applications or extensions to existing applications. While these will require separate deployment, they may also serve to enhance TalkMine's effectiveness. For example, these methods may provide new keywords, or a new space of keywords at a new semantic level, in which TalkMine can operate.

5. Deliverables

Assessment Report: Paper assessing LWW systems and resources, and proposing a multi-year development strategy.
White Paper on Recommendation Sytems for DIS: This paper will offer an outline of the problem area, technical concepts and issues, and survey of the state of the art of recommendation systems for DIS.
TalkMine Testbed: Testbed implementation of TalkMine client-side only (server-side development dependent on additional software engineering support).
Interim Report: Description of accomplishments to date, analysis of TalkMine results, results of semantic analysis, refined proposal for future work.

6. Schedule

1Q 1999: Deliver Assessment Report and White Paper.
3Q 1999: Deliver TalkMine Testbed.
4Q 1999: Deliver Interim Report.

7. Staffing

Members of the proposed team not only have achieved significant scientific accomplishments in the theory of DIS and recommendation systems, but additionally have extensive experience in the design and development of software systems and sucsessful interaction with clients to realistically satisfy their needs.

Johan Bollen (GRA, CIC-3)
Cliff Joslyn (TSM, CIC-3)
Luis Rocha (TSM, CIC-3)

Many engineering issues will need to be addressed in this project, some of which may require assistance from Library staff resources. On the server-side, we will need to understand the database protocols used within the LWW corpus, and how we can construct new tables which will record the structural information and implement aspects of the algorithms needed for our systems. For the TalkMine Testbed, we may also need assistance in developing new applications for data gathering and data analysis, to possibly develop new or enhance existing applications for user interaction, and to assist with expert knowledge of the data content and user base for testing and analysis.

Luis Rocha, Cliff Joslyn, and Marianna Kantor

1. Problem Analysis