Information agents as organizers

George Cybenko, Robert Gray, Alexy Khrabrov and Yunxin Wu. Information agents as organizers. In Yannis Labrou and Tim Finin, editors, Proceedings of the CIKM Workshop on Intelligent Information Agents, Third International Conference on Information and Knowledge Management (CIKM 94), Gaithersburg, Maryland, December 1994.

Abstract
Introduction
Clustering
Hyperlinks
Redundancy
Summary
Bibliography

Abstract

Part of an information agent's task is to search a set of information resources for relevant documents and then present these documents to the user for consideration. The relevant documents should be organized in some way so that the user can easily browse the collection and understand the interrelationships among the documents. However the document collection will not necessarily have any a priori structure so the information agent must be prepared to automatically impose some structure on the collection. Here we consider the automatic generation of hyperlinks, automatic clustering and methods for eliminating redundant information from the collection.

Introduction

Part of an information agent's task is to accept a request, search a set of information resources for documents relevant to the request and then present these relevant documents to the user for consideration (a document is defined as any set of data). The number of relevant documents could be large in which case the documents should be organized in some way so that the user can easily browse the collection and understand the interrelationships among the documents. Unfortunately the document collection will not necessarily have any useful, a priori structure especially since the collection is dynamic -- the contents of the collection depend on the user's request and how the agent decides which information is relevant to that request. Therefore the agent must be prepared to automatically impose a structure on the collection. There are many ways that a document collection can be structured. We discuss the automatic generation of hyperlinks and automatic clustering in which documents are grouped so that each each group contains similar documents. In addition to the lack of organization, documents within the collection might repeat certain information over and over. The agent must be able to recognize and eliminate these redundancies. We present various methods for deciding when an entire document is redundant.

Clustering

In the vector space model of text documents [SM83], points in a high dimensional vector space represent documents and the distances between the points are correlated with the similarities between the documents. Using cluster analysis methods or vector quantization techniques [AND73,KUN93], documents with vector representations can be grouped automatically and organized into a hierarchical cluster structure. This structure is a tree and is very useful for efficiently searching and browsing the document collection. For example, if the clusters are concentrated enough, the search for relevant (similar) documents can be sped up through tree traversal. The structure can be visualized in two-dimensional space as a tool for user retrieval and browsing. For each cluster we can compute a 2D map where the objects are the child clusters of that cluster. The size and density of the clusters can be displayed by using different sizes and colors for the objects in the map. The density can be computed as the mean distance from the vectors in the cluster to the center of the cluster divided by the the size of the cluster. The distances between the objects in the 2D map should approximate the distances between the clusters in high dimensional space.

The automatic clustering of text documents and visualization of the clustering can be based on latent semantic indexing analysis [S+90] and has been implemented using vector quantization and nonlinear mapping. An example is grouping personal e-mail letters according to the text of the letters rather than the headers. Letters on different subjects form distinct clusters. Clustering results for this and other document collections exhibit a common problem -- the clusters can be unbalanced. The size of each cluster can range from a few hundred documents to only a single document. This compromises the effectiveness of automatic clustering. There are methods for balanced clustering but they introduce biases -- a large but dense cluster may be separated and two small clusters relatively far away may merge. There is a trade-off between balance and unbiasedness. Current work is focusing on minimizing this trade-off. Considering the concept of redundancy in document collections and the corresponding metric that is introduced in section 4 of this abstract, we find that naturally balanced cluster structures - i.e. balance is not forced -- indicate good information collections in the sense that they cover the information subspace well with the least amount of overlap.

Hyperlinks

Recent work at Dartmouth has focused on the automatic generation of hyperlinks in unstructured text documents. Here a hyperlink is defined as a link from a word or words in one document to relevant information in another document. Unstructured means that the text of the documents does not explicitly state potential links. In other words the documents do not contain phrases such as ``Refer to the article on communism for additional information''. Automatic means no manual intervention during or after the generation process. Our goal has been to develop a generic algorithm that automatically generates these hyperlinks and does not need knowledge specific to the topics discussed in the documents.

Our current approach is a text-analysis technique in which key phrases are identified in each document. Two cases are considered. In the first case each sentence is a key phrase; in the second case each proper noun phrase is a key phrase. A dictionary-lookup heuristic is used to identify proper noun phrases. A latent semantic indexing (LSI) package is used to determine which documents are relevant to each phrase. LSI was chosen since it is one of the best generic text-retrieval methods [S+90,Gup94]. The LSI package assigns a numerical similarity score to each document/phrase pair. The highest scoring pairs become the hyperlinks. This approach generates many useful links in a collection of TIME magazine articles but generates many nonsensical links as well.

An implementation challenge is that LSI gives odd results for short phrases due to its statistical nature. A hybrid text-retrieval scheme that uses both LSI and inverted indices [Gup94] is being incorporated into our approach to address this problem. The major conceptual challenges are that our approach ignores the context of the key phrases and does not impose any structure on the hyperlinks. Some key phrases are linked to documents that are relevant to the phrase only if the phrase is considered in isolation. Navigation through the hyperdocument collection -- although easier than reading through every document -- is confusing since there is no underlying structure to the links. Current work addressing these challenges includes applying the clustering methods discussed in the previous section; developing methods to automatically identify main and supporting documents and main and supporting sections within documents; incorporating postprocessing techniques to verify that there is sufficient similarity between two documents that are connected with a hyperlink; and exploring more natural ways to identify "key" phrases.

Redundancy

Besides identifying relevant documents, another important task of an information agent is to prevent redundancy in the document collection. Finding a metric for redundancy is a hard problem. Using the vector space model for text processing, there is a convenient interpretation of redundancy. Notice that that concatenation of two documents does not generate any new information and corresponds to the addition of the document vectors. One can conclude that linear dependence in the document vectors represents information redundancy in the collection. Thus the problem of preventing redundancy can be solved by finding an independent subset of the relevant document vectors.

In practice when the number of relevant documents that the agent finds is large (this will happen more and more often with the growth of information resources), we need to limit the number of documents presented in order to reduce user effort and the computational and storage complexity of the application. The same strategy can be used to find the subset of the relevant documents that best covers the query or domain topic. That is, we find the most independent subset of the document vectors. This is a numerical linear algebra problem that can be solved efficiently by singular value decomposition and QR factorization with column pivoting [GL90].

Another natural idea for finding the best representative vector subset is to find the the vectors that are evenly located in the vector space. This can be done with clustering analysis. After clustering, use the vector that is closest to the center of each cluster to represent the whole cluster. This method is advantageous when we have an existing clustered collection and we want new documents to enter the collection. We choose the documents whose vectors will form new clusters. The problem with this method is that it does not prevent linear dependency of the vectors. A hybrid method may help. The center vector is used to represent the whole cluster and we choose the new vectors that are most independent to all the center vectors. Current work is focusing on refining these methods of handling redundancy and developing new methods.

Summary

An information agent must be prepared to not only retrieve relevant information but to automatically organize that information. Here we have considered automatic clustering, the automatic generation of hyperlinks and redundancy elimination. Many other organizational techniques have been and need to be developed.

Bibliography

[And73]

Michael R. Anderburg. Cluster analysis for applications. Academic Press, New York, 1993.

[D+90]

Susan Deerwester et al. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391-407, 1990.

[GL90]

Gene H. Golub and Charles F. Van Loan. Matrix computations. John Hopkins University Press, Baltimore, 1990.

[Gup94]

Vishal Gupta. Dartmouth associative retrieval kernel (DARK). Master's thesis, Thayer School of Engineering, Dartmouth College, 1994.

[Kun93]

S. Y. Kung. Digital Neural Networks. Prentice-Hall, Englewood Cliffs, New Jersey, 1993.

[SM83]

G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill Computer Science Series. McGraw-Hill, New York, 1983.

Back to the transportable agent page

The D'Agents Project ended in approximately 2003. The software, and this web site, is no longer maintained. For questions please contact David Kotz.

Information agents as organizers

Contents