Scatter/Gather: A Cluster-based Approach to
Browsing Large Document Collections

Douglass R. Cutting tex2html_wrap_inline445 - David R. Karger tex2html_wrap_inline447 - Jan O. Pedersen tex2html_wrap_inline445 - John W. Tukey tex2html_wrap_inline451

[1]Xerox Palo Alto Research Center
3333 Coyote Hill Road, Palo Alto, CA 94304 [2]Stanford University [3]Princeton University

In Proceedings of the Fifteenth Annual International ACM SIGIR Conference, pages 318-329, June 1992. Also available as Xerox PARC technical report SSL-92-02.

© Copyright 1992 by ACM, Inc.

Abstract:

Document clustering has not been well received as an information retrieval tool. Objections to its use fall into two main categories: first, that clustering is too slow for large corpora (with running time often quadratic in the number of documents); and second, that clustering does not appreciably improve retrieval.

We argue that these problems arise only when clustering is used in an attempt to improve conventional search techniques. However, looking at clustering as an information access tool in its own right obviates these objections, and provides a powerful new access paradigm. We present a document browsing technique that employs document clustering as its primary operation. We also present fast (linear time) clustering algorithms which support this interactive browsing paradigm.

Introduction

Document clustering has been extensively investigated as a methodology for improving document search and retrieval (see [15] for an excellent review). The general assumption is that mutually similar documents will tend to be relevant to the same queries, and, hence, that automatic determination of groups of such documents can improve recall by effectively broadening a search request (see [11] for a discussion of the cluster hypothesis). Typically a fixed corpus of documents is clustered either into an exhaustive partition, disjoint or otherwise, or into a hierarchical tree structure (see, for example, [8, 13, 2]). In the case of a partition, queries are matched against clusters and the contents of the best scoring clusters are returned as a result, possibly sorted by score. In the case of a hierarchy, queries are processed downward, always taking the highest scoring branch, until some stopping condition is achieved. The subtree at that point is then returned as a result. Hybrid strategies are also available.

These strategies are essentially variations of near-neighbor searchgif where nearness is defined in terms of the pairwise document similarity measure used to generate the clustering. Indeed, cluster search techniques are typically compared to direct near-neighbor search [9], and are evaluated in terms of precision and recall. Various studies indicate that cluster search strategies are not markedly superior to near-neighbor search, and, in some situations, can be inferior (see, for example, [6, 12, 4]). Furthermore, document clustering algorithms are often slow, with quadratic running times. It is therefore unsurprising that cluster search, with its indifferent performance, has not gained wide popularity.

Document clustering has also been studied as a method for accelerating near-neighbor search, but the development of fast algorithms for near-neighbor search has decreased interest in that possibility [1].

In this paper, we take a new approach to document clustering. Rather than dismissing document clustering as a poor tool for enhancing near-neighbor search, we ask how clustering can be effective as an access method in its own right. We describe a document browsing method, called Scatter/Gather, which uses document clustering as its primitive operation. This technique is directed towards information access with non-specific goals and serves as a complement to more focused techniques.

To implement Scatter/Gather, fast document clustering is a necessity. We introduce two new near linear time clustering algorithms which experimentation has shown to be effective, and also discuss reasons for their effectiveness.

Browsing vs Search

The standard formulation of the information access problem presumes a query, the user's expression of an information need. The task is then to search a corpus for documents that match this need. However, it is not difficult to imagine a situation in which it is hard, if not impossible, to formulate such a query precisely. For example, the user may not be familiar with the vocabulary appropriate for describing a topic of interest, or may not wish to commit himself to a particular choice of words. Indeed, the user may not be looking for anything specific at all, but rather may wish to discover the general information content of the corpus. Access to a document collection in fact covers an entire spectrum: at one end is a narrowly specified search for a particular document, given something as specific as its title; at the other end is a browsing session with no well defined goal, satisfying a need to learn more about the document collection. It is common for a session to move across the spectrum, from browsing to search: the user starts with a partially defined goal which is refined as he finds out more about the document collection. Standard information access techniques tend to emphasize the search end of the spectrum. A glaring example of this emphasis is cluster search, where clustering, a technology capable of topic extraction, is submerged from view and used only to assist near-neighbor search.

We propose an alternative application for clustering in information access, taking our inspiration from the access methods typically provided with a conventional textbook. If one has a specific question in mind, and specific terms which define that question, one consults the index, which directs one to passages of interest. However, if one is simply interested in gaining an overview, or has a general question, one peruses the table of contents, which lays out the logical structure of the text. The table of contents gives a sense of what sort of questions might be answered by a more intensive examination of the text, and may also lead to specific sections of interest. One can easily alternate between browsing the table of contents and searching the index.

By direct analogy, we propose an information access system with two components: our browsing method, Scatter/Gather, which uses a cluster-based, dynamic table-of-contents metaphor for navigating a collection of documents; and one or more word-based, directed, text search methods, such as near-neighbor search or snippet search [7]. The browsing component describes groups of similar documents, one or more of which can be selected for further examination. This can be iterated until the user is directly viewing individual documents. Based on documents found in this process, or on terms used to describe document groups, the user may, at any time, switch to a more focused search method. In particular, we anticipate that the browsing tool will not necessarily be used to find particular documents, but may instead help the user formulate a search request, which will then be serviced by some other means. Scatter/Gather may also be used to organize the results of word-based queries that retrieve too many documents.

Scatter/Gather Browsing

In the basic iteration of the proposed browsing method, the user is presented with short summaries of a small number of document groups.

Initially the system scatters the collection into a small number of document groups, or clusters, and presents short summaries of them to the user. Based on these summaries, the user selects one or more of the groups for further study. The selected groups are gathered together to form a subcollection. The system then applies clustering again to scatter the new subcollection into a small number of document groups, which are again presented to the user. With each successive iteration the groups become smaller, and therefore more detailed. Ultimately, when the groups become small enough, this process bottoms out by enumerating individual documents.

 

figure32


Figure: Illustration of Scatter/Gather 

An Illustration

  We now describe a Scatter/Gather session, where the text collection consists of about 5000 articles posted to the New York Times News Service during the month of August 1990. This session is summarized in figure 1. Here, to simplify the figure, we manually assigned single-word labels based on the full cluster descriptions. The full session is provided as Appendix A.

Suppose the user wants to find out what happened that month. Several issues prevent the application of conventional search techniques:

With Scatter/Gather, rather than being forced to provide terms, the user is presented with a set of clusters, an outline of the corpus. She need only select those clusters which seem potentially relevant to the topic of interest. In the example, the big stories of the month are immediately obvious from the initial scattering: Iraq invades Kuwait, and Germany considers reunification. This leads the user to focus on international issues: she selects the `Kuwait' and `Germany' and `Oil' clusters. These three clusters are gathered together.

This reduced corpus is then reclustered on the fly to produce eight new clusters covering the reduced corpus. Since the reduced corpus contains a subset of the articles, these new clusters reveal a finer level of detail than the original eight. The articles on the Iraqi invasion and some of the `Oil' articles have now been separated into clusters discussing the U.S. military deployment, the effects of the invasion upon the oil market, and one which the user deduces is about hostages in Kuwait.

The user feels her understanding of these large stories is adequate, but wishes to find out what happened in other corners of the world. She selects the `Pakistan' cluster, which also contains other foreign political stories, and a cluster containing articles about Africa. This reveals a number of specific international situations as well as a small collection of miscellaneous international articles. The user thus learns of a coup in Pakistan, and about hostages being taken in Trinidad, stories otherwise lost among the major stories of that month.

Requirements

Scatter/Gather depends on the existence of two facilities. First, since clustering and reclustering is an essential part of the basic iteration, we need an algorithm which can appropriately cluster a large number of documents within a time tolerable for user interaction (e.g., less than a minute). Second, given a group of documents, some method for automatically summarizing that group must be specified. This cluster description must be sufficiently revealing for to give the user a sense of the topic defined by the group, yet short enough for many descriptions to be appreciated simultaneously.

We will present two algorithms that meet the first requirement. Buckshot is a fast clustering algorithm suitable for the online reclustering essential for Scatter/Gather. Fractionation is another, more careful, clustering algorithm whose greater accuracy makes it suitable for the static offline partitioning of the entire corpus which is presented first to the user. We also define the cluster digest, an easy to generate, concise description of a cluster suitable for Scatter/Gather.

Document Clustering

Before presenting our document clustering algorithms, we review the terminology established in prior work, and discuss why existing clustering algorithms fail to meet our needs. Throughout this paper, n denotes the number of documents in the collection, and k denotes the desired number of clusters.

In order to cluster documents one must first establish a pairwise measure of document similarity and then define a method to partition the collection into clusters of similar documents. Numerous document similarity measures have been proposed, all of which treat each document as a set of words, often with frequency information, and measure the degree of word overlap between documents [11]. The documents are typically represented by sparse vectors of length equal to the number of unique words (or types) in the corpus. Each component of the vector has a value reflecting the occurrence of the corresponding word in the document. We may use a binary scale, in which the value is one or zero, to represent the presence or the absence of the word, or the value might be some function of the word's frequency within that document. If a word does not occur in a document, its value is zero. A popular similarity measure, the cosine measure, computes the cosine of the angle between these two sparse vectors. If both document vectors are normalized to unit length, the cosine is, of course, simply the inner product of the two vectors. Other measures include the Dice and Jaccard coefficients, which are normalized word overlap counts. Willett [15] has suggested that that the choice of similarity measure has less qualitative impact on clustering results than the choice of clustering algorithm.

Two different types of document clusters can be constructed. One is a flat partition of the documents into a collection of subsets. The other is a hierarchical cluster, which can be defined recursively as either an individual document or a partition of the corpus into sets, each of which is then hierarchically clustered. A hierarchical clustering defines a tree, called a dendrogram, on the documents.

Numerous clustering algorithms have been applied to build hierarchical document clusters, including, most prominently, single-linkage hierarchical clustering [5, 2]. These algorithms generally proceed by iteratively considering all pairs of clusters built so far, and fusing the pair which exhibits the greatest similarity into a single document group (which then becomes a node of the dendrogram). They differ in the procedure used to compute similarity when one of the pair is the product of a previous fusion. Single-linkage clustering defines the similarity as the maximum similarity between any two individuals, one from each of the two groups. Alternative methods consider the minimum similarity (complete-linkage), the average similarity (group-average linkage), as well as other aggregate measures. Although single-linkage clustering is known to have an undesirable chaining behavior, typically forming elongated straggly clusters, it remains popular due to its simplicity and the availability of an optimal space and time algorithm for its computation [10].

These algorithms share certain common characteristics. They are agglomerative, in that they proceed by iteratively choosing two document groups to agglomerate into a single document group. They agglomerate in a greedy manner, in that the pair of document groups chosen for agglomeration is the pair which is considered best or most similar under some criterion. Lastly, they are global in that all pairs of inter-group similarities are considered in the course of selecting an agglomeration. Global algorithms have running times which are intrinsically tex2html_wrap_inline459 ,gif because all pairs of similarities must be considered. This sharply limits their usefulness, even given algorithms that attain the theoretical quadratic lower bound on performance.

Partitional strategies, those that strive for a flat decomposition of the collection into sets of documents rather than a hierarchy of nested partitions, have also been studied [8, 13]. Some of these algorithms are global in nature and thus have the same slow performance as the above mentioned greedy, global, agglomerative algorithms. Other partitional algorithms, by contrast, typically have rectangular running times, i.e., O(kn). Generally, these algorithms proceed by choosing, in some manner, a number of seeds equal to the desired size (number of sets) of the final partition. Each document in the collection is then assigned to the closest seed. As a refinement, the procedure can be iterated, with, at each stage, an improved selection of cluster seeds. It is noteworthy that any partitional clustering algorithm can be transformed into a hierarchical clustering algorithm by recursively partitioning each of the clusters found in an application of the partitioning algorithm.

One application of a partitional clustering has been to improve the performance of near-neighbor search by including, with each document, some closely related documents that might otherwise be missed. However, to be useful for near-neighbor search, the partition must be fairly fine, since it is desirable for each set to only contain a few documents. For example, Willett generates a partition who size is related to the number of unique words in the document collection [13]. From this perspective, the potential computational benefits of a seed-based strategy are largely obviated by the large size (relative to the number of documents) of the required partition. For this reason partitional strategies have not been aggressively pursued by the information retrieval community.

We present two partitioning algorithms which use techniques drawn from the hierarchical algorithms, but which acheive rectangular time bounds. For our application, the number of clusters desired is small and thus the speedup over quadratic time algorithms is substantial.

Definitions

For each document tex2html_wrap_inline481 in a collection (or corpus) C, let the countfile tex2html_wrap_inline485 be the set of words, with their frequencies, that occur in that document.gif Let V be the set of unique words occurring in C. Then tex2html_wrap_inline485 can be represented as a vector of length |V|;

displaymath463

where tex2html_wrap_inline495 is the ith word in V and tex2html_wrap_inline499 is the frequency of tex2html_wrap_inline495 in tex2html_wrap_inline481 .

To measure the similarity between pairs of documents, tex2html_wrap_inline481 and tex2html_wrap_inline507 , let us employ the cosine between monotone element-wise functions of tex2html_wrap_inline485 and tex2html_wrap_inline511 . In particular, let

displaymath464

where g is a monotone damping function, `` tex2html_wrap_inline515 '' denotes inner product, and tex2html_wrap_inline517 denotes vector norm. It has been our experience that taking g to be component-wise square-root produces better results than the traditional component-wise logarithm.

It is useful to consider similarity to be a function of document profiles tex2html_wrap_inline521 , where

displaymath465

in which case

displaymath466

Suppose tex2html_wrap_inline523 is a set of documents, or a document group. A simple profile can be associated with tex2html_wrap_inline523 by defining it to be the normalized sum of profiles of the contained individuals. Let

displaymath467

be the unnormalized sum profile, and then

displaymath468

Similarly, the cosine measure can be extended to tex2html_wrap_inline523 by employing this profile definition:

displaymath469

Sometimes for our purposes, the normalized sum profile is not a good measure of a document group's ``contents'' because it takes into account documents which lie on the outskirts of the group. To solve this problem, we define the trimmed sum profile tex2html_wrap_inline529 for any cluster tex2html_wrap_inline523 by considering only the m ``most central'' documents of the cluster. For every tex2html_wrap_inline481 in tex2html_wrap_inline523 let tex2html_wrap_inline539 be the m documents tex2html_wrap_inline481 whose similarity to tex2html_wrap_inline523 , namely tex2html_wrap_inline547 , is largest. Then define

displaymath470

and

displaymath471

This computation can be completed in time proportional to tex2html_wrap_inline549 .gif The trimming parameter m may be defined adaptively as some percentage of tex2html_wrap_inline549 , or may be fixed.

Cluster Digest

Another description of a document group is in some sense dual to the trimmed sum profile. Rather than considering the central documents of a cluster, we can consider the central words, namely those which appear most frequently in the group as a whole. We thus define tex2html_wrap_inline555 , the topical words of tex2html_wrap_inline523 , to be the w highest weighted terms in tex2html_wrap_inline561 (or perhaps in tex2html_wrap_inline529 ).

Taken together, the two sets tex2html_wrap_inline565 form the (m,w) cluster digest of tex2html_wrap_inline523 , a short description of the contents of the cluster. The cluster digest can easily be computed in time tex2html_wrap_inline571 , and is in fact the summary used to describe a cluster to a user of Scatter/Gather.

Partitional Clustering

Seed-based partitional clustering algorithms have three phases:

1
Find k centers.
2
Assign each document in the collection to a center.
3
Refine the partition so constructed.

The result is a set P of k disjoint document groups such that tex2html_wrap_inline579 .

The Buckshot and Fractionation algorithms are both designed to find the initial centers. They can be thought of as rough clustering algorithms, however their output is only used to define centers. Both algorithms assume the existence of some algorithm which clusters well, but which may run slowly. Let us call this procedure the cluster subroutine. We use group average agglomerative clustering for this subroutine (see appendix B). Each of our algorithms uses this cluster subroutine locally over small sets, and builds on its results to find the k centers.

Buckshot applies the cluster subroutine to a random sample to find centers. Fractionation uses successive application of the cluster subroutine over fixed sized groups to find centers. We believe that Fractionation is the more accurate center finding procedure. However, Buckshot is significantly faster, and, hence, is more appropriate for the on-the-fly online reclustering required by iterations of Scatter/Gather. Fractionation can be used to establish the primary partitioning of the entire corpus, which is displayed in the first iteration of Scatter/Gather.

We implement Step 2 by assigning each document to the ``nearest'' center (in a sense to be defined later).

Our refinement algorithms also reflect a time-accuracy tradeoff. The simplest refinement procedure, iterated move-to-nearest, is fast but limited. A more comprehensive refinement is achieved through repeated application of procedures that attempt to Split, Join, and clarify elements of the partition P.

Finding Initial Centers

Buckshot

The idea of the buckshot algorithm is quite simple. To achieve a rectangular time clustering algorithm, merely choose a small random sample of the documents (of size tex2html_wrap_inline585 ), and apply the cluster subroutine. Return the centers of the clusters found. This algorithm clearly runs in time O(kn).

Since random sampling is employed, the Buckshot algorithm is not deterministic. That is, repeated calls to this algorithm on the same corpus may produce different partitions, although in our experience repeated trials generally produce qualitatively similar partitions.

Fractionation

The Fractionation algorithm finds k centers by initially breaking C into N/m buckets of a fixed size m;SPMgt;k. The cluster subroutine is then applied to each of these buckets separately to agglomerate individuals into document groups such that the reduction in number (from individuals to groups in each bucket) is roughly a factor of tex2html_wrap_inline603 . These groups are now treated as if they were individuals, and the entire process repeated. The iteration terminates when only k groups remain. Fractionation can be viewed as building a tex2html_wrap_inline607 branching tree bottom up, where the leaves are individual documents, terminating when only k roots remain.

Suppose the individuals in C are enumerated, so that tex2html_wrap_inline613 . This ordering could reflect an extrinsic ordering on C, but a better procedure sorts C based on a key which is the word index of the tex2html_wrap_inline619 most common word in each individual. Typically j is a small number, such as three, which favors medium frequency terms. This procedure thus encourages nearby individuals in the corpus ordering to have at least one word in common.

The initial bucketing creates a partition

displaymath589

such that

displaymath590

Each tex2html_wrap_inline623 is then separately clustered (using the cluster subroutine) into tex2html_wrap_inline625 groups, where tex2html_wrap_inline603 is the desired reduction factor. Note that each of these computations occurs in tex2html_wrap_inline629 time, and, hence, all n/m occur in nm time. Each application of agglomerative clustering produces an associated partition tex2html_wrap_inline635 . The union of the documents groups contained in these partitions are then treated as individuals for the next iteration. That is, define

displaymath591

C' inherits an enumeration order by taking the tex2html_wrap_inline639 in lexicographic order on i and j. The process is then repeated with C' replacing C. That is, the tex2html_wrap_inline649 components of C' are broken into tex2html_wrap_inline653 buckets, which are further reduced to tex2html_wrap_inline655 groups through separate agglomeration. The process terminates at iteration j if tex2html_wrap_inline659 . At this point one final application of agglomerative clustering can reduce the remaining groups to a partition P of size k.

To determine the running time, observe that the tex2html_wrap_inline619 iteration, which operates on tex2html_wrap_inline667 items, takes time tex2html_wrap_inline669 . The overall running time is thus tex2html_wrap_inline671 . Thus if m=O(k) this algorithm has rectangular running time.

Assigning Documents to Centers

Once k centers have been found, and suitable profiles defined for those centers, each document in C must be assigned to one of those centers based on some criterion. The simplest algorithm, Assign-to-Nearest, assigns each document to the nearest center.

Let G be a partition of the collection into k groups, and let tex2html_wrap_inline683 be the ith group in G. Let tex2html_wrap_inline687 if i maximizes tex2html_wrap_inline691 . Ties can be broken by assigning tex2html_wrap_inline481 to the group with lowest index. The set tex2html_wrap_inline695 is then the desired partition.

P can be efficiently computed by constructing an inverted map for the k centers tex2html_wrap_inline701 , and for each tex2html_wrap_inline703 simultaneously computing the similarity to all the centers. In any case, the cost of this procedure is proportional to kn.

Refinement

Given an initial clustering, it is now desirable to refine it into a better one. As with our initial clustering algorithms, there is a tradeoff between speed and accuracy. The simplest process is simply to iterate the Assign-to-Nearest process just discussed. The Split algorithm separates poorly defined clusters into two well separated parts and Join merges clusters which are too similar.

Iterated Assign-to-Nearest

The Assign-to-Nearest procedure mentioned above can also be seen as the first of our refinement algorithms. From a given set of clusters, we generate cluster centers using the trimmed sum profiles above, and we then assign each document to the nearest center so as to form new clusters. This process can be iterated indefinitely, though it makes its greatest gains in the first few steps, and hence is typically iterated only a small fixed number of times.gif

Split

Split divides each document group tex2html_wrap_inline523 in a partition P into two new groups. This can be accomplished by applying Buckshot clustering (without refinement) with tex2html_wrap_inline717 and k=2. The resulting Buckshot partition G provides the two new groups.

Let tex2html_wrap_inline723 and let tex2html_wrap_inline725 be a two element Buckshot partition of tex2html_wrap_inline683 . The new partition P' is simply the union of the tex2html_wrap_inline731 's:

displaymath707

Each application of Buckshot requires time proportional to tex2html_wrap_inline733 . Hence, the overall computation can be performed in time proportional to N.

A modification of this procedure would only split groups that score poorly on some coherency criterion. One simple criterion is the cluster self similarity tex2html_wrap_inline737 . This quantity is in fact proportional to the average similarity between documents in the cluster, as well as to the average similarity of a document to the cluster centroid. We thus define:

displaymath708

Let tex2html_wrap_inline739 be the rank of tex2html_wrap_inline741 in the set

displaymath709

The procedure would then only split groups such that tex2html_wrap_inline743 for some tex2html_wrap_inline745 . This modification does not change the order of the algorithm since the coherence criterion can be computed in time proportional to N.

Join

The purpose of the Join refinement operator is to merge document groups in a partition P that are not usefully distinguished by their cluster digests. Since, by definition, any two elements of P are disjoint, they will never have ``typical'' documents in common. However, their lists of ``topical'' words may well overlap. Therefore the criterion of distinguishability between two groups tex2html_wrap_inline523 and tex2html_wrap_inline757 will be

displaymath749

where tex2html_wrap_inline555 is the list of w most topical words for tex2html_wrap_inline523 . We merge tex2html_wrap_inline523 and tex2html_wrap_inline757 if tex2html_wrap_inline769 , for some tex2html_wrap_inline771 .

Determining the topical words for each cluster takes time proportional to the number of words in the corpus, and we must then compute tex2html_wrap_inline773 intersections to decide which clusters to merge. In large corpora, the number of words is typically less than the number of documents, and the running time of Join is thus O(kn).

Application to Scatter/Gather

Combinations of the various initial clustering and refinement procedures give several possible complete clustering algorithms. We have used two of these combinations in the course of implementing the Scatter/Gather method.

The initial partition used in Scatter/Gather is completely determined by the corpus under consideration. Hence, when the corpus is available in advance, one can compute the initial partition offline. We can therefore use a slower clustering algorithm to improve the accuracy of the initial partition. However, for corpora consisting of tens of thousands of documents, a quadratic time algorithm is likely to be too slow even for offline computation. We thus use the Fractionation algorithm to find centers, and then perform a great deal of refinement using the Split, Join, and Assign-to-Nearest operators. Note that the running time for each of the refinement procedures is O(kN) and thus does not affect the overall running time.

In an interactive session, however, it is vital for the clustering algorithm to run as quickly as possible, even at the expense of some accuracy. We therefore use the Buckshot center finding procedure, and then follow it with a bare minimum of refinement. We have found that two iterations of the Assign-to-Nearest procedure yield a reasonably accurate clustering, and that further refinement produces additional improvement, but with quickly diminishing returns.

By virtue of the Buckshot center finding procedure this algorithm is not deterministic. However, in the contemplated application, Scatter/Gather, it is more important that the partition be computed at high speed than that the algorithm be deterministic. Indeed, the lack of determinism might be interpreted as a feature, since the user then has the option of discarding an unrevealing partition in favor of a fresh reclustering.

The overall complexity of both clustering procedures described in this section is clearly O(kN). The constant factor for the Buckshot-based procedure is small enough to permit interactive use with large document collections. The Fractionation-based procedure has a somewhat larger constant factor, but one which is still acceptable for offline applications.

Naturally Clustered Data

It is worth examining the performance of our algorithms when the data set consists of well separated clusters of points. If the input data has k natural clusters, i.e., the smallest intra-cluster document similarity is larger than the largest inter-cluster document similarity, then both of our algorithms will find this partition.

For Buckshot, if we have a corpus containing k widely separated and equal size centers, then a random sample of size tex2html_wrap_inline585 will select some documents from each of the centers with high probability so long as tex2html_wrap_inline787 . This will certainly be true for our case in which k=20 or so. To see this, compute the probability that, if we choose a sample of size s, we fail to get any individual from some cluster. This is at most k times the probability that none of our s individuals is a member of cluster i, namely tex2html_wrap_inline799 . So, the total probability of failure is at most tex2html_wrap_inline801 . If we now take tex2html_wrap_inline803 for some a, then the failure probability is at most

displaymath807

Thus in our case, with k=20, taking a=5 means that 400 samples find all the clusters 999 times in 1000. Given that we start with at least one element from each cluster, our resulting clusters will each be a subset of one of the clusters. Thus the set of centers found will include a center within each actual cluster.

For Fractionation, we need merely note that if we have more than k documents in a single bucket, some pair of them is necessarily in the same actual cluster. Then clearly, this pair will be merged in preference to any other pair. Therefore, no pair of documents not in the same cluster will ever be merged. Thus, when we finish, each cluster we have found will be a subset of some one of the actual clusters.

 

figure144


Figure 2: Initial Scattering  

 

figure149


Figure 3: Second Scatter  

Conclusion

Scatter/Gather demonstrates that document clustering can be an effective information access tool in its own right. The table-of-contents metaphor gives the method an intuitive basis, and experience has shown that it is indeed easy to use. Scatter/Gather is particularly helpful in situations in which it is difficult or undesirable to specify a query formally. Claims of improved performance must await evaluation metrics appropriate to the vaguely defined information access goals in which Scatter/Gather excels.

To support Scatter/Gather, fast clustering algorithms are essential. Clustering can be done quickly by working in a local manner on small groups of documents rather than trying to deal with the entire corpus globally.

For extremely large corpora, even the linear time clustering achieved by the Buckshot or Fractionation algorithms may be too slow. We are working to develop variations on Scatter/Gather which will scale to arbitrarily large corpora, under the assumption that linear time preprocessing will always be feasible.

Clearly, the accuracy of the Buckshot and Fractionation algorithms is affected by the quality of the clustering provided by the slow cluster subroutine. This provides further motivation to find highly accurate clustering algorithms, whatever their running time may be.

 

figure154


Figure 4: Third Scatter  

 

figure158


Figure: Titles of articles in topic 1 from Figure 4  

A Scatter/Gather Session

 

In figures 2 through 5, we present the full output of the Scatter/Gather session described in section 2.1. The corpus is the set of articles distributed by the New York Times News Service during the month of August 1990. This consists of roughly 30 megabytes of ASCII text in about 5000 articles. Some articles are repeated due to updates of news stories.

Here our goal is to learn about international political events during this month. To create the initial partition we've applied the Buckshot clustering algorithm (figure 2). Fractionation is recommended for this task, time permitting.

Each cluster is described with the two line display of its cluster digest. The first line contains the number of the cluster, the number of documents in the cluster, and titles of documents near the centroid. The second line contains words frequent in the cluster.

We select clusters 2 (Iraq's invasion of Kuwait), 5 (Markets, including oil) and 6 (Germany, and probably other international issues) as those which seem likely to contain articles of interest, recluster, and display a new cluster digest (figure 3).

Next, in figure 4, we iterate, this time selecting clusters 3 (Pakistan, and probably other international issues) and 4 (African issues). Specific incidents have been separated out. We find hostages in Trinidad, war in Liberia, police action in South Africa, and so on.

We obtain more detail about the situation in Liberia by viewing the titles of the articles contained in that cluster (figure 5).

Group Average Clustering

 

Here we present a quadratic time greedy global agglomerative clustering algorithm which has given good results in our implementation. This is similar to the algorithm presented in [3].

Let tex2html_wrap_inline523 be a document group. The average similarity between any two documents in tex2html_wrap_inline523 is defined to be

displaymath815

Let G be a set of disjoint document groups. The basic iteration of group average agglomerative clustering finds the two different clusters tex2html_wrap_inline831 and tex2html_wrap_inline833 which maximize tex2html_wrap_inline835 over all choices from G.

A new, smaller, partition G' is then constructed by merging tex2html_wrap_inline831 with tex2html_wrap_inline833 .

displaymath816

Initially, G is simply a set of singleton groups, one for each individual to be clustered. The iteration terminates when |G'| = k. Note that the output from this procedure is the final flat partition G', rather than a nested hierarchy of partitions, although the latter could be computed by recording each pairwise join as one level in a dendrogram.

If we employ the cosine similarity measure, the inner maximization can be significantly accelerated. Recall that tex2html_wrap_inline851 is the unnormalized sum profile associated with tex2html_wrap_inline523 . Then the average pairwise similarity, tex2html_wrap_inline855 , is simply related to the inner product, tex2html_wrap_inline857 . That is, since

eqnarray183

displaymath817

Similarly, for the union of two disjoint groups, tex2html_wrap_inline859

displaymath818

where

eqnarray196

Therefore, if for every tex2html_wrap_inline861 , tex2html_wrap_inline855 and tex2html_wrap_inline851 are known, the pairwise merge that will produce the least decrease in average similarity can be cheaply updated each time a merge is performed. Further, suppose for every tex2html_wrap_inline861 the tex2html_wrap_inline757 were known such that

displaymath819

then finding the best pair would simply involve scanning the |G| candidates. Updating these quantities with each iteration is straightforward, since only those involving tex2html_wrap_inline831 and tex2html_wrap_inline833 need be recomputed.

Using techniques such as these, it can be seen that the average time complexity for truncated group average agglomerative clustering is tex2html_wrap_inline877 where n is equal to the number of individuals to be clustered.

References

1
Chris Buckley and Alan F. Lewit. Optimizations of inverted vector searches. In Proceedings of the Eighth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 97-110, 1985.

2
W.B. Croft. Clustering large files of documents using the single-link method. Journal of the American Society for Information Science, 28:341-344, 1977.

3
A. El-Hamdouchi and P. Willett. Hierarchical document clustering using Ward's method. In Proceedings of the Ninth International Conference on Research and Development in Information Retrieval, pages 149-156, 1986.

4
A. Griffiths, H.C. Luckhurst, and P. Willett. Using inter-document similarity information in document retrieval systems. Journal of the American Society for Information Science, 37:3-11, 1986.

5
Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data. Pretice Hall, Engelwood Cliffs, N.J. 07632, 1988.

6
N. Jardine and C.J. van Rijsbergen. The use of hierarchical clustering in information retrieval. Information Storage and Retrieval, 7:217-240, 1971.

7
J. O. Pedersen, D. R. Cutting, and J. W. Tukey. Snippet search: a single phrase approach to text access. In Proceedings of the 1991 Joint Statistical Meetings. American Statistical Association, 1991. Also available as Xerox PARC technical report SSL-91-08.

8
G. Salton. The SMART Retrieval System. Prentice-Hall, Englewood Cliffs, N.J., 1971.

9
G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.

10
R. Sibson. SLINK: an optimally efficient algorithm for the single link cluster method. Computer Journal, 16:30-34, 1973.

11
C.J. van Rijsbergen. Information Retrieval. Butterworths, London, second edition, 1979.

12
C.J. van Rijsbergen and W.B. Croft. Document clustering: An evaluation of some experiments with the Cranfield 1400 collection. Information Processing & Management, 11:171-182, 1975.

13
P. Willett. Document clustering using an inverted file approach. Journal of Information Science, 2:223-231, 1980.

14
P. Willett. A fast procedure for the calculation of similarity coefficients in automatic classification. Information Processing & Management, 17:53-60, 1981.

15
P. Willett. Recent trends in hierarchical document clustering: A critical review. Information Processing & Management, 24(5):577-597, 1988.
...search
Also known as ``vector space'' or ``similarity'' search.
... tex2html_wrap_inline459 ,
Willett [14] discusses an inverted file approach which can ameliorate this quadratic behavior when a large number of small clusters are desired. Unfortunately, when clusters are large enough to contain a large proportion of the terms in the corpus, this approach yields less improvement.
...document.
Throughout this paper, lower case Greek letters will be used to denote individual documents. Upper case Greek letters will denote sets of documents (document groups) and upper case Roman letters will denote sets of document groups.
... tex2html_wrap_inline549 .
A full sort of the similarities is not required.
...times.
Excessive iteration may in fact worsen the partition rather than improving it, since ``fuzzy'' elongated clusters can pull documents away from other clusters and become even fuzzier.