Douglass R. Cutting
-
David R. Karger
-
Jan O. Pedersen
-
John W. Tukey
[1]Xerox Palo Alto Research Center
3333 Coyote Hill Road, Palo Alto, CA 94304
[2]Stanford University
[3]Princeton University
In Proceedings of the Fifteenth Annual International ACM SIGIR Conference, pages 318-329, June 1992. Also available as Xerox PARC technical report SSL-92-02.
Document clustering has not been well received as an information retrieval tool. Objections to its use fall into two main categories: first, that clustering is too slow for large corpora (with running time often quadratic in the number of documents); and second, that clustering does not appreciably improve retrieval.
We argue that these problems arise only when clustering is used in an attempt to improve conventional search techniques. However, looking at clustering as an information access tool in its own right obviates these objections, and provides a powerful new access paradigm. We present a document browsing technique that employs document clustering as its primary operation. We also present fast (linear time) clustering algorithms which support this interactive browsing paradigm.
Document clustering has been extensively investigated as a methodology for improving document search and retrieval (see [15] for an excellent review). The general assumption is that mutually similar documents will tend to be relevant to the same queries, and, hence, that automatic determination of groups of such documents can improve recall by effectively broadening a search request (see [11] for a discussion of the cluster hypothesis). Typically a fixed corpus of documents is clustered either into an exhaustive partition, disjoint or otherwise, or into a hierarchical tree structure (see, for example, [8, 13, 2]). In the case of a partition, queries are matched against clusters and the contents of the best scoring clusters are returned as a result, possibly sorted by score. In the case of a hierarchy, queries are processed downward, always taking the highest scoring branch, until some stopping condition is achieved. The subtree at that point is then returned as a result. Hybrid strategies are also available.
These strategies are essentially variations of near-neighbor
search
where nearness is defined in terms of the pairwise document
similarity measure used to generate the clustering. Indeed, cluster
search techniques are typically compared to direct near-neighbor search
[9], and are evaluated in
terms of precision and recall. Various studies indicate that cluster
search strategies are not markedly superior to near-neighbor search,
and, in some situations, can be inferior (see, for example,
[6, 12, 4]).
Furthermore, document clustering algorithms are often slow, with
quadratic running times. It is therefore unsurprising that cluster
search, with its indifferent performance, has not gained wide
popularity.
Document clustering has also been studied as a method for accelerating near-neighbor search, but the development of fast algorithms for near-neighbor search has decreased interest in that possibility [1].
In this paper, we take a new approach to document clustering. Rather than dismissing document clustering as a poor tool for enhancing near-neighbor search, we ask how clustering can be effective as an access method in its own right. We describe a document browsing method, called Scatter/Gather, which uses document clustering as its primitive operation. This technique is directed towards information access with non-specific goals and serves as a complement to more focused techniques.
To implement Scatter/Gather, fast document clustering is a necessity. We introduce two new near linear time clustering algorithms which experimentation has shown to be effective, and also discuss reasons for their effectiveness.
The standard formulation of the information access problem presumes a query, the user's expression of an information need. The task is then to search a corpus for documents that match this need. However, it is not difficult to imagine a situation in which it is hard, if not impossible, to formulate such a query precisely. For example, the user may not be familiar with the vocabulary appropriate for describing a topic of interest, or may not wish to commit himself to a particular choice of words. Indeed, the user may not be looking for anything specific at all, but rather may wish to discover the general information content of the corpus. Access to a document collection in fact covers an entire spectrum: at one end is a narrowly specified search for a particular document, given something as specific as its title; at the other end is a browsing session with no well defined goal, satisfying a need to learn more about the document collection. It is common for a session to move across the spectrum, from browsing to search: the user starts with a partially defined goal which is refined as he finds out more about the document collection. Standard information access techniques tend to emphasize the search end of the spectrum. A glaring example of this emphasis is cluster search, where clustering, a technology capable of topic extraction, is submerged from view and used only to assist near-neighbor search.
We propose an alternative application for clustering in information access, taking our inspiration from the access methods typically provided with a conventional textbook. If one has a specific question in mind, and specific terms which define that question, one consults the index, which directs one to passages of interest. However, if one is simply interested in gaining an overview, or has a general question, one peruses the table of contents, which lays out the logical structure of the text. The table of contents gives a sense of what sort of questions might be answered by a more intensive examination of the text, and may also lead to specific sections of interest. One can easily alternate between browsing the table of contents and searching the index.
By direct analogy, we propose an information access system with two components: our browsing method, Scatter/Gather, which uses a cluster-based, dynamic table-of-contents metaphor for navigating a collection of documents; and one or more word-based, directed, text search methods, such as near-neighbor search or snippet search [7]. The browsing component describes groups of similar documents, one or more of which can be selected for further examination. This can be iterated until the user is directly viewing individual documents. Based on documents found in this process, or on terms used to describe document groups, the user may, at any time, switch to a more focused search method. In particular, we anticipate that the browsing tool will not necessarily be used to find particular documents, but may instead help the user formulate a search request, which will then be serviced by some other means. Scatter/Gather may also be used to organize the results of word-based queries that retrieve too many documents.
In the basic iteration of the proposed browsing method, the user is presented with short summaries of a small number of document groups.
Initially the system scatters the collection into a small number of document groups, or clusters, and presents short summaries of them to the user. Based on these summaries, the user selects one or more of the groups for further study. The selected groups are gathered together to form a subcollection. The system then applies clustering again to scatter the new subcollection into a small number of document groups, which are again presented to the user. With each successive iteration the groups become smaller, and therefore more detailed. Ultimately, when the groups become small enough, this process bottoms out by enumerating individual documents.
Figure: Illustration of Scatter/Gather
We now describe a Scatter/Gather session, where the text collection consists of about 5000 articles posted to the New York Times News Service during the month of August 1990. This session is summarized in figure 1. Here, to simplify the figure, we manually assigned single-word labels based on the full cluster descriptions. The full session is provided as Appendix A.
Suppose the user wants to find out what happened that month. Several issues prevent the application of conventional search techniques:
With Scatter/Gather, rather than being forced to provide terms, the user is presented with a set of clusters, an outline of the corpus. She need only select those clusters which seem potentially relevant to the topic of interest. In the example, the big stories of the month are immediately obvious from the initial scattering: Iraq invades Kuwait, and Germany considers reunification. This leads the user to focus on international issues: she selects the `Kuwait' and `Germany' and `Oil' clusters. These three clusters are gathered together.
This reduced corpus is then reclustered on the fly to produce eight new clusters covering the reduced corpus. Since the reduced corpus contains a subset of the articles, these new clusters reveal a finer level of detail than the original eight. The articles on the Iraqi invasion and some of the `Oil' articles have now been separated into clusters discussing the U.S. military deployment, the effects of the invasion upon the oil market, and one which the user deduces is about hostages in Kuwait.
The user feels her understanding of these large stories is adequate, but wishes to find out what happened in other corners of the world. She selects the `Pakistan' cluster, which also contains other foreign political stories, and a cluster containing articles about Africa. This reveals a number of specific international situations as well as a small collection of miscellaneous international articles. The user thus learns of a coup in Pakistan, and about hostages being taken in Trinidad, stories otherwise lost among the major stories of that month.
Scatter/Gather depends on the existence of two facilities. First, since clustering and reclustering is an essential part of the basic iteration, we need an algorithm which can appropriately cluster a large number of documents within a time tolerable for user interaction (e.g., less than a minute). Second, given a group of documents, some method for automatically summarizing that group must be specified. This cluster description must be sufficiently revealing for to give the user a sense of the topic defined by the group, yet short enough for many descriptions to be appreciated simultaneously.
We will present two algorithms that meet the first requirement. Buckshot is a fast clustering algorithm suitable for the online reclustering essential for Scatter/Gather. Fractionation is another, more careful, clustering algorithm whose greater accuracy makes it suitable for the static offline partitioning of the entire corpus which is presented first to the user. We also define the cluster digest, an easy to generate, concise description of a cluster suitable for Scatter/Gather.
Before presenting our document clustering algorithms, we review the terminology established in prior work, and discuss why existing clustering algorithms fail to meet our needs. Throughout this paper, n denotes the number of documents in the collection, and k denotes the desired number of clusters.
In order to cluster documents one must first establish a pairwise measure of document similarity and then define a method to partition the collection into clusters of similar documents. Numerous document similarity measures have been proposed, all of which treat each document as a set of words, often with frequency information, and measure the degree of word overlap between documents [11]. The documents are typically represented by sparse vectors of length equal to the number of unique words (or types) in the corpus. Each component of the vector has a value reflecting the occurrence of the corresponding word in the document. We may use a binary scale, in which the value is one or zero, to represent the presence or the absence of the word, or the value might be some function of the word's frequency within that document. If a word does not occur in a document, its value is zero. A popular similarity measure, the cosine measure, computes the cosine of the angle between these two sparse vectors. If both document vectors are normalized to unit length, the cosine is, of course, simply the inner product of the two vectors. Other measures include the Dice and Jaccard coefficients, which are normalized word overlap counts. Willett [15] has suggested that that the choice of similarity measure has less qualitative impact on clustering results than the choice of clustering algorithm.
Two different types of document clusters can be constructed. One is a flat partition of the documents into a collection of subsets. The other is a hierarchical cluster, which can be defined recursively as either an individual document or a partition of the corpus into sets, each of which is then hierarchically clustered. A hierarchical clustering defines a tree, called a dendrogram, on the documents.
Numerous clustering algorithms have been applied to build hierarchical document clusters, including, most prominently, single-linkage hierarchical clustering [5, 2]. These algorithms generally proceed by iteratively considering all pairs of clusters built so far, and fusing the pair which exhibits the greatest similarity into a single document group (which then becomes a node of the dendrogram). They differ in the procedure used to compute similarity when one of the pair is the product of a previous fusion. Single-linkage clustering defines the similarity as the maximum similarity between any two individuals, one from each of the two groups. Alternative methods consider the minimum similarity (complete-linkage), the average similarity (group-average linkage), as well as other aggregate measures. Although single-linkage clustering is known to have an undesirable chaining behavior, typically forming elongated straggly clusters, it remains popular due to its simplicity and the availability of an optimal space and time algorithm for its computation [10].
These algorithms share certain common characteristics. They are
agglomerative, in that they proceed by iteratively choosing two
document groups to agglomerate into a single document group. They
agglomerate in a greedy manner, in that the pair of document
groups chosen for agglomeration is the pair which is considered best
or most similar under some criterion. Lastly, they are global
in that all pairs of inter-group similarities are considered in the
course of selecting an agglomeration. Global algorithms have running
times which are intrinsically
,
because all pairs of similarities
must be considered. This sharply limits their usefulness, even given
algorithms that attain the theoretical quadratic lower bound on
performance.
Partitional strategies, those that strive for a flat decomposition of the collection into sets of documents rather than a hierarchy of nested partitions, have also been studied [8, 13]. Some of these algorithms are global in nature and thus have the same slow performance as the above mentioned greedy, global, agglomerative algorithms. Other partitional algorithms, by contrast, typically have rectangular running times, i.e., O(kn). Generally, these algorithms proceed by choosing, in some manner, a number of seeds equal to the desired size (number of sets) of the final partition. Each document in the collection is then assigned to the closest seed. As a refinement, the procedure can be iterated, with, at each stage, an improved selection of cluster seeds. It is noteworthy that any partitional clustering algorithm can be transformed into a hierarchical clustering algorithm by recursively partitioning each of the clusters found in an application of the partitioning algorithm.
One application of a partitional clustering has been to improve the performance of near-neighbor search by including, with each document, some closely related documents that might otherwise be missed. However, to be useful for near-neighbor search, the partition must be fairly fine, since it is desirable for each set to only contain a few documents. For example, Willett generates a partition who size is related to the number of unique words in the document collection [13]. From this perspective, the potential computational benefits of a seed-based strategy are largely obviated by the large size (relative to the number of documents) of the required partition. For this reason partitional strategies have not been aggressively pursued by the information retrieval community.
We present two partitioning algorithms which use techniques drawn from the hierarchical algorithms, but which acheive rectangular time bounds. For our application, the number of clusters desired is small and thus the speedup over quadratic time algorithms is substantial.
For each document
in a collection (or corpus) C, let the
countfile
be the set of words, with their
frequencies, that occur in that document.
Let V be the set of unique words occurring in
C. Then
can be represented as a vector of length |V|;
where
is the ith word in V and
is the
frequency of
in
.
To measure the similarity between pairs of documents,
and
, let us employ the cosine between monotone element-wise
functions of
and
. In particular, let
where g is a monotone damping function,
``
'' denotes inner product, and
denotes vector norm. It has been our experience that taking g to be
component-wise square-root produces better results than the
traditional component-wise logarithm.
It is useful to consider similarity to be a function of document
profiles
, where
in which case
Suppose
is a set of documents, or a document group. A
simple profile can be associated with
by defining it to be
the normalized sum of profiles of the contained individuals. Let
be the unnormalized sum profile, and then
Similarly, the cosine measure can be extended to
by employing
this profile definition:
Sometimes for our purposes, the normalized sum profile is not a good
measure of a document group's ``contents'' because it takes into
account documents which lie on the outskirts of the group. To solve
this problem, we define the trimmed sum profile
for any cluster
by considering only the m ``most central''
documents of the cluster. For every
in
let
be the m documents
whose similarity to
, namely
, is largest. Then define
and
This computation can be completed in time proportional to
.
The trimming parameter m may be defined adaptively as some
percentage of
, or may be fixed.
Another description of a document group is in some sense dual to the
trimmed sum profile. Rather than considering the central documents of
a cluster, we can consider the central words, namely those which
appear most frequently in the group as a whole. We thus define
, the topical words of
, to be the w
highest weighted terms in
(or perhaps in
).
Taken together, the two sets
form the
(m,w) cluster digest of
, a short description of the
contents of the cluster. The cluster digest can easily be computed in
time
, and is in fact the summary used to describe
a cluster to a user of Scatter/Gather.
Seed-based partitional clustering algorithms have three phases:
The Buckshot and Fractionation algorithms are both designed to find the initial centers. They can be thought of as rough clustering algorithms, however their output is only used to define centers. Both algorithms assume the existence of some algorithm which clusters well, but which may run slowly. Let us call this procedure the cluster subroutine. We use group average agglomerative clustering for this subroutine (see appendix B). Each of our algorithms uses this cluster subroutine locally over small sets, and builds on its results to find the k centers.
Buckshot applies the cluster subroutine to a random sample to find centers. Fractionation uses successive application of the cluster subroutine over fixed sized groups to find centers. We believe that Fractionation is the more accurate center finding procedure. However, Buckshot is significantly faster, and, hence, is more appropriate for the on-the-fly online reclustering required by iterations of Scatter/Gather. Fractionation can be used to establish the primary partitioning of the entire corpus, which is displayed in the first iteration of Scatter/Gather.
We implement Step 2 by assigning each document to the ``nearest'' center (in a sense to be defined later).
Our refinement algorithms also reflect a time-accuracy tradeoff. The simplest refinement procedure, iterated move-to-nearest, is fast but limited. A more comprehensive refinement is achieved through repeated application of procedures that attempt to Split, Join, and clarify elements of the partition P.
The idea of the buckshot algorithm is quite simple. To achieve a
rectangular time clustering algorithm, merely choose a small random
sample of the documents (of size
), and apply the cluster
subroutine. Return the centers of the clusters found. This algorithm
clearly runs in time O(kn).
Since random sampling is employed, the Buckshot algorithm is not deterministic. That is, repeated calls to this algorithm on the same corpus may produce different partitions, although in our experience repeated trials generally produce qualitatively similar partitions.
The Fractionation algorithm finds k centers by initially breaking
C into N/m buckets of a fixed size m;SPMgt;k. The cluster subroutine
is then applied to each of these buckets separately to agglomerate
individuals into document groups such that the reduction in number
(from individuals to groups in each bucket) is roughly a factor of
. These groups are now treated as if they were individuals, and
the entire process repeated. The iteration terminates when only k
groups remain. Fractionation can be viewed as building a
branching tree bottom up, where the leaves are individual documents,
terminating when only k roots remain.
Suppose the individuals in C are enumerated, so that
. This ordering could reflect an
extrinsic ordering on C, but a better procedure sorts C based on a
key which is the word index of the
most common word in each
individual. Typically j is a small number, such as three, which
favors medium frequency terms. This procedure thus encourages nearby
individuals in the corpus ordering to have at least one word in
common.
The initial bucketing creates a partition
such that
Each
is then separately clustered (using the cluster
subroutine) into
groups, where
is the desired
reduction factor. Note that each of these computations occurs in
time, and, hence, all n/m occur in nm time. Each
application of agglomerative clustering produces an associated
partition
. The union of the
documents groups contained in these partitions are then treated as
individuals for the next iteration. That is, define
C' inherits an enumeration order by taking the
in
lexicographic order on i and j. The process is then repeated with
C' replacing C. That is, the
components of C' are
broken into
buckets, which are further reduced to
groups through separate agglomeration. The process terminates at
iteration j if
. At this point one final application
of agglomerative clustering can reduce the remaining groups to a
partition P of size k.
To determine the running time, observe that the
iteration,
which operates on
items, takes time
. The
overall running time is thus
.
Thus if m=O(k) this algorithm has rectangular running time.
Once k centers have been found, and suitable profiles defined for those centers, each document in C must be assigned to one of those centers based on some criterion. The simplest algorithm, Assign-to-Nearest, assigns each document to the nearest center.
Let G be a partition of the collection into k groups, and let
be the ith group in G. Let
if i
maximizes
. Ties can be broken by assigning
to the group with lowest index. The set
is then the desired partition.
P can be efficiently computed by constructing an inverted map for
the k centers
, and for each
simultaneously computing the similarity to all the centers. In any
case, the cost of this procedure is proportional to kn.
Given an initial clustering, it is now desirable to refine it into a better one. As with our initial clustering algorithms, there is a tradeoff between speed and accuracy. The simplest process is simply to iterate the Assign-to-Nearest process just discussed. The Split algorithm separates poorly defined clusters into two well separated parts and Join merges clusters which are too similar.
The Assign-to-Nearest procedure mentioned above can also be seen as
the first of our refinement algorithms.
From a given set of clusters,
we generate cluster centers using the trimmed sum profiles above, and
we then assign each document to the nearest center so as to form new
clusters. This process can be iterated indefinitely, though it makes
its greatest gains in the first few steps, and hence is typically
iterated only a small fixed number of times.
Split divides each document group
in a partition P into two
new groups. This can be accomplished by applying Buckshot clustering
(without refinement) with
and k=2. The resulting
Buckshot partition G provides the two new groups.
Let
and let
be a two element Buckshot partition of
. The new partition P' is simply the union of the
's:
Each application of Buckshot requires time proportional to
. Hence, the overall computation can be performed in time
proportional to N.
A modification of this procedure would only split groups that score
poorly on some coherency criterion. One simple criterion is the
cluster self similarity
. This quantity is in fact
proportional to the average similarity between documents in the
cluster, as well as to the average similarity of a document to the
cluster centroid. We thus define:
Let
be the rank of
in the set
The procedure
would then only split groups such that
for some
. This modification does not change the order
of the algorithm since the coherence criterion can be computed in time
proportional to N.
The purpose of the Join refinement operator is to merge document
groups in a partition P that are not usefully distinguished by their
cluster digests. Since, by definition, any two elements of P are
disjoint, they will never have ``typical'' documents in common.
However, their lists of ``topical'' words may well overlap. Therefore
the criterion of distinguishability between two groups
and
will be
where
is the list of w most topical words for
.
We merge
and
if
, for
some
.
Determining the topical words for each cluster takes
time proportional to the number of words in the corpus, and we must
then compute
intersections to decide which clusters to merge.
In large corpora, the number of words is typically less than the
number of documents, and the running time of Join is thus O(kn).
Combinations of the various initial clustering and refinement procedures give several possible complete clustering algorithms. We have used two of these combinations in the course of implementing the Scatter/Gather method.
The initial partition used in Scatter/Gather is completely determined by the corpus under consideration. Hence, when the corpus is available in advance, one can compute the initial partition offline. We can therefore use a slower clustering algorithm to improve the accuracy of the initial partition. However, for corpora consisting of tens of thousands of documents, a quadratic time algorithm is likely to be too slow even for offline computation. We thus use the Fractionation algorithm to find centers, and then perform a great deal of refinement using the Split, Join, and Assign-to-Nearest operators. Note that the running time for each of the refinement procedures is O(kN) and thus does not affect the overall running time.
In an interactive session, however, it is vital for the clustering algorithm to run as quickly as possible, even at the expense of some accuracy. We therefore use the Buckshot center finding procedure, and then follow it with a bare minimum of refinement. We have found that two iterations of the Assign-to-Nearest procedure yield a reasonably accurate clustering, and that further refinement produces additional improvement, but with quickly diminishing returns.
By virtue of the Buckshot center finding procedure this algorithm is not deterministic. However, in the contemplated application, Scatter/Gather, it is more important that the partition be computed at high speed than that the algorithm be deterministic. Indeed, the lack of determinism might be interpreted as a feature, since the user then has the option of discarding an unrevealing partition in favor of a fresh reclustering.
The overall complexity of both clustering procedures described in this section is clearly O(kN). The constant factor for the Buckshot-based procedure is small enough to permit interactive use with large document collections. The Fractionation-based procedure has a somewhat larger constant factor, but one which is still acceptable for offline applications.
It is worth examining the performance of our algorithms when the data set consists of well separated clusters of points. If the input data has k natural clusters, i.e., the smallest intra-cluster document similarity is larger than the largest inter-cluster document similarity, then both of our algorithms will find this partition.
For Buckshot, if we have a corpus containing k widely separated and
equal size centers, then a random sample of size
will
select some documents from each of the centers with high probability
so long as
. This will certainly be true for our case
in which k=20 or so. To see this, compute the probability that, if
we choose a sample of size s, we fail to get any individual from
some cluster. This is at most k times the probability that none of
our s individuals is a member of cluster i, namely
. So,
the total probability of failure is at most
. If we now
take
for some a, then the failure probability is at
most
Thus in our case, with k=20, taking a=5 means that 400 samples find all the clusters 999 times in 1000. Given that we start with at least one element from each cluster, our resulting clusters will each be a subset of one of the clusters. Thus the set of centers found will include a center within each actual cluster.
For Fractionation, we need merely note that if we have more than k documents in a single bucket, some pair of them is necessarily in the same actual cluster. Then clearly, this pair will be merged in preference to any other pair. Therefore, no pair of documents not in the same cluster will ever be merged. Thus, when we finish, each cluster we have found will be a subset of some one of the actual clusters.
Scatter/Gather demonstrates that document clustering can be an effective information access tool in its own right. The table-of-contents metaphor gives the method an intuitive basis, and experience has shown that it is indeed easy to use. Scatter/Gather is particularly helpful in situations in which it is difficult or undesirable to specify a query formally. Claims of improved performance must await evaluation metrics appropriate to the vaguely defined information access goals in which Scatter/Gather excels.
To support Scatter/Gather, fast clustering algorithms are essential. Clustering can be done quickly by working in a local manner on small groups of documents rather than trying to deal with the entire corpus globally.
For extremely large corpora, even the linear time clustering achieved by the Buckshot or Fractionation algorithms may be too slow. We are working to develop variations on Scatter/Gather which will scale to arbitrarily large corpora, under the assumption that linear time preprocessing will always be feasible.
Clearly, the accuracy of the Buckshot and Fractionation algorithms is affected by the quality of the clustering provided by the slow cluster subroutine. This provides further motivation to find highly accurate clustering algorithms, whatever their running time may be.
Figure: Titles of articles in topic 1 from Figure 4
In figures 2 through 5, we present the full output of the Scatter/Gather session described in section 2.1. The corpus is the set of articles distributed by the New York Times News Service during the month of August 1990. This consists of roughly 30 megabytes of ASCII text in about 5000 articles. Some articles are repeated due to updates of news stories.
Here our goal is to learn about international political events during this month. To create the initial partition we've applied the Buckshot clustering algorithm (figure 2). Fractionation is recommended for this task, time permitting.
Each cluster is described with the two line display of its cluster digest. The first line contains the number of the cluster, the number of documents in the cluster, and titles of documents near the centroid. The second line contains words frequent in the cluster.
We select clusters 2 (Iraq's invasion of Kuwait), 5 (Markets, including oil) and 6 (Germany, and probably other international issues) as those which seem likely to contain articles of interest, recluster, and display a new cluster digest (figure 3).
Next, in figure 4, we iterate, this time selecting clusters 3 (Pakistan, and probably other international issues) and 4 (African issues). Specific incidents have been separated out. We find hostages in Trinidad, war in Liberia, police action in South Africa, and so on.
We obtain more detail about the situation in Liberia by viewing the titles of the articles contained in that cluster (figure 5).
Here we present a quadratic time greedy global agglomerative clustering algorithm which has given good results in our implementation. This is similar to the algorithm presented in [3].
Let
be a document group. The average similarity between
any two documents in
is defined to be
Let G be a set of disjoint document groups. The basic iteration of
group average agglomerative clustering finds the two different
clusters
and
which maximize
over all choices
from G.
A new, smaller, partition G' is then constructed by merging
with
.
Initially, G is simply a set of singleton groups, one for each individual to be clustered. The iteration terminates when |G'| = k. Note that the output from this procedure is the final flat partition G', rather than a nested hierarchy of partitions, although the latter could be computed by recording each pairwise join as one level in a dendrogram.
If we employ the cosine similarity measure, the inner maximization
can be significantly accelerated. Recall that
is
the unnormalized sum profile associated with
.
Then the average pairwise similarity,
, is simply related
to the inner product,
. That is, since
Similarly, for the union of two disjoint groups,
where
Therefore, if for every
,
and
are
known, the pairwise merge that will produce the least decrease in average
similarity can be cheaply updated each time a merge is performed.
Further, suppose for every
the
were known such that
then finding the best pair would simply involve scanning the |G|
candidates. Updating these quantities with each iteration is
straightforward, since only those involving
and
need be recomputed.
Using techniques such as these, it can be seen that the average time
complexity for truncated group average agglomerative clustering is
where n is equal to the number of individuals to be
clustered.