We formulated a query containing the terms
and instructed the system to retrieve the 500 top-ranked documents according to a vector space weighting. Out of these 500 retrieved documents, only 21 had been judged relevant to the query by the TREC judges (some may not have been judged at all, but for the purposes of this example, those with no judgement are simply considered to be not relevant). These documents were not ranked especially high by the vector space measure: none of the documents judged relevant appeared in the top 10, only one appeared in the top 20, and only four appeared in the top 40.
A tool that can guide the user towards the relevant subgroups would indeed be useful. Here we show that the Scatter/Gather tool can be effective in this way. The system is instructed to gather the 500 documents into five clusters; below are shown the resulting sizes and topical terms:
Cluster 4 stands out for the purposes of the query in that it contains terms pertaining to fraud, investigation, lawyers, and courts. Note that in a general corpus these terms might not be descriptive for this query since the user would assume the documents were about legal issues in general. However, since we know the system has retrieved documents that also pertain to financial institutions, we can assume that the legal terms occur in the context of financial documents.
The topical terms for Cluster 5 are less compelling, for they have only one term corresponding to a crime and seem to clearly indicate documents discussing the scandal involving the leader of the Phillipines in the late 80's. The topical terms of Cluster 0 are very general (and there are only four documents in the cluster which can be quickly scanned). It appears that Cluster 1 contains very general documents that do not fit into any of the other clusters particularly well, whereas Cluster 5 contains documents that relate to a very specific allegation of fraud.
Cluster 2 is also compelling in that its summary contains many financial terms; however, it is less promising than Cluster 4 in that it seems more related to assets and risk assessment than criminal charges and failed banks. Finally, Cluster 3 seems related most strongly to rules and regulations, rather than indictments and fraud. Note again that this cluster, if taken out of context, might seem to refer to government regulations in general; however, since it was generated as the results of a query on financial terms, like Cluster 3, it most likely contains documents discussing rules and regulations on financial matters. A re-scattering of the cluster confirmed this suspicion.
Based on this assessment, Cluster 4 looks most promising. If the user re-scatters (or re-clusters) it, five new clusters are produced:
The user might choose the first, third and fourth clusters since their topical terms all seem to pertain to the topic of interest. Clusters 3 and 4 are especially compelling since they contains terms pertaining both to finance and to criminal proceedings. Cluster 3 has more terms about conviction but 4 has more terms pertaining to failure and the kinds of financial institutions that the user may have known to have failed; namely S&L's and thrifts. As it turns out, the clusters' contents reflect these observations: Cluster 3 contains mainly articles about indictments pertaining to financial fraud involving securities and stocks, but not failed banks. The user can view the contents of a cluster in ranked order, according to the score generated by the similarity search, or can view the documents according to some other search tool. Based on the topical terms, the most promising looking clusters are 1, 3, and 4.
It turns out that Cluster 1 has one relevant document out of four. The third cluster, Cluster 3, has one relevant document out of 28. High ranked documents in this group include indictments for other crimes, some of which are financial in nature:
Cluster 4 has eleven relevant documents out of 29:
The second cluster also contains two relevant documents. Most other documents in this cluster discuss indictments for money laundering along with one article involving Noriega and another on a teen scandal in San Francisco. The last cluster has no relevant documents although it has several that discuss the BCCI.
So it turns out that the top-level Cluster 4 contains 15 of the 21 relevant documents. The remaining 6 relevant documents are found exclusively in Cluster 2. When this cluster, which contains 187 documents, is scattered:
all six relevant documents appear in one cluster, Cluster 3, of size 88:
This example shows the tendency of relevant documents to tend to clump into a few clusters. If users can determine which clusters are of the most interest, they will then look through a higher density of relevant documents than if the documents were simply shown in ranked order.
(This hypothesis has been supported with some experimental work, see the paper Re-examining the Cluster Hypothesis for more details.)
The collection used here is the TREC/Tipster collection, provided by NIST. For more information, see Donna Harman, editor, Proceedings of the Third Text Retrieval Conference TREC-3, National Institute of Standards and Technology Special Publication 500-225, 1995.
In the vector space weighting
approach, championed by Gerry Salton, Chris Buckley, and other members
of the SMART group at Cornell, documents and queries are represented
as weighted vectors, where each word in the document corresponds to
one position in the vector. Documents are ranked according to a
normalized inner-product between the query and the document's
represention. Thus, not all the query terms need be present in the
document in order for the document to be retrieved, and a document
that has many instances of a highly weighted word might be ranked
higher than a document with a few instances of many of the words.
Back to Scatter/Gather Overview
Xerox PARC
2/13/97