Here we show that Scatter/Gather text clustering does a reasonably good job at organizing the documents into meaningful themes or topics.
We ask Scatter/Gather to place the 250 documents into 5 groups. Here is what results. (Bear in mind that encyclopedia articles are well-written and uniform format. The next example shows the results of a more complicated query on a more unruly text collection.)
Shown here are the clusters' sizes (how many documents they contain), a list of topical terms, and a list of document titles. One can see from the topical terms of Cluster 1 that this cluster contains documents that involve stars as symbols, as in military rank and patriotic songs.
Cluster 2 has 68 documents that appear mainly to be about movie and tv stars.
Cluster 3 contains 97 documents that having to do with aspects of astrophysics.
Cluster 4 contains 67 documents also about astronomy and astrophysics. This cluster contains many articles about people who are astronomers (this is apparent when the list is scrolled down).
Cluster 5 contains all the articles that discuss animals or plants, and that happen to contain the word star, for example, star fish.
If we ask Scatter/Gather to re-cluster the 68 documents that appear in Cluster 2, the one that discusses movie and tv stars, and place the results into three clusters, we see the following clusters:
This re-clustering reveals that in actuality this cluster had more kinds of documents than we originally thought, based on the topical terms. These three clusters can be rather neatly summarized as containing articles about (Cluster 1) people who are sports stars, (Cluster 2) stars of film, tv, and theatre, and (Cluster 3) musicians.
Now if we back up a step and re-cluster Cluster 3 from the original set, placing the results into four clusters, we see the following:
The contents of these four clusters can be glossed as general astrophysics, galaxies and stars, constellations, and a cluster of leftover, or outlying documents.
This example suggests the potential power of the system for automatically grouping documents according to themes. It also shows some issues that remain to be addressed. First, we need to determine automatically what the best number of clusters is at each phase. Currently we have the user make the decision of how many clusters to show for each document subcollection. We are working on how to make this choice automatically, based on the characteristics of the subcollection. Second, sometimes the summary is misleading or incomplete in terms of what documents are to be found in the cluster. We saw this with the cluster about film and tv stars -- it also contained documents about sports and music stars, although these were in the minority. We are working on determining how to indicate to the user when there are hidden topic areas in the cluster.
Click here for another example on a more complex query.