Web-based inference detection
Newly published data,
when combined with existing public knowledge, allows for complex and sometimes
unintended inferences. We propose semi-automated tools for detecting these
inferences prior to releasing data. Our tools give data owners a fuller
understanding of the implications of releasing data and help them adjust the
amount of data they release to avoid unwanted inferences.
Our tools first extract salient keywords from the private data intended for
release. Then, they issue search queries for documents that match subsets of
these keywords, within a reference corpus (such as the public Web) that
encapsulates as much of relevant public knowledge as possible. Finally, our
tools parse the documents returned by the search queries for keywords not
present in the original private data. These additional keywords allow us to
automatically estimate the likelihood of certain inferences. Potentially
dangerous inferences are flagged for manual review.
We call this new technology Web-based inference control. The paper reports on
two experiments which demonstrate early successes of this technology. The first
experiment shows the use of our tools to automatically estimate the risk that an
anonymous document allows for re-identification of its author. The second
experiment shows the use of our tools to detect the risk that a document is
linked to a sensitive topic. These experiments, while simple, capture the full
complexity of inference detection and illustrate the power of our approach.