Practical Use of Semantic Search
Focus On The Hay Not The Needle
The needle-haystack analogy for data search is not only tired, it is fundamentally flawed. The analogy implies that what you are looking for is a single or small number of distinct ‘things’ that are of value and the remainder is not. In reality, and especially in enterprise information management, you are really looking at a continuous range of ‘things’. These things, or documents, lie along a spectrum of relevance for any particular topic. There is rarely a bright line between what we are looking for versus what we are not. Adding to the challenge is that the difference between what we want to find and what we don’t care about is subtle and buried within the context of the documents.
Not Quite Total Recall
What seems counter-intuitive is that you are much better served initially to try to find the things you know you don’t want. The reason, as I’ll explain below, is tied to recall and precision. Recall is the term that describes how efficient a search is at finding all the targeted documents or records. For example, if you had a 100,000 documents in a collection, and there were 1,100 related to an issue that you wanted to search for, if your search results returned 800 of the 1,100 then your recall would be 73%. Recall is a very important concept in information management. When you search Google you really don’t care about recall, and neither does Google for the most part. If you search for “how long do zebras live?” you don’t care about returning all documents on the web that relate to that question, you just want a few good examples to pick from. On the other hand, if you are doing an investigation of pricing irregularities at your organization, you really want to see all the documents that might be relevant. Google can have a recall of 60% and you can be very happy, but relying on 60% recall for your enterprise search can be very unhealthy for ones career.
Precision can be seen as the Yin to recall’s Yang. Precision is a measure of the quality of your search results as expressed as “what percent of the search results are actually what I wanted?”. If your precision is low, it means that your search results returned a lot of things that are irrelevant to what you want. Conversely, a high precision means that of the results returned by your search, most of them are spot on topic. In the example above, if your search with 73% recall had 60% precision it would mean that your search result would have returned 1,333 documents, of which 800 were actually relevant. We unconsciously play a balancing game of recall and precision when doing web searches every day. If I’m looking to buy a new Jeep Wrangler I search for “Jeep Wrangler” but that gives me all kind of results I don’t want. So I increase the precision by adding more terms like “used”, or “2 door”. Adding more specificity increases the precision of my search, but also reduces the recall. There may be “previously owned” Jeeps that do not have the word “used” in the description, so increasing my precision is reducing my recall.
Without going into details, I can assure you that no technology for searching unstructured data like emails or documents, has 100% recall and 100% precision. In fact recall of 50% or higher is very good for traditional text searches, and precision of those same searches is rarely over 50%. So if you are trying to find your ‘needle in a haystack’ you’re destined to miss half the needles and get whole lot of hay with the needles you do find.
Culling Your Way Shorter Timelines and Reduced Cost
The answer is to look for the hay. By searching for things you know are not relevant you negate the recall problem. If you are looking for things you do not want, essentially what you want to cull, then it doesn’t matter if you only find 50, 60 or 70 percent, because you only risk leaving a few extra documents in your collection. As you iterate through different content areas you know are not relevant, you delete them from your collection. With each iteration the remaining population becomes more concentrated with the documents you do want to isolate. This provides 2 very real-world benefits.
- Reducing your collection saves you time, reduces resources and generally makes all downstream processes faster and cheaper. This is a recurring theme in information governance and ediscovery, in that the further upstream you can start attacking data size the more cost reductions you can realize, and
- All search and analysis technologies will work better in a ‘target rich’ environment. Reducing the noise in the population via culling will allow all downstream searches and analysis to deliver much richer results.
But how exactly do you search for things you don’t want, when you don’t know what those things might be? The answer lies in semantic, or concept, search. A robust semantic search engine will be able to naively (meaning without training) crawl your data and group documents by concept or topic. You’ll be able to identify meeting reminders, travel alerts, commercial advertisements and various other documents and records that are clearly irrelevant to your investigation and remove that noise. Most importantly, you can do it in bulk without putting eyes on every document because of the power of the technology and because you are inverting the search process to the unwanted items. The net result is a machine assisted way to radically streamline workflows and information management tasks. For high cost processes like e-discovery, it is not uncommon for this simple change in workflow and technology to yield 60-80% reductions in document volumes and similar levels of cost reductions. I have worked on more than one project that has seen 7 digit cost savings from purposefully integrating semantic search into early data analysis.
So concept search is not just a shiny new technology; it has real world applications to reduce project and operational costs across the whole spectrum of information management. The very early days of concept search spawned all types of claims, but the best tools out there today require no training or configuration, will scale to millions of records, and can easily pay for themselves in cost avoidance within a single year.