NOTE: This was from my old A&L Enterprises blog – but I thought it was interesting…
In my last post I discussed applying techniques of Textual Disambiguation at the document – vs. the paragraph or sentence level. Overall I’ve covered quite a few techniques that Bill Inmon shared – hopefully with meaningful examples. Before I summarize my thoughts and compare to other techniques I wanted to share 2 more examples I found in my presentation notes (which he didn’t directly cover):
- Analyzing Customer Feedback in the Airline Environment
- The example was fragments of text from customers that provided feedback about their experiences with an airline.
- One analysis that can be done is the “tone” of the customer (“I think your airline is the best that I have flown on” “Plane was late. Messed up my schedule. Arline sucks“
- You can extract cities (to and from) as well as names of other airlines.
- He had examples in other languages – specifically Spanish and French. NOTE: The “stop” words concept applies to many latin-based languages.
- Automating Extraction of Data from Raw Reports
- The first step that Bill Inmon recommends is to strip the metadata from the report
- This is often labels on the report – which also exist in a hierarchy that needs to be retained
- Based on that he creates a “Metadata Template” – so that the report can be broken down more into a data aspect
- Then that “Metadata Template” can be applied to the report itself to generate a list of the data for a given piece of metadata (which can repeat).
Overall I found the concepts that Bill Inmon presented to be of great value and insight. He has spent years thinking of this – as he realized that the next focus would be on the “unstructured” data within our enterprises. I’m not sure that I agree with his assertion that most vendors aren’t thinking about this – but I do believe he has a well thought out process. The key aspect he continued to communicate was that context is needed in order to perform detailed analysis of free-form text.
His techniques are clearly focused on taking free-form text and resolving it to a traditional relational format. He does this by adding context to the meaningful elements of the text so that they can fit into that structure. By doing so traditional query logic can be applied and the understanding of the data is much richer (Bob Jones, the pilot, is mentioned 6 times in the reports, How many times is an animal mentioned, etc.). I agree with this assertion that there is a lot of potential business value to be found in this unstructured data – so that focus needs to be spent on it.
He does correctly point out that many of the tools in the Big Data space do not address this challenge – as they either assume a structure (Hive) or ignore structure (Pig, Map Reduce). Re-creating structure from that free-form text is a different exercise that requires careful planning and effort.
I don’t agree with his assertion that other vendors are not supplying any solutions to free-form text. IBM has a whole software solution built around text analytics. They have created a new API language called AQL (Annotated Query Language) to drive their text analytics engine. While I agree that most vendors (Greenplum, Oracle, IBM, etc.) are focused more on infrastructure than software solutions they are addressing it. What may confuse the issue is that many Big Data use cases actually use structured/semi-structured data – not unstructured data.
The other aspect to be considered is that you don’t necessarily need to deeply analyze and structure free-form text in order to gain valuable insight from it. A simple case could be sentiment analysis of Tweets in terms of your company. If I scan through the text of tweets looking for my company name and for each tweet found look for sentiment words (great, terrible, always, never..) that can provide a rough gauge of how a company is doing. You can do similar things to understand how often a topic is mentioned (even using known taxonomies to increase the quality of that analysis).
For geeks like me we often want a near perfect solution – one that address all the edge cases. Many people in business simply don’t care – they focus on the core cases. This is an area where we can ask ourselves – is it good enough? Can I, without the deep structure that Textual Disambiguation provides, get my answers without the same level of effort. I think in many cases we are learning that we can – that the rough is good enough for our purposes.
I think any company wanting to derive value from “unstructured data” should carefully consider Bill Inmon’s approach. I think many IT professionals (especially “Data” people) need to understand this – as it is a paradigm shift from what we’re used to dealing with. He does identify a hole in our tendencies – as we often assume that our data will be structured in some way (even when it isn’t). He has some very intelligent ways of addressing these needs – including working software. Overall I was very glad that I attended this talk as it gave me lots to think about and process through.
Finally I dare you to say “Textual Disambiguation” 5 times while rubbing your stomach and patting your head.