Textual Disambiguation – Just a fancy word? (Part 1)

1418158_30307462-300x212NOTE: This was from my old A&L Enterprises blog – but I thought it was interesting…

Recently I had the privilege of attending a presentation by Bill Inmon at the local KC DAMA Day. The focus of his presentation was on gaining business data from unstructured data – which lies at the heart of much of the “Big Data” craze. A common quote in the Big Data realm is that 80% of the world’s data is unstructured.

That word – unstructured – is one that bothers me – as much of the data we’re talking about does have a structure. Often what we’re describing with the term “unstructured” is the free-form text within that data. Tweets, Facebook messages, e-mails, texts, etc. – do have a structure. At a minimum they have some metadata wrapped around them (such as e-mail headers). In other cases, such as a Twitter message, there is a lot of potential data beyond just the “Tweet” itself. If you look at the format of a Twitter message there is information about the user, a time-stamp, reply-to information, and possibly location information.

I think often what we really are saying with the term “unstructured” is the free-form text within these data sources. It’s the text of a document, the body of the e-mail, the “tweet” itself, notes and comments. We can’t use traditional techniques on this data for a variety of reasons – the most prominent being that most tool rely on a structure.

One of the main points Bill Inmon made was that when processing this “text” data is that we lack context. If you think about how we often process data we have some form of context:

  • A relational database relies on a defined schema – so you know the column name and it’s characteristics
  • XML data is self describing (it carries it’s own context) [<myfield>abc</myfield>]
  • Flat files are often either delimited or are position based – which implies a context by an external definition
  • JSON Data – like the tweet – is also self-describing [“coordinates”: null,”created_at”: “Thu Oct 21 16:02:46 +0000 2010”,]

Context is vitally important to drawing useful conclusions – as language is often imprecise. Much of what Google does today is to try to guess what your intent is – using the context of your previous behavior (try searching for something in Incognito mode in Google Chrome vs normally searching). Amazon, Facebook, Netflix, etc. depend on delivering a solution based on the context of your past behavior in conjunction with other user’s behavior.

Let’s take an example right from Bill [he has a website –http://forestrimtech.com/– where you can learn more – including downloading white papers]: “She’s hot”. OK – so what does this mean? Does it mean she is very attractive? Does it mean she is running a fever? Does it mean she’s sweating heavily due to the heat?

So how could we know what “She’s hot” actually means? What if this was mentioned in San Antonio during the heat of the summer? What if it was said by a young male around other young males? What if it was a spoken by a concerned parent at a doctor’s office?

What we are doing is adding a context around “She’s hot” in order to understand what it means. Apart from that context “She’s hot” doesn’t have clear meaning and could lead to wrong conclusions easily. [I personally remember an example years ago on a bus, probably in Junior High, where a young girl said “I’m hot”. Shortly after she said it she clarified that it was temperature hot – as many people we’re thinking of it as she was complementing herself.]

So in one sense we have this jumbled mess of text – without any organization. But in another sense there is structure to derive from this text – which provides us the missing context. Documents have some type of structure (especially legal documents like contracts) – even if it wasn’t planned that way. Sentences have structure – anyone remember diagramming a sentence?

Why do we care?

One simple reason we should care is we have this 80% of corporate data that is unstructured – vs. 20% that is.   Of that 80% unstructured only 1-2% of corporate decisions are based on it (vs 98% on that 20% of structured data).   So this is potentially an untapped reservoir of information to make better decisions.   There are business opportunities – both internal and external – lurking in this untapped data resource.  We will have to learn and implement new techniques to utilize this data – but the clear trend is in that direction.

In my next post we will look in more detail about Bill Inmon’s Textual Disambiguation concept itself (which he has commercialized as Textual ETL).  In a later post we’ll look at applying this at a document level – instead of just at a sentence/paragraph level.

1 Comment

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.