NOTE: This was from my old A&L Enterprises blog – but I thought it was interesting…
In my previous post I introduced the concept of applying context to what is unstructured text. Often when we talk about Big Data and/or unstructured data we are really talking about the free-form text within it.
As I previously mentioned I attended a seminar where Bill Inmon taught about “Textual Disambiguation” (quite a mouthful). His concept is that we need to apply context to that text in order to analyze it effectively. He has some specific principles on how to draw that context out – which he believes are still unique within the industry (he believes that many vendors are ignoring this issue).
So the heart of his process is to process through text in order to discover context – and therefore derive meaning. His company’s software (which he didn’t dwell on in his talk as it was NOT a sales presentation) will start with free-form text and end up with a structured analysis of the words. He was very clear that this is different than NLP (Natural Language Processing), many Text mining techniques, Map Reduce, HIVE, Pig, etc.
I don’t have a copy of his presentation of this – but I will share what I learned using my own examples. Here is the list of techniques (probably not a complete list) to add context back into unstructured text:
- Remove Stop Words
- Correct Misspellings
- Word Stemming
- Standardize Dates
- Document metadata
- Taxonomy / Ontology
- Proximity Analysis
- Number Patterns (SSN, Phone Number)
- Numeric Value Tagging
- Date Naming
To make things interesting I’m going to use some Enron e-mails fragments to demonstrate some of these techniques. You can download the whole set at https://www.cs.cmu.edu/~enron/
- Stop Words are basically the words that connect the words of meaning. These are typically words like “the”, “is”, “at”, “which”, etc.
- While they are part of the language they really don’t have any value outside of the sentence.
- Below is a paragraph from an Enron e-mail where I
strikedthrough the stop words based on this list I found.
- This technique is also used for Search Engine Indexing – as these words wouldn’t help you locate the page you want.
Mr Lay -
Ifyou really think that thissale creates a"great opportunity" forshareholders thenyou aremore out oftouch withreality than Ipreviously thought (unless you werereferring toDynegy shareholders). Under your"leadership" theshareholders have beendevastated, employees havelost theirretirements, college funds have beendesiminated andreputations have beenruined, including yourown.
- A simple technique is to correct misspellings of words in the text
- By doing so we can then match words together that otherwise we couldn’t (how many times is x mentioned?)
- I suspect this requires some sophisticated logic to perform this correctly – but I did find this list from the Oxford Dictionary
- I took a snippet of an e-mail and deliberately misspelled some words as an example
We spent quite a bit of time over the past several months discussing a possible minority investiment of about $5MM in Silicon Energy. We have broken off those discussions because (1) their proposed pre-money valuation of $150MM was, in our opinion, excesive, and (2) our people at EES, who would be the primary users of Silicon Energy, were not happy with Silicon Energy's functionalty.
- Word Stemming is a concept where we take a word and break it down into it’s root.
- For example – move, mover, moving – stem of “mov”
- I found this page that suggests that there are algorithms that can be used to accomplish this.
Maureen Smith and Ruth Concannon have raised some issues regarding how Brooklyn Union Gas is in the books, specifically, the final contract year (11/1/03 - 10/31/04) is not in the books at all (which will produce a gain when booked), and the current deal structure is telescoped incorrectly at Transco Zones 1,3, & 4 when it should be telescoped 17% at Zone 1, 25% at Zone 2, and 58% at Zone 3. This will produce a slight loss when rebooked, but the gain from booking the final year is more than enough to offset the loss.
- Dates come in many formats (05/29/2013, 29/05/2013, May 05, 2013…)
- For ease of comparison it is best to standardize them
- Below are some sentence fragments out of e-mails with different formats
Date: Thursday, January 24, 2002
at the time of their next board meeting on February 12, 2001.
Attached is a revised Credit Watch listing as of 4/09/01. Please note that there are 12 counterparty additions/revisions to the Watchlist for this week.
- Many documents have metadata associated with them – including word documents, e-mails, tweets, etc.
- This includes dates, metadata about the document, in the case of e-mail the from/to addresses
- Here is an example of an e-mail header:
Date: Tue, 16 Oct 2001 14:41:10 -0700 (PDT) From: email@example.com To: firstname.lastname@example.org Subject: Outlook Web Access for Calgary Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-From: Enron Messaging Administration </O=ENRON/OU=NA/CN=RECIPIENTS/CN=NOTESADDR/CN=ENRON MESSAGING ADMINISTRATION> X-To: OWA.Notification@enron.com X-cc: X-bcc:
Taxonomy / Ontology
- A method to bring meaning between words is to build a taxonomy or ontology
- This is typically a type of, kind of, etc. For example, a dogs and cats are types of animals
- Therefore these words can be grouped together (even though they are in different parts of a document)
- If you look at the below e-mail snippet the words “jet skis”, “boats”, and “cataraman” are all types of boats, which are a type of vehicle
This is the info.- Kim Hillis is making the reservations (I hope there are rooms available this late) A couple guys here have stayed there and said it's awesome. With a great beach, close to town and golf. There's jet skis and boats right there. I'm up for chartering a a catamaran for snorkeling and cruisin' all day.
- Another technique is to look at how close in proximity words are (i.e. are they within a few words or paragraphs down)
- This is known process – used in search engines to some degree
- There is no guarantee this will add any value- but could be useful in some cases
- Below is a pretty weak example as I struggled finding a good example (I don’t have good notes on Bill Inmon’s example). Because snow is near forecast it can create a connection to quality forecasts:
Cooper-- I have been looking at the chassis for the model we need to develop. it looks like we can expand (piggy back) the model AE uses for the Snow & water supply forecasts. They only have reports for March thru to August i.e. the period of the bulk of the runoff from snow and rain. We would have to add the remainder of the year as well as the power plants. Will fax ou a sketch of my proposed basic layout.
Number Patterns (SSN, Phone Number)
- There are many common number patterns that can be identified – given how we typically format them.
- Phone Number and SSN are probably some of the most common examples – but each use case may have it’s own format(s)
Pursuant to your e-mail to Jeff Skilling dated February 20, 2001, Mr. Skilling suggests that you contact Mr. Jim Fallon regarding CAIS Internet Inc. Mr. Fallon is managing director of trading at Enron Broadband Services. He can be reached at 713.853.3354.
EEI member utilities wishing to have access should contact Lynn Hailes at: email@example.com or 202/508-5624.
Numeric Value Tagging
- Numeric values in a document without context don’t provide a lot of value
- Often times though there is text adjacent to the number that indicates what type of number it is.
3. Book administrator rolls showing the following new deal P&L for the following new deals rolling up to the Executive DPR:
Deal #559092.1 on 3/23/01 Deal #514509.1 on 2/6/01 Deal #568025.1 on 4/2/01
- While we can standardize dates easily without any context they still don’t mean anything.
- We want to look for nearby words to provide that context – the name – of a date
- Below we have a date of Feburary 12, 2001. There are two nearby sets of words that can provide context:
- effective ==> Effective Date of February 12, 2001
- board meeting ==> Board Meeting Date of February 12, 2001
It is my great pleasure to announce that the Board has accepted my recommendation to appoint Jeff Skilling as chief executive officer, effective at the time of their next board meeting on February 12, 2001. Jeff will also retain his duties as president and chief operating officer. I will continue as chairman of the Board and will remain at Enron, working with Jeff on the strategic direction of the company and our day-to-day global operations.
- We often use a cryptic set of letters in place of longer phrases – for ease of use.
- However, these acronyms themselves don’t have any real meaning – but what they stand for
If you are interested in the Live or Archived meetings of the Federal Communications Commission (FCC) or National Transportation Safety Board (NTSB), please contact us.
If you have any questions concerning the FERC Video Archives, please contact us at firstname.lastname@example.org or at 703-993-3100.
I feel like I just gave a very brief summary of these techniques – where there is a lot more knowledge lurking out there. By no means is this an exhaustive or complete list – but just a beginning. Again – Bill Inmon’s company website – forestrimtech.com – has a lot more information. I think the important part is to realize that you can discover context to wrap around what looks like otherwise meaningless text.
Next we’ll address this at the document level – instead of at the paragraph/sentence level.