Textual Disambiguation – working at the document level (Part 3)

466101_20161383-150x150NOTE: This was from my old A&L Enterprises blog – but I thought it was interesting…

In our last post we went over in detail techniques to add context back into what is other-wise free-form text.  The goal being to add context to data that otherwise has no context.  Another level we can operate at is the document level – as opposed to just the paragraph or sentence level. There are two document types I’m going to review: contracts and e-mail filtering.

Contracts

Below I have an example of the template of a mortgage – a very common contract.  A completed contract would have the “….” filled in with actual data – such that it blends with the text around it.  Obviously if you knew ahead of time the format of the contract (such as the template below) you could parse out the terms using the words around it. For example between “this” and “day” would have the day.  Ideally we would store the text of the contract along with the details that are entered in – for easy usage.

But let’s take an example of a large volume of varying contracts – such that we can’t rely on a specific structure (even for the same lender/agent).  Therefore we need to apply some techniques to break down the text into meaningful data.  One of the first steps we can do is to remove the “Stop” Words from the documents. For example:

This mortgage is made this 14th day of October 1998, between the Mortgagor, John William Smith (herin known as the "Borrower") , and the Mortgagee, Countrywide Financial a corporation organized and existing under the laws of California, whose address is 128 W. Absalom Rd, San Diego, California (herein "Lender")

The next thing we need to do is to begin to find “delimiters” of the different terms in the document.  If we had a CSV file we would use the commas to know when a field begins and ends. Similarly we need to identify beginning and ending words that indicate where to find an item of meaning.  If we look at the example above we could say “between Mortgagor,” and “(herein known” to know the name of the Borrower.  I suspect this may be an iterative process as the documents can vary significantly (so you may either have many different algorithms or very smart ones).

Mortgage Temple Example

THIS MORTGAGE is made this ........................ day of .............................
19 ....., between the Mortgagor, ................................................................ (herein
"Borrower"), and the Mortgagee, ...................................................... a corporation
organized and existing under the laws of ......................................
......................................................................................................, whose address is
..................................................................................
.......................................................................................................(herein "Lender").

WHEREAS, Borrower is indebted to Lender in the principal sum of ...................
Dollars, which indebtedness is evidenced by
Borrower's note dated ..................... (herein
"Note"), providing for monthly installments of principal and interest, with the balance of
the indebtedness, if not sooner paid, due and payable on .......................

TO SECURE to Lender (a) the repayment of the indebtedness evidenced by the Note,
with interest thereon, the payment of all other sums, with interest thereon, advanced in
accordance herewith to protect the security of this Mortgage, and the performance of the
covenants and agreements of Borrower herein contained, and (b) the repayment of any
future advances, with interest thereon, ma
de to Borrower by Lender pursuant to
paragraph 21 hereof (herein "Future Advances"), Borrower does hereby mortgage, grant
and convey to Lender, with power of sale, the following described property located in the
County of .................. ......................................................................... ......... , State of
Massachusetts:

which has the address of .........................................................................
[Street] [City]
.................................................................... (herein "Property Address");
[State and Zip Code]

E-mail Filtering

E-mail is a good example where a filtering process may need to be applied first before more advanced Textual Disambiguation (still a hard word to type) process is applied. This is due to the large volume of e-mails – many of which you don’t need to be looked at.  Bill Inmon described 3 categories of e-mail:

  1. SPAM – this is externally generated content from outside the organization with little to no value (extraneous information).
  2. Blather – this is internally generated content (from your users) with little to no business value. Examples would be jokes, personal e-mails, broadcast e-mails, etc.
  3. Business Meaningful – the content you actually want to look at.

One example scenario he had was to apply analysis to current e-mails to detect and prevent potential new liabilities for a company (so they don’t end up like Enron).  He quoted that in 2012 approximately $65 billion was spent defending lawsuits.  For this to be effective the “SPAM” and “Blather” needs to be filtered out so that time and energy is spent on e-mails with potential meaning.

The principal process he applied to this problem was building a “relevance” taxonomy.  Basically a list of words (using known taxonomies as a guide) used to filter out e-mails that aren’t of interest (i.e. if they don’t have one of the words they are filtered out).

The next principle is to use “words of concern” to identify more significant e-mails.  These are human generated words that if a human was looking for e-mail they would want to see.    Some examples are:

  • apologize
  • attorney
  • risk
  • ashamed
  • scandal

Additionally the relevance of an e-mail can be further filtered by looking how close these words of concern are to each other (proximity).  Typically a “proximity boundary” is selected (in bytes) to determine how close words need to be to increase the relevance of an e-mail.   Once the list of e-mails is generated there are some other techniques to narrow down the e-mails that should be looked at by a person:

  • Use header information – such as From and To, Dates, etc.
  • Use the # of words of concern found (i.e. more words means more likely needs to be looked at)
  • “Hot” words – words that attract the attention of a human when scanning e-mails.  Use these words to increase the relevance of those e-mails

In reality this would likely be a very iterative process as the users provide feedback on why an e-mail is more or less relevant to their concerns.  Additionally changes in the business and/or legal environment could shift what needs to be looked for.

This post was about applying techniques to either filter out documents or break them apart in a systematic way.  In my next post I will summarize what I’ve learned and contrast with other techniques in the industry.

Categories: Oldies

1 Comment

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.