- Forward


Approximate Text Matching
An Introduction with Examples in Java


Prof. David Bernstein
James Madison University

Computer Science Department
bernstdh@jmu.edu

Print

Motivation
Back SMYC Forward
  • Some Observations:
    • Many applications need to compare pieces of text
    • Users often don't know how things are spelled (e.g., "Bernstein" or "Burnstein?") or how they should be described (e.g., "1 Duke Plaza" or "One Duke Plaza")
  • Possible Resolutions:
    • Find things that "sound alike"
    • Handle numbers in many ways
    • Ignore small differences (e.g., transposed letters)
  • In Practice:
    • More matches are often better than fewer (especially in a GUI-based application in which users can be offered a choice)
The Soundex Algorithm
Back SMYC Forward
  • History:
    • Developed by Odell and Russel (for the U.S. Census Bureau) in the 1870s to search for names
    • Many "versions" exist; we will use the version sanctioned by the National Archives
  • Concepts:
    • Ignore (non-initial) vowels
    • Create phonetic groups
    • Eliminate repeating sounds
    • Create a 4-digit code
The Soundex Algorithm (cont.)
Back SMYC Forward
  • Phoentic Groups:
    • A,E,I,O,U,H,W,Y (Removed)
    • B,F,P,V (1)
    • C,G,J,K,Q,S,X,Z (2)
    • D,T (3)
    • L (4)
    • M,N (5)
    • R (6)
  • Rules:
    • Always use the first letter
    • Remove duplicate soundex codes
    • If a vowel (A, E, I, O, U) separates two consonants that have the same soundex code, the consonant to the right of the vowel is coded
    • If "H" or "W" separate two consonants that have the same soundex code, the consonant to the right of the vowel is not coded
    • Truncate to four characters (pad with '0' if less than four characters)
The Soundex Algorithm (cont.)
Back SMYC Forward

A Simple Implementation

javaexamples/strings/Soundex.java
 
Simple Phonetic Coding
Back SMYC Forward
  • Observations:
    • Soundex was developed when computing power was very limited
    • It is possible to use more (language-specific, dialect-specific) phonetic knowledge
  • Some Existing Algorithms:
    • Metaphone
    • Double Metaphone
Simple Phonetic Coding - A Home Grown Algorithm
Back SMYC Forward
  • Getting Started:
    • Convert to upper case
  • Some Phonetic Rules (In Order):
    • ght becomes t
    • gh and ph become f
    • gn becomes n
    • ch and sh becomes x
    • ci, ce, cy become s
    • z becomes s
    • c becomes k
    • q becomes k
    • dg becomes j
    • g becomes j
    • wr becomes r
  • Finishing:
    • Remove vowels
    • Remove repeated consonants
Ignoring Small Differences
Back SMYC Forward
  • Jaro Measure:
    • Part of UNIMATCH record linkage system for the U.S. Census (1976)
    • The weighted sum of percentage of matched characters from each string (including transposed characters)
  • Ratcliff-Obershelp Measure:
    • Part of a pattern matching algorithm (1988)
    • The number of matching characters divided by the total number of characters in the two strings (where the matching characters are those in the longest common subsequence plus the matching characters on either side of the subsequence)
  • Levenshtein Distance:
    • The smallest number of insertions, deletions, and substitutions required to change one string into another
Number Matching
Back SMYC Forward
  • Absolute Numbers:
    • "1" and "one", "2" and "two", etc...
  • Relative Numbers:
    • "1st" and "first", "2nd" and "second", "3rd" and "third", "4th" and "fourth", etc...
    • "1st" and "1", etc... (e.g., for streets)
Number Matching (cont.)
Back SMYC Forward
  • An Observation:
    • More likely to be a problem for small numbers (e.g., "521" and "five hundred twenty one" are not interchanged often)
  • A Simple Algorithm:
    • Use a hash table with multiple keys for the same value (e.g., "1st", "1 st", "1" and "first" are all keys for "1st")
Domain-Specific Situations
Back SMYC Forward
  • Transportation Domain:
    • Interchangeability of "Interstate 81", "I81", "Route 81"
  • Geographic Domain:
    • Interchangeability of "N" and "North"
There's Always More to Learn
Back -