Feed on
Posts
Comments

I am running into an approximate string matching(aka fuzzy string matching, inexact string matching) problem on my Diplomarbeit. Dirk showed me some papers on this topic and coded a tiny c++ program for me to learn. I got the idea and coded a Java Approximate String Matching demo based on the algorithm described by Boy Lowrance and Robert A. Wagner in ACM journal, April, 1975. This kind of problem is in fact very interesting, I believe google has a fancy ASM algorithm running in the engine. Things like search string auto completion has much to do with approximate string matching. (By the way, approximate string matching is not regular expression, but a super concept of regular expression in some way)

The text file used in my testing food.txt can be downloaded here. The java source code is as following. The test-result.txt is here. I am happy for the way it find matches but there is new problem to solve. This text file is tiny, has only 75 items in it. In the program for my Diplomarbeit, there are about 10000 strings sitting in the memory as reference strings. How should I maximize the efficiency?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
/**
 * Created on 5 Jun 2007, 22:51:02 Copyright Ning Zhao. All rights
 * reserved.
 */
package org.ningning.sandbox;
 
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Scanner;
 
/**
 * A very simple implementation of a shortest path based algorithm for
 * approximate string-to-string matching problem. The weight (in this
 * program I call it error score) of the difference between an input
 * string and a reference string is computed concerning the following
 * error cases: character absence (input: chery, reference cherry);
 * character redundance (input: cherrry, reference: cherry); character
 * mismatch (input: cherrz, reference: cherry); character position swap
 * (input: chrery, reference: cherry). More complex error cases can be
 * inserted.
 * <p />
 * The total error score between a reference string and input string is
 * the sum of minimal error scores of each pairs of correspoding
 * characters at the same positions in the strings. Reference strings
 * which have the lowest error scores are most likely to match the input
 * string, thus will be recommended by the program.
 * <p />
 * The attached food.txt has one food item per line. If you want to use
 * your own reference strings, you can either use the same format as
 * food.txt or modify the string scanning code in this program.
 *
 * @author Ellen Ning Zhao
 * @license <a href="http://www.gnu.org/copyleft/gpl.html">GPL V2</a>
 */
public class ApproximateStringMatchingDemo {
 
  private List<String> foods = new ArrayList<String>();
 
  private Map<String, Double> matchingScores = new HashMap<String, Double>();
 
  private Scanner scanner;
 
  /**
   * This table stores scores of combinations of any two ASCII symbols.
   * It is for the typo case errMatching to look up. For example, I
   * meant "cherry" but I typed "cherrz". In my typing there is a
   * matching error at the last position. More frequent typos are given
   * lower scores. For example, the character "v" and "b" are frequently
   * mistyped for each other, so if "v" and "b" appears at the same
   * positions of two words, the error score would be very low. This is
   * a symetric matrix, actually filling half matrix is already correct,
   * but inversing this matrix or write one more auxiliary method for
   * table looking up is slower than filling up the whole table. So I am
   * doing the simpler way.
   */
  private static double[][] mismatchScoreTable;
 
  private String in;
 
  private int inLength;
 
  public ApproximateStringMatchingDemo(String fileName) {
 
    // read the file, fill the food list
    try {
      scanner = new Scanner(new FileReader(fileName));
      while (scanner.hasNext()) {
        this.foods.add(scanner.nextLine());
      }
    } catch (FileNotFoundException e) {
      e.printStackTrace();
      System.exit(1);
    }
 
    if (mismatchScoreTable == null) initMismatchScoreTable();
  }
 
  public List<String> getFoods() {
    return this.foods;
  }
 
  /**
   * Initialize the score table. Finer scoring can be added later. Here
   * these cases are just for example. The many cases here can be
   * externalized into a rule file, and this rule file can be loaded
   * dynamically. We can add more rules like ignore cases, etc.
   */
  private static void initMismatchScoreTable() {
 
    mismatchScoreTable = new double[256][256];
 
    // Score any combination of two characters as 1 by default.
    for (int i = 0; i < 256; i++)
      for (int j = 0; j < 256; j++)
        mismatchScoreTable[i][j] = 1.0d;
 
    // If the input charater and reference character are the same,
    // there is no typo. So the error score is 0.
    for (int i = 0; i < 256; i++)
      mismatchScoreTable[i][i] = 0.0d;
 
    // For people who use both German keyboard and English keyboard,
    // this typo is highly frequent.
    mismatchScoreTable['y']['z'] = 0.1d;
    mismatchScoreTable['z']['y'] = 0.1d;
 
    mismatchScoreTable['v']['b'] = 0.15d;
    mismatchScoreTable['b']['v'] = 0.15d;
 
    mismatchScoreTable['n']['m'] = 0.11d;
    mismatchScoreTable['m']['n'] = 0.11d;
 
    mismatchScoreTable['t']['r'] = 0.15d;
    mismatchScoreTable['r']['t'] = 0.15d;
 
    mismatchScoreTable['g']['h'] = 0.15d;
    mismatchScoreTable['h']['g'] = 0.15d;
 
    mismatchScoreTable['y']['u'] = 0.15d;
    mismatchScoreTable['u']['y'] = 0.15d;
 
    // more typo possibilities can be inserted here....
  }
 
  /**
   * @param in
   *          the input string
   * @return the error score map.
   */
  public Map<String, Double> getScores(String in) {
    this.in = in;
    this.inLength = in.length();
 
    for (String food : this.foods) {
 
      int refLength = food.length();
 
      double[][] errScore = new double[inLength + 1][refLength + 1];
      errScore[0][0] = 0.0d;
 
      for (int inCharAt = 1; inCharAt <= this.inLength; inCharAt++)
        errScore[inCharAt][0] = inCharAt;
 
      for (int refCharAt = 1; refCharAt <= refLength; refCharAt++)
        errScore[0][refCharAt] = refCharAt;
 
      for (int inCharAt = 1; inCharAt <= inLength; inCharAt++)
        for (int refCharAt = 1; refCharAt <= refLength; refCharAt++) {
          // if a character is absent at the given position
          // in the input string, we add score 1.
          double charAbsence = errScore[inCharAt - 1][refCharAt] + 1;
 
          // if a character is redundant at the given position in the
          // input string, we add score 1.
          double charRedundance = errScore[inCharAt][refCharAt - 1] + 1;
 
          // if it is a matching error, we add the score specified in
          // the score table for matching errors.
          double mismatch = errScore[inCharAt - 1][refCharAt - 1]
                  + mismatchScoreTable[this.in.charAt(inCharAt - 1)][food
                          .charAt(refCharAt - 1)];
 
          // initialize the score for swap error to a very big value.
          double charPositionSwap = 999999d;
 
          // score for swap error
          if (inCharAt > 1
                  && refCharAt > 1
                  && this.in.charAt(inCharAt - 1) == food
                          .charAt(refCharAt - 2)
                  && this.in.charAt(inCharAt - 2) == food
                          .charAt(refCharAt - 1)) {
 
            // the score for typing "ie" as "ei" and vice versa
            // is even lower
            if (this.in.charAt(inCharAt - 2) == 'e'
                    && this.in.charAt(inCharAt - 1) == 'i') {
              charPositionSwap = errScore[inCharAt - 2][refCharAt - 2] + 0.25;
            }
 
            // more cases can be inserted here.
 
            else charPositionSwap = errScore[inCharAt - 2][refCharAt - 2] + 0.5;
          }
 
          // more error cases can be inserted here.
 
          double minScore = mismatch;
 
          if (charAbsence < minScore) {
            minScore = charAbsence;
          }
 
          if (charRedundance < minScore) {
            minScore = charRedundance;
          }
 
          if (charPositionSwap < minScore) {
            minScore = charPositionSwap;
          }
 
          errScore[inCharAt][refCharAt] = minScore;
        }
      this.matchingScores.put(food, errScore[this.inLength][refLength]);
    }
    return this.matchingScores;
  }
 
  /**
   * @param args
   */
  public static void main(String[] args) {
    System.out
            .println("Please type the file name. Be sure to check path: ");
    Scanner sc = new Scanner(System.in);
    String filename = sc.next();
    ApproximateStringMatchingDemo demo = new ApproximateStringMatchingDemo(
            filename);
 
    System.out.println("There are " + demo.getFoods().size()
            + " items in the file.");
 
 
    System.out.print("Please type a word. Type q for exit: ");
    sc.nextLine();
    while (sc.hasNext()) {
      String in  = sc.nextLine();
 
      if (in.equals("q")) {
        System.exit(0);
      }
 
      System.out.println("You typed " + in);
 
      System.out.println("--------------------------------------------");
      Map scoreMap = demo.getScores(in);
      for (String food : demo.getFoods()) {
        System.out.println(food + "\t error score: " + scoreMap.get(food));
      }
 
      System.out.println("--------------------------------------------");
 
      double minScore = (Double) Collections.min(scoreMap.values());
 
      if (minScore == 0.0d) {
        System.out.println(in + " is in the list.");
      } else {
        List<String> corrections = new ArrayList<String>();
        StringBuffer sb = new StringBuffer("Do you mean ");
        for (String food : demo.getFoods()) {
          if (scoreMap.get(food).equals(minScore)) {
            corrections.add(food);
            sb.append(food).append(" or ");
          }
        }
        sb.append("?");
        System.out.println(sb.toString());
      }
      System.out.println();
      System.out.println("Please type a word. Type q for exit: ");
    }
  }
}

10 Responses to “A Small Demo of Approximate String Matching in Java”

  1. Leo Na says:

    This is not good for Full-Text Searching from DB.
    I am looking for an algorithm about fuzzy full text searching (part of the terms) engine from DB without using Full-Text indexing.
    Hope I can discuss with you. qq23407934

    • ellen says:

      Hi Leo,

      you are right. If the reference text strings have to come from DB, performance will suffer. In my case, there were about 8000 strings (names of food), I would simply load them into some cache when the application starts. How big is the DB size you have to deal with?

      I do not use QQ or MSN. You can ping me via Google Talk/Gmail/Google Plus. Just look for my name, should be easy to find me (My profile shows up with a photo).

      -Ning

      • Leo Na says:

        Hi Ellen,
        I have tried different way, including “Soundex”, “Double Metaphone”, Viterbi(Bellman) algorithm, yours, and other “Approximate String Matching” implements. Unfortunately, neither one is suitable for my requirement (performance and accuracy). I will try IBM DB2 10.x build-in fuzzy searching techniques. Hope it works.
        Well, thanks for your reply and this is a nice WP blog to share your points.

        Xin

      • Leo Na says:

        Sorry, one more thing about the accuracy for short terms, I don’t think the result is quite good.
        For example, I want to search a keyword”iphone”:
        compare with “iphone5″, the matching score is 1.0
        compare with “ifone”, the matching score is 2.0
        Then, I feel the result is acceptable and set “2.0″ as a matching score threshold.
        However, I will do another search and I want to search “a”:
        compare with “b”, the matching score is 1.0
        compare with “cd”, the matching score is 2.0
        Then, the result is wrong.

        And, how to decide a matching score threshold automatically?
        The above is just my testing result, it might be wrong.
        Leo

        • ellen says:

          Hi Leo,

          first thanks to your nice works. I just wish I will soon have enough time to write more on my blog. Quite a few ideas and interesting things/people in my short daily logs right now.

          Just a quick thought to your scoring problem: I would try to “normalize” the score with the length of the string. In your first example, say the reference string is “iphone”, you first typed “iphone5″ and got score 1.0. The reference string “iphone” has the length of 6. So, the “normalized” score is 1.0/6 = 0.1667. Then you typed “ifone” and got score 2.0. Normalized score is 2.0/6 = 0.3333. In your second example, the reference string was “a”, you first typed “b”, got score 1.0. Normalize it you get 1.0/1 = 1. This is significantly greater than 0.1667, in the first example. Then you typed “cd”, this time, your normalized score will be 2.0/1 = 2.0, again, 2.0 looks far crazier than 0.3333.

          So in your case, I will look up two criteria when filtering strings:
          * criterion 1: the absolute score must be smaller than 2.0 (threshold is 2 for the first criterion)
          * criterion 2: the normalized score must be smaller than 1.0 (threshold is 1 for the second criterion)
          You may just use criterion 2 or combine these two criteria in your situation. (For very long reference strings, threshold 2.0 in criterion 1 might be too tight.)

          Hope that helps,
          Ning

          • Leo Na says:

            Thanks Ellen, that sounds good to me for single words comparison. I appreciate it.
            Have a good weekend!
            Leo

  2. Harish says:

    Have you tried using SecondString java api?
    I was looking at similar solution, but I couldn’t find good examples of using SecondString.

    • ellen says:

      No I have never tried using SecondString API. Recently have been busy getting familiar with the “products” of my own company, who has the broadest software product portfolio in the industry…..

  3. ellen says:

    @SKC,

    yes you may use this code as is in your project. But please quote the original author’s name. My name is Ellen Ning Zhao. Thanks!

  4. SKC says:

    Hi,

    This is good code base.
    I want to use this code, in my project.
    Can I use this as is?

    regards
    SKC

Leave a Reply