I am running into an approximate string matching(aka fuzzy string matching, inexact string matching) problem on my Diplomarbeit. Dirk showed me some papers on this topic and coded a tiny c++ program for me to learn. I got the idea and coded a Java Approximate String Matching demo based on the algorithm described by Boy Lowrance and Robert A. Wagner in ACM journal, April, 1975. This kind of problem is in fact very interesting, I believe google has a fancy ASM algorithm running in the engine. Things like search string auto completion has much to do with approximate string matching. (By the way, approximate string matching is not regular expression, but a super concept of regular expression in some way)
The text file used in my testing food.txt can be downloaded here. The java source code is as following. The test-result.txt is here. I am happy for the way it find matches but there is new problem to solve. This text file is tiny, has only 75 items in it. In the program for my Diplomarbeit, there are about 10000 strings sitting in the memory as reference strings. How should I maximize the efficiency?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 | /** * Created on 5 Jun 2007, 22:51:02 Copyright Ning Zhao. All rights * reserved. */ package org.ningning.sandbox; import java.io.FileNotFoundException; import java.io.FileReader; import java.util.ArrayList; import java.util.Collections; import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.Scanner; /** * A very simple implementation of a shortest path based algorithm for * approximate string-to-string matching problem. The weight (in this * program I call it error score) of the difference between an input * string and a reference string is computed concerning the following * error cases: character absence (input: chery, reference cherry); * character redundance (input: cherrry, reference: cherry); character * mismatch (input: cherrz, reference: cherry); character position swap * (input: chrery, reference: cherry). More complex error cases can be * inserted. * <p /> * The total error score between a reference string and input string is * the sum of minimal error scores of each pairs of correspoding * characters at the same positions in the strings. Reference strings * which have the lowest error scores are most likely to match the input * string, thus will be recommended by the program. * <p /> * The attached food.txt has one food item per line. If you want to use * your own reference strings, you can either use the same format as * food.txt or modify the string scanning code in this program. * * @author Ellen Ning Zhao * @license <a href="http://www.gnu.org/copyleft/gpl.html">GPL V2</a> */ public class ApproximateStringMatchingDemo { private List<String> foods = new ArrayList<String>(); private Map<String, Double> matchingScores = new HashMap<String, Double>(); private Scanner scanner; /** * This table stores scores of combinations of any two ASCII symbols. * It is for the typo case errMatching to look up. For example, I * meant "cherry" but I typed "cherrz". In my typing there is a * matching error at the last position. More frequent typos are given * lower scores. For example, the character "v" and "b" are frequently * mistyped for each other, so if "v" and "b" appears at the same * positions of two words, the error score would be very low. This is * a symetric matrix, actually filling half matrix is already correct, * but inversing this matrix or write one more auxiliary method for * table looking up is slower than filling up the whole table. So I am * doing the simpler way. */ private static double[][] mismatchScoreTable; private String in; private int inLength; public ApproximateStringMatchingDemo(String fileName) { // read the file, fill the food list try { scanner = new Scanner(new FileReader(fileName)); while (scanner.hasNext()) { this.foods.add(scanner.nextLine()); } } catch (FileNotFoundException e) { e.printStackTrace(); System.exit(1); } if (mismatchScoreTable == null) initMismatchScoreTable(); } public List<String> getFoods() { return this.foods; } /** * Initialize the score table. Finer scoring can be added later. Here * these cases are just for example. The many cases here can be * externalized into a rule file, and this rule file can be loaded * dynamically. We can add more rules like ignore cases, etc. */ private static void initMismatchScoreTable() { mismatchScoreTable = new double[256][256]; // Score any combination of two characters as 1 by default. for (int i = 0; i < 256; i++) for (int j = 0; j < 256; j++) mismatchScoreTable[i][j] = 1.0d; // If the input charater and reference character are the same, // there is no typo. So the error score is 0. for (int i = 0; i < 256; i++) mismatchScoreTable[i][i] = 0.0d; // For people who use both German keyboard and English keyboard, // this typo is highly frequent. mismatchScoreTable['y']['z'] = 0.1d; mismatchScoreTable['z']['y'] = 0.1d; mismatchScoreTable['v']['b'] = 0.15d; mismatchScoreTable['b']['v'] = 0.15d; mismatchScoreTable['n']['m'] = 0.11d; mismatchScoreTable['m']['n'] = 0.11d; mismatchScoreTable['t']['r'] = 0.15d; mismatchScoreTable['r']['t'] = 0.15d; mismatchScoreTable['g']['h'] = 0.15d; mismatchScoreTable['h']['g'] = 0.15d; mismatchScoreTable['y']['u'] = 0.15d; mismatchScoreTable['u']['y'] = 0.15d; // more typo possibilities can be inserted here.... } /** * @param in * the input string * @return the error score map. */ public Map<String, Double> getScores(String in) { this.in = in; this.inLength = in.length(); for (String food : this.foods) { int refLength = food.length(); double[][] errScore = new double[inLength + 1][refLength + 1]; errScore[0][0] = 0.0d; for (int inCharAt = 1; inCharAt <= this.inLength; inCharAt++) errScore[inCharAt][0] = inCharAt; for (int refCharAt = 1; refCharAt <= refLength; refCharAt++) errScore[0][refCharAt] = refCharAt; for (int inCharAt = 1; inCharAt <= inLength; inCharAt++) for (int refCharAt = 1; refCharAt <= refLength; refCharAt++) { // if a character is absent at the given position // in the input string, we add score 1. double charAbsence = errScore[inCharAt - 1][refCharAt] + 1; // if a character is redundant at the given position in the // input string, we add score 1. double charRedundance = errScore[inCharAt][refCharAt - 1] + 1; // if it is a matching error, we add the score specified in // the score table for matching errors. double mismatch = errScore[inCharAt - 1][refCharAt - 1] + mismatchScoreTable[this.in.charAt(inCharAt - 1)][food .charAt(refCharAt - 1)]; // initialize the score for swap error to a very big value. double charPositionSwap = 999999d; // score for swap error if (inCharAt > 1 && refCharAt > 1 && this.in.charAt(inCharAt - 1) == food .charAt(refCharAt - 2) && this.in.charAt(inCharAt - 2) == food .charAt(refCharAt - 1)) { // the score for typing "ie" as "ei" and vice versa // is even lower if (this.in.charAt(inCharAt - 2) == 'e' && this.in.charAt(inCharAt - 1) == 'i') { charPositionSwap = errScore[inCharAt - 2][refCharAt - 2] + 0.25; } // more cases can be inserted here. else charPositionSwap = errScore[inCharAt - 2][refCharAt - 2] + 0.5; } // more error cases can be inserted here. double minScore = mismatch; if (charAbsence < minScore) { minScore = charAbsence; } if (charRedundance < minScore) { minScore = charRedundance; } if (charPositionSwap < minScore) { minScore = charPositionSwap; } errScore[inCharAt][refCharAt] = minScore; } this.matchingScores.put(food, errScore[this.inLength][refLength]); } return this.matchingScores; } /** * @param args */ public static void main(String[] args) { System.out .println("Please type the file name. Be sure to check path: "); Scanner sc = new Scanner(System.in); String filename = sc.next(); ApproximateStringMatchingDemo demo = new ApproximateStringMatchingDemo( filename); System.out.println("There are " + demo.getFoods().size() + " items in the file."); System.out.print("Please type a word. Type q for exit: "); sc.nextLine(); while (sc.hasNext()) { String in = sc.nextLine(); if (in.equals("q")) { System.exit(0); } System.out.println("You typed " + in); System.out.println("--------------------------------------------"); Map scoreMap = demo.getScores(in); for (String food : demo.getFoods()) { System.out.println(food + "\t error score: " + scoreMap.get(food)); } System.out.println("--------------------------------------------"); double minScore = (Double) Collections.min(scoreMap.values()); if (minScore == 0.0d) { System.out.println(in + " is in the list."); } else { List<String> corrections = new ArrayList<String>(); StringBuffer sb = new StringBuffer("Do you mean "); for (String food : demo.getFoods()) { if (scoreMap.get(food).equals(minScore)) { corrections.add(food); sb.append(food).append(" or "); } } sb.append("?"); System.out.println(sb.toString()); } System.out.println(); System.out.println("Please type a word. Type q for exit: "); } } } |





This is not good for Full-Text Searching from DB.
I am looking for an algorithm about fuzzy full text searching (part of the terms) engine from DB without using Full-Text indexing.
Hope I can discuss with you. qq23407934
Hi Leo,
you are right. If the reference text strings have to come from DB, performance will suffer. In my case, there were about 8000 strings (names of food), I would simply load them into some cache when the application starts. How big is the DB size you have to deal with?
I do not use QQ or MSN. You can ping me via Google Talk/Gmail/Google Plus. Just look for my name, should be easy to find me (My profile shows up with a photo).
-Ning
Hi Ellen,
I have tried different way, including “Soundex”, “Double Metaphone”, Viterbi(Bellman) algorithm, yours, and other “Approximate String Matching” implements. Unfortunately, neither one is suitable for my requirement (performance and accuracy). I will try IBM DB2 10.x build-in fuzzy searching techniques. Hope it works.
Well, thanks for your reply and this is a nice WP blog to share your points.
Xin
Sorry, one more thing about the accuracy for short terms, I don’t think the result is quite good.
For example, I want to search a keyword”iphone”:
compare with “iphone5″, the matching score is 1.0
compare with “ifone”, the matching score is 2.0
Then, I feel the result is acceptable and set “2.0″ as a matching score threshold.
However, I will do another search and I want to search “a”:
compare with “b”, the matching score is 1.0
compare with “cd”, the matching score is 2.0
Then, the result is wrong.
And, how to decide a matching score threshold automatically?
The above is just my testing result, it might be wrong.
Leo
Hi Leo,
first thanks to your nice works. I just wish I will soon have enough time to write more on my blog. Quite a few ideas and interesting things/people in my short daily logs right now.
Just a quick thought to your scoring problem: I would try to “normalize” the score with the length of the string. In your first example, say the reference string is “iphone”, you first typed “iphone5″ and got score 1.0. The reference string “iphone” has the length of 6. So, the “normalized” score is 1.0/6 = 0.1667. Then you typed “ifone” and got score 2.0. Normalized score is 2.0/6 = 0.3333. In your second example, the reference string was “a”, you first typed “b”, got score 1.0. Normalize it you get 1.0/1 = 1. This is significantly greater than 0.1667, in the first example. Then you typed “cd”, this time, your normalized score will be 2.0/1 = 2.0, again, 2.0 looks far crazier than 0.3333.
So in your case, I will look up two criteria when filtering strings:
* criterion 1: the absolute score must be smaller than 2.0 (threshold is 2 for the first criterion)
* criterion 2: the normalized score must be smaller than 1.0 (threshold is 1 for the second criterion)
You may just use criterion 2 or combine these two criteria in your situation. (For very long reference strings, threshold 2.0 in criterion 1 might be too tight.)
Hope that helps,
Ning
Thanks Ellen, that sounds good to me for single words comparison. I appreciate it.
Have a good weekend!
Leo
Have you tried using SecondString java api?
I was looking at similar solution, but I couldn’t find good examples of using SecondString.
No I have never tried using SecondString API. Recently have been busy getting familiar with the “products” of my own company, who has the broadest software product portfolio in the industry…..
@SKC,
yes you may use this code as is in your project. But please quote the original author’s name. My name is Ellen Ning Zhao. Thanks!
Hi,
This is good code base.
I want to use this code, in my project.
Can I use this as is?
regards
SKC