Thursday, March 12, 2015

String Matching



Theses related to databases are best done by data mining (finding patterns in) large data sets or databases. As an example, you could query a data set of strings by posing a data string in question and see if it matches with any of the strings in the database. If there is a direct match, the edit distance between the query string and the string from the database is zero. Now what is an edit distance?

By definition, the edit distance between two strings, ed(x,y), is the  minimum  number of
character insertions, deletions and replacements that transforms x to y.

Example:
hellohallo: replace e by a (Edit distance:1)
hellohell: delete o (Edit distance:1)
helloshell: delete o, insert s  (Edit distance:2)

The edit distance is also called Levenshtein distance.

There are many algorithms that can be googled on the net to find exact string matches from data sets. These algorithms are called exact string matching algorithms. The string in question may require to retrieve an approximately closely matched string if exact matches do not exist in the data set. For this, the need for approximate string matching algorithms is required.

You may ask how all this relates to data mining. It does because we are dealing with patterns in strings. Similarity retrieval can be applied in real life in case of finger prints, images, and audios and much more.

In my next blog post, I am going to discuss approximate string matching algorithms.





No comments:

Post a Comment