Theses related to databases are
best done by data mining (finding patterns in) large data sets or databases. As
an example, you could query a data set of strings by posing a data string in
question and see if it matches with any of the strings in the database. If
there is a direct match, the edit distance between the query string and the
string from the database is zero. Now what is an edit distance?
By definition, the edit distance between two strings, ed(x,y), is the minimum
number of
character insertions, deletions and replacements that transforms
x to y.
Example:
hello→hallo: replace e by a (Edit distance:1)
hello→hell: delete o (Edit
distance:1)
hello→shell: delete o, insert s (Edit distance:2)
The edit
distance is also called Levenshtein distance.
There
are many algorithms that can be googled on the net to find exact string matches
from data sets. These algorithms are called exact string matching algorithms.
The string in question may require to retrieve an approximately closely matched
string if exact matches do not exist in the data set. For this, the need for
approximate string matching algorithms is required.
You may
ask how all this relates to data mining. It does because we are dealing with
patterns in strings. Similarity retrieval can be applied in real life in case
of finger prints, images, and audios and much more.
In my
next blog post, I am going to discuss approximate string matching algorithms.
No comments:
Post a Comment