Thursday, March 26, 2015

Lipschitz Embeddings



One of the well known approximate string matching algorithms is Lipschitz Embeddings. Here it goes:

Input: A text file T containing N strings (O1, O2,…….ON) and a query object q.
Output: Nearest string to q with corresponding edit distance.

Steps:
  1. Construct a text file R of k strings (A1,A2,……..Ak) by choosing randomly from N string
  1. Compute F(q),the array of edit distances of the query object q to k strings of R, that is, F(q) = (d(q,A1 ),d(q,A2 ),……,d(q,Ak ) ) where, d(q,Aj ) is the edit distance between q and Aj .Here,1≤j≤k.

  1. Compute F(Oi ),the array of edit distances of each string Oi in T to k reference strings in R.
         that is, F(Oi) =( (d(Oi ,A1),(d(Oi , A2 ),…….d(Oi , Ak ) ).Here,1≤ i ≤ N.

  1. Calculate ∂(F(Oj ),F(q)),the distance between each string Oj of T to query string q in embedding  space where,
        
          where 1≤ j ≤ N, 1≤ i ≤ k, p=2.

    5. Find the minimum value among ∂(F(Oi ),F(q)), if the minimum is ∂(F(Om ),F(q)),
         then find d(Om ,q). (1≤i≤N).

   6. For i=1 to N,
      if(∂(F(Oi ),F(q))  > d(Om ,q) )
           then edit_distance= d(Om ,q) and nearest_string= Om.
      else if (d(Oi ,q) < d(Om ,q))
           then edit_distance= d(Oi ,q) and nearest_string = Oi

7.      Show the edit_distance and nearest_string.


See ya next time.



Thursday, March 19, 2015

Brute Force Algorithm




There are different algorithms for approximate string matching. A fundamental approach is the Brute Force Algorithm.

A brute-force approach would be to compute the edit distances to query object(q) from all N substrings of  text(T), and then choose the substring with the minimum edit distance. The sequential steps are given below:


Steps:

1. Read N (say=50) strings from the text file T.
2. Take an input (query object q) from the user.
3. Calculate the edit distances to query object, q from all N strings of T.
4. Find out the minimum edit distance.
5. Output will be the string which has the minimum edit distance.  


Let N=5 and T is:{been, bid, moon, sun, star}
Say, q=’seen’
The edit distances to query object q from all N strings of T will be respectively:
{1,4,3,2,3}
The minimum distance is: 1
Hence the output will be: been

Until next time, I am signing off.


Thursday, March 12, 2015

String Matching



Theses related to databases are best done by data mining (finding patterns in) large data sets or databases. As an example, you could query a data set of strings by posing a data string in question and see if it matches with any of the strings in the database. If there is a direct match, the edit distance between the query string and the string from the database is zero. Now what is an edit distance?

By definition, the edit distance between two strings, ed(x,y), is the  minimum  number of
character insertions, deletions and replacements that transforms x to y.

Example:
hellohallo: replace e by a (Edit distance:1)
hellohell: delete o (Edit distance:1)
helloshell: delete o, insert s  (Edit distance:2)

The edit distance is also called Levenshtein distance.

There are many algorithms that can be googled on the net to find exact string matches from data sets. These algorithms are called exact string matching algorithms. The string in question may require to retrieve an approximately closely matched string if exact matches do not exist in the data set. For this, the need for approximate string matching algorithms is required.

You may ask how all this relates to data mining. It does because we are dealing with patterns in strings. Similarity retrieval can be applied in real life in case of finger prints, images, and audios and much more.

In my next blog post, I am going to discuss approximate string matching algorithms.





Thursday, March 5, 2015

Database Research Projects

For carrying out database research projects, you can design an entity relationship diagram after identifying the entities, attributes of the entities and relationships of the system, convert it to a relational model (table schemas) and normalize the tables and finally design the diagram including split tables, if any from normalization. You can fill up tables with values and make sure there are no conflicts between primary and foreign keys. For this to set up, you can use MySQL or SQL Server. You can query the system with proper queries and check to see if the system is giving the correct results. So, this will be the back-end of the database system. You can create a user friendly interface at the front-end and connect it with the back-end. For designing the interface you can use PhP/javascript, C# or ASP.net. This system can be any enterprise you choose from daily life like a hospital, bank, travel agency, fast food restaurant, any sport or music band that you would like to design as a database system.

In my next blog, I will talk about database theses that are useful to work on.