Most frequent k chars distance

This page uses content from Wikipedia. The original article was at Most frequent k chars distance. The list of authors can be seen in the page history. As with Rosetta Code, the text of Wikipedia is available under the GNU FDL. (See links for details on variance)

In wp:information theory, MostFreqKDistance is a string metric technique for quickly estimating how similar two ordered sets or strings are. The scheme was invented by Sadi Evren SEKER,^[1] and initially used in text mining applications like author recognition.^[1] Method is originally based on a hashing function MaxFreqKChars ^[2] classical wp:author recognition problem and idea first came out while studying on wp:data stream mining.^[3] The string distance

Definition

Method has two steps.

Hash input strings str1 and str2 separately using MostFreqKHashing and output hstr1 and hstr2 respectively
Calculate string distance (or string similarity coefficient) of two hash outputs, hstr1 and hstr2 and output an integer value

Most Frequent K Hashing

The first step of algorithm is calculating the hashing based on the most frequent k characters. The hashing algorithm has below steps: <lang java> String function MostFreqKHashing (String inputString, int K)

   def string outputString
   for each district characters 
       count occurrence of each character
   for i := 0 to K 
       char c = next most freq ith character  (if two chars have same frequency than get the first occurrence in inputString)
       int count = number of occurrence of the character
       append to outputString, c and count
   end for
   return outputString

</lang>

Above function, simply gets an input string and an integer K value and outputs the most frequent K characters from the input string. The only condition during the creation of output string is adding the first occurring character first, if the frequencies of two characters are equal. Similar to the most of hashing functions, Most Frequent K Hashing is also a wp:one way function.

Most Frequent K Distance

The second step of algorithm works on two outputs from two different input strings and outputs the similarity coefficient (or distance metric). <lang java> int function MostFreqKSimilarity (String inputStr1, String inputStr2)

   def int similarity
   for each c = next character from inputStr1
       lookup c in inputStr2
       if c is null
            continue
       similarity += frequency of c in inputStr1 + frequency of c in inputStr2
   return similarity

</lang> Above function, simply gets two input strings, previously outputted from the MostFreqKHashing function. From the most frequent k hashing function, the characters and their frequencies are returned. So, the similarity function calculates the similarity based on characters and their frequencies by checking if the same character appears on both strings and if their frequencies are equal.

In some implementations, the distance metric is required instead of similarity coefficient. In order to convert the output of above similarity coefficient to distance metric, the output can be subtracted from any constant value (like the maximum possible output value). For the case, it is also possible to implement a wp:wrapper function over above two functions.

String Distance Wrapper Function

In order to calculate the distance between two strings, below function can be implemented <lang java> int function MostFreqKSDF (String inputStr1, String inputStr2, int K, int maxDistance)

   return maxDistance - MostFreqKSimilarity(MostFreqKHashing(inputStr1,K), MostFreqKHashing(inputStr2,K))

</lang>

Any call to above string distance function will supply two input strings and a maximum distance value. The function will calculate the similarity and subtract that value from the maximum possible distance. It can be considered as a simple wp:additive inverse of similarity.

Examples

Let's consider maximum 2 frequent hashing over two strings ‘research’ and ‘seeking’. MostFreqKHashing('research',2) = r2e2 because we have 2 'r' and 2 'e' characters with the highest frequency and we return in the order they appear in the string. MostFreqKHashing('seeking',2) = e2s1 Again we have character 'e' with highest frequency and rest of the characters have same frequency of 1, so we return the first character of equal frequencies, which is 's'. Finally we make the comparison: MostFreqKSimilarity('r2e2','e2s1') = 2 We simply compared the outputs and only character occurring in both input is character 'e' and the occurrence in both input is 2. Instead running the sample step by step as above, we can simply run by using the string distance wrapper function as below: MostFreqKSDF('research', 'seeking',2) = 2

Below table holds some sample runs between example inputs for K=2:

Inputs	Hash Outputs	SDF Output (max from 10)
'night' 'nacht'	n1i1 n1a1	9
'my' 'a'	m1y1 a1NULL0	10
‘research’ ‘research’	r2e2 r2e2	8
‘aaaaabbbb’ ‘ababababa’	a5b4 a5b4	1
‘significant’ ‘capabilities’	i3n2 i3a2	5

Method is also suitable for bioinformatics to compare the genetic strings like in wp:fasta format

Str1= LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV

Str2 = EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG

MostFreqKHashing(str1,2) = L9T8

MostFreqKHashing(str2,2) = F9L8

MostFreqKSDF(str1,str2,2,100) = 83

Java

Translation of the pseudo-code of the Wikipedia article wp:Most frequent k characters to wp:java implementation of three functions given in the definition section are given below with wp:JavaDoc comments:

<lang Java> import java.util.Collections; import java.util.Comparator; import java.util.HashMap; import java.util.Iterator; import java.util.LinkedHashMap; import java.util.LinkedList; import java.util.List; import java.util.Map;

public class SDF {

   /** Counting the number of occurrences of each character
    * @param character array
    * @return hashmap : Key = char, Value = num of occurrence
    */
   public static HashMap<Character, Integer> countElementOcurrences(char[] array) {

       HashMap<Character, Integer> countMap = new HashMap<Character, Integer>();

       for (char element : array) {
           Integer count = countMap.get(element);
           count = (count == null) ? 1 : count + 1;
           countMap.put(element, count);
       }

       return countMap;
   }
   
   /**
    * Sorts the counted numbers of characters (keys, values) by java Collection List
    * @param HashMap (with key as character, value as number of occurrences)
    * @return sorted HashMap
    */
 private static HashMap sortByValues(HashMap map) { 
      List list = new LinkedList(map.entrySet());
      // Defined Custom Comparator here
      Collections.sort(list, new Comparator() {
           public int compare(Object o1, Object o2) {
              return -1*((Comparable) ((Map.Entry) (o1)).getValue())
                 .compareTo(((Map.Entry) (o2)).getValue());
           }
      });

      // Here I am copying the sorted list in HashMap
      // using LinkedHashMap to preserve the insertion order
      HashMap sortedHashMap = new LinkedHashMap();
      for (Iterator it = list.iterator(); it.hasNext();) {
             Map.Entry entry = (Map.Entry) it.next();
             sortedHashMap.put(entry.getKey(), entry.getValue());
      } 
      return sortedHashMap;
 }
 /**
  * get most frequent k characters
  * @param array of characters
  * @param limit of k
  * @return hashed String 
  */
   public static String mostOcurrencesElement(char[] array, int k) {
       HashMap<Character, Integer> countMap = countElementOcurrences(array);
       System.out.println(countMap);
       Map<Integer, String> map = sortByValues(countMap); 
       System.out.println(map);
       Iterator it = map.entrySet().iterator();
       int i = 0;
       String output = new String();
       while (it.hasNext()&&i++<k) {
           Map.Entry pairs = (Map.Entry)it.next();
           output += ""+pairs.getKey()+pairs.getValue();
           it.remove(); // avoids a ConcurrentModificationException
       }
       return output;
   }
   /**
    * Calculates the similarity between two input strings
    * @param input string 1
    * @param input string 2
    * @param maximum possible limit value 
    * @return distance as integer
    */
   public int getDiff(String str1, String str2, int limit){
       int similarity=0;
       for(int i =0;i<str1.length();i+=2){
           System.out.println(i);
           int pos = str2.indexOf(str1.charAt(i));
           System.out.println(str2.charAt(i)+" - "+pos);
           if(pos>=0){
               similarity += Integer.parseInt(str2.substring(pos+1, pos+2))+Character.getNumericValue(str1.charAt(i+1)) ;
           }
               
       }
           
       return limit-similarity;
   }
   /**
    * Wrapper function 
    * @param input string 1
    * @param input string 2
    * @param maximum possible limit value 
    * @return distance as integer
    */
   public int SDFfunc(String str1, String str2, int limit){
       return getDiff(mostOcurrencesElement(str1.toCharArray(),2),mostOcurrencesElement(str2.toCharArray(),2),limit);
   }

   public static void main(String[] args) {
       String input1 = new String("LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV");
       String input2 = new String("EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG");
       SDF sdf = new SDF();
       System.out.println (sdf.SDFfunc(input1,input2,100));

} </lang>

References

↑ ^1.0 ^1.1 SEKER, Sadi E.; Altun, Oguz; Ayan, Ugur; Mert, Cihan (2014), "A Novel String Distance Function based on Most Frequent K Characters", wp:International Journal of Machine Learning and Computing (IJMLC), 4, wp:International Association of Computer Science and Information Technology Press (IACSIT Press), pp. 177-183, http://arxiv.org/abs/1401.6596 .
↑ Seker, Sadi E.; Mert, Cihan (2013), "A Novel Feature Hashing For Text Mining", Journal of Technical Science and Technologies, 2, wp:International Black Sea University, pp. 37 -41, Template:Citation/identifier, http://journal.ibsu.edu.ge/index.php/jtst/article/view/428 .
↑ Seker, Sadi E.; Al-Naami, Khaled; Khan, Latifur (2013), "Author attribution on streaming data", Information Reuse and Integration (IRI), 2013 IEEE 14th International Conference on, San Fransisco, USA, Aug 14-16, 2013, wp:IEEE, pp. 497-503, Template:Citation/identifier, http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6642511 .

[mfkc-1] 1.0 ^1.1 SEKER, Sadi E.; Altun, Oguz; Ayan, Ugur; Mert, Cihan (2014), "A Novel String Distance Function based on Most Frequent K Characters", wp:International Journal of Machine Learning and Computing (IJMLC), 4, wp:International Association of Computer Science and Information Technology Press (IACSIT Press), pp. 177-183, http://arxiv.org/abs/1401.6596 .

[hashfunc-2] Seker, Sadi E.; Mert, Cihan (2013), "A Novel Feature Hashing For Text Mining", Journal of Technical Science and Technologies, 2, wp:International Black Sea University, pp. 37 -41, Template:Citation/identifier, http://journal.ibsu.edu.ge/index.php/jtst/article/view/428 .

[author-3] Seker, Sadi E.; Al-Naami, Khaled; Khan, Latifur (2013), "Author attribution on streaming data", Information Reuse and Integration (IRI), 2013 IEEE 14th International Conference on, San Fransisco, USA, Aug 14-16, 2013, wp:IEEE, pp. 497-503, Template:Citation/identifier, http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6642511 .

[1]

[2]

[3]