Most frequent k chars distance: Difference between revisions
rewrite task description based on original article, on Python entry, and Discussion page comments
(rewrite task description based on original article, on Python entry, and Discussion page comments) |
|||
Line 1:
{{
The string distance function (SDF) that is the subject of this entry
was proposed in a paper published in 2014 as a method for quickly
estimating how [[wp:Similarity measure|similar]] two
[[wp:Order theory|ordered sets]] or
[[wp:String (computer science)|strings]] are.
This SDF has also been proposed as being suitable in bioinformatics, e.g. for comparing sequences in the [[wp:fasta]] format.
Unfortunately, the original paper
by Sadi Evren Seker et al.
(arXiv:1401.6596 [cs.DS])
<ref name="mfkc">{{citation
| last1 = SEKER | first1 = Sadi E. | author1-link = Sadi Evren SEKER
| last2 = Altun | first2 = Oguz
Line 12 ⟶ 22:
| publisher = [[wp:International Association of Computer Science and Information Technology Press (IACSIT Press)]]
| title = [[wp:International Journal of Machine Learning and Computing (IJMLC)]]
| url =
| year = 2014}}.</ref>
has a number of internal inconsistencies and for this and other
reasons is not entirely clear in several key respects, but the paper
does give some worked examples and several of these (notably the two
given under the "Worked Examples" heading below) agree with the
interpretation used by the authors of the Python entry herein, so that
interpretation is used in the present task description.
;The Task
(a) Show a time-efficient implementation of the SDF as if by a
function mostFreqKSDF(s1, s2, k, n), where s1 and s2 are arbitrary
strings, and k and n are non-negative integers as explained below;
(b) Show the values produced by the SDF for the following values of s1 and s2,
with k = 2 and n = 10:
{|class="wikitable"
|-
!String Inputs
!"Top Two"
!SDF Output (
|-
|'night'
'nacht'
|n:1 i:1
n:1 a:1
|
|-
|'my'
'a'
|m:1 y:1
a:1
|10
|-
|‘research’
‘research’
|r:2 e:2
r:2 e:2
|
|-
|-
|‘significant’
‘capabilities’
|i:3 n:2
i:3 a:2
|
|}
(c) Show the value produced by the SDF for k = 2, n = 100,
and the two strings:
"LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV"
"EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG"
; Reference Algorithm
The original paper employs an unusual scheme for encoding the
frequencies of single letters as a string of letters and numbers. This
scheme has several disadvantages, and so in the following we take a
more abstract view and simply assume, solely for the purposes
of this description, that two functions are available:
1) MostFreqK(s, k) returns (in decreasing order of their frequency of occurrence)
the k most frequently occurring letters in s, with
any ties broken by the position of the first occurrence in s;
2) For any letter c in MostFreqK(s, k), Freq(s,c) returns the frequency
of c in s.
; MFKsimilarity
Next define MFKsimilarlity(s1, s2, k) as if by the following pseudo-code:
<pre>
int function MFKsimilarity(string s1, string s2, int k):
let similarity = 0
let mfk1 = MostFreqK(s1, k)
let mfk2 = MostFreqK(s2, k)
for each c in mfk1
if (c occurs in mfk2)
then similarity += Freq(s1, c) + Freq(s2, c)
end
return similarity
</pre>
; String Distance Wrapper Function
<pre>
int function MostFreqKSDF(string s1, string s2, int k, int n)
return n - MFKsimilarity(s1, s2, k)
</pre>
; Worked Examples:
The original paper gives several worked examples, including
these two:
(i) MostFreqKSDF("night", "nacht", 2, 10)
For "night", the "top two" letters with their frequencies are: n:1 and i:1.
For "nacht", the "top two" letters with their frequencies are: n:1 and a:1.
Since "n" is the only shared letter amongst these candidates, the
SDF is 10 - 1 - 1 = 8, as per the original paper.
(2)
MostFreqKSDF(
"LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV",
"EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG",
2, 100)
For the first string, the "top two" letters with their frequencies are: L:9 and T:8;
and for the second, the "top two" are F:9 and L:8.
Since "L" is the only shared letter amongst these top-k characters,
and since the sum of the corresponding frequencies is 9+8 = 17,
the SDF is 83.
|