Most frequent k chars distance: Difference between revisions

m
m (→‎{{header|Wren}}: Minor tidy)
 
(16 intermediate revisions by 8 users not shown)
Line 1:
{{clarifydraft task}}
 
{{draft task}}{{Wikipedia|Most frequent k characters}}
The string distance function (SDF) that is the subject of this entry
In [[wp:information theory|information theory]], the '''MostFreqKDistance''' is a [[wp:String metric|string metric]] for quickly estimating how [[wp:Similarity measure|similar]] two [[wp:Order theory|ordered sets]] or [[wp:String (computer science)|strings]] are. The scheme was invented by Sadi Evren SEKER,<ref name="mfkc"/> and initially used in [[wp:text mining|text mining]] applications like [[wp:author recognition|author recognition]].<ref name="mfkc">{{citation
was proposed in a paper published in 2014 as a method for quickly
estimating how [[wp:Similarity measure|similar]] two
[[wp:Order theory|ordered sets]] or
[[wp:String (computer science)|strings]] are.
This SDF has also been proposed as being suitable in bioinformatics, e.g. for comparing sequences in the [[wp:fasta]] format.
 
Unfortunately, the original paper
by Sadi Evren Seker et al.
(arXiv:1401.6596 [cs.DS])
<ref name="mfkc">{{citation
| last1 = SEKER | first1 = Sadi E. | author1-link = Sadi Evren SEKER
| last2 = Altun | first2 = Oguz
Line 12 ⟶ 22:
| publisher = [[wp:International Association of Computer Science and Information Technology Press (IACSIT Press)]]
| title = [[wp:International Journal of Machine Learning and Computing (IJMLC)]]
| url = httphttps://arxiv.org/abs/1401.6596
| year = 2014}}.</ref>
has a number of internal inconsistencies and for this and other
This method is originally based on a hashing function, MaxFreqKChars<ref name="hashfunc">{{citation
reasons is not entirely clear in several key respects, but the paper
| last1 = Seker | first1 = Sadi E. | author1-link = Sadi Evren SEKER
does give some worked examples and several of these (notably the two
| last2 = Mert | first2 = Cihan
given under the "Worked Examples" heading below) agree with the
| contribution = A Novel Feature Hashing For Text Mining
interpretation used by the authors of the Python entry herein, so that
| url = http://journal.ibsu.edu.ge/index.php/jtst/article/view/428
interpretation is used in the present task description.
| pages = 37 -41
| publisher = [[wp:International Black Sea University]]
| title = Journal of Technical Science and Technologies
| ISSN = 2298-0032
| volume = 2
| issue = 1
| year = 2013}}.</ref>
classical [[wp:author recognition|author recognition]] problem and idea first came out while studying [[wp:data stream mining|data stream mining]].<ref name="author">{{citation
| last1 = Seker | first1 = Sadi E. | author1-link = Sadi Evren SEKER
| last2 = Al-Naami | first2 = Khaled
| last3 = Khan | first3 = Latifur
| contribution = Author attribution on streaming data
| doi = 10.1109/IRI.2013.6642511
| url = http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6642511
| pages = 497-503
| publisher = [[wp:IEEE]]
| title = Information Reuse and Integration (IRI), 2013 IEEE 14th International Conference on, San Fransisco, USA, Aug 14-16, 2013
| year = 2013}}.</ref>
The string distance
 
;The Task
;Definition
Method has two steps.
* [[wp:Hash function|Hash]] input strings str1 and str2 separately using MostFreqKHashing and output hstr1 and hstr2 respectively
* Calculate string distance (or string similarity coefficient) of two hash outputs, hstr1 and hstr2 and output an integer value
 
(a) Show a time-efficient implementation of the SDF as if by a
;Most Frequent K Hashing
function mostFreqKSDF(s1, s2, k, n), where s1 and s2 are arbitrary
The first step of algorithm is calculating the hashing based on the most frequent k characters. The hashing algorithm has below steps:
strings, and k and n are non-negative integers as explained below;
'''string function''' ''MostFreqKHashing'' ('''string''' inputString, '''int''' K)
'''def string''' outputString
'''for each''' distinct characters
'''count''' occurrence of each character
'''for''' i '''from''' 0 '''to''' K
'''char''' c = '''next''' most frequent ''i''<sup>th</sup> character (if two chars have same frequency than get the first occurrence in inputString)
'''int''' count = number of occurrence of the character
'''append''' to ''outputString'', c and count
'''end for'''
'''return''' outputString
 
(b) Show the values produced by the SDF for the following values of s1 and s2,
Aim of 'Most Frequent K Hashing' function is calculating the most count of each character and returning the K most frequent character with the character and count. Rules of hash can be listed as below:
with k = 2 and n = 10:
* Output will hold the character and count
* Most frequent character and count will appear before the least frequent at the output
* if two characters have equal frequency, the first appearing in input will appear before at the output
 
Similar to the most of [[wp:hashing function|hashing functions]], ''Most Frequent K Hashing'' is also a [[wp:one way function]].
 
;Most Frequent K Distance
Distance calculation between two strings is based on the hash outputs of two strings.
'''int function''' ''MostFreqKSimilarity'' ('''string''' inputStr1, '''string''' inputStr2)
'''def int''' similarity
'''for each''' c = '''next''' character '''from''' inputStr1
'''lookup''' c '''in''' inputStr2
'''if''' c '''is not null'''
similarity '''+=''' frequency of c in inputStr1 + frequency of c in inputStr2
'''return''' similarity
Above function, simply gets two input strings, previously outputted from the MostFreqKHashing function. From the most frequent k hashing function, the characters and their frequencies are returned. So, the similarity function calculates the similarity based on characters and their frequencies by checking if the same character appears on both strings and if their frequencies are equal.
 
In some implementations, the distance metric is required instead of similarity coefficient. In order to convert the output of above similarity coefficient to distance metric, the output can be subtracted from any constant value (like the maximum possible output value). For the case, it is also possible to implement a [[wp:wrapper function]] over above two functions.
 
;;String Distance Wrapper Function
In order to calculate the distance between two strings, below function can be implemented
'''int function''' MostFreqKSDF ('''string''' inputStr1, '''string''' inputStr2, '''int''' K, '''int''' maxDistance)
'''return''' maxDistance - MostFreqKSimilarity(MostFreqKHashing(inputStr1,K), MostFreqKHashing(inputStr2,K))
 
Any call to above string distance function will supply two input strings and a maximum distance value. The function will calculate the similarity and subtract that value from the maximum possible distance. It can be considered as a simple [[wp:additive inverse]] of similarity.
 
;Examples
Let's consider maximum 2 frequent hashing over two strings ‘research’ and ‘seeking’.
<lang javascript>MostFreqKHashing('research',2) = 'r2e2'</lang>
because we have 2 'r' and 2 'e' characters with the highest frequency and we return in the order they appear in the string.
<lang javascript>MostFreqKHashing('seeking',2) = 'e2s1'</lang>
Again we have character 'e' with highest frequency and rest of the characters have same frequency of 1, so we return the first character of equal frequencies, which is 's'.
Finally we make the comparison:
<lang javascript>MostFreqKSimilarity('r2e2','e2s1') = 2</lang>
We simply compared the outputs and only character occurring in both input is character 'e' and the occurrence in both input is 2.
Instead running the sample step by step as above, we can simply run by using the string distance wrapper function as below:
<lang javascript>MostFreqKSDF('research', 'seeking',2) = 2</lang>
 
Below table holds some sample runs between example inputs for K=2:
{|class="wikitable"
|-
!String Inputs
!"Top Two"
!Hash Outputs
!SDF Output (max fromk=2, n=10)
|-
|'night'
'nacht'
|n:1 i:1
|n1i1
n:1 a:1
n1a1
|98
|-
|'my'
'a'
|m:1 y:1
|m1y1
a:1
a1NULL0
|10
|-
|‘research’
‘research’
|r:2 e:2
|r2e2
r:2 e:2
r2e2
|82
|-
|‘aaaaabbbb’
‘ababababa’
|a5b4
a5b4
|1
|-
|‘significant’
‘capabilities’
|i:3 n:2
|i3n2
i:3 a:2
i3a2
|54
|}
 
 
Method is also suitable for bioinformatics to compare the genetic strings like in [[wp:fasta]] format
(c) Show the value produced by the SDF for k = 2, n = 100,
and the two strings:
 
"LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV"
"EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG"
 
 
; Reference Algorithm
 
The original paper employs an unusual scheme for encoding the
frequencies of single letters as a string of letters and numbers. This
scheme has several disadvantages, and so in the following we take a
more abstract view and simply assume, solely for the purposes
of this description, that two functions are available:
 
1) MostFreqK(s, k) returns (in decreasing order of their frequency of occurrence)
the k most frequently occurring letters in s, with
any ties broken by the position of the first occurrence in s;
 
2) For any letter c in MostFreqK(s, k), Freq(s,c) returns the frequency
of c in s.
 
; MFKsimilarity
 
Next define MFKsimilarlity(s1, s2, k) as if by the following pseudo-code:
<pre>
int function MFKsimilarity(string s1, string s2, int k):
Str1= LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
let similarity = 0
Str2 = EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
let mfk1 = MostFreqK(s1, k)
MostFreqKHashing(str1,2) = L9T8
let mfk2 = MostFreqK(s2, k)
MostFreqKHashing(str2,2) = F9L8
for each c in mfk1
MostFreqKSDF(str1,str2,2,100) = 83
if (c occurs in mfk2)
then similarity += Freq(s1, c) + Freq(s2, c)
end
return similarity
</pre>
 
; String Distance Wrapper Function
 
Now define MostFreqKSDF as if by:
<pre>
int function MostFreqKSDF(string s1, string s2, int k, int n)
return n - MFKsimilarity(s1, s2, k)
</pre>
 
; Worked Examples:
 
The original paper gives several worked examples, including
these two:
 
(i) MostFreqKSDF("night", "nacht", 2, 10)
 
For "night", the "top two" letters with their frequencies are: n:1 and i:1.
 
For "nacht", the "top two" letters with their frequencies are: n:1 and a:1.
 
Since "n" is the only shared letter amongst these candidates, the
SDF is 10 - 1 - 1 = 8, as per the original paper.
 
(2)
MostFreqKSDF(
"LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV",
"EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG",
2, 100)
 
For the first string, the "top two" letters with their frequencies are: L:9 and T:8;
and for the second, the "top two" are F:9 and L:8.
 
Since "L" is the only shared letter amongst these top-k characters, and since the sum of the corresponding frequencies is 9+8 = 17, the SDF is 83.
<br/>
 
=={{header|11l}}==
{{trans|Python}}
 
<syntaxhighlight lang="11l">F most_freq_khashing(inputString, K)
DefaultDict[Char, Int] occuDict
L(c) inputString
occuDict[c]++
V occuList = sorted(occuDict.items(), key' x -> x[1], reverse' 1B)
V outputDict = Dict(occuList[0 .< K])
R outputDict
 
F most_freq_ksimilarity(inputStr1, inputStr2)
V similarity = 0
L(c, cnt1) inputStr1
I c C inputStr2
V cnt2 = inputStr2[c]
similarity += cnt1 + cnt2
R similarity
 
F most_freq_ksdf(inputStr1, inputStr2, K, maxDistance)
R maxDistance - most_freq_ksimilarity(most_freq_khashing(inputStr1, K), most_freq_khashing(inputStr2, K))
 
V str1 = ‘LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV’
V str2 = ‘EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG’
V K = 2
V maxDistance = 100
V dict1 = most_freq_khashing(str1, 2)
print(dict1, end' ":\n")
print(dict1.map((c, cnt) -> c‘’String(cnt)).join(‘’))
V dict2 = most_freq_khashing(str2, 2)
print(dict2, end' ":\n")
print(dict2.map((c, cnt) -> c‘’String(cnt)).join(‘’))
print(most_freq_ksdf(str1, str2, K, maxDistance))</syntaxhighlight>
 
{{out}}
<pre>
[L = 9, T = 8]:
L9T8
[F = 9, L = 8]:
F9L8
83
</pre>
 
=={{header|C++}}==
<langsyntaxhighlight lang="cpp">#include <string>
#include <vector>
#include <map>
Line 200 ⟶ 242:
return 0 ;
}
</syntaxhighlight>
</lang>
{{out}}
<pre>MostFreqKHashing( s1 , 2 ) = L9T8
MostFreqKHashing( s2 , 2 ) = F9L8
</pre>
 
=={{header|Go}}==
{{trans|Kotlin}}
<syntaxhighlight lang="go">package main
 
import (
"fmt"
"sort"
)
 
type cf struct {
c rune
f int
}
 
func reverseStr(s string) string {
runes := []rune(s)
for i, j := 0, len(runes)-1; i < j; i, j = i+1, j-1 {
runes[i], runes[j] = runes[j], runes[i]
}
return string(runes)
}
 
func indexOfCf(cfs []cf, r rune) int {
for i, cf := range cfs {
if cf.c == r {
return i
}
}
return -1
}
 
func minOf(i, j int) int {
if i < j {
return i
}
return j
}
 
func mostFreqKHashing(input string, k int) string {
var cfs []cf
for _, r := range input {
ix := indexOfCf(cfs, r)
if ix >= 0 {
cfs[ix].f++
} else {
cfs = append(cfs, cf{r, 1})
}
}
sort.SliceStable(cfs, func(i, j int) bool {
return cfs[i].f > cfs[j].f // descending, preserves order when equal
})
acc := ""
min := minOf(len(cfs), k)
for _, cf := range cfs[:min] {
acc += fmt.Sprintf("%c%c", cf.c, cf.f)
}
return acc
}
 
func mostFreqKSimilarity(input1, input2 string) int {
similarity := 0
runes1, runes2 := []rune(input1), []rune(input2)
for i := 0; i < len(runes1); i += 2 {
for j := 0; j < len(runes2); j += 2 {
if runes1[i] == runes2[j] {
freq1, freq2 := runes1[i+1], runes2[j+1]
if freq1 != freq2 {
continue // assuming here that frequencies need to match
}
similarity += int(freq1)
}
}
}
return similarity
}
 
func mostFreqKSDF(input1, input2 string, k, maxDistance int) {
fmt.Println("input1 :", input1)
fmt.Println("input2 :", input2)
s1 := mostFreqKHashing(input1, k)
s2 := mostFreqKHashing(input2, k)
fmt.Printf("mfkh(input1, %d) = ", k)
for i, c := range s1 {
if i%2 == 0 {
fmt.Printf("%c", c)
} else {
fmt.Printf("%d", c)
}
}
fmt.Printf("\nmfkh(input2, %d) = ", k)
for i, c := range s2 {
if i%2 == 0 {
fmt.Printf("%c", c)
} else {
fmt.Printf("%d", c)
}
}
result := maxDistance - mostFreqKSimilarity(s1, s2)
fmt.Printf("\nSDF(input1, input2, %d, %d) = %d\n\n", k, maxDistance, result)
}
 
func main() {
pairs := [][2]string{
{"research", "seeking"},
{"night", "nacht"},
{"my", "a"},
{"research", "research"},
{"aaaaabbbb", "ababababa"},
{"significant", "capabilities"},
}
for _, pair := range pairs {
mostFreqKSDF(pair[0], pair[1], 2, 10)
}
 
s1 := "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV"
s2 := "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG"
mostFreqKSDF(s1, s2, 2, 100)
s1 = "abracadabra12121212121abracadabra12121212121"
s2 = reverseStr(s1)
mostFreqKSDF(s1, s2, 2, 100)
}</syntaxhighlight>
 
{{out}}
<pre>
input1 : research
input2 : seeking
mfkh(input1, 2) = r2e2
mfkh(input2, 2) = e2s1
SDF(input1, input2, 2, 10) = 8
 
input1 : night
input2 : nacht
mfkh(input1, 2) = n1i1
mfkh(input2, 2) = n1a1
SDF(input1, input2, 2, 10) = 9
 
input1 : my
input2 : a
mfkh(input1, 2) = m1y1
mfkh(input2, 2) = a1
SDF(input1, input2, 2, 10) = 10
 
input1 : research
input2 : research
mfkh(input1, 2) = r2e2
mfkh(input2, 2) = r2e2
SDF(input1, input2, 2, 10) = 6
 
input1 : aaaaabbbb
input2 : ababababa
mfkh(input1, 2) = a5b4
mfkh(input2, 2) = a5b4
SDF(input1, input2, 2, 10) = 1
 
input1 : significant
input2 : capabilities
mfkh(input1, 2) = i3n2
mfkh(input2, 2) = i3a2
SDF(input1, input2, 2, 10) = 7
 
input1 : LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
input2 : EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
mfkh(input1, 2) = L9T8
mfkh(input2, 2) = F9L8
SDF(input1, input2, 2, 100) = 100
 
input1 : abracadabra12121212121abracadabra12121212121
input2 : 12121212121arbadacarba12121212121arbadacarba
mfkh(input1, 2) = 112a10
mfkh(input2, 2) = 112210
SDF(input1, input2, 2, 100) = 88
</pre>
 
=={{header|Haskell}}==
<langsyntaxhighlight Haskelllang="haskell">module MostFrequentK
where
import Data.List ( nub , sortBy )
Line 246 ⟶ 461:
mostFreqKSDF :: String -> String -> Int ->Int
mostFreqKSDF s t n = mostFreqKSimilarity ( mostFreqKHashing s n ) (mostFreqKHashing t n )
</syntaxhighlight>
</lang>
{{out}}
<pre>mostFrequentKHashing "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV" 2
Line 255 ⟶ 470:
 
=={{header|J}}==
'''Solution''':<langsyntaxhighlight lang="j">NB. String Distance Wrapper Function
mfksDF =: {:@:[ - (mfks@:(mfkh&.>)~ {.)~
 
Line 272 ⟶ 487:
NB. No need to fix mfksDF
mfkh =: mfkh f.
mfks =: mfks f.</langsyntaxhighlight>
 
'''Examples''':<langsyntaxhighlight lang="j">verb define ''
fkh =. ;@:,@:(":&.>) NB. format k hash
 
Line 301 ⟶ 516:
'pass'
)
pass</langsyntaxhighlight>
'''Notes''': As of press time, there are significant discrepancies between the task description, its pseudocode, the test cases provided, and the two other existing implementations. See the [[Talk:Most frequent k chars distance#Prank_Page.3F|talk page]] for the assumptions made in this implementation to reconcile these discrepancies (in particular, in the scoring function).
 
 
 
=={{header|Java}}==
Line 310 ⟶ 523:
Translation of the pseudo-code of the Wikipedia article [[wp:Most frequent k characters]] to [[wp:java]] implementation of three functions given in the definition section are given below with [[wp:JavaDoc]] comments:
 
<langsyntaxhighlight lang="java">import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
Line 436 ⟶ 649:
}
 
}</langsyntaxhighlight>
 
 
=={{header|Javascript}}==
Defines the functions for counting and frequency outside of the hashing function, as it is easier to calculate the comparison from a structured result rather than parsing a hash.
 
<syntaxhighlight lang="javascript">//returns an object of counts keyed by character
const kCounts = str => {
const counts = {};
for (let char of str) {
counts[char] = counts[char] ? counts[char] + 1 : 1;
}
return counts;
};
 
//returns an array of length k containing the characters with the highest counts
const frequentK = (counts, k) => {
//note that this is written for clarity rather than speed,
//as it sorts all of the counts when only the top k are needed
return Object.keys(counts)
.sort((a, b) => counts[b] - counts[a])
.slice(0, k);
};
 
//returns a hashed string of the most frequent k characters and their frequencies
const mostFreqKHashing = (str, k) => {
const counts = kCounts(str);
return frequentK(counts, k)
.map(char => char + counts[char])
.join("");
};
 
//numeric score of similarity based on the sum of counts of characters appearing in the top k of both strings
const mostFreqKSimilarity = (str1, str2, k) => {
const counts1 = kCounts(str1);
const counts2 = kCounts(str2);
const freq1 = frequentK(counts1, k);
const freq2 = frequentK(counts2, k);
let similarity = 0;
for (let char of freq1) {
//only considers a character if it is in the top k of both strings
if (freq2.includes(char)) {
//the examples on this page are inconsistent in whether the similarity
//should be the sum of the two counts or only the shared count (the minimum of the two)
//this code uses the sum
similarity += counts1[char] + counts2[char];
}
}
return similarity;
};
 
//subtracts the similarity score from the maxDifference
const mostFreqKSDF = (str1, str2, k, maxDistance) => {
return maxDistance - mostFreqKSimilarity(str1, str2, k);
};</syntaxhighlight>
 
'''Testing with Jest'''
<syntaxhighlight lang="javascript">test('hash of "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV" is "L9T8"', () => {
expect(mostFreqKHashing(str1, 2)).toBe("L9T8");
});
 
test('hash of "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG" is "F9L8"', () => {
expect(mostFreqKHashing(str2, 2)).toBe("F9L8");
});
 
test("SDF of strings 1 and 2 with k=2 and max=100 is 83", () => {
expect(mostFreqKSDF(str1, str2, 2, 100)).toBe(83);
});</syntaxhighlight>
 
'''Testing in Console'''
<syntaxhighlight lang="javascript">const str1 = "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV";
const str2 = "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG";
const K = 2;
const maxDistance = 100;
console.log(mostFreqKHashing(str1, K));
console.log(mostFreqKHashing(str2, K));
console.log(mostFreqKSDF(str1, str2, K, maxDistance));</syntaxhighlight>
 
{{out}}
<pre>L9T8
F9L8
83</pre>
 
=={{header|jq}}==
{{works with|jq}}
'''Works with gojq, the Go implementation of jq'''
 
'''Preliminaries'''
<syntaxhighlight lang="jq"># bag of words
def bow(stream):
reduce stream as $word ({}; .[($word|tostring)] += 1);
 
# Like sort_by(f) but for items that compare equal, retain the original order
def sort_by_decreasing(f):
# Add $n-$i
def enumerate: . as $in | length as $n | reduce range(0;$n) as $i ([]; . + [ [$n-$i, $in[$i] ] ]);
enumerate
| sort_by((.[1]|f), .[0])
| reverse
| map(.[1]);</syntaxhighlight>
'''The Tasks'''
<syntaxhighlight lang="jq"># Output: { characters: array_of_characters_in_decreasing_order_of_frequency, frequency: object}
def MostFreqKHashing($K):
def chars: explode | range(0;length) as $i | [.[$i]] | implode;
. as $in
| bow(tostring | chars) as $bow
| $bow
| to_entries
| sort_by_decreasing(.value) # if two chars have same frequency than get the first occurrence in $in
| (reduce .[0:$K][] as $kv ({}; .[$kv.key] = $kv.value) ) as $frequency
| {characters: map(.key)[:$K], $frequency};
 
def MostFreqKSimilarity($in1; $in2; $K):
[$in1, $in2] | map( MostFreqKHashing($K)) as [$s1, $s2]
| reduce $s1.characters[] as $c (0;
$s2.frequency[$c] as $f
| if $f then . + $s1.frequency[$c] + $f
else . end) ;
 
def MostFreqKSDF($inputStr1; $inputStr2; $K; $maxDistance):
$maxDistance - MostFreqKSimilarity($inputStr1; $inputStr2; $K);
 
def MostFreqKSDF($K; $maxDistance):
. as [$inputStr1, $inputStr2]
| $maxDistance - MostFreqKSimilarity($inputStr1; $inputStr2; $K);
 
def task2:
["night", "nacht"],
["my", "a"],
["research", "research"],
["research", "seeking"],
["significant", "capabilities"]
| MostFreqKSDF(2; 10) as $sdk
| [., $sdk] ;
 
def task100:
["LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV",
"EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG"]
| MostFreqKSDF(2; 100);
 
task2, task100</syntaxhighlight>
{{out}}
<pre>
[["night","nacht"],8]
[["my","a"],10]
[["research","research"],2]
[["research","seeking"],6]
[["significant","capabilities"],4]
83
</pre>
 
=={{header|Kotlin}}==
Line 446 ⟶ 808:
 
I have no idea what NULL0 is supposed to mean so I've ignored that.
<langsyntaxhighlight lang="scala">// version 1.1.51
 
fun mostFreqKHashing(input: String, k: Int): String =
Line 499 ⟶ 861:
s2 = s1.reversed()
mostFreqKSDF(s1, s2, 2, 100)
}</langsyntaxhighlight>
 
{{out}}
Line 551 ⟶ 913:
SDF(input1, input2, 2, 100) = 88
</pre>
 
=={{header|Nim}}==
It is impossible to get consistent results. We chose to implement Wikipedia algorithm (note that the page is no longer available, except in Internet Archive). With this algorithm, we get the same results as those of Wikipedia except for the last example where we get 91 instead of 83.
 
In order to avoid the limitation to 10 occurrences, we chose to use an ordered table (equivalent to Python OrderedDict). We could have used Kotlin solution or used a sequence of tuples (char, count) but the ordered table gives better performance.
 
<syntaxhighlight lang="nim">import algorithm, sugar, tables
 
type CharCounts = OrderedTable[char, int]
 
func `$`(counts: CharCounts): string =
for (c, count) in counts.pairs:
result.add $c & $count
 
func mostFreqKHashing(str: string; k: Positive): CharCounts =
var counts: CharCounts # To get the counts in apparition order.
for c in str: inc counts.mgetOrPut(c, 0)
counts.sort((x, y) => cmp(x[1], y[1]), Descending) # Note that sort is stable.
var count = 0
for c, val in counts.pairs:
inc count
result[c] = val
if count == k: break
 
func mostFreqKSimilarity(input1, input2: CharCounts): int =
for c, count in input1.pairs:
if c in input2:
result += count
 
func mostFreqKSDF(str1, str2: string; k, maxDistance: Positive): int =
maxDistance - mostFreqKSimilarity(mostFreqKHashing(str1, k), mostFreqKHashing(str2, k))
 
 
const
 
Pairs = [("night", "nacht"),
("my", "a"),
("research", "research"),
("aaaaabbbb", "ababababa"),
("significant", "capabilities")]
 
for (str1, str2) in Pairs:
echo "str1: ", str1
echo "str2: ", str2
echo "mostFreqKHashing(str1, 2) = ", mostFreqKHashing(str1, 2)
echo "mostFreqKHashing(str2, 2) = ", mostFreqKHashing(str2, 2)
echo "mostFreqKSDF(str1, str2, 2, 10) = ", mostFreqKSDF(str1, str2, 2, 10)
echo()
 
const
S1 = "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV"
S2 = "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG"
 
echo "str1: ", S1
echo "str2: ", S2
echo "mostFreqKHashing(str1, 2) = ", mostFreqKHashing(S1, 2)
echo "mostFreqKHashing(str2, 2) = ", mostFreqKHashing(S2, 2)
echo "mostFreqKSDF(str1, str2, 2, 100) = ", mostFreqKSDF(S1, S2, 2, 100)</syntaxhighlight>
 
{{out}}
<pre>str1: night
str2: nacht
mostFreqKHashing(str1, 2) = n1i1
mostFreqKHashing(str2, 2) = n1a1
mostFreqKSDF(str1, str2, 2, 10) = 9
 
str1: my
str2: a
mostFreqKHashing(str1, 2) = m1y1
mostFreqKHashing(str2, 2) = a1
mostFreqKSDF(str1, str2, 2, 10) = 10
 
str1: research
str2: research
mostFreqKHashing(str1, 2) = r2e2
mostFreqKHashing(str2, 2) = r2e2
mostFreqKSDF(str1, str2, 2, 10) = 6
 
str1: aaaaabbbb
str2: ababababa
mostFreqKHashing(str1, 2) = a5b4
mostFreqKHashing(str2, 2) = a5b4
mostFreqKSDF(str1, str2, 2, 10) = 1
 
str1: significant
str2: capabilities
mostFreqKHashing(str1, 2) = i3n2
mostFreqKHashing(str2, 2) = i3a2
mostFreqKSDF(str1, str2, 2, 10) = 7
 
str1: LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
str2: EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
mostFreqKHashing(str1, 2) = L9T8
mostFreqKHashing(str2, 2) = F9L8
mostFreqKSDF(str1, str2, 2, 100) = 91</pre>
 
=={{header|Perl}}==
<langsyntaxhighlight Perllang="perl">#!/usr/bin/perl
use strict ;
use warnings ;
Line 603 ⟶ 1,060:
print "MostFreqKHashing ( " . '$firststring , 2)' . " is " . mostFreqHashing( $firststring , 2 ) . "\n" ;
print "MostFreqKHashing ( " . '$secondstring , 2)' . " is " . mostFreqHashing( $secondstring , 2 ) . "\n" ;
</syntaxhighlight>
</lang>
{{out}}
<pre>MostFreqKHashing ( $firststring , 2) is L9T8
MostFreqKHashing ( $secondstring , 2) is F9L8</pre>
 
=={{header|Perl 6Phix}}==
{{trans|Go}}
{{works with|Rakudo|2017.09}}
<!--<syntaxhighlight lang="phix">(phixonline)-->
My initial impressions of this task are registered on the discussion page under "Prank Page?".
<span style="color: #008080;">with</span> <span style="color: #008080;">javascript_semantics</span>
 
<span style="color: #008080;">function</span> <span style="color: #000000;">mostFreqKHashing</span><span style="color: #0000FF;">(</span><span style="color: #004080;">string</span> <span style="color: #000000;">input</span><span style="color: #0000FF;">,</span> <span style="color: #004080;">integer</span> <span style="color: #000000;">k</span><span style="color: #0000FF;">)</span>
The "most frequent k characters" hashing function is straightforward enough to implement. The distance function is incomprehensible though. The description doesn't match the pseudo-code and neither of them match the examples. I'm not going to bother trying to figure it out unless there is some possible value.
<span style="color: #004080;">string</span> <span style="color: #000000;">cfs</span> <span style="color: #0000FF;">=</span> <span style="color: #008000;">""</span>
 
<span style="color: #004080;">sequence</span> <span style="color: #000000;">cfsnx</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{}</span>
Maybe I am too hasty though. Lets give it a try. Implement a MFKC routine and run an assortment of words through it to get a feel for how it hashes different words.
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">input</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span>
 
<span style="color: #004080;">integer</span> <span style="color: #000000;">r</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">input</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">],</span>
<lang perl6># Fairly straightforward implementation, actually returns a list of pairs
<span style="color: #000000;">ix</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">find</span><span style="color: #0000FF;">(</span><span style="color: #000000;">r</span><span style="color: #0000FF;">,</span><span style="color: #000000;">cfs</span><span style="color: #0000FF;">)</span>
# which can be joined to make a string or manipulated further.
<span style="color: #008080;">if</span> <span style="color: #000000;">ix</span><span style="color: #0000FF;">></span><span style="color: #000000;">0</span> <span style="color: #008080;">then</span>
 
<span style="color: #000000;">cfsnx</span><span style="color: #0000FF;">[</span><span style="color: #000000;">ix</span><span style="color: #0000FF;">][</span><span style="color: #000000;">1</span><span style="color: #0000FF;">]</span> <span style="color: #0000FF;">-=</span> <span style="color: #000000;">1</span>
sub mfkh ($string, \K = 2) {
<span style="color: #008080;">else</span>
my %h;
<span style="color: #000000;">cfs</span> <span style="color: #0000FF;">&=</span> <span style="color: #000000;">r</span>
$string.comb.kv.map: { %h{$^v}[1] //= $^k; %h{$^v}[0]++ };
<span style="color: #000000;">cfsnx</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">append</span><span style="color: #0000FF;">(</span><span style="color: #000000;">cfsnx</span><span style="color: #0000FF;">,{-</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">cfs</span><span style="color: #0000FF;">)})</span>
%h.sort( { -$_.value[0], $_.value[1] } ).head(K).map( { $_.key => $_.value[0] } );
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
}
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
 
<span style="color: #000000;">cfsnx</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">sort</span><span style="color: #0000FF;">(</span><span style="color: #000000;">cfsnx</span><span style="color: #0000FF;">)</span> <span style="color: #000080;font-style:italic;">-- (aside: the idx forces stable sort)</span>
# lets try running 150 or so words from unixdic.txt through it to see
<span style="color: #004080;">sequence</span> <span style="color: #000000;">acc</span> <span style="color: #0000FF;">:=</span> <span style="color: #0000FF;">{}</span>
# how many unique hash values it comes up with.
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">min</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">cfs</span><span style="color: #0000FF;">),</span><span style="color: #000000;">k</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span>
 
<span style="color: #004080;">integer</span> <span style="color: #0000FF;">{</span><span style="color: #000000;">n</span><span style="color: #0000FF;">,</span><span style="color: #000000;">ix</span><span style="color: #0000FF;">}</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">cfsnx</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span>
my @test-words = <
<span style="color: #000000;">acc</span> <span style="color: #0000FF;">&=</span> <span style="color: #0000FF;">{</span><span style="color: #000000;">cfs</span><span style="color: #0000FF;">[</span><span style="color: #000000;">ix</span><span style="color: #0000FF;">],</span> <span style="color: #0000FF;">-</span><span style="color: #000000;">n</span><span style="color: #0000FF;">}</span>
aminobenzoic arginine asinine biennial biennium brigantine brinkmanship
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
britannic chinquapin clinging corinthian declination dickinson dimension
<span style="color: #008080;">return</span> <span style="color: #000000;">acc</span>
dinnertime dionysian diophantine dominican financial financier finessing
<span style="color: #008080;">end</span> <span style="color: #008080;">function</span>
fingernail fingerprint finnish giovanni hopkinsian inaction inalienable
inanimate incaution incendiary incentive inception incident incidental
<span style="color: #008080;">function</span> <span style="color: #000000;">mostFreqKSimilarity</span><span style="color: #0000FF;">(</span><span style="color: #004080;">sequence</span> <span style="color: #000000;">input1</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">input2</span><span style="color: #0000FF;">)</span>
incinerate incline inclusion incommunicable incompletion inconceivable
<span style="color: #004080;">integer</span> <span style="color: #000000;">similarity</span> <span style="color: #0000FF;">:=</span> <span style="color: #000000;">0</span>
inconclusive incongruity inconsiderable inconsiderate inconspicuous
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">input1</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">by</span> <span style="color: #000000;">2</span> <span style="color: #008080;">do</span>
incontrovertible inconvertible incurring incursion indefinable indemnify
<span style="color: #008080;">for</span> <span style="color: #000000;">j</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">input2</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">by</span> <span style="color: #000000;">2</span> <span style="color: #008080;">do</span>
indemnity indeterminacy indian indiana indicant indifferent indigene
<span style="color: #008080;">if</span> <span style="color: #000000;">input1</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span> <span style="color: #0000FF;">==</span> <span style="color: #000000;">input2</span><span style="color: #0000FF;">[</span><span style="color: #000000;">j</span><span style="color: #0000FF;">]</span> <span style="color: #008080;">then</span>
indigenous indigent indispensable indochina indochinese indoctrinate
<span style="color: #004080;">integer</span> <span style="color: #000000;">freq1</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">input1</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">+</span><span style="color: #000000;">1</span><span style="color: #0000FF;">],</span>
indonesia inequivalent inexplainable infantile inferential inferring
<span style="color: #000000;">freq2</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">input2</span><span style="color: #0000FF;">[</span><span style="color: #000000;">j</span><span style="color: #0000FF;">+</span><span style="color: #000000;">1</span><span style="color: #0000FF;">]</span>
infestation inflammation inflationary influential information infringe
<span style="color: #008080;">if</span> <span style="color: #000000;">freq1</span><span style="color: #0000FF;">=</span><span style="color: #000000;">freq2</span> <span style="color: #008080;">then</span>
infusion ingenious ingenuity ingestion ingredient inhabitant inhalation
<span style="color: #000000;">similarity</span> <span style="color: #0000FF;">+=</span> <span style="color: #000000;">freq1</span>
inharmonious inheritance inholding inhomogeneity inkling inoffensive
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
inordinate inorganic inputting inseminate insensible insincere insinuate
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
insistent insomnia insomniac insouciant installation instinct instinctual
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
insubordinate insubstantial insulin insurrection intangible intelligent
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
intensify intensive interception interruption intestinal intestine
<span style="color: #008080;">return</span> <span style="color: #000000;">similarity</span>
intoxicant introduction introversion intrusion invariant invasion inventive
<span style="color: #008080;">end</span> <span style="color: #008080;">function</span>
inversion involution justinian kidnapping kingpin lineprinter liniment
livingston mainline mcginnis minion minneapolis minnie pigmentation
<span style="color: #008080;">function</span> <span style="color: #000000;">flat</span><span style="color: #0000FF;">(</span><span style="color: #004080;">sequence</span> <span style="color: #000000;">s</span><span style="color: #0000FF;">)</span>
pincushion pinion quinine quintessential resignation ruination seminarian
<span style="color: #004080;">string</span> <span style="color: #000000;">res</span> <span style="color: #0000FF;">=</span> <span style="color: #008000;">""</span>
triennial wilkinson wilmington wilsonian wineskin winnie winnipeg
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">s</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">by</span> <span style="color: #000000;">2</span> <span style="color: #008080;">do</span>
>;
<span style="color: #000000;">res</span> <span style="color: #0000FF;">&=</span> <span style="color: #7060A8;">sprintf</span><span style="color: #0000FF;">(</span><span style="color: #008000;">"%c%d"</span><span style="color: #0000FF;">,</span><span style="color: #000000;">s</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">..</span><span style="color: #000000;">i</span><span style="color: #0000FF;">+</span><span style="color: #000000;">1</span><span style="color: #0000FF;">])</span>
 
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
say @test-words.map( { join '', mfkh($_)».kv } ).Bag;</lang>
<span style="color: #008080;">return</span> <span style="color: #000000;">res</span>
{{out}}
<span style="color: #008080;">end</span> <span style="color: #008080;">function</span>
<pre>Bag(i2n2(151))</pre>
One... Nope, I was right, it ''is'' pretty much worthless.
<span style="color: #008080;">procedure</span> <span style="color: #000000;">mostFreqKSDF</span><span style="color: #0000FF;">(</span><span style="color: #004080;">string</span> <span style="color: #000000;">input1</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">input2</span><span style="color: #0000FF;">,</span> <span style="color: #004080;">integer</span> <span style="color: #000000;">k</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">maxDistance</span><span style="color: #0000FF;">)</span>
<span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"input1 : %s\n"</span><span style="color: #0000FF;">,</span> <span style="color: #0000FF;">{</span><span style="color: #000000;">input1</span><span style="color: #0000FF;">})</span>
<span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"input2 : %s\n"</span><span style="color: #0000FF;">,</span> <span style="color: #0000FF;">{</span><span style="color: #000000;">input2</span><span style="color: #0000FF;">})</span>
<span style="color: #004080;">sequence</span> <span style="color: #000000;">s1</span> <span style="color: #0000FF;">:=</span> <span style="color: #000000;">mostFreqKHashing</span><span style="color: #0000FF;">(</span><span style="color: #000000;">input1</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">k</span><span style="color: #0000FF;">),</span>
<span style="color: #000000;">s2</span> <span style="color: #0000FF;">:=</span> <span style="color: #000000;">mostFreqKHashing</span><span style="color: #0000FF;">(</span><span style="color: #000000;">input2</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">k</span><span style="color: #0000FF;">)</span>
<span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"mfkh(input1, %d) = %s\n"</span><span style="color: #0000FF;">,</span> <span style="color: #0000FF;">{</span><span style="color: #000000;">k</span><span style="color: #0000FF;">,</span><span style="color: #000000;">flat</span><span style="color: #0000FF;">(</span><span style="color: #000000;">s1</span><span style="color: #0000FF;">)})</span>
<span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"mfkh(input2, %d) = %s\n"</span><span style="color: #0000FF;">,</span> <span style="color: #0000FF;">{</span><span style="color: #000000;">k</span><span style="color: #0000FF;">,</span><span style="color: #000000;">flat</span><span style="color: #0000FF;">(</span><span style="color: #000000;">s2</span><span style="color: #0000FF;">)})</span>
<span style="color: #004080;">integer</span> <span style="color: #000000;">result</span> <span style="color: #0000FF;">:=</span> <span style="color: #000000;">maxDistance</span> <span style="color: #0000FF;">-</span> <span style="color: #000000;">mostFreqKSimilarity</span><span style="color: #0000FF;">(</span><span style="color: #000000;">s1</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">s2</span><span style="color: #0000FF;">)</span>
<span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"SDF(input1, input2, %d, %d) = %d\n\n"</span><span style="color: #0000FF;">,</span> <span style="color: #0000FF;">{</span><span style="color: #000000;">k</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">maxDistance</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">result</span><span style="color: #0000FF;">})</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">procedure</span>
<span style="color: #008080;">constant</span> <span style="color: #000000;">tests</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{{</span><span style="color: #008000;">"research"</span><span style="color: #0000FF;">,</span> <span style="color: #008000;">"seeking"</span><span style="color: #0000FF;">},</span>
<span style="color: #0000FF;">{</span><span style="color: #008000;">"night"</span><span style="color: #0000FF;">,</span> <span style="color: #008000;">"nacht"</span><span style="color: #0000FF;">},</span>
<span style="color: #0000FF;">{</span><span style="color: #008000;">"my"</span><span style="color: #0000FF;">,</span> <span style="color: #008000;">"a"</span><span style="color: #0000FF;">},</span>
<span style="color: #0000FF;">{</span><span style="color: #008000;">"research"</span><span style="color: #0000FF;">,</span> <span style="color: #008000;">"research"</span><span style="color: #0000FF;">},</span>
<span style="color: #0000FF;">{</span><span style="color: #008000;">"aaaaabbbb"</span><span style="color: #0000FF;">,</span> <span style="color: #008000;">"ababababa"</span><span style="color: #0000FF;">},</span>
<span style="color: #0000FF;">{</span><span style="color: #008000;">"significant"</span><span style="color: #0000FF;">,</span> <span style="color: #008000;">"capabilities"</span><span style="color: #0000FF;">}}</span>
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">tests</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span>
<span style="color: #004080;">string</span> <span style="color: #0000FF;">{</span><span style="color: #000000;">t1</span><span style="color: #0000FF;">,</span><span style="color: #000000;">t2</span><span style="color: #0000FF;">}</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">tests</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span>
<span style="color: #000000;">mostFreqKSDF</span><span style="color: #0000FF;">(</span><span style="color: #000000;">t1</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">t2</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">2</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">10</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
<span style="color: #004080;">string</span> <span style="color: #000000;">s1</span> <span style="color: #0000FF;">:=</span> <span style="color: #008000;">"LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV"</span><span style="color: #0000FF;">,</span>
<span style="color: #000000;">s2</span> <span style="color: #0000FF;">:=</span> <span style="color: #008000;">"EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG"</span>
<span style="color: #000000;">mostFreqKSDF</span><span style="color: #0000FF;">(</span><span style="color: #000000;">s1</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">s2</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">2</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">100</span><span style="color: #0000FF;">)</span>
<span style="color: #000000;">s1</span> <span style="color: #0000FF;">=</span> <span style="color: #008000;">"abracadabra12121212121abracadabra12121212121"</span>
<span style="color: #000000;">s2</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">reverse</span><span style="color: #0000FF;">(</span><span style="color: #000000;">s1</span><span style="color: #0000FF;">)</span>
<span style="color: #000000;">mostFreqKSDF</span><span style="color: #0000FF;">(</span><span style="color: #000000;">s1</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">s2</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">2</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">100</span><span style="color: #0000FF;">)</span>
<!--</syntaxhighlight>-->
Output matches Go and Kotlin.
 
=={{header|Python}}==
{{works with|Python|2.7+}}
'''unoptimized and limited'''
<langsyntaxhighlight lang="python">import collections
def MostFreqKHashing(inputString, K):
occuDict = collections.defaultdict(int)
Line 685 ⟶ 1,172:
 
def MostFreqKSDF(inputStr1, inputStr2, K, maxDistance):
return maxDistance - MostFreqKSimilarity(MostFreqKHashing(inputStr1,K), MostFreqKHashing(inputStr2,K))</langsyntaxhighlight>
 
'''optimized'''
 
A version that replaces the intermediate string with OrderedDict to reduce the time complexity of lookup operation:
<langsyntaxhighlight lang="python">import collections
def MostFreqKHashing(inputString, K):
occuDict = collections.defaultdict(int)
Line 710 ⟶ 1,197:
 
def MostFreqKSDF(inputStr1, inputStr2, K, maxDistance):
return maxDistance - MostFreqKSimilarity(MostFreqKHashing(inputStr1,K), MostFreqKHashing(inputStr2,K))</langsyntaxhighlight>
Test:
<langsyntaxhighlight lang="python">str1 = "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV"
str2 = "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG"
K = 2
Line 722 ⟶ 1,209:
print("%s:"%dict2)
print(''.join(c + str(cnt) for c, cnt in dict2.items()))
print(MostFreqKSDF(str1, str2, K, maxDistance))</langsyntaxhighlight>
{{out}}
<pre>
Line 734 ⟶ 1,221:
=={{header|Racket}}==
 
<langsyntaxhighlight Racketlang="racket">#lang racket
 
(define (MostFreqKHashing inputString K)
Line 761 ⟶ 1,248:
;; (Should add more tests, but it looks like there's a bunch of mistakes
;; in the given tests...)
</syntaxhighlight>
</lang>
 
=={{header|Raku}}==
(formerly Perl 6)
{{works with|Rakudo|2017.09}}
My initial impressions of this task are registered on the discussion page under "Prank Page?".
 
The "most frequent k characters" hashing function is straightforward enough to implement. The distance function is incomprehensible though. The description doesn't match the pseudo-code and neither of them match the examples. I'm not going to bother trying to figure it out unless there is some possible value.
 
Maybe I am too hasty though. Lets give it a try. Implement a MFKC routine and run an assortment of words through it to get a feel for how it hashes different words.
 
<syntaxhighlight lang="raku" line># Fairly straightforward implementation, actually returns a list of pairs
# which can be joined to make a string or manipulated further.
 
sub mfkh ($string, \K = 2) {
my %h;
$string.comb.kv.map: { %h{$^v}[1] //= $^k; %h{$^v}[0]++ };
%h.sort( { -$_.value[0], $_.value[1] } ).head(K).map( { $_.key => $_.value[0] } );
}
 
# lets try running 150 or so words from unixdic.txt through it to see
# how many unique hash values it comes up with.
 
my @test-words = <
aminobenzoic arginine asinine biennial biennium brigantine brinkmanship
britannic chinquapin clinging corinthian declination dickinson dimension
dinnertime dionysian diophantine dominican financial financier finessing
fingernail fingerprint finnish giovanni hopkinsian inaction inalienable
inanimate incaution incendiary incentive inception incident incidental
incinerate incline inclusion incommunicable incompletion inconceivable
inconclusive incongruity inconsiderable inconsiderate inconspicuous
incontrovertible inconvertible incurring incursion indefinable indemnify
indemnity indeterminacy indian indiana indicant indifferent indigene
indigenous indigent indispensable indochina indochinese indoctrinate
indonesia inequivalent inexplainable infantile inferential inferring
infestation inflammation inflationary influential information infringe
infusion ingenious ingenuity ingestion ingredient inhabitant inhalation
inharmonious inheritance inholding inhomogeneity inkling inoffensive
inordinate inorganic inputting inseminate insensible insincere insinuate
insistent insomnia insomniac insouciant installation instinct instinctual
insubordinate insubstantial insulin insurrection intangible intelligent
intensify intensive interception interruption intestinal intestine
intoxicant introduction introversion intrusion invariant invasion inventive
inversion involution justinian kidnapping kingpin lineprinter liniment
livingston mainline mcginnis minion minneapolis minnie pigmentation
pincushion pinion quinine quintessential resignation ruination seminarian
triennial wilkinson wilmington wilsonian wineskin winnie winnipeg
>;
 
say @test-words.map( { join '', mfkh($_)».kv } ).Bag;</syntaxhighlight>
{{out}}
<pre>Bag(i2n2(151))</pre>
One... Nope, I was right, it ''is'' pretty much worthless.
 
=={{header|Ring}}==
<langsyntaxhighlight lang="ring">
# Project : Most frequent k chars distance
# Date : 2018/02/16
# Author : Gal Zsolt (~ CalmoSoft ~)
# Email : <calmosoft@gmail.com>
 
load "stdlib.ring"
Line 811 ⟶ 1,347:
next
return aList
</syntaxhighlight>
</lang>
Output:
<pre>
Line 821 ⟶ 1,357:
 
=={{header|Sidef}}==
<langsyntaxhighlight lang="ruby">func _MostFreqKHashing(string, k) {
 
var seen = Hash()
Line 857 ⟶ 1,393:
}
}
}.sum(0)
}
 
Line 901 ⟶ 1,437:
MostFreqKSDF(s, t, k, limit),
)
}</langsyntaxhighlight>
{{out}}
<pre>
Line 920 ⟶ 1,456:
=={{header|Tcl}}==
{{works with|Tcl|8.6}}
<langsyntaxhighlight lang="tcl">package require Tcl 8.6
 
proc MostFreqKHashing {inputString k} {
Line 945 ⟶ 1,481:
set h2 [MostFreqKHashing $inputStr2 $k]
expr {$limit - [MostFreqKSimilarity $h1 $h2]}
}</langsyntaxhighlight>
Demonstrating:
<langsyntaxhighlight lang="tcl">set str1 "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV"
set str2 "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG"
puts [MostFreqKHashing $str1 2]
puts [MostFreqKHashing $str2 2]
puts [MostFreqKSDF $str1 $str2 2 100]</langsyntaxhighlight>
{{out}}
<pre>
Line 960 ⟶ 1,496:
;A more efficient metric calculator
This version is appreciably more efficient because it does not compute the intermediate string representation “hash”, instead working directly on the intermediate dictionaries and lists:
<langsyntaxhighlight lang="tcl">proc MostFreqKSDF {inputStr1 inputStr2 k limit} {
set c1 [set c2 {}]
foreach ch [split $inputStr1 ""] {dict incr c1 $ch}
Line 972 ⟶ 1,508:
}
return [expr {$limit - $s}]
}</langsyntaxhighlight>
It computes the identical value on the identical inputs.
 
=={{header|Typescript}}==
{{trans|Javascript}} with added type annotations
<syntaxhighlight lang="typescript">//returns an object of counts keyed by character
const kCounts = (str: string): Record<string, number> => {
const counts: Record<string, number> = {};
for (let char of str) {
counts[char] = counts[char] ? counts[char] + 1 : 1;
}
return counts;
};
 
//returns an array of length k containing the characters with the highest counts
const frequentK = ( counts: Record<string, number>, k: number ): string[] => {
//note that this is written for clarity rather than speed,
//as it sorts all of the counts when only the top k are needed
return Object.keys(counts)
.sort((a, b) => counts[b] - counts[a])
.slice(0, k);
};
 
//returns a hashed string of the most frequent k characters and their frequencies
const mostFreqKHashing = (str: string, k: number): string => {
const counts = kCounts(str);
return frequentK(counts, k)
.map(char => char + counts[char])
.join("");
};
 
//numeric score of similarity based on the sum of counts of characters appearing in the top k of both strings
const mostFreqKSimilarity = ( str1: string, str2: string, k: number ): number => {
const counts1 = kCounts(str1);
const counts2 = kCounts(str2);
const freq1 = frequentK(counts1, k);
const freq2 = frequentK(counts2, k);
let similarity = 0;
for (let char of freq1) {
//only considers a character if it is in the top k of both strings
if (freq2.includes(char)) {
//the examples on this page are inconsistent in whether the similarity
//should be the sum of the two counts or only the shared count (the minimum of the two)
//this code uses the sum
similarity += counts1[char] + counts2[char];
}
}
return similarity;
};
 
//subtracts the similarity score from the maxDifference
const mostFreqKSDF = ( str1: string, str2: string, k: number, maxDistance: number ): number => {
return maxDistance - mostFreqKSimilarity(str1, str2, k);
};</syntaxhighlight>
 
=={{header|Wren}}==
{{trans|Kotlin}}
{{libheader|Wren-seq}}
{{libheader|Wren-sort}}
{{libheader|Wren-iterate}}
<syntaxhighlight lang="wren">import "./seq" for Lst
import "./sort" for Sort
import "./iterate" for Stepped
 
var mostFreqKHashing = Fn.new { |input, k|
var indivs = Lst.individuals(input.toList).map { |indiv| [indiv[0], indiv[1]] }.toList
var cmp = Fn.new { |p1, p2| (p2[1] - p1[1]).sign }
Sort.insertion(indivs, cmp)
return indivs.take(k).reduce("") { |acc, p| acc + "%(p[0])%(String.fromByte(p[1]))" }
}
 
var mostFreqKSimilarity = Fn.new { |input1, input2|
var similarity = 0
for (i in Stepped.new(0...input1.count, 2)) {
for (j in Stepped.new(0...input2.count, 2)) {
if (input1[i] == input2[j]) {
var freq1 = input1[i + 1].bytes[0]
var freq2 = input2[j + 1].bytes[0]
if (freq1 == freq2) similarity = similarity + freq1
}
}
}
return similarity
}
 
var mostFreqKSDF = Fn.new { |input1, input2, k, maxDistance|
System.print("input1 : %(input1)")
System.print("input2 : %(input2)")
var s1 = mostFreqKHashing.call(input1, k)
var s2 = mostFreqKHashing.call(input2, k)
System.write("mfkh(input1, %(k)) = ")
var i = 0
for (c in s1) {
System.write((i % 2 == 0) ? c : c.bytes[0])
i = i + 1
}
System.write("\nmfkh(input2, %(k)) = ")
i = 0
for (c in s2) {
System.write((i % 2 == 0) ? c : c.bytes[0])
i = i + 1
}
var result = maxDistance - mostFreqKSimilarity.call(s1, s2)
System.print("\nSDF(input1, input2, %(k), %(maxDistance)) = %(result)\n")
}
 
var pairs = [
["research", "seeking"],
["night", "nacht"],
["my", "a"],
["research", "research"],
["aaaaabbbb", "ababababa"],
["significant", "capabilities"]
]
for (pair in pairs) mostFreqKSDF.call(pair[0], pair[1], 2, 10)
 
var s1 = "LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV"
var s2 = "EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG"
mostFreqKSDF.call(s1, s2, 2, 100)
s1 = "abracadabra12121212121abracadabra12121212121"
s2 = s1[-1..0]
mostFreqKSDF.call(s1, s2, 2, 100)</syntaxhighlight>
 
{{out}}
<pre>
Sane as Kotlin entry.
</pre>
 
;References
9,476

edits