Inverted index: Difference between revisions
m (J: grammar fix) |
(J: ignore requests to invert a file multiple times) |
||
Line 85: | Line 85: | ||
invert=: verb define |
invert=: verb define |
||
files=: |
files=: files,todo=: ~.y-.files |
||
invert1 each |
>invert1 each todo |
||
) |
) |
||
Revision as of 20:25, 27 April 2010
You are encouraged to solve this task according to the task description, using any language you may know.
An Inverted Index is a data structure used to create full text search.
Given a set of text files, implement a program to create an inverted index. Also create a user interface to do a search using that inverted index which returns a list of files that contain the query term / terms. The search index can be in memory.
AutoHotkey
<lang AutoHotkey>; http://www.autohotkey.com/forum/viewtopic.php?t=41479 inputbox, files, files, file pattern such as c:\files\*.txt
word2docs := object() ; autohotkey_L is needed.
stime := A_tickcount Loop, %files%, 0,1 {
tooltip,%A_index% / 500 wordList := WordsIn(A_LoopFileFullPath) InvertedIndex(wordList, A_loopFileFullpath)
}
tooltip msgbox, % "total time " (A_tickcount-stime)/1000
gosub, search return
search: Loop {
InputBox, keyword , input single keyword only msgbox, % foundDocs := findword(keyword)
} return
WordsIn(docpath) {
FileRead, content, %docpath% spos = 1 Loop { if !(spos := Regexmatch(content, "[a-zA-Z]{2,}",match, spos)) break spos += strlen(match) this_wordList .= match "`n" } Sort, this_wordList, U return this_wordList
}
InvertedIndex(byref words, docpath) {
global word2docs
loop, parse, words, `n,`r { if A_loopField = continue word2docs[A_loopField] := word2docs[A_loopField] docpath "`n" }
}
findWord(word2find) {
global word2docs
if (word2docs[word2find] = "") return "" else return word2docs[word2find]
}</lang>
J
This just implements the required spec, with a simplistic definition for what a word is, and with no support for stop words, nor for phrase searching.
<lang J>require'files regex strings'
files=:words=:buckets=: wordre=: rxcomp '[\w]+' HACK=: -.&(128}.a.) NB. work around bug in pcre parse=: ,@:rxfrom~ wordre&rxmatches
invert=: verb define
files=: files,todo=: ~.y-.files >invert1 each todo
)
invert1=: verb define
file=: files i.<y words=: ~.words,contents=: ~.parse HACK tolower fread y ind=: words i. contents buckets=:buckets,(1+words -&# buckets)#a: #buckets=: (file,~each ind{buckets) ind}buckets
)
search=: verb define
hits=. buckets{~words i.~.parse tolower y files {~ >([-.-.)each/hits
)</lang>
Example use:
<lang J> invert 'help/primer/cut.htm';'help/primer/end.htm';'help/primer/gui.htm'
>search 'finally learning'
help/primer/end.htm help/primer/gui.htm
>search 'argument'
help/primer/cut.htm help/primer/gui.htm
>search 'around'
help/primer/gui.htm</lang>
Tcl
<lang tcl>package require Tcl 8.5 proc wordsInString str {
# We define "words" to be "maximal sequences of 'word' characters". # The other possible definition is to use 'non-space' characters. regexp -all -inline {\w+} $str
}
- Adds a document to the index. The index is a map from words to a map
- from filenames to lists of word locations.
proc addDocumentToIndex {filename} {
global index set f [open $filename] set data [read $f] close $f
set i 0 array set localidx {} foreach word [wordsInString $data] {
lappend localidx($word) $i incr i
}
# Transcribe into global index foreach {word places} [array get localidx] {
dict set index($word) $filename $places
}
}
- How to use the index to find files containing a word
proc findFilesForWord {word} {
global index if {[info exists index($word)]} {
return [dict keys $index($word)]
}
}
- How to use the index to find files containing all words from a list.
- Note that this does not use the locations within the file.
proc findFilesWithAllWords {words} {
set files [findFilesForWord [lindex $words 0]] foreach w [lrange $words 1 end] {
set wf [findFilesForWord $w] set newfiles {} foreach f $files { if {$f in $wf} {lappend newfiles $f} } set files $newfiles
} return $files
}
- How to use the index to find a sequence of words in a file.
proc findFilesWithWordSequence {words} {
global index set files {} foreach w $words {
if {![info exist index($w)]} { return }
} dict for {file places} $index([lindex $words 0]) {
if {$file in $files} continue foreach start $places { set gotStart 1 foreach w [lrange $words 1 end] { incr start set gotNext 0 foreach {f ps} $index($w) { if {$f ne $file} continue foreach p $ps { if {$p == $start} { set gotNext 1 break } } if {$gotNext} break } if {!$gotNext} { set gotStart 0 break } } if {$gotStart} { lappend files $file break } }
} return $files
}</lang> For the GUI: <lang tcl>package require Tk pack [labelframe .files -text Files] -side left -fill y pack [listbox .files.list -listvariable files] pack [button .files.add -command AddFile -text "Add File to Index"] pack [labelframe .found -text Found] -side right -fill y pack [listbox .found.list -listvariable found] -fill x pack [entry .found.entry -textvariable terms] -fill x pack [button .found.findAll -command FindAll \ -text "Find File with All"] -side left pack [button .found.findSeq -command FindSeq \ -text "Find File with Sequence"] -side right
- The actions invoked by various GUI buttons
proc AddFile {} {
global files set f [tk_getOpenFile] if {$f ne ""} {
addDocumentToIndex $f lappend files $f
}
} proc FindAll {} {
global found terms set words [wordsInString $terms] set fs [findFilesWithAllWords $words] lappend found "Searching for files with all $terms" {*}$fs \
"---------------------" } proc FindSeq {} {
global found terms set words [wordsInString $terms] set fs [findFilesWithWordSequence $words] lappend found "Searching for files with \"$terms\"" {*}$fs \
"---------------------" }</lang>