Inverted index: Difference between revisions
(→Tcl: Added implementation) |
(J) |
||
Line 72: | Line 72: | ||
return word2docs[word2find] |
return word2docs[word2find] |
||
}</lang> |
}</lang> |
||
=={{header|J}}== |
|||
This just implements the required spec, with a simplistic definition for what a word is, and with no support for stop words, nor phrase searching. |
|||
<lang J>require'files regex strings' |
|||
files=:words=:buckets=:'' |
|||
wordre=: rxcomp '[\w'']+' |
|||
HACK=: -.&(128}.a.) NB. work around bug in pcre |
|||
parse=: ,@:rxfrom~ wordre&rxmatches |
|||
invert=: verb define |
|||
files=: ~.files,y |
|||
invert1 each y |
|||
) |
|||
invert1=: verb define |
|||
file=: files i.<y |
|||
words=: ~.words,contents=: ~.parse HACK tolower fread y |
|||
ind=: words i. contents |
|||
buckets=:buckets,(1+words -&# buckets)#a: |
|||
#buckets=: (file,~each ind{buckets) ind}buckets |
|||
) |
|||
search=: verb define |
|||
hits=. buckets{~words i.~.wordcut tolower y |
|||
files {~ >([-.-.)each/hits |
|||
)</lang> |
|||
Example use: |
|||
<lang J> invert 'help/primer/cut.htm';'help/primer/end.htm';'help/primer/gui.htm' |
|||
>search 'finally learning' |
|||
help/primer/end.htm |
|||
help/primer/gui.htm |
|||
>search 'argument' |
|||
help/primer/cut.htm |
|||
help/primer/gui.htm |
|||
>search 'around' |
|||
help/primer/gui.htm</lang> |
|||
=={{header|Tcl}}== |
=={{header|Tcl}}== |
Revision as of 19:52, 27 April 2010
You are encouraged to solve this task according to the task description, using any language you may know.
An Inverted Index is a data structure used to create full text search.
Given a set of text files, implement a program to create an inverted index. Also create a user interface to do a search using that inverted index which returns a list of files that contain the query term / terms. The search index can be in memory.
AutoHotkey
<lang AutoHotkey>; http://www.autohotkey.com/forum/viewtopic.php?t=41479 inputbox, files, files, file pattern such as c:\files\*.txt
word2docs := object() ; autohotkey_L is needed.
stime := A_tickcount Loop, %files%, 0,1 {
tooltip,%A_index% / 500 wordList := WordsIn(A_LoopFileFullPath) InvertedIndex(wordList, A_loopFileFullpath)
}
tooltip msgbox, % "total time " (A_tickcount-stime)/1000
gosub, search return
search: Loop {
InputBox, keyword , input single keyword only msgbox, % foundDocs := findword(keyword)
} return
WordsIn(docpath) {
FileRead, content, %docpath% spos = 1 Loop { if !(spos := Regexmatch(content, "[a-zA-Z]{2,}",match, spos)) break spos += strlen(match) this_wordList .= match "`n" } Sort, this_wordList, U return this_wordList
}
InvertedIndex(byref words, docpath) {
global word2docs
loop, parse, words, `n,`r { if A_loopField = continue word2docs[A_loopField] := word2docs[A_loopField] docpath "`n" }
}
findWord(word2find) {
global word2docs
if (word2docs[word2find] = "") return "" else return word2docs[word2find]
}</lang>
J
This just implements the required spec, with a simplistic definition for what a word is, and with no support for stop words, nor phrase searching.
<lang J>require'files regex strings'
files=:words=:buckets=: wordre=: rxcomp '[\w]+' HACK=: -.&(128}.a.) NB. work around bug in pcre parse=: ,@:rxfrom~ wordre&rxmatches
invert=: verb define
files=: ~.files,y invert1 each y
)
invert1=: verb define
file=: files i.<y words=: ~.words,contents=: ~.parse HACK tolower fread y ind=: words i. contents buckets=:buckets,(1+words -&# buckets)#a: #buckets=: (file,~each ind{buckets) ind}buckets
)
search=: verb define
hits=. buckets{~words i.~.wordcut tolower y files {~ >([-.-.)each/hits
)</lang>
Example use:
<lang J> invert 'help/primer/cut.htm';'help/primer/end.htm';'help/primer/gui.htm'
>search 'finally learning'
help/primer/end.htm help/primer/gui.htm
>search 'argument'
help/primer/cut.htm help/primer/gui.htm
>search 'around'
help/primer/gui.htm</lang>
Tcl
<lang tcl>package require Tcl 8.5 proc wordsInString str {
# We define "words" to be "maximal sequences of 'word' characters". # The other possible definition is to use 'non-space' characters. regexp -all -inline {\w+} $str
}
- Adds a document to the index. The index is a map from words to a map
- from filenames to lists of word locations.
proc addDocumentToIndex {filename} {
global index set f [open $filename] set data [read $f] close $f
set i 0 array set localidx {} foreach word [wordsInString $data] {
lappend localidx($word) $i incr i
}
# Transcribe into global index foreach {word places} [array get localidx] {
dict set index($word) $filename $places
}
}
- How to use the index to find files containing a word
proc findFilesForWord {word} {
global index if {[info exists index($word)]} {
return [dict keys $index($word)]
}
}
- How to use the index to find files containing all words from a list.
- Note that this does not use the locations within the file.
proc findFilesWithAllWords {words} {
set files [findFilesForWord [lindex $words 0]] foreach w [lrange $words 1 end] {
set wf [findFilesForWord $w] set newfiles {} foreach f $files { if {$f in $wf} {lappend newfiles $f} } set files $newfiles
} return $files
}
- How to use the index to find a sequence of words in a file.
proc findFilesWithWordSequence {words} {
global index set files {} foreach w $words {
if {![info exist index($w)]} { return }
} dict for {file places} $index([lindex $words 0]) {
if {$file in $files} continue foreach start $places { set gotStart 1 foreach w [lrange $words 1 end] { incr start set gotNext 0 foreach {f ps} $index($w) { if {$f ne $file} continue foreach p $ps { if {$p == $start} { set gotNext 1 break } } if {$gotNext} break } if {!$gotNext} { set gotStart 0 break } } if {$gotStart} { lappend files $file break } }
} return $files
}</lang> For the GUI: <lang tcl>package require Tk pack [labelframe .files -text Files] -side left -fill y pack [listbox .files.list -listvariable files] pack [button .files.add -command AddFile -text "Add File to Index"] pack [labelframe .found -text Found] -side right -fill y pack [listbox .found.list -listvariable found] -fill x pack [entry .found.entry -textvariable terms] -fill x pack [button .found.findAll -command FindAll \ -text "Find File with All"] -side left pack [button .found.findSeq -command FindSeq \ -text "Find File with Sequence"] -side right
- The actions invoked by various GUI buttons
proc AddFile {} {
global files set f [tk_getOpenFile] if {$f ne ""} {
addDocumentToIndex $f lappend files $f
}
} proc FindAll {} {
global found terms set words [wordsInString $terms] set fs [findFilesWithAllWords $words] lappend found "Searching for files with all $terms" {*}$fs \
"---------------------" } proc FindSeq {} {
global found terms set words [wordsInString $terms] set fs [findFilesWithWordSequence $words] lappend found "Searching for files with \"$terms\"" {*}$fs \
"---------------------" }</lang>