Inverted index: Difference between revisions
Underscore (talk | contribs) (→{{header|Perl}}: Cleaned up.) |
(Added PicoLisp) |
||
Line 210: | Line 210: | ||
print "$_\n" |
print "$_\n" |
||
foreach @{search_words_with_index({createindex(@ARGV)}, @searchwords)};</lang> |
foreach @{search_words_with_index({createindex(@ARGV)}, @searchwords)};</lang> |
||
=={{header|PicoLisp}}== |
|||
Assuming three files "file1", "file2" and "file3": |
|||
<pre>$ cat file1 |
|||
it is what it is |
|||
$ cat file2 |
|||
what is it |
|||
$ cat file3 |
|||
it is a banana</pre> |
|||
we can read them into a binary tree in the global variable '*MyIndex' |
|||
<lang PicoLisp>(off *MyIndex) |
|||
(use Word |
|||
(for File '("file1" "file2" "file3") |
|||
(in File |
|||
(while (skip) |
|||
(if (idx '*MyIndex (setq Word (till " ^I^J^M" T)) T) |
|||
(push1 (car @) File) |
|||
(set Word (cons File)) ) ) ) ) ) |
|||
(de searchFor @ |
|||
(apply sect |
|||
(extract |
|||
'((Word) (val (car (idx '*MyIndex Word)))) |
|||
(rest) ) ) )</lang> |
|||
Output: |
|||
<pre>: (searchFor "what" "is" "it") |
|||
-> ("file2" "file1") |
|||
: (searchFor "a" "banana") |
|||
-> ("file3") |
|||
: (searchFor "it" "is") |
|||
-> ("file3" "file2" "file1")</pre> |
|||
=={{header|Tcl}}== |
=={{header|Tcl}}== |
Revision as of 17:37, 21 June 2010
You are encouraged to solve this task according to the task description, using any language you may know.
An Inverted Index is a data structure used to create full text search.
Given a set of text files, implement a program to create an inverted index. Also create a user interface to do a search using that inverted index which returns a list of files that contain the query term / terms. The search index can be in memory.
AutoHotkey
<lang AutoHotkey>; http://www.autohotkey.com/forum/viewtopic.php?t=41479 inputbox, files, files, file pattern such as c:\files\*.txt
word2docs := object() ; autohotkey_L is needed.
stime := A_tickcount Loop, %files%, 0,1 {
tooltip,%A_index% / 500 wordList := WordsIn(A_LoopFileFullPath) InvertedIndex(wordList, A_loopFileFullpath)
}
tooltip msgbox, % "total time " (A_tickcount-stime)/1000
gosub, search return
search: Loop {
InputBox, keyword , input single keyword only msgbox, % foundDocs := findword(keyword)
} return
WordsIn(docpath) {
FileRead, content, %docpath% spos = 1 Loop { if !(spos := Regexmatch(content, "[a-zA-Z]{2,}",match, spos)) break spos += strlen(match) this_wordList .= match "`n" } Sort, this_wordList, U return this_wordList
}
InvertedIndex(byref words, docpath) {
global word2docs
loop, parse, words, `n,`r { if A_loopField = continue word2docs[A_loopField] := word2docs[A_loopField] docpath "`n" }
}
findWord(word2find) {
global word2docs
if (word2docs[word2find] = "") return "" else return word2docs[word2find]
}</lang>
Haskell
<lang haskell>import Control.Monad import Data.Char (isAlpha, toLower) import qualified Data.Map as M import qualified Data.IntSet as S import System.Environment (getArgs)
main = do
(files, _ : q) <- liftM (break (== "--")) getArgs buildII files >>= mapM_ putStrLn . queryII q
data IIndex = IIndex
[FilePath] -- Files in the index (M.Map String S.IntSet) -- Maps word to indices of the list deriving Show
buildII :: [FilePath] -> IO IIndex buildII files =
liftM (IIndex files . foldl f M.empty . zip [0..]) $ mapM readFile files where f m (i, s) = foldl g m $ map (lowercase . filter isAlpha) $ words s where g m word = M.insertWith S.union word (S.singleton i) m
queryII :: [String] -> IIndex -> [FilePath] queryII q (IIndex files m) =
map (files !!) $ S.toList $ intersections $ map (\word -> M.findWithDefault S.empty (lowercase word) m) q
intersections [] = S.empty intersections xs = foldl1 S.intersection xs
lowercase = map toLower</lang>
An example of use, assuming the program is named iindex
and there exist files t0
, t1
, and t2
with contents "It is what it is.", "What is it?", and "It is a banana.":
$ iindex t0 t1 t2 -- what is it t0 t1
J
This just implements the required spec, with a simplistic definition for what a word is, and with no support for stop words, nor for phrase searching.
<lang J>require'files regex strings'
rxutf8 0 NB. support latin1 searches for this example, instead of utf8 files=:words=:buckets=: wordre=: rxcomp '[\w]+' parse=: ,@:rxfrom~ wordre&rxmatches
invert=: verb define
files=: files,todo=. ~.y-.files >invert1 each todo
)
invert1=: verb define
file=. files i.<y words=: ~.words,contents=. ~.parse tolower fread jpath y ind=. words i. contents buckets=: buckets,(1+words -&# buckets)#a: #buckets=: (file,~each ind{buckets) ind}buckets
)
search=: verb define
hits=. buckets{~words i.~.parse tolower y files {~ >([-.-.)each/hits
)</lang>
Example use:
<lang J> invert '~help/primer/cut.htm';'~help/primer/end.htm';'~help/primer/gui.htm'
>search 'finally learning'
~help/primer/end.htm ~help/primer/gui.htm
>search 'argument'
~help/primer/cut.htm ~help/primer/gui.htm
>search 'around'
~help/primer/gui.htm</lang>
Perl
<lang perl>use Set::Object 'set';
- given an array of files, returns the index
sub createindex {
my @files = @_;
my %iindex;
foreach my $file (@files) {
open(F, "<", $file) or die "Can't read file $file: $!"; while(<F>) {
s/\A\W+//;
foreach my $w (map {lc} grep {length() >= 3} split /\W+/) { if ( exists($iindex{$w}) ) { $iindex{$w}->insert($file); } else { $iindex{$w} = set($file); } } } close(F);
} return %iindex;
}
- given an index, search for words
sub search_words_with_index {
my %idx = %{shift()}; my @words = @_; my $res = set(); foreach my $w (map {lc} @words) {
$w =~ s/\W+//g; # strip non-words chars
length $w < 3 and next; exists $idx{$w} or return set(); $res = $res->is_null ? set(@{$idx{$w}}) : $res * $idx{$w}; # set intersection } return $res;
}
- TESTING
- USAGE: invidx.pl the,list,of,words file1 file2 .. fileN
my @searchwords = split /,/, shift @ARGV;
- first arg is a comma-separated list of words to search for
print "$_\n"
foreach @{search_words_with_index({createindex(@ARGV)}, @searchwords)};</lang>
PicoLisp
Assuming three files "file1", "file2" and "file3":
$ cat file1 it is what it is $ cat file2 what is it $ cat file3 it is a banana
we can read them into a binary tree in the global variable '*MyIndex' <lang PicoLisp>(off *MyIndex)
(use Word
(for File '("file1" "file2" "file3") (in File (while (skip) (if (idx '*MyIndex (setq Word (till " ^I^J^M" T)) T) (push1 (car @) File) (set Word (cons File)) ) ) ) ) )
(de searchFor @
(apply sect (extract '((Word) (val (car (idx '*MyIndex Word)))) (rest) ) ) )</lang>
Output:
: (searchFor "what" "is" "it") -> ("file2" "file1") : (searchFor "a" "banana") -> ("file3") : (searchFor "it" "is") -> ("file3" "file2" "file1")
Tcl
<lang tcl>package require Tcl 8.5 proc wordsInString str {
# We define "words" to be "maximal sequences of 'word' characters". # The other possible definition is to use 'non-space' characters. regexp -all -inline {\w+} $str
}
- Adds a document to the index. The index is a map from words to a map
- from filenames to lists of word locations.
proc addDocumentToIndex {filename} {
global index set f [open $filename] set data [read $f] close $f
set i 0 array set localidx {} foreach word [wordsInString $data] {
lappend localidx($word) $i incr i
}
# Transcribe into global index foreach {word places} [array get localidx] {
dict set index($word) $filename $places
}
}
- How to use the index to find files containing a word
proc findFilesForWord {word} {
global index if {[info exists index($word)]} {
return [dict keys $index($word)]
}
}
- How to use the index to find files containing all words from a list.
- Note that this does not use the locations within the file.
proc findFilesWithAllWords {words} {
set files [findFilesForWord [lindex $words 0]] foreach w [lrange $words 1 end] {
set wf [findFilesForWord $w] set newfiles {} foreach f $files { if {$f in $wf} {lappend newfiles $f} } set files $newfiles
} return $files
}
- How to use the index to find a sequence of words in a file.
proc findFilesWithWordSequence {words} {
global index set files {} foreach w $words {
if {![info exist index($w)]} { return }
} dict for {file places} $index([lindex $words 0]) {
if {$file in $files} continue foreach start $places { set gotStart 1 foreach w [lrange $words 1 end] { incr start set gotNext 0 foreach {f ps} $index($w) { if {$f ne $file} continue foreach p $ps { if {$p == $start} { set gotNext 1 break } } if {$gotNext} break } if {!$gotNext} { set gotStart 0 break } } if {$gotStart} { lappend files $file break } }
} return $files
}</lang> For the GUI: <lang tcl>package require Tk pack [labelframe .files -text Files] -side left -fill y pack [listbox .files.list -listvariable files] pack [button .files.add -command AddFile -text "Add File to Index"] pack [labelframe .found -text Found] -side right -fill y pack [listbox .found.list -listvariable found] -fill x pack [entry .found.entry -textvariable terms] -fill x pack [button .found.findAll -command FindAll \ -text "Find File with All"] -side left pack [button .found.findSeq -command FindSeq \ -text "Find File with Sequence"] -side right
- The actions invoked by various GUI buttons
proc AddFile {} {
global files set f [tk_getOpenFile] if {$f ne ""} {
addDocumentToIndex $f lappend files $f
}
} proc FindAll {} {
global found terms set words [wordsInString $terms] set fs [findFilesWithAllWords $words] lappend found "Searching for files with all $terms" {*}$fs \
"---------------------" } proc FindSeq {} {
global found terms set words [wordsInString $terms] set fs [findFilesWithWordSequence $words] lappend found "Searching for files with \"$terms\"" {*}$fs \
"---------------------" }</lang>