Inverted index: Difference between revisions

Content added Content deleted

Inline

Revision as of 17:37, 21 June 2010

An Inverted Index is a data structure used to create full text search.

Given a set of text files, implement a program to create an inverted index. Also create a user interface to do a search using that inverted index which returns a list of files that contain the query term / terms. The search index can be in memory.

AutoHotkey

Works with: AutoHotkey_L

<lang AutoHotkey>; http://www.autohotkey.com/forum/viewtopic.php?t=41479 inputbox, files, files, file pattern such as c:\files\*.txt

word2docs := object() ; autohotkey_L is needed.

stime := A_tickcount Loop, %files%, 0,1 {

  tooltip,%A_index%  / 500  
  
  wordList := WordsIn(A_LoopFileFullPath)
  InvertedIndex(wordList, A_loopFileFullpath)

}

tooltip msgbox, % "total time " (A_tickcount-stime)/1000

gosub, search return

search: Loop {

  InputBox, keyword , input single keyword only
  msgbox, % foundDocs := findword(keyword)

} return

WordsIn(docpath) {

  FileRead, content, %docpath%
 spos = 1
  Loop
  {
    if !(spos := Regexmatch(content, "[a-zA-Z]{2,}",match, spos))
      break
    spos += strlen(match)
    this_wordList .= match "`n"
  }
 
 Sort, this_wordList, U  
 return this_wordList

}

InvertedIndex(byref words, docpath) {

  global word2docs

 loop, parse, words, `n,`r 
 {                          
   if A_loopField =
     continue
   word2docs[A_loopField] := word2docs[A_loopField] docpath "`n"
 }

}

findWord(word2find) {

 global word2docs

 if (word2docs[word2find] = "")
    return ""
 else
   return word2docs[word2find]

}</lang>

Haskell

<lang haskell>import Control.Monad import Data.Char (isAlpha, toLower) import qualified Data.Map as M import qualified Data.IntSet as S import System.Environment (getArgs)

main = do

   (files, _ : q) <- liftM (break (== "--")) getArgs
   buildII files >>= mapM_ putStrLn . queryII q

data IIndex = IIndex

   [FilePath]              -- Files in the index
   (M.Map String S.IntSet) -- Maps word to indices of the list
 deriving Show

buildII :: [FilePath] -> IO IIndex buildII files =

   liftM (IIndex files . foldl f M.empty . zip [0..]) $
   mapM readFile files
 where f m (i, s) =
           foldl g m $ map (lowercase . filter isAlpha) $ words s
         where g m word = M.insertWith S.union word (S.singleton i) m

queryII :: [String] -> IIndex -> [FilePath] queryII q (IIndex files m) =

   map (files !!) $ S.toList $ intersections $
   map (\word -> M.findWithDefault S.empty (lowercase word) m) q

intersections [] = S.empty intersections xs = foldl1 S.intersection xs

lowercase = map toLower</lang>

An example of use, assuming the program is named iindex and there exist files t0, t1, and t2 with contents "It is what it is.", "What is it?", and "It is a banana.":

$ iindex t0 t1 t2 -- what is it
t0
t1

J

This just implements the required spec, with a simplistic definition for what a word is, and with no support for stop words, nor for phrase searching.

<lang J>require'files regex strings'

rxutf8 0 NB. support latin1 searches for this example, instead of utf8 files=:words=:buckets=: wordre=: rxcomp '[\w]+' parse=: ,@:rxfrom~ wordre&rxmatches

invert=: verb define

 files=: files,todo=. ~.y-.files
 >invert1 each todo

)

invert1=: verb define

 file=. files i.<y
 words=: ~.words,contents=. ~.parse tolower fread jpath y
 ind=. words i. contents
 buckets=: buckets,(1+words -&# buckets)#a:
 #buckets=: (file,~each ind{buckets) ind}buckets

)

search=: verb define

 hits=. buckets{~words i.~.parse tolower y
 files {~ >([-.-.)each/hits

)</lang>

Example use:

<lang J> invert '~help/primer/cut.htm';'~help/primer/end.htm';'~help/primer/gui.htm'

  >search 'finally learning'

~help/primer/end.htm ~help/primer/gui.htm

  >search 'argument'

~help/primer/cut.htm ~help/primer/gui.htm

  >search 'around'

~help/primer/gui.htm</lang>

Perl

<lang perl>use Set::Object 'set';

given an array of files, returns the index

sub createindex {

   my @files = @_;

   my %iindex;

   foreach my $file (@files)
   {

open(F, "<", $file) or die "Can't read file $file: $!"; while(<F>) {

           s/\A\W+//;

foreach my $w (map {lc} grep {length() >= 3} split /\W+/) { if ( exists($iindex{$w}) ) { $iindex{$w}->insert($file); } else { $iindex{$w} = set($file); } } } close(F);

   }
   return %iindex;

}

given an index, search for words

sub search_words_with_index {

   my %idx = %{shift()};
   my @words = @_;
   my $res = set();
   
   foreach my $w (map {lc} @words)
   {

$w =~ s/\W+//g; # strip non-words chars

       length $w < 3 and next;
       exists $idx{$w} or return set();
       $res = $res->is_null
         ? set(@{$idx{$w}})
         : $res * $idx{$w};       # set intersection
   }
   return $res;

}

TESTING
USAGE: invidx.pl the,list,of,words file1 file2 .. fileN

my @searchwords = split /,/, shift @ARGV;

first arg is a comma-separated list of words to search for

print "$_\n"

   foreach @{search_words_with_index({createindex(@ARGV)}, @searchwords)};</lang>

PicoLisp

Assuming three files "file1", "file2" and "file3":

$ cat file1
it is what it is

$ cat file2
what is it

$ cat file3
it is a banana

we can read them into a binary tree in the global variable '*MyIndex' <lang PicoLisp>(off *MyIndex)

(use Word

  (for File '("file1" "file2" "file3")
     (in File
        (while (skip)
           (if (idx '*MyIndex (setq Word (till " ^I^J^M" T)) T)
              (push1 (car @) File)
              (set Word (cons File)) ) ) ) ) )

(de searchFor @

  (apply sect
     (extract
        '((Word) (val (car (idx '*MyIndex Word))))
        (rest) ) ) )</lang>

Output:

: (searchFor "what" "is" "it")
-> ("file2" "file1")

: (searchFor "a" "banana")
-> ("file3")

: (searchFor "it" "is")
-> ("file3" "file2" "file1")

Tcl

<lang tcl>package require Tcl 8.5 proc wordsInString str {

   # We define "words" to be "maximal sequences of 'word' characters".
   # The other possible definition is to use 'non-space' characters.
   regexp -all -inline {\w+} $str

}

Adds a document to the index. The index is a map from words to a map
from filenames to lists of word locations.

proc addDocumentToIndex {filename} {

   global index
   set f [open $filename]
   set data [read $f]
   close $f

   set i 0
   array set localidx {}
   foreach word [wordsInString $data] {

lappend localidx($word) $i incr i

   # Transcribe into global index
   foreach {word places} [array get localidx] {

dict set index($word) $filename $places

}

How to use the index to find files containing a word

proc findFilesForWord {word} {

   global index
   if {[info exists index($word)]} {

return [dict keys $index($word)]

}

How to use the index to find files containing all words from a list.
Note that this does not use the locations within the file.

proc findFilesWithAllWords {words} {

   set files [findFilesForWord [lindex $words 0]]
   foreach w [lrange $words 1 end] {

set wf [findFilesForWord $w] set newfiles {} foreach f $files { if {$f in $wf} {lappend newfiles $f} } set files $newfiles

   }
   return $files

}

How to use the index to find a sequence of words in a file.

proc findFilesWithWordSequence {words} {

   global index
   set files {}
   foreach w $words {

if {![info exist index($w)]} { return }

   }
   dict for {file places} $index([lindex $words 0]) {

if {$file in $files} continue foreach start $places { set gotStart 1 foreach w [lrange $words 1 end] { incr start set gotNext 0 foreach {f ps} $index($w) { if {$f ne $file} continue foreach p $ps { if {$p == $start} { set gotNext 1 break } } if {$gotNext} break } if {!$gotNext} { set gotStart 0 break } } if {$gotStart} { lappend files $file break } }

   }
   return $files

}</lang> For the GUI: <lang tcl>package require Tk pack [labelframe .files -text Files] -side left -fill y pack [listbox .files.list -listvariable files] pack [button .files.add -command AddFile -text "Add File to Index"] pack [labelframe .found -text Found] -side right -fill y pack [listbox .found.list -listvariable found] -fill x pack [entry .found.entry -textvariable terms] -fill x pack [button .found.findAll -command FindAll \ -text "Find File with All"] -side left pack [button .found.findSeq -command FindSeq \ -text "Find File with Sequence"] -side right

The actions invoked by various GUI buttons

proc AddFile {} {

   global files
   set f [tk_getOpenFile]
   if {$f ne ""} {

addDocumentToIndex $f lappend files $f

} proc FindAll {} {

   global found terms
   set words [wordsInString $terms]
   set fs [findFilesWithAllWords $words]
   lappend found "Searching for files with all $terms" {*}$fs \

"---------------------" } proc FindSeq {} {

   global found terms
   set words [wordsInString $terms]
   set fs [findFilesWithWordSequence $words]
   lappend found "Searching for files with \"$terms\"" {*}$fs \

"---------------------" }</lang>