Inverted index: Difference between revisions

Content added Content deleted

Inline

Revision as of 17:41, 28 May 2010

An Inverted Index is a data structure used to create full text search.

Given a set of text files, implement a program to create an inverted index. Also create a user interface to do a search using that inverted index which returns a list of files that contain the query term / terms. The search index can be in memory.

AutoHotkey

Works with: AutoHotkey_L

<lang AutoHotkey>; http://www.autohotkey.com/forum/viewtopic.php?t=41479 inputbox, files, files, file pattern such as c:\files\*.txt

word2docs := object() ; autohotkey_L is needed.

stime := A_tickcount Loop, %files%, 0,1 {

  tooltip,%A_index%  / 500  
  
  wordList := WordsIn(A_LoopFileFullPath)
  InvertedIndex(wordList, A_loopFileFullpath)

}

tooltip msgbox, % "total time " (A_tickcount-stime)/1000

gosub, search return

search: Loop {

  InputBox, keyword , input single keyword only
  msgbox, % foundDocs := findword(keyword)

} return

WordsIn(docpath) {

  FileRead, content, %docpath%
 spos = 1
  Loop
  {
    if !(spos := Regexmatch(content, "[a-zA-Z]{2,}",match, spos))
      break
    spos += strlen(match)
    this_wordList .= match "`n"
  }
 
 Sort, this_wordList, U  
 return this_wordList

}

InvertedIndex(byref words, docpath) {

  global word2docs

 loop, parse, words, `n,`r 
 {                          
   if A_loopField =
     continue
   word2docs[A_loopField] := word2docs[A_loopField] docpath "`n"
 }

}

findWord(word2find) {

 global word2docs

 if (word2docs[word2find] = "")
    return ""
 else
   return word2docs[word2find]

}</lang>

Haskell

<lang haskell>import Control.Monad import Data.Char (isAlpha, toLower) import qualified Data.Map as M import qualified Data.IntSet as S import System.Environment (getArgs)

main = do

   (files, _ : q) <- liftM (break (== "--")) getArgs
   buildII files >>= mapM_ putStrLn . queryII q

data IIndex = IIndex

   [FilePath]              -- Files in the index
   (M.Map String S.IntSet) -- Maps word to indices of the list
 deriving Show

buildII :: [FilePath] -> IO IIndex buildII files =

   liftM (IIndex files . foldl f M.empty . zip [0..]) $
   mapM readFile files
 where f m (i, s) =
           foldl g m $ map (lowercase . filter isAlpha) $ words s
         where g m word = M.insertWith S.union word (S.singleton i) m

queryII :: [String] -> IIndex -> [FilePath] queryII q (IIndex files m) =

   map (files !!) $ S.toList $ intersections $
   map (\word -> M.findWithDefault S.empty (lowercase word) m) q

intersections [] = S.empty intersections xs = foldl1 S.intersection xs

lowercase = map toLower</lang>

An example of use, assuming the program is named iindex and there exist files t0, t1, and t2 with contents "It is what it is.", "What is it?", and "It is a banana.":

$ iindex t0 t1 t2 -- what is it
t0
t1

J

This just implements the required spec, with a simplistic definition for what a word is, and with no support for stop words, nor for phrase searching.

<lang J>require'files regex strings'

rxutf8 0 NB. support latin1 searches for this example, instead of utf8 files=:words=:buckets=: wordre=: rxcomp '[\w]+' parse=: ,@:rxfrom~ wordre&rxmatches

invert=: verb define

 files=: files,todo=. ~.y-.files
 >invert1 each todo

)

invert1=: verb define

 file=. files i.<y
 words=: ~.words,contents=. ~.parse tolower fread jpath y
 ind=. words i. contents
 buckets=: buckets,(1+words -&# buckets)#a:
 #buckets=: (file,~each ind{buckets) ind}buckets

)

search=: verb define

 hits=. buckets{~words i.~.parse tolower y
 files {~ >([-.-.)each/hits

)</lang>

Example use:

<lang J> invert '~help/primer/cut.htm';'~help/primer/end.htm';'~help/primer/gui.htm'

  >search 'finally learning'

~help/primer/end.htm ~help/primer/gui.htm

  >search 'argument'

~help/primer/cut.htm ~help/primer/gui.htm

  >search 'around'

~help/primer/gui.htm</lang>

Perl

The index here created is not a full inverted index. This solution provides no GUI (only CLI).

<lang perl>use strict;

given an array of files, returns the index

sub createindex {

   my @files = @_;

   my %iindex;

   foreach my $file (@files)
   {

open(F, "<", $file); while(<F>) { chomp; # sanitize... this de facto defines what a "word" is s/\W+/ /g;

my @words = split / +/; foreach my $w (@words) { $w = lc $w; # index lowercase...

# this says not to index "words" with length < 3 next if length($w) < 3;

if ( exists($iindex{$w}) ) { push @{$iindex{$w}}, $file unless grep{$_ eq $file} @{$iindex{$w}}; } else { $iindex{$w} = [$file]; } } } close(F);

   }
   return %iindex;

}</lang>

<lang perl># given an index, search for words sub search_words_with_index {

   my $i = shift;
   my %idx = %$i;
   my @words = @_;
   my @res;
   
   foreach my $w (@words)
   {

next if length($w) < 3; # ignore unindexed word... $w =~ s/\W+//g; # strip non words chars $w = lc $w; if ( exists($idx{$w}) ) { push @res, { "found" => 1, "word" => $w, "files" => $idx{$w} }; } else { push @res, { "found" => 0, "word" => $w }; }

   }
   return @res;

}</lang>

<lang perl># TESTING

USAGE: invidx.pl the,list,of,words file1 file2 .. fileN

my @searchwords = split /,/, (shift @ARGV);

first arg is a comma separated list of words to search for

my %iindex = createindex(@ARGV); # create the index...

use the index to search words; the sub returns an array of
references to hash like:
( found => 0/1, word => "theword", files => arrayref of files where theword appears )

my @sr = search_words_with_index(\%iindex, @searchwords); foreach my $res (@sr) {

   if ( $res->{"found"} )
   {

print $res->{"word"}, ":\n", join("\n", @{$res->{"files"}}), "\n\n";

   } else {

print $res->{"word"}, ": NOT FOUND\n\n";

}

exit 0;</lang>

Tcl

<lang tcl>package require Tcl 8.5 proc wordsInString str {

   # We define "words" to be "maximal sequences of 'word' characters".
   # The other possible definition is to use 'non-space' characters.
   regexp -all -inline {\w+} $str

}

Adds a document to the index. The index is a map from words to a map
from filenames to lists of word locations.

proc addDocumentToIndex {filename} {

   global index
   set f [open $filename]
   set data [read $f]
   close $f

   set i 0
   array set localidx {}
   foreach word [wordsInString $data] {

lappend localidx($word) $i incr i

   # Transcribe into global index
   foreach {word places} [array get localidx] {

dict set index($word) $filename $places

}

How to use the index to find files containing a word

proc findFilesForWord {word} {

   global index
   if {[info exists index($word)]} {

return [dict keys $index($word)]

}

How to use the index to find files containing all words from a list.
Note that this does not use the locations within the file.

proc findFilesWithAllWords {words} {

   set files [findFilesForWord [lindex $words 0]]
   foreach w [lrange $words 1 end] {

set wf [findFilesForWord $w] set newfiles {} foreach f $files { if {$f in $wf} {lappend newfiles $f} } set files $newfiles

   }
   return $files

}

How to use the index to find a sequence of words in a file.

proc findFilesWithWordSequence {words} {

   global index
   set files {}
   foreach w $words {

if {![info exist index($w)]} { return }

   }
   dict for {file places} $index([lindex $words 0]) {

if {$file in $files} continue foreach start $places { set gotStart 1 foreach w [lrange $words 1 end] { incr start set gotNext 0 foreach {f ps} $index($w) { if {$f ne $file} continue foreach p $ps { if {$p == $start} { set gotNext 1 break } } if {$gotNext} break } if {!$gotNext} { set gotStart 0 break } } if {$gotStart} { lappend files $file break } }

   }
   return $files

}</lang> For the GUI: <lang tcl>package require Tk pack [labelframe .files -text Files] -side left -fill y pack [listbox .files.list -listvariable files] pack [button .files.add -command AddFile -text "Add File to Index"] pack [labelframe .found -text Found] -side right -fill y pack [listbox .found.list -listvariable found] -fill x pack [entry .found.entry -textvariable terms] -fill x pack [button .found.findAll -command FindAll \ -text "Find File with All"] -side left pack [button .found.findSeq -command FindSeq \ -text "Find File with Sequence"] -side right

The actions invoked by various GUI buttons

proc AddFile {} {

   global files
   set f [tk_getOpenFile]
   if {$f ne ""} {

addDocumentToIndex $f lappend files $f

} proc FindAll {} {

   global found terms
   set words [wordsInString $terms]
   set fs [findFilesWithAllWords $words]
   lappend found "Searching for files with all $terms" {*}$fs \

"---------------------" } proc FindSeq {} {

   global found terms
   set words [wordsInString $terms]
   set fs [findFilesWithWordSequence $words]
   lappend found "Searching for files with \"$terms\"" {*}$fs \

"---------------------" }</lang>

@@ Line 108: / Line 108: @@
 lowercase = map toLower</lang>
-An example of use, assuming the program is named <code>iindex</code> and there exist files <code>t1</code>, <code>t2</code>, and <code>t3</code> with contents "It is what it is.", "What is it?", and "It is a banana.":
+An example of use, assuming the program is named <code>iindex</code> and there exist files <code>t0</code>, <code>t1</code>, and <code>t2</code> with contents "It is what it is.", "What is it?", and "It is a banana.":
 <pre>$ iindex t0 t1 t2 -- what is it