Inverted index: Difference between revisions

From Rosetta Code
Content added Content deleted
Line 246: Line 246:
let words_of_file in_file =
let words_of_file in_file =
let str = load_file in_file in
let str = load_file in_file in
let words = Str.split (Str.regexp "\b") str in
let words = Str.split (Str.regexp "[ \r\n\t,;.?!:'/\034()]") str in
let words = uniq words in
let words = uniq words in
(words)
(words)

Revision as of 21:09, 14 August 2010

Task
Inverted index
You are encouraged to solve this task according to the task description, using any language you may know.

An Inverted Index is a data structure used to create full text search.

Given a set of text files, implement a program to create an inverted index. Also create a user interface to do a search using that inverted index which returns a list of files that contain the query term / terms. The search index can be in memory.

AutoHotkey

Works with: AutoHotkey_L

<lang AutoHotkey>; http://www.autohotkey.com/forum/viewtopic.php?t=41479 inputbox, files, files, file pattern such as c:\files\*.txt

word2docs := object() ; autohotkey_L is needed.

stime := A_tickcount Loop, %files%, 0,1 {

  tooltip,%A_index%  / 500  
  
  wordList := WordsIn(A_LoopFileFullPath)
  InvertedIndex(wordList, A_loopFileFullpath)   

}

tooltip msgbox, % "total time " (A_tickcount-stime)/1000

gosub, search return

search: Loop {

  InputBox, keyword , input single keyword only
  msgbox, % foundDocs := findword(keyword)

} return

WordsIn(docpath) {

  FileRead, content, %docpath%
 spos = 1
  Loop
  {
    if !(spos := Regexmatch(content, "[a-zA-Z]{2,}",match, spos))
      break
    spos += strlen(match)
    this_wordList .= match "`n"
  }
 
 Sort, this_wordList, U  
 return this_wordList   

}

InvertedIndex(byref words, docpath) {

  global word2docs
 loop, parse, words, `n,`r 
 {                          
   if A_loopField =
     continue
   word2docs[A_loopField] := word2docs[A_loopField] docpath "`n"
 }

}

findWord(word2find) {

 global word2docs
 if (word2docs[word2find] = "")
    return ""
 else
   return word2docs[word2find]

}</lang>

Factor

<lang factor>USING: assocs fry io.encodings.utf8 io.files kernel sequences sets splitting vectors ; IN: rosettacode.inverted-index

file-words ( file -- assoc )
   utf8 file-contents " ,;:!?.()[]{}\n\r" split harvest ;
add-to-file-list ( files file -- files )
   over [ swap [ adjoin ] keep ] [ nip 1vector ] if ;
add-to-index ( words index file -- )
   '[ _ [ _ add-to-file-list ] change-at ] each ;
(index-files) ( files index -- )
  [ [ [ file-words ] keep ] dip swap add-to-index ] curry each ;
index-files ( files -- index )
   H{ } clone [ (index-files) ] keep ;
query ( terms index -- files )
   [ at ] curry map [ ] [ intersect ] map-reduce ;

</lang>

Example use : <lang>( scratchpad ) { "f1" "f2" "f3" } index-files

--- Data stack: H{ { "a" ~vector~ } { "is" ~vector~ } { "what" ~vector~ } { ... ( scratchpad ) { "what" "is" "it" } swap query . V{ "f1" "f2" } </lang>

Haskell

<lang haskell>import Control.Monad import Data.Char (isAlpha, toLower) import qualified Data.Map as M import qualified Data.IntSet as S import System.Environment (getArgs)

main = do

   (files, _ : q) <- liftM (break (== "--")) getArgs
   buildII files >>= mapM_ putStrLn . queryII q

data IIndex = IIndex

   [FilePath]              -- Files in the index
   (M.Map String S.IntSet) -- Maps word to indices of the list
 deriving Show

buildII :: [FilePath] -> IO IIndex buildII files =

   liftM (IIndex files . foldl f M.empty . zip [0..]) $
   mapM readFile files
 where f m (i, s) =
           foldl g m $ map (lowercase . filter isAlpha) $ words s
         where g m word = M.insertWith S.union word (S.singleton i) m

queryII :: [String] -> IIndex -> [FilePath] queryII q (IIndex files m) =

   map (files !!) $ S.toList $ intersections $
   map (\word -> M.findWithDefault S.empty (lowercase word) m) q

intersections [] = S.empty intersections xs = foldl1 S.intersection xs

lowercase = map toLower</lang>

An example of use, assuming the program is named iindex and there exist files t0, t1, and t2 with contents "It is what it is.", "What is it?", and "It is a banana.":

$ iindex t0 t1 t2 -- what is it
t0
t1

J

This just implements the required spec, with a simplistic definition for what a word is, and with no support for stop words, nor for phrase searching.

<lang J>require'files regex strings'

rxutf8 0 NB. support latin1 searches for this example, instead of utf8 files=:words=:buckets=: wordre=: rxcomp '[\w]+' parse=: ,@:rxfrom~ wordre&rxmatches

invert=: verb define

 files=: files,todo=. ~.y-.files
 >invert1 each todo

)

invert1=: verb define

 file=. files i.<y
 words=: ~.words,contents=. ~.parse tolower fread jpath y
 ind=. words i. contents
 buckets=: buckets,(1+words -&# buckets)#a:
 #buckets=: (file,~each ind{buckets) ind}buckets

)

search=: verb define

 hits=. buckets{~words i.~.parse tolower y
 files {~ >([-.-.)each/hits

)</lang>

Example use:

<lang J> invert '~help/primer/cut.htm';'~help/primer/end.htm';'~help/primer/gui.htm'

  >search 'finally learning'

~help/primer/end.htm ~help/primer/gui.htm

  >search 'argument'

~help/primer/cut.htm ~help/primer/gui.htm

  >search 'around'

~help/primer/gui.htm</lang>


OCaml

We store the inverted index data in the file "data.inv" using the sexplib library, so we compile with:

ocamlc -c \
  -pp "camlp4o -I `ocamlc -where`/type-conv \
               -I `ocamlc -where`/sexplib \
               pa_type_conv.cmo pa_sexp_conv.cmo" \
  unix.cma bigarray.cma nums.cma -I +sexplib sexplib.cma str.cma \
  inv.ml

ocamlc -o inv.byte unix.cma bigarray.cma nums.cma -I +sexplib sexplib.cma str.cma inv.cmo

<lang ocaml>TYPE_CONV_PATH "Inverted_index"

type files = string array with sexp type inverted_index = (string * int list) list with sexp

type t = files * inverted_index with sexp

open Sexplib

let data_file = "data.inv" let data_path = Filename.concat Filename.temp_dir_name data_file

let get_inv_index() =

 if Sys.file_exists data_path
 then t_of_sexp(Sexp.load_sexp data_path)
 else ([| |], [])

let load_file f =

 let ic = open_in f in
 let n = in_channel_length ic in
 let s = String.create n in
 really_input ic s 0 n;
 close_in ic;
 (s)

let array_push ar v =

 let len = Array.length ar in
 Array.init (succ len) (fun i ->
   if i < len then Array.unsafe_get ar i else v), len

let uniq lst =

 let h = Hashtbl.create (List.length lst) in
 List.iter (fun x -> Hashtbl.replace h x ()) lst;
 Hashtbl.fold (fun x () xs -> x :: xs) h []

let combine words i inv_index =

 let h = Hashtbl.create (List.length inv_index) in
 List.iter (fun (w, from) -> Hashtbl.replace h w from) inv_index;
 List.iter (fun w ->
   if Hashtbl.mem h w
   then begin
     let from = Hashtbl.find h w in
     Hashtbl.replace h w (i::from)
   end else
     Hashtbl.add h w [i]
 ) words;
 Hashtbl.fold (fun w from acc -> (w, from) :: acc) h []

let words_of_file in_file =

 let str = load_file in_file in
 let words = Str.split (Str.regexp "[ \r\n\t,;.?!:'/\034()]") str in
 let words = uniq words in
 (words)

let index_file in_file =

 let words = words_of_file in_file in
 let files, inv_index = get_inv_index() in
 let files, i = array_push files in_file in
 let inv_index = combine words i inv_index in
 let se = sexp_of_t (files, inv_index) in
 Sexp.save data_path se

let search_word word =

 let files, inv_index = get_inv_index() in
 try
   let is_in = List.assoc word inv_index in
   List.iter (fun i -> print_endline files.(i)) is_in
 with Not_found ->
   print_endline "# Not Found"

let usage() =

 Printf.printf "Usage: %s \
   --index-file <file.txt> / \
   --search-word <some-word>\n%!" Sys.argv.(0);
 exit 1

let () =

 let cmd, arg = try (Sys.argv.(1), Sys.argv.(2)) with _ -> usage() in
 match cmd, arg with
 | "--index-file", in_file -> index_file in_file
 | "--search-word", word -> search_word word
 | _ -> usage()</lang>


Perl

<lang perl>use Set::Object 'set';

  1. given an array of files, returns the index

sub createindex {

   my @files = @_;
   my %iindex;
   foreach my $file (@files)
   {

open(F, "<", $file) or die "Can't read file $file: $!"; while(<F>) {

           s/\A\W+//;

foreach my $w (map {lc} grep {length() >= 3} split /\W+/) { if ( exists($iindex{$w}) ) { $iindex{$w}->insert($file); } else { $iindex{$w} = set($file); } } } close(F);

   }
   return %iindex;

}

  1. given an index, search for words

sub search_words_with_index {

   my %idx = %{shift()};
   my @words = @_;
   my $res = set();
   
   foreach my $w (map {lc} @words)
   {

$w =~ s/\W+//g; # strip non-words chars

       length $w < 3 and next;
       exists $idx{$w} or return set();
       $res = $res->is_null
         ? set(@{$idx{$w}})
         : $res * $idx{$w};       # set intersection
   }
   return @$res;

}

  1. TESTING
  2. USAGE: invidx.pl the,list,of,words file1 file2 .. fileN

my @searchwords = split /,/, shift;

  1. first arg is a comma-separated list of words to search for

print "$_\n"

   foreach search_words_with_index({createindex(@ARGV)}, @searchwords);</lang>

PicoLisp

Assuming three files "file1", "file2" and "file3":

$ cat file1
it is what it is

$ cat file2
what is it

$ cat file3
it is a banana

we can read them into a binary tree in the global variable '*MyIndex' <lang PicoLisp>(off *MyIndex)

(use Word

  (for File '("file1" "file2" "file3")
     (in File
        (while (skip)
           (if (idx '*MyIndex (setq Word (till " ^I^J^M" T)) T)
              (push1 (car @) File)
              (set Word (cons File)) ) ) ) ) )

(de searchFor @

  (apply sect
     (extract
        '((Word) (val (car (idx '*MyIndex Word))))
        (rest) ) ) )</lang>

Output:

: (searchFor "what" "is" "it")
-> ("file2" "file1")

: (searchFor "a" "banana")
-> ("file3")

: (searchFor "it" "is")
-> ("file3" "file2" "file1")

Python

Simple inverted index

First the simple inverted index from here together with an implementation of a search for (multiple) terms from that index. <lang python> This implements: http://en.wikipedia.org/wiki/Inverted_index of 28/07/10

from pprint import pprint as pp from glob import glob try: reduce except: from functools import reduce try: raw_input except: raw_input = input


def parsetexts(fileglob='InvertedIndex/T*.txt'):

   texts, words = {}, set()
   for txtfile in glob(fileglob):
       with open(txtfile, 'r') as f:
           txt = f.read().split()
           words |= set(txt)
           texts[txtfile.split('\\')[-1]] = txt
   return texts, words

def termsearch(terms): # Searches simple inverted index

   return reduce(set.intersection,
                 (invindex[term] for term in terms),
                 set(texts.keys()))

texts, words = parsetexts() print('\nTexts') pp(texts) print('\nWords') pp(sorted(words))

invindex = {word:set(txt

                       for txt, wrds in texts.items() if word in wrds)
           for word in words}

print('\nInverted Index') pp({k:sorted(v) for k,v in invindex.items()})

terms = ["what", "is", "it"] print('\nTerm Search for: ' + repr(terms)) pp(sorted(termsearch(terms)))</lang>

Sample Output

Texts
{'T0.txt': ['it', 'is', 'what', 'it', 'is'],
 'T1.txt': ['what', 'is', 'it'],
 'T2.txt': ['it', 'is', 'a', 'banana']}

Words
['a', 'banana', 'is', 'it', 'what']

Inverted Index
{'a': ['T2.txt'],
 'banana': ['T2.txt'],
 'is': ['T0.txt', 'T1.txt', 'T2.txt'],
 'it': ['T0.txt', 'T1.txt', 'T2.txt'],
 'what': ['T0.txt', 'T1.txt']}

Term Search for: ['what', 'is', 'it']
['T0.txt', 'T1.txt']

Full inverted index

There is a re-write of the termsearch function to work off this type of index, as well as a new phrasesearch function

The phrasesearch function will return multiple matches in a text, and goes on to show how this can be used to pick the text with most matches.

It is assumed that the following code is added to the end of the code for the simple case above and so shares its file opening and parsing results <lang python>from collections import Counter


def termsearch(terms): # Searches full inverted index

   if not set(terms).issubset(words):
       return set()
   return reduce(set.intersection,
                 (set(x[0] for x in txtindx)
                  for term, txtindx in finvindex.items()
                  if term in terms),
                 set(texts.keys()) )

def phrasesearch(phrase):

   wordsinphrase = phrase.strip().strip('"').split()
   if not set(wordsinphrase).issubset(words):
       return set()
   #firstword, *otherwords = wordsinphrase # Only Python 3
   firstword, otherwords = wordsinphrase[0], wordsinphrase[1:]
   found = []
   for txt in termsearch(wordsinphrase):
       # Possible text files
       for firstindx in (indx for t,indx in finvindex[firstword]
                         if t == txt):
           # Over all positions of the first word of the phrase in this txt
           if all( (txt, firstindx+1 + otherindx) in finvindex[otherword]
                   for otherindx, otherword in enumerate(otherwords) ):
               found.append(txt)
   return found


finvindex = {word:set((txt, wrdindx)

                     for txt, wrds in texts.items()
                     for wrdindx in (i for i,w in enumerate(wrds) if word==w)
                     if word in wrds)
            for word in words}

print('\nFull Inverted Index') pp({k:sorted(v) for k,v in finvindex.items()})

print('\nTerm Search on full inverted index for: ' + repr(terms)) pp(sorted(termsearch(terms)))

phrase = '"what is it"' print('\nPhrase Search for: ' + phrase) print(phrasesearch(phrase))

  1. Show multiple match capability

phrase = '"it is"' print('\nPhrase Search for: ' + phrase) ans = phrasesearch(phrase) print(ans) ans = Counter(ans) print(' The phrase is found most commonly in text: ' + repr(ans.most_common(1)[0][0]))</lang>

Sample Output

Full Inverted Index
{'a': [('T2.txt', 2)],
 'banana': [('T2.txt', 3)],
 'is': [('T0.txt', 1), ('T0.txt', 4), ('T1.txt', 1), ('T2.txt', 1)],
 'it': [('T0.txt', 0), ('T0.txt', 3), ('T1.txt', 2), ('T2.txt', 0)],
 'what': [('T0.txt', 2), ('T1.txt', 0)]}

Term Search on full inverted index for: ['what', 'is', 'it']
['T0.txt', 'T1.txt']

Phrase Search for: "what is it"
['T1.txt']

Phrase Search for: "it is"
['T0.txt', 'T0.txt', 'T2.txt']
  The phrase is found most commonly in text: 'T0.txt'

Tcl

<lang tcl>package require Tcl 8.5 proc wordsInString str {

   # We define "words" to be "maximal sequences of 'word' characters".
   # The other possible definition is to use 'non-space' characters.
   regexp -all -inline {\w+} $str

}

  1. Adds a document to the index. The index is a map from words to a map
  2. from filenames to lists of word locations.

proc addDocumentToIndex {filename} {

   global index
   set f [open $filename]
   set data [read $f]
   close $f
   set i 0
   array set localidx {}
   foreach word [wordsInString $data] {

lappend localidx($word) $i incr i

   }
   # Transcribe into global index
   foreach {word places} [array get localidx] {

dict set index($word) $filename $places

   }

}

  1. How to use the index to find files containing a word

proc findFilesForWord {word} {

   global index
   if {[info exists index($word)]} {

return [dict keys $index($word)]

   }

}

  1. How to use the index to find files containing all words from a list.
  2. Note that this does not use the locations within the file.

proc findFilesWithAllWords {words} {

   set files [findFilesForWord [lindex $words 0]]
   foreach w [lrange $words 1 end] {

set wf [findFilesForWord $w] set newfiles {} foreach f $files { if {$f in $wf} {lappend newfiles $f} } set files $newfiles

   }
   return $files

}

  1. How to use the index to find a sequence of words in a file.

proc findFilesWithWordSequence {words} {

   global index
   set files {}
   foreach w $words {

if {![info exist index($w)]} { return }

   }
   dict for {file places} $index([lindex $words 0]) {

if {$file in $files} continue foreach start $places { set gotStart 1 foreach w [lrange $words 1 end] { incr start set gotNext 0 foreach {f ps} $index($w) { if {$f ne $file} continue foreach p $ps { if {$p == $start} { set gotNext 1 break } } if {$gotNext} break } if {!$gotNext} { set gotStart 0 break } } if {$gotStart} { lappend files $file break } }

   }
   return $files

}</lang> For the GUI: <lang tcl>package require Tk pack [labelframe .files -text Files] -side left -fill y pack [listbox .files.list -listvariable files] pack [button .files.add -command AddFile -text "Add File to Index"] pack [labelframe .found -text Found] -side right -fill y pack [listbox .found.list -listvariable found] -fill x pack [entry .found.entry -textvariable terms] -fill x pack [button .found.findAll -command FindAll \ -text "Find File with All"] -side left pack [button .found.findSeq -command FindSeq \ -text "Find File with Sequence"] -side right

  1. The actions invoked by various GUI buttons

proc AddFile {} {

   global files
   set f [tk_getOpenFile]
   if {$f ne ""} {

addDocumentToIndex $f lappend files $f

   }

} proc FindAll {} {

   global found terms
   set words [wordsInString $terms]
   set fs [findFilesWithAllWords $words]
   lappend found "Searching for files with all $terms" {*}$fs \

"---------------------" } proc FindSeq {} {

   global found terms
   set words [wordsInString $terms]
   set fs [findFilesWithWordSequence $words]
   lappend found "Searching for files with \"$terms\"" {*}$fs \

"---------------------" }</lang>