Inverted index: Difference between revisions

From Rosetta Code
Content added Content deleted
m (remove HACK)
(J portability fix: apparently mac needs jpath)
Line 91: Line 91:
invert1=: verb define
invert1=: verb define
file=. files i.<y
file=. files i.<y
words=: ~.words,contents=. ~.parse tolower fread y
words=: ~.words,contents=. ~.parse tolower fread jpath y
ind=. words i. contents
ind=. words i. contents
buckets=: buckets,(1+words -&# buckets)#a:
buckets=: buckets,(1+words -&# buckets)#a:
Line 104: Line 104:
Example use:
Example use:


<lang J> invert 'help/primer/cut.htm';'help/primer/end.htm';'help/primer/gui.htm'
<lang J> invert '~help/primer/cut.htm';'~help/primer/end.htm';'~help/primer/gui.htm'
>search 'finally learning'
>search 'finally learning'
help/primer/end.htm
~help/primer/end.htm
help/primer/gui.htm
~help/primer/gui.htm
>search 'argument'
>search 'argument'
help/primer/cut.htm
~help/primer/cut.htm
help/primer/gui.htm
~help/primer/gui.htm
>search 'around'
>search 'around'
help/primer/gui.htm</lang>
~help/primer/gui.htm</lang>


=={{header|Tcl}}==
=={{header|Tcl}}==

Revision as of 16:14, 30 April 2010

Task
Inverted index
You are encouraged to solve this task according to the task description, using any language you may know.

An Inverted Index is a data structure used to create full text search.

Given a set of text files, implement a program to create an inverted index. Also create a user interface to do a search using that inverted index which returns a list of files that contain the query term / terms. The search index can be in memory.

AutoHotkey

Works with: AutoHotkey_L

<lang AutoHotkey>; http://www.autohotkey.com/forum/viewtopic.php?t=41479 inputbox, files, files, file pattern such as c:\files\*.txt

word2docs := object() ; autohotkey_L is needed.

stime := A_tickcount Loop, %files%, 0,1 {

  tooltip,%A_index%  / 500  
  
  wordList := WordsIn(A_LoopFileFullPath)
  InvertedIndex(wordList, A_loopFileFullpath)   

}

tooltip msgbox, % "total time " (A_tickcount-stime)/1000

gosub, search return

search: Loop {

  InputBox, keyword , input single keyword only
  msgbox, % foundDocs := findword(keyword)

} return

WordsIn(docpath) {

  FileRead, content, %docpath%
 spos = 1
  Loop
  {
    if !(spos := Regexmatch(content, "[a-zA-Z]{2,}",match, spos))
      break
    spos += strlen(match)
    this_wordList .= match "`n"
  }
 
 Sort, this_wordList, U  
 return this_wordList   

}

InvertedIndex(byref words, docpath) {

  global word2docs
 loop, parse, words, `n,`r 
 {                          
   if A_loopField =
     continue
   word2docs[A_loopField] := word2docs[A_loopField] docpath "`n"
 }

}

findWord(word2find) {

 global word2docs
 if (word2docs[word2find] = "")
    return ""
 else
   return word2docs[word2find]

}</lang>

J

This just implements the required spec, with a simplistic definition for what a word is, and with no support for stop words, nor for phrase searching.

<lang J>require'files regex strings'

rxutf8 0 NB. support latin1 searches for this example, instead of utf8 files=:words=:buckets=: wordre=: rxcomp '[\w]+' parse=: ,@:rxfrom~ wordre&rxmatches

invert=: verb define

 files=: files,todo=. ~.y-.files
 >invert1 each todo

)

invert1=: verb define

 file=. files i.<y
 words=: ~.words,contents=. ~.parse tolower fread jpath y
 ind=. words i. contents
 buckets=: buckets,(1+words -&# buckets)#a:
 #buckets=: (file,~each ind{buckets) ind}buckets

)

search=: verb define

 hits=. buckets{~words i.~.parse tolower y
 files {~ >([-.-.)each/hits

)</lang>

Example use:

<lang J> invert '~help/primer/cut.htm';'~help/primer/end.htm';'~help/primer/gui.htm'

  >search 'finally learning'

~help/primer/end.htm ~help/primer/gui.htm

  >search 'argument'

~help/primer/cut.htm ~help/primer/gui.htm

  >search 'around'

~help/primer/gui.htm</lang>

Tcl

<lang tcl>package require Tcl 8.5 proc wordsInString str {

   # We define "words" to be "maximal sequences of 'word' characters".
   # The other possible definition is to use 'non-space' characters.
   regexp -all -inline {\w+} $str

}

  1. Adds a document to the index. The index is a map from words to a map
  2. from filenames to lists of word locations.

proc addDocumentToIndex {filename} {

   global index
   set f [open $filename]
   set data [read $f]
   close $f
   set i 0
   array set localidx {}
   foreach word [wordsInString $data] {

lappend localidx($word) $i incr i

   }
   # Transcribe into global index
   foreach {word places} [array get localidx] {

dict set index($word) $filename $places

   }

}

  1. How to use the index to find files containing a word

proc findFilesForWord {word} {

   global index
   if {[info exists index($word)]} {

return [dict keys $index($word)]

   }

}

  1. How to use the index to find files containing all words from a list.
  2. Note that this does not use the locations within the file.

proc findFilesWithAllWords {words} {

   set files [findFilesForWord [lindex $words 0]]
   foreach w [lrange $words 1 end] {

set wf [findFilesForWord $w] set newfiles {} foreach f $files { if {$f in $wf} {lappend newfiles $f} } set files $newfiles

   }
   return $files

}

  1. How to use the index to find a sequence of words in a file.

proc findFilesWithWordSequence {words} {

   global index
   set files {}
   foreach w $words {

if {![info exist index($w)]} { return }

   }
   dict for {file places} $index([lindex $words 0]) {

if {$file in $files} continue foreach start $places { set gotStart 1 foreach w [lrange $words 1 end] { incr start set gotNext 0 foreach {f ps} $index($w) { if {$f ne $file} continue foreach p $ps { if {$p == $start} { set gotNext 1 break } } if {$gotNext} break } if {!$gotNext} { set gotStart 0 break } } if {$gotStart} { lappend files $file break } }

   }
   return $files

}</lang> For the GUI: <lang tcl>package require Tk pack [labelframe .files -text Files] -side left -fill y pack [listbox .files.list -listvariable files] pack [button .files.add -command AddFile -text "Add File to Index"] pack [labelframe .found -text Found] -side right -fill y pack [listbox .found.list -listvariable found] -fill x pack [entry .found.entry -textvariable terms] -fill x pack [button .found.findAll -command FindAll \ -text "Find File with All"] -side left pack [button .found.findSeq -command FindSeq \ -text "Find File with Sequence"] -side right

  1. The actions invoked by various GUI buttons

proc AddFile {} {

   global files
   set f [tk_getOpenFile]
   if {$f ne ""} {

addDocumentToIndex $f lappend files $f

   }

} proc FindAll {} {

   global found terms
   set words [wordsInString $terms]
   set fs [findFilesWithAllWords $words]
   lappend found "Searching for files with all $terms" {*}$fs \

"---------------------" } proc FindSeq {} {

   global found terms
   set words [wordsInString $terms]
   set fs [findFilesWithWordSequence $words]
   lappend found "Searching for files with \"$terms\"" {*}$fs \

"---------------------" }</lang>