Inverted index: Difference between revisions

← Older edit

Inverted index (view source)

Revision as of 11:44, 11 December 2023

99,946 bytes added , 5 months ago

m

→‎{{header|Wren}}: Minor tidy

PureFox

9,479

edits

Revision as of 21:02, 19 April 2011 (view source) Sonia (talk \| contribs) (→‎{{header\|Go}}: update after library change) ← Older edit		Latest revision as of 11:44, 11 December 2023 (view source) PureFox (talk \| contribs) m (→‎{{header\|Wren}}: Minor tidy)
(63 intermediate revisions by 34 users not shown)
Line 1: {{task\|Classic CS problems and programs}}[[Category:Search]] An [[wp:Inverted_index\|Inverted Index]] is a data structure used to create full text search. Given a set of text files, implement a program to create an inverted index. Also create a user interface to do a search using that inverted index which returns a list of files that contain the query term / terms. The search index can be in memory. ;Task: Given a set of text files, implement a program to create an inverted index. Also create a user interface to do a search using that inverted index which returns a list of files that contain the query term / terms. The search index can be in memory. <br><br> =={{header\|11l}}== {{trans\|D}} <syntaxhighlight lang="11l">DefaultDict[String, Set[String]] index F parse_file(fname, fcontents) L(word) fcontents.split(re:‘\W’) :index[word.lowercase()].add(fname) L(fname, fcontents) [(‘inv1.txt’, ‘It is what it is.’), (‘inv2.txt’, ‘What is it?’), (‘inv3.txt’, ‘It is a banana!’)] parse_file(fname, fcontents) L(w) [‘cat’, ‘is’, ‘banana’, ‘it’, ‘what’] print("\nEnter a word to search for: (q to quit): "w) I w C index print(‘'#.' found in #..’.format(w, sorted(Array(index[w])))) E print(‘'#.' not found.’.format(w))</syntaxhighlight> {{out}} <pre> Enter a word to search for: (q to quit): cat 'cat' not found. Enter a word to search for: (q to quit): is 'is' found in [inv1.txt, inv2.txt, inv3.txt]. Enter a word to search for: (q to quit): banana 'banana' found in [inv3.txt]. Enter a word to search for: (q to quit): it 'it' found in [inv1.txt, inv2.txt, inv3.txt]. Enter a word to search for: (q to quit): what 'what' found in [inv1.txt, inv2.txt]. </pre> =={{header\|Ada}}== ===Main program=== Here is the main program (file inverted_index.adb): <syntaxhighlight lang="ada">with Ada.Text_IO, Generic_Inverted_Index, Ada.Strings.Hash, Parse_Lines; use Ada.Text_IO; procedure Inverted_Index is type Process_Word is access procedure (Word: String); package Inv_Idx is new Generic_Inverted_Index (Source_Type => String, Item_Type => String, Hash => Ada.Strings.Hash); use Inv_Idx; procedure Output(Sources: Source_Vecs.Vector) is Any_Output: Boolean := False; procedure Print_Source(S: String) is begin if not Any_Output then -- this is the first source found Put("Found in the following files: "); Any_Output := True; else -- there has been at least one source before Put(", "); end if; Put(S); end Print_Source; procedure Print is new Inv_Idx.Iterate(Print_Source); begin Print(Sources); if not Any_Output then Put("I did not find this in any of the given files!"); end if; New_Line(2); end Output; procedure Read_From_File(Table: in out Storage_Type; Filename: String) is F: File_Type; procedure Enter_Word(S: String) is begin Table.Store(Source => Filename, Item => S); end Enter_Word; procedure Store_Words is new Parse_Lines.Iterate_Words(Parse_Lines.Word_Pattern, Enter_Word); begin Open(File => F, Mode => In_File, Name => Filename); while not End_Of_File(F) loop Store_Words(Get_Line(F)); end loop; Close(F); exception when others => Put_Line("Something wrong with File '" & Filename & "'"); Put_Line("I'll ignore this!"); end Read_From_File; procedure Read_Files(Tab: out Storage_Type; Line: in String) is procedure Read_File(S: String) is begin Read_From_File(Tab, S); end Read_File; procedure Read_All is new Parse_Lines.Iterate_Words(Parse_Lines.Filename_Pattern, Read_File); begin Read_All(Line); end Read_Files; S: Storage_Type; Done: Boolean := False; begin Put_Line("Enter Filenames:"); Read_Files(S, Get_Line); New_Line; while not Done loop Put_Line("Enter one or more words to search for; <return> to finish:"); declare Words: String := Get_Line; First: Boolean := True; Vec: Source_Vecs.Vector := Source_Vecs.Empty_Vector; procedure Compute_Vector(Item: String) is begin if First then Vec := S.Find(Item); First := False; else Vec := Vec and S.Find(Item); end if; end Compute_Vector; procedure Compute is new Parse_Lines.Iterate_Words(Parse_Lines.Word_Pattern, Compute_Vector); begin if Words = "" then Done := True; else Compute(Words); Output(Vec); end if; end; end loop; end Inverted_Index;</syntaxhighlight> A sample output: <pre>Enter Filenames: 0.txt 1.txt 2.txt Enter one or more words to search for; <return> to finish: it Found in the following files: 0.txt, 1.txt, 2.txt Enter one or more words to search for; <return> to finish: that I did not find this in any of the given files! Enter one or more words to search for; <return> to finish: what is it Found in the following files: 0.txt, 1.txt Enter one or more words to search for; <return> to finish: </pre> ===Package Generic_Inverted_Index=== The real work is actually done in the package Generic_Inverted_Index. Here is the specification (file generic_inverted_index.ads): <syntaxhighlight lang="ada">with Ada.Containers.Indefinite_Vectors; private with Ada.Containers.Indefinite_Hashed_Maps; generic type Source_Type (<>) is private; type Item_Type (<>) is private; with function Hash(Item: Item_Type) return Ada.Containers.Hash_Type is <>; package Generic_Inverted_Index is type Storage_Type is tagged private; package Source_Vecs is new Ada.Containers.Indefinite_Vectors (Index_Type => Positive, Element_Type => Source_Type); procedure Store(Storage: in out Storage_Type; Source: Source_Type; Item: Item_Type); -- stores Source in a table, indexed by Item -- if there is already an Item/Source entry, the Table isn_t changed function Find(Storage: Storage_Type; Item: Item_Type) return Source_Vecs.Vector; -- Generates a vector of all Sources for the given Item function "and"(Left, Right: Source_Vecs.Vector) return Source_Vecs.Vector; -- returns a vector of all sources, which are both in Left and in Right function "or"(Left, Right: Source_Vecs.Vector) return Source_Vecs.Vector; -- returns a vector of all sources, which are in Left, Right, or both function Empty(Vec: Source_Vecs.Vector) return Boolean; -- returns true if Vec is empty function First_Source(The_Sources: Source_Vecs.Vector) return Source_Type; -- returns the first enty in The_Sources; pre: The_Sourses is not empty procedure Delete_First_Source(The_Sources: in out Source_Vecs.Vector; Count: Ada.Containers.Count_Type := 1); -- Removes the first Count entries; pre: The_Sourses has that many entries type Process_Source is not null access procedure (Source: Source_Type); generic with procedure Do_Something(Source: Source_Type); procedure Iterate(The_Sources: Source_Vecs.Vector); -- calls Do_Something(Source) for all sources in The_Sources; private function Same_Vector(U,V: Source_Vecs.Vector) return Boolean; package Maps is new Ada.Containers.Indefinite_Hashed_Maps -- for each item (=key) we store a vector with sources (Key_Type => Item_Type, Element_Type => Source_Vecs.Vector, Hash => Hash, Equivalent_Keys => "=", "=" => Same_Vector); type Storage_Type is new Maps.Map with null record; end Generic_Inverted_Index;</syntaxhighlight> Here is the implementation (generic_inverted_index.adb): <syntaxhighlight lang="ada">package body Generic_Inverted_Index is use Source_Vecs; use type Maps.Cursor; procedure Store(Storage: in out Storage_Type; Source: Source_Type; Item: Item_Type) is begin if (Storage.Find(Item) = Maps.No_Element) then Storage.Insert(Key => Item, New_Item => Empty_Vector & Source); else declare The_Vector: Vector := Storage.Element(Item); begin if The_Vector.Last_Element /= Source then Storage.Replace (Key => Item, New_Item => Storage.Element(Item) & Source); end if; end; end if; end Store; function Find(Storage: Storage_Type; Item: Item_Type) return Vector is begin return Storage.Element(Item); exception when Constraint_Error => return Empty_Vector; -- found nothing end Find; function Is_In(S: Source_Type; V: Vector) return Boolean is VV: Vector := V; begin if Empty(V) then return False; elsif First_Source(V) = S then return True; else Delete_First_Source(VV); return Is_In(S, VV); end if; end Is_In; function "and"(Left, Right: Vector) return Vector is V: Vector := Empty_Vector; begin for I in First_Index(Left) .. Last_Index(Left) loop if Is_In(Element(Left, I), Right) then V := V & Element(Left, I); end if; end loop; return V; end "and"; function "or"(Left, Right: Vector) return Vector is V: Vector := Left; -- all sources in Left begin -- ... add all sources in Right, which are not already in Left for I in First_Index(Right) .. Last_Index(Right) loop if not Is_In(Element(Right, I), Left) then V := V & Element(Right, I); end if; end loop; return V; end "or"; function Empty(Vec: Vector) return Boolean renames Is_Empty; function First_Source(The_Sources: Vector) return Source_Type renames First_Element; procedure Delete_First_Source(The_Sources: in out Vector; Count: Ada.Containers.Count_Type := 1) renames Delete_First; procedure Iterate(The_Sources: Vector) is V: Vector := The_Sources; begin while not Empty(V) loop Do_Something(First_Source(V)); Delete_First_Source(V); end loop; end Iterate; function Same_Vector(U,V: Vector) return Boolean is begin raise Program_Error with "there is no need to call this function"; return False; -- this avoices a compiler warning end Same_Vector; end Generic_Inverted_Index;</syntaxhighlight> ===Package Parse_Lines=== The main program also uses an auxiliary package Parse_Lines. Note the usage of Gnat.Regpat, which itself is pattern matching package, specific for gnat/gcc. This package is derived from the [[Regular expressions#Ada\|Ada implementation of the regular expressions task]]. Here is the spec (parse_lines.ads): <syntaxhighlight lang="ada">with Gnat.Regpat; package Parse_Lines is Word_Pattern: constant String := "([a-zA-Z]+)"; Filename_Pattern: constant String := "([a-zA-Z0-9_.,;:]+)"; procedure Search_For_Pattern(Pattern: Gnat.Regpat.Pattern_Matcher; Search_In: String; First, Last: out Positive; Found: out Boolean); function Compile(Raw: String) return Gnat.Regpat.Pattern_Matcher; generic Pattern: String; with procedure Do_Something(Word: String); procedure Iterate_Words(S: String); end Parse_Lines;</syntaxhighlight> And here is the implementation (parse_lines.adb): <syntaxhighlight lang="ada">with Gnat.Regpat; package body Parse_Lines is procedure Search_For_Pattern(Pattern: Gnat.Regpat.Pattern_Matcher; Search_In: String; First, Last: out Positive; Found: out Boolean) is use Gnat.Regpat; Result: Match_Array (0 .. 1); begin Match(Pattern, Search_In, Result); Found := Result(1) /= No_Match; if Found then First := Result(1).First; Last := Result(1).Last; end if; end Search_For_Pattern; function Compile(Raw: String) return Gnat.Regpat.Pattern_Matcher is begin return Gnat.Regpat.Compile(Raw); end Compile; procedure Iterate_Words(S: String) is Current_First: Positive := S'First; First, Last: Positive; Found: Boolean; use Parse_Lines; Compiled_P: Gnat.Regpat.Pattern_Matcher := Compile(Pattern); begin loop Search_For_Pattern(Compiled_P, S(Current_First .. S'Last), First, Last, Found); exit when not Found; Do_Something(S(First .. Last)); Current_First := Last+1; end loop; end Iterate_Words; end Parse_Lines;</syntaxhighlight> ===Alternative Implementation of Generic_Inverted_Index (Ada 2012)=== The new standard Ada 2012 simplifies the usage of containers significantly. The following runs under gnat (GNAT GPL 2011 (20110419)), when using the experimental -gnat2012 switch. The main program is the same. Here is the spec for Generic_Inverted_Index: <syntaxhighlight lang="ada">with Ada.Containers.Indefinite_Vectors; private with Ada.Containers.Indefinite_Hashed_Maps; generic type Source_Type (<>) is private; type Item_Type (<>) is private; with function Hash(Item: Item_Type) return Ada.Containers.Hash_Type is <>; package Generic_Inverted_Index is type Storage_Type is tagged private; package Source_Vecs is new Ada.Containers.Indefinite_Vectors (Index_Type => Positive, Element_Type => Source_Type); procedure Store(Storage: in out Storage_Type; Source: Source_Type; Item: Item_Type); -- stores Source in a table, indexed by Item -- if there is already an Item/Source entry, the Table isn_t changed function Find(Storage: Storage_Type; Item: Item_Type) return Source_Vecs.Vector; -- Generates a vector of all Sources for the given Item function "and"(Left, Right: Source_Vecs.Vector) return Source_Vecs.Vector; -- returns a vector of all sources, which are both in Left and in Right function "or"(Left, Right: Source_Vecs.Vector) return Source_Vecs.Vector; -- returns a vector of all sources, which are in Left, Right, or both function Empty(Vec: Source_Vecs.Vector) return Boolean; -- returns true if Vec is empty type Process_Source is not null access procedure (Source: Source_Type); generic with procedure Do_Something(Source: Source_Type); procedure Iterate(The_Sources: Source_Vecs.Vector); -- calls Do_Something(Source) for all sources in The_Sources; private function Same_Vector(U,V: Source_Vecs.Vector) return Boolean; package Maps is new Ada.Containers.Indefinite_Hashed_Maps -- for each item (=key) we store a vector with sources (Key_Type => Item_Type, Element_Type => Source_Vecs.Vector, Hash => Hash, Equivalent_Keys => "=", "=" => Same_Vector); type Storage_Type is new Maps.Map with null record; end Generic_Inverted_Index;</syntaxhighlight> The implementation: <syntaxhighlight lang="ada">package body Generic_Inverted_Index is -- uses some of the new Ada 2012 syntax use Source_Vecs; procedure Store(Storage: in out Storage_Type; Source: Source_Type; Item: Item_Type) is use type Maps.Cursor; begin if (Storage.Find(Item) = Maps.No_Element) then Storage.Insert(Key => Item, New_Item => Empty_Vector & Source); else declare The_Vector: Vector := Storage.Element(Item); begin if The_Vector.Last_Element /= Source then Storage.Replace (Key => Item, New_Item => Storage.Element(Item) & Source); end if; end; end if; end Store; function Find(Storage: Storage_Type; Item: Item_Type) return Vector is begin return Storage.Element(Item); exception when Constraint_Error => return Empty_Vector; -- found nothing end Find; function Is_In(S: Source_Type; V: Vector) return Boolean is begin for Some_Element of V loop if Some_Element = S then return True; end if; end loop; return False; end Is_In; function "and"(Left, Right: Vector) return Vector is V: Vector := Empty_Vector; begin for Some_Element of Left loop if Is_In(Some_Element, Right) then V := V & Some_Element; end if; end loop; return V; end "and"; function "or"(Left, Right: Vector) return Vector is V: Vector := Left; -- all sources in Left begin for Some_Element of Right loop if not Is_In(Some_Element, Left) then V := V & Some_Element; end if; end loop; return V; end "or"; function Empty(Vec: Vector) return Boolean renames Is_Empty; procedure Iterate(The_Sources: Vector) is begin for Some_Element in The_Sources loop Do_Something(Element(Some_Element)); end loop; end Iterate; function Same_Vector(U,V: Vector) return Boolean is begin raise Program_Error with "there is no need to call this function"; return False; -- this avoices a compiler warning end Same_Vector; end Generic_Inverted_Index;</syntaxhighlight> =={{header\|AutoHotkey}}== {{works with\|AutoHotkey_L}} <~~lang~~syntaxhighlight ~~AutoHotkey~~lang="autohotkey">; http://www.autohotkey.com/forum/viewtopic.php?t=41479 inputbox, files, files, file pattern such as c:\files\.txt Line 15 ⟶ 585: Loop, %files%, 0,1 { tooltip,%A_index% / 500 wordList := WordsIn(A_LoopFileFullPath) InvertedIndex(wordList, A_loopFileFullpath) } Line 36 ⟶ 606: WordsIn(docpath) { FileRead, content, %docpath% spos = 1 Line 46 ⟶ 616: this_wordList .= match "`n" } Sort, this_wordList, U return this_wordList } Line 55 ⟶ 625: global word2docs loop, parse, words, `n,`r { { if A_loopField = continue Line 71 ⟶ 641: else return word2docs[word2find] }</~~lang~~syntaxhighlight> =={{header\|BBC BASIC}}== {{works with\|BBC BASIC for Windows}} This uses a hashed index and linked lists to hold the file numbers. <syntaxhighlight lang="bbcbasic"> DIM FileList$(4) FileList$() = "BBCKEY0.TXT", "BBCKEY1.TXT", "BBCKEY2.TXT", \ \ "BBCKEY3.TXT", "BBCKEY4.TXT" DictSize% = 30000 DIM Index{(DictSize%-1) word$, link%} REM Build inverted index: FOR file% = DIM(FileList$(),1) TO 0 STEP -1 filename$ = FileList$(file%) F% = OPENIN(filename$) IF F% = 0 ERROR 100, "Failed to open file" WHILE NOT EOF#F% REPEAT C%=BGET#F% : UNTIL C%>64 OR EOF#F% : word$ = CHR$(C%) REPEAT C%=BGET#F% : word$ += CHR$(C%) : UNTIL C%<65 word$ = FNlower(LEFT$(word$)) hash% = FNhash(word$) WHILE Index{(hash%)}.word$<>"" AND Index{(hash%)}.word$<>word$ hash% = (hash% + 1) MOD DictSize% : REM Collision ENDWHILE Index{(hash%)}.word$ = word$ link% = Index{(hash%)}.link% IF link%=0 OR link%!4<>file% THEN DIM heap% 7 : heap%!4 = file% !heap% = link% Index{(hash%)}.link% = heap% : REM Linked list ENDIF ENDWHILE CLOSE #F% NEXT file% REM Now query the index: PRINT FNquery("random") PRINT FNquery("envelope") PRINT FNquery("zebra") PRINT FNquery("the") END DEF FNquery(A$) LOCAL hash%, link%, temp% A$ = FNlower(A$) hash% = FNhash(A$) temp% = hash% WHILE Index{(hash%)}.word$ <> A$ hash% = (hash% + 1) MOD DictSize% IF hash% = temp% THEN = """" + A$ + """ not found" ENDWHILE link% = Index{(hash%)}.link% A$ = """" + A$ + """ found in " WHILE link% A$ += FileList$(link%!4) + ", " link% = !link% ENDWHILE = LEFT$(LEFT$(A$)) DEF FNhash(A$) LOCAL hash% IF LEN(A$) < 4 A$ += STRING$(4-LEN(A$),CHR$0) hash% = !!^A$ IF LEN(A$) > 4 hash% EOR= !(!^A$ + LEN(A$) - 4) = hash% MOD DictSize% DEF FNlower(A$) LOCAL A%,C% FOR A% = 1 TO LEN(A$) C% = ASCMID$(A$,A%) IF C% >= 65 IF C% <= 90 MID$(A$,A%,1) = CHR$(C%+32) NEXT = A$</syntaxhighlight> '''Output:''' <pre> "random" found in BBCKEY2.TXT, BBCKEY3.TXT, BBCKEY4.TXT "envelope" found in BBCKEY1.TXT, BBCKEY4.TXT "zebra" not found "the" found in BBCKEY0.TXT, BBCKEY1.TXT, BBCKEY2.TXT, BBCKEY3.TXT, BBCKEY4.TXT </pre> =={{header\|C}}== The code is stupidly long, having to implement a Trie to store strings and all -- the program doesn't do anything shiny, but Tries may be interesting to look at. <syntaxhighlight lang="c">#include <stdio.h> #include <stdlib.h> char chr_legal[] = "abcdefghijklmnopqrstuvwxyz0123456789_-./"; int chr_idx[256] = {0}; char idx_chr[256] = {0}; #define FNAME 0 typedef struct trie_t trie, trie_t; struct trie_t { trie next[sizeof(chr_legal)]; /* next letter; slot 0 is for file name / int eow; }; trie trie_new() { return calloc(sizeof(trie_t), 1); } #define find_word(r, w) trie_trav(r, w, 1) / tree traversal: returns node if end of word and matches string, optionally * create node if doesn't exist / trie trie_trav(trie root, const char str, int no_create) { int c; while (root) { if ((c = str[0]) == '\0') { if (!root->eow && no_create) return 0; break; } if (! (c = chr_idx[c]) ) { str++; continue; } if (!root->next[c]) { if (no_create) return 0; root->next[c] = trie_new(); } root = root->next[c]; str++; } return root; } /* complete traversal of whole tree, calling callback at each end of word node. * similar method can be used to free nodes, had we wanted to do that. / int trie_all(trie root, char path[], int depth, int (callback)(char )) { int i; if (root->eow && !callback(path)) return 0; for (i = 1; i < sizeof(chr_legal); i++) { if (!root->next[i]) continue; path[depth] = idx_chr[i]; path[depth + 1] = '\0'; if (!trie_all(root->next[i], path, depth + 1, callback)) return 0; } return 1; } void add_index(trie root, const char word, const char fname) { trie x = trie_trav(root, word, 0); x->eow = 1; if (!x->next[FNAME]) x->next[FNAME] = trie_new(); x = trie_trav(x->next[FNAME], fname, 0); x->eow = 1; } int print_path(char path) { printf(" %s", path); return 1; } /* pretend we parsed text files and got lower cased words: dealing * * with text file is a whole other animal and would make code too long / const char files[] = { "f1.txt", "source/f2.txt", "other_file" }; const char text[][5] ={{ "it", "is", "what", "it", "is" }, { "what", "is", "it", 0 }, { "it", "is", "a", "banana", 0 }}; trie init_tables() { int i, j; trie root = trie_new(); for (i = 0; i < sizeof(chr_legal); i++) { chr_idx[(int)chr_legal[i]] = i + 1; idx_chr[i + 1] = chr_legal[i]; } / Enable USE_ADVANCED_FILE_HANDLING to use advanced file handling. * You need to have files named like above files[], with words in them * like in text[][]. Case doesn't matter (told you it's advanced). / #define USE_ADVANCED_FILE_HANDLING 0 #if USE_ADVANCED_FILE_HANDLING void read_file(const char fname) { char cmd[1024]; char word[1024]; sprintf(cmd, "perl -p -e 'while(/(\\w+)/g) {print lc($1),\"\\n\"}' %s", fname); FILE in = popen(cmd, "r"); while (!feof(in)) { fscanf(in, "%1000s", word); add_index(root, word, fname); } pclose(in); }; read_file("f1.txt"); read_file("source/f2.txt"); read_file("other_file"); #else for (i = 0; i < 3; i++) { for (j = 0; j < 5; j++) { if (!text[i][j]) break; add_index(root, text[i][j], files[i]); } } #endif /USE_ADVANCED_FILE_HANDLING/ return root; } void search_index(trie root, const char word) { char path[1024]; printf("Search for \"%s\": ", word); trie found = find_word(root, word); if (!found) printf("not found\n"); else { trie_all(found->next[FNAME], path, 0, print_path); printf("\n"); } } int main() { trie root = init_tables(); search_index(root, "what"); search_index(root, "is"); search_index(root, "banana"); search_index(root, "boo"); return 0; }</syntaxhighlight>Output:<syntaxhighlight lang="text">Search for "what": f1.txt source/f2.txt Search for "is": f1.txt other_file source/f2.txt Search for "banana": other_file Search for "boo": not found</syntaxhighlight> =={{header\|C sharp\|C#}}== <~~lang~~syntaxhighlight lang="csharp">using System; using System.Collections.Generic; using System.IO; Line 97 ⟶ 908: Console.WriteLine("{0} found in: {1}", find, string.Join(" ", Invert(dictionary)[find])); } }</~~lang~~syntaxhighlight> Sample output: <syntaxhighlight lang="text">files: file1 file2 file3 find: what what found in: file1 file2</~~lang~~syntaxhighlight> =={{header\|~~Factor~~C++}}== Same idea as the C implementation - trie to store the words ~~<lang factor>USING: assocs fry io.encodings.utf8 io.files kernel sequences~~ <syntaxhighlight lang="cpp"> ~~sets splitting vectors ;~~ #include <algorithm> ~~IN: rosettacode.inverted-index~~ #include <fstream> #include <iostream> #include <vector> #include <string> const std::string _CHARS = "abcdefghijklmnopqrstuvwxyz0123456789.:-_/"; ~~: file-words ( file -- assoc )~~ const size_t MAX_NODES = 41; ~~utf8 file-contents " ,;:!?.()[]{}\n\r" split harvest ;~~ ~~: add-to-file-list ( files file -- files )~~ ~~over [ swap [ adjoin ] keep ] [ nip 1vector ] if ;~~ ~~: add-to-index ( words index file -- )~~ ~~'[ _ [ _ add-to-file-list ] change-at ] each ;~~ ~~: (index-files) ( files index -- )~~ ~~[ [ [ file-words ] keep ] dip swap add-to-index ] curry each ;~~ ~~: index-files ( files -- index )~~ ~~H{ } clone [ (index-files) ] keep ;~~ ~~: query ( terms index -- files )~~ ~~[ at ] curry map [ ] [ intersect ] map-reduce ;~~ ~~</lang>~~ class node ~~Example use :~~ { ~~<lang>( scratchpad ) { "f1" "f2" "f3" } index-files~~ public: node() { clear(); } node( char z ) { clear(); } ~node() { for( int x = 0; x < MAX_NODES; x++ ) if( next[x] ) delete next[x]; } void clear() { for( int x = 0; x < MAX_NODES; x++ ) next[x] = 0; isWord = false; } bool isWord; std::vector<std::string> files; node* next[MAX_NODES]; }; class index { public: void add( std::string s, std::string fileName ) { std::transform( s.begin(), s.end(), s.begin(), tolower ); std::string h; for( std::string::iterator i = s.begin(); i != s.end(); i++ ) { if( i == 32 ) { pushFileName( addWord( h ), fileName ); h.clear(); continue; } h.append( 1, i ); } if( h.length() ) pushFileName( addWord( h ), fileName ); } void findWord( std::string s ) { std::vector<std::string> v = find( s ); if( !v.size() ) { std::cout << s + " was not found!\n"; return; } std::cout << s << " found in:\n"; for( std::vector<std::string>::iterator i = v.begin(); i != v.end(); i++ ) { std::cout << i << "\n"; } std::cout << "\n"; } private: void pushFileName( node n, std::string fn ) { std::vector<std::string>::iterator i = std::find( n->files.begin(), n->files.end(), fn ); if( i == n->files.end() ) n->files.push_back( fn ); } const std::vector<std::string>& find( std::string s ) { size_t idx; std::transform( s.begin(), s.end(), s.begin(), tolower ); node* rt = &root; for( std::string::iterator i = s.begin(); i != s.end(); i++ ) { idx = _CHARS.find( i ); if( idx < MAX_NODES ) { if( !rt->next[idx] ) return std::vector<std::string>(); rt = rt->next[idx]; } } if( rt->isWord ) return rt->files; return std::vector<std::string>(); } node addWord( std::string s ) { size_t idx; node* rt = &root, n; for( std::string::iterator i = s.begin(); i != s.end(); i++ ) { idx = _CHARS.find( i ); if( idx < MAX_NODES ) { n = rt->next[idx]; if( n ){ rt = n; continue; } n = new node( i ); rt->next[idx] = n; rt = n; } } rt->isWord = true; return rt; } node root; }; int main( int argc, char argv[] ) { index t; std::string s; std::string files[] = { "file1.txt", "f_text.txt", "text_1b.txt" }; for( int x = 0; x < 3; x++ ) { std::ifstream f; f.open( files[x].c_str(), std::ios::in ); if( f.good() ) { while( !f.eof() ) { f >> s; t.add( s, files[x] ); s.clear(); } f.close(); } } while( true ) { std::cout << "Enter one word to search for, return to exit: "; std::getline( std::cin, s ); if( !s.length() ) break; t.findWord( s ); } return 0; } </syntaxhighlight> {{out}} <pre> Enter one word to search for, return to exit: goodness goodness found in: file1.txt f_text.txt Enter one word to search for, return to exit: because because found in: f_text.txt Enter one word to search for, return to exit: her her found in: text_1b.txt Enter one word to search for, return to exit: fat fat was not found! </pre> =={{header\|Clojure}}== <syntaxhighlight lang="clojure">(ns inverted-index.core (:require [clojure.set :as sets] [clojure.java.io :as io])) (def pattern #"\w+") ; Java regex for a raw term: here a substring of alphanums (defn normalize [match] (.toLowerCase match)) ; normalization of a raw term (defn term-seq [text] (map normalize (re-seq pattern text))) (defn set-assoc "Produces map with v added to the set associated with key k in map m" [m k v] (assoc m k (conj (get m k #{}) v))) (defn index-file [index file] (with-open [reader (io/reader file)] (reduce (fn [idx term] (set-assoc idx term file)) index (mapcat term-seq (line-seq reader))))) (defn make-index [files] (reduce index-file {} files)) (defn search [index query] (apply sets/intersection (map index (term-seq query)))) </syntaxhighlight> =={{header\|CoffeeScript}}== <syntaxhighlight lang="coffeescript"> fs = require 'fs' make_index = (fns) -> # words are indexed by filename and 1-based line numbers index = {} for fn in fns for line, line_num in fs.readFileSync(fn).toString().split '\n' words = get_words line for word in words word = mangle(word) index[word] \|\|= [] index[word].push [fn, line_num+1] index grep = (index, word) -> console.log "locations for '#{word}':" locations = index[mangle(word)] \|\| [] for location in locations [fn, line_num] = location console.log "#{fn}:#{line_num}" console.log "\n" get_words = (line) -> words = line.replace(/\W/g, ' ').split ' ' (word for word in words when word != '') mangle = (word) -> # avoid conflicts with words like "constructor" '_' + word do -> fns = (fn for fn in fs.readdirSync('.') when fn.match /\.coffee/) index = make_index(fns) grep index, 'make_index' grep index, 'sort' </syntaxhighlight> output <syntaxhighlight lang="text"> > coffee inverted_index.coffee locations for 'make_index': inverted_index.coffee:3 inverted_index.coffee:33 inverted_index.coffee:34 locations for 'sort': anagrams.coffee:8 derangements.coffee:14 heap.coffee:34 heap.coffee:43 huffman.coffee:81 inverted_index.coffee:35 knuth_sample.coffee:12 </syntaxhighlight> =={{header\|Common Lisp}}== <syntaxhighlight lang="lisp">(defpackage rosettacode.inverted-index (:use cl)) (in-package rosettacode.inverted-index) ;; Return a list of tokens in the string LINE. This is rather ;; complicated as CL has no good standard function to do it. (defun tokenize (line) (let ((start 0) (len (length line))) (loop for s = (position-if #'alphanumericp line :start start) while s for e = (position-if-not #'alphanumericp line :start (1+ s)) collect (subseq line s e) while (and e (< e len)) do (setq start e)))) (defun index-file (index filename) (with-open-file (in filename) (loop for line = (read-line in nil nil) while line do (dolist (token (tokenize line)) (pushnew filename (gethash token index '())))))) (defun build-index (filenames) (let ((index (make-hash-table :test #'equal))) (dolist (f filenames) (index-file index f)) index)) ;; Find the files for QUERY. We use the same tokenizer for the query ;; as for files. (defun lookup (index query) (remove-duplicates (loop for token in (tokenize query) append (gethash token index)) :test #'equal)) </syntaxhighlight> Example: <syntaxhighlight lang="lisp">(defparameter index (build-index '("file1.txt" "file2.txt" "file3.txt"))) (defparameter query "foo bar") (defparameter result (lookup index query)) (format t "Result for query ~s: ~{~a~^, ~}~%" query result)</syntaxhighlight> =={{header\|D}}== <syntaxhighlight lang="d">import std.stdio, std.algorithm, std.string, std.file, std.regex; void main() { string[][string] index; void parseFile(in string fn) { if (!exists(fn) \|\| !isFile(fn)) throw new Exception("File not found"); foreach (word; readText(fn).splitter(regex(r"\W"))) { word = word.toLower(); if (!index.get(word, null).canFind(fn)) index[word] ~= fn; } } immutable fileNames = ["inv1.txt", "inv2.txt", "inv3.txt"]; foreach (fName; fileNames) parseFile(fName); while (true) { writef("\nEnter a word to search for: (q to quit): "); immutable w = readln().strip().toLower(); if (w == "q") { writeln("quitting."); break; } if (w in index) writefln("'%s' found in%( %).", w, index[w]); else writefln("'%s' not found.", w); } }</syntaxhighlight> Both the demo text files and the queries are from the Wikipedia page, they contain: It is what it is. What is it? It is a banana! {{out}} <pre>Enter a word to search for: (q to quit): cat 'cat' not found. Enter a word to search for: (q to quit): is 'is' found in "inv1.txt" "inv2.txt" "inv3.txt". Enter a word to search for: (q to quit): banana 'banana' found in "inv3.txt". Enter a word to search for: (q to quit): it 'it' found in "inv1.txt" "inv2.txt" "inv3.txt". Enter a word to search for: (q to quit): what 'what' found in "inv1.txt" "inv2.txt". Enter a word to search for: (q to quit): q quitting.</pre> =={{header\|Delphi}}== {{libheader\| System.SysUtils}} {{libheader\| system.Generics.Collections}} {{libheader\| SYstem.IOUtils}} <syntaxhighlight lang="delphi"> program Inverted_index; {$APPTYPE CONSOLE} uses System.SysUtils, system.Generics.Collections, SYstem.IOUtils; type TIndex = class private FFileNames: TArray<string>; FIndexs: TDictionary<string, string>; class function SplitWords(Text: string): TArray<string>; static; function StoreFileName(FileName: TFileName): string; procedure StoreLine(Line, Id: string); procedure StoreWord(aWord, Id: string); function DecodeFileName(Indexed: string): string; public procedure AddFile(FileName: TFileName); constructor Create; destructor Destroy; override; function Search(aWord: string): string; property Index[aWord: string]: string read Search; default; end; { TIndex } constructor TIndex.Create; begin FIndexs := TDictionary<string, string>.Create(); SetLength(FFileNames, 0); end; function TIndex.DecodeFileName(Indexed: string): string; var f: string; i: integer; begin Result := Indexed; if Length(FFileNames) > 0 then for i := 0 to High(FFileNames) do begin f := FFileNames[i]; Result := Result.Replace(format('(%x)', [i]), f); end; end; destructor TIndex.Destroy; begin FIndexs.Free; inherited; end; function TIndex.Search(aWord: string): string; begin aWord := AnsiLowerCase(aWord); if FIndexs.ContainsKey(aWord) then Result := 'found in ' + DecodeFileName(FIndexs[aWord]) else Result := 'not found'; end; class function TIndex.SplitWords(Text: string): TArray<string>; const WORD_CHARSET: set of char = (['a'..'z', 'A'..'Z', 'á', 'é', 'í', 'ó', 'ú', 'ã', 'õ', 'à', 'è', 'ì', 'ò', 'ù', 'Á', 'É', 'Í', 'Ó', 'Ú', 'Ã', 'Õ', 'À', 'È', 'Ì', 'Ò', 'Ù', 'ä', 'ë', 'ï', 'ö', 'ü', 'ç', 'Ç', '_', '0'..'9']); procedure Append(var value: string); begin if not value.IsEmpty then begin SetLength(result, length(result) + 1); result[high(result)] := value; value := ''; end; end; var c: Char; inWord, isWordChar: boolean; w: string; begin inWord := False; w := ''; SetLength(result, 0); for c in Text do begin isWordChar := (c in WORD_CHARSET); if inWord then if isWordChar then w := w + c else begin Append(w); inWord := false; Continue; end; if not inWord and isWordChar then begin inWord := True; w := c; end; end; if inWord then Append(w); end; function TIndex.StoreFileName(FileName: TFileName): string; begin SetLength(FFileNames, length(FFileNames) + 1); FFileNames[High(FFileNames)] := FileName; Result := format('"(%x)"', [High(FFileNames)]); end; procedure TIndex.StoreLine(Line, Id: string); var Words: TArray<string>; w: string; begin Words := SplitWords(Line); for w in Words do StoreWord(w, Id); end; procedure TIndex.StoreWord(aWord, Id: string); begin aWord := AnsiLowercase(aWord); if FIndexs.ContainsKey(aWord) then begin if (FIndexs[aWord].IndexOf(Id) = -1) then FIndexs[aWord] := FIndexs[aWord] + ', ' + Id; end else FIndexs.Add(aWord, Id); end; procedure TIndex.AddFile(FileName: TFileName); var Lines: TArray<string>; Line: string; Id: string; begin if not FileExists(FileName) then exit; Lines := TFile.ReadAllLines(FileName); if length(Lines) = 0 then exit; // Remove this line if you want full filename FileName := ExtractFileName(FileName); Id := StoreFileName(FileName); for Line in Lines do StoreLine(Line, Id); end; var Index: TIndex; FileNames: TArray<TFileName> = ['file1.txt', 'file2.txt', 'file3.txt']; FileName: TFileName; Querys: TArray<string> = ['Cat', 'is', 'banana', 'it', 'what']; Query, Found: string; begin Index := TIndex.Create; for FileName in FileNames do Index.AddFile(FileName); for Query in Querys do Writeln(format('"%s" %s'#10, [Query, Index[Query]])); repeat Writeln('Enter a word to search for: (q to quit)'); Readln(Query); if Query = 'q' then Break; Writeln(format('"%s" %s'#10, [Query, Index[Query]])); until False; Index.Free; end.</syntaxhighlight> Input: Same of [[#D\|D]]. {{out}} <pre>"Cat" not found "is" found in "file1.txt", "file2.txt", "file3.txt" "banana" found in "file3.txt" "it" found in "file1.txt", "file2.txt", "file3.txt" "what" found in "file1.txt", "file2.txt" Enter a word to search for: (q to quit): q</pre> =={{header\|EchoLisp}}== ===Indexing=== Index values are sets associated with each word (key). We use the local-put-value function to permanently store the index, in the browser local storage. <syntaxhighlight lang="lisp"> ;; set of input files (define FILES {T0.txt T1.txt T2.txt}) ;; store name for permanent inverted index (define INVERT "INVERTED-INDEX") ;; get text for each file, and call (action filename text) (define (map-files action files) (for ((file files)) (file->string action file))) ;; create store (local-make-store INVERT) ; invert-word : word -> set of files (define (invert-word word file store) (local-put-value word (make-set (cons file (local-get-value word store))) store)) ; parse file text and invert each word (define (invert-text file text) (writeln 'Inverting file text) (let ((text (text-parse text))) (for ((word text)) (invert-word (string-downcase word) file INVERT)))) </syntaxhighlight> === Query === Intersect sets values of each word. <syntaxhighlight lang="lisp"> ;; usage : (inverted-search w1 w2 ..) (define-syntax-rule (inverted-search w ...) (and-get-invert (quote w ))) ;; intersects all sets referenced by words ;; returns the intersection (define (and-get-invert words) (foldl (lambda(word res) (set-intersect res (local-get-value word INVERT))) FILES words)) </syntaxhighlight> Output : <syntaxhighlight lang="lisp"> (map-files invert-text FILES) (inverted-search is it) [0]→ { T0.txt T1.txt T2.txt } (inverted-search is banana) [1]→ { T2.txt } (inverted-search is what) [2]→ { T0.txt T1.txt } (inverted-search boule) [3]→ null </syntaxhighlight> =={{header\|Erlang}}== This might be used with a lot of large files so we use binaries to save space. That adds <<>> to the search terms. If somebody wants to avoid "end." and "end" being two different terms, just add <<".">> to binary:compile_pattern/1 Ditto for any other character. <syntaxhighlight lang="erlang"> -module( inverted_index ). -export( [from_files/1, search/2, task/0] ). from_files( Files ) -> lists:foldl( fun import_from_file/2, dict:new(), Files ). search( Binaries, Inverted_index ) -> [Files \| T] = [dict:fetch(X, Inverted_index) \|\| X <- Binaries], lists:foldl( fun search_common/2, Files, T ). task() -> Files_contents = [{"file_1", <<"it is what it is">>}, {"file_2", <<"what is it">>}, {"file_3", <<"it is a banana">>}], [file:write_file(X, Y) \|\| {X, Y} <- Files_contents], Inverted_index = from_files( [X \|\| {X, _Y} <- Files_contents] ), Result = search( [<<"what">>, <<"is">>, <<"it">>], Inverted_index ), io:fwrite( "~p~n", [Result] ), [file:delete(X) \|\| {X, _Y} <- Files_contents]. import_from_file( File, Dict_acc ) -> New_dict = dict:from_list( import_from_file_contents(File, file:read_file(File)) ), dict:merge( fun import_from_file_merge/3, Dict_acc, New_dict ). import_from_file_contents( File, {ok, Binary} ) -> [{X, [File]} \|\| X <- binary:split( Binary, binary:compile_pattern([<<" ">>, <<"\n">>]), [global] )]; import_from_file_contents( File, {error, Error} ) -> io:fwrite( "Error: could not open file ~p: ~p~nContinuing with the rest of them~n", [File, Error] ), []. import_from_file_merge( _Key, Files, [New_file] ) -> [New_file \| Files]. search_common( Files, Acc ) -> [X \|\| X <- Acc, lists:member(X, Files)]. </syntaxhighlight> ~~--- Data stack:~~ ~~H{ { "a" ~vector~ } { "is" ~vector~ } { "what" ~vector~ } { ...~~ ~~( scratchpad ) { "what" "is" "it" } swap query .~~ ~~V{ "f1" "f2" }~~ ~~</lang>~~ =={{header\|F Sharp\|F#}}== <~~lang~~syntaxhighlight lang="fsharp">open System open System.IO // Map search terms to associated set of files type searchIndexMap = Map<string, Set<string>> let inputSearchCriteria() = let readLine prompt = printf "%s: " prompt Console.ReadLine().Split() readLine "Files", (readLine "Find") \|> Array.map (fun s -> s.ToLower()) let updateIndex indexMap keyValuePair = let k, v = keyValuePair match Map.tryFind k indexMap with \| None -> Map.add k (Set.singleton v) indexMap \| Some set -> Map.add k (Set.add v set) indexMap let buildIndex files = let fileData file = File.ReadAllText(file).Split() \|> Seq.map (fun word -> word.ToLower(), file) files \|> Seq.collect fileData \|> Seq.fold updateIndex Map.empty let searchFiles() = let files, terms = inputSearchCriteria() Line 165 ⟶ 1,584: \|> Set.intersectMany printf "Found in: " ; searchResults \|> Set.iter (printf "%s ") ; printfn ""</~~lang~~syntaxhighlight> Sample usage: <pre> Line 173 ⟶ 1,592: Find: what is Found in: file1.txt file2.txt</pre> =={{header\|Factor}}== <syntaxhighlight lang="factor">USING: assocs fry io.encodings.utf8 io.files kernel sequences sets splitting vectors ; IN: rosettacode.inverted-index : file-words ( file -- assoc ) utf8 file-contents " ,;:!?.()[]{}\n\r" split harvest ; : add-to-file-list ( files file -- files ) over [ swap [ adjoin ] keep ] [ nip 1vector ] if ; : add-to-index ( words index file -- ) '[ _ [ _ add-to-file-list ] change-at ] each ; : (index-files) ( files index -- ) [ [ [ file-words ] keep ] dip swap add-to-index ] curry each ; : index-files ( files -- index ) H{ } clone [ (index-files) ] keep ; : query ( terms index -- files ) [ at ] curry map [ ] [ intersect ] map-reduce ; </syntaxhighlight> Example use : <syntaxhighlight lang="text">( scratchpad ) { "f1" "f2" "f3" } index-files --- Data stack: H{ { "a" ~vector~ } { "is" ~vector~ } { "what" ~vector~ } { ... ( scratchpad ) { "what" "is" "it" } swap query . V{ "f1" "f2" } </syntaxhighlight> =={{header\|Go}}== <~~lang~~syntaxhighlight lang="go">package main import ( "bufio" "bytes" "~~encoding/line~~errors" "fmt" "io" "os" ) // inverted index representation var index map[string][]int // ints index into indexed var indexed []doc Line 207 ⟶ 1,656: } func indexDir(dir string) ~~os.Error~~error { df, err := os.Open(dir) if err != nil { Line 217 ⟶ 1,666: } if len(fis) == 0 { return oserrors.~~NewError~~New(fmt.Sprintf("no files in %s", dir)) } indexed := 0 for _, fi := range fis { if !fi.~~IsRegular~~IsDir() { if indexFile(dir + "/" + fi.Name()) { indexed++ } Line 243 ⟶ 1,692: // scan lines r := ~~line~~bufio.NewReader(f~~, 4096~~) lines := 0 for { b, isPrefix, err := r.ReadLine() switch { case err == osio.EOF: return true case err != nil: Line 254 ⟶ 1,703: return true case isPrefix: fmt.Printf("%s: ~~unexpeced~~unexpected long line\n", fn) return true case lines < 20 && bytes.HasPrefix(b, []byte("Title:")): Line 307 ⟶ 1,756: } } }</~~lang~~syntaxhighlight> Session: <pre> 8448 words indexed in 11 files ~~8447 words indexed in 11 files~~ enter single words to search for enter a blank line when done Line 331 ⟶ 1,779: =={{header\|Haskell}}== <~~lang~~syntaxhighlight lang="haskell">import Control.Monad import Data.Char (isAlpha, toLower) import qualified Data.Map as M Line 362 ⟶ 1,810: intersections xs = foldl1 S.intersection xs lowercase = map toLower</~~lang~~syntaxhighlight> An example of use, assuming the program is named <code>iindex</code> and there exist files <code>t0</code>, <code>t1</code>, and <code>t2</code> with contents "It is what it is.", "What is it?", and "It is a banana.": Line 373 ⟶ 1,821: The following implements a simple case insensitive inverse index using lists simulating texts. <~~lang~~syntaxhighlight ~~Icon~~lang="icon">procedure main() texts := table() # substitute for read and parse files Line 381 ⟶ 1,829: every textname := key(texts) do # build index for each 'text' SII := InvertedIndex(SII,textname,texts[textname]) TermSearchUI(SII) # search UI end Line 402 ⟶ 1,850: repeat { writes("Enter search terms (^z to quit) : ") terms := map(trim(read() \| break)) x := [] terms ? while not pos(0) do { tab(many(' \t')) put(x,tab(upto('\ \t')\|0)) } show("Searching for : ",x) show("Found in : ",s := TermSearch(ii,x)) \| show("Not found : ",x) } write("End of search") Line 418 ⟶ 1,866: procedure TermSearch(ii,x) #: return set of matches or fail every s := !x do ( /u := ii[s] ) \| (u *:= ii[s]) if u > 0 then return u Line 426 ⟶ 1,874: every writes(s\|!x) do writes(" ") write() return end</~~lang~~syntaxhighlight> Output:<pre>Enter search terms (^z to quit) : is it Line 443 ⟶ 1,891: The following code will build a full index. Modification of search routines is left as an exercise: <~~lang~~syntaxhighlight ~~Unicon~~lang="unicon">record InvertedIndexRec(simple,full) procedure FullInvertedIndex(ii,k,words) #: accumulate a full inverted index Line 461 ⟶ 1,909: return ii end</~~lang~~syntaxhighlight> =={{header\|J}}== Line 467 ⟶ 1,915: This just implements the required spec, with a simplistic definition for what a word is, and with no support for stop words, nor for phrase searching. <~~lang~~syntaxhighlight Jlang="j">require'files regex strings' rxutf8 0 NB. support latin1 searches for this example, instead of utf8 Line 490 ⟶ 1,938: hits=. buckets{~words i.~.parse tolower y files {~ >([-.-.)each/hits )</~~lang~~syntaxhighlight> Example use: <~~lang~~syntaxhighlight Jlang="j"> invert '~help/primer/cut.htm';'~help/primer/end.htm';'~help/primer/gui.htm' >search 'finally learning' ~help/primer/end.htm Line 502 ⟶ 1,950: ~help/primer/gui.htm >search 'around' ~help/primer/gui.htm</~~lang~~syntaxhighlight> =={{header\|Java}}== <syntaxhighlight lang="java"> package org.rosettacode; import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.io.IOException; import java.util.ArrayList; import java.util.Arrays; import java.util.HashMap; import java.util.HashSet; import java.util.LinkedList; import java.util.List; import java.util.Map; import java.util.Set; public class InvertedIndex { List<String> stopwords = Arrays.asList("a", "able", "about", "across", "after", "all", "almost", "also", "am", "among", "an", "and", "any", "are", "as", "at", "be", "because", "been", "but", "by", "can", "cannot", "could", "dear", "did", "do", "does", "either", "else", "ever", "every", "for", "from", "get", "got", "had", "has", "have", "he", "her", "hers", "him", "his", "how", "however", "i", "if", "in", "into", "is", "it", "its", "just", "least", "let", "like", "likely", "may", "me", "might", "most", "must", "my", "neither", "no", "nor", "not", "of", "off", "often", "on", "only", "or", "other", "our", "own", "rather", "said", "say", "says", "she", "should", "since", "so", "some", "than", "that", "the", "their", "them", "then", "there", "these", "they", "this", "tis", "to", "too", "twas", "us", "wants", "was", "we", "were", "what", "when", "where", "which", "while", "who", "whom", "why", "will", "with", "would", "yet", "you", "your"); Map<String, List<Tuple>> index = new HashMap<String, List<Tuple>>(); List<String> files = new ArrayList<String>(); public void indexFile(File file) throws IOException { int fileno = files.indexOf(file.getPath()); if (fileno == -1) { files.add(file.getPath()); fileno = files.size() - 1; } int pos = 0; BufferedReader reader = new BufferedReader(new FileReader(file)); for (String line = reader.readLine(); line != null; line = reader .readLine()) { for (String _word : line.split("\\W+")) { String word = _word.toLowerCase(); pos++; if (stopwords.contains(word)) continue; List<Tuple> idx = index.get(word); if (idx == null) { idx = new LinkedList<Tuple>(); index.put(word, idx); } idx.add(new Tuple(fileno, pos)); } } System.out.println("indexed " + file.getPath() + " " + pos + " words"); } public void search(List<String> words) { for (String _word : words) { Set<String> answer = new HashSet<String>(); String word = _word.toLowerCase(); List<Tuple> idx = index.get(word); if (idx != null) { for (Tuple t : idx) { answer.add(files.get(t.fileno)); } } System.out.print(word); for (String f : answer) { System.out.print(" " + f); } System.out.println(""); } } public static void main(String[] args) { try { InvertedIndex idx = new InvertedIndex(); for (int i = 1; i < args.length; i++) { idx.indexFile(new File(args[i])); } idx.search(Arrays.asList(args[0].split(","))); } catch (Exception e) { e.printStackTrace(); } } private class Tuple { private int fileno; private int position; public Tuple(int fileno, int position) { this.fileno = fileno; this.position = position; } } } </syntaxhighlight> Example output: <syntaxhighlight lang="java"> java -cp bin org.rosettacode.InvertedIndex "huntsman,merit,dog,the,gutenberg,lovecraft,olympian" pg30637.txt pg7025.txt pg82.txt pg9090.txt indexed pg30637.txt 106473 words indexed pg7025.txt 205714 words indexed pg82.txt 205060 words indexed pg9090.txt 68962 words huntsman pg82.txt pg7025.txt merit pg9090.txt pg30637.txt pg82.txt pg7025.txt dog pg30637.txt pg82.txt pg7025.txt the gutenberg pg9090.txt pg30637.txt pg82.txt pg7025.txt lovecraft pg30637.txt olympian pg30637.txt </syntaxhighlight> =={{header\|jq}}== In the first part of this section, the core functions for computing an inverted index and for searching it are presented. These functions will work with jq 1.4 as well as later (and possibly earlier) versions. The second section shows how to accomplish the interactive task using a version of jq with support for 'input' and 'input_filename' (e.g. jq 1.5). '''Part 1: inverted_index and search''' <syntaxhighlight lang="jq"># Given an array of [ doc, array_of_distinct_words ] # construct a lookup table: { word: array_of_docs } def inverted_index: reduce .[] as $pair ({}; $pair[0] as $doc \| reduce $pair[1][] as $word (.; .[$word] += [$doc])); def search(words): def overlap(that): . as $this \| reduce that[] as $item ([]; if $this\|index($item) then . + [$item] else . end); . as $dict \| if (words\|length) == 0 then [] else reduce words[1:][] as $word ( $dict[words[0]]; overlap( $dict[$word] ) ) end ; </syntaxhighlight> '''Part 2: Interactive Search''' In this section, a solution to the task is presented using two invocations of jq: one parses the input files, and the other does everything else. If your shell does not support <(...) then you could create a temporary file to hold the parsed output. <syntaxhighlight lang="jq">def prompt_search: "Enter a string or an array of strings to search for, quoting each string, or 0 to exit:", ( (input \| if type == "array" then . elif type == "string" then [.] else empty end) as $in \| search($in), prompt_search ) ; $in \| inverted_index \| prompt_search</syntaxhighlight> '''Example''': <syntaxhighlight lang="sh">$ jq -r -c -n --argfile in <(jq -R 'split(" ") \| select(length>0) \| [input_filename, unique]' T?.txt) -f Inverted_index.jq Enter a string or an array of strings to search for, quoting each string, or 0 to exit: "is" ["T0.txt","T1.txt","T2.txt"] Enter a string or an array of strings to search for, quoting each string, or 0 to exit: ["is", "banana"] ["T2.txt"] Enter a string or an array of strings to search for, quoting each string, or 0 to exit: 0 $</syntaxhighlight> =={{header\|Julia}}== <syntaxhighlight lang="julia">function makedoubleindex(files) idx = Dict{String, Dict}() for file in files str = lowercase(read(file, String)) words = split(str, r"\s+") for word in words if !haskey(idx, word) idx[word] = Dict{String, Int}() elseif !haskey(idx[word], file) (idx[word])[file] = 1 else (idx[word])[file] += 1 end end end idx end function wordsearch(dict, words::Vector{String}) for word in words if haskey(dict, word) for (f, n) in dict[word] println("File $f contains $n instances of <$word>.") end else println("No instances of \"$word\" were found.") end end end wordsearch(dict, word::String) = wordsearch(dict, [word]) const filenames = ["file1.txt", "file2.txt", "file3.txt"] const didx = makedoubleindex(filenames) const searchterms = ["forehead", "of", "hand", "a", "foot"] wordsearch(didx, searchterms) </syntaxhighlight> =={{header\|Kotlin}}== <syntaxhighlight lang="scala">// version 1.1.51 import java.io.File val invIndex = mutableMapOf<String, MutableList<Location>>() val fileNames = mutableListOf<String>() val splitter = Regex("""\W+""") class Location(val fileName: String, val wordNum: Int) { override fun toString() = "{$fileName, word number $wordNum}" } fun indexFile(fileName: String) { if (fileName in fileNames) { println("'$fileName' already indexed") return } fileNames.add(fileName) File(fileName).forEachLine { line -> for ((i, w) in line.toLowerCase().split(splitter).withIndex()) { var locations = invIndex[w] if (locations == null) { locations = mutableListOf<Location>() invIndex.put(w, locations) } locations.add(Location(fileName, i + 1)) } } println("'$fileName' has been indexed") } fun findWord(word: String) { val w = word.toLowerCase() val locations = invIndex[w] if (locations != null) { println("\n'$word' found in the following locations:") println(locations.map { " $it" }.joinToString("\n")) } else println("\n'$word' not found") println() } fun main(args: Array<String>) { // files to be indexed entered as command line arguments if (args.size == 0) { println("No file names have been supplied") return } for (arg in args) indexFile(arg) println() println("Enter word(s) to be searched for in these files or 'q' to quit") while (true) { print(" ? : ") val word = readLine()!! if (word.toLowerCase() == "q") return findWord(word) } }</syntaxhighlight> Contents of files: <pre> inv1.txt -> It is what it is. inv2.txt -> What is it? inv3.txt -> It is a banana! </pre> Sample output: <pre>$ java -jar inverted_index.jar inv1.txt inv2.txt inv3.txt inv1.txt 'inv1.txt' has been indexed 'inv2.txt' has been indexed 'inv3.txt' has been indexed 'inv1.txt' already indexed Enter word(s) to be searched for in these files or 'q' to quit ? : cat 'cat' not found ? : is 'is' found in the following locations: {inv1.txt, word number 2} {inv1.txt, word number 5} {inv2.txt, word number 2} {inv3.txt, word number 2} ? : banana 'banana' found in the following locations: {inv3.txt, word number 4} ? : it 'it' found in the following locations: {inv1.txt, word number 1} {inv1.txt, word number 4} {inv2.txt, word number 3} {inv3.txt, word number 1} ? : what 'what' found in the following locations: {inv1.txt, word number 3} {inv2.txt, word number 1} ? : a 'a' found in the following locations: {inv3.txt, word number 3} ? : q </pre> =={{header\|Mathematica}}/{{header\|Wolfram Language}}== <syntaxhighlight lang="mathematica">si = CreateSearchIndex["ExampleData/Text", Method -> "TFIDF"]; Manipulate[Grid[sr = TextSearch[si, ToString[s]]; {FileBaseName /@ Normal[Dataset[sr][All, "ReferenceLocation"]], Column[#, Frame -> All] & /@ sr[All, "Snippet"]} // Transpose, Frame -> All], {s, "tree"}]</syntaxhighlight> {{out}} Outputs a graphical user interface with an interactive input field showing filename and the snippet of text that includes the search string. =={{header\|Nim}}== We have used as examples three files containing the text from Wikipedia article. We used here an index which keep document name and line number. It would be easy to add the position in the line. <syntaxhighlight lang="nim">import os, strformat, strutils, tables type WordRef = tuple[docnum, linenum: int] InvertedIndex = object docs: seq[string] index: Table[string, seq[WordRef]] SearchResult = object total: int refs: seq[tuple[docname: string; linenums: seq[int]]] IndexingError = object of CatchableError const # Characters composing a word (letters + "'"). WordChars = Letters + {'\''} # Words to ignore. StopWords = ["about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "aren't", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "can't", "cannot", "could", "couldn't", "did", "didn't", "do", "does", "doesn't", "doing", "don't", "down", "during", "each", "few", "for", "from", "further", "had", "hadn't", "has", "hasn't", "have", "haven't", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "isn't", "it", "it's", "its", "itself", "let's", "me", "more", "most", "mustn't", "my", "myself", "no", "nor", "not", "of", "off", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "shan't", "she", "she'd", "she'll", "she's", "should", "shouldn't", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "wasn't", "we", "we'd", "we'll", "we're", "we've", "were", "weren't", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "won't", "would", "wouldn't", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves"] template plural(n: int): string = (if n > 1: "s" else: "") proc add(invIndex: var InvertedIndex; docPath: string) = ## Add a document to the index. let docName = docPath.extractFilename() if docName in invIndex.docs: raise newException(IndexingError, &"{docName}: a document with this name is alreadry indexed.") invIndex.docs.add docName let docIndex = invIndex.docs.high var linenum = 0 var count = 0 try: for line in docPath.lines: inc linenum var word = "" for ch in line: if ch in WordChars: word.add ch.toLowerAscii() else: if word.len > 1 and word notin StopWords: invIndex.index.mgetOrPut(word, newSeq[WordRef]()).add (docIndex, linenum) inc count word = "" except IOError: raise newException(IndexingError, &"{docName}: error while reading file.") if count > 0: echo &"{docName}: {count} word{plural(count)} indexed." else: raise newException(IndexingError, &"{docName}: nothing to index.") func status(invIndex: InvertedIndex): string = ## Display information about the inverted index status. let words = invIndex.index.len let docs = invIndex.docs.len &"Index contains {words} word{plural(words)} from {docs} document{plural(docs)}." proc search(invIndex: InvertedIndex; word: string): SearchResult = ## Search a word in the inverted index. ## The references are grouped by document. let refs = invIndex.index.getOrDefault(word) if refs.len == 0: return result.total = refs.len var prevdoc = "" var prevline = 0 for (docnum, linenum) in refs: let docname = invIndex.docs[docnum] if docname == prevDoc: if linenum != prevline: result.refs[^1].linenums.add(linenum) prevline = linenum else: result.refs.add((docname, @[linenum])) prevdoc = docname prevline = linenum #——————————————————————————————————————————————————————————————————————————————————————————————————— var invertedIndex: InvertedIndex if paramCount() != 1 or not paramStr(1).dirExists(): quit &"Usage: {getAppFileName().lastPathPart} directory" # Index the documents. for doc in walkFiles(paramStr(1) / ".txt"): try: invertedIndex.add(doc) except IndexingError: echo getCurrentExceptionMsg() echo invertedIndex.status() echo "" # Enter search interface. var word: string while true: try: stdout.write "Word to search? " word = stdin.readLine().toLowerAscii().strip() if word == "q": echo "Quitting" break if not word.allCharsInSet(WordChars): echo "This word contains invalid characters." continue except EOFError: echo "" echo "Quitting" break # Search word. let result = invertedIndex.search(word) if result.refs.len == 0: echo "No reference found for this word." continue # Display the references. echo &"Found {result.total} reference{plural(result.total)} to “{word}”:" for (docname, linenums) in result.refs: stdout.write &"... in “{docName}”, line{plural(linenums.len)}" for num in linenums: stdout.write ' ', num echo ""</syntaxhighlight> {{out}} Example of session. <pre>./inverted_index texts a.txt: 129 words indexed. b.txt: 183 words indexed. c.txt: 16 words indexed. Index contains 197 words from 3 documents. Word to search? inverted Found 23 references to “inverted”: ... in “a.txt”, lines 1 3 5 ... in “b.txt”, lines 1 3 5 7 ... in “c.txt”, line 1 Word to search? q Quitting</pre> =={{header\|OCaml}}== Line 514 ⟶ 2,479: unix.cma bigarray.cma nums.cma -I +sexplib sexplib.cma str.cma \ inv.ml ocamlc -o inv.byte unix.cma bigarray.cma nums.cma -I +sexplib sexplib.cma str.cma inv.cmo <~~lang~~syntaxhighlight lang="ocaml">TYPE_CONV_PATH "Inverted_index" type files = string array with sexp Line 556 ⟶ 2,521: List.iter (fun (w, from) -> Hashtbl.replace h w from) inv_index; List.iter (fun w -> iflet ~~Hashtbl.mem~~from ~~h w~~= try Hashtbl.find h w ~~then begin~~ except Not_found -> [] ~~let from = Hashtbl.find h w in~~ in ~~Hashtbl.replace h w (i::from)~~ Hashtbl.replace h w (i::from) ~~end else~~ ~~Hashtbl.add h w [i]~~ ) words; Hashtbl.fold (fun w from acc -> (w, from) :: acc) h [] Line 598 ⟶ 2,562: \| "--index-file", in_file -> index_file in_file \| "--search-word", word -> search_word word \| _ -> usage()</~~lang~~syntaxhighlight> =={{header\|Perl}}== <~~lang~~syntaxhighlight lang="perl">use Set::Object 'set'; # given an array of files, returns the index Line 613 ⟶ 2,576: foreach my $file (@files) { open(F, "<", $file) or die "Can't read file $file: $!"; while(<F>) { s/\A\W+//; foreach my $w (map {lc} grep {length() >= 3} split /\W+/) { if ( exists($iindex{$w}) ) { { $iindex{$w}->insert($file); } else { $iindex{$w} = set($file); } } } } } close(F); } return %iindex; Line 637 ⟶ 2,600: my @words = @_; my $res = set(); foreach my $w (map {lc} @words) { $w =~ s/\W+//g; # strip non-words chars length $w < 3 and next; exists $idx{$w} or return set(); Line 655 ⟶ 2,618: # first arg is a comma-separated list of words to search for print "$_\n" foreach search_words_with_index({createindex(@ARGV)}, @searchwords);</~~lang~~syntaxhighlight> =={{header\|Phix}}== Loads all text files in demo\rosetta\ and builds a list of filenames and a dictionary of {word,file_indexes}, before a handful of quick tests.<br> Might be better (and almost as easy) for the dictionary values to be say {total_count, {file nos}, {file counts}}. <!--<syntaxhighlight lang="phix">(notonline)--> <span style="color: #000080;font-style:italic;">-- demo\rosetta\Inverted_index.exw</span> <span style="color: #008080;">without</span> <span style="color: #008080;">js</span> <span style="color: #000080;font-style:italic;">-- (file i/o)</span> <span style="color: #004080;">integer</span> <span style="color: #000000;">word_count</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">0</span> <span style="color: #004080;">sequence</span> <span style="color: #000000;">filenames</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{}</span> <span style="color: #008080;">function</span> <span style="color: #000000;">is_ascii</span><span style="color: #0000FF;">(</span><span style="color: #004080;">string</span> <span style="color: #000000;">txt</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">txt</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span> <span style="color: #004080;">integer</span> <span style="color: #000000;">ch</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">txt</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span> <span style="color: #008080;">if</span> <span style="color: #000000;">ch</span><span style="color: #0000FF;">=</span><span style="color: #008000;">'\0'</span> <span style="color: #008080;">or</span> <span style="color: #000000;">ch</span><span style="color: #0000FF;">>=</span><span style="color: #000000;">#7F</span> <span style="color: #008080;">then</span> <span style="color: #008080;">return</span> <span style="color: #000000;">0</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #008080;">end</span> <span style="color: #008080;">for</span> <span style="color: #008080;">return</span> <span style="color: #000000;">1</span> <span style="color: #008080;">end</span> <span style="color: #008080;">function</span> <span style="color: #008080;">procedure</span> <span style="color: #000000;">add_words</span><span style="color: #0000FF;">(</span><span style="color: #004080;">string</span> <span style="color: #000000;">name</span><span style="color: #0000FF;">,</span> <span style="color: #004080;">sequence</span> <span style="color: #000000;">words</span><span style="color: #0000FF;">)</span> <span style="color: #000000;">filenames</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">append</span><span style="color: #0000FF;">(</span><span style="color: #000000;">filenames</span><span style="color: #0000FF;">,</span><span style="color: #000000;">name</span><span style="color: #0000FF;">)</span> <span style="color: #004080;">integer</span> <span style="color: #000000;">fn</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">filenames</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">words</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span> <span style="color: #004080;">string</span> <span style="color: #000000;">word</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">words</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span> <span style="color: #008080;">if</span> <span style="color: #000000;">word</span><span style="color: #0000FF;">[</span><span style="color: #000000;">1</span><span style="color: #0000FF;">]>=</span><span style="color: #008000;">'a'</span> <span style="color: #000080;font-style:italic;">-- skip numbers</span> <span style="color: #008080;">and</span> <span style="color: #000000;">word</span><span style="color: #0000FF;">[</span><span style="color: #000000;">1</span><span style="color: #0000FF;">]<=</span><span style="color: #008000;">'z'</span> <span style="color: #008080;">then</span> <span style="color: #004080;">integer</span> <span style="color: #000000;">node</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">getd_index</span><span style="color: #0000FF;">(</span><span style="color: #000000;">word</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">if</span> <span style="color: #000000;">node</span><span style="color: #0000FF;">=</span><span style="color: #000000;">0</span> <span style="color: #008080;">then</span> <span style="color: #000080;font-style:italic;">-- not present</span> <span style="color: #7060A8;">setd</span><span style="color: #0000FF;">(</span><span style="color: #000000;">word</span><span style="color: #0000FF;">,{</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">})</span> <span style="color: #000000;">word_count</span> <span style="color: #0000FF;">+=</span> <span style="color: #000000;">1</span> <span style="color: #008080;">else</span> <span style="color: #004080;">sequence</span> <span style="color: #000000;">val</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">getd_by_index</span><span style="color: #0000FF;">(</span><span style="color: #000000;">node</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">if</span> <span style="color: #7060A8;">find</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">,</span><span style="color: #000000;">val</span><span style="color: #0000FF;">)=</span><span style="color: #000000;">0</span> <span style="color: #008080;">then</span> <span style="color: #7060A8;">setd</span><span style="color: #0000FF;">(</span><span style="color: #000000;">word</span><span style="color: #0000FF;">,</span><span style="color: #7060A8;">append</span><span style="color: #0000FF;">(</span><span style="color: #000000;">val</span><span style="color: #0000FF;">,</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">))</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #008080;">end</span> <span style="color: #008080;">for</span> <span style="color: #008080;">end</span> <span style="color: #008080;">procedure</span> <span style="color: #008080;">procedure</span> <span style="color: #000000;">load_directory</span><span style="color: #0000FF;">()</span> <span style="color: #004080;">sequence</span> <span style="color: #000000;">d</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">dir</span><span style="color: #0000FF;">(</span><span style="color: #008000;">"."</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">d</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span> <span style="color: #008080;">if</span> <span style="color: #008080;">not</span> <span style="color: #7060A8;">find</span><span style="color: #0000FF;">(</span><span style="color: #008000;">'d'</span><span style="color: #0000FF;">,</span><span style="color: #000000;">d</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">][</span><span style="color: #004600;">D_ATTRIBUTES</span><span style="color: #0000FF;">])</span> <span style="color: #000080;font-style:italic;">-- skip directories</span> <span style="color: #008080;">and</span> <span style="color: #000000;">d</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">][</span><span style="color: #004600;">D_SIZE</span><span style="color: #0000FF;">]<</span><span style="color: #000000;">1024</span><span style="color: #0000FF;"></span><span style="color: #000000;">1024</span><span style="color: #0000FF;"></span><span style="color: #000000;">1024</span> <span style="color: #008080;">then</span> <span style="color: #000080;font-style:italic;">-- and files > 1GB</span> <span style="color: #004080;">string</span> <span style="color: #000000;">name</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">d</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">][</span><span style="color: #004600;">D_NAME</span><span style="color: #0000FF;">]</span> <span style="color: #004080;">integer</span> <span style="color: #000000;">fn</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">open</span><span style="color: #0000FF;">(</span><span style="color: #000000;">name</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"rb"</span><span style="color: #0000FF;">)</span> <span style="color: #004080;">string</span> <span style="color: #000000;">txt</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">lower</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">get_text</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">))</span> <span style="color: #7060A8;">close</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">if</span> <span style="color: #000000;">is_ascii</span><span style="color: #0000FF;">(</span><span style="color: #000000;">txt</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">then</span> <span style="color: #000080;font-style:italic;">-- skip any bitmaps etc</span> <span style="color: #004080;">sequence</span> <span style="color: #000000;">words</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">split_any</span><span style="color: #0000FF;">(</span><span style="color: #000000;">txt</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"\0\r\n\t !\"#$%&\'()+,-./:;<=>?@[\\]^`{\|}~"</span><span style="color: #0000FF;">)</span> <span style="color: #000000;">add_words</span><span style="color: #0000FF;">(</span><span style="color: #000000;">name</span><span style="color: #0000FF;">,</span><span style="color: #000000;">words</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #008080;">end</span> <span style="color: #008080;">for</span> <span style="color: #008080;">end</span> <span style="color: #008080;">procedure</span> <span style="color: #008080;">function</span> <span style="color: #000000;">lookup</span><span style="color: #0000FF;">(</span><span style="color: #004080;">sequence</span> <span style="color: #000000;">words</span><span style="color: #0000FF;">)</span> <span style="color: #004080;">sequence</span> <span style="color: #000000;">files</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{}</span> <span style="color: #000080;font-style:italic;">-- indexes to filenames</span> <span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">words</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span> <span style="color: #004080;">string</span> <span style="color: #000000;">word</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">words</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span> <span style="color: #004080;">integer</span> <span style="color: #000000;">node</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">getd_index</span><span style="color: #0000FF;">(</span><span style="color: #000000;">word</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">if</span> <span style="color: #000000;">node</span><span style="color: #0000FF;">=</span><span style="color: #000000;">0</span> <span style="color: #008080;">then</span> <span style="color: #008080;">return</span> <span style="color: #0000FF;">{}</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #004080;">sequence</span> <span style="color: #000000;">val</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">getd_by_index</span><span style="color: #0000FF;">(</span><span style="color: #000000;">node</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">if</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">then</span> <span style="color: #000000;">files</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">val</span> <span style="color: #008080;">else</span> <span style="color: #008080;">for</span> <span style="color: #000000;">j</span><span style="color: #0000FF;">=</span><span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">files</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">to</span> <span style="color: #000000;">1</span> <span style="color: #008080;">by</span> <span style="color: #0000FF;">-</span><span style="color: #000000;">1</span> <span style="color: #008080;">do</span> <span style="color: #008080;">if</span> <span style="color: #008080;">not</span> <span style="color: #7060A8;">find</span><span style="color: #0000FF;">(</span><span style="color: #000000;">files</span><span style="color: #0000FF;">[</span><span style="color: #000000;">j</span><span style="color: #0000FF;">],</span><span style="color: #000000;">val</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">then</span> <span style="color: #000000;">files</span><span style="color: #0000FF;">[</span><span style="color: #000000;">j</span><span style="color: #0000FF;">..</span><span style="color: #000000;">j</span><span style="color: #0000FF;">]</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{}</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #008080;">end</span> <span style="color: #008080;">for</span> <span style="color: #008080;">if</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">files</span><span style="color: #0000FF;">)=</span><span style="color: #000000;">0</span> <span style="color: #008080;">then</span> <span style="color: #008080;">return</span> <span style="color: #0000FF;">{}</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #008080;">end</span> <span style="color: #008080;">for</span> <span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">files</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span> <span style="color: #000000;">files</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">filenames</span><span style="color: #0000FF;">[</span><span style="color: #000000;">files</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]]</span> <span style="color: #008080;">end</span> <span style="color: #008080;">for</span> <span style="color: #008080;">return</span> <span style="color: #000000;">files</span> <span style="color: #008080;">end</span> <span style="color: #008080;">function</span> <span style="color: #000000;">load_directory</span><span style="color: #0000FF;">()</span> <span style="color: #0000FF;">?</span><span style="color: #000000;">word_count</span> <span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"load_directory"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- should only be this file</span> <span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"dir"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- currently two use this</span> <span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"lower"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- currently four use this</span> <span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"lower"</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"dir"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- currently two use both</span> <span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"dir"</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"lower"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- should be the same two</span> <span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"ban"</span><span style="color: #0000FF;">&</span><span style="color: #008000;">"anafish"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- should be none ({})</span> <!--</syntaxhighlight>--> {{out}} Note the distributed version has been modified and is occasionally used for real, so the output will likely differ. <pre> 3365 {"Inverted_index.exw"} {"Inverted_index.exw","viewppm.exw"} {"AlmostPrime.exw","Inverted_index.exw","RockPaperScissors.exw","viewppm.exw"} {"Inverted_index.exw","viewppm.exw"} {"Inverted_index.exw","viewppm.exw"} {} </pre> =={{header\|PHP}}== <syntaxhighlight lang="php"><?php function buildInvertedIndex($filenames) { $invertedIndex = []; foreach($filenames as $filename) { $data = file_get_contents($filename); if($data === false) die('Unable to read file: ' . $filename); preg_match_all('/(\w+)/', $data, $matches, PREG_SET_ORDER); foreach($matches as $match) { $word = strtolower($match[0]); if(!array_key_exists($word, $invertedIndex)) $invertedIndex[$word] = []; if(!in_array($filename, $invertedIndex[$word], true)) $invertedIndex[$word][] = $filename; } } return $invertedIndex; } function lookupWord($invertedIndex, $word) { return array_key_exists($word, $invertedIndex) ? $invertedIndex[$word] : false; } $invertedIndex = buildInvertedIndex2(['file1.txt', 'file2.txt', 'file3.txt']); foreach(['cat', 'is', 'banana', 'it'] as $word) { $matches = lookupWord($invertedIndex, $word); if($matches !== false) { echo "Found the word \"$word\" in the following files: " . implode(', ', $matches) . "\n"; } else { echo "Unable to find the word \"$word\" in the index\n"; } }</syntaxhighlight> =={{header\|PicoLisp}}== Line 668 ⟶ 2,782: it is a banana</pre> we can read them into a binary tree in the global variable 'MyIndex' <~~lang~~syntaxhighlight ~~PicoLisp~~lang="picolisp">(off MyIndex) (use Word Line 682 ⟶ 2,796: (extract '((Word) (val (car (idx 'MyIndex Word)))) (rest) ) ) )</~~lang~~syntaxhighlight> Output: <pre>: (searchFor "what" "is" "it") Line 692 ⟶ 2,806: : (searchFor "it" "is") -> ("file3" "file2" "file1")</pre> =={{header\|PowerShell}}== {{works with\|PowerShell\|2}} <syntaxhighlight lang="powershell">function Index-File ( [string[]]$FileList ) { # Create index hashtable, as needed If ( -not $Script:WordIndex ) { $Script:WordIndex = @{} } # For each file to be indexed... ForEach ( $File in $FileList ) { # Find any previously existing entries for this file $ExistingEntries = $Script:WordIndex.Keys \| Where { $Script:WordIndex[$_] -contains $File } # For any previously existing entries # Delete them (prior to reindexing the file) ForEach ( $Key in $ExistingEntries ) { $Script:WordIndex[$Key] = @( $Script:WordIndex[$Key] \| Where { $_ -ne $File } ) } # Get the contents of the file, split on non-alphanumeric characters, and remove duplicates $Words = ( Get-Content $File ) -split '[^a-zA-Z\d]' \| Sort -Unique # For each word in the file... ForEach ( $Word in $Words ) { # If the entry for the word already exists... If ( $Script:WordIndex[$Word] ) { # Add the file name to the entry $Script:WordIndex[$Word] += $File } Else { # Create a new entry $Script:WordIndex[$Word] = @( $File ) } } } } function Find-Word ( [string]$Word ) { return $WordIndex[$Word] }</syntaxhighlight> <syntaxhighlight lang="powershell"># Populate test files @' Files full of various words. '@ \| Out-File -FilePath C:\Test\File1.txt @' Create an index of words. '@ \| Out-File -FilePath C:\Test\File2.txt @' Use the index to find the files. '@ \| Out-File -FilePath C:\Test\File3.txt</syntaxhighlight> <syntaxhighlight lang="powershell"># Index files Index-File C:\Test\File1.txt, C:\Test\File2.txt, C:\Test\File3.txt</syntaxhighlight> Because PowerShell is a shell language, it is "a user interface to do a search". After running the script defining the functions and running a command to index the files, the user can simply run the search function at the PowerShell command prompt. Alternatively, one could create a more complex custom UI or GUI if desired. <syntaxhighlight lang="powershell"># Query index Find-Word files</syntaxhighlight> {{out}} <pre>C:\Test\File1.txt C:\Test\File3.txt</pre> =={{header\|Python}}== Line 698 ⟶ 2,882: First the simple inverted index from [[wp:Inverted index\|here]] together with an implementation of a search for (multiple) terms from that index. <~~lang~~syntaxhighlight lang="python">''' This implements: http://en.wikipedia.org/wiki/Inverted_index of 28/07/10 ''' Line 738 ⟶ 2,922: terms = ["what", "is", "it"] print('\nTerm Search for: ' + repr(terms)) pp(sorted(termsearch(terms)))</~~lang~~syntaxhighlight> '''Sample Output''' Line 766 ⟶ 2,950: <small>It is assumed that the following code is added to the end of the code for the simple case above and so shares its file opening and parsing results</small> <~~lang~~syntaxhighlight lang="python">from collections import Counter Line 817 ⟶ 3,001: print(ans) ans = Counter(ans) print(' The phrase is found most commonly in text: ' + repr(ans.most_common(1)[0][0]))</~~lang~~syntaxhighlight> '''Sample Output''' Line 837 ⟶ 3,021: The phrase is found most commonly in text: 'T0.txt'</pre> =={{header\|~~REXX~~Racket}}== <syntaxhighlight lang="racket"> ~~Note: In this algorithm, word indices start at 1.~~ #!/usr/bin/env racket ~~<lang rexx>~~ #lang racket ~~/REXX program illustrates building a simple inverted index & word find./~~ (command-line #:args (term . files) (define rindex (make-hasheq)) (for ([file files]) (call-with-input-file file (λ(in) (let loop () (define w (regexp-match #px"\\w+" in)) (when w (let ([w (bytes->string/utf-8 (car w))] [w (string->symbol (string-foldcase w))] [r (hash-ref rindex w '())]) (unless (member file r) (hash-set! rindex w (cons file r))) (loop))))))) (define res (for/list ([w (regexp-match* #px"\\w+" term)]) (list->set (hash-ref rindex (string->symbol (string-foldcase w)) '())))) (define all (set->list (apply set-intersect res))) (if (null? all) (printf "No matching files.\n") (printf "Terms found at: ~a.\n" (string-join all ", ")))) </syntaxhighlight> {{out}} <pre> $ echo "It is what it is." > F1 $ echo "What is it?" > F2 $ echo "It is a banana." > F3 $ ./search.rkt "what" F? Terms found at: F1, F2. $ ./search.rkt "a" F? Terms found at: F3. $ ./search.rkt "what a" F? No matching files. $ ./search.rkt "what is it" F? Terms found at: F1, F2. </pre> =={{header\|Raku}}== ~~@.='' /dictionary of words (so far)./~~ (formerly Perl 6) ~~!='' /a list of found words (so far)./~~ {{works with\|rakudo\|2015-09-16}} <syntaxhighlight lang="raku" line>sub MAIN (@files) { my %norm; do for @files -> $file { %norm.push: $file X=> slurp($file).lc.words; } (my %inv).push: %norm.invert.unique; while prompt("Search terms: ").words -> @words { ~~call invertI 0,'BURMA0.TXT' /read file 0 ... /~~ for @words -> $word { ~~call invertI 1,'BURMA1.TXT' / " " 1 ... /~~ say "$word => {%inv.{$word.lc}//'(not found)'}"; ~~call invertI 2,'BURMA2.TXT' / " " 2 ... /~~ } ~~call invertI 3,'BURMA3.TXT' / " " 3 ... /~~ } ~~call invertI 4,'BURMA4.TXT' / " " 4 ... /~~ }</syntaxhighlight> ~~call invertI 5,'BURMA5.TXT' / " " 5 ... /~~ ~~call invertI 6,'BURMA6.TXT' / " " 6 ... /~~ ~~call invertI 7,'BURMA7.TXT' / " " 7 ... /~~ ~~call invertI 8,'BURMA8.TXT' / " " 8 ... /~~ ~~call invertI 9,'BURMA9.TXT' / " " 9 ... /~~ =={{header\|REXX}}== ~~call findAword 'does' /find a word. /~~ Note: In this algorithm, word indices start at 1. ~~call findAword '60' /find another word. /~~ ~~call findAword "don't" /and find another word. /~~ ~~call findAword "burma-shave" /and find yet another word. /~~ ~~exit /enough of this, I'm tired. /~~ Note:   the Burma Shave signs were created from   1930 ──► 1951   and were common among the rural byways of America. To see more about Burma Shave signs, see the Wikipedia entry:   [http://en.wikipedia.org/wiki/Burma-Shave Burma Shave signs.] <syntaxhighlight lang="rexx">/REXX program illustrates building a simple inverted index and a method of word find./ @.= /a dictionary of words (so far). / != /a list of found words (so far). / call invertI 0, 'BURMA0.TXT' /read the file: BURMA0.TXT ··· / call invertI 1, 'BURMA1.TXT' / " " " BURMA1.TXT ··· / call invertI 2, 'BURMA2.TXT' / " " " BURMA2.TXT ··· / call invertI 3, 'BURMA3.TXT' / " " " BURMA3.TXT ··· / call invertI 4, 'BURMA4.TXT' / " " " BURMA4.TXT ··· / call invertI 5, 'BURMA5.TXT' / " " " BURMA5.TXT ··· / call invertI 6, 'BURMA6.TXT' / " " " BURMA6.TXT ··· / call invertI 7, 'BURMA7.TXT' / " " " BURMA7.TXT ··· / call invertI 8, 'BURMA8.TXT' / " " " BURMA8.TXT ··· / call invertI 9, 'BURMA9.TXT' / " " " BURMA9.TXT ··· / call findAword "huz" /find a word. / call findAword "60" /find another word. / call findAword "don't" /and find another word. / call findAword "burma-shave" /and find yet another word. / exit /stick a fork in it, we're all done. / /──────────────────────────────────────────────────────────────────────────────────────/ findAword: procedure expose @.; arg x /get an uppercase version of the X arg/ parse arg ox /get original (as-is) value of X arg./ _=@.x; oxo='───'ox"───" if _=='' then do say 'word' oxo "not found." return 0 end _@=_ /save _ text, pass it back to invoker./ say 'word' oxo "found in:" do until _==''; parse var _ f w _ say ' file='f " word="w end /until ··· / return _@ /──────────────────────────────────────────────────────────────────────────────────────/ invertI: procedure expose @. !; parse arg #,fn /the file number and the filename. / call lineout fn /close the file, ··· just in case. / w=0 /the number of words found (so far). / do while lines(fn)\==0 / [↓] process the entire file. / _=space( linein(fn) ) /read a line, elide superfluous blanks/ if _=='' then iterate /if a blank record, then ignore it. / say 'file' #", record:" _ /display the record ──► terminal. / do until _=='' /pick off words from record until done/ ~~/─────────────────────────────────────FINDAWORD subroutine─────────────/~~ ~~findAword:~~ ~~procedure~~ ~~expose~~ @. parse upper var _ ? _ /~~get~~pick Aoff a word, ~~and uppercase~~ (it.'s in uppercase)./ ~~parse~~ ~~arg~~ ~~ox;~~ ~~arg~~ x ?=stripper(?) /~~OX=~~strip any trailing punctuation. ~~word;~~ X= ~~uppercase~~ ~~version~~/ if ?='' then iterate /is the word now all blank (or null)? / ~~_=@.x~~ w=w+1 /bump the word counter (index). / ~~oxo='───'ox"───"~~ @.?=@.? # w /append the new word to a list. / if wordpos(?,!)==0 then !=! ? /add it to the list of words found. / end /until ··· / end /while ··· / say; call lineout fn /close the file, just to be neat&safe./ return w /return the index of word in record. / /──────────────────────────────────────────────────────────────────────────────────────/ stripper: procedure; parse arg q /remove punctuation at the end of word/ @punctuation= '.,:;?¿!¡∙·'; do j=1 for length(@punctuation) q=strip(q, 'T', substr(@punctuation, j, 1) ) end /j/ return q</syntaxhighlight> {{out\|output\|text=  when using the default inputs:}} Output note:   The Burma-Shave company was shocked at the number of automobile fenders mailed to the company. ~~if _=='' then do~~ ~~say 'word' oxo "not found."~~ ~~return 0~~ ~~end~~ All fender senders received a half-pound jar of Burma-Shave. ~~_@=_ /save _, pass it back to invoker/~~ <pre style="height:87ex"> ~~say 'word' oxo "found in:"~~ file 0, record: Rip a fender file 0, record: Off your Car file 0, record: Send it in file 0, record: For a half-pound jar file 0, record: Burma-Shave file 1, record: A peach ~~do until _==''~~ file 1, record: Looks good ~~parse var _ f w _~~ file 1, record: With lots of fuzz ~~say ' file='f ' word='w~~ file 1, record: Man's no peach ~~end~~ file 1, record: And never was file 1, record: Burma-Shave file 2, record: Does your husband ~~return _@~~ file 2, record: Misbehave file 2, record: Grunt and grumble file 2, record: Rant and rave ? file 2, record: Shoot the brute some file 2, record: Burma-Shave file 3, record: Don't take a curve file 3, record: At 60 per file 3, record: We hate to lose file 3, record: A customer file 3, record: Burma-Shave file 4, record: Every shaver ~~/─────────────────────────────────────INVERTI subroutine───────────────/~~ file 4, record: Now can snore ~~invertI: procedure expose @. !; parse arg #,fn /file#, filename/~~ file 4, record: Six more minutes ~~call lineout fn /close the file, just in case. /~~ file 4, record: Than before ~~w=0 /number of words so far. /~~ file 4, record: By using file 4, record: Burma-Shave file 5, record: He played ~~do while lines(fn)\==0 /read the entire file. /~~ file 5, record: a sax ~~_=linein(fn) /read the file, 1 line at a time/~~ file 5, record: Had no B.O. ~~_=space(_); if _='' then iterate /if blank record, then ignore it/~~ file 5, record: But his whiskers scratched file 5, record: So she let him go file 5, record: Burma-Shave ~~say '~~file' #"6,~~record="_ /echo a~~ record~~, just~~: toHenry bethe ~~verbose./~~Eighth file 6, record: Prince of Friskers file 6, record: Lost five wives file 6, record: But kept his whiskers file 6, record: Burma-Shave file 7, record: Listen birds ~~upper _ /make it case insensative. /~~ file 7, record: These signs cost file 7, record: Money file 7, record: So roost a while file 7, record: But don't get funny file 7, record: Burma-Shave file 8, record: My man ~~do until _=='' /pick off words until done. /~~ file 8, record: Won't shave ~~parse var _ xxx _ /pick off a word. /~~ file 8, record: Sez Hazel Huz ~~xxx=stripper(xxx) /go and strip off ending punct. /~~ file 8, record: But I should worry ~~if xxx='' then iterate /is the word now blank (null) ? /~~ file 8, record: Dora's does ~~w=w+1 /bump the word counter. /~~ file 8, record: Burma-Shave ~~@.xxx=@.xxx # w~~ ~~if wordpos(xxx,!)==0 then !=! xxx /add to THE list of words found./~~ ~~end~~ file 9, record: Past ~~end~~ file 9, record: Schoolhouses file 9, record: Take it slow file 9, record: Let the little file 9, record: Shavers grow file 9, record: Burma-Shave word ───huz─── found in: ~~call lineout fn /close the file, just to be neat/~~ file=8 word=7 ~~return w~~ word ───60─── found in: file=3 word=6 word ───don't─── found in: ~~/─────────────────────────────────────STRIPPER subroutine──────────────/~~ file=3 word=1 ~~stripper: procedure; parse arg q /remove punctuation at word-end./~~ file=7 word=12 ~~@punctuation='.,:;?¿!¡' /serveral punctuation marks. /~~ word ───burma-shave─── found in: file=0 word=14 ~~do j=1 for length(@punctuation)~~ file=1 word=15 ~~q=strip(q,'T',substr(@punctuation,j,1))~~ file=2 word=15 ~~end~~ file=3 word=14 file=4 word=13 ~~return q~~ file=5 word=17 ~~</lang>~~ file=6 word=14 ~~Output:~~ file=7 word=15 ~~<pre style="height:30ex;overflow:scroll">~~ file=8 word=14 ~~file 0,record=Rip a fender~~ file=9 word=11 ~~file 0,record=off your car~~ ~~file 0,record=send it in~~ ~~file 0,record=for a half-pound jar~~ ~~file 0,record=Burma-shave~~ ~~file 1,record=A peach~~ ~~file 1,record=looks good~~ ~~file 1,record=with lots of fuzz~~ ~~file 1,record=but a man's no peach~~ ~~file 1,record=and never was~~ ~~file 1,record=Burma-shave~~ ~~file 2,record=Does your husband~~ ~~file 2,record=misbehave~~ ~~file 2,record=grunt and grumble~~ ~~file 2,record=rant and rave ?~~ ~~file 2,record=shoot the brute some~~ ~~file 2,record=Burma-shave~~ ~~file 3,record=Don't take a curve~~ ~~file 3,record=at 60 per~~ ~~file 3,record=we hate to lose~~ ~~file 3,record=a customer~~ ~~file 3,record=Burma-shave~~ ~~file 4,record=Every shaver~~ ~~file 4,record=now can snore~~ ~~file 4,record=six more minutes~~ ~~file 4,record=than before~~ ~~file 4,record=by using~~ ~~file 4,record=Burma-shave~~ ~~file 5,record=He played~~ ~~file 5,record=a sax~~ ~~file 5,record=had no B.O.~~ ~~file 5,record=but his whiskers scratched~~ ~~file 5,record=so they let him go~~ ~~file 5,record=Burma-shave~~ ~~file 6,record=Henry the Eighth~~ ~~file 6,record=Prince of Friskers~~ ~~file 6,record=lost five wives~~ ~~file 6,record=but kept his whiskers~~ ~~file 6,record=Burma-shave~~ ~~file 7,record=Listen, birds~~ ~~file 7,record=those signs cost~~ ~~file 7,record=money~~ ~~file 7,record=so roost a while but~~ ~~file 7,record=don't get funny~~ ~~file 7,record=Burma-shave~~ ~~file 8,record=My man~~ ~~file 8,record=won't shave~~ ~~file 8,record=sez Hazel Huz~~ ~~file 8,record=but I should worry~~ ~~file 8,record=Dora's does~~ ~~file 8,record=Burma-shave~~ ~~file 9,record=Past schoolhouses~~ ~~file 9,record=take it slow~~ ~~file 9,record=let the little~~ ~~file 9,record=shavers~~ ~~file 9,record=grow~~ ~~file 9,record=Burma-shave~~ ~~word ---does--- found in:~~ ~~file=2 word=1~~ ~~file=8 word=13~~ ~~word ---60--- found in:~~ ~~file=3 word=6~~ ~~word ---don't--- found in:~~ ~~file=3 word=1~~ ~~file=7 word=12~~ ~~word ---burma-shave--- found in:~~ ~~file=0 word=14~~ ~~file=1 word=17~~ ~~file=2 word=15~~ ~~file=3 word=14~~ ~~file=4 word=13~~ ~~file=5 word=17~~ ~~file=6 word=14~~ ~~file=7 word=15~~ ~~file=8 word=14~~ ~~file=9 word=11~~ </pre> Line 1,009 ⟶ 3,240: '''indexmerge.rb''' <~~lang~~syntaxhighlight lang="ruby">if File.exist? "index.dat" @data = Marshal.load open("index.dat") else Line 1,037 ⟶ 3,268: open("index.dat", "w") do \|index\| index.write Marshal.dump(@data) end</~~lang~~syntaxhighlight> '''indexsearch.rb''' <~~lang~~syntaxhighlight lang="ruby">if File.exist? "index.dat" @data = Marshal.load open("index.dat") else Line 1,061 ⟶ 3,292: end p @result</~~lang~~syntaxhighlight> '''Output''' Line 1,072 ⟶ 3,303: > ./indexsearch.rb It iS\! ["file1", "file2", "file3"]</pre> =={{header\|Rust}}== <syntaxhighlight lang="rust">// Part 1: Inverted index structure use std::{ borrow::Borrow, collections::{BTreeMap, BTreeSet}, }; #[derive(Debug, Default)] pub struct InvertedIndex<T> { indexed: BTreeMap<String, BTreeSet<usize>>, sources: Vec<T>, } impl<T> InvertedIndex<T> { pub fn add<I, V>(&mut self, source: T, tokens: I) where I: IntoIterator<Item = V>, V: Into<String>, { let source_id = self.sources.len(); self.sources.push(source); for token in tokens { self.indexed .entry(token.into()) .or_insert_with(BTreeSet::new) .insert(source_id); } } pub fn search<'a, I, K>(&self, tokens: I) -> impl Iterator<Item = &T> where String: Borrow<K>, K: Ord + ?Sized + 'a, I: IntoIterator<Item = &'a K>, { let mut tokens = tokens.into_iter(); tokens .next() .and_then(\|token\| self.indexed.get(token).cloned()) .and_then(\|first\| { tokens.try_fold(first, \|found, token\| { self.indexed .get(token) .map(\|sources\| { found .intersection(sources) .cloned() .collect::<BTreeSet<_>>() }) .filter(\|update\| !update.is_empty()) }) }) .unwrap_or_else(BTreeSet::new) .into_iter() .map(move \|source\| &self.sources[source]) } pub fn tokens(&self) -> impl Iterator<Item = &str> { self.indexed.keys().map(\|it\| it.as_str()) } pub fn sources(&self) -> &[T] { &self.sources } } // Part 2: File walking and processing use std::{ ffi::OsString, fmt::{Debug, Display}, fs::{read_dir, DirEntry, File, ReadDir}, io::{self, stdin, Read}, path::{Path, PathBuf}, }; #[derive(Debug)] pub struct Files { dirs: Vec<ReadDir>, } impl Files { pub fn walk<P: AsRef<Path>>(path: P) -> io::Result<Self> { Ok(Files { dirs: vec![read_dir(path)?], }) } } impl Iterator for Files { type Item = DirEntry; fn next(&mut self) -> Option<Self::Item> { 'outer: while let Some(mut current) = self.dirs.pop() { while let Some(entry) = current.next() { if let Ok(entry) = entry { let path = entry.path(); if !path.is_dir() { self.dirs.push(current); return Some(entry); } else if let Ok(dir) = read_dir(path) { self.dirs.push(current); self.dirs.push(dir); continue 'outer; } } } } None // No directory left } } fn tokenize<'a>(input: &'a str) -> impl Iterator<Item = String> + 'a { input .split(\|c: char\| !c.is_alphanumeric()) .filter(\|token\| !token.is_empty()) .map(\|token\| token.to_lowercase()) } fn tokenize_file<P: AsRef<Path>>(path: P) -> io::Result<BTreeSet<String>> { let mut buffer = Vec::new(); File::open(path)?.read_to_end(&mut buffer)?; let text = String::from_utf8_lossy(&buffer); Ok(tokenize(&text).collect::<BTreeSet<_>>()) } fn tokenize_query(input: &str) -> Vec<String> { let result = tokenize(input).collect::<BTreeSet<_>>(); // Make a vector sorted by length, so that longer tokens are processed first. // This heuristics should narrow the resulting set faster. let mut result = result.into_iter().collect::<Vec<_>>(); result.sort_by_key(\|item\| usize::MAX - item.len()); result } // Part 3: Interactive application fn args() -> io::Result<(OsString, BTreeSet<OsString>)> { let mut args = std::env::args_os().skip(1); // Skip the executable's name let path = args .next() .ok_or_else(\|\| io::Error::new(io::ErrorKind::Other, "missing path"))?; let extensions = args.collect::<BTreeSet<_>>(); Ok((path, extensions)) } fn print_hits<'a, T>(hits: impl Iterator<Item = T>) where T: Display, { let mut found_none = true; for (number, hit) in hits.enumerate() { println!(" [{}] {}", number + 1, hit); found_none = false; } if found_none { println!("(none)") } } fn main() -> io::Result<()> { let (path, extensions) = args()?; let mut files = InvertedIndex::<PathBuf>::default(); let mut content = InvertedIndex::<PathBuf>::default(); println!( "Indexing {:?} files in '{}'", extensions, path.to_string_lossy() ); for path in Files::walk(path)?.map(\|file\| file.path()).filter(\|path\| { path.extension() .filter(\|&ext\| extensions.is_empty() \|\| extensions.contains(ext)) .is_some() }) { files.add(path.clone(), tokenize(&path.to_string_lossy())); match tokenize_file(&path) { Ok(tokens) => content.add(path, tokens), Err(e) => eprintln!("Skipping a file {}: {}", path.display(), e), } } println!( "Indexed {} tokens in {} files.", content.tokens().count(), content.sources.len() ); // Run the query UI loop let mut query = String::new(); loop { query.clear(); println!("Enter search query:"); if stdin().read_line(&mut query).is_err() \|\| query.trim().is_empty() { break; } match query.trim() { "/exit" \| "/quit" \| "" => break, "/tokens" => { println!("Tokens:"); for token in content.tokens() { println!("{}", token); } println!(); } "/files" => { println!("Sources:"); for source in content.sources() { println!("{}", source.display()); } println!(); } _ => { let query = tokenize_query(&query); println!(); println!("Found hits:"); print_hits(content.search(query.iter()).map(\|it\| it.display())); println!("Found file names:"); print_hits(files.search(query.iter()).map(\|it\| it.display())); println!(); } } } Ok(()) }</syntaxhighlight> {{out}} <pre>C:\Temp>inverted_index books html htm txt Indexing {"htm", "html", "txt"} files in 'books' Indexed 34810 tokens in 9 files. Enter search query: war Found hits: [1] books\EN\C\Chaucer, Geoffrey - Canterbury Tales.txt [2] books\EN\H\Homer - The Odyssey.txt [3] books\EN\O\Orwell, George - 1984.html [4] books\EN\W\Wells, Herbert George - The Invisible Man.txt [5] books\EN\W\Wells, Herbert George - The Island of Doctor Moreau.txt [6] books\EN\W\Wells, Herbert George - The Time Machine.txt [7] books\EN\W\Wells, Herbert George - War of the Worlds.txt Found file names: [1] books\EN\W\Wells, Herbert George - War of the Worlds.txt Enter search query: war gun Found hits: [1] books\EN\O\Orwell, George - 1984.html [2] books\EN\W\Wells, Herbert George - War of the Worlds.txt Found file names: (none) Enter search query: /exit C:\Temp></pre> =={{header\|Scala}}== <syntaxhighlight lang="scala">object InvertedIndex extends App { import java.io.File // indexer val WORD = raw"(\w+)".r def parse(s: String) = WORD.findAllIn(s).map(_.toString.toLowerCase) def invertedIndex(files: Seq[File]): Map[String,Set[File]] = { var i = Map[String,Set[File]]() withDefaultValue Set.empty files.foreach{f => scala.io.Source.fromFile(f).getLines flatMap parse foreach (w => i = i + (w -> (i(w) + f)))} i } // user interface args match { case _ if args.length < 2 => println("Usage: InvertedIndex ALLSEARCHWORDS FILENAME...") case Array(searchwords, filenames @ _) => val queries = parse(searchwords).toList val files = filenames.map(new File(_)).filter{f => if (!f.exists) println(s"Ignoring $f"); f.exists} (queries, files) match { case (q, _) if q.isEmpty => println("Missing search words") case (_, f) if f.isEmpty => println("Missing extant files") case _ => val index = invertedIndex(files) println(s"""Searching for ${queries map ("\""+_+"\"") mkString " and "} in ${files.size} files:""") queries.map(index).foldLeft(files.toSet)(_ intersect _) match { case m if m.isEmpty => println("No matching files") case m => println(m mkString "\n") } } } }</syntaxhighlight> {{out}} <pre>> InvertedIndex "the" file1.txt file2.txt file3.txt Searching for "the" in 3 files: data/file1.txt data/file2.txt data/file3.txt > InvertedIndex "the cat sat" file1.txt file2.txt file3.txt Searching for "the" and "cat" and "sat" in 3 files: file1.txt file2.txt > InvertedIndex fox file1.txt file2.txt file3.txt Searching for "fox" in 3 files: file3.txt > InvertedIndex abc file1.txt file2.txt file3.txt Searching for "abc" in 3 files: No matching files</pre> =={{header\|Tcl}}== <~~lang~~syntaxhighlight lang="tcl">package require Tcl 8.5 proc wordsInString str { # We define "words" to be "maximal sequences of 'word' characters". Line 1,092 ⟶ 3,650: array set localidx {} foreach word [wordsInString $data] { lappend localidx($word) $i incr i } # Transcribe into global index foreach {word places} [array get localidx] { dict set index($word) $filename $places } } Line 1,106 ⟶ 3,664: global index if {[info exists index($word)]} { return [dict keys $index($word)] } } Line 1,114 ⟶ 3,672: set files [findFilesForWord [lindex $words 0]] foreach w [lrange $words 1 end] { set wf [findFilesForWord $w] set newfiles {} foreach f $files { if {$f in $wf} {lappend newfiles $f} } } set files $newfiles } return $files Line 1,129 ⟶ 3,687: set files {} foreach w $words { if {![info exist index($w)]} { return } } } dict for {file places} $index([lindex $words 0]) { if {$file in $files} continue foreach start $places { set gotStart 1 foreach w [lrange $words 1 end] { incr start set gotNext 0 foreach {f ps} $index($w) { if {$f ne $file} continue foreach p $ps { if {$p == $start} { set gotNext 1 break } } } if {$gotNext} break } } if {!$gotNext} { set gotStart 0 break } } } if {$gotStart} { lappend files $file break } } } } return $files }</~~lang~~syntaxhighlight> For the GUI: <~~lang~~syntaxhighlight lang="tcl">package require Tk pack [labelframe .files -text Files] -side left -fill y pack [listbox .files.list -listvariable files] Line 1,172 ⟶ 3,730: pack [entry .found.entry -textvariable terms] -fill x pack [button .found.findAll -command FindAll \ -text "Find File with All"] -side left pack [button .found.findSeq -command FindSeq \ -text "Find File with Sequence"] -side right # The actions invoked by various GUI buttons Line 1,181 ⟶ 3,739: set f [tk_getOpenFile] if {$f ne ""} { addDocumentToIndex $f lappend files $f } } Line 1,190 ⟶ 3,748: set fs [findFilesWithAllWords $words] lappend found "Searching for files with all $terms" {}$fs \ "---------------------" } proc FindSeq {} { Line 1,197 ⟶ 3,755: set fs [findFilesWithWordSequence $words] lappend found "Searching for files with \"$terms\"" {}$fs \ "---------------------" }</~~lang~~syntaxhighlight> =={{header\|TUSCRIPT}}== <~~lang~~syntaxhighlight lang="tuscript"> $$ MODE TUSCRIPT Line 1,232 ⟶ 3,790: ENDLOOP PRINT "-> ",files </syntaxhighlight> ~~</lang>~~ Output: <pre> Line 1,249 ⟶ 3,807: {{works with\|ksh93}} <~~lang~~syntaxhighlight lang="bash">#!/bin/ksh typeset -A INDEX Line 1,273 ⟶ 3,831: (( count == $# )) && echo $file done }</~~lang~~syntaxhighlight> Example use: <~~lang~~syntaxhighlight lang="korn">index .txt search hello world </syntaxhighlight> ~~</lang>~~ ===Directory on filesystem=== Line 1,288 ⟶ 3,846: Add note about slowness. <~~lang~~syntaxhighlight lang="bash">#!/bin/sh # index.sh - create an inverted index Line 1,297 ⟶ 3,855: # the record separator for $INDEX/all.tab). for file in "$@"; do # Use printf(1), not echo, because "$file" might start with # a hyphen and become an option to echo. test 0 -eq $(printf %s "$file" \| wc -l) \|\| { printf '%s\n' "$file: newline in filename" >&2 exit 1 } } done Line 1,311 ⟶ 3,869: fi=1 for file in "$@"; do printf %s "Indexing $file." >&2 # all.tab maps $fi => $file echo "$fi $file" >> "$INDEX/all.tab" # Use punctuation ([:punct:]) and whitespace (IFS) # to split tokens. ti=1 tr -s '[:punct:]' ' ' < "$file" \| while read line; do for token in $line; do # Index token by position ($fi, $ti). Ignore # error from mkdir(1) if directory exists. mkdir "$INDEX/$token" 2>/dev/null echo $ti >> "$INDEX/$token/$fi" : $((ti += 1)) # Show progress. Print a dot per 1000 tokens. case "$ti" in 000) printf . esac done done echo >&2 : $((fi += 1)) done</~~lang~~syntaxhighlight> <~~lang~~syntaxhighlight lang="bash">#!/bin/sh # search.sh - search an inverted index Line 1,346 ⟶ 3,904: want=sequence while getopts aos name; do case "$name" in a) want=all;; o) want=one;; s) want=sequence;; ) exit 2;; esac done shift $((OPTIND - 1)) all() { echo "TODO" exit 2 } one() { echo "TODO" exit 2 } sequence() { echo "TODO" exit 2 } $want "$@"</~~lang~~syntaxhighlight> =={{header\|Wren}}== {{trans\|Kotlin}} {{libheader\|Wren-ioutil}} {{libheader\|Wren-pattern}} {{libheader\|Wren-str}} <syntaxhighlight lang="wren">import "./ioutil" for FileUtil, Input import "./pattern" for Pattern import "./str" for Str import "os" for Process var invIndex = {} var fileNames = [] var splitter = Pattern.new("+1/W") class Location { construct new(fileName, wordNum) { _fileName = fileName _wordNum = wordNum } toString { "%(_fileName), word number %(_wordNum)" } } var indexFile = Fn.new { \|fileName\| if (fileNames.contains(fileName)) { System.print("'%(fileName)' already indexed") return } fileNames.add(fileName) var lines = FileUtil.readLines(fileName) lines.each { \|line\| line = Str.lower(line) var i = 0 for (w in splitter.splitAll(line)) { var locations = invIndex[w] if (!locations) { locations = [] invIndex[w] = locations } locations.add(Location.new(fileName, i + 1)) i = i + 1 } } System.print("'%(fileName)' has been indexed") } var findWord = Fn.new { \|word\| var w = Str.lower(word) var locations = invIndex[w] if (locations) { System.print("\n'%(word)' found in the following locations:") System.print(locations.map { \|l\| " %(l)" }.join("\n")) } else { System.print("\n'%(word)' not found") } System.print() } // files to be indexed entered as command line arguments var args = Process.arguments if (args.count == 0) { System.print("No file names have been supplied") return } for (arg in args) indexFile.call(arg) System.print("\nEnter word(s) to be searched for in these files or 'q' to quit") while (true) { var word = Input.text(" ? : ", 1) if (word == "q" \|\| word == "Q") return findWord.call(word) }</syntaxhighlight> {{out}} Uses the same files as the Kotlin example: <pre> $ wren-cli inverted_index.wren inv1.txt inv2.txt inv3.txt inv1.txt 'inv1.txt' has been indexed 'inv2.txt' has been indexed 'inv3.txt' has been indexed 'inv1.txt' already indexed Enter word(s) to be searched for in these files or 'q' to quit ? : cat 'cat' not found ? : is 'is' found in the following locations: inv1.txt, word number 2 inv1.txt, word number 5 inv2.txt, word number 2 inv3.txt, word number 2 ? : banana 'banana' found in the following locations: inv3.txt, word number 4 ? : it 'it' found in the following locations: inv1.txt, word number 1 inv1.txt, word number 4 inv2.txt, word number 3 inv3.txt, word number 1 ? : what 'what' found in the following locations: inv1.txt, word number 3 inv2.txt, word number 1 ? : a 'a' found in the following locations: inv3.txt, word number 3 ? : q </pre>