Inverted index: Difference between revisions

From Rosetta Code
Content added Content deleted
m (→‎{{header|Wren}}: Minor tidy)
 
(43 intermediate revisions by 25 users not shown)
Line 1: Line 1:
{{task|Classic CS problems and programs}}[[Category:Search]]
{{task|Classic CS problems and programs}}[[Category:Search]]

An [[wp:Inverted_index|Inverted Index]] is a data structure used to create full text search.
An [[wp:Inverted_index|Inverted Index]] is a data structure used to create full text search.



Given a set of text files, implement a program to create an inverted index. Also create a user interface to do a search using that inverted index which returns a list of files that contain the query term / terms. The search index can be in memory.
;Task:
Given a set of text files, implement a program to create an inverted index.

Also create a user interface to do a search using that inverted index which returns a list of files that contain the query term / terms.

The search index can be in memory.
<br><br>

=={{header|11l}}==
{{trans|D}}

<syntaxhighlight lang="11l">DefaultDict[String, Set[String]] index

F parse_file(fname, fcontents)
L(word) fcontents.split(re:‘\W’)
:index[word.lowercase()].add(fname)

L(fname, fcontents) [(‘inv1.txt’, ‘It is what it is.’),
(‘inv2.txt’, ‘What is it?’),
(‘inv3.txt’, ‘It is a banana!’)]
parse_file(fname, fcontents)

L(w) [‘cat’, ‘is’, ‘banana’, ‘it’, ‘what’]
print("\nEnter a word to search for: (q to quit): "w)
I w C index
print(‘'#.' found in #..’.format(w, sorted(Array(index[w]))))
E
print(‘'#.' not found.’.format(w))</syntaxhighlight>

{{out}}
<pre>

Enter a word to search for: (q to quit): cat
'cat' not found.

Enter a word to search for: (q to quit): is
'is' found in [inv1.txt, inv2.txt, inv3.txt].

Enter a word to search for: (q to quit): banana
'banana' found in [inv3.txt].

Enter a word to search for: (q to quit): it
'it' found in [inv1.txt, inv2.txt, inv3.txt].

Enter a word to search for: (q to quit): what
'what' found in [inv1.txt, inv2.txt].
</pre>


=={{header|Ada}}==
=={{header|Ada}}==
Line 10: Line 58:
Here is the main program (file inverted_index.adb):
Here is the main program (file inverted_index.adb):


<lang Ada>with Ada.Text_IO, Generic_Inverted_Index, Ada.Strings.Hash, Parse_Lines;
<syntaxhighlight lang="ada">with Ada.Text_IO, Generic_Inverted_Index, Ada.Strings.Hash, Parse_Lines;
use Ada.Text_IO;
use Ada.Text_IO;


Line 123: Line 171:
end;
end;
end loop;
end loop;
end Inverted_Index;</lang>
end Inverted_Index;</syntaxhighlight>


A sample output:
A sample output:
Line 130: Line 178:


Enter one or more words to search for; <return> to finish:
Enter one or more words to search for; <return> to finish:
it
it
Found in the following files: 0.txt, 1.txt, 2.txt
Found in the following files: 0.txt, 1.txt, 2.txt


Line 148: Line 196:
The real work is actually done in the package Generic_Inverted_Index. Here is the specification (file generic_inverted_index.ads):
The real work is actually done in the package Generic_Inverted_Index. Here is the specification (file generic_inverted_index.ads):


<lang Ada>with Ada.Containers.Indefinite_Vectors;
<syntaxhighlight lang="ada">with Ada.Containers.Indefinite_Vectors;
private with Ada.Containers.Indefinite_Hashed_Maps;
private with Ada.Containers.Indefinite_Hashed_Maps;


Line 210: Line 258:
type Storage_Type is new Maps.Map with null record;
type Storage_Type is new Maps.Map with null record;


end Generic_Inverted_Index;</lang>
end Generic_Inverted_Index;</syntaxhighlight>


Here is the implementation (generic_inverted_index.adb):
Here is the implementation (generic_inverted_index.adb):


<lang Ada>package body Generic_Inverted_Index is
<syntaxhighlight lang="ada">package body Generic_Inverted_Index is


use Source_Vecs;
use Source_Vecs;
Line 308: Line 356:
end Same_Vector;
end Same_Vector;


end Generic_Inverted_Index;</lang>
end Generic_Inverted_Index;</syntaxhighlight>


===Package Parse_Lines===
===Package Parse_Lines===
Line 314: Line 362:
The main program also uses an auxiliary package Parse_Lines. Note the usage of Gnat.Regpat, which itself is pattern matching package, specific for gnat/gcc. This package is derived from the [[Regular expressions#Ada|Ada implementation of the regular expressions task]]. Here is the spec (parse_lines.ads):
The main program also uses an auxiliary package Parse_Lines. Note the usage of Gnat.Regpat, which itself is pattern matching package, specific for gnat/gcc. This package is derived from the [[Regular expressions#Ada|Ada implementation of the regular expressions task]]. Here is the spec (parse_lines.ads):


<lang Ada>with Gnat.Regpat;
<syntaxhighlight lang="ada">with Gnat.Regpat;


package Parse_Lines is
package Parse_Lines is
Line 333: Line 381:
procedure Iterate_Words(S: String);
procedure Iterate_Words(S: String);


end Parse_Lines;</lang>
end Parse_Lines;</syntaxhighlight>


And here is the implementation (parse_lines.adb):
And here is the implementation (parse_lines.adb):


<lang Ada>with Gnat.Regpat;
<syntaxhighlight lang="ada">with Gnat.Regpat;


package body Parse_Lines is
package body Parse_Lines is
Line 378: Line 426:
end Iterate_Words;
end Iterate_Words;


end Parse_Lines;</lang>
end Parse_Lines;</syntaxhighlight>


===Alternative Implementation of Generic_Inverted_Index (Ada 2012)===
===Alternative Implementation of Generic_Inverted_Index (Ada 2012)===
Line 384: Line 432:
The new standard Ada 2012 simplifies the usage of containers significantly. The following runs under gnat (GNAT GPL 2011 (20110419)), when using the experimental -gnat2012 switch. The main program is the same. Here is the spec for Generic_Inverted_Index:
The new standard Ada 2012 simplifies the usage of containers significantly. The following runs under gnat (GNAT GPL 2011 (20110419)), when using the experimental -gnat2012 switch. The main program is the same. Here is the spec for Generic_Inverted_Index:


<lang Ada>with Ada.Containers.Indefinite_Vectors;
<syntaxhighlight lang="ada">with Ada.Containers.Indefinite_Vectors;
private with Ada.Containers.Indefinite_Hashed_Maps;
private with Ada.Containers.Indefinite_Hashed_Maps;


Line 439: Line 487:
type Storage_Type is new Maps.Map with null record;
type Storage_Type is new Maps.Map with null record;


end Generic_Inverted_Index;</lang>
end Generic_Inverted_Index;</syntaxhighlight>


The implementation:
The implementation:


<lang Ada>package body Generic_Inverted_Index is
<syntaxhighlight lang="ada">package body Generic_Inverted_Index is
-- uses some of the new Ada 2012 syntax
-- uses some of the new Ada 2012 syntax
use Source_Vecs;
use Source_Vecs;
Line 524: Line 572:
end Same_Vector;
end Same_Vector;


end Generic_Inverted_Index;</lang>
end Generic_Inverted_Index;</syntaxhighlight>


=={{header|AutoHotkey}}==
=={{header|AutoHotkey}}==
{{works with|AutoHotkey_L}}
{{works with|AutoHotkey_L}}


<lang AutoHotkey>; http://www.autohotkey.com/forum/viewtopic.php?t=41479
<syntaxhighlight lang="autohotkey">; http://www.autohotkey.com/forum/viewtopic.php?t=41479
inputbox, files, files, file pattern such as c:\files\*.txt
inputbox, files, files, file pattern such as c:\files\*.txt


Line 537: Line 585:
Loop, %files%, 0,1
Loop, %files%, 0,1
{
{
tooltip,%A_index% / 500
tooltip,%A_index% / 500

wordList := WordsIn(A_LoopFileFullPath)
wordList := WordsIn(A_LoopFileFullPath)
InvertedIndex(wordList, A_loopFileFullpath)
InvertedIndex(wordList, A_loopFileFullpath)
}
}


Line 558: Line 606:


WordsIn(docpath)
WordsIn(docpath)
{
{
FileRead, content, %docpath%
FileRead, content, %docpath%
spos = 1
spos = 1
Line 568: Line 616:
this_wordList .= match "`n"
this_wordList .= match "`n"
}
}

Sort, this_wordList, U
Sort, this_wordList, U
return this_wordList
return this_wordList
}
}


Line 577: Line 625:
global word2docs
global word2docs


loop, parse, words, `n,`r
loop, parse, words, `n,`r
{
{
if A_loopField =
if A_loopField =
continue
continue
Line 593: Line 641:
else
else
return word2docs[word2find]
return word2docs[word2find]
}</lang>
}</syntaxhighlight>


=={{header|BBC BASIC}}==
=={{header|BBC BASIC}}==
{{works with|BBC BASIC for Windows}}
{{works with|BBC BASIC for Windows}}
This uses a hashed index and linked lists to hold the file numbers.
This uses a hashed index and linked lists to hold the file numbers.
<lang bbcbasic> DIM FileList$(4)
<syntaxhighlight lang="bbcbasic"> DIM FileList$(4)
FileList$() = "BBCKEY0.TXT", "BBCKEY1.TXT", "BBCKEY2.TXT", \
FileList$() = "BBCKEY0.TXT", "BBCKEY1.TXT", "BBCKEY2.TXT", \
\ "BBCKEY3.TXT", "BBCKEY4.TXT"
\ "BBCKEY3.TXT", "BBCKEY4.TXT"

DictSize% = 30000
DictSize% = 30000
DIM Index{(DictSize%-1) word$, link%}
DIM Index{(DictSize%-1) word$, link%}

REM Build inverted index:
REM Build inverted index:
FOR file% = DIM(FileList$(),1) TO 0 STEP -1
FOR file% = DIM(FileList$(),1) TO 0 STEP -1
Line 610: Line 658:
F% = OPENIN(filename$)
F% = OPENIN(filename$)
IF F% = 0 ERROR 100, "Failed to open file"
IF F% = 0 ERROR 100, "Failed to open file"

WHILE NOT EOF#F%
WHILE NOT EOF#F%
REPEAT C%=BGET#F% : UNTIL C%>64 OR EOF#F% : word$ = CHR$(C%)
REPEAT C%=BGET#F% : UNTIL C%>64 OR EOF#F% : word$ = CHR$(C%)
REPEAT C%=BGET#F% : word$ += CHR$(C%) : UNTIL C%<65
REPEAT C%=BGET#F% : word$ += CHR$(C%) : UNTIL C%<65
word$ = FNlower(LEFT$(word$))
word$ = FNlower(LEFT$(word$))

hash% = FNhash(word$)
hash% = FNhash(word$)
WHILE Index{(hash%)}.word$<>"" AND Index{(hash%)}.word$<>word$
WHILE Index{(hash%)}.word$<>"" AND Index{(hash%)}.word$<>word$
Line 628: Line 676:
ENDIF
ENDIF
ENDWHILE
ENDWHILE

CLOSE #F%
CLOSE #F%
NEXT file%
NEXT file%

REM Now query the index:
REM Now query the index:
PRINT FNquery("random")
PRINT FNquery("random")
Line 638: Line 686:
PRINT FNquery("the")
PRINT FNquery("the")
END
END

DEF FNquery(A$)
DEF FNquery(A$)
LOCAL hash%, link%, temp%
LOCAL hash%, link%, temp%
Line 655: Line 703:
ENDWHILE
ENDWHILE
= LEFT$(LEFT$(A$))
= LEFT$(LEFT$(A$))

DEF FNhash(A$)
DEF FNhash(A$)
LOCAL hash%
LOCAL hash%
Line 662: Line 710:
IF LEN(A$) > 4 hash% EOR= !(!^A$ + LEN(A$) - 4)
IF LEN(A$) > 4 hash% EOR= !(!^A$ + LEN(A$) - 4)
= hash% MOD DictSize%
= hash% MOD DictSize%

DEF FNlower(A$)
DEF FNlower(A$)
LOCAL A%,C%
LOCAL A%,C%
Line 669: Line 717:
IF C% >= 65 IF C% <= 90 MID$(A$,A%,1) = CHR$(C%+32)
IF C% >= 65 IF C% <= 90 MID$(A$,A%,1) = CHR$(C%+32)
NEXT
NEXT
= A$</lang>
= A$</syntaxhighlight>
'''Output:'''
'''Output:'''
<pre>
<pre>
Line 680: Line 728:
=={{header|C}}==
=={{header|C}}==
The code is stupidly long, having to implement a Trie to store strings and all -- the program doesn't do anything shiny, but Tries may be interesting to look at.
The code is stupidly long, having to implement a Trie to store strings and all -- the program doesn't do anything shiny, but Tries may be interesting to look at.
<lang C>#include <stdio.h>
<syntaxhighlight lang="c">#include <stdio.h>
#include <stdlib.h>
#include <stdlib.h>


Line 686: Line 734:
int chr_idx[256] = {0};
int chr_idx[256] = {0};
char idx_chr[256] = {0};
char idx_chr[256] = {0};

#define FNAME 0
#define FNAME 0
typedef struct trie_t *trie, trie_t;
typedef struct trie_t *trie, trie_t;
struct trie_t {
struct trie_t {
trie next[sizeof(chr_legal)]; /* next letter; slot 0 is for file name */
trie next[sizeof(chr_legal)]; /* next letter; slot 0 is for file name */
int eow;
int eow;
};
};

trie trie_new() { return calloc(sizeof(trie_t), 1); }
trie trie_new() { return calloc(sizeof(trie_t), 1); }

#define find_word(r, w) trie_trav(r, w, 1)
#define find_word(r, w) trie_trav(r, w, 1)
/* tree traversal: returns node if end of word and matches string, optionally
/* tree traversal: returns node if end of word and matches string, optionally
Line 702: Line 750:
trie trie_trav(trie root, const char * str, int no_create)
trie trie_trav(trie root, const char * str, int no_create)
{
{
int c;
int c;
while (root) {
while (root) {
if ((c = str[0]) == '\0') {
if ((c = str[0]) == '\0') {
if (!root->eow && no_create) return 0;
if (!root->eow && no_create) return 0;
break;
break;
}
}
if (! (c = chr_idx[c]) ) {
if (! (c = chr_idx[c]) ) {
str++;
str++;
continue;
continue;
}
}

if (!root->next[c]) {
if (!root->next[c]) {
if (no_create) return 0;
if (no_create) return 0;
root->next[c] = trie_new();
root->next[c] = trie_new();
}
}
root = root->next[c];
root = root->next[c];
str++;
str++;
}
}
return root;
return root;
}
}

/* complete traversal of whole tree, calling callback at each end of word node.
/* complete traversal of whole tree, calling callback at each end of word node.
* similar method can be used to free nodes, had we wanted to do that.
* similar method can be used to free nodes, had we wanted to do that.
Line 728: Line 776:
int trie_all(trie root, char path[], int depth, int (*callback)(char *))
int trie_all(trie root, char path[], int depth, int (*callback)(char *))
{
{
int i;
int i;
if (root->eow && !callback(path)) return 0;
if (root->eow && !callback(path)) return 0;

for (i = 1; i < sizeof(chr_legal); i++) {
for (i = 1; i < sizeof(chr_legal); i++) {
if (!root->next[i]) continue;
if (!root->next[i]) continue;

path[depth] = idx_chr[i];
path[depth] = idx_chr[i];
path[depth + 1] = '\0';
path[depth + 1] = '\0';
if (!trie_all(root->next[i], path, depth + 1, callback))
if (!trie_all(root->next[i], path, depth + 1, callback))
return 0;
return 0;
}
}
return 1;
return 1;
}
}

void add_index(trie root, const char *word, const char *fname)
void add_index(trie root, const char *word, const char *fname)
{
{
trie x = trie_trav(root, word, 0);
trie x = trie_trav(root, word, 0);
x->eow = 1;
x->eow = 1;

if (!x->next[FNAME])
if (!x->next[FNAME])
x->next[FNAME] = trie_new();
x->next[FNAME] = trie_new();
x = trie_trav(x->next[FNAME], fname, 0);
x = trie_trav(x->next[FNAME], fname, 0);
x->eow = 1;
x->eow = 1;
}
}

int print_path(char *path)
int print_path(char *path)
{
{
printf(" %s", path);
printf(" %s", path);
return 1;
return 1;
}
}

/* pretend we parsed text files and got lower cased words: dealing *
/* pretend we parsed text files and got lower cased words: dealing *
* with text file is a whole other animal and would make code too long */
* with text file is a whole other animal and would make code too long */
const char *files[] = { "f1.txt", "source/f2.txt", "other_file" };
const char *files[] = { "f1.txt", "source/f2.txt", "other_file" };
const char *text[][5] ={{ "it", "is", "what", "it", "is" },
const char *text[][5] ={{ "it", "is", "what", "it", "is" },
{ "what", "is", "it", 0 },
{ "what", "is", "it", 0 },
{ "it", "is", "a", "banana", 0 }};
{ "it", "is", "a", "banana", 0 }};

trie init_tables()
trie init_tables()
{
{
int i, j;
int i, j;
trie root = trie_new();
trie root = trie_new();
for (i = 0; i < sizeof(chr_legal); i++) {
for (i = 0; i < sizeof(chr_legal); i++) {
chr_idx[(int)chr_legal[i]] = i + 1;
chr_idx[(int)chr_legal[i]] = i + 1;
idx_chr[i + 1] = chr_legal[i];
idx_chr[i + 1] = chr_legal[i];
}
}


/* Enable USE_ADVANCED_FILE_HANDLING to use advanced file handling.
/* Enable USE_ADVANCED_FILE_HANDLING to use advanced file handling.
* You need to have files named like above files[], with words in them
* You need to have files named like above files[], with words in them
* like in text[][]. Case doesn't matter (told you it's advanced).
* like in text[][]. Case doesn't matter (told you it's advanced).
*/
*/
#define USE_ADVANCED_FILE_HANDLING 0
#define USE_ADVANCED_FILE_HANDLING 0
#if USE_ADVANCED_FILE_HANDLING
#if USE_ADVANCED_FILE_HANDLING
void read_file(const char * fname) {
void read_file(const char * fname) {
char cmd[1024];
char cmd[1024];
char word[1024];
char word[1024];
sprintf(cmd, "perl -p -e 'while(/(\\w+)/g) {print lc($1),\"\\n\"}' %s", fname);
sprintf(cmd, "perl -p -e 'while(/(\\w+)/g) {print lc($1),\"\\n\"}' %s", fname);
FILE *in = popen(cmd, "r");
FILE *in = popen(cmd, "r");
while (!feof(in)) {
while (!feof(in)) {
fscanf(in, "%1000s", word);
fscanf(in, "%1000s", word);
add_index(root, word, fname);
add_index(root, word, fname);
}
}
pclose(in);
pclose(in);
};
};


read_file("f1.txt");
read_file("f1.txt");
read_file("source/f2.txt");
read_file("source/f2.txt");
read_file("other_file");
read_file("other_file");
#else
#else
for (i = 0; i < 3; i++) {
for (i = 0; i < 3; i++) {
for (j = 0; j < 5; j++) {
for (j = 0; j < 5; j++) {
if (!text[i][j]) break;
if (!text[i][j]) break;
add_index(root, text[i][j], files[i]);
add_index(root, text[i][j], files[i]);
}
}
}
}
#endif /*USE_ADVANCED_FILE_HANDLING*/
#endif /*USE_ADVANCED_FILE_HANDLING*/


return root;
return root;
}
}

void search_index(trie root, const char *word)
void search_index(trie root, const char *word)
{
{
char path[1024];
char path[1024];
printf("Search for \"%s\": ", word);
printf("Search for \"%s\": ", word);
trie found = find_word(root, word);
trie found = find_word(root, word);

if (!found) printf("not found\n");
if (!found) printf("not found\n");
else {
else {
trie_all(found->next[FNAME], path, 0, print_path);
trie_all(found->next[FNAME], path, 0, print_path);
printf("\n");
printf("\n");
}
}
}
}

int main()
int main()
{
{
trie root = init_tables();
trie root = init_tables();

search_index(root, "what");
search_index(root, "what");
search_index(root, "is");
search_index(root, "is");
search_index(root, "banana");
search_index(root, "banana");
search_index(root, "boo");
search_index(root, "boo");
return 0;
return 0;
}</lang>Output:<lang>Search for "what": f1.txt source/f2.txt
}</syntaxhighlight>Output:<syntaxhighlight lang="text">Search for "what": f1.txt source/f2.txt
Search for "is": f1.txt other_file source/f2.txt
Search for "is": f1.txt other_file source/f2.txt
Search for "banana": other_file
Search for "banana": other_file
Search for "boo": not found</lang>
Search for "boo": not found</syntaxhighlight>


=={{header|C sharp|C#}}==
=={{header|C sharp|C#}}==
<lang csharp>using System;
<syntaxhighlight lang="csharp">using System;
using System.Collections.Generic;
using System.Collections.Generic;
using System.IO;
using System.IO;
Line 860: Line 908:
Console.WriteLine("{0} found in: {1}", find, string.Join(" ", Invert(dictionary)[find]));
Console.WriteLine("{0} found in: {1}", find, string.Join(" ", Invert(dictionary)[find]));
}
}
}</lang>
}</syntaxhighlight>
Sample output:
Sample output:
<lang>files: file1 file2 file3
<syntaxhighlight lang="text">files: file1 file2 file3
find: what
find: what
what found in: file1 file2</lang>
what found in: file1 file2</syntaxhighlight>

=={{header|C++}}==
Same idea as the C implementation - trie to store the words
<syntaxhighlight lang="cpp">
#include <algorithm>
#include <fstream>
#include <iostream>
#include <vector>
#include <string>

const std::string _CHARS = "abcdefghijklmnopqrstuvwxyz0123456789.:-_/";
const size_t MAX_NODES = 41;

class node
{
public:
node() { clear(); }
node( char z ) { clear(); }
~node() { for( int x = 0; x < MAX_NODES; x++ ) if( next[x] ) delete next[x]; }
void clear() { for( int x = 0; x < MAX_NODES; x++ ) next[x] = 0; isWord = false; }
bool isWord;
std::vector<std::string> files;
node* next[MAX_NODES];
};

class index {
public:
void add( std::string s, std::string fileName ) {
std::transform( s.begin(), s.end(), s.begin(), tolower );
std::string h;
for( std::string::iterator i = s.begin(); i != s.end(); i++ ) {
if( *i == 32 ) {
pushFileName( addWord( h ), fileName );
h.clear();
continue;
}
h.append( 1, *i );
}
if( h.length() )
pushFileName( addWord( h ), fileName );
}
void findWord( std::string s ) {
std::vector<std::string> v = find( s );
if( !v.size() ) {
std::cout << s + " was not found!\n";
return;
}
std::cout << s << " found in:\n";
for( std::vector<std::string>::iterator i = v.begin(); i != v.end(); i++ ) {
std::cout << *i << "\n";
}
std::cout << "\n";
}
private:
void pushFileName( node* n, std::string fn ) {
std::vector<std::string>::iterator i = std::find( n->files.begin(), n->files.end(), fn );
if( i == n->files.end() ) n->files.push_back( fn );
}
const std::vector<std::string>& find( std::string s ) {
size_t idx;
std::transform( s.begin(), s.end(), s.begin(), tolower );
node* rt = &root;
for( std::string::iterator i = s.begin(); i != s.end(); i++ ) {
idx = _CHARS.find( *i );
if( idx < MAX_NODES ) {
if( !rt->next[idx] ) return std::vector<std::string>();
rt = rt->next[idx];
}
}
if( rt->isWord ) return rt->files;
return std::vector<std::string>();
}
node* addWord( std::string s ) {
size_t idx;
node* rt = &root, *n;
for( std::string::iterator i = s.begin(); i != s.end(); i++ ) {
idx = _CHARS.find( *i );
if( idx < MAX_NODES ) {
n = rt->next[idx];
if( n ){
rt = n;
continue;
}
n = new node( *i );
rt->next[idx] = n;
rt = n;
}
}
rt->isWord = true;
return rt;
}
node root;
};
int main( int argc, char* argv[] ) {
index t;
std::string s;
std::string files[] = { "file1.txt", "f_text.txt", "text_1b.txt" };

for( int x = 0; x < 3; x++ ) {
std::ifstream f;
f.open( files[x].c_str(), std::ios::in );
if( f.good() ) {
while( !f.eof() ) {
f >> s;
t.add( s, files[x] );
s.clear();
}
f.close();
}
}

while( true ) {
std::cout << "Enter one word to search for, return to exit: ";
std::getline( std::cin, s );
if( !s.length() ) break;
t.findWord( s );

}
return 0;
}
</syntaxhighlight>
{{out}}
<pre>
Enter one word to search for, return to exit: goodness
goodness found in:
file1.txt
f_text.txt

Enter one word to search for, return to exit: because
because found in:
f_text.txt

Enter one word to search for, return to exit: her
her found in:
text_1b.txt

Enter one word to search for, return to exit: fat
fat was not found!
</pre>

=={{header|Clojure}}==
<syntaxhighlight lang="clojure">(ns inverted-index.core
(:require [clojure.set :as sets]
[clojure.java.io :as io]))

(def pattern #"\w+") ; Java regex for a raw term: here a substring of alphanums
(defn normalize [match] (.toLowerCase match)) ; normalization of a raw term

(defn term-seq [text] (map normalize (re-seq pattern text)))

(defn set-assoc
"Produces map with v added to the set associated with key k in map m"
[m k v] (assoc m k (conj (get m k #{}) v)))

(defn index-file [index file]
(with-open [reader (io/reader file)]
(reduce
(fn [idx term] (set-assoc idx term file))
index
(mapcat term-seq (line-seq reader)))))

(defn make-index [files]
(reduce index-file {} files))

(defn search [index query]
(apply sets/intersection (map index (term-seq query))))
</syntaxhighlight>

=={{header|CoffeeScript}}==
=={{header|CoffeeScript}}==
<lang coffeescript>
<syntaxhighlight lang="coffeescript">
fs = require 'fs'
fs = require 'fs'


Line 888: Line 1,104:
console.log "#{fn}:#{line_num}"
console.log "#{fn}:#{line_num}"
console.log "\n"
console.log "\n"

get_words = (line) ->
get_words = (line) ->
words = line.replace(/\W/g, ' ').split ' '
words = line.replace(/\W/g, ' ').split ' '
Line 902: Line 1,118:
grep index, 'make_index'
grep index, 'make_index'
grep index, 'sort'
grep index, 'sort'
</syntaxhighlight>
</lang>
output
output
<lang>
<syntaxhighlight lang="text">
> coffee inverted_index.coffee
> coffee inverted_index.coffee
locations for 'make_index':
locations for 'make_index':
inverted_index.coffee:3
inverted_index.coffee:3
Line 920: Line 1,136:
inverted_index.coffee:35
inverted_index.coffee:35
knuth_sample.coffee:12
knuth_sample.coffee:12
</syntaxhighlight>
</lang>


=={{header|Common Lisp}}==
=={{header|Common Lisp}}==
<lang Lisp>(defpackage rosettacode.inverted-index
<syntaxhighlight lang="lisp">(defpackage rosettacode.inverted-index
(:use cl))
(:use cl))
(in-package rosettacode.inverted-index)
(in-package rosettacode.inverted-index)
Line 958: Line 1,174:
:test #'equal))
:test #'equal))


</syntaxhighlight>
</lang>
Example:
Example:
<lang lisp>(defparameter *index* (build-index '("file1.txt" "file2.txt" "file3.txt")))
<syntaxhighlight lang="lisp">(defparameter *index* (build-index '("file1.txt" "file2.txt" "file3.txt")))
(defparameter *query* "foo bar")
(defparameter *query* "foo bar")
(defparameter *result* (lookup *index* *query*))
(defparameter *result* (lookup *index* *query*))
(format t "Result for query ~s: ~{~a~^, ~}~%" *query* *result*)</lang>
(format t "Result for query ~s: ~{~a~^, ~}~%" *query* *result*)</syntaxhighlight>


=={{header|Factor}}==
=={{header|D}}==
<syntaxhighlight lang="d">import std.stdio, std.algorithm, std.string, std.file, std.regex;
<lang factor>USING: assocs fry io.encodings.utf8 io.files kernel sequences
sets splitting vectors ;
IN: rosettacode.inverted-index


void main() {
: file-words ( file -- assoc )
string[][string] index;
utf8 file-contents " ,;:!?.()[]{}\n\r" split harvest ;
: add-to-file-list ( files file -- files )
over [ swap [ adjoin ] keep ] [ nip 1vector ] if ;
: add-to-index ( words index file -- )
'[ _ [ _ add-to-file-list ] change-at ] each ;
: (index-files) ( files index -- )
[ [ [ file-words ] keep ] dip swap add-to-index ] curry each ;
: index-files ( files -- index )
H{ } clone [ (index-files) ] keep ;
: query ( terms index -- files )
[ at ] curry map [ ] [ intersect ] map-reduce ;
</lang>


void parseFile(in string fn) {
Example use :
if (!exists(fn) || !isFile(fn))
<lang>( scratchpad ) { "f1" "f2" "f3" } index-files
throw new Exception("File not found");

foreach (word; readText(fn).splitter(regex(r"\W"))) {
word = word.toLower();
if (!index.get(word, null).canFind(fn))
index[word] ~= fn;
}
}

immutable fileNames = ["inv1.txt", "inv2.txt", "inv3.txt"];
foreach (fName; fileNames)
parseFile(fName);

while (true) {
writef("\nEnter a word to search for: (q to quit): ");
immutable w = readln().strip().toLower();
if (w == "q") {
writeln("quitting.");
break;
}
if (w in index)
writefln("'%s' found in%( %).", w, index[w]);
else
writefln("'%s' not found.", w);
}
}</syntaxhighlight>
Both the demo text files and the queries are from the Wikipedia page, they contain:
It is what it is.

What is it?

It is a banana!
{{out}}
<pre>Enter a word to search for: (q to quit): cat
'cat' not found.

Enter a word to search for: (q to quit): is
'is' found in "inv1.txt" "inv2.txt" "inv3.txt".

Enter a word to search for: (q to quit): banana
'banana' found in "inv3.txt".

Enter a word to search for: (q to quit): it
'it' found in "inv1.txt" "inv2.txt" "inv3.txt".

Enter a word to search for: (q to quit): what
'what' found in "inv1.txt" "inv2.txt".

Enter a word to search for: (q to quit): q
quitting.</pre>
=={{header|Delphi}}==
{{libheader| System.SysUtils}}
{{libheader| system.Generics.Collections}}
{{libheader| SYstem.IOUtils}}
<syntaxhighlight lang="delphi">
program Inverted_index;

{$APPTYPE CONSOLE}

uses
System.SysUtils,
system.Generics.Collections,
SYstem.IOUtils;

type
TIndex = class
private
FFileNames: TArray<string>;
FIndexs: TDictionary<string, string>;
class function SplitWords(Text: string): TArray<string>; static;
function StoreFileName(FileName: TFileName): string;
procedure StoreLine(Line, Id: string);
procedure StoreWord(aWord, Id: string);
function DecodeFileName(Indexed: string): string;
public
procedure AddFile(FileName: TFileName);
constructor Create;
destructor Destroy; override;
function Search(aWord: string): string;
property Index[aWord: string]: string read Search; default;
end;

{ TIndex }

constructor TIndex.Create;
begin
FIndexs := TDictionary<string, string>.Create();
SetLength(FFileNames, 0);
end;

function TIndex.DecodeFileName(Indexed: string): string;
var
f: string;
i: integer;
begin
Result := Indexed;
if Length(FFileNames) > 0 then
for i := 0 to High(FFileNames) do
begin
f := FFileNames[i];
Result := Result.Replace(format('(%x)', [i]), f);
end;
end;

destructor TIndex.Destroy;
begin
FIndexs.Free;
inherited;
end;

function TIndex.Search(aWord: string): string;
begin
aWord := AnsiLowerCase(aWord);
if FIndexs.ContainsKey(aWord) then
Result := 'found in ' + DecodeFileName(FIndexs[aWord])
else
Result := 'not found';
end;

class function TIndex.SplitWords(Text: string): TArray<string>;
const
WORD_CHARSET: set of char = (['a'..'z', 'A'..'Z', 'á', 'é', 'í', 'ó', 'ú', 'ã',
'õ', 'à', 'è', 'ì', 'ò', 'ù', 'Á', 'É', 'Í', 'Ó', 'Ú', 'Ã', 'Õ', 'À', 'È',
'Ì', 'Ò', 'Ù', 'ä', 'ë', 'ï', 'ö', 'ü', 'ç', 'Ç', '_', '0'..'9']);

procedure Append(var value: string);
begin
if not value.IsEmpty then
begin
SetLength(result, length(result) + 1);
result[high(result)] := value;
value := '';
end;
end;

var
c: Char;
inWord, isWordChar: boolean;
w: string;
begin
inWord := False;
w := '';
SetLength(result, 0);
for c in Text do
begin
isWordChar := (c in WORD_CHARSET);
if inWord then
if isWordChar then
w := w + c
else
begin
Append(w);
inWord := false;
Continue;
end;

if not inWord and isWordChar then
begin
inWord := True;
w := c;
end;
end;
if inWord then
Append(w);
end;

function TIndex.StoreFileName(FileName: TFileName): string;
begin
SetLength(FFileNames, length(FFileNames) + 1);
FFileNames[High(FFileNames)] := FileName;
Result := format('"(%x)"', [High(FFileNames)]);
end;

procedure TIndex.StoreLine(Line, Id: string);
var
Words: TArray<string>;
w: string;
begin
Words := SplitWords(Line);
for w in Words do
StoreWord(w, Id);
end;

procedure TIndex.StoreWord(aWord, Id: string);
begin
aWord := AnsiLowercase(aWord);
if FIndexs.ContainsKey(aWord) then
begin
if (FIndexs[aWord].IndexOf(Id) = -1) then
FIndexs[aWord] := FIndexs[aWord] + ', ' + Id;
end
else
FIndexs.Add(aWord, Id);
end;

procedure TIndex.AddFile(FileName: TFileName);
var
Lines: TArray<string>;
Line: string;
Id: string;
begin
if not FileExists(FileName) then
exit;
Lines := TFile.ReadAllLines(FileName);

if length(Lines) = 0 then
exit;

// Remove this line if you want full filename
FileName := ExtractFileName(FileName);

Id := StoreFileName(FileName);

for Line in Lines do
StoreLine(Line, Id);
end;

var
Index: TIndex;
FileNames: TArray<TFileName> = ['file1.txt', 'file2.txt', 'file3.txt'];
FileName: TFileName;
Querys: TArray<string> = ['Cat', 'is', 'banana', 'it', 'what'];
Query, Found: string;

begin
Index := TIndex.Create;

for FileName in FileNames do
Index.AddFile(FileName);

for Query in Querys do
Writeln(format('"%s" %s'#10, [Query, Index[Query]]));

repeat
Writeln('Enter a word to search for: (q to quit)');
Readln(Query);

if Query = 'q' then
Break;

Writeln(format('"%s" %s'#10, [Query, Index[Query]]));
until False;

Index.Free;
end.</syntaxhighlight>
Input: Same of [[#D|D]].
{{out}}
<pre>"Cat" not found

"is" found in "file1.txt", "file2.txt", "file3.txt"

"banana" found in "file3.txt"

"it" found in "file1.txt", "file2.txt", "file3.txt"

"what" found in "file1.txt", "file2.txt"

Enter a word to search for: (q to quit):
q</pre>

=={{header|EchoLisp}}==

===Indexing===
Index values are sets associated with each word (key). We use the local-put-value function to permanently store the index, in the browser local storage.
<syntaxhighlight lang="lisp">
;; set of input files
(define FILES {T0.txt T1.txt T2.txt})
;; store name for permanent inverted index
(define INVERT "INVERTED-INDEX")

;; get text for each file, and call (action filename text)
(define (map-files action files)
(for ((file files))
(file->string action file)))

;; create store
(local-make-store INVERT)

; invert-word : word -> set of files
(define (invert-word word file store)
(local-put-value word
(make-set (cons file (local-get-value word store))) store))

; parse file text and invert each word
(define (invert-text file text)
(writeln 'Inverting file text)
(let ((text (text-parse text)))
(for ((word text)) (invert-word (string-downcase word) file INVERT))))
</syntaxhighlight>

=== Query ===
Intersect sets values of each word.
<syntaxhighlight lang="lisp">
;; usage : (inverted-search w1 w2 ..)
(define-syntax-rule (inverted-search w ...)
(and-get-invert (quote w )))

;; intersects all sets referenced by words
;; returns the intersection
(define (and-get-invert words)
(foldl
(lambda(word res)
(set-intersect res (local-get-value word INVERT)))
FILES words))
</syntaxhighlight>
Output :
<syntaxhighlight lang="lisp">
(map-files invert-text FILES)
(inverted-search is it)
[0]→ { T0.txt T1.txt T2.txt }
(inverted-search is banana)
[1]→ { T2.txt }
(inverted-search is what)
[2]→ { T0.txt T1.txt }
(inverted-search boule)
[3]→ null
</syntaxhighlight>

=={{header|Erlang}}==
This might be used with a lot of large files so we use binaries to save space. That adds <<>> to the search terms.
If somebody wants to avoid "end." and "end" being two different terms, just add <<".">> to binary:compile_pattern/1
Ditto for any other character.

<syntaxhighlight lang="erlang">
-module( inverted_index ).

-export( [from_files/1, search/2, task/0] ).

from_files( Files ) ->
lists:foldl( fun import_from_file/2, dict:new(), Files ).

search( Binaries, Inverted_index ) ->
[Files | T] = [dict:fetch(X, Inverted_index) || X <- Binaries],
lists:foldl( fun search_common/2, Files, T ).

task() ->
Files_contents = [{"file_1", <<"it is what it is">>}, {"file_2", <<"what is it">>}, {"file_3", <<"it is a banana">>}],
[file:write_file(X, Y) || {X, Y} <- Files_contents],
Inverted_index = from_files( [X || {X, _Y} <- Files_contents] ),
Result = search( [<<"what">>, <<"is">>, <<"it">>], Inverted_index ),
io:fwrite( "~p~n", [Result] ),
[file:delete(X) || {X, _Y} <- Files_contents].



import_from_file( File, Dict_acc ) ->
New_dict = dict:from_list( import_from_file_contents(File, file:read_file(File)) ),
dict:merge( fun import_from_file_merge/3, Dict_acc, New_dict ).

import_from_file_contents( File, {ok, Binary} ) ->
[{X, [File]} || X <- binary:split( Binary, binary:compile_pattern([<<" ">>, <<"\n">>]), [global] )];
import_from_file_contents( File, {error, Error} ) ->
io:fwrite( "Error: could not open file ~p: ~p~nContinuing with the rest of them~n", [File, Error] ),
[].

import_from_file_merge( _Key, Files, [New_file] ) -> [New_file | Files].

search_common( Files, Acc ) -> [X || X <- Acc, lists:member(X, Files)].
</syntaxhighlight>


--- Data stack:
H{ { "a" ~vector~ } { "is" ~vector~ } { "what" ~vector~ } { ...
( scratchpad ) { "what" "is" "it" } swap query .
V{ "f1" "f2" }
</lang>
=={{header|F Sharp|F#}}==
=={{header|F Sharp|F#}}==
<lang fsharp>open System
<syntaxhighlight lang="fsharp">open System
open System.IO
open System.IO


// Map search terms to associated set of files
// Map search terms to associated set of files
type searchIndexMap = Map<string, Set<string>>
type searchIndexMap = Map<string, Set<string>>

let inputSearchCriteria() =
let inputSearchCriteria() =
let readLine prompt =
let readLine prompt =
printf "%s: " prompt
printf "%s: " prompt
Console.ReadLine().Split()
Console.ReadLine().Split()

readLine "Files", (readLine "Find") |> Array.map (fun s -> s.ToLower())
readLine "Files", (readLine "Find") |> Array.map (fun s -> s.ToLower())

let updateIndex indexMap keyValuePair =
let updateIndex indexMap keyValuePair =
let k, v = keyValuePair
let k, v = keyValuePair

match Map.tryFind k indexMap with
match Map.tryFind k indexMap with
| None -> Map.add k (Set.singleton v) indexMap
| None -> Map.add k (Set.singleton v) indexMap
| Some set -> Map.add k (Set.add v set) indexMap
| Some set -> Map.add k (Set.add v set) indexMap

let buildIndex files =
let buildIndex files =
let fileData file =
let fileData file =
File.ReadAllText(file).Split() |> Seq.map (fun word -> word.ToLower(), file)
File.ReadAllText(file).Split() |> Seq.map (fun word -> word.ToLower(), file)

files |> Seq.collect fileData
files |> Seq.collect fileData
|> Seq.fold updateIndex Map.empty
|> Seq.fold updateIndex Map.empty

let searchFiles() =
let searchFiles() =
let files, terms = inputSearchCriteria()
let files, terms = inputSearchCriteria()
Line 1,027: Line 1,584:
|> Set.intersectMany
|> Set.intersectMany


printf "Found in: " ; searchResults |> Set.iter (printf "%s ") ; printfn ""</lang>
printf "Found in: " ; searchResults |> Set.iter (printf "%s ") ; printfn ""</syntaxhighlight>
Sample usage:
Sample usage:
<pre>
<pre>
Line 1,035: Line 1,592:
Find: what is
Find: what is
Found in: file1.txt file2.txt</pre>
Found in: file1.txt file2.txt</pre>

=={{header|Factor}}==
<syntaxhighlight lang="factor">USING: assocs fry io.encodings.utf8 io.files kernel sequences
sets splitting vectors ;
IN: rosettacode.inverted-index

: file-words ( file -- assoc )
utf8 file-contents " ,;:!?.()[]{}\n\r" split harvest ;
: add-to-file-list ( files file -- files )
over [ swap [ adjoin ] keep ] [ nip 1vector ] if ;
: add-to-index ( words index file -- )
'[ _ [ _ add-to-file-list ] change-at ] each ;
: (index-files) ( files index -- )
[ [ [ file-words ] keep ] dip swap add-to-index ] curry each ;
: index-files ( files -- index )
H{ } clone [ (index-files) ] keep ;
: query ( terms index -- files )
[ at ] curry map [ ] [ intersect ] map-reduce ;
</syntaxhighlight>

Example use :
<syntaxhighlight lang="text">( scratchpad ) { "f1" "f2" "f3" } index-files

--- Data stack:
H{ { "a" ~vector~ } { "is" ~vector~ } { "what" ~vector~ } { ...
( scratchpad ) { "what" "is" "it" } swap query .
V{ "f1" "f2" }
</syntaxhighlight>


=={{header|Go}}==
=={{header|Go}}==
<lang go>package main
<syntaxhighlight lang="go">package main


import (
import (
Line 1,145: Line 1,730:
}
}
return true
return true
}
}

func ui() {
func ui() {
fmt.Println(len(index), "words indexed in", len(indexed), "files")
fmt.Println(len(index), "words indexed in", len(indexed), "files")
Line 1,164: Line 1,749:
fmt.Println("one match:")
fmt.Println("one match:")
fmt.Println(" ", indexed[dl[0]].file, indexed[dl[0]].title)
fmt.Println(" ", indexed[dl[0]].file, indexed[dl[0]].title)
default:
default:
fmt.Println(len(dl), "matches:")
fmt.Println(len(dl), "matches:")
for _, d := range dl {
for _, d := range dl {
Line 1,171: Line 1,756:
}
}
}
}
}</lang>
}</syntaxhighlight>
Session:
Session:
<pre>
<pre>
Line 1,194: Line 1,779:
=={{header|Haskell}}==
=={{header|Haskell}}==


<lang haskell>import Control.Monad
<syntaxhighlight lang="haskell">import Control.Monad
import Data.Char (isAlpha, toLower)
import Data.Char (isAlpha, toLower)
import qualified Data.Map as M
import qualified Data.Map as M
Line 1,225: Line 1,810:
intersections xs = foldl1 S.intersection xs
intersections xs = foldl1 S.intersection xs


lowercase = map toLower</lang>
lowercase = map toLower</syntaxhighlight>


An example of use, assuming the program is named <code>iindex</code> and there exist files <code>t0</code>, <code>t1</code>, and <code>t2</code> with contents "It is what it is.", "What is it?", and "It is a banana.":
An example of use, assuming the program is named <code>iindex</code> and there exist files <code>t0</code>, <code>t1</code>, and <code>t2</code> with contents "It is what it is.", "What is it?", and "It is a banana.":
Line 1,236: Line 1,821:


The following implements a simple case insensitive inverse index using lists simulating texts.
The following implements a simple case insensitive inverse index using lists simulating texts.
<lang Icon>procedure main()
<syntaxhighlight lang="icon">procedure main()


texts := table() # substitute for read and parse files
texts := table() # substitute for read and parse files
Line 1,244: Line 1,829:


every textname := key(texts) do # build index for each 'text'
every textname := key(texts) do # build index for each 'text'
SII := InvertedIndex(SII,textname,texts[textname])
SII := InvertedIndex(SII,textname,texts[textname])

TermSearchUI(SII) # search UI
TermSearchUI(SII) # search UI

end
end


Line 1,265: Line 1,850:
repeat {
repeat {
writes("Enter search terms (^z to quit) : ")
writes("Enter search terms (^z to quit) : ")
terms := map(trim(read() | break))
terms := map(trim(read() | break))


x := []
x := []
terms ? while not pos(0) do {
terms ? while not pos(0) do {
tab(many(' \t'))
tab(many(' \t'))
put(x,tab(upto('\ \t')|0))
put(x,tab(upto('\ \t')|0))
}
}

show("Searching for : ",x)
show("Searching for : ",x)
show("Found in : ",s := TermSearch(ii,x)) | show("Not found : ",x)
show("Found in : ",s := TermSearch(ii,x)) | show("Not found : ",x)
}
}
write("End of search")
write("End of search")
Line 1,281: Line 1,866:


procedure TermSearch(ii,x) #: return set of matches or fail
procedure TermSearch(ii,x) #: return set of matches or fail
every s := !x do
every s := !x do
( /u := ii[s] ) | (u **:= ii[s])
( /u := ii[s] ) | (u **:= ii[s])
if *u > 0 then return u
if *u > 0 then return u
Line 1,289: Line 1,874:
every writes(s|!x) do writes(" ")
every writes(s|!x) do writes(" ")
write()
write()
return
return
end</lang>
end</syntaxhighlight>


Output:<pre>Enter search terms (^z to quit) : is it
Output:<pre>Enter search terms (^z to quit) : is it
Line 1,306: Line 1,891:


The following code will build a full index. Modification of search routines is left as an exercise:
The following code will build a full index. Modification of search routines is left as an exercise:
<lang Unicon>record InvertedIndexRec(simple,full)
<syntaxhighlight lang="unicon">record InvertedIndexRec(simple,full)


procedure FullInvertedIndex(ii,k,words) #: accumulate a full inverted index
procedure FullInvertedIndex(ii,k,words) #: accumulate a full inverted index
Line 1,324: Line 1,909:


return ii
return ii
end</lang>
end</syntaxhighlight>


=={{header|J}}==
=={{header|J}}==
Line 1,330: Line 1,915:
This just implements the required spec, with a simplistic definition for what a word is, and with no support for stop words, nor for phrase searching.
This just implements the required spec, with a simplistic definition for what a word is, and with no support for stop words, nor for phrase searching.


<lang J>require'files regex strings'
<syntaxhighlight lang="j">require'files regex strings'


rxutf8 0 NB. support latin1 searches for this example, instead of utf8
rxutf8 0 NB. support latin1 searches for this example, instead of utf8
Line 1,353: Line 1,938:
hits=. buckets{~words i.~.parse tolower y
hits=. buckets{~words i.~.parse tolower y
files {~ >([-.-.)each/hits
files {~ >([-.-.)each/hits
)</lang>
)</syntaxhighlight>


Example use:
Example use:


<lang J> invert '~help/primer/cut.htm';'~help/primer/end.htm';'~help/primer/gui.htm'
<syntaxhighlight lang="j"> invert '~help/primer/cut.htm';'~help/primer/end.htm';'~help/primer/gui.htm'
>search 'finally learning'
>search 'finally learning'
~help/primer/end.htm
~help/primer/end.htm
Line 1,365: Line 1,950:
~help/primer/gui.htm
~help/primer/gui.htm
>search 'around'
>search 'around'
~help/primer/gui.htm</lang>
~help/primer/gui.htm</syntaxhighlight>

=={{header|Java}}==
=={{header|Java}}==
<syntaxhighlight lang="java">
<lang Java>
package org.rosettacode;
package org.rosettacode;


Line 1,385: Line 1,971:
public class InvertedIndex {
public class InvertedIndex {


List<String> stopwords = Arrays.asList("a", "able", "about",
List<String> stopwords = Arrays.asList("a", "able", "about",
"across", "after", "all", "almost", "also", "am", "among", "an",
"across", "after", "all", "almost", "also", "am", "among", "an",
"and", "any", "are", "as", "at", "be", "because", "been", "but",
"and", "any", "are", "as", "at", "be", "because", "been", "but",
"by", "can", "cannot", "could", "dear", "did", "do", "does",
"by", "can", "cannot", "could", "dear", "did", "do", "does",
"either", "else", "ever", "every", "for", "from", "get", "got",
"either", "else", "ever", "every", "for", "from", "get", "got",
"had", "has", "have", "he", "her", "hers", "him", "his", "how",
"had", "has", "have", "he", "her", "hers", "him", "his", "how",
"however", "i", "if", "in", "into", "is", "it", "its", "just",
"however", "i", "if", "in", "into", "is", "it", "its", "just",
"least", "let", "like", "likely", "may", "me", "might", "most",
"least", "let", "like", "likely", "may", "me", "might", "most",
"must", "my", "neither", "no", "nor", "not", "of", "off", "often",
"must", "my", "neither", "no", "nor", "not", "of", "off", "often",
"on", "only", "or", "other", "our", "own", "rather", "said", "say",
"on", "only", "or", "other", "our", "own", "rather", "said", "say",
"says", "she", "should", "since", "so", "some", "than", "that",
"says", "she", "should", "since", "so", "some", "than", "that",
"the", "their", "them", "then", "there", "these", "they", "this",
"the", "their", "them", "then", "there", "these", "they", "this",
"tis", "to", "too", "twas", "us", "wants", "was", "we", "were",
"tis", "to", "too", "twas", "us", "wants", "was", "we", "were",
"what", "when", "where", "which", "while", "who", "whom", "why",
"what", "when", "where", "which", "while", "who", "whom", "why",
"will", "with", "would", "yet", "you", "your");
"will", "with", "would", "yet", "you", "your");


Map<String, List<Tuple>> index = new HashMap<String, List<Tuple>>();
Map<String, List<Tuple>> index = new HashMap<String, List<Tuple>>();
List<String> files = new ArrayList<String>();
List<String> files = new ArrayList<String>();


public void indexFile(File file) throws IOException {
public void indexFile(File file) throws IOException {
int fileno = files.indexOf(file.getPath());
int fileno = files.indexOf(file.getPath());
if (fileno == -1) {
if (fileno == -1) {
files.add(file.getPath());
files.add(file.getPath());
fileno = files.size() - 1;
fileno = files.size() - 1;
}
}


int pos = 0;
int pos = 0;
BufferedReader reader = new BufferedReader(new FileReader(file));
BufferedReader reader = new BufferedReader(new FileReader(file));
for (String line = reader.readLine(); line != null; line = reader
for (String line = reader.readLine(); line != null; line = reader
.readLine()) {
.readLine()) {
for (String _word : line.split("\\W+")) {
for (String _word : line.split("\\W+")) {
String word = _word.toLowerCase();
String word = _word.toLowerCase();
pos++;
pos++;
if (stopwords.contains(word))
if (stopwords.contains(word))
continue;
continue;
List<Tuple> idx = index.get(word);
List<Tuple> idx = index.get(word);
if (idx == null) {
if (idx == null) {
idx = new LinkedList<Tuple>();
idx = new LinkedList<Tuple>();
index.put(word, idx);
index.put(word, idx);
}
}
idx.add(new Tuple(fileno, pos));
idx.add(new Tuple(fileno, pos));
}
}
}
}
System.out.println("indexed " + file.getPath() + " " + pos + " words");
System.out.println("indexed " + file.getPath() + " " + pos + " words");
}
}


public void search(List<String> words) {
public void search(List<String> words) {
for (String _word : words) {
for (String _word : words) {
Set<String> answer = new HashSet<String>();
Set<String> answer = new HashSet<String>();
String word = _word.toLowerCase();
String word = _word.toLowerCase();
List<Tuple> idx = index.get(word);
List<Tuple> idx = index.get(word);
if (idx != null) {
if (idx != null) {
for (Tuple t : idx) {
for (Tuple t : idx) {
answer.add(files.get(t.fileno));
answer.add(files.get(t.fileno));
}
}
}
}
System.out.print(word);
System.out.print(word);
for (String f : answer) {
for (String f : answer) {
System.out.print(" " + f);
System.out.print(" " + f);
}
}
System.out.println("");
System.out.println("");
}
}
}
}


public static void main(String[] args) {
public static void main(String[] args) {
try {
try {
InvertedIndex idx = new InvertedIndex();
InvertedIndex idx = new InvertedIndex();
for (int i = 1; i < args.length; i++) {
for (int i = 1; i < args.length; i++) {
idx.indexFile(new File(args[i]));
idx.indexFile(new File(args[i]));
}
}
idx.search(Arrays.asList(args[0].split(",")));
idx.search(Arrays.asList(args[0].split(",")));
} catch (Exception e) {
} catch (Exception e) {
e.printStackTrace();
e.printStackTrace();
}
}
}
}


private class Tuple {
private class Tuple {
private int fileno;
private int fileno;
private int position;
private int position;


public Tuple(int fileno, int position) {
public Tuple(int fileno, int position) {
this.fileno = fileno;
this.fileno = fileno;
this.position = position;
this.position = position;
}
}
}
}
}
}


</syntaxhighlight>
</lang>


Example output:
Example output:
<syntaxhighlight lang="java">
<lang Java>
java -cp bin org.rosettacode.InvertedIndex "huntsman,merit,dog,the,gutenberg,lovecraft,olympian" pg30637.txt pg7025.txt pg82.txt pg9090.txt
java -cp bin org.rosettacode.InvertedIndex "huntsman,merit,dog,the,gutenberg,lovecraft,olympian" pg30637.txt pg7025.txt pg82.txt pg9090.txt
indexed pg30637.txt 106473 words
indexed pg30637.txt 106473 words
indexed pg7025.txt 205714 words
indexed pg7025.txt 205714 words
Line 1,489: Line 2,075:
olympian pg30637.txt
olympian pg30637.txt


</syntaxhighlight>
</lang>

=={{header|jq}}==

In the first part of this section, the core functions for computing an inverted index and for searching it are presented.
These functions will work with jq 1.4 as well as later (and possibly earlier) versions.

The second section shows how to accomplish the interactive task using a version of jq with support for 'input' and 'input_filename' (e.g. jq 1.5).

'''Part 1: inverted_index and search'''
<syntaxhighlight lang="jq"># Given an array of [ doc, array_of_distinct_words ]
# construct a lookup table: { word: array_of_docs }
def inverted_index:
reduce .[] as $pair
({};
$pair[0] as $doc
| reduce $pair[1][] as $word
(.; .[$word] += [$doc]));

def search(words):
def overlap(that): . as $this
| reduce that[] as $item ([]; if $this|index($item) then . + [$item] else . end);

. as $dict
| if (words|length) == 0 then []
else reduce words[1:][] as $word
( $dict[words[0]]; overlap( $dict[$word] ) )
end ; </syntaxhighlight>

'''Part 2: Interactive Search'''

In this section, a solution to the task is presented using two
invocations of jq: one parses the input files, and the other does
everything else. If your shell does not support <(...) then you
could create a temporary file to hold the parsed output.

<syntaxhighlight lang="jq">def prompt_search:
"Enter a string or an array of strings to search for, quoting each string, or 0 to exit:",
( (input | if type == "array" then . elif type == "string" then [.]
else empty
end) as $in
| search($in), prompt_search ) ;

$in | inverted_index | prompt_search</syntaxhighlight>

'''Example''':
<syntaxhighlight lang="sh">$ jq -r -c -n --argfile in <(jq -R 'split(" ") | select(length>0) | [input_filename, unique]' T?.txt) -f Inverted_index.jq
Enter a string or an array of strings to search for, quoting each string, or 0 to exit:
"is"
["T0.txt","T1.txt","T2.txt"]
Enter a string or an array of strings to search for, quoting each string, or 0 to exit:
["is", "banana"]
["T2.txt"]
Enter a string or an array of strings to search for, quoting each string, or 0 to exit:
0
$</syntaxhighlight>

=={{header|Julia}}==
<syntaxhighlight lang="julia">function makedoubleindex(files)
idx = Dict{String, Dict}()
for file in files
str = lowercase(read(file, String))
words = split(str, r"\s+")
for word in words
if !haskey(idx, word)
idx[word] = Dict{String, Int}()
elseif !haskey(idx[word], file)
(idx[word])[file] = 1
else
(idx[word])[file] += 1
end
end
end
idx
end

function wordsearch(dict, words::Vector{String})
for word in words
if haskey(dict, word)
for (f, n) in dict[word]
println("File $f contains $n instances of <$word>.")
end
else
println("No instances of \"$word\" were found.")
end
end
end
wordsearch(dict, word::String) = wordsearch(dict, [word])


const filenames = ["file1.txt", "file2.txt", "file3.txt"]
const didx = makedoubleindex(filenames)
const searchterms = ["forehead", "of", "hand", "a", "foot"]
wordsearch(didx, searchterms)
</syntaxhighlight>

=={{header|Kotlin}}==
<syntaxhighlight lang="scala">// version 1.1.51

import java.io.File

val invIndex = mutableMapOf<String, MutableList<Location>>()
val fileNames = mutableListOf<String>()
val splitter = Regex("""\W+""")

class Location(val fileName: String, val wordNum: Int) {
override fun toString() = "{$fileName, word number $wordNum}"
}

fun indexFile(fileName: String) {
if (fileName in fileNames) {
println("'$fileName' already indexed")
return
}
fileNames.add(fileName)
File(fileName).forEachLine { line ->
for ((i, w) in line.toLowerCase().split(splitter).withIndex()) {
var locations = invIndex[w]
if (locations == null) {
locations = mutableListOf<Location>()
invIndex.put(w, locations)
}
locations.add(Location(fileName, i + 1))
}
}
println("'$fileName' has been indexed")
}

fun findWord(word: String) {
val w = word.toLowerCase()
val locations = invIndex[w]
if (locations != null) {
println("\n'$word' found in the following locations:")
println(locations.map { " $it" }.joinToString("\n"))
}
else println("\n'$word' not found")
println()
}

fun main(args: Array<String>) {
// files to be indexed entered as command line arguments
if (args.size == 0) {
println("No file names have been supplied")
return
}
for (arg in args) indexFile(arg)
println()
println("Enter word(s) to be searched for in these files or 'q' to quit")
while (true) {
print(" ? : ")
val word = readLine()!!
if (word.toLowerCase() == "q") return
findWord(word)
}
}</syntaxhighlight>

Contents of files:
<pre>
inv1.txt -> It is what it is.
inv2.txt -> What is it?
inv3.txt -> It is a banana!
</pre>

Sample output:
<pre>$ java -jar inverted_index.jar inv1.txt inv2.txt inv3.txt inv1.txt
'inv1.txt' has been indexed
'inv2.txt' has been indexed
'inv3.txt' has been indexed
'inv1.txt' already indexed

Enter word(s) to be searched for in these files or 'q' to quit
? : cat

'cat' not found

? : is

'is' found in the following locations:
{inv1.txt, word number 2}
{inv1.txt, word number 5}
{inv2.txt, word number 2}
{inv3.txt, word number 2}

? : banana

'banana' found in the following locations:
{inv3.txt, word number 4}

? : it

'it' found in the following locations:
{inv1.txt, word number 1}
{inv1.txt, word number 4}
{inv2.txt, word number 3}
{inv3.txt, word number 1}

? : what

'what' found in the following locations:
{inv1.txt, word number 3}
{inv2.txt, word number 1}

? : a

'a' found in the following locations:
{inv3.txt, word number 3}

? : q
</pre>

=={{header|Mathematica}}/{{header|Wolfram Language}}==
<syntaxhighlight lang="mathematica">si = CreateSearchIndex["ExampleData/Text", Method -> "TFIDF"];
Manipulate[Grid[sr = TextSearch[si, ToString[s]];
{FileBaseName /@ Normal[Dataset[sr][All, "ReferenceLocation"]],
Column[#, Frame -> All] & /@ sr[All, "Snippet"]} // Transpose,
Frame -> All], {s, "tree"}]</syntaxhighlight>
{{out}}
Outputs a graphical user interface with an interactive input field showing filename and the snippet of text that includes the search string.

=={{header|Nim}}==
We have used as examples three files containing the text from Wikipedia article. We used here an index which keep document name and line number. It would be easy to add the position in the line.

<syntaxhighlight lang="nim">import os, strformat, strutils, tables

type

WordRef = tuple[docnum, linenum: int]

InvertedIndex = object
docs: seq[string]
index: Table[string, seq[WordRef]]

SearchResult = object
total: int
refs: seq[tuple[docname: string; linenums: seq[int]]]

IndexingError = object of CatchableError

const

# Characters composing a word (letters + "'").
WordChars = Letters + {'\''}

# Words to ignore.
StopWords = ["about", "above", "after", "again", "against", "all", "am", "an", "and", "any",
"are", "aren't", "as", "at", "be", "because", "been", "before", "being", "below",
"between", "both", "but", "by", "can't", "cannot", "could", "couldn't", "did",
"didn't", "do", "does", "doesn't", "doing", "don't", "down", "during", "each",
"few", "for", "from", "further", "had", "hadn't", "has", "hasn't", "have",
"haven't", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's",
"hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll",
"i'm", "i've", "if", "in", "into", "is", "isn't", "it", "it's", "its", "itself",
"let's", "me", "more", "most", "mustn't", "my", "myself", "no", "nor", "not",
"of", "off", "on", "once", "only", "or", "other", "ought", "our", "ours",
"ourselves", "out", "over", "own", "same", "shan't", "she", "she'd", "she'll",
"she's", "should", "shouldn't", "so", "some", "such", "than", "that", "that's",
"the", "their", "theirs", "them", "themselves", "then", "there", "there's",
"these", "they", "they'd", "they'll", "they're", "they've", "this", "those",
"through", "to", "too", "under", "until", "up", "very", "was", "wasn't", "we",
"we'd", "we'll", "we're", "we've", "were", "weren't", "what", "what's", "when",
"when's", "where", "where's", "which", "while", "who", "who's", "whom", "why",
"why's", "with", "won't", "would", "wouldn't", "you", "you'd", "you'll", "you're",
"you've", "your", "yours", "yourself", "yourselves"]


template plural(n: int): string = (if n > 1: "s" else: "")


proc add(invIndex: var InvertedIndex; docPath: string) =
## Add a document to the index.

let docName = docPath.extractFilename()
if docName in invIndex.docs:
raise newException(IndexingError, &"{docName}: a document with this name is alreadry indexed.")
invIndex.docs.add docName
let docIndex = invIndex.docs.high

var linenum = 0
var count = 0

try:
for line in docPath.lines:
inc linenum
var word = ""
for ch in line:
if ch in WordChars:
word.add ch.toLowerAscii()
else:
if word.len > 1 and word notin StopWords:
invIndex.index.mgetOrPut(word, newSeq[WordRef]()).add (docIndex, linenum)
inc count
word = ""

except IOError:
raise newException(IndexingError, &"{docName}: error while reading file.")

if count > 0:
echo &"{docName}: {count} word{plural(count)} indexed."
else:
raise newException(IndexingError, &"{docName}: nothing to index.")


func status(invIndex: InvertedIndex): string =
## Display information about the inverted index status.
let words = invIndex.index.len
let docs = invIndex.docs.len
&"Index contains {words} word{plural(words)} from {docs} document{plural(docs)}."


proc search(invIndex: InvertedIndex; word: string): SearchResult =
## Search a word in the inverted index.
## The references are grouped by document.

let refs = invIndex.index.getOrDefault(word)
if refs.len == 0: return

result.total = refs.len
var prevdoc = ""
var prevline = 0
for (docnum, linenum) in refs:
let docname = invIndex.docs[docnum]
if docname == prevDoc:
if linenum != prevline:
result.refs[^1].linenums.add(linenum)
prevline = linenum
else:
result.refs.add((docname, @[linenum]))
prevdoc = docname
prevline = linenum


#———————————————————————————————————————————————————————————————————————————————————————————————————

var invertedIndex: InvertedIndex

if paramCount() != 1 or not paramStr(1).dirExists():
quit &"Usage: {getAppFileName().lastPathPart} directory"

# Index the documents.
for doc in walkFiles(paramStr(1) / "*.txt"):
try:
invertedIndex.add(doc)
except IndexingError:
echo getCurrentExceptionMsg()

echo invertedIndex.status()
echo ""

# Enter search interface.
var word: string
while true:
try:
stdout.write "Word to search? "
word = stdin.readLine().toLowerAscii().strip()
if word == "q":
echo "Quitting"
break
if not word.allCharsInSet(WordChars):
echo "This word contains invalid characters."
continue
except EOFError:
echo ""
echo "Quitting"
break

# Search word.
let result = invertedIndex.search(word)
if result.refs.len == 0:
echo "No reference found for this word."
continue

# Display the references.
echo &"Found {result.total} reference{plural(result.total)} to “{word}”:"
for (docname, linenums) in result.refs:
stdout.write &"... in “{docName}”, line{plural(linenums.len)}"
for num in linenums: stdout.write ' ', num
echo ""</syntaxhighlight>

{{out}}
Example of session.
<pre>./inverted_index texts
a.txt: 129 words indexed.
b.txt: 183 words indexed.
c.txt: 16 words indexed.
Index contains 197 words from 3 documents.

Word to search? inverted
Found 23 references to “inverted”:
... in “a.txt”, lines 1 3 5
... in “b.txt”, lines 1 3 5 7
... in “c.txt”, line 1
Word to search? q
Quitting</pre>


=={{header|OCaml}}==
=={{header|OCaml}}==
Line 1,501: Line 2,479:
unix.cma bigarray.cma nums.cma -I +sexplib sexplib.cma str.cma \
unix.cma bigarray.cma nums.cma -I +sexplib sexplib.cma str.cma \
inv.ml
inv.ml

ocamlc -o inv.byte unix.cma bigarray.cma nums.cma -I +sexplib sexplib.cma str.cma inv.cmo
ocamlc -o inv.byte unix.cma bigarray.cma nums.cma -I +sexplib sexplib.cma str.cma inv.cmo


<lang ocaml>TYPE_CONV_PATH "Inverted_index"
<syntaxhighlight lang="ocaml">TYPE_CONV_PATH "Inverted_index"


type files = string array with sexp
type files = string array with sexp
Line 1,543: Line 2,521:
List.iter (fun (w, from) -> Hashtbl.replace h w from) inv_index;
List.iter (fun (w, from) -> Hashtbl.replace h w from) inv_index;
List.iter (fun w ->
List.iter (fun w ->
if Hashtbl.mem h w
let from =
try Hashtbl.find h w
then begin
except Not_found -> []
let from = Hashtbl.find h w in
in
Hashtbl.replace h w (i::from)
Hashtbl.replace h w (i::from)
end else
Hashtbl.add h w [i]
) words;
) words;
Hashtbl.fold (fun w from acc -> (w, from) :: acc) h []
Hashtbl.fold (fun w from acc -> (w, from) :: acc) h []
Line 1,585: Line 2,562:
| "--index-file", in_file -> index_file in_file
| "--index-file", in_file -> index_file in_file
| "--search-word", word -> search_word word
| "--search-word", word -> search_word word
| _ -> usage()</lang>
| _ -> usage()</syntaxhighlight>



=={{header|Perl}}==
=={{header|Perl}}==
<lang perl>use Set::Object 'set';
<syntaxhighlight lang="perl">use Set::Object 'set';


# given an array of files, returns the index
# given an array of files, returns the index
Line 1,600: Line 2,576:
foreach my $file (@files)
foreach my $file (@files)
{
{
open(F, "<", $file) or die "Can't read file $file: $!";
open(F, "<", $file) or die "Can't read file $file: $!";
while(<F>) {
while(<F>) {
s/\A\W+//;
s/\A\W+//;
foreach my $w (map {lc} grep {length() >= 3} split /\W+/)
foreach my $w (map {lc} grep {length() >= 3} split /\W+/)
{
{
if ( exists($iindex{$w}) )
if ( exists($iindex{$w}) )
{
{
$iindex{$w}->insert($file);
$iindex{$w}->insert($file);
} else {
} else {
$iindex{$w} = set($file);
$iindex{$w} = set($file);
}
}
}
}
}
}
close(F);
close(F);
}
}
return %iindex;
return %iindex;
Line 1,624: Line 2,600:
my @words = @_;
my @words = @_;
my $res = set();
my $res = set();

foreach my $w (map {lc} @words)
foreach my $w (map {lc} @words)
{
{
$w =~ s/\W+//g; # strip non-words chars
$w =~ s/\W+//g; # strip non-words chars
length $w < 3 and next;
length $w < 3 and next;
exists $idx{$w} or return set();
exists $idx{$w} or return set();
Line 1,642: Line 2,618:
# first arg is a comma-separated list of words to search for
# first arg is a comma-separated list of words to search for
print "$_\n"
print "$_\n"
foreach search_words_with_index({createindex(@ARGV)}, @searchwords);</lang>
foreach search_words_with_index({createindex(@ARGV)}, @searchwords);</syntaxhighlight>
=={{header|Perl 6}}==
<lang perl6>sub MAIN (*@files) {
(my %norm).push: do for @files -> $file {
$file X=> slurp($file).lc.words;
}
(my %inv).push: %norm.invert.uniq;


=={{header|Phix}}==
while prompt("Search terms: ").words -> @words {
Loads all text files in demo\rosetta\ and builds a list of filenames and
for @words -> $word {
a dictionary of {word,file_indexes}, before a handful of quick tests.<br>
say "$word => %inv.{$word.lc}";
Might be better (and almost as easy) for the dictionary values to be say
{total_count, {file nos}, {file counts}}.
<!--<syntaxhighlight lang="phix">(notonline)-->
<span style="color: #000080;font-style:italic;">-- demo\rosetta\Inverted_index.exw</span>
<span style="color: #008080;">without</span> <span style="color: #008080;">js</span> <span style="color: #000080;font-style:italic;">-- (file i/o)</span>
<span style="color: #004080;">integer</span> <span style="color: #000000;">word_count</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">0</span>
<span style="color: #004080;">sequence</span> <span style="color: #000000;">filenames</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{}</span>
<span style="color: #008080;">function</span> <span style="color: #000000;">is_ascii</span><span style="color: #0000FF;">(</span><span style="color: #004080;">string</span> <span style="color: #000000;">txt</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">txt</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span>
<span style="color: #004080;">integer</span> <span style="color: #000000;">ch</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">txt</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">ch</span><span style="color: #0000FF;">=</span><span style="color: #008000;">'\0'</span> <span style="color: #008080;">or</span> <span style="color: #000000;">ch</span><span style="color: #0000FF;">>=</span><span style="color: #000000;">#7F</span> <span style="color: #008080;">then</span> <span style="color: #008080;">return</span> <span style="color: #000000;">0</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
<span style="color: #008080;">return</span> <span style="color: #000000;">1</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">function</span>
<span style="color: #008080;">procedure</span> <span style="color: #000000;">add_words</span><span style="color: #0000FF;">(</span><span style="color: #004080;">string</span> <span style="color: #000000;">name</span><span style="color: #0000FF;">,</span> <span style="color: #004080;">sequence</span> <span style="color: #000000;">words</span><span style="color: #0000FF;">)</span>
<span style="color: #000000;">filenames</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">append</span><span style="color: #0000FF;">(</span><span style="color: #000000;">filenames</span><span style="color: #0000FF;">,</span><span style="color: #000000;">name</span><span style="color: #0000FF;">)</span>
<span style="color: #004080;">integer</span> <span style="color: #000000;">fn</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">filenames</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">words</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span>
<span style="color: #004080;">string</span> <span style="color: #000000;">word</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">words</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">word</span><span style="color: #0000FF;">[</span><span style="color: #000000;">1</span><span style="color: #0000FF;">]>=</span><span style="color: #008000;">'a'</span> <span style="color: #000080;font-style:italic;">-- skip numbers</span>
<span style="color: #008080;">and</span> <span style="color: #000000;">word</span><span style="color: #0000FF;">[</span><span style="color: #000000;">1</span><span style="color: #0000FF;">]<=</span><span style="color: #008000;">'z'</span> <span style="color: #008080;">then</span>
<span style="color: #004080;">integer</span> <span style="color: #000000;">node</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">getd_index</span><span style="color: #0000FF;">(</span><span style="color: #000000;">word</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">node</span><span style="color: #0000FF;">=</span><span style="color: #000000;">0</span> <span style="color: #008080;">then</span> <span style="color: #000080;font-style:italic;">-- not present</span>
<span style="color: #7060A8;">setd</span><span style="color: #0000FF;">(</span><span style="color: #000000;">word</span><span style="color: #0000FF;">,{</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">})</span>
<span style="color: #000000;">word_count</span> <span style="color: #0000FF;">+=</span> <span style="color: #000000;">1</span>
<span style="color: #008080;">else</span>
<span style="color: #004080;">sequence</span> <span style="color: #000000;">val</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">getd_by_index</span><span style="color: #0000FF;">(</span><span style="color: #000000;">node</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">if</span> <span style="color: #7060A8;">find</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">,</span><span style="color: #000000;">val</span><span style="color: #0000FF;">)=</span><span style="color: #000000;">0</span> <span style="color: #008080;">then</span>
<span style="color: #7060A8;">setd</span><span style="color: #0000FF;">(</span><span style="color: #000000;">word</span><span style="color: #0000FF;">,</span><span style="color: #7060A8;">append</span><span style="color: #0000FF;">(</span><span style="color: #000000;">val</span><span style="color: #0000FF;">,</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">))</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">procedure</span>
<span style="color: #008080;">procedure</span> <span style="color: #000000;">load_directory</span><span style="color: #0000FF;">()</span>
<span style="color: #004080;">sequence</span> <span style="color: #000000;">d</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">dir</span><span style="color: #0000FF;">(</span><span style="color: #008000;">"."</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">d</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span>
<span style="color: #008080;">if</span> <span style="color: #008080;">not</span> <span style="color: #7060A8;">find</span><span style="color: #0000FF;">(</span><span style="color: #008000;">'d'</span><span style="color: #0000FF;">,</span><span style="color: #000000;">d</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">][</span><span style="color: #004600;">D_ATTRIBUTES</span><span style="color: #0000FF;">])</span> <span style="color: #000080;font-style:italic;">-- skip directories</span>
<span style="color: #008080;">and</span> <span style="color: #000000;">d</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">][</span><span style="color: #004600;">D_SIZE</span><span style="color: #0000FF;">]<</span><span style="color: #000000;">1024</span><span style="color: #0000FF;">*</span><span style="color: #000000;">1024</span><span style="color: #0000FF;">*</span><span style="color: #000000;">1024</span> <span style="color: #008080;">then</span> <span style="color: #000080;font-style:italic;">-- and files &gt; 1GB</span>
<span style="color: #004080;">string</span> <span style="color: #000000;">name</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">d</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">][</span><span style="color: #004600;">D_NAME</span><span style="color: #0000FF;">]</span>
<span style="color: #004080;">integer</span> <span style="color: #000000;">fn</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">open</span><span style="color: #0000FF;">(</span><span style="color: #000000;">name</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"rb"</span><span style="color: #0000FF;">)</span>
<span style="color: #004080;">string</span> <span style="color: #000000;">txt</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">lower</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">get_text</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">))</span>
<span style="color: #7060A8;">close</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">is_ascii</span><span style="color: #0000FF;">(</span><span style="color: #000000;">txt</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">then</span> <span style="color: #000080;font-style:italic;">-- skip any bitmaps etc</span>
<span style="color: #004080;">sequence</span> <span style="color: #000000;">words</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">split_any</span><span style="color: #0000FF;">(</span><span style="color: #000000;">txt</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"\0\r\n\t !\"#$%&\'()*+,-./:;&lt;=&gt;?@[\\]^`{|}~"</span><span style="color: #0000FF;">)</span>
<span style="color: #000000;">add_words</span><span style="color: #0000FF;">(</span><span style="color: #000000;">name</span><span style="color: #0000FF;">,</span><span style="color: #000000;">words</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">procedure</span>
<span style="color: #008080;">function</span> <span style="color: #000000;">lookup</span><span style="color: #0000FF;">(</span><span style="color: #004080;">sequence</span> <span style="color: #000000;">words</span><span style="color: #0000FF;">)</span>
<span style="color: #004080;">sequence</span> <span style="color: #000000;">files</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{}</span> <span style="color: #000080;font-style:italic;">-- indexes to filenames</span>
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">words</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span>
<span style="color: #004080;">string</span> <span style="color: #000000;">word</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">words</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span>
<span style="color: #004080;">integer</span> <span style="color: #000000;">node</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">getd_index</span><span style="color: #0000FF;">(</span><span style="color: #000000;">word</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">node</span><span style="color: #0000FF;">=</span><span style="color: #000000;">0</span> <span style="color: #008080;">then</span> <span style="color: #008080;">return</span> <span style="color: #0000FF;">{}</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #004080;">sequence</span> <span style="color: #000000;">val</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">getd_by_index</span><span style="color: #0000FF;">(</span><span style="color: #000000;">node</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">then</span>
<span style="color: #000000;">files</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">val</span>
<span style="color: #008080;">else</span>
<span style="color: #008080;">for</span> <span style="color: #000000;">j</span><span style="color: #0000FF;">=</span><span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">files</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">to</span> <span style="color: #000000;">1</span> <span style="color: #008080;">by</span> <span style="color: #0000FF;">-</span><span style="color: #000000;">1</span> <span style="color: #008080;">do</span>
<span style="color: #008080;">if</span> <span style="color: #008080;">not</span> <span style="color: #7060A8;">find</span><span style="color: #0000FF;">(</span><span style="color: #000000;">files</span><span style="color: #0000FF;">[</span><span style="color: #000000;">j</span><span style="color: #0000FF;">],</span><span style="color: #000000;">val</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">then</span>
<span style="color: #000000;">files</span><span style="color: #0000FF;">[</span><span style="color: #000000;">j</span><span style="color: #0000FF;">..</span><span style="color: #000000;">j</span><span style="color: #0000FF;">]</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{}</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
<span style="color: #008080;">if</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">files</span><span style="color: #0000FF;">)=</span><span style="color: #000000;">0</span> <span style="color: #008080;">then</span> <span style="color: #008080;">return</span> <span style="color: #0000FF;">{}</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">files</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span>
<span style="color: #000000;">files</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">filenames</span><span style="color: #0000FF;">[</span><span style="color: #000000;">files</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]]</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
<span style="color: #008080;">return</span> <span style="color: #000000;">files</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">function</span>
<span style="color: #000000;">load_directory</span><span style="color: #0000FF;">()</span>
<span style="color: #0000FF;">?</span><span style="color: #000000;">word_count</span>
<span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"load_directory"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- should only be this file</span>
<span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"dir"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- currently two use this</span>
<span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"lower"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- currently four use this</span>
<span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"lower"</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"dir"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- currently two use both</span>
<span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"dir"</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"lower"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- should be the same two</span>
<span style="color: #0000FF;">?</span><span style="color: #000000;">lookup</span><span style="color: #0000FF;">({</span><span style="color: #008000;">"ban"</span><span style="color: #0000FF;">&</span><span style="color: #008000;">"anafish"</span><span style="color: #0000FF;">})</span> <span style="color: #000080;font-style:italic;">-- should be none ({})</span>
<!--</syntaxhighlight>-->
{{out}}
Note the distributed version has been modified and is occasionally used for real, so the output will likely differ.
<pre>
3365
{"Inverted_index.exw"}
{"Inverted_index.exw","viewppm.exw"}
{"AlmostPrime.exw","Inverted_index.exw","RockPaperScissors.exw","viewppm.exw"}
{"Inverted_index.exw","viewppm.exw"}
{"Inverted_index.exw","viewppm.exw"}
{}
</pre>

=={{header|PHP}}==

<syntaxhighlight lang="php"><?php

function buildInvertedIndex($filenames)
{
$invertedIndex = [];

foreach($filenames as $filename)
{
$data = file_get_contents($filename);

if($data === false) die('Unable to read file: ' . $filename);

preg_match_all('/(\w+)/', $data, $matches, PREG_SET_ORDER);

foreach($matches as $match)
{
$word = strtolower($match[0]);

if(!array_key_exists($word, $invertedIndex)) $invertedIndex[$word] = [];
if(!in_array($filename, $invertedIndex[$word], true)) $invertedIndex[$word][] = $filename;
}
}
}
}

}</lang>
return $invertedIndex;
}

function lookupWord($invertedIndex, $word)
{
return array_key_exists($word, $invertedIndex) ? $invertedIndex[$word] : false;
}

$invertedIndex = buildInvertedIndex2(['file1.txt', 'file2.txt', 'file3.txt']);

foreach(['cat', 'is', 'banana', 'it'] as $word)
{
$matches = lookupWord($invertedIndex, $word);

if($matches !== false)
{
echo "Found the word \"$word\" in the following files: " . implode(', ', $matches) . "\n";
}
else
{
echo "Unable to find the word \"$word\" in the index\n";
}
}</syntaxhighlight>


=={{header|PicoLisp}}==
=={{header|PicoLisp}}==
Line 1,668: Line 2,782:
it is a banana</pre>
it is a banana</pre>
we can read them into a binary tree in the global variable '*MyIndex'
we can read them into a binary tree in the global variable '*MyIndex'
<lang PicoLisp>(off *MyIndex)
<syntaxhighlight lang="picolisp">(off *MyIndex)


(use Word
(use Word
Line 1,682: Line 2,796:
(extract
(extract
'((Word) (val (car (idx '*MyIndex Word))))
'((Word) (val (car (idx '*MyIndex Word))))
(rest) ) ) )</lang>
(rest) ) ) )</syntaxhighlight>
Output:
Output:
<pre>: (searchFor "what" "is" "it")
<pre>: (searchFor "what" "is" "it")
Line 1,692: Line 2,806:
: (searchFor "it" "is")
: (searchFor "it" "is")
-> ("file3" "file2" "file1")</pre>
-> ("file3" "file2" "file1")</pre>

=={{header|PowerShell}}==
{{works with|PowerShell|2}}
<syntaxhighlight lang="powershell">function Index-File ( [string[]]$FileList )
{
# Create index hashtable, as needed
If ( -not $Script:WordIndex ) { $Script:WordIndex = @{} }

# For each file to be indexed...
ForEach ( $File in $FileList )
{
# Find any previously existing entries for this file
$ExistingEntries = $Script:WordIndex.Keys | Where { $Script:WordIndex[$_] -contains $File }

# For any previously existing entries
# Delete them (prior to reindexing the file)
ForEach ( $Key in $ExistingEntries )
{
$Script:WordIndex[$Key] = @( $Script:WordIndex[$Key] | Where { $_ -ne $File } )
}

# Get the contents of the file, split on non-alphanumeric characters, and remove duplicates
$Words = ( Get-Content $File ) -split '[^a-zA-Z\d]' | Sort -Unique

# For each word in the file...
ForEach ( $Word in $Words )
{
# If the entry for the word already exists...
If ( $Script:WordIndex[$Word] )
{
# Add the file name to the entry
$Script:WordIndex[$Word] += $File
}
Else
{
# Create a new entry
$Script:WordIndex[$Word] = @( $File )
}
}
}
}

function Find-Word ( [string]$Word )
{
return $WordIndex[$Word]
}</syntaxhighlight>
<syntaxhighlight lang="powershell"># Populate test files
@'
Files full of
various words.
'@ | Out-File -FilePath C:\Test\File1.txt

@'
Create an index
of words.
'@ | Out-File -FilePath C:\Test\File2.txt

@'
Use the index
to find the files.
'@ | Out-File -FilePath C:\Test\File3.txt</syntaxhighlight>
<syntaxhighlight lang="powershell"># Index files
Index-File C:\Test\File1.txt, C:\Test\File2.txt, C:\Test\File3.txt</syntaxhighlight>
Because PowerShell is a shell language, it is "a user interface to do a search". After running the script defining the functions and running a command to index the files, the user can simply run the search function at the PowerShell command prompt.
Alternatively, one could create a more complex custom UI or GUI if desired.
<syntaxhighlight lang="powershell"># Query index
Find-Word files</syntaxhighlight>
{{out}}
<pre>C:\Test\File1.txt
C:\Test\File3.txt</pre>


=={{header|Python}}==
=={{header|Python}}==
Line 1,698: Line 2,882:


First the simple inverted index from [[wp:Inverted index|here]] together with an implementation of a search for (multiple) terms from that index.
First the simple inverted index from [[wp:Inverted index|here]] together with an implementation of a search for (multiple) terms from that index.
<lang python>'''
<syntaxhighlight lang="python">'''
This implements: http://en.wikipedia.org/wiki/Inverted_index of 28/07/10
This implements: http://en.wikipedia.org/wiki/Inverted_index of 28/07/10
'''
'''
Line 1,738: Line 2,922:
terms = ["what", "is", "it"]
terms = ["what", "is", "it"]
print('\nTerm Search for: ' + repr(terms))
print('\nTerm Search for: ' + repr(terms))
pp(sorted(termsearch(terms)))</lang>
pp(sorted(termsearch(terms)))</syntaxhighlight>


'''Sample Output'''
'''Sample Output'''
Line 1,766: Line 2,950:


<small>It is assumed that the following code is added to the end of the code for the simple case above and so shares its file opening and parsing results</small>
<small>It is assumed that the following code is added to the end of the code for the simple case above and so shares its file opening and parsing results</small>
<lang python>from collections import Counter
<syntaxhighlight lang="python">from collections import Counter




Line 1,817: Line 3,001:
print(ans)
print(ans)
ans = Counter(ans)
ans = Counter(ans)
print(' The phrase is found most commonly in text: ' + repr(ans.most_common(1)[0][0]))</lang>
print(' The phrase is found most commonly in text: ' + repr(ans.most_common(1)[0][0]))</syntaxhighlight>


'''Sample Output'''
'''Sample Output'''
Line 1,836: Line 3,020:
['T0.txt', 'T0.txt', 'T2.txt']
['T0.txt', 'T0.txt', 'T2.txt']
The phrase is found most commonly in text: 'T0.txt'</pre>
The phrase is found most commonly in text: 'T0.txt'</pre>

=={{header|Racket}}==
<syntaxhighlight lang="racket">
#!/usr/bin/env racket
#lang racket
(command-line
#:args (term . files)
(define rindex (make-hasheq))
(for ([file files])
(call-with-input-file file
(λ(in) (let loop ()
(define w (regexp-match #px"\\w+" in))
(when w
(let* ([w (bytes->string/utf-8 (car w))]
[w (string->symbol (string-foldcase w))]
[r (hash-ref rindex w '())])
(unless (member file r) (hash-set! rindex w (cons file r)))
(loop)))))))
(define res
(for/list ([w (regexp-match* #px"\\w+" term)])
(list->set (hash-ref rindex (string->symbol (string-foldcase w)) '()))))
(define all (set->list (apply set-intersect res)))
(if (null? all)
(printf "No matching files.\n")
(printf "Terms found at: ~a.\n" (string-join all ", "))))
</syntaxhighlight>
{{out}}
<pre>
$ echo "It is what it is." > F1
$ echo "What is it?" > F2
$ echo "It is a banana." > F3
$ ./search.rkt "what" F?
Terms found at: F1, F2.
$ ./search.rkt "a" F?
Terms found at: F3.
$ ./search.rkt "what a" F?
No matching files.
$ ./search.rkt "what is it" F?
Terms found at: F1, F2.
</pre>

=={{header|Raku}}==
(formerly Perl 6)
{{works with|rakudo|2015-09-16}}
<syntaxhighlight lang="raku" line>sub MAIN (*@files) {
my %norm;
do for @files -> $file {
%norm.push: $file X=> slurp($file).lc.words;
}
(my %inv).push: %norm.invert.unique;

while prompt("Search terms: ").words -> @words {
for @words -> $word {
say "$word => {%inv.{$word.lc}//'(not found)'}";
}
}
}</syntaxhighlight>


=={{header|REXX}}==
=={{header|REXX}}==
Note: In this algorithm, word indices start at 1.
Note: In this algorithm, word indices start at 1.
<br><br>Note: the Burma Shave signs were created from 1930 --&gt; 1951.
<lang rexx>/*REXX program illustrates building a simple inverted index & word find.*/
@.='' /*dictionary of words (so far).*/
!='' /*a list of found words (so far).*/


Note: &nbsp; the Burma Shave signs were created from &nbsp; 1930 ──► 1951 &nbsp; and were common among the rural byways of America.
call invertI 0, 'BURMA0.TXT' /*read file 0 ... */
call invertI 1, 'BURMA1.TXT' /* " " 1 ... */
call invertI 2, 'BURMA2.TXT' /* " " 2 ... */
call invertI 3, 'BURMA3.TXT' /* " " 3 ... */
call invertI 4, 'BURMA4.TXT' /* " " 4 ... */
call invertI 5, 'BURMA5.TXT' /* " " 5 ... */
call invertI 6, 'BURMA6.TXT' /* " " 6 ... */
call invertI 7, 'BURMA7.TXT' /* " " 7 ... */
call invertI 8, 'BURMA8.TXT' /* " " 8 ... */
call invertI 9, 'BURMA9.TXT' /* " " 9 ... */


To see more about Burma Shave signs, see the Wikipedia entry: &nbsp; [http://en.wikipedia.org/wiki/Burma-Shave Burma Shave signs.]
call findAword 'does' /*find a word. */
<syntaxhighlight lang="rexx">/*REXX program illustrates building a simple inverted index and a method of word find.*/
call findAword '60' /*find another word. */
call findAword "don't" /*and find another word. */
@.= /*a dictionary of words (so far). */
call findAword "burma-shave" /*and find yet another word. */
!= /*a list of found words (so far). */
exit /*stick a fork in it, we're done.*/
call invertI 0, 'BURMA0.TXT' /*read the file: BURMA0.TXT ··· */
call invertI 1, 'BURMA1.TXT' /* " " " BURMA1.TXT ··· */
/*──────────────────────────────────FINDAWORD subroutine────────────────*/
call invertI 2, 'BURMA2.TXT' /* " " " BURMA2.TXT ··· */
findAword: procedure expose @. /*get A word, and uppercase it. */
parse arg ox; arg x /*OX= word; X= uppercase version*/
call invertI 3, 'BURMA3.TXT' /* " " " BURMA3.TXT ··· */
call invertI 4, 'BURMA4.TXT' /* " " " BURMA4.TXT ··· */
_=@.x
call invertI 5, 'BURMA5.TXT' /* " " " BURMA5.TXT ··· */
oxo='───'ox"───"
call invertI 6, 'BURMA6.TXT' /* " " " BURMA6.TXT ··· */
if _=='' then do
call invertI 7, 'BURMA7.TXT' /* " " " BURMA7.TXT ··· */
say 'word' oxo "not found."
call invertI 8, 'BURMA8.TXT' /* " " " BURMA8.TXT ··· */
return 0
call invertI 9, 'BURMA9.TXT' /* " " " BURMA9.TXT ··· */
end
_@=_ /*save _, pass it back to invoker*/
call findAword "huz" /*find a word. */
call findAword "60" /*find another word. */
say 'word' oxo "found in:"
do until _==''; parse var _ f w _; say
call findAword "don't" /*and find another word. */
say ' file='f ' word='w
call findAword "burma-shave" /*and find yet another word. */
end /*until ... */
exit /*stick a fork in it, we're all done. */
/*──────────────────────────────────────────────────────────────────────────────────────*/
return _@
findAword: procedure expose @.; arg x /*get an uppercase version of the X arg*/
/*─────────────────────────────────────INVERTI subroutine───────────────*/
invertI: procedure expose @. !; parse arg #,fn /*file#, filename*/
parse arg ox /*get original (as-is) value of X arg.*/
_=@.x; oxo='───'ox"───"
call lineout fn /*close the file, just in case. */
if _=='' then do
w=0 /*number of words so far. */
say 'word' oxo "not found."
return 0
end
_@=_ /*save _ text, pass it back to invoker.*/
say 'word' oxo "found in:"
do until _==''; parse var _ f w _
say ' file='f " word="w
end /*until ··· */
return _@
/*──────────────────────────────────────────────────────────────────────────────────────*/
invertI: procedure expose @. !; parse arg #,fn /*the file number and the filename. */
call lineout fn /*close the file, ··· just in case. */
w=0 /*the number of words found (so far). */
do while lines(fn)\==0 /* [↓] process the entire file. */
_=space( linein(fn) ) /*read a line, elide superfluous blanks*/
if _=='' then iterate /*if a blank record, then ignore it. */
say 'file' #", record:" _ /*display the record ──► terminal. */


do while lines(fn)\==0 /*process the entire file (below)*/
do until _=='' /*pick off words from record until done*/
_=space(linein(fn)) /*read 1 line, elide extra blanks*/
parse upper var _ ? _ /*pick off a word (it's in uppercase).*/
if _=='' then iterate /*if blank record, then ignore it*/
?=stripper(?) /*strip any trailing punctuation. */
say 'file' #",record="_ /*echo a record, just to be verbose.*/
if ?='' then iterate /*is the word now all blank (or null)? */
w=w+1 /*bump the word counter (index). */
@.?=@.? # w /*append the new word to a list. */
if wordpos(?,!)==0 then !=! ? /*add it to the list of words found. */
end /*until ··· */
end /*while ··· */
say; call lineout fn /*close the file, just to be neat&safe.*/
return w /*return the index of word in record. */
/*──────────────────────────────────────────────────────────────────────────────────────*/
stripper: procedure; parse arg q /*remove punctuation at the end of word*/
@punctuation= '.,:;?¿!¡∙·'; do j=1 for length(@punctuation)
q=strip(q, 'T', substr(@punctuation, j, 1) )
end /*j*/
return q</syntaxhighlight>
{{out|output|text=&nbsp; when using the default inputs:}}


Output note: &nbsp; The Burma-Shave company was shocked at the number of automobile fenders mailed to the company.
do until _=='' /*pick off words until done. */
parse upper var _ xxx _ /*pick off a word (uppercased). */
xxx=stripper(xxx) /*strip any ending punctuation. */
if xxx='' then iterate /*is the word now blank (null) ? */
w=w+1 /*bump the word counter. */
@.xxx=@.xxx # w
if wordpos(xxx,!)==0 then !=! xxx /*add to THE list of words found.*/
end /*until ... */
end /*while lines(fn)¬==0*/


All fender senders received a half-pound jar of Burma-Shave.
say; call lineout fn /*close the file, just to be neat*/
<pre style="height:87ex">
return w /*return the index of the word. */
file 0, record: Rip a fender
/*─────────────────────────────────────STRIPPER subroutine──────────────*/
file 0, record: Off your Car
stripper: procedure; parse arg q /*remove punctuation at word-end.*/
file 0, record: Send it in
@punctuation='.,:;?¿!¡' /*serveral punctuation marks. */
file 0, record: For a half-pound jar
do j=1 for length(@punctuation)
file 0, record: Burma-Shave
q=strip(q,'T',substr(@punctuation,j,1))
end /*j*/
return q</lang>
'''output'''
<pre style="height:30ex;overflow:scroll">
file 0,record=Rip a fender
file 0,record=off your car
file 0,record=send it in
file 0,record=for a half-pound jar
file 0,record=Burma-shave


file 1,record=A peach
file 1, record: A peach
file 1,record=looks good
file 1, record: Looks good
file 1,record=with lots of fuzz
file 1, record: With lots of fuzz
file 1,record=but a man's no peach
file 1, record: Man's no peach
file 1,record=and never was
file 1, record: And never was
file 1,record=Burma-shave
file 1, record: Burma-Shave


file 2,record=Does your husband
file 2, record: Does your husband
file 2,record=misbehave
file 2, record: Misbehave
file 2,record=grunt and grumble
file 2, record: Grunt and grumble
file 2,record=rant and rave ?
file 2, record: Rant and rave ?
file 2,record=shoot the brute some
file 2, record: Shoot the brute some
file 2,record=Burma-shave
file 2, record: Burma-Shave


file 3,record=Don't take a curve
file 3, record: Don't take a curve
file 3,record=at 60 per
file 3, record: At 60 per
file 3,record=we hate to lose
file 3, record: We hate to lose
file 3,record=a customer
file 3, record: A customer
file 3,record=Burma-shave
file 3, record: Burma-Shave


file 4,record=Every shaver
file 4, record: Every shaver
file 4,record=now can snore
file 4, record: Now can snore
file 4,record=six more minutes
file 4, record: Six more minutes
file 4,record=than before
file 4, record: Than before
file 4,record=by using
file 4, record: By using
file 4,record=Burma-shave
file 4, record: Burma-Shave


file 5,record=He played
file 5, record: He played
file 5,record=a sax
file 5, record: a sax
file 5,record=had no B.O.
file 5, record: Had no B.O.
file 5,record=but his whiskers scratched
file 5, record: But his whiskers scratched
file 5,record=so they let him go
file 5, record: So she let him go
file 5,record=Burma-shave
file 5, record: Burma-Shave


file 6,record=Henry the Eighth
file 6, record: Henry the Eighth
file 6,record=Prince of Friskers
file 6, record: Prince of Friskers
file 6,record=lost five wives
file 6, record: Lost five wives
file 6,record=but kept his whiskers
file 6, record: But kept his whiskers
file 6,record=Burma-shave
file 6, record: Burma-Shave


file 7,record=Listen, birds
file 7, record: Listen birds
file 7,record=those signs cost
file 7, record: These signs cost
file 7,record=money
file 7, record: Money
file 7,record=so roost a while but
file 7, record: So roost a while
file 7,record=don't get funny
file 7, record: But don't get funny
file 7,record=Burma-shave
file 7, record: Burma-Shave


file 8,record=My man
file 8, record: My man
file 8,record=won't shave
file 8, record: Won't shave
file 8,record=sez Hazel Huz
file 8, record: Sez Hazel Huz
file 8,record=but I should worry
file 8, record: But I should worry
file 8,record=Dora's does
file 8, record: Dora's does
file 8,record=Burma-shave
file 8, record: Burma-Shave


file 9,record=Past schoolhouses
file 9, record: Past
file 9,record=take it slow
file 9, record: Schoolhouses
file 9,record=let the little
file 9, record: Take it slow
file 9,record=shavers
file 9, record: Let the little
file 9,record=grow
file 9, record: Shavers grow
file 9,record=Burma-shave
file 9, record: Burma-Shave

word ───does─── found in:
file=2 word=1
file=8 word=13


word ───huz─── found in:
file=8 word=7
word ───60─── found in:
word ───60─── found in:
file=3 word=6
file=3 word=6

word ───don't─── found in:
word ───don't─── found in:
file=3 word=1
file=3 word=1
file=7 word=12
file=7 word=12

word ───burma-shave─── found in:
word ───burma-shave─── found in:
file=0 word=14
file=0 word=14
file=1 word=17
file=1 word=15
file=2 word=15
file=2 word=15
file=3 word=14
file=3 word=14
file=4 word=13
file=4 word=13
file=5 word=17
file=5 word=17
file=6 word=14
file=6 word=14
file=7 word=15
file=7 word=15
file=8 word=14
file=8 word=14
file=9 word=11
file=9 word=11
</pre>
</pre>


Line 2,002: Line 3,240:


'''indexmerge.rb'''
'''indexmerge.rb'''
<lang ruby>if File.exist? "index.dat"
<syntaxhighlight lang="ruby">if File.exist? "index.dat"
@data = Marshal.load open("index.dat")
@data = Marshal.load open("index.dat")
else
else
Line 2,030: Line 3,268:
open("index.dat", "w") do |index|
open("index.dat", "w") do |index|
index.write Marshal.dump(@data)
index.write Marshal.dump(@data)
end</lang>
end</syntaxhighlight>


'''indexsearch.rb'''
'''indexsearch.rb'''
<lang ruby>if File.exist? "index.dat"
<syntaxhighlight lang="ruby">if File.exist? "index.dat"
@data = Marshal.load open("index.dat")
@data = Marshal.load open("index.dat")
else
else
Line 2,054: Line 3,292:
end
end


p @result</lang>
p @result</syntaxhighlight>


'''Output'''
'''Output'''
Line 2,065: Line 3,303:
> ./indexsearch.rb It iS\!
> ./indexsearch.rb It iS\!
["file1", "file2", "file3"]</pre>
["file1", "file2", "file3"]</pre>

=={{header|Rust}}==
<syntaxhighlight lang="rust">// Part 1: Inverted index structure

use std::{
borrow::Borrow,
collections::{BTreeMap, BTreeSet},
};

#[derive(Debug, Default)]
pub struct InvertedIndex<T> {
indexed: BTreeMap<String, BTreeSet<usize>>,
sources: Vec<T>,
}

impl<T> InvertedIndex<T> {
pub fn add<I, V>(&mut self, source: T, tokens: I)
where
I: IntoIterator<Item = V>,
V: Into<String>,
{
let source_id = self.sources.len();
self.sources.push(source);
for token in tokens {
self.indexed
.entry(token.into())
.or_insert_with(BTreeSet::new)
.insert(source_id);
}
}

pub fn search<'a, I, K>(&self, tokens: I) -> impl Iterator<Item = &T>
where
String: Borrow<K>,
K: Ord + ?Sized + 'a,
I: IntoIterator<Item = &'a K>,
{
let mut tokens = tokens.into_iter();

tokens
.next()
.and_then(|token| self.indexed.get(token).cloned())
.and_then(|first| {
tokens.try_fold(first, |found, token| {
self.indexed
.get(token)
.map(|sources| {
found
.intersection(sources)
.cloned()
.collect::<BTreeSet<_>>()
})
.filter(|update| !update.is_empty())
})
})
.unwrap_or_else(BTreeSet::new)
.into_iter()
.map(move |source| &self.sources[source])
}

pub fn tokens(&self) -> impl Iterator<Item = &str> {
self.indexed.keys().map(|it| it.as_str())
}

pub fn sources(&self) -> &[T] {
&self.sources
}
}

// Part 2: File walking and processing

use std::{
ffi::OsString,
fmt::{Debug, Display},
fs::{read_dir, DirEntry, File, ReadDir},
io::{self, stdin, Read},
path::{Path, PathBuf},
};

#[derive(Debug)]
pub struct Files {
dirs: Vec<ReadDir>,
}

impl Files {
pub fn walk<P: AsRef<Path>>(path: P) -> io::Result<Self> {
Ok(Files {
dirs: vec![read_dir(path)?],
})
}
}

impl Iterator for Files {
type Item = DirEntry;

fn next(&mut self) -> Option<Self::Item> {
'outer: while let Some(mut current) = self.dirs.pop() {
while let Some(entry) = current.next() {
if let Ok(entry) = entry {
let path = entry.path();
if !path.is_dir() {
self.dirs.push(current);
return Some(entry);
} else if let Ok(dir) = read_dir(path) {
self.dirs.push(current);
self.dirs.push(dir);
continue 'outer;
}
}
}
}

None // No directory left
}
}

fn tokenize<'a>(input: &'a str) -> impl Iterator<Item = String> + 'a {
input
.split(|c: char| !c.is_alphanumeric())
.filter(|token| !token.is_empty())
.map(|token| token.to_lowercase())
}

fn tokenize_file<P: AsRef<Path>>(path: P) -> io::Result<BTreeSet<String>> {
let mut buffer = Vec::new();
File::open(path)?.read_to_end(&mut buffer)?;
let text = String::from_utf8_lossy(&buffer);
Ok(tokenize(&text).collect::<BTreeSet<_>>())
}

fn tokenize_query(input: &str) -> Vec<String> {
let result = tokenize(input).collect::<BTreeSet<_>>();
// Make a vector sorted by length, so that longer tokens are processed first.
// This heuristics should narrow the resulting set faster.
let mut result = result.into_iter().collect::<Vec<_>>();
result.sort_by_key(|item| usize::MAX - item.len());
result
}

// Part 3: Interactive application

fn args() -> io::Result<(OsString, BTreeSet<OsString>)> {
let mut args = std::env::args_os().skip(1); // Skip the executable's name

let path = args
.next()
.ok_or_else(|| io::Error::new(io::ErrorKind::Other, "missing path"))?;

let extensions = args.collect::<BTreeSet<_>>();

Ok((path, extensions))
}

fn print_hits<'a, T>(hits: impl Iterator<Item = T>)
where
T: Display,
{
let mut found_none = true;
for (number, hit) in hits.enumerate() {
println!(" [{}] {}", number + 1, hit);
found_none = false;
}

if found_none {
println!("(none)")
}
}

fn main() -> io::Result<()> {
let (path, extensions) = args()?;
let mut files = InvertedIndex::<PathBuf>::default();
let mut content = InvertedIndex::<PathBuf>::default();

println!(
"Indexing {:?} files in '{}'",
extensions,
path.to_string_lossy()
);

for path in Files::walk(path)?.map(|file| file.path()).filter(|path| {
path.extension()
.filter(|&ext| extensions.is_empty() || extensions.contains(ext))
.is_some()
}) {
files.add(path.clone(), tokenize(&path.to_string_lossy()));

match tokenize_file(&path) {
Ok(tokens) => content.add(path, tokens),
Err(e) => eprintln!("Skipping a file {}: {}", path.display(), e),
}
}

println!(
"Indexed {} tokens in {} files.",
content.tokens().count(),
content.sources.len()
);

// Run the query UI loop
let mut query = String::new();

loop {
query.clear();
println!("Enter search query:");
if stdin().read_line(&mut query).is_err() || query.trim().is_empty() {
break;
}

match query.trim() {
"/exit" | "/quit" | "" => break,

"/tokens" => {
println!("Tokens:");
for token in content.tokens() {
println!("{}", token);
}

println!();
}

"/files" => {
println!("Sources:");
for source in content.sources() {
println!("{}", source.display());
}

println!();
}

_ => {
let query = tokenize_query(&query);
println!();
println!("Found hits:");
print_hits(content.search(query.iter()).map(|it| it.display()));
println!("Found file names:");
print_hits(files.search(query.iter()).map(|it| it.display()));
println!();
}
}
}

Ok(())
}</syntaxhighlight>

{{out}}
<pre>C:\Temp>inverted_index books html htm txt
Indexing {"htm", "html", "txt"} files in 'books'
Indexed 34810 tokens in 9 files.
Enter search query:
war

Found hits:
[1] books\EN\C\Chaucer, Geoffrey - Canterbury Tales.txt
[2] books\EN\H\Homer - The Odyssey.txt
[3] books\EN\O\Orwell, George - 1984.html
[4] books\EN\W\Wells, Herbert George - The Invisible Man.txt
[5] books\EN\W\Wells, Herbert George - The Island of Doctor Moreau.txt
[6] books\EN\W\Wells, Herbert George - The Time Machine.txt
[7] books\EN\W\Wells, Herbert George - War of the Worlds.txt
Found file names:
[1] books\EN\W\Wells, Herbert George - War of the Worlds.txt

Enter search query:
war gun

Found hits:
[1] books\EN\O\Orwell, George - 1984.html
[2] books\EN\W\Wells, Herbert George - War of the Worlds.txt
Found file names:
(none)

Enter search query:
/exit

C:\Temp></pre>

=={{header|Scala}}==
<syntaxhighlight lang="scala">object InvertedIndex extends App {
import java.io.File

// indexer
val WORD = raw"(\w+)".r
def parse(s: String) = WORD.findAllIn(s).map(_.toString.toLowerCase)
def invertedIndex(files: Seq[File]): Map[String,Set[File]] = {
var i = Map[String,Set[File]]() withDefaultValue Set.empty
files.foreach{f => scala.io.Source.fromFile(f).getLines flatMap parse foreach
(w => i = i + (w -> (i(w) + f)))}
i
}

// user interface
args match {
case _ if args.length < 2 => println("Usage: InvertedIndex ALLSEARCHWORDS FILENAME...")
case Array(searchwords, filenames @ _*) =>
val queries = parse(searchwords).toList
val files = filenames.map(new File(_)).filter{f => if (!f.exists) println(s"Ignoring $f"); f.exists}
(queries, files) match {
case (q, _) if q.isEmpty => println("Missing search words")
case (_, f) if f.isEmpty => println("Missing extant files")
case _ => val index = invertedIndex(files)
println(s"""Searching for ${queries map ("\""+_+"\"") mkString " and "} in ${files.size} files:""")
queries.map(index).foldLeft(files.toSet)(_ intersect _) match {
case m if m.isEmpty => println("No matching files")
case m => println(m mkString "\n")
}
}
}
}</syntaxhighlight>
{{out}}
<pre>> InvertedIndex "the" file1.txt file2.txt file3.txt
Searching for "the" in 3 files:
data/file1.txt
data/file2.txt
data/file3.txt

> InvertedIndex "the cat sat" file1.txt file2.txt file3.txt
Searching for "the" and "cat" and "sat" in 3 files:
file1.txt
file2.txt

> InvertedIndex fox file1.txt file2.txt file3.txt
Searching for "fox" in 3 files:
file3.txt

> InvertedIndex abc file1.txt file2.txt file3.txt
Searching for "abc" in 3 files:
No matching files</pre>


=={{header|Tcl}}==
=={{header|Tcl}}==
<lang tcl>package require Tcl 8.5
<syntaxhighlight lang="tcl">package require Tcl 8.5
proc wordsInString str {
proc wordsInString str {
# We define "words" to be "maximal sequences of 'word' characters".
# We define "words" to be "maximal sequences of 'word' characters".
Line 2,085: Line 3,650:
array set localidx {}
array set localidx {}
foreach word [wordsInString $data] {
foreach word [wordsInString $data] {
lappend localidx($word) $i
lappend localidx($word) $i
incr i
incr i
}
}


# Transcribe into global index
# Transcribe into global index
foreach {word places} [array get localidx] {
foreach {word places} [array get localidx] {
dict set index($word) $filename $places
dict set index($word) $filename $places
}
}
}
}
Line 2,099: Line 3,664:
global index
global index
if {[info exists index($word)]} {
if {[info exists index($word)]} {
return [dict keys $index($word)]
return [dict keys $index($word)]
}
}
}
}
Line 2,107: Line 3,672:
set files [findFilesForWord [lindex $words 0]]
set files [findFilesForWord [lindex $words 0]]
foreach w [lrange $words 1 end] {
foreach w [lrange $words 1 end] {
set wf [findFilesForWord $w]
set wf [findFilesForWord $w]
set newfiles {}
set newfiles {}
foreach f $files {
foreach f $files {
if {$f in $wf} {lappend newfiles $f}
if {$f in $wf} {lappend newfiles $f}
}
}
set files $newfiles
set files $newfiles
}
}
return $files
return $files
Line 2,122: Line 3,687:
set files {}
set files {}
foreach w $words {
foreach w $words {
if {![info exist index($w)]} {
if {![info exist index($w)]} {
return
return
}
}
}
}
dict for {file places} $index([lindex $words 0]) {
dict for {file places} $index([lindex $words 0]) {
if {$file in $files} continue
if {$file in $files} continue
foreach start $places {
foreach start $places {
set gotStart 1
set gotStart 1
foreach w [lrange $words 1 end] {
foreach w [lrange $words 1 end] {
incr start
incr start
set gotNext 0
set gotNext 0
foreach {f ps} $index($w) {
foreach {f ps} $index($w) {
if {$f ne $file} continue
if {$f ne $file} continue
foreach p $ps {
foreach p $ps {
if {$p == $start} {
if {$p == $start} {
set gotNext 1
set gotNext 1
break
break
}
}
}
}
if {$gotNext} break
if {$gotNext} break
}
}
if {!$gotNext} {
if {!$gotNext} {
set gotStart 0
set gotStart 0
break
break
}
}
}
}
if {$gotStart} {
if {$gotStart} {
lappend files $file
lappend files $file
break
break
}
}
}
}
}
}
return $files
return $files
}</lang>
}</syntaxhighlight>
For the GUI:
For the GUI:
<lang tcl>package require Tk
<syntaxhighlight lang="tcl">package require Tk
pack [labelframe .files -text Files] -side left -fill y
pack [labelframe .files -text Files] -side left -fill y
pack [listbox .files.list -listvariable files]
pack [listbox .files.list -listvariable files]
Line 2,165: Line 3,730:
pack [entry .found.entry -textvariable terms] -fill x
pack [entry .found.entry -textvariable terms] -fill x
pack [button .found.findAll -command FindAll \
pack [button .found.findAll -command FindAll \
-text "Find File with All"] -side left
-text "Find File with All"] -side left
pack [button .found.findSeq -command FindSeq \
pack [button .found.findSeq -command FindSeq \
-text "Find File with Sequence"] -side right
-text "Find File with Sequence"] -side right


# The actions invoked by various GUI buttons
# The actions invoked by various GUI buttons
Line 2,174: Line 3,739:
set f [tk_getOpenFile]
set f [tk_getOpenFile]
if {$f ne ""} {
if {$f ne ""} {
addDocumentToIndex $f
addDocumentToIndex $f
lappend files $f
lappend files $f
}
}
}
}
Line 2,183: Line 3,748:
set fs [findFilesWithAllWords $words]
set fs [findFilesWithAllWords $words]
lappend found "Searching for files with all $terms" {*}$fs \
lappend found "Searching for files with all $terms" {*}$fs \
"---------------------"
"---------------------"
}
}
proc FindSeq {} {
proc FindSeq {} {
Line 2,190: Line 3,755:
set fs [findFilesWithWordSequence $words]
set fs [findFilesWithWordSequence $words]
lappend found "Searching for files with \"$terms\"" {*}$fs \
lappend found "Searching for files with \"$terms\"" {*}$fs \
"---------------------"
"---------------------"
}</lang>
}</syntaxhighlight>


=={{header|TUSCRIPT}}==
=={{header|TUSCRIPT}}==
<lang tuscript>
<syntaxhighlight lang="tuscript">
$$ MODE TUSCRIPT
$$ MODE TUSCRIPT


Line 2,225: Line 3,790:
ENDLOOP
ENDLOOP
PRINT "-> ",files
PRINT "-> ",files
</syntaxhighlight>
</lang>
Output:
Output:
<pre>
<pre>
Line 2,242: Line 3,807:
{{works with|ksh93}}
{{works with|ksh93}}


<lang bash>#!/bin/ksh
<syntaxhighlight lang="bash">#!/bin/ksh


typeset -A INDEX
typeset -A INDEX
Line 2,266: Line 3,831:
(( count == $# )) && echo $file
(( count == $# )) && echo $file
done
done
}</lang>
}</syntaxhighlight>


Example use:
Example use:
<lang korn>index *.txt
<syntaxhighlight lang="korn">index *.txt
search hello world
search hello world
</syntaxhighlight>
</lang>


===Directory on filesystem===
===Directory on filesystem===
Line 2,281: Line 3,846:
* Add note about slowness.
* Add note about slowness.


<lang bash>#!/bin/sh
<syntaxhighlight lang="bash">#!/bin/sh
# index.sh - create an inverted index
# index.sh - create an inverted index


Line 2,290: Line 3,855:
# the record separator for $INDEX/all.tab).
# the record separator for $INDEX/all.tab).
for file in "$@"; do
for file in "$@"; do
# Use printf(1), not echo, because "$file" might start with
# Use printf(1), not echo, because "$file" might start with
# a hyphen and become an option to echo.
# a hyphen and become an option to echo.
test 0 -eq $(printf %s "$file" | wc -l) || {
test 0 -eq $(printf %s "$file" | wc -l) || {
printf '%s\n' "$file: newline in filename" >&2
printf '%s\n' "$file: newline in filename" >&2
exit 1
exit 1
}
}
done
done


Line 2,304: Line 3,869:
fi=1
fi=1
for file in "$@"; do
for file in "$@"; do
printf %s "Indexing $file." >&2
printf %s "Indexing $file." >&2


# all.tab maps $fi => $file
# all.tab maps $fi => $file
echo "$fi $file" >> "$INDEX/all.tab"
echo "$fi $file" >> "$INDEX/all.tab"


# Use punctuation ([:punct:]) and whitespace (IFS)
# Use punctuation ([:punct:]) and whitespace (IFS)
# to split tokens.
# to split tokens.
ti=1
ti=1
tr -s '[:punct:]' ' ' < "$file" | while read line; do
tr -s '[:punct:]' ' ' < "$file" | while read line; do
for token in $line; do
for token in $line; do
# Index token by position ($fi, $ti). Ignore
# Index token by position ($fi, $ti). Ignore
# error from mkdir(1) if directory exists.
# error from mkdir(1) if directory exists.
mkdir "$INDEX/$token" 2>/dev/null
mkdir "$INDEX/$token" 2>/dev/null
echo $ti >> "$INDEX/$token/$fi"
echo $ti >> "$INDEX/$token/$fi"
: $((ti += 1))
: $((ti += 1))


# Show progress. Print a dot per 1000 tokens.
# Show progress. Print a dot per 1000 tokens.
case "$ti" in
case "$ti" in
*000) printf .
*000) printf .
esac
esac
done
done
done
done


echo >&2
echo >&2
: $((fi += 1))
: $((fi += 1))
done</lang>
done</syntaxhighlight>


<lang bash>#!/bin/sh
<syntaxhighlight lang="bash">#!/bin/sh
# search.sh - search an inverted index
# search.sh - search an inverted index


Line 2,339: Line 3,904:
want=sequence
want=sequence
while getopts aos name; do
while getopts aos name; do
case "$name" in
case "$name" in
a) want=all;;
a) want=all;;
o) want=one;;
o) want=one;;
s) want=sequence;;
s) want=sequence;;
*) exit 2;;
*) exit 2;;
esac
esac
done
done
shift $((OPTIND - 1))
shift $((OPTIND - 1))


all() {
all() {
echo "TODO"
echo "TODO"
exit 2
exit 2
}
}


one() {
one() {
echo "TODO"
echo "TODO"
exit 2
exit 2
}
}


sequence() {
sequence() {
echo "TODO"
echo "TODO"
exit 2
exit 2
}
}


$want "$@"</lang>
$want "$@"</syntaxhighlight>

=={{header|Wren}}==
{{trans|Kotlin}}
{{libheader|Wren-ioutil}}
{{libheader|Wren-pattern}}
{{libheader|Wren-str}}
<syntaxhighlight lang="wren">import "./ioutil" for FileUtil, Input
import "./pattern" for Pattern
import "./str" for Str
import "os" for Process

var invIndex = {}
var fileNames = []
var splitter = Pattern.new("+1/W")

class Location {
construct new(fileName, wordNum) {
_fileName = fileName
_wordNum = wordNum
}

toString { "%(_fileName), word number %(_wordNum)" }
}

var indexFile = Fn.new { |fileName|
if (fileNames.contains(fileName)) {
System.print("'%(fileName)' already indexed")
return
}
fileNames.add(fileName)
var lines = FileUtil.readLines(fileName)
lines.each { |line|
line = Str.lower(line)
var i = 0
for (w in splitter.splitAll(line)) {
var locations = invIndex[w]
if (!locations) {
locations = []
invIndex[w] = locations
}
locations.add(Location.new(fileName, i + 1))
i = i + 1
}
}
System.print("'%(fileName)' has been indexed")
}

var findWord = Fn.new { |word|
var w = Str.lower(word)
var locations = invIndex[w]
if (locations) {
System.print("\n'%(word)' found in the following locations:")
System.print(locations.map { |l| " %(l)" }.join("\n"))
} else {
System.print("\n'%(word)' not found")
}
System.print()
}

// files to be indexed entered as command line arguments
var args = Process.arguments
if (args.count == 0) {
System.print("No file names have been supplied")
return
}
for (arg in args) indexFile.call(arg)
System.print("\nEnter word(s) to be searched for in these files or 'q' to quit")
while (true) {
var word = Input.text(" ? : ", 1)
if (word == "q" || word == "Q") return
findWord.call(word)
}</syntaxhighlight>

{{out}}
Uses the same files as the Kotlin example:
<pre>
$ wren-cli inverted_index.wren inv1.txt inv2.txt inv3.txt inv1.txt
'inv1.txt' has been indexed
'inv2.txt' has been indexed
'inv3.txt' has been indexed
'inv1.txt' already indexed

Enter word(s) to be searched for in these files or 'q' to quit
? : cat

'cat' not found

? : is

'is' found in the following locations:
inv1.txt, word number 2
inv1.txt, word number 5
inv2.txt, word number 2
inv3.txt, word number 2

? : banana

'banana' found in the following locations:
inv3.txt, word number 4

? : it

'it' found in the following locations:
inv1.txt, word number 1
inv1.txt, word number 4
inv2.txt, word number 3
inv3.txt, word number 1

? : what

'what' found in the following locations:
inv1.txt, word number 3
inv2.txt, word number 1

? : a

'a' found in the following locations:
inv3.txt, word number 3

? : q
</pre>

Latest revision as of 11:44, 11 December 2023

Task
Inverted index
You are encouraged to solve this task according to the task description, using any language you may know.

An Inverted Index is a data structure used to create full text search.


Task

Given a set of text files, implement a program to create an inverted index.

Also create a user interface to do a search using that inverted index which returns a list of files that contain the query term / terms.

The search index can be in memory.

11l

Translation of: D
DefaultDict[String, Set[String]] index

F parse_file(fname, fcontents)
   L(word) fcontents.split(re:‘\W’)
      :index[word.lowercase()].add(fname)

L(fname, fcontents) [(‘inv1.txt’, ‘It is what it is.’),
                     (‘inv2.txt’, ‘What is it?’),
                     (‘inv3.txt’, ‘It is a banana!’)]
   parse_file(fname, fcontents)

L(w) [‘cat’, ‘is’, ‘banana’, ‘it’, ‘what’]
   print("\nEnter a word to search for: (q to quit): "w)
   I w C index
      print(‘'#.' found in #..’.format(w, sorted(Array(index[w]))))
   E
      print(‘'#.' not found.’.format(w))
Output:

Enter a word to search for: (q to quit): cat
'cat' not found.

Enter a word to search for: (q to quit): is
'is' found in [inv1.txt, inv2.txt, inv3.txt].

Enter a word to search for: (q to quit): banana
'banana' found in [inv3.txt].

Enter a word to search for: (q to quit): it
'it' found in [inv1.txt, inv2.txt, inv3.txt].

Enter a word to search for: (q to quit): what
'what' found in [inv1.txt, inv2.txt].

Ada

Main program

Here is the main program (file inverted_index.adb):

with Ada.Text_IO, Generic_Inverted_Index, Ada.Strings.Hash, Parse_Lines;
use Ada.Text_IO;

procedure Inverted_Index is

   type Process_Word is access procedure (Word: String);

   package Inv_Idx is new Generic_Inverted_Index
     (Source_Type => String,
      Item_Type   => String,
      Hash        => Ada.Strings.Hash);

   use Inv_Idx;

   procedure Output(Sources: Source_Vecs.Vector) is
      Any_Output: Boolean := False;

      procedure Print_Source(S: String) is
      begin
         if not Any_Output then -- this is the first source found
            Put("Found in the following files: ");
            Any_Output := True;
         else -- there has been at least one source before
            Put(", ");
         end if;
         Put(S);
      end Print_Source;

      procedure Print is new Inv_Idx.Iterate(Print_Source);

   begin
      Print(Sources);
      if not Any_Output then
         Put("I did not find this in any of the given files!");
      end if;
      New_Line(2);
   end Output;

   procedure Read_From_File(Table: in out Storage_Type;
                            Filename: String) is
      F: File_Type;

      procedure Enter_Word(S: String) is
      begin
         Table.Store(Source => Filename,  Item => S);
      end Enter_Word;

      procedure Store_Words is new
        Parse_Lines.Iterate_Words(Parse_Lines.Word_Pattern, Enter_Word);

   begin
      Open(File => F, Mode => In_File, Name => Filename);
       while not End_Of_File(F) loop
          Store_Words(Get_Line(F));
      end loop;
      Close(F);
   exception
      when others =>
         Put_Line("Something wrong with File '" & Filename & "'");
         Put_Line("I'll ignore this!");
   end Read_From_File;

   procedure Read_Files(Tab: out Storage_Type; Line: in String) is

      procedure Read_File(S: String) is
      begin
         Read_From_File(Tab, S);
      end Read_File;

      procedure Read_All is new
        Parse_Lines.Iterate_Words(Parse_Lines.Filename_Pattern, Read_File);

   begin
      Read_All(Line);
   end Read_Files;

   S: Storage_Type;
   Done: Boolean := False;

begin
   Put_Line("Enter Filenames:");
   Read_Files(S, Get_Line);
   New_Line;

   while not Done loop
      Put_Line("Enter one or more words to search for; <return> to finish:");
      declare
         Words: String := Get_Line;
         First: Boolean := True;
         Vec: Source_Vecs.Vector := Source_Vecs.Empty_Vector;

         procedure Compute_Vector(Item: String) is
         begin
            if First then
               Vec := S.Find(Item);
               First := False;
            else
               Vec := Vec and S.Find(Item);
            end if;
         end Compute_Vector;

         procedure Compute is new
           Parse_Lines.Iterate_Words(Parse_Lines.Word_Pattern, Compute_Vector);

      begin
         if Words = "" then
            Done := True;
         else
            Compute(Words);
            Output(Vec);
         end if;
      end;
   end loop;
end Inverted_Index;

A sample output:

Enter Filenames:
0.txt 1.txt 2.txt

Enter one or more words to search for; <return> to finish:
it
Found in the following files: 0.txt, 1.txt, 2.txt

Enter one or more words to search for; <return> to finish:
that
I did not find this in any of the given files!

Enter one or more words to search for; <return> to finish:
what is it
Found in the following files: 0.txt, 1.txt

Enter one or more words to search for; <return> to finish:

Package Generic_Inverted_Index

The real work is actually done in the package Generic_Inverted_Index. Here is the specification (file generic_inverted_index.ads):

with Ada.Containers.Indefinite_Vectors;
private with Ada.Containers.Indefinite_Hashed_Maps;

generic
   type Source_Type (<>) is private;
   type Item_Type (<>) is private;
   with function Hash(Item: Item_Type) return Ada.Containers.Hash_Type is <>;
package Generic_Inverted_Index is

   type Storage_Type is tagged private;

   package Source_Vecs is new Ada.Containers.Indefinite_Vectors
     (Index_Type   => Positive,
      Element_Type => Source_Type);

   procedure Store(Storage: in out Storage_Type;
                   Source: Source_Type;
                   Item: Item_Type);
   -- stores Source in a table, indexed by Item
   -- if there is already an Item/Source entry, the Table isn_t changed

   function Find(Storage: Storage_Type; Item: Item_Type)
                return Source_Vecs.Vector;
   -- Generates a vector of all Sources for the given Item

   function "and"(Left, Right: Source_Vecs.Vector) return Source_Vecs.Vector;
   -- returns a vector of all sources, which are both in Left and in Right

   function "or"(Left, Right: Source_Vecs.Vector) return Source_Vecs.Vector;
   -- returns a vector of all sources, which are in Left, Right, or both

   function Empty(Vec: Source_Vecs.Vector) return Boolean;
   -- returns true if Vec is empty

   function First_Source(The_Sources: Source_Vecs.Vector) return Source_Type;
   -- returns the first enty in The_Sources; pre: The_Sourses is not empty

   procedure Delete_First_Source(The_Sources: in out Source_Vecs.Vector;
                                 Count: Ada.Containers.Count_Type := 1);
   -- Removes the first Count entries; pre: The_Sourses has that many entries

   type Process_Source is not null access procedure (Source: Source_Type);

   generic
      with procedure Do_Something(Source: Source_Type);
   procedure Iterate(The_Sources: Source_Vecs.Vector);
   -- calls Do_Something(Source) for all sources in The_Sources;

private

   function Same_Vector(U,V: Source_Vecs.Vector) return Boolean;

   package Maps is new Ada.Containers.Indefinite_Hashed_Maps
     -- for each item (=key) we store a vector with sources
     (Key_Type         => Item_Type,
      Element_Type     => Source_Vecs.Vector,
      Hash             => Hash,
      Equivalent_Keys  => "=",
      "="              => Same_Vector);

   type Storage_Type is new Maps.Map with null record;

end Generic_Inverted_Index;

Here is the implementation (generic_inverted_index.adb):

package body Generic_Inverted_Index is

   use Source_Vecs;
   use type Maps.Cursor;


   procedure Store(Storage: in out Storage_Type;
                   Source: Source_Type;
                   Item: Item_Type) is
   begin
      if (Storage.Find(Item) = Maps.No_Element) then
         Storage.Insert(Key => Item,
                        New_Item => Empty_Vector & Source);
      else
         declare
            The_Vector: Vector := Storage.Element(Item);
         begin
            if The_Vector.Last_Element /= Source then
               Storage.Replace
                 (Key      => Item,
                  New_Item => Storage.Element(Item) & Source);
            end if;
         end;
      end if;
   end Store;

   function Find(Storage: Storage_Type; Item: Item_Type)
                return Vector is
   begin
      return Storage.Element(Item);
   exception
      when Constraint_Error => return Empty_Vector; -- found nothing
   end Find;

   function Is_In(S: Source_Type; V: Vector) return Boolean is
      VV: Vector := V;
   begin
      if Empty(V) then
         return False;
      elsif First_Source(V) = S then
         return True;
      else
         Delete_First_Source(VV);
         return Is_In(S, VV);
      end if;
   end Is_In;

   function "and"(Left, Right: Vector) return Vector is
       V: Vector := Empty_Vector;
   begin
      for I in First_Index(Left) .. Last_Index(Left) loop
         if Is_In(Element(Left, I), Right) then
            V := V & Element(Left, I);
         end if;
      end loop;
      return V;
   end "and";

   function "or"(Left, Right: Vector) return Vector is
       V: Vector := Left; -- all sources in Left
   begin -- ... add all sources in Right, which are not already in Left
      for I in First_Index(Right) .. Last_Index(Right) loop
         if not Is_In(Element(Right, I), Left) then
            V := V & Element(Right, I);
         end if;
      end loop;
      return V;
   end "or";

   function Empty(Vec: Vector) return Boolean
     renames Is_Empty;

   function First_Source(The_Sources: Vector)
                        return Source_Type renames First_Element;

   procedure Delete_First_Source(The_Sources: in out Vector;
                                 Count: Ada.Containers.Count_Type := 1)
     renames Delete_First;

   procedure Iterate(The_Sources: Vector) is
      V: Vector := The_Sources;
   begin
      while not Empty(V) loop
         Do_Something(First_Source(V));
         Delete_First_Source(V);
      end loop;
   end Iterate;

   function Same_Vector(U,V: Vector) return Boolean is
   begin
      raise Program_Error with "there is no need to call this function";
      return False; -- this avoices a compiler warning
   end Same_Vector;

end Generic_Inverted_Index;

Package Parse_Lines

The main program also uses an auxiliary package Parse_Lines. Note the usage of Gnat.Regpat, which itself is pattern matching package, specific for gnat/gcc. This package is derived from the Ada implementation of the regular expressions task. Here is the spec (parse_lines.ads):

with Gnat.Regpat;

package Parse_Lines is

   Word_Pattern: constant String := "([a-zA-Z]+)";
   Filename_Pattern: constant String := "([a-zA-Z0-9_.,;:]+)";

   procedure Search_For_Pattern(Pattern: Gnat.Regpat.Pattern_Matcher;
                                Search_In: String;
                                First, Last: out Positive;
                                Found: out Boolean);

   function Compile(Raw: String) return Gnat.Regpat.Pattern_Matcher;

   generic
      Pattern: String;
      with procedure Do_Something(Word: String);
   procedure Iterate_Words(S: String);

end Parse_Lines;

And here is the implementation (parse_lines.adb):

with Gnat.Regpat;

package body Parse_Lines is

   procedure Search_For_Pattern(Pattern: Gnat.Regpat.Pattern_Matcher;
                                Search_In: String;
                                First, Last: out Positive;
                                Found: out Boolean) is
      use Gnat.Regpat;
      Result: Match_Array (0 .. 1);
   begin
      Match(Pattern, Search_In, Result);
      Found := Result(1) /= No_Match;
      if Found then
         First := Result(1).First;
         Last := Result(1).Last;
      end if;
   end Search_For_Pattern;

   function Compile(Raw: String) return Gnat.Regpat.Pattern_Matcher is
   begin
      return Gnat.Regpat.Compile(Raw);
   end Compile;

      procedure Iterate_Words(S: String) is
      Current_First: Positive := S'First;
      First, Last:   Positive;
      Found:         Boolean;
      use Parse_Lines;
      Compiled_P: Gnat.Regpat.Pattern_Matcher := Compile(Pattern);
   begin
      loop
         Search_For_Pattern(Compiled_P,
                            S(Current_First .. S'Last),
                            First, Last, Found);
         exit when not Found;
         Do_Something(S(First .. Last));
         Current_First := Last+1;
      end loop;
   end Iterate_Words;

end Parse_Lines;

Alternative Implementation of Generic_Inverted_Index (Ada 2012)

The new standard Ada 2012 simplifies the usage of containers significantly. The following runs under gnat (GNAT GPL 2011 (20110419)), when using the experimental -gnat2012 switch. The main program is the same. Here is the spec for Generic_Inverted_Index:

with Ada.Containers.Indefinite_Vectors;
private with Ada.Containers.Indefinite_Hashed_Maps;

generic
   type Source_Type (<>) is private;
   type Item_Type (<>) is private;
   with function Hash(Item: Item_Type) return Ada.Containers.Hash_Type is <>;
package Generic_Inverted_Index is

   type Storage_Type is tagged private;

   package Source_Vecs is new Ada.Containers.Indefinite_Vectors
     (Index_Type   => Positive,
      Element_Type => Source_Type);

   procedure Store(Storage: in out Storage_Type;
                   Source: Source_Type;
                   Item: Item_Type);
   -- stores Source in a table, indexed by Item
   -- if there is already an Item/Source entry, the Table isn_t changed

   function Find(Storage: Storage_Type; Item: Item_Type)
                return Source_Vecs.Vector;
   -- Generates a vector of all Sources for the given Item

   function "and"(Left, Right: Source_Vecs.Vector) return Source_Vecs.Vector;
   -- returns a vector of all sources, which are both in Left and in Right

   function "or"(Left, Right: Source_Vecs.Vector) return Source_Vecs.Vector;
   -- returns a vector of all sources, which are in Left, Right, or both

   function Empty(Vec: Source_Vecs.Vector) return Boolean;
   -- returns true if Vec is empty

   type Process_Source is not null access procedure (Source: Source_Type);

   generic
      with procedure Do_Something(Source: Source_Type);
   procedure Iterate(The_Sources: Source_Vecs.Vector);
   -- calls Do_Something(Source) for all sources in The_Sources;

private

   function Same_Vector(U,V: Source_Vecs.Vector) return Boolean;

   package Maps is new Ada.Containers.Indefinite_Hashed_Maps
     -- for each item (=key) we store a vector with sources
     (Key_Type         => Item_Type,
      Element_Type     => Source_Vecs.Vector,
      Hash             => Hash,
      Equivalent_Keys  => "=",
      "="              => Same_Vector);

   type Storage_Type is new Maps.Map with null record;

end Generic_Inverted_Index;

The implementation:

package body Generic_Inverted_Index is
             -- uses some of the new Ada 2012 syntax
   use Source_Vecs;

   procedure Store(Storage: in out Storage_Type;
                   Source: Source_Type;
                   Item: Item_Type) is
      use type Maps.Cursor;
   begin
      if (Storage.Find(Item) = Maps.No_Element) then
         Storage.Insert(Key => Item,
                        New_Item => Empty_Vector & Source);
      else
         declare
            The_Vector: Vector := Storage.Element(Item);
         begin
            if The_Vector.Last_Element /= Source then
               Storage.Replace
                 (Key      => Item,
                  New_Item => Storage.Element(Item) & Source);
            end if;
         end;
      end if;
   end Store;

   function Find(Storage: Storage_Type; Item: Item_Type)
                return Vector is
   begin
      return Storage.Element(Item);
   exception
      when Constraint_Error => return Empty_Vector; -- found nothing
   end Find;

   function Is_In(S: Source_Type; V: Vector) return Boolean is
   begin
      for Some_Element of V loop
         if Some_Element = S then
            return True;
         end if;
      end loop;
      return False;
   end Is_In;

   function "and"(Left, Right: Vector) return Vector is
       V: Vector := Empty_Vector;
   begin
      for Some_Element of Left loop
         if Is_In(Some_Element, Right) then
            V := V & Some_Element;
         end if;
      end loop;
      return V;
   end "and";

   function "or"(Left, Right: Vector) return Vector is
       V: Vector := Left; -- all sources in Left
   begin
      for Some_Element of Right loop
         if not Is_In(Some_Element, Left) then
            V := V & Some_Element;
         end if;
      end loop;
      return V;
   end "or";

   function Empty(Vec: Vector) return Boolean
     renames Is_Empty;

   procedure Iterate(The_Sources: Vector) is
   begin
      for Some_Element in The_Sources loop
         Do_Something(Element(Some_Element));
      end loop;
   end Iterate;

   function Same_Vector(U,V: Vector) return Boolean is
   begin
      raise Program_Error with "there is no need to call this function";
      return False; -- this avoices a compiler warning
   end Same_Vector;

end Generic_Inverted_Index;

AutoHotkey

Works with: AutoHotkey_L
; http://www.autohotkey.com/forum/viewtopic.php?t=41479
inputbox, files, files, file pattern such as c:\files\*.txt

word2docs := object() ; autohotkey_L is needed.

stime := A_tickcount
Loop, %files%, 0,1
{
   tooltip,%A_index%  / 500

   wordList := WordsIn(A_LoopFileFullPath)
   InvertedIndex(wordList, A_loopFileFullpath)
}

tooltip
msgbox, % "total time " (A_tickcount-stime)/1000

gosub, search
return

search:
Loop
{
   InputBox, keyword , input single keyword only
   msgbox, % foundDocs := findword(keyword)
}
return

WordsIn(docpath)
{
   FileRead, content, %docpath%
  spos = 1
   Loop
   {
     if !(spos := Regexmatch(content, "[a-zA-Z]{2,}",match, spos))
       break
     spos += strlen(match)
     this_wordList .= match "`n"
   }

  Sort, this_wordList, U
  return this_wordList
}

InvertedIndex(byref words, docpath)
{
   global word2docs

  loop, parse, words, `n,`r
  {
    if A_loopField =
      continue
    word2docs[A_loopField] := word2docs[A_loopField] docpath "`n"
  }
}

findWord(word2find)
{
  global word2docs

  if (word2docs[word2find] = "")
     return ""
  else
    return word2docs[word2find]
}

BBC BASIC

This uses a hashed index and linked lists to hold the file numbers.

      DIM FileList$(4)
      FileList$() = "BBCKEY0.TXT", "BBCKEY1.TXT", "BBCKEY2.TXT", \
      \             "BBCKEY3.TXT", "BBCKEY4.TXT"

      DictSize% = 30000
      DIM Index{(DictSize%-1) word$, link%}

      REM Build inverted index:
      FOR file% = DIM(FileList$(),1) TO 0 STEP -1
        filename$ = FileList$(file%)
        F% = OPENIN(filename$)
        IF F% = 0 ERROR 100, "Failed to open file"

        WHILE NOT EOF#F%
          REPEAT C%=BGET#F% : UNTIL C%>64 OR EOF#F% : word$ = CHR$(C%)
          REPEAT C%=BGET#F% : word$ += CHR$(C%) : UNTIL C%<65
          word$ = FNlower(LEFT$(word$))

          hash% = FNhash(word$)
          WHILE Index{(hash%)}.word$<>"" AND Index{(hash%)}.word$<>word$
            hash% = (hash% + 1) MOD DictSize% : REM Collision
          ENDWHILE
          Index{(hash%)}.word$ = word$
          link% = Index{(hash%)}.link%
          IF link%=0 OR link%!4<>file% THEN
            DIM heap% 7 : heap%!4 = file%
            !heap% = link%
            Index{(hash%)}.link% = heap% : REM Linked list
          ENDIF
        ENDWHILE

        CLOSE #F%
      NEXT file%

      REM Now query the index:
      PRINT FNquery("random")
      PRINT FNquery("envelope")
      PRINT FNquery("zebra")
      PRINT FNquery("the")
      END

      DEF FNquery(A$)
      LOCAL hash%, link%, temp%
      A$ = FNlower(A$)
      hash% = FNhash(A$)
      temp% = hash%
      WHILE Index{(hash%)}.word$ <> A$
        hash% = (hash% + 1) MOD DictSize%
        IF hash% = temp% THEN = """" + A$ + """ not found"
      ENDWHILE
      link% = Index{(hash%)}.link%
      A$ = """" + A$ + """ found in "
      WHILE link%
        A$ += FileList$(link%!4) + ", "
        link% = !link%
      ENDWHILE
      = LEFT$(LEFT$(A$))

      DEF FNhash(A$)
      LOCAL hash%
      IF LEN(A$) < 4 A$ += STRING$(4-LEN(A$),CHR$0)
      hash% = !!^A$
      IF LEN(A$) > 4 hash% EOR= !(!^A$ + LEN(A$) - 4)
      = hash% MOD DictSize%

      DEF FNlower(A$)
      LOCAL A%,C%
      FOR A% = 1 TO LEN(A$)
        C% = ASCMID$(A$,A%)
        IF C% >= 65 IF C% <= 90 MID$(A$,A%,1) = CHR$(C%+32)
      NEXT
      = A$

Output:

"random" found in BBCKEY2.TXT, BBCKEY3.TXT, BBCKEY4.TXT
"envelope" found in BBCKEY1.TXT, BBCKEY4.TXT
"zebra" not found
"the" found in BBCKEY0.TXT, BBCKEY1.TXT, BBCKEY2.TXT, BBCKEY3.TXT, BBCKEY4.TXT

C

The code is stupidly long, having to implement a Trie to store strings and all -- the program doesn't do anything shiny, but Tries may be interesting to look at.

#include <stdio.h>
#include <stdlib.h>

char chr_legal[] = "abcdefghijklmnopqrstuvwxyz0123456789_-./";
int  chr_idx[256] = {0};
char idx_chr[256] = {0};

#define FNAME 0
typedef struct trie_t *trie, trie_t;
struct trie_t {
    trie next[sizeof(chr_legal)]; /* next letter; slot 0 is for file name */
    int eow;
};

trie trie_new() { return calloc(sizeof(trie_t), 1); }

#define find_word(r, w) trie_trav(r, w, 1)
/* tree traversal: returns node if end of word and matches string, optionally
 * create node if doesn't exist
 */
trie trie_trav(trie root, const char * str, int no_create)
{
    int c;
    while (root) {
        if ((c = str[0]) == '\0') {
            if (!root->eow && no_create) return 0;
            break;
        }
        if (! (c = chr_idx[c]) ) {
            str++;
            continue;
        }

        if (!root->next[c]) {
            if (no_create) return 0;
            root->next[c] = trie_new();
        }
        root = root->next[c];
        str++;
    }
    return root;
}

/*  complete traversal of whole tree, calling callback at each end of word node.
 *  similar method can be used to free nodes, had we wanted to do that.
 */
int trie_all(trie root, char path[], int depth, int (*callback)(char *))
{
    int i;
    if (root->eow && !callback(path)) return 0;

    for (i = 1; i < sizeof(chr_legal); i++) {
        if (!root->next[i]) continue;

        path[depth] = idx_chr[i];
        path[depth + 1] = '\0';
        if (!trie_all(root->next[i], path, depth + 1, callback))
            return 0;
    }
    return 1;
}

void add_index(trie root, const char *word, const char *fname)
{
    trie x = trie_trav(root, word, 0);
    x->eow = 1;

    if (!x->next[FNAME])
        x->next[FNAME] = trie_new();
    x = trie_trav(x->next[FNAME], fname, 0);
    x->eow = 1;
}

int print_path(char *path)
{
    printf(" %s", path);
    return 1;
}

/*  pretend we parsed text files and got lower cased words: dealing     *
 *  with text file is a whole other animal and would make code too long */
const char *files[] = { "f1.txt", "source/f2.txt", "other_file" };
const char *text[][5] ={{ "it", "is", "what", "it", "is" },
                { "what", "is", "it", 0 },
                { "it", "is", "a", "banana", 0 }};

trie init_tables()
{
    int i, j;
    trie root = trie_new();
    for (i = 0; i < sizeof(chr_legal); i++) {
        chr_idx[(int)chr_legal[i]] = i + 1;
        idx_chr[i + 1] = chr_legal[i];
    }

/* Enable USE_ADVANCED_FILE_HANDLING to use advanced file handling.
 * You need to have files named like above files[], with words in them
 * like in text[][].  Case doesn't matter (told you it's advanced).
 */
#define USE_ADVANCED_FILE_HANDLING 0
#if USE_ADVANCED_FILE_HANDLING
    void read_file(const char * fname) {
        char cmd[1024];
        char word[1024];
        sprintf(cmd, "perl -p -e 'while(/(\\w+)/g) {print lc($1),\"\\n\"}' %s", fname);
        FILE *in = popen(cmd, "r");
        while (!feof(in)) {
            fscanf(in, "%1000s", word);
            add_index(root, word, fname);
        }
        pclose(in);
    };

    read_file("f1.txt");
    read_file("source/f2.txt");
    read_file("other_file");
#else
    for (i = 0; i < 3; i++) {
        for (j = 0; j < 5; j++) {
            if (!text[i][j]) break;
            add_index(root, text[i][j], files[i]);
        }
    }
#endif /*USE_ADVANCED_FILE_HANDLING*/

    return root;
}

void search_index(trie root, const char *word)
{
    char path[1024];
    printf("Search for \"%s\": ", word);
    trie found = find_word(root, word);

    if (!found) printf("not found\n");
    else {
        trie_all(found->next[FNAME], path, 0, print_path);
        printf("\n");
    }
}

int main()
{
    trie root = init_tables();

    search_index(root, "what");
    search_index(root, "is");
    search_index(root, "banana");
    search_index(root, "boo");
    return 0;
}

Output:

Search for "what":  f1.txt source/f2.txt
Search for "is":  f1.txt other_file source/f2.txt
Search for "banana":  other_file
Search for "boo": not found

C#

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;

class InvertedIndex
{
    static Dictionary<TItem, IEnumerable<TKey>> Invert<TKey, TItem>(Dictionary<TKey, IEnumerable<TItem>> dictionary)
    {
        return dictionary
            .SelectMany(keyValuePair => keyValuePair.Value.Select(item => new KeyValuePair<TItem, TKey>(item, keyValuePair.Key)))
            .GroupBy(keyValuePair => keyValuePair.Key)
            .ToDictionary(group => group.Key, group => group.Select(keyValuePair => keyValuePair.Value));
    }

    static void Main()
    {
        Console.Write("files: ");
        var files = Console.ReadLine();
        Console.Write("find: ");
        var find = Console.ReadLine();
        var dictionary = files.Split().ToDictionary(file => file, file => File.ReadAllText(file).Split().AsEnumerable());
        Console.WriteLine("{0} found in: {1}", find, string.Join(" ", Invert(dictionary)[find]));
    }
}

Sample output:

files: file1 file2 file3
find: what
what found in: file1 file2

C++

Same idea as the C implementation - trie to store the words

#include <algorithm>
#include <fstream>
#include <iostream>
#include <vector>
#include <string>

const std::string _CHARS = "abcdefghijklmnopqrstuvwxyz0123456789.:-_/";
const size_t MAX_NODES = 41;

class node
{
public:
    node() { clear(); }
    node( char z ) { clear(); }
    ~node() { for( int x = 0; x < MAX_NODES; x++ ) if( next[x] ) delete next[x]; }
    void clear() { for( int x = 0; x < MAX_NODES; x++ ) next[x] = 0; isWord = false; }
    bool isWord;
    std::vector<std::string> files;
    node* next[MAX_NODES];
};

class index {
public:
    void add( std::string s, std::string fileName ) {
        std::transform( s.begin(), s.end(), s.begin(), tolower );
        std::string h;
        for( std::string::iterator i = s.begin(); i != s.end(); i++ ) {
            if( *i == 32 ) {
                pushFileName( addWord( h ), fileName );
                h.clear();
                continue;
            }
            h.append( 1, *i );
        }
        if( h.length() )
            pushFileName( addWord( h ), fileName );
    }
    void findWord( std::string s ) {
        std::vector<std::string> v = find( s );
        if( !v.size() ) {
            std::cout << s + " was not found!\n";
            return;
        }
        std::cout << s << " found in:\n";
        for( std::vector<std::string>::iterator i = v.begin(); i != v.end(); i++ ) {
            std::cout << *i << "\n";
        }
        std::cout << "\n";
    }
private:
    void pushFileName( node* n, std::string fn ) {
        std::vector<std::string>::iterator i = std::find( n->files.begin(), n->files.end(), fn );
        if( i == n->files.end() ) n->files.push_back( fn );
    }
    const std::vector<std::string>& find( std::string s ) {
        size_t idx;
        std::transform( s.begin(), s.end(), s.begin(), tolower );
        node* rt = &root;
        for( std::string::iterator i = s.begin(); i != s.end(); i++ ) {
            idx = _CHARS.find( *i );
            if( idx < MAX_NODES ) {
                if( !rt->next[idx] ) return std::vector<std::string>();
                rt = rt->next[idx];
            }
        }
        if( rt->isWord ) return rt->files;
        return std::vector<std::string>();
    }
    node* addWord( std::string s ) {
        size_t idx;
        node* rt = &root, *n;
        for( std::string::iterator i = s.begin(); i != s.end(); i++ ) {
            idx = _CHARS.find( *i );
            if( idx < MAX_NODES ) {
                n = rt->next[idx];
                if( n ){
                    rt = n;
                    continue;
                }
                n = new node( *i );
                rt->next[idx] = n;
                rt = n;
            }
        }
        rt->isWord = true;
        return rt;
    }
    node root;
};
int main( int argc, char* argv[] ) {
    index t;
    std::string s;
    std::string files[] = { "file1.txt", "f_text.txt", "text_1b.txt" };

    for( int x = 0; x < 3; x++ ) {
        std::ifstream f;
        f.open( files[x].c_str(), std::ios::in );
        if( f.good() ) {
            while( !f.eof() ) {
                f >> s;
                t.add( s, files[x] );
                s.clear();
            }
            f.close();
        }
    }

    while( true ) {
        std::cout << "Enter one word to search for, return to exit: ";
        std::getline( std::cin, s );
        if( !s.length() ) break;
        t.findWord( s );

    }
    return 0;
}
Output:
Enter one word to search for, return to exit: goodness
goodness found in:
file1.txt
f_text.txt

Enter one word to search for, return to exit: because
because found in:
f_text.txt

Enter one word to search for, return to exit: her
her found in:
text_1b.txt

Enter one word to search for, return to exit: fat
fat was not found!

Clojure

(ns inverted-index.core
  (:require [clojure.set :as sets]
            [clojure.java.io :as io]))

(def pattern #"\w+")     ; Java regex for a raw term: here a substring of alphanums
(defn normalize [match] (.toLowerCase match))  ; normalization of a raw term

(defn term-seq [text] (map normalize (re-seq pattern text)))

(defn set-assoc
  "Produces map with v added to the set associated with key k in map m"
  [m k v] (assoc m k (conj (get m k #{}) v)))

(defn index-file [index file]
  (with-open [reader (io/reader file)]
    (reduce
      (fn [idx term] (set-assoc idx term file))
      index
      (mapcat term-seq (line-seq reader)))))

(defn make-index [files]
  (reduce index-file {} files))

(defn search [index query]
  (apply sets/intersection (map index (term-seq query))))

CoffeeScript

fs = require 'fs'

make_index = (fns) ->
  # words are indexed by filename and 1-based line numbers
  index = {}
  for fn in fns
    for line, line_num in fs.readFileSync(fn).toString().split '\n'
      words = get_words line
      for word in words
        word = mangle(word)
        index[word] ||= []
        index[word].push [fn, line_num+1]
  index

grep = (index, word) ->
  console.log "locations for '#{word}':"
  locations = index[mangle(word)] || []
  for location in locations
    [fn, line_num] = location
    console.log "#{fn}:#{line_num}"
  console.log "\n"

get_words = (line) ->
  words = line.replace(/\W/g, ' ').split ' '
  (word for word in words when word != '')

mangle = (word) ->
  # avoid conflicts with words like "constructor"
  '_' + word

do ->
  fns = (fn for fn in fs.readdirSync('.') when fn.match /\.coffee/)
  index = make_index(fns)
  grep index, 'make_index'
  grep index, 'sort'

output

> coffee inverted_index.coffee
locations for 'make_index':
inverted_index.coffee:3
inverted_index.coffee:33
inverted_index.coffee:34


locations for 'sort':
anagrams.coffee:8
derangements.coffee:14
heap.coffee:34
heap.coffee:43
huffman.coffee:81
inverted_index.coffee:35
knuth_sample.coffee:12

Common Lisp

(defpackage rosettacode.inverted-index
  (:use cl))
(in-package rosettacode.inverted-index)

;; Return a list of tokens in the string LINE.  This is rather
;; complicated as CL has no good standard function to do it.
(defun tokenize (line)
  (let ((start 0) (len (length line)))
    (loop for s = (position-if #'alphanumericp line :start start)
          while s
          for e = (position-if-not #'alphanumericp line :start (1+ s))
          collect (subseq line s e)
          while (and e (< e len))
          do (setq start e))))

(defun index-file (index filename)
  (with-open-file (in filename)
    (loop for line = (read-line in nil nil)
          while line
          do (dolist (token (tokenize line))
               (pushnew filename (gethash token index '()))))))

(defun build-index (filenames)
  (let ((index (make-hash-table :test #'equal)))
    (dolist (f filenames)
      (index-file index f))
    index))

;; Find the files for QUERY.  We use the same tokenizer for the query
;; as for files.
(defun lookup (index query)
  (remove-duplicates (loop for token in (tokenize query)
                           append (gethash token index))
                     :test #'equal))

Example:

(defparameter *index* (build-index '("file1.txt" "file2.txt" "file3.txt")))
(defparameter *query* "foo bar")
(defparameter *result* (lookup *index* *query*))
(format t "Result for query ~s: ~{~a~^, ~}~%" *query* *result*)

D

import std.stdio, std.algorithm, std.string, std.file, std.regex;

void main() {
    string[][string] index;

    void parseFile(in string fn) {
        if (!exists(fn) || !isFile(fn))
            throw new Exception("File not found");

        foreach (word; readText(fn).splitter(regex(r"\W"))) {
            word = word.toLower();
            if (!index.get(word, null).canFind(fn))
                index[word] ~= fn;
        }
    }

    immutable fileNames = ["inv1.txt", "inv2.txt", "inv3.txt"];
    foreach (fName; fileNames)
        parseFile(fName);

    while (true) {
        writef("\nEnter a word to search for: (q to quit): ");
        immutable w = readln().strip().toLower();
        if (w == "q") {
            writeln("quitting.");
            break;
        }
        if (w in index)
            writefln("'%s' found in%( %).", w, index[w]);
        else
            writefln("'%s' not found.", w);
    }
}

Both the demo text files and the queries are from the Wikipedia page, they contain:

It is what it is.
What is it?
It is a banana!
Output:
Enter a word to search for: (q to quit): cat
'cat' not found.

Enter a word to search for: (q to quit): is
'is' found in "inv1.txt" "inv2.txt" "inv3.txt".

Enter a word to search for: (q to quit): banana
'banana' found in "inv3.txt".

Enter a word to search for: (q to quit): it
'it' found in "inv1.txt" "inv2.txt" "inv3.txt".

Enter a word to search for: (q to quit): what
'what' found in "inv1.txt" "inv2.txt".

Enter a word to search for: (q to quit): q
quitting.

Delphi

program Inverted_index;

{$APPTYPE CONSOLE}

uses
  System.SysUtils,
  system.Generics.Collections,
  SYstem.IOUtils;

type
  TIndex = class
  private
    FFileNames: TArray<string>;
    FIndexs: TDictionary<string, string>;
    class function SplitWords(Text: string): TArray<string>; static;
    function StoreFileName(FileName: TFileName): string;
    procedure StoreLine(Line, Id: string);
    procedure StoreWord(aWord, Id: string);
    function DecodeFileName(Indexed: string): string;
  public
    procedure AddFile(FileName: TFileName);
    constructor Create;
    destructor Destroy; override;
    function Search(aWord: string): string;
    property Index[aWord: string]: string read Search; default;
  end;

{ TIndex }

constructor TIndex.Create;
begin
  FIndexs := TDictionary<string, string>.Create();
  SetLength(FFileNames, 0);
end;

function TIndex.DecodeFileName(Indexed: string): string;
var
  f: string;
  i: integer;
begin
  Result := Indexed;
  if Length(FFileNames) > 0 then
    for i := 0 to High(FFileNames) do
    begin
      f := FFileNames[i];
      Result := Result.Replace(format('(%x)', [i]), f);
    end;
end;

destructor TIndex.Destroy;
begin
  FIndexs.Free;
  inherited;
end;

function TIndex.Search(aWord: string): string;
begin
  aWord := AnsiLowerCase(aWord);
  if FIndexs.ContainsKey(aWord) then
    Result := 'found in ' + DecodeFileName(FIndexs[aWord])
  else
    Result := 'not found';
end;

class function TIndex.SplitWords(Text: string): TArray<string>;
const
  WORD_CHARSET: set of char = (['a'..'z', 'A'..'Z', 'á', 'é', 'í', 'ó', 'ú', 'ã',
    'õ', 'à', 'è', 'ì', 'ò', 'ù', 'Á', 'É', 'Í', 'Ó', 'Ú', 'Ã', 'Õ', 'À', 'È',
    'Ì', 'Ò', 'Ù', 'ä', 'ë', 'ï', 'ö', 'ü', 'ç', 'Ç', '_', '0'..'9']);

  procedure Append(var value: string);
  begin
    if not value.IsEmpty then
    begin
      SetLength(result, length(result) + 1);
      result[high(result)] := value;
      value := '';
    end;
  end;

var
  c: Char;
  inWord, isWordChar: boolean;
  w: string;
begin
  inWord := False;
  w := '';
  SetLength(result, 0);
  for c in Text do
  begin
    isWordChar := (c in WORD_CHARSET);
    if inWord then
      if isWordChar then
        w := w + c
      else
      begin
        Append(w);
        inWord := false;
        Continue;
      end;

    if not inWord and isWordChar then
    begin
      inWord := True;
      w := c;
    end;
  end;
  if inWord then
    Append(w);
end;

function TIndex.StoreFileName(FileName: TFileName): string;
begin
  SetLength(FFileNames, length(FFileNames) + 1);
  FFileNames[High(FFileNames)] := FileName;
  Result := format('"(%x)"', [High(FFileNames)]);
end;

procedure TIndex.StoreLine(Line, Id: string);
var
  Words: TArray<string>;
  w: string;
begin
  Words := SplitWords(Line);
  for w in Words do
    StoreWord(w, Id);
end;

procedure TIndex.StoreWord(aWord, Id: string);
begin
  aWord := AnsiLowercase(aWord);
  if FIndexs.ContainsKey(aWord) then
  begin
    if (FIndexs[aWord].IndexOf(Id) = -1) then
      FIndexs[aWord] := FIndexs[aWord] + ', ' + Id;
  end
  else
    FIndexs.Add(aWord, Id);
end;

procedure TIndex.AddFile(FileName: TFileName);
var
  Lines: TArray<string>;
  Line: string;
  Id: string;
begin
  if not FileExists(FileName) then
    exit;
  Lines := TFile.ReadAllLines(FileName);

  if length(Lines) = 0 then
    exit;

  // Remove this line if you want full filename
  FileName := ExtractFileName(FileName);

  Id := StoreFileName(FileName);

  for Line in Lines do
    StoreLine(Line, Id);
end;

var
  Index: TIndex;
  FileNames: TArray<TFileName> = ['file1.txt', 'file2.txt', 'file3.txt'];
  FileName: TFileName;
  Querys: TArray<string> = ['Cat', 'is', 'banana', 'it', 'what'];
  Query, Found: string;

begin
  Index := TIndex.Create;

  for FileName in FileNames do
    Index.AddFile(FileName);

  for Query in Querys do
    Writeln(format('"%s" %s'#10, [Query, Index[Query]]));

  repeat
    Writeln('Enter a word to search for: (q to quit)');
    Readln(Query);

    if Query = 'q' then
      Break;

    Writeln(format('"%s" %s'#10, [Query, Index[Query]]));
  until False;

  Index.Free;
end.

Input: Same of D.

Output:
"Cat" not found

"is" found in "file1.txt", "file2.txt", "file3.txt"

"banana" found in "file3.txt"

"it" found in "file1.txt", "file2.txt", "file3.txt"

"what" found in "file1.txt", "file2.txt"

Enter a word to search for: (q to quit):
q

EchoLisp

Indexing

Index values are sets associated with each word (key). We use the local-put-value function to permanently store the index, in the browser local storage.

;; set of input files
(define FILES {T0.txt T1.txt T2.txt})
;; store name for permanent inverted index
(define INVERT "INVERTED-INDEX")

;; get text for each file, and call (action filename text)
(define (map-files action files)
    (for ((file files))
    (file->string action file)))

;; create store
(local-make-store INVERT)

; invert-word : word -> set of files
(define (invert-word word file store)
        (local-put-value word
            (make-set  (cons file (local-get-value word store))) store))

; parse file text and invert each word
(define (invert-text file text)
        (writeln 'Inverting file text)
        (let ((text (text-parse text)))
        (for ((word text))  (invert-word (string-downcase word) file INVERT))))

Query

Intersect sets values of each word.

;; usage : (inverted-search w1 w2 ..)
(define-syntax-rule (inverted-search w ...)
            (and-get-invert (quote w )))

;; intersects all sets referenced by words
;; returns the intersection
(define (and-get-invert words)
        (foldl
            (lambda(word res)
                 (set-intersect res (local-get-value word  INVERT)))
            FILES words))

Output :

(map-files invert-text FILES)
(inverted-search is it)
[0] { T0.txt T1.txt T2.txt }
(inverted-search is banana)
[1] { T2.txt }
(inverted-search is what)
[2] { T0.txt T1.txt }
(inverted-search boule)
[3] null

Erlang

This might be used with a lot of large files so we use binaries to save space. That adds <<>> to the search terms. If somebody wants to avoid "end." and "end" being two different terms, just add <<".">> to binary:compile_pattern/1 Ditto for any other character.

-module( inverted_index ).

-export( [from_files/1, search/2, task/0] ).

from_files( Files ) ->
        lists:foldl( fun import_from_file/2, dict:new(), Files ).

search( Binaries, Inverted_index ) ->
        [Files | T] = [dict:fetch(X, Inverted_index) || X <- Binaries],
        lists:foldl( fun search_common/2, Files, T ).

task() ->
       Files_contents = [{"file_1", <<"it is what it is">>}, {"file_2", <<"what is it">>}, {"file_3", <<"it is a banana">>}],
       [file:write_file(X, Y) || {X, Y} <- Files_contents],
       Inverted_index = from_files( [X || {X, _Y} <- Files_contents] ),
       Result = search( [<<"what">>, <<"is">>, <<"it">>], Inverted_index ),
       io:fwrite( "~p~n", [Result] ),
       [file:delete(X) || {X, _Y} <- Files_contents].



import_from_file( File, Dict_acc ) ->
        New_dict = dict:from_list( import_from_file_contents(File, file:read_file(File)) ),
    dict:merge( fun import_from_file_merge/3, Dict_acc, New_dict ).

import_from_file_contents( File, {ok, Binary} ) ->
        [{X, [File]} || X <- binary:split( Binary, binary:compile_pattern([<<" ">>, <<"\n">>]), [global] )];
import_from_file_contents( File, {error, Error} ) ->
    io:fwrite( "Error: could not open file ~p: ~p~nContinuing with the rest of them~n", [File,  Error] ),
    [].

import_from_file_merge( _Key, Files, [New_file] ) -> [New_file | Files].

search_common( Files, Acc ) -> [X || X <- Acc, lists:member(X, Files)].

F#

open System
open System.IO

// Map search terms to associated set of files
type searchIndexMap = Map<string, Set<string>>

let inputSearchCriteria() =
    let readLine prompt =
        printf "%s: " prompt
        Console.ReadLine().Split()

    readLine "Files", (readLine "Find") |> Array.map (fun s -> s.ToLower())

let updateIndex indexMap keyValuePair =
    let k, v = keyValuePair

    match Map.tryFind k indexMap with
        | None     -> Map.add k (Set.singleton v) indexMap
        | Some set -> Map.add k (Set.add v set) indexMap

let buildIndex files =
    let fileData file =
        File.ReadAllText(file).Split() |> Seq.map (fun word -> word.ToLower(), file)

    files |> Seq.collect fileData
          |> Seq.fold updateIndex Map.empty

let searchFiles() =
    let files, terms = inputSearchCriteria()
    let indexes = buildIndex files

    let searchResults = terms |> Seq.map (fun term -> Map.find term indexes)
                              |> Set.intersectMany

    printf "Found in: " ; searchResults |> Set.iter (printf "%s ") ; printfn ""

Sample usage:

searchFiles()

Files: file1.txt file2.txt file3.txt
Find: what is
Found in: file1.txt file2.txt

Factor

USING: assocs fry io.encodings.utf8 io.files kernel sequences
sets splitting vectors ;
IN: rosettacode.inverted-index

: file-words ( file -- assoc )
    utf8 file-contents " ,;:!?.()[]{}\n\r" split harvest ;
: add-to-file-list ( files file -- files )
    over [ swap [ adjoin ] keep ] [ nip 1vector ] if ;
: add-to-index ( words index file -- )
    '[ _ [ _ add-to-file-list ] change-at ] each ;
: (index-files) ( files index -- )
   [ [ [ file-words ] keep ] dip swap add-to-index ] curry each ;
: index-files ( files -- index )
    H{ } clone [ (index-files) ] keep ;
: query ( terms index -- files )
    [ at ] curry map [ ] [ intersect ] map-reduce ;

Example use :

( scratchpad ) { "f1" "f2" "f3" } index-files

--- Data stack:
H{ { "a" ~vector~ } { "is" ~vector~ } { "what" ~vector~ } { ...
( scratchpad ) { "what" "is" "it" } swap query .
V{ "f1" "f2" }

Go

package main

import (
    "bufio"
    "bytes"
    "errors"
    "fmt"
    "io"
    "os"
)

// inverted index representation
var index map[string][]int // ints index into indexed
var indexed []doc

type doc struct {
    file  string
    title string
}

func main() {
    // initialize representation
    index = make(map[string][]int)

    // build index
    if err := indexDir("docs"); err != nil {
        fmt.Println(err)
        return
    }

    // run user interface
    ui()
}

func indexDir(dir string) error {
    df, err := os.Open(dir)
    if err != nil {
        return err
    }
    fis, err := df.Readdir(-1)
    if err != nil {
        return err
    }
    if len(fis) == 0 {
        return errors.New(fmt.Sprintf("no files in %s", dir))
    }
    indexed := 0
    for _, fi := range fis {
        if !fi.IsDir() {
            if indexFile(dir + "/" + fi.Name()) {
                indexed++
            }
        }
    }
    return nil
}

func indexFile(fn string) bool {
    f, err := os.Open(fn)
    if err != nil {
        fmt.Println(err)
        return false // only false return
    }

    // register new file
    x := len(indexed)
    indexed = append(indexed, doc{fn, fn})
    pdoc := &indexed[x]

    // scan lines
    r := bufio.NewReader(f)
    lines := 0
    for {
        b, isPrefix, err := r.ReadLine()
        switch {
        case err == io.EOF:
            return true
        case err != nil:
            fmt.Println(err)
            return true
        case isPrefix:
            fmt.Printf("%s: unexpected long line\n", fn)
            return true
        case lines < 20 && bytes.HasPrefix(b, []byte("Title:")):
            // in a real program you would write code
            // to skip the Gutenberg document header
            // and not index it.
            pdoc.title = string(b[7:])
        }
        // index line of text in b
        // again, in a real program you would write a much
        // nicer word splitter.
    wordLoop:
        for _, bword := range bytes.Fields(b) {
            bword := bytes.Trim(bword, ".,-~?!\"'`;:()<>[]{}\\|/=_+*&^%$#@")
            if len(bword) > 0 {
                word := string(bword)
                dl := index[word]
                for _, d := range dl {
                    if d == x {
                        continue wordLoop
                    }
                }
                index[word] = append(dl, x)
            }
        }
    }
    return true
}

func ui() {
    fmt.Println(len(index), "words indexed in", len(indexed), "files")
    fmt.Println("enter single words to search for")
    fmt.Println("enter a blank line when done")
    var word string
    for {
        fmt.Print("search word: ")
        wc, _ := fmt.Scanln(&word)
        if wc == 0 {
            return
        }
        switch dl := index[word]; len(dl) {
        case 0:
            fmt.Println("no match")
        case 1:
            fmt.Println("one match:")
            fmt.Println("   ", indexed[dl[0]].file, indexed[dl[0]].title)
        default:
            fmt.Println(len(dl), "matches:")
            for _, d := range dl {
                fmt.Println("   ", indexed[d].file, indexed[d].title)
            }
        }
    }
}

Session:

8448 words indexed in 11 files
enter single words to search for
enter a blank line when done
search word: dog
no match
search word: cat
one match:
    docs/pg28554.txt Beyond Lies the Wub
search word: robot
6 matches:
    docs/pg32032.txt Second Variety
    docs/pg32522.txt Mr. Spaceship
    docs/pg32832.txt Piper in the Woods
    docs/pg28698.txt The Crystal Crypt
    docs/pg28767.txt The Defenders
    docs/pg32154.txt The Variable Man

Haskell

import Control.Monad
import Data.Char (isAlpha, toLower)
import qualified Data.Map as M
import qualified Data.IntSet as S
import System.Environment (getArgs)

main =  do
    (files, _ : q) <- liftM (break (== "--")) getArgs
    buildII files >>= mapM_ putStrLn . queryII q

data IIndex = IIndex
    [FilePath]              -- Files in the index
    (M.Map String S.IntSet) -- Maps word to indices of the list
  deriving Show

buildII :: [FilePath] -> IO IIndex
buildII files =
    liftM (IIndex files . foldl f M.empty . zip [0..]) $
    mapM readFile files
  where f m (i, s) =
            foldl g m $ map (lowercase . filter isAlpha) $ words s
          where g m word = M.insertWith S.union word (S.singleton i) m

queryII :: [String] -> IIndex -> [FilePath]
queryII q (IIndex files m) =
    map (files !!) $ S.toList $ intersections $
    map (\word -> M.findWithDefault S.empty (lowercase word) m) q

intersections [] = S.empty
intersections xs = foldl1 S.intersection xs

lowercase = map toLower

An example of use, assuming the program is named iindex and there exist files t0, t1, and t2 with contents "It is what it is.", "What is it?", and "It is a banana.":

$ iindex t0 t1 t2 -- what is it
t0
t1

Icon and Unicon

The following implements a simple case insensitive inverse index using lists simulating texts.

procedure main()

  texts := table()     # substitute for read and parse files
  texts["T0.txt"] := ["it", "is", "what", "it", "is"]
  texts["T1.txt"] := ["what", "is", "it"]
  texts["T2.txt"] := ["it", "is", "a", "banana"]

  every textname := key(texts) do  # build index for each 'text'
     SII := InvertedIndex(SII,textname,texts[textname])

  TermSearchUI(SII)  # search UI

end

procedure InvertedIndex(ii,k,words)  #: accumulate a simple inverted index

/ii := table(set())    # create lookup table and null set
every w := !words do {
   if *ii[w] = 0 then ii[w] := set()  # new word, new set
   insert(ii[w],k)
   }

return ii
end

procedure TermSearchUI(ii)    #: search UI, all words must match

repeat {
   writes("Enter search terms (^z to quit) : ")
   terms := map(trim(read() | break))

   x := []
   terms ? while not pos(0) do {
      tab(many(' \t'))
      put(x,tab(upto('\ \t')|0))
      }

   show("Searching for : ",x)
   show("Found in : ",s := TermSearch(ii,x)) | show("Not found : ",x)
   }
write("End of search")
return
end

procedure TermSearch(ii,x)  #: return set of matches or fail
every s := !x do
   ( /u := ii[s] ) | (u **:= ii[s])
if *u > 0 then return u
end

procedure show(s,x) # display helper
every writes(s|!x) do writes(" ")
write()
return
end

Output:

Enter search terms (^z to quit) : is it
Searching for :  is it
Found in :  T0.txt T2.txt T1.txt
Enter search terms (^z to quit) : banana
Searching for :  banana
Found in :  T2.txt
Enter search terms (^z to quit) : fox
Searching for :  fox
Not found :  fox
Enter search terms (^z to quit) : what
Searching for :  what
Found in :  T0.txt T1.txt

The following code will build a full index. Modification of search routines is left as an exercise:

record InvertedIndexRec(simple,full)

procedure FullInvertedIndex(ii,k,words)  #: accumulate a full inverted index

/ii := InvertedIndexRec( table(set()), table() ) # create lookup table and null set

wc := 0
every (w := !words, wc +:= 1) do {
   if *ii.simple[w] = 0 then {
       ii.simple[w] := set()  # new word, new set
       ii.full[w] := table()  # also new table
       }
   insert(ii.simple[w],k)
   /ii.full[w,k] := set()
   insert(ii.full[w,k],wc)
   }

return ii
end

J

This just implements the required spec, with a simplistic definition for what a word is, and with no support for stop words, nor for phrase searching.

require'files regex strings'

rxutf8 0  NB. support latin1 searches for this example, instead of utf8
files=:words=:buckets=:''
wordre=: rxcomp '[\w'']+'
parse=: ,@:rxfrom~ wordre&rxmatches

invert=: verb define
  files=: files,todo=. ~.y-.files
  >invert1 each todo
)

invert1=: verb define
  file=. files i.<y
  words=: ~.words,contents=. ~.parse tolower fread jpath y
  ind=. words i. contents
  buckets=: buckets,(1+words -&# buckets)#a:
  #buckets=: (file,~each ind{buckets) ind}buckets
)

search=: verb define
  hits=. buckets{~words i.~.parse tolower y
  files {~ >([-.-.)each/hits
)

Example use:

   invert '~help/primer/cut.htm';'~help/primer/end.htm';'~help/primer/gui.htm'
   >search 'finally learning'
~help/primer/end.htm
~help/primer/gui.htm
   >search 'argument'
~help/primer/cut.htm
~help/primer/gui.htm
   >search 'around'
~help/primer/gui.htm

Java

package org.rosettacode;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Set;

public class InvertedIndex {

    List<String> stopwords = Arrays.asList("a", "able", "about",
            "across", "after", "all", "almost", "also", "am", "among", "an",
            "and", "any", "are", "as", "at", "be", "because", "been", "but",
            "by", "can", "cannot", "could", "dear", "did", "do", "does",
            "either", "else", "ever", "every", "for", "from", "get", "got",
            "had", "has", "have", "he", "her", "hers", "him", "his", "how",
            "however", "i", "if", "in", "into", "is", "it", "its", "just",
            "least", "let", "like", "likely", "may", "me", "might", "most",
            "must", "my", "neither", "no", "nor", "not", "of", "off", "often",
            "on", "only", "or", "other", "our", "own", "rather", "said", "say",
            "says", "she", "should", "since", "so", "some", "than", "that",
            "the", "their", "them", "then", "there", "these", "they", "this",
            "tis", "to", "too", "twas", "us", "wants", "was", "we", "were",
            "what", "when", "where", "which", "while", "who", "whom", "why",
            "will", "with", "would", "yet", "you", "your");

    Map<String, List<Tuple>> index = new HashMap<String, List<Tuple>>();
    List<String> files = new ArrayList<String>();

    public void indexFile(File file) throws IOException {
        int fileno = files.indexOf(file.getPath());
        if (fileno == -1) {
            files.add(file.getPath());
            fileno = files.size() - 1;
        }

        int pos = 0;
        BufferedReader reader = new BufferedReader(new FileReader(file));
        for (String line = reader.readLine(); line != null; line = reader
                .readLine()) {
            for (String _word : line.split("\\W+")) {
                String word = _word.toLowerCase();
                pos++;
                if (stopwords.contains(word))
                    continue;
                List<Tuple> idx = index.get(word);
                if (idx == null) {
                    idx = new LinkedList<Tuple>();
                    index.put(word, idx);
                }
                idx.add(new Tuple(fileno, pos));
            }
        }
        System.out.println("indexed " + file.getPath() + " " + pos + " words");
    }

    public void search(List<String> words) {
        for (String _word : words) {
            Set<String> answer = new HashSet<String>();
            String word = _word.toLowerCase();
            List<Tuple> idx = index.get(word);
            if (idx != null) {
                for (Tuple t : idx) {
                    answer.add(files.get(t.fileno));
                }
            }
            System.out.print(word);
            for (String f : answer) {
                System.out.print(" " + f);
            }
            System.out.println("");
        }
    }

    public static void main(String[] args) {
        try {
            InvertedIndex idx = new InvertedIndex();
            for (int i = 1; i < args.length; i++) {
                idx.indexFile(new File(args[i]));
            }
            idx.search(Arrays.asList(args[0].split(",")));
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    private class Tuple {
        private int fileno;
        private int position;

        public Tuple(int fileno, int position) {
            this.fileno = fileno;
            this.position = position;
        }
    }
}

Example output:

java -cp bin org.rosettacode.InvertedIndex "huntsman,merit,dog,the,gutenberg,lovecraft,olympian" pg30637.txt pg7025.txt pg82.txt pg9090.txt
indexed pg30637.txt 106473 words
indexed pg7025.txt 205714 words
indexed pg82.txt 205060 words
indexed pg9090.txt 68962 words
huntsman pg82.txt pg7025.txt
merit pg9090.txt pg30637.txt pg82.txt pg7025.txt
dog pg30637.txt pg82.txt pg7025.txt
the
gutenberg pg9090.txt pg30637.txt pg82.txt pg7025.txt
lovecraft pg30637.txt
olympian pg30637.txt

jq

In the first part of this section, the core functions for computing an inverted index and for searching it are presented. These functions will work with jq 1.4 as well as later (and possibly earlier) versions.

The second section shows how to accomplish the interactive task using a version of jq with support for 'input' and 'input_filename' (e.g. jq 1.5).

Part 1: inverted_index and search

# Given an array of [ doc, array_of_distinct_words ]
# construct a lookup table: { word: array_of_docs }
def inverted_index:
  reduce .[] as $pair
    ({};
     $pair[0] as $doc
     | reduce $pair[1][] as $word
       (.; .[$word] += [$doc]));

def search(words):
  def overlap(that): . as $this
  | reduce that[] as $item ([]; if $this|index($item) then . + [$item] else . end);

  . as $dict
  | if (words|length) == 0 then []
    else reduce words[1:][] as $word
      ( $dict[words[0]]; overlap( $dict[$word] ) )
    end ;

Part 2: Interactive Search

In this section, a solution to the task is presented using two invocations of jq: one parses the input files, and the other does everything else. If your shell does not support <(...) then you could create a temporary file to hold the parsed output.

def prompt_search:
  "Enter a string or an array of strings to search for, quoting each string, or 0 to exit:",
  ( (input | if type == "array" then . elif type == "string" then [.]
             else empty
             end) as $in
    | search($in), prompt_search ) ;

$in | inverted_index | prompt_search

Example:

$ jq -r -c -n --argfile in <(jq -R 'split(" ") | select(length>0) | [input_filename, unique]' T?.txt) -f Inverted_index.jq
Enter a string or an array of strings to search for, quoting each string, or 0 to exit:
"is"
["T0.txt","T1.txt","T2.txt"]
Enter a string or an array of strings to search for, quoting each string, or 0 to exit:
["is", "banana"]
["T2.txt"]
Enter a string or an array of strings to search for, quoting each string, or 0 to exit:
0
$

Julia

function makedoubleindex(files)
    idx = Dict{String, Dict}()
    for file in files
        str = lowercase(read(file, String))
        words = split(str, r"\s+")
        for word in words
            if !haskey(idx, word)
                idx[word] = Dict{String, Int}()
            elseif !haskey(idx[word], file)
                (idx[word])[file] = 1
            else
                (idx[word])[file] += 1
            end
        end
    end
    idx
end

function wordsearch(dict, words::Vector{String})
    for word in words
        if haskey(dict, word)
            for (f, n) in dict[word]
                println("File $f contains $n instances of <$word>.")
            end
        else
            println("No instances of \"$word\" were found.")
        end
    end
end
wordsearch(dict, word::String) = wordsearch(dict, [word])


const filenames = ["file1.txt", "file2.txt", "file3.txt"]
const didx = makedoubleindex(filenames)
const searchterms = ["forehead", "of", "hand", "a", "foot"]
wordsearch(didx, searchterms)

Kotlin

// version 1.1.51

import java.io.File

val invIndex  = mutableMapOf<String, MutableList<Location>>()
val fileNames = mutableListOf<String>()
val splitter  = Regex("""\W+""")

class Location(val fileName: String, val wordNum: Int) {
    override fun toString() = "{$fileName, word number $wordNum}"
}

fun indexFile(fileName: String) {
    if (fileName in fileNames) {
        println("'$fileName' already indexed")
        return
    }
    fileNames.add(fileName)
    File(fileName).forEachLine { line ->
        for ((i, w) in line.toLowerCase().split(splitter).withIndex()) {
            var locations = invIndex[w]
            if (locations == null) {
                locations = mutableListOf<Location>()
                invIndex.put(w, locations)
            }
            locations.add(Location(fileName, i + 1))
        }
    }
    println("'$fileName' has been indexed")
}

fun findWord(word: String) {
    val w = word.toLowerCase()
    val locations = invIndex[w]
    if (locations != null) {
       println("\n'$word' found in the following locations:")
       println(locations.map { "    $it" }.joinToString("\n"))
    }
    else println("\n'$word' not found")
    println()
}

fun main(args: Array<String>) {
    // files to be indexed entered as command line arguments
    if (args.size == 0) {
        println("No file names have been supplied")
        return
    }
    for (arg in args) indexFile(arg)
    println()
    println("Enter word(s) to be searched for in these files or 'q' to quit")
    while (true) {
        print("  ? : ")
        val word = readLine()!!
        if (word.toLowerCase() == "q") return
        findWord(word)
    }
}

Contents of files:

inv1.txt  -> It is what it is.
inv2.txt  -> What is it?
inv3.txt  -> It is a banana!

Sample output:

$ java -jar inverted_index.jar inv1.txt inv2.txt inv3.txt inv1.txt
'inv1.txt' has been indexed
'inv2.txt' has been indexed
'inv3.txt' has been indexed
'inv1.txt' already indexed

Enter word(s) to be searched for in these files or 'q' to quit
  ? : cat

'cat' not found

  ? : is

'is' found in the following locations:
    {inv1.txt, word number 2}
    {inv1.txt, word number 5}
    {inv2.txt, word number 2}
    {inv3.txt, word number 2}

  ? : banana

'banana' found in the following locations:
    {inv3.txt, word number 4}

  ? : it

'it' found in the following locations:
    {inv1.txt, word number 1}
    {inv1.txt, word number 4}
    {inv2.txt, word number 3}
    {inv3.txt, word number 1}

  ? : what

'what' found in the following locations:
    {inv1.txt, word number 3}
    {inv2.txt, word number 1}

  ? : a

'a' found in the following locations:
    {inv3.txt, word number 3}

  ? : q

Mathematica/Wolfram Language

si = CreateSearchIndex["ExampleData/Text", Method -> "TFIDF"];
Manipulate[Grid[sr = TextSearch[si, ToString[s]];
  {FileBaseName /@ Normal[Dataset[sr][All, "ReferenceLocation"]], 
    Column[#, Frame -> All] & /@ sr[All, "Snippet"]} // Transpose, 
  Frame -> All], {s, "tree"}]
Output:

Outputs a graphical user interface with an interactive input field showing filename and the snippet of text that includes the search string.

Nim

We have used as examples three files containing the text from Wikipedia article. We used here an index which keep document name and line number. It would be easy to add the position in the line.

import os, strformat, strutils, tables

type

  WordRef = tuple[docnum, linenum: int]

  InvertedIndex = object
    docs: seq[string]
    index: Table[string, seq[WordRef]]

  SearchResult = object
    total: int
    refs: seq[tuple[docname: string; linenums: seq[int]]]

  IndexingError = object of CatchableError

const

  # Characters composing a word (letters + "'").
  WordChars = Letters + {'\''}

  # Words to ignore.
  StopWords = ["about", "above", "after", "again", "against", "all", "am", "an", "and", "any",
               "are", "aren't", "as", "at", "be", "because", "been", "before", "being", "below",
               "between", "both", "but", "by", "can't", "cannot", "could", "couldn't",  "did",
               "didn't", "do", "does", "doesn't", "doing", "don't", "down", "during", "each",
               "few", "for", "from", "further", "had", "hadn't", "has", "hasn't", "have",
               "haven't", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's",
               "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll",
               "i'm", "i've", "if", "in", "into", "is", "isn't", "it", "it's", "its", "itself",
               "let's", "me", "more", "most", "mustn't", "my", "myself", "no", "nor", "not",
               "of", "off", "on", "once", "only", "or", "other", "ought", "our", "ours",
               "ourselves", "out", "over", "own", "same", "shan't", "she", "she'd", "she'll",
               "she's", "should", "shouldn't", "so", "some", "such", "than", "that", "that's",
               "the", "their", "theirs", "them", "themselves", "then", "there", "there's",
               "these", "they", "they'd", "they'll", "they're", "they've", "this", "those",
               "through", "to", "too", "under", "until", "up", "very", "was", "wasn't", "we",
               "we'd", "we'll", "we're", "we've", "were", "weren't", "what", "what's", "when",
               "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why",
               "why's", "with", "won't", "would", "wouldn't", "you", "you'd", "you'll", "you're",
               "you've", "your", "yours", "yourself", "yourselves"]


template plural(n: int): string = (if n > 1: "s" else: "")


proc add(invIndex: var InvertedIndex; docPath: string) =
  ## Add a document to the index.

  let docName = docPath.extractFilename()
  if docName in invIndex.docs:
    raise newException(IndexingError, &"{docName}: a document with this name is alreadry indexed.")
  invIndex.docs.add docName
  let docIndex = invIndex.docs.high

  var linenum = 0
  var count = 0

  try:
    for line in docPath.lines:
      inc linenum
      var word = ""
      for ch in line:
        if ch in WordChars:
          word.add ch.toLowerAscii()
        else:
          if word.len > 1 and word notin StopWords:
            invIndex.index.mgetOrPut(word, newSeq[WordRef]()).add (docIndex, linenum)
            inc count
          word = ""

  except IOError:
    raise newException(IndexingError, &"{docName}: error while reading file.")

  if count > 0:
    echo &"{docName}: {count} word{plural(count)} indexed."
  else:
    raise newException(IndexingError, &"{docName}: nothing to index.")


func status(invIndex: InvertedIndex): string =
  ## Display information about the inverted index status.
  let words = invIndex.index.len
  let docs = invIndex.docs.len
  &"Index contains {words} word{plural(words)} from {docs} document{plural(docs)}."


proc search(invIndex: InvertedIndex; word: string): SearchResult =
  ## Search a word in the inverted index.
  ## The references are grouped by document.

  let refs = invIndex.index.getOrDefault(word)
  if refs.len == 0: return

  result.total = refs.len
  var prevdoc = ""
  var prevline = 0
  for (docnum, linenum) in refs:
    let docname = invIndex.docs[docnum]
    if docname == prevDoc:
      if linenum != prevline:
        result.refs[^1].linenums.add(linenum)
        prevline = linenum
    else:
      result.refs.add((docname, @[linenum]))
      prevdoc = docname
      prevline = linenum


#———————————————————————————————————————————————————————————————————————————————————————————————————

var invertedIndex: InvertedIndex

if paramCount() != 1 or not paramStr(1).dirExists():
  quit &"Usage: {getAppFileName().lastPathPart} directory"

# Index the documents.
for doc in walkFiles(paramStr(1) / "*.txt"):
  try:
    invertedIndex.add(doc)
  except IndexingError:
    echo getCurrentExceptionMsg()

echo invertedIndex.status()
echo ""

# Enter search interface.
var word: string
while true:
  try:
    stdout.write "Word to search? "
    word = stdin.readLine().toLowerAscii().strip()
    if word == "q":
      echo "Quitting"
      break
    if not word.allCharsInSet(WordChars):
      echo "This word contains invalid characters."
      continue
  except EOFError:
    echo ""
    echo "Quitting"
    break

  # Search word.
  let result = invertedIndex.search(word)
  if result.refs.len == 0:
    echo "No reference found for this word."
    continue

  # Display the references.
  echo &"Found {result.total} reference{plural(result.total)} to “{word}”:"
  for (docname, linenums) in result.refs:
    stdout.write &"... in “{docName}”, line{plural(linenums.len)}"
    for num in linenums: stdout.write ' ', num
    echo ""
Output:

Example of session.

./inverted_index texts
a.txt: 129 words indexed.
b.txt: 183 words indexed.
c.txt: 16 words indexed.
Index contains 197 words from 3 documents.

Word to search? inverted
Found 23 references to “inverted”:
... in “a.txt”, lines 1 3 5
... in “b.txt”, lines 1 3 5 7
... in “c.txt”, line 1
Word to search? q
Quitting

OCaml

We store the inverted index data in the file "data.inv" using the sexplib library, so we compile with:

ocamlc -c \
  -pp "camlp4o -I `ocamlc -where`/type-conv \
               -I `ocamlc -where`/sexplib \
               pa_type_conv.cmo pa_sexp_conv.cmo" \
  unix.cma bigarray.cma nums.cma -I +sexplib sexplib.cma str.cma \
  inv.ml
ocamlc -o inv.byte unix.cma bigarray.cma nums.cma -I +sexplib sexplib.cma str.cma inv.cmo
TYPE_CONV_PATH "Inverted_index"

type files = string array with sexp
type inverted_index = (string * int list) list with sexp

type t = files * inverted_index with sexp

open Sexplib

let data_file = "data.inv"
let data_path = Filename.concat Filename.temp_dir_name data_file

let get_inv_index() =
  if Sys.file_exists data_path
  then t_of_sexp(Sexp.load_sexp data_path)
  else ([| |], [])

let load_file f =
  let ic = open_in f in
  let n = in_channel_length ic in
  let s = String.create n in
  really_input ic s 0 n;
  close_in ic;
  (s)

let array_push ar v =
  let len = Array.length ar in
  Array.init (succ len) (fun i ->
    if i < len then Array.unsafe_get ar i else v), len

let uniq lst =
  let h = Hashtbl.create (List.length lst) in
  List.iter (fun x -> Hashtbl.replace h x ()) lst;
  Hashtbl.fold (fun x () xs -> x :: xs) h []

let combine words i inv_index =
  let h = Hashtbl.create (List.length inv_index) in
  List.iter (fun (w, from) -> Hashtbl.replace h w from) inv_index;
  List.iter (fun w ->
    let from =
      try Hashtbl.find h w
      except Not_found -> []
    in
    Hashtbl.replace h w (i::from)
  ) words;
  Hashtbl.fold (fun w from acc -> (w, from) :: acc) h []

let words_of_file in_file =
  let str = load_file in_file in
  let words = Str.split (Str.regexp "[ \r\n\t,;.?!:'/\034()]") str in
  let words = uniq words in
  (words)

let index_file in_file =
  let words = words_of_file in_file in
  let files, inv_index = get_inv_index() in
  let files, i = array_push files in_file in
  let inv_index = combine words i inv_index in
  let se = sexp_of_t (files, inv_index) in
  Sexp.save data_path se

let search_word word =
  let files, inv_index = get_inv_index() in
  try
    let is_in = List.assoc word inv_index in
    List.iter (fun i -> print_endline files.(i)) is_in
  with Not_found ->
    print_endline "# Not Found"

let usage() =
  Printf.printf "Usage: %s \
    --index-file <file.txt> / \
    --search-word <some-word>\n%!" Sys.argv.(0);
  exit 1

let () =
  let cmd, arg = try (Sys.argv.(1), Sys.argv.(2)) with _ -> usage() in
  match cmd, arg with
  | "--index-file", in_file -> index_file in_file
  | "--search-word", word -> search_word word
  | _ -> usage()

Perl

use Set::Object 'set';

# given an array of files, returns the index
sub createindex
{
    my @files = @_;

    my %iindex;

    foreach my $file (@files)
    {
    open(F, "<", $file) or die "Can't read file $file: $!";
    while(<F>) {
            s/\A\W+//;
        foreach my $w (map {lc} grep {length() >= 3} split /\W+/)
        {
        if ( exists($iindex{$w}) )
        {
            $iindex{$w}->insert($file);
        } else {
            $iindex{$w} = set($file);
        }
        }
    }
    close(F);
    }
    return %iindex;
}

# given an index, search for words
sub search_words_with_index
{
    my %idx = %{shift()};
    my @words = @_;
    my $res = set();

    foreach my $w (map {lc} @words)
    {
    $w =~ s/\W+//g;            # strip non-words chars
        length $w < 3 and next;
        exists $idx{$w} or return set();
        $res = $res->is_null
          ? set(@{$idx{$w}})
          : $res * $idx{$w};       # set intersection
    }
    return @$res;
}

# TESTING
# USAGE: invidx.pl the,list,of,words file1 file2 .. fileN
my @searchwords = split /,/, shift;
# first arg is a comma-separated list of words to search for
print "$_\n"
    foreach search_words_with_index({createindex(@ARGV)}, @searchwords);

Phix

Loads all text files in demo\rosetta\ and builds a list of filenames and a dictionary of {word,file_indexes}, before a handful of quick tests.
Might be better (and almost as easy) for the dictionary values to be say {total_count, {file nos}, {file counts}}.

-- demo\rosetta\Inverted_index.exw
without js -- (file i/o)
integer word_count = 0
sequence filenames = {}

function is_ascii(string txt)
    for i=1 to length(txt) do
        integer ch = txt[i]
        if ch='\0' or ch>=#7F then return 0 end if
    end for
    return 1
end function

procedure add_words(string name, sequence words)
    filenames = append(filenames,name)
    integer fn = length(filenames)
    for i=1 to length(words) do
        string word = words[i]
        if word[1]>='a'         -- skip numbers
        and word[1]<='z' then
            integer node = getd_index(word)
            if node=0 then  -- not present
                setd(word,{fn})
                word_count += 1
            else
                sequence val = getd_by_index(node)
                if find(fn,val)=0 then
                    setd(word,append(val,fn))
                end if
            end if
        end if
    end for
end procedure

procedure load_directory()
    sequence d = dir(".")
    for i=1 to length(d) do
        if not find('d',d[i][D_ATTRIBUTES])     -- skip directories
        and d[i][D_SIZE]<1024*1024*1024 then    -- and files > 1GB
            string name = d[i][D_NAME]
            integer fn = open(name,"rb")
            string txt = lower(get_text(fn))
            close(fn)
            if is_ascii(txt) then               -- skip any bitmaps etc
                sequence words = split_any(txt,"\0\r\n\t !\"#$%&\'()*+,-./:;<=>?@[\\]^`{|}~")
                add_words(name,words)
            end if
        end if
    end for
end procedure

function lookup(sequence words)
    sequence files = {}     -- indexes to filenames
    for i=1 to length(words) do
        string word = words[i]
        integer node = getd_index(word)
        if node=0 then return {} end if
        sequence val = getd_by_index(node)
        if i=1 then
            files = val
        else
            for j=length(files) to 1 by -1 do
                if not find(files[j],val) then
                    files[j..j] = {}
                end if
            end for
            if length(files)=0 then return {} end if
        end if
    end for
    for i=1 to length(files) do
        files[i] = filenames[files[i]]
    end for
    return files
end function

load_directory()
?word_count
?lookup({"load_directory"})     -- should only be this file
?lookup({"dir"})                -- currently two use this
?lookup({"lower"})              -- currently four use this
?lookup({"lower","dir"})        -- currently two use both
?lookup({"dir","lower"})        -- should be the same two
?lookup({"ban"&"anafish"})      -- should be none ({})
Output:

Note the distributed version has been modified and is occasionally used for real, so the output will likely differ.

3365
{"Inverted_index.exw"}
{"Inverted_index.exw","viewppm.exw"}
{"AlmostPrime.exw","Inverted_index.exw","RockPaperScissors.exw","viewppm.exw"}
{"Inverted_index.exw","viewppm.exw"}
{"Inverted_index.exw","viewppm.exw"}
{}

PHP

<?php

function buildInvertedIndex($filenames)
{
    $invertedIndex = [];

    foreach($filenames as $filename)
    {
        $data = file_get_contents($filename);

        if($data === false) die('Unable to read file: ' . $filename);

        preg_match_all('/(\w+)/', $data, $matches, PREG_SET_ORDER);

        foreach($matches as $match)
        {
            $word = strtolower($match[0]);

            if(!array_key_exists($word, $invertedIndex)) $invertedIndex[$word] = [];
            if(!in_array($filename, $invertedIndex[$word], true)) $invertedIndex[$word][] = $filename;
        }
    }

    return $invertedIndex;
}

function lookupWord($invertedIndex, $word)
{
    return array_key_exists($word, $invertedIndex) ? $invertedIndex[$word] : false;
}

$invertedIndex = buildInvertedIndex2(['file1.txt', 'file2.txt', 'file3.txt']);

foreach(['cat', 'is', 'banana', 'it'] as $word)
{
    $matches = lookupWord($invertedIndex, $word);

    if($matches !== false)
    {
        echo "Found the word \"$word\" in the following files: " . implode(', ', $matches) . "\n";
    }
    else
    {
        echo "Unable to find the word \"$word\" in the index\n";
    }
}

PicoLisp

Assuming three files "file1", "file2" and "file3":

$ cat file1
it is what it is

$ cat file2
what is it

$ cat file3
it is a banana

we can read them into a binary tree in the global variable '*MyIndex'

(off *MyIndex)

(use Word
   (for File '("file1" "file2" "file3")
      (in File
         (while (skip)
            (if (idx '*MyIndex (setq Word (till " ^I^J^M" T)) T)
               (push1 (car @) File)
               (set Word (cons File)) ) ) ) ) )

(de searchFor @
   (apply sect
      (extract
         '((Word) (val (car (idx '*MyIndex Word))))
         (rest) ) ) )

Output:

: (searchFor "what" "is" "it")
-> ("file2" "file1")

: (searchFor "a" "banana")
-> ("file3")

: (searchFor "it" "is")
-> ("file3" "file2" "file1")

PowerShell

Works with: PowerShell version 2
function Index-File ( [string[]]$FileList )
    {
    #  Create index hashtable, as needed
    If ( -not $Script:WordIndex ) { $Script:WordIndex = @{} }

    #  For each file to be indexed...
    ForEach ( $File in $FileList )
        {
        #  Find any previously existing entries for this file
        $ExistingEntries = $Script:WordIndex.Keys | Where { $Script:WordIndex[$_] -contains $File }

        #  For any previously existing entries
        #    Delete them (prior to reindexing the file)
        ForEach ( $Key in $ExistingEntries )
            {
            $Script:WordIndex[$Key] = @( $Script:WordIndex[$Key] | Where { $_ -ne $File } )
            }

        #  Get the contents of the file, split on non-alphanumeric characters, and remove duplicates
        $Words = ( Get-Content $File ) -split '[^a-zA-Z\d]' | Sort -Unique

        #  For each word in the file...
        ForEach ( $Word in $Words )
            {
            #  If the entry for the word already exists...
            If ( $Script:WordIndex[$Word] )
                {
                #  Add the file name to the entry
                $Script:WordIndex[$Word] += $File
                }
            Else
                {
                #  Create a new entry
                $Script:WordIndex[$Word] = @( $File )
                }
            }
        }
    }

function Find-Word ( [string]$Word )
    {
    return $WordIndex[$Word]
    }
#  Populate test files
@'
Files full of
various words.
'@ | Out-File -FilePath C:\Test\File1.txt

@'
Create an index
of words.
'@ | Out-File -FilePath C:\Test\File2.txt

@'
Use the index
to find the files.
'@ | Out-File -FilePath C:\Test\File3.txt
#  Index files
Index-File C:\Test\File1.txt, C:\Test\File2.txt, C:\Test\File3.txt

Because PowerShell is a shell language, it is "a user interface to do a search". After running the script defining the functions and running a command to index the files, the user can simply run the search function at the PowerShell command prompt. Alternatively, one could create a more complex custom UI or GUI if desired.

#  Query index
Find-Word files
Output:
C:\Test\File1.txt
C:\Test\File3.txt

Python

Simple inverted index

First the simple inverted index from here together with an implementation of a search for (multiple) terms from that index.

'''
This implements: http://en.wikipedia.org/wiki/Inverted_index of 28/07/10
'''

from pprint import pprint as pp
from glob import glob
try: reduce
except: from functools import reduce
try:    raw_input
except: raw_input = input


def parsetexts(fileglob='InvertedIndex/T*.txt'):
    texts, words = {}, set()
    for txtfile in glob(fileglob):
        with open(txtfile, 'r') as f:
            txt = f.read().split()
            words |= set(txt)
            texts[txtfile.split('\\')[-1]] = txt
    return texts, words

def termsearch(terms): # Searches simple inverted index
    return reduce(set.intersection,
                  (invindex[term] for term in terms),
                  set(texts.keys()))

texts, words = parsetexts()
print('\nTexts')
pp(texts)
print('\nWords')
pp(sorted(words))

invindex = {word:set(txt
                        for txt, wrds in texts.items() if word in wrds)
            for word in words}
print('\nInverted Index')
pp({k:sorted(v) for k,v in invindex.items()})

terms = ["what", "is", "it"]
print('\nTerm Search for: ' + repr(terms))
pp(sorted(termsearch(terms)))

Sample Output

Texts
{'T0.txt': ['it', 'is', 'what', 'it', 'is'],
 'T1.txt': ['what', 'is', 'it'],
 'T2.txt': ['it', 'is', 'a', 'banana']}

Words
['a', 'banana', 'is', 'it', 'what']

Inverted Index
{'a': ['T2.txt'],
 'banana': ['T2.txt'],
 'is': ['T0.txt', 'T1.txt', 'T2.txt'],
 'it': ['T0.txt', 'T1.txt', 'T2.txt'],
 'what': ['T0.txt', 'T1.txt']}

Term Search for: ['what', 'is', 'it']
['T0.txt', 'T1.txt']

Full inverted index

There is a re-write of the termsearch function to work off this type of index, as well as a new phrasesearch function

The phrasesearch function will return multiple matches in a text, and goes on to show how this can be used to pick the text with most matches.

It is assumed that the following code is added to the end of the code for the simple case above and so shares its file opening and parsing results

from collections import Counter


def termsearch(terms): # Searches full inverted index
    if not set(terms).issubset(words):
        return set()
    return reduce(set.intersection,
                  (set(x[0] for x in txtindx)
                   for term, txtindx in finvindex.items()
                   if term in terms),
                  set(texts.keys()) )

def phrasesearch(phrase):
    wordsinphrase = phrase.strip().strip('"').split()
    if not set(wordsinphrase).issubset(words):
        return set()
    #firstword, *otherwords = wordsinphrase # Only Python 3
    firstword, otherwords = wordsinphrase[0], wordsinphrase[1:]
    found = []
    for txt in termsearch(wordsinphrase):
        # Possible text files
        for firstindx in (indx for t,indx in finvindex[firstword]
                          if t == txt):
            # Over all positions of the first word of the phrase in this txt
            if all( (txt, firstindx+1 + otherindx) in finvindex[otherword]
                    for otherindx, otherword in enumerate(otherwords) ):
                found.append(txt)
    return found


finvindex = {word:set((txt, wrdindx)
                      for txt, wrds in texts.items()
                      for wrdindx in (i for i,w in enumerate(wrds) if word==w)
                      if word in wrds)
             for word in words}
print('\nFull Inverted Index')
pp({k:sorted(v) for k,v in finvindex.items()})

print('\nTerm Search on full inverted index for: ' + repr(terms))
pp(sorted(termsearch(terms)))

phrase = '"what is it"'
print('\nPhrase Search for: ' + phrase)
print(phrasesearch(phrase))

# Show multiple match capability
phrase = '"it is"'
print('\nPhrase Search for: ' + phrase)
ans = phrasesearch(phrase)
print(ans)
ans = Counter(ans)
print('  The phrase is found most commonly in text: ' + repr(ans.most_common(1)[0][0]))

Sample Output

Full Inverted Index
{'a': [('T2.txt', 2)],
 'banana': [('T2.txt', 3)],
 'is': [('T0.txt', 1), ('T0.txt', 4), ('T1.txt', 1), ('T2.txt', 1)],
 'it': [('T0.txt', 0), ('T0.txt', 3), ('T1.txt', 2), ('T2.txt', 0)],
 'what': [('T0.txt', 2), ('T1.txt', 0)]}

Term Search on full inverted index for: ['what', 'is', 'it']
['T0.txt', 'T1.txt']

Phrase Search for: "what is it"
['T1.txt']

Phrase Search for: "it is"
['T0.txt', 'T0.txt', 'T2.txt']
  The phrase is found most commonly in text: 'T0.txt'

Racket

#!/usr/bin/env racket
#lang racket
(command-line
 #:args (term . files)
 (define rindex (make-hasheq))
 (for ([file files])
   (call-with-input-file file
     (λ(in) (let loop ()
              (define w (regexp-match #px"\\w+" in))
              (when w
                (let* ([w (bytes->string/utf-8 (car w))]
                       [w (string->symbol (string-foldcase w))]
                       [r (hash-ref rindex w '())])
                  (unless (member file r) (hash-set! rindex w (cons file r)))
                  (loop)))))))
 (define res
   (for/list ([w (regexp-match* #px"\\w+" term)])
     (list->set (hash-ref rindex (string->symbol (string-foldcase w)) '()))))
 (define all (set->list (apply set-intersect res)))
 (if (null? all)
   (printf "No matching files.\n")
   (printf "Terms found at: ~a.\n" (string-join all ", "))))
Output:
$ echo "It is what it is." > F1
$ echo "What is it?" > F2
$ echo "It is a banana." > F3
$ ./search.rkt "what" F?
Terms found at: F1, F2.
$ ./search.rkt "a" F?
Terms found at: F3.
$ ./search.rkt "what a" F?
No matching files.
$ ./search.rkt "what is it" F?
Terms found at: F1, F2.

Raku

(formerly Perl 6)

Works with: rakudo version 2015-09-16
sub MAIN (*@files) {
    my %norm;
    do for @files -> $file {
        %norm.push: $file X=> slurp($file).lc.words;
    }
    (my %inv).push: %norm.invert.unique;

    while prompt("Search terms: ").words -> @words {
        for @words -> $word {
            say "$word => {%inv.{$word.lc}//'(not found)'}";
        }
    }
}

REXX

Note: In this algorithm, word indices start at 1.

Note:   the Burma Shave signs were created from   1930 ──► 1951   and were common among the rural byways of America.

To see more about Burma Shave signs, see the Wikipedia entry:   Burma Shave signs.

/*REXX program illustrates building a simple inverted index  and  a method of word find.*/
@.=                                              /*a dictionary of words   (so far).    */
!=                                               /*a list of found words   (so far).    */
call invertI 0, 'BURMA0.TXT'                     /*read the file:  BURMA0.TXT  ···      */
call invertI 1, 'BURMA1.TXT'                     /*  "   "    "    BURMA1.TXT  ···      */
call invertI 2, 'BURMA2.TXT'                     /*  "   "    "    BURMA2.TXT  ···      */
call invertI 3, 'BURMA3.TXT'                     /*  "   "    "    BURMA3.TXT  ···      */
call invertI 4, 'BURMA4.TXT'                     /*  "   "    "    BURMA4.TXT  ···      */
call invertI 5, 'BURMA5.TXT'                     /*  "   "    "    BURMA5.TXT  ···      */
call invertI 6, 'BURMA6.TXT'                     /*  "   "    "    BURMA6.TXT  ···      */
call invertI 7, 'BURMA7.TXT'                     /*  "   "    "    BURMA7.TXT  ···      */
call invertI 8, 'BURMA8.TXT'                     /*  "   "    "    BURMA8.TXT  ···      */
call invertI 9, 'BURMA9.TXT'                     /*  "   "    "    BURMA9.TXT  ···      */
call findAword  "huz"                            /*find a word.                         */
call findAword  "60"                             /*find another word.                   */
call findAword  "don't"                          /*and find another word.               */
call findAword  "burma-shave"                    /*and find yet another word.           */
exit                                             /*stick a fork in it,  we're all done. */
/*──────────────────────────────────────────────────────────────────────────────────────*/
findAword: procedure expose @.;  arg x           /*get an uppercase version of the X arg*/
           parse arg ox                          /*get original (as-is)  value of X arg.*/
           _=@.x;    oxo='───'ox"───"
           if _==''  then do
                          say 'word'   oxo   "not found."
                          return 0
                          end
           _@=_                                  /*save _ text, pass it back to invoker.*/
           say 'word'  oxo  "found in:"
                          do  until _=='';    parse var   _   f  w  _
                          say '       file='f   "  word="w
                          end   /*until ··· */
           return _@
/*──────────────────────────────────────────────────────────────────────────────────────*/
invertI:   procedure expose @. !; parse arg #,fn /*the file number and the filename.    */
           call lineout fn                       /*close the file, ··· just in case.    */
           w=0                                   /*the number of words found  (so far). */
               do  while lines(fn)\==0           /* [↓]   process the entire file.      */
               _=space( linein(fn) )             /*read a line, elide superfluous blanks*/
               if _==''  then iterate            /*if a blank record,  then ignore it.  */
               say 'file' #", record:" _         /*display the record ──► terminal.     */

                  do  until _==''                /*pick off words from record until done*/
                  parse upper var   _   ?  _     /*pick off a word  (it's in uppercase).*/
                  ?=stripper(?)                  /*strip any trailing punctuation.      */
                  if ?=''  then iterate          /*is the word now all blank (or null)? */
                  w=w+1                          /*bump the word counter (index).       */
                  @.?=@.?  #  w                  /*append the new word to a list.       */
                  if wordpos(?,!)==0  then !=! ? /*add it to the list of words found.   */
                  end   /*until ··· */
               end      /*while ··· */
           say;     call lineout fn              /*close the file, just to be neat&safe.*/
           return w                              /*return the index of word in record.  */
/*──────────────────────────────────────────────────────────────────────────────────────*/
stripper:  procedure;  parse arg q               /*remove punctuation at the end of word*/
           @punctuation= '.,:;?¿!¡∙·';        do j=1  for length(@punctuation)
                                              q=strip(q, 'T', substr(@punctuation, j, 1) )
                                              end   /*j*/
           return q
output   when using the default inputs:

Output note:   The Burma-Shave company was shocked at the number of automobile fenders mailed to the company.

All fender senders received a half-pound jar of Burma-Shave.

file 0, record: Rip a fender
file 0, record: Off your Car
file 0, record: Send it in
file 0, record: For a half-pound jar
file 0, record: Burma-Shave

file 1, record: A peach
file 1, record: Looks good
file 1, record: With lots of fuzz
file 1, record: Man's no peach
file 1, record: And never was
file 1, record: Burma-Shave

file 2, record: Does your husband
file 2, record: Misbehave
file 2, record: Grunt and grumble
file 2, record: Rant and rave ?
file 2, record: Shoot the brute some
file 2, record: Burma-Shave

file 3, record: Don't take a curve
file 3, record: At 60 per
file 3, record: We hate to lose
file 3, record: A customer
file 3, record: Burma-Shave

file 4, record: Every shaver
file 4, record: Now can snore
file 4, record: Six more minutes
file 4, record: Than before
file 4, record: By using
file 4, record: Burma-Shave

file 5, record: He played
file 5, record: a sax
file 5, record: Had no B.O.
file 5, record: But his whiskers scratched
file 5, record: So she let him go
file 5, record: Burma-Shave

file 6, record: Henry the Eighth
file 6, record: Prince of Friskers
file 6, record: Lost five wives
file 6, record: But kept his whiskers
file 6, record: Burma-Shave

file 7, record: Listen birds
file 7, record: These signs cost
file 7, record: Money
file 7, record: So roost a while
file 7, record: But don't get funny
file 7, record: Burma-Shave

file 8, record: My man
file 8, record: Won't shave
file 8, record: Sez Hazel Huz
file 8, record: But I should worry
file 8, record: Dora's does
file 8, record: Burma-Shave

file 9, record: Past
file 9, record: Schoolhouses
file 9, record: Take it slow
file 9, record: Let the little
file 9, record: Shavers grow
file 9, record: Burma-Shave

word ───huz─── found in:
       file=8   word=7
word ───60─── found in:
       file=3   word=6
word ───don't─── found in:
       file=3   word=1
       file=7   word=12
word ───burma-shave─── found in:
       file=0   word=14
       file=1   word=15
       file=2   word=15
       file=3   word=14
       file=4   word=13
       file=5   word=17
       file=6   word=14
       file=7   word=15
       file=8   word=14
       file=9   word=11

Ruby

I broke this into two parts, storing the index as a file on disk to better represent how this might actually be used in practice. The indexmerge part will create or update the index data file with any files given on the command line, and then indexsearch will use the data file to search for any terms listed on the command line. The example is based on http://en.wikipedia.org/wiki/Inverted_index of 2010/09/10.

indexmerge.rb

if File.exist? "index.dat"
  @data = Marshal.load open("index.dat")
else
  @data = {}
end

# Let's give the string class the ability to tokenize itsself into lowercase
# words with no punctuation.
class String
  def index_sanitize
    self.split.collect do |token|
      token.downcase.gsub(/\W/, '')
    end
  end
end

# Just implementing a simple inverted index here.
ARGV.each do |filename|
  open filename do |file|
    file.read.index_sanitize.each do |word|
      @data[word] ||= []
      @data[word] << filename unless @data[word].include? filename
    end
  end
end

open("index.dat", "w") do |index|
  index.write Marshal.dump(@data)
end

indexsearch.rb

if File.exist? "index.dat"
  @data = Marshal.load open("index.dat")
else
  raise "The index data file could not be located."
end

class String
  def index_sanitize
    self.split.collect do |token|
      token.downcase.gsub(/\W/, '')
    end
  end
end

# Take anything passed in on the command line in any form and break it
# down the same way we did when making the index.
ARGV.join(' ').index_sanitize.each do |word|
  @result ||= @data[word]
  @result &= @data[word]
end

p @result

Output

> ./indexmerge.rb file1
> ./indexmerge.rb file2 file3
> ./indexsearch.rb what is it
["file1", "file2"]
> ./indexsearch.rb "a banana"
["file3"]
> ./indexsearch.rb It iS\!
["file1", "file2", "file3"]

Rust

// Part 1: Inverted index structure

use std::{
    borrow::Borrow,
    collections::{BTreeMap, BTreeSet},
};

#[derive(Debug, Default)]
pub struct InvertedIndex<T> {
    indexed: BTreeMap<String, BTreeSet<usize>>,
    sources: Vec<T>,
}

impl<T> InvertedIndex<T> {
    pub fn add<I, V>(&mut self, source: T, tokens: I)
    where
        I: IntoIterator<Item = V>,
        V: Into<String>,
    {
        let source_id = self.sources.len();
        self.sources.push(source);
        for token in tokens {
            self.indexed
                .entry(token.into())
                .or_insert_with(BTreeSet::new)
                .insert(source_id);
        }
    }

    pub fn search<'a, I, K>(&self, tokens: I) -> impl Iterator<Item = &T>
    where
        String: Borrow<K>,
        K: Ord + ?Sized + 'a,
        I: IntoIterator<Item = &'a K>,
    {
        let mut tokens = tokens.into_iter();

        tokens
            .next()
            .and_then(|token| self.indexed.get(token).cloned())
            .and_then(|first| {
                tokens.try_fold(first, |found, token| {
                    self.indexed
                        .get(token)
                        .map(|sources| {
                            found
                                .intersection(sources)
                                .cloned()
                                .collect::<BTreeSet<_>>()
                        })
                        .filter(|update| !update.is_empty())
                })
            })
            .unwrap_or_else(BTreeSet::new)
            .into_iter()
            .map(move |source| &self.sources[source])
    }

    pub fn tokens(&self) -> impl Iterator<Item = &str> {
        self.indexed.keys().map(|it| it.as_str())
    }

    pub fn sources(&self) -> &[T] {
        &self.sources
    }
}

// Part 2: File walking and processing

use std::{
    ffi::OsString,
    fmt::{Debug, Display},
    fs::{read_dir, DirEntry, File, ReadDir},
    io::{self, stdin, Read},
    path::{Path, PathBuf},
};

#[derive(Debug)]
pub struct Files {
    dirs: Vec<ReadDir>,
}

impl Files {
    pub fn walk<P: AsRef<Path>>(path: P) -> io::Result<Self> {
        Ok(Files {
            dirs: vec![read_dir(path)?],
        })
    }
}

impl Iterator for Files {
    type Item = DirEntry;

    fn next(&mut self) -> Option<Self::Item> {
        'outer: while let Some(mut current) = self.dirs.pop() {
            while let Some(entry) = current.next() {
                if let Ok(entry) = entry {
                    let path = entry.path();
                    if !path.is_dir() {
                        self.dirs.push(current);
                        return Some(entry);
                    } else if let Ok(dir) = read_dir(path) {
                        self.dirs.push(current);
                        self.dirs.push(dir);
                        continue 'outer;
                    }
                }
            }
        }

        None // No directory left
    }
}

fn tokenize<'a>(input: &'a str) -> impl Iterator<Item = String> + 'a {
    input
        .split(|c: char| !c.is_alphanumeric())
        .filter(|token| !token.is_empty())
        .map(|token| token.to_lowercase())
}

fn tokenize_file<P: AsRef<Path>>(path: P) -> io::Result<BTreeSet<String>> {
    let mut buffer = Vec::new();
    File::open(path)?.read_to_end(&mut buffer)?;
    let text = String::from_utf8_lossy(&buffer);
    Ok(tokenize(&text).collect::<BTreeSet<_>>())
}

fn tokenize_query(input: &str) -> Vec<String> {
    let result = tokenize(input).collect::<BTreeSet<_>>();
    // Make a vector sorted by length, so that longer tokens are processed first.
    // This heuristics should narrow the resulting set faster.
    let mut result = result.into_iter().collect::<Vec<_>>();
    result.sort_by_key(|item| usize::MAX - item.len());
    result
}

// Part 3: Interactive application

fn args() -> io::Result<(OsString, BTreeSet<OsString>)> {
    let mut args = std::env::args_os().skip(1); // Skip the executable's name

    let path = args
        .next()
        .ok_or_else(|| io::Error::new(io::ErrorKind::Other, "missing path"))?;

    let extensions = args.collect::<BTreeSet<_>>();

    Ok((path, extensions))
}

fn print_hits<'a, T>(hits: impl Iterator<Item = T>)
where
    T: Display,
{
    let mut found_none = true;
    for (number, hit) in hits.enumerate() {
        println!("    [{}] {}", number + 1, hit);
        found_none = false;
    }

    if found_none {
        println!("(none)")
    }
}

fn main() -> io::Result<()> {
    let (path, extensions) = args()?;
    let mut files = InvertedIndex::<PathBuf>::default();
    let mut content = InvertedIndex::<PathBuf>::default();

    println!(
        "Indexing {:?} files in '{}'",
        extensions,
        path.to_string_lossy()
    );

    for path in Files::walk(path)?.map(|file| file.path()).filter(|path| {
        path.extension()
            .filter(|&ext| extensions.is_empty() || extensions.contains(ext))
            .is_some()
    }) {
        files.add(path.clone(), tokenize(&path.to_string_lossy()));

        match tokenize_file(&path) {
            Ok(tokens) => content.add(path, tokens),
            Err(e) => eprintln!("Skipping a file {}: {}", path.display(), e),
        }
    }

    println!(
        "Indexed {} tokens in {} files.",
        content.tokens().count(),
        content.sources.len()
    );

    // Run the query UI loop
    let mut query = String::new();

    loop {
        query.clear();
        println!("Enter search query:");
        if stdin().read_line(&mut query).is_err() || query.trim().is_empty() {
            break;
        }

        match query.trim() {
            "/exit" | "/quit" | "" => break,

            "/tokens" => {
                println!("Tokens:");
                for token in content.tokens() {
                    println!("{}", token);
                }

                println!();
            }

            "/files" => {
                println!("Sources:");
                for source in content.sources() {
                    println!("{}", source.display());
                }

                println!();
            }

            _ => {
                let query = tokenize_query(&query);
                println!();
                println!("Found hits:");
                print_hits(content.search(query.iter()).map(|it| it.display()));
                println!("Found file names:");
                print_hits(files.search(query.iter()).map(|it| it.display()));
                println!();
            }
        }
    }

    Ok(())
}
Output:
C:\Temp>inverted_index books html htm txt
Indexing {"htm", "html", "txt"} files in 'books'
Indexed 34810 tokens in 9 files.
Enter search query:
war

Found hits:
    [1] books\EN\C\Chaucer, Geoffrey - Canterbury Tales.txt
    [2] books\EN\H\Homer - The Odyssey.txt
    [3] books\EN\O\Orwell, George - 1984.html
    [4] books\EN\W\Wells, Herbert George - The Invisible Man.txt
    [5] books\EN\W\Wells, Herbert George - The Island of Doctor Moreau.txt
    [6] books\EN\W\Wells, Herbert George - The Time Machine.txt
    [7] books\EN\W\Wells, Herbert George - War of the Worlds.txt
Found file names:
    [1] books\EN\W\Wells, Herbert George - War of the Worlds.txt

Enter search query:
war gun

Found hits:
    [1] books\EN\O\Orwell, George - 1984.html
    [2] books\EN\W\Wells, Herbert George - War of the Worlds.txt
Found file names:
(none)

Enter search query:
/exit

C:\Temp>

Scala

object InvertedIndex extends App {
  import java.io.File

  // indexer
  val WORD = raw"(\w+)".r
  def parse(s: String) = WORD.findAllIn(s).map(_.toString.toLowerCase)
  def invertedIndex(files: Seq[File]): Map[String,Set[File]] = {
    var i = Map[String,Set[File]]() withDefaultValue Set.empty
    files.foreach{f => scala.io.Source.fromFile(f).getLines flatMap parse foreach
      (w => i = i + (w -> (i(w) + f)))}
    i
  }

  // user interface
  args match {
    case _ if args.length < 2 => println("Usage: InvertedIndex ALLSEARCHWORDS FILENAME...")
    case Array(searchwords, filenames @ _*) =>
      val queries = parse(searchwords).toList
      val files = filenames.map(new File(_)).filter{f => if (!f.exists) println(s"Ignoring $f"); f.exists}
      (queries, files) match {
        case (q, _) if q.isEmpty => println("Missing search words")
        case (_, f) if f.isEmpty => println("Missing extant files")
        case _ => val index = invertedIndex(files)
          println(s"""Searching for ${queries map ("\""+_+"\"") mkString " and "} in ${files.size} files:""")
          queries.map(index).foldLeft(files.toSet)(_ intersect _) match {
            case m if m.isEmpty => println("No matching files")
            case m => println(m mkString "\n")
          }
      }
  }
}
Output:
> InvertedIndex "the" file1.txt file2.txt file3.txt
Searching for "the" in 3 files:
data/file1.txt
data/file2.txt
data/file3.txt

> InvertedIndex "the cat sat" file1.txt file2.txt file3.txt
Searching for "the" and "cat" and "sat" in 3 files:
file1.txt
file2.txt

> InvertedIndex fox file1.txt file2.txt file3.txt
Searching for "fox" in 3 files:
file3.txt

> InvertedIndex abc file1.txt file2.txt file3.txt
Searching for "abc" in 3 files:
No matching files

Tcl

package require Tcl 8.5
proc wordsInString str {
    # We define "words" to be "maximal sequences of 'word' characters".
    # The other possible definition is to use 'non-space' characters.
    regexp -all -inline {\w+} $str
}

# Adds a document to the index. The index is a map from words to a map
# from filenames to lists of word locations.
proc addDocumentToIndex {filename} {
    global index
    set f [open $filename]
    set data [read $f]
    close $f

    set i 0
    array set localidx {}
    foreach word [wordsInString $data] {
    lappend localidx($word) $i
    incr i
    }

    # Transcribe into global index
    foreach {word places} [array get localidx] {
    dict set index($word) $filename $places
    }
}

# How to use the index to find files containing a word
proc findFilesForWord {word} {
    global index
    if {[info exists index($word)]} {
    return [dict keys $index($word)]
    }
}
# How to use the index to find files containing all words from a list.
# Note that this does not use the locations within the file.
proc findFilesWithAllWords {words} {
    set files [findFilesForWord [lindex $words 0]]
    foreach w [lrange $words 1 end] {
    set wf [findFilesForWord $w]
    set newfiles {}
    foreach f $files {
        if {$f in $wf} {lappend newfiles $f}
    }
    set files $newfiles
    }
    return $files
}

# How to use the index to find a sequence of words in a file.
proc findFilesWithWordSequence {words} {
    global index
    set files {}
    foreach w $words {
    if {![info exist index($w)]} {
        return
    }
    }
    dict for {file places} $index([lindex $words 0]) {
    if {$file in $files} continue
    foreach start $places {
        set gotStart 1
        foreach w [lrange $words 1 end] {
        incr start
        set gotNext 0
        foreach {f ps} $index($w) {
            if {$f ne $file} continue
            foreach p $ps {
            if {$p == $start} {
                set gotNext 1
                break
            }
            }
            if {$gotNext} break
        }
        if {!$gotNext} {
            set gotStart 0
            break
        }
        }
        if {$gotStart} {
        lappend files $file
        break
        }
    }
    }
    return $files
}

For the GUI:

package require Tk
pack [labelframe .files -text Files] -side left -fill y
pack [listbox .files.list -listvariable files]
pack [button .files.add -command AddFile -text "Add File to Index"]
pack [labelframe .found -text Found] -side right -fill y
pack [listbox .found.list -listvariable found] -fill x
pack [entry .found.entry -textvariable terms] -fill x
pack [button .found.findAll -command FindAll \
    -text "Find File with All"] -side left
pack [button .found.findSeq -command FindSeq \
    -text "Find File with Sequence"] -side right

# The actions invoked by various GUI buttons
proc AddFile {} {
    global files
    set f [tk_getOpenFile]
    if {$f ne ""} {
    addDocumentToIndex $f
    lappend files $f
    }
}
proc FindAll {} {
    global found terms
    set words [wordsInString $terms]
    set fs [findFilesWithAllWords $words]
    lappend found "Searching for files with all $terms" {*}$fs \
    "---------------------"
}
proc FindSeq {} {
    global found terms
    set words [wordsInString $terms]
    set fs [findFilesWithWordSequence $words]
    lappend found "Searching for files with \"$terms\"" {*}$fs \
    "---------------------"
}

TUSCRIPT

$$ MODE TUSCRIPT

files="file1'file2'file3"
LOOP file=files
ERROR/STOP CREATE (file,seq-o,-std-)
ENDLOOP

content1="it is what it is"
content2="what is it"
content3="it is a banana"

FILE/ERASE "file1" = content1
FILE/ERASE "file2" = content2
FILE/ERASE "file3" = content3

ASK "search for": search=""
IF (search=="") STOP

BUILD R_TABLE/USER/AND search = *
DATA  {search}

LOOP/CLEAR file=files
 ACCESS q: READ/RECORDS $file s.z/u,content,count
  LOOP
  COUNT/NEXT/EXIT q (-; search;-;-)
  IF (count!=0) files=APPEND (files," ",file)
  ENDLOOP
 ENDACCESs q
ENDLOOP
PRINT "-> ",files

Output:

search for >what is it
-> file1 file2

search for >banana
-> file3

search for >it is
-> file1 file2 file3

UNIX Shell

Associative array

Works with: ksh93
#!/bin/ksh

typeset -A INDEX

function index {
  typeset num=0
  for file in "$@"; do
    tr -s '[:punct:]' ' ' < "$file" | while read line; do
      for token in $line; do
        INDEX[$token][$num]=$file
      done
    done
  ((++num))
  done
}

function search {
  for token in "$@"; do
    for file in "${INDEX[$token][@]}"; do
      echo "$file"
    done
  done | sort | uniq -c | while read count file; do
    (( count == $# )) && echo $file
  done
}

Example use:

index *.txt
search hello world

Directory on filesystem

This example is under development. It was marked thus on 20/January/2011. Please help complete the example.

The following is an attempt (not yet complete) to port the above script to pdksh, and perhaps other Bourne-compatible shells.

  • TODO Fill in "search.sh".
  • Add note about slowness.
#!/bin/sh
# index.sh - create an inverted index

unset IFS
: ${INDEX:=index}

# Prohibit '\n' in filenames (because '\n' is
# the record separator for $INDEX/all.tab).
for file in "$@"; do
    # Use printf(1), not echo, because "$file" might start with
    # a hyphen and become an option to echo.
    test 0 -eq $(printf %s "$file" | wc -l) || {
        printf '%s\n' "$file: newline in filename" >&2
        exit 1
    }
done

# Make a new directory for the index, or else
# exit with the error message from mkdir(1).
mkdir "$INDEX" || exit $?

fi=1
for file in "$@"; do
    printf %s "Indexing $file." >&2

    # all.tab maps $fi => $file
    echo "$fi $file" >> "$INDEX/all.tab"

    # Use punctuation ([:punct:]) and whitespace (IFS)
    # to split tokens.
    ti=1
    tr -s '[:punct:]' ' ' < "$file" | while read line; do
        for token in $line; do
            # Index token by position ($fi, $ti). Ignore
            # error from mkdir(1) if directory exists.
            mkdir "$INDEX/$token" 2>/dev/null
            echo $ti >> "$INDEX/$token/$fi"
            : $((ti += 1))

            # Show progress. Print a dot per 1000 tokens.
            case "$ti" in
            *000)   printf .
            esac
        done
    done

    echo >&2
    : $((fi += 1))
done
#!/bin/sh
# search.sh - search an inverted index

unset IFS
: ${INDEX:=index}

want=sequence
while getopts aos name; do
    case "$name" in
    a)  want=all;;
    o)  want=one;;
    s)  want=sequence;;
    *)  exit 2;;
    esac
done
shift $((OPTIND - 1))

all() {
    echo "TODO"
    exit 2
}

one() {
    echo "TODO"
    exit 2
}

sequence() {
    echo "TODO"
    exit 2
}

$want "$@"

Wren

Translation of: Kotlin
Library: Wren-ioutil
Library: Wren-pattern
Library: Wren-str
import "./ioutil" for FileUtil, Input
import "./pattern" for Pattern
import "./str" for Str
import "os" for Process

var invIndex = {}
var fileNames = []
var splitter = Pattern.new("+1/W")

class Location {
    construct new(fileName, wordNum) {
        _fileName = fileName
        _wordNum = wordNum
    }

    toString { "%(_fileName), word number %(_wordNum)" }
}

var indexFile = Fn.new { |fileName|
    if (fileNames.contains(fileName)) {
        System.print("'%(fileName)' already indexed")
        return
    }
    fileNames.add(fileName)
    var lines = FileUtil.readLines(fileName)
    lines.each { |line|
        line = Str.lower(line)
        var i = 0
        for (w in splitter.splitAll(line)) {
            var locations = invIndex[w]
            if (!locations) {
                locations = []
                invIndex[w] = locations
            }
            locations.add(Location.new(fileName, i + 1))
            i = i + 1
        }
    }
    System.print("'%(fileName)' has been indexed")
}

var findWord = Fn.new { |word|
    var w = Str.lower(word)
    var locations = invIndex[w]
    if (locations) {
       System.print("\n'%(word)' found in the following locations:")
       System.print(locations.map { |l| "    %(l)" }.join("\n"))
    } else {
       System.print("\n'%(word)' not found")
    }
    System.print()
}

// files to be indexed entered as command line arguments
var args = Process.arguments
if (args.count == 0) {
    System.print("No file names have been supplied")
    return
}
for (arg in args) indexFile.call(arg)
System.print("\nEnter word(s) to be searched for in these files or 'q' to quit")
while (true) {
    var word = Input.text("  ? : ", 1)
    if (word == "q" || word == "Q") return
    findWord.call(word)
}
Output:

Uses the same files as the Kotlin example:

$ wren-cli inverted_index.wren inv1.txt inv2.txt inv3.txt inv1.txt
'inv1.txt' has been indexed
'inv2.txt' has been indexed
'inv3.txt' has been indexed
'inv1.txt' already indexed

Enter word(s) to be searched for in these files or 'q' to quit
  ? : cat

'cat' not found

  ? : is

'is' found in the following locations:
    inv1.txt, word number 2
    inv1.txt, word number 5
    inv2.txt, word number 2
    inv3.txt, word number 2

  ? : banana

'banana' found in the following locations:
    inv3.txt, word number 4

  ? : it

'it' found in the following locations:
    inv1.txt, word number 1
    inv1.txt, word number 4
    inv2.txt, word number 3
    inv3.txt, word number 1

  ? : what

'what' found in the following locations:
    inv1.txt, word number 3
    inv2.txt, word number 1

  ? : a

'a' found in the following locations:
    inv3.txt, word number 3

  ? : q