INDEX DATA
On top of AMA files, the AMB archive may contain a file named DICT.IDX. This
file, if it exists, provides indexing metadata to allow the client to perform
fast and efficient full-text searches across the AMB book.
The index file contains a hash table: a serie of 256 16-bit indexes, where
each index points to a region of the index structure that contains a list of
words (LoW). The index (0..255) itself is an 8 bits hash based on the length
of the word and its characters. The checksum is made of two nibbles: LC.
The high nibble (L) is the length of the word minus 2, while the low nibble
(C) is a simple checksum of all the word's characters XORed together. This
algorithm can be formalized as follows:
((wordlen - 2) << 4) | ((a & 15) XOR (b & 15) XOR (...))
For example, the word "Disk" would end up being indexed under value 0x25,
because:
((4 - 2) << 4) | ((D & 15) XOR (i & 15) XOR (s & 15) XOR (k & 15))
translates to: (2 << 4) | (4 XOR 9 XOR 3 XOR 11)
which leads to: 32 | 5
resulting in: 37 = 0x25
After the index we can find the pointer to the words list. A pointer is a 16
bits file offset from the index structure start.
It needs to be noted that words of less than 2 characters and more than 17
characters cannot be indexed. The presented algorithm has also the interesting
side-effect of indexing low and high caps of the ranges a..z and A..Z
identically. An important limitation is the fact that the list of words (LoW)
is restricted by the 16-bit addressing offset, which means that all LoWs must
start at an offset within the first 64 KiB of the file.
Now that we know the offset at which our LoW starts, we can read the words.
First go to the offset, and read a single 16 bits word. Its value contains the
number of words in the list. Then, read the words one after another (note that
all words in the list have the same length, and you know this length already).
Words are always written in lower case characters. Each word is followed by a
1-byte value that tells how many files the word has been found in. Then, that
many 32-bit file identifiers follow.
index format:
* List of words
xx number of words in the list
? word
x how many files the word is present in
xxxx file identifier 1
xxxx file identifier 2
...
xxxx file identifier n
(other 255 lists of words follow)
* hash table
xx offset of the LoW for words that match hash 0x00
xx offset of the LoW for words that match hash 0x01
...
xx offset of the LoW for words that match hash 0xff