Efficient In-Memory Extensible Inverted File

ARTICLE IN PRESS
Information Systems 32 (2007) 733754 www.elsevier.com/locate/infosys
Efcient in-memory extensible inverted le

Robert W.P. Luka,, Wai Lamb
b
Department of Computing, Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Department of Systems Engineering and Engineering Management, Chinese University of Hong Kong, Shatin, NT, Hong Kong Received 15 December 2005; accepted 4 June 2006 Recommended by N. Koudas
Abstract The growing amount of on-line data demands efcient parallel and distributed indexing mechanisms to manage large resource requirements and unpredictable system failures. Parallel and distributed indices built using commodity hardware like personal computers (PCs) can substantially save cost because PCs are produced in bulk, achieving the scale of economy. However, PCs have limited amount of random access memory (RAM) and the effective utilization of RAM for in-memory inversion is crucial. This paper presents an analytical investigation and an empirical evaluation of storageefcient inmemory extensible inverted les, which are represented by xed- or variable-sized linked list nodes. The size of these linked list nodes is determined by minimizing the storage wastes or maximizing storage utilization under different conditions, which lead to different storage allocation schemes. Minimizing storage wastes also reduces the number of address indirections (i.e., chaining). We evaluated our storage allocation schemes using a number of reference collections. We found that the arrival rate scheme is the best in terms of both storage utilization and the mean number of chainings per term. The nal storage utilization can be over 90% in our evaluation if there is a sufcient number of documents indexed. The mean number of chainings is not large (less than 2.6 for all the reference collections). We have also showed that our best storage allocation scheme can be used for our extensible compressed inverted le. The nal storage utilization of the extensible compressed inverted le can be over 90% in our evaluation provided that there is a sufcient number of documents indexed. The proposed storage allocation schemes can also be used by compressed extensible inverted les with word positions r 2006 Elsevier B.V. All rights reserved.
Keywords: Information retrieval; Indexing; Optimization
1. Introduction As more and more data are made available on-line, it becomes increasingly difcult to manage a single inverted le. This difculty arises from the substantial
Corresponding author. Tel.: +852 2766 5143; fax: +852 2774 0842. E-mail address: csrluk@comp.polyu.edu.hk (R.W.P. Luk).
resource requirement for large-scale indexing and from the long indexing time, making the system vulnerable to unpredictable system failures. For examples, the very large collection (VLC) from TREC [1] requires 100 Gb of storage and TREC terabyte track requires 426 Gb [2]. The WebBase repository [3] requires 220 Gb, estimated to be only 4% of the indexable web pages. The volume of high writing quality, non-English content is also increasing. In the
0306-4379/$ - see front matter r 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.is.2006.06.001
ARTICLE IN PRESS
734 R.W.P. Luk, W. Lam / Information Systems 32 (2007) 733754
near future, Japanese patent data from NTCIR [4] may be as large as 160 Gb. One way to manage such large quantities of data is to create the index by merging smaller indices, which are built using multiple machines indexing different document subsets in parallel [5]. This would limit the impact of system failure to individual machines and increase indexing speed. Acquiring the computing machines in bulk using commodity hardware substantially reduces monetary costs. Also, commodity hardware, like personal computers (PCs), makes in-memory inversion an attractive proposition because random access memory (RAM) for the PC market is relatively cheap and fast, and because RAM has the potential to be upgraded later at lower prices (e.g. DDR-300 RAM to DDR400 RAM). However, PCs can only hold a relatively small amount of RAM (i.e., 4 Gb) compared with mainframe computers. Efcient RAM utilization becomes an important issue for in-memory inversion using a large number of PCs because typically the entire inverted index cannot be stored in RAM due to the large volume of data on-line. Instead, the inverted le is typically built in relatively small batches [6,7]. For each batch, a partial index is built and held in RAM, which is written out as a run on disk. The runs are then merged as the nal inverted le. Efcient RAM utilization can reduce indexing time by reducing the number of inverted les for merging because efcient RAM utilization enables more documents to be indexed per run. During updates, temporary indices are maintained in memory, and then integrated into the main inverted le in batches. Lester and Zobel [8] showed that the amortized time cost to integrate the temporary index with the main inverted le is reduced for different inverted le maintenance methods (i.e., rebuild, re-merge and in-place methods) when more documents are indexed. Therefore, during both initial construction and update, making better use of memory resources can reduce overall costs. With the potential to better balance system resource utilization as indexing is made memory-intensive whereas loading/ushing data are made disk- or network-intensive, efcient in-memory inversion is crucial in index construction. Our major contribution of this paper is in enhancing existing simple-to-implement single-pass in-memory inversion to be storage-efcient for creating partial inverted les and/or temporary index by developing novel storage-efcient allocation schemes that predict the needed storage with
minimal storage wastes. The partial index created by our in-memory inversion can be merged with the main or other partial inverted les. The temporary index created by our in-memory inversion can also be searched during the time that the temporary index is being built. This reduces the latency of the availability of the recently indexed documents for searching and this is important for certain applications (e.g. searching recently available news articles). An evaluation was carried out to determine which of our storage allocation schemes is the best and whether the results are comparable to existing methods (Section 6). The evaluation was carried out using 3.5 Gb of test data from the VLC. The best allocation scheme was the arrival rate scheme, which achieved 95% nal storage utilization for this VLC dataset. To ascertain the generality of the results, various independent datasets for both English (TREC-2, TREC-6 and TREC-2005) and Chinese (NTCIR-5) are also used to evaluate the best storage allocation scheme. We also showed that the indexing speed of our best storage allocation is similar to the indexing speed of the reported results by others [6,9]. The rest of this paper is organized as follows. Section 2 discusses our extensible inverted le structures, the modications of our extensible inverted le to incorporate compressed postings and word positions, and the related storage wastes. This section also provides the rationale behind the choice of our data structure for our storage allocation schemes and the rationale behind the need to optimize storage wastes of our storage allocation schemes. Section 3 describes the rst approach to determine optimal node sizes using a stepwise optimization strategy. This approach results in three related storage allocation schemes. Section 4 discusses the second approach which determines the optimal nodes size that minimizes the asymptotic worst-case storage waste per period for individual terms. Section 5 evaluates these storage allocation schemes and discusses the best scheme in terms of storage utilization, the mean number of chainings, robustness in performance and indexing speed. This section also shows that our storage allocation schemes can be used for allocating nodes to store compressed postings using the best storage allocation scheme and variable byte compression as an example. Section 6 discusses the related work and describes how the storage allocation schemes can predict our extended inverted le
ARTICLE IN PRESS
R.W.P. Luk, W. Lam / Information Systems 32 (2007) 733754 735
that incorporates compressed postings and word positions. Section 7 is the concluding section. 2. Extensible inverted le and storage wastes This section describes the structure of the extensible inverted le and the related considerations for handling compressed postings and word position information. This section also discusses the storage wastes of the extensible inverted le and the rationale to optimize them, as well as the rationale for using the variable-size linked list data structure. 2.1. Extensible inverted le An in-memory inverted le can be considered as a set of inverted lists, which are implemented as linked lists. Each linked list node holds a set of (basic) postings of the form /di, tf(di,tk)S where each basic posting consists of a document identier di and the within-document term frequency tf(di, tk) of the kth term in the ith document. The rest of this paper assumes that unless otherwise indicated, all postings are basic postings. If a linked list node can hold a variable number of postings, then two additional elds of information other than postings are stored in each node, namely a node size variable and an extension pointer. The node size variable species the amount of storage allocated for the current node and the extension pointer facilitates the chaining of nodes. Fig. 1 shows the conceptual structure of the extensible inverted le, implemented using a set of variable-size linked list nodes. The dictionary data structure holds the set of index terms and the start
address of the corresponding variable-size linked list. A new node is allocated whenever the linked list of the kth index term tk is full (e.g., the linked list of the index term Network in Fig. 1) and when a new posting for tk arrived. The size of the new node is determined using one of the storage allocation schemes discussed in the next two sections. If the linked list nodes hold a xed number of postings per node, then the node size variable can be discarded, saving storage space. Each dictionary entry for an index term has a start pointer and a last pointer, which, respectively point to the beginning of the inverted list and the last linked list node of the inverted list. The last pointer reduces the traversal of the linked lists when a new posting for the index term is inserted. In this case, during insertion, the last linked list node needs to be exclusively locked to maintain the integrity of the data for concurrent access [10]. To reduce memory usage, the start pointers can be stored in a le since start pointers are used only for retrieval and not for inserting new postings. For clarity of presentation, each dictionary entry may contain additional information (e.g. document frequency) not shown in Fig. 1. In particular, each dictionary entry should hold a variable, say mpos, which indicates the position of the unlled portion of the last node, to improve the posting insertion speed. The extensible inverted le can support storing a special type of posting for block addressing inverted les [9,11] that index a xed-size text block instead of variable-size documents. This special type of posting, called block-address posting in this paper, has only di eld without the term frequency tf(di, tk) eld of the basic posting where di is the block identier instead
Fig. 1. The conceptual structure of our extensible inverted le, represented as variable-sized nodes. The start and last pointers point to, respectively, the rst and last linked list nodes of the inverted list.
ARTICLE IN PRESS
of the document identier and tk is the kth term. Our storage waste optimization discussed in Sections 3 and 4 can minimize the storage wastes of the nodes that store basic postings or block-address postings because the storage of these postings are constants (i.e., c1) in our storage waste optimization. The extensible inverted le can support storage of compressed postings [12], as well as word positions. For compressed postings (e.g., g [13] or variable byte compression [14]), each dictionary entry keeps track of the bit position (again using mpos) or the byte position of the unlled portion of the last node. The new compressed posting is inserted at mpos as a bit/byte string. If the last node does not have enough memory for a new compressed posting, then the new compressed posting is split between the last node and the newly allocated node. There are two general approaches to storing postings with word positions. One approach stores the word positions in the nodes and one way (as in [6]) to do this is to store the posting followed by a sequence of word positions, i.e. /di,tf(di,tk)S. /pos(di,tk,1),y,pos(i,tk,tf(di,tk))S where di is the identier of the ith document, tf(di, tk) is the withindocument frequency of the kth term in the ith document, pos(di,tk,x) is the xth word position of the kth term in the ith document. In this case, the node size includes the storage to hold word positions as well as the postings. Another approach stores extended postings of the form /di, tf(di,tk), f(di,tk)S, where f(di, tk) is the le position in the auxiliary le that stores the sequence of word position of the kth term in the ith document. In this approach, the word positions are stored in an auxiliary le. Whenever a new extended posting is added, the last position of the auxiliary le is stored as f(di, tk) of this new extended posting. The word positions of the term associated with the new extended posting are appended sequentially to the auxiliary le. These word positions in the auxiliary le can be compressed, for example, using differential coding compression [9,1216]. If the withindocument term frequency is one (i.e. tf(di, tk) 1), then f(di, tk) of the extended posting can directly store the single word position of the kth term in the ith document, saving both storage and access time. For both approaches that store word positions, the storage allocation schemes can be modied to determine the node sizes, as discussed in Section 5.
2.2. Rationale for variable-size linked lists The variable-size linked-list data structure is chosen here because it is used in many in-memory inverted les (sometimes they are called buckets or xed list blocks) and because it is relatively simple to implement and to analyze. Instead of linked lists, postings can be stored in RAM blocks that can expand to hold more postings when they are inserted. This type of RAM block expansion may involve copying and moving data chunks. Our work can be considered as extending the RAM block approach where the block is pre-allocated with storage to hold the expected amount of postings so that it is not necessary to copy or move data chunks in the RAM blocks. This pre-allocation avoids memory fragmentation and the difculty is shifted in predicting the expected number postings for each term instead of using advanced storage allocators. If a fast storage allocator is used so that the allocation time is amortized to be a constant, then the storage utilization may be sacriced. Instead of using advanced storage allocators, dynamic data structures like hash tables, skip lists and balanced trees can be used that support deletion of postings, as well as insertion of postings. However, the storage utilizations of these dynamic data structures are typically low (i.e., no more than 60% if each node contains at least one 4-byte pointer and one 6-byte posting). It is possible to store multiple postings in these data structures. In this case, the problem of optimizing the storage wastes per node re-appears whether one is dealing with a dynamic data structure (e.g., balanced trees) or a variable-size linked list. Therefore, we propose to use variablesize linked lists in this study because they are simple to program, use simple and fast storage allocators, are commonly used by in-memory inverted les, can easily be adopted to store compressed postings and can be optimized for storage waste in the same way as other dynamic data structures (e.g. balanced trees). Our choice of using linked lists to store (compressed) postings implies that our extensible inverted les are designed largely for append-only operations where direct deletions and direct modications can be avoided. Deletions can be done indirectly by ltering document identiers that are known to be invalid (or deleted) from a registry of stale documents [8,17] because search engines can trade-off data consistency with availability [17] according to the CAP theorem [18,19]. It is expected
ARTICLE IN PRESS
that there will be little deletions or modications during the in-memory inversion because the incoming documents are relatively fresh. On the other hand, deletions or modications may occur more often than during in-memory inversion when the inverted le is transferred to disks. Since disk storage cost per byte is much cheaper than RAM, deletions by ltering document identiers are practical solutions for large-scale search engines. Similar to deletions, modications can be implemented effectively as a deletion operation (implemented as ltering) followed by a document insertion. When the registry of stale document identiers is becoming large or when the temporary index is full, the main inverted le on disk can be maintained by re-building, re-merging or in-place updating approaches [8]. Therefore, the choice of an append-only data structure, like the variable-size linked lists, may not be a severe handicap. The use of the variable-size linked list representation of inverted lists requires some consideration on how to merge partial inverted les as follows. The rst approach saves the in-memory inverted lists that are represented as linked lists as contiguous lists on disk. This requires the system to follow the extension pointers of the linked list nodes when transferring the in-memory inverted le to disk. Tracing all the extension pointers incurs some additional time due to cache misses. However, most of the time cost is due to transferring data to disks provided that the mean number of chainings per term is not large. Once the in-memory inverted le is transferred to disk as a set of contiguous lists on disk, the conventional inverted le-merging algorithm can be used to merge these partial inverted le on disk. An alternative approach dumps the inmemory inverted le onto disk as it is. During the rst level of partial inverted le merging, the merging algorithm combines the two inverted lists on disk, represented by two sets of linked lists, into a single contiguous inverted list on disk. The merged partial inverted le has a set of contiguous inverted lists on disk and it can be merged subsequently with other merged partial inverted le using conventional merging algorithm. However, following extension pointers on disk requires le seek operations that incur more time than a cache miss. Therefore, we prefer the rst approach because the time cost of cache miss is less that of a le seek and because this approach is applicable to the inverted le maintenance methods mentioned by Lester and Zobel [8] (i.e., re-build, re-merge and in-place updates) using
an in-memory temporary index (called the document buffer in [8]). 2.3. Rationale for storage waste optimization The success of representing inverted lists by linked lists is based on the ability to accurately predict the required storage needed so that the RAM storage utilization is maximized. Otherwise, if nal storage utilization is low (say 60%), other data structures that can support deletion should be used instead. The storage utilization of the extensible inverted le is the ratio of the total storage P of all (compressed) postings and the total storage (i.e., P+S, where S is the total storage waste). Maximization of the storage utilization U can be considered as the minimization of storage wastes of the extensible inverted le as follows: & ' P P maxfUg max , PS P minfSg since P is treated as a constant which is xed for a particular data collection. Storage wastes in the extensible inverted le can be divided into two types: (a) The rst type of storage waste is the storage overhead e that includes the storage for extension pointers and for the node size variables; and (b) The second type of storage waste is the latent free storage, which has been allocated but is yet unlled with posting information. If this type of latent free storage is not considered as storage wastes, then the optimal linked-list node size would be as large as possible so that the overhead will appear minimal in the total storage allocated to that node. The storage waste of each node in the extensible inverted le is the sum of these two types of storage wastes of the node. There are many advantages to optimize storage wastes. First, it maximizes the storage utilization that can reduce the number of inverted le merge operation and can reduce the amortized time cost of inverted le maintenance [8]. Second, it can indirectly reduce the number of chainings per term. This can reduce (a) the time to search the temporary index when it is built on the y, (b) the time to store the inverted lists in RAM as lists without chaining on disks when the partial inverted le is transferred
ARTICLE IN PRESS
to disk and (c) the number of le seeks when merging two partial inverted les on disk if these inverted les represent inverted lists as linked lists. Third, the analysis of optimizing the storage wastes can be applied to not just linked lists but to other dynamic data structures (e.g. balanced trees) where each node holds more than one (compressed) posting. In this case, the optimization analysis of these dynamic data structures treats the storage waste of each node of these dynamic structures as a constant e with a different value. 3. Stepwise storage allocation approach The stepwise storage allocation approach determines the optimal node size for the incoming documents based on statistics of the current set of indexed documents. This approach optimizes the expected worst-case storage waste E(W(S(N))) after N documents are indexed as follows: EW SN
N 1 X W Sn, N n1
(1)
vocabulary growth rate (VGR) scheme. It extends the formula of the F16 scheme by determining the optimal node size based on the parameter values extracted at the time when a new node is allocated. The assumption is that this optimal node size remains more or less the same between the time that the node is allocated and the time that the node is lled (i.e., the system behavior should be smooth). Unfortunately, the VGR scheme allocates the same optimal node size at a given time instance for common terms and non-common terms, which are known to have widely different number of postings and different desirable node sizes. Thus, the nal storage allocation scheme, called the term growth rate (TGR) scheme, determines the optimal node size for individual terms. The rst two schemes optimize the expected worst-case storage waste E(W(S(N))) for all terms (as in Eq. (1)). The TGR scheme optimizes the expected worst-case storage waste E(W(S(N, tk))) for the kth term after indexing N documents where S(n, tk) is the storage waste of the kth term after indexing n documents. Similar to Eq. (1), the quantity E(W(S(N, tk))) is dened as follows: EW SN; tk
N 1X W Sn; tk . N n1
where E(.) is the expectation operator, W(.) returns the worst-case function of its argument, S(n) is the storage waste after indexing n documents. The reason for optimizing the expected storage waste is to minimize the area under the storage waste curve against the number of documents indexed so that the storage waste is kept to a minimum for the different number of documents indexed (up to N documents). This approach assumes that the optimal node size for N documents is close to the optimal node size for N+DN documents, where DN is a small quantity compared with N, implying that this assumption is true when N is sufciently large. Also, this approach assumes that the measured system parameters for determining the optimal node size should be smooth without large discontinuities. Otherwise, parameters (e.g. the size of the vocabulary) obtained after indexing N documents may vary substantially, leading to drastic changes in the optimal node size, implying that we cannot predict the optimal node size based on past statistics. This approach has three related storage allocation schemes. The rst storage allocation scheme determines the optimal node size after indexing N documents, which is the same as the optimal node size for a static collection of N documents. This scheme is called the xed-sized node scheme (F16). The next storage allocation scheme is called the
(2)
3.1. Fixed-sized node scheme (F16) The xed-sized node storage allocation scheme allocates storage to hold B postings for each new node. The overhead ep of a node allocated by this scheme is the storage for the extension pointer. Assuming that each posting occupies c1 bytes, the node requires c1 B+ep bytes. The storage waste S(n, tk) for term tk after indexing n documents is the latent free storage of the last node plus the storage overhead of all the chained nodes. latent-free The storage of the last node is c1 df n; tk =B B df n; tk , where df(n, tk) is the number of documents that contain the kth term, and d.e is the ceiling function. The storage overhead of all the chained nodes (including the new node) is due to the extension pointer and this overhead is p df n; tk =B (including the last unlled node). The storage waste S(n, tk) for term tk is $ % df n; tk Sn; tk c1 B df n; tk B $ % df n; tk p . B
ARTICLE IN PRESS
The relative frequency estimate of the probability p(tk) that the kth term appeared in a document is df(n, tk)/n. Hence, df(n, tk) p(tk)n. The above storage waste for n indexed documents can be rewritten as Sn
Dn X k1
Sn; tk
$ %' Dn X& $ptk n% ptk n c1 B ptk n p , B B k1
where D(n) is the number of unique terms after indexing n documents. Since it is hard to optimize the closed form of S(n) that has discontinuous functions (e.g., ceiling function), the upper bound and the lower bound of the optimal node sizes are considered as follows. An upper bound W(S(n)) of S(n) is the storage overhead due to the extension pointers of all the chained nodes plus the latent free space where the last node is assumed to have no postings. Hence ' Dn X& ptk n W Sn 1 c1 B p B k1 X
Dn X k1
Sn; tk Sn.
A lower bound of S(n) is the total storage of the extension pointers and this bound assumes that there is no latent free space in the last node. Therefore, the lower bound of S(n) is Dn Dn X& ptk n' X p Sn; tk Sn. p B k1 k1 The above two bounds only differ by the amount of latent free space and the storage of one extension pointer. As the number of indexed document increases, these two bounds converge to S(n) since the latent free space becomes small compared with the storage overhead of the set of chained lled nodes. Thus, the optimal node sizes of the upper and lower bounds are valid approximations of S(n) for large collections. By disregarding the storage waste due to the latent-free space, the lower bound of the storage waste can be approximated as PDn k1 ptk n=B PDn p n=B k1 ptk p n=B cnpSn;
P where cn Dn pdk , is called the total nominal k1 (term) arrival rate after indexing n documents. This lower bound is not useful for the purpose of nding the optimal node size because it has no optimal value for nite values of B and it discounts the latent free storage, encouraging the undesirable allocation of larger than necessary node sizes. Alternatively, optimizing the upper bound W(S(n)) (as in Eq. (3)) of S(n) limits the storage waste after indexing n documents, as well as accounting for the latent-free storage. Based on Eq. (3), the worst-case (upper bound) storage waste after indexing n documents is W Sn p Dn p n=B cn c1 BDXSn. Substituting W(S(n)) into Eq. (1), the expected E(W(S(N))) worst-case storage waste after P indexing N documents is EW SN 1=N n1 Np Dn p n=B cn c1 BDn. The optimal value of B that leads to the global optimal expected worst-case storage waste can be found by differentiating E(W(S(N))) with respect to B, i.e., dEW SN=dB c1 cd DN p cp N 1=2B2 cN, where cd and cp are constants. Since the second derivative of E(W(S(N))) is always positive for ep, and N and B are positive, the global turning point must be a minimum. The optimal worst-case expected storage waste E(W(S(N))) occurs when the rst derivative of E(W(S(N))) is zero, which yields the following optimal node size p p Bopt;F N cp =cd p N 1=c1 DN cN for a static collection of N documents. p Since cp =cd is close to one, Bopt;F N % p p N 1=c1 DN cN in this paper. We dene a quantity, called a(N), as D(N)/N, which is a measure of the vocabulary growth rate. Using this quantity, Bopt,F(N) is simplied to: s p cN . (4) Bopt;F N % 2c1 aN However, the vocabulary size and the number of documents typically follow Heaps law [20] where D(N) pNm, and p and m are constants and therefore, (4) is re-expressed as follows: s p pN m1 cN. (5) Bopt;F N % 2c1 This shows that when the vocabulary size follows Heaps law, there is no optimum xed-sized node, for collections that grow indenitely. However, the optimal node sizes determined by Eqs. (4) and (5)
ARTICLE IN PRESS
grow slowly as N increases, which is consistent with the assumption that the optimal node size after indexing N documents is similar to the optimal node size after indexing N+DN documents. For simplicity, Eq. (4) is used to derive the optimal node size for the next storage allocation scheme for dynamic collections and we check whether Eq. (4) is a reasonable approximation by comparing its predicted value with the experimentally determined optimal node size in Section 5. 3.2. Vocabulary growth rate scheme (VGR) The previous (F16) scheme is not suitable to dynamic collections because (a) the vocabulary size is not constant (Fig. 2) and (b) different document sets and different languages have different vocabulary growth rate a. Therefore, this (VGR) scheme extends the previous (F16) scheme for dynamic collections by assuming that the optimal node size determined after indexing N documents is similar in the near future (i.e., when the new node is allocated), i.e. E(W(S(N))) E E(W(S(N+DN))), where DN is a small quantity compared with N. Effectively, the parameters for determining the node size have to be smooth over time, which is likely to be true when the number of indexed documents is large. There are at least two choices of parameters, based on Eqs. (4) and (5). The two choices differ by whether to estimate a(N) after indexing N documents or to estimate p and m after indexing N
documents. We decided to estimate a(N) to obtain the optimal node size because it is simpler than estimating two parameters (i.e., p and m). Here, a(N) is estimated in a piece-wise linear manner using differencing operations (i.e., ^ aN DD=DN) at different time points. Hence, Bopt,F(N) in Eq. (4) becomes Bopt,V(N) which has a constant storage overhead is e (including the node size variable) instead of ep, i.e. s cN Bopt;V N % . ^ 2c1 aN (6)
In practice, c(N) is approximated by summing the estimated arrival rates ofP terms after indexing N all documents (i.e., cN % DN df N; tk =N, where k1 df(N, tk) is the number of documents containing tk after indexing N documents). Fig. 2 shows the vocabulary growth rate estimated by backward differencing (gray) the number of distinct terms after indexing N and N1 documents, using the TREC VLC data [1]. The estimated value of a is not smooth, violating the assumption made for the VGR scheme. ^ To reduce discontinuities, aN is smoothed by taking the running average of a xed amount (i.e., 90 in this case) of past backward differences. In Fig. ^ 2, aN is smoother than the backward difference estimates of a(N) and the (calculated optimal) storage allocation becomes smoother over the number of documents indexed. According to Fig. 2, the total nominal arrival rate c(N) is already very
10000
Total Nominal Arrival Rate
1000
Estimated Vocabulary Growth Rate
Storage Allocation
Vocabulary Growth Rate (by backward differencing)
100
10
1 1 74 147 220 293 366 439 512 585 658 731 804 877 950 1023 Number of Indexed Documents
Fig. 2. The vocabulary growth rate, estimated total nominal arrival rate C(n), the estimated vocabulary growth rate (^ n a by averaging a a xed amount of running past histories (i.e., averaging the past 90 backward differenced values) and the allocation of storage based on Eq. (5).
ARTICLE IN PRESS
smooth and it is used directly to estimate the optimal node sizes. The remaining parameters (i.e., e and c1) for Bopt,V(N) are constants. 3.3. Term growth rate scheme (TGR) The previous (VGR) scheme allocates nodes of the same size for both common and non-common terms if the number of indexed documents is the same. This appears to be counter-intuitive since common terms are expected to occur in more documents, resulting in more postings, than noncommon terms. Therefore, the storage waste should be optimized for individual terms, and not for the aggregate storage wastes of all index terms (as in VGR). For the TGR scheme, we optimize the expected worst-case storage waste E(W(S(N, tk))) for term tk after indexing N documents. Similar to E(W(S(N))) in Eq. (1) , E(W(S(N, tk))) is dened as in Eq. (2) based on the quantity W(S(n, tk)) which is the worst-case (i.e., upper bound) storage waste for storing the postings of the kth term after indexing n documents. By analogy to W(S(n)) in Eq. (3), W Sn; tk qdk ; n c1 Btk after indexing n documents where:
indexing N documents), it varies with the number of indexed documents. As df(N, p) tk is large, Bopt,T(N, tk) increases slowly since df N; tk % q p df N; tk df N; tk for large values of df(N, tk). This is consistent with our approximation that q(N,tk) E df(N,tk)/B(N, tk) after indexing N documents. 4. Stationary storage allocation approach The stationary storage allocation approach consists of two related schemes that allocate new nodes periodically under some steady state conditions. The period u is measured in terms of the number of documents and it is a constant for different terms. The two storage allocation schemes of this approach minimize the asymptotic worstcase storage waste W(U(N, tk)) of the term tk per period: W U1; tk lim W Uu r; tk
r!1
lim
W Su r; tk , r!1 r

e is the constant storage overhead for a variablesize linked list node, which is the sum of the storage for the extension pointer ep and the node size variable es (i.e., e ep+es); q(tk, n) is the number of chained and lled nodes for the term tk after indexing n documents; B(tk) is the node size for the term tk.
If the number of documents indexed is large, then the size of each node is approximately the same as B(tk). Therefore, q(n,tk) is approximated as df(n, tk)/ B(tk) ( np(tk)/B(tk)) and W(S(n, tk)) becomes W Sn; tk % n ptk =Btk c1 Btk . After some calculus manipulation, the optimal node size Bopt,T(N, tk) of tk for this scheme after indexing N documents is approximated as s df N; tk Bopt;T N; tk % . 2c1
where W(S(u r, tk)) is the worst-case storage waste for the term tk after indexing u r documents. The rst storage allocation scheme is called the arrival rate (AR) scheme which allocates the storage based on the arrival rate of a term and based on scaling the node size by the largest node size that can be dened using the node size variable. The second scheme is called the adaptive arrival rate (AAR) scheme which extends the AR scheme by a running estimate of the arrival rate of each term so that larger nodes can be allocated quickly in response to any sudden upsurge of term occurrences in documents. 4.1. AR allocation scheme Given that the system is under steady state and the system has indexed u r documents, the storage waste is the storage overhead due to the extension pointers of r+1 nodes (including the new node) and the latent free space of the new node (i.e., c1 BO(u r, tk) where BO(u r, tk) is the size of the node determined by this scheme after indexing u r documents). Therefore, the worst-case storage waste W(S(u r, tk)) for the
(7)
Since the node size depends on df(N, tk) (i.e., the number of documents that has term k after
ARTICLE IN PRESS
term tk is u r ptk W Su r; tk 1 p BO u r; tk c1 BO u r; tk 1, where u r p(tk) is the number of documents that contain the kth term after indexing u r documents and the new node stores only the new posting. Substituting the above into Eq. (1), the asymptotic worst-case storage waste per period W(U(N, tk)) is p u r ptk W U1; tk lim 1 r!1 BO u r; tk r ! c1 BO u r; tk 1 r Taking the limits inside, W(U(N, tk)) becomes p u r ptk W U1; tk lim 1 r!1 BO u r; tk r ! c1 BO u r; tk 1 r p u r ptk lim p lim r!1 rBO u r; tk r!1 r c1 BO u r; tk 1 lim r!1 r u ptk p . BO 1; tk Since at least one new node is expected to be allocated for every u documents indexed (otherwise u cannot be the period), W(U(N, tk)) X ep. Using this inequality and the above, we obtain the optimal asymptotic node size Bopt,O(N, tk) as follows: W U1; tk Xp u ptk p ) Bopt;O 1; tk Bopt;O 1; tk
4.2. AAR allocation scheme If the arrival rate of individual terms is allowed to vary slowly over time, then the AR can be approximated piece-wise linearly, i.e., df N; tk =N % df N; tk df N L; tk =N L for some time lag L. The time lag L can be dened so that there is no additional storage waste as follows. When the last node is full, the earliest document identier id1 in the last node is say N L. The difference between id1 and N is the number of documents indexed between the current document and id1. The number of postings Bopt,A(N-L, tk) in the node allocated by this scheme is the number of documents between the current document and id1 that have tk. Thus, Bopt,A(N-L, tk) df(N, tk) df(N-L, tk) and df N; tk =N % Bopt;A N L; tk =N id 1 . One problem with this estimation is that the number of postings in the node is not large for terms with small ARs. It is possible for the term to occur in two consecutive documents and never again afterwards. In that case, the storage allocation would have large errors. In view of this, the number of postings in a node should be larger than some constant b (4 1). Substituting the approximation of df(N, tk)/N into Eq. (9), the number of postings to be allocated is Bopt;A N; tk & ' Bopt;A N L; tk 1; b . max m 1 N id 1 5. Comparing storage allocation schemes This section examines which of the storage allocation schemes discussed in the last section is the best in terms of storage utilization and the number of chainings (or address indirection). Afterwards (in Section 5.5), we evaluate whether good performances can be achieved by the best storage allocation scheme for different datasets in order to ascertain its generality. 5.1. Set up A subset of the VLC [1] from TREC starting from NEWS01 to NEWS04 is used for this evaluation in Section 5.15.4. This subset requires 3.5 Gb in order to store 1.67 million documents. This amount of data is used because the evaluation was carried out by constructing the extensible inverted le in
10
u ptk . The period u is the largest number m of postings that can be stored in any node. This number is dependent on the number of bytes allocated to the node size variable (i.e., m 28s ). In order to ensure that the node size Bopt,O(N, tk) is between 1 and m ( u), Bopt,O(N, tk) is approximated as Bopt;O N; tk % m 1 df N; tk 1, N (9)
where N is the number of documents indexed, df(N, tk) is the number of documents that contain the term tk after indexing N documents, and df(N, tk)/N is the relative frequency estimate of p(tk).
ARTICLE IN PRESS
memory, using a SUN server with 4 Gb of RAM. The extension pointer occupies 4 bytes (i.e., ep 4) and the node size variable occupies 2 bytes (i.e., es 2). With the exception of the arrival rate scheme, the node size variable e indicates the number of postings that the node can store by default. Each posting requires 6 bytes to store the document identier and the associated term frequency. The storage allocation schemes discussed in the last section are given acronyms, as follows. The (optimal) xed-sized node scheme is F16, which is determined using Eq. (4). Likewise, the vocabulary growth rate scheme is VGR, using Eq. (6), and the term growth rate scheme is TGR, using Eq. (7). For the arrival rate scheme, which uses Eq. (9) to determine node size, two variants are experimented with, in order to examine the effect of the node size variable e. One variant, denoted as ARb, uses e to specify node size in terms of the number of bytes. The other, denoted as ARp, uses e to specify node size in terms of the number of postings. AAR refers to the adaptive arrival rate scheme, which uses Eq. (10) to determine node sizes. The minimum number b of postings to be allocated for the AAR scheme is two.
5.2. Performance measures There are two performance measures of success used here. First, storage utilization should be measured, which is dened as the storage (in bytes) for all postings divided by the total inverted le storage (in bytes), which includes the storage for postings, the overhead storage (i.e., e) and the latent free storage. This storage utilization is a microaverage measure because it is measured on the basis for each index. Therefore, it can immediately indicate the amount of storage of RAM used by the index. On the other hand, the (macro-average) storage utilization can be measured as the average of the storage utilization over the index terms. This macro-level average storage utilization is not preferred because it cannot directly indicate the amount of RAM used by the index, since the term occurrence statistics are highly skewed. It is also important to examine the time development of the storage utilization because (a) we assumed that the system behavior does not change abruptly (i.e., smooth) and (b) we are optimizing the expected storage over all N indexed documents or over regular intervals of u documents.
Second, the number of chainings per index term is also an important measure, since it indicates the likelihood of cache misses. If the inverted lists are stored in disk exactly as they are in RAM, the number of chainings per index term indicates the minimum number of logical le seeks per inverted list, which is a major factor in determining retrieval speed. Obviously, the minimum number of chaining is one because of the chaining from the dictionary to the rst node. The storage of the parameters of all the storage allocation schemes is not very signicant compared with the storage of the nal extensible inverted le. The F16 scheme holds 90 backward difference values to compute the running average of the vocabulary growth rate, plus a few values (e.g., e, c1). For the TGR scheme, no backward difference values are stored. Instead, the dictionary stores the document frequency df(N, tk) after indexing N documents for tk, in the dictionary data structure together with the start and last pointers (not shown in Fig. 1). Alternatively, df(N, tk) can be stored in the storage allocated for the extension pointer of the last linked-list node of tk, thereby obviating the need for any additional storage. Because df(N, tk) is needed only when new nodes are allocated, both methods do not need to count df(N, tk) for every posting insertion, which saves processing time. Similar to the TGR scheme, the AR and AAR schemes store df(N, tk) for each index term, plus a parameter value for m. Based on this discussion, the storage for the parameter values of the storage allocation schemes is not signicant. Many inverted le construction methods [2123] are known to index at high speed. However, it is difcult to obtain valid comparison of indexing speed because the indexing speed depends on many practical factors, like operating system settings and detailed programming optimization that may not be part of the retrieval system per se. For example, our system can index faster using a faster dictionary lookup data structure, (e.g. Burst tries [24]). In Section 5.7, we compare the indexing speed of our storage allocation schemes with existing ones [6,9]. 5.3. Storage utilization An experiment was carried out to assess whether the optimal xed node size determined using Eq. (4) is close to the observed result. Since the size is xed, there is no node size variable (i.e., es 0) and the overhead is just the storage of the next pointer (i.e.,
ARTICLE IN PRESS
4 bytes). The node with only one posting has a constant storage utilization rate (i.e., 60%), independent of the number of indexed documents. Fig. 3 shows the storage utilization variation over the number of indexed documents for different xedsized nodes. In general, the larger the xed-node size, the longer it takes for the storage utilization to reach its asymptotic value, which is expected since it takes longer to ll the latent free storage. Notice that the storage utilization curves are smooth, as implied by the assumption that the system behavior is smooth. It is difcult to determine the best nal storage utilization based on Fig. 3, since the nal storage utilization rates for the xed-sized nodes of 12, 16 and 20 postings per node were very close to each
other after indexing 1.67 million documents. Fig. 4 was plotted to visualize clearly the best nal storage utilization rate after indexing 1.67 million documents. It shows how the near optimal node size varies with nal storage utilization rates after indexing 1.67 million documents. The best nal storage utilization rate is 92.5%, achieved using 16 postings per node. This node size is very close to the optimal value of 15.9 postings calculated using Eq. (4). The nal storage utilization is not sensitive to the exact value of the optimal xed node size. Therefore, the use of a near-optimal xed-sized node that is slightly larger than the true optimal size is preferred because chaining performance will improve and because some margin is provided for indexing more documents later. In addition, since
95%
12 postings/node
16 postings /node
Storage Utilization
90%
20 postings/node 24 postings /node
9 postings /node
6 postings /node
85%
3 postings/node
80%
0 200000 400000 600000 800000 1000000 1200000 1400000 1600000 1800000
# Docs
Fig. 3. Storage utilization performance with different xed-sized nodes against the different numbers of documents indexed.
94% Final Storage Utilization 92% 90% 88% 86%

Best node size (16 postings)
84%
calculated optimal node size = 15.9 postings
82% 80% 0 5 10 15 Node size (number of postings) 20 25
Fig. 4. Final storage utilization performance after indexing 1.67 million documents for different xed-sized nodes. The optimal node size is determined by using Eq. (4).
ARTICLE IN PRESS
the storage utilization is insensitive to the exact value of the optimal node size, the approximations and assumptions made for the VGR and TGR schemes should hold. The storage utilization curves for different number of indexed documents are shown in Fig. 5 for different storage allocation schemes. The TGR and ARp schemes tied for the best asymptotic utilization rate, followed by ARb and F16, and then by VGR. Although ARp and TGR asymptotically achieved similar storage utilization, it appeared that TGR was able to reach the asymptotic performance earlier than ARp. Surprisingly, the storage utilization of AAR is the worst. Since the node size is at least two postings per node for the AAR scheme (i.e., b 2), the lower storage utilization is due to the latent-free space allocated to the node that was never used. This suggests that the estimated probabilities were too large, i.e. p(tk) oo B(N-L,tk)/(N id1). Such overestimation is due to the fact that the denominator, N id1, was small so that (a) B(N-L,tk) can be easily close to N id1, causing over estimation and so that (b) quantization errors of 1/(N id1) become signicant, as they are amplied by a factor of m (i.e., 65,535 in this case). For example, suppose that the p(tk) 0.65, but the closest estimated probability based on relative frequency is 0.75 for N-id1 4. The estimation error of 0.1 results in allocating 6554 (i.e., 65,535 0.1) more postings than necessary. If the quantization errors cause under estimation, then the node will be lled more quickly and a new node will be allocated. Therefore, there is a bias towards noticing allocating more memory than necessary.
100% 90%
TGR
These quantization errors are due to changes in the topic focus of the incoming documents. When there is a new topic, a new node is allocated, which is quickly lled since the same term occurred in several incoming documents. When a new node is allocated again, there will be large quantization errors because the previous node was small (i.e., N id1). If there is a topic change at this point, the allocated large node will be mostly unlled. It is surprising that the asymptotic storage utilization of the VGR scheme is worse than that of the F16 scheme, because the VGR scheme is more sophisticated than the F16 scheme. Since the storage utilization curves for F16, VGR and TGR schemes in Fig. 5 are smooth, the assumption that the optimal value of the node size does not vary signicantly over the near future should hold (i.e., E(W(S(N))) E E(W(S(N+DN))). Therefore, the reason for the relatively lower storage utilization for VGR is due to problems with the estimation of a. According to Fig. 2, as the number of documents indexed increases, a decreases towards zero. However, the optimal node size varies as the reciprocal root of a, so that the relative errors become noticeable as a tends to zero after indexing more and more documents. These estimation errors of a will translate into large changes in the optimal node sizes as the number of documents indexed increases because they are amplied by the semi-monotonically increasing function, cN. It is easier to notice allocating more memory than needed due to the under-estimation of a than allocating less memory due to the over-estimation of a because new nodes are allocated if the node size is too small.
VGR
F16
Storage Utilization
80% 70% 60% 50%
ARb ARp
AAR
40% 30%
0 200000 400000 600000 800000 1000000 1200000 1400000 1600000 1800000
# Documents
Fig. 5. The storage utilization of different storage allocation schemes against the number of indexed documents.
ARTICLE IN PRESS
The storage utilization of the TGR scheme was better than the VGR scheme, as expected. It was able to perform amongst the best using ARp. It owes its success to optimizing node sizes for individual terms, as well as the fact that the estimation error reduces as the number N of indexed documents increases (i.e., (N+1)p(tk) E df(N, tk)). Since the node size was optimized for individual terms, the TGR scheme approaches to the asymptotic storage utilization much quicker than the VGR scheme. In Fig. 5, the storage utilization curve of the ARp scheme has some saw-tooth patterns. These patterns repeat approximately for every 65,000 documents which repetitions roughly correspond to the period u. These patterns can be explained as follows. When most of the nodes are lled, the storage utilization will be at its (local) peak. Immediately after the local peak storage utilization, new nodes are allocated and therefore there is a relatively fast drop of storage utilization after the peak, producing a nearby (local) trough. As the nodes are lled steadily, the storage utilization improves steadily until most of the nodes are lled. Similarly, the storage utilization of the ARb scheme has some saw-tooth patterns. However, these patterns are not as apparent as those for the ARp scheme because the period u is much shorter for the ARb scheme (as m is smaller for the ARb scheme). Although the asymptotic storage utilization of the ARp scheme outperformed that of the ARb scheme by only a small margin, the ARb scheme is able to reach the asymptotic storage utilization much quicker than the ARp scheme. However, if the time is measured in terms of the number (i.e., r) of saw-tooth cycles (i.e., u), then both the ARb and ARp schemes need about 20 sawtooth cycles to reach the asymptotic storage utilization. Therefore, the rate of convergence to the asymptotic storage utilization should be measured in terms of the number of saw-tooth cycles instead of the physical time or the number of documents indexed. 5.4. Determining the best scheme Fig. 6 can be used to nd the better storage allocation scheme. The access time is dened as the mean number of chainings (or address indirection operations) per index term for retrieval. The best scheme is ARp (i.e., nearest to the top left-hand in Fig. 6). In general, the mean number of chaining per term for F16 is larger than that of the more
sophisticated schemes, as expected. Also, schemes that optimize node sizes for individual terms (i.e., TGR and AR and AAR) have better chaining performance than schemes (i.e., F16 and VGR) that optimize node sizes based on the entire vocabulary. ARp has the best storage utilization rate (95%) and the second-smallest mean number of chainings (i.e., 1.8) after AAR (i.e., 1.5), which has the worst storage utilization rate (40%). Since AAR has a much better mean number of chaining per term performance than any other schemes but a much lower storage utilization rate, it conrmed that AAR allocated too large nodes, resulting in more latent free storage and less chaining needed. For the AR schemes, the mean number of chaining per term for ARp is much better than that for ARb because larger node sizes, due to scaling with a larger value of m, result in less chaining and the latent-free storage was used as more documents arrived unlike AAR. This suggested that the correct prediction of large nodes by the arrival rate scheme is quite robust since scaling the value of m substantially did not have an impact on the storage utilization. Otherwise, prediction errors of node sizes would have been translated into useless latent-free space, substantially degrading storage utilization similar to AAR. In principle, according to the law of large numbers, the prediction of p(tk), which is measured as the variance sk of the estimate of p(tk), improves as the number n of indexed documents increases, by p a scaling factor of 1= n. TGR, which tied for the best asymptotic storage utilization rate with ARp, has more chainings than ARp. This is not entirely surprising since ARp can allocate large node sizes for common terms as the node sizes are linearly scaled with df(N, tk) and the maximum node size m, where as, for common terms, TGRp can only allocate sub-linearly scaled node size with df N; tk and without any knowledge of the maximum node size m. The impact of m can be observed by comparing the performances of the ARp and ARb schemes (Fig. 6), where m 65,535 for ARp and m 255 for ARb. Therefore, the mean chaining performance of the TGR can be better than the AR scheme if m is sufciently small. On average, ARp required just half the number of chainings compared with TGR. Since TGR reaches its best storage utilization much more quickly than ARp, if the available RAM is small and storage utilization is paramount, TGR may be a better storage allocation scheme in this special case, out-performing the ARb and ARp schemes.
ARTICLE IN PRESS
100% 90% Final Storage Utilization 80%

ARp TGR ARb VGR F16
70% 60% 50%

AAR
40% 30% 0 1 2 3 4 5 6 7 Mean Number of Chaining Per Term 8 9
Fig. 6. Scatter diagram of the nal (or near asymptotic ) storage utilization against the mean chaining per index term for different storage allocation schemes.
5.5. Evaluating robustness of the best schemes In this subsection, we evaluate the robustness of the best two storage allocation schemes, ARp and TGR schemes, found in the previous subsection, using four datasets: a subset of TREC-2 English dataset, TREC-6 English dataset, TREC-2005 Robust track dataset and NTCIR-5 Chinese dataset. The statistics of these datasets are in Table 1. The TREC datasets are articles in English and they have about a third and two third of the number of documents used in the previous evaluation. The TREC-2 dataset is included for indexing speed comparison. We chose the NTCIR-5 Chinese dataset for evaluation because the Chinese language is very different from alphabetic languages like English. Our system uses the Chinese word indexing strategy [25] that indexes the matched longest word in a given Chinese word list with the running text. The datasets used here are newswire articles whereas the previous VLC data set is from the web and some contains newsgroup data. Except TREC-2 data, each document is stored in a single le instead of reading a batch of documents in a single le. The disk storage is calculated by counting the actual amount of bytes occupied by the document content. The storage of the allocated disk blocks is the number of disk blocks used times the disk block size. This storage is obtained using the du facility in Linux. The SUN server used in the previous evaluation had a heavy load, multitasking various resource demanding jobs that make timing the indexing
processes difcult. Instead, we used a PC-server in this evaluation for timing purposes because the PCserver has less loading. We used the CPU time because (a) the PC-server may be loaded by other users since it is a computing node in our computer cluster, (b) it is easier to compare performance using CPU time, and (c) the inversion is done based on using RAM. The PC-server has an AMD Opteron 242 (1.6 GHz) processor with 1Mb cache and 3 Gb RAM (DDR 300 MHz). The spindle speed of the disk is 7200 rpm. This is a reasonably fast PC-server although it is not the fastest at present. We obtain the predicted storage utilization in Table 2 by looking up the storage utilization curves of ARp and TGR in Fig. 5 using the number of indexed documents rounded to the nearest hundred, i.e. 556,100 for TREC-6, 1,033,500 for TREC-2005 and 901,500 for NTCIR-5. The predicted nal storage utilizations based on the operating curves of ARp and of TGR in Fig. 5 are not very different from the nal storage utilizations achieved by the ARp and the TGR schemes (within three percentage points) for the three datasets in Table 2. The mean number of chainings per term for the ARp scheme remains less than two, similar to the results in the previous subsection (i.e., 1.8). By comparison, the mean number of chainings per term for the TGR scheme is at least twice as much as that of the ARp scheme. Therefore, it seemed the ARp scheme is the preferred storage allocation scheme compared with TGR scheme if the mean number of chainings per term is an important performance measure.
ARTICLE IN PRESS
748 Table 1 Statistics of the test datasets Dataset Language Number of documents Number of les Number of unique index terms ( 106) Number of postings ( 106) Storage for content (Gb) Storage of allocated disk blocks (Gb)
a
R.W.P. Luk, W. Lam / Information Systems 32 (2007) 733754
TREC-6 English 566,077 566,077 2.22 91.5 2.1 3.43
TREC-2005 English 1,033,461 1,033,461 1.31 180 3.0 5.50
NTCIR-5 Chinese 901,446 901,446 1.32 213 1.0 3.66
TREC-2a English 510,637 1,025 0.62 58 0.9 1.24
TREC-2 is for comparison.
Table 2 Indexing efciency of the ARp and TGR schemes Scheme Final storage utilization (Predicted using Fig. 5) Final storage utilization (Predicted using Fig. 5) Mean number of chainings per Term Indexing time (seconds) ARp TGR ARp TGR ARp TGR TREC-6 88.3% (88.9%) 90.9% (93.0%) 1.26 2.57 2074 2178 TREC-2005 93.0% (93.1%) 93.4% (94.1%) 1.90 5.44 3322 3343 NTCIR-5 94.8% (92.5%) 94.5% (93.8%) 1.75 6.58 4563 4493
5.6. Indexing speed comparison A subset of TREC-2 dataset was used in [9] and the indexing speed is one of the highest. We used this dataset to show that the time efciency of our indexing scheme is not signicantly lower than that in [9]. We also use the results of a current single-pass in-memory inversion method by Heinz and Zobel because their work is similar to ours and because they have data about their document-level inverted index construction process for ease of comparison. For valid comparisons, the TREC-2 les as distributed by TREC are used and these les typically contain more than one document. This reduces a noticeable amount of time (i.e., about 100 s in this experiment) to nd and read the les. We also changed our tokenizer (as used in the other data sets here) to a simpler one that extracts tokens as strings over the set of alphanumeric characters as in [9]. We observe that the indexing time is a function of the total occurrences of all index terms. Therefore, we use the number of terms indexed per second for comparison of indexing speed. Table 3 shows the
estimated indexing rates of the single pass index construction method by Heinz and Zobel, our ARp scheme and the block addressing inverted index construction by Navarro et al. [9] in terms of the number of terms indexed per second. In Table 3, we observe that the indexing rates of all the different schemes over different collections are similar (i.e., around 210,000+ terms indexed per second). The estimates of indexing rates of the index construction method by Heinz and Zobel [6] are based on their lowest indexing time reported for document-level inverted index construction (i.e., their best results). However, their results are based on the elapsed time. Our results in Table 3 show that the comparison of indexing rate needs to be carefully interpreted because it depends on how the indexing rate is measured and many other factors, e.g., the le organization of documents and the tokenization process. The nal storage utilizations of the ARp scheme for the TREC-2, TREC-6 and TREC-2005 data are within one percentage point of the predicted storage utilizations using the ARp curve in Fig. 5 (i.e., 88.7% for TREC-2, 88.9% for TREC-6 and 93.1%
ARTICLE IN PRESS
R.W.P. Luk, W. Lam / Information Systems 32 (2007) 733754 Table 3 Indexing rate in terms of the number of postings per second References Collection ] index time terms (s) ( 106) 324a 1262a 137b 137 239 355 Time (s) 1475a 5763a 600 494 939 1453 ] index terms per second ( 103) 220 219 229 278 254 245 Final storage utilization 88.5% 89.5% 94.0% Mean ] chainings per term 1.25 1.61 2.56 749
Heinz and Zobel [6] Navarro et al. [9] Ours (ARp)

a
WebV WebXX TREC-2 TREC-2 TREC-6 TREC 2005
based on data in [6]. based on our data.
for TREC-2005). We observe that the nal storage utilizations of the ARp scheme for TREC-6 and TREC-2005 data using the original tokenizer and the simplied tokenizer are also within two percentage points from each other (i.e., 88.3% and 89.5% for TREC-6, and 93.0.% and 94.0% for TREC2005). The mean numbers of chainings per term of this ARp scheme for TREC-2 and TREC-6 data sets are similar to those in Table 2 and within two chainings per term on average. However, the mean number of chainings per term for TREC-2005 dataset using the simplied tokenizer is larger than two. 5.7. Extensible compressed inverted le This subsection shows that the extensible inverted le can be compressed using integer compression techniques [9,1216]. Postings are compressed using the variable byte compression [8,13] here because of its simplicity and because it is a single-pass compression method (unlike, for example, parameterized compression methods). The experimental set up is the same as in the previous subsection except that the storage allocation scheme is modied for compressed postings. More specically, the storage allocation scheme is modied so that when the memory of the node is exhausted but there are still some remaining compressed data, this remaining compressed data is stored in the (following) chained new node. This modication can be applied to our other storage allocation schemes (e.g., TGR). Apart from this modication, we evaluated the ARp storage allocation scheme (assuming that the storage for each posting is six bytes) and the same scheme, called ARpr scheme, except that the original calculated node size is multiplied by the compression ratio R that is dened as the storage of the
compressed posting information divided by the storage of the posting information. The reason for scaling the node size by R is that the storage of the compressed posting is on average R times smaller than the storage of the original posting information so that the node size can be reduced accordingly (on average). Multiplying the node size by R is called R scaling in here. Table 4 shows the results of building the extensible compressed inverted le for TREC-2, TREC-6 and TREC-2005 English documents and some combinations of these collections. The compression ratio R is 37% for all the different collections. R is used to predict the storage utilization by looking up the ARp storage utilization curve in Fig. 5 based on the equivalent number of document indexed that is dened as the number of documents indexed times the compression ratio R. The rationale for using the equivalent number of documents indexed to predict the storage utilization is that the storage utilization is a function of the total allocated storage. An estimate of the total allocated storage is the number of documents indexed. However, for extensible compressed inverted les, the estimate of the total allocated storage is approximated by the number of documents indexed times the compression ratio R since the storage for postings is reduced by the compression ratio R. Since storage utilizations in Fig. 5 are measured after indexing every 100 documents, the equivalent number of documents indexed is rounded to the nearest hundred. We observe from Table 4 that the nal storage utilizations achieved using the ARp scheme and the corresponding predicted storage utilizations using the equivalent number of documents indexed are within three percentage points for collections that have over 380,000 equivalent documents. It appears
ARTICLE IN PRESS
750 R.W.P. Luk, W. Lam / Information Systems 32 (2007) 733754 Table 4 Final and predicted storage utilization of our extensible compressed inverted le (Combined) TREC data collection R scaling Time (s) Mean ] chainings per term R (%) Final storage utilization (%) Predicted storage utilization (%) 74.7 74.7 74.5 74.5 85.5 85.5 89.5 89.5 89.8 89.8 91.9 91.9 Equivalent ] document indexed
2 6 2005 2+2005 6+2005 2+6+2005
No Yes No Yes No Yes No Yes No Yes No Yes
521 500 979 937 1477 1442 1978 1952 2514 2396 3035 2916
1.30 2.20 1.25 1.95 1.64 3.31 1.70 3.52 1.58 3.08 1.66 3.39
37 37 37 37 37 37 37 37 37 37 37 37
68.2 84.7 72.3 85.3 84.3 90.9 88.0 91.7 88.2 91.5 90.0 91.9
190,500 190,500 210,000 210,000 380,000 380,000 569,600 569,600 590,000 590,000 779,800 779,800
that the predicted and nal storage utilizations are closer together for the larger data collections (i.e., TREC2+TREC6+TREC-2005). This might be due to the smaller slope of storage utilization curve as the number of documents indexed increases. The storage utilization using the ARp scheme without R scaling has a consistently lower storage utilization than the ARpr scheme that uses R scaling but the ARp scheme has a consistently lower mean number of chainings per term compared with the ARpr scheme. This is because the ARp scheme allocates larger but fewer nodes than ARpr scheme. We observe in Table 4 that the differences in nal storage utilizations between the ARp and ARpr schemes are reducing for larger and larger collections while the difference in the mean number of chainings per term between ARp and ARpr schemes increases steadily with larger and larger collections. Therefore, if there are enough RAM to index a large number of documents, we prefer the ARp scheme without R scaling because of its good nal storage utilization and the small mean number of chainings per term. In Table 4, we observe that the indexing times of the extensible compressed inverted le for the TREC-2, TREC-6 and TREC-2005 data collections are similar to the corresponding indexing times of the extensible inverted le without compression for the corresponding collections in Table 3. We also observe that the indexing time of the combined collections (e.g. TREC-2+TREC-2005) is close to
the sum of the indexing times of the individual collections (e.g., 1978 E 521+1477 for the combined collection of TREC-2 and TREC-2005). The mean number of chainings per term (in Table 4) using the variable byte compression without R scaling is similar to that without variable byte compression (Table 3) for the corresponding data collections. 6. Related work The inverted le is popular for indexing archival databases and free texts. It is considered [26] as the best choice for Internet searches. It can retrieve documents quickly [27] and it can be compressed [28] requiring storage as low as that of signatures [29]. For dynamic environments (e.g., Internet), inverted les were modied to support (incremental or batch) updates. Cutting and Pedersen [30] modied the B-tree structure with a heap data structure to store postings, thereby improving the storage utilization rate (from 66% to 86%) and reducing indexing time. Tomasic et al. [31] used a dual-list structure strategy to store short and long inverted lists, respectively. Asymptotic storage utilization rate can reach about 88%. Our work can be considered as an extension of their work by using variable size (linked) lists, instead of predened short and long lists. Brown et al. [32] used a persistent object store to manage an incremental inverted le, which has a low RAM requirement.
ARTICLE IN PRESS
Heinz and Zobel [6] have shown that their singlepass (in-memory) inversion approach was the preferred method for inverted le construction but the RAM storage utilization was not reported. Zobel et al. [33] used a xed-sized block of RAM to hold inverted lists. If the RAM block overows, the block would be written onto disks. The storage utilization rate is good (9398%), and it is robust to different block sizes. Our work can be considered as an extension of their work where the size of the RAM block is predicted without the need to move data chunks. Similar to our approaches but tested with a smaller collection, Shieh and Chung [34] used run-time statistics to determine the allocation of free space for the construction of inverted les by linearly interpolating the predictions of the number of arrivals under different extreme conditions. Here, our simple-to-implement single-pass method using existing (in-memory) inverted le structure achieved nal storage utilization rates in between 87% and 95% depending on the amount of data indexed using various reference data sets that include Chinese documents. We found that as the amount of data indexed increases, the storage utilization of our storage allocation schemes increases in the long run. In-memory inversion is relevant to the recent substantial interest [35] in building and accessing parallel-distributed indices [3638] for information retrieval (IR). Initial interest included partitioning the inverted le [23,39,40] and using specialized parallel hardware [4144] for efciency. As computational power increases, parallel architecture for IR uses less specialized hardware (e.g., workstations). One solution uses symmetric multiple processors [45] in a share-everything memory organization. An alternative uses low-cost servers interconnected by a local area network in a share-nothing memory organization to index [23,24,46] and retrieve [36,37] documents concurrently, possibly acting as state-of-the-art locally distributed web servers [47] for giant web services [48]. The hardware and software of the system should be balanced for effective system utilization [45], for instance, using (software) pipelining [25,49]. Couvreur et al. [50] analyzed the tradeoff between cost and performance, using different types of hardware. A more recent evaluation examined retrieval efciency [51] for distributed IR. As the current trend in indexing large data collections is parallel indexing (in batches) (e.g., [5,7]), our storage allocation schemes can be used to build these partitioned indices. Our
results serve as a reference for PC-based parallel indexing (e.g., [5]) since the amount of RAM in PCs is similar to the amount of RAM in our server. The extensible inverted le can store compressed postings [9,1116,28] and it is complementary to compressing posting to achieve better memory utilization for in-memory inversion. Specically, the optimal node size for storing compressed postings can be determined by multiplying the optimal node size described in section three and four with the compression ratio R. This works with the better AR scheme because the storage utilization of this scheme is resilient to simple scaling, as attested by the similar storage utilization rates of ARp and of ARb (92.5% and 92%, respectively) where the scaling factor between the largest node size for ARp and ARb is as large as six. It is possible not to multiply the optimal node size by R but this effectively is enlarging the optimal node size by a factor of 1/R. Our proposed storage allocation schemes can determine node sizes for inverted les that include word positions. One general approach stores word positions after the corresponding basic posting. In this case, the storage should be allocated conservatively because it is difcult to predict the withindocument term frequency and the amount of word positions. Hence, when a new node is allocated for the index term tk in the current document n, the storage of the new node is the sum Xnew of: (1) the storage specied using our storage allocation schemes (say c1 Bold bytes) and; (2) the storage for the word positions of tk in the current document n (say Y bytes) and; (3) the storage for the word positions of the Bold1 postings assuming that the kth term will occur only once in these documents. Therefore, the storage for posting information in the new node is Xnew c1 Bold+Y+c2(Bold1), where c2 is the storage for one word position. The storage overhead of the new node that stores the word positions is e bytes and the storage utilization Vnew of this new node that stores word positions is Vnew Xnew/(e+Xnew). The storage utilization Vold of an equivalent new node that stores the same amount of postings but without word position information is Vold c1 Bold/(e+c1 Bold). Since BoldX1 and Y40, Vnew4Vold. Therefore, the storage utilization of the extensible inverted le that stores word positions should be larger than the
ARTICLE IN PRESS
storage utilization of the corresponding extensible inverted le that does not store word position information. There are other variants to determine the number of word positions in each document. For example, we can use the average term frequency in a document and its standard deviation to estimate the minimum number of word positions to store, with a prescribed condence level (e.g., 95%). The other general approach to handle word position information with inverted les is to use extended postings (Section 2.1) of the form /di, tf(di,tk), f(di,tk)S. Our storage allocation schemes determine the optimal node size for the extended postings by simply altering the constant c1 in the original storage allocation schemes to c1+c3 where c3 is the storage for the le position. The storage utilization of this approach will be lower than storing the word positions in the nodes because the auxiliary le positions of extended postings are overhead and not information. For valid comparisons, the storage utilization of this approach should compare with another scheme or approach that uses the extended postings. In summary, storing word positions in the nodes or in the auxiliary les can make use of the storage allocation schemes in Sections 3 and 4. 7. Conclusion Several storage allocation schemes are proposed for in-memory extensible inverted le construction and these schemes are based on minimizing the storage waste under different conditions. Minimization of storage waste is the equivalent to maximization of storage utilization that is important to reduce the number of inverted les for merging and to reduce the amortized time cost of inverted le maintenance [8]. Minimizing the storage waste also indirectly reduces the number of chainings and the access time, since an address indirection may lead to a cache miss (on RAM) or a le seek (on disk). The reduction of access time is important for the extensible inverted le as a temporary index in memory when it is being searched and as a partial index when it is being merged to form larger partial indexes or to form the nal inverted le. Our storage schemes were evaluated using a sizeable (i.e., 3.5 Gb) document subset of the VLC. In our experiment, the best scheme was the arrival rate (AR) scheme, which determines the node size using term ARs. The AR scheme was found to be the best in our experiment because it has the best
nal storage utilization rate of 95% for this subset of VLC data and because, at the same time, it has the second best mean number of chainings per term (i.e., 1.8). The adaptive AR scheme has the best mean number of chainings per term (i.e., 1.2) but it has the lowest nal storage utilization (i.e., 42%). The TGR scheme has a similar nal storage utilization as the nal storage utilization of the AR scheme but the mean number of chainings per term for the TGR scheme is 3.8 that is about double the amount of chaining per terms of the AR scheme. Therefore, the AR scheme was our clear best scheme if the performance is measured in terms of both the nal storage utilization rate and the mean number of chainings per term. We have also evaluated the ARp scheme using four additional reference data collections (i.e., TREC-2, TREC-6, TREC-2005 and NTCIR-5). It appeared that the nal storage utilization using the ARp scheme for these four data collections can be predicted within three percentage points using the storage utilization curve of the ARp scheme derived from the VLC data collection as a kind of operating curve. The indexing speed (i.e., number of terms indexed per second) of our system can be increased by optimizing the program code (e.g. use a simpler tokenizer) and the operating environment (e.g., combining documents into a single le for fast disk access). The resultant indexing speed of our system was similar to the indexing speed of others [6,9]. This illustrates that our storage allocation schemes did not incur a signicant amount of time overhead when calculating node sizes. We evaluated the ARp scheme for storing variable byte compressed postings [9,13] using the TREC-2, TREC-6, TREC-2005 and some of these combined collections. The measured compression ratio R is almost a constant (i.e., 37%). For compressed postings, the predicted storage utilization is determined by looking up the operating ARp curve in Fig. 5 based on the equivalent number of document indexed that is dened as the number of document indexed times R. We observe that when the equivalent number of document indexed is 210,000 or more, the nal storage utilization of the ARp scheme is close to (i.e., within three percentage points) the corresponding predicted storage utilization. It seems that the storage utilization of the ARp scheme for the compressed posting is similar to that of the ARp scheme for the uncompressed postings when the number of documents indexed is sufciently large.
ARTICLE IN PRESS
The storage allocation schemes can determine node sizes for compressed inverted les with word position information. For nodes that directly store word position information, the node size can be the sum of the calculated optimal node size plus the storage for word positions of the index term in the current indexed document. For nodes that store extended postings, the optimal node size can be determined by the storage allocation schemes using the storage of an extended posting as c1. Acknowledgments We thank the Center for Intelligent Information Retrieval, University of Massachusetts, for facilitating Robert Luk to develop in part the IR system, when he was on leave there. We are grateful to ROCLING for providing its word list. This work is supported by the Hong Kong Polytechnic University Grant no. A-PE36.
References
[1] D. Hawking, N. Craswell, P.B. Thistlewaite, Overview of TREC-7 very large collection track, in: Proceedings of The Seventh TREC Conference, 1998, pp. 4052. [2] C. Clarke, N. Craswell, I. Soboroff. Terabyte track, http:// www-nlpir.nist.gov/projects/terabyte/, 2003. [3] J. Hirai, H. Garcia-Molina, A. Paepcke, S. Raghavan, WebBase: a repository of web pages, in: Proceedings of The Ninth International World Wide Web Conference, 2000, pp. 277293. [4] NTCIR Patent Retrieval Task, http://www.slis.tsukuba.ac.jp/$fujii/ntcir5/cfp-en.html, 2005. [5] L.A. Barroso, J. Dean, U. Holzle, Web search for a planet: the Google cluster architecture, IEEE Micro. 23 (2) (2003) 2228. [6] S. Heinz, J. Zobel, Efcient single-pass index construction for text databases, J. Am. Soc. Inform. Sci. Technol. 54 (8) (2003) 713729. [7] R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, ACM Press, 1999. [8] N. Lester, J. Zobel, H. Williams, Efcient online index maintenance for contiguous inverted lists, Informat. Process. Manage. 42 (2006) 916933. [9] G. Navarro, E.S. de Moura, M. Neubert, R. Baeza-Yates, Adding compression to block addressing inverted indexes, Inform. Retriev. 3 (2000) 4977. [10] A. MacFarlane, S.E. Robertson, J.A. McCann, On concurrency control of inverted les, in: F.C. Johnson (Ed.), Proceedings of the 18th MCS IRSG Annual Colloquium on Information Retrieval Research, 2627 March 1996, pp. 6779. [11] U. Manber, S. Wu, Glimpse: a tool to search through entire le systems, in: Proceedings of the USENIX Winter 1994 Technical Conference, 1994, pp. 2332.
[12] N. Ziviani, E.S. de Moura, G. Navarro, R. Baeza-Yates, Compression: a key for next-generation text retrieval systems, IEEE Comput. 33 (11) (2000) 3744. [13] P. Elias, Universal codeword sets and the representation of the integers, IEEE Trans. Inform. Theory 21 (2) (1975) 194203. [14] A. Trotman, Compressing inverted les, Inform. Retriev. 6 (1) (2003) 519. [15] H. Williams, J. Zobel, Compressing integers for fast le access, Comput. J. 42 (3) (1999) 193201. [16] S.W. Golomb, Run-length encodings, IEEE Trans. Inform. Theory 12 (3) (1996) 399401. [17] E.A. Brewer, Combining systems and databases: a search engine retrospective, in: J.M. Hellerstein, M. Stonebraker (Eds.), Reading in Database Systems. Fourth Edition, MIT Press, Cmabridge, MA, 2005. [18] A. Fox, E.A. Brewer, Harvest, yield, and scalable tolerant systems, in: Proceedings of the 16th SOSP, St. Malo, France, October 1997. [19] S. Gilbert, N. Lynch, Brewers conjecture and the feasibility of consistent, available, partition-tolerant web services, Sigact News 33 (2) (2002) 5159. [20] H.S. Heap, Information Retrieval: Computational and Theoretical Aspects, Academic Press, New York, 1978. [21] C.L.A. Clarke, G. V. Cormack, Dynamic inverted indexes for a distributed full-text retrieval system, Technical Report MT-95-01, University of Waterloo, 1995. [22] B. A. Ribeiro-Nero, J. P. Kitajima, G. Navarro, C. Santana, N. Ziviani, Parallel generation of inverted les for distributed text collections, in: Proceedings of the 18th International Conference of the Chilean Society of Computer Science, Chile, 1998, pp. 149157. [23] B. Ribeiro-Neto, E.S. Moura, M.S. Neubert, N. Ziviani, Efcient distributed algorithms to build inverted les, in: Proceedings of The 22nd Annual International ACM SIGIR conference on Research and development in information retrieval, Berkeley, 1999, pp. 105112. [24] S. Heinz, J. Zobel, H.E. Williams, Burst tries: a fast, efcient data structure for string keys, ACM Trans. Inform. Syst. 20 (2) (2002) 192223. [25] C. Kit, Y. Liu, N. Liang, On methods of Chinese automatic word segmentation, J. Chin. Inform. Process. 3 (1) (1989) 1320. [26] S. Melnik, S. Raghavan, B. Yang, H. Garcia-Molina, Building a distributed full-text index for the Web, ACM Trans. Inform. Syst. 19 (3) (2001) 217247. [27] J. Zobel, A. Moffat, K. Ramanohanarao, Inverted les versus signature les for text indexing, ACM Trans. Database Syst. 23 (4) (1998) 453490. [28] I. Witten, A. Moffat, T.C. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images, Morgan Kaufmann Publishers, Los Alios, CA, 1999. [29] C. Faloutsos, S. Christodoulakis, Description and performance analysis of signature le methods, ACM Trans. Ofce Inform. Syst. 5 (3) (1987) 237257. [30] D. Cutting, J. Petersen, Optimizations for dynamic inverted index maintenance, in: Proceedings of the Thirteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1990, pp. 405411. [31] A. Tomasic, H. Garcia-Molina, K.A. Shoens, Incremental updates of inverted lists for text document retrieval, in: Proceedings of The ACM SIGMOD International Conference on Management of Data, 1994, pp. 289300.
ARTICLE IN PRESS
754 R.W.P. Luk, W. Lam / Information Systems 32 (2007) 733754 on Algorithms and Architectures for Parallel Processing (ICAPP 97), 1997, pp. 163176. P. Bailey, D. Hawking, A parallel architecture for query processing over a terabyte of text, Technical Report TR-CS96-04, The Australia National University, June 1996. N. Goharian, T. El-Ghazawi, D. Grossman, Enterprise text processing: a sparse matrix approach, in: Proceedings of International Conference on Information Technology: Coding and Computing, , 2001, pp. 7175. C. Stanll, B. Kahle, Parallel free-text search on the connection machine system, Commun. ACM 29 (12) (1986) 12291239. Z. Lu, K.S. McKinley, B. Cahoon, The hardware/software balancing act for information retrieval on symmetric multiprocessors, in: Proceedings of Euro-Par 98, 1998, pp. 521527. S. Melnik, S. Raghavan, B. Yang, H. Garcia-Molina, Building a distributed full-text index for the Web, in: Proceedings of The 10th International Conference on World Wide Web, Hong Kong, 2001, pp. 396406. V. Cardellini, E. Casalicchio, M. Colajanni, P.S. Yu, The state of the art in locally distributed web servers, ACM Comput. Surveys 34 (2) (2002) 263311. E.A. Brewer, Lessons from giant-scale services, IEEE Internet Comput. 5 (4) (2001) 4655. A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, S. Raghavan, Searching the web, ACM Trans. Internet Technol. 1 (1) (2001) 243. T.R. Couvreur, R.N. Benzel, S.F. Miller, D.N. Zeitler, D.L. Lee, M. Singhai, N. Shivaratri, W.Y.P. Wong, An analysis of performance and cost factors in searching large text databases using parallel search systems, J. Am. Soc. Inform. Sci. 45 (7) (1994) 443464. B. Cahoon, K.S. McKinley, Z. Lu, Evaluating the performance of distributed architectures for information retrieval using a variety of workloads, ACM Trans. Inform. Syst. 18 (1) (2000) 143. [32] E.W. Brown, J.P. Callan, W.B. Croft, Fast incremental indexing for full-text information retrieval, in: Proceedings of The 20th VLDB Conference, 1994, pp. 192202. [33] J. Zobel, A. Moffat, R. Sacks-Davis, Storage management for les of dynamic records, in: Proceedings of the Fourth Australian Database Conference, 1993, pp. 2638. [34] W.-Y. Shieh, C.-P. Chung, A statistics-based approach to incrementally update inverted les, Inform. Process. Manage. 41 (2) (2005) 275288. [35] A. Tomasic, H. Garcia-Molina, Issues in parallel information retrieval, Bull. Tech. Committee Data Eng. 17 (3) (1994) 4149. [36] C. Badue, B.A. Ribeiro-Neto, R. Baeza-Yates, N. Ziviani, Distributed query processing using partitioned inverted les, in: Proceedings of the Eighth International Symposium on String Processing and Information Retrieval (SPIRE 2001), 2001, pp. 1020. [37] A. MacFarlane, J.A. McCann, S.E. Robertson, Parallel search using partitioned inverted les, in: Proceedings of the Seventh International Symposium on String Processing and Information Retrieval (SPIRE 2000), 2000, pp. 209220. [38] J.P. Callan, Z. Lu, W.B. Croft, Searching distributed collections with inference networks, in: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, 1995, pp. 2128. [39] C. Stanll, Partitioned posting les: a parallel inverted le structure for information retrieval, in: Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Brussels, 1990, pp. 413428. [40] B.-S. Jeong, E. Omiecinski, Inverted le partitioning schemes in multiple disk systems, IEEE Trans. Parallel Distribut. Syst. 6 (2) (1995) 142153. [41] S.-H. Chung, S.-C. Oh, K.R. Ryu, S.-H. Park, Parallel information retrieval on a distributed memory multiprocessor system, in: Proceedings of the International Conference
[42]
[43]
[44]
[45]
[46]
[47]
[48] [49]
[50]
[51]

Efficient In-Memory Extensible Inverted File

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Efficient In-Memory Extensible Inverted File

Uploaded by

Copyright:

Available Formats

ARTICLE IN PRESS

Information Systems 32 (2007) 733754 www.elsevier.com/locate/infosys

Efcient in-memory extensible inverted le

 $ %' Dn X& $ptk n% ptk n c1 B ptk n p , B B k1

Estimated Vocabulary Growth Rate

Vocabulary Growth Rate (by backward differencing)

94% Final Storage Utilization 92% 90% 88% 86%

82% 80% 0 5 10 15 Node size (number of postings) 20 25

80% 70% 60% 50%

100% 90% Final Storage Utilization 80%

70% 60% 50%

40% 30% 0 1 2 3 4 5 6 7 Mean Number of Chaining Per Term 8 9

R.W.P. Luk, W. Lam / Information Systems 32 (2007) 733754

TREC-6 English 566,077 566,077 2.22 91.5 2.1 3.43

TREC-2005 English 1,033,461 1,033,461 1.31 180 3.0 5.50

NTCIR-5 Chinese 901,446 901,446 1.32 213 1.0 3.66

TREC-2a English 510,637 1,025 0.62 58 0.9 1.24

TREC-2 is for comparison.

Heinz and Zobel [6] Navarro et al. [9] Ours (ARp)

WebV WebXX TREC-2 TREC-2 TREC-6 TREC 2005

based on data in [6]. based on our data.

2 6 2005 2+2005 6+2005 2+6+2005

No Yes No Yes No Yes No Yes No Yes No Yes

You might also like

$ %' Dn X& $ptk n% ptk n c1 B ptk n p , B B k1