Comparison of spatial indexes

Nathalie Andrea Barbosa Roa

Résumé

In this report three different tree data structures were tested in order to find the index structure best suited to data groups retrieval for clustering purposes. Group retrieval is considered as the process of finding all the neighborhoods in the data set. This process involves a recursive implementation of radius-neighbor queries. Time and memory measures were taken for the tree construction process and for the group retrieval process. These measures are the average value over a hundred simulations. The selected structures are: the R*Tree [1], the BallTree [4] and the KDTree [2]. I Introduction The R*Tree, or rectangular Tree, was proposed in 1990 and has been largely used since. In clustering applications, the ClusTree algorithm uses an extension of this index to efficiently locate the right place for object insertion [3]. The key idea of this data structure is to group nearby objects and represent them with their minimum bounding rectangle in the next higher level of the tree. BallTrees is a completely binary tree in which each node refer to a region bounded by an d-dimensional hyper-sphere [4]. A parent node is hence the smallest hyper-sphere that contains all the hyper-spheres of its children. The BallTree is the only index, among the analyzed indexes, that accepts the intersection of sibling regions. A K-dimensional binary search tree (KDTree) partitions the data space into mutually exclusive hyper-rectangular regions. Each region is found by recursively splitting the space using axis-parallel hyperplanes. The splitting process finishes when each sub-region has a number of points less than or equal to a given threshold. The python implementations of the R*Tree 1 , the BallTree 2 , the KDTree 3 and the cKDTree 4 (A cython implementation of the KDTree), were tested using three different scenarios. The computation time and memory consumption (maximum resident set size used), were selected as comparative measures. II Performance with uniform distributions In the first scenario, data correspond to random samples extracted from a uniform distribution. The algorithms were tested for a data sets containing n data samples, where each sample is a d-dimensional vector. Tests were performed for n varying in the range n ∈ 10 1 , 10 2 , 10 3 , 10 4 and d in d ∈ 2 1 , 2 2 , 2 3 , 2 4 , 2 5. Figure 1 show the resources used for the tree building task. Computation time measured in seconds is presented in Figure 1a and the maximum resident set size measured in M b is shown in Figure 1b. It can be seen that the R*Tree implementation performs poorly with respect to the other indexes. The actual measurements of the computation time and maximum resident set size are presented in tables 1 and 2 respectively. The group retrieval task is computationally more expensive as can be seen in Figure 2, where the computation time for the worst case (R*Tree, n = 18,d = 10 4) passes from 33.12 seconds in the tree building task to 362.9 seconds for group retrieval (see Figure 2a). The best computation time is achieved by the cKDTree index but at cost of having the worst memory consumption (see Figure 2b). The recorded data for the group retrieval task is presented in tables 3 and 4. 1

Comparison of spatial indexes

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager