Hadoop-GIS Features

Spatial Query Processing

Query operations in Hadoop-GIS are implemented as a combination of a framework query processor (for managing the pipeline and interacting with HDFS) and MapReduce jobs.

Spatial Query Types

Currently there are 3 main query types that Hadoop-GIS supports:

  • Range query - Window containment query
  • Spatial join (with different types of predicates)
  • k-NN (k-Nearest-Neighbor)

Supported Spatial Predicates

st_intersects st_touches st_crosses st_contains
st_adjacent st_disjoint st_equals st_dwithin
st_within st_overlaps st_nearest st_nearest2

See the Parameter Explanation section below for the definition of each predicate.

Supported Data Types

HadoopGIS currently supports the following Well Known Text (WKT) types:

  • Point
  • MultiPoint
  • LineString
  • MultiLineString
  • Polygon
  • MultiPolygon

For complex query with multiple predicates, currently, each (MapReduce) step only processes one predicate, so the user will have to invoke simple queries to generate intermediate result sets to pass to next phase/step.

Required Input Data Format

Data Preparation. The data format that HadoopGIS can accept or re-process must be have the following properties:

  • Each record is located on a separate line; a record representing a spatial object must be contained in a single line.
  • If a record contains a newline (\n) character, the character must be replaced by another special character.
  • Records (objects) can have both spatial and non-spatial attributes.
  • The geometry of objects must be in WKT format (other data formats will be supported later).
  • Records must have an id attribute. Currently we do not enforce the uniqueness of the id. You can pre-process your data by simply adding integers in sequence as id.
  • Values for attributes can be empty.
  • The position number of the geometry field must be the same for every record.

If you have existing data satisfying the above requirements, stage your data on HDFS. e.g.:

hdfs dfs -mkdir /user/testuser

hdfs dfs -mkdir /user/testuser/rawdata1

hdfs dfs -put mydata1.csv /user/testuser/rawdata1/

If you do not have data, you can download the data from Examples page or use the program test/generatePolygons.py to generate random triangles (See the Artificial Data Generation section below).

Supported Spatial Partitioning Algorithms

The following algorithms are supported for spatial partitioning. The abbreviations are used as parameters during query invocation.

Partitioning Algorithm Abbreviation
Fixed Grid Partitioning fg
Binary Space Partitioning bsp
Hilbert Curve Partitioning hc
Quadtree Partitioning qt
Strip-based Partitioning slc
Boundary Optimized Space Partitioning bos
Sort Tile Recursive Partitioning str

Query Invocation

Arguments are passed to HadoopGIS via command line arguments. The full list of arguments for the framework manager can be found by executing ../build/bin/queryprocessor_2d –help.

The following output will be displayed:

  --help                    This help message
  -q [ --querytype ] arg    Query type [ partition | contaiment | spjoin]
  -p [ --bucket ] arg       Fine-grain level tile size for spjoin
  --blocksize arg           Fine-grain level bucket size for partitioning
                            (loading data)
  --roughbucket arg         Rough level bucket size for partitioning
  -a [ --input1 ] arg       HDFS file path to data set 1
  -b [ --input2 ] arg       HDFS file path to data set 2
  -i [ --geom1 ] arg        Field number of data set 1 containing the geometry
  -j [ --geom2 ] arg        Field number of data set 2 containing the geometry
  -h [ --outputpath ] arg   Output path
  -d [ --distance ] arg     Distance (used for certain predicates)
  -k [ --knn ] arg          The number of nearest neighbor. Only used in
                            conjuction with the st_nearest or st_nearest2

  -f [ --outputfields ] arg Fields to be included in the final output separated
                            by commas. See the full documentation. Regular
                            fields from datasets are in the format
                            datasetnum:fieldnum, e.g. 1:1,2:3,1:14. Field
                            counting starts from 1. Optional statistics
                            include: area1, area2, union, intersect, jaccard,
                            dice, mindist
  --containfile arg         User file containing window used for containment
  --containrange arg        Comma separated list of window used for containemtn
  -t [ --predicate ] arg    Predicate for spatial join and nn queries [
                            st_intersects | st_touches | st_crosses |
                            st_contains | st_adjacent | st_disjoint | st_equals
                            | st_dwithin | st_within | st_overlaps | st_nearest
                            | st_nearest2 ]
  -s [ --samplingrate ] arg Sampling rate (0, 1]
  -u [ --partitioner ] arg  Partitioning method ([fg | bsp hc | str | bos | slc
                            | qt ]
  -v [ --partitioner2 ] arg (Optional) Partitioning for second method [fg | bsp
                            | hc | str | bos | slc | qt ]
  --mbb1 arg                HDFS path to MBBs of data set 1
  --mbb2 arg                HDFS path to MBBs of data set 2
  -o [ --overwrite ]        Overwrite existing hdfs directories
  -z [ --parapartition ]    Use 2 partitioning steps
  -n [ --numreducers ] arg  The number of reducers
  --removetmp               Remove temporary directories on HDFS
  --removembb               Remove MBB directory on HDFS

Parameter Explanation

The framework query processor uses keyword argument passing (rather than positioning arguments).

  • query_type (Required parameter)
    • Both spatial join and k-nearest-neighbor use spjoin as its argument for query_type parameter.
    • Argument partition is used to partition the data into HDFS blocks that are efficient to use containment query on.
    • Argument containment is used for containment (window) query.
  • bucket (Required parameter)
    • bucket is the threshold for number of objects per tile that is passed to partitioning algorithms. This parameter is required for both spatial join.
  • blocksize (Optional parameter) is the size of a HDFS block in bytes. It is by default 128 MB on most system.
  • roughbucket (Optional parameter) is the partition maximum payload used in the first step of 2-step partitioning process.
  • input1 (Required)
    • input1 is the HDFS path to the first data set (data set 1). This should be a directory or the exact path of a file.
  • input2 (Required and used only in spatial join and nearest neighbor between 2 datasets)
    • input2 is the HDFS path to the second data set (data set 2). This should be a directory or the exact path of a file.
    • Self-join requires the absence of this parameter.
  • geom1 (Required) is the field number of the geometry field in data set 1. Field counting starts from 1.
  • geom2 (Required and used if and only if input2 is used) is the field number of the geometry field in data set 2. Field counting starts from 1.
  • h (Required) is the HDFS output path where the final result will be located.
    • The HDFS output path cannot be a sub-directory in the original data set. It should not exist on HDFS (use hdfs dfs -rm path to remove it from the command line.
    • Temporary directories used by the pipeline will have the output path as a prefix, followed by an underscore and the abbreviation of the step.
  • d (Required only when predicate is st_dwithin - within a given distance or st_nearest2 - nearest neighbor with a bound). d is the distance in the coordinate system used by input objects.
  • f (Optional but strongly recommended)
    • f determines the format of the final output of spatial processing. Fields will be tab-separated in the output. To specify the fields, use data set id followed by a colon; individual fields in the argument are comma-separated. For instance, -f 1:1,2:3,1:14 will output the first field of data set 1, the 3rd field of data set 2, the 14th field of data set 1. Fields not directly obtained from the original input data have special names below.
    • Name of special fields (used when applicable):
      • area1 : area of object 1
      • area2 : area of object 2
      • union : area of the union of both objects
      • intersect: area of the intersection between two objects
      • jaccard: Jaccard coefficient (based on object area)
      • dice: DICE coefficient (based on object area)
      • mindist: Minimum distance between the object. By default it is the shortest Euclidean distance between any two points on objects 1 and 2.
  • containfile (Required for window query/mutually exclusive with containrange):
    • containfile argument should be the exact path of the local file containing the window user wants to query.
    • The file should be a text file containing 1 geometry in WKT format per line.
  • containrange (Required for window query/mutually exclusive with containfile)
    • The values should be a comma separated list of windows in the format min_x min_y max_x max_y. Use double quotes surrounding the entire string argument to prevent parsing split by the command line.
  • predicate (Required)
    • predicate is the spatial predicate used in all types of query.
    • by default, st_intersects is used for window (containment) query.
    • Predicates and their definition:
      • st_intersects: object 1 intersects with object 2
      • st_touches: object 1 touches object 2
      • st_crosses: object 1 crosses object 2
      • st_contains: object 2 is completed contained inside object 1
      • st_adjacent: object 1 and object 2 are adjacent
      • st_disjoint: object 1 and object 2 has no intersection
      • st_equals: object 1 is the same object 2 (might have different WKT geometry description).
      • st_dwithin: object 1 has its shortest distance to object 2 less than a threshold.
      • st_within: object 1 is contained inside object 2.
      • st_overlaps: object 1 overlaps with object 2.
      • st_nearest: object 2 is one of the k nearest neighbors of 1.
      • st_nearest2: object 2 is one of the k nearest neighbors of 1 with a distance constraint.
  • samplingrate (Optional): A sampling rate use to sample the original data. Values should be in the range of (0, 1].
  • partitioner (Optional): argument is 1 of the 7 partitioning algorithms described in the above. This method is used during the coarse-grain partitioning. Default is bsp
  • partitioner2 (Optional): argument is 1 of the 7 partition algorithms. This method is used during the fine-grain partitioning. Default is bsp
  • mbb1 and mbb2 (Optional): HDFS paths to the MBBs of objects 1 and objects. Useful if the MBBs have been extracted before.
  • overwrite (Optional): if enabled, the framework will delete all intermediate directories used in the processing and execute every single step in the pipeline.
  • parapartition (Optional): when enabled, this signifies the framework should use a 2-step partitioning approach.
  • numreducers (Optional): the maximum number of reducers available on the cluster.
  • removetmp (Optional): when enabled, all intermediate directories are removed from HDFS. Note that only _mbb and _partidx are retained when this option is absent (for reuse purpose).
  • remove mbb: when enabled, all intermediate directories except for the _partidx directory are removed from HDFS.

Additional Features

Partition Visualization

HadoopGIS provides a simple tool to visualize the boundaries of partitions. Optionally, the users can also display the location of object bounding boxes.

To display the argument requirements, execute: build/bin/partition_vis –help

  --help                     this help message
  -p [ --partidxfile ] arg   File name of the partition boundaries in MBB
                             format. Partition boundaries have random colors.
  -f [ --offsetpartidx ] arg Field offset for the start of MBB fields. Default
                             is 1
  -q [ --objfile ] arg       File name of the objects in MBB format. Objects
                             have black border.
  -s [ --spacefile ] arg     File containing the global space information.
                             Global space info should have the same format as
                             partition boundary. If none is specified, the
                             max/min of partition boundaries will be used as
                             the global space.
  -o [ --outputname ] arg    Name of the output file. Should contain .png
Parameter Explanation
  • partidxfile (Required): The argument should be path to a local file containing the partition indices.
  • Format of files provided by partidxfile, objfile, spacefile: Each line should contain 5 fields: an unique ID, min_x, min_y, max_x and max_y coordinates. Fields are tab delimited.
  • outputname (Required): The output file of the