Example Data sets

Direct Download:

Example Queries

Below are examples of using Hadoop-GIS to analyze large scale geo-tagged tweets data. Data sets included in these examples could be found here on the top of the page.

The data were first parsed into the tab-separated (“tsv”) format that satisfied the format of Hadoop-GIS (see Data Preparation and Input Format).

The tweets data were from twitter.com. The data was collected by tweets API. There were about 10 million geo-tagged tweets in the file that came from all around the world. The first field of each line represents the tweet' index and the second field contains for the geometry information (WKT format). Other fields are attributes of that tweet.

The zip code data were from TIGER. There were more than 30k lines inside the file recording all US zip codes. Like the tweets data, the first and second fields of each line contain the index and the geometry information, respectively. The third field is the actual zip code value.

This document includes two examples of tweets data analysis. Operations involved are window containment and spatial join. If you want to know more about the detail explanation of the paramaters, please check the Query Invocation page.

Filter tweets by location

This section shows how to use Hadoop-GIS to find the tweets located inside the US. For illustration, use the approximate latitude and longitude boundary box to stand for the scope of the US. That's

  • Northernmost point Northwest Angle, Minnesota (49°23'4.1“ N)
  • Southernmost point Ballast Key, Florida ( 24°31′15″ N)
  • Easternmost point Sail Rock, just offshore of West Quoddy Head, Maine (66°57' W)
  • Westernmost point Bodelteh Islands offshore from Cape Alava, Washington (124°46' W)

The idea is to query the tweets by their geometries and retains the ones within the region that mentioned above. ContainmentQuery in Hadoop-GIS is capable of this operation. Below are the two steps of using Hadoop-GIS to filter the tweets.

Partition the Data

  1. Download the tweets from the link above.
  2. To extract the file into your local directory run:
    gunzip tweets.filtered.tsv.gz your_path/ 
  3. To upload the file to HDFS by executing:
    hadoop fs -mkdir your_hdfs_path/tweets
    hadoop fs -mkdir your_hdfs_path/tweets/rawdata
    hadoop fs -put your_path/tweets.filtered.tsv YOUR_HDFS_PATH/tweets/rawdata/
    cd hadoopgis
    ./build/bin/queryprocessor_2d --querytype partition --geom1 5 \
        --input1 YOUR_HDFS_PATH/tweets/rawdata/tweets.filtered.tsv \
        --outputpath YOUR_HDFS_PATH/tweetspartitioned --partitioner bsp --s 0.1 --numreducers 10 
  4. After the scripts finished, the tweets data is stored in the HDFS directory YOUR_HDFS_PATH/tweetspartitioned.

Query the Twitter Data

  1. To query the data, execute:
    ./build/bin/queryprocessor_2d --querytype containment --containrange -124.7,24.5,67.0,49.5 \ 
        --input1=YOUR_HDFS_PATH/tweetspartitioned --outputpath=YOUR_HDFS_PATH/containmentout
  2. If you want to know more about the parameters, please check the Features page.
  3. After the processing finishes, results are available on HDFS YOUR_HDFS_PATH/containmentout.
  4. To list the results, run hadoop fs -ls your_hdfs_path/containmentout
    or run hadoop fs -cat YOUR_HDFS_PATH/containmentout/part-* to view you result
    or run hadoop fs -get YOUR_HDFS_PATH/containmentout ./ to download you result from HDFS.

Add attributes to tweets

This section shows how to use Hadoop-GIS to add attributes to tweets. For instance, add a zipcode attribute to a tweet that indicates the zipcode of the area where this tweet been sent.

It's easy to join two dataset in database to expand attributes. Similarly, it's easy for Hadoop-GIS to handle this kind of task by spatialjoin that consider about the geographic information of the datasets. The spatial join operation in Hadoop-GIS is designed to handle this issue. Below are the two steps of using Hadoop-GIS to do spatialjoin with tweets and zipcode.

This picture comes from permaculturemarin.org

Preparing Data

  1. Download the zipcode file.
  2. Extract the file into your local directory. Run:
     gunzip zcta510.tsv.gz your_path/ 
  3. Upload the file to hdfs. Run:
    hadoop fs -mkdir your_hdfs_path/zipcode
    hadoop fs -mkdir your_hdfs_path/zipcode/rawdata
    hadoop fs -put your_path/zcta510.tsv your_hdfs_path/zipcode/rawdata/

Spatial Join Execution

  1. Change directory into the hadoopGIS base directory
  2. Execute the framework query processor:
    ./build/bin/queryprocessor_2d -b YOUR_HDFS_PATH/zipcode -a YOUR_HDFS_PATH/tweets/ -p st_within --outputpath YOUR_HDFS_PATH/spatialjoinout--numreducers 20 --outputfields 1:1,1:2,1:3,1:4,1:5,2:3 --partitioner bsp_2d -q spjoin -i 2 -j 2 

    If you want to know more about the parameters, please check the run queries page. Be aware that we only need the zipcode attribute, so just add 3rd field of the zip code record to tweets data (field 2:3 in the –outputfields argument).

  3. After the processing finishes, results are available at YOUR_HDFS_PATH/spatialjoinout.
  4. Run
    hadoop fs -ls YOUR_HDFS_PATH/spatialjoinout

    . You'll see output data is saved in this directory. Run hadoop dfs -cat YOUR_HDFS_PATH/spatialjoinout/part-* to view you result or run hadoop fs -get YOUR_HDFS_PATH/spatialjoinout/* ./to download you result from HDFS.