With increased accessibility of large scale open data, public health studies are able to take advantage of integrative spatial big data to increase the spatial resolution to community or neighborhood level. One critical information for such studies is the large number of addresses of patients, which is private and highly sensitive. Geocoding such massive private addresses poses major challenges for public health researchers. Many geocoders provide only Web APIs which require sending private addresses over the Internet, which is not feasible. Commercial geocoders require high licensing fee and often have limitations on daily usage, which becomes a major hurdle for researchers. Scalability is another major challenge for large scale address dataset. In this paper, we present EaserGeocoder, a novel open source geocoder for effectively geocoding massive address datasets. EaserGeocoder takes an integrative approach by using multiple references based on open address data sources contributed by governments or communities. It takes a machine learning approach to automatically find the best answer from candidates produced by multiple references. The system provides high scalability through parallel processing. Our comparative studies demonstrate EaserGeocoder outperforms open source geocoders and is comparable to commercial ones in terms of both accuracy and error. It provides a cost-effective and feasible solution for large scale public health studies.
We developed the EaserGeocoder for geocoding New York state addresses as a part of our research, although it has some limitations, but it has high accuracy and throughput.
This research is supported in part by grants from National Science Foundation ACI 1443054 and IIS 1350885.
For any questions regarding to this project, please do not hesitate to contact srashidian [at sign] cs [dot] stonybrook . edu.